2005-01-07 02:29:27 +00:00
|
|
|
/*-
|
2017-11-30 15:48:35 +00:00
|
|
|
* SPDX-License-Identifier: (BSD-4-Clause AND MIT-CMU)
|
2017-11-18 14:26:50 +00:00
|
|
|
*
|
1994-05-24 10:09:53 +00:00
|
|
|
* Copyright (c) 1991, 1993
|
|
|
|
* The Regents of the University of California. All rights reserved.
|
1994-05-25 09:21:21 +00:00
|
|
|
* Copyright (c) 1994 John S. Dyson
|
|
|
|
* All rights reserved.
|
|
|
|
* Copyright (c) 1994 David Greenman
|
|
|
|
* All rights reserved.
|
|
|
|
*
|
1994-05-24 10:09:53 +00:00
|
|
|
*
|
|
|
|
* This code is derived from software contributed to Berkeley by
|
|
|
|
* The Mach Operating System project at Carnegie-Mellon University.
|
|
|
|
*
|
|
|
|
* Redistribution and use in source and binary forms, with or without
|
|
|
|
* modification, are permitted provided that the following conditions
|
|
|
|
* are met:
|
|
|
|
* 1. Redistributions of source code must retain the above copyright
|
|
|
|
* notice, this list of conditions and the following disclaimer.
|
|
|
|
* 2. Redistributions in binary form must reproduce the above copyright
|
|
|
|
* notice, this list of conditions and the following disclaimer in the
|
|
|
|
* documentation and/or other materials provided with the distribution.
|
|
|
|
* 3. All advertising materials mentioning features or use of this software
|
2000-03-27 20:41:17 +00:00
|
|
|
* must display the following acknowledgement:
|
1994-05-24 10:09:53 +00:00
|
|
|
* This product includes software developed by the University of
|
|
|
|
* California, Berkeley and its contributors.
|
|
|
|
* 4. Neither the name of the University nor the names of its contributors
|
|
|
|
* may be used to endorse or promote products derived from this software
|
|
|
|
* without specific prior written permission.
|
|
|
|
*
|
|
|
|
* THIS SOFTWARE IS PROVIDED BY THE REGENTS AND CONTRIBUTORS ``AS IS'' AND
|
|
|
|
* ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
|
|
|
|
* IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE
|
|
|
|
* ARE DISCLAIMED. IN NO EVENT SHALL THE REGENTS OR CONTRIBUTORS BE LIABLE
|
|
|
|
* FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL
|
|
|
|
* DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS
|
|
|
|
* OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION)
|
|
|
|
* HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT
|
|
|
|
* LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY
|
|
|
|
* OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF
|
|
|
|
* SUCH DAMAGE.
|
|
|
|
*
|
1994-08-02 07:55:43 +00:00
|
|
|
* from: @(#)vm_fault.c 8.4 (Berkeley) 1/12/94
|
1994-05-24 10:09:53 +00:00
|
|
|
*
|
|
|
|
*
|
|
|
|
* Copyright (c) 1987, 1990 Carnegie-Mellon University.
|
|
|
|
* All rights reserved.
|
|
|
|
*
|
|
|
|
* Authors: Avadis Tevanian, Jr., Michael Wayne Young
|
These changes embody the support of the fully coherent merged VM buffer cache,
much higher filesystem I/O performance, and much better paging performance. It
represents the culmination of over 6 months of R&D.
The majority of the merged VM/cache work is by John Dyson.
The following highlights the most significant changes. Additionally, there are
(mostly minor) changes to the various filesystem modules (nfs, msdosfs, etc) to
support the new VM/buffer scheme.
vfs_bio.c:
Significant rewrite of most of vfs_bio to support the merged VM buffer cache
scheme. The scheme is almost fully compatible with the old filesystem
interface. Significant improvement in the number of opportunities for write
clustering.
vfs_cluster.c, vfs_subr.c
Upgrade and performance enhancements in vfs layer code to support merged
VM/buffer cache. Fixup of vfs_cluster to eliminate the bogus pagemove stuff.
vm_object.c:
Yet more improvements in the collapse code. Elimination of some windows that
can cause list corruption.
vm_pageout.c:
Fixed it, it really works better now. Somehow in 2.0, some "enhancements"
broke the code. This code has been reworked from the ground-up.
vm_fault.c, vm_page.c, pmap.c, vm_object.c
Support for small-block filesystems with merged VM/buffer cache scheme.
pmap.c vm_map.c
Dynamic kernel VM size, now we dont have to pre-allocate excessive numbers of
kernel PTs.
vm_glue.c
Much simpler and more effective swapping code. No more gratuitous swapping.
proc.h
Fixed the problem that the p_lock flag was not being cleared on a fork.
swap_pager.c, vnode_pager.c
Removal of old vfs_bio cruft to support the past pseudo-coherency. Now the
code doesn't need it anymore.
machdep.c
Changes to better support the parameter values for the merged VM/buffer cache
scheme.
machdep.c, kern_exec.c, vm_glue.c
Implemented a seperate submap for temporary exec string space and another one
to contain process upages. This eliminates all map fragmentation problems
that previously existed.
ffs_inode.c, ufs_inode.c, ufs_readwrite.c
Changes for merged VM/buffer cache. Add "bypass" support for sneaking in on
busy buffers.
Submitted by: John Dyson and David Greenman
1995-01-09 16:06:02 +00:00
|
|
|
*
|
1994-05-24 10:09:53 +00:00
|
|
|
* Permission to use, copy, modify and distribute this software and
|
|
|
|
* its documentation is hereby granted, provided that both the copyright
|
|
|
|
* notice and this permission notice appear in all copies of the
|
|
|
|
* software, derivative works or modified versions, and any portions
|
|
|
|
* thereof, and that both notices appear in supporting documentation.
|
These changes embody the support of the fully coherent merged VM buffer cache,
much higher filesystem I/O performance, and much better paging performance. It
represents the culmination of over 6 months of R&D.
The majority of the merged VM/cache work is by John Dyson.
The following highlights the most significant changes. Additionally, there are
(mostly minor) changes to the various filesystem modules (nfs, msdosfs, etc) to
support the new VM/buffer scheme.
vfs_bio.c:
Significant rewrite of most of vfs_bio to support the merged VM buffer cache
scheme. The scheme is almost fully compatible with the old filesystem
interface. Significant improvement in the number of opportunities for write
clustering.
vfs_cluster.c, vfs_subr.c
Upgrade and performance enhancements in vfs layer code to support merged
VM/buffer cache. Fixup of vfs_cluster to eliminate the bogus pagemove stuff.
vm_object.c:
Yet more improvements in the collapse code. Elimination of some windows that
can cause list corruption.
vm_pageout.c:
Fixed it, it really works better now. Somehow in 2.0, some "enhancements"
broke the code. This code has been reworked from the ground-up.
vm_fault.c, vm_page.c, pmap.c, vm_object.c
Support for small-block filesystems with merged VM/buffer cache scheme.
pmap.c vm_map.c
Dynamic kernel VM size, now we dont have to pre-allocate excessive numbers of
kernel PTs.
vm_glue.c
Much simpler and more effective swapping code. No more gratuitous swapping.
proc.h
Fixed the problem that the p_lock flag was not being cleared on a fork.
swap_pager.c, vnode_pager.c
Removal of old vfs_bio cruft to support the past pseudo-coherency. Now the
code doesn't need it anymore.
machdep.c
Changes to better support the parameter values for the merged VM/buffer cache
scheme.
machdep.c, kern_exec.c, vm_glue.c
Implemented a seperate submap for temporary exec string space and another one
to contain process upages. This eliminates all map fragmentation problems
that previously existed.
ffs_inode.c, ufs_inode.c, ufs_readwrite.c
Changes for merged VM/buffer cache. Add "bypass" support for sneaking in on
busy buffers.
Submitted by: John Dyson and David Greenman
1995-01-09 16:06:02 +00:00
|
|
|
*
|
|
|
|
* CARNEGIE MELLON ALLOWS FREE USE OF THIS SOFTWARE IN ITS "AS IS"
|
|
|
|
* CONDITION. CARNEGIE MELLON DISCLAIMS ANY LIABILITY OF ANY KIND
|
1994-05-24 10:09:53 +00:00
|
|
|
* FOR ANY DAMAGES WHATSOEVER RESULTING FROM THE USE OF THIS SOFTWARE.
|
These changes embody the support of the fully coherent merged VM buffer cache,
much higher filesystem I/O performance, and much better paging performance. It
represents the culmination of over 6 months of R&D.
The majority of the merged VM/cache work is by John Dyson.
The following highlights the most significant changes. Additionally, there are
(mostly minor) changes to the various filesystem modules (nfs, msdosfs, etc) to
support the new VM/buffer scheme.
vfs_bio.c:
Significant rewrite of most of vfs_bio to support the merged VM buffer cache
scheme. The scheme is almost fully compatible with the old filesystem
interface. Significant improvement in the number of opportunities for write
clustering.
vfs_cluster.c, vfs_subr.c
Upgrade and performance enhancements in vfs layer code to support merged
VM/buffer cache. Fixup of vfs_cluster to eliminate the bogus pagemove stuff.
vm_object.c:
Yet more improvements in the collapse code. Elimination of some windows that
can cause list corruption.
vm_pageout.c:
Fixed it, it really works better now. Somehow in 2.0, some "enhancements"
broke the code. This code has been reworked from the ground-up.
vm_fault.c, vm_page.c, pmap.c, vm_object.c
Support for small-block filesystems with merged VM/buffer cache scheme.
pmap.c vm_map.c
Dynamic kernel VM size, now we dont have to pre-allocate excessive numbers of
kernel PTs.
vm_glue.c
Much simpler and more effective swapping code. No more gratuitous swapping.
proc.h
Fixed the problem that the p_lock flag was not being cleared on a fork.
swap_pager.c, vnode_pager.c
Removal of old vfs_bio cruft to support the past pseudo-coherency. Now the
code doesn't need it anymore.
machdep.c
Changes to better support the parameter values for the merged VM/buffer cache
scheme.
machdep.c, kern_exec.c, vm_glue.c
Implemented a seperate submap for temporary exec string space and another one
to contain process upages. This eliminates all map fragmentation problems
that previously existed.
ffs_inode.c, ufs_inode.c, ufs_readwrite.c
Changes for merged VM/buffer cache. Add "bypass" support for sneaking in on
busy buffers.
Submitted by: John Dyson and David Greenman
1995-01-09 16:06:02 +00:00
|
|
|
*
|
1994-05-24 10:09:53 +00:00
|
|
|
* Carnegie Mellon requests users of this software to return to
|
|
|
|
*
|
|
|
|
* Software Distribution Coordinator or Software.Distribution@CS.CMU.EDU
|
|
|
|
* School of Computer Science
|
|
|
|
* Carnegie Mellon University
|
|
|
|
* Pittsburgh PA 15213-3890
|
|
|
|
*
|
|
|
|
* any improvements or extensions that they make and grant Carnegie the
|
|
|
|
* rights to redistribute these changes.
|
|
|
|
*/
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Page fault handling module.
|
|
|
|
*/
|
2003-06-11 23:50:51 +00:00
|
|
|
|
|
|
|
#include <sys/cdefs.h>
|
|
|
|
__FBSDID("$FreeBSD$");
|
|
|
|
|
2012-04-05 17:13:14 +00:00
|
|
|
#include "opt_ktrace.h"
|
2007-12-29 19:53:04 +00:00
|
|
|
#include "opt_vm.h"
|
|
|
|
|
1994-05-24 10:09:53 +00:00
|
|
|
#include <sys/param.h>
|
|
|
|
#include <sys/systm.h>
|
2001-05-22 00:56:25 +00:00
|
|
|
#include <sys/kernel.h>
|
2001-05-01 08:13:21 +00:00
|
|
|
#include <sys/lock.h>
|
Replace vm_fault()'s heuristic for automatic cache behind with a heuristic
that performs the equivalent of an automatic madvise(..., MADV_DONTNEED).
The current heuristic, even with the improvements that I made a few years
ago, is a good example of making the wrong trade-off, or optimizing for
the infrequent case. The infrequent case being reading a single file that
is much larger than memory using mmap(2). And, in this case, the page
daemon isn't the bottleneck; it's the I/O.
In all other cases, the current heuristic has too many false positives,
i.e., it caches too many pages that are later reused. To give one
example, thousands of pages are cached by the current heuristic during a
buildworld and all of them are reactivated before the buildworld
completes. In particular, clang reads source files using mmap(2) and
there are some relatively large source files in our source tree, e.g.,
sqlite, that are read multiple times. With the new heuristic, I see fewer
false positives and they have a much lower cost.
I actually tried something like this more than two years ago and it
didn't perform as well as the cache behind heuristic. However, that was
before the changes to the page daemon in late summer of 2013 and the
existence of pmap_advise(). In particular, with the page daemon doing
its work more frequently and in smaller batches, it now completes its
work while the application accessing the file is blocked on I/O.
Whereas previously, the page daemon appeared to hog the CPU for so long
that it caused "hiccups" in the application's execution.
Finally, I'll add that the elimination of cache pages is a prerequisite
for NUMA support.
Reviewed by: jeff, kib
Sponsored by: EMC / Isilon Storage Division
2015-04-04 19:10:22 +00:00
|
|
|
#include <sys/mman.h>
|
2019-07-08 19:46:20 +00:00
|
|
|
#include <sys/mutex.h>
|
1994-05-25 09:21:21 +00:00
|
|
|
#include <sys/proc.h>
|
2016-04-07 04:23:25 +00:00
|
|
|
#include <sys/racct.h>
|
2019-09-12 16:26:59 +00:00
|
|
|
#include <sys/refcount.h>
|
1994-05-25 09:21:21 +00:00
|
|
|
#include <sys/resourcevar.h>
|
2013-03-09 02:32:23 +00:00
|
|
|
#include <sys/rwlock.h>
|
2019-09-27 18:43:36 +00:00
|
|
|
#include <sys/signalvar.h>
|
2001-05-19 01:28:09 +00:00
|
|
|
#include <sys/sysctl.h>
|
2019-09-27 18:43:36 +00:00
|
|
|
#include <sys/sysent.h>
|
2001-05-22 00:56:25 +00:00
|
|
|
#include <sys/vmmeter.h>
|
|
|
|
#include <sys/vnode.h>
|
2012-04-05 17:13:14 +00:00
|
|
|
#ifdef KTRACE
|
|
|
|
#include <sys/ktrace.h>
|
|
|
|
#endif
|
1994-05-24 10:09:53 +00:00
|
|
|
|
|
|
|
#include <vm/vm.h>
|
1995-12-07 12:48:31 +00:00
|
|
|
#include <vm/vm_param.h>
|
|
|
|
#include <vm/pmap.h>
|
|
|
|
#include <vm/vm_map.h>
|
|
|
|
#include <vm/vm_object.h>
|
1994-05-24 10:09:53 +00:00
|
|
|
#include <vm/vm_page.h>
|
|
|
|
#include <vm/vm_pageout.h>
|
1994-11-06 09:55:31 +00:00
|
|
|
#include <vm/vm_kern.h>
|
NOTE: libkvm, w, ps, 'top', and any other utility which depends on struct
proc or any VM system structure will have to be rebuilt!!!
Much needed overhaul of the VM system. Included in this first round of
changes:
1) Improved pager interfaces: init, alloc, dealloc, getpages, putpages,
haspage, and sync operations are supported. The haspage interface now
provides information about clusterability. All pager routines now take
struct vm_object's instead of "pagers".
2) Improved data structures. In the previous paradigm, there is constant
confusion caused by pagers being both a data structure ("allocate a
pager") and a collection of routines. The idea of a pager structure has
escentially been eliminated. Objects now have types, and this type is
used to index the appropriate pager. In most cases, items in the pager
structure were duplicated in the object data structure and thus were
unnecessary. In the few cases that remained, a un_pager structure union
was created in the object to contain these items.
3) Because of the cleanup of #1 & #2, a lot of unnecessary layering can now
be removed. For instance, vm_object_enter(), vm_object_lookup(),
vm_object_remove(), and the associated object hash list were some of the
things that were removed.
4) simple_lock's removed. Discussion with several people reveals that the
SMP locking primitives used in the VM system aren't likely the mechanism
that we'll be adopting. Even if it were, the locking that was in the code
was very inadequate and would have to be mostly re-done anyway. The
locking in a uni-processor kernel was a no-op but went a long way toward
making the code difficult to read and debug.
5) Places that attempted to kludge-up the fact that we don't have kernel
thread support have been fixed to reflect the reality that we are really
dealing with processes, not threads. The VM system didn't have complete
thread support, so the comments and mis-named routines were just wrong.
We now use tsleep and wakeup directly in the lock routines, for instance.
6) Where appropriate, the pagers have been improved, especially in the
pager_alloc routines. Most of the pager_allocs have been rewritten and
are now faster and easier to maintain.
7) The pagedaemon pageout clustering algorithm has been rewritten and
now tries harder to output an even number of pages before and after
the requested page. This is sort of the reverse of the ideal pagein
algorithm and should provide better overall performance.
8) Unnecessary (incorrect) casts to caddr_t in calls to tsleep & wakeup
have been removed. Some other unnecessary casts have also been removed.
9) Some almost useless debugging code removed.
10) Terminology of shadow objects vs. backing objects straightened out.
The fact that the vm_object data structure escentially had this
backwards really confused things. The use of "shadow" and "backing
object" throughout the code is now internally consistent and correct
in the Mach terminology.
11) Several minor bug fixes, including one in the vm daemon that caused
0 RSS objects to not get purged as intended.
12) A "default pager" has now been created which cleans up the transition
of objects to the "swap" type. The previous checks throughout the code
for swp->pg_data != NULL were really ugly. This change also provides
the rudiments for future backing of "anonymous" memory by something
other than the swap pager (via the vnode pager, for example), and it
allows the decision about which of these pagers to use to be made
dynamically (although will need some additional decision code to do
this, of course).
13) (dyson) MAP_COPY has been deprecated and the corresponding "copy
object" code has been removed. MAP_COPY was undocumented and non-
standard. It was furthermore broken in several ways which caused its
behavior to degrade to MAP_PRIVATE. Binaries that use MAP_COPY will
continue to work correctly, but via the slightly different semantics
of MAP_PRIVATE.
14) (dyson) Sharing maps have been removed. It's marginal usefulness in a
threads design can be worked around in other ways. Both #12 and #13
were done to simplify the code and improve readability and maintain-
ability. (As were most all of these changes)
TODO:
1) Rewrite most of the vnode pager to use VOP_GETPAGES/PUTPAGES. Doing
this will reduce the vnode pager to a mere fraction of its current size.
2) Rewrite vm_fault and the swap/vnode pagers to use the clustering
information provided by the new haspage pager interface. This will
substantially reduce the overhead by eliminating a large number of
VOP_BMAP() calls. The VOP_BMAP() filesystem interface should be
improved to provide both a "behind" and "ahead" indication of
contiguousness.
3) Implement the extended features of pager_haspage in swap_pager_haspage().
It currently just says 0 pages ahead/behind.
4) Re-implement the swap device (swstrategy) in a more elegant way, perhaps
via a much more general mechanism that could also be used for disk
striping of regular filesystems.
5) Do something to improve the architecture of vm_object_collapse(). The
fact that it makes calls into the swap pager and knows too much about
how the swap pager operates really bothers me. It also doesn't allow
for collapsing of non-swap pager objects ("unnamed" objects backed by
other pagers).
1995-07-13 08:48:48 +00:00
|
|
|
#include <vm/vm_pager.h>
|
1995-12-07 12:48:31 +00:00
|
|
|
#include <vm/vm_extern.h>
|
Fix the root cause of the "vm_reserv_populate: reserv <address> is already
promoted" panics. The sequence of events that leads to a panic is rather
long and circuitous. First, suppose that process P has a promoted
superpage S within vm object O that it can write to. Then, suppose that P
forks, which leads to S being write protected. Now, before P's child
exits, suppose that P writes to another virtual page within O. Since the
pages within O are copy on write, a shadow object for O is created to
house the new physical copy of the faulted on virtual page. Then, before
P can fault on S, P's child exists. Now, when P faults on S, it will
follow the "optimized" path for copy-on-write faults in vm_fault(),
wherein the underlying physical page is moved from O to its shadow object
rather than allocating a new page and copying the new page's contents from
the old page. Moreover, suppose that every 4 KB physical page making up S
is moved to the shadow object in this way. However, the optimized path
does not move the underlying superpage reservation, which is the root
cause of the panics! Ultimately, P performs vm_object_collapse() on O's
shadow object, which destroys O and in doing so breaks any reservations
still belonging to O. This leaves the reservation underlying S in an
inconsistent state: It's simultaneously not in use and promoted. Breaking
a reservation does not demote it because I never intended for a promoted
reservation to be broken. It makes little sense. Finally, this
inconsistency leads to an assertion failure the next time that the
reservation is used.
The failing assertion does not (currently) exist in FreeBSD 10.x or
earlier. There, we will quietly break the promoted reservation. While
illogical and unintended, breaking the reservation is essentially
harmless.
PR: 198163
Reviewed by: kib
Tested by: pho
X-MFC after: r267213
Sponsored by: EMC / Isilon Storage Division
2015-03-19 01:40:43 +00:00
|
|
|
#include <vm/vm_reserv.h>
|
1994-05-24 10:09:53 +00:00
|
|
|
|
2003-10-03 22:46:53 +00:00
|
|
|
#define PFBAK 4
|
|
|
|
#define PFFOR 4
|
|
|
|
|
Revamp the default page clustering strategy that is used by the page fault
handler. For roughly twenty years, the page fault handler has used the
same basic strategy: Fetch a fixed number of non-resident pages both ahead
and behind the virtual page that was faulted on. Over the years,
alternative strategies have been implemented for optimizing the handling
of random and sequential access patterns, but the only change to the
default strategy has been to increase the number of pages read ahead to 7
and behind to 8.
The problem with the default page clustering strategy becomes apparent
when you look at how it behaves on the code section of an executable or
shared library. (To simplify the following explanation, I'm going to
ignore the read that is performed to obtain the header and assume that no
pages are resident at the start of execution.) Suppose that we have a
code section consisting of 32 pages. Further, suppose that we access
pages 4, 28, and 16 in that order. Under the default page clustering
strategy, we page fault three times and perform three I/O operations,
because the first and second page faults only read a truncated cluster of
12 pages. In contrast, if we access pages 8, 24, and 16 in that order, we
only fault twice and perform two I/O operations, because the first and
second page faults read a full cluster of 16 pages. In general, truncated
clusters are more common than full clusters.
To address this problem, this revision changes the default page clustering
strategy to align the start of the cluster to a page offset within the vm
object that is a multiple of the cluster size. This results in many fewer
truncated clusters. Returning to our example, if we now access pages 4,
28, and 16 in that order, the cluster that is read to satisfy the page
fault on page 28 will now include page 16. So, the access to page 16 will
no longer page fault and perform an I/O operation.
Since the revised default page clustering strategy is typically reading
more pages at a time, we are likely to read a few more pages that are
never accessed. However, for the various programs that we looked at,
including clang, emacs, firefox, and openjdk, the reduction in the number
of page faults and I/O operations far outweighed the increase in the
number of pages that are never accessed. Moreover, the extra resident
pages allowed for many more superpage mappings. For example, if we look
at the execution of clang during a buildworld, the number of (hard) page
faults on the code section drops by 26%, the number of superpage mappings
increases by about 29,000, but the number of never accessed pages only
increases from 30.38% to 33.66%. Finally, this leads to a small but
measureable reduction in execution time.
In collaboration with: Emily Pettigrew <ejp1@rice.edu>
Differential Revision: https://reviews.freebsd.org/D1500
Reviewed by: jhb, kib
MFC after: 6 weeks
2015-01-16 18:17:09 +00:00
|
|
|
#define VM_FAULT_READ_DEFAULT (1 + VM_FAULT_READ_AHEAD_INIT)
|
2012-05-10 15:16:42 +00:00
|
|
|
#define VM_FAULT_READ_MAX (1 + VM_FAULT_READ_AHEAD_MAX)
|
Replace vm_fault()'s heuristic for automatic cache behind with a heuristic
that performs the equivalent of an automatic madvise(..., MADV_DONTNEED).
The current heuristic, even with the improvements that I made a few years
ago, is a good example of making the wrong trade-off, or optimizing for
the infrequent case. The infrequent case being reading a single file that
is much larger than memory using mmap(2). And, in this case, the page
daemon isn't the bottleneck; it's the I/O.
In all other cases, the current heuristic has too many false positives,
i.e., it caches too many pages that are later reused. To give one
example, thousands of pages are cached by the current heuristic during a
buildworld and all of them are reactivated before the buildworld
completes. In particular, clang reads source files using mmap(2) and
there are some relatively large source files in our source tree, e.g.,
sqlite, that are read multiple times. With the new heuristic, I see fewer
false positives and they have a much lower cost.
I actually tried something like this more than two years ago and it
didn't perform as well as the cache behind heuristic. However, that was
before the changes to the page daemon in late summer of 2013 and the
existence of pmap_advise(). In particular, with the page daemon doing
its work more frequently and in smaller batches, it now completes its
work while the application accessing the file is blocked on I/O.
Whereas previously, the page daemon appeared to hog the CPU for so long
that it caused "hiccups" in the application's execution.
Finally, I'll add that the elimination of cache pages is a prerequisite
for NUMA support.
Reviewed by: jeff, kib
Sponsored by: EMC / Isilon Storage Division
2015-04-04 19:10:22 +00:00
|
|
|
|
|
|
|
#define VM_FAULT_DONTNEED_MIN 1048576
|
1994-05-25 09:21:21 +00:00
|
|
|
|
1998-03-07 20:45:47 +00:00
|
|
|
struct faultstate {
|
2020-01-23 05:03:34 +00:00
|
|
|
/* Fault parameters. */
|
|
|
|
vm_offset_t vaddr;
|
|
|
|
vm_page_t *m_hold;
|
|
|
|
vm_prot_t fault_type;
|
|
|
|
vm_prot_t prot;
|
|
|
|
int fault_flags;
|
2020-01-23 05:19:39 +00:00
|
|
|
int oom;
|
2020-01-23 05:03:34 +00:00
|
|
|
boolean_t wired;
|
|
|
|
|
|
|
|
/* Page reference for cow. */
|
2020-01-17 03:44:04 +00:00
|
|
|
vm_page_t m_cow;
|
2020-01-23 05:03:34 +00:00
|
|
|
|
|
|
|
/* Current object. */
|
|
|
|
vm_object_t object;
|
|
|
|
vm_pindex_t pindex;
|
|
|
|
vm_page_t m;
|
|
|
|
|
|
|
|
/* Top-level map object. */
|
1998-03-07 20:45:47 +00:00
|
|
|
vm_object_t first_object;
|
2020-01-23 05:03:34 +00:00
|
|
|
vm_pindex_t first_pindex;
|
|
|
|
vm_page_t first_m;
|
|
|
|
|
|
|
|
/* Map state. */
|
|
|
|
vm_map_t map;
|
|
|
|
vm_map_entry_t entry;
|
|
|
|
int map_generation;
|
|
|
|
bool lookup_still_valid;
|
|
|
|
|
|
|
|
/* Vnode if locked. */
|
|
|
|
struct vnode *vp;
|
1998-03-07 20:45:47 +00:00
|
|
|
};
|
|
|
|
|
Replace vm_fault()'s heuristic for automatic cache behind with a heuristic
that performs the equivalent of an automatic madvise(..., MADV_DONTNEED).
The current heuristic, even with the improvements that I made a few years
ago, is a good example of making the wrong trade-off, or optimizing for
the infrequent case. The infrequent case being reading a single file that
is much larger than memory using mmap(2). And, in this case, the page
daemon isn't the bottleneck; it's the I/O.
In all other cases, the current heuristic has too many false positives,
i.e., it caches too many pages that are later reused. To give one
example, thousands of pages are cached by the current heuristic during a
buildworld and all of them are reactivated before the buildworld
completes. In particular, clang reads source files using mmap(2) and
there are some relatively large source files in our source tree, e.g.,
sqlite, that are read multiple times. With the new heuristic, I see fewer
false positives and they have a much lower cost.
I actually tried something like this more than two years ago and it
didn't perform as well as the cache behind heuristic. However, that was
before the changes to the page daemon in late summer of 2013 and the
existence of pmap_advise(). In particular, with the page daemon doing
its work more frequently and in smaller batches, it now completes its
work while the application accessing the file is blocked on I/O.
Whereas previously, the page daemon appeared to hog the CPU for so long
that it caused "hiccups" in the application's execution.
Finally, I'll add that the elimination of cache pages is a prerequisite
for NUMA support.
Reviewed by: jeff, kib
Sponsored by: EMC / Isilon Storage Division
2015-04-04 19:10:22 +00:00
|
|
|
static void vm_fault_dontneed(const struct faultstate *fs, vm_offset_t vaddr,
|
|
|
|
int ahead);
|
2014-02-02 20:21:53 +00:00
|
|
|
static void vm_fault_prefault(const struct faultstate *fs, vm_offset_t addra,
|
2018-04-29 12:43:08 +00:00
|
|
|
int backward, int forward, bool obj_locked);
|
2012-05-10 15:16:42 +00:00
|
|
|
|
2019-08-16 09:43:49 +00:00
|
|
|
static int vm_pfault_oom_attempts = 3;
|
|
|
|
SYSCTL_INT(_vm, OID_AUTO, pfault_oom_attempts, CTLFLAG_RWTUN,
|
|
|
|
&vm_pfault_oom_attempts, 0,
|
|
|
|
"Number of page allocation attempts in page fault handler before it "
|
|
|
|
"triggers OOM handling");
|
|
|
|
|
|
|
|
static int vm_pfault_oom_wait = 10;
|
|
|
|
SYSCTL_INT(_vm, OID_AUTO, pfault_oom_wait, CTLFLAG_RWTUN,
|
|
|
|
&vm_pfault_oom_wait, 0,
|
|
|
|
"Number of seconds to wait for free pages before retrying "
|
|
|
|
"the page fault handler");
|
|
|
|
|
2006-03-08 06:31:46 +00:00
|
|
|
static inline void
|
2019-12-15 04:08:24 +00:00
|
|
|
fault_page_release(vm_page_t *mp)
|
1998-03-07 20:45:47 +00:00
|
|
|
{
|
2019-12-15 04:08:24 +00:00
|
|
|
vm_page_t m;
|
2009-02-08 19:37:01 +00:00
|
|
|
|
2019-12-15 04:08:24 +00:00
|
|
|
m = *mp;
|
|
|
|
if (m != NULL) {
|
2019-11-06 16:59:16 +00:00
|
|
|
/*
|
2019-12-15 04:08:24 +00:00
|
|
|
* We are likely to loop around again and attempt to busy
|
|
|
|
* this page. Deactivating it leaves it available for
|
|
|
|
* pageout while optimizing fault restarts.
|
2019-11-06 16:59:16 +00:00
|
|
|
*/
|
2019-12-15 04:08:24 +00:00
|
|
|
vm_page_deactivate(m);
|
|
|
|
vm_page_xunbusy(m);
|
|
|
|
*mp = NULL;
|
|
|
|
}
|
|
|
|
}
|
|
|
|
|
|
|
|
static inline void
|
|
|
|
fault_page_free(vm_page_t *mp)
|
|
|
|
{
|
|
|
|
vm_page_t m;
|
|
|
|
|
|
|
|
m = *mp;
|
|
|
|
if (m != NULL) {
|
|
|
|
VM_OBJECT_ASSERT_WLOCKED(m->object);
|
|
|
|
if (!vm_page_wired(m))
|
|
|
|
vm_page_free(m);
|
2019-12-22 20:35:50 +00:00
|
|
|
else
|
|
|
|
vm_page_xunbusy(m);
|
2019-12-15 04:08:24 +00:00
|
|
|
*mp = NULL;
|
2019-10-23 20:39:21 +00:00
|
|
|
}
|
1998-03-07 20:45:47 +00:00
|
|
|
}
|
|
|
|
|
2006-03-08 06:31:46 +00:00
|
|
|
static inline void
|
1998-03-07 20:45:47 +00:00
|
|
|
unlock_map(struct faultstate *fs)
|
|
|
|
{
|
2009-02-08 19:37:01 +00:00
|
|
|
|
2002-03-18 15:08:09 +00:00
|
|
|
if (fs->lookup_still_valid) {
|
1998-03-07 20:45:47 +00:00
|
|
|
vm_map_lookup_done(fs->map, fs->entry);
|
2016-10-29 21:01:49 +00:00
|
|
|
fs->lookup_still_valid = false;
|
1998-03-07 20:45:47 +00:00
|
|
|
}
|
|
|
|
}
|
|
|
|
|
2016-10-29 18:03:29 +00:00
|
|
|
static void
|
|
|
|
unlock_vp(struct faultstate *fs)
|
|
|
|
{
|
|
|
|
|
|
|
|
if (fs->vp != NULL) {
|
|
|
|
vput(fs->vp);
|
|
|
|
fs->vp = NULL;
|
|
|
|
}
|
|
|
|
}
|
|
|
|
|
1998-03-07 20:45:47 +00:00
|
|
|
static void
|
2019-10-29 20:46:25 +00:00
|
|
|
fault_deallocate(struct faultstate *fs)
|
1998-03-07 20:45:47 +00:00
|
|
|
{
|
2003-06-22 21:35:41 +00:00
|
|
|
|
2020-01-17 03:44:04 +00:00
|
|
|
fault_page_release(&fs->m_cow);
|
2019-12-15 04:08:24 +00:00
|
|
|
fault_page_release(&fs->m);
|
1998-03-07 20:45:47 +00:00
|
|
|
vm_object_pip_wakeup(fs->object);
|
|
|
|
if (fs->object != fs->first_object) {
|
2013-03-09 02:32:23 +00:00
|
|
|
VM_OBJECT_WLOCK(fs->first_object);
|
2019-12-15 04:08:24 +00:00
|
|
|
fault_page_free(&fs->first_m);
|
2013-03-09 02:32:23 +00:00
|
|
|
VM_OBJECT_WUNLOCK(fs->first_object);
|
2019-12-15 04:08:24 +00:00
|
|
|
vm_object_pip_wakeup(fs->first_object);
|
1998-03-07 20:45:47 +00:00
|
|
|
}
|
2004-12-24 19:31:54 +00:00
|
|
|
vm_object_deallocate(fs->first_object);
|
2016-10-29 18:03:29 +00:00
|
|
|
unlock_map(fs);
|
|
|
|
unlock_vp(fs);
|
1998-03-07 20:45:47 +00:00
|
|
|
}
|
|
|
|
|
2019-10-29 20:46:25 +00:00
|
|
|
static void
|
|
|
|
unlock_and_deallocate(struct faultstate *fs)
|
|
|
|
{
|
|
|
|
|
|
|
|
VM_OBJECT_WUNLOCK(fs->object);
|
|
|
|
fault_deallocate(fs);
|
|
|
|
}
|
|
|
|
|
2014-10-10 19:27:36 +00:00
|
|
|
static void
|
2020-01-23 05:03:34 +00:00
|
|
|
vm_fault_dirty(struct faultstate *fs, vm_page_t m)
|
2014-10-10 19:27:36 +00:00
|
|
|
{
|
2016-10-30 20:39:38 +00:00
|
|
|
bool need_dirty;
|
2014-10-10 19:27:36 +00:00
|
|
|
|
2020-01-23 05:03:34 +00:00
|
|
|
if (((fs->prot & VM_PROT_WRITE) == 0 &&
|
|
|
|
(fs->fault_flags & VM_FAULT_DIRTY) == 0) ||
|
2014-10-10 19:27:36 +00:00
|
|
|
(m->oflags & VPO_UNMANAGED) != 0)
|
|
|
|
return;
|
|
|
|
|
2019-10-15 03:45:41 +00:00
|
|
|
VM_PAGE_OBJECT_BUSY_ASSERT(m);
|
2014-10-10 19:27:36 +00:00
|
|
|
|
2020-01-23 05:03:34 +00:00
|
|
|
need_dirty = ((fs->fault_type & VM_PROT_WRITE) != 0 &&
|
|
|
|
(fs->fault_flags & VM_FAULT_WIRE) == 0) ||
|
|
|
|
(fs->fault_flags & VM_FAULT_DIRTY) != 0;
|
2014-10-10 19:27:36 +00:00
|
|
|
|
2019-10-29 21:06:34 +00:00
|
|
|
vm_object_set_writeable_dirty(m->object);
|
|
|
|
|
2014-10-10 19:27:36 +00:00
|
|
|
/*
|
|
|
|
* If the fault is a write, we know that this page is being
|
|
|
|
* written NOW so dirty it explicitly to save on
|
|
|
|
* pmap_is_modified() calls later.
|
|
|
|
*
|
2017-08-28 16:55:43 +00:00
|
|
|
* Also, since the page is now dirty, we can possibly tell
|
2019-12-15 03:15:06 +00:00
|
|
|
* the pager to release any swap backing the page.
|
2014-10-10 19:27:36 +00:00
|
|
|
*/
|
2019-12-15 03:15:06 +00:00
|
|
|
if (need_dirty && vm_page_set_dirty(m) == 0) {
|
|
|
|
/*
|
|
|
|
* If this is a NOSYNC mmap we do not want to set PGA_NOSYNC
|
|
|
|
* if the page is already dirty to prevent data written with
|
|
|
|
* the expectation of being synced from not being synced.
|
|
|
|
* Likewise if this entry does not request NOSYNC then make
|
|
|
|
* sure the page isn't marked NOSYNC. Applications sharing
|
|
|
|
* data should use the same flags to avoid ping ponging.
|
|
|
|
*/
|
2020-01-23 05:03:34 +00:00
|
|
|
if ((fs->entry->eflags & MAP_ENTRY_NOSYNC) != 0)
|
2019-12-15 03:15:06 +00:00
|
|
|
vm_page_aflag_set(m, PGA_NOSYNC);
|
|
|
|
else
|
|
|
|
vm_page_aflag_clear(m, PGA_NOSYNC);
|
|
|
|
}
|
|
|
|
|
2014-10-10 19:27:36 +00:00
|
|
|
}
|
|
|
|
|
2016-11-16 16:34:17 +00:00
|
|
|
/*
|
|
|
|
* Unlocks fs.first_object and fs.map on success.
|
|
|
|
*/
|
|
|
|
static int
|
2020-01-23 05:03:34 +00:00
|
|
|
vm_fault_soft_fast(struct faultstate *fs)
|
2016-11-16 16:34:17 +00:00
|
|
|
{
|
2017-07-23 16:28:13 +00:00
|
|
|
vm_page_t m, m_map;
|
2020-05-27 21:44:26 +00:00
|
|
|
#if VM_NRESERVLEVEL > 0
|
2017-07-23 16:28:13 +00:00
|
|
|
vm_page_t m_super;
|
2017-07-23 19:35:14 +00:00
|
|
|
int flags;
|
2017-07-23 16:28:13 +00:00
|
|
|
#endif
|
2017-07-23 19:35:14 +00:00
|
|
|
int psind, rv;
|
2020-01-23 05:03:34 +00:00
|
|
|
vm_offset_t vaddr;
|
2016-11-16 16:34:17 +00:00
|
|
|
|
|
|
|
MPASS(fs->vp == NULL);
|
2020-01-23 05:03:34 +00:00
|
|
|
vaddr = fs->vaddr;
|
2019-10-15 03:41:36 +00:00
|
|
|
vm_object_busy(fs->first_object);
|
2016-11-16 16:34:17 +00:00
|
|
|
m = vm_page_lookup(fs->first_object, fs->first_pindex);
|
|
|
|
/* A busy page can be mapped for read|execute access. */
|
2020-01-23 05:03:34 +00:00
|
|
|
if (m == NULL || ((fs->prot & VM_PROT_WRITE) != 0 &&
|
2019-10-15 03:45:41 +00:00
|
|
|
vm_page_busied(m)) || !vm_page_all_valid(m)) {
|
2019-10-15 03:41:36 +00:00
|
|
|
rv = KERN_FAILURE;
|
|
|
|
goto out;
|
|
|
|
}
|
2017-07-23 16:28:13 +00:00
|
|
|
m_map = m;
|
|
|
|
psind = 0;
|
2020-05-27 21:44:26 +00:00
|
|
|
#if VM_NRESERVLEVEL > 0
|
2017-07-23 16:28:13 +00:00
|
|
|
if ((m->flags & PG_FICTITIOUS) == 0 &&
|
|
|
|
(m_super = vm_reserv_to_superpage(m)) != NULL &&
|
|
|
|
rounddown2(vaddr, pagesizes[m_super->psind]) >= fs->entry->start &&
|
|
|
|
roundup2(vaddr + 1, pagesizes[m_super->psind]) <= fs->entry->end &&
|
|
|
|
(vaddr & (pagesizes[m_super->psind] - 1)) == (VM_PAGE_TO_PHYS(m) &
|
2020-01-23 05:03:34 +00:00
|
|
|
(pagesizes[m_super->psind] - 1)) && !fs->wired &&
|
2017-07-23 16:28:13 +00:00
|
|
|
pmap_ps_enabled(fs->map->pmap)) {
|
|
|
|
flags = PS_ALL_VALID;
|
2020-01-23 05:03:34 +00:00
|
|
|
if ((fs->prot & VM_PROT_WRITE) != 0) {
|
2017-07-23 16:28:13 +00:00
|
|
|
/*
|
|
|
|
* Create a superpage mapping allowing write access
|
|
|
|
* only if none of the constituent pages are busy and
|
|
|
|
* all of them are already dirty (except possibly for
|
|
|
|
* the page that was faulted on).
|
|
|
|
*/
|
|
|
|
flags |= PS_NONE_BUSY;
|
|
|
|
if ((fs->first_object->flags & OBJ_UNMANAGED) == 0)
|
|
|
|
flags |= PS_ALL_DIRTY;
|
|
|
|
}
|
|
|
|
if (vm_page_ps_test(m_super, flags, m)) {
|
|
|
|
m_map = m_super;
|
|
|
|
psind = m_super->psind;
|
|
|
|
vaddr = rounddown2(vaddr, pagesizes[psind]);
|
|
|
|
/* Preset the modified bit for dirty superpages. */
|
|
|
|
if ((flags & PS_ALL_DIRTY) != 0)
|
2020-01-23 05:03:34 +00:00
|
|
|
fs->fault_type |= VM_PROT_WRITE;
|
2017-07-23 16:28:13 +00:00
|
|
|
}
|
|
|
|
}
|
|
|
|
#endif
|
2020-01-23 05:03:34 +00:00
|
|
|
rv = pmap_enter(fs->map->pmap, vaddr, m_map, fs->prot, fs->fault_type |
|
|
|
|
PMAP_ENTER_NOSLEEP | (fs->wired ? PMAP_ENTER_WIRED : 0), psind);
|
2016-11-16 16:34:17 +00:00
|
|
|
if (rv != KERN_SUCCESS)
|
2019-10-15 03:41:36 +00:00
|
|
|
goto out;
|
2020-01-23 05:03:34 +00:00
|
|
|
if (fs->m_hold != NULL) {
|
|
|
|
(*fs->m_hold) = m;
|
Change synchonization rules for vm_page reference counting.
There are several mechanisms by which a vm_page reference is held,
preventing the page from being freed back to the page allocator. In
particular, holding the page's object lock is sufficient to prevent the
page from being freed; holding the busy lock or a wiring is sufficent as
well. These references are protected by the page lock, which must
therefore be acquired for many per-page operations. This results in
false sharing since the page locks are external to the vm_page
structures themselves and each lock protects multiple structures.
Transition to using an atomically updated per-page reference counter.
The object's reference is counted using a flag bit in the counter. A
second flag bit is used to atomically block new references via
pmap_extract_and_hold() while removing managed mappings of a page.
Thus, the reference count of a page is guaranteed not to increase if the
page is unbusied, unmapped, and the object's write lock is held. As
a consequence of this, the page lock no longer protects a page's
identity; operations which move pages between objects are now
synchronized solely by the objects' locks.
The vm_page_wire() and vm_page_unwire() KPIs are changed. The former
requires that either the object lock or the busy lock is held. The
latter no longer has a return value and may free the page if it releases
the last reference to that page. vm_page_unwire_noq() behaves the same
as before; the caller is responsible for checking its return value and
freeing or enqueuing the page as appropriate. vm_page_wire_mapped() is
introduced for use in pmap_extract_and_hold(). It fails if the page is
concurrently being unmapped, typically triggering a fallback to the
fault handler. vm_page_wire() no longer requires the page lock and
vm_page_unwire() now internally acquires the page lock when releasing
the last wiring of a page (since the page lock still protects a page's
queue state). In particular, synchronization details are no longer
leaked into the caller.
The change excises the page lock from several frequently executed code
paths. In particular, vm_object_terminate() no longer bounces between
page locks as it releases an object's pages, and direct I/O and
sendfile(SF_NOCACHE) completions no longer require the page lock. In
these latter cases we now get linear scalability in the common scenario
where different threads are operating on different files.
__FreeBSD_version is bumped. The DRM ports have been updated to
accomodate the KPI changes.
Reviewed by: jeff (earlier version)
Tested by: gallatin (earlier version), pho
Sponsored by: Netflix
Differential Revision: https://reviews.freebsd.org/D20486
2019-09-09 21:32:42 +00:00
|
|
|
vm_page_wire(m);
|
|
|
|
}
|
2020-01-23 05:03:34 +00:00
|
|
|
if (psind == 0 && !fs->wired)
|
2018-04-29 12:43:08 +00:00
|
|
|
vm_fault_prefault(fs, vaddr, PFBAK, PFFOR, true);
|
|
|
|
VM_OBJECT_RUNLOCK(fs->first_object);
|
2020-01-23 05:03:34 +00:00
|
|
|
vm_fault_dirty(fs, m);
|
2016-11-16 16:34:17 +00:00
|
|
|
vm_map_lookup_done(fs->map, fs->entry);
|
|
|
|
curthread->td_ru.ru_minflt++;
|
2019-10-15 03:41:36 +00:00
|
|
|
|
|
|
|
out:
|
|
|
|
vm_object_unbusy(fs->first_object);
|
|
|
|
return (rv);
|
2016-11-16 16:34:17 +00:00
|
|
|
}
|
|
|
|
|
Add a new populate() pager method and extend device pager ops vector
with cdev_pg_populate() to provide device drivers access to it. It
gives drivers fine control of the pages ownership and allows drivers
to implement arbitrary prefault policies.
The populate method is called on a page fault and is supposed to
populate the vm object with the page at the fault location and some
amount of pages around it, at pager's discretion. VM provides the
pager with the hints about current range of the object mapping, to
avoid instantiation of immediately unused pages, if pager decides so.
Also, VM passes the fault type and map entry protection to the pager,
allowing it to force the optimal required ownership of the mapped
pages.
Installed pages must contiguously fill the returned region, be fully
valid and exclusively busied. Of course, the pages must be compatible
with the object' type.
After populate() successfully returned, VM fault handler installs as
many instantiated pages into the process page tables as it sees
reasonable, while still obeying the correct semantic for COW and vm
map locking.
The method is opt-in, pager sets OBJ_POPULATE flag to indicate that
the method can be called. If pager' vm objects can be shadowed, pager
must implement the traditional getpages() method in addition to the
populate(). Populate() might fall back to the getpages() on per-call
basis as well, by returning VM_PAGER_BAD error code.
For now for device pagers, the populate() method is only allowed to be
used by the managed device pagers, but the limitation is only made
because there is no unmanaged fault handlers which could use it right
now.
KPI designed together with, and reviewed by: alc
Tested by: pho
Sponsored by: The FreeBSD Foundation
MFC after: 3 weeks
2016-12-08 11:26:11 +00:00
|
|
|
static void
|
|
|
|
vm_fault_restore_map_lock(struct faultstate *fs)
|
|
|
|
{
|
|
|
|
|
|
|
|
VM_OBJECT_ASSERT_WLOCKED(fs->first_object);
|
2020-02-28 16:05:18 +00:00
|
|
|
MPASS(blockcount_read(&fs->first_object->paging_in_progress) > 0);
|
Add a new populate() pager method and extend device pager ops vector
with cdev_pg_populate() to provide device drivers access to it. It
gives drivers fine control of the pages ownership and allows drivers
to implement arbitrary prefault policies.
The populate method is called on a page fault and is supposed to
populate the vm object with the page at the fault location and some
amount of pages around it, at pager's discretion. VM provides the
pager with the hints about current range of the object mapping, to
avoid instantiation of immediately unused pages, if pager decides so.
Also, VM passes the fault type and map entry protection to the pager,
allowing it to force the optimal required ownership of the mapped
pages.
Installed pages must contiguously fill the returned region, be fully
valid and exclusively busied. Of course, the pages must be compatible
with the object' type.
After populate() successfully returned, VM fault handler installs as
many instantiated pages into the process page tables as it sees
reasonable, while still obeying the correct semantic for COW and vm
map locking.
The method is opt-in, pager sets OBJ_POPULATE flag to indicate that
the method can be called. If pager' vm objects can be shadowed, pager
must implement the traditional getpages() method in addition to the
populate(). Populate() might fall back to the getpages() on per-call
basis as well, by returning VM_PAGER_BAD error code.
For now for device pagers, the populate() method is only allowed to be
used by the managed device pagers, but the limitation is only made
because there is no unmanaged fault handlers which could use it right
now.
KPI designed together with, and reviewed by: alc
Tested by: pho
Sponsored by: The FreeBSD Foundation
MFC after: 3 weeks
2016-12-08 11:26:11 +00:00
|
|
|
|
|
|
|
if (!vm_map_trylock_read(fs->map)) {
|
|
|
|
VM_OBJECT_WUNLOCK(fs->first_object);
|
|
|
|
vm_map_lock_read(fs->map);
|
|
|
|
VM_OBJECT_WLOCK(fs->first_object);
|
|
|
|
}
|
|
|
|
fs->lookup_still_valid = true;
|
|
|
|
}
|
|
|
|
|
2016-12-30 18:55:33 +00:00
|
|
|
static void
|
|
|
|
vm_fault_populate_check_page(vm_page_t m)
|
|
|
|
{
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Check each page to ensure that the pager is obeying the
|
|
|
|
* interface: the page must be installed in the object, fully
|
|
|
|
* valid, and exclusively busied.
|
|
|
|
*/
|
|
|
|
MPASS(m != NULL);
|
2019-10-15 03:45:41 +00:00
|
|
|
MPASS(vm_page_all_valid(m));
|
2016-12-30 18:55:33 +00:00
|
|
|
MPASS(vm_page_xbusied(m));
|
|
|
|
}
|
|
|
|
|
|
|
|
static void
|
|
|
|
vm_fault_populate_cleanup(vm_object_t object, vm_pindex_t first,
|
|
|
|
vm_pindex_t last)
|
|
|
|
{
|
|
|
|
vm_page_t m;
|
|
|
|
vm_pindex_t pidx;
|
|
|
|
|
|
|
|
VM_OBJECT_ASSERT_WLOCKED(object);
|
|
|
|
MPASS(first <= last);
|
|
|
|
for (pidx = first, m = vm_page_lookup(object, pidx);
|
|
|
|
pidx <= last; pidx++, m = vm_page_next(m)) {
|
|
|
|
vm_fault_populate_check_page(m);
|
|
|
|
vm_page_deactivate(m);
|
|
|
|
vm_page_xunbusy(m);
|
|
|
|
}
|
|
|
|
}
|
Add a new populate() pager method and extend device pager ops vector
with cdev_pg_populate() to provide device drivers access to it. It
gives drivers fine control of the pages ownership and allows drivers
to implement arbitrary prefault policies.
The populate method is called on a page fault and is supposed to
populate the vm object with the page at the fault location and some
amount of pages around it, at pager's discretion. VM provides the
pager with the hints about current range of the object mapping, to
avoid instantiation of immediately unused pages, if pager decides so.
Also, VM passes the fault type and map entry protection to the pager,
allowing it to force the optimal required ownership of the mapped
pages.
Installed pages must contiguously fill the returned region, be fully
valid and exclusively busied. Of course, the pages must be compatible
with the object' type.
After populate() successfully returned, VM fault handler installs as
many instantiated pages into the process page tables as it sees
reasonable, while still obeying the correct semantic for COW and vm
map locking.
The method is opt-in, pager sets OBJ_POPULATE flag to indicate that
the method can be called. If pager' vm objects can be shadowed, pager
must implement the traditional getpages() method in addition to the
populate(). Populate() might fall back to the getpages() on per-call
basis as well, by returning VM_PAGER_BAD error code.
For now for device pagers, the populate() method is only allowed to be
used by the managed device pagers, but the limitation is only made
because there is no unmanaged fault handlers which could use it right
now.
KPI designed together with, and reviewed by: alc
Tested by: pho
Sponsored by: The FreeBSD Foundation
MFC after: 3 weeks
2016-12-08 11:26:11 +00:00
|
|
|
|
|
|
|
static int
|
2020-01-23 05:03:34 +00:00
|
|
|
vm_fault_populate(struct faultstate *fs)
|
Add a new populate() pager method and extend device pager ops vector
with cdev_pg_populate() to provide device drivers access to it. It
gives drivers fine control of the pages ownership and allows drivers
to implement arbitrary prefault policies.
The populate method is called on a page fault and is supposed to
populate the vm object with the page at the fault location and some
amount of pages around it, at pager's discretion. VM provides the
pager with the hints about current range of the object mapping, to
avoid instantiation of immediately unused pages, if pager decides so.
Also, VM passes the fault type and map entry protection to the pager,
allowing it to force the optimal required ownership of the mapped
pages.
Installed pages must contiguously fill the returned region, be fully
valid and exclusively busied. Of course, the pages must be compatible
with the object' type.
After populate() successfully returned, VM fault handler installs as
many instantiated pages into the process page tables as it sees
reasonable, while still obeying the correct semantic for COW and vm
map locking.
The method is opt-in, pager sets OBJ_POPULATE flag to indicate that
the method can be called. If pager' vm objects can be shadowed, pager
must implement the traditional getpages() method in addition to the
populate(). Populate() might fall back to the getpages() on per-call
basis as well, by returning VM_PAGER_BAD error code.
For now for device pagers, the populate() method is only allowed to be
used by the managed device pagers, but the limitation is only made
because there is no unmanaged fault handlers which could use it right
now.
KPI designed together with, and reviewed by: alc
Tested by: pho
Sponsored by: The FreeBSD Foundation
MFC after: 3 weeks
2016-12-08 11:26:11 +00:00
|
|
|
{
|
2018-05-26 02:59:34 +00:00
|
|
|
vm_offset_t vaddr;
|
Add a new populate() pager method and extend device pager ops vector
with cdev_pg_populate() to provide device drivers access to it. It
gives drivers fine control of the pages ownership and allows drivers
to implement arbitrary prefault policies.
The populate method is called on a page fault and is supposed to
populate the vm object with the page at the fault location and some
amount of pages around it, at pager's discretion. VM provides the
pager with the hints about current range of the object mapping, to
avoid instantiation of immediately unused pages, if pager decides so.
Also, VM passes the fault type and map entry protection to the pager,
allowing it to force the optimal required ownership of the mapped
pages.
Installed pages must contiguously fill the returned region, be fully
valid and exclusively busied. Of course, the pages must be compatible
with the object' type.
After populate() successfully returned, VM fault handler installs as
many instantiated pages into the process page tables as it sees
reasonable, while still obeying the correct semantic for COW and vm
map locking.
The method is opt-in, pager sets OBJ_POPULATE flag to indicate that
the method can be called. If pager' vm objects can be shadowed, pager
must implement the traditional getpages() method in addition to the
populate(). Populate() might fall back to the getpages() on per-call
basis as well, by returning VM_PAGER_BAD error code.
For now for device pagers, the populate() method is only allowed to be
used by the managed device pagers, but the limitation is only made
because there is no unmanaged fault handlers which could use it right
now.
KPI designed together with, and reviewed by: alc
Tested by: pho
Sponsored by: The FreeBSD Foundation
MFC after: 3 weeks
2016-12-08 11:26:11 +00:00
|
|
|
vm_page_t m;
|
2016-12-30 18:55:33 +00:00
|
|
|
vm_pindex_t map_first, map_last, pager_first, pager_last, pidx;
|
2018-05-26 02:59:34 +00:00
|
|
|
int i, npages, psind, rv;
|
Add a new populate() pager method and extend device pager ops vector
with cdev_pg_populate() to provide device drivers access to it. It
gives drivers fine control of the pages ownership and allows drivers
to implement arbitrary prefault policies.
The populate method is called on a page fault and is supposed to
populate the vm object with the page at the fault location and some
amount of pages around it, at pager's discretion. VM provides the
pager with the hints about current range of the object mapping, to
avoid instantiation of immediately unused pages, if pager decides so.
Also, VM passes the fault type and map entry protection to the pager,
allowing it to force the optimal required ownership of the mapped
pages.
Installed pages must contiguously fill the returned region, be fully
valid and exclusively busied. Of course, the pages must be compatible
with the object' type.
After populate() successfully returned, VM fault handler installs as
many instantiated pages into the process page tables as it sees
reasonable, while still obeying the correct semantic for COW and vm
map locking.
The method is opt-in, pager sets OBJ_POPULATE flag to indicate that
the method can be called. If pager' vm objects can be shadowed, pager
must implement the traditional getpages() method in addition to the
populate(). Populate() might fall back to the getpages() on per-call
basis as well, by returning VM_PAGER_BAD error code.
For now for device pagers, the populate() method is only allowed to be
used by the managed device pagers, but the limitation is only made
because there is no unmanaged fault handlers which could use it right
now.
KPI designed together with, and reviewed by: alc
Tested by: pho
Sponsored by: The FreeBSD Foundation
MFC after: 3 weeks
2016-12-08 11:26:11 +00:00
|
|
|
|
|
|
|
MPASS(fs->object == fs->first_object);
|
|
|
|
VM_OBJECT_ASSERT_WLOCKED(fs->first_object);
|
2020-02-28 16:05:18 +00:00
|
|
|
MPASS(blockcount_read(&fs->first_object->paging_in_progress) > 0);
|
Add a new populate() pager method and extend device pager ops vector
with cdev_pg_populate() to provide device drivers access to it. It
gives drivers fine control of the pages ownership and allows drivers
to implement arbitrary prefault policies.
The populate method is called on a page fault and is supposed to
populate the vm object with the page at the fault location and some
amount of pages around it, at pager's discretion. VM provides the
pager with the hints about current range of the object mapping, to
avoid instantiation of immediately unused pages, if pager decides so.
Also, VM passes the fault type and map entry protection to the pager,
allowing it to force the optimal required ownership of the mapped
pages.
Installed pages must contiguously fill the returned region, be fully
valid and exclusively busied. Of course, the pages must be compatible
with the object' type.
After populate() successfully returned, VM fault handler installs as
many instantiated pages into the process page tables as it sees
reasonable, while still obeying the correct semantic for COW and vm
map locking.
The method is opt-in, pager sets OBJ_POPULATE flag to indicate that
the method can be called. If pager' vm objects can be shadowed, pager
must implement the traditional getpages() method in addition to the
populate(). Populate() might fall back to the getpages() on per-call
basis as well, by returning VM_PAGER_BAD error code.
For now for device pagers, the populate() method is only allowed to be
used by the managed device pagers, but the limitation is only made
because there is no unmanaged fault handlers which could use it right
now.
KPI designed together with, and reviewed by: alc
Tested by: pho
Sponsored by: The FreeBSD Foundation
MFC after: 3 weeks
2016-12-08 11:26:11 +00:00
|
|
|
MPASS(fs->first_object->backing_object == NULL);
|
|
|
|
MPASS(fs->lookup_still_valid);
|
|
|
|
|
2016-12-30 18:55:33 +00:00
|
|
|
pager_first = OFF_TO_IDX(fs->entry->offset);
|
2017-03-19 19:52:47 +00:00
|
|
|
pager_last = pager_first + atop(fs->entry->end - fs->entry->start) - 1;
|
Add a new populate() pager method and extend device pager ops vector
with cdev_pg_populate() to provide device drivers access to it. It
gives drivers fine control of the pages ownership and allows drivers
to implement arbitrary prefault policies.
The populate method is called on a page fault and is supposed to
populate the vm object with the page at the fault location and some
amount of pages around it, at pager's discretion. VM provides the
pager with the hints about current range of the object mapping, to
avoid instantiation of immediately unused pages, if pager decides so.
Also, VM passes the fault type and map entry protection to the pager,
allowing it to force the optimal required ownership of the mapped
pages.
Installed pages must contiguously fill the returned region, be fully
valid and exclusively busied. Of course, the pages must be compatible
with the object' type.
After populate() successfully returned, VM fault handler installs as
many instantiated pages into the process page tables as it sees
reasonable, while still obeying the correct semantic for COW and vm
map locking.
The method is opt-in, pager sets OBJ_POPULATE flag to indicate that
the method can be called. If pager' vm objects can be shadowed, pager
must implement the traditional getpages() method in addition to the
populate(). Populate() might fall back to the getpages() on per-call
basis as well, by returning VM_PAGER_BAD error code.
For now for device pagers, the populate() method is only allowed to be
used by the managed device pagers, but the limitation is only made
because there is no unmanaged fault handlers which could use it right
now.
KPI designed together with, and reviewed by: alc
Tested by: pho
Sponsored by: The FreeBSD Foundation
MFC after: 3 weeks
2016-12-08 11:26:11 +00:00
|
|
|
unlock_map(fs);
|
|
|
|
unlock_vp(fs);
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Call the pager (driver) populate() method.
|
|
|
|
*
|
|
|
|
* There is no guarantee that the method will be called again
|
|
|
|
* if the current fault is for read, and a future fault is
|
|
|
|
* for write. Report the entry's maximum allowed protection
|
|
|
|
* to the driver.
|
|
|
|
*/
|
|
|
|
rv = vm_pager_populate(fs->first_object, fs->first_pindex,
|
2020-01-23 05:03:34 +00:00
|
|
|
fs->fault_type, fs->entry->max_protection, &pager_first, &pager_last);
|
Add a new populate() pager method and extend device pager ops vector
with cdev_pg_populate() to provide device drivers access to it. It
gives drivers fine control of the pages ownership and allows drivers
to implement arbitrary prefault policies.
The populate method is called on a page fault and is supposed to
populate the vm object with the page at the fault location and some
amount of pages around it, at pager's discretion. VM provides the
pager with the hints about current range of the object mapping, to
avoid instantiation of immediately unused pages, if pager decides so.
Also, VM passes the fault type and map entry protection to the pager,
allowing it to force the optimal required ownership of the mapped
pages.
Installed pages must contiguously fill the returned region, be fully
valid and exclusively busied. Of course, the pages must be compatible
with the object' type.
After populate() successfully returned, VM fault handler installs as
many instantiated pages into the process page tables as it sees
reasonable, while still obeying the correct semantic for COW and vm
map locking.
The method is opt-in, pager sets OBJ_POPULATE flag to indicate that
the method can be called. If pager' vm objects can be shadowed, pager
must implement the traditional getpages() method in addition to the
populate(). Populate() might fall back to the getpages() on per-call
basis as well, by returning VM_PAGER_BAD error code.
For now for device pagers, the populate() method is only allowed to be
used by the managed device pagers, but the limitation is only made
because there is no unmanaged fault handlers which could use it right
now.
KPI designed together with, and reviewed by: alc
Tested by: pho
Sponsored by: The FreeBSD Foundation
MFC after: 3 weeks
2016-12-08 11:26:11 +00:00
|
|
|
|
|
|
|
VM_OBJECT_ASSERT_WLOCKED(fs->first_object);
|
|
|
|
if (rv == VM_PAGER_BAD) {
|
|
|
|
/*
|
|
|
|
* VM_PAGER_BAD is the backdoor for a pager to request
|
|
|
|
* normal fault handling.
|
|
|
|
*/
|
|
|
|
vm_fault_restore_map_lock(fs);
|
|
|
|
if (fs->map->timestamp != fs->map_generation)
|
2020-01-23 05:19:39 +00:00
|
|
|
return (KERN_RESTART);
|
Add a new populate() pager method and extend device pager ops vector
with cdev_pg_populate() to provide device drivers access to it. It
gives drivers fine control of the pages ownership and allows drivers
to implement arbitrary prefault policies.
The populate method is called on a page fault and is supposed to
populate the vm object with the page at the fault location and some
amount of pages around it, at pager's discretion. VM provides the
pager with the hints about current range of the object mapping, to
avoid instantiation of immediately unused pages, if pager decides so.
Also, VM passes the fault type and map entry protection to the pager,
allowing it to force the optimal required ownership of the mapped
pages.
Installed pages must contiguously fill the returned region, be fully
valid and exclusively busied. Of course, the pages must be compatible
with the object' type.
After populate() successfully returned, VM fault handler installs as
many instantiated pages into the process page tables as it sees
reasonable, while still obeying the correct semantic for COW and vm
map locking.
The method is opt-in, pager sets OBJ_POPULATE flag to indicate that
the method can be called. If pager' vm objects can be shadowed, pager
must implement the traditional getpages() method in addition to the
populate(). Populate() might fall back to the getpages() on per-call
basis as well, by returning VM_PAGER_BAD error code.
For now for device pagers, the populate() method is only allowed to be
used by the managed device pagers, but the limitation is only made
because there is no unmanaged fault handlers which could use it right
now.
KPI designed together with, and reviewed by: alc
Tested by: pho
Sponsored by: The FreeBSD Foundation
MFC after: 3 weeks
2016-12-08 11:26:11 +00:00
|
|
|
return (KERN_NOT_RECEIVER);
|
|
|
|
}
|
|
|
|
if (rv != VM_PAGER_OK)
|
|
|
|
return (KERN_FAILURE); /* AKA SIGSEGV */
|
|
|
|
|
|
|
|
/* Ensure that the driver is obeying the interface. */
|
2016-12-30 18:55:33 +00:00
|
|
|
MPASS(pager_first <= pager_last);
|
|
|
|
MPASS(fs->first_pindex <= pager_last);
|
|
|
|
MPASS(fs->first_pindex >= pager_first);
|
|
|
|
MPASS(pager_last < fs->first_object->size);
|
Add a new populate() pager method and extend device pager ops vector
with cdev_pg_populate() to provide device drivers access to it. It
gives drivers fine control of the pages ownership and allows drivers
to implement arbitrary prefault policies.
The populate method is called on a page fault and is supposed to
populate the vm object with the page at the fault location and some
amount of pages around it, at pager's discretion. VM provides the
pager with the hints about current range of the object mapping, to
avoid instantiation of immediately unused pages, if pager decides so.
Also, VM passes the fault type and map entry protection to the pager,
allowing it to force the optimal required ownership of the mapped
pages.
Installed pages must contiguously fill the returned region, be fully
valid and exclusively busied. Of course, the pages must be compatible
with the object' type.
After populate() successfully returned, VM fault handler installs as
many instantiated pages into the process page tables as it sees
reasonable, while still obeying the correct semantic for COW and vm
map locking.
The method is opt-in, pager sets OBJ_POPULATE flag to indicate that
the method can be called. If pager' vm objects can be shadowed, pager
must implement the traditional getpages() method in addition to the
populate(). Populate() might fall back to the getpages() on per-call
basis as well, by returning VM_PAGER_BAD error code.
For now for device pagers, the populate() method is only allowed to be
used by the managed device pagers, but the limitation is only made
because there is no unmanaged fault handlers which could use it right
now.
KPI designed together with, and reviewed by: alc
Tested by: pho
Sponsored by: The FreeBSD Foundation
MFC after: 3 weeks
2016-12-08 11:26:11 +00:00
|
|
|
|
|
|
|
vm_fault_restore_map_lock(fs);
|
2016-12-30 18:55:33 +00:00
|
|
|
if (fs->map->timestamp != fs->map_generation) {
|
|
|
|
vm_fault_populate_cleanup(fs->first_object, pager_first,
|
|
|
|
pager_last);
|
2020-01-23 05:19:39 +00:00
|
|
|
return (KERN_RESTART);
|
2016-12-30 18:55:33 +00:00
|
|
|
}
|
Add a new populate() pager method and extend device pager ops vector
with cdev_pg_populate() to provide device drivers access to it. It
gives drivers fine control of the pages ownership and allows drivers
to implement arbitrary prefault policies.
The populate method is called on a page fault and is supposed to
populate the vm object with the page at the fault location and some
amount of pages around it, at pager's discretion. VM provides the
pager with the hints about current range of the object mapping, to
avoid instantiation of immediately unused pages, if pager decides so.
Also, VM passes the fault type and map entry protection to the pager,
allowing it to force the optimal required ownership of the mapped
pages.
Installed pages must contiguously fill the returned region, be fully
valid and exclusively busied. Of course, the pages must be compatible
with the object' type.
After populate() successfully returned, VM fault handler installs as
many instantiated pages into the process page tables as it sees
reasonable, while still obeying the correct semantic for COW and vm
map locking.
The method is opt-in, pager sets OBJ_POPULATE flag to indicate that
the method can be called. If pager' vm objects can be shadowed, pager
must implement the traditional getpages() method in addition to the
populate(). Populate() might fall back to the getpages() on per-call
basis as well, by returning VM_PAGER_BAD error code.
For now for device pagers, the populate() method is only allowed to be
used by the managed device pagers, but the limitation is only made
because there is no unmanaged fault handlers which could use it right
now.
KPI designed together with, and reviewed by: alc
Tested by: pho
Sponsored by: The FreeBSD Foundation
MFC after: 3 weeks
2016-12-08 11:26:11 +00:00
|
|
|
|
2016-12-30 18:55:33 +00:00
|
|
|
/*
|
|
|
|
* The map is unchanged after our last unlock. Process the fault.
|
|
|
|
*
|
|
|
|
* The range [pager_first, pager_last] that is given to the
|
|
|
|
* pager is only a hint. The pager may populate any range
|
|
|
|
* within the object that includes the requested page index.
|
|
|
|
* In case the pager expanded the range, clip it to fit into
|
|
|
|
* the map entry.
|
|
|
|
*/
|
2017-03-19 19:52:47 +00:00
|
|
|
map_first = OFF_TO_IDX(fs->entry->offset);
|
|
|
|
if (map_first > pager_first) {
|
2016-12-30 18:55:33 +00:00
|
|
|
vm_fault_populate_cleanup(fs->first_object, pager_first,
|
|
|
|
map_first - 1);
|
2017-03-19 19:52:47 +00:00
|
|
|
pager_first = map_first;
|
|
|
|
}
|
|
|
|
map_last = map_first + atop(fs->entry->end - fs->entry->start) - 1;
|
|
|
|
if (map_last < pager_last) {
|
2016-12-30 18:55:33 +00:00
|
|
|
vm_fault_populate_cleanup(fs->first_object, map_last + 1,
|
|
|
|
pager_last);
|
2017-03-19 19:52:47 +00:00
|
|
|
pager_last = map_last;
|
|
|
|
}
|
|
|
|
for (pidx = pager_first, m = vm_page_lookup(fs->first_object, pidx);
|
2018-05-26 02:59:34 +00:00
|
|
|
pidx <= pager_last;
|
|
|
|
pidx += npages, m = vm_page_next(&m[npages - 1])) {
|
|
|
|
vaddr = fs->entry->start + IDX_TO_OFF(pidx) - fs->entry->offset;
|
Add support for pmap_enter(..., psind=1) to the armv6 pmap. In other words,
add support for explicitly requesting that pmap_enter() create a 1 MB page
mapping. (Essentially, this feature allows the machine-independent layer
to create superpage mappings preemptively, and not wait for automatic
promotion to occur.)
Export pmap_ps_enabled() to the machine-independent layer.
Add a flag to pmap_pv_insert_pte1() that specifies whether it should fail
or reclaim a PV entry when one is not available.
Refactor pmap_enter_pte1() into two functions, one by the same name, that
is a general-purpose function for creating pte1 mappings, and another,
pmap_enter_1mpage(), that is used to prefault 1 MB read- and/or execute-
only mappings for execve(2), mmap(2), and shmat(2).
In addition, as an optimization to pmap_enter(..., psind=0), eliminate the
use of pte2_is_managed() from pmap_enter(). Unlike the x86 pmap
implementations, armv6 does not have a managed bit defined within the PTE.
So, pte2_is_managed() is actually a call to PHYS_TO_VM_PAGE(), which is O(n)
in the number of vm_phys_segs[]. All but one call to PHYS_TO_VM_PAGE() in
pmap_enter() can be avoided.
Reviewed by: kib, markj, mmel
Tested by: mmel
MFC after: 6 weeks
Differential Revision: https://reviews.freebsd.org/D16555
2018-08-08 16:55:01 +00:00
|
|
|
#if defined(__aarch64__) || defined(__amd64__) || (defined(__arm__) && \
|
2019-02-13 17:19:37 +00:00
|
|
|
__ARM_ARCH >= 6) || defined(__i386__) || defined(__riscv)
|
2018-05-26 02:59:34 +00:00
|
|
|
psind = m->psind;
|
|
|
|
if (psind > 0 && ((vaddr & (pagesizes[psind] - 1)) != 0 ||
|
|
|
|
pidx + OFF_TO_IDX(pagesizes[psind]) - 1 > pager_last ||
|
2020-01-23 05:03:34 +00:00
|
|
|
!pmap_ps_enabled(fs->map->pmap) || fs->wired))
|
2018-05-26 02:59:34 +00:00
|
|
|
psind = 0;
|
|
|
|
#else
|
|
|
|
psind = 0;
|
|
|
|
#endif
|
|
|
|
npages = atop(pagesizes[psind]);
|
|
|
|
for (i = 0; i < npages; i++) {
|
|
|
|
vm_fault_populate_check_page(&m[i]);
|
2020-01-23 05:03:34 +00:00
|
|
|
vm_fault_dirty(fs, &m[i]);
|
2018-05-26 02:59:34 +00:00
|
|
|
}
|
Add a new populate() pager method and extend device pager ops vector
with cdev_pg_populate() to provide device drivers access to it. It
gives drivers fine control of the pages ownership and allows drivers
to implement arbitrary prefault policies.
The populate method is called on a page fault and is supposed to
populate the vm object with the page at the fault location and some
amount of pages around it, at pager's discretion. VM provides the
pager with the hints about current range of the object mapping, to
avoid instantiation of immediately unused pages, if pager decides so.
Also, VM passes the fault type and map entry protection to the pager,
allowing it to force the optimal required ownership of the mapped
pages.
Installed pages must contiguously fill the returned region, be fully
valid and exclusively busied. Of course, the pages must be compatible
with the object' type.
After populate() successfully returned, VM fault handler installs as
many instantiated pages into the process page tables as it sees
reasonable, while still obeying the correct semantic for COW and vm
map locking.
The method is opt-in, pager sets OBJ_POPULATE flag to indicate that
the method can be called. If pager' vm objects can be shadowed, pager
must implement the traditional getpages() method in addition to the
populate(). Populate() might fall back to the getpages() on per-call
basis as well, by returning VM_PAGER_BAD error code.
For now for device pagers, the populate() method is only allowed to be
used by the managed device pagers, but the limitation is only made
because there is no unmanaged fault handlers which could use it right
now.
KPI designed together with, and reviewed by: alc
Tested by: pho
Sponsored by: The FreeBSD Foundation
MFC after: 3 weeks
2016-12-08 11:26:11 +00:00
|
|
|
VM_OBJECT_WUNLOCK(fs->first_object);
|
2020-01-23 05:03:34 +00:00
|
|
|
rv = pmap_enter(fs->map->pmap, vaddr, m, fs->prot, fs->fault_type |
|
|
|
|
(fs->wired ? PMAP_ENTER_WIRED : 0), psind);
|
2019-02-20 09:51:13 +00:00
|
|
|
#if defined(__amd64__)
|
|
|
|
if (psind > 0 && rv == KERN_FAILURE) {
|
|
|
|
for (i = 0; i < npages; i++) {
|
|
|
|
rv = pmap_enter(fs->map->pmap, vaddr + ptoa(i),
|
2020-01-23 05:03:34 +00:00
|
|
|
&m[i], fs->prot, fs->fault_type |
|
|
|
|
(fs->wired ? PMAP_ENTER_WIRED : 0), 0);
|
2019-02-20 09:51:13 +00:00
|
|
|
MPASS(rv == KERN_SUCCESS);
|
|
|
|
}
|
|
|
|
}
|
|
|
|
#else
|
|
|
|
MPASS(rv == KERN_SUCCESS);
|
|
|
|
#endif
|
Add a new populate() pager method and extend device pager ops vector
with cdev_pg_populate() to provide device drivers access to it. It
gives drivers fine control of the pages ownership and allows drivers
to implement arbitrary prefault policies.
The populate method is called on a page fault and is supposed to
populate the vm object with the page at the fault location and some
amount of pages around it, at pager's discretion. VM provides the
pager with the hints about current range of the object mapping, to
avoid instantiation of immediately unused pages, if pager decides so.
Also, VM passes the fault type and map entry protection to the pager,
allowing it to force the optimal required ownership of the mapped
pages.
Installed pages must contiguously fill the returned region, be fully
valid and exclusively busied. Of course, the pages must be compatible
with the object' type.
After populate() successfully returned, VM fault handler installs as
many instantiated pages into the process page tables as it sees
reasonable, while still obeying the correct semantic for COW and vm
map locking.
The method is opt-in, pager sets OBJ_POPULATE flag to indicate that
the method can be called. If pager' vm objects can be shadowed, pager
must implement the traditional getpages() method in addition to the
populate(). Populate() might fall back to the getpages() on per-call
basis as well, by returning VM_PAGER_BAD error code.
For now for device pagers, the populate() method is only allowed to be
used by the managed device pagers, but the limitation is only made
because there is no unmanaged fault handlers which could use it right
now.
KPI designed together with, and reviewed by: alc
Tested by: pho
Sponsored by: The FreeBSD Foundation
MFC after: 3 weeks
2016-12-08 11:26:11 +00:00
|
|
|
VM_OBJECT_WLOCK(fs->first_object);
|
2018-05-26 02:59:34 +00:00
|
|
|
for (i = 0; i < npages; i++) {
|
2020-01-23 05:03:34 +00:00
|
|
|
if ((fs->fault_flags & VM_FAULT_WIRE) != 0)
|
2018-05-26 02:59:34 +00:00
|
|
|
vm_page_wire(&m[i]);
|
2019-12-28 19:04:00 +00:00
|
|
|
else
|
2018-05-26 02:59:34 +00:00
|
|
|
vm_page_activate(&m[i]);
|
2020-01-23 05:03:34 +00:00
|
|
|
if (fs->m_hold != NULL && m[i].pindex == fs->first_pindex) {
|
|
|
|
(*fs->m_hold) = &m[i];
|
2019-07-08 19:46:20 +00:00
|
|
|
vm_page_wire(&m[i]);
|
2018-05-26 02:59:34 +00:00
|
|
|
}
|
2019-09-10 18:27:45 +00:00
|
|
|
vm_page_xunbusy(&m[i]);
|
Add a new populate() pager method and extend device pager ops vector
with cdev_pg_populate() to provide device drivers access to it. It
gives drivers fine control of the pages ownership and allows drivers
to implement arbitrary prefault policies.
The populate method is called on a page fault and is supposed to
populate the vm object with the page at the fault location and some
amount of pages around it, at pager's discretion. VM provides the
pager with the hints about current range of the object mapping, to
avoid instantiation of immediately unused pages, if pager decides so.
Also, VM passes the fault type and map entry protection to the pager,
allowing it to force the optimal required ownership of the mapped
pages.
Installed pages must contiguously fill the returned region, be fully
valid and exclusively busied. Of course, the pages must be compatible
with the object' type.
After populate() successfully returned, VM fault handler installs as
many instantiated pages into the process page tables as it sees
reasonable, while still obeying the correct semantic for COW and vm
map locking.
The method is opt-in, pager sets OBJ_POPULATE flag to indicate that
the method can be called. If pager' vm objects can be shadowed, pager
must implement the traditional getpages() method in addition to the
populate(). Populate() might fall back to the getpages() on per-call
basis as well, by returning VM_PAGER_BAD error code.
For now for device pagers, the populate() method is only allowed to be
used by the managed device pagers, but the limitation is only made
because there is no unmanaged fault handlers which could use it right
now.
KPI designed together with, and reviewed by: alc
Tested by: pho
Sponsored by: The FreeBSD Foundation
MFC after: 3 weeks
2016-12-08 11:26:11 +00:00
|
|
|
}
|
|
|
|
}
|
|
|
|
curthread->td_ru.ru_majflt++;
|
|
|
|
return (KERN_SUCCESS);
|
|
|
|
}
|
|
|
|
|
2019-09-27 18:43:36 +00:00
|
|
|
static int prot_fault_translation;
|
|
|
|
SYSCTL_INT(_machdep, OID_AUTO, prot_fault_translation, CTLFLAG_RWTUN,
|
|
|
|
&prot_fault_translation, 0,
|
|
|
|
"Control signal to deliver on protection fault");
|
|
|
|
|
|
|
|
/* compat definition to keep common code for signal translation */
|
|
|
|
#define UCODE_PAGEFLT 12
|
|
|
|
#ifdef T_PAGEFLT
|
|
|
|
_Static_assert(UCODE_PAGEFLT == T_PAGEFLT, "T_PAGEFLT");
|
|
|
|
#endif
|
|
|
|
|
1994-05-24 10:09:53 +00:00
|
|
|
/*
|
2019-09-27 18:43:36 +00:00
|
|
|
* vm_fault_trap:
|
1994-05-24 10:09:53 +00:00
|
|
|
*
|
2000-03-26 15:20:23 +00:00
|
|
|
* Handle a page fault occurring at the given address,
|
1994-05-24 10:09:53 +00:00
|
|
|
* requiring the given permissions, in the map specified.
|
|
|
|
* If successful, the page is inserted into the
|
|
|
|
* associated physical map.
|
|
|
|
*
|
|
|
|
* NOTE: the given address should be truncated to the
|
|
|
|
* proper page address.
|
|
|
|
*
|
|
|
|
* KERN_SUCCESS is returned if the page fault is handled; otherwise,
|
|
|
|
* a standard error specifying why the fault is fatal is returned.
|
|
|
|
*
|
|
|
|
* The map in question must be referenced, and remains so.
|
2001-07-04 16:20:28 +00:00
|
|
|
* Caller may hold no locks.
|
1994-05-24 10:09:53 +00:00
|
|
|
*/
|
|
|
|
int
|
2019-09-27 18:43:36 +00:00
|
|
|
vm_fault_trap(vm_map_t map, vm_offset_t vaddr, vm_prot_t fault_type,
|
|
|
|
int fault_flags, int *signo, int *ucode)
|
2010-12-20 22:49:31 +00:00
|
|
|
{
|
2012-04-05 17:13:14 +00:00
|
|
|
int result;
|
2010-12-20 22:49:31 +00:00
|
|
|
|
2019-09-27 18:43:36 +00:00
|
|
|
MPASS(signo == NULL || ucode != NULL);
|
2012-04-05 17:13:14 +00:00
|
|
|
#ifdef KTRACE
|
2019-10-13 06:56:45 +00:00
|
|
|
if (map != kernel_map && KTRPOINT(curthread, KTR_FAULT))
|
2012-04-05 17:13:14 +00:00
|
|
|
ktrfault(vaddr, fault_type);
|
|
|
|
#endif
|
2019-09-27 18:43:36 +00:00
|
|
|
result = vm_fault(map, trunc_page(vaddr), fault_type, fault_flags,
|
2013-08-05 08:55:35 +00:00
|
|
|
NULL);
|
2019-09-27 18:43:36 +00:00
|
|
|
KASSERT(result == KERN_SUCCESS || result == KERN_FAILURE ||
|
|
|
|
result == KERN_INVALID_ADDRESS ||
|
|
|
|
result == KERN_RESOURCE_SHORTAGE ||
|
|
|
|
result == KERN_PROTECTION_FAILURE ||
|
|
|
|
result == KERN_OUT_OF_BOUNDS,
|
|
|
|
("Unexpected Mach error %d from vm_fault()", result));
|
2012-04-05 17:13:14 +00:00
|
|
|
#ifdef KTRACE
|
2019-10-13 06:56:45 +00:00
|
|
|
if (map != kernel_map && KTRPOINT(curthread, KTR_FAULTEND))
|
2012-04-05 17:13:14 +00:00
|
|
|
ktrfaultend(result);
|
|
|
|
#endif
|
2019-09-27 18:43:36 +00:00
|
|
|
if (result != KERN_SUCCESS && signo != NULL) {
|
|
|
|
switch (result) {
|
|
|
|
case KERN_FAILURE:
|
|
|
|
case KERN_INVALID_ADDRESS:
|
|
|
|
*signo = SIGSEGV;
|
|
|
|
*ucode = SEGV_MAPERR;
|
|
|
|
break;
|
|
|
|
case KERN_RESOURCE_SHORTAGE:
|
|
|
|
*signo = SIGBUS;
|
|
|
|
*ucode = BUS_OOMERR;
|
|
|
|
break;
|
|
|
|
case KERN_OUT_OF_BOUNDS:
|
|
|
|
*signo = SIGBUS;
|
|
|
|
*ucode = BUS_OBJERR;
|
|
|
|
break;
|
|
|
|
case KERN_PROTECTION_FAILURE:
|
|
|
|
if (prot_fault_translation == 0) {
|
|
|
|
/*
|
|
|
|
* Autodetect. This check also covers
|
|
|
|
* the images without the ABI-tag ELF
|
|
|
|
* note.
|
|
|
|
*/
|
|
|
|
if (SV_CURPROC_ABI() == SV_ABI_FREEBSD &&
|
|
|
|
curproc->p_osrel >= P_OSREL_SIGSEGV) {
|
|
|
|
*signo = SIGSEGV;
|
|
|
|
*ucode = SEGV_ACCERR;
|
|
|
|
} else {
|
|
|
|
*signo = SIGBUS;
|
|
|
|
*ucode = UCODE_PAGEFLT;
|
|
|
|
}
|
|
|
|
} else if (prot_fault_translation == 1) {
|
|
|
|
/* Always compat mode. */
|
|
|
|
*signo = SIGBUS;
|
|
|
|
*ucode = UCODE_PAGEFLT;
|
|
|
|
} else {
|
|
|
|
/* Always SIGSEGV mode. */
|
|
|
|
*signo = SIGSEGV;
|
|
|
|
*ucode = SEGV_ACCERR;
|
|
|
|
}
|
|
|
|
break;
|
|
|
|
default:
|
|
|
|
KASSERT(0, ("Unexpected Mach error %d from vm_fault()",
|
|
|
|
result));
|
|
|
|
break;
|
|
|
|
}
|
|
|
|
}
|
2012-04-05 17:13:14 +00:00
|
|
|
return (result);
|
2010-12-20 22:49:31 +00:00
|
|
|
}
|
|
|
|
|
2019-10-22 15:59:16 +00:00
|
|
|
static int
|
2020-01-20 22:49:52 +00:00
|
|
|
vm_fault_lock_vnode(struct faultstate *fs, bool objlocked)
|
2019-10-22 15:59:16 +00:00
|
|
|
{
|
|
|
|
struct vnode *vp;
|
|
|
|
int error, locked;
|
|
|
|
|
|
|
|
if (fs->object->type != OBJT_VNODE)
|
|
|
|
return (KERN_SUCCESS);
|
|
|
|
vp = fs->object->handle;
|
2019-10-23 07:36:26 +00:00
|
|
|
if (vp == fs->vp) {
|
|
|
|
ASSERT_VOP_LOCKED(vp, "saved vnode is not locked");
|
2019-10-22 15:59:16 +00:00
|
|
|
return (KERN_SUCCESS);
|
2019-10-23 07:36:26 +00:00
|
|
|
}
|
2019-10-22 15:59:16 +00:00
|
|
|
|
|
|
|
/*
|
|
|
|
* Perform an unlock in case the desired vnode changed while
|
|
|
|
* the map was unlocked during a retry.
|
|
|
|
*/
|
|
|
|
unlock_vp(fs);
|
|
|
|
|
|
|
|
locked = VOP_ISLOCKED(vp);
|
|
|
|
if (locked != LK_EXCLUSIVE)
|
|
|
|
locked = LK_SHARED;
|
|
|
|
|
|
|
|
/*
|
|
|
|
* We must not sleep acquiring the vnode lock while we have
|
|
|
|
* the page exclusive busied or the object's
|
|
|
|
* paging-in-progress count incremented. Otherwise, we could
|
|
|
|
* deadlock.
|
|
|
|
*/
|
|
|
|
error = vget(vp, locked | LK_CANRECURSE | LK_NOWAIT, curthread);
|
|
|
|
if (error == 0) {
|
|
|
|
fs->vp = vp;
|
|
|
|
return (KERN_SUCCESS);
|
|
|
|
}
|
|
|
|
|
|
|
|
vhold(vp);
|
2020-01-20 22:49:52 +00:00
|
|
|
if (objlocked)
|
|
|
|
unlock_and_deallocate(fs);
|
|
|
|
else
|
|
|
|
fault_deallocate(fs);
|
2019-10-22 15:59:16 +00:00
|
|
|
error = vget(vp, locked | LK_RETRY | LK_CANRECURSE, curthread);
|
|
|
|
vdrop(vp);
|
|
|
|
fs->vp = vp;
|
|
|
|
KASSERT(error == 0, ("vm_fault: vget failed %d", error));
|
|
|
|
return (KERN_RESOURCE_SHORTAGE);
|
|
|
|
}
|
|
|
|
|
2020-01-21 00:12:57 +00:00
|
|
|
/*
|
|
|
|
* Calculate the desired readahead. Handle drop-behind.
|
|
|
|
*
|
|
|
|
* Returns the number of readahead blocks to pass to the pager.
|
|
|
|
*/
|
|
|
|
static int
|
|
|
|
vm_fault_readahead(struct faultstate *fs)
|
|
|
|
{
|
|
|
|
int era, nera;
|
|
|
|
u_char behavior;
|
|
|
|
|
|
|
|
KASSERT(fs->lookup_still_valid, ("map unlocked"));
|
|
|
|
era = fs->entry->read_ahead;
|
|
|
|
behavior = vm_map_entry_behavior(fs->entry);
|
|
|
|
if (behavior == MAP_ENTRY_BEHAV_RANDOM) {
|
|
|
|
nera = 0;
|
|
|
|
} else if (behavior == MAP_ENTRY_BEHAV_SEQUENTIAL) {
|
|
|
|
nera = VM_FAULT_READ_AHEAD_MAX;
|
|
|
|
if (fs->vaddr == fs->entry->next_read)
|
|
|
|
vm_fault_dontneed(fs, fs->vaddr, nera);
|
|
|
|
} else if (fs->vaddr == fs->entry->next_read) {
|
|
|
|
/*
|
|
|
|
* This is a sequential fault. Arithmetically
|
|
|
|
* increase the requested number of pages in
|
|
|
|
* the read-ahead window. The requested
|
|
|
|
* number of pages is "# of sequential faults
|
|
|
|
* x (read ahead min + 1) + read ahead min"
|
|
|
|
*/
|
|
|
|
nera = VM_FAULT_READ_AHEAD_MIN;
|
|
|
|
if (era > 0) {
|
|
|
|
nera += era + 1;
|
|
|
|
if (nera > VM_FAULT_READ_AHEAD_MAX)
|
|
|
|
nera = VM_FAULT_READ_AHEAD_MAX;
|
|
|
|
}
|
|
|
|
if (era == VM_FAULT_READ_AHEAD_MAX)
|
|
|
|
vm_fault_dontneed(fs, fs->vaddr, nera);
|
|
|
|
} else {
|
|
|
|
/*
|
|
|
|
* This is a non-sequential fault.
|
|
|
|
*/
|
|
|
|
nera = 0;
|
|
|
|
}
|
|
|
|
if (era != nera) {
|
|
|
|
/*
|
|
|
|
* A read lock on the map suffices to update
|
|
|
|
* the read ahead count safely.
|
|
|
|
*/
|
|
|
|
fs->entry->read_ahead = nera;
|
|
|
|
}
|
|
|
|
|
|
|
|
return (nera);
|
|
|
|
}
|
|
|
|
|
2020-01-23 05:05:39 +00:00
|
|
|
static int
|
|
|
|
vm_fault_lookup(struct faultstate *fs)
|
|
|
|
{
|
|
|
|
int result;
|
|
|
|
|
|
|
|
KASSERT(!fs->lookup_still_valid,
|
|
|
|
("vm_fault_lookup: Map already locked."));
|
|
|
|
result = vm_map_lookup(&fs->map, fs->vaddr, fs->fault_type |
|
|
|
|
VM_PROT_FAULT_LOOKUP, &fs->entry, &fs->first_object,
|
|
|
|
&fs->first_pindex, &fs->prot, &fs->wired);
|
|
|
|
if (result != KERN_SUCCESS) {
|
|
|
|
unlock_vp(fs);
|
|
|
|
return (result);
|
|
|
|
}
|
|
|
|
|
|
|
|
fs->map_generation = fs->map->timestamp;
|
|
|
|
|
|
|
|
if (fs->entry->eflags & MAP_ENTRY_NOFAULT) {
|
|
|
|
panic("%s: fault on nofault entry, addr: %#lx",
|
|
|
|
__func__, (u_long)fs->vaddr);
|
|
|
|
}
|
|
|
|
|
|
|
|
if (fs->entry->eflags & MAP_ENTRY_IN_TRANSITION &&
|
|
|
|
fs->entry->wiring_thread != curthread) {
|
|
|
|
vm_map_unlock_read(fs->map);
|
|
|
|
vm_map_lock(fs->map);
|
|
|
|
if (vm_map_lookup_entry(fs->map, fs->vaddr, &fs->entry) &&
|
|
|
|
(fs->entry->eflags & MAP_ENTRY_IN_TRANSITION)) {
|
|
|
|
unlock_vp(fs);
|
|
|
|
fs->entry->eflags |= MAP_ENTRY_NEEDS_WAKEUP;
|
|
|
|
vm_map_unlock_and_wait(fs->map, 0);
|
|
|
|
} else
|
|
|
|
vm_map_unlock(fs->map);
|
|
|
|
return (KERN_RESOURCE_SHORTAGE);
|
|
|
|
}
|
|
|
|
|
|
|
|
MPASS((fs->entry->eflags & MAP_ENTRY_GUARD) == 0);
|
|
|
|
|
|
|
|
if (fs->wired)
|
|
|
|
fs->fault_type = fs->prot | (fs->fault_type & VM_PROT_COPY);
|
|
|
|
else
|
|
|
|
KASSERT((fs->fault_flags & VM_FAULT_WIRE) == 0,
|
|
|
|
("!fs->wired && VM_FAULT_WIRE"));
|
|
|
|
fs->lookup_still_valid = true;
|
|
|
|
|
|
|
|
return (KERN_SUCCESS);
|
|
|
|
}
|
|
|
|
|
2020-01-23 05:07:01 +00:00
|
|
|
static int
|
|
|
|
vm_fault_relookup(struct faultstate *fs)
|
|
|
|
{
|
|
|
|
vm_object_t retry_object;
|
|
|
|
vm_pindex_t retry_pindex;
|
|
|
|
vm_prot_t retry_prot;
|
|
|
|
int result;
|
|
|
|
|
|
|
|
if (!vm_map_trylock_read(fs->map))
|
|
|
|
return (KERN_RESTART);
|
|
|
|
|
|
|
|
fs->lookup_still_valid = true;
|
|
|
|
if (fs->map->timestamp == fs->map_generation)
|
|
|
|
return (KERN_SUCCESS);
|
|
|
|
|
|
|
|
result = vm_map_lookup_locked(&fs->map, fs->vaddr, fs->fault_type,
|
|
|
|
&fs->entry, &retry_object, &retry_pindex, &retry_prot,
|
|
|
|
&fs->wired);
|
|
|
|
if (result != KERN_SUCCESS) {
|
|
|
|
/*
|
|
|
|
* If retry of map lookup would have blocked then
|
|
|
|
* retry fault from start.
|
|
|
|
*/
|
|
|
|
if (result == KERN_FAILURE)
|
|
|
|
return (KERN_RESTART);
|
|
|
|
return (result);
|
|
|
|
}
|
|
|
|
if (retry_object != fs->first_object ||
|
|
|
|
retry_pindex != fs->first_pindex)
|
|
|
|
return (KERN_RESTART);
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Check whether the protection has changed or the object has
|
|
|
|
* been copied while we left the map unlocked. Changing from
|
|
|
|
* read to write permission is OK - we leave the page
|
|
|
|
* write-protected, and catch the write fault. Changing from
|
|
|
|
* write to read permission means that we can't mark the page
|
|
|
|
* write-enabled after all.
|
|
|
|
*/
|
|
|
|
fs->prot &= retry_prot;
|
|
|
|
fs->fault_type &= retry_prot;
|
|
|
|
if (fs->prot == 0)
|
|
|
|
return (KERN_RESTART);
|
|
|
|
|
|
|
|
/* Reassert because wired may have changed. */
|
|
|
|
KASSERT(fs->wired || (fs->fault_flags & VM_FAULT_WIRE) == 0,
|
|
|
|
("!wired && VM_FAULT_WIRE"));
|
|
|
|
|
|
|
|
return (KERN_SUCCESS);
|
|
|
|
}
|
|
|
|
|
2020-01-23 05:11:01 +00:00
|
|
|
static void
|
|
|
|
vm_fault_cow(struct faultstate *fs)
|
|
|
|
{
|
|
|
|
bool is_first_object_locked;
|
|
|
|
|
|
|
|
/*
|
|
|
|
* This allows pages to be virtually copied from a backing_object
|
|
|
|
* into the first_object, where the backing object has no other
|
|
|
|
* refs to it, and cannot gain any more refs. Instead of a bcopy,
|
|
|
|
* we just move the page from the backing object to the first
|
|
|
|
* object. Note that we must mark the page dirty in the first
|
|
|
|
* object so that it will go out to swap when needed.
|
|
|
|
*/
|
|
|
|
is_first_object_locked = false;
|
|
|
|
if (
|
|
|
|
/*
|
|
|
|
* Only one shadow object and no other refs.
|
|
|
|
*/
|
|
|
|
fs->object->shadow_count == 1 && fs->object->ref_count == 1 &&
|
|
|
|
/*
|
|
|
|
* No other ways to look the object up
|
|
|
|
*/
|
|
|
|
fs->object->handle == NULL && (fs->object->flags & OBJ_ANON) != 0 &&
|
|
|
|
/*
|
|
|
|
* We don't chase down the shadow chain and we can acquire locks.
|
|
|
|
*/
|
|
|
|
(is_first_object_locked = VM_OBJECT_TRYWLOCK(fs->first_object)) &&
|
|
|
|
fs->object == fs->first_object->backing_object &&
|
|
|
|
VM_OBJECT_TRYWLOCK(fs->object)) {
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Remove but keep xbusy for replace. fs->m is moved into
|
|
|
|
* fs->first_object and left busy while fs->first_m is
|
|
|
|
* conditionally freed.
|
|
|
|
*/
|
|
|
|
vm_page_remove_xbusy(fs->m);
|
|
|
|
vm_page_replace(fs->m, fs->first_object, fs->first_pindex,
|
|
|
|
fs->first_m);
|
|
|
|
vm_page_dirty(fs->m);
|
|
|
|
#if VM_NRESERVLEVEL > 0
|
|
|
|
/*
|
|
|
|
* Rename the reservation.
|
|
|
|
*/
|
|
|
|
vm_reserv_rename(fs->m, fs->first_object, fs->object,
|
|
|
|
OFF_TO_IDX(fs->first_object->backing_object_offset));
|
|
|
|
#endif
|
|
|
|
VM_OBJECT_WUNLOCK(fs->object);
|
|
|
|
VM_OBJECT_WUNLOCK(fs->first_object);
|
|
|
|
fs->first_m = fs->m;
|
|
|
|
fs->m = NULL;
|
|
|
|
VM_CNT_INC(v_cow_optim);
|
|
|
|
} else {
|
|
|
|
if (is_first_object_locked)
|
|
|
|
VM_OBJECT_WUNLOCK(fs->first_object);
|
|
|
|
/*
|
|
|
|
* Oh, well, lets copy it.
|
|
|
|
*/
|
|
|
|
pmap_copy_page(fs->m, fs->first_m);
|
|
|
|
vm_page_valid(fs->first_m);
|
|
|
|
if (fs->wired && (fs->fault_flags & VM_FAULT_WIRE) == 0) {
|
|
|
|
vm_page_wire(fs->first_m);
|
|
|
|
vm_page_unwire(fs->m, PQ_INACTIVE);
|
|
|
|
}
|
|
|
|
/*
|
|
|
|
* Save the cow page to be released after
|
|
|
|
* pmap_enter is complete.
|
|
|
|
*/
|
|
|
|
fs->m_cow = fs->m;
|
|
|
|
fs->m = NULL;
|
|
|
|
}
|
|
|
|
/*
|
|
|
|
* fs->object != fs->first_object due to above
|
|
|
|
* conditional
|
|
|
|
*/
|
|
|
|
vm_object_pip_wakeup(fs->object);
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Only use the new page below...
|
|
|
|
*/
|
|
|
|
fs->object = fs->first_object;
|
|
|
|
fs->pindex = fs->first_pindex;
|
|
|
|
fs->m = fs->first_m;
|
|
|
|
VM_CNT_INC(v_cow_faults);
|
|
|
|
curthread->td_cow++;
|
|
|
|
}
|
|
|
|
|
2020-01-23 05:14:41 +00:00
|
|
|
static bool
|
|
|
|
vm_fault_next(struct faultstate *fs)
|
|
|
|
{
|
|
|
|
vm_object_t next_object;
|
|
|
|
|
|
|
|
/*
|
|
|
|
* The requested page does not exist at this object/
|
|
|
|
* offset. Remove the invalid page from the object,
|
|
|
|
* waking up anyone waiting for it, and continue on to
|
|
|
|
* the next object. However, if this is the top-level
|
|
|
|
* object, we must leave the busy page in place to
|
|
|
|
* prevent another process from rushing past us, and
|
|
|
|
* inserting the page in that object at the same time
|
|
|
|
* that we are.
|
|
|
|
*/
|
|
|
|
if (fs->object == fs->first_object) {
|
|
|
|
fs->first_m = fs->m;
|
|
|
|
fs->m = NULL;
|
|
|
|
} else
|
|
|
|
fault_page_free(&fs->m);
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Move on to the next object. Lock the next object before
|
|
|
|
* unlocking the current one.
|
|
|
|
*/
|
|
|
|
VM_OBJECT_ASSERT_WLOCKED(fs->object);
|
|
|
|
next_object = fs->object->backing_object;
|
2020-01-23 05:23:37 +00:00
|
|
|
if (next_object == NULL)
|
2020-01-23 05:14:41 +00:00
|
|
|
return (false);
|
|
|
|
MPASS(fs->first_m != NULL);
|
|
|
|
KASSERT(fs->object != next_object, ("object loop %p", next_object));
|
|
|
|
VM_OBJECT_WLOCK(next_object);
|
|
|
|
vm_object_pip_add(next_object, 1);
|
|
|
|
if (fs->object != fs->first_object)
|
|
|
|
vm_object_pip_wakeup(fs->object);
|
|
|
|
fs->pindex += OFF_TO_IDX(fs->object->backing_object_offset);
|
|
|
|
VM_OBJECT_WUNLOCK(fs->object);
|
|
|
|
fs->object = next_object;
|
|
|
|
|
|
|
|
return (true);
|
|
|
|
}
|
|
|
|
|
2020-01-23 05:23:37 +00:00
|
|
|
static void
|
|
|
|
vm_fault_zerofill(struct faultstate *fs)
|
|
|
|
{
|
|
|
|
|
|
|
|
/*
|
|
|
|
* If there's no object left, fill the page in the top
|
|
|
|
* object with zeros.
|
|
|
|
*/
|
|
|
|
if (fs->object != fs->first_object) {
|
|
|
|
vm_object_pip_wakeup(fs->object);
|
|
|
|
fs->object = fs->first_object;
|
|
|
|
fs->pindex = fs->first_pindex;
|
|
|
|
}
|
|
|
|
MPASS(fs->first_m != NULL);
|
|
|
|
MPASS(fs->m == NULL);
|
|
|
|
fs->m = fs->first_m;
|
|
|
|
fs->first_m = NULL;
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Zero the page if necessary and mark it valid.
|
|
|
|
*/
|
|
|
|
if ((fs->m->flags & PG_ZERO) == 0) {
|
|
|
|
pmap_zero_page(fs->m);
|
|
|
|
} else {
|
|
|
|
VM_CNT_INC(v_ozfod);
|
|
|
|
}
|
|
|
|
VM_CNT_INC(v_zfod);
|
|
|
|
vm_page_valid(fs->m);
|
|
|
|
}
|
|
|
|
|
2020-01-23 05:19:39 +00:00
|
|
|
/*
|
|
|
|
* Allocate a page directly or via the object populate method.
|
|
|
|
*/
|
|
|
|
static int
|
|
|
|
vm_fault_allocate(struct faultstate *fs)
|
|
|
|
{
|
|
|
|
struct domainset *dset;
|
|
|
|
int alloc_req;
|
|
|
|
int rv;
|
|
|
|
|
|
|
|
|
|
|
|
if ((fs->object->flags & OBJ_SIZEVNLOCK) != 0) {
|
|
|
|
rv = vm_fault_lock_vnode(fs, true);
|
|
|
|
MPASS(rv == KERN_SUCCESS || rv == KERN_RESOURCE_SHORTAGE);
|
|
|
|
if (rv == KERN_RESOURCE_SHORTAGE)
|
|
|
|
return (rv);
|
|
|
|
}
|
|
|
|
|
|
|
|
if (fs->pindex >= fs->object->size)
|
|
|
|
return (KERN_OUT_OF_BOUNDS);
|
|
|
|
|
|
|
|
if (fs->object == fs->first_object &&
|
|
|
|
(fs->first_object->flags & OBJ_POPULATE) != 0 &&
|
|
|
|
fs->first_object->shadow_count == 0) {
|
|
|
|
rv = vm_fault_populate(fs);
|
|
|
|
switch (rv) {
|
|
|
|
case KERN_SUCCESS:
|
|
|
|
case KERN_FAILURE:
|
|
|
|
case KERN_RESTART:
|
|
|
|
return (rv);
|
|
|
|
case KERN_NOT_RECEIVER:
|
|
|
|
/*
|
|
|
|
* Pager's populate() method
|
|
|
|
* returned VM_PAGER_BAD.
|
|
|
|
*/
|
|
|
|
break;
|
|
|
|
default:
|
|
|
|
panic("inconsistent return codes");
|
|
|
|
}
|
|
|
|
}
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Allocate a new page for this object/offset pair.
|
|
|
|
*
|
|
|
|
* Unlocked read of the p_flag is harmless. At worst, the P_KILLED
|
|
|
|
* might be not observed there, and allocation can fail, causing
|
|
|
|
* restart and new reading of the p_flag.
|
|
|
|
*/
|
|
|
|
dset = fs->object->domain.dr_policy;
|
|
|
|
if (dset == NULL)
|
|
|
|
dset = curthread->td_domain.dr_policy;
|
|
|
|
if (!vm_page_count_severe_set(&dset->ds_mask) || P_KILLED(curproc)) {
|
|
|
|
#if VM_NRESERVLEVEL > 0
|
|
|
|
vm_object_color(fs->object, atop(fs->vaddr) - fs->pindex);
|
|
|
|
#endif
|
|
|
|
alloc_req = P_KILLED(curproc) ?
|
|
|
|
VM_ALLOC_SYSTEM : VM_ALLOC_NORMAL;
|
|
|
|
if (fs->object->type != OBJT_VNODE &&
|
|
|
|
fs->object->backing_object == NULL)
|
|
|
|
alloc_req |= VM_ALLOC_ZERO;
|
|
|
|
fs->m = vm_page_alloc(fs->object, fs->pindex, alloc_req);
|
|
|
|
}
|
|
|
|
if (fs->m == NULL) {
|
|
|
|
unlock_and_deallocate(fs);
|
|
|
|
if (vm_pfault_oom_attempts < 0 ||
|
|
|
|
fs->oom < vm_pfault_oom_attempts) {
|
|
|
|
fs->oom++;
|
|
|
|
vm_waitpfault(dset, vm_pfault_oom_wait * hz);
|
2020-01-29 12:02:47 +00:00
|
|
|
} else {
|
|
|
|
if (bootverbose)
|
|
|
|
printf(
|
|
|
|
"proc %d (%s) failed to alloc page on fault, starting OOM\n",
|
|
|
|
curproc->p_pid, curproc->p_comm);
|
|
|
|
vm_pageout_oom(VM_OOM_MEM_PF);
|
|
|
|
fs->oom = 0;
|
2020-01-23 05:19:39 +00:00
|
|
|
}
|
|
|
|
return (KERN_RESOURCE_SHORTAGE);
|
|
|
|
}
|
|
|
|
fs->oom = 0;
|
|
|
|
|
|
|
|
return (KERN_NOT_RECEIVER);
|
|
|
|
}
|
2020-01-23 05:18:00 +00:00
|
|
|
|
|
|
|
/*
|
|
|
|
* Call the pager to retrieve the page if there is a chance
|
|
|
|
* that the pager has it, and potentially retrieve additional
|
|
|
|
* pages at the same time.
|
|
|
|
*/
|
|
|
|
static int
|
|
|
|
vm_fault_getpages(struct faultstate *fs, int nera, int *behindp, int *aheadp)
|
|
|
|
{
|
|
|
|
vm_offset_t e_end, e_start;
|
|
|
|
int ahead, behind, cluster_offset, rv;
|
|
|
|
u_char behavior;
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Prepare for unlocking the map. Save the map
|
|
|
|
* entry's start and end addresses, which are used to
|
|
|
|
* optimize the size of the pager operation below.
|
|
|
|
* Even if the map entry's addresses change after
|
|
|
|
* unlocking the map, using the saved addresses is
|
|
|
|
* safe.
|
|
|
|
*/
|
|
|
|
e_start = fs->entry->start;
|
|
|
|
e_end = fs->entry->end;
|
|
|
|
behavior = vm_map_entry_behavior(fs->entry);
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Release the map lock before locking the vnode or
|
|
|
|
* sleeping in the pager. (If the current object has
|
|
|
|
* a shadow, then an earlier iteration of this loop
|
|
|
|
* may have already unlocked the map.)
|
|
|
|
*/
|
|
|
|
unlock_map(fs);
|
|
|
|
|
|
|
|
rv = vm_fault_lock_vnode(fs, false);
|
|
|
|
MPASS(rv == KERN_SUCCESS || rv == KERN_RESOURCE_SHORTAGE);
|
|
|
|
if (rv == KERN_RESOURCE_SHORTAGE)
|
|
|
|
return (rv);
|
|
|
|
KASSERT(fs->vp == NULL || !fs->map->system_map,
|
|
|
|
("vm_fault: vnode-backed object mapped by system map"));
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Page in the requested page and hint the pager,
|
|
|
|
* that it may bring up surrounding pages.
|
|
|
|
*/
|
|
|
|
if (nera == -1 || behavior == MAP_ENTRY_BEHAV_RANDOM ||
|
|
|
|
P_KILLED(curproc)) {
|
|
|
|
behind = 0;
|
|
|
|
ahead = 0;
|
|
|
|
} else {
|
|
|
|
/* Is this a sequential fault? */
|
|
|
|
if (nera > 0) {
|
|
|
|
behind = 0;
|
|
|
|
ahead = nera;
|
|
|
|
} else {
|
|
|
|
/*
|
|
|
|
* Request a cluster of pages that is
|
|
|
|
* aligned to a VM_FAULT_READ_DEFAULT
|
|
|
|
* page offset boundary within the
|
|
|
|
* object. Alignment to a page offset
|
|
|
|
* boundary is more likely to coincide
|
|
|
|
* with the underlying file system
|
|
|
|
* block than alignment to a virtual
|
|
|
|
* address boundary.
|
|
|
|
*/
|
|
|
|
cluster_offset = fs->pindex % VM_FAULT_READ_DEFAULT;
|
|
|
|
behind = ulmin(cluster_offset,
|
|
|
|
atop(fs->vaddr - e_start));
|
|
|
|
ahead = VM_FAULT_READ_DEFAULT - 1 - cluster_offset;
|
|
|
|
}
|
|
|
|
ahead = ulmin(ahead, atop(e_end - fs->vaddr) - 1);
|
|
|
|
}
|
|
|
|
*behindp = behind;
|
|
|
|
*aheadp = ahead;
|
|
|
|
rv = vm_pager_get_pages(fs->object, &fs->m, 1, behindp, aheadp);
|
|
|
|
if (rv == VM_PAGER_OK)
|
|
|
|
return (KERN_SUCCESS);
|
|
|
|
if (rv == VM_PAGER_ERROR)
|
|
|
|
printf("vm_fault: pager read error, pid %d (%s)\n",
|
|
|
|
curproc->p_pid, curproc->p_comm);
|
|
|
|
/*
|
|
|
|
* If an I/O error occurred or the requested page was
|
|
|
|
* outside the range of the pager, clean up and return
|
|
|
|
* an error.
|
|
|
|
*/
|
|
|
|
if (rv == VM_PAGER_ERROR || rv == VM_PAGER_BAD)
|
|
|
|
return (KERN_OUT_OF_BOUNDS);
|
|
|
|
return (KERN_NOT_RECEIVER);
|
|
|
|
}
|
|
|
|
|
2019-12-22 04:21:16 +00:00
|
|
|
/*
|
|
|
|
* Wait/Retry if the page is busy. We have to do this if the page is
|
|
|
|
* either exclusive or shared busy because the vm_pager may be using
|
|
|
|
* read busy for pageouts (and even pageins if it is the vnode pager),
|
|
|
|
* and we could end up trying to pagein and pageout the same page
|
|
|
|
* simultaneously.
|
|
|
|
*
|
|
|
|
* We can theoretically allow the busy case on a read fault if the page
|
|
|
|
* is marked valid, but since such pages are typically already pmap'd,
|
|
|
|
* putting that special case in might be more effort then it is worth.
|
|
|
|
* We cannot under any circumstances mess around with a shared busied
|
|
|
|
* page except, perhaps, to pmap it.
|
|
|
|
*/
|
|
|
|
static void
|
|
|
|
vm_fault_busy_sleep(struct faultstate *fs)
|
|
|
|
{
|
|
|
|
/*
|
|
|
|
* Reference the page before unlocking and
|
|
|
|
* sleeping so that the page daemon is less
|
|
|
|
* likely to reclaim it.
|
|
|
|
*/
|
|
|
|
vm_page_aflag_set(fs->m, PGA_REFERENCED);
|
|
|
|
if (fs->object != fs->first_object) {
|
|
|
|
fault_page_release(&fs->first_m);
|
|
|
|
vm_object_pip_wakeup(fs->first_object);
|
|
|
|
}
|
|
|
|
vm_object_pip_wakeup(fs->object);
|
|
|
|
unlock_map(fs);
|
|
|
|
if (fs->m == vm_page_lookup(fs->object, fs->pindex))
|
2019-12-24 18:38:06 +00:00
|
|
|
vm_page_busy_sleep(fs->m, "vmpfw", false);
|
|
|
|
else
|
|
|
|
VM_OBJECT_WUNLOCK(fs->object);
|
2019-12-22 04:21:16 +00:00
|
|
|
VM_CNT_INC(v_intrans);
|
|
|
|
vm_object_deallocate(fs->first_object);
|
|
|
|
}
|
|
|
|
|
2010-12-20 22:49:31 +00:00
|
|
|
int
|
2019-09-27 18:43:36 +00:00
|
|
|
vm_fault(vm_map_t map, vm_offset_t vaddr, vm_prot_t fault_type,
|
2010-12-20 22:49:31 +00:00
|
|
|
int fault_flags, vm_page_t *m_hold)
|
1994-05-24 10:09:53 +00:00
|
|
|
{
|
1998-03-07 20:45:47 +00:00
|
|
|
struct faultstate fs;
|
2020-01-23 05:19:39 +00:00
|
|
|
int ahead, behind, faultcount;
|
|
|
|
int nera, result, rv;
|
2020-01-23 05:11:01 +00:00
|
|
|
bool dead, hardfault;
|
These changes embody the support of the fully coherent merged VM buffer cache,
much higher filesystem I/O performance, and much better paging performance. It
represents the culmination of over 6 months of R&D.
The majority of the merged VM/cache work is by John Dyson.
The following highlights the most significant changes. Additionally, there are
(mostly minor) changes to the various filesystem modules (nfs, msdosfs, etc) to
support the new VM/buffer scheme.
vfs_bio.c:
Significant rewrite of most of vfs_bio to support the merged VM buffer cache
scheme. The scheme is almost fully compatible with the old filesystem
interface. Significant improvement in the number of opportunities for write
clustering.
vfs_cluster.c, vfs_subr.c
Upgrade and performance enhancements in vfs layer code to support merged
VM/buffer cache. Fixup of vfs_cluster to eliminate the bogus pagemove stuff.
vm_object.c:
Yet more improvements in the collapse code. Elimination of some windows that
can cause list corruption.
vm_pageout.c:
Fixed it, it really works better now. Somehow in 2.0, some "enhancements"
broke the code. This code has been reworked from the ground-up.
vm_fault.c, vm_page.c, pmap.c, vm_object.c
Support for small-block filesystems with merged VM/buffer cache scheme.
pmap.c vm_map.c
Dynamic kernel VM size, now we dont have to pre-allocate excessive numbers of
kernel PTs.
vm_glue.c
Much simpler and more effective swapping code. No more gratuitous swapping.
proc.h
Fixed the problem that the p_lock flag was not being cleared on a fork.
swap_pager.c, vnode_pager.c
Removal of old vfs_bio cruft to support the past pseudo-coherency. Now the
code doesn't need it anymore.
machdep.c
Changes to better support the parameter values for the merged VM/buffer cache
scheme.
machdep.c, kern_exec.c, vm_glue.c
Implemented a seperate submap for temporary exec string space and another one
to contain process upages. This eliminates all map fragmentation problems
that previously existed.
ffs_inode.c, ufs_inode.c, ufs_readwrite.c
Changes for merged VM/buffer cache. Add "bypass" support for sneaking in on
busy buffers.
Submitted by: John Dyson and David Greenman
1995-01-09 16:06:02 +00:00
|
|
|
|
- Remove 'struct vmmeter' from 'struct pcpu', leaving only global vmmeter
in place. To do per-cpu stats, convert all fields that previously were
maintained in the vmmeters that sit in pcpus to counter(9).
- Since some vmmeter stats may be touched at very early stages of boot,
before we have set up UMA and we can do counter_u64_alloc(), provide an
early counter mechanism:
o Leave one spare uint64_t in struct pcpu, named pc_early_dummy_counter.
o Point counter(9) fields of vmmeter to pcpu[0].pc_early_dummy_counter,
so that at early stages of boot, before counters are allocated we already
point to a counter that can be safely written to.
o For sparc64 that required a whole dummy pcpu[MAXCPU] array.
Further related changes:
- Don't include vmmeter.h into pcpu.h.
- vm.stats.vm.v_swappgsout and vm.stats.vm.v_swappgsin changed to 64-bit,
to match kernel representation.
- struct vmmeter hidden under _KERNEL, and only vmstat(1) is an exclusion.
This is based on benno@'s 4-year old patch:
https://lists.freebsd.org/pipermail/freebsd-arch/2013-July/014471.html
Reviewed by: kib, gallatin, marius, lidl
Differential Revision: https://reviews.freebsd.org/D10156
2017-04-17 17:34:47 +00:00
|
|
|
VM_CNT_INC(v_vm_faults);
|
2019-10-13 06:56:45 +00:00
|
|
|
|
|
|
|
if ((curthread->td_pflags & TDP_NOFAULTING) != 0)
|
|
|
|
return (KERN_PROTECTION_FAILURE);
|
|
|
|
|
2009-02-08 20:23:46 +00:00
|
|
|
fs.vp = NULL;
|
2020-01-21 00:12:57 +00:00
|
|
|
fs.vaddr = vaddr;
|
2020-01-23 05:03:34 +00:00
|
|
|
fs.m_hold = m_hold;
|
|
|
|
fs.fault_flags = fault_flags;
|
2020-01-23 05:05:39 +00:00
|
|
|
fs.map = map;
|
|
|
|
fs.lookup_still_valid = false;
|
2020-01-23 05:19:39 +00:00
|
|
|
fs.oom = 0;
|
A change to KPI of vm_pager_get_pages() and underlying VOP_GETPAGES().
o With new KPI consumers can request contiguous ranges of pages, and
unlike before, all pages will be kept busied on return, like it was
done before with the 'reqpage' only. Now the reqpage goes away. With
new interface it is easier to implement code protected from race
conditions.
Such arrayed requests for now should be preceeded by a call to
vm_pager_haspage() to make sure that request is possible. This
could be improved later, making vm_pager_haspage() obsolete.
Strenghtening the promises on the business of the array of pages
allows us to remove such hacks as swp_pager_free_nrpage() and
vm_pager_free_nonreq().
o New KPI accepts two integer pointers that may optionally point at
values for read ahead and read behind, that a pager may do, if it
can. These pages are completely owned by pager, and not controlled
by the caller.
This shifts the UFS-specific readahead logic from vm_fault.c, which
should be file system agnostic, into vnode_pager.c. It also removes
one VOP_BMAP() request per hard fault.
Discussed with: kib, alc, jeff, scottl
Sponsored by: Nginx, Inc.
Sponsored by: Netflix
2015-12-16 21:30:45 +00:00
|
|
|
faultcount = 0;
|
2016-07-18 04:20:26 +00:00
|
|
|
nera = -1;
|
2016-10-29 19:22:38 +00:00
|
|
|
hardfault = false;
|
1994-05-24 10:09:53 +00:00
|
|
|
|
2019-08-16 09:43:49 +00:00
|
|
|
RetryFault:
|
2020-01-23 05:03:34 +00:00
|
|
|
fs.fault_type = fault_type;
|
1994-05-24 10:09:53 +00:00
|
|
|
|
|
|
|
/*
|
These changes embody the support of the fully coherent merged VM buffer cache,
much higher filesystem I/O performance, and much better paging performance. It
represents the culmination of over 6 months of R&D.
The majority of the merged VM/cache work is by John Dyson.
The following highlights the most significant changes. Additionally, there are
(mostly minor) changes to the various filesystem modules (nfs, msdosfs, etc) to
support the new VM/buffer scheme.
vfs_bio.c:
Significant rewrite of most of vfs_bio to support the merged VM buffer cache
scheme. The scheme is almost fully compatible with the old filesystem
interface. Significant improvement in the number of opportunities for write
clustering.
vfs_cluster.c, vfs_subr.c
Upgrade and performance enhancements in vfs layer code to support merged
VM/buffer cache. Fixup of vfs_cluster to eliminate the bogus pagemove stuff.
vm_object.c:
Yet more improvements in the collapse code. Elimination of some windows that
can cause list corruption.
vm_pageout.c:
Fixed it, it really works better now. Somehow in 2.0, some "enhancements"
broke the code. This code has been reworked from the ground-up.
vm_fault.c, vm_page.c, pmap.c, vm_object.c
Support for small-block filesystems with merged VM/buffer cache scheme.
pmap.c vm_map.c
Dynamic kernel VM size, now we dont have to pre-allocate excessive numbers of
kernel PTs.
vm_glue.c
Much simpler and more effective swapping code. No more gratuitous swapping.
proc.h
Fixed the problem that the p_lock flag was not being cleared on a fork.
swap_pager.c, vnode_pager.c
Removal of old vfs_bio cruft to support the past pseudo-coherency. Now the
code doesn't need it anymore.
machdep.c
Changes to better support the parameter values for the merged VM/buffer cache
scheme.
machdep.c, kern_exec.c, vm_glue.c
Implemented a seperate submap for temporary exec string space and another one
to contain process upages. This eliminates all map fragmentation problems
that previously existed.
ffs_inode.c, ufs_inode.c, ufs_readwrite.c
Changes for merged VM/buffer cache. Add "bypass" support for sneaking in on
busy buffers.
Submitted by: John Dyson and David Greenman
1995-01-09 16:06:02 +00:00
|
|
|
* Find the backing store object and offset into it to begin the
|
|
|
|
* search.
|
1994-05-24 10:09:53 +00:00
|
|
|
*/
|
2020-01-23 05:05:39 +00:00
|
|
|
result = vm_fault_lookup(&fs);
|
2002-04-19 04:20:31 +00:00
|
|
|
if (result != KERN_SUCCESS) {
|
2020-01-23 05:05:39 +00:00
|
|
|
if (result == KERN_RESOURCE_SHORTAGE)
|
|
|
|
goto RetryFault;
|
2009-11-18 18:05:54 +00:00
|
|
|
return (result);
|
1994-05-24 10:09:53 +00:00
|
|
|
}
|
1995-04-09 06:03:56 +00:00
|
|
|
|
2016-07-20 17:20:22 +00:00
|
|
|
/*
|
|
|
|
* Try to avoid lock contention on the top-level object through
|
|
|
|
* special-case handling of some types of page faults, specifically,
|
2019-10-29 21:06:34 +00:00
|
|
|
* those that are mapping an existing page from the top-level object.
|
|
|
|
* Under this condition, a read lock on the object suffices, allowing
|
|
|
|
* multiple page faults of a similar type to run in parallel.
|
2016-07-20 17:20:22 +00:00
|
|
|
*/
|
Implement 'fast path' for the vm page fault handler. Or, it could be
called a scalable path. When several preconditions hold, the vm
object lock for the object containing the faulted page is taken in
read mode, instead of write, which allows parallel faults processing
in the region.
Namely, the fast path is taken when the faulted page already exists
and does not need copy on write, is already fully valid, and not busy.
For technical reasons, fast path is avoided when the fault is the
first write on the vnode object, or when the fault is for wiring or
debugger read or write.
On the fast path, pmap_enter(9) is passed the PMAP_ENTER_NOSLEEP flag,
since object lock is kept. Pmap might fail to create the entry, in
which case the fallback to slow path is performed.
Reviewed by: alc
Tested by: pho (previous version)
Hardware provided and hosted by: The FreeBSD Foundation and
Sentex Data Communications
Sponsored by: The FreeBSD Foundation
MFC after: 2 week
2014-08-15 07:30:14 +00:00
|
|
|
if (fs.vp == NULL /* avoid locked vnode leak */ &&
|
2020-01-23 05:03:34 +00:00
|
|
|
(fs.fault_flags & (VM_FAULT_WIRE | VM_FAULT_DIRTY)) == 0) {
|
Implement 'fast path' for the vm page fault handler. Or, it could be
called a scalable path. When several preconditions hold, the vm
object lock for the object containing the faulted page is taken in
read mode, instead of write, which allows parallel faults processing
in the region.
Namely, the fast path is taken when the faulted page already exists
and does not need copy on write, is already fully valid, and not busy.
For technical reasons, fast path is avoided when the fault is the
first write on the vnode object, or when the fault is for wiring or
debugger read or write.
On the fast path, pmap_enter(9) is passed the PMAP_ENTER_NOSLEEP flag,
since object lock is kept. Pmap might fail to create the entry, in
which case the fallback to slow path is performed.
Reviewed by: alc
Tested by: pho (previous version)
Hardware provided and hosted by: The FreeBSD Foundation and
Sentex Data Communications
Sponsored by: The FreeBSD Foundation
MFC after: 2 week
2014-08-15 07:30:14 +00:00
|
|
|
VM_OBJECT_RLOCK(fs.first_object);
|
2020-01-23 05:03:34 +00:00
|
|
|
rv = vm_fault_soft_fast(&fs);
|
2019-10-29 21:06:34 +00:00
|
|
|
if (rv == KERN_SUCCESS)
|
|
|
|
return (rv);
|
Implement 'fast path' for the vm page fault handler. Or, it could be
called a scalable path. When several preconditions hold, the vm
object lock for the object containing the faulted page is taken in
read mode, instead of write, which allows parallel faults processing
in the region.
Namely, the fast path is taken when the faulted page already exists
and does not need copy on write, is already fully valid, and not busy.
For technical reasons, fast path is avoided when the fault is the
first write on the vnode object, or when the fault is for wiring or
debugger read or write.
On the fast path, pmap_enter(9) is passed the PMAP_ENTER_NOSLEEP flag,
since object lock is kept. Pmap might fail to create the entry, in
which case the fallback to slow path is performed.
Reviewed by: alc
Tested by: pho (previous version)
Hardware provided and hosted by: The FreeBSD Foundation and
Sentex Data Communications
Sponsored by: The FreeBSD Foundation
MFC after: 2 week
2014-08-15 07:30:14 +00:00
|
|
|
if (!VM_OBJECT_TRYUPGRADE(fs.first_object)) {
|
|
|
|
VM_OBJECT_RUNLOCK(fs.first_object);
|
|
|
|
VM_OBJECT_WLOCK(fs.first_object);
|
|
|
|
}
|
|
|
|
} else {
|
|
|
|
VM_OBJECT_WLOCK(fs.first_object);
|
|
|
|
}
|
|
|
|
|
Make our v_usecount vnode reference count work identically to the
original BSD code. The association between the vnode and the vm_object
no longer includes reference counts. The major difference is that
vm_object's are no longer freed gratuitiously from the vnode, and so
once an object is created for the vnode, it will last as long as the
vnode does.
When a vnode object reference count is incremented, then the underlying
vnode reference count is incremented also. The two "objects" are now
more intimately related, and so the interactions are now much less
complex.
When vnodes are now normally placed onto the free queue with an object still
attached. The rundown of the object happens at vnode rundown time, and
happens with exactly the same filesystem semantics of the original VFS
code. There is absolutely no need for vnode_pager_uncache and other
travesties like that anymore.
A side-effect of these changes is that SMP locking should be much simpler,
the I/O copyin/copyout optimizations work, NFS should be more ponderable,
and further work on layered filesystems should be less frustrating, because
of the totally coherent management of the vnode objects and vnodes.
Please be careful with your system while running this code, but I would
greatly appreciate feedback as soon a reasonably possible.
1998-01-06 05:26:17 +00:00
|
|
|
/*
|
|
|
|
* Make a reference to this object to prevent its disposal while we
|
|
|
|
* are messing with it. Once we have the reference, the map is free
|
|
|
|
* to be diddled. Since objects reference their shadows (and copies),
|
|
|
|
* they will stay around as well.
|
2001-11-09 21:34:45 +00:00
|
|
|
*
|
|
|
|
* Bump the paging-in-progress count to prevent size changes (e.g.
|
2016-11-01 17:11:10 +00:00
|
|
|
* truncation operations) during I/O.
|
Make our v_usecount vnode reference count work identically to the
original BSD code. The association between the vnode and the vm_object
no longer includes reference counts. The major difference is that
vm_object's are no longer freed gratuitiously from the vnode, and so
once an object is created for the vnode, it will last as long as the
vnode does.
When a vnode object reference count is incremented, then the underlying
vnode reference count is incremented also. The two "objects" are now
more intimately related, and so the interactions are now much less
complex.
When vnodes are now normally placed onto the free queue with an object still
attached. The rundown of the object happens at vnode rundown time, and
happens with exactly the same filesystem semantics of the original VFS
code. There is absolutely no need for vnode_pager_uncache and other
travesties like that anymore.
A side-effect of these changes is that SMP locking should be much simpler,
the I/O copyin/copyout optimizations work, NFS should be more ponderable,
and further work on layered filesystems should be less frustrating, because
of the totally coherent management of the vnode objects and vnodes.
Please be careful with your system while running this code, but I would
greatly appreciate feedback as soon a reasonably possible.
1998-01-06 05:26:17 +00:00
|
|
|
*/
|
2003-12-26 23:33:37 +00:00
|
|
|
vm_object_reference_locked(fs.first_object);
|
1998-08-06 08:33:19 +00:00
|
|
|
vm_object_pip_add(fs.first_object, 1);
|
Make our v_usecount vnode reference count work identically to the
original BSD code. The association between the vnode and the vm_object
no longer includes reference counts. The major difference is that
vm_object's are no longer freed gratuitiously from the vnode, and so
once an object is created for the vnode, it will last as long as the
vnode does.
When a vnode object reference count is incremented, then the underlying
vnode reference count is incremented also. The two "objects" are now
more intimately related, and so the interactions are now much less
complex.
When vnodes are now normally placed onto the free queue with an object still
attached. The rundown of the object happens at vnode rundown time, and
happens with exactly the same filesystem semantics of the original VFS
code. There is absolutely no need for vnode_pager_uncache and other
travesties like that anymore.
A side-effect of these changes is that SMP locking should be much simpler,
the I/O copyin/copyout optimizations work, NFS should be more ponderable,
and further work on layered filesystems should be less frustrating, because
of the totally coherent management of the vnode objects and vnodes.
Please be careful with your system while running this code, but I would
greatly appreciate feedback as soon a reasonably possible.
1998-01-06 05:26:17 +00:00
|
|
|
|
2020-01-17 03:44:04 +00:00
|
|
|
fs.m_cow = fs.m = fs.first_m = NULL;
|
1994-05-24 10:09:53 +00:00
|
|
|
|
|
|
|
/*
|
These changes embody the support of the fully coherent merged VM buffer cache,
much higher filesystem I/O performance, and much better paging performance. It
represents the culmination of over 6 months of R&D.
The majority of the merged VM/cache work is by John Dyson.
The following highlights the most significant changes. Additionally, there are
(mostly minor) changes to the various filesystem modules (nfs, msdosfs, etc) to
support the new VM/buffer scheme.
vfs_bio.c:
Significant rewrite of most of vfs_bio to support the merged VM buffer cache
scheme. The scheme is almost fully compatible with the old filesystem
interface. Significant improvement in the number of opportunities for write
clustering.
vfs_cluster.c, vfs_subr.c
Upgrade and performance enhancements in vfs layer code to support merged
VM/buffer cache. Fixup of vfs_cluster to eliminate the bogus pagemove stuff.
vm_object.c:
Yet more improvements in the collapse code. Elimination of some windows that
can cause list corruption.
vm_pageout.c:
Fixed it, it really works better now. Somehow in 2.0, some "enhancements"
broke the code. This code has been reworked from the ground-up.
vm_fault.c, vm_page.c, pmap.c, vm_object.c
Support for small-block filesystems with merged VM/buffer cache scheme.
pmap.c vm_map.c
Dynamic kernel VM size, now we dont have to pre-allocate excessive numbers of
kernel PTs.
vm_glue.c
Much simpler and more effective swapping code. No more gratuitous swapping.
proc.h
Fixed the problem that the p_lock flag was not being cleared on a fork.
swap_pager.c, vnode_pager.c
Removal of old vfs_bio cruft to support the past pseudo-coherency. Now the
code doesn't need it anymore.
machdep.c
Changes to better support the parameter values for the merged VM/buffer cache
scheme.
machdep.c, kern_exec.c, vm_glue.c
Implemented a seperate submap for temporary exec string space and another one
to contain process upages. This eliminates all map fragmentation problems
that previously existed.
ffs_inode.c, ufs_inode.c, ufs_readwrite.c
Changes for merged VM/buffer cache. Add "bypass" support for sneaking in on
busy buffers.
Submitted by: John Dyson and David Greenman
1995-01-09 16:06:02 +00:00
|
|
|
* Search for the page at object/offset.
|
1994-05-24 10:09:53 +00:00
|
|
|
*/
|
1998-03-07 20:45:47 +00:00
|
|
|
fs.object = fs.first_object;
|
|
|
|
fs.pindex = fs.first_pindex;
|
1994-05-24 10:09:53 +00:00
|
|
|
while (TRUE) {
|
2019-12-15 04:08:24 +00:00
|
|
|
KASSERT(fs.m == NULL,
|
|
|
|
("page still set %p at loop start", fs.m));
|
1999-01-21 08:29:12 +00:00
|
|
|
/*
|
If the vm_fault() handler raced with the vm_object_collapse()
sleepable scan, iteration over the shadow chain looking for a page
could find an OBJ_DEAD object. Such state of the mapping is only
transient, the dead object will be terminated and removed from the
chain shortly. We must not return KERN_PROTECTION_FAILURE unless the
object type is changed to OBJT_DEAD in the chain, indicating that
paging on this address is really impossible. Returning
KERN_PROTECTION_FAILURE prematurely causes spurious SIGSEGV delivered
to processes, or kernel accesses to UVA spuriously failing with
EFAULT.
If the object with OBJ_DEAD flag is found, only return
KERN_PROTECTION_FAILURE when object type is already OBJT_DEAD.
Otherwise, sleep a tick and retry the fault handling.
Ideally, we would wait until the OBJ_DEAD flag is resolved, e.g. by
waiting until the paging on this object is finished. But to do so, we
need to reference the dead object, while vm_object_collapse() insists
on owning the final reference on the collapsed object. This could be
fixed by e.g. changing the assert to shared reference release between
vm_fault() and vm_object_collapse(), but it seems to be too much
complications for rare boundary condition.
PR: 204426
Tested by: pho
Reviewed by: alc
Sponsored by: The FreeBSD Foundation
X-Differential revision: https://reviews.freebsd.org/D6085
MFC after: 2 weeks
Approved by: re (gjb)
2016-06-27 21:54:19 +00:00
|
|
|
* If the object is marked for imminent termination,
|
|
|
|
* we retry here, since the collapse pass has raced
|
|
|
|
* with us. Otherwise, if we see terminally dead
|
|
|
|
* object, return fail.
|
1999-01-21 08:29:12 +00:00
|
|
|
*/
|
If the vm_fault() handler raced with the vm_object_collapse()
sleepable scan, iteration over the shadow chain looking for a page
could find an OBJ_DEAD object. Such state of the mapping is only
transient, the dead object will be terminated and removed from the
chain shortly. We must not return KERN_PROTECTION_FAILURE unless the
object type is changed to OBJT_DEAD in the chain, indicating that
paging on this address is really impossible. Returning
KERN_PROTECTION_FAILURE prematurely causes spurious SIGSEGV delivered
to processes, or kernel accesses to UVA spuriously failing with
EFAULT.
If the object with OBJ_DEAD flag is found, only return
KERN_PROTECTION_FAILURE when object type is already OBJT_DEAD.
Otherwise, sleep a tick and retry the fault handling.
Ideally, we would wait until the OBJ_DEAD flag is resolved, e.g. by
waiting until the paging on this object is finished. But to do so, we
need to reference the dead object, while vm_object_collapse() insists
on owning the final reference on the collapsed object. This could be
fixed by e.g. changing the assert to shared reference release between
vm_fault() and vm_object_collapse(), but it seems to be too much
complications for rare boundary condition.
PR: 204426
Tested by: pho
Reviewed by: alc
Sponsored by: The FreeBSD Foundation
X-Differential revision: https://reviews.freebsd.org/D6085
MFC after: 2 weeks
Approved by: re (gjb)
2016-06-27 21:54:19 +00:00
|
|
|
if ((fs.object->flags & OBJ_DEAD) != 0) {
|
|
|
|
dead = fs.object->type == OBJT_DEAD;
|
1998-03-07 20:45:47 +00:00
|
|
|
unlock_and_deallocate(&fs);
|
If the vm_fault() handler raced with the vm_object_collapse()
sleepable scan, iteration over the shadow chain looking for a page
could find an OBJ_DEAD object. Such state of the mapping is only
transient, the dead object will be terminated and removed from the
chain shortly. We must not return KERN_PROTECTION_FAILURE unless the
object type is changed to OBJT_DEAD in the chain, indicating that
paging on this address is really impossible. Returning
KERN_PROTECTION_FAILURE prematurely causes spurious SIGSEGV delivered
to processes, or kernel accesses to UVA spuriously failing with
EFAULT.
If the object with OBJ_DEAD flag is found, only return
KERN_PROTECTION_FAILURE when object type is already OBJT_DEAD.
Otherwise, sleep a tick and retry the fault handling.
Ideally, we would wait until the OBJ_DEAD flag is resolved, e.g. by
waiting until the paging on this object is finished. But to do so, we
need to reference the dead object, while vm_object_collapse() insists
on owning the final reference on the collapsed object. This could be
fixed by e.g. changing the assert to shared reference release between
vm_fault() and vm_object_collapse(), but it seems to be too much
complications for rare boundary condition.
PR: 204426
Tested by: pho
Reviewed by: alc
Sponsored by: The FreeBSD Foundation
X-Differential revision: https://reviews.freebsd.org/D6085
MFC after: 2 weeks
Approved by: re (gjb)
2016-06-27 21:54:19 +00:00
|
|
|
if (dead)
|
|
|
|
return (KERN_PROTECTION_FAILURE);
|
|
|
|
pause("vmf_de", 1);
|
|
|
|
goto RetryFault;
|
1998-01-17 09:17:02 +00:00
|
|
|
}
|
1999-01-21 08:29:12 +00:00
|
|
|
|
|
|
|
/*
|
|
|
|
* See if page is resident
|
|
|
|
*/
|
1998-03-07 20:45:47 +00:00
|
|
|
fs.m = vm_page_lookup(fs.object, fs.pindex);
|
|
|
|
if (fs.m != NULL) {
|
2019-10-15 03:35:11 +00:00
|
|
|
if (vm_page_tryxbusy(fs.m) == 0) {
|
2019-12-22 04:21:16 +00:00
|
|
|
vm_fault_busy_sleep(&fs);
|
1994-05-24 10:09:53 +00:00
|
|
|
goto RetryFault;
|
|
|
|
}
|
1999-01-23 06:00:27 +00:00
|
|
|
|
1999-01-21 08:29:12 +00:00
|
|
|
/*
|
2019-10-15 03:35:11 +00:00
|
|
|
* The page is marked busy for other processes and the
|
2020-01-23 05:19:39 +00:00
|
|
|
* pagedaemon. If it still is completely valid we
|
|
|
|
* are done.
|
1999-01-21 08:29:12 +00:00
|
|
|
*/
|
2020-01-23 05:19:39 +00:00
|
|
|
if (vm_page_all_valid(fs.m)) {
|
|
|
|
VM_OBJECT_WUNLOCK(fs.object);
|
|
|
|
break; /* break to PAGE HAS BEEN FOUND. */
|
|
|
|
}
|
1994-05-24 10:09:53 +00:00
|
|
|
}
|
2020-01-20 22:49:52 +00:00
|
|
|
VM_OBJECT_ASSERT_WLOCKED(fs.object);
|
1999-01-21 08:29:12 +00:00
|
|
|
|
|
|
|
/*
|
2016-05-23 16:59:05 +00:00
|
|
|
* Page is not resident. If the pager might contain the page
|
|
|
|
* or this is the beginning of the search, allocate a new
|
|
|
|
* page. (Default objects are zero-fill, so there is no real
|
|
|
|
* pager for them.)
|
1999-01-21 08:29:12 +00:00
|
|
|
*/
|
2020-01-23 05:19:39 +00:00
|
|
|
if (fs.m == NULL && (fs.object->type != OBJT_DEFAULT ||
|
|
|
|
fs.object == fs.first_object)) {
|
|
|
|
rv = vm_fault_allocate(&fs);
|
|
|
|
switch (rv) {
|
|
|
|
case KERN_RESTART:
|
1998-03-07 20:45:47 +00:00
|
|
|
unlock_and_deallocate(&fs);
|
2020-01-23 05:19:39 +00:00
|
|
|
/* FALLTHROUGH */
|
|
|
|
case KERN_RESOURCE_SHORTAGE:
|
1994-05-24 10:09:53 +00:00
|
|
|
goto RetryFault;
|
2020-01-23 05:19:39 +00:00
|
|
|
case KERN_SUCCESS:
|
|
|
|
case KERN_FAILURE:
|
|
|
|
case KERN_OUT_OF_BOUNDS:
|
|
|
|
unlock_and_deallocate(&fs);
|
|
|
|
return (rv);
|
|
|
|
case KERN_NOT_RECEIVER:
|
|
|
|
break;
|
|
|
|
default:
|
|
|
|
panic("vm_fault: Unhandled rv %d", rv);
|
2016-11-15 18:22:50 +00:00
|
|
|
}
|
1994-05-24 10:09:53 +00:00
|
|
|
}
|
1998-01-17 09:17:02 +00:00
|
|
|
|
2020-01-20 22:49:52 +00:00
|
|
|
/*
|
|
|
|
* Default objects have no pager so no exclusive busy exists
|
|
|
|
* to protect this page in the chain. Skip to the next
|
|
|
|
* object without dropping the lock to preserve atomicity of
|
|
|
|
* shadow faults.
|
|
|
|
*/
|
2020-01-23 05:22:02 +00:00
|
|
|
if (fs.object->type != OBJT_DEFAULT) {
|
|
|
|
/*
|
|
|
|
* At this point, we have either allocated a new page
|
|
|
|
* or found an existing page that is only partially
|
|
|
|
* valid.
|
|
|
|
*
|
|
|
|
* We hold a reference on the current object and the
|
|
|
|
* page is exclusive busied. The exclusive busy
|
|
|
|
* prevents simultaneous faults and collapses while
|
|
|
|
* the object lock is dropped.
|
|
|
|
*/
|
|
|
|
VM_OBJECT_WUNLOCK(fs.object);
|
2020-01-20 22:49:52 +00:00
|
|
|
|
2020-01-23 05:22:02 +00:00
|
|
|
/*
|
|
|
|
* If the pager for the current object might have
|
|
|
|
* the page, then determine the number of additional
|
|
|
|
* pages to read and potentially reprioritize
|
|
|
|
* previously read pages for earlier reclamation.
|
|
|
|
* These operations should only be performed once per
|
|
|
|
* page fault. Even if the current pager doesn't
|
|
|
|
* have the page, the number of additional pages to
|
|
|
|
* read will apply to subsequent objects in the
|
|
|
|
* shadow chain.
|
|
|
|
*/
|
|
|
|
if (nera == -1 && !P_KILLED(curproc))
|
|
|
|
nera = vm_fault_readahead(&fs);
|
2016-11-03 16:44:55 +00:00
|
|
|
|
2020-01-23 05:22:02 +00:00
|
|
|
rv = vm_fault_getpages(&fs, nera, &behind, &ahead);
|
|
|
|
if (rv == KERN_SUCCESS) {
|
|
|
|
faultcount = behind + 1 + ahead;
|
|
|
|
hardfault = true;
|
|
|
|
break; /* break to PAGE HAS BEEN FOUND. */
|
|
|
|
}
|
|
|
|
if (rv == KERN_RESOURCE_SHORTAGE)
|
|
|
|
goto RetryFault;
|
|
|
|
VM_OBJECT_WLOCK(fs.object);
|
|
|
|
if (rv == KERN_OUT_OF_BOUNDS) {
|
|
|
|
fault_page_free(&fs.m);
|
|
|
|
unlock_and_deallocate(&fs);
|
|
|
|
return (rv);
|
|
|
|
}
|
1994-05-24 10:09:53 +00:00
|
|
|
}
|
1999-09-21 00:36:16 +00:00
|
|
|
|
1994-05-24 10:09:53 +00:00
|
|
|
/*
|
2020-01-23 05:18:00 +00:00
|
|
|
* The page was not found in the current object. Try to
|
|
|
|
* traverse into a backing object or zero fill if none is
|
|
|
|
* found.
|
1994-05-24 10:09:53 +00:00
|
|
|
*/
|
2020-01-23 05:23:37 +00:00
|
|
|
if (vm_fault_next(&fs))
|
|
|
|
continue;
|
|
|
|
VM_OBJECT_WUNLOCK(fs.object);
|
|
|
|
vm_fault_zerofill(&fs);
|
|
|
|
/* Don't try to prefault neighboring pages. */
|
|
|
|
faultcount = 1;
|
|
|
|
break; /* break to PAGE HAS BEEN FOUND. */
|
1994-05-24 10:09:53 +00:00
|
|
|
}
|
1999-01-21 08:29:12 +00:00
|
|
|
|
1994-05-24 10:09:53 +00:00
|
|
|
/*
|
2020-01-20 22:49:52 +00:00
|
|
|
* PAGE HAS BEEN FOUND. A valid page has been found and exclusively
|
|
|
|
* busied. The object lock must no longer be held.
|
1994-05-24 10:09:53 +00:00
|
|
|
*/
|
2020-01-20 22:49:52 +00:00
|
|
|
vm_page_assert_xbusied(fs.m);
|
|
|
|
VM_OBJECT_ASSERT_UNLOCKED(fs.object);
|
1994-05-24 10:09:53 +00:00
|
|
|
|
|
|
|
/*
|
These changes embody the support of the fully coherent merged VM buffer cache,
much higher filesystem I/O performance, and much better paging performance. It
represents the culmination of over 6 months of R&D.
The majority of the merged VM/cache work is by John Dyson.
The following highlights the most significant changes. Additionally, there are
(mostly minor) changes to the various filesystem modules (nfs, msdosfs, etc) to
support the new VM/buffer scheme.
vfs_bio.c:
Significant rewrite of most of vfs_bio to support the merged VM buffer cache
scheme. The scheme is almost fully compatible with the old filesystem
interface. Significant improvement in the number of opportunities for write
clustering.
vfs_cluster.c, vfs_subr.c
Upgrade and performance enhancements in vfs layer code to support merged
VM/buffer cache. Fixup of vfs_cluster to eliminate the bogus pagemove stuff.
vm_object.c:
Yet more improvements in the collapse code. Elimination of some windows that
can cause list corruption.
vm_pageout.c:
Fixed it, it really works better now. Somehow in 2.0, some "enhancements"
broke the code. This code has been reworked from the ground-up.
vm_fault.c, vm_page.c, pmap.c, vm_object.c
Support for small-block filesystems with merged VM/buffer cache scheme.
pmap.c vm_map.c
Dynamic kernel VM size, now we dont have to pre-allocate excessive numbers of
kernel PTs.
vm_glue.c
Much simpler and more effective swapping code. No more gratuitous swapping.
proc.h
Fixed the problem that the p_lock flag was not being cleared on a fork.
swap_pager.c, vnode_pager.c
Removal of old vfs_bio cruft to support the past pseudo-coherency. Now the
code doesn't need it anymore.
machdep.c
Changes to better support the parameter values for the merged VM/buffer cache
scheme.
machdep.c, kern_exec.c, vm_glue.c
Implemented a seperate submap for temporary exec string space and another one
to contain process upages. This eliminates all map fragmentation problems
that previously existed.
ffs_inode.c, ufs_inode.c, ufs_readwrite.c
Changes for merged VM/buffer cache. Add "bypass" support for sneaking in on
busy buffers.
Submitted by: John Dyson and David Greenman
1995-01-09 16:06:02 +00:00
|
|
|
* If the page is being written, but isn't already owned by the
|
|
|
|
* top-level object, we have to copy it into a new page owned by the
|
|
|
|
* top-level object.
|
1994-05-24 10:09:53 +00:00
|
|
|
*/
|
1998-03-07 20:45:47 +00:00
|
|
|
if (fs.object != fs.first_object) {
|
These changes embody the support of the fully coherent merged VM buffer cache,
much higher filesystem I/O performance, and much better paging performance. It
represents the culmination of over 6 months of R&D.
The majority of the merged VM/cache work is by John Dyson.
The following highlights the most significant changes. Additionally, there are
(mostly minor) changes to the various filesystem modules (nfs, msdosfs, etc) to
support the new VM/buffer scheme.
vfs_bio.c:
Significant rewrite of most of vfs_bio to support the merged VM buffer cache
scheme. The scheme is almost fully compatible with the old filesystem
interface. Significant improvement in the number of opportunities for write
clustering.
vfs_cluster.c, vfs_subr.c
Upgrade and performance enhancements in vfs layer code to support merged
VM/buffer cache. Fixup of vfs_cluster to eliminate the bogus pagemove stuff.
vm_object.c:
Yet more improvements in the collapse code. Elimination of some windows that
can cause list corruption.
vm_pageout.c:
Fixed it, it really works better now. Somehow in 2.0, some "enhancements"
broke the code. This code has been reworked from the ground-up.
vm_fault.c, vm_page.c, pmap.c, vm_object.c
Support for small-block filesystems with merged VM/buffer cache scheme.
pmap.c vm_map.c
Dynamic kernel VM size, now we dont have to pre-allocate excessive numbers of
kernel PTs.
vm_glue.c
Much simpler and more effective swapping code. No more gratuitous swapping.
proc.h
Fixed the problem that the p_lock flag was not being cleared on a fork.
swap_pager.c, vnode_pager.c
Removal of old vfs_bio cruft to support the past pseudo-coherency. Now the
code doesn't need it anymore.
machdep.c
Changes to better support the parameter values for the merged VM/buffer cache
scheme.
machdep.c, kern_exec.c, vm_glue.c
Implemented a seperate submap for temporary exec string space and another one
to contain process upages. This eliminates all map fragmentation problems
that previously existed.
ffs_inode.c, ufs_inode.c, ufs_readwrite.c
Changes for merged VM/buffer cache. Add "bypass" support for sneaking in on
busy buffers.
Submitted by: John Dyson and David Greenman
1995-01-09 16:06:02 +00:00
|
|
|
/*
|
|
|
|
* We only really need to copy if we want to write it.
|
1994-05-24 10:09:53 +00:00
|
|
|
*/
|
2020-01-23 05:03:34 +00:00
|
|
|
if ((fs.fault_type & (VM_PROT_COPY | VM_PROT_WRITE)) != 0) {
|
2020-01-23 05:11:01 +00:00
|
|
|
vm_fault_cow(&fs);
|
Eliminate typically pointless calls to vm_fault_prefault() on soft, copy-
on-write faults. On a page fault, when we call vm_fault_prefault(), it
probes the pmap and the shadow chain of vm objects to see if there are
opportunities to create read and/or execute-only mappings to neighoring
pages. For example, in the case of hard faults, such effort typically pays
off, that is, mappings are created that eliminate future soft page faults.
However, in the the case of soft, copy-on-write faults, the effort very
rarely pays off. (See the review for some specific data.)
Reviewed by: kib, markj
MFC after: 3 weeks
Differential Revision: https://reviews.freebsd.org/D17367
2018-10-27 17:49:46 +00:00
|
|
|
/*
|
|
|
|
* We only try to prefault read-only mappings to the
|
|
|
|
* neighboring pages when this copy-on-write fault is
|
|
|
|
* a hard fault. In other cases, trying to prefault
|
|
|
|
* is typically wasted effort.
|
|
|
|
*/
|
|
|
|
if (faultcount == 0)
|
|
|
|
faultcount = 1;
|
|
|
|
|
These changes embody the support of the fully coherent merged VM buffer cache,
much higher filesystem I/O performance, and much better paging performance. It
represents the culmination of over 6 months of R&D.
The majority of the merged VM/cache work is by John Dyson.
The following highlights the most significant changes. Additionally, there are
(mostly minor) changes to the various filesystem modules (nfs, msdosfs, etc) to
support the new VM/buffer scheme.
vfs_bio.c:
Significant rewrite of most of vfs_bio to support the merged VM buffer cache
scheme. The scheme is almost fully compatible with the old filesystem
interface. Significant improvement in the number of opportunities for write
clustering.
vfs_cluster.c, vfs_subr.c
Upgrade and performance enhancements in vfs layer code to support merged
VM/buffer cache. Fixup of vfs_cluster to eliminate the bogus pagemove stuff.
vm_object.c:
Yet more improvements in the collapse code. Elimination of some windows that
can cause list corruption.
vm_pageout.c:
Fixed it, it really works better now. Somehow in 2.0, some "enhancements"
broke the code. This code has been reworked from the ground-up.
vm_fault.c, vm_page.c, pmap.c, vm_object.c
Support for small-block filesystems with merged VM/buffer cache scheme.
pmap.c vm_map.c
Dynamic kernel VM size, now we dont have to pre-allocate excessive numbers of
kernel PTs.
vm_glue.c
Much simpler and more effective swapping code. No more gratuitous swapping.
proc.h
Fixed the problem that the p_lock flag was not being cleared on a fork.
swap_pager.c, vnode_pager.c
Removal of old vfs_bio cruft to support the past pseudo-coherency. Now the
code doesn't need it anymore.
machdep.c
Changes to better support the parameter values for the merged VM/buffer cache
scheme.
machdep.c, kern_exec.c, vm_glue.c
Implemented a seperate submap for temporary exec string space and another one
to contain process upages. This eliminates all map fragmentation problems
that previously existed.
ffs_inode.c, ufs_inode.c, ufs_readwrite.c
Changes for merged VM/buffer cache. Add "bypass" support for sneaking in on
busy buffers.
Submitted by: John Dyson and David Greenman
1995-01-09 16:06:02 +00:00
|
|
|
} else {
|
2020-01-23 05:03:34 +00:00
|
|
|
fs.prot &= ~VM_PROT_WRITE;
|
1994-05-24 10:09:53 +00:00
|
|
|
}
|
|
|
|
}
|
|
|
|
|
|
|
|
/*
|
These changes embody the support of the fully coherent merged VM buffer cache,
much higher filesystem I/O performance, and much better paging performance. It
represents the culmination of over 6 months of R&D.
The majority of the merged VM/cache work is by John Dyson.
The following highlights the most significant changes. Additionally, there are
(mostly minor) changes to the various filesystem modules (nfs, msdosfs, etc) to
support the new VM/buffer scheme.
vfs_bio.c:
Significant rewrite of most of vfs_bio to support the merged VM buffer cache
scheme. The scheme is almost fully compatible with the old filesystem
interface. Significant improvement in the number of opportunities for write
clustering.
vfs_cluster.c, vfs_subr.c
Upgrade and performance enhancements in vfs layer code to support merged
VM/buffer cache. Fixup of vfs_cluster to eliminate the bogus pagemove stuff.
vm_object.c:
Yet more improvements in the collapse code. Elimination of some windows that
can cause list corruption.
vm_pageout.c:
Fixed it, it really works better now. Somehow in 2.0, some "enhancements"
broke the code. This code has been reworked from the ground-up.
vm_fault.c, vm_page.c, pmap.c, vm_object.c
Support for small-block filesystems with merged VM/buffer cache scheme.
pmap.c vm_map.c
Dynamic kernel VM size, now we dont have to pre-allocate excessive numbers of
kernel PTs.
vm_glue.c
Much simpler and more effective swapping code. No more gratuitous swapping.
proc.h
Fixed the problem that the p_lock flag was not being cleared on a fork.
swap_pager.c, vnode_pager.c
Removal of old vfs_bio cruft to support the past pseudo-coherency. Now the
code doesn't need it anymore.
machdep.c
Changes to better support the parameter values for the merged VM/buffer cache
scheme.
machdep.c, kern_exec.c, vm_glue.c
Implemented a seperate submap for temporary exec string space and another one
to contain process upages. This eliminates all map fragmentation problems
that previously existed.
ffs_inode.c, ufs_inode.c, ufs_readwrite.c
Changes for merged VM/buffer cache. Add "bypass" support for sneaking in on
busy buffers.
Submitted by: John Dyson and David Greenman
1995-01-09 16:06:02 +00:00
|
|
|
* We must verify that the maps have not changed since our last
|
|
|
|
* lookup.
|
1994-05-24 10:09:53 +00:00
|
|
|
*/
|
2004-08-12 20:14:49 +00:00
|
|
|
if (!fs.lookup_still_valid) {
|
2020-01-23 05:07:01 +00:00
|
|
|
result = vm_fault_relookup(&fs);
|
|
|
|
if (result != KERN_SUCCESS) {
|
2020-01-20 22:49:52 +00:00
|
|
|
fault_deallocate(&fs);
|
2020-01-23 05:07:01 +00:00
|
|
|
if (result == KERN_RESTART)
|
2018-02-14 00:25:18 +00:00
|
|
|
goto RetryFault;
|
2020-01-23 05:07:01 +00:00
|
|
|
return (result);
|
1994-05-24 10:09:53 +00:00
|
|
|
}
|
|
|
|
}
|
2020-01-20 22:49:52 +00:00
|
|
|
VM_OBJECT_ASSERT_UNLOCKED(fs.object);
|
2016-07-07 20:58:16 +00:00
|
|
|
|
2009-02-08 20:23:46 +00:00
|
|
|
/*
|
2016-07-07 20:58:16 +00:00
|
|
|
* If the page was filled by a pager, save the virtual address that
|
|
|
|
* should be faulted on next under a sequential access pattern to the
|
|
|
|
* map entry. A read lock on the map suffices to update this address
|
|
|
|
* safely.
|
2009-02-08 20:23:46 +00:00
|
|
|
*/
|
2009-02-25 07:52:53 +00:00
|
|
|
if (hardfault)
|
2016-07-07 20:58:16 +00:00
|
|
|
fs.entry->next_read = vaddr + ptoa(ahead) + PAGE_SIZE;
|
2009-02-08 20:23:46 +00:00
|
|
|
|
The VFS/BIO subsystem contained a number of hacks in order to optimize
piecemeal, middle-of-file writes for NFS. These hacks have caused no
end of trouble, especially when combined with mmap(). I've removed
them. Instead, NFS will issue a read-before-write to fully
instantiate the struct buf containing the write. NFS does, however,
optimize piecemeal appends to files. For most common file operations,
you will not notice the difference. The sole remaining fragment in
the VFS/BIO system is b_dirtyoff/end, which NFS uses to avoid cache
coherency issues with read-merge-write style operations. NFS also
optimizes the write-covers-entire-buffer case by avoiding the
read-before-write. There is quite a bit of room for further
optimization in these areas.
The VM system marks pages fully-valid (AKA vm_page_t->valid =
VM_PAGE_BITS_ALL) in several places, most noteably in vm_fault. This
is not correct operation. The vm_pager_get_pages() code is now
responsible for marking VM pages all-valid. A number of VM helper
routines have been added to aid in zeroing-out the invalid portions of
a VM page prior to the page being marked all-valid. This operation is
necessary to properly support mmap(). The zeroing occurs most often
when dealing with file-EOF situations. Several bugs have been fixed
in the NFS subsystem, including bits handling file and directory EOF
situations and buf->b_flags consistancy issues relating to clearing
B_ERROR & B_INVAL, and handling B_DONE.
getblk() and allocbuf() have been rewritten. B_CACHE operation is now
formally defined in comments and more straightforward in
implementation. B_CACHE for VMIO buffers is based on the validity of
the backing store. B_CACHE for non-VMIO buffers is based simply on
whether the buffer is B_INVAL or not (B_CACHE set if B_INVAL clear,
and vise-versa). biodone() is now responsible for setting B_CACHE
when a successful read completes. B_CACHE is also set when a bdwrite()
is initiated and when a bwrite() is initiated. VFS VOP_BWRITE
routines (there are only two - nfs_bwrite() and bwrite()) are now
expected to set B_CACHE. This means that bowrite() and bawrite() also
set B_CACHE indirectly.
There are a number of places in the code which were previously using
buf->b_bufsize (which is DEV_BSIZE aligned) when they should have
been using buf->b_bcount. These have been fixed. getblk() now clears
B_DONE on return because the rest of the system is so bad about
dealing with B_DONE.
Major fixes to NFS/TCP have been made. A server-side bug could cause
requests to be lost by the server due to nfs_realign() overwriting
other rpc's in the same TCP mbuf chain. The server's kernel must be
recompiled to get the benefit of the fixes.
Submitted by: Matthew Dillon <dillon@apollo.backplane.com>
1999-05-02 23:57:16 +00:00
|
|
|
/*
|
2009-04-26 20:54:57 +00:00
|
|
|
* Page must be completely valid or it is not fit to
|
The VFS/BIO subsystem contained a number of hacks in order to optimize
piecemeal, middle-of-file writes for NFS. These hacks have caused no
end of trouble, especially when combined with mmap(). I've removed
them. Instead, NFS will issue a read-before-write to fully
instantiate the struct buf containing the write. NFS does, however,
optimize piecemeal appends to files. For most common file operations,
you will not notice the difference. The sole remaining fragment in
the VFS/BIO system is b_dirtyoff/end, which NFS uses to avoid cache
coherency issues with read-merge-write style operations. NFS also
optimizes the write-covers-entire-buffer case by avoiding the
read-before-write. There is quite a bit of room for further
optimization in these areas.
The VM system marks pages fully-valid (AKA vm_page_t->valid =
VM_PAGE_BITS_ALL) in several places, most noteably in vm_fault. This
is not correct operation. The vm_pager_get_pages() code is now
responsible for marking VM pages all-valid. A number of VM helper
routines have been added to aid in zeroing-out the invalid portions of
a VM page prior to the page being marked all-valid. This operation is
necessary to properly support mmap(). The zeroing occurs most often
when dealing with file-EOF situations. Several bugs have been fixed
in the NFS subsystem, including bits handling file and directory EOF
situations and buf->b_flags consistancy issues relating to clearing
B_ERROR & B_INVAL, and handling B_DONE.
getblk() and allocbuf() have been rewritten. B_CACHE operation is now
formally defined in comments and more straightforward in
implementation. B_CACHE for VMIO buffers is based on the validity of
the backing store. B_CACHE for non-VMIO buffers is based simply on
whether the buffer is B_INVAL or not (B_CACHE set if B_INVAL clear,
and vise-versa). biodone() is now responsible for setting B_CACHE
when a successful read completes. B_CACHE is also set when a bdwrite()
is initiated and when a bwrite() is initiated. VFS VOP_BWRITE
routines (there are only two - nfs_bwrite() and bwrite()) are now
expected to set B_CACHE. This means that bowrite() and bawrite() also
set B_CACHE indirectly.
There are a number of places in the code which were previously using
buf->b_bufsize (which is DEV_BSIZE aligned) when they should have
been using buf->b_bcount. These have been fixed. getblk() now clears
B_DONE on return because the rest of the system is so bad about
dealing with B_DONE.
Major fixes to NFS/TCP have been made. A server-side bug could cause
requests to be lost by the server due to nfs_realign() overwriting
other rpc's in the same TCP mbuf chain. The server's kernel must be
recompiled to get the benefit of the fixes.
Submitted by: Matthew Dillon <dillon@apollo.backplane.com>
1999-05-02 23:57:16 +00:00
|
|
|
* map into user space. vm_pager_get_pages() ensures this.
|
|
|
|
*/
|
2020-01-20 22:49:52 +00:00
|
|
|
vm_page_assert_xbusied(fs.m);
|
2019-10-15 03:45:41 +00:00
|
|
|
KASSERT(vm_page_all_valid(fs.m),
|
2009-04-26 20:54:57 +00:00
|
|
|
("vm_fault: page %p partially invalid", fs.m));
|
2020-01-20 22:49:52 +00:00
|
|
|
|
2020-01-23 05:03:34 +00:00
|
|
|
vm_fault_dirty(&fs, fs.m);
|
2003-10-04 21:35:48 +00:00
|
|
|
|
2004-08-09 18:46:39 +00:00
|
|
|
/*
|
|
|
|
* Put this page into the physical map. We had to do the unlock above
|
|
|
|
* because pmap_enter() may sleep. We don't put the page
|
|
|
|
* back on the active queue until later so that the pageout daemon
|
|
|
|
* won't find it (yet).
|
|
|
|
*/
|
2020-01-23 05:03:34 +00:00
|
|
|
pmap_enter(fs.map->pmap, vaddr, fs.m, fs.prot,
|
|
|
|
fs.fault_type | (fs.wired ? PMAP_ENTER_WIRED : 0), 0);
|
|
|
|
if (faultcount != 1 && (fs.fault_flags & VM_FAULT_WIRE) == 0 &&
|
|
|
|
fs.wired == 0)
|
A change to KPI of vm_pager_get_pages() and underlying VOP_GETPAGES().
o With new KPI consumers can request contiguous ranges of pages, and
unlike before, all pages will be kept busied on return, like it was
done before with the 'reqpage' only. Now the reqpage goes away. With
new interface it is easier to implement code protected from race
conditions.
Such arrayed requests for now should be preceeded by a call to
vm_pager_haspage() to make sure that request is possible. This
could be improved later, making vm_pager_haspage() obsolete.
Strenghtening the promises on the business of the array of pages
allows us to remove such hacks as swp_pager_free_nrpage() and
vm_pager_free_nonreq().
o New KPI accepts two integer pointers that may optionally point at
values for read ahead and read behind, that a pager may do, if it
can. These pages are completely owned by pager, and not controlled
by the caller.
This shifts the UFS-specific readahead logic from vm_fault.c, which
should be file system agnostic, into vnode_pager.c. It also removes
one VOP_BMAP() request per hard fault.
Discussed with: kib, alc, jeff, scottl
Sponsored by: Nginx, Inc.
Sponsored by: Netflix
2015-12-16 21:30:45 +00:00
|
|
|
vm_fault_prefault(&fs, vaddr,
|
|
|
|
faultcount > 0 ? behind : PFBAK,
|
2018-04-29 12:43:08 +00:00
|
|
|
faultcount > 0 ? ahead : PFFOR, false);
|
1996-06-01 20:50:57 +00:00
|
|
|
|
1994-05-24 10:09:53 +00:00
|
|
|
/*
|
These changes embody the support of the fully coherent merged VM buffer cache,
much higher filesystem I/O performance, and much better paging performance. It
represents the culmination of over 6 months of R&D.
The majority of the merged VM/cache work is by John Dyson.
The following highlights the most significant changes. Additionally, there are
(mostly minor) changes to the various filesystem modules (nfs, msdosfs, etc) to
support the new VM/buffer scheme.
vfs_bio.c:
Significant rewrite of most of vfs_bio to support the merged VM buffer cache
scheme. The scheme is almost fully compatible with the old filesystem
interface. Significant improvement in the number of opportunities for write
clustering.
vfs_cluster.c, vfs_subr.c
Upgrade and performance enhancements in vfs layer code to support merged
VM/buffer cache. Fixup of vfs_cluster to eliminate the bogus pagemove stuff.
vm_object.c:
Yet more improvements in the collapse code. Elimination of some windows that
can cause list corruption.
vm_pageout.c:
Fixed it, it really works better now. Somehow in 2.0, some "enhancements"
broke the code. This code has been reworked from the ground-up.
vm_fault.c, vm_page.c, pmap.c, vm_object.c
Support for small-block filesystems with merged VM/buffer cache scheme.
pmap.c vm_map.c
Dynamic kernel VM size, now we dont have to pre-allocate excessive numbers of
kernel PTs.
vm_glue.c
Much simpler and more effective swapping code. No more gratuitous swapping.
proc.h
Fixed the problem that the p_lock flag was not being cleared on a fork.
swap_pager.c, vnode_pager.c
Removal of old vfs_bio cruft to support the past pseudo-coherency. Now the
code doesn't need it anymore.
machdep.c
Changes to better support the parameter values for the merged VM/buffer cache
scheme.
machdep.c, kern_exec.c, vm_glue.c
Implemented a seperate submap for temporary exec string space and another one
to contain process upages. This eliminates all map fragmentation problems
that previously existed.
ffs_inode.c, ufs_inode.c, ufs_readwrite.c
Changes for merged VM/buffer cache. Add "bypass" support for sneaking in on
busy buffers.
Submitted by: John Dyson and David Greenman
1995-01-09 16:06:02 +00:00
|
|
|
* If the page is not wired down, then put it where the pageout daemon
|
|
|
|
* can find it.
|
1994-05-24 10:09:53 +00:00
|
|
|
*/
|
2020-01-23 05:03:34 +00:00
|
|
|
if ((fs.fault_flags & VM_FAULT_WIRE) != 0)
|
2015-07-30 18:28:34 +00:00
|
|
|
vm_page_wire(fs.m);
|
2019-12-28 19:04:00 +00:00
|
|
|
else
|
1998-03-07 20:45:47 +00:00
|
|
|
vm_page_activate(fs.m);
|
2020-01-23 05:03:34 +00:00
|
|
|
if (fs.m_hold != NULL) {
|
|
|
|
(*fs.m_hold) = fs.m;
|
2019-07-08 19:46:20 +00:00
|
|
|
vm_page_wire(fs.m);
|
2010-12-20 22:49:31 +00:00
|
|
|
}
|
2013-08-09 11:11:11 +00:00
|
|
|
vm_page_xunbusy(fs.m);
|
2019-12-15 04:08:24 +00:00
|
|
|
fs.m = NULL;
|
2003-04-22 20:01:56 +00:00
|
|
|
|
2004-08-09 06:01:46 +00:00
|
|
|
/*
|
|
|
|
* Unlock everything, and return
|
|
|
|
*/
|
2019-10-29 20:46:25 +00:00
|
|
|
fault_deallocate(&fs);
|
2013-01-28 12:54:53 +00:00
|
|
|
if (hardfault) {
|
- Remove 'struct vmmeter' from 'struct pcpu', leaving only global vmmeter
in place. To do per-cpu stats, convert all fields that previously were
maintained in the vmmeters that sit in pcpus to counter(9).
- Since some vmmeter stats may be touched at very early stages of boot,
before we have set up UMA and we can do counter_u64_alloc(), provide an
early counter mechanism:
o Leave one spare uint64_t in struct pcpu, named pc_early_dummy_counter.
o Point counter(9) fields of vmmeter to pcpu[0].pc_early_dummy_counter,
so that at early stages of boot, before counters are allocated we already
point to a counter that can be safely written to.
o For sparc64 that required a whole dummy pcpu[MAXCPU] array.
Further related changes:
- Don't include vmmeter.h into pcpu.h.
- vm.stats.vm.v_swappgsout and vm.stats.vm.v_swappgsin changed to 64-bit,
to match kernel representation.
- struct vmmeter hidden under _KERNEL, and only vmstat(1) is an exclusion.
This is based on benno@'s 4-year old patch:
https://lists.freebsd.org/pipermail/freebsd-arch/2013-July/014471.html
Reviewed by: kib, gallatin, marius, lidl
Differential Revision: https://reviews.freebsd.org/D10156
2017-04-17 17:34:47 +00:00
|
|
|
VM_CNT_INC(v_io_faults);
|
2007-06-01 01:12:45 +00:00
|
|
|
curthread->td_ru.ru_majflt++;
|
2016-04-07 04:23:25 +00:00
|
|
|
#ifdef RACCT
|
|
|
|
if (racct_enable && fs.object->type == OBJT_VNODE) {
|
|
|
|
PROC_LOCK(curproc);
|
2020-01-23 05:03:34 +00:00
|
|
|
if ((fs.fault_type & (VM_PROT_COPY | VM_PROT_WRITE)) != 0) {
|
2016-04-07 04:23:25 +00:00
|
|
|
racct_add_force(curproc, RACCT_WRITEBPS,
|
|
|
|
PAGE_SIZE + behind * PAGE_SIZE);
|
|
|
|
racct_add_force(curproc, RACCT_WRITEIOPS, 1);
|
|
|
|
} else {
|
|
|
|
racct_add_force(curproc, RACCT_READBPS,
|
|
|
|
PAGE_SIZE + ahead * PAGE_SIZE);
|
|
|
|
racct_add_force(curproc, RACCT_READIOPS, 1);
|
|
|
|
}
|
|
|
|
PROC_UNLOCK(curproc);
|
|
|
|
}
|
|
|
|
#endif
|
2013-01-28 12:54:53 +00:00
|
|
|
} else
|
2007-06-01 01:12:45 +00:00
|
|
|
curthread->td_ru.ru_minflt++;
|
1994-05-24 10:09:53 +00:00
|
|
|
|
These changes embody the support of the fully coherent merged VM buffer cache,
much higher filesystem I/O performance, and much better paging performance. It
represents the culmination of over 6 months of R&D.
The majority of the merged VM/cache work is by John Dyson.
The following highlights the most significant changes. Additionally, there are
(mostly minor) changes to the various filesystem modules (nfs, msdosfs, etc) to
support the new VM/buffer scheme.
vfs_bio.c:
Significant rewrite of most of vfs_bio to support the merged VM buffer cache
scheme. The scheme is almost fully compatible with the old filesystem
interface. Significant improvement in the number of opportunities for write
clustering.
vfs_cluster.c, vfs_subr.c
Upgrade and performance enhancements in vfs layer code to support merged
VM/buffer cache. Fixup of vfs_cluster to eliminate the bogus pagemove stuff.
vm_object.c:
Yet more improvements in the collapse code. Elimination of some windows that
can cause list corruption.
vm_pageout.c:
Fixed it, it really works better now. Somehow in 2.0, some "enhancements"
broke the code. This code has been reworked from the ground-up.
vm_fault.c, vm_page.c, pmap.c, vm_object.c
Support for small-block filesystems with merged VM/buffer cache scheme.
pmap.c vm_map.c
Dynamic kernel VM size, now we dont have to pre-allocate excessive numbers of
kernel PTs.
vm_glue.c
Much simpler and more effective swapping code. No more gratuitous swapping.
proc.h
Fixed the problem that the p_lock flag was not being cleared on a fork.
swap_pager.c, vnode_pager.c
Removal of old vfs_bio cruft to support the past pseudo-coherency. Now the
code doesn't need it anymore.
machdep.c
Changes to better support the parameter values for the merged VM/buffer cache
scheme.
machdep.c, kern_exec.c, vm_glue.c
Implemented a seperate submap for temporary exec string space and another one
to contain process upages. This eliminates all map fragmentation problems
that previously existed.
ffs_inode.c, ufs_inode.c, ufs_readwrite.c
Changes for merged VM/buffer cache. Add "bypass" support for sneaking in on
busy buffers.
Submitted by: John Dyson and David Greenman
1995-01-09 16:06:02 +00:00
|
|
|
return (KERN_SUCCESS);
|
1994-05-24 10:09:53 +00:00
|
|
|
}
|
|
|
|
|
2012-05-10 15:16:42 +00:00
|
|
|
/*
|
Replace vm_fault()'s heuristic for automatic cache behind with a heuristic
that performs the equivalent of an automatic madvise(..., MADV_DONTNEED).
The current heuristic, even with the improvements that I made a few years
ago, is a good example of making the wrong trade-off, or optimizing for
the infrequent case. The infrequent case being reading a single file that
is much larger than memory using mmap(2). And, in this case, the page
daemon isn't the bottleneck; it's the I/O.
In all other cases, the current heuristic has too many false positives,
i.e., it caches too many pages that are later reused. To give one
example, thousands of pages are cached by the current heuristic during a
buildworld and all of them are reactivated before the buildworld
completes. In particular, clang reads source files using mmap(2) and
there are some relatively large source files in our source tree, e.g.,
sqlite, that are read multiple times. With the new heuristic, I see fewer
false positives and they have a much lower cost.
I actually tried something like this more than two years ago and it
didn't perform as well as the cache behind heuristic. However, that was
before the changes to the page daemon in late summer of 2013 and the
existence of pmap_advise(). In particular, with the page daemon doing
its work more frequently and in smaller batches, it now completes its
work while the application accessing the file is blocked on I/O.
Whereas previously, the page daemon appeared to hog the CPU for so long
that it caused "hiccups" in the application's execution.
Finally, I'll add that the elimination of cache pages is a prerequisite
for NUMA support.
Reviewed by: jeff, kib
Sponsored by: EMC / Isilon Storage Division
2015-04-04 19:10:22 +00:00
|
|
|
* Speed up the reclamation of pages that precede the faulting pindex within
|
|
|
|
* the first object of the shadow chain. Essentially, perform the equivalent
|
|
|
|
* to madvise(..., MADV_DONTNEED) on a large cluster of pages that precedes
|
|
|
|
* the faulting pindex by the cluster size when the pages read by vm_fault()
|
|
|
|
* cross a cluster-size boundary. The cluster size is the greater of the
|
|
|
|
* smallest superpage size and VM_FAULT_DONTNEED_MIN.
|
|
|
|
*
|
|
|
|
* When "fs->first_object" is a shadow object, the pages in the backing object
|
|
|
|
* that precede the faulting pindex are deactivated by vm_fault(). So, this
|
|
|
|
* function must only be concerned with pages in the first object.
|
2012-05-10 15:16:42 +00:00
|
|
|
*/
|
|
|
|
static void
|
Replace vm_fault()'s heuristic for automatic cache behind with a heuristic
that performs the equivalent of an automatic madvise(..., MADV_DONTNEED).
The current heuristic, even with the improvements that I made a few years
ago, is a good example of making the wrong trade-off, or optimizing for
the infrequent case. The infrequent case being reading a single file that
is much larger than memory using mmap(2). And, in this case, the page
daemon isn't the bottleneck; it's the I/O.
In all other cases, the current heuristic has too many false positives,
i.e., it caches too many pages that are later reused. To give one
example, thousands of pages are cached by the current heuristic during a
buildworld and all of them are reactivated before the buildworld
completes. In particular, clang reads source files using mmap(2) and
there are some relatively large source files in our source tree, e.g.,
sqlite, that are read multiple times. With the new heuristic, I see fewer
false positives and they have a much lower cost.
I actually tried something like this more than two years ago and it
didn't perform as well as the cache behind heuristic. However, that was
before the changes to the page daemon in late summer of 2013 and the
existence of pmap_advise(). In particular, with the page daemon doing
its work more frequently and in smaller batches, it now completes its
work while the application accessing the file is blocked on I/O.
Whereas previously, the page daemon appeared to hog the CPU for so long
that it caused "hiccups" in the application's execution.
Finally, I'll add that the elimination of cache pages is a prerequisite
for NUMA support.
Reviewed by: jeff, kib
Sponsored by: EMC / Isilon Storage Division
2015-04-04 19:10:22 +00:00
|
|
|
vm_fault_dontneed(const struct faultstate *fs, vm_offset_t vaddr, int ahead)
|
2012-05-10 15:16:42 +00:00
|
|
|
{
|
Replace vm_fault()'s heuristic for automatic cache behind with a heuristic
that performs the equivalent of an automatic madvise(..., MADV_DONTNEED).
The current heuristic, even with the improvements that I made a few years
ago, is a good example of making the wrong trade-off, or optimizing for
the infrequent case. The infrequent case being reading a single file that
is much larger than memory using mmap(2). And, in this case, the page
daemon isn't the bottleneck; it's the I/O.
In all other cases, the current heuristic has too many false positives,
i.e., it caches too many pages that are later reused. To give one
example, thousands of pages are cached by the current heuristic during a
buildworld and all of them are reactivated before the buildworld
completes. In particular, clang reads source files using mmap(2) and
there are some relatively large source files in our source tree, e.g.,
sqlite, that are read multiple times. With the new heuristic, I see fewer
false positives and they have a much lower cost.
I actually tried something like this more than two years ago and it
didn't perform as well as the cache behind heuristic. However, that was
before the changes to the page daemon in late summer of 2013 and the
existence of pmap_advise(). In particular, with the page daemon doing
its work more frequently and in smaller batches, it now completes its
work while the application accessing the file is blocked on I/O.
Whereas previously, the page daemon appeared to hog the CPU for so long
that it caused "hiccups" in the application's execution.
Finally, I'll add that the elimination of cache pages is a prerequisite
for NUMA support.
Reviewed by: jeff, kib
Sponsored by: EMC / Isilon Storage Division
2015-04-04 19:10:22 +00:00
|
|
|
vm_map_entry_t entry;
|
2012-05-10 15:16:42 +00:00
|
|
|
vm_object_t first_object, object;
|
Replace vm_fault()'s heuristic for automatic cache behind with a heuristic
that performs the equivalent of an automatic madvise(..., MADV_DONTNEED).
The current heuristic, even with the improvements that I made a few years
ago, is a good example of making the wrong trade-off, or optimizing for
the infrequent case. The infrequent case being reading a single file that
is much larger than memory using mmap(2). And, in this case, the page
daemon isn't the bottleneck; it's the I/O.
In all other cases, the current heuristic has too many false positives,
i.e., it caches too many pages that are later reused. To give one
example, thousands of pages are cached by the current heuristic during a
buildworld and all of them are reactivated before the buildworld
completes. In particular, clang reads source files using mmap(2) and
there are some relatively large source files in our source tree, e.g.,
sqlite, that are read multiple times. With the new heuristic, I see fewer
false positives and they have a much lower cost.
I actually tried something like this more than two years ago and it
didn't perform as well as the cache behind heuristic. However, that was
before the changes to the page daemon in late summer of 2013 and the
existence of pmap_advise(). In particular, with the page daemon doing
its work more frequently and in smaller batches, it now completes its
work while the application accessing the file is blocked on I/O.
Whereas previously, the page daemon appeared to hog the CPU for so long
that it caused "hiccups" in the application's execution.
Finally, I'll add that the elimination of cache pages is a prerequisite
for NUMA support.
Reviewed by: jeff, kib
Sponsored by: EMC / Isilon Storage Division
2015-04-04 19:10:22 +00:00
|
|
|
vm_offset_t end, start;
|
|
|
|
vm_page_t m, m_next;
|
|
|
|
vm_pindex_t pend, pstart;
|
|
|
|
vm_size_t size;
|
2012-05-10 15:16:42 +00:00
|
|
|
|
|
|
|
object = fs->object;
|
2020-01-20 22:49:52 +00:00
|
|
|
VM_OBJECT_ASSERT_UNLOCKED(object);
|
2012-05-10 15:16:42 +00:00
|
|
|
first_object = fs->first_object;
|
Replace vm_fault()'s heuristic for automatic cache behind with a heuristic
that performs the equivalent of an automatic madvise(..., MADV_DONTNEED).
The current heuristic, even with the improvements that I made a few years
ago, is a good example of making the wrong trade-off, or optimizing for
the infrequent case. The infrequent case being reading a single file that
is much larger than memory using mmap(2). And, in this case, the page
daemon isn't the bottleneck; it's the I/O.
In all other cases, the current heuristic has too many false positives,
i.e., it caches too many pages that are later reused. To give one
example, thousands of pages are cached by the current heuristic during a
buildworld and all of them are reactivated before the buildworld
completes. In particular, clang reads source files using mmap(2) and
there are some relatively large source files in our source tree, e.g.,
sqlite, that are read multiple times. With the new heuristic, I see fewer
false positives and they have a much lower cost.
I actually tried something like this more than two years ago and it
didn't perform as well as the cache behind heuristic. However, that was
before the changes to the page daemon in late summer of 2013 and the
existence of pmap_advise(). In particular, with the page daemon doing
its work more frequently and in smaller batches, it now completes its
work while the application accessing the file is blocked on I/O.
Whereas previously, the page daemon appeared to hog the CPU for so long
that it caused "hiccups" in the application's execution.
Finally, I'll add that the elimination of cache pages is a prerequisite
for NUMA support.
Reviewed by: jeff, kib
Sponsored by: EMC / Isilon Storage Division
2015-04-04 19:10:22 +00:00
|
|
|
/* Neither fictitious nor unmanaged pages can be reclaimed. */
|
In the past four years, we've added two new vm object types. Each time,
similar changes had to be made in various places throughout the machine-
independent virtual memory layer to support the new vm object type.
However, in most of these places, it's actually not the type of the vm
object that matters to us but instead certain attributes of its pages.
For example, OBJT_DEVICE, OBJT_MGTDEVICE, and OBJT_SG objects contain
fictitious pages. In other words, in most of these places, we were
testing the vm object's type to determine if it contained fictitious (or
unmanaged) pages.
To both simplify the code in these places and make the addition of future
vm object types easier, this change introduces two new vm object flags
that describe attributes of the vm object's pages, specifically, whether
they are fictitious or unmanaged.
Reviewed and tested by: kib
2012-12-09 00:32:38 +00:00
|
|
|
if ((first_object->flags & (OBJ_FICTITIOUS | OBJ_UNMANAGED)) == 0) {
|
2020-01-20 22:49:52 +00:00
|
|
|
VM_OBJECT_RLOCK(first_object);
|
Replace vm_fault()'s heuristic for automatic cache behind with a heuristic
that performs the equivalent of an automatic madvise(..., MADV_DONTNEED).
The current heuristic, even with the improvements that I made a few years
ago, is a good example of making the wrong trade-off, or optimizing for
the infrequent case. The infrequent case being reading a single file that
is much larger than memory using mmap(2). And, in this case, the page
daemon isn't the bottleneck; it's the I/O.
In all other cases, the current heuristic has too many false positives,
i.e., it caches too many pages that are later reused. To give one
example, thousands of pages are cached by the current heuristic during a
buildworld and all of them are reactivated before the buildworld
completes. In particular, clang reads source files using mmap(2) and
there are some relatively large source files in our source tree, e.g.,
sqlite, that are read multiple times. With the new heuristic, I see fewer
false positives and they have a much lower cost.
I actually tried something like this more than two years ago and it
didn't perform as well as the cache behind heuristic. However, that was
before the changes to the page daemon in late summer of 2013 and the
existence of pmap_advise(). In particular, with the page daemon doing
its work more frequently and in smaller batches, it now completes its
work while the application accessing the file is blocked on I/O.
Whereas previously, the page daemon appeared to hog the CPU for so long
that it caused "hiccups" in the application's execution.
Finally, I'll add that the elimination of cache pages is a prerequisite
for NUMA support.
Reviewed by: jeff, kib
Sponsored by: EMC / Isilon Storage Division
2015-04-04 19:10:22 +00:00
|
|
|
size = VM_FAULT_DONTNEED_MIN;
|
|
|
|
if (MAXPAGESIZES > 1 && size < pagesizes[1])
|
|
|
|
size = pagesizes[1];
|
|
|
|
end = rounddown2(vaddr, size);
|
|
|
|
if (vaddr - end >= size - PAGE_SIZE - ptoa(ahead) &&
|
|
|
|
(entry = fs->entry)->start < end) {
|
|
|
|
if (end - entry->start < size)
|
|
|
|
start = entry->start;
|
|
|
|
else
|
|
|
|
start = end - size;
|
|
|
|
pmap_advise(fs->map->pmap, start, end, MADV_DONTNEED);
|
|
|
|
pstart = OFF_TO_IDX(entry->offset) + atop(start -
|
|
|
|
entry->start);
|
|
|
|
m_next = vm_page_find_least(first_object, pstart);
|
|
|
|
pend = OFF_TO_IDX(entry->offset) + atop(end -
|
|
|
|
entry->start);
|
|
|
|
while ((m = m_next) != NULL && m->pindex < pend) {
|
|
|
|
m_next = TAILQ_NEXT(m, listq);
|
2019-10-15 03:45:41 +00:00
|
|
|
if (!vm_page_all_valid(m) ||
|
Replace vm_fault()'s heuristic for automatic cache behind with a heuristic
that performs the equivalent of an automatic madvise(..., MADV_DONTNEED).
The current heuristic, even with the improvements that I made a few years
ago, is a good example of making the wrong trade-off, or optimizing for
the infrequent case. The infrequent case being reading a single file that
is much larger than memory using mmap(2). And, in this case, the page
daemon isn't the bottleneck; it's the I/O.
In all other cases, the current heuristic has too many false positives,
i.e., it caches too many pages that are later reused. To give one
example, thousands of pages are cached by the current heuristic during a
buildworld and all of them are reactivated before the buildworld
completes. In particular, clang reads source files using mmap(2) and
there are some relatively large source files in our source tree, e.g.,
sqlite, that are read multiple times. With the new heuristic, I see fewer
false positives and they have a much lower cost.
I actually tried something like this more than two years ago and it
didn't perform as well as the cache behind heuristic. However, that was
before the changes to the page daemon in late summer of 2013 and the
existence of pmap_advise(). In particular, with the page daemon doing
its work more frequently and in smaller batches, it now completes its
work while the application accessing the file is blocked on I/O.
Whereas previously, the page daemon appeared to hog the CPU for so long
that it caused "hiccups" in the application's execution.
Finally, I'll add that the elimination of cache pages is a prerequisite
for NUMA support.
Reviewed by: jeff, kib
Sponsored by: EMC / Isilon Storage Division
2015-04-04 19:10:22 +00:00
|
|
|
vm_page_busied(m))
|
|
|
|
continue;
|
2015-08-03 20:30:27 +00:00
|
|
|
|
|
|
|
/*
|
|
|
|
* Don't clear PGA_REFERENCED, since it would
|
|
|
|
* likely represent a reference by a different
|
|
|
|
* process.
|
|
|
|
*
|
|
|
|
* Typically, at this point, prefetched pages
|
|
|
|
* are still in the inactive queue. Only
|
|
|
|
* pages that triggered page faults are in the
|
2019-12-28 19:04:00 +00:00
|
|
|
* active queue. The test for whether the page
|
|
|
|
* is in the inactive queue is racy; in the
|
|
|
|
* worst case we will requeue the page
|
|
|
|
* unnecessarily.
|
2015-08-03 20:30:27 +00:00
|
|
|
*/
|
2018-03-18 16:40:56 +00:00
|
|
|
if (!vm_page_inactive(m))
|
|
|
|
vm_page_deactivate(m);
|
2012-05-10 15:16:42 +00:00
|
|
|
}
|
|
|
|
}
|
2020-01-20 22:49:52 +00:00
|
|
|
VM_OBJECT_RUNLOCK(first_object);
|
2012-05-10 15:16:42 +00:00
|
|
|
}
|
|
|
|
}
|
|
|
|
|
2003-10-03 22:46:53 +00:00
|
|
|
/*
|
|
|
|
* vm_fault_prefault provides a quick way of clustering
|
|
|
|
* pagefaults into a processes address space. It is a "cousin"
|
|
|
|
* of vm_map_pmap_enter, except it runs at page fault time instead
|
|
|
|
* of mmap time.
|
|
|
|
*/
|
|
|
|
static void
|
2014-02-02 20:21:53 +00:00
|
|
|
vm_fault_prefault(const struct faultstate *fs, vm_offset_t addra,
|
2018-04-29 12:43:08 +00:00
|
|
|
int backward, int forward, bool obj_locked)
|
2003-10-03 22:46:53 +00:00
|
|
|
{
|
2014-02-02 20:21:53 +00:00
|
|
|
pmap_t pmap;
|
|
|
|
vm_map_entry_t entry;
|
|
|
|
vm_object_t backing_object, lobject;
|
2003-10-03 22:46:53 +00:00
|
|
|
vm_offset_t addr, starta;
|
|
|
|
vm_pindex_t pindex;
|
2006-06-15 01:01:06 +00:00
|
|
|
vm_page_t m;
|
A change to KPI of vm_pager_get_pages() and underlying VOP_GETPAGES().
o With new KPI consumers can request contiguous ranges of pages, and
unlike before, all pages will be kept busied on return, like it was
done before with the 'reqpage' only. Now the reqpage goes away. With
new interface it is easier to implement code protected from race
conditions.
Such arrayed requests for now should be preceeded by a call to
vm_pager_haspage() to make sure that request is possible. This
could be improved later, making vm_pager_haspage() obsolete.
Strenghtening the promises on the business of the array of pages
allows us to remove such hacks as swp_pager_free_nrpage() and
vm_pager_free_nonreq().
o New KPI accepts two integer pointers that may optionally point at
values for read ahead and read behind, that a pager may do, if it
can. These pages are completely owned by pager, and not controlled
by the caller.
This shifts the UFS-specific readahead logic from vm_fault.c, which
should be file system agnostic, into vnode_pager.c. It also removes
one VOP_BMAP() request per hard fault.
Discussed with: kib, alc, jeff, scottl
Sponsored by: Nginx, Inc.
Sponsored by: Netflix
2015-12-16 21:30:45 +00:00
|
|
|
int i;
|
2003-10-03 22:46:53 +00:00
|
|
|
|
2014-02-02 20:21:53 +00:00
|
|
|
pmap = fs->map->pmap;
|
2004-10-17 20:29:28 +00:00
|
|
|
if (pmap != vmspace_pmap(curthread->td_proc->p_vmspace))
|
2003-10-03 22:46:53 +00:00
|
|
|
return;
|
|
|
|
|
2014-02-02 20:21:53 +00:00
|
|
|
entry = fs->entry;
|
2003-10-03 22:46:53 +00:00
|
|
|
|
2017-02-24 08:09:16 +00:00
|
|
|
if (addra < backward * PAGE_SIZE) {
|
2003-10-03 22:46:53 +00:00
|
|
|
starta = entry->start;
|
2017-02-24 08:09:16 +00:00
|
|
|
} else {
|
|
|
|
starta = addra - backward * PAGE_SIZE;
|
|
|
|
if (starta < entry->start)
|
|
|
|
starta = entry->start;
|
2003-10-03 22:46:53 +00:00
|
|
|
}
|
|
|
|
|
2014-02-02 20:21:53 +00:00
|
|
|
/*
|
|
|
|
* Generate the sequence of virtual addresses that are candidates for
|
|
|
|
* prefaulting in an outward spiral from the faulting virtual address,
|
|
|
|
* "addra". Specifically, the sequence is "addra - PAGE_SIZE", "addra
|
|
|
|
* + PAGE_SIZE", "addra - 2 * PAGE_SIZE", "addra + 2 * PAGE_SIZE", ...
|
|
|
|
* If the candidate address doesn't have a backing physical page, then
|
|
|
|
* the loop immediately terminates.
|
|
|
|
*/
|
|
|
|
for (i = 0; i < 2 * imax(backward, forward); i++) {
|
|
|
|
addr = addra + ((i >> 1) + 1) * ((i & 1) == 0 ? -PAGE_SIZE :
|
|
|
|
PAGE_SIZE);
|
|
|
|
if (addr > addra + forward * PAGE_SIZE)
|
2003-10-03 22:46:53 +00:00
|
|
|
addr = 0;
|
|
|
|
|
|
|
|
if (addr < starta || addr >= entry->end)
|
|
|
|
continue;
|
|
|
|
|
|
|
|
if (!pmap_is_prefaultable(pmap, addr))
|
|
|
|
continue;
|
|
|
|
|
|
|
|
pindex = ((addr - entry->start) + entry->offset) >> PAGE_SHIFT;
|
2014-02-02 20:21:53 +00:00
|
|
|
lobject = entry->object.vm_object;
|
2018-04-29 12:43:08 +00:00
|
|
|
if (!obj_locked)
|
|
|
|
VM_OBJECT_RLOCK(lobject);
|
2003-10-03 22:46:53 +00:00
|
|
|
while ((m = vm_page_lookup(lobject, pindex)) == NULL &&
|
|
|
|
lobject->type == OBJT_DEFAULT &&
|
|
|
|
(backing_object = lobject->backing_object) != NULL) {
|
2009-10-25 17:30:50 +00:00
|
|
|
KASSERT((lobject->backing_object_offset & PAGE_MASK) ==
|
|
|
|
0, ("vm_fault_prefault: unaligned object offset"));
|
2003-10-03 22:46:53 +00:00
|
|
|
pindex += lobject->backing_object_offset >> PAGE_SHIFT;
|
2013-05-17 19:02:36 +00:00
|
|
|
VM_OBJECT_RLOCK(backing_object);
|
2018-04-29 12:43:08 +00:00
|
|
|
if (!obj_locked || lobject != entry->object.vm_object)
|
|
|
|
VM_OBJECT_RUNLOCK(lobject);
|
2003-10-03 22:46:53 +00:00
|
|
|
lobject = backing_object;
|
|
|
|
}
|
2003-10-04 21:35:48 +00:00
|
|
|
if (m == NULL) {
|
2018-04-29 12:43:08 +00:00
|
|
|
if (!obj_locked || lobject != entry->object.vm_object)
|
|
|
|
VM_OBJECT_RUNLOCK(lobject);
|
2003-10-03 22:46:53 +00:00
|
|
|
break;
|
2003-10-04 21:35:48 +00:00
|
|
|
}
|
2019-10-15 03:45:41 +00:00
|
|
|
if (vm_page_all_valid(m) &&
|
2010-05-08 20:34:01 +00:00
|
|
|
(m->flags & PG_FICTITIOUS) == 0)
|
Change the management of cached pages (PQ_CACHE) in two fundamental
ways:
(1) Cached pages are no longer kept in the object's resident page
splay tree and memq. Instead, they are kept in a separate per-object
splay tree of cached pages. However, access to this new per-object
splay tree is synchronized by the _free_ page queues lock, not to be
confused with the heavily contended page queues lock. Consequently, a
cached page can be reclaimed by vm_page_alloc(9) without acquiring the
object's lock or the page queues lock.
This solves a problem independently reported by tegge@ and Isilon.
Specifically, they observed the page daemon consuming a great deal of
CPU time because of pages bouncing back and forth between the cache
queue (PQ_CACHE) and the inactive queue (PQ_INACTIVE). The source of
this problem turned out to be a deadlock avoidance strategy employed
when selecting a cached page to reclaim in vm_page_select_cache().
However, the root cause was really that reclaiming a cached page
required the acquisition of an object lock while the page queues lock
was already held. Thus, this change addresses the problem at its
root, by eliminating the need to acquire the object's lock.
Moreover, keeping cached pages in the object's primary splay tree and
memq was, in effect, optimizing for the uncommon case. Cached pages
are reclaimed far, far more often than they are reactivated. Instead,
this change makes reclamation cheaper, especially in terms of
synchronization overhead, and reactivation more expensive, because
reactivated pages will have to be reentered into the object's primary
splay tree and memq.
(2) Cached pages are now stored alongside free pages in the physical
memory allocator's buddy queues, increasing the likelihood that large
allocations of contiguous physical memory (i.e., superpages) will
succeed.
Finally, as a result of this change long-standing restrictions on when
and where a cached page can be reclaimed and returned by
vm_page_alloc(9) are eliminated. Specifically, calls to
vm_page_alloc(9) specifying VM_ALLOC_INTERRUPT can now reclaim and
return a formerly cached page. Consequently, a call to malloc(9)
specifying M_NOWAIT is less likely to fail.
Discussed with: many over the course of the summer, including jeff@,
Justin Husted @ Isilon, peter@, tegge@
Tested by: an earlier version by kris@
Approved by: re (kensmith)
2007-09-25 06:25:06 +00:00
|
|
|
pmap_enter_quick(pmap, addr, m, entry->protection);
|
2018-04-29 12:43:08 +00:00
|
|
|
if (!obj_locked || lobject != entry->object.vm_object)
|
|
|
|
VM_OBJECT_RUNLOCK(lobject);
|
2003-10-03 22:46:53 +00:00
|
|
|
}
|
|
|
|
}
|
|
|
|
|
2010-12-25 21:26:56 +00:00
|
|
|
/*
|
|
|
|
* Hold each of the physical pages that are mapped by the specified range of
|
|
|
|
* virtual addresses, ["addr", "addr" + "len"), if those mappings are valid
|
|
|
|
* and allow the specified types of access, "prot". If all of the implied
|
|
|
|
* pages are successfully held, then the number of held pages is returned
|
|
|
|
* together with pointers to those pages in the array "ma". However, if any
|
|
|
|
* of the pages cannot be held, -1 is returned.
|
|
|
|
*/
|
|
|
|
int
|
|
|
|
vm_fault_quick_hold_pages(vm_map_t map, vm_offset_t addr, vm_size_t len,
|
|
|
|
vm_prot_t prot, vm_page_t *ma, int max_count)
|
|
|
|
{
|
|
|
|
vm_offset_t end, va;
|
|
|
|
vm_page_t *mp;
|
2013-11-20 08:45:26 +00:00
|
|
|
int count;
|
2010-12-25 21:26:56 +00:00
|
|
|
boolean_t pmap_failed;
|
|
|
|
|
2011-03-25 16:38:10 +00:00
|
|
|
if (len == 0)
|
|
|
|
return (0);
|
2013-11-20 08:45:26 +00:00
|
|
|
end = round_page(addr + len);
|
2010-12-25 21:26:56 +00:00
|
|
|
addr = trunc_page(addr);
|
|
|
|
|
2020-06-19 03:32:04 +00:00
|
|
|
if (!vm_map_range_valid(map, addr, end))
|
2010-12-25 21:26:56 +00:00
|
|
|
return (-1);
|
|
|
|
|
2013-11-20 08:45:26 +00:00
|
|
|
if (atop(end - addr) > max_count)
|
2010-12-25 21:26:56 +00:00
|
|
|
panic("vm_fault_quick_hold_pages: count > max_count");
|
2013-11-20 08:45:26 +00:00
|
|
|
count = atop(end - addr);
|
2010-12-25 21:26:56 +00:00
|
|
|
|
|
|
|
/*
|
|
|
|
* Most likely, the physical pages are resident in the pmap, so it is
|
|
|
|
* faster to try pmap_extract_and_hold() first.
|
|
|
|
*/
|
|
|
|
pmap_failed = FALSE;
|
|
|
|
for (mp = ma, va = addr; va < end; mp++, va += PAGE_SIZE) {
|
|
|
|
*mp = pmap_extract_and_hold(map->pmap, va, prot);
|
|
|
|
if (*mp == NULL)
|
|
|
|
pmap_failed = TRUE;
|
|
|
|
else if ((prot & VM_PROT_WRITE) != 0 &&
|
2010-12-28 20:02:30 +00:00
|
|
|
(*mp)->dirty != VM_PAGE_BITS_ALL) {
|
2010-12-25 21:26:56 +00:00
|
|
|
/*
|
|
|
|
* Explicitly dirty the physical page. Otherwise, the
|
|
|
|
* caller's changes may go unnoticed because they are
|
|
|
|
* performed through an unmanaged mapping or by a DMA
|
|
|
|
* operation.
|
2011-06-19 19:13:24 +00:00
|
|
|
*
|
2011-09-28 14:57:50 +00:00
|
|
|
* The object lock is not held here.
|
|
|
|
* See vm_page_clear_dirty_mask().
|
2010-12-25 21:26:56 +00:00
|
|
|
*/
|
2011-06-19 19:13:24 +00:00
|
|
|
vm_page_dirty(*mp);
|
2010-12-25 21:26:56 +00:00
|
|
|
}
|
|
|
|
}
|
|
|
|
if (pmap_failed) {
|
|
|
|
/*
|
|
|
|
* One or more pages could not be held by the pmap. Either no
|
|
|
|
* page was mapped at the specified virtual address or that
|
|
|
|
* mapping had insufficient permissions. Attempt to fault in
|
|
|
|
* and hold these pages.
|
2018-03-26 16:31:12 +00:00
|
|
|
*
|
|
|
|
* If vm_fault_disable_pagefaults() was called,
|
|
|
|
* i.e., TDP_NOFAULTING is set, we must not sleep nor
|
|
|
|
* acquire MD VM locks, which means we must not call
|
2019-09-27 18:43:36 +00:00
|
|
|
* vm_fault(). Some (out of tree) callers mark
|
2018-03-26 16:31:12 +00:00
|
|
|
* too wide a code area with vm_fault_disable_pagefaults()
|
|
|
|
* already, use the VM_PROT_QUICK_NOFAULT flag to request
|
|
|
|
* the proper behaviour explicitly.
|
2010-12-25 21:26:56 +00:00
|
|
|
*/
|
2018-03-26 16:31:12 +00:00
|
|
|
if ((prot & VM_PROT_QUICK_NOFAULT) != 0 &&
|
|
|
|
(curthread->td_pflags & TDP_NOFAULTING) != 0)
|
|
|
|
goto error;
|
2010-12-25 21:26:56 +00:00
|
|
|
for (mp = ma, va = addr; va < end; mp++, va += PAGE_SIZE)
|
2019-09-27 18:43:36 +00:00
|
|
|
if (*mp == NULL && vm_fault(map, va, prot,
|
2010-12-25 21:26:56 +00:00
|
|
|
VM_FAULT_NORMAL, mp) != KERN_SUCCESS)
|
|
|
|
goto error;
|
|
|
|
}
|
|
|
|
return (count);
|
|
|
|
error:
|
|
|
|
for (mp = ma; mp < ma + count; mp++)
|
Change synchonization rules for vm_page reference counting.
There are several mechanisms by which a vm_page reference is held,
preventing the page from being freed back to the page allocator. In
particular, holding the page's object lock is sufficient to prevent the
page from being freed; holding the busy lock or a wiring is sufficent as
well. These references are protected by the page lock, which must
therefore be acquired for many per-page operations. This results in
false sharing since the page locks are external to the vm_page
structures themselves and each lock protects multiple structures.
Transition to using an atomically updated per-page reference counter.
The object's reference is counted using a flag bit in the counter. A
second flag bit is used to atomically block new references via
pmap_extract_and_hold() while removing managed mappings of a page.
Thus, the reference count of a page is guaranteed not to increase if the
page is unbusied, unmapped, and the object's write lock is held. As
a consequence of this, the page lock no longer protects a page's
identity; operations which move pages between objects are now
synchronized solely by the objects' locks.
The vm_page_wire() and vm_page_unwire() KPIs are changed. The former
requires that either the object lock or the busy lock is held. The
latter no longer has a return value and may free the page if it releases
the last reference to that page. vm_page_unwire_noq() behaves the same
as before; the caller is responsible for checking its return value and
freeing or enqueuing the page as appropriate. vm_page_wire_mapped() is
introduced for use in pmap_extract_and_hold(). It fails if the page is
concurrently being unmapped, typically triggering a fallback to the
fault handler. vm_page_wire() no longer requires the page lock and
vm_page_unwire() now internally acquires the page lock when releasing
the last wiring of a page (since the page lock still protects a page's
queue state). In particular, synchronization details are no longer
leaked into the caller.
The change excises the page lock from several frequently executed code
paths. In particular, vm_object_terminate() no longer bounces between
page locks as it releases an object's pages, and direct I/O and
sendfile(SF_NOCACHE) completions no longer require the page lock. In
these latter cases we now get linear scalability in the common scenario
where different threads are operating on different files.
__FreeBSD_version is bumped. The DRM ports have been updated to
accomodate the KPI changes.
Reviewed by: jeff (earlier version)
Tested by: gallatin (earlier version), pho
Sponsored by: Netflix
Differential Revision: https://reviews.freebsd.org/D20486
2019-09-09 21:32:42 +00:00
|
|
|
if (*mp != NULL)
|
|
|
|
vm_page_unwire(*mp, PQ_INACTIVE);
|
2010-12-25 21:26:56 +00:00
|
|
|
return (-1);
|
|
|
|
}
|
|
|
|
|
1994-05-24 10:09:53 +00:00
|
|
|
/*
|
|
|
|
* Routine:
|
|
|
|
* vm_fault_copy_entry
|
|
|
|
* Function:
|
2009-10-27 10:15:58 +00:00
|
|
|
* Create new shadow object backing dst_entry with private copy of
|
|
|
|
* all underlying pages. When src_entry is equal to dst_entry,
|
|
|
|
* function implements COW for wired-down map entry. Otherwise,
|
|
|
|
* it forks wired entry into dst_map.
|
1994-05-24 10:09:53 +00:00
|
|
|
*
|
|
|
|
* In/out conditions:
|
|
|
|
* The source and destination maps must be locked for write.
|
|
|
|
* The source map entry must be wired down (or be a sharing map
|
|
|
|
* entry corresponding to a main map entry that is wired down).
|
|
|
|
*/
|
1994-05-25 09:21:21 +00:00
|
|
|
void
|
2009-07-03 22:17:37 +00:00
|
|
|
vm_fault_copy_entry(vm_map_t dst_map, vm_map_t src_map,
|
|
|
|
vm_map_entry_t dst_entry, vm_map_entry_t src_entry,
|
|
|
|
vm_ooffset_t *fork_charge)
|
1994-05-24 10:09:53 +00:00
|
|
|
{
|
2009-10-27 10:15:58 +00:00
|
|
|
vm_object_t backing_object, dst_object, object, src_object;
|
2009-10-26 00:01:52 +00:00
|
|
|
vm_pindex_t dst_pindex, pindex, src_pindex;
|
2009-10-27 10:15:58 +00:00
|
|
|
vm_prot_t access, prot;
|
These changes embody the support of the fully coherent merged VM buffer cache,
much higher filesystem I/O performance, and much better paging performance. It
represents the culmination of over 6 months of R&D.
The majority of the merged VM/cache work is by John Dyson.
The following highlights the most significant changes. Additionally, there are
(mostly minor) changes to the various filesystem modules (nfs, msdosfs, etc) to
support the new VM/buffer scheme.
vfs_bio.c:
Significant rewrite of most of vfs_bio to support the merged VM buffer cache
scheme. The scheme is almost fully compatible with the old filesystem
interface. Significant improvement in the number of opportunities for write
clustering.
vfs_cluster.c, vfs_subr.c
Upgrade and performance enhancements in vfs layer code to support merged
VM/buffer cache. Fixup of vfs_cluster to eliminate the bogus pagemove stuff.
vm_object.c:
Yet more improvements in the collapse code. Elimination of some windows that
can cause list corruption.
vm_pageout.c:
Fixed it, it really works better now. Somehow in 2.0, some "enhancements"
broke the code. This code has been reworked from the ground-up.
vm_fault.c, vm_page.c, pmap.c, vm_object.c
Support for small-block filesystems with merged VM/buffer cache scheme.
pmap.c vm_map.c
Dynamic kernel VM size, now we dont have to pre-allocate excessive numbers of
kernel PTs.
vm_glue.c
Much simpler and more effective swapping code. No more gratuitous swapping.
proc.h
Fixed the problem that the p_lock flag was not being cleared on a fork.
swap_pager.c, vnode_pager.c
Removal of old vfs_bio cruft to support the past pseudo-coherency. Now the
code doesn't need it anymore.
machdep.c
Changes to better support the parameter values for the merged VM/buffer cache
scheme.
machdep.c, kern_exec.c, vm_glue.c
Implemented a seperate submap for temporary exec string space and another one
to contain process upages. This eliminates all map fragmentation problems
that previously existed.
ffs_inode.c, ufs_inode.c, ufs_readwrite.c
Changes for merged VM/buffer cache. Add "bypass" support for sneaking in on
busy buffers.
Submitted by: John Dyson and David Greenman
1995-01-09 16:06:02 +00:00
|
|
|
vm_offset_t vaddr;
|
|
|
|
vm_page_t dst_m;
|
|
|
|
vm_page_t src_m;
|
2014-04-27 05:19:01 +00:00
|
|
|
boolean_t upgrade;
|
1994-05-24 10:09:53 +00:00
|
|
|
|
|
|
|
#ifdef lint
|
|
|
|
src_map++;
|
These changes embody the support of the fully coherent merged VM buffer cache,
much higher filesystem I/O performance, and much better paging performance. It
represents the culmination of over 6 months of R&D.
The majority of the merged VM/cache work is by John Dyson.
The following highlights the most significant changes. Additionally, there are
(mostly minor) changes to the various filesystem modules (nfs, msdosfs, etc) to
support the new VM/buffer scheme.
vfs_bio.c:
Significant rewrite of most of vfs_bio to support the merged VM buffer cache
scheme. The scheme is almost fully compatible with the old filesystem
interface. Significant improvement in the number of opportunities for write
clustering.
vfs_cluster.c, vfs_subr.c
Upgrade and performance enhancements in vfs layer code to support merged
VM/buffer cache. Fixup of vfs_cluster to eliminate the bogus pagemove stuff.
vm_object.c:
Yet more improvements in the collapse code. Elimination of some windows that
can cause list corruption.
vm_pageout.c:
Fixed it, it really works better now. Somehow in 2.0, some "enhancements"
broke the code. This code has been reworked from the ground-up.
vm_fault.c, vm_page.c, pmap.c, vm_object.c
Support for small-block filesystems with merged VM/buffer cache scheme.
pmap.c vm_map.c
Dynamic kernel VM size, now we dont have to pre-allocate excessive numbers of
kernel PTs.
vm_glue.c
Much simpler and more effective swapping code. No more gratuitous swapping.
proc.h
Fixed the problem that the p_lock flag was not being cleared on a fork.
swap_pager.c, vnode_pager.c
Removal of old vfs_bio cruft to support the past pseudo-coherency. Now the
code doesn't need it anymore.
machdep.c
Changes to better support the parameter values for the merged VM/buffer cache
scheme.
machdep.c, kern_exec.c, vm_glue.c
Implemented a seperate submap for temporary exec string space and another one
to contain process upages. This eliminates all map fragmentation problems
that previously existed.
ffs_inode.c, ufs_inode.c, ufs_readwrite.c
Changes for merged VM/buffer cache. Add "bypass" support for sneaking in on
busy buffers.
Submitted by: John Dyson and David Greenman
1995-01-09 16:06:02 +00:00
|
|
|
#endif /* lint */
|
1994-05-24 10:09:53 +00:00
|
|
|
|
2009-10-27 10:15:58 +00:00
|
|
|
upgrade = src_entry == dst_entry;
|
2014-05-10 17:03:33 +00:00
|
|
|
access = prot = dst_entry->protection;
|
2009-10-27 10:15:58 +00:00
|
|
|
|
1994-05-24 10:09:53 +00:00
|
|
|
src_object = src_entry->object.vm_object;
|
2009-10-26 00:01:52 +00:00
|
|
|
src_pindex = OFF_TO_IDX(src_entry->offset);
|
1994-05-24 10:09:53 +00:00
|
|
|
|
2014-05-10 17:03:33 +00:00
|
|
|
if (upgrade && (dst_entry->eflags & MAP_ENTRY_NEEDS_COPY) == 0) {
|
|
|
|
dst_object = src_object;
|
|
|
|
vm_object_reference(dst_object);
|
|
|
|
} else {
|
|
|
|
/*
|
2019-12-01 20:43:04 +00:00
|
|
|
* Create the top-level object for the destination entry.
|
|
|
|
* Doesn't actually shadow anything - we copy the pages
|
|
|
|
* directly.
|
2014-05-10 17:03:33 +00:00
|
|
|
*/
|
2019-12-01 20:43:04 +00:00
|
|
|
dst_object = vm_object_allocate_anon(atop(dst_entry->end -
|
|
|
|
dst_entry->start), NULL, NULL, 0);
|
2007-12-29 19:53:04 +00:00
|
|
|
#if VM_NRESERVLEVEL > 0
|
2014-05-10 17:03:33 +00:00
|
|
|
dst_object->flags |= OBJ_COLORED;
|
|
|
|
dst_object->pg_color = atop(dst_entry->start);
|
2007-12-29 19:53:04 +00:00
|
|
|
#endif
|
2018-09-28 14:10:12 +00:00
|
|
|
dst_object->domain = src_object->domain;
|
|
|
|
dst_object->charge = dst_entry->end - dst_entry->start;
|
2014-05-10 17:03:33 +00:00
|
|
|
}
|
1994-05-24 10:09:53 +00:00
|
|
|
|
2013-03-09 02:32:23 +00:00
|
|
|
VM_OBJECT_WLOCK(dst_object);
|
2009-10-27 10:15:58 +00:00
|
|
|
KASSERT(upgrade || dst_entry->object.vm_object == NULL,
|
2009-07-03 22:17:37 +00:00
|
|
|
("vm_fault_copy_entry: vm_object not NULL"));
|
2014-05-10 17:03:33 +00:00
|
|
|
if (src_object != dst_object) {
|
|
|
|
dst_entry->object.vm_object = dst_object;
|
|
|
|
dst_entry->offset = 0;
|
Switch to use shared vnode locks for text files during image activation.
kern_execve() locks text vnode exclusive to be able to set and clear
VV_TEXT flag. VV_TEXT is mutually exclusive with the v_writecount > 0
condition.
The change removes VV_TEXT, replacing it with the condition
v_writecount <= -1, and puts v_writecount under the vnode interlock.
Each text reference decrements v_writecount. To clear the text
reference when the segment is unmapped, it is recorded in the
vm_map_entry backed by the text file as MAP_ENTRY_VN_TEXT flag, and
v_writecount is incremented on the map entry removal
The operations like VOP_ADD_WRITECOUNT() and VOP_SET_TEXT() check that
v_writecount does not contradict the desired change. vn_writecheck()
is now racy and its use was eliminated everywhere except access.
Atomic check for writeability and increment of v_writecount is
performed by the VOP. vn_truncate() now increments v_writecount
around VOP_SETATTR() call, lack of which is arguably a bug on its own.
nullfs bypasses v_writecount to the lower vnode always, so nullfs
vnode has its own v_writecount correct, and lower vnode gets all
references, since object->handle is always lower vnode.
On the text vnode' vm object dealloc, the v_writecount value is reset
to zero, and deadfs vop_unset_text short-circuit the operation.
Reclamation of lowervp always reclaims all nullfs vnodes referencing
lowervp first, so no stray references are left.
Reviewed by: markj, trasz
Tested by: mjg, pho
Sponsored by: The FreeBSD Foundation
MFC after: 1 month
Differential revision: https://reviews.freebsd.org/D19923
2019-05-05 11:20:43 +00:00
|
|
|
dst_entry->eflags &= ~MAP_ENTRY_VN_EXEC;
|
2014-05-10 17:03:33 +00:00
|
|
|
}
|
2009-10-27 10:15:58 +00:00
|
|
|
if (fork_charge != NULL) {
|
2010-12-02 17:37:16 +00:00
|
|
|
KASSERT(dst_entry->cred == NULL,
|
2009-10-27 10:15:58 +00:00
|
|
|
("vm_fault_copy_entry: leaked swp charge"));
|
2010-12-02 17:37:16 +00:00
|
|
|
dst_object->cred = curthread->td_ucred;
|
|
|
|
crhold(dst_object->cred);
|
2009-10-27 10:15:58 +00:00
|
|
|
*fork_charge += dst_object->charge;
|
2018-09-28 14:11:01 +00:00
|
|
|
} else if ((dst_object->type == OBJT_DEFAULT ||
|
|
|
|
dst_object->type == OBJT_SWAP) &&
|
|
|
|
dst_object->cred == NULL) {
|
2014-05-10 17:03:33 +00:00
|
|
|
KASSERT(dst_entry->cred != NULL, ("no cred for entry %p",
|
|
|
|
dst_entry));
|
2010-12-02 17:37:16 +00:00
|
|
|
dst_object->cred = dst_entry->cred;
|
|
|
|
dst_entry->cred = NULL;
|
2009-10-27 10:15:58 +00:00
|
|
|
}
|
2014-05-10 17:03:33 +00:00
|
|
|
|
2009-10-27 10:15:58 +00:00
|
|
|
/*
|
|
|
|
* If not an upgrade, then enter the mappings in the pmap as
|
|
|
|
* read and/or execute accesses. Otherwise, enter them as
|
|
|
|
* write accesses.
|
|
|
|
*
|
|
|
|
* A writeable large page mapping is only created if all of
|
|
|
|
* the constituent small page mappings are modified. Marking
|
|
|
|
* PTEs as modified on inception allows promotion to happen
|
|
|
|
* without taking potentially large number of soft faults.
|
|
|
|
*/
|
|
|
|
if (!upgrade)
|
|
|
|
access &= ~VM_PROT_WRITE;
|
1994-05-24 10:09:53 +00:00
|
|
|
|
|
|
|
/*
|
2012-10-24 18:32:37 +00:00
|
|
|
* Loop through all of the virtual pages within the entry's
|
|
|
|
* range, copying each page from the source object to the
|
|
|
|
* destination object. Since the source is wired, those pages
|
|
|
|
* must exist. In contrast, the destination is pageable.
|
2018-05-30 16:48:48 +00:00
|
|
|
* Since the destination object doesn't share any backing storage
|
2012-10-24 18:32:37 +00:00
|
|
|
* with the source object, all of its pages must be dirtied,
|
|
|
|
* regardless of whether they can be written.
|
1994-05-24 10:09:53 +00:00
|
|
|
*/
|
2009-10-26 00:01:52 +00:00
|
|
|
for (vaddr = dst_entry->start, dst_pindex = 0;
|
These changes embody the support of the fully coherent merged VM buffer cache,
much higher filesystem I/O performance, and much better paging performance. It
represents the culmination of over 6 months of R&D.
The majority of the merged VM/cache work is by John Dyson.
The following highlights the most significant changes. Additionally, there are
(mostly minor) changes to the various filesystem modules (nfs, msdosfs, etc) to
support the new VM/buffer scheme.
vfs_bio.c:
Significant rewrite of most of vfs_bio to support the merged VM buffer cache
scheme. The scheme is almost fully compatible with the old filesystem
interface. Significant improvement in the number of opportunities for write
clustering.
vfs_cluster.c, vfs_subr.c
Upgrade and performance enhancements in vfs layer code to support merged
VM/buffer cache. Fixup of vfs_cluster to eliminate the bogus pagemove stuff.
vm_object.c:
Yet more improvements in the collapse code. Elimination of some windows that
can cause list corruption.
vm_pageout.c:
Fixed it, it really works better now. Somehow in 2.0, some "enhancements"
broke the code. This code has been reworked from the ground-up.
vm_fault.c, vm_page.c, pmap.c, vm_object.c
Support for small-block filesystems with merged VM/buffer cache scheme.
pmap.c vm_map.c
Dynamic kernel VM size, now we dont have to pre-allocate excessive numbers of
kernel PTs.
vm_glue.c
Much simpler and more effective swapping code. No more gratuitous swapping.
proc.h
Fixed the problem that the p_lock flag was not being cleared on a fork.
swap_pager.c, vnode_pager.c
Removal of old vfs_bio cruft to support the past pseudo-coherency. Now the
code doesn't need it anymore.
machdep.c
Changes to better support the parameter values for the merged VM/buffer cache
scheme.
machdep.c, kern_exec.c, vm_glue.c
Implemented a seperate submap for temporary exec string space and another one
to contain process upages. This eliminates all map fragmentation problems
that previously existed.
ffs_inode.c, ufs_inode.c, ufs_readwrite.c
Changes for merged VM/buffer cache. Add "bypass" support for sneaking in on
busy buffers.
Submitted by: John Dyson and David Greenman
1995-01-09 16:06:02 +00:00
|
|
|
vaddr < dst_entry->end;
|
2009-10-26 00:01:52 +00:00
|
|
|
vaddr += PAGE_SIZE, dst_pindex++) {
|
2014-05-10 17:03:33 +00:00
|
|
|
again:
|
1994-05-24 10:09:53 +00:00
|
|
|
/*
|
These changes embody the support of the fully coherent merged VM buffer cache,
much higher filesystem I/O performance, and much better paging performance. It
represents the culmination of over 6 months of R&D.
The majority of the merged VM/cache work is by John Dyson.
The following highlights the most significant changes. Additionally, there are
(mostly minor) changes to the various filesystem modules (nfs, msdosfs, etc) to
support the new VM/buffer scheme.
vfs_bio.c:
Significant rewrite of most of vfs_bio to support the merged VM buffer cache
scheme. The scheme is almost fully compatible with the old filesystem
interface. Significant improvement in the number of opportunities for write
clustering.
vfs_cluster.c, vfs_subr.c
Upgrade and performance enhancements in vfs layer code to support merged
VM/buffer cache. Fixup of vfs_cluster to eliminate the bogus pagemove stuff.
vm_object.c:
Yet more improvements in the collapse code. Elimination of some windows that
can cause list corruption.
vm_pageout.c:
Fixed it, it really works better now. Somehow in 2.0, some "enhancements"
broke the code. This code has been reworked from the ground-up.
vm_fault.c, vm_page.c, pmap.c, vm_object.c
Support for small-block filesystems with merged VM/buffer cache scheme.
pmap.c vm_map.c
Dynamic kernel VM size, now we dont have to pre-allocate excessive numbers of
kernel PTs.
vm_glue.c
Much simpler and more effective swapping code. No more gratuitous swapping.
proc.h
Fixed the problem that the p_lock flag was not being cleared on a fork.
swap_pager.c, vnode_pager.c
Removal of old vfs_bio cruft to support the past pseudo-coherency. Now the
code doesn't need it anymore.
machdep.c
Changes to better support the parameter values for the merged VM/buffer cache
scheme.
machdep.c, kern_exec.c, vm_glue.c
Implemented a seperate submap for temporary exec string space and another one
to contain process upages. This eliminates all map fragmentation problems
that previously existed.
ffs_inode.c, ufs_inode.c, ufs_readwrite.c
Changes for merged VM/buffer cache. Add "bypass" support for sneaking in on
busy buffers.
Submitted by: John Dyson and David Greenman
1995-01-09 16:06:02 +00:00
|
|
|
* Find the page in the source object, and copy it in.
|
2014-04-27 05:19:01 +00:00
|
|
|
* Because the source is wired down, the page will be
|
|
|
|
* in memory.
|
1994-05-24 10:09:53 +00:00
|
|
|
*/
|
2014-05-10 17:03:33 +00:00
|
|
|
if (src_object != dst_object)
|
|
|
|
VM_OBJECT_RLOCK(src_object);
|
2003-10-15 08:00:45 +00:00
|
|
|
object = src_object;
|
2009-10-26 00:01:52 +00:00
|
|
|
pindex = src_pindex + dst_pindex;
|
|
|
|
while ((src_m = vm_page_lookup(object, pindex)) == NULL &&
|
2003-10-15 08:00:45 +00:00
|
|
|
(backing_object = object->backing_object) != NULL) {
|
|
|
|
/*
|
2014-04-27 05:19:01 +00:00
|
|
|
* Unless the source mapping is read-only or
|
|
|
|
* it is presently being upgraded from
|
|
|
|
* read-only, the first object in the shadow
|
|
|
|
* chain should provide all of the pages. In
|
|
|
|
* other words, this loop body should never be
|
|
|
|
* executed when the source mapping is already
|
|
|
|
* read/write.
|
2003-10-15 08:00:45 +00:00
|
|
|
*/
|
2014-04-27 05:19:01 +00:00
|
|
|
KASSERT((src_entry->protection & VM_PROT_WRITE) == 0 ||
|
|
|
|
upgrade,
|
|
|
|
("vm_fault_copy_entry: main object missing page"));
|
|
|
|
|
2013-05-22 15:11:00 +00:00
|
|
|
VM_OBJECT_RLOCK(backing_object);
|
2003-10-15 08:00:45 +00:00
|
|
|
pindex += OFF_TO_IDX(object->backing_object_offset);
|
2014-05-10 17:03:33 +00:00
|
|
|
if (object != dst_object)
|
|
|
|
VM_OBJECT_RUNLOCK(object);
|
2003-10-15 08:00:45 +00:00
|
|
|
object = backing_object;
|
|
|
|
}
|
2014-04-27 05:19:01 +00:00
|
|
|
KASSERT(src_m != NULL, ("vm_fault_copy_entry: page missing"));
|
2014-05-10 17:03:33 +00:00
|
|
|
|
|
|
|
if (object != dst_object) {
|
|
|
|
/*
|
|
|
|
* Allocate a page in the destination object.
|
|
|
|
*/
|
2014-05-21 08:19:04 +00:00
|
|
|
dst_m = vm_page_alloc(dst_object, (src_object ==
|
|
|
|
dst_object ? src_pindex : 0) + dst_pindex,
|
|
|
|
VM_ALLOC_NORMAL);
|
|
|
|
if (dst_m == NULL) {
|
|
|
|
VM_OBJECT_WUNLOCK(dst_object);
|
|
|
|
VM_OBJECT_RUNLOCK(object);
|
2018-02-20 10:13:13 +00:00
|
|
|
vm_wait(dst_object);
|
2014-05-21 08:19:04 +00:00
|
|
|
VM_OBJECT_WLOCK(dst_object);
|
|
|
|
goto again;
|
|
|
|
}
|
2014-05-10 17:03:33 +00:00
|
|
|
pmap_copy_page(src_m, dst_m);
|
|
|
|
VM_OBJECT_RUNLOCK(object);
|
2019-03-20 13:07:57 +00:00
|
|
|
dst_m->dirty = dst_m->valid = src_m->valid;
|
2014-05-10 17:03:33 +00:00
|
|
|
} else {
|
|
|
|
dst_m = src_m;
|
2019-10-15 03:35:11 +00:00
|
|
|
if (vm_page_busy_acquire(dst_m, VM_ALLOC_WAITFAIL) == 0)
|
2014-05-10 17:03:33 +00:00
|
|
|
goto again;
|
2019-10-15 03:35:11 +00:00
|
|
|
if (dst_m->pindex >= dst_object->size) {
|
2018-09-28 14:11:38 +00:00
|
|
|
/*
|
|
|
|
* We are upgrading. Index can occur
|
|
|
|
* out of bounds if the object type is
|
|
|
|
* vnode and the file was truncated.
|
|
|
|
*/
|
2019-10-15 03:35:11 +00:00
|
|
|
vm_page_xunbusy(dst_m);
|
2018-09-28 14:11:38 +00:00
|
|
|
break;
|
2019-10-15 03:35:11 +00:00
|
|
|
}
|
2014-05-10 17:03:33 +00:00
|
|
|
}
|
2013-03-09 02:32:23 +00:00
|
|
|
VM_OBJECT_WUNLOCK(dst_object);
|
1994-05-24 10:09:53 +00:00
|
|
|
|
|
|
|
/*
|
2009-10-27 10:15:58 +00:00
|
|
|
* Enter it in the pmap. If a wired, copy-on-write
|
|
|
|
* mapping is being replaced by a write-enabled
|
|
|
|
* mapping, then wire that new mapping.
|
2019-03-20 13:07:57 +00:00
|
|
|
*
|
|
|
|
* The page can be invalid if the user called
|
|
|
|
* msync(MS_INVALIDATE) or truncated the backing vnode
|
|
|
|
* or shared memory object. In this case, do not
|
|
|
|
* insert it into pmap, but still do the copy so that
|
|
|
|
* all copies of the wired map entry have similar
|
|
|
|
* backing pages.
|
1994-05-24 10:09:53 +00:00
|
|
|
*/
|
2019-10-15 03:45:41 +00:00
|
|
|
if (vm_page_all_valid(dst_m)) {
|
2019-03-20 13:07:57 +00:00
|
|
|
pmap_enter(dst_map->pmap, vaddr, dst_m, prot,
|
|
|
|
access | (upgrade ? PMAP_ENTER_WIRED : 0), 0);
|
|
|
|
}
|
1994-05-24 10:09:53 +00:00
|
|
|
|
|
|
|
/*
|
These changes embody the support of the fully coherent merged VM buffer cache,
much higher filesystem I/O performance, and much better paging performance. It
represents the culmination of over 6 months of R&D.
The majority of the merged VM/cache work is by John Dyson.
The following highlights the most significant changes. Additionally, there are
(mostly minor) changes to the various filesystem modules (nfs, msdosfs, etc) to
support the new VM/buffer scheme.
vfs_bio.c:
Significant rewrite of most of vfs_bio to support the merged VM buffer cache
scheme. The scheme is almost fully compatible with the old filesystem
interface. Significant improvement in the number of opportunities for write
clustering.
vfs_cluster.c, vfs_subr.c
Upgrade and performance enhancements in vfs layer code to support merged
VM/buffer cache. Fixup of vfs_cluster to eliminate the bogus pagemove stuff.
vm_object.c:
Yet more improvements in the collapse code. Elimination of some windows that
can cause list corruption.
vm_pageout.c:
Fixed it, it really works better now. Somehow in 2.0, some "enhancements"
broke the code. This code has been reworked from the ground-up.
vm_fault.c, vm_page.c, pmap.c, vm_object.c
Support for small-block filesystems with merged VM/buffer cache scheme.
pmap.c vm_map.c
Dynamic kernel VM size, now we dont have to pre-allocate excessive numbers of
kernel PTs.
vm_glue.c
Much simpler and more effective swapping code. No more gratuitous swapping.
proc.h
Fixed the problem that the p_lock flag was not being cleared on a fork.
swap_pager.c, vnode_pager.c
Removal of old vfs_bio cruft to support the past pseudo-coherency. Now the
code doesn't need it anymore.
machdep.c
Changes to better support the parameter values for the merged VM/buffer cache
scheme.
machdep.c, kern_exec.c, vm_glue.c
Implemented a seperate submap for temporary exec string space and another one
to contain process upages. This eliminates all map fragmentation problems
that previously existed.
ffs_inode.c, ufs_inode.c, ufs_readwrite.c
Changes for merged VM/buffer cache. Add "bypass" support for sneaking in on
busy buffers.
Submitted by: John Dyson and David Greenman
1995-01-09 16:06:02 +00:00
|
|
|
* Mark it no longer busy, and put it on the active list.
|
1994-05-24 10:09:53 +00:00
|
|
|
*/
|
2013-03-09 02:32:23 +00:00
|
|
|
VM_OBJECT_WLOCK(dst_object);
|
2010-04-30 00:46:43 +00:00
|
|
|
|
2009-10-27 10:15:58 +00:00
|
|
|
if (upgrade) {
|
2014-05-10 17:03:33 +00:00
|
|
|
if (src_m != dst_m) {
|
2014-06-16 18:15:27 +00:00
|
|
|
vm_page_unwire(src_m, PQ_INACTIVE);
|
2014-05-10 17:03:33 +00:00
|
|
|
vm_page_wire(dst_m);
|
|
|
|
} else {
|
2019-06-02 01:00:17 +00:00
|
|
|
KASSERT(vm_page_wired(dst_m),
|
2014-05-10 17:03:33 +00:00
|
|
|
("dst_m %p is not wired", dst_m));
|
|
|
|
}
|
2010-04-30 00:46:43 +00:00
|
|
|
} else {
|
2009-10-27 10:15:58 +00:00
|
|
|
vm_page_activate(dst_m);
|
2010-04-30 00:46:43 +00:00
|
|
|
}
|
2013-08-09 11:11:11 +00:00
|
|
|
vm_page_xunbusy(dst_m);
|
1994-05-24 10:09:53 +00:00
|
|
|
}
|
2013-03-09 02:32:23 +00:00
|
|
|
VM_OBJECT_WUNLOCK(dst_object);
|
2009-10-27 10:15:58 +00:00
|
|
|
if (upgrade) {
|
|
|
|
dst_entry->eflags &= ~(MAP_ENTRY_COW | MAP_ENTRY_NEEDS_COPY);
|
|
|
|
vm_object_deallocate(src_object);
|
|
|
|
}
|
1994-05-25 09:21:21 +00:00
|
|
|
}
|
|
|
|
|
Handle spurious page faults that may occur in no-fault sections of the
kernel.
When access restrictions are added to a page table entry, we flush the
corresponding virtual address mapping from the TLB. In contrast, when
access restrictions are removed from a page table entry, we do not
flush the virtual address mapping from the TLB. This is exactly as
recommended in AMD's documentation. In effect, when access
restrictions are removed from a page table entry, AMD's MMUs will
transparently refresh a stale TLB entry. In short, this saves us from
having to perform potentially costly TLB flushes. In contrast,
Intel's MMUs are allowed to generate a spurious page fault based upon
the stale TLB entry. Usually, such spurious page faults are handled
by vm_fault() without incident. However, when we are executing
no-fault sections of the kernel, we are not allowed to execute
vm_fault(). This change introduces special-case handling for spurious
page faults that occur in no-fault sections of the kernel.
In collaboration with: kib
Tested by: gibbs (an earlier version)
I would also like to acknowledge Hiroki Sato's assistance in
diagnosing this problem.
MFC after: 1 week
2012-03-22 04:52:51 +00:00
|
|
|
/*
|
|
|
|
* Block entry into the machine-independent layer's page fault handler by
|
|
|
|
* the calling thread. Subsequent calls to vm_fault() by that thread will
|
|
|
|
* return KERN_PROTECTION_FAILURE. Enable machine-dependent handling of
|
|
|
|
* spurious page faults.
|
|
|
|
*/
|
2011-07-09 15:21:10 +00:00
|
|
|
int
|
|
|
|
vm_fault_disable_pagefaults(void)
|
|
|
|
{
|
|
|
|
|
Handle spurious page faults that may occur in no-fault sections of the
kernel.
When access restrictions are added to a page table entry, we flush the
corresponding virtual address mapping from the TLB. In contrast, when
access restrictions are removed from a page table entry, we do not
flush the virtual address mapping from the TLB. This is exactly as
recommended in AMD's documentation. In effect, when access
restrictions are removed from a page table entry, AMD's MMUs will
transparently refresh a stale TLB entry. In short, this saves us from
having to perform potentially costly TLB flushes. In contrast,
Intel's MMUs are allowed to generate a spurious page fault based upon
the stale TLB entry. Usually, such spurious page faults are handled
by vm_fault() without incident. However, when we are executing
no-fault sections of the kernel, we are not allowed to execute
vm_fault(). This change introduces special-case handling for spurious
page faults that occur in no-fault sections of the kernel.
In collaboration with: kib
Tested by: gibbs (an earlier version)
I would also like to acknowledge Hiroki Sato's assistance in
diagnosing this problem.
MFC after: 1 week
2012-03-22 04:52:51 +00:00
|
|
|
return (curthread_pflags_set(TDP_NOFAULTING | TDP_RESETSPUR));
|
2011-07-09 15:21:10 +00:00
|
|
|
}
|
|
|
|
|
|
|
|
void
|
|
|
|
vm_fault_enable_pagefaults(int save)
|
|
|
|
{
|
|
|
|
|
|
|
|
curthread_pflags_restore(save);
|
|
|
|
}
|