2005-01-07 02:29:27 +00:00
|
|
|
/*-
|
1994-05-25 09:21:21 +00:00
|
|
|
* Copyright (c) 1991 Regents of the University of California.
|
|
|
|
* All rights reserved.
|
|
|
|
* Copyright (c) 1994 John S. Dyson
|
|
|
|
* All rights reserved.
|
|
|
|
* Copyright (c) 1994 David Greenman
|
|
|
|
* All rights reserved.
|
2005-08-10 00:17:36 +00:00
|
|
|
* Copyright (c) 2005 Yahoo! Technologies Norway AS
|
|
|
|
* All rights reserved.
|
1994-05-24 10:09:53 +00:00
|
|
|
*
|
|
|
|
* This code is derived from software contributed to Berkeley by
|
|
|
|
* The Mach Operating System project at Carnegie-Mellon University.
|
|
|
|
*
|
|
|
|
* Redistribution and use in source and binary forms, with or without
|
|
|
|
* modification, are permitted provided that the following conditions
|
|
|
|
* are met:
|
|
|
|
* 1. Redistributions of source code must retain the above copyright
|
|
|
|
* notice, this list of conditions and the following disclaimer.
|
|
|
|
* 2. Redistributions in binary form must reproduce the above copyright
|
|
|
|
* notice, this list of conditions and the following disclaimer in the
|
|
|
|
* documentation and/or other materials provided with the distribution.
|
|
|
|
* 3. All advertising materials mentioning features or use of this software
|
2000-03-27 20:41:17 +00:00
|
|
|
* must display the following acknowledgement:
|
1994-05-24 10:09:53 +00:00
|
|
|
* This product includes software developed by the University of
|
|
|
|
* California, Berkeley and its contributors.
|
|
|
|
* 4. Neither the name of the University nor the names of its contributors
|
|
|
|
* may be used to endorse or promote products derived from this software
|
|
|
|
* without specific prior written permission.
|
|
|
|
*
|
|
|
|
* THIS SOFTWARE IS PROVIDED BY THE REGENTS AND CONTRIBUTORS ``AS IS'' AND
|
|
|
|
* ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
|
|
|
|
* IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE
|
|
|
|
* ARE DISCLAIMED. IN NO EVENT SHALL THE REGENTS OR CONTRIBUTORS BE LIABLE
|
|
|
|
* FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL
|
|
|
|
* DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS
|
|
|
|
* OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION)
|
|
|
|
* HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT
|
|
|
|
* LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY
|
|
|
|
* OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF
|
|
|
|
* SUCH DAMAGE.
|
|
|
|
*
|
1994-08-02 07:55:43 +00:00
|
|
|
* from: @(#)vm_pageout.c 7.4 (Berkeley) 5/7/91
|
1994-05-24 10:09:53 +00:00
|
|
|
*
|
|
|
|
*
|
|
|
|
* Copyright (c) 1987, 1990 Carnegie-Mellon University.
|
|
|
|
* All rights reserved.
|
|
|
|
*
|
|
|
|
* Authors: Avadis Tevanian, Jr., Michael Wayne Young
|
These changes embody the support of the fully coherent merged VM buffer cache,
much higher filesystem I/O performance, and much better paging performance. It
represents the culmination of over 6 months of R&D.
The majority of the merged VM/cache work is by John Dyson.
The following highlights the most significant changes. Additionally, there are
(mostly minor) changes to the various filesystem modules (nfs, msdosfs, etc) to
support the new VM/buffer scheme.
vfs_bio.c:
Significant rewrite of most of vfs_bio to support the merged VM buffer cache
scheme. The scheme is almost fully compatible with the old filesystem
interface. Significant improvement in the number of opportunities for write
clustering.
vfs_cluster.c, vfs_subr.c
Upgrade and performance enhancements in vfs layer code to support merged
VM/buffer cache. Fixup of vfs_cluster to eliminate the bogus pagemove stuff.
vm_object.c:
Yet more improvements in the collapse code. Elimination of some windows that
can cause list corruption.
vm_pageout.c:
Fixed it, it really works better now. Somehow in 2.0, some "enhancements"
broke the code. This code has been reworked from the ground-up.
vm_fault.c, vm_page.c, pmap.c, vm_object.c
Support for small-block filesystems with merged VM/buffer cache scheme.
pmap.c vm_map.c
Dynamic kernel VM size, now we dont have to pre-allocate excessive numbers of
kernel PTs.
vm_glue.c
Much simpler and more effective swapping code. No more gratuitous swapping.
proc.h
Fixed the problem that the p_lock flag was not being cleared on a fork.
swap_pager.c, vnode_pager.c
Removal of old vfs_bio cruft to support the past pseudo-coherency. Now the
code doesn't need it anymore.
machdep.c
Changes to better support the parameter values for the merged VM/buffer cache
scheme.
machdep.c, kern_exec.c, vm_glue.c
Implemented a seperate submap for temporary exec string space and another one
to contain process upages. This eliminates all map fragmentation problems
that previously existed.
ffs_inode.c, ufs_inode.c, ufs_readwrite.c
Changes for merged VM/buffer cache. Add "bypass" support for sneaking in on
busy buffers.
Submitted by: John Dyson and David Greenman
1995-01-09 16:06:02 +00:00
|
|
|
*
|
1994-05-24 10:09:53 +00:00
|
|
|
* Permission to use, copy, modify and distribute this software and
|
|
|
|
* its documentation is hereby granted, provided that both the copyright
|
|
|
|
* notice and this permission notice appear in all copies of the
|
|
|
|
* software, derivative works or modified versions, and any portions
|
|
|
|
* thereof, and that both notices appear in supporting documentation.
|
These changes embody the support of the fully coherent merged VM buffer cache,
much higher filesystem I/O performance, and much better paging performance. It
represents the culmination of over 6 months of R&D.
The majority of the merged VM/cache work is by John Dyson.
The following highlights the most significant changes. Additionally, there are
(mostly minor) changes to the various filesystem modules (nfs, msdosfs, etc) to
support the new VM/buffer scheme.
vfs_bio.c:
Significant rewrite of most of vfs_bio to support the merged VM buffer cache
scheme. The scheme is almost fully compatible with the old filesystem
interface. Significant improvement in the number of opportunities for write
clustering.
vfs_cluster.c, vfs_subr.c
Upgrade and performance enhancements in vfs layer code to support merged
VM/buffer cache. Fixup of vfs_cluster to eliminate the bogus pagemove stuff.
vm_object.c:
Yet more improvements in the collapse code. Elimination of some windows that
can cause list corruption.
vm_pageout.c:
Fixed it, it really works better now. Somehow in 2.0, some "enhancements"
broke the code. This code has been reworked from the ground-up.
vm_fault.c, vm_page.c, pmap.c, vm_object.c
Support for small-block filesystems with merged VM/buffer cache scheme.
pmap.c vm_map.c
Dynamic kernel VM size, now we dont have to pre-allocate excessive numbers of
kernel PTs.
vm_glue.c
Much simpler and more effective swapping code. No more gratuitous swapping.
proc.h
Fixed the problem that the p_lock flag was not being cleared on a fork.
swap_pager.c, vnode_pager.c
Removal of old vfs_bio cruft to support the past pseudo-coherency. Now the
code doesn't need it anymore.
machdep.c
Changes to better support the parameter values for the merged VM/buffer cache
scheme.
machdep.c, kern_exec.c, vm_glue.c
Implemented a seperate submap for temporary exec string space and another one
to contain process upages. This eliminates all map fragmentation problems
that previously existed.
ffs_inode.c, ufs_inode.c, ufs_readwrite.c
Changes for merged VM/buffer cache. Add "bypass" support for sneaking in on
busy buffers.
Submitted by: John Dyson and David Greenman
1995-01-09 16:06:02 +00:00
|
|
|
*
|
|
|
|
* CARNEGIE MELLON ALLOWS FREE USE OF THIS SOFTWARE IN ITS "AS IS"
|
|
|
|
* CONDITION. CARNEGIE MELLON DISCLAIMS ANY LIABILITY OF ANY KIND
|
1994-05-24 10:09:53 +00:00
|
|
|
* FOR ANY DAMAGES WHATSOEVER RESULTING FROM THE USE OF THIS SOFTWARE.
|
These changes embody the support of the fully coherent merged VM buffer cache,
much higher filesystem I/O performance, and much better paging performance. It
represents the culmination of over 6 months of R&D.
The majority of the merged VM/cache work is by John Dyson.
The following highlights the most significant changes. Additionally, there are
(mostly minor) changes to the various filesystem modules (nfs, msdosfs, etc) to
support the new VM/buffer scheme.
vfs_bio.c:
Significant rewrite of most of vfs_bio to support the merged VM buffer cache
scheme. The scheme is almost fully compatible with the old filesystem
interface. Significant improvement in the number of opportunities for write
clustering.
vfs_cluster.c, vfs_subr.c
Upgrade and performance enhancements in vfs layer code to support merged
VM/buffer cache. Fixup of vfs_cluster to eliminate the bogus pagemove stuff.
vm_object.c:
Yet more improvements in the collapse code. Elimination of some windows that
can cause list corruption.
vm_pageout.c:
Fixed it, it really works better now. Somehow in 2.0, some "enhancements"
broke the code. This code has been reworked from the ground-up.
vm_fault.c, vm_page.c, pmap.c, vm_object.c
Support for small-block filesystems with merged VM/buffer cache scheme.
pmap.c vm_map.c
Dynamic kernel VM size, now we dont have to pre-allocate excessive numbers of
kernel PTs.
vm_glue.c
Much simpler and more effective swapping code. No more gratuitous swapping.
proc.h
Fixed the problem that the p_lock flag was not being cleared on a fork.
swap_pager.c, vnode_pager.c
Removal of old vfs_bio cruft to support the past pseudo-coherency. Now the
code doesn't need it anymore.
machdep.c
Changes to better support the parameter values for the merged VM/buffer cache
scheme.
machdep.c, kern_exec.c, vm_glue.c
Implemented a seperate submap for temporary exec string space and another one
to contain process upages. This eliminates all map fragmentation problems
that previously existed.
ffs_inode.c, ufs_inode.c, ufs_readwrite.c
Changes for merged VM/buffer cache. Add "bypass" support for sneaking in on
busy buffers.
Submitted by: John Dyson and David Greenman
1995-01-09 16:06:02 +00:00
|
|
|
*
|
1994-05-24 10:09:53 +00:00
|
|
|
* Carnegie Mellon requests users of this software to return to
|
|
|
|
*
|
|
|
|
* Software Distribution Coordinator or Software.Distribution@CS.CMU.EDU
|
|
|
|
* School of Computer Science
|
|
|
|
* Carnegie Mellon University
|
|
|
|
* Pittsburgh PA 15213-3890
|
|
|
|
*
|
|
|
|
* any improvements or extensions that they make and grant Carnegie the
|
|
|
|
* rights to redistribute these changes.
|
|
|
|
*/
|
|
|
|
|
|
|
|
/*
|
|
|
|
* The proverbial page-out daemon.
|
|
|
|
*/
|
|
|
|
|
2003-06-11 23:50:51 +00:00
|
|
|
#include <sys/cdefs.h>
|
|
|
|
__FBSDID("$FreeBSD$");
|
|
|
|
|
1998-09-29 17:33:59 +00:00
|
|
|
#include "opt_vm.h"
|
1994-05-24 10:09:53 +00:00
|
|
|
#include <sys/param.h>
|
1994-05-25 09:21:21 +00:00
|
|
|
#include <sys/systm.h>
|
1995-03-16 18:17:34 +00:00
|
|
|
#include <sys/kernel.h>
|
2002-11-21 09:17:56 +00:00
|
|
|
#include <sys/eventhandler.h>
|
2001-05-01 08:13:21 +00:00
|
|
|
#include <sys/lock.h>
|
|
|
|
#include <sys/mutex.h>
|
1994-05-25 09:21:21 +00:00
|
|
|
#include <sys/proc.h>
|
1999-07-01 13:21:46 +00:00
|
|
|
#include <sys/kthread.h>
|
2000-09-07 01:33:02 +00:00
|
|
|
#include <sys/ktr.h>
|
2007-06-26 18:24:05 +00:00
|
|
|
#include <sys/mount.h>
|
1994-05-25 09:21:21 +00:00
|
|
|
#include <sys/resourcevar.h>
|
2002-10-12 05:32:24 +00:00
|
|
|
#include <sys/sched.h>
|
1995-02-14 06:14:28 +00:00
|
|
|
#include <sys/signalvar.h>
|
1995-04-09 06:03:56 +00:00
|
|
|
#include <sys/vnode.h>
|
1995-12-07 12:48:31 +00:00
|
|
|
#include <sys/vmmeter.h>
|
2001-03-28 11:52:56 +00:00
|
|
|
#include <sys/sx.h>
|
This commit does a couple of things:
Re-enables the RSS limiting, and the routine is now tail-recursive,
making it much more safe (eliminates the possiblity of kernel stack
overflow.) Also, the RSS limiting is a little more intelligent about
finding the likely objects that are pushing the process over the limit.
Added some sysctls that help with VM system tuning.
New sysctl features:
1) Enable/disable lru pageout algorithm.
vm.pageout_algorithm = 0, default algorithm that works
well, especially using X windows and heavy
memory loading. Can have adverse effects,
sometimes slowing down program loading.
vm.pageout_algorithm = 1, close to true LRU. Works much
better than clock, etc. Does not work as well as
the default algorithm in general. Certain memory
"malloc" type benchmarks work a little better with
this setting.
Please give me feedback on the performance results
associated with these.
2) Enable/disable swapping.
vm.swapping_enabled = 1, default.
vm.swapping_enabled = 0, useful for cases where swapping
degrades performance.
The config option "NO_SWAPPING" is still operative, and
takes precedence over the sysctl. If "NO_SWAPPING" is
specified, the sysctl still exists, but "vm.swapping_enabled"
is hard-wired to "0".
Each of these can be changed "on the fly."
1996-06-26 05:39:27 +00:00
|
|
|
#include <sys/sysctl.h>
|
1994-05-24 10:09:53 +00:00
|
|
|
|
|
|
|
#include <vm/vm.h>
|
1995-12-07 12:48:31 +00:00
|
|
|
#include <vm/vm_param.h>
|
|
|
|
#include <vm/vm_object.h>
|
1994-05-24 10:09:53 +00:00
|
|
|
#include <vm/vm_page.h>
|
1995-12-07 12:48:31 +00:00
|
|
|
#include <vm/vm_map.h>
|
1994-05-24 10:09:53 +00:00
|
|
|
#include <vm/vm_pageout.h>
|
NOTE: libkvm, w, ps, 'top', and any other utility which depends on struct
proc or any VM system structure will have to be rebuilt!!!
Much needed overhaul of the VM system. Included in this first round of
changes:
1) Improved pager interfaces: init, alloc, dealloc, getpages, putpages,
haspage, and sync operations are supported. The haspage interface now
provides information about clusterability. All pager routines now take
struct vm_object's instead of "pagers".
2) Improved data structures. In the previous paradigm, there is constant
confusion caused by pagers being both a data structure ("allocate a
pager") and a collection of routines. The idea of a pager structure has
escentially been eliminated. Objects now have types, and this type is
used to index the appropriate pager. In most cases, items in the pager
structure were duplicated in the object data structure and thus were
unnecessary. In the few cases that remained, a un_pager structure union
was created in the object to contain these items.
3) Because of the cleanup of #1 & #2, a lot of unnecessary layering can now
be removed. For instance, vm_object_enter(), vm_object_lookup(),
vm_object_remove(), and the associated object hash list were some of the
things that were removed.
4) simple_lock's removed. Discussion with several people reveals that the
SMP locking primitives used in the VM system aren't likely the mechanism
that we'll be adopting. Even if it were, the locking that was in the code
was very inadequate and would have to be mostly re-done anyway. The
locking in a uni-processor kernel was a no-op but went a long way toward
making the code difficult to read and debug.
5) Places that attempted to kludge-up the fact that we don't have kernel
thread support have been fixed to reflect the reality that we are really
dealing with processes, not threads. The VM system didn't have complete
thread support, so the comments and mis-named routines were just wrong.
We now use tsleep and wakeup directly in the lock routines, for instance.
6) Where appropriate, the pagers have been improved, especially in the
pager_alloc routines. Most of the pager_allocs have been rewritten and
are now faster and easier to maintain.
7) The pagedaemon pageout clustering algorithm has been rewritten and
now tries harder to output an even number of pages before and after
the requested page. This is sort of the reverse of the ideal pagein
algorithm and should provide better overall performance.
8) Unnecessary (incorrect) casts to caddr_t in calls to tsleep & wakeup
have been removed. Some other unnecessary casts have also been removed.
9) Some almost useless debugging code removed.
10) Terminology of shadow objects vs. backing objects straightened out.
The fact that the vm_object data structure escentially had this
backwards really confused things. The use of "shadow" and "backing
object" throughout the code is now internally consistent and correct
in the Mach terminology.
11) Several minor bug fixes, including one in the vm daemon that caused
0 RSS objects to not get purged as intended.
12) A "default pager" has now been created which cleans up the transition
of objects to the "swap" type. The previous checks throughout the code
for swp->pg_data != NULL were really ugly. This change also provides
the rudiments for future backing of "anonymous" memory by something
other than the swap pager (via the vnode pager, for example), and it
allows the decision about which of these pagers to use to be made
dynamically (although will need some additional decision code to do
this, of course).
13) (dyson) MAP_COPY has been deprecated and the corresponding "copy
object" code has been removed. MAP_COPY was undocumented and non-
standard. It was furthermore broken in several ways which caused its
behavior to degrade to MAP_PRIVATE. Binaries that use MAP_COPY will
continue to work correctly, but via the slightly different semantics
of MAP_PRIVATE.
14) (dyson) Sharing maps have been removed. It's marginal usefulness in a
threads design can be worked around in other ways. Both #12 and #13
were done to simplify the code and improve readability and maintain-
ability. (As were most all of these changes)
TODO:
1) Rewrite most of the vnode pager to use VOP_GETPAGES/PUTPAGES. Doing
this will reduce the vnode pager to a mere fraction of its current size.
2) Rewrite vm_fault and the swap/vnode pagers to use the clustering
information provided by the new haspage pager interface. This will
substantially reduce the overhead by eliminating a large number of
VOP_BMAP() calls. The VOP_BMAP() filesystem interface should be
improved to provide both a "behind" and "ahead" indication of
contiguousness.
3) Implement the extended features of pager_haspage in swap_pager_haspage().
It currently just says 0 pages ahead/behind.
4) Re-implement the swap device (swstrategy) in a more elegant way, perhaps
via a much more general mechanism that could also be used for disk
striping of regular filesystems.
5) Do something to improve the architecture of vm_object_collapse(). The
fact that it makes calls into the swap pager and knows too much about
how the swap pager operates really bothers me. It also doesn't allow
for collapsing of non-swap pager objects ("unnamed" objects backed by
other pagers).
1995-07-13 08:48:48 +00:00
|
|
|
#include <vm/vm_pager.h>
|
1994-10-09 01:52:19 +00:00
|
|
|
#include <vm/swap_pager.h>
|
1995-12-07 12:48:31 +00:00
|
|
|
#include <vm/vm_extern.h>
|
2002-03-20 04:02:59 +00:00
|
|
|
#include <vm/uma.h>
|
1994-05-24 10:09:53 +00:00
|
|
|
|
1995-08-28 09:19:25 +00:00
|
|
|
/*
|
|
|
|
* System initialization
|
|
|
|
*/
|
|
|
|
|
|
|
|
/* the kernel process "vm_pageout"*/
|
2002-03-19 22:20:14 +00:00
|
|
|
static void vm_pageout(void);
|
|
|
|
static int vm_pageout_clean(vm_page_t);
|
|
|
|
static void vm_pageout_scan(int pass);
|
2003-09-19 05:03:45 +00:00
|
|
|
|
1995-08-28 09:19:25 +00:00
|
|
|
struct proc *pageproc;
|
|
|
|
|
|
|
|
static struct kproc_desc page_kp = {
|
|
|
|
"pagedaemon",
|
|
|
|
vm_pageout,
|
|
|
|
&pageproc
|
|
|
|
};
|
2008-03-16 10:58:09 +00:00
|
|
|
SYSINIT(pagedaemon, SI_SUB_KTHREAD_PAGE, SI_ORDER_FIRST, kproc_start,
|
|
|
|
&page_kp);
|
1995-08-28 09:19:25 +00:00
|
|
|
|
This commit does a couple of things:
Re-enables the RSS limiting, and the routine is now tail-recursive,
making it much more safe (eliminates the possiblity of kernel stack
overflow.) Also, the RSS limiting is a little more intelligent about
finding the likely objects that are pushing the process over the limit.
Added some sysctls that help with VM system tuning.
New sysctl features:
1) Enable/disable lru pageout algorithm.
vm.pageout_algorithm = 0, default algorithm that works
well, especially using X windows and heavy
memory loading. Can have adverse effects,
sometimes slowing down program loading.
vm.pageout_algorithm = 1, close to true LRU. Works much
better than clock, etc. Does not work as well as
the default algorithm in general. Certain memory
"malloc" type benchmarks work a little better with
this setting.
Please give me feedback on the performance results
associated with these.
2) Enable/disable swapping.
vm.swapping_enabled = 1, default.
vm.swapping_enabled = 0, useful for cases where swapping
degrades performance.
The config option "NO_SWAPPING" is still operative, and
takes precedence over the sysctl. If "NO_SWAPPING" is
specified, the sysctl still exists, but "vm.swapping_enabled"
is hard-wired to "0".
Each of these can be changed "on the fly."
1996-06-26 05:39:27 +00:00
|
|
|
#if !defined(NO_SWAPPING)
|
1995-08-28 09:19:25 +00:00
|
|
|
/* the kernel process "vm_daemon"*/
|
2002-03-19 22:20:14 +00:00
|
|
|
static void vm_daemon(void);
|
1995-12-14 09:55:16 +00:00
|
|
|
static struct proc *vmproc;
|
1995-08-28 09:19:25 +00:00
|
|
|
|
|
|
|
static struct kproc_desc vm_kp = {
|
|
|
|
"vmdaemon",
|
|
|
|
vm_daemon,
|
|
|
|
&vmproc
|
|
|
|
};
|
2008-03-16 10:58:09 +00:00
|
|
|
SYSINIT(vmdaemon, SI_SUB_KTHREAD_VM, SI_ORDER_FIRST, kproc_start, &vm_kp);
|
This commit does a couple of things:
Re-enables the RSS limiting, and the routine is now tail-recursive,
making it much more safe (eliminates the possiblity of kernel stack
overflow.) Also, the RSS limiting is a little more intelligent about
finding the likely objects that are pushing the process over the limit.
Added some sysctls that help with VM system tuning.
New sysctl features:
1) Enable/disable lru pageout algorithm.
vm.pageout_algorithm = 0, default algorithm that works
well, especially using X windows and heavy
memory loading. Can have adverse effects,
sometimes slowing down program loading.
vm.pageout_algorithm = 1, close to true LRU. Works much
better than clock, etc. Does not work as well as
the default algorithm in general. Certain memory
"malloc" type benchmarks work a little better with
this setting.
Please give me feedback on the performance results
associated with these.
2) Enable/disable swapping.
vm.swapping_enabled = 1, default.
vm.swapping_enabled = 0, useful for cases where swapping
degrades performance.
The config option "NO_SWAPPING" is still operative, and
takes precedence over the sysctl. If "NO_SWAPPING" is
specified, the sysctl still exists, but "vm.swapping_enabled"
is hard-wired to "0".
Each of these can be changed "on the fly."
1996-06-26 05:39:27 +00:00
|
|
|
#endif
|
1995-08-28 09:19:25 +00:00
|
|
|
|
|
|
|
|
2003-02-01 21:55:30 +00:00
|
|
|
int vm_pages_needed; /* Event on which pageout daemon sleeps */
|
|
|
|
int vm_pageout_deficit; /* Estimated number of pages deficit */
|
|
|
|
int vm_pageout_pages_needed; /* flag saying that the pageout daemon needs pages */
|
1994-05-24 10:09:53 +00:00
|
|
|
|
This commit does a couple of things:
Re-enables the RSS limiting, and the routine is now tail-recursive,
making it much more safe (eliminates the possiblity of kernel stack
overflow.) Also, the RSS limiting is a little more intelligent about
finding the likely objects that are pushing the process over the limit.
Added some sysctls that help with VM system tuning.
New sysctl features:
1) Enable/disable lru pageout algorithm.
vm.pageout_algorithm = 0, default algorithm that works
well, especially using X windows and heavy
memory loading. Can have adverse effects,
sometimes slowing down program loading.
vm.pageout_algorithm = 1, close to true LRU. Works much
better than clock, etc. Does not work as well as
the default algorithm in general. Certain memory
"malloc" type benchmarks work a little better with
this setting.
Please give me feedback on the performance results
associated with these.
2) Enable/disable swapping.
vm.swapping_enabled = 1, default.
vm.swapping_enabled = 0, useful for cases where swapping
degrades performance.
The config option "NO_SWAPPING" is still operative, and
takes precedence over the sysctl. If "NO_SWAPPING" is
specified, the sysctl still exists, but "vm.swapping_enabled"
is hard-wired to "0".
Each of these can be changed "on the fly."
1996-06-26 05:39:27 +00:00
|
|
|
#if !defined(NO_SWAPPING)
|
1995-12-14 09:55:16 +00:00
|
|
|
static int vm_pageout_req_swapout; /* XXX */
|
|
|
|
static int vm_daemon_needed;
|
2007-06-26 18:24:05 +00:00
|
|
|
static struct mtx vm_daemon_mtx;
|
|
|
|
/* Allow for use by vm_pageout before vm_daemon is initialized. */
|
|
|
|
MTX_SYSINIT(vm_daemon, &vm_daemon_mtx, "vm daemon", MTX_DEF);
|
This commit does a couple of things:
Re-enables the RSS limiting, and the routine is now tail-recursive,
making it much more safe (eliminates the possiblity of kernel stack
overflow.) Also, the RSS limiting is a little more intelligent about
finding the likely objects that are pushing the process over the limit.
Added some sysctls that help with VM system tuning.
New sysctl features:
1) Enable/disable lru pageout algorithm.
vm.pageout_algorithm = 0, default algorithm that works
well, especially using X windows and heavy
memory loading. Can have adverse effects,
sometimes slowing down program loading.
vm.pageout_algorithm = 1, close to true LRU. Works much
better than clock, etc. Does not work as well as
the default algorithm in general. Certain memory
"malloc" type benchmarks work a little better with
this setting.
Please give me feedback on the performance results
associated with these.
2) Enable/disable swapping.
vm.swapping_enabled = 1, default.
vm.swapping_enabled = 0, useful for cases where swapping
degrades performance.
The config option "NO_SWAPPING" is still operative, and
takes precedence over the sysctl. If "NO_SWAPPING" is
specified, the sysctl still exists, but "vm.swapping_enabled"
is hard-wired to "0".
Each of these can be changed "on the fly."
1996-06-26 05:39:27 +00:00
|
|
|
#endif
|
2000-12-26 19:41:38 +00:00
|
|
|
static int vm_max_launder = 32;
|
1998-02-09 06:11:36 +00:00
|
|
|
static int vm_pageout_stats_max=0, vm_pageout_stats_interval = 0;
|
|
|
|
static int vm_pageout_full_stats_interval = 0;
|
2004-07-12 17:45:37 +00:00
|
|
|
static int vm_pageout_algorithm=0;
|
1998-02-09 06:11:36 +00:00
|
|
|
static int defer_swap_pageouts=0;
|
|
|
|
static int disable_swap_pageouts=0;
|
1997-12-05 05:41:06 +00:00
|
|
|
|
This commit does a couple of things:
Re-enables the RSS limiting, and the routine is now tail-recursive,
making it much more safe (eliminates the possiblity of kernel stack
overflow.) Also, the RSS limiting is a little more intelligent about
finding the likely objects that are pushing the process over the limit.
Added some sysctls that help with VM system tuning.
New sysctl features:
1) Enable/disable lru pageout algorithm.
vm.pageout_algorithm = 0, default algorithm that works
well, especially using X windows and heavy
memory loading. Can have adverse effects,
sometimes slowing down program loading.
vm.pageout_algorithm = 1, close to true LRU. Works much
better than clock, etc. Does not work as well as
the default algorithm in general. Certain memory
"malloc" type benchmarks work a little better with
this setting.
Please give me feedback on the performance results
associated with these.
2) Enable/disable swapping.
vm.swapping_enabled = 1, default.
vm.swapping_enabled = 0, useful for cases where swapping
degrades performance.
The config option "NO_SWAPPING" is still operative, and
takes precedence over the sysctl. If "NO_SWAPPING" is
specified, the sysctl still exists, but "vm.swapping_enabled"
is hard-wired to "0".
Each of these can be changed "on the fly."
1996-06-26 05:39:27 +00:00
|
|
|
#if defined(NO_SWAPPING)
|
1998-02-09 06:11:36 +00:00
|
|
|
static int vm_swap_enabled=0;
|
|
|
|
static int vm_swap_idle_enabled=0;
|
This commit does a couple of things:
Re-enables the RSS limiting, and the routine is now tail-recursive,
making it much more safe (eliminates the possiblity of kernel stack
overflow.) Also, the RSS limiting is a little more intelligent about
finding the likely objects that are pushing the process over the limit.
Added some sysctls that help with VM system tuning.
New sysctl features:
1) Enable/disable lru pageout algorithm.
vm.pageout_algorithm = 0, default algorithm that works
well, especially using X windows and heavy
memory loading. Can have adverse effects,
sometimes slowing down program loading.
vm.pageout_algorithm = 1, close to true LRU. Works much
better than clock, etc. Does not work as well as
the default algorithm in general. Certain memory
"malloc" type benchmarks work a little better with
this setting.
Please give me feedback on the performance results
associated with these.
2) Enable/disable swapping.
vm.swapping_enabled = 1, default.
vm.swapping_enabled = 0, useful for cases where swapping
degrades performance.
The config option "NO_SWAPPING" is still operative, and
takes precedence over the sysctl. If "NO_SWAPPING" is
specified, the sysctl still exists, but "vm.swapping_enabled"
is hard-wired to "0".
Each of these can be changed "on the fly."
1996-06-26 05:39:27 +00:00
|
|
|
#else
|
1998-02-09 06:11:36 +00:00
|
|
|
static int vm_swap_enabled=1;
|
|
|
|
static int vm_swap_idle_enabled=0;
|
This commit does a couple of things:
Re-enables the RSS limiting, and the routine is now tail-recursive,
making it much more safe (eliminates the possiblity of kernel stack
overflow.) Also, the RSS limiting is a little more intelligent about
finding the likely objects that are pushing the process over the limit.
Added some sysctls that help with VM system tuning.
New sysctl features:
1) Enable/disable lru pageout algorithm.
vm.pageout_algorithm = 0, default algorithm that works
well, especially using X windows and heavy
memory loading. Can have adverse effects,
sometimes slowing down program loading.
vm.pageout_algorithm = 1, close to true LRU. Works much
better than clock, etc. Does not work as well as
the default algorithm in general. Certain memory
"malloc" type benchmarks work a little better with
this setting.
Please give me feedback on the performance results
associated with these.
2) Enable/disable swapping.
vm.swapping_enabled = 1, default.
vm.swapping_enabled = 0, useful for cases where swapping
degrades performance.
The config option "NO_SWAPPING" is still operative, and
takes precedence over the sysctl. If "NO_SWAPPING" is
specified, the sysctl still exists, but "vm.swapping_enabled"
is hard-wired to "0".
Each of these can be changed "on the fly."
1996-06-26 05:39:27 +00:00
|
|
|
#endif
|
|
|
|
|
|
|
|
SYSCTL_INT(_vm, VM_PAGEOUT_ALGORITHM, pageout_algorithm,
|
2000-12-26 19:41:38 +00:00
|
|
|
CTLFLAG_RW, &vm_pageout_algorithm, 0, "LRU page mgmt");
|
|
|
|
|
|
|
|
SYSCTL_INT(_vm, OID_AUTO, max_launder,
|
|
|
|
CTLFLAG_RW, &vm_max_launder, 0, "Limit dirty flushes in pageout");
|
This commit does a couple of things:
Re-enables the RSS limiting, and the routine is now tail-recursive,
making it much more safe (eliminates the possiblity of kernel stack
overflow.) Also, the RSS limiting is a little more intelligent about
finding the likely objects that are pushing the process over the limit.
Added some sysctls that help with VM system tuning.
New sysctl features:
1) Enable/disable lru pageout algorithm.
vm.pageout_algorithm = 0, default algorithm that works
well, especially using X windows and heavy
memory loading. Can have adverse effects,
sometimes slowing down program loading.
vm.pageout_algorithm = 1, close to true LRU. Works much
better than clock, etc. Does not work as well as
the default algorithm in general. Certain memory
"malloc" type benchmarks work a little better with
this setting.
Please give me feedback on the performance results
associated with these.
2) Enable/disable swapping.
vm.swapping_enabled = 1, default.
vm.swapping_enabled = 0, useful for cases where swapping
degrades performance.
The config option "NO_SWAPPING" is still operative, and
takes precedence over the sysctl. If "NO_SWAPPING" is
specified, the sysctl still exists, but "vm.swapping_enabled"
is hard-wired to "0".
Each of these can be changed "on the fly."
1996-06-26 05:39:27 +00:00
|
|
|
|
1997-07-27 04:49:19 +00:00
|
|
|
SYSCTL_INT(_vm, OID_AUTO, pageout_stats_max,
|
1998-10-31 17:21:31 +00:00
|
|
|
CTLFLAG_RW, &vm_pageout_stats_max, 0, "Max pageout stats scan length");
|
1997-07-27 04:49:19 +00:00
|
|
|
|
|
|
|
SYSCTL_INT(_vm, OID_AUTO, pageout_full_stats_interval,
|
1998-10-31 17:21:31 +00:00
|
|
|
CTLFLAG_RW, &vm_pageout_full_stats_interval, 0, "Interval for full stats scan");
|
1997-07-27 04:49:19 +00:00
|
|
|
|
|
|
|
SYSCTL_INT(_vm, OID_AUTO, pageout_stats_interval,
|
1998-10-31 17:21:31 +00:00
|
|
|
CTLFLAG_RW, &vm_pageout_stats_interval, 0, "Interval for partial stats scan");
|
1997-07-27 04:49:19 +00:00
|
|
|
|
This commit does a couple of things:
Re-enables the RSS limiting, and the routine is now tail-recursive,
making it much more safe (eliminates the possiblity of kernel stack
overflow.) Also, the RSS limiting is a little more intelligent about
finding the likely objects that are pushing the process over the limit.
Added some sysctls that help with VM system tuning.
New sysctl features:
1) Enable/disable lru pageout algorithm.
vm.pageout_algorithm = 0, default algorithm that works
well, especially using X windows and heavy
memory loading. Can have adverse effects,
sometimes slowing down program loading.
vm.pageout_algorithm = 1, close to true LRU. Works much
better than clock, etc. Does not work as well as
the default algorithm in general. Certain memory
"malloc" type benchmarks work a little better with
this setting.
Please give me feedback on the performance results
associated with these.
2) Enable/disable swapping.
vm.swapping_enabled = 1, default.
vm.swapping_enabled = 0, useful for cases where swapping
degrades performance.
The config option "NO_SWAPPING" is still operative, and
takes precedence over the sysctl. If "NO_SWAPPING" is
specified, the sysctl still exists, but "vm.swapping_enabled"
is hard-wired to "0".
Each of these can be changed "on the fly."
1996-06-26 05:39:27 +00:00
|
|
|
#if defined(NO_SWAPPING)
|
1997-12-06 02:23:36 +00:00
|
|
|
SYSCTL_INT(_vm, VM_SWAPPING_ENABLED, swap_enabled,
|
2008-08-03 14:26:15 +00:00
|
|
|
CTLFLAG_RD, &vm_swap_enabled, 0, "Enable entire process swapout");
|
1997-12-06 02:23:36 +00:00
|
|
|
SYSCTL_INT(_vm, OID_AUTO, swap_idle_enabled,
|
2008-08-03 14:26:15 +00:00
|
|
|
CTLFLAG_RD, &vm_swap_idle_enabled, 0, "Allow swapout on idle criteria");
|
This commit does a couple of things:
Re-enables the RSS limiting, and the routine is now tail-recursive,
making it much more safe (eliminates the possiblity of kernel stack
overflow.) Also, the RSS limiting is a little more intelligent about
finding the likely objects that are pushing the process over the limit.
Added some sysctls that help with VM system tuning.
New sysctl features:
1) Enable/disable lru pageout algorithm.
vm.pageout_algorithm = 0, default algorithm that works
well, especially using X windows and heavy
memory loading. Can have adverse effects,
sometimes slowing down program loading.
vm.pageout_algorithm = 1, close to true LRU. Works much
better than clock, etc. Does not work as well as
the default algorithm in general. Certain memory
"malloc" type benchmarks work a little better with
this setting.
Please give me feedback on the performance results
associated with these.
2) Enable/disable swapping.
vm.swapping_enabled = 1, default.
vm.swapping_enabled = 0, useful for cases where swapping
degrades performance.
The config option "NO_SWAPPING" is still operative, and
takes precedence over the sysctl. If "NO_SWAPPING" is
specified, the sysctl still exists, but "vm.swapping_enabled"
is hard-wired to "0".
Each of these can be changed "on the fly."
1996-06-26 05:39:27 +00:00
|
|
|
#else
|
1997-12-06 02:23:36 +00:00
|
|
|
SYSCTL_INT(_vm, VM_SWAPPING_ENABLED, swap_enabled,
|
1998-10-31 17:21:31 +00:00
|
|
|
CTLFLAG_RW, &vm_swap_enabled, 0, "Enable entire process swapout");
|
1997-12-06 02:23:36 +00:00
|
|
|
SYSCTL_INT(_vm, OID_AUTO, swap_idle_enabled,
|
1998-10-31 17:21:31 +00:00
|
|
|
CTLFLAG_RW, &vm_swap_idle_enabled, 0, "Allow swapout on idle criteria");
|
This commit does a couple of things:
Re-enables the RSS limiting, and the routine is now tail-recursive,
making it much more safe (eliminates the possiblity of kernel stack
overflow.) Also, the RSS limiting is a little more intelligent about
finding the likely objects that are pushing the process over the limit.
Added some sysctls that help with VM system tuning.
New sysctl features:
1) Enable/disable lru pageout algorithm.
vm.pageout_algorithm = 0, default algorithm that works
well, especially using X windows and heavy
memory loading. Can have adverse effects,
sometimes slowing down program loading.
vm.pageout_algorithm = 1, close to true LRU. Works much
better than clock, etc. Does not work as well as
the default algorithm in general. Certain memory
"malloc" type benchmarks work a little better with
this setting.
Please give me feedback on the performance results
associated with these.
2) Enable/disable swapping.
vm.swapping_enabled = 1, default.
vm.swapping_enabled = 0, useful for cases where swapping
degrades performance.
The config option "NO_SWAPPING" is still operative, and
takes precedence over the sysctl. If "NO_SWAPPING" is
specified, the sysctl still exists, but "vm.swapping_enabled"
is hard-wired to "0".
Each of these can be changed "on the fly."
1996-06-26 05:39:27 +00:00
|
|
|
#endif
|
1994-05-24 10:09:53 +00:00
|
|
|
|
1997-12-06 02:23:36 +00:00
|
|
|
SYSCTL_INT(_vm, OID_AUTO, defer_swapspace_pageouts,
|
1998-10-31 17:21:31 +00:00
|
|
|
CTLFLAG_RW, &defer_swap_pageouts, 0, "Give preference to dirty pages in mem");
|
1997-12-04 19:00:56 +00:00
|
|
|
|
1997-12-06 02:23:36 +00:00
|
|
|
SYSCTL_INT(_vm, OID_AUTO, disable_swapspace_pageouts,
|
1998-10-31 17:21:31 +00:00
|
|
|
CTLFLAG_RW, &disable_swap_pageouts, 0, "Disallow swapout of dirty pages");
|
1997-12-04 19:00:56 +00:00
|
|
|
|
2001-12-20 22:42:27 +00:00
|
|
|
static int pageout_lock_miss;
|
|
|
|
SYSCTL_INT(_vm, OID_AUTO, pageout_lock_miss,
|
|
|
|
CTLFLAG_RD, &pageout_lock_miss, 0, "vget() lock misses during pageout");
|
|
|
|
|
1998-03-01 04:18:54 +00:00
|
|
|
#define VM_PAGEOUT_PAGE_COUNT 16
|
1994-08-04 03:06:48 +00:00
|
|
|
int vm_pageout_page_count = VM_PAGEOUT_PAGE_COUNT;
|
1994-05-24 10:09:53 +00:00
|
|
|
|
1995-04-16 12:56:22 +00:00
|
|
|
int vm_page_max_wired; /* XXX max # of wired pages system-wide */
|
2007-11-23 00:30:19 +00:00
|
|
|
SYSCTL_INT(_vm, OID_AUTO, max_wired,
|
|
|
|
CTLFLAG_RW, &vm_page_max_wired, 0, "System-wide limit to wired page count");
|
1994-05-24 10:09:53 +00:00
|
|
|
|
This commit does a couple of things:
Re-enables the RSS limiting, and the routine is now tail-recursive,
making it much more safe (eliminates the possiblity of kernel stack
overflow.) Also, the RSS limiting is a little more intelligent about
finding the likely objects that are pushing the process over the limit.
Added some sysctls that help with VM system tuning.
New sysctl features:
1) Enable/disable lru pageout algorithm.
vm.pageout_algorithm = 0, default algorithm that works
well, especially using X windows and heavy
memory loading. Can have adverse effects,
sometimes slowing down program loading.
vm.pageout_algorithm = 1, close to true LRU. Works much
better than clock, etc. Does not work as well as
the default algorithm in general. Certain memory
"malloc" type benchmarks work a little better with
this setting.
Please give me feedback on the performance results
associated with these.
2) Enable/disable swapping.
vm.swapping_enabled = 1, default.
vm.swapping_enabled = 0, useful for cases where swapping
degrades performance.
The config option "NO_SWAPPING" is still operative, and
takes precedence over the sysctl. If "NO_SWAPPING" is
specified, the sysctl still exists, but "vm.swapping_enabled"
is hard-wired to "0".
Each of these can be changed "on the fly."
1996-06-26 05:39:27 +00:00
|
|
|
#if !defined(NO_SWAPPING)
|
2003-07-07 07:16:29 +00:00
|
|
|
static void vm_pageout_map_deactivate_pages(vm_map_t, long);
|
|
|
|
static void vm_pageout_object_deactivate_pages(pmap_t, vm_object_t, long);
|
2007-06-26 18:24:05 +00:00
|
|
|
static void vm_req_vmdaemon(int req);
|
This commit does a couple of things:
Re-enables the RSS limiting, and the routine is now tail-recursive,
making it much more safe (eliminates the possiblity of kernel stack
overflow.) Also, the RSS limiting is a little more intelligent about
finding the likely objects that are pushing the process over the limit.
Added some sysctls that help with VM system tuning.
New sysctl features:
1) Enable/disable lru pageout algorithm.
vm.pageout_algorithm = 0, default algorithm that works
well, especially using X windows and heavy
memory loading. Can have adverse effects,
sometimes slowing down program loading.
vm.pageout_algorithm = 1, close to true LRU. Works much
better than clock, etc. Does not work as well as
the default algorithm in general. Certain memory
"malloc" type benchmarks work a little better with
this setting.
Please give me feedback on the performance results
associated with these.
2) Enable/disable swapping.
vm.swapping_enabled = 1, default.
vm.swapping_enabled = 0, useful for cases where swapping
degrades performance.
The config option "NO_SWAPPING" is still operative, and
takes precedence over the sysctl. If "NO_SWAPPING" is
specified, the sysctl still exists, but "vm.swapping_enabled"
is hard-wired to "0".
Each of these can be changed "on the fly."
1996-06-26 05:39:27 +00:00
|
|
|
#endif
|
1997-07-27 04:49:19 +00:00
|
|
|
static void vm_pageout_page_stats(void);
|
1995-10-07 19:02:56 +00:00
|
|
|
|
2005-08-10 00:17:36 +00:00
|
|
|
/*
|
|
|
|
* vm_pageout_fallback_object_lock:
|
|
|
|
*
|
|
|
|
* Lock vm object currently associated with `m'. VM_OBJECT_TRYLOCK is
|
|
|
|
* known to have failed and page queue must be either PQ_ACTIVE or
|
|
|
|
* PQ_INACTIVE. To avoid lock order violation, unlock the page queues
|
|
|
|
* while locking the vm object. Use marker page to detect page queue
|
|
|
|
* changes and maintain notion of next page on page queue. Return
|
|
|
|
* TRUE if no changes were detected, FALSE otherwise. vm object is
|
|
|
|
* locked on return.
|
|
|
|
*
|
|
|
|
* This function depends on both the lock portion of struct vm_object
|
|
|
|
* and normal struct vm_page being type stable.
|
|
|
|
*/
|
2007-11-25 20:37:29 +00:00
|
|
|
boolean_t
|
2005-08-10 00:17:36 +00:00
|
|
|
vm_pageout_fallback_object_lock(vm_page_t m, vm_page_t *next)
|
|
|
|
{
|
|
|
|
struct vm_page marker;
|
|
|
|
boolean_t unchanged;
|
|
|
|
u_short queue;
|
|
|
|
vm_object_t object;
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Initialize our marker
|
|
|
|
*/
|
|
|
|
bzero(&marker, sizeof(marker));
|
2006-10-22 04:28:14 +00:00
|
|
|
marker.flags = PG_FICTITIOUS | PG_MARKER;
|
|
|
|
marker.oflags = VPO_BUSY;
|
2005-08-10 00:17:36 +00:00
|
|
|
marker.queue = m->queue;
|
|
|
|
marker.wire_count = 1;
|
|
|
|
|
|
|
|
queue = m->queue;
|
|
|
|
object = m->object;
|
|
|
|
|
|
|
|
TAILQ_INSERT_AFTER(&vm_page_queues[queue].pl,
|
|
|
|
m, &marker, pageq);
|
|
|
|
vm_page_unlock_queues();
|
|
|
|
VM_OBJECT_LOCK(object);
|
|
|
|
vm_page_lock_queues();
|
|
|
|
|
|
|
|
/* Page queue might have changed. */
|
|
|
|
*next = TAILQ_NEXT(&marker, pageq);
|
|
|
|
unchanged = (m->queue == queue &&
|
|
|
|
m->object == object &&
|
|
|
|
&marker == TAILQ_NEXT(m, pageq));
|
|
|
|
TAILQ_REMOVE(&vm_page_queues[queue].pl,
|
|
|
|
&marker, pageq);
|
|
|
|
return (unchanged);
|
|
|
|
}
|
|
|
|
|
1994-05-24 10:09:53 +00:00
|
|
|
/*
|
1994-05-25 09:21:21 +00:00
|
|
|
* vm_pageout_clean:
|
NOTE: libkvm, w, ps, 'top', and any other utility which depends on struct
proc or any VM system structure will have to be rebuilt!!!
Much needed overhaul of the VM system. Included in this first round of
changes:
1) Improved pager interfaces: init, alloc, dealloc, getpages, putpages,
haspage, and sync operations are supported. The haspage interface now
provides information about clusterability. All pager routines now take
struct vm_object's instead of "pagers".
2) Improved data structures. In the previous paradigm, there is constant
confusion caused by pagers being both a data structure ("allocate a
pager") and a collection of routines. The idea of a pager structure has
escentially been eliminated. Objects now have types, and this type is
used to index the appropriate pager. In most cases, items in the pager
structure were duplicated in the object data structure and thus were
unnecessary. In the few cases that remained, a un_pager structure union
was created in the object to contain these items.
3) Because of the cleanup of #1 & #2, a lot of unnecessary layering can now
be removed. For instance, vm_object_enter(), vm_object_lookup(),
vm_object_remove(), and the associated object hash list were some of the
things that were removed.
4) simple_lock's removed. Discussion with several people reveals that the
SMP locking primitives used in the VM system aren't likely the mechanism
that we'll be adopting. Even if it were, the locking that was in the code
was very inadequate and would have to be mostly re-done anyway. The
locking in a uni-processor kernel was a no-op but went a long way toward
making the code difficult to read and debug.
5) Places that attempted to kludge-up the fact that we don't have kernel
thread support have been fixed to reflect the reality that we are really
dealing with processes, not threads. The VM system didn't have complete
thread support, so the comments and mis-named routines were just wrong.
We now use tsleep and wakeup directly in the lock routines, for instance.
6) Where appropriate, the pagers have been improved, especially in the
pager_alloc routines. Most of the pager_allocs have been rewritten and
are now faster and easier to maintain.
7) The pagedaemon pageout clustering algorithm has been rewritten and
now tries harder to output an even number of pages before and after
the requested page. This is sort of the reverse of the ideal pagein
algorithm and should provide better overall performance.
8) Unnecessary (incorrect) casts to caddr_t in calls to tsleep & wakeup
have been removed. Some other unnecessary casts have also been removed.
9) Some almost useless debugging code removed.
10) Terminology of shadow objects vs. backing objects straightened out.
The fact that the vm_object data structure escentially had this
backwards really confused things. The use of "shadow" and "backing
object" throughout the code is now internally consistent and correct
in the Mach terminology.
11) Several minor bug fixes, including one in the vm daemon that caused
0 RSS objects to not get purged as intended.
12) A "default pager" has now been created which cleans up the transition
of objects to the "swap" type. The previous checks throughout the code
for swp->pg_data != NULL were really ugly. This change also provides
the rudiments for future backing of "anonymous" memory by something
other than the swap pager (via the vnode pager, for example), and it
allows the decision about which of these pagers to use to be made
dynamically (although will need some additional decision code to do
this, of course).
13) (dyson) MAP_COPY has been deprecated and the corresponding "copy
object" code has been removed. MAP_COPY was undocumented and non-
standard. It was furthermore broken in several ways which caused its
behavior to degrade to MAP_PRIVATE. Binaries that use MAP_COPY will
continue to work correctly, but via the slightly different semantics
of MAP_PRIVATE.
14) (dyson) Sharing maps have been removed. It's marginal usefulness in a
threads design can be worked around in other ways. Both #12 and #13
were done to simplify the code and improve readability and maintain-
ability. (As were most all of these changes)
TODO:
1) Rewrite most of the vnode pager to use VOP_GETPAGES/PUTPAGES. Doing
this will reduce the vnode pager to a mere fraction of its current size.
2) Rewrite vm_fault and the swap/vnode pagers to use the clustering
information provided by the new haspage pager interface. This will
substantially reduce the overhead by eliminating a large number of
VOP_BMAP() calls. The VOP_BMAP() filesystem interface should be
improved to provide both a "behind" and "ahead" indication of
contiguousness.
3) Implement the extended features of pager_haspage in swap_pager_haspage().
It currently just says 0 pages ahead/behind.
4) Re-implement the swap device (swstrategy) in a more elegant way, perhaps
via a much more general mechanism that could also be used for disk
striping of regular filesystems.
5) Do something to improve the architecture of vm_object_collapse(). The
fact that it makes calls into the swap pager and knows too much about
how the swap pager operates really bothers me. It also doesn't allow
for collapsing of non-swap pager objects ("unnamed" objects backed by
other pagers).
1995-07-13 08:48:48 +00:00
|
|
|
*
|
|
|
|
* Clean the page and remove it from the laundry.
|
|
|
|
*
|
|
|
|
* We set the busy bit to cause potential page faults on this page to
|
1999-01-21 08:29:12 +00:00
|
|
|
* block. Note the careful timing, however, the busy bit isn't set till
|
|
|
|
* late and we cannot do anything that will mess with the page.
|
1994-05-24 10:09:53 +00:00
|
|
|
*/
|
1995-11-20 12:20:02 +00:00
|
|
|
static int
|
This mega-commit is meant to fix numerous interrelated problems. There
has been some bitrot and incorrect assumptions in the vfs_bio code. These
problems have manifest themselves worse on NFS type filesystems, but can
still affect local filesystems under certain circumstances. Most of
the problems have involved mmap consistancy, and as a side-effect broke
the vfs.ioopt code. This code might have been committed seperately, but
almost everything is interrelated.
1) Allow (pmap_object_init_pt) prefaulting of buffer-busy pages that
are fully valid.
2) Rather than deactivating erroneously read initial (header) pages in
kern_exec, we now free them.
3) Fix the rundown of non-VMIO buffers that are in an inconsistent
(missing vp) state.
4) Fix the disassociation of pages from buffers in brelse. The previous
code had rotted and was faulty in a couple of important circumstances.
5) Remove a gratuitious buffer wakeup in vfs_vmio_release.
6) Remove a crufty and currently unused cluster mechanism for VBLK
files in vfs_bio_awrite. When the code is functional, I'll add back
a cleaner version.
7) The page busy count wakeups assocated with the buffer cache usage were
incorrectly cleaned up in a previous commit by me. Revert to the
original, correct version, but with a cleaner implementation.
8) The cluster read code now tries to keep data associated with buffers
more aggressively (without breaking the heuristics) when it is presumed
that the read data (buffers) will be soon needed.
9) Change to filesystem lockmgr locks so that they use LK_NOPAUSE. The
delay loop waiting is not useful for filesystem locks, due to the
length of the time intervals.
10) Correct and clean-up spec_getpages.
11) Implement a fully functional nfs_getpages, nfs_putpages.
12) Fix nfs_write so that modifications are coherent with the NFS data on
the server disk (at least as well as NFS seems to allow.)
13) Properly support MS_INVALIDATE on NFS.
14) Properly pass down MS_INVALIDATE to lower levels of the VM code from
vm_map_clean.
15) Better support the notion of pages being busy but valid, so that
fewer in-transit waits occur. (use p->busy more for pageouts instead
of PG_BUSY.) Since the page is fully valid, it is still usable for
reads.
16) It is possible (in error) for cached pages to be busy. Make the
page allocation code handle that case correctly. (It should probably
be a printf or panic, but I want the system to handle coding errors
robustly. I'll probably add a printf.)
17) Correct the design and usage of vm_page_sleep. It didn't handle
consistancy problems very well, so make the design a little less
lofty. After vm_page_sleep, if it ever blocked, it is still important
to relookup the page (if the object generation count changed), and
verify it's status (always.)
18) In vm_pageout.c, vm_pageout_clean had rotted, so clean that up.
19) Push the page busy for writes and VM_PROT_READ into vm_pageout_flush.
20) Fix vm_pager_put_pages and it's descendents to support an int flag
instead of a boolean, so that we can pass down the invalidate bit.
1998-03-07 21:37:31 +00:00
|
|
|
vm_pageout_clean(m)
|
NOTE: libkvm, w, ps, 'top', and any other utility which depends on struct
proc or any VM system structure will have to be rebuilt!!!
Much needed overhaul of the VM system. Included in this first round of
changes:
1) Improved pager interfaces: init, alloc, dealloc, getpages, putpages,
haspage, and sync operations are supported. The haspage interface now
provides information about clusterability. All pager routines now take
struct vm_object's instead of "pagers".
2) Improved data structures. In the previous paradigm, there is constant
confusion caused by pagers being both a data structure ("allocate a
pager") and a collection of routines. The idea of a pager structure has
escentially been eliminated. Objects now have types, and this type is
used to index the appropriate pager. In most cases, items in the pager
structure were duplicated in the object data structure and thus were
unnecessary. In the few cases that remained, a un_pager structure union
was created in the object to contain these items.
3) Because of the cleanup of #1 & #2, a lot of unnecessary layering can now
be removed. For instance, vm_object_enter(), vm_object_lookup(),
vm_object_remove(), and the associated object hash list were some of the
things that were removed.
4) simple_lock's removed. Discussion with several people reveals that the
SMP locking primitives used in the VM system aren't likely the mechanism
that we'll be adopting. Even if it were, the locking that was in the code
was very inadequate and would have to be mostly re-done anyway. The
locking in a uni-processor kernel was a no-op but went a long way toward
making the code difficult to read and debug.
5) Places that attempted to kludge-up the fact that we don't have kernel
thread support have been fixed to reflect the reality that we are really
dealing with processes, not threads. The VM system didn't have complete
thread support, so the comments and mis-named routines were just wrong.
We now use tsleep and wakeup directly in the lock routines, for instance.
6) Where appropriate, the pagers have been improved, especially in the
pager_alloc routines. Most of the pager_allocs have been rewritten and
are now faster and easier to maintain.
7) The pagedaemon pageout clustering algorithm has been rewritten and
now tries harder to output an even number of pages before and after
the requested page. This is sort of the reverse of the ideal pagein
algorithm and should provide better overall performance.
8) Unnecessary (incorrect) casts to caddr_t in calls to tsleep & wakeup
have been removed. Some other unnecessary casts have also been removed.
9) Some almost useless debugging code removed.
10) Terminology of shadow objects vs. backing objects straightened out.
The fact that the vm_object data structure escentially had this
backwards really confused things. The use of "shadow" and "backing
object" throughout the code is now internally consistent and correct
in the Mach terminology.
11) Several minor bug fixes, including one in the vm daemon that caused
0 RSS objects to not get purged as intended.
12) A "default pager" has now been created which cleans up the transition
of objects to the "swap" type. The previous checks throughout the code
for swp->pg_data != NULL were really ugly. This change also provides
the rudiments for future backing of "anonymous" memory by something
other than the swap pager (via the vnode pager, for example), and it
allows the decision about which of these pagers to use to be made
dynamically (although will need some additional decision code to do
this, of course).
13) (dyson) MAP_COPY has been deprecated and the corresponding "copy
object" code has been removed. MAP_COPY was undocumented and non-
standard. It was furthermore broken in several ways which caused its
behavior to degrade to MAP_PRIVATE. Binaries that use MAP_COPY will
continue to work correctly, but via the slightly different semantics
of MAP_PRIVATE.
14) (dyson) Sharing maps have been removed. It's marginal usefulness in a
threads design can be worked around in other ways. Both #12 and #13
were done to simplify the code and improve readability and maintain-
ability. (As were most all of these changes)
TODO:
1) Rewrite most of the vnode pager to use VOP_GETPAGES/PUTPAGES. Doing
this will reduce the vnode pager to a mere fraction of its current size.
2) Rewrite vm_fault and the swap/vnode pagers to use the clustering
information provided by the new haspage pager interface. This will
substantially reduce the overhead by eliminating a large number of
VOP_BMAP() calls. The VOP_BMAP() filesystem interface should be
improved to provide both a "behind" and "ahead" indication of
contiguousness.
3) Implement the extended features of pager_haspage in swap_pager_haspage().
It currently just says 0 pages ahead/behind.
4) Re-implement the swap device (swstrategy) in a more elegant way, perhaps
via a much more general mechanism that could also be used for disk
striping of regular filesystems.
5) Do something to improve the architecture of vm_object_collapse(). The
fact that it makes calls into the swap pager and knows too much about
how the swap pager operates really bothers me. It also doesn't allow
for collapsing of non-swap pager objects ("unnamed" objects backed by
other pagers).
1995-07-13 08:48:48 +00:00
|
|
|
vm_page_t m;
|
1994-05-24 10:09:53 +00:00
|
|
|
{
|
2001-07-04 19:00:13 +00:00
|
|
|
vm_object_t object;
|
1996-05-31 00:38:04 +00:00
|
|
|
vm_page_t mc[2*vm_pageout_page_count];
|
2003-08-31 00:00:46 +00:00
|
|
|
int pageout_count;
|
1999-09-17 04:56:40 +00:00
|
|
|
int ib, is, page_base;
|
1995-12-11 04:58:34 +00:00
|
|
|
vm_pindex_t pindex = m->pindex;
|
1994-05-25 09:21:21 +00:00
|
|
|
|
2002-07-27 23:20:32 +00:00
|
|
|
mtx_assert(&vm_page_queue_mtx, MA_OWNED);
|
2003-08-31 00:00:46 +00:00
|
|
|
VM_OBJECT_LOCK_ASSERT(m->object, MA_OWNED);
|
2001-07-04 16:20:28 +00:00
|
|
|
|
1999-01-21 08:29:12 +00:00
|
|
|
/*
|
|
|
|
* It doesn't cost us anything to pageout OBJT_DEFAULT or OBJT_SWAP
|
|
|
|
* with the new swapper, but we could have serious problems paging
|
|
|
|
* out other object types if there is insufficient memory.
|
|
|
|
*
|
|
|
|
* Unfortunately, checking free memory here is far too late, so the
|
|
|
|
* check has been moved up a procedural level.
|
|
|
|
*/
|
|
|
|
|
NOTE: libkvm, w, ps, 'top', and any other utility which depends on struct
proc or any VM system structure will have to be rebuilt!!!
Much needed overhaul of the VM system. Included in this first round of
changes:
1) Improved pager interfaces: init, alloc, dealloc, getpages, putpages,
haspage, and sync operations are supported. The haspage interface now
provides information about clusterability. All pager routines now take
struct vm_object's instead of "pagers".
2) Improved data structures. In the previous paradigm, there is constant
confusion caused by pagers being both a data structure ("allocate a
pager") and a collection of routines. The idea of a pager structure has
escentially been eliminated. Objects now have types, and this type is
used to index the appropriate pager. In most cases, items in the pager
structure were duplicated in the object data structure and thus were
unnecessary. In the few cases that remained, a un_pager structure union
was created in the object to contain these items.
3) Because of the cleanup of #1 & #2, a lot of unnecessary layering can now
be removed. For instance, vm_object_enter(), vm_object_lookup(),
vm_object_remove(), and the associated object hash list were some of the
things that were removed.
4) simple_lock's removed. Discussion with several people reveals that the
SMP locking primitives used in the VM system aren't likely the mechanism
that we'll be adopting. Even if it were, the locking that was in the code
was very inadequate and would have to be mostly re-done anyway. The
locking in a uni-processor kernel was a no-op but went a long way toward
making the code difficult to read and debug.
5) Places that attempted to kludge-up the fact that we don't have kernel
thread support have been fixed to reflect the reality that we are really
dealing with processes, not threads. The VM system didn't have complete
thread support, so the comments and mis-named routines were just wrong.
We now use tsleep and wakeup directly in the lock routines, for instance.
6) Where appropriate, the pagers have been improved, especially in the
pager_alloc routines. Most of the pager_allocs have been rewritten and
are now faster and easier to maintain.
7) The pagedaemon pageout clustering algorithm has been rewritten and
now tries harder to output an even number of pages before and after
the requested page. This is sort of the reverse of the ideal pagein
algorithm and should provide better overall performance.
8) Unnecessary (incorrect) casts to caddr_t in calls to tsleep & wakeup
have been removed. Some other unnecessary casts have also been removed.
9) Some almost useless debugging code removed.
10) Terminology of shadow objects vs. backing objects straightened out.
The fact that the vm_object data structure escentially had this
backwards really confused things. The use of "shadow" and "backing
object" throughout the code is now internally consistent and correct
in the Mach terminology.
11) Several minor bug fixes, including one in the vm daemon that caused
0 RSS objects to not get purged as intended.
12) A "default pager" has now been created which cleans up the transition
of objects to the "swap" type. The previous checks throughout the code
for swp->pg_data != NULL were really ugly. This change also provides
the rudiments for future backing of "anonymous" memory by something
other than the swap pager (via the vnode pager, for example), and it
allows the decision about which of these pagers to use to be made
dynamically (although will need some additional decision code to do
this, of course).
13) (dyson) MAP_COPY has been deprecated and the corresponding "copy
object" code has been removed. MAP_COPY was undocumented and non-
standard. It was furthermore broken in several ways which caused its
behavior to degrade to MAP_PRIVATE. Binaries that use MAP_COPY will
continue to work correctly, but via the slightly different semantics
of MAP_PRIVATE.
14) (dyson) Sharing maps have been removed. It's marginal usefulness in a
threads design can be worked around in other ways. Both #12 and #13
were done to simplify the code and improve readability and maintain-
ability. (As were most all of these changes)
TODO:
1) Rewrite most of the vnode pager to use VOP_GETPAGES/PUTPAGES. Doing
this will reduce the vnode pager to a mere fraction of its current size.
2) Rewrite vm_fault and the swap/vnode pagers to use the clustering
information provided by the new haspage pager interface. This will
substantially reduce the overhead by eliminating a large number of
VOP_BMAP() calls. The VOP_BMAP() filesystem interface should be
improved to provide both a "behind" and "ahead" indication of
contiguousness.
3) Implement the extended features of pager_haspage in swap_pager_haspage().
It currently just says 0 pages ahead/behind.
4) Re-implement the swap device (swstrategy) in a more elegant way, perhaps
via a much more general mechanism that could also be used for disk
striping of regular filesystems.
5) Do something to improve the architecture of vm_object_collapse(). The
fact that it makes calls into the swap pager and knows too much about
how the swap pager operates really bothers me. It also doesn't allow
for collapsing of non-swap pager objects ("unnamed" objects backed by
other pagers).
1995-07-13 08:48:48 +00:00
|
|
|
/*
|
2007-06-18 02:04:38 +00:00
|
|
|
* Can't clean the page if it's busy or held.
|
NOTE: libkvm, w, ps, 'top', and any other utility which depends on struct
proc or any VM system structure will have to be rebuilt!!!
Much needed overhaul of the VM system. Included in this first round of
changes:
1) Improved pager interfaces: init, alloc, dealloc, getpages, putpages,
haspage, and sync operations are supported. The haspage interface now
provides information about clusterability. All pager routines now take
struct vm_object's instead of "pagers".
2) Improved data structures. In the previous paradigm, there is constant
confusion caused by pagers being both a data structure ("allocate a
pager") and a collection of routines. The idea of a pager structure has
escentially been eliminated. Objects now have types, and this type is
used to index the appropriate pager. In most cases, items in the pager
structure were duplicated in the object data structure and thus were
unnecessary. In the few cases that remained, a un_pager structure union
was created in the object to contain these items.
3) Because of the cleanup of #1 & #2, a lot of unnecessary layering can now
be removed. For instance, vm_object_enter(), vm_object_lookup(),
vm_object_remove(), and the associated object hash list were some of the
things that were removed.
4) simple_lock's removed. Discussion with several people reveals that the
SMP locking primitives used in the VM system aren't likely the mechanism
that we'll be adopting. Even if it were, the locking that was in the code
was very inadequate and would have to be mostly re-done anyway. The
locking in a uni-processor kernel was a no-op but went a long way toward
making the code difficult to read and debug.
5) Places that attempted to kludge-up the fact that we don't have kernel
thread support have been fixed to reflect the reality that we are really
dealing with processes, not threads. The VM system didn't have complete
thread support, so the comments and mis-named routines were just wrong.
We now use tsleep and wakeup directly in the lock routines, for instance.
6) Where appropriate, the pagers have been improved, especially in the
pager_alloc routines. Most of the pager_allocs have been rewritten and
are now faster and easier to maintain.
7) The pagedaemon pageout clustering algorithm has been rewritten and
now tries harder to output an even number of pages before and after
the requested page. This is sort of the reverse of the ideal pagein
algorithm and should provide better overall performance.
8) Unnecessary (incorrect) casts to caddr_t in calls to tsleep & wakeup
have been removed. Some other unnecessary casts have also been removed.
9) Some almost useless debugging code removed.
10) Terminology of shadow objects vs. backing objects straightened out.
The fact that the vm_object data structure escentially had this
backwards really confused things. The use of "shadow" and "backing
object" throughout the code is now internally consistent and correct
in the Mach terminology.
11) Several minor bug fixes, including one in the vm daemon that caused
0 RSS objects to not get purged as intended.
12) A "default pager" has now been created which cleans up the transition
of objects to the "swap" type. The previous checks throughout the code
for swp->pg_data != NULL were really ugly. This change also provides
the rudiments for future backing of "anonymous" memory by something
other than the swap pager (via the vnode pager, for example), and it
allows the decision about which of these pagers to use to be made
dynamically (although will need some additional decision code to do
this, of course).
13) (dyson) MAP_COPY has been deprecated and the corresponding "copy
object" code has been removed. MAP_COPY was undocumented and non-
standard. It was furthermore broken in several ways which caused its
behavior to degrade to MAP_PRIVATE. Binaries that use MAP_COPY will
continue to work correctly, but via the slightly different semantics
of MAP_PRIVATE.
14) (dyson) Sharing maps have been removed. It's marginal usefulness in a
threads design can be worked around in other ways. Both #12 and #13
were done to simplify the code and improve readability and maintain-
ability. (As were most all of these changes)
TODO:
1) Rewrite most of the vnode pager to use VOP_GETPAGES/PUTPAGES. Doing
this will reduce the vnode pager to a mere fraction of its current size.
2) Rewrite vm_fault and the swap/vnode pagers to use the clustering
information provided by the new haspage pager interface. This will
substantially reduce the overhead by eliminating a large number of
VOP_BMAP() calls. The VOP_BMAP() filesystem interface should be
improved to provide both a "behind" and "ahead" indication of
contiguousness.
3) Implement the extended features of pager_haspage in swap_pager_haspage().
It currently just says 0 pages ahead/behind.
4) Re-implement the swap device (swstrategy) in a more elegant way, perhaps
via a much more general mechanism that could also be used for disk
striping of regular filesystems.
5) Do something to improve the architecture of vm_object_collapse(). The
fact that it makes calls into the swap pager and knows too much about
how the swap pager operates really bothers me. It also doesn't allow
for collapsing of non-swap pager objects ("unnamed" objects backed by
other pagers).
1995-07-13 08:48:48 +00:00
|
|
|
*/
|
This mega-commit is meant to fix numerous interrelated problems. There
has been some bitrot and incorrect assumptions in the vfs_bio code. These
problems have manifest themselves worse on NFS type filesystems, but can
still affect local filesystems under certain circumstances. Most of
the problems have involved mmap consistancy, and as a side-effect broke
the vfs.ioopt code. This code might have been committed seperately, but
almost everything is interrelated.
1) Allow (pmap_object_init_pt) prefaulting of buffer-busy pages that
are fully valid.
2) Rather than deactivating erroneously read initial (header) pages in
kern_exec, we now free them.
3) Fix the rundown of non-VMIO buffers that are in an inconsistent
(missing vp) state.
4) Fix the disassociation of pages from buffers in brelse. The previous
code had rotted and was faulty in a couple of important circumstances.
5) Remove a gratuitious buffer wakeup in vfs_vmio_release.
6) Remove a crufty and currently unused cluster mechanism for VBLK
files in vfs_bio_awrite. When the code is functional, I'll add back
a cleaner version.
7) The page busy count wakeups assocated with the buffer cache usage were
incorrectly cleaned up in a previous commit by me. Revert to the
original, correct version, but with a cleaner implementation.
8) The cluster read code now tries to keep data associated with buffers
more aggressively (without breaking the heuristics) when it is presumed
that the read data (buffers) will be soon needed.
9) Change to filesystem lockmgr locks so that they use LK_NOPAUSE. The
delay loop waiting is not useful for filesystem locks, due to the
length of the time intervals.
10) Correct and clean-up spec_getpages.
11) Implement a fully functional nfs_getpages, nfs_putpages.
12) Fix nfs_write so that modifications are coherent with the NFS data on
the server disk (at least as well as NFS seems to allow.)
13) Properly support MS_INVALIDATE on NFS.
14) Properly pass down MS_INVALIDATE to lower levels of the VM code from
vm_map_clean.
15) Better support the notion of pages being busy but valid, so that
fewer in-transit waits occur. (use p->busy more for pageouts instead
of PG_BUSY.) Since the page is fully valid, it is still usable for
reads.
16) It is possible (in error) for cached pages to be busy. Make the
page allocation code handle that case correctly. (It should probably
be a printf or panic, but I want the system to handle coding errors
robustly. I'll probably add a printf.)
17) Correct the design and usage of vm_page_sleep. It didn't handle
consistancy problems very well, so make the design a little less
lofty. After vm_page_sleep, if it ever blocked, it is still important
to relookup the page (if the object generation count changed), and
verify it's status (always.)
18) In vm_pageout.c, vm_pageout_clean had rotted, so clean that up.
19) Push the page busy for writes and VM_PROT_READ into vm_pageout_flush.
20) Fix vm_pager_put_pages and it's descendents to support an int flag
instead of a boolean, so that we can pass down the invalidate bit.
1998-03-07 21:37:31 +00:00
|
|
|
if ((m->hold_count != 0) ||
|
2007-06-18 02:04:38 +00:00
|
|
|
((m->busy != 0) || (m->oflags & VPO_BUSY))) {
|
These changes embody the support of the fully coherent merged VM buffer cache,
much higher filesystem I/O performance, and much better paging performance. It
represents the culmination of over 6 months of R&D.
The majority of the merged VM/cache work is by John Dyson.
The following highlights the most significant changes. Additionally, there are
(mostly minor) changes to the various filesystem modules (nfs, msdosfs, etc) to
support the new VM/buffer scheme.
vfs_bio.c:
Significant rewrite of most of vfs_bio to support the merged VM buffer cache
scheme. The scheme is almost fully compatible with the old filesystem
interface. Significant improvement in the number of opportunities for write
clustering.
vfs_cluster.c, vfs_subr.c
Upgrade and performance enhancements in vfs layer code to support merged
VM/buffer cache. Fixup of vfs_cluster to eliminate the bogus pagemove stuff.
vm_object.c:
Yet more improvements in the collapse code. Elimination of some windows that
can cause list corruption.
vm_pageout.c:
Fixed it, it really works better now. Somehow in 2.0, some "enhancements"
broke the code. This code has been reworked from the ground-up.
vm_fault.c, vm_page.c, pmap.c, vm_object.c
Support for small-block filesystems with merged VM/buffer cache scheme.
pmap.c vm_map.c
Dynamic kernel VM size, now we dont have to pre-allocate excessive numbers of
kernel PTs.
vm_glue.c
Much simpler and more effective swapping code. No more gratuitous swapping.
proc.h
Fixed the problem that the p_lock flag was not being cleared on a fork.
swap_pager.c, vnode_pager.c
Removal of old vfs_bio cruft to support the past pseudo-coherency. Now the
code doesn't need it anymore.
machdep.c
Changes to better support the parameter values for the merged VM/buffer cache
scheme.
machdep.c, kern_exec.c, vm_glue.c
Implemented a seperate submap for temporary exec string space and another one
to contain process upages. This eliminates all map fragmentation problems
that previously existed.
ffs_inode.c, ufs_inode.c, ufs_readwrite.c
Changes for merged VM/buffer cache. Add "bypass" support for sneaking in on
busy buffers.
Submitted by: John Dyson and David Greenman
1995-01-09 16:06:02 +00:00
|
|
|
return 0;
|
2000-05-29 22:40:54 +00:00
|
|
|
}
|
1994-05-24 10:09:53 +00:00
|
|
|
|
1996-05-31 00:38:04 +00:00
|
|
|
mc[vm_pageout_page_count] = m;
|
1994-05-25 09:21:21 +00:00
|
|
|
pageout_count = 1;
|
1996-05-31 00:38:04 +00:00
|
|
|
page_base = vm_pageout_page_count;
|
1999-09-17 04:56:40 +00:00
|
|
|
ib = 1;
|
|
|
|
is = 1;
|
|
|
|
|
NOTE: libkvm, w, ps, 'top', and any other utility which depends on struct
proc or any VM system structure will have to be rebuilt!!!
Much needed overhaul of the VM system. Included in this first round of
changes:
1) Improved pager interfaces: init, alloc, dealloc, getpages, putpages,
haspage, and sync operations are supported. The haspage interface now
provides information about clusterability. All pager routines now take
struct vm_object's instead of "pagers".
2) Improved data structures. In the previous paradigm, there is constant
confusion caused by pagers being both a data structure ("allocate a
pager") and a collection of routines. The idea of a pager structure has
escentially been eliminated. Objects now have types, and this type is
used to index the appropriate pager. In most cases, items in the pager
structure were duplicated in the object data structure and thus were
unnecessary. In the few cases that remained, a un_pager structure union
was created in the object to contain these items.
3) Because of the cleanup of #1 & #2, a lot of unnecessary layering can now
be removed. For instance, vm_object_enter(), vm_object_lookup(),
vm_object_remove(), and the associated object hash list were some of the
things that were removed.
4) simple_lock's removed. Discussion with several people reveals that the
SMP locking primitives used in the VM system aren't likely the mechanism
that we'll be adopting. Even if it were, the locking that was in the code
was very inadequate and would have to be mostly re-done anyway. The
locking in a uni-processor kernel was a no-op but went a long way toward
making the code difficult to read and debug.
5) Places that attempted to kludge-up the fact that we don't have kernel
thread support have been fixed to reflect the reality that we are really
dealing with processes, not threads. The VM system didn't have complete
thread support, so the comments and mis-named routines were just wrong.
We now use tsleep and wakeup directly in the lock routines, for instance.
6) Where appropriate, the pagers have been improved, especially in the
pager_alloc routines. Most of the pager_allocs have been rewritten and
are now faster and easier to maintain.
7) The pagedaemon pageout clustering algorithm has been rewritten and
now tries harder to output an even number of pages before and after
the requested page. This is sort of the reverse of the ideal pagein
algorithm and should provide better overall performance.
8) Unnecessary (incorrect) casts to caddr_t in calls to tsleep & wakeup
have been removed. Some other unnecessary casts have also been removed.
9) Some almost useless debugging code removed.
10) Terminology of shadow objects vs. backing objects straightened out.
The fact that the vm_object data structure escentially had this
backwards really confused things. The use of "shadow" and "backing
object" throughout the code is now internally consistent and correct
in the Mach terminology.
11) Several minor bug fixes, including one in the vm daemon that caused
0 RSS objects to not get purged as intended.
12) A "default pager" has now been created which cleans up the transition
of objects to the "swap" type. The previous checks throughout the code
for swp->pg_data != NULL were really ugly. This change also provides
the rudiments for future backing of "anonymous" memory by something
other than the swap pager (via the vnode pager, for example), and it
allows the decision about which of these pagers to use to be made
dynamically (although will need some additional decision code to do
this, of course).
13) (dyson) MAP_COPY has been deprecated and the corresponding "copy
object" code has been removed. MAP_COPY was undocumented and non-
standard. It was furthermore broken in several ways which caused its
behavior to degrade to MAP_PRIVATE. Binaries that use MAP_COPY will
continue to work correctly, but via the slightly different semantics
of MAP_PRIVATE.
14) (dyson) Sharing maps have been removed. It's marginal usefulness in a
threads design can be worked around in other ways. Both #12 and #13
were done to simplify the code and improve readability and maintain-
ability. (As were most all of these changes)
TODO:
1) Rewrite most of the vnode pager to use VOP_GETPAGES/PUTPAGES. Doing
this will reduce the vnode pager to a mere fraction of its current size.
2) Rewrite vm_fault and the swap/vnode pagers to use the clustering
information provided by the new haspage pager interface. This will
substantially reduce the overhead by eliminating a large number of
VOP_BMAP() calls. The VOP_BMAP() filesystem interface should be
improved to provide both a "behind" and "ahead" indication of
contiguousness.
3) Implement the extended features of pager_haspage in swap_pager_haspage().
It currently just says 0 pages ahead/behind.
4) Re-implement the swap device (swstrategy) in a more elegant way, perhaps
via a much more general mechanism that could also be used for disk
striping of regular filesystems.
5) Do something to improve the architecture of vm_object_collapse(). The
fact that it makes calls into the swap pager and knows too much about
how the swap pager operates really bothers me. It also doesn't allow
for collapsing of non-swap pager objects ("unnamed" objects backed by
other pagers).
1995-07-13 08:48:48 +00:00
|
|
|
/*
|
|
|
|
* Scan object for clusterable pages.
|
|
|
|
*
|
|
|
|
* We can cluster ONLY if: ->> the page is NOT
|
|
|
|
* clean, wired, busy, held, or mapped into a
|
|
|
|
* buffer, and one of the following:
|
|
|
|
* 1) The page is inactive, or a seldom used
|
|
|
|
* active page.
|
|
|
|
* -or-
|
|
|
|
* 2) we force the issue.
|
1999-09-17 04:56:40 +00:00
|
|
|
*
|
|
|
|
* During heavy mmap/modification loads the pageout
|
|
|
|
* daemon can really fragment the underlying file
|
|
|
|
* due to flushing pages out of order and not trying
|
|
|
|
* align the clusters (which leave sporatic out-of-order
|
|
|
|
* holes). To solve this problem we do the reverse scan
|
|
|
|
* first and attempt to align our cluster, then do a
|
|
|
|
* forward scan if room remains.
|
NOTE: libkvm, w, ps, 'top', and any other utility which depends on struct
proc or any VM system structure will have to be rebuilt!!!
Much needed overhaul of the VM system. Included in this first round of
changes:
1) Improved pager interfaces: init, alloc, dealloc, getpages, putpages,
haspage, and sync operations are supported. The haspage interface now
provides information about clusterability. All pager routines now take
struct vm_object's instead of "pagers".
2) Improved data structures. In the previous paradigm, there is constant
confusion caused by pagers being both a data structure ("allocate a
pager") and a collection of routines. The idea of a pager structure has
escentially been eliminated. Objects now have types, and this type is
used to index the appropriate pager. In most cases, items in the pager
structure were duplicated in the object data structure and thus were
unnecessary. In the few cases that remained, a un_pager structure union
was created in the object to contain these items.
3) Because of the cleanup of #1 & #2, a lot of unnecessary layering can now
be removed. For instance, vm_object_enter(), vm_object_lookup(),
vm_object_remove(), and the associated object hash list were some of the
things that were removed.
4) simple_lock's removed. Discussion with several people reveals that the
SMP locking primitives used in the VM system aren't likely the mechanism
that we'll be adopting. Even if it were, the locking that was in the code
was very inadequate and would have to be mostly re-done anyway. The
locking in a uni-processor kernel was a no-op but went a long way toward
making the code difficult to read and debug.
5) Places that attempted to kludge-up the fact that we don't have kernel
thread support have been fixed to reflect the reality that we are really
dealing with processes, not threads. The VM system didn't have complete
thread support, so the comments and mis-named routines were just wrong.
We now use tsleep and wakeup directly in the lock routines, for instance.
6) Where appropriate, the pagers have been improved, especially in the
pager_alloc routines. Most of the pager_allocs have been rewritten and
are now faster and easier to maintain.
7) The pagedaemon pageout clustering algorithm has been rewritten and
now tries harder to output an even number of pages before and after
the requested page. This is sort of the reverse of the ideal pagein
algorithm and should provide better overall performance.
8) Unnecessary (incorrect) casts to caddr_t in calls to tsleep & wakeup
have been removed. Some other unnecessary casts have also been removed.
9) Some almost useless debugging code removed.
10) Terminology of shadow objects vs. backing objects straightened out.
The fact that the vm_object data structure escentially had this
backwards really confused things. The use of "shadow" and "backing
object" throughout the code is now internally consistent and correct
in the Mach terminology.
11) Several minor bug fixes, including one in the vm daemon that caused
0 RSS objects to not get purged as intended.
12) A "default pager" has now been created which cleans up the transition
of objects to the "swap" type. The previous checks throughout the code
for swp->pg_data != NULL were really ugly. This change also provides
the rudiments for future backing of "anonymous" memory by something
other than the swap pager (via the vnode pager, for example), and it
allows the decision about which of these pagers to use to be made
dynamically (although will need some additional decision code to do
this, of course).
13) (dyson) MAP_COPY has been deprecated and the corresponding "copy
object" code has been removed. MAP_COPY was undocumented and non-
standard. It was furthermore broken in several ways which caused its
behavior to degrade to MAP_PRIVATE. Binaries that use MAP_COPY will
continue to work correctly, but via the slightly different semantics
of MAP_PRIVATE.
14) (dyson) Sharing maps have been removed. It's marginal usefulness in a
threads design can be worked around in other ways. Both #12 and #13
were done to simplify the code and improve readability and maintain-
ability. (As were most all of these changes)
TODO:
1) Rewrite most of the vnode pager to use VOP_GETPAGES/PUTPAGES. Doing
this will reduce the vnode pager to a mere fraction of its current size.
2) Rewrite vm_fault and the swap/vnode pagers to use the clustering
information provided by the new haspage pager interface. This will
substantially reduce the overhead by eliminating a large number of
VOP_BMAP() calls. The VOP_BMAP() filesystem interface should be
improved to provide both a "behind" and "ahead" indication of
contiguousness.
3) Implement the extended features of pager_haspage in swap_pager_haspage().
It currently just says 0 pages ahead/behind.
4) Re-implement the swap device (swstrategy) in a more elegant way, perhaps
via a much more general mechanism that could also be used for disk
striping of regular filesystems.
5) Do something to improve the architecture of vm_object_collapse(). The
fact that it makes calls into the swap pager and knows too much about
how the swap pager operates really bothers me. It also doesn't allow
for collapsing of non-swap pager objects ("unnamed" objects backed by
other pagers).
1995-07-13 08:48:48 +00:00
|
|
|
*/
|
2003-06-28 20:07:54 +00:00
|
|
|
object = m->object;
|
1999-09-17 04:56:40 +00:00
|
|
|
more:
|
|
|
|
while (ib && pageout_count < vm_pageout_page_count) {
|
NOTE: libkvm, w, ps, 'top', and any other utility which depends on struct
proc or any VM system structure will have to be rebuilt!!!
Much needed overhaul of the VM system. Included in this first round of
changes:
1) Improved pager interfaces: init, alloc, dealloc, getpages, putpages,
haspage, and sync operations are supported. The haspage interface now
provides information about clusterability. All pager routines now take
struct vm_object's instead of "pagers".
2) Improved data structures. In the previous paradigm, there is constant
confusion caused by pagers being both a data structure ("allocate a
pager") and a collection of routines. The idea of a pager structure has
escentially been eliminated. Objects now have types, and this type is
used to index the appropriate pager. In most cases, items in the pager
structure were duplicated in the object data structure and thus were
unnecessary. In the few cases that remained, a un_pager structure union
was created in the object to contain these items.
3) Because of the cleanup of #1 & #2, a lot of unnecessary layering can now
be removed. For instance, vm_object_enter(), vm_object_lookup(),
vm_object_remove(), and the associated object hash list were some of the
things that were removed.
4) simple_lock's removed. Discussion with several people reveals that the
SMP locking primitives used in the VM system aren't likely the mechanism
that we'll be adopting. Even if it were, the locking that was in the code
was very inadequate and would have to be mostly re-done anyway. The
locking in a uni-processor kernel was a no-op but went a long way toward
making the code difficult to read and debug.
5) Places that attempted to kludge-up the fact that we don't have kernel
thread support have been fixed to reflect the reality that we are really
dealing with processes, not threads. The VM system didn't have complete
thread support, so the comments and mis-named routines were just wrong.
We now use tsleep and wakeup directly in the lock routines, for instance.
6) Where appropriate, the pagers have been improved, especially in the
pager_alloc routines. Most of the pager_allocs have been rewritten and
are now faster and easier to maintain.
7) The pagedaemon pageout clustering algorithm has been rewritten and
now tries harder to output an even number of pages before and after
the requested page. This is sort of the reverse of the ideal pagein
algorithm and should provide better overall performance.
8) Unnecessary (incorrect) casts to caddr_t in calls to tsleep & wakeup
have been removed. Some other unnecessary casts have also been removed.
9) Some almost useless debugging code removed.
10) Terminology of shadow objects vs. backing objects straightened out.
The fact that the vm_object data structure escentially had this
backwards really confused things. The use of "shadow" and "backing
object" throughout the code is now internally consistent and correct
in the Mach terminology.
11) Several minor bug fixes, including one in the vm daemon that caused
0 RSS objects to not get purged as intended.
12) A "default pager" has now been created which cleans up the transition
of objects to the "swap" type. The previous checks throughout the code
for swp->pg_data != NULL were really ugly. This change also provides
the rudiments for future backing of "anonymous" memory by something
other than the swap pager (via the vnode pager, for example), and it
allows the decision about which of these pagers to use to be made
dynamically (although will need some additional decision code to do
this, of course).
13) (dyson) MAP_COPY has been deprecated and the corresponding "copy
object" code has been removed. MAP_COPY was undocumented and non-
standard. It was furthermore broken in several ways which caused its
behavior to degrade to MAP_PRIVATE. Binaries that use MAP_COPY will
continue to work correctly, but via the slightly different semantics
of MAP_PRIVATE.
14) (dyson) Sharing maps have been removed. It's marginal usefulness in a
threads design can be worked around in other ways. Both #12 and #13
were done to simplify the code and improve readability and maintain-
ability. (As were most all of these changes)
TODO:
1) Rewrite most of the vnode pager to use VOP_GETPAGES/PUTPAGES. Doing
this will reduce the vnode pager to a mere fraction of its current size.
2) Rewrite vm_fault and the swap/vnode pagers to use the clustering
information provided by the new haspage pager interface. This will
substantially reduce the overhead by eliminating a large number of
VOP_BMAP() calls. The VOP_BMAP() filesystem interface should be
improved to provide both a "behind" and "ahead" indication of
contiguousness.
3) Implement the extended features of pager_haspage in swap_pager_haspage().
It currently just says 0 pages ahead/behind.
4) Re-implement the swap device (swstrategy) in a more elegant way, perhaps
via a much more general mechanism that could also be used for disk
striping of regular filesystems.
5) Do something to improve the architecture of vm_object_collapse(). The
fact that it makes calls into the swap pager and knows too much about
how the swap pager operates really bothers me. It also doesn't allow
for collapsing of non-swap pager objects ("unnamed" objects backed by
other pagers).
1995-07-13 08:48:48 +00:00
|
|
|
vm_page_t p;
|
1995-04-09 06:03:56 +00:00
|
|
|
|
1999-09-17 04:56:40 +00:00
|
|
|
if (ib > pindex) {
|
|
|
|
ib = 0;
|
|
|
|
break;
|
|
|
|
}
|
|
|
|
|
|
|
|
if ((p = vm_page_lookup(object, pindex - ib)) == NULL) {
|
|
|
|
ib = 0;
|
|
|
|
break;
|
|
|
|
}
|
Change the management of cached pages (PQ_CACHE) in two fundamental
ways:
(1) Cached pages are no longer kept in the object's resident page
splay tree and memq. Instead, they are kept in a separate per-object
splay tree of cached pages. However, access to this new per-object
splay tree is synchronized by the _free_ page queues lock, not to be
confused with the heavily contended page queues lock. Consequently, a
cached page can be reclaimed by vm_page_alloc(9) without acquiring the
object's lock or the page queues lock.
This solves a problem independently reported by tegge@ and Isilon.
Specifically, they observed the page daemon consuming a great deal of
CPU time because of pages bouncing back and forth between the cache
queue (PQ_CACHE) and the inactive queue (PQ_INACTIVE). The source of
this problem turned out to be a deadlock avoidance strategy employed
when selecting a cached page to reclaim in vm_page_select_cache().
However, the root cause was really that reclaiming a cached page
required the acquisition of an object lock while the page queues lock
was already held. Thus, this change addresses the problem at its
root, by eliminating the need to acquire the object's lock.
Moreover, keeping cached pages in the object's primary splay tree and
memq was, in effect, optimizing for the uncommon case. Cached pages
are reclaimed far, far more often than they are reactivated. Instead,
this change makes reclamation cheaper, especially in terms of
synchronization overhead, and reactivation more expensive, because
reactivated pages will have to be reentered into the object's primary
splay tree and memq.
(2) Cached pages are now stored alongside free pages in the physical
memory allocator's buddy queues, increasing the likelihood that large
allocations of contiguous physical memory (i.e., superpages) will
succeed.
Finally, as a result of this change long-standing restrictions on when
and where a cached page can be reclaimed and returned by
vm_page_alloc(9) are eliminated. Specifically, calls to
vm_page_alloc(9) specifying VM_ALLOC_INTERRUPT can now reclaim and
return a formerly cached page. Consequently, a call to malloc(9)
specifying M_NOWAIT is less likely to fail.
Discussed with: many over the course of the summer, including jeff@,
Justin Husted @ Isilon, peter@, tegge@
Tested by: an earlier version by kris@
Approved by: re (kensmith)
2007-09-25 06:25:06 +00:00
|
|
|
if ((p->oflags & VPO_BUSY) || p->busy) {
|
1999-09-17 04:56:40 +00:00
|
|
|
ib = 0;
|
|
|
|
break;
|
|
|
|
}
|
|
|
|
vm_page_test_dirty(p);
|
2009-06-24 04:45:03 +00:00
|
|
|
if (p->dirty == 0 ||
|
1999-09-17 04:56:40 +00:00
|
|
|
p->queue != PQ_INACTIVE ||
|
2001-10-21 06:12:06 +00:00
|
|
|
p->wire_count != 0 || /* may be held by buf cache */
|
|
|
|
p->hold_count != 0) { /* may be undergoing I/O */
|
1999-09-17 04:56:40 +00:00
|
|
|
ib = 0;
|
|
|
|
break;
|
1995-04-09 06:03:56 +00:00
|
|
|
}
|
1999-09-17 04:56:40 +00:00
|
|
|
mc[--page_base] = p;
|
|
|
|
++pageout_count;
|
|
|
|
++ib;
|
These changes embody the support of the fully coherent merged VM buffer cache,
much higher filesystem I/O performance, and much better paging performance. It
represents the culmination of over 6 months of R&D.
The majority of the merged VM/cache work is by John Dyson.
The following highlights the most significant changes. Additionally, there are
(mostly minor) changes to the various filesystem modules (nfs, msdosfs, etc) to
support the new VM/buffer scheme.
vfs_bio.c:
Significant rewrite of most of vfs_bio to support the merged VM buffer cache
scheme. The scheme is almost fully compatible with the old filesystem
interface. Significant improvement in the number of opportunities for write
clustering.
vfs_cluster.c, vfs_subr.c
Upgrade and performance enhancements in vfs layer code to support merged
VM/buffer cache. Fixup of vfs_cluster to eliminate the bogus pagemove stuff.
vm_object.c:
Yet more improvements in the collapse code. Elimination of some windows that
can cause list corruption.
vm_pageout.c:
Fixed it, it really works better now. Somehow in 2.0, some "enhancements"
broke the code. This code has been reworked from the ground-up.
vm_fault.c, vm_page.c, pmap.c, vm_object.c
Support for small-block filesystems with merged VM/buffer cache scheme.
pmap.c vm_map.c
Dynamic kernel VM size, now we dont have to pre-allocate excessive numbers of
kernel PTs.
vm_glue.c
Much simpler and more effective swapping code. No more gratuitous swapping.
proc.h
Fixed the problem that the p_lock flag was not being cleared on a fork.
swap_pager.c, vnode_pager.c
Removal of old vfs_bio cruft to support the past pseudo-coherency. Now the
code doesn't need it anymore.
machdep.c
Changes to better support the parameter values for the merged VM/buffer cache
scheme.
machdep.c, kern_exec.c, vm_glue.c
Implemented a seperate submap for temporary exec string space and another one
to contain process upages. This eliminates all map fragmentation problems
that previously existed.
ffs_inode.c, ufs_inode.c, ufs_readwrite.c
Changes for merged VM/buffer cache. Add "bypass" support for sneaking in on
busy buffers.
Submitted by: John Dyson and David Greenman
1995-01-09 16:06:02 +00:00
|
|
|
/*
|
1999-09-17 04:56:40 +00:00
|
|
|
* alignment boundry, stop here and switch directions. Do
|
|
|
|
* not clear ib.
|
These changes embody the support of the fully coherent merged VM buffer cache,
much higher filesystem I/O performance, and much better paging performance. It
represents the culmination of over 6 months of R&D.
The majority of the merged VM/cache work is by John Dyson.
The following highlights the most significant changes. Additionally, there are
(mostly minor) changes to the various filesystem modules (nfs, msdosfs, etc) to
support the new VM/buffer scheme.
vfs_bio.c:
Significant rewrite of most of vfs_bio to support the merged VM buffer cache
scheme. The scheme is almost fully compatible with the old filesystem
interface. Significant improvement in the number of opportunities for write
clustering.
vfs_cluster.c, vfs_subr.c
Upgrade and performance enhancements in vfs layer code to support merged
VM/buffer cache. Fixup of vfs_cluster to eliminate the bogus pagemove stuff.
vm_object.c:
Yet more improvements in the collapse code. Elimination of some windows that
can cause list corruption.
vm_pageout.c:
Fixed it, it really works better now. Somehow in 2.0, some "enhancements"
broke the code. This code has been reworked from the ground-up.
vm_fault.c, vm_page.c, pmap.c, vm_object.c
Support for small-block filesystems with merged VM/buffer cache scheme.
pmap.c vm_map.c
Dynamic kernel VM size, now we dont have to pre-allocate excessive numbers of
kernel PTs.
vm_glue.c
Much simpler and more effective swapping code. No more gratuitous swapping.
proc.h
Fixed the problem that the p_lock flag was not being cleared on a fork.
swap_pager.c, vnode_pager.c
Removal of old vfs_bio cruft to support the past pseudo-coherency. Now the
code doesn't need it anymore.
machdep.c
Changes to better support the parameter values for the merged VM/buffer cache
scheme.
machdep.c, kern_exec.c, vm_glue.c
Implemented a seperate submap for temporary exec string space and another one
to contain process upages. This eliminates all map fragmentation problems
that previously existed.
ffs_inode.c, ufs_inode.c, ufs_readwrite.c
Changes for merged VM/buffer cache. Add "bypass" support for sneaking in on
busy buffers.
Submitted by: John Dyson and David Greenman
1995-01-09 16:06:02 +00:00
|
|
|
*/
|
1999-09-17 04:56:40 +00:00
|
|
|
if ((pindex - (ib - 1)) % vm_pageout_page_count == 0)
|
|
|
|
break;
|
|
|
|
}
|
|
|
|
|
|
|
|
while (pageout_count < vm_pageout_page_count &&
|
|
|
|
pindex + is < object->size) {
|
|
|
|
vm_page_t p;
|
|
|
|
|
|
|
|
if ((p = vm_page_lookup(object, pindex + is)) == NULL)
|
|
|
|
break;
|
Change the management of cached pages (PQ_CACHE) in two fundamental
ways:
(1) Cached pages are no longer kept in the object's resident page
splay tree and memq. Instead, they are kept in a separate per-object
splay tree of cached pages. However, access to this new per-object
splay tree is synchronized by the _free_ page queues lock, not to be
confused with the heavily contended page queues lock. Consequently, a
cached page can be reclaimed by vm_page_alloc(9) without acquiring the
object's lock or the page queues lock.
This solves a problem independently reported by tegge@ and Isilon.
Specifically, they observed the page daemon consuming a great deal of
CPU time because of pages bouncing back and forth between the cache
queue (PQ_CACHE) and the inactive queue (PQ_INACTIVE). The source of
this problem turned out to be a deadlock avoidance strategy employed
when selecting a cached page to reclaim in vm_page_select_cache().
However, the root cause was really that reclaiming a cached page
required the acquisition of an object lock while the page queues lock
was already held. Thus, this change addresses the problem at its
root, by eliminating the need to acquire the object's lock.
Moreover, keeping cached pages in the object's primary splay tree and
memq was, in effect, optimizing for the uncommon case. Cached pages
are reclaimed far, far more often than they are reactivated. Instead,
this change makes reclamation cheaper, especially in terms of
synchronization overhead, and reactivation more expensive, because
reactivated pages will have to be reentered into the object's primary
splay tree and memq.
(2) Cached pages are now stored alongside free pages in the physical
memory allocator's buddy queues, increasing the likelihood that large
allocations of contiguous physical memory (i.e., superpages) will
succeed.
Finally, as a result of this change long-standing restrictions on when
and where a cached page can be reclaimed and returned by
vm_page_alloc(9) are eliminated. Specifically, calls to
vm_page_alloc(9) specifying VM_ALLOC_INTERRUPT can now reclaim and
return a formerly cached page. Consequently, a call to malloc(9)
specifying M_NOWAIT is less likely to fail.
Discussed with: many over the course of the summer, including jeff@,
Justin Husted @ Isilon, peter@, tegge@
Tested by: an earlier version by kris@
Approved by: re (kensmith)
2007-09-25 06:25:06 +00:00
|
|
|
if ((p->oflags & VPO_BUSY) || p->busy) {
|
1999-09-17 04:56:40 +00:00
|
|
|
break;
|
1994-05-25 09:21:21 +00:00
|
|
|
}
|
1999-09-17 04:56:40 +00:00
|
|
|
vm_page_test_dirty(p);
|
2009-06-24 04:45:03 +00:00
|
|
|
if (p->dirty == 0 ||
|
1999-09-17 04:56:40 +00:00
|
|
|
p->queue != PQ_INACTIVE ||
|
2001-10-21 06:12:06 +00:00
|
|
|
p->wire_count != 0 || /* may be held by buf cache */
|
|
|
|
p->hold_count != 0) { /* may be undergoing I/O */
|
1999-09-17 04:56:40 +00:00
|
|
|
break;
|
|
|
|
}
|
|
|
|
mc[page_base + pageout_count] = p;
|
|
|
|
++pageout_count;
|
|
|
|
++is;
|
1994-05-24 10:09:53 +00:00
|
|
|
}
|
|
|
|
|
1999-09-17 04:56:40 +00:00
|
|
|
/*
|
|
|
|
* If we exhausted our forward scan, continue with the reverse scan
|
|
|
|
* when possible, even past a page boundry. This catches boundry
|
|
|
|
* conditions.
|
|
|
|
*/
|
|
|
|
if (ib && pageout_count < vm_pageout_page_count)
|
|
|
|
goto more;
|
|
|
|
|
1996-07-30 03:08:57 +00:00
|
|
|
/*
|
|
|
|
* we allow reads during pageouts...
|
|
|
|
*/
|
2003-10-18 21:09:21 +00:00
|
|
|
return (vm_pageout_flush(&mc[page_base], pageout_count, 0));
|
1995-11-05 20:46:03 +00:00
|
|
|
}
|
|
|
|
|
1999-01-21 08:29:12 +00:00
|
|
|
/*
|
|
|
|
* vm_pageout_flush() - launder the given pages
|
|
|
|
*
|
|
|
|
* The given pages are laundered. Note that we setup for the start of
|
|
|
|
* I/O ( i.e. busy the page ), mark it read-only, and bump the object
|
|
|
|
* reference count all in here rather then in the parent. If we want
|
|
|
|
* the parent to do more sophisticated things we may have to change
|
|
|
|
* the ordering.
|
|
|
|
*/
|
1995-11-05 20:46:03 +00:00
|
|
|
int
|
2003-10-18 21:09:21 +00:00
|
|
|
vm_pageout_flush(vm_page_t *mc, int count, int flags)
|
1995-11-05 20:46:03 +00:00
|
|
|
{
|
2003-10-24 06:43:04 +00:00
|
|
|
vm_object_t object = mc[0]->object;
|
1995-11-05 20:46:03 +00:00
|
|
|
int pageout_status[count];
|
1998-02-05 03:32:49 +00:00
|
|
|
int numpagedout = 0;
|
1995-11-05 20:46:03 +00:00
|
|
|
int i;
|
|
|
|
|
2002-07-27 23:20:32 +00:00
|
|
|
mtx_assert(&vm_page_queue_mtx, MA_OWNED);
|
2003-10-24 06:43:04 +00:00
|
|
|
VM_OBJECT_LOCK_ASSERT(object, MA_OWNED);
|
1999-01-21 08:29:12 +00:00
|
|
|
/*
|
|
|
|
* Initiate I/O. Bump the vm_page_t->busy counter and
|
|
|
|
* mark the pages read-only.
|
|
|
|
*
|
|
|
|
* We do not have to fixup the clean/dirty bits here... we can
|
|
|
|
* allow the pager to do it after the I/O completes.
|
2000-12-11 07:52:47 +00:00
|
|
|
*
|
|
|
|
* NOTE! mc[i]->dirty may be partial or fragmented due to an
|
|
|
|
* edge case with file fragments.
|
1999-01-21 08:29:12 +00:00
|
|
|
*/
|
This mega-commit is meant to fix numerous interrelated problems. There
has been some bitrot and incorrect assumptions in the vfs_bio code. These
problems have manifest themselves worse on NFS type filesystems, but can
still affect local filesystems under certain circumstances. Most of
the problems have involved mmap consistancy, and as a side-effect broke
the vfs.ioopt code. This code might have been committed seperately, but
almost everything is interrelated.
1) Allow (pmap_object_init_pt) prefaulting of buffer-busy pages that
are fully valid.
2) Rather than deactivating erroneously read initial (header) pages in
kern_exec, we now free them.
3) Fix the rundown of non-VMIO buffers that are in an inconsistent
(missing vp) state.
4) Fix the disassociation of pages from buffers in brelse. The previous
code had rotted and was faulty in a couple of important circumstances.
5) Remove a gratuitious buffer wakeup in vfs_vmio_release.
6) Remove a crufty and currently unused cluster mechanism for VBLK
files in vfs_bio_awrite. When the code is functional, I'll add back
a cleaner version.
7) The page busy count wakeups assocated with the buffer cache usage were
incorrectly cleaned up in a previous commit by me. Revert to the
original, correct version, but with a cleaner implementation.
8) The cluster read code now tries to keep data associated with buffers
more aggressively (without breaking the heuristics) when it is presumed
that the read data (buffers) will be soon needed.
9) Change to filesystem lockmgr locks so that they use LK_NOPAUSE. The
delay loop waiting is not useful for filesystem locks, due to the
length of the time intervals.
10) Correct and clean-up spec_getpages.
11) Implement a fully functional nfs_getpages, nfs_putpages.
12) Fix nfs_write so that modifications are coherent with the NFS data on
the server disk (at least as well as NFS seems to allow.)
13) Properly support MS_INVALIDATE on NFS.
14) Properly pass down MS_INVALIDATE to lower levels of the VM code from
vm_map_clean.
15) Better support the notion of pages being busy but valid, so that
fewer in-transit waits occur. (use p->busy more for pageouts instead
of PG_BUSY.) Since the page is fully valid, it is still usable for
reads.
16) It is possible (in error) for cached pages to be busy. Make the
page allocation code handle that case correctly. (It should probably
be a printf or panic, but I want the system to handle coding errors
robustly. I'll probably add a printf.)
17) Correct the design and usage of vm_page_sleep. It didn't handle
consistancy problems very well, so make the design a little less
lofty. After vm_page_sleep, if it ever blocked, it is still important
to relookup the page (if the object generation count changed), and
verify it's status (always.)
18) In vm_pageout.c, vm_pageout_clean had rotted, so clean that up.
19) Push the page busy for writes and VM_PROT_READ into vm_pageout_flush.
20) Fix vm_pager_put_pages and it's descendents to support an int flag
instead of a boolean, so that we can pass down the invalidate bit.
1998-03-07 21:37:31 +00:00
|
|
|
for (i = 0; i < count; i++) {
|
2003-10-18 21:09:21 +00:00
|
|
|
KASSERT(mc[i]->valid == VM_PAGE_BITS_ALL,
|
|
|
|
("vm_pageout_flush: partially invalid page %p index %d/%d",
|
|
|
|
mc[i], i, count));
|
1998-09-04 08:06:57 +00:00
|
|
|
vm_page_io_start(mc[i]);
|
2006-08-01 19:06:06 +00:00
|
|
|
pmap_remove_write(mc[i]);
|
This mega-commit is meant to fix numerous interrelated problems. There
has been some bitrot and incorrect assumptions in the vfs_bio code. These
problems have manifest themselves worse on NFS type filesystems, but can
still affect local filesystems under certain circumstances. Most of
the problems have involved mmap consistancy, and as a side-effect broke
the vfs.ioopt code. This code might have been committed seperately, but
almost everything is interrelated.
1) Allow (pmap_object_init_pt) prefaulting of buffer-busy pages that
are fully valid.
2) Rather than deactivating erroneously read initial (header) pages in
kern_exec, we now free them.
3) Fix the rundown of non-VMIO buffers that are in an inconsistent
(missing vp) state.
4) Fix the disassociation of pages from buffers in brelse. The previous
code had rotted and was faulty in a couple of important circumstances.
5) Remove a gratuitious buffer wakeup in vfs_vmio_release.
6) Remove a crufty and currently unused cluster mechanism for VBLK
files in vfs_bio_awrite. When the code is functional, I'll add back
a cleaner version.
7) The page busy count wakeups assocated with the buffer cache usage were
incorrectly cleaned up in a previous commit by me. Revert to the
original, correct version, but with a cleaner implementation.
8) The cluster read code now tries to keep data associated with buffers
more aggressively (without breaking the heuristics) when it is presumed
that the read data (buffers) will be soon needed.
9) Change to filesystem lockmgr locks so that they use LK_NOPAUSE. The
delay loop waiting is not useful for filesystem locks, due to the
length of the time intervals.
10) Correct and clean-up spec_getpages.
11) Implement a fully functional nfs_getpages, nfs_putpages.
12) Fix nfs_write so that modifications are coherent with the NFS data on
the server disk (at least as well as NFS seems to allow.)
13) Properly support MS_INVALIDATE on NFS.
14) Properly pass down MS_INVALIDATE to lower levels of the VM code from
vm_map_clean.
15) Better support the notion of pages being busy but valid, so that
fewer in-transit waits occur. (use p->busy more for pageouts instead
of PG_BUSY.) Since the page is fully valid, it is still usable for
reads.
16) It is possible (in error) for cached pages to be busy. Make the
page allocation code handle that case correctly. (It should probably
be a printf or panic, but I want the system to handle coding errors
robustly. I'll probably add a printf.)
17) Correct the design and usage of vm_page_sleep. It didn't handle
consistancy problems very well, so make the design a little less
lofty. After vm_page_sleep, if it ever blocked, it is still important
to relookup the page (if the object generation count changed), and
verify it's status (always.)
18) In vm_pageout.c, vm_pageout_clean had rotted, so clean that up.
19) Push the page busy for writes and VM_PROT_READ into vm_pageout_flush.
20) Fix vm_pager_put_pages and it's descendents to support an int flag
instead of a boolean, so that we can pass down the invalidate bit.
1998-03-07 21:37:31 +00:00
|
|
|
}
|
2002-07-27 23:20:32 +00:00
|
|
|
vm_page_unlock_queues();
|
1998-08-06 08:33:19 +00:00
|
|
|
vm_object_pip_add(object, count);
|
1995-11-05 20:46:03 +00:00
|
|
|
|
2007-06-13 06:10:10 +00:00
|
|
|
vm_pager_put_pages(object, mc, count, flags, pageout_status);
|
1994-05-24 10:09:53 +00:00
|
|
|
|
2002-07-20 20:58:46 +00:00
|
|
|
vm_page_lock_queues();
|
1995-11-05 20:46:03 +00:00
|
|
|
for (i = 0; i < count; i++) {
|
|
|
|
vm_page_t mt = mc[i];
|
NOTE: libkvm, w, ps, 'top', and any other utility which depends on struct
proc or any VM system structure will have to be rebuilt!!!
Much needed overhaul of the VM system. Included in this first round of
changes:
1) Improved pager interfaces: init, alloc, dealloc, getpages, putpages,
haspage, and sync operations are supported. The haspage interface now
provides information about clusterability. All pager routines now take
struct vm_object's instead of "pagers".
2) Improved data structures. In the previous paradigm, there is constant
confusion caused by pagers being both a data structure ("allocate a
pager") and a collection of routines. The idea of a pager structure has
escentially been eliminated. Objects now have types, and this type is
used to index the appropriate pager. In most cases, items in the pager
structure were duplicated in the object data structure and thus were
unnecessary. In the few cases that remained, a un_pager structure union
was created in the object to contain these items.
3) Because of the cleanup of #1 & #2, a lot of unnecessary layering can now
be removed. For instance, vm_object_enter(), vm_object_lookup(),
vm_object_remove(), and the associated object hash list were some of the
things that were removed.
4) simple_lock's removed. Discussion with several people reveals that the
SMP locking primitives used in the VM system aren't likely the mechanism
that we'll be adopting. Even if it were, the locking that was in the code
was very inadequate and would have to be mostly re-done anyway. The
locking in a uni-processor kernel was a no-op but went a long way toward
making the code difficult to read and debug.
5) Places that attempted to kludge-up the fact that we don't have kernel
thread support have been fixed to reflect the reality that we are really
dealing with processes, not threads. The VM system didn't have complete
thread support, so the comments and mis-named routines were just wrong.
We now use tsleep and wakeup directly in the lock routines, for instance.
6) Where appropriate, the pagers have been improved, especially in the
pager_alloc routines. Most of the pager_allocs have been rewritten and
are now faster and easier to maintain.
7) The pagedaemon pageout clustering algorithm has been rewritten and
now tries harder to output an even number of pages before and after
the requested page. This is sort of the reverse of the ideal pagein
algorithm and should provide better overall performance.
8) Unnecessary (incorrect) casts to caddr_t in calls to tsleep & wakeup
have been removed. Some other unnecessary casts have also been removed.
9) Some almost useless debugging code removed.
10) Terminology of shadow objects vs. backing objects straightened out.
The fact that the vm_object data structure escentially had this
backwards really confused things. The use of "shadow" and "backing
object" throughout the code is now internally consistent and correct
in the Mach terminology.
11) Several minor bug fixes, including one in the vm daemon that caused
0 RSS objects to not get purged as intended.
12) A "default pager" has now been created which cleans up the transition
of objects to the "swap" type. The previous checks throughout the code
for swp->pg_data != NULL were really ugly. This change also provides
the rudiments for future backing of "anonymous" memory by something
other than the swap pager (via the vnode pager, for example), and it
allows the decision about which of these pagers to use to be made
dynamically (although will need some additional decision code to do
this, of course).
13) (dyson) MAP_COPY has been deprecated and the corresponding "copy
object" code has been removed. MAP_COPY was undocumented and non-
standard. It was furthermore broken in several ways which caused its
behavior to degrade to MAP_PRIVATE. Binaries that use MAP_COPY will
continue to work correctly, but via the slightly different semantics
of MAP_PRIVATE.
14) (dyson) Sharing maps have been removed. It's marginal usefulness in a
threads design can be worked around in other ways. Both #12 and #13
were done to simplify the code and improve readability and maintain-
ability. (As were most all of these changes)
TODO:
1) Rewrite most of the vnode pager to use VOP_GETPAGES/PUTPAGES. Doing
this will reduce the vnode pager to a mere fraction of its current size.
2) Rewrite vm_fault and the swap/vnode pagers to use the clustering
information provided by the new haspage pager interface. This will
substantially reduce the overhead by eliminating a large number of
VOP_BMAP() calls. The VOP_BMAP() filesystem interface should be
improved to provide both a "behind" and "ahead" indication of
contiguousness.
3) Implement the extended features of pager_haspage in swap_pager_haspage().
It currently just says 0 pages ahead/behind.
4) Re-implement the swap device (swstrategy) in a more elegant way, perhaps
via a much more general mechanism that could also be used for disk
striping of regular filesystems.
5) Do something to improve the architecture of vm_object_collapse(). The
fact that it makes calls into the swap pager and knows too much about
how the swap pager operates really bothers me. It also doesn't allow
for collapsing of non-swap pager objects ("unnamed" objects backed by
other pagers).
1995-07-13 08:48:48 +00:00
|
|
|
|
2007-09-15 18:30:28 +00:00
|
|
|
KASSERT(pageout_status[i] == VM_PAGER_PEND ||
|
|
|
|
(mt->flags & PG_WRITEABLE) == 0,
|
2004-02-21 23:32:00 +00:00
|
|
|
("vm_pageout_flush: page %p is not write protected", mt));
|
1994-05-25 09:21:21 +00:00
|
|
|
switch (pageout_status[i]) {
|
|
|
|
case VM_PAGER_OK:
|
|
|
|
case VM_PAGER_PEND:
|
1998-02-05 03:32:49 +00:00
|
|
|
numpagedout++;
|
1994-05-25 09:21:21 +00:00
|
|
|
break;
|
|
|
|
case VM_PAGER_BAD:
|
|
|
|
/*
|
These changes embody the support of the fully coherent merged VM buffer cache,
much higher filesystem I/O performance, and much better paging performance. It
represents the culmination of over 6 months of R&D.
The majority of the merged VM/cache work is by John Dyson.
The following highlights the most significant changes. Additionally, there are
(mostly minor) changes to the various filesystem modules (nfs, msdosfs, etc) to
support the new VM/buffer scheme.
vfs_bio.c:
Significant rewrite of most of vfs_bio to support the merged VM buffer cache
scheme. The scheme is almost fully compatible with the old filesystem
interface. Significant improvement in the number of opportunities for write
clustering.
vfs_cluster.c, vfs_subr.c
Upgrade and performance enhancements in vfs layer code to support merged
VM/buffer cache. Fixup of vfs_cluster to eliminate the bogus pagemove stuff.
vm_object.c:
Yet more improvements in the collapse code. Elimination of some windows that
can cause list corruption.
vm_pageout.c:
Fixed it, it really works better now. Somehow in 2.0, some "enhancements"
broke the code. This code has been reworked from the ground-up.
vm_fault.c, vm_page.c, pmap.c, vm_object.c
Support for small-block filesystems with merged VM/buffer cache scheme.
pmap.c vm_map.c
Dynamic kernel VM size, now we dont have to pre-allocate excessive numbers of
kernel PTs.
vm_glue.c
Much simpler and more effective swapping code. No more gratuitous swapping.
proc.h
Fixed the problem that the p_lock flag was not being cleared on a fork.
swap_pager.c, vnode_pager.c
Removal of old vfs_bio cruft to support the past pseudo-coherency. Now the
code doesn't need it anymore.
machdep.c
Changes to better support the parameter values for the merged VM/buffer cache
scheme.
machdep.c, kern_exec.c, vm_glue.c
Implemented a seperate submap for temporary exec string space and another one
to contain process upages. This eliminates all map fragmentation problems
that previously existed.
ffs_inode.c, ufs_inode.c, ufs_readwrite.c
Changes for merged VM/buffer cache. Add "bypass" support for sneaking in on
busy buffers.
Submitted by: John Dyson and David Greenman
1995-01-09 16:06:02 +00:00
|
|
|
* Page outside of range of object. Right now we
|
|
|
|
* essentially lose the changes by pretending it
|
|
|
|
* worked.
|
1994-05-25 09:21:21 +00:00
|
|
|
*/
|
1999-09-17 04:56:40 +00:00
|
|
|
vm_page_undirty(mt);
|
1994-05-25 09:21:21 +00:00
|
|
|
break;
|
|
|
|
case VM_PAGER_ERROR:
|
|
|
|
case VM_PAGER_FAIL:
|
|
|
|
/*
|
These changes embody the support of the fully coherent merged VM buffer cache,
much higher filesystem I/O performance, and much better paging performance. It
represents the culmination of over 6 months of R&D.
The majority of the merged VM/cache work is by John Dyson.
The following highlights the most significant changes. Additionally, there are
(mostly minor) changes to the various filesystem modules (nfs, msdosfs, etc) to
support the new VM/buffer scheme.
vfs_bio.c:
Significant rewrite of most of vfs_bio to support the merged VM buffer cache
scheme. The scheme is almost fully compatible with the old filesystem
interface. Significant improvement in the number of opportunities for write
clustering.
vfs_cluster.c, vfs_subr.c
Upgrade and performance enhancements in vfs layer code to support merged
VM/buffer cache. Fixup of vfs_cluster to eliminate the bogus pagemove stuff.
vm_object.c:
Yet more improvements in the collapse code. Elimination of some windows that
can cause list corruption.
vm_pageout.c:
Fixed it, it really works better now. Somehow in 2.0, some "enhancements"
broke the code. This code has been reworked from the ground-up.
vm_fault.c, vm_page.c, pmap.c, vm_object.c
Support for small-block filesystems with merged VM/buffer cache scheme.
pmap.c vm_map.c
Dynamic kernel VM size, now we dont have to pre-allocate excessive numbers of
kernel PTs.
vm_glue.c
Much simpler and more effective swapping code. No more gratuitous swapping.
proc.h
Fixed the problem that the p_lock flag was not being cleared on a fork.
swap_pager.c, vnode_pager.c
Removal of old vfs_bio cruft to support the past pseudo-coherency. Now the
code doesn't need it anymore.
machdep.c
Changes to better support the parameter values for the merged VM/buffer cache
scheme.
machdep.c, kern_exec.c, vm_glue.c
Implemented a seperate submap for temporary exec string space and another one
to contain process upages. This eliminates all map fragmentation problems
that previously existed.
ffs_inode.c, ufs_inode.c, ufs_readwrite.c
Changes for merged VM/buffer cache. Add "bypass" support for sneaking in on
busy buffers.
Submitted by: John Dyson and David Greenman
1995-01-09 16:06:02 +00:00
|
|
|
* If page couldn't be paged out, then reactivate the
|
|
|
|
* page so it doesn't clog the inactive list. (We
|
|
|
|
* will try paging out it again later).
|
1994-05-25 09:21:21 +00:00
|
|
|
*/
|
1998-01-31 11:56:53 +00:00
|
|
|
vm_page_activate(mt);
|
1994-05-25 09:21:21 +00:00
|
|
|
break;
|
|
|
|
case VM_PAGER_AGAIN:
|
1994-05-24 10:09:53 +00:00
|
|
|
break;
|
1994-05-25 09:21:21 +00:00
|
|
|
}
|
1994-05-24 10:09:53 +00:00
|
|
|
|
|
|
|
/*
|
These changes embody the support of the fully coherent merged VM buffer cache,
much higher filesystem I/O performance, and much better paging performance. It
represents the culmination of over 6 months of R&D.
The majority of the merged VM/cache work is by John Dyson.
The following highlights the most significant changes. Additionally, there are
(mostly minor) changes to the various filesystem modules (nfs, msdosfs, etc) to
support the new VM/buffer scheme.
vfs_bio.c:
Significant rewrite of most of vfs_bio to support the merged VM buffer cache
scheme. The scheme is almost fully compatible with the old filesystem
interface. Significant improvement in the number of opportunities for write
clustering.
vfs_cluster.c, vfs_subr.c
Upgrade and performance enhancements in vfs layer code to support merged
VM/buffer cache. Fixup of vfs_cluster to eliminate the bogus pagemove stuff.
vm_object.c:
Yet more improvements in the collapse code. Elimination of some windows that
can cause list corruption.
vm_pageout.c:
Fixed it, it really works better now. Somehow in 2.0, some "enhancements"
broke the code. This code has been reworked from the ground-up.
vm_fault.c, vm_page.c, pmap.c, vm_object.c
Support for small-block filesystems with merged VM/buffer cache scheme.
pmap.c vm_map.c
Dynamic kernel VM size, now we dont have to pre-allocate excessive numbers of
kernel PTs.
vm_glue.c
Much simpler and more effective swapping code. No more gratuitous swapping.
proc.h
Fixed the problem that the p_lock flag was not being cleared on a fork.
swap_pager.c, vnode_pager.c
Removal of old vfs_bio cruft to support the past pseudo-coherency. Now the
code doesn't need it anymore.
machdep.c
Changes to better support the parameter values for the merged VM/buffer cache
scheme.
machdep.c, kern_exec.c, vm_glue.c
Implemented a seperate submap for temporary exec string space and another one
to contain process upages. This eliminates all map fragmentation problems
that previously existed.
ffs_inode.c, ufs_inode.c, ufs_readwrite.c
Changes for merged VM/buffer cache. Add "bypass" support for sneaking in on
busy buffers.
Submitted by: John Dyson and David Greenman
1995-01-09 16:06:02 +00:00
|
|
|
* If the operation is still going, leave the page busy to
|
|
|
|
* block all other accesses. Also, leave the paging in
|
|
|
|
* progress indicator set so that we don't attempt an object
|
|
|
|
* collapse.
|
1994-05-24 10:09:53 +00:00
|
|
|
*/
|
1994-05-25 09:21:21 +00:00
|
|
|
if (pageout_status[i] != VM_PAGER_PEND) {
|
1995-03-01 23:30:04 +00:00
|
|
|
vm_object_pip_wakeup(object);
|
1998-09-04 08:06:57 +00:00
|
|
|
vm_page_io_finish(mt);
|
2004-02-21 23:32:00 +00:00
|
|
|
if (vm_page_count_severe())
|
|
|
|
vm_page_try_to_cache(mt);
|
1994-05-24 10:09:53 +00:00
|
|
|
}
|
1994-05-25 09:21:21 +00:00
|
|
|
}
|
1998-02-05 03:32:49 +00:00
|
|
|
return numpagedout;
|
1994-05-25 09:21:21 +00:00
|
|
|
}
|
1994-05-24 10:09:53 +00:00
|
|
|
|
This commit does a couple of things:
Re-enables the RSS limiting, and the routine is now tail-recursive,
making it much more safe (eliminates the possiblity of kernel stack
overflow.) Also, the RSS limiting is a little more intelligent about
finding the likely objects that are pushing the process over the limit.
Added some sysctls that help with VM system tuning.
New sysctl features:
1) Enable/disable lru pageout algorithm.
vm.pageout_algorithm = 0, default algorithm that works
well, especially using X windows and heavy
memory loading. Can have adverse effects,
sometimes slowing down program loading.
vm.pageout_algorithm = 1, close to true LRU. Works much
better than clock, etc. Does not work as well as
the default algorithm in general. Certain memory
"malloc" type benchmarks work a little better with
this setting.
Please give me feedback on the performance results
associated with these.
2) Enable/disable swapping.
vm.swapping_enabled = 1, default.
vm.swapping_enabled = 0, useful for cases where swapping
degrades performance.
The config option "NO_SWAPPING" is still operative, and
takes precedence over the sysctl. If "NO_SWAPPING" is
specified, the sysctl still exists, but "vm.swapping_enabled"
is hard-wired to "0".
Each of these can be changed "on the fly."
1996-06-26 05:39:27 +00:00
|
|
|
#if !defined(NO_SWAPPING)
|
1994-05-25 09:21:21 +00:00
|
|
|
/*
|
|
|
|
* vm_pageout_object_deactivate_pages
|
|
|
|
*
|
|
|
|
* deactivate enough pages to satisfy the inactive target
|
|
|
|
* requirements or if vm_page_proc_limit is set, then
|
|
|
|
* deactivate all of the pages in the object and its
|
NOTE: libkvm, w, ps, 'top', and any other utility which depends on struct
proc or any VM system structure will have to be rebuilt!!!
Much needed overhaul of the VM system. Included in this first round of
changes:
1) Improved pager interfaces: init, alloc, dealloc, getpages, putpages,
haspage, and sync operations are supported. The haspage interface now
provides information about clusterability. All pager routines now take
struct vm_object's instead of "pagers".
2) Improved data structures. In the previous paradigm, there is constant
confusion caused by pagers being both a data structure ("allocate a
pager") and a collection of routines. The idea of a pager structure has
escentially been eliminated. Objects now have types, and this type is
used to index the appropriate pager. In most cases, items in the pager
structure were duplicated in the object data structure and thus were
unnecessary. In the few cases that remained, a un_pager structure union
was created in the object to contain these items.
3) Because of the cleanup of #1 & #2, a lot of unnecessary layering can now
be removed. For instance, vm_object_enter(), vm_object_lookup(),
vm_object_remove(), and the associated object hash list were some of the
things that were removed.
4) simple_lock's removed. Discussion with several people reveals that the
SMP locking primitives used in the VM system aren't likely the mechanism
that we'll be adopting. Even if it were, the locking that was in the code
was very inadequate and would have to be mostly re-done anyway. The
locking in a uni-processor kernel was a no-op but went a long way toward
making the code difficult to read and debug.
5) Places that attempted to kludge-up the fact that we don't have kernel
thread support have been fixed to reflect the reality that we are really
dealing with processes, not threads. The VM system didn't have complete
thread support, so the comments and mis-named routines were just wrong.
We now use tsleep and wakeup directly in the lock routines, for instance.
6) Where appropriate, the pagers have been improved, especially in the
pager_alloc routines. Most of the pager_allocs have been rewritten and
are now faster and easier to maintain.
7) The pagedaemon pageout clustering algorithm has been rewritten and
now tries harder to output an even number of pages before and after
the requested page. This is sort of the reverse of the ideal pagein
algorithm and should provide better overall performance.
8) Unnecessary (incorrect) casts to caddr_t in calls to tsleep & wakeup
have been removed. Some other unnecessary casts have also been removed.
9) Some almost useless debugging code removed.
10) Terminology of shadow objects vs. backing objects straightened out.
The fact that the vm_object data structure escentially had this
backwards really confused things. The use of "shadow" and "backing
object" throughout the code is now internally consistent and correct
in the Mach terminology.
11) Several minor bug fixes, including one in the vm daemon that caused
0 RSS objects to not get purged as intended.
12) A "default pager" has now been created which cleans up the transition
of objects to the "swap" type. The previous checks throughout the code
for swp->pg_data != NULL were really ugly. This change also provides
the rudiments for future backing of "anonymous" memory by something
other than the swap pager (via the vnode pager, for example), and it
allows the decision about which of these pagers to use to be made
dynamically (although will need some additional decision code to do
this, of course).
13) (dyson) MAP_COPY has been deprecated and the corresponding "copy
object" code has been removed. MAP_COPY was undocumented and non-
standard. It was furthermore broken in several ways which caused its
behavior to degrade to MAP_PRIVATE. Binaries that use MAP_COPY will
continue to work correctly, but via the slightly different semantics
of MAP_PRIVATE.
14) (dyson) Sharing maps have been removed. It's marginal usefulness in a
threads design can be worked around in other ways. Both #12 and #13
were done to simplify the code and improve readability and maintain-
ability. (As were most all of these changes)
TODO:
1) Rewrite most of the vnode pager to use VOP_GETPAGES/PUTPAGES. Doing
this will reduce the vnode pager to a mere fraction of its current size.
2) Rewrite vm_fault and the swap/vnode pagers to use the clustering
information provided by the new haspage pager interface. This will
substantially reduce the overhead by eliminating a large number of
VOP_BMAP() calls. The VOP_BMAP() filesystem interface should be
improved to provide both a "behind" and "ahead" indication of
contiguousness.
3) Implement the extended features of pager_haspage in swap_pager_haspage().
It currently just says 0 pages ahead/behind.
4) Re-implement the swap device (swstrategy) in a more elegant way, perhaps
via a much more general mechanism that could also be used for disk
striping of regular filesystems.
5) Do something to improve the architecture of vm_object_collapse(). The
fact that it makes calls into the swap pager and knows too much about
how the swap pager operates really bothers me. It also doesn't allow
for collapsing of non-swap pager objects ("unnamed" objects backed by
other pagers).
1995-07-13 08:48:48 +00:00
|
|
|
* backing_objects.
|
1994-05-25 09:21:21 +00:00
|
|
|
*
|
|
|
|
* The object and map must be locked.
|
|
|
|
*/
|
This commit does a couple of things:
Re-enables the RSS limiting, and the routine is now tail-recursive,
making it much more safe (eliminates the possiblity of kernel stack
overflow.) Also, the RSS limiting is a little more intelligent about
finding the likely objects that are pushing the process over the limit.
Added some sysctls that help with VM system tuning.
New sysctl features:
1) Enable/disable lru pageout algorithm.
vm.pageout_algorithm = 0, default algorithm that works
well, especially using X windows and heavy
memory loading. Can have adverse effects,
sometimes slowing down program loading.
vm.pageout_algorithm = 1, close to true LRU. Works much
better than clock, etc. Does not work as well as
the default algorithm in general. Certain memory
"malloc" type benchmarks work a little better with
this setting.
Please give me feedback on the performance results
associated with these.
2) Enable/disable swapping.
vm.swapping_enabled = 1, default.
vm.swapping_enabled = 0, useful for cases where swapping
degrades performance.
The config option "NO_SWAPPING" is still operative, and
takes precedence over the sysctl. If "NO_SWAPPING" is
specified, the sysctl still exists, but "vm.swapping_enabled"
is hard-wired to "0".
Each of these can be changed "on the fly."
1996-06-26 05:39:27 +00:00
|
|
|
static void
|
2003-07-07 07:16:29 +00:00
|
|
|
vm_pageout_object_deactivate_pages(pmap, first_object, desired)
|
|
|
|
pmap_t pmap;
|
|
|
|
vm_object_t first_object;
|
|
|
|
long desired;
|
1994-05-25 09:21:21 +00:00
|
|
|
{
|
2003-07-07 07:16:29 +00:00
|
|
|
vm_object_t backing_object, object;
|
2001-07-04 19:00:13 +00:00
|
|
|
vm_page_t p, next;
|
2002-07-27 06:41:03 +00:00
|
|
|
int actcount, rcount, remove_mode;
|
1994-05-25 09:21:21 +00:00
|
|
|
|
2003-07-07 07:16:29 +00:00
|
|
|
VM_OBJECT_LOCK_ASSERT(first_object, MA_OWNED);
|
2009-07-24 13:50:29 +00:00
|
|
|
if (first_object->type == OBJT_DEVICE ||
|
|
|
|
first_object->type == OBJT_SG ||
|
|
|
|
first_object->type == OBJT_PHYS)
|
This commit does a couple of things:
Re-enables the RSS limiting, and the routine is now tail-recursive,
making it much more safe (eliminates the possiblity of kernel stack
overflow.) Also, the RSS limiting is a little more intelligent about
finding the likely objects that are pushing the process over the limit.
Added some sysctls that help with VM system tuning.
New sysctl features:
1) Enable/disable lru pageout algorithm.
vm.pageout_algorithm = 0, default algorithm that works
well, especially using X windows and heavy
memory loading. Can have adverse effects,
sometimes slowing down program loading.
vm.pageout_algorithm = 1, close to true LRU. Works much
better than clock, etc. Does not work as well as
the default algorithm in general. Certain memory
"malloc" type benchmarks work a little better with
this setting.
Please give me feedback on the performance results
associated with these.
2) Enable/disable swapping.
vm.swapping_enabled = 1, default.
vm.swapping_enabled = 0, useful for cases where swapping
degrades performance.
The config option "NO_SWAPPING" is still operative, and
takes precedence over the sysctl. If "NO_SWAPPING" is
specified, the sysctl still exists, but "vm.swapping_enabled"
is hard-wired to "0".
Each of these can be changed "on the fly."
1996-06-26 05:39:27 +00:00
|
|
|
return;
|
2003-07-07 07:16:29 +00:00
|
|
|
for (object = first_object;; object = backing_object) {
|
|
|
|
if (pmap_resident_count(pmap) <= desired)
|
|
|
|
goto unlock_return;
|
This commit does a couple of things:
Re-enables the RSS limiting, and the routine is now tail-recursive,
making it much more safe (eliminates the possiblity of kernel stack
overflow.) Also, the RSS limiting is a little more intelligent about
finding the likely objects that are pushing the process over the limit.
Added some sysctls that help with VM system tuning.
New sysctl features:
1) Enable/disable lru pageout algorithm.
vm.pageout_algorithm = 0, default algorithm that works
well, especially using X windows and heavy
memory loading. Can have adverse effects,
sometimes slowing down program loading.
vm.pageout_algorithm = 1, close to true LRU. Works much
better than clock, etc. Does not work as well as
the default algorithm in general. Certain memory
"malloc" type benchmarks work a little better with
this setting.
Please give me feedback on the performance results
associated with these.
2) Enable/disable swapping.
vm.swapping_enabled = 1, default.
vm.swapping_enabled = 0, useful for cases where swapping
degrades performance.
The config option "NO_SWAPPING" is still operative, and
takes precedence over the sysctl. If "NO_SWAPPING" is
specified, the sysctl still exists, but "vm.swapping_enabled"
is hard-wired to "0".
Each of these can be changed "on the fly."
1996-06-26 05:39:27 +00:00
|
|
|
if (object->paging_in_progress)
|
2003-07-07 07:16:29 +00:00
|
|
|
goto unlock_return;
|
1994-05-25 09:21:21 +00:00
|
|
|
|
2003-04-30 03:08:16 +00:00
|
|
|
remove_mode = 0;
|
This commit does a couple of things:
Re-enables the RSS limiting, and the routine is now tail-recursive,
making it much more safe (eliminates the possiblity of kernel stack
overflow.) Also, the RSS limiting is a little more intelligent about
finding the likely objects that are pushing the process over the limit.
Added some sysctls that help with VM system tuning.
New sysctl features:
1) Enable/disable lru pageout algorithm.
vm.pageout_algorithm = 0, default algorithm that works
well, especially using X windows and heavy
memory loading. Can have adverse effects,
sometimes slowing down program loading.
vm.pageout_algorithm = 1, close to true LRU. Works much
better than clock, etc. Does not work as well as
the default algorithm in general. Certain memory
"malloc" type benchmarks work a little better with
this setting.
Please give me feedback on the performance results
associated with these.
2) Enable/disable swapping.
vm.swapping_enabled = 1, default.
vm.swapping_enabled = 0, useful for cases where swapping
degrades performance.
The config option "NO_SWAPPING" is still operative, and
takes precedence over the sysctl. If "NO_SWAPPING" is
specified, the sysctl still exists, but "vm.swapping_enabled"
is hard-wired to "0".
Each of these can be changed "on the fly."
1996-06-26 05:39:27 +00:00
|
|
|
if (object->shadow_count > 1)
|
|
|
|
remove_mode = 1;
|
2002-03-10 21:52:48 +00:00
|
|
|
/*
|
|
|
|
* scan the objects entire memory queue
|
|
|
|
*/
|
This commit does a couple of things:
Re-enables the RSS limiting, and the routine is now tail-recursive,
making it much more safe (eliminates the possiblity of kernel stack
overflow.) Also, the RSS limiting is a little more intelligent about
finding the likely objects that are pushing the process over the limit.
Added some sysctls that help with VM system tuning.
New sysctl features:
1) Enable/disable lru pageout algorithm.
vm.pageout_algorithm = 0, default algorithm that works
well, especially using X windows and heavy
memory loading. Can have adverse effects,
sometimes slowing down program loading.
vm.pageout_algorithm = 1, close to true LRU. Works much
better than clock, etc. Does not work as well as
the default algorithm in general. Certain memory
"malloc" type benchmarks work a little better with
this setting.
Please give me feedback on the performance results
associated with these.
2) Enable/disable swapping.
vm.swapping_enabled = 1, default.
vm.swapping_enabled = 0, useful for cases where swapping
degrades performance.
The config option "NO_SWAPPING" is still operative, and
takes precedence over the sysctl. If "NO_SWAPPING" is
specified, the sysctl still exists, but "vm.swapping_enabled"
is hard-wired to "0".
Each of these can be changed "on the fly."
1996-06-26 05:39:27 +00:00
|
|
|
rcount = object->resident_page_count;
|
|
|
|
p = TAILQ_FIRST(&object->memq);
|
2002-07-27 06:41:03 +00:00
|
|
|
vm_page_lock_queues();
|
This commit does a couple of things:
Re-enables the RSS limiting, and the routine is now tail-recursive,
making it much more safe (eliminates the possiblity of kernel stack
overflow.) Also, the RSS limiting is a little more intelligent about
finding the likely objects that are pushing the process over the limit.
Added some sysctls that help with VM system tuning.
New sysctl features:
1) Enable/disable lru pageout algorithm.
vm.pageout_algorithm = 0, default algorithm that works
well, especially using X windows and heavy
memory loading. Can have adverse effects,
sometimes slowing down program loading.
vm.pageout_algorithm = 1, close to true LRU. Works much
better than clock, etc. Does not work as well as
the default algorithm in general. Certain memory
"malloc" type benchmarks work a little better with
this setting.
Please give me feedback on the performance results
associated with these.
2) Enable/disable swapping.
vm.swapping_enabled = 1, default.
vm.swapping_enabled = 0, useful for cases where swapping
degrades performance.
The config option "NO_SWAPPING" is still operative, and
takes precedence over the sysctl. If "NO_SWAPPING" is
specified, the sysctl still exists, but "vm.swapping_enabled"
is hard-wired to "0".
Each of these can be changed "on the fly."
1996-06-26 05:39:27 +00:00
|
|
|
while (p && (rcount-- > 0)) {
|
2003-07-07 07:16:29 +00:00
|
|
|
if (pmap_resident_count(pmap) <= desired) {
|
2002-07-27 06:41:03 +00:00
|
|
|
vm_page_unlock_queues();
|
2003-07-07 07:16:29 +00:00
|
|
|
goto unlock_return;
|
2002-07-27 06:41:03 +00:00
|
|
|
}
|
This commit does a couple of things:
Re-enables the RSS limiting, and the routine is now tail-recursive,
making it much more safe (eliminates the possiblity of kernel stack
overflow.) Also, the RSS limiting is a little more intelligent about
finding the likely objects that are pushing the process over the limit.
Added some sysctls that help with VM system tuning.
New sysctl features:
1) Enable/disable lru pageout algorithm.
vm.pageout_algorithm = 0, default algorithm that works
well, especially using X windows and heavy
memory loading. Can have adverse effects,
sometimes slowing down program loading.
vm.pageout_algorithm = 1, close to true LRU. Works much
better than clock, etc. Does not work as well as
the default algorithm in general. Certain memory
"malloc" type benchmarks work a little better with
this setting.
Please give me feedback on the performance results
associated with these.
2) Enable/disable swapping.
vm.swapping_enabled = 1, default.
vm.swapping_enabled = 0, useful for cases where swapping
degrades performance.
The config option "NO_SWAPPING" is still operative, and
takes precedence over the sysctl. If "NO_SWAPPING" is
specified, the sysctl still exists, but "vm.swapping_enabled"
is hard-wired to "0".
Each of these can be changed "on the fly."
1996-06-26 05:39:27 +00:00
|
|
|
next = TAILQ_NEXT(p, listq);
|
2007-06-10 21:59:14 +00:00
|
|
|
cnt.v_pdpages++;
|
This commit does a couple of things:
Re-enables the RSS limiting, and the routine is now tail-recursive,
making it much more safe (eliminates the possiblity of kernel stack
overflow.) Also, the RSS limiting is a little more intelligent about
finding the likely objects that are pushing the process over the limit.
Added some sysctls that help with VM system tuning.
New sysctl features:
1) Enable/disable lru pageout algorithm.
vm.pageout_algorithm = 0, default algorithm that works
well, especially using X windows and heavy
memory loading. Can have adverse effects,
sometimes slowing down program loading.
vm.pageout_algorithm = 1, close to true LRU. Works much
better than clock, etc. Does not work as well as
the default algorithm in general. Certain memory
"malloc" type benchmarks work a little better with
this setting.
Please give me feedback on the performance results
associated with these.
2) Enable/disable swapping.
vm.swapping_enabled = 1, default.
vm.swapping_enabled = 0, useful for cases where swapping
degrades performance.
The config option "NO_SWAPPING" is still operative, and
takes precedence over the sysctl. If "NO_SWAPPING" is
specified, the sysctl still exists, but "vm.swapping_enabled"
is hard-wired to "0".
Each of these can be changed "on the fly."
1996-06-26 05:39:27 +00:00
|
|
|
if (p->wire_count != 0 ||
|
|
|
|
p->hold_count != 0 ||
|
|
|
|
p->busy != 0 ||
|
2006-10-22 04:28:14 +00:00
|
|
|
(p->oflags & VPO_BUSY) ||
|
|
|
|
(p->flags & PG_UNMANAGED) ||
|
2003-07-07 07:16:29 +00:00
|
|
|
!pmap_page_exists_quick(pmap, p)) {
|
This commit does a couple of things:
Re-enables the RSS limiting, and the routine is now tail-recursive,
making it much more safe (eliminates the possiblity of kernel stack
overflow.) Also, the RSS limiting is a little more intelligent about
finding the likely objects that are pushing the process over the limit.
Added some sysctls that help with VM system tuning.
New sysctl features:
1) Enable/disable lru pageout algorithm.
vm.pageout_algorithm = 0, default algorithm that works
well, especially using X windows and heavy
memory loading. Can have adverse effects,
sometimes slowing down program loading.
vm.pageout_algorithm = 1, close to true LRU. Works much
better than clock, etc. Does not work as well as
the default algorithm in general. Certain memory
"malloc" type benchmarks work a little better with
this setting.
Please give me feedback on the performance results
associated with these.
2) Enable/disable swapping.
vm.swapping_enabled = 1, default.
vm.swapping_enabled = 0, useful for cases where swapping
degrades performance.
The config option "NO_SWAPPING" is still operative, and
takes precedence over the sysctl. If "NO_SWAPPING" is
specified, the sysctl still exists, but "vm.swapping_enabled"
is hard-wired to "0".
Each of these can be changed "on the fly."
1996-06-26 05:39:27 +00:00
|
|
|
p = next;
|
|
|
|
continue;
|
1996-06-17 03:35:40 +00:00
|
|
|
}
|
2000-05-21 12:50:18 +00:00
|
|
|
actcount = pmap_ts_referenced(p);
|
1997-10-06 02:48:16 +00:00
|
|
|
if (actcount) {
|
1998-09-04 08:06:57 +00:00
|
|
|
vm_page_flag_set(p, PG_REFERENCED);
|
1996-07-08 02:25:53 +00:00
|
|
|
} else if (p->flags & PG_REFERENCED) {
|
1997-10-06 02:48:16 +00:00
|
|
|
actcount = 1;
|
1994-05-24 10:09:53 +00:00
|
|
|
}
|
This commit does a couple of things:
Re-enables the RSS limiting, and the routine is now tail-recursive,
making it much more safe (eliminates the possiblity of kernel stack
overflow.) Also, the RSS limiting is a little more intelligent about
finding the likely objects that are pushing the process over the limit.
Added some sysctls that help with VM system tuning.
New sysctl features:
1) Enable/disable lru pageout algorithm.
vm.pageout_algorithm = 0, default algorithm that works
well, especially using X windows and heavy
memory loading. Can have adverse effects,
sometimes slowing down program loading.
vm.pageout_algorithm = 1, close to true LRU. Works much
better than clock, etc. Does not work as well as
the default algorithm in general. Certain memory
"malloc" type benchmarks work a little better with
this setting.
Please give me feedback on the performance results
associated with these.
2) Enable/disable swapping.
vm.swapping_enabled = 1, default.
vm.swapping_enabled = 0, useful for cases where swapping
degrades performance.
The config option "NO_SWAPPING" is still operative, and
takes precedence over the sysctl. If "NO_SWAPPING" is
specified, the sysctl still exists, but "vm.swapping_enabled"
is hard-wired to "0".
Each of these can be changed "on the fly."
1996-06-26 05:39:27 +00:00
|
|
|
if ((p->queue != PQ_ACTIVE) &&
|
|
|
|
(p->flags & PG_REFERENCED)) {
|
|
|
|
vm_page_activate(p);
|
1997-10-06 02:48:16 +00:00
|
|
|
p->act_count += actcount;
|
1998-09-04 08:06:57 +00:00
|
|
|
vm_page_flag_clear(p, PG_REFERENCED);
|
1996-07-08 02:25:53 +00:00
|
|
|
} else if (p->queue == PQ_ACTIVE) {
|
This commit does a couple of things:
Re-enables the RSS limiting, and the routine is now tail-recursive,
making it much more safe (eliminates the possiblity of kernel stack
overflow.) Also, the RSS limiting is a little more intelligent about
finding the likely objects that are pushing the process over the limit.
Added some sysctls that help with VM system tuning.
New sysctl features:
1) Enable/disable lru pageout algorithm.
vm.pageout_algorithm = 0, default algorithm that works
well, especially using X windows and heavy
memory loading. Can have adverse effects,
sometimes slowing down program loading.
vm.pageout_algorithm = 1, close to true LRU. Works much
better than clock, etc. Does not work as well as
the default algorithm in general. Certain memory
"malloc" type benchmarks work a little better with
this setting.
Please give me feedback on the performance results
associated with these.
2) Enable/disable swapping.
vm.swapping_enabled = 1, default.
vm.swapping_enabled = 0, useful for cases where swapping
degrades performance.
The config option "NO_SWAPPING" is still operative, and
takes precedence over the sysctl. If "NO_SWAPPING" is
specified, the sysctl still exists, but "vm.swapping_enabled"
is hard-wired to "0".
Each of these can be changed "on the fly."
1996-06-26 05:39:27 +00:00
|
|
|
if ((p->flags & PG_REFERENCED) == 0) {
|
1996-07-08 02:25:53 +00:00
|
|
|
p->act_count -= min(p->act_count, ACT_DECLINE);
|
2000-12-26 19:41:38 +00:00
|
|
|
if (!remove_mode && (vm_pageout_algorithm || (p->act_count == 0))) {
|
2002-11-16 07:44:25 +00:00
|
|
|
pmap_remove_all(p);
|
This commit does a couple of things:
Re-enables the RSS limiting, and the routine is now tail-recursive,
making it much more safe (eliminates the possiblity of kernel stack
overflow.) Also, the RSS limiting is a little more intelligent about
finding the likely objects that are pushing the process over the limit.
Added some sysctls that help with VM system tuning.
New sysctl features:
1) Enable/disable lru pageout algorithm.
vm.pageout_algorithm = 0, default algorithm that works
well, especially using X windows and heavy
memory loading. Can have adverse effects,
sometimes slowing down program loading.
vm.pageout_algorithm = 1, close to true LRU. Works much
better than clock, etc. Does not work as well as
the default algorithm in general. Certain memory
"malloc" type benchmarks work a little better with
this setting.
Please give me feedback on the performance results
associated with these.
2) Enable/disable swapping.
vm.swapping_enabled = 1, default.
vm.swapping_enabled = 0, useful for cases where swapping
degrades performance.
The config option "NO_SWAPPING" is still operative, and
takes precedence over the sysctl. If "NO_SWAPPING" is
specified, the sysctl still exists, but "vm.swapping_enabled"
is hard-wired to "0".
Each of these can be changed "on the fly."
1996-06-26 05:39:27 +00:00
|
|
|
vm_page_deactivate(p);
|
1996-07-08 02:25:53 +00:00
|
|
|
} else {
|
2008-03-19 20:24:35 +00:00
|
|
|
vm_page_requeue(p);
|
1996-07-08 02:25:53 +00:00
|
|
|
}
|
This commit does a couple of things:
Re-enables the RSS limiting, and the routine is now tail-recursive,
making it much more safe (eliminates the possiblity of kernel stack
overflow.) Also, the RSS limiting is a little more intelligent about
finding the likely objects that are pushing the process over the limit.
Added some sysctls that help with VM system tuning.
New sysctl features:
1) Enable/disable lru pageout algorithm.
vm.pageout_algorithm = 0, default algorithm that works
well, especially using X windows and heavy
memory loading. Can have adverse effects,
sometimes slowing down program loading.
vm.pageout_algorithm = 1, close to true LRU. Works much
better than clock, etc. Does not work as well as
the default algorithm in general. Certain memory
"malloc" type benchmarks work a little better with
this setting.
Please give me feedback on the performance results
associated with these.
2) Enable/disable swapping.
vm.swapping_enabled = 1, default.
vm.swapping_enabled = 0, useful for cases where swapping
degrades performance.
The config option "NO_SWAPPING" is still operative, and
takes precedence over the sysctl. If "NO_SWAPPING" is
specified, the sysctl still exists, but "vm.swapping_enabled"
is hard-wired to "0".
Each of these can be changed "on the fly."
1996-06-26 05:39:27 +00:00
|
|
|
} else {
|
1998-01-31 11:56:53 +00:00
|
|
|
vm_page_activate(p);
|
1998-09-04 08:06:57 +00:00
|
|
|
vm_page_flag_clear(p, PG_REFERENCED);
|
This commit does a couple of things:
Re-enables the RSS limiting, and the routine is now tail-recursive,
making it much more safe (eliminates the possiblity of kernel stack
overflow.) Also, the RSS limiting is a little more intelligent about
finding the likely objects that are pushing the process over the limit.
Added some sysctls that help with VM system tuning.
New sysctl features:
1) Enable/disable lru pageout algorithm.
vm.pageout_algorithm = 0, default algorithm that works
well, especially using X windows and heavy
memory loading. Can have adverse effects,
sometimes slowing down program loading.
vm.pageout_algorithm = 1, close to true LRU. Works much
better than clock, etc. Does not work as well as
the default algorithm in general. Certain memory
"malloc" type benchmarks work a little better with
this setting.
Please give me feedback on the performance results
associated with these.
2) Enable/disable swapping.
vm.swapping_enabled = 1, default.
vm.swapping_enabled = 0, useful for cases where swapping
degrades performance.
The config option "NO_SWAPPING" is still operative, and
takes precedence over the sysctl. If "NO_SWAPPING" is
specified, the sysctl still exists, but "vm.swapping_enabled"
is hard-wired to "0".
Each of these can be changed "on the fly."
1996-06-26 05:39:27 +00:00
|
|
|
if (p->act_count < (ACT_MAX - ACT_ADVANCE))
|
|
|
|
p->act_count += ACT_ADVANCE;
|
2008-03-19 20:24:35 +00:00
|
|
|
vm_page_requeue(p);
|
This commit does a couple of things:
Re-enables the RSS limiting, and the routine is now tail-recursive,
making it much more safe (eliminates the possiblity of kernel stack
overflow.) Also, the RSS limiting is a little more intelligent about
finding the likely objects that are pushing the process over the limit.
Added some sysctls that help with VM system tuning.
New sysctl features:
1) Enable/disable lru pageout algorithm.
vm.pageout_algorithm = 0, default algorithm that works
well, especially using X windows and heavy
memory loading. Can have adverse effects,
sometimes slowing down program loading.
vm.pageout_algorithm = 1, close to true LRU. Works much
better than clock, etc. Does not work as well as
the default algorithm in general. Certain memory
"malloc" type benchmarks work a little better with
this setting.
Please give me feedback on the performance results
associated with these.
2) Enable/disable swapping.
vm.swapping_enabled = 1, default.
vm.swapping_enabled = 0, useful for cases where swapping
degrades performance.
The config option "NO_SWAPPING" is still operative, and
takes precedence over the sysctl. If "NO_SWAPPING" is
specified, the sysctl still exists, but "vm.swapping_enabled"
is hard-wired to "0".
Each of these can be changed "on the fly."
1996-06-26 05:39:27 +00:00
|
|
|
}
|
|
|
|
} else if (p->queue == PQ_INACTIVE) {
|
2002-11-16 07:44:25 +00:00
|
|
|
pmap_remove_all(p);
|
This commit does a couple of things:
Re-enables the RSS limiting, and the routine is now tail-recursive,
making it much more safe (eliminates the possiblity of kernel stack
overflow.) Also, the RSS limiting is a little more intelligent about
finding the likely objects that are pushing the process over the limit.
Added some sysctls that help with VM system tuning.
New sysctl features:
1) Enable/disable lru pageout algorithm.
vm.pageout_algorithm = 0, default algorithm that works
well, especially using X windows and heavy
memory loading. Can have adverse effects,
sometimes slowing down program loading.
vm.pageout_algorithm = 1, close to true LRU. Works much
better than clock, etc. Does not work as well as
the default algorithm in general. Certain memory
"malloc" type benchmarks work a little better with
this setting.
Please give me feedback on the performance results
associated with these.
2) Enable/disable swapping.
vm.swapping_enabled = 1, default.
vm.swapping_enabled = 0, useful for cases where swapping
degrades performance.
The config option "NO_SWAPPING" is still operative, and
takes precedence over the sysctl. If "NO_SWAPPING" is
specified, the sysctl still exists, but "vm.swapping_enabled"
is hard-wired to "0".
Each of these can be changed "on the fly."
1996-06-26 05:39:27 +00:00
|
|
|
}
|
|
|
|
p = next;
|
1994-05-24 10:09:53 +00:00
|
|
|
}
|
2002-07-27 06:41:03 +00:00
|
|
|
vm_page_unlock_queues();
|
2003-07-07 07:16:29 +00:00
|
|
|
if ((backing_object = object->backing_object) == NULL)
|
|
|
|
goto unlock_return;
|
|
|
|
VM_OBJECT_LOCK(backing_object);
|
|
|
|
if (object != first_object)
|
|
|
|
VM_OBJECT_UNLOCK(object);
|
1994-05-25 09:21:21 +00:00
|
|
|
}
|
2003-07-07 07:16:29 +00:00
|
|
|
unlock_return:
|
|
|
|
if (object != first_object)
|
|
|
|
VM_OBJECT_UNLOCK(object);
|
1994-05-25 09:21:21 +00:00
|
|
|
}
|
|
|
|
|
|
|
|
/*
|
|
|
|
* deactivate some number of pages in a map, try to do it fairly, but
|
|
|
|
* that is really hard to do.
|
|
|
|
*/
|
1995-10-07 19:02:56 +00:00
|
|
|
static void
|
This commit does a couple of things:
Re-enables the RSS limiting, and the routine is now tail-recursive,
making it much more safe (eliminates the possiblity of kernel stack
overflow.) Also, the RSS limiting is a little more intelligent about
finding the likely objects that are pushing the process over the limit.
Added some sysctls that help with VM system tuning.
New sysctl features:
1) Enable/disable lru pageout algorithm.
vm.pageout_algorithm = 0, default algorithm that works
well, especially using X windows and heavy
memory loading. Can have adverse effects,
sometimes slowing down program loading.
vm.pageout_algorithm = 1, close to true LRU. Works much
better than clock, etc. Does not work as well as
the default algorithm in general. Certain memory
"malloc" type benchmarks work a little better with
this setting.
Please give me feedback on the performance results
associated with these.
2) Enable/disable swapping.
vm.swapping_enabled = 1, default.
vm.swapping_enabled = 0, useful for cases where swapping
degrades performance.
The config option "NO_SWAPPING" is still operative, and
takes precedence over the sysctl. If "NO_SWAPPING" is
specified, the sysctl still exists, but "vm.swapping_enabled"
is hard-wired to "0".
Each of these can be changed "on the fly."
1996-06-26 05:39:27 +00:00
|
|
|
vm_pageout_map_deactivate_pages(map, desired)
|
1994-05-25 09:21:21 +00:00
|
|
|
vm_map_t map;
|
2003-07-07 07:16:29 +00:00
|
|
|
long desired;
|
1994-05-25 09:21:21 +00:00
|
|
|
{
|
|
|
|
vm_map_entry_t tmpe;
|
This commit does a couple of things:
Re-enables the RSS limiting, and the routine is now tail-recursive,
making it much more safe (eliminates the possiblity of kernel stack
overflow.) Also, the RSS limiting is a little more intelligent about
finding the likely objects that are pushing the process over the limit.
Added some sysctls that help with VM system tuning.
New sysctl features:
1) Enable/disable lru pageout algorithm.
vm.pageout_algorithm = 0, default algorithm that works
well, especially using X windows and heavy
memory loading. Can have adverse effects,
sometimes slowing down program loading.
vm.pageout_algorithm = 1, close to true LRU. Works much
better than clock, etc. Does not work as well as
the default algorithm in general. Certain memory
"malloc" type benchmarks work a little better with
this setting.
Please give me feedback on the performance results
associated with these.
2) Enable/disable swapping.
vm.swapping_enabled = 1, default.
vm.swapping_enabled = 0, useful for cases where swapping
degrades performance.
The config option "NO_SWAPPING" is still operative, and
takes precedence over the sysctl. If "NO_SWAPPING" is
specified, the sysctl still exists, but "vm.swapping_enabled"
is hard-wired to "0".
Each of these can be changed "on the fly."
1996-06-26 05:39:27 +00:00
|
|
|
vm_object_t obj, bigobj;
|
2001-10-14 20:51:14 +00:00
|
|
|
int nothingwired;
|
These changes embody the support of the fully coherent merged VM buffer cache,
much higher filesystem I/O performance, and much better paging performance. It
represents the culmination of over 6 months of R&D.
The majority of the merged VM/cache work is by John Dyson.
The following highlights the most significant changes. Additionally, there are
(mostly minor) changes to the various filesystem modules (nfs, msdosfs, etc) to
support the new VM/buffer scheme.
vfs_bio.c:
Significant rewrite of most of vfs_bio to support the merged VM buffer cache
scheme. The scheme is almost fully compatible with the old filesystem
interface. Significant improvement in the number of opportunities for write
clustering.
vfs_cluster.c, vfs_subr.c
Upgrade and performance enhancements in vfs layer code to support merged
VM/buffer cache. Fixup of vfs_cluster to eliminate the bogus pagemove stuff.
vm_object.c:
Yet more improvements in the collapse code. Elimination of some windows that
can cause list corruption.
vm_pageout.c:
Fixed it, it really works better now. Somehow in 2.0, some "enhancements"
broke the code. This code has been reworked from the ground-up.
vm_fault.c, vm_page.c, pmap.c, vm_object.c
Support for small-block filesystems with merged VM/buffer cache scheme.
pmap.c vm_map.c
Dynamic kernel VM size, now we dont have to pre-allocate excessive numbers of
kernel PTs.
vm_glue.c
Much simpler and more effective swapping code. No more gratuitous swapping.
proc.h
Fixed the problem that the p_lock flag was not being cleared on a fork.
swap_pager.c, vnode_pager.c
Removal of old vfs_bio cruft to support the past pseudo-coherency. Now the
code doesn't need it anymore.
machdep.c
Changes to better support the parameter values for the merged VM/buffer cache
scheme.
machdep.c, kern_exec.c, vm_glue.c
Implemented a seperate submap for temporary exec string space and another one
to contain process upages. This eliminates all map fragmentation problems
that previously existed.
ffs_inode.c, ufs_inode.c, ufs_readwrite.c
Changes for merged VM/buffer cache. Add "bypass" support for sneaking in on
busy buffers.
Submitted by: John Dyson and David Greenman
1995-01-09 16:06:02 +00:00
|
|
|
|
2002-04-28 06:07:54 +00:00
|
|
|
if (!vm_map_trylock(map))
|
1994-05-25 09:21:21 +00:00
|
|
|
return;
|
This commit does a couple of things:
Re-enables the RSS limiting, and the routine is now tail-recursive,
making it much more safe (eliminates the possiblity of kernel stack
overflow.) Also, the RSS limiting is a little more intelligent about
finding the likely objects that are pushing the process over the limit.
Added some sysctls that help with VM system tuning.
New sysctl features:
1) Enable/disable lru pageout algorithm.
vm.pageout_algorithm = 0, default algorithm that works
well, especially using X windows and heavy
memory loading. Can have adverse effects,
sometimes slowing down program loading.
vm.pageout_algorithm = 1, close to true LRU. Works much
better than clock, etc. Does not work as well as
the default algorithm in general. Certain memory
"malloc" type benchmarks work a little better with
this setting.
Please give me feedback on the performance results
associated with these.
2) Enable/disable swapping.
vm.swapping_enabled = 1, default.
vm.swapping_enabled = 0, useful for cases where swapping
degrades performance.
The config option "NO_SWAPPING" is still operative, and
takes precedence over the sysctl. If "NO_SWAPPING" is
specified, the sysctl still exists, but "vm.swapping_enabled"
is hard-wired to "0".
Each of these can be changed "on the fly."
1996-06-26 05:39:27 +00:00
|
|
|
|
|
|
|
bigobj = NULL;
|
2001-10-14 20:51:14 +00:00
|
|
|
nothingwired = TRUE;
|
This commit does a couple of things:
Re-enables the RSS limiting, and the routine is now tail-recursive,
making it much more safe (eliminates the possiblity of kernel stack
overflow.) Also, the RSS limiting is a little more intelligent about
finding the likely objects that are pushing the process over the limit.
Added some sysctls that help with VM system tuning.
New sysctl features:
1) Enable/disable lru pageout algorithm.
vm.pageout_algorithm = 0, default algorithm that works
well, especially using X windows and heavy
memory loading. Can have adverse effects,
sometimes slowing down program loading.
vm.pageout_algorithm = 1, close to true LRU. Works much
better than clock, etc. Does not work as well as
the default algorithm in general. Certain memory
"malloc" type benchmarks work a little better with
this setting.
Please give me feedback on the performance results
associated with these.
2) Enable/disable swapping.
vm.swapping_enabled = 1, default.
vm.swapping_enabled = 0, useful for cases where swapping
degrades performance.
The config option "NO_SWAPPING" is still operative, and
takes precedence over the sysctl. If "NO_SWAPPING" is
specified, the sysctl still exists, but "vm.swapping_enabled"
is hard-wired to "0".
Each of these can be changed "on the fly."
1996-06-26 05:39:27 +00:00
|
|
|
|
|
|
|
/*
|
|
|
|
* first, search out the biggest object, and try to free pages from
|
|
|
|
* that.
|
|
|
|
*/
|
|
|
|
tmpe = map->header.next;
|
|
|
|
while (tmpe != &map->header) {
|
1999-02-07 21:48:23 +00:00
|
|
|
if ((tmpe->eflags & MAP_ENTRY_IS_SUB_MAP) == 0) {
|
This commit does a couple of things:
Re-enables the RSS limiting, and the routine is now tail-recursive,
making it much more safe (eliminates the possiblity of kernel stack
overflow.) Also, the RSS limiting is a little more intelligent about
finding the likely objects that are pushing the process over the limit.
Added some sysctls that help with VM system tuning.
New sysctl features:
1) Enable/disable lru pageout algorithm.
vm.pageout_algorithm = 0, default algorithm that works
well, especially using X windows and heavy
memory loading. Can have adverse effects,
sometimes slowing down program loading.
vm.pageout_algorithm = 1, close to true LRU. Works much
better than clock, etc. Does not work as well as
the default algorithm in general. Certain memory
"malloc" type benchmarks work a little better with
this setting.
Please give me feedback on the performance results
associated with these.
2) Enable/disable swapping.
vm.swapping_enabled = 1, default.
vm.swapping_enabled = 0, useful for cases where swapping
degrades performance.
The config option "NO_SWAPPING" is still operative, and
takes precedence over the sysctl. If "NO_SWAPPING" is
specified, the sysctl still exists, but "vm.swapping_enabled"
is hard-wired to "0".
Each of these can be changed "on the fly."
1996-06-26 05:39:27 +00:00
|
|
|
obj = tmpe->object.vm_object;
|
2003-06-29 19:51:24 +00:00
|
|
|
if (obj != NULL && VM_OBJECT_TRYLOCK(obj)) {
|
|
|
|
if (obj->shadow_count <= 1 &&
|
|
|
|
(bigobj == NULL ||
|
|
|
|
bigobj->resident_page_count < obj->resident_page_count)) {
|
|
|
|
if (bigobj != NULL)
|
|
|
|
VM_OBJECT_UNLOCK(bigobj);
|
|
|
|
bigobj = obj;
|
|
|
|
} else
|
|
|
|
VM_OBJECT_UNLOCK(obj);
|
This commit does a couple of things:
Re-enables the RSS limiting, and the routine is now tail-recursive,
making it much more safe (eliminates the possiblity of kernel stack
overflow.) Also, the RSS limiting is a little more intelligent about
finding the likely objects that are pushing the process over the limit.
Added some sysctls that help with VM system tuning.
New sysctl features:
1) Enable/disable lru pageout algorithm.
vm.pageout_algorithm = 0, default algorithm that works
well, especially using X windows and heavy
memory loading. Can have adverse effects,
sometimes slowing down program loading.
vm.pageout_algorithm = 1, close to true LRU. Works much
better than clock, etc. Does not work as well as
the default algorithm in general. Certain memory
"malloc" type benchmarks work a little better with
this setting.
Please give me feedback on the performance results
associated with these.
2) Enable/disable swapping.
vm.swapping_enabled = 1, default.
vm.swapping_enabled = 0, useful for cases where swapping
degrades performance.
The config option "NO_SWAPPING" is still operative, and
takes precedence over the sysctl. If "NO_SWAPPING" is
specified, the sysctl still exists, but "vm.swapping_enabled"
is hard-wired to "0".
Each of these can be changed "on the fly."
1996-06-26 05:39:27 +00:00
|
|
|
}
|
|
|
|
}
|
2001-10-14 20:51:14 +00:00
|
|
|
if (tmpe->wired_count > 0)
|
|
|
|
nothingwired = FALSE;
|
This commit does a couple of things:
Re-enables the RSS limiting, and the routine is now tail-recursive,
making it much more safe (eliminates the possiblity of kernel stack
overflow.) Also, the RSS limiting is a little more intelligent about
finding the likely objects that are pushing the process over the limit.
Added some sysctls that help with VM system tuning.
New sysctl features:
1) Enable/disable lru pageout algorithm.
vm.pageout_algorithm = 0, default algorithm that works
well, especially using X windows and heavy
memory loading. Can have adverse effects,
sometimes slowing down program loading.
vm.pageout_algorithm = 1, close to true LRU. Works much
better than clock, etc. Does not work as well as
the default algorithm in general. Certain memory
"malloc" type benchmarks work a little better with
this setting.
Please give me feedback on the performance results
associated with these.
2) Enable/disable swapping.
vm.swapping_enabled = 1, default.
vm.swapping_enabled = 0, useful for cases where swapping
degrades performance.
The config option "NO_SWAPPING" is still operative, and
takes precedence over the sysctl. If "NO_SWAPPING" is
specified, the sysctl still exists, but "vm.swapping_enabled"
is hard-wired to "0".
Each of these can be changed "on the fly."
1996-06-26 05:39:27 +00:00
|
|
|
tmpe = tmpe->next;
|
1994-05-25 09:21:21 +00:00
|
|
|
}
|
This commit does a couple of things:
Re-enables the RSS limiting, and the routine is now tail-recursive,
making it much more safe (eliminates the possiblity of kernel stack
overflow.) Also, the RSS limiting is a little more intelligent about
finding the likely objects that are pushing the process over the limit.
Added some sysctls that help with VM system tuning.
New sysctl features:
1) Enable/disable lru pageout algorithm.
vm.pageout_algorithm = 0, default algorithm that works
well, especially using X windows and heavy
memory loading. Can have adverse effects,
sometimes slowing down program loading.
vm.pageout_algorithm = 1, close to true LRU. Works much
better than clock, etc. Does not work as well as
the default algorithm in general. Certain memory
"malloc" type benchmarks work a little better with
this setting.
Please give me feedback on the performance results
associated with these.
2) Enable/disable swapping.
vm.swapping_enabled = 1, default.
vm.swapping_enabled = 0, useful for cases where swapping
degrades performance.
The config option "NO_SWAPPING" is still operative, and
takes precedence over the sysctl. If "NO_SWAPPING" is
specified, the sysctl still exists, but "vm.swapping_enabled"
is hard-wired to "0".
Each of these can be changed "on the fly."
1996-06-26 05:39:27 +00:00
|
|
|
|
2003-06-29 19:51:24 +00:00
|
|
|
if (bigobj != NULL) {
|
2003-07-07 07:16:29 +00:00
|
|
|
vm_pageout_object_deactivate_pages(map->pmap, bigobj, desired);
|
2003-06-29 19:51:24 +00:00
|
|
|
VM_OBJECT_UNLOCK(bigobj);
|
|
|
|
}
|
This commit does a couple of things:
Re-enables the RSS limiting, and the routine is now tail-recursive,
making it much more safe (eliminates the possiblity of kernel stack
overflow.) Also, the RSS limiting is a little more intelligent about
finding the likely objects that are pushing the process over the limit.
Added some sysctls that help with VM system tuning.
New sysctl features:
1) Enable/disable lru pageout algorithm.
vm.pageout_algorithm = 0, default algorithm that works
well, especially using X windows and heavy
memory loading. Can have adverse effects,
sometimes slowing down program loading.
vm.pageout_algorithm = 1, close to true LRU. Works much
better than clock, etc. Does not work as well as
the default algorithm in general. Certain memory
"malloc" type benchmarks work a little better with
this setting.
Please give me feedback on the performance results
associated with these.
2) Enable/disable swapping.
vm.swapping_enabled = 1, default.
vm.swapping_enabled = 0, useful for cases where swapping
degrades performance.
The config option "NO_SWAPPING" is still operative, and
takes precedence over the sysctl. If "NO_SWAPPING" is
specified, the sysctl still exists, but "vm.swapping_enabled"
is hard-wired to "0".
Each of these can be changed "on the fly."
1996-06-26 05:39:27 +00:00
|
|
|
/*
|
|
|
|
* Next, hunt around for other pages to deactivate. We actually
|
|
|
|
* do this search sort of wrong -- .text first is not the best idea.
|
|
|
|
*/
|
|
|
|
tmpe = map->header.next;
|
|
|
|
while (tmpe != &map->header) {
|
1999-02-19 14:25:37 +00:00
|
|
|
if (pmap_resident_count(vm_map_pmap(map)) <= desired)
|
This commit does a couple of things:
Re-enables the RSS limiting, and the routine is now tail-recursive,
making it much more safe (eliminates the possiblity of kernel stack
overflow.) Also, the RSS limiting is a little more intelligent about
finding the likely objects that are pushing the process over the limit.
Added some sysctls that help with VM system tuning.
New sysctl features:
1) Enable/disable lru pageout algorithm.
vm.pageout_algorithm = 0, default algorithm that works
well, especially using X windows and heavy
memory loading. Can have adverse effects,
sometimes slowing down program loading.
vm.pageout_algorithm = 1, close to true LRU. Works much
better than clock, etc. Does not work as well as
the default algorithm in general. Certain memory
"malloc" type benchmarks work a little better with
this setting.
Please give me feedback on the performance results
associated with these.
2) Enable/disable swapping.
vm.swapping_enabled = 1, default.
vm.swapping_enabled = 0, useful for cases where swapping
degrades performance.
The config option "NO_SWAPPING" is still operative, and
takes precedence over the sysctl. If "NO_SWAPPING" is
specified, the sysctl still exists, but "vm.swapping_enabled"
is hard-wired to "0".
Each of these can be changed "on the fly."
1996-06-26 05:39:27 +00:00
|
|
|
break;
|
1999-02-07 21:48:23 +00:00
|
|
|
if ((tmpe->eflags & MAP_ENTRY_IS_SUB_MAP) == 0) {
|
This commit does a couple of things:
Re-enables the RSS limiting, and the routine is now tail-recursive,
making it much more safe (eliminates the possiblity of kernel stack
overflow.) Also, the RSS limiting is a little more intelligent about
finding the likely objects that are pushing the process over the limit.
Added some sysctls that help with VM system tuning.
New sysctl features:
1) Enable/disable lru pageout algorithm.
vm.pageout_algorithm = 0, default algorithm that works
well, especially using X windows and heavy
memory loading. Can have adverse effects,
sometimes slowing down program loading.
vm.pageout_algorithm = 1, close to true LRU. Works much
better than clock, etc. Does not work as well as
the default algorithm in general. Certain memory
"malloc" type benchmarks work a little better with
this setting.
Please give me feedback on the performance results
associated with these.
2) Enable/disable swapping.
vm.swapping_enabled = 1, default.
vm.swapping_enabled = 0, useful for cases where swapping
degrades performance.
The config option "NO_SWAPPING" is still operative, and
takes precedence over the sysctl. If "NO_SWAPPING" is
specified, the sysctl still exists, but "vm.swapping_enabled"
is hard-wired to "0".
Each of these can be changed "on the fly."
1996-06-26 05:39:27 +00:00
|
|
|
obj = tmpe->object.vm_object;
|
2003-06-29 19:51:24 +00:00
|
|
|
if (obj != NULL) {
|
|
|
|
VM_OBJECT_LOCK(obj);
|
2003-07-07 07:16:29 +00:00
|
|
|
vm_pageout_object_deactivate_pages(map->pmap, obj, desired);
|
2003-06-29 19:51:24 +00:00
|
|
|
VM_OBJECT_UNLOCK(obj);
|
|
|
|
}
|
This commit does a couple of things:
Re-enables the RSS limiting, and the routine is now tail-recursive,
making it much more safe (eliminates the possiblity of kernel stack
overflow.) Also, the RSS limiting is a little more intelligent about
finding the likely objects that are pushing the process over the limit.
Added some sysctls that help with VM system tuning.
New sysctl features:
1) Enable/disable lru pageout algorithm.
vm.pageout_algorithm = 0, default algorithm that works
well, especially using X windows and heavy
memory loading. Can have adverse effects,
sometimes slowing down program loading.
vm.pageout_algorithm = 1, close to true LRU. Works much
better than clock, etc. Does not work as well as
the default algorithm in general. Certain memory
"malloc" type benchmarks work a little better with
this setting.
Please give me feedback on the performance results
associated with these.
2) Enable/disable swapping.
vm.swapping_enabled = 1, default.
vm.swapping_enabled = 0, useful for cases where swapping
degrades performance.
The config option "NO_SWAPPING" is still operative, and
takes precedence over the sysctl. If "NO_SWAPPING" is
specified, the sysctl still exists, but "vm.swapping_enabled"
is hard-wired to "0".
Each of these can be changed "on the fly."
1996-06-26 05:39:27 +00:00
|
|
|
}
|
|
|
|
tmpe = tmpe->next;
|
2002-12-01 05:40:18 +00:00
|
|
|
}
|
This commit does a couple of things:
Re-enables the RSS limiting, and the routine is now tail-recursive,
making it much more safe (eliminates the possiblity of kernel stack
overflow.) Also, the RSS limiting is a little more intelligent about
finding the likely objects that are pushing the process over the limit.
Added some sysctls that help with VM system tuning.
New sysctl features:
1) Enable/disable lru pageout algorithm.
vm.pageout_algorithm = 0, default algorithm that works
well, especially using X windows and heavy
memory loading. Can have adverse effects,
sometimes slowing down program loading.
vm.pageout_algorithm = 1, close to true LRU. Works much
better than clock, etc. Does not work as well as
the default algorithm in general. Certain memory
"malloc" type benchmarks work a little better with
this setting.
Please give me feedback on the performance results
associated with these.
2) Enable/disable swapping.
vm.swapping_enabled = 1, default.
vm.swapping_enabled = 0, useful for cases where swapping
degrades performance.
The config option "NO_SWAPPING" is still operative, and
takes precedence over the sysctl. If "NO_SWAPPING" is
specified, the sysctl still exists, but "vm.swapping_enabled"
is hard-wired to "0".
Each of these can be changed "on the fly."
1996-06-26 05:39:27 +00:00
|
|
|
|
|
|
|
/*
|
|
|
|
* Remove all mappings if a process is swapped out, this will free page
|
|
|
|
* table pages.
|
|
|
|
*/
|
2002-12-01 05:40:18 +00:00
|
|
|
if (desired == 0 && nothingwired) {
|
2002-09-21 22:07:17 +00:00
|
|
|
pmap_remove(vm_map_pmap(map), vm_map_min(map),
|
|
|
|
vm_map_max(map));
|
2002-12-01 05:40:18 +00:00
|
|
|
}
|
This commit does a couple of things:
Re-enables the RSS limiting, and the routine is now tail-recursive,
making it much more safe (eliminates the possiblity of kernel stack
overflow.) Also, the RSS limiting is a little more intelligent about
finding the likely objects that are pushing the process over the limit.
Added some sysctls that help with VM system tuning.
New sysctl features:
1) Enable/disable lru pageout algorithm.
vm.pageout_algorithm = 0, default algorithm that works
well, especially using X windows and heavy
memory loading. Can have adverse effects,
sometimes slowing down program loading.
vm.pageout_algorithm = 1, close to true LRU. Works much
better than clock, etc. Does not work as well as
the default algorithm in general. Certain memory
"malloc" type benchmarks work a little better with
this setting.
Please give me feedback on the performance results
associated with these.
2) Enable/disable swapping.
vm.swapping_enabled = 1, default.
vm.swapping_enabled = 0, useful for cases where swapping
degrades performance.
The config option "NO_SWAPPING" is still operative, and
takes precedence over the sysctl. If "NO_SWAPPING" is
specified, the sysctl still exists, but "vm.swapping_enabled"
is hard-wired to "0".
Each of these can be changed "on the fly."
1996-06-26 05:39:27 +00:00
|
|
|
vm_map_unlock(map);
|
1994-05-25 09:21:21 +00:00
|
|
|
}
|
2002-03-10 21:52:48 +00:00
|
|
|
#endif /* !defined(NO_SWAPPING) */
|
1994-05-25 09:21:21 +00:00
|
|
|
|
|
|
|
/*
|
|
|
|
* vm_pageout_scan does the dirty work for the pageout daemon.
|
|
|
|
*/
|
2000-12-26 19:41:38 +00:00
|
|
|
static void
|
|
|
|
vm_pageout_scan(int pass)
|
1994-05-25 09:21:21 +00:00
|
|
|
{
|
1996-07-08 03:22:55 +00:00
|
|
|
vm_page_t m, next;
|
Implement a low-memory deadlock solution.
Removed most of the hacks that were trying to deal with low-memory
situations prior to now.
The new code is based on the concept that I/O must be able to function in
a low memory situation. All major modules related to I/O (except
networking) have been adjusted to allow allocation out of the system
reserve memory pool. These modules now detect a low memory situation but
rather then block they instead continue to operate, then return resources
to the memory pool instead of cache them or leave them wired.
Code has been added to stall in a low-memory situation prior to a vnode
being locked.
Thus situations where a process blocks in a low-memory condition while
holding a locked vnode have been reduced to near nothing. Not only will
I/O continue to operate, but many prior deadlock conditions simply no
longer exist.
Implement a number of VFS/BIO fixes
(found by Ian): in biodone(), bogus-page replacement code, the loop
was not properly incrementing loop variables prior to a continue
statement. We do not believe this code can be hit anyway but we
aren't taking any chances. We'll turn the whole section into a
panic (as it already is in brelse()) after the release is rolled.
In biodone(), the foff calculation was incorrectly
clamped to the iosize, causing the wrong foff to be calculated
for pages in the case of an I/O error or biodone() called without
initiating I/O. The problem always caused a panic before. Now it
doesn't. The problem is mainly an issue with NFS.
Fixed casts for ~PAGE_MASK. This code worked properly before only
because the calculations use signed arithmatic. Better to properly
extend PAGE_MASK first before inverting it for the 64 bit masking
op.
In brelse(), the bogus_page fixup code was improperly throwing
away the original contents of 'm' when it did the j-loop to
fix the bogus pages. The result was that it would potentially
invalidate parts of the *WRONG* page(!), leading to corruption.
There may still be cases where a background bitmap write is
being duplicated, causing potential corruption. We have identified
a potentially serious bug related to this but the fix is still TBD.
So instead this patch contains a KASSERT to detect the problem
and panic the machine rather then continue to corrupt the filesystem.
The problem does not occur very often.. it is very hard to
reproduce, and it may or may not be the cause of the corruption
people have reported.
Review by: (VFS/BIO: mckusick, Ian Dowse <iedowse@maths.tcd.ie>)
Testing by: (VM/Deadlock) Paul Saab <ps@yahoo-inc.com>
2000-11-18 23:06:26 +00:00
|
|
|
struct vm_page marker;
|
1999-01-21 08:29:12 +00:00
|
|
|
int page_shortage, maxscan, pcount;
|
|
|
|
int addl_page_shortage, addl_page_shortage_init;
|
These changes embody the support of the fully coherent merged VM buffer cache,
much higher filesystem I/O performance, and much better paging performance. It
represents the culmination of over 6 months of R&D.
The majority of the merged VM/cache work is by John Dyson.
The following highlights the most significant changes. Additionally, there are
(mostly minor) changes to the various filesystem modules (nfs, msdosfs, etc) to
support the new VM/buffer scheme.
vfs_bio.c:
Significant rewrite of most of vfs_bio to support the merged VM buffer cache
scheme. The scheme is almost fully compatible with the old filesystem
interface. Significant improvement in the number of opportunities for write
clustering.
vfs_cluster.c, vfs_subr.c
Upgrade and performance enhancements in vfs layer code to support merged
VM/buffer cache. Fixup of vfs_cluster to eliminate the bogus pagemove stuff.
vm_object.c:
Yet more improvements in the collapse code. Elimination of some windows that
can cause list corruption.
vm_pageout.c:
Fixed it, it really works better now. Somehow in 2.0, some "enhancements"
broke the code. This code has been reworked from the ground-up.
vm_fault.c, vm_page.c, pmap.c, vm_object.c
Support for small-block filesystems with merged VM/buffer cache scheme.
pmap.c vm_map.c
Dynamic kernel VM size, now we dont have to pre-allocate excessive numbers of
kernel PTs.
vm_glue.c
Much simpler and more effective swapping code. No more gratuitous swapping.
proc.h
Fixed the problem that the p_lock flag was not being cleared on a fork.
swap_pager.c, vnode_pager.c
Removal of old vfs_bio cruft to support the past pseudo-coherency. Now the
code doesn't need it anymore.
machdep.c
Changes to better support the parameter values for the merged VM/buffer cache
scheme.
machdep.c, kern_exec.c, vm_glue.c
Implemented a seperate submap for temporary exec string space and another one
to contain process upages. This eliminates all map fragmentation problems
that previously existed.
ffs_inode.c, ufs_inode.c, ufs_readwrite.c
Changes for merged VM/buffer cache. Add "bypass" support for sneaking in on
busy buffers.
Submitted by: John Dyson and David Greenman
1995-01-09 16:06:02 +00:00
|
|
|
vm_object_t object;
|
Enable the new physical memory allocator.
This allocator uses a binary buddy system with a twist. First and
foremost, this allocator is required to support the implementation of
superpages. As a side effect, it enables a more robust implementation
of contigmalloc(9). Moreover, this reimplementation of
contigmalloc(9) eliminates the acquisition of Giant by
contigmalloc(..., M_NOWAIT, ...).
The twist is that this allocator tries to reduce the number of TLB
misses incurred by accesses through a direct map to small, UMA-managed
objects and page table pages. Roughly speaking, the physical pages
that are allocated for such purposes are clustered together in the
physical address space. The performance benefits vary. In the most
extreme case, a uniprocessor kernel running on an Opteron, I measured
an 18% reduction in system time during a buildworld.
This allocator does not implement page coloring. The reason is that
superpages have much the same effect. The contiguous physical memory
allocation necessary for a superpage is inherently colored.
Finally, the one caveat is that this allocator does not effectively
support prezeroed pages. I hope this is temporary. On i386, this is
a slight pessimization. However, on amd64, the beneficial effects of
the direct-map optimization outweigh the ill effects. I speculate
that this is true in general of machines with a direct map.
Approved by: re
2007-06-16 04:57:06 +00:00
|
|
|
int actcount;
|
1995-04-09 06:03:56 +00:00
|
|
|
int vnodes_skipped = 0;
|
2000-12-26 19:41:38 +00:00
|
|
|
int maxlaunder;
|
These changes embody the support of the fully coherent merged VM buffer cache,
much higher filesystem I/O performance, and much better paging performance. It
represents the culmination of over 6 months of R&D.
The majority of the merged VM/cache work is by John Dyson.
The following highlights the most significant changes. Additionally, there are
(mostly minor) changes to the various filesystem modules (nfs, msdosfs, etc) to
support the new VM/buffer scheme.
vfs_bio.c:
Significant rewrite of most of vfs_bio to support the merged VM buffer cache
scheme. The scheme is almost fully compatible with the old filesystem
interface. Significant improvement in the number of opportunities for write
clustering.
vfs_cluster.c, vfs_subr.c
Upgrade and performance enhancements in vfs layer code to support merged
VM/buffer cache. Fixup of vfs_cluster to eliminate the bogus pagemove stuff.
vm_object.c:
Yet more improvements in the collapse code. Elimination of some windows that
can cause list corruption.
vm_pageout.c:
Fixed it, it really works better now. Somehow in 2.0, some "enhancements"
broke the code. This code has been reworked from the ground-up.
vm_fault.c, vm_page.c, pmap.c, vm_object.c
Support for small-block filesystems with merged VM/buffer cache scheme.
pmap.c vm_map.c
Dynamic kernel VM size, now we dont have to pre-allocate excessive numbers of
kernel PTs.
vm_glue.c
Much simpler and more effective swapping code. No more gratuitous swapping.
proc.h
Fixed the problem that the p_lock flag was not being cleared on a fork.
swap_pager.c, vnode_pager.c
Removal of old vfs_bio cruft to support the past pseudo-coherency. Now the
code doesn't need it anymore.
machdep.c
Changes to better support the parameter values for the merged VM/buffer cache
scheme.
machdep.c, kern_exec.c, vm_glue.c
Implemented a seperate submap for temporary exec string space and another one
to contain process upages. This eliminates all map fragmentation problems
that previously existed.
ffs_inode.c, ufs_inode.c, ufs_readwrite.c
Changes for merged VM/buffer cache. Add "bypass" support for sneaking in on
busy buffers.
Submitted by: John Dyson and David Greenman
1995-01-09 16:06:02 +00:00
|
|
|
|
2002-11-21 09:17:56 +00:00
|
|
|
/*
|
|
|
|
* Decrease registered cache sizes.
|
|
|
|
*/
|
|
|
|
EVENTHANDLER_INVOKE(vm_lowmem, 0);
|
|
|
|
/*
|
|
|
|
* We do this explicitly after the caches have been drained above.
|
|
|
|
*/
|
|
|
|
uma_reclaim();
|
1997-10-25 02:41:56 +00:00
|
|
|
|
2003-01-14 06:57:03 +00:00
|
|
|
addl_page_shortage_init = atomic_readandclear_int(&vm_pageout_deficit);
|
1996-05-29 05:15:33 +00:00
|
|
|
|
1999-01-21 08:29:12 +00:00
|
|
|
/*
|
|
|
|
* Calculate the number of pages we want to either free or move
|
2000-12-26 19:41:38 +00:00
|
|
|
* to the cache.
|
1999-01-21 08:29:12 +00:00
|
|
|
*/
|
2000-12-26 19:41:38 +00:00
|
|
|
page_shortage = vm_paging_target() + addl_page_shortage_init;
|
1996-05-29 05:15:33 +00:00
|
|
|
|
Implement a low-memory deadlock solution.
Removed most of the hacks that were trying to deal with low-memory
situations prior to now.
The new code is based on the concept that I/O must be able to function in
a low memory situation. All major modules related to I/O (except
networking) have been adjusted to allow allocation out of the system
reserve memory pool. These modules now detect a low memory situation but
rather then block they instead continue to operate, then return resources
to the memory pool instead of cache them or leave them wired.
Code has been added to stall in a low-memory situation prior to a vnode
being locked.
Thus situations where a process blocks in a low-memory condition while
holding a locked vnode have been reduced to near nothing. Not only will
I/O continue to operate, but many prior deadlock conditions simply no
longer exist.
Implement a number of VFS/BIO fixes
(found by Ian): in biodone(), bogus-page replacement code, the loop
was not properly incrementing loop variables prior to a continue
statement. We do not believe this code can be hit anyway but we
aren't taking any chances. We'll turn the whole section into a
panic (as it already is in brelse()) after the release is rolled.
In biodone(), the foff calculation was incorrectly
clamped to the iosize, causing the wrong foff to be calculated
for pages in the case of an I/O error or biodone() called without
initiating I/O. The problem always caused a panic before. Now it
doesn't. The problem is mainly an issue with NFS.
Fixed casts for ~PAGE_MASK. This code worked properly before only
because the calculations use signed arithmatic. Better to properly
extend PAGE_MASK first before inverting it for the 64 bit masking
op.
In brelse(), the bogus_page fixup code was improperly throwing
away the original contents of 'm' when it did the j-loop to
fix the bogus pages. The result was that it would potentially
invalidate parts of the *WRONG* page(!), leading to corruption.
There may still be cases where a background bitmap write is
being duplicated, causing potential corruption. We have identified
a potentially serious bug related to this but the fix is still TBD.
So instead this patch contains a KASSERT to detect the problem
and panic the machine rather then continue to corrupt the filesystem.
The problem does not occur very often.. it is very hard to
reproduce, and it may or may not be the cause of the corruption
people have reported.
Review by: (VFS/BIO: mckusick, Ian Dowse <iedowse@maths.tcd.ie>)
Testing by: (VM/Deadlock) Paul Saab <ps@yahoo-inc.com>
2000-11-18 23:06:26 +00:00
|
|
|
/*
|
|
|
|
* Initialize our marker
|
|
|
|
*/
|
|
|
|
bzero(&marker, sizeof(marker));
|
2006-10-22 04:28:14 +00:00
|
|
|
marker.flags = PG_FICTITIOUS | PG_MARKER;
|
|
|
|
marker.oflags = VPO_BUSY;
|
Implement a low-memory deadlock solution.
Removed most of the hacks that were trying to deal with low-memory
situations prior to now.
The new code is based on the concept that I/O must be able to function in
a low memory situation. All major modules related to I/O (except
networking) have been adjusted to allow allocation out of the system
reserve memory pool. These modules now detect a low memory situation but
rather then block they instead continue to operate, then return resources
to the memory pool instead of cache them or leave them wired.
Code has been added to stall in a low-memory situation prior to a vnode
being locked.
Thus situations where a process blocks in a low-memory condition while
holding a locked vnode have been reduced to near nothing. Not only will
I/O continue to operate, but many prior deadlock conditions simply no
longer exist.
Implement a number of VFS/BIO fixes
(found by Ian): in biodone(), bogus-page replacement code, the loop
was not properly incrementing loop variables prior to a continue
statement. We do not believe this code can be hit anyway but we
aren't taking any chances. We'll turn the whole section into a
panic (as it already is in brelse()) after the release is rolled.
In biodone(), the foff calculation was incorrectly
clamped to the iosize, causing the wrong foff to be calculated
for pages in the case of an I/O error or biodone() called without
initiating I/O. The problem always caused a panic before. Now it
doesn't. The problem is mainly an issue with NFS.
Fixed casts for ~PAGE_MASK. This code worked properly before only
because the calculations use signed arithmatic. Better to properly
extend PAGE_MASK first before inverting it for the 64 bit masking
op.
In brelse(), the bogus_page fixup code was improperly throwing
away the original contents of 'm' when it did the j-loop to
fix the bogus pages. The result was that it would potentially
invalidate parts of the *WRONG* page(!), leading to corruption.
There may still be cases where a background bitmap write is
being duplicated, causing potential corruption. We have identified
a potentially serious bug related to this but the fix is still TBD.
So instead this patch contains a KASSERT to detect the problem
and panic the machine rather then continue to corrupt the filesystem.
The problem does not occur very often.. it is very hard to
reproduce, and it may or may not be the cause of the corruption
people have reported.
Review by: (VFS/BIO: mckusick, Ian Dowse <iedowse@maths.tcd.ie>)
Testing by: (VM/Deadlock) Paul Saab <ps@yahoo-inc.com>
2000-11-18 23:06:26 +00:00
|
|
|
marker.queue = PQ_INACTIVE;
|
|
|
|
marker.wire_count = 1;
|
|
|
|
|
1999-01-21 08:29:12 +00:00
|
|
|
/*
|
|
|
|
* Start scanning the inactive queue for pages we can move to the
|
|
|
|
* cache or free. The scan will stop when the target is reached or
|
Implement a low-memory deadlock solution.
Removed most of the hacks that were trying to deal with low-memory
situations prior to now.
The new code is based on the concept that I/O must be able to function in
a low memory situation. All major modules related to I/O (except
networking) have been adjusted to allow allocation out of the system
reserve memory pool. These modules now detect a low memory situation but
rather then block they instead continue to operate, then return resources
to the memory pool instead of cache them or leave them wired.
Code has been added to stall in a low-memory situation prior to a vnode
being locked.
Thus situations where a process blocks in a low-memory condition while
holding a locked vnode have been reduced to near nothing. Not only will
I/O continue to operate, but many prior deadlock conditions simply no
longer exist.
Implement a number of VFS/BIO fixes
(found by Ian): in biodone(), bogus-page replacement code, the loop
was not properly incrementing loop variables prior to a continue
statement. We do not believe this code can be hit anyway but we
aren't taking any chances. We'll turn the whole section into a
panic (as it already is in brelse()) after the release is rolled.
In biodone(), the foff calculation was incorrectly
clamped to the iosize, causing the wrong foff to be calculated
for pages in the case of an I/O error or biodone() called without
initiating I/O. The problem always caused a panic before. Now it
doesn't. The problem is mainly an issue with NFS.
Fixed casts for ~PAGE_MASK. This code worked properly before only
because the calculations use signed arithmatic. Better to properly
extend PAGE_MASK first before inverting it for the 64 bit masking
op.
In brelse(), the bogus_page fixup code was improperly throwing
away the original contents of 'm' when it did the j-loop to
fix the bogus pages. The result was that it would potentially
invalidate parts of the *WRONG* page(!), leading to corruption.
There may still be cases where a background bitmap write is
being duplicated, causing potential corruption. We have identified
a potentially serious bug related to this but the fix is still TBD.
So instead this patch contains a KASSERT to detect the problem
and panic the machine rather then continue to corrupt the filesystem.
The problem does not occur very often.. it is very hard to
reproduce, and it may or may not be the cause of the corruption
people have reported.
Review by: (VFS/BIO: mckusick, Ian Dowse <iedowse@maths.tcd.ie>)
Testing by: (VM/Deadlock) Paul Saab <ps@yahoo-inc.com>
2000-11-18 23:06:26 +00:00
|
|
|
* we have scanned the entire inactive queue. Note that m->act_count
|
|
|
|
* is not used to form decisions for the inactive queue, only for the
|
|
|
|
* active queue.
|
2000-12-26 19:41:38 +00:00
|
|
|
*
|
|
|
|
* maxlaunder limits the number of dirty pages we flush per scan.
|
|
|
|
* For most systems a smaller value (16 or 32) is more robust under
|
|
|
|
* extreme memory and disk pressure because any unnecessary writes
|
|
|
|
* to disk can result in extreme performance degredation. However,
|
|
|
|
* systems with excessive dirty pages (especially when MAP_NOSYNC is
|
|
|
|
* used) will die horribly with limited laundering. If the pageout
|
|
|
|
* daemon cannot clean enough pages in the first pass, we let it go
|
|
|
|
* all out in succeeding passes.
|
1999-01-21 08:29:12 +00:00
|
|
|
*/
|
2000-12-26 19:41:38 +00:00
|
|
|
if ((maxlaunder = vm_max_launder) <= 1)
|
|
|
|
maxlaunder = 1;
|
|
|
|
if (pass)
|
|
|
|
maxlaunder = 10000;
|
2003-08-15 05:13:36 +00:00
|
|
|
vm_page_lock_queues();
|
1999-01-21 08:29:12 +00:00
|
|
|
rescan0:
|
|
|
|
addl_page_shortage = addl_page_shortage_init;
|
2007-05-31 22:52:15 +00:00
|
|
|
maxscan = cnt.v_inactive_count;
|
2001-07-04 23:27:09 +00:00
|
|
|
|
1999-10-30 07:37:14 +00:00
|
|
|
for (m = TAILQ_FIRST(&vm_page_queues[PQ_INACTIVE].pl);
|
The buffer queue mechanism has been reformulated. Instead of having
QUEUE_AGE, QUEUE_LRU, and QUEUE_EMPTY we instead have QUEUE_CLEAN,
QUEUE_DIRTY, QUEUE_EMPTY, and QUEUE_EMPTYKVA. With this patch clean
and dirty buffers have been separated. Empty buffers with KVM
assignments have been separated from truely empty buffers. getnewbuf()
has been rewritten and now operates in a 100% optimal fashion. That is,
it is able to find precisely the right kind of buffer it needs to
allocate a new buffer, defragment KVM, or to free-up an existing buffer
when the buffer cache is full (which is a steady-state situation for
the buffer cache).
Buffer flushing has been reorganized. Previously buffers were flushed
in the context of whatever process hit the conditions forcing buffer
flushing to occur. This resulted in processes blocking on conditions
unrelated to what they were doing. This also resulted in inappropriate
VFS stacking chains due to multiple processes getting stuck trying to
flush dirty buffers or due to a single process getting into a situation
where it might attempt to flush buffers recursively - a situation that
was only partially fixed in prior commits. We have added a new daemon
called the buf_daemon which is responsible for flushing dirty buffers
when the number of dirty buffers exceeds the vfs.hidirtybuffers limit.
This daemon attempts to dynamically adjust the rate at which dirty buffers
are flushed such that getnewbuf() calls (almost) never block.
The number of nbufs and amount of buffer space is now scaled past the
8MB limit that was previously imposed for systems with over 64MB of
memory, and the vfs.{lo,hi}dirtybuffers limits have been relaxed
somewhat. The number of physical buffers has been increased with the
intention that we will manage physical I/O differently in the future.
reassignbuf previously attempted to keep the dirtyblkhd list sorted which
could result in non-deterministic operation under certain conditions,
such as when a large number of dirty buffers are being managed. This
algorithm has been changed. reassignbuf now keeps buffers locally sorted
if it can do so cheaply, and otherwise gives up and adds buffers to
the head of the dirtyblkhd list. The new algorithm is deterministic but
not perfect. The new algorithm greatly reduces problems that previously
occured when write_behind was turned off in the system.
The P_FLSINPROG proc->p_flag bit has been replaced by the more descriptive
P_BUFEXHAUST bit. This bit allows processes working with filesystem
buffers to use available emergency reserves. Normal processes do not set
this bit and are not allowed to dig into emergency reserves. The purpose
of this bit is to avoid low-memory deadlocks.
A small race condition was fixed in getpbuf() in vm/vm_pager.c.
Submitted by: Matthew Dillon <dillon@apollo.backplane.com>
Reviewed by: Kirk McKusick <mckusick@mckusick.com>
1999-07-04 00:25:38 +00:00
|
|
|
m != NULL && maxscan-- > 0 && page_shortage > 0;
|
|
|
|
m = next) {
|
1994-05-24 10:09:53 +00:00
|
|
|
|
2007-06-10 21:59:14 +00:00
|
|
|
cnt.v_pdpages++;
|
1996-05-29 05:15:33 +00:00
|
|
|
|
2005-12-31 14:39:20 +00:00
|
|
|
if (VM_PAGE_GETQUEUE(m) != PQ_INACTIVE) {
|
1996-07-30 03:08:57 +00:00
|
|
|
goto rescan0;
|
1996-05-31 00:38:04 +00:00
|
|
|
}
|
1996-05-29 05:15:33 +00:00
|
|
|
|
1996-05-18 03:38:05 +00:00
|
|
|
next = TAILQ_NEXT(m, pageq);
|
2004-11-05 06:24:05 +00:00
|
|
|
object = m->object;
|
1994-05-24 10:09:53 +00:00
|
|
|
|
Implement a low-memory deadlock solution.
Removed most of the hacks that were trying to deal with low-memory
situations prior to now.
The new code is based on the concept that I/O must be able to function in
a low memory situation. All major modules related to I/O (except
networking) have been adjusted to allow allocation out of the system
reserve memory pool. These modules now detect a low memory situation but
rather then block they instead continue to operate, then return resources
to the memory pool instead of cache them or leave them wired.
Code has been added to stall in a low-memory situation prior to a vnode
being locked.
Thus situations where a process blocks in a low-memory condition while
holding a locked vnode have been reduced to near nothing. Not only will
I/O continue to operate, but many prior deadlock conditions simply no
longer exist.
Implement a number of VFS/BIO fixes
(found by Ian): in biodone(), bogus-page replacement code, the loop
was not properly incrementing loop variables prior to a continue
statement. We do not believe this code can be hit anyway but we
aren't taking any chances. We'll turn the whole section into a
panic (as it already is in brelse()) after the release is rolled.
In biodone(), the foff calculation was incorrectly
clamped to the iosize, causing the wrong foff to be calculated
for pages in the case of an I/O error or biodone() called without
initiating I/O. The problem always caused a panic before. Now it
doesn't. The problem is mainly an issue with NFS.
Fixed casts for ~PAGE_MASK. This code worked properly before only
because the calculations use signed arithmatic. Better to properly
extend PAGE_MASK first before inverting it for the 64 bit masking
op.
In brelse(), the bogus_page fixup code was improperly throwing
away the original contents of 'm' when it did the j-loop to
fix the bogus pages. The result was that it would potentially
invalidate parts of the *WRONG* page(!), leading to corruption.
There may still be cases where a background bitmap write is
being duplicated, causing potential corruption. We have identified
a potentially serious bug related to this but the fix is still TBD.
So instead this patch contains a KASSERT to detect the problem
and panic the machine rather then continue to corrupt the filesystem.
The problem does not occur very often.. it is very hard to
reproduce, and it may or may not be the cause of the corruption
people have reported.
Review by: (VFS/BIO: mckusick, Ian Dowse <iedowse@maths.tcd.ie>)
Testing by: (VM/Deadlock) Paul Saab <ps@yahoo-inc.com>
2000-11-18 23:06:26 +00:00
|
|
|
/*
|
|
|
|
* skip marker pages
|
|
|
|
*/
|
|
|
|
if (m->flags & PG_MARKER)
|
|
|
|
continue;
|
|
|
|
|
2001-10-21 06:12:06 +00:00
|
|
|
/*
|
|
|
|
* A held page may be undergoing I/O, so skip it.
|
|
|
|
*/
|
1996-05-29 05:15:33 +00:00
|
|
|
if (m->hold_count) {
|
2008-03-19 20:24:35 +00:00
|
|
|
vm_page_requeue(m);
|
1996-05-29 05:15:33 +00:00
|
|
|
addl_page_shortage++;
|
|
|
|
continue;
|
1994-05-25 09:21:21 +00:00
|
|
|
}
|
1994-05-24 10:09:53 +00:00
|
|
|
/*
|
2002-03-10 21:52:48 +00:00
|
|
|
* Don't mess with busy pages, keep in the front of the
|
1996-05-18 03:38:05 +00:00
|
|
|
* queue, most likely are being paged out.
|
1994-05-24 10:09:53 +00:00
|
|
|
*/
|
2005-08-10 00:17:36 +00:00
|
|
|
if (!VM_OBJECT_TRYLOCK(object) &&
|
|
|
|
(!vm_pageout_fallback_object_lock(m, &next) ||
|
|
|
|
m->hold_count != 0)) {
|
|
|
|
VM_OBJECT_UNLOCK(object);
|
2004-11-05 06:24:05 +00:00
|
|
|
addl_page_shortage++;
|
|
|
|
continue;
|
|
|
|
}
|
2006-10-22 04:28:14 +00:00
|
|
|
if (m->busy || (m->oflags & VPO_BUSY)) {
|
2004-11-05 06:24:05 +00:00
|
|
|
VM_OBJECT_UNLOCK(object);
|
1996-05-29 05:15:33 +00:00
|
|
|
addl_page_shortage++;
|
1994-05-25 09:21:21 +00:00
|
|
|
continue;
|
|
|
|
}
|
1996-01-19 04:00:31 +00:00
|
|
|
|
1997-10-06 02:48:16 +00:00
|
|
|
/*
|
1999-01-21 08:29:12 +00:00
|
|
|
* If the object is not being used, we ignore previous
|
|
|
|
* references.
|
1997-10-06 02:48:16 +00:00
|
|
|
*/
|
2004-11-05 06:24:05 +00:00
|
|
|
if (object->ref_count == 0) {
|
1998-09-04 08:06:57 +00:00
|
|
|
vm_page_flag_clear(m, PG_REFERENCED);
|
2009-05-17 20:40:41 +00:00
|
|
|
KASSERT(!pmap_page_is_mapped(m),
|
|
|
|
("vm_pageout_scan: page %p is mapped", m));
|
1997-10-06 02:48:16 +00:00
|
|
|
|
|
|
|
/*
|
1999-01-21 08:29:12 +00:00
|
|
|
* Otherwise, if the page has been referenced while in the
|
|
|
|
* inactive queue, we bump the "activation count" upwards,
|
|
|
|
* making it less likely that the page will be added back to
|
|
|
|
* the inactive queue prematurely again. Here we check the
|
|
|
|
* page tables (or emulated bits, if any), given the upper
|
|
|
|
* level VM system not knowing anything about existing
|
|
|
|
* references.
|
1997-10-06 02:48:16 +00:00
|
|
|
*/
|
1996-07-30 03:08:57 +00:00
|
|
|
} else if (((m->flags & PG_REFERENCED) == 0) &&
|
2000-05-21 12:50:18 +00:00
|
|
|
(actcount = pmap_ts_referenced(m))) {
|
1996-07-30 03:08:57 +00:00
|
|
|
vm_page_activate(m);
|
2004-11-05 06:24:05 +00:00
|
|
|
VM_OBJECT_UNLOCK(object);
|
1997-10-06 02:48:16 +00:00
|
|
|
m->act_count += (actcount + ACT_ADVANCE);
|
1996-07-30 03:08:57 +00:00
|
|
|
continue;
|
|
|
|
}
|
1994-11-06 05:07:53 +00:00
|
|
|
|
1997-10-06 02:48:16 +00:00
|
|
|
/*
|
1999-01-21 08:29:12 +00:00
|
|
|
* If the upper level VM system knows about any page
|
|
|
|
* references, we activate the page. We also set the
|
|
|
|
* "activation count" higher than normal so that we will less
|
|
|
|
* likely place pages back onto the inactive queue again.
|
1997-10-06 02:48:16 +00:00
|
|
|
*/
|
1996-07-30 03:08:57 +00:00
|
|
|
if ((m->flags & PG_REFERENCED) != 0) {
|
1998-09-04 08:06:57 +00:00
|
|
|
vm_page_flag_clear(m, PG_REFERENCED);
|
2000-05-21 12:50:18 +00:00
|
|
|
actcount = pmap_ts_referenced(m);
|
1996-07-30 03:08:57 +00:00
|
|
|
vm_page_activate(m);
|
2004-11-05 06:24:05 +00:00
|
|
|
VM_OBJECT_UNLOCK(object);
|
1997-10-06 02:48:16 +00:00
|
|
|
m->act_count += (actcount + ACT_ADVANCE + 1);
|
1996-07-30 03:08:57 +00:00
|
|
|
continue;
|
|
|
|
}
|
|
|
|
|
1997-10-06 02:48:16 +00:00
|
|
|
/*
|
2009-05-28 06:52:14 +00:00
|
|
|
* If the upper level VM system does not believe that the page
|
|
|
|
* is fully dirty, but it is mapped for write access, then we
|
|
|
|
* consult the pmap to see if the page's dirty status should
|
|
|
|
* be updated.
|
1997-10-06 02:48:16 +00:00
|
|
|
*/
|
2009-05-28 06:52:14 +00:00
|
|
|
if (m->dirty != VM_PAGE_BITS_ALL &&
|
|
|
|
(m->flags & PG_WRITEABLE) != 0) {
|
2004-02-10 18:34:27 +00:00
|
|
|
/*
|
|
|
|
* Avoid a race condition: Unless write access is
|
|
|
|
* removed from the page, another processor could
|
|
|
|
* modify it before all access is removed by the call
|
|
|
|
* to vm_page_cache() below. If vm_page_cache() finds
|
|
|
|
* that the page has been modified when it removes all
|
|
|
|
* access, it panics because it cannot cache dirty
|
|
|
|
* pages. In principle, we could eliminate just write
|
|
|
|
* access here rather than all access. In the expected
|
|
|
|
* case, when there are no last instant modifications
|
|
|
|
* to the page, removing all access will be cheaper
|
|
|
|
* overall.
|
|
|
|
*/
|
2009-05-28 06:52:14 +00:00
|
|
|
if (pmap_is_modified(m))
|
|
|
|
vm_page_dirty(m);
|
|
|
|
else if (m->dirty == 0)
|
2004-02-10 18:34:27 +00:00
|
|
|
pmap_remove_all(m);
|
1996-07-30 03:08:57 +00:00
|
|
|
}
|
2004-03-04 09:36:46 +00:00
|
|
|
|
1996-01-19 04:00:31 +00:00
|
|
|
if (m->valid == 0) {
|
2003-10-17 05:07:17 +00:00
|
|
|
/*
|
|
|
|
* Invalid pages can be easily freed
|
|
|
|
*/
|
|
|
|
vm_page_free(m);
|
2007-06-10 21:59:14 +00:00
|
|
|
cnt.v_dfree++;
|
1999-01-21 08:29:12 +00:00
|
|
|
--page_shortage;
|
1996-01-19 04:00:31 +00:00
|
|
|
} else if (m->dirty == 0) {
|
2003-10-17 05:07:17 +00:00
|
|
|
/*
|
|
|
|
* Clean pages can be placed onto the cache queue.
|
|
|
|
* This effectively frees them.
|
|
|
|
*/
|
1996-01-19 04:00:31 +00:00
|
|
|
vm_page_cache(m);
|
1999-01-21 08:29:12 +00:00
|
|
|
--page_shortage;
|
2000-12-26 19:41:38 +00:00
|
|
|
} else if ((m->flags & PG_WINATCFLS) == 0 && pass == 0) {
|
|
|
|
/*
|
|
|
|
* Dirty pages need to be paged out, but flushing
|
|
|
|
* a page is extremely expensive verses freeing
|
|
|
|
* a clean page. Rather then artificially limiting
|
|
|
|
* the number of pages we can flush, we instead give
|
|
|
|
* dirty pages extra priority on the inactive queue
|
|
|
|
* by forcing them to be cycled through the queue
|
|
|
|
* twice before being flushed, after which the
|
|
|
|
* (now clean) page will cycle through once more
|
|
|
|
* before being freed. This significantly extends
|
|
|
|
* the thrash point for a heavily loaded machine.
|
|
|
|
*/
|
|
|
|
vm_page_flag_set(m, PG_WINATCFLS);
|
2008-03-19 20:24:35 +00:00
|
|
|
vm_page_requeue(m);
|
These changes embody the support of the fully coherent merged VM buffer cache,
much higher filesystem I/O performance, and much better paging performance. It
represents the culmination of over 6 months of R&D.
The majority of the merged VM/cache work is by John Dyson.
The following highlights the most significant changes. Additionally, there are
(mostly minor) changes to the various filesystem modules (nfs, msdosfs, etc) to
support the new VM/buffer scheme.
vfs_bio.c:
Significant rewrite of most of vfs_bio to support the merged VM buffer cache
scheme. The scheme is almost fully compatible with the old filesystem
interface. Significant improvement in the number of opportunities for write
clustering.
vfs_cluster.c, vfs_subr.c
Upgrade and performance enhancements in vfs layer code to support merged
VM/buffer cache. Fixup of vfs_cluster to eliminate the bogus pagemove stuff.
vm_object.c:
Yet more improvements in the collapse code. Elimination of some windows that
can cause list corruption.
vm_pageout.c:
Fixed it, it really works better now. Somehow in 2.0, some "enhancements"
broke the code. This code has been reworked from the ground-up.
vm_fault.c, vm_page.c, pmap.c, vm_object.c
Support for small-block filesystems with merged VM/buffer cache scheme.
pmap.c vm_map.c
Dynamic kernel VM size, now we dont have to pre-allocate excessive numbers of
kernel PTs.
vm_glue.c
Much simpler and more effective swapping code. No more gratuitous swapping.
proc.h
Fixed the problem that the p_lock flag was not being cleared on a fork.
swap_pager.c, vnode_pager.c
Removal of old vfs_bio cruft to support the past pseudo-coherency. Now the
code doesn't need it anymore.
machdep.c
Changes to better support the parameter values for the merged VM/buffer cache
scheme.
machdep.c, kern_exec.c, vm_glue.c
Implemented a seperate submap for temporary exec string space and another one
to contain process upages. This eliminates all map fragmentation problems
that previously existed.
ffs_inode.c, ufs_inode.c, ufs_readwrite.c
Changes for merged VM/buffer cache. Add "bypass" support for sneaking in on
busy buffers.
Submitted by: John Dyson and David Greenman
1995-01-09 16:06:02 +00:00
|
|
|
} else if (maxlaunder > 0) {
|
2000-12-26 19:41:38 +00:00
|
|
|
/*
|
|
|
|
* We always want to try to flush some dirty pages if
|
|
|
|
* we encounter them, to keep the system stable.
|
|
|
|
* Normally this number is small, but under extreme
|
|
|
|
* pressure where there are insufficient clean pages
|
|
|
|
* on the inactive queue, we may have to go all out.
|
|
|
|
*/
|
2007-06-26 18:24:05 +00:00
|
|
|
int swap_pageouts_ok, vfslocked = 0;
|
1995-04-09 06:03:56 +00:00
|
|
|
struct vnode *vp = NULL;
|
2007-07-02 06:56:37 +00:00
|
|
|
struct mount *mp = NULL;
|
These changes embody the support of the fully coherent merged VM buffer cache,
much higher filesystem I/O performance, and much better paging performance. It
represents the culmination of over 6 months of R&D.
The majority of the merged VM/cache work is by John Dyson.
The following highlights the most significant changes. Additionally, there are
(mostly minor) changes to the various filesystem modules (nfs, msdosfs, etc) to
support the new VM/buffer scheme.
vfs_bio.c:
Significant rewrite of most of vfs_bio to support the merged VM buffer cache
scheme. The scheme is almost fully compatible with the old filesystem
interface. Significant improvement in the number of opportunities for write
clustering.
vfs_cluster.c, vfs_subr.c
Upgrade and performance enhancements in vfs layer code to support merged
VM/buffer cache. Fixup of vfs_cluster to eliminate the bogus pagemove stuff.
vm_object.c:
Yet more improvements in the collapse code. Elimination of some windows that
can cause list corruption.
vm_pageout.c:
Fixed it, it really works better now. Somehow in 2.0, some "enhancements"
broke the code. This code has been reworked from the ground-up.
vm_fault.c, vm_page.c, pmap.c, vm_object.c
Support for small-block filesystems with merged VM/buffer cache scheme.
pmap.c vm_map.c
Dynamic kernel VM size, now we dont have to pre-allocate excessive numbers of
kernel PTs.
vm_glue.c
Much simpler and more effective swapping code. No more gratuitous swapping.
proc.h
Fixed the problem that the p_lock flag was not being cleared on a fork.
swap_pager.c, vnode_pager.c
Removal of old vfs_bio cruft to support the past pseudo-coherency. Now the
code doesn't need it anymore.
machdep.c
Changes to better support the parameter values for the merged VM/buffer cache
scheme.
machdep.c, kern_exec.c, vm_glue.c
Implemented a seperate submap for temporary exec string space and another one
to contain process upages. This eliminates all map fragmentation problems
that previously existed.
ffs_inode.c, ufs_inode.c, ufs_readwrite.c
Changes for merged VM/buffer cache. Add "bypass" support for sneaking in on
busy buffers.
Submitted by: John Dyson and David Greenman
1995-01-09 16:06:02 +00:00
|
|
|
|
1997-12-04 19:00:56 +00:00
|
|
|
if ((object->type != OBJT_SWAP) && (object->type != OBJT_DEFAULT)) {
|
|
|
|
swap_pageouts_ok = 1;
|
|
|
|
} else {
|
|
|
|
swap_pageouts_ok = !(defer_swap_pageouts || disable_swap_pageouts);
|
|
|
|
swap_pageouts_ok |= (!disable_swap_pageouts && defer_swap_pageouts &&
|
1999-09-17 04:56:40 +00:00
|
|
|
vm_page_count_min());
|
1997-12-04 19:00:56 +00:00
|
|
|
|
|
|
|
}
|
1997-12-05 05:41:06 +00:00
|
|
|
|
|
|
|
/*
|
1999-01-21 08:29:12 +00:00
|
|
|
* We don't bother paging objects that are "dead".
|
|
|
|
* Those objects are in a "rundown" state.
|
1997-12-05 05:41:06 +00:00
|
|
|
*/
|
|
|
|
if (!swap_pageouts_ok || (object->flags & OBJ_DEAD)) {
|
2003-08-31 00:00:46 +00:00
|
|
|
VM_OBJECT_UNLOCK(object);
|
2008-03-19 20:24:35 +00:00
|
|
|
vm_page_requeue(m);
|
1997-12-04 19:00:56 +00:00
|
|
|
continue;
|
|
|
|
}
|
|
|
|
|
2006-02-17 21:02:39 +00:00
|
|
|
/*
|
|
|
|
* Following operations may unlock
|
|
|
|
* vm_page_queue_mtx, invalidating the 'next'
|
|
|
|
* pointer. To prevent an inordinate number
|
|
|
|
* of restarts we use our marker to remember
|
|
|
|
* our place.
|
|
|
|
*
|
|
|
|
*/
|
|
|
|
TAILQ_INSERT_AFTER(&vm_page_queues[PQ_INACTIVE].pl,
|
|
|
|
m, &marker, pageq);
|
1999-01-21 08:29:12 +00:00
|
|
|
/*
|
2000-12-26 19:41:38 +00:00
|
|
|
* The object is already known NOT to be dead. It
|
|
|
|
* is possible for the vget() to block the whole
|
|
|
|
* pageout daemon, but the new low-memory handling
|
|
|
|
* code should prevent it.
|
1999-01-21 08:29:12 +00:00
|
|
|
*
|
2000-12-26 19:41:38 +00:00
|
|
|
* The previous code skipped locked vnodes and, worse,
|
|
|
|
* reordered pages in the queue. This results in
|
|
|
|
* completely non-deterministic operation and, on a
|
|
|
|
* busy system, can lead to extremely non-optimal
|
|
|
|
* pageouts. For example, it can cause clean pages
|
|
|
|
* to be freed and dirty pages to be moved to the end
|
|
|
|
* of the queue. Since dirty pages are also moved to
|
|
|
|
* the end of the queue once-cleaned, this gives
|
|
|
|
* way too large a weighting to defering the freeing
|
|
|
|
* of dirty pages.
|
1999-01-21 08:29:12 +00:00
|
|
|
*
|
2001-12-20 22:42:27 +00:00
|
|
|
* We can't wait forever for the vnode lock, we might
|
|
|
|
* deadlock due to a vn_read() getting stuck in
|
|
|
|
* vm_wait while holding this vnode. We skip the
|
|
|
|
* vnode if we can't get it in a reasonable amount
|
|
|
|
* of time.
|
1999-01-21 08:29:12 +00:00
|
|
|
*/
|
|
|
|
if (object->type == OBJT_VNODE) {
|
NOTE: libkvm, w, ps, 'top', and any other utility which depends on struct
proc or any VM system structure will have to be rebuilt!!!
Much needed overhaul of the VM system. Included in this first round of
changes:
1) Improved pager interfaces: init, alloc, dealloc, getpages, putpages,
haspage, and sync operations are supported. The haspage interface now
provides information about clusterability. All pager routines now take
struct vm_object's instead of "pagers".
2) Improved data structures. In the previous paradigm, there is constant
confusion caused by pagers being both a data structure ("allocate a
pager") and a collection of routines. The idea of a pager structure has
escentially been eliminated. Objects now have types, and this type is
used to index the appropriate pager. In most cases, items in the pager
structure were duplicated in the object data structure and thus were
unnecessary. In the few cases that remained, a un_pager structure union
was created in the object to contain these items.
3) Because of the cleanup of #1 & #2, a lot of unnecessary layering can now
be removed. For instance, vm_object_enter(), vm_object_lookup(),
vm_object_remove(), and the associated object hash list were some of the
things that were removed.
4) simple_lock's removed. Discussion with several people reveals that the
SMP locking primitives used in the VM system aren't likely the mechanism
that we'll be adopting. Even if it were, the locking that was in the code
was very inadequate and would have to be mostly re-done anyway. The
locking in a uni-processor kernel was a no-op but went a long way toward
making the code difficult to read and debug.
5) Places that attempted to kludge-up the fact that we don't have kernel
thread support have been fixed to reflect the reality that we are really
dealing with processes, not threads. The VM system didn't have complete
thread support, so the comments and mis-named routines were just wrong.
We now use tsleep and wakeup directly in the lock routines, for instance.
6) Where appropriate, the pagers have been improved, especially in the
pager_alloc routines. Most of the pager_allocs have been rewritten and
are now faster and easier to maintain.
7) The pagedaemon pageout clustering algorithm has been rewritten and
now tries harder to output an even number of pages before and after
the requested page. This is sort of the reverse of the ideal pagein
algorithm and should provide better overall performance.
8) Unnecessary (incorrect) casts to caddr_t in calls to tsleep & wakeup
have been removed. Some other unnecessary casts have also been removed.
9) Some almost useless debugging code removed.
10) Terminology of shadow objects vs. backing objects straightened out.
The fact that the vm_object data structure escentially had this
backwards really confused things. The use of "shadow" and "backing
object" throughout the code is now internally consistent and correct
in the Mach terminology.
11) Several minor bug fixes, including one in the vm daemon that caused
0 RSS objects to not get purged as intended.
12) A "default pager" has now been created which cleans up the transition
of objects to the "swap" type. The previous checks throughout the code
for swp->pg_data != NULL were really ugly. This change also provides
the rudiments for future backing of "anonymous" memory by something
other than the swap pager (via the vnode pager, for example), and it
allows the decision about which of these pagers to use to be made
dynamically (although will need some additional decision code to do
this, of course).
13) (dyson) MAP_COPY has been deprecated and the corresponding "copy
object" code has been removed. MAP_COPY was undocumented and non-
standard. It was furthermore broken in several ways which caused its
behavior to degrade to MAP_PRIVATE. Binaries that use MAP_COPY will
continue to work correctly, but via the slightly different semantics
of MAP_PRIVATE.
14) (dyson) Sharing maps have been removed. It's marginal usefulness in a
threads design can be worked around in other ways. Both #12 and #13
were done to simplify the code and improve readability and maintain-
ability. (As were most all of these changes)
TODO:
1) Rewrite most of the vnode pager to use VOP_GETPAGES/PUTPAGES. Doing
this will reduce the vnode pager to a mere fraction of its current size.
2) Rewrite vm_fault and the swap/vnode pagers to use the clustering
information provided by the new haspage pager interface. This will
substantially reduce the overhead by eliminating a large number of
VOP_BMAP() calls. The VOP_BMAP() filesystem interface should be
improved to provide both a "behind" and "ahead" indication of
contiguousness.
3) Implement the extended features of pager_haspage in swap_pager_haspage().
It currently just says 0 pages ahead/behind.
4) Re-implement the swap device (swstrategy) in a more elegant way, perhaps
via a much more general mechanism that could also be used for disk
striping of regular filesystems.
5) Do something to improve the architecture of vm_object_collapse(). The
fact that it makes calls into the swap pager and knows too much about
how the swap pager operates really bothers me. It also doesn't allow
for collapsing of non-swap pager objects ("unnamed" objects backed by
other pagers).
1995-07-13 08:48:48 +00:00
|
|
|
vp = object->handle;
|
2006-02-17 18:22:19 +00:00
|
|
|
if (vp->v_type == VREG &&
|
|
|
|
vn_start_write(vp, &mp, V_NOWAIT) != 0) {
|
2008-11-16 21:57:54 +00:00
|
|
|
mp = NULL;
|
2006-02-17 18:22:19 +00:00
|
|
|
++pageout_lock_miss;
|
|
|
|
if (object->flags & OBJ_MIGHTBEDIRTY)
|
|
|
|
vnodes_skipped++;
|
2006-02-17 21:02:39 +00:00
|
|
|
goto unlock_and_continue;
|
2006-02-17 18:22:19 +00:00
|
|
|
}
|
2010-01-17 21:26:14 +00:00
|
|
|
KASSERT(mp != NULL,
|
|
|
|
("vp %p with NULL v_mount", vp));
|
2003-08-15 05:13:36 +00:00
|
|
|
vm_page_unlock_queues();
|
2007-07-02 06:56:37 +00:00
|
|
|
vm_object_reference_locked(object);
|
2003-08-31 00:00:46 +00:00
|
|
|
VM_OBJECT_UNLOCK(object);
|
2007-06-26 18:24:05 +00:00
|
|
|
vfslocked = VFS_LOCK_GIANT(vp->v_mount);
|
|
|
|
if (vget(vp, LK_EXCLUSIVE | LK_TIMELOCK,
|
|
|
|
curthread)) {
|
2003-08-31 00:00:46 +00:00
|
|
|
VM_OBJECT_LOCK(object);
|
2003-08-15 05:13:36 +00:00
|
|
|
vm_page_lock_queues();
|
2001-12-20 22:42:27 +00:00
|
|
|
++pageout_lock_miss;
|
1995-11-05 20:46:03 +00:00
|
|
|
if (object->flags & OBJ_MIGHTBEDIRTY)
|
1998-01-12 01:46:33 +00:00
|
|
|
vnodes_skipped++;
|
2006-02-17 21:02:39 +00:00
|
|
|
vp = NULL;
|
|
|
|
goto unlock_and_continue;
|
1996-05-29 05:15:33 +00:00
|
|
|
}
|
2003-08-31 00:00:46 +00:00
|
|
|
VM_OBJECT_LOCK(object);
|
2003-08-15 05:13:36 +00:00
|
|
|
vm_page_lock_queues();
|
1996-05-31 00:38:04 +00:00
|
|
|
/*
|
Implement a low-memory deadlock solution.
Removed most of the hacks that were trying to deal with low-memory
situations prior to now.
The new code is based on the concept that I/O must be able to function in
a low memory situation. All major modules related to I/O (except
networking) have been adjusted to allow allocation out of the system
reserve memory pool. These modules now detect a low memory situation but
rather then block they instead continue to operate, then return resources
to the memory pool instead of cache them or leave them wired.
Code has been added to stall in a low-memory situation prior to a vnode
being locked.
Thus situations where a process blocks in a low-memory condition while
holding a locked vnode have been reduced to near nothing. Not only will
I/O continue to operate, but many prior deadlock conditions simply no
longer exist.
Implement a number of VFS/BIO fixes
(found by Ian): in biodone(), bogus-page replacement code, the loop
was not properly incrementing loop variables prior to a continue
statement. We do not believe this code can be hit anyway but we
aren't taking any chances. We'll turn the whole section into a
panic (as it already is in brelse()) after the release is rolled.
In biodone(), the foff calculation was incorrectly
clamped to the iosize, causing the wrong foff to be calculated
for pages in the case of an I/O error or biodone() called without
initiating I/O. The problem always caused a panic before. Now it
doesn't. The problem is mainly an issue with NFS.
Fixed casts for ~PAGE_MASK. This code worked properly before only
because the calculations use signed arithmatic. Better to properly
extend PAGE_MASK first before inverting it for the 64 bit masking
op.
In brelse(), the bogus_page fixup code was improperly throwing
away the original contents of 'm' when it did the j-loop to
fix the bogus pages. The result was that it would potentially
invalidate parts of the *WRONG* page(!), leading to corruption.
There may still be cases where a background bitmap write is
being duplicated, causing potential corruption. We have identified
a potentially serious bug related to this but the fix is still TBD.
So instead this patch contains a KASSERT to detect the problem
and panic the machine rather then continue to corrupt the filesystem.
The problem does not occur very often.. it is very hard to
reproduce, and it may or may not be the cause of the corruption
people have reported.
Review by: (VFS/BIO: mckusick, Ian Dowse <iedowse@maths.tcd.ie>)
Testing by: (VM/Deadlock) Paul Saab <ps@yahoo-inc.com>
2000-11-18 23:06:26 +00:00
|
|
|
* The page might have been moved to another
|
|
|
|
* queue during potential blocking in vget()
|
|
|
|
* above. The page might have been freed and
|
2007-07-02 06:56:37 +00:00
|
|
|
* reused for another vnode.
|
1996-05-31 00:38:04 +00:00
|
|
|
*/
|
2005-12-31 14:39:20 +00:00
|
|
|
if (VM_PAGE_GETQUEUE(m) != PQ_INACTIVE ||
|
Implement a low-memory deadlock solution.
Removed most of the hacks that were trying to deal with low-memory
situations prior to now.
The new code is based on the concept that I/O must be able to function in
a low memory situation. All major modules related to I/O (except
networking) have been adjusted to allow allocation out of the system
reserve memory pool. These modules now detect a low memory situation but
rather then block they instead continue to operate, then return resources
to the memory pool instead of cache them or leave them wired.
Code has been added to stall in a low-memory situation prior to a vnode
being locked.
Thus situations where a process blocks in a low-memory condition while
holding a locked vnode have been reduced to near nothing. Not only will
I/O continue to operate, but many prior deadlock conditions simply no
longer exist.
Implement a number of VFS/BIO fixes
(found by Ian): in biodone(), bogus-page replacement code, the loop
was not properly incrementing loop variables prior to a continue
statement. We do not believe this code can be hit anyway but we
aren't taking any chances. We'll turn the whole section into a
panic (as it already is in brelse()) after the release is rolled.
In biodone(), the foff calculation was incorrectly
clamped to the iosize, causing the wrong foff to be calculated
for pages in the case of an I/O error or biodone() called without
initiating I/O. The problem always caused a panic before. Now it
doesn't. The problem is mainly an issue with NFS.
Fixed casts for ~PAGE_MASK. This code worked properly before only
because the calculations use signed arithmatic. Better to properly
extend PAGE_MASK first before inverting it for the 64 bit masking
op.
In brelse(), the bogus_page fixup code was improperly throwing
away the original contents of 'm' when it did the j-loop to
fix the bogus pages. The result was that it would potentially
invalidate parts of the *WRONG* page(!), leading to corruption.
There may still be cases where a background bitmap write is
being duplicated, causing potential corruption. We have identified
a potentially serious bug related to this but the fix is still TBD.
So instead this patch contains a KASSERT to detect the problem
and panic the machine rather then continue to corrupt the filesystem.
The problem does not occur very often.. it is very hard to
reproduce, and it may or may not be the cause of the corruption
people have reported.
Review by: (VFS/BIO: mckusick, Ian Dowse <iedowse@maths.tcd.ie>)
Testing by: (VM/Deadlock) Paul Saab <ps@yahoo-inc.com>
2000-11-18 23:06:26 +00:00
|
|
|
m->object != object ||
|
2006-02-17 21:02:39 +00:00
|
|
|
TAILQ_NEXT(m, pageq) != &marker) {
|
1996-05-29 05:15:33 +00:00
|
|
|
if (object->flags & OBJ_MIGHTBEDIRTY)
|
1998-01-12 01:46:33 +00:00
|
|
|
vnodes_skipped++;
|
2003-08-31 00:00:46 +00:00
|
|
|
goto unlock_and_continue;
|
1996-05-29 05:15:33 +00:00
|
|
|
}
|
|
|
|
|
1996-05-31 00:38:04 +00:00
|
|
|
/*
|
Implement a low-memory deadlock solution.
Removed most of the hacks that were trying to deal with low-memory
situations prior to now.
The new code is based on the concept that I/O must be able to function in
a low memory situation. All major modules related to I/O (except
networking) have been adjusted to allow allocation out of the system
reserve memory pool. These modules now detect a low memory situation but
rather then block they instead continue to operate, then return resources
to the memory pool instead of cache them or leave them wired.
Code has been added to stall in a low-memory situation prior to a vnode
being locked.
Thus situations where a process blocks in a low-memory condition while
holding a locked vnode have been reduced to near nothing. Not only will
I/O continue to operate, but many prior deadlock conditions simply no
longer exist.
Implement a number of VFS/BIO fixes
(found by Ian): in biodone(), bogus-page replacement code, the loop
was not properly incrementing loop variables prior to a continue
statement. We do not believe this code can be hit anyway but we
aren't taking any chances. We'll turn the whole section into a
panic (as it already is in brelse()) after the release is rolled.
In biodone(), the foff calculation was incorrectly
clamped to the iosize, causing the wrong foff to be calculated
for pages in the case of an I/O error or biodone() called without
initiating I/O. The problem always caused a panic before. Now it
doesn't. The problem is mainly an issue with NFS.
Fixed casts for ~PAGE_MASK. This code worked properly before only
because the calculations use signed arithmatic. Better to properly
extend PAGE_MASK first before inverting it for the 64 bit masking
op.
In brelse(), the bogus_page fixup code was improperly throwing
away the original contents of 'm' when it did the j-loop to
fix the bogus pages. The result was that it would potentially
invalidate parts of the *WRONG* page(!), leading to corruption.
There may still be cases where a background bitmap write is
being duplicated, causing potential corruption. We have identified
a potentially serious bug related to this but the fix is still TBD.
So instead this patch contains a KASSERT to detect the problem
and panic the machine rather then continue to corrupt the filesystem.
The problem does not occur very often.. it is very hard to
reproduce, and it may or may not be the cause of the corruption
people have reported.
Review by: (VFS/BIO: mckusick, Ian Dowse <iedowse@maths.tcd.ie>)
Testing by: (VM/Deadlock) Paul Saab <ps@yahoo-inc.com>
2000-11-18 23:06:26 +00:00
|
|
|
* The page may have been busied during the
|
2007-07-02 06:56:37 +00:00
|
|
|
* blocking in vget(). We don't move the
|
Implement a low-memory deadlock solution.
Removed most of the hacks that were trying to deal with low-memory
situations prior to now.
The new code is based on the concept that I/O must be able to function in
a low memory situation. All major modules related to I/O (except
networking) have been adjusted to allow allocation out of the system
reserve memory pool. These modules now detect a low memory situation but
rather then block they instead continue to operate, then return resources
to the memory pool instead of cache them or leave them wired.
Code has been added to stall in a low-memory situation prior to a vnode
being locked.
Thus situations where a process blocks in a low-memory condition while
holding a locked vnode have been reduced to near nothing. Not only will
I/O continue to operate, but many prior deadlock conditions simply no
longer exist.
Implement a number of VFS/BIO fixes
(found by Ian): in biodone(), bogus-page replacement code, the loop
was not properly incrementing loop variables prior to a continue
statement. We do not believe this code can be hit anyway but we
aren't taking any chances. We'll turn the whole section into a
panic (as it already is in brelse()) after the release is rolled.
In biodone(), the foff calculation was incorrectly
clamped to the iosize, causing the wrong foff to be calculated
for pages in the case of an I/O error or biodone() called without
initiating I/O. The problem always caused a panic before. Now it
doesn't. The problem is mainly an issue with NFS.
Fixed casts for ~PAGE_MASK. This code worked properly before only
because the calculations use signed arithmatic. Better to properly
extend PAGE_MASK first before inverting it for the 64 bit masking
op.
In brelse(), the bogus_page fixup code was improperly throwing
away the original contents of 'm' when it did the j-loop to
fix the bogus pages. The result was that it would potentially
invalidate parts of the *WRONG* page(!), leading to corruption.
There may still be cases where a background bitmap write is
being duplicated, causing potential corruption. We have identified
a potentially serious bug related to this but the fix is still TBD.
So instead this patch contains a KASSERT to detect the problem
and panic the machine rather then continue to corrupt the filesystem.
The problem does not occur very often.. it is very hard to
reproduce, and it may or may not be the cause of the corruption
people have reported.
Review by: (VFS/BIO: mckusick, Ian Dowse <iedowse@maths.tcd.ie>)
Testing by: (VM/Deadlock) Paul Saab <ps@yahoo-inc.com>
2000-11-18 23:06:26 +00:00
|
|
|
* page back onto the end of the queue so that
|
|
|
|
* statistics are more correct if we don't.
|
1996-05-31 00:38:04 +00:00
|
|
|
*/
|
2006-10-22 04:28:14 +00:00
|
|
|
if (m->busy || (m->oflags & VPO_BUSY)) {
|
2003-08-31 00:00:46 +00:00
|
|
|
goto unlock_and_continue;
|
1996-05-29 05:15:33 +00:00
|
|
|
}
|
|
|
|
|
1996-05-31 00:38:04 +00:00
|
|
|
/*
|
2001-10-21 06:12:06 +00:00
|
|
|
* If the page has become held it might
|
|
|
|
* be undergoing I/O, so skip it
|
1996-05-31 00:38:04 +00:00
|
|
|
*/
|
1996-05-29 05:15:33 +00:00
|
|
|
if (m->hold_count) {
|
2008-03-19 20:24:35 +00:00
|
|
|
vm_page_requeue(m);
|
1996-05-29 05:15:33 +00:00
|
|
|
if (object->flags & OBJ_MIGHTBEDIRTY)
|
1998-01-12 01:46:33 +00:00
|
|
|
vnodes_skipped++;
|
2003-08-31 00:00:46 +00:00
|
|
|
goto unlock_and_continue;
|
1995-04-09 06:03:56 +00:00
|
|
|
}
|
|
|
|
}
|
|
|
|
|
1994-05-25 09:21:21 +00:00
|
|
|
/*
|
These changes embody the support of the fully coherent merged VM buffer cache,
much higher filesystem I/O performance, and much better paging performance. It
represents the culmination of over 6 months of R&D.
The majority of the merged VM/cache work is by John Dyson.
The following highlights the most significant changes. Additionally, there are
(mostly minor) changes to the various filesystem modules (nfs, msdosfs, etc) to
support the new VM/buffer scheme.
vfs_bio.c:
Significant rewrite of most of vfs_bio to support the merged VM buffer cache
scheme. The scheme is almost fully compatible with the old filesystem
interface. Significant improvement in the number of opportunities for write
clustering.
vfs_cluster.c, vfs_subr.c
Upgrade and performance enhancements in vfs layer code to support merged
VM/buffer cache. Fixup of vfs_cluster to eliminate the bogus pagemove stuff.
vm_object.c:
Yet more improvements in the collapse code. Elimination of some windows that
can cause list corruption.
vm_pageout.c:
Fixed it, it really works better now. Somehow in 2.0, some "enhancements"
broke the code. This code has been reworked from the ground-up.
vm_fault.c, vm_page.c, pmap.c, vm_object.c
Support for small-block filesystems with merged VM/buffer cache scheme.
pmap.c vm_map.c
Dynamic kernel VM size, now we dont have to pre-allocate excessive numbers of
kernel PTs.
vm_glue.c
Much simpler and more effective swapping code. No more gratuitous swapping.
proc.h
Fixed the problem that the p_lock flag was not being cleared on a fork.
swap_pager.c, vnode_pager.c
Removal of old vfs_bio cruft to support the past pseudo-coherency. Now the
code doesn't need it anymore.
machdep.c
Changes to better support the parameter values for the merged VM/buffer cache
scheme.
machdep.c, kern_exec.c, vm_glue.c
Implemented a seperate submap for temporary exec string space and another one
to contain process upages. This eliminates all map fragmentation problems
that previously existed.
ffs_inode.c, ufs_inode.c, ufs_readwrite.c
Changes for merged VM/buffer cache. Add "bypass" support for sneaking in on
busy buffers.
Submitted by: John Dyson and David Greenman
1995-01-09 16:06:02 +00:00
|
|
|
* If a page is dirty, then it is either being washed
|
|
|
|
* (but not yet cleaned) or it is still in the
|
|
|
|
* laundry. If it is still in the laundry, then we
|
2000-12-26 19:41:38 +00:00
|
|
|
* start the cleaning operation.
|
Implement a low-memory deadlock solution.
Removed most of the hacks that were trying to deal with low-memory
situations prior to now.
The new code is based on the concept that I/O must be able to function in
a low memory situation. All major modules related to I/O (except
networking) have been adjusted to allow allocation out of the system
reserve memory pool. These modules now detect a low memory situation but
rather then block they instead continue to operate, then return resources
to the memory pool instead of cache them or leave them wired.
Code has been added to stall in a low-memory situation prior to a vnode
being locked.
Thus situations where a process blocks in a low-memory condition while
holding a locked vnode have been reduced to near nothing. Not only will
I/O continue to operate, but many prior deadlock conditions simply no
longer exist.
Implement a number of VFS/BIO fixes
(found by Ian): in biodone(), bogus-page replacement code, the loop
was not properly incrementing loop variables prior to a continue
statement. We do not believe this code can be hit anyway but we
aren't taking any chances. We'll turn the whole section into a
panic (as it already is in brelse()) after the release is rolled.
In biodone(), the foff calculation was incorrectly
clamped to the iosize, causing the wrong foff to be calculated
for pages in the case of an I/O error or biodone() called without
initiating I/O. The problem always caused a panic before. Now it
doesn't. The problem is mainly an issue with NFS.
Fixed casts for ~PAGE_MASK. This code worked properly before only
because the calculations use signed arithmatic. Better to properly
extend PAGE_MASK first before inverting it for the 64 bit masking
op.
In brelse(), the bogus_page fixup code was improperly throwing
away the original contents of 'm' when it did the j-loop to
fix the bogus pages. The result was that it would potentially
invalidate parts of the *WRONG* page(!), leading to corruption.
There may still be cases where a background bitmap write is
being duplicated, causing potential corruption. We have identified
a potentially serious bug related to this but the fix is still TBD.
So instead this patch contains a KASSERT to detect the problem
and panic the machine rather then continue to corrupt the filesystem.
The problem does not occur very often.. it is very hard to
reproduce, and it may or may not be the cause of the corruption
people have reported.
Review by: (VFS/BIO: mckusick, Ian Dowse <iedowse@maths.tcd.ie>)
Testing by: (VM/Deadlock) Paul Saab <ps@yahoo-inc.com>
2000-11-18 23:06:26 +00:00
|
|
|
*
|
2000-12-26 19:41:38 +00:00
|
|
|
* decrement page_shortage on success to account for
|
|
|
|
* the (future) cleaned page. Otherwise we could wind
|
|
|
|
* up laundering or cleaning too many pages.
|
1994-05-25 09:21:21 +00:00
|
|
|
*/
|
2000-12-26 19:41:38 +00:00
|
|
|
if (vm_pageout_clean(m) != 0) {
|
|
|
|
--page_shortage;
|
Implement a low-memory deadlock solution.
Removed most of the hacks that were trying to deal with low-memory
situations prior to now.
The new code is based on the concept that I/O must be able to function in
a low memory situation. All major modules related to I/O (except
networking) have been adjusted to allow allocation out of the system
reserve memory pool. These modules now detect a low memory situation but
rather then block they instead continue to operate, then return resources
to the memory pool instead of cache them or leave them wired.
Code has been added to stall in a low-memory situation prior to a vnode
being locked.
Thus situations where a process blocks in a low-memory condition while
holding a locked vnode have been reduced to near nothing. Not only will
I/O continue to operate, but many prior deadlock conditions simply no
longer exist.
Implement a number of VFS/BIO fixes
(found by Ian): in biodone(), bogus-page replacement code, the loop
was not properly incrementing loop variables prior to a continue
statement. We do not believe this code can be hit anyway but we
aren't taking any chances. We'll turn the whole section into a
panic (as it already is in brelse()) after the release is rolled.
In biodone(), the foff calculation was incorrectly
clamped to the iosize, causing the wrong foff to be calculated
for pages in the case of an I/O error or biodone() called without
initiating I/O. The problem always caused a panic before. Now it
doesn't. The problem is mainly an issue with NFS.
Fixed casts for ~PAGE_MASK. This code worked properly before only
because the calculations use signed arithmatic. Better to properly
extend PAGE_MASK first before inverting it for the 64 bit masking
op.
In brelse(), the bogus_page fixup code was improperly throwing
away the original contents of 'm' when it did the j-loop to
fix the bogus pages. The result was that it would potentially
invalidate parts of the *WRONG* page(!), leading to corruption.
There may still be cases where a background bitmap write is
being duplicated, causing potential corruption. We have identified
a potentially serious bug related to this but the fix is still TBD.
So instead this patch contains a KASSERT to detect the problem
and panic the machine rather then continue to corrupt the filesystem.
The problem does not occur very often.. it is very hard to
reproduce, and it may or may not be the cause of the corruption
people have reported.
Review by: (VFS/BIO: mckusick, Ian Dowse <iedowse@maths.tcd.ie>)
Testing by: (VM/Deadlock) Paul Saab <ps@yahoo-inc.com>
2000-11-18 23:06:26 +00:00
|
|
|
--maxlaunder;
|
2000-12-26 19:41:38 +00:00
|
|
|
}
|
2003-08-31 00:00:46 +00:00
|
|
|
unlock_and_continue:
|
2003-10-17 05:07:17 +00:00
|
|
|
VM_OBJECT_UNLOCK(object);
|
2007-07-02 06:56:37 +00:00
|
|
|
if (mp != NULL) {
|
2003-10-17 05:07:17 +00:00
|
|
|
vm_page_unlock_queues();
|
2007-07-02 06:56:37 +00:00
|
|
|
if (vp != NULL)
|
|
|
|
vput(vp);
|
2007-06-26 18:24:05 +00:00
|
|
|
VFS_UNLOCK_GIANT(vfslocked);
|
2007-07-02 06:56:37 +00:00
|
|
|
vm_object_deallocate(object);
|
2000-07-11 22:07:57 +00:00
|
|
|
vn_finished_write(mp);
|
2003-10-17 05:07:17 +00:00
|
|
|
vm_page_lock_queues();
|
2000-07-11 22:07:57 +00:00
|
|
|
}
|
2006-02-17 21:02:39 +00:00
|
|
|
next = TAILQ_NEXT(&marker, pageq);
|
|
|
|
TAILQ_REMOVE(&vm_page_queues[PQ_INACTIVE].pl,
|
|
|
|
&marker, pageq);
|
2003-10-17 05:07:17 +00:00
|
|
|
continue;
|
These changes embody the support of the fully coherent merged VM buffer cache,
much higher filesystem I/O performance, and much better paging performance. It
represents the culmination of over 6 months of R&D.
The majority of the merged VM/cache work is by John Dyson.
The following highlights the most significant changes. Additionally, there are
(mostly minor) changes to the various filesystem modules (nfs, msdosfs, etc) to
support the new VM/buffer scheme.
vfs_bio.c:
Significant rewrite of most of vfs_bio to support the merged VM buffer cache
scheme. The scheme is almost fully compatible with the old filesystem
interface. Significant improvement in the number of opportunities for write
clustering.
vfs_cluster.c, vfs_subr.c
Upgrade and performance enhancements in vfs layer code to support merged
VM/buffer cache. Fixup of vfs_cluster to eliminate the bogus pagemove stuff.
vm_object.c:
Yet more improvements in the collapse code. Elimination of some windows that
can cause list corruption.
vm_pageout.c:
Fixed it, it really works better now. Somehow in 2.0, some "enhancements"
broke the code. This code has been reworked from the ground-up.
vm_fault.c, vm_page.c, pmap.c, vm_object.c
Support for small-block filesystems with merged VM/buffer cache scheme.
pmap.c vm_map.c
Dynamic kernel VM size, now we dont have to pre-allocate excessive numbers of
kernel PTs.
vm_glue.c
Much simpler and more effective swapping code. No more gratuitous swapping.
proc.h
Fixed the problem that the p_lock flag was not being cleared on a fork.
swap_pager.c, vnode_pager.c
Removal of old vfs_bio cruft to support the past pseudo-coherency. Now the
code doesn't need it anymore.
machdep.c
Changes to better support the parameter values for the merged VM/buffer cache
scheme.
machdep.c, kern_exec.c, vm_glue.c
Implemented a seperate submap for temporary exec string space and another one
to contain process upages. This eliminates all map fragmentation problems
that previously existed.
ffs_inode.c, ufs_inode.c, ufs_readwrite.c
Changes for merged VM/buffer cache. Add "bypass" support for sneaking in on
busy buffers.
Submitted by: John Dyson and David Greenman
1995-01-09 16:06:02 +00:00
|
|
|
}
|
2003-10-17 05:07:17 +00:00
|
|
|
VM_OBJECT_UNLOCK(object);
|
1994-05-25 09:21:21 +00:00
|
|
|
}
|
1994-05-24 10:09:53 +00:00
|
|
|
|
1999-01-21 08:29:12 +00:00
|
|
|
/*
|
Implement a low-memory deadlock solution.
Removed most of the hacks that were trying to deal with low-memory
situations prior to now.
The new code is based on the concept that I/O must be able to function in
a low memory situation. All major modules related to I/O (except
networking) have been adjusted to allow allocation out of the system
reserve memory pool. These modules now detect a low memory situation but
rather then block they instead continue to operate, then return resources
to the memory pool instead of cache them or leave them wired.
Code has been added to stall in a low-memory situation prior to a vnode
being locked.
Thus situations where a process blocks in a low-memory condition while
holding a locked vnode have been reduced to near nothing. Not only will
I/O continue to operate, but many prior deadlock conditions simply no
longer exist.
Implement a number of VFS/BIO fixes
(found by Ian): in biodone(), bogus-page replacement code, the loop
was not properly incrementing loop variables prior to a continue
statement. We do not believe this code can be hit anyway but we
aren't taking any chances. We'll turn the whole section into a
panic (as it already is in brelse()) after the release is rolled.
In biodone(), the foff calculation was incorrectly
clamped to the iosize, causing the wrong foff to be calculated
for pages in the case of an I/O error or biodone() called without
initiating I/O. The problem always caused a panic before. Now it
doesn't. The problem is mainly an issue with NFS.
Fixed casts for ~PAGE_MASK. This code worked properly before only
because the calculations use signed arithmatic. Better to properly
extend PAGE_MASK first before inverting it for the 64 bit masking
op.
In brelse(), the bogus_page fixup code was improperly throwing
away the original contents of 'm' when it did the j-loop to
fix the bogus pages. The result was that it would potentially
invalidate parts of the *WRONG* page(!), leading to corruption.
There may still be cases where a background bitmap write is
being duplicated, causing potential corruption. We have identified
a potentially serious bug related to this but the fix is still TBD.
So instead this patch contains a KASSERT to detect the problem
and panic the machine rather then continue to corrupt the filesystem.
The problem does not occur very often.. it is very hard to
reproduce, and it may or may not be the cause of the corruption
people have reported.
Review by: (VFS/BIO: mckusick, Ian Dowse <iedowse@maths.tcd.ie>)
Testing by: (VM/Deadlock) Paul Saab <ps@yahoo-inc.com>
2000-11-18 23:06:26 +00:00
|
|
|
* Compute the number of pages we want to try to move from the
|
|
|
|
* active queue to the inactive queue.
|
1994-05-24 10:09:53 +00:00
|
|
|
*/
|
2007-05-31 22:52:15 +00:00
|
|
|
page_shortage = vm_paging_target() +
|
|
|
|
cnt.v_inactive_target - cnt.v_inactive_count;
|
Some VM improvements, including elimination of alot of Sig-11
problems. Tor Egge and others have helped with various VM bugs
lately, but don't blame him -- blame me!!!
pmap.c:
1) Create an object for kernel page table allocations. This
fixes a bogus allocation method previously used for such, by
grabbing pages from the kernel object, using bogus pindexes.
(This was a code cleanup, and perhaps a minor system stability
issue.)
pmap.c:
2) Pre-set the modify and accessed bits when prudent. This will
decrease bus traffic under certain circumstances.
vfs_bio.c, vfs_cluster.c:
3) Rather than calculating the beginning virtual byte offset
multiple times, stick the offset into the buffer header, so
that the calculated offset can be reused. (Long long multiplies
are often expensive, and this is a probably unmeasurable performance
improvement, and code cleanup.)
vfs_bio.c:
4) Handle write recursion more intelligently (but not perfectly) so
that it is less likely to cause a system panic, and is also
much more robust.
vfs_bio.c:
5) getblk incorrectly wrote out blocks that are incorrectly sized.
The problem is fixed, and writes blocks out ONLY when B_DELWRI
is true.
vfs_bio.c:
6) Check that already constituted buffers have fully valid pages. If
not, then make sure that the B_CACHE bit is not set. (This was
a major source of Sig-11 type problems.)
vfs_bio.c:
7) Fix a potential system deadlock due to an incorrectly specified
sleep priority while waiting for a buffer write operation. The
change that I made opens the system up to serious problems, and
we need to examine the issue of process sleep priorities.
vfs_cluster.c, vfs_bio.c:
8) Make clustered reads work more correctly (and more completely)
when buffers are already constituted, but not fully valid.
(This was another system reliability issue.)
vfs_subr.c, ffs_inode.c:
9) Create a vtruncbuf function, which is used by filesystems that
can truncate files. The vinvalbuf forced a file sync type operation,
while vtruncbuf only invalidates the buffers past the new end of file,
and also invalidates the appropriate pages. (This was a system reliabiliy
and performance issue.)
10) Modify FFS to use vtruncbuf.
vm_object.c:
11) Make the object rundown mechanism for OBJT_VNODE type objects work
more correctly. Included in that fix, create pager entries for
the OBJT_DEAD pager type, so that paging requests that might slip
in during race conditions are properly handled. (This was a system
reliability issue.)
vm_page.c:
12) Make some of the page validation routines be a little less picky
about arguments passed to them. Also, support page invalidation
change the object generation count so that we handle generation
counts a little more robustly.
vm_pageout.c:
13) Further reduce pageout daemon activity when the system doesn't
need help from it. There should be no additional performance
decrease even when the pageout daemon is running. (This was
a significant performance issue.)
vnode_pager.c:
14) Teach the vnode pager to handle race conditions during vnode
deallocations.
1998-03-16 01:56:03 +00:00
|
|
|
page_shortage += addl_page_shortage;
|
1999-01-21 08:29:12 +00:00
|
|
|
|
|
|
|
/*
|
Implement a low-memory deadlock solution.
Removed most of the hacks that were trying to deal with low-memory
situations prior to now.
The new code is based on the concept that I/O must be able to function in
a low memory situation. All major modules related to I/O (except
networking) have been adjusted to allow allocation out of the system
reserve memory pool. These modules now detect a low memory situation but
rather then block they instead continue to operate, then return resources
to the memory pool instead of cache them or leave them wired.
Code has been added to stall in a low-memory situation prior to a vnode
being locked.
Thus situations where a process blocks in a low-memory condition while
holding a locked vnode have been reduced to near nothing. Not only will
I/O continue to operate, but many prior deadlock conditions simply no
longer exist.
Implement a number of VFS/BIO fixes
(found by Ian): in biodone(), bogus-page replacement code, the loop
was not properly incrementing loop variables prior to a continue
statement. We do not believe this code can be hit anyway but we
aren't taking any chances. We'll turn the whole section into a
panic (as it already is in brelse()) after the release is rolled.
In biodone(), the foff calculation was incorrectly
clamped to the iosize, causing the wrong foff to be calculated
for pages in the case of an I/O error or biodone() called without
initiating I/O. The problem always caused a panic before. Now it
doesn't. The problem is mainly an issue with NFS.
Fixed casts for ~PAGE_MASK. This code worked properly before only
because the calculations use signed arithmatic. Better to properly
extend PAGE_MASK first before inverting it for the 64 bit masking
op.
In brelse(), the bogus_page fixup code was improperly throwing
away the original contents of 'm' when it did the j-loop to
fix the bogus pages. The result was that it would potentially
invalidate parts of the *WRONG* page(!), leading to corruption.
There may still be cases where a background bitmap write is
being duplicated, causing potential corruption. We have identified
a potentially serious bug related to this but the fix is still TBD.
So instead this patch contains a KASSERT to detect the problem
and panic the machine rather then continue to corrupt the filesystem.
The problem does not occur very often.. it is very hard to
reproduce, and it may or may not be the cause of the corruption
people have reported.
Review by: (VFS/BIO: mckusick, Ian Dowse <iedowse@maths.tcd.ie>)
Testing by: (VM/Deadlock) Paul Saab <ps@yahoo-inc.com>
2000-11-18 23:06:26 +00:00
|
|
|
* Scan the active queue for things we can deactivate. We nominally
|
|
|
|
* track the per-page activity counter and use it to locate
|
|
|
|
* deactivation candidates.
|
1999-01-21 08:29:12 +00:00
|
|
|
*/
|
2007-05-31 22:52:15 +00:00
|
|
|
pcount = cnt.v_active_count;
|
1999-10-30 07:37:14 +00:00
|
|
|
m = TAILQ_FIRST(&vm_page_queues[PQ_ACTIVE].pl);
|
1999-01-21 08:29:12 +00:00
|
|
|
|
1996-05-18 03:38:05 +00:00
|
|
|
while ((m != NULL) && (pcount-- > 0) && (page_shortage > 0)) {
|
1996-05-31 00:38:04 +00:00
|
|
|
|
2005-12-31 14:39:20 +00:00
|
|
|
KASSERT(VM_PAGE_INQUEUE2(m, PQ_ACTIVE),
|
2003-10-22 03:08:24 +00:00
|
|
|
("vm_pageout_scan: page %p isn't active", m));
|
1996-05-31 00:38:04 +00:00
|
|
|
|
1996-05-18 03:38:05 +00:00
|
|
|
next = TAILQ_NEXT(m, pageq);
|
2004-10-27 18:29:17 +00:00
|
|
|
object = m->object;
|
2005-08-10 00:17:36 +00:00
|
|
|
if ((m->flags & PG_MARKER) != 0) {
|
|
|
|
m = next;
|
|
|
|
continue;
|
|
|
|
}
|
|
|
|
if (!VM_OBJECT_TRYLOCK(object) &&
|
|
|
|
!vm_pageout_fallback_object_lock(m, &next)) {
|
|
|
|
VM_OBJECT_UNLOCK(object);
|
2004-10-30 07:09:46 +00:00
|
|
|
m = next;
|
2004-10-27 18:29:17 +00:00
|
|
|
continue;
|
|
|
|
}
|
|
|
|
|
1994-05-25 09:21:21 +00:00
|
|
|
/*
|
These changes embody the support of the fully coherent merged VM buffer cache,
much higher filesystem I/O performance, and much better paging performance. It
represents the culmination of over 6 months of R&D.
The majority of the merged VM/cache work is by John Dyson.
The following highlights the most significant changes. Additionally, there are
(mostly minor) changes to the various filesystem modules (nfs, msdosfs, etc) to
support the new VM/buffer scheme.
vfs_bio.c:
Significant rewrite of most of vfs_bio to support the merged VM buffer cache
scheme. The scheme is almost fully compatible with the old filesystem
interface. Significant improvement in the number of opportunities for write
clustering.
vfs_cluster.c, vfs_subr.c
Upgrade and performance enhancements in vfs layer code to support merged
VM/buffer cache. Fixup of vfs_cluster to eliminate the bogus pagemove stuff.
vm_object.c:
Yet more improvements in the collapse code. Elimination of some windows that
can cause list corruption.
vm_pageout.c:
Fixed it, it really works better now. Somehow in 2.0, some "enhancements"
broke the code. This code has been reworked from the ground-up.
vm_fault.c, vm_page.c, pmap.c, vm_object.c
Support for small-block filesystems with merged VM/buffer cache scheme.
pmap.c vm_map.c
Dynamic kernel VM size, now we dont have to pre-allocate excessive numbers of
kernel PTs.
vm_glue.c
Much simpler and more effective swapping code. No more gratuitous swapping.
proc.h
Fixed the problem that the p_lock flag was not being cleared on a fork.
swap_pager.c, vnode_pager.c
Removal of old vfs_bio cruft to support the past pseudo-coherency. Now the
code doesn't need it anymore.
machdep.c
Changes to better support the parameter values for the merged VM/buffer cache
scheme.
machdep.c, kern_exec.c, vm_glue.c
Implemented a seperate submap for temporary exec string space and another one
to contain process upages. This eliminates all map fragmentation problems
that previously existed.
ffs_inode.c, ufs_inode.c, ufs_readwrite.c
Changes for merged VM/buffer cache. Add "bypass" support for sneaking in on
busy buffers.
Submitted by: John Dyson and David Greenman
1995-01-09 16:06:02 +00:00
|
|
|
* Don't deactivate pages that are busy.
|
1994-05-25 09:21:21 +00:00
|
|
|
*/
|
1994-09-06 11:28:46 +00:00
|
|
|
if ((m->busy != 0) ||
|
2006-10-22 04:28:14 +00:00
|
|
|
(m->oflags & VPO_BUSY) ||
|
1995-04-09 06:03:56 +00:00
|
|
|
(m->hold_count != 0)) {
|
2004-10-27 18:29:17 +00:00
|
|
|
VM_OBJECT_UNLOCK(object);
|
2008-03-19 20:24:35 +00:00
|
|
|
vm_page_requeue(m);
|
1994-05-25 09:21:21 +00:00
|
|
|
m = next;
|
|
|
|
continue;
|
|
|
|
}
|
1996-05-18 03:38:05 +00:00
|
|
|
|
|
|
|
/*
|
|
|
|
* The count for pagedaemon pages is done after checking the
|
2000-03-26 15:20:23 +00:00
|
|
|
* page for eligibility...
|
1996-05-18 03:38:05 +00:00
|
|
|
*/
|
2007-06-10 21:59:14 +00:00
|
|
|
cnt.v_pdpages++;
|
1996-06-17 03:35:40 +00:00
|
|
|
|
1997-10-06 02:48:16 +00:00
|
|
|
/*
|
|
|
|
* Check to see "how much" the page has been used.
|
|
|
|
*/
|
|
|
|
actcount = 0;
|
2004-10-27 18:29:17 +00:00
|
|
|
if (object->ref_count != 0) {
|
1996-06-17 03:35:40 +00:00
|
|
|
if (m->flags & PG_REFERENCED) {
|
1997-10-06 02:48:16 +00:00
|
|
|
actcount += 1;
|
1996-05-18 03:38:05 +00:00
|
|
|
}
|
2000-05-21 12:50:18 +00:00
|
|
|
actcount += pmap_ts_referenced(m);
|
1997-10-06 02:48:16 +00:00
|
|
|
if (actcount) {
|
|
|
|
m->act_count += ACT_ADVANCE + actcount;
|
This commit does a couple of things:
Re-enables the RSS limiting, and the routine is now tail-recursive,
making it much more safe (eliminates the possiblity of kernel stack
overflow.) Also, the RSS limiting is a little more intelligent about
finding the likely objects that are pushing the process over the limit.
Added some sysctls that help with VM system tuning.
New sysctl features:
1) Enable/disable lru pageout algorithm.
vm.pageout_algorithm = 0, default algorithm that works
well, especially using X windows and heavy
memory loading. Can have adverse effects,
sometimes slowing down program loading.
vm.pageout_algorithm = 1, close to true LRU. Works much
better than clock, etc. Does not work as well as
the default algorithm in general. Certain memory
"malloc" type benchmarks work a little better with
this setting.
Please give me feedback on the performance results
associated with these.
2) Enable/disable swapping.
vm.swapping_enabled = 1, default.
vm.swapping_enabled = 0, useful for cases where swapping
degrades performance.
The config option "NO_SWAPPING" is still operative, and
takes precedence over the sysctl. If "NO_SWAPPING" is
specified, the sysctl still exists, but "vm.swapping_enabled"
is hard-wired to "0".
Each of these can be changed "on the fly."
1996-06-26 05:39:27 +00:00
|
|
|
if (m->act_count > ACT_MAX)
|
|
|
|
m->act_count = ACT_MAX;
|
|
|
|
}
|
1996-05-18 03:38:05 +00:00
|
|
|
}
|
1996-06-17 03:35:40 +00:00
|
|
|
|
1997-10-06 02:48:16 +00:00
|
|
|
/*
|
|
|
|
* Since we have "tested" this bit, we need to clear it now.
|
|
|
|
*/
|
1998-09-04 08:06:57 +00:00
|
|
|
vm_page_flag_clear(m, PG_REFERENCED);
|
1996-06-17 03:35:40 +00:00
|
|
|
|
1997-10-06 02:48:16 +00:00
|
|
|
/*
|
|
|
|
* Only if an object is currently being used, do we use the
|
|
|
|
* page activation count stats.
|
|
|
|
*/
|
2004-10-27 18:29:17 +00:00
|
|
|
if (actcount && (object->ref_count != 0)) {
|
2008-03-19 20:24:35 +00:00
|
|
|
vm_page_requeue(m);
|
1994-05-25 09:21:21 +00:00
|
|
|
} else {
|
This commit does a couple of things:
Re-enables the RSS limiting, and the routine is now tail-recursive,
making it much more safe (eliminates the possiblity of kernel stack
overflow.) Also, the RSS limiting is a little more intelligent about
finding the likely objects that are pushing the process over the limit.
Added some sysctls that help with VM system tuning.
New sysctl features:
1) Enable/disable lru pageout algorithm.
vm.pageout_algorithm = 0, default algorithm that works
well, especially using X windows and heavy
memory loading. Can have adverse effects,
sometimes slowing down program loading.
vm.pageout_algorithm = 1, close to true LRU. Works much
better than clock, etc. Does not work as well as
the default algorithm in general. Certain memory
"malloc" type benchmarks work a little better with
this setting.
Please give me feedback on the performance results
associated with these.
2) Enable/disable swapping.
vm.swapping_enabled = 1, default.
vm.swapping_enabled = 0, useful for cases where swapping
degrades performance.
The config option "NO_SWAPPING" is still operative, and
takes precedence over the sysctl. If "NO_SWAPPING" is
specified, the sysctl still exists, but "vm.swapping_enabled"
is hard-wired to "0".
Each of these can be changed "on the fly."
1996-06-26 05:39:27 +00:00
|
|
|
m->act_count -= min(m->act_count, ACT_DECLINE);
|
2000-12-26 19:41:38 +00:00
|
|
|
if (vm_pageout_algorithm ||
|
2004-10-27 18:29:17 +00:00
|
|
|
object->ref_count == 0 ||
|
2000-12-26 19:41:38 +00:00
|
|
|
m->act_count == 0) {
|
1998-01-12 01:46:33 +00:00
|
|
|
page_shortage--;
|
2004-10-27 18:29:17 +00:00
|
|
|
if (object->ref_count == 0) {
|
2002-11-16 07:44:25 +00:00
|
|
|
pmap_remove_all(m);
|
1997-01-11 07:22:24 +00:00
|
|
|
if (m->dirty == 0)
|
|
|
|
vm_page_cache(m);
|
|
|
|
else
|
|
|
|
vm_page_deactivate(m);
|
This commit does a couple of things:
Re-enables the RSS limiting, and the routine is now tail-recursive,
making it much more safe (eliminates the possiblity of kernel stack
overflow.) Also, the RSS limiting is a little more intelligent about
finding the likely objects that are pushing the process over the limit.
Added some sysctls that help with VM system tuning.
New sysctl features:
1) Enable/disable lru pageout algorithm.
vm.pageout_algorithm = 0, default algorithm that works
well, especially using X windows and heavy
memory loading. Can have adverse effects,
sometimes slowing down program loading.
vm.pageout_algorithm = 1, close to true LRU. Works much
better than clock, etc. Does not work as well as
the default algorithm in general. Certain memory
"malloc" type benchmarks work a little better with
this setting.
Please give me feedback on the performance results
associated with these.
2) Enable/disable swapping.
vm.swapping_enabled = 1, default.
vm.swapping_enabled = 0, useful for cases where swapping
degrades performance.
The config option "NO_SWAPPING" is still operative, and
takes precedence over the sysctl. If "NO_SWAPPING" is
specified, the sysctl still exists, but "vm.swapping_enabled"
is hard-wired to "0".
Each of these can be changed "on the fly."
1996-06-26 05:39:27 +00:00
|
|
|
} else {
|
|
|
|
vm_page_deactivate(m);
|
|
|
|
}
|
1996-06-17 03:35:40 +00:00
|
|
|
} else {
|
2008-03-19 20:24:35 +00:00
|
|
|
vm_page_requeue(m);
|
1994-05-25 09:21:21 +00:00
|
|
|
}
|
|
|
|
}
|
2004-10-27 18:29:17 +00:00
|
|
|
VM_OBJECT_UNLOCK(object);
|
1994-05-25 09:21:21 +00:00
|
|
|
m = next;
|
1994-05-24 10:09:53 +00:00
|
|
|
}
|
2002-07-23 02:42:25 +00:00
|
|
|
vm_page_unlock_queues();
|
1997-12-06 02:23:36 +00:00
|
|
|
#if !defined(NO_SWAPPING)
|
|
|
|
/*
|
|
|
|
* Idle process swapout -- run once per second.
|
|
|
|
*/
|
|
|
|
if (vm_swap_idle_enabled) {
|
|
|
|
static long lsec;
|
1998-03-30 09:56:58 +00:00
|
|
|
if (time_second != lsec) {
|
2007-06-26 18:24:05 +00:00
|
|
|
vm_req_vmdaemon(VM_SWAP_IDLE);
|
1998-03-30 09:56:58 +00:00
|
|
|
lsec = time_second;
|
1997-12-06 02:23:36 +00:00
|
|
|
}
|
|
|
|
}
|
|
|
|
#endif
|
|
|
|
|
1995-04-09 06:03:56 +00:00
|
|
|
/*
|
|
|
|
* If we didn't get enough free pages, and we have skipped a vnode
|
1995-04-17 10:00:55 +00:00
|
|
|
* in a writeable object, wakeup the sync daemon. And kick swapout
|
|
|
|
* if we did not get enough free pages.
|
1995-04-09 06:03:56 +00:00
|
|
|
*/
|
1999-09-17 04:56:40 +00:00
|
|
|
if (vm_paging_target() > 0) {
|
|
|
|
if (vnodes_skipped && vm_page_count_min())
|
1999-06-26 14:56:58 +00:00
|
|
|
(void) speedup_syncer();
|
This commit does a couple of things:
Re-enables the RSS limiting, and the routine is now tail-recursive,
making it much more safe (eliminates the possiblity of kernel stack
overflow.) Also, the RSS limiting is a little more intelligent about
finding the likely objects that are pushing the process over the limit.
Added some sysctls that help with VM system tuning.
New sysctl features:
1) Enable/disable lru pageout algorithm.
vm.pageout_algorithm = 0, default algorithm that works
well, especially using X windows and heavy
memory loading. Can have adverse effects,
sometimes slowing down program loading.
vm.pageout_algorithm = 1, close to true LRU. Works much
better than clock, etc. Does not work as well as
the default algorithm in general. Certain memory
"malloc" type benchmarks work a little better with
this setting.
Please give me feedback on the performance results
associated with these.
2) Enable/disable swapping.
vm.swapping_enabled = 1, default.
vm.swapping_enabled = 0, useful for cases where swapping
degrades performance.
The config option "NO_SWAPPING" is still operative, and
takes precedence over the sysctl. If "NO_SWAPPING" is
specified, the sysctl still exists, but "vm.swapping_enabled"
is hard-wired to "0".
Each of these can be changed "on the fly."
1996-06-26 05:39:27 +00:00
|
|
|
#if !defined(NO_SWAPPING)
|
2007-06-26 18:24:05 +00:00
|
|
|
if (vm_swap_enabled && vm_page_count_target())
|
|
|
|
vm_req_vmdaemon(VM_SWAP_NORMAL);
|
1996-02-22 10:57:37 +00:00
|
|
|
#endif
|
1995-04-09 06:03:56 +00:00
|
|
|
}
|
|
|
|
|
1994-10-22 02:18:03 +00:00
|
|
|
/*
|
2003-05-19 00:51:07 +00:00
|
|
|
* If we are critically low on one of RAM or swap and low on
|
|
|
|
* the other, kill the largest process. However, we avoid
|
|
|
|
* doing this on the first pass in order to give ourselves a
|
|
|
|
* chance to flush out dirty vnode-backed pages and to allow
|
|
|
|
* active pages to be moved to the inactive queue and reclaimed.
|
2008-09-29 19:45:12 +00:00
|
|
|
*/
|
|
|
|
if (pass != 0 &&
|
|
|
|
((swap_pager_avail < 64 && vm_page_count_min()) ||
|
|
|
|
(swap_pager_full && vm_paging_target() > 0)))
|
|
|
|
vm_pageout_oom(VM_OOM_MEM);
|
|
|
|
}
|
|
|
|
|
|
|
|
|
|
|
|
void
|
|
|
|
vm_pageout_oom(int shortage)
|
|
|
|
{
|
|
|
|
struct proc *p, *bigproc;
|
|
|
|
vm_offset_t size, bigsize;
|
|
|
|
struct thread *td;
|
2009-04-19 20:53:47 +00:00
|
|
|
struct vmspace *vm;
|
2008-09-29 19:45:12 +00:00
|
|
|
|
|
|
|
/*
|
2001-05-17 22:49:03 +00:00
|
|
|
* We keep the process bigproc locked once we find it to keep anyone
|
|
|
|
* from messing with it; however, there is a possibility of
|
|
|
|
* deadlock if process B is bigproc and one of it's child processes
|
|
|
|
* attempts to propagate a signal to B while we are waiting for A's
|
|
|
|
* lock while walking this list. To avoid this, we don't block on
|
|
|
|
* the process lock but just skip a process if it is already locked.
|
1994-10-22 02:18:03 +00:00
|
|
|
*/
|
2008-09-29 19:45:12 +00:00
|
|
|
bigproc = NULL;
|
|
|
|
bigsize = 0;
|
|
|
|
sx_slock(&allproc_lock);
|
|
|
|
FOREACH_PROC_IN_SYSTEM(p) {
|
|
|
|
int breakout;
|
|
|
|
|
|
|
|
if (PROC_TRYLOCK(p) == 0)
|
|
|
|
continue;
|
|
|
|
/*
|
|
|
|
* If this is a system or protected process, skip it.
|
|
|
|
*/
|
2009-04-19 20:53:47 +00:00
|
|
|
if ((p->p_flag & (P_INEXEC | P_PROTECTED | P_SYSTEM)) ||
|
|
|
|
(p->p_pid == 1) ||
|
2008-09-29 19:45:12 +00:00
|
|
|
((p->p_pid < 48) && (swap_pager_avail != 0))) {
|
|
|
|
PROC_UNLOCK(p);
|
|
|
|
continue;
|
|
|
|
}
|
|
|
|
/*
|
|
|
|
* If the process is in a non-running type state,
|
|
|
|
* don't touch it. Check all the threads individually.
|
|
|
|
*/
|
|
|
|
breakout = 0;
|
|
|
|
FOREACH_THREAD_IN_PROC(p, td) {
|
|
|
|
thread_lock(td);
|
|
|
|
if (!TD_ON_RUNQ(td) &&
|
|
|
|
!TD_IS_RUNNING(td) &&
|
|
|
|
!TD_IS_SLEEPING(td)) {
|
Commit 14/14 of sched_lock decomposition.
- Use thread_lock() rather than sched_lock for per-thread scheduling
sychronization.
- Use the per-process spinlock rather than the sched_lock for per-process
scheduling synchronization.
Tested by: kris, current@
Tested on: i386, amd64, ULE, 4BSD, libthr, libkse, PREEMPTION, etc.
Discussed with: kris, attilio, kmacy, jhb, julian, bde (small parts each)
2007-06-05 00:00:57 +00:00
|
|
|
thread_unlock(td);
|
2008-09-29 19:45:12 +00:00
|
|
|
breakout = 1;
|
|
|
|
break;
|
Part 1 of KSE-III
The ability to schedule multiple threads per process
(one one cpu) by making ALL system calls optionally asynchronous.
to come: ia64 and power-pc patches, patches for gdb, test program (in tools)
Reviewed by: Almost everyone who counts
(at various times, peter, jhb, matt, alfred, mini, bernd,
and a cast of thousands)
NOTE: this is still Beta code, and contains lots of debugging stuff.
expect slight instability in signals..
2002-06-29 17:26:22 +00:00
|
|
|
}
|
2008-09-29 19:45:12 +00:00
|
|
|
thread_unlock(td);
|
1994-10-22 02:18:03 +00:00
|
|
|
}
|
2008-09-29 19:45:12 +00:00
|
|
|
if (breakout) {
|
|
|
|
PROC_UNLOCK(p);
|
|
|
|
continue;
|
1994-10-22 02:18:03 +00:00
|
|
|
}
|
2008-09-29 19:45:12 +00:00
|
|
|
/*
|
|
|
|
* get the process size
|
|
|
|
*/
|
2009-04-19 20:53:47 +00:00
|
|
|
vm = vmspace_acquire_ref(p);
|
|
|
|
if (vm == NULL) {
|
|
|
|
PROC_UNLOCK(p);
|
|
|
|
continue;
|
|
|
|
}
|
|
|
|
if (!vm_map_trylock_read(&vm->vm_map)) {
|
|
|
|
vmspace_free(vm);
|
2008-09-29 19:45:12 +00:00
|
|
|
PROC_UNLOCK(p);
|
|
|
|
continue;
|
|
|
|
}
|
2009-04-28 11:45:36 +00:00
|
|
|
size = vmspace_swap_count(vm);
|
2009-04-19 20:53:47 +00:00
|
|
|
vm_map_unlock_read(&vm->vm_map);
|
2008-09-29 19:45:12 +00:00
|
|
|
if (shortage == VM_OOM_MEM)
|
2009-04-19 20:53:47 +00:00
|
|
|
size += vmspace_resident_count(vm);
|
|
|
|
vmspace_free(vm);
|
2008-09-29 19:45:12 +00:00
|
|
|
/*
|
|
|
|
* if the this process is bigger than the biggest one
|
|
|
|
* remember it.
|
|
|
|
*/
|
|
|
|
if (size > bigsize) {
|
|
|
|
if (bigproc != NULL)
|
|
|
|
PROC_UNLOCK(bigproc);
|
|
|
|
bigproc = p;
|
|
|
|
bigsize = size;
|
|
|
|
} else
|
|
|
|
PROC_UNLOCK(p);
|
|
|
|
}
|
|
|
|
sx_sunlock(&allproc_lock);
|
|
|
|
if (bigproc != NULL) {
|
|
|
|
killproc(bigproc, "out of swap space");
|
|
|
|
sched_nice(bigproc, PRIO_MIN);
|
|
|
|
PROC_UNLOCK(bigproc);
|
|
|
|
wakeup(&cnt.v_free_count);
|
1994-10-22 02:18:03 +00:00
|
|
|
}
|
1994-05-25 09:21:21 +00:00
|
|
|
}
|
1994-05-24 10:09:53 +00:00
|
|
|
|
1997-07-27 04:49:19 +00:00
|
|
|
/*
|
|
|
|
* This routine tries to maintain the pseudo LRU active queue,
|
|
|
|
* so that during long periods of time where there is no paging,
|
2000-03-26 15:20:23 +00:00
|
|
|
* that some statistic accumulation still occurs. This code
|
1997-07-27 04:49:19 +00:00
|
|
|
* helps the situation where paging just starts to occur.
|
|
|
|
*/
|
|
|
|
static void
|
|
|
|
vm_pageout_page_stats()
|
|
|
|
{
|
2004-10-30 23:30:53 +00:00
|
|
|
vm_object_t object;
|
1997-07-27 04:49:19 +00:00
|
|
|
vm_page_t m,next;
|
|
|
|
int pcount,tpcount; /* Number of pages to check */
|
|
|
|
static int fullintervalcount = 0;
|
Some VM improvements, including elimination of alot of Sig-11
problems. Tor Egge and others have helped with various VM bugs
lately, but don't blame him -- blame me!!!
pmap.c:
1) Create an object for kernel page table allocations. This
fixes a bogus allocation method previously used for such, by
grabbing pages from the kernel object, using bogus pindexes.
(This was a code cleanup, and perhaps a minor system stability
issue.)
pmap.c:
2) Pre-set the modify and accessed bits when prudent. This will
decrease bus traffic under certain circumstances.
vfs_bio.c, vfs_cluster.c:
3) Rather than calculating the beginning virtual byte offset
multiple times, stick the offset into the buffer header, so
that the calculated offset can be reused. (Long long multiplies
are often expensive, and this is a probably unmeasurable performance
improvement, and code cleanup.)
vfs_bio.c:
4) Handle write recursion more intelligently (but not perfectly) so
that it is less likely to cause a system panic, and is also
much more robust.
vfs_bio.c:
5) getblk incorrectly wrote out blocks that are incorrectly sized.
The problem is fixed, and writes blocks out ONLY when B_DELWRI
is true.
vfs_bio.c:
6) Check that already constituted buffers have fully valid pages. If
not, then make sure that the B_CACHE bit is not set. (This was
a major source of Sig-11 type problems.)
vfs_bio.c:
7) Fix a potential system deadlock due to an incorrectly specified
sleep priority while waiting for a buffer write operation. The
change that I made opens the system up to serious problems, and
we need to examine the issue of process sleep priorities.
vfs_cluster.c, vfs_bio.c:
8) Make clustered reads work more correctly (and more completely)
when buffers are already constituted, but not fully valid.
(This was another system reliability issue.)
vfs_subr.c, ffs_inode.c:
9) Create a vtruncbuf function, which is used by filesystems that
can truncate files. The vinvalbuf forced a file sync type operation,
while vtruncbuf only invalidates the buffers past the new end of file,
and also invalidates the appropriate pages. (This was a system reliabiliy
and performance issue.)
10) Modify FFS to use vtruncbuf.
vm_object.c:
11) Make the object rundown mechanism for OBJT_VNODE type objects work
more correctly. Included in that fix, create pager entries for
the OBJT_DEAD pager type, so that paging requests that might slip
in during race conditions are properly handled. (This was a system
reliability issue.)
vm_page.c:
12) Make some of the page validation routines be a little less picky
about arguments passed to them. Also, support page invalidation
change the object generation count so that we handle generation
counts a little more robustly.
vm_pageout.c:
13) Further reduce pageout daemon activity when the system doesn't
need help from it. There should be no additional performance
decrease even when the pageout daemon is running. (This was
a significant performance issue.)
vnode_pager.c:
14) Teach the vnode pager to handle race conditions during vnode
deallocations.
1998-03-16 01:56:03 +00:00
|
|
|
int page_shortage;
|
|
|
|
|
2004-06-24 04:08:43 +00:00
|
|
|
mtx_assert(&vm_page_queue_mtx, MA_OWNED);
|
1999-09-17 04:56:40 +00:00
|
|
|
page_shortage =
|
2007-05-31 22:52:15 +00:00
|
|
|
(cnt.v_inactive_target + cnt.v_cache_max + cnt.v_free_min) -
|
|
|
|
(cnt.v_free_count + cnt.v_inactive_count + cnt.v_cache_count);
|
1999-09-17 04:56:40 +00:00
|
|
|
|
Some VM improvements, including elimination of alot of Sig-11
problems. Tor Egge and others have helped with various VM bugs
lately, but don't blame him -- blame me!!!
pmap.c:
1) Create an object for kernel page table allocations. This
fixes a bogus allocation method previously used for such, by
grabbing pages from the kernel object, using bogus pindexes.
(This was a code cleanup, and perhaps a minor system stability
issue.)
pmap.c:
2) Pre-set the modify and accessed bits when prudent. This will
decrease bus traffic under certain circumstances.
vfs_bio.c, vfs_cluster.c:
3) Rather than calculating the beginning virtual byte offset
multiple times, stick the offset into the buffer header, so
that the calculated offset can be reused. (Long long multiplies
are often expensive, and this is a probably unmeasurable performance
improvement, and code cleanup.)
vfs_bio.c:
4) Handle write recursion more intelligently (but not perfectly) so
that it is less likely to cause a system panic, and is also
much more robust.
vfs_bio.c:
5) getblk incorrectly wrote out blocks that are incorrectly sized.
The problem is fixed, and writes blocks out ONLY when B_DELWRI
is true.
vfs_bio.c:
6) Check that already constituted buffers have fully valid pages. If
not, then make sure that the B_CACHE bit is not set. (This was
a major source of Sig-11 type problems.)
vfs_bio.c:
7) Fix a potential system deadlock due to an incorrectly specified
sleep priority while waiting for a buffer write operation. The
change that I made opens the system up to serious problems, and
we need to examine the issue of process sleep priorities.
vfs_cluster.c, vfs_bio.c:
8) Make clustered reads work more correctly (and more completely)
when buffers are already constituted, but not fully valid.
(This was another system reliability issue.)
vfs_subr.c, ffs_inode.c:
9) Create a vtruncbuf function, which is used by filesystems that
can truncate files. The vinvalbuf forced a file sync type operation,
while vtruncbuf only invalidates the buffers past the new end of file,
and also invalidates the appropriate pages. (This was a system reliabiliy
and performance issue.)
10) Modify FFS to use vtruncbuf.
vm_object.c:
11) Make the object rundown mechanism for OBJT_VNODE type objects work
more correctly. Included in that fix, create pager entries for
the OBJT_DEAD pager type, so that paging requests that might slip
in during race conditions are properly handled. (This was a system
reliability issue.)
vm_page.c:
12) Make some of the page validation routines be a little less picky
about arguments passed to them. Also, support page invalidation
change the object generation count so that we handle generation
counts a little more robustly.
vm_pageout.c:
13) Further reduce pageout daemon activity when the system doesn't
need help from it. There should be no additional performance
decrease even when the pageout daemon is running. (This was
a significant performance issue.)
vnode_pager.c:
14) Teach the vnode pager to handle race conditions during vnode
deallocations.
1998-03-16 01:56:03 +00:00
|
|
|
if (page_shortage <= 0)
|
|
|
|
return;
|
1997-07-27 04:49:19 +00:00
|
|
|
|
2007-05-31 22:52:15 +00:00
|
|
|
pcount = cnt.v_active_count;
|
1997-07-27 04:49:19 +00:00
|
|
|
fullintervalcount += vm_pageout_stats_interval;
|
|
|
|
if (fullintervalcount < vm_pageout_full_stats_interval) {
|
2008-09-21 18:01:34 +00:00
|
|
|
tpcount = (int64_t)vm_pageout_stats_max * cnt.v_active_count /
|
|
|
|
cnt.v_page_count;
|
1997-07-27 04:49:19 +00:00
|
|
|
if (pcount > tpcount)
|
|
|
|
pcount = tpcount;
|
2000-05-29 02:31:55 +00:00
|
|
|
} else {
|
|
|
|
fullintervalcount = 0;
|
1997-07-27 04:49:19 +00:00
|
|
|
}
|
|
|
|
|
1999-10-30 07:37:14 +00:00
|
|
|
m = TAILQ_FIRST(&vm_page_queues[PQ_ACTIVE].pl);
|
1997-07-27 04:49:19 +00:00
|
|
|
while ((m != NULL) && (pcount-- > 0)) {
|
1997-10-06 02:48:16 +00:00
|
|
|
int actcount;
|
1997-07-27 04:49:19 +00:00
|
|
|
|
2005-12-31 14:39:20 +00:00
|
|
|
KASSERT(VM_PAGE_INQUEUE2(m, PQ_ACTIVE),
|
2003-10-22 18:41:32 +00:00
|
|
|
("vm_pageout_page_stats: page %p isn't active", m));
|
1997-07-27 04:49:19 +00:00
|
|
|
|
|
|
|
next = TAILQ_NEXT(m, pageq);
|
2004-10-30 23:30:53 +00:00
|
|
|
object = m->object;
|
2005-08-10 00:17:36 +00:00
|
|
|
|
|
|
|
if ((m->flags & PG_MARKER) != 0) {
|
|
|
|
m = next;
|
|
|
|
continue;
|
|
|
|
}
|
|
|
|
if (!VM_OBJECT_TRYLOCK(object) &&
|
|
|
|
!vm_pageout_fallback_object_lock(m, &next)) {
|
|
|
|
VM_OBJECT_UNLOCK(object);
|
2004-10-30 23:30:53 +00:00
|
|
|
m = next;
|
|
|
|
continue;
|
|
|
|
}
|
|
|
|
|
1997-07-27 04:49:19 +00:00
|
|
|
/*
|
|
|
|
* Don't deactivate pages that are busy.
|
|
|
|
*/
|
|
|
|
if ((m->busy != 0) ||
|
2006-10-22 04:28:14 +00:00
|
|
|
(m->oflags & VPO_BUSY) ||
|
1997-07-27 04:49:19 +00:00
|
|
|
(m->hold_count != 0)) {
|
2004-10-30 23:30:53 +00:00
|
|
|
VM_OBJECT_UNLOCK(object);
|
2008-03-19 20:24:35 +00:00
|
|
|
vm_page_requeue(m);
|
1997-07-27 04:49:19 +00:00
|
|
|
m = next;
|
|
|
|
continue;
|
|
|
|
}
|
|
|
|
|
1997-10-06 02:48:16 +00:00
|
|
|
actcount = 0;
|
1997-07-27 04:49:19 +00:00
|
|
|
if (m->flags & PG_REFERENCED) {
|
1998-09-04 08:06:57 +00:00
|
|
|
vm_page_flag_clear(m, PG_REFERENCED);
|
1997-10-06 02:48:16 +00:00
|
|
|
actcount += 1;
|
1997-07-27 04:49:19 +00:00
|
|
|
}
|
|
|
|
|
2000-05-21 12:50:18 +00:00
|
|
|
actcount += pmap_ts_referenced(m);
|
1997-10-06 02:48:16 +00:00
|
|
|
if (actcount) {
|
|
|
|
m->act_count += ACT_ADVANCE + actcount;
|
1997-07-27 04:49:19 +00:00
|
|
|
if (m->act_count > ACT_MAX)
|
|
|
|
m->act_count = ACT_MAX;
|
2008-03-19 20:24:35 +00:00
|
|
|
vm_page_requeue(m);
|
1997-07-27 04:49:19 +00:00
|
|
|
} else {
|
|
|
|
if (m->act_count == 0) {
|
1997-10-06 02:48:16 +00:00
|
|
|
/*
|
2000-12-26 19:41:38 +00:00
|
|
|
* We turn off page access, so that we have
|
|
|
|
* more accurate RSS stats. We don't do this
|
|
|
|
* in the normal page deactivation when the
|
|
|
|
* system is loaded VM wise, because the
|
|
|
|
* cost of the large number of page protect
|
|
|
|
* operations would be higher than the value
|
|
|
|
* of doing the operation.
|
1997-10-06 02:48:16 +00:00
|
|
|
*/
|
2002-11-16 07:44:25 +00:00
|
|
|
pmap_remove_all(m);
|
1997-07-27 04:49:19 +00:00
|
|
|
vm_page_deactivate(m);
|
|
|
|
} else {
|
|
|
|
m->act_count -= min(m->act_count, ACT_DECLINE);
|
2008-03-19 20:24:35 +00:00
|
|
|
vm_page_requeue(m);
|
1997-07-27 04:49:19 +00:00
|
|
|
}
|
|
|
|
}
|
2004-10-30 23:30:53 +00:00
|
|
|
VM_OBJECT_UNLOCK(object);
|
1997-07-27 04:49:19 +00:00
|
|
|
m = next;
|
|
|
|
}
|
|
|
|
}
|
|
|
|
|
1994-05-24 10:09:53 +00:00
|
|
|
/*
|
|
|
|
* vm_pageout is the high level pageout daemon.
|
|
|
|
*/
|
1995-08-28 09:19:25 +00:00
|
|
|
static void
|
1994-05-25 09:21:21 +00:00
|
|
|
vm_pageout()
|
1994-05-24 10:09:53 +00:00
|
|
|
{
|
2004-06-24 03:13:30 +00:00
|
|
|
int error, pass;
|
2000-09-07 01:33:02 +00:00
|
|
|
|
1994-05-24 10:09:53 +00:00
|
|
|
/*
|
These changes embody the support of the fully coherent merged VM buffer cache,
much higher filesystem I/O performance, and much better paging performance. It
represents the culmination of over 6 months of R&D.
The majority of the merged VM/cache work is by John Dyson.
The following highlights the most significant changes. Additionally, there are
(mostly minor) changes to the various filesystem modules (nfs, msdosfs, etc) to
support the new VM/buffer scheme.
vfs_bio.c:
Significant rewrite of most of vfs_bio to support the merged VM buffer cache
scheme. The scheme is almost fully compatible with the old filesystem
interface. Significant improvement in the number of opportunities for write
clustering.
vfs_cluster.c, vfs_subr.c
Upgrade and performance enhancements in vfs layer code to support merged
VM/buffer cache. Fixup of vfs_cluster to eliminate the bogus pagemove stuff.
vm_object.c:
Yet more improvements in the collapse code. Elimination of some windows that
can cause list corruption.
vm_pageout.c:
Fixed it, it really works better now. Somehow in 2.0, some "enhancements"
broke the code. This code has been reworked from the ground-up.
vm_fault.c, vm_page.c, pmap.c, vm_object.c
Support for small-block filesystems with merged VM/buffer cache scheme.
pmap.c vm_map.c
Dynamic kernel VM size, now we dont have to pre-allocate excessive numbers of
kernel PTs.
vm_glue.c
Much simpler and more effective swapping code. No more gratuitous swapping.
proc.h
Fixed the problem that the p_lock flag was not being cleared on a fork.
swap_pager.c, vnode_pager.c
Removal of old vfs_bio cruft to support the past pseudo-coherency. Now the
code doesn't need it anymore.
machdep.c
Changes to better support the parameter values for the merged VM/buffer cache
scheme.
machdep.c, kern_exec.c, vm_glue.c
Implemented a seperate submap for temporary exec string space and another one
to contain process upages. This eliminates all map fragmentation problems
that previously existed.
ffs_inode.c, ufs_inode.c, ufs_readwrite.c
Changes for merged VM/buffer cache. Add "bypass" support for sneaking in on
busy buffers.
Submitted by: John Dyson and David Greenman
1995-01-09 16:06:02 +00:00
|
|
|
* Initialize some paging parameters.
|
1994-05-24 10:09:53 +00:00
|
|
|
*/
|
2007-05-31 22:52:15 +00:00
|
|
|
cnt.v_interrupt_free_min = 2;
|
|
|
|
if (cnt.v_page_count < 2000)
|
1996-05-31 00:38:04 +00:00
|
|
|
vm_pageout_page_count = 8;
|
1995-04-09 06:03:56 +00:00
|
|
|
|
2003-09-19 05:03:45 +00:00
|
|
|
/*
|
|
|
|
* v_free_reserved needs to include enough for the largest
|
|
|
|
* swap pager structures plus enough for any pv_entry structs
|
|
|
|
* when paging.
|
|
|
|
*/
|
2007-05-31 22:52:15 +00:00
|
|
|
if (cnt.v_page_count > 1024)
|
|
|
|
cnt.v_free_min = 4 + (cnt.v_page_count - 1024) / 200;
|
|
|
|
else
|
|
|
|
cnt.v_free_min = 4;
|
|
|
|
cnt.v_pageout_free_min = (2*MAXBSIZE)/PAGE_SIZE +
|
|
|
|
cnt.v_interrupt_free_min;
|
|
|
|
cnt.v_free_reserved = vm_pageout_page_count +
|
Enable the new physical memory allocator.
This allocator uses a binary buddy system with a twist. First and
foremost, this allocator is required to support the implementation of
superpages. As a side effect, it enables a more robust implementation
of contigmalloc(9). Moreover, this reimplementation of
contigmalloc(9) eliminates the acquisition of Giant by
contigmalloc(..., M_NOWAIT, ...).
The twist is that this allocator tries to reduce the number of TLB
misses incurred by accesses through a direct map to small, UMA-managed
objects and page table pages. Roughly speaking, the physical pages
that are allocated for such purposes are clustered together in the
physical address space. The performance benefits vary. In the most
extreme case, a uniprocessor kernel running on an Opteron, I measured
an 18% reduction in system time during a buildworld.
This allocator does not implement page coloring. The reason is that
superpages have much the same effect. The contiguous physical memory
allocation necessary for a superpage is inherently colored.
Finally, the one caveat is that this allocator does not effectively
support prezeroed pages. I hope this is temporary. On i386, this is
a slight pessimization. However, on amd64, the beneficial effects of
the direct-map optimization outweigh the ill effects. I speculate
that this is true in general of machines with a direct map.
Approved by: re
2007-06-16 04:57:06 +00:00
|
|
|
cnt.v_pageout_free_min + (cnt.v_page_count / 768);
|
2007-05-31 22:52:15 +00:00
|
|
|
cnt.v_free_severe = cnt.v_free_min / 2;
|
|
|
|
cnt.v_free_min += cnt.v_free_reserved;
|
|
|
|
cnt.v_free_severe += cnt.v_free_reserved;
|
2003-09-19 05:03:45 +00:00
|
|
|
|
1994-09-12 11:31:36 +00:00
|
|
|
/*
|
2000-12-26 19:41:38 +00:00
|
|
|
* v_free_target and v_cache_min control pageout hysteresis. Note
|
|
|
|
* that these are more a measure of the VM cache queue hysteresis
|
|
|
|
* then the VM free queue. Specifically, v_free_target is the
|
|
|
|
* high water mark (free+cache pages).
|
|
|
|
*
|
|
|
|
* v_free_reserved + v_cache_min (mostly means v_cache_min) is the
|
|
|
|
* low water mark, while v_free_min is the stop. v_cache_min must
|
|
|
|
* be big enough to handle memory needs while the pageout daemon
|
|
|
|
* is signalled and run to free more pages.
|
1994-09-12 11:31:36 +00:00
|
|
|
*/
|
2007-05-31 22:52:15 +00:00
|
|
|
if (cnt.v_free_count > 6144)
|
|
|
|
cnt.v_free_target = 4 * cnt.v_free_min + cnt.v_free_reserved;
|
|
|
|
else
|
|
|
|
cnt.v_free_target = 2 * cnt.v_free_min + cnt.v_free_reserved;
|
|
|
|
|
|
|
|
if (cnt.v_free_count > 2048) {
|
|
|
|
cnt.v_cache_min = cnt.v_free_target;
|
|
|
|
cnt.v_cache_max = 2 * cnt.v_cache_min;
|
|
|
|
cnt.v_inactive_target = (3 * cnt.v_free_target) / 2;
|
These changes embody the support of the fully coherent merged VM buffer cache,
much higher filesystem I/O performance, and much better paging performance. It
represents the culmination of over 6 months of R&D.
The majority of the merged VM/cache work is by John Dyson.
The following highlights the most significant changes. Additionally, there are
(mostly minor) changes to the various filesystem modules (nfs, msdosfs, etc) to
support the new VM/buffer scheme.
vfs_bio.c:
Significant rewrite of most of vfs_bio to support the merged VM buffer cache
scheme. The scheme is almost fully compatible with the old filesystem
interface. Significant improvement in the number of opportunities for write
clustering.
vfs_cluster.c, vfs_subr.c
Upgrade and performance enhancements in vfs layer code to support merged
VM/buffer cache. Fixup of vfs_cluster to eliminate the bogus pagemove stuff.
vm_object.c:
Yet more improvements in the collapse code. Elimination of some windows that
can cause list corruption.
vm_pageout.c:
Fixed it, it really works better now. Somehow in 2.0, some "enhancements"
broke the code. This code has been reworked from the ground-up.
vm_fault.c, vm_page.c, pmap.c, vm_object.c
Support for small-block filesystems with merged VM/buffer cache scheme.
pmap.c vm_map.c
Dynamic kernel VM size, now we dont have to pre-allocate excessive numbers of
kernel PTs.
vm_glue.c
Much simpler and more effective swapping code. No more gratuitous swapping.
proc.h
Fixed the problem that the p_lock flag was not being cleared on a fork.
swap_pager.c, vnode_pager.c
Removal of old vfs_bio cruft to support the past pseudo-coherency. Now the
code doesn't need it anymore.
machdep.c
Changes to better support the parameter values for the merged VM/buffer cache
scheme.
machdep.c, kern_exec.c, vm_glue.c
Implemented a seperate submap for temporary exec string space and another one
to contain process upages. This eliminates all map fragmentation problems
that previously existed.
ffs_inode.c, ufs_inode.c, ufs_readwrite.c
Changes for merged VM/buffer cache. Add "bypass" support for sneaking in on
busy buffers.
Submitted by: John Dyson and David Greenman
1995-01-09 16:06:02 +00:00
|
|
|
} else {
|
2007-05-31 22:52:15 +00:00
|
|
|
cnt.v_cache_min = 0;
|
|
|
|
cnt.v_cache_max = 0;
|
|
|
|
cnt.v_inactive_target = cnt.v_free_count / 4;
|
These changes embody the support of the fully coherent merged VM buffer cache,
much higher filesystem I/O performance, and much better paging performance. It
represents the culmination of over 6 months of R&D.
The majority of the merged VM/cache work is by John Dyson.
The following highlights the most significant changes. Additionally, there are
(mostly minor) changes to the various filesystem modules (nfs, msdosfs, etc) to
support the new VM/buffer scheme.
vfs_bio.c:
Significant rewrite of most of vfs_bio to support the merged VM buffer cache
scheme. The scheme is almost fully compatible with the old filesystem
interface. Significant improvement in the number of opportunities for write
clustering.
vfs_cluster.c, vfs_subr.c
Upgrade and performance enhancements in vfs layer code to support merged
VM/buffer cache. Fixup of vfs_cluster to eliminate the bogus pagemove stuff.
vm_object.c:
Yet more improvements in the collapse code. Elimination of some windows that
can cause list corruption.
vm_pageout.c:
Fixed it, it really works better now. Somehow in 2.0, some "enhancements"
broke the code. This code has been reworked from the ground-up.
vm_fault.c, vm_page.c, pmap.c, vm_object.c
Support for small-block filesystems with merged VM/buffer cache scheme.
pmap.c vm_map.c
Dynamic kernel VM size, now we dont have to pre-allocate excessive numbers of
kernel PTs.
vm_glue.c
Much simpler and more effective swapping code. No more gratuitous swapping.
proc.h
Fixed the problem that the p_lock flag was not being cleared on a fork.
swap_pager.c, vnode_pager.c
Removal of old vfs_bio cruft to support the past pseudo-coherency. Now the
code doesn't need it anymore.
machdep.c
Changes to better support the parameter values for the merged VM/buffer cache
scheme.
machdep.c, kern_exec.c, vm_glue.c
Implemented a seperate submap for temporary exec string space and another one
to contain process upages. This eliminates all map fragmentation problems
that previously existed.
ffs_inode.c, ufs_inode.c, ufs_readwrite.c
Changes for merged VM/buffer cache. Add "bypass" support for sneaking in on
busy buffers.
Submitted by: John Dyson and David Greenman
1995-01-09 16:06:02 +00:00
|
|
|
}
|
2007-05-31 22:52:15 +00:00
|
|
|
if (cnt.v_inactive_target > cnt.v_free_count / 3)
|
|
|
|
cnt.v_inactive_target = cnt.v_free_count / 3;
|
1994-05-25 09:21:21 +00:00
|
|
|
|
These changes embody the support of the fully coherent merged VM buffer cache,
much higher filesystem I/O performance, and much better paging performance. It
represents the culmination of over 6 months of R&D.
The majority of the merged VM/cache work is by John Dyson.
The following highlights the most significant changes. Additionally, there are
(mostly minor) changes to the various filesystem modules (nfs, msdosfs, etc) to
support the new VM/buffer scheme.
vfs_bio.c:
Significant rewrite of most of vfs_bio to support the merged VM buffer cache
scheme. The scheme is almost fully compatible with the old filesystem
interface. Significant improvement in the number of opportunities for write
clustering.
vfs_cluster.c, vfs_subr.c
Upgrade and performance enhancements in vfs layer code to support merged
VM/buffer cache. Fixup of vfs_cluster to eliminate the bogus pagemove stuff.
vm_object.c:
Yet more improvements in the collapse code. Elimination of some windows that
can cause list corruption.
vm_pageout.c:
Fixed it, it really works better now. Somehow in 2.0, some "enhancements"
broke the code. This code has been reworked from the ground-up.
vm_fault.c, vm_page.c, pmap.c, vm_object.c
Support for small-block filesystems with merged VM/buffer cache scheme.
pmap.c vm_map.c
Dynamic kernel VM size, now we dont have to pre-allocate excessive numbers of
kernel PTs.
vm_glue.c
Much simpler and more effective swapping code. No more gratuitous swapping.
proc.h
Fixed the problem that the p_lock flag was not being cleared on a fork.
swap_pager.c, vnode_pager.c
Removal of old vfs_bio cruft to support the past pseudo-coherency. Now the
code doesn't need it anymore.
machdep.c
Changes to better support the parameter values for the merged VM/buffer cache
scheme.
machdep.c, kern_exec.c, vm_glue.c
Implemented a seperate submap for temporary exec string space and another one
to contain process upages. This eliminates all map fragmentation problems
that previously existed.
ffs_inode.c, ufs_inode.c, ufs_readwrite.c
Changes for merged VM/buffer cache. Add "bypass" support for sneaking in on
busy buffers.
Submitted by: John Dyson and David Greenman
1995-01-09 16:06:02 +00:00
|
|
|
/* XXX does not really belong here */
|
1994-05-24 10:09:53 +00:00
|
|
|
if (vm_page_max_wired == 0)
|
2007-05-31 22:52:15 +00:00
|
|
|
vm_page_max_wired = cnt.v_free_count / 3;
|
1994-05-24 10:09:53 +00:00
|
|
|
|
1997-07-27 04:49:19 +00:00
|
|
|
if (vm_pageout_stats_max == 0)
|
2007-05-31 22:52:15 +00:00
|
|
|
vm_pageout_stats_max = cnt.v_free_target;
|
1997-07-27 04:49:19 +00:00
|
|
|
|
|
|
|
/*
|
|
|
|
* Set interval in seconds for stats scan.
|
|
|
|
*/
|
|
|
|
if (vm_pageout_stats_interval == 0)
|
Some VM improvements, including elimination of alot of Sig-11
problems. Tor Egge and others have helped with various VM bugs
lately, but don't blame him -- blame me!!!
pmap.c:
1) Create an object for kernel page table allocations. This
fixes a bogus allocation method previously used for such, by
grabbing pages from the kernel object, using bogus pindexes.
(This was a code cleanup, and perhaps a minor system stability
issue.)
pmap.c:
2) Pre-set the modify and accessed bits when prudent. This will
decrease bus traffic under certain circumstances.
vfs_bio.c, vfs_cluster.c:
3) Rather than calculating the beginning virtual byte offset
multiple times, stick the offset into the buffer header, so
that the calculated offset can be reused. (Long long multiplies
are often expensive, and this is a probably unmeasurable performance
improvement, and code cleanup.)
vfs_bio.c:
4) Handle write recursion more intelligently (but not perfectly) so
that it is less likely to cause a system panic, and is also
much more robust.
vfs_bio.c:
5) getblk incorrectly wrote out blocks that are incorrectly sized.
The problem is fixed, and writes blocks out ONLY when B_DELWRI
is true.
vfs_bio.c:
6) Check that already constituted buffers have fully valid pages. If
not, then make sure that the B_CACHE bit is not set. (This was
a major source of Sig-11 type problems.)
vfs_bio.c:
7) Fix a potential system deadlock due to an incorrectly specified
sleep priority while waiting for a buffer write operation. The
change that I made opens the system up to serious problems, and
we need to examine the issue of process sleep priorities.
vfs_cluster.c, vfs_bio.c:
8) Make clustered reads work more correctly (and more completely)
when buffers are already constituted, but not fully valid.
(This was another system reliability issue.)
vfs_subr.c, ffs_inode.c:
9) Create a vtruncbuf function, which is used by filesystems that
can truncate files. The vinvalbuf forced a file sync type operation,
while vtruncbuf only invalidates the buffers past the new end of file,
and also invalidates the appropriate pages. (This was a system reliabiliy
and performance issue.)
10) Modify FFS to use vtruncbuf.
vm_object.c:
11) Make the object rundown mechanism for OBJT_VNODE type objects work
more correctly. Included in that fix, create pager entries for
the OBJT_DEAD pager type, so that paging requests that might slip
in during race conditions are properly handled. (This was a system
reliability issue.)
vm_page.c:
12) Make some of the page validation routines be a little less picky
about arguments passed to them. Also, support page invalidation
change the object generation count so that we handle generation
counts a little more robustly.
vm_pageout.c:
13) Further reduce pageout daemon activity when the system doesn't
need help from it. There should be no additional performance
decrease even when the pageout daemon is running. (This was
a significant performance issue.)
vnode_pager.c:
14) Teach the vnode pager to handle race conditions during vnode
deallocations.
1998-03-16 01:56:03 +00:00
|
|
|
vm_pageout_stats_interval = 5;
|
1997-07-27 04:49:19 +00:00
|
|
|
if (vm_pageout_full_stats_interval == 0)
|
|
|
|
vm_pageout_full_stats_interval = vm_pageout_stats_interval * 4;
|
|
|
|
|
NOTE: libkvm, w, ps, 'top', and any other utility which depends on struct
proc or any VM system structure will have to be rebuilt!!!
Much needed overhaul of the VM system. Included in this first round of
changes:
1) Improved pager interfaces: init, alloc, dealloc, getpages, putpages,
haspage, and sync operations are supported. The haspage interface now
provides information about clusterability. All pager routines now take
struct vm_object's instead of "pagers".
2) Improved data structures. In the previous paradigm, there is constant
confusion caused by pagers being both a data structure ("allocate a
pager") and a collection of routines. The idea of a pager structure has
escentially been eliminated. Objects now have types, and this type is
used to index the appropriate pager. In most cases, items in the pager
structure were duplicated in the object data structure and thus were
unnecessary. In the few cases that remained, a un_pager structure union
was created in the object to contain these items.
3) Because of the cleanup of #1 & #2, a lot of unnecessary layering can now
be removed. For instance, vm_object_enter(), vm_object_lookup(),
vm_object_remove(), and the associated object hash list were some of the
things that were removed.
4) simple_lock's removed. Discussion with several people reveals that the
SMP locking primitives used in the VM system aren't likely the mechanism
that we'll be adopting. Even if it were, the locking that was in the code
was very inadequate and would have to be mostly re-done anyway. The
locking in a uni-processor kernel was a no-op but went a long way toward
making the code difficult to read and debug.
5) Places that attempted to kludge-up the fact that we don't have kernel
thread support have been fixed to reflect the reality that we are really
dealing with processes, not threads. The VM system didn't have complete
thread support, so the comments and mis-named routines were just wrong.
We now use tsleep and wakeup directly in the lock routines, for instance.
6) Where appropriate, the pagers have been improved, especially in the
pager_alloc routines. Most of the pager_allocs have been rewritten and
are now faster and easier to maintain.
7) The pagedaemon pageout clustering algorithm has been rewritten and
now tries harder to output an even number of pages before and after
the requested page. This is sort of the reverse of the ideal pagein
algorithm and should provide better overall performance.
8) Unnecessary (incorrect) casts to caddr_t in calls to tsleep & wakeup
have been removed. Some other unnecessary casts have also been removed.
9) Some almost useless debugging code removed.
10) Terminology of shadow objects vs. backing objects straightened out.
The fact that the vm_object data structure escentially had this
backwards really confused things. The use of "shadow" and "backing
object" throughout the code is now internally consistent and correct
in the Mach terminology.
11) Several minor bug fixes, including one in the vm daemon that caused
0 RSS objects to not get purged as intended.
12) A "default pager" has now been created which cleans up the transition
of objects to the "swap" type. The previous checks throughout the code
for swp->pg_data != NULL were really ugly. This change also provides
the rudiments for future backing of "anonymous" memory by something
other than the swap pager (via the vnode pager, for example), and it
allows the decision about which of these pagers to use to be made
dynamically (although will need some additional decision code to do
this, of course).
13) (dyson) MAP_COPY has been deprecated and the corresponding "copy
object" code has been removed. MAP_COPY was undocumented and non-
standard. It was furthermore broken in several ways which caused its
behavior to degrade to MAP_PRIVATE. Binaries that use MAP_COPY will
continue to work correctly, but via the slightly different semantics
of MAP_PRIVATE.
14) (dyson) Sharing maps have been removed. It's marginal usefulness in a
threads design can be worked around in other ways. Both #12 and #13
were done to simplify the code and improve readability and maintain-
ability. (As were most all of these changes)
TODO:
1) Rewrite most of the vnode pager to use VOP_GETPAGES/PUTPAGES. Doing
this will reduce the vnode pager to a mere fraction of its current size.
2) Rewrite vm_fault and the swap/vnode pagers to use the clustering
information provided by the new haspage pager interface. This will
substantially reduce the overhead by eliminating a large number of
VOP_BMAP() calls. The VOP_BMAP() filesystem interface should be
improved to provide both a "behind" and "ahead" indication of
contiguousness.
3) Implement the extended features of pager_haspage in swap_pager_haspage().
It currently just says 0 pages ahead/behind.
4) Re-implement the swap device (swstrategy) in a more elegant way, perhaps
via a much more general mechanism that could also be used for disk
striping of regular filesystems.
5) Do something to improve the architecture of vm_object_collapse(). The
fact that it makes calls into the swap pager and knows too much about
how the swap pager operates really bothers me. It also doesn't allow
for collapsing of non-swap pager objects ("unnamed" objects backed by
other pagers).
1995-07-13 08:48:48 +00:00
|
|
|
swap_pager_swap_init();
|
2000-12-26 19:41:38 +00:00
|
|
|
pass = 0;
|
1994-05-24 10:09:53 +00:00
|
|
|
/*
|
These changes embody the support of the fully coherent merged VM buffer cache,
much higher filesystem I/O performance, and much better paging performance. It
represents the culmination of over 6 months of R&D.
The majority of the merged VM/cache work is by John Dyson.
The following highlights the most significant changes. Additionally, there are
(mostly minor) changes to the various filesystem modules (nfs, msdosfs, etc) to
support the new VM/buffer scheme.
vfs_bio.c:
Significant rewrite of most of vfs_bio to support the merged VM buffer cache
scheme. The scheme is almost fully compatible with the old filesystem
interface. Significant improvement in the number of opportunities for write
clustering.
vfs_cluster.c, vfs_subr.c
Upgrade and performance enhancements in vfs layer code to support merged
VM/buffer cache. Fixup of vfs_cluster to eliminate the bogus pagemove stuff.
vm_object.c:
Yet more improvements in the collapse code. Elimination of some windows that
can cause list corruption.
vm_pageout.c:
Fixed it, it really works better now. Somehow in 2.0, some "enhancements"
broke the code. This code has been reworked from the ground-up.
vm_fault.c, vm_page.c, pmap.c, vm_object.c
Support for small-block filesystems with merged VM/buffer cache scheme.
pmap.c vm_map.c
Dynamic kernel VM size, now we dont have to pre-allocate excessive numbers of
kernel PTs.
vm_glue.c
Much simpler and more effective swapping code. No more gratuitous swapping.
proc.h
Fixed the problem that the p_lock flag was not being cleared on a fork.
swap_pager.c, vnode_pager.c
Removal of old vfs_bio cruft to support the past pseudo-coherency. Now the
code doesn't need it anymore.
machdep.c
Changes to better support the parameter values for the merged VM/buffer cache
scheme.
machdep.c, kern_exec.c, vm_glue.c
Implemented a seperate submap for temporary exec string space and another one
to contain process upages. This eliminates all map fragmentation problems
that previously existed.
ffs_inode.c, ufs_inode.c, ufs_readwrite.c
Changes for merged VM/buffer cache. Add "bypass" support for sneaking in on
busy buffers.
Submitted by: John Dyson and David Greenman
1995-01-09 16:06:02 +00:00
|
|
|
* The pageout daemon is never done, so loop forever.
|
1994-05-24 10:09:53 +00:00
|
|
|
*/
|
|
|
|
while (TRUE) {
|
Implement a low-memory deadlock solution.
Removed most of the hacks that were trying to deal with low-memory
situations prior to now.
The new code is based on the concept that I/O must be able to function in
a low memory situation. All major modules related to I/O (except
networking) have been adjusted to allow allocation out of the system
reserve memory pool. These modules now detect a low memory situation but
rather then block they instead continue to operate, then return resources
to the memory pool instead of cache them or leave them wired.
Code has been added to stall in a low-memory situation prior to a vnode
being locked.
Thus situations where a process blocks in a low-memory condition while
holding a locked vnode have been reduced to near nothing. Not only will
I/O continue to operate, but many prior deadlock conditions simply no
longer exist.
Implement a number of VFS/BIO fixes
(found by Ian): in biodone(), bogus-page replacement code, the loop
was not properly incrementing loop variables prior to a continue
statement. We do not believe this code can be hit anyway but we
aren't taking any chances. We'll turn the whole section into a
panic (as it already is in brelse()) after the release is rolled.
In biodone(), the foff calculation was incorrectly
clamped to the iosize, causing the wrong foff to be calculated
for pages in the case of an I/O error or biodone() called without
initiating I/O. The problem always caused a panic before. Now it
doesn't. The problem is mainly an issue with NFS.
Fixed casts for ~PAGE_MASK. This code worked properly before only
because the calculations use signed arithmatic. Better to properly
extend PAGE_MASK first before inverting it for the 64 bit masking
op.
In brelse(), the bogus_page fixup code was improperly throwing
away the original contents of 'm' when it did the j-loop to
fix the bogus pages. The result was that it would potentially
invalidate parts of the *WRONG* page(!), leading to corruption.
There may still be cases where a background bitmap write is
being duplicated, causing potential corruption. We have identified
a potentially serious bug related to this but the fix is still TBD.
So instead this patch contains a KASSERT to detect the problem
and panic the machine rather then continue to corrupt the filesystem.
The problem does not occur very often.. it is very hard to
reproduce, and it may or may not be the cause of the corruption
people have reported.
Review by: (VFS/BIO: mckusick, Ian Dowse <iedowse@maths.tcd.ie>)
Testing by: (VM/Deadlock) Paul Saab <ps@yahoo-inc.com>
2000-11-18 23:06:26 +00:00
|
|
|
/*
|
|
|
|
* If we have enough free memory, wakeup waiters. Do
|
|
|
|
* not clear vm_pages_needed until we reach our target,
|
|
|
|
* otherwise we may be woken up over and over again and
|
|
|
|
* waste a lot of cpu.
|
|
|
|
*/
|
2007-02-07 06:37:30 +00:00
|
|
|
mtx_lock(&vm_page_queue_free_mtx);
|
Implement a low-memory deadlock solution.
Removed most of the hacks that were trying to deal with low-memory
situations prior to now.
The new code is based on the concept that I/O must be able to function in
a low memory situation. All major modules related to I/O (except
networking) have been adjusted to allow allocation out of the system
reserve memory pool. These modules now detect a low memory situation but
rather then block they instead continue to operate, then return resources
to the memory pool instead of cache them or leave them wired.
Code has been added to stall in a low-memory situation prior to a vnode
being locked.
Thus situations where a process blocks in a low-memory condition while
holding a locked vnode have been reduced to near nothing. Not only will
I/O continue to operate, but many prior deadlock conditions simply no
longer exist.
Implement a number of VFS/BIO fixes
(found by Ian): in biodone(), bogus-page replacement code, the loop
was not properly incrementing loop variables prior to a continue
statement. We do not believe this code can be hit anyway but we
aren't taking any chances. We'll turn the whole section into a
panic (as it already is in brelse()) after the release is rolled.
In biodone(), the foff calculation was incorrectly
clamped to the iosize, causing the wrong foff to be calculated
for pages in the case of an I/O error or biodone() called without
initiating I/O. The problem always caused a panic before. Now it
doesn't. The problem is mainly an issue with NFS.
Fixed casts for ~PAGE_MASK. This code worked properly before only
because the calculations use signed arithmatic. Better to properly
extend PAGE_MASK first before inverting it for the 64 bit masking
op.
In brelse(), the bogus_page fixup code was improperly throwing
away the original contents of 'm' when it did the j-loop to
fix the bogus pages. The result was that it would potentially
invalidate parts of the *WRONG* page(!), leading to corruption.
There may still be cases where a background bitmap write is
being duplicated, causing potential corruption. We have identified
a potentially serious bug related to this but the fix is still TBD.
So instead this patch contains a KASSERT to detect the problem
and panic the machine rather then continue to corrupt the filesystem.
The problem does not occur very often.. it is very hard to
reproduce, and it may or may not be the cause of the corruption
people have reported.
Review by: (VFS/BIO: mckusick, Ian Dowse <iedowse@maths.tcd.ie>)
Testing by: (VM/Deadlock) Paul Saab <ps@yahoo-inc.com>
2000-11-18 23:06:26 +00:00
|
|
|
if (vm_pages_needed && !vm_page_count_min()) {
|
2003-02-02 07:16:40 +00:00
|
|
|
if (!vm_paging_needed())
|
Implement a low-memory deadlock solution.
Removed most of the hacks that were trying to deal with low-memory
situations prior to now.
The new code is based on the concept that I/O must be able to function in
a low memory situation. All major modules related to I/O (except
networking) have been adjusted to allow allocation out of the system
reserve memory pool. These modules now detect a low memory situation but
rather then block they instead continue to operate, then return resources
to the memory pool instead of cache them or leave them wired.
Code has been added to stall in a low-memory situation prior to a vnode
being locked.
Thus situations where a process blocks in a low-memory condition while
holding a locked vnode have been reduced to near nothing. Not only will
I/O continue to operate, but many prior deadlock conditions simply no
longer exist.
Implement a number of VFS/BIO fixes
(found by Ian): in biodone(), bogus-page replacement code, the loop
was not properly incrementing loop variables prior to a continue
statement. We do not believe this code can be hit anyway but we
aren't taking any chances. We'll turn the whole section into a
panic (as it already is in brelse()) after the release is rolled.
In biodone(), the foff calculation was incorrectly
clamped to the iosize, causing the wrong foff to be calculated
for pages in the case of an I/O error or biodone() called without
initiating I/O. The problem always caused a panic before. Now it
doesn't. The problem is mainly an issue with NFS.
Fixed casts for ~PAGE_MASK. This code worked properly before only
because the calculations use signed arithmatic. Better to properly
extend PAGE_MASK first before inverting it for the 64 bit masking
op.
In brelse(), the bogus_page fixup code was improperly throwing
away the original contents of 'm' when it did the j-loop to
fix the bogus pages. The result was that it would potentially
invalidate parts of the *WRONG* page(!), leading to corruption.
There may still be cases where a background bitmap write is
being duplicated, causing potential corruption. We have identified
a potentially serious bug related to this but the fix is still TBD.
So instead this patch contains a KASSERT to detect the problem
and panic the machine rather then continue to corrupt the filesystem.
The problem does not occur very often.. it is very hard to
reproduce, and it may or may not be the cause of the corruption
people have reported.
Review by: (VFS/BIO: mckusick, Ian Dowse <iedowse@maths.tcd.ie>)
Testing by: (VM/Deadlock) Paul Saab <ps@yahoo-inc.com>
2000-11-18 23:06:26 +00:00
|
|
|
vm_pages_needed = 0;
|
2007-05-31 22:52:15 +00:00
|
|
|
wakeup(&cnt.v_free_count);
|
Implement a low-memory deadlock solution.
Removed most of the hacks that were trying to deal with low-memory
situations prior to now.
The new code is based on the concept that I/O must be able to function in
a low memory situation. All major modules related to I/O (except
networking) have been adjusted to allow allocation out of the system
reserve memory pool. These modules now detect a low memory situation but
rather then block they instead continue to operate, then return resources
to the memory pool instead of cache them or leave them wired.
Code has been added to stall in a low-memory situation prior to a vnode
being locked.
Thus situations where a process blocks in a low-memory condition while
holding a locked vnode have been reduced to near nothing. Not only will
I/O continue to operate, but many prior deadlock conditions simply no
longer exist.
Implement a number of VFS/BIO fixes
(found by Ian): in biodone(), bogus-page replacement code, the loop
was not properly incrementing loop variables prior to a continue
statement. We do not believe this code can be hit anyway but we
aren't taking any chances. We'll turn the whole section into a
panic (as it already is in brelse()) after the release is rolled.
In biodone(), the foff calculation was incorrectly
clamped to the iosize, causing the wrong foff to be calculated
for pages in the case of an I/O error or biodone() called without
initiating I/O. The problem always caused a panic before. Now it
doesn't. The problem is mainly an issue with NFS.
Fixed casts for ~PAGE_MASK. This code worked properly before only
because the calculations use signed arithmatic. Better to properly
extend PAGE_MASK first before inverting it for the 64 bit masking
op.
In brelse(), the bogus_page fixup code was improperly throwing
away the original contents of 'm' when it did the j-loop to
fix the bogus pages. The result was that it would potentially
invalidate parts of the *WRONG* page(!), leading to corruption.
There may still be cases where a background bitmap write is
being duplicated, causing potential corruption. We have identified
a potentially serious bug related to this but the fix is still TBD.
So instead this patch contains a KASSERT to detect the problem
and panic the machine rather then continue to corrupt the filesystem.
The problem does not occur very often.. it is very hard to
reproduce, and it may or may not be the cause of the corruption
people have reported.
Review by: (VFS/BIO: mckusick, Ian Dowse <iedowse@maths.tcd.ie>)
Testing by: (VM/Deadlock) Paul Saab <ps@yahoo-inc.com>
2000-11-18 23:06:26 +00:00
|
|
|
}
|
|
|
|
if (vm_pages_needed) {
|
1999-09-17 04:56:40 +00:00
|
|
|
/*
|
2000-12-26 19:41:38 +00:00
|
|
|
* Still not done, take a second pass without waiting
|
|
|
|
* (unlimited dirty cleaning), otherwise sleep a bit
|
|
|
|
* and try again.
|
1999-09-17 04:56:40 +00:00
|
|
|
*/
|
2000-12-26 19:41:38 +00:00
|
|
|
++pass;
|
|
|
|
if (pass > 1)
|
2007-02-07 06:37:30 +00:00
|
|
|
msleep(&vm_pages_needed,
|
|
|
|
&vm_page_queue_free_mtx, PVM, "psleep",
|
|
|
|
hz / 2);
|
1999-09-17 04:56:40 +00:00
|
|
|
} else {
|
|
|
|
/*
|
2000-12-26 19:41:38 +00:00
|
|
|
* Good enough, sleep & handle stats. Prime the pass
|
|
|
|
* for the next run.
|
1999-09-17 04:56:40 +00:00
|
|
|
*/
|
2000-12-26 19:41:38 +00:00
|
|
|
if (pass > 1)
|
|
|
|
pass = 1;
|
|
|
|
else
|
|
|
|
pass = 0;
|
2007-02-07 06:37:30 +00:00
|
|
|
error = msleep(&vm_pages_needed,
|
|
|
|
&vm_page_queue_free_mtx, PVM, "psleep",
|
|
|
|
vm_pageout_stats_interval * hz);
|
1997-07-27 04:49:19 +00:00
|
|
|
if (error && !vm_pages_needed) {
|
2007-02-07 06:37:30 +00:00
|
|
|
mtx_unlock(&vm_page_queue_free_mtx);
|
2000-12-26 19:41:38 +00:00
|
|
|
pass = 0;
|
2007-02-07 06:37:30 +00:00
|
|
|
vm_page_lock_queues();
|
1997-07-27 04:49:19 +00:00
|
|
|
vm_pageout_page_stats();
|
2004-06-24 04:08:43 +00:00
|
|
|
vm_page_unlock_queues();
|
1997-07-27 04:49:19 +00:00
|
|
|
continue;
|
|
|
|
}
|
1995-03-01 23:30:04 +00:00
|
|
|
}
|
1996-05-18 03:38:05 +00:00
|
|
|
if (vm_pages_needed)
|
2007-06-10 21:59:14 +00:00
|
|
|
cnt.v_pdwakeups++;
|
2007-02-07 06:37:30 +00:00
|
|
|
mtx_unlock(&vm_page_queue_free_mtx);
|
2000-12-26 19:41:38 +00:00
|
|
|
vm_pageout_scan(pass);
|
1994-05-24 10:09:53 +00:00
|
|
|
}
|
|
|
|
}
|
1994-05-25 09:21:21 +00:00
|
|
|
|
2003-02-09 20:40:36 +00:00
|
|
|
/*
|
2007-02-07 06:37:30 +00:00
|
|
|
* Unless the free page queue lock is held by the caller, this function
|
2003-02-09 20:40:36 +00:00
|
|
|
* should be regarded as advisory. Specifically, the caller should
|
|
|
|
* not msleep() on &cnt.v_free_count following this function unless
|
2007-02-07 06:37:30 +00:00
|
|
|
* the free page queue lock is held until the msleep() is performed.
|
2003-02-09 20:40:36 +00:00
|
|
|
*/
|
1996-11-28 23:15:07 +00:00
|
|
|
void
|
|
|
|
pagedaemon_wakeup()
|
|
|
|
{
|
2003-02-02 07:16:40 +00:00
|
|
|
|
2001-09-12 08:38:13 +00:00
|
|
|
if (!vm_pages_needed && curthread->td_proc != pageproc) {
|
2003-02-02 07:16:40 +00:00
|
|
|
vm_pages_needed = 1;
|
1996-11-28 23:15:07 +00:00
|
|
|
wakeup(&vm_pages_needed);
|
|
|
|
}
|
|
|
|
}
|
|
|
|
|
This commit does a couple of things:
Re-enables the RSS limiting, and the routine is now tail-recursive,
making it much more safe (eliminates the possiblity of kernel stack
overflow.) Also, the RSS limiting is a little more intelligent about
finding the likely objects that are pushing the process over the limit.
Added some sysctls that help with VM system tuning.
New sysctl features:
1) Enable/disable lru pageout algorithm.
vm.pageout_algorithm = 0, default algorithm that works
well, especially using X windows and heavy
memory loading. Can have adverse effects,
sometimes slowing down program loading.
vm.pageout_algorithm = 1, close to true LRU. Works much
better than clock, etc. Does not work as well as
the default algorithm in general. Certain memory
"malloc" type benchmarks work a little better with
this setting.
Please give me feedback on the performance results
associated with these.
2) Enable/disable swapping.
vm.swapping_enabled = 1, default.
vm.swapping_enabled = 0, useful for cases where swapping
degrades performance.
The config option "NO_SWAPPING" is still operative, and
takes precedence over the sysctl. If "NO_SWAPPING" is
specified, the sysctl still exists, but "vm.swapping_enabled"
is hard-wired to "0".
Each of these can be changed "on the fly."
1996-06-26 05:39:27 +00:00
|
|
|
#if !defined(NO_SWAPPING)
|
1996-02-22 10:57:37 +00:00
|
|
|
static void
|
2007-06-26 18:24:05 +00:00
|
|
|
vm_req_vmdaemon(int req)
|
1996-02-22 10:57:37 +00:00
|
|
|
{
|
|
|
|
static int lastrun = 0;
|
|
|
|
|
2007-06-26 18:24:05 +00:00
|
|
|
mtx_lock(&vm_daemon_mtx);
|
|
|
|
vm_pageout_req_swapout |= req;
|
1996-05-18 03:38:05 +00:00
|
|
|
if ((ticks > (lastrun + hz)) || (ticks < lastrun)) {
|
1996-02-22 10:57:37 +00:00
|
|
|
wakeup(&vm_daemon_needed);
|
|
|
|
lastrun = ticks;
|
|
|
|
}
|
2007-06-26 18:24:05 +00:00
|
|
|
mtx_unlock(&vm_daemon_mtx);
|
1996-02-22 10:57:37 +00:00
|
|
|
}
|
|
|
|
|
1995-08-28 09:19:25 +00:00
|
|
|
static void
|
1995-02-25 18:39:04 +00:00
|
|
|
vm_daemon()
|
These changes embody the support of the fully coherent merged VM buffer cache,
much higher filesystem I/O performance, and much better paging performance. It
represents the culmination of over 6 months of R&D.
The majority of the merged VM/cache work is by John Dyson.
The following highlights the most significant changes. Additionally, there are
(mostly minor) changes to the various filesystem modules (nfs, msdosfs, etc) to
support the new VM/buffer scheme.
vfs_bio.c:
Significant rewrite of most of vfs_bio to support the merged VM buffer cache
scheme. The scheme is almost fully compatible with the old filesystem
interface. Significant improvement in the number of opportunities for write
clustering.
vfs_cluster.c, vfs_subr.c
Upgrade and performance enhancements in vfs layer code to support merged
VM/buffer cache. Fixup of vfs_cluster to eliminate the bogus pagemove stuff.
vm_object.c:
Yet more improvements in the collapse code. Elimination of some windows that
can cause list corruption.
vm_pageout.c:
Fixed it, it really works better now. Somehow in 2.0, some "enhancements"
broke the code. This code has been reworked from the ground-up.
vm_fault.c, vm_page.c, pmap.c, vm_object.c
Support for small-block filesystems with merged VM/buffer cache scheme.
pmap.c vm_map.c
Dynamic kernel VM size, now we dont have to pre-allocate excessive numbers of
kernel PTs.
vm_glue.c
Much simpler and more effective swapping code. No more gratuitous swapping.
proc.h
Fixed the problem that the p_lock flag was not being cleared on a fork.
swap_pager.c, vnode_pager.c
Removal of old vfs_bio cruft to support the past pseudo-coherency. Now the
code doesn't need it anymore.
machdep.c
Changes to better support the parameter values for the merged VM/buffer cache
scheme.
machdep.c, kern_exec.c, vm_glue.c
Implemented a seperate submap for temporary exec string space and another one
to contain process upages. This eliminates all map fragmentation problems
that previously existed.
ffs_inode.c, ufs_inode.c, ufs_readwrite.c
Changes for merged VM/buffer cache. Add "bypass" support for sneaking in on
busy buffers.
Submitted by: John Dyson and David Greenman
1995-01-09 16:06:02 +00:00
|
|
|
{
|
2004-03-04 09:36:46 +00:00
|
|
|
struct rlimit rsslim;
|
1994-11-06 05:07:53 +00:00
|
|
|
struct proc *p;
|
Part 1 of KSE-III
The ability to schedule multiple threads per process
(one one cpu) by making ALL system calls optionally asynchronous.
to come: ia64 and power-pc patches, patches for gdb, test program (in tools)
Reviewed by: Almost everyone who counts
(at various times, peter, jhb, matt, alfred, mini, bernd,
and a cast of thousands)
NOTE: this is still Beta code, and contains lots of debugging stuff.
expect slight instability in signals..
2002-06-29 17:26:22 +00:00
|
|
|
struct thread *td;
|
2009-04-19 20:53:47 +00:00
|
|
|
struct vmspace *vm;
|
2007-06-26 18:24:05 +00:00
|
|
|
int breakout, swapout_flags;
|
These changes embody the support of the fully coherent merged VM buffer cache,
much higher filesystem I/O performance, and much better paging performance. It
represents the culmination of over 6 months of R&D.
The majority of the merged VM/cache work is by John Dyson.
The following highlights the most significant changes. Additionally, there are
(mostly minor) changes to the various filesystem modules (nfs, msdosfs, etc) to
support the new VM/buffer scheme.
vfs_bio.c:
Significant rewrite of most of vfs_bio to support the merged VM buffer cache
scheme. The scheme is almost fully compatible with the old filesystem
interface. Significant improvement in the number of opportunities for write
clustering.
vfs_cluster.c, vfs_subr.c
Upgrade and performance enhancements in vfs layer code to support merged
VM/buffer cache. Fixup of vfs_cluster to eliminate the bogus pagemove stuff.
vm_object.c:
Yet more improvements in the collapse code. Elimination of some windows that
can cause list corruption.
vm_pageout.c:
Fixed it, it really works better now. Somehow in 2.0, some "enhancements"
broke the code. This code has been reworked from the ground-up.
vm_fault.c, vm_page.c, pmap.c, vm_object.c
Support for small-block filesystems with merged VM/buffer cache scheme.
pmap.c vm_map.c
Dynamic kernel VM size, now we dont have to pre-allocate excessive numbers of
kernel PTs.
vm_glue.c
Much simpler and more effective swapping code. No more gratuitous swapping.
proc.h
Fixed the problem that the p_lock flag was not being cleared on a fork.
swap_pager.c, vnode_pager.c
Removal of old vfs_bio cruft to support the past pseudo-coherency. Now the
code doesn't need it anymore.
machdep.c
Changes to better support the parameter values for the merged VM/buffer cache
scheme.
machdep.c, kern_exec.c, vm_glue.c
Implemented a seperate submap for temporary exec string space and another one
to contain process upages. This eliminates all map fragmentation problems
that previously existed.
ffs_inode.c, ufs_inode.c, ufs_readwrite.c
Changes for merged VM/buffer cache. Add "bypass" support for sneaking in on
busy buffers.
Submitted by: John Dyson and David Greenman
1995-01-09 16:06:02 +00:00
|
|
|
|
|
|
|
while (TRUE) {
|
2007-06-26 18:24:05 +00:00
|
|
|
mtx_lock(&vm_daemon_mtx);
|
|
|
|
msleep(&vm_daemon_needed, &vm_daemon_mtx, PPAUSE, "psleep", 0);
|
|
|
|
swapout_flags = vm_pageout_req_swapout;
|
|
|
|
vm_pageout_req_swapout = 0;
|
|
|
|
mtx_unlock(&vm_daemon_mtx);
|
|
|
|
if (swapout_flags)
|
|
|
|
swapout_procs(swapout_flags);
|
|
|
|
|
1994-11-06 05:07:53 +00:00
|
|
|
/*
|
These changes embody the support of the fully coherent merged VM buffer cache,
much higher filesystem I/O performance, and much better paging performance. It
represents the culmination of over 6 months of R&D.
The majority of the merged VM/cache work is by John Dyson.
The following highlights the most significant changes. Additionally, there are
(mostly minor) changes to the various filesystem modules (nfs, msdosfs, etc) to
support the new VM/buffer scheme.
vfs_bio.c:
Significant rewrite of most of vfs_bio to support the merged VM buffer cache
scheme. The scheme is almost fully compatible with the old filesystem
interface. Significant improvement in the number of opportunities for write
clustering.
vfs_cluster.c, vfs_subr.c
Upgrade and performance enhancements in vfs layer code to support merged
VM/buffer cache. Fixup of vfs_cluster to eliminate the bogus pagemove stuff.
vm_object.c:
Yet more improvements in the collapse code. Elimination of some windows that
can cause list corruption.
vm_pageout.c:
Fixed it, it really works better now. Somehow in 2.0, some "enhancements"
broke the code. This code has been reworked from the ground-up.
vm_fault.c, vm_page.c, pmap.c, vm_object.c
Support for small-block filesystems with merged VM/buffer cache scheme.
pmap.c vm_map.c
Dynamic kernel VM size, now we dont have to pre-allocate excessive numbers of
kernel PTs.
vm_glue.c
Much simpler and more effective swapping code. No more gratuitous swapping.
proc.h
Fixed the problem that the p_lock flag was not being cleared on a fork.
swap_pager.c, vnode_pager.c
Removal of old vfs_bio cruft to support the past pseudo-coherency. Now the
code doesn't need it anymore.
machdep.c
Changes to better support the parameter values for the merged VM/buffer cache
scheme.
machdep.c, kern_exec.c, vm_glue.c
Implemented a seperate submap for temporary exec string space and another one
to contain process upages. This eliminates all map fragmentation problems
that previously existed.
ffs_inode.c, ufs_inode.c, ufs_readwrite.c
Changes for merged VM/buffer cache. Add "bypass" support for sneaking in on
busy buffers.
Submitted by: John Dyson and David Greenman
1995-01-09 16:06:02 +00:00
|
|
|
* scan the processes for exceeding their rlimits or if
|
|
|
|
* process is swapped out -- deactivate pages
|
1994-11-06 05:07:53 +00:00
|
|
|
*/
|
2001-03-28 11:52:56 +00:00
|
|
|
sx_slock(&allproc_lock);
|
2007-01-17 15:05:52 +00:00
|
|
|
FOREACH_PROC_IN_SYSTEM(p) {
|
1999-02-19 19:14:48 +00:00
|
|
|
vm_pindex_t limit, size;
|
1994-11-06 05:07:53 +00:00
|
|
|
|
|
|
|
/*
|
|
|
|
* if this is a system process or if we have already
|
|
|
|
* looked at this process, skip it.
|
|
|
|
*/
|
2003-04-22 20:03:08 +00:00
|
|
|
PROC_LOCK(p);
|
2009-04-19 20:53:47 +00:00
|
|
|
if (p->p_flag & (P_INEXEC | P_SYSTEM | P_WEXIT)) {
|
2003-04-22 20:03:08 +00:00
|
|
|
PROC_UNLOCK(p);
|
1994-11-06 05:07:53 +00:00
|
|
|
continue;
|
|
|
|
}
|
These changes embody the support of the fully coherent merged VM buffer cache,
much higher filesystem I/O performance, and much better paging performance. It
represents the culmination of over 6 months of R&D.
The majority of the merged VM/cache work is by John Dyson.
The following highlights the most significant changes. Additionally, there are
(mostly minor) changes to the various filesystem modules (nfs, msdosfs, etc) to
support the new VM/buffer scheme.
vfs_bio.c:
Significant rewrite of most of vfs_bio to support the merged VM buffer cache
scheme. The scheme is almost fully compatible with the old filesystem
interface. Significant improvement in the number of opportunities for write
clustering.
vfs_cluster.c, vfs_subr.c
Upgrade and performance enhancements in vfs layer code to support merged
VM/buffer cache. Fixup of vfs_cluster to eliminate the bogus pagemove stuff.
vm_object.c:
Yet more improvements in the collapse code. Elimination of some windows that
can cause list corruption.
vm_pageout.c:
Fixed it, it really works better now. Somehow in 2.0, some "enhancements"
broke the code. This code has been reworked from the ground-up.
vm_fault.c, vm_page.c, pmap.c, vm_object.c
Support for small-block filesystems with merged VM/buffer cache scheme.
pmap.c vm_map.c
Dynamic kernel VM size, now we dont have to pre-allocate excessive numbers of
kernel PTs.
vm_glue.c
Much simpler and more effective swapping code. No more gratuitous swapping.
proc.h
Fixed the problem that the p_lock flag was not being cleared on a fork.
swap_pager.c, vnode_pager.c
Removal of old vfs_bio cruft to support the past pseudo-coherency. Now the
code doesn't need it anymore.
machdep.c
Changes to better support the parameter values for the merged VM/buffer cache
scheme.
machdep.c, kern_exec.c, vm_glue.c
Implemented a seperate submap for temporary exec string space and another one
to contain process upages. This eliminates all map fragmentation problems
that previously existed.
ffs_inode.c, ufs_inode.c, ufs_readwrite.c
Changes for merged VM/buffer cache. Add "bypass" support for sneaking in on
busy buffers.
Submitted by: John Dyson and David Greenman
1995-01-09 16:06:02 +00:00
|
|
|
/*
|
|
|
|
* if the process is in a non-running type state,
|
|
|
|
* don't touch it.
|
|
|
|
*/
|
Part 1 of KSE-III
The ability to schedule multiple threads per process
(one one cpu) by making ALL system calls optionally asynchronous.
to come: ia64 and power-pc patches, patches for gdb, test program (in tools)
Reviewed by: Almost everyone who counts
(at various times, peter, jhb, matt, alfred, mini, bernd,
and a cast of thousands)
NOTE: this is still Beta code, and contains lots of debugging stuff.
expect slight instability in signals..
2002-06-29 17:26:22 +00:00
|
|
|
breakout = 0;
|
|
|
|
FOREACH_THREAD_IN_PROC(p, td) {
|
Commit 14/14 of sched_lock decomposition.
- Use thread_lock() rather than sched_lock for per-thread scheduling
sychronization.
- Use the per-process spinlock rather than the sched_lock for per-process
scheduling synchronization.
Tested by: kris, current@
Tested on: i386, amd64, ULE, 4BSD, libthr, libkse, PREEMPTION, etc.
Discussed with: kris, attilio, kmacy, jhb, julian, bde (small parts each)
2007-06-05 00:00:57 +00:00
|
|
|
thread_lock(td);
|
2002-09-11 08:13:56 +00:00
|
|
|
if (!TD_ON_RUNQ(td) &&
|
|
|
|
!TD_IS_RUNNING(td) &&
|
|
|
|
!TD_IS_SLEEPING(td)) {
|
Commit 14/14 of sched_lock decomposition.
- Use thread_lock() rather than sched_lock for per-thread scheduling
sychronization.
- Use the per-process spinlock rather than the sched_lock for per-process
scheduling synchronization.
Tested by: kris, current@
Tested on: i386, amd64, ULE, 4BSD, libthr, libkse, PREEMPTION, etc.
Discussed with: kris, attilio, kmacy, jhb, julian, bde (small parts each)
2007-06-05 00:00:57 +00:00
|
|
|
thread_unlock(td);
|
Part 1 of KSE-III
The ability to schedule multiple threads per process
(one one cpu) by making ALL system calls optionally asynchronous.
to come: ia64 and power-pc patches, patches for gdb, test program (in tools)
Reviewed by: Almost everyone who counts
(at various times, peter, jhb, matt, alfred, mini, bernd,
and a cast of thousands)
NOTE: this is still Beta code, and contains lots of debugging stuff.
expect slight instability in signals..
2002-06-29 17:26:22 +00:00
|
|
|
breakout = 1;
|
|
|
|
break;
|
|
|
|
}
|
Commit 14/14 of sched_lock decomposition.
- Use thread_lock() rather than sched_lock for per-thread scheduling
sychronization.
- Use the per-process spinlock rather than the sched_lock for per-process
scheduling synchronization.
Tested by: kris, current@
Tested on: i386, amd64, ULE, 4BSD, libthr, libkse, PREEMPTION, etc.
Discussed with: kris, attilio, kmacy, jhb, julian, bde (small parts each)
2007-06-05 00:00:57 +00:00
|
|
|
thread_unlock(td);
|
Part 1 of KSE-III
The ability to schedule multiple threads per process
(one one cpu) by making ALL system calls optionally asynchronous.
to come: ia64 and power-pc patches, patches for gdb, test program (in tools)
Reviewed by: Almost everyone who counts
(at various times, peter, jhb, matt, alfred, mini, bernd,
and a cast of thousands)
NOTE: this is still Beta code, and contains lots of debugging stuff.
expect slight instability in signals..
2002-06-29 17:26:22 +00:00
|
|
|
}
|
|
|
|
if (breakout) {
|
2003-04-22 20:03:08 +00:00
|
|
|
PROC_UNLOCK(p);
|
1994-11-06 05:07:53 +00:00
|
|
|
continue;
|
|
|
|
}
|
These changes embody the support of the fully coherent merged VM buffer cache,
much higher filesystem I/O performance, and much better paging performance. It
represents the culmination of over 6 months of R&D.
The majority of the merged VM/cache work is by John Dyson.
The following highlights the most significant changes. Additionally, there are
(mostly minor) changes to the various filesystem modules (nfs, msdosfs, etc) to
support the new VM/buffer scheme.
vfs_bio.c:
Significant rewrite of most of vfs_bio to support the merged VM buffer cache
scheme. The scheme is almost fully compatible with the old filesystem
interface. Significant improvement in the number of opportunities for write
clustering.
vfs_cluster.c, vfs_subr.c
Upgrade and performance enhancements in vfs layer code to support merged
VM/buffer cache. Fixup of vfs_cluster to eliminate the bogus pagemove stuff.
vm_object.c:
Yet more improvements in the collapse code. Elimination of some windows that
can cause list corruption.
vm_pageout.c:
Fixed it, it really works better now. Somehow in 2.0, some "enhancements"
broke the code. This code has been reworked from the ground-up.
vm_fault.c, vm_page.c, pmap.c, vm_object.c
Support for small-block filesystems with merged VM/buffer cache scheme.
pmap.c vm_map.c
Dynamic kernel VM size, now we dont have to pre-allocate excessive numbers of
kernel PTs.
vm_glue.c
Much simpler and more effective swapping code. No more gratuitous swapping.
proc.h
Fixed the problem that the p_lock flag was not being cleared on a fork.
swap_pager.c, vnode_pager.c
Removal of old vfs_bio cruft to support the past pseudo-coherency. Now the
code doesn't need it anymore.
machdep.c
Changes to better support the parameter values for the merged VM/buffer cache
scheme.
machdep.c, kern_exec.c, vm_glue.c
Implemented a seperate submap for temporary exec string space and another one
to contain process upages. This eliminates all map fragmentation problems
that previously existed.
ffs_inode.c, ufs_inode.c, ufs_readwrite.c
Changes for merged VM/buffer cache. Add "bypass" support for sneaking in on
busy buffers.
Submitted by: John Dyson and David Greenman
1995-01-09 16:06:02 +00:00
|
|
|
/*
|
|
|
|
* get a limit
|
|
|
|
*/
|
2004-03-04 09:36:46 +00:00
|
|
|
lim_rlimit(p, RLIMIT_RSS, &rsslim);
|
1999-02-19 19:14:48 +00:00
|
|
|
limit = OFF_TO_IDX(
|
2004-02-04 21:52:57 +00:00
|
|
|
qmin(rsslim.rlim_cur, rsslim.rlim_max));
|
These changes embody the support of the fully coherent merged VM buffer cache,
much higher filesystem I/O performance, and much better paging performance. It
represents the culmination of over 6 months of R&D.
The majority of the merged VM/cache work is by John Dyson.
The following highlights the most significant changes. Additionally, there are
(mostly minor) changes to the various filesystem modules (nfs, msdosfs, etc) to
support the new VM/buffer scheme.
vfs_bio.c:
Significant rewrite of most of vfs_bio to support the merged VM buffer cache
scheme. The scheme is almost fully compatible with the old filesystem
interface. Significant improvement in the number of opportunities for write
clustering.
vfs_cluster.c, vfs_subr.c
Upgrade and performance enhancements in vfs layer code to support merged
VM/buffer cache. Fixup of vfs_cluster to eliminate the bogus pagemove stuff.
vm_object.c:
Yet more improvements in the collapse code. Elimination of some windows that
can cause list corruption.
vm_pageout.c:
Fixed it, it really works better now. Somehow in 2.0, some "enhancements"
broke the code. This code has been reworked from the ground-up.
vm_fault.c, vm_page.c, pmap.c, vm_object.c
Support for small-block filesystems with merged VM/buffer cache scheme.
pmap.c vm_map.c
Dynamic kernel VM size, now we dont have to pre-allocate excessive numbers of
kernel PTs.
vm_glue.c
Much simpler and more effective swapping code. No more gratuitous swapping.
proc.h
Fixed the problem that the p_lock flag was not being cleared on a fork.
swap_pager.c, vnode_pager.c
Removal of old vfs_bio cruft to support the past pseudo-coherency. Now the
code doesn't need it anymore.
machdep.c
Changes to better support the parameter values for the merged VM/buffer cache
scheme.
machdep.c, kern_exec.c, vm_glue.c
Implemented a seperate submap for temporary exec string space and another one
to contain process upages. This eliminates all map fragmentation problems
that previously existed.
ffs_inode.c, ufs_inode.c, ufs_readwrite.c
Changes for merged VM/buffer cache. Add "bypass" support for sneaking in on
busy buffers.
Submitted by: John Dyson and David Greenman
1995-01-09 16:06:02 +00:00
|
|
|
|
|
|
|
/*
|
|
|
|
* let processes that are swapped out really be
|
|
|
|
* swapped out set the limit to nothing (will force a
|
|
|
|
* swap-out.)
|
|
|
|
*/
|
2007-09-17 05:31:39 +00:00
|
|
|
if ((p->p_flag & P_INMEM) == 0)
|
These changes embody the support of the fully coherent merged VM buffer cache,
much higher filesystem I/O performance, and much better paging performance. It
represents the culmination of over 6 months of R&D.
The majority of the merged VM/cache work is by John Dyson.
The following highlights the most significant changes. Additionally, there are
(mostly minor) changes to the various filesystem modules (nfs, msdosfs, etc) to
support the new VM/buffer scheme.
vfs_bio.c:
Significant rewrite of most of vfs_bio to support the merged VM buffer cache
scheme. The scheme is almost fully compatible with the old filesystem
interface. Significant improvement in the number of opportunities for write
clustering.
vfs_cluster.c, vfs_subr.c
Upgrade and performance enhancements in vfs layer code to support merged
VM/buffer cache. Fixup of vfs_cluster to eliminate the bogus pagemove stuff.
vm_object.c:
Yet more improvements in the collapse code. Elimination of some windows that
can cause list corruption.
vm_pageout.c:
Fixed it, it really works better now. Somehow in 2.0, some "enhancements"
broke the code. This code has been reworked from the ground-up.
vm_fault.c, vm_page.c, pmap.c, vm_object.c
Support for small-block filesystems with merged VM/buffer cache scheme.
pmap.c vm_map.c
Dynamic kernel VM size, now we dont have to pre-allocate excessive numbers of
kernel PTs.
vm_glue.c
Much simpler and more effective swapping code. No more gratuitous swapping.
proc.h
Fixed the problem that the p_lock flag was not being cleared on a fork.
swap_pager.c, vnode_pager.c
Removal of old vfs_bio cruft to support the past pseudo-coherency. Now the
code doesn't need it anymore.
machdep.c
Changes to better support the parameter values for the merged VM/buffer cache
scheme.
machdep.c, kern_exec.c, vm_glue.c
Implemented a seperate submap for temporary exec string space and another one
to contain process upages. This eliminates all map fragmentation problems
that previously existed.
ffs_inode.c, ufs_inode.c, ufs_readwrite.c
Changes for merged VM/buffer cache. Add "bypass" support for sneaking in on
busy buffers.
Submitted by: John Dyson and David Greenman
1995-01-09 16:06:02 +00:00
|
|
|
limit = 0; /* XXX */
|
2009-04-19 20:53:47 +00:00
|
|
|
vm = vmspace_acquire_ref(p);
|
2003-04-22 20:03:08 +00:00
|
|
|
PROC_UNLOCK(p);
|
2009-04-19 20:53:47 +00:00
|
|
|
if (vm == NULL)
|
|
|
|
continue;
|
1994-11-06 05:07:53 +00:00
|
|
|
|
2009-04-19 20:53:47 +00:00
|
|
|
size = vmspace_resident_count(vm);
|
1994-11-06 05:07:53 +00:00
|
|
|
if (limit >= 0 && size >= limit) {
|
1999-02-19 19:14:48 +00:00
|
|
|
vm_pageout_map_deactivate_pages(
|
2009-04-19 20:53:47 +00:00
|
|
|
&vm->vm_map, limit);
|
1994-11-06 05:07:53 +00:00
|
|
|
}
|
2009-04-19 20:53:47 +00:00
|
|
|
vmspace_free(vm);
|
1994-11-06 05:07:53 +00:00
|
|
|
}
|
2001-03-28 11:52:56 +00:00
|
|
|
sx_sunlock(&allproc_lock);
|
1994-11-06 05:07:53 +00:00
|
|
|
}
|
|
|
|
}
|
2002-03-10 21:52:48 +00:00
|
|
|
#endif /* !defined(NO_SWAPPING) */
|