2005-01-07 02:29:27 +00:00
|
|
|
/*-
|
2017-11-18 14:26:50 +00:00
|
|
|
* SPDX-License-Identifier: BSD-4-Clause
|
|
|
|
*
|
1994-05-24 10:09:53 +00:00
|
|
|
* Copyright (c) 1990 University of Utah.
|
1994-05-25 09:21:21 +00:00
|
|
|
* Copyright (c) 1991 The Regents of the University of California.
|
|
|
|
* All rights reserved.
|
NOTE: libkvm, w, ps, 'top', and any other utility which depends on struct
proc or any VM system structure will have to be rebuilt!!!
Much needed overhaul of the VM system. Included in this first round of
changes:
1) Improved pager interfaces: init, alloc, dealloc, getpages, putpages,
haspage, and sync operations are supported. The haspage interface now
provides information about clusterability. All pager routines now take
struct vm_object's instead of "pagers".
2) Improved data structures. In the previous paradigm, there is constant
confusion caused by pagers being both a data structure ("allocate a
pager") and a collection of routines. The idea of a pager structure has
escentially been eliminated. Objects now have types, and this type is
used to index the appropriate pager. In most cases, items in the pager
structure were duplicated in the object data structure and thus were
unnecessary. In the few cases that remained, a un_pager structure union
was created in the object to contain these items.
3) Because of the cleanup of #1 & #2, a lot of unnecessary layering can now
be removed. For instance, vm_object_enter(), vm_object_lookup(),
vm_object_remove(), and the associated object hash list were some of the
things that were removed.
4) simple_lock's removed. Discussion with several people reveals that the
SMP locking primitives used in the VM system aren't likely the mechanism
that we'll be adopting. Even if it were, the locking that was in the code
was very inadequate and would have to be mostly re-done anyway. The
locking in a uni-processor kernel was a no-op but went a long way toward
making the code difficult to read and debug.
5) Places that attempted to kludge-up the fact that we don't have kernel
thread support have been fixed to reflect the reality that we are really
dealing with processes, not threads. The VM system didn't have complete
thread support, so the comments and mis-named routines were just wrong.
We now use tsleep and wakeup directly in the lock routines, for instance.
6) Where appropriate, the pagers have been improved, especially in the
pager_alloc routines. Most of the pager_allocs have been rewritten and
are now faster and easier to maintain.
7) The pagedaemon pageout clustering algorithm has been rewritten and
now tries harder to output an even number of pages before and after
the requested page. This is sort of the reverse of the ideal pagein
algorithm and should provide better overall performance.
8) Unnecessary (incorrect) casts to caddr_t in calls to tsleep & wakeup
have been removed. Some other unnecessary casts have also been removed.
9) Some almost useless debugging code removed.
10) Terminology of shadow objects vs. backing objects straightened out.
The fact that the vm_object data structure escentially had this
backwards really confused things. The use of "shadow" and "backing
object" throughout the code is now internally consistent and correct
in the Mach terminology.
11) Several minor bug fixes, including one in the vm daemon that caused
0 RSS objects to not get purged as intended.
12) A "default pager" has now been created which cleans up the transition
of objects to the "swap" type. The previous checks throughout the code
for swp->pg_data != NULL were really ugly. This change also provides
the rudiments for future backing of "anonymous" memory by something
other than the swap pager (via the vnode pager, for example), and it
allows the decision about which of these pagers to use to be made
dynamically (although will need some additional decision code to do
this, of course).
13) (dyson) MAP_COPY has been deprecated and the corresponding "copy
object" code has been removed. MAP_COPY was undocumented and non-
standard. It was furthermore broken in several ways which caused its
behavior to degrade to MAP_PRIVATE. Binaries that use MAP_COPY will
continue to work correctly, but via the slightly different semantics
of MAP_PRIVATE.
14) (dyson) Sharing maps have been removed. It's marginal usefulness in a
threads design can be worked around in other ways. Both #12 and #13
were done to simplify the code and improve readability and maintain-
ability. (As were most all of these changes)
TODO:
1) Rewrite most of the vnode pager to use VOP_GETPAGES/PUTPAGES. Doing
this will reduce the vnode pager to a mere fraction of its current size.
2) Rewrite vm_fault and the swap/vnode pagers to use the clustering
information provided by the new haspage pager interface. This will
substantially reduce the overhead by eliminating a large number of
VOP_BMAP() calls. The VOP_BMAP() filesystem interface should be
improved to provide both a "behind" and "ahead" indication of
contiguousness.
3) Implement the extended features of pager_haspage in swap_pager_haspage().
It currently just says 0 pages ahead/behind.
4) Re-implement the swap device (swstrategy) in a more elegant way, perhaps
via a much more general mechanism that could also be used for disk
striping of regular filesystems.
5) Do something to improve the architecture of vm_object_collapse(). The
fact that it makes calls into the swap pager and knows too much about
how the swap pager operates really bothers me. It also doesn't allow
for collapsing of non-swap pager objects ("unnamed" objects backed by
other pagers).
1995-07-13 08:48:48 +00:00
|
|
|
* Copyright (c) 1993, 1994 John S. Dyson
|
|
|
|
* Copyright (c) 1995, David Greenman
|
1994-05-24 10:09:53 +00:00
|
|
|
*
|
|
|
|
* This code is derived from software contributed to Berkeley by
|
|
|
|
* the Systems Programming Group of the University of Utah Computer
|
|
|
|
* Science Department.
|
|
|
|
*
|
|
|
|
* Redistribution and use in source and binary forms, with or without
|
|
|
|
* modification, are permitted provided that the following conditions
|
|
|
|
* are met:
|
|
|
|
* 1. Redistributions of source code must retain the above copyright
|
|
|
|
* notice, this list of conditions and the following disclaimer.
|
|
|
|
* 2. Redistributions in binary form must reproduce the above copyright
|
|
|
|
* notice, this list of conditions and the following disclaimer in the
|
|
|
|
* documentation and/or other materials provided with the distribution.
|
|
|
|
* 3. All advertising materials mentioning features or use of this software
|
2000-03-27 20:41:17 +00:00
|
|
|
* must display the following acknowledgement:
|
1994-05-24 10:09:53 +00:00
|
|
|
* This product includes software developed by the University of
|
|
|
|
* California, Berkeley and its contributors.
|
|
|
|
* 4. Neither the name of the University nor the names of its contributors
|
|
|
|
* may be used to endorse or promote products derived from this software
|
|
|
|
* without specific prior written permission.
|
|
|
|
*
|
|
|
|
* THIS SOFTWARE IS PROVIDED BY THE REGENTS AND CONTRIBUTORS ``AS IS'' AND
|
|
|
|
* ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
|
|
|
|
* IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE
|
|
|
|
* ARE DISCLAIMED. IN NO EVENT SHALL THE REGENTS OR CONTRIBUTORS BE LIABLE
|
|
|
|
* FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL
|
|
|
|
* DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS
|
|
|
|
* OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION)
|
|
|
|
* HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT
|
|
|
|
* LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY
|
|
|
|
* OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF
|
|
|
|
* SUCH DAMAGE.
|
|
|
|
*
|
1994-05-25 09:21:21 +00:00
|
|
|
* from: @(#)vnode_pager.c 7.5 (Berkeley) 4/20/91
|
1994-05-24 10:09:53 +00:00
|
|
|
*/
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Page to/from files (vnodes).
|
|
|
|
*/
|
|
|
|
|
1994-05-25 09:21:21 +00:00
|
|
|
/*
|
|
|
|
* TODO:
|
NOTE: libkvm, w, ps, 'top', and any other utility which depends on struct
proc or any VM system structure will have to be rebuilt!!!
Much needed overhaul of the VM system. Included in this first round of
changes:
1) Improved pager interfaces: init, alloc, dealloc, getpages, putpages,
haspage, and sync operations are supported. The haspage interface now
provides information about clusterability. All pager routines now take
struct vm_object's instead of "pagers".
2) Improved data structures. In the previous paradigm, there is constant
confusion caused by pagers being both a data structure ("allocate a
pager") and a collection of routines. The idea of a pager structure has
escentially been eliminated. Objects now have types, and this type is
used to index the appropriate pager. In most cases, items in the pager
structure were duplicated in the object data structure and thus were
unnecessary. In the few cases that remained, a un_pager structure union
was created in the object to contain these items.
3) Because of the cleanup of #1 & #2, a lot of unnecessary layering can now
be removed. For instance, vm_object_enter(), vm_object_lookup(),
vm_object_remove(), and the associated object hash list were some of the
things that were removed.
4) simple_lock's removed. Discussion with several people reveals that the
SMP locking primitives used in the VM system aren't likely the mechanism
that we'll be adopting. Even if it were, the locking that was in the code
was very inadequate and would have to be mostly re-done anyway. The
locking in a uni-processor kernel was a no-op but went a long way toward
making the code difficult to read and debug.
5) Places that attempted to kludge-up the fact that we don't have kernel
thread support have been fixed to reflect the reality that we are really
dealing with processes, not threads. The VM system didn't have complete
thread support, so the comments and mis-named routines were just wrong.
We now use tsleep and wakeup directly in the lock routines, for instance.
6) Where appropriate, the pagers have been improved, especially in the
pager_alloc routines. Most of the pager_allocs have been rewritten and
are now faster and easier to maintain.
7) The pagedaemon pageout clustering algorithm has been rewritten and
now tries harder to output an even number of pages before and after
the requested page. This is sort of the reverse of the ideal pagein
algorithm and should provide better overall performance.
8) Unnecessary (incorrect) casts to caddr_t in calls to tsleep & wakeup
have been removed. Some other unnecessary casts have also been removed.
9) Some almost useless debugging code removed.
10) Terminology of shadow objects vs. backing objects straightened out.
The fact that the vm_object data structure escentially had this
backwards really confused things. The use of "shadow" and "backing
object" throughout the code is now internally consistent and correct
in the Mach terminology.
11) Several minor bug fixes, including one in the vm daemon that caused
0 RSS objects to not get purged as intended.
12) A "default pager" has now been created which cleans up the transition
of objects to the "swap" type. The previous checks throughout the code
for swp->pg_data != NULL were really ugly. This change also provides
the rudiments for future backing of "anonymous" memory by something
other than the swap pager (via the vnode pager, for example), and it
allows the decision about which of these pagers to use to be made
dynamically (although will need some additional decision code to do
this, of course).
13) (dyson) MAP_COPY has been deprecated and the corresponding "copy
object" code has been removed. MAP_COPY was undocumented and non-
standard. It was furthermore broken in several ways which caused its
behavior to degrade to MAP_PRIVATE. Binaries that use MAP_COPY will
continue to work correctly, but via the slightly different semantics
of MAP_PRIVATE.
14) (dyson) Sharing maps have been removed. It's marginal usefulness in a
threads design can be worked around in other ways. Both #12 and #13
were done to simplify the code and improve readability and maintain-
ability. (As were most all of these changes)
TODO:
1) Rewrite most of the vnode pager to use VOP_GETPAGES/PUTPAGES. Doing
this will reduce the vnode pager to a mere fraction of its current size.
2) Rewrite vm_fault and the swap/vnode pagers to use the clustering
information provided by the new haspage pager interface. This will
substantially reduce the overhead by eliminating a large number of
VOP_BMAP() calls. The VOP_BMAP() filesystem interface should be
improved to provide both a "behind" and "ahead" indication of
contiguousness.
3) Implement the extended features of pager_haspage in swap_pager_haspage().
It currently just says 0 pages ahead/behind.
4) Re-implement the swap device (swstrategy) in a more elegant way, perhaps
via a much more general mechanism that could also be used for disk
striping of regular filesystems.
5) Do something to improve the architecture of vm_object_collapse(). The
fact that it makes calls into the swap pager and knows too much about
how the swap pager operates really bothers me. It also doesn't allow
for collapsing of non-swap pager objects ("unnamed" objects backed by
other pagers).
1995-07-13 08:48:48 +00:00
|
|
|
* Implement VOP_GETPAGES/PUTPAGES interface for filesystems. Will
|
1995-04-09 06:03:56 +00:00
|
|
|
* greatly re-simplify the vnode_pager.
|
1994-05-25 09:21:21 +00:00
|
|
|
*/
|
|
|
|
|
2003-06-11 23:50:51 +00:00
|
|
|
#include <sys/cdefs.h>
|
|
|
|
__FBSDID("$FreeBSD$");
|
|
|
|
|
2015-03-21 17:56:55 +00:00
|
|
|
#include "opt_vm.h"
|
|
|
|
|
1994-05-24 10:09:53 +00:00
|
|
|
#include <sys/param.h>
|
Allocate pager bufs from UMA instead of 80-ish mutex protected linked list.
o In vm_pager_bufferinit() create pbuf_zone and start accounting on how many
pbufs are we going to have set.
In various subsystems that are going to utilize pbufs create private zones
via call to pbuf_zsecond_create(). The latter calls uma_zsecond_create(),
and sets a limit on created zone. After startup preallocate pbufs according
to requirements of all pbuf zones.
Subsystems that used to have a private limit with old allocator now have
private pbuf zones: md(4), fusefs, NFS client, smbfs, VFS cluster, FFS,
swap, vnode pager.
The following subsystems use shared pbuf zone: cam(4), nvme(4), physio(9),
aio(4). They should have their private limits, but changing that is out of
scope of this commit.
o Fetch tunable value of kern.nswbuf from init_param2() and while here move
NSWBUF_MIN to opt_param.h and eliminate opt_swap.h, that was holding only
this option.
Default values aren't touched by this commit, but they probably should be
reviewed wrt to modern hardware.
This change removes a tight bottleneck from sendfile(2) operation, that
uses pbufs in vnode pager. Other pagers also would benefit from faster
allocation.
Together with: gallatin
Tested by: pho
2019-01-15 01:02:16 +00:00
|
|
|
#include <sys/kernel.h>
|
1994-05-24 10:09:53 +00:00
|
|
|
#include <sys/systm.h>
|
2018-03-29 02:54:50 +00:00
|
|
|
#include <sys/sysctl.h>
|
1994-05-24 10:09:53 +00:00
|
|
|
#include <sys/proc.h>
|
|
|
|
#include <sys/vnode.h>
|
|
|
|
#include <sys/mount.h>
|
2000-05-05 09:59:14 +00:00
|
|
|
#include <sys/bio.h>
|
NOTE: libkvm, w, ps, 'top', and any other utility which depends on struct
proc or any VM system structure will have to be rebuilt!!!
Much needed overhaul of the VM system. Included in this first round of
changes:
1) Improved pager interfaces: init, alloc, dealloc, getpages, putpages,
haspage, and sync operations are supported. The haspage interface now
provides information about clusterability. All pager routines now take
struct vm_object's instead of "pagers".
2) Improved data structures. In the previous paradigm, there is constant
confusion caused by pagers being both a data structure ("allocate a
pager") and a collection of routines. The idea of a pager structure has
escentially been eliminated. Objects now have types, and this type is
used to index the appropriate pager. In most cases, items in the pager
structure were duplicated in the object data structure and thus were
unnecessary. In the few cases that remained, a un_pager structure union
was created in the object to contain these items.
3) Because of the cleanup of #1 & #2, a lot of unnecessary layering can now
be removed. For instance, vm_object_enter(), vm_object_lookup(),
vm_object_remove(), and the associated object hash list were some of the
things that were removed.
4) simple_lock's removed. Discussion with several people reveals that the
SMP locking primitives used in the VM system aren't likely the mechanism
that we'll be adopting. Even if it were, the locking that was in the code
was very inadequate and would have to be mostly re-done anyway. The
locking in a uni-processor kernel was a no-op but went a long way toward
making the code difficult to read and debug.
5) Places that attempted to kludge-up the fact that we don't have kernel
thread support have been fixed to reflect the reality that we are really
dealing with processes, not threads. The VM system didn't have complete
thread support, so the comments and mis-named routines were just wrong.
We now use tsleep and wakeup directly in the lock routines, for instance.
6) Where appropriate, the pagers have been improved, especially in the
pager_alloc routines. Most of the pager_allocs have been rewritten and
are now faster and easier to maintain.
7) The pagedaemon pageout clustering algorithm has been rewritten and
now tries harder to output an even number of pages before and after
the requested page. This is sort of the reverse of the ideal pagein
algorithm and should provide better overall performance.
8) Unnecessary (incorrect) casts to caddr_t in calls to tsleep & wakeup
have been removed. Some other unnecessary casts have also been removed.
9) Some almost useless debugging code removed.
10) Terminology of shadow objects vs. backing objects straightened out.
The fact that the vm_object data structure escentially had this
backwards really confused things. The use of "shadow" and "backing
object" throughout the code is now internally consistent and correct
in the Mach terminology.
11) Several minor bug fixes, including one in the vm daemon that caused
0 RSS objects to not get purged as intended.
12) A "default pager" has now been created which cleans up the transition
of objects to the "swap" type. The previous checks throughout the code
for swp->pg_data != NULL were really ugly. This change also provides
the rudiments for future backing of "anonymous" memory by something
other than the swap pager (via the vnode pager, for example), and it
allows the decision about which of these pagers to use to be made
dynamically (although will need some additional decision code to do
this, of course).
13) (dyson) MAP_COPY has been deprecated and the corresponding "copy
object" code has been removed. MAP_COPY was undocumented and non-
standard. It was furthermore broken in several ways which caused its
behavior to degrade to MAP_PRIVATE. Binaries that use MAP_COPY will
continue to work correctly, but via the slightly different semantics
of MAP_PRIVATE.
14) (dyson) Sharing maps have been removed. It's marginal usefulness in a
threads design can be worked around in other ways. Both #12 and #13
were done to simplify the code and improve readability and maintain-
ability. (As were most all of these changes)
TODO:
1) Rewrite most of the vnode pager to use VOP_GETPAGES/PUTPAGES. Doing
this will reduce the vnode pager to a mere fraction of its current size.
2) Rewrite vm_fault and the swap/vnode pagers to use the clustering
information provided by the new haspage pager interface. This will
substantially reduce the overhead by eliminating a large number of
VOP_BMAP() calls. The VOP_BMAP() filesystem interface should be
improved to provide both a "behind" and "ahead" indication of
contiguousness.
3) Implement the extended features of pager_haspage in swap_pager_haspage().
It currently just says 0 pages ahead/behind.
4) Re-implement the swap device (swstrategy) in a more elegant way, perhaps
via a much more general mechanism that could also be used for disk
striping of regular filesystems.
5) Do something to improve the architecture of vm_object_collapse(). The
fact that it makes calls into the swap pager and knows too much about
how the swap pager operates really bothers me. It also doesn't allow
for collapsing of non-swap pager objects ("unnamed" objects backed by
other pagers).
1995-07-13 08:48:48 +00:00
|
|
|
#include <sys/buf.h>
|
1995-12-07 12:48:31 +00:00
|
|
|
#include <sys/vmmeter.h>
|
2019-05-21 20:38:48 +00:00
|
|
|
#include <sys/ktr.h>
|
2005-01-24 21:21:59 +00:00
|
|
|
#include <sys/limits.h>
|
1999-09-17 05:17:59 +00:00
|
|
|
#include <sys/conf.h>
|
2013-03-09 02:32:23 +00:00
|
|
|
#include <sys/rwlock.h>
|
2004-04-06 07:12:32 +00:00
|
|
|
#include <sys/sf_buf.h>
|
2018-03-29 02:54:50 +00:00
|
|
|
#include <sys/domainset.h>
|
1994-05-24 10:09:53 +00:00
|
|
|
|
2005-08-08 22:44:10 +00:00
|
|
|
#include <machine/atomic.h>
|
|
|
|
|
1994-05-24 10:09:53 +00:00
|
|
|
#include <vm/vm.h>
|
2012-08-05 14:11:42 +00:00
|
|
|
#include <vm/vm_param.h>
|
1995-12-07 12:48:31 +00:00
|
|
|
#include <vm/vm_object.h>
|
1994-05-24 10:09:53 +00:00
|
|
|
#include <vm/vm_page.h>
|
NOTE: libkvm, w, ps, 'top', and any other utility which depends on struct
proc or any VM system structure will have to be rebuilt!!!
Much needed overhaul of the VM system. Included in this first round of
changes:
1) Improved pager interfaces: init, alloc, dealloc, getpages, putpages,
haspage, and sync operations are supported. The haspage interface now
provides information about clusterability. All pager routines now take
struct vm_object's instead of "pagers".
2) Improved data structures. In the previous paradigm, there is constant
confusion caused by pagers being both a data structure ("allocate a
pager") and a collection of routines. The idea of a pager structure has
escentially been eliminated. Objects now have types, and this type is
used to index the appropriate pager. In most cases, items in the pager
structure were duplicated in the object data structure and thus were
unnecessary. In the few cases that remained, a un_pager structure union
was created in the object to contain these items.
3) Because of the cleanup of #1 & #2, a lot of unnecessary layering can now
be removed. For instance, vm_object_enter(), vm_object_lookup(),
vm_object_remove(), and the associated object hash list were some of the
things that were removed.
4) simple_lock's removed. Discussion with several people reveals that the
SMP locking primitives used in the VM system aren't likely the mechanism
that we'll be adopting. Even if it were, the locking that was in the code
was very inadequate and would have to be mostly re-done anyway. The
locking in a uni-processor kernel was a no-op but went a long way toward
making the code difficult to read and debug.
5) Places that attempted to kludge-up the fact that we don't have kernel
thread support have been fixed to reflect the reality that we are really
dealing with processes, not threads. The VM system didn't have complete
thread support, so the comments and mis-named routines were just wrong.
We now use tsleep and wakeup directly in the lock routines, for instance.
6) Where appropriate, the pagers have been improved, especially in the
pager_alloc routines. Most of the pager_allocs have been rewritten and
are now faster and easier to maintain.
7) The pagedaemon pageout clustering algorithm has been rewritten and
now tries harder to output an even number of pages before and after
the requested page. This is sort of the reverse of the ideal pagein
algorithm and should provide better overall performance.
8) Unnecessary (incorrect) casts to caddr_t in calls to tsleep & wakeup
have been removed. Some other unnecessary casts have also been removed.
9) Some almost useless debugging code removed.
10) Terminology of shadow objects vs. backing objects straightened out.
The fact that the vm_object data structure escentially had this
backwards really confused things. The use of "shadow" and "backing
object" throughout the code is now internally consistent and correct
in the Mach terminology.
11) Several minor bug fixes, including one in the vm daemon that caused
0 RSS objects to not get purged as intended.
12) A "default pager" has now been created which cleans up the transition
of objects to the "swap" type. The previous checks throughout the code
for swp->pg_data != NULL were really ugly. This change also provides
the rudiments for future backing of "anonymous" memory by something
other than the swap pager (via the vnode pager, for example), and it
allows the decision about which of these pagers to use to be made
dynamically (although will need some additional decision code to do
this, of course).
13) (dyson) MAP_COPY has been deprecated and the corresponding "copy
object" code has been removed. MAP_COPY was undocumented and non-
standard. It was furthermore broken in several ways which caused its
behavior to degrade to MAP_PRIVATE. Binaries that use MAP_COPY will
continue to work correctly, but via the slightly different semantics
of MAP_PRIVATE.
14) (dyson) Sharing maps have been removed. It's marginal usefulness in a
threads design can be worked around in other ways. Both #12 and #13
were done to simplify the code and improve readability and maintain-
ability. (As were most all of these changes)
TODO:
1) Rewrite most of the vnode pager to use VOP_GETPAGES/PUTPAGES. Doing
this will reduce the vnode pager to a mere fraction of its current size.
2) Rewrite vm_fault and the swap/vnode pagers to use the clustering
information provided by the new haspage pager interface. This will
substantially reduce the overhead by eliminating a large number of
VOP_BMAP() calls. The VOP_BMAP() filesystem interface should be
improved to provide both a "behind" and "ahead" indication of
contiguousness.
3) Implement the extended features of pager_haspage in swap_pager_haspage().
It currently just says 0 pages ahead/behind.
4) Re-implement the swap device (swstrategy) in a more elegant way, perhaps
via a much more general mechanism that could also be used for disk
striping of regular filesystems.
5) Do something to improve the architecture of vm_object_collapse(). The
fact that it makes calls into the swap pager and knows too much about
how the swap pager operates really bothers me. It also doesn't allow
for collapsing of non-swap pager objects ("unnamed" objects backed by
other pagers).
1995-07-13 08:48:48 +00:00
|
|
|
#include <vm/vm_pager.h>
|
1997-12-19 09:03:37 +00:00
|
|
|
#include <vm/vm_map.h>
|
1994-05-24 10:09:53 +00:00
|
|
|
#include <vm/vnode_pager.h>
|
1995-12-07 12:48:31 +00:00
|
|
|
#include <vm/vm_extern.h>
|
Allocate pager bufs from UMA instead of 80-ish mutex protected linked list.
o In vm_pager_bufferinit() create pbuf_zone and start accounting on how many
pbufs are we going to have set.
In various subsystems that are going to utilize pbufs create private zones
via call to pbuf_zsecond_create(). The latter calls uma_zsecond_create(),
and sets a limit on created zone. After startup preallocate pbufs according
to requirements of all pbuf zones.
Subsystems that used to have a private limit with old allocator now have
private pbuf zones: md(4), fusefs, NFS client, smbfs, VFS cluster, FFS,
swap, vnode pager.
The following subsystems use shared pbuf zone: cam(4), nvme(4), physio(9),
aio(4). They should have their private limits, but changing that is out of
scope of this commit.
o Fetch tunable value of kern.nswbuf from init_param2() and while here move
NSWBUF_MIN to opt_param.h and eliminate opt_swap.h, that was holding only
this option.
Default values aren't touched by this commit, but they probably should be
reviewed wrt to modern hardware.
This change removes a tight bottleneck from sendfile(2) operation, that
uses pbufs in vnode pager. Other pagers also would benefit from faster
allocation.
Together with: gallatin
Tested by: pho
2019-01-15 01:02:16 +00:00
|
|
|
#include <vm/uma.h>
|
1994-05-24 10:09:53 +00:00
|
|
|
|
2006-10-14 22:09:03 +00:00
|
|
|
static int vnode_pager_addr(struct vnode *vp, vm_ooffset_t address,
|
|
|
|
daddr_t *rtaddress, int *run);
|
2002-03-19 22:20:14 +00:00
|
|
|
static int vnode_pager_input_smlfs(vm_object_t object, vm_page_t m);
|
|
|
|
static int vnode_pager_input_old(vm_object_t object, vm_page_t m);
|
|
|
|
static void vnode_pager_dealloc(vm_object_t);
|
A change to KPI of vm_pager_get_pages() and underlying VOP_GETPAGES().
o With new KPI consumers can request contiguous ranges of pages, and
unlike before, all pages will be kept busied on return, like it was
done before with the 'reqpage' only. Now the reqpage goes away. With
new interface it is easier to implement code protected from race
conditions.
Such arrayed requests for now should be preceeded by a call to
vm_pager_haspage() to make sure that request is possible. This
could be improved later, making vm_pager_haspage() obsolete.
Strenghtening the promises on the business of the array of pages
allows us to remove such hacks as swp_pager_free_nrpage() and
vm_pager_free_nonreq().
o New KPI accepts two integer pointers that may optionally point at
values for read ahead and read behind, that a pager may do, if it
can. These pages are completely owned by pager, and not controlled
by the caller.
This shifts the UFS-specific readahead logic from vm_fault.c, which
should be file system agnostic, into vnode_pager.c. It also removes
one VOP_BMAP() request per hard fault.
Discussed with: kib, alc, jeff, scottl
Sponsored by: Nginx, Inc.
Sponsored by: Netflix
2015-12-16 21:30:45 +00:00
|
|
|
static int vnode_pager_getpages(vm_object_t, vm_page_t *, int, int *, int *);
|
|
|
|
static int vnode_pager_getpages_async(vm_object_t, vm_page_t *, int, int *,
|
|
|
|
int *, vop_getpages_iodone_t, void *);
|
2014-09-14 10:27:36 +00:00
|
|
|
static void vnode_pager_putpages(vm_object_t, vm_page_t *, int, int, int *);
|
2002-03-19 22:20:14 +00:00
|
|
|
static boolean_t vnode_pager_haspage(vm_object_t, vm_pindex_t, int *, int *);
|
Implement global and per-uid accounting of the anonymous memory. Add
rlimit RLIMIT_SWAP that limits the amount of swap that may be reserved
for the uid.
The accounting information (charge) is associated with either map entry,
or vm object backing the entry, assuming the object is the first one
in the shadow chain and entry does not require COW. Charge is moved
from entry to object on allocation of the object, e.g. during the mmap,
assuming the object is allocated, or on the first page fault on the
entry. It moves back to the entry on forks due to COW setup.
The per-entry granularity of accounting makes the charge process fair
for processes that change uid during lifetime, and decrements charge
for proper uid when region is unmapped.
The interface of vm_pager_allocate(9) is extended by adding struct ucred *,
that is used to charge appropriate uid when allocation if performed by
kernel, e.g. md(4).
Several syscalls, among them is fork(2), may now return ENOMEM when
global or per-uid limits are enforced.
In collaboration with: pho
Reviewed by: alc
Approved by: re (kensmith)
2009-06-23 20:45:22 +00:00
|
|
|
static vm_object_t vnode_pager_alloc(void *, vm_ooffset_t, vm_prot_t,
|
|
|
|
vm_ooffset_t, struct ucred *cred);
|
2014-11-23 12:01:52 +00:00
|
|
|
static int vnode_pager_generic_getpages_done(struct buf *);
|
|
|
|
static void vnode_pager_generic_getpages_done_async(struct buf *);
|
2019-09-03 20:31:48 +00:00
|
|
|
static void vnode_pager_update_writecount(vm_object_t, vm_offset_t,
|
|
|
|
vm_offset_t);
|
|
|
|
static void vnode_pager_release_writecount(vm_object_t, vm_offset_t,
|
|
|
|
vm_offset_t);
|
1995-10-30 17:56:30 +00:00
|
|
|
|
1994-05-24 10:09:53 +00:00
|
|
|
struct pagerops vnodepagerops = {
|
2003-08-05 06:51:26 +00:00
|
|
|
.pgo_alloc = vnode_pager_alloc,
|
|
|
|
.pgo_dealloc = vnode_pager_dealloc,
|
|
|
|
.pgo_getpages = vnode_pager_getpages,
|
2014-11-23 12:01:52 +00:00
|
|
|
.pgo_getpages_async = vnode_pager_getpages_async,
|
2003-08-05 06:51:26 +00:00
|
|
|
.pgo_putpages = vnode_pager_putpages,
|
|
|
|
.pgo_haspage = vnode_pager_haspage,
|
2019-09-03 20:31:48 +00:00
|
|
|
.pgo_update_writecount = vnode_pager_update_writecount,
|
|
|
|
.pgo_release_writecount = vnode_pager_release_writecount,
|
1994-05-24 10:09:53 +00:00
|
|
|
};
|
|
|
|
|
2018-03-29 02:54:50 +00:00
|
|
|
static struct domainset *vnode_domainset = NULL;
|
|
|
|
|
|
|
|
SYSCTL_PROC(_debug, OID_AUTO, vnode_domainset, CTLTYPE_STRING | CTLFLAG_RW,
|
|
|
|
&vnode_domainset, 0, sysctl_handle_domainset, "A",
|
|
|
|
"Default vnode NUMA policy");
|
|
|
|
|
2019-02-15 23:36:22 +00:00
|
|
|
static int nvnpbufs;
|
|
|
|
SYSCTL_INT(_vm, OID_AUTO, vnode_pbufs, CTLFLAG_RDTUN | CTLFLAG_NOFETCH,
|
|
|
|
&nvnpbufs, 0, "number of physical buffers allocated for vnode pager");
|
|
|
|
|
Allocate pager bufs from UMA instead of 80-ish mutex protected linked list.
o In vm_pager_bufferinit() create pbuf_zone and start accounting on how many
pbufs are we going to have set.
In various subsystems that are going to utilize pbufs create private zones
via call to pbuf_zsecond_create(). The latter calls uma_zsecond_create(),
and sets a limit on created zone. After startup preallocate pbufs according
to requirements of all pbuf zones.
Subsystems that used to have a private limit with old allocator now have
private pbuf zones: md(4), fusefs, NFS client, smbfs, VFS cluster, FFS,
swap, vnode pager.
The following subsystems use shared pbuf zone: cam(4), nvme(4), physio(9),
aio(4). They should have their private limits, but changing that is out of
scope of this commit.
o Fetch tunable value of kern.nswbuf from init_param2() and while here move
NSWBUF_MIN to opt_param.h and eliminate opt_swap.h, that was holding only
this option.
Default values aren't touched by this commit, but they probably should be
reviewed wrt to modern hardware.
This change removes a tight bottleneck from sendfile(2) operation, that
uses pbufs in vnode pager. Other pagers also would benefit from faster
allocation.
Together with: gallatin
Tested by: pho
2019-01-15 01:02:16 +00:00
|
|
|
static uma_zone_t vnode_pbuf_zone;
|
|
|
|
|
|
|
|
static void
|
|
|
|
vnode_pager_init(void *dummy)
|
|
|
|
{
|
|
|
|
|
2019-02-15 23:36:22 +00:00
|
|
|
#ifdef __LP64__
|
|
|
|
nvnpbufs = nswbuf * 2;
|
|
|
|
#else
|
|
|
|
nvnpbufs = nswbuf / 2;
|
|
|
|
#endif
|
|
|
|
TUNABLE_INT_FETCH("vm.vnode_pbufs", &nvnpbufs);
|
|
|
|
vnode_pbuf_zone = pbuf_zsecond_create("vnpbuf", nvnpbufs);
|
Allocate pager bufs from UMA instead of 80-ish mutex protected linked list.
o In vm_pager_bufferinit() create pbuf_zone and start accounting on how many
pbufs are we going to have set.
In various subsystems that are going to utilize pbufs create private zones
via call to pbuf_zsecond_create(). The latter calls uma_zsecond_create(),
and sets a limit on created zone. After startup preallocate pbufs according
to requirements of all pbuf zones.
Subsystems that used to have a private limit with old allocator now have
private pbuf zones: md(4), fusefs, NFS client, smbfs, VFS cluster, FFS,
swap, vnode pager.
The following subsystems use shared pbuf zone: cam(4), nvme(4), physio(9),
aio(4). They should have their private limits, but changing that is out of
scope of this commit.
o Fetch tunable value of kern.nswbuf from init_param2() and while here move
NSWBUF_MIN to opt_param.h and eliminate opt_swap.h, that was holding only
this option.
Default values aren't touched by this commit, but they probably should be
reviewed wrt to modern hardware.
This change removes a tight bottleneck from sendfile(2) operation, that
uses pbufs in vnode pager. Other pagers also would benefit from faster
allocation.
Together with: gallatin
Tested by: pho
2019-01-15 01:02:16 +00:00
|
|
|
}
|
|
|
|
SYSINIT(vnode_pager, SI_SUB_CPU, SI_ORDER_ANY, vnode_pager_init, NULL);
|
|
|
|
|
2005-01-24 21:21:59 +00:00
|
|
|
/* Create the VM system backing object for this vnode */
|
|
|
|
int
|
2006-02-01 12:43:13 +00:00
|
|
|
vnode_create_vobject(struct vnode *vp, off_t isize, struct thread *td)
|
2005-01-24 21:21:59 +00:00
|
|
|
{
|
|
|
|
vm_object_t object;
|
|
|
|
vm_ooffset_t size = isize;
|
|
|
|
struct vattr va;
|
|
|
|
|
|
|
|
if (!vn_isdisk(vp, NULL) && vn_canvmio(vp) == FALSE)
|
|
|
|
return (0);
|
|
|
|
|
2019-08-29 07:50:25 +00:00
|
|
|
object = vp->v_object;
|
|
|
|
if (object != NULL)
|
|
|
|
return (0);
|
2005-01-24 21:21:59 +00:00
|
|
|
|
|
|
|
if (size == 0) {
|
|
|
|
if (vn_isdisk(vp, NULL)) {
|
|
|
|
size = IDX_TO_OFF(INT_MAX);
|
|
|
|
} else {
|
2008-08-28 15:23:18 +00:00
|
|
|
if (VOP_GETATTR(vp, &va, td->td_ucred))
|
2005-01-24 21:21:59 +00:00
|
|
|
return (0);
|
|
|
|
size = va.va_size;
|
|
|
|
}
|
|
|
|
}
|
|
|
|
|
Implement global and per-uid accounting of the anonymous memory. Add
rlimit RLIMIT_SWAP that limits the amount of swap that may be reserved
for the uid.
The accounting information (charge) is associated with either map entry,
or vm object backing the entry, assuming the object is the first one
in the shadow chain and entry does not require COW. Charge is moved
from entry to object on allocation of the object, e.g. during the mmap,
assuming the object is allocated, or on the first page fault on the
entry. It moves back to the entry on forks due to COW setup.
The per-entry granularity of accounting makes the charge process fair
for processes that change uid during lifetime, and decrements charge
for proper uid when region is unmapped.
The interface of vm_pager_allocate(9) is extended by adding struct ucred *,
that is used to charge appropriate uid when allocation if performed by
kernel, e.g. md(4).
Several syscalls, among them is fork(2), may now return ENOMEM when
global or per-uid limits are enforced.
In collaboration with: pho
Reviewed by: alc
Approved by: re (kensmith)
2009-06-23 20:45:22 +00:00
|
|
|
object = vnode_pager_alloc(vp, size, 0, 0, td->td_ucred);
|
2005-01-24 21:21:59 +00:00
|
|
|
/*
|
|
|
|
* Dereference the reference we just created. This assumes
|
|
|
|
* that the object is associated with the vp.
|
|
|
|
*/
|
2013-03-09 02:32:23 +00:00
|
|
|
VM_OBJECT_WLOCK(object);
|
2005-01-24 21:21:59 +00:00
|
|
|
object->ref_count--;
|
2013-03-09 02:32:23 +00:00
|
|
|
VM_OBJECT_WUNLOCK(object);
|
2005-01-24 21:21:59 +00:00
|
|
|
vrele(vp);
|
|
|
|
|
|
|
|
KASSERT(vp->v_object != NULL, ("vnode_create_vobject: NULL object"));
|
|
|
|
|
|
|
|
return (0);
|
|
|
|
}
|
|
|
|
|
2005-01-28 08:56:48 +00:00
|
|
|
void
|
|
|
|
vnode_destroy_vobject(struct vnode *vp)
|
|
|
|
{
|
|
|
|
struct vm_object *obj;
|
|
|
|
|
|
|
|
obj = vp->v_object;
|
2019-08-29 07:50:25 +00:00
|
|
|
if (obj == NULL || obj->handle != vp)
|
2005-01-28 08:56:48 +00:00
|
|
|
return;
|
2007-07-26 16:58:09 +00:00
|
|
|
ASSERT_VOP_ELOCKED(vp, "vnode_destroy_vobject");
|
2013-03-09 02:32:23 +00:00
|
|
|
VM_OBJECT_WLOCK(obj);
|
2019-08-29 07:50:25 +00:00
|
|
|
MPASS(obj->type == OBJT_VNODE);
|
Add implementation of robust mutexes, hopefully close enough to the
intention of the POSIX IEEE Std 1003.1TM-2008/Cor 1-2013.
A robust mutex is guaranteed to be cleared by the system upon either
thread or process owner termination while the mutex is held. The next
mutex locker is then notified about inconsistent mutex state and can
execute (or abandon) corrective actions.
The patch mostly consists of small changes here and there, adding
neccessary checks for the inconsistent and abandoned conditions into
existing paths. Additionally, the thread exit handler was extended to
iterate over the userspace-maintained list of owned robust mutexes,
unlocking and marking as terminated each of them.
The list of owned robust mutexes cannot be maintained atomically
synchronous with the mutex lock state (it is possible in kernel, but
is too expensive). Instead, for the duration of lock or unlock
operation, the current mutex is remembered in a special slot that is
also checked by the kernel at thread termination.
Kernel must be aware about the per-thread location of the heads of
robust mutex lists and the current active mutex slot. When a thread
touches a robust mutex for the first time, a new umtx op syscall is
issued which informs about location of lists heads.
The umtx sleep queues for PP and PI mutexes are split between
non-robust and robust.
Somewhat unrelated changes in the patch:
1. Style.
2. The fix for proper tdfind() call use in umtxq_sleep_pi() for shared
pi mutexes.
3. Removal of the userspace struct pthread_mutex m_owner field.
4. The sysctl kern.ipc.umtx_vnode_persistent is added, which controls
the lifetime of the shared mutex associated with a vnode' page.
Reviewed by: jilles (previous version, supposedly the objection was fixed)
Discussed with: brooks, Martin Simmons <martin@lispworks.com> (some aspects)
Tested by: pho
Sponsored by: The FreeBSD Foundation
2016-05-17 09:56:22 +00:00
|
|
|
umtx_shm_object_terminated(obj);
|
2005-01-28 08:56:48 +00:00
|
|
|
if (obj->ref_count == 0) {
|
|
|
|
/*
|
|
|
|
* don't double-terminate the object
|
|
|
|
*/
|
2016-07-05 11:21:02 +00:00
|
|
|
if ((obj->flags & OBJ_DEAD) == 0) {
|
2019-08-25 13:26:06 +00:00
|
|
|
vm_object_set_flag(obj, OBJ_DEAD);
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Clean pages and flush buffers.
|
|
|
|
*/
|
|
|
|
vm_object_page_clean(obj, 0, 0, OBJPC_SYNC);
|
|
|
|
VM_OBJECT_WUNLOCK(obj);
|
|
|
|
|
|
|
|
vinvalbuf(vp, V_SAVE, 0, 0);
|
|
|
|
|
|
|
|
BO_LOCK(&vp->v_bufobj);
|
|
|
|
vp->v_bufobj.bo_flag |= BO_DEAD;
|
|
|
|
BO_UNLOCK(&vp->v_bufobj);
|
|
|
|
|
|
|
|
VM_OBJECT_WLOCK(obj);
|
2005-01-28 08:56:48 +00:00
|
|
|
vm_object_terminate(obj);
|
2016-07-05 11:21:02 +00:00
|
|
|
} else {
|
|
|
|
/*
|
|
|
|
* Waiters were already handled during object
|
|
|
|
* termination. The exclusive vnode lock hopefully
|
|
|
|
* prevented new waiters from referencing the dying
|
|
|
|
* object.
|
|
|
|
*/
|
|
|
|
vp->v_object = NULL;
|
2013-03-09 02:32:23 +00:00
|
|
|
VM_OBJECT_WUNLOCK(obj);
|
2016-07-05 11:21:02 +00:00
|
|
|
}
|
2005-01-28 08:56:48 +00:00
|
|
|
} else {
|
|
|
|
/*
|
|
|
|
* Woe to the process that tries to page now :-).
|
|
|
|
*/
|
|
|
|
vm_pager_deallocate(obj);
|
2013-03-09 02:32:23 +00:00
|
|
|
VM_OBJECT_WUNLOCK(obj);
|
2005-01-28 08:56:48 +00:00
|
|
|
}
|
2016-07-05 11:21:02 +00:00
|
|
|
KASSERT(vp->v_object == NULL, ("vp %p obj %p", vp, vp->v_object));
|
2005-01-28 08:56:48 +00:00
|
|
|
}
|
|
|
|
|
|
|
|
|
1994-05-24 10:09:53 +00:00
|
|
|
/*
|
|
|
|
* Allocate (or lookup) pager for a vnode.
|
|
|
|
* Handle is a vnode pointer.
|
|
|
|
*/
|
NOTE: libkvm, w, ps, 'top', and any other utility which depends on struct
proc or any VM system structure will have to be rebuilt!!!
Much needed overhaul of the VM system. Included in this first round of
changes:
1) Improved pager interfaces: init, alloc, dealloc, getpages, putpages,
haspage, and sync operations are supported. The haspage interface now
provides information about clusterability. All pager routines now take
struct vm_object's instead of "pagers".
2) Improved data structures. In the previous paradigm, there is constant
confusion caused by pagers being both a data structure ("allocate a
pager") and a collection of routines. The idea of a pager structure has
escentially been eliminated. Objects now have types, and this type is
used to index the appropriate pager. In most cases, items in the pager
structure were duplicated in the object data structure and thus were
unnecessary. In the few cases that remained, a un_pager structure union
was created in the object to contain these items.
3) Because of the cleanup of #1 & #2, a lot of unnecessary layering can now
be removed. For instance, vm_object_enter(), vm_object_lookup(),
vm_object_remove(), and the associated object hash list were some of the
things that were removed.
4) simple_lock's removed. Discussion with several people reveals that the
SMP locking primitives used in the VM system aren't likely the mechanism
that we'll be adopting. Even if it were, the locking that was in the code
was very inadequate and would have to be mostly re-done anyway. The
locking in a uni-processor kernel was a no-op but went a long way toward
making the code difficult to read and debug.
5) Places that attempted to kludge-up the fact that we don't have kernel
thread support have been fixed to reflect the reality that we are really
dealing with processes, not threads. The VM system didn't have complete
thread support, so the comments and mis-named routines were just wrong.
We now use tsleep and wakeup directly in the lock routines, for instance.
6) Where appropriate, the pagers have been improved, especially in the
pager_alloc routines. Most of the pager_allocs have been rewritten and
are now faster and easier to maintain.
7) The pagedaemon pageout clustering algorithm has been rewritten and
now tries harder to output an even number of pages before and after
the requested page. This is sort of the reverse of the ideal pagein
algorithm and should provide better overall performance.
8) Unnecessary (incorrect) casts to caddr_t in calls to tsleep & wakeup
have been removed. Some other unnecessary casts have also been removed.
9) Some almost useless debugging code removed.
10) Terminology of shadow objects vs. backing objects straightened out.
The fact that the vm_object data structure escentially had this
backwards really confused things. The use of "shadow" and "backing
object" throughout the code is now internally consistent and correct
in the Mach terminology.
11) Several minor bug fixes, including one in the vm daemon that caused
0 RSS objects to not get purged as intended.
12) A "default pager" has now been created which cleans up the transition
of objects to the "swap" type. The previous checks throughout the code
for swp->pg_data != NULL were really ugly. This change also provides
the rudiments for future backing of "anonymous" memory by something
other than the swap pager (via the vnode pager, for example), and it
allows the decision about which of these pagers to use to be made
dynamically (although will need some additional decision code to do
this, of course).
13) (dyson) MAP_COPY has been deprecated and the corresponding "copy
object" code has been removed. MAP_COPY was undocumented and non-
standard. It was furthermore broken in several ways which caused its
behavior to degrade to MAP_PRIVATE. Binaries that use MAP_COPY will
continue to work correctly, but via the slightly different semantics
of MAP_PRIVATE.
14) (dyson) Sharing maps have been removed. It's marginal usefulness in a
threads design can be worked around in other ways. Both #12 and #13
were done to simplify the code and improve readability and maintain-
ability. (As were most all of these changes)
TODO:
1) Rewrite most of the vnode pager to use VOP_GETPAGES/PUTPAGES. Doing
this will reduce the vnode pager to a mere fraction of its current size.
2) Rewrite vm_fault and the swap/vnode pagers to use the clustering
information provided by the new haspage pager interface. This will
substantially reduce the overhead by eliminating a large number of
VOP_BMAP() calls. The VOP_BMAP() filesystem interface should be
improved to provide both a "behind" and "ahead" indication of
contiguousness.
3) Implement the extended features of pager_haspage in swap_pager_haspage().
It currently just says 0 pages ahead/behind.
4) Re-implement the swap device (swstrategy) in a more elegant way, perhaps
via a much more general mechanism that could also be used for disk
striping of regular filesystems.
5) Do something to improve the architecture of vm_object_collapse(). The
fact that it makes calls into the swap pager and knows too much about
how the swap pager operates really bothers me. It also doesn't allow
for collapsing of non-swap pager objects ("unnamed" objects backed by
other pagers).
1995-07-13 08:48:48 +00:00
|
|
|
vm_object_t
|
1998-10-13 08:24:45 +00:00
|
|
|
vnode_pager_alloc(void *handle, vm_ooffset_t size, vm_prot_t prot,
|
Implement global and per-uid accounting of the anonymous memory. Add
rlimit RLIMIT_SWAP that limits the amount of swap that may be reserved
for the uid.
The accounting information (charge) is associated with either map entry,
or vm object backing the entry, assuming the object is the first one
in the shadow chain and entry does not require COW. Charge is moved
from entry to object on allocation of the object, e.g. during the mmap,
assuming the object is allocated, or on the first page fault on the
entry. It moves back to the entry on forks due to COW setup.
The per-entry granularity of accounting makes the charge process fair
for processes that change uid during lifetime, and decrements charge
for proper uid when region is unmapped.
The interface of vm_pager_allocate(9) is extended by adding struct ucred *,
that is used to charge appropriate uid when allocation if performed by
kernel, e.g. md(4).
Several syscalls, among them is fork(2), may now return ENOMEM when
global or per-uid limits are enforced.
In collaboration with: pho
Reviewed by: alc
Approved by: re (kensmith)
2009-06-23 20:45:22 +00:00
|
|
|
vm_ooffset_t offset, struct ucred *cred)
|
1994-05-24 10:09:53 +00:00
|
|
|
{
|
1995-07-09 06:58:03 +00:00
|
|
|
vm_object_t object;
|
1994-05-24 10:09:53 +00:00
|
|
|
struct vnode *vp;
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Pageout to vnode, no can do yet.
|
|
|
|
*/
|
|
|
|
if (handle == NULL)
|
1994-08-04 03:06:48 +00:00
|
|
|
return (NULL);
|
1994-05-24 10:09:53 +00:00
|
|
|
|
2019-08-29 07:50:25 +00:00
|
|
|
vp = (struct vnode *)handle;
|
|
|
|
ASSERT_VOP_LOCKED(vp, "vnode_pager_alloc");
|
2013-04-28 18:40:55 +00:00
|
|
|
KASSERT(vp->v_usecount != 0, ("vnode_pager_alloc: no vnode reference"));
|
2019-08-29 07:50:25 +00:00
|
|
|
retry:
|
|
|
|
object = vp->v_object;
|
1997-12-29 00:25:11 +00:00
|
|
|
|
NOTE: libkvm, w, ps, 'top', and any other utility which depends on struct
proc or any VM system structure will have to be rebuilt!!!
Much needed overhaul of the VM system. Included in this first round of
changes:
1) Improved pager interfaces: init, alloc, dealloc, getpages, putpages,
haspage, and sync operations are supported. The haspage interface now
provides information about clusterability. All pager routines now take
struct vm_object's instead of "pagers".
2) Improved data structures. In the previous paradigm, there is constant
confusion caused by pagers being both a data structure ("allocate a
pager") and a collection of routines. The idea of a pager structure has
escentially been eliminated. Objects now have types, and this type is
used to index the appropriate pager. In most cases, items in the pager
structure were duplicated in the object data structure and thus were
unnecessary. In the few cases that remained, a un_pager structure union
was created in the object to contain these items.
3) Because of the cleanup of #1 & #2, a lot of unnecessary layering can now
be removed. For instance, vm_object_enter(), vm_object_lookup(),
vm_object_remove(), and the associated object hash list were some of the
things that were removed.
4) simple_lock's removed. Discussion with several people reveals that the
SMP locking primitives used in the VM system aren't likely the mechanism
that we'll be adopting. Even if it were, the locking that was in the code
was very inadequate and would have to be mostly re-done anyway. The
locking in a uni-processor kernel was a no-op but went a long way toward
making the code difficult to read and debug.
5) Places that attempted to kludge-up the fact that we don't have kernel
thread support have been fixed to reflect the reality that we are really
dealing with processes, not threads. The VM system didn't have complete
thread support, so the comments and mis-named routines were just wrong.
We now use tsleep and wakeup directly in the lock routines, for instance.
6) Where appropriate, the pagers have been improved, especially in the
pager_alloc routines. Most of the pager_allocs have been rewritten and
are now faster and easier to maintain.
7) The pagedaemon pageout clustering algorithm has been rewritten and
now tries harder to output an even number of pages before and after
the requested page. This is sort of the reverse of the ideal pagein
algorithm and should provide better overall performance.
8) Unnecessary (incorrect) casts to caddr_t in calls to tsleep & wakeup
have been removed. Some other unnecessary casts have also been removed.
9) Some almost useless debugging code removed.
10) Terminology of shadow objects vs. backing objects straightened out.
The fact that the vm_object data structure escentially had this
backwards really confused things. The use of "shadow" and "backing
object" throughout the code is now internally consistent and correct
in the Mach terminology.
11) Several minor bug fixes, including one in the vm daemon that caused
0 RSS objects to not get purged as intended.
12) A "default pager" has now been created which cleans up the transition
of objects to the "swap" type. The previous checks throughout the code
for swp->pg_data != NULL were really ugly. This change also provides
the rudiments for future backing of "anonymous" memory by something
other than the swap pager (via the vnode pager, for example), and it
allows the decision about which of these pagers to use to be made
dynamically (although will need some additional decision code to do
this, of course).
13) (dyson) MAP_COPY has been deprecated and the corresponding "copy
object" code has been removed. MAP_COPY was undocumented and non-
standard. It was furthermore broken in several ways which caused its
behavior to degrade to MAP_PRIVATE. Binaries that use MAP_COPY will
continue to work correctly, but via the slightly different semantics
of MAP_PRIVATE.
14) (dyson) Sharing maps have been removed. It's marginal usefulness in a
threads design can be worked around in other ways. Both #12 and #13
were done to simplify the code and improve readability and maintain-
ability. (As were most all of these changes)
TODO:
1) Rewrite most of the vnode pager to use VOP_GETPAGES/PUTPAGES. Doing
this will reduce the vnode pager to a mere fraction of its current size.
2) Rewrite vm_fault and the swap/vnode pagers to use the clustering
information provided by the new haspage pager interface. This will
substantially reduce the overhead by eliminating a large number of
VOP_BMAP() calls. The VOP_BMAP() filesystem interface should be
improved to provide both a "behind" and "ahead" indication of
contiguousness.
3) Implement the extended features of pager_haspage in swap_pager_haspage().
It currently just says 0 pages ahead/behind.
4) Re-implement the swap device (swstrategy) in a more elegant way, perhaps
via a much more general mechanism that could also be used for disk
striping of regular filesystems.
5) Do something to improve the architecture of vm_object_collapse(). The
fact that it makes calls into the swap pager and knows too much about
how the swap pager operates really bothers me. It also doesn't allow
for collapsing of non-swap pager objects ("unnamed" objects backed by
other pagers).
1995-07-13 08:48:48 +00:00
|
|
|
if (object == NULL) {
|
1994-05-24 10:09:53 +00:00
|
|
|
/*
|
2008-05-20 19:05:43 +00:00
|
|
|
* Add an object of the appropriate size
|
1994-05-24 10:09:53 +00:00
|
|
|
*/
|
2019-08-29 07:50:25 +00:00
|
|
|
object = vm_object_allocate(OBJT_VNODE,
|
|
|
|
OFF_TO_IDX(round_page(size)));
|
1994-08-04 03:06:48 +00:00
|
|
|
|
1998-10-13 08:24:45 +00:00
|
|
|
object->un_pager.vnp.vnp_size = size;
|
2012-02-23 21:07:16 +00:00
|
|
|
object->un_pager.vnp.writemappings = 0;
|
2018-03-29 02:54:50 +00:00
|
|
|
object->domain.dr_policy = vnode_domainset;
|
1994-08-04 03:06:48 +00:00
|
|
|
|
NOTE: libkvm, w, ps, 'top', and any other utility which depends on struct
proc or any VM system structure will have to be rebuilt!!!
Much needed overhaul of the VM system. Included in this first round of
changes:
1) Improved pager interfaces: init, alloc, dealloc, getpages, putpages,
haspage, and sync operations are supported. The haspage interface now
provides information about clusterability. All pager routines now take
struct vm_object's instead of "pagers".
2) Improved data structures. In the previous paradigm, there is constant
confusion caused by pagers being both a data structure ("allocate a
pager") and a collection of routines. The idea of a pager structure has
escentially been eliminated. Objects now have types, and this type is
used to index the appropriate pager. In most cases, items in the pager
structure were duplicated in the object data structure and thus were
unnecessary. In the few cases that remained, a un_pager structure union
was created in the object to contain these items.
3) Because of the cleanup of #1 & #2, a lot of unnecessary layering can now
be removed. For instance, vm_object_enter(), vm_object_lookup(),
vm_object_remove(), and the associated object hash list were some of the
things that were removed.
4) simple_lock's removed. Discussion with several people reveals that the
SMP locking primitives used in the VM system aren't likely the mechanism
that we'll be adopting. Even if it were, the locking that was in the code
was very inadequate and would have to be mostly re-done anyway. The
locking in a uni-processor kernel was a no-op but went a long way toward
making the code difficult to read and debug.
5) Places that attempted to kludge-up the fact that we don't have kernel
thread support have been fixed to reflect the reality that we are really
dealing with processes, not threads. The VM system didn't have complete
thread support, so the comments and mis-named routines were just wrong.
We now use tsleep and wakeup directly in the lock routines, for instance.
6) Where appropriate, the pagers have been improved, especially in the
pager_alloc routines. Most of the pager_allocs have been rewritten and
are now faster and easier to maintain.
7) The pagedaemon pageout clustering algorithm has been rewritten and
now tries harder to output an even number of pages before and after
the requested page. This is sort of the reverse of the ideal pagein
algorithm and should provide better overall performance.
8) Unnecessary (incorrect) casts to caddr_t in calls to tsleep & wakeup
have been removed. Some other unnecessary casts have also been removed.
9) Some almost useless debugging code removed.
10) Terminology of shadow objects vs. backing objects straightened out.
The fact that the vm_object data structure escentially had this
backwards really confused things. The use of "shadow" and "backing
object" throughout the code is now internally consistent and correct
in the Mach terminology.
11) Several minor bug fixes, including one in the vm daemon that caused
0 RSS objects to not get purged as intended.
12) A "default pager" has now been created which cleans up the transition
of objects to the "swap" type. The previous checks throughout the code
for swp->pg_data != NULL were really ugly. This change also provides
the rudiments for future backing of "anonymous" memory by something
other than the swap pager (via the vnode pager, for example), and it
allows the decision about which of these pagers to use to be made
dynamically (although will need some additional decision code to do
this, of course).
13) (dyson) MAP_COPY has been deprecated and the corresponding "copy
object" code has been removed. MAP_COPY was undocumented and non-
standard. It was furthermore broken in several ways which caused its
behavior to degrade to MAP_PRIVATE. Binaries that use MAP_COPY will
continue to work correctly, but via the slightly different semantics
of MAP_PRIVATE.
14) (dyson) Sharing maps have been removed. It's marginal usefulness in a
threads design can be worked around in other ways. Both #12 and #13
were done to simplify the code and improve readability and maintain-
ability. (As were most all of these changes)
TODO:
1) Rewrite most of the vnode pager to use VOP_GETPAGES/PUTPAGES. Doing
this will reduce the vnode pager to a mere fraction of its current size.
2) Rewrite vm_fault and the swap/vnode pagers to use the clustering
information provided by the new haspage pager interface. This will
substantially reduce the overhead by eliminating a large number of
VOP_BMAP() calls. The VOP_BMAP() filesystem interface should be
improved to provide both a "behind" and "ahead" indication of
contiguousness.
3) Implement the extended features of pager_haspage in swap_pager_haspage().
It currently just says 0 pages ahead/behind.
4) Re-implement the swap device (swstrategy) in a more elegant way, perhaps
via a much more general mechanism that could also be used for disk
striping of regular filesystems.
5) Do something to improve the architecture of vm_object_collapse(). The
fact that it makes calls into the swap pager and knows too much about
how the swap pager operates really bothers me. It also doesn't allow
for collapsing of non-swap pager objects ("unnamed" objects backed by
other pagers).
1995-07-13 08:48:48 +00:00
|
|
|
object->handle = handle;
|
2008-06-12 20:46:47 +00:00
|
|
|
VI_LOCK(vp);
|
|
|
|
if (vp->v_object != NULL) {
|
2008-05-20 19:05:43 +00:00
|
|
|
/*
|
2019-08-29 07:50:25 +00:00
|
|
|
* Object has been created while we were allocating.
|
2008-05-20 19:05:43 +00:00
|
|
|
*/
|
2008-06-12 20:46:47 +00:00
|
|
|
VI_UNLOCK(vp);
|
2015-05-10 08:21:03 +00:00
|
|
|
VM_OBJECT_WLOCK(object);
|
|
|
|
KASSERT(object->ref_count == 1,
|
|
|
|
("leaked ref %p %d", object, object->ref_count));
|
|
|
|
object->type = OBJT_DEAD;
|
|
|
|
object->ref_count = 0;
|
|
|
|
VM_OBJECT_WUNLOCK(object);
|
2008-05-20 19:05:43 +00:00
|
|
|
vm_object_destroy(object);
|
|
|
|
goto retry;
|
|
|
|
}
|
NOTE: libkvm, w, ps, 'top', and any other utility which depends on struct
proc or any VM system structure will have to be rebuilt!!!
Much needed overhaul of the VM system. Included in this first round of
changes:
1) Improved pager interfaces: init, alloc, dealloc, getpages, putpages,
haspage, and sync operations are supported. The haspage interface now
provides information about clusterability. All pager routines now take
struct vm_object's instead of "pagers".
2) Improved data structures. In the previous paradigm, there is constant
confusion caused by pagers being both a data structure ("allocate a
pager") and a collection of routines. The idea of a pager structure has
escentially been eliminated. Objects now have types, and this type is
used to index the appropriate pager. In most cases, items in the pager
structure were duplicated in the object data structure and thus were
unnecessary. In the few cases that remained, a un_pager structure union
was created in the object to contain these items.
3) Because of the cleanup of #1 & #2, a lot of unnecessary layering can now
be removed. For instance, vm_object_enter(), vm_object_lookup(),
vm_object_remove(), and the associated object hash list were some of the
things that were removed.
4) simple_lock's removed. Discussion with several people reveals that the
SMP locking primitives used in the VM system aren't likely the mechanism
that we'll be adopting. Even if it were, the locking that was in the code
was very inadequate and would have to be mostly re-done anyway. The
locking in a uni-processor kernel was a no-op but went a long way toward
making the code difficult to read and debug.
5) Places that attempted to kludge-up the fact that we don't have kernel
thread support have been fixed to reflect the reality that we are really
dealing with processes, not threads. The VM system didn't have complete
thread support, so the comments and mis-named routines were just wrong.
We now use tsleep and wakeup directly in the lock routines, for instance.
6) Where appropriate, the pagers have been improved, especially in the
pager_alloc routines. Most of the pager_allocs have been rewritten and
are now faster and easier to maintain.
7) The pagedaemon pageout clustering algorithm has been rewritten and
now tries harder to output an even number of pages before and after
the requested page. This is sort of the reverse of the ideal pagein
algorithm and should provide better overall performance.
8) Unnecessary (incorrect) casts to caddr_t in calls to tsleep & wakeup
have been removed. Some other unnecessary casts have also been removed.
9) Some almost useless debugging code removed.
10) Terminology of shadow objects vs. backing objects straightened out.
The fact that the vm_object data structure escentially had this
backwards really confused things. The use of "shadow" and "backing
object" throughout the code is now internally consistent and correct
in the Mach terminology.
11) Several minor bug fixes, including one in the vm daemon that caused
0 RSS objects to not get purged as intended.
12) A "default pager" has now been created which cleans up the transition
of objects to the "swap" type. The previous checks throughout the code
for swp->pg_data != NULL were really ugly. This change also provides
the rudiments for future backing of "anonymous" memory by something
other than the swap pager (via the vnode pager, for example), and it
allows the decision about which of these pagers to use to be made
dynamically (although will need some additional decision code to do
this, of course).
13) (dyson) MAP_COPY has been deprecated and the corresponding "copy
object" code has been removed. MAP_COPY was undocumented and non-
standard. It was furthermore broken in several ways which caused its
behavior to degrade to MAP_PRIVATE. Binaries that use MAP_COPY will
continue to work correctly, but via the slightly different semantics
of MAP_PRIVATE.
14) (dyson) Sharing maps have been removed. It's marginal usefulness in a
threads design can be worked around in other ways. Both #12 and #13
were done to simplify the code and improve readability and maintain-
ability. (As were most all of these changes)
TODO:
1) Rewrite most of the vnode pager to use VOP_GETPAGES/PUTPAGES. Doing
this will reduce the vnode pager to a mere fraction of its current size.
2) Rewrite vm_fault and the swap/vnode pagers to use the clustering
information provided by the new haspage pager interface. This will
substantially reduce the overhead by eliminating a large number of
VOP_BMAP() calls. The VOP_BMAP() filesystem interface should be
improved to provide both a "behind" and "ahead" indication of
contiguousness.
3) Implement the extended features of pager_haspage in swap_pager_haspage().
It currently just says 0 pages ahead/behind.
4) Re-implement the swap device (swstrategy) in a more elegant way, perhaps
via a much more general mechanism that could also be used for disk
striping of regular filesystems.
5) Do something to improve the architecture of vm_object_collapse(). The
fact that it makes calls into the swap pager and knows too much about
how the swap pager operates really bothers me. It also doesn't allow
for collapsing of non-swap pager objects ("unnamed" objects backed by
other pagers).
1995-07-13 08:48:48 +00:00
|
|
|
vp->v_object = object;
|
2008-06-12 20:46:47 +00:00
|
|
|
VI_UNLOCK(vp);
|
|
|
|
} else {
|
2019-08-29 07:50:25 +00:00
|
|
|
VM_OBJECT_WLOCK(object);
|
Make our v_usecount vnode reference count work identically to the
original BSD code. The association between the vnode and the vm_object
no longer includes reference counts. The major difference is that
vm_object's are no longer freed gratuitiously from the vnode, and so
once an object is created for the vnode, it will last as long as the
vnode does.
When a vnode object reference count is incremented, then the underlying
vnode reference count is incremented also. The two "objects" are now
more intimately related, and so the interactions are now much less
complex.
When vnodes are now normally placed onto the free queue with an object still
attached. The rundown of the object happens at vnode rundown time, and
happens with exactly the same filesystem semantics of the original VFS
code. There is absolutely no need for vnode_pager_uncache and other
travesties like that anymore.
A side-effect of these changes is that SMP locking should be much simpler,
the I/O copyin/copyout optimizations work, NFS should be more ponderable,
and further work on layered filesystems should be less frustrating, because
of the totally coherent management of the vnode objects and vnodes.
Please be careful with your system while running this code, but I would
greatly appreciate feedback as soon a reasonably possible.
1998-01-06 05:26:17 +00:00
|
|
|
object->ref_count++;
|
2015-03-21 17:56:55 +00:00
|
|
|
#if VM_NRESERVLEVEL > 0
|
|
|
|
vm_object_color(object, 0);
|
|
|
|
#endif
|
2013-03-09 02:32:23 +00:00
|
|
|
VM_OBJECT_WUNLOCK(object);
|
2008-06-12 20:46:47 +00:00
|
|
|
}
|
2016-12-31 10:37:56 +00:00
|
|
|
vrefact(vp);
|
NOTE: libkvm, w, ps, 'top', and any other utility which depends on struct
proc or any VM system structure will have to be rebuilt!!!
Much needed overhaul of the VM system. Included in this first round of
changes:
1) Improved pager interfaces: init, alloc, dealloc, getpages, putpages,
haspage, and sync operations are supported. The haspage interface now
provides information about clusterability. All pager routines now take
struct vm_object's instead of "pagers".
2) Improved data structures. In the previous paradigm, there is constant
confusion caused by pagers being both a data structure ("allocate a
pager") and a collection of routines. The idea of a pager structure has
escentially been eliminated. Objects now have types, and this type is
used to index the appropriate pager. In most cases, items in the pager
structure were duplicated in the object data structure and thus were
unnecessary. In the few cases that remained, a un_pager structure union
was created in the object to contain these items.
3) Because of the cleanup of #1 & #2, a lot of unnecessary layering can now
be removed. For instance, vm_object_enter(), vm_object_lookup(),
vm_object_remove(), and the associated object hash list were some of the
things that were removed.
4) simple_lock's removed. Discussion with several people reveals that the
SMP locking primitives used in the VM system aren't likely the mechanism
that we'll be adopting. Even if it were, the locking that was in the code
was very inadequate and would have to be mostly re-done anyway. The
locking in a uni-processor kernel was a no-op but went a long way toward
making the code difficult to read and debug.
5) Places that attempted to kludge-up the fact that we don't have kernel
thread support have been fixed to reflect the reality that we are really
dealing with processes, not threads. The VM system didn't have complete
thread support, so the comments and mis-named routines were just wrong.
We now use tsleep and wakeup directly in the lock routines, for instance.
6) Where appropriate, the pagers have been improved, especially in the
pager_alloc routines. Most of the pager_allocs have been rewritten and
are now faster and easier to maintain.
7) The pagedaemon pageout clustering algorithm has been rewritten and
now tries harder to output an even number of pages before and after
the requested page. This is sort of the reverse of the ideal pagein
algorithm and should provide better overall performance.
8) Unnecessary (incorrect) casts to caddr_t in calls to tsleep & wakeup
have been removed. Some other unnecessary casts have also been removed.
9) Some almost useless debugging code removed.
10) Terminology of shadow objects vs. backing objects straightened out.
The fact that the vm_object data structure escentially had this
backwards really confused things. The use of "shadow" and "backing
object" throughout the code is now internally consistent and correct
in the Mach terminology.
11) Several minor bug fixes, including one in the vm daemon that caused
0 RSS objects to not get purged as intended.
12) A "default pager" has now been created which cleans up the transition
of objects to the "swap" type. The previous checks throughout the code
for swp->pg_data != NULL were really ugly. This change also provides
the rudiments for future backing of "anonymous" memory by something
other than the swap pager (via the vnode pager, for example), and it
allows the decision about which of these pagers to use to be made
dynamically (although will need some additional decision code to do
this, of course).
13) (dyson) MAP_COPY has been deprecated and the corresponding "copy
object" code has been removed. MAP_COPY was undocumented and non-
standard. It was furthermore broken in several ways which caused its
behavior to degrade to MAP_PRIVATE. Binaries that use MAP_COPY will
continue to work correctly, but via the slightly different semantics
of MAP_PRIVATE.
14) (dyson) Sharing maps have been removed. It's marginal usefulness in a
threads design can be worked around in other ways. Both #12 and #13
were done to simplify the code and improve readability and maintain-
ability. (As were most all of these changes)
TODO:
1) Rewrite most of the vnode pager to use VOP_GETPAGES/PUTPAGES. Doing
this will reduce the vnode pager to a mere fraction of its current size.
2) Rewrite vm_fault and the swap/vnode pagers to use the clustering
information provided by the new haspage pager interface. This will
substantially reduce the overhead by eliminating a large number of
VOP_BMAP() calls. The VOP_BMAP() filesystem interface should be
improved to provide both a "behind" and "ahead" indication of
contiguousness.
3) Implement the extended features of pager_haspage in swap_pager_haspage().
It currently just says 0 pages ahead/behind.
4) Re-implement the swap device (swstrategy) in a more elegant way, perhaps
via a much more general mechanism that could also be used for disk
striping of regular filesystems.
5) Do something to improve the architecture of vm_object_collapse(). The
fact that it makes calls into the swap pager and knows too much about
how the swap pager operates really bothers me. It also doesn't allow
for collapsing of non-swap pager objects ("unnamed" objects backed by
other pagers).
1995-07-13 08:48:48 +00:00
|
|
|
return (object);
|
1994-05-24 10:09:53 +00:00
|
|
|
}
|
|
|
|
|
2003-05-06 02:45:28 +00:00
|
|
|
/*
|
|
|
|
* The object must be locked.
|
|
|
|
*/
|
1995-12-14 09:55:16 +00:00
|
|
|
static void
|
2014-01-20 18:47:56 +00:00
|
|
|
vnode_pager_dealloc(vm_object_t object)
|
NOTE: libkvm, w, ps, 'top', and any other utility which depends on struct
proc or any VM system structure will have to be rebuilt!!!
Much needed overhaul of the VM system. Included in this first round of
changes:
1) Improved pager interfaces: init, alloc, dealloc, getpages, putpages,
haspage, and sync operations are supported. The haspage interface now
provides information about clusterability. All pager routines now take
struct vm_object's instead of "pagers".
2) Improved data structures. In the previous paradigm, there is constant
confusion caused by pagers being both a data structure ("allocate a
pager") and a collection of routines. The idea of a pager structure has
escentially been eliminated. Objects now have types, and this type is
used to index the appropriate pager. In most cases, items in the pager
structure were duplicated in the object data structure and thus were
unnecessary. In the few cases that remained, a un_pager structure union
was created in the object to contain these items.
3) Because of the cleanup of #1 & #2, a lot of unnecessary layering can now
be removed. For instance, vm_object_enter(), vm_object_lookup(),
vm_object_remove(), and the associated object hash list were some of the
things that were removed.
4) simple_lock's removed. Discussion with several people reveals that the
SMP locking primitives used in the VM system aren't likely the mechanism
that we'll be adopting. Even if it were, the locking that was in the code
was very inadequate and would have to be mostly re-done anyway. The
locking in a uni-processor kernel was a no-op but went a long way toward
making the code difficult to read and debug.
5) Places that attempted to kludge-up the fact that we don't have kernel
thread support have been fixed to reflect the reality that we are really
dealing with processes, not threads. The VM system didn't have complete
thread support, so the comments and mis-named routines were just wrong.
We now use tsleep and wakeup directly in the lock routines, for instance.
6) Where appropriate, the pagers have been improved, especially in the
pager_alloc routines. Most of the pager_allocs have been rewritten and
are now faster and easier to maintain.
7) The pagedaemon pageout clustering algorithm has been rewritten and
now tries harder to output an even number of pages before and after
the requested page. This is sort of the reverse of the ideal pagein
algorithm and should provide better overall performance.
8) Unnecessary (incorrect) casts to caddr_t in calls to tsleep & wakeup
have been removed. Some other unnecessary casts have also been removed.
9) Some almost useless debugging code removed.
10) Terminology of shadow objects vs. backing objects straightened out.
The fact that the vm_object data structure escentially had this
backwards really confused things. The use of "shadow" and "backing
object" throughout the code is now internally consistent and correct
in the Mach terminology.
11) Several minor bug fixes, including one in the vm daemon that caused
0 RSS objects to not get purged as intended.
12) A "default pager" has now been created which cleans up the transition
of objects to the "swap" type. The previous checks throughout the code
for swp->pg_data != NULL were really ugly. This change also provides
the rudiments for future backing of "anonymous" memory by something
other than the swap pager (via the vnode pager, for example), and it
allows the decision about which of these pagers to use to be made
dynamically (although will need some additional decision code to do
this, of course).
13) (dyson) MAP_COPY has been deprecated and the corresponding "copy
object" code has been removed. MAP_COPY was undocumented and non-
standard. It was furthermore broken in several ways which caused its
behavior to degrade to MAP_PRIVATE. Binaries that use MAP_COPY will
continue to work correctly, but via the slightly different semantics
of MAP_PRIVATE.
14) (dyson) Sharing maps have been removed. It's marginal usefulness in a
threads design can be worked around in other ways. Both #12 and #13
were done to simplify the code and improve readability and maintain-
ability. (As were most all of these changes)
TODO:
1) Rewrite most of the vnode pager to use VOP_GETPAGES/PUTPAGES. Doing
this will reduce the vnode pager to a mere fraction of its current size.
2) Rewrite vm_fault and the swap/vnode pagers to use the clustering
information provided by the new haspage pager interface. This will
substantially reduce the overhead by eliminating a large number of
VOP_BMAP() calls. The VOP_BMAP() filesystem interface should be
improved to provide both a "behind" and "ahead" indication of
contiguousness.
3) Implement the extended features of pager_haspage in swap_pager_haspage().
It currently just says 0 pages ahead/behind.
4) Re-implement the swap device (swstrategy) in a more elegant way, perhaps
via a much more general mechanism that could also be used for disk
striping of regular filesystems.
5) Do something to improve the architecture of vm_object_collapse(). The
fact that it makes calls into the swap pager and knows too much about
how the swap pager operates really bothers me. It also doesn't allow
for collapsing of non-swap pager objects ("unnamed" objects backed by
other pagers).
1995-07-13 08:48:48 +00:00
|
|
|
{
|
2010-01-17 21:26:14 +00:00
|
|
|
struct vnode *vp;
|
|
|
|
int refs;
|
1994-05-24 10:09:53 +00:00
|
|
|
|
2010-01-17 21:26:14 +00:00
|
|
|
vp = object->handle;
|
NOTE: libkvm, w, ps, 'top', and any other utility which depends on struct
proc or any VM system structure will have to be rebuilt!!!
Much needed overhaul of the VM system. Included in this first round of
changes:
1) Improved pager interfaces: init, alloc, dealloc, getpages, putpages,
haspage, and sync operations are supported. The haspage interface now
provides information about clusterability. All pager routines now take
struct vm_object's instead of "pagers".
2) Improved data structures. In the previous paradigm, there is constant
confusion caused by pagers being both a data structure ("allocate a
pager") and a collection of routines. The idea of a pager structure has
escentially been eliminated. Objects now have types, and this type is
used to index the appropriate pager. In most cases, items in the pager
structure were duplicated in the object data structure and thus were
unnecessary. In the few cases that remained, a un_pager structure union
was created in the object to contain these items.
3) Because of the cleanup of #1 & #2, a lot of unnecessary layering can now
be removed. For instance, vm_object_enter(), vm_object_lookup(),
vm_object_remove(), and the associated object hash list were some of the
things that were removed.
4) simple_lock's removed. Discussion with several people reveals that the
SMP locking primitives used in the VM system aren't likely the mechanism
that we'll be adopting. Even if it were, the locking that was in the code
was very inadequate and would have to be mostly re-done anyway. The
locking in a uni-processor kernel was a no-op but went a long way toward
making the code difficult to read and debug.
5) Places that attempted to kludge-up the fact that we don't have kernel
thread support have been fixed to reflect the reality that we are really
dealing with processes, not threads. The VM system didn't have complete
thread support, so the comments and mis-named routines were just wrong.
We now use tsleep and wakeup directly in the lock routines, for instance.
6) Where appropriate, the pagers have been improved, especially in the
pager_alloc routines. Most of the pager_allocs have been rewritten and
are now faster and easier to maintain.
7) The pagedaemon pageout clustering algorithm has been rewritten and
now tries harder to output an even number of pages before and after
the requested page. This is sort of the reverse of the ideal pagein
algorithm and should provide better overall performance.
8) Unnecessary (incorrect) casts to caddr_t in calls to tsleep & wakeup
have been removed. Some other unnecessary casts have also been removed.
9) Some almost useless debugging code removed.
10) Terminology of shadow objects vs. backing objects straightened out.
The fact that the vm_object data structure escentially had this
backwards really confused things. The use of "shadow" and "backing
object" throughout the code is now internally consistent and correct
in the Mach terminology.
11) Several minor bug fixes, including one in the vm daemon that caused
0 RSS objects to not get purged as intended.
12) A "default pager" has now been created which cleans up the transition
of objects to the "swap" type. The previous checks throughout the code
for swp->pg_data != NULL were really ugly. This change also provides
the rudiments for future backing of "anonymous" memory by something
other than the swap pager (via the vnode pager, for example), and it
allows the decision about which of these pagers to use to be made
dynamically (although will need some additional decision code to do
this, of course).
13) (dyson) MAP_COPY has been deprecated and the corresponding "copy
object" code has been removed. MAP_COPY was undocumented and non-
standard. It was furthermore broken in several ways which caused its
behavior to degrade to MAP_PRIVATE. Binaries that use MAP_COPY will
continue to work correctly, but via the slightly different semantics
of MAP_PRIVATE.
14) (dyson) Sharing maps have been removed. It's marginal usefulness in a
threads design can be worked around in other ways. Both #12 and #13
were done to simplify the code and improve readability and maintain-
ability. (As were most all of these changes)
TODO:
1) Rewrite most of the vnode pager to use VOP_GETPAGES/PUTPAGES. Doing
this will reduce the vnode pager to a mere fraction of its current size.
2) Rewrite vm_fault and the swap/vnode pagers to use the clustering
information provided by the new haspage pager interface. This will
substantially reduce the overhead by eliminating a large number of
VOP_BMAP() calls. The VOP_BMAP() filesystem interface should be
improved to provide both a "behind" and "ahead" indication of
contiguousness.
3) Implement the extended features of pager_haspage in swap_pager_haspage().
It currently just says 0 pages ahead/behind.
4) Re-implement the swap device (swstrategy) in a more elegant way, perhaps
via a much more general mechanism that could also be used for disk
striping of regular filesystems.
5) Do something to improve the architecture of vm_object_collapse(). The
fact that it makes calls into the swap pager and knows too much about
how the swap pager operates really bothers me. It also doesn't allow
for collapsing of non-swap pager objects ("unnamed" objects backed by
other pagers).
1995-07-13 08:48:48 +00:00
|
|
|
if (vp == NULL)
|
|
|
|
panic("vnode_pager_dealloc: pager already dealloced");
|
These changes embody the support of the fully coherent merged VM buffer cache,
much higher filesystem I/O performance, and much better paging performance. It
represents the culmination of over 6 months of R&D.
The majority of the merged VM/cache work is by John Dyson.
The following highlights the most significant changes. Additionally, there are
(mostly minor) changes to the various filesystem modules (nfs, msdosfs, etc) to
support the new VM/buffer scheme.
vfs_bio.c:
Significant rewrite of most of vfs_bio to support the merged VM buffer cache
scheme. The scheme is almost fully compatible with the old filesystem
interface. Significant improvement in the number of opportunities for write
clustering.
vfs_cluster.c, vfs_subr.c
Upgrade and performance enhancements in vfs layer code to support merged
VM/buffer cache. Fixup of vfs_cluster to eliminate the bogus pagemove stuff.
vm_object.c:
Yet more improvements in the collapse code. Elimination of some windows that
can cause list corruption.
vm_pageout.c:
Fixed it, it really works better now. Somehow in 2.0, some "enhancements"
broke the code. This code has been reworked from the ground-up.
vm_fault.c, vm_page.c, pmap.c, vm_object.c
Support for small-block filesystems with merged VM/buffer cache scheme.
pmap.c vm_map.c
Dynamic kernel VM size, now we dont have to pre-allocate excessive numbers of
kernel PTs.
vm_glue.c
Much simpler and more effective swapping code. No more gratuitous swapping.
proc.h
Fixed the problem that the p_lock flag was not being cleared on a fork.
swap_pager.c, vnode_pager.c
Removal of old vfs_bio cruft to support the past pseudo-coherency. Now the
code doesn't need it anymore.
machdep.c
Changes to better support the parameter values for the merged VM/buffer cache
scheme.
machdep.c, kern_exec.c, vm_glue.c
Implemented a seperate submap for temporary exec string space and another one
to contain process upages. This eliminates all map fragmentation problems
that previously existed.
ffs_inode.c, ufs_inode.c, ufs_readwrite.c
Changes for merged VM/buffer cache. Add "bypass" support for sneaking in on
busy buffers.
Submitted by: John Dyson and David Greenman
1995-01-09 16:06:02 +00:00
|
|
|
|
2013-03-09 02:32:23 +00:00
|
|
|
VM_OBJECT_ASSERT_WLOCKED(object);
|
1998-02-25 03:56:15 +00:00
|
|
|
vm_object_pip_wait(object, "vnpdea");
|
2010-01-17 21:26:14 +00:00
|
|
|
refs = object->ref_count;
|
1994-08-04 03:06:48 +00:00
|
|
|
|
NOTE: libkvm, w, ps, 'top', and any other utility which depends on struct
proc or any VM system structure will have to be rebuilt!!!
Much needed overhaul of the VM system. Included in this first round of
changes:
1) Improved pager interfaces: init, alloc, dealloc, getpages, putpages,
haspage, and sync operations are supported. The haspage interface now
provides information about clusterability. All pager routines now take
struct vm_object's instead of "pagers".
2) Improved data structures. In the previous paradigm, there is constant
confusion caused by pagers being both a data structure ("allocate a
pager") and a collection of routines. The idea of a pager structure has
escentially been eliminated. Objects now have types, and this type is
used to index the appropriate pager. In most cases, items in the pager
structure were duplicated in the object data structure and thus were
unnecessary. In the few cases that remained, a un_pager structure union
was created in the object to contain these items.
3) Because of the cleanup of #1 & #2, a lot of unnecessary layering can now
be removed. For instance, vm_object_enter(), vm_object_lookup(),
vm_object_remove(), and the associated object hash list were some of the
things that were removed.
4) simple_lock's removed. Discussion with several people reveals that the
SMP locking primitives used in the VM system aren't likely the mechanism
that we'll be adopting. Even if it were, the locking that was in the code
was very inadequate and would have to be mostly re-done anyway. The
locking in a uni-processor kernel was a no-op but went a long way toward
making the code difficult to read and debug.
5) Places that attempted to kludge-up the fact that we don't have kernel
thread support have been fixed to reflect the reality that we are really
dealing with processes, not threads. The VM system didn't have complete
thread support, so the comments and mis-named routines were just wrong.
We now use tsleep and wakeup directly in the lock routines, for instance.
6) Where appropriate, the pagers have been improved, especially in the
pager_alloc routines. Most of the pager_allocs have been rewritten and
are now faster and easier to maintain.
7) The pagedaemon pageout clustering algorithm has been rewritten and
now tries harder to output an even number of pages before and after
the requested page. This is sort of the reverse of the ideal pagein
algorithm and should provide better overall performance.
8) Unnecessary (incorrect) casts to caddr_t in calls to tsleep & wakeup
have been removed. Some other unnecessary casts have also been removed.
9) Some almost useless debugging code removed.
10) Terminology of shadow objects vs. backing objects straightened out.
The fact that the vm_object data structure escentially had this
backwards really confused things. The use of "shadow" and "backing
object" throughout the code is now internally consistent and correct
in the Mach terminology.
11) Several minor bug fixes, including one in the vm daemon that caused
0 RSS objects to not get purged as intended.
12) A "default pager" has now been created which cleans up the transition
of objects to the "swap" type. The previous checks throughout the code
for swp->pg_data != NULL were really ugly. This change also provides
the rudiments for future backing of "anonymous" memory by something
other than the swap pager (via the vnode pager, for example), and it
allows the decision about which of these pagers to use to be made
dynamically (although will need some additional decision code to do
this, of course).
13) (dyson) MAP_COPY has been deprecated and the corresponding "copy
object" code has been removed. MAP_COPY was undocumented and non-
standard. It was furthermore broken in several ways which caused its
behavior to degrade to MAP_PRIVATE. Binaries that use MAP_COPY will
continue to work correctly, but via the slightly different semantics
of MAP_PRIVATE.
14) (dyson) Sharing maps have been removed. It's marginal usefulness in a
threads design can be worked around in other ways. Both #12 and #13
were done to simplify the code and improve readability and maintain-
ability. (As were most all of these changes)
TODO:
1) Rewrite most of the vnode pager to use VOP_GETPAGES/PUTPAGES. Doing
this will reduce the vnode pager to a mere fraction of its current size.
2) Rewrite vm_fault and the swap/vnode pagers to use the clustering
information provided by the new haspage pager interface. This will
substantially reduce the overhead by eliminating a large number of
VOP_BMAP() calls. The VOP_BMAP() filesystem interface should be
improved to provide both a "behind" and "ahead" indication of
contiguousness.
3) Implement the extended features of pager_haspage in swap_pager_haspage().
It currently just says 0 pages ahead/behind.
4) Re-implement the swap device (swstrategy) in a more elegant way, perhaps
via a much more general mechanism that could also be used for disk
striping of regular filesystems.
5) Do something to improve the architecture of vm_object_collapse(). The
fact that it makes calls into the swap pager and knows too much about
how the swap pager operates really bothers me. It also doesn't allow
for collapsing of non-swap pager objects ("unnamed" objects backed by
other pagers).
1995-07-13 08:48:48 +00:00
|
|
|
object->handle = NULL;
|
1998-02-05 03:32:49 +00:00
|
|
|
object->type = OBJT_DEAD;
|
2007-07-26 16:58:09 +00:00
|
|
|
ASSERT_VOP_ELOCKED(vp, "vnode_pager_dealloc");
|
2012-02-23 21:07:16 +00:00
|
|
|
if (object->un_pager.vnp.writemappings > 0) {
|
|
|
|
object->un_pager.vnp.writemappings = 0;
|
Switch to use shared vnode locks for text files during image activation.
kern_execve() locks text vnode exclusive to be able to set and clear
VV_TEXT flag. VV_TEXT is mutually exclusive with the v_writecount > 0
condition.
The change removes VV_TEXT, replacing it with the condition
v_writecount <= -1, and puts v_writecount under the vnode interlock.
Each text reference decrements v_writecount. To clear the text
reference when the segment is unmapped, it is recorded in the
vm_map_entry backed by the text file as MAP_ENTRY_VN_TEXT flag, and
v_writecount is incremented on the map entry removal
The operations like VOP_ADD_WRITECOUNT() and VOP_SET_TEXT() check that
v_writecount does not contradict the desired change. vn_writecheck()
is now racy and its use was eliminated everywhere except access.
Atomic check for writeability and increment of v_writecount is
performed by the VOP. vn_truncate() now increments v_writecount
around VOP_SETATTR() call, lack of which is arguably a bug on its own.
nullfs bypasses v_writecount to the lower vnode always, so nullfs
vnode has its own v_writecount correct, and lower vnode gets all
references, since object->handle is always lower vnode.
On the text vnode' vm object dealloc, the v_writecount value is reset
to zero, and deadfs vop_unset_text short-circuit the operation.
Reclamation of lowervp always reclaims all nullfs vnodes referencing
lowervp first, so no stray references are left.
Reviewed by: markj, trasz
Tested by: mjg, pho
Sponsored by: The FreeBSD Foundation
MFC after: 1 month
Differential revision: https://reviews.freebsd.org/D19923
2019-05-05 11:20:43 +00:00
|
|
|
VOP_ADD_WRITECOUNT_CHECKED(vp, -1);
|
2012-03-08 20:27:20 +00:00
|
|
|
CTR3(KTR_VFS, "%s: vp %p v_writecount decreased to %d",
|
|
|
|
__func__, vp, vp->v_writecount);
|
2012-02-23 21:07:16 +00:00
|
|
|
}
|
NOTE: libkvm, w, ps, 'top', and any other utility which depends on struct
proc or any VM system structure will have to be rebuilt!!!
Much needed overhaul of the VM system. Included in this first round of
changes:
1) Improved pager interfaces: init, alloc, dealloc, getpages, putpages,
haspage, and sync operations are supported. The haspage interface now
provides information about clusterability. All pager routines now take
struct vm_object's instead of "pagers".
2) Improved data structures. In the previous paradigm, there is constant
confusion caused by pagers being both a data structure ("allocate a
pager") and a collection of routines. The idea of a pager structure has
escentially been eliminated. Objects now have types, and this type is
used to index the appropriate pager. In most cases, items in the pager
structure were duplicated in the object data structure and thus were
unnecessary. In the few cases that remained, a un_pager structure union
was created in the object to contain these items.
3) Because of the cleanup of #1 & #2, a lot of unnecessary layering can now
be removed. For instance, vm_object_enter(), vm_object_lookup(),
vm_object_remove(), and the associated object hash list were some of the
things that were removed.
4) simple_lock's removed. Discussion with several people reveals that the
SMP locking primitives used in the VM system aren't likely the mechanism
that we'll be adopting. Even if it were, the locking that was in the code
was very inadequate and would have to be mostly re-done anyway. The
locking in a uni-processor kernel was a no-op but went a long way toward
making the code difficult to read and debug.
5) Places that attempted to kludge-up the fact that we don't have kernel
thread support have been fixed to reflect the reality that we are really
dealing with processes, not threads. The VM system didn't have complete
thread support, so the comments and mis-named routines were just wrong.
We now use tsleep and wakeup directly in the lock routines, for instance.
6) Where appropriate, the pagers have been improved, especially in the
pager_alloc routines. Most of the pager_allocs have been rewritten and
are now faster and easier to maintain.
7) The pagedaemon pageout clustering algorithm has been rewritten and
now tries harder to output an even number of pages before and after
the requested page. This is sort of the reverse of the ideal pagein
algorithm and should provide better overall performance.
8) Unnecessary (incorrect) casts to caddr_t in calls to tsleep & wakeup
have been removed. Some other unnecessary casts have also been removed.
9) Some almost useless debugging code removed.
10) Terminology of shadow objects vs. backing objects straightened out.
The fact that the vm_object data structure escentially had this
backwards really confused things. The use of "shadow" and "backing
object" throughout the code is now internally consistent and correct
in the Mach terminology.
11) Several minor bug fixes, including one in the vm daemon that caused
0 RSS objects to not get purged as intended.
12) A "default pager" has now been created which cleans up the transition
of objects to the "swap" type. The previous checks throughout the code
for swp->pg_data != NULL were really ugly. This change also provides
the rudiments for future backing of "anonymous" memory by something
other than the swap pager (via the vnode pager, for example), and it
allows the decision about which of these pagers to use to be made
dynamically (although will need some additional decision code to do
this, of course).
13) (dyson) MAP_COPY has been deprecated and the corresponding "copy
object" code has been removed. MAP_COPY was undocumented and non-
standard. It was furthermore broken in several ways which caused its
behavior to degrade to MAP_PRIVATE. Binaries that use MAP_COPY will
continue to work correctly, but via the slightly different semantics
of MAP_PRIVATE.
14) (dyson) Sharing maps have been removed. It's marginal usefulness in a
threads design can be worked around in other ways. Both #12 and #13
were done to simplify the code and improve readability and maintain-
ability. (As were most all of these changes)
TODO:
1) Rewrite most of the vnode pager to use VOP_GETPAGES/PUTPAGES. Doing
this will reduce the vnode pager to a mere fraction of its current size.
2) Rewrite vm_fault and the swap/vnode pagers to use the clustering
information provided by the new haspage pager interface. This will
substantially reduce the overhead by eliminating a large number of
VOP_BMAP() calls. The VOP_BMAP() filesystem interface should be
improved to provide both a "behind" and "ahead" indication of
contiguousness.
3) Implement the extended features of pager_haspage in swap_pager_haspage().
It currently just says 0 pages ahead/behind.
4) Re-implement the swap device (swstrategy) in a more elegant way, perhaps
via a much more general mechanism that could also be used for disk
striping of regular filesystems.
5) Do something to improve the architecture of vm_object_collapse(). The
fact that it makes calls into the swap pager and knows too much about
how the swap pager operates really bothers me. It also doesn't allow
for collapsing of non-swap pager objects ("unnamed" objects backed by
other pagers).
1995-07-13 08:48:48 +00:00
|
|
|
vp->v_object = NULL;
|
Switch to use shared vnode locks for text files during image activation.
kern_execve() locks text vnode exclusive to be able to set and clear
VV_TEXT flag. VV_TEXT is mutually exclusive with the v_writecount > 0
condition.
The change removes VV_TEXT, replacing it with the condition
v_writecount <= -1, and puts v_writecount under the vnode interlock.
Each text reference decrements v_writecount. To clear the text
reference when the segment is unmapped, it is recorded in the
vm_map_entry backed by the text file as MAP_ENTRY_VN_TEXT flag, and
v_writecount is incremented on the map entry removal
The operations like VOP_ADD_WRITECOUNT() and VOP_SET_TEXT() check that
v_writecount does not contradict the desired change. vn_writecheck()
is now racy and its use was eliminated everywhere except access.
Atomic check for writeability and increment of v_writecount is
performed by the VOP. vn_truncate() now increments v_writecount
around VOP_SETATTR() call, lack of which is arguably a bug on its own.
nullfs bypasses v_writecount to the lower vnode always, so nullfs
vnode has its own v_writecount correct, and lower vnode gets all
references, since object->handle is always lower vnode.
On the text vnode' vm object dealloc, the v_writecount value is reset
to zero, and deadfs vop_unset_text short-circuit the operation.
Reclamation of lowervp always reclaims all nullfs vnodes referencing
lowervp first, so no stray references are left.
Reviewed by: markj, trasz
Tested by: mjg, pho
Sponsored by: The FreeBSD Foundation
MFC after: 1 month
Differential revision: https://reviews.freebsd.org/D19923
2019-05-05 11:20:43 +00:00
|
|
|
VI_LOCK(vp);
|
|
|
|
|
|
|
|
/*
|
|
|
|
* vm_map_entry_set_vnode_text() cannot reach this vnode by
|
|
|
|
* following object->handle. Clear all text references now.
|
|
|
|
* This also clears the transient references from
|
|
|
|
* kern_execve(), which is fine because dead_vnodeops uses nop
|
|
|
|
* for VOP_UNSET_TEXT().
|
|
|
|
*/
|
|
|
|
if (vp->v_writecount < 0)
|
|
|
|
vp->v_writecount = 0;
|
|
|
|
VI_UNLOCK(vp);
|
2013-03-09 02:32:23 +00:00
|
|
|
VM_OBJECT_WUNLOCK(object);
|
2010-01-17 21:26:14 +00:00
|
|
|
while (refs-- > 0)
|
|
|
|
vunref(vp);
|
2013-03-09 02:32:23 +00:00
|
|
|
VM_OBJECT_WLOCK(object);
|
1994-05-24 10:09:53 +00:00
|
|
|
}
|
|
|
|
|
1995-12-14 09:55:16 +00:00
|
|
|
static boolean_t
|
2014-01-20 18:47:56 +00:00
|
|
|
vnode_pager_haspage(vm_object_t object, vm_pindex_t pindex, int *before,
|
|
|
|
int *after)
|
1994-05-24 10:09:53 +00:00
|
|
|
{
|
NOTE: libkvm, w, ps, 'top', and any other utility which depends on struct
proc or any VM system structure will have to be rebuilt!!!
Much needed overhaul of the VM system. Included in this first round of
changes:
1) Improved pager interfaces: init, alloc, dealloc, getpages, putpages,
haspage, and sync operations are supported. The haspage interface now
provides information about clusterability. All pager routines now take
struct vm_object's instead of "pagers".
2) Improved data structures. In the previous paradigm, there is constant
confusion caused by pagers being both a data structure ("allocate a
pager") and a collection of routines. The idea of a pager structure has
escentially been eliminated. Objects now have types, and this type is
used to index the appropriate pager. In most cases, items in the pager
structure were duplicated in the object data structure and thus were
unnecessary. In the few cases that remained, a un_pager structure union
was created in the object to contain these items.
3) Because of the cleanup of #1 & #2, a lot of unnecessary layering can now
be removed. For instance, vm_object_enter(), vm_object_lookup(),
vm_object_remove(), and the associated object hash list were some of the
things that were removed.
4) simple_lock's removed. Discussion with several people reveals that the
SMP locking primitives used in the VM system aren't likely the mechanism
that we'll be adopting. Even if it were, the locking that was in the code
was very inadequate and would have to be mostly re-done anyway. The
locking in a uni-processor kernel was a no-op but went a long way toward
making the code difficult to read and debug.
5) Places that attempted to kludge-up the fact that we don't have kernel
thread support have been fixed to reflect the reality that we are really
dealing with processes, not threads. The VM system didn't have complete
thread support, so the comments and mis-named routines were just wrong.
We now use tsleep and wakeup directly in the lock routines, for instance.
6) Where appropriate, the pagers have been improved, especially in the
pager_alloc routines. Most of the pager_allocs have been rewritten and
are now faster and easier to maintain.
7) The pagedaemon pageout clustering algorithm has been rewritten and
now tries harder to output an even number of pages before and after
the requested page. This is sort of the reverse of the ideal pagein
algorithm and should provide better overall performance.
8) Unnecessary (incorrect) casts to caddr_t in calls to tsleep & wakeup
have been removed. Some other unnecessary casts have also been removed.
9) Some almost useless debugging code removed.
10) Terminology of shadow objects vs. backing objects straightened out.
The fact that the vm_object data structure escentially had this
backwards really confused things. The use of "shadow" and "backing
object" throughout the code is now internally consistent and correct
in the Mach terminology.
11) Several minor bug fixes, including one in the vm daemon that caused
0 RSS objects to not get purged as intended.
12) A "default pager" has now been created which cleans up the transition
of objects to the "swap" type. The previous checks throughout the code
for swp->pg_data != NULL were really ugly. This change also provides
the rudiments for future backing of "anonymous" memory by something
other than the swap pager (via the vnode pager, for example), and it
allows the decision about which of these pagers to use to be made
dynamically (although will need some additional decision code to do
this, of course).
13) (dyson) MAP_COPY has been deprecated and the corresponding "copy
object" code has been removed. MAP_COPY was undocumented and non-
standard. It was furthermore broken in several ways which caused its
behavior to degrade to MAP_PRIVATE. Binaries that use MAP_COPY will
continue to work correctly, but via the slightly different semantics
of MAP_PRIVATE.
14) (dyson) Sharing maps have been removed. It's marginal usefulness in a
threads design can be worked around in other ways. Both #12 and #13
were done to simplify the code and improve readability and maintain-
ability. (As were most all of these changes)
TODO:
1) Rewrite most of the vnode pager to use VOP_GETPAGES/PUTPAGES. Doing
this will reduce the vnode pager to a mere fraction of its current size.
2) Rewrite vm_fault and the swap/vnode pagers to use the clustering
information provided by the new haspage pager interface. This will
substantially reduce the overhead by eliminating a large number of
VOP_BMAP() calls. The VOP_BMAP() filesystem interface should be
improved to provide both a "behind" and "ahead" indication of
contiguousness.
3) Implement the extended features of pager_haspage in swap_pager_haspage().
It currently just says 0 pages ahead/behind.
4) Re-implement the swap device (swstrategy) in a more elegant way, perhaps
via a much more general mechanism that could also be used for disk
striping of regular filesystems.
5) Do something to improve the architecture of vm_object_collapse(). The
fact that it makes calls into the swap pager and knows too much about
how the swap pager operates really bothers me. It also doesn't allow
for collapsing of non-swap pager objects ("unnamed" objects backed by
other pagers).
1995-07-13 08:48:48 +00:00
|
|
|
struct vnode *vp = object->handle;
|
2002-05-14 11:09:43 +00:00
|
|
|
daddr_t bn;
|
2019-08-19 22:25:28 +00:00
|
|
|
uintptr_t lockstate;
|
1995-11-20 12:20:02 +00:00
|
|
|
int err;
|
1995-09-04 04:44:26 +00:00
|
|
|
daddr_t reqblock;
|
1995-10-23 02:23:29 +00:00
|
|
|
int poff;
|
|
|
|
int bsize;
|
1995-12-17 23:29:56 +00:00
|
|
|
int pagesperblock, blocksperpage;
|
1994-05-24 10:09:53 +00:00
|
|
|
|
2019-08-19 22:25:28 +00:00
|
|
|
VM_OBJECT_ASSERT_LOCKED(object);
|
1999-09-17 05:17:59 +00:00
|
|
|
/*
|
|
|
|
* If no vp or vp is doomed or marked transparent to VM, we do not
|
|
|
|
* have the page.
|
|
|
|
*/
|
2006-02-06 10:14:12 +00:00
|
|
|
if (vp == NULL || vp->v_iflag & VI_DOOMED)
|
2002-08-04 10:29:36 +00:00
|
|
|
return FALSE;
|
1994-05-24 10:09:53 +00:00
|
|
|
/*
|
2006-02-06 10:14:12 +00:00
|
|
|
* If the offset is beyond end of file we do
|
These changes embody the support of the fully coherent merged VM buffer cache,
much higher filesystem I/O performance, and much better paging performance. It
represents the culmination of over 6 months of R&D.
The majority of the merged VM/cache work is by John Dyson.
The following highlights the most significant changes. Additionally, there are
(mostly minor) changes to the various filesystem modules (nfs, msdosfs, etc) to
support the new VM/buffer scheme.
vfs_bio.c:
Significant rewrite of most of vfs_bio to support the merged VM buffer cache
scheme. The scheme is almost fully compatible with the old filesystem
interface. Significant improvement in the number of opportunities for write
clustering.
vfs_cluster.c, vfs_subr.c
Upgrade and performance enhancements in vfs layer code to support merged
VM/buffer cache. Fixup of vfs_cluster to eliminate the bogus pagemove stuff.
vm_object.c:
Yet more improvements in the collapse code. Elimination of some windows that
can cause list corruption.
vm_pageout.c:
Fixed it, it really works better now. Somehow in 2.0, some "enhancements"
broke the code. This code has been reworked from the ground-up.
vm_fault.c, vm_page.c, pmap.c, vm_object.c
Support for small-block filesystems with merged VM/buffer cache scheme.
pmap.c vm_map.c
Dynamic kernel VM size, now we dont have to pre-allocate excessive numbers of
kernel PTs.
vm_glue.c
Much simpler and more effective swapping code. No more gratuitous swapping.
proc.h
Fixed the problem that the p_lock flag was not being cleared on a fork.
swap_pager.c, vnode_pager.c
Removal of old vfs_bio cruft to support the past pseudo-coherency. Now the
code doesn't need it anymore.
machdep.c
Changes to better support the parameter values for the merged VM/buffer cache
scheme.
machdep.c, kern_exec.c, vm_glue.c
Implemented a seperate submap for temporary exec string space and another one
to contain process upages. This eliminates all map fragmentation problems
that previously existed.
ffs_inode.c, ufs_inode.c, ufs_readwrite.c
Changes for merged VM/buffer cache. Add "bypass" support for sneaking in on
busy buffers.
Submitted by: John Dyson and David Greenman
1995-01-09 16:06:02 +00:00
|
|
|
* not have the page.
|
1994-05-24 10:09:53 +00:00
|
|
|
*/
|
2006-02-06 10:14:12 +00:00
|
|
|
if (IDX_TO_OFF(pindex) >= object->un_pager.vnp.vnp_size)
|
1994-11-24 14:43:22 +00:00
|
|
|
return FALSE;
|
1994-05-24 10:09:53 +00:00
|
|
|
|
1995-10-19 21:35:03 +00:00
|
|
|
bsize = vp->v_mount->mnt_stat.f_iosize;
|
1995-09-04 04:44:26 +00:00
|
|
|
pagesperblock = bsize / PAGE_SIZE;
|
1995-12-17 23:29:56 +00:00
|
|
|
blocksperpage = 0;
|
|
|
|
if (pagesperblock > 0) {
|
|
|
|
reqblock = pindex / pagesperblock;
|
|
|
|
} else {
|
|
|
|
blocksperpage = (PAGE_SIZE / bsize);
|
|
|
|
reqblock = pindex * blocksperpage;
|
|
|
|
}
|
2019-08-19 22:25:28 +00:00
|
|
|
lockstate = VM_OBJECT_DROP(object);
|
2003-08-17 18:54:23 +00:00
|
|
|
err = VOP_BMAP(vp, reqblock, NULL, &bn, after, before);
|
2019-08-19 22:25:28 +00:00
|
|
|
VM_OBJECT_PICKUP(object, lockstate);
|
1995-05-30 08:16:23 +00:00
|
|
|
if (err)
|
NOTE: libkvm, w, ps, 'top', and any other utility which depends on struct
proc or any VM system structure will have to be rebuilt!!!
Much needed overhaul of the VM system. Included in this first round of
changes:
1) Improved pager interfaces: init, alloc, dealloc, getpages, putpages,
haspage, and sync operations are supported. The haspage interface now
provides information about clusterability. All pager routines now take
struct vm_object's instead of "pagers".
2) Improved data structures. In the previous paradigm, there is constant
confusion caused by pagers being both a data structure ("allocate a
pager") and a collection of routines. The idea of a pager structure has
escentially been eliminated. Objects now have types, and this type is
used to index the appropriate pager. In most cases, items in the pager
structure were duplicated in the object data structure and thus were
unnecessary. In the few cases that remained, a un_pager structure union
was created in the object to contain these items.
3) Because of the cleanup of #1 & #2, a lot of unnecessary layering can now
be removed. For instance, vm_object_enter(), vm_object_lookup(),
vm_object_remove(), and the associated object hash list were some of the
things that were removed.
4) simple_lock's removed. Discussion with several people reveals that the
SMP locking primitives used in the VM system aren't likely the mechanism
that we'll be adopting. Even if it were, the locking that was in the code
was very inadequate and would have to be mostly re-done anyway. The
locking in a uni-processor kernel was a no-op but went a long way toward
making the code difficult to read and debug.
5) Places that attempted to kludge-up the fact that we don't have kernel
thread support have been fixed to reflect the reality that we are really
dealing with processes, not threads. The VM system didn't have complete
thread support, so the comments and mis-named routines were just wrong.
We now use tsleep and wakeup directly in the lock routines, for instance.
6) Where appropriate, the pagers have been improved, especially in the
pager_alloc routines. Most of the pager_allocs have been rewritten and
are now faster and easier to maintain.
7) The pagedaemon pageout clustering algorithm has been rewritten and
now tries harder to output an even number of pages before and after
the requested page. This is sort of the reverse of the ideal pagein
algorithm and should provide better overall performance.
8) Unnecessary (incorrect) casts to caddr_t in calls to tsleep & wakeup
have been removed. Some other unnecessary casts have also been removed.
9) Some almost useless debugging code removed.
10) Terminology of shadow objects vs. backing objects straightened out.
The fact that the vm_object data structure escentially had this
backwards really confused things. The use of "shadow" and "backing
object" throughout the code is now internally consistent and correct
in the Mach terminology.
11) Several minor bug fixes, including one in the vm daemon that caused
0 RSS objects to not get purged as intended.
12) A "default pager" has now been created which cleans up the transition
of objects to the "swap" type. The previous checks throughout the code
for swp->pg_data != NULL were really ugly. This change also provides
the rudiments for future backing of "anonymous" memory by something
other than the swap pager (via the vnode pager, for example), and it
allows the decision about which of these pagers to use to be made
dynamically (although will need some additional decision code to do
this, of course).
13) (dyson) MAP_COPY has been deprecated and the corresponding "copy
object" code has been removed. MAP_COPY was undocumented and non-
standard. It was furthermore broken in several ways which caused its
behavior to degrade to MAP_PRIVATE. Binaries that use MAP_COPY will
continue to work correctly, but via the slightly different semantics
of MAP_PRIVATE.
14) (dyson) Sharing maps have been removed. It's marginal usefulness in a
threads design can be worked around in other ways. Both #12 and #13
were done to simplify the code and improve readability and maintain-
ability. (As were most all of these changes)
TODO:
1) Rewrite most of the vnode pager to use VOP_GETPAGES/PUTPAGES. Doing
this will reduce the vnode pager to a mere fraction of its current size.
2) Rewrite vm_fault and the swap/vnode pagers to use the clustering
information provided by the new haspage pager interface. This will
substantially reduce the overhead by eliminating a large number of
VOP_BMAP() calls. The VOP_BMAP() filesystem interface should be
improved to provide both a "behind" and "ahead" indication of
contiguousness.
3) Implement the extended features of pager_haspage in swap_pager_haspage().
It currently just says 0 pages ahead/behind.
4) Re-implement the swap device (swstrategy) in a more elegant way, perhaps
via a much more general mechanism that could also be used for disk
striping of regular filesystems.
5) Do something to improve the architecture of vm_object_collapse(). The
fact that it makes calls into the swap pager and knows too much about
how the swap pager operates really bothers me. It also doesn't allow
for collapsing of non-swap pager objects ("unnamed" objects backed by
other pagers).
1995-07-13 08:48:48 +00:00
|
|
|
return TRUE;
|
2002-03-10 21:52:48 +00:00
|
|
|
if (bn == -1)
|
1995-09-06 05:37:43 +00:00
|
|
|
return FALSE;
|
1995-12-17 23:29:56 +00:00
|
|
|
if (pagesperblock > 0) {
|
|
|
|
poff = pindex - (reqblock * pagesperblock);
|
|
|
|
if (before) {
|
|
|
|
*before *= pagesperblock;
|
|
|
|
*before += poff;
|
|
|
|
}
|
|
|
|
if (after) {
|
2015-05-04 18:49:25 +00:00
|
|
|
/*
|
|
|
|
* The BMAP vop can report a partial block in the
|
2015-05-06 08:07:11 +00:00
|
|
|
* 'after', but must not report blocks after EOF.
|
2015-05-04 18:49:25 +00:00
|
|
|
* Assert the latter, and truncate 'after' in case
|
|
|
|
* of the former.
|
|
|
|
*/
|
2015-05-06 08:07:11 +00:00
|
|
|
KASSERT((reqblock + *after) * pagesperblock <
|
|
|
|
roundup2(object->size, pagesperblock),
|
2015-05-04 18:49:25 +00:00
|
|
|
("%s: reqblock %jd after %d size %ju", __func__,
|
|
|
|
(intmax_t )reqblock, *after,
|
|
|
|
(uintmax_t )object->size));
|
1995-12-17 23:29:56 +00:00
|
|
|
*after *= pagesperblock;
|
2015-05-04 18:49:25 +00:00
|
|
|
*after += pagesperblock - (poff + 1);
|
|
|
|
if (pindex + *after >= object->size)
|
|
|
|
*after = object->size - 1 - pindex;
|
1995-12-17 23:29:56 +00:00
|
|
|
}
|
|
|
|
} else {
|
|
|
|
if (before) {
|
|
|
|
*before /= blocksperpage;
|
|
|
|
}
|
|
|
|
|
|
|
|
if (after) {
|
|
|
|
*after /= blocksperpage;
|
1995-09-11 00:46:19 +00:00
|
|
|
}
|
1995-09-04 04:44:26 +00:00
|
|
|
}
|
1995-09-06 05:37:43 +00:00
|
|
|
return TRUE;
|
1994-05-24 10:09:53 +00:00
|
|
|
}
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Lets the VM system know about a change in size for a file.
|
NOTE: libkvm, w, ps, 'top', and any other utility which depends on struct
proc or any VM system structure will have to be rebuilt!!!
Much needed overhaul of the VM system. Included in this first round of
changes:
1) Improved pager interfaces: init, alloc, dealloc, getpages, putpages,
haspage, and sync operations are supported. The haspage interface now
provides information about clusterability. All pager routines now take
struct vm_object's instead of "pagers".
2) Improved data structures. In the previous paradigm, there is constant
confusion caused by pagers being both a data structure ("allocate a
pager") and a collection of routines. The idea of a pager structure has
escentially been eliminated. Objects now have types, and this type is
used to index the appropriate pager. In most cases, items in the pager
structure were duplicated in the object data structure and thus were
unnecessary. In the few cases that remained, a un_pager structure union
was created in the object to contain these items.
3) Because of the cleanup of #1 & #2, a lot of unnecessary layering can now
be removed. For instance, vm_object_enter(), vm_object_lookup(),
vm_object_remove(), and the associated object hash list were some of the
things that were removed.
4) simple_lock's removed. Discussion with several people reveals that the
SMP locking primitives used in the VM system aren't likely the mechanism
that we'll be adopting. Even if it were, the locking that was in the code
was very inadequate and would have to be mostly re-done anyway. The
locking in a uni-processor kernel was a no-op but went a long way toward
making the code difficult to read and debug.
5) Places that attempted to kludge-up the fact that we don't have kernel
thread support have been fixed to reflect the reality that we are really
dealing with processes, not threads. The VM system didn't have complete
thread support, so the comments and mis-named routines were just wrong.
We now use tsleep and wakeup directly in the lock routines, for instance.
6) Where appropriate, the pagers have been improved, especially in the
pager_alloc routines. Most of the pager_allocs have been rewritten and
are now faster and easier to maintain.
7) The pagedaemon pageout clustering algorithm has been rewritten and
now tries harder to output an even number of pages before and after
the requested page. This is sort of the reverse of the ideal pagein
algorithm and should provide better overall performance.
8) Unnecessary (incorrect) casts to caddr_t in calls to tsleep & wakeup
have been removed. Some other unnecessary casts have also been removed.
9) Some almost useless debugging code removed.
10) Terminology of shadow objects vs. backing objects straightened out.
The fact that the vm_object data structure escentially had this
backwards really confused things. The use of "shadow" and "backing
object" throughout the code is now internally consistent and correct
in the Mach terminology.
11) Several minor bug fixes, including one in the vm daemon that caused
0 RSS objects to not get purged as intended.
12) A "default pager" has now been created which cleans up the transition
of objects to the "swap" type. The previous checks throughout the code
for swp->pg_data != NULL were really ugly. This change also provides
the rudiments for future backing of "anonymous" memory by something
other than the swap pager (via the vnode pager, for example), and it
allows the decision about which of these pagers to use to be made
dynamically (although will need some additional decision code to do
this, of course).
13) (dyson) MAP_COPY has been deprecated and the corresponding "copy
object" code has been removed. MAP_COPY was undocumented and non-
standard. It was furthermore broken in several ways which caused its
behavior to degrade to MAP_PRIVATE. Binaries that use MAP_COPY will
continue to work correctly, but via the slightly different semantics
of MAP_PRIVATE.
14) (dyson) Sharing maps have been removed. It's marginal usefulness in a
threads design can be worked around in other ways. Both #12 and #13
were done to simplify the code and improve readability and maintain-
ability. (As were most all of these changes)
TODO:
1) Rewrite most of the vnode pager to use VOP_GETPAGES/PUTPAGES. Doing
this will reduce the vnode pager to a mere fraction of its current size.
2) Rewrite vm_fault and the swap/vnode pagers to use the clustering
information provided by the new haspage pager interface. This will
substantially reduce the overhead by eliminating a large number of
VOP_BMAP() calls. The VOP_BMAP() filesystem interface should be
improved to provide both a "behind" and "ahead" indication of
contiguousness.
3) Implement the extended features of pager_haspage in swap_pager_haspage().
It currently just says 0 pages ahead/behind.
4) Re-implement the swap device (swstrategy) in a more elegant way, perhaps
via a much more general mechanism that could also be used for disk
striping of regular filesystems.
5) Do something to improve the architecture of vm_object_collapse(). The
fact that it makes calls into the swap pager and knows too much about
how the swap pager operates really bothers me. It also doesn't allow
for collapsing of non-swap pager objects ("unnamed" objects backed by
other pagers).
1995-07-13 08:48:48 +00:00
|
|
|
* We adjust our own internal size and flush any cached pages in
|
1994-05-24 10:09:53 +00:00
|
|
|
* the associated object that are affected by the size change.
|
|
|
|
*
|
|
|
|
* Note: this routine may be invoked as a result of a pager put
|
|
|
|
* operation (possibly at object termination time), so we must be careful.
|
|
|
|
*/
|
|
|
|
void
|
2014-01-20 18:47:56 +00:00
|
|
|
vnode_pager_setsize(struct vnode *vp, vm_ooffset_t nsize)
|
1994-05-24 10:09:53 +00:00
|
|
|
{
|
2003-06-10 20:28:41 +00:00
|
|
|
vm_object_t object;
|
|
|
|
vm_page_t m;
|
1998-08-25 13:47:37 +00:00
|
|
|
vm_pindex_t nobjsize;
|
2001-07-04 16:20:28 +00:00
|
|
|
|
2003-06-10 20:28:41 +00:00
|
|
|
if ((object = vp->v_object) == NULL)
|
1994-05-24 10:09:53 +00:00
|
|
|
return;
|
2009-02-09 11:32:23 +00:00
|
|
|
/* ASSERT_VOP_ELOCKED(vp, "vnode_pager_setsize and not locked vnode"); */
|
2013-03-09 02:32:23 +00:00
|
|
|
VM_OBJECT_WLOCK(object);
|
Assert that the object type for the vnode' non-NULL v_object, passed
to vnode_pager_setsize(), is either OBJT_VNODE, or, if vnode was
already reclaimed, OBJT_DEAD. Note that the later is only possible
due to some filesystems, in particular, nfsiods from nfs clients, call
vnode_pager_setsize() with unlocked vnode.
More, if the object is terminated, do not perform the resizing
operation.
Reviewed by: alc
Tested by: pho, bf
MFC after: 1 week
2013-04-28 19:19:26 +00:00
|
|
|
if (object->type == OBJT_DEAD) {
|
|
|
|
VM_OBJECT_WUNLOCK(object);
|
|
|
|
return;
|
|
|
|
}
|
|
|
|
KASSERT(object->type == OBJT_VNODE,
|
|
|
|
("not vnode-backed object %p", object));
|
2003-06-10 20:28:41 +00:00
|
|
|
if (nsize == object->un_pager.vnp.vnp_size) {
|
|
|
|
/*
|
|
|
|
* Hasn't changed size
|
|
|
|
*/
|
2013-03-09 02:32:23 +00:00
|
|
|
VM_OBJECT_WUNLOCK(object);
|
1994-05-24 10:09:53 +00:00
|
|
|
return;
|
2003-06-10 20:28:41 +00:00
|
|
|
}
|
1998-08-25 13:47:37 +00:00
|
|
|
nobjsize = OFF_TO_IDX(nsize + PAGE_MASK);
|
NOTE: libkvm, w, ps, 'top', and any other utility which depends on struct
proc or any VM system structure will have to be rebuilt!!!
Much needed overhaul of the VM system. Included in this first round of
changes:
1) Improved pager interfaces: init, alloc, dealloc, getpages, putpages,
haspage, and sync operations are supported. The haspage interface now
provides information about clusterability. All pager routines now take
struct vm_object's instead of "pagers".
2) Improved data structures. In the previous paradigm, there is constant
confusion caused by pagers being both a data structure ("allocate a
pager") and a collection of routines. The idea of a pager structure has
escentially been eliminated. Objects now have types, and this type is
used to index the appropriate pager. In most cases, items in the pager
structure were duplicated in the object data structure and thus were
unnecessary. In the few cases that remained, a un_pager structure union
was created in the object to contain these items.
3) Because of the cleanup of #1 & #2, a lot of unnecessary layering can now
be removed. For instance, vm_object_enter(), vm_object_lookup(),
vm_object_remove(), and the associated object hash list were some of the
things that were removed.
4) simple_lock's removed. Discussion with several people reveals that the
SMP locking primitives used in the VM system aren't likely the mechanism
that we'll be adopting. Even if it were, the locking that was in the code
was very inadequate and would have to be mostly re-done anyway. The
locking in a uni-processor kernel was a no-op but went a long way toward
making the code difficult to read and debug.
5) Places that attempted to kludge-up the fact that we don't have kernel
thread support have been fixed to reflect the reality that we are really
dealing with processes, not threads. The VM system didn't have complete
thread support, so the comments and mis-named routines were just wrong.
We now use tsleep and wakeup directly in the lock routines, for instance.
6) Where appropriate, the pagers have been improved, especially in the
pager_alloc routines. Most of the pager_allocs have been rewritten and
are now faster and easier to maintain.
7) The pagedaemon pageout clustering algorithm has been rewritten and
now tries harder to output an even number of pages before and after
the requested page. This is sort of the reverse of the ideal pagein
algorithm and should provide better overall performance.
8) Unnecessary (incorrect) casts to caddr_t in calls to tsleep & wakeup
have been removed. Some other unnecessary casts have also been removed.
9) Some almost useless debugging code removed.
10) Terminology of shadow objects vs. backing objects straightened out.
The fact that the vm_object data structure escentially had this
backwards really confused things. The use of "shadow" and "backing
object" throughout the code is now internally consistent and correct
in the Mach terminology.
11) Several minor bug fixes, including one in the vm daemon that caused
0 RSS objects to not get purged as intended.
12) A "default pager" has now been created which cleans up the transition
of objects to the "swap" type. The previous checks throughout the code
for swp->pg_data != NULL were really ugly. This change also provides
the rudiments for future backing of "anonymous" memory by something
other than the swap pager (via the vnode pager, for example), and it
allows the decision about which of these pagers to use to be made
dynamically (although will need some additional decision code to do
this, of course).
13) (dyson) MAP_COPY has been deprecated and the corresponding "copy
object" code has been removed. MAP_COPY was undocumented and non-
standard. It was furthermore broken in several ways which caused its
behavior to degrade to MAP_PRIVATE. Binaries that use MAP_COPY will
continue to work correctly, but via the slightly different semantics
of MAP_PRIVATE.
14) (dyson) Sharing maps have been removed. It's marginal usefulness in a
threads design can be worked around in other ways. Both #12 and #13
were done to simplify the code and improve readability and maintain-
ability. (As were most all of these changes)
TODO:
1) Rewrite most of the vnode pager to use VOP_GETPAGES/PUTPAGES. Doing
this will reduce the vnode pager to a mere fraction of its current size.
2) Rewrite vm_fault and the swap/vnode pagers to use the clustering
information provided by the new haspage pager interface. This will
substantially reduce the overhead by eliminating a large number of
VOP_BMAP() calls. The VOP_BMAP() filesystem interface should be
improved to provide both a "behind" and "ahead" indication of
contiguousness.
3) Implement the extended features of pager_haspage in swap_pager_haspage().
It currently just says 0 pages ahead/behind.
4) Re-implement the swap device (swstrategy) in a more elegant way, perhaps
via a much more general mechanism that could also be used for disk
striping of regular filesystems.
5) Do something to improve the architecture of vm_object_collapse(). The
fact that it makes calls into the swap pager and knows too much about
how the swap pager operates really bothers me. It also doesn't allow
for collapsing of non-swap pager objects ("unnamed" objects backed by
other pagers).
1995-07-13 08:48:48 +00:00
|
|
|
if (nsize < object->un_pager.vnp.vnp_size) {
|
2003-06-10 20:28:41 +00:00
|
|
|
/*
|
|
|
|
* File has shrunk. Toss any cached pages beyond the new EOF.
|
|
|
|
*/
|
|
|
|
if (nobjsize < object->size)
|
1998-08-25 13:47:37 +00:00
|
|
|
vm_object_page_remove(object, nobjsize, object->size,
|
2011-06-29 16:40:41 +00:00
|
|
|
0);
|
1994-08-04 03:06:48 +00:00
|
|
|
/*
|
|
|
|
* this gets rid of garbage at the end of a page that is now
|
2001-12-14 01:16:57 +00:00
|
|
|
* only partially backed by the vnode.
|
|
|
|
*
|
|
|
|
* XXX for some reason (I don't know yet), if we take a
|
|
|
|
* completely invalid page and mark it partially valid
|
|
|
|
* it can screw up NFS reads, so we don't allow the case.
|
1994-08-04 03:06:48 +00:00
|
|
|
*/
|
2019-10-15 03:45:41 +00:00
|
|
|
if (!(nsize & PAGE_MASK))
|
|
|
|
goto out;
|
|
|
|
m = vm_page_grab(object, OFF_TO_IDX(nsize), VM_ALLOC_NOCREAT);
|
|
|
|
if (m == NULL)
|
|
|
|
goto out;
|
|
|
|
if (!vm_page_none_valid(m)) {
|
2003-10-19 00:01:56 +00:00
|
|
|
int base = (int)nsize & PAGE_MASK;
|
|
|
|
int size = PAGE_SIZE - base;
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Clear out partial-page garbage in case
|
|
|
|
* the page has been mapped.
|
|
|
|
*/
|
|
|
|
pmap_zero_page_area(m, base, size);
|
|
|
|
|
|
|
|
/*
|
2009-06-02 08:02:27 +00:00
|
|
|
* Update the valid bits to reflect the blocks that
|
|
|
|
* have been zeroed. Some of these valid bits may
|
|
|
|
* have already been set.
|
|
|
|
*/
|
2011-11-30 17:39:00 +00:00
|
|
|
vm_page_set_valid_range(m, base, size);
|
2009-06-02 08:02:27 +00:00
|
|
|
|
|
|
|
/*
|
|
|
|
* Round "base" to the next block boundary so that the
|
|
|
|
* dirty bit for a partially zeroed block is not
|
|
|
|
* cleared.
|
|
|
|
*/
|
|
|
|
base = roundup2(base, DEV_BSIZE);
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Clear out partial-page dirty bits.
|
2003-10-19 00:01:56 +00:00
|
|
|
*
|
|
|
|
* note that we do not clear out the valid
|
|
|
|
* bits. This would prevent bogus_page
|
|
|
|
* replacement from working properly.
|
|
|
|
*/
|
2009-06-02 08:02:27 +00:00
|
|
|
vm_page_clear_dirty(m, base, PAGE_SIZE - base);
|
1994-08-04 03:06:48 +00:00
|
|
|
}
|
2019-10-15 03:45:41 +00:00
|
|
|
vm_page_xunbusy(m);
|
1994-05-24 10:09:53 +00:00
|
|
|
}
|
2019-10-15 03:45:41 +00:00
|
|
|
out:
|
1995-12-11 04:58:34 +00:00
|
|
|
object->un_pager.vnp.vnp_size = nsize;
|
1998-08-25 13:47:37 +00:00
|
|
|
object->size = nobjsize;
|
2013-03-09 02:32:23 +00:00
|
|
|
VM_OBJECT_WUNLOCK(object);
|
1994-05-24 10:09:53 +00:00
|
|
|
}
|
|
|
|
|
1994-05-25 09:21:21 +00:00
|
|
|
/*
|
|
|
|
* calculate the linear (byte) disk address of specified virtual
|
|
|
|
* file address
|
|
|
|
*/
|
2006-10-14 22:09:03 +00:00
|
|
|
static int
|
|
|
|
vnode_pager_addr(struct vnode *vp, vm_ooffset_t address, daddr_t *rtaddress,
|
|
|
|
int *run)
|
1994-05-25 09:21:21 +00:00
|
|
|
{
|
These changes embody the support of the fully coherent merged VM buffer cache,
much higher filesystem I/O performance, and much better paging performance. It
represents the culmination of over 6 months of R&D.
The majority of the merged VM/cache work is by John Dyson.
The following highlights the most significant changes. Additionally, there are
(mostly minor) changes to the various filesystem modules (nfs, msdosfs, etc) to
support the new VM/buffer scheme.
vfs_bio.c:
Significant rewrite of most of vfs_bio to support the merged VM buffer cache
scheme. The scheme is almost fully compatible with the old filesystem
interface. Significant improvement in the number of opportunities for write
clustering.
vfs_cluster.c, vfs_subr.c
Upgrade and performance enhancements in vfs layer code to support merged
VM/buffer cache. Fixup of vfs_cluster to eliminate the bogus pagemove stuff.
vm_object.c:
Yet more improvements in the collapse code. Elimination of some windows that
can cause list corruption.
vm_pageout.c:
Fixed it, it really works better now. Somehow in 2.0, some "enhancements"
broke the code. This code has been reworked from the ground-up.
vm_fault.c, vm_page.c, pmap.c, vm_object.c
Support for small-block filesystems with merged VM/buffer cache scheme.
pmap.c vm_map.c
Dynamic kernel VM size, now we dont have to pre-allocate excessive numbers of
kernel PTs.
vm_glue.c
Much simpler and more effective swapping code. No more gratuitous swapping.
proc.h
Fixed the problem that the p_lock flag was not being cleared on a fork.
swap_pager.c, vnode_pager.c
Removal of old vfs_bio cruft to support the past pseudo-coherency. Now the
code doesn't need it anymore.
machdep.c
Changes to better support the parameter values for the merged VM/buffer cache
scheme.
machdep.c, kern_exec.c, vm_glue.c
Implemented a seperate submap for temporary exec string space and another one
to contain process upages. This eliminates all map fragmentation problems
that previously existed.
ffs_inode.c, ufs_inode.c, ufs_readwrite.c
Changes for merged VM/buffer cache. Add "bypass" support for sneaking in on
busy buffers.
Submitted by: John Dyson and David Greenman
1995-01-09 16:06:02 +00:00
|
|
|
int bsize;
|
|
|
|
int err;
|
1995-12-11 04:58:34 +00:00
|
|
|
daddr_t vblock;
|
2005-05-18 08:57:31 +00:00
|
|
|
daddr_t voffset;
|
These changes embody the support of the fully coherent merged VM buffer cache,
much higher filesystem I/O performance, and much better paging performance. It
represents the culmination of over 6 months of R&D.
The majority of the merged VM/cache work is by John Dyson.
The following highlights the most significant changes. Additionally, there are
(mostly minor) changes to the various filesystem modules (nfs, msdosfs, etc) to
support the new VM/buffer scheme.
vfs_bio.c:
Significant rewrite of most of vfs_bio to support the merged VM buffer cache
scheme. The scheme is almost fully compatible with the old filesystem
interface. Significant improvement in the number of opportunities for write
clustering.
vfs_cluster.c, vfs_subr.c
Upgrade and performance enhancements in vfs layer code to support merged
VM/buffer cache. Fixup of vfs_cluster to eliminate the bogus pagemove stuff.
vm_object.c:
Yet more improvements in the collapse code. Elimination of some windows that
can cause list corruption.
vm_pageout.c:
Fixed it, it really works better now. Somehow in 2.0, some "enhancements"
broke the code. This code has been reworked from the ground-up.
vm_fault.c, vm_page.c, pmap.c, vm_object.c
Support for small-block filesystems with merged VM/buffer cache scheme.
pmap.c vm_map.c
Dynamic kernel VM size, now we dont have to pre-allocate excessive numbers of
kernel PTs.
vm_glue.c
Much simpler and more effective swapping code. No more gratuitous swapping.
proc.h
Fixed the problem that the p_lock flag was not being cleared on a fork.
swap_pager.c, vnode_pager.c
Removal of old vfs_bio cruft to support the past pseudo-coherency. Now the
code doesn't need it anymore.
machdep.c
Changes to better support the parameter values for the merged VM/buffer cache
scheme.
machdep.c, kern_exec.c, vm_glue.c
Implemented a seperate submap for temporary exec string space and another one
to contain process upages. This eliminates all map fragmentation problems
that previously existed.
ffs_inode.c, ufs_inode.c, ufs_readwrite.c
Changes for merged VM/buffer cache. Add "bypass" support for sneaking in on
busy buffers.
Submitted by: John Dyson and David Greenman
1995-01-09 16:06:02 +00:00
|
|
|
|
2004-12-07 22:05:38 +00:00
|
|
|
if (address < 0)
|
These changes embody the support of the fully coherent merged VM buffer cache,
much higher filesystem I/O performance, and much better paging performance. It
represents the culmination of over 6 months of R&D.
The majority of the merged VM/cache work is by John Dyson.
The following highlights the most significant changes. Additionally, there are
(mostly minor) changes to the various filesystem modules (nfs, msdosfs, etc) to
support the new VM/buffer scheme.
vfs_bio.c:
Significant rewrite of most of vfs_bio to support the merged VM buffer cache
scheme. The scheme is almost fully compatible with the old filesystem
interface. Significant improvement in the number of opportunities for write
clustering.
vfs_cluster.c, vfs_subr.c
Upgrade and performance enhancements in vfs layer code to support merged
VM/buffer cache. Fixup of vfs_cluster to eliminate the bogus pagemove stuff.
vm_object.c:
Yet more improvements in the collapse code. Elimination of some windows that
can cause list corruption.
vm_pageout.c:
Fixed it, it really works better now. Somehow in 2.0, some "enhancements"
broke the code. This code has been reworked from the ground-up.
vm_fault.c, vm_page.c, pmap.c, vm_object.c
Support for small-block filesystems with merged VM/buffer cache scheme.
pmap.c vm_map.c
Dynamic kernel VM size, now we dont have to pre-allocate excessive numbers of
kernel PTs.
vm_glue.c
Much simpler and more effective swapping code. No more gratuitous swapping.
proc.h
Fixed the problem that the p_lock flag was not being cleared on a fork.
swap_pager.c, vnode_pager.c
Removal of old vfs_bio cruft to support the past pseudo-coherency. Now the
code doesn't need it anymore.
machdep.c
Changes to better support the parameter values for the merged VM/buffer cache
scheme.
machdep.c, kern_exec.c, vm_glue.c
Implemented a seperate submap for temporary exec string space and another one
to contain process upages. This eliminates all map fragmentation problems
that previously existed.
ffs_inode.c, ufs_inode.c, ufs_readwrite.c
Changes for merged VM/buffer cache. Add "bypass" support for sneaking in on
busy buffers.
Submitted by: John Dyson and David Greenman
1995-01-09 16:06:02 +00:00
|
|
|
return -1;
|
1994-05-25 09:21:21 +00:00
|
|
|
|
2006-02-06 10:14:12 +00:00
|
|
|
if (vp->v_iflag & VI_DOOMED)
|
1995-10-23 02:23:29 +00:00
|
|
|
return -1;
|
|
|
|
|
1994-05-25 09:21:21 +00:00
|
|
|
bsize = vp->v_mount->mnt_stat.f_iosize;
|
|
|
|
vblock = address / bsize;
|
|
|
|
voffset = address % bsize;
|
|
|
|
|
2006-10-14 22:09:03 +00:00
|
|
|
err = VOP_BMAP(vp, vblock, NULL, rtaddress, run, NULL);
|
|
|
|
if (err == 0) {
|
|
|
|
if (*rtaddress != -1)
|
|
|
|
*rtaddress += voffset / DEV_BSIZE;
|
2002-03-10 21:52:48 +00:00
|
|
|
if (run) {
|
1995-02-03 06:46:28 +00:00
|
|
|
*run += 1;
|
2019-07-06 15:55:16 +00:00
|
|
|
*run *= bsize / PAGE_SIZE;
|
|
|
|
*run -= voffset / PAGE_SIZE;
|
1995-02-03 06:46:28 +00:00
|
|
|
}
|
|
|
|
}
|
1994-05-25 09:21:21 +00:00
|
|
|
|
2006-10-14 22:09:03 +00:00
|
|
|
return (err);
|
1994-05-25 09:21:21 +00:00
|
|
|
}
|
|
|
|
|
|
|
|
/*
|
2002-05-16 21:28:32 +00:00
|
|
|
* small block filesystem vnode pager input
|
1994-05-25 09:21:21 +00:00
|
|
|
*/
|
1995-12-14 09:55:16 +00:00
|
|
|
static int
|
2014-01-20 18:47:56 +00:00
|
|
|
vnode_pager_input_smlfs(vm_object_t object, vm_page_t m)
|
1994-05-25 09:21:21 +00:00
|
|
|
{
|
2004-11-15 09:18:27 +00:00
|
|
|
struct vnode *vp;
|
|
|
|
struct bufobj *bo;
|
1994-05-25 09:21:21 +00:00
|
|
|
struct buf *bp;
|
2004-04-06 07:12:32 +00:00
|
|
|
struct sf_buf *sf;
|
2005-05-18 08:57:31 +00:00
|
|
|
daddr_t fileaddr;
|
1994-05-25 09:21:21 +00:00
|
|
|
vm_offset_t bsize;
|
2011-11-05 08:20:32 +00:00
|
|
|
vm_page_bits_t bits;
|
|
|
|
int error, i;
|
1994-05-25 09:21:21 +00:00
|
|
|
|
2011-11-05 08:20:32 +00:00
|
|
|
error = 0;
|
NOTE: libkvm, w, ps, 'top', and any other utility which depends on struct
proc or any VM system structure will have to be rebuilt!!!
Much needed overhaul of the VM system. Included in this first round of
changes:
1) Improved pager interfaces: init, alloc, dealloc, getpages, putpages,
haspage, and sync operations are supported. The haspage interface now
provides information about clusterability. All pager routines now take
struct vm_object's instead of "pagers".
2) Improved data structures. In the previous paradigm, there is constant
confusion caused by pagers being both a data structure ("allocate a
pager") and a collection of routines. The idea of a pager structure has
escentially been eliminated. Objects now have types, and this type is
used to index the appropriate pager. In most cases, items in the pager
structure were duplicated in the object data structure and thus were
unnecessary. In the few cases that remained, a un_pager structure union
was created in the object to contain these items.
3) Because of the cleanup of #1 & #2, a lot of unnecessary layering can now
be removed. For instance, vm_object_enter(), vm_object_lookup(),
vm_object_remove(), and the associated object hash list were some of the
things that were removed.
4) simple_lock's removed. Discussion with several people reveals that the
SMP locking primitives used in the VM system aren't likely the mechanism
that we'll be adopting. Even if it were, the locking that was in the code
was very inadequate and would have to be mostly re-done anyway. The
locking in a uni-processor kernel was a no-op but went a long way toward
making the code difficult to read and debug.
5) Places that attempted to kludge-up the fact that we don't have kernel
thread support have been fixed to reflect the reality that we are really
dealing with processes, not threads. The VM system didn't have complete
thread support, so the comments and mis-named routines were just wrong.
We now use tsleep and wakeup directly in the lock routines, for instance.
6) Where appropriate, the pagers have been improved, especially in the
pager_alloc routines. Most of the pager_allocs have been rewritten and
are now faster and easier to maintain.
7) The pagedaemon pageout clustering algorithm has been rewritten and
now tries harder to output an even number of pages before and after
the requested page. This is sort of the reverse of the ideal pagein
algorithm and should provide better overall performance.
8) Unnecessary (incorrect) casts to caddr_t in calls to tsleep & wakeup
have been removed. Some other unnecessary casts have also been removed.
9) Some almost useless debugging code removed.
10) Terminology of shadow objects vs. backing objects straightened out.
The fact that the vm_object data structure escentially had this
backwards really confused things. The use of "shadow" and "backing
object" throughout the code is now internally consistent and correct
in the Mach terminology.
11) Several minor bug fixes, including one in the vm daemon that caused
0 RSS objects to not get purged as intended.
12) A "default pager" has now been created which cleans up the transition
of objects to the "swap" type. The previous checks throughout the code
for swp->pg_data != NULL were really ugly. This change also provides
the rudiments for future backing of "anonymous" memory by something
other than the swap pager (via the vnode pager, for example), and it
allows the decision about which of these pagers to use to be made
dynamically (although will need some additional decision code to do
this, of course).
13) (dyson) MAP_COPY has been deprecated and the corresponding "copy
object" code has been removed. MAP_COPY was undocumented and non-
standard. It was furthermore broken in several ways which caused its
behavior to degrade to MAP_PRIVATE. Binaries that use MAP_COPY will
continue to work correctly, but via the slightly different semantics
of MAP_PRIVATE.
14) (dyson) Sharing maps have been removed. It's marginal usefulness in a
threads design can be worked around in other ways. Both #12 and #13
were done to simplify the code and improve readability and maintain-
ability. (As were most all of these changes)
TODO:
1) Rewrite most of the vnode pager to use VOP_GETPAGES/PUTPAGES. Doing
this will reduce the vnode pager to a mere fraction of its current size.
2) Rewrite vm_fault and the swap/vnode pagers to use the clustering
information provided by the new haspage pager interface. This will
substantially reduce the overhead by eliminating a large number of
VOP_BMAP() calls. The VOP_BMAP() filesystem interface should be
improved to provide both a "behind" and "ahead" indication of
contiguousness.
3) Implement the extended features of pager_haspage in swap_pager_haspage().
It currently just says 0 pages ahead/behind.
4) Re-implement the swap device (swstrategy) in a more elegant way, perhaps
via a much more general mechanism that could also be used for disk
striping of regular filesystems.
5) Do something to improve the architecture of vm_object_collapse(). The
fact that it makes calls into the swap pager and knows too much about
how the swap pager operates really bothers me. It also doesn't allow
for collapsing of non-swap pager objects ("unnamed" objects backed by
other pagers).
1995-07-13 08:48:48 +00:00
|
|
|
vp = object->handle;
|
2006-02-06 10:14:12 +00:00
|
|
|
if (vp->v_iflag & VI_DOOMED)
|
1995-10-23 02:23:29 +00:00
|
|
|
return VM_PAGER_BAD;
|
|
|
|
|
1994-05-25 09:21:21 +00:00
|
|
|
bsize = vp->v_mount->mnt_stat.f_iosize;
|
1995-03-19 23:46:25 +00:00
|
|
|
|
2004-11-15 09:18:27 +00:00
|
|
|
VOP_BMAP(vp, 0, &bo, 0, NULL, NULL);
|
1994-05-25 09:21:21 +00:00
|
|
|
|
2004-04-06 07:12:32 +00:00
|
|
|
sf = sf_buf_alloc(m, 0);
|
1994-05-25 09:21:21 +00:00
|
|
|
|
1994-08-04 03:06:48 +00:00
|
|
|
for (i = 0; i < PAGE_SIZE / bsize; i++) {
|
2001-11-05 18:58:47 +00:00
|
|
|
vm_ooffset_t address;
|
1994-08-04 03:06:48 +00:00
|
|
|
|
2009-05-09 08:30:44 +00:00
|
|
|
bits = vm_page_bits(i * bsize, bsize);
|
|
|
|
if (m->valid & bits)
|
These changes embody the support of the fully coherent merged VM buffer cache,
much higher filesystem I/O performance, and much better paging performance. It
represents the culmination of over 6 months of R&D.
The majority of the merged VM/cache work is by John Dyson.
The following highlights the most significant changes. Additionally, there are
(mostly minor) changes to the various filesystem modules (nfs, msdosfs, etc) to
support the new VM/buffer scheme.
vfs_bio.c:
Significant rewrite of most of vfs_bio to support the merged VM buffer cache
scheme. The scheme is almost fully compatible with the old filesystem
interface. Significant improvement in the number of opportunities for write
clustering.
vfs_cluster.c, vfs_subr.c
Upgrade and performance enhancements in vfs layer code to support merged
VM/buffer cache. Fixup of vfs_cluster to eliminate the bogus pagemove stuff.
vm_object.c:
Yet more improvements in the collapse code. Elimination of some windows that
can cause list corruption.
vm_pageout.c:
Fixed it, it really works better now. Somehow in 2.0, some "enhancements"
broke the code. This code has been reworked from the ground-up.
vm_fault.c, vm_page.c, pmap.c, vm_object.c
Support for small-block filesystems with merged VM/buffer cache scheme.
pmap.c vm_map.c
Dynamic kernel VM size, now we dont have to pre-allocate excessive numbers of
kernel PTs.
vm_glue.c
Much simpler and more effective swapping code. No more gratuitous swapping.
proc.h
Fixed the problem that the p_lock flag was not being cleared on a fork.
swap_pager.c, vnode_pager.c
Removal of old vfs_bio cruft to support the past pseudo-coherency. Now the
code doesn't need it anymore.
machdep.c
Changes to better support the parameter values for the merged VM/buffer cache
scheme.
machdep.c, kern_exec.c, vm_glue.c
Implemented a seperate submap for temporary exec string space and another one
to contain process upages. This eliminates all map fragmentation problems
that previously existed.
ffs_inode.c, ufs_inode.c, ufs_readwrite.c
Changes for merged VM/buffer cache. Add "bypass" support for sneaking in on
busy buffers.
Submitted by: John Dyson and David Greenman
1995-01-09 16:06:02 +00:00
|
|
|
continue;
|
1994-05-25 09:21:21 +00:00
|
|
|
|
2001-11-05 18:58:47 +00:00
|
|
|
address = IDX_TO_OFF(m->pindex) + i * bsize;
|
|
|
|
if (address >= object->un_pager.vnp.vnp_size) {
|
|
|
|
fileaddr = -1;
|
|
|
|
} else {
|
2006-10-14 22:09:03 +00:00
|
|
|
error = vnode_pager_addr(vp, address, &fileaddr, NULL);
|
|
|
|
if (error)
|
|
|
|
break;
|
2001-11-05 18:58:47 +00:00
|
|
|
}
|
1994-08-04 03:06:48 +00:00
|
|
|
if (fileaddr != -1) {
|
Allocate pager bufs from UMA instead of 80-ish mutex protected linked list.
o In vm_pager_bufferinit() create pbuf_zone and start accounting on how many
pbufs are we going to have set.
In various subsystems that are going to utilize pbufs create private zones
via call to pbuf_zsecond_create(). The latter calls uma_zsecond_create(),
and sets a limit on created zone. After startup preallocate pbufs according
to requirements of all pbuf zones.
Subsystems that used to have a private limit with old allocator now have
private pbuf zones: md(4), fusefs, NFS client, smbfs, VFS cluster, FFS,
swap, vnode pager.
The following subsystems use shared pbuf zone: cam(4), nvme(4), physio(9),
aio(4). They should have their private limits, but changing that is out of
scope of this commit.
o Fetch tunable value of kern.nswbuf from init_param2() and while here move
NSWBUF_MIN to opt_param.h and eliminate opt_swap.h, that was holding only
this option.
Default values aren't touched by this commit, but they probably should be
reviewed wrt to modern hardware.
This change removes a tight bottleneck from sendfile(2) operation, that
uses pbufs in vnode pager. Other pagers also would benefit from faster
allocation.
Together with: gallatin
Tested by: pho
2019-01-15 01:02:16 +00:00
|
|
|
bp = uma_zalloc(vnode_pbuf_zone, M_WAITOK);
|
1994-05-25 09:21:21 +00:00
|
|
|
|
1994-08-04 03:06:48 +00:00
|
|
|
/* build a minimal buffer header */
|
2000-03-20 10:44:49 +00:00
|
|
|
bp->b_iocmd = BIO_READ;
|
2003-08-18 19:47:16 +00:00
|
|
|
bp->b_iodone = bdone;
|
2001-10-11 23:38:17 +00:00
|
|
|
KASSERT(bp->b_rcred == NOCRED, ("leaking read ucred"));
|
|
|
|
KASSERT(bp->b_wcred == NOCRED, ("leaking write ucred"));
|
2002-02-27 18:32:23 +00:00
|
|
|
bp->b_rcred = crhold(curthread->td_ucred);
|
|
|
|
bp->b_wcred = crhold(curthread->td_ucred);
|
2004-04-06 07:12:32 +00:00
|
|
|
bp->b_data = (caddr_t)sf_buf_kva(sf) + i * bsize;
|
1995-02-22 10:34:34 +00:00
|
|
|
bp->b_blkno = fileaddr;
|
2004-11-15 09:18:27 +00:00
|
|
|
pbgetbo(bo, bp);
|
2012-03-28 20:49:11 +00:00
|
|
|
bp->b_vp = vp;
|
1994-05-25 09:21:21 +00:00
|
|
|
bp->b_bcount = bsize;
|
|
|
|
bp->b_bufsize = bsize;
|
2000-12-26 19:41:38 +00:00
|
|
|
bp->b_runningbufspace = bp->b_bufsize;
|
Adjust some variables (mostly related to the buffer cache) that hold
address space sizes to be longs instead of ints. Specifically, the follow
values are now longs: runningbufspace, bufspace, maxbufspace,
bufmallocspace, maxbufmallocspace, lobufspace, hibufspace, lorunningspace,
hirunningspace, maxswzone, maxbcache, and maxpipekva. Previously, a
relatively small number (~ 44000) of buffers set in kern.nbuf would result
in integer overflows resulting either in hangs or bogus values of
hidirtybuffers and lodirtybuffers. Now one has to overflow a long to see
such problems. There was a check for a nbuf setting that would cause
overflows in the auto-tuning of nbuf. I've changed it to always check and
cap nbuf but warn if a user-supplied tunable would cause overflow.
Note that this changes the ABI of several sysctls that are used by things
like top(1), etc., so any MFC would probably require a some gross shims
to allow for that.
MFC after: 1 month
2009-03-09 19:35:20 +00:00
|
|
|
atomic_add_long(&runningbufspace, bp->b_runningbufspace);
|
1994-08-04 03:06:48 +00:00
|
|
|
|
|
|
|
/* do the input */
|
2003-10-18 14:10:28 +00:00
|
|
|
bp->b_iooffset = dbtob(bp->b_blkno);
|
2004-10-24 20:03:41 +00:00
|
|
|
bstrategy(bp);
|
1994-05-25 09:21:21 +00:00
|
|
|
|
2003-08-18 19:47:16 +00:00
|
|
|
bwait(bp, PVM, "vnsrd");
|
|
|
|
|
2000-04-02 15:24:56 +00:00
|
|
|
if ((bp->b_ioflags & BIO_ERROR) != 0)
|
1994-05-25 09:21:21 +00:00
|
|
|
error = EIO;
|
|
|
|
|
1994-08-04 03:06:48 +00:00
|
|
|
/*
|
|
|
|
* free the buffer header back to the swap buffer pool
|
|
|
|
*/
|
2012-03-28 20:49:11 +00:00
|
|
|
bp->b_vp = NULL;
|
2004-11-15 09:18:27 +00:00
|
|
|
pbrelbo(bp);
|
Allocate pager bufs from UMA instead of 80-ish mutex protected linked list.
o In vm_pager_bufferinit() create pbuf_zone and start accounting on how many
pbufs are we going to have set.
In various subsystems that are going to utilize pbufs create private zones
via call to pbuf_zsecond_create(). The latter calls uma_zsecond_create(),
and sets a limit on created zone. After startup preallocate pbufs according
to requirements of all pbuf zones.
Subsystems that used to have a private limit with old allocator now have
private pbuf zones: md(4), fusefs, NFS client, smbfs, VFS cluster, FFS,
swap, vnode pager.
The following subsystems use shared pbuf zone: cam(4), nvme(4), physio(9),
aio(4). They should have their private limits, but changing that is out of
scope of this commit.
o Fetch tunable value of kern.nswbuf from init_param2() and while here move
NSWBUF_MIN to opt_param.h and eliminate opt_swap.h, that was holding only
this option.
Default values aren't touched by this commit, but they probably should be
reviewed wrt to modern hardware.
This change removes a tight bottleneck from sendfile(2) operation, that
uses pbufs in vnode pager. Other pagers also would benefit from faster
allocation.
Together with: gallatin
Tested by: pho
2019-01-15 01:02:16 +00:00
|
|
|
uma_zfree(vnode_pbuf_zone, bp);
|
1994-08-04 03:06:48 +00:00
|
|
|
if (error)
|
1994-05-25 09:21:21 +00:00
|
|
|
break;
|
2009-05-09 08:30:44 +00:00
|
|
|
} else
|
2004-04-06 07:12:32 +00:00
|
|
|
bzero((caddr_t)sf_buf_kva(sf) + i * bsize, bsize);
|
2009-05-09 08:30:44 +00:00
|
|
|
KASSERT((m->dirty & bits) == 0,
|
|
|
|
("vnode_pager_input_smlfs: page %p is dirty", m));
|
2013-03-09 02:32:23 +00:00
|
|
|
VM_OBJECT_WLOCK(object);
|
2009-05-09 08:30:44 +00:00
|
|
|
m->valid |= bits;
|
2013-03-09 02:32:23 +00:00
|
|
|
VM_OBJECT_WUNLOCK(object);
|
1994-05-25 09:21:21 +00:00
|
|
|
}
|
2004-04-06 07:12:32 +00:00
|
|
|
sf_buf_free(sf);
|
1994-08-04 03:06:48 +00:00
|
|
|
if (error) {
|
1994-11-06 09:55:31 +00:00
|
|
|
return VM_PAGER_ERROR;
|
1994-05-25 09:21:21 +00:00
|
|
|
}
|
|
|
|
return VM_PAGER_OK;
|
|
|
|
}
|
|
|
|
|
|
|
|
/*
|
2004-12-25 21:30:41 +00:00
|
|
|
* old style vnode pager input routine
|
1994-05-25 09:21:21 +00:00
|
|
|
*/
|
1995-12-14 09:55:16 +00:00
|
|
|
static int
|
2014-01-20 18:47:56 +00:00
|
|
|
vnode_pager_input_old(vm_object_t object, vm_page_t m)
|
1994-05-25 09:21:21 +00:00
|
|
|
{
|
1994-05-24 10:09:53 +00:00
|
|
|
struct uio auio;
|
|
|
|
struct iovec aiov;
|
These changes embody the support of the fully coherent merged VM buffer cache,
much higher filesystem I/O performance, and much better paging performance. It
represents the culmination of over 6 months of R&D.
The majority of the merged VM/cache work is by John Dyson.
The following highlights the most significant changes. Additionally, there are
(mostly minor) changes to the various filesystem modules (nfs, msdosfs, etc) to
support the new VM/buffer scheme.
vfs_bio.c:
Significant rewrite of most of vfs_bio to support the merged VM buffer cache
scheme. The scheme is almost fully compatible with the old filesystem
interface. Significant improvement in the number of opportunities for write
clustering.
vfs_cluster.c, vfs_subr.c
Upgrade and performance enhancements in vfs layer code to support merged
VM/buffer cache. Fixup of vfs_cluster to eliminate the bogus pagemove stuff.
vm_object.c:
Yet more improvements in the collapse code. Elimination of some windows that
can cause list corruption.
vm_pageout.c:
Fixed it, it really works better now. Somehow in 2.0, some "enhancements"
broke the code. This code has been reworked from the ground-up.
vm_fault.c, vm_page.c, pmap.c, vm_object.c
Support for small-block filesystems with merged VM/buffer cache scheme.
pmap.c vm_map.c
Dynamic kernel VM size, now we dont have to pre-allocate excessive numbers of
kernel PTs.
vm_glue.c
Much simpler and more effective swapping code. No more gratuitous swapping.
proc.h
Fixed the problem that the p_lock flag was not being cleared on a fork.
swap_pager.c, vnode_pager.c
Removal of old vfs_bio cruft to support the past pseudo-coherency. Now the
code doesn't need it anymore.
machdep.c
Changes to better support the parameter values for the merged VM/buffer cache
scheme.
machdep.c, kern_exec.c, vm_glue.c
Implemented a seperate submap for temporary exec string space and another one
to contain process upages. This eliminates all map fragmentation problems
that previously existed.
ffs_inode.c, ufs_inode.c, ufs_readwrite.c
Changes for merged VM/buffer cache. Add "bypass" support for sneaking in on
busy buffers.
Submitted by: John Dyson and David Greenman
1995-01-09 16:06:02 +00:00
|
|
|
int error;
|
|
|
|
int size;
|
2004-04-06 07:12:32 +00:00
|
|
|
struct sf_buf *sf;
|
2001-05-29 16:58:25 +00:00
|
|
|
struct vnode *vp;
|
1994-05-25 09:21:21 +00:00
|
|
|
|
2013-03-09 02:32:23 +00:00
|
|
|
VM_OBJECT_ASSERT_WLOCKED(object);
|
1994-05-25 09:21:21 +00:00
|
|
|
error = 0;
|
1994-08-04 03:06:48 +00:00
|
|
|
|
1994-05-25 09:21:21 +00:00
|
|
|
/*
|
|
|
|
* Return failure if beyond current EOF
|
|
|
|
*/
|
1995-12-11 04:58:34 +00:00
|
|
|
if (IDX_TO_OFF(m->pindex) >= object->un_pager.vnp.vnp_size) {
|
1994-05-25 09:21:21 +00:00
|
|
|
return VM_PAGER_BAD;
|
|
|
|
} else {
|
|
|
|
size = PAGE_SIZE;
|
1995-12-11 04:58:34 +00:00
|
|
|
if (IDX_TO_OFF(m->pindex) + size > object->un_pager.vnp.vnp_size)
|
|
|
|
size = object->un_pager.vnp.vnp_size - IDX_TO_OFF(m->pindex);
|
2003-10-25 05:21:16 +00:00
|
|
|
vp = object->handle;
|
2013-03-09 02:32:23 +00:00
|
|
|
VM_OBJECT_WUNLOCK(object);
|
1995-03-19 23:46:25 +00:00
|
|
|
|
These changes embody the support of the fully coherent merged VM buffer cache,
much higher filesystem I/O performance, and much better paging performance. It
represents the culmination of over 6 months of R&D.
The majority of the merged VM/cache work is by John Dyson.
The following highlights the most significant changes. Additionally, there are
(mostly minor) changes to the various filesystem modules (nfs, msdosfs, etc) to
support the new VM/buffer scheme.
vfs_bio.c:
Significant rewrite of most of vfs_bio to support the merged VM buffer cache
scheme. The scheme is almost fully compatible with the old filesystem
interface. Significant improvement in the number of opportunities for write
clustering.
vfs_cluster.c, vfs_subr.c
Upgrade and performance enhancements in vfs layer code to support merged
VM/buffer cache. Fixup of vfs_cluster to eliminate the bogus pagemove stuff.
vm_object.c:
Yet more improvements in the collapse code. Elimination of some windows that
can cause list corruption.
vm_pageout.c:
Fixed it, it really works better now. Somehow in 2.0, some "enhancements"
broke the code. This code has been reworked from the ground-up.
vm_fault.c, vm_page.c, pmap.c, vm_object.c
Support for small-block filesystems with merged VM/buffer cache scheme.
pmap.c vm_map.c
Dynamic kernel VM size, now we dont have to pre-allocate excessive numbers of
kernel PTs.
vm_glue.c
Much simpler and more effective swapping code. No more gratuitous swapping.
proc.h
Fixed the problem that the p_lock flag was not being cleared on a fork.
swap_pager.c, vnode_pager.c
Removal of old vfs_bio cruft to support the past pseudo-coherency. Now the
code doesn't need it anymore.
machdep.c
Changes to better support the parameter values for the merged VM/buffer cache
scheme.
machdep.c, kern_exec.c, vm_glue.c
Implemented a seperate submap for temporary exec string space and another one
to contain process upages. This eliminates all map fragmentation problems
that previously existed.
ffs_inode.c, ufs_inode.c, ufs_readwrite.c
Changes for merged VM/buffer cache. Add "bypass" support for sneaking in on
busy buffers.
Submitted by: John Dyson and David Greenman
1995-01-09 16:06:02 +00:00
|
|
|
/*
|
|
|
|
* Allocate a kernel virtual address and initialize so that
|
|
|
|
* we can use VOP_READ/WRITE routines.
|
|
|
|
*/
|
2004-04-06 07:12:32 +00:00
|
|
|
sf = sf_buf_alloc(m, 0);
|
1995-03-19 23:46:25 +00:00
|
|
|
|
2004-04-06 07:12:32 +00:00
|
|
|
aiov.iov_base = (caddr_t)sf_buf_kva(sf);
|
1994-05-25 09:21:21 +00:00
|
|
|
aiov.iov_len = size;
|
|
|
|
auio.uio_iov = &aiov;
|
|
|
|
auio.uio_iovcnt = 1;
|
1995-12-11 04:58:34 +00:00
|
|
|
auio.uio_offset = IDX_TO_OFF(m->pindex);
|
1994-05-25 09:21:21 +00:00
|
|
|
auio.uio_segflg = UIO_SYSSPACE;
|
|
|
|
auio.uio_rw = UIO_READ;
|
|
|
|
auio.uio_resid = size;
|
2001-09-12 08:38:13 +00:00
|
|
|
auio.uio_td = curthread;
|
1994-05-25 09:21:21 +00:00
|
|
|
|
2002-02-27 18:32:23 +00:00
|
|
|
error = VOP_READ(vp, &auio, 0, curthread->td_ucred);
|
1994-05-25 09:21:21 +00:00
|
|
|
if (!error) {
|
2001-07-04 19:00:13 +00:00
|
|
|
int count = size - auio.uio_resid;
|
1994-05-25 09:21:21 +00:00
|
|
|
|
|
|
|
if (count == 0)
|
|
|
|
error = EINVAL;
|
|
|
|
else if (count != PAGE_SIZE)
|
2004-04-06 07:12:32 +00:00
|
|
|
bzero((caddr_t)sf_buf_kva(sf) + count,
|
|
|
|
PAGE_SIZE - count);
|
1994-05-25 09:21:21 +00:00
|
|
|
}
|
2004-04-06 07:12:32 +00:00
|
|
|
sf_buf_free(sf);
|
2003-10-19 00:01:56 +00:00
|
|
|
|
2013-03-09 02:32:23 +00:00
|
|
|
VM_OBJECT_WLOCK(object);
|
1994-05-25 09:21:21 +00:00
|
|
|
}
|
2009-05-09 08:30:44 +00:00
|
|
|
KASSERT(m->dirty == 0, ("vnode_pager_input_old: page %p is dirty", m));
|
1998-09-28 23:58:10 +00:00
|
|
|
if (!error)
|
2019-10-15 03:45:41 +00:00
|
|
|
vm_page_valid(m);
|
1994-11-06 09:55:31 +00:00
|
|
|
return error ? VM_PAGER_ERROR : VM_PAGER_OK;
|
1994-05-25 09:21:21 +00:00
|
|
|
}
|
|
|
|
|
|
|
|
/*
|
|
|
|
* generic vnode pager input routine
|
|
|
|
*/
|
1995-09-04 04:44:26 +00:00
|
|
|
|
1998-02-26 06:39:59 +00:00
|
|
|
/*
|
2001-05-19 01:28:09 +00:00
|
|
|
* Local media VFS's that do not implement their own VOP_GETPAGES
|
2002-07-01 14:14:07 +00:00
|
|
|
* should have their VOP_GETPAGES call to vnode_pager_generic_getpages()
|
|
|
|
* to implement the previous behaviour.
|
1998-02-26 06:39:59 +00:00
|
|
|
*
|
|
|
|
* All other FS's should use the bypass to get to the local media
|
|
|
|
* backing vp's VOP_GETPAGES.
|
|
|
|
*/
|
1995-12-14 09:55:16 +00:00
|
|
|
static int
|
A change to KPI of vm_pager_get_pages() and underlying VOP_GETPAGES().
o With new KPI consumers can request contiguous ranges of pages, and
unlike before, all pages will be kept busied on return, like it was
done before with the 'reqpage' only. Now the reqpage goes away. With
new interface it is easier to implement code protected from race
conditions.
Such arrayed requests for now should be preceeded by a call to
vm_pager_haspage() to make sure that request is possible. This
could be improved later, making vm_pager_haspage() obsolete.
Strenghtening the promises on the business of the array of pages
allows us to remove such hacks as swp_pager_free_nrpage() and
vm_pager_free_nonreq().
o New KPI accepts two integer pointers that may optionally point at
values for read ahead and read behind, that a pager may do, if it
can. These pages are completely owned by pager, and not controlled
by the caller.
This shifts the UFS-specific readahead logic from vm_fault.c, which
should be file system agnostic, into vnode_pager.c. It also removes
one VOP_BMAP() request per hard fault.
Discussed with: kib, alc, jeff, scottl
Sponsored by: Nginx, Inc.
Sponsored by: Netflix
2015-12-16 21:30:45 +00:00
|
|
|
vnode_pager_getpages(vm_object_t object, vm_page_t *m, int count, int *rbehind,
|
|
|
|
int *rahead)
|
1995-09-04 04:44:26 +00:00
|
|
|
{
|
|
|
|
struct vnode *vp;
|
A change to KPI of vm_pager_get_pages() and underlying VOP_GETPAGES().
o With new KPI consumers can request contiguous ranges of pages, and
unlike before, all pages will be kept busied on return, like it was
done before with the 'reqpage' only. Now the reqpage goes away. With
new interface it is easier to implement code protected from race
conditions.
Such arrayed requests for now should be preceeded by a call to
vm_pager_haspage() to make sure that request is possible. This
could be improved later, making vm_pager_haspage() obsolete.
Strenghtening the promises on the business of the array of pages
allows us to remove such hacks as swp_pager_free_nrpage() and
vm_pager_free_nonreq().
o New KPI accepts two integer pointers that may optionally point at
values for read ahead and read behind, that a pager may do, if it
can. These pages are completely owned by pager, and not controlled
by the caller.
This shifts the UFS-specific readahead logic from vm_fault.c, which
should be file system agnostic, into vnode_pager.c. It also removes
one VOP_BMAP() request per hard fault.
Discussed with: kib, alc, jeff, scottl
Sponsored by: Nginx, Inc.
Sponsored by: Netflix
2015-12-16 21:30:45 +00:00
|
|
|
int rtval;
|
Make our v_usecount vnode reference count work identically to the
original BSD code. The association between the vnode and the vm_object
no longer includes reference counts. The major difference is that
vm_object's are no longer freed gratuitiously from the vnode, and so
once an object is created for the vnode, it will last as long as the
vnode does.
When a vnode object reference count is incremented, then the underlying
vnode reference count is incremented also. The two "objects" are now
more intimately related, and so the interactions are now much less
complex.
When vnodes are now normally placed onto the free queue with an object still
attached. The rundown of the object happens at vnode rundown time, and
happens with exactly the same filesystem semantics of the original VFS
code. There is absolutely no need for vnode_pager_uncache and other
travesties like that anymore.
A side-effect of these changes is that SMP locking should be much simpler,
the I/O copyin/copyout optimizations work, NFS should be more ponderable,
and further work on layered filesystems should be less frustrating, because
of the totally coherent management of the vnode objects and vnodes.
Please be careful with your system while running this code, but I would
greatly appreciate feedback as soon a reasonably possible.
1998-01-06 05:26:17 +00:00
|
|
|
|
1995-09-04 04:44:26 +00:00
|
|
|
vp = object->handle;
|
2013-03-09 02:32:23 +00:00
|
|
|
VM_OBJECT_WUNLOCK(object);
|
A change to KPI of vm_pager_get_pages() and underlying VOP_GETPAGES().
o With new KPI consumers can request contiguous ranges of pages, and
unlike before, all pages will be kept busied on return, like it was
done before with the 'reqpage' only. Now the reqpage goes away. With
new interface it is easier to implement code protected from race
conditions.
Such arrayed requests for now should be preceeded by a call to
vm_pager_haspage() to make sure that request is possible. This
could be improved later, making vm_pager_haspage() obsolete.
Strenghtening the promises on the business of the array of pages
allows us to remove such hacks as swp_pager_free_nrpage() and
vm_pager_free_nonreq().
o New KPI accepts two integer pointers that may optionally point at
values for read ahead and read behind, that a pager may do, if it
can. These pages are completely owned by pager, and not controlled
by the caller.
This shifts the UFS-specific readahead logic from vm_fault.c, which
should be file system agnostic, into vnode_pager.c. It also removes
one VOP_BMAP() request per hard fault.
Discussed with: kib, alc, jeff, scottl
Sponsored by: Nginx, Inc.
Sponsored by: Netflix
2015-12-16 21:30:45 +00:00
|
|
|
rtval = VOP_GETPAGES(vp, m, count, rbehind, rahead);
|
2001-05-19 01:28:09 +00:00
|
|
|
KASSERT(rtval != EOPNOTSUPP,
|
|
|
|
("vnode_pager: FS getpages not implemented\n"));
|
2013-03-09 02:32:23 +00:00
|
|
|
VM_OBJECT_WLOCK(object);
|
1998-02-26 06:39:59 +00:00
|
|
|
return rtval;
|
1995-09-04 04:44:26 +00:00
|
|
|
}
|
|
|
|
|
2014-11-23 12:01:52 +00:00
|
|
|
static int
|
|
|
|
vnode_pager_getpages_async(vm_object_t object, vm_page_t *m, int count,
|
A change to KPI of vm_pager_get_pages() and underlying VOP_GETPAGES().
o With new KPI consumers can request contiguous ranges of pages, and
unlike before, all pages will be kept busied on return, like it was
done before with the 'reqpage' only. Now the reqpage goes away. With
new interface it is easier to implement code protected from race
conditions.
Such arrayed requests for now should be preceeded by a call to
vm_pager_haspage() to make sure that request is possible. This
could be improved later, making vm_pager_haspage() obsolete.
Strenghtening the promises on the business of the array of pages
allows us to remove such hacks as swp_pager_free_nrpage() and
vm_pager_free_nonreq().
o New KPI accepts two integer pointers that may optionally point at
values for read ahead and read behind, that a pager may do, if it
can. These pages are completely owned by pager, and not controlled
by the caller.
This shifts the UFS-specific readahead logic from vm_fault.c, which
should be file system agnostic, into vnode_pager.c. It also removes
one VOP_BMAP() request per hard fault.
Discussed with: kib, alc, jeff, scottl
Sponsored by: Nginx, Inc.
Sponsored by: Netflix
2015-12-16 21:30:45 +00:00
|
|
|
int *rbehind, int *rahead, vop_getpages_iodone_t iodone, void *arg)
|
2014-11-23 12:01:52 +00:00
|
|
|
{
|
|
|
|
struct vnode *vp;
|
|
|
|
int rtval;
|
|
|
|
|
|
|
|
vp = object->handle;
|
|
|
|
VM_OBJECT_WUNLOCK(object);
|
A change to KPI of vm_pager_get_pages() and underlying VOP_GETPAGES().
o With new KPI consumers can request contiguous ranges of pages, and
unlike before, all pages will be kept busied on return, like it was
done before with the 'reqpage' only. Now the reqpage goes away. With
new interface it is easier to implement code protected from race
conditions.
Such arrayed requests for now should be preceeded by a call to
vm_pager_haspage() to make sure that request is possible. This
could be improved later, making vm_pager_haspage() obsolete.
Strenghtening the promises on the business of the array of pages
allows us to remove such hacks as swp_pager_free_nrpage() and
vm_pager_free_nonreq().
o New KPI accepts two integer pointers that may optionally point at
values for read ahead and read behind, that a pager may do, if it
can. These pages are completely owned by pager, and not controlled
by the caller.
This shifts the UFS-specific readahead logic from vm_fault.c, which
should be file system agnostic, into vnode_pager.c. It also removes
one VOP_BMAP() request per hard fault.
Discussed with: kib, alc, jeff, scottl
Sponsored by: Nginx, Inc.
Sponsored by: Netflix
2015-12-16 21:30:45 +00:00
|
|
|
rtval = VOP_GETPAGES_ASYNC(vp, m, count, rbehind, rahead, iodone, arg);
|
2014-11-23 12:01:52 +00:00
|
|
|
KASSERT(rtval != EOPNOTSUPP,
|
|
|
|
("vnode_pager: FS getpages_async not implemented\n"));
|
|
|
|
VM_OBJECT_WLOCK(object);
|
|
|
|
return (rtval);
|
|
|
|
}
|
|
|
|
|
2014-09-15 12:28:29 +00:00
|
|
|
/*
|
2014-11-23 12:01:52 +00:00
|
|
|
* The implementation of VOP_GETPAGES() and VOP_GETPAGES_ASYNC() for
|
|
|
|
* local filesystems, where partially valid pages can only occur at
|
|
|
|
* the end of file.
|
2014-09-15 12:28:29 +00:00
|
|
|
*/
|
|
|
|
int
|
|
|
|
vnode_pager_local_getpages(struct vop_getpages_args *ap)
|
2014-11-23 12:01:52 +00:00
|
|
|
{
|
|
|
|
|
A change to KPI of vm_pager_get_pages() and underlying VOP_GETPAGES().
o With new KPI consumers can request contiguous ranges of pages, and
unlike before, all pages will be kept busied on return, like it was
done before with the 'reqpage' only. Now the reqpage goes away. With
new interface it is easier to implement code protected from race
conditions.
Such arrayed requests for now should be preceeded by a call to
vm_pager_haspage() to make sure that request is possible. This
could be improved later, making vm_pager_haspage() obsolete.
Strenghtening the promises on the business of the array of pages
allows us to remove such hacks as swp_pager_free_nrpage() and
vm_pager_free_nonreq().
o New KPI accepts two integer pointers that may optionally point at
values for read ahead and read behind, that a pager may do, if it
can. These pages are completely owned by pager, and not controlled
by the caller.
This shifts the UFS-specific readahead logic from vm_fault.c, which
should be file system agnostic, into vnode_pager.c. It also removes
one VOP_BMAP() request per hard fault.
Discussed with: kib, alc, jeff, scottl
Sponsored by: Nginx, Inc.
Sponsored by: Netflix
2015-12-16 21:30:45 +00:00
|
|
|
return (vnode_pager_generic_getpages(ap->a_vp, ap->a_m, ap->a_count,
|
|
|
|
ap->a_rbehind, ap->a_rahead, NULL, NULL));
|
2014-11-23 12:01:52 +00:00
|
|
|
}
|
|
|
|
|
|
|
|
int
|
|
|
|
vnode_pager_local_getpages_async(struct vop_getpages_async_args *ap)
|
|
|
|
{
|
|
|
|
|
A change to KPI of vm_pager_get_pages() and underlying VOP_GETPAGES().
o With new KPI consumers can request contiguous ranges of pages, and
unlike before, all pages will be kept busied on return, like it was
done before with the 'reqpage' only. Now the reqpage goes away. With
new interface it is easier to implement code protected from race
conditions.
Such arrayed requests for now should be preceeded by a call to
vm_pager_haspage() to make sure that request is possible. This
could be improved later, making vm_pager_haspage() obsolete.
Strenghtening the promises on the business of the array of pages
allows us to remove such hacks as swp_pager_free_nrpage() and
vm_pager_free_nonreq().
o New KPI accepts two integer pointers that may optionally point at
values for read ahead and read behind, that a pager may do, if it
can. These pages are completely owned by pager, and not controlled
by the caller.
This shifts the UFS-specific readahead logic from vm_fault.c, which
should be file system agnostic, into vnode_pager.c. It also removes
one VOP_BMAP() request per hard fault.
Discussed with: kib, alc, jeff, scottl
Sponsored by: Nginx, Inc.
Sponsored by: Netflix
2015-12-16 21:30:45 +00:00
|
|
|
return (vnode_pager_generic_getpages(ap->a_vp, ap->a_m, ap->a_count,
|
|
|
|
ap->a_rbehind, ap->a_rahead, ap->a_iodone, ap->a_arg));
|
2014-09-15 12:28:29 +00:00
|
|
|
}
|
|
|
|
|
1998-02-26 06:39:59 +00:00
|
|
|
/*
|
|
|
|
* This is now called from local media FS's to operate against their
|
|
|
|
* own vnodes if they fail to implement VOP_GETPAGES.
|
|
|
|
*/
|
|
|
|
int
|
A change to KPI of vm_pager_get_pages() and underlying VOP_GETPAGES().
o With new KPI consumers can request contiguous ranges of pages, and
unlike before, all pages will be kept busied on return, like it was
done before with the 'reqpage' only. Now the reqpage goes away. With
new interface it is easier to implement code protected from race
conditions.
Such arrayed requests for now should be preceeded by a call to
vm_pager_haspage() to make sure that request is possible. This
could be improved later, making vm_pager_haspage() obsolete.
Strenghtening the promises on the business of the array of pages
allows us to remove such hacks as swp_pager_free_nrpage() and
vm_pager_free_nonreq().
o New KPI accepts two integer pointers that may optionally point at
values for read ahead and read behind, that a pager may do, if it
can. These pages are completely owned by pager, and not controlled
by the caller.
This shifts the UFS-specific readahead logic from vm_fault.c, which
should be file system agnostic, into vnode_pager.c. It also removes
one VOP_BMAP() request per hard fault.
Discussed with: kib, alc, jeff, scottl
Sponsored by: Nginx, Inc.
Sponsored by: Netflix
2015-12-16 21:30:45 +00:00
|
|
|
vnode_pager_generic_getpages(struct vnode *vp, vm_page_t *m, int count,
|
|
|
|
int *a_rbehind, int *a_rahead, vop_getpages_iodone_t iodone, void *arg)
|
1994-05-25 09:21:21 +00:00
|
|
|
{
|
1998-02-26 06:39:59 +00:00
|
|
|
vm_object_t object;
|
2004-11-15 09:18:27 +00:00
|
|
|
struct bufobj *bo;
|
1995-03-19 23:46:25 +00:00
|
|
|
struct buf *bp;
|
A change to KPI of vm_pager_get_pages() and underlying VOP_GETPAGES().
o With new KPI consumers can request contiguous ranges of pages, and
unlike before, all pages will be kept busied on return, like it was
done before with the 'reqpage' only. Now the reqpage goes away. With
new interface it is easier to implement code protected from race
conditions.
Such arrayed requests for now should be preceeded by a call to
vm_pager_haspage() to make sure that request is possible. This
could be improved later, making vm_pager_haspage() obsolete.
Strenghtening the promises on the business of the array of pages
allows us to remove such hacks as swp_pager_free_nrpage() and
vm_pager_free_nonreq().
o New KPI accepts two integer pointers that may optionally point at
values for read ahead and read behind, that a pager may do, if it
can. These pages are completely owned by pager, and not controlled
by the caller.
This shifts the UFS-specific readahead logic from vm_fault.c, which
should be file system agnostic, into vnode_pager.c. It also removes
one VOP_BMAP() request per hard fault.
Discussed with: kib, alc, jeff, scottl
Sponsored by: Nginx, Inc.
Sponsored by: Netflix
2015-12-16 21:30:45 +00:00
|
|
|
off_t foff;
|
2016-11-17 20:32:32 +00:00
|
|
|
#ifdef INVARIANTS
|
|
|
|
off_t blkno0;
|
|
|
|
#endif
|
Allocate pager bufs from UMA instead of 80-ish mutex protected linked list.
o In vm_pager_bufferinit() create pbuf_zone and start accounting on how many
pbufs are we going to have set.
In various subsystems that are going to utilize pbufs create private zones
via call to pbuf_zsecond_create(). The latter calls uma_zsecond_create(),
and sets a limit on created zone. After startup preallocate pbufs according
to requirements of all pbuf zones.
Subsystems that used to have a private limit with old allocator now have
private pbuf zones: md(4), fusefs, NFS client, smbfs, VFS cluster, FFS,
swap, vnode pager.
The following subsystems use shared pbuf zone: cam(4), nvme(4), physio(9),
aio(4). They should have their private limits, but changing that is out of
scope of this commit.
o Fetch tunable value of kern.nswbuf from init_param2() and while here move
NSWBUF_MIN to opt_param.h and eliminate opt_swap.h, that was holding only
this option.
Default values aren't touched by this commit, but they probably should be
reviewed wrt to modern hardware.
This change removes a tight bottleneck from sendfile(2) operation, that
uses pbufs in vnode pager. Other pagers also would benefit from faster
allocation.
Together with: gallatin
Tested by: pho
2019-01-15 01:02:16 +00:00
|
|
|
int bsize, pagesperblock;
|
A change to KPI of vm_pager_get_pages() and underlying VOP_GETPAGES().
o With new KPI consumers can request contiguous ranges of pages, and
unlike before, all pages will be kept busied on return, like it was
done before with the 'reqpage' only. Now the reqpage goes away. With
new interface it is easier to implement code protected from race
conditions.
Such arrayed requests for now should be preceeded by a call to
vm_pager_haspage() to make sure that request is possible. This
could be improved later, making vm_pager_haspage() obsolete.
Strenghtening the promises on the business of the array of pages
allows us to remove such hacks as swp_pager_free_nrpage() and
vm_pager_free_nonreq().
o New KPI accepts two integer pointers that may optionally point at
values for read ahead and read behind, that a pager may do, if it
can. These pages are completely owned by pager, and not controlled
by the caller.
This shifts the UFS-specific readahead logic from vm_fault.c, which
should be file system agnostic, into vnode_pager.c. It also removes
one VOP_BMAP() request per hard fault.
Discussed with: kib, alc, jeff, scottl
Sponsored by: Nginx, Inc.
Sponsored by: Netflix
2015-12-16 21:30:45 +00:00
|
|
|
int error, before, after, rbehind, rahead, poff, i;
|
|
|
|
int bytecount, secmask;
|
1998-02-26 06:39:59 +00:00
|
|
|
|
2004-11-15 09:18:27 +00:00
|
|
|
KASSERT(vp->v_type != VCHR && vp->v_type != VBLK,
|
A change to KPI of vm_pager_get_pages() and underlying VOP_GETPAGES().
o With new KPI consumers can request contiguous ranges of pages, and
unlike before, all pages will be kept busied on return, like it was
done before with the 'reqpage' only. Now the reqpage goes away. With
new interface it is easier to implement code protected from race
conditions.
Such arrayed requests for now should be preceeded by a call to
vm_pager_haspage() to make sure that request is possible. This
could be improved later, making vm_pager_haspage() obsolete.
Strenghtening the promises on the business of the array of pages
allows us to remove such hacks as swp_pager_free_nrpage() and
vm_pager_free_nonreq().
o New KPI accepts two integer pointers that may optionally point at
values for read ahead and read behind, that a pager may do, if it
can. These pages are completely owned by pager, and not controlled
by the caller.
This shifts the UFS-specific readahead logic from vm_fault.c, which
should be file system agnostic, into vnode_pager.c. It also removes
one VOP_BMAP() request per hard fault.
Discussed with: kib, alc, jeff, scottl
Sponsored by: Nginx, Inc.
Sponsored by: Netflix
2015-12-16 21:30:45 +00:00
|
|
|
("%s does not support devices", __func__));
|
|
|
|
|
2006-02-06 10:14:12 +00:00
|
|
|
if (vp->v_iflag & VI_DOOMED)
|
2015-10-24 21:59:22 +00:00
|
|
|
return (VM_PAGER_BAD);
|
1995-10-23 02:23:29 +00:00
|
|
|
|
2015-10-24 21:59:22 +00:00
|
|
|
object = vp->v_object;
|
A change to KPI of vm_pager_get_pages() and underlying VOP_GETPAGES().
o With new KPI consumers can request contiguous ranges of pages, and
unlike before, all pages will be kept busied on return, like it was
done before with the 'reqpage' only. Now the reqpage goes away. With
new interface it is easier to implement code protected from race
conditions.
Such arrayed requests for now should be preceeded by a call to
vm_pager_haspage() to make sure that request is possible. This
could be improved later, making vm_pager_haspage() obsolete.
Strenghtening the promises on the business of the array of pages
allows us to remove such hacks as swp_pager_free_nrpage() and
vm_pager_free_nonreq().
o New KPI accepts two integer pointers that may optionally point at
values for read ahead and read behind, that a pager may do, if it
can. These pages are completely owned by pager, and not controlled
by the caller.
This shifts the UFS-specific readahead logic from vm_fault.c, which
should be file system agnostic, into vnode_pager.c. It also removes
one VOP_BMAP() request per hard fault.
Discussed with: kib, alc, jeff, scottl
Sponsored by: Nginx, Inc.
Sponsored by: Netflix
2015-12-16 21:30:45 +00:00
|
|
|
foff = IDX_TO_OFF(m[0]->pindex);
|
1994-05-25 09:21:21 +00:00
|
|
|
bsize = vp->v_mount->mnt_stat.f_iosize;
|
A change to KPI of vm_pager_get_pages() and underlying VOP_GETPAGES().
o With new KPI consumers can request contiguous ranges of pages, and
unlike before, all pages will be kept busied on return, like it was
done before with the 'reqpage' only. Now the reqpage goes away. With
new interface it is easier to implement code protected from race
conditions.
Such arrayed requests for now should be preceeded by a call to
vm_pager_haspage() to make sure that request is possible. This
could be improved later, making vm_pager_haspage() obsolete.
Strenghtening the promises on the business of the array of pages
allows us to remove such hacks as swp_pager_free_nrpage() and
vm_pager_free_nonreq().
o New KPI accepts two integer pointers that may optionally point at
values for read ahead and read behind, that a pager may do, if it
can. These pages are completely owned by pager, and not controlled
by the caller.
This shifts the UFS-specific readahead logic from vm_fault.c, which
should be file system agnostic, into vnode_pager.c. It also removes
one VOP_BMAP() request per hard fault.
Discussed with: kib, alc, jeff, scottl
Sponsored by: Nginx, Inc.
Sponsored by: Netflix
2015-12-16 21:30:45 +00:00
|
|
|
pagesperblock = bsize / PAGE_SIZE;
|
|
|
|
|
|
|
|
KASSERT(foff < object->un_pager.vnp.vnp_size,
|
|
|
|
("%s: page %p offset beyond vp %p size", __func__, m[0], vp));
|
2019-02-26 04:50:46 +00:00
|
|
|
KASSERT(count <= nitems(bp->b_pages),
|
A change to KPI of vm_pager_get_pages() and underlying VOP_GETPAGES().
o With new KPI consumers can request contiguous ranges of pages, and
unlike before, all pages will be kept busied on return, like it was
done before with the 'reqpage' only. Now the reqpage goes away. With
new interface it is easier to implement code protected from race
conditions.
Such arrayed requests for now should be preceeded by a call to
vm_pager_haspage() to make sure that request is possible. This
could be improved later, making vm_pager_haspage() obsolete.
Strenghtening the promises on the business of the array of pages
allows us to remove such hacks as swp_pager_free_nrpage() and
vm_pager_free_nonreq().
o New KPI accepts two integer pointers that may optionally point at
values for read ahead and read behind, that a pager may do, if it
can. These pages are completely owned by pager, and not controlled
by the caller.
This shifts the UFS-specific readahead logic from vm_fault.c, which
should be file system agnostic, into vnode_pager.c. It also removes
one VOP_BMAP() request per hard fault.
Discussed with: kib, alc, jeff, scottl
Sponsored by: Nginx, Inc.
Sponsored by: Netflix
2015-12-16 21:30:45 +00:00
|
|
|
("%s: requested %d pages", __func__, count));
|
|
|
|
|
|
|
|
/*
|
|
|
|
* The last page has valid blocks. Invalid part can only
|
|
|
|
* exist at the end of file, and the page is made fully valid
|
|
|
|
* by zeroing in vm_pager_get_pages().
|
|
|
|
*/
|
2019-10-15 03:45:41 +00:00
|
|
|
if (!vm_page_none_valid(m[count - 1]) && --count == 0) {
|
A change to KPI of vm_pager_get_pages() and underlying VOP_GETPAGES().
o With new KPI consumers can request contiguous ranges of pages, and
unlike before, all pages will be kept busied on return, like it was
done before with the 'reqpage' only. Now the reqpage goes away. With
new interface it is easier to implement code protected from race
conditions.
Such arrayed requests for now should be preceeded by a call to
vm_pager_haspage() to make sure that request is possible. This
could be improved later, making vm_pager_haspage() obsolete.
Strenghtening the promises on the business of the array of pages
allows us to remove such hacks as swp_pager_free_nrpage() and
vm_pager_free_nonreq().
o New KPI accepts two integer pointers that may optionally point at
values for read ahead and read behind, that a pager may do, if it
can. These pages are completely owned by pager, and not controlled
by the caller.
This shifts the UFS-specific readahead logic from vm_fault.c, which
should be file system agnostic, into vnode_pager.c. It also removes
one VOP_BMAP() request per hard fault.
Discussed with: kib, alc, jeff, scottl
Sponsored by: Nginx, Inc.
Sponsored by: Netflix
2015-12-16 21:30:45 +00:00
|
|
|
if (iodone != NULL)
|
|
|
|
iodone(arg, m, 1, 0);
|
|
|
|
return (VM_PAGER_OK);
|
|
|
|
}
|
1994-05-24 10:09:53 +00:00
|
|
|
|
Allocate pager bufs from UMA instead of 80-ish mutex protected linked list.
o In vm_pager_bufferinit() create pbuf_zone and start accounting on how many
pbufs are we going to have set.
In various subsystems that are going to utilize pbufs create private zones
via call to pbuf_zsecond_create(). The latter calls uma_zsecond_create(),
and sets a limit on created zone. After startup preallocate pbufs according
to requirements of all pbuf zones.
Subsystems that used to have a private limit with old allocator now have
private pbuf zones: md(4), fusefs, NFS client, smbfs, VFS cluster, FFS,
swap, vnode pager.
The following subsystems use shared pbuf zone: cam(4), nvme(4), physio(9),
aio(4). They should have their private limits, but changing that is out of
scope of this commit.
o Fetch tunable value of kern.nswbuf from init_param2() and while here move
NSWBUF_MIN to opt_param.h and eliminate opt_swap.h, that was holding only
this option.
Default values aren't touched by this commit, but they probably should be
reviewed wrt to modern hardware.
This change removes a tight bottleneck from sendfile(2) operation, that
uses pbufs in vnode pager. Other pagers also would benefit from faster
allocation.
Together with: gallatin
Tested by: pho
2019-01-15 01:02:16 +00:00
|
|
|
bp = uma_zalloc(vnode_pbuf_zone, M_WAITOK);
|
2015-03-06 14:15:30 +00:00
|
|
|
|
1994-05-24 10:09:53 +00:00
|
|
|
/*
|
2014-11-18 13:38:07 +00:00
|
|
|
* Get the underlying device blocks for the file with VOP_BMAP().
|
|
|
|
* If the file system doesn't support VOP_BMAP, use old way of
|
|
|
|
* getting pages via VOP_READ.
|
1994-05-24 10:09:53 +00:00
|
|
|
*/
|
A change to KPI of vm_pager_get_pages() and underlying VOP_GETPAGES().
o With new KPI consumers can request contiguous ranges of pages, and
unlike before, all pages will be kept busied on return, like it was
done before with the 'reqpage' only. Now the reqpage goes away. With
new interface it is easier to implement code protected from race
conditions.
Such arrayed requests for now should be preceeded by a call to
vm_pager_haspage() to make sure that request is possible. This
could be improved later, making vm_pager_haspage() obsolete.
Strenghtening the promises on the business of the array of pages
allows us to remove such hacks as swp_pager_free_nrpage() and
vm_pager_free_nonreq().
o New KPI accepts two integer pointers that may optionally point at
values for read ahead and read behind, that a pager may do, if it
can. These pages are completely owned by pager, and not controlled
by the caller.
This shifts the UFS-specific readahead logic from vm_fault.c, which
should be file system agnostic, into vnode_pager.c. It also removes
one VOP_BMAP() request per hard fault.
Discussed with: kib, alc, jeff, scottl
Sponsored by: Nginx, Inc.
Sponsored by: Netflix
2015-12-16 21:30:45 +00:00
|
|
|
error = VOP_BMAP(vp, foff / bsize, &bo, &bp->b_blkno, &after, &before);
|
2006-10-10 18:26:18 +00:00
|
|
|
if (error == EOPNOTSUPP) {
|
Allocate pager bufs from UMA instead of 80-ish mutex protected linked list.
o In vm_pager_bufferinit() create pbuf_zone and start accounting on how many
pbufs are we going to have set.
In various subsystems that are going to utilize pbufs create private zones
via call to pbuf_zsecond_create(). The latter calls uma_zsecond_create(),
and sets a limit on created zone. After startup preallocate pbufs according
to requirements of all pbuf zones.
Subsystems that used to have a private limit with old allocator now have
private pbuf zones: md(4), fusefs, NFS client, smbfs, VFS cluster, FFS,
swap, vnode pager.
The following subsystems use shared pbuf zone: cam(4), nvme(4), physio(9),
aio(4). They should have their private limits, but changing that is out of
scope of this commit.
o Fetch tunable value of kern.nswbuf from init_param2() and while here move
NSWBUF_MIN to opt_param.h and eliminate opt_swap.h, that was holding only
this option.
Default values aren't touched by this commit, but they probably should be
reviewed wrt to modern hardware.
This change removes a tight bottleneck from sendfile(2) operation, that
uses pbufs in vnode pager. Other pagers also would benefit from faster
allocation.
Together with: gallatin
Tested by: pho
2019-01-15 01:02:16 +00:00
|
|
|
uma_zfree(vnode_pbuf_zone, bp);
|
2013-03-09 02:32:23 +00:00
|
|
|
VM_OBJECT_WLOCK(object);
|
A change to KPI of vm_pager_get_pages() and underlying VOP_GETPAGES().
o With new KPI consumers can request contiguous ranges of pages, and
unlike before, all pages will be kept busied on return, like it was
done before with the 'reqpage' only. Now the reqpage goes away. With
new interface it is easier to implement code protected from race
conditions.
Such arrayed requests for now should be preceeded by a call to
vm_pager_haspage() to make sure that request is possible. This
could be improved later, making vm_pager_haspage() obsolete.
Strenghtening the promises on the business of the array of pages
allows us to remove such hacks as swp_pager_free_nrpage() and
vm_pager_free_nonreq().
o New KPI accepts two integer pointers that may optionally point at
values for read ahead and read behind, that a pager may do, if it
can. These pages are completely owned by pager, and not controlled
by the caller.
This shifts the UFS-specific readahead logic from vm_fault.c, which
should be file system agnostic, into vnode_pager.c. It also removes
one VOP_BMAP() request per hard fault.
Discussed with: kib, alc, jeff, scottl
Sponsored by: Nginx, Inc.
Sponsored by: Netflix
2015-12-16 21:30:45 +00:00
|
|
|
for (i = 0; i < count; i++) {
|
- Remove 'struct vmmeter' from 'struct pcpu', leaving only global vmmeter
in place. To do per-cpu stats, convert all fields that previously were
maintained in the vmmeters that sit in pcpus to counter(9).
- Since some vmmeter stats may be touched at very early stages of boot,
before we have set up UMA and we can do counter_u64_alloc(), provide an
early counter mechanism:
o Leave one spare uint64_t in struct pcpu, named pc_early_dummy_counter.
o Point counter(9) fields of vmmeter to pcpu[0].pc_early_dummy_counter,
so that at early stages of boot, before counters are allocated we already
point to a counter that can be safely written to.
o For sparc64 that required a whole dummy pcpu[MAXCPU] array.
Further related changes:
- Don't include vmmeter.h into pcpu.h.
- vm.stats.vm.v_swappgsout and vm.stats.vm.v_swappgsin changed to 64-bit,
to match kernel representation.
- struct vmmeter hidden under _KERNEL, and only vmstat(1) is an exclusion.
This is based on benno@'s 4-year old patch:
https://lists.freebsd.org/pipermail/freebsd-arch/2013-July/014471.html
Reviewed by: kib, gallatin, marius, lidl
Differential Revision: https://reviews.freebsd.org/D10156
2017-04-17 17:34:47 +00:00
|
|
|
VM_CNT_INC(v_vnodein);
|
|
|
|
VM_CNT_INC(v_vnodepgsin);
|
A change to KPI of vm_pager_get_pages() and underlying VOP_GETPAGES().
o With new KPI consumers can request contiguous ranges of pages, and
unlike before, all pages will be kept busied on return, like it was
done before with the 'reqpage' only. Now the reqpage goes away. With
new interface it is easier to implement code protected from race
conditions.
Such arrayed requests for now should be preceeded by a call to
vm_pager_haspage() to make sure that request is possible. This
could be improved later, making vm_pager_haspage() obsolete.
Strenghtening the promises on the business of the array of pages
allows us to remove such hacks as swp_pager_free_nrpage() and
vm_pager_free_nonreq().
o New KPI accepts two integer pointers that may optionally point at
values for read ahead and read behind, that a pager may do, if it
can. These pages are completely owned by pager, and not controlled
by the caller.
This shifts the UFS-specific readahead logic from vm_fault.c, which
should be file system agnostic, into vnode_pager.c. It also removes
one VOP_BMAP() request per hard fault.
Discussed with: kib, alc, jeff, scottl
Sponsored by: Nginx, Inc.
Sponsored by: Netflix
2015-12-16 21:30:45 +00:00
|
|
|
error = vnode_pager_input_old(object, m[i]);
|
|
|
|
if (error)
|
|
|
|
break;
|
|
|
|
}
|
2013-03-09 02:32:23 +00:00
|
|
|
VM_OBJECT_WUNLOCK(object);
|
2003-10-25 05:21:16 +00:00
|
|
|
return (error);
|
2006-10-10 18:26:18 +00:00
|
|
|
} else if (error != 0) {
|
Allocate pager bufs from UMA instead of 80-ish mutex protected linked list.
o In vm_pager_bufferinit() create pbuf_zone and start accounting on how many
pbufs are we going to have set.
In various subsystems that are going to utilize pbufs create private zones
via call to pbuf_zsecond_create(). The latter calls uma_zsecond_create(),
and sets a limit on created zone. After startup preallocate pbufs according
to requirements of all pbuf zones.
Subsystems that used to have a private limit with old allocator now have
private pbuf zones: md(4), fusefs, NFS client, smbfs, VFS cluster, FFS,
swap, vnode pager.
The following subsystems use shared pbuf zone: cam(4), nvme(4), physio(9),
aio(4). They should have their private limits, but changing that is out of
scope of this commit.
o Fetch tunable value of kern.nswbuf from init_param2() and while here move
NSWBUF_MIN to opt_param.h and eliminate opt_swap.h, that was holding only
this option.
Default values aren't touched by this commit, but they probably should be
reviewed wrt to modern hardware.
This change removes a tight bottleneck from sendfile(2) operation, that
uses pbufs in vnode pager. Other pagers also would benefit from faster
allocation.
Together with: gallatin
Tested by: pho
2019-01-15 01:02:16 +00:00
|
|
|
uma_zfree(vnode_pbuf_zone, bp);
|
2006-10-10 18:26:18 +00:00
|
|
|
return (VM_PAGER_ERROR);
|
1994-05-25 09:21:21 +00:00
|
|
|
}
|
1999-04-05 19:38:30 +00:00
|
|
|
|
2014-09-15 17:14:09 +00:00
|
|
|
/*
|
A change to KPI of vm_pager_get_pages() and underlying VOP_GETPAGES().
o With new KPI consumers can request contiguous ranges of pages, and
unlike before, all pages will be kept busied on return, like it was
done before with the 'reqpage' only. Now the reqpage goes away. With
new interface it is easier to implement code protected from race
conditions.
Such arrayed requests for now should be preceeded by a call to
vm_pager_haspage() to make sure that request is possible. This
could be improved later, making vm_pager_haspage() obsolete.
Strenghtening the promises on the business of the array of pages
allows us to remove such hacks as swp_pager_free_nrpage() and
vm_pager_free_nonreq().
o New KPI accepts two integer pointers that may optionally point at
values for read ahead and read behind, that a pager may do, if it
can. These pages are completely owned by pager, and not controlled
by the caller.
This shifts the UFS-specific readahead logic from vm_fault.c, which
should be file system agnostic, into vnode_pager.c. It also removes
one VOP_BMAP() request per hard fault.
Discussed with: kib, alc, jeff, scottl
Sponsored by: Nginx, Inc.
Sponsored by: Netflix
2015-12-16 21:30:45 +00:00
|
|
|
* If the file system supports BMAP, but blocksize is smaller
|
|
|
|
* than a page size, then use special small filesystem code.
|
2014-09-15 17:14:09 +00:00
|
|
|
*/
|
A change to KPI of vm_pager_get_pages() and underlying VOP_GETPAGES().
o With new KPI consumers can request contiguous ranges of pages, and
unlike before, all pages will be kept busied on return, like it was
done before with the 'reqpage' only. Now the reqpage goes away. With
new interface it is easier to implement code protected from race
conditions.
Such arrayed requests for now should be preceeded by a call to
vm_pager_haspage() to make sure that request is possible. This
could be improved later, making vm_pager_haspage() obsolete.
Strenghtening the promises on the business of the array of pages
allows us to remove such hacks as swp_pager_free_nrpage() and
vm_pager_free_nonreq().
o New KPI accepts two integer pointers that may optionally point at
values for read ahead and read behind, that a pager may do, if it
can. These pages are completely owned by pager, and not controlled
by the caller.
This shifts the UFS-specific readahead logic from vm_fault.c, which
should be file system agnostic, into vnode_pager.c. It also removes
one VOP_BMAP() request per hard fault.
Discussed with: kib, alc, jeff, scottl
Sponsored by: Nginx, Inc.
Sponsored by: Netflix
2015-12-16 21:30:45 +00:00
|
|
|
if (pagesperblock == 0) {
|
Allocate pager bufs from UMA instead of 80-ish mutex protected linked list.
o In vm_pager_bufferinit() create pbuf_zone and start accounting on how many
pbufs are we going to have set.
In various subsystems that are going to utilize pbufs create private zones
via call to pbuf_zsecond_create(). The latter calls uma_zsecond_create(),
and sets a limit on created zone. After startup preallocate pbufs according
to requirements of all pbuf zones.
Subsystems that used to have a private limit with old allocator now have
private pbuf zones: md(4), fusefs, NFS client, smbfs, VFS cluster, FFS,
swap, vnode pager.
The following subsystems use shared pbuf zone: cam(4), nvme(4), physio(9),
aio(4). They should have their private limits, but changing that is out of
scope of this commit.
o Fetch tunable value of kern.nswbuf from init_param2() and while here move
NSWBUF_MIN to opt_param.h and eliminate opt_swap.h, that was holding only
this option.
Default values aren't touched by this commit, but they probably should be
reviewed wrt to modern hardware.
This change removes a tight bottleneck from sendfile(2) operation, that
uses pbufs in vnode pager. Other pagers also would benefit from faster
allocation.
Together with: gallatin
Tested by: pho
2019-01-15 01:02:16 +00:00
|
|
|
uma_zfree(vnode_pbuf_zone, bp);
|
A change to KPI of vm_pager_get_pages() and underlying VOP_GETPAGES().
o With new KPI consumers can request contiguous ranges of pages, and
unlike before, all pages will be kept busied on return, like it was
done before with the 'reqpage' only. Now the reqpage goes away. With
new interface it is easier to implement code protected from race
conditions.
Such arrayed requests for now should be preceeded by a call to
vm_pager_haspage() to make sure that request is possible. This
could be improved later, making vm_pager_haspage() obsolete.
Strenghtening the promises on the business of the array of pages
allows us to remove such hacks as swp_pager_free_nrpage() and
vm_pager_free_nonreq().
o New KPI accepts two integer pointers that may optionally point at
values for read ahead and read behind, that a pager may do, if it
can. These pages are completely owned by pager, and not controlled
by the caller.
This shifts the UFS-specific readahead logic from vm_fault.c, which
should be file system agnostic, into vnode_pager.c. It also removes
one VOP_BMAP() request per hard fault.
Discussed with: kib, alc, jeff, scottl
Sponsored by: Nginx, Inc.
Sponsored by: Netflix
2015-12-16 21:30:45 +00:00
|
|
|
for (i = 0; i < count; i++) {
|
- Remove 'struct vmmeter' from 'struct pcpu', leaving only global vmmeter
in place. To do per-cpu stats, convert all fields that previously were
maintained in the vmmeters that sit in pcpus to counter(9).
- Since some vmmeter stats may be touched at very early stages of boot,
before we have set up UMA and we can do counter_u64_alloc(), provide an
early counter mechanism:
o Leave one spare uint64_t in struct pcpu, named pc_early_dummy_counter.
o Point counter(9) fields of vmmeter to pcpu[0].pc_early_dummy_counter,
so that at early stages of boot, before counters are allocated we already
point to a counter that can be safely written to.
o For sparc64 that required a whole dummy pcpu[MAXCPU] array.
Further related changes:
- Don't include vmmeter.h into pcpu.h.
- vm.stats.vm.v_swappgsout and vm.stats.vm.v_swappgsin changed to 64-bit,
to match kernel representation.
- struct vmmeter hidden under _KERNEL, and only vmstat(1) is an exclusion.
This is based on benno@'s 4-year old patch:
https://lists.freebsd.org/pipermail/freebsd-arch/2013-July/014471.html
Reviewed by: kib, gallatin, marius, lidl
Differential Revision: https://reviews.freebsd.org/D10156
2017-04-17 17:34:47 +00:00
|
|
|
VM_CNT_INC(v_vnodein);
|
|
|
|
VM_CNT_INC(v_vnodepgsin);
|
A change to KPI of vm_pager_get_pages() and underlying VOP_GETPAGES().
o With new KPI consumers can request contiguous ranges of pages, and
unlike before, all pages will be kept busied on return, like it was
done before with the 'reqpage' only. Now the reqpage goes away. With
new interface it is easier to implement code protected from race
conditions.
Such arrayed requests for now should be preceeded by a call to
vm_pager_haspage() to make sure that request is possible. This
could be improved later, making vm_pager_haspage() obsolete.
Strenghtening the promises on the business of the array of pages
allows us to remove such hacks as swp_pager_free_nrpage() and
vm_pager_free_nonreq().
o New KPI accepts two integer pointers that may optionally point at
values for read ahead and read behind, that a pager may do, if it
can. These pages are completely owned by pager, and not controlled
by the caller.
This shifts the UFS-specific readahead logic from vm_fault.c, which
should be file system agnostic, into vnode_pager.c. It also removes
one VOP_BMAP() request per hard fault.
Discussed with: kib, alc, jeff, scottl
Sponsored by: Nginx, Inc.
Sponsored by: Netflix
2015-12-16 21:30:45 +00:00
|
|
|
error = vnode_pager_input_smlfs(object, m[i]);
|
|
|
|
if (error)
|
|
|
|
break;
|
|
|
|
}
|
|
|
|
return (error);
|
|
|
|
}
|
2014-09-15 17:14:09 +00:00
|
|
|
|
1994-05-25 09:21:21 +00:00
|
|
|
/*
|
A change to KPI of vm_pager_get_pages() and underlying VOP_GETPAGES().
o With new KPI consumers can request contiguous ranges of pages, and
unlike before, all pages will be kept busied on return, like it was
done before with the 'reqpage' only. Now the reqpage goes away. With
new interface it is easier to implement code protected from race
conditions.
Such arrayed requests for now should be preceeded by a call to
vm_pager_haspage() to make sure that request is possible. This
could be improved later, making vm_pager_haspage() obsolete.
Strenghtening the promises on the business of the array of pages
allows us to remove such hacks as swp_pager_free_nrpage() and
vm_pager_free_nonreq().
o New KPI accepts two integer pointers that may optionally point at
values for read ahead and read behind, that a pager may do, if it
can. These pages are completely owned by pager, and not controlled
by the caller.
This shifts the UFS-specific readahead logic from vm_fault.c, which
should be file system agnostic, into vnode_pager.c. It also removes
one VOP_BMAP() request per hard fault.
Discussed with: kib, alc, jeff, scottl
Sponsored by: Nginx, Inc.
Sponsored by: Netflix
2015-12-16 21:30:45 +00:00
|
|
|
* A sparse file can be encountered only for a single page request,
|
2016-05-02 20:16:29 +00:00
|
|
|
* which may not be preceded by call to vm_pager_haspage().
|
1994-05-25 09:21:21 +00:00
|
|
|
*/
|
A change to KPI of vm_pager_get_pages() and underlying VOP_GETPAGES().
o With new KPI consumers can request contiguous ranges of pages, and
unlike before, all pages will be kept busied on return, like it was
done before with the 'reqpage' only. Now the reqpage goes away. With
new interface it is easier to implement code protected from race
conditions.
Such arrayed requests for now should be preceeded by a call to
vm_pager_haspage() to make sure that request is possible. This
could be improved later, making vm_pager_haspage() obsolete.
Strenghtening the promises on the business of the array of pages
allows us to remove such hacks as swp_pager_free_nrpage() and
vm_pager_free_nonreq().
o New KPI accepts two integer pointers that may optionally point at
values for read ahead and read behind, that a pager may do, if it
can. These pages are completely owned by pager, and not controlled
by the caller.
This shifts the UFS-specific readahead logic from vm_fault.c, which
should be file system agnostic, into vnode_pager.c. It also removes
one VOP_BMAP() request per hard fault.
Discussed with: kib, alc, jeff, scottl
Sponsored by: Nginx, Inc.
Sponsored by: Netflix
2015-12-16 21:30:45 +00:00
|
|
|
if (bp->b_blkno == -1) {
|
|
|
|
KASSERT(count == 1,
|
|
|
|
("%s: array[%d] request to a sparse file %p", __func__,
|
|
|
|
count, vp));
|
Allocate pager bufs from UMA instead of 80-ish mutex protected linked list.
o In vm_pager_bufferinit() create pbuf_zone and start accounting on how many
pbufs are we going to have set.
In various subsystems that are going to utilize pbufs create private zones
via call to pbuf_zsecond_create(). The latter calls uma_zsecond_create(),
and sets a limit on created zone. After startup preallocate pbufs according
to requirements of all pbuf zones.
Subsystems that used to have a private limit with old allocator now have
private pbuf zones: md(4), fusefs, NFS client, smbfs, VFS cluster, FFS,
swap, vnode pager.
The following subsystems use shared pbuf zone: cam(4), nvme(4), physio(9),
aio(4). They should have their private limits, but changing that is out of
scope of this commit.
o Fetch tunable value of kern.nswbuf from init_param2() and while here move
NSWBUF_MIN to opt_param.h and eliminate opt_swap.h, that was holding only
this option.
Default values aren't touched by this commit, but they probably should be
reviewed wrt to modern hardware.
This change removes a tight bottleneck from sendfile(2) operation, that
uses pbufs in vnode pager. Other pagers also would benefit from faster
allocation.
Together with: gallatin
Tested by: pho
2019-01-15 01:02:16 +00:00
|
|
|
uma_zfree(vnode_pbuf_zone, bp);
|
A change to KPI of vm_pager_get_pages() and underlying VOP_GETPAGES().
o With new KPI consumers can request contiguous ranges of pages, and
unlike before, all pages will be kept busied on return, like it was
done before with the 'reqpage' only. Now the reqpage goes away. With
new interface it is easier to implement code protected from race
conditions.
Such arrayed requests for now should be preceeded by a call to
vm_pager_haspage() to make sure that request is possible. This
could be improved later, making vm_pager_haspage() obsolete.
Strenghtening the promises on the business of the array of pages
allows us to remove such hacks as swp_pager_free_nrpage() and
vm_pager_free_nonreq().
o New KPI accepts two integer pointers that may optionally point at
values for read ahead and read behind, that a pager may do, if it
can. These pages are completely owned by pager, and not controlled
by the caller.
This shifts the UFS-specific readahead logic from vm_fault.c, which
should be file system agnostic, into vnode_pager.c. It also removes
one VOP_BMAP() request per hard fault.
Discussed with: kib, alc, jeff, scottl
Sponsored by: Nginx, Inc.
Sponsored by: Netflix
2015-12-16 21:30:45 +00:00
|
|
|
pmap_zero_page(m[0]);
|
|
|
|
KASSERT(m[0]->dirty == 0, ("%s: page %p is dirty",
|
|
|
|
__func__, m[0]));
|
2014-09-15 17:14:09 +00:00
|
|
|
VM_OBJECT_WLOCK(object);
|
2019-10-15 03:45:41 +00:00
|
|
|
vm_page_valid(m[0]);
|
2013-03-09 02:32:23 +00:00
|
|
|
VM_OBJECT_WUNLOCK(object);
|
2006-10-08 20:26:16 +00:00
|
|
|
return (VM_PAGER_OK);
|
1994-05-25 09:21:21 +00:00
|
|
|
}
|
1995-03-19 23:46:25 +00:00
|
|
|
|
2016-11-17 20:32:32 +00:00
|
|
|
#ifdef INVARIANTS
|
|
|
|
blkno0 = bp->b_blkno;
|
|
|
|
#endif
|
A change to KPI of vm_pager_get_pages() and underlying VOP_GETPAGES().
o With new KPI consumers can request contiguous ranges of pages, and
unlike before, all pages will be kept busied on return, like it was
done before with the 'reqpage' only. Now the reqpage goes away. With
new interface it is easier to implement code protected from race
conditions.
Such arrayed requests for now should be preceeded by a call to
vm_pager_haspage() to make sure that request is possible. This
could be improved later, making vm_pager_haspage() obsolete.
Strenghtening the promises on the business of the array of pages
allows us to remove such hacks as swp_pager_free_nrpage() and
vm_pager_free_nonreq().
o New KPI accepts two integer pointers that may optionally point at
values for read ahead and read behind, that a pager may do, if it
can. These pages are completely owned by pager, and not controlled
by the caller.
This shifts the UFS-specific readahead logic from vm_fault.c, which
should be file system agnostic, into vnode_pager.c. It also removes
one VOP_BMAP() request per hard fault.
Discussed with: kib, alc, jeff, scottl
Sponsored by: Nginx, Inc.
Sponsored by: Netflix
2015-12-16 21:30:45 +00:00
|
|
|
bp->b_blkno += (foff % bsize) / DEV_BSIZE;
|
|
|
|
|
|
|
|
/* Recalculate blocks available after/before to pages. */
|
|
|
|
poff = (foff % bsize) / PAGE_SIZE;
|
|
|
|
before *= pagesperblock;
|
|
|
|
before += poff;
|
|
|
|
after *= pagesperblock;
|
|
|
|
after += pagesperblock - (poff + 1);
|
|
|
|
if (m[0]->pindex + after >= object->size)
|
|
|
|
after = object->size - 1 - m[0]->pindex;
|
|
|
|
KASSERT(count <= after + 1, ("%s: %d pages asked, can do only %d",
|
|
|
|
__func__, count, after + 1));
|
|
|
|
after -= count - 1;
|
|
|
|
|
|
|
|
/* Trim requested rbehind/rahead to possible values. */
|
|
|
|
rbehind = a_rbehind ? *a_rbehind : 0;
|
|
|
|
rahead = a_rahead ? *a_rahead : 0;
|
|
|
|
rbehind = min(rbehind, before);
|
|
|
|
rbehind = min(rbehind, m[0]->pindex);
|
|
|
|
rahead = min(rahead, after);
|
|
|
|
rahead = min(rahead, object->size - m[count - 1]->pindex);
|
2016-11-17 20:32:32 +00:00
|
|
|
/*
|
|
|
|
* Check that total amount of pages fit into buf. Trim rbehind and
|
|
|
|
* rahead evenly if not.
|
|
|
|
*/
|
|
|
|
if (rbehind + rahead + count > nitems(bp->b_pages)) {
|
|
|
|
int trim, sum;
|
|
|
|
|
|
|
|
trim = rbehind + rahead + count - nitems(bp->b_pages) + 1;
|
|
|
|
sum = rbehind + rahead;
|
|
|
|
if (rbehind == before) {
|
|
|
|
/* Roundup rbehind trim to block size. */
|
|
|
|
rbehind -= roundup(trim * rbehind / sum, pagesperblock);
|
|
|
|
if (rbehind < 0)
|
|
|
|
rbehind = 0;
|
|
|
|
} else
|
|
|
|
rbehind -= trim * rbehind / sum;
|
|
|
|
rahead -= trim * rahead / sum;
|
|
|
|
}
|
|
|
|
KASSERT(rbehind + rahead + count <= nitems(bp->b_pages),
|
A change to KPI of vm_pager_get_pages() and underlying VOP_GETPAGES().
o With new KPI consumers can request contiguous ranges of pages, and
unlike before, all pages will be kept busied on return, like it was
done before with the 'reqpage' only. Now the reqpage goes away. With
new interface it is easier to implement code protected from race
conditions.
Such arrayed requests for now should be preceeded by a call to
vm_pager_haspage() to make sure that request is possible. This
could be improved later, making vm_pager_haspage() obsolete.
Strenghtening the promises on the business of the array of pages
allows us to remove such hacks as swp_pager_free_nrpage() and
vm_pager_free_nonreq().
o New KPI accepts two integer pointers that may optionally point at
values for read ahead and read behind, that a pager may do, if it
can. These pages are completely owned by pager, and not controlled
by the caller.
This shifts the UFS-specific readahead logic from vm_fault.c, which
should be file system agnostic, into vnode_pager.c. It also removes
one VOP_BMAP() request per hard fault.
Discussed with: kib, alc, jeff, scottl
Sponsored by: Nginx, Inc.
Sponsored by: Netflix
2015-12-16 21:30:45 +00:00
|
|
|
("%s: behind %d ahead %d count %d", __func__,
|
|
|
|
rbehind, rahead, count));
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Fill in the bp->b_pages[] array with requested and optional
|
|
|
|
* read behind or read ahead pages. Read behind pages are looked
|
|
|
|
* up in a backward direction, down to a first cached page. Same
|
|
|
|
* for read ahead pages, but there is no need to shift the array
|
|
|
|
* in case of encountering a cached page.
|
|
|
|
*/
|
|
|
|
i = bp->b_npages = 0;
|
|
|
|
if (rbehind) {
|
|
|
|
vm_pindex_t startpindex, tpindex;
|
|
|
|
vm_page_t p;
|
|
|
|
|
2015-10-24 21:59:22 +00:00
|
|
|
VM_OBJECT_WLOCK(object);
|
A change to KPI of vm_pager_get_pages() and underlying VOP_GETPAGES().
o With new KPI consumers can request contiguous ranges of pages, and
unlike before, all pages will be kept busied on return, like it was
done before with the 'reqpage' only. Now the reqpage goes away. With
new interface it is easier to implement code protected from race
conditions.
Such arrayed requests for now should be preceeded by a call to
vm_pager_haspage() to make sure that request is possible. This
could be improved later, making vm_pager_haspage() obsolete.
Strenghtening the promises on the business of the array of pages
allows us to remove such hacks as swp_pager_free_nrpage() and
vm_pager_free_nonreq().
o New KPI accepts two integer pointers that may optionally point at
values for read ahead and read behind, that a pager may do, if it
can. These pages are completely owned by pager, and not controlled
by the caller.
This shifts the UFS-specific readahead logic from vm_fault.c, which
should be file system agnostic, into vnode_pager.c. It also removes
one VOP_BMAP() request per hard fault.
Discussed with: kib, alc, jeff, scottl
Sponsored by: Nginx, Inc.
Sponsored by: Netflix
2015-12-16 21:30:45 +00:00
|
|
|
startpindex = m[0]->pindex - rbehind;
|
|
|
|
if ((p = TAILQ_PREV(m[0], pglist, listq)) != NULL &&
|
|
|
|
p->pindex >= startpindex)
|
|
|
|
startpindex = p->pindex + 1;
|
|
|
|
|
|
|
|
/* tpindex is unsigned; beware of numeric underflow. */
|
|
|
|
for (tpindex = m[0]->pindex - 1;
|
|
|
|
tpindex >= startpindex && tpindex < m[0]->pindex;
|
|
|
|
tpindex--, i++) {
|
2016-11-15 18:22:50 +00:00
|
|
|
p = vm_page_alloc(object, tpindex, VM_ALLOC_NORMAL);
|
A change to KPI of vm_pager_get_pages() and underlying VOP_GETPAGES().
o With new KPI consumers can request contiguous ranges of pages, and
unlike before, all pages will be kept busied on return, like it was
done before with the 'reqpage' only. Now the reqpage goes away. With
new interface it is easier to implement code protected from race
conditions.
Such arrayed requests for now should be preceeded by a call to
vm_pager_haspage() to make sure that request is possible. This
could be improved later, making vm_pager_haspage() obsolete.
Strenghtening the promises on the business of the array of pages
allows us to remove such hacks as swp_pager_free_nrpage() and
vm_pager_free_nonreq().
o New KPI accepts two integer pointers that may optionally point at
values for read ahead and read behind, that a pager may do, if it
can. These pages are completely owned by pager, and not controlled
by the caller.
This shifts the UFS-specific readahead logic from vm_fault.c, which
should be file system agnostic, into vnode_pager.c. It also removes
one VOP_BMAP() request per hard fault.
Discussed with: kib, alc, jeff, scottl
Sponsored by: Nginx, Inc.
Sponsored by: Netflix
2015-12-16 21:30:45 +00:00
|
|
|
if (p == NULL) {
|
|
|
|
/* Shift the array. */
|
|
|
|
for (int j = 0; j < i; j++)
|
|
|
|
bp->b_pages[j] = bp->b_pages[j +
|
|
|
|
tpindex + 1 - startpindex];
|
|
|
|
break;
|
|
|
|
}
|
|
|
|
bp->b_pages[tpindex - startpindex] = p;
|
1994-05-25 09:21:21 +00:00
|
|
|
}
|
A change to KPI of vm_pager_get_pages() and underlying VOP_GETPAGES().
o With new KPI consumers can request contiguous ranges of pages, and
unlike before, all pages will be kept busied on return, like it was
done before with the 'reqpage' only. Now the reqpage goes away. With
new interface it is easier to implement code protected from race
conditions.
Such arrayed requests for now should be preceeded by a call to
vm_pager_haspage() to make sure that request is possible. This
could be improved later, making vm_pager_haspage() obsolete.
Strenghtening the promises on the business of the array of pages
allows us to remove such hacks as swp_pager_free_nrpage() and
vm_pager_free_nonreq().
o New KPI accepts two integer pointers that may optionally point at
values for read ahead and read behind, that a pager may do, if it
can. These pages are completely owned by pager, and not controlled
by the caller.
This shifts the UFS-specific readahead logic from vm_fault.c, which
should be file system agnostic, into vnode_pager.c. It also removes
one VOP_BMAP() request per hard fault.
Discussed with: kib, alc, jeff, scottl
Sponsored by: Nginx, Inc.
Sponsored by: Netflix
2015-12-16 21:30:45 +00:00
|
|
|
|
|
|
|
bp->b_pgbefore = i;
|
|
|
|
bp->b_npages += i;
|
|
|
|
bp->b_blkno -= IDX_TO_OFF(i) / DEV_BSIZE;
|
|
|
|
} else
|
|
|
|
bp->b_pgbefore = 0;
|
|
|
|
|
|
|
|
/* Requested pages. */
|
|
|
|
for (int j = 0; j < count; j++, i++)
|
|
|
|
bp->b_pages[i] = m[j];
|
|
|
|
bp->b_npages += count;
|
|
|
|
|
|
|
|
if (rahead) {
|
|
|
|
vm_pindex_t endpindex, tpindex;
|
|
|
|
vm_page_t p;
|
|
|
|
|
|
|
|
if (!VM_OBJECT_WOWNED(object))
|
|
|
|
VM_OBJECT_WLOCK(object);
|
|
|
|
endpindex = m[count - 1]->pindex + rahead + 1;
|
|
|
|
if ((p = TAILQ_NEXT(m[count - 1], listq)) != NULL &&
|
|
|
|
p->pindex < endpindex)
|
|
|
|
endpindex = p->pindex;
|
|
|
|
if (endpindex > object->size)
|
|
|
|
endpindex = object->size;
|
|
|
|
|
|
|
|
for (tpindex = m[count - 1]->pindex + 1;
|
|
|
|
tpindex < endpindex; i++, tpindex++) {
|
2016-11-15 18:22:50 +00:00
|
|
|
p = vm_page_alloc(object, tpindex, VM_ALLOC_NORMAL);
|
A change to KPI of vm_pager_get_pages() and underlying VOP_GETPAGES().
o With new KPI consumers can request contiguous ranges of pages, and
unlike before, all pages will be kept busied on return, like it was
done before with the 'reqpage' only. Now the reqpage goes away. With
new interface it is easier to implement code protected from race
conditions.
Such arrayed requests for now should be preceeded by a call to
vm_pager_haspage() to make sure that request is possible. This
could be improved later, making vm_pager_haspage() obsolete.
Strenghtening the promises on the business of the array of pages
allows us to remove such hacks as swp_pager_free_nrpage() and
vm_pager_free_nonreq().
o New KPI accepts two integer pointers that may optionally point at
values for read ahead and read behind, that a pager may do, if it
can. These pages are completely owned by pager, and not controlled
by the caller.
This shifts the UFS-specific readahead logic from vm_fault.c, which
should be file system agnostic, into vnode_pager.c. It also removes
one VOP_BMAP() request per hard fault.
Discussed with: kib, alc, jeff, scottl
Sponsored by: Nginx, Inc.
Sponsored by: Netflix
2015-12-16 21:30:45 +00:00
|
|
|
if (p == NULL)
|
|
|
|
break;
|
|
|
|
bp->b_pages[i] = p;
|
1994-05-25 09:21:21 +00:00
|
|
|
}
|
|
|
|
|
A change to KPI of vm_pager_get_pages() and underlying VOP_GETPAGES().
o With new KPI consumers can request contiguous ranges of pages, and
unlike before, all pages will be kept busied on return, like it was
done before with the 'reqpage' only. Now the reqpage goes away. With
new interface it is easier to implement code protected from race
conditions.
Such arrayed requests for now should be preceeded by a call to
vm_pager_haspage() to make sure that request is possible. This
could be improved later, making vm_pager_haspage() obsolete.
Strenghtening the promises on the business of the array of pages
allows us to remove such hacks as swp_pager_free_nrpage() and
vm_pager_free_nonreq().
o New KPI accepts two integer pointers that may optionally point at
values for read ahead and read behind, that a pager may do, if it
can. These pages are completely owned by pager, and not controlled
by the caller.
This shifts the UFS-specific readahead logic from vm_fault.c, which
should be file system agnostic, into vnode_pager.c. It also removes
one VOP_BMAP() request per hard fault.
Discussed with: kib, alc, jeff, scottl
Sponsored by: Nginx, Inc.
Sponsored by: Netflix
2015-12-16 21:30:45 +00:00
|
|
|
bp->b_pgafter = i - bp->b_npages;
|
|
|
|
bp->b_npages = i;
|
|
|
|
} else
|
|
|
|
bp->b_pgafter = 0;
|
2015-10-24 21:59:22 +00:00
|
|
|
|
A change to KPI of vm_pager_get_pages() and underlying VOP_GETPAGES().
o With new KPI consumers can request contiguous ranges of pages, and
unlike before, all pages will be kept busied on return, like it was
done before with the 'reqpage' only. Now the reqpage goes away. With
new interface it is easier to implement code protected from race
conditions.
Such arrayed requests for now should be preceeded by a call to
vm_pager_haspage() to make sure that request is possible. This
could be improved later, making vm_pager_haspage() obsolete.
Strenghtening the promises on the business of the array of pages
allows us to remove such hacks as swp_pager_free_nrpage() and
vm_pager_free_nonreq().
o New KPI accepts two integer pointers that may optionally point at
values for read ahead and read behind, that a pager may do, if it
can. These pages are completely owned by pager, and not controlled
by the caller.
This shifts the UFS-specific readahead logic from vm_fault.c, which
should be file system agnostic, into vnode_pager.c. It also removes
one VOP_BMAP() request per hard fault.
Discussed with: kib, alc, jeff, scottl
Sponsored by: Nginx, Inc.
Sponsored by: Netflix
2015-12-16 21:30:45 +00:00
|
|
|
if (VM_OBJECT_WOWNED(object))
|
|
|
|
VM_OBJECT_WUNLOCK(object);
|
1995-02-03 06:46:28 +00:00
|
|
|
|
A change to KPI of vm_pager_get_pages() and underlying VOP_GETPAGES().
o With new KPI consumers can request contiguous ranges of pages, and
unlike before, all pages will be kept busied on return, like it was
done before with the 'reqpage' only. Now the reqpage goes away. With
new interface it is easier to implement code protected from race
conditions.
Such arrayed requests for now should be preceeded by a call to
vm_pager_haspage() to make sure that request is possible. This
could be improved later, making vm_pager_haspage() obsolete.
Strenghtening the promises on the business of the array of pages
allows us to remove such hacks as swp_pager_free_nrpage() and
vm_pager_free_nonreq().
o New KPI accepts two integer pointers that may optionally point at
values for read ahead and read behind, that a pager may do, if it
can. These pages are completely owned by pager, and not controlled
by the caller.
This shifts the UFS-specific readahead logic from vm_fault.c, which
should be file system agnostic, into vnode_pager.c. It also removes
one VOP_BMAP() request per hard fault.
Discussed with: kib, alc, jeff, scottl
Sponsored by: Nginx, Inc.
Sponsored by: Netflix
2015-12-16 21:30:45 +00:00
|
|
|
/* Report back actual behind/ahead read. */
|
|
|
|
if (a_rbehind)
|
|
|
|
*a_rbehind = bp->b_pgbefore;
|
|
|
|
if (a_rahead)
|
|
|
|
*a_rahead = bp->b_pgafter;
|
1994-05-25 09:21:21 +00:00
|
|
|
|
2016-11-17 20:32:32 +00:00
|
|
|
#ifdef INVARIANTS
|
2016-10-19 19:50:09 +00:00
|
|
|
KASSERT(bp->b_npages <= nitems(bp->b_pages),
|
A change to KPI of vm_pager_get_pages() and underlying VOP_GETPAGES().
o With new KPI consumers can request contiguous ranges of pages, and
unlike before, all pages will be kept busied on return, like it was
done before with the 'reqpage' only. Now the reqpage goes away. With
new interface it is easier to implement code protected from race
conditions.
Such arrayed requests for now should be preceeded by a call to
vm_pager_haspage() to make sure that request is possible. This
could be improved later, making vm_pager_haspage() obsolete.
Strenghtening the promises on the business of the array of pages
allows us to remove such hacks as swp_pager_free_nrpage() and
vm_pager_free_nonreq().
o New KPI accepts two integer pointers that may optionally point at
values for read ahead and read behind, that a pager may do, if it
can. These pages are completely owned by pager, and not controlled
by the caller.
This shifts the UFS-specific readahead logic from vm_fault.c, which
should be file system agnostic, into vnode_pager.c. It also removes
one VOP_BMAP() request per hard fault.
Discussed with: kib, alc, jeff, scottl
Sponsored by: Nginx, Inc.
Sponsored by: Netflix
2015-12-16 21:30:45 +00:00
|
|
|
("%s: buf %p overflowed", __func__, bp));
|
2017-01-12 20:26:02 +00:00
|
|
|
for (int j = 1, prev = 0; j < bp->b_npages; j++) {
|
2017-01-04 22:31:09 +00:00
|
|
|
if (bp->b_pages[j] == bogus_page)
|
|
|
|
continue;
|
|
|
|
KASSERT(bp->b_pages[j]->pindex - bp->b_pages[prev]->pindex ==
|
|
|
|
j - prev, ("%s: pages array not consecutive, bp %p",
|
|
|
|
__func__, bp));
|
|
|
|
prev = j;
|
|
|
|
}
|
2016-11-17 20:32:32 +00:00
|
|
|
#endif
|
1994-05-25 09:21:21 +00:00
|
|
|
|
|
|
|
/*
|
A change to KPI of vm_pager_get_pages() and underlying VOP_GETPAGES().
o With new KPI consumers can request contiguous ranges of pages, and
unlike before, all pages will be kept busied on return, like it was
done before with the 'reqpage' only. Now the reqpage goes away. With
new interface it is easier to implement code protected from race
conditions.
Such arrayed requests for now should be preceeded by a call to
vm_pager_haspage() to make sure that request is possible. This
could be improved later, making vm_pager_haspage() obsolete.
Strenghtening the promises on the business of the array of pages
allows us to remove such hacks as swp_pager_free_nrpage() and
vm_pager_free_nonreq().
o New KPI accepts two integer pointers that may optionally point at
values for read ahead and read behind, that a pager may do, if it
can. These pages are completely owned by pager, and not controlled
by the caller.
This shifts the UFS-specific readahead logic from vm_fault.c, which
should be file system agnostic, into vnode_pager.c. It also removes
one VOP_BMAP() request per hard fault.
Discussed with: kib, alc, jeff, scottl
Sponsored by: Nginx, Inc.
Sponsored by: Netflix
2015-12-16 21:30:45 +00:00
|
|
|
* Recalculate first offset and bytecount with regards to read behind.
|
|
|
|
* Truncate bytecount to vnode real size and round up physical size
|
|
|
|
* for real devices.
|
1994-05-25 09:21:21 +00:00
|
|
|
*/
|
A change to KPI of vm_pager_get_pages() and underlying VOP_GETPAGES().
o With new KPI consumers can request contiguous ranges of pages, and
unlike before, all pages will be kept busied on return, like it was
done before with the 'reqpage' only. Now the reqpage goes away. With
new interface it is easier to implement code protected from race
conditions.
Such arrayed requests for now should be preceeded by a call to
vm_pager_haspage() to make sure that request is possible. This
could be improved later, making vm_pager_haspage() obsolete.
Strenghtening the promises on the business of the array of pages
allows us to remove such hacks as swp_pager_free_nrpage() and
vm_pager_free_nonreq().
o New KPI accepts two integer pointers that may optionally point at
values for read ahead and read behind, that a pager may do, if it
can. These pages are completely owned by pager, and not controlled
by the caller.
This shifts the UFS-specific readahead logic from vm_fault.c, which
should be file system agnostic, into vnode_pager.c. It also removes
one VOP_BMAP() request per hard fault.
Discussed with: kib, alc, jeff, scottl
Sponsored by: Nginx, Inc.
Sponsored by: Netflix
2015-12-16 21:30:45 +00:00
|
|
|
foff = IDX_TO_OFF(bp->b_pages[0]->pindex);
|
|
|
|
bytecount = bp->b_npages << PAGE_SHIFT;
|
|
|
|
if ((foff + bytecount) > object->un_pager.vnp.vnp_size)
|
|
|
|
bytecount = object->un_pager.vnp.vnp_size - foff;
|
2015-10-24 21:59:22 +00:00
|
|
|
secmask = bo->bo_bsize - 1;
|
|
|
|
KASSERT(secmask < PAGE_SIZE && secmask > 0,
|
A change to KPI of vm_pager_get_pages() and underlying VOP_GETPAGES().
o With new KPI consumers can request contiguous ranges of pages, and
unlike before, all pages will be kept busied on return, like it was
done before with the 'reqpage' only. Now the reqpage goes away. With
new interface it is easier to implement code protected from race
conditions.
Such arrayed requests for now should be preceeded by a call to
vm_pager_haspage() to make sure that request is possible. This
could be improved later, making vm_pager_haspage() obsolete.
Strenghtening the promises on the business of the array of pages
allows us to remove such hacks as swp_pager_free_nrpage() and
vm_pager_free_nonreq().
o New KPI accepts two integer pointers that may optionally point at
values for read ahead and read behind, that a pager may do, if it
can. These pages are completely owned by pager, and not controlled
by the caller.
This shifts the UFS-specific readahead logic from vm_fault.c, which
should be file system agnostic, into vnode_pager.c. It also removes
one VOP_BMAP() request per hard fault.
Discussed with: kib, alc, jeff, scottl
Sponsored by: Nginx, Inc.
Sponsored by: Netflix
2015-12-16 21:30:45 +00:00
|
|
|
("%s: sector size %d too large", __func__, secmask + 1));
|
|
|
|
bytecount = (bytecount + secmask) & ~secmask;
|
1994-05-25 09:21:21 +00:00
|
|
|
|
|
|
|
/*
|
A change to KPI of vm_pager_get_pages() and underlying VOP_GETPAGES().
o With new KPI consumers can request contiguous ranges of pages, and
unlike before, all pages will be kept busied on return, like it was
done before with the 'reqpage' only. Now the reqpage goes away. With
new interface it is easier to implement code protected from race
conditions.
Such arrayed requests for now should be preceeded by a call to
vm_pager_haspage() to make sure that request is possible. This
could be improved later, making vm_pager_haspage() obsolete.
Strenghtening the promises on the business of the array of pages
allows us to remove such hacks as swp_pager_free_nrpage() and
vm_pager_free_nonreq().
o New KPI accepts two integer pointers that may optionally point at
values for read ahead and read behind, that a pager may do, if it
can. These pages are completely owned by pager, and not controlled
by the caller.
This shifts the UFS-specific readahead logic from vm_fault.c, which
should be file system agnostic, into vnode_pager.c. It also removes
one VOP_BMAP() request per hard fault.
Discussed with: kib, alc, jeff, scottl
Sponsored by: Nginx, Inc.
Sponsored by: Netflix
2015-12-16 21:30:45 +00:00
|
|
|
* And map the pages to be read into the kva, if the filesystem
|
2013-03-19 14:36:28 +00:00
|
|
|
* requires mapped buffers.
|
1994-05-25 09:21:21 +00:00
|
|
|
*/
|
2014-11-19 15:17:19 +00:00
|
|
|
if ((vp->v_mount->mnt_kern_flag & MNTK_UNMAPPED_BUFS) != 0 &&
|
2013-03-19 14:36:28 +00:00
|
|
|
unmapped_buf_allowed) {
|
|
|
|
bp->b_data = unmapped_buf;
|
|
|
|
bp->b_offset = 0;
|
2015-07-23 19:13:41 +00:00
|
|
|
} else {
|
|
|
|
bp->b_data = bp->b_kvabase;
|
A change to KPI of vm_pager_get_pages() and underlying VOP_GETPAGES().
o With new KPI consumers can request contiguous ranges of pages, and
unlike before, all pages will be kept busied on return, like it was
done before with the 'reqpage' only. Now the reqpage goes away. With
new interface it is easier to implement code protected from race
conditions.
Such arrayed requests for now should be preceeded by a call to
vm_pager_haspage() to make sure that request is possible. This
could be improved later, making vm_pager_haspage() obsolete.
Strenghtening the promises on the business of the array of pages
allows us to remove such hacks as swp_pager_free_nrpage() and
vm_pager_free_nonreq().
o New KPI accepts two integer pointers that may optionally point at
values for read ahead and read behind, that a pager may do, if it
can. These pages are completely owned by pager, and not controlled
by the caller.
This shifts the UFS-specific readahead logic from vm_fault.c, which
should be file system agnostic, into vnode_pager.c. It also removes
one VOP_BMAP() request per hard fault.
Discussed with: kib, alc, jeff, scottl
Sponsored by: Nginx, Inc.
Sponsored by: Netflix
2015-12-16 21:30:45 +00:00
|
|
|
pmap_qenter((vm_offset_t)bp->b_data, bp->b_pages, bp->b_npages);
|
2015-07-23 19:13:41 +00:00
|
|
|
}
|
1994-05-25 09:21:21 +00:00
|
|
|
|
A change to KPI of vm_pager_get_pages() and underlying VOP_GETPAGES().
o With new KPI consumers can request contiguous ranges of pages, and
unlike before, all pages will be kept busied on return, like it was
done before with the 'reqpage' only. Now the reqpage goes away. With
new interface it is easier to implement code protected from race
conditions.
Such arrayed requests for now should be preceeded by a call to
vm_pager_haspage() to make sure that request is possible. This
could be improved later, making vm_pager_haspage() obsolete.
Strenghtening the promises on the business of the array of pages
allows us to remove such hacks as swp_pager_free_nrpage() and
vm_pager_free_nonreq().
o New KPI accepts two integer pointers that may optionally point at
values for read ahead and read behind, that a pager may do, if it
can. These pages are completely owned by pager, and not controlled
by the caller.
This shifts the UFS-specific readahead logic from vm_fault.c, which
should be file system agnostic, into vnode_pager.c. It also removes
one VOP_BMAP() request per hard fault.
Discussed with: kib, alc, jeff, scottl
Sponsored by: Nginx, Inc.
Sponsored by: Netflix
2015-12-16 21:30:45 +00:00
|
|
|
/* Build a minimal buffer header. */
|
2000-03-20 10:44:49 +00:00
|
|
|
bp->b_iocmd = BIO_READ;
|
2001-10-11 23:38:17 +00:00
|
|
|
KASSERT(bp->b_rcred == NOCRED, ("leaking read ucred"));
|
|
|
|
KASSERT(bp->b_wcred == NOCRED, ("leaking write ucred"));
|
2002-02-27 18:32:23 +00:00
|
|
|
bp->b_rcred = crhold(curthread->td_ucred);
|
|
|
|
bp->b_wcred = crhold(curthread->td_ucred);
|
2004-11-15 09:18:27 +00:00
|
|
|
pbgetbo(bo, bp);
|
2012-03-28 20:49:11 +00:00
|
|
|
bp->b_vp = vp;
|
A change to KPI of vm_pager_get_pages() and underlying VOP_GETPAGES().
o With new KPI consumers can request contiguous ranges of pages, and
unlike before, all pages will be kept busied on return, like it was
done before with the 'reqpage' only. Now the reqpage goes away. With
new interface it is easier to implement code protected from race
conditions.
Such arrayed requests for now should be preceeded by a call to
vm_pager_haspage() to make sure that request is possible. This
could be improved later, making vm_pager_haspage() obsolete.
Strenghtening the promises on the business of the array of pages
allows us to remove such hacks as swp_pager_free_nrpage() and
vm_pager_free_nonreq().
o New KPI accepts two integer pointers that may optionally point at
values for read ahead and read behind, that a pager may do, if it
can. These pages are completely owned by pager, and not controlled
by the caller.
This shifts the UFS-specific readahead logic from vm_fault.c, which
should be file system agnostic, into vnode_pager.c. It also removes
one VOP_BMAP() request per hard fault.
Discussed with: kib, alc, jeff, scottl
Sponsored by: Nginx, Inc.
Sponsored by: Netflix
2015-12-16 21:30:45 +00:00
|
|
|
bp->b_bcount = bp->b_bufsize = bp->b_runningbufspace = bytecount;
|
|
|
|
bp->b_iooffset = dbtob(bp->b_blkno);
|
2016-11-17 20:32:32 +00:00
|
|
|
KASSERT(IDX_TO_OFF(m[0]->pindex - bp->b_pages[0]->pindex) ==
|
|
|
|
(blkno0 - bp->b_blkno) * DEV_BSIZE +
|
|
|
|
IDX_TO_OFF(m[0]->pindex) % bsize,
|
|
|
|
("wrong offsets bsize %d m[0] %ju b_pages[0] %ju "
|
|
|
|
"blkno0 %ju b_blkno %ju", bsize,
|
|
|
|
(uintmax_t)m[0]->pindex, (uintmax_t)bp->b_pages[0]->pindex,
|
|
|
|
(uintmax_t)blkno0, (uintmax_t)bp->b_blkno));
|
1994-05-25 09:21:21 +00:00
|
|
|
|
A change to KPI of vm_pager_get_pages() and underlying VOP_GETPAGES().
o With new KPI consumers can request contiguous ranges of pages, and
unlike before, all pages will be kept busied on return, like it was
done before with the 'reqpage' only. Now the reqpage goes away. With
new interface it is easier to implement code protected from race
conditions.
Such arrayed requests for now should be preceeded by a call to
vm_pager_haspage() to make sure that request is possible. This
could be improved later, making vm_pager_haspage() obsolete.
Strenghtening the promises on the business of the array of pages
allows us to remove such hacks as swp_pager_free_nrpage() and
vm_pager_free_nonreq().
o New KPI accepts two integer pointers that may optionally point at
values for read ahead and read behind, that a pager may do, if it
can. These pages are completely owned by pager, and not controlled
by the caller.
This shifts the UFS-specific readahead logic from vm_fault.c, which
should be file system agnostic, into vnode_pager.c. It also removes
one VOP_BMAP() request per hard fault.
Discussed with: kib, alc, jeff, scottl
Sponsored by: Nginx, Inc.
Sponsored by: Netflix
2015-12-16 21:30:45 +00:00
|
|
|
atomic_add_long(&runningbufspace, bp->b_runningbufspace);
|
- Remove 'struct vmmeter' from 'struct pcpu', leaving only global vmmeter
in place. To do per-cpu stats, convert all fields that previously were
maintained in the vmmeters that sit in pcpus to counter(9).
- Since some vmmeter stats may be touched at very early stages of boot,
before we have set up UMA and we can do counter_u64_alloc(), provide an
early counter mechanism:
o Leave one spare uint64_t in struct pcpu, named pc_early_dummy_counter.
o Point counter(9) fields of vmmeter to pcpu[0].pc_early_dummy_counter,
so that at early stages of boot, before counters are allocated we already
point to a counter that can be safely written to.
o For sparc64 that required a whole dummy pcpu[MAXCPU] array.
Further related changes:
- Don't include vmmeter.h into pcpu.h.
- vm.stats.vm.v_swappgsout and vm.stats.vm.v_swappgsin changed to 64-bit,
to match kernel representation.
- struct vmmeter hidden under _KERNEL, and only vmstat(1) is an exclusion.
This is based on benno@'s 4-year old patch:
https://lists.freebsd.org/pipermail/freebsd-arch/2013-July/014471.html
Reviewed by: kib, gallatin, marius, lidl
Differential Revision: https://reviews.freebsd.org/D10156
2017-04-17 17:34:47 +00:00
|
|
|
VM_CNT_INC(v_vnodein);
|
|
|
|
VM_CNT_ADD(v_vnodepgsin, bp->b_npages);
|
1994-10-15 13:33:09 +00:00
|
|
|
|
2014-11-23 12:01:52 +00:00
|
|
|
if (iodone != NULL) { /* async */
|
A change to KPI of vm_pager_get_pages() and underlying VOP_GETPAGES().
o With new KPI consumers can request contiguous ranges of pages, and
unlike before, all pages will be kept busied on return, like it was
done before with the 'reqpage' only. Now the reqpage goes away. With
new interface it is easier to implement code protected from race
conditions.
Such arrayed requests for now should be preceeded by a call to
vm_pager_haspage() to make sure that request is possible. This
could be improved later, making vm_pager_haspage() obsolete.
Strenghtening the promises on the business of the array of pages
allows us to remove such hacks as swp_pager_free_nrpage() and
vm_pager_free_nonreq().
o New KPI accepts two integer pointers that may optionally point at
values for read ahead and read behind, that a pager may do, if it
can. These pages are completely owned by pager, and not controlled
by the caller.
This shifts the UFS-specific readahead logic from vm_fault.c, which
should be file system agnostic, into vnode_pager.c. It also removes
one VOP_BMAP() request per hard fault.
Discussed with: kib, alc, jeff, scottl
Sponsored by: Nginx, Inc.
Sponsored by: Netflix
2015-12-16 21:30:45 +00:00
|
|
|
bp->b_pgiodone = iodone;
|
2014-11-23 12:01:52 +00:00
|
|
|
bp->b_caller1 = arg;
|
|
|
|
bp->b_iodone = vnode_pager_generic_getpages_done_async;
|
|
|
|
bp->b_flags |= B_ASYNC;
|
|
|
|
BUF_KERNPROC(bp);
|
|
|
|
bstrategy(bp);
|
A change to KPI of vm_pager_get_pages() and underlying VOP_GETPAGES().
o With new KPI consumers can request contiguous ranges of pages, and
unlike before, all pages will be kept busied on return, like it was
done before with the 'reqpage' only. Now the reqpage goes away. With
new interface it is easier to implement code protected from race
conditions.
Such arrayed requests for now should be preceeded by a call to
vm_pager_haspage() to make sure that request is possible. This
could be improved later, making vm_pager_haspage() obsolete.
Strenghtening the promises on the business of the array of pages
allows us to remove such hacks as swp_pager_free_nrpage() and
vm_pager_free_nonreq().
o New KPI accepts two integer pointers that may optionally point at
values for read ahead and read behind, that a pager may do, if it
can. These pages are completely owned by pager, and not controlled
by the caller.
This shifts the UFS-specific readahead logic from vm_fault.c, which
should be file system agnostic, into vnode_pager.c. It also removes
one VOP_BMAP() request per hard fault.
Discussed with: kib, alc, jeff, scottl
Sponsored by: Nginx, Inc.
Sponsored by: Netflix
2015-12-16 21:30:45 +00:00
|
|
|
return (VM_PAGER_OK);
|
2014-11-23 12:01:52 +00:00
|
|
|
} else {
|
|
|
|
bp->b_iodone = bdone;
|
|
|
|
bstrategy(bp);
|
|
|
|
bwait(bp, PVM, "vnread");
|
|
|
|
error = vnode_pager_generic_getpages_done(bp);
|
2014-11-24 07:57:20 +00:00
|
|
|
for (i = 0; i < bp->b_npages; i++)
|
2014-11-23 12:01:52 +00:00
|
|
|
bp->b_pages[i] = NULL;
|
|
|
|
bp->b_vp = NULL;
|
|
|
|
pbrelbo(bp);
|
Allocate pager bufs from UMA instead of 80-ish mutex protected linked list.
o In vm_pager_bufferinit() create pbuf_zone and start accounting on how many
pbufs are we going to have set.
In various subsystems that are going to utilize pbufs create private zones
via call to pbuf_zsecond_create(). The latter calls uma_zsecond_create(),
and sets a limit on created zone. After startup preallocate pbufs according
to requirements of all pbuf zones.
Subsystems that used to have a private limit with old allocator now have
private pbuf zones: md(4), fusefs, NFS client, smbfs, VFS cluster, FFS,
swap, vnode pager.
The following subsystems use shared pbuf zone: cam(4), nvme(4), physio(9),
aio(4). They should have their private limits, but changing that is out of
scope of this commit.
o Fetch tunable value of kern.nswbuf from init_param2() and while here move
NSWBUF_MIN to opt_param.h and eliminate opt_swap.h, that was holding only
this option.
Default values aren't touched by this commit, but they probably should be
reviewed wrt to modern hardware.
This change removes a tight bottleneck from sendfile(2) operation, that
uses pbufs in vnode pager. Other pagers also would benefit from faster
allocation.
Together with: gallatin
Tested by: pho
2019-01-15 01:02:16 +00:00
|
|
|
uma_zfree(vnode_pbuf_zone, bp);
|
A change to KPI of vm_pager_get_pages() and underlying VOP_GETPAGES().
o With new KPI consumers can request contiguous ranges of pages, and
unlike before, all pages will be kept busied on return, like it was
done before with the 'reqpage' only. Now the reqpage goes away. With
new interface it is easier to implement code protected from race
conditions.
Such arrayed requests for now should be preceeded by a call to
vm_pager_haspage() to make sure that request is possible. This
could be improved later, making vm_pager_haspage() obsolete.
Strenghtening the promises on the business of the array of pages
allows us to remove such hacks as swp_pager_free_nrpage() and
vm_pager_free_nonreq().
o New KPI accepts two integer pointers that may optionally point at
values for read ahead and read behind, that a pager may do, if it
can. These pages are completely owned by pager, and not controlled
by the caller.
This shifts the UFS-specific readahead logic from vm_fault.c, which
should be file system agnostic, into vnode_pager.c. It also removes
one VOP_BMAP() request per hard fault.
Discussed with: kib, alc, jeff, scottl
Sponsored by: Nginx, Inc.
Sponsored by: Netflix
2015-12-16 21:30:45 +00:00
|
|
|
return (error != 0 ? VM_PAGER_ERROR : VM_PAGER_OK);
|
2014-11-23 12:01:52 +00:00
|
|
|
}
|
|
|
|
}
|
1994-05-25 09:21:21 +00:00
|
|
|
|
2014-11-23 12:01:52 +00:00
|
|
|
static void
|
|
|
|
vnode_pager_generic_getpages_done_async(struct buf *bp)
|
|
|
|
{
|
|
|
|
int error;
|
|
|
|
|
|
|
|
error = vnode_pager_generic_getpages_done(bp);
|
A change to KPI of vm_pager_get_pages() and underlying VOP_GETPAGES().
o With new KPI consumers can request contiguous ranges of pages, and
unlike before, all pages will be kept busied on return, like it was
done before with the 'reqpage' only. Now the reqpage goes away. With
new interface it is easier to implement code protected from race
conditions.
Such arrayed requests for now should be preceeded by a call to
vm_pager_haspage() to make sure that request is possible. This
could be improved later, making vm_pager_haspage() obsolete.
Strenghtening the promises on the business of the array of pages
allows us to remove such hacks as swp_pager_free_nrpage() and
vm_pager_free_nonreq().
o New KPI accepts two integer pointers that may optionally point at
values for read ahead and read behind, that a pager may do, if it
can. These pages are completely owned by pager, and not controlled
by the caller.
This shifts the UFS-specific readahead logic from vm_fault.c, which
should be file system agnostic, into vnode_pager.c. It also removes
one VOP_BMAP() request per hard fault.
Discussed with: kib, alc, jeff, scottl
Sponsored by: Nginx, Inc.
Sponsored by: Netflix
2015-12-16 21:30:45 +00:00
|
|
|
/* Run the iodone upon the requested range. */
|
|
|
|
bp->b_pgiodone(bp->b_caller1, bp->b_pages + bp->b_pgbefore,
|
|
|
|
bp->b_npages - bp->b_pgbefore - bp->b_pgafter, error);
|
2014-11-23 12:01:52 +00:00
|
|
|
for (int i = 0; i < bp->b_npages; i++)
|
|
|
|
bp->b_pages[i] = NULL;
|
|
|
|
bp->b_vp = NULL;
|
|
|
|
pbrelbo(bp);
|
Allocate pager bufs from UMA instead of 80-ish mutex protected linked list.
o In vm_pager_bufferinit() create pbuf_zone and start accounting on how many
pbufs are we going to have set.
In various subsystems that are going to utilize pbufs create private zones
via call to pbuf_zsecond_create(). The latter calls uma_zsecond_create(),
and sets a limit on created zone. After startup preallocate pbufs according
to requirements of all pbuf zones.
Subsystems that used to have a private limit with old allocator now have
private pbuf zones: md(4), fusefs, NFS client, smbfs, VFS cluster, FFS,
swap, vnode pager.
The following subsystems use shared pbuf zone: cam(4), nvme(4), physio(9),
aio(4). They should have their private limits, but changing that is out of
scope of this commit.
o Fetch tunable value of kern.nswbuf from init_param2() and while here move
NSWBUF_MIN to opt_param.h and eliminate opt_swap.h, that was holding only
this option.
Default values aren't touched by this commit, but they probably should be
reviewed wrt to modern hardware.
This change removes a tight bottleneck from sendfile(2) operation, that
uses pbufs in vnode pager. Other pagers also would benefit from faster
allocation.
Together with: gallatin
Tested by: pho
2019-01-15 01:02:16 +00:00
|
|
|
uma_zfree(vnode_pbuf_zone, bp);
|
2014-11-23 12:01:52 +00:00
|
|
|
}
|
|
|
|
|
|
|
|
static int
|
|
|
|
vnode_pager_generic_getpages_done(struct buf *bp)
|
|
|
|
{
|
|
|
|
vm_object_t object;
|
|
|
|
off_t tfoff, nextoff;
|
|
|
|
int i, error;
|
|
|
|
|
|
|
|
error = (bp->b_ioflags & BIO_ERROR) != 0 ? EIO : 0;
|
|
|
|
object = bp->b_vp->v_object;
|
|
|
|
|
|
|
|
if (error == 0 && bp->b_bcount != bp->b_npages * PAGE_SIZE) {
|
2015-07-23 19:13:41 +00:00
|
|
|
if (!buf_mapped(bp)) {
|
|
|
|
bp->b_data = bp->b_kvabase;
|
|
|
|
pmap_qenter((vm_offset_t)bp->b_data, bp->b_pages,
|
2014-11-23 12:01:52 +00:00
|
|
|
bp->b_npages);
|
2013-03-19 14:36:28 +00:00
|
|
|
}
|
2015-07-23 19:13:41 +00:00
|
|
|
bzero(bp->b_data + bp->b_bcount,
|
2014-11-23 12:01:52 +00:00
|
|
|
PAGE_SIZE * bp->b_npages - bp->b_bcount);
|
2013-03-19 14:36:28 +00:00
|
|
|
}
|
2015-07-23 19:13:41 +00:00
|
|
|
if (buf_mapped(bp)) {
|
|
|
|
pmap_qremove((vm_offset_t)bp->b_data, bp->b_npages);
|
|
|
|
bp->b_data = unmapped_buf;
|
1994-05-25 09:21:21 +00:00
|
|
|
}
|
|
|
|
|
2013-03-09 02:32:23 +00:00
|
|
|
VM_OBJECT_WLOCK(object);
|
2014-11-23 12:01:52 +00:00
|
|
|
for (i = 0, tfoff = IDX_TO_OFF(bp->b_pages[0]->pindex);
|
|
|
|
i < bp->b_npages; i++, tfoff = nextoff) {
|
This mega-commit is meant to fix numerous interrelated problems. There
has been some bitrot and incorrect assumptions in the vfs_bio code. These
problems have manifest themselves worse on NFS type filesystems, but can
still affect local filesystems under certain circumstances. Most of
the problems have involved mmap consistancy, and as a side-effect broke
the vfs.ioopt code. This code might have been committed seperately, but
almost everything is interrelated.
1) Allow (pmap_object_init_pt) prefaulting of buffer-busy pages that
are fully valid.
2) Rather than deactivating erroneously read initial (header) pages in
kern_exec, we now free them.
3) Fix the rundown of non-VMIO buffers that are in an inconsistent
(missing vp) state.
4) Fix the disassociation of pages from buffers in brelse. The previous
code had rotted and was faulty in a couple of important circumstances.
5) Remove a gratuitious buffer wakeup in vfs_vmio_release.
6) Remove a crufty and currently unused cluster mechanism for VBLK
files in vfs_bio_awrite. When the code is functional, I'll add back
a cleaner version.
7) The page busy count wakeups assocated with the buffer cache usage were
incorrectly cleaned up in a previous commit by me. Revert to the
original, correct version, but with a cleaner implementation.
8) The cluster read code now tries to keep data associated with buffers
more aggressively (without breaking the heuristics) when it is presumed
that the read data (buffers) will be soon needed.
9) Change to filesystem lockmgr locks so that they use LK_NOPAUSE. The
delay loop waiting is not useful for filesystem locks, due to the
length of the time intervals.
10) Correct and clean-up spec_getpages.
11) Implement a fully functional nfs_getpages, nfs_putpages.
12) Fix nfs_write so that modifications are coherent with the NFS data on
the server disk (at least as well as NFS seems to allow.)
13) Properly support MS_INVALIDATE on NFS.
14) Properly pass down MS_INVALIDATE to lower levels of the VM code from
vm_map_clean.
15) Better support the notion of pages being busy but valid, so that
fewer in-transit waits occur. (use p->busy more for pageouts instead
of PG_BUSY.) Since the page is fully valid, it is still usable for
reads.
16) It is possible (in error) for cached pages to be busy. Make the
page allocation code handle that case correctly. (It should probably
be a printf or panic, but I want the system to handle coding errors
robustly. I'll probably add a printf.)
17) Correct the design and usage of vm_page_sleep. It didn't handle
consistancy problems very well, so make the design a little less
lofty. After vm_page_sleep, if it ever blocked, it is still important
to relookup the page (if the object generation count changed), and
verify it's status (always.)
18) In vm_pageout.c, vm_pageout_clean had rotted, so clean that up.
19) Push the page busy for writes and VM_PROT_READ into vm_pageout_flush.
20) Fix vm_pager_put_pages and it's descendents to support an int flag
instead of a boolean, so that we can pass down the invalidate bit.
1998-03-07 21:37:31 +00:00
|
|
|
vm_page_t mt;
|
|
|
|
|
|
|
|
nextoff = tfoff + PAGE_SIZE;
|
2014-11-23 12:01:52 +00:00
|
|
|
mt = bp->b_pages[i];
|
This mega-commit is meant to fix numerous interrelated problems. There
has been some bitrot and incorrect assumptions in the vfs_bio code. These
problems have manifest themselves worse on NFS type filesystems, but can
still affect local filesystems under certain circumstances. Most of
the problems have involved mmap consistancy, and as a side-effect broke
the vfs.ioopt code. This code might have been committed seperately, but
almost everything is interrelated.
1) Allow (pmap_object_init_pt) prefaulting of buffer-busy pages that
are fully valid.
2) Rather than deactivating erroneously read initial (header) pages in
kern_exec, we now free them.
3) Fix the rundown of non-VMIO buffers that are in an inconsistent
(missing vp) state.
4) Fix the disassociation of pages from buffers in brelse. The previous
code had rotted and was faulty in a couple of important circumstances.
5) Remove a gratuitious buffer wakeup in vfs_vmio_release.
6) Remove a crufty and currently unused cluster mechanism for VBLK
files in vfs_bio_awrite. When the code is functional, I'll add back
a cleaner version.
7) The page busy count wakeups assocated with the buffer cache usage were
incorrectly cleaned up in a previous commit by me. Revert to the
original, correct version, but with a cleaner implementation.
8) The cluster read code now tries to keep data associated with buffers
more aggressively (without breaking the heuristics) when it is presumed
that the read data (buffers) will be soon needed.
9) Change to filesystem lockmgr locks so that they use LK_NOPAUSE. The
delay loop waiting is not useful for filesystem locks, due to the
length of the time intervals.
10) Correct and clean-up spec_getpages.
11) Implement a fully functional nfs_getpages, nfs_putpages.
12) Fix nfs_write so that modifications are coherent with the NFS data on
the server disk (at least as well as NFS seems to allow.)
13) Properly support MS_INVALIDATE on NFS.
14) Properly pass down MS_INVALIDATE to lower levels of the VM code from
vm_map_clean.
15) Better support the notion of pages being busy but valid, so that
fewer in-transit waits occur. (use p->busy more for pageouts instead
of PG_BUSY.) Since the page is fully valid, it is still usable for
reads.
16) It is possible (in error) for cached pages to be busy. Make the
page allocation code handle that case correctly. (It should probably
be a printf or panic, but I want the system to handle coding errors
robustly. I'll probably add a printf.)
17) Correct the design and usage of vm_page_sleep. It didn't handle
consistancy problems very well, so make the design a little less
lofty. After vm_page_sleep, if it ever blocked, it is still important
to relookup the page (if the object generation count changed), and
verify it's status (always.)
18) In vm_pageout.c, vm_pageout_clean had rotted, so clean that up.
19) Push the page busy for writes and VM_PROT_READ into vm_pageout_flush.
20) Fix vm_pager_put_pages and it's descendents to support an int flag
instead of a boolean, so that we can pass down the invalidate bit.
1998-03-07 21:37:31 +00:00
|
|
|
|
1999-05-15 23:42:39 +00:00
|
|
|
if (nextoff <= object->un_pager.vnp.vnp_size) {
|
1999-04-05 19:38:30 +00:00
|
|
|
/*
|
|
|
|
* Read filled up entire page.
|
|
|
|
*/
|
2019-10-15 03:45:41 +00:00
|
|
|
vm_page_valid(mt);
|
2009-04-25 02:59:06 +00:00
|
|
|
KASSERT(mt->dirty == 0,
|
2014-11-19 16:29:39 +00:00
|
|
|
("%s: page %p is dirty", __func__, mt));
|
2009-04-25 02:59:06 +00:00
|
|
|
KASSERT(!pmap_page_is_mapped(mt),
|
2014-11-19 16:29:39 +00:00
|
|
|
("%s: page %p is mapped", __func__, mt));
|
This mega-commit is meant to fix numerous interrelated problems. There
has been some bitrot and incorrect assumptions in the vfs_bio code. These
problems have manifest themselves worse on NFS type filesystems, but can
still affect local filesystems under certain circumstances. Most of
the problems have involved mmap consistancy, and as a side-effect broke
the vfs.ioopt code. This code might have been committed seperately, but
almost everything is interrelated.
1) Allow (pmap_object_init_pt) prefaulting of buffer-busy pages that
are fully valid.
2) Rather than deactivating erroneously read initial (header) pages in
kern_exec, we now free them.
3) Fix the rundown of non-VMIO buffers that are in an inconsistent
(missing vp) state.
4) Fix the disassociation of pages from buffers in brelse. The previous
code had rotted and was faulty in a couple of important circumstances.
5) Remove a gratuitious buffer wakeup in vfs_vmio_release.
6) Remove a crufty and currently unused cluster mechanism for VBLK
files in vfs_bio_awrite. When the code is functional, I'll add back
a cleaner version.
7) The page busy count wakeups assocated with the buffer cache usage were
incorrectly cleaned up in a previous commit by me. Revert to the
original, correct version, but with a cleaner implementation.
8) The cluster read code now tries to keep data associated with buffers
more aggressively (without breaking the heuristics) when it is presumed
that the read data (buffers) will be soon needed.
9) Change to filesystem lockmgr locks so that they use LK_NOPAUSE. The
delay loop waiting is not useful for filesystem locks, due to the
length of the time intervals.
10) Correct and clean-up spec_getpages.
11) Implement a fully functional nfs_getpages, nfs_putpages.
12) Fix nfs_write so that modifications are coherent with the NFS data on
the server disk (at least as well as NFS seems to allow.)
13) Properly support MS_INVALIDATE on NFS.
14) Properly pass down MS_INVALIDATE to lower levels of the VM code from
vm_map_clean.
15) Better support the notion of pages being busy but valid, so that
fewer in-transit waits occur. (use p->busy more for pageouts instead
of PG_BUSY.) Since the page is fully valid, it is still usable for
reads.
16) It is possible (in error) for cached pages to be busy. Make the
page allocation code handle that case correctly. (It should probably
be a printf or panic, but I want the system to handle coding errors
robustly. I'll probably add a printf.)
17) Correct the design and usage of vm_page_sleep. It didn't handle
consistancy problems very well, so make the design a little less
lofty. After vm_page_sleep, if it ever blocked, it is still important
to relookup the page (if the object generation count changed), and
verify it's status (always.)
18) In vm_pageout.c, vm_pageout_clean had rotted, so clean that up.
19) Push the page busy for writes and VM_PROT_READ into vm_pageout_flush.
20) Fix vm_pager_put_pages and it's descendents to support an int flag
instead of a boolean, so that we can pass down the invalidate bit.
1998-03-07 21:37:31 +00:00
|
|
|
} else {
|
1999-04-05 19:38:30 +00:00
|
|
|
/*
|
2009-05-15 04:33:35 +00:00
|
|
|
* Read did not fill up entire page.
|
1999-04-05 19:38:30 +00:00
|
|
|
*
|
|
|
|
* Currently we do not set the entire page valid,
|
|
|
|
* we just try to clear the piece that we couldn't
|
|
|
|
* read.
|
|
|
|
*/
|
2011-11-30 17:39:00 +00:00
|
|
|
vm_page_set_valid_range(mt, 0,
|
1999-05-15 23:42:39 +00:00
|
|
|
object->un_pager.vnp.vnp_size - tfoff);
|
2009-05-15 04:33:35 +00:00
|
|
|
KASSERT((mt->dirty & vm_page_bits(0,
|
|
|
|
object->un_pager.vnp.vnp_size - tfoff)) == 0,
|
2014-11-19 16:29:39 +00:00
|
|
|
("%s: page %p is dirty", __func__, mt));
|
This mega-commit is meant to fix numerous interrelated problems. There
has been some bitrot and incorrect assumptions in the vfs_bio code. These
problems have manifest themselves worse on NFS type filesystems, but can
still affect local filesystems under certain circumstances. Most of
the problems have involved mmap consistancy, and as a side-effect broke
the vfs.ioopt code. This code might have been committed seperately, but
almost everything is interrelated.
1) Allow (pmap_object_init_pt) prefaulting of buffer-busy pages that
are fully valid.
2) Rather than deactivating erroneously read initial (header) pages in
kern_exec, we now free them.
3) Fix the rundown of non-VMIO buffers that are in an inconsistent
(missing vp) state.
4) Fix the disassociation of pages from buffers in brelse. The previous
code had rotted and was faulty in a couple of important circumstances.
5) Remove a gratuitious buffer wakeup in vfs_vmio_release.
6) Remove a crufty and currently unused cluster mechanism for VBLK
files in vfs_bio_awrite. When the code is functional, I'll add back
a cleaner version.
7) The page busy count wakeups assocated with the buffer cache usage were
incorrectly cleaned up in a previous commit by me. Revert to the
original, correct version, but with a cleaner implementation.
8) The cluster read code now tries to keep data associated with buffers
more aggressively (without breaking the heuristics) when it is presumed
that the read data (buffers) will be soon needed.
9) Change to filesystem lockmgr locks so that they use LK_NOPAUSE. The
delay loop waiting is not useful for filesystem locks, due to the
length of the time intervals.
10) Correct and clean-up spec_getpages.
11) Implement a fully functional nfs_getpages, nfs_putpages.
12) Fix nfs_write so that modifications are coherent with the NFS data on
the server disk (at least as well as NFS seems to allow.)
13) Properly support MS_INVALIDATE on NFS.
14) Properly pass down MS_INVALIDATE to lower levels of the VM code from
vm_map_clean.
15) Better support the notion of pages being busy but valid, so that
fewer in-transit waits occur. (use p->busy more for pageouts instead
of PG_BUSY.) Since the page is fully valid, it is still usable for
reads.
16) It is possible (in error) for cached pages to be busy. Make the
page allocation code handle that case correctly. (It should probably
be a printf or panic, but I want the system to handle coding errors
robustly. I'll probably add a printf.)
17) Correct the design and usage of vm_page_sleep. It didn't handle
consistancy problems very well, so make the design a little less
lofty. After vm_page_sleep, if it ever blocked, it is still important
to relookup the page (if the object generation count changed), and
verify it's status (always.)
18) In vm_pageout.c, vm_pageout_clean had rotted, so clean that up.
19) Push the page busy for writes and VM_PROT_READ into vm_pageout_flush.
20) Fix vm_pager_put_pages and it's descendents to support an int flag
instead of a boolean, so that we can pass down the invalidate bit.
1998-03-07 21:37:31 +00:00
|
|
|
}
|
A change to KPI of vm_pager_get_pages() and underlying VOP_GETPAGES().
o With new KPI consumers can request contiguous ranges of pages, and
unlike before, all pages will be kept busied on return, like it was
done before with the 'reqpage' only. Now the reqpage goes away. With
new interface it is easier to implement code protected from race
conditions.
Such arrayed requests for now should be preceeded by a call to
vm_pager_haspage() to make sure that request is possible. This
could be improved later, making vm_pager_haspage() obsolete.
Strenghtening the promises on the business of the array of pages
allows us to remove such hacks as swp_pager_free_nrpage() and
vm_pager_free_nonreq().
o New KPI accepts two integer pointers that may optionally point at
values for read ahead and read behind, that a pager may do, if it
can. These pages are completely owned by pager, and not controlled
by the caller.
This shifts the UFS-specific readahead logic from vm_fault.c, which
should be file system agnostic, into vnode_pager.c. It also removes
one VOP_BMAP() request per hard fault.
Discussed with: kib, alc, jeff, scottl
Sponsored by: Nginx, Inc.
Sponsored by: Netflix
2015-12-16 21:30:45 +00:00
|
|
|
|
|
|
|
if (i < bp->b_pgbefore || i >= bp->b_npages - bp->b_pgafter)
|
2012-08-14 11:45:47 +00:00
|
|
|
vm_page_readahead_finish(mt);
|
1994-05-25 09:21:21 +00:00
|
|
|
}
|
2013-03-09 02:32:23 +00:00
|
|
|
VM_OBJECT_WUNLOCK(object);
|
2014-11-23 12:01:52 +00:00
|
|
|
if (error != 0)
|
|
|
|
printf("%s: I/O read error %d\n", __func__, error);
|
|
|
|
|
|
|
|
return (error);
|
1994-05-25 09:21:21 +00:00
|
|
|
}
|
|
|
|
|
1998-02-26 06:39:59 +00:00
|
|
|
/*
|
|
|
|
* EOPNOTSUPP is no longer legal. For local media VFS's that do not
|
|
|
|
* implement their own VOP_PUTPAGES, their VOP_PUTPAGES should call to
|
|
|
|
* vnode_pager_generic_putpages() to implement the previous behaviour.
|
|
|
|
*
|
|
|
|
* All other FS's should use the bypass to get to the local media
|
|
|
|
* backing vp's VOP_PUTPAGES.
|
|
|
|
*/
|
1999-01-24 02:32:15 +00:00
|
|
|
static void
|
2014-01-20 18:47:56 +00:00
|
|
|
vnode_pager_putpages(vm_object_t object, vm_page_t *m, int count,
|
2014-09-14 10:27:36 +00:00
|
|
|
int flags, int *rtvals)
|
1995-09-04 04:44:26 +00:00
|
|
|
{
|
|
|
|
int rtval;
|
|
|
|
struct vnode *vp;
|
1998-03-09 08:58:53 +00:00
|
|
|
int bytes = count * PAGE_SIZE;
|
1996-10-17 02:49:35 +00:00
|
|
|
|
1999-02-27 23:39:28 +00:00
|
|
|
/*
|
|
|
|
* Force synchronous operation if we are extremely low on memory
|
|
|
|
* to prevent a low-memory deadlock. VOP operations often need to
|
2015-11-22 09:50:13 +00:00
|
|
|
* allocate more memory to initiate the I/O ( i.e. do a BMAP
|
1999-02-27 23:39:28 +00:00
|
|
|
* operation ). The swapper handles the case by limiting the amount
|
|
|
|
* of asynchronous I/O, but that sort of solution doesn't scale well
|
|
|
|
* for the vnode pager without a lot of work.
|
|
|
|
*
|
|
|
|
* Also, the backing vnode's iodone routine may not wake the pageout
|
|
|
|
* daemon up. This should be probably be addressed XXX.
|
|
|
|
*/
|
|
|
|
|
2018-02-06 22:10:07 +00:00
|
|
|
if (vm_page_count_min())
|
2014-09-14 10:27:36 +00:00
|
|
|
flags |= VM_PAGER_PUT_SYNC;
|
1999-02-27 23:39:28 +00:00
|
|
|
|
|
|
|
/*
|
|
|
|
* Call device-specific putpages function
|
|
|
|
*/
|
1995-09-04 04:44:26 +00:00
|
|
|
vp = object->handle;
|
2013-03-09 02:32:23 +00:00
|
|
|
VM_OBJECT_WUNLOCK(object);
|
2014-09-14 10:27:36 +00:00
|
|
|
rtval = VOP_PUTPAGES(vp, m, bytes, flags, rtvals);
|
2001-05-19 01:28:09 +00:00
|
|
|
KASSERT(rtval != EOPNOTSUPP,
|
|
|
|
("vnode_pager: stale FS putpages\n"));
|
2013-03-09 02:32:23 +00:00
|
|
|
VM_OBJECT_WLOCK(object);
|
1995-09-04 04:44:26 +00:00
|
|
|
}
|
|
|
|
|
Do not overwrite clean blocks on pageout.
If filesystem block size is less than the page size, it is possible
that the page-out run contains partially clean pages. E.g., the chunk
of the page might be bdwrite()-ed, or some thread performed bwrite()
on a buffer which references a chunk of the paged out page. As
result, the assertion added in r319975, which checked that all pages
in the run are dirty, does not hold on such filesystems.
One solution is to remove the assert, but it is undesirable, because
we do overwrite the valid on-disk content. I cannot provide a scenario
where such write would corrupt the file data, but I do not like it on
principle. Another, in my opinion proper, solution is to only write
parts of the pages still marked dirty. The patch implements this, it
skips clean blocks and only writes the dirty block runs.
Note that due to clustering, write one page might clean other pages in
the run, so the next write range must be calculated only after the
current range is written out.
More, due to a possible invalidation, and the fact that the object
lock is dropped and reacquired before the checks, it is possible that
the whole page-out pages run appears to consist of only clean pages.
For this reason, it is impossible to assert that there is some work
for the pageout method to do (i.e. assert that there is at least one
dirty page in the run). But such clearing can only occur due to
invalidation, and not due to a parallel write, because we own the
vnode lock exclusive.
Reported by: fsu
In collaboration with: pho
Reviewed by: alc, markj
Sponsored by: The FreeBSD Foundation
MFC after: 3 weeks
Differential revision: https://reviews.freebsd.org/D12668
2017-10-20 08:32:37 +00:00
|
|
|
static int
|
|
|
|
vn_off2bidx(vm_ooffset_t offset)
|
|
|
|
{
|
|
|
|
|
|
|
|
return ((offset & PAGE_MASK) / DEV_BSIZE);
|
|
|
|
}
|
|
|
|
|
|
|
|
static bool
|
|
|
|
vn_dirty_blk(vm_page_t m, vm_ooffset_t offset)
|
|
|
|
{
|
|
|
|
|
|
|
|
KASSERT(IDX_TO_OFF(m->pindex) <= offset &&
|
|
|
|
offset < IDX_TO_OFF(m->pindex + 1),
|
|
|
|
("page %p pidx %ju offset %ju", m, (uintmax_t)m->pindex,
|
|
|
|
(uintmax_t)offset));
|
|
|
|
return ((m->dirty & ((vm_page_bits_t)1 << vn_off2bidx(offset))) != 0);
|
|
|
|
}
|
1998-02-26 06:39:59 +00:00
|
|
|
|
1994-05-25 09:21:21 +00:00
|
|
|
/*
|
1998-02-26 06:39:59 +00:00
|
|
|
* This is now called from local media FS's to operate against their
|
1999-03-27 02:39:01 +00:00
|
|
|
* own vnodes if they fail to implement VOP_PUTPAGES.
|
2000-12-26 19:41:38 +00:00
|
|
|
*
|
|
|
|
* This is typically called indirectly via the pageout daemon and
|
2016-05-02 20:16:29 +00:00
|
|
|
* clustering has already typically occurred, so in general we ask the
|
2000-12-26 19:41:38 +00:00
|
|
|
* underlying filesystem to write the data out asynchronously rather
|
|
|
|
* then delayed.
|
1994-05-25 09:21:21 +00:00
|
|
|
*/
|
1998-02-26 06:39:59 +00:00
|
|
|
int
|
2010-05-26 18:00:44 +00:00
|
|
|
vnode_pager_generic_putpages(struct vnode *vp, vm_page_t *ma, int bytecount,
|
|
|
|
int flags, int *rtvals)
|
1994-05-25 09:21:21 +00:00
|
|
|
{
|
1998-02-26 06:39:59 +00:00
|
|
|
vm_object_t object;
|
2010-05-26 18:00:44 +00:00
|
|
|
vm_page_t m;
|
Do not overwrite clean blocks on pageout.
If filesystem block size is less than the page size, it is possible
that the page-out run contains partially clean pages. E.g., the chunk
of the page might be bdwrite()-ed, or some thread performed bwrite()
on a buffer which references a chunk of the paged out page. As
result, the assertion added in r319975, which checked that all pages
in the run are dirty, does not hold on such filesystems.
One solution is to remove the assert, but it is undesirable, because
we do overwrite the valid on-disk content. I cannot provide a scenario
where such write would corrupt the file data, but I do not like it on
principle. Another, in my opinion proper, solution is to only write
parts of the pages still marked dirty. The patch implements this, it
skips clean blocks and only writes the dirty block runs.
Note that due to clustering, write one page might clean other pages in
the run, so the next write range must be calculated only after the
current range is written out.
More, due to a possible invalidation, and the fact that the object
lock is dropped and reacquired before the checks, it is possible that
the whole page-out pages run appears to consist of only clean pages.
For this reason, it is impossible to assert that there is some work
for the pageout method to do (i.e. assert that there is at least one
dirty page in the run). But such clearing can only occur due to
invalidation, and not due to a parallel write, because we own the
vnode lock exclusive.
Reported by: fsu
In collaboration with: pho
Reviewed by: alc, markj
Sponsored by: The FreeBSD Foundation
MFC after: 3 weeks
Differential revision: https://reviews.freebsd.org/D12668
2017-10-20 08:32:37 +00:00
|
|
|
vm_ooffset_t maxblksz, next_offset, poffset, prev_offset;
|
1995-04-09 06:03:56 +00:00
|
|
|
struct uio auio;
|
|
|
|
struct iovec aiov;
|
Do not overwrite clean blocks on pageout.
If filesystem block size is less than the page size, it is possible
that the page-out run contains partially clean pages. E.g., the chunk
of the page might be bdwrite()-ed, or some thread performed bwrite()
on a buffer which references a chunk of the paged out page. As
result, the assertion added in r319975, which checked that all pages
in the run are dirty, does not hold on such filesystems.
One solution is to remove the assert, but it is undesirable, because
we do overwrite the valid on-disk content. I cannot provide a scenario
where such write would corrupt the file data, but I do not like it on
principle. Another, in my opinion proper, solution is to only write
parts of the pages still marked dirty. The patch implements this, it
skips clean blocks and only writes the dirty block runs.
Note that due to clustering, write one page might clean other pages in
the run, so the next write range must be calculated only after the
current range is written out.
More, due to a possible invalidation, and the fact that the object
lock is dropped and reacquired before the checks, it is possible that
the whole page-out pages run appears to consist of only clean pages.
For this reason, it is impossible to assert that there is some work
for the pageout method to do (i.e. assert that there is at least one
dirty page in the run). But such clearing can only occur due to
invalidation, and not due to a parallel write, because we own the
vnode lock exclusive.
Reported by: fsu
In collaboration with: pho
Reviewed by: alc, markj
Sponsored by: The FreeBSD Foundation
MFC after: 3 weeks
Differential revision: https://reviews.freebsd.org/D12668
2017-10-20 08:32:37 +00:00
|
|
|
off_t prev_resid, wrsz;
|
2017-06-15 14:34:33 +00:00
|
|
|
int count, error, i, maxsize, ncount, pgoff, ppscheck;
|
Do not overwrite clean blocks on pageout.
If filesystem block size is less than the page size, it is possible
that the page-out run contains partially clean pages. E.g., the chunk
of the page might be bdwrite()-ed, or some thread performed bwrite()
on a buffer which references a chunk of the paged out page. As
result, the assertion added in r319975, which checked that all pages
in the run are dirty, does not hold on such filesystems.
One solution is to remove the assert, but it is undesirable, because
we do overwrite the valid on-disk content. I cannot provide a scenario
where such write would corrupt the file data, but I do not like it on
principle. Another, in my opinion proper, solution is to only write
parts of the pages still marked dirty. The patch implements this, it
skips clean blocks and only writes the dirty block runs.
Note that due to clustering, write one page might clean other pages in
the run, so the next write range must be calculated only after the
current range is written out.
More, due to a possible invalidation, and the fact that the object
lock is dropped and reacquired before the checks, it is possible that
the whole page-out pages run appears to consist of only clean pages.
For this reason, it is impossible to assert that there is some work
for the pageout method to do (i.e. assert that there is at least one
dirty page in the run). But such clearing can only occur due to
invalidation, and not due to a parallel write, because we own the
vnode lock exclusive.
Reported by: fsu
In collaboration with: pho
Reviewed by: alc, markj
Sponsored by: The FreeBSD Foundation
MFC after: 3 weeks
Differential revision: https://reviews.freebsd.org/D12668
2017-10-20 08:32:37 +00:00
|
|
|
bool in_hole;
|
2005-11-01 23:00:24 +00:00
|
|
|
static struct timeval lastfail;
|
|
|
|
static int curfail;
|
1994-05-25 09:21:21 +00:00
|
|
|
|
1998-02-26 06:39:59 +00:00
|
|
|
object = vp->v_object;
|
|
|
|
count = bytecount / PAGE_SIZE;
|
|
|
|
|
1994-08-04 03:06:48 +00:00
|
|
|
for (i = 0; i < count; i++)
|
2011-06-01 21:00:28 +00:00
|
|
|
rtvals[i] = VM_PAGER_ERROR;
|
1994-05-25 09:21:21 +00:00
|
|
|
|
2010-05-26 18:00:44 +00:00
|
|
|
if ((int64_t)ma[0]->pindex < 0) {
|
2017-06-15 14:34:33 +00:00
|
|
|
printf("vnode_pager_generic_putpages: "
|
|
|
|
"attempt to write meta-data 0x%jx(%lx)\n",
|
|
|
|
(uintmax_t)ma[0]->pindex, (u_long)ma[0]->dirty);
|
1995-04-09 06:03:56 +00:00
|
|
|
rtvals[0] = VM_PAGER_BAD;
|
2017-06-15 14:34:33 +00:00
|
|
|
return (VM_PAGER_BAD);
|
These changes embody the support of the fully coherent merged VM buffer cache,
much higher filesystem I/O performance, and much better paging performance. It
represents the culmination of over 6 months of R&D.
The majority of the merged VM/cache work is by John Dyson.
The following highlights the most significant changes. Additionally, there are
(mostly minor) changes to the various filesystem modules (nfs, msdosfs, etc) to
support the new VM/buffer scheme.
vfs_bio.c:
Significant rewrite of most of vfs_bio to support the merged VM buffer cache
scheme. The scheme is almost fully compatible with the old filesystem
interface. Significant improvement in the number of opportunities for write
clustering.
vfs_cluster.c, vfs_subr.c
Upgrade and performance enhancements in vfs layer code to support merged
VM/buffer cache. Fixup of vfs_cluster to eliminate the bogus pagemove stuff.
vm_object.c:
Yet more improvements in the collapse code. Elimination of some windows that
can cause list corruption.
vm_pageout.c:
Fixed it, it really works better now. Somehow in 2.0, some "enhancements"
broke the code. This code has been reworked from the ground-up.
vm_fault.c, vm_page.c, pmap.c, vm_object.c
Support for small-block filesystems with merged VM/buffer cache scheme.
pmap.c vm_map.c
Dynamic kernel VM size, now we dont have to pre-allocate excessive numbers of
kernel PTs.
vm_glue.c
Much simpler and more effective swapping code. No more gratuitous swapping.
proc.h
Fixed the problem that the p_lock flag was not being cleared on a fork.
swap_pager.c, vnode_pager.c
Removal of old vfs_bio cruft to support the past pseudo-coherency. Now the
code doesn't need it anymore.
machdep.c
Changes to better support the parameter values for the merged VM/buffer cache
scheme.
machdep.c, kern_exec.c, vm_glue.c
Implemented a seperate submap for temporary exec string space and another one
to contain process upages. This eliminates all map fragmentation problems
that previously existed.
ffs_inode.c, ufs_inode.c, ufs_readwrite.c
Changes for merged VM/buffer cache. Add "bypass" support for sneaking in on
busy buffers.
Submitted by: John Dyson and David Greenman
1995-01-09 16:06:02 +00:00
|
|
|
}
|
1995-03-19 23:46:25 +00:00
|
|
|
|
1995-04-09 06:03:56 +00:00
|
|
|
maxsize = count * PAGE_SIZE;
|
|
|
|
ncount = count;
|
1994-05-24 10:09:53 +00:00
|
|
|
|
2010-05-26 18:00:44 +00:00
|
|
|
poffset = IDX_TO_OFF(ma[0]->pindex);
|
2001-10-12 18:17:34 +00:00
|
|
|
|
|
|
|
/*
|
|
|
|
* If the page-aligned write is larger then the actual file we
|
2016-05-02 20:16:29 +00:00
|
|
|
* have to invalidate pages occurring beyond the file EOF. However,
|
2001-10-12 18:17:34 +00:00
|
|
|
* there is an edge case where a file may not be page-aligned where
|
|
|
|
* the last page is partially invalid. In this case the filesystem
|
|
|
|
* may not properly clear the dirty bits for the entire page (which
|
|
|
|
* could be VM_PAGE_BITS_ALL due to the page having been mmap()d).
|
|
|
|
* With the page locked we are free to fix-up the dirty bits here.
|
2001-12-14 01:16:57 +00:00
|
|
|
*
|
|
|
|
* We do not under any circumstances truncate the valid bits, as
|
|
|
|
* this will screw up bogus page replacement.
|
2001-10-12 18:17:34 +00:00
|
|
|
*/
|
2017-10-20 18:40:29 +00:00
|
|
|
VM_OBJECT_RLOCK(object);
|
1995-12-11 04:58:34 +00:00
|
|
|
if (maxsize + poffset > object->un_pager.vnp.vnp_size) {
|
2017-10-20 18:40:29 +00:00
|
|
|
if (!VM_OBJECT_TRYUPGRADE(object)) {
|
|
|
|
VM_OBJECT_RUNLOCK(object);
|
|
|
|
VM_OBJECT_WLOCK(object);
|
|
|
|
if (maxsize + poffset <= object->un_pager.vnp.vnp_size)
|
|
|
|
goto downgrade;
|
|
|
|
}
|
2001-10-12 18:17:34 +00:00
|
|
|
if (object->un_pager.vnp.vnp_size > poffset) {
|
1995-12-11 04:58:34 +00:00
|
|
|
maxsize = object->un_pager.vnp.vnp_size - poffset;
|
2001-10-12 18:17:34 +00:00
|
|
|
ncount = btoc(maxsize);
|
|
|
|
if ((pgoff = (int)maxsize & PAGE_MASK) != 0) {
|
2018-02-02 11:56:30 +00:00
|
|
|
pgoff = roundup2(pgoff, DEV_BSIZE);
|
|
|
|
|
2010-05-26 18:00:44 +00:00
|
|
|
/*
|
|
|
|
* If the object is locked and the following
|
|
|
|
* conditions hold, then the page's dirty
|
|
|
|
* field cannot be concurrently changed by a
|
|
|
|
* pmap operation.
|
|
|
|
*/
|
|
|
|
m = ma[ncount - 1];
|
2013-08-09 11:11:11 +00:00
|
|
|
vm_page_assert_sbusied(m);
|
2012-06-16 18:56:19 +00:00
|
|
|
KASSERT(!pmap_page_is_write_mapped(m),
|
2010-05-26 18:00:44 +00:00
|
|
|
("vnode_pager_generic_putpages: page %p is not read-only", m));
|
2017-06-15 14:34:33 +00:00
|
|
|
MPASS(m->dirty != 0);
|
2010-05-26 18:00:44 +00:00
|
|
|
vm_page_clear_dirty(m, pgoff, PAGE_SIZE -
|
|
|
|
pgoff);
|
2001-10-12 18:17:34 +00:00
|
|
|
}
|
|
|
|
} else {
|
1995-05-18 02:59:26 +00:00
|
|
|
maxsize = 0;
|
2001-10-12 18:17:34 +00:00
|
|
|
ncount = 0;
|
|
|
|
}
|
2017-06-15 14:34:33 +00:00
|
|
|
for (i = ncount; i < count; i++)
|
|
|
|
rtvals[i] = VM_PAGER_BAD;
|
2017-10-20 18:40:29 +00:00
|
|
|
downgrade:
|
|
|
|
VM_OBJECT_LOCK_DOWNGRADE(object);
|
1994-05-25 09:21:21 +00:00
|
|
|
}
|
|
|
|
|
1995-04-09 06:03:56 +00:00
|
|
|
auio.uio_iov = &aiov;
|
|
|
|
auio.uio_segflg = UIO_NOCOPY;
|
|
|
|
auio.uio_rw = UIO_WRITE;
|
2017-06-15 14:34:33 +00:00
|
|
|
auio.uio_td = NULL;
|
Do not overwrite clean blocks on pageout.
If filesystem block size is less than the page size, it is possible
that the page-out run contains partially clean pages. E.g., the chunk
of the page might be bdwrite()-ed, or some thread performed bwrite()
on a buffer which references a chunk of the paged out page. As
result, the assertion added in r319975, which checked that all pages
in the run are dirty, does not hold on such filesystems.
One solution is to remove the assert, but it is undesirable, because
we do overwrite the valid on-disk content. I cannot provide a scenario
where such write would corrupt the file data, but I do not like it on
principle. Another, in my opinion proper, solution is to only write
parts of the pages still marked dirty. The patch implements this, it
skips clean blocks and only writes the dirty block runs.
Note that due to clustering, write one page might clean other pages in
the run, so the next write range must be calculated only after the
current range is written out.
More, due to a possible invalidation, and the fact that the object
lock is dropped and reacquired before the checks, it is possible that
the whole page-out pages run appears to consist of only clean pages.
For this reason, it is impossible to assert that there is some work
for the pageout method to do (i.e. assert that there is at least one
dirty page in the run). But such clearing can only occur due to
invalidation, and not due to a parallel write, because we own the
vnode lock exclusive.
Reported by: fsu
In collaboration with: pho
Reviewed by: alc, markj
Sponsored by: The FreeBSD Foundation
MFC after: 3 weeks
Differential revision: https://reviews.freebsd.org/D12668
2017-10-20 08:32:37 +00:00
|
|
|
maxblksz = roundup2(poffset + maxsize, DEV_BSIZE);
|
|
|
|
|
|
|
|
for (prev_offset = poffset; prev_offset < maxblksz;) {
|
|
|
|
/* Skip clean blocks. */
|
|
|
|
for (in_hole = true; in_hole && prev_offset < maxblksz;) {
|
|
|
|
m = ma[OFF_TO_IDX(prev_offset - poffset)];
|
|
|
|
for (i = vn_off2bidx(prev_offset);
|
|
|
|
i < sizeof(vm_page_bits_t) * NBBY &&
|
|
|
|
prev_offset < maxblksz; i++) {
|
|
|
|
if (vn_dirty_blk(m, prev_offset)) {
|
|
|
|
in_hole = false;
|
|
|
|
break;
|
|
|
|
}
|
|
|
|
prev_offset += DEV_BSIZE;
|
|
|
|
}
|
|
|
|
}
|
|
|
|
if (in_hole)
|
|
|
|
goto write_done;
|
|
|
|
|
|
|
|
/* Find longest run of dirty blocks. */
|
|
|
|
for (next_offset = prev_offset; next_offset < maxblksz;) {
|
|
|
|
m = ma[OFF_TO_IDX(next_offset - poffset)];
|
|
|
|
for (i = vn_off2bidx(next_offset);
|
|
|
|
i < sizeof(vm_page_bits_t) * NBBY &&
|
|
|
|
next_offset < maxblksz; i++) {
|
|
|
|
if (!vn_dirty_blk(m, next_offset))
|
|
|
|
goto start_write;
|
|
|
|
next_offset += DEV_BSIZE;
|
|
|
|
}
|
|
|
|
}
|
|
|
|
start_write:
|
|
|
|
if (next_offset > poffset + maxsize)
|
|
|
|
next_offset = poffset + maxsize;
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Getting here requires finding a dirty block in the
|
|
|
|
* 'skip clean blocks' loop.
|
|
|
|
*/
|
|
|
|
MPASS(prev_offset < next_offset);
|
|
|
|
|
2017-10-20 18:40:29 +00:00
|
|
|
VM_OBJECT_RUNLOCK(object);
|
Do not overwrite clean blocks on pageout.
If filesystem block size is less than the page size, it is possible
that the page-out run contains partially clean pages. E.g., the chunk
of the page might be bdwrite()-ed, or some thread performed bwrite()
on a buffer which references a chunk of the paged out page. As
result, the assertion added in r319975, which checked that all pages
in the run are dirty, does not hold on such filesystems.
One solution is to remove the assert, but it is undesirable, because
we do overwrite the valid on-disk content. I cannot provide a scenario
where such write would corrupt the file data, but I do not like it on
principle. Another, in my opinion proper, solution is to only write
parts of the pages still marked dirty. The patch implements this, it
skips clean blocks and only writes the dirty block runs.
Note that due to clustering, write one page might clean other pages in
the run, so the next write range must be calculated only after the
current range is written out.
More, due to a possible invalidation, and the fact that the object
lock is dropped and reacquired before the checks, it is possible that
the whole page-out pages run appears to consist of only clean pages.
For this reason, it is impossible to assert that there is some work
for the pageout method to do (i.e. assert that there is at least one
dirty page in the run). But such clearing can only occur due to
invalidation, and not due to a parallel write, because we own the
vnode lock exclusive.
Reported by: fsu
In collaboration with: pho
Reviewed by: alc, markj
Sponsored by: The FreeBSD Foundation
MFC after: 3 weeks
Differential revision: https://reviews.freebsd.org/D12668
2017-10-20 08:32:37 +00:00
|
|
|
aiov.iov_base = NULL;
|
|
|
|
auio.uio_iovcnt = 1;
|
|
|
|
auio.uio_offset = prev_offset;
|
|
|
|
prev_resid = auio.uio_resid = aiov.iov_len = next_offset -
|
|
|
|
prev_offset;
|
|
|
|
error = VOP_WRITE(vp, &auio,
|
|
|
|
vnode_pager_putpages_ioflags(flags), curthread->td_ucred);
|
|
|
|
|
|
|
|
wrsz = prev_resid - auio.uio_resid;
|
|
|
|
if (wrsz == 0) {
|
|
|
|
if (ppsratecheck(&lastfail, &curfail, 1) != 0) {
|
|
|
|
vn_printf(vp, "vnode_pager_putpages: "
|
|
|
|
"zero-length write at %ju resid %zd\n",
|
|
|
|
auio.uio_offset, auio.uio_resid);
|
|
|
|
}
|
2017-10-20 18:40:29 +00:00
|
|
|
VM_OBJECT_RLOCK(object);
|
Do not overwrite clean blocks on pageout.
If filesystem block size is less than the page size, it is possible
that the page-out run contains partially clean pages. E.g., the chunk
of the page might be bdwrite()-ed, or some thread performed bwrite()
on a buffer which references a chunk of the paged out page. As
result, the assertion added in r319975, which checked that all pages
in the run are dirty, does not hold on such filesystems.
One solution is to remove the assert, but it is undesirable, because
we do overwrite the valid on-disk content. I cannot provide a scenario
where such write would corrupt the file data, but I do not like it on
principle. Another, in my opinion proper, solution is to only write
parts of the pages still marked dirty. The patch implements this, it
skips clean blocks and only writes the dirty block runs.
Note that due to clustering, write one page might clean other pages in
the run, so the next write range must be calculated only after the
current range is written out.
More, due to a possible invalidation, and the fact that the object
lock is dropped and reacquired before the checks, it is possible that
the whole page-out pages run appears to consist of only clean pages.
For this reason, it is impossible to assert that there is some work
for the pageout method to do (i.e. assert that there is at least one
dirty page in the run). But such clearing can only occur due to
invalidation, and not due to a parallel write, because we own the
vnode lock exclusive.
Reported by: fsu
In collaboration with: pho
Reviewed by: alc, markj
Sponsored by: The FreeBSD Foundation
MFC after: 3 weeks
Differential revision: https://reviews.freebsd.org/D12668
2017-10-20 08:32:37 +00:00
|
|
|
break;
|
|
|
|
}
|
|
|
|
|
|
|
|
/* Adjust the starting offset for next iteration. */
|
|
|
|
prev_offset += wrsz;
|
|
|
|
MPASS(auio.uio_offset == prev_offset);
|
|
|
|
|
|
|
|
ppscheck = 0;
|
|
|
|
if (error != 0 && (ppscheck = ppsratecheck(&lastfail,
|
|
|
|
&curfail, 1)) != 0)
|
|
|
|
vn_printf(vp, "vnode_pager_putpages: I/O error %d\n",
|
|
|
|
error);
|
|
|
|
if (auio.uio_resid != 0 && (ppscheck != 0 ||
|
|
|
|
ppsratecheck(&lastfail, &curfail, 1) != 0))
|
|
|
|
vn_printf(vp, "vnode_pager_putpages: residual I/O %zd "
|
|
|
|
"at %ju\n", auio.uio_resid,
|
|
|
|
(uintmax_t)ma[0]->pindex);
|
2017-10-20 18:40:29 +00:00
|
|
|
VM_OBJECT_RLOCK(object);
|
Do not overwrite clean blocks on pageout.
If filesystem block size is less than the page size, it is possible
that the page-out run contains partially clean pages. E.g., the chunk
of the page might be bdwrite()-ed, or some thread performed bwrite()
on a buffer which references a chunk of the paged out page. As
result, the assertion added in r319975, which checked that all pages
in the run are dirty, does not hold on such filesystems.
One solution is to remove the assert, but it is undesirable, because
we do overwrite the valid on-disk content. I cannot provide a scenario
where such write would corrupt the file data, but I do not like it on
principle. Another, in my opinion proper, solution is to only write
parts of the pages still marked dirty. The patch implements this, it
skips clean blocks and only writes the dirty block runs.
Note that due to clustering, write one page might clean other pages in
the run, so the next write range must be calculated only after the
current range is written out.
More, due to a possible invalidation, and the fact that the object
lock is dropped and reacquired before the checks, it is possible that
the whole page-out pages run appears to consist of only clean pages.
For this reason, it is impossible to assert that there is some work
for the pageout method to do (i.e. assert that there is at least one
dirty page in the run). But such clearing can only occur due to
invalidation, and not due to a parallel write, because we own the
vnode lock exclusive.
Reported by: fsu
In collaboration with: pho
Reviewed by: alc, markj
Sponsored by: The FreeBSD Foundation
MFC after: 3 weeks
Differential revision: https://reviews.freebsd.org/D12668
2017-10-20 08:32:37 +00:00
|
|
|
if (error != 0 || auio.uio_resid != 0)
|
|
|
|
break;
|
|
|
|
}
|
|
|
|
write_done:
|
|
|
|
/* Mark completely processed pages. */
|
|
|
|
for (i = 0; i < OFF_TO_IDX(prev_offset - poffset); i++)
|
1998-03-01 04:18:54 +00:00
|
|
|
rtvals[i] = VM_PAGER_OK;
|
Do not overwrite clean blocks on pageout.
If filesystem block size is less than the page size, it is possible
that the page-out run contains partially clean pages. E.g., the chunk
of the page might be bdwrite()-ed, or some thread performed bwrite()
on a buffer which references a chunk of the paged out page. As
result, the assertion added in r319975, which checked that all pages
in the run are dirty, does not hold on such filesystems.
One solution is to remove the assert, but it is undesirable, because
we do overwrite the valid on-disk content. I cannot provide a scenario
where such write would corrupt the file data, but I do not like it on
principle. Another, in my opinion proper, solution is to only write
parts of the pages still marked dirty. The patch implements this, it
skips clean blocks and only writes the dirty block runs.
Note that due to clustering, write one page might clean other pages in
the run, so the next write range must be calculated only after the
current range is written out.
More, due to a possible invalidation, and the fact that the object
lock is dropped and reacquired before the checks, it is possible that
the whole page-out pages run appears to consist of only clean pages.
For this reason, it is impossible to assert that there is some work
for the pageout method to do (i.e. assert that there is at least one
dirty page in the run). But such clearing can only occur due to
invalidation, and not due to a parallel write, because we own the
vnode lock exclusive.
Reported by: fsu
In collaboration with: pho
Reviewed by: alc, markj
Sponsored by: The FreeBSD Foundation
MFC after: 3 weeks
Differential revision: https://reviews.freebsd.org/D12668
2017-10-20 08:32:37 +00:00
|
|
|
/* Mark partial EOF page. */
|
|
|
|
if (prev_offset == poffset + maxsize && (prev_offset & PAGE_MASK) != 0)
|
|
|
|
rtvals[i++] = VM_PAGER_OK;
|
|
|
|
/* Unwritten pages in range, free bonus if the page is clean. */
|
|
|
|
for (; i < ncount; i++)
|
|
|
|
rtvals[i] = ma[i]->dirty == 0 ? VM_PAGER_OK : VM_PAGER_ERROR;
|
2017-10-20 18:40:29 +00:00
|
|
|
VM_OBJECT_RUNLOCK(object);
|
Do not overwrite clean blocks on pageout.
If filesystem block size is less than the page size, it is possible
that the page-out run contains partially clean pages. E.g., the chunk
of the page might be bdwrite()-ed, or some thread performed bwrite()
on a buffer which references a chunk of the paged out page. As
result, the assertion added in r319975, which checked that all pages
in the run are dirty, does not hold on such filesystems.
One solution is to remove the assert, but it is undesirable, because
we do overwrite the valid on-disk content. I cannot provide a scenario
where such write would corrupt the file data, but I do not like it on
principle. Another, in my opinion proper, solution is to only write
parts of the pages still marked dirty. The patch implements this, it
skips clean blocks and only writes the dirty block runs.
Note that due to clustering, write one page might clean other pages in
the run, so the next write range must be calculated only after the
current range is written out.
More, due to a possible invalidation, and the fact that the object
lock is dropped and reacquired before the checks, it is possible that
the whole page-out pages run appears to consist of only clean pages.
For this reason, it is impossible to assert that there is some work
for the pageout method to do (i.e. assert that there is at least one
dirty page in the run). But such clearing can only occur due to
invalidation, and not due to a parallel write, because we own the
vnode lock exclusive.
Reported by: fsu
In collaboration with: pho
Reviewed by: alc, markj
Sponsored by: The FreeBSD Foundation
MFC after: 3 weeks
Differential revision: https://reviews.freebsd.org/D12668
2017-10-20 08:32:37 +00:00
|
|
|
VM_CNT_ADD(v_vnodepgsout, i);
|
|
|
|
VM_CNT_INC(v_vnodeout);
|
2017-06-15 14:34:33 +00:00
|
|
|
return (rtvals[0]);
|
1995-04-09 06:03:56 +00:00
|
|
|
}
|
2011-06-01 21:00:28 +00:00
|
|
|
|
2017-04-05 16:56:04 +00:00
|
|
|
int
|
|
|
|
vnode_pager_putpages_ioflags(int pager_flags)
|
|
|
|
{
|
|
|
|
int ioflags;
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Pageouts are already clustered, use IO_ASYNC to force a
|
|
|
|
* bawrite() rather then a bdwrite() to prevent paging I/O
|
|
|
|
* from saturating the buffer cache. Dummy-up the sequential
|
|
|
|
* heuristic to cause large ranges to cluster. If neither
|
|
|
|
* IO_SYNC or IO_ASYNC is set, the system decides how to
|
|
|
|
* cluster.
|
|
|
|
*/
|
|
|
|
ioflags = IO_VMIO;
|
|
|
|
if ((pager_flags & (VM_PAGER_PUT_SYNC | VM_PAGER_PUT_INVAL)) != 0)
|
|
|
|
ioflags |= IO_SYNC;
|
|
|
|
else if ((pager_flags & VM_PAGER_CLUSTER_OK) == 0)
|
|
|
|
ioflags |= IO_ASYNC;
|
|
|
|
ioflags |= (pager_flags & VM_PAGER_PUT_INVAL) != 0 ? IO_INVAL: 0;
|
|
|
|
ioflags |= (pager_flags & VM_PAGER_PUT_NOREUSE) != 0 ? IO_NOREUSE : 0;
|
|
|
|
ioflags |= IO_SEQMAX << IO_SEQSHIFT;
|
|
|
|
return (ioflags);
|
|
|
|
}
|
|
|
|
|
2017-07-26 20:07:05 +00:00
|
|
|
/*
|
|
|
|
* vnode_pager_undirty_pages().
|
|
|
|
*
|
|
|
|
* A helper to mark pages as clean after pageout that was possibly
|
|
|
|
* done with a short write. The lpos argument specifies the page run
|
|
|
|
* length in bytes, and the written argument specifies how many bytes
|
|
|
|
* were actually written. eof is the offset past the last valid byte
|
|
|
|
* in the vnode using the absolute file position of the first byte in
|
|
|
|
* the run as the base from which it is computed.
|
|
|
|
*/
|
2011-06-01 21:00:28 +00:00
|
|
|
void
|
2017-07-26 20:07:05 +00:00
|
|
|
vnode_pager_undirty_pages(vm_page_t *ma, int *rtvals, int written, off_t eof,
|
|
|
|
int lpos)
|
2011-06-01 21:00:28 +00:00
|
|
|
{
|
2011-06-11 20:13:28 +00:00
|
|
|
vm_object_t obj;
|
2017-07-26 20:07:05 +00:00
|
|
|
int i, pos, pos_devb;
|
2011-06-01 21:00:28 +00:00
|
|
|
|
2017-07-26 20:07:05 +00:00
|
|
|
if (written == 0 && eof >= lpos)
|
2011-06-11 20:13:28 +00:00
|
|
|
return;
|
|
|
|
obj = ma[0]->object;
|
2013-03-09 02:32:23 +00:00
|
|
|
VM_OBJECT_WLOCK(obj);
|
2011-06-01 21:00:28 +00:00
|
|
|
for (i = 0, pos = 0; pos < written; i++, pos += PAGE_SIZE) {
|
|
|
|
if (pos < trunc_page(written)) {
|
|
|
|
rtvals[i] = VM_PAGER_OK;
|
|
|
|
vm_page_undirty(ma[i]);
|
|
|
|
} else {
|
|
|
|
/* Partially written page. */
|
|
|
|
rtvals[i] = VM_PAGER_AGAIN;
|
|
|
|
vm_page_clear_dirty(ma[i], 0, written & PAGE_MASK);
|
|
|
|
}
|
|
|
|
}
|
2017-07-26 20:07:05 +00:00
|
|
|
if (eof >= lpos) /* avoid truncation */
|
|
|
|
goto done;
|
|
|
|
for (pos = eof, i = OFF_TO_IDX(trunc_page(pos)); pos < lpos; i++) {
|
|
|
|
if (pos != trunc_page(pos)) {
|
|
|
|
/*
|
|
|
|
* The page contains the last valid byte in
|
|
|
|
* the vnode, mark the rest of the page as
|
|
|
|
* clean, potentially making the whole page
|
|
|
|
* clean.
|
|
|
|
*/
|
|
|
|
pos_devb = roundup2(pos & PAGE_MASK, DEV_BSIZE);
|
|
|
|
vm_page_clear_dirty(ma[i], pos_devb, PAGE_SIZE -
|
|
|
|
pos_devb);
|
|
|
|
|
|
|
|
/*
|
|
|
|
* If the page was cleaned, report the pageout
|
|
|
|
* on it as successful. msync() no longer
|
|
|
|
* needs to write out the page, endlessly
|
|
|
|
* creating write requests and dirty buffers.
|
|
|
|
*/
|
|
|
|
if (ma[i]->dirty == 0)
|
|
|
|
rtvals[i] = VM_PAGER_OK;
|
|
|
|
|
|
|
|
pos = round_page(pos);
|
|
|
|
} else {
|
|
|
|
/* vm_pageout_flush() clears dirty */
|
|
|
|
rtvals[i] = VM_PAGER_BAD;
|
|
|
|
pos += PAGE_SIZE;
|
|
|
|
}
|
|
|
|
}
|
|
|
|
done:
|
2013-03-09 02:32:23 +00:00
|
|
|
VM_OBJECT_WUNLOCK(obj);
|
2011-06-01 21:00:28 +00:00
|
|
|
}
|
2012-02-23 21:07:16 +00:00
|
|
|
|
2019-09-03 20:31:48 +00:00
|
|
|
static void
|
2012-02-23 21:07:16 +00:00
|
|
|
vnode_pager_update_writecount(vm_object_t object, vm_offset_t start,
|
|
|
|
vm_offset_t end)
|
|
|
|
{
|
|
|
|
struct vnode *vp;
|
|
|
|
vm_ooffset_t old_wm;
|
|
|
|
|
2013-03-09 02:32:23 +00:00
|
|
|
VM_OBJECT_WLOCK(object);
|
2012-02-23 21:07:16 +00:00
|
|
|
if (object->type != OBJT_VNODE) {
|
2013-03-09 02:32:23 +00:00
|
|
|
VM_OBJECT_WUNLOCK(object);
|
2012-02-23 21:07:16 +00:00
|
|
|
return;
|
|
|
|
}
|
|
|
|
old_wm = object->un_pager.vnp.writemappings;
|
|
|
|
object->un_pager.vnp.writemappings += (vm_ooffset_t)end - start;
|
|
|
|
vp = object->handle;
|
|
|
|
if (old_wm == 0 && object->un_pager.vnp.writemappings != 0) {
|
Switch to use shared vnode locks for text files during image activation.
kern_execve() locks text vnode exclusive to be able to set and clear
VV_TEXT flag. VV_TEXT is mutually exclusive with the v_writecount > 0
condition.
The change removes VV_TEXT, replacing it with the condition
v_writecount <= -1, and puts v_writecount under the vnode interlock.
Each text reference decrements v_writecount. To clear the text
reference when the segment is unmapped, it is recorded in the
vm_map_entry backed by the text file as MAP_ENTRY_VN_TEXT flag, and
v_writecount is incremented on the map entry removal
The operations like VOP_ADD_WRITECOUNT() and VOP_SET_TEXT() check that
v_writecount does not contradict the desired change. vn_writecheck()
is now racy and its use was eliminated everywhere except access.
Atomic check for writeability and increment of v_writecount is
performed by the VOP. vn_truncate() now increments v_writecount
around VOP_SETATTR() call, lack of which is arguably a bug on its own.
nullfs bypasses v_writecount to the lower vnode always, so nullfs
vnode has its own v_writecount correct, and lower vnode gets all
references, since object->handle is always lower vnode.
On the text vnode' vm object dealloc, the v_writecount value is reset
to zero, and deadfs vop_unset_text short-circuit the operation.
Reclamation of lowervp always reclaims all nullfs vnodes referencing
lowervp first, so no stray references are left.
Reviewed by: markj, trasz
Tested by: mjg, pho
Sponsored by: The FreeBSD Foundation
MFC after: 1 month
Differential revision: https://reviews.freebsd.org/D19923
2019-05-05 11:20:43 +00:00
|
|
|
ASSERT_VOP_LOCKED(vp, "v_writecount inc");
|
|
|
|
VOP_ADD_WRITECOUNT_CHECKED(vp, 1);
|
2012-03-08 20:27:20 +00:00
|
|
|
CTR3(KTR_VFS, "%s: vp %p v_writecount increased to %d",
|
|
|
|
__func__, vp, vp->v_writecount);
|
2012-02-23 21:07:16 +00:00
|
|
|
} else if (old_wm != 0 && object->un_pager.vnp.writemappings == 0) {
|
Switch to use shared vnode locks for text files during image activation.
kern_execve() locks text vnode exclusive to be able to set and clear
VV_TEXT flag. VV_TEXT is mutually exclusive with the v_writecount > 0
condition.
The change removes VV_TEXT, replacing it with the condition
v_writecount <= -1, and puts v_writecount under the vnode interlock.
Each text reference decrements v_writecount. To clear the text
reference when the segment is unmapped, it is recorded in the
vm_map_entry backed by the text file as MAP_ENTRY_VN_TEXT flag, and
v_writecount is incremented on the map entry removal
The operations like VOP_ADD_WRITECOUNT() and VOP_SET_TEXT() check that
v_writecount does not contradict the desired change. vn_writecheck()
is now racy and its use was eliminated everywhere except access.
Atomic check for writeability and increment of v_writecount is
performed by the VOP. vn_truncate() now increments v_writecount
around VOP_SETATTR() call, lack of which is arguably a bug on its own.
nullfs bypasses v_writecount to the lower vnode always, so nullfs
vnode has its own v_writecount correct, and lower vnode gets all
references, since object->handle is always lower vnode.
On the text vnode' vm object dealloc, the v_writecount value is reset
to zero, and deadfs vop_unset_text short-circuit the operation.
Reclamation of lowervp always reclaims all nullfs vnodes referencing
lowervp first, so no stray references are left.
Reviewed by: markj, trasz
Tested by: mjg, pho
Sponsored by: The FreeBSD Foundation
MFC after: 1 month
Differential revision: https://reviews.freebsd.org/D19923
2019-05-05 11:20:43 +00:00
|
|
|
ASSERT_VOP_LOCKED(vp, "v_writecount dec");
|
|
|
|
VOP_ADD_WRITECOUNT_CHECKED(vp, -1);
|
2012-03-08 20:27:20 +00:00
|
|
|
CTR3(KTR_VFS, "%s: vp %p v_writecount decreased to %d",
|
|
|
|
__func__, vp, vp->v_writecount);
|
2012-02-23 21:07:16 +00:00
|
|
|
}
|
2013-03-09 02:32:23 +00:00
|
|
|
VM_OBJECT_WUNLOCK(object);
|
2012-02-23 21:07:16 +00:00
|
|
|
}
|
|
|
|
|
2019-09-03 20:31:48 +00:00
|
|
|
static void
|
2012-02-23 21:07:16 +00:00
|
|
|
vnode_pager_release_writecount(vm_object_t object, vm_offset_t start,
|
|
|
|
vm_offset_t end)
|
|
|
|
{
|
|
|
|
struct vnode *vp;
|
|
|
|
struct mount *mp;
|
|
|
|
vm_offset_t inc;
|
|
|
|
|
2013-03-09 02:32:23 +00:00
|
|
|
VM_OBJECT_WLOCK(object);
|
2012-02-23 21:07:16 +00:00
|
|
|
|
|
|
|
/*
|
|
|
|
* First, recheck the object type to account for the race when
|
|
|
|
* the vnode is reclaimed.
|
|
|
|
*/
|
|
|
|
if (object->type != OBJT_VNODE) {
|
2013-03-09 02:32:23 +00:00
|
|
|
VM_OBJECT_WUNLOCK(object);
|
2012-02-23 21:07:16 +00:00
|
|
|
return;
|
|
|
|
}
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Optimize for the case when writemappings is not going to
|
|
|
|
* zero.
|
|
|
|
*/
|
|
|
|
inc = end - start;
|
|
|
|
if (object->un_pager.vnp.writemappings != inc) {
|
|
|
|
object->un_pager.vnp.writemappings -= inc;
|
2013-03-09 02:32:23 +00:00
|
|
|
VM_OBJECT_WUNLOCK(object);
|
2012-02-23 21:07:16 +00:00
|
|
|
return;
|
|
|
|
}
|
|
|
|
|
|
|
|
vp = object->handle;
|
|
|
|
vhold(vp);
|
2013-03-09 02:32:23 +00:00
|
|
|
VM_OBJECT_WUNLOCK(object);
|
2012-02-23 21:07:16 +00:00
|
|
|
mp = NULL;
|
|
|
|
vn_start_write(vp, &mp, V_WAIT);
|
Switch to use shared vnode locks for text files during image activation.
kern_execve() locks text vnode exclusive to be able to set and clear
VV_TEXT flag. VV_TEXT is mutually exclusive with the v_writecount > 0
condition.
The change removes VV_TEXT, replacing it with the condition
v_writecount <= -1, and puts v_writecount under the vnode interlock.
Each text reference decrements v_writecount. To clear the text
reference when the segment is unmapped, it is recorded in the
vm_map_entry backed by the text file as MAP_ENTRY_VN_TEXT flag, and
v_writecount is incremented on the map entry removal
The operations like VOP_ADD_WRITECOUNT() and VOP_SET_TEXT() check that
v_writecount does not contradict the desired change. vn_writecheck()
is now racy and its use was eliminated everywhere except access.
Atomic check for writeability and increment of v_writecount is
performed by the VOP. vn_truncate() now increments v_writecount
around VOP_SETATTR() call, lack of which is arguably a bug on its own.
nullfs bypasses v_writecount to the lower vnode always, so nullfs
vnode has its own v_writecount correct, and lower vnode gets all
references, since object->handle is always lower vnode.
On the text vnode' vm object dealloc, the v_writecount value is reset
to zero, and deadfs vop_unset_text short-circuit the operation.
Reclamation of lowervp always reclaims all nullfs vnodes referencing
lowervp first, so no stray references are left.
Reviewed by: markj, trasz
Tested by: mjg, pho
Sponsored by: The FreeBSD Foundation
MFC after: 1 month
Differential revision: https://reviews.freebsd.org/D19923
2019-05-05 11:20:43 +00:00
|
|
|
vn_lock(vp, LK_SHARED | LK_RETRY);
|
2012-02-23 21:07:16 +00:00
|
|
|
|
|
|
|
/*
|
|
|
|
* Decrement the object's writemappings, by swapping the start
|
|
|
|
* and end arguments for vnode_pager_update_writecount(). If
|
|
|
|
* there was not a race with vnode reclaimation, then the
|
|
|
|
* vnode's v_writecount is decremented.
|
|
|
|
*/
|
|
|
|
vnode_pager_update_writecount(object, end, start);
|
|
|
|
VOP_UNLOCK(vp, 0);
|
|
|
|
vdrop(vp);
|
|
|
|
if (mp != NULL)
|
|
|
|
vn_finished_write(mp);
|
|
|
|
}
|