2005-01-07 02:29:27 +00:00
|
|
|
/*-
|
1994-05-24 10:09:53 +00:00
|
|
|
* Copyright (c) 1990 University of Utah.
|
|
|
|
* Copyright (c) 1991, 1993
|
|
|
|
* The Regents of the University of California. All rights reserved.
|
|
|
|
*
|
|
|
|
* This code is derived from software contributed to Berkeley by
|
|
|
|
* the Systems Programming Group of the University of Utah Computer
|
|
|
|
* Science Department.
|
|
|
|
*
|
|
|
|
* Redistribution and use in source and binary forms, with or without
|
|
|
|
* modification, are permitted provided that the following conditions
|
|
|
|
* are met:
|
|
|
|
* 1. Redistributions of source code must retain the above copyright
|
|
|
|
* notice, this list of conditions and the following disclaimer.
|
|
|
|
* 2. Redistributions in binary form must reproduce the above copyright
|
|
|
|
* notice, this list of conditions and the following disclaimer in the
|
|
|
|
* documentation and/or other materials provided with the distribution.
|
2017-02-28 23:42:47 +00:00
|
|
|
* 3. Neither the name of the University nor the names of its contributors
|
1994-05-24 10:09:53 +00:00
|
|
|
* may be used to endorse or promote products derived from this software
|
|
|
|
* without specific prior written permission.
|
|
|
|
*
|
|
|
|
* THIS SOFTWARE IS PROVIDED BY THE REGENTS AND CONTRIBUTORS ``AS IS'' AND
|
|
|
|
* ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
|
|
|
|
* IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE
|
|
|
|
* ARE DISCLAIMED. IN NO EVENT SHALL THE REGENTS OR CONTRIBUTORS BE LIABLE
|
|
|
|
* FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL
|
|
|
|
* DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS
|
|
|
|
* OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION)
|
|
|
|
* HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT
|
|
|
|
* LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY
|
|
|
|
* OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF
|
|
|
|
* SUCH DAMAGE.
|
|
|
|
*
|
1994-05-25 09:21:21 +00:00
|
|
|
* @(#)device_pager.c 8.1 (Berkeley) 6/11/93
|
1994-05-24 10:09:53 +00:00
|
|
|
*/
|
|
|
|
|
2003-06-11 23:50:51 +00:00
|
|
|
#include <sys/cdefs.h>
|
|
|
|
__FBSDID("$FreeBSD$");
|
|
|
|
|
1994-05-24 10:09:53 +00:00
|
|
|
#include <sys/param.h>
|
|
|
|
#include <sys/systm.h>
|
|
|
|
#include <sys/conf.h>
|
2001-05-01 08:13:21 +00:00
|
|
|
#include <sys/lock.h>
|
2001-07-04 16:20:28 +00:00
|
|
|
#include <sys/proc.h>
|
2001-05-01 08:13:21 +00:00
|
|
|
#include <sys/mutex.h>
|
1994-05-24 10:09:53 +00:00
|
|
|
#include <sys/mman.h>
|
2013-03-09 02:32:23 +00:00
|
|
|
#include <sys/rwlock.h>
|
2001-04-18 20:24:16 +00:00
|
|
|
#include <sys/sx.h>
|
1994-05-24 10:09:53 +00:00
|
|
|
|
|
|
|
#include <vm/vm.h>
|
2012-08-05 14:11:42 +00:00
|
|
|
#include <vm/vm_param.h>
|
1995-12-07 12:48:31 +00:00
|
|
|
#include <vm/vm_object.h>
|
1994-05-24 10:09:53 +00:00
|
|
|
#include <vm/vm_page.h>
|
NOTE: libkvm, w, ps, 'top', and any other utility which depends on struct
proc or any VM system structure will have to be rebuilt!!!
Much needed overhaul of the VM system. Included in this first round of
changes:
1) Improved pager interfaces: init, alloc, dealloc, getpages, putpages,
haspage, and sync operations are supported. The haspage interface now
provides information about clusterability. All pager routines now take
struct vm_object's instead of "pagers".
2) Improved data structures. In the previous paradigm, there is constant
confusion caused by pagers being both a data structure ("allocate a
pager") and a collection of routines. The idea of a pager structure has
escentially been eliminated. Objects now have types, and this type is
used to index the appropriate pager. In most cases, items in the pager
structure were duplicated in the object data structure and thus were
unnecessary. In the few cases that remained, a un_pager structure union
was created in the object to contain these items.
3) Because of the cleanup of #1 & #2, a lot of unnecessary layering can now
be removed. For instance, vm_object_enter(), vm_object_lookup(),
vm_object_remove(), and the associated object hash list were some of the
things that were removed.
4) simple_lock's removed. Discussion with several people reveals that the
SMP locking primitives used in the VM system aren't likely the mechanism
that we'll be adopting. Even if it were, the locking that was in the code
was very inadequate and would have to be mostly re-done anyway. The
locking in a uni-processor kernel was a no-op but went a long way toward
making the code difficult to read and debug.
5) Places that attempted to kludge-up the fact that we don't have kernel
thread support have been fixed to reflect the reality that we are really
dealing with processes, not threads. The VM system didn't have complete
thread support, so the comments and mis-named routines were just wrong.
We now use tsleep and wakeup directly in the lock routines, for instance.
6) Where appropriate, the pagers have been improved, especially in the
pager_alloc routines. Most of the pager_allocs have been rewritten and
are now faster and easier to maintain.
7) The pagedaemon pageout clustering algorithm has been rewritten and
now tries harder to output an even number of pages before and after
the requested page. This is sort of the reverse of the ideal pagein
algorithm and should provide better overall performance.
8) Unnecessary (incorrect) casts to caddr_t in calls to tsleep & wakeup
have been removed. Some other unnecessary casts have also been removed.
9) Some almost useless debugging code removed.
10) Terminology of shadow objects vs. backing objects straightened out.
The fact that the vm_object data structure escentially had this
backwards really confused things. The use of "shadow" and "backing
object" throughout the code is now internally consistent and correct
in the Mach terminology.
11) Several minor bug fixes, including one in the vm daemon that caused
0 RSS objects to not get purged as intended.
12) A "default pager" has now been created which cleans up the transition
of objects to the "swap" type. The previous checks throughout the code
for swp->pg_data != NULL were really ugly. This change also provides
the rudiments for future backing of "anonymous" memory by something
other than the swap pager (via the vnode pager, for example), and it
allows the decision about which of these pagers to use to be made
dynamically (although will need some additional decision code to do
this, of course).
13) (dyson) MAP_COPY has been deprecated and the corresponding "copy
object" code has been removed. MAP_COPY was undocumented and non-
standard. It was furthermore broken in several ways which caused its
behavior to degrade to MAP_PRIVATE. Binaries that use MAP_COPY will
continue to work correctly, but via the slightly different semantics
of MAP_PRIVATE.
14) (dyson) Sharing maps have been removed. It's marginal usefulness in a
threads design can be worked around in other ways. Both #12 and #13
were done to simplify the code and improve readability and maintain-
ability. (As were most all of these changes)
TODO:
1) Rewrite most of the vnode pager to use VOP_GETPAGES/PUTPAGES. Doing
this will reduce the vnode pager to a mere fraction of its current size.
2) Rewrite vm_fault and the swap/vnode pagers to use the clustering
information provided by the new haspage pager interface. This will
substantially reduce the overhead by eliminating a large number of
VOP_BMAP() calls. The VOP_BMAP() filesystem interface should be
improved to provide both a "behind" and "ahead" indication of
contiguousness.
3) Implement the extended features of pager_haspage in swap_pager_haspage().
It currently just says 0 pages ahead/behind.
4) Re-implement the swap device (swstrategy) in a more elegant way, perhaps
via a much more general mechanism that could also be used for disk
striping of regular filesystems.
5) Do something to improve the architecture of vm_object_collapse(). The
fact that it makes calls into the swap pager and knows too much about
how the swap pager operates really bothers me. It also doesn't allow
for collapsing of non-swap pager objects ("unnamed" objects backed by
other pagers).
1995-07-13 08:48:48 +00:00
|
|
|
#include <vm/vm_pager.h>
|
2012-11-16 05:55:56 +00:00
|
|
|
#include <vm/vm_phys.h>
|
2002-03-20 04:02:59 +00:00
|
|
|
#include <vm/uma.h>
|
1994-05-24 10:09:53 +00:00
|
|
|
|
2002-03-19 22:20:14 +00:00
|
|
|
static void dev_pager_init(void);
|
|
|
|
static vm_object_t dev_pager_alloc(void *, vm_ooffset_t, vm_prot_t,
|
Implement global and per-uid accounting of the anonymous memory. Add
rlimit RLIMIT_SWAP that limits the amount of swap that may be reserved
for the uid.
The accounting information (charge) is associated with either map entry,
or vm object backing the entry, assuming the object is the first one
in the shadow chain and entry does not require COW. Charge is moved
from entry to object on allocation of the object, e.g. during the mmap,
assuming the object is allocated, or on the first page fault on the
entry. It moves back to the entry on forks due to COW setup.
The per-entry granularity of accounting makes the charge process fair
for processes that change uid during lifetime, and decrements charge
for proper uid when region is unmapped.
The interface of vm_pager_allocate(9) is extended by adding struct ucred *,
that is used to charge appropriate uid when allocation if performed by
kernel, e.g. md(4).
Several syscalls, among them is fork(2), may now return ENOMEM when
global or per-uid limits are enforced.
In collaboration with: pho
Reviewed by: alc
Approved by: re (kensmith)
2009-06-23 20:45:22 +00:00
|
|
|
vm_ooffset_t, struct ucred *);
|
2002-03-19 22:20:14 +00:00
|
|
|
static void dev_pager_dealloc(vm_object_t);
|
A change to KPI of vm_pager_get_pages() and underlying VOP_GETPAGES().
o With new KPI consumers can request contiguous ranges of pages, and
unlike before, all pages will be kept busied on return, like it was
done before with the 'reqpage' only. Now the reqpage goes away. With
new interface it is easier to implement code protected from race
conditions.
Such arrayed requests for now should be preceeded by a call to
vm_pager_haspage() to make sure that request is possible. This
could be improved later, making vm_pager_haspage() obsolete.
Strenghtening the promises on the business of the array of pages
allows us to remove such hacks as swp_pager_free_nrpage() and
vm_pager_free_nonreq().
o New KPI accepts two integer pointers that may optionally point at
values for read ahead and read behind, that a pager may do, if it
can. These pages are completely owned by pager, and not controlled
by the caller.
This shifts the UFS-specific readahead logic from vm_fault.c, which
should be file system agnostic, into vnode_pager.c. It also removes
one VOP_BMAP() request per hard fault.
Discussed with: kib, alc, jeff, scottl
Sponsored by: Nginx, Inc.
Sponsored by: Netflix
2015-12-16 21:30:45 +00:00
|
|
|
static int dev_pager_getpages(vm_object_t, vm_page_t *, int, int *, int *);
|
2015-11-29 11:37:25 +00:00
|
|
|
static void dev_pager_putpages(vm_object_t, vm_page_t *, int, int, int *);
|
|
|
|
static boolean_t dev_pager_haspage(vm_object_t, vm_pindex_t, int *, int *);
|
2012-05-12 20:49:58 +00:00
|
|
|
static void dev_pager_free_page(vm_object_t object, vm_page_t m);
|
Add a new populate() pager method and extend device pager ops vector
with cdev_pg_populate() to provide device drivers access to it. It
gives drivers fine control of the pages ownership and allows drivers
to implement arbitrary prefault policies.
The populate method is called on a page fault and is supposed to
populate the vm object with the page at the fault location and some
amount of pages around it, at pager's discretion. VM provides the
pager with the hints about current range of the object mapping, to
avoid instantiation of immediately unused pages, if pager decides so.
Also, VM passes the fault type and map entry protection to the pager,
allowing it to force the optimal required ownership of the mapped
pages.
Installed pages must contiguously fill the returned region, be fully
valid and exclusively busied. Of course, the pages must be compatible
with the object' type.
After populate() successfully returned, VM fault handler installs as
many instantiated pages into the process page tables as it sees
reasonable, while still obeying the correct semantic for COW and vm
map locking.
The method is opt-in, pager sets OBJ_POPULATE flag to indicate that
the method can be called. If pager' vm objects can be shadowed, pager
must implement the traditional getpages() method in addition to the
populate(). Populate() might fall back to the getpages() on per-call
basis as well, by returning VM_PAGER_BAD error code.
For now for device pagers, the populate() method is only allowed to be
used by the managed device pagers, but the limitation is only made
because there is no unmanaged fault handlers which could use it right
now.
KPI designed together with, and reviewed by: alc
Tested by: pho
Sponsored by: The FreeBSD Foundation
MFC after: 3 weeks
2016-12-08 11:26:11 +00:00
|
|
|
static int dev_pager_populate(vm_object_t object, vm_pindex_t pidx,
|
|
|
|
int fault_type, vm_prot_t, vm_pindex_t *first, vm_pindex_t *last);
|
1995-12-14 09:55:16 +00:00
|
|
|
|
|
|
|
/* list of device pager objects */
|
|
|
|
static struct pagerlst dev_pager_object_list;
|
2001-04-18 20:24:16 +00:00
|
|
|
/* protect list manipulation */
|
|
|
|
static struct mtx dev_pager_mtx;
|
|
|
|
|
1994-05-24 10:09:53 +00:00
|
|
|
struct pagerops devicepagerops = {
|
2003-08-05 06:51:26 +00:00
|
|
|
.pgo_init = dev_pager_init,
|
|
|
|
.pgo_alloc = dev_pager_alloc,
|
|
|
|
.pgo_dealloc = dev_pager_dealloc,
|
|
|
|
.pgo_getpages = dev_pager_getpages,
|
|
|
|
.pgo_putpages = dev_pager_putpages,
|
|
|
|
.pgo_haspage = dev_pager_haspage,
|
1994-05-24 10:09:53 +00:00
|
|
|
};
|
|
|
|
|
2012-05-12 20:49:58 +00:00
|
|
|
struct pagerops mgtdevicepagerops = {
|
|
|
|
.pgo_alloc = dev_pager_alloc,
|
|
|
|
.pgo_dealloc = dev_pager_dealloc,
|
|
|
|
.pgo_getpages = dev_pager_getpages,
|
|
|
|
.pgo_putpages = dev_pager_putpages,
|
|
|
|
.pgo_haspage = dev_pager_haspage,
|
Add a new populate() pager method and extend device pager ops vector
with cdev_pg_populate() to provide device drivers access to it. It
gives drivers fine control of the pages ownership and allows drivers
to implement arbitrary prefault policies.
The populate method is called on a page fault and is supposed to
populate the vm object with the page at the fault location and some
amount of pages around it, at pager's discretion. VM provides the
pager with the hints about current range of the object mapping, to
avoid instantiation of immediately unused pages, if pager decides so.
Also, VM passes the fault type and map entry protection to the pager,
allowing it to force the optimal required ownership of the mapped
pages.
Installed pages must contiguously fill the returned region, be fully
valid and exclusively busied. Of course, the pages must be compatible
with the object' type.
After populate() successfully returned, VM fault handler installs as
many instantiated pages into the process page tables as it sees
reasonable, while still obeying the correct semantic for COW and vm
map locking.
The method is opt-in, pager sets OBJ_POPULATE flag to indicate that
the method can be called. If pager' vm objects can be shadowed, pager
must implement the traditional getpages() method in addition to the
populate(). Populate() might fall back to the getpages() on per-call
basis as well, by returning VM_PAGER_BAD error code.
For now for device pagers, the populate() method is only allowed to be
used by the managed device pagers, but the limitation is only made
because there is no unmanaged fault handlers which could use it right
now.
KPI designed together with, and reviewed by: alc
Tested by: pho
Sponsored by: The FreeBSD Foundation
MFC after: 3 weeks
2016-12-08 11:26:11 +00:00
|
|
|
.pgo_populate = dev_pager_populate,
|
2012-05-12 20:49:58 +00:00
|
|
|
};
|
|
|
|
|
2011-11-15 14:40:00 +00:00
|
|
|
static int old_dev_pager_ctor(void *handle, vm_ooffset_t size, vm_prot_t prot,
|
|
|
|
vm_ooffset_t foff, struct ucred *cred, u_short *color);
|
|
|
|
static void old_dev_pager_dtor(void *handle);
|
|
|
|
static int old_dev_pager_fault(vm_object_t object, vm_ooffset_t offset,
|
|
|
|
int prot, vm_page_t *mres);
|
|
|
|
|
|
|
|
static struct cdev_pager_ops old_dev_pager_ops = {
|
|
|
|
.cdev_pg_ctor = old_dev_pager_ctor,
|
|
|
|
.cdev_pg_dtor = old_dev_pager_dtor,
|
|
|
|
.cdev_pg_fault = old_dev_pager_fault
|
|
|
|
};
|
|
|
|
|
1995-12-14 09:55:16 +00:00
|
|
|
static void
|
2015-11-29 11:37:25 +00:00
|
|
|
dev_pager_init(void)
|
1994-05-24 10:09:53 +00:00
|
|
|
{
|
2015-11-29 11:37:25 +00:00
|
|
|
|
NOTE: libkvm, w, ps, 'top', and any other utility which depends on struct
proc or any VM system structure will have to be rebuilt!!!
Much needed overhaul of the VM system. Included in this first round of
changes:
1) Improved pager interfaces: init, alloc, dealloc, getpages, putpages,
haspage, and sync operations are supported. The haspage interface now
provides information about clusterability. All pager routines now take
struct vm_object's instead of "pagers".
2) Improved data structures. In the previous paradigm, there is constant
confusion caused by pagers being both a data structure ("allocate a
pager") and a collection of routines. The idea of a pager structure has
escentially been eliminated. Objects now have types, and this type is
used to index the appropriate pager. In most cases, items in the pager
structure were duplicated in the object data structure and thus were
unnecessary. In the few cases that remained, a un_pager structure union
was created in the object to contain these items.
3) Because of the cleanup of #1 & #2, a lot of unnecessary layering can now
be removed. For instance, vm_object_enter(), vm_object_lookup(),
vm_object_remove(), and the associated object hash list were some of the
things that were removed.
4) simple_lock's removed. Discussion with several people reveals that the
SMP locking primitives used in the VM system aren't likely the mechanism
that we'll be adopting. Even if it were, the locking that was in the code
was very inadequate and would have to be mostly re-done anyway. The
locking in a uni-processor kernel was a no-op but went a long way toward
making the code difficult to read and debug.
5) Places that attempted to kludge-up the fact that we don't have kernel
thread support have been fixed to reflect the reality that we are really
dealing with processes, not threads. The VM system didn't have complete
thread support, so the comments and mis-named routines were just wrong.
We now use tsleep and wakeup directly in the lock routines, for instance.
6) Where appropriate, the pagers have been improved, especially in the
pager_alloc routines. Most of the pager_allocs have been rewritten and
are now faster and easier to maintain.
7) The pagedaemon pageout clustering algorithm has been rewritten and
now tries harder to output an even number of pages before and after
the requested page. This is sort of the reverse of the ideal pagein
algorithm and should provide better overall performance.
8) Unnecessary (incorrect) casts to caddr_t in calls to tsleep & wakeup
have been removed. Some other unnecessary casts have also been removed.
9) Some almost useless debugging code removed.
10) Terminology of shadow objects vs. backing objects straightened out.
The fact that the vm_object data structure escentially had this
backwards really confused things. The use of "shadow" and "backing
object" throughout the code is now internally consistent and correct
in the Mach terminology.
11) Several minor bug fixes, including one in the vm daemon that caused
0 RSS objects to not get purged as intended.
12) A "default pager" has now been created which cleans up the transition
of objects to the "swap" type. The previous checks throughout the code
for swp->pg_data != NULL were really ugly. This change also provides
the rudiments for future backing of "anonymous" memory by something
other than the swap pager (via the vnode pager, for example), and it
allows the decision about which of these pagers to use to be made
dynamically (although will need some additional decision code to do
this, of course).
13) (dyson) MAP_COPY has been deprecated and the corresponding "copy
object" code has been removed. MAP_COPY was undocumented and non-
standard. It was furthermore broken in several ways which caused its
behavior to degrade to MAP_PRIVATE. Binaries that use MAP_COPY will
continue to work correctly, but via the slightly different semantics
of MAP_PRIVATE.
14) (dyson) Sharing maps have been removed. It's marginal usefulness in a
threads design can be worked around in other ways. Both #12 and #13
were done to simplify the code and improve readability and maintain-
ability. (As were most all of these changes)
TODO:
1) Rewrite most of the vnode pager to use VOP_GETPAGES/PUTPAGES. Doing
this will reduce the vnode pager to a mere fraction of its current size.
2) Rewrite vm_fault and the swap/vnode pagers to use the clustering
information provided by the new haspage pager interface. This will
substantially reduce the overhead by eliminating a large number of
VOP_BMAP() calls. The VOP_BMAP() filesystem interface should be
improved to provide both a "behind" and "ahead" indication of
contiguousness.
3) Implement the extended features of pager_haspage in swap_pager_haspage().
It currently just says 0 pages ahead/behind.
4) Re-implement the swap device (swstrategy) in a more elegant way, perhaps
via a much more general mechanism that could also be used for disk
striping of regular filesystems.
5) Do something to improve the architecture of vm_object_collapse(). The
fact that it makes calls into the swap pager and knows too much about
how the swap pager operates really bothers me. It also doesn't allow
for collapsing of non-swap pager objects ("unnamed" objects backed by
other pagers).
1995-07-13 08:48:48 +00:00
|
|
|
TAILQ_INIT(&dev_pager_object_list);
|
2002-04-04 21:03:38 +00:00
|
|
|
mtx_init(&dev_pager_mtx, "dev_pager list", NULL, MTX_DEF);
|
1994-05-24 10:09:53 +00:00
|
|
|
}
|
|
|
|
|
2011-11-15 14:40:00 +00:00
|
|
|
vm_object_t
|
|
|
|
cdev_pager_lookup(void *handle)
|
|
|
|
{
|
|
|
|
vm_object_t object;
|
|
|
|
|
|
|
|
mtx_lock(&dev_pager_mtx);
|
|
|
|
object = vm_pager_object_lookup(&dev_pager_object_list, handle);
|
|
|
|
mtx_unlock(&dev_pager_mtx);
|
|
|
|
return (object);
|
|
|
|
}
|
|
|
|
|
|
|
|
vm_object_t
|
|
|
|
cdev_pager_allocate(void *handle, enum obj_type tp, struct cdev_pager_ops *ops,
|
|
|
|
vm_ooffset_t size, vm_prot_t prot, vm_ooffset_t foff, struct ucred *cred)
|
1994-05-24 10:09:53 +00:00
|
|
|
{
|
2007-08-07 15:36:25 +00:00
|
|
|
vm_object_t object, object1;
|
2004-01-04 20:55:15 +00:00
|
|
|
vm_pindex_t pindex;
|
2011-11-15 14:40:00 +00:00
|
|
|
u_short color;
|
|
|
|
|
2012-05-12 20:49:58 +00:00
|
|
|
if (tp != OBJT_DEVICE && tp != OBJT_MGTDEVICE)
|
2011-11-15 14:40:00 +00:00
|
|
|
return (NULL);
|
Add a new populate() pager method and extend device pager ops vector
with cdev_pg_populate() to provide device drivers access to it. It
gives drivers fine control of the pages ownership and allows drivers
to implement arbitrary prefault policies.
The populate method is called on a page fault and is supposed to
populate the vm object with the page at the fault location and some
amount of pages around it, at pager's discretion. VM provides the
pager with the hints about current range of the object mapping, to
avoid instantiation of immediately unused pages, if pager decides so.
Also, VM passes the fault type and map entry protection to the pager,
allowing it to force the optimal required ownership of the mapped
pages.
Installed pages must contiguously fill the returned region, be fully
valid and exclusively busied. Of course, the pages must be compatible
with the object' type.
After populate() successfully returned, VM fault handler installs as
many instantiated pages into the process page tables as it sees
reasonable, while still obeying the correct semantic for COW and vm
map locking.
The method is opt-in, pager sets OBJ_POPULATE flag to indicate that
the method can be called. If pager' vm objects can be shadowed, pager
must implement the traditional getpages() method in addition to the
populate(). Populate() might fall back to the getpages() on per-call
basis as well, by returning VM_PAGER_BAD error code.
For now for device pagers, the populate() method is only allowed to be
used by the managed device pagers, but the limitation is only made
because there is no unmanaged fault handlers which could use it right
now.
KPI designed together with, and reviewed by: alc
Tested by: pho
Sponsored by: The FreeBSD Foundation
MFC after: 3 weeks
2016-12-08 11:26:11 +00:00
|
|
|
KASSERT(tp == OBJT_MGTDEVICE || ops->cdev_pg_populate == NULL,
|
|
|
|
("populate on unmanaged device pager"));
|
1994-05-24 10:09:53 +00:00
|
|
|
|
2002-06-22 18:36:51 +00:00
|
|
|
/*
|
|
|
|
* Offset should be page aligned.
|
|
|
|
*/
|
|
|
|
if (foff & PAGE_MASK)
|
|
|
|
return (NULL);
|
|
|
|
|
2017-02-12 21:05:44 +00:00
|
|
|
/*
|
|
|
|
* Treat the mmap(2) file offset as an unsigned value for a
|
|
|
|
* device mapping. This, in effect, allows a user to pass all
|
|
|
|
* possible off_t values as the mapping cookie to the driver. At
|
|
|
|
* this point, we know that both foff and size are a multiple
|
|
|
|
* of the page size. Do a check to avoid wrap.
|
|
|
|
*/
|
2002-06-22 18:36:51 +00:00
|
|
|
size = round_page(size);
|
2017-02-12 21:05:44 +00:00
|
|
|
pindex = UOFF_TO_IDX(foff) + UOFF_TO_IDX(size);
|
|
|
|
if (pindex > OBJ_MAX_SIZE || pindex < UOFF_TO_IDX(foff) ||
|
|
|
|
pindex < UOFF_TO_IDX(size))
|
|
|
|
return (NULL);
|
2002-06-22 18:36:51 +00:00
|
|
|
|
2011-11-15 14:40:00 +00:00
|
|
|
if (ops->cdev_pg_ctor(handle, size, prot, foff, cred, &color) != 0)
|
These changes embody the support of the fully coherent merged VM buffer cache,
much higher filesystem I/O performance, and much better paging performance. It
represents the culmination of over 6 months of R&D.
The majority of the merged VM/cache work is by John Dyson.
The following highlights the most significant changes. Additionally, there are
(mostly minor) changes to the various filesystem modules (nfs, msdosfs, etc) to
support the new VM/buffer scheme.
vfs_bio.c:
Significant rewrite of most of vfs_bio to support the merged VM buffer cache
scheme. The scheme is almost fully compatible with the old filesystem
interface. Significant improvement in the number of opportunities for write
clustering.
vfs_cluster.c, vfs_subr.c
Upgrade and performance enhancements in vfs layer code to support merged
VM/buffer cache. Fixup of vfs_cluster to eliminate the bogus pagemove stuff.
vm_object.c:
Yet more improvements in the collapse code. Elimination of some windows that
can cause list corruption.
vm_pageout.c:
Fixed it, it really works better now. Somehow in 2.0, some "enhancements"
broke the code. This code has been reworked from the ground-up.
vm_fault.c, vm_page.c, pmap.c, vm_object.c
Support for small-block filesystems with merged VM/buffer cache scheme.
pmap.c vm_map.c
Dynamic kernel VM size, now we dont have to pre-allocate excessive numbers of
kernel PTs.
vm_glue.c
Much simpler and more effective swapping code. No more gratuitous swapping.
proc.h
Fixed the problem that the p_lock flag was not being cleared on a fork.
swap_pager.c, vnode_pager.c
Removal of old vfs_bio cruft to support the past pseudo-coherency. Now the
code doesn't need it anymore.
machdep.c
Changes to better support the parameter values for the merged VM/buffer cache
scheme.
machdep.c, kern_exec.c, vm_glue.c
Implemented a seperate submap for temporary exec string space and another one
to contain process upages. This eliminates all map fragmentation problems
that previously existed.
ffs_inode.c, ufs_inode.c, ufs_readwrite.c
Changes for merged VM/buffer cache. Add "bypass" support for sneaking in on
busy buffers.
Submitted by: John Dyson and David Greenman
1995-01-09 16:06:02 +00:00
|
|
|
return (NULL);
|
2007-08-07 15:36:25 +00:00
|
|
|
mtx_lock(&dev_pager_mtx);
|
NOTE: libkvm, w, ps, 'top', and any other utility which depends on struct
proc or any VM system structure will have to be rebuilt!!!
Much needed overhaul of the VM system. Included in this first round of
changes:
1) Improved pager interfaces: init, alloc, dealloc, getpages, putpages,
haspage, and sync operations are supported. The haspage interface now
provides information about clusterability. All pager routines now take
struct vm_object's instead of "pagers".
2) Improved data structures. In the previous paradigm, there is constant
confusion caused by pagers being both a data structure ("allocate a
pager") and a collection of routines. The idea of a pager structure has
escentially been eliminated. Objects now have types, and this type is
used to index the appropriate pager. In most cases, items in the pager
structure were duplicated in the object data structure and thus were
unnecessary. In the few cases that remained, a un_pager structure union
was created in the object to contain these items.
3) Because of the cleanup of #1 & #2, a lot of unnecessary layering can now
be removed. For instance, vm_object_enter(), vm_object_lookup(),
vm_object_remove(), and the associated object hash list were some of the
things that were removed.
4) simple_lock's removed. Discussion with several people reveals that the
SMP locking primitives used in the VM system aren't likely the mechanism
that we'll be adopting. Even if it were, the locking that was in the code
was very inadequate and would have to be mostly re-done anyway. The
locking in a uni-processor kernel was a no-op but went a long way toward
making the code difficult to read and debug.
5) Places that attempted to kludge-up the fact that we don't have kernel
thread support have been fixed to reflect the reality that we are really
dealing with processes, not threads. The VM system didn't have complete
thread support, so the comments and mis-named routines were just wrong.
We now use tsleep and wakeup directly in the lock routines, for instance.
6) Where appropriate, the pagers have been improved, especially in the
pager_alloc routines. Most of the pager_allocs have been rewritten and
are now faster and easier to maintain.
7) The pagedaemon pageout clustering algorithm has been rewritten and
now tries harder to output an even number of pages before and after
the requested page. This is sort of the reverse of the ideal pagein
algorithm and should provide better overall performance.
8) Unnecessary (incorrect) casts to caddr_t in calls to tsleep & wakeup
have been removed. Some other unnecessary casts have also been removed.
9) Some almost useless debugging code removed.
10) Terminology of shadow objects vs. backing objects straightened out.
The fact that the vm_object data structure escentially had this
backwards really confused things. The use of "shadow" and "backing
object" throughout the code is now internally consistent and correct
in the Mach terminology.
11) Several minor bug fixes, including one in the vm daemon that caused
0 RSS objects to not get purged as intended.
12) A "default pager" has now been created which cleans up the transition
of objects to the "swap" type. The previous checks throughout the code
for swp->pg_data != NULL were really ugly. This change also provides
the rudiments for future backing of "anonymous" memory by something
other than the swap pager (via the vnode pager, for example), and it
allows the decision about which of these pagers to use to be made
dynamically (although will need some additional decision code to do
this, of course).
13) (dyson) MAP_COPY has been deprecated and the corresponding "copy
object" code has been removed. MAP_COPY was undocumented and non-
standard. It was furthermore broken in several ways which caused its
behavior to degrade to MAP_PRIVATE. Binaries that use MAP_COPY will
continue to work correctly, but via the slightly different semantics
of MAP_PRIVATE.
14) (dyson) Sharing maps have been removed. It's marginal usefulness in a
threads design can be worked around in other ways. Both #12 and #13
were done to simplify the code and improve readability and maintain-
ability. (As were most all of these changes)
TODO:
1) Rewrite most of the vnode pager to use VOP_GETPAGES/PUTPAGES. Doing
this will reduce the vnode pager to a mere fraction of its current size.
2) Rewrite vm_fault and the swap/vnode pagers to use the clustering
information provided by the new haspage pager interface. This will
substantially reduce the overhead by eliminating a large number of
VOP_BMAP() calls. The VOP_BMAP() filesystem interface should be
improved to provide both a "behind" and "ahead" indication of
contiguousness.
3) Implement the extended features of pager_haspage in swap_pager_haspage().
It currently just says 0 pages ahead/behind.
4) Re-implement the swap device (swstrategy) in a more elegant way, perhaps
via a much more general mechanism that could also be used for disk
striping of regular filesystems.
5) Do something to improve the architecture of vm_object_collapse(). The
fact that it makes calls into the swap pager and knows too much about
how the swap pager operates really bothers me. It also doesn't allow
for collapsing of non-swap pager objects ("unnamed" objects backed by
other pagers).
1995-07-13 08:48:48 +00:00
|
|
|
|
1994-05-24 10:09:53 +00:00
|
|
|
/*
|
|
|
|
* Look up pager, creating as necessary.
|
|
|
|
*/
|
2007-08-07 15:36:25 +00:00
|
|
|
object1 = NULL;
|
NOTE: libkvm, w, ps, 'top', and any other utility which depends on struct
proc or any VM system structure will have to be rebuilt!!!
Much needed overhaul of the VM system. Included in this first round of
changes:
1) Improved pager interfaces: init, alloc, dealloc, getpages, putpages,
haspage, and sync operations are supported. The haspage interface now
provides information about clusterability. All pager routines now take
struct vm_object's instead of "pagers".
2) Improved data structures. In the previous paradigm, there is constant
confusion caused by pagers being both a data structure ("allocate a
pager") and a collection of routines. The idea of a pager structure has
escentially been eliminated. Objects now have types, and this type is
used to index the appropriate pager. In most cases, items in the pager
structure were duplicated in the object data structure and thus were
unnecessary. In the few cases that remained, a un_pager structure union
was created in the object to contain these items.
3) Because of the cleanup of #1 & #2, a lot of unnecessary layering can now
be removed. For instance, vm_object_enter(), vm_object_lookup(),
vm_object_remove(), and the associated object hash list were some of the
things that were removed.
4) simple_lock's removed. Discussion with several people reveals that the
SMP locking primitives used in the VM system aren't likely the mechanism
that we'll be adopting. Even if it were, the locking that was in the code
was very inadequate and would have to be mostly re-done anyway. The
locking in a uni-processor kernel was a no-op but went a long way toward
making the code difficult to read and debug.
5) Places that attempted to kludge-up the fact that we don't have kernel
thread support have been fixed to reflect the reality that we are really
dealing with processes, not threads. The VM system didn't have complete
thread support, so the comments and mis-named routines were just wrong.
We now use tsleep and wakeup directly in the lock routines, for instance.
6) Where appropriate, the pagers have been improved, especially in the
pager_alloc routines. Most of the pager_allocs have been rewritten and
are now faster and easier to maintain.
7) The pagedaemon pageout clustering algorithm has been rewritten and
now tries harder to output an even number of pages before and after
the requested page. This is sort of the reverse of the ideal pagein
algorithm and should provide better overall performance.
8) Unnecessary (incorrect) casts to caddr_t in calls to tsleep & wakeup
have been removed. Some other unnecessary casts have also been removed.
9) Some almost useless debugging code removed.
10) Terminology of shadow objects vs. backing objects straightened out.
The fact that the vm_object data structure escentially had this
backwards really confused things. The use of "shadow" and "backing
object" throughout the code is now internally consistent and correct
in the Mach terminology.
11) Several minor bug fixes, including one in the vm daemon that caused
0 RSS objects to not get purged as intended.
12) A "default pager" has now been created which cleans up the transition
of objects to the "swap" type. The previous checks throughout the code
for swp->pg_data != NULL were really ugly. This change also provides
the rudiments for future backing of "anonymous" memory by something
other than the swap pager (via the vnode pager, for example), and it
allows the decision about which of these pagers to use to be made
dynamically (although will need some additional decision code to do
this, of course).
13) (dyson) MAP_COPY has been deprecated and the corresponding "copy
object" code has been removed. MAP_COPY was undocumented and non-
standard. It was furthermore broken in several ways which caused its
behavior to degrade to MAP_PRIVATE. Binaries that use MAP_COPY will
continue to work correctly, but via the slightly different semantics
of MAP_PRIVATE.
14) (dyson) Sharing maps have been removed. It's marginal usefulness in a
threads design can be worked around in other ways. Both #12 and #13
were done to simplify the code and improve readability and maintain-
ability. (As were most all of these changes)
TODO:
1) Rewrite most of the vnode pager to use VOP_GETPAGES/PUTPAGES. Doing
this will reduce the vnode pager to a mere fraction of its current size.
2) Rewrite vm_fault and the swap/vnode pagers to use the clustering
information provided by the new haspage pager interface. This will
substantially reduce the overhead by eliminating a large number of
VOP_BMAP() calls. The VOP_BMAP() filesystem interface should be
improved to provide both a "behind" and "ahead" indication of
contiguousness.
3) Implement the extended features of pager_haspage in swap_pager_haspage().
It currently just says 0 pages ahead/behind.
4) Re-implement the swap device (swstrategy) in a more elegant way, perhaps
via a much more general mechanism that could also be used for disk
striping of regular filesystems.
5) Do something to improve the architecture of vm_object_collapse(). The
fact that it makes calls into the swap pager and knows too much about
how the swap pager operates really bothers me. It also doesn't allow
for collapsing of non-swap pager objects ("unnamed" objects backed by
other pagers).
1995-07-13 08:48:48 +00:00
|
|
|
object = vm_pager_object_lookup(&dev_pager_object_list, handle);
|
|
|
|
if (object == NULL) {
|
1994-05-24 10:09:53 +00:00
|
|
|
/*
|
2008-05-17 16:26:34 +00:00
|
|
|
* Allocate object and associate it with the pager. Initialize
|
|
|
|
* the object's pg_color based upon the physical address of the
|
|
|
|
* device's memory.
|
1994-05-24 10:09:53 +00:00
|
|
|
*/
|
2001-04-18 20:24:16 +00:00
|
|
|
mtx_unlock(&dev_pager_mtx);
|
2011-11-15 14:40:00 +00:00
|
|
|
object1 = vm_object_allocate(tp, pindex);
|
2008-05-17 16:26:34 +00:00
|
|
|
object1->flags |= OBJ_COLORED;
|
2011-11-15 14:40:00 +00:00
|
|
|
object1->pg_color = color;
|
|
|
|
object1->handle = handle;
|
|
|
|
object1->un_pager.devp.ops = ops;
|
Fix a bug in the device pager code that can trigger an assertion
in devfs if a particular race condition is hit in the device pager
code.
This was a side effect of change 227530 which changed the device
pager interface to call a new destructor routine for the cdev.
That destructor routine, old_dev_pager_dtor(), takes a VM object
handle.
The object handle is cast to a struct cdev *, and passed into
dev_rel().
That works in most cases, except the case in cdev_pager_allocate()
where there is a race condition between two threads allocating an
object backed by the same device. The loser of the race
deallocates its object at the end of the function.
The problem is that before inserting the object into the
dev_pager_object_list, the object's handle is changed from the
struct cdev pointer to the object's own address. This is to avoid
conflicts with the winner of the race, which already inserted an
object in the list with a handle that is a pointer to the same cdev
structure.
The object is then passed to vm_object_deallocate(), and eventually
makes its way down to old_dev_pager_dtor(). That function passes
the handle pointer (which is actually a VM object, not a struct
cdev as usual) into dev_rel(). dev_rel() decrements the reference
count in the assumed struct cdev (which happens to be 0), and
that triggers the assertion in dev_rel() that the reference count
is greater than or equal to 0.
The fix is to add a cdev pointer to the VM object, and use that
pointer when calling the cdev_pg_dtor() routine.
vm_object.h: Add a struct cdev pointer to the VM object
structure.
device_pager.c: In cdev_pager_allocate(), populate the new cdev
pointer.
In dev_pager_dealloc(), use the new cdev pointer
when calling the object's cdev_pg_dtor() routine.
Reviewed by: kib
Sponsored by: Spectra Logic Corporation
MFC after: 1 week
2013-01-09 16:48:38 +00:00
|
|
|
object1->un_pager.devp.dev = handle;
|
2011-07-30 14:13:57 +00:00
|
|
|
TAILQ_INIT(&object1->un_pager.devp.devp_pglist);
|
2007-08-07 15:36:25 +00:00
|
|
|
mtx_lock(&dev_pager_mtx);
|
|
|
|
object = vm_pager_object_lookup(&dev_pager_object_list, handle);
|
|
|
|
if (object != NULL) {
|
|
|
|
/*
|
|
|
|
* We raced with other thread while allocating object.
|
|
|
|
*/
|
|
|
|
if (pindex > object->size)
|
|
|
|
object->size = pindex;
|
2014-02-11 22:05:21 +00:00
|
|
|
KASSERT(object->type == tp,
|
2016-10-30 18:04:11 +00:00
|
|
|
("Inconsistent device pager type %p %d",
|
|
|
|
object, tp));
|
|
|
|
KASSERT(object->un_pager.devp.ops == ops,
|
|
|
|
("Inconsistent devops %p %p", object, ops));
|
2007-08-07 15:36:25 +00:00
|
|
|
} else {
|
|
|
|
object = object1;
|
|
|
|
object1 = NULL;
|
|
|
|
object->handle = handle;
|
|
|
|
TAILQ_INSERT_TAIL(&dev_pager_object_list, object,
|
|
|
|
pager_object_list);
|
Add a new populate() pager method and extend device pager ops vector
with cdev_pg_populate() to provide device drivers access to it. It
gives drivers fine control of the pages ownership and allows drivers
to implement arbitrary prefault policies.
The populate method is called on a page fault and is supposed to
populate the vm object with the page at the fault location and some
amount of pages around it, at pager's discretion. VM provides the
pager with the hints about current range of the object mapping, to
avoid instantiation of immediately unused pages, if pager decides so.
Also, VM passes the fault type and map entry protection to the pager,
allowing it to force the optimal required ownership of the mapped
pages.
Installed pages must contiguously fill the returned region, be fully
valid and exclusively busied. Of course, the pages must be compatible
with the object' type.
After populate() successfully returned, VM fault handler installs as
many instantiated pages into the process page tables as it sees
reasonable, while still obeying the correct semantic for COW and vm
map locking.
The method is opt-in, pager sets OBJ_POPULATE flag to indicate that
the method can be called. If pager' vm objects can be shadowed, pager
must implement the traditional getpages() method in addition to the
populate(). Populate() might fall back to the getpages() on per-call
basis as well, by returning VM_PAGER_BAD error code.
For now for device pagers, the populate() method is only allowed to be
used by the managed device pagers, but the limitation is only made
because there is no unmanaged fault handlers which could use it right
now.
KPI designed together with, and reviewed by: alc
Tested by: pho
Sponsored by: The FreeBSD Foundation
MFC after: 3 weeks
2016-12-08 11:26:11 +00:00
|
|
|
if (ops->cdev_pg_populate != NULL)
|
|
|
|
vm_object_set_flag(object, OBJ_POPULATE);
|
2007-08-07 15:36:25 +00:00
|
|
|
}
|
1994-05-24 10:09:53 +00:00
|
|
|
} else {
|
2004-01-04 20:55:15 +00:00
|
|
|
if (pindex > object->size)
|
|
|
|
object->size = pindex;
|
2014-02-11 22:05:21 +00:00
|
|
|
KASSERT(object->type == tp,
|
|
|
|
("Inconsistent device pager type %p %d", object, tp));
|
1994-05-24 10:09:53 +00:00
|
|
|
}
|
2007-08-07 15:36:25 +00:00
|
|
|
mtx_unlock(&dev_pager_mtx);
|
2011-07-30 14:13:57 +00:00
|
|
|
if (object1 != NULL) {
|
|
|
|
object1->handle = object1;
|
|
|
|
mtx_lock(&dev_pager_mtx);
|
|
|
|
TAILQ_INSERT_TAIL(&dev_pager_object_list, object1,
|
|
|
|
pager_object_list);
|
|
|
|
mtx_unlock(&dev_pager_mtx);
|
|
|
|
vm_object_deallocate(object1);
|
|
|
|
}
|
NOTE: libkvm, w, ps, 'top', and any other utility which depends on struct
proc or any VM system structure will have to be rebuilt!!!
Much needed overhaul of the VM system. Included in this first round of
changes:
1) Improved pager interfaces: init, alloc, dealloc, getpages, putpages,
haspage, and sync operations are supported. The haspage interface now
provides information about clusterability. All pager routines now take
struct vm_object's instead of "pagers".
2) Improved data structures. In the previous paradigm, there is constant
confusion caused by pagers being both a data structure ("allocate a
pager") and a collection of routines. The idea of a pager structure has
escentially been eliminated. Objects now have types, and this type is
used to index the appropriate pager. In most cases, items in the pager
structure were duplicated in the object data structure and thus were
unnecessary. In the few cases that remained, a un_pager structure union
was created in the object to contain these items.
3) Because of the cleanup of #1 & #2, a lot of unnecessary layering can now
be removed. For instance, vm_object_enter(), vm_object_lookup(),
vm_object_remove(), and the associated object hash list were some of the
things that were removed.
4) simple_lock's removed. Discussion with several people reveals that the
SMP locking primitives used in the VM system aren't likely the mechanism
that we'll be adopting. Even if it were, the locking that was in the code
was very inadequate and would have to be mostly re-done anyway. The
locking in a uni-processor kernel was a no-op but went a long way toward
making the code difficult to read and debug.
5) Places that attempted to kludge-up the fact that we don't have kernel
thread support have been fixed to reflect the reality that we are really
dealing with processes, not threads. The VM system didn't have complete
thread support, so the comments and mis-named routines were just wrong.
We now use tsleep and wakeup directly in the lock routines, for instance.
6) Where appropriate, the pagers have been improved, especially in the
pager_alloc routines. Most of the pager_allocs have been rewritten and
are now faster and easier to maintain.
7) The pagedaemon pageout clustering algorithm has been rewritten and
now tries harder to output an even number of pages before and after
the requested page. This is sort of the reverse of the ideal pagein
algorithm and should provide better overall performance.
8) Unnecessary (incorrect) casts to caddr_t in calls to tsleep & wakeup
have been removed. Some other unnecessary casts have also been removed.
9) Some almost useless debugging code removed.
10) Terminology of shadow objects vs. backing objects straightened out.
The fact that the vm_object data structure escentially had this
backwards really confused things. The use of "shadow" and "backing
object" throughout the code is now internally consistent and correct
in the Mach terminology.
11) Several minor bug fixes, including one in the vm daemon that caused
0 RSS objects to not get purged as intended.
12) A "default pager" has now been created which cleans up the transition
of objects to the "swap" type. The previous checks throughout the code
for swp->pg_data != NULL were really ugly. This change also provides
the rudiments for future backing of "anonymous" memory by something
other than the swap pager (via the vnode pager, for example), and it
allows the decision about which of these pagers to use to be made
dynamically (although will need some additional decision code to do
this, of course).
13) (dyson) MAP_COPY has been deprecated and the corresponding "copy
object" code has been removed. MAP_COPY was undocumented and non-
standard. It was furthermore broken in several ways which caused its
behavior to degrade to MAP_PRIVATE. Binaries that use MAP_COPY will
continue to work correctly, but via the slightly different semantics
of MAP_PRIVATE.
14) (dyson) Sharing maps have been removed. It's marginal usefulness in a
threads design can be worked around in other ways. Both #12 and #13
were done to simplify the code and improve readability and maintain-
ability. (As were most all of these changes)
TODO:
1) Rewrite most of the vnode pager to use VOP_GETPAGES/PUTPAGES. Doing
this will reduce the vnode pager to a mere fraction of its current size.
2) Rewrite vm_fault and the swap/vnode pagers to use the clustering
information provided by the new haspage pager interface. This will
substantially reduce the overhead by eliminating a large number of
VOP_BMAP() calls. The VOP_BMAP() filesystem interface should be
improved to provide both a "behind" and "ahead" indication of
contiguousness.
3) Implement the extended features of pager_haspage in swap_pager_haspage().
It currently just says 0 pages ahead/behind.
4) Re-implement the swap device (swstrategy) in a more elegant way, perhaps
via a much more general mechanism that could also be used for disk
striping of regular filesystems.
5) Do something to improve the architecture of vm_object_collapse(). The
fact that it makes calls into the swap pager and knows too much about
how the swap pager operates really bothers me. It also doesn't allow
for collapsing of non-swap pager objects ("unnamed" objects backed by
other pagers).
1995-07-13 08:48:48 +00:00
|
|
|
return (object);
|
1994-05-24 10:09:53 +00:00
|
|
|
}
|
|
|
|
|
2011-11-15 14:40:00 +00:00
|
|
|
static vm_object_t
|
|
|
|
dev_pager_alloc(void *handle, vm_ooffset_t size, vm_prot_t prot,
|
|
|
|
vm_ooffset_t foff, struct ucred *cred)
|
|
|
|
{
|
|
|
|
|
|
|
|
return (cdev_pager_allocate(handle, OBJT_DEVICE, &old_dev_pager_ops,
|
|
|
|
size, prot, foff, cred));
|
|
|
|
}
|
|
|
|
|
|
|
|
void
|
|
|
|
cdev_pager_free_page(vm_object_t object, vm_page_t m)
|
|
|
|
{
|
|
|
|
|
2013-03-09 02:32:23 +00:00
|
|
|
VM_OBJECT_ASSERT_WLOCKED(object);
|
2012-05-12 20:49:58 +00:00
|
|
|
if (object->type == OBJT_MGTDEVICE) {
|
|
|
|
KASSERT((m->oflags & VPO_UNMANAGED) == 0, ("unmanaged %p", m));
|
|
|
|
pmap_remove_all(m);
|
|
|
|
vm_page_lock(m);
|
|
|
|
vm_page_remove(m);
|
|
|
|
vm_page_unlock(m);
|
|
|
|
} else if (object->type == OBJT_DEVICE)
|
|
|
|
dev_pager_free_page(object, m);
|
|
|
|
}
|
|
|
|
|
|
|
|
static void
|
|
|
|
dev_pager_free_page(vm_object_t object, vm_page_t m)
|
|
|
|
{
|
|
|
|
|
2013-03-09 02:32:23 +00:00
|
|
|
VM_OBJECT_ASSERT_WLOCKED(object);
|
2012-05-12 20:49:58 +00:00
|
|
|
KASSERT((object->type == OBJT_DEVICE &&
|
|
|
|
(m->oflags & VPO_UNMANAGED) != 0),
|
|
|
|
("Managed device or page obj %p m %p", object, m));
|
2013-08-10 17:36:42 +00:00
|
|
|
TAILQ_REMOVE(&object->un_pager.devp.devp_pglist, m, plinks.q);
|
2011-11-15 14:40:00 +00:00
|
|
|
vm_page_putfake(m);
|
|
|
|
}
|
|
|
|
|
1995-12-14 09:55:16 +00:00
|
|
|
static void
|
2015-11-29 11:37:25 +00:00
|
|
|
dev_pager_dealloc(vm_object_t object)
|
NOTE: libkvm, w, ps, 'top', and any other utility which depends on struct
proc or any VM system structure will have to be rebuilt!!!
Much needed overhaul of the VM system. Included in this first round of
changes:
1) Improved pager interfaces: init, alloc, dealloc, getpages, putpages,
haspage, and sync operations are supported. The haspage interface now
provides information about clusterability. All pager routines now take
struct vm_object's instead of "pagers".
2) Improved data structures. In the previous paradigm, there is constant
confusion caused by pagers being both a data structure ("allocate a
pager") and a collection of routines. The idea of a pager structure has
escentially been eliminated. Objects now have types, and this type is
used to index the appropriate pager. In most cases, items in the pager
structure were duplicated in the object data structure and thus were
unnecessary. In the few cases that remained, a un_pager structure union
was created in the object to contain these items.
3) Because of the cleanup of #1 & #2, a lot of unnecessary layering can now
be removed. For instance, vm_object_enter(), vm_object_lookup(),
vm_object_remove(), and the associated object hash list were some of the
things that were removed.
4) simple_lock's removed. Discussion with several people reveals that the
SMP locking primitives used in the VM system aren't likely the mechanism
that we'll be adopting. Even if it were, the locking that was in the code
was very inadequate and would have to be mostly re-done anyway. The
locking in a uni-processor kernel was a no-op but went a long way toward
making the code difficult to read and debug.
5) Places that attempted to kludge-up the fact that we don't have kernel
thread support have been fixed to reflect the reality that we are really
dealing with processes, not threads. The VM system didn't have complete
thread support, so the comments and mis-named routines were just wrong.
We now use tsleep and wakeup directly in the lock routines, for instance.
6) Where appropriate, the pagers have been improved, especially in the
pager_alloc routines. Most of the pager_allocs have been rewritten and
are now faster and easier to maintain.
7) The pagedaemon pageout clustering algorithm has been rewritten and
now tries harder to output an even number of pages before and after
the requested page. This is sort of the reverse of the ideal pagein
algorithm and should provide better overall performance.
8) Unnecessary (incorrect) casts to caddr_t in calls to tsleep & wakeup
have been removed. Some other unnecessary casts have also been removed.
9) Some almost useless debugging code removed.
10) Terminology of shadow objects vs. backing objects straightened out.
The fact that the vm_object data structure escentially had this
backwards really confused things. The use of "shadow" and "backing
object" throughout the code is now internally consistent and correct
in the Mach terminology.
11) Several minor bug fixes, including one in the vm daemon that caused
0 RSS objects to not get purged as intended.
12) A "default pager" has now been created which cleans up the transition
of objects to the "swap" type. The previous checks throughout the code
for swp->pg_data != NULL were really ugly. This change also provides
the rudiments for future backing of "anonymous" memory by something
other than the swap pager (via the vnode pager, for example), and it
allows the decision about which of these pagers to use to be made
dynamically (although will need some additional decision code to do
this, of course).
13) (dyson) MAP_COPY has been deprecated and the corresponding "copy
object" code has been removed. MAP_COPY was undocumented and non-
standard. It was furthermore broken in several ways which caused its
behavior to degrade to MAP_PRIVATE. Binaries that use MAP_COPY will
continue to work correctly, but via the slightly different semantics
of MAP_PRIVATE.
14) (dyson) Sharing maps have been removed. It's marginal usefulness in a
threads design can be worked around in other ways. Both #12 and #13
were done to simplify the code and improve readability and maintain-
ability. (As were most all of these changes)
TODO:
1) Rewrite most of the vnode pager to use VOP_GETPAGES/PUTPAGES. Doing
this will reduce the vnode pager to a mere fraction of its current size.
2) Rewrite vm_fault and the swap/vnode pagers to use the clustering
information provided by the new haspage pager interface. This will
substantially reduce the overhead by eliminating a large number of
VOP_BMAP() calls. The VOP_BMAP() filesystem interface should be
improved to provide both a "behind" and "ahead" indication of
contiguousness.
3) Implement the extended features of pager_haspage in swap_pager_haspage().
It currently just says 0 pages ahead/behind.
4) Re-implement the swap device (swstrategy) in a more elegant way, perhaps
via a much more general mechanism that could also be used for disk
striping of regular filesystems.
5) Do something to improve the architecture of vm_object_collapse(). The
fact that it makes calls into the swap pager and knows too much about
how the swap pager operates really bothers me. It also doesn't allow
for collapsing of non-swap pager objects ("unnamed" objects backed by
other pagers).
1995-07-13 08:48:48 +00:00
|
|
|
{
|
1994-05-24 10:09:53 +00:00
|
|
|
vm_page_t m;
|
|
|
|
|
2013-03-09 02:32:23 +00:00
|
|
|
VM_OBJECT_WUNLOCK(object);
|
Fix a bug in the device pager code that can trigger an assertion
in devfs if a particular race condition is hit in the device pager
code.
This was a side effect of change 227530 which changed the device
pager interface to call a new destructor routine for the cdev.
That destructor routine, old_dev_pager_dtor(), takes a VM object
handle.
The object handle is cast to a struct cdev *, and passed into
dev_rel().
That works in most cases, except the case in cdev_pager_allocate()
where there is a race condition between two threads allocating an
object backed by the same device. The loser of the race
deallocates its object at the end of the function.
The problem is that before inserting the object into the
dev_pager_object_list, the object's handle is changed from the
struct cdev pointer to the object's own address. This is to avoid
conflicts with the winner of the race, which already inserted an
object in the list with a handle that is a pointer to the same cdev
structure.
The object is then passed to vm_object_deallocate(), and eventually
makes its way down to old_dev_pager_dtor(). That function passes
the handle pointer (which is actually a VM object, not a struct
cdev as usual) into dev_rel(). dev_rel() decrements the reference
count in the assumed struct cdev (which happens to be 0), and
that triggers the assertion in dev_rel() that the reference count
is greater than or equal to 0.
The fix is to add a cdev pointer to the VM object, and use that
pointer when calling the cdev_pg_dtor() routine.
vm_object.h: Add a struct cdev pointer to the VM object
structure.
device_pager.c: In cdev_pager_allocate(), populate the new cdev
pointer.
In dev_pager_dealloc(), use the new cdev pointer
when calling the object's cdev_pg_dtor() routine.
Reviewed by: kib
Sponsored by: Spectra Logic Corporation
MFC after: 1 week
2013-01-09 16:48:38 +00:00
|
|
|
object->un_pager.devp.ops->cdev_pg_dtor(object->un_pager.devp.dev);
|
2011-11-15 14:40:00 +00:00
|
|
|
|
2001-04-18 20:24:16 +00:00
|
|
|
mtx_lock(&dev_pager_mtx);
|
NOTE: libkvm, w, ps, 'top', and any other utility which depends on struct
proc or any VM system structure will have to be rebuilt!!!
Much needed overhaul of the VM system. Included in this first round of
changes:
1) Improved pager interfaces: init, alloc, dealloc, getpages, putpages,
haspage, and sync operations are supported. The haspage interface now
provides information about clusterability. All pager routines now take
struct vm_object's instead of "pagers".
2) Improved data structures. In the previous paradigm, there is constant
confusion caused by pagers being both a data structure ("allocate a
pager") and a collection of routines. The idea of a pager structure has
escentially been eliminated. Objects now have types, and this type is
used to index the appropriate pager. In most cases, items in the pager
structure were duplicated in the object data structure and thus were
unnecessary. In the few cases that remained, a un_pager structure union
was created in the object to contain these items.
3) Because of the cleanup of #1 & #2, a lot of unnecessary layering can now
be removed. For instance, vm_object_enter(), vm_object_lookup(),
vm_object_remove(), and the associated object hash list were some of the
things that were removed.
4) simple_lock's removed. Discussion with several people reveals that the
SMP locking primitives used in the VM system aren't likely the mechanism
that we'll be adopting. Even if it were, the locking that was in the code
was very inadequate and would have to be mostly re-done anyway. The
locking in a uni-processor kernel was a no-op but went a long way toward
making the code difficult to read and debug.
5) Places that attempted to kludge-up the fact that we don't have kernel
thread support have been fixed to reflect the reality that we are really
dealing with processes, not threads. The VM system didn't have complete
thread support, so the comments and mis-named routines were just wrong.
We now use tsleep and wakeup directly in the lock routines, for instance.
6) Where appropriate, the pagers have been improved, especially in the
pager_alloc routines. Most of the pager_allocs have been rewritten and
are now faster and easier to maintain.
7) The pagedaemon pageout clustering algorithm has been rewritten and
now tries harder to output an even number of pages before and after
the requested page. This is sort of the reverse of the ideal pagein
algorithm and should provide better overall performance.
8) Unnecessary (incorrect) casts to caddr_t in calls to tsleep & wakeup
have been removed. Some other unnecessary casts have also been removed.
9) Some almost useless debugging code removed.
10) Terminology of shadow objects vs. backing objects straightened out.
The fact that the vm_object data structure escentially had this
backwards really confused things. The use of "shadow" and "backing
object" throughout the code is now internally consistent and correct
in the Mach terminology.
11) Several minor bug fixes, including one in the vm daemon that caused
0 RSS objects to not get purged as intended.
12) A "default pager" has now been created which cleans up the transition
of objects to the "swap" type. The previous checks throughout the code
for swp->pg_data != NULL were really ugly. This change also provides
the rudiments for future backing of "anonymous" memory by something
other than the swap pager (via the vnode pager, for example), and it
allows the decision about which of these pagers to use to be made
dynamically (although will need some additional decision code to do
this, of course).
13) (dyson) MAP_COPY has been deprecated and the corresponding "copy
object" code has been removed. MAP_COPY was undocumented and non-
standard. It was furthermore broken in several ways which caused its
behavior to degrade to MAP_PRIVATE. Binaries that use MAP_COPY will
continue to work correctly, but via the slightly different semantics
of MAP_PRIVATE.
14) (dyson) Sharing maps have been removed. It's marginal usefulness in a
threads design can be worked around in other ways. Both #12 and #13
were done to simplify the code and improve readability and maintain-
ability. (As were most all of these changes)
TODO:
1) Rewrite most of the vnode pager to use VOP_GETPAGES/PUTPAGES. Doing
this will reduce the vnode pager to a mere fraction of its current size.
2) Rewrite vm_fault and the swap/vnode pagers to use the clustering
information provided by the new haspage pager interface. This will
substantially reduce the overhead by eliminating a large number of
VOP_BMAP() calls. The VOP_BMAP() filesystem interface should be
improved to provide both a "behind" and "ahead" indication of
contiguousness.
3) Implement the extended features of pager_haspage in swap_pager_haspage().
It currently just says 0 pages ahead/behind.
4) Re-implement the swap device (swstrategy) in a more elegant way, perhaps
via a much more general mechanism that could also be used for disk
striping of regular filesystems.
5) Do something to improve the architecture of vm_object_collapse(). The
fact that it makes calls into the swap pager and knows too much about
how the swap pager operates really bothers me. It also doesn't allow
for collapsing of non-swap pager objects ("unnamed" objects backed by
other pagers).
1995-07-13 08:48:48 +00:00
|
|
|
TAILQ_REMOVE(&dev_pager_object_list, object, pager_object_list);
|
2001-04-18 20:24:16 +00:00
|
|
|
mtx_unlock(&dev_pager_mtx);
|
2013-03-09 02:32:23 +00:00
|
|
|
VM_OBJECT_WLOCK(object);
|
2012-05-12 20:49:58 +00:00
|
|
|
|
|
|
|
if (object->type == OBJT_DEVICE) {
|
|
|
|
/*
|
|
|
|
* Free up our fake pages.
|
|
|
|
*/
|
|
|
|
while ((m = TAILQ_FIRST(&object->un_pager.devp.devp_pglist))
|
|
|
|
!= NULL)
|
|
|
|
dev_pager_free_page(object, m);
|
|
|
|
}
|
2015-05-08 19:43:37 +00:00
|
|
|
object->handle = NULL;
|
|
|
|
object->type = OBJT_DEAD;
|
2011-11-15 14:40:00 +00:00
|
|
|
}
|
|
|
|
|
|
|
|
static int
|
A change to KPI of vm_pager_get_pages() and underlying VOP_GETPAGES().
o With new KPI consumers can request contiguous ranges of pages, and
unlike before, all pages will be kept busied on return, like it was
done before with the 'reqpage' only. Now the reqpage goes away. With
new interface it is easier to implement code protected from race
conditions.
Such arrayed requests for now should be preceeded by a call to
vm_pager_haspage() to make sure that request is possible. This
could be improved later, making vm_pager_haspage() obsolete.
Strenghtening the promises on the business of the array of pages
allows us to remove such hacks as swp_pager_free_nrpage() and
vm_pager_free_nonreq().
o New KPI accepts two integer pointers that may optionally point at
values for read ahead and read behind, that a pager may do, if it
can. These pages are completely owned by pager, and not controlled
by the caller.
This shifts the UFS-specific readahead logic from vm_fault.c, which
should be file system agnostic, into vnode_pager.c. It also removes
one VOP_BMAP() request per hard fault.
Discussed with: kib, alc, jeff, scottl
Sponsored by: Nginx, Inc.
Sponsored by: Netflix
2015-12-16 21:30:45 +00:00
|
|
|
dev_pager_getpages(vm_object_t object, vm_page_t *ma, int count, int *rbehind,
|
|
|
|
int *rahead)
|
2011-11-15 14:40:00 +00:00
|
|
|
{
|
2015-11-29 11:37:25 +00:00
|
|
|
int error;
|
2011-11-15 14:40:00 +00:00
|
|
|
|
A change to KPI of vm_pager_get_pages() and underlying VOP_GETPAGES().
o With new KPI consumers can request contiguous ranges of pages, and
unlike before, all pages will be kept busied on return, like it was
done before with the 'reqpage' only. Now the reqpage goes away. With
new interface it is easier to implement code protected from race
conditions.
Such arrayed requests for now should be preceeded by a call to
vm_pager_haspage() to make sure that request is possible. This
could be improved later, making vm_pager_haspage() obsolete.
Strenghtening the promises on the business of the array of pages
allows us to remove such hacks as swp_pager_free_nrpage() and
vm_pager_free_nonreq().
o New KPI accepts two integer pointers that may optionally point at
values for read ahead and read behind, that a pager may do, if it
can. These pages are completely owned by pager, and not controlled
by the caller.
This shifts the UFS-specific readahead logic from vm_fault.c, which
should be file system agnostic, into vnode_pager.c. It also removes
one VOP_BMAP() request per hard fault.
Discussed with: kib, alc, jeff, scottl
Sponsored by: Nginx, Inc.
Sponsored by: Netflix
2015-12-16 21:30:45 +00:00
|
|
|
/* Since our haspage reports zero after/before, the count is 1. */
|
|
|
|
KASSERT(count == 1, ("%s: count %d", __func__, count));
|
2013-03-09 02:32:23 +00:00
|
|
|
VM_OBJECT_ASSERT_WLOCKED(object);
|
Add a new populate() pager method and extend device pager ops vector
with cdev_pg_populate() to provide device drivers access to it. It
gives drivers fine control of the pages ownership and allows drivers
to implement arbitrary prefault policies.
The populate method is called on a page fault and is supposed to
populate the vm object with the page at the fault location and some
amount of pages around it, at pager's discretion. VM provides the
pager with the hints about current range of the object mapping, to
avoid instantiation of immediately unused pages, if pager decides so.
Also, VM passes the fault type and map entry protection to the pager,
allowing it to force the optimal required ownership of the mapped
pages.
Installed pages must contiguously fill the returned region, be fully
valid and exclusively busied. Of course, the pages must be compatible
with the object' type.
After populate() successfully returned, VM fault handler installs as
many instantiated pages into the process page tables as it sees
reasonable, while still obeying the correct semantic for COW and vm
map locking.
The method is opt-in, pager sets OBJ_POPULATE flag to indicate that
the method can be called. If pager' vm objects can be shadowed, pager
must implement the traditional getpages() method in addition to the
populate(). Populate() might fall back to the getpages() on per-call
basis as well, by returning VM_PAGER_BAD error code.
For now for device pagers, the populate() method is only allowed to be
used by the managed device pagers, but the limitation is only made
because there is no unmanaged fault handlers which could use it right
now.
KPI designed together with, and reviewed by: alc
Tested by: pho
Sponsored by: The FreeBSD Foundation
MFC after: 3 weeks
2016-12-08 11:26:11 +00:00
|
|
|
if (object->un_pager.devp.ops->cdev_pg_fault == NULL)
|
|
|
|
return (VM_PAGER_FAIL);
|
2011-11-15 14:40:00 +00:00
|
|
|
error = object->un_pager.devp.ops->cdev_pg_fault(object,
|
A change to KPI of vm_pager_get_pages() and underlying VOP_GETPAGES().
o With new KPI consumers can request contiguous ranges of pages, and
unlike before, all pages will be kept busied on return, like it was
done before with the 'reqpage' only. Now the reqpage goes away. With
new interface it is easier to implement code protected from race
conditions.
Such arrayed requests for now should be preceeded by a call to
vm_pager_haspage() to make sure that request is possible. This
could be improved later, making vm_pager_haspage() obsolete.
Strenghtening the promises on the business of the array of pages
allows us to remove such hacks as swp_pager_free_nrpage() and
vm_pager_free_nonreq().
o New KPI accepts two integer pointers that may optionally point at
values for read ahead and read behind, that a pager may do, if it
can. These pages are completely owned by pager, and not controlled
by the caller.
This shifts the UFS-specific readahead logic from vm_fault.c, which
should be file system agnostic, into vnode_pager.c. It also removes
one VOP_BMAP() request per hard fault.
Discussed with: kib, alc, jeff, scottl
Sponsored by: Nginx, Inc.
Sponsored by: Netflix
2015-12-16 21:30:45 +00:00
|
|
|
IDX_TO_OFF(ma[0]->pindex), PROT_READ, &ma[0]);
|
2011-11-15 14:40:00 +00:00
|
|
|
|
2013-03-09 02:32:23 +00:00
|
|
|
VM_OBJECT_ASSERT_WLOCKED(object);
|
2011-11-15 14:40:00 +00:00
|
|
|
|
|
|
|
if (error == VM_PAGER_OK) {
|
2012-05-12 20:49:58 +00:00
|
|
|
KASSERT((object->type == OBJT_DEVICE &&
|
A change to KPI of vm_pager_get_pages() and underlying VOP_GETPAGES().
o With new KPI consumers can request contiguous ranges of pages, and
unlike before, all pages will be kept busied on return, like it was
done before with the 'reqpage' only. Now the reqpage goes away. With
new interface it is easier to implement code protected from race
conditions.
Such arrayed requests for now should be preceeded by a call to
vm_pager_haspage() to make sure that request is possible. This
could be improved later, making vm_pager_haspage() obsolete.
Strenghtening the promises on the business of the array of pages
allows us to remove such hacks as swp_pager_free_nrpage() and
vm_pager_free_nonreq().
o New KPI accepts two integer pointers that may optionally point at
values for read ahead and read behind, that a pager may do, if it
can. These pages are completely owned by pager, and not controlled
by the caller.
This shifts the UFS-specific readahead logic from vm_fault.c, which
should be file system agnostic, into vnode_pager.c. It also removes
one VOP_BMAP() request per hard fault.
Discussed with: kib, alc, jeff, scottl
Sponsored by: Nginx, Inc.
Sponsored by: Netflix
2015-12-16 21:30:45 +00:00
|
|
|
(ma[0]->oflags & VPO_UNMANAGED) != 0) ||
|
2012-05-12 20:49:58 +00:00
|
|
|
(object->type == OBJT_MGTDEVICE &&
|
A change to KPI of vm_pager_get_pages() and underlying VOP_GETPAGES().
o With new KPI consumers can request contiguous ranges of pages, and
unlike before, all pages will be kept busied on return, like it was
done before with the 'reqpage' only. Now the reqpage goes away. With
new interface it is easier to implement code protected from race
conditions.
Such arrayed requests for now should be preceeded by a call to
vm_pager_haspage() to make sure that request is possible. This
could be improved later, making vm_pager_haspage() obsolete.
Strenghtening the promises on the business of the array of pages
allows us to remove such hacks as swp_pager_free_nrpage() and
vm_pager_free_nonreq().
o New KPI accepts two integer pointers that may optionally point at
values for read ahead and read behind, that a pager may do, if it
can. These pages are completely owned by pager, and not controlled
by the caller.
This shifts the UFS-specific readahead logic from vm_fault.c, which
should be file system agnostic, into vnode_pager.c. It also removes
one VOP_BMAP() request per hard fault.
Discussed with: kib, alc, jeff, scottl
Sponsored by: Nginx, Inc.
Sponsored by: Netflix
2015-12-16 21:30:45 +00:00
|
|
|
(ma[0]->oflags & VPO_UNMANAGED) == 0),
|
|
|
|
("Wrong page type %p %p", ma[0], object));
|
2012-05-12 20:49:58 +00:00
|
|
|
if (object->type == OBJT_DEVICE) {
|
|
|
|
TAILQ_INSERT_TAIL(&object->un_pager.devp.devp_pglist,
|
A change to KPI of vm_pager_get_pages() and underlying VOP_GETPAGES().
o With new KPI consumers can request contiguous ranges of pages, and
unlike before, all pages will be kept busied on return, like it was
done before with the 'reqpage' only. Now the reqpage goes away. With
new interface it is easier to implement code protected from race
conditions.
Such arrayed requests for now should be preceeded by a call to
vm_pager_haspage() to make sure that request is possible. This
could be improved later, making vm_pager_haspage() obsolete.
Strenghtening the promises on the business of the array of pages
allows us to remove such hacks as swp_pager_free_nrpage() and
vm_pager_free_nonreq().
o New KPI accepts two integer pointers that may optionally point at
values for read ahead and read behind, that a pager may do, if it
can. These pages are completely owned by pager, and not controlled
by the caller.
This shifts the UFS-specific readahead logic from vm_fault.c, which
should be file system agnostic, into vnode_pager.c. It also removes
one VOP_BMAP() request per hard fault.
Discussed with: kib, alc, jeff, scottl
Sponsored by: Nginx, Inc.
Sponsored by: Netflix
2015-12-16 21:30:45 +00:00
|
|
|
ma[0], plinks.q);
|
2012-05-12 20:49:58 +00:00
|
|
|
}
|
A change to KPI of vm_pager_get_pages() and underlying VOP_GETPAGES().
o With new KPI consumers can request contiguous ranges of pages, and
unlike before, all pages will be kept busied on return, like it was
done before with the 'reqpage' only. Now the reqpage goes away. With
new interface it is easier to implement code protected from race
conditions.
Such arrayed requests for now should be preceeded by a call to
vm_pager_haspage() to make sure that request is possible. This
could be improved later, making vm_pager_haspage() obsolete.
Strenghtening the promises on the business of the array of pages
allows us to remove such hacks as swp_pager_free_nrpage() and
vm_pager_free_nonreq().
o New KPI accepts two integer pointers that may optionally point at
values for read ahead and read behind, that a pager may do, if it
can. These pages are completely owned by pager, and not controlled
by the caller.
This shifts the UFS-specific readahead logic from vm_fault.c, which
should be file system agnostic, into vnode_pager.c. It also removes
one VOP_BMAP() request per hard fault.
Discussed with: kib, alc, jeff, scottl
Sponsored by: Nginx, Inc.
Sponsored by: Netflix
2015-12-16 21:30:45 +00:00
|
|
|
if (rbehind)
|
|
|
|
*rbehind = 0;
|
|
|
|
if (rahead)
|
|
|
|
*rahead = 0;
|
1994-05-24 10:09:53 +00:00
|
|
|
}
|
2011-11-15 14:40:00 +00:00
|
|
|
|
|
|
|
return (error);
|
1994-05-24 10:09:53 +00:00
|
|
|
}
|
|
|
|
|
Add a new populate() pager method and extend device pager ops vector
with cdev_pg_populate() to provide device drivers access to it. It
gives drivers fine control of the pages ownership and allows drivers
to implement arbitrary prefault policies.
The populate method is called on a page fault and is supposed to
populate the vm object with the page at the fault location and some
amount of pages around it, at pager's discretion. VM provides the
pager with the hints about current range of the object mapping, to
avoid instantiation of immediately unused pages, if pager decides so.
Also, VM passes the fault type and map entry protection to the pager,
allowing it to force the optimal required ownership of the mapped
pages.
Installed pages must contiguously fill the returned region, be fully
valid and exclusively busied. Of course, the pages must be compatible
with the object' type.
After populate() successfully returned, VM fault handler installs as
many instantiated pages into the process page tables as it sees
reasonable, while still obeying the correct semantic for COW and vm
map locking.
The method is opt-in, pager sets OBJ_POPULATE flag to indicate that
the method can be called. If pager' vm objects can be shadowed, pager
must implement the traditional getpages() method in addition to the
populate(). Populate() might fall back to the getpages() on per-call
basis as well, by returning VM_PAGER_BAD error code.
For now for device pagers, the populate() method is only allowed to be
used by the managed device pagers, but the limitation is only made
because there is no unmanaged fault handlers which could use it right
now.
KPI designed together with, and reviewed by: alc
Tested by: pho
Sponsored by: The FreeBSD Foundation
MFC after: 3 weeks
2016-12-08 11:26:11 +00:00
|
|
|
static int
|
|
|
|
dev_pager_populate(vm_object_t object, vm_pindex_t pidx, int fault_type,
|
|
|
|
vm_prot_t max_prot, vm_pindex_t *first, vm_pindex_t *last)
|
|
|
|
{
|
|
|
|
|
|
|
|
VM_OBJECT_ASSERT_WLOCKED(object);
|
|
|
|
if (object->un_pager.devp.ops->cdev_pg_populate == NULL)
|
|
|
|
return (VM_PAGER_FAIL);
|
|
|
|
return (object->un_pager.devp.ops->cdev_pg_populate(object, pidx,
|
|
|
|
fault_type, max_prot, first, last));
|
|
|
|
}
|
|
|
|
|
1995-12-14 09:55:16 +00:00
|
|
|
static int
|
2011-11-15 14:40:00 +00:00
|
|
|
old_dev_pager_fault(vm_object_t object, vm_ooffset_t offset, int prot,
|
|
|
|
vm_page_t *mres)
|
1994-05-24 10:09:53 +00:00
|
|
|
{
|
2003-03-25 00:07:06 +00:00
|
|
|
vm_paddr_t paddr;
|
2009-07-12 23:31:20 +00:00
|
|
|
vm_page_t m_paddr, page;
|
2004-06-16 09:47:26 +00:00
|
|
|
struct cdev *dev;
|
2004-09-24 05:59:11 +00:00
|
|
|
struct cdevsw *csw;
|
2008-09-26 14:50:49 +00:00
|
|
|
struct file *fpop;
|
2011-11-15 14:40:00 +00:00
|
|
|
struct thread *td;
|
2016-05-01 17:48:43 +00:00
|
|
|
vm_memattr_t memattr, memattr1;
|
2011-11-15 14:40:00 +00:00
|
|
|
int ref, ret;
|
1994-05-24 10:09:53 +00:00
|
|
|
|
2009-07-12 23:31:20 +00:00
|
|
|
memattr = object->memattr;
|
2011-11-15 14:40:00 +00:00
|
|
|
|
2013-03-09 02:32:23 +00:00
|
|
|
VM_OBJECT_WUNLOCK(object);
|
2011-11-15 14:40:00 +00:00
|
|
|
|
|
|
|
dev = object->handle;
|
2010-08-06 09:42:15 +00:00
|
|
|
csw = dev_refthread(dev, &ref);
|
2011-07-06 15:09:52 +00:00
|
|
|
if (csw == NULL) {
|
2013-03-09 02:32:23 +00:00
|
|
|
VM_OBJECT_WLOCK(object);
|
2011-07-06 15:09:52 +00:00
|
|
|
return (VM_PAGER_FAIL);
|
|
|
|
}
|
2008-09-26 14:50:49 +00:00
|
|
|
td = curthread;
|
|
|
|
fpop = td->td_fpop;
|
|
|
|
td->td_fpop = NULL;
|
2011-11-15 14:40:00 +00:00
|
|
|
ret = csw->d_mmap(dev, offset, &paddr, prot, &memattr);
|
2008-09-26 14:50:49 +00:00
|
|
|
td->td_fpop = fpop;
|
2010-08-06 09:42:15 +00:00
|
|
|
dev_relthread(dev, ref);
|
2011-11-15 14:40:00 +00:00
|
|
|
if (ret != 0) {
|
|
|
|
printf(
|
|
|
|
"WARNING: dev_pager_getpage: map function returns error %d", ret);
|
2013-03-09 02:32:23 +00:00
|
|
|
VM_OBJECT_WLOCK(object);
|
2011-11-15 14:40:00 +00:00
|
|
|
return (VM_PAGER_FAIL);
|
|
|
|
}
|
|
|
|
|
2009-07-12 23:31:20 +00:00
|
|
|
/* If "paddr" is a real page, perform a sanity check on "memattr". */
|
|
|
|
if ((m_paddr = vm_phys_paddr_to_vm_page(paddr)) != NULL &&
|
2016-05-01 17:48:43 +00:00
|
|
|
(memattr1 = pmap_page_get_memattr(m_paddr)) != memattr) {
|
|
|
|
/*
|
|
|
|
* For the /dev/mem d_mmap routine to return the
|
|
|
|
* correct memattr, pmap_page_get_memattr() needs to
|
|
|
|
* be called, which we do there.
|
|
|
|
*/
|
|
|
|
if ((csw->d_flags & D_MEM) == 0) {
|
|
|
|
printf("WARNING: Device driver %s has set "
|
|
|
|
"\"memattr\" inconsistently (drv %u pmap %u).\n",
|
|
|
|
csw->d_name, memattr, memattr1);
|
|
|
|
}
|
|
|
|
memattr = memattr1;
|
2009-07-12 23:31:20 +00:00
|
|
|
}
|
2011-11-15 14:40:00 +00:00
|
|
|
if (((*mres)->flags & PG_FICTITIOUS) != 0) {
|
2004-07-30 11:09:18 +00:00
|
|
|
/*
|
2011-11-15 14:40:00 +00:00
|
|
|
* If the passed in result page is a fake page, update it with
|
2004-07-30 11:09:18 +00:00
|
|
|
* the new physical address.
|
|
|
|
*/
|
2011-11-15 14:40:00 +00:00
|
|
|
page = *mres;
|
2013-03-09 02:32:23 +00:00
|
|
|
VM_OBJECT_WLOCK(object);
|
2011-03-11 07:07:48 +00:00
|
|
|
vm_page_updatefake(page, paddr, memattr);
|
2004-07-30 11:09:18 +00:00
|
|
|
} else {
|
|
|
|
/*
|
|
|
|
* Replace the passed in reqpage page with our own fake page and
|
|
|
|
* free up the all of the original pages.
|
|
|
|
*/
|
2011-03-11 07:07:48 +00:00
|
|
|
page = vm_page_getfake(paddr, memattr);
|
2013-03-09 02:32:23 +00:00
|
|
|
VM_OBJECT_WLOCK(object);
|
2015-12-17 17:48:57 +00:00
|
|
|
vm_page_replace_checked(page, object, (*mres)->pindex, *mres);
|
2011-11-15 14:40:00 +00:00
|
|
|
vm_page_lock(*mres);
|
|
|
|
vm_page_free(*mres);
|
|
|
|
vm_page_unlock(*mres);
|
|
|
|
*mres = page;
|
2004-07-30 11:09:18 +00:00
|
|
|
}
|
2009-06-22 19:09:48 +00:00
|
|
|
page->valid = VM_PAGE_BITS_ALL;
|
These changes embody the support of the fully coherent merged VM buffer cache,
much higher filesystem I/O performance, and much better paging performance. It
represents the culmination of over 6 months of R&D.
The majority of the merged VM/cache work is by John Dyson.
The following highlights the most significant changes. Additionally, there are
(mostly minor) changes to the various filesystem modules (nfs, msdosfs, etc) to
support the new VM/buffer scheme.
vfs_bio.c:
Significant rewrite of most of vfs_bio to support the merged VM buffer cache
scheme. The scheme is almost fully compatible with the old filesystem
interface. Significant improvement in the number of opportunities for write
clustering.
vfs_cluster.c, vfs_subr.c
Upgrade and performance enhancements in vfs layer code to support merged
VM/buffer cache. Fixup of vfs_cluster to eliminate the bogus pagemove stuff.
vm_object.c:
Yet more improvements in the collapse code. Elimination of some windows that
can cause list corruption.
vm_pageout.c:
Fixed it, it really works better now. Somehow in 2.0, some "enhancements"
broke the code. This code has been reworked from the ground-up.
vm_fault.c, vm_page.c, pmap.c, vm_object.c
Support for small-block filesystems with merged VM/buffer cache scheme.
pmap.c vm_map.c
Dynamic kernel VM size, now we dont have to pre-allocate excessive numbers of
kernel PTs.
vm_glue.c
Much simpler and more effective swapping code. No more gratuitous swapping.
proc.h
Fixed the problem that the p_lock flag was not being cleared on a fork.
swap_pager.c, vnode_pager.c
Removal of old vfs_bio cruft to support the past pseudo-coherency. Now the
code doesn't need it anymore.
machdep.c
Changes to better support the parameter values for the merged VM/buffer cache
scheme.
machdep.c, kern_exec.c, vm_glue.c
Implemented a seperate submap for temporary exec string space and another one
to contain process upages. This eliminates all map fragmentation problems
that previously existed.
ffs_inode.c, ufs_inode.c, ufs_readwrite.c
Changes for merged VM/buffer cache. Add "bypass" support for sneaking in on
busy buffers.
Submitted by: John Dyson and David Greenman
1995-01-09 16:06:02 +00:00
|
|
|
return (VM_PAGER_OK);
|
1994-05-24 10:09:53 +00:00
|
|
|
}
|
|
|
|
|
1999-01-24 02:32:15 +00:00
|
|
|
static void
|
2015-11-29 11:37:25 +00:00
|
|
|
dev_pager_putpages(vm_object_t object, vm_page_t *m, int count, int flags,
|
|
|
|
int *rtvals)
|
1994-05-24 10:09:53 +00:00
|
|
|
{
|
2011-11-15 14:40:00 +00:00
|
|
|
|
1994-05-24 10:09:53 +00:00
|
|
|
panic("dev_pager_putpage called");
|
|
|
|
}
|
|
|
|
|
1995-12-14 09:55:16 +00:00
|
|
|
static boolean_t
|
2015-11-29 11:37:25 +00:00
|
|
|
dev_pager_haspage(vm_object_t object, vm_pindex_t pindex, int *before,
|
|
|
|
int *after)
|
1994-05-24 10:09:53 +00:00
|
|
|
{
|
2015-11-29 11:37:25 +00:00
|
|
|
|
NOTE: libkvm, w, ps, 'top', and any other utility which depends on struct
proc or any VM system structure will have to be rebuilt!!!
Much needed overhaul of the VM system. Included in this first round of
changes:
1) Improved pager interfaces: init, alloc, dealloc, getpages, putpages,
haspage, and sync operations are supported. The haspage interface now
provides information about clusterability. All pager routines now take
struct vm_object's instead of "pagers".
2) Improved data structures. In the previous paradigm, there is constant
confusion caused by pagers being both a data structure ("allocate a
pager") and a collection of routines. The idea of a pager structure has
escentially been eliminated. Objects now have types, and this type is
used to index the appropriate pager. In most cases, items in the pager
structure were duplicated in the object data structure and thus were
unnecessary. In the few cases that remained, a un_pager structure union
was created in the object to contain these items.
3) Because of the cleanup of #1 & #2, a lot of unnecessary layering can now
be removed. For instance, vm_object_enter(), vm_object_lookup(),
vm_object_remove(), and the associated object hash list were some of the
things that were removed.
4) simple_lock's removed. Discussion with several people reveals that the
SMP locking primitives used in the VM system aren't likely the mechanism
that we'll be adopting. Even if it were, the locking that was in the code
was very inadequate and would have to be mostly re-done anyway. The
locking in a uni-processor kernel was a no-op but went a long way toward
making the code difficult to read and debug.
5) Places that attempted to kludge-up the fact that we don't have kernel
thread support have been fixed to reflect the reality that we are really
dealing with processes, not threads. The VM system didn't have complete
thread support, so the comments and mis-named routines were just wrong.
We now use tsleep and wakeup directly in the lock routines, for instance.
6) Where appropriate, the pagers have been improved, especially in the
pager_alloc routines. Most of the pager_allocs have been rewritten and
are now faster and easier to maintain.
7) The pagedaemon pageout clustering algorithm has been rewritten and
now tries harder to output an even number of pages before and after
the requested page. This is sort of the reverse of the ideal pagein
algorithm and should provide better overall performance.
8) Unnecessary (incorrect) casts to caddr_t in calls to tsleep & wakeup
have been removed. Some other unnecessary casts have also been removed.
9) Some almost useless debugging code removed.
10) Terminology of shadow objects vs. backing objects straightened out.
The fact that the vm_object data structure escentially had this
backwards really confused things. The use of "shadow" and "backing
object" throughout the code is now internally consistent and correct
in the Mach terminology.
11) Several minor bug fixes, including one in the vm daemon that caused
0 RSS objects to not get purged as intended.
12) A "default pager" has now been created which cleans up the transition
of objects to the "swap" type. The previous checks throughout the code
for swp->pg_data != NULL were really ugly. This change also provides
the rudiments for future backing of "anonymous" memory by something
other than the swap pager (via the vnode pager, for example), and it
allows the decision about which of these pagers to use to be made
dynamically (although will need some additional decision code to do
this, of course).
13) (dyson) MAP_COPY has been deprecated and the corresponding "copy
object" code has been removed. MAP_COPY was undocumented and non-
standard. It was furthermore broken in several ways which caused its
behavior to degrade to MAP_PRIVATE. Binaries that use MAP_COPY will
continue to work correctly, but via the slightly different semantics
of MAP_PRIVATE.
14) (dyson) Sharing maps have been removed. It's marginal usefulness in a
threads design can be worked around in other ways. Both #12 and #13
were done to simplify the code and improve readability and maintain-
ability. (As were most all of these changes)
TODO:
1) Rewrite most of the vnode pager to use VOP_GETPAGES/PUTPAGES. Doing
this will reduce the vnode pager to a mere fraction of its current size.
2) Rewrite vm_fault and the swap/vnode pagers to use the clustering
information provided by the new haspage pager interface. This will
substantially reduce the overhead by eliminating a large number of
VOP_BMAP() calls. The VOP_BMAP() filesystem interface should be
improved to provide both a "behind" and "ahead" indication of
contiguousness.
3) Implement the extended features of pager_haspage in swap_pager_haspage().
It currently just says 0 pages ahead/behind.
4) Re-implement the swap device (swstrategy) in a more elegant way, perhaps
via a much more general mechanism that could also be used for disk
striping of regular filesystems.
5) Do something to improve the architecture of vm_object_collapse(). The
fact that it makes calls into the swap pager and knows too much about
how the swap pager operates really bothers me. It also doesn't allow
for collapsing of non-swap pager objects ("unnamed" objects backed by
other pagers).
1995-07-13 08:48:48 +00:00
|
|
|
if (before != NULL)
|
|
|
|
*before = 0;
|
|
|
|
if (after != NULL)
|
|
|
|
*after = 0;
|
These changes embody the support of the fully coherent merged VM buffer cache,
much higher filesystem I/O performance, and much better paging performance. It
represents the culmination of over 6 months of R&D.
The majority of the merged VM/cache work is by John Dyson.
The following highlights the most significant changes. Additionally, there are
(mostly minor) changes to the various filesystem modules (nfs, msdosfs, etc) to
support the new VM/buffer scheme.
vfs_bio.c:
Significant rewrite of most of vfs_bio to support the merged VM buffer cache
scheme. The scheme is almost fully compatible with the old filesystem
interface. Significant improvement in the number of opportunities for write
clustering.
vfs_cluster.c, vfs_subr.c
Upgrade and performance enhancements in vfs layer code to support merged
VM/buffer cache. Fixup of vfs_cluster to eliminate the bogus pagemove stuff.
vm_object.c:
Yet more improvements in the collapse code. Elimination of some windows that
can cause list corruption.
vm_pageout.c:
Fixed it, it really works better now. Somehow in 2.0, some "enhancements"
broke the code. This code has been reworked from the ground-up.
vm_fault.c, vm_page.c, pmap.c, vm_object.c
Support for small-block filesystems with merged VM/buffer cache scheme.
pmap.c vm_map.c
Dynamic kernel VM size, now we dont have to pre-allocate excessive numbers of
kernel PTs.
vm_glue.c
Much simpler and more effective swapping code. No more gratuitous swapping.
proc.h
Fixed the problem that the p_lock flag was not being cleared on a fork.
swap_pager.c, vnode_pager.c
Removal of old vfs_bio cruft to support the past pseudo-coherency. Now the
code doesn't need it anymore.
machdep.c
Changes to better support the parameter values for the merged VM/buffer cache
scheme.
machdep.c, kern_exec.c, vm_glue.c
Implemented a seperate submap for temporary exec string space and another one
to contain process upages. This eliminates all map fragmentation problems
that previously existed.
ffs_inode.c, ufs_inode.c, ufs_readwrite.c
Changes for merged VM/buffer cache. Add "bypass" support for sneaking in on
busy buffers.
Submitted by: John Dyson and David Greenman
1995-01-09 16:06:02 +00:00
|
|
|
return (TRUE);
|
1994-05-24 10:09:53 +00:00
|
|
|
}
|
2011-11-15 14:40:00 +00:00
|
|
|
|
|
|
|
static int
|
|
|
|
old_dev_pager_ctor(void *handle, vm_ooffset_t size, vm_prot_t prot,
|
|
|
|
vm_ooffset_t foff, struct ucred *cred, u_short *color)
|
|
|
|
{
|
|
|
|
struct cdev *dev;
|
|
|
|
struct cdevsw *csw;
|
|
|
|
vm_memattr_t dummy;
|
|
|
|
vm_ooffset_t off;
|
|
|
|
vm_paddr_t paddr;
|
|
|
|
unsigned int npages;
|
|
|
|
int ref;
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Make sure this device can be mapped.
|
|
|
|
*/
|
|
|
|
dev = handle;
|
|
|
|
csw = dev_refthread(dev, &ref);
|
|
|
|
if (csw == NULL)
|
|
|
|
return (ENXIO);
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Check that the specified range of the device allows the desired
|
|
|
|
* protection.
|
|
|
|
*
|
|
|
|
* XXX assumes VM_PROT_* == PROT_*
|
|
|
|
*/
|
|
|
|
npages = OFF_TO_IDX(size);
|
2014-03-12 16:38:55 +00:00
|
|
|
paddr = 0; /* Make paddr initialized for the case of size == 0. */
|
2011-11-15 14:40:00 +00:00
|
|
|
for (off = foff; npages--; off += PAGE_SIZE) {
|
|
|
|
if (csw->d_mmap(dev, off, &paddr, (int)prot, &dummy) != 0) {
|
|
|
|
dev_relthread(dev, ref);
|
|
|
|
return (EINVAL);
|
|
|
|
}
|
|
|
|
}
|
|
|
|
|
|
|
|
dev_ref(dev);
|
|
|
|
dev_relthread(dev, ref);
|
|
|
|
*color = atop(paddr) - OFF_TO_IDX(off - PAGE_SIZE);
|
|
|
|
return (0);
|
|
|
|
}
|
|
|
|
|
|
|
|
static void
|
|
|
|
old_dev_pager_dtor(void *handle)
|
|
|
|
{
|
|
|
|
|
|
|
|
dev_rel(handle);
|
|
|
|
}
|