2005-01-07 02:29:27 +00:00
|
|
|
/*-
|
2017-11-30 15:48:35 +00:00
|
|
|
* SPDX-License-Identifier: (BSD-3-Clause AND MIT-CMU)
|
2017-11-20 19:43:44 +00:00
|
|
|
*
|
1994-05-24 10:09:53 +00:00
|
|
|
* Copyright (c) 1991, 1993
|
|
|
|
* The Regents of the University of California. All rights reserved.
|
|
|
|
*
|
|
|
|
* This code is derived from software contributed to Berkeley by
|
|
|
|
* The Mach Operating System project at Carnegie-Mellon University.
|
|
|
|
*
|
|
|
|
* Redistribution and use in source and binary forms, with or without
|
|
|
|
* modification, are permitted provided that the following conditions
|
|
|
|
* are met:
|
|
|
|
* 1. Redistributions of source code must retain the above copyright
|
|
|
|
* notice, this list of conditions and the following disclaimer.
|
|
|
|
* 2. Redistributions in binary form must reproduce the above copyright
|
|
|
|
* notice, this list of conditions and the following disclaimer in the
|
|
|
|
* documentation and/or other materials provided with the distribution.
|
2017-02-28 23:42:47 +00:00
|
|
|
* 3. Neither the name of the University nor the names of its contributors
|
1994-05-24 10:09:53 +00:00
|
|
|
* may be used to endorse or promote products derived from this software
|
|
|
|
* without specific prior written permission.
|
|
|
|
*
|
|
|
|
* THIS SOFTWARE IS PROVIDED BY THE REGENTS AND CONTRIBUTORS ``AS IS'' AND
|
|
|
|
* ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
|
|
|
|
* IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE
|
|
|
|
* ARE DISCLAIMED. IN NO EVENT SHALL THE REGENTS OR CONTRIBUTORS BE LIABLE
|
|
|
|
* FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL
|
|
|
|
* DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS
|
|
|
|
* OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION)
|
|
|
|
* HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT
|
|
|
|
* LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY
|
|
|
|
* OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF
|
|
|
|
* SUCH DAMAGE.
|
|
|
|
*
|
1994-08-02 07:55:43 +00:00
|
|
|
* from: @(#)vm_map.c 8.3 (Berkeley) 1/12/94
|
1994-05-24 10:09:53 +00:00
|
|
|
*
|
|
|
|
*
|
|
|
|
* Copyright (c) 1987, 1990 Carnegie-Mellon University.
|
|
|
|
* All rights reserved.
|
|
|
|
*
|
|
|
|
* Authors: Avadis Tevanian, Jr., Michael Wayne Young
|
These changes embody the support of the fully coherent merged VM buffer cache,
much higher filesystem I/O performance, and much better paging performance. It
represents the culmination of over 6 months of R&D.
The majority of the merged VM/cache work is by John Dyson.
The following highlights the most significant changes. Additionally, there are
(mostly minor) changes to the various filesystem modules (nfs, msdosfs, etc) to
support the new VM/buffer scheme.
vfs_bio.c:
Significant rewrite of most of vfs_bio to support the merged VM buffer cache
scheme. The scheme is almost fully compatible with the old filesystem
interface. Significant improvement in the number of opportunities for write
clustering.
vfs_cluster.c, vfs_subr.c
Upgrade and performance enhancements in vfs layer code to support merged
VM/buffer cache. Fixup of vfs_cluster to eliminate the bogus pagemove stuff.
vm_object.c:
Yet more improvements in the collapse code. Elimination of some windows that
can cause list corruption.
vm_pageout.c:
Fixed it, it really works better now. Somehow in 2.0, some "enhancements"
broke the code. This code has been reworked from the ground-up.
vm_fault.c, vm_page.c, pmap.c, vm_object.c
Support for small-block filesystems with merged VM/buffer cache scheme.
pmap.c vm_map.c
Dynamic kernel VM size, now we dont have to pre-allocate excessive numbers of
kernel PTs.
vm_glue.c
Much simpler and more effective swapping code. No more gratuitous swapping.
proc.h
Fixed the problem that the p_lock flag was not being cleared on a fork.
swap_pager.c, vnode_pager.c
Removal of old vfs_bio cruft to support the past pseudo-coherency. Now the
code doesn't need it anymore.
machdep.c
Changes to better support the parameter values for the merged VM/buffer cache
scheme.
machdep.c, kern_exec.c, vm_glue.c
Implemented a seperate submap for temporary exec string space and another one
to contain process upages. This eliminates all map fragmentation problems
that previously existed.
ffs_inode.c, ufs_inode.c, ufs_readwrite.c
Changes for merged VM/buffer cache. Add "bypass" support for sneaking in on
busy buffers.
Submitted by: John Dyson and David Greenman
1995-01-09 16:06:02 +00:00
|
|
|
*
|
1994-05-24 10:09:53 +00:00
|
|
|
* Permission to use, copy, modify and distribute this software and
|
|
|
|
* its documentation is hereby granted, provided that both the copyright
|
|
|
|
* notice and this permission notice appear in all copies of the
|
|
|
|
* software, derivative works or modified versions, and any portions
|
|
|
|
* thereof, and that both notices appear in supporting documentation.
|
These changes embody the support of the fully coherent merged VM buffer cache,
much higher filesystem I/O performance, and much better paging performance. It
represents the culmination of over 6 months of R&D.
The majority of the merged VM/cache work is by John Dyson.
The following highlights the most significant changes. Additionally, there are
(mostly minor) changes to the various filesystem modules (nfs, msdosfs, etc) to
support the new VM/buffer scheme.
vfs_bio.c:
Significant rewrite of most of vfs_bio to support the merged VM buffer cache
scheme. The scheme is almost fully compatible with the old filesystem
interface. Significant improvement in the number of opportunities for write
clustering.
vfs_cluster.c, vfs_subr.c
Upgrade and performance enhancements in vfs layer code to support merged
VM/buffer cache. Fixup of vfs_cluster to eliminate the bogus pagemove stuff.
vm_object.c:
Yet more improvements in the collapse code. Elimination of some windows that
can cause list corruption.
vm_pageout.c:
Fixed it, it really works better now. Somehow in 2.0, some "enhancements"
broke the code. This code has been reworked from the ground-up.
vm_fault.c, vm_page.c, pmap.c, vm_object.c
Support for small-block filesystems with merged VM/buffer cache scheme.
pmap.c vm_map.c
Dynamic kernel VM size, now we dont have to pre-allocate excessive numbers of
kernel PTs.
vm_glue.c
Much simpler and more effective swapping code. No more gratuitous swapping.
proc.h
Fixed the problem that the p_lock flag was not being cleared on a fork.
swap_pager.c, vnode_pager.c
Removal of old vfs_bio cruft to support the past pseudo-coherency. Now the
code doesn't need it anymore.
machdep.c
Changes to better support the parameter values for the merged VM/buffer cache
scheme.
machdep.c, kern_exec.c, vm_glue.c
Implemented a seperate submap for temporary exec string space and another one
to contain process upages. This eliminates all map fragmentation problems
that previously existed.
ffs_inode.c, ufs_inode.c, ufs_readwrite.c
Changes for merged VM/buffer cache. Add "bypass" support for sneaking in on
busy buffers.
Submitted by: John Dyson and David Greenman
1995-01-09 16:06:02 +00:00
|
|
|
*
|
|
|
|
* CARNEGIE MELLON ALLOWS FREE USE OF THIS SOFTWARE IN ITS "AS IS"
|
|
|
|
* CONDITION. CARNEGIE MELLON DISCLAIMS ANY LIABILITY OF ANY KIND
|
1994-05-24 10:09:53 +00:00
|
|
|
* FOR ANY DAMAGES WHATSOEVER RESULTING FROM THE USE OF THIS SOFTWARE.
|
These changes embody the support of the fully coherent merged VM buffer cache,
much higher filesystem I/O performance, and much better paging performance. It
represents the culmination of over 6 months of R&D.
The majority of the merged VM/cache work is by John Dyson.
The following highlights the most significant changes. Additionally, there are
(mostly minor) changes to the various filesystem modules (nfs, msdosfs, etc) to
support the new VM/buffer scheme.
vfs_bio.c:
Significant rewrite of most of vfs_bio to support the merged VM buffer cache
scheme. The scheme is almost fully compatible with the old filesystem
interface. Significant improvement in the number of opportunities for write
clustering.
vfs_cluster.c, vfs_subr.c
Upgrade and performance enhancements in vfs layer code to support merged
VM/buffer cache. Fixup of vfs_cluster to eliminate the bogus pagemove stuff.
vm_object.c:
Yet more improvements in the collapse code. Elimination of some windows that
can cause list corruption.
vm_pageout.c:
Fixed it, it really works better now. Somehow in 2.0, some "enhancements"
broke the code. This code has been reworked from the ground-up.
vm_fault.c, vm_page.c, pmap.c, vm_object.c
Support for small-block filesystems with merged VM/buffer cache scheme.
pmap.c vm_map.c
Dynamic kernel VM size, now we dont have to pre-allocate excessive numbers of
kernel PTs.
vm_glue.c
Much simpler and more effective swapping code. No more gratuitous swapping.
proc.h
Fixed the problem that the p_lock flag was not being cleared on a fork.
swap_pager.c, vnode_pager.c
Removal of old vfs_bio cruft to support the past pseudo-coherency. Now the
code doesn't need it anymore.
machdep.c
Changes to better support the parameter values for the merged VM/buffer cache
scheme.
machdep.c, kern_exec.c, vm_glue.c
Implemented a seperate submap for temporary exec string space and another one
to contain process upages. This eliminates all map fragmentation problems
that previously existed.
ffs_inode.c, ufs_inode.c, ufs_readwrite.c
Changes for merged VM/buffer cache. Add "bypass" support for sneaking in on
busy buffers.
Submitted by: John Dyson and David Greenman
1995-01-09 16:06:02 +00:00
|
|
|
*
|
1994-05-24 10:09:53 +00:00
|
|
|
* Carnegie Mellon requests users of this software to return to
|
|
|
|
*
|
|
|
|
* Software Distribution Coordinator or Software.Distribution@CS.CMU.EDU
|
|
|
|
* School of Computer Science
|
|
|
|
* Carnegie Mellon University
|
|
|
|
* Pittsburgh PA 15213-3890
|
|
|
|
*
|
|
|
|
* any improvements or extensions that they make and grant Carnegie the
|
|
|
|
* rights to redistribute these changes.
|
|
|
|
*/
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Virtual memory mapping module.
|
|
|
|
*/
|
|
|
|
|
2003-06-11 23:50:51 +00:00
|
|
|
#include <sys/cdefs.h>
|
|
|
|
__FBSDID("$FreeBSD$");
|
|
|
|
|
1994-05-24 10:09:53 +00:00
|
|
|
#include <sys/param.h>
|
|
|
|
#include <sys/systm.h>
|
2019-11-17 14:54:07 +00:00
|
|
|
#include <sys/elf.h>
|
2010-11-14 17:53:52 +00:00
|
|
|
#include <sys/kernel.h>
|
2001-10-11 17:53:43 +00:00
|
|
|
#include <sys/ktr.h>
|
2001-05-01 08:13:21 +00:00
|
|
|
#include <sys/lock.h>
|
|
|
|
#include <sys/mutex.h>
|
1995-03-16 18:17:34 +00:00
|
|
|
#include <sys/proc.h>
|
1995-12-07 12:48:31 +00:00
|
|
|
#include <sys/vmmeter.h>
|
1996-05-19 07:36:50 +00:00
|
|
|
#include <sys/mman.h>
|
1997-12-19 09:03:37 +00:00
|
|
|
#include <sys/vnode.h>
|
2011-04-05 20:23:59 +00:00
|
|
|
#include <sys/racct.h>
|
1999-01-06 23:05:42 +00:00
|
|
|
#include <sys/resourcevar.h>
|
2013-03-09 02:32:23 +00:00
|
|
|
#include <sys/rwlock.h>
|
2003-09-23 18:56:54 +00:00
|
|
|
#include <sys/file.h>
|
2010-11-14 17:53:52 +00:00
|
|
|
#include <sys/sysctl.h>
|
2002-09-21 22:07:17 +00:00
|
|
|
#include <sys/sysent.h>
|
2003-01-13 23:04:32 +00:00
|
|
|
#include <sys/shm.h>
|
1994-05-24 10:09:53 +00:00
|
|
|
|
|
|
|
#include <vm/vm.h>
|
1995-12-07 12:48:31 +00:00
|
|
|
#include <vm/vm_param.h>
|
|
|
|
#include <vm/pmap.h>
|
|
|
|
#include <vm/vm_map.h>
|
1994-05-24 10:09:53 +00:00
|
|
|
#include <vm/vm_page.h>
|
Provide separate accounting for user-wired pages.
Historically we have not distinguished between kernel wirings and user
wirings for accounting purposes. User wirings (via mlock(2)) were
subject to a global limit on the number of wired pages, so if large
swaths of physical memory were wired by the kernel, as happens with
the ZFS ARC among other things, the limit could be exceeded, causing
user wirings to fail.
The change adds a new counter, v_user_wire_count, which counts the
number of virtual pages wired by user processes via mlock(2) and
mlockall(2). Only user-wired pages are subject to the system-wide
limit which helps provide some safety against deadlocks. In
particular, while sources of kernel wirings typically support some
backpressure mechanism, there is no way to reclaim user-wired pages
shorting of killing the wiring process. The limit is exported as
vm.max_user_wired, renamed from vm.max_wired, and changed from u_int
to u_long.
The choice to count virtual user-wired pages rather than physical
pages was done for simplicity. There are mechanisms that can cause
user-wired mappings to be destroyed while maintaining a wiring of
the backing physical page; these make it difficult to accurately
track user wirings at the physical page layer.
The change also closes some holes which allowed user wirings to succeed
even when they would cause the system limit to be exceeded. For
instance, mmap() may now fail with ENOMEM in a process that has called
mlockall(MCL_FUTURE) if the new mapping would cause the user wiring
limit to be exceeded.
Note that bhyve -S is subject to the user wiring limit, which defaults
to 1/3 of physical RAM. Users that wish to exceed the limit must tune
vm.max_user_wired.
Reviewed by: kib, ngie (mlock() test changes)
Tested by: pho (earlier version)
MFC after: 45 days
Sponsored by: Netflix
Differential Revision: https://reviews.freebsd.org/D19908
2019-05-13 16:38:48 +00:00
|
|
|
#include <vm/vm_pageout.h>
|
1994-05-24 10:09:53 +00:00
|
|
|
#include <vm/vm_object.h>
|
1998-01-17 09:17:02 +00:00
|
|
|
#include <vm/vm_pager.h>
|
1994-05-25 09:21:21 +00:00
|
|
|
#include <vm/vm_kern.h>
|
1995-12-07 12:48:31 +00:00
|
|
|
#include <vm/vm_extern.h>
|
2012-02-23 21:07:16 +00:00
|
|
|
#include <vm/vnode_pager.h>
|
2000-12-13 10:01:00 +00:00
|
|
|
#include <vm/swap_pager.h>
|
2002-03-20 04:02:59 +00:00
|
|
|
#include <vm/uma.h>
|
1994-05-24 10:09:53 +00:00
|
|
|
|
|
|
|
/*
|
|
|
|
* Virtual memory maps provide for the mapping, protection,
|
|
|
|
* and sharing of virtual memory objects. In addition,
|
|
|
|
* this module provides for an efficient virtual copy of
|
|
|
|
* memory from one map to another.
|
|
|
|
*
|
|
|
|
* Synchronization is required prior to most operations.
|
|
|
|
*
|
|
|
|
* Maps consist of an ordered doubly-linked list of simple
|
2008-12-31 05:44:05 +00:00
|
|
|
* entries; a self-adjusting binary search tree of these
|
|
|
|
* entries is used to speed up lookups.
|
1994-05-24 10:09:53 +00:00
|
|
|
*
|
2000-03-26 15:20:23 +00:00
|
|
|
* Since portions of maps are specified by start/end addresses,
|
1994-05-24 10:09:53 +00:00
|
|
|
* which may not align with existing map entries, all
|
|
|
|
* routines merely "clip" entries to these start/end values.
|
|
|
|
* [That is, an entry is split into two, bordering at a
|
|
|
|
* start or end value.] Note that these clippings may not
|
|
|
|
* always be necessary (as the two resulting entries are then
|
|
|
|
* not changed); however, the clipping is done for convenience.
|
|
|
|
*
|
|
|
|
* As mentioned above, virtual copy operations are performed
|
1999-03-27 23:46:04 +00:00
|
|
|
* by copying VM object references from one map to
|
1994-05-24 10:09:53 +00:00
|
|
|
* another, and then marking both regions as copy-on-write.
|
|
|
|
*/
|
|
|
|
|
2002-12-30 00:41:33 +00:00
|
|
|
static struct mtx map_sleep_mtx;
|
2002-03-19 09:11:49 +00:00
|
|
|
static uma_zone_t mapentzone;
|
|
|
|
static uma_zone_t kmapentzone;
|
|
|
|
static uma_zone_t vmspace_zone;
|
2004-08-02 00:18:36 +00:00
|
|
|
static int vmspace_zinit(void *mem, int size, int flags);
|
2010-04-03 19:07:05 +00:00
|
|
|
static void _vm_map_init(vm_map_t map, pmap_t pmap, vm_offset_t min,
|
|
|
|
vm_offset_t max);
|
2010-09-18 15:03:31 +00:00
|
|
|
static void vm_map_entry_deallocate(vm_map_entry_t entry, boolean_t system_map);
|
2009-02-24 20:57:43 +00:00
|
|
|
static void vm_map_entry_dispose(vm_map_t map, vm_map_entry_t entry);
|
When unwiring a region of an address space, do not assume that the
underlying physical pages are mapped by the pmap. If, for example, the
application has performed an mprotect(..., PROT_NONE) on any part of the
wired region, then those pages will no longer be mapped by the pmap.
So, using the pmap to lookup the wired pages in order to unwire them
doesn't always work, and when it doesn't work wired pages are leaked.
To avoid the leak, introduce and use a new function vm_object_unwire()
that locates the wired pages by traversing the object and its backing
objects.
At the same time, switch from using pmap_change_wiring() to the recently
introduced function pmap_unwire() for unwiring the region's mappings.
pmap_unwire() is faster, because it operates a range of virtual addresses
rather than a single virtual page at a time. Moreover, by operating on
a range, it is superpage friendly. It doesn't waste time performing
unnecessary demotions.
Reported by: markj
Reviewed by: kib
Tested by: pho, jmg (arm)
Sponsored by: EMC / Isilon Storage Division
2014-07-26 18:10:18 +00:00
|
|
|
static void vm_map_entry_unwire(vm_map_t map, vm_map_entry_t entry);
|
Implement address space guards.
Guard, requested by the MAP_GUARD mmap(2) flag, prevents the reuse of
the allocated address space, but does not allow instantiation of the
pages in the range. It is useful for more explicit support for usual
two-stage reserve then commit allocators, since it prevents accidental
instantiation of the mapping, e.g. by mprotect(2).
Use guards to reimplement stack grow code. Explicitely track stack
grow area with the guard, including the stack guard page. On stack
grow, trivial shift of the guard map entry and stack map entry limits
makes the stack expansion. Move the code to detect stack grow and
call vm_map_growstack(), from vm_fault() into vm_map_lookup().
As result, it is impossible to get random mapping to occur in the
stack grow area, or to overlap the stack guard page.
Enable stack guard page by default.
Reviewed by: alc, markj
Man page update reviewed by: alc, bjk, emaste, markj, pho
Tested by: pho, Qualys
Sponsored by: The FreeBSD Foundation
MFC after: 1 week
Differential revision: https://reviews.freebsd.org/D11306 (man pages)
2017-06-24 17:01:11 +00:00
|
|
|
static int vm_map_growstack(vm_map_t map, vm_offset_t addr,
|
|
|
|
vm_map_entry_t gap_entry);
|
2014-09-08 00:19:03 +00:00
|
|
|
static void vm_map_pmap_enter(vm_map_t map, vm_offset_t addr, vm_prot_t prot,
|
|
|
|
vm_object_t object, vm_pindex_t pindex, vm_size_t size, int flags);
|
2002-03-19 09:11:49 +00:00
|
|
|
#ifdef INVARIANTS
|
|
|
|
static void vmspace_zdtor(void *mem, int size, void *arg);
|
|
|
|
#endif
|
2014-06-09 03:37:41 +00:00
|
|
|
static int vm_map_stack_locked(vm_map_t map, vm_offset_t addrbos,
|
|
|
|
vm_size_t max_ssize, vm_size_t growsize, vm_prot_t prot, vm_prot_t max,
|
|
|
|
int cow);
|
2014-08-02 16:10:24 +00:00
|
|
|
static void vm_map_wire_entry_failure(vm_map_t map, vm_map_entry_t entry,
|
|
|
|
vm_offset_t failed_addr);
|
1996-05-18 03:38:05 +00:00
|
|
|
|
2010-12-02 17:37:16 +00:00
|
|
|
#define ENTRY_CHARGED(e) ((e)->cred != NULL || \
|
|
|
|
((e)->object.vm_object != NULL && (e)->object.vm_object->cred != NULL && \
|
Implement global and per-uid accounting of the anonymous memory. Add
rlimit RLIMIT_SWAP that limits the amount of swap that may be reserved
for the uid.
The accounting information (charge) is associated with either map entry,
or vm object backing the entry, assuming the object is the first one
in the shadow chain and entry does not require COW. Charge is moved
from entry to object on allocation of the object, e.g. during the mmap,
assuming the object is allocated, or on the first page fault on the
entry. It moves back to the entry on forks due to COW setup.
The per-entry granularity of accounting makes the charge process fair
for processes that change uid during lifetime, and decrements charge
for proper uid when region is unmapped.
The interface of vm_pager_allocate(9) is extended by adding struct ucred *,
that is used to charge appropriate uid when allocation if performed by
kernel, e.g. md(4).
Several syscalls, among them is fork(2), may now return ENOMEM when
global or per-uid limits are enforced.
In collaboration with: pho
Reviewed by: alc
Approved by: re (kensmith)
2009-06-23 20:45:22 +00:00
|
|
|
!((e)->eflags & MAP_ENTRY_NEEDS_COPY)))
|
|
|
|
|
2006-05-29 21:28:56 +00:00
|
|
|
/*
|
|
|
|
* PROC_VMSPACE_{UN,}LOCK() can be a noop as long as vmspaces are type
|
|
|
|
* stable.
|
|
|
|
*/
|
|
|
|
#define PROC_VMSPACE_LOCK(p) do { } while (0)
|
|
|
|
#define PROC_VMSPACE_UNLOCK(p) do { } while (0)
|
|
|
|
|
2007-08-20 12:05:45 +00:00
|
|
|
/*
|
|
|
|
* VM_MAP_RANGE_CHECK: [ internal use only ]
|
|
|
|
*
|
|
|
|
* Asserts that the starting and ending region
|
|
|
|
* addresses fall within the valid range of the map.
|
|
|
|
*/
|
|
|
|
#define VM_MAP_RANGE_CHECK(map, start, end) \
|
|
|
|
{ \
|
|
|
|
if (start < vm_map_min(map)) \
|
|
|
|
start = vm_map_min(map); \
|
|
|
|
if (end > vm_map_max(map)) \
|
|
|
|
end = vm_map_max(map); \
|
|
|
|
if (start > end) \
|
|
|
|
start = end; \
|
|
|
|
}
|
|
|
|
|
2020-11-11 17:16:39 +00:00
|
|
|
#ifndef UMA_MD_SMALL_ALLOC
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Allocate a new slab for kernel map entries. The kernel map may be locked or
|
|
|
|
* unlocked, depending on whether the request is coming from the kernel map or a
|
|
|
|
* submap. This function allocates a virtual address range directly from the
|
|
|
|
* kernel map instead of the kmem_* layer to avoid recursion on the kernel map
|
|
|
|
* lock and also to avoid triggering allocator recursion in the vmem boundary
|
|
|
|
* tag allocator.
|
|
|
|
*/
|
|
|
|
static void *
|
|
|
|
kmapent_alloc(uma_zone_t zone, vm_size_t bytes, int domain, uint8_t *pflag,
|
|
|
|
int wait)
|
|
|
|
{
|
|
|
|
vm_offset_t addr;
|
|
|
|
int error, locked;
|
|
|
|
|
|
|
|
*pflag = UMA_SLAB_PRIV;
|
|
|
|
|
|
|
|
if (!(locked = vm_map_locked(kernel_map)))
|
|
|
|
vm_map_lock(kernel_map);
|
|
|
|
addr = vm_map_findspace(kernel_map, vm_map_min(kernel_map), bytes);
|
|
|
|
if (addr + bytes < addr || addr + bytes > vm_map_max(kernel_map))
|
|
|
|
panic("%s: kernel map is exhausted", __func__);
|
|
|
|
error = vm_map_insert(kernel_map, NULL, 0, addr, addr + bytes,
|
|
|
|
VM_PROT_RW, VM_PROT_RW, MAP_NOFAULT);
|
|
|
|
if (error != KERN_SUCCESS)
|
|
|
|
panic("%s: vm_map_insert() failed: %d", __func__, error);
|
|
|
|
if (!locked)
|
|
|
|
vm_map_unlock(kernel_map);
|
|
|
|
error = kmem_back_domain(domain, kernel_object, addr, bytes, M_NOWAIT |
|
|
|
|
M_USE_RESERVE | (wait & M_ZERO));
|
|
|
|
if (error == KERN_SUCCESS) {
|
|
|
|
return ((void *)addr);
|
|
|
|
} else {
|
|
|
|
if (!locked)
|
|
|
|
vm_map_lock(kernel_map);
|
|
|
|
vm_map_delete(kernel_map, addr, bytes);
|
|
|
|
if (!locked)
|
|
|
|
vm_map_unlock(kernel_map);
|
|
|
|
return (NULL);
|
|
|
|
}
|
|
|
|
}
|
|
|
|
|
|
|
|
static void
|
|
|
|
kmapent_free(void *item, vm_size_t size, uint8_t pflag)
|
|
|
|
{
|
|
|
|
vm_offset_t addr;
|
|
|
|
int error;
|
|
|
|
|
|
|
|
if ((pflag & UMA_SLAB_PRIV) == 0)
|
|
|
|
/* XXX leaked */
|
|
|
|
return;
|
|
|
|
|
|
|
|
addr = (vm_offset_t)item;
|
|
|
|
kmem_unback(kernel_object, addr, size);
|
|
|
|
error = vm_map_remove(kernel_map, addr, addr + size);
|
|
|
|
KASSERT(error == KERN_SUCCESS,
|
|
|
|
("%s: vm_map_remove failed: %d", __func__, error));
|
|
|
|
}
|
|
|
|
|
|
|
|
/*
|
|
|
|
* The worst-case upper bound on the number of kernel map entries that may be
|
|
|
|
* created before the zone must be replenished in _vm_map_unlock().
|
|
|
|
*/
|
|
|
|
#define KMAPENT_RESERVE 1
|
|
|
|
|
|
|
|
#endif /* !UMD_MD_SMALL_ALLOC */
|
|
|
|
|
2009-10-01 12:48:35 +00:00
|
|
|
/*
|
|
|
|
* vm_map_startup:
|
|
|
|
*
|
2020-11-11 17:16:39 +00:00
|
|
|
* Initialize the vm_map module. Must be called before any other vm_map
|
|
|
|
* routines.
|
2009-10-01 12:48:35 +00:00
|
|
|
*
|
2020-11-11 17:16:39 +00:00
|
|
|
* User map and entry structures are allocated from the general purpose
|
|
|
|
* memory pool. Kernel maps are statically defined. Kernel map entries
|
|
|
|
* require special handling to avoid recursion; see the comments above
|
|
|
|
* kmapent_alloc() and in vm_map_entry_create().
|
2009-10-01 12:48:35 +00:00
|
|
|
*/
|
1995-05-30 08:16:23 +00:00
|
|
|
void
|
2001-07-04 20:15:18 +00:00
|
|
|
vm_map_startup(void)
|
1994-05-24 10:09:53 +00:00
|
|
|
{
|
2002-12-30 00:41:33 +00:00
|
|
|
mtx_init(&map_sleep_mtx, "vm map sleep mutex", NULL, MTX_DEF);
|
2020-11-11 17:16:39 +00:00
|
|
|
|
|
|
|
/*
|
|
|
|
* Disable the use of per-CPU buckets: map entry allocation is
|
|
|
|
* serialized by the kernel map lock.
|
|
|
|
*/
|
2003-11-03 16:14:45 +00:00
|
|
|
kmapentzone = uma_zcreate("KMAP ENTRY", sizeof(struct vm_map_entry),
|
2002-06-17 22:02:41 +00:00
|
|
|
NULL, NULL, NULL, NULL, UMA_ALIGN_PTR,
|
2020-11-11 17:16:39 +00:00
|
|
|
UMA_ZONE_VM | UMA_ZONE_NOBUCKET);
|
|
|
|
#ifndef UMA_MD_SMALL_ALLOC
|
|
|
|
/* Reserve an extra map entry for use when replenishing the reserve. */
|
|
|
|
uma_zone_reserve(kmapentzone, KMAPENT_RESERVE + 1);
|
|
|
|
uma_prealloc(kmapentzone, KMAPENT_RESERVE + 1);
|
|
|
|
uma_zone_set_allocf(kmapentzone, kmapent_alloc);
|
|
|
|
uma_zone_set_freef(kmapentzone, kmapent_free);
|
|
|
|
#endif
|
|
|
|
|
2003-11-03 16:14:45 +00:00
|
|
|
mapentzone = uma_zcreate("MAP ENTRY", sizeof(struct vm_map_entry),
|
2002-03-20 04:02:59 +00:00
|
|
|
NULL, NULL, NULL, NULL, UMA_ALIGN_PTR, 0);
|
2013-08-07 06:21:20 +00:00
|
|
|
vmspace_zone = uma_zcreate("VMSPACE", sizeof(struct vmspace), NULL,
|
|
|
|
#ifdef INVARIANTS
|
|
|
|
vmspace_zdtor,
|
|
|
|
#else
|
|
|
|
NULL,
|
|
|
|
#endif
|
2013-09-22 17:48:10 +00:00
|
|
|
vmspace_zinit, NULL, UMA_ALIGN_PTR, UMA_ZONE_NOFREE);
|
2002-03-19 09:11:49 +00:00
|
|
|
}
|
|
|
|
|
2004-08-02 00:18:36 +00:00
|
|
|
static int
|
|
|
|
vmspace_zinit(void *mem, int size, int flags)
|
2002-03-19 09:11:49 +00:00
|
|
|
{
|
|
|
|
struct vmspace *vm;
|
2020-08-17 13:02:01 +00:00
|
|
|
vm_map_t map;
|
2002-03-19 09:11:49 +00:00
|
|
|
|
|
|
|
vm = (struct vmspace *)mem;
|
2020-08-17 13:02:01 +00:00
|
|
|
map = &vm->vm_map;
|
2002-03-19 09:11:49 +00:00
|
|
|
|
2013-07-25 03:48:37 +00:00
|
|
|
memset(map, 0, sizeof(*map));
|
2020-08-17 13:02:01 +00:00
|
|
|
mtx_init(&map->system_mtx, "vm map (system)", NULL,
|
|
|
|
MTX_DEF | MTX_DUPOK);
|
2012-06-27 03:45:25 +00:00
|
|
|
sx_init(&map->lock, "vm map (user)");
|
2020-08-17 13:02:01 +00:00
|
|
|
PMAP_LOCK_INIT(vmspace_pmap(vm));
|
2004-08-02 00:18:36 +00:00
|
|
|
return (0);
|
2002-03-19 09:11:49 +00:00
|
|
|
}
|
|
|
|
|
|
|
|
#ifdef INVARIANTS
|
|
|
|
static void
|
|
|
|
vmspace_zdtor(void *mem, int size, void *arg)
|
|
|
|
{
|
|
|
|
struct vmspace *vm;
|
|
|
|
|
|
|
|
vm = (struct vmspace *)mem;
|
2020-08-17 13:02:01 +00:00
|
|
|
KASSERT(vm->vm_map.nentries == 0,
|
|
|
|
("vmspace %p nentries == %d on free", vm, vm->vm_map.nentries));
|
|
|
|
KASSERT(vm->vm_map.size == 0,
|
|
|
|
("vmspace %p size == %ju on free", vm, (uintmax_t)vm->vm_map.size));
|
1994-05-24 10:09:53 +00:00
|
|
|
}
|
2002-03-19 09:11:49 +00:00
|
|
|
#endif /* INVARIANTS */
|
1994-05-24 10:09:53 +00:00
|
|
|
|
|
|
|
/*
|
|
|
|
* Allocate a vmspace structure, including a vm_map and pmap,
|
|
|
|
* and initialize those structures. The refcnt is set to 1.
|
|
|
|
*/
|
|
|
|
struct vmspace *
|
2013-09-20 17:06:49 +00:00
|
|
|
vmspace_alloc(vm_offset_t min, vm_offset_t max, pmap_pinit_t pinit)
|
1994-05-24 10:09:53 +00:00
|
|
|
{
|
1998-04-29 04:28:22 +00:00
|
|
|
struct vmspace *vm;
|
These changes embody the support of the fully coherent merged VM buffer cache,
much higher filesystem I/O performance, and much better paging performance. It
represents the culmination of over 6 months of R&D.
The majority of the merged VM/cache work is by John Dyson.
The following highlights the most significant changes. Additionally, there are
(mostly minor) changes to the various filesystem modules (nfs, msdosfs, etc) to
support the new VM/buffer scheme.
vfs_bio.c:
Significant rewrite of most of vfs_bio to support the merged VM buffer cache
scheme. The scheme is almost fully compatible with the old filesystem
interface. Significant improvement in the number of opportunities for write
clustering.
vfs_cluster.c, vfs_subr.c
Upgrade and performance enhancements in vfs layer code to support merged
VM/buffer cache. Fixup of vfs_cluster to eliminate the bogus pagemove stuff.
vm_object.c:
Yet more improvements in the collapse code. Elimination of some windows that
can cause list corruption.
vm_pageout.c:
Fixed it, it really works better now. Somehow in 2.0, some "enhancements"
broke the code. This code has been reworked from the ground-up.
vm_fault.c, vm_page.c, pmap.c, vm_object.c
Support for small-block filesystems with merged VM/buffer cache scheme.
pmap.c vm_map.c
Dynamic kernel VM size, now we dont have to pre-allocate excessive numbers of
kernel PTs.
vm_glue.c
Much simpler and more effective swapping code. No more gratuitous swapping.
proc.h
Fixed the problem that the p_lock flag was not being cleared on a fork.
swap_pager.c, vnode_pager.c
Removal of old vfs_bio cruft to support the past pseudo-coherency. Now the
code doesn't need it anymore.
machdep.c
Changes to better support the parameter values for the merged VM/buffer cache
scheme.
machdep.c, kern_exec.c, vm_glue.c
Implemented a seperate submap for temporary exec string space and another one
to contain process upages. This eliminates all map fragmentation problems
that previously existed.
ffs_inode.c, ufs_inode.c, ufs_readwrite.c
Changes for merged VM/buffer cache. Add "bypass" support for sneaking in on
busy buffers.
Submitted by: John Dyson and David Greenman
1995-01-09 16:06:02 +00:00
|
|
|
|
2003-02-19 05:47:46 +00:00
|
|
|
vm = uma_zalloc(vmspace_zone, M_WAITOK);
|
2013-09-20 17:06:49 +00:00
|
|
|
KASSERT(vm->vm_map.pmap == NULL, ("vm_map.pmap must be NULL"));
|
|
|
|
if (!pinit(vmspace_pmap(vm))) {
|
2007-11-05 11:36:16 +00:00
|
|
|
uma_zfree(vmspace_zone, vm);
|
|
|
|
return (NULL);
|
|
|
|
}
|
2001-05-23 22:38:00 +00:00
|
|
|
CTR1(KTR_VM, "vmspace_alloc: %p", vm);
|
2010-04-03 19:07:05 +00:00
|
|
|
_vm_map_init(&vm->vm_map, vmspace_pmap(vm), min, max);
|
2020-11-04 16:30:56 +00:00
|
|
|
refcount_init(&vm->vm_refcnt, 1);
|
VM level code cleanups.
1) Start using TSM.
Struct procs continue to point to upages structure, after being freed.
Struct vmspace continues to point to pte object and kva space for kstack.
u_map is now superfluous.
2) vm_map's don't need to be reference counted. They always exist either
in the kernel or in a vmspace. The vmspaces are managed by reference
counts.
3) Remove the "wired" vm_map nonsense.
4) No need to keep a cache of kernel stack kva's.
5) Get rid of strange looking ++var, and change to var++.
6) Change more data structures to use our "zone" allocator. Added
struct proc, struct vmspace and struct vnode. This saves a significant
amount of kva space and physical memory. Additionally, this enables
TSM for the zone managed memory.
7) Keep ioopt disabled for now.
8) Remove the now bogus "single use" map concept.
9) Use generation counts or id's for data structures residing in TSM, where
it allows us to avoid unneeded restart overhead during traversals, where
blocking might occur.
10) Account better for memory deficits, so the pageout daemon will be able
to make enough memory available (experimental.)
11) Fix some vnode locking problems. (From Tor, I think.)
12) Add a check in ufs_lookup, to avoid lots of unneeded calls to bcmp.
(experimental.)
13) Significantly shrink, cleanup, and make slightly faster the vm_fault.c
code. Use generation counts, get rid of unneded collpase operations,
and clean up the cluster code.
14) Make vm_zone more suitable for TSM.
This commit is partially as a result of discussions and contributions from
other people, including DG, Tor Egge, PHK, and probably others that I
have forgotten to attribute (so let me know, if I forgot.)
This is not the infamous, final cleanup of the vnode stuff, but a necessary
step. Vnode mgmt should be correct, but things might still change, and
there is still some missing stuff (like ioopt, and physical backing of
non-merged cache files, debugging of layering concepts.)
1998-01-22 17:30:44 +00:00
|
|
|
vm->vm_shm = NULL;
|
2004-07-24 07:40:35 +00:00
|
|
|
vm->vm_swrss = 0;
|
|
|
|
vm->vm_tsize = 0;
|
|
|
|
vm->vm_dsize = 0;
|
|
|
|
vm->vm_ssize = 0;
|
|
|
|
vm->vm_taddr = 0;
|
|
|
|
vm->vm_daddr = 0;
|
|
|
|
vm->vm_maxsaddr = 0;
|
1994-05-24 10:09:53 +00:00
|
|
|
return (vm);
|
|
|
|
}
|
|
|
|
|
2015-04-29 10:23:02 +00:00
|
|
|
#ifdef RACCT
|
2011-04-05 20:23:59 +00:00
|
|
|
static void
|
|
|
|
vmspace_container_reset(struct proc *p)
|
|
|
|
{
|
|
|
|
|
|
|
|
PROC_LOCK(p);
|
|
|
|
racct_set(p, RACCT_DATA, 0);
|
|
|
|
racct_set(p, RACCT_STACK, 0);
|
|
|
|
racct_set(p, RACCT_RSS, 0);
|
|
|
|
racct_set(p, RACCT_MEMLOCK, 0);
|
|
|
|
racct_set(p, RACCT_VMEM, 0);
|
|
|
|
PROC_UNLOCK(p);
|
|
|
|
}
|
2015-04-29 10:23:02 +00:00
|
|
|
#endif
|
2011-04-05 20:23:59 +00:00
|
|
|
|
2006-03-08 06:31:46 +00:00
|
|
|
static inline void
|
2002-03-10 21:52:48 +00:00
|
|
|
vmspace_dofree(struct vmspace *vm)
|
2002-02-05 21:23:05 +00:00
|
|
|
{
|
2010-04-03 16:20:22 +00:00
|
|
|
|
2002-02-05 21:23:05 +00:00
|
|
|
CTR1(KTR_VM, "vmspace_free: %p", vm);
|
2003-01-13 23:04:32 +00:00
|
|
|
|
|
|
|
/*
|
|
|
|
* Make sure any SysV shm is freed, it might not have been in
|
|
|
|
* exit1().
|
|
|
|
*/
|
|
|
|
shmexit(vm);
|
|
|
|
|
2002-02-05 21:23:05 +00:00
|
|
|
/*
|
|
|
|
* Lock the map, to wait out all other references to it.
|
|
|
|
* Delete all of the mappings and pages they hold, then call
|
|
|
|
* the pmap module to reclaim anything left.
|
|
|
|
*/
|
2018-08-29 12:24:19 +00:00
|
|
|
(void)vm_map_remove(&vm->vm_map, vm_map_min(&vm->vm_map),
|
|
|
|
vm_map_max(&vm->vm_map));
|
2002-03-19 09:11:49 +00:00
|
|
|
|
2010-04-03 16:20:22 +00:00
|
|
|
pmap_release(vmspace_pmap(vm));
|
|
|
|
vm->vm_map.pmap = NULL;
|
2002-03-19 09:11:49 +00:00
|
|
|
uma_zfree(vmspace_zone, vm);
|
2002-02-05 21:23:05 +00:00
|
|
|
}
|
|
|
|
|
1994-05-24 10:09:53 +00:00
|
|
|
void
|
2001-07-04 20:15:18 +00:00
|
|
|
vmspace_free(struct vmspace *vm)
|
1994-05-24 10:09:53 +00:00
|
|
|
{
|
|
|
|
|
2015-01-24 16:59:38 +00:00
|
|
|
WITNESS_WARN(WARN_GIANTOK | WARN_SLEEPOK, NULL,
|
2016-04-27 21:51:24 +00:00
|
|
|
"vmspace_free() called");
|
2015-01-24 16:59:38 +00:00
|
|
|
|
2020-11-04 16:30:56 +00:00
|
|
|
if (refcount_release(&vm->vm_refcnt))
|
2002-02-05 21:23:05 +00:00
|
|
|
vmspace_dofree(vm);
|
|
|
|
}
|
|
|
|
|
|
|
|
void
|
|
|
|
vmspace_exitfree(struct proc *p)
|
|
|
|
{
|
2002-04-17 05:26:42 +00:00
|
|
|
struct vmspace *vm;
|
1996-01-19 04:00:31 +00:00
|
|
|
|
2006-05-29 21:28:56 +00:00
|
|
|
PROC_VMSPACE_LOCK(p);
|
2002-12-15 18:50:04 +00:00
|
|
|
vm = p->p_vmspace;
|
|
|
|
p->p_vmspace = NULL;
|
2006-05-29 21:28:56 +00:00
|
|
|
PROC_VMSPACE_UNLOCK(p);
|
|
|
|
KASSERT(vm == &vmspace0, ("vmspace_exitfree: wrong vmspace"));
|
|
|
|
vmspace_free(vm);
|
|
|
|
}
|
|
|
|
|
|
|
|
void
|
|
|
|
vmspace_exit(struct thread *td)
|
|
|
|
{
|
|
|
|
struct vmspace *vm;
|
|
|
|
struct proc *p;
|
2020-11-04 16:30:56 +00:00
|
|
|
bool released;
|
2006-05-29 21:28:56 +00:00
|
|
|
|
|
|
|
p = td->td_proc;
|
|
|
|
vm = p->p_vmspace;
|
2020-11-04 16:30:56 +00:00
|
|
|
|
|
|
|
/*
|
|
|
|
* Prepare to release the vmspace reference. The thread that releases
|
|
|
|
* the last reference is responsible for tearing down the vmspace.
|
|
|
|
* However, threads not releasing the final reference must switch to the
|
|
|
|
* kernel's vmspace0 before the decrement so that the subsequent pmap
|
|
|
|
* deactivation does not modify a freed vmspace.
|
|
|
|
*/
|
|
|
|
refcount_acquire(&vmspace0.vm_refcnt);
|
|
|
|
if (!(released = refcount_release_if_last(&vm->vm_refcnt))) {
|
|
|
|
if (p->p_vmspace != &vmspace0) {
|
2006-05-29 21:28:56 +00:00
|
|
|
PROC_VMSPACE_LOCK(p);
|
|
|
|
p->p_vmspace = &vmspace0;
|
|
|
|
PROC_VMSPACE_UNLOCK(p);
|
|
|
|
pmap_activate(td);
|
|
|
|
}
|
2020-11-04 16:30:56 +00:00
|
|
|
released = refcount_release(&vm->vm_refcnt);
|
|
|
|
}
|
|
|
|
if (released) {
|
|
|
|
/*
|
|
|
|
* pmap_remove_pages() expects the pmap to be active, so switch
|
|
|
|
* back first if necessary.
|
|
|
|
*/
|
2006-05-29 21:28:56 +00:00
|
|
|
if (p->p_vmspace != vm) {
|
|
|
|
PROC_VMSPACE_LOCK(p);
|
|
|
|
p->p_vmspace = vm;
|
|
|
|
PROC_VMSPACE_UNLOCK(p);
|
|
|
|
pmap_activate(td);
|
|
|
|
}
|
|
|
|
pmap_remove_pages(vmspace_pmap(vm));
|
|
|
|
PROC_VMSPACE_LOCK(p);
|
|
|
|
p->p_vmspace = &vmspace0;
|
|
|
|
PROC_VMSPACE_UNLOCK(p);
|
|
|
|
pmap_activate(td);
|
2002-04-17 05:26:42 +00:00
|
|
|
vmspace_dofree(vm);
|
2006-05-29 21:28:56 +00:00
|
|
|
}
|
2015-04-29 10:23:02 +00:00
|
|
|
#ifdef RACCT
|
|
|
|
if (racct_enable)
|
|
|
|
vmspace_container_reset(p);
|
|
|
|
#endif
|
2006-05-29 21:28:56 +00:00
|
|
|
}
|
|
|
|
|
|
|
|
/* Acquire reference to vmspace owned by another process. */
|
|
|
|
|
|
|
|
struct vmspace *
|
|
|
|
vmspace_acquire_ref(struct proc *p)
|
|
|
|
{
|
|
|
|
struct vmspace *vm;
|
|
|
|
|
|
|
|
PROC_VMSPACE_LOCK(p);
|
|
|
|
vm = p->p_vmspace;
|
2020-11-04 16:30:56 +00:00
|
|
|
if (vm == NULL || !refcount_acquire_if_not_zero(&vm->vm_refcnt)) {
|
2006-05-29 21:28:56 +00:00
|
|
|
PROC_VMSPACE_UNLOCK(p);
|
|
|
|
return (NULL);
|
|
|
|
}
|
|
|
|
if (vm != p->p_vmspace) {
|
|
|
|
PROC_VMSPACE_UNLOCK(p);
|
|
|
|
vmspace_free(vm);
|
|
|
|
return (NULL);
|
|
|
|
}
|
|
|
|
PROC_VMSPACE_UNLOCK(p);
|
|
|
|
return (vm);
|
1994-05-24 10:09:53 +00:00
|
|
|
}
|
|
|
|
|
2016-01-19 21:37:51 +00:00
|
|
|
/*
|
|
|
|
* Switch between vmspaces in an AIO kernel process.
|
|
|
|
*
|
2019-06-10 19:01:54 +00:00
|
|
|
* The new vmspace is either the vmspace of a user process obtained
|
|
|
|
* from an active AIO request or the initial vmspace of the AIO kernel
|
|
|
|
* process (when it is idling). Because user processes will block to
|
|
|
|
* drain any active AIO requests before proceeding in exit() or
|
|
|
|
* execve(), the reference count for vmspaces from AIO requests can
|
|
|
|
* never be 0. Similarly, AIO kernel processes hold an extra
|
|
|
|
* reference on their initial vmspace for the life of the process. As
|
|
|
|
* a result, the 'newvm' vmspace always has a non-zero reference
|
|
|
|
* count. This permits an additional reference on 'newvm' to be
|
|
|
|
* acquired via a simple atomic increment rather than the loop in
|
|
|
|
* vmspace_acquire_ref() above.
|
2016-01-19 21:37:51 +00:00
|
|
|
*/
|
|
|
|
void
|
|
|
|
vmspace_switch_aio(struct vmspace *newvm)
|
|
|
|
{
|
|
|
|
struct vmspace *oldvm;
|
|
|
|
|
|
|
|
/* XXX: Need some way to assert that this is an aio daemon. */
|
|
|
|
|
2020-11-04 16:30:56 +00:00
|
|
|
KASSERT(refcount_load(&newvm->vm_refcnt) > 0,
|
2016-01-19 21:37:51 +00:00
|
|
|
("vmspace_switch_aio: newvm unreferenced"));
|
|
|
|
|
|
|
|
oldvm = curproc->p_vmspace;
|
|
|
|
if (oldvm == newvm)
|
|
|
|
return;
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Point to the new address space and refer to it.
|
|
|
|
*/
|
|
|
|
curproc->p_vmspace = newvm;
|
2020-11-04 16:30:56 +00:00
|
|
|
refcount_acquire(&newvm->vm_refcnt);
|
2016-01-19 21:37:51 +00:00
|
|
|
|
|
|
|
/* Activate the new mapping. */
|
|
|
|
pmap_activate(curthread);
|
|
|
|
|
|
|
|
vmspace_free(oldvm);
|
|
|
|
}
|
|
|
|
|
2001-07-04 20:15:18 +00:00
|
|
|
void
|
2002-04-28 23:12:52 +00:00
|
|
|
_vm_map_lock(vm_map_t map, const char *file, int line)
|
2001-07-04 20:15:18 +00:00
|
|
|
{
|
2002-05-02 17:32:27 +00:00
|
|
|
|
2002-07-12 23:20:06 +00:00
|
|
|
if (map->system_map)
|
2011-11-20 16:33:09 +00:00
|
|
|
mtx_lock_flags_(&map->system_mtx, 0, file, line);
|
2004-07-30 09:10:28 +00:00
|
|
|
else
|
2011-11-21 12:59:52 +00:00
|
|
|
sx_xlock_(&map->lock, file, line);
|
2001-07-04 20:15:18 +00:00
|
|
|
map->timestamp++;
|
|
|
|
}
|
|
|
|
|
Switch to use shared vnode locks for text files during image activation.
kern_execve() locks text vnode exclusive to be able to set and clear
VV_TEXT flag. VV_TEXT is mutually exclusive with the v_writecount > 0
condition.
The change removes VV_TEXT, replacing it with the condition
v_writecount <= -1, and puts v_writecount under the vnode interlock.
Each text reference decrements v_writecount. To clear the text
reference when the segment is unmapped, it is recorded in the
vm_map_entry backed by the text file as MAP_ENTRY_VN_TEXT flag, and
v_writecount is incremented on the map entry removal
The operations like VOP_ADD_WRITECOUNT() and VOP_SET_TEXT() check that
v_writecount does not contradict the desired change. vn_writecheck()
is now racy and its use was eliminated everywhere except access.
Atomic check for writeability and increment of v_writecount is
performed by the VOP. vn_truncate() now increments v_writecount
around VOP_SETATTR() call, lack of which is arguably a bug on its own.
nullfs bypasses v_writecount to the lower vnode always, so nullfs
vnode has its own v_writecount correct, and lower vnode gets all
references, since object->handle is always lower vnode.
On the text vnode' vm object dealloc, the v_writecount value is reset
to zero, and deadfs vop_unset_text short-circuit the operation.
Reclamation of lowervp always reclaims all nullfs vnodes referencing
lowervp first, so no stray references are left.
Reviewed by: markj, trasz
Tested by: mjg, pho
Sponsored by: The FreeBSD Foundation
MFC after: 1 month
Differential revision: https://reviews.freebsd.org/D19923
2019-05-05 11:20:43 +00:00
|
|
|
void
|
|
|
|
vm_map_entry_set_vnode_text(vm_map_entry_t entry, bool add)
|
|
|
|
{
|
2019-12-01 20:43:04 +00:00
|
|
|
vm_object_t object;
|
Switch to use shared vnode locks for text files during image activation.
kern_execve() locks text vnode exclusive to be able to set and clear
VV_TEXT flag. VV_TEXT is mutually exclusive with the v_writecount > 0
condition.
The change removes VV_TEXT, replacing it with the condition
v_writecount <= -1, and puts v_writecount under the vnode interlock.
Each text reference decrements v_writecount. To clear the text
reference when the segment is unmapped, it is recorded in the
vm_map_entry backed by the text file as MAP_ENTRY_VN_TEXT flag, and
v_writecount is incremented on the map entry removal
The operations like VOP_ADD_WRITECOUNT() and VOP_SET_TEXT() check that
v_writecount does not contradict the desired change. vn_writecheck()
is now racy and its use was eliminated everywhere except access.
Atomic check for writeability and increment of v_writecount is
performed by the VOP. vn_truncate() now increments v_writecount
around VOP_SETATTR() call, lack of which is arguably a bug on its own.
nullfs bypasses v_writecount to the lower vnode always, so nullfs
vnode has its own v_writecount correct, and lower vnode gets all
references, since object->handle is always lower vnode.
On the text vnode' vm object dealloc, the v_writecount value is reset
to zero, and deadfs vop_unset_text short-circuit the operation.
Reclamation of lowervp always reclaims all nullfs vnodes referencing
lowervp first, so no stray references are left.
Reviewed by: markj, trasz
Tested by: mjg, pho
Sponsored by: The FreeBSD Foundation
MFC after: 1 month
Differential revision: https://reviews.freebsd.org/D19923
2019-05-05 11:20:43 +00:00
|
|
|
struct vnode *vp;
|
2019-12-01 20:43:04 +00:00
|
|
|
bool vp_held;
|
Switch to use shared vnode locks for text files during image activation.
kern_execve() locks text vnode exclusive to be able to set and clear
VV_TEXT flag. VV_TEXT is mutually exclusive with the v_writecount > 0
condition.
The change removes VV_TEXT, replacing it with the condition
v_writecount <= -1, and puts v_writecount under the vnode interlock.
Each text reference decrements v_writecount. To clear the text
reference when the segment is unmapped, it is recorded in the
vm_map_entry backed by the text file as MAP_ENTRY_VN_TEXT flag, and
v_writecount is incremented on the map entry removal
The operations like VOP_ADD_WRITECOUNT() and VOP_SET_TEXT() check that
v_writecount does not contradict the desired change. vn_writecheck()
is now racy and its use was eliminated everywhere except access.
Atomic check for writeability and increment of v_writecount is
performed by the VOP. vn_truncate() now increments v_writecount
around VOP_SETATTR() call, lack of which is arguably a bug on its own.
nullfs bypasses v_writecount to the lower vnode always, so nullfs
vnode has its own v_writecount correct, and lower vnode gets all
references, since object->handle is always lower vnode.
On the text vnode' vm object dealloc, the v_writecount value is reset
to zero, and deadfs vop_unset_text short-circuit the operation.
Reclamation of lowervp always reclaims all nullfs vnodes referencing
lowervp first, so no stray references are left.
Reviewed by: markj, trasz
Tested by: mjg, pho
Sponsored by: The FreeBSD Foundation
MFC after: 1 month
Differential revision: https://reviews.freebsd.org/D19923
2019-05-05 11:20:43 +00:00
|
|
|
|
|
|
|
if ((entry->eflags & MAP_ENTRY_VN_EXEC) == 0)
|
|
|
|
return;
|
|
|
|
KASSERT((entry->eflags & MAP_ENTRY_IS_SUB_MAP) == 0,
|
|
|
|
("Submap with execs"));
|
|
|
|
object = entry->object.vm_object;
|
|
|
|
KASSERT(object != NULL, ("No object for text, entry %p", entry));
|
2019-12-01 20:43:04 +00:00
|
|
|
if ((object->flags & OBJ_ANON) != 0)
|
|
|
|
object = object->handle;
|
|
|
|
else
|
|
|
|
KASSERT(object->backing_object == NULL,
|
|
|
|
("non-anon object %p shadows", object));
|
|
|
|
KASSERT(object != NULL, ("No content object for text, entry %p obj %p",
|
|
|
|
entry, entry->object.vm_object));
|
Switch to use shared vnode locks for text files during image activation.
kern_execve() locks text vnode exclusive to be able to set and clear
VV_TEXT flag. VV_TEXT is mutually exclusive with the v_writecount > 0
condition.
The change removes VV_TEXT, replacing it with the condition
v_writecount <= -1, and puts v_writecount under the vnode interlock.
Each text reference decrements v_writecount. To clear the text
reference when the segment is unmapped, it is recorded in the
vm_map_entry backed by the text file as MAP_ENTRY_VN_TEXT flag, and
v_writecount is incremented on the map entry removal
The operations like VOP_ADD_WRITECOUNT() and VOP_SET_TEXT() check that
v_writecount does not contradict the desired change. vn_writecheck()
is now racy and its use was eliminated everywhere except access.
Atomic check for writeability and increment of v_writecount is
performed by the VOP. vn_truncate() now increments v_writecount
around VOP_SETATTR() call, lack of which is arguably a bug on its own.
nullfs bypasses v_writecount to the lower vnode always, so nullfs
vnode has its own v_writecount correct, and lower vnode gets all
references, since object->handle is always lower vnode.
On the text vnode' vm object dealloc, the v_writecount value is reset
to zero, and deadfs vop_unset_text short-circuit the operation.
Reclamation of lowervp always reclaims all nullfs vnodes referencing
lowervp first, so no stray references are left.
Reviewed by: markj, trasz
Tested by: mjg, pho
Sponsored by: The FreeBSD Foundation
MFC after: 1 month
Differential revision: https://reviews.freebsd.org/D19923
2019-05-05 11:20:43 +00:00
|
|
|
|
2019-12-01 20:43:04 +00:00
|
|
|
/*
|
|
|
|
* Mostly, we do not lock the backing object. It is
|
|
|
|
* referenced by the entry we are processing, so it cannot go
|
|
|
|
* away.
|
|
|
|
*/
|
2019-06-05 20:21:17 +00:00
|
|
|
vp = NULL;
|
2019-12-01 20:43:04 +00:00
|
|
|
vp_held = false;
|
2019-06-05 20:21:17 +00:00
|
|
|
if (object->type == OBJT_DEAD) {
|
|
|
|
/*
|
|
|
|
* For OBJT_DEAD objects, v_writecount was handled in
|
|
|
|
* vnode_pager_dealloc().
|
|
|
|
*/
|
|
|
|
} else if (object->type == OBJT_VNODE) {
|
|
|
|
vp = object->handle;
|
|
|
|
} else if (object->type == OBJT_SWAP) {
|
|
|
|
KASSERT((object->flags & OBJ_TMPFS_NODE) != 0,
|
|
|
|
("vm_map_entry_set_vnode_text: swap and !TMPFS "
|
|
|
|
"entry %p, object %p, add %d", entry, object, add));
|
|
|
|
/*
|
|
|
|
* Tmpfs VREG node, which was reclaimed, has
|
|
|
|
* OBJ_TMPFS_NODE flag set, but not OBJ_TMPFS. In
|
|
|
|
* this case there is no v_writecount to adjust.
|
|
|
|
*/
|
2019-12-01 20:43:04 +00:00
|
|
|
VM_OBJECT_RLOCK(object);
|
|
|
|
if ((object->flags & OBJ_TMPFS) != 0) {
|
2019-06-05 20:21:17 +00:00
|
|
|
vp = object->un_pager.swp.swp_tmpfs;
|
2019-12-01 20:43:04 +00:00
|
|
|
if (vp != NULL) {
|
|
|
|
vhold(vp);
|
|
|
|
vp_held = true;
|
|
|
|
}
|
|
|
|
}
|
|
|
|
VM_OBJECT_RUNLOCK(object);
|
2019-06-05 20:21:17 +00:00
|
|
|
} else {
|
|
|
|
KASSERT(0,
|
Switch to use shared vnode locks for text files during image activation.
kern_execve() locks text vnode exclusive to be able to set and clear
VV_TEXT flag. VV_TEXT is mutually exclusive with the v_writecount > 0
condition.
The change removes VV_TEXT, replacing it with the condition
v_writecount <= -1, and puts v_writecount under the vnode interlock.
Each text reference decrements v_writecount. To clear the text
reference when the segment is unmapped, it is recorded in the
vm_map_entry backed by the text file as MAP_ENTRY_VN_TEXT flag, and
v_writecount is incremented on the map entry removal
The operations like VOP_ADD_WRITECOUNT() and VOP_SET_TEXT() check that
v_writecount does not contradict the desired change. vn_writecheck()
is now racy and its use was eliminated everywhere except access.
Atomic check for writeability and increment of v_writecount is
performed by the VOP. vn_truncate() now increments v_writecount
around VOP_SETATTR() call, lack of which is arguably a bug on its own.
nullfs bypasses v_writecount to the lower vnode always, so nullfs
vnode has its own v_writecount correct, and lower vnode gets all
references, since object->handle is always lower vnode.
On the text vnode' vm object dealloc, the v_writecount value is reset
to zero, and deadfs vop_unset_text short-circuit the operation.
Reclamation of lowervp always reclaims all nullfs vnodes referencing
lowervp first, so no stray references are left.
Reviewed by: markj, trasz
Tested by: mjg, pho
Sponsored by: The FreeBSD Foundation
MFC after: 1 month
Differential revision: https://reviews.freebsd.org/D19923
2019-05-05 11:20:43 +00:00
|
|
|
("vm_map_entry_set_vnode_text: wrong object type, "
|
|
|
|
"entry %p, object %p, add %d", entry, object, add));
|
2019-06-05 20:21:17 +00:00
|
|
|
}
|
|
|
|
if (vp != NULL) {
|
2019-08-18 20:24:52 +00:00
|
|
|
if (add) {
|
Switch to use shared vnode locks for text files during image activation.
kern_execve() locks text vnode exclusive to be able to set and clear
VV_TEXT flag. VV_TEXT is mutually exclusive with the v_writecount > 0
condition.
The change removes VV_TEXT, replacing it with the condition
v_writecount <= -1, and puts v_writecount under the vnode interlock.
Each text reference decrements v_writecount. To clear the text
reference when the segment is unmapped, it is recorded in the
vm_map_entry backed by the text file as MAP_ENTRY_VN_TEXT flag, and
v_writecount is incremented on the map entry removal
The operations like VOP_ADD_WRITECOUNT() and VOP_SET_TEXT() check that
v_writecount does not contradict the desired change. vn_writecheck()
is now racy and its use was eliminated everywhere except access.
Atomic check for writeability and increment of v_writecount is
performed by the VOP. vn_truncate() now increments v_writecount
around VOP_SETATTR() call, lack of which is arguably a bug on its own.
nullfs bypasses v_writecount to the lower vnode always, so nullfs
vnode has its own v_writecount correct, and lower vnode gets all
references, since object->handle is always lower vnode.
On the text vnode' vm object dealloc, the v_writecount value is reset
to zero, and deadfs vop_unset_text short-circuit the operation.
Reclamation of lowervp always reclaims all nullfs vnodes referencing
lowervp first, so no stray references are left.
Reviewed by: markj, trasz
Tested by: mjg, pho
Sponsored by: The FreeBSD Foundation
MFC after: 1 month
Differential revision: https://reviews.freebsd.org/D19923
2019-05-05 11:20:43 +00:00
|
|
|
VOP_SET_TEXT_CHECKED(vp);
|
2019-08-18 20:24:52 +00:00
|
|
|
} else {
|
|
|
|
vn_lock(vp, LK_SHARED | LK_RETRY);
|
Switch to use shared vnode locks for text files during image activation.
kern_execve() locks text vnode exclusive to be able to set and clear
VV_TEXT flag. VV_TEXT is mutually exclusive with the v_writecount > 0
condition.
The change removes VV_TEXT, replacing it with the condition
v_writecount <= -1, and puts v_writecount under the vnode interlock.
Each text reference decrements v_writecount. To clear the text
reference when the segment is unmapped, it is recorded in the
vm_map_entry backed by the text file as MAP_ENTRY_VN_TEXT flag, and
v_writecount is incremented on the map entry removal
The operations like VOP_ADD_WRITECOUNT() and VOP_SET_TEXT() check that
v_writecount does not contradict the desired change. vn_writecheck()
is now racy and its use was eliminated everywhere except access.
Atomic check for writeability and increment of v_writecount is
performed by the VOP. vn_truncate() now increments v_writecount
around VOP_SETATTR() call, lack of which is arguably a bug on its own.
nullfs bypasses v_writecount to the lower vnode always, so nullfs
vnode has its own v_writecount correct, and lower vnode gets all
references, since object->handle is always lower vnode.
On the text vnode' vm object dealloc, the v_writecount value is reset
to zero, and deadfs vop_unset_text short-circuit the operation.
Reclamation of lowervp always reclaims all nullfs vnodes referencing
lowervp first, so no stray references are left.
Reviewed by: markj, trasz
Tested by: mjg, pho
Sponsored by: The FreeBSD Foundation
MFC after: 1 month
Differential revision: https://reviews.freebsd.org/D19923
2019-05-05 11:20:43 +00:00
|
|
|
VOP_UNSET_TEXT_CHECKED(vp);
|
2020-01-03 22:29:58 +00:00
|
|
|
VOP_UNLOCK(vp);
|
2019-08-18 20:24:52 +00:00
|
|
|
}
|
2019-12-01 20:43:04 +00:00
|
|
|
if (vp_held)
|
|
|
|
vdrop(vp);
|
Switch to use shared vnode locks for text files during image activation.
kern_execve() locks text vnode exclusive to be able to set and clear
VV_TEXT flag. VV_TEXT is mutually exclusive with the v_writecount > 0
condition.
The change removes VV_TEXT, replacing it with the condition
v_writecount <= -1, and puts v_writecount under the vnode interlock.
Each text reference decrements v_writecount. To clear the text
reference when the segment is unmapped, it is recorded in the
vm_map_entry backed by the text file as MAP_ENTRY_VN_TEXT flag, and
v_writecount is incremented on the map entry removal
The operations like VOP_ADD_WRITECOUNT() and VOP_SET_TEXT() check that
v_writecount does not contradict the desired change. vn_writecheck()
is now racy and its use was eliminated everywhere except access.
Atomic check for writeability and increment of v_writecount is
performed by the VOP. vn_truncate() now increments v_writecount
around VOP_SETATTR() call, lack of which is arguably a bug on its own.
nullfs bypasses v_writecount to the lower vnode always, so nullfs
vnode has its own v_writecount correct, and lower vnode gets all
references, since object->handle is always lower vnode.
On the text vnode' vm object dealloc, the v_writecount value is reset
to zero, and deadfs vop_unset_text short-circuit the operation.
Reclamation of lowervp always reclaims all nullfs vnodes referencing
lowervp first, so no stray references are left.
Reviewed by: markj, trasz
Tested by: mjg, pho
Sponsored by: The FreeBSD Foundation
MFC after: 1 month
Differential revision: https://reviews.freebsd.org/D19923
2019-05-05 11:20:43 +00:00
|
|
|
}
|
|
|
|
}
|
|
|
|
|
2019-11-13 15:56:07 +00:00
|
|
|
/*
|
|
|
|
* Use a different name for this vm_map_entry field when it's use
|
|
|
|
* is not consistent with its use as part of an ordered search tree.
|
|
|
|
*/
|
|
|
|
#define defer_next right
|
|
|
|
|
2010-09-18 15:03:31 +00:00
|
|
|
static void
|
|
|
|
vm_map_process_deferred(void)
|
2001-07-04 20:15:18 +00:00
|
|
|
{
|
2010-09-18 15:03:31 +00:00
|
|
|
struct thread *td;
|
2012-06-20 18:00:26 +00:00
|
|
|
vm_map_entry_t entry, next;
|
2012-02-23 21:07:16 +00:00
|
|
|
vm_object_t object;
|
2009-02-24 20:57:43 +00:00
|
|
|
|
2010-09-18 15:03:31 +00:00
|
|
|
td = curthread;
|
2012-06-20 18:00:26 +00:00
|
|
|
entry = td->td_map_def_user;
|
|
|
|
td->td_map_def_user = NULL;
|
|
|
|
while (entry != NULL) {
|
2019-11-13 15:56:07 +00:00
|
|
|
next = entry->defer_next;
|
2019-09-03 20:31:48 +00:00
|
|
|
MPASS((entry->eflags & (MAP_ENTRY_WRITECNT |
|
|
|
|
MAP_ENTRY_VN_EXEC)) != (MAP_ENTRY_WRITECNT |
|
Switch to use shared vnode locks for text files during image activation.
kern_execve() locks text vnode exclusive to be able to set and clear
VV_TEXT flag. VV_TEXT is mutually exclusive with the v_writecount > 0
condition.
The change removes VV_TEXT, replacing it with the condition
v_writecount <= -1, and puts v_writecount under the vnode interlock.
Each text reference decrements v_writecount. To clear the text
reference when the segment is unmapped, it is recorded in the
vm_map_entry backed by the text file as MAP_ENTRY_VN_TEXT flag, and
v_writecount is incremented on the map entry removal
The operations like VOP_ADD_WRITECOUNT() and VOP_SET_TEXT() check that
v_writecount does not contradict the desired change. vn_writecheck()
is now racy and its use was eliminated everywhere except access.
Atomic check for writeability and increment of v_writecount is
performed by the VOP. vn_truncate() now increments v_writecount
around VOP_SETATTR() call, lack of which is arguably a bug on its own.
nullfs bypasses v_writecount to the lower vnode always, so nullfs
vnode has its own v_writecount correct, and lower vnode gets all
references, since object->handle is always lower vnode.
On the text vnode' vm object dealloc, the v_writecount value is reset
to zero, and deadfs vop_unset_text short-circuit the operation.
Reclamation of lowervp always reclaims all nullfs vnodes referencing
lowervp first, so no stray references are left.
Reviewed by: markj, trasz
Tested by: mjg, pho
Sponsored by: The FreeBSD Foundation
MFC after: 1 month
Differential revision: https://reviews.freebsd.org/D19923
2019-05-05 11:20:43 +00:00
|
|
|
MAP_ENTRY_VN_EXEC));
|
2019-09-03 20:31:48 +00:00
|
|
|
if ((entry->eflags & MAP_ENTRY_WRITECNT) != 0) {
|
2012-02-23 21:07:16 +00:00
|
|
|
/*
|
|
|
|
* Decrement the object's writemappings and
|
|
|
|
* possibly the vnode's v_writecount.
|
|
|
|
*/
|
|
|
|
KASSERT((entry->eflags & MAP_ENTRY_IS_SUB_MAP) == 0,
|
|
|
|
("Submap with writecount"));
|
|
|
|
object = entry->object.vm_object;
|
|
|
|
KASSERT(object != NULL, ("No object for writecount"));
|
2019-09-03 20:31:48 +00:00
|
|
|
vm_pager_release_writecount(object, entry->start,
|
2012-02-23 21:07:16 +00:00
|
|
|
entry->end);
|
|
|
|
}
|
Switch to use shared vnode locks for text files during image activation.
kern_execve() locks text vnode exclusive to be able to set and clear
VV_TEXT flag. VV_TEXT is mutually exclusive with the v_writecount > 0
condition.
The change removes VV_TEXT, replacing it with the condition
v_writecount <= -1, and puts v_writecount under the vnode interlock.
Each text reference decrements v_writecount. To clear the text
reference when the segment is unmapped, it is recorded in the
vm_map_entry backed by the text file as MAP_ENTRY_VN_TEXT flag, and
v_writecount is incremented on the map entry removal
The operations like VOP_ADD_WRITECOUNT() and VOP_SET_TEXT() check that
v_writecount does not contradict the desired change. vn_writecheck()
is now racy and its use was eliminated everywhere except access.
Atomic check for writeability and increment of v_writecount is
performed by the VOP. vn_truncate() now increments v_writecount
around VOP_SETATTR() call, lack of which is arguably a bug on its own.
nullfs bypasses v_writecount to the lower vnode always, so nullfs
vnode has its own v_writecount correct, and lower vnode gets all
references, since object->handle is always lower vnode.
On the text vnode' vm object dealloc, the v_writecount value is reset
to zero, and deadfs vop_unset_text short-circuit the operation.
Reclamation of lowervp always reclaims all nullfs vnodes referencing
lowervp first, so no stray references are left.
Reviewed by: markj, trasz
Tested by: mjg, pho
Sponsored by: The FreeBSD Foundation
MFC after: 1 month
Differential revision: https://reviews.freebsd.org/D19923
2019-05-05 11:20:43 +00:00
|
|
|
vm_map_entry_set_vnode_text(entry, false);
|
2010-09-18 15:03:31 +00:00
|
|
|
vm_map_entry_deallocate(entry, FALSE);
|
2012-06-20 18:00:26 +00:00
|
|
|
entry = next;
|
2010-09-18 15:03:31 +00:00
|
|
|
}
|
|
|
|
}
|
|
|
|
|
2019-11-09 17:08:27 +00:00
|
|
|
#ifdef INVARIANTS
|
|
|
|
static void
|
|
|
|
_vm_map_assert_locked(vm_map_t map, const char *file, int line)
|
|
|
|
{
|
|
|
|
|
|
|
|
if (map->system_map)
|
|
|
|
mtx_assert_(&map->system_mtx, MA_OWNED, file, line);
|
|
|
|
else
|
|
|
|
sx_assert_(&map->lock, SA_XLOCKED, file, line);
|
|
|
|
}
|
|
|
|
|
|
|
|
#define VM_MAP_ASSERT_LOCKED(map) \
|
|
|
|
_vm_map_assert_locked(map, LOCK_FILE, LOCK_LINE)
|
|
|
|
|
|
|
|
enum { VMMAP_CHECK_NONE, VMMAP_CHECK_UNLOCK, VMMAP_CHECK_ALL };
|
|
|
|
#ifdef DIAGNOSTIC
|
|
|
|
static int enable_vmmap_check = VMMAP_CHECK_UNLOCK;
|
|
|
|
#else
|
|
|
|
static int enable_vmmap_check = VMMAP_CHECK_NONE;
|
|
|
|
#endif
|
|
|
|
SYSCTL_INT(_debug, OID_AUTO, vmmap_check, CTLFLAG_RWTUN,
|
|
|
|
&enable_vmmap_check, 0, "Enable vm map consistency checking");
|
|
|
|
|
|
|
|
static void _vm_map_assert_consistent(vm_map_t map, int check);
|
|
|
|
|
|
|
|
#define VM_MAP_ASSERT_CONSISTENT(map) \
|
|
|
|
_vm_map_assert_consistent(map, VMMAP_CHECK_ALL)
|
|
|
|
#ifdef DIAGNOSTIC
|
|
|
|
#define VM_MAP_UNLOCK_CONSISTENT(map) do { \
|
|
|
|
if (map->nupdates > map->nentries) { \
|
|
|
|
_vm_map_assert_consistent(map, VMMAP_CHECK_UNLOCK); \
|
|
|
|
map->nupdates = 0; \
|
|
|
|
} \
|
|
|
|
} while (0)
|
|
|
|
#else
|
|
|
|
#define VM_MAP_UNLOCK_CONSISTENT(map)
|
|
|
|
#endif
|
|
|
|
#else
|
|
|
|
#define VM_MAP_ASSERT_LOCKED(map)
|
|
|
|
#define VM_MAP_ASSERT_CONSISTENT(map)
|
|
|
|
#define VM_MAP_UNLOCK_CONSISTENT(map)
|
|
|
|
#endif /* INVARIANTS */
|
|
|
|
|
2010-09-18 15:03:31 +00:00
|
|
|
void
|
|
|
|
_vm_map_unlock(vm_map_t map, const char *file, int line)
|
|
|
|
{
|
2009-02-24 20:57:43 +00:00
|
|
|
|
2019-11-09 17:08:27 +00:00
|
|
|
VM_MAP_UNLOCK_CONSISTENT(map);
|
2020-11-11 17:16:39 +00:00
|
|
|
if (map->system_map) {
|
|
|
|
#ifndef UMA_MD_SMALL_ALLOC
|
|
|
|
if (map == kernel_map && (map->flags & MAP_REPLENISH) != 0) {
|
|
|
|
uma_prealloc(kmapentzone, 1);
|
|
|
|
map->flags &= ~MAP_REPLENISH;
|
|
|
|
}
|
|
|
|
#endif
|
2011-11-20 16:33:09 +00:00
|
|
|
mtx_unlock_flags_(&map->system_mtx, 0, file, line);
|
2020-11-11 17:16:39 +00:00
|
|
|
} else {
|
2011-11-21 12:59:52 +00:00
|
|
|
sx_xunlock_(&map->lock, file, line);
|
2010-09-18 15:03:31 +00:00
|
|
|
vm_map_process_deferred();
|
2009-02-24 20:57:43 +00:00
|
|
|
}
|
2001-07-04 20:15:18 +00:00
|
|
|
}
|
|
|
|
|
|
|
|
void
|
2002-04-28 23:12:52 +00:00
|
|
|
_vm_map_lock_read(vm_map_t map, const char *file, int line)
|
2001-07-04 20:15:18 +00:00
|
|
|
{
|
2002-05-02 17:32:27 +00:00
|
|
|
|
2002-07-12 23:20:06 +00:00
|
|
|
if (map->system_map)
|
2011-11-20 16:33:09 +00:00
|
|
|
mtx_lock_flags_(&map->system_mtx, 0, file, line);
|
2004-07-30 09:10:28 +00:00
|
|
|
else
|
2011-11-21 12:59:52 +00:00
|
|
|
sx_slock_(&map->lock, file, line);
|
2001-07-04 20:15:18 +00:00
|
|
|
}
|
|
|
|
|
|
|
|
void
|
2002-04-28 23:12:52 +00:00
|
|
|
_vm_map_unlock_read(vm_map_t map, const char *file, int line)
|
2001-07-04 20:15:18 +00:00
|
|
|
{
|
2002-05-02 17:32:27 +00:00
|
|
|
|
2020-11-11 17:16:39 +00:00
|
|
|
if (map->system_map) {
|
|
|
|
KASSERT((map->flags & MAP_REPLENISH) == 0,
|
|
|
|
("%s: MAP_REPLENISH leaked", __func__));
|
2011-11-20 16:33:09 +00:00
|
|
|
mtx_unlock_flags_(&map->system_mtx, 0, file, line);
|
2020-11-11 17:16:39 +00:00
|
|
|
} else {
|
2011-11-21 12:59:52 +00:00
|
|
|
sx_sunlock_(&map->lock, file, line);
|
2010-09-18 15:03:31 +00:00
|
|
|
vm_map_process_deferred();
|
|
|
|
}
|
2001-07-04 20:15:18 +00:00
|
|
|
}
|
|
|
|
|
2002-04-28 06:07:54 +00:00
|
|
|
int
|
2002-04-28 23:12:52 +00:00
|
|
|
_vm_map_trylock(vm_map_t map, const char *file, int line)
|
2002-04-28 06:07:54 +00:00
|
|
|
{
|
2002-03-18 15:08:09 +00:00
|
|
|
int error;
|
|
|
|
|
2002-12-31 19:38:04 +00:00
|
|
|
error = map->system_map ?
|
2011-11-20 16:33:09 +00:00
|
|
|
!mtx_trylock_flags_(&map->system_mtx, 0, file, line) :
|
2011-11-21 12:59:52 +00:00
|
|
|
!sx_try_xlock_(&map->lock, file, line);
|
2002-12-30 00:41:33 +00:00
|
|
|
if (error == 0)
|
|
|
|
map->timestamp++;
|
2002-05-02 17:32:27 +00:00
|
|
|
return (error == 0);
|
2002-03-18 15:08:09 +00:00
|
|
|
}
|
|
|
|
|
2003-03-12 23:13:16 +00:00
|
|
|
int
|
|
|
|
_vm_map_trylock_read(vm_map_t map, const char *file, int line)
|
|
|
|
{
|
|
|
|
int error;
|
|
|
|
|
|
|
|
error = map->system_map ?
|
2011-11-20 16:33:09 +00:00
|
|
|
!mtx_trylock_flags_(&map->system_mtx, 0, file, line) :
|
2011-11-21 12:59:52 +00:00
|
|
|
!sx_try_slock_(&map->lock, file, line);
|
2003-03-12 23:13:16 +00:00
|
|
|
return (error == 0);
|
|
|
|
}
|
|
|
|
|
2009-01-01 00:31:46 +00:00
|
|
|
/*
|
|
|
|
* _vm_map_lock_upgrade: [ internal use only ]
|
|
|
|
*
|
|
|
|
* Tries to upgrade a read (shared) lock on the specified map to a write
|
|
|
|
* (exclusive) lock. Returns the value "0" if the upgrade succeeds and a
|
|
|
|
* non-zero value if the upgrade fails. If the upgrade fails, the map is
|
|
|
|
* returned without a read or write lock held.
|
|
|
|
*
|
|
|
|
* Requires that the map be read locked.
|
|
|
|
*/
|
2002-03-18 15:08:09 +00:00
|
|
|
int
|
2002-04-28 23:12:52 +00:00
|
|
|
_vm_map_lock_upgrade(vm_map_t map, const char *file, int line)
|
2002-03-18 15:08:09 +00:00
|
|
|
{
|
2009-01-01 00:31:46 +00:00
|
|
|
unsigned int last_timestamp;
|
2002-05-02 17:32:27 +00:00
|
|
|
|
2004-07-30 09:10:28 +00:00
|
|
|
if (map->system_map) {
|
2011-11-20 16:33:09 +00:00
|
|
|
mtx_assert_(&map->system_mtx, MA_OWNED, file, line);
|
2009-01-01 00:31:46 +00:00
|
|
|
} else {
|
2011-11-21 12:59:52 +00:00
|
|
|
if (!sx_try_upgrade_(&map->lock, file, line)) {
|
2009-01-01 00:31:46 +00:00
|
|
|
last_timestamp = map->timestamp;
|
2011-11-21 12:59:52 +00:00
|
|
|
sx_sunlock_(&map->lock, file, line);
|
2010-09-18 15:03:31 +00:00
|
|
|
vm_map_process_deferred();
|
2009-01-01 00:31:46 +00:00
|
|
|
/*
|
|
|
|
* If the map's timestamp does not change while the
|
|
|
|
* map is unlocked, then the upgrade succeeds.
|
|
|
|
*/
|
2011-11-21 12:59:52 +00:00
|
|
|
sx_xlock_(&map->lock, file, line);
|
2009-01-01 00:31:46 +00:00
|
|
|
if (last_timestamp != map->timestamp) {
|
2011-11-21 12:59:52 +00:00
|
|
|
sx_xunlock_(&map->lock, file, line);
|
2009-01-01 00:31:46 +00:00
|
|
|
return (1);
|
|
|
|
}
|
|
|
|
}
|
|
|
|
}
|
2002-05-02 17:32:27 +00:00
|
|
|
map->timestamp++;
|
|
|
|
return (0);
|
2001-07-04 20:15:18 +00:00
|
|
|
}
|
|
|
|
|
|
|
|
void
|
2002-04-28 23:12:52 +00:00
|
|
|
_vm_map_lock_downgrade(vm_map_t map, const char *file, int line)
|
2001-07-04 20:15:18 +00:00
|
|
|
{
|
2002-05-02 17:32:27 +00:00
|
|
|
|
2004-07-30 09:10:28 +00:00
|
|
|
if (map->system_map) {
|
2020-11-11 17:16:39 +00:00
|
|
|
KASSERT((map->flags & MAP_REPLENISH) == 0,
|
|
|
|
("%s: MAP_REPLENISH leaked", __func__));
|
2011-11-20 16:33:09 +00:00
|
|
|
mtx_assert_(&map->system_mtx, MA_OWNED, file, line);
|
2019-11-09 17:08:27 +00:00
|
|
|
} else {
|
|
|
|
VM_MAP_UNLOCK_CONSISTENT(map);
|
2011-11-21 12:59:52 +00:00
|
|
|
sx_downgrade_(&map->lock, file, line);
|
2019-11-09 17:08:27 +00:00
|
|
|
}
|
2009-01-01 00:31:46 +00:00
|
|
|
}
|
|
|
|
|
|
|
|
/*
|
|
|
|
* vm_map_locked:
|
|
|
|
*
|
|
|
|
* Returns a non-zero value if the caller holds a write (exclusive) lock
|
|
|
|
* on the specified map and the value "0" otherwise.
|
|
|
|
*/
|
|
|
|
int
|
|
|
|
vm_map_locked(vm_map_t map)
|
|
|
|
{
|
|
|
|
|
|
|
|
if (map->system_map)
|
|
|
|
return (mtx_owned(&map->system_mtx));
|
|
|
|
else
|
|
|
|
return (sx_xlocked(&map->lock));
|
2002-03-18 15:08:09 +00:00
|
|
|
}
|
|
|
|
|
2002-06-07 18:34:23 +00:00
|
|
|
/*
|
2010-09-19 17:43:22 +00:00
|
|
|
* _vm_map_unlock_and_wait:
|
|
|
|
*
|
|
|
|
* Atomically releases the lock on the specified map and puts the calling
|
|
|
|
* thread to sleep. The calling thread will remain asleep until either
|
|
|
|
* vm_map_wakeup() is performed on the map or the specified timeout is
|
|
|
|
* exceeded.
|
|
|
|
*
|
|
|
|
* WARNING! This function does not perform deferred deallocations of
|
|
|
|
* objects and map entries. Therefore, the calling thread is expected to
|
|
|
|
* reacquire the map lock after reawakening and later perform an ordinary
|
|
|
|
* unlock operation, such as vm_map_unlock(), before completing its
|
|
|
|
* operation on the map.
|
2002-06-07 18:34:23 +00:00
|
|
|
*/
|
2002-07-11 02:39:24 +00:00
|
|
|
int
|
2010-09-19 17:43:22 +00:00
|
|
|
_vm_map_unlock_and_wait(vm_map_t map, int timo, const char *file, int line)
|
2002-06-07 18:34:23 +00:00
|
|
|
{
|
|
|
|
|
2019-11-09 17:08:27 +00:00
|
|
|
VM_MAP_UNLOCK_CONSISTENT(map);
|
2002-12-30 00:41:33 +00:00
|
|
|
mtx_lock(&map_sleep_mtx);
|
2020-11-11 17:16:39 +00:00
|
|
|
if (map->system_map) {
|
|
|
|
KASSERT((map->flags & MAP_REPLENISH) == 0,
|
|
|
|
("%s: MAP_REPLENISH leaked", __func__));
|
2011-11-20 16:33:09 +00:00
|
|
|
mtx_unlock_flags_(&map->system_mtx, 0, file, line);
|
2020-11-11 17:16:39 +00:00
|
|
|
} else {
|
2011-11-21 12:59:52 +00:00
|
|
|
sx_xunlock_(&map->lock, file, line);
|
2020-11-11 17:16:39 +00:00
|
|
|
}
|
2010-09-19 17:43:22 +00:00
|
|
|
return (msleep(&map->root, &map_sleep_mtx, PDROP | PVM, "vmmaps",
|
|
|
|
timo));
|
2002-06-07 18:34:23 +00:00
|
|
|
}
|
|
|
|
|
|
|
|
/*
|
|
|
|
* vm_map_wakeup:
|
2010-09-19 17:43:22 +00:00
|
|
|
*
|
|
|
|
* Awaken any threads that have slept on the map using
|
|
|
|
* vm_map_unlock_and_wait().
|
2002-06-07 18:34:23 +00:00
|
|
|
*/
|
2002-07-11 02:39:24 +00:00
|
|
|
void
|
2002-06-07 18:34:23 +00:00
|
|
|
vm_map_wakeup(vm_map_t map)
|
|
|
|
{
|
|
|
|
|
2002-06-17 13:27:40 +00:00
|
|
|
/*
|
2002-12-30 00:41:33 +00:00
|
|
|
* Acquire and release map_sleep_mtx to prevent a wakeup()
|
2010-09-19 17:43:22 +00:00
|
|
|
* from being performed (and lost) between the map unlock
|
|
|
|
* and the msleep() in _vm_map_unlock_and_wait().
|
2002-06-17 13:27:40 +00:00
|
|
|
*/
|
2002-12-30 00:41:33 +00:00
|
|
|
mtx_lock(&map_sleep_mtx);
|
|
|
|
mtx_unlock(&map_sleep_mtx);
|
2002-06-07 18:34:23 +00:00
|
|
|
wakeup(&map->root);
|
|
|
|
}
|
|
|
|
|
2010-12-09 21:02:22 +00:00
|
|
|
void
|
|
|
|
vm_map_busy(vm_map_t map)
|
|
|
|
{
|
|
|
|
|
|
|
|
VM_MAP_ASSERT_LOCKED(map);
|
|
|
|
map->busy++;
|
|
|
|
}
|
|
|
|
|
|
|
|
void
|
|
|
|
vm_map_unbusy(vm_map_t map)
|
|
|
|
{
|
|
|
|
|
|
|
|
VM_MAP_ASSERT_LOCKED(map);
|
|
|
|
KASSERT(map->busy, ("vm_map_unbusy: not busy"));
|
|
|
|
if (--map->busy == 0 && (map->flags & MAP_BUSY_WAKEUP)) {
|
|
|
|
vm_map_modflags(map, 0, MAP_BUSY_WAKEUP);
|
|
|
|
wakeup(&map->busy);
|
|
|
|
}
|
|
|
|
}
|
|
|
|
|
|
|
|
void
|
|
|
|
vm_map_wait_busy(vm_map_t map)
|
|
|
|
{
|
|
|
|
|
|
|
|
VM_MAP_ASSERT_LOCKED(map);
|
|
|
|
while (map->busy) {
|
|
|
|
vm_map_modflags(map, MAP_BUSY_WAKEUP, 0);
|
|
|
|
if (map->system_map)
|
|
|
|
msleep(&map->busy, &map->system_mtx, 0, "mbusy", 0);
|
|
|
|
else
|
|
|
|
sx_sleep(&map->busy, &map->lock, 0, "mbusy", 0);
|
|
|
|
}
|
|
|
|
map->timestamp++;
|
|
|
|
}
|
|
|
|
|
2001-07-04 20:15:18 +00:00
|
|
|
long
|
|
|
|
vmspace_resident_count(struct vmspace *vmspace)
|
|
|
|
{
|
|
|
|
return pmap_resident_count(vmspace_pmap(vmspace));
|
2003-10-06 01:47:12 +00:00
|
|
|
}
|
|
|
|
|
1994-05-24 10:09:53 +00:00
|
|
|
/*
|
|
|
|
* Initialize an existing vm_map structure
|
|
|
|
* such as that in the vmspace structure.
|
|
|
|
*/
|
2002-03-19 09:11:49 +00:00
|
|
|
static void
|
2010-04-03 19:07:05 +00:00
|
|
|
_vm_map_init(vm_map_t map, pmap_t pmap, vm_offset_t min, vm_offset_t max)
|
1994-05-24 10:09:53 +00:00
|
|
|
{
|
2001-05-23 22:38:00 +00:00
|
|
|
|
2018-11-02 16:26:44 +00:00
|
|
|
map->header.eflags = MAP_ENTRY_HEADER;
|
2002-07-11 02:39:24 +00:00
|
|
|
map->needs_wakeup = FALSE;
|
1997-08-05 00:02:08 +00:00
|
|
|
map->system_map = 0;
|
2010-04-03 19:07:05 +00:00
|
|
|
map->pmap = pmap;
|
2018-08-29 12:24:19 +00:00
|
|
|
map->header.end = min;
|
|
|
|
map->header.start = max;
|
2004-05-07 00:17:07 +00:00
|
|
|
map->flags = 0;
|
2019-12-07 17:14:33 +00:00
|
|
|
map->header.left = map->header.right = &map->header;
|
2002-05-24 01:33:24 +00:00
|
|
|
map->root = NULL;
|
1994-05-24 10:09:53 +00:00
|
|
|
map->timestamp = 0;
|
2010-12-09 21:02:22 +00:00
|
|
|
map->busy = 0;
|
Implement Address Space Layout Randomization (ASLR)
With this change, randomization can be enabled for all non-fixed
mappings. It means that the base address for the mapping is selected
with a guaranteed amount of entropy (bits). If the mapping was
requested to be superpage aligned, the randomization honours the
superpage attributes.
Although the value of ASLR is diminshing over time as exploit authors
work out simple ASLR bypass techniques, it elimintates the trivial
exploitation of certain vulnerabilities, at least in theory. This
implementation is relatively small and happens at the correct
architectural level. Also, it is not expected to introduce
regressions in existing cases when turned off (default for now), or
cause any significant maintaince burden.
The randomization is done on a best-effort basis - that is, the
allocator falls back to a first fit strategy if fragmentation prevents
entropy injection. It is trivial to implement a strong mode where
failure to guarantee the requested amount of entropy results in
mapping request failure, but I do not consider that to be usable.
I have not fine-tuned the amount of entropy injected right now. It is
only a quantitive change that will not change the implementation. The
current amount is controlled by aslr_pages_rnd.
To not spoil coalescing optimizations, to reduce the page table
fragmentation inherent to ASLR, and to keep the transient superpage
promotion for the malloced memory, locality clustering is implemented
for anonymous private mappings, which are automatically grouped until
fragmentation kicks in. The initial location for the anon group range
is, of course, randomized. This is controlled by vm.cluster_anon,
enabled by default.
The default mode keeps the sbrk area unpopulated by other mappings,
but this can be turned off, which gives much more breathing bits on
architectures with small address space, such as i386. This is tied
with the question of following an application's hint about the mmap(2)
base address. Testing shows that ignoring the hint does not affect the
function of common applications, but I would expect more demanding
code could break. By default sbrk is preserved and mmap hints are
satisfied, which can be changed by using the
kern.elf{32,64}.aslr.honor_sbrk sysctl.
ASLR is enabled on per-ABI basis, and currently it is only allowed on
FreeBSD native i386 and amd64 (including compat 32bit) ABIs. Support
for additional architectures will be added after further testing.
Both per-process and per-image controls are implemented:
- procctl(2) adds PROC_ASLR_CTL/PROC_ASLR_STATUS;
- NT_FREEBSD_FCTL_ASLR_DISABLE feature control note bit makes it possible
to force ASLR off for the given binary. (A tool to edit the feature
control note is in development.)
Global controls are:
- kern.elf{32,64}.aslr.enable - for non-fixed mappings done by mmap(2);
- kern.elf{32,64}.aslr.pie_enable - for PIE image activation mappings;
- kern.elf{32,64}.aslr.honor_sbrk - allow to use sbrk area for mmap(2);
- vm.cluster_anon - enables anon mapping clustering.
PR: 208580 (exp runs)
Exp-runs done by: antoine
Reviewed by: markj (previous version)
Discussed with: emaste
Tested by: pho
MFC after: 1 month
Sponsored by: The FreeBSD Foundation
Differential revision: https://reviews.freebsd.org/D5603
2019-02-10 17:19:45 +00:00
|
|
|
map->anon_loc = 0;
|
2019-11-09 17:08:27 +00:00
|
|
|
#ifdef DIAGNOSTIC
|
|
|
|
map->nupdates = 0;
|
|
|
|
#endif
|
1994-05-24 10:09:53 +00:00
|
|
|
}
|
|
|
|
|
2000-10-04 01:29:17 +00:00
|
|
|
void
|
2010-04-03 19:07:05 +00:00
|
|
|
vm_map_init(vm_map_t map, pmap_t pmap, vm_offset_t min, vm_offset_t max)
|
2000-10-04 01:29:17 +00:00
|
|
|
{
|
2010-04-03 19:07:05 +00:00
|
|
|
|
|
|
|
_vm_map_init(map, pmap, min, max);
|
2020-08-17 13:02:01 +00:00
|
|
|
mtx_init(&map->system_mtx, "vm map (system)", NULL,
|
|
|
|
MTX_DEF | MTX_DUPOK);
|
|
|
|
sx_init(&map->lock, "vm map (user)");
|
2000-10-04 01:29:17 +00:00
|
|
|
}
|
|
|
|
|
1996-05-18 03:38:05 +00:00
|
|
|
/*
|
|
|
|
* vm_map_entry_dispose: [ internal use only ]
|
|
|
|
*
|
|
|
|
* Inverse of vm_map_entry_create.
|
|
|
|
*/
|
1996-12-07 06:19:37 +00:00
|
|
|
static void
|
2001-07-04 20:15:18 +00:00
|
|
|
vm_map_entry_dispose(vm_map_t map, vm_map_entry_t entry)
|
1996-05-18 03:38:05 +00:00
|
|
|
{
|
2002-06-10 06:11:45 +00:00
|
|
|
uma_zfree(map->system_map ? kmapentzone : mapentzone, entry);
|
1996-05-18 03:38:05 +00:00
|
|
|
}
|
|
|
|
|
1994-05-24 10:09:53 +00:00
|
|
|
/*
|
|
|
|
* vm_map_entry_create: [ internal use only ]
|
|
|
|
*
|
|
|
|
* Allocates a VM map entry for insertion.
|
2001-04-12 21:50:03 +00:00
|
|
|
* No entry fields are filled in.
|
1994-05-24 10:09:53 +00:00
|
|
|
*/
|
1995-12-14 09:55:16 +00:00
|
|
|
static vm_map_entry_t
|
2001-07-04 20:15:18 +00:00
|
|
|
vm_map_entry_create(vm_map_t map)
|
1994-05-24 10:09:53 +00:00
|
|
|
{
|
2000-02-16 21:11:33 +00:00
|
|
|
vm_map_entry_t new_entry;
|
|
|
|
|
2020-11-11 17:16:39 +00:00
|
|
|
#ifndef UMA_MD_SMALL_ALLOC
|
|
|
|
if (map == kernel_map) {
|
|
|
|
VM_MAP_ASSERT_LOCKED(map);
|
|
|
|
|
|
|
|
/*
|
|
|
|
* A new slab of kernel map entries cannot be allocated at this
|
|
|
|
* point because the kernel map has not yet been updated to
|
|
|
|
* reflect the caller's request. Therefore, we allocate a new
|
|
|
|
* map entry, dipping into the reserve if necessary, and set a
|
|
|
|
* flag indicating that the reserve must be replenished before
|
|
|
|
* the map is unlocked.
|
|
|
|
*/
|
|
|
|
new_entry = uma_zalloc(kmapentzone, M_NOWAIT | M_NOVM);
|
|
|
|
if (new_entry == NULL) {
|
|
|
|
new_entry = uma_zalloc(kmapentzone,
|
|
|
|
M_NOWAIT | M_NOVM | M_USE_RESERVE);
|
|
|
|
kernel_map->flags |= MAP_REPLENISH;
|
|
|
|
}
|
|
|
|
} else
|
|
|
|
#endif
|
|
|
|
if (map->system_map) {
|
2002-06-10 06:11:45 +00:00
|
|
|
new_entry = uma_zalloc(kmapentzone, M_NOWAIT);
|
2020-11-11 17:16:39 +00:00
|
|
|
} else {
|
2003-02-19 05:47:46 +00:00
|
|
|
new_entry = uma_zalloc(mapentzone, M_WAITOK);
|
2020-11-11 17:16:39 +00:00
|
|
|
}
|
|
|
|
KASSERT(new_entry != NULL,
|
|
|
|
("vm_map_entry_create: kernel resources exhausted"));
|
2002-03-10 21:52:48 +00:00
|
|
|
return (new_entry);
|
1994-05-24 10:09:53 +00:00
|
|
|
}
|
|
|
|
|
2002-06-01 16:59:30 +00:00
|
|
|
/*
|
|
|
|
* vm_map_entry_set_behavior:
|
|
|
|
*
|
|
|
|
* Set the expected access behavior, either normal, random, or
|
|
|
|
* sequential.
|
|
|
|
*/
|
2006-03-08 06:31:46 +00:00
|
|
|
static inline void
|
2002-06-01 16:59:30 +00:00
|
|
|
vm_map_entry_set_behavior(vm_map_entry_t entry, u_char behavior)
|
|
|
|
{
|
|
|
|
entry->eflags = (entry->eflags & ~MAP_ENTRY_BEHAV_MASK) |
|
|
|
|
(behavior & MAP_ENTRY_BEHAV_MASK);
|
|
|
|
}
|
|
|
|
|
2004-08-13 08:06:34 +00:00
|
|
|
/*
|
The computations of vm_map_splay_split and vm_map_splay_merge touch both
children of every entry on the search path as part of updating values of
the max_free field. By comparing the max_free values of an entry and its
child on the search path, the code can avoid accessing the child off the
path in cases where the max_free value decreases along the path.
Specifically, this patch changes splay_split so that the max_free field
of every entry on the search path is replaced, temporarily, by the
max_free field from its child not on the search path or, if the child
in that direction is NULL, then a difference between start and end
values of two pointers already available in the split code, without
following any next or prev pointers. However, to find that max_free
value does not require looking toward that other child if either the
child on the search path has a lower max_free value, or the current max_free
value is zero, because in either case we know that the value of max_free for
the other child is the value we already have. So, the changes to
vm_entry_splay_split make sure that we know all the off-search-path entries
we will need to complete the splay, without looking at all of them. There is
an exception at the bottom of the search path where we cannot rely on the
max_free value in the direction of the NULL pointer that ends the search,
because of the behavior of entry-clipping code.
The corresponding change to vm_splay_entry_merge makes it simpler, since it's
just reversing pointers and updating running maxima.
In a test intended to exercise vigorously the vm_map implementation, the
effect of this change was to reduce the data cache miss rate by 10-14% and
the running time by 5-7%.
Tested by: pho
Reviewed by: alc
Approved by: kib (mentor)
MFC after: 1 month
Differential Revision: https://reviews.freebsd.org/D19826
2019-06-10 21:34:07 +00:00
|
|
|
* vm_map_entry_max_free_{left,right}:
|
2004-08-13 08:06:34 +00:00
|
|
|
*
|
The computations of vm_map_splay_split and vm_map_splay_merge touch both
children of every entry on the search path as part of updating values of
the max_free field. By comparing the max_free values of an entry and its
child on the search path, the code can avoid accessing the child off the
path in cases where the max_free value decreases along the path.
Specifically, this patch changes splay_split so that the max_free field
of every entry on the search path is replaced, temporarily, by the
max_free field from its child not on the search path or, if the child
in that direction is NULL, then a difference between start and end
values of two pointers already available in the split code, without
following any next or prev pointers. However, to find that max_free
value does not require looking toward that other child if either the
child on the search path has a lower max_free value, or the current max_free
value is zero, because in either case we know that the value of max_free for
the other child is the value we already have. So, the changes to
vm_entry_splay_split make sure that we know all the off-search-path entries
we will need to complete the splay, without looking at all of them. There is
an exception at the bottom of the search path where we cannot rely on the
max_free value in the direction of the NULL pointer that ends the search,
because of the behavior of entry-clipping code.
The corresponding change to vm_splay_entry_merge makes it simpler, since it's
just reversing pointers and updating running maxima.
In a test intended to exercise vigorously the vm_map implementation, the
effect of this change was to reduce the data cache miss rate by 10-14% and
the running time by 5-7%.
Tested by: pho
Reviewed by: alc
Approved by: kib (mentor)
MFC after: 1 month
Differential Revision: https://reviews.freebsd.org/D19826
2019-06-10 21:34:07 +00:00
|
|
|
* Compute the size of the largest free gap between two entries,
|
|
|
|
* one the root of a tree and the other the ancestor of that root
|
|
|
|
* that is the least or greatest ancestor found on the search path.
|
2004-08-13 08:06:34 +00:00
|
|
|
*/
|
The computations of vm_map_splay_split and vm_map_splay_merge touch both
children of every entry on the search path as part of updating values of
the max_free field. By comparing the max_free values of an entry and its
child on the search path, the code can avoid accessing the child off the
path in cases where the max_free value decreases along the path.
Specifically, this patch changes splay_split so that the max_free field
of every entry on the search path is replaced, temporarily, by the
max_free field from its child not on the search path or, if the child
in that direction is NULL, then a difference between start and end
values of two pointers already available in the split code, without
following any next or prev pointers. However, to find that max_free
value does not require looking toward that other child if either the
child on the search path has a lower max_free value, or the current max_free
value is zero, because in either case we know that the value of max_free for
the other child is the value we already have. So, the changes to
vm_entry_splay_split make sure that we know all the off-search-path entries
we will need to complete the splay, without looking at all of them. There is
an exception at the bottom of the search path where we cannot rely on the
max_free value in the direction of the NULL pointer that ends the search,
because of the behavior of entry-clipping code.
The corresponding change to vm_splay_entry_merge makes it simpler, since it's
just reversing pointers and updating running maxima.
In a test intended to exercise vigorously the vm_map implementation, the
effect of this change was to reduce the data cache miss rate by 10-14% and
the running time by 5-7%.
Tested by: pho
Reviewed by: alc
Approved by: kib (mentor)
MFC after: 1 month
Differential Revision: https://reviews.freebsd.org/D19826
2019-06-10 21:34:07 +00:00
|
|
|
static inline vm_size_t
|
|
|
|
vm_map_entry_max_free_left(vm_map_entry_t root, vm_map_entry_t left_ancestor)
|
2004-08-13 08:06:34 +00:00
|
|
|
{
|
Eliminate adj_free field from vm_map_entry.
Drop the adj_free field from vm_map_entry_t. Refine the max_free field
so that p->max_free is the size of the largest gap with one endpoint
in the subtree rooted at p. Change vm_map_findspace so that, first,
the address-based splay is restricted to tree nodes with large-enough
max_free value, to avoid searching for the right starting point in a
subtree where all the gaps are too small. Second, when the address
search leads to a tree search for the first large-enough gap, that gap
is the subject of a splay-search that brings the gap to the top of the
tree, so that an immediate insertion will take constant time.
Break up the splay code into separate components, one for searching
and breaking up the tree and another for reassembling it. Use these
components, and not splay itself, for linking and unlinking. Drop the
after-where parameter to link, as it is computed as a side-effect of
the splay search.
Submitted by: Doug Moore <dougm@rice.edu>
Reviewed by: markj
Tested by: pho
MFC after: 2 weeks
Differential revision: https://reviews.freebsd.org/D17794
2019-03-29 16:53:46 +00:00
|
|
|
|
2019-12-07 17:14:33 +00:00
|
|
|
return (root->left != left_ancestor ?
|
The computations of vm_map_splay_split and vm_map_splay_merge touch both
children of every entry on the search path as part of updating values of
the max_free field. By comparing the max_free values of an entry and its
child on the search path, the code can avoid accessing the child off the
path in cases where the max_free value decreases along the path.
Specifically, this patch changes splay_split so that the max_free field
of every entry on the search path is replaced, temporarily, by the
max_free field from its child not on the search path or, if the child
in that direction is NULL, then a difference between start and end
values of two pointers already available in the split code, without
following any next or prev pointers. However, to find that max_free
value does not require looking toward that other child if either the
child on the search path has a lower max_free value, or the current max_free
value is zero, because in either case we know that the value of max_free for
the other child is the value we already have. So, the changes to
vm_entry_splay_split make sure that we know all the off-search-path entries
we will need to complete the splay, without looking at all of them. There is
an exception at the bottom of the search path where we cannot rely on the
max_free value in the direction of the NULL pointer that ends the search,
because of the behavior of entry-clipping code.
The corresponding change to vm_splay_entry_merge makes it simpler, since it's
just reversing pointers and updating running maxima.
In a test intended to exercise vigorously the vm_map implementation, the
effect of this change was to reduce the data cache miss rate by 10-14% and
the running time by 5-7%.
Tested by: pho
Reviewed by: alc
Approved by: kib (mentor)
MFC after: 1 month
Differential Revision: https://reviews.freebsd.org/D19826
2019-06-10 21:34:07 +00:00
|
|
|
root->left->max_free : root->start - left_ancestor->end);
|
Eliminate adj_free field from vm_map_entry.
Drop the adj_free field from vm_map_entry_t. Refine the max_free field
so that p->max_free is the size of the largest gap with one endpoint
in the subtree rooted at p. Change vm_map_findspace so that, first,
the address-based splay is restricted to tree nodes with large-enough
max_free value, to avoid searching for the right starting point in a
subtree where all the gaps are too small. Second, when the address
search leads to a tree search for the first large-enough gap, that gap
is the subject of a splay-search that brings the gap to the top of the
tree, so that an immediate insertion will take constant time.
Break up the splay code into separate components, one for searching
and breaking up the tree and another for reassembling it. Use these
components, and not splay itself, for linking and unlinking. Drop the
after-where parameter to link, as it is computed as a side-effect of
the splay search.
Submitted by: Doug Moore <dougm@rice.edu>
Reviewed by: markj
Tested by: pho
MFC after: 2 weeks
Differential revision: https://reviews.freebsd.org/D17794
2019-03-29 16:53:46 +00:00
|
|
|
}
|
|
|
|
|
The computations of vm_map_splay_split and vm_map_splay_merge touch both
children of every entry on the search path as part of updating values of
the max_free field. By comparing the max_free values of an entry and its
child on the search path, the code can avoid accessing the child off the
path in cases where the max_free value decreases along the path.
Specifically, this patch changes splay_split so that the max_free field
of every entry on the search path is replaced, temporarily, by the
max_free field from its child not on the search path or, if the child
in that direction is NULL, then a difference between start and end
values of two pointers already available in the split code, without
following any next or prev pointers. However, to find that max_free
value does not require looking toward that other child if either the
child on the search path has a lower max_free value, or the current max_free
value is zero, because in either case we know that the value of max_free for
the other child is the value we already have. So, the changes to
vm_entry_splay_split make sure that we know all the off-search-path entries
we will need to complete the splay, without looking at all of them. There is
an exception at the bottom of the search path where we cannot rely on the
max_free value in the direction of the NULL pointer that ends the search,
because of the behavior of entry-clipping code.
The corresponding change to vm_splay_entry_merge makes it simpler, since it's
just reversing pointers and updating running maxima.
In a test intended to exercise vigorously the vm_map implementation, the
effect of this change was to reduce the data cache miss rate by 10-14% and
the running time by 5-7%.
Tested by: pho
Reviewed by: alc
Approved by: kib (mentor)
MFC after: 1 month
Differential Revision: https://reviews.freebsd.org/D19826
2019-06-10 21:34:07 +00:00
|
|
|
static inline vm_size_t
|
|
|
|
vm_map_entry_max_free_right(vm_map_entry_t root, vm_map_entry_t right_ancestor)
|
|
|
|
{
|
|
|
|
|
2019-12-07 17:14:33 +00:00
|
|
|
return (root->right != right_ancestor ?
|
The computations of vm_map_splay_split and vm_map_splay_merge touch both
children of every entry on the search path as part of updating values of
the max_free field. By comparing the max_free values of an entry and its
child on the search path, the code can avoid accessing the child off the
path in cases where the max_free value decreases along the path.
Specifically, this patch changes splay_split so that the max_free field
of every entry on the search path is replaced, temporarily, by the
max_free field from its child not on the search path or, if the child
in that direction is NULL, then a difference between start and end
values of two pointers already available in the split code, without
following any next or prev pointers. However, to find that max_free
value does not require looking toward that other child if either the
child on the search path has a lower max_free value, or the current max_free
value is zero, because in either case we know that the value of max_free for
the other child is the value we already have. So, the changes to
vm_entry_splay_split make sure that we know all the off-search-path entries
we will need to complete the splay, without looking at all of them. There is
an exception at the bottom of the search path where we cannot rely on the
max_free value in the direction of the NULL pointer that ends the search,
because of the behavior of entry-clipping code.
The corresponding change to vm_splay_entry_merge makes it simpler, since it's
just reversing pointers and updating running maxima.
In a test intended to exercise vigorously the vm_map implementation, the
effect of this change was to reduce the data cache miss rate by 10-14% and
the running time by 5-7%.
Tested by: pho
Reviewed by: alc
Approved by: kib (mentor)
MFC after: 1 month
Differential Revision: https://reviews.freebsd.org/D19826
2019-06-10 21:34:07 +00:00
|
|
|
root->right->max_free : right_ancestor->start - root->end);
|
|
|
|
}
|
|
|
|
|
2019-11-20 16:06:48 +00:00
|
|
|
/*
|
|
|
|
* vm_map_entry_{pred,succ}:
|
|
|
|
*
|
|
|
|
* Find the {predecessor, successor} of the entry by taking one step
|
|
|
|
* in the appropriate direction and backtracking as much as necessary.
|
2019-12-07 17:14:33 +00:00
|
|
|
* vm_map_entry_succ is defined in vm_map.h.
|
2019-11-20 16:06:48 +00:00
|
|
|
*/
|
|
|
|
static inline vm_map_entry_t
|
|
|
|
vm_map_entry_pred(vm_map_entry_t entry)
|
|
|
|
{
|
2019-12-07 17:14:33 +00:00
|
|
|
vm_map_entry_t prior;
|
2019-11-20 16:06:48 +00:00
|
|
|
|
2019-12-07 17:14:33 +00:00
|
|
|
prior = entry->left;
|
|
|
|
if (prior->right->start < entry->start) {
|
|
|
|
do
|
|
|
|
prior = prior->right;
|
|
|
|
while (prior->right != entry);
|
|
|
|
}
|
|
|
|
return (prior);
|
2019-11-20 16:06:48 +00:00
|
|
|
}
|
|
|
|
|
2019-11-29 02:06:45 +00:00
|
|
|
static inline vm_size_t
|
|
|
|
vm_size_max(vm_size_t a, vm_size_t b)
|
|
|
|
{
|
|
|
|
|
|
|
|
return (a > b ? a : b);
|
|
|
|
}
|
|
|
|
|
2019-12-07 17:14:33 +00:00
|
|
|
#define SPLAY_LEFT_STEP(root, y, llist, rlist, test) do { \
|
|
|
|
vm_map_entry_t z; \
|
The computations of vm_map_splay_split and vm_map_splay_merge touch both
children of every entry on the search path as part of updating values of
the max_free field. By comparing the max_free values of an entry and its
child on the search path, the code can avoid accessing the child off the
path in cases where the max_free value decreases along the path.
Specifically, this patch changes splay_split so that the max_free field
of every entry on the search path is replaced, temporarily, by the
max_free field from its child not on the search path or, if the child
in that direction is NULL, then a difference between start and end
values of two pointers already available in the split code, without
following any next or prev pointers. However, to find that max_free
value does not require looking toward that other child if either the
child on the search path has a lower max_free value, or the current max_free
value is zero, because in either case we know that the value of max_free for
the other child is the value we already have. So, the changes to
vm_entry_splay_split make sure that we know all the off-search-path entries
we will need to complete the splay, without looking at all of them. There is
an exception at the bottom of the search path where we cannot rely on the
max_free value in the direction of the NULL pointer that ends the search,
because of the behavior of entry-clipping code.
The corresponding change to vm_splay_entry_merge makes it simpler, since it's
just reversing pointers and updating running maxima.
In a test intended to exercise vigorously the vm_map implementation, the
effect of this change was to reduce the data cache miss rate by 10-14% and
the running time by 5-7%.
Tested by: pho
Reviewed by: alc
Approved by: kib (mentor)
MFC after: 1 month
Differential Revision: https://reviews.freebsd.org/D19826
2019-06-10 21:34:07 +00:00
|
|
|
vm_size_t max_free; \
|
|
|
|
\
|
|
|
|
/* \
|
|
|
|
* Infer root->right->max_free == root->max_free when \
|
|
|
|
* y->max_free < root->max_free || root->max_free == 0. \
|
|
|
|
* Otherwise, look right to find it. \
|
|
|
|
*/ \
|
|
|
|
y = root->left; \
|
|
|
|
max_free = root->max_free; \
|
2019-12-31 22:20:54 +00:00
|
|
|
KASSERT(max_free == vm_size_max( \
|
|
|
|
vm_map_entry_max_free_left(root, llist), \
|
|
|
|
vm_map_entry_max_free_right(root, rlist)), \
|
The computations of vm_map_splay_split and vm_map_splay_merge touch both
children of every entry on the search path as part of updating values of
the max_free field. By comparing the max_free values of an entry and its
child on the search path, the code can avoid accessing the child off the
path in cases where the max_free value decreases along the path.
Specifically, this patch changes splay_split so that the max_free field
of every entry on the search path is replaced, temporarily, by the
max_free field from its child not on the search path or, if the child
in that direction is NULL, then a difference between start and end
values of two pointers already available in the split code, without
following any next or prev pointers. However, to find that max_free
value does not require looking toward that other child if either the
child on the search path has a lower max_free value, or the current max_free
value is zero, because in either case we know that the value of max_free for
the other child is the value we already have. So, the changes to
vm_entry_splay_split make sure that we know all the off-search-path entries
we will need to complete the splay, without looking at all of them. There is
an exception at the bottom of the search path where we cannot rely on the
max_free value in the direction of the NULL pointer that ends the search,
because of the behavior of entry-clipping code.
The corresponding change to vm_splay_entry_merge makes it simpler, since it's
just reversing pointers and updating running maxima.
In a test intended to exercise vigorously the vm_map implementation, the
effect of this change was to reduce the data cache miss rate by 10-14% and
the running time by 5-7%.
Tested by: pho
Reviewed by: alc
Approved by: kib (mentor)
MFC after: 1 month
Differential Revision: https://reviews.freebsd.org/D19826
2019-06-10 21:34:07 +00:00
|
|
|
("%s: max_free invariant fails", __func__)); \
|
2019-12-31 22:20:54 +00:00
|
|
|
if (max_free - 1 < vm_map_entry_max_free_left(root, llist)) \
|
The computations of vm_map_splay_split and vm_map_splay_merge touch both
children of every entry on the search path as part of updating values of
the max_free field. By comparing the max_free values of an entry and its
child on the search path, the code can avoid accessing the child off the
path in cases where the max_free value decreases along the path.
Specifically, this patch changes splay_split so that the max_free field
of every entry on the search path is replaced, temporarily, by the
max_free field from its child not on the search path or, if the child
in that direction is NULL, then a difference between start and end
values of two pointers already available in the split code, without
following any next or prev pointers. However, to find that max_free
value does not require looking toward that other child if either the
child on the search path has a lower max_free value, or the current max_free
value is zero, because in either case we know that the value of max_free for
the other child is the value we already have. So, the changes to
vm_entry_splay_split make sure that we know all the off-search-path entries
we will need to complete the splay, without looking at all of them. There is
an exception at the bottom of the search path where we cannot rely on the
max_free value in the direction of the NULL pointer that ends the search,
because of the behavior of entry-clipping code.
The corresponding change to vm_splay_entry_merge makes it simpler, since it's
just reversing pointers and updating running maxima.
In a test intended to exercise vigorously the vm_map implementation, the
effect of this change was to reduce the data cache miss rate by 10-14% and
the running time by 5-7%.
Tested by: pho
Reviewed by: alc
Approved by: kib (mentor)
MFC after: 1 month
Differential Revision: https://reviews.freebsd.org/D19826
2019-06-10 21:34:07 +00:00
|
|
|
max_free = vm_map_entry_max_free_right(root, rlist); \
|
2019-12-07 17:14:33 +00:00
|
|
|
if (y != llist && (test)) { \
|
The computations of vm_map_splay_split and vm_map_splay_merge touch both
children of every entry on the search path as part of updating values of
the max_free field. By comparing the max_free values of an entry and its
child on the search path, the code can avoid accessing the child off the
path in cases where the max_free value decreases along the path.
Specifically, this patch changes splay_split so that the max_free field
of every entry on the search path is replaced, temporarily, by the
max_free field from its child not on the search path or, if the child
in that direction is NULL, then a difference between start and end
values of two pointers already available in the split code, without
following any next or prev pointers. However, to find that max_free
value does not require looking toward that other child if either the
child on the search path has a lower max_free value, or the current max_free
value is zero, because in either case we know that the value of max_free for
the other child is the value we already have. So, the changes to
vm_entry_splay_split make sure that we know all the off-search-path entries
we will need to complete the splay, without looking at all of them. There is
an exception at the bottom of the search path where we cannot rely on the
max_free value in the direction of the NULL pointer that ends the search,
because of the behavior of entry-clipping code.
The corresponding change to vm_splay_entry_merge makes it simpler, since it's
just reversing pointers and updating running maxima.
In a test intended to exercise vigorously the vm_map implementation, the
effect of this change was to reduce the data cache miss rate by 10-14% and
the running time by 5-7%.
Tested by: pho
Reviewed by: alc
Approved by: kib (mentor)
MFC after: 1 month
Differential Revision: https://reviews.freebsd.org/D19826
2019-06-10 21:34:07 +00:00
|
|
|
/* Rotate right and make y root. */ \
|
2019-12-07 17:14:33 +00:00
|
|
|
z = y->right; \
|
|
|
|
if (z != root) { \
|
|
|
|
root->left = z; \
|
|
|
|
y->right = root; \
|
|
|
|
if (max_free < y->max_free) \
|
|
|
|
root->max_free = max_free = \
|
|
|
|
vm_size_max(max_free, z->max_free); \
|
|
|
|
} else if (max_free < y->max_free) \
|
2019-11-29 02:06:45 +00:00
|
|
|
root->max_free = max_free = \
|
2019-12-07 17:14:33 +00:00
|
|
|
vm_size_max(max_free, root->start - y->end);\
|
The computations of vm_map_splay_split and vm_map_splay_merge touch both
children of every entry on the search path as part of updating values of
the max_free field. By comparing the max_free values of an entry and its
child on the search path, the code can avoid accessing the child off the
path in cases where the max_free value decreases along the path.
Specifically, this patch changes splay_split so that the max_free field
of every entry on the search path is replaced, temporarily, by the
max_free field from its child not on the search path or, if the child
in that direction is NULL, then a difference between start and end
values of two pointers already available in the split code, without
following any next or prev pointers. However, to find that max_free
value does not require looking toward that other child if either the
child on the search path has a lower max_free value, or the current max_free
value is zero, because in either case we know that the value of max_free for
the other child is the value we already have. So, the changes to
vm_entry_splay_split make sure that we know all the off-search-path entries
we will need to complete the splay, without looking at all of them. There is
an exception at the bottom of the search path where we cannot rely on the
max_free value in the direction of the NULL pointer that ends the search,
because of the behavior of entry-clipping code.
The corresponding change to vm_splay_entry_merge makes it simpler, since it's
just reversing pointers and updating running maxima.
In a test intended to exercise vigorously the vm_map implementation, the
effect of this change was to reduce the data cache miss rate by 10-14% and
the running time by 5-7%.
Tested by: pho
Reviewed by: alc
Approved by: kib (mentor)
MFC after: 1 month
Differential Revision: https://reviews.freebsd.org/D19826
2019-06-10 21:34:07 +00:00
|
|
|
root = y; \
|
|
|
|
y = root->left; \
|
|
|
|
} \
|
|
|
|
/* Copy right->max_free. Put root on rlist. */ \
|
|
|
|
root->max_free = max_free; \
|
|
|
|
KASSERT(max_free == vm_map_entry_max_free_right(root, rlist), \
|
|
|
|
("%s: max_free not copied from right", __func__)); \
|
|
|
|
root->left = rlist; \
|
|
|
|
rlist = root; \
|
2019-12-07 17:14:33 +00:00
|
|
|
root = y != llist ? y : NULL; \
|
Eliminate adj_free field from vm_map_entry.
Drop the adj_free field from vm_map_entry_t. Refine the max_free field
so that p->max_free is the size of the largest gap with one endpoint
in the subtree rooted at p. Change vm_map_findspace so that, first,
the address-based splay is restricted to tree nodes with large-enough
max_free value, to avoid searching for the right starting point in a
subtree where all the gaps are too small. Second, when the address
search leads to a tree search for the first large-enough gap, that gap
is the subject of a splay-search that brings the gap to the top of the
tree, so that an immediate insertion will take constant time.
Break up the splay code into separate components, one for searching
and breaking up the tree and another for reassembling it. Use these
components, and not splay itself, for linking and unlinking. Drop the
after-where parameter to link, as it is computed as a side-effect of
the splay search.
Submitted by: Doug Moore <dougm@rice.edu>
Reviewed by: markj
Tested by: pho
MFC after: 2 weeks
Differential revision: https://reviews.freebsd.org/D17794
2019-03-29 16:53:46 +00:00
|
|
|
} while (0)
|
|
|
|
|
2019-12-07 17:14:33 +00:00
|
|
|
#define SPLAY_RIGHT_STEP(root, y, llist, rlist, test) do { \
|
|
|
|
vm_map_entry_t z; \
|
The computations of vm_map_splay_split and vm_map_splay_merge touch both
children of every entry on the search path as part of updating values of
the max_free field. By comparing the max_free values of an entry and its
child on the search path, the code can avoid accessing the child off the
path in cases where the max_free value decreases along the path.
Specifically, this patch changes splay_split so that the max_free field
of every entry on the search path is replaced, temporarily, by the
max_free field from its child not on the search path or, if the child
in that direction is NULL, then a difference between start and end
values of two pointers already available in the split code, without
following any next or prev pointers. However, to find that max_free
value does not require looking toward that other child if either the
child on the search path has a lower max_free value, or the current max_free
value is zero, because in either case we know that the value of max_free for
the other child is the value we already have. So, the changes to
vm_entry_splay_split make sure that we know all the off-search-path entries
we will need to complete the splay, without looking at all of them. There is
an exception at the bottom of the search path where we cannot rely on the
max_free value in the direction of the NULL pointer that ends the search,
because of the behavior of entry-clipping code.
The corresponding change to vm_splay_entry_merge makes it simpler, since it's
just reversing pointers and updating running maxima.
In a test intended to exercise vigorously the vm_map implementation, the
effect of this change was to reduce the data cache miss rate by 10-14% and
the running time by 5-7%.
Tested by: pho
Reviewed by: alc
Approved by: kib (mentor)
MFC after: 1 month
Differential Revision: https://reviews.freebsd.org/D19826
2019-06-10 21:34:07 +00:00
|
|
|
vm_size_t max_free; \
|
|
|
|
\
|
|
|
|
/* \
|
|
|
|
* Infer root->left->max_free == root->max_free when \
|
|
|
|
* y->max_free < root->max_free || root->max_free == 0. \
|
|
|
|
* Otherwise, look left to find it. \
|
|
|
|
*/ \
|
|
|
|
y = root->right; \
|
|
|
|
max_free = root->max_free; \
|
2019-12-31 22:20:54 +00:00
|
|
|
KASSERT(max_free == vm_size_max( \
|
|
|
|
vm_map_entry_max_free_left(root, llist), \
|
|
|
|
vm_map_entry_max_free_right(root, rlist)), \
|
The computations of vm_map_splay_split and vm_map_splay_merge touch both
children of every entry on the search path as part of updating values of
the max_free field. By comparing the max_free values of an entry and its
child on the search path, the code can avoid accessing the child off the
path in cases where the max_free value decreases along the path.
Specifically, this patch changes splay_split so that the max_free field
of every entry on the search path is replaced, temporarily, by the
max_free field from its child not on the search path or, if the child
in that direction is NULL, then a difference between start and end
values of two pointers already available in the split code, without
following any next or prev pointers. However, to find that max_free
value does not require looking toward that other child if either the
child on the search path has a lower max_free value, or the current max_free
value is zero, because in either case we know that the value of max_free for
the other child is the value we already have. So, the changes to
vm_entry_splay_split make sure that we know all the off-search-path entries
we will need to complete the splay, without looking at all of them. There is
an exception at the bottom of the search path where we cannot rely on the
max_free value in the direction of the NULL pointer that ends the search,
because of the behavior of entry-clipping code.
The corresponding change to vm_splay_entry_merge makes it simpler, since it's
just reversing pointers and updating running maxima.
In a test intended to exercise vigorously the vm_map implementation, the
effect of this change was to reduce the data cache miss rate by 10-14% and
the running time by 5-7%.
Tested by: pho
Reviewed by: alc
Approved by: kib (mentor)
MFC after: 1 month
Differential Revision: https://reviews.freebsd.org/D19826
2019-06-10 21:34:07 +00:00
|
|
|
("%s: max_free invariant fails", __func__)); \
|
2019-12-31 22:20:54 +00:00
|
|
|
if (max_free - 1 < vm_map_entry_max_free_right(root, rlist)) \
|
The computations of vm_map_splay_split and vm_map_splay_merge touch both
children of every entry on the search path as part of updating values of
the max_free field. By comparing the max_free values of an entry and its
child on the search path, the code can avoid accessing the child off the
path in cases where the max_free value decreases along the path.
Specifically, this patch changes splay_split so that the max_free field
of every entry on the search path is replaced, temporarily, by the
max_free field from its child not on the search path or, if the child
in that direction is NULL, then a difference between start and end
values of two pointers already available in the split code, without
following any next or prev pointers. However, to find that max_free
value does not require looking toward that other child if either the
child on the search path has a lower max_free value, or the current max_free
value is zero, because in either case we know that the value of max_free for
the other child is the value we already have. So, the changes to
vm_entry_splay_split make sure that we know all the off-search-path entries
we will need to complete the splay, without looking at all of them. There is
an exception at the bottom of the search path where we cannot rely on the
max_free value in the direction of the NULL pointer that ends the search,
because of the behavior of entry-clipping code.
The corresponding change to vm_splay_entry_merge makes it simpler, since it's
just reversing pointers and updating running maxima.
In a test intended to exercise vigorously the vm_map implementation, the
effect of this change was to reduce the data cache miss rate by 10-14% and
the running time by 5-7%.
Tested by: pho
Reviewed by: alc
Approved by: kib (mentor)
MFC after: 1 month
Differential Revision: https://reviews.freebsd.org/D19826
2019-06-10 21:34:07 +00:00
|
|
|
max_free = vm_map_entry_max_free_left(root, llist); \
|
2019-12-07 17:14:33 +00:00
|
|
|
if (y != rlist && (test)) { \
|
The computations of vm_map_splay_split and vm_map_splay_merge touch both
children of every entry on the search path as part of updating values of
the max_free field. By comparing the max_free values of an entry and its
child on the search path, the code can avoid accessing the child off the
path in cases where the max_free value decreases along the path.
Specifically, this patch changes splay_split so that the max_free field
of every entry on the search path is replaced, temporarily, by the
max_free field from its child not on the search path or, if the child
in that direction is NULL, then a difference between start and end
values of two pointers already available in the split code, without
following any next or prev pointers. However, to find that max_free
value does not require looking toward that other child if either the
child on the search path has a lower max_free value, or the current max_free
value is zero, because in either case we know that the value of max_free for
the other child is the value we already have. So, the changes to
vm_entry_splay_split make sure that we know all the off-search-path entries
we will need to complete the splay, without looking at all of them. There is
an exception at the bottom of the search path where we cannot rely on the
max_free value in the direction of the NULL pointer that ends the search,
because of the behavior of entry-clipping code.
The corresponding change to vm_splay_entry_merge makes it simpler, since it's
just reversing pointers and updating running maxima.
In a test intended to exercise vigorously the vm_map implementation, the
effect of this change was to reduce the data cache miss rate by 10-14% and
the running time by 5-7%.
Tested by: pho
Reviewed by: alc
Approved by: kib (mentor)
MFC after: 1 month
Differential Revision: https://reviews.freebsd.org/D19826
2019-06-10 21:34:07 +00:00
|
|
|
/* Rotate left and make y root. */ \
|
2019-12-07 17:14:33 +00:00
|
|
|
z = y->left; \
|
|
|
|
if (z != root) { \
|
|
|
|
root->right = z; \
|
|
|
|
y->left = root; \
|
|
|
|
if (max_free < y->max_free) \
|
|
|
|
root->max_free = max_free = \
|
|
|
|
vm_size_max(max_free, z->max_free); \
|
|
|
|
} else if (max_free < y->max_free) \
|
2019-11-29 02:06:45 +00:00
|
|
|
root->max_free = max_free = \
|
2019-12-07 17:14:33 +00:00
|
|
|
vm_size_max(max_free, y->start - root->end);\
|
The computations of vm_map_splay_split and vm_map_splay_merge touch both
children of every entry on the search path as part of updating values of
the max_free field. By comparing the max_free values of an entry and its
child on the search path, the code can avoid accessing the child off the
path in cases where the max_free value decreases along the path.
Specifically, this patch changes splay_split so that the max_free field
of every entry on the search path is replaced, temporarily, by the
max_free field from its child not on the search path or, if the child
in that direction is NULL, then a difference between start and end
values of two pointers already available in the split code, without
following any next or prev pointers. However, to find that max_free
value does not require looking toward that other child if either the
child on the search path has a lower max_free value, or the current max_free
value is zero, because in either case we know that the value of max_free for
the other child is the value we already have. So, the changes to
vm_entry_splay_split make sure that we know all the off-search-path entries
we will need to complete the splay, without looking at all of them. There is
an exception at the bottom of the search path where we cannot rely on the
max_free value in the direction of the NULL pointer that ends the search,
because of the behavior of entry-clipping code.
The corresponding change to vm_splay_entry_merge makes it simpler, since it's
just reversing pointers and updating running maxima.
In a test intended to exercise vigorously the vm_map implementation, the
effect of this change was to reduce the data cache miss rate by 10-14% and
the running time by 5-7%.
Tested by: pho
Reviewed by: alc
Approved by: kib (mentor)
MFC after: 1 month
Differential Revision: https://reviews.freebsd.org/D19826
2019-06-10 21:34:07 +00:00
|
|
|
root = y; \
|
|
|
|
y = root->right; \
|
|
|
|
} \
|
|
|
|
/* Copy left->max_free. Put root on llist. */ \
|
|
|
|
root->max_free = max_free; \
|
|
|
|
KASSERT(max_free == vm_map_entry_max_free_left(root, llist), \
|
|
|
|
("%s: max_free not copied from left", __func__)); \
|
|
|
|
root->right = llist; \
|
|
|
|
llist = root; \
|
2019-12-07 17:14:33 +00:00
|
|
|
root = y != rlist ? y : NULL; \
|
Eliminate adj_free field from vm_map_entry.
Drop the adj_free field from vm_map_entry_t. Refine the max_free field
so that p->max_free is the size of the largest gap with one endpoint
in the subtree rooted at p. Change vm_map_findspace so that, first,
the address-based splay is restricted to tree nodes with large-enough
max_free value, to avoid searching for the right starting point in a
subtree where all the gaps are too small. Second, when the address
search leads to a tree search for the first large-enough gap, that gap
is the subject of a splay-search that brings the gap to the top of the
tree, so that an immediate insertion will take constant time.
Break up the splay code into separate components, one for searching
and breaking up the tree and another for reassembling it. Use these
components, and not splay itself, for linking and unlinking. Drop the
after-where parameter to link, as it is computed as a side-effect of
the splay search.
Submitted by: Doug Moore <dougm@rice.edu>
Reviewed by: markj
Tested by: pho
MFC after: 2 weeks
Differential revision: https://reviews.freebsd.org/D17794
2019-03-29 16:53:46 +00:00
|
|
|
} while (0)
|
2004-08-13 08:06:34 +00:00
|
|
|
|
2002-05-24 01:33:24 +00:00
|
|
|
/*
|
2019-12-07 17:14:33 +00:00
|
|
|
* Walk down the tree until we find addr or a gap where addr would go, breaking
|
|
|
|
* off left and right subtrees of nodes less than, or greater than addr. Treat
|
|
|
|
* subtrees with root->max_free < length as empty trees. llist and rlist are
|
|
|
|
* the two sides in reverse order (bottom-up), with llist linked by the right
|
|
|
|
* pointer and rlist linked by the left pointer in the vm_map_entry, and both
|
|
|
|
* lists terminated by &map->header. This function, and the subsequent call to
|
|
|
|
* vm_map_splay_merge_{left,right,pred,succ}, rely on the start and end address
|
The computations of vm_map_splay_split and vm_map_splay_merge touch both
children of every entry on the search path as part of updating values of
the max_free field. By comparing the max_free values of an entry and its
child on the search path, the code can avoid accessing the child off the
path in cases where the max_free value decreases along the path.
Specifically, this patch changes splay_split so that the max_free field
of every entry on the search path is replaced, temporarily, by the
max_free field from its child not on the search path or, if the child
in that direction is NULL, then a difference between start and end
values of two pointers already available in the split code, without
following any next or prev pointers. However, to find that max_free
value does not require looking toward that other child if either the
child on the search path has a lower max_free value, or the current max_free
value is zero, because in either case we know that the value of max_free for
the other child is the value we already have. So, the changes to
vm_entry_splay_split make sure that we know all the off-search-path entries
we will need to complete the splay, without looking at all of them. There is
an exception at the bottom of the search path where we cannot rely on the
max_free value in the direction of the NULL pointer that ends the search,
because of the behavior of entry-clipping code.
The corresponding change to vm_splay_entry_merge makes it simpler, since it's
just reversing pointers and updating running maxima.
In a test intended to exercise vigorously the vm_map implementation, the
effect of this change was to reduce the data cache miss rate by 10-14% and
the running time by 5-7%.
Tested by: pho
Reviewed by: alc
Approved by: kib (mentor)
MFC after: 1 month
Differential Revision: https://reviews.freebsd.org/D19826
2019-06-10 21:34:07 +00:00
|
|
|
* values in &map->header.
|
2002-05-24 01:33:24 +00:00
|
|
|
*/
|
2019-11-27 21:00:44 +00:00
|
|
|
static __always_inline vm_map_entry_t
|
The computations of vm_map_splay_split and vm_map_splay_merge touch both
children of every entry on the search path as part of updating values of
the max_free field. By comparing the max_free values of an entry and its
child on the search path, the code can avoid accessing the child off the
path in cases where the max_free value decreases along the path.
Specifically, this patch changes splay_split so that the max_free field
of every entry on the search path is replaced, temporarily, by the
max_free field from its child not on the search path or, if the child
in that direction is NULL, then a difference between start and end
values of two pointers already available in the split code, without
following any next or prev pointers. However, to find that max_free
value does not require looking toward that other child if either the
child on the search path has a lower max_free value, or the current max_free
value is zero, because in either case we know that the value of max_free for
the other child is the value we already have. So, the changes to
vm_entry_splay_split make sure that we know all the off-search-path entries
we will need to complete the splay, without looking at all of them. There is
an exception at the bottom of the search path where we cannot rely on the
max_free value in the direction of the NULL pointer that ends the search,
because of the behavior of entry-clipping code.
The corresponding change to vm_splay_entry_merge makes it simpler, since it's
just reversing pointers and updating running maxima.
In a test intended to exercise vigorously the vm_map implementation, the
effect of this change was to reduce the data cache miss rate by 10-14% and
the running time by 5-7%.
Tested by: pho
Reviewed by: alc
Approved by: kib (mentor)
MFC after: 1 month
Differential Revision: https://reviews.freebsd.org/D19826
2019-06-10 21:34:07 +00:00
|
|
|
vm_map_splay_split(vm_map_t map, vm_offset_t addr, vm_size_t length,
|
2019-11-27 21:00:44 +00:00
|
|
|
vm_map_entry_t *llist, vm_map_entry_t *rlist)
|
2002-05-24 01:33:24 +00:00
|
|
|
{
|
2019-12-07 17:14:33 +00:00
|
|
|
vm_map_entry_t left, right, root, y;
|
2002-05-24 01:33:24 +00:00
|
|
|
|
2019-12-07 17:14:33 +00:00
|
|
|
left = right = &map->header;
|
The computations of vm_map_splay_split and vm_map_splay_merge touch both
children of every entry on the search path as part of updating values of
the max_free field. By comparing the max_free values of an entry and its
child on the search path, the code can avoid accessing the child off the
path in cases where the max_free value decreases along the path.
Specifically, this patch changes splay_split so that the max_free field
of every entry on the search path is replaced, temporarily, by the
max_free field from its child not on the search path or, if the child
in that direction is NULL, then a difference between start and end
values of two pointers already available in the split code, without
following any next or prev pointers. However, to find that max_free
value does not require looking toward that other child if either the
child on the search path has a lower max_free value, or the current max_free
value is zero, because in either case we know that the value of max_free for
the other child is the value we already have. So, the changes to
vm_entry_splay_split make sure that we know all the off-search-path entries
we will need to complete the splay, without looking at all of them. There is
an exception at the bottom of the search path where we cannot rely on the
max_free value in the direction of the NULL pointer that ends the search,
because of the behavior of entry-clipping code.
The corresponding change to vm_splay_entry_merge makes it simpler, since it's
just reversing pointers and updating running maxima.
In a test intended to exercise vigorously the vm_map implementation, the
effect of this change was to reduce the data cache miss rate by 10-14% and
the running time by 5-7%.
Tested by: pho
Reviewed by: alc
Approved by: kib (mentor)
MFC after: 1 month
Differential Revision: https://reviews.freebsd.org/D19826
2019-06-10 21:34:07 +00:00
|
|
|
root = map->root;
|
Eliminate adj_free field from vm_map_entry.
Drop the adj_free field from vm_map_entry_t. Refine the max_free field
so that p->max_free is the size of the largest gap with one endpoint
in the subtree rooted at p. Change vm_map_findspace so that, first,
the address-based splay is restricted to tree nodes with large-enough
max_free value, to avoid searching for the right starting point in a
subtree where all the gaps are too small. Second, when the address
search leads to a tree search for the first large-enough gap, that gap
is the subject of a splay-search that brings the gap to the top of the
tree, so that an immediate insertion will take constant time.
Break up the splay code into separate components, one for searching
and breaking up the tree and another for reassembling it. Use these
components, and not splay itself, for linking and unlinking. Drop the
after-where parameter to link, as it is computed as a side-effect of
the splay search.
Submitted by: Doug Moore <dougm@rice.edu>
Reviewed by: markj
Tested by: pho
MFC after: 2 weeks
Differential revision: https://reviews.freebsd.org/D17794
2019-03-29 16:53:46 +00:00
|
|
|
while (root != NULL && root->max_free >= length) {
|
2019-12-07 17:14:33 +00:00
|
|
|
KASSERT(left->end <= root->start &&
|
|
|
|
root->end <= right->start,
|
The computations of vm_map_splay_split and vm_map_splay_merge touch both
children of every entry on the search path as part of updating values of
the max_free field. By comparing the max_free values of an entry and its
child on the search path, the code can avoid accessing the child off the
path in cases where the max_free value decreases along the path.
Specifically, this patch changes splay_split so that the max_free field
of every entry on the search path is replaced, temporarily, by the
max_free field from its child not on the search path or, if the child
in that direction is NULL, then a difference between start and end
values of two pointers already available in the split code, without
following any next or prev pointers. However, to find that max_free
value does not require looking toward that other child if either the
child on the search path has a lower max_free value, or the current max_free
value is zero, because in either case we know that the value of max_free for
the other child is the value we already have. So, the changes to
vm_entry_splay_split make sure that we know all the off-search-path entries
we will need to complete the splay, without looking at all of them. There is
an exception at the bottom of the search path where we cannot rely on the
max_free value in the direction of the NULL pointer that ends the search,
because of the behavior of entry-clipping code.
The corresponding change to vm_splay_entry_merge makes it simpler, since it's
just reversing pointers and updating running maxima.
In a test intended to exercise vigorously the vm_map implementation, the
effect of this change was to reduce the data cache miss rate by 10-14% and
the running time by 5-7%.
Tested by: pho
Reviewed by: alc
Approved by: kib (mentor)
MFC after: 1 month
Differential Revision: https://reviews.freebsd.org/D19826
2019-06-10 21:34:07 +00:00
|
|
|
("%s: root not within tree bounds", __func__));
|
2004-08-13 08:06:34 +00:00
|
|
|
if (addr < root->start) {
|
2019-12-07 17:14:33 +00:00
|
|
|
SPLAY_LEFT_STEP(root, y, left, right,
|
Eliminate adj_free field from vm_map_entry.
Drop the adj_free field from vm_map_entry_t. Refine the max_free field
so that p->max_free is the size of the largest gap with one endpoint
in the subtree rooted at p. Change vm_map_findspace so that, first,
the address-based splay is restricted to tree nodes with large-enough
max_free value, to avoid searching for the right starting point in a
subtree where all the gaps are too small. Second, when the address
search leads to a tree search for the first large-enough gap, that gap
is the subject of a splay-search that brings the gap to the top of the
tree, so that an immediate insertion will take constant time.
Break up the splay code into separate components, one for searching
and breaking up the tree and another for reassembling it. Use these
components, and not splay itself, for linking and unlinking. Drop the
after-where parameter to link, as it is computed as a side-effect of
the splay search.
Submitted by: Doug Moore <dougm@rice.edu>
Reviewed by: markj
Tested by: pho
MFC after: 2 weeks
Differential revision: https://reviews.freebsd.org/D17794
2019-03-29 16:53:46 +00:00
|
|
|
y->max_free >= length && addr < y->start);
|
2008-12-30 21:52:18 +00:00
|
|
|
} else if (addr >= root->end) {
|
2019-12-07 17:14:33 +00:00
|
|
|
SPLAY_RIGHT_STEP(root, y, left, right,
|
Eliminate adj_free field from vm_map_entry.
Drop the adj_free field from vm_map_entry_t. Refine the max_free field
so that p->max_free is the size of the largest gap with one endpoint
in the subtree rooted at p. Change vm_map_findspace so that, first,
the address-based splay is restricted to tree nodes with large-enough
max_free value, to avoid searching for the right starting point in a
subtree where all the gaps are too small. Second, when the address
search leads to a tree search for the first large-enough gap, that gap
is the subject of a splay-search that brings the gap to the top of the
tree, so that an immediate insertion will take constant time.
Break up the splay code into separate components, one for searching
and breaking up the tree and another for reassembling it. Use these
components, and not splay itself, for linking and unlinking. Drop the
after-where parameter to link, as it is computed as a side-effect of
the splay search.
Submitted by: Doug Moore <dougm@rice.edu>
Reviewed by: markj
Tested by: pho
MFC after: 2 weeks
Differential revision: https://reviews.freebsd.org/D17794
2019-03-29 16:53:46 +00:00
|
|
|
y->max_free >= length && addr >= y->end);
|
2008-12-30 21:52:18 +00:00
|
|
|
} else
|
|
|
|
break;
|
2002-05-24 01:33:24 +00:00
|
|
|
}
|
2019-12-07 17:14:33 +00:00
|
|
|
*llist = left;
|
|
|
|
*rlist = right;
|
Eliminate adj_free field from vm_map_entry.
Drop the adj_free field from vm_map_entry_t. Refine the max_free field
so that p->max_free is the size of the largest gap with one endpoint
in the subtree rooted at p. Change vm_map_findspace so that, first,
the address-based splay is restricted to tree nodes with large-enough
max_free value, to avoid searching for the right starting point in a
subtree where all the gaps are too small. Second, when the address
search leads to a tree search for the first large-enough gap, that gap
is the subject of a splay-search that brings the gap to the top of the
tree, so that an immediate insertion will take constant time.
Break up the splay code into separate components, one for searching
and breaking up the tree and another for reassembling it. Use these
components, and not splay itself, for linking and unlinking. Drop the
after-where parameter to link, as it is computed as a side-effect of
the splay search.
Submitted by: Doug Moore <dougm@rice.edu>
Reviewed by: markj
Tested by: pho
MFC after: 2 weeks
Differential revision: https://reviews.freebsd.org/D17794
2019-03-29 16:53:46 +00:00
|
|
|
return (root);
|
|
|
|
}
|
|
|
|
|
2019-11-27 21:00:44 +00:00
|
|
|
static __always_inline void
|
|
|
|
vm_map_splay_findnext(vm_map_entry_t root, vm_map_entry_t *rlist)
|
Eliminate adj_free field from vm_map_entry.
Drop the adj_free field from vm_map_entry_t. Refine the max_free field
so that p->max_free is the size of the largest gap with one endpoint
in the subtree rooted at p. Change vm_map_findspace so that, first,
the address-based splay is restricted to tree nodes with large-enough
max_free value, to avoid searching for the right starting point in a
subtree where all the gaps are too small. Second, when the address
search leads to a tree search for the first large-enough gap, that gap
is the subject of a splay-search that brings the gap to the top of the
tree, so that an immediate insertion will take constant time.
Break up the splay code into separate components, one for searching
and breaking up the tree and another for reassembling it. Use these
components, and not splay itself, for linking and unlinking. Drop the
after-where parameter to link, as it is computed as a side-effect of
the splay search.
Submitted by: Doug Moore <dougm@rice.edu>
Reviewed by: markj
Tested by: pho
MFC after: 2 weeks
Differential revision: https://reviews.freebsd.org/D17794
2019-03-29 16:53:46 +00:00
|
|
|
{
|
2019-12-07 17:14:33 +00:00
|
|
|
vm_map_entry_t hi, right, y;
|
Eliminate adj_free field from vm_map_entry.
Drop the adj_free field from vm_map_entry_t. Refine the max_free field
so that p->max_free is the size of the largest gap with one endpoint
in the subtree rooted at p. Change vm_map_findspace so that, first,
the address-based splay is restricted to tree nodes with large-enough
max_free value, to avoid searching for the right starting point in a
subtree where all the gaps are too small. Second, when the address
search leads to a tree search for the first large-enough gap, that gap
is the subject of a splay-search that brings the gap to the top of the
tree, so that an immediate insertion will take constant time.
Break up the splay code into separate components, one for searching
and breaking up the tree and another for reassembling it. Use these
components, and not splay itself, for linking and unlinking. Drop the
after-where parameter to link, as it is computed as a side-effect of
the splay search.
Submitted by: Doug Moore <dougm@rice.edu>
Reviewed by: markj
Tested by: pho
MFC after: 2 weeks
Differential revision: https://reviews.freebsd.org/D17794
2019-03-29 16:53:46 +00:00
|
|
|
|
2019-12-07 17:14:33 +00:00
|
|
|
right = *rlist;
|
|
|
|
hi = root->right == right ? NULL : root->right;
|
|
|
|
if (hi == NULL)
|
|
|
|
return;
|
|
|
|
do
|
|
|
|
SPLAY_LEFT_STEP(hi, y, root, right, true);
|
|
|
|
while (hi != NULL);
|
|
|
|
*rlist = right;
|
Eliminate adj_free field from vm_map_entry.
Drop the adj_free field from vm_map_entry_t. Refine the max_free field
so that p->max_free is the size of the largest gap with one endpoint
in the subtree rooted at p. Change vm_map_findspace so that, first,
the address-based splay is restricted to tree nodes with large-enough
max_free value, to avoid searching for the right starting point in a
subtree where all the gaps are too small. Second, when the address
search leads to a tree search for the first large-enough gap, that gap
is the subject of a splay-search that brings the gap to the top of the
tree, so that an immediate insertion will take constant time.
Break up the splay code into separate components, one for searching
and breaking up the tree and another for reassembling it. Use these
components, and not splay itself, for linking and unlinking. Drop the
after-where parameter to link, as it is computed as a side-effect of
the splay search.
Submitted by: Doug Moore <dougm@rice.edu>
Reviewed by: markj
Tested by: pho
MFC after: 2 weeks
Differential revision: https://reviews.freebsd.org/D17794
2019-03-29 16:53:46 +00:00
|
|
|
}
|
|
|
|
|
2019-11-27 21:00:44 +00:00
|
|
|
static __always_inline void
|
|
|
|
vm_map_splay_findprev(vm_map_entry_t root, vm_map_entry_t *llist)
|
Eliminate adj_free field from vm_map_entry.
Drop the adj_free field from vm_map_entry_t. Refine the max_free field
so that p->max_free is the size of the largest gap with one endpoint
in the subtree rooted at p. Change vm_map_findspace so that, first,
the address-based splay is restricted to tree nodes with large-enough
max_free value, to avoid searching for the right starting point in a
subtree where all the gaps are too small. Second, when the address
search leads to a tree search for the first large-enough gap, that gap
is the subject of a splay-search that brings the gap to the top of the
tree, so that an immediate insertion will take constant time.
Break up the splay code into separate components, one for searching
and breaking up the tree and another for reassembling it. Use these
components, and not splay itself, for linking and unlinking. Drop the
after-where parameter to link, as it is computed as a side-effect of
the splay search.
Submitted by: Doug Moore <dougm@rice.edu>
Reviewed by: markj
Tested by: pho
MFC after: 2 weeks
Differential revision: https://reviews.freebsd.org/D17794
2019-03-29 16:53:46 +00:00
|
|
|
{
|
2019-12-07 17:14:33 +00:00
|
|
|
vm_map_entry_t left, lo, y;
|
Eliminate adj_free field from vm_map_entry.
Drop the adj_free field from vm_map_entry_t. Refine the max_free field
so that p->max_free is the size of the largest gap with one endpoint
in the subtree rooted at p. Change vm_map_findspace so that, first,
the address-based splay is restricted to tree nodes with large-enough
max_free value, to avoid searching for the right starting point in a
subtree where all the gaps are too small. Second, when the address
search leads to a tree search for the first large-enough gap, that gap
is the subject of a splay-search that brings the gap to the top of the
tree, so that an immediate insertion will take constant time.
Break up the splay code into separate components, one for searching
and breaking up the tree and another for reassembling it. Use these
components, and not splay itself, for linking and unlinking. Drop the
after-where parameter to link, as it is computed as a side-effect of
the splay search.
Submitted by: Doug Moore <dougm@rice.edu>
Reviewed by: markj
Tested by: pho
MFC after: 2 weeks
Differential revision: https://reviews.freebsd.org/D17794
2019-03-29 16:53:46 +00:00
|
|
|
|
2019-12-07 17:14:33 +00:00
|
|
|
left = *llist;
|
|
|
|
lo = root->left == left ? NULL : root->left;
|
|
|
|
if (lo == NULL)
|
|
|
|
return;
|
|
|
|
do
|
|
|
|
SPLAY_RIGHT_STEP(lo, y, left, root, true);
|
|
|
|
while (lo != NULL);
|
|
|
|
*llist = left;
|
Eliminate adj_free field from vm_map_entry.
Drop the adj_free field from vm_map_entry_t. Refine the max_free field
so that p->max_free is the size of the largest gap with one endpoint
in the subtree rooted at p. Change vm_map_findspace so that, first,
the address-based splay is restricted to tree nodes with large-enough
max_free value, to avoid searching for the right starting point in a
subtree where all the gaps are too small. Second, when the address
search leads to a tree search for the first large-enough gap, that gap
is the subject of a splay-search that brings the gap to the top of the
tree, so that an immediate insertion will take constant time.
Break up the splay code into separate components, one for searching
and breaking up the tree and another for reassembling it. Use these
components, and not splay itself, for linking and unlinking. Drop the
after-where parameter to link, as it is computed as a side-effect of
the splay search.
Submitted by: Doug Moore <dougm@rice.edu>
Reviewed by: markj
Tested by: pho
MFC after: 2 weeks
Differential revision: https://reviews.freebsd.org/D17794
2019-03-29 16:53:46 +00:00
|
|
|
}
|
|
|
|
|
The computations of vm_map_splay_split and vm_map_splay_merge touch both
children of every entry on the search path as part of updating values of
the max_free field. By comparing the max_free values of an entry and its
child on the search path, the code can avoid accessing the child off the
path in cases where the max_free value decreases along the path.
Specifically, this patch changes splay_split so that the max_free field
of every entry on the search path is replaced, temporarily, by the
max_free field from its child not on the search path or, if the child
in that direction is NULL, then a difference between start and end
values of two pointers already available in the split code, without
following any next or prev pointers. However, to find that max_free
value does not require looking toward that other child if either the
child on the search path has a lower max_free value, or the current max_free
value is zero, because in either case we know that the value of max_free for
the other child is the value we already have. So, the changes to
vm_entry_splay_split make sure that we know all the off-search-path entries
we will need to complete the splay, without looking at all of them. There is
an exception at the bottom of the search path where we cannot rely on the
max_free value in the direction of the NULL pointer that ends the search,
because of the behavior of entry-clipping code.
The corresponding change to vm_splay_entry_merge makes it simpler, since it's
just reversing pointers and updating running maxima.
In a test intended to exercise vigorously the vm_map implementation, the
effect of this change was to reduce the data cache miss rate by 10-14% and
the running time by 5-7%.
Tested by: pho
Reviewed by: alc
Approved by: kib (mentor)
MFC after: 1 month
Differential Revision: https://reviews.freebsd.org/D19826
2019-06-10 21:34:07 +00:00
|
|
|
static inline void
|
|
|
|
vm_map_entry_swap(vm_map_entry_t *a, vm_map_entry_t *b)
|
|
|
|
{
|
|
|
|
vm_map_entry_t tmp;
|
|
|
|
|
|
|
|
tmp = *b;
|
|
|
|
*b = *a;
|
|
|
|
*a = tmp;
|
|
|
|
}
|
|
|
|
|
Eliminate adj_free field from vm_map_entry.
Drop the adj_free field from vm_map_entry_t. Refine the max_free field
so that p->max_free is the size of the largest gap with one endpoint
in the subtree rooted at p. Change vm_map_findspace so that, first,
the address-based splay is restricted to tree nodes with large-enough
max_free value, to avoid searching for the right starting point in a
subtree where all the gaps are too small. Second, when the address
search leads to a tree search for the first large-enough gap, that gap
is the subject of a splay-search that brings the gap to the top of the
tree, so that an immediate insertion will take constant time.
Break up the splay code into separate components, one for searching
and breaking up the tree and another for reassembling it. Use these
components, and not splay itself, for linking and unlinking. Drop the
after-where parameter to link, as it is computed as a side-effect of
the splay search.
Submitted by: Doug Moore <dougm@rice.edu>
Reviewed by: markj
Tested by: pho
MFC after: 2 weeks
Differential revision: https://reviews.freebsd.org/D17794
2019-03-29 16:53:46 +00:00
|
|
|
/*
|
|
|
|
* Walk back up the two spines, flip the pointers and set max_free. The
|
|
|
|
* subtrees of the root go at the bottom of llist and rlist.
|
|
|
|
*/
|
2019-11-29 02:06:45 +00:00
|
|
|
static vm_size_t
|
|
|
|
vm_map_splay_merge_left_walk(vm_map_entry_t header, vm_map_entry_t root,
|
|
|
|
vm_map_entry_t tail, vm_size_t max_free, vm_map_entry_t llist)
|
Eliminate adj_free field from vm_map_entry.
Drop the adj_free field from vm_map_entry_t. Refine the max_free field
so that p->max_free is the size of the largest gap with one endpoint
in the subtree rooted at p. Change vm_map_findspace so that, first,
the address-based splay is restricted to tree nodes with large-enough
max_free value, to avoid searching for the right starting point in a
subtree where all the gaps are too small. Second, when the address
search leads to a tree search for the first large-enough gap, that gap
is the subject of a splay-search that brings the gap to the top of the
tree, so that an immediate insertion will take constant time.
Break up the splay code into separate components, one for searching
and breaking up the tree and another for reassembling it. Use these
components, and not splay itself, for linking and unlinking. Drop the
after-where parameter to link, as it is computed as a side-effect of
the splay search.
Submitted by: Doug Moore <dougm@rice.edu>
Reviewed by: markj
Tested by: pho
MFC after: 2 weeks
Differential revision: https://reviews.freebsd.org/D17794
2019-03-29 16:53:46 +00:00
|
|
|
{
|
2019-11-29 02:06:45 +00:00
|
|
|
do {
|
|
|
|
/*
|
|
|
|
* The max_free values of the children of llist are in
|
|
|
|
* llist->max_free and max_free. Update with the
|
|
|
|
* max value.
|
|
|
|
*/
|
|
|
|
llist->max_free = max_free =
|
|
|
|
vm_size_max(llist->max_free, max_free);
|
|
|
|
vm_map_entry_swap(&llist->right, &tail);
|
|
|
|
vm_map_entry_swap(&tail, &llist);
|
|
|
|
} while (llist != header);
|
|
|
|
root->left = tail;
|
|
|
|
return (max_free);
|
|
|
|
}
|
2004-08-13 08:06:34 +00:00
|
|
|
|
2019-11-29 02:06:45 +00:00
|
|
|
/*
|
|
|
|
* When llist is known to be the predecessor of root.
|
|
|
|
*/
|
|
|
|
static inline vm_size_t
|
|
|
|
vm_map_splay_merge_pred(vm_map_entry_t header, vm_map_entry_t root,
|
|
|
|
vm_map_entry_t llist)
|
|
|
|
{
|
|
|
|
vm_size_t max_free;
|
|
|
|
|
|
|
|
max_free = root->start - llist->end;
|
|
|
|
if (llist != header) {
|
|
|
|
max_free = vm_map_splay_merge_left_walk(header, root,
|
2019-12-07 17:14:33 +00:00
|
|
|
root, max_free, llist);
|
2019-11-29 02:06:45 +00:00
|
|
|
} else {
|
2019-12-07 17:14:33 +00:00
|
|
|
root->left = header;
|
|
|
|
header->right = root;
|
The computations of vm_map_splay_split and vm_map_splay_merge touch both
children of every entry on the search path as part of updating values of
the max_free field. By comparing the max_free values of an entry and its
child on the search path, the code can avoid accessing the child off the
path in cases where the max_free value decreases along the path.
Specifically, this patch changes splay_split so that the max_free field
of every entry on the search path is replaced, temporarily, by the
max_free field from its child not on the search path or, if the child
in that direction is NULL, then a difference between start and end
values of two pointers already available in the split code, without
following any next or prev pointers. However, to find that max_free
value does not require looking toward that other child if either the
child on the search path has a lower max_free value, or the current max_free
value is zero, because in either case we know that the value of max_free for
the other child is the value we already have. So, the changes to
vm_entry_splay_split make sure that we know all the off-search-path entries
we will need to complete the splay, without looking at all of them. There is
an exception at the bottom of the search path where we cannot rely on the
max_free value in the direction of the NULL pointer that ends the search,
because of the behavior of entry-clipping code.
The corresponding change to vm_splay_entry_merge makes it simpler, since it's
just reversing pointers and updating running maxima.
In a test intended to exercise vigorously the vm_map implementation, the
effect of this change was to reduce the data cache miss rate by 10-14% and
the running time by 5-7%.
Tested by: pho
Reviewed by: alc
Approved by: kib (mentor)
MFC after: 1 month
Differential Revision: https://reviews.freebsd.org/D19826
2019-06-10 21:34:07 +00:00
|
|
|
}
|
2019-11-29 02:06:45 +00:00
|
|
|
return (max_free);
|
|
|
|
}
|
|
|
|
|
|
|
|
/*
|
|
|
|
* When llist may or may not be the predecessor of root.
|
|
|
|
*/
|
|
|
|
static inline vm_size_t
|
|
|
|
vm_map_splay_merge_left(vm_map_entry_t header, vm_map_entry_t root,
|
|
|
|
vm_map_entry_t llist)
|
|
|
|
{
|
|
|
|
vm_size_t max_free;
|
|
|
|
|
|
|
|
max_free = vm_map_entry_max_free_left(root, llist);
|
|
|
|
if (llist != header) {
|
|
|
|
max_free = vm_map_splay_merge_left_walk(header, root,
|
2019-12-07 17:14:33 +00:00
|
|
|
root->left == llist ? root : root->left,
|
|
|
|
max_free, llist);
|
2019-11-29 02:06:45 +00:00
|
|
|
}
|
|
|
|
return (max_free);
|
|
|
|
}
|
|
|
|
|
|
|
|
static vm_size_t
|
|
|
|
vm_map_splay_merge_right_walk(vm_map_entry_t header, vm_map_entry_t root,
|
|
|
|
vm_map_entry_t tail, vm_size_t max_free, vm_map_entry_t rlist)
|
|
|
|
{
|
|
|
|
do {
|
|
|
|
/*
|
|
|
|
* The max_free values of the children of rlist are in
|
|
|
|
* rlist->max_free and max_free. Update with the
|
|
|
|
* max value.
|
|
|
|
*/
|
|
|
|
rlist->max_free = max_free =
|
|
|
|
vm_size_max(rlist->max_free, max_free);
|
|
|
|
vm_map_entry_swap(&rlist->left, &tail);
|
|
|
|
vm_map_entry_swap(&tail, &rlist);
|
|
|
|
} while (rlist != header);
|
|
|
|
root->right = tail;
|
|
|
|
return (max_free);
|
|
|
|
}
|
|
|
|
|
|
|
|
/*
|
|
|
|
* When rlist is known to be the succecessor of root.
|
|
|
|
*/
|
|
|
|
static inline vm_size_t
|
|
|
|
vm_map_splay_merge_succ(vm_map_entry_t header, vm_map_entry_t root,
|
|
|
|
vm_map_entry_t rlist)
|
|
|
|
{
|
|
|
|
vm_size_t max_free;
|
|
|
|
|
|
|
|
max_free = rlist->start - root->end;
|
|
|
|
if (rlist != header) {
|
|
|
|
max_free = vm_map_splay_merge_right_walk(header, root,
|
2019-12-07 17:14:33 +00:00
|
|
|
root, max_free, rlist);
|
2019-11-29 02:06:45 +00:00
|
|
|
} else {
|
2019-12-07 17:14:33 +00:00
|
|
|
root->right = header;
|
|
|
|
header->left = root;
|
2019-11-29 02:06:45 +00:00
|
|
|
}
|
|
|
|
return (max_free);
|
|
|
|
}
|
|
|
|
|
|
|
|
/*
|
|
|
|
* When rlist may or may not be the succecessor of root.
|
|
|
|
*/
|
|
|
|
static inline vm_size_t
|
|
|
|
vm_map_splay_merge_right(vm_map_entry_t header, vm_map_entry_t root,
|
|
|
|
vm_map_entry_t rlist)
|
|
|
|
{
|
|
|
|
vm_size_t max_free;
|
|
|
|
|
|
|
|
max_free = vm_map_entry_max_free_right(root, rlist);
|
|
|
|
if (rlist != header) {
|
|
|
|
max_free = vm_map_splay_merge_right_walk(header, root,
|
2019-12-07 17:14:33 +00:00
|
|
|
root->right == rlist ? root : root->right,
|
|
|
|
max_free, rlist);
|
2019-11-29 02:06:45 +00:00
|
|
|
}
|
|
|
|
return (max_free);
|
2002-05-24 01:33:24 +00:00
|
|
|
}
|
|
|
|
|
2019-06-26 03:12:57 +00:00
|
|
|
/*
|
|
|
|
* vm_map_splay:
|
|
|
|
*
|
|
|
|
* The Sleator and Tarjan top-down splay algorithm with the
|
|
|
|
* following variation. Max_free must be computed bottom-up, so
|
|
|
|
* on the downward pass, maintain the left and right spines in
|
|
|
|
* reverse order. Then, make a second pass up each side to fix
|
|
|
|
* the pointers and compute max_free. The time bound is O(log n)
|
|
|
|
* amortized.
|
|
|
|
*
|
2019-12-07 17:14:33 +00:00
|
|
|
* The tree is threaded, which means that there are no null pointers.
|
|
|
|
* When a node has no left child, its left pointer points to its
|
|
|
|
* predecessor, which the last ancestor on the search path from the root
|
|
|
|
* where the search branched right. Likewise, when a node has no right
|
|
|
|
* child, its right pointer points to its successor. The map header node
|
|
|
|
* is the predecessor of the first map entry, and the successor of the
|
|
|
|
* last.
|
|
|
|
*
|
2019-06-26 03:12:57 +00:00
|
|
|
* The new root is the vm_map_entry containing "addr", or else an
|
|
|
|
* adjacent entry (lower if possible) if addr is not in the tree.
|
|
|
|
*
|
|
|
|
* The map must be locked, and leaves it so.
|
|
|
|
*
|
|
|
|
* Returns: the new root.
|
|
|
|
*/
|
|
|
|
static vm_map_entry_t
|
|
|
|
vm_map_splay(vm_map_t map, vm_offset_t addr)
|
|
|
|
{
|
2019-11-29 02:06:45 +00:00
|
|
|
vm_map_entry_t header, llist, rlist, root;
|
|
|
|
vm_size_t max_free_left, max_free_right;
|
2019-06-26 03:12:57 +00:00
|
|
|
|
2019-11-29 02:06:45 +00:00
|
|
|
header = &map->header;
|
2019-06-26 03:12:57 +00:00
|
|
|
root = vm_map_splay_split(map, addr, 0, &llist, &rlist);
|
|
|
|
if (root != NULL) {
|
2019-11-29 02:06:45 +00:00
|
|
|
max_free_left = vm_map_splay_merge_left(header, root, llist);
|
|
|
|
max_free_right = vm_map_splay_merge_right(header, root, rlist);
|
|
|
|
} else if (llist != header) {
|
2019-06-26 03:12:57 +00:00
|
|
|
/*
|
|
|
|
* Recover the greatest node in the left
|
|
|
|
* subtree and make it the root.
|
|
|
|
*/
|
|
|
|
root = llist;
|
|
|
|
llist = root->right;
|
2019-11-29 02:06:45 +00:00
|
|
|
max_free_left = vm_map_splay_merge_left(header, root, llist);
|
|
|
|
max_free_right = vm_map_splay_merge_succ(header, root, rlist);
|
|
|
|
} else if (rlist != header) {
|
2019-06-26 03:12:57 +00:00
|
|
|
/*
|
|
|
|
* Recover the least node in the right
|
|
|
|
* subtree and make it the root.
|
|
|
|
*/
|
|
|
|
root = rlist;
|
|
|
|
rlist = root->left;
|
2019-11-29 02:06:45 +00:00
|
|
|
max_free_left = vm_map_splay_merge_pred(header, root, llist);
|
|
|
|
max_free_right = vm_map_splay_merge_right(header, root, rlist);
|
2019-06-26 03:12:57 +00:00
|
|
|
} else {
|
|
|
|
/* There is no root. */
|
|
|
|
return (NULL);
|
|
|
|
}
|
2019-11-29 02:06:45 +00:00
|
|
|
root->max_free = vm_size_max(max_free_left, max_free_right);
|
|
|
|
map->root = root;
|
2019-06-26 03:12:57 +00:00
|
|
|
VM_MAP_ASSERT_CONSISTENT(map);
|
|
|
|
return (root);
|
|
|
|
}
|
|
|
|
|
1994-05-24 10:09:53 +00:00
|
|
|
/*
|
|
|
|
* vm_map_entry_{un,}link:
|
|
|
|
*
|
2019-12-31 22:20:54 +00:00
|
|
|
* Insert/remove entries from maps. On linking, if new entry clips
|
|
|
|
* existing entry, trim existing entry to avoid overlap, and manage
|
|
|
|
* offsets. On unlinking, merge disappearing entry with neighbor, if
|
|
|
|
* called for, and manage offsets. Callers should not modify fields in
|
|
|
|
* entries already mapped.
|
1994-05-24 10:09:53 +00:00
|
|
|
*/
|
2002-05-24 01:33:24 +00:00
|
|
|
static void
|
The computations of vm_map_splay_split and vm_map_splay_merge touch both
children of every entry on the search path as part of updating values of
the max_free field. By comparing the max_free values of an entry and its
child on the search path, the code can avoid accessing the child off the
path in cases where the max_free value decreases along the path.
Specifically, this patch changes splay_split so that the max_free field
of every entry on the search path is replaced, temporarily, by the
max_free field from its child not on the search path or, if the child
in that direction is NULL, then a difference between start and end
values of two pointers already available in the split code, without
following any next or prev pointers. However, to find that max_free
value does not require looking toward that other child if either the
child on the search path has a lower max_free value, or the current max_free
value is zero, because in either case we know that the value of max_free for
the other child is the value we already have. So, the changes to
vm_entry_splay_split make sure that we know all the off-search-path entries
we will need to complete the splay, without looking at all of them. There is
an exception at the bottom of the search path where we cannot rely on the
max_free value in the direction of the NULL pointer that ends the search,
because of the behavior of entry-clipping code.
The corresponding change to vm_splay_entry_merge makes it simpler, since it's
just reversing pointers and updating running maxima.
In a test intended to exercise vigorously the vm_map implementation, the
effect of this change was to reduce the data cache miss rate by 10-14% and
the running time by 5-7%.
Tested by: pho
Reviewed by: alc
Approved by: kib (mentor)
MFC after: 1 month
Differential Revision: https://reviews.freebsd.org/D19826
2019-06-10 21:34:07 +00:00
|
|
|
vm_map_entry_link(vm_map_t map, vm_map_entry_t entry)
|
1999-03-21 23:37:00 +00:00
|
|
|
{
|
2019-11-29 02:06:45 +00:00
|
|
|
vm_map_entry_t header, llist, rlist, root;
|
2019-12-31 22:20:54 +00:00
|
|
|
vm_size_t max_free_left, max_free_right;
|
2001-05-23 22:38:00 +00:00
|
|
|
|
Eliminate adj_free field from vm_map_entry.
Drop the adj_free field from vm_map_entry_t. Refine the max_free field
so that p->max_free is the size of the largest gap with one endpoint
in the subtree rooted at p. Change vm_map_findspace so that, first,
the address-based splay is restricted to tree nodes with large-enough
max_free value, to avoid searching for the right starting point in a
subtree where all the gaps are too small. Second, when the address
search leads to a tree search for the first large-enough gap, that gap
is the subject of a splay-search that brings the gap to the top of the
tree, so that an immediate insertion will take constant time.
Break up the splay code into separate components, one for searching
and breaking up the tree and another for reassembling it. Use these
components, and not splay itself, for linking and unlinking. Drop the
after-where parameter to link, as it is computed as a side-effect of
the splay search.
Submitted by: Doug Moore <dougm@rice.edu>
Reviewed by: markj
Tested by: pho
MFC after: 2 weeks
Differential revision: https://reviews.freebsd.org/D17794
2019-03-29 16:53:46 +00:00
|
|
|
CTR3(KTR_VM,
|
|
|
|
"vm_map_entry_link: map %p, nentries %d, entry %p", map,
|
|
|
|
map->nentries, entry);
|
2009-02-24 20:43:29 +00:00
|
|
|
VM_MAP_ASSERT_LOCKED(map);
|
1999-03-21 23:37:00 +00:00
|
|
|
map->nentries++;
|
2019-11-29 02:06:45 +00:00
|
|
|
header = &map->header;
|
The computations of vm_map_splay_split and vm_map_splay_merge touch both
children of every entry on the search path as part of updating values of
the max_free field. By comparing the max_free values of an entry and its
child on the search path, the code can avoid accessing the child off the
path in cases where the max_free value decreases along the path.
Specifically, this patch changes splay_split so that the max_free field
of every entry on the search path is replaced, temporarily, by the
max_free field from its child not on the search path or, if the child
in that direction is NULL, then a difference between start and end
values of two pointers already available in the split code, without
following any next or prev pointers. However, to find that max_free
value does not require looking toward that other child if either the
child on the search path has a lower max_free value, or the current max_free
value is zero, because in either case we know that the value of max_free for
the other child is the value we already have. So, the changes to
vm_entry_splay_split make sure that we know all the off-search-path entries
we will need to complete the splay, without looking at all of them. There is
an exception at the bottom of the search path where we cannot rely on the
max_free value in the direction of the NULL pointer that ends the search,
because of the behavior of entry-clipping code.
The corresponding change to vm_splay_entry_merge makes it simpler, since it's
just reversing pointers and updating running maxima.
In a test intended to exercise vigorously the vm_map implementation, the
effect of this change was to reduce the data cache miss rate by 10-14% and
the running time by 5-7%.
Tested by: pho
Reviewed by: alc
Approved by: kib (mentor)
MFC after: 1 month
Differential Revision: https://reviews.freebsd.org/D19826
2019-06-10 21:34:07 +00:00
|
|
|
root = vm_map_splay_split(map, entry->start, 0, &llist, &rlist);
|
2019-12-31 22:20:54 +00:00
|
|
|
if (root == NULL) {
|
|
|
|
/*
|
|
|
|
* The new entry does not overlap any existing entry in the
|
|
|
|
* map, so it becomes the new root of the map tree.
|
|
|
|
*/
|
|
|
|
max_free_left = vm_map_splay_merge_pred(header, entry, llist);
|
|
|
|
max_free_right = vm_map_splay_merge_succ(header, entry, rlist);
|
|
|
|
} else if (entry->start == root->start) {
|
|
|
|
/*
|
|
|
|
* The new entry is a clone of root, with only the end field
|
|
|
|
* changed. The root entry will be shrunk to abut the new
|
|
|
|
* entry, and will be the right child of the new root entry in
|
|
|
|
* the modified map.
|
|
|
|
*/
|
|
|
|
KASSERT(entry->end < root->end,
|
|
|
|
("%s: clip_start not within entry", __func__));
|
|
|
|
vm_map_splay_findprev(root, &llist);
|
|
|
|
root->offset += entry->end - root->start;
|
|
|
|
root->start = entry->end;
|
|
|
|
max_free_left = vm_map_splay_merge_pred(header, entry, llist);
|
|
|
|
max_free_right = root->max_free = vm_size_max(
|
|
|
|
vm_map_splay_merge_pred(entry, root, entry),
|
|
|
|
vm_map_splay_merge_right(header, root, rlist));
|
|
|
|
} else {
|
|
|
|
/*
|
|
|
|
* The new entry is a clone of root, with only the start field
|
|
|
|
* changed. The root entry will be shrunk to abut the new
|
|
|
|
* entry, and will be the left child of the new root entry in
|
|
|
|
* the modified map.
|
|
|
|
*/
|
|
|
|
KASSERT(entry->end == root->end,
|
|
|
|
("%s: clip_start not within entry", __func__));
|
|
|
|
vm_map_splay_findnext(root, &rlist);
|
|
|
|
entry->offset += entry->start - root->start;
|
|
|
|
root->end = entry->start;
|
|
|
|
max_free_left = root->max_free = vm_size_max(
|
|
|
|
vm_map_splay_merge_left(header, root, llist),
|
|
|
|
vm_map_splay_merge_succ(entry, root, entry));
|
|
|
|
max_free_right = vm_map_splay_merge_succ(header, entry, rlist);
|
|
|
|
}
|
|
|
|
entry->max_free = vm_size_max(max_free_left, max_free_right);
|
|
|
|
map->root = entry;
|
Eliminate adj_free field from vm_map_entry.
Drop the adj_free field from vm_map_entry_t. Refine the max_free field
so that p->max_free is the size of the largest gap with one endpoint
in the subtree rooted at p. Change vm_map_findspace so that, first,
the address-based splay is restricted to tree nodes with large-enough
max_free value, to avoid searching for the right starting point in a
subtree where all the gaps are too small. Second, when the address
search leads to a tree search for the first large-enough gap, that gap
is the subject of a splay-search that brings the gap to the top of the
tree, so that an immediate insertion will take constant time.
Break up the splay code into separate components, one for searching
and breaking up the tree and another for reassembling it. Use these
components, and not splay itself, for linking and unlinking. Drop the
after-where parameter to link, as it is computed as a side-effect of
the splay search.
Submitted by: Doug Moore <dougm@rice.edu>
Reviewed by: markj
Tested by: pho
MFC after: 2 weeks
Differential revision: https://reviews.freebsd.org/D17794
2019-03-29 16:53:46 +00:00
|
|
|
VM_MAP_ASSERT_CONSISTENT(map);
|
1999-03-21 23:37:00 +00:00
|
|
|
}
|
|
|
|
|
Eliminate adj_free field from vm_map_entry.
Drop the adj_free field from vm_map_entry_t. Refine the max_free field
so that p->max_free is the size of the largest gap with one endpoint
in the subtree rooted at p. Change vm_map_findspace so that, first,
the address-based splay is restricted to tree nodes with large-enough
max_free value, to avoid searching for the right starting point in a
subtree where all the gaps are too small. Second, when the address
search leads to a tree search for the first large-enough gap, that gap
is the subject of a splay-search that brings the gap to the top of the
tree, so that an immediate insertion will take constant time.
Break up the splay code into separate components, one for searching
and breaking up the tree and another for reassembling it. Use these
components, and not splay itself, for linking and unlinking. Drop the
after-where parameter to link, as it is computed as a side-effect of
the splay search.
Submitted by: Doug Moore <dougm@rice.edu>
Reviewed by: markj
Tested by: pho
MFC after: 2 weeks
Differential revision: https://reviews.freebsd.org/D17794
2019-03-29 16:53:46 +00:00
|
|
|
enum unlink_merge_type {
|
|
|
|
UNLINK_MERGE_NONE,
|
|
|
|
UNLINK_MERGE_NEXT
|
|
|
|
};
|
|
|
|
|
2002-05-24 01:33:24 +00:00
|
|
|
static void
|
The computations of vm_map_splay_split and vm_map_splay_merge touch both
children of every entry on the search path as part of updating values of
the max_free field. By comparing the max_free values of an entry and its
child on the search path, the code can avoid accessing the child off the
path in cases where the max_free value decreases along the path.
Specifically, this patch changes splay_split so that the max_free field
of every entry on the search path is replaced, temporarily, by the
max_free field from its child not on the search path or, if the child
in that direction is NULL, then a difference between start and end
values of two pointers already available in the split code, without
following any next or prev pointers. However, to find that max_free
value does not require looking toward that other child if either the
child on the search path has a lower max_free value, or the current max_free
value is zero, because in either case we know that the value of max_free for
the other child is the value we already have. So, the changes to
vm_entry_splay_split make sure that we know all the off-search-path entries
we will need to complete the splay, without looking at all of them. There is
an exception at the bottom of the search path where we cannot rely on the
max_free value in the direction of the NULL pointer that ends the search,
because of the behavior of entry-clipping code.
The corresponding change to vm_splay_entry_merge makes it simpler, since it's
just reversing pointers and updating running maxima.
In a test intended to exercise vigorously the vm_map implementation, the
effect of this change was to reduce the data cache miss rate by 10-14% and
the running time by 5-7%.
Tested by: pho
Reviewed by: alc
Approved by: kib (mentor)
MFC after: 1 month
Differential Revision: https://reviews.freebsd.org/D19826
2019-06-10 21:34:07 +00:00
|
|
|
vm_map_entry_unlink(vm_map_t map, vm_map_entry_t entry,
|
|
|
|
enum unlink_merge_type op)
|
1999-03-21 23:37:00 +00:00
|
|
|
{
|
2019-12-07 17:14:33 +00:00
|
|
|
vm_map_entry_t header, llist, rlist, root;
|
2019-11-29 02:06:45 +00:00
|
|
|
vm_size_t max_free_left, max_free_right;
|
1999-03-21 23:37:00 +00:00
|
|
|
|
2009-02-24 20:43:29 +00:00
|
|
|
VM_MAP_ASSERT_LOCKED(map);
|
2019-11-29 02:06:45 +00:00
|
|
|
header = &map->header;
|
The computations of vm_map_splay_split and vm_map_splay_merge touch both
children of every entry on the search path as part of updating values of
the max_free field. By comparing the max_free values of an entry and its
child on the search path, the code can avoid accessing the child off the
path in cases where the max_free value decreases along the path.
Specifically, this patch changes splay_split so that the max_free field
of every entry on the search path is replaced, temporarily, by the
max_free field from its child not on the search path or, if the child
in that direction is NULL, then a difference between start and end
values of two pointers already available in the split code, without
following any next or prev pointers. However, to find that max_free
value does not require looking toward that other child if either the
child on the search path has a lower max_free value, or the current max_free
value is zero, because in either case we know that the value of max_free for
the other child is the value we already have. So, the changes to
vm_entry_splay_split make sure that we know all the off-search-path entries
we will need to complete the splay, without looking at all of them. There is
an exception at the bottom of the search path where we cannot rely on the
max_free value in the direction of the NULL pointer that ends the search,
because of the behavior of entry-clipping code.
The corresponding change to vm_splay_entry_merge makes it simpler, since it's
just reversing pointers and updating running maxima.
In a test intended to exercise vigorously the vm_map implementation, the
effect of this change was to reduce the data cache miss rate by 10-14% and
the running time by 5-7%.
Tested by: pho
Reviewed by: alc
Approved by: kib (mentor)
MFC after: 1 month
Differential Revision: https://reviews.freebsd.org/D19826
2019-06-10 21:34:07 +00:00
|
|
|
root = vm_map_splay_split(map, entry->start, 0, &llist, &rlist);
|
Eliminate adj_free field from vm_map_entry.
Drop the adj_free field from vm_map_entry_t. Refine the max_free field
so that p->max_free is the size of the largest gap with one endpoint
in the subtree rooted at p. Change vm_map_findspace so that, first,
the address-based splay is restricted to tree nodes with large-enough
max_free value, to avoid searching for the right starting point in a
subtree where all the gaps are too small. Second, when the address
search leads to a tree search for the first large-enough gap, that gap
is the subject of a splay-search that brings the gap to the top of the
tree, so that an immediate insertion will take constant time.
Break up the splay code into separate components, one for searching
and breaking up the tree and another for reassembling it. Use these
components, and not splay itself, for linking and unlinking. Drop the
after-where parameter to link, as it is computed as a side-effect of
the splay search.
Submitted by: Doug Moore <dougm@rice.edu>
Reviewed by: markj
Tested by: pho
MFC after: 2 weeks
Differential revision: https://reviews.freebsd.org/D17794
2019-03-29 16:53:46 +00:00
|
|
|
KASSERT(root != NULL,
|
|
|
|
("vm_map_entry_unlink: unlink object not mapped"));
|
|
|
|
|
2019-11-27 21:00:44 +00:00
|
|
|
vm_map_splay_findprev(root, &llist);
|
2019-09-17 02:53:59 +00:00
|
|
|
vm_map_splay_findnext(root, &rlist);
|
2019-11-27 21:00:44 +00:00
|
|
|
if (op == UNLINK_MERGE_NEXT) {
|
Eliminate adj_free field from vm_map_entry.
Drop the adj_free field from vm_map_entry_t. Refine the max_free field
so that p->max_free is the size of the largest gap with one endpoint
in the subtree rooted at p. Change vm_map_findspace so that, first,
the address-based splay is restricted to tree nodes with large-enough
max_free value, to avoid searching for the right starting point in a
subtree where all the gaps are too small. Second, when the address
search leads to a tree search for the first large-enough gap, that gap
is the subject of a splay-search that brings the gap to the top of the
tree, so that an immediate insertion will take constant time.
Break up the splay code into separate components, one for searching
and breaking up the tree and another for reassembling it. Use these
components, and not splay itself, for linking and unlinking. Drop the
after-where parameter to link, as it is computed as a side-effect of
the splay search.
Submitted by: Doug Moore <dougm@rice.edu>
Reviewed by: markj
Tested by: pho
MFC after: 2 weeks
Differential revision: https://reviews.freebsd.org/D17794
2019-03-29 16:53:46 +00:00
|
|
|
rlist->start = root->start;
|
|
|
|
rlist->offset = root->offset;
|
2019-11-27 21:00:44 +00:00
|
|
|
}
|
2019-11-29 02:06:45 +00:00
|
|
|
if (llist != header) {
|
2019-11-27 21:00:44 +00:00
|
|
|
root = llist;
|
|
|
|
llist = root->right;
|
2019-11-29 02:06:45 +00:00
|
|
|
max_free_left = vm_map_splay_merge_left(header, root, llist);
|
|
|
|
max_free_right = vm_map_splay_merge_succ(header, root, rlist);
|
|
|
|
} else if (rlist != header) {
|
Eliminate adj_free field from vm_map_entry.
Drop the adj_free field from vm_map_entry_t. Refine the max_free field
so that p->max_free is the size of the largest gap with one endpoint
in the subtree rooted at p. Change vm_map_findspace so that, first,
the address-based splay is restricted to tree nodes with large-enough
max_free value, to avoid searching for the right starting point in a
subtree where all the gaps are too small. Second, when the address
search leads to a tree search for the first large-enough gap, that gap
is the subject of a splay-search that brings the gap to the top of the
tree, so that an immediate insertion will take constant time.
Break up the splay code into separate components, one for searching
and breaking up the tree and another for reassembling it. Use these
components, and not splay itself, for linking and unlinking. Drop the
after-where parameter to link, as it is computed as a side-effect of
the splay search.
Submitted by: Doug Moore <dougm@rice.edu>
Reviewed by: markj
Tested by: pho
MFC after: 2 weeks
Differential revision: https://reviews.freebsd.org/D17794
2019-03-29 16:53:46 +00:00
|
|
|
root = rlist;
|
|
|
|
rlist = root->left;
|
2019-11-29 02:06:45 +00:00
|
|
|
max_free_left = vm_map_splay_merge_pred(header, root, llist);
|
|
|
|
max_free_right = vm_map_splay_merge_right(header, root, rlist);
|
2019-12-07 17:14:33 +00:00
|
|
|
} else {
|
|
|
|
header->left = header->right = header;
|
2019-11-27 21:00:44 +00:00
|
|
|
root = NULL;
|
2019-12-07 17:14:33 +00:00
|
|
|
}
|
Eliminate adj_free field from vm_map_entry.
Drop the adj_free field from vm_map_entry_t. Refine the max_free field
so that p->max_free is the size of the largest gap with one endpoint
in the subtree rooted at p. Change vm_map_findspace so that, first,
the address-based splay is restricted to tree nodes with large-enough
max_free value, to avoid searching for the right starting point in a
subtree where all the gaps are too small. Second, when the address
search leads to a tree search for the first large-enough gap, that gap
is the subject of a splay-search that brings the gap to the top of the
tree, so that an immediate insertion will take constant time.
Break up the splay code into separate components, one for searching
and breaking up the tree and another for reassembling it. Use these
components, and not splay itself, for linking and unlinking. Drop the
after-where parameter to link, as it is computed as a side-effect of
the splay search.
Submitted by: Doug Moore <dougm@rice.edu>
Reviewed by: markj
Tested by: pho
MFC after: 2 weeks
Differential revision: https://reviews.freebsd.org/D17794
2019-03-29 16:53:46 +00:00
|
|
|
if (root != NULL)
|
2019-11-29 02:06:45 +00:00
|
|
|
root->max_free = vm_size_max(max_free_left, max_free_right);
|
|
|
|
map->root = root;
|
Eliminate adj_free field from vm_map_entry.
Drop the adj_free field from vm_map_entry_t. Refine the max_free field
so that p->max_free is the size of the largest gap with one endpoint
in the subtree rooted at p. Change vm_map_findspace so that, first,
the address-based splay is restricted to tree nodes with large-enough
max_free value, to avoid searching for the right starting point in a
subtree where all the gaps are too small. Second, when the address
search leads to a tree search for the first large-enough gap, that gap
is the subject of a splay-search that brings the gap to the top of the
tree, so that an immediate insertion will take constant time.
Break up the splay code into separate components, one for searching
and breaking up the tree and another for reassembling it. Use these
components, and not splay itself, for linking and unlinking. Drop the
after-where parameter to link, as it is computed as a side-effect of
the splay search.
Submitted by: Doug Moore <dougm@rice.edu>
Reviewed by: markj
Tested by: pho
MFC after: 2 weeks
Differential revision: https://reviews.freebsd.org/D17794
2019-03-29 16:53:46 +00:00
|
|
|
VM_MAP_ASSERT_CONSISTENT(map);
|
1999-03-21 23:37:00 +00:00
|
|
|
map->nentries--;
|
2001-05-23 22:38:00 +00:00
|
|
|
CTR3(KTR_VM, "vm_map_entry_unlink: map %p, nentries %d, entry %p", map,
|
|
|
|
map->nentries, entry);
|
1999-03-21 23:37:00 +00:00
|
|
|
}
|
1994-05-24 10:09:53 +00:00
|
|
|
|
2004-08-13 08:06:34 +00:00
|
|
|
/*
|
2019-05-22 23:11:16 +00:00
|
|
|
* vm_map_entry_resize:
|
2004-08-13 08:06:34 +00:00
|
|
|
*
|
2019-05-22 23:11:16 +00:00
|
|
|
* Resize a vm_map_entry, recompute the amount of free space that
|
|
|
|
* follows it and propagate that value up the tree.
|
2004-08-13 08:06:34 +00:00
|
|
|
*
|
|
|
|
* The map must be locked, and leaves it so.
|
|
|
|
*/
|
|
|
|
static void
|
2019-05-22 23:11:16 +00:00
|
|
|
vm_map_entry_resize(vm_map_t map, vm_map_entry_t entry, vm_size_t grow_amount)
|
2004-08-13 08:06:34 +00:00
|
|
|
{
|
2019-11-29 02:06:45 +00:00
|
|
|
vm_map_entry_t header, llist, rlist, root;
|
2004-08-13 08:06:34 +00:00
|
|
|
|
Eliminate adj_free field from vm_map_entry.
Drop the adj_free field from vm_map_entry_t. Refine the max_free field
so that p->max_free is the size of the largest gap with one endpoint
in the subtree rooted at p. Change vm_map_findspace so that, first,
the address-based splay is restricted to tree nodes with large-enough
max_free value, to avoid searching for the right starting point in a
subtree where all the gaps are too small. Second, when the address
search leads to a tree search for the first large-enough gap, that gap
is the subject of a splay-search that brings the gap to the top of the
tree, so that an immediate insertion will take constant time.
Break up the splay code into separate components, one for searching
and breaking up the tree and another for reassembling it. Use these
components, and not splay itself, for linking and unlinking. Drop the
after-where parameter to link, as it is computed as a side-effect of
the splay search.
Submitted by: Doug Moore <dougm@rice.edu>
Reviewed by: markj
Tested by: pho
MFC after: 2 weeks
Differential revision: https://reviews.freebsd.org/D17794
2019-03-29 16:53:46 +00:00
|
|
|
VM_MAP_ASSERT_LOCKED(map);
|
2019-11-29 02:06:45 +00:00
|
|
|
header = &map->header;
|
The computations of vm_map_splay_split and vm_map_splay_merge touch both
children of every entry on the search path as part of updating values of
the max_free field. By comparing the max_free values of an entry and its
child on the search path, the code can avoid accessing the child off the
path in cases where the max_free value decreases along the path.
Specifically, this patch changes splay_split so that the max_free field
of every entry on the search path is replaced, temporarily, by the
max_free field from its child not on the search path or, if the child
in that direction is NULL, then a difference between start and end
values of two pointers already available in the split code, without
following any next or prev pointers. However, to find that max_free
value does not require looking toward that other child if either the
child on the search path has a lower max_free value, or the current max_free
value is zero, because in either case we know that the value of max_free for
the other child is the value we already have. So, the changes to
vm_entry_splay_split make sure that we know all the off-search-path entries
we will need to complete the splay, without looking at all of them. There is
an exception at the bottom of the search path where we cannot rely on the
max_free value in the direction of the NULL pointer that ends the search,
because of the behavior of entry-clipping code.
The corresponding change to vm_splay_entry_merge makes it simpler, since it's
just reversing pointers and updating running maxima.
In a test intended to exercise vigorously the vm_map implementation, the
effect of this change was to reduce the data cache miss rate by 10-14% and
the running time by 5-7%.
Tested by: pho
Reviewed by: alc
Approved by: kib (mentor)
MFC after: 1 month
Differential Revision: https://reviews.freebsd.org/D19826
2019-06-10 21:34:07 +00:00
|
|
|
root = vm_map_splay_split(map, entry->start, 0, &llist, &rlist);
|
2019-11-27 21:00:44 +00:00
|
|
|
KASSERT(root != NULL, ("%s: resize object not mapped", __func__));
|
Eliminate adj_free field from vm_map_entry.
Drop the adj_free field from vm_map_entry_t. Refine the max_free field
so that p->max_free is the size of the largest gap with one endpoint
in the subtree rooted at p. Change vm_map_findspace so that, first,
the address-based splay is restricted to tree nodes with large-enough
max_free value, to avoid searching for the right starting point in a
subtree where all the gaps are too small. Second, when the address
search leads to a tree search for the first large-enough gap, that gap
is the subject of a splay-search that brings the gap to the top of the
tree, so that an immediate insertion will take constant time.
Break up the splay code into separate components, one for searching
and breaking up the tree and another for reassembling it. Use these
components, and not splay itself, for linking and unlinking. Drop the
after-where parameter to link, as it is computed as a side-effect of
the splay search.
Submitted by: Doug Moore <dougm@rice.edu>
Reviewed by: markj
Tested by: pho
MFC after: 2 weeks
Differential revision: https://reviews.freebsd.org/D17794
2019-03-29 16:53:46 +00:00
|
|
|
vm_map_splay_findnext(root, &rlist);
|
2019-05-22 17:40:54 +00:00
|
|
|
entry->end += grow_amount;
|
2019-11-29 02:06:45 +00:00
|
|
|
root->max_free = vm_size_max(
|
|
|
|
vm_map_splay_merge_left(header, root, llist),
|
|
|
|
vm_map_splay_merge_succ(header, root, rlist));
|
|
|
|
map->root = root;
|
Eliminate adj_free field from vm_map_entry.
Drop the adj_free field from vm_map_entry_t. Refine the max_free field
so that p->max_free is the size of the largest gap with one endpoint
in the subtree rooted at p. Change vm_map_findspace so that, first,
the address-based splay is restricted to tree nodes with large-enough
max_free value, to avoid searching for the right starting point in a
subtree where all the gaps are too small. Second, when the address
search leads to a tree search for the first large-enough gap, that gap
is the subject of a splay-search that brings the gap to the top of the
tree, so that an immediate insertion will take constant time.
Break up the splay code into separate components, one for searching
and breaking up the tree and another for reassembling it. Use these
components, and not splay itself, for linking and unlinking. Drop the
after-where parameter to link, as it is computed as a side-effect of
the splay search.
Submitted by: Doug Moore <dougm@rice.edu>
Reviewed by: markj
Tested by: pho
MFC after: 2 weeks
Differential revision: https://reviews.freebsd.org/D17794
2019-03-29 16:53:46 +00:00
|
|
|
VM_MAP_ASSERT_CONSISTENT(map);
|
2019-05-22 23:11:16 +00:00
|
|
|
CTR4(KTR_VM, "%s: map %p, nentries %d, entry %p",
|
2019-05-23 02:10:41 +00:00
|
|
|
__func__, map, map->nentries, entry);
|
2004-08-13 08:06:34 +00:00
|
|
|
}
|
|
|
|
|
1996-03-28 04:53:28 +00:00
|
|
|
/*
|
2019-06-26 03:12:57 +00:00
|
|
|
* vm_map_lookup_entry: [ internal use only ]
|
1996-03-28 04:53:28 +00:00
|
|
|
*
|
2019-06-26 03:12:57 +00:00
|
|
|
* Finds the map entry containing (or
|
|
|
|
* immediately preceding) the specified address
|
|
|
|
* in the given map; the entry is returned
|
|
|
|
* in the "entry" parameter. The boolean
|
|
|
|
* result indicates whether the address is
|
|
|
|
* actually contained in the map.
|
1996-03-28 04:53:28 +00:00
|
|
|
*/
|
2019-06-26 03:12:57 +00:00
|
|
|
boolean_t
|
|
|
|
vm_map_lookup_entry(
|
|
|
|
vm_map_t map,
|
|
|
|
vm_offset_t address,
|
|
|
|
vm_map_entry_t *entry) /* OUT */
|
1996-03-28 04:53:28 +00:00
|
|
|
{
|
2019-12-07 17:14:33 +00:00
|
|
|
vm_map_entry_t cur, header, lbound, ubound;
|
2019-06-26 03:12:57 +00:00
|
|
|
boolean_t locked;
|
1996-03-28 04:53:28 +00:00
|
|
|
|
2008-12-30 19:48:03 +00:00
|
|
|
/*
|
|
|
|
* If the map is empty, then the map entry immediately preceding
|
2019-06-26 03:12:57 +00:00
|
|
|
* "address" is the map's header.
|
2008-12-30 19:48:03 +00:00
|
|
|
*/
|
2019-11-29 02:06:45 +00:00
|
|
|
header = &map->header;
|
2019-06-26 03:12:57 +00:00
|
|
|
cur = map->root;
|
|
|
|
if (cur == NULL) {
|
2019-11-29 02:06:45 +00:00
|
|
|
*entry = header;
|
2019-06-26 03:12:57 +00:00
|
|
|
return (FALSE);
|
|
|
|
}
|
|
|
|
if (address >= cur->start && cur->end > address) {
|
|
|
|
*entry = cur;
|
|
|
|
return (TRUE);
|
Eliminate adj_free field from vm_map_entry.
Drop the adj_free field from vm_map_entry_t. Refine the max_free field
so that p->max_free is the size of the largest gap with one endpoint
in the subtree rooted at p. Change vm_map_findspace so that, first,
the address-based splay is restricted to tree nodes with large-enough
max_free value, to avoid searching for the right starting point in a
subtree where all the gaps are too small. Second, when the address
search leads to a tree search for the first large-enough gap, that gap
is the subject of a splay-search that brings the gap to the top of the
tree, so that an immediate insertion will take constant time.
Break up the splay code into separate components, one for searching
and breaking up the tree and another for reassembling it. Use these
components, and not splay itself, for linking and unlinking. Drop the
after-where parameter to link, as it is computed as a side-effect of
the splay search.
Submitted by: Doug Moore <dougm@rice.edu>
Reviewed by: markj
Tested by: pho
MFC after: 2 weeks
Differential revision: https://reviews.freebsd.org/D17794
2019-03-29 16:53:46 +00:00
|
|
|
}
|
|
|
|
if ((locked = vm_map_locked(map)) ||
|
2009-01-01 00:31:46 +00:00
|
|
|
sx_try_upgrade(&map->lock)) {
|
|
|
|
/*
|
|
|
|
* Splay requires a write lock on the map. However, it only
|
|
|
|
* restructures the binary search tree; it does not otherwise
|
|
|
|
* change the map. Thus, the map's timestamp need not change
|
|
|
|
* on a temporary upgrade.
|
|
|
|
*/
|
2019-06-26 03:12:57 +00:00
|
|
|
cur = vm_map_splay(map, address);
|
2019-11-09 17:08:27 +00:00
|
|
|
if (!locked) {
|
|
|
|
VM_MAP_UNLOCK_CONSISTENT(map);
|
2009-01-01 00:31:46 +00:00
|
|
|
sx_downgrade(&map->lock);
|
2019-11-09 17:08:27 +00:00
|
|
|
}
|
2019-06-26 03:12:57 +00:00
|
|
|
|
|
|
|
/*
|
|
|
|
* If "address" is contained within a map entry, the new root
|
|
|
|
* is that map entry. Otherwise, the new root is a map entry
|
|
|
|
* immediately before or after "address".
|
|
|
|
*/
|
|
|
|
if (address < cur->start) {
|
2019-11-29 02:06:45 +00:00
|
|
|
*entry = header;
|
2019-06-26 03:12:57 +00:00
|
|
|
return (FALSE);
|
|
|
|
}
|
|
|
|
*entry = cur;
|
|
|
|
return (address < cur->end);
|
Eliminate adj_free field from vm_map_entry.
Drop the adj_free field from vm_map_entry_t. Refine the max_free field
so that p->max_free is the size of the largest gap with one endpoint
in the subtree rooted at p. Change vm_map_findspace so that, first,
the address-based splay is restricted to tree nodes with large-enough
max_free value, to avoid searching for the right starting point in a
subtree where all the gaps are too small. Second, when the address
search leads to a tree search for the first large-enough gap, that gap
is the subject of a splay-search that brings the gap to the top of the
tree, so that an immediate insertion will take constant time.
Break up the splay code into separate components, one for searching
and breaking up the tree and another for reassembling it. Use these
components, and not splay itself, for linking and unlinking. Drop the
after-where parameter to link, as it is computed as a side-effect of
the splay search.
Submitted by: Doug Moore <dougm@rice.edu>
Reviewed by: markj
Tested by: pho
MFC after: 2 weeks
Differential revision: https://reviews.freebsd.org/D17794
2019-03-29 16:53:46 +00:00
|
|
|
}
|
|
|
|
/*
|
|
|
|
* Since the map is only locked for read access, perform a
|
2019-06-26 03:12:57 +00:00
|
|
|
* standard binary search tree lookup for "address".
|
Eliminate adj_free field from vm_map_entry.
Drop the adj_free field from vm_map_entry_t. Refine the max_free field
so that p->max_free is the size of the largest gap with one endpoint
in the subtree rooted at p. Change vm_map_findspace so that, first,
the address-based splay is restricted to tree nodes with large-enough
max_free value, to avoid searching for the right starting point in a
subtree where all the gaps are too small. Second, when the address
search leads to a tree search for the first large-enough gap, that gap
is the subject of a splay-search that brings the gap to the top of the
tree, so that an immediate insertion will take constant time.
Break up the splay code into separate components, one for searching
and breaking up the tree and another for reassembling it. Use these
components, and not splay itself, for linking and unlinking. Drop the
after-where parameter to link, as it is computed as a side-effect of
the splay search.
Submitted by: Doug Moore <dougm@rice.edu>
Reviewed by: markj
Tested by: pho
MFC after: 2 weeks
Differential revision: https://reviews.freebsd.org/D17794
2019-03-29 16:53:46 +00:00
|
|
|
*/
|
2019-12-07 17:14:33 +00:00
|
|
|
lbound = ubound = header;
|
|
|
|
for (;;) {
|
2019-06-26 03:12:57 +00:00
|
|
|
if (address < cur->start) {
|
2019-12-07 17:14:33 +00:00
|
|
|
ubound = cur;
|
2019-06-26 03:12:57 +00:00
|
|
|
cur = cur->left;
|
2019-12-07 17:14:33 +00:00
|
|
|
if (cur == lbound)
|
|
|
|
break;
|
2019-06-26 03:12:57 +00:00
|
|
|
} else if (cur->end <= address) {
|
|
|
|
lbound = cur;
|
|
|
|
cur = cur->right;
|
2019-12-07 17:14:33 +00:00
|
|
|
if (cur == ubound)
|
|
|
|
break;
|
Eliminate adj_free field from vm_map_entry.
Drop the adj_free field from vm_map_entry_t. Refine the max_free field
so that p->max_free is the size of the largest gap with one endpoint
in the subtree rooted at p. Change vm_map_findspace so that, first,
the address-based splay is restricted to tree nodes with large-enough
max_free value, to avoid searching for the right starting point in a
subtree where all the gaps are too small. Second, when the address
search leads to a tree search for the first large-enough gap, that gap
is the subject of a splay-search that brings the gap to the top of the
tree, so that an immediate insertion will take constant time.
Break up the splay code into separate components, one for searching
and breaking up the tree and another for reassembling it. Use these
components, and not splay itself, for linking and unlinking. Drop the
after-where parameter to link, as it is computed as a side-effect of
the splay search.
Submitted by: Doug Moore <dougm@rice.edu>
Reviewed by: markj
Tested by: pho
MFC after: 2 weeks
Differential revision: https://reviews.freebsd.org/D17794
2019-03-29 16:53:46 +00:00
|
|
|
} else {
|
2019-06-26 03:12:57 +00:00
|
|
|
*entry = cur;
|
|
|
|
return (TRUE);
|
2009-01-01 00:31:46 +00:00
|
|
|
}
|
2019-12-07 17:14:33 +00:00
|
|
|
}
|
2019-06-26 03:12:57 +00:00
|
|
|
*entry = lbound;
|
|
|
|
return (FALSE);
|
1996-03-28 04:53:28 +00:00
|
|
|
}
|
|
|
|
|
1994-05-24 10:09:53 +00:00
|
|
|
/*
|
|
|
|
* vm_map_insert:
|
|
|
|
*
|
|
|
|
* Inserts the given whole VM object into the target
|
|
|
|
* map at the specified address range. The object's
|
|
|
|
* size should match that of the address range.
|
|
|
|
*
|
|
|
|
* Requires that the map be locked, and leaves it so.
|
1999-02-12 09:51:43 +00:00
|
|
|
*
|
|
|
|
* If object is non-NULL, ref count must be bumped by caller
|
|
|
|
* prior to making call to account for the new entry.
|
1994-05-24 10:09:53 +00:00
|
|
|
*/
|
|
|
|
int
|
1997-08-25 22:15:31 +00:00
|
|
|
vm_map_insert(vm_map_t map, vm_object_t object, vm_ooffset_t offset,
|
2014-06-16 16:37:41 +00:00
|
|
|
vm_offset_t start, vm_offset_t end, vm_prot_t prot, vm_prot_t max, int cow)
|
1994-05-24 10:09:53 +00:00
|
|
|
{
|
2019-11-20 16:06:48 +00:00
|
|
|
vm_map_entry_t new_entry, next_entry, prev_entry;
|
2010-12-02 17:37:16 +00:00
|
|
|
struct ucred *cred;
|
2017-01-01 18:49:46 +00:00
|
|
|
vm_eflags_t protoeflags;
|
2012-02-11 17:29:07 +00:00
|
|
|
vm_inherit_t inheritance;
|
2020-09-09 22:02:30 +00:00
|
|
|
u_long bdry;
|
|
|
|
u_int bidx;
|
1994-05-24 10:09:53 +00:00
|
|
|
|
2009-02-24 20:43:29 +00:00
|
|
|
VM_MAP_ASSERT_LOCKED(map);
|
2017-11-28 23:40:54 +00:00
|
|
|
KASSERT(object != kernel_object ||
|
2014-06-16 16:37:41 +00:00
|
|
|
(cow & MAP_COPY_ON_WRITE) == 0,
|
2017-11-28 23:40:54 +00:00
|
|
|
("vm_map_insert: kernel object and COW"));
|
2020-09-09 22:02:30 +00:00
|
|
|
KASSERT(object == NULL || (cow & MAP_NOFAULT) == 0 ||
|
|
|
|
(cow & MAP_SPLIT_BOUNDARY_MASK) != 0,
|
|
|
|
("vm_map_insert: paradoxical MAP_NOFAULT request, obj %p cow %#x",
|
|
|
|
object, cow));
|
2017-06-21 18:51:30 +00:00
|
|
|
KASSERT((prot & ~max) == 0,
|
|
|
|
("prot %#x is not subset of max_prot %#x", prot, max));
|
2009-02-24 20:43:29 +00:00
|
|
|
|
1994-05-24 10:09:53 +00:00
|
|
|
/*
|
These changes embody the support of the fully coherent merged VM buffer cache,
much higher filesystem I/O performance, and much better paging performance. It
represents the culmination of over 6 months of R&D.
The majority of the merged VM/cache work is by John Dyson.
The following highlights the most significant changes. Additionally, there are
(mostly minor) changes to the various filesystem modules (nfs, msdosfs, etc) to
support the new VM/buffer scheme.
vfs_bio.c:
Significant rewrite of most of vfs_bio to support the merged VM buffer cache
scheme. The scheme is almost fully compatible with the old filesystem
interface. Significant improvement in the number of opportunities for write
clustering.
vfs_cluster.c, vfs_subr.c
Upgrade and performance enhancements in vfs layer code to support merged
VM/buffer cache. Fixup of vfs_cluster to eliminate the bogus pagemove stuff.
vm_object.c:
Yet more improvements in the collapse code. Elimination of some windows that
can cause list corruption.
vm_pageout.c:
Fixed it, it really works better now. Somehow in 2.0, some "enhancements"
broke the code. This code has been reworked from the ground-up.
vm_fault.c, vm_page.c, pmap.c, vm_object.c
Support for small-block filesystems with merged VM/buffer cache scheme.
pmap.c vm_map.c
Dynamic kernel VM size, now we dont have to pre-allocate excessive numbers of
kernel PTs.
vm_glue.c
Much simpler and more effective swapping code. No more gratuitous swapping.
proc.h
Fixed the problem that the p_lock flag was not being cleared on a fork.
swap_pager.c, vnode_pager.c
Removal of old vfs_bio cruft to support the past pseudo-coherency. Now the
code doesn't need it anymore.
machdep.c
Changes to better support the parameter values for the merged VM/buffer cache
scheme.
machdep.c, kern_exec.c, vm_glue.c
Implemented a seperate submap for temporary exec string space and another one
to contain process upages. This eliminates all map fragmentation problems
that previously existed.
ffs_inode.c, ufs_inode.c, ufs_readwrite.c
Changes for merged VM/buffer cache. Add "bypass" support for sneaking in on
busy buffers.
Submitted by: John Dyson and David Greenman
1995-01-09 16:06:02 +00:00
|
|
|
* Check that the start and end points are not bogus.
|
1994-05-24 10:09:53 +00:00
|
|
|
*/
|
2020-06-19 04:18:20 +00:00
|
|
|
if (start == end || !vm_map_range_valid(map, start, end))
|
These changes embody the support of the fully coherent merged VM buffer cache,
much higher filesystem I/O performance, and much better paging performance. It
represents the culmination of over 6 months of R&D.
The majority of the merged VM/cache work is by John Dyson.
The following highlights the most significant changes. Additionally, there are
(mostly minor) changes to the various filesystem modules (nfs, msdosfs, etc) to
support the new VM/buffer scheme.
vfs_bio.c:
Significant rewrite of most of vfs_bio to support the merged VM buffer cache
scheme. The scheme is almost fully compatible with the old filesystem
interface. Significant improvement in the number of opportunities for write
clustering.
vfs_cluster.c, vfs_subr.c
Upgrade and performance enhancements in vfs layer code to support merged
VM/buffer cache. Fixup of vfs_cluster to eliminate the bogus pagemove stuff.
vm_object.c:
Yet more improvements in the collapse code. Elimination of some windows that
can cause list corruption.
vm_pageout.c:
Fixed it, it really works better now. Somehow in 2.0, some "enhancements"
broke the code. This code has been reworked from the ground-up.
vm_fault.c, vm_page.c, pmap.c, vm_object.c
Support for small-block filesystems with merged VM/buffer cache scheme.
pmap.c vm_map.c
Dynamic kernel VM size, now we dont have to pre-allocate excessive numbers of
kernel PTs.
vm_glue.c
Much simpler and more effective swapping code. No more gratuitous swapping.
proc.h
Fixed the problem that the p_lock flag was not being cleared on a fork.
swap_pager.c, vnode_pager.c
Removal of old vfs_bio cruft to support the past pseudo-coherency. Now the
code doesn't need it anymore.
machdep.c
Changes to better support the parameter values for the merged VM/buffer cache
scheme.
machdep.c, kern_exec.c, vm_glue.c
Implemented a seperate submap for temporary exec string space and another one
to contain process upages. This eliminates all map fragmentation problems
that previously existed.
ffs_inode.c, ufs_inode.c, ufs_readwrite.c
Changes for merged VM/buffer cache. Add "bypass" support for sneaking in on
busy buffers.
Submitted by: John Dyson and David Greenman
1995-01-09 16:06:02 +00:00
|
|
|
return (KERN_INVALID_ADDRESS);
|
1994-05-24 10:09:53 +00:00
|
|
|
|
2021-01-08 22:40:04 +00:00
|
|
|
if ((map->flags & MAP_WXORX) != 0 && (prot & (VM_PROT_WRITE |
|
|
|
|
VM_PROT_EXECUTE)) == (VM_PROT_WRITE | VM_PROT_EXECUTE))
|
|
|
|
return (KERN_PROTECTION_FAILURE);
|
|
|
|
|
1994-05-24 10:09:53 +00:00
|
|
|
/*
|
These changes embody the support of the fully coherent merged VM buffer cache,
much higher filesystem I/O performance, and much better paging performance. It
represents the culmination of over 6 months of R&D.
The majority of the merged VM/cache work is by John Dyson.
The following highlights the most significant changes. Additionally, there are
(mostly minor) changes to the various filesystem modules (nfs, msdosfs, etc) to
support the new VM/buffer scheme.
vfs_bio.c:
Significant rewrite of most of vfs_bio to support the merged VM buffer cache
scheme. The scheme is almost fully compatible with the old filesystem
interface. Significant improvement in the number of opportunities for write
clustering.
vfs_cluster.c, vfs_subr.c
Upgrade and performance enhancements in vfs layer code to support merged
VM/buffer cache. Fixup of vfs_cluster to eliminate the bogus pagemove stuff.
vm_object.c:
Yet more improvements in the collapse code. Elimination of some windows that
can cause list corruption.
vm_pageout.c:
Fixed it, it really works better now. Somehow in 2.0, some "enhancements"
broke the code. This code has been reworked from the ground-up.
vm_fault.c, vm_page.c, pmap.c, vm_object.c
Support for small-block filesystems with merged VM/buffer cache scheme.
pmap.c vm_map.c
Dynamic kernel VM size, now we dont have to pre-allocate excessive numbers of
kernel PTs.
vm_glue.c
Much simpler and more effective swapping code. No more gratuitous swapping.
proc.h
Fixed the problem that the p_lock flag was not being cleared on a fork.
swap_pager.c, vnode_pager.c
Removal of old vfs_bio cruft to support the past pseudo-coherency. Now the
code doesn't need it anymore.
machdep.c
Changes to better support the parameter values for the merged VM/buffer cache
scheme.
machdep.c, kern_exec.c, vm_glue.c
Implemented a seperate submap for temporary exec string space and another one
to contain process upages. This eliminates all map fragmentation problems
that previously existed.
ffs_inode.c, ufs_inode.c, ufs_readwrite.c
Changes for merged VM/buffer cache. Add "bypass" support for sneaking in on
busy buffers.
Submitted by: John Dyson and David Greenman
1995-01-09 16:06:02 +00:00
|
|
|
* Find the entry prior to the proposed starting address; if it's part
|
|
|
|
* of an existing entry, this range is bogus.
|
1994-05-24 10:09:53 +00:00
|
|
|
*/
|
2019-07-04 18:28:49 +00:00
|
|
|
if (vm_map_lookup_entry(map, start, &prev_entry))
|
These changes embody the support of the fully coherent merged VM buffer cache,
much higher filesystem I/O performance, and much better paging performance. It
represents the culmination of over 6 months of R&D.
The majority of the merged VM/cache work is by John Dyson.
The following highlights the most significant changes. Additionally, there are
(mostly minor) changes to the various filesystem modules (nfs, msdosfs, etc) to
support the new VM/buffer scheme.
vfs_bio.c:
Significant rewrite of most of vfs_bio to support the merged VM buffer cache
scheme. The scheme is almost fully compatible with the old filesystem
interface. Significant improvement in the number of opportunities for write
clustering.
vfs_cluster.c, vfs_subr.c
Upgrade and performance enhancements in vfs layer code to support merged
VM/buffer cache. Fixup of vfs_cluster to eliminate the bogus pagemove stuff.
vm_object.c:
Yet more improvements in the collapse code. Elimination of some windows that
can cause list corruption.
vm_pageout.c:
Fixed it, it really works better now. Somehow in 2.0, some "enhancements"
broke the code. This code has been reworked from the ground-up.
vm_fault.c, vm_page.c, pmap.c, vm_object.c
Support for small-block filesystems with merged VM/buffer cache scheme.
pmap.c vm_map.c
Dynamic kernel VM size, now we dont have to pre-allocate excessive numbers of
kernel PTs.
vm_glue.c
Much simpler and more effective swapping code. No more gratuitous swapping.
proc.h
Fixed the problem that the p_lock flag was not being cleared on a fork.
swap_pager.c, vnode_pager.c
Removal of old vfs_bio cruft to support the past pseudo-coherency. Now the
code doesn't need it anymore.
machdep.c
Changes to better support the parameter values for the merged VM/buffer cache
scheme.
machdep.c, kern_exec.c, vm_glue.c
Implemented a seperate submap for temporary exec string space and another one
to contain process upages. This eliminates all map fragmentation problems
that previously existed.
ffs_inode.c, ufs_inode.c, ufs_readwrite.c
Changes for merged VM/buffer cache. Add "bypass" support for sneaking in on
busy buffers.
Submitted by: John Dyson and David Greenman
1995-01-09 16:06:02 +00:00
|
|
|
return (KERN_NO_SPACE);
|
1994-05-24 10:09:53 +00:00
|
|
|
|
|
|
|
/*
|
These changes embody the support of the fully coherent merged VM buffer cache,
much higher filesystem I/O performance, and much better paging performance. It
represents the culmination of over 6 months of R&D.
The majority of the merged VM/cache work is by John Dyson.
The following highlights the most significant changes. Additionally, there are
(mostly minor) changes to the various filesystem modules (nfs, msdosfs, etc) to
support the new VM/buffer scheme.
vfs_bio.c:
Significant rewrite of most of vfs_bio to support the merged VM buffer cache
scheme. The scheme is almost fully compatible with the old filesystem
interface. Significant improvement in the number of opportunities for write
clustering.
vfs_cluster.c, vfs_subr.c
Upgrade and performance enhancements in vfs layer code to support merged
VM/buffer cache. Fixup of vfs_cluster to eliminate the bogus pagemove stuff.
vm_object.c:
Yet more improvements in the collapse code. Elimination of some windows that
can cause list corruption.
vm_pageout.c:
Fixed it, it really works better now. Somehow in 2.0, some "enhancements"
broke the code. This code has been reworked from the ground-up.
vm_fault.c, vm_page.c, pmap.c, vm_object.c
Support for small-block filesystems with merged VM/buffer cache scheme.
pmap.c vm_map.c
Dynamic kernel VM size, now we dont have to pre-allocate excessive numbers of
kernel PTs.
vm_glue.c
Much simpler and more effective swapping code. No more gratuitous swapping.
proc.h
Fixed the problem that the p_lock flag was not being cleared on a fork.
swap_pager.c, vnode_pager.c
Removal of old vfs_bio cruft to support the past pseudo-coherency. Now the
code doesn't need it anymore.
machdep.c
Changes to better support the parameter values for the merged VM/buffer cache
scheme.
machdep.c, kern_exec.c, vm_glue.c
Implemented a seperate submap for temporary exec string space and another one
to contain process upages. This eliminates all map fragmentation problems
that previously existed.
ffs_inode.c, ufs_inode.c, ufs_readwrite.c
Changes for merged VM/buffer cache. Add "bypass" support for sneaking in on
busy buffers.
Submitted by: John Dyson and David Greenman
1995-01-09 16:06:02 +00:00
|
|
|
* Assert that the next entry doesn't overlap the end point.
|
1994-05-24 10:09:53 +00:00
|
|
|
*/
|
2019-11-20 16:06:48 +00:00
|
|
|
next_entry = vm_map_entry_succ(prev_entry);
|
|
|
|
if (next_entry->start < end)
|
These changes embody the support of the fully coherent merged VM buffer cache,
much higher filesystem I/O performance, and much better paging performance. It
represents the culmination of over 6 months of R&D.
The majority of the merged VM/cache work is by John Dyson.
The following highlights the most significant changes. Additionally, there are
(mostly minor) changes to the various filesystem modules (nfs, msdosfs, etc) to
support the new VM/buffer scheme.
vfs_bio.c:
Significant rewrite of most of vfs_bio to support the merged VM buffer cache
scheme. The scheme is almost fully compatible with the old filesystem
interface. Significant improvement in the number of opportunities for write
clustering.
vfs_cluster.c, vfs_subr.c
Upgrade and performance enhancements in vfs layer code to support merged
VM/buffer cache. Fixup of vfs_cluster to eliminate the bogus pagemove stuff.
vm_object.c:
Yet more improvements in the collapse code. Elimination of some windows that
can cause list corruption.
vm_pageout.c:
Fixed it, it really works better now. Somehow in 2.0, some "enhancements"
broke the code. This code has been reworked from the ground-up.
vm_fault.c, vm_page.c, pmap.c, vm_object.c
Support for small-block filesystems with merged VM/buffer cache scheme.
pmap.c vm_map.c
Dynamic kernel VM size, now we dont have to pre-allocate excessive numbers of
kernel PTs.
vm_glue.c
Much simpler and more effective swapping code. No more gratuitous swapping.
proc.h
Fixed the problem that the p_lock flag was not being cleared on a fork.
swap_pager.c, vnode_pager.c
Removal of old vfs_bio cruft to support the past pseudo-coherency. Now the
code doesn't need it anymore.
machdep.c
Changes to better support the parameter values for the merged VM/buffer cache
scheme.
machdep.c, kern_exec.c, vm_glue.c
Implemented a seperate submap for temporary exec string space and another one
to contain process upages. This eliminates all map fragmentation problems
that previously existed.
ffs_inode.c, ufs_inode.c, ufs_readwrite.c
Changes for merged VM/buffer cache. Add "bypass" support for sneaking in on
busy buffers.
Submitted by: John Dyson and David Greenman
1995-01-09 16:06:02 +00:00
|
|
|
return (KERN_NO_SPACE);
|
1994-05-24 10:09:53 +00:00
|
|
|
|
Implement address space guards.
Guard, requested by the MAP_GUARD mmap(2) flag, prevents the reuse of
the allocated address space, but does not allow instantiation of the
pages in the range. It is useful for more explicit support for usual
two-stage reserve then commit allocators, since it prevents accidental
instantiation of the mapping, e.g. by mprotect(2).
Use guards to reimplement stack grow code. Explicitely track stack
grow area with the guard, including the stack guard page. On stack
grow, trivial shift of the guard map entry and stack map entry limits
makes the stack expansion. Move the code to detect stack grow and
call vm_map_growstack(), from vm_fault() into vm_map_lookup().
As result, it is impossible to get random mapping to occur in the
stack grow area, or to overlap the stack guard page.
Enable stack guard page by default.
Reviewed by: alc, markj
Man page update reviewed by: alc, bjk, emaste, markj, pho
Tested by: pho, Qualys
Sponsored by: The FreeBSD Foundation
MFC after: 1 week
Differential revision: https://reviews.freebsd.org/D11306 (man pages)
2017-06-24 17:01:11 +00:00
|
|
|
if ((cow & MAP_CREATE_GUARD) != 0 && (object != NULL ||
|
|
|
|
max != VM_PROT_NONE))
|
|
|
|
return (KERN_INVALID_ARGUMENT);
|
|
|
|
|
1997-01-16 04:16:22 +00:00
|
|
|
protoeflags = 0;
|
|
|
|
if (cow & MAP_COPY_ON_WRITE)
|
2014-06-16 16:37:41 +00:00
|
|
|
protoeflags |= MAP_ENTRY_COW | MAP_ENTRY_NEEDS_COPY;
|
|
|
|
if (cow & MAP_NOFAULT)
|
1997-01-16 04:16:22 +00:00
|
|
|
protoeflags |= MAP_ENTRY_NOFAULT;
|
1999-12-12 03:19:33 +00:00
|
|
|
if (cow & MAP_DISABLE_SYNCER)
|
|
|
|
protoeflags |= MAP_ENTRY_NOSYNC;
|
2000-02-28 04:10:35 +00:00
|
|
|
if (cow & MAP_DISABLE_COREDUMP)
|
|
|
|
protoeflags |= MAP_ENTRY_NOCOREDUMP;
|
2014-06-19 16:26:16 +00:00
|
|
|
if (cow & MAP_STACK_GROWS_DOWN)
|
|
|
|
protoeflags |= MAP_ENTRY_GROWS_DOWN;
|
|
|
|
if (cow & MAP_STACK_GROWS_UP)
|
|
|
|
protoeflags |= MAP_ENTRY_GROWS_UP;
|
2019-09-03 20:31:48 +00:00
|
|
|
if (cow & MAP_WRITECOUNT)
|
|
|
|
protoeflags |= MAP_ENTRY_WRITECNT;
|
Switch to use shared vnode locks for text files during image activation.
kern_execve() locks text vnode exclusive to be able to set and clear
VV_TEXT flag. VV_TEXT is mutually exclusive with the v_writecount > 0
condition.
The change removes VV_TEXT, replacing it with the condition
v_writecount <= -1, and puts v_writecount under the vnode interlock.
Each text reference decrements v_writecount. To clear the text
reference when the segment is unmapped, it is recorded in the
vm_map_entry backed by the text file as MAP_ENTRY_VN_TEXT flag, and
v_writecount is incremented on the map entry removal
The operations like VOP_ADD_WRITECOUNT() and VOP_SET_TEXT() check that
v_writecount does not contradict the desired change. vn_writecheck()
is now racy and its use was eliminated everywhere except access.
Atomic check for writeability and increment of v_writecount is
performed by the VOP. vn_truncate() now increments v_writecount
around VOP_SETATTR() call, lack of which is arguably a bug on its own.
nullfs bypasses v_writecount to the lower vnode always, so nullfs
vnode has its own v_writecount correct, and lower vnode gets all
references, since object->handle is always lower vnode.
On the text vnode' vm object dealloc, the v_writecount value is reset
to zero, and deadfs vop_unset_text short-circuit the operation.
Reclamation of lowervp always reclaims all nullfs vnodes referencing
lowervp first, so no stray references are left.
Reviewed by: markj, trasz
Tested by: mjg, pho
Sponsored by: The FreeBSD Foundation
MFC after: 1 month
Differential revision: https://reviews.freebsd.org/D19923
2019-05-05 11:20:43 +00:00
|
|
|
if (cow & MAP_VN_EXEC)
|
|
|
|
protoeflags |= MAP_ENTRY_VN_EXEC;
|
Implement address space guards.
Guard, requested by the MAP_GUARD mmap(2) flag, prevents the reuse of
the allocated address space, but does not allow instantiation of the
pages in the range. It is useful for more explicit support for usual
two-stage reserve then commit allocators, since it prevents accidental
instantiation of the mapping, e.g. by mprotect(2).
Use guards to reimplement stack grow code. Explicitely track stack
grow area with the guard, including the stack guard page. On stack
grow, trivial shift of the guard map entry and stack map entry limits
makes the stack expansion. Move the code to detect stack grow and
call vm_map_growstack(), from vm_fault() into vm_map_lookup().
As result, it is impossible to get random mapping to occur in the
stack grow area, or to overlap the stack guard page.
Enable stack guard page by default.
Reviewed by: alc, markj
Man page update reviewed by: alc, bjk, emaste, markj, pho
Tested by: pho, Qualys
Sponsored by: The FreeBSD Foundation
MFC after: 1 week
Differential revision: https://reviews.freebsd.org/D11306 (man pages)
2017-06-24 17:01:11 +00:00
|
|
|
if ((cow & MAP_CREATE_GUARD) != 0)
|
|
|
|
protoeflags |= MAP_ENTRY_GUARD;
|
|
|
|
if ((cow & MAP_CREATE_STACK_GAP_DN) != 0)
|
|
|
|
protoeflags |= MAP_ENTRY_STACK_GAP_DN;
|
|
|
|
if ((cow & MAP_CREATE_STACK_GAP_UP) != 0)
|
|
|
|
protoeflags |= MAP_ENTRY_STACK_GAP_UP;
|
2012-02-11 17:29:07 +00:00
|
|
|
if (cow & MAP_INHERIT_SHARE)
|
|
|
|
inheritance = VM_INHERIT_SHARE;
|
|
|
|
else
|
|
|
|
inheritance = VM_INHERIT_DEFAULT;
|
2020-09-09 22:02:30 +00:00
|
|
|
if ((cow & MAP_SPLIT_BOUNDARY_MASK) != 0) {
|
|
|
|
/* This magically ignores index 0, for usual page size. */
|
|
|
|
bidx = (cow & MAP_SPLIT_BOUNDARY_MASK) >>
|
|
|
|
MAP_SPLIT_BOUNDARY_SHIFT;
|
|
|
|
if (bidx >= MAXPAGESIZES)
|
|
|
|
return (KERN_INVALID_ARGUMENT);
|
|
|
|
bdry = pagesizes[bidx] - 1;
|
|
|
|
if ((start & bdry) != 0 || (end & bdry) != 0)
|
|
|
|
return (KERN_INVALID_ARGUMENT);
|
|
|
|
protoeflags |= bidx << MAP_ENTRY_SPLIT_BOUNDARY_SHIFT;
|
|
|
|
}
|
1999-12-12 03:19:33 +00:00
|
|
|
|
2010-12-02 17:37:16 +00:00
|
|
|
cred = NULL;
|
Implement address space guards.
Guard, requested by the MAP_GUARD mmap(2) flag, prevents the reuse of
the allocated address space, but does not allow instantiation of the
pages in the range. It is useful for more explicit support for usual
two-stage reserve then commit allocators, since it prevents accidental
instantiation of the mapping, e.g. by mprotect(2).
Use guards to reimplement stack grow code. Explicitely track stack
grow area with the guard, including the stack guard page. On stack
grow, trivial shift of the guard map entry and stack map entry limits
makes the stack expansion. Move the code to detect stack grow and
call vm_map_growstack(), from vm_fault() into vm_map_lookup().
As result, it is impossible to get random mapping to occur in the
stack grow area, or to overlap the stack guard page.
Enable stack guard page by default.
Reviewed by: alc, markj
Man page update reviewed by: alc, bjk, emaste, markj, pho
Tested by: pho, Qualys
Sponsored by: The FreeBSD Foundation
MFC after: 1 week
Differential revision: https://reviews.freebsd.org/D11306 (man pages)
2017-06-24 17:01:11 +00:00
|
|
|
if ((cow & (MAP_ACC_NO_CHARGE | MAP_NOFAULT | MAP_CREATE_GUARD)) != 0)
|
Implement global and per-uid accounting of the anonymous memory. Add
rlimit RLIMIT_SWAP that limits the amount of swap that may be reserved
for the uid.
The accounting information (charge) is associated with either map entry,
or vm object backing the entry, assuming the object is the first one
in the shadow chain and entry does not require COW. Charge is moved
from entry to object on allocation of the object, e.g. during the mmap,
assuming the object is allocated, or on the first page fault on the
entry. It moves back to the entry on forks due to COW setup.
The per-entry granularity of accounting makes the charge process fair
for processes that change uid during lifetime, and decrements charge
for proper uid when region is unmapped.
The interface of vm_pager_allocate(9) is extended by adding struct ucred *,
that is used to charge appropriate uid when allocation if performed by
kernel, e.g. md(4).
Several syscalls, among them is fork(2), may now return ENOMEM when
global or per-uid limits are enforced.
In collaboration with: pho
Reviewed by: alc
Approved by: re (kensmith)
2009-06-23 20:45:22 +00:00
|
|
|
goto charged;
|
|
|
|
if ((cow & MAP_ACC_CHARGED) || ((prot & VM_PROT_WRITE) &&
|
|
|
|
((protoeflags & MAP_ENTRY_NEEDS_COPY) || object == NULL))) {
|
|
|
|
if (!(cow & MAP_ACC_CHARGED) && !swap_reserve(end - start))
|
|
|
|
return (KERN_RESOURCE_SHORTAGE);
|
2017-01-01 18:49:46 +00:00
|
|
|
KASSERT(object == NULL ||
|
|
|
|
(protoeflags & MAP_ENTRY_NEEDS_COPY) != 0 ||
|
2010-12-02 17:37:16 +00:00
|
|
|
object->cred == NULL,
|
2017-01-01 18:49:46 +00:00
|
|
|
("overcommit: vm_map_insert o %p", object));
|
2010-12-02 17:37:16 +00:00
|
|
|
cred = curthread->td_ucred;
|
Implement global and per-uid accounting of the anonymous memory. Add
rlimit RLIMIT_SWAP that limits the amount of swap that may be reserved
for the uid.
The accounting information (charge) is associated with either map entry,
or vm object backing the entry, assuming the object is the first one
in the shadow chain and entry does not require COW. Charge is moved
from entry to object on allocation of the object, e.g. during the mmap,
assuming the object is allocated, or on the first page fault on the
entry. It moves back to the entry on forks due to COW setup.
The per-entry granularity of accounting makes the charge process fair
for processes that change uid during lifetime, and decrements charge
for proper uid when region is unmapped.
The interface of vm_pager_allocate(9) is extended by adding struct ucred *,
that is used to charge appropriate uid when allocation if performed by
kernel, e.g. md(4).
Several syscalls, among them is fork(2), may now return ENOMEM when
global or per-uid limits are enforced.
In collaboration with: pho
Reviewed by: alc
Approved by: re (kensmith)
2009-06-23 20:45:22 +00:00
|
|
|
}
|
|
|
|
|
|
|
|
charged:
|
2010-10-04 16:49:40 +00:00
|
|
|
/* Expand the kernel pmap, if necessary. */
|
|
|
|
if (map == kernel_map && end > kernel_vm_end)
|
|
|
|
pmap_growkernel(end);
|
2003-04-20 21:56:40 +00:00
|
|
|
if (object != NULL) {
|
1999-02-12 09:51:43 +00:00
|
|
|
/*
|
2003-04-20 21:56:40 +00:00
|
|
|
* OBJ_ONEMAPPING must be cleared unless this mapping
|
|
|
|
* is trivially proven to be the only mapping for any
|
|
|
|
* of the object's pages. (Object granularity
|
|
|
|
* reference counting is insufficient to recognize
|
2003-11-03 16:14:45 +00:00
|
|
|
* aliases with precision.)
|
1999-02-12 09:51:43 +00:00
|
|
|
*/
|
2019-11-19 23:19:43 +00:00
|
|
|
if ((object->flags & OBJ_ANON) != 0) {
|
|
|
|
VM_OBJECT_WLOCK(object);
|
|
|
|
if (object->ref_count > 1 || object->shadow_count != 0)
|
|
|
|
vm_object_clear_flag(object, OBJ_ONEMAPPING);
|
|
|
|
VM_OBJECT_WUNLOCK(object);
|
|
|
|
}
|
2018-11-02 16:26:44 +00:00
|
|
|
} else if ((prev_entry->eflags & ~MAP_ENTRY_USER_WIRED) ==
|
|
|
|
protoeflags &&
|
Switch to use shared vnode locks for text files during image activation.
kern_execve() locks text vnode exclusive to be able to set and clear
VV_TEXT flag. VV_TEXT is mutually exclusive with the v_writecount > 0
condition.
The change removes VV_TEXT, replacing it with the condition
v_writecount <= -1, and puts v_writecount under the vnode interlock.
Each text reference decrements v_writecount. To clear the text
reference when the segment is unmapped, it is recorded in the
vm_map_entry backed by the text file as MAP_ENTRY_VN_TEXT flag, and
v_writecount is incremented on the map entry removal
The operations like VOP_ADD_WRITECOUNT() and VOP_SET_TEXT() check that
v_writecount does not contradict the desired change. vn_writecheck()
is now racy and its use was eliminated everywhere except access.
Atomic check for writeability and increment of v_writecount is
performed by the VOP. vn_truncate() now increments v_writecount
around VOP_SETATTR() call, lack of which is arguably a bug on its own.
nullfs bypasses v_writecount to the lower vnode always, so nullfs
vnode has its own v_writecount correct, and lower vnode gets all
references, since object->handle is always lower vnode.
On the text vnode' vm object dealloc, the v_writecount value is reset
to zero, and deadfs vop_unset_text short-circuit the operation.
Reclamation of lowervp always reclaims all nullfs vnodes referencing
lowervp first, so no stray references are left.
Reviewed by: markj, trasz
Tested by: mjg, pho
Sponsored by: The FreeBSD Foundation
MFC after: 1 month
Differential revision: https://reviews.freebsd.org/D19923
2019-05-05 11:20:43 +00:00
|
|
|
(cow & (MAP_STACK_GROWS_DOWN | MAP_STACK_GROWS_UP |
|
|
|
|
MAP_VN_EXEC)) == 0 &&
|
2018-07-28 04:06:33 +00:00
|
|
|
prev_entry->end == start && (prev_entry->cred == cred ||
|
2017-01-01 18:49:46 +00:00
|
|
|
(prev_entry->object.vm_object != NULL &&
|
|
|
|
prev_entry->object.vm_object->cred == cred)) &&
|
|
|
|
vm_object_coalesce(prev_entry->object.vm_object,
|
|
|
|
prev_entry->offset,
|
|
|
|
(vm_size_t)(prev_entry->end - prev_entry->start),
|
|
|
|
(vm_size_t)(end - prev_entry->end), cred != NULL &&
|
|
|
|
(protoeflags & MAP_ENTRY_NEEDS_COPY) == 0)) {
|
1999-05-18 05:38:48 +00:00
|
|
|
/*
|
|
|
|
* We were able to extend the object. Determine if we
|
2003-11-03 16:14:45 +00:00
|
|
|
* can extend the previous map entry to include the
|
1999-05-18 05:38:48 +00:00
|
|
|
* new range as well.
|
|
|
|
*/
|
2017-01-01 18:49:46 +00:00
|
|
|
if (prev_entry->inheritance == inheritance &&
|
|
|
|
prev_entry->protection == prot &&
|
2018-07-28 04:06:33 +00:00
|
|
|
prev_entry->max_protection == max &&
|
|
|
|
prev_entry->wired_count == 0) {
|
|
|
|
KASSERT((prev_entry->eflags & MAP_ENTRY_USER_WIRED) ==
|
|
|
|
0, ("prev_entry %p has incoherent wiring",
|
|
|
|
prev_entry));
|
Implement address space guards.
Guard, requested by the MAP_GUARD mmap(2) flag, prevents the reuse of
the allocated address space, but does not allow instantiation of the
pages in the range. It is useful for more explicit support for usual
two-stage reserve then commit allocators, since it prevents accidental
instantiation of the mapping, e.g. by mprotect(2).
Use guards to reimplement stack grow code. Explicitely track stack
grow area with the guard, including the stack guard page. On stack
grow, trivial shift of the guard map entry and stack map entry limits
makes the stack expansion. Move the code to detect stack grow and
call vm_map_growstack(), from vm_fault() into vm_map_lookup().
As result, it is impossible to get random mapping to occur in the
stack grow area, or to overlap the stack guard page.
Enable stack guard page by default.
Reviewed by: alc, markj
Man page update reviewed by: alc, bjk, emaste, markj, pho
Tested by: pho, Qualys
Sponsored by: The FreeBSD Foundation
MFC after: 1 week
Differential revision: https://reviews.freebsd.org/D11306 (man pages)
2017-06-24 17:01:11 +00:00
|
|
|
if ((prev_entry->eflags & MAP_ENTRY_GUARD) == 0)
|
|
|
|
map->size += end - prev_entry->end;
|
2019-05-22 23:11:16 +00:00
|
|
|
vm_map_entry_resize(map, prev_entry,
|
2019-05-22 17:40:54 +00:00
|
|
|
end - prev_entry->end);
|
2019-11-20 16:06:48 +00:00
|
|
|
vm_map_try_merge_entries(map, prev_entry, next_entry);
|
1999-05-18 05:38:48 +00:00
|
|
|
return (KERN_SUCCESS);
|
1996-05-18 03:38:05 +00:00
|
|
|
}
|
1999-05-18 05:38:48 +00:00
|
|
|
|
|
|
|
/*
|
|
|
|
* If we can extend the object but cannot extend the
|
|
|
|
* map entry, we have to create a new map entry. We
|
|
|
|
* must bump the ref count on the extended object to
|
2001-02-04 06:19:28 +00:00
|
|
|
* account for it. object may be NULL.
|
1999-05-18 05:38:48 +00:00
|
|
|
*/
|
|
|
|
object = prev_entry->object.vm_object;
|
|
|
|
offset = prev_entry->offset +
|
2017-01-01 18:49:46 +00:00
|
|
|
(prev_entry->end - prev_entry->start);
|
1999-05-18 05:38:48 +00:00
|
|
|
vm_object_reference(object);
|
2010-12-02 17:37:16 +00:00
|
|
|
if (cred != NULL && object != NULL && object->cred != NULL &&
|
Implement global and per-uid accounting of the anonymous memory. Add
rlimit RLIMIT_SWAP that limits the amount of swap that may be reserved
for the uid.
The accounting information (charge) is associated with either map entry,
or vm object backing the entry, assuming the object is the first one
in the shadow chain and entry does not require COW. Charge is moved
from entry to object on allocation of the object, e.g. during the mmap,
assuming the object is allocated, or on the first page fault on the
entry. It moves back to the entry on forks due to COW setup.
The per-entry granularity of accounting makes the charge process fair
for processes that change uid during lifetime, and decrements charge
for proper uid when region is unmapped.
The interface of vm_pager_allocate(9) is extended by adding struct ucred *,
that is used to charge appropriate uid when allocation if performed by
kernel, e.g. md(4).
Several syscalls, among them is fork(2), may now return ENOMEM when
global or per-uid limits are enforced.
In collaboration with: pho
Reviewed by: alc
Approved by: re (kensmith)
2009-06-23 20:45:22 +00:00
|
|
|
!(prev_entry->eflags & MAP_ENTRY_NEEDS_COPY)) {
|
|
|
|
/* Object already accounts for this uid. */
|
2010-12-02 17:37:16 +00:00
|
|
|
cred = NULL;
|
Implement global and per-uid accounting of the anonymous memory. Add
rlimit RLIMIT_SWAP that limits the amount of swap that may be reserved
for the uid.
The accounting information (charge) is associated with either map entry,
or vm object backing the entry, assuming the object is the first one
in the shadow chain and entry does not require COW. Charge is moved
from entry to object on allocation of the object, e.g. during the mmap,
assuming the object is allocated, or on the first page fault on the
entry. It moves back to the entry on forks due to COW setup.
The per-entry granularity of accounting makes the charge process fair
for processes that change uid during lifetime, and decrements charge
for proper uid when region is unmapped.
The interface of vm_pager_allocate(9) is extended by adding struct ucred *,
that is used to charge appropriate uid when allocation if performed by
kernel, e.g. md(4).
Several syscalls, among them is fork(2), may now return ENOMEM when
global or per-uid limits are enforced.
In collaboration with: pho
Reviewed by: alc
Approved by: re (kensmith)
2009-06-23 20:45:22 +00:00
|
|
|
}
|
1994-05-24 10:09:53 +00:00
|
|
|
}
|
2014-06-26 16:04:03 +00:00
|
|
|
if (cred != NULL)
|
|
|
|
crhold(cred);
|
1996-12-31 16:23:38 +00:00
|
|
|
|
1994-05-24 10:09:53 +00:00
|
|
|
/*
|
These changes embody the support of the fully coherent merged VM buffer cache,
much higher filesystem I/O performance, and much better paging performance. It
represents the culmination of over 6 months of R&D.
The majority of the merged VM/cache work is by John Dyson.
The following highlights the most significant changes. Additionally, there are
(mostly minor) changes to the various filesystem modules (nfs, msdosfs, etc) to
support the new VM/buffer scheme.
vfs_bio.c:
Significant rewrite of most of vfs_bio to support the merged VM buffer cache
scheme. The scheme is almost fully compatible with the old filesystem
interface. Significant improvement in the number of opportunities for write
clustering.
vfs_cluster.c, vfs_subr.c
Upgrade and performance enhancements in vfs layer code to support merged
VM/buffer cache. Fixup of vfs_cluster to eliminate the bogus pagemove stuff.
vm_object.c:
Yet more improvements in the collapse code. Elimination of some windows that
can cause list corruption.
vm_pageout.c:
Fixed it, it really works better now. Somehow in 2.0, some "enhancements"
broke the code. This code has been reworked from the ground-up.
vm_fault.c, vm_page.c, pmap.c, vm_object.c
Support for small-block filesystems with merged VM/buffer cache scheme.
pmap.c vm_map.c
Dynamic kernel VM size, now we dont have to pre-allocate excessive numbers of
kernel PTs.
vm_glue.c
Much simpler and more effective swapping code. No more gratuitous swapping.
proc.h
Fixed the problem that the p_lock flag was not being cleared on a fork.
swap_pager.c, vnode_pager.c
Removal of old vfs_bio cruft to support the past pseudo-coherency. Now the
code doesn't need it anymore.
machdep.c
Changes to better support the parameter values for the merged VM/buffer cache
scheme.
machdep.c, kern_exec.c, vm_glue.c
Implemented a seperate submap for temporary exec string space and another one
to contain process upages. This eliminates all map fragmentation problems
that previously existed.
ffs_inode.c, ufs_inode.c, ufs_readwrite.c
Changes for merged VM/buffer cache. Add "bypass" support for sneaking in on
busy buffers.
Submitted by: John Dyson and David Greenman
1995-01-09 16:06:02 +00:00
|
|
|
* Create a new entry
|
1994-05-24 10:09:53 +00:00
|
|
|
*/
|
|
|
|
new_entry = vm_map_entry_create(map);
|
|
|
|
new_entry->start = start;
|
|
|
|
new_entry->end = end;
|
2010-12-02 17:37:16 +00:00
|
|
|
new_entry->cred = NULL;
|
1994-05-24 10:09:53 +00:00
|
|
|
|
1997-01-16 04:16:22 +00:00
|
|
|
new_entry->eflags = protoeflags;
|
1994-05-24 10:09:53 +00:00
|
|
|
new_entry->object.vm_object = object;
|
|
|
|
new_entry->offset = offset;
|
1999-01-06 23:05:42 +00:00
|
|
|
|
2012-02-11 17:29:07 +00:00
|
|
|
new_entry->inheritance = inheritance;
|
1999-03-02 05:43:18 +00:00
|
|
|
new_entry->protection = prot;
|
|
|
|
new_entry->max_protection = max;
|
|
|
|
new_entry->wired_count = 0;
|
2014-03-21 13:55:57 +00:00
|
|
|
new_entry->wiring_thread = NULL;
|
2012-05-10 15:16:42 +00:00
|
|
|
new_entry->read_ahead = VM_FAULT_READ_AHEAD_INIT;
|
2016-07-07 20:58:16 +00:00
|
|
|
new_entry->next_read = start;
|
1999-03-02 05:43:18 +00:00
|
|
|
|
2010-12-02 17:37:16 +00:00
|
|
|
KASSERT(cred == NULL || !ENTRY_CHARGED(new_entry),
|
2017-01-01 18:49:46 +00:00
|
|
|
("overcommit: vm_map_insert leaks vm_map %p", new_entry));
|
2010-12-02 17:37:16 +00:00
|
|
|
new_entry->cred = cred;
|
Implement global and per-uid accounting of the anonymous memory. Add
rlimit RLIMIT_SWAP that limits the amount of swap that may be reserved
for the uid.
The accounting information (charge) is associated with either map entry,
or vm object backing the entry, assuming the object is the first one
in the shadow chain and entry does not require COW. Charge is moved
from entry to object on allocation of the object, e.g. during the mmap,
assuming the object is allocated, or on the first page fault on the
entry. It moves back to the entry on forks due to COW setup.
The per-entry granularity of accounting makes the charge process fair
for processes that change uid during lifetime, and decrements charge
for proper uid when region is unmapped.
The interface of vm_pager_allocate(9) is extended by adding struct ucred *,
that is used to charge appropriate uid when allocation if performed by
kernel, e.g. md(4).
Several syscalls, among them is fork(2), may now return ENOMEM when
global or per-uid limits are enforced.
In collaboration with: pho
Reviewed by: alc
Approved by: re (kensmith)
2009-06-23 20:45:22 +00:00
|
|
|
|
1994-05-24 10:09:53 +00:00
|
|
|
/*
|
These changes embody the support of the fully coherent merged VM buffer cache,
much higher filesystem I/O performance, and much better paging performance. It
represents the culmination of over 6 months of R&D.
The majority of the merged VM/cache work is by John Dyson.
The following highlights the most significant changes. Additionally, there are
(mostly minor) changes to the various filesystem modules (nfs, msdosfs, etc) to
support the new VM/buffer scheme.
vfs_bio.c:
Significant rewrite of most of vfs_bio to support the merged VM buffer cache
scheme. The scheme is almost fully compatible with the old filesystem
interface. Significant improvement in the number of opportunities for write
clustering.
vfs_cluster.c, vfs_subr.c
Upgrade and performance enhancements in vfs layer code to support merged
VM/buffer cache. Fixup of vfs_cluster to eliminate the bogus pagemove stuff.
vm_object.c:
Yet more improvements in the collapse code. Elimination of some windows that
can cause list corruption.
vm_pageout.c:
Fixed it, it really works better now. Somehow in 2.0, some "enhancements"
broke the code. This code has been reworked from the ground-up.
vm_fault.c, vm_page.c, pmap.c, vm_object.c
Support for small-block filesystems with merged VM/buffer cache scheme.
pmap.c vm_map.c
Dynamic kernel VM size, now we dont have to pre-allocate excessive numbers of
kernel PTs.
vm_glue.c
Much simpler and more effective swapping code. No more gratuitous swapping.
proc.h
Fixed the problem that the p_lock flag was not being cleared on a fork.
swap_pager.c, vnode_pager.c
Removal of old vfs_bio cruft to support the past pseudo-coherency. Now the
code doesn't need it anymore.
machdep.c
Changes to better support the parameter values for the merged VM/buffer cache
scheme.
machdep.c, kern_exec.c, vm_glue.c
Implemented a seperate submap for temporary exec string space and another one
to contain process upages. This eliminates all map fragmentation problems
that previously existed.
ffs_inode.c, ufs_inode.c, ufs_readwrite.c
Changes for merged VM/buffer cache. Add "bypass" support for sneaking in on
busy buffers.
Submitted by: John Dyson and David Greenman
1995-01-09 16:06:02 +00:00
|
|
|
* Insert the new entry into the list
|
1994-05-24 10:09:53 +00:00
|
|
|
*/
|
Eliminate adj_free field from vm_map_entry.
Drop the adj_free field from vm_map_entry_t. Refine the max_free field
so that p->max_free is the size of the largest gap with one endpoint
in the subtree rooted at p. Change vm_map_findspace so that, first,
the address-based splay is restricted to tree nodes with large-enough
max_free value, to avoid searching for the right starting point in a
subtree where all the gaps are too small. Second, when the address
search leads to a tree search for the first large-enough gap, that gap
is the subject of a splay-search that brings the gap to the top of the
tree, so that an immediate insertion will take constant time.
Break up the splay code into separate components, one for searching
and breaking up the tree and another for reassembling it. Use these
components, and not splay itself, for linking and unlinking. Drop the
after-where parameter to link, as it is computed as a side-effect of
the splay search.
Submitted by: Doug Moore <dougm@rice.edu>
Reviewed by: markj
Tested by: pho
MFC after: 2 weeks
Differential revision: https://reviews.freebsd.org/D17794
2019-03-29 16:53:46 +00:00
|
|
|
vm_map_entry_link(map, new_entry);
|
Implement address space guards.
Guard, requested by the MAP_GUARD mmap(2) flag, prevents the reuse of
the allocated address space, but does not allow instantiation of the
pages in the range. It is useful for more explicit support for usual
two-stage reserve then commit allocators, since it prevents accidental
instantiation of the mapping, e.g. by mprotect(2).
Use guards to reimplement stack grow code. Explicitely track stack
grow area with the guard, including the stack guard page. On stack
grow, trivial shift of the guard map entry and stack map entry limits
makes the stack expansion. Move the code to detect stack grow and
call vm_map_growstack(), from vm_fault() into vm_map_lookup().
As result, it is impossible to get random mapping to occur in the
stack grow area, or to overlap the stack guard page.
Enable stack guard page by default.
Reviewed by: alc, markj
Man page update reviewed by: alc, bjk, emaste, markj, pho
Tested by: pho, Qualys
Sponsored by: The FreeBSD Foundation
MFC after: 1 week
Differential revision: https://reviews.freebsd.org/D11306 (man pages)
2017-06-24 17:01:11 +00:00
|
|
|
if ((new_entry->eflags & MAP_ENTRY_GUARD) == 0)
|
|
|
|
map->size += new_entry->end - new_entry->start;
|
1994-05-24 10:09:53 +00:00
|
|
|
|
2001-02-04 06:19:28 +00:00
|
|
|
/*
|
2014-06-25 03:30:03 +00:00
|
|
|
* Try to coalesce the new entry with both the previous and next
|
|
|
|
* entries in the list. Previously, we only attempted to coalesce
|
|
|
|
* with the previous entry when object is NULL. Here, we handle the
|
|
|
|
* other cases, which are less common.
|
2001-02-04 06:19:28 +00:00
|
|
|
*/
|
2019-08-25 07:06:51 +00:00
|
|
|
vm_map_try_merge_entries(map, prev_entry, new_entry);
|
2019-11-20 16:06:48 +00:00
|
|
|
vm_map_try_merge_entries(map, new_entry, next_entry);
|
2001-02-04 06:19:28 +00:00
|
|
|
|
2017-01-01 18:49:46 +00:00
|
|
|
if ((cow & (MAP_PREFAULT | MAP_PREFAULT_PARTIAL)) != 0) {
|
|
|
|
vm_map_pmap_enter(map, start, prot, object, OFF_TO_IDX(offset),
|
|
|
|
end - start, cow & MAP_PREFAULT_PARTIAL);
|
1999-12-12 03:19:33 +00:00
|
|
|
}
|
1999-05-17 00:53:56 +00:00
|
|
|
|
These changes embody the support of the fully coherent merged VM buffer cache,
much higher filesystem I/O performance, and much better paging performance. It
represents the culmination of over 6 months of R&D.
The majority of the merged VM/cache work is by John Dyson.
The following highlights the most significant changes. Additionally, there are
(mostly minor) changes to the various filesystem modules (nfs, msdosfs, etc) to
support the new VM/buffer scheme.
vfs_bio.c:
Significant rewrite of most of vfs_bio to support the merged VM buffer cache
scheme. The scheme is almost fully compatible with the old filesystem
interface. Significant improvement in the number of opportunities for write
clustering.
vfs_cluster.c, vfs_subr.c
Upgrade and performance enhancements in vfs layer code to support merged
VM/buffer cache. Fixup of vfs_cluster to eliminate the bogus pagemove stuff.
vm_object.c:
Yet more improvements in the collapse code. Elimination of some windows that
can cause list corruption.
vm_pageout.c:
Fixed it, it really works better now. Somehow in 2.0, some "enhancements"
broke the code. This code has been reworked from the ground-up.
vm_fault.c, vm_page.c, pmap.c, vm_object.c
Support for small-block filesystems with merged VM/buffer cache scheme.
pmap.c vm_map.c
Dynamic kernel VM size, now we dont have to pre-allocate excessive numbers of
kernel PTs.
vm_glue.c
Much simpler and more effective swapping code. No more gratuitous swapping.
proc.h
Fixed the problem that the p_lock flag was not being cleared on a fork.
swap_pager.c, vnode_pager.c
Removal of old vfs_bio cruft to support the past pseudo-coherency. Now the
code doesn't need it anymore.
machdep.c
Changes to better support the parameter values for the merged VM/buffer cache
scheme.
machdep.c, kern_exec.c, vm_glue.c
Implemented a seperate submap for temporary exec string space and another one
to contain process upages. This eliminates all map fragmentation problems
that previously existed.
ffs_inode.c, ufs_inode.c, ufs_readwrite.c
Changes for merged VM/buffer cache. Add "bypass" support for sneaking in on
busy buffers.
Submitted by: John Dyson and David Greenman
1995-01-09 16:06:02 +00:00
|
|
|
return (KERN_SUCCESS);
|
1994-05-24 10:09:53 +00:00
|
|
|
}
|
|
|
|
|
|
|
|
/*
|
2004-08-13 08:06:34 +00:00
|
|
|
* vm_map_findspace:
|
|
|
|
*
|
|
|
|
* Find the first fit (lowest VM address) for "length" free bytes
|
|
|
|
* beginning at address >= start in the given map.
|
|
|
|
*
|
Eliminate adj_free field from vm_map_entry.
Drop the adj_free field from vm_map_entry_t. Refine the max_free field
so that p->max_free is the size of the largest gap with one endpoint
in the subtree rooted at p. Change vm_map_findspace so that, first,
the address-based splay is restricted to tree nodes with large-enough
max_free value, to avoid searching for the right starting point in a
subtree where all the gaps are too small. Second, when the address
search leads to a tree search for the first large-enough gap, that gap
is the subject of a splay-search that brings the gap to the top of the
tree, so that an immediate insertion will take constant time.
Break up the splay code into separate components, one for searching
and breaking up the tree and another for reassembling it. Use these
components, and not splay itself, for linking and unlinking. Drop the
after-where parameter to link, as it is computed as a side-effect of
the splay search.
Submitted by: Doug Moore <dougm@rice.edu>
Reviewed by: markj
Tested by: pho
MFC after: 2 weeks
Differential revision: https://reviews.freebsd.org/D17794
2019-03-29 16:53:46 +00:00
|
|
|
* In a vm_map_entry, "max_free" is the maximum amount of
|
|
|
|
* contiguous free space between an entry in its subtree and a
|
|
|
|
* neighbor of that entry. This allows finding a free region in
|
|
|
|
* one path down the tree, so O(log n) amortized with splay
|
|
|
|
* trees.
|
2004-08-13 08:06:34 +00:00
|
|
|
*
|
|
|
|
* The map must be locked, and leaves it so.
|
|
|
|
*
|
Eliminate adj_free field from vm_map_entry.
Drop the adj_free field from vm_map_entry_t. Refine the max_free field
so that p->max_free is the size of the largest gap with one endpoint
in the subtree rooted at p. Change vm_map_findspace so that, first,
the address-based splay is restricted to tree nodes with large-enough
max_free value, to avoid searching for the right starting point in a
subtree where all the gaps are too small. Second, when the address
search leads to a tree search for the first large-enough gap, that gap
is the subject of a splay-search that brings the gap to the top of the
tree, so that an immediate insertion will take constant time.
Break up the splay code into separate components, one for searching
and breaking up the tree and another for reassembling it. Use these
components, and not splay itself, for linking and unlinking. Drop the
after-where parameter to link, as it is computed as a side-effect of
the splay search.
Submitted by: Doug Moore <dougm@rice.edu>
Reviewed by: markj
Tested by: pho
MFC after: 2 weeks
Differential revision: https://reviews.freebsd.org/D17794
2019-03-29 16:53:46 +00:00
|
|
|
* Returns: starting address if sufficient space,
|
|
|
|
* vm_map_max(map)-length+1 if insufficient space.
|
1994-05-24 10:09:53 +00:00
|
|
|
*/
|
Eliminate adj_free field from vm_map_entry.
Drop the adj_free field from vm_map_entry_t. Refine the max_free field
so that p->max_free is the size of the largest gap with one endpoint
in the subtree rooted at p. Change vm_map_findspace so that, first,
the address-based splay is restricted to tree nodes with large-enough
max_free value, to avoid searching for the right starting point in a
subtree where all the gaps are too small. Second, when the address
search leads to a tree search for the first large-enough gap, that gap
is the subject of a splay-search that brings the gap to the top of the
tree, so that an immediate insertion will take constant time.
Break up the splay code into separate components, one for searching
and breaking up the tree and another for reassembling it. Use these
components, and not splay itself, for linking and unlinking. Drop the
after-where parameter to link, as it is computed as a side-effect of
the splay search.
Submitted by: Doug Moore <dougm@rice.edu>
Reviewed by: markj
Tested by: pho
MFC after: 2 weeks
Differential revision: https://reviews.freebsd.org/D17794
2019-03-29 16:53:46 +00:00
|
|
|
vm_offset_t
|
|
|
|
vm_map_findspace(vm_map_t map, vm_offset_t start, vm_size_t length)
|
1994-05-24 10:09:53 +00:00
|
|
|
{
|
2019-11-29 02:06:45 +00:00
|
|
|
vm_map_entry_t header, llist, rlist, root, y;
|
|
|
|
vm_size_t left_length, max_free_left, max_free_right;
|
2019-06-11 22:41:39 +00:00
|
|
|
vm_offset_t gap_end;
|
1994-05-24 10:09:53 +00:00
|
|
|
|
2020-11-11 17:16:39 +00:00
|
|
|
VM_MAP_ASSERT_LOCKED(map);
|
|
|
|
|
2005-01-18 19:50:09 +00:00
|
|
|
/*
|
|
|
|
* Request must fit within min/max VM address and must avoid
|
|
|
|
* address wrap.
|
|
|
|
*/
|
2018-08-29 12:24:19 +00:00
|
|
|
start = MAX(start, vm_map_min(map));
|
2019-06-11 22:41:39 +00:00
|
|
|
if (start >= vm_map_max(map) || length > vm_map_max(map) - start)
|
Eliminate adj_free field from vm_map_entry.
Drop the adj_free field from vm_map_entry_t. Refine the max_free field
so that p->max_free is the size of the largest gap with one endpoint
in the subtree rooted at p. Change vm_map_findspace so that, first,
the address-based splay is restricted to tree nodes with large-enough
max_free value, to avoid searching for the right starting point in a
subtree where all the gaps are too small. Second, when the address
search leads to a tree search for the first large-enough gap, that gap
is the subject of a splay-search that brings the gap to the top of the
tree, so that an immediate insertion will take constant time.
Break up the splay code into separate components, one for searching
and breaking up the tree and another for reassembling it. Use these
components, and not splay itself, for linking and unlinking. Drop the
after-where parameter to link, as it is computed as a side-effect of
the splay search.
Submitted by: Doug Moore <dougm@rice.edu>
Reviewed by: markj
Tested by: pho
MFC after: 2 weeks
Differential revision: https://reviews.freebsd.org/D17794
2019-03-29 16:53:46 +00:00
|
|
|
return (vm_map_max(map) - length + 1);
|
1994-05-24 10:09:53 +00:00
|
|
|
|
2004-08-13 08:06:34 +00:00
|
|
|
/* Empty tree means wide open address space. */
|
Eliminate adj_free field from vm_map_entry.
Drop the adj_free field from vm_map_entry_t. Refine the max_free field
so that p->max_free is the size of the largest gap with one endpoint
in the subtree rooted at p. Change vm_map_findspace so that, first,
the address-based splay is restricted to tree nodes with large-enough
max_free value, to avoid searching for the right starting point in a
subtree where all the gaps are too small. Second, when the address
search leads to a tree search for the first large-enough gap, that gap
is the subject of a splay-search that brings the gap to the top of the
tree, so that an immediate insertion will take constant time.
Break up the splay code into separate components, one for searching
and breaking up the tree and another for reassembling it. Use these
components, and not splay itself, for linking and unlinking. Drop the
after-where parameter to link, as it is computed as a side-effect of
the splay search.
Submitted by: Doug Moore <dougm@rice.edu>
Reviewed by: markj
Tested by: pho
MFC after: 2 weeks
Differential revision: https://reviews.freebsd.org/D17794
2019-03-29 16:53:46 +00:00
|
|
|
if (map->root == NULL)
|
|
|
|
return (start);
|
2004-08-13 08:06:34 +00:00
|
|
|
|
1994-05-24 10:09:53 +00:00
|
|
|
/*
|
2019-06-11 22:41:39 +00:00
|
|
|
* After splay_split, if start is within an entry, push it to the start
|
|
|
|
* of the following gap. If rlist is at the end of the gap containing
|
|
|
|
* start, save the end of that gap in gap_end to see if the gap is big
|
|
|
|
* enough; otherwise set gap_end to start skip gap-checking and move
|
|
|
|
* directly to a search of the right subtree.
|
1994-05-24 10:09:53 +00:00
|
|
|
*/
|
2019-11-29 02:06:45 +00:00
|
|
|
header = &map->header;
|
The computations of vm_map_splay_split and vm_map_splay_merge touch both
children of every entry on the search path as part of updating values of
the max_free field. By comparing the max_free values of an entry and its
child on the search path, the code can avoid accessing the child off the
path in cases where the max_free value decreases along the path.
Specifically, this patch changes splay_split so that the max_free field
of every entry on the search path is replaced, temporarily, by the
max_free field from its child not on the search path or, if the child
in that direction is NULL, then a difference between start and end
values of two pointers already available in the split code, without
following any next or prev pointers. However, to find that max_free
value does not require looking toward that other child if either the
child on the search path has a lower max_free value, or the current max_free
value is zero, because in either case we know that the value of max_free for
the other child is the value we already have. So, the changes to
vm_entry_splay_split make sure that we know all the off-search-path entries
we will need to complete the splay, without looking at all of them. There is
an exception at the bottom of the search path where we cannot rely on the
max_free value in the direction of the NULL pointer that ends the search,
because of the behavior of entry-clipping code.
The corresponding change to vm_splay_entry_merge makes it simpler, since it's
just reversing pointers and updating running maxima.
In a test intended to exercise vigorously the vm_map implementation, the
effect of this change was to reduce the data cache miss rate by 10-14% and
the running time by 5-7%.
Tested by: pho
Reviewed by: alc
Approved by: kib (mentor)
MFC after: 1 month
Differential Revision: https://reviews.freebsd.org/D19826
2019-06-10 21:34:07 +00:00
|
|
|
root = vm_map_splay_split(map, start, length, &llist, &rlist);
|
2019-06-11 22:41:39 +00:00
|
|
|
gap_end = rlist->start;
|
|
|
|
if (root != NULL) {
|
Eliminate adj_free field from vm_map_entry.
Drop the adj_free field from vm_map_entry_t. Refine the max_free field
so that p->max_free is the size of the largest gap with one endpoint
in the subtree rooted at p. Change vm_map_findspace so that, first,
the address-based splay is restricted to tree nodes with large-enough
max_free value, to avoid searching for the right starting point in a
subtree where all the gaps are too small. Second, when the address
search leads to a tree search for the first large-enough gap, that gap
is the subject of a splay-search that brings the gap to the top of the
tree, so that an immediate insertion will take constant time.
Break up the splay code into separate components, one for searching
and breaking up the tree and another for reassembling it. Use these
components, and not splay itself, for linking and unlinking. Drop the
after-where parameter to link, as it is computed as a side-effect of
the splay search.
Submitted by: Doug Moore <dougm@rice.edu>
Reviewed by: markj
Tested by: pho
MFC after: 2 weeks
Differential revision: https://reviews.freebsd.org/D17794
2019-03-29 16:53:46 +00:00
|
|
|
start = root->end;
|
2019-12-07 17:14:33 +00:00
|
|
|
if (root->right != rlist)
|
2019-06-11 22:41:39 +00:00
|
|
|
gap_end = start;
|
2019-11-29 02:06:45 +00:00
|
|
|
max_free_left = vm_map_splay_merge_left(header, root, llist);
|
|
|
|
max_free_right = vm_map_splay_merge_right(header, root, rlist);
|
|
|
|
} else if (rlist != header) {
|
Eliminate adj_free field from vm_map_entry.
Drop the adj_free field from vm_map_entry_t. Refine the max_free field
so that p->max_free is the size of the largest gap with one endpoint
in the subtree rooted at p. Change vm_map_findspace so that, first,
the address-based splay is restricted to tree nodes with large-enough
max_free value, to avoid searching for the right starting point in a
subtree where all the gaps are too small. Second, when the address
search leads to a tree search for the first large-enough gap, that gap
is the subject of a splay-search that brings the gap to the top of the
tree, so that an immediate insertion will take constant time.
Break up the splay code into separate components, one for searching
and breaking up the tree and another for reassembling it. Use these
components, and not splay itself, for linking and unlinking. Drop the
after-where parameter to link, as it is computed as a side-effect of
the splay search.
Submitted by: Doug Moore <dougm@rice.edu>
Reviewed by: markj
Tested by: pho
MFC after: 2 weeks
Differential revision: https://reviews.freebsd.org/D17794
2019-03-29 16:53:46 +00:00
|
|
|
root = rlist;
|
|
|
|
rlist = root->left;
|
2019-11-29 02:06:45 +00:00
|
|
|
max_free_left = vm_map_splay_merge_pred(header, root, llist);
|
|
|
|
max_free_right = vm_map_splay_merge_right(header, root, rlist);
|
Eliminate adj_free field from vm_map_entry.
Drop the adj_free field from vm_map_entry_t. Refine the max_free field
so that p->max_free is the size of the largest gap with one endpoint
in the subtree rooted at p. Change vm_map_findspace so that, first,
the address-based splay is restricted to tree nodes with large-enough
max_free value, to avoid searching for the right starting point in a
subtree where all the gaps are too small. Second, when the address
search leads to a tree search for the first large-enough gap, that gap
is the subject of a splay-search that brings the gap to the top of the
tree, so that an immediate insertion will take constant time.
Break up the splay code into separate components, one for searching
and breaking up the tree and another for reassembling it. Use these
components, and not splay itself, for linking and unlinking. Drop the
after-where parameter to link, as it is computed as a side-effect of
the splay search.
Submitted by: Doug Moore <dougm@rice.edu>
Reviewed by: markj
Tested by: pho
MFC after: 2 weeks
Differential revision: https://reviews.freebsd.org/D17794
2019-03-29 16:53:46 +00:00
|
|
|
} else {
|
|
|
|
root = llist;
|
|
|
|
llist = root->right;
|
2019-11-29 02:06:45 +00:00
|
|
|
max_free_left = vm_map_splay_merge_left(header, root, llist);
|
|
|
|
max_free_right = vm_map_splay_merge_succ(header, root, rlist);
|
2004-08-13 08:06:34 +00:00
|
|
|
}
|
2019-11-29 02:06:45 +00:00
|
|
|
root->max_free = vm_size_max(max_free_left, max_free_right);
|
|
|
|
map->root = root;
|
Eliminate adj_free field from vm_map_entry.
Drop the adj_free field from vm_map_entry_t. Refine the max_free field
so that p->max_free is the size of the largest gap with one endpoint
in the subtree rooted at p. Change vm_map_findspace so that, first,
the address-based splay is restricted to tree nodes with large-enough
max_free value, to avoid searching for the right starting point in a
subtree where all the gaps are too small. Second, when the address
search leads to a tree search for the first large-enough gap, that gap
is the subject of a splay-search that brings the gap to the top of the
tree, so that an immediate insertion will take constant time.
Break up the splay code into separate components, one for searching
and breaking up the tree and another for reassembling it. Use these
components, and not splay itself, for linking and unlinking. Drop the
after-where parameter to link, as it is computed as a side-effect of
the splay search.
Submitted by: Doug Moore <dougm@rice.edu>
Reviewed by: markj
Tested by: pho
MFC after: 2 weeks
Differential revision: https://reviews.freebsd.org/D17794
2019-03-29 16:53:46 +00:00
|
|
|
VM_MAP_ASSERT_CONSISTENT(map);
|
2019-06-11 22:41:39 +00:00
|
|
|
if (length <= gap_end - start)
|
Eliminate adj_free field from vm_map_entry.
Drop the adj_free field from vm_map_entry_t. Refine the max_free field
so that p->max_free is the size of the largest gap with one endpoint
in the subtree rooted at p. Change vm_map_findspace so that, first,
the address-based splay is restricted to tree nodes with large-enough
max_free value, to avoid searching for the right starting point in a
subtree where all the gaps are too small. Second, when the address
search leads to a tree search for the first large-enough gap, that gap
is the subject of a splay-search that brings the gap to the top of the
tree, so that an immediate insertion will take constant time.
Break up the splay code into separate components, one for searching
and breaking up the tree and another for reassembling it. Use these
components, and not splay itself, for linking and unlinking. Drop the
after-where parameter to link, as it is computed as a side-effect of
the splay search.
Submitted by: Doug Moore <dougm@rice.edu>
Reviewed by: markj
Tested by: pho
MFC after: 2 weeks
Differential revision: https://reviews.freebsd.org/D17794
2019-03-29 16:53:46 +00:00
|
|
|
return (start);
|
1994-05-24 10:09:53 +00:00
|
|
|
|
2004-08-13 08:06:34 +00:00
|
|
|
/* With max_free, can immediately tell if no solution. */
|
2019-12-07 17:14:33 +00:00
|
|
|
if (root->right == header || length > root->right->max_free)
|
Eliminate adj_free field from vm_map_entry.
Drop the adj_free field from vm_map_entry_t. Refine the max_free field
so that p->max_free is the size of the largest gap with one endpoint
in the subtree rooted at p. Change vm_map_findspace so that, first,
the address-based splay is restricted to tree nodes with large-enough
max_free value, to avoid searching for the right starting point in a
subtree where all the gaps are too small. Second, when the address
search leads to a tree search for the first large-enough gap, that gap
is the subject of a splay-search that brings the gap to the top of the
tree, so that an immediate insertion will take constant time.
Break up the splay code into separate components, one for searching
and breaking up the tree and another for reassembling it. Use these
components, and not splay itself, for linking and unlinking. Drop the
after-where parameter to link, as it is computed as a side-effect of
the splay search.
Submitted by: Doug Moore <dougm@rice.edu>
Reviewed by: markj
Tested by: pho
MFC after: 2 weeks
Differential revision: https://reviews.freebsd.org/D17794
2019-03-29 16:53:46 +00:00
|
|
|
return (vm_map_max(map) - length + 1);
|
2004-08-13 08:06:34 +00:00
|
|
|
|
1994-05-24 10:09:53 +00:00
|
|
|
/*
|
Eliminate adj_free field from vm_map_entry.
Drop the adj_free field from vm_map_entry_t. Refine the max_free field
so that p->max_free is the size of the largest gap with one endpoint
in the subtree rooted at p. Change vm_map_findspace so that, first,
the address-based splay is restricted to tree nodes with large-enough
max_free value, to avoid searching for the right starting point in a
subtree where all the gaps are too small. Second, when the address
search leads to a tree search for the first large-enough gap, that gap
is the subject of a splay-search that brings the gap to the top of the
tree, so that an immediate insertion will take constant time.
Break up the splay code into separate components, one for searching
and breaking up the tree and another for reassembling it. Use these
components, and not splay itself, for linking and unlinking. Drop the
after-where parameter to link, as it is computed as a side-effect of
the splay search.
Submitted by: Doug Moore <dougm@rice.edu>
Reviewed by: markj
Tested by: pho
MFC after: 2 weeks
Differential revision: https://reviews.freebsd.org/D17794
2019-03-29 16:53:46 +00:00
|
|
|
* Splay for the least large-enough gap in the right subtree.
|
1994-05-24 10:09:53 +00:00
|
|
|
*/
|
2019-11-29 02:06:45 +00:00
|
|
|
llist = rlist = header;
|
The computations of vm_map_splay_split and vm_map_splay_merge touch both
children of every entry on the search path as part of updating values of
the max_free field. By comparing the max_free values of an entry and its
child on the search path, the code can avoid accessing the child off the
path in cases where the max_free value decreases along the path.
Specifically, this patch changes splay_split so that the max_free field
of every entry on the search path is replaced, temporarily, by the
max_free field from its child not on the search path or, if the child
in that direction is NULL, then a difference between start and end
values of two pointers already available in the split code, without
following any next or prev pointers. However, to find that max_free
value does not require looking toward that other child if either the
child on the search path has a lower max_free value, or the current max_free
value is zero, because in either case we know that the value of max_free for
the other child is the value we already have. So, the changes to
vm_entry_splay_split make sure that we know all the off-search-path entries
we will need to complete the splay, without looking at all of them. There is
an exception at the bottom of the search path where we cannot rely on the
max_free value in the direction of the NULL pointer that ends the search,
because of the behavior of entry-clipping code.
The corresponding change to vm_splay_entry_merge makes it simpler, since it's
just reversing pointers and updating running maxima.
In a test intended to exercise vigorously the vm_map implementation, the
effect of this change was to reduce the data cache miss rate by 10-14% and
the running time by 5-7%.
Tested by: pho
Reviewed by: alc
Approved by: kib (mentor)
MFC after: 1 month
Differential Revision: https://reviews.freebsd.org/D19826
2019-06-10 21:34:07 +00:00
|
|
|
for (left_length = 0;;
|
|
|
|
left_length = vm_map_entry_max_free_left(root, llist)) {
|
Eliminate adj_free field from vm_map_entry.
Drop the adj_free field from vm_map_entry_t. Refine the max_free field
so that p->max_free is the size of the largest gap with one endpoint
in the subtree rooted at p. Change vm_map_findspace so that, first,
the address-based splay is restricted to tree nodes with large-enough
max_free value, to avoid searching for the right starting point in a
subtree where all the gaps are too small. Second, when the address
search leads to a tree search for the first large-enough gap, that gap
is the subject of a splay-search that brings the gap to the top of the
tree, so that an immediate insertion will take constant time.
Break up the splay code into separate components, one for searching
and breaking up the tree and another for reassembling it. Use these
components, and not splay itself, for linking and unlinking. Drop the
after-where parameter to link, as it is computed as a side-effect of
the splay search.
Submitted by: Doug Moore <dougm@rice.edu>
Reviewed by: markj
Tested by: pho
MFC after: 2 weeks
Differential revision: https://reviews.freebsd.org/D17794
2019-03-29 16:53:46 +00:00
|
|
|
if (length <= left_length)
|
2019-12-07 17:14:33 +00:00
|
|
|
SPLAY_LEFT_STEP(root, y, llist, rlist,
|
The computations of vm_map_splay_split and vm_map_splay_merge touch both
children of every entry on the search path as part of updating values of
the max_free field. By comparing the max_free values of an entry and its
child on the search path, the code can avoid accessing the child off the
path in cases where the max_free value decreases along the path.
Specifically, this patch changes splay_split so that the max_free field
of every entry on the search path is replaced, temporarily, by the
max_free field from its child not on the search path or, if the child
in that direction is NULL, then a difference between start and end
values of two pointers already available in the split code, without
following any next or prev pointers. However, to find that max_free
value does not require looking toward that other child if either the
child on the search path has a lower max_free value, or the current max_free
value is zero, because in either case we know that the value of max_free for
the other child is the value we already have. So, the changes to
vm_entry_splay_split make sure that we know all the off-search-path entries
we will need to complete the splay, without looking at all of them. There is
an exception at the bottom of the search path where we cannot rely on the
max_free value in the direction of the NULL pointer that ends the search,
because of the behavior of entry-clipping code.
The corresponding change to vm_splay_entry_merge makes it simpler, since it's
just reversing pointers and updating running maxima.
In a test intended to exercise vigorously the vm_map implementation, the
effect of this change was to reduce the data cache miss rate by 10-14% and
the running time by 5-7%.
Tested by: pho
Reviewed by: alc
Approved by: kib (mentor)
MFC after: 1 month
Differential Revision: https://reviews.freebsd.org/D19826
2019-06-10 21:34:07 +00:00
|
|
|
length <= vm_map_entry_max_free_left(y, llist));
|
Eliminate adj_free field from vm_map_entry.
Drop the adj_free field from vm_map_entry_t. Refine the max_free field
so that p->max_free is the size of the largest gap with one endpoint
in the subtree rooted at p. Change vm_map_findspace so that, first,
the address-based splay is restricted to tree nodes with large-enough
max_free value, to avoid searching for the right starting point in a
subtree where all the gaps are too small. Second, when the address
search leads to a tree search for the first large-enough gap, that gap
is the subject of a splay-search that brings the gap to the top of the
tree, so that an immediate insertion will take constant time.
Break up the splay code into separate components, one for searching
and breaking up the tree and another for reassembling it. Use these
components, and not splay itself, for linking and unlinking. Drop the
after-where parameter to link, as it is computed as a side-effect of
the splay search.
Submitted by: Doug Moore <dougm@rice.edu>
Reviewed by: markj
Tested by: pho
MFC after: 2 weeks
Differential revision: https://reviews.freebsd.org/D17794
2019-03-29 16:53:46 +00:00
|
|
|
else
|
2019-12-07 17:14:33 +00:00
|
|
|
SPLAY_RIGHT_STEP(root, y, llist, rlist,
|
The computations of vm_map_splay_split and vm_map_splay_merge touch both
children of every entry on the search path as part of updating values of
the max_free field. By comparing the max_free values of an entry and its
child on the search path, the code can avoid accessing the child off the
path in cases where the max_free value decreases along the path.
Specifically, this patch changes splay_split so that the max_free field
of every entry on the search path is replaced, temporarily, by the
max_free field from its child not on the search path or, if the child
in that direction is NULL, then a difference between start and end
values of two pointers already available in the split code, without
following any next or prev pointers. However, to find that max_free
value does not require looking toward that other child if either the
child on the search path has a lower max_free value, or the current max_free
value is zero, because in either case we know that the value of max_free for
the other child is the value we already have. So, the changes to
vm_entry_splay_split make sure that we know all the off-search-path entries
we will need to complete the splay, without looking at all of them. There is
an exception at the bottom of the search path where we cannot rely on the
max_free value in the direction of the NULL pointer that ends the search,
because of the behavior of entry-clipping code.
The corresponding change to vm_splay_entry_merge makes it simpler, since it's
just reversing pointers and updating running maxima.
In a test intended to exercise vigorously the vm_map implementation, the
effect of this change was to reduce the data cache miss rate by 10-14% and
the running time by 5-7%.
Tested by: pho
Reviewed by: alc
Approved by: kib (mentor)
MFC after: 1 month
Differential Revision: https://reviews.freebsd.org/D19826
2019-06-10 21:34:07 +00:00
|
|
|
length > vm_map_entry_max_free_left(y, root));
|
Eliminate adj_free field from vm_map_entry.
Drop the adj_free field from vm_map_entry_t. Refine the max_free field
so that p->max_free is the size of the largest gap with one endpoint
in the subtree rooted at p. Change vm_map_findspace so that, first,
the address-based splay is restricted to tree nodes with large-enough
max_free value, to avoid searching for the right starting point in a
subtree where all the gaps are too small. Second, when the address
search leads to a tree search for the first large-enough gap, that gap
is the subject of a splay-search that brings the gap to the top of the
tree, so that an immediate insertion will take constant time.
Break up the splay code into separate components, one for searching
and breaking up the tree and another for reassembling it. Use these
components, and not splay itself, for linking and unlinking. Drop the
after-where parameter to link, as it is computed as a side-effect of
the splay search.
Submitted by: Doug Moore <dougm@rice.edu>
Reviewed by: markj
Tested by: pho
MFC after: 2 weeks
Differential revision: https://reviews.freebsd.org/D17794
2019-03-29 16:53:46 +00:00
|
|
|
if (root == NULL)
|
|
|
|
break;
|
1994-05-24 10:09:53 +00:00
|
|
|
}
|
Eliminate adj_free field from vm_map_entry.
Drop the adj_free field from vm_map_entry_t. Refine the max_free field
so that p->max_free is the size of the largest gap with one endpoint
in the subtree rooted at p. Change vm_map_findspace so that, first,
the address-based splay is restricted to tree nodes with large-enough
max_free value, to avoid searching for the right starting point in a
subtree where all the gaps are too small. Second, when the address
search leads to a tree search for the first large-enough gap, that gap
is the subject of a splay-search that brings the gap to the top of the
tree, so that an immediate insertion will take constant time.
Break up the splay code into separate components, one for searching
and breaking up the tree and another for reassembling it. Use these
components, and not splay itself, for linking and unlinking. Drop the
after-where parameter to link, as it is computed as a side-effect of
the splay search.
Submitted by: Doug Moore <dougm@rice.edu>
Reviewed by: markj
Tested by: pho
MFC after: 2 weeks
Differential revision: https://reviews.freebsd.org/D17794
2019-03-29 16:53:46 +00:00
|
|
|
root = llist;
|
|
|
|
llist = root->right;
|
2019-11-29 02:06:45 +00:00
|
|
|
max_free_left = vm_map_splay_merge_left(header, root, llist);
|
|
|
|
if (rlist == header) {
|
|
|
|
root->max_free = vm_size_max(max_free_left,
|
|
|
|
vm_map_splay_merge_succ(header, root, rlist));
|
|
|
|
} else {
|
The computations of vm_map_splay_split and vm_map_splay_merge touch both
children of every entry on the search path as part of updating values of
the max_free field. By comparing the max_free values of an entry and its
child on the search path, the code can avoid accessing the child off the
path in cases where the max_free value decreases along the path.
Specifically, this patch changes splay_split so that the max_free field
of every entry on the search path is replaced, temporarily, by the
max_free field from its child not on the search path or, if the child
in that direction is NULL, then a difference between start and end
values of two pointers already available in the split code, without
following any next or prev pointers. However, to find that max_free
value does not require looking toward that other child if either the
child on the search path has a lower max_free value, or the current max_free
value is zero, because in either case we know that the value of max_free for
the other child is the value we already have. So, the changes to
vm_entry_splay_split make sure that we know all the off-search-path entries
we will need to complete the splay, without looking at all of them. There is
an exception at the bottom of the search path where we cannot rely on the
max_free value in the direction of the NULL pointer that ends the search,
because of the behavior of entry-clipping code.
The corresponding change to vm_splay_entry_merge makes it simpler, since it's
just reversing pointers and updating running maxima.
In a test intended to exercise vigorously the vm_map implementation, the
effect of this change was to reduce the data cache miss rate by 10-14% and
the running time by 5-7%.
Tested by: pho
Reviewed by: alc
Approved by: kib (mentor)
MFC after: 1 month
Differential Revision: https://reviews.freebsd.org/D19826
2019-06-10 21:34:07 +00:00
|
|
|
y = rlist;
|
Eliminate adj_free field from vm_map_entry.
Drop the adj_free field from vm_map_entry_t. Refine the max_free field
so that p->max_free is the size of the largest gap with one endpoint
in the subtree rooted at p. Change vm_map_findspace so that, first,
the address-based splay is restricted to tree nodes with large-enough
max_free value, to avoid searching for the right starting point in a
subtree where all the gaps are too small. Second, when the address
search leads to a tree search for the first large-enough gap, that gap
is the subject of a splay-search that brings the gap to the top of the
tree, so that an immediate insertion will take constant time.
Break up the splay code into separate components, one for searching
and breaking up the tree and another for reassembling it. Use these
components, and not splay itself, for linking and unlinking. Drop the
after-where parameter to link, as it is computed as a side-effect of
the splay search.
Submitted by: Doug Moore <dougm@rice.edu>
Reviewed by: markj
Tested by: pho
MFC after: 2 weeks
Differential revision: https://reviews.freebsd.org/D17794
2019-03-29 16:53:46 +00:00
|
|
|
rlist = y->left;
|
2019-11-29 02:06:45 +00:00
|
|
|
y->max_free = vm_size_max(
|
|
|
|
vm_map_splay_merge_pred(root, y, root),
|
|
|
|
vm_map_splay_merge_right(header, y, rlist));
|
|
|
|
root->max_free = vm_size_max(max_free_left, y->max_free);
|
Eliminate adj_free field from vm_map_entry.
Drop the adj_free field from vm_map_entry_t. Refine the max_free field
so that p->max_free is the size of the largest gap with one endpoint
in the subtree rooted at p. Change vm_map_findspace so that, first,
the address-based splay is restricted to tree nodes with large-enough
max_free value, to avoid searching for the right starting point in a
subtree where all the gaps are too small. Second, when the address
search leads to a tree search for the first large-enough gap, that gap
is the subject of a splay-search that brings the gap to the top of the
tree, so that an immediate insertion will take constant time.
Break up the splay code into separate components, one for searching
and breaking up the tree and another for reassembling it. Use these
components, and not splay itself, for linking and unlinking. Drop the
after-where parameter to link, as it is computed as a side-effect of
the splay search.
Submitted by: Doug Moore <dougm@rice.edu>
Reviewed by: markj
Tested by: pho
MFC after: 2 weeks
Differential revision: https://reviews.freebsd.org/D17794
2019-03-29 16:53:46 +00:00
|
|
|
}
|
2019-11-29 02:06:45 +00:00
|
|
|
map->root = root;
|
Eliminate adj_free field from vm_map_entry.
Drop the adj_free field from vm_map_entry_t. Refine the max_free field
so that p->max_free is the size of the largest gap with one endpoint
in the subtree rooted at p. Change vm_map_findspace so that, first,
the address-based splay is restricted to tree nodes with large-enough
max_free value, to avoid searching for the right starting point in a
subtree where all the gaps are too small. Second, when the address
search leads to a tree search for the first large-enough gap, that gap
is the subject of a splay-search that brings the gap to the top of the
tree, so that an immediate insertion will take constant time.
Break up the splay code into separate components, one for searching
and breaking up the tree and another for reassembling it. Use these
components, and not splay itself, for linking and unlinking. Drop the
after-where parameter to link, as it is computed as a side-effect of
the splay search.
Submitted by: Doug Moore <dougm@rice.edu>
Reviewed by: markj
Tested by: pho
MFC after: 2 weeks
Differential revision: https://reviews.freebsd.org/D17794
2019-03-29 16:53:46 +00:00
|
|
|
VM_MAP_ASSERT_CONSISTENT(map);
|
|
|
|
return (root->end);
|
1994-05-24 10:09:53 +00:00
|
|
|
}
|
|
|
|
|
2007-08-20 12:05:45 +00:00
|
|
|
int
|
|
|
|
vm_map_fixed(vm_map_t map, vm_object_t object, vm_ooffset_t offset,
|
2008-04-28 05:30:23 +00:00
|
|
|
vm_offset_t start, vm_size_t length, vm_prot_t prot,
|
2007-08-20 12:05:45 +00:00
|
|
|
vm_prot_t max, int cow)
|
|
|
|
{
|
2008-04-28 05:30:23 +00:00
|
|
|
vm_offset_t end;
|
2007-08-20 12:05:45 +00:00
|
|
|
int result;
|
|
|
|
|
|
|
|
end = start + length;
|
2014-06-09 03:37:41 +00:00
|
|
|
KASSERT((cow & (MAP_STACK_GROWS_DOWN | MAP_STACK_GROWS_UP)) == 0 ||
|
|
|
|
object == NULL,
|
|
|
|
("vm_map_fixed: non-NULL backing object for stack"));
|
2009-02-08 20:39:17 +00:00
|
|
|
vm_map_lock(map);
|
2007-08-20 12:05:45 +00:00
|
|
|
VM_MAP_RANGE_CHECK(map, start, end);
|
2020-09-09 21:34:31 +00:00
|
|
|
if ((cow & MAP_CHECK_EXCL) == 0) {
|
|
|
|
result = vm_map_delete(map, start, end);
|
|
|
|
if (result != KERN_SUCCESS)
|
|
|
|
goto out;
|
|
|
|
}
|
2014-06-09 03:37:41 +00:00
|
|
|
if ((cow & (MAP_STACK_GROWS_DOWN | MAP_STACK_GROWS_UP)) != 0) {
|
|
|
|
result = vm_map_stack_locked(map, start, length, sgrowsiz,
|
|
|
|
prot, max, cow);
|
|
|
|
} else {
|
|
|
|
result = vm_map_insert(map, object, offset, start, end,
|
|
|
|
prot, max, cow);
|
|
|
|
}
|
2020-09-09 21:34:31 +00:00
|
|
|
out:
|
2007-08-20 12:05:45 +00:00
|
|
|
vm_map_unlock(map);
|
|
|
|
return (result);
|
|
|
|
}
|
|
|
|
|
Implement Address Space Layout Randomization (ASLR)
With this change, randomization can be enabled for all non-fixed
mappings. It means that the base address for the mapping is selected
with a guaranteed amount of entropy (bits). If the mapping was
requested to be superpage aligned, the randomization honours the
superpage attributes.
Although the value of ASLR is diminshing over time as exploit authors
work out simple ASLR bypass techniques, it elimintates the trivial
exploitation of certain vulnerabilities, at least in theory. This
implementation is relatively small and happens at the correct
architectural level. Also, it is not expected to introduce
regressions in existing cases when turned off (default for now), or
cause any significant maintaince burden.
The randomization is done on a best-effort basis - that is, the
allocator falls back to a first fit strategy if fragmentation prevents
entropy injection. It is trivial to implement a strong mode where
failure to guarantee the requested amount of entropy results in
mapping request failure, but I do not consider that to be usable.
I have not fine-tuned the amount of entropy injected right now. It is
only a quantitive change that will not change the implementation. The
current amount is controlled by aslr_pages_rnd.
To not spoil coalescing optimizations, to reduce the page table
fragmentation inherent to ASLR, and to keep the transient superpage
promotion for the malloced memory, locality clustering is implemented
for anonymous private mappings, which are automatically grouped until
fragmentation kicks in. The initial location for the anon group range
is, of course, randomized. This is controlled by vm.cluster_anon,
enabled by default.
The default mode keeps the sbrk area unpopulated by other mappings,
but this can be turned off, which gives much more breathing bits on
architectures with small address space, such as i386. This is tied
with the question of following an application's hint about the mmap(2)
base address. Testing shows that ignoring the hint does not affect the
function of common applications, but I would expect more demanding
code could break. By default sbrk is preserved and mmap hints are
satisfied, which can be changed by using the
kern.elf{32,64}.aslr.honor_sbrk sysctl.
ASLR is enabled on per-ABI basis, and currently it is only allowed on
FreeBSD native i386 and amd64 (including compat 32bit) ABIs. Support
for additional architectures will be added after further testing.
Both per-process and per-image controls are implemented:
- procctl(2) adds PROC_ASLR_CTL/PROC_ASLR_STATUS;
- NT_FREEBSD_FCTL_ASLR_DISABLE feature control note bit makes it possible
to force ASLR off for the given binary. (A tool to edit the feature
control note is in development.)
Global controls are:
- kern.elf{32,64}.aslr.enable - for non-fixed mappings done by mmap(2);
- kern.elf{32,64}.aslr.pie_enable - for PIE image activation mappings;
- kern.elf{32,64}.aslr.honor_sbrk - allow to use sbrk area for mmap(2);
- vm.cluster_anon - enables anon mapping clustering.
PR: 208580 (exp runs)
Exp-runs done by: antoine
Reviewed by: markj (previous version)
Discussed with: emaste
Tested by: pho
MFC after: 1 month
Sponsored by: The FreeBSD Foundation
Differential revision: https://reviews.freebsd.org/D5603
2019-02-10 17:19:45 +00:00
|
|
|
static const int aslr_pages_rnd_64[2] = {0x1000, 0x10};
|
|
|
|
static const int aslr_pages_rnd_32[2] = {0x100, 0x4};
|
|
|
|
|
|
|
|
static int cluster_anon = 1;
|
|
|
|
SYSCTL_INT(_vm, OID_AUTO, cluster_anon, CTLFLAG_RW,
|
|
|
|
&cluster_anon, 0,
|
2019-02-14 15:45:53 +00:00
|
|
|
"Cluster anonymous mappings: 0 = no, 1 = yes if no hint, 2 = always");
|
|
|
|
|
|
|
|
static bool
|
|
|
|
clustering_anon_allowed(vm_offset_t addr)
|
|
|
|
{
|
|
|
|
|
|
|
|
switch (cluster_anon) {
|
|
|
|
case 0:
|
|
|
|
return (false);
|
|
|
|
case 1:
|
|
|
|
return (addr == 0);
|
|
|
|
case 2:
|
|
|
|
default:
|
|
|
|
return (true);
|
|
|
|
}
|
|
|
|
}
|
Implement Address Space Layout Randomization (ASLR)
With this change, randomization can be enabled for all non-fixed
mappings. It means that the base address for the mapping is selected
with a guaranteed amount of entropy (bits). If the mapping was
requested to be superpage aligned, the randomization honours the
superpage attributes.
Although the value of ASLR is diminshing over time as exploit authors
work out simple ASLR bypass techniques, it elimintates the trivial
exploitation of certain vulnerabilities, at least in theory. This
implementation is relatively small and happens at the correct
architectural level. Also, it is not expected to introduce
regressions in existing cases when turned off (default for now), or
cause any significant maintaince burden.
The randomization is done on a best-effort basis - that is, the
allocator falls back to a first fit strategy if fragmentation prevents
entropy injection. It is trivial to implement a strong mode where
failure to guarantee the requested amount of entropy results in
mapping request failure, but I do not consider that to be usable.
I have not fine-tuned the amount of entropy injected right now. It is
only a quantitive change that will not change the implementation. The
current amount is controlled by aslr_pages_rnd.
To not spoil coalescing optimizations, to reduce the page table
fragmentation inherent to ASLR, and to keep the transient superpage
promotion for the malloced memory, locality clustering is implemented
for anonymous private mappings, which are automatically grouped until
fragmentation kicks in. The initial location for the anon group range
is, of course, randomized. This is controlled by vm.cluster_anon,
enabled by default.
The default mode keeps the sbrk area unpopulated by other mappings,
but this can be turned off, which gives much more breathing bits on
architectures with small address space, such as i386. This is tied
with the question of following an application's hint about the mmap(2)
base address. Testing shows that ignoring the hint does not affect the
function of common applications, but I would expect more demanding
code could break. By default sbrk is preserved and mmap hints are
satisfied, which can be changed by using the
kern.elf{32,64}.aslr.honor_sbrk sysctl.
ASLR is enabled on per-ABI basis, and currently it is only allowed on
FreeBSD native i386 and amd64 (including compat 32bit) ABIs. Support
for additional architectures will be added after further testing.
Both per-process and per-image controls are implemented:
- procctl(2) adds PROC_ASLR_CTL/PROC_ASLR_STATUS;
- NT_FREEBSD_FCTL_ASLR_DISABLE feature control note bit makes it possible
to force ASLR off for the given binary. (A tool to edit the feature
control note is in development.)
Global controls are:
- kern.elf{32,64}.aslr.enable - for non-fixed mappings done by mmap(2);
- kern.elf{32,64}.aslr.pie_enable - for PIE image activation mappings;
- kern.elf{32,64}.aslr.honor_sbrk - allow to use sbrk area for mmap(2);
- vm.cluster_anon - enables anon mapping clustering.
PR: 208580 (exp runs)
Exp-runs done by: antoine
Reviewed by: markj (previous version)
Discussed with: emaste
Tested by: pho
MFC after: 1 month
Sponsored by: The FreeBSD Foundation
Differential revision: https://reviews.freebsd.org/D5603
2019-02-10 17:19:45 +00:00
|
|
|
|
|
|
|
static long aslr_restarts;
|
|
|
|
SYSCTL_LONG(_vm, OID_AUTO, aslr_restarts, CTLFLAG_RD,
|
|
|
|
&aslr_restarts, 0,
|
|
|
|
"Number of aslr failures");
|
|
|
|
|
2017-12-26 17:59:37 +00:00
|
|
|
/*
|
|
|
|
* Searches for the specified amount of free space in the given map with the
|
|
|
|
* specified alignment. Performs an address-ordered, first-fit search from
|
|
|
|
* the given address "*addr", with an optional upper bound "max_addr". If the
|
|
|
|
* parameter "alignment" is zero, then the alignment is computed from the
|
|
|
|
* given (object, offset) pair so as to enable the greatest possible use of
|
|
|
|
* superpage mappings. Returns KERN_SUCCESS and the address of the free space
|
|
|
|
* in "*addr" if successful. Otherwise, returns KERN_NO_SPACE.
|
|
|
|
*
|
|
|
|
* The map must be locked. Initially, there must be at least "length" bytes
|
|
|
|
* of free space at the given address.
|
|
|
|
*/
|
|
|
|
static int
|
|
|
|
vm_map_alignspace(vm_map_t map, vm_object_t object, vm_ooffset_t offset,
|
|
|
|
vm_offset_t *addr, vm_size_t length, vm_offset_t max_addr,
|
|
|
|
vm_offset_t alignment)
|
|
|
|
{
|
|
|
|
vm_offset_t aligned_addr, free_addr;
|
|
|
|
|
|
|
|
VM_MAP_ASSERT_LOCKED(map);
|
|
|
|
free_addr = *addr;
|
Eliminate adj_free field from vm_map_entry.
Drop the adj_free field from vm_map_entry_t. Refine the max_free field
so that p->max_free is the size of the largest gap with one endpoint
in the subtree rooted at p. Change vm_map_findspace so that, first,
the address-based splay is restricted to tree nodes with large-enough
max_free value, to avoid searching for the right starting point in a
subtree where all the gaps are too small. Second, when the address
search leads to a tree search for the first large-enough gap, that gap
is the subject of a splay-search that brings the gap to the top of the
tree, so that an immediate insertion will take constant time.
Break up the splay code into separate components, one for searching
and breaking up the tree and another for reassembling it. Use these
components, and not splay itself, for linking and unlinking. Drop the
after-where parameter to link, as it is computed as a side-effect of
the splay search.
Submitted by: Doug Moore <dougm@rice.edu>
Reviewed by: markj
Tested by: pho
MFC after: 2 weeks
Differential revision: https://reviews.freebsd.org/D17794
2019-03-29 16:53:46 +00:00
|
|
|
KASSERT(free_addr == vm_map_findspace(map, free_addr, length),
|
2019-06-11 22:41:39 +00:00
|
|
|
("caller failed to provide space %#jx at address %p",
|
|
|
|
(uintmax_t)length, (void *)free_addr));
|
2017-12-26 17:59:37 +00:00
|
|
|
for (;;) {
|
|
|
|
/*
|
|
|
|
* At the start of every iteration, the free space at address
|
|
|
|
* "*addr" is at least "length" bytes.
|
|
|
|
*/
|
|
|
|
if (alignment == 0)
|
|
|
|
pmap_align_superpage(object, offset, addr, length);
|
|
|
|
else if ((*addr & (alignment - 1)) != 0) {
|
|
|
|
*addr &= ~(alignment - 1);
|
|
|
|
*addr += alignment;
|
|
|
|
}
|
|
|
|
aligned_addr = *addr;
|
|
|
|
if (aligned_addr == free_addr) {
|
|
|
|
/*
|
|
|
|
* Alignment did not change "*addr", so "*addr" must
|
|
|
|
* still provide sufficient free space.
|
|
|
|
*/
|
|
|
|
return (KERN_SUCCESS);
|
|
|
|
}
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Test for address wrap on "*addr". A wrapped "*addr" could
|
|
|
|
* be a valid address, in which case vm_map_findspace() cannot
|
|
|
|
* be relied upon to fail.
|
|
|
|
*/
|
Eliminate adj_free field from vm_map_entry.
Drop the adj_free field from vm_map_entry_t. Refine the max_free field
so that p->max_free is the size of the largest gap with one endpoint
in the subtree rooted at p. Change vm_map_findspace so that, first,
the address-based splay is restricted to tree nodes with large-enough
max_free value, to avoid searching for the right starting point in a
subtree where all the gaps are too small. Second, when the address
search leads to a tree search for the first large-enough gap, that gap
is the subject of a splay-search that brings the gap to the top of the
tree, so that an immediate insertion will take constant time.
Break up the splay code into separate components, one for searching
and breaking up the tree and another for reassembling it. Use these
components, and not splay itself, for linking and unlinking. Drop the
after-where parameter to link, as it is computed as a side-effect of
the splay search.
Submitted by: Doug Moore <dougm@rice.edu>
Reviewed by: markj
Tested by: pho
MFC after: 2 weeks
Differential revision: https://reviews.freebsd.org/D17794
2019-03-29 16:53:46 +00:00
|
|
|
if (aligned_addr < free_addr)
|
|
|
|
return (KERN_NO_SPACE);
|
|
|
|
*addr = vm_map_findspace(map, aligned_addr, length);
|
|
|
|
if (*addr + length > vm_map_max(map) ||
|
2017-12-26 17:59:37 +00:00
|
|
|
(max_addr != 0 && *addr + length > max_addr))
|
|
|
|
return (KERN_NO_SPACE);
|
|
|
|
free_addr = *addr;
|
|
|
|
if (free_addr == aligned_addr) {
|
|
|
|
/*
|
|
|
|
* If a successful call to vm_map_findspace() did not
|
|
|
|
* change "*addr", then "*addr" must still be aligned
|
|
|
|
* and provide sufficient free space.
|
|
|
|
*/
|
|
|
|
return (KERN_SUCCESS);
|
|
|
|
}
|
|
|
|
}
|
|
|
|
}
|
|
|
|
|
2020-09-09 21:44:59 +00:00
|
|
|
int
|
|
|
|
vm_map_find_aligned(vm_map_t map, vm_offset_t *addr, vm_size_t length,
|
|
|
|
vm_offset_t max_addr, vm_offset_t alignment)
|
|
|
|
{
|
|
|
|
/* XXXKIB ASLR eh ? */
|
|
|
|
*addr = vm_map_findspace(map, *addr, length);
|
|
|
|
if (*addr + length > vm_map_max(map) ||
|
|
|
|
(max_addr != 0 && *addr + length > max_addr))
|
|
|
|
return (KERN_NO_SPACE);
|
|
|
|
return (vm_map_alignspace(map, NULL, 0, addr, length, max_addr,
|
|
|
|
alignment));
|
|
|
|
}
|
|
|
|
|
1994-05-24 10:09:53 +00:00
|
|
|
/*
|
|
|
|
* vm_map_find finds an unallocated region in the target address
|
|
|
|
* map with the given length. The search is defined to be
|
|
|
|
* first-fit from the specified address; the region found is
|
|
|
|
* returned in the same parameter.
|
|
|
|
*
|
1999-02-12 09:51:43 +00:00
|
|
|
* If object is non-NULL, ref count must be bumped by caller
|
|
|
|
* prior to making call to account for the new entry.
|
1994-05-24 10:09:53 +00:00
|
|
|
*/
|
|
|
|
int
|
1997-08-25 22:15:31 +00:00
|
|
|
vm_map_find(vm_map_t map, vm_object_t object, vm_ooffset_t offset,
|
|
|
|
vm_offset_t *addr, /* IN/OUT */
|
2013-09-09 18:11:59 +00:00
|
|
|
vm_size_t length, vm_offset_t max_addr, int find_space,
|
|
|
|
vm_prot_t prot, vm_prot_t max, int cow)
|
1994-05-24 10:09:53 +00:00
|
|
|
{
|
Implement Address Space Layout Randomization (ASLR)
With this change, randomization can be enabled for all non-fixed
mappings. It means that the base address for the mapping is selected
with a guaranteed amount of entropy (bits). If the mapping was
requested to be superpage aligned, the randomization honours the
superpage attributes.
Although the value of ASLR is diminshing over time as exploit authors
work out simple ASLR bypass techniques, it elimintates the trivial
exploitation of certain vulnerabilities, at least in theory. This
implementation is relatively small and happens at the correct
architectural level. Also, it is not expected to introduce
regressions in existing cases when turned off (default for now), or
cause any significant maintaince burden.
The randomization is done on a best-effort basis - that is, the
allocator falls back to a first fit strategy if fragmentation prevents
entropy injection. It is trivial to implement a strong mode where
failure to guarantee the requested amount of entropy results in
mapping request failure, but I do not consider that to be usable.
I have not fine-tuned the amount of entropy injected right now. It is
only a quantitive change that will not change the implementation. The
current amount is controlled by aslr_pages_rnd.
To not spoil coalescing optimizations, to reduce the page table
fragmentation inherent to ASLR, and to keep the transient superpage
promotion for the malloced memory, locality clustering is implemented
for anonymous private mappings, which are automatically grouped until
fragmentation kicks in. The initial location for the anon group range
is, of course, randomized. This is controlled by vm.cluster_anon,
enabled by default.
The default mode keeps the sbrk area unpopulated by other mappings,
but this can be turned off, which gives much more breathing bits on
architectures with small address space, such as i386. This is tied
with the question of following an application's hint about the mmap(2)
base address. Testing shows that ignoring the hint does not affect the
function of common applications, but I would expect more demanding
code could break. By default sbrk is preserved and mmap hints are
satisfied, which can be changed by using the
kern.elf{32,64}.aslr.honor_sbrk sysctl.
ASLR is enabled on per-ABI basis, and currently it is only allowed on
FreeBSD native i386 and amd64 (including compat 32bit) ABIs. Support
for additional architectures will be added after further testing.
Both per-process and per-image controls are implemented:
- procctl(2) adds PROC_ASLR_CTL/PROC_ASLR_STATUS;
- NT_FREEBSD_FCTL_ASLR_DISABLE feature control note bit makes it possible
to force ASLR off for the given binary. (A tool to edit the feature
control note is in development.)
Global controls are:
- kern.elf{32,64}.aslr.enable - for non-fixed mappings done by mmap(2);
- kern.elf{32,64}.aslr.pie_enable - for PIE image activation mappings;
- kern.elf{32,64}.aslr.honor_sbrk - allow to use sbrk area for mmap(2);
- vm.cluster_anon - enables anon mapping clustering.
PR: 208580 (exp runs)
Exp-runs done by: antoine
Reviewed by: markj (previous version)
Discussed with: emaste
Tested by: pho
MFC after: 1 month
Sponsored by: The FreeBSD Foundation
Differential revision: https://reviews.freebsd.org/D5603
2019-02-10 17:19:45 +00:00
|
|
|
vm_offset_t alignment, curr_min_addr, min_addr;
|
|
|
|
int gap, pidx, rv, try;
|
|
|
|
bool cluster, en_aslr, update_anon;
|
1994-05-24 10:09:53 +00:00
|
|
|
|
2014-06-09 03:37:41 +00:00
|
|
|
KASSERT((cow & (MAP_STACK_GROWS_DOWN | MAP_STACK_GROWS_UP)) == 0 ||
|
|
|
|
object == NULL,
|
|
|
|
("vm_map_find: non-NULL backing object for stack"));
|
2019-01-16 05:15:57 +00:00
|
|
|
MPASS((cow & MAP_REMAP) == 0 || (find_space == VMFS_NO_SPACE &&
|
|
|
|
(cow & (MAP_STACK_GROWS_DOWN | MAP_STACK_GROWS_UP)) == 0));
|
2013-07-19 19:06:15 +00:00
|
|
|
if (find_space == VMFS_OPTIMAL_SPACE && (object == NULL ||
|
|
|
|
(object->flags & OBJ_COLORED) == 0))
|
2013-08-16 21:13:55 +00:00
|
|
|
find_space = VMFS_ANY_SPACE;
|
|
|
|
if (find_space >> 8 != 0) {
|
|
|
|
KASSERT((find_space & 0xff) == 0, ("bad VMFS flags"));
|
|
|
|
alignment = (vm_offset_t)1 << (find_space >> 8);
|
|
|
|
} else
|
|
|
|
alignment = 0;
|
Implement Address Space Layout Randomization (ASLR)
With this change, randomization can be enabled for all non-fixed
mappings. It means that the base address for the mapping is selected
with a guaranteed amount of entropy (bits). If the mapping was
requested to be superpage aligned, the randomization honours the
superpage attributes.
Although the value of ASLR is diminshing over time as exploit authors
work out simple ASLR bypass techniques, it elimintates the trivial
exploitation of certain vulnerabilities, at least in theory. This
implementation is relatively small and happens at the correct
architectural level. Also, it is not expected to introduce
regressions in existing cases when turned off (default for now), or
cause any significant maintaince burden.
The randomization is done on a best-effort basis - that is, the
allocator falls back to a first fit strategy if fragmentation prevents
entropy injection. It is trivial to implement a strong mode where
failure to guarantee the requested amount of entropy results in
mapping request failure, but I do not consider that to be usable.
I have not fine-tuned the amount of entropy injected right now. It is
only a quantitive change that will not change the implementation. The
current amount is controlled by aslr_pages_rnd.
To not spoil coalescing optimizations, to reduce the page table
fragmentation inherent to ASLR, and to keep the transient superpage
promotion for the malloced memory, locality clustering is implemented
for anonymous private mappings, which are automatically grouped until
fragmentation kicks in. The initial location for the anon group range
is, of course, randomized. This is controlled by vm.cluster_anon,
enabled by default.
The default mode keeps the sbrk area unpopulated by other mappings,
but this can be turned off, which gives much more breathing bits on
architectures with small address space, such as i386. This is tied
with the question of following an application's hint about the mmap(2)
base address. Testing shows that ignoring the hint does not affect the
function of common applications, but I would expect more demanding
code could break. By default sbrk is preserved and mmap hints are
satisfied, which can be changed by using the
kern.elf{32,64}.aslr.honor_sbrk sysctl.
ASLR is enabled on per-ABI basis, and currently it is only allowed on
FreeBSD native i386 and amd64 (including compat 32bit) ABIs. Support
for additional architectures will be added after further testing.
Both per-process and per-image controls are implemented:
- procctl(2) adds PROC_ASLR_CTL/PROC_ASLR_STATUS;
- NT_FREEBSD_FCTL_ASLR_DISABLE feature control note bit makes it possible
to force ASLR off for the given binary. (A tool to edit the feature
control note is in development.)
Global controls are:
- kern.elf{32,64}.aslr.enable - for non-fixed mappings done by mmap(2);
- kern.elf{32,64}.aslr.pie_enable - for PIE image activation mappings;
- kern.elf{32,64}.aslr.honor_sbrk - allow to use sbrk area for mmap(2);
- vm.cluster_anon - enables anon mapping clustering.
PR: 208580 (exp runs)
Exp-runs done by: antoine
Reviewed by: markj (previous version)
Discussed with: emaste
Tested by: pho
MFC after: 1 month
Sponsored by: The FreeBSD Foundation
Differential revision: https://reviews.freebsd.org/D5603
2019-02-10 17:19:45 +00:00
|
|
|
en_aslr = (map->flags & MAP_ASLR) != 0;
|
2019-02-14 15:45:53 +00:00
|
|
|
update_anon = cluster = clustering_anon_allowed(*addr) &&
|
Implement Address Space Layout Randomization (ASLR)
With this change, randomization can be enabled for all non-fixed
mappings. It means that the base address for the mapping is selected
with a guaranteed amount of entropy (bits). If the mapping was
requested to be superpage aligned, the randomization honours the
superpage attributes.
Although the value of ASLR is diminshing over time as exploit authors
work out simple ASLR bypass techniques, it elimintates the trivial
exploitation of certain vulnerabilities, at least in theory. This
implementation is relatively small and happens at the correct
architectural level. Also, it is not expected to introduce
regressions in existing cases when turned off (default for now), or
cause any significant maintaince burden.
The randomization is done on a best-effort basis - that is, the
allocator falls back to a first fit strategy if fragmentation prevents
entropy injection. It is trivial to implement a strong mode where
failure to guarantee the requested amount of entropy results in
mapping request failure, but I do not consider that to be usable.
I have not fine-tuned the amount of entropy injected right now. It is
only a quantitive change that will not change the implementation. The
current amount is controlled by aslr_pages_rnd.
To not spoil coalescing optimizations, to reduce the page table
fragmentation inherent to ASLR, and to keep the transient superpage
promotion for the malloced memory, locality clustering is implemented
for anonymous private mappings, which are automatically grouped until
fragmentation kicks in. The initial location for the anon group range
is, of course, randomized. This is controlled by vm.cluster_anon,
enabled by default.
The default mode keeps the sbrk area unpopulated by other mappings,
but this can be turned off, which gives much more breathing bits on
architectures with small address space, such as i386. This is tied
with the question of following an application's hint about the mmap(2)
base address. Testing shows that ignoring the hint does not affect the
function of common applications, but I would expect more demanding
code could break. By default sbrk is preserved and mmap hints are
satisfied, which can be changed by using the
kern.elf{32,64}.aslr.honor_sbrk sysctl.
ASLR is enabled on per-ABI basis, and currently it is only allowed on
FreeBSD native i386 and amd64 (including compat 32bit) ABIs. Support
for additional architectures will be added after further testing.
Both per-process and per-image controls are implemented:
- procctl(2) adds PROC_ASLR_CTL/PROC_ASLR_STATUS;
- NT_FREEBSD_FCTL_ASLR_DISABLE feature control note bit makes it possible
to force ASLR off for the given binary. (A tool to edit the feature
control note is in development.)
Global controls are:
- kern.elf{32,64}.aslr.enable - for non-fixed mappings done by mmap(2);
- kern.elf{32,64}.aslr.pie_enable - for PIE image activation mappings;
- kern.elf{32,64}.aslr.honor_sbrk - allow to use sbrk area for mmap(2);
- vm.cluster_anon - enables anon mapping clustering.
PR: 208580 (exp runs)
Exp-runs done by: antoine
Reviewed by: markj (previous version)
Discussed with: emaste
Tested by: pho
MFC after: 1 month
Sponsored by: The FreeBSD Foundation
Differential revision: https://reviews.freebsd.org/D5603
2019-02-10 17:19:45 +00:00
|
|
|
(map->flags & MAP_IS_SUB_MAP) == 0 && max_addr == 0 &&
|
|
|
|
find_space != VMFS_NO_SPACE && object == NULL &&
|
|
|
|
(cow & (MAP_INHERIT_SHARE | MAP_STACK_GROWS_UP |
|
|
|
|
MAP_STACK_GROWS_DOWN)) == 0 && prot != PROT_NONE;
|
|
|
|
curr_min_addr = min_addr = *addr;
|
|
|
|
if (en_aslr && min_addr == 0 && !cluster &&
|
|
|
|
find_space != VMFS_NO_SPACE &&
|
|
|
|
(map->flags & MAP_ASLR_IGNSTART) != 0)
|
|
|
|
curr_min_addr = min_addr = vm_map_min(map);
|
|
|
|
try = 0;
|
2017-11-22 16:39:24 +00:00
|
|
|
vm_map_lock(map);
|
Implement Address Space Layout Randomization (ASLR)
With this change, randomization can be enabled for all non-fixed
mappings. It means that the base address for the mapping is selected
with a guaranteed amount of entropy (bits). If the mapping was
requested to be superpage aligned, the randomization honours the
superpage attributes.
Although the value of ASLR is diminshing over time as exploit authors
work out simple ASLR bypass techniques, it elimintates the trivial
exploitation of certain vulnerabilities, at least in theory. This
implementation is relatively small and happens at the correct
architectural level. Also, it is not expected to introduce
regressions in existing cases when turned off (default for now), or
cause any significant maintaince burden.
The randomization is done on a best-effort basis - that is, the
allocator falls back to a first fit strategy if fragmentation prevents
entropy injection. It is trivial to implement a strong mode where
failure to guarantee the requested amount of entropy results in
mapping request failure, but I do not consider that to be usable.
I have not fine-tuned the amount of entropy injected right now. It is
only a quantitive change that will not change the implementation. The
current amount is controlled by aslr_pages_rnd.
To not spoil coalescing optimizations, to reduce the page table
fragmentation inherent to ASLR, and to keep the transient superpage
promotion for the malloced memory, locality clustering is implemented
for anonymous private mappings, which are automatically grouped until
fragmentation kicks in. The initial location for the anon group range
is, of course, randomized. This is controlled by vm.cluster_anon,
enabled by default.
The default mode keeps the sbrk area unpopulated by other mappings,
but this can be turned off, which gives much more breathing bits on
architectures with small address space, such as i386. This is tied
with the question of following an application's hint about the mmap(2)
base address. Testing shows that ignoring the hint does not affect the
function of common applications, but I would expect more demanding
code could break. By default sbrk is preserved and mmap hints are
satisfied, which can be changed by using the
kern.elf{32,64}.aslr.honor_sbrk sysctl.
ASLR is enabled on per-ABI basis, and currently it is only allowed on
FreeBSD native i386 and amd64 (including compat 32bit) ABIs. Support
for additional architectures will be added after further testing.
Both per-process and per-image controls are implemented:
- procctl(2) adds PROC_ASLR_CTL/PROC_ASLR_STATUS;
- NT_FREEBSD_FCTL_ASLR_DISABLE feature control note bit makes it possible
to force ASLR off for the given binary. (A tool to edit the feature
control note is in development.)
Global controls are:
- kern.elf{32,64}.aslr.enable - for non-fixed mappings done by mmap(2);
- kern.elf{32,64}.aslr.pie_enable - for PIE image activation mappings;
- kern.elf{32,64}.aslr.honor_sbrk - allow to use sbrk area for mmap(2);
- vm.cluster_anon - enables anon mapping clustering.
PR: 208580 (exp runs)
Exp-runs done by: antoine
Reviewed by: markj (previous version)
Discussed with: emaste
Tested by: pho
MFC after: 1 month
Sponsored by: The FreeBSD Foundation
Differential revision: https://reviews.freebsd.org/D5603
2019-02-10 17:19:45 +00:00
|
|
|
if (cluster) {
|
|
|
|
curr_min_addr = map->anon_loc;
|
|
|
|
if (curr_min_addr == 0)
|
|
|
|
cluster = false;
|
|
|
|
}
|
2017-12-26 17:59:37 +00:00
|
|
|
if (find_space != VMFS_NO_SPACE) {
|
|
|
|
KASSERT(find_space == VMFS_ANY_SPACE ||
|
|
|
|
find_space == VMFS_OPTIMAL_SPACE ||
|
|
|
|
find_space == VMFS_SUPER_SPACE ||
|
|
|
|
alignment != 0, ("unexpected VMFS flag"));
|
2013-07-19 19:06:15 +00:00
|
|
|
again:
|
Implement Address Space Layout Randomization (ASLR)
With this change, randomization can be enabled for all non-fixed
mappings. It means that the base address for the mapping is selected
with a guaranteed amount of entropy (bits). If the mapping was
requested to be superpage aligned, the randomization honours the
superpage attributes.
Although the value of ASLR is diminshing over time as exploit authors
work out simple ASLR bypass techniques, it elimintates the trivial
exploitation of certain vulnerabilities, at least in theory. This
implementation is relatively small and happens at the correct
architectural level. Also, it is not expected to introduce
regressions in existing cases when turned off (default for now), or
cause any significant maintaince burden.
The randomization is done on a best-effort basis - that is, the
allocator falls back to a first fit strategy if fragmentation prevents
entropy injection. It is trivial to implement a strong mode where
failure to guarantee the requested amount of entropy results in
mapping request failure, but I do not consider that to be usable.
I have not fine-tuned the amount of entropy injected right now. It is
only a quantitive change that will not change the implementation. The
current amount is controlled by aslr_pages_rnd.
To not spoil coalescing optimizations, to reduce the page table
fragmentation inherent to ASLR, and to keep the transient superpage
promotion for the malloced memory, locality clustering is implemented
for anonymous private mappings, which are automatically grouped until
fragmentation kicks in. The initial location for the anon group range
is, of course, randomized. This is controlled by vm.cluster_anon,
enabled by default.
The default mode keeps the sbrk area unpopulated by other mappings,
but this can be turned off, which gives much more breathing bits on
architectures with small address space, such as i386. This is tied
with the question of following an application's hint about the mmap(2)
base address. Testing shows that ignoring the hint does not affect the
function of common applications, but I would expect more demanding
code could break. By default sbrk is preserved and mmap hints are
satisfied, which can be changed by using the
kern.elf{32,64}.aslr.honor_sbrk sysctl.
ASLR is enabled on per-ABI basis, and currently it is only allowed on
FreeBSD native i386 and amd64 (including compat 32bit) ABIs. Support
for additional architectures will be added after further testing.
Both per-process and per-image controls are implemented:
- procctl(2) adds PROC_ASLR_CTL/PROC_ASLR_STATUS;
- NT_FREEBSD_FCTL_ASLR_DISABLE feature control note bit makes it possible
to force ASLR off for the given binary. (A tool to edit the feature
control note is in development.)
Global controls are:
- kern.elf{32,64}.aslr.enable - for non-fixed mappings done by mmap(2);
- kern.elf{32,64}.aslr.pie_enable - for PIE image activation mappings;
- kern.elf{32,64}.aslr.honor_sbrk - allow to use sbrk area for mmap(2);
- vm.cluster_anon - enables anon mapping clustering.
PR: 208580 (exp runs)
Exp-runs done by: antoine
Reviewed by: markj (previous version)
Discussed with: emaste
Tested by: pho
MFC after: 1 month
Sponsored by: The FreeBSD Foundation
Differential revision: https://reviews.freebsd.org/D5603
2019-02-10 17:19:45 +00:00
|
|
|
/*
|
|
|
|
* When creating an anonymous mapping, try clustering
|
|
|
|
* with an existing anonymous mapping first.
|
|
|
|
*
|
|
|
|
* We make up to two attempts to find address space
|
|
|
|
* for a given find_space value. The first attempt may
|
|
|
|
* apply randomization or may cluster with an existing
|
|
|
|
* anonymous mapping. If this first attempt fails,
|
|
|
|
* perform a first-fit search of the available address
|
|
|
|
* space.
|
|
|
|
*
|
|
|
|
* If all tries failed, and find_space is
|
|
|
|
* VMFS_OPTIMAL_SPACE, fallback to VMFS_ANY_SPACE.
|
|
|
|
* Again enable clustering and randomization.
|
|
|
|
*/
|
|
|
|
try++;
|
|
|
|
MPASS(try <= 2);
|
|
|
|
|
|
|
|
if (try == 2) {
|
|
|
|
/*
|
|
|
|
* Second try: we failed either to find a
|
|
|
|
* suitable region for randomizing the
|
|
|
|
* allocation, or to cluster with an existing
|
|
|
|
* mapping. Retry with free run.
|
|
|
|
*/
|
|
|
|
curr_min_addr = (map->flags & MAP_ASLR_IGNSTART) != 0 ?
|
|
|
|
vm_map_min(map) : min_addr;
|
|
|
|
atomic_add_long(&aslr_restarts, 1);
|
|
|
|
}
|
|
|
|
|
|
|
|
if (try == 1 && en_aslr && !cluster) {
|
|
|
|
/*
|
|
|
|
* Find space for allocation, including
|
|
|
|
* gap needed for later randomization.
|
|
|
|
*/
|
|
|
|
pidx = MAXPAGESIZES > 1 && pagesizes[1] != 0 &&
|
|
|
|
(find_space == VMFS_SUPER_SPACE || find_space ==
|
|
|
|
VMFS_OPTIMAL_SPACE) ? 1 : 0;
|
|
|
|
gap = vm_map_max(map) > MAP_32BIT_MAX_ADDR &&
|
|
|
|
(max_addr == 0 || max_addr > MAP_32BIT_MAX_ADDR) ?
|
|
|
|
aslr_pages_rnd_64[pidx] : aslr_pages_rnd_32[pidx];
|
Eliminate adj_free field from vm_map_entry.
Drop the adj_free field from vm_map_entry_t. Refine the max_free field
so that p->max_free is the size of the largest gap with one endpoint
in the subtree rooted at p. Change vm_map_findspace so that, first,
the address-based splay is restricted to tree nodes with large-enough
max_free value, to avoid searching for the right starting point in a
subtree where all the gaps are too small. Second, when the address
search leads to a tree search for the first large-enough gap, that gap
is the subject of a splay-search that brings the gap to the top of the
tree, so that an immediate insertion will take constant time.
Break up the splay code into separate components, one for searching
and breaking up the tree and another for reassembling it. Use these
components, and not splay itself, for linking and unlinking. Drop the
after-where parameter to link, as it is computed as a side-effect of
the splay search.
Submitted by: Doug Moore <dougm@rice.edu>
Reviewed by: markj
Tested by: pho
MFC after: 2 weeks
Differential revision: https://reviews.freebsd.org/D17794
2019-03-29 16:53:46 +00:00
|
|
|
*addr = vm_map_findspace(map, curr_min_addr,
|
|
|
|
length + gap * pagesizes[pidx]);
|
|
|
|
if (*addr + length + gap * pagesizes[pidx] >
|
2019-04-05 16:12:35 +00:00
|
|
|
vm_map_max(map))
|
Implement Address Space Layout Randomization (ASLR)
With this change, randomization can be enabled for all non-fixed
mappings. It means that the base address for the mapping is selected
with a guaranteed amount of entropy (bits). If the mapping was
requested to be superpage aligned, the randomization honours the
superpage attributes.
Although the value of ASLR is diminshing over time as exploit authors
work out simple ASLR bypass techniques, it elimintates the trivial
exploitation of certain vulnerabilities, at least in theory. This
implementation is relatively small and happens at the correct
architectural level. Also, it is not expected to introduce
regressions in existing cases when turned off (default for now), or
cause any significant maintaince burden.
The randomization is done on a best-effort basis - that is, the
allocator falls back to a first fit strategy if fragmentation prevents
entropy injection. It is trivial to implement a strong mode where
failure to guarantee the requested amount of entropy results in
mapping request failure, but I do not consider that to be usable.
I have not fine-tuned the amount of entropy injected right now. It is
only a quantitive change that will not change the implementation. The
current amount is controlled by aslr_pages_rnd.
To not spoil coalescing optimizations, to reduce the page table
fragmentation inherent to ASLR, and to keep the transient superpage
promotion for the malloced memory, locality clustering is implemented
for anonymous private mappings, which are automatically grouped until
fragmentation kicks in. The initial location for the anon group range
is, of course, randomized. This is controlled by vm.cluster_anon,
enabled by default.
The default mode keeps the sbrk area unpopulated by other mappings,
but this can be turned off, which gives much more breathing bits on
architectures with small address space, such as i386. This is tied
with the question of following an application's hint about the mmap(2)
base address. Testing shows that ignoring the hint does not affect the
function of common applications, but I would expect more demanding
code could break. By default sbrk is preserved and mmap hints are
satisfied, which can be changed by using the
kern.elf{32,64}.aslr.honor_sbrk sysctl.
ASLR is enabled on per-ABI basis, and currently it is only allowed on
FreeBSD native i386 and amd64 (including compat 32bit) ABIs. Support
for additional architectures will be added after further testing.
Both per-process and per-image controls are implemented:
- procctl(2) adds PROC_ASLR_CTL/PROC_ASLR_STATUS;
- NT_FREEBSD_FCTL_ASLR_DISABLE feature control note bit makes it possible
to force ASLR off for the given binary. (A tool to edit the feature
control note is in development.)
Global controls are:
- kern.elf{32,64}.aslr.enable - for non-fixed mappings done by mmap(2);
- kern.elf{32,64}.aslr.pie_enable - for PIE image activation mappings;
- kern.elf{32,64}.aslr.honor_sbrk - allow to use sbrk area for mmap(2);
- vm.cluster_anon - enables anon mapping clustering.
PR: 208580 (exp runs)
Exp-runs done by: antoine
Reviewed by: markj (previous version)
Discussed with: emaste
Tested by: pho
MFC after: 1 month
Sponsored by: The FreeBSD Foundation
Differential revision: https://reviews.freebsd.org/D5603
2019-02-10 17:19:45 +00:00
|
|
|
goto again;
|
|
|
|
/* And randomize the start address. */
|
|
|
|
*addr += (arc4random() % gap) * pagesizes[pidx];
|
2019-03-23 16:36:18 +00:00
|
|
|
if (max_addr != 0 && *addr + length > max_addr)
|
|
|
|
goto again;
|
Eliminate adj_free field from vm_map_entry.
Drop the adj_free field from vm_map_entry_t. Refine the max_free field
so that p->max_free is the size of the largest gap with one endpoint
in the subtree rooted at p. Change vm_map_findspace so that, first,
the address-based splay is restricted to tree nodes with large-enough
max_free value, to avoid searching for the right starting point in a
subtree where all the gaps are too small. Second, when the address
search leads to a tree search for the first large-enough gap, that gap
is the subject of a splay-search that brings the gap to the top of the
tree, so that an immediate insertion will take constant time.
Break up the splay code into separate components, one for searching
and breaking up the tree and another for reassembling it. Use these
components, and not splay itself, for linking and unlinking. Drop the
after-where parameter to link, as it is computed as a side-effect of
the splay search.
Submitted by: Doug Moore <dougm@rice.edu>
Reviewed by: markj
Tested by: pho
MFC after: 2 weeks
Differential revision: https://reviews.freebsd.org/D17794
2019-03-29 16:53:46 +00:00
|
|
|
} else {
|
|
|
|
*addr = vm_map_findspace(map, curr_min_addr, length);
|
|
|
|
if (*addr + length > vm_map_max(map) ||
|
|
|
|
(max_addr != 0 && *addr + length > max_addr)) {
|
|
|
|
if (cluster) {
|
|
|
|
cluster = false;
|
|
|
|
MPASS(try == 1);
|
|
|
|
goto again;
|
|
|
|
}
|
|
|
|
rv = KERN_NO_SPACE;
|
|
|
|
goto done;
|
Implement Address Space Layout Randomization (ASLR)
With this change, randomization can be enabled for all non-fixed
mappings. It means that the base address for the mapping is selected
with a guaranteed amount of entropy (bits). If the mapping was
requested to be superpage aligned, the randomization honours the
superpage attributes.
Although the value of ASLR is diminshing over time as exploit authors
work out simple ASLR bypass techniques, it elimintates the trivial
exploitation of certain vulnerabilities, at least in theory. This
implementation is relatively small and happens at the correct
architectural level. Also, it is not expected to introduce
regressions in existing cases when turned off (default for now), or
cause any significant maintaince burden.
The randomization is done on a best-effort basis - that is, the
allocator falls back to a first fit strategy if fragmentation prevents
entropy injection. It is trivial to implement a strong mode where
failure to guarantee the requested amount of entropy results in
mapping request failure, but I do not consider that to be usable.
I have not fine-tuned the amount of entropy injected right now. It is
only a quantitive change that will not change the implementation. The
current amount is controlled by aslr_pages_rnd.
To not spoil coalescing optimizations, to reduce the page table
fragmentation inherent to ASLR, and to keep the transient superpage
promotion for the malloced memory, locality clustering is implemented
for anonymous private mappings, which are automatically grouped until
fragmentation kicks in. The initial location for the anon group range
is, of course, randomized. This is controlled by vm.cluster_anon,
enabled by default.
The default mode keeps the sbrk area unpopulated by other mappings,
but this can be turned off, which gives much more breathing bits on
architectures with small address space, such as i386. This is tied
with the question of following an application's hint about the mmap(2)
base address. Testing shows that ignoring the hint does not affect the
function of common applications, but I would expect more demanding
code could break. By default sbrk is preserved and mmap hints are
satisfied, which can be changed by using the
kern.elf{32,64}.aslr.honor_sbrk sysctl.
ASLR is enabled on per-ABI basis, and currently it is only allowed on
FreeBSD native i386 and amd64 (including compat 32bit) ABIs. Support
for additional architectures will be added after further testing.
Both per-process and per-image controls are implemented:
- procctl(2) adds PROC_ASLR_CTL/PROC_ASLR_STATUS;
- NT_FREEBSD_FCTL_ASLR_DISABLE feature control note bit makes it possible
to force ASLR off for the given binary. (A tool to edit the feature
control note is in development.)
Global controls are:
- kern.elf{32,64}.aslr.enable - for non-fixed mappings done by mmap(2);
- kern.elf{32,64}.aslr.pie_enable - for PIE image activation mappings;
- kern.elf{32,64}.aslr.honor_sbrk - allow to use sbrk area for mmap(2);
- vm.cluster_anon - enables anon mapping clustering.
PR: 208580 (exp runs)
Exp-runs done by: antoine
Reviewed by: markj (previous version)
Discussed with: emaste
Tested by: pho
MFC after: 1 month
Sponsored by: The FreeBSD Foundation
Differential revision: https://reviews.freebsd.org/D5603
2019-02-10 17:19:45 +00:00
|
|
|
}
|
1994-05-24 10:09:53 +00:00
|
|
|
}
|
Implement Address Space Layout Randomization (ASLR)
With this change, randomization can be enabled for all non-fixed
mappings. It means that the base address for the mapping is selected
with a guaranteed amount of entropy (bits). If the mapping was
requested to be superpage aligned, the randomization honours the
superpage attributes.
Although the value of ASLR is diminshing over time as exploit authors
work out simple ASLR bypass techniques, it elimintates the trivial
exploitation of certain vulnerabilities, at least in theory. This
implementation is relatively small and happens at the correct
architectural level. Also, it is not expected to introduce
regressions in existing cases when turned off (default for now), or
cause any significant maintaince burden.
The randomization is done on a best-effort basis - that is, the
allocator falls back to a first fit strategy if fragmentation prevents
entropy injection. It is trivial to implement a strong mode where
failure to guarantee the requested amount of entropy results in
mapping request failure, but I do not consider that to be usable.
I have not fine-tuned the amount of entropy injected right now. It is
only a quantitive change that will not change the implementation. The
current amount is controlled by aslr_pages_rnd.
To not spoil coalescing optimizations, to reduce the page table
fragmentation inherent to ASLR, and to keep the transient superpage
promotion for the malloced memory, locality clustering is implemented
for anonymous private mappings, which are automatically grouped until
fragmentation kicks in. The initial location for the anon group range
is, of course, randomized. This is controlled by vm.cluster_anon,
enabled by default.
The default mode keeps the sbrk area unpopulated by other mappings,
but this can be turned off, which gives much more breathing bits on
architectures with small address space, such as i386. This is tied
with the question of following an application's hint about the mmap(2)
base address. Testing shows that ignoring the hint does not affect the
function of common applications, but I would expect more demanding
code could break. By default sbrk is preserved and mmap hints are
satisfied, which can be changed by using the
kern.elf{32,64}.aslr.honor_sbrk sysctl.
ASLR is enabled on per-ABI basis, and currently it is only allowed on
FreeBSD native i386 and amd64 (including compat 32bit) ABIs. Support
for additional architectures will be added after further testing.
Both per-process and per-image controls are implemented:
- procctl(2) adds PROC_ASLR_CTL/PROC_ASLR_STATUS;
- NT_FREEBSD_FCTL_ASLR_DISABLE feature control note bit makes it possible
to force ASLR off for the given binary. (A tool to edit the feature
control note is in development.)
Global controls are:
- kern.elf{32,64}.aslr.enable - for non-fixed mappings done by mmap(2);
- kern.elf{32,64}.aslr.pie_enable - for PIE image activation mappings;
- kern.elf{32,64}.aslr.honor_sbrk - allow to use sbrk area for mmap(2);
- vm.cluster_anon - enables anon mapping clustering.
PR: 208580 (exp runs)
Exp-runs done by: antoine
Reviewed by: markj (previous version)
Discussed with: emaste
Tested by: pho
MFC after: 1 month
Sponsored by: The FreeBSD Foundation
Differential revision: https://reviews.freebsd.org/D5603
2019-02-10 17:19:45 +00:00
|
|
|
|
2017-12-26 17:59:37 +00:00
|
|
|
if (find_space != VMFS_ANY_SPACE &&
|
|
|
|
(rv = vm_map_alignspace(map, object, offset, addr, length,
|
|
|
|
max_addr, alignment)) != KERN_SUCCESS) {
|
|
|
|
if (find_space == VMFS_OPTIMAL_SPACE) {
|
|
|
|
find_space = VMFS_ANY_SPACE;
|
Implement Address Space Layout Randomization (ASLR)
With this change, randomization can be enabled for all non-fixed
mappings. It means that the base address for the mapping is selected
with a guaranteed amount of entropy (bits). If the mapping was
requested to be superpage aligned, the randomization honours the
superpage attributes.
Although the value of ASLR is diminshing over time as exploit authors
work out simple ASLR bypass techniques, it elimintates the trivial
exploitation of certain vulnerabilities, at least in theory. This
implementation is relatively small and happens at the correct
architectural level. Also, it is not expected to introduce
regressions in existing cases when turned off (default for now), or
cause any significant maintaince burden.
The randomization is done on a best-effort basis - that is, the
allocator falls back to a first fit strategy if fragmentation prevents
entropy injection. It is trivial to implement a strong mode where
failure to guarantee the requested amount of entropy results in
mapping request failure, but I do not consider that to be usable.
I have not fine-tuned the amount of entropy injected right now. It is
only a quantitive change that will not change the implementation. The
current amount is controlled by aslr_pages_rnd.
To not spoil coalescing optimizations, to reduce the page table
fragmentation inherent to ASLR, and to keep the transient superpage
promotion for the malloced memory, locality clustering is implemented
for anonymous private mappings, which are automatically grouped until
fragmentation kicks in. The initial location for the anon group range
is, of course, randomized. This is controlled by vm.cluster_anon,
enabled by default.
The default mode keeps the sbrk area unpopulated by other mappings,
but this can be turned off, which gives much more breathing bits on
architectures with small address space, such as i386. This is tied
with the question of following an application's hint about the mmap(2)
base address. Testing shows that ignoring the hint does not affect the
function of common applications, but I would expect more demanding
code could break. By default sbrk is preserved and mmap hints are
satisfied, which can be changed by using the
kern.elf{32,64}.aslr.honor_sbrk sysctl.
ASLR is enabled on per-ABI basis, and currently it is only allowed on
FreeBSD native i386 and amd64 (including compat 32bit) ABIs. Support
for additional architectures will be added after further testing.
Both per-process and per-image controls are implemented:
- procctl(2) adds PROC_ASLR_CTL/PROC_ASLR_STATUS;
- NT_FREEBSD_FCTL_ASLR_DISABLE feature control note bit makes it possible
to force ASLR off for the given binary. (A tool to edit the feature
control note is in development.)
Global controls are:
- kern.elf{32,64}.aslr.enable - for non-fixed mappings done by mmap(2);
- kern.elf{32,64}.aslr.pie_enable - for PIE image activation mappings;
- kern.elf{32,64}.aslr.honor_sbrk - allow to use sbrk area for mmap(2);
- vm.cluster_anon - enables anon mapping clustering.
PR: 208580 (exp runs)
Exp-runs done by: antoine
Reviewed by: markj (previous version)
Discussed with: emaste
Tested by: pho
MFC after: 1 month
Sponsored by: The FreeBSD Foundation
Differential revision: https://reviews.freebsd.org/D5603
2019-02-10 17:19:45 +00:00
|
|
|
curr_min_addr = min_addr;
|
|
|
|
cluster = update_anon;
|
|
|
|
try = 0;
|
2017-12-26 17:59:37 +00:00
|
|
|
goto again;
|
|
|
|
}
|
|
|
|
goto done;
|
2014-06-09 03:37:41 +00:00
|
|
|
}
|
2019-01-16 05:15:57 +00:00
|
|
|
} else if ((cow & MAP_REMAP) != 0) {
|
2020-06-19 03:32:04 +00:00
|
|
|
if (!vm_map_range_valid(map, *addr, *addr + length)) {
|
2019-01-16 05:15:57 +00:00
|
|
|
rv = KERN_INVALID_ADDRESS;
|
|
|
|
goto done;
|
|
|
|
}
|
2020-09-09 21:34:31 +00:00
|
|
|
rv = vm_map_delete(map, *addr, *addr + length);
|
|
|
|
if (rv != KERN_SUCCESS)
|
|
|
|
goto done;
|
2017-12-26 17:59:37 +00:00
|
|
|
}
|
|
|
|
if ((cow & (MAP_STACK_GROWS_DOWN | MAP_STACK_GROWS_UP)) != 0) {
|
|
|
|
rv = vm_map_stack_locked(map, *addr, length, sgrowsiz, prot,
|
|
|
|
max, cow);
|
|
|
|
} else {
|
|
|
|
rv = vm_map_insert(map, object, offset, *addr, *addr + length,
|
|
|
|
prot, max, cow);
|
|
|
|
}
|
Implement Address Space Layout Randomization (ASLR)
With this change, randomization can be enabled for all non-fixed
mappings. It means that the base address for the mapping is selected
with a guaranteed amount of entropy (bits). If the mapping was
requested to be superpage aligned, the randomization honours the
superpage attributes.
Although the value of ASLR is diminshing over time as exploit authors
work out simple ASLR bypass techniques, it elimintates the trivial
exploitation of certain vulnerabilities, at least in theory. This
implementation is relatively small and happens at the correct
architectural level. Also, it is not expected to introduce
regressions in existing cases when turned off (default for now), or
cause any significant maintaince burden.
The randomization is done on a best-effort basis - that is, the
allocator falls back to a first fit strategy if fragmentation prevents
entropy injection. It is trivial to implement a strong mode where
failure to guarantee the requested amount of entropy results in
mapping request failure, but I do not consider that to be usable.
I have not fine-tuned the amount of entropy injected right now. It is
only a quantitive change that will not change the implementation. The
current amount is controlled by aslr_pages_rnd.
To not spoil coalescing optimizations, to reduce the page table
fragmentation inherent to ASLR, and to keep the transient superpage
promotion for the malloced memory, locality clustering is implemented
for anonymous private mappings, which are automatically grouped until
fragmentation kicks in. The initial location for the anon group range
is, of course, randomized. This is controlled by vm.cluster_anon,
enabled by default.
The default mode keeps the sbrk area unpopulated by other mappings,
but this can be turned off, which gives much more breathing bits on
architectures with small address space, such as i386. This is tied
with the question of following an application's hint about the mmap(2)
base address. Testing shows that ignoring the hint does not affect the
function of common applications, but I would expect more demanding
code could break. By default sbrk is preserved and mmap hints are
satisfied, which can be changed by using the
kern.elf{32,64}.aslr.honor_sbrk sysctl.
ASLR is enabled on per-ABI basis, and currently it is only allowed on
FreeBSD native i386 and amd64 (including compat 32bit) ABIs. Support
for additional architectures will be added after further testing.
Both per-process and per-image controls are implemented:
- procctl(2) adds PROC_ASLR_CTL/PROC_ASLR_STATUS;
- NT_FREEBSD_FCTL_ASLR_DISABLE feature control note bit makes it possible
to force ASLR off for the given binary. (A tool to edit the feature
control note is in development.)
Global controls are:
- kern.elf{32,64}.aslr.enable - for non-fixed mappings done by mmap(2);
- kern.elf{32,64}.aslr.pie_enable - for PIE image activation mappings;
- kern.elf{32,64}.aslr.honor_sbrk - allow to use sbrk area for mmap(2);
- vm.cluster_anon - enables anon mapping clustering.
PR: 208580 (exp runs)
Exp-runs done by: antoine
Reviewed by: markj (previous version)
Discussed with: emaste
Tested by: pho
MFC after: 1 month
Sponsored by: The FreeBSD Foundation
Differential revision: https://reviews.freebsd.org/D5603
2019-02-10 17:19:45 +00:00
|
|
|
if (rv == KERN_SUCCESS && update_anon)
|
|
|
|
map->anon_loc = *addr + length;
|
2017-12-26 17:59:37 +00:00
|
|
|
done:
|
1994-05-24 10:09:53 +00:00
|
|
|
vm_map_unlock(map);
|
2017-12-26 17:59:37 +00:00
|
|
|
return (rv);
|
1994-05-24 10:09:53 +00:00
|
|
|
}
|
|
|
|
|
2017-12-01 10:53:08 +00:00
|
|
|
/*
|
|
|
|
* vm_map_find_min() is a variant of vm_map_find() that takes an
|
|
|
|
* additional parameter (min_addr) and treats the given address
|
|
|
|
* (*addr) differently. Specifically, it treats *addr as a hint
|
|
|
|
* and not as the minimum address where the mapping is created.
|
|
|
|
*
|
|
|
|
* This function works in two phases. First, it tries to
|
|
|
|
* allocate above the hint. If that fails and the hint is
|
|
|
|
* greater than min_addr, it performs a second pass, replacing
|
|
|
|
* the hint with min_addr as the minimum address for the
|
|
|
|
* allocation.
|
|
|
|
*/
|
Treat the addr argument for mmap(2) request without MAP_FIXED flag as
a hint.
Right now, for non-fixed mmap(2) calls, addr is de-facto interpreted
as the absolute minimal address of the range where the mapping is
created. The VA allocator only allocates in the range [addr,
VM_MAXUSER_ADDRESS]. This is too restrictive, the mmap(2) call might
unduly fail if there is no free addresses above addr but a lot of
usable space below it.
Lift this implementation limitation by allocating VA in two passes.
First, try to allocate above addr, as before. If that fails, do the
second pass with less restrictive constraints for the start of
allocation by specifying minimal allocation address at the max bss
end, if this limit is less than addr.
One important case where this change makes a difference is the
allocation of the stacks for new threads in libthr. Under some
configuration conditions, libthr tries to hint kernel to reuse the
main thread stack grow area for the new stacks. This cannot work by
design now after grow area is converted to stack, and there is no
unallocated VA above the main stack. Interpreting requested stack
base address as the hint provides compatibility with old libthr and
with (mis-)configured current libthr.
Reviewed by: alc
Tested by: dim (previous version)
Sponsored by: The FreeBSD Foundation
MFC after: 1 week
2017-06-28 04:02:36 +00:00
|
|
|
int
|
|
|
|
vm_map_find_min(vm_map_t map, vm_object_t object, vm_ooffset_t offset,
|
|
|
|
vm_offset_t *addr, vm_size_t length, vm_offset_t min_addr,
|
|
|
|
vm_offset_t max_addr, int find_space, vm_prot_t prot, vm_prot_t max,
|
|
|
|
int cow)
|
|
|
|
{
|
|
|
|
vm_offset_t hint;
|
|
|
|
int rv;
|
|
|
|
|
|
|
|
hint = *addr;
|
|
|
|
for (;;) {
|
|
|
|
rv = vm_map_find(map, object, offset, addr, length, max_addr,
|
|
|
|
find_space, prot, max, cow);
|
|
|
|
if (rv == KERN_SUCCESS || min_addr >= hint)
|
|
|
|
return (rv);
|
2017-07-09 15:41:49 +00:00
|
|
|
*addr = hint = min_addr;
|
Treat the addr argument for mmap(2) request without MAP_FIXED flag as
a hint.
Right now, for non-fixed mmap(2) calls, addr is de-facto interpreted
as the absolute minimal address of the range where the mapping is
created. The VA allocator only allocates in the range [addr,
VM_MAXUSER_ADDRESS]. This is too restrictive, the mmap(2) call might
unduly fail if there is no free addresses above addr but a lot of
usable space below it.
Lift this implementation limitation by allocating VA in two passes.
First, try to allocate above addr, as before. If that fails, do the
second pass with less restrictive constraints for the start of
allocation by specifying minimal allocation address at the max bss
end, if this limit is less than addr.
One important case where this change makes a difference is the
allocation of the stacks for new threads in libthr. Under some
configuration conditions, libthr tries to hint kernel to reuse the
main thread stack grow area for the new stacks. This cannot work by
design now after grow area is converted to stack, and there is no
unallocated VA above the main stack. Interpreting requested stack
base address as the hint provides compatibility with old libthr and
with (mis-)configured current libthr.
Reviewed by: alc
Tested by: dim (previous version)
Sponsored by: The FreeBSD Foundation
MFC after: 1 week
2017-06-28 04:02:36 +00:00
|
|
|
}
|
|
|
|
}
|
|
|
|
|
2018-11-18 01:27:17 +00:00
|
|
|
/*
|
|
|
|
* A map entry with any of the following flags set must not be merged with
|
|
|
|
* another entry.
|
|
|
|
*/
|
|
|
|
#define MAP_ENTRY_NOMERGE_MASK (MAP_ENTRY_GROWS_DOWN | MAP_ENTRY_GROWS_UP | \
|
Switch to use shared vnode locks for text files during image activation.
kern_execve() locks text vnode exclusive to be able to set and clear
VV_TEXT flag. VV_TEXT is mutually exclusive with the v_writecount > 0
condition.
The change removes VV_TEXT, replacing it with the condition
v_writecount <= -1, and puts v_writecount under the vnode interlock.
Each text reference decrements v_writecount. To clear the text
reference when the segment is unmapped, it is recorded in the
vm_map_entry backed by the text file as MAP_ENTRY_VN_TEXT flag, and
v_writecount is incremented on the map entry removal
The operations like VOP_ADD_WRITECOUNT() and VOP_SET_TEXT() check that
v_writecount does not contradict the desired change. vn_writecheck()
is now racy and its use was eliminated everywhere except access.
Atomic check for writeability and increment of v_writecount is
performed by the VOP. vn_truncate() now increments v_writecount
around VOP_SETATTR() call, lack of which is arguably a bug on its own.
nullfs bypasses v_writecount to the lower vnode always, so nullfs
vnode has its own v_writecount correct, and lower vnode gets all
references, since object->handle is always lower vnode.
On the text vnode' vm object dealloc, the v_writecount value is reset
to zero, and deadfs vop_unset_text short-circuit the operation.
Reclamation of lowervp always reclaims all nullfs vnodes referencing
lowervp first, so no stray references are left.
Reviewed by: markj, trasz
Tested by: mjg, pho
Sponsored by: The FreeBSD Foundation
MFC after: 1 month
Differential revision: https://reviews.freebsd.org/D19923
2019-05-05 11:20:43 +00:00
|
|
|
MAP_ENTRY_IN_TRANSITION | MAP_ENTRY_IS_SUB_MAP | MAP_ENTRY_VN_EXEC)
|
2018-11-18 01:27:17 +00:00
|
|
|
|
2018-10-20 23:08:04 +00:00
|
|
|
static bool
|
|
|
|
vm_map_mergeable_neighbors(vm_map_entry_t prev, vm_map_entry_t entry)
|
|
|
|
{
|
|
|
|
|
2018-11-18 01:27:17 +00:00
|
|
|
KASSERT((prev->eflags & MAP_ENTRY_NOMERGE_MASK) == 0 ||
|
|
|
|
(entry->eflags & MAP_ENTRY_NOMERGE_MASK) == 0,
|
|
|
|
("vm_map_mergeable_neighbors: neither %p nor %p are mergeable",
|
|
|
|
prev, entry));
|
2018-10-20 23:08:04 +00:00
|
|
|
return (prev->end == entry->start &&
|
|
|
|
prev->object.vm_object == entry->object.vm_object &&
|
|
|
|
(prev->object.vm_object == NULL ||
|
2018-11-18 01:27:17 +00:00
|
|
|
prev->offset + (prev->end - prev->start) == entry->offset) &&
|
2018-10-20 23:08:04 +00:00
|
|
|
prev->eflags == entry->eflags &&
|
|
|
|
prev->protection == entry->protection &&
|
|
|
|
prev->max_protection == entry->max_protection &&
|
|
|
|
prev->inheritance == entry->inheritance &&
|
|
|
|
prev->wired_count == entry->wired_count &&
|
|
|
|
prev->cred == entry->cred);
|
|
|
|
}
|
|
|
|
|
|
|
|
static void
|
|
|
|
vm_map_merged_neighbor_dispose(vm_map_t map, vm_map_entry_t entry)
|
|
|
|
{
|
|
|
|
|
|
|
|
/*
|
2018-11-18 01:27:17 +00:00
|
|
|
* If the backing object is a vnode object, vm_object_deallocate()
|
|
|
|
* calls vrele(). However, vrele() does not lock the vnode because
|
|
|
|
* the vnode has additional references. Thus, the map lock can be
|
|
|
|
* kept without causing a lock-order reversal with the vnode lock.
|
2018-10-20 23:08:04 +00:00
|
|
|
*
|
2018-11-18 01:27:17 +00:00
|
|
|
* Since we count the number of virtual page mappings in
|
|
|
|
* object->un_pager.vnp.writemappings, the writemappings value
|
|
|
|
* should not be adjusted when the entry is disposed of.
|
2018-10-20 23:08:04 +00:00
|
|
|
*/
|
|
|
|
if (entry->object.vm_object != NULL)
|
|
|
|
vm_object_deallocate(entry->object.vm_object);
|
|
|
|
if (entry->cred != NULL)
|
|
|
|
crfree(entry->cred);
|
|
|
|
vm_map_entry_dispose(map, entry);
|
|
|
|
}
|
|
|
|
|
1994-05-24 10:09:53 +00:00
|
|
|
/*
|
2019-08-25 07:06:51 +00:00
|
|
|
* vm_map_try_merge_entries:
|
1996-07-30 03:08:57 +00:00
|
|
|
*
|
2019-08-25 07:06:51 +00:00
|
|
|
* Compare the given map entry to its predecessor, and merge its precessor
|
|
|
|
* into it if possible. The entry remains valid, and may be extended.
|
|
|
|
* The predecessor may be deleted.
|
2001-02-04 06:19:28 +00:00
|
|
|
*
|
|
|
|
* The map must be locked.
|
1994-05-24 10:09:53 +00:00
|
|
|
*/
|
2014-09-08 02:25:01 +00:00
|
|
|
void
|
2019-11-25 02:19:47 +00:00
|
|
|
vm_map_try_merge_entries(vm_map_t map, vm_map_entry_t prev_entry,
|
|
|
|
vm_map_entry_t entry)
|
1994-05-24 10:09:53 +00:00
|
|
|
{
|
1996-07-30 03:08:57 +00:00
|
|
|
|
2019-08-25 07:06:51 +00:00
|
|
|
VM_MAP_ASSERT_LOCKED(map);
|
|
|
|
if ((entry->eflags & MAP_ENTRY_NOMERGE_MASK) == 0 &&
|
2019-11-25 02:19:47 +00:00
|
|
|
vm_map_mergeable_neighbors(prev_entry, entry)) {
|
|
|
|
vm_map_entry_unlink(map, prev_entry, UNLINK_MERGE_NEXT);
|
|
|
|
vm_map_merged_neighbor_dispose(map, prev_entry);
|
1996-03-13 01:18:14 +00:00
|
|
|
}
|
1994-05-24 10:09:53 +00:00
|
|
|
}
|
2018-11-18 01:27:17 +00:00
|
|
|
|
2019-06-13 20:09:07 +00:00
|
|
|
/*
|
|
|
|
* vm_map_entry_back:
|
|
|
|
*
|
|
|
|
* Allocate an object to back a map entry.
|
|
|
|
*/
|
|
|
|
static inline void
|
|
|
|
vm_map_entry_back(vm_map_entry_t entry)
|
|
|
|
{
|
|
|
|
vm_object_t object;
|
|
|
|
|
|
|
|
KASSERT(entry->object.vm_object == NULL,
|
|
|
|
("map entry %p has backing object", entry));
|
|
|
|
KASSERT((entry->eflags & MAP_ENTRY_IS_SUB_MAP) == 0,
|
|
|
|
("map entry %p is a submap", entry));
|
2019-12-01 20:43:04 +00:00
|
|
|
object = vm_object_allocate_anon(atop(entry->end - entry->start), NULL,
|
|
|
|
entry->cred, entry->end - entry->start);
|
2019-06-13 20:09:07 +00:00
|
|
|
entry->object.vm_object = object;
|
|
|
|
entry->offset = 0;
|
2019-12-01 20:43:04 +00:00
|
|
|
entry->cred = NULL;
|
2019-06-13 20:09:07 +00:00
|
|
|
}
|
|
|
|
|
|
|
|
/*
|
|
|
|
* vm_map_entry_charge_object
|
|
|
|
*
|
|
|
|
* If there is no object backing this entry, create one. Otherwise, if
|
|
|
|
* the entry has cred, give it to the backing object.
|
|
|
|
*/
|
|
|
|
static inline void
|
|
|
|
vm_map_entry_charge_object(vm_map_t map, vm_map_entry_t entry)
|
|
|
|
{
|
|
|
|
|
|
|
|
VM_MAP_ASSERT_LOCKED(map);
|
|
|
|
KASSERT((entry->eflags & MAP_ENTRY_IS_SUB_MAP) == 0,
|
|
|
|
("map entry %p is a submap", entry));
|
|
|
|
if (entry->object.vm_object == NULL && !map->system_map &&
|
|
|
|
(entry->eflags & MAP_ENTRY_GUARD) == 0)
|
|
|
|
vm_map_entry_back(entry);
|
|
|
|
else if (entry->object.vm_object != NULL &&
|
|
|
|
((entry->eflags & MAP_ENTRY_NEEDS_COPY) == 0) &&
|
|
|
|
entry->cred != NULL) {
|
|
|
|
VM_OBJECT_WLOCK(entry->object.vm_object);
|
|
|
|
KASSERT(entry->object.vm_object->cred == NULL,
|
|
|
|
("OVERCOMMIT: %s: both cred e %p", __func__, entry));
|
|
|
|
entry->object.vm_object->cred = entry->cred;
|
|
|
|
entry->object.vm_object->charge = entry->end - entry->start;
|
|
|
|
VM_OBJECT_WUNLOCK(entry->object.vm_object);
|
|
|
|
entry->cred = NULL;
|
|
|
|
}
|
|
|
|
}
|
|
|
|
|
2019-12-11 16:09:57 +00:00
|
|
|
/*
|
|
|
|
* vm_map_entry_clone
|
|
|
|
*
|
|
|
|
* Create a duplicate map entry for clipping.
|
|
|
|
*/
|
|
|
|
static vm_map_entry_t
|
|
|
|
vm_map_entry_clone(vm_map_t map, vm_map_entry_t entry)
|
|
|
|
{
|
|
|
|
vm_map_entry_t new_entry;
|
|
|
|
|
|
|
|
VM_MAP_ASSERT_LOCKED(map);
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Create a backing object now, if none exists, so that more individual
|
|
|
|
* objects won't be created after the map entry is split.
|
|
|
|
*/
|
|
|
|
vm_map_entry_charge_object(map, entry);
|
|
|
|
|
|
|
|
/* Clone the entry. */
|
|
|
|
new_entry = vm_map_entry_create(map);
|
|
|
|
*new_entry = *entry;
|
|
|
|
if (new_entry->cred != NULL)
|
|
|
|
crhold(entry->cred);
|
|
|
|
if ((entry->eflags & MAP_ENTRY_IS_SUB_MAP) == 0) {
|
|
|
|
vm_object_reference(new_entry->object.vm_object);
|
|
|
|
vm_map_entry_set_vnode_text(new_entry, true);
|
|
|
|
/*
|
|
|
|
* The object->un_pager.vnp.writemappings for the object of
|
|
|
|
* MAP_ENTRY_WRITECNT type entry shall be kept as is here. The
|
|
|
|
* virtual pages are re-distributed among the clipped entries,
|
|
|
|
* so the sum is left the same.
|
|
|
|
*/
|
|
|
|
}
|
|
|
|
return (new_entry);
|
|
|
|
}
|
|
|
|
|
1994-05-24 10:09:53 +00:00
|
|
|
/*
|
|
|
|
* vm_map_clip_start: [ internal use only ]
|
|
|
|
*
|
|
|
|
* Asserts that the given entry begins at or after
|
|
|
|
* the specified address; if necessary,
|
|
|
|
* it splits the entry into two.
|
|
|
|
*/
|
2020-09-09 22:02:30 +00:00
|
|
|
static int
|
|
|
|
vm_map_clip_start(vm_map_t map, vm_map_entry_t entry, vm_offset_t startaddr)
|
1994-05-24 10:09:53 +00:00
|
|
|
{
|
1998-04-29 04:28:22 +00:00
|
|
|
vm_map_entry_t new_entry;
|
2020-09-09 22:02:30 +00:00
|
|
|
int bdry_idx;
|
1994-05-24 10:09:53 +00:00
|
|
|
|
2020-06-29 16:54:00 +00:00
|
|
|
if (!map->system_map)
|
|
|
|
WITNESS_WARN(WARN_GIANTOK | WARN_SLEEPOK, NULL,
|
|
|
|
"%s: map %p entry %p start 0x%jx", __func__, map, entry,
|
2020-09-09 22:02:30 +00:00
|
|
|
(uintmax_t)startaddr);
|
2020-06-29 16:54:00 +00:00
|
|
|
|
2020-09-09 22:02:30 +00:00
|
|
|
if (startaddr <= entry->start)
|
|
|
|
return (KERN_SUCCESS);
|
2020-06-16 22:53:56 +00:00
|
|
|
|
2009-02-24 20:43:29 +00:00
|
|
|
VM_MAP_ASSERT_LOCKED(map);
|
2020-09-09 22:02:30 +00:00
|
|
|
KASSERT(entry->end > startaddr && entry->start < startaddr,
|
2020-06-16 22:53:56 +00:00
|
|
|
("%s: invalid clip of entry %p", __func__, entry));
|
2009-02-24 20:43:29 +00:00
|
|
|
|
2020-09-09 22:02:30 +00:00
|
|
|
bdry_idx = (entry->eflags & MAP_ENTRY_SPLIT_BOUNDARY_MASK) >>
|
|
|
|
MAP_ENTRY_SPLIT_BOUNDARY_SHIFT;
|
|
|
|
if (bdry_idx != 0) {
|
|
|
|
if ((startaddr & (pagesizes[bdry_idx] - 1)) != 0)
|
|
|
|
return (KERN_INVALID_ARGUMENT);
|
|
|
|
}
|
|
|
|
|
2019-12-11 16:09:57 +00:00
|
|
|
new_entry = vm_map_entry_clone(map, entry);
|
1994-05-24 10:09:53 +00:00
|
|
|
|
2019-06-15 04:30:13 +00:00
|
|
|
/*
|
|
|
|
* Split off the front portion. Insert the new entry BEFORE this one,
|
|
|
|
* so that this entry has the specified starting address.
|
|
|
|
*/
|
2020-09-09 22:02:30 +00:00
|
|
|
new_entry->end = startaddr;
|
Eliminate adj_free field from vm_map_entry.
Drop the adj_free field from vm_map_entry_t. Refine the max_free field
so that p->max_free is the size of the largest gap with one endpoint
in the subtree rooted at p. Change vm_map_findspace so that, first,
the address-based splay is restricted to tree nodes with large-enough
max_free value, to avoid searching for the right starting point in a
subtree where all the gaps are too small. Second, when the address
search leads to a tree search for the first large-enough gap, that gap
is the subject of a splay-search that brings the gap to the top of the
tree, so that an immediate insertion will take constant time.
Break up the splay code into separate components, one for searching
and breaking up the tree and another for reassembling it. Use these
components, and not splay itself, for linking and unlinking. Drop the
after-where parameter to link, as it is computed as a side-effect of
the splay search.
Submitted by: Doug Moore <dougm@rice.edu>
Reviewed by: markj
Tested by: pho
MFC after: 2 weeks
Differential revision: https://reviews.freebsd.org/D17794
2019-03-29 16:53:46 +00:00
|
|
|
vm_map_entry_link(map, new_entry);
|
2020-09-09 22:02:30 +00:00
|
|
|
return (KERN_SUCCESS);
|
1994-05-24 10:09:53 +00:00
|
|
|
}
|
|
|
|
|
2020-01-24 07:48:11 +00:00
|
|
|
/*
|
|
|
|
* vm_map_lookup_clip_start:
|
|
|
|
*
|
|
|
|
* Find the entry at or just after 'start', and clip it if 'start' is in
|
|
|
|
* the interior of the entry. Return entry after 'start', and in
|
|
|
|
* prev_entry set the entry before 'start'.
|
|
|
|
*/
|
2020-09-09 22:02:30 +00:00
|
|
|
static int
|
2020-01-24 07:48:11 +00:00
|
|
|
vm_map_lookup_clip_start(vm_map_t map, vm_offset_t start,
|
2020-09-09 22:02:30 +00:00
|
|
|
vm_map_entry_t *res_entry, vm_map_entry_t *prev_entry)
|
2020-01-24 07:48:11 +00:00
|
|
|
{
|
|
|
|
vm_map_entry_t entry;
|
2020-09-09 22:02:30 +00:00
|
|
|
int rv;
|
2020-01-24 07:48:11 +00:00
|
|
|
|
2020-06-29 16:54:00 +00:00
|
|
|
if (!map->system_map)
|
|
|
|
WITNESS_WARN(WARN_GIANTOK | WARN_SLEEPOK, NULL,
|
|
|
|
"%s: map %p start 0x%jx prev %p", __func__, map,
|
|
|
|
(uintmax_t)start, prev_entry);
|
|
|
|
|
2020-01-24 07:48:11 +00:00
|
|
|
if (vm_map_lookup_entry(map, start, prev_entry)) {
|
|
|
|
entry = *prev_entry;
|
2020-09-09 22:02:30 +00:00
|
|
|
rv = vm_map_clip_start(map, entry, start);
|
|
|
|
if (rv != KERN_SUCCESS)
|
|
|
|
return (rv);
|
2020-01-24 07:48:11 +00:00
|
|
|
*prev_entry = vm_map_entry_pred(entry);
|
|
|
|
} else
|
|
|
|
entry = vm_map_entry_succ(*prev_entry);
|
2020-09-09 22:02:30 +00:00
|
|
|
*res_entry = entry;
|
|
|
|
return (KERN_SUCCESS);
|
2020-01-24 07:48:11 +00:00
|
|
|
}
|
|
|
|
|
1994-05-24 10:09:53 +00:00
|
|
|
/*
|
|
|
|
* vm_map_clip_end: [ internal use only ]
|
|
|
|
*
|
|
|
|
* Asserts that the given entry ends at or before
|
|
|
|
* the specified address; if necessary,
|
|
|
|
* it splits the entry into two.
|
|
|
|
*/
|
2020-09-09 22:02:30 +00:00
|
|
|
static int
|
|
|
|
vm_map_clip_end(vm_map_t map, vm_map_entry_t entry, vm_offset_t endaddr)
|
1994-05-24 10:09:53 +00:00
|
|
|
{
|
1998-04-29 04:28:22 +00:00
|
|
|
vm_map_entry_t new_entry;
|
2020-09-09 22:02:30 +00:00
|
|
|
int bdry_idx;
|
1994-05-24 10:09:53 +00:00
|
|
|
|
2020-06-29 16:54:00 +00:00
|
|
|
if (!map->system_map)
|
|
|
|
WITNESS_WARN(WARN_GIANTOK | WARN_SLEEPOK, NULL,
|
|
|
|
"%s: map %p entry %p end 0x%jx", __func__, map, entry,
|
2020-09-09 22:02:30 +00:00
|
|
|
(uintmax_t)endaddr);
|
2020-06-29 16:54:00 +00:00
|
|
|
|
2020-09-09 22:02:30 +00:00
|
|
|
if (endaddr >= entry->end)
|
|
|
|
return (KERN_SUCCESS);
|
2020-06-16 22:53:56 +00:00
|
|
|
|
2009-02-24 20:43:29 +00:00
|
|
|
VM_MAP_ASSERT_LOCKED(map);
|
2020-09-09 22:02:30 +00:00
|
|
|
KASSERT(entry->start < endaddr && entry->end > endaddr,
|
2020-06-16 22:53:56 +00:00
|
|
|
("%s: invalid clip of entry %p", __func__, entry));
|
2009-02-24 20:43:29 +00:00
|
|
|
|
2020-09-09 22:02:30 +00:00
|
|
|
bdry_idx = (entry->eflags & MAP_ENTRY_SPLIT_BOUNDARY_MASK) >>
|
|
|
|
MAP_ENTRY_SPLIT_BOUNDARY_SHIFT;
|
|
|
|
if (bdry_idx != 0) {
|
|
|
|
if ((endaddr & (pagesizes[bdry_idx] - 1)) != 0)
|
|
|
|
return (KERN_INVALID_ARGUMENT);
|
|
|
|
}
|
|
|
|
|
2019-12-11 16:09:57 +00:00
|
|
|
new_entry = vm_map_entry_clone(map, entry);
|
1994-05-24 10:09:53 +00:00
|
|
|
|
2019-06-15 04:30:13 +00:00
|
|
|
/*
|
|
|
|
* Split off the back portion. Insert the new entry AFTER this one,
|
|
|
|
* so that this entry has the specified ending address.
|
|
|
|
*/
|
2020-09-09 22:02:30 +00:00
|
|
|
new_entry->start = endaddr;
|
Eliminate adj_free field from vm_map_entry.
Drop the adj_free field from vm_map_entry_t. Refine the max_free field
so that p->max_free is the size of the largest gap with one endpoint
in the subtree rooted at p. Change vm_map_findspace so that, first,
the address-based splay is restricted to tree nodes with large-enough
max_free value, to avoid searching for the right starting point in a
subtree where all the gaps are too small. Second, when the address
search leads to a tree search for the first large-enough gap, that gap
is the subject of a splay-search that brings the gap to the top of the
tree, so that an immediate insertion will take constant time.
Break up the splay code into separate components, one for searching
and breaking up the tree and another for reassembling it. Use these
components, and not splay itself, for linking and unlinking. Drop the
after-where parameter to link, as it is computed as a side-effect of
the splay search.
Submitted by: Doug Moore <dougm@rice.edu>
Reviewed by: markj
Tested by: pho
MFC after: 2 weeks
Differential revision: https://reviews.freebsd.org/D17794
2019-03-29 16:53:46 +00:00
|
|
|
vm_map_entry_link(map, new_entry);
|
2020-09-09 22:02:30 +00:00
|
|
|
|
|
|
|
return (KERN_SUCCESS);
|
1994-05-24 10:09:53 +00:00
|
|
|
}
|
|
|
|
|
|
|
|
/*
|
|
|
|
* vm_map_submap: [ kernel use only ]
|
|
|
|
*
|
|
|
|
* Mark the given range as handled by a subordinate map.
|
|
|
|
*
|
|
|
|
* This range must have been created with vm_map_find,
|
|
|
|
* and no other operations may have been performed on this
|
|
|
|
* range prior to calling vm_map_submap.
|
|
|
|
*
|
|
|
|
* Only a limited number of operations can be performed
|
|
|
|
* within this rage after calling vm_map_submap:
|
|
|
|
* vm_fault
|
|
|
|
* [Don't try vm_map_copy!]
|
|
|
|
*
|
|
|
|
* To remove a submapping, one must first remove the
|
|
|
|
* range from the superior map, and then destroy the
|
|
|
|
* submap (if desired). [Better yet, don't try it.]
|
|
|
|
*/
|
|
|
|
int
|
2001-07-04 20:15:18 +00:00
|
|
|
vm_map_submap(
|
|
|
|
vm_map_t map,
|
|
|
|
vm_offset_t start,
|
|
|
|
vm_offset_t end,
|
|
|
|
vm_map_t submap)
|
1994-05-24 10:09:53 +00:00
|
|
|
{
|
These changes embody the support of the fully coherent merged VM buffer cache,
much higher filesystem I/O performance, and much better paging performance. It
represents the culmination of over 6 months of R&D.
The majority of the merged VM/cache work is by John Dyson.
The following highlights the most significant changes. Additionally, there are
(mostly minor) changes to the various filesystem modules (nfs, msdosfs, etc) to
support the new VM/buffer scheme.
vfs_bio.c:
Significant rewrite of most of vfs_bio to support the merged VM buffer cache
scheme. The scheme is almost fully compatible with the old filesystem
interface. Significant improvement in the number of opportunities for write
clustering.
vfs_cluster.c, vfs_subr.c
Upgrade and performance enhancements in vfs layer code to support merged
VM/buffer cache. Fixup of vfs_cluster to eliminate the bogus pagemove stuff.
vm_object.c:
Yet more improvements in the collapse code. Elimination of some windows that
can cause list corruption.
vm_pageout.c:
Fixed it, it really works better now. Somehow in 2.0, some "enhancements"
broke the code. This code has been reworked from the ground-up.
vm_fault.c, vm_page.c, pmap.c, vm_object.c
Support for small-block filesystems with merged VM/buffer cache scheme.
pmap.c vm_map.c
Dynamic kernel VM size, now we dont have to pre-allocate excessive numbers of
kernel PTs.
vm_glue.c
Much simpler and more effective swapping code. No more gratuitous swapping.
proc.h
Fixed the problem that the p_lock flag was not being cleared on a fork.
swap_pager.c, vnode_pager.c
Removal of old vfs_bio cruft to support the past pseudo-coherency. Now the
code doesn't need it anymore.
machdep.c
Changes to better support the parameter values for the merged VM/buffer cache
scheme.
machdep.c, kern_exec.c, vm_glue.c
Implemented a seperate submap for temporary exec string space and another one
to contain process upages. This eliminates all map fragmentation problems
that previously existed.
ffs_inode.c, ufs_inode.c, ufs_readwrite.c
Changes for merged VM/buffer cache. Add "bypass" support for sneaking in on
busy buffers.
Submitted by: John Dyson and David Greenman
1995-01-09 16:06:02 +00:00
|
|
|
vm_map_entry_t entry;
|
Implement Address Space Layout Randomization (ASLR)
With this change, randomization can be enabled for all non-fixed
mappings. It means that the base address for the mapping is selected
with a guaranteed amount of entropy (bits). If the mapping was
requested to be superpage aligned, the randomization honours the
superpage attributes.
Although the value of ASLR is diminshing over time as exploit authors
work out simple ASLR bypass techniques, it elimintates the trivial
exploitation of certain vulnerabilities, at least in theory. This
implementation is relatively small and happens at the correct
architectural level. Also, it is not expected to introduce
regressions in existing cases when turned off (default for now), or
cause any significant maintaince burden.
The randomization is done on a best-effort basis - that is, the
allocator falls back to a first fit strategy if fragmentation prevents
entropy injection. It is trivial to implement a strong mode where
failure to guarantee the requested amount of entropy results in
mapping request failure, but I do not consider that to be usable.
I have not fine-tuned the amount of entropy injected right now. It is
only a quantitive change that will not change the implementation. The
current amount is controlled by aslr_pages_rnd.
To not spoil coalescing optimizations, to reduce the page table
fragmentation inherent to ASLR, and to keep the transient superpage
promotion for the malloced memory, locality clustering is implemented
for anonymous private mappings, which are automatically grouped until
fragmentation kicks in. The initial location for the anon group range
is, of course, randomized. This is controlled by vm.cluster_anon,
enabled by default.
The default mode keeps the sbrk area unpopulated by other mappings,
but this can be turned off, which gives much more breathing bits on
architectures with small address space, such as i386. This is tied
with the question of following an application's hint about the mmap(2)
base address. Testing shows that ignoring the hint does not affect the
function of common applications, but I would expect more demanding
code could break. By default sbrk is preserved and mmap hints are
satisfied, which can be changed by using the
kern.elf{32,64}.aslr.honor_sbrk sysctl.
ASLR is enabled on per-ABI basis, and currently it is only allowed on
FreeBSD native i386 and amd64 (including compat 32bit) ABIs. Support
for additional architectures will be added after further testing.
Both per-process and per-image controls are implemented:
- procctl(2) adds PROC_ASLR_CTL/PROC_ASLR_STATUS;
- NT_FREEBSD_FCTL_ASLR_DISABLE feature control note bit makes it possible
to force ASLR off for the given binary. (A tool to edit the feature
control note is in development.)
Global controls are:
- kern.elf{32,64}.aslr.enable - for non-fixed mappings done by mmap(2);
- kern.elf{32,64}.aslr.pie_enable - for PIE image activation mappings;
- kern.elf{32,64}.aslr.honor_sbrk - allow to use sbrk area for mmap(2);
- vm.cluster_anon - enables anon mapping clustering.
PR: 208580 (exp runs)
Exp-runs done by: antoine
Reviewed by: markj (previous version)
Discussed with: emaste
Tested by: pho
MFC after: 1 month
Sponsored by: The FreeBSD Foundation
Differential revision: https://reviews.freebsd.org/D5603
2019-02-10 17:19:45 +00:00
|
|
|
int result;
|
|
|
|
|
|
|
|
result = KERN_INVALID_ARGUMENT;
|
|
|
|
|
|
|
|
vm_map_lock(submap);
|
|
|
|
submap->flags |= MAP_IS_SUB_MAP;
|
|
|
|
vm_map_unlock(submap);
|
1994-05-24 10:09:53 +00:00
|
|
|
|
|
|
|
vm_map_lock(map);
|
|
|
|
VM_MAP_RANGE_CHECK(map, start, end);
|
2020-01-23 16:45:10 +00:00
|
|
|
if (vm_map_lookup_entry(map, start, &entry) && entry->end >= end &&
|
|
|
|
(entry->eflags & MAP_ENTRY_COW) == 0 &&
|
|
|
|
entry->object.vm_object == NULL) {
|
2020-09-09 22:02:30 +00:00
|
|
|
result = vm_map_clip_start(map, entry, start);
|
|
|
|
if (result != KERN_SUCCESS)
|
|
|
|
goto unlock;
|
|
|
|
result = vm_map_clip_end(map, entry, end);
|
|
|
|
if (result != KERN_SUCCESS)
|
|
|
|
goto unlock;
|
VM level code cleanups.
1) Start using TSM.
Struct procs continue to point to upages structure, after being freed.
Struct vmspace continues to point to pte object and kva space for kstack.
u_map is now superfluous.
2) vm_map's don't need to be reference counted. They always exist either
in the kernel or in a vmspace. The vmspaces are managed by reference
counts.
3) Remove the "wired" vm_map nonsense.
4) No need to keep a cache of kernel stack kva's.
5) Get rid of strange looking ++var, and change to var++.
6) Change more data structures to use our "zone" allocator. Added
struct proc, struct vmspace and struct vnode. This saves a significant
amount of kva space and physical memory. Additionally, this enables
TSM for the zone managed memory.
7) Keep ioopt disabled for now.
8) Remove the now bogus "single use" map concept.
9) Use generation counts or id's for data structures residing in TSM, where
it allows us to avoid unneeded restart overhead during traversals, where
blocking might occur.
10) Account better for memory deficits, so the pageout daemon will be able
to make enough memory available (experimental.)
11) Fix some vnode locking problems. (From Tor, I think.)
12) Add a check in ufs_lookup, to avoid lots of unneeded calls to bcmp.
(experimental.)
13) Significantly shrink, cleanup, and make slightly faster the vm_fault.c
code. Use generation counts, get rid of unneded collpase operations,
and clean up the cluster code.
14) Make vm_zone more suitable for TSM.
This commit is partially as a result of discussions and contributions from
other people, including DG, Tor Egge, PHK, and probably others that I
have forgotten to attribute (so let me know, if I forgot.)
This is not the infamous, final cleanup of the vnode stuff, but a necessary
step. Vnode mgmt should be correct, but things might still change, and
there is still some missing stuff (like ioopt, and physical backing of
non-merged cache files, debugging of layering concepts.)
1998-01-22 17:30:44 +00:00
|
|
|
entry->object.sub_map = submap;
|
1997-01-16 04:16:22 +00:00
|
|
|
entry->eflags |= MAP_ENTRY_IS_SUB_MAP;
|
1994-05-24 10:09:53 +00:00
|
|
|
result = KERN_SUCCESS;
|
|
|
|
}
|
2020-09-09 22:02:30 +00:00
|
|
|
unlock:
|
1994-05-24 10:09:53 +00:00
|
|
|
vm_map_unlock(map);
|
|
|
|
|
Implement Address Space Layout Randomization (ASLR)
With this change, randomization can be enabled for all non-fixed
mappings. It means that the base address for the mapping is selected
with a guaranteed amount of entropy (bits). If the mapping was
requested to be superpage aligned, the randomization honours the
superpage attributes.
Although the value of ASLR is diminshing over time as exploit authors
work out simple ASLR bypass techniques, it elimintates the trivial
exploitation of certain vulnerabilities, at least in theory. This
implementation is relatively small and happens at the correct
architectural level. Also, it is not expected to introduce
regressions in existing cases when turned off (default for now), or
cause any significant maintaince burden.
The randomization is done on a best-effort basis - that is, the
allocator falls back to a first fit strategy if fragmentation prevents
entropy injection. It is trivial to implement a strong mode where
failure to guarantee the requested amount of entropy results in
mapping request failure, but I do not consider that to be usable.
I have not fine-tuned the amount of entropy injected right now. It is
only a quantitive change that will not change the implementation. The
current amount is controlled by aslr_pages_rnd.
To not spoil coalescing optimizations, to reduce the page table
fragmentation inherent to ASLR, and to keep the transient superpage
promotion for the malloced memory, locality clustering is implemented
for anonymous private mappings, which are automatically grouped until
fragmentation kicks in. The initial location for the anon group range
is, of course, randomized. This is controlled by vm.cluster_anon,
enabled by default.
The default mode keeps the sbrk area unpopulated by other mappings,
but this can be turned off, which gives much more breathing bits on
architectures with small address space, such as i386. This is tied
with the question of following an application's hint about the mmap(2)
base address. Testing shows that ignoring the hint does not affect the
function of common applications, but I would expect more demanding
code could break. By default sbrk is preserved and mmap hints are
satisfied, which can be changed by using the
kern.elf{32,64}.aslr.honor_sbrk sysctl.
ASLR is enabled on per-ABI basis, and currently it is only allowed on
FreeBSD native i386 and amd64 (including compat 32bit) ABIs. Support
for additional architectures will be added after further testing.
Both per-process and per-image controls are implemented:
- procctl(2) adds PROC_ASLR_CTL/PROC_ASLR_STATUS;
- NT_FREEBSD_FCTL_ASLR_DISABLE feature control note bit makes it possible
to force ASLR off for the given binary. (A tool to edit the feature
control note is in development.)
Global controls are:
- kern.elf{32,64}.aslr.enable - for non-fixed mappings done by mmap(2);
- kern.elf{32,64}.aslr.pie_enable - for PIE image activation mappings;
- kern.elf{32,64}.aslr.honor_sbrk - allow to use sbrk area for mmap(2);
- vm.cluster_anon - enables anon mapping clustering.
PR: 208580 (exp runs)
Exp-runs done by: antoine
Reviewed by: markj (previous version)
Discussed with: emaste
Tested by: pho
MFC after: 1 month
Sponsored by: The FreeBSD Foundation
Differential revision: https://reviews.freebsd.org/D5603
2019-02-10 17:19:45 +00:00
|
|
|
if (result != KERN_SUCCESS) {
|
|
|
|
vm_map_lock(submap);
|
|
|
|
submap->flags &= ~MAP_IS_SUB_MAP;
|
|
|
|
vm_map_unlock(submap);
|
|
|
|
}
|
These changes embody the support of the fully coherent merged VM buffer cache,
much higher filesystem I/O performance, and much better paging performance. It
represents the culmination of over 6 months of R&D.
The majority of the merged VM/cache work is by John Dyson.
The following highlights the most significant changes. Additionally, there are
(mostly minor) changes to the various filesystem modules (nfs, msdosfs, etc) to
support the new VM/buffer scheme.
vfs_bio.c:
Significant rewrite of most of vfs_bio to support the merged VM buffer cache
scheme. The scheme is almost fully compatible with the old filesystem
interface. Significant improvement in the number of opportunities for write
clustering.
vfs_cluster.c, vfs_subr.c
Upgrade and performance enhancements in vfs layer code to support merged
VM/buffer cache. Fixup of vfs_cluster to eliminate the bogus pagemove stuff.
vm_object.c:
Yet more improvements in the collapse code. Elimination of some windows that
can cause list corruption.
vm_pageout.c:
Fixed it, it really works better now. Somehow in 2.0, some "enhancements"
broke the code. This code has been reworked from the ground-up.
vm_fault.c, vm_page.c, pmap.c, vm_object.c
Support for small-block filesystems with merged VM/buffer cache scheme.
pmap.c vm_map.c
Dynamic kernel VM size, now we dont have to pre-allocate excessive numbers of
kernel PTs.
vm_glue.c
Much simpler and more effective swapping code. No more gratuitous swapping.
proc.h
Fixed the problem that the p_lock flag was not being cleared on a fork.
swap_pager.c, vnode_pager.c
Removal of old vfs_bio cruft to support the past pseudo-coherency. Now the
code doesn't need it anymore.
machdep.c
Changes to better support the parameter values for the merged VM/buffer cache
scheme.
machdep.c, kern_exec.c, vm_glue.c
Implemented a seperate submap for temporary exec string space and another one
to contain process upages. This eliminates all map fragmentation problems
that previously existed.
ffs_inode.c, ufs_inode.c, ufs_readwrite.c
Changes for merged VM/buffer cache. Add "bypass" support for sneaking in on
busy buffers.
Submitted by: John Dyson and David Greenman
1995-01-09 16:06:02 +00:00
|
|
|
return (result);
|
1994-05-24 10:09:53 +00:00
|
|
|
}
|
|
|
|
|
2003-07-03 20:18:02 +00:00
|
|
|
/*
|
Add a page size field to struct vm_page. Increase the page size field when
a partially populated reservation becomes fully populated, and decrease this
field when a fully populated reservation becomes partially populated.
Use this field to simplify the implementation of pmap_enter_object() on
amd64, arm, and i386.
On all architectures where we support superpages, the cost of creating a
superpage mapping is roughly the same as creating a base page mapping. For
example, both kinds of mappings entail the creation of a single PTE and PV
entry. With this in mind, use the page size field to make the
implementation of vm_map_pmap_enter(..., MAP_PREFAULT_PARTIAL) a little
smarter. Previously, if MAP_PREFAULT_PARTIAL was specified to
vm_map_pmap_enter(), that function would only map base pages. Now, it will
create up to 96 base page or superpage mappings.
Reviewed by: kib
Sponsored by: EMC / Isilon Storage Division
2014-06-07 17:12:26 +00:00
|
|
|
* The maximum number of pages to map if MAP_PREFAULT_PARTIAL is specified
|
2003-07-03 20:18:02 +00:00
|
|
|
*/
|
|
|
|
#define MAX_INIT_PT 96
|
|
|
|
|
2003-06-29 23:32:55 +00:00
|
|
|
/*
|
|
|
|
* vm_map_pmap_enter:
|
|
|
|
*
|
Add a page size field to struct vm_page. Increase the page size field when
a partially populated reservation becomes fully populated, and decrease this
field when a fully populated reservation becomes partially populated.
Use this field to simplify the implementation of pmap_enter_object() on
amd64, arm, and i386.
On all architectures where we support superpages, the cost of creating a
superpage mapping is roughly the same as creating a base page mapping. For
example, both kinds of mappings entail the creation of a single PTE and PV
entry. With this in mind, use the page size field to make the
implementation of vm_map_pmap_enter(..., MAP_PREFAULT_PARTIAL) a little
smarter. Previously, if MAP_PREFAULT_PARTIAL was specified to
vm_map_pmap_enter(), that function would only map base pages. Now, it will
create up to 96 base page or superpage mappings.
Reviewed by: kib
Sponsored by: EMC / Isilon Storage Division
2014-06-07 17:12:26 +00:00
|
|
|
* Preload the specified map's pmap with mappings to the specified
|
|
|
|
* object's memory-resident pages. No further physical pages are
|
|
|
|
* allocated, and no further virtual pages are retrieved from secondary
|
|
|
|
* storage. If the specified flags include MAP_PREFAULT_PARTIAL, then a
|
|
|
|
* limited number of page mappings are created at the low-end of the
|
|
|
|
* specified address range. (For this purpose, a superpage mapping
|
|
|
|
* counts as one page mapping.) Otherwise, all resident pages within
|
2016-12-12 17:47:09 +00:00
|
|
|
* the specified address range are mapped.
|
2003-06-29 23:32:55 +00:00
|
|
|
*/
|
2014-09-08 00:19:03 +00:00
|
|
|
static void
|
2004-04-24 03:46:44 +00:00
|
|
|
vm_map_pmap_enter(vm_map_t map, vm_offset_t addr, vm_prot_t prot,
|
2003-06-29 23:32:55 +00:00
|
|
|
vm_object_t object, vm_pindex_t pindex, vm_size_t size, int flags)
|
|
|
|
{
|
2007-03-25 19:33:40 +00:00
|
|
|
vm_offset_t start;
|
2006-06-05 20:35:27 +00:00
|
|
|
vm_page_t p, p_start;
|
Add a page size field to struct vm_page. Increase the page size field when
a partially populated reservation becomes fully populated, and decrease this
field when a fully populated reservation becomes partially populated.
Use this field to simplify the implementation of pmap_enter_object() on
amd64, arm, and i386.
On all architectures where we support superpages, the cost of creating a
superpage mapping is roughly the same as creating a base page mapping. For
example, both kinds of mappings entail the creation of a single PTE and PV
entry. With this in mind, use the page size field to make the
implementation of vm_map_pmap_enter(..., MAP_PREFAULT_PARTIAL) a little
smarter. Previously, if MAP_PREFAULT_PARTIAL was specified to
vm_map_pmap_enter(), that function would only map base pages. Now, it will
create up to 96 base page or superpage mappings.
Reviewed by: kib
Sponsored by: EMC / Isilon Storage Division
2014-06-07 17:12:26 +00:00
|
|
|
vm_pindex_t mask, psize, threshold, tmpidx;
|
2003-06-29 23:32:55 +00:00
|
|
|
|
2005-09-03 18:20:20 +00:00
|
|
|
if ((prot & (VM_PROT_READ | VM_PROT_EXECUTE)) == 0 || object == NULL)
|
2003-07-03 20:18:02 +00:00
|
|
|
return;
|
2009-07-24 13:50:29 +00:00
|
|
|
if (object->type == OBJT_DEVICE || object->type == OBJT_SG) {
|
2013-05-21 20:38:19 +00:00
|
|
|
VM_OBJECT_WLOCK(object);
|
|
|
|
if (object->type == OBJT_DEVICE || object->type == OBJT_SG) {
|
|
|
|
pmap_object_init_pt(map->pmap, addr, object, pindex,
|
|
|
|
size);
|
|
|
|
VM_OBJECT_WUNLOCK(object);
|
|
|
|
return;
|
|
|
|
}
|
|
|
|
VM_OBJECT_LOCK_DOWNGRADE(object);
|
2019-11-29 19:57:49 +00:00
|
|
|
} else
|
|
|
|
VM_OBJECT_RLOCK(object);
|
2003-07-03 20:18:02 +00:00
|
|
|
|
|
|
|
psize = atop(size);
|
|
|
|
if (psize + pindex > object->size) {
|
2019-12-04 19:46:48 +00:00
|
|
|
if (pindex >= object->size) {
|
2013-05-21 20:38:19 +00:00
|
|
|
VM_OBJECT_RUNLOCK(object);
|
|
|
|
return;
|
|
|
|
}
|
2003-07-03 20:18:02 +00:00
|
|
|
psize = object->size - pindex;
|
|
|
|
}
|
|
|
|
|
2006-06-05 20:35:27 +00:00
|
|
|
start = 0;
|
|
|
|
p_start = NULL;
|
Add a page size field to struct vm_page. Increase the page size field when
a partially populated reservation becomes fully populated, and decrease this
field when a fully populated reservation becomes partially populated.
Use this field to simplify the implementation of pmap_enter_object() on
amd64, arm, and i386.
On all architectures where we support superpages, the cost of creating a
superpage mapping is roughly the same as creating a base page mapping. For
example, both kinds of mappings entail the creation of a single PTE and PV
entry. With this in mind, use the page size field to make the
implementation of vm_map_pmap_enter(..., MAP_PREFAULT_PARTIAL) a little
smarter. Previously, if MAP_PREFAULT_PARTIAL was specified to
vm_map_pmap_enter(), that function would only map base pages. Now, it will
create up to 96 base page or superpage mappings.
Reviewed by: kib
Sponsored by: EMC / Isilon Storage Division
2014-06-07 17:12:26 +00:00
|
|
|
threshold = MAX_INIT_PT;
|
2003-07-03 20:18:02 +00:00
|
|
|
|
2010-07-04 11:13:33 +00:00
|
|
|
p = vm_page_find_least(object, pindex);
|
2003-07-03 20:18:02 +00:00
|
|
|
/*
|
|
|
|
* Assert: the variable p is either (1) the page with the
|
|
|
|
* least pindex greater than or equal to the parameter pindex
|
|
|
|
* or (2) NULL.
|
|
|
|
*/
|
|
|
|
for (;
|
|
|
|
p != NULL && (tmpidx = p->pindex - pindex) < psize;
|
|
|
|
p = TAILQ_NEXT(p, listq)) {
|
|
|
|
/*
|
|
|
|
* don't allow an madvise to blow away our really
|
|
|
|
* free pages allocating pv entries.
|
|
|
|
*/
|
Add a page size field to struct vm_page. Increase the page size field when
a partially populated reservation becomes fully populated, and decrease this
field when a fully populated reservation becomes partially populated.
Use this field to simplify the implementation of pmap_enter_object() on
amd64, arm, and i386.
On all architectures where we support superpages, the cost of creating a
superpage mapping is roughly the same as creating a base page mapping. For
example, both kinds of mappings entail the creation of a single PTE and PV
entry. With this in mind, use the page size field to make the
implementation of vm_map_pmap_enter(..., MAP_PREFAULT_PARTIAL) a little
smarter. Previously, if MAP_PREFAULT_PARTIAL was specified to
vm_map_pmap_enter(), that function would only map base pages. Now, it will
create up to 96 base page or superpage mappings.
Reviewed by: kib
Sponsored by: EMC / Isilon Storage Division
2014-06-07 17:12:26 +00:00
|
|
|
if (((flags & MAP_PREFAULT_MADVISE) != 0 &&
|
2018-02-06 22:10:07 +00:00
|
|
|
vm_page_count_severe()) ||
|
Add a page size field to struct vm_page. Increase the page size field when
a partially populated reservation becomes fully populated, and decrease this
field when a fully populated reservation becomes partially populated.
Use this field to simplify the implementation of pmap_enter_object() on
amd64, arm, and i386.
On all architectures where we support superpages, the cost of creating a
superpage mapping is roughly the same as creating a base page mapping. For
example, both kinds of mappings entail the creation of a single PTE and PV
entry. With this in mind, use the page size field to make the
implementation of vm_map_pmap_enter(..., MAP_PREFAULT_PARTIAL) a little
smarter. Previously, if MAP_PREFAULT_PARTIAL was specified to
vm_map_pmap_enter(), that function would only map base pages. Now, it will
create up to 96 base page or superpage mappings.
Reviewed by: kib
Sponsored by: EMC / Isilon Storage Division
2014-06-07 17:12:26 +00:00
|
|
|
((flags & MAP_PREFAULT_PARTIAL) != 0 &&
|
|
|
|
tmpidx >= threshold)) {
|
2006-06-17 08:45:01 +00:00
|
|
|
psize = tmpidx;
|
2003-07-03 20:18:02 +00:00
|
|
|
break;
|
|
|
|
}
|
2019-10-15 03:45:41 +00:00
|
|
|
if (vm_page_all_valid(p)) {
|
2006-06-05 20:35:27 +00:00
|
|
|
if (p_start == NULL) {
|
|
|
|
start = addr + ptoa(tmpidx);
|
|
|
|
p_start = p;
|
|
|
|
}
|
Add a page size field to struct vm_page. Increase the page size field when
a partially populated reservation becomes fully populated, and decrease this
field when a fully populated reservation becomes partially populated.
Use this field to simplify the implementation of pmap_enter_object() on
amd64, arm, and i386.
On all architectures where we support superpages, the cost of creating a
superpage mapping is roughly the same as creating a base page mapping. For
example, both kinds of mappings entail the creation of a single PTE and PV
entry. With this in mind, use the page size field to make the
implementation of vm_map_pmap_enter(..., MAP_PREFAULT_PARTIAL) a little
smarter. Previously, if MAP_PREFAULT_PARTIAL was specified to
vm_map_pmap_enter(), that function would only map base pages. Now, it will
create up to 96 base page or superpage mappings.
Reviewed by: kib
Sponsored by: EMC / Isilon Storage Division
2014-06-07 17:12:26 +00:00
|
|
|
/* Jump ahead if a superpage mapping is possible. */
|
|
|
|
if (p->psind > 0 && ((addr + ptoa(tmpidx)) &
|
|
|
|
(pagesizes[p->psind] - 1)) == 0) {
|
|
|
|
mask = atop(pagesizes[p->psind]) - 1;
|
|
|
|
if (tmpidx + mask < psize &&
|
2017-07-14 02:15:48 +00:00
|
|
|
vm_page_ps_test(p, PS_ALL_VALID, NULL)) {
|
Add a page size field to struct vm_page. Increase the page size field when
a partially populated reservation becomes fully populated, and decrease this
field when a fully populated reservation becomes partially populated.
Use this field to simplify the implementation of pmap_enter_object() on
amd64, arm, and i386.
On all architectures where we support superpages, the cost of creating a
superpage mapping is roughly the same as creating a base page mapping. For
example, both kinds of mappings entail the creation of a single PTE and PV
entry. With this in mind, use the page size field to make the
implementation of vm_map_pmap_enter(..., MAP_PREFAULT_PARTIAL) a little
smarter. Previously, if MAP_PREFAULT_PARTIAL was specified to
vm_map_pmap_enter(), that function would only map base pages. Now, it will
create up to 96 base page or superpage mappings.
Reviewed by: kib
Sponsored by: EMC / Isilon Storage Division
2014-06-07 17:12:26 +00:00
|
|
|
p += mask;
|
|
|
|
threshold += mask;
|
|
|
|
}
|
|
|
|
}
|
Change the management of cached pages (PQ_CACHE) in two fundamental
ways:
(1) Cached pages are no longer kept in the object's resident page
splay tree and memq. Instead, they are kept in a separate per-object
splay tree of cached pages. However, access to this new per-object
splay tree is synchronized by the _free_ page queues lock, not to be
confused with the heavily contended page queues lock. Consequently, a
cached page can be reclaimed by vm_page_alloc(9) without acquiring the
object's lock or the page queues lock.
This solves a problem independently reported by tegge@ and Isilon.
Specifically, they observed the page daemon consuming a great deal of
CPU time because of pages bouncing back and forth between the cache
queue (PQ_CACHE) and the inactive queue (PQ_INACTIVE). The source of
this problem turned out to be a deadlock avoidance strategy employed
when selecting a cached page to reclaim in vm_page_select_cache().
However, the root cause was really that reclaiming a cached page
required the acquisition of an object lock while the page queues lock
was already held. Thus, this change addresses the problem at its
root, by eliminating the need to acquire the object's lock.
Moreover, keeping cached pages in the object's primary splay tree and
memq was, in effect, optimizing for the uncommon case. Cached pages
are reclaimed far, far more often than they are reactivated. Instead,
this change makes reclamation cheaper, especially in terms of
synchronization overhead, and reactivation more expensive, because
reactivated pages will have to be reentered into the object's primary
splay tree and memq.
(2) Cached pages are now stored alongside free pages in the physical
memory allocator's buddy queues, increasing the likelihood that large
allocations of contiguous physical memory (i.e., superpages) will
succeed.
Finally, as a result of this change long-standing restrictions on when
and where a cached page can be reclaimed and returned by
vm_page_alloc(9) are eliminated. Specifically, calls to
vm_page_alloc(9) specifying VM_ALLOC_INTERRUPT can now reclaim and
return a formerly cached page. Consequently, a call to malloc(9)
specifying M_NOWAIT is less likely to fail.
Discussed with: many over the course of the summer, including jeff@,
Justin Husted @ Isilon, peter@, tegge@
Tested by: an earlier version by kris@
Approved by: re (kensmith)
2007-09-25 06:25:06 +00:00
|
|
|
} else if (p_start != NULL) {
|
2006-06-05 20:35:27 +00:00
|
|
|
pmap_enter_object(map->pmap, start, addr +
|
|
|
|
ptoa(tmpidx), p_start, prot);
|
|
|
|
p_start = NULL;
|
2003-07-03 20:18:02 +00:00
|
|
|
}
|
|
|
|
}
|
2010-05-26 18:00:44 +00:00
|
|
|
if (p_start != NULL)
|
2006-06-17 08:45:01 +00:00
|
|
|
pmap_enter_object(map->pmap, start, addr + ptoa(psize),
|
|
|
|
p_start, prot);
|
2013-05-21 20:38:19 +00:00
|
|
|
VM_OBJECT_RUNLOCK(object);
|
2003-06-29 23:32:55 +00:00
|
|
|
}
|
|
|
|
|
1994-05-24 10:09:53 +00:00
|
|
|
/*
|
|
|
|
* vm_map_protect:
|
|
|
|
*
|
2021-01-12 12:43:39 +00:00
|
|
|
* Sets the protection and/or the maximum protection of the
|
|
|
|
* specified address region in the target map.
|
1994-05-24 10:09:53 +00:00
|
|
|
*/
|
|
|
|
int
|
1997-08-25 22:15:31 +00:00
|
|
|
vm_map_protect(vm_map_t map, vm_offset_t start, vm_offset_t end,
|
2021-01-12 12:43:39 +00:00
|
|
|
vm_prot_t new_prot, vm_prot_t new_maxprot, int flags)
|
1994-05-24 10:09:53 +00:00
|
|
|
{
|
2019-11-25 02:19:47 +00:00
|
|
|
vm_map_entry_t entry, first_entry, in_tran, prev_entry;
|
Implement global and per-uid accounting of the anonymous memory. Add
rlimit RLIMIT_SWAP that limits the amount of swap that may be reserved
for the uid.
The accounting information (charge) is associated with either map entry,
or vm object backing the entry, assuming the object is the first one
in the shadow chain and entry does not require COW. Charge is moved
from entry to object on allocation of the object, e.g. during the mmap,
assuming the object is allocated, or on the first page fault on the
entry. It moves back to the entry on forks due to COW setup.
The per-entry granularity of accounting makes the charge process fair
for processes that change uid during lifetime, and decrements charge
for proper uid when region is unmapped.
The interface of vm_pager_allocate(9) is extended by adding struct ucred *,
that is used to charge appropriate uid when allocation if performed by
kernel, e.g. md(4).
Several syscalls, among them is fork(2), may now return ENOMEM when
global or per-uid limits are enforced.
In collaboration with: pho
Reviewed by: alc
Approved by: re (kensmith)
2009-06-23 20:45:22 +00:00
|
|
|
vm_object_t obj;
|
2010-12-02 17:37:16 +00:00
|
|
|
struct ucred *cred;
|
2009-10-27 10:15:58 +00:00
|
|
|
vm_prot_t old_prot;
|
2019-06-28 02:14:54 +00:00
|
|
|
int rv;
|
1994-05-24 10:09:53 +00:00
|
|
|
|
2013-11-20 09:03:48 +00:00
|
|
|
if (start == end)
|
|
|
|
return (KERN_SUCCESS);
|
|
|
|
|
2021-01-12 12:43:39 +00:00
|
|
|
if ((flags & (VM_MAP_PROTECT_SET_PROT | VM_MAP_PROTECT_SET_MAXPROT)) ==
|
|
|
|
(VM_MAP_PROTECT_SET_PROT | VM_MAP_PROTECT_SET_MAXPROT) &&
|
|
|
|
(new_prot & new_maxprot) != new_prot)
|
|
|
|
return (KERN_OUT_OF_BOUNDS);
|
|
|
|
|
Fix another race between vm_map_protect() and vm_map_wire().
vm_map_wire() increments entry->wire_count, after that it drops the
map lock both for faulting in the entry' pages, and for marking next
entry in the requested region as IN_TRANSITION. Only after all entries
are faulted in, MAP_ENTRY_USER_WIRE flag is set.
This makes it possible for vm_map_protect() to run while other entry'
MAP_ENTRY_IN_TRANSITION flag is handled, and vm_map_busy() lock does
not prevent it. In particular, if the call to vm_map_protect() adds
VM_PROT_WRITE to CoW entry, it would fail to call
vm_fault_copy_entry(). There are at least two consequences of the
race: the top object in the shadow chain is not populated with
writeable pages, and second, the entry eventually get contradictory
flags MAP_ENTRY_NEEDS_COPY | MAP_ENTRY_USER_WIRED with VM_PROT_WRITE
set.
Handle it by waiting for all MAP_ENTRY_IN_TRANSITION flags to go away
in vm_map_protect(), which does not drop map lock afterwards. Note
that vm_map_busy_wait() is left as is.
Reported and tested by: pho (previous version)
Reviewed by: Doug Moore <dougm@rice.edu>, markj
Sponsored by: The FreeBSD Foundation
MFC after: 1 week
Differential revision: https://reviews.freebsd.org/D20091
2019-05-01 13:15:06 +00:00
|
|
|
again:
|
|
|
|
in_tran = NULL;
|
1994-05-24 10:09:53 +00:00
|
|
|
vm_map_lock(map);
|
|
|
|
|
2021-01-12 12:43:39 +00:00
|
|
|
if ((map->flags & MAP_WXORX) != 0 &&
|
|
|
|
(flags & VM_MAP_PROTECT_SET_PROT) != 0 &&
|
|
|
|
(new_prot & (VM_PROT_WRITE | VM_PROT_EXECUTE)) == (VM_PROT_WRITE |
|
2021-01-08 22:40:04 +00:00
|
|
|
VM_PROT_EXECUTE)) {
|
|
|
|
vm_map_unlock(map);
|
|
|
|
return (KERN_PROTECTION_FAILURE);
|
|
|
|
}
|
|
|
|
|
2017-04-10 21:01:42 +00:00
|
|
|
/*
|
|
|
|
* Ensure that we are not concurrently wiring pages. vm_map_wire() may
|
|
|
|
* need to fault pages into the map and will drop the map lock while
|
|
|
|
* doing so, and the VM object may end up in an inconsistent state if we
|
|
|
|
* update the protection on the map entry in between faults.
|
|
|
|
*/
|
|
|
|
vm_map_wait_busy(map);
|
|
|
|
|
1994-05-24 10:09:53 +00:00
|
|
|
VM_MAP_RANGE_CHECK(map, start, end);
|
|
|
|
|
2019-11-25 02:19:47 +00:00
|
|
|
if (!vm_map_lookup_entry(map, start, &first_entry))
|
|
|
|
first_entry = vm_map_entry_succ(first_entry);
|
1994-05-24 10:09:53 +00:00
|
|
|
|
|
|
|
/*
|
These changes embody the support of the fully coherent merged VM buffer cache,
much higher filesystem I/O performance, and much better paging performance. It
represents the culmination of over 6 months of R&D.
The majority of the merged VM/cache work is by John Dyson.
The following highlights the most significant changes. Additionally, there are
(mostly minor) changes to the various filesystem modules (nfs, msdosfs, etc) to
support the new VM/buffer scheme.
vfs_bio.c:
Significant rewrite of most of vfs_bio to support the merged VM buffer cache
scheme. The scheme is almost fully compatible with the old filesystem
interface. Significant improvement in the number of opportunities for write
clustering.
vfs_cluster.c, vfs_subr.c
Upgrade and performance enhancements in vfs layer code to support merged
VM/buffer cache. Fixup of vfs_cluster to eliminate the bogus pagemove stuff.
vm_object.c:
Yet more improvements in the collapse code. Elimination of some windows that
can cause list corruption.
vm_pageout.c:
Fixed it, it really works better now. Somehow in 2.0, some "enhancements"
broke the code. This code has been reworked from the ground-up.
vm_fault.c, vm_page.c, pmap.c, vm_object.c
Support for small-block filesystems with merged VM/buffer cache scheme.
pmap.c vm_map.c
Dynamic kernel VM size, now we dont have to pre-allocate excessive numbers of
kernel PTs.
vm_glue.c
Much simpler and more effective swapping code. No more gratuitous swapping.
proc.h
Fixed the problem that the p_lock flag was not being cleared on a fork.
swap_pager.c, vnode_pager.c
Removal of old vfs_bio cruft to support the past pseudo-coherency. Now the
code doesn't need it anymore.
machdep.c
Changes to better support the parameter values for the merged VM/buffer cache
scheme.
machdep.c, kern_exec.c, vm_glue.c
Implemented a seperate submap for temporary exec string space and another one
to contain process upages. This eliminates all map fragmentation problems
that previously existed.
ffs_inode.c, ufs_inode.c, ufs_readwrite.c
Changes for merged VM/buffer cache. Add "bypass" support for sneaking in on
busy buffers.
Submitted by: John Dyson and David Greenman
1995-01-09 16:06:02 +00:00
|
|
|
* Make a first pass to check for protection violations.
|
1994-05-24 10:09:53 +00:00
|
|
|
*/
|
2019-11-25 02:19:47 +00:00
|
|
|
for (entry = first_entry; entry->start < end;
|
|
|
|
entry = vm_map_entry_succ(entry)) {
|
|
|
|
if ((entry->eflags & MAP_ENTRY_GUARD) != 0)
|
2017-06-25 23:16:37 +00:00
|
|
|
continue;
|
2019-11-25 02:19:47 +00:00
|
|
|
if ((entry->eflags & MAP_ENTRY_IS_SUB_MAP) != 0) {
|
1995-02-02 09:09:15 +00:00
|
|
|
vm_map_unlock(map);
|
These changes embody the support of the fully coherent merged VM buffer cache,
much higher filesystem I/O performance, and much better paging performance. It
represents the culmination of over 6 months of R&D.
The majority of the merged VM/cache work is by John Dyson.
The following highlights the most significant changes. Additionally, there are
(mostly minor) changes to the various filesystem modules (nfs, msdosfs, etc) to
support the new VM/buffer scheme.
vfs_bio.c:
Significant rewrite of most of vfs_bio to support the merged VM buffer cache
scheme. The scheme is almost fully compatible with the old filesystem
interface. Significant improvement in the number of opportunities for write
clustering.
vfs_cluster.c, vfs_subr.c
Upgrade and performance enhancements in vfs layer code to support merged
VM/buffer cache. Fixup of vfs_cluster to eliminate the bogus pagemove stuff.
vm_object.c:
Yet more improvements in the collapse code. Elimination of some windows that
can cause list corruption.
vm_pageout.c:
Fixed it, it really works better now. Somehow in 2.0, some "enhancements"
broke the code. This code has been reworked from the ground-up.
vm_fault.c, vm_page.c, pmap.c, vm_object.c
Support for small-block filesystems with merged VM/buffer cache scheme.
pmap.c vm_map.c
Dynamic kernel VM size, now we dont have to pre-allocate excessive numbers of
kernel PTs.
vm_glue.c
Much simpler and more effective swapping code. No more gratuitous swapping.
proc.h
Fixed the problem that the p_lock flag was not being cleared on a fork.
swap_pager.c, vnode_pager.c
Removal of old vfs_bio cruft to support the past pseudo-coherency. Now the
code doesn't need it anymore.
machdep.c
Changes to better support the parameter values for the merged VM/buffer cache
scheme.
machdep.c, kern_exec.c, vm_glue.c
Implemented a seperate submap for temporary exec string space and another one
to contain process upages. This eliminates all map fragmentation problems
that previously existed.
ffs_inode.c, ufs_inode.c, ufs_readwrite.c
Changes for merged VM/buffer cache. Add "bypass" support for sneaking in on
busy buffers.
Submitted by: John Dyson and David Greenman
1995-01-09 16:06:02 +00:00
|
|
|
return (KERN_INVALID_ARGUMENT);
|
1995-02-02 09:09:15 +00:00
|
|
|
}
|
2021-01-12 12:43:39 +00:00
|
|
|
if ((flags & VM_MAP_PROTECT_SET_PROT) == 0)
|
|
|
|
new_prot = entry->protection;
|
|
|
|
if ((flags & VM_MAP_PROTECT_SET_MAXPROT) == 0)
|
|
|
|
new_maxprot = entry->max_protection;
|
|
|
|
if ((new_prot & entry->max_protection) != new_prot ||
|
|
|
|
(new_maxprot & entry->max_protection) != new_maxprot) {
|
1994-05-24 10:09:53 +00:00
|
|
|
vm_map_unlock(map);
|
These changes embody the support of the fully coherent merged VM buffer cache,
much higher filesystem I/O performance, and much better paging performance. It
represents the culmination of over 6 months of R&D.
The majority of the merged VM/cache work is by John Dyson.
The following highlights the most significant changes. Additionally, there are
(mostly minor) changes to the various filesystem modules (nfs, msdosfs, etc) to
support the new VM/buffer scheme.
vfs_bio.c:
Significant rewrite of most of vfs_bio to support the merged VM buffer cache
scheme. The scheme is almost fully compatible with the old filesystem
interface. Significant improvement in the number of opportunities for write
clustering.
vfs_cluster.c, vfs_subr.c
Upgrade and performance enhancements in vfs layer code to support merged
VM/buffer cache. Fixup of vfs_cluster to eliminate the bogus pagemove stuff.
vm_object.c:
Yet more improvements in the collapse code. Elimination of some windows that
can cause list corruption.
vm_pageout.c:
Fixed it, it really works better now. Somehow in 2.0, some "enhancements"
broke the code. This code has been reworked from the ground-up.
vm_fault.c, vm_page.c, pmap.c, vm_object.c
Support for small-block filesystems with merged VM/buffer cache scheme.
pmap.c vm_map.c
Dynamic kernel VM size, now we dont have to pre-allocate excessive numbers of
kernel PTs.
vm_glue.c
Much simpler and more effective swapping code. No more gratuitous swapping.
proc.h
Fixed the problem that the p_lock flag was not being cleared on a fork.
swap_pager.c, vnode_pager.c
Removal of old vfs_bio cruft to support the past pseudo-coherency. Now the
code doesn't need it anymore.
machdep.c
Changes to better support the parameter values for the merged VM/buffer cache
scheme.
machdep.c, kern_exec.c, vm_glue.c
Implemented a seperate submap for temporary exec string space and another one
to contain process upages. This eliminates all map fragmentation problems
that previously existed.
ffs_inode.c, ufs_inode.c, ufs_readwrite.c
Changes for merged VM/buffer cache. Add "bypass" support for sneaking in on
busy buffers.
Submitted by: John Dyson and David Greenman
1995-01-09 16:06:02 +00:00
|
|
|
return (KERN_PROTECTION_FAILURE);
|
1994-05-24 10:09:53 +00:00
|
|
|
}
|
2019-11-25 02:19:47 +00:00
|
|
|
if ((entry->eflags & MAP_ENTRY_IN_TRANSITION) != 0)
|
|
|
|
in_tran = entry;
|
Fix another race between vm_map_protect() and vm_map_wire().
vm_map_wire() increments entry->wire_count, after that it drops the
map lock both for faulting in the entry' pages, and for marking next
entry in the requested region as IN_TRANSITION. Only after all entries
are faulted in, MAP_ENTRY_USER_WIRE flag is set.
This makes it possible for vm_map_protect() to run while other entry'
MAP_ENTRY_IN_TRANSITION flag is handled, and vm_map_busy() lock does
not prevent it. In particular, if the call to vm_map_protect() adds
VM_PROT_WRITE to CoW entry, it would fail to call
vm_fault_copy_entry(). There are at least two consequences of the
race: the top object in the shadow chain is not populated with
writeable pages, and second, the entry eventually get contradictory
flags MAP_ENTRY_NEEDS_COPY | MAP_ENTRY_USER_WIRED with VM_PROT_WRITE
set.
Handle it by waiting for all MAP_ENTRY_IN_TRANSITION flags to go away
in vm_map_protect(), which does not drop map lock afterwards. Note
that vm_map_busy_wait() is left as is.
Reported and tested by: pho (previous version)
Reviewed by: Doug Moore <dougm@rice.edu>, markj
Sponsored by: The FreeBSD Foundation
MFC after: 1 week
Differential revision: https://reviews.freebsd.org/D20091
2019-05-01 13:15:06 +00:00
|
|
|
}
|
|
|
|
|
|
|
|
/*
|
2019-11-17 06:50:36 +00:00
|
|
|
* Postpone the operation until all in-transition map entries have
|
|
|
|
* stabilized. An in-transition entry might already have its pages
|
|
|
|
* wired and wired_count incremented, but not yet have its
|
|
|
|
* MAP_ENTRY_USER_WIRED flag set. In which case, we would fail to call
|
|
|
|
* vm_fault_copy_entry() in the final loop below.
|
Fix another race between vm_map_protect() and vm_map_wire().
vm_map_wire() increments entry->wire_count, after that it drops the
map lock both for faulting in the entry' pages, and for marking next
entry in the requested region as IN_TRANSITION. Only after all entries
are faulted in, MAP_ENTRY_USER_WIRE flag is set.
This makes it possible for vm_map_protect() to run while other entry'
MAP_ENTRY_IN_TRANSITION flag is handled, and vm_map_busy() lock does
not prevent it. In particular, if the call to vm_map_protect() adds
VM_PROT_WRITE to CoW entry, it would fail to call
vm_fault_copy_entry(). There are at least two consequences of the
race: the top object in the shadow chain is not populated with
writeable pages, and second, the entry eventually get contradictory
flags MAP_ENTRY_NEEDS_COPY | MAP_ENTRY_USER_WIRED with VM_PROT_WRITE
set.
Handle it by waiting for all MAP_ENTRY_IN_TRANSITION flags to go away
in vm_map_protect(), which does not drop map lock afterwards. Note
that vm_map_busy_wait() is left as is.
Reported and tested by: pho (previous version)
Reviewed by: Doug Moore <dougm@rice.edu>, markj
Sponsored by: The FreeBSD Foundation
MFC after: 1 week
Differential revision: https://reviews.freebsd.org/D20091
2019-05-01 13:15:06 +00:00
|
|
|
*/
|
|
|
|
if (in_tran != NULL) {
|
|
|
|
in_tran->eflags |= MAP_ENTRY_NEEDS_WAKEUP;
|
|
|
|
vm_map_unlock_and_wait(map, 0);
|
|
|
|
goto again;
|
1994-05-24 10:09:53 +00:00
|
|
|
}
|
|
|
|
|
Implement global and per-uid accounting of the anonymous memory. Add
rlimit RLIMIT_SWAP that limits the amount of swap that may be reserved
for the uid.
The accounting information (charge) is associated with either map entry,
or vm object backing the entry, assuming the object is the first one
in the shadow chain and entry does not require COW. Charge is moved
from entry to object on allocation of the object, e.g. during the mmap,
assuming the object is allocated, or on the first page fault on the
entry. It moves back to the entry on forks due to COW setup.
The per-entry granularity of accounting makes the charge process fair
for processes that change uid during lifetime, and decrements charge
for proper uid when region is unmapped.
The interface of vm_pager_allocate(9) is extended by adding struct ucred *,
that is used to charge appropriate uid when allocation if performed by
kernel, e.g. md(4).
Several syscalls, among them is fork(2), may now return ENOMEM when
global or per-uid limits are enforced.
In collaboration with: pho
Reviewed by: alc
Approved by: re (kensmith)
2009-06-23 20:45:22 +00:00
|
|
|
/*
|
2019-06-28 02:14:54 +00:00
|
|
|
* Before changing the protections, try to reserve swap space for any
|
|
|
|
* private (i.e., copy-on-write) mappings that are transitioning from
|
|
|
|
* read-only to read/write access. If a reservation fails, break out
|
|
|
|
* of this loop early and let the next loop simplify the entries, since
|
|
|
|
* some may now be mergeable.
|
Implement global and per-uid accounting of the anonymous memory. Add
rlimit RLIMIT_SWAP that limits the amount of swap that may be reserved
for the uid.
The accounting information (charge) is associated with either map entry,
or vm object backing the entry, assuming the object is the first one
in the shadow chain and entry does not require COW. Charge is moved
from entry to object on allocation of the object, e.g. during the mmap,
assuming the object is allocated, or on the first page fault on the
entry. It moves back to the entry on forks due to COW setup.
The per-entry granularity of accounting makes the charge process fair
for processes that change uid during lifetime, and decrements charge
for proper uid when region is unmapped.
The interface of vm_pager_allocate(9) is extended by adding struct ucred *,
that is used to charge appropriate uid when allocation if performed by
kernel, e.g. md(4).
Several syscalls, among them is fork(2), may now return ENOMEM when
global or per-uid limits are enforced.
In collaboration with: pho
Reviewed by: alc
Approved by: re (kensmith)
2009-06-23 20:45:22 +00:00
|
|
|
*/
|
2020-09-09 22:02:30 +00:00
|
|
|
rv = vm_map_clip_start(map, first_entry, start);
|
|
|
|
if (rv != KERN_SUCCESS) {
|
|
|
|
vm_map_unlock(map);
|
|
|
|
return (rv);
|
|
|
|
}
|
2019-11-25 02:19:47 +00:00
|
|
|
for (entry = first_entry; entry->start < end;
|
|
|
|
entry = vm_map_entry_succ(entry)) {
|
2020-09-09 22:02:30 +00:00
|
|
|
rv = vm_map_clip_end(map, entry, end);
|
|
|
|
if (rv != KERN_SUCCESS) {
|
|
|
|
vm_map_unlock(map);
|
|
|
|
return (rv);
|
|
|
|
}
|
Implement global and per-uid accounting of the anonymous memory. Add
rlimit RLIMIT_SWAP that limits the amount of swap that may be reserved
for the uid.
The accounting information (charge) is associated with either map entry,
or vm object backing the entry, assuming the object is the first one
in the shadow chain and entry does not require COW. Charge is moved
from entry to object on allocation of the object, e.g. during the mmap,
assuming the object is allocated, or on the first page fault on the
entry. It moves back to the entry on forks due to COW setup.
The per-entry granularity of accounting makes the charge process fair
for processes that change uid during lifetime, and decrements charge
for proper uid when region is unmapped.
The interface of vm_pager_allocate(9) is extended by adding struct ucred *,
that is used to charge appropriate uid when allocation if performed by
kernel, e.g. md(4).
Several syscalls, among them is fork(2), may now return ENOMEM when
global or per-uid limits are enforced.
In collaboration with: pho
Reviewed by: alc
Approved by: re (kensmith)
2009-06-23 20:45:22 +00:00
|
|
|
|
2021-01-12 12:43:39 +00:00
|
|
|
if ((flags & VM_MAP_PROTECT_SET_PROT) == 0 ||
|
2019-11-25 02:19:47 +00:00
|
|
|
((new_prot & ~entry->protection) & VM_PROT_WRITE) == 0 ||
|
|
|
|
ENTRY_CHARGED(entry) ||
|
2021-01-12 12:43:39 +00:00
|
|
|
(entry->eflags & MAP_ENTRY_GUARD) != 0)
|
Implement global and per-uid accounting of the anonymous memory. Add
rlimit RLIMIT_SWAP that limits the amount of swap that may be reserved
for the uid.
The accounting information (charge) is associated with either map entry,
or vm object backing the entry, assuming the object is the first one
in the shadow chain and entry does not require COW. Charge is moved
from entry to object on allocation of the object, e.g. during the mmap,
assuming the object is allocated, or on the first page fault on the
entry. It moves back to the entry on forks due to COW setup.
The per-entry granularity of accounting makes the charge process fair
for processes that change uid during lifetime, and decrements charge
for proper uid when region is unmapped.
The interface of vm_pager_allocate(9) is extended by adding struct ucred *,
that is used to charge appropriate uid when allocation if performed by
kernel, e.g. md(4).
Several syscalls, among them is fork(2), may now return ENOMEM when
global or per-uid limits are enforced.
In collaboration with: pho
Reviewed by: alc
Approved by: re (kensmith)
2009-06-23 20:45:22 +00:00
|
|
|
continue;
|
|
|
|
|
2010-12-02 17:37:16 +00:00
|
|
|
cred = curthread->td_ucred;
|
2019-11-25 02:19:47 +00:00
|
|
|
obj = entry->object.vm_object;
|
Implement global and per-uid accounting of the anonymous memory. Add
rlimit RLIMIT_SWAP that limits the amount of swap that may be reserved
for the uid.
The accounting information (charge) is associated with either map entry,
or vm object backing the entry, assuming the object is the first one
in the shadow chain and entry does not require COW. Charge is moved
from entry to object on allocation of the object, e.g. during the mmap,
assuming the object is allocated, or on the first page fault on the
entry. It moves back to the entry on forks due to COW setup.
The per-entry granularity of accounting makes the charge process fair
for processes that change uid during lifetime, and decrements charge
for proper uid when region is unmapped.
The interface of vm_pager_allocate(9) is extended by adding struct ucred *,
that is used to charge appropriate uid when allocation if performed by
kernel, e.g. md(4).
Several syscalls, among them is fork(2), may now return ENOMEM when
global or per-uid limits are enforced.
In collaboration with: pho
Reviewed by: alc
Approved by: re (kensmith)
2009-06-23 20:45:22 +00:00
|
|
|
|
2019-11-25 02:19:47 +00:00
|
|
|
if (obj == NULL ||
|
|
|
|
(entry->eflags & MAP_ENTRY_NEEDS_COPY) != 0) {
|
|
|
|
if (!swap_reserve(entry->end - entry->start)) {
|
2019-06-28 02:14:54 +00:00
|
|
|
rv = KERN_RESOURCE_SHORTAGE;
|
2019-11-25 02:19:47 +00:00
|
|
|
end = entry->end;
|
2019-06-28 02:14:54 +00:00
|
|
|
break;
|
Implement global and per-uid accounting of the anonymous memory. Add
rlimit RLIMIT_SWAP that limits the amount of swap that may be reserved
for the uid.
The accounting information (charge) is associated with either map entry,
or vm object backing the entry, assuming the object is the first one
in the shadow chain and entry does not require COW. Charge is moved
from entry to object on allocation of the object, e.g. during the mmap,
assuming the object is allocated, or on the first page fault on the
entry. It moves back to the entry on forks due to COW setup.
The per-entry granularity of accounting makes the charge process fair
for processes that change uid during lifetime, and decrements charge
for proper uid when region is unmapped.
The interface of vm_pager_allocate(9) is extended by adding struct ucred *,
that is used to charge appropriate uid when allocation if performed by
kernel, e.g. md(4).
Several syscalls, among them is fork(2), may now return ENOMEM when
global or per-uid limits are enforced.
In collaboration with: pho
Reviewed by: alc
Approved by: re (kensmith)
2009-06-23 20:45:22 +00:00
|
|
|
}
|
2010-12-02 17:37:16 +00:00
|
|
|
crhold(cred);
|
2019-11-25 02:19:47 +00:00
|
|
|
entry->cred = cred;
|
Implement global and per-uid accounting of the anonymous memory. Add
rlimit RLIMIT_SWAP that limits the amount of swap that may be reserved
for the uid.
The accounting information (charge) is associated with either map entry,
or vm object backing the entry, assuming the object is the first one
in the shadow chain and entry does not require COW. Charge is moved
from entry to object on allocation of the object, e.g. during the mmap,
assuming the object is allocated, or on the first page fault on the
entry. It moves back to the entry on forks due to COW setup.
The per-entry granularity of accounting makes the charge process fair
for processes that change uid during lifetime, and decrements charge
for proper uid when region is unmapped.
The interface of vm_pager_allocate(9) is extended by adding struct ucred *,
that is used to charge appropriate uid when allocation if performed by
kernel, e.g. md(4).
Several syscalls, among them is fork(2), may now return ENOMEM when
global or per-uid limits are enforced.
In collaboration with: pho
Reviewed by: alc
Approved by: re (kensmith)
2009-06-23 20:45:22 +00:00
|
|
|
continue;
|
|
|
|
}
|
|
|
|
|
2019-11-29 19:57:49 +00:00
|
|
|
if (obj->type != OBJT_DEFAULT && obj->type != OBJT_SWAP)
|
|
|
|
continue;
|
2013-03-09 02:32:23 +00:00
|
|
|
VM_OBJECT_WLOCK(obj);
|
Implement global and per-uid accounting of the anonymous memory. Add
rlimit RLIMIT_SWAP that limits the amount of swap that may be reserved
for the uid.
The accounting information (charge) is associated with either map entry,
or vm object backing the entry, assuming the object is the first one
in the shadow chain and entry does not require COW. Charge is moved
from entry to object on allocation of the object, e.g. during the mmap,
assuming the object is allocated, or on the first page fault on the
entry. It moves back to the entry on forks due to COW setup.
The per-entry granularity of accounting makes the charge process fair
for processes that change uid during lifetime, and decrements charge
for proper uid when region is unmapped.
The interface of vm_pager_allocate(9) is extended by adding struct ucred *,
that is used to charge appropriate uid when allocation if performed by
kernel, e.g. md(4).
Several syscalls, among them is fork(2), may now return ENOMEM when
global or per-uid limits are enforced.
In collaboration with: pho
Reviewed by: alc
Approved by: re (kensmith)
2009-06-23 20:45:22 +00:00
|
|
|
if (obj->type != OBJT_DEFAULT && obj->type != OBJT_SWAP) {
|
2013-03-09 02:32:23 +00:00
|
|
|
VM_OBJECT_WUNLOCK(obj);
|
Implement global and per-uid accounting of the anonymous memory. Add
rlimit RLIMIT_SWAP that limits the amount of swap that may be reserved
for the uid.
The accounting information (charge) is associated with either map entry,
or vm object backing the entry, assuming the object is the first one
in the shadow chain and entry does not require COW. Charge is moved
from entry to object on allocation of the object, e.g. during the mmap,
assuming the object is allocated, or on the first page fault on the
entry. It moves back to the entry on forks due to COW setup.
The per-entry granularity of accounting makes the charge process fair
for processes that change uid during lifetime, and decrements charge
for proper uid when region is unmapped.
The interface of vm_pager_allocate(9) is extended by adding struct ucred *,
that is used to charge appropriate uid when allocation if performed by
kernel, e.g. md(4).
Several syscalls, among them is fork(2), may now return ENOMEM when
global or per-uid limits are enforced.
In collaboration with: pho
Reviewed by: alc
Approved by: re (kensmith)
2009-06-23 20:45:22 +00:00
|
|
|
continue;
|
|
|
|
}
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Charge for the whole object allocation now, since
|
|
|
|
* we cannot distinguish between non-charged and
|
|
|
|
* charged clipped mapping of the same object later.
|
|
|
|
*/
|
|
|
|
KASSERT(obj->charge == 0,
|
2014-05-10 16:30:48 +00:00
|
|
|
("vm_map_protect: object %p overcharged (entry %p)",
|
2019-11-25 02:19:47 +00:00
|
|
|
obj, entry));
|
Implement global and per-uid accounting of the anonymous memory. Add
rlimit RLIMIT_SWAP that limits the amount of swap that may be reserved
for the uid.
The accounting information (charge) is associated with either map entry,
or vm object backing the entry, assuming the object is the first one
in the shadow chain and entry does not require COW. Charge is moved
from entry to object on allocation of the object, e.g. during the mmap,
assuming the object is allocated, or on the first page fault on the
entry. It moves back to the entry on forks due to COW setup.
The per-entry granularity of accounting makes the charge process fair
for processes that change uid during lifetime, and decrements charge
for proper uid when region is unmapped.
The interface of vm_pager_allocate(9) is extended by adding struct ucred *,
that is used to charge appropriate uid when allocation if performed by
kernel, e.g. md(4).
Several syscalls, among them is fork(2), may now return ENOMEM when
global or per-uid limits are enforced.
In collaboration with: pho
Reviewed by: alc
Approved by: re (kensmith)
2009-06-23 20:45:22 +00:00
|
|
|
if (!swap_reserve(ptoa(obj->size))) {
|
2013-03-09 02:32:23 +00:00
|
|
|
VM_OBJECT_WUNLOCK(obj);
|
2019-06-28 02:14:54 +00:00
|
|
|
rv = KERN_RESOURCE_SHORTAGE;
|
2019-11-25 02:19:47 +00:00
|
|
|
end = entry->end;
|
2019-06-28 02:14:54 +00:00
|
|
|
break;
|
Implement global and per-uid accounting of the anonymous memory. Add
rlimit RLIMIT_SWAP that limits the amount of swap that may be reserved
for the uid.
The accounting information (charge) is associated with either map entry,
or vm object backing the entry, assuming the object is the first one
in the shadow chain and entry does not require COW. Charge is moved
from entry to object on allocation of the object, e.g. during the mmap,
assuming the object is allocated, or on the first page fault on the
entry. It moves back to the entry on forks due to COW setup.
The per-entry granularity of accounting makes the charge process fair
for processes that change uid during lifetime, and decrements charge
for proper uid when region is unmapped.
The interface of vm_pager_allocate(9) is extended by adding struct ucred *,
that is used to charge appropriate uid when allocation if performed by
kernel, e.g. md(4).
Several syscalls, among them is fork(2), may now return ENOMEM when
global or per-uid limits are enforced.
In collaboration with: pho
Reviewed by: alc
Approved by: re (kensmith)
2009-06-23 20:45:22 +00:00
|
|
|
}
|
|
|
|
|
2010-12-02 17:37:16 +00:00
|
|
|
crhold(cred);
|
|
|
|
obj->cred = cred;
|
Implement global and per-uid accounting of the anonymous memory. Add
rlimit RLIMIT_SWAP that limits the amount of swap that may be reserved
for the uid.
The accounting information (charge) is associated with either map entry,
or vm object backing the entry, assuming the object is the first one
in the shadow chain and entry does not require COW. Charge is moved
from entry to object on allocation of the object, e.g. during the mmap,
assuming the object is allocated, or on the first page fault on the
entry. It moves back to the entry on forks due to COW setup.
The per-entry granularity of accounting makes the charge process fair
for processes that change uid during lifetime, and decrements charge
for proper uid when region is unmapped.
The interface of vm_pager_allocate(9) is extended by adding struct ucred *,
that is used to charge appropriate uid when allocation if performed by
kernel, e.g. md(4).
Several syscalls, among them is fork(2), may now return ENOMEM when
global or per-uid limits are enforced.
In collaboration with: pho
Reviewed by: alc
Approved by: re (kensmith)
2009-06-23 20:45:22 +00:00
|
|
|
obj->charge = ptoa(obj->size);
|
2013-03-09 02:32:23 +00:00
|
|
|
VM_OBJECT_WUNLOCK(obj);
|
Implement global and per-uid accounting of the anonymous memory. Add
rlimit RLIMIT_SWAP that limits the amount of swap that may be reserved
for the uid.
The accounting information (charge) is associated with either map entry,
or vm object backing the entry, assuming the object is the first one
in the shadow chain and entry does not require COW. Charge is moved
from entry to object on allocation of the object, e.g. during the mmap,
assuming the object is allocated, or on the first page fault on the
entry. It moves back to the entry on forks due to COW setup.
The per-entry granularity of accounting makes the charge process fair
for processes that change uid during lifetime, and decrements charge
for proper uid when region is unmapped.
The interface of vm_pager_allocate(9) is extended by adding struct ucred *,
that is used to charge appropriate uid when allocation if performed by
kernel, e.g. md(4).
Several syscalls, among them is fork(2), may now return ENOMEM when
global or per-uid limits are enforced.
In collaboration with: pho
Reviewed by: alc
Approved by: re (kensmith)
2009-06-23 20:45:22 +00:00
|
|
|
}
|
|
|
|
|
1994-05-24 10:09:53 +00:00
|
|
|
/*
|
2019-06-28 02:14:54 +00:00
|
|
|
* If enough swap space was available, go back and fix up protections.
|
|
|
|
* Otherwise, just simplify entries, since some may have been modified.
|
|
|
|
* [Note that clipping is not necessary the second time.]
|
1994-05-24 10:09:53 +00:00
|
|
|
*/
|
2019-11-25 02:19:47 +00:00
|
|
|
for (prev_entry = vm_map_entry_pred(first_entry), entry = first_entry;
|
|
|
|
entry->start < end;
|
|
|
|
vm_map_try_merge_entries(map, prev_entry, entry),
|
|
|
|
prev_entry = entry, entry = vm_map_entry_succ(entry)) {
|
2019-06-28 02:14:54 +00:00
|
|
|
if (rv != KERN_SUCCESS ||
|
2019-11-25 02:19:47 +00:00
|
|
|
(entry->eflags & MAP_ENTRY_GUARD) != 0)
|
Implement address space guards.
Guard, requested by the MAP_GUARD mmap(2) flag, prevents the reuse of
the allocated address space, but does not allow instantiation of the
pages in the range. It is useful for more explicit support for usual
two-stage reserve then commit allocators, since it prevents accidental
instantiation of the mapping, e.g. by mprotect(2).
Use guards to reimplement stack grow code. Explicitely track stack
grow area with the guard, including the stack guard page. On stack
grow, trivial shift of the guard map entry and stack map entry limits
makes the stack expansion. Move the code to detect stack grow and
call vm_map_growstack(), from vm_fault() into vm_map_lookup().
As result, it is impossible to get random mapping to occur in the
stack grow area, or to overlap the stack guard page.
Enable stack guard page by default.
Reviewed by: alc, markj
Man page update reviewed by: alc, bjk, emaste, markj, pho
Tested by: pho, Qualys
Sponsored by: The FreeBSD Foundation
MFC after: 1 week
Differential revision: https://reviews.freebsd.org/D11306 (man pages)
2017-06-24 17:01:11 +00:00
|
|
|
continue;
|
|
|
|
|
2019-11-25 02:19:47 +00:00
|
|
|
old_prot = entry->protection;
|
2009-10-27 10:15:58 +00:00
|
|
|
|
2021-01-12 12:43:39 +00:00
|
|
|
if ((flags & VM_MAP_PROTECT_SET_MAXPROT) != 0) {
|
|
|
|
entry->max_protection = new_maxprot;
|
|
|
|
entry->protection = new_maxprot & old_prot;
|
|
|
|
}
|
|
|
|
if ((flags & VM_MAP_PROTECT_SET_PROT) != 0)
|
2019-11-25 02:19:47 +00:00
|
|
|
entry->protection = new_prot;
|
1994-05-24 10:09:53 +00:00
|
|
|
|
2014-05-11 17:41:29 +00:00
|
|
|
/*
|
|
|
|
* For user wired map entries, the normal lazy evaluation of
|
|
|
|
* write access upgrades through soft page faults is
|
|
|
|
* undesirable. Instead, immediately copy any pages that are
|
|
|
|
* copy-on-write and enable write access in the physical map.
|
|
|
|
*/
|
2019-11-25 02:19:47 +00:00
|
|
|
if ((entry->eflags & MAP_ENTRY_USER_WIRED) != 0 &&
|
|
|
|
(entry->protection & VM_PROT_WRITE) != 0 &&
|
2014-05-28 00:45:35 +00:00
|
|
|
(old_prot & VM_PROT_WRITE) == 0)
|
2019-11-25 02:19:47 +00:00
|
|
|
vm_fault_copy_entry(map, map, entry, entry, NULL);
|
2009-10-27 10:15:58 +00:00
|
|
|
|
1994-05-24 10:09:53 +00:00
|
|
|
/*
|
2009-11-02 17:45:39 +00:00
|
|
|
* When restricting access, update the physical map. Worry
|
|
|
|
* about copy-on-write here.
|
1994-05-24 10:09:53 +00:00
|
|
|
*/
|
2019-11-25 02:19:47 +00:00
|
|
|
if ((old_prot & ~entry->protection) != 0) {
|
1997-01-16 04:16:22 +00:00
|
|
|
#define MASK(entry) (((entry)->eflags & MAP_ENTRY_COW) ? ~VM_PROT_WRITE : \
|
1994-05-24 10:09:53 +00:00
|
|
|
VM_PROT_ALL)
|
2019-11-25 02:19:47 +00:00
|
|
|
pmap_protect(map->pmap, entry->start,
|
|
|
|
entry->end,
|
|
|
|
entry->protection & MASK(entry));
|
1994-05-24 10:09:53 +00:00
|
|
|
#undef MASK
|
|
|
|
}
|
|
|
|
}
|
2019-11-25 02:19:47 +00:00
|
|
|
vm_map_try_merge_entries(map, prev_entry, entry);
|
1994-05-24 10:09:53 +00:00
|
|
|
vm_map_unlock(map);
|
2019-06-28 02:14:54 +00:00
|
|
|
return (rv);
|
1994-05-24 10:09:53 +00:00
|
|
|
}
|
|
|
|
|
1996-05-19 07:36:50 +00:00
|
|
|
/*
|
|
|
|
* vm_map_madvise:
|
|
|
|
*
|
2003-11-03 16:14:45 +00:00
|
|
|
* This routine traverses a processes map handling the madvise
|
1999-08-13 17:45:34 +00:00
|
|
|
* system call. Advisories are classified as either those effecting
|
2003-11-03 16:14:45 +00:00
|
|
|
* the vm_map_entry structure, or those effecting the underlying
|
1999-08-13 17:45:34 +00:00
|
|
|
* objects.
|
1996-05-19 07:36:50 +00:00
|
|
|
*/
|
1999-09-21 05:00:48 +00:00
|
|
|
int
|
2001-07-04 20:15:18 +00:00
|
|
|
vm_map_madvise(
|
|
|
|
vm_map_t map,
|
2003-11-03 16:14:45 +00:00
|
|
|
vm_offset_t start,
|
2001-07-04 20:15:18 +00:00
|
|
|
vm_offset_t end,
|
|
|
|
int behav)
|
1996-05-19 07:36:50 +00:00
|
|
|
{
|
2019-11-25 02:19:47 +00:00
|
|
|
vm_map_entry_t entry, prev_entry;
|
2020-09-09 22:02:30 +00:00
|
|
|
int rv;
|
Use a single, consistent approach to returning success versus failure in
vm_map_madvise(). Previously, vm_map_madvise() used a traditional Unix-
style "return (0);" to indicate success in the common case, but Mach-
style return values in the edge cases. Since KERN_SUCCESS equals zero,
the only problem with this inconsistency was stylistic. vm_map_madvise()
has exactly two callers in the entire source tree, and only one of them
cares about the return value. That caller, kern_madvise(), can be
simplified if vm_map_madvise() consistently uses Unix-style return
values.
Since vm_map_madvise() uses the variable modify_map as a Boolean, make it
one.
Eliminate a redundant error check from kern_madvise(). Add a comment
explaining where the check is performed.
Explicitly note that exec_release_args_kva() doesn't care about
vm_map_madvise()'s return value. Since MADV_FREE is passed as the
behavior, the return value will always be zero.
Reviewed by: kib, markj
MFC after: 7 days
2018-06-04 16:28:06 +00:00
|
|
|
bool modify_map;
|
1996-05-19 07:36:50 +00:00
|
|
|
|
1999-09-21 05:00:48 +00:00
|
|
|
/*
|
|
|
|
* Some madvise calls directly modify the vm_map_entry, in which case
|
2003-11-03 16:14:45 +00:00
|
|
|
* we need to use an exclusive lock on the map and we need to perform
|
1999-09-21 05:00:48 +00:00
|
|
|
* various clipping operations. Otherwise we only need a read-lock
|
|
|
|
* on the map.
|
|
|
|
*/
|
|
|
|
switch(behav) {
|
|
|
|
case MADV_NORMAL:
|
|
|
|
case MADV_SEQUENTIAL:
|
|
|
|
case MADV_RANDOM:
|
1999-12-12 03:19:33 +00:00
|
|
|
case MADV_NOSYNC:
|
|
|
|
case MADV_AUTOSYNC:
|
2000-02-28 04:10:35 +00:00
|
|
|
case MADV_NOCORE:
|
|
|
|
case MADV_CORE:
|
2013-11-20 09:03:48 +00:00
|
|
|
if (start == end)
|
Use a single, consistent approach to returning success versus failure in
vm_map_madvise(). Previously, vm_map_madvise() used a traditional Unix-
style "return (0);" to indicate success in the common case, but Mach-
style return values in the edge cases. Since KERN_SUCCESS equals zero,
the only problem with this inconsistency was stylistic. vm_map_madvise()
has exactly two callers in the entire source tree, and only one of them
cares about the return value. That caller, kern_madvise(), can be
simplified if vm_map_madvise() consistently uses Unix-style return
values.
Since vm_map_madvise() uses the variable modify_map as a Boolean, make it
one.
Eliminate a redundant error check from kern_madvise(). Add a comment
explaining where the check is performed.
Explicitly note that exec_release_args_kva() doesn't care about
vm_map_madvise()'s return value. Since MADV_FREE is passed as the
behavior, the return value will always be zero.
Reviewed by: kib, markj
MFC after: 7 days
2018-06-04 16:28:06 +00:00
|
|
|
return (0);
|
|
|
|
modify_map = true;
|
1999-08-13 17:45:34 +00:00
|
|
|
vm_map_lock(map);
|
1999-09-21 05:00:48 +00:00
|
|
|
break;
|
|
|
|
case MADV_WILLNEED:
|
|
|
|
case MADV_DONTNEED:
|
|
|
|
case MADV_FREE:
|
2013-11-20 09:03:48 +00:00
|
|
|
if (start == end)
|
Use a single, consistent approach to returning success versus failure in
vm_map_madvise(). Previously, vm_map_madvise() used a traditional Unix-
style "return (0);" to indicate success in the common case, but Mach-
style return values in the edge cases. Since KERN_SUCCESS equals zero,
the only problem with this inconsistency was stylistic. vm_map_madvise()
has exactly two callers in the entire source tree, and only one of them
cares about the return value. That caller, kern_madvise(), can be
simplified if vm_map_madvise() consistently uses Unix-style return
values.
Since vm_map_madvise() uses the variable modify_map as a Boolean, make it
one.
Eliminate a redundant error check from kern_madvise(). Add a comment
explaining where the check is performed.
Explicitly note that exec_release_args_kva() doesn't care about
vm_map_madvise()'s return value. Since MADV_FREE is passed as the
behavior, the return value will always be zero.
Reviewed by: kib, markj
MFC after: 7 days
2018-06-04 16:28:06 +00:00
|
|
|
return (0);
|
|
|
|
modify_map = false;
|
1999-08-13 17:45:34 +00:00
|
|
|
vm_map_lock_read(map);
|
1999-09-21 05:00:48 +00:00
|
|
|
break;
|
|
|
|
default:
|
Use a single, consistent approach to returning success versus failure in
vm_map_madvise(). Previously, vm_map_madvise() used a traditional Unix-
style "return (0);" to indicate success in the common case, but Mach-
style return values in the edge cases. Since KERN_SUCCESS equals zero,
the only problem with this inconsistency was stylistic. vm_map_madvise()
has exactly two callers in the entire source tree, and only one of them
cares about the return value. That caller, kern_madvise(), can be
simplified if vm_map_madvise() consistently uses Unix-style return
values.
Since vm_map_madvise() uses the variable modify_map as a Boolean, make it
one.
Eliminate a redundant error check from kern_madvise(). Add a comment
explaining where the check is performed.
Explicitly note that exec_release_args_kva() doesn't care about
vm_map_madvise()'s return value. Since MADV_FREE is passed as the
behavior, the return value will always be zero.
Reviewed by: kib, markj
MFC after: 7 days
2018-06-04 16:28:06 +00:00
|
|
|
return (EINVAL);
|
1999-09-21 05:00:48 +00:00
|
|
|
}
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Locate starting entry and clip if necessary.
|
|
|
|
*/
|
1996-05-19 07:36:50 +00:00
|
|
|
VM_MAP_RANGE_CHECK(map, start, end);
|
|
|
|
|
1999-08-13 17:45:34 +00:00
|
|
|
if (modify_map) {
|
|
|
|
/*
|
|
|
|
* madvise behaviors that are implemented in the vm_map_entry.
|
|
|
|
*
|
|
|
|
* We clip the vm_map_entry so that behavioral changes are
|
|
|
|
* limited to the specified address range.
|
|
|
|
*/
|
2020-09-09 22:02:30 +00:00
|
|
|
rv = vm_map_lookup_clip_start(map, start, &entry, &prev_entry);
|
|
|
|
if (rv != KERN_SUCCESS) {
|
|
|
|
vm_map_unlock(map);
|
|
|
|
return (vm_mmap_to_errno(rv));
|
|
|
|
}
|
|
|
|
|
|
|
|
for (; entry->start < end; prev_entry = entry,
|
|
|
|
entry = vm_map_entry_succ(entry)) {
|
2019-11-25 02:19:47 +00:00
|
|
|
if ((entry->eflags & MAP_ENTRY_IS_SUB_MAP) != 0)
|
1999-08-13 17:45:34 +00:00
|
|
|
continue;
|
|
|
|
|
2020-09-09 22:02:30 +00:00
|
|
|
rv = vm_map_clip_end(map, entry, end);
|
|
|
|
if (rv != KERN_SUCCESS) {
|
|
|
|
vm_map_unlock(map);
|
|
|
|
return (vm_mmap_to_errno(rv));
|
|
|
|
}
|
1999-08-13 17:45:34 +00:00
|
|
|
|
|
|
|
switch (behav) {
|
|
|
|
case MADV_NORMAL:
|
2019-11-25 02:19:47 +00:00
|
|
|
vm_map_entry_set_behavior(entry,
|
|
|
|
MAP_ENTRY_BEHAV_NORMAL);
|
1999-08-13 17:45:34 +00:00
|
|
|
break;
|
|
|
|
case MADV_SEQUENTIAL:
|
2019-11-25 02:19:47 +00:00
|
|
|
vm_map_entry_set_behavior(entry,
|
|
|
|
MAP_ENTRY_BEHAV_SEQUENTIAL);
|
1999-08-13 17:45:34 +00:00
|
|
|
break;
|
|
|
|
case MADV_RANDOM:
|
2019-11-25 02:19:47 +00:00
|
|
|
vm_map_entry_set_behavior(entry,
|
|
|
|
MAP_ENTRY_BEHAV_RANDOM);
|
1999-08-13 17:45:34 +00:00
|
|
|
break;
|
1999-12-12 03:19:33 +00:00
|
|
|
case MADV_NOSYNC:
|
2019-11-25 02:19:47 +00:00
|
|
|
entry->eflags |= MAP_ENTRY_NOSYNC;
|
1999-12-12 03:19:33 +00:00
|
|
|
break;
|
|
|
|
case MADV_AUTOSYNC:
|
2019-11-25 02:19:47 +00:00
|
|
|
entry->eflags &= ~MAP_ENTRY_NOSYNC;
|
1999-12-12 03:19:33 +00:00
|
|
|
break;
|
2000-02-28 04:10:35 +00:00
|
|
|
case MADV_NOCORE:
|
2019-11-25 02:19:47 +00:00
|
|
|
entry->eflags |= MAP_ENTRY_NOCOREDUMP;
|
2000-02-28 04:10:35 +00:00
|
|
|
break;
|
|
|
|
case MADV_CORE:
|
2019-11-25 02:19:47 +00:00
|
|
|
entry->eflags &= ~MAP_ENTRY_NOCOREDUMP;
|
2000-02-28 04:10:35 +00:00
|
|
|
break;
|
1999-08-13 17:45:34 +00:00
|
|
|
default:
|
|
|
|
break;
|
|
|
|
}
|
2019-11-25 02:19:47 +00:00
|
|
|
vm_map_try_merge_entries(map, prev_entry, entry);
|
1996-05-19 07:36:50 +00:00
|
|
|
}
|
2019-11-25 02:19:47 +00:00
|
|
|
vm_map_try_merge_entries(map, prev_entry, entry);
|
1999-08-13 17:45:34 +00:00
|
|
|
vm_map_unlock(map);
|
1999-09-21 05:00:48 +00:00
|
|
|
} else {
|
2012-03-19 18:47:34 +00:00
|
|
|
vm_pindex_t pstart, pend;
|
1999-08-13 17:45:34 +00:00
|
|
|
|
1999-09-21 05:00:48 +00:00
|
|
|
/*
|
|
|
|
* madvise behaviors that are implemented in the underlying
|
|
|
|
* vm_object.
|
|
|
|
*
|
|
|
|
* Since we don't clip the vm_map_entry, we have to clip
|
|
|
|
* the vm_object pindex and count.
|
|
|
|
*/
|
2020-01-24 07:48:11 +00:00
|
|
|
if (!vm_map_lookup_entry(map, start, &entry))
|
|
|
|
entry = vm_map_entry_succ(entry);
|
2019-11-25 02:19:47 +00:00
|
|
|
for (; entry->start < end;
|
|
|
|
entry = vm_map_entry_succ(entry)) {
|
Significantly reduce the cost, i.e., run time, of calls to madvise(...,
MADV_DONTNEED) and madvise(..., MADV_FREE). Specifically, introduce a new
pmap function, pmap_advise(), that operates on a range of virtual addresses
within the specified pmap, allowing for a more efficient implementation of
MADV_DONTNEED and MADV_FREE. Previously, the implementation of
MADV_DONTNEED and MADV_FREE relied on per-page pmap operations, such as
pmap_clear_reference(). Intuitively, the problem with this implementation
is that the pmap-level locks are acquired and released and the page table
traversed repeatedly, once for each resident page in the range
that was specified to madvise(2). A more subtle flaw with the previous
implementation is that pmap_clear_reference() would clear the reference bit
on all mappings to the specified page, not just the mapping in the range
specified to madvise(2).
Since our malloc(3) makes heavy use of madvise(2), this change can have a
measureable impact. For example, the system time for completing a parallel
"buildworld" on a 6-core amd64 machine was reduced by about 1.5% to 2.0%.
Note: This change only contains pmap_advise() implementations for a subset
of our supported architectures. I will commit implementations for the
remaining architectures after further testing. For now, a stub function is
sufficient because of the advisory nature of pmap_advise().
Discussed with: jeff, jhb, kib
Tested by: pho (i386), marcel (ia64)
Sponsored by: EMC / Isilon Storage Division
2013-08-29 15:49:05 +00:00
|
|
|
vm_offset_t useEnd, useStart;
|
2000-05-14 18:46:40 +00:00
|
|
|
|
2019-11-25 02:19:47 +00:00
|
|
|
if ((entry->eflags & MAP_ENTRY_IS_SUB_MAP) != 0)
|
1999-09-21 05:00:48 +00:00
|
|
|
continue;
|
1999-08-13 17:45:34 +00:00
|
|
|
|
2019-09-04 20:28:16 +00:00
|
|
|
/*
|
|
|
|
* MADV_FREE would otherwise rewind time to
|
|
|
|
* the creation of the shadow object. Because
|
|
|
|
* we hold the VM map read-locked, neither the
|
|
|
|
* entry's object nor the presence of a
|
|
|
|
* backing object can change.
|
|
|
|
*/
|
|
|
|
if (behav == MADV_FREE &&
|
2019-11-25 02:19:47 +00:00
|
|
|
entry->object.vm_object != NULL &&
|
|
|
|
entry->object.vm_object->backing_object != NULL)
|
2019-09-04 20:28:16 +00:00
|
|
|
continue;
|
|
|
|
|
2019-11-25 02:19:47 +00:00
|
|
|
pstart = OFF_TO_IDX(entry->offset);
|
|
|
|
pend = pstart + atop(entry->end - entry->start);
|
|
|
|
useStart = entry->start;
|
|
|
|
useEnd = entry->end;
|
1997-01-22 01:34:48 +00:00
|
|
|
|
2019-11-25 02:19:47 +00:00
|
|
|
if (entry->start < start) {
|
|
|
|
pstart += atop(start - entry->start);
|
2000-05-14 18:46:40 +00:00
|
|
|
useStart = start;
|
1999-09-21 05:00:48 +00:00
|
|
|
}
|
2019-11-25 02:19:47 +00:00
|
|
|
if (entry->end > end) {
|
|
|
|
pend -= atop(entry->end - end);
|
Significantly reduce the cost, i.e., run time, of calls to madvise(...,
MADV_DONTNEED) and madvise(..., MADV_FREE). Specifically, introduce a new
pmap function, pmap_advise(), that operates on a range of virtual addresses
within the specified pmap, allowing for a more efficient implementation of
MADV_DONTNEED and MADV_FREE. Previously, the implementation of
MADV_DONTNEED and MADV_FREE relied on per-page pmap operations, such as
pmap_clear_reference(). Intuitively, the problem with this implementation
is that the pmap-level locks are acquired and released and the page table
traversed repeatedly, once for each resident page in the range
that was specified to madvise(2). A more subtle flaw with the previous
implementation is that pmap_clear_reference() would clear the reference bit
on all mappings to the specified page, not just the mapping in the range
specified to madvise(2).
Since our malloc(3) makes heavy use of madvise(2), this change can have a
measureable impact. For example, the system time for completing a parallel
"buildworld" on a 6-core amd64 machine was reduced by about 1.5% to 2.0%.
Note: This change only contains pmap_advise() implementations for a subset
of our supported architectures. I will commit implementations for the
remaining architectures after further testing. For now, a stub function is
sufficient because of the advisory nature of pmap_advise().
Discussed with: jeff, jhb, kib
Tested by: pho (i386), marcel (ia64)
Sponsored by: EMC / Isilon Storage Division
2013-08-29 15:49:05 +00:00
|
|
|
useEnd = end;
|
|
|
|
}
|
1999-08-13 17:45:34 +00:00
|
|
|
|
2012-03-19 18:47:34 +00:00
|
|
|
if (pstart >= pend)
|
1999-09-21 05:00:48 +00:00
|
|
|
continue;
|
|
|
|
|
Significantly reduce the cost, i.e., run time, of calls to madvise(...,
MADV_DONTNEED) and madvise(..., MADV_FREE). Specifically, introduce a new
pmap function, pmap_advise(), that operates on a range of virtual addresses
within the specified pmap, allowing for a more efficient implementation of
MADV_DONTNEED and MADV_FREE. Previously, the implementation of
MADV_DONTNEED and MADV_FREE relied on per-page pmap operations, such as
pmap_clear_reference(). Intuitively, the problem with this implementation
is that the pmap-level locks are acquired and released and the page table
traversed repeatedly, once for each resident page in the range
that was specified to madvise(2). A more subtle flaw with the previous
implementation is that pmap_clear_reference() would clear the reference bit
on all mappings to the specified page, not just the mapping in the range
specified to madvise(2).
Since our malloc(3) makes heavy use of madvise(2), this change can have a
measureable impact. For example, the system time for completing a parallel
"buildworld" on a 6-core amd64 machine was reduced by about 1.5% to 2.0%.
Note: This change only contains pmap_advise() implementations for a subset
of our supported architectures. I will commit implementations for the
remaining architectures after further testing. For now, a stub function is
sufficient because of the advisory nature of pmap_advise().
Discussed with: jeff, jhb, kib
Tested by: pho (i386), marcel (ia64)
Sponsored by: EMC / Isilon Storage Division
2013-08-29 15:49:05 +00:00
|
|
|
/*
|
|
|
|
* Perform the pmap_advise() before clearing
|
|
|
|
* PGA_REFERENCED in vm_page_advise(). Otherwise, a
|
|
|
|
* concurrent pmap operation, such as pmap_remove(),
|
|
|
|
* could clear a reference in the pmap and set
|
|
|
|
* PGA_REFERENCED on the page before the pmap_advise()
|
|
|
|
* had completed. Consequently, the page would appear
|
|
|
|
* referenced based upon an old reference that
|
|
|
|
* occurred before this pmap_advise() ran.
|
|
|
|
*/
|
|
|
|
if (behav == MADV_DONTNEED || behav == MADV_FREE)
|
|
|
|
pmap_advise(map->pmap, useStart, useEnd,
|
|
|
|
behav);
|
|
|
|
|
2019-11-25 02:19:47 +00:00
|
|
|
vm_object_madvise(entry->object.vm_object, pstart,
|
2012-03-19 18:47:34 +00:00
|
|
|
pend, behav);
|
2014-09-23 18:54:23 +00:00
|
|
|
|
|
|
|
/*
|
|
|
|
* Pre-populate paging structures in the
|
|
|
|
* WILLNEED case. For wired entries, the
|
|
|
|
* paging structures are already populated.
|
|
|
|
*/
|
|
|
|
if (behav == MADV_WILLNEED &&
|
2019-11-25 02:19:47 +00:00
|
|
|
entry->wired_count == 0) {
|
2003-11-03 16:14:45 +00:00
|
|
|
vm_map_pmap_enter(map,
|
2000-05-14 18:46:40 +00:00
|
|
|
useStart,
|
2019-11-25 02:19:47 +00:00
|
|
|
entry->protection,
|
|
|
|
entry->object.vm_object,
|
2012-03-19 18:47:34 +00:00
|
|
|
pstart,
|
|
|
|
ptoa(pend - pstart),
|
2001-10-31 03:06:33 +00:00
|
|
|
MAP_PREFAULT_MADVISE
|
1999-09-21 05:00:48 +00:00
|
|
|
);
|
1996-05-19 07:36:50 +00:00
|
|
|
}
|
|
|
|
}
|
1999-08-13 17:45:34 +00:00
|
|
|
vm_map_unlock_read(map);
|
1996-05-19 07:36:50 +00:00
|
|
|
}
|
2002-03-10 21:52:48 +00:00
|
|
|
return (0);
|
2003-11-03 16:14:45 +00:00
|
|
|
}
|
1996-05-19 07:36:50 +00:00
|
|
|
|
1994-05-24 10:09:53 +00:00
|
|
|
/*
|
|
|
|
* vm_map_inherit:
|
|
|
|
*
|
|
|
|
* Sets the inheritance of the specified address
|
|
|
|
* range in the target map. Inheritance
|
|
|
|
* affects how the map will be shared with
|
2008-12-31 05:44:05 +00:00
|
|
|
* child maps at the time of vmspace_fork.
|
1994-05-24 10:09:53 +00:00
|
|
|
*/
|
|
|
|
int
|
1997-08-25 22:15:31 +00:00
|
|
|
vm_map_inherit(vm_map_t map, vm_offset_t start, vm_offset_t end,
|
|
|
|
vm_inherit_t new_inheritance)
|
1994-05-24 10:09:53 +00:00
|
|
|
{
|
2020-09-09 22:02:30 +00:00
|
|
|
vm_map_entry_t entry, lentry, prev_entry, start_entry;
|
|
|
|
int rv;
|
1994-05-24 10:09:53 +00:00
|
|
|
|
|
|
|
switch (new_inheritance) {
|
|
|
|
case VM_INHERIT_NONE:
|
|
|
|
case VM_INHERIT_COPY:
|
|
|
|
case VM_INHERIT_SHARE:
|
2017-03-14 17:10:42 +00:00
|
|
|
case VM_INHERIT_ZERO:
|
1994-05-24 10:09:53 +00:00
|
|
|
break;
|
|
|
|
default:
|
These changes embody the support of the fully coherent merged VM buffer cache,
much higher filesystem I/O performance, and much better paging performance. It
represents the culmination of over 6 months of R&D.
The majority of the merged VM/cache work is by John Dyson.
The following highlights the most significant changes. Additionally, there are
(mostly minor) changes to the various filesystem modules (nfs, msdosfs, etc) to
support the new VM/buffer scheme.
vfs_bio.c:
Significant rewrite of most of vfs_bio to support the merged VM buffer cache
scheme. The scheme is almost fully compatible with the old filesystem
interface. Significant improvement in the number of opportunities for write
clustering.
vfs_cluster.c, vfs_subr.c
Upgrade and performance enhancements in vfs layer code to support merged
VM/buffer cache. Fixup of vfs_cluster to eliminate the bogus pagemove stuff.
vm_object.c:
Yet more improvements in the collapse code. Elimination of some windows that
can cause list corruption.
vm_pageout.c:
Fixed it, it really works better now. Somehow in 2.0, some "enhancements"
broke the code. This code has been reworked from the ground-up.
vm_fault.c, vm_page.c, pmap.c, vm_object.c
Support for small-block filesystems with merged VM/buffer cache scheme.
pmap.c vm_map.c
Dynamic kernel VM size, now we dont have to pre-allocate excessive numbers of
kernel PTs.
vm_glue.c
Much simpler and more effective swapping code. No more gratuitous swapping.
proc.h
Fixed the problem that the p_lock flag was not being cleared on a fork.
swap_pager.c, vnode_pager.c
Removal of old vfs_bio cruft to support the past pseudo-coherency. Now the
code doesn't need it anymore.
machdep.c
Changes to better support the parameter values for the merged VM/buffer cache
scheme.
machdep.c, kern_exec.c, vm_glue.c
Implemented a seperate submap for temporary exec string space and another one
to contain process upages. This eliminates all map fragmentation problems
that previously existed.
ffs_inode.c, ufs_inode.c, ufs_readwrite.c
Changes for merged VM/buffer cache. Add "bypass" support for sneaking in on
busy buffers.
Submitted by: John Dyson and David Greenman
1995-01-09 16:06:02 +00:00
|
|
|
return (KERN_INVALID_ARGUMENT);
|
1994-05-24 10:09:53 +00:00
|
|
|
}
|
2013-11-20 09:03:48 +00:00
|
|
|
if (start == end)
|
|
|
|
return (KERN_SUCCESS);
|
1994-05-24 10:09:53 +00:00
|
|
|
vm_map_lock(map);
|
|
|
|
VM_MAP_RANGE_CHECK(map, start, end);
|
2020-09-09 22:02:30 +00:00
|
|
|
rv = vm_map_lookup_clip_start(map, start, &start_entry, &prev_entry);
|
|
|
|
if (rv != KERN_SUCCESS)
|
|
|
|
goto unlock;
|
|
|
|
if (vm_map_lookup_entry(map, end - 1, &lentry)) {
|
|
|
|
rv = vm_map_clip_end(map, lentry, end);
|
|
|
|
if (rv != KERN_SUCCESS)
|
|
|
|
goto unlock;
|
|
|
|
}
|
|
|
|
if (new_inheritance == VM_INHERIT_COPY) {
|
|
|
|
for (entry = start_entry; entry->start < end;
|
|
|
|
prev_entry = entry, entry = vm_map_entry_succ(entry)) {
|
|
|
|
if ((entry->eflags & MAP_ENTRY_SPLIT_BOUNDARY_MASK)
|
|
|
|
!= 0) {
|
|
|
|
rv = KERN_INVALID_ARGUMENT;
|
|
|
|
goto unlock;
|
|
|
|
}
|
|
|
|
}
|
|
|
|
}
|
|
|
|
for (entry = start_entry; entry->start < end; prev_entry = entry,
|
|
|
|
entry = vm_map_entry_succ(entry)) {
|
|
|
|
KASSERT(entry->end <= end, ("non-clipped entry %p end %jx %jx",
|
|
|
|
entry, (uintmax_t)entry->end, (uintmax_t)end));
|
Implement address space guards.
Guard, requested by the MAP_GUARD mmap(2) flag, prevents the reuse of
the allocated address space, but does not allow instantiation of the
pages in the range. It is useful for more explicit support for usual
two-stage reserve then commit allocators, since it prevents accidental
instantiation of the mapping, e.g. by mprotect(2).
Use guards to reimplement stack grow code. Explicitely track stack
grow area with the guard, including the stack guard page. On stack
grow, trivial shift of the guard map entry and stack map entry limits
makes the stack expansion. Move the code to detect stack grow and
call vm_map_growstack(), from vm_fault() into vm_map_lookup().
As result, it is impossible to get random mapping to occur in the
stack grow area, or to overlap the stack guard page.
Enable stack guard page by default.
Reviewed by: alc, markj
Man page update reviewed by: alc, bjk, emaste, markj, pho
Tested by: pho, Qualys
Sponsored by: The FreeBSD Foundation
MFC after: 1 week
Differential revision: https://reviews.freebsd.org/D11306 (man pages)
2017-06-24 17:01:11 +00:00
|
|
|
if ((entry->eflags & MAP_ENTRY_GUARD) == 0 ||
|
|
|
|
new_inheritance != VM_INHERIT_ZERO)
|
|
|
|
entry->inheritance = new_inheritance;
|
2019-11-20 16:06:48 +00:00
|
|
|
vm_map_try_merge_entries(map, prev_entry, entry);
|
1994-05-24 10:09:53 +00:00
|
|
|
}
|
2019-11-20 16:06:48 +00:00
|
|
|
vm_map_try_merge_entries(map, prev_entry, entry);
|
2020-09-09 22:02:30 +00:00
|
|
|
unlock:
|
1994-05-24 10:09:53 +00:00
|
|
|
vm_map_unlock(map);
|
2020-09-09 22:02:30 +00:00
|
|
|
return (rv);
|
1994-05-24 10:09:53 +00:00
|
|
|
}
|
|
|
|
|
2019-07-19 20:47:35 +00:00
|
|
|
/*
|
|
|
|
* vm_map_entry_in_transition:
|
|
|
|
*
|
|
|
|
* Release the map lock, and sleep until the entry is no longer in
|
|
|
|
* transition. Awake and acquire the map lock. If the map changed while
|
|
|
|
* another held the lock, lookup a possibly-changed entry at or after the
|
|
|
|
* 'start' position of the old entry.
|
|
|
|
*/
|
|
|
|
static vm_map_entry_t
|
|
|
|
vm_map_entry_in_transition(vm_map_t map, vm_offset_t in_start,
|
|
|
|
vm_offset_t *io_end, bool holes_ok, vm_map_entry_t in_entry)
|
|
|
|
{
|
|
|
|
vm_map_entry_t entry;
|
|
|
|
vm_offset_t start;
|
|
|
|
u_int last_timestamp;
|
|
|
|
|
|
|
|
VM_MAP_ASSERT_LOCKED(map);
|
|
|
|
KASSERT((in_entry->eflags & MAP_ENTRY_IN_TRANSITION) != 0,
|
|
|
|
("not in-tranition map entry %p", in_entry));
|
|
|
|
/*
|
|
|
|
* We have not yet clipped the entry.
|
|
|
|
*/
|
|
|
|
start = MAX(in_start, in_entry->start);
|
|
|
|
in_entry->eflags |= MAP_ENTRY_NEEDS_WAKEUP;
|
|
|
|
last_timestamp = map->timestamp;
|
|
|
|
if (vm_map_unlock_and_wait(map, 0)) {
|
|
|
|
/*
|
|
|
|
* Allow interruption of user wiring/unwiring?
|
|
|
|
*/
|
|
|
|
}
|
|
|
|
vm_map_lock(map);
|
|
|
|
if (last_timestamp + 1 == map->timestamp)
|
|
|
|
return (in_entry);
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Look again for the entry because the map was modified while it was
|
|
|
|
* unlocked. Specifically, the entry may have been clipped, merged, or
|
|
|
|
* deleted.
|
|
|
|
*/
|
|
|
|
if (!vm_map_lookup_entry(map, start, &entry)) {
|
|
|
|
if (!holes_ok) {
|
|
|
|
*io_end = start;
|
|
|
|
return (NULL);
|
|
|
|
}
|
2019-11-13 15:56:07 +00:00
|
|
|
entry = vm_map_entry_succ(entry);
|
2019-07-19 20:47:35 +00:00
|
|
|
}
|
|
|
|
return (entry);
|
|
|
|
}
|
|
|
|
|
2002-06-07 18:34:23 +00:00
|
|
|
/*
|
|
|
|
* vm_map_unwire:
|
|
|
|
*
|
2002-06-08 07:32:38 +00:00
|
|
|
* Implements both kernel and user unwiring.
|
2002-06-07 18:34:23 +00:00
|
|
|
*/
|
|
|
|
int
|
|
|
|
vm_map_unwire(vm_map_t map, vm_offset_t start, vm_offset_t end,
|
2003-08-11 07:14:08 +00:00
|
|
|
int flags)
|
2002-06-07 18:34:23 +00:00
|
|
|
{
|
2019-11-20 16:06:48 +00:00
|
|
|
vm_map_entry_t entry, first_entry, next_entry, prev_entry;
|
2002-06-07 18:34:23 +00:00
|
|
|
int rv;
|
2019-11-20 16:06:48 +00:00
|
|
|
bool holes_ok, need_wakeup, user_unwire;
|
2002-06-07 18:34:23 +00:00
|
|
|
|
2013-11-20 09:03:48 +00:00
|
|
|
if (start == end)
|
|
|
|
return (KERN_SUCCESS);
|
2019-07-04 19:12:13 +00:00
|
|
|
holes_ok = (flags & VM_MAP_WIRE_HOLESOK) != 0;
|
|
|
|
user_unwire = (flags & VM_MAP_WIRE_USER) != 0;
|
2002-06-07 18:34:23 +00:00
|
|
|
vm_map_lock(map);
|
|
|
|
VM_MAP_RANGE_CHECK(map, start, end);
|
2019-06-26 03:12:57 +00:00
|
|
|
if (!vm_map_lookup_entry(map, start, &first_entry)) {
|
2019-07-04 19:12:13 +00:00
|
|
|
if (holes_ok)
|
2019-11-13 15:56:07 +00:00
|
|
|
first_entry = vm_map_entry_succ(first_entry);
|
2019-06-26 03:12:57 +00:00
|
|
|
else {
|
2003-08-11 07:14:08 +00:00
|
|
|
vm_map_unlock(map);
|
|
|
|
return (KERN_INVALID_ADDRESS);
|
|
|
|
}
|
2002-06-07 18:34:23 +00:00
|
|
|
}
|
2019-07-04 19:25:30 +00:00
|
|
|
rv = KERN_SUCCESS;
|
2019-11-20 16:06:48 +00:00
|
|
|
for (entry = first_entry; entry->start < end; entry = next_entry) {
|
2002-06-07 18:34:23 +00:00
|
|
|
if (entry->eflags & MAP_ENTRY_IN_TRANSITION) {
|
|
|
|
/*
|
|
|
|
* We have not yet clipped the entry.
|
|
|
|
*/
|
2019-11-20 16:06:48 +00:00
|
|
|
next_entry = vm_map_entry_in_transition(map, start,
|
|
|
|
&end, holes_ok, entry);
|
|
|
|
if (next_entry == NULL) {
|
|
|
|
if (entry == first_entry) {
|
2019-07-19 20:47:35 +00:00
|
|
|
vm_map_unlock(map);
|
|
|
|
return (KERN_INVALID_ADDRESS);
|
2002-06-07 18:34:23 +00:00
|
|
|
}
|
2019-07-19 20:47:35 +00:00
|
|
|
rv = KERN_INVALID_ADDRESS;
|
|
|
|
break;
|
2002-06-07 18:34:23 +00:00
|
|
|
}
|
2019-11-20 16:06:48 +00:00
|
|
|
first_entry = (entry == first_entry) ?
|
|
|
|
next_entry : NULL;
|
2002-06-07 18:34:23 +00:00
|
|
|
continue;
|
|
|
|
}
|
2020-09-09 22:02:30 +00:00
|
|
|
rv = vm_map_clip_start(map, entry, start);
|
|
|
|
if (rv != KERN_SUCCESS)
|
|
|
|
break;
|
|
|
|
rv = vm_map_clip_end(map, entry, end);
|
|
|
|
if (rv != KERN_SUCCESS)
|
|
|
|
break;
|
|
|
|
|
2002-06-07 18:34:23 +00:00
|
|
|
/*
|
|
|
|
* Mark the entry in case the map lock is released. (See
|
|
|
|
* above.)
|
|
|
|
*/
|
2013-11-20 08:47:54 +00:00
|
|
|
KASSERT((entry->eflags & MAP_ENTRY_IN_TRANSITION) == 0 &&
|
|
|
|
entry->wiring_thread == NULL,
|
|
|
|
("owned map entry %p", entry));
|
2002-06-07 18:34:23 +00:00
|
|
|
entry->eflags |= MAP_ENTRY_IN_TRANSITION;
|
2013-07-11 05:55:08 +00:00
|
|
|
entry->wiring_thread = curthread;
|
2019-11-20 16:06:48 +00:00
|
|
|
next_entry = vm_map_entry_succ(entry);
|
2002-06-07 18:34:23 +00:00
|
|
|
/*
|
|
|
|
* Check the map for holes in the specified region.
|
2019-07-04 19:12:13 +00:00
|
|
|
* If holes_ok, skip this check.
|
2002-06-07 18:34:23 +00:00
|
|
|
*/
|
2019-07-04 19:12:13 +00:00
|
|
|
if (!holes_ok &&
|
2019-11-20 16:06:48 +00:00
|
|
|
entry->end < end && next_entry->start > entry->end) {
|
2002-06-07 18:34:23 +00:00
|
|
|
end = entry->end;
|
|
|
|
rv = KERN_INVALID_ADDRESS;
|
2019-07-04 19:25:30 +00:00
|
|
|
break;
|
2002-06-07 18:34:23 +00:00
|
|
|
}
|
|
|
|
/*
|
2004-05-25 05:51:17 +00:00
|
|
|
* If system unwiring, require that the entry is system wired.
|
2002-06-07 18:34:23 +00:00
|
|
|
*/
|
2004-08-10 14:42:48 +00:00
|
|
|
if (!user_unwire &&
|
|
|
|
vm_map_entry_system_wired_count(entry) == 0) {
|
2002-06-07 18:34:23 +00:00
|
|
|
end = entry->end;
|
|
|
|
rv = KERN_INVALID_ARGUMENT;
|
2019-07-04 19:25:30 +00:00
|
|
|
break;
|
2002-06-07 18:34:23 +00:00
|
|
|
}
|
|
|
|
}
|
2019-07-04 19:12:13 +00:00
|
|
|
need_wakeup = false;
|
|
|
|
if (first_entry == NULL &&
|
|
|
|
!vm_map_lookup_entry(map, start, &first_entry)) {
|
|
|
|
KASSERT(holes_ok, ("vm_map_unwire: lookup failed"));
|
2019-11-20 16:06:48 +00:00
|
|
|
prev_entry = first_entry;
|
|
|
|
entry = vm_map_entry_succ(first_entry);
|
|
|
|
} else {
|
|
|
|
prev_entry = vm_map_entry_pred(first_entry);
|
|
|
|
entry = first_entry;
|
2002-06-07 18:34:23 +00:00
|
|
|
}
|
2019-11-20 16:06:48 +00:00
|
|
|
for (; entry->start < end;
|
|
|
|
prev_entry = entry, entry = vm_map_entry_succ(entry)) {
|
2013-07-11 05:55:08 +00:00
|
|
|
/*
|
2019-07-04 19:12:13 +00:00
|
|
|
* If holes_ok was specified, an empty
|
2013-07-11 05:55:08 +00:00
|
|
|
* space in the unwired region could have been mapped
|
|
|
|
* while the map lock was dropped for draining
|
|
|
|
* MAP_ENTRY_IN_TRANSITION. Moreover, another thread
|
|
|
|
* could be simultaneously wiring this new mapping
|
|
|
|
* entry. Detect these cases and skip any entries
|
|
|
|
* marked as in transition by us.
|
|
|
|
*/
|
|
|
|
if ((entry->eflags & MAP_ENTRY_IN_TRANSITION) == 0 ||
|
|
|
|
entry->wiring_thread != curthread) {
|
2019-07-04 19:12:13 +00:00
|
|
|
KASSERT(holes_ok,
|
2013-07-11 05:55:08 +00:00
|
|
|
("vm_map_unwire: !HOLESOK and new/changed entry"));
|
|
|
|
continue;
|
|
|
|
}
|
|
|
|
|
2004-05-25 05:51:17 +00:00
|
|
|
if (rv == KERN_SUCCESS && (!user_unwire ||
|
|
|
|
(entry->eflags & MAP_ENTRY_USER_WIRED))) {
|
When unwiring a region of an address space, do not assume that the
underlying physical pages are mapped by the pmap. If, for example, the
application has performed an mprotect(..., PROT_NONE) on any part of the
wired region, then those pages will no longer be mapped by the pmap.
So, using the pmap to lookup the wired pages in order to unwire them
doesn't always work, and when it doesn't work wired pages are leaked.
To avoid the leak, introduce and use a new function vm_object_unwire()
that locates the wired pages by traversing the object and its backing
objects.
At the same time, switch from using pmap_change_wiring() to the recently
introduced function pmap_unwire() for unwiring the region's mappings.
pmap_unwire() is faster, because it operates a range of virtual addresses
rather than a single virtual page at a time. Moreover, by operating on
a range, it is superpage friendly. It doesn't waste time performing
unnecessary demotions.
Reported by: markj
Reviewed by: kib
Tested by: pho, jmg (arm)
Sponsored by: EMC / Isilon Storage Division
2014-07-26 18:10:18 +00:00
|
|
|
if (entry->wired_count == 1)
|
|
|
|
vm_map_entry_unwire(map, entry);
|
|
|
|
else
|
|
|
|
entry->wired_count--;
|
Provide separate accounting for user-wired pages.
Historically we have not distinguished between kernel wirings and user
wirings for accounting purposes. User wirings (via mlock(2)) were
subject to a global limit on the number of wired pages, so if large
swaths of physical memory were wired by the kernel, as happens with
the ZFS ARC among other things, the limit could be exceeded, causing
user wirings to fail.
The change adds a new counter, v_user_wire_count, which counts the
number of virtual pages wired by user processes via mlock(2) and
mlockall(2). Only user-wired pages are subject to the system-wide
limit which helps provide some safety against deadlocks. In
particular, while sources of kernel wirings typically support some
backpressure mechanism, there is no way to reclaim user-wired pages
shorting of killing the wiring process. The limit is exported as
vm.max_user_wired, renamed from vm.max_wired, and changed from u_int
to u_long.
The choice to count virtual user-wired pages rather than physical
pages was done for simplicity. There are mechanisms that can cause
user-wired mappings to be destroyed while maintaining a wiring of
the backing physical page; these make it difficult to accurately
track user wirings at the physical page layer.
The change also closes some holes which allowed user wirings to succeed
even when they would cause the system limit to be exceeded. For
instance, mmap() may now fail with ENOMEM in a process that has called
mlockall(MCL_FUTURE) if the new mapping would cause the user wiring
limit to be exceeded.
Note that bhyve -S is subject to the user wiring limit, which defaults
to 1/3 of physical RAM. Users that wish to exceed the limit must tune
vm.max_user_wired.
Reviewed by: kib, ngie (mlock() test changes)
Tested by: pho (earlier version)
MFC after: 45 days
Sponsored by: Netflix
Differential Revision: https://reviews.freebsd.org/D19908
2019-05-13 16:38:48 +00:00
|
|
|
if (user_unwire)
|
|
|
|
entry->eflags &= ~MAP_ENTRY_USER_WIRED;
|
2002-06-08 19:00:40 +00:00
|
|
|
}
|
2013-07-11 05:55:08 +00:00
|
|
|
KASSERT((entry->eflags & MAP_ENTRY_IN_TRANSITION) != 0,
|
2013-11-20 08:47:54 +00:00
|
|
|
("vm_map_unwire: in-transition flag missing %p", entry));
|
|
|
|
KASSERT(entry->wiring_thread == curthread,
|
|
|
|
("vm_map_unwire: alien wire %p", entry));
|
2002-06-07 18:34:23 +00:00
|
|
|
entry->eflags &= ~MAP_ENTRY_IN_TRANSITION;
|
2013-07-11 05:55:08 +00:00
|
|
|
entry->wiring_thread = NULL;
|
2002-06-07 18:34:23 +00:00
|
|
|
if (entry->eflags & MAP_ENTRY_NEEDS_WAKEUP) {
|
|
|
|
entry->eflags &= ~MAP_ENTRY_NEEDS_WAKEUP;
|
2019-07-04 19:12:13 +00:00
|
|
|
need_wakeup = true;
|
2002-06-07 18:34:23 +00:00
|
|
|
}
|
2019-11-20 16:06:48 +00:00
|
|
|
vm_map_try_merge_entries(map, prev_entry, entry);
|
2002-06-07 18:34:23 +00:00
|
|
|
}
|
2019-11-20 16:06:48 +00:00
|
|
|
vm_map_try_merge_entries(map, prev_entry, entry);
|
2002-06-07 18:34:23 +00:00
|
|
|
vm_map_unlock(map);
|
|
|
|
if (need_wakeup)
|
|
|
|
vm_map_wakeup(map);
|
|
|
|
return (rv);
|
|
|
|
}
|
|
|
|
|
Provide separate accounting for user-wired pages.
Historically we have not distinguished between kernel wirings and user
wirings for accounting purposes. User wirings (via mlock(2)) were
subject to a global limit on the number of wired pages, so if large
swaths of physical memory were wired by the kernel, as happens with
the ZFS ARC among other things, the limit could be exceeded, causing
user wirings to fail.
The change adds a new counter, v_user_wire_count, which counts the
number of virtual pages wired by user processes via mlock(2) and
mlockall(2). Only user-wired pages are subject to the system-wide
limit which helps provide some safety against deadlocks. In
particular, while sources of kernel wirings typically support some
backpressure mechanism, there is no way to reclaim user-wired pages
shorting of killing the wiring process. The limit is exported as
vm.max_user_wired, renamed from vm.max_wired, and changed from u_int
to u_long.
The choice to count virtual user-wired pages rather than physical
pages was done for simplicity. There are mechanisms that can cause
user-wired mappings to be destroyed while maintaining a wiring of
the backing physical page; these make it difficult to accurately
track user wirings at the physical page layer.
The change also closes some holes which allowed user wirings to succeed
even when they would cause the system limit to be exceeded. For
instance, mmap() may now fail with ENOMEM in a process that has called
mlockall(MCL_FUTURE) if the new mapping would cause the user wiring
limit to be exceeded.
Note that bhyve -S is subject to the user wiring limit, which defaults
to 1/3 of physical RAM. Users that wish to exceed the limit must tune
vm.max_user_wired.
Reviewed by: kib, ngie (mlock() test changes)
Tested by: pho (earlier version)
MFC after: 45 days
Sponsored by: Netflix
Differential Revision: https://reviews.freebsd.org/D19908
2019-05-13 16:38:48 +00:00
|
|
|
static void
|
|
|
|
vm_map_wire_user_count_sub(u_long npages)
|
|
|
|
{
|
|
|
|
|
|
|
|
atomic_subtract_long(&vm_user_wire_count, npages);
|
|
|
|
}
|
|
|
|
|
|
|
|
static bool
|
|
|
|
vm_map_wire_user_count_add(u_long npages)
|
|
|
|
{
|
|
|
|
u_long wired;
|
|
|
|
|
|
|
|
wired = vm_user_wire_count;
|
|
|
|
do {
|
|
|
|
if (npages + wired > vm_page_max_user_wired)
|
|
|
|
return (false);
|
|
|
|
} while (!atomic_fcmpset_long(&vm_user_wire_count, &wired,
|
|
|
|
npages + wired));
|
|
|
|
|
|
|
|
return (true);
|
|
|
|
}
|
|
|
|
|
2014-08-02 16:10:24 +00:00
|
|
|
/*
|
|
|
|
* vm_map_wire_entry_failure:
|
|
|
|
*
|
|
|
|
* Handle a wiring failure on the given entry.
|
|
|
|
*
|
|
|
|
* The map should be locked.
|
|
|
|
*/
|
|
|
|
static void
|
|
|
|
vm_map_wire_entry_failure(vm_map_t map, vm_map_entry_t entry,
|
|
|
|
vm_offset_t failed_addr)
|
|
|
|
{
|
|
|
|
|
|
|
|
VM_MAP_ASSERT_LOCKED(map);
|
|
|
|
KASSERT((entry->eflags & MAP_ENTRY_IN_TRANSITION) != 0 &&
|
|
|
|
entry->wired_count == 1,
|
|
|
|
("vm_map_wire_entry_failure: entry %p isn't being wired", entry));
|
|
|
|
KASSERT(failed_addr < entry->end,
|
|
|
|
("vm_map_wire_entry_failure: entry %p was fully wired", entry));
|
|
|
|
|
|
|
|
/*
|
|
|
|
* If any pages at the start of this entry were successfully wired,
|
|
|
|
* then unwire them.
|
|
|
|
*/
|
|
|
|
if (failed_addr > entry->start) {
|
|
|
|
pmap_unwire(map->pmap, entry->start, failed_addr);
|
|
|
|
vm_object_unwire(entry->object.vm_object, entry->offset,
|
|
|
|
failed_addr - entry->start, PQ_ACTIVE);
|
|
|
|
}
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Assign an out-of-range value to represent the failure to wire this
|
|
|
|
* entry.
|
|
|
|
*/
|
|
|
|
entry->wired_count = -1;
|
|
|
|
}
|
|
|
|
|
Provide separate accounting for user-wired pages.
Historically we have not distinguished between kernel wirings and user
wirings for accounting purposes. User wirings (via mlock(2)) were
subject to a global limit on the number of wired pages, so if large
swaths of physical memory were wired by the kernel, as happens with
the ZFS ARC among other things, the limit could be exceeded, causing
user wirings to fail.
The change adds a new counter, v_user_wire_count, which counts the
number of virtual pages wired by user processes via mlock(2) and
mlockall(2). Only user-wired pages are subject to the system-wide
limit which helps provide some safety against deadlocks. In
particular, while sources of kernel wirings typically support some
backpressure mechanism, there is no way to reclaim user-wired pages
shorting of killing the wiring process. The limit is exported as
vm.max_user_wired, renamed from vm.max_wired, and changed from u_int
to u_long.
The choice to count virtual user-wired pages rather than physical
pages was done for simplicity. There are mechanisms that can cause
user-wired mappings to be destroyed while maintaining a wiring of
the backing physical page; these make it difficult to accurately
track user wirings at the physical page layer.
The change also closes some holes which allowed user wirings to succeed
even when they would cause the system limit to be exceeded. For
instance, mmap() may now fail with ENOMEM in a process that has called
mlockall(MCL_FUTURE) if the new mapping would cause the user wiring
limit to be exceeded.
Note that bhyve -S is subject to the user wiring limit, which defaults
to 1/3 of physical RAM. Users that wish to exceed the limit must tune
vm.max_user_wired.
Reviewed by: kib, ngie (mlock() test changes)
Tested by: pho (earlier version)
MFC after: 45 days
Sponsored by: Netflix
Differential Revision: https://reviews.freebsd.org/D19908
2019-05-13 16:38:48 +00:00
|
|
|
int
|
|
|
|
vm_map_wire(vm_map_t map, vm_offset_t start, vm_offset_t end, int flags)
|
|
|
|
{
|
|
|
|
int rv;
|
|
|
|
|
|
|
|
vm_map_lock(map);
|
|
|
|
rv = vm_map_wire_locked(map, start, end, flags);
|
|
|
|
vm_map_unlock(map);
|
|
|
|
return (rv);
|
|
|
|
}
|
|
|
|
|
2002-06-08 07:32:38 +00:00
|
|
|
/*
|
Provide separate accounting for user-wired pages.
Historically we have not distinguished between kernel wirings and user
wirings for accounting purposes. User wirings (via mlock(2)) were
subject to a global limit on the number of wired pages, so if large
swaths of physical memory were wired by the kernel, as happens with
the ZFS ARC among other things, the limit could be exceeded, causing
user wirings to fail.
The change adds a new counter, v_user_wire_count, which counts the
number of virtual pages wired by user processes via mlock(2) and
mlockall(2). Only user-wired pages are subject to the system-wide
limit which helps provide some safety against deadlocks. In
particular, while sources of kernel wirings typically support some
backpressure mechanism, there is no way to reclaim user-wired pages
shorting of killing the wiring process. The limit is exported as
vm.max_user_wired, renamed from vm.max_wired, and changed from u_int
to u_long.
The choice to count virtual user-wired pages rather than physical
pages was done for simplicity. There are mechanisms that can cause
user-wired mappings to be destroyed while maintaining a wiring of
the backing physical page; these make it difficult to accurately
track user wirings at the physical page layer.
The change also closes some holes which allowed user wirings to succeed
even when they would cause the system limit to be exceeded. For
instance, mmap() may now fail with ENOMEM in a process that has called
mlockall(MCL_FUTURE) if the new mapping would cause the user wiring
limit to be exceeded.
Note that bhyve -S is subject to the user wiring limit, which defaults
to 1/3 of physical RAM. Users that wish to exceed the limit must tune
vm.max_user_wired.
Reviewed by: kib, ngie (mlock() test changes)
Tested by: pho (earlier version)
MFC after: 45 days
Sponsored by: Netflix
Differential Revision: https://reviews.freebsd.org/D19908
2019-05-13 16:38:48 +00:00
|
|
|
* vm_map_wire_locked:
|
2002-06-08 07:32:38 +00:00
|
|
|
*
|
Provide separate accounting for user-wired pages.
Historically we have not distinguished between kernel wirings and user
wirings for accounting purposes. User wirings (via mlock(2)) were
subject to a global limit on the number of wired pages, so if large
swaths of physical memory were wired by the kernel, as happens with
the ZFS ARC among other things, the limit could be exceeded, causing
user wirings to fail.
The change adds a new counter, v_user_wire_count, which counts the
number of virtual pages wired by user processes via mlock(2) and
mlockall(2). Only user-wired pages are subject to the system-wide
limit which helps provide some safety against deadlocks. In
particular, while sources of kernel wirings typically support some
backpressure mechanism, there is no way to reclaim user-wired pages
shorting of killing the wiring process. The limit is exported as
vm.max_user_wired, renamed from vm.max_wired, and changed from u_int
to u_long.
The choice to count virtual user-wired pages rather than physical
pages was done for simplicity. There are mechanisms that can cause
user-wired mappings to be destroyed while maintaining a wiring of
the backing physical page; these make it difficult to accurately
track user wirings at the physical page layer.
The change also closes some holes which allowed user wirings to succeed
even when they would cause the system limit to be exceeded. For
instance, mmap() may now fail with ENOMEM in a process that has called
mlockall(MCL_FUTURE) if the new mapping would cause the user wiring
limit to be exceeded.
Note that bhyve -S is subject to the user wiring limit, which defaults
to 1/3 of physical RAM. Users that wish to exceed the limit must tune
vm.max_user_wired.
Reviewed by: kib, ngie (mlock() test changes)
Tested by: pho (earlier version)
MFC after: 45 days
Sponsored by: Netflix
Differential Revision: https://reviews.freebsd.org/D19908
2019-05-13 16:38:48 +00:00
|
|
|
* Implements both kernel and user wiring. Returns with the map locked,
|
|
|
|
* the map lock may be dropped.
|
2002-06-08 07:32:38 +00:00
|
|
|
*/
|
|
|
|
int
|
Provide separate accounting for user-wired pages.
Historically we have not distinguished between kernel wirings and user
wirings for accounting purposes. User wirings (via mlock(2)) were
subject to a global limit on the number of wired pages, so if large
swaths of physical memory were wired by the kernel, as happens with
the ZFS ARC among other things, the limit could be exceeded, causing
user wirings to fail.
The change adds a new counter, v_user_wire_count, which counts the
number of virtual pages wired by user processes via mlock(2) and
mlockall(2). Only user-wired pages are subject to the system-wide
limit which helps provide some safety against deadlocks. In
particular, while sources of kernel wirings typically support some
backpressure mechanism, there is no way to reclaim user-wired pages
shorting of killing the wiring process. The limit is exported as
vm.max_user_wired, renamed from vm.max_wired, and changed from u_int
to u_long.
The choice to count virtual user-wired pages rather than physical
pages was done for simplicity. There are mechanisms that can cause
user-wired mappings to be destroyed while maintaining a wiring of
the backing physical page; these make it difficult to accurately
track user wirings at the physical page layer.
The change also closes some holes which allowed user wirings to succeed
even when they would cause the system limit to be exceeded. For
instance, mmap() may now fail with ENOMEM in a process that has called
mlockall(MCL_FUTURE) if the new mapping would cause the user wiring
limit to be exceeded.
Note that bhyve -S is subject to the user wiring limit, which defaults
to 1/3 of physical RAM. Users that wish to exceed the limit must tune
vm.max_user_wired.
Reviewed by: kib, ngie (mlock() test changes)
Tested by: pho (earlier version)
MFC after: 45 days
Sponsored by: Netflix
Differential Revision: https://reviews.freebsd.org/D19908
2019-05-13 16:38:48 +00:00
|
|
|
vm_map_wire_locked(vm_map_t map, vm_offset_t start, vm_offset_t end, int flags)
|
2002-06-08 07:32:38 +00:00
|
|
|
{
|
2019-11-20 16:06:48 +00:00
|
|
|
vm_map_entry_t entry, first_entry, next_entry, prev_entry;
|
2014-08-02 16:10:24 +00:00
|
|
|
vm_offset_t faddr, saved_end, saved_start;
|
2020-09-09 22:02:30 +00:00
|
|
|
u_long incr, npages;
|
|
|
|
u_int bidx, last_timestamp;
|
2002-06-09 20:25:18 +00:00
|
|
|
int rv;
|
2019-11-20 16:06:48 +00:00
|
|
|
bool holes_ok, need_wakeup, user_wire;
|
2011-03-21 09:40:01 +00:00
|
|
|
vm_prot_t prot;
|
2002-06-08 07:32:38 +00:00
|
|
|
|
Provide separate accounting for user-wired pages.
Historically we have not distinguished between kernel wirings and user
wirings for accounting purposes. User wirings (via mlock(2)) were
subject to a global limit on the number of wired pages, so if large
swaths of physical memory were wired by the kernel, as happens with
the ZFS ARC among other things, the limit could be exceeded, causing
user wirings to fail.
The change adds a new counter, v_user_wire_count, which counts the
number of virtual pages wired by user processes via mlock(2) and
mlockall(2). Only user-wired pages are subject to the system-wide
limit which helps provide some safety against deadlocks. In
particular, while sources of kernel wirings typically support some
backpressure mechanism, there is no way to reclaim user-wired pages
shorting of killing the wiring process. The limit is exported as
vm.max_user_wired, renamed from vm.max_wired, and changed from u_int
to u_long.
The choice to count virtual user-wired pages rather than physical
pages was done for simplicity. There are mechanisms that can cause
user-wired mappings to be destroyed while maintaining a wiring of
the backing physical page; these make it difficult to accurately
track user wirings at the physical page layer.
The change also closes some holes which allowed user wirings to succeed
even when they would cause the system limit to be exceeded. For
instance, mmap() may now fail with ENOMEM in a process that has called
mlockall(MCL_FUTURE) if the new mapping would cause the user wiring
limit to be exceeded.
Note that bhyve -S is subject to the user wiring limit, which defaults
to 1/3 of physical RAM. Users that wish to exceed the limit must tune
vm.max_user_wired.
Reviewed by: kib, ngie (mlock() test changes)
Tested by: pho (earlier version)
MFC after: 45 days
Sponsored by: Netflix
Differential Revision: https://reviews.freebsd.org/D19908
2019-05-13 16:38:48 +00:00
|
|
|
VM_MAP_ASSERT_LOCKED(map);
|
|
|
|
|
2013-11-20 09:03:48 +00:00
|
|
|
if (start == end)
|
|
|
|
return (KERN_SUCCESS);
|
2011-03-21 09:40:01 +00:00
|
|
|
prot = 0;
|
|
|
|
if (flags & VM_MAP_WIRE_WRITE)
|
|
|
|
prot |= VM_PROT_WRITE;
|
2019-07-04 19:12:13 +00:00
|
|
|
holes_ok = (flags & VM_MAP_WIRE_HOLESOK) != 0;
|
|
|
|
user_wire = (flags & VM_MAP_WIRE_USER) != 0;
|
2002-06-09 20:25:18 +00:00
|
|
|
VM_MAP_RANGE_CHECK(map, start, end);
|
2019-06-26 03:12:57 +00:00
|
|
|
if (!vm_map_lookup_entry(map, start, &first_entry)) {
|
2019-07-04 19:12:13 +00:00
|
|
|
if (holes_ok)
|
2019-11-13 15:56:07 +00:00
|
|
|
first_entry = vm_map_entry_succ(first_entry);
|
2019-06-26 03:12:57 +00:00
|
|
|
else
|
2003-08-11 07:14:08 +00:00
|
|
|
return (KERN_INVALID_ADDRESS);
|
2002-06-09 20:25:18 +00:00
|
|
|
}
|
2019-11-20 16:06:48 +00:00
|
|
|
for (entry = first_entry; entry->start < end; entry = next_entry) {
|
2002-06-09 20:25:18 +00:00
|
|
|
if (entry->eflags & MAP_ENTRY_IN_TRANSITION) {
|
|
|
|
/*
|
|
|
|
* We have not yet clipped the entry.
|
|
|
|
*/
|
2019-11-20 16:06:48 +00:00
|
|
|
next_entry = vm_map_entry_in_transition(map, start,
|
|
|
|
&end, holes_ok, entry);
|
|
|
|
if (next_entry == NULL) {
|
|
|
|
if (entry == first_entry)
|
2019-07-19 20:47:35 +00:00
|
|
|
return (KERN_INVALID_ADDRESS);
|
|
|
|
rv = KERN_INVALID_ADDRESS;
|
|
|
|
goto done;
|
2002-06-09 20:25:18 +00:00
|
|
|
}
|
2019-11-20 16:06:48 +00:00
|
|
|
first_entry = (entry == first_entry) ?
|
|
|
|
next_entry : NULL;
|
2002-06-09 20:25:18 +00:00
|
|
|
continue;
|
|
|
|
}
|
2020-09-09 22:02:30 +00:00
|
|
|
rv = vm_map_clip_start(map, entry, start);
|
|
|
|
if (rv != KERN_SUCCESS)
|
|
|
|
goto done;
|
|
|
|
rv = vm_map_clip_end(map, entry, end);
|
|
|
|
if (rv != KERN_SUCCESS)
|
|
|
|
goto done;
|
|
|
|
|
2002-06-09 20:25:18 +00:00
|
|
|
/*
|
|
|
|
* Mark the entry in case the map lock is released. (See
|
|
|
|
* above.)
|
|
|
|
*/
|
2013-11-20 08:47:54 +00:00
|
|
|
KASSERT((entry->eflags & MAP_ENTRY_IN_TRANSITION) == 0 &&
|
|
|
|
entry->wiring_thread == NULL,
|
|
|
|
("owned map entry %p", entry));
|
2002-06-09 20:25:18 +00:00
|
|
|
entry->eflags |= MAP_ENTRY_IN_TRANSITION;
|
2013-07-11 05:55:08 +00:00
|
|
|
entry->wiring_thread = curthread;
|
2011-03-21 09:40:01 +00:00
|
|
|
if ((entry->protection & (VM_PROT_READ | VM_PROT_EXECUTE)) == 0
|
|
|
|
|| (entry->protection & prot) != prot) {
|
|
|
|
entry->eflags |= MAP_ENTRY_WIRE_SKIPPED;
|
2019-07-04 19:12:13 +00:00
|
|
|
if (!holes_ok) {
|
2011-03-21 09:40:01 +00:00
|
|
|
end = entry->end;
|
|
|
|
rv = KERN_INVALID_ADDRESS;
|
|
|
|
goto done;
|
2009-04-10 10:16:03 +00:00
|
|
|
}
|
2019-07-03 22:41:54 +00:00
|
|
|
} else if (entry->wired_count == 0) {
|
2004-08-10 14:42:48 +00:00
|
|
|
entry->wired_count++;
|
Provide separate accounting for user-wired pages.
Historically we have not distinguished between kernel wirings and user
wirings for accounting purposes. User wirings (via mlock(2)) were
subject to a global limit on the number of wired pages, so if large
swaths of physical memory were wired by the kernel, as happens with
the ZFS ARC among other things, the limit could be exceeded, causing
user wirings to fail.
The change adds a new counter, v_user_wire_count, which counts the
number of virtual pages wired by user processes via mlock(2) and
mlockall(2). Only user-wired pages are subject to the system-wide
limit which helps provide some safety against deadlocks. In
particular, while sources of kernel wirings typically support some
backpressure mechanism, there is no way to reclaim user-wired pages
shorting of killing the wiring process. The limit is exported as
vm.max_user_wired, renamed from vm.max_wired, and changed from u_int
to u_long.
The choice to count virtual user-wired pages rather than physical
pages was done for simplicity. There are mechanisms that can cause
user-wired mappings to be destroyed while maintaining a wiring of
the backing physical page; these make it difficult to accurately
track user wirings at the physical page layer.
The change also closes some holes which allowed user wirings to succeed
even when they would cause the system limit to be exceeded. For
instance, mmap() may now fail with ENOMEM in a process that has called
mlockall(MCL_FUTURE) if the new mapping would cause the user wiring
limit to be exceeded.
Note that bhyve -S is subject to the user wiring limit, which defaults
to 1/3 of physical RAM. Users that wish to exceed the limit must tune
vm.max_user_wired.
Reviewed by: kib, ngie (mlock() test changes)
Tested by: pho (earlier version)
MFC after: 45 days
Sponsored by: Netflix
Differential Revision: https://reviews.freebsd.org/D19908
2019-05-13 16:38:48 +00:00
|
|
|
|
|
|
|
npages = atop(entry->end - entry->start);
|
|
|
|
if (user_wire && !vm_map_wire_user_count_add(npages)) {
|
|
|
|
vm_map_wire_entry_failure(map, entry,
|
|
|
|
entry->start);
|
|
|
|
end = entry->end;
|
|
|
|
rv = KERN_RESOURCE_SHORTAGE;
|
|
|
|
goto done;
|
|
|
|
}
|
2014-08-02 16:10:24 +00:00
|
|
|
|
2002-06-09 20:25:18 +00:00
|
|
|
/*
|
|
|
|
* Release the map lock, relying on the in-transition
|
2010-12-09 21:02:22 +00:00
|
|
|
* mark. Mark the map busy for fork.
|
2002-06-09 20:25:18 +00:00
|
|
|
*/
|
Provide separate accounting for user-wired pages.
Historically we have not distinguished between kernel wirings and user
wirings for accounting purposes. User wirings (via mlock(2)) were
subject to a global limit on the number of wired pages, so if large
swaths of physical memory were wired by the kernel, as happens with
the ZFS ARC among other things, the limit could be exceeded, causing
user wirings to fail.
The change adds a new counter, v_user_wire_count, which counts the
number of virtual pages wired by user processes via mlock(2) and
mlockall(2). Only user-wired pages are subject to the system-wide
limit which helps provide some safety against deadlocks. In
particular, while sources of kernel wirings typically support some
backpressure mechanism, there is no way to reclaim user-wired pages
shorting of killing the wiring process. The limit is exported as
vm.max_user_wired, renamed from vm.max_wired, and changed from u_int
to u_long.
The choice to count virtual user-wired pages rather than physical
pages was done for simplicity. There are mechanisms that can cause
user-wired mappings to be destroyed while maintaining a wiring of
the backing physical page; these make it difficult to accurately
track user wirings at the physical page layer.
The change also closes some holes which allowed user wirings to succeed
even when they would cause the system limit to be exceeded. For
instance, mmap() may now fail with ENOMEM in a process that has called
mlockall(MCL_FUTURE) if the new mapping would cause the user wiring
limit to be exceeded.
Note that bhyve -S is subject to the user wiring limit, which defaults
to 1/3 of physical RAM. Users that wish to exceed the limit must tune
vm.max_user_wired.
Reviewed by: kib, ngie (mlock() test changes)
Tested by: pho (earlier version)
MFC after: 45 days
Sponsored by: Netflix
Differential Revision: https://reviews.freebsd.org/D19908
2019-05-13 16:38:48 +00:00
|
|
|
saved_start = entry->start;
|
|
|
|
saved_end = entry->end;
|
2019-07-19 20:47:35 +00:00
|
|
|
last_timestamp = map->timestamp;
|
2020-09-09 22:02:30 +00:00
|
|
|
bidx = (entry->eflags & MAP_ENTRY_SPLIT_BOUNDARY_MASK)
|
|
|
|
>> MAP_ENTRY_SPLIT_BOUNDARY_SHIFT;
|
|
|
|
incr = pagesizes[bidx];
|
2010-12-09 21:02:22 +00:00
|
|
|
vm_map_busy(map);
|
2002-06-09 20:25:18 +00:00
|
|
|
vm_map_unlock(map);
|
2014-08-02 16:10:24 +00:00
|
|
|
|
2020-09-09 22:02:30 +00:00
|
|
|
for (faddr = saved_start; faddr < saved_end;
|
|
|
|
faddr += incr) {
|
2014-08-02 16:10:24 +00:00
|
|
|
/*
|
|
|
|
* Simulate a fault to get the page and enter
|
|
|
|
* it into the physical map.
|
|
|
|
*/
|
2020-09-09 22:02:30 +00:00
|
|
|
rv = vm_fault(map, faddr, VM_PROT_NONE,
|
|
|
|
VM_FAULT_WIRE, NULL);
|
|
|
|
if (rv != KERN_SUCCESS)
|
2014-08-02 16:10:24 +00:00
|
|
|
break;
|
2020-09-09 22:02:30 +00:00
|
|
|
}
|
2002-06-09 20:25:18 +00:00
|
|
|
vm_map_lock(map);
|
2010-12-09 21:02:22 +00:00
|
|
|
vm_map_unbusy(map);
|
2002-06-11 19:13:59 +00:00
|
|
|
if (last_timestamp + 1 != map->timestamp) {
|
2002-06-09 20:25:18 +00:00
|
|
|
/*
|
|
|
|
* Look again for the entry because the map was
|
|
|
|
* modified while it was unlocked. The entry
|
|
|
|
* may have been clipped, but NOT merged or
|
|
|
|
* deleted.
|
|
|
|
*/
|
2019-07-04 19:12:13 +00:00
|
|
|
if (!vm_map_lookup_entry(map, saved_start,
|
2019-11-20 16:06:48 +00:00
|
|
|
&next_entry))
|
2019-07-04 19:12:13 +00:00
|
|
|
KASSERT(false,
|
|
|
|
("vm_map_wire: lookup failed"));
|
2019-11-20 16:06:48 +00:00
|
|
|
first_entry = (entry == first_entry) ?
|
|
|
|
next_entry : NULL;
|
|
|
|
for (entry = next_entry; entry->end < saved_end;
|
|
|
|
entry = vm_map_entry_succ(entry)) {
|
2014-08-02 16:10:24 +00:00
|
|
|
/*
|
|
|
|
* In case of failure, handle entries
|
|
|
|
* that were not fully wired here;
|
|
|
|
* fully wired entries are handled
|
|
|
|
* later.
|
|
|
|
*/
|
|
|
|
if (rv != KERN_SUCCESS &&
|
|
|
|
faddr < entry->end)
|
|
|
|
vm_map_wire_entry_failure(map,
|
|
|
|
entry, faddr);
|
2002-06-11 19:13:59 +00:00
|
|
|
}
|
2002-06-09 20:25:18 +00:00
|
|
|
}
|
|
|
|
if (rv != KERN_SUCCESS) {
|
2014-08-02 16:10:24 +00:00
|
|
|
vm_map_wire_entry_failure(map, entry, faddr);
|
Provide separate accounting for user-wired pages.
Historically we have not distinguished between kernel wirings and user
wirings for accounting purposes. User wirings (via mlock(2)) were
subject to a global limit on the number of wired pages, so if large
swaths of physical memory were wired by the kernel, as happens with
the ZFS ARC among other things, the limit could be exceeded, causing
user wirings to fail.
The change adds a new counter, v_user_wire_count, which counts the
number of virtual pages wired by user processes via mlock(2) and
mlockall(2). Only user-wired pages are subject to the system-wide
limit which helps provide some safety against deadlocks. In
particular, while sources of kernel wirings typically support some
backpressure mechanism, there is no way to reclaim user-wired pages
shorting of killing the wiring process. The limit is exported as
vm.max_user_wired, renamed from vm.max_wired, and changed from u_int
to u_long.
The choice to count virtual user-wired pages rather than physical
pages was done for simplicity. There are mechanisms that can cause
user-wired mappings to be destroyed while maintaining a wiring of
the backing physical page; these make it difficult to accurately
track user wirings at the physical page layer.
The change also closes some holes which allowed user wirings to succeed
even when they would cause the system limit to be exceeded. For
instance, mmap() may now fail with ENOMEM in a process that has called
mlockall(MCL_FUTURE) if the new mapping would cause the user wiring
limit to be exceeded.
Note that bhyve -S is subject to the user wiring limit, which defaults
to 1/3 of physical RAM. Users that wish to exceed the limit must tune
vm.max_user_wired.
Reviewed by: kib, ngie (mlock() test changes)
Tested by: pho (earlier version)
MFC after: 45 days
Sponsored by: Netflix
Differential Revision: https://reviews.freebsd.org/D19908
2019-05-13 16:38:48 +00:00
|
|
|
if (user_wire)
|
|
|
|
vm_map_wire_user_count_sub(npages);
|
2002-06-09 20:25:18 +00:00
|
|
|
end = entry->end;
|
|
|
|
goto done;
|
|
|
|
}
|
2004-08-10 14:42:48 +00:00
|
|
|
} else if (!user_wire ||
|
|
|
|
(entry->eflags & MAP_ENTRY_USER_WIRED) == 0) {
|
|
|
|
entry->wired_count++;
|
2002-06-09 20:25:18 +00:00
|
|
|
}
|
|
|
|
/*
|
|
|
|
* Check the map for holes in the specified region.
|
2019-07-04 19:12:13 +00:00
|
|
|
* If holes_ok was specified, skip this check.
|
2002-06-09 20:25:18 +00:00
|
|
|
*/
|
2019-11-20 16:06:48 +00:00
|
|
|
next_entry = vm_map_entry_succ(entry);
|
2019-07-04 19:12:13 +00:00
|
|
|
if (!holes_ok &&
|
2019-11-20 16:06:48 +00:00
|
|
|
entry->end < end && next_entry->start > entry->end) {
|
2002-06-09 20:25:18 +00:00
|
|
|
end = entry->end;
|
|
|
|
rv = KERN_INVALID_ADDRESS;
|
|
|
|
goto done;
|
|
|
|
}
|
|
|
|
}
|
|
|
|
rv = KERN_SUCCESS;
|
|
|
|
done:
|
2019-07-04 19:12:13 +00:00
|
|
|
need_wakeup = false;
|
|
|
|
if (first_entry == NULL &&
|
|
|
|
!vm_map_lookup_entry(map, start, &first_entry)) {
|
|
|
|
KASSERT(holes_ok, ("vm_map_wire: lookup failed"));
|
2019-11-20 16:06:48 +00:00
|
|
|
prev_entry = first_entry;
|
|
|
|
entry = vm_map_entry_succ(first_entry);
|
|
|
|
} else {
|
|
|
|
prev_entry = vm_map_entry_pred(first_entry);
|
|
|
|
entry = first_entry;
|
2002-06-09 20:25:18 +00:00
|
|
|
}
|
2019-11-20 16:06:48 +00:00
|
|
|
for (; entry->start < end;
|
|
|
|
prev_entry = entry, entry = vm_map_entry_succ(entry)) {
|
2013-07-11 05:55:08 +00:00
|
|
|
/*
|
2019-07-04 19:12:13 +00:00
|
|
|
* If holes_ok was specified, an empty
|
2013-07-11 05:55:08 +00:00
|
|
|
* space in the unwired region could have been mapped
|
|
|
|
* while the map lock was dropped for faulting in the
|
|
|
|
* pages or draining MAP_ENTRY_IN_TRANSITION.
|
|
|
|
* Moreover, another thread could be simultaneously
|
|
|
|
* wiring this new mapping entry. Detect these cases
|
2017-06-24 16:47:41 +00:00
|
|
|
* and skip any entries marked as in transition not by us.
|
2020-09-09 22:02:30 +00:00
|
|
|
*
|
|
|
|
* Another way to get an entry not marked with
|
|
|
|
* MAP_ENTRY_IN_TRANSITION is after failed clipping,
|
|
|
|
* which set rv to KERN_INVALID_ARGUMENT.
|
2013-07-11 05:55:08 +00:00
|
|
|
*/
|
|
|
|
if ((entry->eflags & MAP_ENTRY_IN_TRANSITION) == 0 ||
|
|
|
|
entry->wiring_thread != curthread) {
|
2020-09-09 22:02:30 +00:00
|
|
|
KASSERT(holes_ok || rv == KERN_INVALID_ARGUMENT,
|
2013-07-11 05:55:08 +00:00
|
|
|
("vm_map_wire: !HOLESOK and new/changed entry"));
|
|
|
|
continue;
|
|
|
|
}
|
|
|
|
|
2019-07-04 19:17:55 +00:00
|
|
|
if ((entry->eflags & MAP_ENTRY_WIRE_SKIPPED) != 0) {
|
|
|
|
/* do nothing */
|
|
|
|
} else if (rv == KERN_SUCCESS) {
|
2002-06-09 20:25:18 +00:00
|
|
|
if (user_wire)
|
|
|
|
entry->eflags |= MAP_ENTRY_USER_WIRED;
|
2002-06-11 19:13:59 +00:00
|
|
|
} else if (entry->wired_count == -1) {
|
|
|
|
/*
|
|
|
|
* Wiring failed on this entry. Thus, unwiring is
|
|
|
|
* unnecessary.
|
|
|
|
*/
|
|
|
|
entry->wired_count = 0;
|
When unwiring a region of an address space, do not assume that the
underlying physical pages are mapped by the pmap. If, for example, the
application has performed an mprotect(..., PROT_NONE) on any part of the
wired region, then those pages will no longer be mapped by the pmap.
So, using the pmap to lookup the wired pages in order to unwire them
doesn't always work, and when it doesn't work wired pages are leaked.
To avoid the leak, introduce and use a new function vm_object_unwire()
that locates the wired pages by traversing the object and its backing
objects.
At the same time, switch from using pmap_change_wiring() to the recently
introduced function pmap_unwire() for unwiring the region's mappings.
pmap_unwire() is faster, because it operates a range of virtual addresses
rather than a single virtual page at a time. Moreover, by operating on
a range, it is superpage friendly. It doesn't waste time performing
unnecessary demotions.
Reported by: markj
Reviewed by: kib
Tested by: pho, jmg (arm)
Sponsored by: EMC / Isilon Storage Division
2014-07-26 18:10:18 +00:00
|
|
|
} else if (!user_wire ||
|
|
|
|
(entry->eflags & MAP_ENTRY_USER_WIRED) == 0) {
|
2014-08-02 16:10:24 +00:00
|
|
|
/*
|
|
|
|
* Undo the wiring. Wiring succeeded on this entry
|
|
|
|
* but failed on a later entry.
|
|
|
|
*/
|
Provide separate accounting for user-wired pages.
Historically we have not distinguished between kernel wirings and user
wirings for accounting purposes. User wirings (via mlock(2)) were
subject to a global limit on the number of wired pages, so if large
swaths of physical memory were wired by the kernel, as happens with
the ZFS ARC among other things, the limit could be exceeded, causing
user wirings to fail.
The change adds a new counter, v_user_wire_count, which counts the
number of virtual pages wired by user processes via mlock(2) and
mlockall(2). Only user-wired pages are subject to the system-wide
limit which helps provide some safety against deadlocks. In
particular, while sources of kernel wirings typically support some
backpressure mechanism, there is no way to reclaim user-wired pages
shorting of killing the wiring process. The limit is exported as
vm.max_user_wired, renamed from vm.max_wired, and changed from u_int
to u_long.
The choice to count virtual user-wired pages rather than physical
pages was done for simplicity. There are mechanisms that can cause
user-wired mappings to be destroyed while maintaining a wiring of
the backing physical page; these make it difficult to accurately
track user wirings at the physical page layer.
The change also closes some holes which allowed user wirings to succeed
even when they would cause the system limit to be exceeded. For
instance, mmap() may now fail with ENOMEM in a process that has called
mlockall(MCL_FUTURE) if the new mapping would cause the user wiring
limit to be exceeded.
Note that bhyve -S is subject to the user wiring limit, which defaults
to 1/3 of physical RAM. Users that wish to exceed the limit must tune
vm.max_user_wired.
Reviewed by: kib, ngie (mlock() test changes)
Tested by: pho (earlier version)
MFC after: 45 days
Sponsored by: Netflix
Differential Revision: https://reviews.freebsd.org/D19908
2019-05-13 16:38:48 +00:00
|
|
|
if (entry->wired_count == 1) {
|
When unwiring a region of an address space, do not assume that the
underlying physical pages are mapped by the pmap. If, for example, the
application has performed an mprotect(..., PROT_NONE) on any part of the
wired region, then those pages will no longer be mapped by the pmap.
So, using the pmap to lookup the wired pages in order to unwire them
doesn't always work, and when it doesn't work wired pages are leaked.
To avoid the leak, introduce and use a new function vm_object_unwire()
that locates the wired pages by traversing the object and its backing
objects.
At the same time, switch from using pmap_change_wiring() to the recently
introduced function pmap_unwire() for unwiring the region's mappings.
pmap_unwire() is faster, because it operates a range of virtual addresses
rather than a single virtual page at a time. Moreover, by operating on
a range, it is superpage friendly. It doesn't waste time performing
unnecessary demotions.
Reported by: markj
Reviewed by: kib
Tested by: pho, jmg (arm)
Sponsored by: EMC / Isilon Storage Division
2014-07-26 18:10:18 +00:00
|
|
|
vm_map_entry_unwire(map, entry);
|
Provide separate accounting for user-wired pages.
Historically we have not distinguished between kernel wirings and user
wirings for accounting purposes. User wirings (via mlock(2)) were
subject to a global limit on the number of wired pages, so if large
swaths of physical memory were wired by the kernel, as happens with
the ZFS ARC among other things, the limit could be exceeded, causing
user wirings to fail.
The change adds a new counter, v_user_wire_count, which counts the
number of virtual pages wired by user processes via mlock(2) and
mlockall(2). Only user-wired pages are subject to the system-wide
limit which helps provide some safety against deadlocks. In
particular, while sources of kernel wirings typically support some
backpressure mechanism, there is no way to reclaim user-wired pages
shorting of killing the wiring process. The limit is exported as
vm.max_user_wired, renamed from vm.max_wired, and changed from u_int
to u_long.
The choice to count virtual user-wired pages rather than physical
pages was done for simplicity. There are mechanisms that can cause
user-wired mappings to be destroyed while maintaining a wiring of
the backing physical page; these make it difficult to accurately
track user wirings at the physical page layer.
The change also closes some holes which allowed user wirings to succeed
even when they would cause the system limit to be exceeded. For
instance, mmap() may now fail with ENOMEM in a process that has called
mlockall(MCL_FUTURE) if the new mapping would cause the user wiring
limit to be exceeded.
Note that bhyve -S is subject to the user wiring limit, which defaults
to 1/3 of physical RAM. Users that wish to exceed the limit must tune
vm.max_user_wired.
Reviewed by: kib, ngie (mlock() test changes)
Tested by: pho (earlier version)
MFC after: 45 days
Sponsored by: Netflix
Differential Revision: https://reviews.freebsd.org/D19908
2019-05-13 16:38:48 +00:00
|
|
|
if (user_wire)
|
|
|
|
vm_map_wire_user_count_sub(
|
|
|
|
atop(entry->end - entry->start));
|
|
|
|
} else
|
2002-06-09 20:25:18 +00:00
|
|
|
entry->wired_count--;
|
|
|
|
}
|
2013-07-11 05:55:08 +00:00
|
|
|
KASSERT((entry->eflags & MAP_ENTRY_IN_TRANSITION) != 0,
|
|
|
|
("vm_map_wire: in-transition flag missing %p", entry));
|
|
|
|
KASSERT(entry->wiring_thread == curthread,
|
|
|
|
("vm_map_wire: alien wire %p", entry));
|
|
|
|
entry->eflags &= ~(MAP_ENTRY_IN_TRANSITION |
|
|
|
|
MAP_ENTRY_WIRE_SKIPPED);
|
|
|
|
entry->wiring_thread = NULL;
|
2002-06-09 20:25:18 +00:00
|
|
|
if (entry->eflags & MAP_ENTRY_NEEDS_WAKEUP) {
|
|
|
|
entry->eflags &= ~MAP_ENTRY_NEEDS_WAKEUP;
|
2019-07-04 19:12:13 +00:00
|
|
|
need_wakeup = true;
|
2002-06-09 20:25:18 +00:00
|
|
|
}
|
2019-11-20 16:06:48 +00:00
|
|
|
vm_map_try_merge_entries(map, prev_entry, entry);
|
2002-06-09 20:25:18 +00:00
|
|
|
}
|
2019-11-20 16:06:48 +00:00
|
|
|
vm_map_try_merge_entries(map, prev_entry, entry);
|
2002-06-09 20:25:18 +00:00
|
|
|
if (need_wakeup)
|
|
|
|
vm_map_wakeup(map);
|
|
|
|
return (rv);
|
2002-06-08 07:32:38 +00:00
|
|
|
}
|
|
|
|
|
1994-05-24 10:09:53 +00:00
|
|
|
/*
|
2003-11-09 05:25:35 +00:00
|
|
|
* vm_map_sync
|
1994-05-24 10:09:53 +00:00
|
|
|
*
|
|
|
|
* Push any dirty cached pages in the address range to their pager.
|
|
|
|
* If syncio is TRUE, dirty pages are written synchronously.
|
|
|
|
* If invalidate is TRUE, any cached pages are freed as well.
|
|
|
|
*
|
2003-11-09 22:09:04 +00:00
|
|
|
* If the size of the region from start to end is zero, we are
|
|
|
|
* supposed to flush all modified pages within the region containing
|
|
|
|
* start. Unfortunately, a region can be split or coalesced with
|
|
|
|
* neighboring regions, making it difficult to determine what the
|
|
|
|
* original region was. Therefore, we approximate this requirement by
|
|
|
|
* flushing the current region containing start.
|
|
|
|
*
|
1994-05-24 10:09:53 +00:00
|
|
|
* Returns an error if any part of the specified range is not mapped.
|
|
|
|
*/
|
|
|
|
int
|
2003-11-09 05:25:35 +00:00
|
|
|
vm_map_sync(
|
2001-07-04 20:15:18 +00:00
|
|
|
vm_map_t map,
|
|
|
|
vm_offset_t start,
|
|
|
|
vm_offset_t end,
|
|
|
|
boolean_t syncio,
|
|
|
|
boolean_t invalidate)
|
1994-05-24 10:09:53 +00:00
|
|
|
{
|
2019-11-25 02:19:47 +00:00
|
|
|
vm_map_entry_t entry, first_entry, next_entry;
|
1994-05-24 10:09:53 +00:00
|
|
|
vm_size_t size;
|
|
|
|
vm_object_t object;
|
1995-12-11 04:58:34 +00:00
|
|
|
vm_ooffset_t offset;
|
2009-02-08 20:30:51 +00:00
|
|
|
unsigned int last_timestamp;
|
2020-09-09 22:02:30 +00:00
|
|
|
int bdry_idx;
|
2012-03-17 23:00:32 +00:00
|
|
|
boolean_t failed;
|
1994-05-24 10:09:53 +00:00
|
|
|
|
|
|
|
vm_map_lock_read(map);
|
|
|
|
VM_MAP_RANGE_CHECK(map, start, end);
|
2019-11-25 02:19:47 +00:00
|
|
|
if (!vm_map_lookup_entry(map, start, &first_entry)) {
|
1994-05-24 10:09:53 +00:00
|
|
|
vm_map_unlock_read(map);
|
These changes embody the support of the fully coherent merged VM buffer cache,
much higher filesystem I/O performance, and much better paging performance. It
represents the culmination of over 6 months of R&D.
The majority of the merged VM/cache work is by John Dyson.
The following highlights the most significant changes. Additionally, there are
(mostly minor) changes to the various filesystem modules (nfs, msdosfs, etc) to
support the new VM/buffer scheme.
vfs_bio.c:
Significant rewrite of most of vfs_bio to support the merged VM buffer cache
scheme. The scheme is almost fully compatible with the old filesystem
interface. Significant improvement in the number of opportunities for write
clustering.
vfs_cluster.c, vfs_subr.c
Upgrade and performance enhancements in vfs layer code to support merged
VM/buffer cache. Fixup of vfs_cluster to eliminate the bogus pagemove stuff.
vm_object.c:
Yet more improvements in the collapse code. Elimination of some windows that
can cause list corruption.
vm_pageout.c:
Fixed it, it really works better now. Somehow in 2.0, some "enhancements"
broke the code. This code has been reworked from the ground-up.
vm_fault.c, vm_page.c, pmap.c, vm_object.c
Support for small-block filesystems with merged VM/buffer cache scheme.
pmap.c vm_map.c
Dynamic kernel VM size, now we dont have to pre-allocate excessive numbers of
kernel PTs.
vm_glue.c
Much simpler and more effective swapping code. No more gratuitous swapping.
proc.h
Fixed the problem that the p_lock flag was not being cleared on a fork.
swap_pager.c, vnode_pager.c
Removal of old vfs_bio cruft to support the past pseudo-coherency. Now the
code doesn't need it anymore.
machdep.c
Changes to better support the parameter values for the merged VM/buffer cache
scheme.
machdep.c, kern_exec.c, vm_glue.c
Implemented a seperate submap for temporary exec string space and another one
to contain process upages. This eliminates all map fragmentation problems
that previously existed.
ffs_inode.c, ufs_inode.c, ufs_readwrite.c
Changes for merged VM/buffer cache. Add "bypass" support for sneaking in on
busy buffers.
Submitted by: John Dyson and David Greenman
1995-01-09 16:06:02 +00:00
|
|
|
return (KERN_INVALID_ADDRESS);
|
2019-06-26 03:12:57 +00:00
|
|
|
} else if (start == end) {
|
2019-11-25 02:19:47 +00:00
|
|
|
start = first_entry->start;
|
|
|
|
end = first_entry->end;
|
1994-05-24 10:09:53 +00:00
|
|
|
}
|
2020-09-09 22:02:30 +00:00
|
|
|
|
1994-05-24 10:09:53 +00:00
|
|
|
/*
|
2020-09-09 22:02:30 +00:00
|
|
|
* Make a first pass to check for user-wired memory, holes,
|
|
|
|
* and partial invalidation of largepage mappings.
|
1994-05-24 10:09:53 +00:00
|
|
|
*/
|
2019-11-25 02:19:47 +00:00
|
|
|
for (entry = first_entry; entry->start < end; entry = next_entry) {
|
2020-09-09 22:02:30 +00:00
|
|
|
if (invalidate) {
|
|
|
|
if ((entry->eflags & MAP_ENTRY_USER_WIRED) != 0) {
|
|
|
|
vm_map_unlock_read(map);
|
|
|
|
return (KERN_INVALID_ARGUMENT);
|
|
|
|
}
|
|
|
|
bdry_idx = (entry->eflags &
|
|
|
|
MAP_ENTRY_SPLIT_BOUNDARY_MASK) >>
|
|
|
|
MAP_ENTRY_SPLIT_BOUNDARY_SHIFT;
|
|
|
|
if (bdry_idx != 0 &&
|
|
|
|
((start & (pagesizes[bdry_idx] - 1)) != 0 ||
|
|
|
|
(end & (pagesizes[bdry_idx] - 1)) != 0)) {
|
|
|
|
vm_map_unlock_read(map);
|
|
|
|
return (KERN_INVALID_ARGUMENT);
|
|
|
|
}
|
1994-05-24 10:09:53 +00:00
|
|
|
}
|
2019-11-25 02:19:47 +00:00
|
|
|
next_entry = vm_map_entry_succ(entry);
|
|
|
|
if (end > entry->end &&
|
|
|
|
entry->end != next_entry->start) {
|
1994-05-24 10:09:53 +00:00
|
|
|
vm_map_unlock_read(map);
|
These changes embody the support of the fully coherent merged VM buffer cache,
much higher filesystem I/O performance, and much better paging performance. It
represents the culmination of over 6 months of R&D.
The majority of the merged VM/cache work is by John Dyson.
The following highlights the most significant changes. Additionally, there are
(mostly minor) changes to the various filesystem modules (nfs, msdosfs, etc) to
support the new VM/buffer scheme.
vfs_bio.c:
Significant rewrite of most of vfs_bio to support the merged VM buffer cache
scheme. The scheme is almost fully compatible with the old filesystem
interface. Significant improvement in the number of opportunities for write
clustering.
vfs_cluster.c, vfs_subr.c
Upgrade and performance enhancements in vfs layer code to support merged
VM/buffer cache. Fixup of vfs_cluster to eliminate the bogus pagemove stuff.
vm_object.c:
Yet more improvements in the collapse code. Elimination of some windows that
can cause list corruption.
vm_pageout.c:
Fixed it, it really works better now. Somehow in 2.0, some "enhancements"
broke the code. This code has been reworked from the ground-up.
vm_fault.c, vm_page.c, pmap.c, vm_object.c
Support for small-block filesystems with merged VM/buffer cache scheme.
pmap.c vm_map.c
Dynamic kernel VM size, now we dont have to pre-allocate excessive numbers of
kernel PTs.
vm_glue.c
Much simpler and more effective swapping code. No more gratuitous swapping.
proc.h
Fixed the problem that the p_lock flag was not being cleared on a fork.
swap_pager.c, vnode_pager.c
Removal of old vfs_bio cruft to support the past pseudo-coherency. Now the
code doesn't need it anymore.
machdep.c
Changes to better support the parameter values for the merged VM/buffer cache
scheme.
machdep.c, kern_exec.c, vm_glue.c
Implemented a seperate submap for temporary exec string space and another one
to contain process upages. This eliminates all map fragmentation problems
that previously existed.
ffs_inode.c, ufs_inode.c, ufs_readwrite.c
Changes for merged VM/buffer cache. Add "bypass" support for sneaking in on
busy buffers.
Submitted by: John Dyson and David Greenman
1995-01-09 16:06:02 +00:00
|
|
|
return (KERN_INVALID_ADDRESS);
|
1994-05-24 10:09:53 +00:00
|
|
|
}
|
|
|
|
}
|
|
|
|
|
2006-07-21 23:22:49 +00:00
|
|
|
if (invalidate)
|
2002-12-01 18:57:56 +00:00
|
|
|
pmap_remove(map->pmap, start, end);
|
2012-03-17 23:00:32 +00:00
|
|
|
failed = FALSE;
|
2006-07-21 23:22:49 +00:00
|
|
|
|
1994-05-24 10:09:53 +00:00
|
|
|
/*
|
|
|
|
* Make a second pass, cleaning/uncaching pages from the indicated
|
|
|
|
* objects as we go.
|
|
|
|
*/
|
2019-11-25 02:19:47 +00:00
|
|
|
for (entry = first_entry; entry->start < end;) {
|
|
|
|
offset = entry->offset + (start - entry->start);
|
|
|
|
size = (end <= entry->end ? end : entry->end) - start;
|
|
|
|
if ((entry->eflags & MAP_ENTRY_IS_SUB_MAP) != 0) {
|
1998-04-29 04:28:22 +00:00
|
|
|
vm_map_t smap;
|
1994-05-24 10:09:53 +00:00
|
|
|
vm_map_entry_t tentry;
|
|
|
|
vm_size_t tsize;
|
|
|
|
|
2019-11-25 02:19:47 +00:00
|
|
|
smap = entry->object.sub_map;
|
1994-05-24 10:09:53 +00:00
|
|
|
vm_map_lock_read(smap);
|
|
|
|
(void) vm_map_lookup_entry(smap, offset, &tentry);
|
|
|
|
tsize = tentry->end - offset;
|
|
|
|
if (tsize < size)
|
|
|
|
size = tsize;
|
|
|
|
object = tentry->object.vm_object;
|
|
|
|
offset = tentry->offset + (offset - tentry->start);
|
|
|
|
vm_map_unlock_read(smap);
|
|
|
|
} else {
|
2019-11-25 02:19:47 +00:00
|
|
|
object = entry->object.vm_object;
|
1994-05-24 10:09:53 +00:00
|
|
|
}
|
2009-02-08 20:30:51 +00:00
|
|
|
vm_object_reference(object);
|
|
|
|
last_timestamp = map->timestamp;
|
|
|
|
vm_map_unlock_read(map);
|
2012-03-17 23:00:32 +00:00
|
|
|
if (!vm_object_sync(object, offset, size, syncio, invalidate))
|
|
|
|
failed = TRUE;
|
1994-05-24 10:09:53 +00:00
|
|
|
start += size;
|
2009-02-08 20:30:51 +00:00
|
|
|
vm_object_deallocate(object);
|
|
|
|
vm_map_lock_read(map);
|
2019-06-26 03:12:57 +00:00
|
|
|
if (last_timestamp == map->timestamp ||
|
2019-11-25 02:19:47 +00:00
|
|
|
!vm_map_lookup_entry(map, start, &entry))
|
|
|
|
entry = vm_map_entry_succ(entry);
|
1994-05-24 10:09:53 +00:00
|
|
|
}
|
|
|
|
|
|
|
|
vm_map_unlock_read(map);
|
2012-03-17 23:00:32 +00:00
|
|
|
return (failed ? KERN_FAILURE : KERN_SUCCESS);
|
1994-05-24 10:09:53 +00:00
|
|
|
}
|
|
|
|
|
|
|
|
/*
|
|
|
|
* vm_map_entry_unwire: [ internal use only ]
|
|
|
|
*
|
|
|
|
* Make the region specified by this entry pageable.
|
|
|
|
*
|
|
|
|
* The map in question should be locked.
|
|
|
|
* [This is the reason for this routine's existence.]
|
|
|
|
*/
|
2003-11-03 16:14:45 +00:00
|
|
|
static void
|
2001-07-04 20:15:18 +00:00
|
|
|
vm_map_entry_unwire(vm_map_t map, vm_map_entry_t entry)
|
1994-05-24 10:09:53 +00:00
|
|
|
{
|
Provide separate accounting for user-wired pages.
Historically we have not distinguished between kernel wirings and user
wirings for accounting purposes. User wirings (via mlock(2)) were
subject to a global limit on the number of wired pages, so if large
swaths of physical memory were wired by the kernel, as happens with
the ZFS ARC among other things, the limit could be exceeded, causing
user wirings to fail.
The change adds a new counter, v_user_wire_count, which counts the
number of virtual pages wired by user processes via mlock(2) and
mlockall(2). Only user-wired pages are subject to the system-wide
limit which helps provide some safety against deadlocks. In
particular, while sources of kernel wirings typically support some
backpressure mechanism, there is no way to reclaim user-wired pages
shorting of killing the wiring process. The limit is exported as
vm.max_user_wired, renamed from vm.max_wired, and changed from u_int
to u_long.
The choice to count virtual user-wired pages rather than physical
pages was done for simplicity. There are mechanisms that can cause
user-wired mappings to be destroyed while maintaining a wiring of
the backing physical page; these make it difficult to accurately
track user wirings at the physical page layer.
The change also closes some holes which allowed user wirings to succeed
even when they would cause the system limit to be exceeded. For
instance, mmap() may now fail with ENOMEM in a process that has called
mlockall(MCL_FUTURE) if the new mapping would cause the user wiring
limit to be exceeded.
Note that bhyve -S is subject to the user wiring limit, which defaults
to 1/3 of physical RAM. Users that wish to exceed the limit must tune
vm.max_user_wired.
Reviewed by: kib, ngie (mlock() test changes)
Tested by: pho (earlier version)
MFC after: 45 days
Sponsored by: Netflix
Differential Revision: https://reviews.freebsd.org/D19908
2019-05-13 16:38:48 +00:00
|
|
|
vm_size_t size;
|
When unwiring a region of an address space, do not assume that the
underlying physical pages are mapped by the pmap. If, for example, the
application has performed an mprotect(..., PROT_NONE) on any part of the
wired region, then those pages will no longer be mapped by the pmap.
So, using the pmap to lookup the wired pages in order to unwire them
doesn't always work, and when it doesn't work wired pages are leaked.
To avoid the leak, introduce and use a new function vm_object_unwire()
that locates the wired pages by traversing the object and its backing
objects.
At the same time, switch from using pmap_change_wiring() to the recently
introduced function pmap_unwire() for unwiring the region's mappings.
pmap_unwire() is faster, because it operates a range of virtual addresses
rather than a single virtual page at a time. Moreover, by operating on
a range, it is superpage friendly. It doesn't waste time performing
unnecessary demotions.
Reported by: markj
Reviewed by: kib
Tested by: pho, jmg (arm)
Sponsored by: EMC / Isilon Storage Division
2014-07-26 18:10:18 +00:00
|
|
|
|
|
|
|
VM_MAP_ASSERT_LOCKED(map);
|
|
|
|
KASSERT(entry->wired_count > 0,
|
|
|
|
("vm_map_entry_unwire: entry %p isn't wired", entry));
|
Provide separate accounting for user-wired pages.
Historically we have not distinguished between kernel wirings and user
wirings for accounting purposes. User wirings (via mlock(2)) were
subject to a global limit on the number of wired pages, so if large
swaths of physical memory were wired by the kernel, as happens with
the ZFS ARC among other things, the limit could be exceeded, causing
user wirings to fail.
The change adds a new counter, v_user_wire_count, which counts the
number of virtual pages wired by user processes via mlock(2) and
mlockall(2). Only user-wired pages are subject to the system-wide
limit which helps provide some safety against deadlocks. In
particular, while sources of kernel wirings typically support some
backpressure mechanism, there is no way to reclaim user-wired pages
shorting of killing the wiring process. The limit is exported as
vm.max_user_wired, renamed from vm.max_wired, and changed from u_int
to u_long.
The choice to count virtual user-wired pages rather than physical
pages was done for simplicity. There are mechanisms that can cause
user-wired mappings to be destroyed while maintaining a wiring of
the backing physical page; these make it difficult to accurately
track user wirings at the physical page layer.
The change also closes some holes which allowed user wirings to succeed
even when they would cause the system limit to be exceeded. For
instance, mmap() may now fail with ENOMEM in a process that has called
mlockall(MCL_FUTURE) if the new mapping would cause the user wiring
limit to be exceeded.
Note that bhyve -S is subject to the user wiring limit, which defaults
to 1/3 of physical RAM. Users that wish to exceed the limit must tune
vm.max_user_wired.
Reviewed by: kib, ngie (mlock() test changes)
Tested by: pho (earlier version)
MFC after: 45 days
Sponsored by: Netflix
Differential Revision: https://reviews.freebsd.org/D19908
2019-05-13 16:38:48 +00:00
|
|
|
|
|
|
|
size = entry->end - entry->start;
|
|
|
|
if ((entry->eflags & MAP_ENTRY_USER_WIRED) != 0)
|
|
|
|
vm_map_wire_user_count_sub(atop(size));
|
When unwiring a region of an address space, do not assume that the
underlying physical pages are mapped by the pmap. If, for example, the
application has performed an mprotect(..., PROT_NONE) on any part of the
wired region, then those pages will no longer be mapped by the pmap.
So, using the pmap to lookup the wired pages in order to unwire them
doesn't always work, and when it doesn't work wired pages are leaked.
To avoid the leak, introduce and use a new function vm_object_unwire()
that locates the wired pages by traversing the object and its backing
objects.
At the same time, switch from using pmap_change_wiring() to the recently
introduced function pmap_unwire() for unwiring the region's mappings.
pmap_unwire() is faster, because it operates a range of virtual addresses
rather than a single virtual page at a time. Moreover, by operating on
a range, it is superpage friendly. It doesn't waste time performing
unnecessary demotions.
Reported by: markj
Reviewed by: kib
Tested by: pho, jmg (arm)
Sponsored by: EMC / Isilon Storage Division
2014-07-26 18:10:18 +00:00
|
|
|
pmap_unwire(map->pmap, entry->start, entry->end);
|
Provide separate accounting for user-wired pages.
Historically we have not distinguished between kernel wirings and user
wirings for accounting purposes. User wirings (via mlock(2)) were
subject to a global limit on the number of wired pages, so if large
swaths of physical memory were wired by the kernel, as happens with
the ZFS ARC among other things, the limit could be exceeded, causing
user wirings to fail.
The change adds a new counter, v_user_wire_count, which counts the
number of virtual pages wired by user processes via mlock(2) and
mlockall(2). Only user-wired pages are subject to the system-wide
limit which helps provide some safety against deadlocks. In
particular, while sources of kernel wirings typically support some
backpressure mechanism, there is no way to reclaim user-wired pages
shorting of killing the wiring process. The limit is exported as
vm.max_user_wired, renamed from vm.max_wired, and changed from u_int
to u_long.
The choice to count virtual user-wired pages rather than physical
pages was done for simplicity. There are mechanisms that can cause
user-wired mappings to be destroyed while maintaining a wiring of
the backing physical page; these make it difficult to accurately
track user wirings at the physical page layer.
The change also closes some holes which allowed user wirings to succeed
even when they would cause the system limit to be exceeded. For
instance, mmap() may now fail with ENOMEM in a process that has called
mlockall(MCL_FUTURE) if the new mapping would cause the user wiring
limit to be exceeded.
Note that bhyve -S is subject to the user wiring limit, which defaults
to 1/3 of physical RAM. Users that wish to exceed the limit must tune
vm.max_user_wired.
Reviewed by: kib, ngie (mlock() test changes)
Tested by: pho (earlier version)
MFC after: 45 days
Sponsored by: Netflix
Differential Revision: https://reviews.freebsd.org/D19908
2019-05-13 16:38:48 +00:00
|
|
|
vm_object_unwire(entry->object.vm_object, entry->offset, size,
|
|
|
|
PQ_ACTIVE);
|
1994-05-24 10:09:53 +00:00
|
|
|
entry->wired_count = 0;
|
|
|
|
}
|
|
|
|
|
2010-09-18 15:03:31 +00:00
|
|
|
static void
|
|
|
|
vm_map_entry_deallocate(vm_map_entry_t entry, boolean_t system_map)
|
|
|
|
{
|
|
|
|
|
|
|
|
if ((entry->eflags & MAP_ENTRY_IS_SUB_MAP) == 0)
|
|
|
|
vm_object_deallocate(entry->object.vm_object);
|
|
|
|
uma_zfree(system_map ? kmapentzone : mapentzone, entry);
|
|
|
|
}
|
|
|
|
|
1994-05-24 10:09:53 +00:00
|
|
|
/*
|
|
|
|
* vm_map_entry_delete: [ internal use only ]
|
|
|
|
*
|
|
|
|
* Deallocate the given entry from the target map.
|
These changes embody the support of the fully coherent merged VM buffer cache,
much higher filesystem I/O performance, and much better paging performance. It
represents the culmination of over 6 months of R&D.
The majority of the merged VM/cache work is by John Dyson.
The following highlights the most significant changes. Additionally, there are
(mostly minor) changes to the various filesystem modules (nfs, msdosfs, etc) to
support the new VM/buffer scheme.
vfs_bio.c:
Significant rewrite of most of vfs_bio to support the merged VM buffer cache
scheme. The scheme is almost fully compatible with the old filesystem
interface. Significant improvement in the number of opportunities for write
clustering.
vfs_cluster.c, vfs_subr.c
Upgrade and performance enhancements in vfs layer code to support merged
VM/buffer cache. Fixup of vfs_cluster to eliminate the bogus pagemove stuff.
vm_object.c:
Yet more improvements in the collapse code. Elimination of some windows that
can cause list corruption.
vm_pageout.c:
Fixed it, it really works better now. Somehow in 2.0, some "enhancements"
broke the code. This code has been reworked from the ground-up.
vm_fault.c, vm_page.c, pmap.c, vm_object.c
Support for small-block filesystems with merged VM/buffer cache scheme.
pmap.c vm_map.c
Dynamic kernel VM size, now we dont have to pre-allocate excessive numbers of
kernel PTs.
vm_glue.c
Much simpler and more effective swapping code. No more gratuitous swapping.
proc.h
Fixed the problem that the p_lock flag was not being cleared on a fork.
swap_pager.c, vnode_pager.c
Removal of old vfs_bio cruft to support the past pseudo-coherency. Now the
code doesn't need it anymore.
machdep.c
Changes to better support the parameter values for the merged VM/buffer cache
scheme.
machdep.c, kern_exec.c, vm_glue.c
Implemented a seperate submap for temporary exec string space and another one
to contain process upages. This eliminates all map fragmentation problems
that previously existed.
ffs_inode.c, ufs_inode.c, ufs_readwrite.c
Changes for merged VM/buffer cache. Add "bypass" support for sneaking in on
busy buffers.
Submitted by: John Dyson and David Greenman
1995-01-09 16:06:02 +00:00
|
|
|
*/
|
1996-12-07 07:44:05 +00:00
|
|
|
static void
|
2001-07-04 20:15:18 +00:00
|
|
|
vm_map_entry_delete(vm_map_t map, vm_map_entry_t entry)
|
1994-05-24 10:09:53 +00:00
|
|
|
{
|
2003-11-05 05:48:22 +00:00
|
|
|
vm_object_t object;
|
2020-06-25 15:21:21 +00:00
|
|
|
vm_pindex_t offidxstart, offidxend, size1;
|
2017-03-14 19:39:17 +00:00
|
|
|
vm_size_t size;
|
2003-11-05 05:48:22 +00:00
|
|
|
|
Eliminate adj_free field from vm_map_entry.
Drop the adj_free field from vm_map_entry_t. Refine the max_free field
so that p->max_free is the size of the largest gap with one endpoint
in the subtree rooted at p. Change vm_map_findspace so that, first,
the address-based splay is restricted to tree nodes with large-enough
max_free value, to avoid searching for the right starting point in a
subtree where all the gaps are too small. Second, when the address
search leads to a tree search for the first large-enough gap, that gap
is the subject of a splay-search that brings the gap to the top of the
tree, so that an immediate insertion will take constant time.
Break up the splay code into separate components, one for searching
and breaking up the tree and another for reassembling it. Use these
components, and not splay itself, for linking and unlinking. Drop the
after-where parameter to link, as it is computed as a side-effect of
the splay search.
Submitted by: Doug Moore <dougm@rice.edu>
Reviewed by: markj
Tested by: pho
MFC after: 2 weeks
Differential revision: https://reviews.freebsd.org/D17794
2019-03-29 16:53:46 +00:00
|
|
|
vm_map_entry_unlink(map, entry, UNLINK_MERGE_NONE);
|
Implement global and per-uid accounting of the anonymous memory. Add
rlimit RLIMIT_SWAP that limits the amount of swap that may be reserved
for the uid.
The accounting information (charge) is associated with either map entry,
or vm object backing the entry, assuming the object is the first one
in the shadow chain and entry does not require COW. Charge is moved
from entry to object on allocation of the object, e.g. during the mmap,
assuming the object is allocated, or on the first page fault on the
entry. It moves back to the entry on forks due to COW setup.
The per-entry granularity of accounting makes the charge process fair
for processes that change uid during lifetime, and decrements charge
for proper uid when region is unmapped.
The interface of vm_pager_allocate(9) is extended by adding struct ucred *,
that is used to charge appropriate uid when allocation if performed by
kernel, e.g. md(4).
Several syscalls, among them is fork(2), may now return ENOMEM when
global or per-uid limits are enforced.
In collaboration with: pho
Reviewed by: alc
Approved by: re (kensmith)
2009-06-23 20:45:22 +00:00
|
|
|
object = entry->object.vm_object;
|
Implement address space guards.
Guard, requested by the MAP_GUARD mmap(2) flag, prevents the reuse of
the allocated address space, but does not allow instantiation of the
pages in the range. It is useful for more explicit support for usual
two-stage reserve then commit allocators, since it prevents accidental
instantiation of the mapping, e.g. by mprotect(2).
Use guards to reimplement stack grow code. Explicitely track stack
grow area with the guard, including the stack guard page. On stack
grow, trivial shift of the guard map entry and stack map entry limits
makes the stack expansion. Move the code to detect stack grow and
call vm_map_growstack(), from vm_fault() into vm_map_lookup().
As result, it is impossible to get random mapping to occur in the
stack grow area, or to overlap the stack guard page.
Enable stack guard page by default.
Reviewed by: alc, markj
Man page update reviewed by: alc, bjk, emaste, markj, pho
Tested by: pho, Qualys
Sponsored by: The FreeBSD Foundation
MFC after: 1 week
Differential revision: https://reviews.freebsd.org/D11306 (man pages)
2017-06-24 17:01:11 +00:00
|
|
|
|
|
|
|
if ((entry->eflags & MAP_ENTRY_GUARD) != 0) {
|
|
|
|
MPASS(entry->cred == NULL);
|
|
|
|
MPASS((entry->eflags & MAP_ENTRY_IS_SUB_MAP) == 0);
|
|
|
|
MPASS(object == NULL);
|
|
|
|
vm_map_entry_deallocate(entry, map->system_map);
|
|
|
|
return;
|
|
|
|
}
|
|
|
|
|
Implement global and per-uid accounting of the anonymous memory. Add
rlimit RLIMIT_SWAP that limits the amount of swap that may be reserved
for the uid.
The accounting information (charge) is associated with either map entry,
or vm object backing the entry, assuming the object is the first one
in the shadow chain and entry does not require COW. Charge is moved
from entry to object on allocation of the object, e.g. during the mmap,
assuming the object is allocated, or on the first page fault on the
entry. It moves back to the entry on forks due to COW setup.
The per-entry granularity of accounting makes the charge process fair
for processes that change uid during lifetime, and decrements charge
for proper uid when region is unmapped.
The interface of vm_pager_allocate(9) is extended by adding struct ucred *,
that is used to charge appropriate uid when allocation if performed by
kernel, e.g. md(4).
Several syscalls, among them is fork(2), may now return ENOMEM when
global or per-uid limits are enforced.
In collaboration with: pho
Reviewed by: alc
Approved by: re (kensmith)
2009-06-23 20:45:22 +00:00
|
|
|
size = entry->end - entry->start;
|
|
|
|
map->size -= size;
|
|
|
|
|
2010-12-02 17:37:16 +00:00
|
|
|
if (entry->cred != NULL) {
|
|
|
|
swap_release_by_cred(size, entry->cred);
|
|
|
|
crfree(entry->cred);
|
Implement global and per-uid accounting of the anonymous memory. Add
rlimit RLIMIT_SWAP that limits the amount of swap that may be reserved
for the uid.
The accounting information (charge) is associated with either map entry,
or vm object backing the entry, assuming the object is the first one
in the shadow chain and entry does not require COW. Charge is moved
from entry to object on allocation of the object, e.g. during the mmap,
assuming the object is allocated, or on the first page fault on the
entry. It moves back to the entry on forks due to COW setup.
The per-entry granularity of accounting makes the charge process fair
for processes that change uid during lifetime, and decrements charge
for proper uid when region is unmapped.
The interface of vm_pager_allocate(9) is extended by adding struct ucred *,
that is used to charge appropriate uid when allocation if performed by
kernel, e.g. md(4).
Several syscalls, among them is fork(2), may now return ENOMEM when
global or per-uid limits are enforced.
In collaboration with: pho
Reviewed by: alc
Approved by: re (kensmith)
2009-06-23 20:45:22 +00:00
|
|
|
}
|
1994-05-24 10:09:53 +00:00
|
|
|
|
2019-11-19 23:19:43 +00:00
|
|
|
if ((entry->eflags & MAP_ENTRY_IS_SUB_MAP) != 0 || object == NULL) {
|
|
|
|
entry->object.vm_object = NULL;
|
|
|
|
} else if ((object->flags & OBJ_ANON) != 0 ||
|
|
|
|
object == kernel_object) {
|
2010-12-02 17:37:16 +00:00
|
|
|
KASSERT(entry->cred == NULL || object->cred == NULL ||
|
Implement global and per-uid accounting of the anonymous memory. Add
rlimit RLIMIT_SWAP that limits the amount of swap that may be reserved
for the uid.
The accounting information (charge) is associated with either map entry,
or vm object backing the entry, assuming the object is the first one
in the shadow chain and entry does not require COW. Charge is moved
from entry to object on allocation of the object, e.g. during the mmap,
assuming the object is allocated, or on the first page fault on the
entry. It moves back to the entry on forks due to COW setup.
The per-entry granularity of accounting makes the charge process fair
for processes that change uid during lifetime, and decrements charge
for proper uid when region is unmapped.
The interface of vm_pager_allocate(9) is extended by adding struct ucred *,
that is used to charge appropriate uid when allocation if performed by
kernel, e.g. md(4).
Several syscalls, among them is fork(2), may now return ENOMEM when
global or per-uid limits are enforced.
In collaboration with: pho
Reviewed by: alc
Approved by: re (kensmith)
2009-06-23 20:45:22 +00:00
|
|
|
(entry->eflags & MAP_ENTRY_NEEDS_COPY),
|
2010-12-02 17:37:16 +00:00
|
|
|
("OVERCOMMIT vm_map_entry_delete: both cred %p", entry));
|
2003-11-05 05:48:22 +00:00
|
|
|
offidxstart = OFF_TO_IDX(entry->offset);
|
2020-06-25 15:21:21 +00:00
|
|
|
offidxend = offidxstart + atop(size);
|
2013-03-09 02:32:23 +00:00
|
|
|
VM_OBJECT_WLOCK(object);
|
2019-11-19 23:19:43 +00:00
|
|
|
if (object->ref_count != 1 &&
|
|
|
|
((object->flags & OBJ_ONEMAPPING) != 0 ||
|
2017-11-28 23:40:54 +00:00
|
|
|
object == kernel_object)) {
|
2003-11-05 05:48:22 +00:00
|
|
|
vm_object_collapse(object);
|
2011-06-29 16:40:41 +00:00
|
|
|
|
|
|
|
/*
|
|
|
|
* The option OBJPR_NOTMAPPED can be passed here
|
|
|
|
* because vm_map_delete() already performed
|
|
|
|
* pmap_remove() on the only mapping to this range
|
|
|
|
* of pages.
|
|
|
|
*/
|
|
|
|
vm_object_page_remove(object, offidxstart, offidxend,
|
|
|
|
OBJPR_NOTMAPPED);
|
2003-11-05 05:48:22 +00:00
|
|
|
if (offidxend >= object->size &&
|
Implement global and per-uid accounting of the anonymous memory. Add
rlimit RLIMIT_SWAP that limits the amount of swap that may be reserved
for the uid.
The accounting information (charge) is associated with either map entry,
or vm object backing the entry, assuming the object is the first one
in the shadow chain and entry does not require COW. Charge is moved
from entry to object on allocation of the object, e.g. during the mmap,
assuming the object is allocated, or on the first page fault on the
entry. It moves back to the entry on forks due to COW setup.
The per-entry granularity of accounting makes the charge process fair
for processes that change uid during lifetime, and decrements charge
for proper uid when region is unmapped.
The interface of vm_pager_allocate(9) is extended by adding struct ucred *,
that is used to charge appropriate uid when allocation if performed by
kernel, e.g. md(4).
Several syscalls, among them is fork(2), may now return ENOMEM when
global or per-uid limits are enforced.
In collaboration with: pho
Reviewed by: alc
Approved by: re (kensmith)
2009-06-23 20:45:22 +00:00
|
|
|
offidxstart < object->size) {
|
|
|
|
size1 = object->size;
|
2003-11-05 05:48:22 +00:00
|
|
|
object->size = offidxstart;
|
2010-12-02 17:37:16 +00:00
|
|
|
if (object->cred != NULL) {
|
Implement global and per-uid accounting of the anonymous memory. Add
rlimit RLIMIT_SWAP that limits the amount of swap that may be reserved
for the uid.
The accounting information (charge) is associated with either map entry,
or vm object backing the entry, assuming the object is the first one
in the shadow chain and entry does not require COW. Charge is moved
from entry to object on allocation of the object, e.g. during the mmap,
assuming the object is allocated, or on the first page fault on the
entry. It moves back to the entry on forks due to COW setup.
The per-entry granularity of accounting makes the charge process fair
for processes that change uid during lifetime, and decrements charge
for proper uid when region is unmapped.
The interface of vm_pager_allocate(9) is extended by adding struct ucred *,
that is used to charge appropriate uid when allocation if performed by
kernel, e.g. md(4).
Several syscalls, among them is fork(2), may now return ENOMEM when
global or per-uid limits are enforced.
In collaboration with: pho
Reviewed by: alc
Approved by: re (kensmith)
2009-06-23 20:45:22 +00:00
|
|
|
size1 -= object->size;
|
|
|
|
KASSERT(object->charge >= ptoa(size1),
|
2016-12-30 13:04:43 +00:00
|
|
|
("object %p charge < 0", object));
|
|
|
|
swap_release_by_cred(ptoa(size1),
|
|
|
|
object->cred);
|
Implement global and per-uid accounting of the anonymous memory. Add
rlimit RLIMIT_SWAP that limits the amount of swap that may be reserved
for the uid.
The accounting information (charge) is associated with either map entry,
or vm object backing the entry, assuming the object is the first one
in the shadow chain and entry does not require COW. Charge is moved
from entry to object on allocation of the object, e.g. during the mmap,
assuming the object is allocated, or on the first page fault on the
entry. It moves back to the entry on forks due to COW setup.
The per-entry granularity of accounting makes the charge process fair
for processes that change uid during lifetime, and decrements charge
for proper uid when region is unmapped.
The interface of vm_pager_allocate(9) is extended by adding struct ucred *,
that is used to charge appropriate uid when allocation if performed by
kernel, e.g. md(4).
Several syscalls, among them is fork(2), may now return ENOMEM when
global or per-uid limits are enforced.
In collaboration with: pho
Reviewed by: alc
Approved by: re (kensmith)
2009-06-23 20:45:22 +00:00
|
|
|
object->charge -= ptoa(size1);
|
|
|
|
}
|
|
|
|
}
|
2003-11-05 05:48:22 +00:00
|
|
|
}
|
2013-03-09 02:32:23 +00:00
|
|
|
VM_OBJECT_WUNLOCK(object);
|
2019-11-19 23:19:43 +00:00
|
|
|
}
|
2010-09-18 15:03:31 +00:00
|
|
|
if (map->system_map)
|
|
|
|
vm_map_entry_deallocate(entry, TRUE);
|
|
|
|
else {
|
2019-11-13 15:56:07 +00:00
|
|
|
entry->defer_next = curthread->td_map_def_user;
|
2010-09-18 15:03:31 +00:00
|
|
|
curthread->td_map_def_user = entry;
|
|
|
|
}
|
1994-05-24 10:09:53 +00:00
|
|
|
}
|
|
|
|
|
|
|
|
/*
|
|
|
|
* vm_map_delete: [ internal use only ]
|
|
|
|
*
|
|
|
|
* Deallocates the given address range from the target
|
|
|
|
* map.
|
|
|
|
*/
|
|
|
|
int
|
2009-02-24 20:57:43 +00:00
|
|
|
vm_map_delete(vm_map_t map, vm_offset_t start, vm_offset_t end)
|
1994-05-24 10:09:53 +00:00
|
|
|
{
|
2020-09-09 22:02:30 +00:00
|
|
|
vm_map_entry_t entry, next_entry, scratch_entry;
|
|
|
|
int rv;
|
1994-05-24 10:09:53 +00:00
|
|
|
|
2009-02-24 20:43:29 +00:00
|
|
|
VM_MAP_ASSERT_LOCKED(map);
|
2020-06-29 16:54:00 +00:00
|
|
|
|
2013-11-20 09:03:48 +00:00
|
|
|
if (start == end)
|
|
|
|
return (KERN_SUCCESS);
|
2009-02-24 20:43:29 +00:00
|
|
|
|
1994-05-24 10:09:53 +00:00
|
|
|
/*
|
2020-01-24 07:48:11 +00:00
|
|
|
* Find the start of the region, and clip it.
|
|
|
|
* Step through all entries in this region.
|
1994-05-24 10:09:53 +00:00
|
|
|
*/
|
2020-09-09 22:02:30 +00:00
|
|
|
rv = vm_map_lookup_clip_start(map, start, &entry, &scratch_entry);
|
|
|
|
if (rv != KERN_SUCCESS)
|
|
|
|
return (rv);
|
|
|
|
for (; entry->start < end; entry = next_entry) {
|
2002-06-11 05:24:22 +00:00
|
|
|
/*
|
|
|
|
* Wait for wiring or unwiring of an entry to complete.
|
2004-08-16 03:11:09 +00:00
|
|
|
* Also wait for any system wirings to disappear on
|
|
|
|
* user maps.
|
2002-06-11 05:24:22 +00:00
|
|
|
*/
|
2004-08-16 03:11:09 +00:00
|
|
|
if ((entry->eflags & MAP_ENTRY_IN_TRANSITION) != 0 ||
|
|
|
|
(vm_map_pmap(map) != kernel_pmap &&
|
|
|
|
vm_map_entry_system_wired_count(entry) != 0)) {
|
2002-06-11 05:24:22 +00:00
|
|
|
unsigned int last_timestamp;
|
|
|
|
vm_offset_t saved_start;
|
|
|
|
|
|
|
|
saved_start = entry->start;
|
|
|
|
entry->eflags |= MAP_ENTRY_NEEDS_WAKEUP;
|
|
|
|
last_timestamp = map->timestamp;
|
2007-11-07 21:56:58 +00:00
|
|
|
(void) vm_map_unlock_and_wait(map, 0);
|
2002-06-11 05:24:22 +00:00
|
|
|
vm_map_lock(map);
|
2019-06-26 03:12:57 +00:00
|
|
|
if (last_timestamp + 1 != map->timestamp) {
|
|
|
|
/*
|
|
|
|
* Look again for the entry because the map was
|
|
|
|
* modified while it was unlocked.
|
|
|
|
* Specifically, the entry may have been
|
|
|
|
* clipped, merged, or deleted.
|
|
|
|
*/
|
2020-09-09 22:02:30 +00:00
|
|
|
rv = vm_map_lookup_clip_start(map, saved_start,
|
|
|
|
&next_entry, &scratch_entry);
|
|
|
|
if (rv != KERN_SUCCESS)
|
|
|
|
break;
|
2020-01-24 07:48:11 +00:00
|
|
|
} else
|
|
|
|
next_entry = entry;
|
2002-06-11 05:24:22 +00:00
|
|
|
continue;
|
|
|
|
}
|
2020-09-09 22:02:30 +00:00
|
|
|
|
|
|
|
/* XXXKIB or delete to the upper superpage boundary ? */
|
|
|
|
rv = vm_map_clip_end(map, entry, end);
|
|
|
|
if (rv != KERN_SUCCESS)
|
|
|
|
break;
|
2020-01-24 07:48:11 +00:00
|
|
|
next_entry = vm_map_entry_succ(entry);
|
1994-05-24 10:09:53 +00:00
|
|
|
|
|
|
|
/*
|
These changes embody the support of the fully coherent merged VM buffer cache,
much higher filesystem I/O performance, and much better paging performance. It
represents the culmination of over 6 months of R&D.
The majority of the merged VM/cache work is by John Dyson.
The following highlights the most significant changes. Additionally, there are
(mostly minor) changes to the various filesystem modules (nfs, msdosfs, etc) to
support the new VM/buffer scheme.
vfs_bio.c:
Significant rewrite of most of vfs_bio to support the merged VM buffer cache
scheme. The scheme is almost fully compatible with the old filesystem
interface. Significant improvement in the number of opportunities for write
clustering.
vfs_cluster.c, vfs_subr.c
Upgrade and performance enhancements in vfs layer code to support merged
VM/buffer cache. Fixup of vfs_cluster to eliminate the bogus pagemove stuff.
vm_object.c:
Yet more improvements in the collapse code. Elimination of some windows that
can cause list corruption.
vm_pageout.c:
Fixed it, it really works better now. Somehow in 2.0, some "enhancements"
broke the code. This code has been reworked from the ground-up.
vm_fault.c, vm_page.c, pmap.c, vm_object.c
Support for small-block filesystems with merged VM/buffer cache scheme.
pmap.c vm_map.c
Dynamic kernel VM size, now we dont have to pre-allocate excessive numbers of
kernel PTs.
vm_glue.c
Much simpler and more effective swapping code. No more gratuitous swapping.
proc.h
Fixed the problem that the p_lock flag was not being cleared on a fork.
swap_pager.c, vnode_pager.c
Removal of old vfs_bio cruft to support the past pseudo-coherency. Now the
code doesn't need it anymore.
machdep.c
Changes to better support the parameter values for the merged VM/buffer cache
scheme.
machdep.c, kern_exec.c, vm_glue.c
Implemented a seperate submap for temporary exec string space and another one
to contain process upages. This eliminates all map fragmentation problems
that previously existed.
ffs_inode.c, ufs_inode.c, ufs_readwrite.c
Changes for merged VM/buffer cache. Add "bypass" support for sneaking in on
busy buffers.
Submitted by: John Dyson and David Greenman
1995-01-09 16:06:02 +00:00
|
|
|
* Unwire before removing addresses from the pmap; otherwise,
|
|
|
|
* unwiring will put the entries back in the pmap.
|
1994-05-24 10:09:53 +00:00
|
|
|
*/
|
2018-07-06 12:37:46 +00:00
|
|
|
if (entry->wired_count != 0)
|
1994-05-24 10:09:53 +00:00
|
|
|
vm_map_entry_unwire(map, entry);
|
|
|
|
|
2018-07-06 12:44:48 +00:00
|
|
|
/*
|
|
|
|
* Remove mappings for the pages, but only if the
|
|
|
|
* mappings could exist. For instance, it does not
|
|
|
|
* make sense to call pmap_remove() for guard entries.
|
|
|
|
*/
|
|
|
|
if ((entry->eflags & MAP_ENTRY_IS_SUB_MAP) != 0 ||
|
|
|
|
entry->object.vm_object != NULL)
|
|
|
|
pmap_remove(map->pmap, entry->start, entry->end);
|
1994-05-24 10:09:53 +00:00
|
|
|
|
Implement Address Space Layout Randomization (ASLR)
With this change, randomization can be enabled for all non-fixed
mappings. It means that the base address for the mapping is selected
with a guaranteed amount of entropy (bits). If the mapping was
requested to be superpage aligned, the randomization honours the
superpage attributes.
Although the value of ASLR is diminshing over time as exploit authors
work out simple ASLR bypass techniques, it elimintates the trivial
exploitation of certain vulnerabilities, at least in theory. This
implementation is relatively small and happens at the correct
architectural level. Also, it is not expected to introduce
regressions in existing cases when turned off (default for now), or
cause any significant maintaince burden.
The randomization is done on a best-effort basis - that is, the
allocator falls back to a first fit strategy if fragmentation prevents
entropy injection. It is trivial to implement a strong mode where
failure to guarantee the requested amount of entropy results in
mapping request failure, but I do not consider that to be usable.
I have not fine-tuned the amount of entropy injected right now. It is
only a quantitive change that will not change the implementation. The
current amount is controlled by aslr_pages_rnd.
To not spoil coalescing optimizations, to reduce the page table
fragmentation inherent to ASLR, and to keep the transient superpage
promotion for the malloced memory, locality clustering is implemented
for anonymous private mappings, which are automatically grouped until
fragmentation kicks in. The initial location for the anon group range
is, of course, randomized. This is controlled by vm.cluster_anon,
enabled by default.
The default mode keeps the sbrk area unpopulated by other mappings,
but this can be turned off, which gives much more breathing bits on
architectures with small address space, such as i386. This is tied
with the question of following an application's hint about the mmap(2)
base address. Testing shows that ignoring the hint does not affect the
function of common applications, but I would expect more demanding
code could break. By default sbrk is preserved and mmap hints are
satisfied, which can be changed by using the
kern.elf{32,64}.aslr.honor_sbrk sysctl.
ASLR is enabled on per-ABI basis, and currently it is only allowed on
FreeBSD native i386 and amd64 (including compat 32bit) ABIs. Support
for additional architectures will be added after further testing.
Both per-process and per-image controls are implemented:
- procctl(2) adds PROC_ASLR_CTL/PROC_ASLR_STATUS;
- NT_FREEBSD_FCTL_ASLR_DISABLE feature control note bit makes it possible
to force ASLR off for the given binary. (A tool to edit the feature
control note is in development.)
Global controls are:
- kern.elf{32,64}.aslr.enable - for non-fixed mappings done by mmap(2);
- kern.elf{32,64}.aslr.pie_enable - for PIE image activation mappings;
- kern.elf{32,64}.aslr.honor_sbrk - allow to use sbrk area for mmap(2);
- vm.cluster_anon - enables anon mapping clustering.
PR: 208580 (exp runs)
Exp-runs done by: antoine
Reviewed by: markj (previous version)
Discussed with: emaste
Tested by: pho
MFC after: 1 month
Sponsored by: The FreeBSD Foundation
Differential revision: https://reviews.freebsd.org/D5603
2019-02-10 17:19:45 +00:00
|
|
|
if (entry->end == map->anon_loc)
|
|
|
|
map->anon_loc = entry->start;
|
|
|
|
|
1994-05-24 10:09:53 +00:00
|
|
|
/*
|
2009-02-24 20:23:16 +00:00
|
|
|
* Delete the entry only after removing all pmap
|
|
|
|
* entries pointing to its pages. (Otherwise, its
|
|
|
|
* page frames may be reallocated, and any modify bits
|
|
|
|
* will be set in the wrong object!)
|
1994-05-24 10:09:53 +00:00
|
|
|
*/
|
|
|
|
vm_map_entry_delete(map, entry);
|
|
|
|
}
|
2020-09-09 22:02:30 +00:00
|
|
|
return (rv);
|
1994-05-24 10:09:53 +00:00
|
|
|
}
|
|
|
|
|
|
|
|
/*
|
|
|
|
* vm_map_remove:
|
|
|
|
*
|
|
|
|
* Remove the given address range from the target map.
|
|
|
|
* This is the exported form of vm_map_delete.
|
|
|
|
*/
|
|
|
|
int
|
2001-07-04 20:15:18 +00:00
|
|
|
vm_map_remove(vm_map_t map, vm_offset_t start, vm_offset_t end)
|
1994-05-24 10:09:53 +00:00
|
|
|
{
|
2004-08-14 18:57:41 +00:00
|
|
|
int result;
|
1994-05-24 10:09:53 +00:00
|
|
|
|
|
|
|
vm_map_lock(map);
|
|
|
|
VM_MAP_RANGE_CHECK(map, start, end);
|
2009-02-24 20:57:43 +00:00
|
|
|
result = vm_map_delete(map, start, end);
|
1994-05-24 10:09:53 +00:00
|
|
|
vm_map_unlock(map);
|
These changes embody the support of the fully coherent merged VM buffer cache,
much higher filesystem I/O performance, and much better paging performance. It
represents the culmination of over 6 months of R&D.
The majority of the merged VM/cache work is by John Dyson.
The following highlights the most significant changes. Additionally, there are
(mostly minor) changes to the various filesystem modules (nfs, msdosfs, etc) to
support the new VM/buffer scheme.
vfs_bio.c:
Significant rewrite of most of vfs_bio to support the merged VM buffer cache
scheme. The scheme is almost fully compatible with the old filesystem
interface. Significant improvement in the number of opportunities for write
clustering.
vfs_cluster.c, vfs_subr.c
Upgrade and performance enhancements in vfs layer code to support merged
VM/buffer cache. Fixup of vfs_cluster to eliminate the bogus pagemove stuff.
vm_object.c:
Yet more improvements in the collapse code. Elimination of some windows that
can cause list corruption.
vm_pageout.c:
Fixed it, it really works better now. Somehow in 2.0, some "enhancements"
broke the code. This code has been reworked from the ground-up.
vm_fault.c, vm_page.c, pmap.c, vm_object.c
Support for small-block filesystems with merged VM/buffer cache scheme.
pmap.c vm_map.c
Dynamic kernel VM size, now we dont have to pre-allocate excessive numbers of
kernel PTs.
vm_glue.c
Much simpler and more effective swapping code. No more gratuitous swapping.
proc.h
Fixed the problem that the p_lock flag was not being cleared on a fork.
swap_pager.c, vnode_pager.c
Removal of old vfs_bio cruft to support the past pseudo-coherency. Now the
code doesn't need it anymore.
machdep.c
Changes to better support the parameter values for the merged VM/buffer cache
scheme.
machdep.c, kern_exec.c, vm_glue.c
Implemented a seperate submap for temporary exec string space and another one
to contain process upages. This eliminates all map fragmentation problems
that previously existed.
ffs_inode.c, ufs_inode.c, ufs_readwrite.c
Changes for merged VM/buffer cache. Add "bypass" support for sneaking in on
busy buffers.
Submitted by: John Dyson and David Greenman
1995-01-09 16:06:02 +00:00
|
|
|
return (result);
|
1994-05-24 10:09:53 +00:00
|
|
|
}
|
|
|
|
|
|
|
|
/*
|
|
|
|
* vm_map_check_protection:
|
|
|
|
*
|
2003-01-20 17:46:48 +00:00
|
|
|
* Assert that the target map allows the specified privilege on the
|
|
|
|
* entire address region given. The entire region must be allocated.
|
|
|
|
*
|
|
|
|
* WARNING! This code does not and should not check whether the
|
|
|
|
* contents of the region is accessible. For example a smaller file
|
|
|
|
* might be mapped into a larger address space.
|
|
|
|
*
|
|
|
|
* NOTE! This code is also called by munmap().
|
2003-11-10 01:37:40 +00:00
|
|
|
*
|
|
|
|
* The map must be locked. A read lock is sufficient.
|
1994-05-24 10:09:53 +00:00
|
|
|
*/
|
1995-05-30 08:16:23 +00:00
|
|
|
boolean_t
|
1997-08-25 22:15:31 +00:00
|
|
|
vm_map_check_protection(vm_map_t map, vm_offset_t start, vm_offset_t end,
|
|
|
|
vm_prot_t protection)
|
1994-05-24 10:09:53 +00:00
|
|
|
{
|
1998-04-29 04:28:22 +00:00
|
|
|
vm_map_entry_t entry;
|
2019-06-26 03:12:57 +00:00
|
|
|
vm_map_entry_t tmp_entry;
|
1994-05-24 10:09:53 +00:00
|
|
|
|
2019-06-26 03:12:57 +00:00
|
|
|
if (!vm_map_lookup_entry(map, start, &tmp_entry))
|
These changes embody the support of the fully coherent merged VM buffer cache,
much higher filesystem I/O performance, and much better paging performance. It
represents the culmination of over 6 months of R&D.
The majority of the merged VM/cache work is by John Dyson.
The following highlights the most significant changes. Additionally, there are
(mostly minor) changes to the various filesystem modules (nfs, msdosfs, etc) to
support the new VM/buffer scheme.
vfs_bio.c:
Significant rewrite of most of vfs_bio to support the merged VM buffer cache
scheme. The scheme is almost fully compatible with the old filesystem
interface. Significant improvement in the number of opportunities for write
clustering.
vfs_cluster.c, vfs_subr.c
Upgrade and performance enhancements in vfs layer code to support merged
VM/buffer cache. Fixup of vfs_cluster to eliminate the bogus pagemove stuff.
vm_object.c:
Yet more improvements in the collapse code. Elimination of some windows that
can cause list corruption.
vm_pageout.c:
Fixed it, it really works better now. Somehow in 2.0, some "enhancements"
broke the code. This code has been reworked from the ground-up.
vm_fault.c, vm_page.c, pmap.c, vm_object.c
Support for small-block filesystems with merged VM/buffer cache scheme.
pmap.c vm_map.c
Dynamic kernel VM size, now we dont have to pre-allocate excessive numbers of
kernel PTs.
vm_glue.c
Much simpler and more effective swapping code. No more gratuitous swapping.
proc.h
Fixed the problem that the p_lock flag was not being cleared on a fork.
swap_pager.c, vnode_pager.c
Removal of old vfs_bio cruft to support the past pseudo-coherency. Now the
code doesn't need it anymore.
machdep.c
Changes to better support the parameter values for the merged VM/buffer cache
scheme.
machdep.c, kern_exec.c, vm_glue.c
Implemented a seperate submap for temporary exec string space and another one
to contain process upages. This eliminates all map fragmentation problems
that previously existed.
ffs_inode.c, ufs_inode.c, ufs_readwrite.c
Changes for merged VM/buffer cache. Add "bypass" support for sneaking in on
busy buffers.
Submitted by: John Dyson and David Greenman
1995-01-09 16:06:02 +00:00
|
|
|
return (FALSE);
|
2019-06-26 03:12:57 +00:00
|
|
|
entry = tmp_entry;
|
1994-05-24 10:09:53 +00:00
|
|
|
|
|
|
|
while (start < end) {
|
|
|
|
/*
|
These changes embody the support of the fully coherent merged VM buffer cache,
much higher filesystem I/O performance, and much better paging performance. It
represents the culmination of over 6 months of R&D.
The majority of the merged VM/cache work is by John Dyson.
The following highlights the most significant changes. Additionally, there are
(mostly minor) changes to the various filesystem modules (nfs, msdosfs, etc) to
support the new VM/buffer scheme.
vfs_bio.c:
Significant rewrite of most of vfs_bio to support the merged VM buffer cache
scheme. The scheme is almost fully compatible with the old filesystem
interface. Significant improvement in the number of opportunities for write
clustering.
vfs_cluster.c, vfs_subr.c
Upgrade and performance enhancements in vfs layer code to support merged
VM/buffer cache. Fixup of vfs_cluster to eliminate the bogus pagemove stuff.
vm_object.c:
Yet more improvements in the collapse code. Elimination of some windows that
can cause list corruption.
vm_pageout.c:
Fixed it, it really works better now. Somehow in 2.0, some "enhancements"
broke the code. This code has been reworked from the ground-up.
vm_fault.c, vm_page.c, pmap.c, vm_object.c
Support for small-block filesystems with merged VM/buffer cache scheme.
pmap.c vm_map.c
Dynamic kernel VM size, now we dont have to pre-allocate excessive numbers of
kernel PTs.
vm_glue.c
Much simpler and more effective swapping code. No more gratuitous swapping.
proc.h
Fixed the problem that the p_lock flag was not being cleared on a fork.
swap_pager.c, vnode_pager.c
Removal of old vfs_bio cruft to support the past pseudo-coherency. Now the
code doesn't need it anymore.
machdep.c
Changes to better support the parameter values for the merged VM/buffer cache
scheme.
machdep.c, kern_exec.c, vm_glue.c
Implemented a seperate submap for temporary exec string space and another one
to contain process upages. This eliminates all map fragmentation problems
that previously existed.
ffs_inode.c, ufs_inode.c, ufs_readwrite.c
Changes for merged VM/buffer cache. Add "bypass" support for sneaking in on
busy buffers.
Submitted by: John Dyson and David Greenman
1995-01-09 16:06:02 +00:00
|
|
|
* No holes allowed!
|
1994-05-24 10:09:53 +00:00
|
|
|
*/
|
2003-11-10 01:37:40 +00:00
|
|
|
if (start < entry->start)
|
These changes embody the support of the fully coherent merged VM buffer cache,
much higher filesystem I/O performance, and much better paging performance. It
represents the culmination of over 6 months of R&D.
The majority of the merged VM/cache work is by John Dyson.
The following highlights the most significant changes. Additionally, there are
(mostly minor) changes to the various filesystem modules (nfs, msdosfs, etc) to
support the new VM/buffer scheme.
vfs_bio.c:
Significant rewrite of most of vfs_bio to support the merged VM buffer cache
scheme. The scheme is almost fully compatible with the old filesystem
interface. Significant improvement in the number of opportunities for write
clustering.
vfs_cluster.c, vfs_subr.c
Upgrade and performance enhancements in vfs layer code to support merged
VM/buffer cache. Fixup of vfs_cluster to eliminate the bogus pagemove stuff.
vm_object.c:
Yet more improvements in the collapse code. Elimination of some windows that
can cause list corruption.
vm_pageout.c:
Fixed it, it really works better now. Somehow in 2.0, some "enhancements"
broke the code. This code has been reworked from the ground-up.
vm_fault.c, vm_page.c, pmap.c, vm_object.c
Support for small-block filesystems with merged VM/buffer cache scheme.
pmap.c vm_map.c
Dynamic kernel VM size, now we dont have to pre-allocate excessive numbers of
kernel PTs.
vm_glue.c
Much simpler and more effective swapping code. No more gratuitous swapping.
proc.h
Fixed the problem that the p_lock flag was not being cleared on a fork.
swap_pager.c, vnode_pager.c
Removal of old vfs_bio cruft to support the past pseudo-coherency. Now the
code doesn't need it anymore.
machdep.c
Changes to better support the parameter values for the merged VM/buffer cache
scheme.
machdep.c, kern_exec.c, vm_glue.c
Implemented a seperate submap for temporary exec string space and another one
to contain process upages. This eliminates all map fragmentation problems
that previously existed.
ffs_inode.c, ufs_inode.c, ufs_readwrite.c
Changes for merged VM/buffer cache. Add "bypass" support for sneaking in on
busy buffers.
Submitted by: John Dyson and David Greenman
1995-01-09 16:06:02 +00:00
|
|
|
return (FALSE);
|
1994-05-24 10:09:53 +00:00
|
|
|
/*
|
|
|
|
* Check protection associated with entry.
|
|
|
|
*/
|
2003-11-10 01:37:40 +00:00
|
|
|
if ((entry->protection & protection) != protection)
|
These changes embody the support of the fully coherent merged VM buffer cache,
much higher filesystem I/O performance, and much better paging performance. It
represents the culmination of over 6 months of R&D.
The majority of the merged VM/cache work is by John Dyson.
The following highlights the most significant changes. Additionally, there are
(mostly minor) changes to the various filesystem modules (nfs, msdosfs, etc) to
support the new VM/buffer scheme.
vfs_bio.c:
Significant rewrite of most of vfs_bio to support the merged VM buffer cache
scheme. The scheme is almost fully compatible with the old filesystem
interface. Significant improvement in the number of opportunities for write
clustering.
vfs_cluster.c, vfs_subr.c
Upgrade and performance enhancements in vfs layer code to support merged
VM/buffer cache. Fixup of vfs_cluster to eliminate the bogus pagemove stuff.
vm_object.c:
Yet more improvements in the collapse code. Elimination of some windows that
can cause list corruption.
vm_pageout.c:
Fixed it, it really works better now. Somehow in 2.0, some "enhancements"
broke the code. This code has been reworked from the ground-up.
vm_fault.c, vm_page.c, pmap.c, vm_object.c
Support for small-block filesystems with merged VM/buffer cache scheme.
pmap.c vm_map.c
Dynamic kernel VM size, now we dont have to pre-allocate excessive numbers of
kernel PTs.
vm_glue.c
Much simpler and more effective swapping code. No more gratuitous swapping.
proc.h
Fixed the problem that the p_lock flag was not being cleared on a fork.
swap_pager.c, vnode_pager.c
Removal of old vfs_bio cruft to support the past pseudo-coherency. Now the
code doesn't need it anymore.
machdep.c
Changes to better support the parameter values for the merged VM/buffer cache
scheme.
machdep.c, kern_exec.c, vm_glue.c
Implemented a seperate submap for temporary exec string space and another one
to contain process upages. This eliminates all map fragmentation problems
that previously existed.
ffs_inode.c, ufs_inode.c, ufs_readwrite.c
Changes for merged VM/buffer cache. Add "bypass" support for sneaking in on
busy buffers.
Submitted by: John Dyson and David Greenman
1995-01-09 16:06:02 +00:00
|
|
|
return (FALSE);
|
1994-05-24 10:09:53 +00:00
|
|
|
/* go to next entry */
|
|
|
|
start = entry->end;
|
2019-11-13 15:56:07 +00:00
|
|
|
entry = vm_map_entry_succ(entry);
|
1994-05-24 10:09:53 +00:00
|
|
|
}
|
These changes embody the support of the fully coherent merged VM buffer cache,
much higher filesystem I/O performance, and much better paging performance. It
represents the culmination of over 6 months of R&D.
The majority of the merged VM/cache work is by John Dyson.
The following highlights the most significant changes. Additionally, there are
(mostly minor) changes to the various filesystem modules (nfs, msdosfs, etc) to
support the new VM/buffer scheme.
vfs_bio.c:
Significant rewrite of most of vfs_bio to support the merged VM buffer cache
scheme. The scheme is almost fully compatible with the old filesystem
interface. Significant improvement in the number of opportunities for write
clustering.
vfs_cluster.c, vfs_subr.c
Upgrade and performance enhancements in vfs layer code to support merged
VM/buffer cache. Fixup of vfs_cluster to eliminate the bogus pagemove stuff.
vm_object.c:
Yet more improvements in the collapse code. Elimination of some windows that
can cause list corruption.
vm_pageout.c:
Fixed it, it really works better now. Somehow in 2.0, some "enhancements"
broke the code. This code has been reworked from the ground-up.
vm_fault.c, vm_page.c, pmap.c, vm_object.c
Support for small-block filesystems with merged VM/buffer cache scheme.
pmap.c vm_map.c
Dynamic kernel VM size, now we dont have to pre-allocate excessive numbers of
kernel PTs.
vm_glue.c
Much simpler and more effective swapping code. No more gratuitous swapping.
proc.h
Fixed the problem that the p_lock flag was not being cleared on a fork.
swap_pager.c, vnode_pager.c
Removal of old vfs_bio cruft to support the past pseudo-coherency. Now the
code doesn't need it anymore.
machdep.c
Changes to better support the parameter values for the merged VM/buffer cache
scheme.
machdep.c, kern_exec.c, vm_glue.c
Implemented a seperate submap for temporary exec string space and another one
to contain process upages. This eliminates all map fragmentation problems
that previously existed.
ffs_inode.c, ufs_inode.c, ufs_readwrite.c
Changes for merged VM/buffer cache. Add "bypass" support for sneaking in on
busy buffers.
Submitted by: John Dyson and David Greenman
1995-01-09 16:06:02 +00:00
|
|
|
return (TRUE);
|
1994-05-24 10:09:53 +00:00
|
|
|
}
|
|
|
|
|
2019-11-25 07:13:05 +00:00
|
|
|
/*
|
|
|
|
*
|
2019-11-29 19:57:49 +00:00
|
|
|
* vm_map_copy_swap_object:
|
2019-11-25 07:13:05 +00:00
|
|
|
*
|
2019-11-29 19:57:49 +00:00
|
|
|
* Copies a swap-backed object from an existing map entry to a
|
2019-11-25 07:13:05 +00:00
|
|
|
* new one. Carries forward the swap charge. May change the
|
|
|
|
* src object on return.
|
|
|
|
*/
|
|
|
|
static void
|
2019-11-29 19:57:49 +00:00
|
|
|
vm_map_copy_swap_object(vm_map_entry_t src_entry, vm_map_entry_t dst_entry,
|
2019-11-25 07:13:05 +00:00
|
|
|
vm_offset_t size, vm_ooffset_t *fork_charge)
|
|
|
|
{
|
|
|
|
vm_object_t src_object;
|
|
|
|
struct ucred *cred;
|
|
|
|
int charged;
|
|
|
|
|
|
|
|
src_object = src_entry->object.vm_object;
|
|
|
|
charged = ENTRY_CHARGED(src_entry);
|
2019-12-15 02:02:27 +00:00
|
|
|
if ((src_object->flags & OBJ_ANON) != 0) {
|
|
|
|
VM_OBJECT_WLOCK(src_object);
|
|
|
|
vm_object_collapse(src_object);
|
|
|
|
if ((src_object->flags & OBJ_ONEMAPPING) != 0) {
|
|
|
|
vm_object_split(src_entry);
|
|
|
|
src_object = src_entry->object.vm_object;
|
|
|
|
}
|
|
|
|
vm_object_reference_locked(src_object);
|
|
|
|
vm_object_clear_flag(src_object, OBJ_ONEMAPPING);
|
|
|
|
VM_OBJECT_WUNLOCK(src_object);
|
|
|
|
} else
|
|
|
|
vm_object_reference(src_object);
|
2019-11-25 07:13:05 +00:00
|
|
|
if (src_entry->cred != NULL &&
|
|
|
|
!(src_entry->eflags & MAP_ENTRY_NEEDS_COPY)) {
|
|
|
|
KASSERT(src_object->cred == NULL,
|
|
|
|
("OVERCOMMIT: vm_map_copy_anon_entry: cred %p",
|
|
|
|
src_object));
|
|
|
|
src_object->cred = src_entry->cred;
|
|
|
|
src_object->charge = size;
|
|
|
|
}
|
|
|
|
dst_entry->object.vm_object = src_object;
|
|
|
|
if (charged) {
|
|
|
|
cred = curthread->td_ucred;
|
|
|
|
crhold(cred);
|
|
|
|
dst_entry->cred = cred;
|
|
|
|
*fork_charge += size;
|
|
|
|
if (!(src_entry->eflags & MAP_ENTRY_NEEDS_COPY)) {
|
|
|
|
crhold(cred);
|
|
|
|
src_entry->cred = cred;
|
|
|
|
*fork_charge += size;
|
|
|
|
}
|
|
|
|
}
|
|
|
|
}
|
|
|
|
|
1994-05-24 10:09:53 +00:00
|
|
|
/*
|
|
|
|
* vm_map_copy_entry:
|
|
|
|
*
|
|
|
|
* Copies the contents of the source entry to the destination
|
|
|
|
* entry. The entries *must* be aligned properly.
|
|
|
|
*/
|
1995-12-14 09:55:16 +00:00
|
|
|
static void
|
2001-07-04 20:15:18 +00:00
|
|
|
vm_map_copy_entry(
|
|
|
|
vm_map_t src_map,
|
|
|
|
vm_map_t dst_map,
|
2003-11-03 16:14:45 +00:00
|
|
|
vm_map_entry_t src_entry,
|
Implement global and per-uid accounting of the anonymous memory. Add
rlimit RLIMIT_SWAP that limits the amount of swap that may be reserved
for the uid.
The accounting information (charge) is associated with either map entry,
or vm object backing the entry, assuming the object is the first one
in the shadow chain and entry does not require COW. Charge is moved
from entry to object on allocation of the object, e.g. during the mmap,
assuming the object is allocated, or on the first page fault on the
entry. It moves back to the entry on forks due to COW setup.
The per-entry granularity of accounting makes the charge process fair
for processes that change uid during lifetime, and decrements charge
for proper uid when region is unmapped.
The interface of vm_pager_allocate(9) is extended by adding struct ucred *,
that is used to charge appropriate uid when allocation if performed by
kernel, e.g. md(4).
Several syscalls, among them is fork(2), may now return ENOMEM when
global or per-uid limits are enforced.
In collaboration with: pho
Reviewed by: alc
Approved by: re (kensmith)
2009-06-23 20:45:22 +00:00
|
|
|
vm_map_entry_t dst_entry,
|
|
|
|
vm_ooffset_t *fork_charge)
|
1994-05-24 10:09:53 +00:00
|
|
|
{
|
1998-04-29 04:28:22 +00:00
|
|
|
vm_object_t src_object;
|
2012-02-23 21:07:16 +00:00
|
|
|
vm_map_entry_t fake_entry;
|
Implement global and per-uid accounting of the anonymous memory. Add
rlimit RLIMIT_SWAP that limits the amount of swap that may be reserved
for the uid.
The accounting information (charge) is associated with either map entry,
or vm object backing the entry, assuming the object is the first one
in the shadow chain and entry does not require COW. Charge is moved
from entry to object on allocation of the object, e.g. during the mmap,
assuming the object is allocated, or on the first page fault on the
entry. It moves back to the entry on forks due to COW setup.
The per-entry granularity of accounting makes the charge process fair
for processes that change uid during lifetime, and decrements charge
for proper uid when region is unmapped.
The interface of vm_pager_allocate(9) is extended by adding struct ucred *,
that is used to charge appropriate uid when allocation if performed by
kernel, e.g. md(4).
Several syscalls, among them is fork(2), may now return ENOMEM when
global or per-uid limits are enforced.
In collaboration with: pho
Reviewed by: alc
Approved by: re (kensmith)
2009-06-23 20:45:22 +00:00
|
|
|
vm_offset_t size;
|
1998-04-29 04:28:22 +00:00
|
|
|
|
2009-02-24 20:43:29 +00:00
|
|
|
VM_MAP_ASSERT_LOCKED(dst_map);
|
|
|
|
|
1999-02-07 21:48:23 +00:00
|
|
|
if ((dst_entry->eflags|src_entry->eflags) & MAP_ENTRY_IS_SUB_MAP)
|
1994-05-24 10:09:53 +00:00
|
|
|
return;
|
|
|
|
|
2014-05-13 13:20:23 +00:00
|
|
|
if (src_entry->wired_count == 0 ||
|
|
|
|
(src_entry->protection & VM_PROT_WRITE) == 0) {
|
1994-05-24 10:09:53 +00:00
|
|
|
/*
|
These changes embody the support of the fully coherent merged VM buffer cache,
much higher filesystem I/O performance, and much better paging performance. It
represents the culmination of over 6 months of R&D.
The majority of the merged VM/cache work is by John Dyson.
The following highlights the most significant changes. Additionally, there are
(mostly minor) changes to the various filesystem modules (nfs, msdosfs, etc) to
support the new VM/buffer scheme.
vfs_bio.c:
Significant rewrite of most of vfs_bio to support the merged VM buffer cache
scheme. The scheme is almost fully compatible with the old filesystem
interface. Significant improvement in the number of opportunities for write
clustering.
vfs_cluster.c, vfs_subr.c
Upgrade and performance enhancements in vfs layer code to support merged
VM/buffer cache. Fixup of vfs_cluster to eliminate the bogus pagemove stuff.
vm_object.c:
Yet more improvements in the collapse code. Elimination of some windows that
can cause list corruption.
vm_pageout.c:
Fixed it, it really works better now. Somehow in 2.0, some "enhancements"
broke the code. This code has been reworked from the ground-up.
vm_fault.c, vm_page.c, pmap.c, vm_object.c
Support for small-block filesystems with merged VM/buffer cache scheme.
pmap.c vm_map.c
Dynamic kernel VM size, now we dont have to pre-allocate excessive numbers of
kernel PTs.
vm_glue.c
Much simpler and more effective swapping code. No more gratuitous swapping.
proc.h
Fixed the problem that the p_lock flag was not being cleared on a fork.
swap_pager.c, vnode_pager.c
Removal of old vfs_bio cruft to support the past pseudo-coherency. Now the
code doesn't need it anymore.
machdep.c
Changes to better support the parameter values for the merged VM/buffer cache
scheme.
machdep.c, kern_exec.c, vm_glue.c
Implemented a seperate submap for temporary exec string space and another one
to contain process upages. This eliminates all map fragmentation problems
that previously existed.
ffs_inode.c, ufs_inode.c, ufs_readwrite.c
Changes for merged VM/buffer cache. Add "bypass" support for sneaking in on
busy buffers.
Submitted by: John Dyson and David Greenman
1995-01-09 16:06:02 +00:00
|
|
|
* If the source entry is marked needs_copy, it is already
|
|
|
|
* write-protected.
|
1994-05-24 10:09:53 +00:00
|
|
|
*/
|
2014-05-10 19:47:00 +00:00
|
|
|
if ((src_entry->eflags & MAP_ENTRY_NEEDS_COPY) == 0 &&
|
|
|
|
(src_entry->protection & VM_PROT_WRITE) != 0) {
|
VM level code cleanups.
1) Start using TSM.
Struct procs continue to point to upages structure, after being freed.
Struct vmspace continues to point to pte object and kva space for kstack.
u_map is now superfluous.
2) vm_map's don't need to be reference counted. They always exist either
in the kernel or in a vmspace. The vmspaces are managed by reference
counts.
3) Remove the "wired" vm_map nonsense.
4) No need to keep a cache of kernel stack kva's.
5) Get rid of strange looking ++var, and change to var++.
6) Change more data structures to use our "zone" allocator. Added
struct proc, struct vmspace and struct vnode. This saves a significant
amount of kva space and physical memory. Additionally, this enables
TSM for the zone managed memory.
7) Keep ioopt disabled for now.
8) Remove the now bogus "single use" map concept.
9) Use generation counts or id's for data structures residing in TSM, where
it allows us to avoid unneeded restart overhead during traversals, where
blocking might occur.
10) Account better for memory deficits, so the pageout daemon will be able
to make enough memory available (experimental.)
11) Fix some vnode locking problems. (From Tor, I think.)
12) Add a check in ufs_lookup, to avoid lots of unneeded calls to bcmp.
(experimental.)
13) Significantly shrink, cleanup, and make slightly faster the vm_fault.c
code. Use generation counts, get rid of unneded collpase operations,
and clean up the cluster code.
14) Make vm_zone more suitable for TSM.
This commit is partially as a result of discussions and contributions from
other people, including DG, Tor Egge, PHK, and probably others that I
have forgotten to attribute (so let me know, if I forgot.)
This is not the infamous, final cleanup of the vnode stuff, but a necessary
step. Vnode mgmt should be correct, but things might still change, and
there is still some missing stuff (like ioopt, and physical backing of
non-merged cache files, debugging of layering concepts.)
1998-01-22 17:30:44 +00:00
|
|
|
pmap_protect(src_map->pmap,
|
|
|
|
src_entry->start,
|
|
|
|
src_entry->end,
|
|
|
|
src_entry->protection & ~VM_PROT_WRITE);
|
1994-05-24 10:09:53 +00:00
|
|
|
}
|
1996-05-18 03:38:05 +00:00
|
|
|
|
1994-05-24 10:09:53 +00:00
|
|
|
/*
|
These changes embody the support of the fully coherent merged VM buffer cache,
much higher filesystem I/O performance, and much better paging performance. It
represents the culmination of over 6 months of R&D.
The majority of the merged VM/cache work is by John Dyson.
The following highlights the most significant changes. Additionally, there are
(mostly minor) changes to the various filesystem modules (nfs, msdosfs, etc) to
support the new VM/buffer scheme.
vfs_bio.c:
Significant rewrite of most of vfs_bio to support the merged VM buffer cache
scheme. The scheme is almost fully compatible with the old filesystem
interface. Significant improvement in the number of opportunities for write
clustering.
vfs_cluster.c, vfs_subr.c
Upgrade and performance enhancements in vfs layer code to support merged
VM/buffer cache. Fixup of vfs_cluster to eliminate the bogus pagemove stuff.
vm_object.c:
Yet more improvements in the collapse code. Elimination of some windows that
can cause list corruption.
vm_pageout.c:
Fixed it, it really works better now. Somehow in 2.0, some "enhancements"
broke the code. This code has been reworked from the ground-up.
vm_fault.c, vm_page.c, pmap.c, vm_object.c
Support for small-block filesystems with merged VM/buffer cache scheme.
pmap.c vm_map.c
Dynamic kernel VM size, now we dont have to pre-allocate excessive numbers of
kernel PTs.
vm_glue.c
Much simpler and more effective swapping code. No more gratuitous swapping.
proc.h
Fixed the problem that the p_lock flag was not being cleared on a fork.
swap_pager.c, vnode_pager.c
Removal of old vfs_bio cruft to support the past pseudo-coherency. Now the
code doesn't need it anymore.
machdep.c
Changes to better support the parameter values for the merged VM/buffer cache
scheme.
machdep.c, kern_exec.c, vm_glue.c
Implemented a seperate submap for temporary exec string space and another one
to contain process upages. This eliminates all map fragmentation problems
that previously existed.
ffs_inode.c, ufs_inode.c, ufs_readwrite.c
Changes for merged VM/buffer cache. Add "bypass" support for sneaking in on
busy buffers.
Submitted by: John Dyson and David Greenman
1995-01-09 16:06:02 +00:00
|
|
|
* Make a copy of the object.
|
1994-05-24 10:09:53 +00:00
|
|
|
*/
|
Implement global and per-uid accounting of the anonymous memory. Add
rlimit RLIMIT_SWAP that limits the amount of swap that may be reserved
for the uid.
The accounting information (charge) is associated with either map entry,
or vm object backing the entry, assuming the object is the first one
in the shadow chain and entry does not require COW. Charge is moved
from entry to object on allocation of the object, e.g. during the mmap,
assuming the object is allocated, or on the first page fault on the
entry. It moves back to the entry on forks due to COW setup.
The per-entry granularity of accounting makes the charge process fair
for processes that change uid during lifetime, and decrements charge
for proper uid when region is unmapped.
The interface of vm_pager_allocate(9) is extended by adding struct ucred *,
that is used to charge appropriate uid when allocation if performed by
kernel, e.g. md(4).
Several syscalls, among them is fork(2), may now return ENOMEM when
global or per-uid limits are enforced.
In collaboration with: pho
Reviewed by: alc
Approved by: re (kensmith)
2009-06-23 20:45:22 +00:00
|
|
|
size = src_entry->end - src_entry->start;
|
1999-01-28 00:57:57 +00:00
|
|
|
if ((src_object = src_entry->object.vm_object) != NULL) {
|
2019-11-29 19:57:49 +00:00
|
|
|
if (src_object->type == OBJT_DEFAULT ||
|
|
|
|
src_object->type == OBJT_SWAP) {
|
|
|
|
vm_map_copy_swap_object(src_entry, dst_entry,
|
2019-11-25 07:13:05 +00:00
|
|
|
size, fork_charge);
|
|
|
|
/* May have split/collapsed, reload obj. */
|
|
|
|
src_object = src_entry->object.vm_object;
|
|
|
|
} else {
|
|
|
|
vm_object_reference(src_object);
|
|
|
|
dst_entry->object.vm_object = src_object;
|
Implement global and per-uid accounting of the anonymous memory. Add
rlimit RLIMIT_SWAP that limits the amount of swap that may be reserved
for the uid.
The accounting information (charge) is associated with either map entry,
or vm object backing the entry, assuming the object is the first one
in the shadow chain and entry does not require COW. Charge is moved
from entry to object on allocation of the object, e.g. during the mmap,
assuming the object is allocated, or on the first page fault on the
entry. It moves back to the entry on forks due to COW setup.
The per-entry granularity of accounting makes the charge process fair
for processes that change uid during lifetime, and decrements charge
for proper uid when region is unmapped.
The interface of vm_pager_allocate(9) is extended by adding struct ucred *,
that is used to charge appropriate uid when allocation if performed by
kernel, e.g. md(4).
Several syscalls, among them is fork(2), may now return ENOMEM when
global or per-uid limits are enforced.
In collaboration with: pho
Reviewed by: alc
Approved by: re (kensmith)
2009-06-23 20:45:22 +00:00
|
|
|
}
|
2016-12-30 13:04:43 +00:00
|
|
|
src_entry->eflags |= MAP_ENTRY_COW |
|
|
|
|
MAP_ENTRY_NEEDS_COPY;
|
|
|
|
dst_entry->eflags |= MAP_ENTRY_COW |
|
|
|
|
MAP_ENTRY_NEEDS_COPY;
|
1996-05-18 03:38:05 +00:00
|
|
|
dst_entry->offset = src_entry->offset;
|
2019-09-03 20:31:48 +00:00
|
|
|
if (src_entry->eflags & MAP_ENTRY_WRITECNT) {
|
2012-02-23 21:07:16 +00:00
|
|
|
/*
|
2019-09-03 20:31:48 +00:00
|
|
|
* MAP_ENTRY_WRITECNT cannot
|
2012-02-23 21:07:16 +00:00
|
|
|
* indicate write reference from
|
|
|
|
* src_entry, since the entry is
|
|
|
|
* marked as needs copy. Allocate a
|
|
|
|
* fake entry that is used to
|
2019-09-03 20:31:48 +00:00
|
|
|
* decrement object->un_pager writecount
|
2012-02-23 21:07:16 +00:00
|
|
|
* at the appropriate time. Attach
|
|
|
|
* fake_entry to the deferred list.
|
|
|
|
*/
|
|
|
|
fake_entry = vm_map_entry_create(dst_map);
|
2019-09-03 20:31:48 +00:00
|
|
|
fake_entry->eflags = MAP_ENTRY_WRITECNT;
|
|
|
|
src_entry->eflags &= ~MAP_ENTRY_WRITECNT;
|
2012-02-23 21:07:16 +00:00
|
|
|
vm_object_reference(src_object);
|
|
|
|
fake_entry->object.vm_object = src_object;
|
|
|
|
fake_entry->start = src_entry->start;
|
|
|
|
fake_entry->end = src_entry->end;
|
2019-11-13 15:56:07 +00:00
|
|
|
fake_entry->defer_next =
|
|
|
|
curthread->td_map_def_user;
|
2012-02-23 21:07:16 +00:00
|
|
|
curthread->td_map_def_user = fake_entry;
|
|
|
|
}
|
2017-06-21 18:54:28 +00:00
|
|
|
|
|
|
|
pmap_copy(dst_map->pmap, src_map->pmap,
|
|
|
|
dst_entry->start, dst_entry->end - dst_entry->start,
|
|
|
|
src_entry->start);
|
1996-05-18 03:38:05 +00:00
|
|
|
} else {
|
|
|
|
dst_entry->object.vm_object = NULL;
|
|
|
|
dst_entry->offset = 0;
|
2010-12-02 17:37:16 +00:00
|
|
|
if (src_entry->cred != NULL) {
|
|
|
|
dst_entry->cred = curthread->td_ucred;
|
|
|
|
crhold(dst_entry->cred);
|
Implement global and per-uid accounting of the anonymous memory. Add
rlimit RLIMIT_SWAP that limits the amount of swap that may be reserved
for the uid.
The accounting information (charge) is associated with either map entry,
or vm object backing the entry, assuming the object is the first one
in the shadow chain and entry does not require COW. Charge is moved
from entry to object on allocation of the object, e.g. during the mmap,
assuming the object is allocated, or on the first page fault on the
entry. It moves back to the entry on forks due to COW setup.
The per-entry granularity of accounting makes the charge process fair
for processes that change uid during lifetime, and decrements charge
for proper uid when region is unmapped.
The interface of vm_pager_allocate(9) is extended by adding struct ucred *,
that is used to charge appropriate uid when allocation if performed by
kernel, e.g. md(4).
Several syscalls, among them is fork(2), may now return ENOMEM when
global or per-uid limits are enforced.
In collaboration with: pho
Reviewed by: alc
Approved by: re (kensmith)
2009-06-23 20:45:22 +00:00
|
|
|
*fork_charge += size;
|
|
|
|
}
|
1996-05-18 03:38:05 +00:00
|
|
|
}
|
These changes embody the support of the fully coherent merged VM buffer cache,
much higher filesystem I/O performance, and much better paging performance. It
represents the culmination of over 6 months of R&D.
The majority of the merged VM/cache work is by John Dyson.
The following highlights the most significant changes. Additionally, there are
(mostly minor) changes to the various filesystem modules (nfs, msdosfs, etc) to
support the new VM/buffer scheme.
vfs_bio.c:
Significant rewrite of most of vfs_bio to support the merged VM buffer cache
scheme. The scheme is almost fully compatible with the old filesystem
interface. Significant improvement in the number of opportunities for write
clustering.
vfs_cluster.c, vfs_subr.c
Upgrade and performance enhancements in vfs layer code to support merged
VM/buffer cache. Fixup of vfs_cluster to eliminate the bogus pagemove stuff.
vm_object.c:
Yet more improvements in the collapse code. Elimination of some windows that
can cause list corruption.
vm_pageout.c:
Fixed it, it really works better now. Somehow in 2.0, some "enhancements"
broke the code. This code has been reworked from the ground-up.
vm_fault.c, vm_page.c, pmap.c, vm_object.c
Support for small-block filesystems with merged VM/buffer cache scheme.
pmap.c vm_map.c
Dynamic kernel VM size, now we dont have to pre-allocate excessive numbers of
kernel PTs.
vm_glue.c
Much simpler and more effective swapping code. No more gratuitous swapping.
proc.h
Fixed the problem that the p_lock flag was not being cleared on a fork.
swap_pager.c, vnode_pager.c
Removal of old vfs_bio cruft to support the past pseudo-coherency. Now the
code doesn't need it anymore.
machdep.c
Changes to better support the parameter values for the merged VM/buffer cache
scheme.
machdep.c, kern_exec.c, vm_glue.c
Implemented a seperate submap for temporary exec string space and another one
to contain process upages. This eliminates all map fragmentation problems
that previously existed.
ffs_inode.c, ufs_inode.c, ufs_readwrite.c
Changes for merged VM/buffer cache. Add "bypass" support for sneaking in on
busy buffers.
Submitted by: John Dyson and David Greenman
1995-01-09 16:06:02 +00:00
|
|
|
} else {
|
1994-05-24 10:09:53 +00:00
|
|
|
/*
|
2014-05-13 13:20:23 +00:00
|
|
|
* We don't want to make writeable wired pages copy-on-write.
|
|
|
|
* Immediately copy these pages into the new map by simulating
|
|
|
|
* page faults. The new pages are pageable.
|
1994-05-24 10:09:53 +00:00
|
|
|
*/
|
2009-07-03 22:17:37 +00:00
|
|
|
vm_fault_copy_entry(dst_map, src_map, dst_entry, src_entry,
|
|
|
|
fork_charge);
|
1994-05-24 10:09:53 +00:00
|
|
|
}
|
|
|
|
}
|
|
|
|
|
2004-06-24 22:43:46 +00:00
|
|
|
/*
|
|
|
|
* vmspace_map_entry_forked:
|
|
|
|
* Update the newly-forked vmspace each time a map entry is inherited
|
|
|
|
* or copied. The values for vm_dsize and vm_tsize are approximate
|
|
|
|
* (and mostly-obsolete ideas in the face of mmap(2) et al.)
|
|
|
|
*/
|
|
|
|
static void
|
|
|
|
vmspace_map_entry_forked(const struct vmspace *vm1, struct vmspace *vm2,
|
|
|
|
vm_map_entry_t entry)
|
|
|
|
{
|
|
|
|
vm_size_t entrysize;
|
|
|
|
vm_offset_t newend;
|
|
|
|
|
Implement address space guards.
Guard, requested by the MAP_GUARD mmap(2) flag, prevents the reuse of
the allocated address space, but does not allow instantiation of the
pages in the range. It is useful for more explicit support for usual
two-stage reserve then commit allocators, since it prevents accidental
instantiation of the mapping, e.g. by mprotect(2).
Use guards to reimplement stack grow code. Explicitely track stack
grow area with the guard, including the stack guard page. On stack
grow, trivial shift of the guard map entry and stack map entry limits
makes the stack expansion. Move the code to detect stack grow and
call vm_map_growstack(), from vm_fault() into vm_map_lookup().
As result, it is impossible to get random mapping to occur in the
stack grow area, or to overlap the stack guard page.
Enable stack guard page by default.
Reviewed by: alc, markj
Man page update reviewed by: alc, bjk, emaste, markj, pho
Tested by: pho, Qualys
Sponsored by: The FreeBSD Foundation
MFC after: 1 week
Differential revision: https://reviews.freebsd.org/D11306 (man pages)
2017-06-24 17:01:11 +00:00
|
|
|
if ((entry->eflags & MAP_ENTRY_GUARD) != 0)
|
|
|
|
return;
|
2004-06-24 22:43:46 +00:00
|
|
|
entrysize = entry->end - entry->start;
|
|
|
|
vm2->vm_map.size += entrysize;
|
|
|
|
if (entry->eflags & (MAP_ENTRY_GROWS_DOWN | MAP_ENTRY_GROWS_UP)) {
|
|
|
|
vm2->vm_ssize += btoc(entrysize);
|
|
|
|
} else if (entry->start >= (vm_offset_t)vm1->vm_daddr &&
|
|
|
|
entry->start < (vm_offset_t)vm1->vm_daddr + ctob(vm1->vm_dsize)) {
|
2004-06-28 19:58:39 +00:00
|
|
|
newend = MIN(entry->end,
|
2004-06-24 22:43:46 +00:00
|
|
|
(vm_offset_t)vm1->vm_daddr + ctob(vm1->vm_dsize));
|
2004-06-28 19:58:39 +00:00
|
|
|
vm2->vm_dsize += btoc(newend - entry->start);
|
2004-06-24 22:43:46 +00:00
|
|
|
} else if (entry->start >= (vm_offset_t)vm1->vm_taddr &&
|
|
|
|
entry->start < (vm_offset_t)vm1->vm_taddr + ctob(vm1->vm_tsize)) {
|
2004-06-28 19:58:39 +00:00
|
|
|
newend = MIN(entry->end,
|
2004-06-24 22:43:46 +00:00
|
|
|
(vm_offset_t)vm1->vm_taddr + ctob(vm1->vm_tsize));
|
|
|
|
vm2->vm_tsize += btoc(newend - entry->start);
|
|
|
|
}
|
|
|
|
}
|
|
|
|
|
1994-05-24 10:09:53 +00:00
|
|
|
/*
|
|
|
|
* vmspace_fork:
|
|
|
|
* Create a new process vmspace structure and vm_map
|
|
|
|
* based on those of an existing process. The new map
|
|
|
|
* is based on the old map, according to the inheritance
|
|
|
|
* values on the regions in that map.
|
|
|
|
*
|
2004-06-24 22:43:46 +00:00
|
|
|
* XXX It might be worth coalescing the entries added to the new vmspace.
|
|
|
|
*
|
1994-05-24 10:09:53 +00:00
|
|
|
* The source map must not be locked.
|
|
|
|
*/
|
|
|
|
struct vmspace *
|
Implement global and per-uid accounting of the anonymous memory. Add
rlimit RLIMIT_SWAP that limits the amount of swap that may be reserved
for the uid.
The accounting information (charge) is associated with either map entry,
or vm object backing the entry, assuming the object is the first one
in the shadow chain and entry does not require COW. Charge is moved
from entry to object on allocation of the object, e.g. during the mmap,
assuming the object is allocated, or on the first page fault on the
entry. It moves back to the entry on forks due to COW setup.
The per-entry granularity of accounting makes the charge process fair
for processes that change uid during lifetime, and decrements charge
for proper uid when region is unmapped.
The interface of vm_pager_allocate(9) is extended by adding struct ucred *,
that is used to charge appropriate uid when allocation if performed by
kernel, e.g. md(4).
Several syscalls, among them is fork(2), may now return ENOMEM when
global or per-uid limits are enforced.
In collaboration with: pho
Reviewed by: alc
Approved by: re (kensmith)
2009-06-23 20:45:22 +00:00
|
|
|
vmspace_fork(struct vmspace *vm1, vm_ooffset_t *fork_charge)
|
1994-05-24 10:09:53 +00:00
|
|
|
{
|
1998-04-29 04:28:22 +00:00
|
|
|
struct vmspace *vm2;
|
2012-02-25 17:49:59 +00:00
|
|
|
vm_map_t new_map, old_map;
|
|
|
|
vm_map_entry_t new_entry, old_entry;
|
1996-03-02 02:54:24 +00:00
|
|
|
vm_object_t object;
|
2019-02-20 09:51:13 +00:00
|
|
|
int error, locked;
|
Implement address space guards.
Guard, requested by the MAP_GUARD mmap(2) flag, prevents the reuse of
the allocated address space, but does not allow instantiation of the
pages in the range. It is useful for more explicit support for usual
two-stage reserve then commit allocators, since it prevents accidental
instantiation of the mapping, e.g. by mprotect(2).
Use guards to reimplement stack grow code. Explicitely track stack
grow area with the guard, including the stack guard page. On stack
grow, trivial shift of the guard map entry and stack map entry limits
makes the stack expansion. Move the code to detect stack grow and
call vm_map_growstack(), from vm_fault() into vm_map_lookup().
As result, it is impossible to get random mapping to occur in the
stack grow area, or to overlap the stack guard page.
Enable stack guard page by default.
Reviewed by: alc, markj
Man page update reviewed by: alc, bjk, emaste, markj, pho
Tested by: pho, Qualys
Sponsored by: The FreeBSD Foundation
MFC after: 1 week
Differential revision: https://reviews.freebsd.org/D11306 (man pages)
2017-06-24 17:01:11 +00:00
|
|
|
vm_inherit_t inh;
|
1994-05-24 10:09:53 +00:00
|
|
|
|
2012-02-25 17:49:59 +00:00
|
|
|
old_map = &vm1->vm_map;
|
|
|
|
/* Copy immutable fields of vm1 to vm2. */
|
2018-11-25 17:56:49 +00:00
|
|
|
vm2 = vmspace_alloc(vm_map_min(old_map), vm_map_max(old_map),
|
|
|
|
pmap_pinit);
|
2007-11-05 11:36:16 +00:00
|
|
|
if (vm2 == NULL)
|
2012-02-25 17:49:59 +00:00
|
|
|
return (NULL);
|
2019-02-20 09:51:13 +00:00
|
|
|
|
2004-06-24 22:43:46 +00:00
|
|
|
vm2->vm_taddr = vm1->vm_taddr;
|
|
|
|
vm2->vm_daddr = vm1->vm_daddr;
|
|
|
|
vm2->vm_maxsaddr = vm1->vm_maxsaddr;
|
2012-02-25 17:49:59 +00:00
|
|
|
vm_map_lock(old_map);
|
|
|
|
if (old_map->busy)
|
|
|
|
vm_map_wait_busy(old_map);
|
|
|
|
new_map = &vm2->vm_map;
|
2009-02-08 19:55:03 +00:00
|
|
|
locked = vm_map_trylock(new_map); /* trylock to silence WITNESS */
|
|
|
|
KASSERT(locked, ("vmspace_fork: lock failed"));
|
1994-05-24 10:09:53 +00:00
|
|
|
|
2019-02-20 09:51:13 +00:00
|
|
|
error = pmap_vmspace_copy(new_map->pmap, old_map->pmap);
|
|
|
|
if (error != 0) {
|
|
|
|
sx_xunlock(&old_map->lock);
|
|
|
|
sx_xunlock(&new_map->lock);
|
|
|
|
vm_map_process_deferred();
|
|
|
|
vmspace_free(vm2);
|
|
|
|
return (NULL);
|
|
|
|
}
|
|
|
|
|
Implement Address Space Layout Randomization (ASLR)
With this change, randomization can be enabled for all non-fixed
mappings. It means that the base address for the mapping is selected
with a guaranteed amount of entropy (bits). If the mapping was
requested to be superpage aligned, the randomization honours the
superpage attributes.
Although the value of ASLR is diminshing over time as exploit authors
work out simple ASLR bypass techniques, it elimintates the trivial
exploitation of certain vulnerabilities, at least in theory. This
implementation is relatively small and happens at the correct
architectural level. Also, it is not expected to introduce
regressions in existing cases when turned off (default for now), or
cause any significant maintaince burden.
The randomization is done on a best-effort basis - that is, the
allocator falls back to a first fit strategy if fragmentation prevents
entropy injection. It is trivial to implement a strong mode where
failure to guarantee the requested amount of entropy results in
mapping request failure, but I do not consider that to be usable.
I have not fine-tuned the amount of entropy injected right now. It is
only a quantitive change that will not change the implementation. The
current amount is controlled by aslr_pages_rnd.
To not spoil coalescing optimizations, to reduce the page table
fragmentation inherent to ASLR, and to keep the transient superpage
promotion for the malloced memory, locality clustering is implemented
for anonymous private mappings, which are automatically grouped until
fragmentation kicks in. The initial location for the anon group range
is, of course, randomized. This is controlled by vm.cluster_anon,
enabled by default.
The default mode keeps the sbrk area unpopulated by other mappings,
but this can be turned off, which gives much more breathing bits on
architectures with small address space, such as i386. This is tied
with the question of following an application's hint about the mmap(2)
base address. Testing shows that ignoring the hint does not affect the
function of common applications, but I would expect more demanding
code could break. By default sbrk is preserved and mmap hints are
satisfied, which can be changed by using the
kern.elf{32,64}.aslr.honor_sbrk sysctl.
ASLR is enabled on per-ABI basis, and currently it is only allowed on
FreeBSD native i386 and amd64 (including compat 32bit) ABIs. Support
for additional architectures will be added after further testing.
Both per-process and per-image controls are implemented:
- procctl(2) adds PROC_ASLR_CTL/PROC_ASLR_STATUS;
- NT_FREEBSD_FCTL_ASLR_DISABLE feature control note bit makes it possible
to force ASLR off for the given binary. (A tool to edit the feature
control note is in development.)
Global controls are:
- kern.elf{32,64}.aslr.enable - for non-fixed mappings done by mmap(2);
- kern.elf{32,64}.aslr.pie_enable - for PIE image activation mappings;
- kern.elf{32,64}.aslr.honor_sbrk - allow to use sbrk area for mmap(2);
- vm.cluster_anon - enables anon mapping clustering.
PR: 208580 (exp runs)
Exp-runs done by: antoine
Reviewed by: markj (previous version)
Discussed with: emaste
Tested by: pho
MFC after: 1 month
Sponsored by: The FreeBSD Foundation
Differential revision: https://reviews.freebsd.org/D5603
2019-02-10 17:19:45 +00:00
|
|
|
new_map->anon_loc = old_map->anon_loc;
|
2021-01-12 06:09:59 +00:00
|
|
|
new_map->flags |= old_map->flags & (MAP_ASLR | MAP_ASLR_IGNSTART |
|
|
|
|
MAP_WXORX);
|
2019-02-20 09:51:13 +00:00
|
|
|
|
2019-11-25 02:19:47 +00:00
|
|
|
VM_MAP_ENTRY_FOREACH(old_entry, old_map) {
|
|
|
|
if ((old_entry->eflags & MAP_ENTRY_IS_SUB_MAP) != 0)
|
1994-05-24 10:09:53 +00:00
|
|
|
panic("vm_map_fork: encountered a submap");
|
|
|
|
|
Implement address space guards.
Guard, requested by the MAP_GUARD mmap(2) flag, prevents the reuse of
the allocated address space, but does not allow instantiation of the
pages in the range. It is useful for more explicit support for usual
two-stage reserve then commit allocators, since it prevents accidental
instantiation of the mapping, e.g. by mprotect(2).
Use guards to reimplement stack grow code. Explicitely track stack
grow area with the guard, including the stack guard page. On stack
grow, trivial shift of the guard map entry and stack map entry limits
makes the stack expansion. Move the code to detect stack grow and
call vm_map_growstack(), from vm_fault() into vm_map_lookup().
As result, it is impossible to get random mapping to occur in the
stack grow area, or to overlap the stack guard page.
Enable stack guard page by default.
Reviewed by: alc, markj
Man page update reviewed by: alc, bjk, emaste, markj, pho
Tested by: pho, Qualys
Sponsored by: The FreeBSD Foundation
MFC after: 1 week
Differential revision: https://reviews.freebsd.org/D11306 (man pages)
2017-06-24 17:01:11 +00:00
|
|
|
inh = old_entry->inheritance;
|
|
|
|
if ((old_entry->eflags & MAP_ENTRY_GUARD) != 0 &&
|
|
|
|
inh != VM_INHERIT_NONE)
|
|
|
|
inh = VM_INHERIT_COPY;
|
|
|
|
|
|
|
|
switch (inh) {
|
1994-05-24 10:09:53 +00:00
|
|
|
case VM_INHERIT_NONE:
|
|
|
|
break;
|
|
|
|
|
1997-07-27 04:44:12 +00:00
|
|
|
case VM_INHERIT_SHARE:
|
|
|
|
/*
|
2019-11-25 02:19:47 +00:00
|
|
|
* Clone the entry, creating the shared object if
|
|
|
|
* necessary.
|
1997-07-27 04:44:12 +00:00
|
|
|
*/
|
|
|
|
object = old_entry->object.vm_object;
|
|
|
|
if (object == NULL) {
|
2019-06-13 20:09:07 +00:00
|
|
|
vm_map_entry_back(old_entry);
|
|
|
|
object = old_entry->object.vm_object;
|
1999-05-28 03:39:44 +00:00
|
|
|
}
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Add the reference before calling vm_object_shadow
|
|
|
|
* to insure that a shadow object is created.
|
|
|
|
*/
|
|
|
|
vm_object_reference(object);
|
|
|
|
if (old_entry->eflags & MAP_ENTRY_NEEDS_COPY) {
|
1997-01-31 04:10:41 +00:00
|
|
|
vm_object_shadow(&old_entry->object.vm_object,
|
2011-02-04 21:49:24 +00:00
|
|
|
&old_entry->offset,
|
2019-12-01 20:43:04 +00:00
|
|
|
old_entry->end - old_entry->start,
|
|
|
|
old_entry->cred,
|
|
|
|
/* Transfer the second reference too. */
|
|
|
|
true);
|
1997-01-31 04:10:41 +00:00
|
|
|
old_entry->eflags &= ~MAP_ENTRY_NEEDS_COPY;
|
2019-12-01 20:43:04 +00:00
|
|
|
old_entry->cred = NULL;
|
2009-02-08 20:00:33 +00:00
|
|
|
|
|
|
|
/*
|
2019-08-25 07:06:51 +00:00
|
|
|
* As in vm_map_merged_neighbor_dispose(),
|
|
|
|
* the vnode lock will not be acquired in
|
2009-02-08 20:00:33 +00:00
|
|
|
* this call to vm_object_deallocate().
|
|
|
|
*/
|
2001-03-09 18:25:54 +00:00
|
|
|
vm_object_deallocate(object);
|
1997-01-31 04:10:41 +00:00
|
|
|
object = old_entry->object.vm_object;
|
2019-12-01 20:43:04 +00:00
|
|
|
} else {
|
|
|
|
VM_OBJECT_WLOCK(object);
|
|
|
|
vm_object_clear_flag(object, OBJ_ONEMAPPING);
|
|
|
|
if (old_entry->cred != NULL) {
|
|
|
|
KASSERT(object->cred == NULL,
|
|
|
|
("vmspace_fork both cred"));
|
|
|
|
object->cred = old_entry->cred;
|
|
|
|
object->charge = old_entry->end -
|
|
|
|
old_entry->start;
|
|
|
|
old_entry->cred = NULL;
|
|
|
|
}
|
2013-04-09 10:04:10 +00:00
|
|
|
|
2019-12-01 20:43:04 +00:00
|
|
|
/*
|
|
|
|
* Assert the correct state of the vnode
|
|
|
|
* v_writecount while the object is locked, to
|
|
|
|
* not relock it later for the assertion
|
|
|
|
* correctness.
|
|
|
|
*/
|
|
|
|
if (old_entry->eflags & MAP_ENTRY_WRITECNT &&
|
|
|
|
object->type == OBJT_VNODE) {
|
|
|
|
KASSERT(((struct vnode *)object->
|
|
|
|
handle)->v_writecount > 0,
|
|
|
|
("vmspace_fork: v_writecount %p",
|
|
|
|
object));
|
|
|
|
KASSERT(object->un_pager.vnp.
|
|
|
|
writemappings > 0,
|
|
|
|
("vmspace_fork: vnp.writecount %p",
|
|
|
|
object));
|
|
|
|
}
|
|
|
|
VM_OBJECT_WUNLOCK(object);
|
2013-04-09 10:04:10 +00:00
|
|
|
}
|
1997-01-22 01:34:48 +00:00
|
|
|
|
1994-05-24 10:09:53 +00:00
|
|
|
/*
|
1999-03-27 23:46:04 +00:00
|
|
|
* Clone the entry, referencing the shared object.
|
1994-05-24 10:09:53 +00:00
|
|
|
*/
|
|
|
|
new_entry = vm_map_entry_create(new_map);
|
|
|
|
*new_entry = *old_entry;
|
2009-02-08 19:41:08 +00:00
|
|
|
new_entry->eflags &= ~(MAP_ENTRY_USER_WIRED |
|
|
|
|
MAP_ENTRY_IN_TRANSITION);
|
2013-07-11 05:55:08 +00:00
|
|
|
new_entry->wiring_thread = NULL;
|
1994-05-24 10:09:53 +00:00
|
|
|
new_entry->wired_count = 0;
|
2019-09-03 20:31:48 +00:00
|
|
|
if (new_entry->eflags & MAP_ENTRY_WRITECNT) {
|
|
|
|
vm_pager_update_writecount(object,
|
2012-02-23 21:07:16 +00:00
|
|
|
new_entry->start, new_entry->end);
|
|
|
|
}
|
Switch to use shared vnode locks for text files during image activation.
kern_execve() locks text vnode exclusive to be able to set and clear
VV_TEXT flag. VV_TEXT is mutually exclusive with the v_writecount > 0
condition.
The change removes VV_TEXT, replacing it with the condition
v_writecount <= -1, and puts v_writecount under the vnode interlock.
Each text reference decrements v_writecount. To clear the text
reference when the segment is unmapped, it is recorded in the
vm_map_entry backed by the text file as MAP_ENTRY_VN_TEXT flag, and
v_writecount is incremented on the map entry removal
The operations like VOP_ADD_WRITECOUNT() and VOP_SET_TEXT() check that
v_writecount does not contradict the desired change. vn_writecheck()
is now racy and its use was eliminated everywhere except access.
Atomic check for writeability and increment of v_writecount is
performed by the VOP. vn_truncate() now increments v_writecount
around VOP_SETATTR() call, lack of which is arguably a bug on its own.
nullfs bypasses v_writecount to the lower vnode always, so nullfs
vnode has its own v_writecount correct, and lower vnode gets all
references, since object->handle is always lower vnode.
On the text vnode' vm object dealloc, the v_writecount value is reset
to zero, and deadfs vop_unset_text short-circuit the operation.
Reclamation of lowervp always reclaims all nullfs vnodes referencing
lowervp first, so no stray references are left.
Reviewed by: markj, trasz
Tested by: mjg, pho
Sponsored by: The FreeBSD Foundation
MFC after: 1 month
Differential revision: https://reviews.freebsd.org/D19923
2019-05-05 11:20:43 +00:00
|
|
|
vm_map_entry_set_vnode_text(new_entry, true);
|
1994-05-24 10:09:53 +00:00
|
|
|
|
|
|
|
/*
|
These changes embody the support of the fully coherent merged VM buffer cache,
much higher filesystem I/O performance, and much better paging performance. It
represents the culmination of over 6 months of R&D.
The majority of the merged VM/cache work is by John Dyson.
The following highlights the most significant changes. Additionally, there are
(mostly minor) changes to the various filesystem modules (nfs, msdosfs, etc) to
support the new VM/buffer scheme.
vfs_bio.c:
Significant rewrite of most of vfs_bio to support the merged VM buffer cache
scheme. The scheme is almost fully compatible with the old filesystem
interface. Significant improvement in the number of opportunities for write
clustering.
vfs_cluster.c, vfs_subr.c
Upgrade and performance enhancements in vfs layer code to support merged
VM/buffer cache. Fixup of vfs_cluster to eliminate the bogus pagemove stuff.
vm_object.c:
Yet more improvements in the collapse code. Elimination of some windows that
can cause list corruption.
vm_pageout.c:
Fixed it, it really works better now. Somehow in 2.0, some "enhancements"
broke the code. This code has been reworked from the ground-up.
vm_fault.c, vm_page.c, pmap.c, vm_object.c
Support for small-block filesystems with merged VM/buffer cache scheme.
pmap.c vm_map.c
Dynamic kernel VM size, now we dont have to pre-allocate excessive numbers of
kernel PTs.
vm_glue.c
Much simpler and more effective swapping code. No more gratuitous swapping.
proc.h
Fixed the problem that the p_lock flag was not being cleared on a fork.
swap_pager.c, vnode_pager.c
Removal of old vfs_bio cruft to support the past pseudo-coherency. Now the
code doesn't need it anymore.
machdep.c
Changes to better support the parameter values for the merged VM/buffer cache
scheme.
machdep.c, kern_exec.c, vm_glue.c
Implemented a seperate submap for temporary exec string space and another one
to contain process upages. This eliminates all map fragmentation problems
that previously existed.
ffs_inode.c, ufs_inode.c, ufs_readwrite.c
Changes for merged VM/buffer cache. Add "bypass" support for sneaking in on
busy buffers.
Submitted by: John Dyson and David Greenman
1995-01-09 16:06:02 +00:00
|
|
|
* Insert the entry into the new map -- we know we're
|
|
|
|
* inserting at the end of the new map.
|
1994-05-24 10:09:53 +00:00
|
|
|
*/
|
Eliminate adj_free field from vm_map_entry.
Drop the adj_free field from vm_map_entry_t. Refine the max_free field
so that p->max_free is the size of the largest gap with one endpoint
in the subtree rooted at p. Change vm_map_findspace so that, first,
the address-based splay is restricted to tree nodes with large-enough
max_free value, to avoid searching for the right starting point in a
subtree where all the gaps are too small. Second, when the address
search leads to a tree search for the first large-enough gap, that gap
is the subject of a splay-search that brings the gap to the top of the
tree, so that an immediate insertion will take constant time.
Break up the splay code into separate components, one for searching
and breaking up the tree and another for reassembling it. Use these
components, and not splay itself, for linking and unlinking. Drop the
after-where parameter to link, as it is computed as a side-effect of
the splay search.
Submitted by: Doug Moore <dougm@rice.edu>
Reviewed by: markj
Tested by: pho
MFC after: 2 weeks
Differential revision: https://reviews.freebsd.org/D17794
2019-03-29 16:53:46 +00:00
|
|
|
vm_map_entry_link(new_map, new_entry);
|
2004-06-24 22:43:46 +00:00
|
|
|
vmspace_map_entry_forked(vm1, vm2, new_entry);
|
1994-05-24 10:09:53 +00:00
|
|
|
|
|
|
|
/*
|
These changes embody the support of the fully coherent merged VM buffer cache,
much higher filesystem I/O performance, and much better paging performance. It
represents the culmination of over 6 months of R&D.
The majority of the merged VM/cache work is by John Dyson.
The following highlights the most significant changes. Additionally, there are
(mostly minor) changes to the various filesystem modules (nfs, msdosfs, etc) to
support the new VM/buffer scheme.
vfs_bio.c:
Significant rewrite of most of vfs_bio to support the merged VM buffer cache
scheme. The scheme is almost fully compatible with the old filesystem
interface. Significant improvement in the number of opportunities for write
clustering.
vfs_cluster.c, vfs_subr.c
Upgrade and performance enhancements in vfs layer code to support merged
VM/buffer cache. Fixup of vfs_cluster to eliminate the bogus pagemove stuff.
vm_object.c:
Yet more improvements in the collapse code. Elimination of some windows that
can cause list corruption.
vm_pageout.c:
Fixed it, it really works better now. Somehow in 2.0, some "enhancements"
broke the code. This code has been reworked from the ground-up.
vm_fault.c, vm_page.c, pmap.c, vm_object.c
Support for small-block filesystems with merged VM/buffer cache scheme.
pmap.c vm_map.c
Dynamic kernel VM size, now we dont have to pre-allocate excessive numbers of
kernel PTs.
vm_glue.c
Much simpler and more effective swapping code. No more gratuitous swapping.
proc.h
Fixed the problem that the p_lock flag was not being cleared on a fork.
swap_pager.c, vnode_pager.c
Removal of old vfs_bio cruft to support the past pseudo-coherency. Now the
code doesn't need it anymore.
machdep.c
Changes to better support the parameter values for the merged VM/buffer cache
scheme.
machdep.c, kern_exec.c, vm_glue.c
Implemented a seperate submap for temporary exec string space and another one
to contain process upages. This eliminates all map fragmentation problems
that previously existed.
ffs_inode.c, ufs_inode.c, ufs_readwrite.c
Changes for merged VM/buffer cache. Add "bypass" support for sneaking in on
busy buffers.
Submitted by: John Dyson and David Greenman
1995-01-09 16:06:02 +00:00
|
|
|
* Update the physical map
|
1994-05-24 10:09:53 +00:00
|
|
|
*/
|
|
|
|
pmap_copy(new_map->pmap, old_map->pmap,
|
These changes embody the support of the fully coherent merged VM buffer cache,
much higher filesystem I/O performance, and much better paging performance. It
represents the culmination of over 6 months of R&D.
The majority of the merged VM/cache work is by John Dyson.
The following highlights the most significant changes. Additionally, there are
(mostly minor) changes to the various filesystem modules (nfs, msdosfs, etc) to
support the new VM/buffer scheme.
vfs_bio.c:
Significant rewrite of most of vfs_bio to support the merged VM buffer cache
scheme. The scheme is almost fully compatible with the old filesystem
interface. Significant improvement in the number of opportunities for write
clustering.
vfs_cluster.c, vfs_subr.c
Upgrade and performance enhancements in vfs layer code to support merged
VM/buffer cache. Fixup of vfs_cluster to eliminate the bogus pagemove stuff.
vm_object.c:
Yet more improvements in the collapse code. Elimination of some windows that
can cause list corruption.
vm_pageout.c:
Fixed it, it really works better now. Somehow in 2.0, some "enhancements"
broke the code. This code has been reworked from the ground-up.
vm_fault.c, vm_page.c, pmap.c, vm_object.c
Support for small-block filesystems with merged VM/buffer cache scheme.
pmap.c vm_map.c
Dynamic kernel VM size, now we dont have to pre-allocate excessive numbers of
kernel PTs.
vm_glue.c
Much simpler and more effective swapping code. No more gratuitous swapping.
proc.h
Fixed the problem that the p_lock flag was not being cleared on a fork.
swap_pager.c, vnode_pager.c
Removal of old vfs_bio cruft to support the past pseudo-coherency. Now the
code doesn't need it anymore.
machdep.c
Changes to better support the parameter values for the merged VM/buffer cache
scheme.
machdep.c, kern_exec.c, vm_glue.c
Implemented a seperate submap for temporary exec string space and another one
to contain process upages. This eliminates all map fragmentation problems
that previously existed.
ffs_inode.c, ufs_inode.c, ufs_readwrite.c
Changes for merged VM/buffer cache. Add "bypass" support for sneaking in on
busy buffers.
Submitted by: John Dyson and David Greenman
1995-01-09 16:06:02 +00:00
|
|
|
new_entry->start,
|
|
|
|
(old_entry->end - old_entry->start),
|
|
|
|
old_entry->start);
|
1994-05-24 10:09:53 +00:00
|
|
|
break;
|
|
|
|
|
|
|
|
case VM_INHERIT_COPY:
|
|
|
|
/*
|
These changes embody the support of the fully coherent merged VM buffer cache,
much higher filesystem I/O performance, and much better paging performance. It
represents the culmination of over 6 months of R&D.
The majority of the merged VM/cache work is by John Dyson.
The following highlights the most significant changes. Additionally, there are
(mostly minor) changes to the various filesystem modules (nfs, msdosfs, etc) to
support the new VM/buffer scheme.
vfs_bio.c:
Significant rewrite of most of vfs_bio to support the merged VM buffer cache
scheme. The scheme is almost fully compatible with the old filesystem
interface. Significant improvement in the number of opportunities for write
clustering.
vfs_cluster.c, vfs_subr.c
Upgrade and performance enhancements in vfs layer code to support merged
VM/buffer cache. Fixup of vfs_cluster to eliminate the bogus pagemove stuff.
vm_object.c:
Yet more improvements in the collapse code. Elimination of some windows that
can cause list corruption.
vm_pageout.c:
Fixed it, it really works better now. Somehow in 2.0, some "enhancements"
broke the code. This code has been reworked from the ground-up.
vm_fault.c, vm_page.c, pmap.c, vm_object.c
Support for small-block filesystems with merged VM/buffer cache scheme.
pmap.c vm_map.c
Dynamic kernel VM size, now we dont have to pre-allocate excessive numbers of
kernel PTs.
vm_glue.c
Much simpler and more effective swapping code. No more gratuitous swapping.
proc.h
Fixed the problem that the p_lock flag was not being cleared on a fork.
swap_pager.c, vnode_pager.c
Removal of old vfs_bio cruft to support the past pseudo-coherency. Now the
code doesn't need it anymore.
machdep.c
Changes to better support the parameter values for the merged VM/buffer cache
scheme.
machdep.c, kern_exec.c, vm_glue.c
Implemented a seperate submap for temporary exec string space and another one
to contain process upages. This eliminates all map fragmentation problems
that previously existed.
ffs_inode.c, ufs_inode.c, ufs_readwrite.c
Changes for merged VM/buffer cache. Add "bypass" support for sneaking in on
busy buffers.
Submitted by: John Dyson and David Greenman
1995-01-09 16:06:02 +00:00
|
|
|
* Clone the entry and link into the map.
|
1994-05-24 10:09:53 +00:00
|
|
|
*/
|
|
|
|
new_entry = vm_map_entry_create(new_map);
|
|
|
|
*new_entry = *old_entry;
|
2012-02-23 21:07:16 +00:00
|
|
|
/*
|
|
|
|
* Copied entry is COW over the old object.
|
|
|
|
*/
|
2009-02-08 19:41:08 +00:00
|
|
|
new_entry->eflags &= ~(MAP_ENTRY_USER_WIRED |
|
2019-09-03 20:31:48 +00:00
|
|
|
MAP_ENTRY_IN_TRANSITION | MAP_ENTRY_WRITECNT);
|
2013-07-11 05:55:08 +00:00
|
|
|
new_entry->wiring_thread = NULL;
|
1994-05-24 10:09:53 +00:00
|
|
|
new_entry->wired_count = 0;
|
|
|
|
new_entry->object.vm_object = NULL;
|
2010-12-02 17:37:16 +00:00
|
|
|
new_entry->cred = NULL;
|
Eliminate adj_free field from vm_map_entry.
Drop the adj_free field from vm_map_entry_t. Refine the max_free field
so that p->max_free is the size of the largest gap with one endpoint
in the subtree rooted at p. Change vm_map_findspace so that, first,
the address-based splay is restricted to tree nodes with large-enough
max_free value, to avoid searching for the right starting point in a
subtree where all the gaps are too small. Second, when the address
search leads to a tree search for the first large-enough gap, that gap
is the subject of a splay-search that brings the gap to the top of the
tree, so that an immediate insertion will take constant time.
Break up the splay code into separate components, one for searching
and breaking up the tree and another for reassembling it. Use these
components, and not splay itself, for linking and unlinking. Drop the
after-where parameter to link, as it is computed as a side-effect of
the splay search.
Submitted by: Doug Moore <dougm@rice.edu>
Reviewed by: markj
Tested by: pho
MFC after: 2 weeks
Differential revision: https://reviews.freebsd.org/D17794
2019-03-29 16:53:46 +00:00
|
|
|
vm_map_entry_link(new_map, new_entry);
|
2004-06-24 22:43:46 +00:00
|
|
|
vmspace_map_entry_forked(vm1, vm2, new_entry);
|
1996-01-19 04:00:31 +00:00
|
|
|
vm_map_copy_entry(old_map, new_map, old_entry,
|
Implement global and per-uid accounting of the anonymous memory. Add
rlimit RLIMIT_SWAP that limits the amount of swap that may be reserved
for the uid.
The accounting information (charge) is associated with either map entry,
or vm object backing the entry, assuming the object is the first one
in the shadow chain and entry does not require COW. Charge is moved
from entry to object on allocation of the object, e.g. during the mmap,
assuming the object is allocated, or on the first page fault on the
entry. It moves back to the entry on forks due to COW setup.
The per-entry granularity of accounting makes the charge process fair
for processes that change uid during lifetime, and decrements charge
for proper uid when region is unmapped.
The interface of vm_pager_allocate(9) is extended by adding struct ucred *,
that is used to charge appropriate uid when allocation if performed by
kernel, e.g. md(4).
Several syscalls, among them is fork(2), may now return ENOMEM when
global or per-uid limits are enforced.
In collaboration with: pho
Reviewed by: alc
Approved by: re (kensmith)
2009-06-23 20:45:22 +00:00
|
|
|
new_entry, fork_charge);
|
Switch to use shared vnode locks for text files during image activation.
kern_execve() locks text vnode exclusive to be able to set and clear
VV_TEXT flag. VV_TEXT is mutually exclusive with the v_writecount > 0
condition.
The change removes VV_TEXT, replacing it with the condition
v_writecount <= -1, and puts v_writecount under the vnode interlock.
Each text reference decrements v_writecount. To clear the text
reference when the segment is unmapped, it is recorded in the
vm_map_entry backed by the text file as MAP_ENTRY_VN_TEXT flag, and
v_writecount is incremented on the map entry removal
The operations like VOP_ADD_WRITECOUNT() and VOP_SET_TEXT() check that
v_writecount does not contradict the desired change. vn_writecheck()
is now racy and its use was eliminated everywhere except access.
Atomic check for writeability and increment of v_writecount is
performed by the VOP. vn_truncate() now increments v_writecount
around VOP_SETATTR() call, lack of which is arguably a bug on its own.
nullfs bypasses v_writecount to the lower vnode always, so nullfs
vnode has its own v_writecount correct, and lower vnode gets all
references, since object->handle is always lower vnode.
On the text vnode' vm object dealloc, the v_writecount value is reset
to zero, and deadfs vop_unset_text short-circuit the operation.
Reclamation of lowervp always reclaims all nullfs vnodes referencing
lowervp first, so no stray references are left.
Reviewed by: markj, trasz
Tested by: mjg, pho
Sponsored by: The FreeBSD Foundation
MFC after: 1 month
Differential revision: https://reviews.freebsd.org/D19923
2019-05-05 11:20:43 +00:00
|
|
|
vm_map_entry_set_vnode_text(new_entry, true);
|
1994-05-24 10:09:53 +00:00
|
|
|
break;
|
2017-03-14 17:10:42 +00:00
|
|
|
|
|
|
|
case VM_INHERIT_ZERO:
|
|
|
|
/*
|
|
|
|
* Create a new anonymous mapping entry modelled from
|
|
|
|
* the old one.
|
|
|
|
*/
|
|
|
|
new_entry = vm_map_entry_create(new_map);
|
|
|
|
memset(new_entry, 0, sizeof(*new_entry));
|
|
|
|
|
|
|
|
new_entry->start = old_entry->start;
|
|
|
|
new_entry->end = old_entry->end;
|
|
|
|
new_entry->eflags = old_entry->eflags &
|
|
|
|
~(MAP_ENTRY_USER_WIRED | MAP_ENTRY_IN_TRANSITION |
|
2020-09-09 22:02:30 +00:00
|
|
|
MAP_ENTRY_WRITECNT | MAP_ENTRY_VN_EXEC |
|
|
|
|
MAP_ENTRY_SPLIT_BOUNDARY_MASK);
|
2017-03-14 17:10:42 +00:00
|
|
|
new_entry->protection = old_entry->protection;
|
|
|
|
new_entry->max_protection = old_entry->max_protection;
|
|
|
|
new_entry->inheritance = VM_INHERIT_ZERO;
|
|
|
|
|
Eliminate adj_free field from vm_map_entry.
Drop the adj_free field from vm_map_entry_t. Refine the max_free field
so that p->max_free is the size of the largest gap with one endpoint
in the subtree rooted at p. Change vm_map_findspace so that, first,
the address-based splay is restricted to tree nodes with large-enough
max_free value, to avoid searching for the right starting point in a
subtree where all the gaps are too small. Second, when the address
search leads to a tree search for the first large-enough gap, that gap
is the subject of a splay-search that brings the gap to the top of the
tree, so that an immediate insertion will take constant time.
Break up the splay code into separate components, one for searching
and breaking up the tree and another for reassembling it. Use these
components, and not splay itself, for linking and unlinking. Drop the
after-where parameter to link, as it is computed as a side-effect of
the splay search.
Submitted by: Doug Moore <dougm@rice.edu>
Reviewed by: markj
Tested by: pho
MFC after: 2 weeks
Differential revision: https://reviews.freebsd.org/D17794
2019-03-29 16:53:46 +00:00
|
|
|
vm_map_entry_link(new_map, new_entry);
|
2017-03-14 17:10:42 +00:00
|
|
|
vmspace_map_entry_forked(vm1, vm2, new_entry);
|
|
|
|
|
|
|
|
new_entry->cred = curthread->td_ucred;
|
|
|
|
crhold(new_entry->cred);
|
|
|
|
*fork_charge += (new_entry->end - new_entry->start);
|
|
|
|
|
|
|
|
break;
|
1994-05-24 10:09:53 +00:00
|
|
|
}
|
|
|
|
}
|
2012-02-23 21:07:16 +00:00
|
|
|
/*
|
|
|
|
* Use inlined vm_map_unlock() to postpone handling the deferred
|
|
|
|
* map entries, which cannot be done until both old_map and
|
|
|
|
* new_map locks are released.
|
|
|
|
*/
|
|
|
|
sx_xunlock(&old_map->lock);
|
2012-02-25 17:49:59 +00:00
|
|
|
sx_xunlock(&new_map->lock);
|
2012-02-23 21:07:16 +00:00
|
|
|
vm_map_process_deferred();
|
1994-05-24 10:09:53 +00:00
|
|
|
|
These changes embody the support of the fully coherent merged VM buffer cache,
much higher filesystem I/O performance, and much better paging performance. It
represents the culmination of over 6 months of R&D.
The majority of the merged VM/cache work is by John Dyson.
The following highlights the most significant changes. Additionally, there are
(mostly minor) changes to the various filesystem modules (nfs, msdosfs, etc) to
support the new VM/buffer scheme.
vfs_bio.c:
Significant rewrite of most of vfs_bio to support the merged VM buffer cache
scheme. The scheme is almost fully compatible with the old filesystem
interface. Significant improvement in the number of opportunities for write
clustering.
vfs_cluster.c, vfs_subr.c
Upgrade and performance enhancements in vfs layer code to support merged
VM/buffer cache. Fixup of vfs_cluster to eliminate the bogus pagemove stuff.
vm_object.c:
Yet more improvements in the collapse code. Elimination of some windows that
can cause list corruption.
vm_pageout.c:
Fixed it, it really works better now. Somehow in 2.0, some "enhancements"
broke the code. This code has been reworked from the ground-up.
vm_fault.c, vm_page.c, pmap.c, vm_object.c
Support for small-block filesystems with merged VM/buffer cache scheme.
pmap.c vm_map.c
Dynamic kernel VM size, now we dont have to pre-allocate excessive numbers of
kernel PTs.
vm_glue.c
Much simpler and more effective swapping code. No more gratuitous swapping.
proc.h
Fixed the problem that the p_lock flag was not being cleared on a fork.
swap_pager.c, vnode_pager.c
Removal of old vfs_bio cruft to support the past pseudo-coherency. Now the
code doesn't need it anymore.
machdep.c
Changes to better support the parameter values for the merged VM/buffer cache
scheme.
machdep.c, kern_exec.c, vm_glue.c
Implemented a seperate submap for temporary exec string space and another one
to contain process upages. This eliminates all map fragmentation problems
that previously existed.
ffs_inode.c, ufs_inode.c, ufs_readwrite.c
Changes for merged VM/buffer cache. Add "bypass" support for sneaking in on
busy buffers.
Submitted by: John Dyson and David Greenman
1995-01-09 16:06:02 +00:00
|
|
|
return (vm2);
|
1994-05-24 10:09:53 +00:00
|
|
|
}
|
|
|
|
|
2017-06-30 15:49:36 +00:00
|
|
|
/*
|
|
|
|
* Create a process's stack for exec_new_vmspace(). This function is never
|
|
|
|
* asked to wire the newly created stack.
|
|
|
|
*/
|
1999-06-17 00:39:26 +00:00
|
|
|
int
|
2003-09-27 22:28:14 +00:00
|
|
|
vm_map_stack(vm_map_t map, vm_offset_t addrbos, vm_size_t max_ssize,
|
|
|
|
vm_prot_t prot, vm_prot_t max, int cow)
|
2014-06-09 03:37:41 +00:00
|
|
|
{
|
|
|
|
vm_size_t growsize, init_ssize;
|
2017-06-30 15:49:36 +00:00
|
|
|
rlim_t vmemlim;
|
2014-06-09 03:37:41 +00:00
|
|
|
int rv;
|
|
|
|
|
2017-06-30 15:49:36 +00:00
|
|
|
MPASS((map->flags & MAP_WIREFUTURE) == 0);
|
2014-06-09 03:37:41 +00:00
|
|
|
growsize = sgrowsiz;
|
|
|
|
init_ssize = (max_ssize < growsize) ? max_ssize : growsize;
|
|
|
|
vm_map_lock(map);
|
2015-06-10 10:48:12 +00:00
|
|
|
vmemlim = lim_cur(curthread, RLIMIT_VMEM);
|
2014-06-09 03:37:41 +00:00
|
|
|
/* If we would blow our VMEM resource limit, no go */
|
|
|
|
if (map->size + init_ssize > vmemlim) {
|
|
|
|
rv = KERN_NO_SPACE;
|
|
|
|
goto out;
|
|
|
|
}
|
2014-06-15 07:52:59 +00:00
|
|
|
rv = vm_map_stack_locked(map, addrbos, max_ssize, growsize, prot,
|
2014-06-09 03:37:41 +00:00
|
|
|
max, cow);
|
|
|
|
out:
|
|
|
|
vm_map_unlock(map);
|
|
|
|
return (rv);
|
|
|
|
}
|
|
|
|
|
2017-06-25 20:06:05 +00:00
|
|
|
static int stack_guard_page = 1;
|
|
|
|
SYSCTL_INT(_security_bsd, OID_AUTO, stack_guard_page, CTLFLAG_RWTUN,
|
|
|
|
&stack_guard_page, 0,
|
|
|
|
"Specifies the number of guard pages for a stack that grows");
|
|
|
|
|
2014-06-09 03:37:41 +00:00
|
|
|
static int
|
|
|
|
vm_map_stack_locked(vm_map_t map, vm_offset_t addrbos, vm_size_t max_ssize,
|
|
|
|
vm_size_t growsize, vm_prot_t prot, vm_prot_t max, int cow)
|
1999-06-17 00:39:26 +00:00
|
|
|
{
|
2019-06-26 03:12:57 +00:00
|
|
|
vm_map_entry_t new_entry, prev_entry;
|
Implement address space guards.
Guard, requested by the MAP_GUARD mmap(2) flag, prevents the reuse of
the allocated address space, but does not allow instantiation of the
pages in the range. It is useful for more explicit support for usual
two-stage reserve then commit allocators, since it prevents accidental
instantiation of the mapping, e.g. by mprotect(2).
Use guards to reimplement stack grow code. Explicitely track stack
grow area with the guard, including the stack guard page. On stack
grow, trivial shift of the guard map entry and stack map entry limits
makes the stack expansion. Move the code to detect stack grow and
call vm_map_growstack(), from vm_fault() into vm_map_lookup().
As result, it is impossible to get random mapping to occur in the
stack grow area, or to overlap the stack guard page.
Enable stack guard page by default.
Reviewed by: alc, markj
Man page update reviewed by: alc, bjk, emaste, markj, pho
Tested by: pho, Qualys
Sponsored by: The FreeBSD Foundation
MFC after: 1 week
Differential revision: https://reviews.freebsd.org/D11306 (man pages)
2017-06-24 17:01:11 +00:00
|
|
|
vm_offset_t bot, gap_bot, gap_top, top;
|
2017-06-25 20:06:05 +00:00
|
|
|
vm_size_t init_ssize, sgp;
|
2003-09-27 22:28:14 +00:00
|
|
|
int orient, rv;
|
1999-06-17 00:39:26 +00:00
|
|
|
|
2003-09-27 22:28:14 +00:00
|
|
|
/*
|
|
|
|
* The stack orientation is piggybacked with the cow argument.
|
|
|
|
* Extract it into orient and mask the cow argument so that we
|
|
|
|
* don't pass it around further.
|
|
|
|
*/
|
Implement address space guards.
Guard, requested by the MAP_GUARD mmap(2) flag, prevents the reuse of
the allocated address space, but does not allow instantiation of the
pages in the range. It is useful for more explicit support for usual
two-stage reserve then commit allocators, since it prevents accidental
instantiation of the mapping, e.g. by mprotect(2).
Use guards to reimplement stack grow code. Explicitely track stack
grow area with the guard, including the stack guard page. On stack
grow, trivial shift of the guard map entry and stack map entry limits
makes the stack expansion. Move the code to detect stack grow and
call vm_map_growstack(), from vm_fault() into vm_map_lookup().
As result, it is impossible to get random mapping to occur in the
stack grow area, or to overlap the stack guard page.
Enable stack guard page by default.
Reviewed by: alc, markj
Man page update reviewed by: alc, bjk, emaste, markj, pho
Tested by: pho, Qualys
Sponsored by: The FreeBSD Foundation
MFC after: 1 week
Differential revision: https://reviews.freebsd.org/D11306 (man pages)
2017-06-24 17:01:11 +00:00
|
|
|
orient = cow & (MAP_STACK_GROWS_DOWN | MAP_STACK_GROWS_UP);
|
2003-09-27 22:28:14 +00:00
|
|
|
KASSERT(orient != 0, ("No stack grow direction"));
|
Implement address space guards.
Guard, requested by the MAP_GUARD mmap(2) flag, prevents the reuse of
the allocated address space, but does not allow instantiation of the
pages in the range. It is useful for more explicit support for usual
two-stage reserve then commit allocators, since it prevents accidental
instantiation of the mapping, e.g. by mprotect(2).
Use guards to reimplement stack grow code. Explicitely track stack
grow area with the guard, including the stack guard page. On stack
grow, trivial shift of the guard map entry and stack map entry limits
makes the stack expansion. Move the code to detect stack grow and
call vm_map_growstack(), from vm_fault() into vm_map_lookup().
As result, it is impossible to get random mapping to occur in the
stack grow area, or to overlap the stack guard page.
Enable stack guard page by default.
Reviewed by: alc, markj
Man page update reviewed by: alc, bjk, emaste, markj, pho
Tested by: pho, Qualys
Sponsored by: The FreeBSD Foundation
MFC after: 1 week
Differential revision: https://reviews.freebsd.org/D11306 (man pages)
2017-06-24 17:01:11 +00:00
|
|
|
KASSERT(orient != (MAP_STACK_GROWS_DOWN | MAP_STACK_GROWS_UP),
|
|
|
|
("bi-dir stack"));
|
2003-09-27 22:28:14 +00:00
|
|
|
|
2020-06-19 03:32:04 +00:00
|
|
|
if (max_ssize == 0 ||
|
|
|
|
!vm_map_range_valid(map, addrbos, addrbos + max_ssize))
|
2017-11-22 16:45:27 +00:00
|
|
|
return (KERN_INVALID_ADDRESS);
|
2019-11-17 14:54:07 +00:00
|
|
|
sgp = ((curproc->p_flag2 & P2_STKGAP_DISABLE) != 0 ||
|
|
|
|
(curproc->p_fctl0 & NT_FREEBSD_FCTL_STKGAP_DISABLE) != 0) ? 0 :
|
2019-09-03 18:56:25 +00:00
|
|
|
(vm_size_t)stack_guard_page * PAGE_SIZE;
|
2017-11-22 16:45:27 +00:00
|
|
|
if (sgp >= max_ssize)
|
|
|
|
return (KERN_INVALID_ARGUMENT);
|
2003-09-27 22:28:14 +00:00
|
|
|
|
2017-06-25 20:06:05 +00:00
|
|
|
init_ssize = growsize;
|
|
|
|
if (max_ssize < init_ssize + sgp)
|
|
|
|
init_ssize = max_ssize - sgp;
|
1999-06-17 00:39:26 +00:00
|
|
|
|
|
|
|
/* If addr is already mapped, no go */
|
2019-06-26 03:12:57 +00:00
|
|
|
if (vm_map_lookup_entry(map, addrbos, &prev_entry))
|
2002-06-26 03:13:46 +00:00
|
|
|
return (KERN_NO_SPACE);
|
|
|
|
|
2003-09-27 22:28:14 +00:00
|
|
|
/*
|
2016-05-02 20:16:29 +00:00
|
|
|
* If we can't accommodate max_ssize in the current mapping, no go.
|
1999-06-17 00:39:26 +00:00
|
|
|
*/
|
2019-11-13 15:56:07 +00:00
|
|
|
if (vm_map_entry_succ(prev_entry)->start < addrbos + max_ssize)
|
1999-06-17 00:39:26 +00:00
|
|
|
return (KERN_NO_SPACE);
|
|
|
|
|
2003-09-27 22:28:14 +00:00
|
|
|
/*
|
|
|
|
* We initially map a stack of only init_ssize. We will grow as
|
|
|
|
* needed later. Depending on the orientation of the stack (i.e.
|
|
|
|
* the grow direction) we either map at the top of the range, the
|
|
|
|
* bottom of the range or in the middle.
|
1999-06-17 00:39:26 +00:00
|
|
|
*
|
2003-09-27 22:28:14 +00:00
|
|
|
* Note: we would normally expect prot and max to be VM_PROT_ALL,
|
|
|
|
* and cow to be 0. Possibly we should eliminate these as input
|
|
|
|
* parameters, and just pass these values here in the insert call.
|
1999-06-17 00:39:26 +00:00
|
|
|
*/
|
Implement address space guards.
Guard, requested by the MAP_GUARD mmap(2) flag, prevents the reuse of
the allocated address space, but does not allow instantiation of the
pages in the range. It is useful for more explicit support for usual
two-stage reserve then commit allocators, since it prevents accidental
instantiation of the mapping, e.g. by mprotect(2).
Use guards to reimplement stack grow code. Explicitely track stack
grow area with the guard, including the stack guard page. On stack
grow, trivial shift of the guard map entry and stack map entry limits
makes the stack expansion. Move the code to detect stack grow and
call vm_map_growstack(), from vm_fault() into vm_map_lookup().
As result, it is impossible to get random mapping to occur in the
stack grow area, or to overlap the stack guard page.
Enable stack guard page by default.
Reviewed by: alc, markj
Man page update reviewed by: alc, bjk, emaste, markj, pho
Tested by: pho, Qualys
Sponsored by: The FreeBSD Foundation
MFC after: 1 week
Differential revision: https://reviews.freebsd.org/D11306 (man pages)
2017-06-24 17:01:11 +00:00
|
|
|
if (orient == MAP_STACK_GROWS_DOWN) {
|
2003-09-27 22:28:14 +00:00
|
|
|
bot = addrbos + max_ssize - init_ssize;
|
Implement address space guards.
Guard, requested by the MAP_GUARD mmap(2) flag, prevents the reuse of
the allocated address space, but does not allow instantiation of the
pages in the range. It is useful for more explicit support for usual
two-stage reserve then commit allocators, since it prevents accidental
instantiation of the mapping, e.g. by mprotect(2).
Use guards to reimplement stack grow code. Explicitely track stack
grow area with the guard, including the stack guard page. On stack
grow, trivial shift of the guard map entry and stack map entry limits
makes the stack expansion. Move the code to detect stack grow and
call vm_map_growstack(), from vm_fault() into vm_map_lookup().
As result, it is impossible to get random mapping to occur in the
stack grow area, or to overlap the stack guard page.
Enable stack guard page by default.
Reviewed by: alc, markj
Man page update reviewed by: alc, bjk, emaste, markj, pho
Tested by: pho, Qualys
Sponsored by: The FreeBSD Foundation
MFC after: 1 week
Differential revision: https://reviews.freebsd.org/D11306 (man pages)
2017-06-24 17:01:11 +00:00
|
|
|
top = bot + init_ssize;
|
|
|
|
gap_bot = addrbos;
|
|
|
|
gap_top = bot;
|
|
|
|
} else /* if (orient == MAP_STACK_GROWS_UP) */ {
|
2003-09-27 22:28:14 +00:00
|
|
|
bot = addrbos;
|
Implement address space guards.
Guard, requested by the MAP_GUARD mmap(2) flag, prevents the reuse of
the allocated address space, but does not allow instantiation of the
pages in the range. It is useful for more explicit support for usual
two-stage reserve then commit allocators, since it prevents accidental
instantiation of the mapping, e.g. by mprotect(2).
Use guards to reimplement stack grow code. Explicitely track stack
grow area with the guard, including the stack guard page. On stack
grow, trivial shift of the guard map entry and stack map entry limits
makes the stack expansion. Move the code to detect stack grow and
call vm_map_growstack(), from vm_fault() into vm_map_lookup().
As result, it is impossible to get random mapping to occur in the
stack grow area, or to overlap the stack guard page.
Enable stack guard page by default.
Reviewed by: alc, markj
Man page update reviewed by: alc, bjk, emaste, markj, pho
Tested by: pho, Qualys
Sponsored by: The FreeBSD Foundation
MFC after: 1 week
Differential revision: https://reviews.freebsd.org/D11306 (man pages)
2017-06-24 17:01:11 +00:00
|
|
|
top = bot + init_ssize;
|
|
|
|
gap_bot = top;
|
|
|
|
gap_top = addrbos + max_ssize;
|
1999-06-17 00:39:26 +00:00
|
|
|
}
|
Implement address space guards.
Guard, requested by the MAP_GUARD mmap(2) flag, prevents the reuse of
the allocated address space, but does not allow instantiation of the
pages in the range. It is useful for more explicit support for usual
two-stage reserve then commit allocators, since it prevents accidental
instantiation of the mapping, e.g. by mprotect(2).
Use guards to reimplement stack grow code. Explicitely track stack
grow area with the guard, including the stack guard page. On stack
grow, trivial shift of the guard map entry and stack map entry limits
makes the stack expansion. Move the code to detect stack grow and
call vm_map_growstack(), from vm_fault() into vm_map_lookup().
As result, it is impossible to get random mapping to occur in the
stack grow area, or to overlap the stack guard page.
Enable stack guard page by default.
Reviewed by: alc, markj
Man page update reviewed by: alc, bjk, emaste, markj, pho
Tested by: pho, Qualys
Sponsored by: The FreeBSD Foundation
MFC after: 1 week
Differential revision: https://reviews.freebsd.org/D11306 (man pages)
2017-06-24 17:01:11 +00:00
|
|
|
rv = vm_map_insert(map, NULL, 0, bot, top, prot, max, cow);
|
|
|
|
if (rv != KERN_SUCCESS)
|
|
|
|
return (rv);
|
2019-11-13 15:56:07 +00:00
|
|
|
new_entry = vm_map_entry_succ(prev_entry);
|
Implement address space guards.
Guard, requested by the MAP_GUARD mmap(2) flag, prevents the reuse of
the allocated address space, but does not allow instantiation of the
pages in the range. It is useful for more explicit support for usual
two-stage reserve then commit allocators, since it prevents accidental
instantiation of the mapping, e.g. by mprotect(2).
Use guards to reimplement stack grow code. Explicitely track stack
grow area with the guard, including the stack guard page. On stack
grow, trivial shift of the guard map entry and stack map entry limits
makes the stack expansion. Move the code to detect stack grow and
call vm_map_growstack(), from vm_fault() into vm_map_lookup().
As result, it is impossible to get random mapping to occur in the
stack grow area, or to overlap the stack guard page.
Enable stack guard page by default.
Reviewed by: alc, markj
Man page update reviewed by: alc, bjk, emaste, markj, pho
Tested by: pho, Qualys
Sponsored by: The FreeBSD Foundation
MFC after: 1 week
Differential revision: https://reviews.freebsd.org/D11306 (man pages)
2017-06-24 17:01:11 +00:00
|
|
|
KASSERT(new_entry->end == top || new_entry->start == bot,
|
|
|
|
("Bad entry start/end for new stack entry"));
|
|
|
|
KASSERT((orient & MAP_STACK_GROWS_DOWN) == 0 ||
|
|
|
|
(new_entry->eflags & MAP_ENTRY_GROWS_DOWN) != 0,
|
|
|
|
("new entry lacks MAP_ENTRY_GROWS_DOWN"));
|
|
|
|
KASSERT((orient & MAP_STACK_GROWS_UP) == 0 ||
|
|
|
|
(new_entry->eflags & MAP_ENTRY_GROWS_UP) != 0,
|
|
|
|
("new entry lacks MAP_ENTRY_GROWS_UP"));
|
2019-09-03 18:56:25 +00:00
|
|
|
if (gap_bot == gap_top)
|
|
|
|
return (KERN_SUCCESS);
|
Implement address space guards.
Guard, requested by the MAP_GUARD mmap(2) flag, prevents the reuse of
the allocated address space, but does not allow instantiation of the
pages in the range. It is useful for more explicit support for usual
two-stage reserve then commit allocators, since it prevents accidental
instantiation of the mapping, e.g. by mprotect(2).
Use guards to reimplement stack grow code. Explicitely track stack
grow area with the guard, including the stack guard page. On stack
grow, trivial shift of the guard map entry and stack map entry limits
makes the stack expansion. Move the code to detect stack grow and
call vm_map_growstack(), from vm_fault() into vm_map_lookup().
As result, it is impossible to get random mapping to occur in the
stack grow area, or to overlap the stack guard page.
Enable stack guard page by default.
Reviewed by: alc, markj
Man page update reviewed by: alc, bjk, emaste, markj, pho
Tested by: pho, Qualys
Sponsored by: The FreeBSD Foundation
MFC after: 1 week
Differential revision: https://reviews.freebsd.org/D11306 (man pages)
2017-06-24 17:01:11 +00:00
|
|
|
rv = vm_map_insert(map, NULL, 0, gap_bot, gap_top, VM_PROT_NONE,
|
|
|
|
VM_PROT_NONE, MAP_CREATE_GUARD | (orient == MAP_STACK_GROWS_DOWN ?
|
|
|
|
MAP_CREATE_STACK_GAP_DN : MAP_CREATE_STACK_GAP_UP));
|
2019-08-24 14:29:13 +00:00
|
|
|
if (rv == KERN_SUCCESS) {
|
|
|
|
/*
|
|
|
|
* Gap can never successfully handle a fault, so
|
|
|
|
* read-ahead logic is never used for it. Re-use
|
|
|
|
* next_read of the gap entry to store
|
|
|
|
* stack_guard_page for vm_map_growstack().
|
|
|
|
*/
|
|
|
|
if (orient == MAP_STACK_GROWS_DOWN)
|
2019-11-13 15:56:07 +00:00
|
|
|
vm_map_entry_pred(new_entry)->next_read = sgp;
|
2019-08-24 14:29:13 +00:00
|
|
|
else
|
2019-11-13 15:56:07 +00:00
|
|
|
vm_map_entry_succ(new_entry)->next_read = sgp;
|
2019-08-24 14:29:13 +00:00
|
|
|
} else {
|
Implement address space guards.
Guard, requested by the MAP_GUARD mmap(2) flag, prevents the reuse of
the allocated address space, but does not allow instantiation of the
pages in the range. It is useful for more explicit support for usual
two-stage reserve then commit allocators, since it prevents accidental
instantiation of the mapping, e.g. by mprotect(2).
Use guards to reimplement stack grow code. Explicitely track stack
grow area with the guard, including the stack guard page. On stack
grow, trivial shift of the guard map entry and stack map entry limits
makes the stack expansion. Move the code to detect stack grow and
call vm_map_growstack(), from vm_fault() into vm_map_lookup().
As result, it is impossible to get random mapping to occur in the
stack grow area, or to overlap the stack guard page.
Enable stack guard page by default.
Reviewed by: alc, markj
Man page update reviewed by: alc, bjk, emaste, markj, pho
Tested by: pho, Qualys
Sponsored by: The FreeBSD Foundation
MFC after: 1 week
Differential revision: https://reviews.freebsd.org/D11306 (man pages)
2017-06-24 17:01:11 +00:00
|
|
|
(void)vm_map_delete(map, bot, top);
|
2019-08-24 14:29:13 +00:00
|
|
|
}
|
1999-06-17 00:39:26 +00:00
|
|
|
return (rv);
|
|
|
|
}
|
|
|
|
|
Implement address space guards.
Guard, requested by the MAP_GUARD mmap(2) flag, prevents the reuse of
the allocated address space, but does not allow instantiation of the
pages in the range. It is useful for more explicit support for usual
two-stage reserve then commit allocators, since it prevents accidental
instantiation of the mapping, e.g. by mprotect(2).
Use guards to reimplement stack grow code. Explicitely track stack
grow area with the guard, including the stack guard page. On stack
grow, trivial shift of the guard map entry and stack map entry limits
makes the stack expansion. Move the code to detect stack grow and
call vm_map_growstack(), from vm_fault() into vm_map_lookup().
As result, it is impossible to get random mapping to occur in the
stack grow area, or to overlap the stack guard page.
Enable stack guard page by default.
Reviewed by: alc, markj
Man page update reviewed by: alc, bjk, emaste, markj, pho
Tested by: pho, Qualys
Sponsored by: The FreeBSD Foundation
MFC after: 1 week
Differential revision: https://reviews.freebsd.org/D11306 (man pages)
2017-06-24 17:01:11 +00:00
|
|
|
/*
|
|
|
|
* Attempts to grow a vm stack entry. Returns KERN_SUCCESS if we
|
|
|
|
* successfully grow the stack.
|
1999-06-17 00:39:26 +00:00
|
|
|
*/
|
Implement address space guards.
Guard, requested by the MAP_GUARD mmap(2) flag, prevents the reuse of
the allocated address space, but does not allow instantiation of the
pages in the range. It is useful for more explicit support for usual
two-stage reserve then commit allocators, since it prevents accidental
instantiation of the mapping, e.g. by mprotect(2).
Use guards to reimplement stack grow code. Explicitely track stack
grow area with the guard, including the stack guard page. On stack
grow, trivial shift of the guard map entry and stack map entry limits
makes the stack expansion. Move the code to detect stack grow and
call vm_map_growstack(), from vm_fault() into vm_map_lookup().
As result, it is impossible to get random mapping to occur in the
stack grow area, or to overlap the stack guard page.
Enable stack guard page by default.
Reviewed by: alc, markj
Man page update reviewed by: alc, bjk, emaste, markj, pho
Tested by: pho, Qualys
Sponsored by: The FreeBSD Foundation
MFC after: 1 week
Differential revision: https://reviews.freebsd.org/D11306 (man pages)
2017-06-24 17:01:11 +00:00
|
|
|
static int
|
|
|
|
vm_map_growstack(vm_map_t map, vm_offset_t addr, vm_map_entry_t gap_entry)
|
1999-06-17 00:39:26 +00:00
|
|
|
{
|
Implement address space guards.
Guard, requested by the MAP_GUARD mmap(2) flag, prevents the reuse of
the allocated address space, but does not allow instantiation of the
pages in the range. It is useful for more explicit support for usual
two-stage reserve then commit allocators, since it prevents accidental
instantiation of the mapping, e.g. by mprotect(2).
Use guards to reimplement stack grow code. Explicitely track stack
grow area with the guard, including the stack guard page. On stack
grow, trivial shift of the guard map entry and stack map entry limits
makes the stack expansion. Move the code to detect stack grow and
call vm_map_growstack(), from vm_fault() into vm_map_lookup().
As result, it is impossible to get random mapping to occur in the
stack grow area, or to overlap the stack guard page.
Enable stack guard page by default.
Reviewed by: alc, markj
Man page update reviewed by: alc, bjk, emaste, markj, pho
Tested by: pho, Qualys
Sponsored by: The FreeBSD Foundation
MFC after: 1 week
Differential revision: https://reviews.freebsd.org/D11306 (man pages)
2017-06-24 17:01:11 +00:00
|
|
|
vm_map_entry_t stack_entry;
|
|
|
|
struct proc *p;
|
|
|
|
struct vmspace *vm;
|
|
|
|
struct ucred *cred;
|
|
|
|
vm_offset_t gap_end, gap_start, grow_start;
|
2019-05-22 23:11:16 +00:00
|
|
|
vm_size_t grow_amount, guard, max_grow;
|
2012-12-18 07:35:01 +00:00
|
|
|
rlim_t lmemlim, stacklim, vmemlim;
|
Implement address space guards.
Guard, requested by the MAP_GUARD mmap(2) flag, prevents the reuse of
the allocated address space, but does not allow instantiation of the
pages in the range. It is useful for more explicit support for usual
two-stage reserve then commit allocators, since it prevents accidental
instantiation of the mapping, e.g. by mprotect(2).
Use guards to reimplement stack grow code. Explicitely track stack
grow area with the guard, including the stack guard page. On stack
grow, trivial shift of the guard map entry and stack map entry limits
makes the stack expansion. Move the code to detect stack grow and
call vm_map_growstack(), from vm_fault() into vm_map_lookup().
As result, it is impossible to get random mapping to occur in the
stack grow area, or to overlap the stack guard page.
Enable stack guard page by default.
Reviewed by: alc, markj
Man page update reviewed by: alc, bjk, emaste, markj, pho
Tested by: pho, Qualys
Sponsored by: The FreeBSD Foundation
MFC after: 1 week
Differential revision: https://reviews.freebsd.org/D11306 (man pages)
2017-06-24 17:01:11 +00:00
|
|
|
int rv, rv1;
|
|
|
|
bool gap_deleted, grow_down, is_procstack;
|
2011-04-05 20:23:59 +00:00
|
|
|
#ifdef notyet
|
|
|
|
uint64_t limit;
|
|
|
|
#endif
|
2011-07-06 20:06:44 +00:00
|
|
|
#ifdef RACCT
|
2011-04-05 20:23:59 +00:00
|
|
|
int error;
|
2011-07-06 20:06:44 +00:00
|
|
|
#endif
|
2001-07-04 16:20:28 +00:00
|
|
|
|
Implement address space guards.
Guard, requested by the MAP_GUARD mmap(2) flag, prevents the reuse of
the allocated address space, but does not allow instantiation of the
pages in the range. It is useful for more explicit support for usual
two-stage reserve then commit allocators, since it prevents accidental
instantiation of the mapping, e.g. by mprotect(2).
Use guards to reimplement stack grow code. Explicitely track stack
grow area with the guard, including the stack guard page. On stack
grow, trivial shift of the guard map entry and stack map entry limits
makes the stack expansion. Move the code to detect stack grow and
call vm_map_growstack(), from vm_fault() into vm_map_lookup().
As result, it is impossible to get random mapping to occur in the
stack grow area, or to overlap the stack guard page.
Enable stack guard page by default.
Reviewed by: alc, markj
Man page update reviewed by: alc, bjk, emaste, markj, pho
Tested by: pho, Qualys
Sponsored by: The FreeBSD Foundation
MFC after: 1 week
Differential revision: https://reviews.freebsd.org/D11306 (man pages)
2017-06-24 17:01:11 +00:00
|
|
|
p = curproc;
|
|
|
|
vm = p->p_vmspace;
|
2017-07-19 19:00:32 +00:00
|
|
|
|
|
|
|
/*
|
|
|
|
* Disallow stack growth when the access is performed by a
|
|
|
|
* debugger or AIO daemon. The reason is that the wrong
|
|
|
|
* resource limits are applied.
|
|
|
|
*/
|
2019-08-08 16:48:19 +00:00
|
|
|
if (p != initproc && (map != &p->p_vmspace->vm_map ||
|
|
|
|
p->p_textvp == NULL))
|
2017-07-18 20:26:41 +00:00
|
|
|
return (KERN_FAILURE);
|
2017-07-19 19:00:32 +00:00
|
|
|
|
Implement address space guards.
Guard, requested by the MAP_GUARD mmap(2) flag, prevents the reuse of
the allocated address space, but does not allow instantiation of the
pages in the range. It is useful for more explicit support for usual
two-stage reserve then commit allocators, since it prevents accidental
instantiation of the mapping, e.g. by mprotect(2).
Use guards to reimplement stack grow code. Explicitely track stack
grow area with the guard, including the stack guard page. On stack
grow, trivial shift of the guard map entry and stack map entry limits
makes the stack expansion. Move the code to detect stack grow and
call vm_map_growstack(), from vm_fault() into vm_map_lookup().
As result, it is impossible to get random mapping to occur in the
stack grow area, or to overlap the stack guard page.
Enable stack guard page by default.
Reviewed by: alc, markj
Man page update reviewed by: alc, bjk, emaste, markj, pho
Tested by: pho, Qualys
Sponsored by: The FreeBSD Foundation
MFC after: 1 week
Differential revision: https://reviews.freebsd.org/D11306 (man pages)
2017-06-24 17:01:11 +00:00
|
|
|
MPASS(!map->system_map);
|
|
|
|
|
2015-06-10 10:48:12 +00:00
|
|
|
lmemlim = lim_cur(curthread, RLIMIT_MEMLOCK);
|
|
|
|
stacklim = lim_cur(curthread, RLIMIT_STACK);
|
|
|
|
vmemlim = lim_cur(curthread, RLIMIT_VMEM);
|
Implement address space guards.
Guard, requested by the MAP_GUARD mmap(2) flag, prevents the reuse of
the allocated address space, but does not allow instantiation of the
pages in the range. It is useful for more explicit support for usual
two-stage reserve then commit allocators, since it prevents accidental
instantiation of the mapping, e.g. by mprotect(2).
Use guards to reimplement stack grow code. Explicitely track stack
grow area with the guard, including the stack guard page. On stack
grow, trivial shift of the guard map entry and stack map entry limits
makes the stack expansion. Move the code to detect stack grow and
call vm_map_growstack(), from vm_fault() into vm_map_lookup().
As result, it is impossible to get random mapping to occur in the
stack grow area, or to overlap the stack guard page.
Enable stack guard page by default.
Reviewed by: alc, markj
Man page update reviewed by: alc, bjk, emaste, markj, pho
Tested by: pho, Qualys
Sponsored by: The FreeBSD Foundation
MFC after: 1 week
Differential revision: https://reviews.freebsd.org/D11306 (man pages)
2017-06-24 17:01:11 +00:00
|
|
|
retry:
|
|
|
|
/* If addr is not in a hole for a stack grow area, no need to grow. */
|
2019-06-26 03:12:57 +00:00
|
|
|
if (gap_entry == NULL && !vm_map_lookup_entry(map, addr, &gap_entry))
|
Implement address space guards.
Guard, requested by the MAP_GUARD mmap(2) flag, prevents the reuse of
the allocated address space, but does not allow instantiation of the
pages in the range. It is useful for more explicit support for usual
two-stage reserve then commit allocators, since it prevents accidental
instantiation of the mapping, e.g. by mprotect(2).
Use guards to reimplement stack grow code. Explicitely track stack
grow area with the guard, including the stack guard page. On stack
grow, trivial shift of the guard map entry and stack map entry limits
makes the stack expansion. Move the code to detect stack grow and
call vm_map_growstack(), from vm_fault() into vm_map_lookup().
As result, it is impossible to get random mapping to occur in the
stack grow area, or to overlap the stack guard page.
Enable stack guard page by default.
Reviewed by: alc, markj
Man page update reviewed by: alc, bjk, emaste, markj, pho
Tested by: pho, Qualys
Sponsored by: The FreeBSD Foundation
MFC after: 1 week
Differential revision: https://reviews.freebsd.org/D11306 (man pages)
2017-06-24 17:01:11 +00:00
|
|
|
return (KERN_FAILURE);
|
|
|
|
if ((gap_entry->eflags & MAP_ENTRY_GUARD) == 0)
|
2001-07-04 16:20:28 +00:00
|
|
|
return (KERN_SUCCESS);
|
Implement address space guards.
Guard, requested by the MAP_GUARD mmap(2) flag, prevents the reuse of
the allocated address space, but does not allow instantiation of the
pages in the range. It is useful for more explicit support for usual
two-stage reserve then commit allocators, since it prevents accidental
instantiation of the mapping, e.g. by mprotect(2).
Use guards to reimplement stack grow code. Explicitely track stack
grow area with the guard, including the stack guard page. On stack
grow, trivial shift of the guard map entry and stack map entry limits
makes the stack expansion. Move the code to detect stack grow and
call vm_map_growstack(), from vm_fault() into vm_map_lookup().
As result, it is impossible to get random mapping to occur in the
stack grow area, or to overlap the stack guard page.
Enable stack guard page by default.
Reviewed by: alc, markj
Man page update reviewed by: alc, bjk, emaste, markj, pho
Tested by: pho, Qualys
Sponsored by: The FreeBSD Foundation
MFC after: 1 week
Differential revision: https://reviews.freebsd.org/D11306 (man pages)
2017-06-24 17:01:11 +00:00
|
|
|
if ((gap_entry->eflags & MAP_ENTRY_STACK_GAP_DN) != 0) {
|
2019-11-13 15:56:07 +00:00
|
|
|
stack_entry = vm_map_entry_succ(gap_entry);
|
Implement address space guards.
Guard, requested by the MAP_GUARD mmap(2) flag, prevents the reuse of
the allocated address space, but does not allow instantiation of the
pages in the range. It is useful for more explicit support for usual
two-stage reserve then commit allocators, since it prevents accidental
instantiation of the mapping, e.g. by mprotect(2).
Use guards to reimplement stack grow code. Explicitely track stack
grow area with the guard, including the stack guard page. On stack
grow, trivial shift of the guard map entry and stack map entry limits
makes the stack expansion. Move the code to detect stack grow and
call vm_map_growstack(), from vm_fault() into vm_map_lookup().
As result, it is impossible to get random mapping to occur in the
stack grow area, or to overlap the stack guard page.
Enable stack guard page by default.
Reviewed by: alc, markj
Man page update reviewed by: alc, bjk, emaste, markj, pho
Tested by: pho, Qualys
Sponsored by: The FreeBSD Foundation
MFC after: 1 week
Differential revision: https://reviews.freebsd.org/D11306 (man pages)
2017-06-24 17:01:11 +00:00
|
|
|
if ((stack_entry->eflags & MAP_ENTRY_GROWS_DOWN) == 0 ||
|
|
|
|
stack_entry->start != gap_entry->end)
|
|
|
|
return (KERN_FAILURE);
|
|
|
|
grow_amount = round_page(stack_entry->start - addr);
|
|
|
|
grow_down = true;
|
|
|
|
} else if ((gap_entry->eflags & MAP_ENTRY_STACK_GAP_UP) != 0) {
|
2019-11-13 15:56:07 +00:00
|
|
|
stack_entry = vm_map_entry_pred(gap_entry);
|
Implement address space guards.
Guard, requested by the MAP_GUARD mmap(2) flag, prevents the reuse of
the allocated address space, but does not allow instantiation of the
pages in the range. It is useful for more explicit support for usual
two-stage reserve then commit allocators, since it prevents accidental
instantiation of the mapping, e.g. by mprotect(2).
Use guards to reimplement stack grow code. Explicitely track stack
grow area with the guard, including the stack guard page. On stack
grow, trivial shift of the guard map entry and stack map entry limits
makes the stack expansion. Move the code to detect stack grow and
call vm_map_growstack(), from vm_fault() into vm_map_lookup().
As result, it is impossible to get random mapping to occur in the
stack grow area, or to overlap the stack guard page.
Enable stack guard page by default.
Reviewed by: alc, markj
Man page update reviewed by: alc, bjk, emaste, markj, pho
Tested by: pho, Qualys
Sponsored by: The FreeBSD Foundation
MFC after: 1 week
Differential revision: https://reviews.freebsd.org/D11306 (man pages)
2017-06-24 17:01:11 +00:00
|
|
|
if ((stack_entry->eflags & MAP_ENTRY_GROWS_UP) == 0 ||
|
|
|
|
stack_entry->end != gap_entry->start)
|
|
|
|
return (KERN_FAILURE);
|
|
|
|
grow_amount = round_page(addr + 1 - stack_entry->end);
|
|
|
|
grow_down = false;
|
2003-08-30 21:25:23 +00:00
|
|
|
} else {
|
Implement address space guards.
Guard, requested by the MAP_GUARD mmap(2) flag, prevents the reuse of
the allocated address space, but does not allow instantiation of the
pages in the range. It is useful for more explicit support for usual
two-stage reserve then commit allocators, since it prevents accidental
instantiation of the mapping, e.g. by mprotect(2).
Use guards to reimplement stack grow code. Explicitely track stack
grow area with the guard, including the stack guard page. On stack
grow, trivial shift of the guard map entry and stack map entry limits
makes the stack expansion. Move the code to detect stack grow and
call vm_map_growstack(), from vm_fault() into vm_map_lookup().
As result, it is impossible to get random mapping to occur in the
stack grow area, or to overlap the stack guard page.
Enable stack guard page by default.
Reviewed by: alc, markj
Man page update reviewed by: alc, bjk, emaste, markj, pho
Tested by: pho, Qualys
Sponsored by: The FreeBSD Foundation
MFC after: 1 week
Differential revision: https://reviews.freebsd.org/D11306 (man pages)
2017-06-24 17:01:11 +00:00
|
|
|
return (KERN_FAILURE);
|
2003-08-30 21:25:23 +00:00
|
|
|
}
|
2019-11-17 14:54:07 +00:00
|
|
|
guard = ((curproc->p_flag2 & P2_STKGAP_DISABLE) != 0 ||
|
|
|
|
(curproc->p_fctl0 & NT_FREEBSD_FCTL_STKGAP_DISABLE) != 0) ? 0 :
|
2019-09-03 18:56:25 +00:00
|
|
|
gap_entry->next_read;
|
2017-07-01 23:39:49 +00:00
|
|
|
max_grow = gap_entry->end - gap_entry->start;
|
|
|
|
if (guard > max_grow)
|
|
|
|
return (KERN_NO_SPACE);
|
|
|
|
max_grow -= guard;
|
Implement address space guards.
Guard, requested by the MAP_GUARD mmap(2) flag, prevents the reuse of
the allocated address space, but does not allow instantiation of the
pages in the range. It is useful for more explicit support for usual
two-stage reserve then commit allocators, since it prevents accidental
instantiation of the mapping, e.g. by mprotect(2).
Use guards to reimplement stack grow code. Explicitely track stack
grow area with the guard, including the stack guard page. On stack
grow, trivial shift of the guard map entry and stack map entry limits
makes the stack expansion. Move the code to detect stack grow and
call vm_map_growstack(), from vm_fault() into vm_map_lookup().
As result, it is impossible to get random mapping to occur in the
stack grow area, or to overlap the stack guard page.
Enable stack guard page by default.
Reviewed by: alc, markj
Man page update reviewed by: alc, bjk, emaste, markj, pho
Tested by: pho, Qualys
Sponsored by: The FreeBSD Foundation
MFC after: 1 week
Differential revision: https://reviews.freebsd.org/D11306 (man pages)
2017-06-24 17:01:11 +00:00
|
|
|
if (grow_amount > max_grow)
|
2001-07-04 16:20:28 +00:00
|
|
|
return (KERN_NO_SPACE);
|
1999-06-17 00:39:26 +00:00
|
|
|
|
2003-08-30 21:25:23 +00:00
|
|
|
/*
|
|
|
|
* If this is the main process stack, see if we're over the stack
|
|
|
|
* limit.
|
1999-06-17 00:39:26 +00:00
|
|
|
*/
|
Implement address space guards.
Guard, requested by the MAP_GUARD mmap(2) flag, prevents the reuse of
the allocated address space, but does not allow instantiation of the
pages in the range. It is useful for more explicit support for usual
two-stage reserve then commit allocators, since it prevents accidental
instantiation of the mapping, e.g. by mprotect(2).
Use guards to reimplement stack grow code. Explicitely track stack
grow area with the guard, including the stack guard page. On stack
grow, trivial shift of the guard map entry and stack map entry limits
makes the stack expansion. Move the code to detect stack grow and
call vm_map_growstack(), from vm_fault() into vm_map_lookup().
As result, it is impossible to get random mapping to occur in the
stack grow area, or to overlap the stack guard page.
Enable stack guard page by default.
Reviewed by: alc, markj
Man page update reviewed by: alc, bjk, emaste, markj, pho
Tested by: pho, Qualys
Sponsored by: The FreeBSD Foundation
MFC after: 1 week
Differential revision: https://reviews.freebsd.org/D11306 (man pages)
2017-06-24 17:01:11 +00:00
|
|
|
is_procstack = addr >= (vm_offset_t)vm->vm_maxsaddr &&
|
|
|
|
addr < (vm_offset_t)p->p_sysent->sv_usrstack;
|
|
|
|
if (is_procstack && (ctob(vm->vm_ssize) + grow_amount > stacklim))
|
2001-07-04 16:20:28 +00:00
|
|
|
return (KERN_NO_SPACE);
|
Implement address space guards.
Guard, requested by the MAP_GUARD mmap(2) flag, prevents the reuse of
the allocated address space, but does not allow instantiation of the
pages in the range. It is useful for more explicit support for usual
two-stage reserve then commit allocators, since it prevents accidental
instantiation of the mapping, e.g. by mprotect(2).
Use guards to reimplement stack grow code. Explicitely track stack
grow area with the guard, including the stack guard page. On stack
grow, trivial shift of the guard map entry and stack map entry limits
makes the stack expansion. Move the code to detect stack grow and
call vm_map_growstack(), from vm_fault() into vm_map_lookup().
As result, it is impossible to get random mapping to occur in the
stack grow area, or to overlap the stack guard page.
Enable stack guard page by default.
Reviewed by: alc, markj
Man page update reviewed by: alc, bjk, emaste, markj, pho
Tested by: pho, Qualys
Sponsored by: The FreeBSD Foundation
MFC after: 1 week
Differential revision: https://reviews.freebsd.org/D11306 (man pages)
2017-06-24 17:01:11 +00:00
|
|
|
|
2011-07-06 20:06:44 +00:00
|
|
|
#ifdef RACCT
|
2015-04-29 10:23:02 +00:00
|
|
|
if (racct_enable) {
|
|
|
|
PROC_LOCK(p);
|
|
|
|
if (is_procstack && racct_set(p, RACCT_STACK,
|
|
|
|
ctob(vm->vm_ssize) + grow_amount)) {
|
|
|
|
PROC_UNLOCK(p);
|
|
|
|
return (KERN_NO_SPACE);
|
|
|
|
}
|
2011-04-05 20:23:59 +00:00
|
|
|
PROC_UNLOCK(p);
|
|
|
|
}
|
2011-07-06 20:06:44 +00:00
|
|
|
#endif
|
1999-06-17 00:39:26 +00:00
|
|
|
|
Implement address space guards.
Guard, requested by the MAP_GUARD mmap(2) flag, prevents the reuse of
the allocated address space, but does not allow instantiation of the
pages in the range. It is useful for more explicit support for usual
two-stage reserve then commit allocators, since it prevents accidental
instantiation of the mapping, e.g. by mprotect(2).
Use guards to reimplement stack grow code. Explicitely track stack
grow area with the guard, including the stack guard page. On stack
grow, trivial shift of the guard map entry and stack map entry limits
makes the stack expansion. Move the code to detect stack grow and
call vm_map_growstack(), from vm_fault() into vm_map_lookup().
As result, it is impossible to get random mapping to occur in the
stack grow area, or to overlap the stack guard page.
Enable stack guard page by default.
Reviewed by: alc, markj
Man page update reviewed by: alc, bjk, emaste, markj, pho
Tested by: pho, Qualys
Sponsored by: The FreeBSD Foundation
MFC after: 1 week
Differential revision: https://reviews.freebsd.org/D11306 (man pages)
2017-06-24 17:01:11 +00:00
|
|
|
grow_amount = roundup(grow_amount, sgrowsiz);
|
|
|
|
if (grow_amount > max_grow)
|
|
|
|
grow_amount = max_grow;
|
2004-02-04 21:52:57 +00:00
|
|
|
if (is_procstack && (ctob(vm->vm_ssize) + grow_amount > stacklim)) {
|
2010-11-07 21:40:34 +00:00
|
|
|
grow_amount = trunc_page((vm_size_t)stacklim) -
|
|
|
|
ctob(vm->vm_ssize);
|
1999-06-17 00:39:26 +00:00
|
|
|
}
|
Implement address space guards.
Guard, requested by the MAP_GUARD mmap(2) flag, prevents the reuse of
the allocated address space, but does not allow instantiation of the
pages in the range. It is useful for more explicit support for usual
two-stage reserve then commit allocators, since it prevents accidental
instantiation of the mapping, e.g. by mprotect(2).
Use guards to reimplement stack grow code. Explicitely track stack
grow area with the guard, including the stack guard page. On stack
grow, trivial shift of the guard map entry and stack map entry limits
makes the stack expansion. Move the code to detect stack grow and
call vm_map_growstack(), from vm_fault() into vm_map_lookup().
As result, it is impossible to get random mapping to occur in the
stack grow area, or to overlap the stack guard page.
Enable stack guard page by default.
Reviewed by: alc, markj
Man page update reviewed by: alc, bjk, emaste, markj, pho
Tested by: pho, Qualys
Sponsored by: The FreeBSD Foundation
MFC after: 1 week
Differential revision: https://reviews.freebsd.org/D11306 (man pages)
2017-06-24 17:01:11 +00:00
|
|
|
|
2011-04-05 20:23:59 +00:00
|
|
|
#ifdef notyet
|
|
|
|
PROC_LOCK(p);
|
|
|
|
limit = racct_get_available(p, RACCT_STACK);
|
|
|
|
PROC_UNLOCK(p);
|
|
|
|
if (is_procstack && (ctob(vm->vm_ssize) + grow_amount > limit))
|
|
|
|
grow_amount = limit - ctob(vm->vm_ssize);
|
|
|
|
#endif
|
Implement address space guards.
Guard, requested by the MAP_GUARD mmap(2) flag, prevents the reuse of
the allocated address space, but does not allow instantiation of the
pages in the range. It is useful for more explicit support for usual
two-stage reserve then commit allocators, since it prevents accidental
instantiation of the mapping, e.g. by mprotect(2).
Use guards to reimplement stack grow code. Explicitely track stack
grow area with the guard, including the stack guard page. On stack
grow, trivial shift of the guard map entry and stack map entry limits
makes the stack expansion. Move the code to detect stack grow and
call vm_map_growstack(), from vm_fault() into vm_map_lookup().
As result, it is impossible to get random mapping to occur in the
stack grow area, or to overlap the stack guard page.
Enable stack guard page by default.
Reviewed by: alc, markj
Man page update reviewed by: alc, bjk, emaste, markj, pho
Tested by: pho, Qualys
Sponsored by: The FreeBSD Foundation
MFC after: 1 week
Differential revision: https://reviews.freebsd.org/D11306 (man pages)
2017-06-24 17:01:11 +00:00
|
|
|
|
|
|
|
if (!old_mlock && (map->flags & MAP_WIREFUTURE) != 0) {
|
2013-01-10 12:43:58 +00:00
|
|
|
if (ptoa(pmap_wired_count(map->pmap)) + grow_amount > lmemlim) {
|
2012-12-18 07:35:01 +00:00
|
|
|
rv = KERN_NO_SPACE;
|
|
|
|
goto out;
|
|
|
|
}
|
|
|
|
#ifdef RACCT
|
2015-04-29 10:23:02 +00:00
|
|
|
if (racct_enable) {
|
|
|
|
PROC_LOCK(p);
|
|
|
|
if (racct_set(p, RACCT_MEMLOCK,
|
|
|
|
ptoa(pmap_wired_count(map->pmap)) + grow_amount)) {
|
|
|
|
PROC_UNLOCK(p);
|
|
|
|
rv = KERN_NO_SPACE;
|
|
|
|
goto out;
|
|
|
|
}
|
2012-12-18 07:35:01 +00:00
|
|
|
PROC_UNLOCK(p);
|
|
|
|
}
|
|
|
|
#endif
|
|
|
|
}
|
Implement address space guards.
Guard, requested by the MAP_GUARD mmap(2) flag, prevents the reuse of
the allocated address space, but does not allow instantiation of the
pages in the range. It is useful for more explicit support for usual
two-stage reserve then commit allocators, since it prevents accidental
instantiation of the mapping, e.g. by mprotect(2).
Use guards to reimplement stack grow code. Explicitely track stack
grow area with the guard, including the stack guard page. On stack
grow, trivial shift of the guard map entry and stack map entry limits
makes the stack expansion. Move the code to detect stack grow and
call vm_map_growstack(), from vm_fault() into vm_map_lookup().
As result, it is impossible to get random mapping to occur in the
stack grow area, or to overlap the stack guard page.
Enable stack guard page by default.
Reviewed by: alc, markj
Man page update reviewed by: alc, bjk, emaste, markj, pho
Tested by: pho, Qualys
Sponsored by: The FreeBSD Foundation
MFC after: 1 week
Differential revision: https://reviews.freebsd.org/D11306 (man pages)
2017-06-24 17:01:11 +00:00
|
|
|
|
2002-06-26 03:13:46 +00:00
|
|
|
/* If we would blow our VMEM resource limit, no go */
|
2004-02-04 21:52:57 +00:00
|
|
|
if (map->size + grow_amount > vmemlim) {
|
2011-04-05 20:23:59 +00:00
|
|
|
rv = KERN_NO_SPACE;
|
|
|
|
goto out;
|
2002-06-26 03:13:46 +00:00
|
|
|
}
|
2011-07-06 20:06:44 +00:00
|
|
|
#ifdef RACCT
|
2015-04-29 10:23:02 +00:00
|
|
|
if (racct_enable) {
|
|
|
|
PROC_LOCK(p);
|
|
|
|
if (racct_set(p, RACCT_VMEM, map->size + grow_amount)) {
|
|
|
|
PROC_UNLOCK(p);
|
|
|
|
rv = KERN_NO_SPACE;
|
|
|
|
goto out;
|
|
|
|
}
|
2011-04-05 20:23:59 +00:00
|
|
|
PROC_UNLOCK(p);
|
|
|
|
}
|
2011-07-06 20:06:44 +00:00
|
|
|
#endif
|
2002-06-26 03:13:46 +00:00
|
|
|
|
Implement address space guards.
Guard, requested by the MAP_GUARD mmap(2) flag, prevents the reuse of
the allocated address space, but does not allow instantiation of the
pages in the range. It is useful for more explicit support for usual
two-stage reserve then commit allocators, since it prevents accidental
instantiation of the mapping, e.g. by mprotect(2).
Use guards to reimplement stack grow code. Explicitely track stack
grow area with the guard, including the stack guard page. On stack
grow, trivial shift of the guard map entry and stack map entry limits
makes the stack expansion. Move the code to detect stack grow and
call vm_map_growstack(), from vm_fault() into vm_map_lookup().
As result, it is impossible to get random mapping to occur in the
stack grow area, or to overlap the stack guard page.
Enable stack guard page by default.
Reviewed by: alc, markj
Man page update reviewed by: alc, bjk, emaste, markj, pho
Tested by: pho, Qualys
Sponsored by: The FreeBSD Foundation
MFC after: 1 week
Differential revision: https://reviews.freebsd.org/D11306 (man pages)
2017-06-24 17:01:11 +00:00
|
|
|
if (vm_map_lock_upgrade(map)) {
|
|
|
|
gap_entry = NULL;
|
|
|
|
vm_map_lock_read(map);
|
|
|
|
goto retry;
|
|
|
|
}
|
1999-06-17 00:39:26 +00:00
|
|
|
|
Implement address space guards.
Guard, requested by the MAP_GUARD mmap(2) flag, prevents the reuse of
the allocated address space, but does not allow instantiation of the
pages in the range. It is useful for more explicit support for usual
two-stage reserve then commit allocators, since it prevents accidental
instantiation of the mapping, e.g. by mprotect(2).
Use guards to reimplement stack grow code. Explicitely track stack
grow area with the guard, including the stack guard page. On stack
grow, trivial shift of the guard map entry and stack map entry limits
makes the stack expansion. Move the code to detect stack grow and
call vm_map_growstack(), from vm_fault() into vm_map_lookup().
As result, it is impossible to get random mapping to occur in the
stack grow area, or to overlap the stack guard page.
Enable stack guard page by default.
Reviewed by: alc, markj
Man page update reviewed by: alc, bjk, emaste, markj, pho
Tested by: pho, Qualys
Sponsored by: The FreeBSD Foundation
MFC after: 1 week
Differential revision: https://reviews.freebsd.org/D11306 (man pages)
2017-06-24 17:01:11 +00:00
|
|
|
if (grow_down) {
|
|
|
|
grow_start = gap_entry->end - grow_amount;
|
|
|
|
if (gap_entry->start + grow_amount == gap_entry->end) {
|
|
|
|
gap_start = gap_entry->start;
|
|
|
|
gap_end = gap_entry->end;
|
|
|
|
vm_map_entry_delete(map, gap_entry);
|
|
|
|
gap_deleted = true;
|
|
|
|
} else {
|
|
|
|
MPASS(gap_entry->start < gap_entry->end - grow_amount);
|
2019-05-22 23:11:16 +00:00
|
|
|
vm_map_entry_resize(map, gap_entry, -grow_amount);
|
Implement address space guards.
Guard, requested by the MAP_GUARD mmap(2) flag, prevents the reuse of
the allocated address space, but does not allow instantiation of the
pages in the range. It is useful for more explicit support for usual
two-stage reserve then commit allocators, since it prevents accidental
instantiation of the mapping, e.g. by mprotect(2).
Use guards to reimplement stack grow code. Explicitely track stack
grow area with the guard, including the stack guard page. On stack
grow, trivial shift of the guard map entry and stack map entry limits
makes the stack expansion. Move the code to detect stack grow and
call vm_map_growstack(), from vm_fault() into vm_map_lookup().
As result, it is impossible to get random mapping to occur in the
stack grow area, or to overlap the stack guard page.
Enable stack guard page by default.
Reviewed by: alc, markj
Man page update reviewed by: alc, bjk, emaste, markj, pho
Tested by: pho, Qualys
Sponsored by: The FreeBSD Foundation
MFC after: 1 week
Differential revision: https://reviews.freebsd.org/D11306 (man pages)
2017-06-24 17:01:11 +00:00
|
|
|
gap_deleted = false;
|
2003-08-30 21:25:23 +00:00
|
|
|
}
|
Implement address space guards.
Guard, requested by the MAP_GUARD mmap(2) flag, prevents the reuse of
the allocated address space, but does not allow instantiation of the
pages in the range. It is useful for more explicit support for usual
two-stage reserve then commit allocators, since it prevents accidental
instantiation of the mapping, e.g. by mprotect(2).
Use guards to reimplement stack grow code. Explicitely track stack
grow area with the guard, including the stack guard page. On stack
grow, trivial shift of the guard map entry and stack map entry limits
makes the stack expansion. Move the code to detect stack grow and
call vm_map_growstack(), from vm_fault() into vm_map_lookup().
As result, it is impossible to get random mapping to occur in the
stack grow area, or to overlap the stack guard page.
Enable stack guard page by default.
Reviewed by: alc, markj
Man page update reviewed by: alc, bjk, emaste, markj, pho
Tested by: pho, Qualys
Sponsored by: The FreeBSD Foundation
MFC after: 1 week
Differential revision: https://reviews.freebsd.org/D11306 (man pages)
2017-06-24 17:01:11 +00:00
|
|
|
rv = vm_map_insert(map, NULL, 0, grow_start,
|
|
|
|
grow_start + grow_amount,
|
|
|
|
stack_entry->protection, stack_entry->max_protection,
|
2014-06-19 16:26:16 +00:00
|
|
|
MAP_STACK_GROWS_DOWN);
|
Implement address space guards.
Guard, requested by the MAP_GUARD mmap(2) flag, prevents the reuse of
the allocated address space, but does not allow instantiation of the
pages in the range. It is useful for more explicit support for usual
two-stage reserve then commit allocators, since it prevents accidental
instantiation of the mapping, e.g. by mprotect(2).
Use guards to reimplement stack grow code. Explicitely track stack
grow area with the guard, including the stack guard page. On stack
grow, trivial shift of the guard map entry and stack map entry limits
makes the stack expansion. Move the code to detect stack grow and
call vm_map_growstack(), from vm_fault() into vm_map_lookup().
As result, it is impossible to get random mapping to occur in the
stack grow area, or to overlap the stack guard page.
Enable stack guard page by default.
Reviewed by: alc, markj
Man page update reviewed by: alc, bjk, emaste, markj, pho
Tested by: pho, Qualys
Sponsored by: The FreeBSD Foundation
MFC after: 1 week
Differential revision: https://reviews.freebsd.org/D11306 (man pages)
2017-06-24 17:01:11 +00:00
|
|
|
if (rv != KERN_SUCCESS) {
|
|
|
|
if (gap_deleted) {
|
|
|
|
rv1 = vm_map_insert(map, NULL, 0, gap_start,
|
|
|
|
gap_end, VM_PROT_NONE, VM_PROT_NONE,
|
|
|
|
MAP_CREATE_GUARD | MAP_CREATE_STACK_GAP_DN);
|
|
|
|
MPASS(rv1 == KERN_SUCCESS);
|
2019-05-22 17:40:54 +00:00
|
|
|
} else
|
2019-05-22 23:11:16 +00:00
|
|
|
vm_map_entry_resize(map, gap_entry,
|
2019-05-22 17:40:54 +00:00
|
|
|
grow_amount);
|
1999-06-17 00:39:26 +00:00
|
|
|
}
|
2003-08-30 21:25:23 +00:00
|
|
|
} else {
|
Implement address space guards.
Guard, requested by the MAP_GUARD mmap(2) flag, prevents the reuse of
the allocated address space, but does not allow instantiation of the
pages in the range. It is useful for more explicit support for usual
two-stage reserve then commit allocators, since it prevents accidental
instantiation of the mapping, e.g. by mprotect(2).
Use guards to reimplement stack grow code. Explicitely track stack
grow area with the guard, including the stack guard page. On stack
grow, trivial shift of the guard map entry and stack map entry limits
makes the stack expansion. Move the code to detect stack grow and
call vm_map_growstack(), from vm_fault() into vm_map_lookup().
As result, it is impossible to get random mapping to occur in the
stack grow area, or to overlap the stack guard page.
Enable stack guard page by default.
Reviewed by: alc, markj
Man page update reviewed by: alc, bjk, emaste, markj, pho
Tested by: pho, Qualys
Sponsored by: The FreeBSD Foundation
MFC after: 1 week
Differential revision: https://reviews.freebsd.org/D11306 (man pages)
2017-06-24 17:01:11 +00:00
|
|
|
grow_start = stack_entry->end;
|
2010-12-02 17:37:16 +00:00
|
|
|
cred = stack_entry->cred;
|
|
|
|
if (cred == NULL && stack_entry->object.vm_object != NULL)
|
|
|
|
cred = stack_entry->object.vm_object->cred;
|
|
|
|
if (cred != NULL && !swap_reserve_by_cred(grow_amount, cred))
|
Implement global and per-uid accounting of the anonymous memory. Add
rlimit RLIMIT_SWAP that limits the amount of swap that may be reserved
for the uid.
The accounting information (charge) is associated with either map entry,
or vm object backing the entry, assuming the object is the first one
in the shadow chain and entry does not require COW. Charge is moved
from entry to object on allocation of the object, e.g. during the mmap,
assuming the object is allocated, or on the first page fault on the
entry. It moves back to the entry on forks due to COW setup.
The per-entry granularity of accounting makes the charge process fair
for processes that change uid during lifetime, and decrements charge
for proper uid when region is unmapped.
The interface of vm_pager_allocate(9) is extended by adding struct ucred *,
that is used to charge appropriate uid when allocation if performed by
kernel, e.g. md(4).
Several syscalls, among them is fork(2), may now return ENOMEM when
global or per-uid limits are enforced.
In collaboration with: pho
Reviewed by: alc
Approved by: re (kensmith)
2009-06-23 20:45:22 +00:00
|
|
|
rv = KERN_NO_SPACE;
|
2003-08-30 21:25:23 +00:00
|
|
|
/* Grow the underlying object if applicable. */
|
Implement global and per-uid accounting of the anonymous memory. Add
rlimit RLIMIT_SWAP that limits the amount of swap that may be reserved
for the uid.
The accounting information (charge) is associated with either map entry,
or vm object backing the entry, assuming the object is the first one
in the shadow chain and entry does not require COW. Charge is moved
from entry to object on allocation of the object, e.g. during the mmap,
assuming the object is allocated, or on the first page fault on the
entry. It moves back to the entry on forks due to COW setup.
The per-entry granularity of accounting makes the charge process fair
for processes that change uid during lifetime, and decrements charge
for proper uid when region is unmapped.
The interface of vm_pager_allocate(9) is extended by adding struct ucred *,
that is used to charge appropriate uid when allocation if performed by
kernel, e.g. md(4).
Several syscalls, among them is fork(2), may now return ENOMEM when
global or per-uid limits are enforced.
In collaboration with: pho
Reviewed by: alc
Approved by: re (kensmith)
2009-06-23 20:45:22 +00:00
|
|
|
else if (stack_entry->object.vm_object == NULL ||
|
2016-12-30 13:04:43 +00:00
|
|
|
vm_object_coalesce(stack_entry->object.vm_object,
|
|
|
|
stack_entry->offset,
|
|
|
|
(vm_size_t)(stack_entry->end - stack_entry->start),
|
2019-05-22 23:11:16 +00:00
|
|
|
grow_amount, cred != NULL)) {
|
|
|
|
if (gap_entry->start + grow_amount == gap_entry->end) {
|
Implement address space guards.
Guard, requested by the MAP_GUARD mmap(2) flag, prevents the reuse of
the allocated address space, but does not allow instantiation of the
pages in the range. It is useful for more explicit support for usual
two-stage reserve then commit allocators, since it prevents accidental
instantiation of the mapping, e.g. by mprotect(2).
Use guards to reimplement stack grow code. Explicitely track stack
grow area with the guard, including the stack guard page. On stack
grow, trivial shift of the guard map entry and stack map entry limits
makes the stack expansion. Move the code to detect stack grow and
call vm_map_growstack(), from vm_fault() into vm_map_lookup().
As result, it is impossible to get random mapping to occur in the
stack grow area, or to overlap the stack guard page.
Enable stack guard page by default.
Reviewed by: alc, markj
Man page update reviewed by: alc, bjk, emaste, markj, pho
Tested by: pho, Qualys
Sponsored by: The FreeBSD Foundation
MFC after: 1 week
Differential revision: https://reviews.freebsd.org/D11306 (man pages)
2017-06-24 17:01:11 +00:00
|
|
|
vm_map_entry_delete(map, gap_entry);
|
2019-05-22 23:11:16 +00:00
|
|
|
vm_map_entry_resize(map, stack_entry,
|
|
|
|
grow_amount);
|
|
|
|
} else {
|
Implement address space guards.
Guard, requested by the MAP_GUARD mmap(2) flag, prevents the reuse of
the allocated address space, but does not allow instantiation of the
pages in the range. It is useful for more explicit support for usual
two-stage reserve then commit allocators, since it prevents accidental
instantiation of the mapping, e.g. by mprotect(2).
Use guards to reimplement stack grow code. Explicitely track stack
grow area with the guard, including the stack guard page. On stack
grow, trivial shift of the guard map entry and stack map entry limits
makes the stack expansion. Move the code to detect stack grow and
call vm_map_growstack(), from vm_fault() into vm_map_lookup().
As result, it is impossible to get random mapping to occur in the
stack grow area, or to overlap the stack guard page.
Enable stack guard page by default.
Reviewed by: alc, markj
Man page update reviewed by: alc, bjk, emaste, markj, pho
Tested by: pho, Qualys
Sponsored by: The FreeBSD Foundation
MFC after: 1 week
Differential revision: https://reviews.freebsd.org/D11306 (man pages)
2017-06-24 17:01:11 +00:00
|
|
|
gap_entry->start += grow_amount;
|
2019-05-22 23:11:16 +00:00
|
|
|
stack_entry->end += grow_amount;
|
|
|
|
}
|
Implement address space guards.
Guard, requested by the MAP_GUARD mmap(2) flag, prevents the reuse of
the allocated address space, but does not allow instantiation of the
pages in the range. It is useful for more explicit support for usual
two-stage reserve then commit allocators, since it prevents accidental
instantiation of the mapping, e.g. by mprotect(2).
Use guards to reimplement stack grow code. Explicitely track stack
grow area with the guard, including the stack guard page. On stack
grow, trivial shift of the guard map entry and stack map entry limits
makes the stack expansion. Move the code to detect stack grow and
call vm_map_growstack(), from vm_fault() into vm_map_lookup().
As result, it is impossible to get random mapping to occur in the
stack grow area, or to overlap the stack guard page.
Enable stack guard page by default.
Reviewed by: alc, markj
Man page update reviewed by: alc, bjk, emaste, markj, pho
Tested by: pho, Qualys
Sponsored by: The FreeBSD Foundation
MFC after: 1 week
Differential revision: https://reviews.freebsd.org/D11306 (man pages)
2017-06-24 17:01:11 +00:00
|
|
|
map->size += grow_amount;
|
2003-08-30 21:25:23 +00:00
|
|
|
rv = KERN_SUCCESS;
|
|
|
|
} else
|
|
|
|
rv = KERN_FAILURE;
|
1999-06-17 00:39:26 +00:00
|
|
|
}
|
2003-08-30 21:25:23 +00:00
|
|
|
if (rv == KERN_SUCCESS && is_procstack)
|
|
|
|
vm->vm_ssize += btoc(grow_amount);
|
|
|
|
|
2003-08-11 07:14:08 +00:00
|
|
|
/*
|
|
|
|
* Heed the MAP_WIREFUTURE flag if it was set for this process.
|
|
|
|
*/
|
Implement address space guards.
Guard, requested by the MAP_GUARD mmap(2) flag, prevents the reuse of
the allocated address space, but does not allow instantiation of the
pages in the range. It is useful for more explicit support for usual
two-stage reserve then commit allocators, since it prevents accidental
instantiation of the mapping, e.g. by mprotect(2).
Use guards to reimplement stack grow code. Explicitely track stack
grow area with the guard, including the stack guard page. On stack
grow, trivial shift of the guard map entry and stack map entry limits
makes the stack expansion. Move the code to detect stack grow and
call vm_map_growstack(), from vm_fault() into vm_map_lookup().
As result, it is impossible to get random mapping to occur in the
stack grow area, or to overlap the stack guard page.
Enable stack guard page by default.
Reviewed by: alc, markj
Man page update reviewed by: alc, bjk, emaste, markj, pho
Tested by: pho, Qualys
Sponsored by: The FreeBSD Foundation
MFC after: 1 week
Differential revision: https://reviews.freebsd.org/D11306 (man pages)
2017-06-24 17:01:11 +00:00
|
|
|
if (rv == KERN_SUCCESS && (map->flags & MAP_WIREFUTURE) != 0) {
|
Provide separate accounting for user-wired pages.
Historically we have not distinguished between kernel wirings and user
wirings for accounting purposes. User wirings (via mlock(2)) were
subject to a global limit on the number of wired pages, so if large
swaths of physical memory were wired by the kernel, as happens with
the ZFS ARC among other things, the limit could be exceeded, causing
user wirings to fail.
The change adds a new counter, v_user_wire_count, which counts the
number of virtual pages wired by user processes via mlock(2) and
mlockall(2). Only user-wired pages are subject to the system-wide
limit which helps provide some safety against deadlocks. In
particular, while sources of kernel wirings typically support some
backpressure mechanism, there is no way to reclaim user-wired pages
shorting of killing the wiring process. The limit is exported as
vm.max_user_wired, renamed from vm.max_wired, and changed from u_int
to u_long.
The choice to count virtual user-wired pages rather than physical
pages was done for simplicity. There are mechanisms that can cause
user-wired mappings to be destroyed while maintaining a wiring of
the backing physical page; these make it difficult to accurately
track user wirings at the physical page layer.
The change also closes some holes which allowed user wirings to succeed
even when they would cause the system limit to be exceeded. For
instance, mmap() may now fail with ENOMEM in a process that has called
mlockall(MCL_FUTURE) if the new mapping would cause the user wiring
limit to be exceeded.
Note that bhyve -S is subject to the user wiring limit, which defaults
to 1/3 of physical RAM. Users that wish to exceed the limit must tune
vm.max_user_wired.
Reviewed by: kib, ngie (mlock() test changes)
Tested by: pho (earlier version)
MFC after: 45 days
Sponsored by: Netflix
Differential Revision: https://reviews.freebsd.org/D19908
2019-05-13 16:38:48 +00:00
|
|
|
rv = vm_map_wire_locked(map, grow_start,
|
|
|
|
grow_start + grow_amount,
|
2017-06-19 20:40:59 +00:00
|
|
|
VM_MAP_WIRE_USER | VM_MAP_WIRE_NOHOLES);
|
Provide separate accounting for user-wired pages.
Historically we have not distinguished between kernel wirings and user
wirings for accounting purposes. User wirings (via mlock(2)) were
subject to a global limit on the number of wired pages, so if large
swaths of physical memory were wired by the kernel, as happens with
the ZFS ARC among other things, the limit could be exceeded, causing
user wirings to fail.
The change adds a new counter, v_user_wire_count, which counts the
number of virtual pages wired by user processes via mlock(2) and
mlockall(2). Only user-wired pages are subject to the system-wide
limit which helps provide some safety against deadlocks. In
particular, while sources of kernel wirings typically support some
backpressure mechanism, there is no way to reclaim user-wired pages
shorting of killing the wiring process. The limit is exported as
vm.max_user_wired, renamed from vm.max_wired, and changed from u_int
to u_long.
The choice to count virtual user-wired pages rather than physical
pages was done for simplicity. There are mechanisms that can cause
user-wired mappings to be destroyed while maintaining a wiring of
the backing physical page; these make it difficult to accurately
track user wirings at the physical page layer.
The change also closes some holes which allowed user wirings to succeed
even when they would cause the system limit to be exceeded. For
instance, mmap() may now fail with ENOMEM in a process that has called
mlockall(MCL_FUTURE) if the new mapping would cause the user wiring
limit to be exceeded.
Note that bhyve -S is subject to the user wiring limit, which defaults
to 1/3 of physical RAM. Users that wish to exceed the limit must tune
vm.max_user_wired.
Reviewed by: kib, ngie (mlock() test changes)
Tested by: pho (earlier version)
MFC after: 45 days
Sponsored by: Netflix
Differential Revision: https://reviews.freebsd.org/D19908
2019-05-13 16:38:48 +00:00
|
|
|
}
|
|
|
|
vm_map_lock_downgrade(map);
|
2003-08-11 07:14:08 +00:00
|
|
|
|
2011-04-05 20:23:59 +00:00
|
|
|
out:
|
2011-07-06 20:06:44 +00:00
|
|
|
#ifdef RACCT
|
2015-04-29 10:23:02 +00:00
|
|
|
if (racct_enable && rv != KERN_SUCCESS) {
|
2011-04-05 20:23:59 +00:00
|
|
|
PROC_LOCK(p);
|
|
|
|
error = racct_set(p, RACCT_VMEM, map->size);
|
|
|
|
KASSERT(error == 0, ("decreasing RACCT_VMEM failed"));
|
2012-12-18 07:35:01 +00:00
|
|
|
if (!old_mlock) {
|
|
|
|
error = racct_set(p, RACCT_MEMLOCK,
|
2013-01-10 12:43:58 +00:00
|
|
|
ptoa(pmap_wired_count(map->pmap)));
|
2012-12-18 07:35:01 +00:00
|
|
|
KASSERT(error == 0, ("decreasing RACCT_MEMLOCK failed"));
|
|
|
|
}
|
2011-04-05 20:23:59 +00:00
|
|
|
error = racct_set(p, RACCT_STACK, ctob(vm->vm_ssize));
|
|
|
|
KASSERT(error == 0, ("decreasing RACCT_STACK failed"));
|
|
|
|
PROC_UNLOCK(p);
|
|
|
|
}
|
2011-07-06 20:06:44 +00:00
|
|
|
#endif
|
2011-04-05 20:23:59 +00:00
|
|
|
|
2001-07-04 16:20:28 +00:00
|
|
|
return (rv);
|
1999-06-17 00:39:26 +00:00
|
|
|
}
|
|
|
|
|
1997-04-13 01:48:35 +00:00
|
|
|
/*
|
|
|
|
* Unshare the specified VM space for exec. If other processes are
|
|
|
|
* mapped to it, then create a new one. The new vmspace is null.
|
|
|
|
*/
|
2007-11-05 11:36:16 +00:00
|
|
|
int
|
2002-07-20 02:56:12 +00:00
|
|
|
vmspace_exec(struct proc *p, vm_offset_t minuser, vm_offset_t maxuser)
|
2001-07-04 20:15:18 +00:00
|
|
|
{
|
1997-04-13 01:48:35 +00:00
|
|
|
struct vmspace *oldvmspace = p->p_vmspace;
|
|
|
|
struct vmspace *newvmspace;
|
|
|
|
|
When exec_new_vmspace() decides that current vmspace cannot be reused
on execve(2), it calls vmspace_exec(), which frees the current
vmspace. The thread executing an exec syscall gets new vmspace
assigned, and old vmspace is freed if only referenced by the current
process. The free operation includes pmap_release(), which
de-constructs the paging structures used by hardware.
If the calling process is multithreaded, other threads are suspended
in the thread_suspend_check(), and need to be unsuspended and run to
be able to exit on successfull exec. Now, since the old vmspace is
destroyed, paging structures are invalid, threads are resumed on the
non-existent pmaps (page tables), which leads to triple fault on x86.
To fix, postpone the free of old vmspace until the threads are resumed
and exited. To avoid modifications to all image activators all of
which use exec_new_vmspace(), memoize the current (old) vmspace in
kern_execve(), and notify it about the need to call vmspace_free()
with a thread-private flag TDP_EXECVMSPC.
http://bugs.debian.org/743141
Reported by: Ivo De Decker <ivo.dedecker@ugent.be> through secteam
Sponsored by: The FreeBSD Foundation
MFC after: 3 days
2014-05-20 09:19:35 +00:00
|
|
|
KASSERT((curthread->td_pflags & TDP_EXECVMSPC) == 0,
|
|
|
|
("vmspace_exec recursed"));
|
2018-11-25 17:56:49 +00:00
|
|
|
newvmspace = vmspace_alloc(minuser, maxuser, pmap_pinit);
|
2007-11-05 11:36:16 +00:00
|
|
|
if (newvmspace == NULL)
|
|
|
|
return (ENOMEM);
|
2004-07-24 07:40:35 +00:00
|
|
|
newvmspace->vm_swrss = oldvmspace->vm_swrss;
|
1997-04-13 01:48:35 +00:00
|
|
|
/*
|
|
|
|
* This code is written like this for prototype purposes. The
|
|
|
|
* goal is to avoid running down the vmspace here, but let the
|
|
|
|
* other process's that are still using the vmspace to finally
|
|
|
|
* run it down. Even though there is little or no chance of blocking
|
|
|
|
* here, it is a good idea to keep this form for future mods.
|
|
|
|
*/
|
2006-05-29 21:28:56 +00:00
|
|
|
PROC_VMSPACE_LOCK(p);
|
1997-04-13 01:48:35 +00:00
|
|
|
p->p_vmspace = newvmspace;
|
2006-05-29 21:28:56 +00:00
|
|
|
PROC_VMSPACE_UNLOCK(p);
|
2008-03-12 10:12:01 +00:00
|
|
|
if (p == curthread->td_proc)
|
2001-09-12 08:38:13 +00:00
|
|
|
pmap_activate(curthread);
|
When exec_new_vmspace() decides that current vmspace cannot be reused
on execve(2), it calls vmspace_exec(), which frees the current
vmspace. The thread executing an exec syscall gets new vmspace
assigned, and old vmspace is freed if only referenced by the current
process. The free operation includes pmap_release(), which
de-constructs the paging structures used by hardware.
If the calling process is multithreaded, other threads are suspended
in the thread_suspend_check(), and need to be unsuspended and run to
be able to exit on successfull exec. Now, since the old vmspace is
destroyed, paging structures are invalid, threads are resumed on the
non-existent pmaps (page tables), which leads to triple fault on x86.
To fix, postpone the free of old vmspace until the threads are resumed
and exited. To avoid modifications to all image activators all of
which use exec_new_vmspace(), memoize the current (old) vmspace in
kern_execve(), and notify it about the need to call vmspace_free()
with a thread-private flag TDP_EXECVMSPC.
http://bugs.debian.org/743141
Reported by: Ivo De Decker <ivo.dedecker@ugent.be> through secteam
Sponsored by: The FreeBSD Foundation
MFC after: 3 days
2014-05-20 09:19:35 +00:00
|
|
|
curthread->td_pflags |= TDP_EXECVMSPC;
|
2007-11-05 11:36:16 +00:00
|
|
|
return (0);
|
1997-04-13 01:48:35 +00:00
|
|
|
}
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Unshare the specified VM space for forcing COW. This
|
|
|
|
* is called by rfork, for the (RFMEM|RFPROC) == 0 case.
|
|
|
|
*/
|
2007-11-05 11:36:16 +00:00
|
|
|
int
|
2001-07-04 20:15:18 +00:00
|
|
|
vmspace_unshare(struct proc *p)
|
|
|
|
{
|
1997-04-13 01:48:35 +00:00
|
|
|
struct vmspace *oldvmspace = p->p_vmspace;
|
|
|
|
struct vmspace *newvmspace;
|
Implement global and per-uid accounting of the anonymous memory. Add
rlimit RLIMIT_SWAP that limits the amount of swap that may be reserved
for the uid.
The accounting information (charge) is associated with either map entry,
or vm object backing the entry, assuming the object is the first one
in the shadow chain and entry does not require COW. Charge is moved
from entry to object on allocation of the object, e.g. during the mmap,
assuming the object is allocated, or on the first page fault on the
entry. It moves back to the entry on forks due to COW setup.
The per-entry granularity of accounting makes the charge process fair
for processes that change uid during lifetime, and decrements charge
for proper uid when region is unmapped.
The interface of vm_pager_allocate(9) is extended by adding struct ucred *,
that is used to charge appropriate uid when allocation if performed by
kernel, e.g. md(4).
Several syscalls, among them is fork(2), may now return ENOMEM when
global or per-uid limits are enforced.
In collaboration with: pho
Reviewed by: alc
Approved by: re (kensmith)
2009-06-23 20:45:22 +00:00
|
|
|
vm_ooffset_t fork_charge;
|
1997-04-13 01:48:35 +00:00
|
|
|
|
2020-11-04 16:30:56 +00:00
|
|
|
if (refcount_load(&oldvmspace->vm_refcnt) == 1)
|
2007-11-05 11:36:16 +00:00
|
|
|
return (0);
|
Implement global and per-uid accounting of the anonymous memory. Add
rlimit RLIMIT_SWAP that limits the amount of swap that may be reserved
for the uid.
The accounting information (charge) is associated with either map entry,
or vm object backing the entry, assuming the object is the first one
in the shadow chain and entry does not require COW. Charge is moved
from entry to object on allocation of the object, e.g. during the mmap,
assuming the object is allocated, or on the first page fault on the
entry. It moves back to the entry on forks due to COW setup.
The per-entry granularity of accounting makes the charge process fair
for processes that change uid during lifetime, and decrements charge
for proper uid when region is unmapped.
The interface of vm_pager_allocate(9) is extended by adding struct ucred *,
that is used to charge appropriate uid when allocation if performed by
kernel, e.g. md(4).
Several syscalls, among them is fork(2), may now return ENOMEM when
global or per-uid limits are enforced.
In collaboration with: pho
Reviewed by: alc
Approved by: re (kensmith)
2009-06-23 20:45:22 +00:00
|
|
|
fork_charge = 0;
|
|
|
|
newvmspace = vmspace_fork(oldvmspace, &fork_charge);
|
2007-11-05 11:36:16 +00:00
|
|
|
if (newvmspace == NULL)
|
|
|
|
return (ENOMEM);
|
2010-12-02 17:37:16 +00:00
|
|
|
if (!swap_reserve_by_cred(fork_charge, p->p_ucred)) {
|
Implement global and per-uid accounting of the anonymous memory. Add
rlimit RLIMIT_SWAP that limits the amount of swap that may be reserved
for the uid.
The accounting information (charge) is associated with either map entry,
or vm object backing the entry, assuming the object is the first one
in the shadow chain and entry does not require COW. Charge is moved
from entry to object on allocation of the object, e.g. during the mmap,
assuming the object is allocated, or on the first page fault on the
entry. It moves back to the entry on forks due to COW setup.
The per-entry granularity of accounting makes the charge process fair
for processes that change uid during lifetime, and decrements charge
for proper uid when region is unmapped.
The interface of vm_pager_allocate(9) is extended by adding struct ucred *,
that is used to charge appropriate uid when allocation if performed by
kernel, e.g. md(4).
Several syscalls, among them is fork(2), may now return ENOMEM when
global or per-uid limits are enforced.
In collaboration with: pho
Reviewed by: alc
Approved by: re (kensmith)
2009-06-23 20:45:22 +00:00
|
|
|
vmspace_free(newvmspace);
|
|
|
|
return (ENOMEM);
|
|
|
|
}
|
2006-05-29 21:28:56 +00:00
|
|
|
PROC_VMSPACE_LOCK(p);
|
1997-04-13 01:48:35 +00:00
|
|
|
p->p_vmspace = newvmspace;
|
2006-05-29 21:28:56 +00:00
|
|
|
PROC_VMSPACE_UNLOCK(p);
|
2008-03-12 10:12:01 +00:00
|
|
|
if (p == curthread->td_proc)
|
2001-09-12 08:38:13 +00:00
|
|
|
pmap_activate(curthread);
|
2004-02-02 23:23:48 +00:00
|
|
|
vmspace_free(oldvmspace);
|
2007-11-05 11:36:16 +00:00
|
|
|
return (0);
|
1997-04-13 01:48:35 +00:00
|
|
|
}
|
|
|
|
|
1994-05-24 10:09:53 +00:00
|
|
|
/*
|
|
|
|
* vm_map_lookup:
|
|
|
|
*
|
|
|
|
* Finds the VM object, offset, and
|
|
|
|
* protection for a given virtual address in the
|
|
|
|
* specified map, assuming a page fault of the
|
|
|
|
* type specified.
|
|
|
|
*
|
|
|
|
* Leaves the map in question locked for read; return
|
|
|
|
* values are guaranteed until a vm_map_lookup_done
|
|
|
|
* call is performed. Note that the map argument
|
|
|
|
* is in/out; the returned map must be used in
|
|
|
|
* the call to vm_map_lookup_done.
|
|
|
|
*
|
|
|
|
* A handle (out_entry) is returned for use in
|
|
|
|
* vm_map_lookup_done, to make that fast.
|
|
|
|
*
|
|
|
|
* If a lookup is requested with "write protection"
|
|
|
|
* specified, the map may be changed to perform virtual
|
|
|
|
* copying operations, although the data referenced will
|
|
|
|
* remain the same.
|
|
|
|
*/
|
|
|
|
int
|
1997-08-25 22:15:31 +00:00
|
|
|
vm_map_lookup(vm_map_t *var_map, /* IN/OUT */
|
|
|
|
vm_offset_t vaddr,
|
1998-01-17 09:17:02 +00:00
|
|
|
vm_prot_t fault_typea,
|
1997-08-25 22:15:31 +00:00
|
|
|
vm_map_entry_t *out_entry, /* OUT */
|
|
|
|
vm_object_t *object, /* OUT */
|
|
|
|
vm_pindex_t *pindex, /* OUT */
|
|
|
|
vm_prot_t *out_prot, /* OUT */
|
VM level code cleanups.
1) Start using TSM.
Struct procs continue to point to upages structure, after being freed.
Struct vmspace continues to point to pte object and kva space for kstack.
u_map is now superfluous.
2) vm_map's don't need to be reference counted. They always exist either
in the kernel or in a vmspace. The vmspaces are managed by reference
counts.
3) Remove the "wired" vm_map nonsense.
4) No need to keep a cache of kernel stack kva's.
5) Get rid of strange looking ++var, and change to var++.
6) Change more data structures to use our "zone" allocator. Added
struct proc, struct vmspace and struct vnode. This saves a significant
amount of kva space and physical memory. Additionally, this enables
TSM for the zone managed memory.
7) Keep ioopt disabled for now.
8) Remove the now bogus "single use" map concept.
9) Use generation counts or id's for data structures residing in TSM, where
it allows us to avoid unneeded restart overhead during traversals, where
blocking might occur.
10) Account better for memory deficits, so the pageout daemon will be able
to make enough memory available (experimental.)
11) Fix some vnode locking problems. (From Tor, I think.)
12) Add a check in ufs_lookup, to avoid lots of unneeded calls to bcmp.
(experimental.)
13) Significantly shrink, cleanup, and make slightly faster the vm_fault.c
code. Use generation counts, get rid of unneded collpase operations,
and clean up the cluster code.
14) Make vm_zone more suitable for TSM.
This commit is partially as a result of discussions and contributions from
other people, including DG, Tor Egge, PHK, and probably others that I
have forgotten to attribute (so let me know, if I forgot.)
This is not the infamous, final cleanup of the vnode stuff, but a necessary
step. Vnode mgmt should be correct, but things might still change, and
there is still some missing stuff (like ioopt, and physical backing of
non-merged cache files, debugging of layering concepts.)
1998-01-22 17:30:44 +00:00
|
|
|
boolean_t *wired) /* OUT */
|
1994-05-24 10:09:53 +00:00
|
|
|
{
|
1998-04-29 04:28:22 +00:00
|
|
|
vm_map_entry_t entry;
|
|
|
|
vm_map_t map = *var_map;
|
|
|
|
vm_prot_t prot;
|
2019-12-06 23:39:08 +00:00
|
|
|
vm_prot_t fault_type;
|
Implement global and per-uid accounting of the anonymous memory. Add
rlimit RLIMIT_SWAP that limits the amount of swap that may be reserved
for the uid.
The accounting information (charge) is associated with either map entry,
or vm object backing the entry, assuming the object is the first one
in the shadow chain and entry does not require COW. Charge is moved
from entry to object on allocation of the object, e.g. during the mmap,
assuming the object is allocated, or on the first page fault on the
entry. It moves back to the entry on forks due to COW setup.
The per-entry granularity of accounting makes the charge process fair
for processes that change uid during lifetime, and decrements charge
for proper uid when region is unmapped.
The interface of vm_pager_allocate(9) is extended by adding struct ucred *,
that is used to charge appropriate uid when allocation if performed by
kernel, e.g. md(4).
Several syscalls, among them is fork(2), may now return ENOMEM when
global or per-uid limits are enforced.
In collaboration with: pho
Reviewed by: alc
Approved by: re (kensmith)
2009-06-23 20:45:22 +00:00
|
|
|
vm_object_t eobject;
|
2011-02-04 21:49:24 +00:00
|
|
|
vm_size_t size;
|
2010-12-02 17:37:16 +00:00
|
|
|
struct ucred *cred;
|
1994-05-24 10:09:53 +00:00
|
|
|
|
Implement address space guards.
Guard, requested by the MAP_GUARD mmap(2) flag, prevents the reuse of
the allocated address space, but does not allow instantiation of the
pages in the range. It is useful for more explicit support for usual
two-stage reserve then commit allocators, since it prevents accidental
instantiation of the mapping, e.g. by mprotect(2).
Use guards to reimplement stack grow code. Explicitely track stack
grow area with the guard, including the stack guard page. On stack
grow, trivial shift of the guard map entry and stack map entry limits
makes the stack expansion. Move the code to detect stack grow and
call vm_map_growstack(), from vm_fault() into vm_map_lookup().
As result, it is impossible to get random mapping to occur in the
stack grow area, or to overlap the stack guard page.
Enable stack guard page by default.
Reviewed by: alc, markj
Man page update reviewed by: alc, bjk, emaste, markj, pho
Tested by: pho, Qualys
Sponsored by: The FreeBSD Foundation
MFC after: 1 week
Differential revision: https://reviews.freebsd.org/D11306 (man pages)
2017-06-24 17:01:11 +00:00
|
|
|
RetryLookup:
|
1994-05-24 10:09:53 +00:00
|
|
|
|
|
|
|
vm_map_lock_read(map);
|
|
|
|
|
Implement address space guards.
Guard, requested by the MAP_GUARD mmap(2) flag, prevents the reuse of
the allocated address space, but does not allow instantiation of the
pages in the range. It is useful for more explicit support for usual
two-stage reserve then commit allocators, since it prevents accidental
instantiation of the mapping, e.g. by mprotect(2).
Use guards to reimplement stack grow code. Explicitely track stack
grow area with the guard, including the stack guard page. On stack
grow, trivial shift of the guard map entry and stack map entry limits
makes the stack expansion. Move the code to detect stack grow and
call vm_map_growstack(), from vm_fault() into vm_map_lookup().
As result, it is impossible to get random mapping to occur in the
stack grow area, or to overlap the stack guard page.
Enable stack guard page by default.
Reviewed by: alc, markj
Man page update reviewed by: alc, bjk, emaste, markj, pho
Tested by: pho, Qualys
Sponsored by: The FreeBSD Foundation
MFC after: 1 week
Differential revision: https://reviews.freebsd.org/D11306 (man pages)
2017-06-24 17:01:11 +00:00
|
|
|
RetryLookupLocked:
|
1994-05-24 10:09:53 +00:00
|
|
|
/*
|
2008-12-30 19:48:03 +00:00
|
|
|
* Lookup the faulting address.
|
1994-05-24 10:09:53 +00:00
|
|
|
*/
|
2008-12-30 20:51:07 +00:00
|
|
|
if (!vm_map_lookup_entry(map, vaddr, out_entry)) {
|
|
|
|
vm_map_unlock_read(map);
|
|
|
|
return (KERN_INVALID_ADDRESS);
|
|
|
|
}
|
1994-05-24 10:09:53 +00:00
|
|
|
|
2008-12-30 19:48:03 +00:00
|
|
|
entry = *out_entry;
|
2003-11-03 16:14:45 +00:00
|
|
|
|
1994-05-24 10:09:53 +00:00
|
|
|
/*
|
These changes embody the support of the fully coherent merged VM buffer cache,
much higher filesystem I/O performance, and much better paging performance. It
represents the culmination of over 6 months of R&D.
The majority of the merged VM/cache work is by John Dyson.
The following highlights the most significant changes. Additionally, there are
(mostly minor) changes to the various filesystem modules (nfs, msdosfs, etc) to
support the new VM/buffer scheme.
vfs_bio.c:
Significant rewrite of most of vfs_bio to support the merged VM buffer cache
scheme. The scheme is almost fully compatible with the old filesystem
interface. Significant improvement in the number of opportunities for write
clustering.
vfs_cluster.c, vfs_subr.c
Upgrade and performance enhancements in vfs layer code to support merged
VM/buffer cache. Fixup of vfs_cluster to eliminate the bogus pagemove stuff.
vm_object.c:
Yet more improvements in the collapse code. Elimination of some windows that
can cause list corruption.
vm_pageout.c:
Fixed it, it really works better now. Somehow in 2.0, some "enhancements"
broke the code. This code has been reworked from the ground-up.
vm_fault.c, vm_page.c, pmap.c, vm_object.c
Support for small-block filesystems with merged VM/buffer cache scheme.
pmap.c vm_map.c
Dynamic kernel VM size, now we dont have to pre-allocate excessive numbers of
kernel PTs.
vm_glue.c
Much simpler and more effective swapping code. No more gratuitous swapping.
proc.h
Fixed the problem that the p_lock flag was not being cleared on a fork.
swap_pager.c, vnode_pager.c
Removal of old vfs_bio cruft to support the past pseudo-coherency. Now the
code doesn't need it anymore.
machdep.c
Changes to better support the parameter values for the merged VM/buffer cache
scheme.
machdep.c, kern_exec.c, vm_glue.c
Implemented a seperate submap for temporary exec string space and another one
to contain process upages. This eliminates all map fragmentation problems
that previously existed.
ffs_inode.c, ufs_inode.c, ufs_readwrite.c
Changes for merged VM/buffer cache. Add "bypass" support for sneaking in on
busy buffers.
Submitted by: John Dyson and David Greenman
1995-01-09 16:06:02 +00:00
|
|
|
* Handle submaps.
|
1994-05-24 10:09:53 +00:00
|
|
|
*/
|
1997-01-16 04:16:22 +00:00
|
|
|
if (entry->eflags & MAP_ENTRY_IS_SUB_MAP) {
|
These changes embody the support of the fully coherent merged VM buffer cache,
much higher filesystem I/O performance, and much better paging performance. It
represents the culmination of over 6 months of R&D.
The majority of the merged VM/cache work is by John Dyson.
The following highlights the most significant changes. Additionally, there are
(mostly minor) changes to the various filesystem modules (nfs, msdosfs, etc) to
support the new VM/buffer scheme.
vfs_bio.c:
Significant rewrite of most of vfs_bio to support the merged VM buffer cache
scheme. The scheme is almost fully compatible with the old filesystem
interface. Significant improvement in the number of opportunities for write
clustering.
vfs_cluster.c, vfs_subr.c
Upgrade and performance enhancements in vfs layer code to support merged
VM/buffer cache. Fixup of vfs_cluster to eliminate the bogus pagemove stuff.
vm_object.c:
Yet more improvements in the collapse code. Elimination of some windows that
can cause list corruption.
vm_pageout.c:
Fixed it, it really works better now. Somehow in 2.0, some "enhancements"
broke the code. This code has been reworked from the ground-up.
vm_fault.c, vm_page.c, pmap.c, vm_object.c
Support for small-block filesystems with merged VM/buffer cache scheme.
pmap.c vm_map.c
Dynamic kernel VM size, now we dont have to pre-allocate excessive numbers of
kernel PTs.
vm_glue.c
Much simpler and more effective swapping code. No more gratuitous swapping.
proc.h
Fixed the problem that the p_lock flag was not being cleared on a fork.
swap_pager.c, vnode_pager.c
Removal of old vfs_bio cruft to support the past pseudo-coherency. Now the
code doesn't need it anymore.
machdep.c
Changes to better support the parameter values for the merged VM/buffer cache
scheme.
machdep.c, kern_exec.c, vm_glue.c
Implemented a seperate submap for temporary exec string space and another one
to contain process upages. This eliminates all map fragmentation problems
that previously existed.
ffs_inode.c, ufs_inode.c, ufs_readwrite.c
Changes for merged VM/buffer cache. Add "bypass" support for sneaking in on
busy buffers.
Submitted by: John Dyson and David Greenman
1995-01-09 16:06:02 +00:00
|
|
|
vm_map_t old_map = map;
|
1994-05-24 10:09:53 +00:00
|
|
|
|
|
|
|
*var_map = map = entry->object.sub_map;
|
|
|
|
vm_map_unlock_read(old_map);
|
|
|
|
goto RetryLookup;
|
|
|
|
}
|
1997-04-06 02:29:45 +00:00
|
|
|
|
1994-05-24 10:09:53 +00:00
|
|
|
/*
|
These changes embody the support of the fully coherent merged VM buffer cache,
much higher filesystem I/O performance, and much better paging performance. It
represents the culmination of over 6 months of R&D.
The majority of the merged VM/cache work is by John Dyson.
The following highlights the most significant changes. Additionally, there are
(mostly minor) changes to the various filesystem modules (nfs, msdosfs, etc) to
support the new VM/buffer scheme.
vfs_bio.c:
Significant rewrite of most of vfs_bio to support the merged VM buffer cache
scheme. The scheme is almost fully compatible with the old filesystem
interface. Significant improvement in the number of opportunities for write
clustering.
vfs_cluster.c, vfs_subr.c
Upgrade and performance enhancements in vfs layer code to support merged
VM/buffer cache. Fixup of vfs_cluster to eliminate the bogus pagemove stuff.
vm_object.c:
Yet more improvements in the collapse code. Elimination of some windows that
can cause list corruption.
vm_pageout.c:
Fixed it, it really works better now. Somehow in 2.0, some "enhancements"
broke the code. This code has been reworked from the ground-up.
vm_fault.c, vm_page.c, pmap.c, vm_object.c
Support for small-block filesystems with merged VM/buffer cache scheme.
pmap.c vm_map.c
Dynamic kernel VM size, now we dont have to pre-allocate excessive numbers of
kernel PTs.
vm_glue.c
Much simpler and more effective swapping code. No more gratuitous swapping.
proc.h
Fixed the problem that the p_lock flag was not being cleared on a fork.
swap_pager.c, vnode_pager.c
Removal of old vfs_bio cruft to support the past pseudo-coherency. Now the
code doesn't need it anymore.
machdep.c
Changes to better support the parameter values for the merged VM/buffer cache
scheme.
machdep.c, kern_exec.c, vm_glue.c
Implemented a seperate submap for temporary exec string space and another one
to contain process upages. This eliminates all map fragmentation problems
that previously existed.
ffs_inode.c, ufs_inode.c, ufs_readwrite.c
Changes for merged VM/buffer cache. Add "bypass" support for sneaking in on
busy buffers.
Submitted by: John Dyson and David Greenman
1995-01-09 16:06:02 +00:00
|
|
|
* Check whether this task is allowed to have this page.
|
1994-05-24 10:09:53 +00:00
|
|
|
*/
|
2009-11-26 05:16:07 +00:00
|
|
|
prot = entry->protection;
|
Implement address space guards.
Guard, requested by the MAP_GUARD mmap(2) flag, prevents the reuse of
the allocated address space, but does not allow instantiation of the
pages in the range. It is useful for more explicit support for usual
two-stage reserve then commit allocators, since it prevents accidental
instantiation of the mapping, e.g. by mprotect(2).
Use guards to reimplement stack grow code. Explicitely track stack
grow area with the guard, including the stack guard page. On stack
grow, trivial shift of the guard map entry and stack map entry limits
makes the stack expansion. Move the code to detect stack grow and
call vm_map_growstack(), from vm_fault() into vm_map_lookup().
As result, it is impossible to get random mapping to occur in the
stack grow area, or to overlap the stack guard page.
Enable stack guard page by default.
Reviewed by: alc, markj
Man page update reviewed by: alc, bjk, emaste, markj, pho
Tested by: pho, Qualys
Sponsored by: The FreeBSD Foundation
MFC after: 1 week
Differential revision: https://reviews.freebsd.org/D11306 (man pages)
2017-06-24 17:01:11 +00:00
|
|
|
if ((fault_typea & VM_PROT_FAULT_LOOKUP) != 0) {
|
|
|
|
fault_typea &= ~VM_PROT_FAULT_LOOKUP;
|
|
|
|
if (prot == VM_PROT_NONE && map != kernel_map &&
|
|
|
|
(entry->eflags & MAP_ENTRY_GUARD) != 0 &&
|
|
|
|
(entry->eflags & (MAP_ENTRY_STACK_GAP_DN |
|
|
|
|
MAP_ENTRY_STACK_GAP_UP)) != 0 &&
|
|
|
|
vm_map_growstack(map, vaddr, entry) == KERN_SUCCESS)
|
|
|
|
goto RetryLookupLocked;
|
|
|
|
}
|
2019-12-06 23:39:08 +00:00
|
|
|
fault_type = fault_typea & VM_PROT_ALL;
|
2009-11-18 18:05:54 +00:00
|
|
|
if ((fault_type & prot) != fault_type || prot == VM_PROT_NONE) {
|
2008-12-30 20:51:07 +00:00
|
|
|
vm_map_unlock_read(map);
|
|
|
|
return (KERN_PROTECTION_FAILURE);
|
1998-01-17 09:17:02 +00:00
|
|
|
}
|
Remove a check which caused spurious SIGSEGV on usermode access to the
mapped address without valid pte installed, when parallel wiring of
the entry happen. The entry must be copy on write. If entry is COW
but was already copied, and parallel wiring set
MAP_ENTRY_IN_TRANSITION, vm_fault() would sleep waiting for the
MAP_ENTRY_IN_TRANSITION flag to clear. After that, the fault handler
is restarted and vm_map_lookup() or vm_map_lookup_locked() trip over
the check. Note that this is race, if the address is accessed after
the wiring is done, the entry does not fault at all.
There is no reason in the current kernel to disallow write access to
the COW wired entry if the entry permissions allow it. Initially this
was done in r24666, since that kernel did not supported proper
copy-on-write for wired text, which was fixed in r199869. The r251901
revision re-introduced the r24666 fix for the current VM.
Note that write access must clear MAP_ENTRY_NEEDS_COPY entry flag by
performing COW. In reverse, when MAP_ENTRY_NEEDS_COPY is set in
vmspace_fork(), the MAP_ENTRY_USER_WIRED flag is cleared. Put the
assert stating the invariant, instead of returning the error.
Reported and debugging help by: peter
Reviewed by: alc
Sponsored by: The FreeBSD Foundation
MFC after: 1 week
2015-09-09 06:19:33 +00:00
|
|
|
KASSERT((prot & VM_PROT_WRITE) == 0 || (entry->eflags &
|
|
|
|
(MAP_ENTRY_USER_WIRED | MAP_ENTRY_NEEDS_COPY)) !=
|
|
|
|
(MAP_ENTRY_USER_WIRED | MAP_ENTRY_NEEDS_COPY),
|
|
|
|
("entry %p flags %x", entry, entry->eflags));
|
2013-06-18 07:02:35 +00:00
|
|
|
if ((fault_typea & VM_PROT_COPY) != 0 &&
|
|
|
|
(entry->max_protection & VM_PROT_WRITE) == 0 &&
|
|
|
|
(entry->eflags & MAP_ENTRY_COW) == 0) {
|
|
|
|
vm_map_unlock_read(map);
|
|
|
|
return (KERN_PROTECTION_FAILURE);
|
|
|
|
}
|
1994-05-24 10:09:53 +00:00
|
|
|
|
|
|
|
/*
|
These changes embody the support of the fully coherent merged VM buffer cache,
much higher filesystem I/O performance, and much better paging performance. It
represents the culmination of over 6 months of R&D.
The majority of the merged VM/cache work is by John Dyson.
The following highlights the most significant changes. Additionally, there are
(mostly minor) changes to the various filesystem modules (nfs, msdosfs, etc) to
support the new VM/buffer scheme.
vfs_bio.c:
Significant rewrite of most of vfs_bio to support the merged VM buffer cache
scheme. The scheme is almost fully compatible with the old filesystem
interface. Significant improvement in the number of opportunities for write
clustering.
vfs_cluster.c, vfs_subr.c
Upgrade and performance enhancements in vfs layer code to support merged
VM/buffer cache. Fixup of vfs_cluster to eliminate the bogus pagemove stuff.
vm_object.c:
Yet more improvements in the collapse code. Elimination of some windows that
can cause list corruption.
vm_pageout.c:
Fixed it, it really works better now. Somehow in 2.0, some "enhancements"
broke the code. This code has been reworked from the ground-up.
vm_fault.c, vm_page.c, pmap.c, vm_object.c
Support for small-block filesystems with merged VM/buffer cache scheme.
pmap.c vm_map.c
Dynamic kernel VM size, now we dont have to pre-allocate excessive numbers of
kernel PTs.
vm_glue.c
Much simpler and more effective swapping code. No more gratuitous swapping.
proc.h
Fixed the problem that the p_lock flag was not being cleared on a fork.
swap_pager.c, vnode_pager.c
Removal of old vfs_bio cruft to support the past pseudo-coherency. Now the
code doesn't need it anymore.
machdep.c
Changes to better support the parameter values for the merged VM/buffer cache
scheme.
machdep.c, kern_exec.c, vm_glue.c
Implemented a seperate submap for temporary exec string space and another one
to contain process upages. This eliminates all map fragmentation problems
that previously existed.
ffs_inode.c, ufs_inode.c, ufs_readwrite.c
Changes for merged VM/buffer cache. Add "bypass" support for sneaking in on
busy buffers.
Submitted by: John Dyson and David Greenman
1995-01-09 16:06:02 +00:00
|
|
|
* If this page is not pageable, we have to get it for all possible
|
|
|
|
* accesses.
|
1994-05-24 10:09:53 +00:00
|
|
|
*/
|
1994-10-09 01:52:19 +00:00
|
|
|
*wired = (entry->wired_count != 0);
|
|
|
|
if (*wired)
|
2009-11-26 05:16:07 +00:00
|
|
|
fault_type = entry->protection;
|
Implement global and per-uid accounting of the anonymous memory. Add
rlimit RLIMIT_SWAP that limits the amount of swap that may be reserved
for the uid.
The accounting information (charge) is associated with either map entry,
or vm object backing the entry, assuming the object is the first one
in the shadow chain and entry does not require COW. Charge is moved
from entry to object on allocation of the object, e.g. during the mmap,
assuming the object is allocated, or on the first page fault on the
entry. It moves back to the entry on forks due to COW setup.
The per-entry granularity of accounting makes the charge process fair
for processes that change uid during lifetime, and decrements charge
for proper uid when region is unmapped.
The interface of vm_pager_allocate(9) is extended by adding struct ucred *,
that is used to charge appropriate uid when allocation if performed by
kernel, e.g. md(4).
Several syscalls, among them is fork(2), may now return ENOMEM when
global or per-uid limits are enforced.
In collaboration with: pho
Reviewed by: alc
Approved by: re (kensmith)
2009-06-23 20:45:22 +00:00
|
|
|
size = entry->end - entry->start;
|
2019-12-01 20:43:04 +00:00
|
|
|
|
1994-05-24 10:09:53 +00:00
|
|
|
/*
|
These changes embody the support of the fully coherent merged VM buffer cache,
much higher filesystem I/O performance, and much better paging performance. It
represents the culmination of over 6 months of R&D.
The majority of the merged VM/cache work is by John Dyson.
The following highlights the most significant changes. Additionally, there are
(mostly minor) changes to the various filesystem modules (nfs, msdosfs, etc) to
support the new VM/buffer scheme.
vfs_bio.c:
Significant rewrite of most of vfs_bio to support the merged VM buffer cache
scheme. The scheme is almost fully compatible with the old filesystem
interface. Significant improvement in the number of opportunities for write
clustering.
vfs_cluster.c, vfs_subr.c
Upgrade and performance enhancements in vfs layer code to support merged
VM/buffer cache. Fixup of vfs_cluster to eliminate the bogus pagemove stuff.
vm_object.c:
Yet more improvements in the collapse code. Elimination of some windows that
can cause list corruption.
vm_pageout.c:
Fixed it, it really works better now. Somehow in 2.0, some "enhancements"
broke the code. This code has been reworked from the ground-up.
vm_fault.c, vm_page.c, pmap.c, vm_object.c
Support for small-block filesystems with merged VM/buffer cache scheme.
pmap.c vm_map.c
Dynamic kernel VM size, now we dont have to pre-allocate excessive numbers of
kernel PTs.
vm_glue.c
Much simpler and more effective swapping code. No more gratuitous swapping.
proc.h
Fixed the problem that the p_lock flag was not being cleared on a fork.
swap_pager.c, vnode_pager.c
Removal of old vfs_bio cruft to support the past pseudo-coherency. Now the
code doesn't need it anymore.
machdep.c
Changes to better support the parameter values for the merged VM/buffer cache
scheme.
machdep.c, kern_exec.c, vm_glue.c
Implemented a seperate submap for temporary exec string space and another one
to contain process upages. This eliminates all map fragmentation problems
that previously existed.
ffs_inode.c, ufs_inode.c, ufs_readwrite.c
Changes for merged VM/buffer cache. Add "bypass" support for sneaking in on
busy buffers.
Submitted by: John Dyson and David Greenman
1995-01-09 16:06:02 +00:00
|
|
|
* If the entry was copy-on-write, we either ...
|
1994-05-24 10:09:53 +00:00
|
|
|
*/
|
1997-01-16 04:16:22 +00:00
|
|
|
if (entry->eflags & MAP_ENTRY_NEEDS_COPY) {
|
These changes embody the support of the fully coherent merged VM buffer cache,
much higher filesystem I/O performance, and much better paging performance. It
represents the culmination of over 6 months of R&D.
The majority of the merged VM/cache work is by John Dyson.
The following highlights the most significant changes. Additionally, there are
(mostly minor) changes to the various filesystem modules (nfs, msdosfs, etc) to
support the new VM/buffer scheme.
vfs_bio.c:
Significant rewrite of most of vfs_bio to support the merged VM buffer cache
scheme. The scheme is almost fully compatible with the old filesystem
interface. Significant improvement in the number of opportunities for write
clustering.
vfs_cluster.c, vfs_subr.c
Upgrade and performance enhancements in vfs layer code to support merged
VM/buffer cache. Fixup of vfs_cluster to eliminate the bogus pagemove stuff.
vm_object.c:
Yet more improvements in the collapse code. Elimination of some windows that
can cause list corruption.
vm_pageout.c:
Fixed it, it really works better now. Somehow in 2.0, some "enhancements"
broke the code. This code has been reworked from the ground-up.
vm_fault.c, vm_page.c, pmap.c, vm_object.c
Support for small-block filesystems with merged VM/buffer cache scheme.
pmap.c vm_map.c
Dynamic kernel VM size, now we dont have to pre-allocate excessive numbers of
kernel PTs.
vm_glue.c
Much simpler and more effective swapping code. No more gratuitous swapping.
proc.h
Fixed the problem that the p_lock flag was not being cleared on a fork.
swap_pager.c, vnode_pager.c
Removal of old vfs_bio cruft to support the past pseudo-coherency. Now the
code doesn't need it anymore.
machdep.c
Changes to better support the parameter values for the merged VM/buffer cache
scheme.
machdep.c, kern_exec.c, vm_glue.c
Implemented a seperate submap for temporary exec string space and another one
to contain process upages. This eliminates all map fragmentation problems
that previously existed.
ffs_inode.c, ufs_inode.c, ufs_readwrite.c
Changes for merged VM/buffer cache. Add "bypass" support for sneaking in on
busy buffers.
Submitted by: John Dyson and David Greenman
1995-01-09 16:06:02 +00:00
|
|
|
/*
|
|
|
|
* If we want to write the page, we may as well handle that
|
1999-03-27 23:46:04 +00:00
|
|
|
* now since we've got the map locked.
|
1995-05-30 08:16:23 +00:00
|
|
|
*
|
These changes embody the support of the fully coherent merged VM buffer cache,
much higher filesystem I/O performance, and much better paging performance. It
represents the culmination of over 6 months of R&D.
The majority of the merged VM/cache work is by John Dyson.
The following highlights the most significant changes. Additionally, there are
(mostly minor) changes to the various filesystem modules (nfs, msdosfs, etc) to
support the new VM/buffer scheme.
vfs_bio.c:
Significant rewrite of most of vfs_bio to support the merged VM buffer cache
scheme. The scheme is almost fully compatible with the old filesystem
interface. Significant improvement in the number of opportunities for write
clustering.
vfs_cluster.c, vfs_subr.c
Upgrade and performance enhancements in vfs layer code to support merged
VM/buffer cache. Fixup of vfs_cluster to eliminate the bogus pagemove stuff.
vm_object.c:
Yet more improvements in the collapse code. Elimination of some windows that
can cause list corruption.
vm_pageout.c:
Fixed it, it really works better now. Somehow in 2.0, some "enhancements"
broke the code. This code has been reworked from the ground-up.
vm_fault.c, vm_page.c, pmap.c, vm_object.c
Support for small-block filesystems with merged VM/buffer cache scheme.
pmap.c vm_map.c
Dynamic kernel VM size, now we dont have to pre-allocate excessive numbers of
kernel PTs.
vm_glue.c
Much simpler and more effective swapping code. No more gratuitous swapping.
proc.h
Fixed the problem that the p_lock flag was not being cleared on a fork.
swap_pager.c, vnode_pager.c
Removal of old vfs_bio cruft to support the past pseudo-coherency. Now the
code doesn't need it anymore.
machdep.c
Changes to better support the parameter values for the merged VM/buffer cache
scheme.
machdep.c, kern_exec.c, vm_glue.c
Implemented a seperate submap for temporary exec string space and another one
to contain process upages. This eliminates all map fragmentation problems
that previously existed.
ffs_inode.c, ufs_inode.c, ufs_readwrite.c
Changes for merged VM/buffer cache. Add "bypass" support for sneaking in on
busy buffers.
Submitted by: John Dyson and David Greenman
1995-01-09 16:06:02 +00:00
|
|
|
* If we don't need to write the page, we just demote the
|
|
|
|
* permissions allowed.
|
1994-05-24 10:09:53 +00:00
|
|
|
*/
|
2009-11-26 05:16:07 +00:00
|
|
|
if ((fault_type & VM_PROT_WRITE) != 0 ||
|
|
|
|
(fault_typea & VM_PROT_COPY) != 0) {
|
1994-05-24 10:09:53 +00:00
|
|
|
/*
|
These changes embody the support of the fully coherent merged VM buffer cache,
much higher filesystem I/O performance, and much better paging performance. It
represents the culmination of over 6 months of R&D.
The majority of the merged VM/cache work is by John Dyson.
The following highlights the most significant changes. Additionally, there are
(mostly minor) changes to the various filesystem modules (nfs, msdosfs, etc) to
support the new VM/buffer scheme.
vfs_bio.c:
Significant rewrite of most of vfs_bio to support the merged VM buffer cache
scheme. The scheme is almost fully compatible with the old filesystem
interface. Significant improvement in the number of opportunities for write
clustering.
vfs_cluster.c, vfs_subr.c
Upgrade and performance enhancements in vfs layer code to support merged
VM/buffer cache. Fixup of vfs_cluster to eliminate the bogus pagemove stuff.
vm_object.c:
Yet more improvements in the collapse code. Elimination of some windows that
can cause list corruption.
vm_pageout.c:
Fixed it, it really works better now. Somehow in 2.0, some "enhancements"
broke the code. This code has been reworked from the ground-up.
vm_fault.c, vm_page.c, pmap.c, vm_object.c
Support for small-block filesystems with merged VM/buffer cache scheme.
pmap.c vm_map.c
Dynamic kernel VM size, now we dont have to pre-allocate excessive numbers of
kernel PTs.
vm_glue.c
Much simpler and more effective swapping code. No more gratuitous swapping.
proc.h
Fixed the problem that the p_lock flag was not being cleared on a fork.
swap_pager.c, vnode_pager.c
Removal of old vfs_bio cruft to support the past pseudo-coherency. Now the
code doesn't need it anymore.
machdep.c
Changes to better support the parameter values for the merged VM/buffer cache
scheme.
machdep.c, kern_exec.c, vm_glue.c
Implemented a seperate submap for temporary exec string space and another one
to contain process upages. This eliminates all map fragmentation problems
that previously existed.
ffs_inode.c, ufs_inode.c, ufs_readwrite.c
Changes for merged VM/buffer cache. Add "bypass" support for sneaking in on
busy buffers.
Submitted by: John Dyson and David Greenman
1995-01-09 16:06:02 +00:00
|
|
|
* Make a new object, and place it in the object
|
|
|
|
* chain. Note that no new references have appeared
|
1999-03-27 23:46:04 +00:00
|
|
|
* -- one just moved from the map to the new
|
These changes embody the support of the fully coherent merged VM buffer cache,
much higher filesystem I/O performance, and much better paging performance. It
represents the culmination of over 6 months of R&D.
The majority of the merged VM/cache work is by John Dyson.
The following highlights the most significant changes. Additionally, there are
(mostly minor) changes to the various filesystem modules (nfs, msdosfs, etc) to
support the new VM/buffer scheme.
vfs_bio.c:
Significant rewrite of most of vfs_bio to support the merged VM buffer cache
scheme. The scheme is almost fully compatible with the old filesystem
interface. Significant improvement in the number of opportunities for write
clustering.
vfs_cluster.c, vfs_subr.c
Upgrade and performance enhancements in vfs layer code to support merged
VM/buffer cache. Fixup of vfs_cluster to eliminate the bogus pagemove stuff.
vm_object.c:
Yet more improvements in the collapse code. Elimination of some windows that
can cause list corruption.
vm_pageout.c:
Fixed it, it really works better now. Somehow in 2.0, some "enhancements"
broke the code. This code has been reworked from the ground-up.
vm_fault.c, vm_page.c, pmap.c, vm_object.c
Support for small-block filesystems with merged VM/buffer cache scheme.
pmap.c vm_map.c
Dynamic kernel VM size, now we dont have to pre-allocate excessive numbers of
kernel PTs.
vm_glue.c
Much simpler and more effective swapping code. No more gratuitous swapping.
proc.h
Fixed the problem that the p_lock flag was not being cleared on a fork.
swap_pager.c, vnode_pager.c
Removal of old vfs_bio cruft to support the past pseudo-coherency. Now the
code doesn't need it anymore.
machdep.c
Changes to better support the parameter values for the merged VM/buffer cache
scheme.
machdep.c, kern_exec.c, vm_glue.c
Implemented a seperate submap for temporary exec string space and another one
to contain process upages. This eliminates all map fragmentation problems
that previously existed.
ffs_inode.c, ufs_inode.c, ufs_readwrite.c
Changes for merged VM/buffer cache. Add "bypass" support for sneaking in on
busy buffers.
Submitted by: John Dyson and David Greenman
1995-01-09 16:06:02 +00:00
|
|
|
* object.
|
1994-05-24 10:09:53 +00:00
|
|
|
*/
|
2002-03-18 15:08:09 +00:00
|
|
|
if (vm_map_lock_upgrade(map))
|
1994-05-24 10:09:53 +00:00
|
|
|
goto RetryLookup;
|
2002-05-31 03:48:55 +00:00
|
|
|
|
2010-12-02 17:37:16 +00:00
|
|
|
if (entry->cred == NULL) {
|
Implement global and per-uid accounting of the anonymous memory. Add
rlimit RLIMIT_SWAP that limits the amount of swap that may be reserved
for the uid.
The accounting information (charge) is associated with either map entry,
or vm object backing the entry, assuming the object is the first one
in the shadow chain and entry does not require COW. Charge is moved
from entry to object on allocation of the object, e.g. during the mmap,
assuming the object is allocated, or on the first page fault on the
entry. It moves back to the entry on forks due to COW setup.
The per-entry granularity of accounting makes the charge process fair
for processes that change uid during lifetime, and decrements charge
for proper uid when region is unmapped.
The interface of vm_pager_allocate(9) is extended by adding struct ucred *,
that is used to charge appropriate uid when allocation if performed by
kernel, e.g. md(4).
Several syscalls, among them is fork(2), may now return ENOMEM when
global or per-uid limits are enforced.
In collaboration with: pho
Reviewed by: alc
Approved by: re (kensmith)
2009-06-23 20:45:22 +00:00
|
|
|
/*
|
|
|
|
* The debugger owner is charged for
|
|
|
|
* the memory.
|
|
|
|
*/
|
2010-12-02 17:37:16 +00:00
|
|
|
cred = curthread->td_ucred;
|
|
|
|
crhold(cred);
|
|
|
|
if (!swap_reserve_by_cred(size, cred)) {
|
|
|
|
crfree(cred);
|
Implement global and per-uid accounting of the anonymous memory. Add
rlimit RLIMIT_SWAP that limits the amount of swap that may be reserved
for the uid.
The accounting information (charge) is associated with either map entry,
or vm object backing the entry, assuming the object is the first one
in the shadow chain and entry does not require COW. Charge is moved
from entry to object on allocation of the object, e.g. during the mmap,
assuming the object is allocated, or on the first page fault on the
entry. It moves back to the entry on forks due to COW setup.
The per-entry granularity of accounting makes the charge process fair
for processes that change uid during lifetime, and decrements charge
for proper uid when region is unmapped.
The interface of vm_pager_allocate(9) is extended by adding struct ucred *,
that is used to charge appropriate uid when allocation if performed by
kernel, e.g. md(4).
Several syscalls, among them is fork(2), may now return ENOMEM when
global or per-uid limits are enforced.
In collaboration with: pho
Reviewed by: alc
Approved by: re (kensmith)
2009-06-23 20:45:22 +00:00
|
|
|
vm_map_unlock(map);
|
|
|
|
return (KERN_RESOURCE_SHORTAGE);
|
|
|
|
}
|
2010-12-02 17:37:16 +00:00
|
|
|
entry->cred = cred;
|
Implement global and per-uid accounting of the anonymous memory. Add
rlimit RLIMIT_SWAP that limits the amount of swap that may be reserved
for the uid.
The accounting information (charge) is associated with either map entry,
or vm object backing the entry, assuming the object is the first one
in the shadow chain and entry does not require COW. Charge is moved
from entry to object on allocation of the object, e.g. during the mmap,
assuming the object is allocated, or on the first page fault on the
entry. It moves back to the entry on forks due to COW setup.
The per-entry granularity of accounting makes the charge process fair
for processes that change uid during lifetime, and decrements charge
for proper uid when region is unmapped.
The interface of vm_pager_allocate(9) is extended by adding struct ucred *,
that is used to charge appropriate uid when allocation if performed by
kernel, e.g. md(4).
Several syscalls, among them is fork(2), may now return ENOMEM when
global or per-uid limits are enforced.
In collaboration with: pho
Reviewed by: alc
Approved by: re (kensmith)
2009-06-23 20:45:22 +00:00
|
|
|
}
|
|
|
|
eobject = entry->object.vm_object;
|
2019-12-01 20:43:04 +00:00
|
|
|
vm_object_shadow(&entry->object.vm_object,
|
|
|
|
&entry->offset, size, entry->cred, false);
|
|
|
|
if (eobject == entry->object.vm_object) {
|
Implement global and per-uid accounting of the anonymous memory. Add
rlimit RLIMIT_SWAP that limits the amount of swap that may be reserved
for the uid.
The accounting information (charge) is associated with either map entry,
or vm object backing the entry, assuming the object is the first one
in the shadow chain and entry does not require COW. Charge is moved
from entry to object on allocation of the object, e.g. during the mmap,
assuming the object is allocated, or on the first page fault on the
entry. It moves back to the entry on forks due to COW setup.
The per-entry granularity of accounting makes the charge process fair
for processes that change uid during lifetime, and decrements charge
for proper uid when region is unmapped.
The interface of vm_pager_allocate(9) is extended by adding struct ucred *,
that is used to charge appropriate uid when allocation if performed by
kernel, e.g. md(4).
Several syscalls, among them is fork(2), may now return ENOMEM when
global or per-uid limits are enforced.
In collaboration with: pho
Reviewed by: alc
Approved by: re (kensmith)
2009-06-23 20:45:22 +00:00
|
|
|
/*
|
|
|
|
* The object was not shadowed.
|
|
|
|
*/
|
2010-12-02 17:37:16 +00:00
|
|
|
swap_release_by_cred(size, entry->cred);
|
|
|
|
crfree(entry->cred);
|
Implement global and per-uid accounting of the anonymous memory. Add
rlimit RLIMIT_SWAP that limits the amount of swap that may be reserved
for the uid.
The accounting information (charge) is associated with either map entry,
or vm object backing the entry, assuming the object is the first one
in the shadow chain and entry does not require COW. Charge is moved
from entry to object on allocation of the object, e.g. during the mmap,
assuming the object is allocated, or on the first page fault on the
entry. It moves back to the entry on forks due to COW setup.
The per-entry granularity of accounting makes the charge process fair
for processes that change uid during lifetime, and decrements charge
for proper uid when region is unmapped.
The interface of vm_pager_allocate(9) is extended by adding struct ucred *,
that is used to charge appropriate uid when allocation if performed by
kernel, e.g. md(4).
Several syscalls, among them is fork(2), may now return ENOMEM when
global or per-uid limits are enforced.
In collaboration with: pho
Reviewed by: alc
Approved by: re (kensmith)
2009-06-23 20:45:22 +00:00
|
|
|
}
|
2019-12-01 20:43:04 +00:00
|
|
|
entry->cred = NULL;
|
|
|
|
entry->eflags &= ~MAP_ENTRY_NEEDS_COPY;
|
2002-05-31 03:48:55 +00:00
|
|
|
|
1999-02-19 03:11:37 +00:00
|
|
|
vm_map_lock_downgrade(map);
|
These changes embody the support of the fully coherent merged VM buffer cache,
much higher filesystem I/O performance, and much better paging performance. It
represents the culmination of over 6 months of R&D.
The majority of the merged VM/cache work is by John Dyson.
The following highlights the most significant changes. Additionally, there are
(mostly minor) changes to the various filesystem modules (nfs, msdosfs, etc) to
support the new VM/buffer scheme.
vfs_bio.c:
Significant rewrite of most of vfs_bio to support the merged VM buffer cache
scheme. The scheme is almost fully compatible with the old filesystem
interface. Significant improvement in the number of opportunities for write
clustering.
vfs_cluster.c, vfs_subr.c
Upgrade and performance enhancements in vfs layer code to support merged
VM/buffer cache. Fixup of vfs_cluster to eliminate the bogus pagemove stuff.
vm_object.c:
Yet more improvements in the collapse code. Elimination of some windows that
can cause list corruption.
vm_pageout.c:
Fixed it, it really works better now. Somehow in 2.0, some "enhancements"
broke the code. This code has been reworked from the ground-up.
vm_fault.c, vm_page.c, pmap.c, vm_object.c
Support for small-block filesystems with merged VM/buffer cache scheme.
pmap.c vm_map.c
Dynamic kernel VM size, now we dont have to pre-allocate excessive numbers of
kernel PTs.
vm_glue.c
Much simpler and more effective swapping code. No more gratuitous swapping.
proc.h
Fixed the problem that the p_lock flag was not being cleared on a fork.
swap_pager.c, vnode_pager.c
Removal of old vfs_bio cruft to support the past pseudo-coherency. Now the
code doesn't need it anymore.
machdep.c
Changes to better support the parameter values for the merged VM/buffer cache
scheme.
machdep.c, kern_exec.c, vm_glue.c
Implemented a seperate submap for temporary exec string space and another one
to contain process upages. This eliminates all map fragmentation problems
that previously existed.
ffs_inode.c, ufs_inode.c, ufs_readwrite.c
Changes for merged VM/buffer cache. Add "bypass" support for sneaking in on
busy buffers.
Submitted by: John Dyson and David Greenman
1995-01-09 16:06:02 +00:00
|
|
|
} else {
|
1994-05-24 10:09:53 +00:00
|
|
|
/*
|
These changes embody the support of the fully coherent merged VM buffer cache,
much higher filesystem I/O performance, and much better paging performance. It
represents the culmination of over 6 months of R&D.
The majority of the merged VM/cache work is by John Dyson.
The following highlights the most significant changes. Additionally, there are
(mostly minor) changes to the various filesystem modules (nfs, msdosfs, etc) to
support the new VM/buffer scheme.
vfs_bio.c:
Significant rewrite of most of vfs_bio to support the merged VM buffer cache
scheme. The scheme is almost fully compatible with the old filesystem
interface. Significant improvement in the number of opportunities for write
clustering.
vfs_cluster.c, vfs_subr.c
Upgrade and performance enhancements in vfs layer code to support merged
VM/buffer cache. Fixup of vfs_cluster to eliminate the bogus pagemove stuff.
vm_object.c:
Yet more improvements in the collapse code. Elimination of some windows that
can cause list corruption.
vm_pageout.c:
Fixed it, it really works better now. Somehow in 2.0, some "enhancements"
broke the code. This code has been reworked from the ground-up.
vm_fault.c, vm_page.c, pmap.c, vm_object.c
Support for small-block filesystems with merged VM/buffer cache scheme.
pmap.c vm_map.c
Dynamic kernel VM size, now we dont have to pre-allocate excessive numbers of
kernel PTs.
vm_glue.c
Much simpler and more effective swapping code. No more gratuitous swapping.
proc.h
Fixed the problem that the p_lock flag was not being cleared on a fork.
swap_pager.c, vnode_pager.c
Removal of old vfs_bio cruft to support the past pseudo-coherency. Now the
code doesn't need it anymore.
machdep.c
Changes to better support the parameter values for the merged VM/buffer cache
scheme.
machdep.c, kern_exec.c, vm_glue.c
Implemented a seperate submap for temporary exec string space and another one
to contain process upages. This eliminates all map fragmentation problems
that previously existed.
ffs_inode.c, ufs_inode.c, ufs_readwrite.c
Changes for merged VM/buffer cache. Add "bypass" support for sneaking in on
busy buffers.
Submitted by: John Dyson and David Greenman
1995-01-09 16:06:02 +00:00
|
|
|
* We're attempting to read a copy-on-write page --
|
|
|
|
* don't allow writes.
|
1994-05-24 10:09:53 +00:00
|
|
|
*/
|
VM level code cleanups.
1) Start using TSM.
Struct procs continue to point to upages structure, after being freed.
Struct vmspace continues to point to pte object and kva space for kstack.
u_map is now superfluous.
2) vm_map's don't need to be reference counted. They always exist either
in the kernel or in a vmspace. The vmspaces are managed by reference
counts.
3) Remove the "wired" vm_map nonsense.
4) No need to keep a cache of kernel stack kva's.
5) Get rid of strange looking ++var, and change to var++.
6) Change more data structures to use our "zone" allocator. Added
struct proc, struct vmspace and struct vnode. This saves a significant
amount of kva space and physical memory. Additionally, this enables
TSM for the zone managed memory.
7) Keep ioopt disabled for now.
8) Remove the now bogus "single use" map concept.
9) Use generation counts or id's for data structures residing in TSM, where
it allows us to avoid unneeded restart overhead during traversals, where
blocking might occur.
10) Account better for memory deficits, so the pageout daemon will be able
to make enough memory available (experimental.)
11) Fix some vnode locking problems. (From Tor, I think.)
12) Add a check in ufs_lookup, to avoid lots of unneeded calls to bcmp.
(experimental.)
13) Significantly shrink, cleanup, and make slightly faster the vm_fault.c
code. Use generation counts, get rid of unneded collpase operations,
and clean up the cluster code.
14) Make vm_zone more suitable for TSM.
This commit is partially as a result of discussions and contributions from
other people, including DG, Tor Egge, PHK, and probably others that I
have forgotten to attribute (so let me know, if I forgot.)
This is not the infamous, final cleanup of the vnode stuff, but a necessary
step. Vnode mgmt should be correct, but things might still change, and
there is still some missing stuff (like ioopt, and physical backing of
non-merged cache files, debugging of layering concepts.)
1998-01-22 17:30:44 +00:00
|
|
|
prot &= ~VM_PROT_WRITE;
|
1994-05-24 10:09:53 +00:00
|
|
|
}
|
|
|
|
}
|
VM level code cleanups.
1) Start using TSM.
Struct procs continue to point to upages structure, after being freed.
Struct vmspace continues to point to pte object and kva space for kstack.
u_map is now superfluous.
2) vm_map's don't need to be reference counted. They always exist either
in the kernel or in a vmspace. The vmspaces are managed by reference
counts.
3) Remove the "wired" vm_map nonsense.
4) No need to keep a cache of kernel stack kva's.
5) Get rid of strange looking ++var, and change to var++.
6) Change more data structures to use our "zone" allocator. Added
struct proc, struct vmspace and struct vnode. This saves a significant
amount of kva space and physical memory. Additionally, this enables
TSM for the zone managed memory.
7) Keep ioopt disabled for now.
8) Remove the now bogus "single use" map concept.
9) Use generation counts or id's for data structures residing in TSM, where
it allows us to avoid unneeded restart overhead during traversals, where
blocking might occur.
10) Account better for memory deficits, so the pageout daemon will be able
to make enough memory available (experimental.)
11) Fix some vnode locking problems. (From Tor, I think.)
12) Add a check in ufs_lookup, to avoid lots of unneeded calls to bcmp.
(experimental.)
13) Significantly shrink, cleanup, and make slightly faster the vm_fault.c
code. Use generation counts, get rid of unneded collpase operations,
and clean up the cluster code.
14) Make vm_zone more suitable for TSM.
This commit is partially as a result of discussions and contributions from
other people, including DG, Tor Egge, PHK, and probably others that I
have forgotten to attribute (so let me know, if I forgot.)
This is not the infamous, final cleanup of the vnode stuff, but a necessary
step. Vnode mgmt should be correct, but things might still change, and
there is still some missing stuff (like ioopt, and physical backing of
non-merged cache files, debugging of layering concepts.)
1998-01-22 17:30:44 +00:00
|
|
|
|
1994-05-24 10:09:53 +00:00
|
|
|
/*
|
These changes embody the support of the fully coherent merged VM buffer cache,
much higher filesystem I/O performance, and much better paging performance. It
represents the culmination of over 6 months of R&D.
The majority of the merged VM/cache work is by John Dyson.
The following highlights the most significant changes. Additionally, there are
(mostly minor) changes to the various filesystem modules (nfs, msdosfs, etc) to
support the new VM/buffer scheme.
vfs_bio.c:
Significant rewrite of most of vfs_bio to support the merged VM buffer cache
scheme. The scheme is almost fully compatible with the old filesystem
interface. Significant improvement in the number of opportunities for write
clustering.
vfs_cluster.c, vfs_subr.c
Upgrade and performance enhancements in vfs layer code to support merged
VM/buffer cache. Fixup of vfs_cluster to eliminate the bogus pagemove stuff.
vm_object.c:
Yet more improvements in the collapse code. Elimination of some windows that
can cause list corruption.
vm_pageout.c:
Fixed it, it really works better now. Somehow in 2.0, some "enhancements"
broke the code. This code has been reworked from the ground-up.
vm_fault.c, vm_page.c, pmap.c, vm_object.c
Support for small-block filesystems with merged VM/buffer cache scheme.
pmap.c vm_map.c
Dynamic kernel VM size, now we dont have to pre-allocate excessive numbers of
kernel PTs.
vm_glue.c
Much simpler and more effective swapping code. No more gratuitous swapping.
proc.h
Fixed the problem that the p_lock flag was not being cleared on a fork.
swap_pager.c, vnode_pager.c
Removal of old vfs_bio cruft to support the past pseudo-coherency. Now the
code doesn't need it anymore.
machdep.c
Changes to better support the parameter values for the merged VM/buffer cache
scheme.
machdep.c, kern_exec.c, vm_glue.c
Implemented a seperate submap for temporary exec string space and another one
to contain process upages. This eliminates all map fragmentation problems
that previously existed.
ffs_inode.c, ufs_inode.c, ufs_readwrite.c
Changes for merged VM/buffer cache. Add "bypass" support for sneaking in on
busy buffers.
Submitted by: John Dyson and David Greenman
1995-01-09 16:06:02 +00:00
|
|
|
* Create an object if necessary.
|
1994-05-24 10:09:53 +00:00
|
|
|
*/
|
2019-12-01 20:43:04 +00:00
|
|
|
if (entry->object.vm_object == NULL && !map->system_map) {
|
2003-11-03 16:14:45 +00:00
|
|
|
if (vm_map_lock_upgrade(map))
|
1994-05-24 10:09:53 +00:00
|
|
|
goto RetryLookup;
|
2019-12-01 20:43:04 +00:00
|
|
|
entry->object.vm_object = vm_object_allocate_anon(atop(size),
|
|
|
|
NULL, entry->cred, entry->cred != NULL ? size : 0);
|
1994-05-24 10:09:53 +00:00
|
|
|
entry->offset = 0;
|
2019-12-01 20:43:04 +00:00
|
|
|
entry->cred = NULL;
|
1999-02-19 03:11:37 +00:00
|
|
|
vm_map_lock_downgrade(map);
|
1994-05-24 10:09:53 +00:00
|
|
|
}
|
1996-06-16 20:37:31 +00:00
|
|
|
|
1994-05-24 10:09:53 +00:00
|
|
|
/*
|
These changes embody the support of the fully coherent merged VM buffer cache,
much higher filesystem I/O performance, and much better paging performance. It
represents the culmination of over 6 months of R&D.
The majority of the merged VM/cache work is by John Dyson.
The following highlights the most significant changes. Additionally, there are
(mostly minor) changes to the various filesystem modules (nfs, msdosfs, etc) to
support the new VM/buffer scheme.
vfs_bio.c:
Significant rewrite of most of vfs_bio to support the merged VM buffer cache
scheme. The scheme is almost fully compatible with the old filesystem
interface. Significant improvement in the number of opportunities for write
clustering.
vfs_cluster.c, vfs_subr.c
Upgrade and performance enhancements in vfs layer code to support merged
VM/buffer cache. Fixup of vfs_cluster to eliminate the bogus pagemove stuff.
vm_object.c:
Yet more improvements in the collapse code. Elimination of some windows that
can cause list corruption.
vm_pageout.c:
Fixed it, it really works better now. Somehow in 2.0, some "enhancements"
broke the code. This code has been reworked from the ground-up.
vm_fault.c, vm_page.c, pmap.c, vm_object.c
Support for small-block filesystems with merged VM/buffer cache scheme.
pmap.c vm_map.c
Dynamic kernel VM size, now we dont have to pre-allocate excessive numbers of
kernel PTs.
vm_glue.c
Much simpler and more effective swapping code. No more gratuitous swapping.
proc.h
Fixed the problem that the p_lock flag was not being cleared on a fork.
swap_pager.c, vnode_pager.c
Removal of old vfs_bio cruft to support the past pseudo-coherency. Now the
code doesn't need it anymore.
machdep.c
Changes to better support the parameter values for the merged VM/buffer cache
scheme.
machdep.c, kern_exec.c, vm_glue.c
Implemented a seperate submap for temporary exec string space and another one
to contain process upages. This eliminates all map fragmentation problems
that previously existed.
ffs_inode.c, ufs_inode.c, ufs_readwrite.c
Changes for merged VM/buffer cache. Add "bypass" support for sneaking in on
busy buffers.
Submitted by: John Dyson and David Greenman
1995-01-09 16:06:02 +00:00
|
|
|
* Return the object/offset from this entry. If the entry was
|
|
|
|
* copy-on-write or empty, it has been fixed up.
|
1994-05-24 10:09:53 +00:00
|
|
|
*/
|
2018-12-02 13:16:46 +00:00
|
|
|
*pindex = OFF_TO_IDX((vaddr - entry->start) + entry->offset);
|
1994-05-24 10:09:53 +00:00
|
|
|
*object = entry->object.vm_object;
|
|
|
|
|
|
|
|
*out_prot = prot;
|
These changes embody the support of the fully coherent merged VM buffer cache,
much higher filesystem I/O performance, and much better paging performance. It
represents the culmination of over 6 months of R&D.
The majority of the merged VM/cache work is by John Dyson.
The following highlights the most significant changes. Additionally, there are
(mostly minor) changes to the various filesystem modules (nfs, msdosfs, etc) to
support the new VM/buffer scheme.
vfs_bio.c:
Significant rewrite of most of vfs_bio to support the merged VM buffer cache
scheme. The scheme is almost fully compatible with the old filesystem
interface. Significant improvement in the number of opportunities for write
clustering.
vfs_cluster.c, vfs_subr.c
Upgrade and performance enhancements in vfs layer code to support merged
VM/buffer cache. Fixup of vfs_cluster to eliminate the bogus pagemove stuff.
vm_object.c:
Yet more improvements in the collapse code. Elimination of some windows that
can cause list corruption.
vm_pageout.c:
Fixed it, it really works better now. Somehow in 2.0, some "enhancements"
broke the code. This code has been reworked from the ground-up.
vm_fault.c, vm_page.c, pmap.c, vm_object.c
Support for small-block filesystems with merged VM/buffer cache scheme.
pmap.c vm_map.c
Dynamic kernel VM size, now we dont have to pre-allocate excessive numbers of
kernel PTs.
vm_glue.c
Much simpler and more effective swapping code. No more gratuitous swapping.
proc.h
Fixed the problem that the p_lock flag was not being cleared on a fork.
swap_pager.c, vnode_pager.c
Removal of old vfs_bio cruft to support the past pseudo-coherency. Now the
code doesn't need it anymore.
machdep.c
Changes to better support the parameter values for the merged VM/buffer cache
scheme.
machdep.c, kern_exec.c, vm_glue.c
Implemented a seperate submap for temporary exec string space and another one
to contain process upages. This eliminates all map fragmentation problems
that previously existed.
ffs_inode.c, ufs_inode.c, ufs_readwrite.c
Changes for merged VM/buffer cache. Add "bypass" support for sneaking in on
busy buffers.
Submitted by: John Dyson and David Greenman
1995-01-09 16:06:02 +00:00
|
|
|
return (KERN_SUCCESS);
|
1994-05-24 10:09:53 +00:00
|
|
|
}
|
|
|
|
|
2004-08-12 20:14:49 +00:00
|
|
|
/*
|
|
|
|
* vm_map_lookup_locked:
|
|
|
|
*
|
|
|
|
* Lookup the faulting address. A version of vm_map_lookup that returns
|
|
|
|
* KERN_FAILURE instead of blocking on map lock or memory allocation.
|
|
|
|
*/
|
|
|
|
int
|
|
|
|
vm_map_lookup_locked(vm_map_t *var_map, /* IN/OUT */
|
|
|
|
vm_offset_t vaddr,
|
|
|
|
vm_prot_t fault_typea,
|
|
|
|
vm_map_entry_t *out_entry, /* OUT */
|
|
|
|
vm_object_t *object, /* OUT */
|
|
|
|
vm_pindex_t *pindex, /* OUT */
|
|
|
|
vm_prot_t *out_prot, /* OUT */
|
|
|
|
boolean_t *wired) /* OUT */
|
|
|
|
{
|
|
|
|
vm_map_entry_t entry;
|
|
|
|
vm_map_t map = *var_map;
|
|
|
|
vm_prot_t prot;
|
|
|
|
vm_prot_t fault_type = fault_typea;
|
|
|
|
|
|
|
|
/*
|
2008-12-30 19:48:03 +00:00
|
|
|
* Lookup the faulting address.
|
2004-08-12 20:14:49 +00:00
|
|
|
*/
|
2008-12-30 19:48:03 +00:00
|
|
|
if (!vm_map_lookup_entry(map, vaddr, out_entry))
|
|
|
|
return (KERN_INVALID_ADDRESS);
|
2004-08-12 20:14:49 +00:00
|
|
|
|
2008-12-30 19:48:03 +00:00
|
|
|
entry = *out_entry;
|
2004-08-12 20:14:49 +00:00
|
|
|
|
|
|
|
/*
|
|
|
|
* Fail if the entry refers to a submap.
|
|
|
|
*/
|
|
|
|
if (entry->eflags & MAP_ENTRY_IS_SUB_MAP)
|
|
|
|
return (KERN_FAILURE);
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Check whether this task is allowed to have this page.
|
|
|
|
*/
|
2009-11-26 05:16:07 +00:00
|
|
|
prot = entry->protection;
|
2004-08-12 20:14:49 +00:00
|
|
|
fault_type &= VM_PROT_READ | VM_PROT_WRITE | VM_PROT_EXECUTE;
|
|
|
|
if ((fault_type & prot) != fault_type)
|
|
|
|
return (KERN_PROTECTION_FAILURE);
|
|
|
|
|
|
|
|
/*
|
|
|
|
* If this page is not pageable, we have to get it for all possible
|
|
|
|
* accesses.
|
|
|
|
*/
|
|
|
|
*wired = (entry->wired_count != 0);
|
|
|
|
if (*wired)
|
2009-11-26 05:16:07 +00:00
|
|
|
fault_type = entry->protection;
|
2004-08-12 20:14:49 +00:00
|
|
|
|
|
|
|
if (entry->eflags & MAP_ENTRY_NEEDS_COPY) {
|
|
|
|
/*
|
|
|
|
* Fail if the entry was copy-on-write for a write fault.
|
|
|
|
*/
|
|
|
|
if (fault_type & VM_PROT_WRITE)
|
|
|
|
return (KERN_FAILURE);
|
|
|
|
/*
|
|
|
|
* We're attempting to read a copy-on-write page --
|
|
|
|
* don't allow writes.
|
|
|
|
*/
|
|
|
|
prot &= ~VM_PROT_WRITE;
|
|
|
|
}
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Fail if an object should be created.
|
|
|
|
*/
|
|
|
|
if (entry->object.vm_object == NULL && !map->system_map)
|
|
|
|
return (KERN_FAILURE);
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Return the object/offset from this entry. If the entry was
|
|
|
|
* copy-on-write or empty, it has been fixed up.
|
|
|
|
*/
|
2018-12-02 13:16:46 +00:00
|
|
|
*pindex = OFF_TO_IDX((vaddr - entry->start) + entry->offset);
|
2004-08-12 20:14:49 +00:00
|
|
|
*object = entry->object.vm_object;
|
|
|
|
|
|
|
|
*out_prot = prot;
|
|
|
|
return (KERN_SUCCESS);
|
|
|
|
}
|
|
|
|
|
1994-05-24 10:09:53 +00:00
|
|
|
/*
|
|
|
|
* vm_map_lookup_done:
|
|
|
|
*
|
|
|
|
* Releases locks acquired by a vm_map_lookup
|
|
|
|
* (according to the handle returned by that lookup).
|
|
|
|
*/
|
1995-05-30 08:16:23 +00:00
|
|
|
void
|
2001-07-04 20:15:18 +00:00
|
|
|
vm_map_lookup_done(vm_map_t map, vm_map_entry_t entry)
|
1994-05-24 10:09:53 +00:00
|
|
|
{
|
|
|
|
/*
|
These changes embody the support of the fully coherent merged VM buffer cache,
much higher filesystem I/O performance, and much better paging performance. It
represents the culmination of over 6 months of R&D.
The majority of the merged VM/cache work is by John Dyson.
The following highlights the most significant changes. Additionally, there are
(mostly minor) changes to the various filesystem modules (nfs, msdosfs, etc) to
support the new VM/buffer scheme.
vfs_bio.c:
Significant rewrite of most of vfs_bio to support the merged VM buffer cache
scheme. The scheme is almost fully compatible with the old filesystem
interface. Significant improvement in the number of opportunities for write
clustering.
vfs_cluster.c, vfs_subr.c
Upgrade and performance enhancements in vfs layer code to support merged
VM/buffer cache. Fixup of vfs_cluster to eliminate the bogus pagemove stuff.
vm_object.c:
Yet more improvements in the collapse code. Elimination of some windows that
can cause list corruption.
vm_pageout.c:
Fixed it, it really works better now. Somehow in 2.0, some "enhancements"
broke the code. This code has been reworked from the ground-up.
vm_fault.c, vm_page.c, pmap.c, vm_object.c
Support for small-block filesystems with merged VM/buffer cache scheme.
pmap.c vm_map.c
Dynamic kernel VM size, now we dont have to pre-allocate excessive numbers of
kernel PTs.
vm_glue.c
Much simpler and more effective swapping code. No more gratuitous swapping.
proc.h
Fixed the problem that the p_lock flag was not being cleared on a fork.
swap_pager.c, vnode_pager.c
Removal of old vfs_bio cruft to support the past pseudo-coherency. Now the
code doesn't need it anymore.
machdep.c
Changes to better support the parameter values for the merged VM/buffer cache
scheme.
machdep.c, kern_exec.c, vm_glue.c
Implemented a seperate submap for temporary exec string space and another one
to contain process upages. This eliminates all map fragmentation problems
that previously existed.
ffs_inode.c, ufs_inode.c, ufs_readwrite.c
Changes for merged VM/buffer cache. Add "bypass" support for sneaking in on
busy buffers.
Submitted by: John Dyson and David Greenman
1995-01-09 16:06:02 +00:00
|
|
|
* Unlock the main-level map
|
1994-05-24 10:09:53 +00:00
|
|
|
*/
|
|
|
|
vm_map_unlock_read(map);
|
|
|
|
}
|
|
|
|
|
2018-03-30 10:55:31 +00:00
|
|
|
vm_offset_t
|
|
|
|
vm_map_max_KBI(const struct vm_map *map)
|
|
|
|
{
|
|
|
|
|
2018-08-29 12:24:19 +00:00
|
|
|
return (vm_map_max(map));
|
2018-03-30 10:55:31 +00:00
|
|
|
}
|
|
|
|
|
|
|
|
vm_offset_t
|
|
|
|
vm_map_min_KBI(const struct vm_map *map)
|
|
|
|
{
|
|
|
|
|
2018-08-29 12:24:19 +00:00
|
|
|
return (vm_map_min(map));
|
2018-03-30 10:55:31 +00:00
|
|
|
}
|
|
|
|
|
|
|
|
pmap_t
|
|
|
|
vm_map_pmap_KBI(vm_map_t map)
|
|
|
|
{
|
|
|
|
|
|
|
|
return (map->pmap);
|
|
|
|
}
|
|
|
|
|
2020-07-13 16:39:27 +00:00
|
|
|
bool
|
|
|
|
vm_map_range_valid_KBI(vm_map_t map, vm_offset_t start, vm_offset_t end)
|
|
|
|
{
|
|
|
|
|
|
|
|
return (vm_map_range_valid(map, start, end));
|
|
|
|
}
|
|
|
|
|
2019-10-14 17:15:42 +00:00
|
|
|
#ifdef INVARIANTS
|
|
|
|
static void
|
2019-11-09 17:08:27 +00:00
|
|
|
_vm_map_assert_consistent(vm_map_t map, int check)
|
2019-10-14 17:15:42 +00:00
|
|
|
{
|
|
|
|
vm_map_entry_t entry, prev;
|
2019-12-07 17:14:33 +00:00
|
|
|
vm_map_entry_t cur, header, lbound, ubound;
|
2019-10-14 17:15:42 +00:00
|
|
|
vm_size_t max_left, max_right;
|
|
|
|
|
2019-11-29 02:06:45 +00:00
|
|
|
#ifdef DIAGNOSTIC
|
|
|
|
++map->nupdates;
|
|
|
|
#endif
|
2019-11-09 17:08:27 +00:00
|
|
|
if (enable_vmmap_check != check)
|
2019-10-14 17:15:42 +00:00
|
|
|
return;
|
|
|
|
|
2019-12-07 17:14:33 +00:00
|
|
|
header = prev = &map->header;
|
2019-10-14 17:15:42 +00:00
|
|
|
VM_MAP_ENTRY_FOREACH(entry, map) {
|
|
|
|
KASSERT(prev->end <= entry->start,
|
|
|
|
("map %p prev->end = %jx, start = %jx", map,
|
|
|
|
(uintmax_t)prev->end, (uintmax_t)entry->start));
|
|
|
|
KASSERT(entry->start < entry->end,
|
|
|
|
("map %p start = %jx, end = %jx", map,
|
|
|
|
(uintmax_t)entry->start, (uintmax_t)entry->end));
|
2019-12-07 17:14:33 +00:00
|
|
|
KASSERT(entry->left == header ||
|
2019-10-14 17:15:42 +00:00
|
|
|
entry->left->start < entry->start,
|
|
|
|
("map %p left->start = %jx, start = %jx", map,
|
|
|
|
(uintmax_t)entry->left->start, (uintmax_t)entry->start));
|
2019-12-07 17:14:33 +00:00
|
|
|
KASSERT(entry->right == header ||
|
2019-10-14 17:15:42 +00:00
|
|
|
entry->start < entry->right->start,
|
|
|
|
("map %p start = %jx, right->start = %jx", map,
|
|
|
|
(uintmax_t)entry->start, (uintmax_t)entry->right->start));
|
2019-12-07 17:14:33 +00:00
|
|
|
cur = map->root;
|
|
|
|
lbound = ubound = header;
|
|
|
|
for (;;) {
|
|
|
|
if (entry->start < cur->start) {
|
|
|
|
ubound = cur;
|
|
|
|
cur = cur->left;
|
|
|
|
KASSERT(cur != lbound,
|
|
|
|
("map %p cannot find %jx",
|
2019-12-08 00:02:36 +00:00
|
|
|
map, (uintmax_t)entry->start));
|
2019-12-07 17:14:33 +00:00
|
|
|
} else if (cur->end <= entry->start) {
|
|
|
|
lbound = cur;
|
|
|
|
cur = cur->right;
|
|
|
|
KASSERT(cur != ubound,
|
|
|
|
("map %p cannot find %jx",
|
2019-12-08 00:02:36 +00:00
|
|
|
map, (uintmax_t)entry->start));
|
2019-12-07 17:14:33 +00:00
|
|
|
} else {
|
|
|
|
KASSERT(cur == entry,
|
|
|
|
("map %p cannot find %jx",
|
2019-12-08 00:02:36 +00:00
|
|
|
map, (uintmax_t)entry->start));
|
2019-12-07 17:14:33 +00:00
|
|
|
break;
|
|
|
|
}
|
|
|
|
}
|
|
|
|
max_left = vm_map_entry_max_free_left(entry, lbound);
|
|
|
|
max_right = vm_map_entry_max_free_right(entry, ubound);
|
|
|
|
KASSERT(entry->max_free == vm_size_max(max_left, max_right),
|
2019-10-14 17:15:42 +00:00
|
|
|
("map %p max = %jx, max_left = %jx, max_right = %jx", map,
|
|
|
|
(uintmax_t)entry->max_free,
|
|
|
|
(uintmax_t)max_left, (uintmax_t)max_right));
|
|
|
|
prev = entry;
|
2019-11-13 15:56:07 +00:00
|
|
|
}
|
2019-10-14 17:15:42 +00:00
|
|
|
KASSERT(prev->end <= entry->start,
|
|
|
|
("map %p prev->end = %jx, start = %jx", map,
|
|
|
|
(uintmax_t)prev->end, (uintmax_t)entry->start));
|
|
|
|
}
|
|
|
|
#endif
|
|
|
|
|
1996-09-14 11:54:59 +00:00
|
|
|
#include "opt_ddb.h"
|
1995-04-16 12:56:22 +00:00
|
|
|
#ifdef DDB
|
1996-09-14 11:54:59 +00:00
|
|
|
#include <sys/kernel.h>
|
|
|
|
|
|
|
|
#include <ddb/ddb.h>
|
|
|
|
|
2012-11-12 00:30:40 +00:00
|
|
|
static void
|
|
|
|
vm_map_print(vm_map_t map)
|
1994-05-24 10:09:53 +00:00
|
|
|
{
|
2019-06-14 03:15:54 +00:00
|
|
|
vm_map_entry_t entry, prev;
|
1994-05-24 10:09:53 +00:00
|
|
|
|
1999-03-02 05:43:18 +00:00
|
|
|
db_iprintf("Task map %p: pmap=%p, nentries=%d, version=%u\n",
|
|
|
|
(void *)map,
|
1998-07-14 12:14:58 +00:00
|
|
|
(void *)map->pmap, map->nentries, map->timestamp);
|
1994-05-24 10:09:53 +00:00
|
|
|
|
1996-09-14 11:54:59 +00:00
|
|
|
db_indent += 2;
|
2019-10-14 17:15:42 +00:00
|
|
|
prev = &map->header;
|
|
|
|
VM_MAP_ENTRY_FOREACH(entry, map) {
|
Implement address space guards.
Guard, requested by the MAP_GUARD mmap(2) flag, prevents the reuse of
the allocated address space, but does not allow instantiation of the
pages in the range. It is useful for more explicit support for usual
two-stage reserve then commit allocators, since it prevents accidental
instantiation of the mapping, e.g. by mprotect(2).
Use guards to reimplement stack grow code. Explicitely track stack
grow area with the guard, including the stack guard page. On stack
grow, trivial shift of the guard map entry and stack map entry limits
makes the stack expansion. Move the code to detect stack grow and
call vm_map_growstack(), from vm_fault() into vm_map_lookup().
As result, it is impossible to get random mapping to occur in the
stack grow area, or to overlap the stack guard page.
Enable stack guard page by default.
Reviewed by: alc, markj
Man page update reviewed by: alc, bjk, emaste, markj, pho
Tested by: pho, Qualys
Sponsored by: The FreeBSD Foundation
MFC after: 1 week
Differential revision: https://reviews.freebsd.org/D11306 (man pages)
2017-06-24 17:01:11 +00:00
|
|
|
db_iprintf("map entry %p: start=%p, end=%p, eflags=%#x, \n",
|
|
|
|
(void *)entry, (void *)entry->start, (void *)entry->end,
|
|
|
|
entry->eflags);
|
1999-03-02 05:43:18 +00:00
|
|
|
{
|
2020-02-23 03:32:04 +00:00
|
|
|
static const char * const inheritance_name[4] =
|
These changes embody the support of the fully coherent merged VM buffer cache,
much higher filesystem I/O performance, and much better paging performance. It
represents the culmination of over 6 months of R&D.
The majority of the merged VM/cache work is by John Dyson.
The following highlights the most significant changes. Additionally, there are
(mostly minor) changes to the various filesystem modules (nfs, msdosfs, etc) to
support the new VM/buffer scheme.
vfs_bio.c:
Significant rewrite of most of vfs_bio to support the merged VM buffer cache
scheme. The scheme is almost fully compatible with the old filesystem
interface. Significant improvement in the number of opportunities for write
clustering.
vfs_cluster.c, vfs_subr.c
Upgrade and performance enhancements in vfs layer code to support merged
VM/buffer cache. Fixup of vfs_cluster to eliminate the bogus pagemove stuff.
vm_object.c:
Yet more improvements in the collapse code. Elimination of some windows that
can cause list corruption.
vm_pageout.c:
Fixed it, it really works better now. Somehow in 2.0, some "enhancements"
broke the code. This code has been reworked from the ground-up.
vm_fault.c, vm_page.c, pmap.c, vm_object.c
Support for small-block filesystems with merged VM/buffer cache scheme.
pmap.c vm_map.c
Dynamic kernel VM size, now we dont have to pre-allocate excessive numbers of
kernel PTs.
vm_glue.c
Much simpler and more effective swapping code. No more gratuitous swapping.
proc.h
Fixed the problem that the p_lock flag was not being cleared on a fork.
swap_pager.c, vnode_pager.c
Removal of old vfs_bio cruft to support the past pseudo-coherency. Now the
code doesn't need it anymore.
machdep.c
Changes to better support the parameter values for the merged VM/buffer cache
scheme.
machdep.c, kern_exec.c, vm_glue.c
Implemented a seperate submap for temporary exec string space and another one
to contain process upages. This eliminates all map fragmentation problems
that previously existed.
ffs_inode.c, ufs_inode.c, ufs_readwrite.c
Changes for merged VM/buffer cache. Add "bypass" support for sneaking in on
busy buffers.
Submitted by: John Dyson and David Greenman
1995-01-09 16:06:02 +00:00
|
|
|
{"share", "copy", "none", "donate_copy"};
|
|
|
|
|
Make our v_usecount vnode reference count work identically to the
original BSD code. The association between the vnode and the vm_object
no longer includes reference counts. The major difference is that
vm_object's are no longer freed gratuitiously from the vnode, and so
once an object is created for the vnode, it will last as long as the
vnode does.
When a vnode object reference count is incremented, then the underlying
vnode reference count is incremented also. The two "objects" are now
more intimately related, and so the interactions are now much less
complex.
When vnodes are now normally placed onto the free queue with an object still
attached. The rundown of the object happens at vnode rundown time, and
happens with exactly the same filesystem semantics of the original VFS
code. There is absolutely no need for vnode_pager_uncache and other
travesties like that anymore.
A side-effect of these changes is that SMP locking should be much simpler,
the I/O copyin/copyout optimizations work, NFS should be more ponderable,
and further work on layered filesystems should be less frustrating, because
of the totally coherent management of the vnode objects and vnodes.
Please be careful with your system while running this code, but I would
greatly appreciate feedback as soon a reasonably possible.
1998-01-06 05:26:17 +00:00
|
|
|
db_iprintf(" prot=%x/%x/%s",
|
These changes embody the support of the fully coherent merged VM buffer cache,
much higher filesystem I/O performance, and much better paging performance. It
represents the culmination of over 6 months of R&D.
The majority of the merged VM/cache work is by John Dyson.
The following highlights the most significant changes. Additionally, there are
(mostly minor) changes to the various filesystem modules (nfs, msdosfs, etc) to
support the new VM/buffer scheme.
vfs_bio.c:
Significant rewrite of most of vfs_bio to support the merged VM buffer cache
scheme. The scheme is almost fully compatible with the old filesystem
interface. Significant improvement in the number of opportunities for write
clustering.
vfs_cluster.c, vfs_subr.c
Upgrade and performance enhancements in vfs layer code to support merged
VM/buffer cache. Fixup of vfs_cluster to eliminate the bogus pagemove stuff.
vm_object.c:
Yet more improvements in the collapse code. Elimination of some windows that
can cause list corruption.
vm_pageout.c:
Fixed it, it really works better now. Somehow in 2.0, some "enhancements"
broke the code. This code has been reworked from the ground-up.
vm_fault.c, vm_page.c, pmap.c, vm_object.c
Support for small-block filesystems with merged VM/buffer cache scheme.
pmap.c vm_map.c
Dynamic kernel VM size, now we dont have to pre-allocate excessive numbers of
kernel PTs.
vm_glue.c
Much simpler and more effective swapping code. No more gratuitous swapping.
proc.h
Fixed the problem that the p_lock flag was not being cleared on a fork.
swap_pager.c, vnode_pager.c
Removal of old vfs_bio cruft to support the past pseudo-coherency. Now the
code doesn't need it anymore.
machdep.c
Changes to better support the parameter values for the merged VM/buffer cache
scheme.
machdep.c, kern_exec.c, vm_glue.c
Implemented a seperate submap for temporary exec string space and another one
to contain process upages. This eliminates all map fragmentation problems
that previously existed.
ffs_inode.c, ufs_inode.c, ufs_readwrite.c
Changes for merged VM/buffer cache. Add "bypass" support for sneaking in on
busy buffers.
Submitted by: John Dyson and David Greenman
1995-01-09 16:06:02 +00:00
|
|
|
entry->protection,
|
|
|
|
entry->max_protection,
|
2019-06-14 03:15:54 +00:00
|
|
|
inheritance_name[(int)(unsigned char)
|
|
|
|
entry->inheritance]);
|
1994-05-24 10:09:53 +00:00
|
|
|
if (entry->wired_count != 0)
|
Make our v_usecount vnode reference count work identically to the
original BSD code. The association between the vnode and the vm_object
no longer includes reference counts. The major difference is that
vm_object's are no longer freed gratuitiously from the vnode, and so
once an object is created for the vnode, it will last as long as the
vnode does.
When a vnode object reference count is incremented, then the underlying
vnode reference count is incremented also. The two "objects" are now
more intimately related, and so the interactions are now much less
complex.
When vnodes are now normally placed onto the free queue with an object still
attached. The rundown of the object happens at vnode rundown time, and
happens with exactly the same filesystem semantics of the original VFS
code. There is absolutely no need for vnode_pager_uncache and other
travesties like that anymore.
A side-effect of these changes is that SMP locking should be much simpler,
the I/O copyin/copyout optimizations work, NFS should be more ponderable,
and further work on layered filesystems should be less frustrating, because
of the totally coherent management of the vnode objects and vnodes.
Please be careful with your system while running this code, but I would
greatly appreciate feedback as soon a reasonably possible.
1998-01-06 05:26:17 +00:00
|
|
|
db_printf(", wired");
|
1994-05-24 10:09:53 +00:00
|
|
|
}
|
1999-02-07 21:48:23 +00:00
|
|
|
if (entry->eflags & MAP_ENTRY_IS_SUB_MAP) {
|
2002-11-07 22:49:07 +00:00
|
|
|
db_printf(", share=%p, offset=0x%jx\n",
|
1999-02-07 21:48:23 +00:00
|
|
|
(void *)entry->object.sub_map,
|
2002-11-07 22:49:07 +00:00
|
|
|
(uintmax_t)entry->offset);
|
2019-06-14 03:15:54 +00:00
|
|
|
if (prev == &map->header ||
|
|
|
|
prev->object.sub_map !=
|
|
|
|
entry->object.sub_map) {
|
1996-09-14 11:54:59 +00:00
|
|
|
db_indent += 2;
|
2012-11-12 00:30:40 +00:00
|
|
|
vm_map_print((vm_map_t)entry->object.sub_map);
|
1996-09-14 11:54:59 +00:00
|
|
|
db_indent -= 2;
|
1994-05-24 10:09:53 +00:00
|
|
|
}
|
These changes embody the support of the fully coherent merged VM buffer cache,
much higher filesystem I/O performance, and much better paging performance. It
represents the culmination of over 6 months of R&D.
The majority of the merged VM/cache work is by John Dyson.
The following highlights the most significant changes. Additionally, there are
(mostly minor) changes to the various filesystem modules (nfs, msdosfs, etc) to
support the new VM/buffer scheme.
vfs_bio.c:
Significant rewrite of most of vfs_bio to support the merged VM buffer cache
scheme. The scheme is almost fully compatible with the old filesystem
interface. Significant improvement in the number of opportunities for write
clustering.
vfs_cluster.c, vfs_subr.c
Upgrade and performance enhancements in vfs layer code to support merged
VM/buffer cache. Fixup of vfs_cluster to eliminate the bogus pagemove stuff.
vm_object.c:
Yet more improvements in the collapse code. Elimination of some windows that
can cause list corruption.
vm_pageout.c:
Fixed it, it really works better now. Somehow in 2.0, some "enhancements"
broke the code. This code has been reworked from the ground-up.
vm_fault.c, vm_page.c, pmap.c, vm_object.c
Support for small-block filesystems with merged VM/buffer cache scheme.
pmap.c vm_map.c
Dynamic kernel VM size, now we dont have to pre-allocate excessive numbers of
kernel PTs.
vm_glue.c
Much simpler and more effective swapping code. No more gratuitous swapping.
proc.h
Fixed the problem that the p_lock flag was not being cleared on a fork.
swap_pager.c, vnode_pager.c
Removal of old vfs_bio cruft to support the past pseudo-coherency. Now the
code doesn't need it anymore.
machdep.c
Changes to better support the parameter values for the merged VM/buffer cache
scheme.
machdep.c, kern_exec.c, vm_glue.c
Implemented a seperate submap for temporary exec string space and another one
to contain process upages. This eliminates all map fragmentation problems
that previously existed.
ffs_inode.c, ufs_inode.c, ufs_readwrite.c
Changes for merged VM/buffer cache. Add "bypass" support for sneaking in on
busy buffers.
Submitted by: John Dyson and David Greenman
1995-01-09 16:06:02 +00:00
|
|
|
} else {
|
2010-12-02 17:37:16 +00:00
|
|
|
if (entry->cred != NULL)
|
|
|
|
db_printf(", ruid %d", entry->cred->cr_ruid);
|
2002-11-07 22:49:07 +00:00
|
|
|
db_printf(", object=%p, offset=0x%jx",
|
1998-07-14 12:14:58 +00:00
|
|
|
(void *)entry->object.vm_object,
|
2002-11-07 22:49:07 +00:00
|
|
|
(uintmax_t)entry->offset);
|
2010-12-02 17:37:16 +00:00
|
|
|
if (entry->object.vm_object && entry->object.vm_object->cred)
|
|
|
|
db_printf(", obj ruid %d charge %jx",
|
|
|
|
entry->object.vm_object->cred->cr_ruid,
|
Implement global and per-uid accounting of the anonymous memory. Add
rlimit RLIMIT_SWAP that limits the amount of swap that may be reserved
for the uid.
The accounting information (charge) is associated with either map entry,
or vm object backing the entry, assuming the object is the first one
in the shadow chain and entry does not require COW. Charge is moved
from entry to object on allocation of the object, e.g. during the mmap,
assuming the object is allocated, or on the first page fault on the
entry. It moves back to the entry on forks due to COW setup.
The per-entry granularity of accounting makes the charge process fair
for processes that change uid during lifetime, and decrements charge
for proper uid when region is unmapped.
The interface of vm_pager_allocate(9) is extended by adding struct ucred *,
that is used to charge appropriate uid when allocation if performed by
kernel, e.g. md(4).
Several syscalls, among them is fork(2), may now return ENOMEM when
global or per-uid limits are enforced.
In collaboration with: pho
Reviewed by: alc
Approved by: re (kensmith)
2009-06-23 20:45:22 +00:00
|
|
|
(uintmax_t)entry->object.vm_object->charge);
|
1997-01-16 04:16:22 +00:00
|
|
|
if (entry->eflags & MAP_ENTRY_COW)
|
1996-09-14 11:54:59 +00:00
|
|
|
db_printf(", copy (%s)",
|
1997-01-16 04:16:22 +00:00
|
|
|
(entry->eflags & MAP_ENTRY_NEEDS_COPY) ? "needed" : "done");
|
1996-09-14 11:54:59 +00:00
|
|
|
db_printf("\n");
|
1994-05-24 10:09:53 +00:00
|
|
|
|
2019-06-14 03:15:54 +00:00
|
|
|
if (prev == &map->header ||
|
|
|
|
prev->object.vm_object !=
|
|
|
|
entry->object.vm_object) {
|
1996-09-14 11:54:59 +00:00
|
|
|
db_indent += 2;
|
1998-07-14 12:14:58 +00:00
|
|
|
vm_object_print((db_expr_t)(intptr_t)
|
|
|
|
entry->object.vm_object,
|
2014-05-10 16:36:13 +00:00
|
|
|
0, 0, (char *)0);
|
1996-09-14 11:54:59 +00:00
|
|
|
db_indent -= 2;
|
1994-05-24 10:09:53 +00:00
|
|
|
}
|
|
|
|
}
|
2019-10-14 17:15:42 +00:00
|
|
|
prev = entry;
|
1994-05-24 10:09:53 +00:00
|
|
|
}
|
1996-09-14 11:54:59 +00:00
|
|
|
db_indent -= 2;
|
1994-05-24 10:09:53 +00:00
|
|
|
}
|
Make our v_usecount vnode reference count work identically to the
original BSD code. The association between the vnode and the vm_object
no longer includes reference counts. The major difference is that
vm_object's are no longer freed gratuitiously from the vnode, and so
once an object is created for the vnode, it will last as long as the
vnode does.
When a vnode object reference count is incremented, then the underlying
vnode reference count is incremented also. The two "objects" are now
more intimately related, and so the interactions are now much less
complex.
When vnodes are now normally placed onto the free queue with an object still
attached. The rundown of the object happens at vnode rundown time, and
happens with exactly the same filesystem semantics of the original VFS
code. There is absolutely no need for vnode_pager_uncache and other
travesties like that anymore.
A side-effect of these changes is that SMP locking should be much simpler,
the I/O copyin/copyout optimizations work, NFS should be more ponderable,
and further work on layered filesystems should be less frustrating, because
of the totally coherent management of the vnode objects and vnodes.
Please be careful with your system while running this code, but I would
greatly appreciate feedback as soon a reasonably possible.
1998-01-06 05:26:17 +00:00
|
|
|
|
2012-11-12 00:30:40 +00:00
|
|
|
DB_SHOW_COMMAND(map, map)
|
|
|
|
{
|
|
|
|
|
|
|
|
if (!have_addr) {
|
|
|
|
db_printf("usage: show map <addr>\n");
|
|
|
|
return;
|
|
|
|
}
|
|
|
|
vm_map_print((vm_map_t)addr);
|
|
|
|
}
|
Make our v_usecount vnode reference count work identically to the
original BSD code. The association between the vnode and the vm_object
no longer includes reference counts. The major difference is that
vm_object's are no longer freed gratuitiously from the vnode, and so
once an object is created for the vnode, it will last as long as the
vnode does.
When a vnode object reference count is incremented, then the underlying
vnode reference count is incremented also. The two "objects" are now
more intimately related, and so the interactions are now much less
complex.
When vnodes are now normally placed onto the free queue with an object still
attached. The rundown of the object happens at vnode rundown time, and
happens with exactly the same filesystem semantics of the original VFS
code. There is absolutely no need for vnode_pager_uncache and other
travesties like that anymore.
A side-effect of these changes is that SMP locking should be much simpler,
the I/O copyin/copyout optimizations work, NFS should be more ponderable,
and further work on layered filesystems should be less frustrating, because
of the totally coherent management of the vnode objects and vnodes.
Please be careful with your system while running this code, but I would
greatly appreciate feedback as soon a reasonably possible.
1998-01-06 05:26:17 +00:00
|
|
|
|
|
|
|
DB_SHOW_COMMAND(procvm, procvm)
|
|
|
|
{
|
|
|
|
struct proc *p;
|
|
|
|
|
|
|
|
if (have_addr) {
|
2016-12-13 19:22:43 +00:00
|
|
|
p = db_lookup_proc(addr);
|
Make our v_usecount vnode reference count work identically to the
original BSD code. The association between the vnode and the vm_object
no longer includes reference counts. The major difference is that
vm_object's are no longer freed gratuitiously from the vnode, and so
once an object is created for the vnode, it will last as long as the
vnode does.
When a vnode object reference count is incremented, then the underlying
vnode reference count is incremented also. The two "objects" are now
more intimately related, and so the interactions are now much less
complex.
When vnodes are now normally placed onto the free queue with an object still
attached. The rundown of the object happens at vnode rundown time, and
happens with exactly the same filesystem semantics of the original VFS
code. There is absolutely no need for vnode_pager_uncache and other
travesties like that anymore.
A side-effect of these changes is that SMP locking should be much simpler,
the I/O copyin/copyout optimizations work, NFS should be more ponderable,
and further work on layered filesystems should be less frustrating, because
of the totally coherent management of the vnode objects and vnodes.
Please be careful with your system while running this code, but I would
greatly appreciate feedback as soon a reasonably possible.
1998-01-06 05:26:17 +00:00
|
|
|
} else {
|
|
|
|
p = curproc;
|
|
|
|
}
|
|
|
|
|
1998-07-11 07:46:16 +00:00
|
|
|
db_printf("p = %p, vmspace = %p, map = %p, pmap = %p\n",
|
|
|
|
(void *)p, (void *)p->p_vmspace, (void *)&p->p_vmspace->vm_map,
|
1999-02-19 14:25:37 +00:00
|
|
|
(void *)vmspace_pmap(p->p_vmspace));
|
Make our v_usecount vnode reference count work identically to the
original BSD code. The association between the vnode and the vm_object
no longer includes reference counts. The major difference is that
vm_object's are no longer freed gratuitiously from the vnode, and so
once an object is created for the vnode, it will last as long as the
vnode does.
When a vnode object reference count is incremented, then the underlying
vnode reference count is incremented also. The two "objects" are now
more intimately related, and so the interactions are now much less
complex.
When vnodes are now normally placed onto the free queue with an object still
attached. The rundown of the object happens at vnode rundown time, and
happens with exactly the same filesystem semantics of the original VFS
code. There is absolutely no need for vnode_pager_uncache and other
travesties like that anymore.
A side-effect of these changes is that SMP locking should be much simpler,
the I/O copyin/copyout optimizations work, NFS should be more ponderable,
and further work on layered filesystems should be less frustrating, because
of the totally coherent management of the vnode objects and vnodes.
Please be careful with your system while running this code, but I would
greatly appreciate feedback as soon a reasonably possible.
1998-01-06 05:26:17 +00:00
|
|
|
|
2012-11-12 00:30:40 +00:00
|
|
|
vm_map_print((vm_map_t)&p->p_vmspace->vm_map);
|
Make our v_usecount vnode reference count work identically to the
original BSD code. The association between the vnode and the vm_object
no longer includes reference counts. The major difference is that
vm_object's are no longer freed gratuitiously from the vnode, and so
once an object is created for the vnode, it will last as long as the
vnode does.
When a vnode object reference count is incremented, then the underlying
vnode reference count is incremented also. The two "objects" are now
more intimately related, and so the interactions are now much less
complex.
When vnodes are now normally placed onto the free queue with an object still
attached. The rundown of the object happens at vnode rundown time, and
happens with exactly the same filesystem semantics of the original VFS
code. There is absolutely no need for vnode_pager_uncache and other
travesties like that anymore.
A side-effect of these changes is that SMP locking should be much simpler,
the I/O copyin/copyout optimizations work, NFS should be more ponderable,
and further work on layered filesystems should be less frustrating, because
of the totally coherent management of the vnode objects and vnodes.
Please be careful with your system while running this code, but I would
greatly appreciate feedback as soon a reasonably possible.
1998-01-06 05:26:17 +00:00
|
|
|
}
|
|
|
|
|
1996-09-14 11:54:59 +00:00
|
|
|
#endif /* DDB */
|