address space for an address as aligned by the new pmap_align_tlb()
function, which is for constraints imposed by the TLB. [1]
o) Add a kmem_alloc_nofault_space() function, which acts like
kmem_alloc_nofault() but allows the caller to specify which find-space
option to use. [1]
o) Use kmem_alloc_nofault_space() with VMFS_TLB_ALIGNED_SPACE to allocate the
kernel stack address on MIPS. [1]
o) Make pmap_align_tlb() on MIPS align addresses so that they do not start on
an odd boundary within the TLB, so that they are suitable for insertion as
wired entries and do not have to share a TLB entry with another mapping,
assuming they are appropriately-sized.
o) Eliminate md_realstack now that the kstack will be appropriately-aligned on
MIPS.
o) Increase the number of guard pages to 2 so that we retain the proper
alignment of the kstack address.
Reviewed by: [1] alc
X-MFC-after: Making sure alc has not come up with a better interface.
that page only makes sense if the advice is MADV_WILLNEED. In that case,
the intention is to activate the page, so discouraging the page daemon
from reclaiming the page makes sense. In contrast, in the other cases,
MADV_DONTNEED and MADV_FREE, it makes no sense whatsoever to discourage
the page daemon from reclaiming the page by setting PG_REFERENCED.
Wrap a nearby line.
Discussed with: kib
MFC after: 3 weeks
sleeping on that page is nonsensical. Doing so reduces the likelihood
that the page daemon will reclaim the page before the thread waiting in
vm_object_backing_scan() is reawakened. However, it does not guarantee
that the page is not reclaimed, so vm_object_backing_scan() restarts
after reawakening. More importantly, this muddles the meaning of
PG_REFERENCED. There is no reason to believe that the caller of
vm_object_backing_scan() is going to use (i.e., access) the contents of
the page. There is especially no reason to believe that an access is
more likely because vm_object_backing_scan() had to sleep on the page.
Discussed with: kib
MFC after: 3 weeks
either redundant or harmful, depending on the caller. For example, when
called by vm_fault(), it is redundant. However, when called by
vm_thread_swapin(), it is harmful. Specifically, if the thread is later
swapped out, having PG_REFERENCED set on its stack pages leads the page
daemon to reactivate these stack pages and delay their reclamation.
Reviewed by: kib
MFC after: 3 weeks
Previously, one of these limits was initialized in two places to a
different value in each place. Moreover, because an unsigned int was used
to represent the amount of pageable physical memory, some of these limits
were incorrectly initialized on 64-bit architectures. (Currently, this
error is masked by login.conf's default settings.)
Make vm_thread_swapin() and vm_thread_swapout() static.
Submitted by: bde (an earlier version)
Reviewed by: kib
memory with the specified physical attributes. In particular, like
kmem_alloc_contig(), the caller can specify the physical address range
from which the physical pages are allocated and the memory attributes
(i.e., cache behavior) for these physical pages. However, in contrast to
kmem_alloc_contig() or contigmalloc(), the physical pages that are
allocated by kmem_alloc_attr() are not necessarily physically contiguous.
This function is needed by DRM and VirtualBox.
Correct an error in the prototype for kmem_malloc(). The third argument
had the wrong type.
Tested by: rnoland
MFC after: 3 days
killed by OOM. When killed process waits for a page allocation, try to
satisfy the request as fast as possible.
This removes the often encountered deadlock, where OOM continously
selects the same victim process, that sleeps uninterruptibly waiting
for a page. The killed process may still sleep if page cannot be
obtained immediately, but testing has shown that system has much
higher chance to survive in OOM situation with the patch.
In collaboration with: pho
Reviewed by: alc
MFC after: 4 weeks
no superpage mappings are created within the clean submap, aligning the
start of the clean submap helps to prevent interference with kmem_alloc()'s
use of superpages.
reference count, and decrements it on dereference. If referenced object
is deallocated, object type is reset to OBJT_DEAD. Consequently, all
vnode references that are owned by object references are never released.
vunref() the vnode in vm object deallocation code for OBJT_VNODE
appropriate number of times to prevent leak.
Add an assertion to the vm_pageout() to make sure that we never get
reference on the vnode but then do not execute code to release it.
In collaboration with: pho
Reviewed by: alc
MFC after: 3 weeks
This replaces d_mmap() with the d_mmap2() implementation and also
changes the type of offset to vm_ooffset_t.
Purge d_mmap2().
All driver modules will need to be rebuilt since D_VERSION is also
bumped.
Reviewed by: jhb@
MFC after: Not in this lifetime...
Fix some wrong usages.
Note: this does not affect generated binaries as this argument is not used.
PR: 137213
Submitted by: Eygene Ryabinkin (initial version)
MFC after: 1 month
flag. Besides providing the redundand information, need to update both
vnode and object flags causes more acquisition of vnode interlock.
OBJ_MIGHTBEDIRTY is only checked for vnode-backed vm objects.
Remove VI_OBJDIRTY and make sure that OBJ_MIGHTBEDIRTY is set only for
vnode-backed vm objects.
Suggested and reviewed by: alc
Tested by: pho
MFC after: 3 weeks
represented a write access that is allowed to override write protection.
Until now, VM_PROT_OVERRIDE_WRITE has been used to write breakpoints into
text pages. Text pages are not just write protected but they are also
copy-on-write. VM_PROT_OVERRIDE_WRITE overrides the write protection on the
text page and triggers the replication of the page so that the breakpoint
will be written to a private copy. However, here is where things become
confused. It is the debugger, not the process being debugged that requires
write access to the copied page. Nonetheless, the copied page is being
mapped into the process with write access enabled. In other words, once the
debugger sets a breakpoint within a text page, the program can write to its
private copy of that text page. Whereas prior to setting the breakpoint, a
SIGSEGV would have occurred upon a write access. VM_PROT_COPY addresses
this problem. The combination of VM_PROT_READ and VM_PROT_COPY forces the
replication of a copy-on-write page even though the access is only for read.
Moreover, the replicated page is only mapped into the process with read
access, and not write access.
Reviewed by: kib
MFC after: 4 weeks
pages.
(Note: Claims made in the comments about the handling of breakpoints in
wired pages have been false for roughly a decade. This and another bug
involving breakpoints will be fixed in coming changes.)
Reviewed by: kib
This improvements aims for avoiding further cache-misses in scheduler
specific functions which need to keep track of average thread running
time and further locking in places setting for this flag.
Reported by: jeff (originally), kris (currently)
Reviewed by: jhb
Tested by: Giuseppe Cocomazzi <sbudella at email dot it>
from tuning(7). One of the descriptions references tuning(7) because
it is too complex to adequatly describe here (it is not a simple
boolean sysctl) and users should be warned to that.
Reviewed by: alc, kib
Approved by: gnn (mentor)
version of this file. When a process forks, any wired pages are immediately
copied because copy-on-write is not supported for wired pages. In other
words, the child process is given its own private copy of each wired page
from its parent's address space. Unfortunately, to date, these copied pages
have been mapped into the child's address space with the wrong permissions,
typically VM_PROT_ALL. This change corrects the permissions.
Reviewed by: kib
install new shadow object behind the map entry and copy the pages
from the underlying objects to it. This makes the mprotect(2) call to
actually perform the requested operation instead of silently do nothing
and return success, that causes SIGSEGV on later write access to the
mapping.
Reuse vm_fault_copy_entry() to do the copying, modifying it to behave
correctly when src_entry == dst_entry.
Reviewed by: alc
MFC after: 3 weeks
the memory or D-cache, depending on the semantics of the platform.
vm_sync_icache() is basically a wrapper around pmap_sync_icache(),
that translates the vm_map_t argumument to pmap_t.
o Introduce pmap_sync_icache() to all PMAP implementation. For powerpc
it replaces the pmap_page_executable() function, added to solve
the I-cache problem in uiomove_fromphys().
o In proc_rwmem() call vm_sync_icache() when writing to a page that
has execute permissions. This assures that when breakpoints are
written, the I-cache will be coherent and the process will actually
hit the breakpoint.
o This also fixes the Book-E PMAP implementation that was missing
necessary locking while trying to deal with the I-cache coherency
in pmap_enter() (read: mmu_booke_enter_locked).
The key property of this change is that the I-cache is made coherent
*after* writes have been done. Doing it in the PMAP layer when adding
or changing a mapping means that the I-cache is made coherent *before*
any writes happen. The difference is key when the I-cache prefetches.
Call priv_check(PRIV_VM_SWAP_NORLIMIT) only when per-uid limit is
actually exceed.
Both changes aim at calling priv_check(9) only for the cases when
privilege is actually exercised by the process.
Reported and tested by: rwatson
Reviewed by: alc
MFC after: 3 days
This is done to make it harder to exploit kernel NULL pointer security
vulnerabilities. While this of course does not fix vulnerabilities,
it does mitigate their impact.
Note that this may break some applications, most likely emulators or
similar, which for one reason or another require mapping memory at
zero.
This restriction can be disabled with the security.bsd.mmap_zero
sysctl variable.
Discussed with: rwatson, bz
Tested by: bz (Wine), simon (VirtualBox)
Submitted by: jhb
of the linked object is zero-length. More old code assumes that mmap
of zero length returns success.
For a.out and pre-8 ELF binaries, allow the mmap of zero length.
Reported by: tegge
Reviewed by: tegge, alc, jhb
MFC after: 3 days
Remove the altkstacks, instead instantiate threads with kernel stack
allocated with the right size from the start. For the thread that has
kernel stack cached, verify that requested stack size is equial to the
actual, and reallocate the stack if sizes differ [1].
This fixes the bug introduced by r173361 that was committed several days
after r173004 and consisted of kthread_add(9) ignoring the non-default
kernel stack size.
Also, r173361 removed the caching of the kernel stacks for a non-first
thread in the process. Introduce separate kernel stack cache that keeps
some limited amount of preallocated kernel stacks to lower the latency
of thread allocation. Add vm_lowmem handler to prune the cache on
low memory condition. This way, system with reasonable amount of the
threads get lower latency of thread creation, while still not exhausting
significant portion of KVA for unused kstacks.
Submitted by: peter [1]
Discussed with: jhb, julian, peter
Reviewed by: jhb
Tested by: pho (and retested according to new test scenarious)
MFC after: 1 week
allocated with the right size from the start. For the thread that has
kernel stack cached, verify that requested stack size is equial to the
actual, and reallocate the stack if sizes differ [1].
This fixes the bug introduced by r173361 that was committed several days
after r173004 and consisted of kthread_add(9) ignoring the non-default
kernel stack size.
Also, r173361 removed the caching of the kernel stacks for a non-first
thread in the process. Introduce separate kernel stack cache that keeps
some limited amount of preallocated kernel stacks to lower the latency
of thread allocation. Add vm_lowmem handler to prune the cache on
low memory condition. This way, system with reasonable amount of the
threads get lower latency of thread creation, while still not exhausting
significant portion of KVA for unused kstacks.
Submitted by: peter [1]
Discussed with: jhb, julian, peter
Reviewed by: jhb
Tested by: pho
MFC after: 1 week
accidentally lost at one point during the PAT development. Without this
fix vm_pager_get_pages() was zeroing each of the pages.
Submitted by: czander @ NVidia
MFC after: 3 days
pages in an object.
- Add a new variant of d_mmap() currently called d_mmap2() which accepts
an additional in/out parameter that is the memory attribute to use for
the requested page.
- A driver either uses d_mmap() or d_mmap2() for all requests but not both.
The current implementation uses a flag in the cdevsw (D_MMAP2) to indicate
that the driver provides a d_mmap2() handler instead of d_mmap(). This
is done to make the change ABI compatible with existing drivers and
MFC'able to 7 and 8.
Submitted by: alc
MFC after: 1 month
a device pager (OBJT_DEVICE) object in that it uses fictitious pages to
provide aliases to other memory addresses. The primary difference is that
it uses an sglist(9) to determine the physical addresses for a given offset
into the object instead of invoking the d_mmap() method in a device driver.
Reviewed by: alc
Approved by: re (kensmith)
MFC after: 2 weeks
amd64 and i386. Essentially, fictitious pages provide a mechanism for
creating aliases for either normal or device-backed pages. Therefore,
pmap_page_set_memattr() on a fictitious page needn't update the direct
map or flush the cache. Such actions are the responsibility of the
"primary" instance of the page or the device driver that "owns" the
physical address. For example, these actions are already performed by
pmap_mapdev().
The device pager needn't restore the memory attributes on a fictitious
page before releasing it. It's now pointless.
Add pmap_page_set_memattr() to the Xen pmap.
Approved by: re (kib)
configuring machine-dependent memory attributes...":
Don't set the memory attribute for a "real" page that is allocated to
a device object in vm_page_alloc(). It is a pointless act, because
the device pager replaces this "real" page with a "fake" page and sets
the memory attribute on that "fake" page.
Eliminate pointless code from pmap_cache_bits() on amd64.
Employ the "Self Snoop" feature supported by some x86 processors to
avoid cache flushes in the pmap.
Approved by: re (kib)
behavior is mandated by POSIX.
- Do not fail requests that pass a length greater than SSIZE_MAX
(such as > 2GB on 32-bit platforms). The 'len' parameter is actually
an unsigned 'size_t' so negative values don't really make sense.
Submitted by: Alexander Best alexbestms at math.uni-muenster.de
Reviewed by: alc
Approved by: re (kib)
MFC after: 1 week
dependent memory attributes:
Rename vm_cache_mode_t to vm_memattr_t. The new name reflects the
fact that there are machine-dependent memory attributes that have
nothing to do with controlling the cache's behavior.
Introduce vm_object_set_memattr() for setting the default memory
attributes that will be given to an object's pages.
Introduce and use pmap_page_{get,set}_memattr() for getting and
setting a page's machine-dependent memory attributes. Add full
support for these functions on amd64 and i386 and stubs for them on
the other architectures. The function pmap_page_set_memattr() is also
responsible for any other machine-dependent aspects of changing a
page's memory attributes, such as flushing the cache or updating the
direct map. The uses include kmem_alloc_contig(), vm_page_alloc(),
and the device pager:
kmem_alloc_contig() can now be used to allocate kernel memory with
non-default memory attributes on amd64 and i386.
vm_page_alloc() and the device pager will set the memory attributes
for the real or fictitious page according to the object's default
memory attributes.
Update the various pmap functions on amd64 and i386 that map pages to
incorporate each page's memory attributes in the mapping.
Notes: (1) Inherent to this design are safety features that prevent
the specification of inconsistent memory attributes by different
mappings on amd64 and i386. In addition, the device pager provides a
warning when a device driver creates a fictitious page with memory
attributes that are inconsistent with the real page that the
fictitious page is an alias for. (2) Storing the machine-dependent
memory attributes for amd64 and i386 as a dedicated "int" in "struct
md_page" represents a compromise between space efficiency and the ease
of MFCing these changes to RELENG_7.
In collaboration with: jhb
Approved by: re (kib)
non-readable and non-executable map entry, the entry is skipped from
wiring and loop is aborted. But, since MAP_ENTRY_WIRE_SKIPPED was not
set for the map entry, its wired_count is later erronously decremented.
vm_map_delete(9) for such map entry stuck in "vmmaps".
Properly set MAP_ENTRY_WIRE_SKIPPED when aborting the loop.
Reported by: John Marshall <john.marshall riverwillow com au>
Approved by: re (kensmith)
charge the objects created by vm_fault_copy_entry. The object charge
was set, but reserve not incremented.
Reported by: Greg Rivers <gcr+freebsd-current tharned org>
Reviewed by: alc (previous version)
Approved by: re (kensmith)
required by video card drivers. Specifically, this change introduces
vm_cache_mode_t with an appropriate VM_CACHE_DEFAULT definition on all
architectures. In addition, this changes adds a vm_cache_mode_t parameter
to kmem_alloc_contig() and vm_phys_alloc_contig(). These will be the
interfaces for allocating mapped kernel memory and physical memory,
respectively, with non-default cache modes.
In collaboration with: jhb
Note that this does not actually enable full-range i/o requests for
64 architectures, and is done now to update KBI only.
Tested by: pho
Reviewed by: jhb, bde (as part of the review of the bigger patch)
valid mask. Consequently, there is no need to perform a bit-wise and of
the page's dirty and valid masks in order to determine which parts of a
page are dirty and valid.
Eliminate an unnecessary #include.
rlimit RLIMIT_SWAP that limits the amount of swap that may be reserved
for the uid.
The accounting information (charge) is associated with either map entry,
or vm object backing the entry, assuming the object is the first one
in the shadow chain and entry does not require COW. Charge is moved
from entry to object on allocation of the object, e.g. during the mmap,
assuming the object is allocated, or on the first page fault on the
entry. It moves back to the entry on forks due to COW setup.
The per-entry granularity of accounting makes the charge process fair
for processes that change uid during lifetime, and decrements charge
for proper uid when region is unmapped.
The interface of vm_pager_allocate(9) is extended by adding struct ucred *,
that is used to charge appropriate uid when allocation if performed by
kernel, e.g. md(4).
Several syscalls, among them is fork(2), may now return ENOMEM when
global or per-uid limits are enforced.
In collaboration with: pho
Reviewed by: alc
Approved by: re (kensmith)
calls to vdrop() until after the free page queues lock is released. This
eliminates repeatedly releasing and reacquiring the free page queues lock
each time the last cached page is reclaimed from a vnode-backed object.
structure. When the page is shared, the kernel mapping becomes a special
type of managed page to force the cache off the page mappings. This is
needed to avoid stale entries on all ARM VIVT caches, and VIPT caches
with cache color issue.
Submitted by: Mark Tinguely
Reviewed by: alc
Tested by: Grzegorz Bernacki, thompsa
with the malloc tag and calls a new back-end, kmem_alloc_contig(), that
allocates the pages and maps them.
The motivations for this change are two-fold: (1) A cache mode parameter
will be added to kmem_alloc_contig(). In other words, kmem_alloc_contig()
will be extended to support the allocation of memory with caller-specified
caching. (2) The UMA allocation function that is used by the two jumbo
frames zones can use kmem_alloc_contig() in place of contigmalloc() and
thereby avoid having free jumbo frames held by the zone counted as live
malloc()ed memory.
kmem_alloc() and kmem_malloc(). Specifically, defer the setting of the
page's valid bits until contigmapping() when the mapping is known to be
successful.
memory with 4MB pages was added to pmap_object_init_pt(). This code
assumes that the pages of a OBJT_DEVICE object are always physically
contiguous. Unfortunately, this is not always the case. For example,
jhb@ informs me that the recently introduced /dev/ksyms driver creates
a OBJT_DEVICE object that violates this assumption. Thus, this
revision modifies pmap_object_init_pt() to abort the mapping if the
OBJT_DEVICE object's pages are not physically contiguous. This
revision also changes some inconsistent if not buggy behavior. For
example, the i386 version aborts if the first 4MB virtual page that
would be mapped is already valid. However, it incorrectly replaces
any subsequent 4MB virtual page mappings that it encounters,
potentially leaking a page table page. The amd64 version has a bug of
my own creation. It potentially busies the wrong page and always an
insufficent number of pages if it blocks allocating a page table page.
To my knowledge, there have been no reports of these bugs, hence,
their persistance. I suspect that the existing restrictions that
pmap_object_init_pt() placed on the OBJT_DEVICE objects that it would
choose to map, for example, that the first page must be aligned on a 2
or 4MB physical boundary and that the size of the mapping must be a
multiple of the large page size, were enough to avoid triggering the
bug for drivers like ksyms. However, one side effect of testing the
OBJT_DEVICE object's pages for physical contiguity is that a dubious
difference between pmap_object_init_pt() and the standard path for
mapping devices pages, i.e., vm_fault(), has been eliminated.
Previously, pmap_object_init_pt() would only instantiate the first
PG_FICTITOUS page being mapped because it never examined the rest.
Now, however, pmap_object_init_pt() uses the new function
vm_object_populate() to instantiate them all (in order to support
testing their physical contiguity). These pages need to be
instantiated for the mechanism that I have prototyped for
automatically maintaining the consistency of the PAT settings across
multiple mappings, particularly, amd64's direct mapping, to work.
(Translation: This change is also being made to support jhb@'s work on
the Nvidia feature requests.)
Discussed with: jhb@
vm_map_pmap_enter(). The immediate effect of this change is that automatic
prefaulting by mmap() for small mappings is performed on POSIX shared memory
objects just the same as it is on ordinary files.
and used in a large number of files, but also because an increasing number
of incorrect uses of MAC calls were sneaking in due to copy-and-paste of
MAC-aware code without the associated opt_mac.h include.
Discussed with: pjd
shm_dotruncate() and vnode_pager_setsize(). Specifically, if the length of
a shared memory object or a file is truncated such that the length modulo
the page size is between 1 and 511, then all of the page's dirty bits were
cleared. Now, a dirty bit is cleared only if the corresponding block is
truncated in its entirety.
device drivers to use arbitrary VM objects to satisfy individual mmap()
requests.
- A new d_mmap_single(cdev, &foff, objsize, &object, prot) callback is
added to cdevsw. This function is called for each mmap() request.
If it returns ENODEV, then the mmap() request will fall back to using
the device's device pager object and d_mmap(). Otherwise, the method
can return a VM object to satisfy this entire mmap() request via
*object. It can also modify the starting offset into this object via
*foff. This allows device drivers to use the file offset as a cookie
to identify specific VM objects.
- vm_mmap_vnode() has been changed to call vm_mmap_cdev() directly when
mapping V_CHR vnodes. This avoids duplicating all the cdev mmap
handling code and simplifies some of vm_mmap_vnode().
- D_VERSION has been bumped to D_VERSION_02. Older device drivers
using D_VERSION_01 are still supported.
MFC after: 1 month
rather than unconditionally making partially dirty pages fully dirty, only
make partially dirty pages fully dirty if the pmap says that the page has
been modified.
(This change is also a small optimization. It eliminate an unnecessary call
to pmap_is_modified() on pages that are mapped read only.)
Suggested by: tegge
following changes:
Rename vfs_page_set_valid() to vfs_page_set_validclean() to reflect
what this function actually does. Suggested by: tegge
Introduce a new version of vfs_page_set_valid() that does no more than
what the function's name implies. Specifically, it does not update
the page's dirty mask, and thus it does not require the page queues
lock to be held.
Update two of the three callers to the old vfs_page_set_valid() to
call vfs_page_set_validclean() instead because they actually require
the page's dirty mask to be cleared.
Introduce vm_page_set_valid().
Reviewed by: tegge
that vnode_pager_input_smlfs() zeroes the page, it should not mark the page
as valid until after the page is zeroed. Otherwise, the page could be
mapped for read access (e.g., by vm_map_pmap_enter()) before the page is
zeroed. Reviewed by: tegge
Eliminate gratuitous clearing of the page's dirty mask by
vnode_pager_input_smlfs(). Instead, assert that the page is clean.
Reviewed by: tegge
Eliminate some blank lines.
Eliminate pointless calls to pmap_clear_modify() and vm_page_undirty() from
vnode_pager_input_old(). The page is not mapped. Therefore, it cannot have
any page table entries that are modified.
Eliminate an incorrect comment from vnode_pager_generic_getpages().