sleeping on that page is nonsensical. Doing so reduces the likelihood
that the page daemon will reclaim the page before the thread waiting in
vm_object_backing_scan() is reawakened. However, it does not guarantee
that the page is not reclaimed, so vm_object_backing_scan() restarts
after reawakening. More importantly, this muddles the meaning of
PG_REFERENCED. There is no reason to believe that the caller of
vm_object_backing_scan() is going to use (i.e., access) the contents of
the page. There is especially no reason to believe that an access is
more likely because vm_object_backing_scan() had to sleep on the page.
Discussed with: kib
MFC after: 3 weeks
either redundant or harmful, depending on the caller. For example, when
called by vm_fault(), it is redundant. However, when called by
vm_thread_swapin(), it is harmful. Specifically, if the thread is later
swapped out, having PG_REFERENCED set on its stack pages leads the page
daemon to reactivate these stack pages and delay their reclamation.
Reviewed by: kib
MFC after: 3 weeks
Previously, one of these limits was initialized in two places to a
different value in each place. Moreover, because an unsigned int was used
to represent the amount of pageable physical memory, some of these limits
were incorrectly initialized on 64-bit architectures. (Currently, this
error is masked by login.conf's default settings.)
Make vm_thread_swapin() and vm_thread_swapout() static.
Submitted by: bde (an earlier version)
Reviewed by: kib
memory with the specified physical attributes. In particular, like
kmem_alloc_contig(), the caller can specify the physical address range
from which the physical pages are allocated and the memory attributes
(i.e., cache behavior) for these physical pages. However, in contrast to
kmem_alloc_contig() or contigmalloc(), the physical pages that are
allocated by kmem_alloc_attr() are not necessarily physically contiguous.
This function is needed by DRM and VirtualBox.
Correct an error in the prototype for kmem_malloc(). The third argument
had the wrong type.
Tested by: rnoland
MFC after: 3 days
killed by OOM. When killed process waits for a page allocation, try to
satisfy the request as fast as possible.
This removes the often encountered deadlock, where OOM continously
selects the same victim process, that sleeps uninterruptibly waiting
for a page. The killed process may still sleep if page cannot be
obtained immediately, but testing has shown that system has much
higher chance to survive in OOM situation with the patch.
In collaboration with: pho
Reviewed by: alc
MFC after: 4 weeks
no superpage mappings are created within the clean submap, aligning the
start of the clean submap helps to prevent interference with kmem_alloc()'s
use of superpages.
reference count, and decrements it on dereference. If referenced object
is deallocated, object type is reset to OBJT_DEAD. Consequently, all
vnode references that are owned by object references are never released.
vunref() the vnode in vm object deallocation code for OBJT_VNODE
appropriate number of times to prevent leak.
Add an assertion to the vm_pageout() to make sure that we never get
reference on the vnode but then do not execute code to release it.
In collaboration with: pho
Reviewed by: alc
MFC after: 3 weeks
This replaces d_mmap() with the d_mmap2() implementation and also
changes the type of offset to vm_ooffset_t.
Purge d_mmap2().
All driver modules will need to be rebuilt since D_VERSION is also
bumped.
Reviewed by: jhb@
MFC after: Not in this lifetime...
Fix some wrong usages.
Note: this does not affect generated binaries as this argument is not used.
PR: 137213
Submitted by: Eygene Ryabinkin (initial version)
MFC after: 1 month
flag. Besides providing the redundand information, need to update both
vnode and object flags causes more acquisition of vnode interlock.
OBJ_MIGHTBEDIRTY is only checked for vnode-backed vm objects.
Remove VI_OBJDIRTY and make sure that OBJ_MIGHTBEDIRTY is set only for
vnode-backed vm objects.
Suggested and reviewed by: alc
Tested by: pho
MFC after: 3 weeks
represented a write access that is allowed to override write protection.
Until now, VM_PROT_OVERRIDE_WRITE has been used to write breakpoints into
text pages. Text pages are not just write protected but they are also
copy-on-write. VM_PROT_OVERRIDE_WRITE overrides the write protection on the
text page and triggers the replication of the page so that the breakpoint
will be written to a private copy. However, here is where things become
confused. It is the debugger, not the process being debugged that requires
write access to the copied page. Nonetheless, the copied page is being
mapped into the process with write access enabled. In other words, once the
debugger sets a breakpoint within a text page, the program can write to its
private copy of that text page. Whereas prior to setting the breakpoint, a
SIGSEGV would have occurred upon a write access. VM_PROT_COPY addresses
this problem. The combination of VM_PROT_READ and VM_PROT_COPY forces the
replication of a copy-on-write page even though the access is only for read.
Moreover, the replicated page is only mapped into the process with read
access, and not write access.
Reviewed by: kib
MFC after: 4 weeks
pages.
(Note: Claims made in the comments about the handling of breakpoints in
wired pages have been false for roughly a decade. This and another bug
involving breakpoints will be fixed in coming changes.)
Reviewed by: kib
This improvements aims for avoiding further cache-misses in scheduler
specific functions which need to keep track of average thread running
time and further locking in places setting for this flag.
Reported by: jeff (originally), kris (currently)
Reviewed by: jhb
Tested by: Giuseppe Cocomazzi <sbudella at email dot it>
from tuning(7). One of the descriptions references tuning(7) because
it is too complex to adequatly describe here (it is not a simple
boolean sysctl) and users should be warned to that.
Reviewed by: alc, kib
Approved by: gnn (mentor)
version of this file. When a process forks, any wired pages are immediately
copied because copy-on-write is not supported for wired pages. In other
words, the child process is given its own private copy of each wired page
from its parent's address space. Unfortunately, to date, these copied pages
have been mapped into the child's address space with the wrong permissions,
typically VM_PROT_ALL. This change corrects the permissions.
Reviewed by: kib
install new shadow object behind the map entry and copy the pages
from the underlying objects to it. This makes the mprotect(2) call to
actually perform the requested operation instead of silently do nothing
and return success, that causes SIGSEGV on later write access to the
mapping.
Reuse vm_fault_copy_entry() to do the copying, modifying it to behave
correctly when src_entry == dst_entry.
Reviewed by: alc
MFC after: 3 weeks
the memory or D-cache, depending on the semantics of the platform.
vm_sync_icache() is basically a wrapper around pmap_sync_icache(),
that translates the vm_map_t argumument to pmap_t.
o Introduce pmap_sync_icache() to all PMAP implementation. For powerpc
it replaces the pmap_page_executable() function, added to solve
the I-cache problem in uiomove_fromphys().
o In proc_rwmem() call vm_sync_icache() when writing to a page that
has execute permissions. This assures that when breakpoints are
written, the I-cache will be coherent and the process will actually
hit the breakpoint.
o This also fixes the Book-E PMAP implementation that was missing
necessary locking while trying to deal with the I-cache coherency
in pmap_enter() (read: mmu_booke_enter_locked).
The key property of this change is that the I-cache is made coherent
*after* writes have been done. Doing it in the PMAP layer when adding
or changing a mapping means that the I-cache is made coherent *before*
any writes happen. The difference is key when the I-cache prefetches.
Call priv_check(PRIV_VM_SWAP_NORLIMIT) only when per-uid limit is
actually exceed.
Both changes aim at calling priv_check(9) only for the cases when
privilege is actually exercised by the process.
Reported and tested by: rwatson
Reviewed by: alc
MFC after: 3 days
This is done to make it harder to exploit kernel NULL pointer security
vulnerabilities. While this of course does not fix vulnerabilities,
it does mitigate their impact.
Note that this may break some applications, most likely emulators or
similar, which for one reason or another require mapping memory at
zero.
This restriction can be disabled with the security.bsd.mmap_zero
sysctl variable.
Discussed with: rwatson, bz
Tested by: bz (Wine), simon (VirtualBox)
Submitted by: jhb
of the linked object is zero-length. More old code assumes that mmap
of zero length returns success.
For a.out and pre-8 ELF binaries, allow the mmap of zero length.
Reported by: tegge
Reviewed by: tegge, alc, jhb
MFC after: 3 days
Remove the altkstacks, instead instantiate threads with kernel stack
allocated with the right size from the start. For the thread that has
kernel stack cached, verify that requested stack size is equial to the
actual, and reallocate the stack if sizes differ [1].
This fixes the bug introduced by r173361 that was committed several days
after r173004 and consisted of kthread_add(9) ignoring the non-default
kernel stack size.
Also, r173361 removed the caching of the kernel stacks for a non-first
thread in the process. Introduce separate kernel stack cache that keeps
some limited amount of preallocated kernel stacks to lower the latency
of thread allocation. Add vm_lowmem handler to prune the cache on
low memory condition. This way, system with reasonable amount of the
threads get lower latency of thread creation, while still not exhausting
significant portion of KVA for unused kstacks.
Submitted by: peter [1]
Discussed with: jhb, julian, peter
Reviewed by: jhb
Tested by: pho (and retested according to new test scenarious)
MFC after: 1 week
allocated with the right size from the start. For the thread that has
kernel stack cached, verify that requested stack size is equial to the
actual, and reallocate the stack if sizes differ [1].
This fixes the bug introduced by r173361 that was committed several days
after r173004 and consisted of kthread_add(9) ignoring the non-default
kernel stack size.
Also, r173361 removed the caching of the kernel stacks for a non-first
thread in the process. Introduce separate kernel stack cache that keeps
some limited amount of preallocated kernel stacks to lower the latency
of thread allocation. Add vm_lowmem handler to prune the cache on
low memory condition. This way, system with reasonable amount of the
threads get lower latency of thread creation, while still not exhausting
significant portion of KVA for unused kstacks.
Submitted by: peter [1]
Discussed with: jhb, julian, peter
Reviewed by: jhb
Tested by: pho
MFC after: 1 week
accidentally lost at one point during the PAT development. Without this
fix vm_pager_get_pages() was zeroing each of the pages.
Submitted by: czander @ NVidia
MFC after: 3 days
pages in an object.
- Add a new variant of d_mmap() currently called d_mmap2() which accepts
an additional in/out parameter that is the memory attribute to use for
the requested page.
- A driver either uses d_mmap() or d_mmap2() for all requests but not both.
The current implementation uses a flag in the cdevsw (D_MMAP2) to indicate
that the driver provides a d_mmap2() handler instead of d_mmap(). This
is done to make the change ABI compatible with existing drivers and
MFC'able to 7 and 8.
Submitted by: alc
MFC after: 1 month
a device pager (OBJT_DEVICE) object in that it uses fictitious pages to
provide aliases to other memory addresses. The primary difference is that
it uses an sglist(9) to determine the physical addresses for a given offset
into the object instead of invoking the d_mmap() method in a device driver.
Reviewed by: alc
Approved by: re (kensmith)
MFC after: 2 weeks
amd64 and i386. Essentially, fictitious pages provide a mechanism for
creating aliases for either normal or device-backed pages. Therefore,
pmap_page_set_memattr() on a fictitious page needn't update the direct
map or flush the cache. Such actions are the responsibility of the
"primary" instance of the page or the device driver that "owns" the
physical address. For example, these actions are already performed by
pmap_mapdev().
The device pager needn't restore the memory attributes on a fictitious
page before releasing it. It's now pointless.
Add pmap_page_set_memattr() to the Xen pmap.
Approved by: re (kib)
configuring machine-dependent memory attributes...":
Don't set the memory attribute for a "real" page that is allocated to
a device object in vm_page_alloc(). It is a pointless act, because
the device pager replaces this "real" page with a "fake" page and sets
the memory attribute on that "fake" page.
Eliminate pointless code from pmap_cache_bits() on amd64.
Employ the "Self Snoop" feature supported by some x86 processors to
avoid cache flushes in the pmap.
Approved by: re (kib)
behavior is mandated by POSIX.
- Do not fail requests that pass a length greater than SSIZE_MAX
(such as > 2GB on 32-bit platforms). The 'len' parameter is actually
an unsigned 'size_t' so negative values don't really make sense.
Submitted by: Alexander Best alexbestms at math.uni-muenster.de
Reviewed by: alc
Approved by: re (kib)
MFC after: 1 week
dependent memory attributes:
Rename vm_cache_mode_t to vm_memattr_t. The new name reflects the
fact that there are machine-dependent memory attributes that have
nothing to do with controlling the cache's behavior.
Introduce vm_object_set_memattr() for setting the default memory
attributes that will be given to an object's pages.
Introduce and use pmap_page_{get,set}_memattr() for getting and
setting a page's machine-dependent memory attributes. Add full
support for these functions on amd64 and i386 and stubs for them on
the other architectures. The function pmap_page_set_memattr() is also
responsible for any other machine-dependent aspects of changing a
page's memory attributes, such as flushing the cache or updating the
direct map. The uses include kmem_alloc_contig(), vm_page_alloc(),
and the device pager:
kmem_alloc_contig() can now be used to allocate kernel memory with
non-default memory attributes on amd64 and i386.
vm_page_alloc() and the device pager will set the memory attributes
for the real or fictitious page according to the object's default
memory attributes.
Update the various pmap functions on amd64 and i386 that map pages to
incorporate each page's memory attributes in the mapping.
Notes: (1) Inherent to this design are safety features that prevent
the specification of inconsistent memory attributes by different
mappings on amd64 and i386. In addition, the device pager provides a
warning when a device driver creates a fictitious page with memory
attributes that are inconsistent with the real page that the
fictitious page is an alias for. (2) Storing the machine-dependent
memory attributes for amd64 and i386 as a dedicated "int" in "struct
md_page" represents a compromise between space efficiency and the ease
of MFCing these changes to RELENG_7.
In collaboration with: jhb
Approved by: re (kib)
non-readable and non-executable map entry, the entry is skipped from
wiring and loop is aborted. But, since MAP_ENTRY_WIRE_SKIPPED was not
set for the map entry, its wired_count is later erronously decremented.
vm_map_delete(9) for such map entry stuck in "vmmaps".
Properly set MAP_ENTRY_WIRE_SKIPPED when aborting the loop.
Reported by: John Marshall <john.marshall riverwillow com au>
Approved by: re (kensmith)
charge the objects created by vm_fault_copy_entry. The object charge
was set, but reserve not incremented.
Reported by: Greg Rivers <gcr+freebsd-current tharned org>
Reviewed by: alc (previous version)
Approved by: re (kensmith)
required by video card drivers. Specifically, this change introduces
vm_cache_mode_t with an appropriate VM_CACHE_DEFAULT definition on all
architectures. In addition, this changes adds a vm_cache_mode_t parameter
to kmem_alloc_contig() and vm_phys_alloc_contig(). These will be the
interfaces for allocating mapped kernel memory and physical memory,
respectively, with non-default cache modes.
In collaboration with: jhb
Note that this does not actually enable full-range i/o requests for
64 architectures, and is done now to update KBI only.
Tested by: pho
Reviewed by: jhb, bde (as part of the review of the bigger patch)
valid mask. Consequently, there is no need to perform a bit-wise and of
the page's dirty and valid masks in order to determine which parts of a
page are dirty and valid.
Eliminate an unnecessary #include.
rlimit RLIMIT_SWAP that limits the amount of swap that may be reserved
for the uid.
The accounting information (charge) is associated with either map entry,
or vm object backing the entry, assuming the object is the first one
in the shadow chain and entry does not require COW. Charge is moved
from entry to object on allocation of the object, e.g. during the mmap,
assuming the object is allocated, or on the first page fault on the
entry. It moves back to the entry on forks due to COW setup.
The per-entry granularity of accounting makes the charge process fair
for processes that change uid during lifetime, and decrements charge
for proper uid when region is unmapped.
The interface of vm_pager_allocate(9) is extended by adding struct ucred *,
that is used to charge appropriate uid when allocation if performed by
kernel, e.g. md(4).
Several syscalls, among them is fork(2), may now return ENOMEM when
global or per-uid limits are enforced.
In collaboration with: pho
Reviewed by: alc
Approved by: re (kensmith)
calls to vdrop() until after the free page queues lock is released. This
eliminates repeatedly releasing and reacquiring the free page queues lock
each time the last cached page is reclaimed from a vnode-backed object.
structure. When the page is shared, the kernel mapping becomes a special
type of managed page to force the cache off the page mappings. This is
needed to avoid stale entries on all ARM VIVT caches, and VIPT caches
with cache color issue.
Submitted by: Mark Tinguely
Reviewed by: alc
Tested by: Grzegorz Bernacki, thompsa
with the malloc tag and calls a new back-end, kmem_alloc_contig(), that
allocates the pages and maps them.
The motivations for this change are two-fold: (1) A cache mode parameter
will be added to kmem_alloc_contig(). In other words, kmem_alloc_contig()
will be extended to support the allocation of memory with caller-specified
caching. (2) The UMA allocation function that is used by the two jumbo
frames zones can use kmem_alloc_contig() in place of contigmalloc() and
thereby avoid having free jumbo frames held by the zone counted as live
malloc()ed memory.
kmem_alloc() and kmem_malloc(). Specifically, defer the setting of the
page's valid bits until contigmapping() when the mapping is known to be
successful.
memory with 4MB pages was added to pmap_object_init_pt(). This code
assumes that the pages of a OBJT_DEVICE object are always physically
contiguous. Unfortunately, this is not always the case. For example,
jhb@ informs me that the recently introduced /dev/ksyms driver creates
a OBJT_DEVICE object that violates this assumption. Thus, this
revision modifies pmap_object_init_pt() to abort the mapping if the
OBJT_DEVICE object's pages are not physically contiguous. This
revision also changes some inconsistent if not buggy behavior. For
example, the i386 version aborts if the first 4MB virtual page that
would be mapped is already valid. However, it incorrectly replaces
any subsequent 4MB virtual page mappings that it encounters,
potentially leaking a page table page. The amd64 version has a bug of
my own creation. It potentially busies the wrong page and always an
insufficent number of pages if it blocks allocating a page table page.
To my knowledge, there have been no reports of these bugs, hence,
their persistance. I suspect that the existing restrictions that
pmap_object_init_pt() placed on the OBJT_DEVICE objects that it would
choose to map, for example, that the first page must be aligned on a 2
or 4MB physical boundary and that the size of the mapping must be a
multiple of the large page size, were enough to avoid triggering the
bug for drivers like ksyms. However, one side effect of testing the
OBJT_DEVICE object's pages for physical contiguity is that a dubious
difference between pmap_object_init_pt() and the standard path for
mapping devices pages, i.e., vm_fault(), has been eliminated.
Previously, pmap_object_init_pt() would only instantiate the first
PG_FICTITOUS page being mapped because it never examined the rest.
Now, however, pmap_object_init_pt() uses the new function
vm_object_populate() to instantiate them all (in order to support
testing their physical contiguity). These pages need to be
instantiated for the mechanism that I have prototyped for
automatically maintaining the consistency of the PAT settings across
multiple mappings, particularly, amd64's direct mapping, to work.
(Translation: This change is also being made to support jhb@'s work on
the Nvidia feature requests.)
Discussed with: jhb@
vm_map_pmap_enter(). The immediate effect of this change is that automatic
prefaulting by mmap() for small mappings is performed on POSIX shared memory
objects just the same as it is on ordinary files.
and used in a large number of files, but also because an increasing number
of incorrect uses of MAC calls were sneaking in due to copy-and-paste of
MAC-aware code without the associated opt_mac.h include.
Discussed with: pjd
shm_dotruncate() and vnode_pager_setsize(). Specifically, if the length of
a shared memory object or a file is truncated such that the length modulo
the page size is between 1 and 511, then all of the page's dirty bits were
cleared. Now, a dirty bit is cleared only if the corresponding block is
truncated in its entirety.
device drivers to use arbitrary VM objects to satisfy individual mmap()
requests.
- A new d_mmap_single(cdev, &foff, objsize, &object, prot) callback is
added to cdevsw. This function is called for each mmap() request.
If it returns ENODEV, then the mmap() request will fall back to using
the device's device pager object and d_mmap(). Otherwise, the method
can return a VM object to satisfy this entire mmap() request via
*object. It can also modify the starting offset into this object via
*foff. This allows device drivers to use the file offset as a cookie
to identify specific VM objects.
- vm_mmap_vnode() has been changed to call vm_mmap_cdev() directly when
mapping V_CHR vnodes. This avoids duplicating all the cdev mmap
handling code and simplifies some of vm_mmap_vnode().
- D_VERSION has been bumped to D_VERSION_02. Older device drivers
using D_VERSION_01 are still supported.
MFC after: 1 month
rather than unconditionally making partially dirty pages fully dirty, only
make partially dirty pages fully dirty if the pmap says that the page has
been modified.
(This change is also a small optimization. It eliminate an unnecessary call
to pmap_is_modified() on pages that are mapped read only.)
Suggested by: tegge
following changes:
Rename vfs_page_set_valid() to vfs_page_set_validclean() to reflect
what this function actually does. Suggested by: tegge
Introduce a new version of vfs_page_set_valid() that does no more than
what the function's name implies. Specifically, it does not update
the page's dirty mask, and thus it does not require the page queues
lock to be held.
Update two of the three callers to the old vfs_page_set_valid() to
call vfs_page_set_validclean() instead because they actually require
the page's dirty mask to be cleared.
Introduce vm_page_set_valid().
Reviewed by: tegge
that vnode_pager_input_smlfs() zeroes the page, it should not mark the page
as valid until after the page is zeroed. Otherwise, the page could be
mapped for read access (e.g., by vm_map_pmap_enter()) before the page is
zeroed. Reviewed by: tegge
Eliminate gratuitous clearing of the page's dirty mask by
vnode_pager_input_smlfs(). Instead, assert that the page is clean.
Reviewed by: tegge
Eliminate some blank lines.
Eliminate pointless calls to pmap_clear_modify() and vm_page_undirty() from
vnode_pager_input_old(). The page is not mapped. Therefore, it cannot have
any page table entries that are modified.
Eliminate an incorrect comment from vnode_pager_generic_getpages().
pmap_clear_modify() on a page is pointless if that page is not mapped or
it is only mapped for read access. Instead, assert that the page is not
mapped or not mapped for write access as appropriate.
Eliminate unnecessary clearing of a page's dirty mask. Instead, assert
that the page's dirty mask is clear.
vmopag" implementation. The vm_page_lookup() code modifies splay tree
of the object pages, and asserts that object lock is taken. First issue
could cause kernel data corruption, and second one instantly panics the
INVARIANTS-enabled kernel.
Take the advantage of the fact that object->memq is ordered by page index,
and iterate over memq to calculate the runs.
While there, make the code slightly more style-compliant by moving
variables declarations to the right place.
Discussed with: jhb, alc
Reviewed by: alc
MFC after: 2 weeks
the vmspace of the examined process instead of directly accessing its
vmspace, that may change. Also, as an optimization, check for P_INEXEC
flag before examining the process.
Reported and tested by: pho (previous version)
Reviewed by: alc
MFC after: 3 week
busy count. Only mappings that allow write access should be prevented by
a non-zero busy count.
(The prohibition on mapping pages for read access when they have a non-
zero busy count originated in revision 1.202 of i386/i386/pmap.c when
this code was a part of the pmap.)
Reviewed by: tegge
a reservation, unless all of the reservation's pages were free, the
reservation was moved to the head of the partially-populated reservations
queue, where it would be the next reservation to be broken in case the
free page queues were emptied. Now, instead, I am moving it to the tail.
Very likely this reservation is in the process of being freed in its
entirety, so placing it at the tail of the queue makes it more likely that
the underlying physical memory will be returned to the free page queues as
one contiguous chunk. If a reservation must be broken, it will, instead,
be the longest unchanged reservation, which is arguably the reservation
that is least likely to ever achieve promotion or be freed in its entirety.
MFC after: 6 weeks
the mappings without any of read and execution rights, in particular,
the PROT_NONE entries. This makes mlockall(2) work for the process
address space that has such mappings.
Since protection mode of the entry may change between setting
MAP_ENTRY_IN_TRANSITION and final pass over the region that records
the wire status of the entries, allocate new map entry flag
MAP_ENTRY_WIRE_SKIPPED to mark the skipped PROT_NONE entries.
Reported and tested by: Hans Ottevanger <fbsdhackers beasties demon nl>
Reviewed by: alc
MFC after: 3 weeks
but I see no benefit from it today.
VM_PROT_READ_IS_EXEC was only intended for use on processors that do not
distinguish between read and execute permission. On an mmap(2) or
mprotect(2), it automatically added execute permission if the caller
specified permissions included read permission. The hope was that this
would reduce the number of vm map entries needed to implement an address
space because there would be fewer neighboring vm map entries that differed
only in the presence or absence of VM_PROT_EXECUTE. (See vm/vm_mmap.c
revision 1.56.)
Today, I don't see any real applications that benefit from
VM_PROT_READ_IS_EXEC. In any case, vm map entries are now organized
as a self-adjusting binary search tree instead of an ordered list. So,
the need for coalescing vm map entries is not as great as it once was.
address space sizes to be longs instead of ints. Specifically, the follow
values are now longs: runningbufspace, bufspace, maxbufspace,
bufmallocspace, maxbufmallocspace, lobufspace, hibufspace, lorunningspace,
hirunningspace, maxswzone, maxbcache, and maxpipekva. Previously, a
relatively small number (~ 44000) of buffers set in kern.nbuf would result
in integer overflows resulting either in hangs or bogus values of
hidirtybuffers and lodirtybuffers. Now one has to overflow a long to see
such problems. There was a check for a nbuf setting that would cause
overflows in the auto-tuning of nbuf. I've changed it to always check and
cap nbuf but warn if a user-supplied tunable would cause overflow.
Note that this changes the ABI of several sysctls that are used by things
like top(1), etc., so any MFC would probably require a some gross shims
to allow for that.
MFC after: 1 month
fault. In r188331 this update was relocated because of synchronization
changes to a place where it would occur on both hard and soft faults. This
change again restricts the update to hard faults.
function, done in r188334. Instead, collect the entries that shall be
freed, in the deferred_freelist member of the map. Automatically purge
the deferred freelist when map is unlocked.
Tested by: pho
Reviewed by: alc
1 will trigger a pass through the VM's low-memory handlers, such as
protocol and UMA drain routines. This makes it easier to exercise
these otherwise rarely-invoked code paths.
MFC after: 3 days
hold the map lock there, and might need the vnode lock for OBJT_VNODE
objects. Postpone object deallocation until caller of vm_map_delete()
drops the map lock. Link the map entries to be freed into the freelist,
that is released by the new helper function vm_map_entry_free_freelist().
Reviewed by: tegge, alc
Tested by: pho
Reference object, drop the map lock, and then call vm_object_sync().
The object sync might require vnode lock for OBJT_VNODE type objects.
Reviewed by: tegge
Tested by: pho
acquire vnode lock for OBJT_VNODE object after map lock is dropped.
Because we have the busy page(s) in the object, sleeping there would
result in deadlock with vnode resize. Try to get lock without sleeping,
and, if the attempt failed, drop the state, lock the vnode, and restart
the fault handler from the start with already locked vnode.
Because the vnode_pager_lock() function is inlined in vm_fault(),
axe it.
Based on suggestion by: alc
Reviewed by: tegge, alc
Tested by: pho
describing why several calls to vm_deallocate_object() with locked map
do not result in the acquisition of the vnode lock after map lock.
Suggested and reviewed by: tegge
be accessible outside vmspace_fork() yet, but locking it would satisfy
the protocol of the vm_map_entry_link() and other functions called
from vmspace_fork().
Use trylock that is supposedly cannot fail, to silence WITNESS warning
of the nested acquisition of the sx lock with the same name.
Suggested and reviewed by: tegge
backend kegs so it may source compatible memory from multiple backends.
This is useful for cases such as NUMA or different layouts for the same
memory type.
- Provide a new api for adding new backend kegs to secondary zones.
- Provide a new flag for adjusting the layout of zones to stagger
allocations better across cache lines.
Sponsored by: Nokia
inside the SYSCTL() macros and thus does not need to be done for
all of the nodes scattered across the source tree.
- Mark the name-cache related sysctl's (including debug.hashstat.*) MPSAFE.
- Mark vm.loadavg MPSAFE.
- Remove GIANT_REQUIRED from vmtotal() (everything in this routine already
has sufficient locking) and mark vm.vmtotal MPSAFE.
- Mark the vm.stats.(sys|vm).* sysctls MPSAFE.
of the counter, that may happen when too many sendfile(2) calls are
being executed with this vnode [1].
To keep the size of the struct vm_page and offsets of the fields
accessed by out-of-tree modules, swap the types and locations
of the wire_count and cow fields. Add safety checks to detect cow
overflow and force fallback to the normal copy code for zero-copy
sockets. [2]
Reported by: Anton Yuzhaninov <citrin citrin ru> [1]
Suggested by: alc [2]
Reviewed by: alc
MFC after: 2 weeks
vm_map_lookup{,_locked}() to vm_map_lookup_entry(). Having the fast path
in vm_map_lookup{,_locked}() limits its benefits to page faults. Moving
it to vm_map_lookup_entry() extends its benefits to other operations on
the vm map.
In particular following functions KPI results modified:
- bufobj_invalbuf()
- bufsync()
and BO_SYNC() "virtual method" of the buffer objects set.
Main consumers of bufobj functions are affected by this change too and,
in particular, functions which changed their KPI are:
- vinvalbuf()
- g_vfs_close()
Due to the KPI breakage, __FreeBSD_version will be bumped in a later
commit.
As a side note, please consider just temporary the 'curthread' argument
passing to VOP_SYNC() (in bufsync()) as it will be axed out ASAP
Reviewed by: kib
Tested by: Giovanni Trematerra <giovanni dot trematerra at gmail dot com>
into the separate function vm_pageout_oom(). Supply a parameter for
vm_pageout_oom() describing a reason for the call.
Call vm_pageout_oom() from the swp_pager_meta_build() when swap zone
is exhausted.
Reviewed by: alc
Tested by: pho, jhb
MFC after: 2 weeks
filedescriptor into it. Make sure that td_fpop is NULL when calling
d_mmap from dev_pager_getpages().
Change guards against td_fpop field being non-NULL with private state
for another device, and against sudden clearing the td_fpop. This
could occur when either a driver method calls another driver through
the filedescriptor operation, or a page fault happen while driver is
writing to a memory backed by another driver.
Noted by: rwatson
Tested by: rnoland
MFC after: 3 days
that redzone adds to the allocation for storing its metadata is at least as
large as the metadata that it will store there.
Submitted by: Nima Misaghian
routine wakes up proc0 so that proc0 can swap the thread back in.
Historically, this has been done by waking up proc0 directly from
setrunnable() itself via a wakeup(). When waking up a sleeping thread
that was swapped out (the usual case when waking proc0 since only sleeping
threads are eligible to be swapped out), this resulted in a bit of
recursion (e.g. wakeup() -> setrunnable() -> wakeup()).
With sleep queues having separate locks in 6.x and later, this caused a
spin lock LOR (sleepq lock -> sched_lock/thread lock -> sleepq lock).
An attempt was made to fix this in 7.0 by making the proc0 wakeup use
the ithread mechanism for doing the wakeup. However, this required
grabbing proc0's thread lock to perform the wakeup. If proc0 was asleep
elsewhere in the kernel (e.g. waiting for disk I/O), then this degenerated
into the same LOR since the thread lock would be some other sleepq lock.
Fix this by deferring the wakeup of the swapper until after the sleepq
lock held by the upper layer has been locked. The setrunnable() routine
now returns a boolean value to indicate whether or not proc0 needs to be
woken up. The end result is that consumers of the sleepq API such as
*sleep/wakeup, condition variables, sx locks, and lockmgr, have to wakeup
proc0 if they get a non-zero return value from sleepq_abort(),
sleepq_broadcast(), or sleepq_signal().
Discussed with: jeff
Glanced at by: sam
Tested by: Jurgen Weber jurgen - ish com au
MFC after: 2 weeks
to downgrade the exclusive lock to shared one when exclusive lock owner
requested shared lock. New lockmgr panics instead.
The vnode_pager_lock function requests shared lock on the vnode backing
the OBJT_VNODE, and can be called when the current thread already holds
an exlcusive lock on the vnode. For instance, it happens when handling
page fault from the VOP_WRITE() uiomove that writes to the file, with
the faulted in page fetched from the vm object backed by the same file.
We then get the situation described above.
Verify whether the vnode is already exclusively locked by the curthread
and request recursed exclusive vnode lock instead of shared, if true.
Reported by: gallatin
Discussed with: attilio
for the bio for swapout write. It allows the page allocator to drain
free page list deeper. As result, a deadlock where pageout deamon sleeps
waiting for bio to be allocated for swapout is no more reproducable in
practice.
Alan said that M_USE_RESERVE shall be ressurrected and used there, but
until this is implemented, M_NOWAIT does exactly what is needed.
Tested by: pho, kris
Reviewed by: alc
No objections from: phk
MFC after: 2 weeks (RELENG_7 only)
on the amd64 architecture. The amd64 architecture requires kernel code and
global variables to reside in the highest 2GB of the 64-bit virtual address
space. Thus, the memory allocated during bootstrap, before the call to
kmem_init(), starts at KERNBASE, which is not necessarily the same as
VM_MIN_KERNEL_ADDRESS on amd64.
PowerPC/AIM. Consequently, it should not be used to determine the maximum
number of kernel map entries. Intead, use VM_MIN_KERNEL_ADDRESS, which marks
the start of the kernel map on all architectures.
Tested by: marcel@ (PowerPC/AIM)
work. (Moreover, I don't believe that they have ever worked as intended.)
The explanation is fairly simple. Both MADV_DONTNEED and MADV_FREE perform
vm_page_dontneed() on each page within the range given to madvise(). This
function moves the page to the inactive queue. Specifically, if the page is
clean, it is moved to the head of the inactive queue where it is first in
line for processing by the page daemon. On the other hand, if it is dirty,
it is placed at the tail. Let's further examine the case in which the page
is clean. Recall that the page is at the head of the line for processing by
the page daemon. The expectation of vm_page_dontneed()'s author was that
the page would be transferred from the inactive queue to the cache queue by
the page daemon. (Once the page is in the cache queue, it is, in effect,
free, that is, it can be reallocated to a new vm object by vm_page_alloc()
if it isn't reactivated quickly enough by a user of the old vm object.) The
trouble is that nowhere in the execution of either MADV_DONTNEED or
MADV_FREE is either the machine-independent reference flag (PG_REFERENCED)
or the reference bit in any page table entry (PTE) mapping the page cleared.
Consequently, the immediate reaction of the page daemon is to reactivate the
page because it is referenced. In effect, the madvise() was for naught.
The case in which the page was dirty is not too different. Instead of being
laundered, the page is reactivated.
Note: The essential difference between MADV_DONTNEED and MADV_FREE is
that MADV_FREE clears a page's dirty field. So, MADV_FREE is always
executing the clean case above.
This revision changes vm_page_dontneed() to clear both the machine-
independent reference flag (PG_REFERENCED) and the reference bit in all PTEs
mapping the page.
MFC after: 6 weeks
entirety of the specified range be mapped. Specifically, it has
returned EINVAL if the entire range is not mapped. There is not,
however, any basis for this in either SuSv2 or our own man page.
Moreover, neither Linux nor Solaris impose this requirement. This
revision removes this requirement.
Submitted by: Tijl Coosemans
PR: 118510
MFC after: 6 weeks
Directory IO without a VM object will store data in 'malloced' buffers
severely limiting caching of the data. Without this change VM objects for
directories are only created on an open() of the directory.
TODO: Inline test if VM object already exists to avoid locking/function call
overhead.
Tested by: kris@
Reviewed by: jeff@
Reported by: David Filo
superpage-aligned virtual address for the mapping. Revision 1.65
implemented an overly simplistic and generally ineffectual method for
finding a superpage-aligned virtual address. Specifically, it rounds
the virtual address corresponding to the end of the data segment up to
the next superpage-aligned virtual address. If this virtual address
is unallocated, then the device will be mapped using superpages.
Unfortunately, in modern times, where applications like the X server
dynamically load much of their code, this virtual address is already
allocated. In such cases, mmap(2) simply uses the first available
virtual address, which is not necessarily superpage aligned.
This revision changes mmap(2) to use a more robust method,
specifically, the VMFS_ALIGNED_SPACE option that is now implemented by
vm_map_find().
physical address of the device's memory. This enables
pmap_align_superpage() to propose a virtual address for mapping the
device memory that permits the use of superpage mappings.
used to request superpage alignment for the submap.
Request superpage alignment for the kmem_map.
Pass VMFS_ANY_SPACE instead of TRUE to vm_map_find(). (They are currently
equivalent but VMFS_ANY_SPACE is the new preferred spelling.)
Remove a stale comment from kmem_malloc().
support for VMFS_ALIGNED_SPACE, which requests the allocation of an
address range best suited to superpages. The old options TRUE and FALSE
are mapped to VMFS_ANY_SPACE and VMFS_NO_SPACE, so that there is no
immediate need to update all of vm_map_find(9)'s callers.
While I'm here, correct a misstatement about vm_map_find(9)'s return
values in the man page.
variables and sysctl nodes.
- In reset walk the children of kern_sched_stats and reset the counters
via the oid_arg1 pointer. This allows us to add arbitrary counters to
the tree and still reset them properly.
- Define a set of switch types to be passed with flags to mi_switch().
These types are named SWT_*. These types correspond to SCHED_STATS
counters and are automatically handled in this way.
- Make the new SWT_ types more specific than the older switch stats.
There are now stats for idle switches, remote idle wakeups, remote
preemption ithreads idling, etc.
- Add switch statistics for ULE's pickcpu algorithm. These stats include
how much migration there is, how often affinity was successful, how
often threads were migrated to the local cpu on wakeup, etc.
Sponsored by: Nokia
contigmalloc(9) as a last resort to steal pages from an inactive,
partially-used superpage reservation.
Rename vm_reserv_reclaim() to vm_reserv_reclaim_inactive() and
refactor it so that a separate subroutine is responsible for breaking
the selected reservation. This subroutine is also used by
vm_reserv_reclaim_contig().
UMA_SLAB_KERNEL for consistency with its sibling UMA_SLAB_KMEM.
(UMA_SLAB_KMAP met its original demise in revision 1.30 of
vm/uma_core.c.) UMA_SLAB_KERNEL is now required by the jumbo frame
allocators. Without it, UMA cannot correctly return pages from the
jumbo frame zones to the VM system because it resets the pages' object
field to NULL instead of the kernel object. In more detail, the jumbo
frame zones are created with the option UMA_ZONE_REFCNT. This causes
UMA to overwrite the pages' object field with the address of the slab.
However, when UMA wants to release these pages, it doesn't know how to
restore the object field, so it sets it to NULL. This change teaches
UMA how to reset the object field to the kernel object.
Crashes reported by: kris
Fix tested by: kris
Fix discussed with: jeff
MFC after: 6 weeks
vm_object_reference(). This is intended to get rid of vget()
consumers who don't wish to acquire a lock. This is functionally
the same as calling vref(). vm_object_reference_locked() already
uses vref.
Discussed with: alc
obtain the reference. In particular, this fixes the panic reported in
the PR. Remove the comments stating that this needs to be done.
PR: kern/119422
MFC after: 1 week
requiring the per-process spinlock to only requiring the process lock.
- Reflect these changes in the proc.h documentation and consumers throughout
the kernel. This is a substantial reduction in locking cost for these
fields and was made possible by recent changes to threading support.
vm/vm_contig.c, vm/vm_page.c, and vm/vm_pageq.c. Today, vm/vm_pageq.c
has withered to the point that it contains only four short functions,
two of which are only used by vm/vm_page.c. Since I can't foresee any
reason for vm/vm_pageq.c to grow, it is time to fold the remaining
contents of vm/vm_pageq.c back into vm/vm_page.c.
Add some comments. Rename one of the functions, vm_pageq_enqueue(),
that is now static within vm/vm_page.c to vm_page_enqueue().
Eliminate PQ_MAXCOUNT as it no longer serves any purpose.
Instead of checking each page for PG_UNMANAGED, perform a one-time
check whether the object is OBJT_PHYS. (PG_UNMANAGED pages only
belong to OBJT_PHYS objects.)
after each SYSINIT() macro invocation. This makes a number of
lightweight C parsers much happier with the FreeBSD kernel
source, including cflow's prcc and lxr.
MFC after: 1 month
Discussed with: imp, rink
While the KSE project was quite successful in bringing threading to
FreeBSD, the M:N approach taken by the kse library was never developed
to its full potential. Backwards compatibility will be provided via
libmap.conf for dynamically linked binaries and static binaries will
be broken.
sched_sleep(). This removes extra thread_lock() acquisition and
allows the scheduler to decide what to do with the static boost.
- Change the priority arguments to cv_* to match sleepq/msleep/etc.
where 0 means no priority change. Catch -1 in cv_broadcastpri() and
convert it to 0 for now.
- Set a flag when sleeping in a way that is compatible with swapping
since direct priority comparisons are meaningless now.
- Add a sysctl to ule, kern.sched.static_boost, that defaults to on which
controls the boost behavior. Turning it off gives better performance
in some workloads but needs more investigation.
- While we're modifying sleepq, change signal and broadcast to both
return with the lock held as the lock was held on enter.
Reviewed by: jhb, peter
Specifically, since the delete-behind heuristic is never applied to a
device-backed object, there is no point in checking whether each of the
object's pages is fictitious. (Only device-backed objects have
fictitious pages.)
structure. This allows per-CPU variations of struct pmap on a
single architecture without affecting the machine-independent
fields. As such, the PMAP variations don't affect the ABI. They
become part of it.
pmap_remove_all() must not be called on fictitious pages. To date,
fictitious pages have been allocated from zeroed memory, effectively
hiding this problem because the fictitious pages appear to have an empty
pv list. Submitted by: Kostik Belousov
Rewrite the comments describing vm_object_page_remove() to better
describe what it does. Add an assertion. Reviewed by: Kostik Belousov
MFC after: 1 week
only anonymous default (OBJT_DEFAULT) and swap (OBJT_SWAP) objects should
ever have OBJ_ONEMAPPING set. However, vm_object_deallocate() was
setting it on device (OBJT_DEVICE) objects. As a result,
vm_object_page_remove() could be called on a device object and if that
occurred pmap_remove_all() would be called on the device object's pages.
However, a device object's pages are fictitious, and fictitious pages do
not have an initialized pv list (struct md_page).
To date, fictitious pages have been allocated from zeroed memory,
effectively hiding this problem. Now, however, the conversion of rotting
diagnostics to invariants in the amd64 and i386 pmaps has revealed the
problem. Specifically, assertion failures have occurred during the
initialization phase of the X server on some hardware.
MFC after: 1 week
Discussed with: Kostik Belousov
Reported by: Michiel Boland
conjuction with 'thread' argument passing which is always curthread.
Remove the unuseful extra-argument and pass explicitly curthread to lower
layer functions, when necessary.
KPI results broken by this change, which should affect several ports, so
version bumping and manpage update will be further committed.
Tested by: kris, pho, Diego Sardina <siarodx at gmail dot com>
address space in kmem map call vm_lowmem event in a loop and wait a bit for
subsystems to reclaim some memory which in turn will reclaim address space as
well.
Note, this is a work-around.
Reviewed by: alc
Approved by: alc
MFC after: 3 days
Remove this argument and pass curthread directly to underlying
VOP_LOCK1() VFS method. This modify makes the code cleaner and in
particular remove an annoying dependence helping next lockmgr() cleanup.
KPI results, obviously, changed.
Manpage and FreeBSD_version will be updated through further commits.
As a side note, would be valuable to say that next commits will address
a similar cleanup about VFS methods, in particular vop_lock1 and
vop_unlock.
Tested by: Diego Sardina <siarodx at gmail dot com>,
Andrea Di Pasquale <whyx dot it at gmail dot com>
implement shm_open(2) and shm_unlink(2) in the kernel:
- Each shared memory file descriptor is associated with a swap-backed vm
object which provides the backing store. Each descriptor starts off with
a size of zero, but the size can be altered via ftruncate(2). The shared
memory file descriptors also support fstat(2). read(2), write(2),
ioctl(2), select(2), poll(2), and kevent(2) are not supported on shared
memory file descriptors.
- shm_open(2) and shm_unlink(2) are now implemented as system calls that
manage shared memory file descriptors. The virtual namespace that maps
pathnames to shared memory file descriptors is implemented as a hash
table where the hash key is generated via the 32-bit Fowler/Noll/Vo hash
of the pathname.
- As an extension, the constant 'SHM_ANON' may be specified in place of the
path argument to shm_open(2). In this case, an unnamed shared memory
file descriptor will be created similar to the IPC_PRIVATE key for
shmget(2). Note that the shared memory object can still be shared among
processes by sharing the file descriptor via fork(2) or sendmsg(2), but
it is unnamed. This effectively serves to implement the getmemfd() idea
bandied about the lists several times over the years.
- The backing store for shared memory file descriptors are garbage
collected when they are not referenced by any open file descriptors or
the shm_open(2) virtual namespace.
Submitted by: dillon, peter (previous versions)
Submitted by: rwatson (I based this on his version)
Reviewed by: alc (suggested converting getmemfd() to shm_open())
assertion hit in swapoff_one() when we un-mount a swap partition. We
should be using curthread where we used thread0 before. This change
also replaces the thread argument with a credential argument, as the
MAC framework only requires the cred.
It should be noted that this allows the machine to be rebooted without
panicing with "cannot differ from curthread or NULL" when MAC is enabled.
Submitted by: rwatson
Reviewed by: attilio
MFC after: 2 weeks
queues lock is acquired. Otherwise, the state of a reservation's
pages' flags and its population count can be inconsistent. That could
result in a page being freed twice.
Reported by: kris
machine-independent support for superpages. (The earlier part was
the rewrite of the physical memory allocator.) The remainder of the
code required for superpages support is machine-dependent and will
be added to the various pmap implementations at a later date.
Initially, I am only supporting one large page size per architecture.
Moreover, I am only enabling the reservation system on amd64. (In
an emergency, it can be disabled by setting VM_NRESERVLEVELS to 0
in amd64/include/vmparam.h or your kernel configuration file.)
Recycle the vm object's "pg_color" field to represent the color of the
first virtual page address at which the object is mapped instead of the
color of the object's first physical page. Since an object may not be
mapped, introduce a flag "OBJ_COLORED" that indicates whether "pg_color"
is valid.
page to be in the free lists. Instead, it now returns TRUE if it
removed the page from the free lists and FALSE if the page was not
in the free lists.
This change is required to support superpage reservations. Specifically,
once reservations are introduced, a cached page can either be in the
free lists or a reservation.
linker interfaces for looking up function names and offsets from
instruction pointers. Create two variants of each call: one that is
"DDB-safe" and avoids locking in the linker, and one that is safe for
use in live kernels, by virtue of observing locking, and in particular
safe when kernel modules are being loaded and unloaded simultaneous to
their use. This will allow them to be used outside of debugging
contexts.
Modify two of three current stack(9) consumers to use the DDB-safe
interfaces, as they run in low-level debugging contexts, such as inside
lockmgr(9) and the kernel memory allocator.
Update man page.
vm_pageout_fallback_object_lock() in vm_contig_launder_page() to better
handle a lock-ordering problem. Consequently, trylock's failure on the
page's containing object no longer implies that the page cannot be
laundered.
MFC after: 6 weeks
malloc_type_allocated(..., 0) calls that occur when contigmalloc() has
failed. Eliminate the acquisition and release of the page queues lock
from vm_page_release_contig(). Rename contigmalloc2() to
contigmapping(), reflecting what it does.
comments from vnode_pager_setsize(). This call was introduced in
revision 1.140 to address a problem that no longer exists.
Specifically, pmap_zero_page_area() has replaced a (possibly)
problematic implementation of page zeroing that was based on
vm_pager_map(), bzero(), and vm_pager_unmap().
First, a file is mmap(2)ed and then mlock(2)ed. Later, it is truncated.
Under "normal" circumstances, i.e., when the file is not mlock(2)ed, the
pages beyond the EOF are unmapped and freed. However, when the file is
mlock(2)ed, the pages beyond the EOF are unmapped but not freed because
they have a non-zero wire count. This can be a mistake. Specifically,
it is a mistake if the sole reason why the pages are wired is because of
wired, managed mappings. Previously, unmapping the pages destroys these
wired, managed mappings, but does not reduce the pages' wire count.
Consequently, when the file is unmapped, the pages are not unwired
because the wired mapping has been destroyed. Moreover, when the vm
object is finally destroyed, the pages are leaked because they are still
wired. The fix is to reduce the pages' wired count by the number of
wired, managed mappings destroyed. To do this, I introduce a new pmap
function pmap_page_wired_mappings() that returns the number of managed
mappings to the given physical page that are wired, and I use this
function in vm_object_page_remove().
Reviewed by: tegge
MFC after: 6 weeks
silent NULL pointer dereference in the i386 and sparc64 pmap_pinit()
when the kmem_alloc_nofault() failed to allocate address space. Both
functions now return error instead of panicing or dereferencing NULL.
As consequence, vmspace_exec() and vmspace_unshare() returns the errno
int. struct vmspace arg was added to vm_forkproc() to avoid dealing
with failed allocation when most of the fork1() job is already done.
The kernel stack for the thread is now set up in the thread_alloc(),
that itself may return NULL. Also, allocation of the first process
thread is performed in the fork1() to properly deal with stack
allocation failure. proc_linkup() is separated into proc_linkup()
called from fork1(), and proc_linkup0(), that is used to set up the
kernel process (was known as swapper).
In collaboration with: Peter Holm
Reviewed by: jhb
default object rather than cache it was to have
vm_pager_has_page(object, pindex, ...) == FALSE to imply that there is
no cached page in object at pindex. This allows to avoid explicit
checks for cached pages in vm_object_backing_scan().
For now, we need the same bandaid for the swap object, otherwise both
the vm_page_lookup() and the pager can report that there is no page at
offset, while page is stored in the cache. Also, this fixes another
instance of the KASSERT("object type is incompatible") failure in the
vm_page_cache_transfer().
Reported and tested by: Peter Holm
Reviewed by: alc
MFC after: 3 days
that would have an offset beyond the end of the target object. Such
pages should remain in the source object.
MFC after: 3 days
Diagnosed and reviewed by: Kostik Belousov
Reported and tested by: Peter Holm
from Mac OS X Leopard--rationalize naming for entry points to
the following general forms:
mac_<object>_<method/action>
mac_<object>_check_<method/action>
The previous naming scheme was inconsistent and mostly
reversed from the new scheme. Also, make object types more
consistent and remove spaces from object types that contain
multiple parts ("posix_sem" -> "posixsem") to make mechanical
parsing easier. Introduce a new "netinet" object type for
certain IPv4/IPv6-related methods. Also simplify, slightly,
some entry point names.
All MAC policy modules will need to be recompiled, and modules
not updates as part of this commit will need to be modified to
conform to the new KPI.
Sponsored by: SPARTA (original patches against Mac OS X)
Obtained from: TrustedBSD Project, Apple Computer
cache: vnode_pager_setsize() must handle the case where a file is
truncated to a non-page-size-aligned boundary and there is a cached
page underlying the new end of file.
Reported by: kris, tegge
Tested by: kris
MFC after: 3 days
since revision 1.1. Specifically, neither traversal of the vm map checks
whether the end of the vm map has been reached. Consequently, the first
traversal can wrap around and bogusly return an error.
This error has gone unnoticed for so long because no one had ever before
tried msync(2)ing a region above the stack.
Reported by: peter
MFC after: 1 week
to kproc_xxx as they actually make whole processes.
Thos makes way for us to add REAL kthread_create() and friends
that actually make theads. it turns out that most of these
calls actually end up being moved back to the thread version
when it's added. but we need to make this cosmetic change first.
I'd LOVE to do this rename in 7.0 so that we can eventually MFC the
new kthread_xxx() calls.
it must first ensure that the page is no longer mapped. This is
trivially accomplished by calling pmap_remove_all() a little earlier
in vm_page_cache(). While I'm in the neighborbood, make a related
panic message a little more useful.
Approved by: re (kensmith)
Reported by: Peter Holm and Konstantin Belousov
Reviewed by: Konstantin Belousov
a consequence of sparc64/sparc64/vm_machdep.c revision 1.76. It occurs
when uma_small_free() frees a page. The solution has two parts: (1) Mark
pages allocated with VM_ALLOC_NOOBJ as PG_UNMANAGED. (2) Defer the lock
assertion in pmap_page_is_mapped() until after PG_UNMANAGED is tested.
This is safe because both PG_UNMANAGED and PG_FICTITIOUS are immutable
flags, i.e., they do not change state between the time that a page is
allocated and freed.
Approved by: re (kensmith)
PR: 116794
cache: vm_object_page_remove() should convert any cached pages that
fall with the specified range to free pages. Otherwise, there could
be a problem if a file is first truncated and then regrown.
Specifically, some old data from prior to the truncation might reappear.
Generalize vm_page_cache_free() to support the conversion of either a
subset or the entirety of an object's cached pages.
Reported by: tegge
Reviewed by: tegge
Approved by: re (kensmith)
ways:
(1) Cached pages are no longer kept in the object's resident page
splay tree and memq. Instead, they are kept in a separate per-object
splay tree of cached pages. However, access to this new per-object
splay tree is synchronized by the _free_ page queues lock, not to be
confused with the heavily contended page queues lock. Consequently, a
cached page can be reclaimed by vm_page_alloc(9) without acquiring the
object's lock or the page queues lock.
This solves a problem independently reported by tegge@ and Isilon.
Specifically, they observed the page daemon consuming a great deal of
CPU time because of pages bouncing back and forth between the cache
queue (PQ_CACHE) and the inactive queue (PQ_INACTIVE). The source of
this problem turned out to be a deadlock avoidance strategy employed
when selecting a cached page to reclaim in vm_page_select_cache().
However, the root cause was really that reclaiming a cached page
required the acquisition of an object lock while the page queues lock
was already held. Thus, this change addresses the problem at its
root, by eliminating the need to acquire the object's lock.
Moreover, keeping cached pages in the object's primary splay tree and
memq was, in effect, optimizing for the uncommon case. Cached pages
are reclaimed far, far more often than they are reactivated. Instead,
this change makes reclamation cheaper, especially in terms of
synchronization overhead, and reactivation more expensive, because
reactivated pages will have to be reentered into the object's primary
splay tree and memq.
(2) Cached pages are now stored alongside free pages in the physical
memory allocator's buddy queues, increasing the likelihood that large
allocations of contiguous physical memory (i.e., superpages) will
succeed.
Finally, as a result of this change long-standing restrictions on when
and where a cached page can be reclaimed and returned by
vm_page_alloc(9) are eliminated. Specifically, calls to
vm_page_alloc(9) specifying VM_ALLOC_INTERRUPT can now reclaim and
return a formerly cached page. Consequently, a call to malloc(9)
specifying M_NOWAIT is less likely to fail.
Discussed with: many over the course of the summer, including jeff@,
Justin Husted @ Isilon, peter@, tegge@
Tested by: an earlier version by kris@
Approved by: re (kensmith)
changes the units from seconds to the value of 'ticks' when swapped
in/out. ULE does not have a periodic timer that scans all threads in
the system and as such maintaining a per-second counter is difficult.
- Change computations requiring the unit in seconds to subtract ticks
and divide by hz. This does make the wraparound condition hz times
more frequent but this is still in the range of several months to
years and the adverse effects are minimal.
Approved by: re