rather than unconditionally making partially dirty pages fully dirty, only
make partially dirty pages fully dirty if the pmap says that the page has
been modified.
(This change is also a small optimization. It eliminate an unnecessary call
to pmap_is_modified() on pages that are mapped read only.)
Suggested by: tegge
following changes:
Rename vfs_page_set_valid() to vfs_page_set_validclean() to reflect
what this function actually does. Suggested by: tegge
Introduce a new version of vfs_page_set_valid() that does no more than
what the function's name implies. Specifically, it does not update
the page's dirty mask, and thus it does not require the page queues
lock to be held.
Update two of the three callers to the old vfs_page_set_valid() to
call vfs_page_set_validclean() instead because they actually require
the page's dirty mask to be cleared.
Introduce vm_page_set_valid().
Reviewed by: tegge
that vnode_pager_input_smlfs() zeroes the page, it should not mark the page
as valid until after the page is zeroed. Otherwise, the page could be
mapped for read access (e.g., by vm_map_pmap_enter()) before the page is
zeroed. Reviewed by: tegge
Eliminate gratuitous clearing of the page's dirty mask by
vnode_pager_input_smlfs(). Instead, assert that the page is clean.
Reviewed by: tegge
Eliminate some blank lines.
Eliminate pointless calls to pmap_clear_modify() and vm_page_undirty() from
vnode_pager_input_old(). The page is not mapped. Therefore, it cannot have
any page table entries that are modified.
Eliminate an incorrect comment from vnode_pager_generic_getpages().
pmap_clear_modify() on a page is pointless if that page is not mapped or
it is only mapped for read access. Instead, assert that the page is not
mapped or not mapped for write access as appropriate.
Eliminate unnecessary clearing of a page's dirty mask. Instead, assert
that the page's dirty mask is clear.
vmopag" implementation. The vm_page_lookup() code modifies splay tree
of the object pages, and asserts that object lock is taken. First issue
could cause kernel data corruption, and second one instantly panics the
INVARIANTS-enabled kernel.
Take the advantage of the fact that object->memq is ordered by page index,
and iterate over memq to calculate the runs.
While there, make the code slightly more style-compliant by moving
variables declarations to the right place.
Discussed with: jhb, alc
Reviewed by: alc
MFC after: 2 weeks
the vmspace of the examined process instead of directly accessing its
vmspace, that may change. Also, as an optimization, check for P_INEXEC
flag before examining the process.
Reported and tested by: pho (previous version)
Reviewed by: alc
MFC after: 3 week
busy count. Only mappings that allow write access should be prevented by
a non-zero busy count.
(The prohibition on mapping pages for read access when they have a non-
zero busy count originated in revision 1.202 of i386/i386/pmap.c when
this code was a part of the pmap.)
Reviewed by: tegge
a reservation, unless all of the reservation's pages were free, the
reservation was moved to the head of the partially-populated reservations
queue, where it would be the next reservation to be broken in case the
free page queues were emptied. Now, instead, I am moving it to the tail.
Very likely this reservation is in the process of being freed in its
entirety, so placing it at the tail of the queue makes it more likely that
the underlying physical memory will be returned to the free page queues as
one contiguous chunk. If a reservation must be broken, it will, instead,
be the longest unchanged reservation, which is arguably the reservation
that is least likely to ever achieve promotion or be freed in its entirety.
MFC after: 6 weeks
the mappings without any of read and execution rights, in particular,
the PROT_NONE entries. This makes mlockall(2) work for the process
address space that has such mappings.
Since protection mode of the entry may change between setting
MAP_ENTRY_IN_TRANSITION and final pass over the region that records
the wire status of the entries, allocate new map entry flag
MAP_ENTRY_WIRE_SKIPPED to mark the skipped PROT_NONE entries.
Reported and tested by: Hans Ottevanger <fbsdhackers beasties demon nl>
Reviewed by: alc
MFC after: 3 weeks
but I see no benefit from it today.
VM_PROT_READ_IS_EXEC was only intended for use on processors that do not
distinguish between read and execute permission. On an mmap(2) or
mprotect(2), it automatically added execute permission if the caller
specified permissions included read permission. The hope was that this
would reduce the number of vm map entries needed to implement an address
space because there would be fewer neighboring vm map entries that differed
only in the presence or absence of VM_PROT_EXECUTE. (See vm/vm_mmap.c
revision 1.56.)
Today, I don't see any real applications that benefit from
VM_PROT_READ_IS_EXEC. In any case, vm map entries are now organized
as a self-adjusting binary search tree instead of an ordered list. So,
the need for coalescing vm map entries is not as great as it once was.
address space sizes to be longs instead of ints. Specifically, the follow
values are now longs: runningbufspace, bufspace, maxbufspace,
bufmallocspace, maxbufmallocspace, lobufspace, hibufspace, lorunningspace,
hirunningspace, maxswzone, maxbcache, and maxpipekva. Previously, a
relatively small number (~ 44000) of buffers set in kern.nbuf would result
in integer overflows resulting either in hangs or bogus values of
hidirtybuffers and lodirtybuffers. Now one has to overflow a long to see
such problems. There was a check for a nbuf setting that would cause
overflows in the auto-tuning of nbuf. I've changed it to always check and
cap nbuf but warn if a user-supplied tunable would cause overflow.
Note that this changes the ABI of several sysctls that are used by things
like top(1), etc., so any MFC would probably require a some gross shims
to allow for that.
MFC after: 1 month
fault. In r188331 this update was relocated because of synchronization
changes to a place where it would occur on both hard and soft faults. This
change again restricts the update to hard faults.
function, done in r188334. Instead, collect the entries that shall be
freed, in the deferred_freelist member of the map. Automatically purge
the deferred freelist when map is unlocked.
Tested by: pho
Reviewed by: alc
1 will trigger a pass through the VM's low-memory handlers, such as
protocol and UMA drain routines. This makes it easier to exercise
these otherwise rarely-invoked code paths.
MFC after: 3 days
hold the map lock there, and might need the vnode lock for OBJT_VNODE
objects. Postpone object deallocation until caller of vm_map_delete()
drops the map lock. Link the map entries to be freed into the freelist,
that is released by the new helper function vm_map_entry_free_freelist().
Reviewed by: tegge, alc
Tested by: pho
Reference object, drop the map lock, and then call vm_object_sync().
The object sync might require vnode lock for OBJT_VNODE type objects.
Reviewed by: tegge
Tested by: pho
acquire vnode lock for OBJT_VNODE object after map lock is dropped.
Because we have the busy page(s) in the object, sleeping there would
result in deadlock with vnode resize. Try to get lock without sleeping,
and, if the attempt failed, drop the state, lock the vnode, and restart
the fault handler from the start with already locked vnode.
Because the vnode_pager_lock() function is inlined in vm_fault(),
axe it.
Based on suggestion by: alc
Reviewed by: tegge, alc
Tested by: pho
describing why several calls to vm_deallocate_object() with locked map
do not result in the acquisition of the vnode lock after map lock.
Suggested and reviewed by: tegge
be accessible outside vmspace_fork() yet, but locking it would satisfy
the protocol of the vm_map_entry_link() and other functions called
from vmspace_fork().
Use trylock that is supposedly cannot fail, to silence WITNESS warning
of the nested acquisition of the sx lock with the same name.
Suggested and reviewed by: tegge
backend kegs so it may source compatible memory from multiple backends.
This is useful for cases such as NUMA or different layouts for the same
memory type.
- Provide a new api for adding new backend kegs to secondary zones.
- Provide a new flag for adjusting the layout of zones to stagger
allocations better across cache lines.
Sponsored by: Nokia
inside the SYSCTL() macros and thus does not need to be done for
all of the nodes scattered across the source tree.
- Mark the name-cache related sysctl's (including debug.hashstat.*) MPSAFE.
- Mark vm.loadavg MPSAFE.
- Remove GIANT_REQUIRED from vmtotal() (everything in this routine already
has sufficient locking) and mark vm.vmtotal MPSAFE.
- Mark the vm.stats.(sys|vm).* sysctls MPSAFE.
of the counter, that may happen when too many sendfile(2) calls are
being executed with this vnode [1].
To keep the size of the struct vm_page and offsets of the fields
accessed by out-of-tree modules, swap the types and locations
of the wire_count and cow fields. Add safety checks to detect cow
overflow and force fallback to the normal copy code for zero-copy
sockets. [2]
Reported by: Anton Yuzhaninov <citrin citrin ru> [1]
Suggested by: alc [2]
Reviewed by: alc
MFC after: 2 weeks
vm_map_lookup{,_locked}() to vm_map_lookup_entry(). Having the fast path
in vm_map_lookup{,_locked}() limits its benefits to page faults. Moving
it to vm_map_lookup_entry() extends its benefits to other operations on
the vm map.
In particular following functions KPI results modified:
- bufobj_invalbuf()
- bufsync()
and BO_SYNC() "virtual method" of the buffer objects set.
Main consumers of bufobj functions are affected by this change too and,
in particular, functions which changed their KPI are:
- vinvalbuf()
- g_vfs_close()
Due to the KPI breakage, __FreeBSD_version will be bumped in a later
commit.
As a side note, please consider just temporary the 'curthread' argument
passing to VOP_SYNC() (in bufsync()) as it will be axed out ASAP
Reviewed by: kib
Tested by: Giovanni Trematerra <giovanni dot trematerra at gmail dot com>
into the separate function vm_pageout_oom(). Supply a parameter for
vm_pageout_oom() describing a reason for the call.
Call vm_pageout_oom() from the swp_pager_meta_build() when swap zone
is exhausted.
Reviewed by: alc
Tested by: pho, jhb
MFC after: 2 weeks
filedescriptor into it. Make sure that td_fpop is NULL when calling
d_mmap from dev_pager_getpages().
Change guards against td_fpop field being non-NULL with private state
for another device, and against sudden clearing the td_fpop. This
could occur when either a driver method calls another driver through
the filedescriptor operation, or a page fault happen while driver is
writing to a memory backed by another driver.
Noted by: rwatson
Tested by: rnoland
MFC after: 3 days
that redzone adds to the allocation for storing its metadata is at least as
large as the metadata that it will store there.
Submitted by: Nima Misaghian
routine wakes up proc0 so that proc0 can swap the thread back in.
Historically, this has been done by waking up proc0 directly from
setrunnable() itself via a wakeup(). When waking up a sleeping thread
that was swapped out (the usual case when waking proc0 since only sleeping
threads are eligible to be swapped out), this resulted in a bit of
recursion (e.g. wakeup() -> setrunnable() -> wakeup()).
With sleep queues having separate locks in 6.x and later, this caused a
spin lock LOR (sleepq lock -> sched_lock/thread lock -> sleepq lock).
An attempt was made to fix this in 7.0 by making the proc0 wakeup use
the ithread mechanism for doing the wakeup. However, this required
grabbing proc0's thread lock to perform the wakeup. If proc0 was asleep
elsewhere in the kernel (e.g. waiting for disk I/O), then this degenerated
into the same LOR since the thread lock would be some other sleepq lock.
Fix this by deferring the wakeup of the swapper until after the sleepq
lock held by the upper layer has been locked. The setrunnable() routine
now returns a boolean value to indicate whether or not proc0 needs to be
woken up. The end result is that consumers of the sleepq API such as
*sleep/wakeup, condition variables, sx locks, and lockmgr, have to wakeup
proc0 if they get a non-zero return value from sleepq_abort(),
sleepq_broadcast(), or sleepq_signal().
Discussed with: jeff
Glanced at by: sam
Tested by: Jurgen Weber jurgen - ish com au
MFC after: 2 weeks
to downgrade the exclusive lock to shared one when exclusive lock owner
requested shared lock. New lockmgr panics instead.
The vnode_pager_lock function requests shared lock on the vnode backing
the OBJT_VNODE, and can be called when the current thread already holds
an exlcusive lock on the vnode. For instance, it happens when handling
page fault from the VOP_WRITE() uiomove that writes to the file, with
the faulted in page fetched from the vm object backed by the same file.
We then get the situation described above.
Verify whether the vnode is already exclusively locked by the curthread
and request recursed exclusive vnode lock instead of shared, if true.
Reported by: gallatin
Discussed with: attilio
for the bio for swapout write. It allows the page allocator to drain
free page list deeper. As result, a deadlock where pageout deamon sleeps
waiting for bio to be allocated for swapout is no more reproducable in
practice.
Alan said that M_USE_RESERVE shall be ressurrected and used there, but
until this is implemented, M_NOWAIT does exactly what is needed.
Tested by: pho, kris
Reviewed by: alc
No objections from: phk
MFC after: 2 weeks (RELENG_7 only)
on the amd64 architecture. The amd64 architecture requires kernel code and
global variables to reside in the highest 2GB of the 64-bit virtual address
space. Thus, the memory allocated during bootstrap, before the call to
kmem_init(), starts at KERNBASE, which is not necessarily the same as
VM_MIN_KERNEL_ADDRESS on amd64.
PowerPC/AIM. Consequently, it should not be used to determine the maximum
number of kernel map entries. Intead, use VM_MIN_KERNEL_ADDRESS, which marks
the start of the kernel map on all architectures.
Tested by: marcel@ (PowerPC/AIM)
work. (Moreover, I don't believe that they have ever worked as intended.)
The explanation is fairly simple. Both MADV_DONTNEED and MADV_FREE perform
vm_page_dontneed() on each page within the range given to madvise(). This
function moves the page to the inactive queue. Specifically, if the page is
clean, it is moved to the head of the inactive queue where it is first in
line for processing by the page daemon. On the other hand, if it is dirty,
it is placed at the tail. Let's further examine the case in which the page
is clean. Recall that the page is at the head of the line for processing by
the page daemon. The expectation of vm_page_dontneed()'s author was that
the page would be transferred from the inactive queue to the cache queue by
the page daemon. (Once the page is in the cache queue, it is, in effect,
free, that is, it can be reallocated to a new vm object by vm_page_alloc()
if it isn't reactivated quickly enough by a user of the old vm object.) The
trouble is that nowhere in the execution of either MADV_DONTNEED or
MADV_FREE is either the machine-independent reference flag (PG_REFERENCED)
or the reference bit in any page table entry (PTE) mapping the page cleared.
Consequently, the immediate reaction of the page daemon is to reactivate the
page because it is referenced. In effect, the madvise() was for naught.
The case in which the page was dirty is not too different. Instead of being
laundered, the page is reactivated.
Note: The essential difference between MADV_DONTNEED and MADV_FREE is
that MADV_FREE clears a page's dirty field. So, MADV_FREE is always
executing the clean case above.
This revision changes vm_page_dontneed() to clear both the machine-
independent reference flag (PG_REFERENCED) and the reference bit in all PTEs
mapping the page.
MFC after: 6 weeks
entirety of the specified range be mapped. Specifically, it has
returned EINVAL if the entire range is not mapped. There is not,
however, any basis for this in either SuSv2 or our own man page.
Moreover, neither Linux nor Solaris impose this requirement. This
revision removes this requirement.
Submitted by: Tijl Coosemans
PR: 118510
MFC after: 6 weeks
Directory IO without a VM object will store data in 'malloced' buffers
severely limiting caching of the data. Without this change VM objects for
directories are only created on an open() of the directory.
TODO: Inline test if VM object already exists to avoid locking/function call
overhead.
Tested by: kris@
Reviewed by: jeff@
Reported by: David Filo
superpage-aligned virtual address for the mapping. Revision 1.65
implemented an overly simplistic and generally ineffectual method for
finding a superpage-aligned virtual address. Specifically, it rounds
the virtual address corresponding to the end of the data segment up to
the next superpage-aligned virtual address. If this virtual address
is unallocated, then the device will be mapped using superpages.
Unfortunately, in modern times, where applications like the X server
dynamically load much of their code, this virtual address is already
allocated. In such cases, mmap(2) simply uses the first available
virtual address, which is not necessarily superpage aligned.
This revision changes mmap(2) to use a more robust method,
specifically, the VMFS_ALIGNED_SPACE option that is now implemented by
vm_map_find().
physical address of the device's memory. This enables
pmap_align_superpage() to propose a virtual address for mapping the
device memory that permits the use of superpage mappings.
used to request superpage alignment for the submap.
Request superpage alignment for the kmem_map.
Pass VMFS_ANY_SPACE instead of TRUE to vm_map_find(). (They are currently
equivalent but VMFS_ANY_SPACE is the new preferred spelling.)
Remove a stale comment from kmem_malloc().
support for VMFS_ALIGNED_SPACE, which requests the allocation of an
address range best suited to superpages. The old options TRUE and FALSE
are mapped to VMFS_ANY_SPACE and VMFS_NO_SPACE, so that there is no
immediate need to update all of vm_map_find(9)'s callers.
While I'm here, correct a misstatement about vm_map_find(9)'s return
values in the man page.