mappings for a.out binaries. Apparently, a.out ld.so from FreeBSD
1.1.5.1 can issue such requests.
Reported and tested by: Dan Plassche <dplassche@gmail.com>
MFC after: 1 week
network file systems (not only NFS proper). Short reads cause pages
other then the requested one, which were not filled by read response,
to stay invalid.
Change the vm_page_readahead_finish() interface to not take the error
code, but instead to make a decision to free or to (de)activate the
page only by its validity. As result, not requested invalid pages are
freed even if the read RPC indicated success.
Noted and reviewed by: alc
MFC after: 1 week
ago, sleeping on busy pages in vm_pageout_launder() made sense. The call
to vm_pageout_flush() specified asynchronous I/O and sleeping on busy pages
blocked vm_pageout_launder() until the flush had completed. However, in
CVS revision 1.35 of vm/vm_contig.c, the call to vm_pageout_flush() was
changed to request synchronous I/O, but the sleep on busy pages was not
removed.
to pull vm_param.h was removed. Other big dependency of vm_page.h on
vm_param.h are PA_LOCK* definitions, which are only needed for
in-kernel code, because modules use KBI-safe functions to lock the
pages.
Stop including vm_param.h into vm_page.h. Include vm_param.h
explicitely for the kernel code which needs it.
Suggested and reviewed by: alc
MFC after: 2 weeks
vm_page oflags by providing helper function
vm_page_readahead_finish(), which handles completed reads for pages
with indexes other then the requested one, for VOP_GETPAGES().
Reviewed by: alc
MFC after: 1 week
them alone.
Process the act_count updates for the held pages in the vm_pageout
loop over the inactive queue, instead of refusing to do anything with
such page.
Clarify the intent of the addl_page_shortage counter and change its
use for pages which are not processed in the loop according to the
description.
Reviewed by: alc
MFC after: 2 weeks
it can't sleep, it can still move clean pages from the inactive queue to
the cache. Also, when a page is cached, there is no need to restart the
scan. The "next" page pointer held by vm_contig_launder() is still
valid. Finally, add a comment summarizing what vm_contig_grow_cache()
does based upon the value of "tries".
MFC after: 3 weeks
VM_KMEM_MAX_SIZE.
The code was not taking into account the size of the kernel_map, which
the kmem_map is allocated from, so it could produce a sub-map size too
large to fit. The simplest solution is to ignore VM_KMEM_MAX entirely
and base the memguard map's size off the kernel_map's size, since this
is always relevant and always smaller.
Found by: Justin Hibbits
This makes the RED/BLACK support go away and simplifies a lot vmradix
functions used here. This happens because with patricia trie support
the trie will be little enough that keeping 2 diffetnt will be
efficient too.
- Reduce differences with head, in places like backing scan where the
optimizazions used shuffled the code a little bit around.
Tested by: flo, Andrea Barberio
inactive queue, unless busy page is found.
Dropping the mutex often should allow the other lock acquires to
proceed without waiting for whole inactive scan to finish. On machines
with lot of physical memory scan often need to iterate a lot before it
finishes or finds a page which requires laundring, causing high
latency for other lock waiters.
Suggested and reviewed by: alc
MFC after: 3 weeks
exhausted while searching and when a "maximum" value is passed as end
(or end == 0).
This allow for avoiding starting address overflow while searching
through and avoids livelock with "start" wrapping up to "end".
Reported by: pho (supposedly)
"next" index, scanning 2 times in a row the same object.
This was hidden because when cache and resident tries are merged
together there is a check to skip different objects in all the
vm_radix_lookupn() usages, in order to fix a race with RED nodes.
by vm_objects.
- Add flags for the per-object lock and free pages queue mutex lock.
Use the newly added flags to mark the cache root within the vm_object
structure.
Please note that other vm_object members should be marked with correct
locking but they are left for other commits.
In collabouration with: alc
MFC after: 3 days3 days3 days
in vm_map_process_deferred() which is then iterated to release map entries.
This avoids having a nested vm map unlock operation called from the loop
body attempt to recuse into vm_map_process_deferred(). This can happen if
the vm_map_remove() triggers the OOM killer.
Reviewed by: alc, kib
MFC after: 1 week
layer, but it is read directly by the MI VM layer. This change introduces
pmap_page_is_write_mapped() in order to completely encapsulate all direct
access to PGA_WRITEABLE in the pmap layer.
Aesthetics aside, I am making this change because amd64 will likely begin
using an alternative method to track write mappings, and having
pmap_page_is_write_mapped() in place allows me to make such a change
without further modification to the MI VM layer.
As an added bonus, tidy up some nearby comments concerning page flags.
Reviewed by: kib
MFC after: 6 weeks
propagate the stack execution permissions when stack is grown down.
First, curproc->p_sysent->sv_stackprot specifies maximum allowed stack
protection for current ABI, so the new stack entry was typically marked
executable always. Second, for non-main stack MAP_STACK mapping,
the PROT_ flags should be used which were specified at the mmap(2) call
time, and not sv_stackprot.
MFC after: 1 week
remove the RED/BLACK concept.
This is based on the assumption that path-compressed tries will be
small and fast enough that a separate trie for cached pages will make
sense and will leave the trie code simple enough (along with removing
a lot of differences in the userend code).
The target of this is getting at the point where the recovery path is
completely removed as we could count on pre-allocation once the
path compressed trie is implemented.
The target of this is getting at the point where the recovery path is
completely removed as we could count on pre-allocation once the
path compressed trie is implemented.
The target of this is getting at the point where the recovery path is
completely removed as we could count on pre-allocation once the
path compressed trie is implemented.
the recovery path. The bulk of vm_radix_remove() is put into a generic
function vm_radix_sweep() which allows 2 different modes (hard and soft):
the soft one will deal with half-constructed paths by cleaning them up.
Ideally all these complications should go once that a way to pre-allocate
is implemented, possibly by implementing path compression.
Requested and discussed with: jeff
Tested by: pho
low memory situation. I've observed a situation where per-CPU
allocations were disabled while there were enough free cached pages.
Basically, cnt.v_free_count was sitting stable at a value lower
than cnt.v_free_min and that caused massive performance drop.
Reviewed by: alc
MFC after: 1 week
In PHYS_TO_VM_PAGE() when VM_PHYSSEG_DENSE is set the check if we are past
the end of vm_page_array was incorrect causing it to return NULL. This
value is then used in vm_phys_add_page causing a data abort.
Reviewed by: alc, kib, imp
Tested by: stas
range operations like pmap_remove() and pmap_protect() as well as allowing
simple operations like pmap_extract() not to involve any global state.
This substantially reduces lock coverages for the global table lock and
improves concurrency.
vm_pager_object_lookup() already referenced the object.
Note that there is no in-tree consumers of cdev_pager_lookup(). The
only known user of the function is i915 gem driver, which is not yet
imported. This should make the KPI change minor.
Submitted by: avg
MFC after: 1 week
which carries fictitous managed pages. In particular, the consumers of
the new object type can remove all mappings of the device page with
pmap_remove_all().
The range of physical addresses used for fake page allocation shall be
registered with vm_phys_fictitious_reg_range() interface to allow the
PHYS_TO_VM_PAGE() to work in pmap.
Most likely, only i386 and amd64 pmaps can handle fictitious managed
pages right now.
Sponsored by: The FreeBSD Foundation
Reviewed by: alc
MFC after: 1 month
for allocation of fictitious pages, for which PHYS_TO_VM_PAGE()
returns proper fictitious vm_page_t. The range should be de-registered
after consumer stopped using it.
De-inline the PHYS_TO_VM_PAGE() since it now carries code to iterate
over registered ranges.
A hash container might be developed instead of range registration
interface, and fake pages could be put automatically into the hash,
were PHYS_TO_VM_PAGE() could look them up later. This should be
considered before the MFC of the commit is done.
Sponsored by: The FreeBSD Foundation
Reviewed by: alc
MFC after: 1 month
vm_page into new interface vm_page_initfake(). Handle the case of fake
page re-initialization with changed memattr.
Sponsored by: The FreeBSD Foundation
Reviewed by: alc
MFC after: 1 month
64-bits numbers. ktr_tracepoint() infacts casts all the passed value to
u_long values as that is what the ktr entries can handle.
However, we have to work a lot with vm_pindex_t which are always 64-bit
also on 32-bits architectures (most notable case being i386).
Use macros to split the 64 bits printing into 32-bits chunks which
KTR can correctly handle.
Reported and tested by: flo
There are two aspects to the sequential access optimization: (1) read ahead
of pages that are expected to be accessed in the near future and (2) unmap
and cache behind of pages that are not expected to be accessed again. This
revision changes both aspects.
The read ahead optimization is now more effective. It starts with the same
initial read window as before, but arithmetically grows the window on
sequential page faults. This can yield increased read bandwidth. For
example, on one of my machines, a program using mmap() to read a file that
is several times larger than the machine's physical memory takes about 17%
less time to complete.
The unmap and cache behind optimization is now more selectively applied.
The read ahead window must grow to its maximum size before unmap and cache
behind is performed. This significantly reduces the number of times that
pages are unmapped and cached only to be reactivated a short time later.
The unmap and cache behind optimization now clears each page's referenced
flag. Previously, in the case of dirty pages, if the containing file was
still mapped at the time that the page daemon examined the dirty pages,
they would be reactivated.
From a stylistic standpoint, this revision also cleanly separates the
implementation of the read ahead and unmap/cache behind optimizations.
Glanced at: kib
MFC after: 2 weeks
the page. This PMAP requires an additional lock besides the PMAP lock
in pmap_extract_and_hold(), which vm_page_pa_tryrelock() did not release.
Suggested by: kib
MFC after: 4 days
cover the initial stack size. For MCL_WIREFUTURE maps, the subsequent
call to vm_map_wire() to wire the whole stack region fails due to
VM_MAP_WIRE_NOHOLES flag.
Use the VM_MAP_WIRE_HOLESOK to only wire mapped part of the stack.
Reported and tested by: Sushanth Rai <sushanth_rai yahoo com>
Reviewed by: alc
MFC after: 1 week
accesses of the cache member of vm_object objects.
- Use novel vm_page_is_cached() for checks outside of the vm subsystem.
Reviewed by: alc
MFC after: 2 weeks
X-MFC: r234039
that it will be freed to the cache pool rather than the default pool.
Otherwise, the cached pages within the reservation may be recycled sooner
than necessary.
Reported by: Andrey Zonov
(if not already fictious) the code can panic when trying to first insert
a fictious page because of the overridden pindex.
Fix this by applying the same spinning pattern of vm_page_rename().
Reported by: pho
vm_page_cache_remove() should only be used in very little and specific
cases (and marked as static likely) where the callers is going to take
care also of the page flags appropriately, otherwise one can end up
with a corrupted page.
Reported by: pho
a pair of records similar to syscall entry and return that a user can
use to determine how long page faults take. The new ktrace records are
enabled via the 'p' trace type, and are enabled in the default set of
trace points.
Reviewed by: kib
MFC after: 2 weeks
to enable the collection of counts of synchronous and asynchronous
reads and writes for its associated filesystem. The counts are
displayed using `mount -v'.
Ensure that buffers used for paging indicate the vnode from
which they are operating so that counts of paging I/O operations
from the filesystem are collected.
This checkin only adds the setting of the mount point for the
UFS/FFS filesystem, but it would be trivial to add the setting
and clearing of the mount point at filesystem mount/unmount
time for other filesystems too.
Reviewed by: kib
kernel.
When access restrictions are added to a page table entry, we flush the
corresponding virtual address mapping from the TLB. In contrast, when
access restrictions are removed from a page table entry, we do not
flush the virtual address mapping from the TLB. This is exactly as
recommended in AMD's documentation. In effect, when access
restrictions are removed from a page table entry, AMD's MMUs will
transparently refresh a stale TLB entry. In short, this saves us from
having to perform potentially costly TLB flushes. In contrast,
Intel's MMUs are allowed to generate a spurious page fault based upon
the stale TLB entry. Usually, such spurious page faults are handled
by vm_fault() without incident. However, when we are executing
no-fault sections of the kernel, we are not allowed to execute
vm_fault(). This change introduces special-case handling for spurious
page faults that occur in no-fault sections of the kernel.
In collaboration with: kib
Tested by: gibbs (an earlier version)
I would also like to acknowledge Hiroki Sato's assistance in
diagnosing this problem.
MFC after: 1 week
than 4GB. Specifically, the inlined version of 'ptoa' of the the 'int'
count of pages overflowed on 64-bit platforms. While here, change
vm_object_madvise() to accept two vm_pindex_t parameters (start and end)
rather than a (start, count) tuple to match other VM APIs as suggested
by alc@.
if the filesystem performed short write and we are skipping the page
due to this.
Propogate write error from the pager back to the callers of
vm_pageout_flush(). Report the failure to write a page from the
requested range as the FALSE return value from vm_object_page_clean(),
and propagate it back to msync(2) to return EIO to usermode.
While there, convert the clearobjflags variable in the
vm_object_page_clean() and arguments of the helper functions to
boolean.
PR: kern/165927
Reviewed by: alc
MFC after: 2 weeks
- Fix bugs in the free path where the pages were not unwired and
relevant locking wasn't acquired.
- Introduce the rnode_map, submap of kernel_map, where to allocate from.
The reason is that, in architectures without direct-mapping,
kmem_alloc*() will try to insert the newly created mapping while
holding the vm_object lock introducing a LOR or lock recursion.
rnode_map is however a leafly-used submap, thus there cannot be any
deadlock.
Notes: Size the submap in order to be, by default, around 64 MB and
decrase the size of the nodes as the allocation will be much smaller
(and when the compacting code in the vm_radix will be implemented this
will aim for much less space to be used). However note that the
size of the submap can be changed at boot time via the
hw.rnode_map_scale scaling factor.
- Use uma_zone_set_max() covering the size of the submap.
Tested by: flo
still as it can be useful.
- Make most of the interface private as it is unnecessary public right
now. This will help in making nodes changing with arch and still avoid
namespace pollution.
external pagers in Mach. FreeBSD doesn't implement external pagers.
Moreover, it don't pageout the kernel object. So, the reasons for
having code don't hold.
Reviewed by: kib
MFC after: 6 weeks
v_writecount. Keep the amount of the virtual address space used by
the mappings in the new vm_object un_pager.vnp.writemappings
counter. The vnode v_writecount is incremented when writemappings gets
non-zero value, and decremented when writemappings is returned to
zero.
Writeable shared vnode-backed mappings are accounted for in vm_mmap(),
and vm_map_insert() is instructed to set MAP_ENTRY_VN_WRITECNT flag on
the created map entry. During deferred map entry deallocation,
vm_map_process_deferred() checks for MAP_ENTRY_VN_WRITECOUNT and
decrements writemappings for the vm object.
Now, the writeable mount cannot be demoted to read-only while
writeable shared mappings of the vnodes from the mount point
exist. Also, execve(2) fails for such files with ETXTBUSY, as it
should be.
Noted by: tegge
Reviewed by: tegge (long time ago, early version), alc
Tested by: pho
MFC after: 3 weeks
for a shared mapping and marking the entry for inheritance.
Other thread might execute vmspace_fork() in between (e.g. by fork(2)),
resulting in the mapping becoming private.
Noted and reviewed by: alc
MFC after: 1 week
Code should just use the devtoname() function to obtain the name of a
character device. Also add const keywords to pieces of code that need it
to build properly.
MFC after: 2 weeks
callers of vm_page_insert().
The default action for every caller is to unwind-back the operation
besides vm_page_rename() where this has proven to be impossible to do.
For that case, it just spins until the page is not available to be
allocated. However, due to vm_page_rename() to be mostly rare (and
having never hit this panic in the past) it is tought to be a very
seldom thing and not a possible performance factor.
The patch has been tested with an atomic counter returning NULL from
the zone allocator every 1/100000 allocations. Per-printf, I've verified
that a typical buildkernel could trigger this 30 times. The patch
survived to 2 hours of repeated buildkernel/world.
Several technical notes:
- The vm_page_insert() is moved, in several callers, closer to failure
points. This could be committed separately before vmcontention hits
the tree just to verify -CURRENT is happy with it.
- vm_page_rename() does not need to have the page lock in the callers
as it hide that as an implementation detail. Do the locking internally.
- now vm_page_insert() returns an int, with 0 meaning everything was ok,
thus KPI is broken by this patch.
disconnected swap device.
This is quick and imperfect solution, as swap device will still be opened
and GEOM will not be able to destroy it. Proper solution would be to
automatically turn off and close disconnected swap device, but with existing
code it will cause panic if there is at least one page on device, even if
it is unimportant page of the user-level process. It needs some work.
Reviewed by: kib@
MFC after: 1 week
wrap-up at some point.
This bug is triggered very easilly by indirect blocks in UFS which grow
negative resulting in very high counts.
In collabouration with: flo
excluding other allocations including UMA now entails the addition of
a single flag to kmem_alloc or uma zone create
Reviewed by: alc, avg
MFC after: 2 weeks
u_int. With the auto-sized buffer cache on the modern machines, UFS
metadata can generate more the 65535 pages belonging to the buffers
undergoing i/o, overflowing the counter.
Reported and tested by: jimharris
Reviewed by: alc
MFC after: 1 week
generation change if requested mode is async. The object generation is
only changed when the object is marked as OBJ_MIGHTBEDIRTY. For async
mode it is enough to write each dirty page, not to make a guarantee that
all pages are cleared after the vm_object_page_clean() returned.
Diagnosed by: truckman
Tested by: flo
Reviewed by: alc, truckman
MFC after: 2 weeks
MS_SYNC flag. The system must guarantee that all writes are finished
before syscalls returned. Schedule the writes in async mode, which is
much faster and allows the clustering to occur. Wait for writes using
VOP_FSYNC(), since we are syncing the whole file mapping.
Potentially, the restriction to only apply the optimization can be
relaxed by not requiring that the mapping cover whole file, as it is
done by other OSes.
Reported and tested by: az
Reviewed by: alc
MFC after: 2 weeks
without the VM_OBJECT_LOCK held, thus can be concurrent with BLACK ones.
However, also use a write memory barrier in order to not reorder the
operation of decrementing rn_count in respect fetching the pointer.
Discussed with: jeff
- Avoid to use atomic to manipulate it at level0 because it seems
unneeded and introduces a bug on big-endian architectures where only
the top half (2 bits) of the double-words are written (as sparc64,
for example, doesn't support atomics at 16-bits) heading to a wrong
handling of rn_count.
Reported by: flo, andreast
Found by: marius
No answer by: jeff
use superpage reservations. So, for the first time, kernel virtual memory
that is allocated by contigmalloc(), kmem_alloc_attr(), and
kmem_alloc_contig() can be promoted to superpages. In fact, even a series
of small contigmalloc() allocations may collectively result in a promoted
superpage.
Eliminate some duplication of code in vm_reserv_alloc_page().
Change the type of vm_reserv_reclaim_contig()'s first parameter in order
that it be consistent with other vm_*_contig() functions.
Tested by: marius (sparc64)
Since the address of vm_page lock mutex depends on the kernel options,
it is easy for module to get out of sync with the kernel.
No vm_page_lockptr() accessor is provided for modules. It can be added
later if needed, unless proper KPI is developed to serve the needs.
Reviewed by: attilio, alc
MFC after: 3 weeks
defined and will allow consumers, willing to provide options, file and
line to locking requests, to not worry about options redefining the
interfaces.
This is typically useful when there is the need to build another
locking interface on top of the mutex one.
The introduced functions that consumers can use are:
- mtx_lock_flags_
- mtx_unlock_flags_
- mtx_lock_spin_flags_
- mtx_unlock_spin_flags_
- mtx_assert_
- thread_lock_flags_
Spare notes:
- Likely we can get rid of all the 'INVARIANTS' specification in the
ppbus code by using the same macro as done in this patch (but this is
left to the ppbus maintainer)
- all the other locking interfaces may require a similar cleanup, where
the most notable case is sx which will allow a further cleanup of
vm_map locking facilities
- The patch should be fully compatible with older branches, thus a MFC
is previewed (infact it uses all the underlying mechanisms already
present).
Comments review by: eadler, Ben Kaduk
Discussed with: kib, jhb
MFC after: 1 month
yielding a new public interface, vm_page_alloc_contig(). This new function
addresses some of the limitations of the current interfaces, contigmalloc()
and kmem_alloc_contig(). For example, the physically contiguous memory that
is allocated with those interfaces can only be allocated to the kernel vm
object and must be mapped into the kernel virtual address space. It also
provides functionality that vm_phys_alloc_contig() doesn't, such as wiring
the returned pages. Moreover, unlike that function, it respects the low
water marks on the paging queues and wakes up the page daemon when
necessary. That said, at present, this new function can't be applied to all
types of vm objects. However, that restriction will be eliminated in the
coming weeks.
From a design standpoint, this change also addresses an inconsistency
between vm_phys_alloc_contig() and the other vm_phys_alloc*() functions.
Specifically, vm_phys_alloc_contig() manipulated vm_page fields that other
functions in vm/vm_phys.c didn't. Moreover, vm_phys_alloc_contig() knew
about vnodes and reservations. Now, vm_page_alloc_contig() is responsible
for these things.
Reviewed by: kib
Discussed with: jhb
layer for old KPI and KBI. New interface should be used together with
d_mmap_single cdevsw method.
Device pager can be allocated with the cdev_pager_allocate(9)
function, which takes struct cdev_pager_ops, containing
constructor/destructor and page fault handler methods supplied by
driver.
Constructor and destructor, called at the pager allocation and
deallocation time, allow the driver to handle per-object private data.
The pager handler is called to handle page fault on the vm map entry
backed by the driver pager. Driver shall return either the vm_page_t
which should be mapped, or error code (which does not cause kernel
panic anymore). The page handler interface has a placeholder to
specify the access mode causing the fault, but currently PROT_READ is
always passed there.
Sponsored by: The FreeBSD Foundation
Reviewed by: alc
MFC after: 1 month
The SYSCTL_NODE macro defines a list that stores all child-elements of
that node. If there's no SYSCTL_DECL macro anywhere else, there's no
reason why it shouldn't be static.
allocate the requested page because too few pages are cached or free.
Document the VM_ALLOC_COUNT() option to vm_page_alloc() and
vm_page_alloc_freelist().
Make style changes to vm_page_alloc() and vm_page_alloc_freelist(),
such as using a variable name that more closely corresponds to the
comments.
Use the defined types instead of int when manipulating masks.
Supposedly, it could fix support for 32KB page size in the
machine-independend VM layer.
Reviewed by: alc
MFC after: 2 weeks
madvise(2) except that it operates on a file descriptor instead of a
memory region. It is currently only supported on regular files.
Just as with madvise(2), the advice given to posix_fadvise(2) can be
divided into two types. The first type provide hints about data access
patterns and are used in the file read and write routines to modify the
I/O flags passed down to VOP_READ() and VOP_WRITE(). These modes are
thus filesystem independent. Note that to ease implementation (and
since this API is only advisory anyway), only a single non-normal
range is allowed per file descriptor.
The second type of hints are used to hint to the OS that data will or
will not be used. These hints are implemented via a new VOP_ADVISE().
A default implementation is provided which does nothing for the WILLNEED
request and attempts to move any clean pages to the cache page queue for
the DONTNEED request. This latter case required two other changes.
First, a new V_CLEANONLY flag was added to vinvalbuf(). This requests
vinvalbuf() to only flush clean buffers for the vnode from the buffer
cache and to not remove any backing pages from the vnode. This is
used to ensure clean pages are not wired into the buffer cache before
attempting to move them to the cache page queue. The second change adds
a new vm_object_page_cache() method. This method is somewhat similar to
vm_object_page_remove() except that instead of freeing each page in the
specified range, it attempts to move clean pages to the cache queue if
possible.
To preserve the ABI of struct file, the f_cdevpriv pointer is now reused
in a union to point to the currently active advice region if one is
present for regular files.
Reviewed by: jilles, kib, arch@
Approved by: re (kib)
MFC after: 1 month
and use these new options in the mips pmap.
Wake up the page daemon in vm_page_alloc_freelist() if the number of free
and cached pages becomes too low.
Tidy up vm_page_alloc_init(). In particular, add a comment about an
important restriction on its use.
Tested by: jchandra@
Likely this file needs some more restructuration (and we should
make a lot of macros private to radix implementation) but leave them
as they are so far because we may enrich the KPI much further.
tree.
Reclaim all the nodes related to the radix tree for a specified
vm_object when calling vm_object_terminate() via the newly added
interface vm_radix_reclaim_nodes().
The function is recursive, but we have a well-defined maximum depth,
thus the amount of necessary stack can be easilly calculated.
Reported by: alc
Discussed and reviewed by: jeff
first leaf page in a specified range. This permits us to make many
search & operate functions without much code duplication.
- Make a generic iterator for radix items.
Black nodes support standard active pages and red nodes support cached
pages. Red nodes may be removed without the object lock but will not
collapse unused tree nodes. Red nodes may not be directly inserted,
instead a new function is supplied to convert between black and red.
- Handle cached pages and active pages in the same loop in vm_object_split,
vm_object_backing_scan, and vm_object_terminate.
- Retire the splay page handling as the ifdefs are too difficult to
maintain.
- Slightly optimize the vm_radix_lookupn() function.
eliminating duplicated code in the various pmap implementations.
Micro-optimize vm_phys_free_pages().
Introduce vm_phys_free_contig(). It is fast routine for freeing an
arbitrary number of physically contiguous pages. In particular, it
doesn't require the number of pages to be a power of two.
Use "u_long" instead of "unsigned long".
Bruce Evans (bde@) has convinced me that the "boundary" parameters
to kmem_alloc_contig(), vm_phys_alloc_contig(), and
vm_reserv_reclaim_contig() should be of type "vm_paddr_t" and not
"u_long". Make this change.
height and a pointer so that the update to the root is atomic. This
permits safe lookups in parallel with tree expansion. Shrinking the
space requirements is a small bonus.
for the kernel_map/kmem_map recursion because it uses direct mapping
provided by amd64 to avoid object and map search and recursion.
Probabilly all the others architectures using UMA_MD_SMALL_ALLOC are also
fixed by this, but other remains, where the most notable case is i386.
For it a solution has still to be determined. A way to do this would
be to have a reserved map just for radix node and mark all accesses to
its lock to be witness safe, but that would still be unoptimal due to
the large amount of virtual address space needed to cater the whole
tree.
more general VM system interfaces. So, their implementation can now
reside in kern_malloc.c alongside the other functions that are declared
in malloc.h.
common cases that can be handled in constant time. The insight being
that a page's parent in the vm object's tree is very often its
predecessor or successor in the vm object's ordered memq.
Tested by: jhb
MFC after: 10 days
the vm object pages splay.
TODO:
- Handle differently the negative keys for having smaller depth
index nodes (negative keys caming from indirect blocks)
- Fix the get_node() by having support for a low reserved objects
directly from UMA
- Implement the lookup_le and re-enable VM_NRESERVELEVEL = 1
- Try to rework the superpage splay of idle pages and the cache splay
for every vm object in order to regain space on vm_page structure
- Verify performance and improve them (likely by having consumers to deal
with several ranges of pages manually?)
Obtained from: jeff, Mayur Shardul (GSoC 2009)
word to handle the dirty mask updates in vm_page_clear_dirty_mask().
Remove the vm page queue lock around vm_page_dirty() call in vm_fault_hold()
the sole purpose of which was to protect dirty on architectures which
does not provide short or byte-wide atomics.
Reviewed by: alc, attilio
Tested by: flo (sparc64)
MFC after: 2 weeks
patch modifies makesyscalls.sh to prefix all of the non-compatibility
calls (e.g. not linux_, freebsd32_) with sys_ and updates the kernel
entry points and all places in the code that use them. It also
fixes an additional name space collision between the kernel function
psignal and the libc function of the same name by renaming the kernel
psignal kern_psignal(). By introducing this change now we will ease future
MFCs that change syscalls.
Reviewed by: rwatson
Approved by: re (bz)
flags field. Updates to the atomic flags are performed using the atomic
ops on the containing word, do not require any vm lock to be held, and
are non-blocking. The vm_page_aflag_set(9) and vm_page_aflag_clear(9)
functions are provided to modify afalgs.
Document the changes to flags field to only require the page lock.
Introduce vm_page_reference(9) function to provide a stable KPI and
KBI for filesystems like tmpfs and zfs which need to mark a page as
referenced.
Reviewed by: alc, attilio
Tested by: marius, flo (sparc64); andreast (powerpc, powerpc64)
Approved by: re (bz)
after the conversion of the swap device size to the page size units,
not before. That lifts the limit on the usable swap partition size
from 32GB to 256GB, that is less depressing for the modern systems.
Submitted by: Alexander V. Chernikov <melifaro ipfw ru>
Reviewed by: alc
Approved by: re (bz)
MFC after: 2 weeks
kernel for FreeBSD 9.0:
Add a new capability mask argument to fget(9) and friends, allowing system
call code to declare what capabilities are required when an integer file
descriptor is converted into an in-kernel struct file *. With options
CAPABILITIES compiled into the kernel, this enforces capability
protection; without, this change is effectively a no-op.
Some cases require special handling, such as mmap(2), which must preserve
information about the maximum rights at the time of mapping in the memory
map so that they can later be enforced in mprotect(2) -- this is done by
narrowing the rights in the existing max_protection field used for similar
purposes with file permissions.
In namei(9), we assert that the code is not reached from within capability
mode, as we're not yet ready to enforce namespace capabilities there.
This will follow in a later commit.
Update two capability names: CAP_EVENT and CAP_KEVENT become
CAP_POST_KEVENT and CAP_POLL_KEVENT to more accurately indicate what they
represent.
Approved by: re (bz)
Submitted by: jonathan
Sponsored by: Google Inc
to VPO_UNMANAGED (and also making the flag protected by the vm object
lock, instead of vm page queue lock).
- Mark the fake pages with both PG_FICTITIOUS (as it is now) and
VPO_UNMANAGED. As a consequence, pmap code now can use use just
VPO_UNMANAGED to decide whether the page is unmanaged.
Reviewed by: alc
Tested by: pho (x86, previous version), marius (sparc64),
marcel (arm, ia64, powerpc), ray (mips)
Sponsored by: The FreeBSD Foundation
Approved by: re (bz)
configured swap devices in the Linux-compatible format.
Based on the submission by: Robert Millan <rmh debian org>
PR: kern/159281
Reviewed by: bde
Approved by: re (kensmith)
MFC after: 2 weeks
allocated the device pager for the given handle, then the object
fictitious pages list and the object membership in the global object
list still need to be initialized. Otherwise, dev_pager_dealloc() will
traverse uninitialized pointers.
Reported and tested by: pho
Reviewed by: jhb
Approved by: re (kensmith)
MFC after: 1 week
function vm_mmap_to_errno(). It is useful for the drivers that implement
mmap(2)-like functionality, to be able to return error codes consistent
with mmap(2).
Sponsored by: The FreeBSD Foundation
No objections from: alc
MFC after: 1 week
uiomove generates EFAULT if any accessed address is not mapped, as
opposed to handling the fault.
Sponsored by: The FreeBSD Foundation
Reviewed by: alc (previous version)
won't happen before 9.0. This commit adds "#ifdef RACCT" around all the
"PROC_LOCK(p); racct_whatever(p, ...); PROC_UNLOCK(p)" instances, in order
to avoid useless locking/unlocking in kernels built without "options RACCT".
option to vm_object_page_remove() asserts that the specified range of pages
is not mapped, or more precisely that none of these pages have any managed
mappings. Thus, vm_object_page_remove() need not call pmap_remove_all() on
the pages.
This change not only saves time by eliminating pointless calls to
pmap_remove_all(), but it also eliminates an inconsistency in the use of
pmap_remove_all() versus related functions, like pmap_remove_write(). It
eliminates harmless but pointless calls to pmap_remove_all() that were being
performed on PG_UNMANAGED pages.
Update all of the existing assertions on pmap_remove_all() to reflect this
change.
Reviewed by: kib
(Saying that the lock on the object that the page belongs to must be held
only represents one aspect of the rules.)
Eliminate the use of the page queues lock for atomically performing read-
modify-write operations on the dirty field when the underlying architecture
supports atomic operations on char and short types.
Document the fact that 32KB pages aren't really supported.
Reviewed by: attilio, kib
vm_page_undirty(). The assert is not precise due to VPO_BUSY owner
to tracked, so assertion does not catch the case when VPO_BUSY is
owned by other thread.
Reviewed by: alc
VM_PAGER_AGAIN to VM_PAGER_ERROR for the uwritten pages. Return
VM_PAGER_AGAIN for the partially written page. Always forward at least
one page in the loop of vm_object_page_clean().
VM_PAGER_ERROR causes the page reactivation and does not clear the
page dirty state, so the write is not lost.
The change fixes an infinite loop in vm_object_page_clean() when the
filesystem returns permanent errors for some page writes.
Reported and tested by: gavin
Reviewed by: alc, rmacklem
MFC after: 1 week
uma_startup2() was called. Thus, setting the variable "booted" to true in
uma_startup() was ok on machines with UMA_MD_SMALL_ALLOC defined, because
any allocations made after uma_startup() but before uma_startup2() could be
satisfied by uma_small_alloc(). Now, however, some multipage allocations
are necessary before uma_startup2() just to allocate zone structures on
machines with a large number of processors. Thus, a Boolean can no longer
effectively describe the state of the UMA allocator. Instead, make "booted"
have three values to describe how far initialization has progressed. This
allows multipage allocations to continue using startup_alloc() until
uma_startup2(), but single-page allocations may begin using
uma_small_alloc() after uma_startup().
2. With the aforementioned change, only a modest increase in boot pages is
necessary to boot UMA on a large number of processors.
3. Retire UMA_MD_SMALL_ALLOC_NEEDS_VM. It has only been used between
r182028 and r204128.
Reviewed by: attilio [1], nwhitehorn [3]
Tested by: sbruno
architectures (i386, for example) the virtual memory space may be
constrained enough that 2MB is a large chunk. Use 64K for arches
other than amd64 and ia64, with special handling for sparc64 due to
differing hardware.
Also commit the comment changes to kmem_init_zero_region() that I
missed due to not saving the file. (Darn the unfamiliar development
environment).
Arch maintainers, please feel free to adjust ZERO_REGION_SIZE as you
see fit.
Requested by: alc
MFC after: 1 week
MFC with: r221853
Hold the vnode around the region where object lock is dropped, until
vnode lock is acquired.
Do not drop the vnode reference for a case when the object was
deallocated during unlock. Note that in this case, VV_TEXT is cleared
by vnode_pager_dealloc().
Reported and tested by: pho
Reviewed by: alc
MFC after: 3 days
If supplied length is zero, and user address is invalid, function
might return -1, due to the truncation and rounding of the address.
The callers interpret the situation as EFAULT. Instead of handling
the zero length in caller, filter it in vm_fault_quick_hold_pages().
Sponsored by: The FreeBSD Foundation
Reviewed by: alc
- Hold the proc lock while changing the state from PRS_NEW to PRS_NORMAL
in fork to honor the locking requirements. While here, expand the scope
of the PROC_LOCK() on the new process (p2) to avoid some LORs. Previously
the code was locking the new child process (p2) after it had locked the
parent process (p1). However, when locking two processes, the safe order
is to lock the child first, then the parent.
- Fix various places that were checking p_state against PRS_NEW without
having the process locked to use PROC_LOCK(). Every place was already
locking the process, just after the PRS_NEW check.
- Remove or reduce the use of PROC_SLOCK() for places that were checking
p_state against PRS_NEW. The PROC_LOCK() alone is sufficient for reading
the current state.
- Reorder fill_kinfo_proc() slightly so it only acquires PROC_SLOCK() once.
MFC after: 1 week
which are not yet fully initialized (i.e. ones with p_state == PRS_NEW).
Without it, we could panic in _thread_lock_flags().
Note that there may be other instances of FOREACH_PROC_IN_SYSTEM() that
require similar fix.
Reported by: pho, keramida
Discussed with: kib
As it was pointed out by Alan Cox, that no longer serves its purpose with
the modern UMA allocator compared to the old one used in 4.x days.
The removal of sysctl eliminates max_proc_mmap type overflow leading to
the broken mmap(2) seen with large amount of physical memory on arches
with factually unbound KVA space (such as amd64). It was found that
slightly less than 256GB of physmem was enough to trigger the overflow.
Reviewed by: alc, kib
Approved by: avg (mentor)
MFC after: 2 months
vm_map_insert(), the kmem_back() assumption about newly inserted
entry might be broken due to interference of two factors. In the low
memory condition, when vm_page_alloc() returns NULL, supplied map is
unlocked. If another thread performs kmem_malloc() meantime, and its
map entry is placed right next to our thread map entry in the map,
both entries wire count is still 0 and entries are coalesced due to
vm_map_simplify_entry().
Mark new entry with MAP_ENTRY_IN_TRANSITION to prevent coalesce.
Fix some style issues, tighten the assertions to account for
MAP_ENTRY_IN_TRANSITION state.
Reported and tested by: pho
Reviewed by: alc