Commit Graph

3016 Commits

Author SHA1 Message Date
Eitan Adler
3eb9ab5255 Document a large number of currently undocumented sysctls. While here
fix some style(9) issues and reduce redundancy.

PR:		kern/155491
PR:		kern/155490
PR:		kern/155489
Submitted by:	Galimov Albert <wtfcrap@mail.ru>
Approved by:	bde
Reviewed by:	jhb
MFC after:	1 week
2011-12-13 00:38:50 +00:00
Konstantin Belousov
134465d732 Fix printf.
Submitted by:	az
MFC after:	1 week
2011-12-12 10:04:04 +00:00
Alan Cox
c68c35372e Introduce vm_reserv_alloc_contig() and teach vm_page_alloc_contig() how to
use superpage reservations.  So, for the first time, kernel virtual memory
that is allocated by contigmalloc(), kmem_alloc_attr(), and
kmem_alloc_contig() can be promoted to superpages.  In fact, even a series
of small contigmalloc() allocations may collectively result in a promoted
superpage.

Eliminate some duplication of code in vm_reserv_alloc_page().

Change the type of vm_reserv_reclaim_contig()'s first parameter in order
that it be consistent with other vm_*_contig() functions.

Tested by:	marius (sparc64)
2011-12-05 18:29:25 +00:00
Konstantin Belousov
dc874f9881 Rename vm_page_set_valid() to vm_page_set_valid_range().
The vm_page_set_valid() is the most reasonable name for the m->valid
accessor.

Reviewed by:	attilio, alc
2011-11-30 17:39:00 +00:00
Konstantin Belousov
cf1911a9ad Hide the internals of vm_page_lock(9) from the loadable modules.
Since the address of vm_page lock mutex depends on the kernel options,
it is easy for module to get out of sync with the kernel.

No vm_page_lockptr() accessor is provided for modules. It can be added
later if needed, unless proper KPI is developed to serve the needs.

Reviewed by:	attilio, alc
MFC after:	3 weeks
2011-11-29 13:07:32 +00:00
Attilio Rao
9fde98bba3 Introduce the same mutex-wise fix in r227758 for sx locks.
The functions that offer file and line specifications are:
- sx_assert_
- sx_downgrade_
- sx_slock_
- sx_slock_sig_
- sx_sunlock_
- sx_try_slock_
- sx_try_xlock_
- sx_try_upgrade_
- sx_unlock_
- sx_xlock_
- sx_xlock_sig_
- sx_xunlock_

Now vm_map locking is fully converted and can avoid to know specifics
about locking procedures.
Reviewed by:	kib
MFC after:	1 month
2011-11-21 12:59:52 +00:00
Attilio Rao
ccdf233323 Introduce macro stubs in the mutex implementation that will be always
defined and will allow consumers, willing to provide options, file and
line to locking requests, to not worry about options redefining the
interfaces.
This is typically useful when there is the need to build another
locking interface on top of the mutex one.

The introduced functions that consumers can use are:
- mtx_lock_flags_
- mtx_unlock_flags_
- mtx_lock_spin_flags_
- mtx_unlock_spin_flags_
- mtx_assert_
- thread_lock_flags_

Spare notes:
- Likely we can get rid of all the 'INVARIANTS' specification in the
  ppbus code by using the same macro as done in this patch (but this is
  left to the ppbus maintainer)
- all the other locking interfaces may require a similar cleanup, where
  the most notable case is sx which will allow a further cleanup of
  vm_map locking facilities
- The patch should be fully compatible with older branches, thus a MFC
  is previewed (infact it uses all the underlying mechanisms already
  present).

Comments review by:	eadler, Ben Kaduk
Discussed with:		kib, jhb
MFC after:	1 month
2011-11-20 16:33:09 +00:00
Alan Cox
5ff276b7f4 Eliminate end-of-line white space. 2011-11-17 06:54:49 +00:00
Alan Cox
fbd80bd047 Refactor the code that performs physically contiguous memory allocation,
yielding a new public interface, vm_page_alloc_contig().  This new function
addresses some of the limitations of the current interfaces, contigmalloc()
and kmem_alloc_contig().  For example, the physically contiguous memory that
is allocated with those interfaces can only be allocated to the kernel vm
object and must be mapped into the kernel virtual address space.  It also
provides functionality that vm_phys_alloc_contig() doesn't, such as wiring
the returned pages.  Moreover, unlike that function, it respects the low
water marks on the paging queues and wakes up the page daemon when
necessary.  That said, at present, this new function can't be applied to all
types of vm objects.  However, that restriction will be eliminated in the
coming weeks.

From a design standpoint, this change also addresses an inconsistency
between vm_phys_alloc_contig() and the other vm_phys_alloc*() functions.
Specifically, vm_phys_alloc_contig() manipulated vm_page fields that other
functions in vm/vm_phys.c didn't.  Moreover, vm_phys_alloc_contig() knew
about vnodes and reservations.  Now, vm_page_alloc_contig() is responsible
for these things.

Reviewed by:	kib
Discussed with:	jhb
2011-11-16 16:46:09 +00:00
Konstantin Belousov
286790a7dd Update the device pager interface, while keeping the compatibility
layer for old KPI and KBI.  New interface should be used together with
d_mmap_single cdevsw method.

Device pager can be allocated with the cdev_pager_allocate(9)
function, which takes struct cdev_pager_ops, containing
constructor/destructor and page fault handler methods supplied by
driver.

Constructor and destructor, called at the pager allocation and
deallocation time, allow the driver to handle per-object private data.

The pager handler is called to handle page fault on the vm map entry
backed by the driver pager. Driver shall return either the vm_page_t
which should be mapped, or error code (which does not cause kernel
panic anymore). The page handler interface has a placeholder to
specify the access mode causing the fault, but currently PROT_READ is
always passed there.

Sponsored by:	The FreeBSD Foundation
Reviewed by:	alc
MFC after:	1 month
2011-11-15 14:40:00 +00:00
Konstantin Belousov
bf277cf450 Remove the condition that is always true.
Submitted by:	alc
MFC after:	1 week
2011-11-15 14:09:53 +00:00
Ed Schouten
6472ac3d8a Mark all SYSCTL_NODEs static that have no corresponding SYSCTL_DECLs.
The SYSCTL_NODE macro defines a list that stores all child-elements of
that node. If there's no SYSCTL_DECL macro anywhere else, there's no
reason why it shouldn't be static.
2011-11-07 15:43:11 +00:00
Alan Cox
c835bd16a8 Wake up the page daemon in vm_page_alloc_freelist() if it couldn't
allocate the requested page because too few pages are cached or free.

Document the VM_ALLOC_COUNT() option to vm_page_alloc() and
vm_page_alloc_freelist().

Make style changes to vm_page_alloc() and vm_page_alloc_freelist(),
such as using a variable name that more closely corresponds to the
comments.
2011-11-06 02:03:27 +00:00
Konstantin Belousov
7845becfe8 Remove redundand definitions. The chunk was missed from r227102.
MFC after:	2 weeks
2011-11-05 09:03:18 +00:00
Konstantin Belousov
561cc9fcb5 Provide typedefs for the type of bit mask for the page bits.
Use the defined types instead of int when manipulating masks.
Supposedly, it could fix support for 32KB page size in the
machine-independend VM layer.

Reviewed by:	alc
MFC after:	2 weeks
2011-11-05 08:20:32 +00:00
Alan Cox
2614c5c47c Simplify the implementation of the failure case in kmem_alloc_attr(). 2011-11-04 04:41:58 +00:00
John Baldwin
936c09ac0f Add the posix_fadvise(2) system call. It is somewhat similar to
madvise(2) except that it operates on a file descriptor instead of a
memory region.  It is currently only supported on regular files.

Just as with madvise(2), the advice given to posix_fadvise(2) can be
divided into two types.  The first type provide hints about data access
patterns and are used in the file read and write routines to modify the
I/O flags passed down to VOP_READ() and VOP_WRITE().  These modes are
thus filesystem independent.  Note that to ease implementation (and
since this API is only advisory anyway), only a single non-normal
range is allowed per file descriptor.

The second type of hints are used to hint to the OS that data will or
will not be used.  These hints are implemented via a new VOP_ADVISE().
A default implementation is provided which does nothing for the WILLNEED
request and attempts to move any clean pages to the cache page queue for
the DONTNEED request.  This latter case required two other changes.
First, a new V_CLEANONLY flag was added to vinvalbuf().  This requests
vinvalbuf() to only flush clean buffers for the vnode from the buffer
cache and to not remove any backing pages from the vnode.  This is
used to ensure clean pages are not wired into the buffer cache before
attempting to move them to the cache page queue.  The second change adds
a new vm_object_page_cache() method.  This method is somewhat similar to
vm_object_page_remove() except that instead of freeing each page in the
specified range, it attempts to move clean pages to the cache queue if
possible.

To preserve the ABI of struct file, the f_cdevpriv pointer is now reused
in a union to point to the currently active advice region if one is
present for regular files.

Reviewed by:	jilles, kib, arch@
Approved by:	re (kib)
MFC after:	1 month
2011-11-04 04:02:50 +00:00
Alan Cox
8393768074 Add support for VM_ALLOC_WIRED and VM_ALLOC_ZERO to vm_page_alloc_freelist()
and use these new options in the mips pmap.

Wake up the page daemon in vm_page_alloc_freelist() if the number of free
and cached pages becomes too low.

Tidy up vm_page_alloc_init().  In particular, add a comment about an
important restriction on its use.

Tested by:	jchandra@
2011-11-02 05:42:51 +00:00
Alan Cox
5c1f2cc4c2 Eliminate vm_phys_bootstrap_alloc(). It was a failed attempt at
eliminating duplicated code in the various pmap implementations.

Micro-optimize vm_phys_free_pages().

Introduce vm_phys_free_contig().  It is fast routine for freeing an
arbitrary number of physically contiguous pages.  In particular, it
doesn't require the number of pages to be a power of two.

Use "u_long" instead of "unsigned long".

Bruce Evans (bde@) has convinced me that the "boundary" parameters
to kmem_alloc_contig(), vm_phys_alloc_contig(), and
vm_reserv_reclaim_contig() should be of type "vm_paddr_t" and not
"u_long".  Make this change.
2011-10-30 05:06:14 +00:00
Alan Cox
1933a67cf4 Use "u_long" instead of "unsigned long". 2011-10-28 22:36:15 +00:00
Alan Cox
125b695b6e Tidy up the comment at the head of vm_page_alloc, and mention that the
returned page has the flag VPO_BUSY set.
2011-10-27 17:29:19 +00:00
Alan Cox
703dec68bf Eliminate vestiges of page coloring in VM_ALLOC_NOOBJ calls to
vm_page_alloc().  While I'm here, for the sake of consistency, always
specify the allocation class, such as VM_ALLOC_NORMAL, as the first of
the flags.
2011-10-27 16:39:17 +00:00
Alan Cox
f346986b76 contigmalloc(9) and contigfree(9) are now implemented in terms of other
more general VM system interfaces.  So, their implementation can now
reside in kern_malloc.c alongside the other functions that are declared
in malloc.h.
2011-10-27 02:52:24 +00:00
Alan Cox
9c60ca3238 Speed up vm_page_cache() and vm_page_remove() by checking for a few
common cases that can be handled in constant time.  The insight being
that a page's parent in the vm object's tree is very often its
predecessor or successor in the vm object's ordered memq.

Tested by:	jhb
MFC after:	10 days
2011-10-25 16:35:08 +00:00
Attilio Rao
2d5106600e VN_NRESERVLEVEL is used in this file but opt_vm is not included
thus the stub switch won't be correctly handled.
Include opt_vm.h.

Submitted by:	jeff
MFC after:	3 days
2011-10-22 22:00:35 +00:00
Konstantin Belousov
126b36a21e Control the execution permission of the readable segments for
i386 binaries on the amd64 and ia64 with the sysctl, instead of
unconditionally enabling it.

Reviewed by:	marcel
2011-10-15 12:35:18 +00:00
John Baldwin
9860134635 Fix a typo in a comment. 2011-10-14 11:48:32 +00:00
Marcel Moolenaar
5f81660285 In sys_obreak() and when compiling for amd64 or ia64, when the process
is ILP32 (i.e. i386) grant execute permissions by default. The JDK 1.4.x
depends on being able to execute from the heap on i386.
2011-10-13 16:20:10 +00:00
Gleb Smirnoff
8d689e042f Make memguard(9) capable to guard uma(9) allocations. 2011-10-12 18:08:28 +00:00
Konstantin Belousov
17514c1bd9 Style nit.
Submitted by:	jhb
MFC after:	2 weeks
2011-09-29 00:44:34 +00:00
Konstantin Belousov
2042bb377a Fix grammar.
Submitted by:	bf
MFC after:	2 weeks
2011-09-28 16:12:15 +00:00
Konstantin Belousov
abb9b935ca Use the trick of performing the atomic operation on the contained aligned
word to handle the dirty mask updates in vm_page_clear_dirty_mask().
Remove the vm page queue lock around vm_page_dirty() call in vm_fault_hold()
the sole purpose of which was to protect dirty on architectures which
does not provide short or byte-wide atomics.

Reviewed by:	alc, attilio
Tested by:	flo (sparc64)
MFC after:	2 weeks
2011-09-28 14:57:50 +00:00
Konstantin Belousov
005f609130 Use the explicitly-sized types for the dirty and valid masks.
Requested by:	attilio
Reviewed by:	alc
MFC after:	2 weeks
2011-09-28 14:51:28 +00:00
Kip Macy
8451d0dd78 In order to maximize the re-usability of kernel code in user space this
patch modifies makesyscalls.sh to prefix all of the non-compatibility
calls (e.g. not linux_, freebsd32_) with sys_ and updates the kernel
entry points and all places in the code that use them. It also
fixes an additional name space collision between the kernel function
psignal and the libc function of the same name by renaming the kernel
psignal kern_psignal(). By introducing this change now we will ease future
MFCs that change syscalls.

Reviewed by:	rwatson
Approved by:	re (bz)
2011-09-16 13:58:51 +00:00
Konstantin Belousov
3407fefef6 Split the vm_page flags PG_WRITEABLE and PG_REFERENCED into atomic
flags field. Updates to the atomic flags are performed using the atomic
ops on the containing word, do not require any vm lock to be held, and
are non-blocking. The vm_page_aflag_set(9) and vm_page_aflag_clear(9)
functions are provided to modify afalgs.

Document the changes to flags field to only require the page lock.

Introduce vm_page_reference(9) function to provide a stable KPI and
KBI for filesystems like tmpfs and zfs which need to mark a page as
referenced.

Reviewed by:    alc, attilio
Tested by:      marius, flo (sparc64); andreast (powerpc, powerpc64)
Approved by:	re (bz)
2011-09-06 10:30:11 +00:00
Konstantin Belousov
15523cf799 Update some comments in swap_pager.c.
Reviewed and most wording by:	alc
MFC after:	1 week
Approved by:	re (bz)
2011-08-22 20:44:18 +00:00
Konstantin Belousov
6e903bd0d6 Apply the limit to avoid the overflows in the radix tree subr_blist.c
after the conversion of the swap device size to the page size units,
not before. That lifts the limit on the usable swap partition size
from 32GB to 256GB, that is less depressing for the modern systems.

Submitted by:   Alexander V. Chernikov <melifaro ipfw ru>
Reviewed by:    alc
Approved by:	re (bz)
MFC after:      2 weeks
2011-08-22 11:18:47 +00:00
Robert Watson
a9d2f8d84f Second-to-last commit implementing Capsicum capabilities in the FreeBSD
kernel for FreeBSD 9.0:

Add a new capability mask argument to fget(9) and friends, allowing system
call code to declare what capabilities are required when an integer file
descriptor is converted into an in-kernel struct file *.  With options
CAPABILITIES compiled into the kernel, this enforces capability
protection; without, this change is effectively a no-op.

Some cases require special handling, such as mmap(2), which must preserve
information about the maximum rights at the time of mapping in the memory
map so that they can later be enforced in mprotect(2) -- this is done by
narrowing the rights in the existing max_protection field used for similar
purposes with file permissions.

In namei(9), we assert that the code is not reached from within capability
mode, as we're not yet ready to enforce namespace capabilities there.
This will follow in a later commit.

Update two capability names: CAP_EVENT and CAP_KEVENT become
CAP_POST_KEVENT and CAP_POLL_KEVENT to more accurately indicate what they
represent.

Approved by:	re (bz)
Submitted by:	jonathan
Sponsored by:	Google Inc
2011-08-11 12:30:23 +00:00
Konstantin Belousov
d98d0ce27a - Move the PG_UNMANAGED flag from m->flags to m->oflags, renaming the flag
to VPO_UNMANAGED (and also making the flag protected by the vm object
  lock, instead of vm page queue lock).
- Mark the fake pages with both PG_FICTITIOUS (as it is now) and
  VPO_UNMANAGED. As a consequence, pmap code now can use use just
  VPO_UNMANAGED to decide whether the page is unmanaged.

Reviewed by:	alc
Tested by:	pho (x86, previous version), marius (sparc64),
    marcel (arm, ia64, powerpc), ray (mips)
Sponsored by:	The FreeBSD Foundation
Approved by:	re (bz)
2011-08-09 21:01:36 +00:00
Alan Cox
12f4b65fa6 Fix an error in kmem_alloc_attr(). Unless "tries" is updated,
kmem_alloc_attr() could get stuck in a loop.

Approved by:	re (kib)
MFC after:	3 days
2011-08-07 00:11:39 +00:00
Konstantin Belousov
dda4f96087 Implement the linprocfs swaps file, providing information about the
configured swap devices in the Linux-compatible format.

Based on the submission by:	Robert Millan <rmh debian org>
PR:	kern/159281
Reviewed by:	bde
Approved by:	re (kensmith)
MFC after:	2 weeks
2011-08-01 19:12:15 +00:00
Konstantin Belousov
339772b003 Fix a race in the device pager allocation. If another thread won and
allocated the device pager for the given handle, then the object
fictitious pages list and the object membership in the global object
list still need to be initialized. Otherwise, dev_pager_dealloc() will
traverse uninitialized pointers.

Reported and tested by: pho
Reviewed by:    jhb
Approved by:	re (kensmith)
MFC after:      1 week
2011-07-30 14:13:57 +00:00
Konstantin Belousov
2e32165ce0 Extract the code to translate VM error into errno, into an exported
function vm_mmap_to_errno(). It is useful for the drivers that implement
mmap(2)-like functionality, to be able to return error codes consistent
with mmap(2).

Sponsored by:	The FreeBSD Foundation
No objections from:	alc
MFC after:	1 week
2011-07-10 20:49:13 +00:00
Konstantin Belousov
3103730c82 Style.
MFC after:	3 days
2011-07-10 20:45:13 +00:00
Konstantin Belousov
2801687d56 Add a facility to disable processing page faults. When activated,
uiomove generates EFAULT if any accessed address is not mapped, as
opposed to handling the fault.

Sponsored by:	The FreeBSD Foundation
Reviewed by:	alc (previous version)
2011-07-09 15:21:10 +00:00
Edward Tomasz Napierala
afcc55f318 All the racct_*() calls need to happen with the proc locked. Fixing this
won't happen before 9.0.  This commit adds "#ifdef RACCT" around all the
"PROC_LOCK(p); racct_whatever(p, ...); PROC_UNLOCK(p)" instances, in order
to avoid useless locking/unlocking in kernels built without "options RACCT".
2011-07-06 20:06:44 +00:00
Attilio Rao
91a1929f07 Handle a race between device_pager and devsw in a more graceful manner:
return an error code rather than panic the kernel.

Sponsored by:	Sandvine Incorporated
Reviewed by:	kib
Tested by:	pho
MFC after:	2 weeks
2011-07-06 15:09:52 +00:00
Alan Cox
a8229fa37c Initialize marker pages as held rather than fictitious/wired. Marking the
page as held is more useful as a safety precaution in case someone forgets
to check for PG_MARKER.

Reviewed by:	kib
2011-07-02 23:34:47 +00:00
Alan Cox
6bbee8e28a Add a new option, OBJPR_NOTMAPPED, to vm_object_page_remove(). Passing this
option to vm_object_page_remove() asserts that the specified range of pages
is not mapped, or more precisely that none of these pages have any managed
mappings.  Thus, vm_object_page_remove() need not call pmap_remove_all() on
the pages.

This change not only saves time by eliminating pointless calls to
pmap_remove_all(), but it also eliminates an inconsistency in the use of
pmap_remove_all() versus related functions, like pmap_remove_write().  It
eliminates harmless but pointless calls to pmap_remove_all() that were being
performed on PG_UNMANAGED pages.

Update all of the existing assertions on pmap_remove_all() to reflect this
change.

Reviewed by:	kib
2011-06-29 16:40:41 +00:00
Alan Cox
1bfec3dfb6 Revert to using the page queues lock in vm_page_clear_dirty_mask() on
MIPS.  (At present, although atomic_clear_char() is defined by atomic.h
on MIPS, it is not actually implemented by support.S.)
2011-06-23 05:23:59 +00:00
Alan Cox
3c76db4c64 Precisely document the synchronization rules for the page's dirty field.
(Saying that the lock on the object that the page belongs to must be held
only represents one aspect of the rules.)

Eliminate the use of the page queues lock for atomically performing read-
modify-write operations on the dirty field when the underlying architecture
supports atomic operations on char and short types.

Document the fact that 32KB pages aren't really supported.

Reviewed by:	attilio, kib
2011-06-19 19:13:24 +00:00
Konstantin Belousov
3b1025d200 Assert that page is VPO_BUSY or page owner object is locked in
vm_page_undirty(). The assert is not precise due to VPO_BUSY owner
to tracked, so assertion does not catch the case when VPO_BUSY is
owned by other thread.

Reviewed by:	alc
2011-06-11 20:15:19 +00:00
Konstantin Belousov
9d17da3bef Fix a bug in r222586. Lock the page owner object around the modification
of the m->dirty.

Reported and tested by:	nwhitehorn
Reviewed by:	alc
2011-06-11 20:13:28 +00:00
Konstantin Belousov
031ec8c10a In the VOP_PUTPAGES() implementations, change the default error from
VM_PAGER_AGAIN to VM_PAGER_ERROR for the uwritten pages. Return
VM_PAGER_AGAIN for the partially written page. Always forward at least
one page in the loop of vm_object_page_clean().

VM_PAGER_ERROR causes the page reactivation and does not clear the
page dirty state, so the write is not lost.

The change fixes an infinite loop in vm_object_page_clean() when the
filesystem returns permanent errors for some page writes.

Reported and tested by:	gavin
Reviewed by:	alc, rmacklem
MFC after:	1 week
2011-06-01 21:00:28 +00:00
Alan Cox
8cd02d00be Correct an error in r222163. Unless UMA_MD_SMALL_ALLOC is defined,
startup_alloc() must be used until uma_startup2() is called.

Reported by:	jh
2011-05-22 17:46:16 +00:00
Alan Cox
342f1793ba 1. Prior to r214782, UMA did not support multipage allocations before
uma_startup2() was called.  Thus, setting the variable "booted" to true in
uma_startup() was ok on machines with UMA_MD_SMALL_ALLOC defined, because
any allocations made after uma_startup() but before uma_startup2() could be
satisfied by uma_small_alloc().  Now, however, some multipage allocations
are necessary before uma_startup2() just to allocate zone structures on
machines with a large number of processors.  Thus, a Boolean can no longer
effectively describe the state of the UMA allocator.  Instead, make "booted"
have three values to describe how far initialization has progressed.  This
allows multipage allocations to continue using startup_alloc() until
uma_startup2(), but single-page allocations may begin using
uma_small_alloc() after uma_startup().

2. With the aforementioned change, only a modest increase in boot pages is
necessary to boot UMA on a large number of processors.

3. Retire UMA_MD_SMALL_ALLOC_NEEDS_VM.  It has only been used between
r182028 and r204128.

Reviewed by:	attilio [1], nwhitehorn [3]
Tested by:	sbruno
2011-05-21 17:43:43 +00:00
Alan Cox
59d7277f4a Fix spelling errors. 2011-05-20 17:28:00 +00:00
Alan Cox
df1bc9de7c Eliminate a redundant #include. ("vm/vm_param.h" already includes
"machine/vmparam.h".)
2011-05-20 15:26:31 +00:00
Matthew D Fleming
cfb00e5aa7 Move the ZERO_REGION_SIZE to a machine-dependent file, as on many
architectures (i386, for example) the virtual memory space may be
constrained enough that 2MB is a large chunk.  Use 64K for arches
other than amd64 and ia64, with special handling for sparc64 due to
differing hardware.

Also commit the comment changes to kmem_init_zero_region() that I
missed due to not saving the file.  (Darn the unfamiliar development
environment).

Arch maintainers, please feel free to adjust ZERO_REGION_SIZE as you
see fit.

Requested by:	alc
MFC after:	1 week
MFC with:	r221853
2011-05-13 19:35:01 +00:00
Matthew D Fleming
89cb2a19ec Usa a globally visible region of zeros for both /dev/zero and the md
device.  There are likely other kernel uses of "blob of zeros" than can
be converted.

Reviewed by:	alc
MFC after:	1 week
2011-05-13 18:48:00 +00:00
Max Laier
e18cc7bf3e Another long standing vm bug found at Isilon:
Fix a race between vm_object_collapse and vm_fault.

Reviewed by:	alc@
MFC after:	3 days
2011-05-09 20:27:49 +00:00
David E. O'Brien
cec9f109bb Reap old SPL comments.
Reviewed by:	alc
2011-04-26 22:18:53 +00:00
Konstantin Belousov
86769ac0a4 Fix two bugs in r218670.
Hold the vnode around the region where object lock is dropped, until
vnode lock is acquired.

Do not drop the vnode reference for a case when the object was
deallocated during unlock. Note that in this case, VV_TEXT is cleared
by vnode_pager_dealloc().

Reported and tested by:	pho
Reviewed by:	alc
MFC after:	3 days
2011-04-23 21:38:21 +00:00
John Baldwin
e806d352d2 Fix several places to ignore processes that are not yet fully constructed.
MFC after:	1 week
2011-04-06 17:47:22 +00:00
Edward Tomasz Napierala
f497cda257 In vm_daemon(), do not skip processes stopped with SIGSTOP. 2011-04-06 16:27:04 +00:00
Edward Tomasz Napierala
099e7e950f Add RACCT_RSS.
Sponsored by:	The FreeBSD Foundation
Reviewed by:	kib (earlier version)
2011-04-06 16:24:24 +00:00
Edward Tomasz Napierala
1ba5ad4210 Add accounting for most of the memory-related resources.
Sponsored by:	The FreeBSD Foundation
Reviewed by:	kib (earlier version)
2011-04-05 20:23:59 +00:00
Konstantin Belousov
af32c4196f Handle the corner case in vm_fault_quick_hold_pages().
If supplied length is zero, and user address is invalid, function
might return -1, due to the truncation and rounding of the address.
The callers interpret the situation as EFAULT. Instead of handling
the zero length in caller, filter it in vm_fault_quick_hold_pages().

Sponsored by:	The FreeBSD Foundation
Reviewed by:	alc
2011-03-25 16:38:10 +00:00
John Baldwin
8e6fa660f2 Fix some locking nits with the p_state field of struct proc:
- Hold the proc lock while changing the state from PRS_NEW to PRS_NORMAL
  in fork to honor the locking requirements.  While here, expand the scope
  of the PROC_LOCK() on the new process (p2) to avoid some LORs.  Previously
  the code was locking the new child process (p2) after it had locked the
  parent process (p1).  However, when locking two processes, the safe order
  is to lock the child first, then the parent.
- Fix various places that were checking p_state against PRS_NEW without
  having the process locked to use PROC_LOCK().  Every place was already
  locking the process, just after the PRS_NEW check.
- Remove or reduce the use of PROC_SLOCK() for places that were checking
  p_state against PRS_NEW.  The PROC_LOCK() alone is sufficient for reading
  the current state.
- Reorder fill_kinfo_proc() slightly so it only acquires PROC_SLOCK() once.

MFC after:	1 week
2011-03-24 18:40:11 +00:00
Jeff Roberson
e4cd31dd3c - Merge changes to the base system to support OFED. These include
a wider arg2 for sysctl, updates to vlan code, IFT_INFINIBAND,
   and other miscellaneous small features.
2011-03-21 09:40:01 +00:00
Edward Tomasz Napierala
3fccbe4397 In vm_daemon(), when iterating over all processes in the system, skip those
which are not yet fully initialized (i.e. ones with p_state == PRS_NEW).
Without it, we could panic in _thread_lock_flags().

Note that there may be other instances of FOREACH_PROC_IN_SYSTEM() that
require similar fix.

Reported by:	pho, keramida
Discussed with:	kib
2011-03-18 06:47:23 +00:00
Alan Cox
10cf256074 Eliminate duplication of the fake page code and zone by the device and sg
pagers.

Reviewed by:	jhb
2011-03-11 07:07:48 +00:00
Rebecca Cran
2860553a86 Change the return type of vmspace_swap_count to a long to match the other
vmspace_*_count functions.

MFC after:	3 days
2011-03-01 11:04:30 +00:00
Sergey Kandaurov
7ec9c8d170 Remove sysctl vm.max_proc_mmap used to protect from KVA space exhaustion.
As it was pointed out by Alan Cox, that no longer serves its purpose with
the modern UMA allocator compared to the old one used in 4.x days.

The removal of sysctl eliminates max_proc_mmap type overflow leading to
the broken mmap(2) seen with large amount of physical memory on arches
with factually unbound KVA space (such as amd64).  It was found that
slightly less than 256GB of physmem was enough to trigger the overflow.

Reviewed by:	alc, kib
Approved by:	avg (mentor)
MFC after:	2 months
2011-02-24 09:22:56 +00:00
Rebecca Cran
65d8409cee Calculate and return the count in vmspace_swap_count as a vm_offset_t
instead of an int to avoid overflow.

While here, clean up some style(9) issues.

PR:		kern/152200
Reviewed by:	kib
MFC after:	2 weeks
2011-02-23 10:28:37 +00:00
Alan Cox
e6ffa21488 Remove pmap fields that are either unused or not fully implemented.
Discussed with:	kib
2011-02-17 15:36:29 +00:00
Konstantin Belousov
56bdf2dbc2 Since r218070 reenabled the call to vm_map_simplify_entry() from
vm_map_insert(), the kmem_back() assumption about newly inserted
entry might be broken due to interference of two factors. In the low
memory condition, when vm_page_alloc() returns NULL, supplied map is
unlocked. If another thread performs kmem_malloc() meantime, and its
map entry is placed right next to our thread map entry in the map,
both entries wire count is still 0 and entries are coalesced due to
vm_map_simplify_entry().

Mark new entry with MAP_ENTRY_IN_TRANSITION to prevent coalesce.
Fix some style issues, tighten the assertions to account for
MAP_ENTRY_IN_TRANSITION state.

Reported and tested by:	pho
Reviewed by:	alc
2011-02-15 09:03:58 +00:00
Konstantin Belousov
03fa5b34a0 Lock the vnode around clearing of VV_TEXT flag. Remove mp_fixme() note
mentioning that vnode lock is needed.

Reviewed by:	alc
Tested by:	pho
MFC after:	1 week
2011-02-13 21:52:26 +00:00
Juli Mallett
6edf6104a9 Use CPU_FOREACH rather than expecting CPUs 0 through mp_ncpus-1 to be present.
Don't micro-optimize the uniprocessor case; use the same loop there.

Submitted by:	Bhanu Prakash
Reviewed by:	kib, jhb
2011-02-12 02:10:08 +00:00
Alan Cox
d7b20e4b45 Retire VFS_BIO_DEBUG. Convert those checks that were still valid into
KASSERT()s and eliminate the rest.

Replace excessive printf()s and a panic() in bufdone_finish() with a
KASSERT() in vm_page_io_finish().

Reviewed by:	kib
2011-02-12 01:00:00 +00:00
Alan Cox
17f3095d1a Unless "cnt" exceeds MAX_COMMIT_COUNT, nfsrv_commit() and nfsvno_fsync() are
incorrectly calling vm_object_page_clean().  They are passing the length of
the range rather than the ending offset of the range.

Perform the OFF_TO_IDX() conversion in vm_object_page_clean() rather than the
callers.

Reviewed by:	kib
MFC after:	3 weeks
2011-02-05 21:21:27 +00:00
Alan Cox
0cc74f144e Since the last parameter to vm_object_shadow() is a vm_size_t and not a
vm_pindex_t, it makes no sense for its callers to perform atop().  Let
vm_object_shadow() do that instead.
2011-02-04 21:49:24 +00:00
Alan Cox
3d05198e23 Release the free page queues lock earlier in vm_page_alloc().
Discussed with:	kib@
2011-01-30 23:55:48 +00:00
Alan Cox
d2a444c0da Reenable the call to vm_map_simplify_entry() from vm_map_insert() for non-
MAP_STACK_* entries.  (See r71983 and r74235.)

In some cases, performing this call to vm_map_simplify_entry() halves the
number of vm map entries used by the Sun JDK.
2011-01-29 15:23:02 +00:00
Matthew D Fleming
00f0e671ff Explicitly wire the user buffer rather than doing it implicitly in
sbuf_new_for_sysctl(9).  This allows using an sbuf with a SYSCTL_OUT
drain for extremely large amounts of data where the caller knows that
appropriate references are held, and sleeping is not an issue.

Inspired by:	rwatson
2011-01-27 00:34:12 +00:00
Sergey Kandaurov
4053b05b91 Make MSGBUF_SIZE kernel option a loader tunable kern.msgbufsize.
Submitted by:	perryh pluto.rain.com (previous version)
Reviewed by:	jhb
Approved by:	kib (mentor)
Tested by:	universe
2011-01-21 10:26:26 +00:00
Alan Cox
2c4992db70 Move the definition of M_VMPGDATA to the swap pager, where the only
remaining uses are.
2011-01-18 04:54:43 +00:00
Alan Cox
44e46b9e53 Explicitly initialize the page's queue field to PQ_NONE instead of relying
on PQ_NONE being zero.

Redefine PQ_NONE and PQ_COUNT so that a page queue isn't allocated for
PQ_NONE.

Reviewed by:	kib@
2011-01-17 19:17:26 +00:00
Alan Cox
9454f82862 Sort function prototypes. 2011-01-16 20:40:50 +00:00
Alan Cox
43319c116a Update a lock annotation on the page structure. 2011-01-16 18:04:01 +00:00
Alan Cox
4c6a2e7a1f Shift responsibility for synchronizing access to the page's act_count
field to the object's lock.

Reviewed by:	kib@
2011-01-16 18:01:39 +00:00
Alan Cox
9648f3447d Clean up the start of vm_page_alloc(). In particular, eliminate an
assertion that is no longer required.  Long ago, calls to vm_page_alloc()
from an interrupt handler had to specify VM_ALLOC_INTERRUPT so that
vm_page_alloc() would not attempt to reclaim a PQ_CACHE page from another vm
object.  Today, with the synchronization on a vm object's collection of
PQ_CACHE pages, this is no longer an issue.  In fact, VM_ALLOC_INTERRUPT now
reclaims PQ_CACHE pages just like VM_ALLOC_{NORMAL,SYSTEM}.

MFC after:	3 weeks
2011-01-16 17:33:34 +00:00
Konstantin Belousov
c6c9025b65 For consistency, use kernel_object instead of &kernel_object_store
when initializing the object mutex. Do the same for kmem_object.

Discussed with:	alc
MFC after:	1 week
2011-01-15 21:56:38 +00:00
Alan Cox
ff5958e785 For some time now, the kernel and kmem objects have been ordinary
OBJT_PHYS objects.  Thus, there is no need for handling them specially
in vm_fault().  In fact, this special case handling would have led to
an assertion failure just before the call to pmap_enter().

Reviewed by:	kib@
MFC after:	6 weeks
2011-01-15 19:21:28 +00:00
John Baldwin
58ccf5b41c Remove unneeded includes of <sys/linker_set.h>. Other headers that use
it internally contain nested includes.

Reviewed by:	bde
2011-01-11 13:59:06 +00:00
Konstantin Belousov
50a57dfbec Move repeated MAXSLP definition from machine/vmparam.h to sys/vmmeter.h.
Update the outdated comments describing MAXSLP and the process
selection algorithm for swap out.

Comments wording and reviewed by:	alc
2011-01-09 12:50:44 +00:00
Alan Cox
27772ddf45 Eliminate a redundant alignment directive on the page locks array. 2011-01-09 04:34:02 +00:00
Alan Cox
ce8a13bdb9 Eliminate the counting of vm_page_pa_tryrelock calls. We really don't
need it anymore.  Moreover, its implementation had a type mismatch, a
long is not necessarily an uint64_t.  (This mismatch was hidden by
casting.)  Move the remaining two counters up a level in the sysctl
hierarchy.  There is no reason for them to be under the vm.pmap node.

Reviewed by:	kib
2011-01-08 22:45:22 +00:00
Alan Cox
17f6a17bf7 Release the page lock early in vm_pageout_clean(). There is no reason to
hold this lock until the end of the function.

With the aforementioned change to vm_pageout_clean(), page locks don't need
to support recursive (MTX_RECURSE) or duplicate (MTX_DUPOK) acquisitions.

Reviewed by:	kib
2011-01-03 00:41:56 +00:00
Alan Cox
edf93b25d3 Make a couple refinements to r216799 and r216810. In particular, revise
a comment and move it to its proper place.

Reviewed by:	kib
2011-01-01 17:39:38 +00:00
Rebecca Cran
4c18dec9a9 There can be more than 0x20000000 swap meta blocks allocated if a swap-backed
md(4) device is used. Don't panic when deallocating such a device if swap
has been used.

PR:	kern/133170
Discussed with:	kib
MFC after:	3 days
2011-01-01 16:59:05 +00:00
Konstantin Belousov
50cfe7fa50 Remove OBJ_CLEANING flag. The vfs_setdirty_locked_object() is the only
consumer of the flag, and it used the flag because OBJ_MIGHTBEDIRTY
was cleared early in vm_object_page_clean, before the cleaning pass
was done. This is no longer true after r216799.

 Moreover, since OBJ_CLEANING is a flag, and not the counter, it could
be reset too prematurely when parallel vm_object_page_clean() are
performed.

Reviewed by:	alc (as a part of the bigger patch)
MFC after:	1 month (after r216799 is merged)
2010-12-29 22:26:49 +00:00
Alan Cox
fef87167c9 There is no point in vm_contig_launder{,_page}() flushing held pages,
instead skip over them.  As long as a page is held, it can't be reclaimed by
contigmalloc(M_WAITOK).  Moreover, a held page may be undergoing
modification, e.g., vmapbuf(), so even if the hold were released before the
completion of contigmalloc(), the page might have to be flushed again.

MFC after:	3 weeks
2010-12-29 20:35:36 +00:00
Konstantin Belousov
3280870dca Move the increment of vm object generation count into
vm_object_set_writeable_dirty().

Fix an issue where restart of the scan in vm_object_page_clean() did
not removed write permissions for newly added pages or, if the mapping
for some already scanned page changed to writeable due to fault.
Merge the two loops in vm_object_page_clean(), doing the remove of
write permission and cleaning in the same loop. The restart of the
loop then correctly downgrade writeable mappings.

Fix an issue where a second caller to msync() might actually return
before the first caller had actually completed flushing the
pages. Clear the OBJ_MIGHTBEDIRTY flag after the cleaning loop, not
before.

Calls to pmap_is_modified() are not needed after pmap_remove_write()
there.

Proposed, reviewed and tested by:	alc
MFC after:	1 week
2010-12-29 12:53:53 +00:00
Alan Cox
a5dbab5444 Correct a typo in vm_fault_quick_hold_pages().
Reported by:	Bartosz Stec
2010-12-28 20:02:30 +00:00
Alan Cox
4de2261903 Move vm_object_print()'s prototype to the expected place. 2010-12-27 07:12:22 +00:00
Alan Cox
0b47b37621 Retire vm_fault_quick(). It's no longer used.
Reviewed by:	kib@
2010-12-25 23:54:50 +00:00
Alan Cox
82de724fe1 Introduce and use a new VM interface for temporarily pinning pages. This
new interface replaces the combined use of vm_fault_quick() and
pmap_extract_and_hold() throughout the kernel.

In collaboration with:	kib@
2010-12-25 21:26:56 +00:00
Alan Cox
acd11c7499 Introduce vm_fault_hold() and use it to (1) eliminate a long-standing race
condition in proc_rwmem() and to (2) simplify the implementation of the
cxgb driver's vm_fault_hold_user_pages().  Specifically, in proc_rwmem()
the requested read or write could fail because the targeted page could be
reclaimed between the calls to vm_fault() and vm_page_hold().

In collaboration with:	kib@
MFC after:	6 weeks
2010-12-20 22:49:31 +00:00
Alan Cox
8c22654d7e Implement and use a single optimized function for unholding a set of pages.
Reviewed by:	kib@
2010-12-17 22:41:22 +00:00
Alan Cox
7984ab250b Change memguard_fudge() so that it can handle km_max being zero. Not
every platform defines VM_KMEM_SIZE_MAX, and on those platforms km_max
will be zero.

Reviewed by:	mdf
Tested by:	marius
2010-12-14 05:47:35 +00:00
Max Laier
a5db445da4 Fix a long standing (from the original 4.4BSD lite sources) race between
vmspace_fork and vm_map_wire that would lead to "vm_fault_copy_wired: page
missing" panics.  While faulting in pages for a map entry that is being
wired down, mark the containing map as busy.  In vmspace_fork wait until the
map is unbusy, before we try to copy the entries.

Reviewed by:	kib
MFC after:	5 days
Sponsored by:	Isilon Systems, Inc.
2010-12-09 21:02:22 +00:00
Jayachandran C.
48772ca4aa Revert the vm/vm_page.c change in r216317.
This adds back changes in r216141, which was reverted by the above
check in.
2010-12-09 07:39:06 +00:00
Jayachandran C.
aa93efedd8 swi_vm() for mips. 2010-12-09 06:54:06 +00:00
Edward Tomasz Napierala
a2f510e8ec Fix comment intentation. 2010-12-04 17:41:58 +00:00
Warner Losh
6f1a8765be To make minidumps work properly on mips for memory that's direct
mapped and entered via vm_page_setup, keep track of it like we do
for amd64.

# A separate commit will be made to move this to a capability-based ifdef
# rather than arch-based ifdef.

Submitted by:	alc@
MFC after:	1 week
2010-12-03 04:39:48 +00:00
Edward Tomasz Napierala
ef694c1ac4 Replace pointer to "struct uidinfo" with pointer to "struct ucred"
in "struct vm_object".  This is required to make it possible to account
for per-jail swap usage.

Reviewed by:	kib@
Tested by:	pho@
Sponsored by:	FreeBSD Foundation
2010-12-02 17:37:16 +00:00
Alan Cox
05cb58f669 Correct an error in the allocation of the vm_page_dump array in
vm_page_startup().  Specifically, the dump_avail array should be used
instead of the phys_avail array to calculate the size of vm_page_dump.  For
example, the pages for the message buffer are allocated prior to
vm_page_startup() by subtracting them from the last entry in the phys_avail
array, but the first thing that vm_page_startup() does after creating the
vm_page_dump array is to set the bits corresponding to the message buffer
pages in that array.  However, these bits might not actually exist in the
array, because the size of the array is determined by the current value in
the last entry of the phys_avail array.  In general, the only reason why
this doesn't always result in an out-of-bounds array access is that the size
of the vm_page_dump array is rounded up to the next page boundary.  This
change eliminates that dependence on rounding (and luck).

MFC after:	6 weeks
2010-12-01 03:35:19 +00:00
Jayachandran C.
aa54636620 Fix issue noted by alc while reviewing r215938:
The current implementation of vm_page_alloc_freelist() does not handle
order > 0 correctly. Remove order parameter to the function and use it
only for order 0 pages.

Submitted by:	alc
2010-11-28 05:51:31 +00:00
Konstantin Belousov
780636b72a After the sleep caused by encountering a busy page, relookup the page.
Submitted and reviewed by:	alc
Reprted and tested by:	pho
MFC after:	5 days
2010-11-24 12:25:17 +00:00
Konstantin Belousov
3157c50313 Eliminate the mab, maf arrays and related variables.
The change also fixes off-by-one error in the calculation of mreq.

Suggested and reviewed by:	alc
Tested by:	pho
MFC after:	5 days
2010-11-21 10:18:28 +00:00
Alan Cox
17ea6f00d5 Optimize vm_object_terminate().
Reviewed by:	kib
MFC after:	1 week
2010-11-20 22:30:09 +00:00
Konstantin Belousov
4c7b9a2063 The runlen returned from vm_pageout_flush() might be zero legitimately,
when mreq page has status VM_PAGER_AGAIN.

MFC after:	5 days
2010-11-20 17:27:38 +00:00
Alan Cox
00f8bffc22 Reduce the amount of detail printed by vm_page_free_toq() when it panics.
Reviewed by:	kib
2010-11-19 17:49:08 +00:00
Max Laier
85f2a0c91e Off by one page in vm_reserv_reclaim_contig(): Also reclaim reservations
with only a single free page if that satisfies the requested size.

MFC after:	3 days
Reviewed by:	alc
2010-11-19 04:30:33 +00:00
Konstantin Belousov
1e8a675c73 vm_pageout_flush() might cache the pages that finished write to the
backing storage. Such pages might be then reused, racing with the
assert in vm_object_page_collect_flush() that verified that dirty
pages from the run (most likely, pages with VM_PAGER_AGAIN status) are
write-protected still. In fact, the page indexes for the pages that
were removed from the object page list should be ignored by
vm_object_page_clean().

Return the length of successfully written run from vm_pageout_flush(),
that is, the count of pages between requested page and first page
after requested with status VM_PAGER_AGAIN. Supply the requested page
index in the array to vm_pageout_flush(). Use the returned run length
to forward the index of next page to clean in vm_object_page_clean().

Reported by:	avg
Reviewed by:	alc
MFC after:	1 week
2010-11-18 21:09:02 +00:00
Konstantin Belousov
4166faaee0 Only increment object generation count when inserting the page into
object page list.  The only use of object generation count now is a
restart of the scan in vm_object_page_clean(), which makes sense to do
on the page addition. Page removals do not affect the dirtiness of the
object, as well as manipulations with the shadow chain.

Suggested and reviewed by:	alc
MFC after:    1 week
2010-11-18 20:46:28 +00:00
Konstantin Belousov
7022f954c3 Do not use __FreeBSD_version prefix for the special osrel version.
The ports/Mk/bsd.port.mk uses sys/param.h to fetch osrel, and cannot
grok several constants with the prefix.

Reported and tested by:	    swell.k gmail com
MFC after:   1 week
2010-11-14 21:59:11 +00:00
Konstantin Belousov
94bce4535d Use symbolic names instead of hardcoding values for magic p_osrel constants.
MFC after:   1 week
2010-11-14 18:24:12 +00:00
Konstantin Belousov
9a6d144ff8 Implement a (soft) stack guard page for auto-growing stack mappings.
The unmapped page separates the tip of the stack and possible adjanced
segment, making some uses of stack overflow harder.  The stack growing
code refuses to expand the segment to the last page of the reseved
region when sysctl security.bsd.stack_guard_page is set to 1. The
default value for sysctl and accompanying tunable is 0.

Please note that mmap(MAP_FIXED) still can place a mapping right up to
the stack, making continuous region.

Reviewed by:	alc
MFC after:	1 week
2010-11-14 17:53:52 +00:00
Alan Cox
2cf36c8f67 Enable reservation-based physical memory allocation. Even without the
creation of large page mappings in the pmap, it can provide modest
performance benefits.  In particular, for a "buildworld" on a 2x 1GHz
Ultrasparc IIIi it reduced the wall clock time by 2.2% and the system
time by 12.6%.

Tested by:	marius@
2010-11-10 17:57:34 +00:00
Alan Cox
e48262487a In case the stack size reaches its limit and its growth must be restricted,
ensure that grow_amount is a multiple of the page size.  Otherwise, the
kernel may crash in swap_reserve_by_uid() on HEAD and FreeBSD 8.x, and
produce a core file with a missing stack on FreeBSD 7.x.

Diagnosed and reported by: jilles
Reviewed by:	kib
MFC after:	1 week
2010-11-07 21:40:34 +00:00
Oleksandr Tymoshenko
903ba3da86 - Add minidump support for FreeBSD/mips 2010-11-07 03:09:02 +00:00
John Baldwin
e9a069d8af Update startup_alloc() to support multi-page allocations and allow internal
zones whose objects are larger than a page to use startup_alloc().  This
allows allocation of zone objects during early boot on machines with a large
number of CPUs since the resulting zone objects are larger than a page.

Submitted by:	trema
Reviewed by:	attilio
MFC after:	1 week
2010-11-04 15:33:50 +00:00
Alan Cox
d689bc0082 Correct some format strings used by sysctls.
MFC after:	1 week
2010-10-30 18:00:53 +00:00
John Baldwin
1a587ef2a5 - Make 'vm_refcnt' volatile so that compilers won't be tempted to treat
its value as a loop invariant.  Currently this is a no-op because
  'atomic_cmpset_int()' clobbers all memory on current architectures.
- Use atomic_fetchadd_int() instead of an atomic_cmpset_int() loop to drop
  a reference in vmspace_free().

Reviewed by:	alc
MFC after:	1 month
2010-10-21 17:29:32 +00:00
Andriy Gapon
55144670c2 PG_BUSY -> VPO_BUSY, PG_WANTED -> VPO_WANTED in manual pages and comments
Reviewed by:	alc
MFC after:	4 days
2010-10-20 05:17:23 +00:00
Matthew D Fleming
20ed0cb0c6 uma_zfree(zone, NULL) should do nothing, to match free(9).
Noticed by:	Ron Steinke <rsteinke at isilon dot com>
MFC after:	3 days
2010-10-19 16:06:00 +00:00
Lawrence Stewart
1c6cae9711 Change uma_zone_set_max to return the effective value of "nitems" after
rounding. The same value can also be obtained with uma_zone_get_max, but this
change avoids a caller having to make two back-to-back calls.

Sponsored by:	FreeBSD Foundation
Reviewed by:	gnn, jhb
2010-10-16 04:41:45 +00:00
Lawrence Stewart
c4ae7908a7 - Simplify implementation of uma_zone_get_max.
- Add uma_zone_get_cur which returns the current approximate occupancy of
  a zone. This is useful for providing stats via sysctl amongst other things.

Sponsored by:	FreeBSD Foundation
Reviewed by:	gnn, jhb
MFC after:	2 weeks
2010-10-16 04:14:45 +00:00
Alan Cox
f8616ebfae If vm_map_find() is asked to allocate a superpage-aligned region of virtual
addresses that is greater than a superpage in size but not a multiple of
the superpage size, then vm_map_find() is not always expanding the kernel
pmap to support the last few small pages being allocated.  These failures
are not commonplace, so this was first noticed by someone porting FreeBSD
to a new architecture.  Previously, we grew the kernel page table in
vm_map_findspace() when we found the first available virtual address.
This works most of the time because we always grow the kernel pmap or page
table by an amount that is a multiple of the superpage size.  Now, instead,
we defer the call to pmap_growkernel() until we are committed to a range
of virtual addresses in vm_map_insert().  In general, there is another
reason to prefer calling pmap_growkernel() in vm_map_insert().  It makes
it possible for someone to do the equivalent of an mmap(MAP_FIXED) on the
kernel map.

Reported by:	Svatopluk Kraus
Reviewed by:	kib@
MFC after:	3 weeks
2010-10-04 16:49:40 +00:00
Matthew D Fleming
d69b01efc2 Replace an XXX comment with the appropriate code.
Submitted by:	alc
2010-09-20 20:41:59 +00:00
Alan Cox
da0483096d Allow a POSIX shared memory object that is opened for read but not for
write to nonetheless be mapped PROT_WRITE and MAP_PRIVATE, i.e.,
copy-on-write.

(This is a regression in the new implementation of POSIX shared memory
objects that is used by HEAD and RELENG_8.  This bug does not exist in
RELENG_7's user-level, file-based implementation.)

PR:		150260
MFC after:	3 weeks
2010-09-19 19:42:04 +00:00
Alan Cox
8304adaac6 Make refinements to r212824. In particular, don't make
vm_map_unlock_nodefer() part of the synchronization interface for maps.

Add comments to vm_map_unlock_and_wait() and vm_map_wakeup() describing
how they should be used.  In particular, describe the deferred deallocations
issue with vm_map_unlock_and_wait().

Redo the implementation of vm_map_unlock_and_wait() so that it passes
along the caller's file and line information, just like the other map
locking primitives.

Reviewed by:	kib
X-MFC after:	r212824
2010-09-19 17:43:22 +00:00
Konstantin Belousov
0b367bd8c0 Adopt the deferring of object deallocation for the deleted map entries
on map unlock to the lock downgrade and later read unlock operation.

System map entries cannot be backed by OBJT_VNODE objects, no need to
defer deallocation for them. Map entries from user maps do not require
the owner map for deallocation, and can be accumulated in the
thread-local list for freeing when a user map is unlocked.

Move the collection of entries for deferred reclamation into
vm_map_delete(). Create helper vm_map_process_deferred(), that is
called from locations where processing is feasible. Do not process
deferred entries in vm_map_unlock_and_wait() since map_sleep_mtx is
held.

Reviewed by:	alc, rstone (previous versions)
Tested by:	pho
MFC after:	2 weeks
2010-09-18 15:03:31 +00:00
Matthew D Fleming
4e6571599b Re-add r212370 now that the LOR in powerpc64 has been resolved:
Add a drain function for struct sysctl_req, and use it for a variety
of handlers, some of which had to do awkward things to get a large
enough SBUF_FIXEDLEN buffer.

Note that some sysctl handlers were explicitly outputting a trailing
NUL byte.  This behaviour was preserved, though it should not be
necessary.

Reviewed by:    phk (original patch)
2010-09-16 16:13:12 +00:00
Matthew D Fleming
404a593e28 Revert r212370, as it causes a LOR on powerpc. powerpc does a few
unexpected things in copyout(9) and so wiring the user buffer is not
sufficient to perform a copyout(9) while holding a random mutex.

Requested by: nwhitehorn
2010-09-13 18:48:23 +00:00
Matthew D Fleming
dd67e2103c Add a drain function for struct sysctl_req, and use it for a variety of
handlers, some of which had to do awkward things to get a large enough
FIXEDLEN buffer.

Note that some sysctl handlers were explicitly outputting a trailing NUL
byte.  This behaviour was preserved, though it should not be necessary.

Reviewed by:	phk
2010-09-09 18:33:46 +00:00
Nathan Whitehorn
42768fec0f On architectures with non-tree-based page tables like PowerPC, every page
in a range must be checked when calling pmap_remove(). Calling
pmap_remove() from vm_pageout_map_deactivate_pages() with the entire range
of the map could result in attempting to demap an extraordinary number
of pages (> 10^15), so iterate through each map entry and unmap each of
them individually.

MFC after:	6 weeks
2010-09-09 13:32:58 +00:00
Ryan Stone
d473d3a164 Fix a typo in r212281. uintptr -> uintptr_t
Pointy hat to:  rstone

Approved by:    emaste (mentor)
MFC after:      2 weeks
2010-09-07 02:51:11 +00:00
Ryan Stone
0d41964095 In munmap() downgrade the vm_map_lock to a read lock before taking a read
lock on the pmc-sx lock.  This prevents a deadlock with
pmc_log_process_mappings, which has an exclusive lock on pmc-sx and tries
to get a read lock on a vm_map.  Downgrading the vm_map_lock in munmap
allows pmc_log_process_mappings to continue, preventing the deadlock.

Without this change I could cause a deadlock on a multicore 8.1-RELEASE
system by having one thread constantly mmap'ing and then munmap'ing a
PROT_EXEC mapping in a loop while I repeatedly invoked and stopped pmcstat
in system-wide sampling mode.

Reviewed by:	fabient
Approved by:	emaste (mentor)
MFC after:	2 weeks
2010-09-07 00:23:45 +00:00
Andriy Gapon
a9b89cf1c1 vm_page.c: include opt_msgbuf.h for MSGBUF_SIZE use in vm_page_startup
vm_page_startup uses MSGBUF_SIZE value for adding msgbuf pages to minidump.
If opt_msgbuf.h is not included and MSGBUF_SIZE is overriden in kernel
config, then not all msgbuf pages will be dumped.  And most importantly,
struct msgbuf itself will not be included.  Thus the dump would look
corrupted/incomplete to tools like kgdb, dmesg, etc that try to access
struct msgbuf as one of the first things they do when working on a crash
dump.

MFC after:	5 days
2010-09-03 10:40:53 +00:00
Matthew D Fleming
a2a200a24d Have memguard(9) crash with an easier-to-debug message on double-free.
Reviewed by:    zml
MFC after:      3 weeks
2010-08-31 17:43:47 +00:00
Matthew D Fleming
6d3ed393d6 The realloc case for memguard(9) will copy too many bytes when
reallocating to a smaller-sized allocation.  Fix this issue.

Noticed by:     alc
Reviewed by:    alc
Approved by:    zml (mentor)
MFC after:      3 weeks
2010-08-31 16:57:58 +00:00
Alan Cox
74ffb9af15 Add the MAP_PREFAULT_READ option to mmap(2).
Reviewed by:	jhb, kib
2010-08-28 16:57:07 +00:00
Andre Oppermann
e49471b04b Add uma_zone_get_max() to obtain the effective limit after a call
to uma_zone_set_max().

The UMA zone limit is not exactly set to the value supplied but
rounded up to completely fill the backing store increment (a page
normally).  This can lead to surprising situations where the number
of elements allocated from UMA is higher than the supplied limit
value.  The new get function reads back the effective value so that
the supplied limit value can be adjusted to the real limit.

Reviewed by:	jeffr
MFC after:	1 week
2010-08-16 14:24:00 +00:00
Matthew D Fleming
f02d86e269 Fix compile. It seemed better to have memguard.c include opt_vm.h in
case future compile-time knobs were added that it wants to use.
Also add include guards and forward declarations to vm/memguard.h.

Approved by:    zml (mentor)
MFC after:      1 month
2010-08-12 16:54:43 +00:00
Matthew D Fleming
e3813573bd Rework memguard(9) to reserve significantly more KVA to detect
use-after-free over a longer time.  Also release the backing pages of
a guarded allocation at free(9) time to reduce the overhead of using
memguard(9).  Allow setting and varying the malloc type at run-time.
Add knobs to allow:

 - randomly guarding memory
 - adding un-backed KVA guard pages to detect underflow and overflow
 - a lower limit on the size of allocations that are guarded

Reviewed by:    alc
Reviewed by:    brueffer, Ulrich Spörlein <uqs spoerlein net> (man page)
Silence from:   -arch
Approved by:    zml (mentor)
MFC after:      1 month
2010-08-11 22:10:37 +00:00
Konstantin Belousov
3979450b4c Add new make_dev_p(9) flag MAKEDEV_ETERNAL to inform devfs that created
cdev will never be destroyed. Propagate the flag to devfs vnodes as
VV_ETERNVALDEV. Use the flags to avoid acquiring devmtx and taking a
thread reference on such nodes.

In collaboration with:	pho
MFC after:	1 month
2010-08-06 09:42:15 +00:00
John Baldwin
a3870a1826 Very rough first cut at NUMA support for the physical page allocator. For
now it uses a very dumb first-touch allocation policy.  This will change in
the future.
- Each architecture indicates the maximum number of supported memory domains
  via a new VM_NDOMAIN parameter in <machine/vmparam.h>.
- Each cpu now has a PCPU_GET(domain) member to indicate the memory domain
  a CPU belongs to.  Domain values are dense and numbered from 0.
- When a platform supports multiple domains, the default freelist
  (VM_FREELIST_DEFAULT) is split up into N freelists, one for each domain.
  The MD code is required to populate an array of mem_affinity structures.
  Each entry in the array defines a range of memory (start and end) and a
  domain for the range.  Multiple entries may be present for a single
  domain.  The list is terminated by an entry where all fields are zero.
  This array of structures is used to split up phys_avail[] regions that
  fall in VM_FREELIST_DEFAULT into per-domain freelists.
- Each memory domain has a separate lookup-array of freelists that is
  used when fulfulling a physical memory allocation.  Right now the
  per-domain freelists are listed in a round-robin order for each domain.
  In the future a table such as the ACPI SLIT table may be used to order
  the per-domain lookup lists based on the penalty for each memory domain
  relative to a specific domain.  The lookup lists may be examined via a
  new vm.phys.lookup_lists sysctl.
- The first-touch policy is implemented by using PCPU_GET(domain) to
  pick a lookup list when allocating memory.

Reviewed by:	alc
2010-07-27 20:33:50 +00:00
Edward Tomasz Napierala
fd6f4ffb27 Fix commented out resource limit check in mlockall(2). It's still racy,
but at least less misleading.
2010-07-27 19:26:18 +00:00
Alan Cox
2af6e14d39 Introduce exec_alloc_args(). The objective being to encapsulate the
details of the string buffer allocation in one place.

Eliminate the portion of the string buffer that was dedicated to storing
the interpreter name.  The pointer to the interpreter name can simply be
made to point to the appropriate argument string.

Reviewed by:	kib
2010-07-27 17:31:03 +00:00
Alan Cox
9e4e511499 Change the order in which the file name, arguments, environment, and
shell command are stored in exec*()'s demand-paged string buffer.  For
a "buildworld" on an 8GB amd64 multiprocessor, the new order reduces
the number of global TLB shootdowns by 31%.  It also eliminates about
330k page faults on the kernel address space.

Change exec_shell_imgact() to use "args->begin_argv" consistently as
the start of the argument and environment strings.  Previously, it
would sometimes use "args->buf", which is the start of the overall
buffer, but no longer the start of the argument and environment
strings.  While I'm here, eliminate unnecessary passing of "&length"
to copystr(), where we don't actually care about the length of the
copied string.

Clean up the initialization of the exec map.  In particular, use the
correct size for an entry, and express that size in the same way that
is used when an entry is allocated.  The old size was one page too
large.  (This discrepancy originated in 2004 when I rewrote
exec_map_first_page() to use sf_buf_alloc() instead of the exec map
for mapping the first page of the executable.)

Reviewed by:	kib
2010-07-25 17:43:38 +00:00
Jayachandran C.
49ca10d40c Redo the page table page allocation on MIPS, as suggested by
alc@.

The UMA zone based allocation is replaced by a scheme that creates
a new free page list for the KSEG0 region, and a new function
in sys/vm that allocates pages from a specific free page list.

This also fixes a race condition introduced by the UMA based page table
page allocation code. Dropping the page queue and pmap locks before
the call to uma_zfree, and re-acquiring them afterwards  will introduce
a race condtion(noted by alc@).

The changes are :
- Revert the earlier changes in MIPS pmap.c that added UMA zone for
page table pages.
- Add a new freelist VM_FREELIST_HIGHMEM to MIPS vmparam.h for memory that
is not directly mapped (in 32bit kernel). Normal page allocations will first
try the HIGHMEM freelist and then the default(direct mapped) freelist.
- Add a new function 'vm_page_t vm_page_alloc_freelist(int flind, int
order, int req)' to vm/vm_page.c to allocate a page from a specified
freelist. The MIPS page table pages will be allocated using this function
from the freelist containing direct mapped pages.
- Move the page initialization code from vm_phys_alloc_contig() to a
new function vm_page_alloc_init(), and use this function to initialize
pages in vm_page_alloc_freelist() too.
- Split the  function vm_phys_alloc_pages(int pool, int order) to create
vm_phys_alloc_freelist_pages(int flind, int pool, int order), and use
this function from both vm_page_alloc_freelist() and vm_phys_alloc_pages().

Reviewed by:	alc
2010-07-21 09:27:00 +00:00
Alan Cox
b99348e5ea Add support for the VM_ALLOC_COUNT() hint to vm_page_alloc(). Consequently,
the maintenance of vm_pageout_deficit can be localized to just two places:
vm_page_alloc() and vm_pageout_scan().

This change also corrects an off-by-one error in the maintenance of
vm_pageout_deficit.  Historically, the buffer cache functions, allocbuf()
and vm_hold_load_pages(), have not taken into account that vm_page_alloc()
already increments vm_pageout_deficit by one.

Reviewed by:	kib
2010-07-09 19:38:30 +00:00
Konstantin Belousov
1d9e77f6bf Make VM_ALLOC_RETRY flag mandatory for vm_page_grab(). Assert that the
flag is always provided, and unconditionally retry after sleep for the
busy page or failed allocation.

The intent is to remove VM_ALLOC_RETRY eventually.

Proposed and reviewed by:	alc
2010-07-08 08:37:51 +00:00
Konstantin Belousov
5f195aa32e Add the ability for the allocflag argument of the vm_page_grab() to
specify the increment of vm_pageout_deficit when sleeping due to page
shortage. Then, in allocbuf(), the code to allocate pages when extending
vmio buffer can be replaced by a call to vm_page_grab().

Suggested and reviewed by:	alc
MFC after:	2 weeks
2010-07-05 21:13:32 +00:00
Konstantin Belousov
757216f3a5 Several cleanups for the r209686:
- remove unused defines;
- remove unused curgeneration argument for vm_object_page_collect_flush();
- always assert that vm_object_page_clean() is called for OBJT_VNODE;
- move vm_page_find_least() into for() statement initial clause.

Submitted by:	alc
2010-07-04 19:02:32 +00:00
Konstantin Belousov
e239bb9730 Reimplement vm_object_page_clean(), using the fact that vm object memq
is ordered by page index. This greatly simplifies the implementation,
since we no longer need to mark the pages with VPO_CLEANCHK to denote
the progress. It is enough to remember the current position by index
before dropping the object lock.

Remove VPO_CLEANCHK and VM_PAGER_IGNORE_CLEANCHK as unused.
Garbage-collect vm.msync_flush_flags sysctl.

Suggested and reviewed by:	alc
Tested by:	pho
2010-07-04 11:26:56 +00:00
Konstantin Belousov
b382c10a57 Introduce a helper function vm_page_find_least(). Use it in several places,
which inline the function.

Reviewed by:	alc
Tested by:	pho
MFC after:	1 week
2010-07-04 11:13:33 +00:00
Alan Cox
b64400a03f Improve the comment and man page for vm_page_alloc(). Specifically,
document one of the optional flags; clarify which of the flags are
optional (and which are not), and remove mention of a restriction on
the reclamation of cached pages that no longer holds since version 7.

MFC after:	1 week
2010-07-03 18:25:37 +00:00
Alan Cox
cdaba1f2be Push down the acquisition of the page queues lock into
vm_pageout_page_stats().  In particular, avoid acquiring the page
queues lock unless iterating over the active queue.
2010-07-02 20:56:22 +00:00
Alan Cox
4b0640310a Use vm_page_prev() instead of vm_page_lookup() in the implementation of
vm_fault()'s automatic delete-behind heuristic.
vm_page_prev() is typically faster.
2010-07-02 19:59:18 +00:00
Alan Cox
9cf5198832 With the demise of page coloring, the page queue macros no longer serve any
useful purpose.  Eliminate them.

Reviewed by:	kib
2010-07-02 15:02:51 +00:00
Alan Cox
95976f3f38 Simplify entry to vm_pageout_clean(). Expect the page to be locked.
Previously, the caller unlocked the page, and vm_pageout_clean()
immediately reacquired the page lock.  Also, assert rather than test
that the page is neither busy nor held.  Since vm_pageout_clean() is
called with the object and page locked, the page can't have changed
state since the caller verified that the page is neither busy nor
held.
2010-06-30 17:20:33 +00:00
Alan Cox
91b4f42767 Introduce vm_page_next() and vm_page_prev(), and use them in
vm_pageout_clean().  When iterating over a range of pages, these functions
can be cheaper than vm_page_lookup() because their implementation takes
advantage of the vm_object's memq being ordered.

Reviewed by:	kib@
MFC after:	3 weeks
2010-06-21 23:27:24 +00:00
Sean Bruno
bf96595915 Add a new column to the output of vmstat -z to indicate the number
of times the system was forced to sleep when requesting a new allocation.

Expand the debugger hook, db_show_uma, to display these results as well.

This has proven to be very useful in out of memory situations when
it is not known why systems have become sluggish or fail in odd ways.

Reviewed by:	rwatson alc
Approved by:	scottl (mentor) peter
Obtained from:	Yahoo Inc.
2010-06-15 19:28:37 +00:00
Alan Cox
9ee2165f5d Eliminate checks for a page having a NULL object in vm_pageout_scan()
and vm_pageout_page_stats().  These checks were recently introduced by
the first page locking commit, r207410, but they are not needed.  At
the same time, eliminate some redundant accesses to the page's object
field.  (These accesses should have neen eliminated by r207410.)

Make the assertion in vm_page_flag_set() stricter.  Specifically, only
managed pages should have PG_WRITEABLE set.

Add a comment documenting an assertion to vm_page_flag_clear().

It has long been the case that fictitious pages have their wire count
permanently set to one.  Add comments to vm_page_wire() and
vm_page_unwire() documenting this.  Add assertions to these functions
as well.

Update the comment describing vm_page_unwire().  Much of the old
comment had little to do with vm_page_unwire(), but a lot to do with
_vm_page_deactivate().  Move relevant parts of the old comment to
_vm_page_deactivate().

Only pages that belong to an object can be paged out.  Therefore, it
is pointless for vm_page_unwire() to acquire the page queues lock and
enqueue such pages in one of the paging queues.  Generally speaking,
such pages are immediately freed after the call to vm_page_unwire().
Previously, it was the call to vm_page_free() that reacquired the page
queues lock and removed these pages from the paging queues.  Now, we
will never acquire the page queues lock for this case.  (It is also
worth noting that since both vm_page_unwire() and vm_page_free()
occurred with the page locked, the page daemon never saw the page with
its object field set to NULL.)

Change the panic with vm_page_unwire() to provide a more precise message.

Reviewed by:	kib@
2010-06-14 19:54:19 +00:00
John Baldwin
3aa6d94e0c Update several places that iterate over CPUs to use CPU_FOREACH(). 2010-06-11 18:46:34 +00:00
Alan Cox
ce18658792 Reduce the scope of the page queues lock and the number of
PG_REFERENCED changes in vm_pageout_object_deactivate_pages().
Simplify this function's inner loop using TAILQ_FOREACH(), and shorten
some of its overly long lines.  Update a stale comment.

Assert that PG_REFERENCED may be cleared only if the object containing
the page is locked.  Add a comment documenting this.

Assert that a caller to vm_page_requeue() holds the page queues lock,
and assert that the page is on a page queue.

Push down the page queues lock into pmap_ts_referenced() and
pmap_page_exists_quick().  (As of now, there are no longer any pmap
functions that expect to be called with the page queues lock held.)

Neither pmap_ts_referenced() nor pmap_page_exists_quick() should ever
be passed an unmanaged page.  Assert this rather than returning "0"
and "FALSE" respectively.

ARM:

Simplify pmap_page_exists_quick() by switching to TAILQ_FOREACH().

Push down the page queues lock inside of pmap_clearbit(), simplifying
pmap_clear_modify(), pmap_clear_reference(), and pmap_remove_write().
Additionally, this allows for avoiding the acquisition of the page
queues lock in some cases.

PowerPC/AIM:

moea*_page_exits_quick() and moea*_page_wired_mappings() will never be
called before pmap initialization is complete.  Therefore, the check
for moea_initialized can be eliminated.

Push down the page queues lock inside of moea*_clear_bit(),
simplifying moea*_clear_modify() and moea*_clear_reference().

The last parameter to moea*_clear_bit() is never used.  Eliminate it.

PowerPC/BookE:

Simplify mmu_booke_page_exists_quick()'s control flow.

Reviewed by:	kib@
2010-06-10 16:56:35 +00:00
Jayachandran C.
17dca144a2 Make vm_contig_grow_cache() extern, and use it when vm_phys_alloc_contig()
fails to allocate MIPS page table pages.  The current usage of VM_WAIT in
case of vm_phys_alloc_contig() failure is not correct, because:

"There is no guarantee that any of the available free (or cached) pages
after the VM_WAIT will fall within the range of suitable physical
addresses.  Every time this function sleeps and a single page is freed
(or cached) by someone else, this function will be reawakened.  With
a little bad luck, you could spin indefinitely."

We also add low and high parameters to vm_contig_grow_cache() and
vm_contig_launder() so that we restrict vm_contig_launder() to the range
of pages we are interested in.

Reported by: alc

Reviewed by:	alc
Approved by:	rrs (mentor)
2010-06-04 06:35:36 +00:00
Konstantin Belousov
4d65036b4f Do not leak vm page lock in vm_contig_launder(), vm_pageout_page_lock()
always returns with the page locked.

Submitted by:	alc
Pointy hat to:	kib
2010-06-03 18:34:34 +00:00
Konstantin Belousov
2bbfbc3fe2 Add assertion and comment in vm_page_flag_set() describing the expectations
when the PG_WRITEABLE flag is set.

Reviewed by:	alc
2010-06-03 10:11:45 +00:00
Alan Cox
f4e10cdaa6 Maintain the pretense that we support 32KB pages for the sake of the ia64
LINT build.
2010-06-03 02:24:53 +00:00
Alan Cox
c8fa870982 Minimize the use of the page queues lock for synchronizing access to the
page's dirty field.  With the exception of one case, access to this field
is now synchronized by the object lock.
2010-06-02 15:46:37 +00:00
Alan Cox
8f0d5d3b9f When I pushed down the page queues lock into pmap_is_modified(), I created
an ordering dependence: A pmap operation that clears PG_WRITEABLE and calls
vm_page_dirty() must perform the call first.  Otherwise, pmap_is_modified()
could return FALSE without acquiring the page queues lock because the page
is not (currently) writeable, and the caller to pmap_is_modified() might
believe that the page's dirty field is clear because it has not seen the
effect of the vm_page_dirty() call.

When I pushed down the page queues lock into pmap_is_modified(), I
overlooked one place where this ordering dependence is violated:
pmap_enter().  In a rare situation pmap_enter() can be called to replace a
dirty mapping to one page with a mapping to another page.  (I say rare
because replacements generally occur as a result of a copy-on-write fault,
and so the old page is not dirty.)  This change delays clearing PG_WRITEABLE
until after vm_page_dirty() has been called.

Fixing the ordering dependency also makes it easy to introduce a small
optimization: When pmap_enter() used to replace a mapping to one page with a
mapping to another page, it freed the pv entry for the first mapping and
later called the pv entry allocator for the new mapping.  Now, pmap_enter()
attempts to recycle the old pv entry, saving two calls to the pv entry
allocator.

There is no point in setting PG_WRITEABLE on unmanaged pages, so don't.
Update a comment to reflect this.

Tidy up the variable declarations at the start of pmap_enter().
2010-05-29 17:10:45 +00:00
Alan Cox
c46b90e90a Push down page queues lock acquisition in pmap_enter_object() and
pmap_is_referenced().  Eliminate the corresponding page queues lock
acquisitions from vm_map_pmap_enter() and mincore(), respectively.  In
mincore(), this allows some additional cases to complete without ever
acquiring the page queues lock.

Assert that the page is managed in pmap_is_referenced().

On powerpc/aim, push down the page queues lock acquisition from
moea*_is_modified() and moea*_is_referenced() into moea*_query_bit().
Again, this will allow some additional cases to complete without ever
acquiring the page queues lock.

Reorder a few statements in vm_page_dontneed() so that a race can't lead
to an old reference persisting.  This scenario is described in detail by a
comment.

Correct a spelling error in vm_page_dontneed().

Assert that the object is locked in vm_page_clear_dirty(), and restrict the
page queues lock assertion to just those cases in which the page is
currently writeable.

Add object locking to vnode_pager_generic_putpages().  This was the one
and only place where vm_page_clear_dirty() was being called without the
object being locked.

Eliminate an unnecessary vm_page_lock() around vnode_pager_setsize()'s call
to vm_page_clear_dirty().

Change vnode_pager_generic_putpages() to the modern-style of function
definition.  Also, change the name of one of the parameters to follow
virtual memory system naming conventions.

Reviewed by:	kib
2010-05-26 18:00:44 +00:00
Alan Cox
e98d019d3c Eliminate the acquisition and release of the page queues lock from
vfs_busy_pages().  It is no longer needed.

Submitted by:	kib
2010-05-25 02:26:25 +00:00
Alan Cox
567e51e18c Roughly half of a typical pmap_mincore() implementation is machine-
independent code.  Move this code into mincore(), and eliminate the
page queues lock from pmap_mincore().

Push down the page queues lock into pmap_clear_modify(),
pmap_clear_reference(), and pmap_is_modified().  Assert that these
functions are never passed an unmanaged page.

Eliminate an inaccurate comment from powerpc/powerpc/mmu_if.m:
Contrary to what the comment says, pmap_mincore() is not simply an
optimization.  Without a complete pmap_mincore() implementation,
mincore() cannot return either MINCORE_MODIFIED or MINCORE_REFERENCED
because only the pmap can provide this information.

Eliminate the page queues lock from vfs_setdirty_locked_object(),
vm_pageout_clean(), vm_object_page_collect_flush(), and
vm_object_page_clean().  Generally speaking, these are all accesses
to the page's dirty field, which are synchronized by the containing
vm object's lock.

Reduce the scope of the page queues lock in vm_object_madvise() and
vm_page_dontneed().

Reviewed by:	kib (an earlier version)
2010-05-24 14:26:57 +00:00
Konstantin Belousov
a6e38685f3 When waiting for the busy page, do not unlock the object unless unlock
cannot be avoided.

Reviewed by:	alc
MFC after:	1 week
2010-05-20 08:51:01 +00:00
Alan Cox
aa12e8b71d The page queues lock is no longer required by vm_page_set_invalid(), so
eliminate it.

Assert that the object containing the page is locked in
vm_page_test_dirty().  Perform some style clean up while I'm here.

Reviewed by:	kib
2010-05-18 16:40:29 +00:00
Alan Cox
9ab6032f73 On entry to pmap_enter(), assert that the page is busy. While I'm
here, make the style of assertion used by pmap_enter() consistent
across all architectures.

On entry to pmap_remove_write(), assert that the page is neither
unmanaged nor fictitious, since we cannot remove write access to
either kind of page.

With the push down of the page queues lock, pmap_remove_write() cannot
condition its behavior on the state of the PG_WRITEABLE flag if the
page is busy.  Assert that the object containing the page is locked.
This allows us to know that the page will neither become busy nor will
PG_WRITEABLE be set on it while pmap_remove_write() is running.

Correct a long-standing bug in vm_page_cowsetup().  We cannot possibly
do copy-on-write-based zero-copy transmit on unmanaged or fictitious
pages, so don't even try.  Previously, the call to pmap_remove_write()
would have failed silently.
2010-05-16 23:45:10 +00:00
Alan Cox
a4bc2c8929 Correct an error of omission in r202897: Now that amd64 uses the direct map
to access the message buffer, we must explicitly request that the underlying
physical pages are included in a crash dump.

Reported by:	Benjamin Kaduk
2010-05-16 19:25:56 +00:00
Alan Cox
a1a95cd608 Add a comment about the proper use of vm_object_page_remove().
MFC after:	1 week
2010-05-16 16:54:05 +00:00
Alan Cox
b27753086c Update synchronization annotations for struct vm_page. Add a comment
explaining how the setting of PG_WRITEABLE is synchronized.
2010-05-11 01:29:18 +00:00
Konstantin Belousov
6e2175fc06 Continue cleaning the queue instead of moving to the next queue or
bailing out if acquisition of page lock caused page position in the
queue to change.

Pointed out by:	alc
2010-05-10 11:53:40 +00:00
Alan Cox
eee9d99231 Push down the acquisition of the page queues lock into vm_pageq_remove().
(This eliminates a surprising number of page queues lock acquisitions by
vm_fault() because the page's queue is PQ_NONE and thus the page queues
lock is not needed to remove the page from a queue.)
2010-05-09 16:55:42 +00:00
Alan Cox
db1f085eee Call vm_page_deactivate() rather than vm_page_dontneed() in
swp_pager_force_pagein().  By dirtying the page, swp_pager_force_pagein()
forces vm_page_dontneed() to insert the page at the head of the inactive
queue, just like vm_page_deactivate() does.  Moreover, because the page
was invalid, it can't have been mapped, and thus the other effect of
vm_page_dontneed(), clearing the page's reference bits has no effect.  In
summary, there is no reason to call vm_page_dontneed() since its effect
will be identical to calling the simpler vm_page_deactivate().
2010-05-09 16:27:42 +00:00
Alan Cox
d061cdd513 Remove the page queues lock around a call to vm_page_activate(). Make the
page dirty before adding it to the active queue.
2010-05-09 00:32:52 +00:00
Alan Cox
34e7251f10 Minimize the scope of the page queues lock in vm_fault(). 2010-05-08 21:35:51 +00:00