Commit Graph

2954 Commits

Author SHA1 Message Date
Jayachandran C.
48772ca4aa Revert the vm/vm_page.c change in r216317.
This adds back changes in r216141, which was reverted by the above
check in.
2010-12-09 07:39:06 +00:00
Jayachandran C.
aa93efedd8 swi_vm() for mips. 2010-12-09 06:54:06 +00:00
Edward Tomasz Napierala
a2f510e8ec Fix comment intentation. 2010-12-04 17:41:58 +00:00
Warner Losh
6f1a8765be To make minidumps work properly on mips for memory that's direct
mapped and entered via vm_page_setup, keep track of it like we do
for amd64.

# A separate commit will be made to move this to a capability-based ifdef
# rather than arch-based ifdef.

Submitted by:	alc@
MFC after:	1 week
2010-12-03 04:39:48 +00:00
Edward Tomasz Napierala
ef694c1ac4 Replace pointer to "struct uidinfo" with pointer to "struct ucred"
in "struct vm_object".  This is required to make it possible to account
for per-jail swap usage.

Reviewed by:	kib@
Tested by:	pho@
Sponsored by:	FreeBSD Foundation
2010-12-02 17:37:16 +00:00
Alan Cox
05cb58f669 Correct an error in the allocation of the vm_page_dump array in
vm_page_startup().  Specifically, the dump_avail array should be used
instead of the phys_avail array to calculate the size of vm_page_dump.  For
example, the pages for the message buffer are allocated prior to
vm_page_startup() by subtracting them from the last entry in the phys_avail
array, but the first thing that vm_page_startup() does after creating the
vm_page_dump array is to set the bits corresponding to the message buffer
pages in that array.  However, these bits might not actually exist in the
array, because the size of the array is determined by the current value in
the last entry of the phys_avail array.  In general, the only reason why
this doesn't always result in an out-of-bounds array access is that the size
of the vm_page_dump array is rounded up to the next page boundary.  This
change eliminates that dependence on rounding (and luck).

MFC after:	6 weeks
2010-12-01 03:35:19 +00:00
Jayachandran C.
aa54636620 Fix issue noted by alc while reviewing r215938:
The current implementation of vm_page_alloc_freelist() does not handle
order > 0 correctly. Remove order parameter to the function and use it
only for order 0 pages.

Submitted by:	alc
2010-11-28 05:51:31 +00:00
Konstantin Belousov
780636b72a After the sleep caused by encountering a busy page, relookup the page.
Submitted and reviewed by:	alc
Reprted and tested by:	pho
MFC after:	5 days
2010-11-24 12:25:17 +00:00
Konstantin Belousov
3157c50313 Eliminate the mab, maf arrays and related variables.
The change also fixes off-by-one error in the calculation of mreq.

Suggested and reviewed by:	alc
Tested by:	pho
MFC after:	5 days
2010-11-21 10:18:28 +00:00
Alan Cox
17ea6f00d5 Optimize vm_object_terminate().
Reviewed by:	kib
MFC after:	1 week
2010-11-20 22:30:09 +00:00
Konstantin Belousov
4c7b9a2063 The runlen returned from vm_pageout_flush() might be zero legitimately,
when mreq page has status VM_PAGER_AGAIN.

MFC after:	5 days
2010-11-20 17:27:38 +00:00
Alan Cox
00f8bffc22 Reduce the amount of detail printed by vm_page_free_toq() when it panics.
Reviewed by:	kib
2010-11-19 17:49:08 +00:00
Max Laier
85f2a0c91e Off by one page in vm_reserv_reclaim_contig(): Also reclaim reservations
with only a single free page if that satisfies the requested size.

MFC after:	3 days
Reviewed by:	alc
2010-11-19 04:30:33 +00:00
Konstantin Belousov
1e8a675c73 vm_pageout_flush() might cache the pages that finished write to the
backing storage. Such pages might be then reused, racing with the
assert in vm_object_page_collect_flush() that verified that dirty
pages from the run (most likely, pages with VM_PAGER_AGAIN status) are
write-protected still. In fact, the page indexes for the pages that
were removed from the object page list should be ignored by
vm_object_page_clean().

Return the length of successfully written run from vm_pageout_flush(),
that is, the count of pages between requested page and first page
after requested with status VM_PAGER_AGAIN. Supply the requested page
index in the array to vm_pageout_flush(). Use the returned run length
to forward the index of next page to clean in vm_object_page_clean().

Reported by:	avg
Reviewed by:	alc
MFC after:	1 week
2010-11-18 21:09:02 +00:00
Konstantin Belousov
4166faaee0 Only increment object generation count when inserting the page into
object page list.  The only use of object generation count now is a
restart of the scan in vm_object_page_clean(), which makes sense to do
on the page addition. Page removals do not affect the dirtiness of the
object, as well as manipulations with the shadow chain.

Suggested and reviewed by:	alc
MFC after:    1 week
2010-11-18 20:46:28 +00:00
Konstantin Belousov
7022f954c3 Do not use __FreeBSD_version prefix for the special osrel version.
The ports/Mk/bsd.port.mk uses sys/param.h to fetch osrel, and cannot
grok several constants with the prefix.

Reported and tested by:	    swell.k gmail com
MFC after:   1 week
2010-11-14 21:59:11 +00:00
Konstantin Belousov
94bce4535d Use symbolic names instead of hardcoding values for magic p_osrel constants.
MFC after:   1 week
2010-11-14 18:24:12 +00:00
Konstantin Belousov
9a6d144ff8 Implement a (soft) stack guard page for auto-growing stack mappings.
The unmapped page separates the tip of the stack and possible adjanced
segment, making some uses of stack overflow harder.  The stack growing
code refuses to expand the segment to the last page of the reseved
region when sysctl security.bsd.stack_guard_page is set to 1. The
default value for sysctl and accompanying tunable is 0.

Please note that mmap(MAP_FIXED) still can place a mapping right up to
the stack, making continuous region.

Reviewed by:	alc
MFC after:	1 week
2010-11-14 17:53:52 +00:00
Alan Cox
2cf36c8f67 Enable reservation-based physical memory allocation. Even without the
creation of large page mappings in the pmap, it can provide modest
performance benefits.  In particular, for a "buildworld" on a 2x 1GHz
Ultrasparc IIIi it reduced the wall clock time by 2.2% and the system
time by 12.6%.

Tested by:	marius@
2010-11-10 17:57:34 +00:00
Alan Cox
e48262487a In case the stack size reaches its limit and its growth must be restricted,
ensure that grow_amount is a multiple of the page size.  Otherwise, the
kernel may crash in swap_reserve_by_uid() on HEAD and FreeBSD 8.x, and
produce a core file with a missing stack on FreeBSD 7.x.

Diagnosed and reported by: jilles
Reviewed by:	kib
MFC after:	1 week
2010-11-07 21:40:34 +00:00
Oleksandr Tymoshenko
903ba3da86 - Add minidump support for FreeBSD/mips 2010-11-07 03:09:02 +00:00
John Baldwin
e9a069d8af Update startup_alloc() to support multi-page allocations and allow internal
zones whose objects are larger than a page to use startup_alloc().  This
allows allocation of zone objects during early boot on machines with a large
number of CPUs since the resulting zone objects are larger than a page.

Submitted by:	trema
Reviewed by:	attilio
MFC after:	1 week
2010-11-04 15:33:50 +00:00
Alan Cox
d689bc0082 Correct some format strings used by sysctls.
MFC after:	1 week
2010-10-30 18:00:53 +00:00
John Baldwin
1a587ef2a5 - Make 'vm_refcnt' volatile so that compilers won't be tempted to treat
its value as a loop invariant.  Currently this is a no-op because
  'atomic_cmpset_int()' clobbers all memory on current architectures.
- Use atomic_fetchadd_int() instead of an atomic_cmpset_int() loop to drop
  a reference in vmspace_free().

Reviewed by:	alc
MFC after:	1 month
2010-10-21 17:29:32 +00:00
Andriy Gapon
55144670c2 PG_BUSY -> VPO_BUSY, PG_WANTED -> VPO_WANTED in manual pages and comments
Reviewed by:	alc
MFC after:	4 days
2010-10-20 05:17:23 +00:00
Matthew D Fleming
20ed0cb0c6 uma_zfree(zone, NULL) should do nothing, to match free(9).
Noticed by:	Ron Steinke <rsteinke at isilon dot com>
MFC after:	3 days
2010-10-19 16:06:00 +00:00
Lawrence Stewart
1c6cae9711 Change uma_zone_set_max to return the effective value of "nitems" after
rounding. The same value can also be obtained with uma_zone_get_max, but this
change avoids a caller having to make two back-to-back calls.

Sponsored by:	FreeBSD Foundation
Reviewed by:	gnn, jhb
2010-10-16 04:41:45 +00:00
Lawrence Stewart
c4ae7908a7 - Simplify implementation of uma_zone_get_max.
- Add uma_zone_get_cur which returns the current approximate occupancy of
  a zone. This is useful for providing stats via sysctl amongst other things.

Sponsored by:	FreeBSD Foundation
Reviewed by:	gnn, jhb
MFC after:	2 weeks
2010-10-16 04:14:45 +00:00
Alan Cox
f8616ebfae If vm_map_find() is asked to allocate a superpage-aligned region of virtual
addresses that is greater than a superpage in size but not a multiple of
the superpage size, then vm_map_find() is not always expanding the kernel
pmap to support the last few small pages being allocated.  These failures
are not commonplace, so this was first noticed by someone porting FreeBSD
to a new architecture.  Previously, we grew the kernel page table in
vm_map_findspace() when we found the first available virtual address.
This works most of the time because we always grow the kernel pmap or page
table by an amount that is a multiple of the superpage size.  Now, instead,
we defer the call to pmap_growkernel() until we are committed to a range
of virtual addresses in vm_map_insert().  In general, there is another
reason to prefer calling pmap_growkernel() in vm_map_insert().  It makes
it possible for someone to do the equivalent of an mmap(MAP_FIXED) on the
kernel map.

Reported by:	Svatopluk Kraus
Reviewed by:	kib@
MFC after:	3 weeks
2010-10-04 16:49:40 +00:00
Matthew D Fleming
d69b01efc2 Replace an XXX comment with the appropriate code.
Submitted by:	alc
2010-09-20 20:41:59 +00:00
Alan Cox
da0483096d Allow a POSIX shared memory object that is opened for read but not for
write to nonetheless be mapped PROT_WRITE and MAP_PRIVATE, i.e.,
copy-on-write.

(This is a regression in the new implementation of POSIX shared memory
objects that is used by HEAD and RELENG_8.  This bug does not exist in
RELENG_7's user-level, file-based implementation.)

PR:		150260
MFC after:	3 weeks
2010-09-19 19:42:04 +00:00
Alan Cox
8304adaac6 Make refinements to r212824. In particular, don't make
vm_map_unlock_nodefer() part of the synchronization interface for maps.

Add comments to vm_map_unlock_and_wait() and vm_map_wakeup() describing
how they should be used.  In particular, describe the deferred deallocations
issue with vm_map_unlock_and_wait().

Redo the implementation of vm_map_unlock_and_wait() so that it passes
along the caller's file and line information, just like the other map
locking primitives.

Reviewed by:	kib
X-MFC after:	r212824
2010-09-19 17:43:22 +00:00
Konstantin Belousov
0b367bd8c0 Adopt the deferring of object deallocation for the deleted map entries
on map unlock to the lock downgrade and later read unlock operation.

System map entries cannot be backed by OBJT_VNODE objects, no need to
defer deallocation for them. Map entries from user maps do not require
the owner map for deallocation, and can be accumulated in the
thread-local list for freeing when a user map is unlocked.

Move the collection of entries for deferred reclamation into
vm_map_delete(). Create helper vm_map_process_deferred(), that is
called from locations where processing is feasible. Do not process
deferred entries in vm_map_unlock_and_wait() since map_sleep_mtx is
held.

Reviewed by:	alc, rstone (previous versions)
Tested by:	pho
MFC after:	2 weeks
2010-09-18 15:03:31 +00:00
Matthew D Fleming
4e6571599b Re-add r212370 now that the LOR in powerpc64 has been resolved:
Add a drain function for struct sysctl_req, and use it for a variety
of handlers, some of which had to do awkward things to get a large
enough SBUF_FIXEDLEN buffer.

Note that some sysctl handlers were explicitly outputting a trailing
NUL byte.  This behaviour was preserved, though it should not be
necessary.

Reviewed by:    phk (original patch)
2010-09-16 16:13:12 +00:00
Matthew D Fleming
404a593e28 Revert r212370, as it causes a LOR on powerpc. powerpc does a few
unexpected things in copyout(9) and so wiring the user buffer is not
sufficient to perform a copyout(9) while holding a random mutex.

Requested by: nwhitehorn
2010-09-13 18:48:23 +00:00
Matthew D Fleming
dd67e2103c Add a drain function for struct sysctl_req, and use it for a variety of
handlers, some of which had to do awkward things to get a large enough
FIXEDLEN buffer.

Note that some sysctl handlers were explicitly outputting a trailing NUL
byte.  This behaviour was preserved, though it should not be necessary.

Reviewed by:	phk
2010-09-09 18:33:46 +00:00
Nathan Whitehorn
42768fec0f On architectures with non-tree-based page tables like PowerPC, every page
in a range must be checked when calling pmap_remove(). Calling
pmap_remove() from vm_pageout_map_deactivate_pages() with the entire range
of the map could result in attempting to demap an extraordinary number
of pages (> 10^15), so iterate through each map entry and unmap each of
them individually.

MFC after:	6 weeks
2010-09-09 13:32:58 +00:00
Ryan Stone
d473d3a164 Fix a typo in r212281. uintptr -> uintptr_t
Pointy hat to:  rstone

Approved by:    emaste (mentor)
MFC after:      2 weeks
2010-09-07 02:51:11 +00:00
Ryan Stone
0d41964095 In munmap() downgrade the vm_map_lock to a read lock before taking a read
lock on the pmc-sx lock.  This prevents a deadlock with
pmc_log_process_mappings, which has an exclusive lock on pmc-sx and tries
to get a read lock on a vm_map.  Downgrading the vm_map_lock in munmap
allows pmc_log_process_mappings to continue, preventing the deadlock.

Without this change I could cause a deadlock on a multicore 8.1-RELEASE
system by having one thread constantly mmap'ing and then munmap'ing a
PROT_EXEC mapping in a loop while I repeatedly invoked and stopped pmcstat
in system-wide sampling mode.

Reviewed by:	fabient
Approved by:	emaste (mentor)
MFC after:	2 weeks
2010-09-07 00:23:45 +00:00
Andriy Gapon
a9b89cf1c1 vm_page.c: include opt_msgbuf.h for MSGBUF_SIZE use in vm_page_startup
vm_page_startup uses MSGBUF_SIZE value for adding msgbuf pages to minidump.
If opt_msgbuf.h is not included and MSGBUF_SIZE is overriden in kernel
config, then not all msgbuf pages will be dumped.  And most importantly,
struct msgbuf itself will not be included.  Thus the dump would look
corrupted/incomplete to tools like kgdb, dmesg, etc that try to access
struct msgbuf as one of the first things they do when working on a crash
dump.

MFC after:	5 days
2010-09-03 10:40:53 +00:00
Matthew D Fleming
a2a200a24d Have memguard(9) crash with an easier-to-debug message on double-free.
Reviewed by:    zml
MFC after:      3 weeks
2010-08-31 17:43:47 +00:00
Matthew D Fleming
6d3ed393d6 The realloc case for memguard(9) will copy too many bytes when
reallocating to a smaller-sized allocation.  Fix this issue.

Noticed by:     alc
Reviewed by:    alc
Approved by:    zml (mentor)
MFC after:      3 weeks
2010-08-31 16:57:58 +00:00
Alan Cox
74ffb9af15 Add the MAP_PREFAULT_READ option to mmap(2).
Reviewed by:	jhb, kib
2010-08-28 16:57:07 +00:00
Andre Oppermann
e49471b04b Add uma_zone_get_max() to obtain the effective limit after a call
to uma_zone_set_max().

The UMA zone limit is not exactly set to the value supplied but
rounded up to completely fill the backing store increment (a page
normally).  This can lead to surprising situations where the number
of elements allocated from UMA is higher than the supplied limit
value.  The new get function reads back the effective value so that
the supplied limit value can be adjusted to the real limit.

Reviewed by:	jeffr
MFC after:	1 week
2010-08-16 14:24:00 +00:00
Matthew D Fleming
f02d86e269 Fix compile. It seemed better to have memguard.c include opt_vm.h in
case future compile-time knobs were added that it wants to use.
Also add include guards and forward declarations to vm/memguard.h.

Approved by:    zml (mentor)
MFC after:      1 month
2010-08-12 16:54:43 +00:00
Matthew D Fleming
e3813573bd Rework memguard(9) to reserve significantly more KVA to detect
use-after-free over a longer time.  Also release the backing pages of
a guarded allocation at free(9) time to reduce the overhead of using
memguard(9).  Allow setting and varying the malloc type at run-time.
Add knobs to allow:

 - randomly guarding memory
 - adding un-backed KVA guard pages to detect underflow and overflow
 - a lower limit on the size of allocations that are guarded

Reviewed by:    alc
Reviewed by:    brueffer, Ulrich Spörlein <uqs spoerlein net> (man page)
Silence from:   -arch
Approved by:    zml (mentor)
MFC after:      1 month
2010-08-11 22:10:37 +00:00
Konstantin Belousov
3979450b4c Add new make_dev_p(9) flag MAKEDEV_ETERNAL to inform devfs that created
cdev will never be destroyed. Propagate the flag to devfs vnodes as
VV_ETERNVALDEV. Use the flags to avoid acquiring devmtx and taking a
thread reference on such nodes.

In collaboration with:	pho
MFC after:	1 month
2010-08-06 09:42:15 +00:00
John Baldwin
a3870a1826 Very rough first cut at NUMA support for the physical page allocator. For
now it uses a very dumb first-touch allocation policy.  This will change in
the future.
- Each architecture indicates the maximum number of supported memory domains
  via a new VM_NDOMAIN parameter in <machine/vmparam.h>.
- Each cpu now has a PCPU_GET(domain) member to indicate the memory domain
  a CPU belongs to.  Domain values are dense and numbered from 0.
- When a platform supports multiple domains, the default freelist
  (VM_FREELIST_DEFAULT) is split up into N freelists, one for each domain.
  The MD code is required to populate an array of mem_affinity structures.
  Each entry in the array defines a range of memory (start and end) and a
  domain for the range.  Multiple entries may be present for a single
  domain.  The list is terminated by an entry where all fields are zero.
  This array of structures is used to split up phys_avail[] regions that
  fall in VM_FREELIST_DEFAULT into per-domain freelists.
- Each memory domain has a separate lookup-array of freelists that is
  used when fulfulling a physical memory allocation.  Right now the
  per-domain freelists are listed in a round-robin order for each domain.
  In the future a table such as the ACPI SLIT table may be used to order
  the per-domain lookup lists based on the penalty for each memory domain
  relative to a specific domain.  The lookup lists may be examined via a
  new vm.phys.lookup_lists sysctl.
- The first-touch policy is implemented by using PCPU_GET(domain) to
  pick a lookup list when allocating memory.

Reviewed by:	alc
2010-07-27 20:33:50 +00:00
Edward Tomasz Napierala
fd6f4ffb27 Fix commented out resource limit check in mlockall(2). It's still racy,
but at least less misleading.
2010-07-27 19:26:18 +00:00
Alan Cox
2af6e14d39 Introduce exec_alloc_args(). The objective being to encapsulate the
details of the string buffer allocation in one place.

Eliminate the portion of the string buffer that was dedicated to storing
the interpreter name.  The pointer to the interpreter name can simply be
made to point to the appropriate argument string.

Reviewed by:	kib
2010-07-27 17:31:03 +00:00
Alan Cox
9e4e511499 Change the order in which the file name, arguments, environment, and
shell command are stored in exec*()'s demand-paged string buffer.  For
a "buildworld" on an 8GB amd64 multiprocessor, the new order reduces
the number of global TLB shootdowns by 31%.  It also eliminates about
330k page faults on the kernel address space.

Change exec_shell_imgact() to use "args->begin_argv" consistently as
the start of the argument and environment strings.  Previously, it
would sometimes use "args->buf", which is the start of the overall
buffer, but no longer the start of the argument and environment
strings.  While I'm here, eliminate unnecessary passing of "&length"
to copystr(), where we don't actually care about the length of the
copied string.

Clean up the initialization of the exec map.  In particular, use the
correct size for an entry, and express that size in the same way that
is used when an entry is allocated.  The old size was one page too
large.  (This discrepancy originated in 2004 when I rewrote
exec_map_first_page() to use sf_buf_alloc() instead of the exec map
for mapping the first page of the executable.)

Reviewed by:	kib
2010-07-25 17:43:38 +00:00
Jayachandran C.
49ca10d40c Redo the page table page allocation on MIPS, as suggested by
alc@.

The UMA zone based allocation is replaced by a scheme that creates
a new free page list for the KSEG0 region, and a new function
in sys/vm that allocates pages from a specific free page list.

This also fixes a race condition introduced by the UMA based page table
page allocation code. Dropping the page queue and pmap locks before
the call to uma_zfree, and re-acquiring them afterwards  will introduce
a race condtion(noted by alc@).

The changes are :
- Revert the earlier changes in MIPS pmap.c that added UMA zone for
page table pages.
- Add a new freelist VM_FREELIST_HIGHMEM to MIPS vmparam.h for memory that
is not directly mapped (in 32bit kernel). Normal page allocations will first
try the HIGHMEM freelist and then the default(direct mapped) freelist.
- Add a new function 'vm_page_t vm_page_alloc_freelist(int flind, int
order, int req)' to vm/vm_page.c to allocate a page from a specified
freelist. The MIPS page table pages will be allocated using this function
from the freelist containing direct mapped pages.
- Move the page initialization code from vm_phys_alloc_contig() to a
new function vm_page_alloc_init(), and use this function to initialize
pages in vm_page_alloc_freelist() too.
- Split the  function vm_phys_alloc_pages(int pool, int order) to create
vm_phys_alloc_freelist_pages(int flind, int pool, int order), and use
this function from both vm_page_alloc_freelist() and vm_phys_alloc_pages().

Reviewed by:	alc
2010-07-21 09:27:00 +00:00
Alan Cox
b99348e5ea Add support for the VM_ALLOC_COUNT() hint to vm_page_alloc(). Consequently,
the maintenance of vm_pageout_deficit can be localized to just two places:
vm_page_alloc() and vm_pageout_scan().

This change also corrects an off-by-one error in the maintenance of
vm_pageout_deficit.  Historically, the buffer cache functions, allocbuf()
and vm_hold_load_pages(), have not taken into account that vm_page_alloc()
already increments vm_pageout_deficit by one.

Reviewed by:	kib
2010-07-09 19:38:30 +00:00
Konstantin Belousov
1d9e77f6bf Make VM_ALLOC_RETRY flag mandatory for vm_page_grab(). Assert that the
flag is always provided, and unconditionally retry after sleep for the
busy page or failed allocation.

The intent is to remove VM_ALLOC_RETRY eventually.

Proposed and reviewed by:	alc
2010-07-08 08:37:51 +00:00
Konstantin Belousov
5f195aa32e Add the ability for the allocflag argument of the vm_page_grab() to
specify the increment of vm_pageout_deficit when sleeping due to page
shortage. Then, in allocbuf(), the code to allocate pages when extending
vmio buffer can be replaced by a call to vm_page_grab().

Suggested and reviewed by:	alc
MFC after:	2 weeks
2010-07-05 21:13:32 +00:00
Konstantin Belousov
757216f3a5 Several cleanups for the r209686:
- remove unused defines;
- remove unused curgeneration argument for vm_object_page_collect_flush();
- always assert that vm_object_page_clean() is called for OBJT_VNODE;
- move vm_page_find_least() into for() statement initial clause.

Submitted by:	alc
2010-07-04 19:02:32 +00:00
Konstantin Belousov
e239bb9730 Reimplement vm_object_page_clean(), using the fact that vm object memq
is ordered by page index. This greatly simplifies the implementation,
since we no longer need to mark the pages with VPO_CLEANCHK to denote
the progress. It is enough to remember the current position by index
before dropping the object lock.

Remove VPO_CLEANCHK and VM_PAGER_IGNORE_CLEANCHK as unused.
Garbage-collect vm.msync_flush_flags sysctl.

Suggested and reviewed by:	alc
Tested by:	pho
2010-07-04 11:26:56 +00:00
Konstantin Belousov
b382c10a57 Introduce a helper function vm_page_find_least(). Use it in several places,
which inline the function.

Reviewed by:	alc
Tested by:	pho
MFC after:	1 week
2010-07-04 11:13:33 +00:00
Alan Cox
b64400a03f Improve the comment and man page for vm_page_alloc(). Specifically,
document one of the optional flags; clarify which of the flags are
optional (and which are not), and remove mention of a restriction on
the reclamation of cached pages that no longer holds since version 7.

MFC after:	1 week
2010-07-03 18:25:37 +00:00
Alan Cox
cdaba1f2be Push down the acquisition of the page queues lock into
vm_pageout_page_stats().  In particular, avoid acquiring the page
queues lock unless iterating over the active queue.
2010-07-02 20:56:22 +00:00
Alan Cox
4b0640310a Use vm_page_prev() instead of vm_page_lookup() in the implementation of
vm_fault()'s automatic delete-behind heuristic.
vm_page_prev() is typically faster.
2010-07-02 19:59:18 +00:00
Alan Cox
9cf5198832 With the demise of page coloring, the page queue macros no longer serve any
useful purpose.  Eliminate them.

Reviewed by:	kib
2010-07-02 15:02:51 +00:00
Alan Cox
95976f3f38 Simplify entry to vm_pageout_clean(). Expect the page to be locked.
Previously, the caller unlocked the page, and vm_pageout_clean()
immediately reacquired the page lock.  Also, assert rather than test
that the page is neither busy nor held.  Since vm_pageout_clean() is
called with the object and page locked, the page can't have changed
state since the caller verified that the page is neither busy nor
held.
2010-06-30 17:20:33 +00:00
Alan Cox
91b4f42767 Introduce vm_page_next() and vm_page_prev(), and use them in
vm_pageout_clean().  When iterating over a range of pages, these functions
can be cheaper than vm_page_lookup() because their implementation takes
advantage of the vm_object's memq being ordered.

Reviewed by:	kib@
MFC after:	3 weeks
2010-06-21 23:27:24 +00:00
Sean Bruno
bf96595915 Add a new column to the output of vmstat -z to indicate the number
of times the system was forced to sleep when requesting a new allocation.

Expand the debugger hook, db_show_uma, to display these results as well.

This has proven to be very useful in out of memory situations when
it is not known why systems have become sluggish or fail in odd ways.

Reviewed by:	rwatson alc
Approved by:	scottl (mentor) peter
Obtained from:	Yahoo Inc.
2010-06-15 19:28:37 +00:00
Alan Cox
9ee2165f5d Eliminate checks for a page having a NULL object in vm_pageout_scan()
and vm_pageout_page_stats().  These checks were recently introduced by
the first page locking commit, r207410, but they are not needed.  At
the same time, eliminate some redundant accesses to the page's object
field.  (These accesses should have neen eliminated by r207410.)

Make the assertion in vm_page_flag_set() stricter.  Specifically, only
managed pages should have PG_WRITEABLE set.

Add a comment documenting an assertion to vm_page_flag_clear().

It has long been the case that fictitious pages have their wire count
permanently set to one.  Add comments to vm_page_wire() and
vm_page_unwire() documenting this.  Add assertions to these functions
as well.

Update the comment describing vm_page_unwire().  Much of the old
comment had little to do with vm_page_unwire(), but a lot to do with
_vm_page_deactivate().  Move relevant parts of the old comment to
_vm_page_deactivate().

Only pages that belong to an object can be paged out.  Therefore, it
is pointless for vm_page_unwire() to acquire the page queues lock and
enqueue such pages in one of the paging queues.  Generally speaking,
such pages are immediately freed after the call to vm_page_unwire().
Previously, it was the call to vm_page_free() that reacquired the page
queues lock and removed these pages from the paging queues.  Now, we
will never acquire the page queues lock for this case.  (It is also
worth noting that since both vm_page_unwire() and vm_page_free()
occurred with the page locked, the page daemon never saw the page with
its object field set to NULL.)

Change the panic with vm_page_unwire() to provide a more precise message.

Reviewed by:	kib@
2010-06-14 19:54:19 +00:00
John Baldwin
3aa6d94e0c Update several places that iterate over CPUs to use CPU_FOREACH(). 2010-06-11 18:46:34 +00:00
Alan Cox
ce18658792 Reduce the scope of the page queues lock and the number of
PG_REFERENCED changes in vm_pageout_object_deactivate_pages().
Simplify this function's inner loop using TAILQ_FOREACH(), and shorten
some of its overly long lines.  Update a stale comment.

Assert that PG_REFERENCED may be cleared only if the object containing
the page is locked.  Add a comment documenting this.

Assert that a caller to vm_page_requeue() holds the page queues lock,
and assert that the page is on a page queue.

Push down the page queues lock into pmap_ts_referenced() and
pmap_page_exists_quick().  (As of now, there are no longer any pmap
functions that expect to be called with the page queues lock held.)

Neither pmap_ts_referenced() nor pmap_page_exists_quick() should ever
be passed an unmanaged page.  Assert this rather than returning "0"
and "FALSE" respectively.

ARM:

Simplify pmap_page_exists_quick() by switching to TAILQ_FOREACH().

Push down the page queues lock inside of pmap_clearbit(), simplifying
pmap_clear_modify(), pmap_clear_reference(), and pmap_remove_write().
Additionally, this allows for avoiding the acquisition of the page
queues lock in some cases.

PowerPC/AIM:

moea*_page_exits_quick() and moea*_page_wired_mappings() will never be
called before pmap initialization is complete.  Therefore, the check
for moea_initialized can be eliminated.

Push down the page queues lock inside of moea*_clear_bit(),
simplifying moea*_clear_modify() and moea*_clear_reference().

The last parameter to moea*_clear_bit() is never used.  Eliminate it.

PowerPC/BookE:

Simplify mmu_booke_page_exists_quick()'s control flow.

Reviewed by:	kib@
2010-06-10 16:56:35 +00:00
Jayachandran C.
17dca144a2 Make vm_contig_grow_cache() extern, and use it when vm_phys_alloc_contig()
fails to allocate MIPS page table pages.  The current usage of VM_WAIT in
case of vm_phys_alloc_contig() failure is not correct, because:

"There is no guarantee that any of the available free (or cached) pages
after the VM_WAIT will fall within the range of suitable physical
addresses.  Every time this function sleeps and a single page is freed
(or cached) by someone else, this function will be reawakened.  With
a little bad luck, you could spin indefinitely."

We also add low and high parameters to vm_contig_grow_cache() and
vm_contig_launder() so that we restrict vm_contig_launder() to the range
of pages we are interested in.

Reported by: alc

Reviewed by:	alc
Approved by:	rrs (mentor)
2010-06-04 06:35:36 +00:00
Konstantin Belousov
4d65036b4f Do not leak vm page lock in vm_contig_launder(), vm_pageout_page_lock()
always returns with the page locked.

Submitted by:	alc
Pointy hat to:	kib
2010-06-03 18:34:34 +00:00
Konstantin Belousov
2bbfbc3fe2 Add assertion and comment in vm_page_flag_set() describing the expectations
when the PG_WRITEABLE flag is set.

Reviewed by:	alc
2010-06-03 10:11:45 +00:00
Alan Cox
f4e10cdaa6 Maintain the pretense that we support 32KB pages for the sake of the ia64
LINT build.
2010-06-03 02:24:53 +00:00
Alan Cox
c8fa870982 Minimize the use of the page queues lock for synchronizing access to the
page's dirty field.  With the exception of one case, access to this field
is now synchronized by the object lock.
2010-06-02 15:46:37 +00:00
Alan Cox
8f0d5d3b9f When I pushed down the page queues lock into pmap_is_modified(), I created
an ordering dependence: A pmap operation that clears PG_WRITEABLE and calls
vm_page_dirty() must perform the call first.  Otherwise, pmap_is_modified()
could return FALSE without acquiring the page queues lock because the page
is not (currently) writeable, and the caller to pmap_is_modified() might
believe that the page's dirty field is clear because it has not seen the
effect of the vm_page_dirty() call.

When I pushed down the page queues lock into pmap_is_modified(), I
overlooked one place where this ordering dependence is violated:
pmap_enter().  In a rare situation pmap_enter() can be called to replace a
dirty mapping to one page with a mapping to another page.  (I say rare
because replacements generally occur as a result of a copy-on-write fault,
and so the old page is not dirty.)  This change delays clearing PG_WRITEABLE
until after vm_page_dirty() has been called.

Fixing the ordering dependency also makes it easy to introduce a small
optimization: When pmap_enter() used to replace a mapping to one page with a
mapping to another page, it freed the pv entry for the first mapping and
later called the pv entry allocator for the new mapping.  Now, pmap_enter()
attempts to recycle the old pv entry, saving two calls to the pv entry
allocator.

There is no point in setting PG_WRITEABLE on unmanaged pages, so don't.
Update a comment to reflect this.

Tidy up the variable declarations at the start of pmap_enter().
2010-05-29 17:10:45 +00:00
Alan Cox
c46b90e90a Push down page queues lock acquisition in pmap_enter_object() and
pmap_is_referenced().  Eliminate the corresponding page queues lock
acquisitions from vm_map_pmap_enter() and mincore(), respectively.  In
mincore(), this allows some additional cases to complete without ever
acquiring the page queues lock.

Assert that the page is managed in pmap_is_referenced().

On powerpc/aim, push down the page queues lock acquisition from
moea*_is_modified() and moea*_is_referenced() into moea*_query_bit().
Again, this will allow some additional cases to complete without ever
acquiring the page queues lock.

Reorder a few statements in vm_page_dontneed() so that a race can't lead
to an old reference persisting.  This scenario is described in detail by a
comment.

Correct a spelling error in vm_page_dontneed().

Assert that the object is locked in vm_page_clear_dirty(), and restrict the
page queues lock assertion to just those cases in which the page is
currently writeable.

Add object locking to vnode_pager_generic_putpages().  This was the one
and only place where vm_page_clear_dirty() was being called without the
object being locked.

Eliminate an unnecessary vm_page_lock() around vnode_pager_setsize()'s call
to vm_page_clear_dirty().

Change vnode_pager_generic_putpages() to the modern-style of function
definition.  Also, change the name of one of the parameters to follow
virtual memory system naming conventions.

Reviewed by:	kib
2010-05-26 18:00:44 +00:00
Alan Cox
e98d019d3c Eliminate the acquisition and release of the page queues lock from
vfs_busy_pages().  It is no longer needed.

Submitted by:	kib
2010-05-25 02:26:25 +00:00
Alan Cox
567e51e18c Roughly half of a typical pmap_mincore() implementation is machine-
independent code.  Move this code into mincore(), and eliminate the
page queues lock from pmap_mincore().

Push down the page queues lock into pmap_clear_modify(),
pmap_clear_reference(), and pmap_is_modified().  Assert that these
functions are never passed an unmanaged page.

Eliminate an inaccurate comment from powerpc/powerpc/mmu_if.m:
Contrary to what the comment says, pmap_mincore() is not simply an
optimization.  Without a complete pmap_mincore() implementation,
mincore() cannot return either MINCORE_MODIFIED or MINCORE_REFERENCED
because only the pmap can provide this information.

Eliminate the page queues lock from vfs_setdirty_locked_object(),
vm_pageout_clean(), vm_object_page_collect_flush(), and
vm_object_page_clean().  Generally speaking, these are all accesses
to the page's dirty field, which are synchronized by the containing
vm object's lock.

Reduce the scope of the page queues lock in vm_object_madvise() and
vm_page_dontneed().

Reviewed by:	kib (an earlier version)
2010-05-24 14:26:57 +00:00
Konstantin Belousov
a6e38685f3 When waiting for the busy page, do not unlock the object unless unlock
cannot be avoided.

Reviewed by:	alc
MFC after:	1 week
2010-05-20 08:51:01 +00:00
Alan Cox
aa12e8b71d The page queues lock is no longer required by vm_page_set_invalid(), so
eliminate it.

Assert that the object containing the page is locked in
vm_page_test_dirty().  Perform some style clean up while I'm here.

Reviewed by:	kib
2010-05-18 16:40:29 +00:00
Alan Cox
9ab6032f73 On entry to pmap_enter(), assert that the page is busy. While I'm
here, make the style of assertion used by pmap_enter() consistent
across all architectures.

On entry to pmap_remove_write(), assert that the page is neither
unmanaged nor fictitious, since we cannot remove write access to
either kind of page.

With the push down of the page queues lock, pmap_remove_write() cannot
condition its behavior on the state of the PG_WRITEABLE flag if the
page is busy.  Assert that the object containing the page is locked.
This allows us to know that the page will neither become busy nor will
PG_WRITEABLE be set on it while pmap_remove_write() is running.

Correct a long-standing bug in vm_page_cowsetup().  We cannot possibly
do copy-on-write-based zero-copy transmit on unmanaged or fictitious
pages, so don't even try.  Previously, the call to pmap_remove_write()
would have failed silently.
2010-05-16 23:45:10 +00:00
Alan Cox
a4bc2c8929 Correct an error of omission in r202897: Now that amd64 uses the direct map
to access the message buffer, we must explicitly request that the underlying
physical pages are included in a crash dump.

Reported by:	Benjamin Kaduk
2010-05-16 19:25:56 +00:00
Alan Cox
a1a95cd608 Add a comment about the proper use of vm_object_page_remove().
MFC after:	1 week
2010-05-16 16:54:05 +00:00
Alan Cox
b27753086c Update synchronization annotations for struct vm_page. Add a comment
explaining how the setting of PG_WRITEABLE is synchronized.
2010-05-11 01:29:18 +00:00
Konstantin Belousov
6e2175fc06 Continue cleaning the queue instead of moving to the next queue or
bailing out if acquisition of page lock caused page position in the
queue to change.

Pointed out by:	alc
2010-05-10 11:53:40 +00:00
Alan Cox
eee9d99231 Push down the acquisition of the page queues lock into vm_pageq_remove().
(This eliminates a surprising number of page queues lock acquisitions by
vm_fault() because the page's queue is PQ_NONE and thus the page queues
lock is not needed to remove the page from a queue.)
2010-05-09 16:55:42 +00:00
Alan Cox
db1f085eee Call vm_page_deactivate() rather than vm_page_dontneed() in
swp_pager_force_pagein().  By dirtying the page, swp_pager_force_pagein()
forces vm_page_dontneed() to insert the page at the head of the inactive
queue, just like vm_page_deactivate() does.  Moreover, because the page
was invalid, it can't have been mapped, and thus the other effect of
vm_page_dontneed(), clearing the page's reference bits has no effect.  In
summary, there is no reason to call vm_page_dontneed() since its effect
will be identical to calling the simpler vm_page_deactivate().
2010-05-09 16:27:42 +00:00
Alan Cox
d061cdd513 Remove the page queues lock around a call to vm_page_activate(). Make the
page dirty before adding it to the active queue.
2010-05-09 00:32:52 +00:00
Alan Cox
34e7251f10 Minimize the scope of the page queues lock in vm_fault(). 2010-05-08 21:35:51 +00:00
Alan Cox
3c4a24406b Push down the page queues into vm_page_cache(), vm_page_try_to_cache(), and
vm_page_try_to_free().  Consequently, push down the page queues lock into
pmap_enter_quick(), pmap_page_wired_mapped(), pmap_remove_all(), and
pmap_remove_write().

Push down the page queues lock into Xen's pmap_page_is_mapped().  (I
overlooked the Xen pmap in r207702.)

Switch to a per-processor counter for the total number of pages cached.
2010-05-08 20:34:01 +00:00
Jung-uk Kim
af394cfa36 Fix a typo in the previous commit. 2010-05-07 21:06:52 +00:00
Konstantin Belousov
af4b86b949 One more use for vm_pageout_init_marker().
Reviewed by:	alc
2010-05-07 18:57:26 +00:00
Alan Cox
97c3834772 Eliminate unnecessary page queues locking. 2010-05-07 16:22:06 +00:00
Alan Cox
03679e2334 Push down the page queues lock into vm_page_activate(). 2010-05-07 15:49:43 +00:00
Alan Cox
c1f98960a3 Update the synchronization requirements for the page usage count. 2010-05-07 06:58:53 +00:00
Alan Cox
7072188017 Eliminate acquisitions of the page queues lock that are no longer needed.
Switch to a per-processor counter for the number of pages freed during
process termination.
2010-05-07 05:23:15 +00:00
Alan Cox
9402dff3de Push down the page queues lock into vm_page_deactivate(). Eliminate an
incorrect comment.
2010-05-07 04:14:07 +00:00
Alan Cox
eb00b276ab Eliminate page queues locking around most calls to vm_page_free(). 2010-05-06 18:58:32 +00:00
Alan Cox
fd8c28bfdf Update a comment to say that access to a page's wire count is now
synchronized by the page lock.
2010-05-06 17:28:59 +00:00
Alan Cox
7024db1d40 Push down the page queues lock inside of vm_page_free_toq() and
pmap_page_is_mapped() in preparation for removing page queues locking
around calls to vm_page_free().  Setting aside the assertion that calls
pmap_page_is_mapped(), vm_page_free_toq() now acquires and holds the page
queues lock just long enough to actually add or remove the page from the
paging queues.

Update vm_page_unhold() to reflect the above change.
2010-05-06 16:39:43 +00:00
Konstantin Belousov
8c6162468b Add a helper function vm_pageout_page_lock(), similar to tegge'
vm_pageout_fallback_object_lock(), to obtain the page lock
while having page queue lock locked, and still maintain the
page position in a queue.

Use the helper to lock the page in the pageout daemon and contig launder
iterators instead of skipping the page if its lock is contested.
Skipping locked pages easily causes pagedaemon or launder to not make a
progress with page cleaning.

Proposed and reviewed by:	alc
2010-05-06 04:57:33 +00:00
Alan Cox
5ac59343be Acquire the page lock around all remaining calls to vm_page_free() on
managed pages that didn't already have that lock held.  (Freeing an
unmanaged page, such as the various pmaps use, doesn't require the page
lock.)

This allows a change in vm_page_remove()'s locking requirements.  It now
expects the page lock to be held instead of the page queues lock.
Consequently, the page queues lock is no longer required at all by callers
to vm_page_rename().

Discussed with: kib
2010-05-05 18:16:06 +00:00
Alan Cox
e3ef0d2fcf Push down the acquisition of the page queues lock into vm_page_unwire().
Update the comment describing which lock should be held on entry to
vm_page_wire().

Reviewed by:	kib
2010-05-05 03:45:46 +00:00
Alan Cox
a7283d3213 Add page locking to the vm_page_cow* functions.
Push down the acquisition and release of the page queues lock into
vm_page_wire().

Reviewed by:	kib
2010-05-04 15:55:41 +00:00
Alan Cox
0c41a69e71 Add lock assertions. 2010-05-04 05:55:19 +00:00
Konstantin Belousov
5637a59143 Handle busy status of the page in a way expected for pager_getpage().
Flush requested page, unbusy other pages, do not clear m->busy.

Reviewed by:	alc
MFC after:	1 week
2010-05-03 19:19:58 +00:00
Alan Cox
2d5d7f7f61 Acquire the page lock around vm_page_wire() in vm_page_grab().
Assert that the page lock is held in vm_page_wire().
2010-05-03 17:55:32 +00:00
Alan Cox
451033a48a It makes more sense for the object-based backend allocator to use OBJT_PHYS
objects instead of OBJT_DEFAULT objects because we never reclaim or pageout
the allocated pages.  Moreover, they are mapped with pmap_qenter(), which
creates unmanaged mappings.

Reviewed by:	kib
2010-05-03 17:35:31 +00:00
Alan Cox
746c2ddee8 The pages allocated by kmem_alloc_attr() and kmem_malloc() are unmanaged.
Consequently, neither the page lock nor the page queues lock is needed to
unwire and free them.
2010-05-03 07:08:16 +00:00
Alan Cox
9f2512bab5 Assert that the page queues lock is held in vm_page_remove() and
vm_page_unwire() only if the page is managed, i.e., pageable.
2010-05-03 07:00:50 +00:00
Alan Cox
b8d36afcfe Add page lock assertions where we access the page's hold_count. 2010-05-02 23:33:10 +00:00
Alan Cox
6c56db5c9e Eliminate an assignment that was made redundant by r207410. 2010-05-02 21:04:59 +00:00
Alan Cox
447fe2a4c6 Defer the acquisition of the page and page queues locks in
vm_pageout_object_deactivate_pages().
2010-05-02 20:46:17 +00:00
Alan Cox
f623e55269 Simplify vm_fault(). The introduction of the new page lock renders a bit of
cleverness by vm_fault() to avoid repeatedly releasing and reacquiring the
page queues lock pointless.

Reviewed by:	kib, kmacy
2010-05-02 20:24:25 +00:00
Alan Cox
ac800a8490 Correct an error in r207410: Remove an unlock of a lock that is no longer
held.
2010-05-02 18:09:33 +00:00
Alan Cox
b88b6c9d80 It makes no sense for vm_page_sleep_if_busy()'s helper, vm_page_sleep(),
to unconditionally set PG_REFERENCED on a page before sleeping.  In many
cases, it's perfectly ok for the page to disappear, i.e., be reclaimed by
the page daemon, before the caller to vm_page_sleep() is reawakened.
Instead, we now explicitly set PG_REFERENCED in those cases where having
the page persist until the caller is awakened is clearly desirable.  Note,
however, that setting PG_REFERENCED on the page is still only a hint,
and not a guarantee that the page should persist.
2010-05-02 17:33:46 +00:00
Alan Cox
f6c8c187d4 This change addresses the race condition that was introduced by the previous
revision, r207450, to this file.  Specifically, between dropping the page
queues lock in vm_contig_launder() and reacquiring it in
vm_contig_launder_page(), the page may be removed from the active or
inactive queue.  It could be wired, freed, cached, etc.  None of which
vm_contig_launder_page() is prepared for.

Reviewed by:	kib, kmacy
2010-05-02 16:44:06 +00:00
Alan Cox
9b55fc0429 Correct an error of omission in r206819. If VMFS_TLB_ALIGNED_SPACE is
specified to vm_map_find(), then retry the vm_map_findspace() if
vm_map_insert() fails because the aligned space is already partly used.

Reported by:	Neel Natu
2010-05-02 01:25:03 +00:00
Kip Macy
0ce3ba8cd5 Update locking comment above vm_page:
- re-assign page queue lock "Q"
       - assign page lock "P"
       - update several uncommented fields
       - observe that hold_count is now protected by the page lock "P"
2010-05-01 03:41:21 +00:00
Kip Macy
7bec141b12 push up dropping of the page queue lock to avoid holding it in vm_pageout_flush 2010-04-30 22:31:37 +00:00
Kip Macy
ad0c05daf9 don't call vm_pageout_flush with the page queue mutex held
Reported by: Michael Butler
2010-04-30 21:21:21 +00:00
Kip Macy
6dd8b893d6 - acquire the page lock in vm_contig_launder_page before checking page fields
- release page queue lock before calling vm_pageout_flush
2010-04-30 21:20:14 +00:00
Kip Macy
e8f263195d - don't check hold_count without the page lock held
- don't leak the page lock if m->object is NULL
  (assuming that that check will in fact even be valid when m->object is protected by the page lock)
2010-04-30 19:40:37 +00:00
Konstantin Belousov
e20e8c1558 Unlock page lock instead of recursively locking it. 2010-04-30 16:20:14 +00:00
Kip Macy
6d74d042e3 don't allow unsynchronized free in vm_page_unhold 2010-04-30 02:46:49 +00:00
Kip Macy
2965a45315 On Alan's advice, rather than do a wholesale conversion on a single
architecture from page queue lock to a hashed array of page locks
(based on a patch by Jeff Roberson), I've implemented page lock
support in the MI code and have only moved vm_page's hold_count
out from under page queue mutex to page lock. This changes
pmap_extract_and_hold on all pmaps.

Supported by: Bitgravity Inc.

Discussed with: alc, jeffr, and kib
2010-04-30 00:46:43 +00:00
Alan Cox
82bfb965d1 Simplify the inner loop of vm_pageout_object_deactivate_pages(). Rather
than checking each page for PG_UNMANAGED, check the vm object's type.
Only OBJT_PHYS can have unmanaged pages.  Eliminate a pointless counter.
The vm object is locked, that lock is never released by the inner loop,
and the set of pages contained by the vm object is not changed by the
inner loop.  Therefore, the counter serves no purpose.
2010-04-29 16:18:45 +00:00
Konstantin Belousov
6fb8c0c117 When doing kstack swapin, read as much pages in one run as possible.
Suggested and reviewed by:	alc (previous version)
Tested by:	pho
MFC after:	2 weeks
2010-04-29 09:59:16 +00:00
Konstantin Belousov
e86a87e97e In swap pager, do not free the non-requested pages from the run if they are
wired. Kstack pages are wired, this change prepares swap pager for handling
of long runs of kstack pages.

Noted and reviewed by:	alc
Tested by:	pho
MFC after:	2 weeks
2010-04-29 09:57:25 +00:00
Alan Cox
77d6d85393 Setting PG_REFERENCED on a page at the end of vm_fault() is redundant since
the page table entry's accessed bit is either preset by the immediately
preceding call to pmap_enter() or by hardware (or software) upon return
from vm_fault() when the faulting access is restarted.
2010-04-28 06:34:47 +00:00
Alan Cox
6a2a3d7338 Change vm_object_madvise() so that it checks whether the page is invalid
or unmanaged before acquiring the page queues lock.  Neither of these
tests require that lock.  Moreover, a better way of testing if the page
is unmanaged is to test the type of vm object.  This avoids a pointless
vm_page_lookup().

MFC after:	3 weeks
2010-04-28 04:57:32 +00:00
Alan Cox
7b85f59183 Resurrect pmap_is_referenced() and use it in mincore(). Essentially,
pmap_ts_referenced() is not always appropriate for checking whether or
not pages have been referenced because it clears any reference bits
that it encounters.  For example, in mincore(), clearing the reference
bits has two negative consequences.  First, it throws off the activity
count calculations performed by the page daemon.  Specifically, a page
on which mincore() has called pmap_ts_referenced() looks less active
to the page daemon than it should.  Consequently, the page could be
deactivated prematurely by the page daemon.  Arguably, this problem
could be fixed by having mincore() duplicate the activity count
calculation on the page.  However, there is a second problem for which
that is not a solution.  In order to clear a reference on a 4KB page,
it may be necessary to demote a 2/4MB page mapping.  Thus, a mincore()
by one process can have the side effect of demoting a superpage
mapping within another process!
2010-04-24 17:32:52 +00:00
Alan Cox
5d4a7b7945 Eliminate an unnecessary call to pmap_remove_all(). If a page belongs to
an object whose reference count is zero, then that page cannot possibly
be mapped.
2010-04-20 04:16:39 +00:00
Alan Cox
7b7d5b6c58 vm_thread_swapout() can safely dirty the page before rather than after
acquiring the page queues lock.
2010-04-19 00:18:14 +00:00
Juli Mallett
ca596a25f0 o) Add a VM find-space option, VMFS_TLB_ALIGNED_SPACE, which searches the
address space for an address as aligned by the new pmap_align_tlb()
   function, which is for constraints imposed by the TLB. [1]
o) Add a kmem_alloc_nofault_space() function, which acts like
   kmem_alloc_nofault() but allows the caller to specify which find-space
   option to use. [1]
o) Use kmem_alloc_nofault_space() with VMFS_TLB_ALIGNED_SPACE to allocate the
   kernel stack address on MIPS. [1]
o) Make pmap_align_tlb() on MIPS align addresses so that they do not start on
   an odd boundary within the TLB, so that they are suitable for insertion as
   wired entries and do not have to share a TLB entry with another mapping,
   assuming they are appropriately-sized.
o) Eliminate md_realstack now that the kstack will be appropriately-aligned on
   MIPS.
o) Increase the number of guard pages to 2 so that we retain the proper
   alignment of the kstack address.

Reviewed by:	[1] alc
X-MFC-after:	Making sure alc has not come up with a better interface.
2010-04-18 22:32:07 +00:00
Alan Cox
b28889a2fc Remove a nonsensical test from vm_pageout_clean(). A page can't be in the
inactive queue and have a non-zero wire count.

Reviewed by:	kib
MFC after:	3 weeks
2010-04-18 21:29:28 +00:00
Alan Cox
4b9dd5d537 There is no justification for vm_object_split() setting PG_REFERENCED on a
page that it is going to sleep on.  Eliminate it.

MFC after:	3 weeks
2010-04-18 17:50:09 +00:00
Alan Cox
b11b56b55b In vm_object_madvise() setting PG_REFERENCED on a page before sleeping on
that page only makes sense if the advice is MADV_WILLNEED.  In that case,
the intention is to activate the page, so discouraging the page daemon
from reclaiming the page makes sense.  In contrast, in the other cases,
MADV_DONTNEED and MADV_FREE, it makes no sense whatsoever to discourage
the page daemon from reclaiming the page by setting PG_REFERENCED.

Wrap a nearby line.

Discussed with:	kib
MFC after:	3 weeks
2010-04-17 21:14:37 +00:00
Alan Cox
aefea7f519 In vm_object_backing_scan(), setting PG_REFERENCED on a page before
sleeping on that page is nonsensical.  Doing so reduces the likelihood
that the page daemon will reclaim the page before the thread waiting in
vm_object_backing_scan() is reawakened.  However, it does not guarantee
that the page is not reclaimed, so vm_object_backing_scan() restarts
after reawakening.  More importantly, this muddles the meaning of
PG_REFERENCED.  There is no reason to believe that the caller of
vm_object_backing_scan() is going to use (i.e., access) the contents of
the page.  There is especially no reason to believe that an access is
more likely because vm_object_backing_scan() had to sleep on the page.

Discussed with:	kib
MFC after:	3 weeks
2010-04-17 18:35:07 +00:00
Alan Cox
0b6ace4743 Setting PG_REFERENCED on the requested page in swap_pager_getpages() is
either redundant or harmful, depending on the caller.  For example, when
called by vm_fault(), it is redundant.  However, when called by
vm_thread_swapin(), it is harmful.  Specifically, if the thread is later
swapped out, having PG_REFERENCED set on its stack pages leads the page
daemon to reactivate these stack pages and delay their reclamation.

Reviewed by:	kib
MFC after:	3 weeks
2010-04-17 17:02:17 +00:00
Alan Cox
6c68c971cb Simplify vm_thread_swapin(). 2010-04-13 06:48:37 +00:00
Alan Cox
ac45ee97c9 Initialize the virtual memory-related resource limits in a single place.
Previously, one of these limits was initialized in two places to a
different value in each place.  Moreover, because an unsigned int was used
to represent the amount of pageable physical memory, some of these limits
were incorrectly initialized on 64-bit architectures.  (Currently, this
error is masked by login.conf's default settings.)

Make vm_thread_swapin() and vm_thread_swapout() static.

Submitted by:	bde (an earlier version)
Reviewed by:	kib
2010-04-11 16:26:07 +00:00
Alan Cox
8ef9d880e6 Introduce the function kmem_alloc_attr(), which allocates kernel virtual
memory with the specified physical attributes.  In particular, like
kmem_alloc_contig(), the caller can specify the physical address range
from which the physical pages are allocated and the memory attributes
(i.e., cache behavior) for these physical pages.  However, in contrast to
kmem_alloc_contig() or contigmalloc(), the physical pages that are
allocated by kmem_alloc_attr() are not necessarily physically contiguous.
This function is needed by DRM and VirtualBox.

Correct an error in the prototype for kmem_malloc().  The third argument
had the wrong type.

Tested by:	rnoland
MFC after:	3 days
2010-04-09 02:39:20 +00:00
Joel Dahl
c0587701ad Start copyright notice with /*- 2010-04-07 16:29:10 +00:00
Konstantin Belousov
3f1c4c4f31 When OOM searches for a process to kill, ignore the processes already
killed by OOM. When killed process waits for a page allocation, try to
satisfy the request as fast as possible.

This removes the often encountered deadlock, where OOM continously
selects the same victim process, that sleeps uninterruptibly waiting
for a page. The killed process may still sleep if page cannot be
obtained immediately, but testing has shown that system has much
higher chance to survive in OOM situation with the patch.

In collaboration with:	pho
Reviewed by:	alc
MFC after:	4 weeks
2010-04-06 10:43:01 +00:00
Alan Cox
f6d00b38c7 vm_reserv_alloc_page() should never be called on an OBJT_SG object, just as
it is never called on an OBJT_DEVICE object.  (This change should have been
included in r195840.)

Reported by:	dougb@, avg@
MFC after:	3 days
2010-04-05 06:23:31 +00:00
Alan Cox
92351f162e Make _vm_map_init() the one place where the vm map's pmap field is
initialized.

Reviewed by:	kib
2010-04-03 19:07:05 +00:00
Alan Cox
0ef12795b5 Re-enable the call to pmap_release() by vmspace_dofree(). The accounting
problem that is described in the comment has been addressed.

Submitted by:	kib
Tested by:	pho (a few months ago)
MFC after:	6 weeks
2010-04-03 16:20:22 +00:00
John Baldwin
5711bf30da Reject attempts to create a MAP_ANON mapping with a non-zero offset.
PR:		kern/71258
Submitted by:	Alexander Best
MFC after:	2 weeks
2010-03-23 21:08:07 +00:00
Kip Macy
1a23373ceb - enable alignment on amd64 only
- only align pcpu caches and the volatile portion of uma_zone
2010-03-22 22:39:32 +00:00
Kip Macy
6b4391d786 turn 205266 in to a no-op until the problem can be properly diagnosed 2010-03-18 20:30:25 +00:00
Kip Macy
5e4bb93cca Cache line align various structures and move volatile counters to
not share a cache line with (mostly) immutable state

Reviewed by:	jeff@
MFC after:	7 days
2010-03-17 21:18:28 +00:00
Konstantin Belousov
ddb16cfc32 Update comment for vm_page_alloc(9), listing all acceptable flags [1].
Note that the function does not sleep, it can block.

Submitted by:	Giovanni Trematerra <giovanni.trematerra gmail com> [1]
MFC after:	3 days
2010-02-27 17:09:28 +00:00
Konstantin Belousov
d7de6e2cb6 Remove write-only variable.
MFC after:	3 days
2010-02-22 16:00:56 +00:00
Alan Cox
fd776d18a0 Align the start of the clean submap to a superpage boundary. Although
no superpage mappings are created within the clean submap, aligning the
start of the clean submap helps to prevent interference with kmem_alloc()'s
use of superpages.
2010-02-21 22:23:13 +00:00
Konstantin Belousov
41c2274481 The MAP_ENTRY_NEEDS_COPY flag belongs to protoeflags, cow variable
uses different namespace.

Reported by:	Jonathan Anderson <jonathan.anderson cl cam ac uk>
MFC after:	3 days
2010-01-29 19:25:45 +00:00
Konstantin Belousov
b9f180d1de When a vnode-backed vm object is referenced, it increments the vnode
reference count, and decrements it on dereference. If referenced object
is deallocated, object type is reset to OBJT_DEAD. Consequently, all
vnode references that are owned by object references are never released.
vunref() the vnode in vm object deallocation code for OBJT_VNODE
appropriate number of times to prevent leak.

Add an assertion to the vm_pageout() to make sure that we never get
reference on the vnode but then do not execute code to release it.

In collaboration with:	pho
Reviewed by:	alc
MFC after:	3 weeks
2010-01-17 21:26:14 +00:00
Robert Noland
cfd7bacef2 Update d_mmap() to accept vm_ooffset_t and vm_memattr_t.
This replaces d_mmap() with the d_mmap2() implementation and also
changes the type of offset to vm_ooffset_t.

Purge d_mmap2().

All driver modules will need to be rebuilt since D_VERSION is also
bumped.

Reviewed by:	jhb@
MFC after:	Not in this lifetime...
2009-12-29 21:51:28 +00:00
Antoine Brodin
13e403fdea (S)LIST_HEAD_INITIALIZER takes a (S)LIST_HEAD as an argument.
Fix some wrong usages.
Note: this does not affect generated binaries as this argument is not used.

PR:		137213
Submitted by:	Eygene Ryabinkin (initial version)
MFC after:	1 month
2009-12-28 22:56:30 +00:00
Konstantin Belousov
49e3050e6c VI_OBJDIRTY vnode flag mirrors the state of OBJ_MIGHTBEDIRTY vm object
flag. Besides providing the redundand information, need to update both
vnode and object flags causes more acquisition of vnode interlock.
OBJ_MIGHTBEDIRTY is only checked for vnode-backed vm objects.

Remove VI_OBJDIRTY and make sure that OBJ_MIGHTBEDIRTY is set only for
vnode-backed vm objects.

Suggested and reviewed by:	alc
Tested by:	pho
MFC after:	3 weeks
2009-12-21 12:29:38 +00:00
Antoine Brodin
4e2d83fc4a Remove trailing ";" in UMA_HASH_INSERT and UMA_HASH_REMOVE macros.
MFC after:	1 month
2009-12-05 17:45:56 +00:00
Alan Cox
79f6ebe233 Properly synchronize the previous change. 2009-11-28 00:50:09 +00:00
Alan Cox
d8778512cf Support the new VM_PROT_COPY option on wired pages. The effect of which
is that a debugger can now set a breakpoint in a program that uses mlock(2)
on its text segment or mlockall(2) on its entire address space.
2009-11-27 22:08:29 +00:00
Alan Cox
e2997fea72 Simplify the invocation of vm_fault(). Specifically, eliminate the flag
VM_FAULT_DIRTY.  The information provided by this flag can be trivially
inferred by vm_fault().

Discussed with:	kib
2009-11-27 20:24:11 +00:00
Alan Cox
a6d42a0d62 Replace VM_PROT_OVERRIDE_WRITE by VM_PROT_COPY. VM_PROT_OVERRIDE_WRITE has
represented a write access that is allowed to override write protection.
Until now, VM_PROT_OVERRIDE_WRITE has been used to write breakpoints into
text pages.  Text pages are not just write protected but they are also
copy-on-write.  VM_PROT_OVERRIDE_WRITE overrides the write protection on the
text page and triggers the replication of the page so that the breakpoint
will be written to a private copy.  However, here is where things become
confused.  It is the debugger, not the process being debugged that requires
write access to the copied page.  Nonetheless, the copied page is being
mapped into the process with write access enabled.  In other words, once the
debugger sets a breakpoint within a text page, the program can write to its
private copy of that text page.  Whereas prior to setting the breakpoint, a
SIGSEGV would have occurred upon a write access.  VM_PROT_COPY addresses
this problem.  The combination of VM_PROT_READ and VM_PROT_COPY forces the
replication of a copy-on-write page even though the access is only for read.
Moreover, the replicated page is only mapped into the process with read
access, and not write access.

Reviewed by:	kib
MFC after:	4 weeks
2009-11-26 05:16:07 +00:00
Alan Cox
2db65ab46e Simplify both the invocation and the implementation of vm_fault() for wiring
pages.

(Note: Claims made in the comments about the handling of breakpoints in
wired pages have been false for roughly a decade.  This and another bug
involving breakpoints will be fixed in coming changes.)

Reviewed by:	kib
2009-11-18 18:05:54 +00:00
Alan Cox
2dd02f4773 Eliminate an unnecessary #include. (This #include should have been removed
in r188331 when vnode_pager_lock() was eliminated.)
2009-11-04 03:12:56 +00:00
Alan Cox
86684848b6 Eliminate a bit of hackery from vm_fault(). The operations that this
hackery sought to prevent are now properly supported by vm_map_protect().
(See r198505.)

Reviewed by:	kib
2009-11-03 17:15:15 +00:00
Attilio Rao
1b9d701fee Split P_NOLOAD into a per-thread flag (TDF_NOLOAD).
This improvements aims for avoiding further cache-misses in scheduler
specific functions which need to keep track of average thread running
time and further locking in places setting for this flag.

Reported by:	jeff (originally), kris (currently)
Reviewed by:	jhb
Tested by:	Giuseppe Cocomazzi <sbudella at email dot it>
2009-11-03 16:46:52 +00:00
Alan Cox
2fafce9e4c Avoid pointless calls to pmap_protect().
Reviewed by:	kib
2009-11-02 17:45:39 +00:00
Ivan Voras
8a9c731f13 Add sysctl documentation strings. The descriptions are derived
from tuning(7). One of the descriptions references tuning(7) because
it is too complex to adequatly describe here (it is not a simple
boolean sysctl) and users should be warned to that.

Reviewed by:	alc, kib
Approved by:	gnn (mentor)
2009-11-02 16:56:59 +00:00
Alan Cox
e4ed417a35 Correct an error in vm_fault_copy_entry() that has existed since the first
version of this file.  When a process forks, any wired pages are immediately
copied because copy-on-write is not supported for wired pages.  In other
words, the child process is given its own private copy of each wired page
from its parent's address space.  Unfortunately, to date, these copied pages
have been mapped into the child's address space with the wrong permissions,
typically VM_PROT_ALL.  This change corrects the permissions.

Reviewed by:	kib
2009-10-31 17:39:56 +00:00
Konstantin Belousov
210a688642 When protection of wired read-only mapping is changed to read-write,
install new shadow object behind the map entry and copy the pages
from the underlying objects to it. This makes the mprotect(2) call to
actually perform the requested operation instead of silently do nothing
and return success, that causes SIGSEGV on later write access to the
mapping.

Reuse vm_fault_copy_entry() to do the copying, modifying it to behave
correctly when src_entry == dst_entry.

Reviewed by:	alc
MFC after:	3 weeks
2009-10-27 10:15:58 +00:00
Alan Cox
7afab86c3d Simplify the inner loop of vm_fault_copy_entry().
Reviewed by:	kib
2009-10-26 00:01:52 +00:00
Alan Cox
36930fc947 Eliminate an unnecessary check from vm_fault_prefault(). 2009-10-25 17:30:50 +00:00
Marcel Moolenaar
1a4fcaebe3 o Introduce vm_sync_icache() for making the I-cache coherent with
the memory or D-cache, depending on the semantics of the platform.
    vm_sync_icache() is basically a wrapper around pmap_sync_icache(),
    that translates the vm_map_t argumument to pmap_t.
o   Introduce pmap_sync_icache() to all PMAP implementation. For powerpc
    it replaces the pmap_page_executable() function, added to solve
    the I-cache problem in uiomove_fromphys().
o   In proc_rwmem() call vm_sync_icache() when writing to a page that
    has execute permissions. This assures that when breakpoints are
    written, the I-cache will be coherent and the process will actually
    hit the breakpoint.
o   This also fixes the Book-E PMAP implementation that was missing
    necessary locking while trying to deal with the I-cache coherency
    in pmap_enter() (read: mmu_booke_enter_locked).

The key property of this change is that the I-cache is made coherent
*after* writes have been done. Doing it in the PMAP layer when adding
or changing a mapping means that the I-cache is made coherent *before*
any writes happen. The difference is key when the I-cache prefetches.
2009-10-21 18:38:02 +00:00
Konstantin Belousov
5c0e1c11a3 Remove spurious call to priv_check(PRIV_VM_SWAP_NOQUOTA).
Call priv_check(PRIV_VM_SWAP_NORLIMIT) only when per-uid limit is
actually exceed.

Both changes aim at calling priv_check(9) only for the cases when
privilege is actually exercised by the process.

Reported and tested by:	rwatson
Reviewed by:	alc
MFC after:	3 days
2009-10-18 12:55:39 +00:00
Alan Cox
e67e0775e6 Align and pad the page queue and free page queue locks so that the linker
can't possibly place them together within the same cache line.

MFC after:	3 weeks
2009-10-04 18:53:10 +00:00
Bjoern A. Zeeb
fc50832731 Back out the functional parts from r197537. After r197711, affecting all
user mappings, mmap no longer needs special treatment.
2009-10-02 17:51:46 +00:00
Konstantin Belousov
6fecb26b6b Move the annotation for vm_map_startup() immediately before the function.
MFC after:	3 days
2009-10-01 12:48:35 +00:00
Simon L. B. Nielsen
27bfa95847 Do not allow mmap with the MAP_FIXED argument to map at address zero.
This is done to make it harder to exploit kernel NULL pointer security
vulnerabilities.  While this of course does not fix vulnerabilities,
it does mitigate their impact.

Note that this may break some applications, most likely emulators or
similar, which for one reason or another require mapping memory at
zero.

This restriction can be disabled with the security.bsd.mmap_zero
sysctl variable.

Discussed with:	rwatson, bz
Tested by:	bz (Wine), simon (VirtualBox)
Submitted by:	jhb
2009-09-27 14:49:51 +00:00
Konstantin Belousov
497a82382b Old (a.out) rtld attempts to mmap zero-length region, e.g. when bss
of the linked object is zero-length. More old code assumes that mmap
of zero length returns success.

For a.out and pre-8 ELF binaries, allow the mmap of zero length.

Reported by:	tegge
Reviewed by:	tegge, alc, jhb
MFC after:	3 days
2009-09-20 12:40:56 +00:00
Konstantin Belousov
8a945d109c Reintroduce the r196640, after fixing the problem with my testing.
Remove the altkstacks, instead instantiate threads with kernel stack
allocated with the right size from the start. For the thread that has
kernel stack cached, verify that requested stack size is equial to the
actual, and reallocate the stack if sizes differ [1].

This fixes the bug introduced by r173361 that was committed several days
after r173004 and consisted of kthread_add(9) ignoring the non-default
kernel stack size.

Also, r173361 removed the caching of the kernel stacks for a non-first
thread in the process. Introduce separate kernel stack cache that keeps
some limited amount of preallocated kernel stacks to lower the latency
of thread allocation. Add vm_lowmem handler to prune the cache on
low memory condition. This way, system with reasonable amount of the
threads get lower latency of thread creation, while still not exhausting
significant portion of KVA for unused kstacks.

Submitted by:	peter [1]
Discussed with:	jhb, julian, peter
Reviewed by:	jhb
Tested by:	pho (and retested according to new test scenarious)
MFC after:	1 week
2009-09-01 11:41:51 +00:00
Konstantin Belousov
f25fa6abb2 Reverse r196640 and r196644 for now. 2009-08-29 21:53:08 +00:00
Konstantin Belousov
c3cf0b476f Remove the altkstacks, instead instantiate threads with kernel stack
allocated with the right size from the start. For the thread that has
kernel stack cached, verify that requested stack size is equial to the
actual, and reallocate the stack if sizes differ [1].

This fixes the bug introduced by r173361 that was committed several days
after r173004 and consisted of kthread_add(9) ignoring the non-default
kernel stack size.

Also, r173361 removed the caching of the kernel stacks for a non-first
thread in the process. Introduce separate kernel stack cache that keeps
some limited amount of preallocated kernel stacks to lower the latency
of thread allocation. Add vm_lowmem handler to prune the cache on
low memory condition. This way, system with reasonable amount of the
threads get lower latency of thread creation, while still not exhausting
significant portion of KVA for unused kstacks.

Submitted by:	peter [1]
Discussed with:	jhb, julian, peter
Reviewed by:	jhb
Tested by:	pho
MFC after:	1 week
2009-08-29 13:28:02 +00:00
John Baldwin
58e279db02 Mark the fake pages constructed by the OBJT_SG pager valid. This was
accidentally lost at one point during the PAT development.  Without this
fix vm_pager_get_pages() was zeroing each of the pages.

Submitted by:	czander @ NVidia
MFC after:	3 days
2009-08-29 02:17:40 +00:00
John Baldwin
2fa8c8d21e Extend the device pager to support different memory attributes on different
pages in an object.
- Add a new variant of d_mmap() currently called d_mmap2() which accepts
  an additional in/out parameter that is the memory attribute to use for
  the requested page.
- A driver either uses d_mmap() or d_mmap2() for all requests but not both.
  The current implementation uses a flag in the cdevsw (D_MMAP2) to indicate
  that the driver provides a d_mmap2() handler instead of d_mmap().  This
  is done to make the change ABI compatible with existing drivers and
  MFC'able to 7 and 8.

Submitted by:	alc
MFC after:	1 month
2009-08-28 14:06:55 +00:00
John Baldwin
63cbfee7b0 Remove debugging that crept in with previous commit.
Reported by:	nwhitehorn
Approved by:	re (kib)
2009-07-24 15:06:49 +00:00
John Baldwin
013818111a Add a new type of VM object: OBJT_SG. An OBJT_SG object is very similar to
a device pager (OBJT_DEVICE) object in that it uses fictitious pages to
provide aliases to other memory addresses.  The primary difference is that
it uses an sglist(9) to determine the physical addresses for a given offset
into the object instead of invoking the d_mmap() method in a device driver.

Reviewed by:	alc
Approved by:	re (kensmith)
MFC after:	2 weeks
2009-07-24 13:50:29 +00:00
Alan Cox
9861cbc6ca Change the handling of fictitious pages by pmap_page_set_memattr() on
amd64 and i386.  Essentially, fictitious pages provide a mechanism for
creating aliases for either normal or device-backed pages.  Therefore,
pmap_page_set_memattr() on a fictitious page needn't update the direct
map or flush the cache.  Such actions are the responsibility of the
"primary" instance of the page or the device driver that "owns" the
physical address.  For example, these actions are already performed by
pmap_mapdev().

The device pager needn't restore the memory attributes on a fictitious
page before releasing it.  It's now pointless.

Add pmap_page_set_memattr() to the Xen pmap.

Approved by:	re (kib)
2009-07-19 21:40:19 +00:00
Alan Cox
13de722155 An addendum to r195649, "Add support to the virtual memory system for
configuring machine-dependent memory attributes...":

Don't set the memory attribute for a "real" page that is allocated to
a device object in vm_page_alloc().  It is a pointless act, because
the device pager replaces this "real" page with a "fake" page and sets
the memory attribute on that "fake" page.

Eliminate pointless code from pmap_cache_bits() on amd64.

Employ the "Self Snoop" feature supported by some x86 processors to
avoid cache flushes in the pmap.

Approved by:	re (kib)
2009-07-18 01:50:05 +00:00
John Baldwin
0fe0ed8bf8 - Change mmap() to fail requests with EINVAL that pass a length of 0. This
behavior is mandated by POSIX.
- Do not fail requests that pass a length greater than SSIZE_MAX
  (such as > 2GB on 32-bit platforms).  The 'len' parameter is actually
  an unsigned 'size_t' so negative values don't really make sense.

Submitted by:	Alexander Best  alexbestms at math.uni-muenster.de
Reviewed by:	alc
Approved by:	re (kib)
MFC after:	1 week
2009-07-14 19:45:36 +00:00
Alan Cox
3153e878dd Add support to the virtual memory system for configuring machine-
dependent memory attributes:

Rename vm_cache_mode_t to vm_memattr_t.  The new name reflects the
fact that there are machine-dependent memory attributes that have
nothing to do with controlling the cache's behavior.

Introduce vm_object_set_memattr() for setting the default memory
attributes that will be given to an object's pages.

Introduce and use pmap_page_{get,set}_memattr() for getting and
setting a page's machine-dependent memory attributes.  Add full
support for these functions on amd64 and i386 and stubs for them on
the other architectures.  The function pmap_page_set_memattr() is also
responsible for any other machine-dependent aspects of changing a
page's memory attributes, such as flushing the cache or updating the
direct map.  The uses include kmem_alloc_contig(), vm_page_alloc(),
and the device pager:

  kmem_alloc_contig() can now be used to allocate kernel memory with
  non-default memory attributes on amd64 and i386.

  vm_page_alloc() and the device pager will set the memory attributes
  for the real or fictitious page according to the object's default
  memory attributes.

Update the various pmap functions on amd64 and i386 that map pages to
incorporate each page's memory attributes in the mapping.

Notes: (1) Inherent to this design are safety features that prevent
the specification of inconsistent memory attributes by different
mappings on amd64 and i386.  In addition, the device pager provides a
warning when a device driver creates a fictitious page with memory
attributes that are inconsistent with the real page that the
fictitious page is an alias for. (2) Storing the machine-dependent
memory attributes for amd64 and i386 as a dedicated "int" in "struct
md_page" represents a compromise between space efficiency and the ease
of MFCing these changes to RELENG_7.

In collaboration with: jhb

Approved by:	re (kib)
2009-07-12 23:31:20 +00:00
Konstantin Belousov
529ab57b9a When VM_MAP_WIRE_HOLESOK is not specified and vm_map_wire(9) encounters
non-readable and non-executable map entry, the entry is skipped from
wiring and loop is aborted. But, since MAP_ENTRY_WIRE_SKIPPED was not
set for the map entry, its wired_count is later erronously decremented.
vm_map_delete(9) for such map entry stuck in "vmmaps".

Properly set MAP_ENTRY_WIRE_SKIPPED when aborting the loop.

Reported by:	John Marshall <john.marshall riverwillow com au>
Approved by:	re (kensmith)
2009-07-12 12:37:38 +00:00
Konstantin Belousov
121fd46175 When forking a vm space that has wired map entries, do not forget to
charge the objects created by vm_fault_copy_entry. The object charge
was set, but reserve not incremented.

Reported by:	Greg Rivers <gcr+freebsd-current tharned org>
Reviewed by:	alc (previous version)
Approved by:	re (kensmith)
2009-07-03 22:17:37 +00:00
Konstantin Belousov
9b4d473a6e Eliminiate code duplication by calling vm_object_destroy()
from vm_object_collapse().

Requested and reviewed by:	alc
Approved by:	re (kensmith)
2009-06-28 08:42:17 +00:00
Alan Cox
e999111ae7 This change is the next step in implementing the cache control functionality
required by video card drivers.  Specifically, this change introduces
vm_cache_mode_t with an appropriate VM_CACHE_DEFAULT definition on all
architectures.  In addition, this changes adds a vm_cache_mode_t parameter
to kmem_alloc_contig() and vm_phys_alloc_contig().  These will be the
interfaces for allocating mapped kernel memory and physical memory,
respectively, with non-default cache modes.

In collaboration with:	jhb
2009-06-26 04:47:43 +00:00
Konstantin Belousov
9f80ce043d Change the type of uio_resid member of struct uio from int to ssize_t.
Note that this does not actually enable full-range i/o requests for
64 architectures, and is done now to update KBI only.

Tested by:	pho
Reviewed by:	jhb, bde (as part of the review of the bigger patch)
2009-06-25 18:46:30 +00:00
Konstantin Belousov
7a8af8eee2 Initialize the uip to silence gcc warning that seems to sneak in in some
build environments.

Reported by:	alc, bf1783 at googlemail com
2009-06-24 09:26:33 +00:00
Alan Cox
26f4eea53f The bits set in a page's dirty mask are a subset of the bits set in its
valid mask.  Consequently, there is no need to perform a bit-wise and of
the page's dirty and valid masks in order to determine which parts of a
page are dirty and valid.

Eliminate an unnecessary #include.
2009-06-24 04:45:03 +00:00
Konstantin Belousov
3364c323e6 Implement global and per-uid accounting of the anonymous memory. Add
rlimit RLIMIT_SWAP that limits the amount of swap that may be reserved
for the uid.

The accounting information (charge) is associated with either map entry,
or vm object backing the entry, assuming the object is the first one
in the shadow chain and entry does not require COW. Charge is moved
from entry to object on allocation of the object, e.g. during the mmap,
assuming the object is allocated, or on the first page fault on the
entry. It moves back to the entry on forks due to COW setup.

The per-entry granularity of accounting makes the charge process fair
for processes that change uid during lifetime, and decrements charge
for proper uid when region is unmapped.

The interface of vm_pager_allocate(9) is extended by adding struct ucred *,
that is used to charge appropriate uid when allocation if performed by
kernel, e.g. md(4).

Several syscalls, among them is fork(2), may now return ENOMEM when
global or per-uid limits are enforced.

In collaboration with:	pho
Reviewed by:	alc
Approved by:	re (kensmith)
2009-06-23 20:45:22 +00:00
Alan Cox
ce1b857f33 Validate the page in one place, dev_pager_getpages(), rather than doing it
in two places, dev_pager_getfake() and dev_pager_updatefake().

Compare a pointer to "NULL" rather than "0".
2009-06-22 19:09:48 +00:00
Alan Cox
ef327c3ee7 Implement a mechanism within vm_phys_alloc_contig() to defer all necessary
calls to vdrop() until after the free page queues lock is released.  This
eliminates repeatedly releasing and reacquiring the free page queues lock
each time the last cached page is reclaimed from a vnode-backed object.
2009-06-21 20:29:14 +00:00
Alan Cox
6f0489c670 Strive for greater consistency among the places that implement real,
fictious, and contiguous page allocation.  Eliminate unnecessary
reinitialization of a page's fields.
2009-06-21 00:21:33 +00:00
Andrew Thompson
f06a3a36ac Track the kernel mapping of a physical page by a new entry in vm_page
structure. When the page is shared, the kernel mapping becomes a special
type of managed page to force the cache off the page mappings. This is
needed to avoid stale entries on all ARM VIVT caches, and VIPT caches
with cache color issue.

Submitted by:	Mark Tinguely
Reviewed by:	alc
Tested by:	Grzegorz Bernacki, thompsa
2009-06-18 20:42:37 +00:00
Alan Cox
aea6e893ed Add support for UMA_SLAB_KERNEL to page_free(). (While I'm here remove an
unnecessary newline character from the end of two panic messages.)
2009-06-18 07:27:11 +00:00
Alan Cox
f0553fdbc4 Eliminate unnecessary forward declarations. 2009-06-17 20:12:23 +00:00
Alan Cox
d78200e4e8 Refactor contigmalloc() into two functions: a simple front-end that deals
with the malloc tag and calls a new back-end, kmem_alloc_contig(), that
allocates the pages and maps them.

The motivations for this change are two-fold: (1) A cache mode parameter
will be added to kmem_alloc_contig().  In other words, kmem_alloc_contig()
will be extended to support the allocation of memory with caller-specified
caching. (2) The UMA allocation function that is used by the two jumbo
frames zones can use kmem_alloc_contig() in place of contigmalloc() and
thereby avoid having free jumbo frames held by the zone counted as live
malloc()ed memory.
2009-06-17 17:19:48 +00:00
Alan Cox
2d59a004af Pass the size of the mapping to contigmapping() as a "vm_size_t" rather
than a "vm_pindex_t".  A "vm_size_t" is more convenient for it to use.
2009-06-17 07:11:38 +00:00
Alan Cox
ead1d027bd Make the maintenance of a page's valid bits by contigmalloc() more like
kmem_alloc() and kmem_malloc().  Specifically, defer the setting of the
page's valid bits until contigmapping() when the mapping is known to be
successful.
2009-06-17 04:57:32 +00:00
Alan Cox
387aabc513 Long, long ago in r27464 special case code for mapping device-backed
memory with 4MB pages was added to pmap_object_init_pt().  This code
assumes that the pages of a OBJT_DEVICE object are always physically
contiguous.  Unfortunately, this is not always the case.  For example,
jhb@ informs me that the recently introduced /dev/ksyms driver creates
a OBJT_DEVICE object that violates this assumption.  Thus, this
revision modifies pmap_object_init_pt() to abort the mapping if the
OBJT_DEVICE object's pages are not physically contiguous.  This
revision also changes some inconsistent if not buggy behavior.  For
example, the i386 version aborts if the first 4MB virtual page that
would be mapped is already valid.  However, it incorrectly replaces
any subsequent 4MB virtual page mappings that it encounters,
potentially leaking a page table page.  The amd64 version has a bug of
my own creation.  It potentially busies the wrong page and always an
insufficent number of pages if it blocks allocating a page table page.

To my knowledge, there have been no reports of these bugs, hence,
their persistance.  I suspect that the existing restrictions that
pmap_object_init_pt() placed on the OBJT_DEVICE objects that it would
choose to map, for example, that the first page must be aligned on a 2
or 4MB physical boundary and that the size of the mapping must be a
multiple of the large page size, were enough to avoid triggering the
bug for drivers like ksyms.  However, one side effect of testing the
OBJT_DEVICE object's pages for physical contiguity is that a dubious
difference between pmap_object_init_pt() and the standard path for
mapping devices pages, i.e., vm_fault(), has been eliminated.
Previously, pmap_object_init_pt() would only instantiate the first
PG_FICTITOUS page being mapped because it never examined the rest.
Now, however, pmap_object_init_pt() uses the new function
vm_object_populate() to instantiate them all (in order to support
testing their physical contiguity).  These pages need to be
instantiated for the mechanism that I have prototyped for
automatically maintaining the consistency of the PAT settings across
multiple mappings, particularly, amd64's direct mapping, to work.
(Translation: This change is also being made to support jhb@'s work on
the Nvidia feature requests.)

Discussed with:	jhb@
2009-06-14 19:51:43 +00:00
Alan Cox
53f55a430f Eliminate an unnecessary clearing of a page's dirty bits in
phys_pager_getpages().
2009-06-13 20:58:12 +00:00
Alan Cox
1136ed06b9 Eliminate an unnecessary restriction on the vm object type from
vm_map_pmap_enter().  The immediate effect of this change is that automatic
prefaulting by mmap() for small mappings is performed on POSIX shared memory
objects just the same as it is on ordinary files.
2009-06-09 17:04:39 +00:00
Alan Cox
0a2e596a93 Eliminate unnecessary obfuscation when testing a page's valid bits. 2009-06-07 19:38:26 +00:00
Alan Cox
fe6ad778fe Eliminate an unneeded forward declaration. (This should have been removed
in revision 1.42.)
2009-06-06 21:23:29 +00:00
Alan Cox
d1a6e42ddd If vm_pager_get_pages() returns VM_PAGER_OK, then there is no need to check
the page's valid bits.  The page is guaranteed to be fully valid.  (For the
record, this is documented in vm/vm_pager.h's comments.)
2009-06-06 20:13:14 +00:00
Alan Cox
7a122777c9 vm_thread_swapin() needn't validate any pages. The pages are already
validated by vm_pager_get_pages().
2009-06-05 17:06:20 +00:00
Alan Cox
42d9e2c4a6 Simplify contigfree(). 2009-06-05 16:55:10 +00:00
Robert Watson
bcf11e8d00 Move "options MAC" from opt_mac.h to opt_global.h, as it's now in GENERIC
and used in a large number of files, but also because an increasing number
of incorrect uses of MAC calls were sneaking in due to copy-and-paste of
MAC-aware code without the associated opt_mac.h include.

Discussed with:	pjd
2009-06-05 14:55:22 +00:00
Alan Cox
3c33df624c Correct a boundary case error in the management of a page's dirty bits by
shm_dotruncate() and vnode_pager_setsize().  Specifically, if the length of
a shared memory object or a file is truncated such that the length modulo
the page size is between 1 and 511, then all of the page's dirty bits were
cleared.  Now, a dirty bit is cleared only if the corresponding block is
truncated in its entirety.
2009-06-02 08:02:27 +00:00
John Baldwin
64345f0b57 Add an extension to the character device interface that allows character
device drivers to use arbitrary VM objects to satisfy individual mmap()
requests.
- A new d_mmap_single(cdev, &foff, objsize, &object, prot) callback is
  added to cdevsw.  This function is called for each mmap() request.
  If it returns ENODEV, then the mmap() request will fall back to using
  the device's device pager object and d_mmap().  Otherwise, the method
  can return a VM object to satisfy this entire mmap() request via
  *object.  It can also modify the starting offset into this object via
  *foff.  This allows device drivers to use the file offset as a cookie
  to identify specific VM objects.
- vm_mmap_vnode() has been changed to call vm_mmap_cdev() directly when
  mapping V_CHR vnodes.  This avoids duplicating all the cdev mmap
  handling code and simplifies some of vm_mmap_vnode().
- D_VERSION has been bumped to D_VERSION_02.  Older device drivers
  using D_VERSION_01 are still supported.

MFC after:	1 month
2009-06-01 21:32:52 +00:00
Alan Cox
461c78604e Eliminate a stale comment and the two remaining uses of the "register"
keyword in this file.
2009-05-30 22:15:55 +00:00
Alan Cox
edd16ab140 Add assertions in two places where a page's valid or dirty bits are changed. 2009-05-30 22:06:58 +00:00
Alan Cox
a28042d1e3 Change vm_object_page_remove() such that it clears the page's dirty bits
when it invalidates the page.

Suggested by:	tegge
2009-05-28 07:26:36 +00:00
Alan Cox
b78ddb0b8a Revise vm_pageout_scan()'s handling of partially dirty pages. Specifically,
rather than unconditionally making partially dirty pages fully dirty, only
make partially dirty pages fully dirty if the pmap says that the page has
been modified.

(This change is also a small optimization.  It eliminate an unnecessary call
to pmap_is_modified() on pages that are mapped read only.)

Suggested by:	tegge
2009-05-28 06:52:14 +00:00
Kip Macy
e95d34711b - back out direct map hack
- it is no longer needed
2009-05-19 01:14:37 +00:00
Alan Cox
47916d0c37 Eliminate a pointless call to pmap_clear_reference() from vm_pageout_scan().
If the page belongs to an object with a reference count of zero, then it
can't have any managed mappings on which to clear a reference bit.
2009-05-17 20:40:41 +00:00
Kip Macy
32237d8492 apply band-aid to x86_64 systems with more physical memory than kmem by allocating from the direct map 2009-05-16 19:17:15 +00:00
Alan Cox
42eb41087c Eliminate unnecessary clearing of the page's dirty mask from various
getpages functions.

Eliminate a stale comment.
2009-05-15 04:33:35 +00:00
Alan Cox
1c1b26f276 Eliminate page queues locking from bufdone_finish() through the
following changes:

Rename vfs_page_set_valid() to vfs_page_set_validclean() to reflect
what this function actually does.  Suggested by: tegge

Introduce a new version of vfs_page_set_valid() that does no more than
what the function's name implies.  Specifically, it does not update
the page's dirty mask, and thus it does not require the page queues
lock to be held.

Update two of the three callers to the old vfs_page_set_valid() to
call vfs_page_set_validclean() instead because they actually require
the page's dirty mask to be cleared.

Introduce vm_page_set_valid().

Reviewed by:	tegge
2009-05-13 05:39:39 +00:00
Alan Cox
12aa4fdca9 Eliminate gratuitous clearing of the page's dirty mask. 2009-05-12 05:49:02 +00:00
Alan Cox
0d53a17bde Fix a race involving vnode_pager_input_smlfs(). Specifically, in the case
that vnode_pager_input_smlfs() zeroes the page, it should not mark the page
as valid until after the page is zeroed.  Otherwise, the page could be
mapped for read access (e.g., by vm_map_pmap_enter()) before the page is
zeroed.  Reviewed by: tegge

Eliminate gratuitous clearing of the page's dirty mask by
vnode_pager_input_smlfs().  Instead, assert that the page is clean.
Reviewed by: tegge

Eliminate some blank lines.

Eliminate pointless calls to pmap_clear_modify() and vm_page_undirty() from
vnode_pager_input_old().  The page is not mapped.  Therefore, it cannot have
any page table entries that are modified.

Eliminate an incorrect comment from vnode_pager_generic_getpages().
2009-05-09 08:30:44 +00:00
Alan Cox
d7d9cfed36 Eliminate an incorrect comment. 2009-05-07 05:44:13 +00:00
Alan Cox
3a2cdcb0e3 Eliminate vnode_pager_input_smlfs()'s pointless call to pmap_clear_modify().
The page can't possibly have any modified page table entries because it
isn't even mapped.
2009-05-04 06:30:00 +00:00
Konstantin Belousov
7981aa2431 Use the acquired reference to the vmspace instead of direct dereferencing
of p->p_vmspace in a place where it was missed in r191277.

Noted by:  pluknet gmail com
2009-04-28 11:45:36 +00:00
Konstantin Belousov
8eb5a1cdee Fix typo. 2009-04-28 11:43:35 +00:00
Alan Cox
a80982c113 Eliminate an errant comment.
Discussed with:	tegge
2009-04-26 21:24:50 +00:00
Alan Cox
78cfe1f7bd Eliminate an archaic band-aid. The immediately preceding comment already
explains why the band-aid is unnecessary.

Suggested by:	tegge
2009-04-26 20:54:57 +00:00
Alan Cox
016a3c93b2 Eliminate unnecessary calls to pmap_clear_modify(). Specifically, calling
pmap_clear_modify() on a page is pointless if that page is not mapped or
it is only mapped for read access.  Instead, assert that the page is not
mapped or not mapped for write access as appropriate.

Eliminate unnecessary clearing of a page's dirty mask.  Instead, assert
that the page's dirty mask is clear.
2009-04-25 02:59:06 +00:00
Konstantin Belousov
bb2ac86f7d Do not call vm_page_lookup() from the ddb routine, namely from "show
vmopag" implementation. The vm_page_lookup() code modifies splay tree
of the object pages, and asserts that object lock is taken. First issue
could cause kernel data corruption, and second one instantly panics the
INVARIANTS-enabled kernel.

Take the advantage of the fact that object->memq is ordered by page index,
and iterate over memq to calculate the runs.

While there, make the code slightly more style-compliant by moving
variables declarations to the right place.

Discussed with:	jhb, alc
Reviewed by:	alc
MFC after:	2 weeks
2009-04-23 21:09:47 +00:00
Konstantin Belousov
6bed074cd2 In both pageout oom handler and vm_daemon, acquire the reference to
the vmspace of the examined process instead of directly accessing its
vmspace, that may change. Also, as an optimization, check for P_INEXEC
flag before examining the process.

Reported and tested by:	pho (previous version)
Reviewed by:	alc
MFC after:	3 week
2009-04-19 20:53:47 +00:00
Alan Cox
f4b0c119c0 Calling pmap_clear_modify() after calling pmap_remove_write() is pointless.
The latter function already clears the modified status from each of the
page's mappings.
2009-04-19 07:18:08 +00:00
Alan Cox
f9855e177d Allow valid pages to be mapped for read access when they have a non-zero
busy count.  Only mappings that allow write access should be prevented by
a non-zero busy count.

(The prohibition on mapping pages for read access when they have a non-
zero busy count originated in revision 1.202 of i386/i386/pmap.c when
this code was a part of the pmap.)

Reviewed by:	tegge
2009-04-19 00:34:34 +00:00
Alan Cox
b9519926e6 Remove execute permission from the memory allocated by sbrk().
Pre-announced on: -arch (3/31/09)
Discussed with: rwatson
Tested by: marius (sparc64)
2009-04-11 22:34:08 +00:00
Alan Cox
ab5378cf11 Previously, when vm_page_free_toq() was performed on a page belonging to
a reservation, unless all of the reservation's pages were free, the
reservation was moved to the head of the partially-populated reservations
queue, where it would be the next reservation to be broken in case the
free page queues were emptied.  Now, instead, I am moving it to the tail.
Very likely this reservation is in the process of being freed in its
entirety, so placing it at the tail of the queue makes it more likely that
the underlying physical memory will be returned to the free page queues as
one contiguous chunk.  If a reservation must be broken, it will, instead,
be the longest unchanged reservation, which is arguably the reservation
that is least likely to ever achieve promotion or be freed in its entirety.

MFC after:	6 weeks
2009-04-11 09:09:00 +00:00
Konstantin Belousov
6d7e809123 When vm_map_wire(9) is allowed to skip holes in the wired region, skip
the mappings without any of read and execution rights, in particular,
the PROT_NONE entries. This makes mlockall(2) work for the process
address space that has such mappings.

Since protection mode of the entry may change between setting
MAP_ENTRY_IN_TRANSITION and final pass over the region that records
the wire status of the entries, allocate new map entry flag
MAP_ENTRY_WIRE_SKIPPED to mark the skipped PROT_NONE entries.

Reported and tested by:	Hans Ottevanger <fbsdhackers beasties demon nl>
Reviewed by:	alc
MFC after:	3 weeks
2009-04-10 10:16:03 +00:00
Alan Cox
beb3c3a9c5 Retire VM_PROT_READ_IS_EXEC. It was intended to be a micro-optimization,
but I see no benefit from it today.

VM_PROT_READ_IS_EXEC was only intended for use on processors that do not
distinguish between read and execute permission.  On an mmap(2) or
mprotect(2), it automatically added execute permission if the caller
specified permissions included read permission.  The hope was that this
would reduce the number of vm map entries needed to implement an address
space because there would be fewer neighboring vm map entries that differed
only in the presence or absence of VM_PROT_EXECUTE.  (See vm/vm_mmap.c
revision 1.56.)

Today, I don't see any real applications that benefit from
VM_PROT_READ_IS_EXEC.  In any case, vm map entries are now organized
as a self-adjusting binary search tree instead of an ordered list.  So,
the need for coalescing vm map entries is not as great as it once was.
2009-04-04 23:12:14 +00:00
Alan Cox
a7f9bae19e Eliminate dead code.
Reviewed by:	jhb
2009-04-01 04:36:37 +00:00
John Baldwin
5bd65606f4 Adjust some variables (mostly related to the buffer cache) that hold
address space sizes to be longs instead of ints.  Specifically, the follow
values are now longs: runningbufspace, bufspace, maxbufspace,
bufmallocspace, maxbufmallocspace, lobufspace, hibufspace, lorunningspace,
hirunningspace, maxswzone, maxbcache, and maxpipekva.  Previously, a
relatively small number (~ 44000) of buffers set in kern.nbuf would result
in integer overflows resulting either in hangs or bogus values of
hidirtybuffers and lodirtybuffers.  Now one has to overflow a long to see
such problems.  There was a check for a nbuf setting that would cause
overflows in the auto-tuning of nbuf.  I've changed it to always check and
cap nbuf but warn if a user-supplied tunable would cause overflow.

Note that this changes the ABI of several sysctls that are used by things
like top(1), etc., so any MFC would probably require a some gross shims
to allow for that.

MFC after:	1 month
2009-03-09 19:35:20 +00:00
Alan Cox
5758fe7185 Prior to r188331 a map entry's last read offset was only updated by a hard
fault.  In r188331 this update was relocated because of synchronization
changes to a place where it would occur on both hard and soft faults.  This
change again restricts the update to hard faults.
2009-02-25 07:52:53 +00:00
Konstantin Belousov
655c349022 Revert the addition of the freelist argument for the vm_map_delete()
function, done in r188334. Instead, collect the entries that shall be
freed, in the deferred_freelist member of the map. Automatically purge
the deferred freelist when map is unlocked.

Tested by:	pho
Reviewed by:	alc
2009-02-24 20:57:43 +00:00