Commit Graph

2575 Commits

Author SHA1 Message Date
Alan Cox
12aa4fdca9 Eliminate gratuitous clearing of the page's dirty mask. 2009-05-12 05:49:02 +00:00
Alan Cox
0d53a17bde Fix a race involving vnode_pager_input_smlfs(). Specifically, in the case
that vnode_pager_input_smlfs() zeroes the page, it should not mark the page
as valid until after the page is zeroed.  Otherwise, the page could be
mapped for read access (e.g., by vm_map_pmap_enter()) before the page is
zeroed.  Reviewed by: tegge

Eliminate gratuitous clearing of the page's dirty mask by
vnode_pager_input_smlfs().  Instead, assert that the page is clean.
Reviewed by: tegge

Eliminate some blank lines.

Eliminate pointless calls to pmap_clear_modify() and vm_page_undirty() from
vnode_pager_input_old().  The page is not mapped.  Therefore, it cannot have
any page table entries that are modified.

Eliminate an incorrect comment from vnode_pager_generic_getpages().
2009-05-09 08:30:44 +00:00
Alan Cox
d7d9cfed36 Eliminate an incorrect comment. 2009-05-07 05:44:13 +00:00
Alan Cox
3a2cdcb0e3 Eliminate vnode_pager_input_smlfs()'s pointless call to pmap_clear_modify().
The page can't possibly have any modified page table entries because it
isn't even mapped.
2009-05-04 06:30:00 +00:00
Konstantin Belousov
7981aa2431 Use the acquired reference to the vmspace instead of direct dereferencing
of p->p_vmspace in a place where it was missed in r191277.

Noted by:  pluknet gmail com
2009-04-28 11:45:36 +00:00
Konstantin Belousov
8eb5a1cdee Fix typo. 2009-04-28 11:43:35 +00:00
Alan Cox
a80982c113 Eliminate an errant comment.
Discussed with:	tegge
2009-04-26 21:24:50 +00:00
Alan Cox
78cfe1f7bd Eliminate an archaic band-aid. The immediately preceding comment already
explains why the band-aid is unnecessary.

Suggested by:	tegge
2009-04-26 20:54:57 +00:00
Alan Cox
016a3c93b2 Eliminate unnecessary calls to pmap_clear_modify(). Specifically, calling
pmap_clear_modify() on a page is pointless if that page is not mapped or
it is only mapped for read access.  Instead, assert that the page is not
mapped or not mapped for write access as appropriate.

Eliminate unnecessary clearing of a page's dirty mask.  Instead, assert
that the page's dirty mask is clear.
2009-04-25 02:59:06 +00:00
Konstantin Belousov
bb2ac86f7d Do not call vm_page_lookup() from the ddb routine, namely from "show
vmopag" implementation. The vm_page_lookup() code modifies splay tree
of the object pages, and asserts that object lock is taken. First issue
could cause kernel data corruption, and second one instantly panics the
INVARIANTS-enabled kernel.

Take the advantage of the fact that object->memq is ordered by page index,
and iterate over memq to calculate the runs.

While there, make the code slightly more style-compliant by moving
variables declarations to the right place.

Discussed with:	jhb, alc
Reviewed by:	alc
MFC after:	2 weeks
2009-04-23 21:09:47 +00:00
Konstantin Belousov
6bed074cd2 In both pageout oom handler and vm_daemon, acquire the reference to
the vmspace of the examined process instead of directly accessing its
vmspace, that may change. Also, as an optimization, check for P_INEXEC
flag before examining the process.

Reported and tested by:	pho (previous version)
Reviewed by:	alc
MFC after:	3 week
2009-04-19 20:53:47 +00:00
Alan Cox
f4b0c119c0 Calling pmap_clear_modify() after calling pmap_remove_write() is pointless.
The latter function already clears the modified status from each of the
page's mappings.
2009-04-19 07:18:08 +00:00
Alan Cox
f9855e177d Allow valid pages to be mapped for read access when they have a non-zero
busy count.  Only mappings that allow write access should be prevented by
a non-zero busy count.

(The prohibition on mapping pages for read access when they have a non-
zero busy count originated in revision 1.202 of i386/i386/pmap.c when
this code was a part of the pmap.)

Reviewed by:	tegge
2009-04-19 00:34:34 +00:00
Alan Cox
b9519926e6 Remove execute permission from the memory allocated by sbrk().
Pre-announced on: -arch (3/31/09)
Discussed with: rwatson
Tested by: marius (sparc64)
2009-04-11 22:34:08 +00:00
Alan Cox
ab5378cf11 Previously, when vm_page_free_toq() was performed on a page belonging to
a reservation, unless all of the reservation's pages were free, the
reservation was moved to the head of the partially-populated reservations
queue, where it would be the next reservation to be broken in case the
free page queues were emptied.  Now, instead, I am moving it to the tail.
Very likely this reservation is in the process of being freed in its
entirety, so placing it at the tail of the queue makes it more likely that
the underlying physical memory will be returned to the free page queues as
one contiguous chunk.  If a reservation must be broken, it will, instead,
be the longest unchanged reservation, which is arguably the reservation
that is least likely to ever achieve promotion or be freed in its entirety.

MFC after:	6 weeks
2009-04-11 09:09:00 +00:00
Konstantin Belousov
6d7e809123 When vm_map_wire(9) is allowed to skip holes in the wired region, skip
the mappings without any of read and execution rights, in particular,
the PROT_NONE entries. This makes mlockall(2) work for the process
address space that has such mappings.

Since protection mode of the entry may change between setting
MAP_ENTRY_IN_TRANSITION and final pass over the region that records
the wire status of the entries, allocate new map entry flag
MAP_ENTRY_WIRE_SKIPPED to mark the skipped PROT_NONE entries.

Reported and tested by:	Hans Ottevanger <fbsdhackers beasties demon nl>
Reviewed by:	alc
MFC after:	3 weeks
2009-04-10 10:16:03 +00:00
Alan Cox
beb3c3a9c5 Retire VM_PROT_READ_IS_EXEC. It was intended to be a micro-optimization,
but I see no benefit from it today.

VM_PROT_READ_IS_EXEC was only intended for use on processors that do not
distinguish between read and execute permission.  On an mmap(2) or
mprotect(2), it automatically added execute permission if the caller
specified permissions included read permission.  The hope was that this
would reduce the number of vm map entries needed to implement an address
space because there would be fewer neighboring vm map entries that differed
only in the presence or absence of VM_PROT_EXECUTE.  (See vm/vm_mmap.c
revision 1.56.)

Today, I don't see any real applications that benefit from
VM_PROT_READ_IS_EXEC.  In any case, vm map entries are now organized
as a self-adjusting binary search tree instead of an ordered list.  So,
the need for coalescing vm map entries is not as great as it once was.
2009-04-04 23:12:14 +00:00
Alan Cox
a7f9bae19e Eliminate dead code.
Reviewed by:	jhb
2009-04-01 04:36:37 +00:00
John Baldwin
5bd65606f4 Adjust some variables (mostly related to the buffer cache) that hold
address space sizes to be longs instead of ints.  Specifically, the follow
values are now longs: runningbufspace, bufspace, maxbufspace,
bufmallocspace, maxbufmallocspace, lobufspace, hibufspace, lorunningspace,
hirunningspace, maxswzone, maxbcache, and maxpipekva.  Previously, a
relatively small number (~ 44000) of buffers set in kern.nbuf would result
in integer overflows resulting either in hangs or bogus values of
hidirtybuffers and lodirtybuffers.  Now one has to overflow a long to see
such problems.  There was a check for a nbuf setting that would cause
overflows in the auto-tuning of nbuf.  I've changed it to always check and
cap nbuf but warn if a user-supplied tunable would cause overflow.

Note that this changes the ABI of several sysctls that are used by things
like top(1), etc., so any MFC would probably require a some gross shims
to allow for that.

MFC after:	1 month
2009-03-09 19:35:20 +00:00
Alan Cox
5758fe7185 Prior to r188331 a map entry's last read offset was only updated by a hard
fault.  In r188331 this update was relocated because of synchronization
changes to a place where it would occur on both hard and soft faults.  This
change again restricts the update to hard faults.
2009-02-25 07:52:53 +00:00
Konstantin Belousov
655c349022 Revert the addition of the freelist argument for the vm_map_delete()
function, done in r188334. Instead, collect the entries that shall be
freed, in the deferred_freelist member of the map. Automatically purge
the deferred freelist when map is unlocked.

Tested by:	pho
Reviewed by:	alc
2009-02-24 20:57:43 +00:00
Konstantin Belousov
3a0916b8ea Add the assertion macros for the map locks. Use them in several map
manipulation functions.

Tested by:	pho
Reviewed by:	alc
2009-02-24 20:43:29 +00:00
Konstantin Belousov
e608cc3c8d Update the comment after the r188334.
Reviewed by:	alc
2009-02-24 20:23:16 +00:00
Roman Divacky
af83f5d77c Change the functions to ANSI in those cases where it breaks promotion
to int rule. See ISO C Standard: SS6.7.5.3:15.

Approved by:	kib (mentor)
Reviewed by:	warner
Tested by:	silence on -current
2009-02-24 18:09:31 +00:00
Robert Watson
9309e63c1f Put debug.vm_lowmem sysctl under DIAGNOSTIC.
Submitted by:	sam
MFC after:	3 days
2009-02-23 23:30:17 +00:00
Robert Watson
86f087370b Add a debugging sysctl, debug.vm_lowmem, that when assigned a value of
1 will trigger a pass through the VM's low-memory handlers, such as
protocol and UMA drain routines.  This makes it easier to exercise
these otherwise rarely-invoked code paths.

MFC after:	3 days
2009-02-23 23:00:12 +00:00
Alan Cox
bfd9b137a0 Reduce the scope of the page queues lock in vm_object_page_remove().
MFC after:	1 week
2009-02-21 20:57:25 +00:00
Alan Cox
9d13a605d4 Eliminate stale comments. 2009-02-20 16:19:34 +00:00
Konstantin Belousov
cb61d6987e Comment out the assertion from r188321. It is not valid for nfs.
Reported by:	alc
2009-02-09 11:32:23 +00:00
Alan Cox
c722e407dc Avoid some cases of unnecessary page queues locking by vm_fault's delete-
behind heuristic.
2009-02-09 06:23:21 +00:00
Alan Cox
7b54b1a9f5 Eliminate OBJ_NEEDGIANT. After r188331, OBJ_NEEDGIANT's only use is by a
redundant assertion in vm_fault().

Reviewed by:	kib
2009-02-08 22:17:24 +00:00
Konstantin Belousov
2fada4c2b3 Remove no longer valid comment.
Submitted by:	alc
2009-02-08 21:20:13 +00:00
Konstantin Belousov
b0994946c7 Improve comments, correct English.
Submitted by:	alc
2009-02-08 20:52:09 +00:00
Konstantin Belousov
897d81a020 Do not call vm_object_deallocate() from vm_map_delete(), because we
hold the map lock there, and might need the vnode lock for OBJT_VNODE
objects. Postpone object deallocation until caller of vm_map_delete()
drops the map lock. Link the map entries to be freed into the freelist,
that is released by the new helper function vm_map_entry_free_freelist().

Reviewed by:	tegge, alc
Tested by:	pho
2009-02-08 20:39:17 +00:00
Konstantin Belousov
e53fa61bf2 In vm_map_sync(), do not call vm_object_sync() while holding map lock.
Reference object, drop the map lock, and then call vm_object_sync().
The object sync might require vnode lock for OBJT_VNODE type objects.

Reviewed by:	tegge
Tested by:	pho
2009-02-08 20:30:51 +00:00
Konstantin Belousov
d2bf64c309 Do not sleep for vnode lock while holding map lock in vm_fault. Try to
acquire vnode lock for OBJT_VNODE object after map lock is dropped.
Because we have the busy page(s) in the object, sleeping there would
result in deadlock with vnode resize. Try to get lock without sleeping,
and, if the attempt failed, drop the state, lock the vnode, and restart
the fault handler from the start with already locked vnode.

Because the vnode_pager_lock() function is inlined in vm_fault(),
axe it.

Based on suggestion by:	alc
Reviewed by:	tegge, alc
Tested by:	pho
2009-02-08 20:23:46 +00:00
Konstantin Belousov
7fd10fb3c7 Add the comments to vm_map_simplify_entry() and vmspace_fork(),
describing why several calls to vm_deallocate_object() with locked map
do not result in the acquisition of the vnode lock after map lock.

Suggested and reviewed by:	tegge
2009-02-08 20:00:33 +00:00
Konstantin Belousov
1fac7d7f35 Lock the new map in vmspace_fork(). The newly allocated map should not
be accessible outside vmspace_fork() yet, but locking it would satisfy
the protocol of the vm_map_entry_link() and other functions called
from vmspace_fork().

Use trylock that is supposedly cannot fail, to silence WITNESS warning
of the nested acquisition of the sx lock with the same name.

Suggested and reviewed by:	tegge
2009-02-08 19:55:03 +00:00
Konstantin Belousov
705f0a82c2 Assert that vnode is exclusively locked when its vm object is resized.
Reviewed by:	tegge
2009-02-08 19:44:50 +00:00
Konstantin Belousov
9f6acfd1a8 Do not leak the MAP_ENTRY_IN_TRANSITION flag when copying map entry
on fork. Otherwise, copied entry cannot be removed in the child map.

Reviewed by:	tegge
MFC after:	2 weeks
2009-02-08 19:41:08 +00:00
Konstantin Belousov
0d0be82a5d Style. 2009-02-08 19:37:01 +00:00
Jeff Roberson
e20a199fd5 - Make the keg abstraction more complete. Permit a zone to have multiple
backend kegs so it may source compatible memory from multiple backends.
   This is useful for cases such as NUMA or different layouts for the same
   memory type.
 - Provide a new api for adding new backend kegs to secondary zones.
 - Provide a new flag for adjusting the layout of zones to stagger
   allocations better across cache lines.

Sponsored by:	Nokia
2009-01-25 09:11:24 +00:00
John Baldwin
8a7ef10b71 - Mark all standalone INT/LONG/QUAD sysctl's MPSAFE. This is done
inside the SYSCTL() macros and thus does not need to be done for
  all of the nodes scattered across the source tree.
- Mark the name-cache related sysctl's (including debug.hashstat.*) MPSAFE.
- Mark vm.loadavg MPSAFE.
- Remove GIANT_REQUIRED from vmtotal() (everything in this routine already
  has sufficient locking) and mark vm.vmtotal MPSAFE.
- Mark the vm.stats.(sys|vm).* sysctls MPSAFE.
2009-01-23 22:49:23 +00:00
John Baldwin
fa3de7700c Now that vfs_markatime() no longer requires an exclusive lock due to
the VOP_MARKATIME() changes, use a shared vnode lock for mmap().

Submitted by:	ups
2009-01-21 14:43:35 +00:00
Konstantin Belousov
641e2829b6 Extend the struct vm_page wire_count to u_int to avoid the overflow
of the counter, that may happen when too many sendfile(2) calls are
being executed with this vnode [1].

To keep the size of the struct vm_page and offsets of the fields
accessed by out-of-tree modules, swap the types and locations
of the wire_count and cow fields. Add safety checks to detect cow
overflow and force fallback to the normal copy code for zero-copy
sockets. [2]

Reported by:	Anton Yuzhaninov <citrin citrin ru> [1]
Suggested by:	alc [2]
Reviewed by:	alc
MFC after:	2 weeks
2009-01-03 13:24:08 +00:00
Alan Cox
05a8c41419 Resurrect shared map locks allowing greater concurrency during some map
operations, such as page faults.

An earlier version of this change was ...

Reviewed by:	kib
Tested by:	pho
MFC after:	6 weeks
2009-01-01 00:31:46 +00:00
Alan Cox
e2abaaaa2b Update or eliminate some stale comments. 2008-12-31 05:44:05 +00:00
Alan Cox
7438d60b4b Avoid an unnecessary memory dereference in vm_map_entry_splay(). 2008-12-30 21:52:18 +00:00
Alan Cox
095104ac36 Style change to vm_map_lookup(): Eliminate a macro of dubious value. 2008-12-30 20:51:07 +00:00
Alan Cox
4c3ef59e3d Move the implementation of the vm map's fast path on address lookup from
vm_map_lookup{,_locked}() to vm_map_lookup_entry().  Having the fast path
in vm_map_lookup{,_locked}() limits its benefits to page faults.  Moving
it to vm_map_lookup_entry() extends its benefits to other operations on
the vm map.
2008-12-30 19:48:03 +00:00
Robert Noland
e9f541267d Fix printing of KASSERT message missed in r163604.
Approved by:	kib
2008-12-21 16:56:13 +00:00
Konstantin Belousov
6129343d5d Instead of forcing vn_start_write() to reset mp back to NULL for the
failed calls with non-NULL vp, explicitely clear mp after failure.

Tested by:	stass
Reviewed by:	tegge
PR:		123768
MFC after:	1 week
2008-11-16 21:57:54 +00:00
Rafal Jaworowski
8e321b7943 Support kernel crash mini dumps on ARM architecture.
Obtained from:	Juniper Networks, Semihalf
2008-11-06 16:20:27 +00:00
Giorgos Keramidas
2db63c5e38 Various comment nits, and typos. 2008-11-02 00:41:26 +00:00
Robert Watson
556c3162b9 Update mmap() comment: no more block devices, so no more block device
cache coherency questions.

MFC after:	3 days
2008-10-22 16:50:12 +00:00
Attilio Rao
0d7935fd01 Remove the struct thread unuseful argument from bufobj interface.
In particular following functions KPI results modified:
- bufobj_invalbuf()
- bufsync()

and BO_SYNC() "virtual method" of the buffer objects set.
Main consumers of bufobj functions are affected by this change too and,
in particular, functions which changed their KPI are:
- vinvalbuf()
- g_vfs_close()

Due to the KPI breakage, __FreeBSD_version will be bumped in a later
commit.

As a side note, please consider just temporary the 'curthread' argument
passing to VOP_SYNC() (in bufsync()) as it will be axed out ASAP

Reviewed by:	kib
Tested by:	Giovanni Trematerra <giovanni dot trematerra at gmail dot com>
2008-10-10 21:23:50 +00:00
Konstantin Belousov
2025d69ba7 Move the code for doing out-of-memory grass from vm_pageout_scan()
into the separate function vm_pageout_oom(). Supply a parameter for
vm_pageout_oom() describing a reason for the call.

Call vm_pageout_oom() from the swp_pager_meta_build() when swap zone
is exhausted.

Reviewed by:	alc
Tested by:	pho, jhb
MFC after:	2 weeks
2008-09-29 19:45:12 +00:00
Ed Maste
a8a478fce6 Move CTASSERT from header file to source file, per implementation note now
in the CTASSERT man page.
2008-09-26 18:44:40 +00:00
Konstantin Belousov
7818e0a545 Save previous content of the td_fpop before storing the current
filedescriptor into it. Make sure that td_fpop is NULL when calling
d_mmap from dev_pager_getpages().

Change guards against td_fpop field being non-NULL with private state
for another device, and against sudden clearing the td_fpop. This
could occur when either a driver method calls another driver through
the filedescriptor operation, or a page fault happen while driver is
writing to a memory backed by another driver.

Noted by:	rwatson
Tested by:	rnoland
MFC after:	3 days
2008-09-26 14:50:49 +00:00
Alan Cox
8d28bf04e2 Prevent an integer overflow in vm_pageout_page_stats() on machines with a
large number of physical pages.

PR:		126158
Submitted by:	Dmitry Tejblum
MFC after:	3 days
2008-09-21 18:01:34 +00:00
Konstantin Belousov
36b907893d Allow the d_mmap driver methods to use cdevpriv KPI during verification
phase of establishing mapping.

Discussed with:	rwatson, jhb, rnoland
Tested by:	rnoland
MFC after:	3 days
2008-09-20 19:56:02 +00:00
Attilio Rao
0359a12ead Decontextualize the couplet VOP_GETATTR / VOP_SETATTR as the passed thread
was always curthread and totally unuseful.

Tested by: Giovanni Trematerra <giovanni dot trematerra at gmail dot com>
2008-08-28 15:23:18 +00:00
Antoine Brodin
2f2ea10a07 Remove unused variable nosleepwithlocks.
PR:		126609
Submitted by:	Mateusz Guzik
MFC after:	1 month
X-MFC:		to stable/7 only, this variable is still used in stable/6
2008-08-23 12:40:07 +00:00
Nathan Whitehorn
f620b5bf45 Allow the MD UMA allocator to use VM routines like kmem_*(). Existing code requires MD allocator to be available early in the boot process, before the VM is fully available. This defines a new VM define (UMA_MD_SMALL_ALLOC_NEEDS_VM) that allows an MD UMA small allocator to become available at the same time as the default UMA allocator.
Approved by:	marcel (mentor)
2008-08-23 01:35:36 +00:00
Julian Elischer
ac957cd271 A bunch of formatting fixes brough to light by, or created by the Vimage commit
a few days ago.
2008-08-20 01:05:56 +00:00
Kip Macy
4b34502e99 Work around differences in page allocation for initial page tables on xen
MFC after:	1 month
2008-08-17 23:40:29 +00:00
Ed Maste
4222358722 Fix REDZONE(9) on amd64 and perhaps other 64 bit targets -- ensure the space
that redzone adds to the allocation for storing its metadata is at least as
large as the metadata that it will store there.

Submitted by:	Nima Misaghian
2008-08-13 17:32:48 +00:00
John Baldwin
da7bbd2c08 If a thread that is swapped out is made runnable, then the setrunnable()
routine wakes up proc0 so that proc0 can swap the thread back in.
Historically, this has been done by waking up proc0 directly from
setrunnable() itself via a wakeup().  When waking up a sleeping thread
that was swapped out (the usual case when waking proc0 since only sleeping
threads are eligible to be swapped out), this resulted in a bit of
recursion (e.g. wakeup() -> setrunnable() -> wakeup()).

With sleep queues having separate locks in 6.x and later, this caused a
spin lock LOR (sleepq lock -> sched_lock/thread lock -> sleepq lock).
An attempt was made to fix this in 7.0 by making the proc0 wakeup use
the ithread mechanism for doing the wakeup.  However, this required
grabbing proc0's thread lock to perform the wakeup.  If proc0 was asleep
elsewhere in the kernel (e.g. waiting for disk I/O), then this degenerated
into the same LOR since the thread lock would be some other sleepq lock.

Fix this by deferring the wakeup of the swapper until after the sleepq
lock held by the upper layer has been locked.  The setrunnable() routine
now returns a boolean value to indicate whether or not proc0 needs to be
woken up.  The end result is that consumers of the sleepq API such as
*sleep/wakeup, condition variables, sx locks, and lockmgr, have to wakeup
proc0 if they get a non-zero return value from sleepq_abort(),
sleepq_broadcast(), or sleepq_signal().

Discussed with:	jeff
Glanced at by:	sam
Tested by:	Jurgen Weber  jurgen - ish com au
MFC after:	2 weeks
2008-08-05 20:02:31 +00:00
Tom Rhodes
6bd9cb1c81 Fill in a few sysctl descriptions.
Reviewed by:	alc, Matt Dillon <dillon@apollo.backplane.com>
Approved by:	alc
2008-08-03 14:26:15 +00:00
John Baldwin
2c3b410b3a One more whitespace nit. 2008-07-30 21:23:32 +00:00
John Baldwin
3cca4b6fe8 A few more whitespace fixes. 2008-07-30 21:18:08 +00:00
John Baldwin
3677ad363b If the kernel has run out of metadata for swap, then explicitly panic()
instead of emitting a warning before deadlocking.

MFC after:	1 month
2008-07-30 21:12:15 +00:00
Konstantin Belousov
24bbc85bf6 The behaviour of the lockmgr going back at least to the 4.4BSD-Lite2 was
to downgrade the exclusive lock to shared one when exclusive lock owner
requested shared lock. New lockmgr panics instead.

The vnode_pager_lock function requests shared lock on the vnode backing
the OBJT_VNODE, and can be called when the current thread already holds
an exlcusive lock on the vnode. For instance, it happens when handling
page fault from the VOP_WRITE() uiomove that writes to the file, with
the faulted in page fetched from the vm object backed by the same file.
We then get the situation described above.

Verify whether the vnode is already exclusively locked by the curthread
and request recursed exclusive vnode lock instead of shared, if true.

Reported by:	gallatin
Discussed with:	attilio
2008-07-30 18:16:06 +00:00
Alan Cox
fb272dc841 Eliminate stale comments from kmem_malloc(). 2008-07-18 17:41:31 +00:00
Konstantin Belousov
11041003c6 Use the VM_ALLOC_INTERRUPT for the page requests when allocating memory
for the bio for swapout write. It allows the page allocator to drain
free page list deeper. As result, a deadlock where pageout deamon sleeps
waiting for bio to be allocated for swapout is no more reproducable in
practice.

Alan said that M_USE_RESERVE shall be ressurrected and used there, but
until this is implemented, M_NOWAIT does exactly what is needed.

Tested by:	pho, kris
Reviewed by:	alc
No objections from:	phk
MFC after:	2 weeks (RELENG_7 only)
2008-07-11 11:27:42 +00:00
Alan Cox
b89eaf4e9f Enable the creation of a kmem map larger than 4GB.
Submitted by: Tz-Huan Huang

Make several variables related to kmem map auto-sizing static.
Found by: CScout
2008-07-05 19:34:33 +00:00
Alan Cox
5cfa90e902 Make preparations for increasing the size of the kernel virtual address space
on the amd64 architecture.  The amd64 architecture requires kernel code and
global variables to reside in the highest 2GB of the 64-bit virtual address
space.  Thus, the memory allocated during bootstrap, before the call to
kmem_init(), starts at KERNBASE, which is not necessarily the same as
VM_MIN_KERNEL_ADDRESS on amd64.
2008-06-22 04:54:27 +00:00
Alan Cox
c1f02198d1 KERNBASE is not necessarily an address within the kernel map, e.g.,
PowerPC/AIM.  Consequently, it should not be used to determine the maximum
number of kernel map entries.  Intead, use VM_MIN_KERNEL_ADDRESS, which marks
the start of the kernel map on all architectures.

Tested by:	marcel@ (PowerPC/AIM)
2008-06-21 21:02:13 +00:00
Stephan Uphoff
11be8415c9 Fix vm object creation locking to allow SHARED vnode locking for vnode_create_vobject.
(Not currently used)

Noticed by: kib@
2008-06-12 20:46:47 +00:00
Alan Cox
8bcd3b1998 Essentially, neither madvise(..., MADV_DONTNEED) nor madvise(..., MADV_FREE)
work.  (Moreover, I don't believe that they have ever worked as intended.)
The explanation is fairly simple.  Both MADV_DONTNEED and MADV_FREE perform
vm_page_dontneed() on each page within the range given to madvise().  This
function moves the page to the inactive queue.  Specifically, if the page is
clean, it is moved to the head of the inactive queue where it is first in
line for processing by the page daemon.  On the other hand, if it is dirty,
it is placed at the tail.  Let's further examine the case in which the page
is clean.  Recall that the page is at the head of the line for processing by
the page daemon.  The expectation of vm_page_dontneed()'s author was that
the page would be transferred from the inactive queue to the cache queue by
the page daemon.  (Once the page is in the cache queue, it is, in effect,
free, that is, it can be reallocated to a new vm object by vm_page_alloc()
if it isn't reactivated quickly enough by a user of the old vm object.)  The
trouble is that nowhere in the execution of either MADV_DONTNEED or
MADV_FREE is either the machine-independent reference flag (PG_REFERENCED)
or the reference bit in any page table entry (PTE) mapping the page cleared.
Consequently, the immediate reaction of the page daemon is to reactivate the
page because it is referenced.  In effect, the madvise() was for naught.
The case in which the page was dirty is not too different.  Instead of being
laundered, the page is reactivated.

Note: The essential difference between MADV_DONTNEED and MADV_FREE is
that MADV_FREE clears a page's dirty field.  So, MADV_FREE is always
executing the clean case above.

This revision changes vm_page_dontneed() to clear both the machine-
independent reference flag (PG_REFERENCED) and the reference bit in all PTEs
mapping the page.

MFC after:	6 weeks
2008-06-06 18:38:43 +00:00
Alan Cox
ba3042115f To date, our implementation of munmap(2) has required that the
entirety of the specified range be mapped.  Specifically, it has
returned EINVAL if the entire range is not mapped.  There is not,
however, any basis for this in either SuSv2 or our own man page.
Moreover, neither Linux nor Solaris impose this requirement.  This
revision removes this requirement.

Submitted by: Tijl Coosemans
PR: 118510
MFC after: 6 weeks
2008-05-24 21:57:16 +00:00
Stephan Uphoff
2ac78f0e1a Allow VM object creation in ufs_lookup. (If vfs.vmiodirenable is set)
Directory IO without a VM object will store data in 'malloced' buffers
severely limiting caching of the data. Without this  change VM objects for
directories are only created on an open() of the directory.
TODO: Inline test if VM object already exists to avoid locking/function call
overhead.

Tested by: kris@
Reviewed by: jeff@
Reported by: David Filo
2008-05-20 19:05:43 +00:00
Alan Cox
1ec1304bdb Retire pmap_addr_hint(). It is no longer used. 2008-05-18 04:16:57 +00:00
Alan Cox
d0a83a83bf In order to map device memory using superpages, mmap(2) must find a
superpage-aligned virtual address for the mapping.  Revision 1.65
implemented an overly simplistic and generally ineffectual method for
finding a superpage-aligned virtual address.  Specifically, it rounds
the virtual address corresponding to the end of the data segment up to
the next superpage-aligned virtual address.  If this virtual address
is unallocated, then the device will be mapped using superpages.
Unfortunately, in modern times, where applications like the X server
dynamically load much of their code, this virtual address is already
allocated.  In such cases, mmap(2) simply uses the first available
virtual address, which is not necessarily superpage aligned.

This revision changes mmap(2) to use a more robust method,
specifically, the VMFS_ALIGNED_SPACE option that is now implemented by
vm_map_find().
2008-05-17 19:32:48 +00:00
Alan Cox
e46cd4132c Preset a device object's alignment ("pg_color") based upon the
physical address of the device's memory.  This enables
pmap_align_superpage() to propose a virtual address for mapping the
device memory that permits the use of superpage mappings.
2008-05-17 16:26:34 +00:00
Alan Cox
f578838754 Don't call vm_reserv_alloc_page() on device-backed objects. Otherwise, the
system may panic because there is no reservation structure corresponding to
the physical address of the device memory.

Reported by: Giorgos Keramidas
2008-05-15 18:52:31 +00:00
Alan Cox
6ac3ab7f98 Provide the new argument to kmem_suballoc(). 2008-05-10 23:39:27 +00:00
Alan Cox
3202ed7523 Introduce a new parameter "superpage_align" to kmem_suballoc() that is
used to request superpage alignment for the submap.

Request superpage alignment for the kmem_map.

Pass VMFS_ANY_SPACE instead of TRUE to vm_map_find().  (They are currently
equivalent but VMFS_ANY_SPACE is the new preferred spelling.)

Remove a stale comment from kmem_malloc().
2008-05-10 21:46:20 +00:00
Alan Cox
26c538ffcd Generalize vm_map_find(9)'s parameter "find_space". Specifically, add
support for VMFS_ALIGNED_SPACE, which requests the allocation of an
address range best suited to superpages.  The old options TRUE and FALSE
are mapped to VMFS_ANY_SPACE and VMFS_NO_SPACE, so that there is no
immediate need to update all of vm_map_find(9)'s callers.

While I'm here, correct a misstatement about vm_map_find(9)'s return
values in the man page.
2008-05-10 18:55:35 +00:00
Alan Cox
d3249b142b Introduce pmap_align_superpage(). It increases the starting virtual
address of the given mapping if a different alignment might result in more
superpage mappings.
2008-05-09 16:48:07 +00:00
Kip Macy
c8c7ad9260 add malloc flag to blist so that it can be used in ithread context
Reviewed by: alc, bsdimp
2008-05-05 19:48:54 +00:00
Alan Cox
2bc24aa956 Eliminate pointless casts from kmem_suballoc(). 2008-04-28 17:25:27 +00:00
Alan Cox
b8ca4ef2e3 vm_map_fixed(), unlike vm_map_find(), does not update "addr", so it can be
passed by value.
2008-04-28 05:30:23 +00:00
Jeff Roberson
8df78c41d6 - Make SCHED_STATS more generic by adding a wrapper to create the
variables and sysctl nodes.
 - In reset walk the children of kern_sched_stats and reset the counters
   via the oid_arg1 pointer.  This allows us to add arbitrary counters to
   the tree and still reset them properly.
 - Define a set of switch types to be passed with flags to mi_switch().
   These types are named SWT_*.  These types correspond to SCHED_STATS
   counters and are automatically handled in this way.
 - Make the new SWT_ types more specific than the older switch stats.
   There are now stats for idle switches, remote idle wakeups, remote
   preemption ithreads idling, etc.
 - Add switch statistics for ULE's pickcpu algorithm.  These stats include
   how much migration there is, how often affinity was successful, how
   often threads were migrated to the local cpu on wakeup, etc.

Sponsored by:	Nokia
2008-04-17 04:20:10 +00:00
Alan Cox
44aab2c3de Introduce vm_reserv_reclaim_contig(). This function is used by
contigmalloc(9) as a last resort to steal pages from an inactive,
partially-used superpage reservation.

Rename vm_reserv_reclaim() to vm_reserv_reclaim_inactive() and
refactor it so that a separate subroutine is responsible for breaking
the selected reservation.  This subroutine is also used by
vm_reserv_reclaim_contig().
2008-04-06 18:09:28 +00:00
Alan Cox
2fbced6574 Eliminate an unnecessary test from vm_phys_unfree_page(). 2008-04-05 05:02:53 +00:00
Alan Cox
c416972587 Update a comment to vm_map_pmap_enter(). 2008-04-04 19:14:58 +00:00
Alan Cox
7630c26507 Reintroduce UMA_SLAB_KMAP; however, change its spelling to
UMA_SLAB_KERNEL for consistency with its sibling UMA_SLAB_KMEM.
(UMA_SLAB_KMAP met its original demise in revision 1.30 of
vm/uma_core.c.)  UMA_SLAB_KERNEL is now required by the jumbo frame
allocators.  Without it, UMA cannot correctly return pages from the
jumbo frame zones to the VM system because it resets the pages' object
field to NULL instead of the kernel object.  In more detail, the jumbo
frame zones are created with the option UMA_ZONE_REFCNT.  This causes
UMA to overwrite the pages' object field with the address of the slab.
However, when UMA wants to release these pages, it doesn't know how to
restore the object field, so it sets it to NULL.  This change teaches
UMA how to reset the object field to the kernel object.

Crashes reported by: kris
Fix tested by: kris
Fix discussed with: jeff
MFC after: 6 weeks
2008-04-04 18:41:12 +00:00
Alan Cox
24dedba9f5 Eliminate an unnecessary printf() from kmem_suballoc(). The subsequent
panic() can be extended to convey the same information.
2008-03-30 20:08:59 +00:00
Jeff Roberson
52481a9a9d - Use vm_object_reference_locked() directly from
vm_object_reference().  This is intended to get rid of vget()
   consumers who don't wish to acquire a lock.  This is functionally
   the same as calling vref(). vm_object_reference_locked() already
   uses vref.

Discussed with:	alc
2008-03-29 07:06:13 +00:00