Commit Graph

3299 Commits

Author SHA1 Message Date
attilio
0d3b58aee0 MFC 2013-02-03 20:13:33 +00:00
glebius
e3e319a0b6 Fix typo in debug printf. 2013-01-29 19:06:16 +00:00
zont
b5edc96a84 - Add system wide page faults requiring I/O counter.
Reviewed by:	alc
MFC after:	2 weeks
2013-01-28 12:54:53 +00:00
zont
875b69507c - Add sysctls to show number of stats scans.
MFC after:	2 weeks
2013-01-28 12:20:20 +00:00
zont
b3905d7835 - Style.
MFC after:	2 weeks
2013-01-28 12:08:29 +00:00
zont
ee65990ea4 - Get rid of unused function vmspace_wired_count().
Reviewed by:	alc
Approved by:	kib (mentor)
MFC after:	1 week
2013-01-14 12:12:56 +00:00
zont
3b71bce613 - Improve readability of sys_obreak().
Suggested by:	alc
Reviewed by:	alc
Approved by:	kib (mentor)
MFC after:	1 week
2013-01-11 09:58:35 +00:00
zont
d2863e4c68 - Reduce kernel size by removing unnecessary pointer indirections.
GENERIC kernel size reduced in 16 bytes and RACCT kernel in 336 bytes.

Suggested by:	alc
Reviewed by:	alc
Approved by:	kib (mentor)
MFC after:	1 week
2013-01-10 12:43:58 +00:00
attilio
f458bac614 Remove vm_radix_lookupn() and its usage in the kernel. 2013-01-10 12:30:58 +00:00
ken
1abc90f894 Fix a bug in the device pager code that can trigger an assertion
in devfs if a particular race condition is hit in the device pager
code.

This was a side effect of change 227530 which changed the device
pager interface to call a new destructor routine for the cdev.
That destructor routine, old_dev_pager_dtor(), takes a VM object
handle.

The object handle is cast to a struct cdev *, and passed into
dev_rel().

That works in most cases, except the case in cdev_pager_allocate()
where there is a race condition between two threads allocating an
object backed by the same device.  The loser of the race
deallocates its object at the end of the function.

The problem is that before inserting the object into the
dev_pager_object_list, the object's handle is changed from the
struct cdev pointer to the object's own address.  This is to avoid
conflicts with the winner of the race, which already inserted an
object in the list with a handle that is a pointer to the same cdev
structure.

The object is then passed to vm_object_deallocate(), and eventually
makes its way down to old_dev_pager_dtor().  That function passes
the handle pointer (which is actually a VM object, not a struct
cdev as usual) into dev_rel().  dev_rel() decrements the reference
count in the assumed struct cdev (which happens to be 0), and
that triggers the assertion in dev_rel() that the reference count
is greater than or equal to 0.

The fix is to add a cdev pointer to the VM object, and use that
pointer when calling the cdev_pg_dtor() routine.

vm_object.h:	Add a struct cdev pointer to the VM object
		structure.

device_pager.c:	In cdev_pager_allocate(), populate the new cdev
		pointer.

		In dev_pager_dealloc(), use the new cdev pointer
		when calling the object's cdev_pg_dtor() routine.

Reviewed by:	kib
Sponsored by:	Spectra Logic Corporation
MFC after:	1 week
2013-01-09 16:48:38 +00:00
attilio
fcadd67d75 MFC 2012-12-26 08:20:27 +00:00
glebius
1292747048 Comment fix: there is no ub_ptr, instead explain meaning of uz_count
field verbally.
2012-12-21 10:09:45 +00:00
zont
15b694913e - Fix locked memory accounting for maps with MAP_WIREFUTURE flag.
- Add sysctl vm.old_mlock which may turn such accounting off.

Reviewed by:	avg, trasz
Approved by:	kib (mentor)
MFC after:	1 week
2012-12-18 07:35:01 +00:00
attilio
be719e9167 MFC 2012-12-11 00:07:19 +00:00
alc
02094caa2c In the past four years, we've added two new vm object types. Each time,
similar changes had to be made in various places throughout the machine-
independent virtual memory layer to support the new vm object type.
However, in most of these places, it's actually not the type of the vm
object that matters to us but instead certain attributes of its pages.
For example, OBJT_DEVICE, OBJT_MGTDEVICE, and OBJT_SG objects contain
fictitious pages.  In other words, in most of these places, we were
testing the vm object's type to determine if it contained fictitious (or
unmanaged) pages.

To both simplify the code in these places and make the addition of future
vm object types easier, this change introduces two new vm object flags
that describe attributes of the vm object's pages, specifically, whether
they are fictitious or unmanaged.

Reviewed and tested by:	kib
2012-12-09 00:32:38 +00:00
pjd
fc89492084 White-space cleanups. 2012-12-08 09:23:05 +00:00
pjd
a585ca9ec8 Implemented uma_zone_set_warning(9) function that sets a warning, which
will be printed once the given zone becomes full and cannot allocate an
item. The warning will not be printed more often than every five minutes.

All UMA warnings can be globally turned off by setting sysctl/tunable
vm.zone_warnings to 0.

Discussed on:	arch
Obtained from:	WHEEL Systems
MFC after:	2 weeks
2012-12-07 22:27:13 +00:00
alc
793b12af79 Add support for the (relatively) new object type OBJT_MGTDEVICE to
vm_object_set_memattr().  Also, add a "safety belt" so that
vm_object_set_memattr() doesn't silently modify undefined object types.

Reviewed by:	kib
MFC after:	10 days
2012-11-28 18:29:34 +00:00
alc
e44badfb9f Make a few small changes to vm_map_pmap_enter():
Add detail to the comment describing this function.  In particular,
describe what MAP_PREFAULT_PARTIAL does.

Eliminate the abrupt change in behavior when the specified address range
grows from MAX_INIT_PT pages to MAX_INIT_PT plus one pages.  Instead of
doing nothing, i.e., preloading no mappings whatsoever, map any resident
pages that fall within the start of the specified address range, i.e.,
[addr, addr + ulmin(size, ptoa(MAX_INIT_PT))).

Long ago, the vm object's list of resident pages was not ordered, so
this function had to choose between probing the global hash table of
all resident pages and iterating over the vm object's unordered list of
resident pages.  Now, the list is ordered, so there is no reason for
MAP_PREFAULT_PARTIAL to be concerned with the vm object's count of
resident changes.

MFC after:	14 days
2012-11-25 19:42:36 +00:00
alc
e12b2ad698 Correct an error in r230623. When both VM_ALLOC_NODUMP and VM_ALLOC_ZERO
were specified to vm_page_alloc(), PG_NODUMP wasn't being set on the
allocated page when it happened to be pre-zeroed.

MFC after:	5 days
2012-11-21 06:26:18 +00:00
jh
25dd09b996 - Don't pass geom and provider names as format strings.
- Add __printflike() attributes.
- Remove an extra argument for the g_new_geomf() call in swapongeom_ev().

Reviewed by:	pjd
2012-11-20 12:32:18 +00:00
alc
1284a0383d Update a comment to reflect the elimination of the hold queue in r242300. 2012-11-17 04:00:19 +00:00
kib
bc5bfde14d Move the declaration of vm_phys_paddr_to_vm_page() from vm/vm_page.h
to vm/vm_phys.h, where it belongs.

Requested and reviewed by:	alc
MFC after:	2 weeks
2012-11-16 05:55:56 +00:00
kib
75f2aa672f Explicitely state that M_USE_RESERVE requires M_NOWAIT, using assertion.
Reviewed by:	alc
MFC after:	2 weeks
2012-11-16 05:49:56 +00:00
kib
e8ae50d444 Flip the semantic of M_NOWAIT to only require the allocation to not
sleep, and perform the page allocations with VM_ALLOC_SYSTEM
class. Previously, the allocation was also allowed to completely drain
the reserve of the free pages, being translated to VM_ALLOC_INTERRUPT
request class for vm_page_alloc() and similar functions.

Allow the caller of malloc* to request the 'deep drain' semantic by
providing M_USE_RESERVE flag, now translated to VM_ALLOC_INTERRUPT
class. Previously, it resulted in less aggressive VM_ALLOC_SYSTEM
allocation class.

Centralize the translation of the M_* malloc(9) flags in the single
inline function malloc2vm_flags().

Discussion started by:	"Sears, Steven" <Steven.Sears@netapp.com>
Reviewed by:	alc, mdf (previous version)
Tested by:	pho (previous version)
MFC after:	2 weeks
2012-11-14 20:01:40 +00:00
alc
ff7333d33f Replace the single, global page queues lock with per-queue locks on the
active and inactive paging queues.

Reviewed by:	kib
2012-11-13 02:50:39 +00:00
attilio
7efc7fb950 Fix DDB command "show map XXX":
- Check that an argument is always available, otherwise current map
  printing before to recurse is garbage.
- Spit out a message if an argument is not provided.
- Remove unread nlines variable.
- Use an explicit recursive function, disassociated from the
  DB_SHOW_COMMAND() body, in order to make clear prototype and recursion
  of the above mentioned function.  The code results now much less
  obscure.

Submitted by:	gianni
2012-11-12 00:30:40 +00:00
kib
f16ea99007 The r241025 fixed the case when a binary, executed from nullfs mount,
was still possible to open for write from the lower filesystem.  There
is a symmetric situation where the binary could already has file
descriptors opened for write, but it can be executed from the nullfs
overlay.

Handle the issue by passing one v_writecount reference to the lower
vnode if nullfs vnode has non-zero v_writecount.  Note that only one
write reference can be donated, since nullfs only keeps one use
reference on the lower vnode.  Always use the lower vnode v_writecount
for the checks.

Introduce the VOP_GET_WRITECOUNT to read v_writecount, which is
currently always bypassed to the lower vnode, and VOP_ADD_WRITECOUNT
to manipulate the v_writecount value, which manages a single bypass
reference to the lower vnode.  Caling the VOPs instead of directly
accessing v_writecount provide the fix described in the previous
paragraph.

Tested by:	pho
MFC after:	3 weeks
2012-11-02 13:56:36 +00:00
alc
60d5a532fb In general, we call pmap_remove_all() before calling vm_page_cache(). So,
the call to pmap_remove_all() within vm_page_cache() is usually redundant.
This change eliminates that call to pmap_remove_all() and introduces a
call to pmap_remove_all() before vm_page_cache() in the one place where
it didn't already exist.

When iterating over a paging queue, if the object containing the current
page has a zero reference count, then the page can't have any managed
mappings.  So, a call to pmap_remove_all() is pointless.

Change a panic() call in vm_page_cache() to a KASSERT().

MFC after:	6 weeks
2012-11-01 16:20:02 +00:00
attilio
d38d7bb245 Rework the known mutexes to benefit about staying on their own
cache line in order to avoid manual frobbing but using
struct mtx_padalign.

The sole exception being nvme and sxfge drivers, where the author
redefined CACHE_LINE_SIZE manually, so they need to be analyzed and
dealt with separately.

Reviwed by:	jimharris, alc
2012-10-31 18:07:18 +00:00
alc
77582e8298 Replace the page hold queue, PQ_HOLD, by a new page flag, PG_UNHOLDFREE,
because the queue itself serves no purpose.  When a held page is freed,
inserting the page into the hold queue has the side effect of setting the
page's "queue" field to PQ_HOLD.  Later, when the page is unheld, it will
be freed because the "queue" field is PQ_HOLD.  In other words, PQ_HOLD is
used as a flag, not a queue.  So, this change replaces it with a flag.

To accomodate the new page flag, make the page's "flags" field wider and
"oflags" field narrower.

Reviewed by:	kib
2012-10-29 06:15:04 +00:00
trasz
edd0d84112 Remove useless check; vm_pindex_t is unsigned on all architectures.
CID:		3701
Found with:	Coverity Prevent
2012-10-28 20:03:57 +00:00
mdf
1bc1b805d7 Const-ify the zone name argument to uma_zcreate(9).
MFC after:	3 days
2012-10-26 17:51:05 +00:00
andre
a8b2ff5af7 Move the corresponding MTX_SYSINIT() next to their struct mtx declaration
to make their relationship more obvious as done with the other such mutexs.
2012-10-26 17:31:35 +00:00
kib
ddaaa16d8b Commit the actual text provided by Alan, instead of the wrong update
in r242011.

MFC after:	1 week
2012-10-24 18:32:37 +00:00
kib
2a76567642 Dirty the newly copied anonymous pages after the wired region is
forked. Otherwise, pagedaemon might reclaim the page without saving
its content into the swap file, resulting in the valid content
replaced by zeroes.

Reported and tested by:	pho
Reviewed and comment update by:	alc
MFC after:	1 week
2012-10-24 18:21:59 +00:00
attilio
64eaf39fd7 MFC 2012-10-22 21:26:36 +00:00
kib
560aa751e0 Remove the support for using non-mpsafe filesystem modules.
In particular, do not lock Giant conditionally when calling into the
filesystem module, remove the VFS_LOCK_GIANT() and related
macros. Stop handling buffers belonging to non-mpsafe filesystems.

The VFS_VERSION is bumped to indicate the interface change which does
not result in the interface signatures changes.

Conducted and reviewed by:	attilio
Tested by:	pho
2012-10-22 17:50:54 +00:00
eadler
23c67e54ce Print flags as hex instead of an integer.
PR:		kern/168210
Submitted by:	linimon
Reviewed by:	alc
Approved by:	cperciva
MFC after:	3 days
2012-10-22 02:11:57 +00:00
alc
1df941a7f3 Move vm_page_requeue() to the only file that uses it.
MFC after:	3 weeks
2012-10-13 20:19:43 +00:00
alc
c84b1820ea Eliminate the conditional for releasing the page queues lock in
vm_page_sleep().  vm_page_sleep() is no longer called with this lock
held.

Eliminate assertions that the page queues lock is NOT held.  These
assertions won't translate well to having distinct locks on the active
and inactive page queues, and they really aren't that useful.

MFC after:	3 weeks
2012-10-13 18:46:46 +00:00
alc
271cefc5f6 Tidy up a bit:
Update some of the comments.  In particular, use "sleep" in preference to
"block" where appropriate.

Eliminate some unnecessary casts.

Make a few whitespace changes for consistency.

Reviewed by:	kib
MFC after:	3 days
2012-10-03 05:06:45 +00:00
kib
8f845e475e Fix the mis-handling of the VV_TEXT on the nullfs vnodes.
If you have a binary on a filesystem which is also mounted over by
nullfs, you could execute the binary from the lower filesystem, or
from the nullfs mount. When executed from lower filesystem, the lower
vnode gets VV_TEXT flag set, and the file cannot be modified while the
binary is active. But, if executed as the nullfs alias, only the
nullfs vnode gets VV_TEXT set, and you still can open the lower vnode
for write.

Add a set of VOPs for the VV_TEXT query, set and clear operations,
which are correctly bypassed to lower vnode.

Tested by:	pho (previous version)
MFC after:	2 weeks
2012-09-28 11:25:02 +00:00
alc
a0349df30f Address a race condition that was introduced in r238212. Unless the page
queues lock is acquired before the page lock is released, there is no
guarantee that the page will still be in that same page queue when
vm_page_requeue() is called.

Reported by:		pho
In collaboration with:	kib
MFC after:	3 days
2012-09-23 17:42:39 +00:00
attilio
be930066f1 MFC 2012-09-21 03:07:34 +00:00
kib
667e5e154a Plug the accounting leak for the wired pages when msync(MS_INVALIDATE)
is performed on the vnode mapping which is wired in other address space.

While there, explicitely assert that the page is unwired and zero the
wire_count instead of substract. The condition is rechecked later in
vm_page_free(_toq) already.

Reported and tested by:	zont
Reviewed by:	alc (previous version)
MFC after:	1 week
2012-09-20 09:52:57 +00:00
glebius
e4b6b754eb If caller specifies UMA_ZONE_OFFPAGE explicitly, then do not waste memory
in an allocation for a slab.

Reviewed by:	jeff
2012-09-18 20:28:55 +00:00
eadler
8600cbb5b6 Correct double "the the"
Approved by:	cperciva
MFC after:	3 days
2012-09-14 21:28:56 +00:00
zont
2b9b209471 - Simplify VM code by using vmspace_wired_count() for counting wired
memory of a process.

Reviewed by:	avg
Approved by:	kib (mentor)
MFC after:	2 weeks
2012-09-05 18:19:54 +00:00
des
71dbd73468 Whitespace cleanup. 2012-09-05 12:24:50 +00:00
des
3ed9c078db No memory barrier is required. This was pointed out by kib@ a while ago,
but I got distracted by other matters.

(for real this time)
2012-09-04 22:19:33 +00:00
des
dec17a5bb5 Revert previous commit, which was performed in the wrong tree. 2012-09-04 21:06:53 +00:00
des
627c3f1a6e No memory barrier is required. This was pointed out by kib@ a while ago,
but I got distracted by other matters.
2012-09-04 19:04:02 +00:00
zont
f93fc1d719 - After r240026 sgrowsiz should be used in a safer maner.
Approved by:	kib (mentor)
MCF after:	1 week
2012-09-03 09:34:46 +00:00
zont
2f4305a824 - Remove accounting of locked memory from vsunlock(9) that I missed in r239818.
Approved by:	kib (mentor)
2012-08-30 08:03:33 +00:00
zont
85dfc3b8b7 - Don't take an account of locked memory for current process in vslock(9).
There are two consumers of vslock(9): sysctl code and drm driver.  These
consumers are using locked memory as transient memory, it doesn't belong
to a process's memory.

Suggested by:	avg
Reviewed by:	alc
Approved by:	kib (mentor)
MFC after:	2 weeks
2012-08-29 11:23:59 +00:00
attilio
d3c5a80b69 MFC 2012-08-27 11:59:04 +00:00
pluknet
91ac59768e Typo in previous change: print half the theoretical maximum as maximum
recommended amount.

Reported by:	<site freebsd at orientalsensation com>
Reviewed by:	des
2012-08-27 10:59:49 +00:00
glebius
b1ab314c3f Fix function name in keg_cachespread_init() assert. 2012-08-26 09:54:11 +00:00
des
5e88649166 - When running out of swzone, instead of spewing an error message every
tick until the situation is resolved (if ever), just print a single
  message when running out and another when space becomes available.

- When adding more swap, warn if the total amount exceeds half the
  theoretical maximum we can handle.
2012-08-16 08:29:49 +00:00
kib
ce7012daf6 For old mmap syscall, when executing on amd64 or ia64, enforce the
PROT_EXEC if prot is non-zero, process is 32bit and
kern.elf32.i386_read_exec syscal is enabled. This workaround is needed
for old i386 a.out binaries, where dynamic linker did not specified
PROT_EXEC for mapping of the text.

The kern.elf32.i386_read_exec MIB name looks weird for a.out binaries,
but I reused the existing knob which already has the needed semantic.

MFC after:	1 week
2012-08-14 12:11:48 +00:00
kib
0d46b47153 Adjust the r205536, by allowing a non-zero offset for anonymous
mappings for a.out binaries. Apparently, a.out ld.so from FreeBSD
1.1.5.1 can issue such requests.

Reported and tested by:	Dan Plassche <dplassche@gmail.com>
MFC after:	1 week
2012-08-14 11:47:07 +00:00
kib
a3d0fb0175 Do not leave invalid pages in the object after the short read for a
network file systems (not only NFS proper). Short reads cause pages
other then the requested one, which were not filled by read response,
to stay invalid.

Change the vm_page_readahead_finish() interface to not take the error
code, but instead to make a decision to free or to (de)activate the
page only by its validity. As result, not requested invalid pages are
freed even if the read RPC indicated success.

Noted and reviewed by:	alc
MFC after:	1 week
2012-08-14 11:45:47 +00:00
alc
cd8266338a Never sleep on busy pages in vm_pageout_launder(), always skip them. Long
ago, sleeping on busy pages in vm_pageout_launder() made sense.  The call
to vm_pageout_flush() specified asynchronous I/O and sleeping on busy pages
blocked vm_pageout_launder() until the flush had completed.  However, in
CVS revision 1.35 of vm/vm_contig.c, the call to vm_pageout_flush() was
changed to request synchronous I/O, but the sleep on busy pages was not
removed.
2012-08-07 04:48:14 +00:00
kib
cac2fe116f After the PHYS_TO_VM_PAGE() function was de-inlined, the main reason
to pull vm_param.h was removed.  Other big dependency of vm_page.h on
vm_param.h are PA_LOCK* definitions, which are only needed for
in-kernel code, because modules use KBI-safe functions to lock the
pages.

Stop including vm_param.h into vm_page.h. Include vm_param.h
explicitely for the kernel code which needs it.

Suggested and reviewed by:	alc
MFC after:    2 weeks
2012-08-05 14:11:42 +00:00
kib
4259905d31 Reduce code duplication and exposure of direct access to struct
vm_page oflags by providing helper function
vm_page_readahead_finish(), which handles completed reads for pages
with indexes other then the requested one, for VOP_GETPAGES().

Reviewed by:	alc
MFC after:	1 week
2012-08-04 18:16:43 +00:00
attilio
c52a057b19 MFC 2012-08-03 15:58:05 +00:00
alc
5b4712b5a1 Inline vm_page_aflags_clear() and vm_page_aflags_set().
Add comments stating that neither these functions nor the flags that they
are used to manipulate are part of the KBI.
2012-08-03 01:48:15 +00:00
alc
ceefb8bf17 Eliminate an unneeded declaration. (I should have removed this as part
of r227568.)
2012-07-30 20:38:37 +00:00
kib
4f8212948b Do not requeue held page or page for which locking failed, just leave
them alone.

Process the act_count updates for the held pages in the vm_pageout
loop over the inactive queue, instead of refusing to do anything with
such page.

Clarify the intent of the addl_page_shortage counter and change its
use for pages which are not processed in the loop according to the
description.

Reviewed by:	alc
MFC after:	2 weeks
2012-07-26 09:06:48 +00:00
alc
26fd7fb588 Addendum to r238604. If the inactive queue scan isn't restarted, then
the variable "addl_page_shortage_init" isn't needed.

X-MFC after:	r238604
2012-07-24 02:35:30 +00:00
kib
80c3756a6f Do not restart scan of the inactive queue when non-inactive page is
found. Rather, we shall not find such pages on inactive queue at all.

Requested and reviewed by:	alc
MFC after:    2 weeks
2012-07-18 21:47:50 +00:00
alc
e5949174d4 Move what remains of vm/vm_contig.c into vm/vm_pageout.c, where similar
code resides.  Rename vm_contig_grow_cache() to vm_pageout_grow_cache().

Reviewed by:	kib
2012-07-18 05:21:34 +00:00
alc
ad2692aed9 Correct vm_page_alloc_contig()'s implementation of VM_ALLOC_NODUMP. 2012-07-17 02:36:59 +00:00
alc
8af6bec3e3 Various improvements to vm_contig_grow_cache(). Most notably, even when
it can't sleep, it can still move clean pages from the inactive queue to
the cache.  Also, when a page is cached, there is no need to restart the
scan.  The "next" page pointer held by vm_contig_launder() is still
valid.  Finally, add a comment summarizing what vm_contig_grow_cache()
does based upon the value of "tries".

MFC after:	3 weeks
2012-07-16 18:13:43 +00:00
alc
8f708ce433 Correct an off-by-one error in vm_reserv_alloc_contig() that resulted in
the last reservation of a multi-reservation allocation not being
initialized.
2012-07-15 21:46:19 +00:00
mdf
a42ef9b109 Fix a bug with memguard(9) on 32-bit architectures without a
VM_KMEM_MAX_SIZE.

The code was not taking into account the size of the kernel_map, which
the kmem_map is allocated from, so it could produce a sub-map size too
large to fit.  The simplest solution is to ignore VM_KMEM_MAX entirely
and base the memguard map's size off the kernel_map's size, since this
is always relevant and always smaller.

Found by:	Justin Hibbits
2012-07-15 20:29:48 +00:00
alc
2044619bd4 If vm_contig_grow_cache() is allowed to sleep, then invoke the vm_lowmem
handlers.
2012-07-14 20:14:03 +00:00
alc
c0341f5875 Move kmem_alloc_{attr,contig}() to vm/vm_kern.c, where similarly named
functions reside.  Correct the comment describing kmem_alloc_contig().
2012-07-14 18:10:44 +00:00
attilio
26c71bd174 Style. 2012-07-12 11:02:57 +00:00
attilio
34064c9bef Remove unused iterating functions. 2012-07-12 11:02:04 +00:00
attilio
cc7c02f8bf Merge from vmcontention 2012-07-12 02:15:06 +00:00
attilio
675a214708 MFC 2012-07-12 02:07:21 +00:00
attilio
74cb07ed81 Document the object type movements, related to swp_pager_copy(),
in vm_object_collapse() and vm_object_split().

In collabouration with:	alc
MFC after:	3 days
2012-07-11 01:04:59 +00:00
attilio
a23ed68137 - Move VM_RADIX_STACK in vm_object.c because it is the only consumer
- Import the check for the return value of vm_radix_lookup() directly
  in the while removing the need to use a spourious check.
2012-07-08 23:50:57 +00:00
attilio
593da627a7 MFC 2012-07-08 23:20:15 +00:00
attilio
65bce42709 MFC 2012-07-08 23:17:04 +00:00
kib
aa091fdb2a Avoid vm page queues lock leak after r238212.
Reported and tested by:	Michael Butler <imb protected-networks net>
Reviewed by:	alc
Pointy hat to:	kib
MFC after:	20 days
2012-07-08 18:04:26 +00:00
attilio
e237fbcafa Merge from vmcontention 2012-07-08 16:12:59 +00:00
attilio
9cc8fb25a6 MFC 2012-07-08 14:06:26 +00:00
attilio
ffa3f082ff - Split the cached and resident pages tree into 2 distinct ones.
This makes the RED/BLACK support go away and simplifies a lot vmradix
  functions used here. This happens because with patricia trie support
  the trie will be little enough that keeping 2 diffetnt will be
  efficient too.
- Reduce differences with head, in places like backing scan where the
  optimizazions used shuffled the code a little bit around.

Tested by:	flo, Andrea Barberio
2012-07-08 14:01:25 +00:00
kib
80dc0e94a4 Drop page queues mutex on each iteration of vm_pageout_scan over the
inactive queue, unless busy page is found.

Dropping the mutex often should allow the other lock acquires to
proceed without waiting for whole inactive scan to finish. On machines
with lot of physical memory scan often need to iterate a lot before it
finishes or finds a page which requires laundring, causing high
latency for other lock waiters.

Suggested and reviewed by:	alc
MFC after:	3 weeks
2012-07-07 19:39:08 +00:00
attilio
d659f0f6bf MFC 2012-07-07 17:55:27 +00:00
eadler
16452223a2 Add missing sleep stat increase
PR:		kern/168211
Submitted by:	linimon
Reviewed by:	alc
Approved by:	cperciva
MFC after:	3 days
2012-07-07 17:46:11 +00:00
kib
dbb42d9c5d Style.
Reviewed by:	alc (previous version)
MFC after:	1 week
2012-07-06 20:13:16 +00:00
jhb
ab100847da Honor db_pager_quit in 'show uma' and 'show malloc'.
MFC after:	1 month
2012-07-02 16:14:52 +00:00
alc
c5e6daff9d Add new pmap layer locks to the predefined lock order. Change the names
of a few existing VM locks to follow a consistent naming scheme.
2012-06-27 03:45:25 +00:00
attilio
5e431fc401 Fix an inverted logic or.
Reported by:	pho
2012-06-24 01:28:36 +00:00
attilio
69fc91bff9 Fix a bug in the logic, otherwise exhausted is set everytime that
outidx == cnt.

Reported by:	pho
2012-06-23 14:09:52 +00:00
attilio
5d9dd820d8 MFC 2012-06-23 02:08:15 +00:00
attilio
a8dc1f5270 Give vm_radix_lookupn() a way to specify that the whole range has been
exhausted while searching and when a "maximum" value is passed as end
(or end == 0).
This allow for avoiding starting address overflow while searching
through and avoids livelock with "start" wrapping up to "end".

Reported by:	pho (supposedly)
2012-06-23 01:30:51 +00:00
attilio
1053e2db44 Restart the scan from the busy page rather than 0. 2012-06-23 01:08:46 +00:00
attilio
295c086a4e Fix a bug where the start address is not correctly pointing to the
"next" index, scanning 2 times in a row the same object.
This was hidden because when cache and resident tries are merged
together there is a check to skip different objects in all the
vm_radix_lookupn() usages, in order to fix a race with RED nodes.
2012-06-22 22:45:34 +00:00
attilio
a7497af6fd - Add a comment explaining the locking of the cached pages pool held
by vm_objects.
- Add flags for the per-object lock and free pages queue mutex lock.
  Use the newly added flags to mark the cache root within the vm_object
  structure.

Please note that other vm_object members should be marked with correct
locking but they are left for other commits.

In collabouration with:	alc

MFC after:	3 days3 days3 days
2012-06-22 18:34:11 +00:00
alc
4d96d753fe Selectively inline vm_page_dirty(). 2012-06-20 23:25:47 +00:00
jhb
12f0fa63b4 Move the per-thread deferred user map entries list into a private list
in vm_map_process_deferred() which is then iterated to release map entries.
This avoids having a nested vm map unlock operation called from the loop
body attempt to recuse into vm_map_process_deferred().  This can happen if
the vm_map_remove() triggers the OOM killer.

Reviewed by:	alc, kib
MFC after:	1 week
2012-06-20 18:00:26 +00:00
attilio
5d0dc848b7 Do a more targeted check on the page cache and avoid to check the cache
pointer directly in vnode_pager_setsize() by using newly introduced
vm_page_is_cached() function.

Reviewed by:	alc
MFC after:	2 weeks
X-MFC:		r234039,234064
2012-06-16 21:39:00 +00:00
alc
6eeaee04e4 The page flag PGA_WRITEABLE is set and cleared exclusively by the pmap
layer, but it is read directly by the MI VM layer.  This change introduces
pmap_page_is_write_mapped() in order to completely encapsulate all direct
access to PGA_WRITEABLE in the pmap layer.

Aesthetics aside, I am making this change because amd64 will likely begin
using an alternative method to track write mappings, and having
pmap_page_is_write_mapped() in place allows me to make such a change
without further modification to the MI VM layer.

As an added bonus, tidy up some nearby comments concerning page flags.

Reviewed by:	kib
MFC after:	6 weeks
2012-06-16 18:56:19 +00:00
attilio
d692a159e6 MFC 2012-06-16 17:05:09 +00:00
kib
a280ada6e7 Use the previous stack entry protection and max protection to correctly
propagate the stack execution permissions when stack is grown down.

First, curproc->p_sysent->sv_stackprot specifies maximum allowed stack
protection for current ABI, so the new stack entry was typically marked
executable always. Second, for non-main stack MAP_STACK mapping,
the PROT_ flags should be used which were specified at the mmap(2) call
time, and not sv_stackprot.

MFC after:	1 week
2012-06-10 11:31:50 +00:00
attilio
3fc8e26872 Introduce a new tree for dealing with cached pages separately and
remove the RED/BLACK concept.
This is based on the assumption that path-compressed tries will be
small and fast enough that a separate trie for cached pages will make
sense and will leave the trie code simple enough (along with removing
a lot of differences in the userend code).
2012-06-09 12:25:30 +00:00
attilio
807db03f96 Revert r231027 and fix the prototype for vm_radix_remove().
The target of this is getting at the point where the recovery path is
completely removed as we could count on pre-allocation once the
path compressed trie is implemented.
2012-06-08 18:44:54 +00:00
attilio
56a6aa288e Revert r234033, 234007.
The target of this is getting at the point where the recovery path is
completely removed as we could count on pre-allocation once the
path compressed trie is implemented.
2012-06-08 18:11:21 +00:00
attilio
7b7f4887b9 Revert r236367.
The target of this is getting at the point where the recovery path is
completely removed as we could count on pre-allocation once the
path compressed trie is implemented.
2012-06-08 18:08:31 +00:00
attilio
85fb4c5799 MFC 2012-06-07 22:47:53 +00:00
eadler
f5ab1922a6 Revert r236380
PR:		kern/166780
Requested by:	many
Approved by:	cperciva (implicit)
2012-06-01 18:58:50 +00:00
attilio
e761e0c4bc MFC 2012-06-01 14:57:55 +00:00
eadler
01b578179f Add sysctl to query amount of swap space free
PR:		kern/166780
Submitted by:	Radim Kolar <hsn@sendmail.cz>
Approved by:	cperciva
MFC after:	1 week
2012-06-01 04:42:52 +00:00
attilio
ab9d63eba7 Simplify insert path by using the same logic of vm_radix_remove() for
the recovery path. The bulk of vm_radix_remove() is put into a generic
function vm_radix_sweep() which allows 2 different modes (hard and soft):
the soft one will deal with half-constructed paths by cleaning them up.

Ideally all these complications should go once that a way to pre-allocate
is implemented, possibly by implementing path compression.

Requested and discussed with:	jeff
Tested by:	pho
2012-05-31 22:54:08 +00:00
emax
0984d7ec39 Tweak condition for disabling allocation from per-CPU buckets in
low memory situation. I've observed a situation where per-CPU
allocations were disabled while there were enough free cached pages.
Basically, cnt.v_free_count was sitting stable at a value lower
than cnt.v_free_min and that caused massive performance drop.

Reviewed by:	alc
MFC after:	1 week
2012-05-23 18:56:29 +00:00
kib
187a8c5cd6 Calculate the count of per-process cow faults. Export the count to
userspace using the obscure spare int field in struct kinfo_proc.

Submitted by:	Andrey Zonov <andrey zonov org>
MFC after:	1 week
2012-05-23 18:10:54 +00:00
avg
ea96526248 vm_pager_object_lookup: small performance optimization
do not needlessly lock an object if its handle doesn't match

Reviewed by:	kib, alc
MFC after:	1 week
2012-05-23 12:51:49 +00:00
andrew
39fa59acb9 Fix booting on ARM.
In PHYS_TO_VM_PAGE() when VM_PHYSSEG_DENSE is set the check if we are past
the end of vm_page_array was incorrect causing it to return NULL. This
value is then used in vm_phys_add_page causing a data abort.

Reviewed by:	alc, kib, imp
Tested by:	stas
2012-05-22 07:04:23 +00:00
nwhitehorn
e83623fb1f Replace the list of PVOs owned by each PMAP with an RB tree. This simplifies
range operations like pmap_remove() and pmap_protect() as well as allowing
simple operations like pmap_extract() not to involve any global state.
This substantially reduces lock coverages for the global table lock and
improves concurrency.
2012-05-20 14:33:28 +00:00
kib
ac67bd3aa8 Do not double-reference the found vm object in cdev_pager_lookup().
vm_pager_object_lookup() already referenced the object.

Note that there is no in-tree consumers of cdev_pager_lookup(). The
only known user of the function is i915 gem driver, which is not yet
imported. This should make the KPI change minor.

Submitted by:	avg
MFC after:	1 week
2012-05-18 10:23:47 +00:00
kib
81841e52c6 Add new pager type, OBJT_MGTDEVICE. It provides the device pager
which carries fictitous managed pages. In particular, the consumers of
the new object type can remove all mappings of the device page with
pmap_remove_all().

The range of physical addresses used for fake page allocation shall be
registered with vm_phys_fictitious_reg_range() interface to allow the
PHYS_TO_VM_PAGE() to work in pmap.

Most likely, only i386 and amd64 pmaps can handle fictitious managed
pages right now.

Sponsored by:	The FreeBSD Foundation
Reviewed by:	alc
MFC after:	1 month
2012-05-12 20:49:58 +00:00
kib
9ff1ec42a4 Add a facility to register a range of physical addresses to be used
for allocation of fictitious pages, for which PHYS_TO_VM_PAGE()
returns proper fictitious vm_page_t. The range should be de-registered
after consumer stopped using it.

De-inline the PHYS_TO_VM_PAGE() since it now carries code to iterate
over registered ranges.

A hash container might be developed instead of range registration
interface, and fake pages could be put automatically into the hash,
were PHYS_TO_VM_PAGE() could look them up later. This should be
considered before the MFC of the commit is done.

Sponsored by:	The FreeBSD Foundation
Reviewed by:	alc
MFC after:	1 month
2012-05-12 20:42:56 +00:00
kib
1dfd5258de Split the code from vm_page_getfake() to initialize the fake page struct
vm_page into new interface vm_page_initfake(). Handle the case of fake
page re-initialization with changed memattr.

Sponsored by:	The FreeBSD Foundation
Reviewed by:	alc
MFC after:	1 month
2012-05-12 20:34:22 +00:00
kib
e5df51c93a Assert that the page passed to vm_page_putfake() is unmanaged.
Sponsored by:	The FreeBSD Foundation
Reviewed by:	alc
MFC after:	1 month
2012-05-12 20:27:51 +00:00
kib
af33014c48 Assert that fictitious or unmanaged pages do not appear on
active/inactive lists.

Sponsored by:	The FreeBSD Foundation
Reviewed by:	alc
MFC after:	1 month
2012-05-12 20:24:46 +00:00
kib
d67fed001c Commit the change forgotten in r235356.
Sponsored by:	The FreeBSD Foundation
Reviewed by:	alc
MFC after:	1 month
2012-05-12 20:10:18 +00:00
kib
84b22557f6 Make the vm_page_array_size long. Remove redundand zero initialization
for vm_page_array_size and nearby variablees.

Sponsored by:	The FreeBSD Foundation
Reviewed by:	alc
MFC after:	1 month
2012-05-12 20:03:06 +00:00
attilio
c72fe43a63 Add braces. 2012-05-12 19:54:57 +00:00
attilio
e5220032ec On 32-bits architecture KTR has a bug as it cannot correctly grok
64-bits numbers. ktr_tracepoint() infacts casts all the passed value to
u_long values as that is what the ktr entries can handle.

However, we have to work a lot with vm_pindex_t which are always 64-bit
also on 32-bits architectures (most notable case being i386).

Use macros to split the 64 bits printing into 32-bits chunks which
KTR can correctly handle.

Reported and tested by:	flo
2012-05-12 19:52:59 +00:00
attilio
1b73802874 MFC 2012-05-12 19:26:15 +00:00
attilio
3bd53aaf3c - Fix a bug where lookupn can wrap up looking for the pages to scan,
returning a non correct very low address again.
- Stub out vm_lookup_foreach as it is not used
2012-05-12 19:22:57 +00:00
alc
323d529ebe Give vm_fault()'s sequential access optimization a makeover.
There are two aspects to the sequential access optimization: (1) read ahead
of pages that are expected to be accessed in the near future and (2) unmap
and cache behind of pages that are not expected to be accessed again.  This
revision changes both aspects.

The read ahead optimization is now more effective.  It starts with the same
initial read window as before, but arithmetically grows the window on
sequential page faults.  This can yield increased read bandwidth.  For
example, on one of my machines, a program using mmap() to read a file that
is several times larger than the machine's physical memory takes about 17%
less time to complete.

The unmap and cache behind optimization is now more selectively applied.
The read ahead window must grow to its maximum size before unmap and cache
behind is performed.  This significantly reduces the number of times that
pages are unmapped and cached only to be reactivated a short time later.

The unmap and cache behind optimization now clears each page's referenced
flag.  Previously, in the case of dirty pages, if the containing file was
still mapped at the time that the page daemon examined the dirty pages,
they would be reactivated.

From a stylistic standpoint, this revision also cleanly separates the
implementation of the read ahead and unmap/cache behind optimizations.

Glanced at:	kib
MFC after:	2 weeks
2012-05-10 15:16:42 +00:00
attilio
c7b668e647 MFC 2012-05-05 21:40:32 +00:00
nwhitehorn
d394621163 Avoid a lock order reversal in pmap_extract_and_hold() from relocking
the page. This PMAP requires an additional lock besides the PMAP lock
in pmap_extract_and_hold(), which vm_page_pa_tryrelock() did not release.

Suggested by:	kib
MFC after:	4 days
2012-04-22 17:58:30 +00:00
kib
bc03240549 When MAP_STACK mapping is created, the map entry is created only to
cover the initial stack size. For MCL_WIREFUTURE maps, the subsequent
call to vm_map_wire() to wire the whole stack region fails due to
VM_MAP_WIRE_NOHOLES flag.

Use the VM_MAP_WIRE_HOLESOK to only wire mapped part of the stack.

Reported and tested by:	Sushanth Rai <sushanth_rai yahoo com>
Reviewed by:	alc
MFC after:	1 week
2012-04-21 18:36:53 +00:00
alc
ad5c747d1d As documented in vm_page.h, updates to the vm_page's flags no longer
require the page queues lock.

MFC after:	1 week
2012-04-21 18:26:16 +00:00
attilio
b721606b1f Fix a brain-o.
vm_page_lookup_cache() is not exported and cannot be used arbitrarely.

Reported by:	pho
2012-04-21 00:32:56 +00:00
attilio
af795c2c10 MFC 2012-04-11 14:54:06 +00:00
attilio
628004ddfb - Introduce a cache-miss optimization for consistency with other
accesses of the cache member of vm_object objects.
- Use novel vm_page_is_cached() for checks outside of the vm subsystem.

Reviewed by:	alc
MFC after:	2 weeks
X-MFC:		r234039
2012-04-09 17:05:18 +00:00
alc
e30063988b Fix mincore(2) so that it reports PG_CACHED pages as resident.
MFC after:	2 weeks
2012-04-08 18:25:12 +00:00
alc
a317fe0618 If a page belonging a reservation is cached, then mark the reservation so
that it will be freed to the cache pool rather than the default pool.
Otherwise, the cached pages within the reservation may be recycled sooner
than necessary.

Reported by:	Andrey Zonov
2012-04-08 17:00:46 +00:00
attilio
c85d28efad Fix a typo.
Reported by:	pho
2012-04-08 11:04:08 +00:00
attilio
b07157326c MFC 2012-04-08 00:14:15 +00:00
attilio
a539d5c569 If we already had a page with pindex == 0 in the device pager object
(if not already fictious) the code can panic when trying to first insert
a fictious page because of the overridden pindex.

Fix this by applying the same spinning pattern of vm_page_rename().

Reported by:	pho
2012-04-07 23:58:49 +00:00
attilio
4bd8f1d188 Staticize vm_page_cache_remove().
Reviewed by:	alc
2012-04-06 20:34:00 +00:00
attilio
a729154742 MFC 2012-04-06 19:59:37 +00:00
attilio
ed06f155e1 Free a cached page rather than only removing it.
vm_page_cache_remove() should only be used in very little and specific
cases (and marked as static likely) where the callers is going to take
care also of the page flags appropriately, otherwise one can end up
with a corrupted page.

Reported by:	pho
2012-04-06 19:49:45 +00:00
nwhitehorn
9c79ae8eea Reduce the frequency that the PowerPC/AIM pmaps invalidate instruction
caches, by invalidating kernel icaches only when needed and not flushing
user caches for shared pages.

Suggested by:	kib
MFC after:	2 weeks
2012-04-06 16:03:38 +00:00
jhb
5829de48d9 Add new ktrace records for the start and end of VM faults. This gives
a pair of records similar to syscall entry and return that a user can
use to determine how long page faults take.  The new ktrace records are
enabled via the 'p' trace type, and are enabled in the default set of
trace points.

Reviewed by:	kib
MFC after:	2 weeks
2012-04-05 17:13:14 +00:00
attilio
3a71812376 The cached_pages_counter is not reset even when shared pages are present
if the obj is not a vnode.

Fix this.

Reported and tested by:	flo
2012-04-03 20:06:07 +00:00
attilio
081936d842 MFC 2012-03-30 16:54:21 +00:00
mckusick
9a7982e5a0 Keep track of the mount point associated with a special device
to enable the collection of counts of synchronous and asynchronous
reads and writes for its associated filesystem. The counts are
displayed using `mount -v'.

Ensure that buffers used for paging indicate the vnode from
which they are operating so that counts of paging I/O operations
from the filesystem are collected.

This checkin only adds the setting of the mount point for the
UFS/FFS filesystem, but it would be trivial to add the setting
and clearing of the mount point at filesystem mount/unmount
time for other filesystems too.

Reviewed by: kib
2012-03-28 20:49:11 +00:00
alc
e02fd6b842 Handle spurious page faults that may occur in no-fault sections of the
kernel.

When access restrictions are added to a page table entry, we flush the
corresponding virtual address mapping from the TLB.  In contrast, when
access restrictions are removed from a page table entry, we do not
flush the virtual address mapping from the TLB.  This is exactly as
recommended in AMD's documentation.  In effect, when access
restrictions are removed from a page table entry, AMD's MMUs will
transparently refresh a stale TLB entry.  In short, this saves us from
having to perform potentially costly TLB flushes.  In contrast,
Intel's MMUs are allowed to generate a spurious page fault based upon
the stale TLB entry.  Usually, such spurious page faults are handled
by vm_fault() without incident.  However, when we are executing
no-fault sections of the kernel, we are not allowed to execute
vm_fault().  This change introduces special-case handling for spurious
page faults that occur in no-fault sections of the kernel.

In collaboration with:	kib
Tested by:		gibbs (an earlier version)

I would also like to acknowledge Hiroki Sato's assistance in
diagnosing this problem.

MFC after:	1 week
2012-03-22 04:52:51 +00:00
jhb
102d1a0777 Bah, just revert my earlier change entirely. (Missed alc's request to do
this earlier.)

Requested by:	alc
2012-03-19 19:06:40 +00:00
attilio
7ee59361d5 MFC 2012-03-19 18:54:01 +00:00
jhb
9628d3dbf8 Fix madvise(MADV_WILLNEED) to properly handle individual mappings larger
than 4GB.  Specifically, the inlined version of 'ptoa' of the the 'int'
count of pages overflowed on 64-bit platforms.  While here, change
vm_object_madvise() to accept two vm_pindex_t parameters (start and end)
rather than a (start, count) tuple to match other VM APIs as suggested
by alc@.
2012-03-19 18:47:34 +00:00
jhb
2e0db42a5f Alter the previous commit to use vm_size_t instead of vm_pindex_t.
vm_pindex_t is not a count of pages per se, it is more like vm_ooffset_t,
but a page index instead of a byte offset.
2012-03-19 18:43:44 +00:00
kib
2963c3c979 In vm_object_page_clean(), do not clean OBJ_MIGHTBEDIRTY object flag
if the filesystem performed short write and we are skipping the page
due to this.

Propogate write error from the pager back to the callers of
vm_pageout_flush().  Report the failure to write a page from the
requested range as the FALSE return value from vm_object_page_clean(),
and propagate it back to msync(2) to return EIO to usermode.

While there, convert the clearobjflags variable in the
vm_object_page_clean() and arguments of the helper functions to
boolean.

PR:	kern/165927
Reviewed by:	alc
MFC after:	2 weeks
2012-03-17 23:00:32 +00:00
attilio
5725f63f57 MFC 2012-03-16 15:46:44 +00:00
attilio
f9319cf885 Fix the nodes allocator in architectures without direct-mapping:
- Fix bugs in the free path where the pages were not unwired and
  relevant locking wasn't acquired.
- Introduce the rnode_map, submap of kernel_map, where to allocate from.
  The reason is that, in architectures without direct-mapping,
  kmem_alloc*() will try to insert the newly created mapping while
  holding the vm_object lock introducing a LOR or lock recursion.
  rnode_map is however a leafly-used submap, thus there cannot be any
  deadlock.
  Notes: Size the submap in order to be, by default, around 64 MB and
  decrase the size of the nodes as the allocation will be much smaller
  (and when the compacting code in the vm_radix will be implemented this
  will aim for much less space to be used).  However note that the
  size of the submap can be changed at boot time via the
  hw.rnode_map_scale scaling factor.
- Use uma_zone_set_max() covering the size of the submap.

Tested by:	flo
2012-03-16 15:41:07 +00:00
jhb
1e515523f3 Pedantic nit: use vm_pindex_t instead of long for a count of pages. 2012-03-14 20:57:48 +00:00
flo
83ea608b00 IFC at r232948
Approved by:	attilio
2012-03-14 00:41:37 +00:00
jhb
19feaba08b Add KTR_VFS traces to track modifications to a vnode's writecount. 2012-03-08 20:27:20 +00:00
attilio
86fae10111 MFC 2012-03-07 11:18:38 +00:00
attilio
9e63566650 Fix a compile time bug by adding a check just after the struct
definition
2012-03-06 23:37:53 +00:00
alc
9181f45b9f Eliminate stale incorrect ARGSUSED comments.
Submitted by:	bde
2012-03-02 17:33:51 +00:00
attilio
df89a6a2db - Exclude vm_radix_shrink() from the interface but retain the code
still as it can be useful.
- Make most of the interface private as it is unnecessary public right
  now.  This will help in making nodes changing with arch and still avoid
  namespace pollution.
2012-03-01 00:54:08 +00:00
attilio
3c5fbc2c09 MFC 2012-03-01 00:27:51 +00:00
alc
54c1d2e89a Simplify kmem_alloc() by eliminating code that existed on account of
external pagers in Mach.  FreeBSD doesn't implement external pagers.
Moreover, it don't pageout the kernel object.  So, the reasons for
having code don't hold.

Reviewed by:	kib
MFC after:	6 weeks
2012-02-29 05:41:29 +00:00
alc
867c58a8cb Simplify vm_mmap()'s control flow.
Add a comment describing what vm_mmap_to_errno() does.

Reviewed by:	kib
MFC after:	3 weeks
X-MFC after:	r232071
2012-02-25 21:06:39 +00:00
attilio
d4c43cbb8b MFC 2012-02-25 18:24:45 +00:00
alc
7d737c65b5 Simplify vmspace_fork()'s control flow by copying immutable data before
the vm map locks are acquired.  Also, eliminate redundant initialization
of the new vm map's timestamp.

Reviewed by:	kib
MFC after:	3 weeks
2012-02-25 17:49:59 +00:00
kib
8c39852ba5 Place the if() at the right location, to activate the v_writecount
accounting for shared writeable mappings for all filesystems, not only
for the bypass layers.

Submitted by:	alc
Pointy hat to:	kib
MFC after:	20 days
2012-02-24 10:41:58 +00:00
kib
f315a59476 Account the writeable shared mappings backed by file in the vnode
v_writecount.  Keep the amount of the virtual address space used by
the mappings in the new vm_object un_pager.vnp.writemappings
counter. The vnode v_writecount is incremented when writemappings gets
non-zero value, and decremented when writemappings is returned to
zero.

Writeable shared vnode-backed mappings are accounted for in vm_mmap(),
and vm_map_insert() is instructed to set MAP_ENTRY_VN_WRITECNT flag on
the created map entry.  During deferred map entry deallocation,
vm_map_process_deferred() checks for MAP_ENTRY_VN_WRITECOUNT and
decrements writemappings for the vm object.

Now, the writeable mount cannot be demoted to read-only while
writeable shared mappings of the vnodes from the mount point
exist. Also, execve(2) fails for such files with ETXTBUSY, as it
should be.

Noted by:	tegge
Reviewed by:	tegge (long time ago, early version), alc
Tested by:	pho
MFC after:	3 weeks
2012-02-23 21:07:16 +00:00
kib
ac9e4627bb Remove wrong comment.
Discussed with:	alc
MFC after:	3 days
2012-02-22 20:01:38 +00:00
alc
506ef17e7f When vm_mmap() is used to map a vm object into a kernel vm_map, it
makes no sense to check the size of the kernel vm_map against the
user-level resource limits for the calling process.

Reviewed by:	kib
2012-02-16 06:45:51 +00:00
attilio
d58aaf5046 MFC 2012-02-14 19:58:00 +00:00
kib
dacbfe950a Close a race due to dropping of the map lock between creating map entry
for a shared mapping and marking the entry for inheritance.
Other thread might execute vmspace_fork() in between (e.g. by fork(2)),
resulting in the mapping becoming private.

Noted and reviewed by:	alc
MFC after:	1 week
2012-02-11 17:29:07 +00:00
attilio
8fe59ec45c MFC 2012-02-10 18:31:35 +00:00
ed
28b4a002d6 Remove direct access to si_name.
Code should just use the devtoname() function to obtain the name of a
character device. Also add const keywords to pieces of code that need it
to build properly.

MFC after:	2 weeks
2012-02-10 12:35:57 +00:00
flo
1e497814c3 fix KTR consistency
I'm committing this on behalf of Attilio as he cannot access svn right now.
2012-02-05 18:55:20 +00:00
attilio
6587a6afdd Remove the panic from vm_radix_insert() and propagate the error to the
callers of vm_page_insert().

The default action for every caller is to unwind-back the operation
besides vm_page_rename() where this has proven to be impossible to do.
For that case, it just spins until the page is not available to be
allocated. However, due to vm_page_rename() to be mostly rare (and
having never hit this panic in the past) it is tought to be a very
seldom thing and not a possible performance factor.

The patch has been tested with an atomic counter returning NULL from
the zone allocator every 1/100000 allocations. Per-printf, I've verified
that a typical buildkernel could trigger this 30 times. The patch
survived to 2 hours of repeated buildkernel/world.

Several technical notes:
- The vm_page_insert() is moved, in several callers, closer to failure
  points.  This could be committed separately before vmcontention hits
  the tree just to verify -CURRENT is happy with it.
- vm_page_rename() does not need to have the page lock in the callers
  as it hide that as an implementation detail. Do the locking internally.
- now vm_page_insert() returns an int, with 0 meaning everything was ok,
  thus KPI is broken by this patch.
2012-02-05 17:37:26 +00:00
attilio
3454102b5b MFC 2012-02-04 17:18:16 +00:00
mav
3cc9e27b24 Fix NULL dereference panic on attempt to turn off (on system shutdown)
disconnected swap device.

This is quick and imperfect solution, as swap device will still be opened
and GEOM will not be able to destroy it. Proper solution would be to
automatically turn off and close disconnected swap device, but with existing
code it will cause panic if there is at least one page on device, even if
it is unimportant page of the user-level process. It needs some work.

Reviewed by:	kib@
MFC after:	1 week
2012-02-01 20:12:44 +00:00
attilio
1b454e6b83 Fix a bug in vm_radix_leaf() where the shifting start address can
wrap-up at some point.
This bug is triggered very easilly by indirect blocks in UFS which grow
negative resulting in very high counts.

In collabouration with:	flo
2012-01-29 16:44:21 +00:00
attilio
8bc5caadc8 Fix format string for the pindex members as they should be treated
as uintmax_t for compatibility among 32/64 bits.
2012-01-29 16:29:06 +00:00
attilio
810afc9780 Make an assertion stronger and improve the printout for easier bug
catching when it is not possible to dump
2012-01-29 16:11:25 +00:00
kmacy
84d434965a exclude kmem_alloc'ed ARC data buffers from kernel minidumps on amd64
excluding other allocations including UMA now entails the addition of
a single flag to kmem_alloc or uma zone create

Reviewed by:	alc, avg
MFC after:	2 weeks
2012-01-27 20:18:31 +00:00
nwhitehorn
bf2ee27f25 Revert r212360 now that PowerPC can handle large sparse arguments to
pmap_remove() (changed in r228412).

MFC after:	2 weeks
2012-01-17 00:31:09 +00:00
kib
247e21eaf0 Change the type of the paging_in_progress refcounter from u_short to
u_int. With the auto-sized buffer cache on the modern machines, UFS
metadata can generate more the 65535 pages belonging to the buffers
undergoing i/o, overflowing the counter.

Reported and tested by:	jimharris
Reviewed by:	alc
MFC after:	1 week
2012-01-10 18:05:44 +00:00
attilio
abdebe75b2 MFC 2012-01-05 23:12:19 +00:00
kib
aec33a2b90 Do not restart the scan in vm_object_page_clean() on the object
generation change if requested mode is async. The object generation is
only changed when the object is marked as OBJ_MIGHTBEDIRTY. For async
mode it is enough to write each dirty page, not to make a guarantee that
all pages are cleared after the vm_object_page_clean() returned.

Diagnosed by:	truckman
Tested by:	flo
Reviewed by:	alc, truckman
MFC after:	2 weeks
2012-01-04 16:04:20 +00:00
attilio
7ba6fbeca7 Fix a spot missed during the last merge. 2012-01-01 21:46:16 +00:00
attilio
a4ffaeb982 MFC 2012-01-01 20:18:40 +00:00
alc
7f817ed8c5 Optimize vm_object_split()'s handling of reservations. 2011-12-28 20:27:18 +00:00
kib
c2a7f10253 Optimize the common case of msyncing the whole file mapping with
MS_SYNC flag. The system must guarantee that all writes are finished
before syscalls returned. Schedule the writes in async mode, which is
much faster and allows the clustering to occur. Wait for writes using
VOP_FSYNC(), since we are syncing the whole file mapping.

Potentially, the restriction to only apply the optimization can be
relaxed by not requiring that the mapping cover whole file, as it is
done by other OSes.

Reported and tested by:	 az
Reviewed by: alc
MFC after:   2 weeks
2011-12-23 09:09:42 +00:00
kib
cfa70889cb Move kstack_cache_entry into the private header, and make the
stack cache list header accessible outside vm_glue.c.

MFC after:	1 week
2011-12-16 10:56:16 +00:00
eadler
d29af4d1bc - The previous commit (r228449) accidentally moved the vm.stats.vm.* sysctls
to vm.stats.sys.  Move them back.

Noticed by:		pho
Reviewed by:	bde (earlier version)
Approved by:	bz
MFC after:		1 week
Pointy hat to:	me
2011-12-14 13:25:00 +00:00
eadler
3072a90209 Document a large number of currently undocumented sysctls. While here
fix some style(9) issues and reduce redundancy.

PR:		kern/155491
PR:		kern/155490
PR:		kern/155489
Submitted by:	Galimov Albert <wtfcrap@mail.ru>
Approved by:	bde
Reviewed by:	jhb
MFC after:	1 week
2011-12-13 00:38:50 +00:00
kib
2893f40afd Fix printf.
Submitted by:	az
MFC after:	1 week
2011-12-12 10:04:04 +00:00
attilio
1f27e97ae5 Use atomics for rn_count on leaf node because RED operations happen
without the VM_OBJECT_LOCK held, thus can be concurrent with BLACK ones.
However, also use a write memory barrier in order to not reorder the
operation of decrementing rn_count in respect fetching the pointer.

Discussed with:	jeff
2011-12-06 22:57:48 +00:00
attilio
2436e63a9c - Make rn_count 32-bits as it will naturally pad for 32-bit arches
- Avoid to use atomic to manipulate it at level0 because it seems
  unneeded and introduces a bug on big-endian architectures where only
  the top half (2 bits) of the double-words are written (as sparc64,
  for example, doesn't support atomics at 16-bits) heading to a wrong
  handling of rn_count.

Reported by:	flo, andreast
Found by:	marius
No answer by:	jeff
2011-12-06 19:04:45 +00:00
alc
a8855af4c0 Introduce vm_reserv_alloc_contig() and teach vm_page_alloc_contig() how to
use superpage reservations.  So, for the first time, kernel virtual memory
that is allocated by contigmalloc(), kmem_alloc_attr(), and
kmem_alloc_contig() can be promoted to superpages.  In fact, even a series
of small contigmalloc() allocations may collectively result in a promoted
superpage.

Eliminate some duplication of code in vm_reserv_alloc_page().

Change the type of vm_reserv_reclaim_contig()'s first parameter in order
that it be consistent with other vm_*_contig() functions.

Tested by:	marius (sparc64)
2011-12-05 18:29:25 +00:00
andreast
8c385e6008 Fix compilation issue on 32-bit targets.
Reviewed by:	attilio
2011-12-05 16:06:12 +00:00
attilio
8fbad61ab4 Revert a change that sneaked in during the last MFC. 2011-12-02 23:21:59 +00:00
attilio
b2701fb716 MFC 2011-12-02 21:45:46 +00:00
kib
d326d5565d Rename vm_page_set_valid() to vm_page_set_valid_range().
The vm_page_set_valid() is the most reasonable name for the m->valid
accessor.

Reviewed by:	attilio, alc
2011-11-30 17:39:00 +00:00
kib
a441abaf37 Hide the internals of vm_page_lock(9) from the loadable modules.
Since the address of vm_page lock mutex depends on the kernel options,
it is easy for module to get out of sync with the kernel.

No vm_page_lockptr() accessor is provided for modules. It can be added
later if needed, unless proper KPI is developed to serve the needs.

Reviewed by:	attilio, alc
MFC after:	3 weeks
2011-11-29 13:07:32 +00:00
attilio
984f9ddc2b - Remove unnecessary checks on rnode in KTR prints
- Track rn_count in KTR prints
- Improve KTR in a way it best fits rn_count tracking
2011-11-29 02:07:07 +00:00
attilio
27cea0290f Fix compile.
Submitted by:	flo
2011-11-28 19:14:38 +00:00
attilio
77442309d8 Improve the diagnostic in the remove case. 2011-11-28 17:26:19 +00:00
attilio
4391f6c279 Fix a bug when the 'rnode' pointer can be NULL and we try to track
the children.  This helps in debugging case.

Reported by:	flo
2011-11-26 14:26:37 +00:00
attilio
aba6408184 MFC 2011-11-26 13:54:55 +00:00
attilio
b95134ea01 Introduce the same mutex-wise fix in r227758 for sx locks.
The functions that offer file and line specifications are:
- sx_assert_
- sx_downgrade_
- sx_slock_
- sx_slock_sig_
- sx_sunlock_
- sx_try_slock_
- sx_try_xlock_
- sx_try_upgrade_
- sx_unlock_
- sx_xlock_
- sx_xlock_sig_
- sx_xunlock_

Now vm_map locking is fully converted and can avoid to know specifics
about locking procedures.
Reviewed by:	kib
MFC after:	1 month
2011-11-21 12:59:52 +00:00
attilio
6a69e947d3 Introduce macro stubs in the mutex implementation that will be always
defined and will allow consumers, willing to provide options, file and
line to locking requests, to not worry about options redefining the
interfaces.
This is typically useful when there is the need to build another
locking interface on top of the mutex one.

The introduced functions that consumers can use are:
- mtx_lock_flags_
- mtx_unlock_flags_
- mtx_lock_spin_flags_
- mtx_unlock_spin_flags_
- mtx_assert_
- thread_lock_flags_

Spare notes:
- Likely we can get rid of all the 'INVARIANTS' specification in the
  ppbus code by using the same macro as done in this patch (but this is
  left to the ppbus maintainer)
- all the other locking interfaces may require a similar cleanup, where
  the most notable case is sx which will allow a further cleanup of
  vm_map locking facilities
- The patch should be fully compatible with older branches, thus a MFC
  is previewed (infact it uses all the underlying mechanisms already
  present).

Comments review by:	eadler, Ben Kaduk
Discussed with:		kib, jhb
MFC after:	1 month
2011-11-20 16:33:09 +00:00
attilio
062b8ecde4 Add more KTR points for failure in vm_radix_insert(). 2011-11-20 14:51:27 +00:00
attilio
c68bde9dbe MFC 2011-11-18 09:54:14 +00:00
alc
e6f2f7a89d Eliminate end-of-line white space. 2011-11-17 06:54:49 +00:00
alc
3692d01659 Refactor the code that performs physically contiguous memory allocation,
yielding a new public interface, vm_page_alloc_contig().  This new function
addresses some of the limitations of the current interfaces, contigmalloc()
and kmem_alloc_contig().  For example, the physically contiguous memory that
is allocated with those interfaces can only be allocated to the kernel vm
object and must be mapped into the kernel virtual address space.  It also
provides functionality that vm_phys_alloc_contig() doesn't, such as wiring
the returned pages.  Moreover, unlike that function, it respects the low
water marks on the paging queues and wakes up the page daemon when
necessary.  That said, at present, this new function can't be applied to all
types of vm objects.  However, that restriction will be eliminated in the
coming weeks.

From a design standpoint, this change also addresses an inconsistency
between vm_phys_alloc_contig() and the other vm_phys_alloc*() functions.
Specifically, vm_phys_alloc_contig() manipulated vm_page fields that other
functions in vm/vm_phys.c didn't.  Moreover, vm_phys_alloc_contig() knew
about vnodes and reservations.  Now, vm_page_alloc_contig() is responsible
for these things.

Reviewed by:	kib
Discussed with:	jhb
2011-11-16 16:46:09 +00:00
attilio
2cf76e3d27 Fix compilation for userland:
- Use CTASSERT() only in the kernel.
- the root pointer is required by struct vm_object which is accessible
  (maybe incorrectly?) by userland.
2011-11-15 23:37:15 +00:00
attilio
b13f89ff5b MFC 2011-11-15 23:32:30 +00:00
kib
592f122323 Update the device pager interface, while keeping the compatibility
layer for old KPI and KBI.  New interface should be used together with
d_mmap_single cdevsw method.

Device pager can be allocated with the cdev_pager_allocate(9)
function, which takes struct cdev_pager_ops, containing
constructor/destructor and page fault handler methods supplied by
driver.

Constructor and destructor, called at the pager allocation and
deallocation time, allow the driver to handle per-object private data.

The pager handler is called to handle page fault on the vm map entry
backed by the driver pager. Driver shall return either the vm_page_t
which should be mapped, or error code (which does not cause kernel
panic anymore). The page handler interface has a placeholder to
specify the access mode causing the fault, but currently PROT_READ is
always passed there.

Sponsored by:	The FreeBSD Foundation
Reviewed by:	alc
MFC after:	1 month
2011-11-15 14:40:00 +00:00
kib
93c04deafa Remove the condition that is always true.
Submitted by:	alc
MFC after:	1 week
2011-11-15 14:09:53 +00:00
attilio
6e9e854884 MFC 2011-11-08 11:08:40 +00:00
ed
0c56cf839d Mark all SYSCTL_NODEs static that have no corresponding SYSCTL_DECLs.
The SYSCTL_NODE macro defines a list that stores all child-elements of
that node. If there's no SYSCTL_DECL macro anywhere else, there's no
reason why it shouldn't be static.
2011-11-07 15:43:11 +00:00
alc
eb6224d517 Wake up the page daemon in vm_page_alloc_freelist() if it couldn't
allocate the requested page because too few pages are cached or free.

Document the VM_ALLOC_COUNT() option to vm_page_alloc() and
vm_page_alloc_freelist().

Make style changes to vm_page_alloc() and vm_page_alloc_freelist(),
such as using a variable name that more closely corresponds to the
comments.
2011-11-06 02:03:27 +00:00
kib
a527ed8008 Remove redundand definitions. The chunk was missed from r227102.
MFC after:	2 weeks
2011-11-05 09:03:18 +00:00
kib
91c9020e20 Provide typedefs for the type of bit mask for the page bits.
Use the defined types instead of int when manipulating masks.
Supposedly, it could fix support for 32KB page size in the
machine-independend VM layer.

Reviewed by:	alc
MFC after:	2 weeks
2011-11-05 08:20:32 +00:00
attilio
65129c5c3e MFC 2011-11-04 06:56:59 +00:00
alc
89a5842600 Simplify the implementation of the failure case in kmem_alloc_attr(). 2011-11-04 04:41:58 +00:00
jhb
78c075174e Add the posix_fadvise(2) system call. It is somewhat similar to
madvise(2) except that it operates on a file descriptor instead of a
memory region.  It is currently only supported on regular files.

Just as with madvise(2), the advice given to posix_fadvise(2) can be
divided into two types.  The first type provide hints about data access
patterns and are used in the file read and write routines to modify the
I/O flags passed down to VOP_READ() and VOP_WRITE().  These modes are
thus filesystem independent.  Note that to ease implementation (and
since this API is only advisory anyway), only a single non-normal
range is allowed per file descriptor.

The second type of hints are used to hint to the OS that data will or
will not be used.  These hints are implemented via a new VOP_ADVISE().
A default implementation is provided which does nothing for the WILLNEED
request and attempts to move any clean pages to the cache page queue for
the DONTNEED request.  This latter case required two other changes.
First, a new V_CLEANONLY flag was added to vinvalbuf().  This requests
vinvalbuf() to only flush clean buffers for the vnode from the buffer
cache and to not remove any backing pages from the vnode.  This is
used to ensure clean pages are not wired into the buffer cache before
attempting to move them to the cache page queue.  The second change adds
a new vm_object_page_cache() method.  This method is somewhat similar to
vm_object_page_remove() except that instead of freeing each page in the
specified range, it attempts to move clean pages to the cache queue if
possible.

To preserve the ABI of struct file, the f_cdevpriv pointer is now reused
in a union to point to the currently active advice region if one is
present for regular files.

Reviewed by:	jilles, kib, arch@
Approved by:	re (kib)
MFC after:	1 month
2011-11-04 04:02:50 +00:00
attilio
11fa9a13f6 MFC 2011-11-03 21:57:02 +00:00
alc
6dbf0412bb Add support for VM_ALLOC_WIRED and VM_ALLOC_ZERO to vm_page_alloc_freelist()
and use these new options in the mips pmap.

Wake up the page daemon in vm_page_alloc_freelist() if the number of free
and cached pages becomes too low.

Tidy up vm_page_alloc_init().  In particular, add a comment about an
important restriction on its use.

Tested by:	jchandra@
2011-11-02 05:42:51 +00:00
jeff
31902d5a7c - Add some convenience inlines.
- Update the copyright.
2011-11-01 04:21:57 +00:00
attilio
f8c5162413 Add kernel protection to the header file for vmradix.
Likely this file needs some more restructuration (and we should
make a lot of macros private to radix implementation) but leave them
as they are so far because we may enrich the KPI much further.
2011-11-01 03:53:10 +00:00
attilio
7272c08497 vm_object_terminate() doesn't actually free the pages in the splay
tree.
Reclaim all the nodes related to the radix tree for a specified
vm_object when calling vm_object_terminate() via the newly added
interface vm_radix_reclaim_nodes().
The function is recursive, but we have a well-defined maximum depth,
thus the amount of necessary stack can be easilly calculated.

Reported by:	alc
Discussed and reviewed by:	jeff
2011-11-01 03:40:38 +00:00
jeff
146842f2d2 - Extract part of vm_radix_lookupn() into a function that just finds the
first leaf page in a specified range.  This permits us to make many
   search & operate functions without much code duplication.
 - Make a generic iterator for radix items.
2011-10-30 22:57:42 +00:00
attilio
44bf71a91c MFC 2011-10-30 11:43:12 +00:00
jeff
9e2e6a2980 - Support two types of nodes, red and black, within the same radix tree.
Black nodes support standard active pages and red nodes support cached
   pages.  Red nodes may be removed without the object lock but will not
   collapse unused tree nodes.  Red nodes may not be directly inserted,
   instead a new function is supplied to convert between black and red.
 - Handle cached pages and active pages in the same loop in vm_object_split,
   vm_object_backing_scan, and vm_object_terminate.
 - Retire the splay page handling as the ifdefs are too difficult to
   maintain.
 - Slightly optimize the vm_radix_lookupn() function.
2011-10-30 11:11:04 +00:00
alc
57e8705396 Eliminate vm_phys_bootstrap_alloc(). It was a failed attempt at
eliminating duplicated code in the various pmap implementations.

Micro-optimize vm_phys_free_pages().

Introduce vm_phys_free_contig().  It is fast routine for freeing an
arbitrary number of physically contiguous pages.  In particular, it
doesn't require the number of pages to be a power of two.

Use "u_long" instead of "unsigned long".

Bruce Evans (bde@) has convinced me that the "boundary" parameters
to kmem_alloc_contig(), vm_phys_alloc_contig(), and
vm_reserv_reclaim_contig() should be of type "vm_paddr_t" and not
"u_long".  Make this change.
2011-10-30 05:06:14 +00:00
alc
15734d833a Use "u_long" instead of "unsigned long". 2011-10-28 22:36:15 +00:00
jeff
1f2c60154d - Use a single uintptr_t for the root of the radix node that encodes the
height and a pointer so that the update to the root is atomic.  This
   permits safe lookups in parallel with tree expansion.  Shrinking the
   space requirements is a small bonus.
2011-10-28 03:42:41 +00:00
attilio
22f6379eef Fix compilation. 2011-10-28 03:07:23 +00:00
attilio
963320ca59 MFC 2011-10-28 02:54:07 +00:00
attilio
588d89046c Use an UMA zone for the radix node. This avoids the problem to check
for the kernel_map/kmem_map recursion because it uses direct mapping
provided by amd64 to avoid object and map search and recursion.

Probabilly all the others architectures using UMA_MD_SMALL_ALLOC are also
fixed by this, but other remains, where the most notable case is i386.
For it a solution has still to be determined.  A way to do this would
be to have a reserved map just for radix node and mark all accesses to
its lock to be witness safe, but that would still be unoptimal due to
the large amount of virtual address space needed to cater the whole
tree.
2011-10-28 01:56:36 +00:00