Commit Graph

628 Commits

Author SHA1 Message Date
kib
6796656333 Remove zero-copy sockets code. It only worked for anonymous memory,
and the equivalent functionality is now provided by sendfile(2) over
posix shared memory filedescriptor.

Remove the cow member of struct vm_page, and rearrange the remaining
members.  While there, make hold_count unsigned.

Requested and reviewed by:	alc
Tested by:	pho
Sponsored by:	The FreeBSD Foundation
Approved by:	re (delphij)
2013-09-16 06:25:54 +00:00
kib
889b9d0e0b If the last page of the file is partially full and whole valid
portion is invalidated, invalidate the whole page.  Otherwise,
partially valid page appears on a page queue, which is wrong.  This
could only happen for the last page, because only then buffer which
triggered invalidation could not cover the whole page.

Reported and tested by:	pho (previous version)
Reviewed by:	alc
Sponsored by:	The FreeBSD Foundation
Approved by:	re (delphij)
MFC after:	2 weeks
2013-09-14 10:11:38 +00:00
kib
7ab18d4990 The vm_page_trysbusy() should not fail when shared busy counter or
VPB_BIT_WAITERS flag were changed between reading of busy_lock and the
cas.  The vm_page_sbusy(), which is the only user of
vm_page_trysbusy() in the tree, panics on the failure, which in these
cases is transient and do not mean that the current page state
prevents sbusying.

Retry the operation inside vm_page_trysbusy() if cas failed, only
return a failure when VPB_BIT_SHARED is cleared.

Reported and tested by:	pho
Reviewed by:	attilio
Sponsored by:	The FreeBSD Foundation
2013-09-05 12:54:40 +00:00
alc
aa9a7bb9e6 Significantly reduce the cost, i.e., run time, of calls to madvise(...,
MADV_DONTNEED) and madvise(..., MADV_FREE).  Specifically, introduce a new
pmap function, pmap_advise(), that operates on a range of virtual addresses
within the specified pmap, allowing for a more efficient implementation of
MADV_DONTNEED and MADV_FREE.  Previously, the implementation of
MADV_DONTNEED and MADV_FREE relied on per-page pmap operations, such as
pmap_clear_reference().  Intuitively, the problem with this implementation
is that the pmap-level locks are acquired and released and the page table
traversed repeatedly, once for each resident page in the range
that was specified to madvise(2).  A more subtle flaw with the previous
implementation is that pmap_clear_reference() would clear the reference bit
on all mappings to the specified page, not just the mapping in the range
specified to madvise(2).

Since our malloc(3) makes heavy use of madvise(2), this change can have a
measureable impact.  For example, the system time for completing a parallel
"buildworld" on a 6-core amd64 machine was reduced by about 1.5% to 2.0%.

Note: This change only contains pmap_advise() implementations for a subset
of our supported architectures.  I will commit implementations for the
remaining architectures after further testing.  For now, a stub function is
sufficient because of the advisory nature of pmap_advise().

Discussed with: jeff, jhb, kib
Tested by:      pho (i386), marcel (ia64)
Sponsored by:   EMC / Isilon Storage Division
2013-08-29 15:49:05 +00:00
glebius
088bcbe3ed Remove comment that is no longer relevant since r254182. 2013-08-26 14:14:25 +00:00
alc
1a535523cd Addendum to r254141: The call to vm_radix_insert() in vm_page_cache() can
reclaim the last preexisting cached page in the object, resulting in a call
to vdrop().  Detect this scenario so that the vnode's hold count is
correctly maintained.  Otherwise, we panic.

Reported by:	scottl
Tested by:	pho
Discussed with:	attilio, jeff, kib
2013-08-23 17:27:12 +00:00
kib
ba12eedccd Remove the deprecated VM_ALLOC_RETRY flag for the vm_page_grab(9).
The flag was mandatory since r209792, where vm_page_grab(9) was
changed to only support the alloc retry semantic.

Suggested and reviewed by:	alc
Sponsored by:	The FreeBSD Foundation
2013-08-22 07:39:53 +00:00
alc
42d76a02b5 Addendum to r254141: Allow recursion on the free pages queues lock in
vm_page_alloc_freelist().

Reported and tested by:	sbruno
Sponsored by:	EMC / Isilon Storage Division
2013-08-21 15:31:43 +00:00
attilio
ae49aeaba6 On the recovery path for vm_page_alloc(), if a page had been requested
wired, unwind back the wiring bits otherwise we can end up freeing a
page that is considered wired.

Sponsored by:	EMC / Isilon storage division
Reported by:	alc
2013-08-15 11:01:25 +00:00
jeff
bc00d6df57 Improve pageout flow control to wakeup more frequently and do less work while
maintaining better LRU of active pages.

 - Change v_free_target to include the quantity previously represented by
   v_cache_min so we don't need to add them together everywhere we use them.
 - Add a pageout_wakeup_thresh that sets the free page count trigger for
   waking the page daemon.  Set this 10% above v_free_min so we wakeup before
   any phase transitions in vm users.
 - Adjust down v_free_target now that we're willing to accept more pagedaemon
   wakeups.  This means we process fewer pages in one iteration as well,
   leading to shorter lock hold times and less overall disruption.
 - Eliminate vm_pageout_page_stats().  This was a minor variation on the
   PQ_ACTIVE segment of the normal pageout daemon.  Instead we now process
   1 / vm_pageout_update_period pages every second.  This causes us to visit
   the whole active list every 60 seconds.  Previously we would only maintain
   the active LRU when we were short on pages which would mean it could be
   woefully out of date.

Reviewed by:	alc (slight variant of this)
Discussed with:	alc, kib, jhb
Sponsored by:	EMC / Isilon Storage Division
2013-08-13 21:56:16 +00:00
attilio
f2a180739c Correct the recovery logic in vm_page_alloc_contig:
what is really needed on this code snipped is that all the pages that
are already fully inserted gets fully freed, while for the others the
object removal itself might be skipped, hence the object might be set to
NULL.

Sponsored by:	EMC / Isilon storage division
Reported by:	alc, kib
Reviewed by:	alc
2013-08-11 21:15:04 +00:00
kib
4675fcfce0 Different consumers of the struct vm_page abuse pageq member to keep
additional information, when the page is guaranteed to not belong to a
paging queue.  Usually, this results in a lot of type casts which make
reasoning about the code correctness harder.

Sometimes m->object is used instead of pageq, which could cause real
and confusing bugs if non-NULL m->object is leaked.  See r141955 and
r253140 for examples.

Change the pageq member into a union containing explicitly-typed
members.  Use them instead of type-punning or abusing m->object in x86
pmaps, uma and vm_page_alloc_contig().

Requested and reviewed by:	alc
Sponsored by:	The FreeBSD Foundation
2013-08-10 17:36:42 +00:00
jhb
8f3909e991 Revert the addition of VPO_BUSY and instead update vm_page_replace() to
properly unbusy the page.

Submitted by:	alc
2013-08-09 21:14:55 +00:00
attilio
e9f37cac74 On all the architectures, avoid to preallocate the physical memory
for nodes used in vm_radix.
On architectures supporting direct mapping, also avoid to pre-allocate
the KVA for such nodes.

In order to do so make the operations derived from vm_radix_insert()
to fail and handle all the deriving failure of those.

vm_radix-wise introduce a new function called vm_radix_replace(),
which can replace a leaf node, already present, with a new one,
and take into account the possibility, during vm_radix_insert()
allocation, that the operations on the radix trie can recurse.
This means that if operations in vm_radix_insert() recursed
vm_radix_insert() will start from scratch again.

Sponsored by:	EMC / Isilon storage division
Reviewed by:	alc (older version)
Reviewed by:	jeff
Tested by:	pho, scottl
2013-08-09 11:28:55 +00:00
attilio
16c7563cf4 The soft and hard busy mechanism rely on the vm object lock to work.
Unify the 2 concept into a real, minimal, sxlock where the shared
acquisition represent the soft busy and the exclusive acquisition
represent the hard busy.
The old VPO_WANTED mechanism becames the hard-path for this new lock
and it becomes per-page rather than per-object.
The vm_object lock becames an interlock for this functionality:
it can be held in both read or write mode.
However, if the vm_object lock is held in read mode while acquiring
or releasing the busy state, the thread owner cannot make any
assumption on the busy state unless it is also busying it.

Also:
- Add a new flag to directly shared busy pages while vm_page_alloc
  and vm_page_grab are being executed.  This will be very helpful
  once these functions happen under a read object lock.
- Move the swapping sleep into its own per-object flag

The KPI is heavilly changed this is why the version is bumped.
It is very likely that some VM ports users will need to change
their own code.

Sponsored by:	EMC / Isilon storage division
Discussed with:	alc
Reviewed by:	jeff, kib
Tested by:	gavin, bapt (older version)
Tested by:	pho, scottl
2013-08-09 11:11:11 +00:00
kib
8de1718b60 Split the pagequeues per NUMA domains, and split pageademon process
into threads each processing queue in a single domain.  The structure
of the pagedaemons and queues is kept intact, most of the changes come
from the need for code to find an owning page queue for given page,
calculated from the segment containing the page.

The tie between NUMA domain and pagedaemon thread/pagequeue split is
rather arbitrary, the multithreaded daemon could be allowed for the
single-domain machines, or one domain might be split into several page
domains, to further increase concurrency.

Right now, each pagedaemon thread tries to reach the global target,
precalculated at the start of the pass.  This is not optimal, since it
could cause excessive page deactivation and freeing.  The code should
be changed to re-check the global page deficit state in the loop after
some number of iterations.

The pagedaemons reach the quorum before starting the OOM, since one
thread inability to meet the target is normal for split queues.  Only
when all pagedaemons fail to produce enough reusable pages, OOM is
started by single selected thread.

Launder is modified to take into account the segments layout with
regard to the region for which cleaning is performed.

Based on the preliminary patch by jeff, sponsored by EMC / Isilon
Storage Division.

Reviewed by:	alc
Tested by:	pho
Sponsored by:	The FreeBSD Foundation
2013-08-07 16:36:38 +00:00
kib
c2cfac4ffc In the vm_page_set_invalid() function, do not assert that the page is
not busy, since its only caller brelse() can legitimately call it on
busy page.  This happens for VOP_PUTPAGES() on filesystems that use
buffers and which VOP_WRITE() method marked the buffer containing page
as non-cacheable.

Reported and tested by:	pho
Reviewed by:	alc
Sponsored by:	The FreeBSD Foundation
MFC after:	2 weeks
2013-07-11 05:38:39 +00:00
neel
44e6cbd7c8 vm_phys_fictitious_reg_range() was losing the 'memattr' because it would be
reset by pmap_page_init() right after being initialized in vm_page_initfake().

The statement above is with reference to the amd64 implementation of
pmap_page_init().

Fix this by calling 'pmap_page_init()' in 'vm_page_initfake()' before changing
the 'memattr'.

Reviewed by:	kib
MFC after:	2 weeks
2013-07-03 23:38:37 +00:00
glebius
fd69e3f2df Typo in comment. 2013-06-24 13:36:16 +00:00
alc
53ffec3a56 Revise the interface between vm_object_madvise() and vm_page_dontneed() so
that pointless calls to pmap_is_modified() can be easily avoided when
performing madvise(..., MADV_FREE).

Sponsored by:	EMC / Isilon Storage Division
2013-06-10 01:48:21 +00:00
kib
dba0637421 Remove irrelevant comments.
Discussed with:	alc
MFC after:	3 days
2013-06-03 17:30:40 +00:00
alc
17993ced1b Require that the page lock is held, instead of the object lock, when
clearing the page's PGA_REFERENCED flag.  Since we are typically
manipulating the page's act_count field when we are clearing its
PGA_REFERENCED flag, the page lock is already held everywhere that we clear
the PGA_REFERENCED flag.  So, in fact, this revision only changes some
comments and an assertion.  Nonetheless, it will enable later changes to
object locking in the pageout code.

Introduce vm_page_assert_locked(), which completely hides the implementation
details of the page lock from the caller, and use it in
vm_page_aflag_clear().  (The existing vm_page_lock_assert() could not be
used in vm_page_aflag_clear().)  Over the coming weeks, I expect that we'll
either eliminate or replace the various uses of vm_page_lock_assert() with
vm_page_assert_locked().

Reviewed by:	attilio
Sponsored by:	EMC / Isilon Storage Division
2013-06-03 01:22:54 +00:00
alc
ece39b8d03 Now that access to the page's "act_count" field is synchronized by the page
lock instead of the object lock, there is no reason for vm_page_activate()
to assert that the object is locked for either read or write access.
(The "VPO_UNMANAGED" flag never changes after page allocation.)

Sponsored by:	EMC / Isilon Storage Division
2013-06-01 20:32:34 +00:00
attilio
fdf82ef9cf o Relax locking assertions for vm_page_find_least()
o Relax locking assertions for pmap_enter_object() and add them also
  to architectures that currently don't have any
o Introduce VM_OBJECT_LOCK_DOWNGRADE() which is basically a downgrade
  operation on the per-object rwlock
o Use all the mechanisms above to make vm_map_pmap_enter() to work
  mostl of the times only with readlocks.

Sponsored by:	EMC / Isilon storage division
Reviewed by:	alc
2013-05-21 20:38:19 +00:00
kib
f8c66c9055 Add ddb command 'show pginfo' which provides useful information about
a vm page, denoted either by an address of the struct vm_page, or, if
the '/p' modifier is specified, by a physical address of the
corresponding frame.

Reviewed by:	jhb
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
2013-05-21 11:04:00 +00:00
alc
585b1bf4b4 Relax the object locking assertion in vm_page_lookup(). Now that a radix
tree is used to maintain the object's collection of resident pages,
vm_page_lookup() no longer needs an exclusive lock.

Reviewed by:    attilio
Sponsored by:   EMC / Isilon Storage Division
2013-05-17 18:49:43 +00:00
peter
46ecea3da0 Bandaid for compiling with gcc, which happens to be the default compiler
for a number of platforms still.
2013-05-13 07:09:31 +00:00
alc
7d20e37fb6 Refactor vm_page_alloc()'s interactions with vm_reserv_alloc_page() and
vm_page_insert() so that (1) vm_radix_lookup_le() is never called while the
free page queues lock is held and (2) vm_radix_lookup_le() is called at most
once.  This change reduces the average time that the free page queues lock
is held by vm_page_alloc() as well as vm_page_alloc()'s average overall
running time.

Sponsored by:	EMC / Isilon Storage Division
2013-05-12 16:50:18 +00:00
alc
9e48bd7ba9 Most allocation of pages to objects proceeds from lower to higher
indices.  Consequentially, vm_page_insert() should use
vm_radix_lookup_le() instead of vm_radix_lookup_ge().  Here's why.  In
the expected case, vm_radix_lookup_le() will quickly find a page less
than the specified key at the same radix node.  In contrast,
vm_radix_lookup_ge() is expected to return NULL, but to do that it must
examine every slot in the radix tree that is greater than the key.

Prior to this change, the average cost of a vm_page_insert() call on my
test machine was 992 cycles.  After this change, the average cost is only
532 cycles, a reduction of 46%.

Reviewed by:	attilio
Sponsored by:	EMC / Isilon Storage Division
2013-03-17 16:23:19 +00:00
alc
b346e448af Simplify the interface to vm_radix_insert() by eliminating the parameter
"index".  The content of a radix tree leaf, or at least its "key", is not
opaque to the other radix tree operations.  Specifically, they know how to
extract the "key" from a leaf.  So, eliminating the parameter "index" isn't
breaking the abstraction.  Moreover, eliminating the parameter "index"
effectively prevents the caller from passing an inconsistent "index" and
leaf to vm_radix_insert().

Reviewed by:	attilio
Sponsored by:	EMC / Isilon Storage Division
2013-03-17 16:06:03 +00:00
attilio
3c52979cb4 MFC 2013-03-12 13:26:12 +00:00
alc
07fc599921 When transferring the page from one object to another, don't insert the
page into its new object until the page's pindex has been updated.
Otherwise, one code path within vm_radix_insert() may use the wrong
pindex value.

Sponsored by:	EMC / Isilon Storage Division
2013-03-12 06:14:31 +00:00
attilio
32a3275e77 MFC 2013-03-11 10:49:02 +00:00
alc
2c9c761886 Introduce vm_radix_is_empty(), and use it in place of
vm_object_cache_is_empty() where the caller is aware of the page cache's
implementation as a radix trie.

Sponsored by:	EMC / Isilon Storage Division
2013-03-10 17:30:57 +00:00
alc
854d4fd5e6 Update a comment: The object lock is no longer a mutex. 2013-03-09 21:32:24 +00:00
attilio
76954ad68a Merge from vmcontention. 2013-03-09 03:19:53 +00:00
attilio
16a80466e5 MFC 2013-03-09 02:51:51 +00:00
attilio
72f7f3e528 Switch the vm_object mutex to be a rwlock. This will enable in the
future further optimizations where the vm_object lock will be held
in read mode most of the time the page cache resident pool of pages
are accessed for reading purposes.

The change is mostly mechanical but few notes are reported:
* The KPI changes as follow:
  - VM_OBJECT_LOCK() -> VM_OBJECT_WLOCK()
  - VM_OBJECT_TRYLOCK() -> VM_OBJECT_TRYWLOCK()
  - VM_OBJECT_UNLOCK() -> VM_OBJECT_WUNLOCK()
  - VM_OBJECT_LOCK_ASSERT(MA_OWNED) -> VM_OBJECT_ASSERT_WLOCKED()
    (in order to avoid visibility of implementation details)
  - The read-mode operations are added:
    VM_OBJECT_RLOCK(), VM_OBJECT_TRYRLOCK(), VM_OBJECT_RUNLOCK(),
    VM_OBJECT_ASSERT_RLOCKED(), VM_OBJECT_ASSERT_LOCKED()
* The vm/vm_pager.h namespace pollution avoidance (forcing requiring
  sys/mutex.h in consumers directly to cater its inlining functions
  using VM_OBJECT_LOCK()) imposes that all the vm/vm_pager.h
  consumers now must include also sys/rwlock.h.
* zfs requires a quite convoluted fix to include FreeBSD rwlocks into
  the compat layer because the name clash between FreeBSD and solaris
  versions must be avoided.
  At this purpose zfs redefines the vm_object locking functions
  directly, isolating the FreeBSD components in specific compat stubs.

The KPI results heavilly broken by this commit.  Thirdy part ports must
be updated accordingly (I can think off-hand of VirtualBox, for example).

Sponsored by:	EMC / Isilon storage division
Reviewed by:	jeff
Reviewed by:	pjd (ZFS specific review)
Discussed with:	alc
Tested by:	pho
2013-03-09 02:32:23 +00:00
attilio
754f3790b8 Merge from vmc-playground:
Introduce a new KPI that verifies if the page cache is empty for a
specified vm_object.  This KPI does not make assumptions about the
locking in order to be used also for building assertions at init and
destroy time.
It is mostly used to hide implementation details of the page cache.

Sponsored by:	EMC / Isilon storage division
Reviewed by:	jeff
Reviewed by:	alc (vm_radix based version)
Tested by:	flo, pho, jhb, davide
2013-03-09 02:05:29 +00:00
attilio
709ad55889 Evaluations on the likelyhood of empty object cache cannot be made in
general way but must be evaluated case by case.
Embedd the decision in the caller themselves rather than in a
general purpose KPI.

Sponsored by:	EMC / Isilon storage division
Reported by:	alc
Reviewed by:	alc
2013-03-04 12:33:40 +00:00
alc
a855741cf1 A Boolean is more appropriate than an int here. Use what I think is a
slightly better variable name.

Sponsored by:	EMC / Isilon Storage Division
2013-03-04 07:20:59 +00:00
alc
317a9584fb Two out of three times that vm_page_find_least() is called, it's going to
return the vm object's first page.  In those cases, there is no need to
traverse the trie.

Sponsored by:	EMC / Isilon Storage Division
2013-03-03 01:36:31 +00:00
attilio
210a93e7f7 Merge from vmcontention 2013-02-26 18:18:39 +00:00
attilio
134623836d MFC 2013-02-26 18:11:43 +00:00
attilio
49f99b7251 Wrap the sleeps synchronized by the vm_object lock into the specific
macro VM_OBJECT_SLEEP().
This hides some implementation details like the usage of the msleep()
primitive and the necessity to access to the lock address directly.
For this reason VM_OBJECT_MTX() macro is now retired.

Sponsored by:	EMC / Isilon storage division
Reviewed by:	alc
Tested by:	pho
2013-02-26 17:22:08 +00:00
attilio
e590e8091e As VM_OBJECT_SLEEP() is a vm_object_t specific function, make
the passed object as the first argument of the function for consistency.

Sponsored by:	EMC / Isilon storage revision
2013-02-26 01:38:12 +00:00
attilio
fc0ecac2f8 Revert wrongly added asserts: lookup and remove from the collection
of cached pages doesn't require the object lock to be held.

Sponsored by:	EMC / Isilon storage division
2013-02-26 00:34:52 +00:00
attilio
905e648d42 Hide the details for the assertion for VM_OBJECT_LOCK operations.
Rename current VM_OBJECT_LOCK_ASSERT(foo, RA_WLOCKED) into
VM_OBJECT_ASSERT_WLOCKED(foo)

Sponsored by:	EMC / Isilon storage division
Requested by:	alc
2013-02-21 21:54:53 +00:00
attilio
15bf891afe Rename VM_OBJECT_LOCK(), VM_OBJECT_UNLOCK() and VM_OBJECT_TRYLOCK() to
their "write" versions.

Sponsored by:	EMC / Isilon storage division
2013-02-20 12:03:20 +00:00
attilio
658534ed5a Switch vm_object lock to be a rwlock.
* VM_OBJECT_LOCK and VM_OBJECT_UNLOCK are mapped to write operations
* VM_OBJECT_SLEEP() is introduced as a general purpose primitve to
  get a sleep operation using a VM_OBJECT_LOCK() as protection
* The approach must bear with vm_pager.h namespace pollution so many
  files require including directly rwlock.h
2013-02-20 10:38:34 +00:00
attilio
47d4527f8f Remove an unuseful check as looking up into an empty trie should be
as fast as checking a NULL ptr.

Sponsored by:	EMC / Isilon storage division
2013-02-15 17:22:57 +00:00
attilio
ea8f6ef283 Remove whitespace. 2013-02-15 17:21:41 +00:00
attilio
af55a9fb46 - Fix style in vm_page_lookup(): there is no whiteline between
assertions and other code in this file.
- Reinsert some comments that were lost during the work but which are
  actual yet, reducing differences with HEAD.

Sponsoed by:	EMC / Isilon storage division
2013-02-15 12:35:16 +00:00
attilio
642ba82fdf Remove an unuseful check on resident_page_count.
vm_radix_lookup_ge() of an empty trie is as fast as checking a
NULL pointer.
2013-02-14 15:25:31 +00:00
attilio
eafe26c8a6 The radix preallocation pages can overfow the biggestone segment, so
use a different scheme for preallocation: reserve few KB of nodes to be
used to cater page allocations before the memory can be efficiently
pre-allocated by UMA.

This at all effects remove boot_pages further carving and along with
this modifies to the boot_pages allocation system and necessity to
initialize the UMA zone before pmap_init().

Reported by:	pho, jhb
2013-02-14 15:23:00 +00:00
attilio
53f78d1a7d Implement a new algorithm for managing the radix trie which also
includes path-compression. This greatly helps with sparsely populated
tries, where an uncompressed trie may end up by having a lot of
intermediate nodes for very little leaves.

The new algorithm introduces 2 main concepts: the node level and the
node owner.  Every node represents a branch point where the leaves share
the key up to the level specified in the node-level (current level
excluded, of course).  Such key partly shared is the one contained in
the owner.  Of course, the root branch is exempted to keep a valid
owner, because theoretically all the keys are contained in the space
designed by the root branch node.  The search algorithm seems very
intuitive and that is where one should start reading to understand the
full approach.

In the end, the algorithm ends up by demanding only one node per insert
and this is not necessary in all the cases.  To stay safe, we basically
preallocate as many nodes as the number of physical pages are in the
system, using uma_preallocate().  However, this raises 2 concerns:
* As pmap_init() needs to kmem_alloc(), the nodes must be pre-allocated
  when vm_radix_init() is currently called, which is much before UMA
  is fully initialized.  This means that uma_prealloc() will dig into the
  UMA_BOOT_PAGES pool of pages, which is often not enough to keep track
  of such large allocations.
  In order to fix this, change a bit the concept of UMA_BOOT_PAGES and
  vm.boot_pages. More specifically make the UMA_BOOT_PAGES an initial "value"
  as long as vm.boot_pages and extend the boot_pages physical area by as
  many bytes as needed with the information returned by
  vm_radix_allocphys_size().
* A small amount of pages will be held in per-cpu buckets and won't be
  accessible from curcpu, so the vm_radix_node_get() could really panic
  when the pre-allocation pool is close to be exhausted.
  In theory we could pre-allocate more pages than the number of physical
  frames to satisfy such request, but as many insert would happen without
  a node allocation anyway, I think it is safe to assume that the
  over-allocation is already compensating for such problem.
  On the field testing can stand me correct, of course.  This could be
  further helped by the case where we allow a single-page insert to not
  require a complete root node.

The use of pre-allocation gets rid all the non-direct mapping trickery
and introduced lock recursion allowance for vm_page_free_queue.

The nodes children are reduced in number from 32 -> 16 and from 16 -> 8
(for respectively 64 bits and 32 bits architectures).
This would make the children to fit into cacheline for amd64 case,
for example, and in general spawn less cacheline, which may be
helpful in lookup_ge() case.
Also, path-compression cames to help in cases where there are many levels,
making the fallouts of such change less hurting.

Sponsored by:	EMC / Isilon storage division
Reviewed by:	jeff (partially)
Tested by:	flo
2013-02-13 01:19:31 +00:00
attilio
87d8d2eec7 Fix style. 2013-02-10 16:00:14 +00:00
attilio
59f669fb82 Style. 2013-02-07 15:06:04 +00:00
attilio
4c22b4bafe Cleanup vm_radix KPI:
- Avoid the return value for vm_radix_insert()
- Name the functions argument per-style(9)
- Avoid to get and return opaque objects but use vm_page_t as vm_radix is
  thought to not really be general code but to cater specifically page
  cache and resident cache.
2013-02-06 18:37:46 +00:00
attilio
d3fb98bfb4 Enrich comments on newly added assertions. 2013-02-06 17:47:24 +00:00
attilio
22b0de04b3 Reduce diffs against HEAD. 2013-02-06 17:17:11 +00:00
attilio
9d88c5279c Now that vm_page_cache_free() and vm_page_cache_transfer() are
reimplemented as ranged operations, sync vm_page_is_cached() semantic
with HEAD.
2013-02-06 14:50:34 +00:00
attilio
44f85cd1a5 Reduce diffs against HEAD:
Reimplement vm_page_cache_free() as a range operation.
2013-02-06 14:29:05 +00:00
attilio
439c0b8cf1 Reduce diffs against HEAD:
- Reimplement vm_page_cache_transfer() properly
- Remove vm_page_cache_rename() as a subsequent change
2013-02-05 00:09:33 +00:00
attilio
d61cd60feb Reduce differences with HEAD. 2013-02-04 22:05:22 +00:00
attilio
b972b67ed7 Merge from vmcontention 2013-02-04 15:44:42 +00:00
attilio
be719e9167 MFC 2012-12-11 00:07:19 +00:00
alc
02094caa2c In the past four years, we've added two new vm object types. Each time,
similar changes had to be made in various places throughout the machine-
independent virtual memory layer to support the new vm object type.
However, in most of these places, it's actually not the type of the vm
object that matters to us but instead certain attributes of its pages.
For example, OBJT_DEVICE, OBJT_MGTDEVICE, and OBJT_SG objects contain
fictitious pages.  In other words, in most of these places, we were
testing the vm object's type to determine if it contained fictitious (or
unmanaged) pages.

To both simplify the code in these places and make the addition of future
vm object types easier, this change introduces two new vm object flags
that describe attributes of the vm object's pages, specifically, whether
they are fictitious or unmanaged.

Reviewed and tested by:	kib
2012-12-09 00:32:38 +00:00
alc
e12b2ad698 Correct an error in r230623. When both VM_ALLOC_NODUMP and VM_ALLOC_ZERO
were specified to vm_page_alloc(), PG_NODUMP wasn't being set on the
allocated page when it happened to be pre-zeroed.

MFC after:	5 days
2012-11-21 06:26:18 +00:00
alc
ff7333d33f Replace the single, global page queues lock with per-queue locks on the
active and inactive paging queues.

Reviewed by:	kib
2012-11-13 02:50:39 +00:00
alc
60d5a532fb In general, we call pmap_remove_all() before calling vm_page_cache(). So,
the call to pmap_remove_all() within vm_page_cache() is usually redundant.
This change eliminates that call to pmap_remove_all() and introduces a
call to pmap_remove_all() before vm_page_cache() in the one place where
it didn't already exist.

When iterating over a paging queue, if the object containing the current
page has a zero reference count, then the page can't have any managed
mappings.  So, a call to pmap_remove_all() is pointless.

Change a panic() call in vm_page_cache() to a KASSERT().

MFC after:	6 weeks
2012-11-01 16:20:02 +00:00
attilio
d38d7bb245 Rework the known mutexes to benefit about staying on their own
cache line in order to avoid manual frobbing but using
struct mtx_padalign.

The sole exception being nvme and sxfge drivers, where the author
redefined CACHE_LINE_SIZE manually, so they need to be analyzed and
dealt with separately.

Reviwed by:	jimharris, alc
2012-10-31 18:07:18 +00:00
alc
77582e8298 Replace the page hold queue, PQ_HOLD, by a new page flag, PG_UNHOLDFREE,
because the queue itself serves no purpose.  When a held page is freed,
inserting the page into the hold queue has the side effect of setting the
page's "queue" field to PQ_HOLD.  Later, when the page is unheld, it will
be freed because the "queue" field is PQ_HOLD.  In other words, PQ_HOLD is
used as a flag, not a queue.  So, this change replaces it with a flag.

To accomodate the new page flag, make the page's "flags" field wider and
"oflags" field narrower.

Reviewed by:	kib
2012-10-29 06:15:04 +00:00
attilio
64eaf39fd7 MFC 2012-10-22 21:26:36 +00:00
alc
1df941a7f3 Move vm_page_requeue() to the only file that uses it.
MFC after:	3 weeks
2012-10-13 20:19:43 +00:00
alc
c84b1820ea Eliminate the conditional for releasing the page queues lock in
vm_page_sleep().  vm_page_sleep() is no longer called with this lock
held.

Eliminate assertions that the page queues lock is NOT held.  These
assertions won't translate well to having distinct locks on the active
and inactive page queues, and they really aren't that useful.

MFC after:	3 weeks
2012-10-13 18:46:46 +00:00
alc
271cefc5f6 Tidy up a bit:
Update some of the comments.  In particular, use "sleep" in preference to
"block" where appropriate.

Eliminate some unnecessary casts.

Make a few whitespace changes for consistency.

Reviewed by:	kib
MFC after:	3 days
2012-10-03 05:06:45 +00:00
attilio
d3c5a80b69 MFC 2012-08-27 11:59:04 +00:00
kib
a3d0fb0175 Do not leave invalid pages in the object after the short read for a
network file systems (not only NFS proper). Short reads cause pages
other then the requested one, which were not filled by read response,
to stay invalid.

Change the vm_page_readahead_finish() interface to not take the error
code, but instead to make a decision to free or to (de)activate the
page only by its validity. As result, not requested invalid pages are
freed even if the read RPC indicated success.

Noted and reviewed by:	alc
MFC after:	1 week
2012-08-14 11:45:47 +00:00
kib
4259905d31 Reduce code duplication and exposure of direct access to struct
vm_page oflags by providing helper function
vm_page_readahead_finish(), which handles completed reads for pages
with indexes other then the requested one, for VOP_GETPAGES().

Reviewed by:	alc
MFC after:	1 week
2012-08-04 18:16:43 +00:00
attilio
c52a057b19 MFC 2012-08-03 15:58:05 +00:00
alc
5b4712b5a1 Inline vm_page_aflags_clear() and vm_page_aflags_set().
Add comments stating that neither these functions nor the flags that they
are used to manipulate are part of the KBI.
2012-08-03 01:48:15 +00:00
alc
ad2692aed9 Correct vm_page_alloc_contig()'s implementation of VM_ALLOC_NODUMP. 2012-07-17 02:36:59 +00:00
attilio
e237fbcafa Merge from vmcontention 2012-07-08 16:12:59 +00:00
attilio
ffa3f082ff - Split the cached and resident pages tree into 2 distinct ones.
This makes the RED/BLACK support go away and simplifies a lot vmradix
  functions used here. This happens because with patricia trie support
  the trie will be little enough that keeping 2 diffetnt will be
  efficient too.
- Reduce differences with head, in places like backing scan where the
  optimizazions used shuffled the code a little bit around.

Tested by:	flo, Andrea Barberio
2012-07-08 14:01:25 +00:00
attilio
d659f0f6bf MFC 2012-07-07 17:55:27 +00:00
alc
c5e6daff9d Add new pmap layer locks to the predefined lock order. Change the names
of a few existing VM locks to follow a consistent naming scheme.
2012-06-27 03:45:25 +00:00
attilio
5d9dd820d8 MFC 2012-06-23 02:08:15 +00:00
alc
4d96d753fe Selectively inline vm_page_dirty(). 2012-06-20 23:25:47 +00:00
alc
6eeaee04e4 The page flag PGA_WRITEABLE is set and cleared exclusively by the pmap
layer, but it is read directly by the MI VM layer.  This change introduces
pmap_page_is_write_mapped() in order to completely encapsulate all direct
access to PGA_WRITEABLE in the pmap layer.

Aesthetics aside, I am making this change because amd64 will likely begin
using an alternative method to track write mappings, and having
pmap_page_is_write_mapped() in place allows me to make such a change
without further modification to the MI VM layer.

As an added bonus, tidy up some nearby comments concerning page flags.

Reviewed by:	kib
MFC after:	6 weeks
2012-06-16 18:56:19 +00:00
attilio
807db03f96 Revert r231027 and fix the prototype for vm_radix_remove().
The target of this is getting at the point where the recovery path is
completely removed as we could count on pre-allocation once the
path compressed trie is implemented.
2012-06-08 18:44:54 +00:00
attilio
e761e0c4bc MFC 2012-06-01 14:57:55 +00:00
andrew
39fa59acb9 Fix booting on ARM.
In PHYS_TO_VM_PAGE() when VM_PHYSSEG_DENSE is set the check if we are past
the end of vm_page_array was incorrect causing it to return NULL. This
value is then used in vm_phys_add_page causing a data abort.

Reviewed by:	alc, kib, imp
Tested by:	stas
2012-05-22 07:04:23 +00:00
nwhitehorn
e83623fb1f Replace the list of PVOs owned by each PMAP with an RB tree. This simplifies
range operations like pmap_remove() and pmap_protect() as well as allowing
simple operations like pmap_extract() not to involve any global state.
This substantially reduces lock coverages for the global table lock and
improves concurrency.
2012-05-20 14:33:28 +00:00
kib
9ff1ec42a4 Add a facility to register a range of physical addresses to be used
for allocation of fictitious pages, for which PHYS_TO_VM_PAGE()
returns proper fictitious vm_page_t. The range should be de-registered
after consumer stopped using it.

De-inline the PHYS_TO_VM_PAGE() since it now carries code to iterate
over registered ranges.

A hash container might be developed instead of range registration
interface, and fake pages could be put automatically into the hash,
were PHYS_TO_VM_PAGE() could look them up later. This should be
considered before the MFC of the commit is done.

Sponsored by:	The FreeBSD Foundation
Reviewed by:	alc
MFC after:	1 month
2012-05-12 20:42:56 +00:00
kib
1dfd5258de Split the code from vm_page_getfake() to initialize the fake page struct
vm_page into new interface vm_page_initfake(). Handle the case of fake
page re-initialization with changed memattr.

Sponsored by:	The FreeBSD Foundation
Reviewed by:	alc
MFC after:	1 month
2012-05-12 20:34:22 +00:00
kib
e5df51c93a Assert that the page passed to vm_page_putfake() is unmanaged.
Sponsored by:	The FreeBSD Foundation
Reviewed by:	alc
MFC after:	1 month
2012-05-12 20:27:51 +00:00
kib
84b22557f6 Make the vm_page_array_size long. Remove redundand zero initialization
for vm_page_array_size and nearby variablees.

Sponsored by:	The FreeBSD Foundation
Reviewed by:	alc
MFC after:	1 month
2012-05-12 20:03:06 +00:00
attilio
c7b668e647 MFC 2012-05-05 21:40:32 +00:00
nwhitehorn
d394621163 Avoid a lock order reversal in pmap_extract_and_hold() from relocking
the page. This PMAP requires an additional lock besides the PMAP lock
in pmap_extract_and_hold(), which vm_page_pa_tryrelock() did not release.

Suggested by:	kib
MFC after:	4 days
2012-04-22 17:58:30 +00:00