Commit Graph

3233 Commits

Author SHA1 Message Date
Jeff Roberson
fc03d22b17 Refine UMA bucket allocation to reduce space consumption and improve
performance.

 - Always free to the alloc bucket if there is space.  This gives LIFO
   allocation order to improve hot-cache performance.  This also allows
   for zones with a single bucket per-cpu rather than a pair if the entire
   working set fits in one bucket.
 - Enable per-cpu caches of buckets.  To prevent recursive bucket
   allocation one bucket zone still has per-cpu caches disabled.
 - Pick the initial bucket size based on a table driven maximum size
   per-bucket rather than the number of items per-page.  This gives
   more sane initial sizes.
 - Only grow the bucket size when we face contention on the zone lock, this
   causes bucket sizes to grow more slowly.
 - Adjust the number of items per-bucket to account for the header space.
   This packs the buckets more efficiently per-page while making them
   not quite powers of two.
 - Eliminate the per-zone free bucket list.  Always return buckets back
   to the bucket zone.  This ensures that as zones grow into larger
   bucket sizes they eventually discard the smaller sizes.  It persists
   fewer buckets in the system.  The locking is slightly trickier.
 - Only switch buckets in zalloc, not zfree, this eliminates pathological
   cases where we ping-pong between two buckets.
 - Ensure that the thread that fills a new bucket gets to allocate from
   it to give a better upper bound on allocation time.

Sponsored by:	EMC / Isilon Storage Division
2013-06-18 04:50:20 +00:00
Jeff Roberson
0095a78419 - Add a new UMA API: uma_zcache_create(). This makes a zone without any
backing memory that is only a container for per-cpu caches of arbitrary
   pointer items.  These zones have no kegs.
 - Convert the regular keg based allocator to use the new import/release
   functions.
 - Move some stats to be atomics since they would require excessive zone
   locking/unlocking with the new import/release paradigm.  Make
   zone_free_item simpler now that callers can manage more stats.
 - Check for these cache-only zones in the public APIs and debugging
   code by checking zone_first_keg() against NULL.

Sponsored by:	EMC / Isilong Storage Division
2013-06-17 03:43:47 +00:00
Jeff Roberson
ef72505e6d - Convert the slab free item list from a linked array of indices to a
bitmap using sys/bitset.  This is much simpler, has lower space
   overhead and is cheaper in most cases.
 - Use a second bitmap for invariants asserts and improve the quality of
   the asserts as well as the number of erroneous conditions that we will
   catch.
 - Drastically simplify sizing code.  Special case refcnt zones since they
   will be going away.
 - Update stale comments.

Sponsored by:	EMC / Isilon Storage Division
2013-06-13 21:05:38 +00:00
Alan Cox
2051980f97 Revise the interface between vm_object_madvise() and vm_page_dontneed() so
that pointless calls to pmap_is_modified() can be easily avoided when
performing madvise(..., MADV_FREE).

Sponsored by:	EMC / Isilon Storage Division
2013-06-10 01:48:21 +00:00
Gleb Smirnoff
995d706909 Make sys_mlock() function just a wrapper around vm_mlock() function
that does all the job.

Reviewed by:	kib, jilles
Sponsored by:	Nginx, Inc.
2013-06-08 13:13:40 +00:00
Attilio Rao
002f377ab2 Complete r251452:
Avoid to busy/unbusy a page in cases where there is no need to drop the
vm_obj lock, more nominally when the page is full valid after
vm_page_grab().

Sponsored by:	EMC / Isilon storage division
Reviewed by:	alc
2013-06-06 18:19:26 +00:00
Attilio Rao
dfd55c0c7b In vm_object_split(), busy and consequently unbusy the pages only when
swap_pager_copy() is invoked, otherwise there is no reason to do so.
This will eliminate the necessity to busy pages most of the times.

Sponsored by:	EMC / Isilon storage division
Reviewed by:	alc
2013-06-04 22:47:01 +00:00
Alan Cox
da38420832 Update a comment. 2013-06-04 05:44:52 +00:00
Alan Cox
e23b0a193e Relax the object locking in vm_pageout_map_deactivate_pages() and
vm_pageout_object_deactivate_pages().  A read lock suffices.

Sponsored by:	EMC / Isilon Storage Division
2013-06-04 02:28:47 +00:00
Konstantin Belousov
be6ec55376 Remove irrelevant comments.
Discussed with:	alc
MFC after:	3 days
2013-06-03 17:30:40 +00:00
Alan Cox
b417181250 Require that the page lock is held, instead of the object lock, when
clearing the page's PGA_REFERENCED flag.  Since we are typically
manipulating the page's act_count field when we are clearing its
PGA_REFERENCED flag, the page lock is already held everywhere that we clear
the PGA_REFERENCED flag.  So, in fact, this revision only changes some
comments and an assertion.  Nonetheless, it will enable later changes to
object locking in the pageout code.

Introduce vm_page_assert_locked(), which completely hides the implementation
details of the page lock from the caller, and use it in
vm_page_aflag_clear().  (The existing vm_page_lock_assert() could not be
used in vm_page_aflag_clear().)  Over the coming weeks, I expect that we'll
either eliminate or replace the various uses of vm_page_lock_assert() with
vm_page_assert_locked().

Reviewed by:	attilio
Sponsored by:	EMC / Isilon Storage Division
2013-06-03 01:22:54 +00:00
Alan Cox
b4e498071d Now that access to the page's "act_count" field is synchronized by the page
lock instead of the object lock, there is no reason for vm_page_activate()
to assert that the object is locked for either read or write access.
(The "VPO_UNMANAGED" flag never changes after page allocation.)

Sponsored by:	EMC / Isilon Storage Division
2013-06-01 20:32:34 +00:00
Alan Cox
ef5ba5a31d Simplify the definition of vm_page_lock_assert(). There is no compelling
reason to inline the implementation of vm_page_lock_assert() in the
!KLD_MODULES case.  Use the same implementation for both KLD_MODULES and
!KLD_MODULES.

Reviewed by:	kib
2013-05-31 16:00:42 +00:00
Konstantin Belousov
7560005c41 After the object lock was dropped, the object' reference count could
change.  Retest the ref_count and return from the function to not
execute the further code which assumes that ref_count == 1 if it is
not.  Also, do not leak vnode lock if other thread cleared OBJ_TMPFS
flag meantime.

Reported by:	bdrewery
Tested by:	bdrewery, pho
Sponsored by:	The FreeBSD Foundation
2013-05-30 20:00:19 +00:00
Konstantin Belousov
782d4a636b Remove the capitalization in the assertion message. Print the address
of the object to get useful information from optimizated kernels dump.
2013-05-30 19:53:31 +00:00
Attilio Rao
c25673ffd6 o Change the locking scheme for swp_bcount.
It can now be accessed with a write lock on the object containing it OR
  with a read lock on the object containing it along with the swhash_mtx.
o Remove some duplicate assertions for swap_pager_freespace() and
  swap_pager_unswapped() but keep the object locking references for
  documentation.

Sponsored by:	EMC / Isilon storage division
Reviewed by:	alc
2013-05-28 22:07:23 +00:00
Attilio Rao
83b375ea16 Acquire read lock on the src object for vm_fault_copy_entry().
Sponsored by:	EMC / Isilon storage division
Reviewed by:	alc
2013-05-22 15:11:00 +00:00
Attilio Rao
9af6d512f5 o Relax locking assertions for vm_page_find_least()
o Relax locking assertions for pmap_enter_object() and add them also
  to architectures that currently don't have any
o Introduce VM_OBJECT_LOCK_DOWNGRADE() which is basically a downgrade
  operation on the per-object rwlock
o Use all the mechanisms above to make vm_map_pmap_enter() to work
  mostl of the times only with readlocks.

Sponsored by:	EMC / Isilon storage division
Reviewed by:	alc
2013-05-21 20:38:19 +00:00
Konstantin Belousov
4fab678be2 Add ddb command 'show pginfo' which provides useful information about
a vm page, denoted either by an address of the struct vm_page, or, if
the '/p' modifier is specified, by a physical address of the
corresponding frame.

Reviewed by:	jhb
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
2013-05-21 11:04:00 +00:00
Alan Cox
c141ae7f49 Relax the object locking in vm_fault_prefault(). A read lock suffices.
Reviewed by:	attilio
Sponsored by:	EMC / Isilon Storage Division
2013-05-17 19:02:36 +00:00
Alan Cox
767a6420bc Relax the object locking assertion in vm_page_lookup(). Now that a radix
tree is used to maintain the object's collection of resident pages,
vm_page_lookup() no longer needs an exclusive lock.

Reviewed by:    attilio
Sponsored by:   EMC / Isilon Storage Division
2013-05-17 18:49:43 +00:00
Attilio Rao
7e226537c7 o Add accessor functions to add and remove pages from a specific
freelist.
o Split the pool of free pages queues really by domain and not rely on
  definition of VM_RAW_NFREELIST.
o For MAXMEMDOM > 1, wrap the RR allocation logic into a specific
  function that is called when calculating the allocation domain.
  The RR counter is kept, currently, per-thread.
  In the future it is expected that such function evolves in a real
  policy decision referee, based on specific informations retrieved by
  per-thread and per-vm_object attributes.
o Add the concept of "probed domains" under the form of vm_ndomains.
  It is responsibility for every architecture willing to support multiple
  memory domains to correctly probe vm_ndomains along with mem_affinity
  segments attributes.  Those two values are supposed to remain always
  consistent.
  Please also note that vm_ndomains and td_dom_rr_idx are both int
  because segments already store domains as int.  Ideally u_int would
  have much more sense. Probabilly this should be cleaned up in the
  future.
o Apply RR domain selection also to vm_phys_zero_pages_idle().

Sponsored by:	EMC / Isilon storage division
Partly obtained from:	jeff
Reviewed by:	alc
Tested by:	jeff
2013-05-13 15:40:51 +00:00
Peter Wemm
df839389c5 Bandaid for compiling with gcc, which happens to be the default compiler
for a number of platforms still.
2013-05-13 07:09:31 +00:00
Alan Cox
404eb1b3fd Refactor vm_page_alloc()'s interactions with vm_reserv_alloc_page() and
vm_page_insert() so that (1) vm_radix_lookup_le() is never called while the
free page queues lock is held and (2) vm_radix_lookup_le() is called at most
once.  This change reduces the average time that the free page queues lock
is held by vm_page_alloc() as well as vm_page_alloc()'s average overall
running time.

Sponsored by:	EMC / Isilon Storage Division
2013-05-12 16:50:18 +00:00
Alan Cox
9f2e600890 To reduce the amount of arithmetic performed in the various radix tree
functions, reverse the numbering scheme for the levels.  The highest
numbered level in the tree now appears near the root instead of the leaves.

Sponsored by:	EMC / Isilon Storage Division
2013-05-11 18:01:41 +00:00
Attilio Rao
d0b5855eb2 Fix-up r250338 by completing the removal of VM_NDOMAIN in favor of
MAXMEMDOM.
This unbreak builds.

Sponsored by:	EMC / Isilon storage division
Reported by:	adrian, jeli
2013-05-08 10:55:39 +00:00
Attilio Rao
941646f5ec Rename VM_NDOMAIN into MAXMEMDOM and move it into machine/param.h in
order to match the MAXCPU concept.  The change should also be useful
for consolidation and consistency.

Sponsored by:	EMC / Isilon storage division
Obtained from:	jeff
Reviewed by:	alc
2013-05-07 22:46:24 +00:00
Alan Cox
bb0e1de4ab Remove a redundant call to panic() from vm_radix_keydiff(). The assertion
before the loop accomplishes the same thing.

Sponsored by:	EMC / Isilon Storage Division
2013-05-07 18:45:34 +00:00
Alan Cox
2d4b9a6438 Optimize vm_radix_lookup_ge() and vm_radix_lookup_le(). Specifically,
change the way that these functions ascend the tree when the search for a
matching leaf fails at an interior node.  Rather than returning to the root
of the tree and repeating the lookup with an updated key, maintain a stack
of interior nodes that were visited during the descent and use that stack
to resume the lookup at the closest ancestor that might have a matching
descendant.

Sponsored by:	EMC / Isilon Storage Division
Reviewed by:	attilio
Tested by:	pho
2013-05-04 22:50:15 +00:00
John Baldwin
f5c4b077be Fix two bugs in the current NUMA-aware allocation code:
- vm_phys_alloc_freelist_pages() can be called by vm_page_alloc_freelist()
  to allocate a page from a specific freelist.  In the NUMA case it did not
  properly map the public VM_FREELIST_* constants to the correct backing
  freelists, nor did it try all NUMA domains for allocations from
  VM_FREELIST_DEFAULT.
- vm_phys_alloc_pages() did not pin the thread and each call to
  vm_phys_alloc_freelist_pages() fetched the current domain to choose
  which freelist to use.  If a thread migrated domains during the loop
  in vm_phys_alloc_pages() it could skip one of the freelists.  If the
  other freelists were out of memory then it is possible that
  vm_phys_alloc_pages() would fail to allocate a page even though pages
  were available resulting in a panic in vm_page_alloc().

Reviewed by:	alc
MFC after:	1 week
2013-05-03 18:58:37 +00:00
Konstantin Belousov
53f5f8a0e1 Add a hint suggesting why tmpfs does not need a special case there. 2013-05-02 18:35:12 +00:00
Konstantin Belousov
6f2af3fcf3 Rework the handling of the tmpfs node backing swap object and tmpfs
vnode v_object to avoid double-buffering.  Use the same object both as
the backing store for tmpfs node and as the v_object.

Besides reducing memory use up to 2x times for situation of mapping
files from tmpfs, it also makes tmpfs read and write operations copy
twice bytes less.

VM subsystem was already slightly adapted to tolerate OBJT_SWAP object
as v_object. Now the vm_object_deallocate() is modified to not
reinstantiate OBJ_ONEMAPPING flag and help the VFS to correctly handle
VV_TEXT flag on the last dereference of the tmpfs backing object.

Reviewed by:	alc
Tested by:	pho, bf
MFC after:	1 month
2013-04-28 19:38:59 +00:00
Konstantin Belousov
e5f299ff76 Make vm_object_page_clean() and vm_mmap_vnode() tolerate the vnode'
v_object of non OBJT_VNODE type.

For vm_object_page_clean(), simply do not assert that object type must
be OBJT_VNODE, and add a comment explaining how the check for
OBJ_MIGHTBEDIRTY prevents the rest of function from operating on such
objects.

For vm_mmap_vnode(), if the object type is not OBJT_VNODE, require it
to be for swap pager (or default), handle the bypass filesystems, and
correctly acquire the object reference in this case.

Reviewed by:	alc
Tested by:	pho, bf
MFC after:	1 week
2013-04-28 19:25:09 +00:00
Konstantin Belousov
9b8851faae Assert that the object type for the vnode' non-NULL v_object, passed
to vnode_pager_setsize(), is either OBJT_VNODE, or, if vnode was
already reclaimed, OBJT_DEAD.  Note that the later is only possible
due to some filesystems, in particular, nfsiods from nfs clients, call
vnode_pager_setsize() with unlocked vnode.

More, if the object is terminated, do not perform the resizing
operation.

Reviewed by:	alc
Tested by:	pho, bf
MFC after:	1 week
2013-04-28 19:19:26 +00:00
Konstantin Belousov
6ded84276d Convert panic() into KASSERT().
Reviewed by:	alc
MFC after:	1 week
2013-04-28 18:40:55 +00:00
Alan Cox
82af926a57 Eliminate an unneeded call to vm_radix_trimkey() from vm_radix_lookup_le().
This call is clearing bits from the key that will be set again by the next
line.

Sponsored by:	EMC / Isilon Storage Division
2013-04-28 08:29:00 +00:00
Alan Cox
40076ebc5c Avoid some lookup restarts in vm_radix_lookup_{ge,le}().
Sponsored by:	EMC / Isilon Storage Division
2013-04-27 16:44:59 +00:00
Gleb Smirnoff
08a3102c0b Panic if UMA_ZONE_PCPU is created at early stages of boot, when mp_ncpus
isn't yet initialized. Otherwise we will panic at first allocation later.

Sponsored by:	Nginx, Inc.
2013-04-22 09:02:23 +00:00
Alan Cox
384875a3a6 Simplify vm_radix_{add,dec}lev().
Sponsored by:	EMC / Isilon Storage Division
2013-04-22 01:26:13 +00:00
Alan Cox
880659fe81 When calculating the number of reserved nodes, discount the pages that will
be used to store the nodes.

Sponsored by:	EMC / Isilon Storage Division
2013-04-18 05:34:33 +00:00
Alan Cox
a08f2cf69e Although we perform path compression to reduce the height of the trie and
the number of interior nodes, we have previously created a level zero
interior node at the root of every non-empty trie, even when that node is
not strictly necessary, i.e., it has only one child.  This change is the
second (and final) step in eliminating those unnecessary level zero interior
nodes.  Specifically, it updates the deletion and insertion functions so
that they do not require a level zero interior node at the root of the trie.
For a "buildworld" workload, this change results in a 16.8% reduction in the
number of interior nodes allocated and a similar reduction in the average
execution time for lookup functions.  For example, the average execution
time for a call to vm_radix_lookup_ge() is reduced by 22.9%.

Reviewed by:	attilio, jeff (an earlier version)
Sponsored by:	EMC / Isilon Storage Division
2013-04-15 06:12:00 +00:00
Alan Cox
6f9c0b15bb Although we perform path compression to reduce the height of the trie and
the number of interior nodes, we always create a level zero interior node at
the root of every non-empty trie, even when that node is not strictly
necessary, i.e., it has only one child.  This change is the first step in
eliminating those unnecessary level zero interior nodes.  Specifically, it
updates all of the lookup functions so that they do not require a level zero
interior node at the root.

Reviewed by:	attilio, jeff (an earlier version)
Sponsored by:	EMC / Isilon Storage Division
2013-04-12 20:21:28 +00:00
Gleb Smirnoff
85dcf349c1 Convert UMA code to C99 uintXX_t types. 2013-04-09 17:43:48 +00:00
Gleb Smirnoff
04fc5741e0 Swap us_freecount and us_flags, achieving same structure size
as before previous commit.

Submitted by:	alc
2013-04-09 17:25:15 +00:00
Gleb Smirnoff
8cf455b8d9 Since now we support 256 items per slab, we need more bits
for us_freecount.

This grows uma_slab_head on 32-bit arches, but growth isn't
significant. Taking kmem zones as example, only the 32 byte
zone is affected, ipers is reduced from 113 to 112.

In collaboration with:	kib
2013-04-09 15:15:52 +00:00
Gleb Smirnoff
025071f2af Fix KASSERTs: maximum number of items per slab is 256. 2013-04-09 12:20:44 +00:00
Konstantin Belousov
b9781cf650 Fix the assertions for the state of the object under the map entry
with the MAP_ENTRY_VN_WRITECNT flag:
- Move the assertion that verifies the state of the v_writecount and
  vnp.writecount, under the block where the object is locked.
- Check that the object type is OBJT_VNODE before asserting.

Reported by:	avg
Reviewed by:	alc
MFC after:	1 week
2013-04-09 10:04:10 +00:00
Attilio Rao
a15f7df5de The per-page act_count can be made very-easily protected by the
per-page lock rather than vm_object lock, without any further overhead.
Make the formal switch.

Sponsored by:	EMC / Isilon storage division
Reviewed by:	alc
Tested by:	pho
2013-04-08 20:02:27 +00:00
Gleb Smirnoff
ad97af7ebd Merge from projects/counters: UMA_ZONE_PCPU zones.
These zones have slab size == sizeof(struct pcpu), but request from VM
enough pages to fit (uk_slabsize * mp_ncpus). An item allocated from such
zone would have a separate twin for each CPU in the system, and these twins
are at a distance of sizeof(struct pcpu) from each other. This magic value
of distance would allow us to make some optimizations later.

  To address private item from a CPU simple arithmetics should be used:

  item = (type *)((char *)base + sizeof(struct pcpu) * curcpu)

  These arithmetics are available as zpcpu_get() macro in pcpu.h.

  To introduce non-page size slabs a new field had been added to uma_keg
uk_slabsize. This shifted some frequently used fields of uma_keg to the
fourth cache line on amd64. To mitigate this pessimization, uma_keg fields
were a bit rearranged and least frequently used uk_name and uk_link moved
down to the fourth cache line. All other fields, that are dereferenced
frequently fit into first three cache lines.

Sponsored by:	Nginx, Inc.
2013-04-08 19:10:45 +00:00
Alan Cox
2c899fede2 Micro-optimize the order of struct vm_radix_node's fields. Specifically,
arrange for all of the fields to start at a short offset from the
beginning of the structure.

Eliminate unnecessary masking of VM_RADIX_FLAGS from the root pointer in
vm_radix_getroot().

Sponsored by:	EMC / Isilon Storage Division
2013-04-07 01:30:51 +00:00
Jeff Roberson
26089666b6 Prepare to replace the buf splay with a trie:
- Don't insert BKGRDMARKER bufs into the splay or dirty/clean buf lists.
   No consumers need to find them there and it complicates the tree.
   These flags are all FFS specific and could be moved out of the buf
   cache.
 - Use pbgetvp() and pbrelvp() to associate the background and journal
   bufs with the vp.  Not only is this much cheaper it makes more sense
   for these transient bufs.
 - Fix the assertions in pbget* and pbrel*.  It's not safe to check list
   pointers which were never initialized.  Use the BX flags instead.  We
   also check B_PAGING in reassignbuf() so this should cover all cases.

Discussed with:	kib, mckusick, attilio
Sponsored by:	EMC / Isilon Storage Division
2013-04-06 22:21:23 +00:00
Alan Cox
c1c82b36ad Simplify vm_radix_keybarr().
Sponsored by:	EMC / Isilon Storage Division
2013-04-06 18:04:35 +00:00
Alan Cox
72abda6466 Simplify vm_radix_insert().
Reviewed by:	attilio
Tested by:	pho
Sponsored by:	EMC / Isilon Storage Division
2013-04-06 06:02:55 +00:00
Alan Cox
96f1a84272 Replace the remaining uses of vm_radix_node_page() by vm_radix_isleaf() and
vm_radix_topage().  This transformation eliminates some unnecessary
conditional branches from the inner loops of vm_radix_insert(),
vm_radix_lookup{,_ge,_le}(), and vm_radix_remove().

Simplify the control flow of vm_radix_lookup_{ge,le}().

Reviewed by:	attilio (an earlier version)
Tested by:	pho
Sponsored by:	EMC / Isilon Storage Division
2013-04-03 06:37:25 +00:00
Konstantin Belousov
bafa6cfc93 Release the v_writecount reference on the vnode in case of error,
before the vnode is vput() in vm_mmap_vnode().  Error return means
that there is no use reference on the vnode from the vm object
reference, and failing to restore v_writecount breaks the invariant
that v_writecount is less or equal to the usecount.

The situation observed when nfs client returns ESTALE for
VOP_GETATTR() after the open.

In collaboration with:	pho
MFC after:	1 week
2013-03-28 06:39:27 +00:00
Alan Cox
3fc10b7363 Introduce vm_radix_isleaf() and use it in a couple places. As compared to
using vm_radix_node_page() == NULL, the compiler is able to generate one
less conditional branch when vm_radix_isleaf() is used.  More use cases
involving the inner loops of vm_radix_insert(), vm_radix_lookup{,_ge,_le}(),
and vm_radix_remove() will follow.

Reviewed by:	attilio
Sponsored by:	EMC / Isilon Storage Division
2013-03-26 17:30:40 +00:00
Alan Cox
652615dcb7 Micro-optimize the control flow in a few places. Eliminate a panic call
that could never be reached in vm_radix_insert().  (If the pointer being
checked by the panic call were ever NULL, the immmediately preceding loop
would have already crashed on a NULL pointer dereference.)

Reviewed by:	attilio (an earlier version)
Sponsored by:	EMC / Isilon Storage Division
2013-03-24 16:43:07 +00:00
Konstantin Belousov
7db07e1c85 Only size and create the bio_transient_map when unmapped buffers are
enabled.  Now, disabling the unmapped buffers should result in the
kernel memory map identical to pre-r248550.

Sponsored by:	The FreeBSD Foundation
2013-03-21 07:28:15 +00:00
Konstantin Belousov
6991ee13a6 Fix the logic inversion in the r248512.
Noted by:	mckay
2013-03-20 09:44:23 +00:00
Konstantin Belousov
2cc718a11c Do not map the swap i/o pbufs if the geom provider for the swap
partition accepts unmapped requests.

Sponsored by:	The FreeBSD Foundation
Tested by:	pho
2013-03-19 14:39:27 +00:00
Konstantin Belousov
6ce697dc73 Pass unmapped buffers for page in requests if the filesystem indicated support
for the unmapped i/o.

Sponsored by:	The FreeBSD Foundation
Tested by:	pho
2013-03-19 14:36:28 +00:00
Konstantin Belousov
ee75e7de7b Implement the concept of the unmapped VMIO buffers, i.e. buffers which
do not map the b_pages pages into buffer_map KVA.  The use of the
unmapped buffers eliminate the need to perform TLB shootdown for
mapping on the buffer creation and reuse, greatly reducing the amount
of IPIs for shootdown on big-SMP machines and eliminating up to 25-30%
of the system time on i/o intensive workloads.

The unmapped buffer should be explicitely requested by the GB_UNMAPPED
flag by the consumer.  For unmapped buffer, no KVA reservation is
performed at all. The consumer might request unmapped buffer which
does have a KVA reserve, to manually map it without recursing into
buffer cache and blocking, with the GB_KVAALLOC flag.

When the mapped buffer is requested and unmapped buffer already
exists, the cache performs an upgrade, possibly reusing the KVA
reservation.

Unmapped buffer is translated into unmapped bio in g_vfs_strategy().
Unmapped bio carry a pointer to the vm_page_t array, offset and length
instead of the data pointer.  The provider which processes the bio
should explicitely specify a readiness to accept unmapped bio,
otherwise g_down geom thread performs the transient upgrade of the bio
request by mapping the pages into the new bio_transient_map KVA
submap.

The bio_transient_map submap claims up to 10% of the buffer map, and
the total buffer_map + bio_transient_map KVA usage stays the
same. Still, it could be manually tuned by kern.bio_transient_maxcnt
tunable, in the units of the transient mappings.  Eventually, the
bio_transient_map could be removed after all geom classes and drivers
can accept unmapped i/o requests.

Unmapped support can be turned off by the vfs.unmapped_buf_allowed
tunable, disabling which makes the buffer (or cluster) creation
requests to ignore GB_UNMAPPED and GB_KVAALLOC flags.  Unmapped
buffers are only enabled by default on the architectures where
pmap_copy_page() was implemented and tested.

In the rework, filesystem metadata is not the subject to maxbufspace
limit anymore. Since the metadata buffers are always mapped, the
buffers still have to fit into the buffer map, which provides a
reasonable (but practically unreachable) upper bound on it. The
non-metadata buffer allocations, both mapped and unmapped, is
accounted against maxbufspace, as before. Effectively, this means that
the maxbufspace is forced on mapped and unmapped buffers separately.
The pre-patch bufspace limiting code did not worked, because
buffer_map fragmentation does not allow the limit to be reached.

By Jeff Roberson request, the getnewbuf() function was split into
smaller single-purpose functions.

Sponsored by:	The FreeBSD Foundation
Discussed with:	jeff (previous version)
Tested by:	pho, scottl (previous version), jhb, bf
MFC after:	2 weeks
2013-03-19 14:13:12 +00:00
Attilio Rao
774d251d99 Sync back vmcontention branch into HEAD:
Replace the per-object resident and cached pages splay tree with a
path-compressed multi-digit radix trie.
Along with this, switch also the x86-specific handling of idle page
tables to using the radix trie.

This change is supposed to do the following:
- Allowing the acquisition of read locking for lookup operations of the
  resident/cached pages collections as the per-vm_page_t splay iterators
  are now removed.
- Increase the scalability of the operations on the page collections.

The radix trie does rely on the consumers locking to ensure atomicity of
its operations.  In order to avoid deadlocks the bisection nodes are
pre-allocated in the UMA zone.  This can be done safely because the
algorithm needs at maximum one new node per insert which means the
maximum number of the desired nodes is the number of available physical
frames themselves.  However, not all the times a new bisection node is
really needed.

The radix trie implements path-compression because UFS indirect blocks
can lead to several objects with a very sparse trie, increasing the number
of levels to usually scan.  It also helps in the nodes pre-fetching by
introducing the single node per-insert property.

This code is not generalized (yet) because of the possible loss of
performance by having much of the sizes in play configurable.
However, efforts to make this code more general and then reusable in
further different consumers might be really done.

The only KPI change is the removal of the function vm_page_splay() which
is now reaped.
The only KBI change, instead, is the removal of the left/right iterators
from struct vm_page, which are now reaped.

Further technical notes broken into mealpieces can be retrieved from the
svn branch:
http://svn.freebsd.org/base/user/attilio/vmcontention/

Sponsored by:	EMC / Isilon storage division
In collaboration with:	alc, jeff
Tested by:	flo, pho, jhb, davide
Tested by:	ian (arm)
Tested by:	andreast (powerpc)
2013-03-18 00:25:02 +00:00
Konstantin Belousov
70e198dd07 Some style fixes.
Sponsored by:	The FreeBSD Foundation
2013-03-14 20:31:39 +00:00
Konstantin Belousov
e8a4a618cf Add pmap function pmap_copy_pages(), which copies the content of the
pages around, taking array of vm_page_t both for source and
destination.  Starting offsets and total transfer size are specified.

The function implements optimal algorithm for copying using the
platform-specific optimizations.  For instance, on the architectures
were the direct map is available, no transient mappings are created,
for i386 the per-cpu ephemeral page frame is used.  The code was
typically borrowed from the pmap_copy_page() for the same
architecture.

Only i386/amd64, powerpc aim and arm/arm-v6 implementations were
tested at the time of commit. High-level code, not committed yet to
the tree, ensures that the use of the function is only allowed after
explicit enablement.

For sparc64, the existing code has known issues and a stab is added
instead, to allow the kernel linking.

Sponsored by:	The FreeBSD Foundation
Tested by:	pho (i386, amd64), scottl (amd64), ian (arm and arm-v6)
MFC after:	2 weeks
2013-03-14 20:18:12 +00:00
Konstantin Belousov
e7788a47e3 Remove excessive and inconsistent initializers for the various kernel
maps and submaps.

MFC after:	2 weeks
2013-03-14 19:50:09 +00:00
Attilio Rao
4bc80a3402 Simplify vm_page_is_valid().
Sponsored by:	EMC / Isilon storage division
Reviewed by:	alc
2013-03-12 12:20:49 +00:00
Alan Cox
34496b53ee Update a comment: The object lock is no longer a mutex. 2013-03-09 21:32:24 +00:00
Attilio Rao
89f6b8632c Switch the vm_object mutex to be a rwlock. This will enable in the
future further optimizations where the vm_object lock will be held
in read mode most of the time the page cache resident pool of pages
are accessed for reading purposes.

The change is mostly mechanical but few notes are reported:
* The KPI changes as follow:
  - VM_OBJECT_LOCK() -> VM_OBJECT_WLOCK()
  - VM_OBJECT_TRYLOCK() -> VM_OBJECT_TRYWLOCK()
  - VM_OBJECT_UNLOCK() -> VM_OBJECT_WUNLOCK()
  - VM_OBJECT_LOCK_ASSERT(MA_OWNED) -> VM_OBJECT_ASSERT_WLOCKED()
    (in order to avoid visibility of implementation details)
  - The read-mode operations are added:
    VM_OBJECT_RLOCK(), VM_OBJECT_TRYRLOCK(), VM_OBJECT_RUNLOCK(),
    VM_OBJECT_ASSERT_RLOCKED(), VM_OBJECT_ASSERT_LOCKED()
* The vm/vm_pager.h namespace pollution avoidance (forcing requiring
  sys/mutex.h in consumers directly to cater its inlining functions
  using VM_OBJECT_LOCK()) imposes that all the vm/vm_pager.h
  consumers now must include also sys/rwlock.h.
* zfs requires a quite convoluted fix to include FreeBSD rwlocks into
  the compat layer because the name clash between FreeBSD and solaris
  versions must be avoided.
  At this purpose zfs redefines the vm_object locking functions
  directly, isolating the FreeBSD components in specific compat stubs.

The KPI results heavilly broken by this commit.  Thirdy part ports must
be updated accordingly (I can think off-hand of VirtualBox, for example).

Sponsored by:	EMC / Isilon storage division
Reviewed by:	jeff
Reviewed by:	pjd (ZFS specific review)
Discussed with:	alc
Tested by:	pho
2013-03-09 02:32:23 +00:00
Attilio Rao
c934116100 Merge from vmc-playground:
Introduce a new KPI that verifies if the page cache is empty for a
specified vm_object.  This KPI does not make assumptions about the
locking in order to be used also for building assertions at init and
destroy time.
It is mostly used to hide implementation details of the page cache.

Sponsored by:	EMC / Isilon storage division
Reviewed by:	jeff
Reviewed by:	alc (vm_radix based version)
Tested by:	flo, pho, jhb, davide
2013-03-09 02:05:29 +00:00
Andre Oppermann
15ae0c9af9 Move the callout subsystem initialization to its own SYSINIT()
from being indirectly called via cpu_startup()+vm_ksubmap_init().
The boot order position remains the same at SI_SUB_CPU.

Allocation of the callout array is changed to stardard kernel malloc
from a slightly obscure direct kernel_map allocation.

kern_timeout_callwheel_alloc() is renamed to callout_callwheel_init()
to better describe its purpose.
kern_timeout_callwheel_init() is removed simplifying the per-cpu
initialization.

Reviewed by:	davide
2013-03-08 10:37:17 +00:00
Attilio Rao
198da1b2fa Merge from vmcontention:
As vm objects are type-stable there is no need to initialize the
resident splay tree pointer and the cache splay tree pointer in
_vm_object_allocate() but this could be done in the init UMA zone
handler.

The destructor UMA zone handler, will further check if the condition is
retained at every destruction and catch for bugs.

Sponsored by:	EMC / Isilon storage division
Submitted by:	alc
2013-03-04 13:10:59 +00:00
Alan Cox
55f33f2caf The value held by the vm object's field pg_color is only considered
valid if the flag OBJ_COLORED is set.  Since _vm_object_allocate()
doesn't set this flag, it needn't initialize pg_color.

Sponsored by:	EMC / Isilon Storage Division
2013-03-02 18:07:29 +00:00
Pawel Jakub Dawidek
2609222ab4 Merge Capsicum overhaul:
- Capability is no longer separate descriptor type. Now every descriptor
  has set of its own capability rights.

- The cap_new(2) system call is left, but it is no longer documented and
  should not be used in new code.

- The new syscall cap_rights_limit(2) should be used instead of
  cap_new(2), which limits capability rights of the given descriptor
  without creating a new one.

- The cap_getrights(2) syscall is renamed to cap_rights_get(2).

- If CAP_IOCTL capability right is present we can further reduce allowed
  ioctls list with the new cap_ioctls_limit(2) syscall. List of allowed
  ioctls can be retrived with cap_ioctls_get(2) syscall.

- If CAP_FCNTL capability right is present we can further reduce fcntls
  that can be used with the new cap_fcntls_limit(2) syscall and retrive
  them with cap_fcntls_get(2).

- To support ioctl and fcntl white-listing the filedesc structure was
  heavly modified.

- The audit subsystem, kdump and procstat tools were updated to
  recognize new syscalls.

- Capability rights were revised and eventhough I tried hard to provide
  backward API and ABI compatibility there are some incompatible changes
  that are described in detail below:

	CAP_CREATE old behaviour:
	- Allow for openat(2)+O_CREAT.
	- Allow for linkat(2).
	- Allow for symlinkat(2).
	CAP_CREATE new behaviour:
	- Allow for openat(2)+O_CREAT.

	Added CAP_LINKAT:
	- Allow for linkat(2). ABI: Reuses CAP_RMDIR bit.
	- Allow to be target for renameat(2).

	Added CAP_SYMLINKAT:
	- Allow for symlinkat(2).

	Removed CAP_DELETE. Old behaviour:
	- Allow for unlinkat(2) when removing non-directory object.
	- Allow to be source for renameat(2).

	Removed CAP_RMDIR. Old behaviour:
	- Allow for unlinkat(2) when removing directory.

	Added CAP_RENAMEAT:
	- Required for source directory for the renameat(2) syscall.

	Added CAP_UNLINKAT (effectively it replaces CAP_DELETE and CAP_RMDIR):
	- Allow for unlinkat(2) on any object.
	- Required if target of renameat(2) exists and will be removed by this
	  call.

	Removed CAP_MAPEXEC.

	CAP_MMAP old behaviour:
	- Allow for mmap(2) with any combination of PROT_NONE, PROT_READ and
	  PROT_WRITE.
	CAP_MMAP new behaviour:
	- Allow for mmap(2)+PROT_NONE.

	Added CAP_MMAP_R:
	- Allow for mmap(PROT_READ).
	Added CAP_MMAP_W:
	- Allow for mmap(PROT_WRITE).
	Added CAP_MMAP_X:
	- Allow for mmap(PROT_EXEC).
	Added CAP_MMAP_RW:
	- Allow for mmap(PROT_READ | PROT_WRITE).
	Added CAP_MMAP_RX:
	- Allow for mmap(PROT_READ | PROT_EXEC).
	Added CAP_MMAP_WX:
	- Allow for mmap(PROT_WRITE | PROT_EXEC).
	Added CAP_MMAP_RWX:
	- Allow for mmap(PROT_READ | PROT_WRITE | PROT_EXEC).

	Renamed CAP_MKDIR to CAP_MKDIRAT.
	Renamed CAP_MKFIFO to CAP_MKFIFOAT.
	Renamed CAP_MKNODE to CAP_MKNODEAT.

	CAP_READ old behaviour:
	- Allow pread(2).
	- Disallow read(2), readv(2) (if there is no CAP_SEEK).
	CAP_READ new behaviour:
	- Allow read(2), readv(2).
	- Disallow pread(2) (CAP_SEEK was also required).

	CAP_WRITE old behaviour:
	- Allow pwrite(2).
	- Disallow write(2), writev(2) (if there is no CAP_SEEK).
	CAP_WRITE new behaviour:
	- Allow write(2), writev(2).
	- Disallow pwrite(2) (CAP_SEEK was also required).

	Added convinient defines:

	#define	CAP_PREAD		(CAP_SEEK | CAP_READ)
	#define	CAP_PWRITE		(CAP_SEEK | CAP_WRITE)
	#define	CAP_MMAP_R		(CAP_MMAP | CAP_SEEK | CAP_READ)
	#define	CAP_MMAP_W		(CAP_MMAP | CAP_SEEK | CAP_WRITE)
	#define	CAP_MMAP_X		(CAP_MMAP | CAP_SEEK | 0x0000000000000008ULL)
	#define	CAP_MMAP_RW		(CAP_MMAP_R | CAP_MMAP_W)
	#define	CAP_MMAP_RX		(CAP_MMAP_R | CAP_MMAP_X)
	#define	CAP_MMAP_WX		(CAP_MMAP_W | CAP_MMAP_X)
	#define	CAP_MMAP_RWX		(CAP_MMAP_R | CAP_MMAP_W | CAP_MMAP_X)
	#define	CAP_RECV		CAP_READ
	#define	CAP_SEND		CAP_WRITE

	#define	CAP_SOCK_CLIENT \
		(CAP_CONNECT | CAP_GETPEERNAME | CAP_GETSOCKNAME | CAP_GETSOCKOPT | \
		 CAP_PEELOFF | CAP_RECV | CAP_SEND | CAP_SETSOCKOPT | CAP_SHUTDOWN)
	#define	CAP_SOCK_SERVER \
		(CAP_ACCEPT | CAP_BIND | CAP_GETPEERNAME | CAP_GETSOCKNAME | \
		 CAP_GETSOCKOPT | CAP_LISTEN | CAP_PEELOFF | CAP_RECV | CAP_SEND | \
		 CAP_SETSOCKOPT | CAP_SHUTDOWN)

	Added defines for backward API compatibility:

	#define	CAP_MAPEXEC		CAP_MMAP_X
	#define	CAP_DELETE		CAP_UNLINKAT
	#define	CAP_MKDIR		CAP_MKDIRAT
	#define	CAP_RMDIR		CAP_UNLINKAT
	#define	CAP_MKFIFO		CAP_MKFIFOAT
	#define	CAP_MKNOD		CAP_MKNODAT
	#define	CAP_SOCK_ALL		(CAP_SOCK_CLIENT | CAP_SOCK_SERVER)

Sponsored by:	The FreeBSD Foundation
Reviewed by:	Christoph Mallon <christoph.mallon@gmx.de>
Many aspects discussed with:	rwatson, benl, jonathan
ABI compatibility discussed with:	kib
2013-03-02 00:53:12 +00:00
Attilio Rao
dc1558d1cd Merge from vmobj-rwlock:
VM_OBJECT_LOCKED() macro is only used to implement a custom version
of lock assertions right now (which likely spread out thanks to
copy and paste).
Remove it and implement actual assertions.

Sponsored by:	EMC / Isilon storage division
Reviewed by:	alc
Tested by:	pho
2013-02-27 18:12:13 +00:00
Attilio Rao
a4915c21d9 Merge from vmc-playground branch:
Replace the sub-optimal uma_zone_set_obj() primitive with more modern
uma_zone_reserve_kva().  The new primitive reserves before hand
the necessary KVA space to cater the zone allocations and allocates pages
with ALLOC_NOOBJ.  More specifically:
- uma_zone_reserve_kva() does not need an object to cater the backend
  allocator.
- uma_zone_reserve_kva() can cater M_WAITOK requests, in order to
  serve zones which need to do uma_prealloc() too.
- When possible, uma_zone_reserve_kva() uses directly the direct-mapping
  by uma_small_alloc() rather than relying on the KVA / offset
  combination.

The removal of the object attribute allows 2 further changes:
1) _vm_object_allocate() becomes static within vm_object.c
2) VM_OBJECT_LOCK_INIT() is removed.  This function is replaced by
   direct calls to mtx_init() as there is no need to export it anymore
   and the calls aren't either homogeneous anymore: there are now small
   differences between arguments passed to mtx_init().

Sponsored by:	EMC / Isilon storage division
Reviewed by:	alc (which also offered almost all the comments)
Tested by:	pho, jhb, davide
2013-02-26 23:35:27 +00:00
Attilio Rao
64a3476f0c Remove white spaces.
Sponsored by:	EMC / Isilon storage division
2013-02-26 20:35:40 +00:00
Attilio Rao
0dde287b20 Wrap the sleeps synchronized by the vm_object lock into the specific
macro VM_OBJECT_SLEEP().
This hides some implementation details like the usage of the msleep()
primitive and the necessity to access to the lock address directly.
For this reason VM_OBJECT_MTX() macro is now retired.

Sponsored by:	EMC / Isilon storage division
Reviewed by:	alc
Tested by:	pho
2013-02-26 17:22:08 +00:00
Alan Cox
fc23011bc3 On arm, like sparc64, the end of the kernel map varies from one type of
machine to another.  Therefore, VM_MAX_KERNEL_ADDRESS can't be a constant.
Instead, #define it to be a variable, vm_max_kernel_address, just like we
do on sparc64.

Reviewed by:	kib
Tested by:	ian
2013-02-18 01:02:48 +00:00
John Baldwin
174b5f3850 Make VM_NDOMAIN a kernel option so that it can be enabled from a kernel
config file.

Requested by:	phk (ages ago)
MFC after:	1 month
2013-02-14 19:38:04 +00:00
Marius Strobl
94bfd5b1a0 Try to improve r242655 take III: move these SYSCTLs describing the kernel
map, which is defined and initialized in vm/vm_kern.c, to the latter.

Submitted by:	alc
2013-02-04 09:35:48 +00:00
Gleb Smirnoff
3caae6ca60 Fix typo in debug printf. 2013-01-29 19:06:16 +00:00
Andrey Zonov
b3a01bdf1f - Add system wide page faults requiring I/O counter.
Reviewed by:	alc
MFC after:	2 weeks
2013-01-28 12:54:53 +00:00
Andrey Zonov
536368691a - Add sysctls to show number of stats scans.
MFC after:	2 weeks
2013-01-28 12:20:20 +00:00
Andrey Zonov
4a36532940 - Style.
MFC after:	2 weeks
2013-01-28 12:08:29 +00:00
Andrey Zonov
1cc20081df - Get rid of unused function vmspace_wired_count().
Reviewed by:	alc
Approved by:	kib (mentor)
MFC after:	1 week
2013-01-14 12:12:56 +00:00
Andrey Zonov
cde4a72547 - Improve readability of sys_obreak().
Suggested by:	alc
Reviewed by:	alc
Approved by:	kib (mentor)
MFC after:	1 week
2013-01-11 09:58:35 +00:00
Andrey Zonov
3ac7d29722 - Reduce kernel size by removing unnecessary pointer indirections.
GENERIC kernel size reduced in 16 bytes and RACCT kernel in 336 bytes.

Suggested by:	alc
Reviewed by:	alc
Approved by:	kib (mentor)
MFC after:	1 week
2013-01-10 12:43:58 +00:00
Kenneth D. Merry
43ab9660c5 Fix a bug in the device pager code that can trigger an assertion
in devfs if a particular race condition is hit in the device pager
code.

This was a side effect of change 227530 which changed the device
pager interface to call a new destructor routine for the cdev.
That destructor routine, old_dev_pager_dtor(), takes a VM object
handle.

The object handle is cast to a struct cdev *, and passed into
dev_rel().

That works in most cases, except the case in cdev_pager_allocate()
where there is a race condition between two threads allocating an
object backed by the same device.  The loser of the race
deallocates its object at the end of the function.

The problem is that before inserting the object into the
dev_pager_object_list, the object's handle is changed from the
struct cdev pointer to the object's own address.  This is to avoid
conflicts with the winner of the race, which already inserted an
object in the list with a handle that is a pointer to the same cdev
structure.

The object is then passed to vm_object_deallocate(), and eventually
makes its way down to old_dev_pager_dtor().  That function passes
the handle pointer (which is actually a VM object, not a struct
cdev as usual) into dev_rel().  dev_rel() decrements the reference
count in the assumed struct cdev (which happens to be 0), and
that triggers the assertion in dev_rel() that the reference count
is greater than or equal to 0.

The fix is to add a cdev pointer to the VM object, and use that
pointer when calling the cdev_pg_dtor() routine.

vm_object.h:	Add a struct cdev pointer to the VM object
		structure.

device_pager.c:	In cdev_pager_allocate(), populate the new cdev
		pointer.

		In dev_pager_dealloc(), use the new cdev pointer
		when calling the object's cdev_pg_dtor() routine.

Reviewed by:	kib
Sponsored by:	Spectra Logic Corporation
MFC after:	1 week
2013-01-09 16:48:38 +00:00
Gleb Smirnoff
936c747be0 Comment fix: there is no ub_ptr, instead explain meaning of uz_count
field verbally.
2012-12-21 10:09:45 +00:00
Andrey Zonov
7e19eda4aa - Fix locked memory accounting for maps with MAP_WIREFUTURE flag.
- Add sysctl vm.old_mlock which may turn such accounting off.

Reviewed by:	avg, trasz
Approved by:	kib (mentor)
MFC after:	1 week
2012-12-18 07:35:01 +00:00
Alan Cox
2863482058 In the past four years, we've added two new vm object types. Each time,
similar changes had to be made in various places throughout the machine-
independent virtual memory layer to support the new vm object type.
However, in most of these places, it's actually not the type of the vm
object that matters to us but instead certain attributes of its pages.
For example, OBJT_DEVICE, OBJT_MGTDEVICE, and OBJT_SG objects contain
fictitious pages.  In other words, in most of these places, we were
testing the vm object's type to determine if it contained fictitious (or
unmanaged) pages.

To both simplify the code in these places and make the addition of future
vm object types easier, this change introduces two new vm object flags
that describe attributes of the vm object's pages, specifically, whether
they are fictitious or unmanaged.

Reviewed and tested by:	kib
2012-12-09 00:32:38 +00:00
Pawel Jakub Dawidek
b0ae014466 White-space cleanups. 2012-12-08 09:23:05 +00:00
Pawel Jakub Dawidek
2f891cd504 Implemented uma_zone_set_warning(9) function that sets a warning, which
will be printed once the given zone becomes full and cannot allocate an
item. The warning will not be printed more often than every five minutes.

All UMA warnings can be globally turned off by setting sysctl/tunable
vm.zone_warnings to 0.

Discussed on:	arch
Obtained from:	WHEEL Systems
MFC after:	2 weeks
2012-12-07 22:27:13 +00:00
Alan Cox
96b0b92ac1 Add support for the (relatively) new object type OBJT_MGTDEVICE to
vm_object_set_memattr().  Also, add a "safety belt" so that
vm_object_set_memattr() doesn't silently modify undefined object types.

Reviewed by:	kib
MFC after:	10 days
2012-11-28 18:29:34 +00:00
Alan Cox
a922d312b0 Make a few small changes to vm_map_pmap_enter():
Add detail to the comment describing this function.  In particular,
describe what MAP_PREFAULT_PARTIAL does.

Eliminate the abrupt change in behavior when the specified address range
grows from MAX_INIT_PT pages to MAX_INIT_PT plus one pages.  Instead of
doing nothing, i.e., preloading no mappings whatsoever, map any resident
pages that fall within the start of the specified address range, i.e.,
[addr, addr + ulmin(size, ptoa(MAX_INIT_PT))).

Long ago, the vm object's list of resident pages was not ordered, so
this function had to choose between probing the global hash table of
all resident pages and iterating over the vm object's unordered list of
resident pages.  Now, the list is ordered, so there is no reason for
MAP_PREFAULT_PARTIAL to be concerned with the vm object's count of
resident changes.

MFC after:	14 days
2012-11-25 19:42:36 +00:00
Alan Cox
0d69690e8f Correct an error in r230623. When both VM_ALLOC_NODUMP and VM_ALLOC_ZERO
were specified to vm_page_alloc(), PG_NODUMP wasn't being set on the
allocated page when it happened to be pre-zeroed.

MFC after:	5 days
2012-11-21 06:26:18 +00:00
Jaakko Heinonen
02c62349c9 - Don't pass geom and provider names as format strings.
- Add __printflike() attributes.
- Remove an extra argument for the g_new_geomf() call in swapongeom_ev().

Reviewed by:	pjd
2012-11-20 12:32:18 +00:00
Alan Cox
969a0af09d Update a comment to reflect the elimination of the hold queue in r242300. 2012-11-17 04:00:19 +00:00
Konstantin Belousov
43f48b65c0 Move the declaration of vm_phys_paddr_to_vm_page() from vm/vm_page.h
to vm/vm_phys.h, where it belongs.

Requested and reviewed by:	alc
MFC after:	2 weeks
2012-11-16 05:55:56 +00:00
Konstantin Belousov
962b064afe Explicitely state that M_USE_RESERVE requires M_NOWAIT, using assertion.
Reviewed by:	alc
MFC after:	2 weeks
2012-11-16 05:49:56 +00:00
Konstantin Belousov
b32ecf44bc Flip the semantic of M_NOWAIT to only require the allocation to not
sleep, and perform the page allocations with VM_ALLOC_SYSTEM
class. Previously, the allocation was also allowed to completely drain
the reserve of the free pages, being translated to VM_ALLOC_INTERRUPT
request class for vm_page_alloc() and similar functions.

Allow the caller of malloc* to request the 'deep drain' semantic by
providing M_USE_RESERVE flag, now translated to VM_ALLOC_INTERRUPT
class. Previously, it resulted in less aggressive VM_ALLOC_SYSTEM
allocation class.

Centralize the translation of the M_* malloc(9) flags in the single
inline function malloc2vm_flags().

Discussion started by:	"Sears, Steven" <Steven.Sears@netapp.com>
Reviewed by:	alc, mdf (previous version)
Tested by:	pho (previous version)
MFC after:	2 weeks
2012-11-14 20:01:40 +00:00
Alan Cox
8d22020384 Replace the single, global page queues lock with per-queue locks on the
active and inactive paging queues.

Reviewed by:	kib
2012-11-13 02:50:39 +00:00
Attilio Rao
2ebcd458e3 Fix DDB command "show map XXX":
- Check that an argument is always available, otherwise current map
  printing before to recurse is garbage.
- Spit out a message if an argument is not provided.
- Remove unread nlines variable.
- Use an explicit recursive function, disassociated from the
  DB_SHOW_COMMAND() body, in order to make clear prototype and recursion
  of the above mentioned function.  The code results now much less
  obscure.

Submitted by:	gianni
2012-11-12 00:30:40 +00:00
Konstantin Belousov
140dedb81c The r241025 fixed the case when a binary, executed from nullfs mount,
was still possible to open for write from the lower filesystem.  There
is a symmetric situation where the binary could already has file
descriptors opened for write, but it can be executed from the nullfs
overlay.

Handle the issue by passing one v_writecount reference to the lower
vnode if nullfs vnode has non-zero v_writecount.  Note that only one
write reference can be donated, since nullfs only keeps one use
reference on the lower vnode.  Always use the lower vnode v_writecount
for the checks.

Introduce the VOP_GET_WRITECOUNT to read v_writecount, which is
currently always bypassed to the lower vnode, and VOP_ADD_WRITECOUNT
to manipulate the v_writecount value, which manages a single bypass
reference to the lower vnode.  Caling the VOPs instead of directly
accessing v_writecount provide the fix described in the previous
paragraph.

Tested by:	pho
MFC after:	3 weeks
2012-11-02 13:56:36 +00:00
Alan Cox
9fc4739d2a In general, we call pmap_remove_all() before calling vm_page_cache(). So,
the call to pmap_remove_all() within vm_page_cache() is usually redundant.
This change eliminates that call to pmap_remove_all() and introduces a
call to pmap_remove_all() before vm_page_cache() in the one place where
it didn't already exist.

When iterating over a paging queue, if the object containing the current
page has a zero reference count, then the page can't have any managed
mappings.  So, a call to pmap_remove_all() is pointless.

Change a panic() call in vm_page_cache() to a KASSERT().

MFC after:	6 weeks
2012-11-01 16:20:02 +00:00
Attilio Rao
4ceaf45de5 Rework the known mutexes to benefit about staying on their own
cache line in order to avoid manual frobbing but using
struct mtx_padalign.

The sole exception being nvme and sxfge drivers, where the author
redefined CACHE_LINE_SIZE manually, so they need to be analyzed and
dealt with separately.

Reviwed by:	jimharris, alc
2012-10-31 18:07:18 +00:00
Alan Cox
081a488159 Replace the page hold queue, PQ_HOLD, by a new page flag, PG_UNHOLDFREE,
because the queue itself serves no purpose.  When a held page is freed,
inserting the page into the hold queue has the side effect of setting the
page's "queue" field to PQ_HOLD.  Later, when the page is unheld, it will
be freed because the "queue" field is PQ_HOLD.  In other words, PQ_HOLD is
used as a flag, not a queue.  So, this change replaces it with a flag.

To accomodate the new page flag, make the page's "flags" field wider and
"oflags" field narrower.

Reviewed by:	kib
2012-10-29 06:15:04 +00:00
Edward Tomasz Napierala
a406d8c319 Remove useless check; vm_pindex_t is unsigned on all architectures.
CID:		3701
Found with:	Coverity Prevent
2012-10-28 20:03:57 +00:00
Matthew D Fleming
bb196eb480 Const-ify the zone name argument to uma_zcreate(9).
MFC after:	3 days
2012-10-26 17:51:05 +00:00
Andre Oppermann
25c1e16409 Move the corresponding MTX_SYSINIT() next to their struct mtx declaration
to make their relationship more obvious as done with the other such mutexs.
2012-10-26 17:31:35 +00:00
Konstantin Belousov
ef45823eba Commit the actual text provided by Alan, instead of the wrong update
in r242011.

MFC after:	1 week
2012-10-24 18:32:37 +00:00
Konstantin Belousov
bc79b37f2c Dirty the newly copied anonymous pages after the wired region is
forked. Otherwise, pagedaemon might reclaim the page without saving
its content into the swap file, resulting in the valid content
replaced by zeroes.

Reported and tested by:	pho
Reviewed and comment update by:	alc
MFC after:	1 week
2012-10-24 18:21:59 +00:00
Konstantin Belousov
5050aa86cf Remove the support for using non-mpsafe filesystem modules.
In particular, do not lock Giant conditionally when calling into the
filesystem module, remove the VFS_LOCK_GIANT() and related
macros. Stop handling buffers belonging to non-mpsafe filesystems.

The VFS_VERSION is bumped to indicate the interface change which does
not result in the interface signatures changes.

Conducted and reviewed by:	attilio
Tested by:	pho
2012-10-22 17:50:54 +00:00
Eitan Adler
0b80c1e400 Print flags as hex instead of an integer.
PR:		kern/168210
Submitted by:	linimon
Reviewed by:	alc
Approved by:	cperciva
MFC after:	3 days
2012-10-22 02:11:57 +00:00
Alan Cox
7ecfabc7bb Move vm_page_requeue() to the only file that uses it.
MFC after:	3 weeks
2012-10-13 20:19:43 +00:00
Alan Cox
9af47af64a Eliminate the conditional for releasing the page queues lock in
vm_page_sleep().  vm_page_sleep() is no longer called with this lock
held.

Eliminate assertions that the page queues lock is NOT held.  These
assertions won't translate well to having distinct locks on the active
and inactive page queues, and they really aren't that useful.

MFC after:	3 weeks
2012-10-13 18:46:46 +00:00
Alan Cox
4db2c4b8c7 Tidy up a bit:
Update some of the comments.  In particular, use "sleep" in preference to
"block" where appropriate.

Eliminate some unnecessary casts.

Make a few whitespace changes for consistency.

Reviewed by:	kib
MFC after:	3 days
2012-10-03 05:06:45 +00:00
Konstantin Belousov
877d24ac8a Fix the mis-handling of the VV_TEXT on the nullfs vnodes.
If you have a binary on a filesystem which is also mounted over by
nullfs, you could execute the binary from the lower filesystem, or
from the nullfs mount. When executed from lower filesystem, the lower
vnode gets VV_TEXT flag set, and the file cannot be modified while the
binary is active. But, if executed as the nullfs alias, only the
nullfs vnode gets VV_TEXT set, and you still can open the lower vnode
for write.

Add a set of VOPs for the VV_TEXT query, set and clear operations,
which are correctly bypassed to lower vnode.

Tested by:	pho (previous version)
MFC after:	2 weeks
2012-09-28 11:25:02 +00:00
Alan Cox
1f11f2bff4 Address a race condition that was introduced in r238212. Unless the page
queues lock is acquired before the page lock is released, there is no
guarantee that the page will still be in that same page queue when
vm_page_requeue() is called.

Reported by:		pho
In collaboration with:	kib
MFC after:	3 days
2012-09-23 17:42:39 +00:00
Konstantin Belousov
5f9c767b19 Plug the accounting leak for the wired pages when msync(MS_INVALIDATE)
is performed on the vnode mapping which is wired in other address space.

While there, explicitely assert that the page is unwired and zero the
wire_count instead of substract. The condition is rechecked later in
vm_page_free(_toq) already.

Reported and tested by:	zont
Reviewed by:	alc (previous version)
MFC after:	1 week
2012-09-20 09:52:57 +00:00
Gleb Smirnoff
2864dbbfc1 If caller specifies UMA_ZONE_OFFPAGE explicitly, then do not waste memory
in an allocation for a slab.

Reviewed by:	jeff
2012-09-18 20:28:55 +00:00
Eitan Adler
96240c89f0 Correct double "the the"
Approved by:	cperciva
MFC after:	3 days
2012-09-14 21:28:56 +00:00
Andrey Zonov
c4e357e8d3 - Simplify VM code by using vmspace_wired_count() for counting wired
memory of a process.

Reviewed by:	avg
Approved by:	kib (mentor)
MFC after:	2 weeks
2012-09-05 18:19:54 +00:00
Dag-Erling Smørgrav
f379b823bc Whitespace cleanup. 2012-09-05 12:24:50 +00:00
Dag-Erling Smørgrav
dc1b35b525 No memory barrier is required. This was pointed out by kib@ a while ago,
but I got distracted by other matters.

(for real this time)
2012-09-04 22:19:33 +00:00
Dag-Erling Smørgrav
22a5e6b972 Revert previous commit, which was performed in the wrong tree. 2012-09-04 21:06:53 +00:00
Dag-Erling Smørgrav
db0390e833 No memory barrier is required. This was pointed out by kib@ a while ago,
but I got distracted by other matters.
2012-09-04 19:04:02 +00:00
Andrey Zonov
cfe52ecf0e - After r240026 sgrowsiz should be used in a safer maner.
Approved by:	kib (mentor)
MCF after:	1 week
2012-09-03 09:34:46 +00:00
Andrey Zonov
e145130e71 - Remove accounting of locked memory from vsunlock(9) that I missed in r239818.
Approved by:	kib (mentor)
2012-08-30 08:03:33 +00:00
Andrey Zonov
126a63ce6c - Don't take an account of locked memory for current process in vslock(9).
There are two consumers of vslock(9): sysctl code and drm driver.  These
consumers are using locked memory as transient memory, it doesn't belong
to a process's memory.

Suggested by:	avg
Reviewed by:	alc
Approved by:	kib (mentor)
MFC after:	2 weeks
2012-08-29 11:23:59 +00:00
Sergey Kandaurov
9462305cbe Typo in previous change: print half the theoretical maximum as maximum
recommended amount.

Reported by:	<site freebsd at orientalsensation com>
Reviewed by:	des
2012-08-27 10:59:49 +00:00
Gleb Smirnoff
42321809c4 Fix function name in keg_cachespread_init() assert. 2012-08-26 09:54:11 +00:00
Dag-Erling Smørgrav
3ff863f1aa - When running out of swzone, instead of spewing an error message every
tick until the situation is resolved (if ever), just print a single
  message when running out and another when space becomes available.

- When adding more swap, warn if the total amount exceeds half the
  theoretical maximum we can handle.
2012-08-16 08:29:49 +00:00
Konstantin Belousov
ee4116b8f7 For old mmap syscall, when executing on amd64 or ia64, enforce the
PROT_EXEC if prot is non-zero, process is 32bit and
kern.elf32.i386_read_exec syscal is enabled. This workaround is needed
for old i386 a.out binaries, where dynamic linker did not specified
PROT_EXEC for mapping of the text.

The kern.elf32.i386_read_exec MIB name looks weird for a.out binaries,
but I reused the existing knob which already has the needed semantic.

MFC after:	1 week
2012-08-14 12:11:48 +00:00
Konstantin Belousov
7707ccabfb Adjust the r205536, by allowing a non-zero offset for anonymous
mappings for a.out binaries. Apparently, a.out ld.so from FreeBSD
1.1.5.1 can issue such requests.

Reported and tested by:	Dan Plassche <dplassche@gmail.com>
MFC after:	1 week
2012-08-14 11:47:07 +00:00
Konstantin Belousov
b6c00483e9 Do not leave invalid pages in the object after the short read for a
network file systems (not only NFS proper). Short reads cause pages
other then the requested one, which were not filled by read response,
to stay invalid.

Change the vm_page_readahead_finish() interface to not take the error
code, but instead to make a decision to free or to (de)activate the
page only by its validity. As result, not requested invalid pages are
freed even if the read RPC indicated success.

Noted and reviewed by:	alc
MFC after:	1 week
2012-08-14 11:45:47 +00:00
Alan Cox
18f55b0171 Never sleep on busy pages in vm_pageout_launder(), always skip them. Long
ago, sleeping on busy pages in vm_pageout_launder() made sense.  The call
to vm_pageout_flush() specified asynchronous I/O and sleeping on busy pages
blocked vm_pageout_launder() until the flush had completed.  However, in
CVS revision 1.35 of vm/vm_contig.c, the call to vm_pageout_flush() was
changed to request synchronous I/O, but the sleep on busy pages was not
removed.
2012-08-07 04:48:14 +00:00
Konstantin Belousov
1c771f9222 After the PHYS_TO_VM_PAGE() function was de-inlined, the main reason
to pull vm_param.h was removed.  Other big dependency of vm_page.h on
vm_param.h are PA_LOCK* definitions, which are only needed for
in-kernel code, because modules use KBI-safe functions to lock the
pages.

Stop including vm_param.h into vm_page.h. Include vm_param.h
explicitely for the kernel code which needs it.

Suggested and reviewed by:	alc
MFC after:    2 weeks
2012-08-05 14:11:42 +00:00
Konstantin Belousov
0055cbd3c5 Reduce code duplication and exposure of direct access to struct
vm_page oflags by providing helper function
vm_page_readahead_finish(), which handles completed reads for pages
with indexes other then the requested one, for VOP_GETPAGES().

Reviewed by:	alc
MFC after:	1 week
2012-08-04 18:16:43 +00:00
Alan Cox
369763e31a Inline vm_page_aflags_clear() and vm_page_aflags_set().
Add comments stating that neither these functions nor the flags that they
are used to manipulate are part of the KBI.
2012-08-03 01:48:15 +00:00
Alan Cox
d26a90a8b2 Eliminate an unneeded declaration. (I should have removed this as part
of r227568.)
2012-07-30 20:38:37 +00:00
Konstantin Belousov
311e34e260 Do not requeue held page or page for which locking failed, just leave
them alone.

Process the act_count updates for the held pages in the vm_pageout
loop over the inactive queue, instead of refusing to do anything with
such page.

Clarify the intent of the addl_page_shortage counter and change its
use for pages which are not processed in the loop according to the
description.

Reviewed by:	alc
MFC after:	2 weeks
2012-07-26 09:06:48 +00:00
Alan Cox
2cba1ccd0e Addendum to r238604. If the inactive queue scan isn't restarted, then
the variable "addl_page_shortage_init" isn't needed.

X-MFC after:	r238604
2012-07-24 02:35:30 +00:00
Konstantin Belousov
d4961bcb3a Do not restart scan of the inactive queue when non-inactive page is
found. Rather, we shall not find such pages on inactive queue at all.

Requested and reviewed by:	alc
MFC after:    2 weeks
2012-07-18 21:47:50 +00:00
Alan Cox
85eeca35b9 Move what remains of vm/vm_contig.c into vm/vm_pageout.c, where similar
code resides.  Rename vm_contig_grow_cache() to vm_pageout_grow_cache().

Reviewed by:	kib
2012-07-18 05:21:34 +00:00
Alan Cox
da1ab8a4a0 Correct vm_page_alloc_contig()'s implementation of VM_ALLOC_NODUMP. 2012-07-17 02:36:59 +00:00
Alan Cox
907e4524dc Various improvements to vm_contig_grow_cache(). Most notably, even when
it can't sleep, it can still move clean pages from the inactive queue to
the cache.  Also, when a page is cached, there is no need to restart the
scan.  The "next" page pointer held by vm_contig_launder() is still
valid.  Finally, add a comment summarizing what vm_contig_grow_cache()
does based upon the value of "tries".

MFC after:	3 weeks
2012-07-16 18:13:43 +00:00
Alan Cox
476f7f2423 Correct an off-by-one error in vm_reserv_alloc_contig() that resulted in
the last reservation of a multi-reservation allocation not being
initialized.
2012-07-15 21:46:19 +00:00
Matthew D Fleming
f806cdcf99 Fix a bug with memguard(9) on 32-bit architectures without a
VM_KMEM_MAX_SIZE.

The code was not taking into account the size of the kernel_map, which
the kmem_map is allocated from, so it could produce a sub-map size too
large to fit.  The simplest solution is to ignore VM_KMEM_MAX entirely
and base the memguard map's size off the kernel_map's size, since this
is always relevant and always smaller.

Found by:	Justin Hibbits
2012-07-15 20:29:48 +00:00
Alan Cox
9757857c4f If vm_contig_grow_cache() is allowed to sleep, then invoke the vm_lowmem
handlers.
2012-07-14 20:14:03 +00:00
Alan Cox
0ff0fc84c2 Move kmem_alloc_{attr,contig}() to vm/vm_kern.c, where similarly named
functions reside.  Correct the comment describing kmem_alloc_contig().
2012-07-14 18:10:44 +00:00
Attilio Rao
571a1e92aa Document the object type movements, related to swp_pager_copy(),
in vm_object_collapse() and vm_object_split().

In collabouration with:	alc
MFC after:	3 days
2012-07-11 01:04:59 +00:00
Konstantin Belousov
a4156419ca Avoid vm page queues lock leak after r238212.
Reported and tested by:	Michael Butler <imb protected-networks net>
Reviewed by:	alc
Pointy hat to:	kib
MFC after:	20 days
2012-07-08 18:04:26 +00:00
Konstantin Belousov
48cc2fc774 Drop page queues mutex on each iteration of vm_pageout_scan over the
inactive queue, unless busy page is found.

Dropping the mutex often should allow the other lock acquires to
proceed without waiting for whole inactive scan to finish. On machines
with lot of physical memory scan often need to iterate a lot before it
finishes or finds a page which requires laundring, causing high
latency for other lock waiters.

Suggested and reviewed by:	alc
MFC after:	3 weeks
2012-07-07 19:39:08 +00:00
Eitan Adler
c288b54837 Add missing sleep stat increase
PR:		kern/168211
Submitted by:	linimon
Reviewed by:	alc
Approved by:	cperciva
MFC after:	3 days
2012-07-07 17:46:11 +00:00
Konstantin Belousov
5d10ef2096 Style.
Reviewed by:	alc (previous version)
MFC after:	1 week
2012-07-06 20:13:16 +00:00
John Baldwin
687c94aac9 Honor db_pager_quit in 'show uma' and 'show malloc'.
MFC after:	1 month
2012-07-02 16:14:52 +00:00
Alan Cox
e30df26e7b Add new pmap layer locks to the predefined lock order. Change the names
of a few existing VM locks to follow a consistent naming scheme.
2012-06-27 03:45:25 +00:00
Attilio Rao
ffdd0c7db3 - Add a comment explaining the locking of the cached pages pool held
by vm_objects.
- Add flags for the per-object lock and free pages queue mutex lock.
  Use the newly added flags to mark the cache root within the vm_object
  structure.

Please note that other vm_object members should be marked with correct
locking but they are left for other commits.

In collabouration with:	alc

MFC after:	3 days3 days3 days
2012-06-22 18:34:11 +00:00
Alan Cox
eddc92918e Selectively inline vm_page_dirty(). 2012-06-20 23:25:47 +00:00
John Baldwin
6fbe60fa8b Move the per-thread deferred user map entries list into a private list
in vm_map_process_deferred() which is then iterated to release map entries.
This avoids having a nested vm map unlock operation called from the loop
body attempt to recuse into vm_map_process_deferred().  This can happen if
the vm_map_remove() triggers the OOM killer.

Reviewed by:	alc, kib
MFC after:	1 week
2012-06-20 18:00:26 +00:00
Attilio Rao
db9ba57895 Do a more targeted check on the page cache and avoid to check the cache
pointer directly in vnode_pager_setsize() by using newly introduced
vm_page_is_cached() function.

Reviewed by:	alc
MFC after:	2 weeks
X-MFC:		r234039,234064
2012-06-16 21:39:00 +00:00
Alan Cox
6031c68de4 The page flag PGA_WRITEABLE is set and cleared exclusively by the pmap
layer, but it is read directly by the MI VM layer.  This change introduces
pmap_page_is_write_mapped() in order to completely encapsulate all direct
access to PGA_WRITEABLE in the pmap layer.

Aesthetics aside, I am making this change because amd64 will likely begin
using an alternative method to track write mappings, and having
pmap_page_is_write_mapped() in place allows me to make such a change
without further modification to the MI VM layer.

As an added bonus, tidy up some nearby comments concerning page flags.

Reviewed by:	kib
MFC after:	6 weeks
2012-06-16 18:56:19 +00:00
Konstantin Belousov
83ce08538a Use the previous stack entry protection and max protection to correctly
propagate the stack execution permissions when stack is grown down.

First, curproc->p_sysent->sv_stackprot specifies maximum allowed stack
protection for current ABI, so the new stack entry was typically marked
executable always. Second, for non-main stack MAP_STACK mapping,
the PROT_ flags should be used which were specified at the mmap(2) call
time, and not sv_stackprot.

MFC after:	1 week
2012-06-10 11:31:50 +00:00
Eitan Adler
0a4a2b8e62 Revert r236380
PR:		kern/166780
Requested by:	many
Approved by:	cperciva (implicit)
2012-06-01 18:58:50 +00:00
Eitan Adler
71ee98c97c Add sysctl to query amount of swap space free
PR:		kern/166780
Submitted by:	Radim Kolar <hsn@sendmail.cz>
Approved by:	cperciva
MFC after:	1 week
2012-06-01 04:42:52 +00:00
Maksim Yevmenkin
251386b4b2 Tweak condition for disabling allocation from per-CPU buckets in
low memory situation. I've observed a situation where per-CPU
allocations were disabled while there were enough free cached pages.
Basically, cnt.v_free_count was sitting stable at a value lower
than cnt.v_free_min and that caused massive performance drop.

Reviewed by:	alc
MFC after:	1 week
2012-05-23 18:56:29 +00:00
Konstantin Belousov
4d34e019c4 Calculate the count of per-process cow faults. Export the count to
userspace using the obscure spare int field in struct kinfo_proc.

Submitted by:	Andrey Zonov <andrey zonov org>
MFC after:	1 week
2012-05-23 18:10:54 +00:00
Andriy Gapon
b6062382be vm_pager_object_lookup: small performance optimization
do not needlessly lock an object if its handle doesn't match

Reviewed by:	kib, alc
MFC after:	1 week
2012-05-23 12:51:49 +00:00
Andrew Turner
c415e17250 Fix booting on ARM.
In PHYS_TO_VM_PAGE() when VM_PHYSSEG_DENSE is set the check if we are past
the end of vm_page_array was incorrect causing it to return NULL. This
value is then used in vm_phys_add_page causing a data abort.

Reviewed by:	alc, kib, imp
Tested by:	stas
2012-05-22 07:04:23 +00:00
Nathan Whitehorn
ccc4a5c761 Replace the list of PVOs owned by each PMAP with an RB tree. This simplifies
range operations like pmap_remove() and pmap_protect() as well as allowing
simple operations like pmap_extract() not to involve any global state.
This substantially reduces lock coverages for the global table lock and
improves concurrency.
2012-05-20 14:33:28 +00:00
Konstantin Belousov
df2f557df6 Do not double-reference the found vm object in cdev_pager_lookup().
vm_pager_object_lookup() already referenced the object.

Note that there is no in-tree consumers of cdev_pager_lookup(). The
only known user of the function is i915 gem driver, which is not yet
imported. This should make the KPI change minor.

Submitted by:	avg
MFC after:	1 week
2012-05-18 10:23:47 +00:00
Konstantin Belousov
b7ac5a8571 Add new pager type, OBJT_MGTDEVICE. It provides the device pager
which carries fictitous managed pages. In particular, the consumers of
the new object type can remove all mappings of the device page with
pmap_remove_all().

The range of physical addresses used for fake page allocation shall be
registered with vm_phys_fictitious_reg_range() interface to allow the
PHYS_TO_VM_PAGE() to work in pmap.

Most likely, only i386 and amd64 pmaps can handle fictitious managed
pages right now.

Sponsored by:	The FreeBSD Foundation
Reviewed by:	alc
MFC after:	1 month
2012-05-12 20:49:58 +00:00
Konstantin Belousov
b6de32bd9b Add a facility to register a range of physical addresses to be used
for allocation of fictitious pages, for which PHYS_TO_VM_PAGE()
returns proper fictitious vm_page_t. The range should be de-registered
after consumer stopped using it.

De-inline the PHYS_TO_VM_PAGE() since it now carries code to iterate
over registered ranges.

A hash container might be developed instead of range registration
interface, and fake pages could be put automatically into the hash,
were PHYS_TO_VM_PAGE() could look them up later. This should be
considered before the MFC of the commit is done.

Sponsored by:	The FreeBSD Foundation
Reviewed by:	alc
MFC after:	1 month
2012-05-12 20:42:56 +00:00
Konstantin Belousov
e461aae747 Split the code from vm_page_getfake() to initialize the fake page struct
vm_page into new interface vm_page_initfake(). Handle the case of fake
page re-initialization with changed memattr.

Sponsored by:	The FreeBSD Foundation
Reviewed by:	alc
MFC after:	1 month
2012-05-12 20:34:22 +00:00
Konstantin Belousov
116c213502 Assert that the page passed to vm_page_putfake() is unmanaged.
Sponsored by:	The FreeBSD Foundation
Reviewed by:	alc
MFC after:	1 month
2012-05-12 20:27:51 +00:00
Konstantin Belousov
7900f95d88 Assert that fictitious or unmanaged pages do not appear on
active/inactive lists.

Sponsored by:	The FreeBSD Foundation
Reviewed by:	alc
MFC after:	1 month
2012-05-12 20:24:46 +00:00
Konstantin Belousov
13a0b7bcc4 Commit the change forgotten in r235356.
Sponsored by:	The FreeBSD Foundation
Reviewed by:	alc
MFC after:	1 month
2012-05-12 20:10:18 +00:00
Konstantin Belousov
0c26bb71f6 Make the vm_page_array_size long. Remove redundand zero initialization
for vm_page_array_size and nearby variablees.

Sponsored by:	The FreeBSD Foundation
Reviewed by:	alc
MFC after:	1 month
2012-05-12 20:03:06 +00:00
Alan Cox
13458803f4 Give vm_fault()'s sequential access optimization a makeover.
There are two aspects to the sequential access optimization: (1) read ahead
of pages that are expected to be accessed in the near future and (2) unmap
and cache behind of pages that are not expected to be accessed again.  This
revision changes both aspects.

The read ahead optimization is now more effective.  It starts with the same
initial read window as before, but arithmetically grows the window on
sequential page faults.  This can yield increased read bandwidth.  For
example, on one of my machines, a program using mmap() to read a file that
is several times larger than the machine's physical memory takes about 17%
less time to complete.

The unmap and cache behind optimization is now more selectively applied.
The read ahead window must grow to its maximum size before unmap and cache
behind is performed.  This significantly reduces the number of times that
pages are unmapped and cached only to be reactivated a short time later.

The unmap and cache behind optimization now clears each page's referenced
flag.  Previously, in the case of dirty pages, if the containing file was
still mapped at the time that the page daemon examined the dirty pages,
they would be reactivated.

From a stylistic standpoint, this revision also cleanly separates the
implementation of the read ahead and unmap/cache behind optimizations.

Glanced at:	kib
MFC after:	2 weeks
2012-05-10 15:16:42 +00:00
Nathan Whitehorn
0b852c03eb Avoid a lock order reversal in pmap_extract_and_hold() from relocking
the page. This PMAP requires an additional lock besides the PMAP lock
in pmap_extract_and_hold(), which vm_page_pa_tryrelock() did not release.

Suggested by:	kib
MFC after:	4 days
2012-04-22 17:58:30 +00:00
Konstantin Belousov
1472f4f4b9 When MAP_STACK mapping is created, the map entry is created only to
cover the initial stack size. For MCL_WIREFUTURE maps, the subsequent
call to vm_map_wire() to wire the whole stack region fails due to
VM_MAP_WIRE_NOHOLES flag.

Use the VM_MAP_WIRE_HOLESOK to only wire mapped part of the stack.

Reported and tested by:	Sushanth Rai <sushanth_rai yahoo com>
Reviewed by:	alc
MFC after:	1 week
2012-04-21 18:36:53 +00:00
Alan Cox
2aa163dc57 As documented in vm_page.h, updates to the vm_page's flags no longer
require the page queues lock.

MFC after:	1 week
2012-04-21 18:26:16 +00:00
Attilio Rao
a0f2c37b6f - Introduce a cache-miss optimization for consistency with other
accesses of the cache member of vm_object objects.
- Use novel vm_page_is_cached() for checks outside of the vm subsystem.

Reviewed by:	alc
MFC after:	2 weeks
X-MFC:		r234039
2012-04-09 17:05:18 +00:00
Alan Cox
1c8279e4e7 Fix mincore(2) so that it reports PG_CACHED pages as resident.
MFC after:	2 weeks
2012-04-08 18:25:12 +00:00
Alan Cox
908e3da10e If a page belonging a reservation is cached, then mark the reservation so
that it will be freed to the cache pool rather than the default pool.
Otherwise, the cached pages within the reservation may be recycled sooner
than necessary.

Reported by:	Andrey Zonov
2012-04-08 17:00:46 +00:00
Attilio Rao
d1aa86e151 Staticize vm_page_cache_remove().
Reviewed by:	alc
2012-04-06 20:34:00 +00:00
Nathan Whitehorn
57bd5cce62 Reduce the frequency that the PowerPC/AIM pmaps invalidate instruction
caches, by invalidating kernel icaches only when needed and not flushing
user caches for shared pages.

Suggested by:	kib
MFC after:	2 weeks
2012-04-06 16:03:38 +00:00
John Baldwin
35818d2e94 Add new ktrace records for the start and end of VM faults. This gives
a pair of records similar to syscall entry and return that a user can
use to determine how long page faults take.  The new ktrace records are
enabled via the 'p' trace type, and are enabled in the default set of
trace points.

Reviewed by:	kib
MFC after:	2 weeks
2012-04-05 17:13:14 +00:00
Kirk McKusick
1faacf5d09 Keep track of the mount point associated with a special device
to enable the collection of counts of synchronous and asynchronous
reads and writes for its associated filesystem. The counts are
displayed using `mount -v'.

Ensure that buffers used for paging indicate the vnode from
which they are operating so that counts of paging I/O operations
from the filesystem are collected.

This checkin only adds the setting of the mount point for the
UFS/FFS filesystem, but it would be trivial to add the setting
and clearing of the mount point at filesystem mount/unmount
time for other filesystems too.

Reviewed by: kib
2012-03-28 20:49:11 +00:00
Alan Cox
5730afc9b6 Handle spurious page faults that may occur in no-fault sections of the
kernel.

When access restrictions are added to a page table entry, we flush the
corresponding virtual address mapping from the TLB.  In contrast, when
access restrictions are removed from a page table entry, we do not
flush the virtual address mapping from the TLB.  This is exactly as
recommended in AMD's documentation.  In effect, when access
restrictions are removed from a page table entry, AMD's MMUs will
transparently refresh a stale TLB entry.  In short, this saves us from
having to perform potentially costly TLB flushes.  In contrast,
Intel's MMUs are allowed to generate a spurious page fault based upon
the stale TLB entry.  Usually, such spurious page faults are handled
by vm_fault() without incident.  However, when we are executing
no-fault sections of the kernel, we are not allowed to execute
vm_fault().  This change introduces special-case handling for spurious
page faults that occur in no-fault sections of the kernel.

In collaboration with:	kib
Tested by:		gibbs (an earlier version)

I would also like to acknowledge Hiroki Sato's assistance in
diagnosing this problem.

MFC after:	1 week
2012-03-22 04:52:51 +00:00
John Baldwin
d6e9b97b3f Bah, just revert my earlier change entirely. (Missed alc's request to do
this earlier.)

Requested by:	alc
2012-03-19 19:06:40 +00:00
John Baldwin
92a5994685 Fix madvise(MADV_WILLNEED) to properly handle individual mappings larger
than 4GB.  Specifically, the inlined version of 'ptoa' of the the 'int'
count of pages overflowed on 64-bit platforms.  While here, change
vm_object_madvise() to accept two vm_pindex_t parameters (start and end)
rather than a (start, count) tuple to match other VM APIs as suggested
by alc@.
2012-03-19 18:47:34 +00:00
John Baldwin
8407f69657 Alter the previous commit to use vm_size_t instead of vm_pindex_t.
vm_pindex_t is not a count of pages per se, it is more like vm_ooffset_t,
but a page index instead of a byte offset.
2012-03-19 18:43:44 +00:00
Konstantin Belousov
126d60823a In vm_object_page_clean(), do not clean OBJ_MIGHTBEDIRTY object flag
if the filesystem performed short write and we are skipping the page
due to this.

Propogate write error from the pager back to the callers of
vm_pageout_flush().  Report the failure to write a page from the
requested range as the FALSE return value from vm_object_page_clean(),
and propagate it back to msync(2) to return EIO to usermode.

While there, convert the clearobjflags variable in the
vm_object_page_clean() and arguments of the helper functions to
boolean.

PR:	kern/165927
Reviewed by:	alc
MFC after:	2 weeks
2012-03-17 23:00:32 +00:00
John Baldwin
df96bc9713 Pedantic nit: use vm_pindex_t instead of long for a count of pages. 2012-03-14 20:57:48 +00:00
John Baldwin
b47f624183 Add KTR_VFS traces to track modifications to a vnode's writecount. 2012-03-08 20:27:20 +00:00
Alan Cox
83cbe16ff4 Eliminate stale incorrect ARGSUSED comments.
Submitted by:	bde
2012-03-02 17:33:51 +00:00
Alan Cox
9ed54e79b5 Simplify kmem_alloc() by eliminating code that existed on account of
external pagers in Mach.  FreeBSD doesn't implement external pagers.
Moreover, it don't pageout the kernel object.  So, the reasons for
having code don't hold.

Reviewed by:	kib
MFC after:	6 weeks
2012-02-29 05:41:29 +00:00