Commit Graph

353 Commits

Author SHA1 Message Date
alc
694dd903a3 Change the type of the map entry's next_read field from a vm_pindex_t to a
vm_offset_t.  (This field is used to detect sequential access to the virtual
address range represented by the map entry.)  There are three reasons to
make this change.  First, a vm_offset_t is smaller on 32-bit architectures.
Consequently, a struct vm_map_entry is now smaller on 32-bit architectures.
Second, a vm_offset_t can be written atomically, whereas it may not be
possible to write a vm_pindex_t atomically on a 32-bit architecture.  Third,
using a vm_pindex_t makes the next_read field dependent on which object in
the shadow chain is being read from.

Replace an "XXX" comment.

Reviewed by:	kib
Approved by:	re (gjb)
Sponsored by:	EMC / Isilon Storage Division
2016-07-07 20:58:16 +00:00
kib
fb6d455926 Change type of the 'dead' variable to boolean.
Requested by:	alc
MFC after:	1 week
Approved by:	re (gjb)
2016-07-03 00:08:17 +00:00
kib
8ecc9fc9fd If the vm_fault() handler raced with the vm_object_collapse()
sleepable scan, iteration over the shadow chain looking for a page
could find an OBJ_DEAD object.  Such state of the mapping is only
transient, the dead object will be terminated and removed from the
chain shortly.  We must not return KERN_PROTECTION_FAILURE unless the
object type is changed to OBJT_DEAD in the chain, indicating that
paging on this address is really impossible.  Returning
KERN_PROTECTION_FAILURE prematurely causes spurious SIGSEGV delivered
to processes, or kernel accesses to UVA spuriously failing with
EFAULT.

If the object with OBJ_DEAD flag is found, only return
KERN_PROTECTION_FAILURE when object type is already OBJT_DEAD.
Otherwise, sleep a tick and retry the fault handling.

Ideally, we would wait until the OBJ_DEAD flag is resolved, e.g. by
waiting until the paging on this object is finished.  But to do so, we
need to reference the dead object, while vm_object_collapse() insists
on owning the final reference on the collapsed object.  This could be
fixed by e.g. changing the assert to shared reference release between
vm_fault() and vm_object_collapse(), but it seems to be too much
complications for rare boundary condition.

PR:	204426
Tested by:    pho
Reviewed by:  alc
Sponsored by: The FreeBSD Foundation
X-Differential revision:	https://reviews.freebsd.org/D6085
MFC after:	2 weeks
Approved by:	re (gjb)
2016-06-27 21:54:19 +00:00
alc
2cbd677fae Use vm_page_replace_checked() instead of vm_page_rename() for implementing
optimized copy-on-write faults.  This has two advantages: (1) one less radix
tree operation is performed and (2) vm_page_replace_checked() cannot fail,
making the code simpler.

Submitted by:	Ryan Libby
Reviewed by:	kib
Sponsored by:	EMC / Isilon Storage Division
Differential Revision:	https://reviews.freebsd.org/D4478
2016-05-27 06:05:12 +00:00
alc
392a7dbfbf Correct an error in a comment: One of the conditions for page allocation
is actually the opposite of that stated in the comment.

Remove an unnecessary assignment.  Use an assertion to document the fact
that no assignment is needed.

Rewrite another comment to clarify that the page is not completely valid.

Reviewed by:	kib
2016-05-23 16:59:05 +00:00
alc
1bed8c5452 When descending a shadow chain of objects, it makes no sense to update
the current offset (spelled: "fs.pindex") until it is known whether a
backing object exists.  In fact, if not for the fact that the backing
object offset is zero when there is no backing object, this update would
produce a broken offset.

Reviewed by:	kib
2016-05-21 23:18:23 +00:00
alc
6dc0fa9057 Clean up the handling of errors from vm_pager_get_pages(). Mostly, this
cleanup consists of fixes to comments.  However, there is one change to
code: Remove special-case handling of errors involving the kernel map.
We do not perform I/O on the kernel map, so there is no need for this
special case.

Reviewed by:	kib (an earlier version)
2016-05-19 19:27:33 +00:00
trasz
825d80e01c Add four new RCTL resources - readbps, readiops, writebps and writeiops,
for limiting disk (actually filesystem) IO.

Note that in some cases these limits are not quite precise. It's ok,
as long as it's within some reasonable bounds.

Testing - and review of the code, in particular the VFS and VM parts - is
very welcome.

MFC after:	1 month
Relnotes:	yes
Sponsored by:	The FreeBSD Foundation
Differential Revision:	https://reviews.freebsd.org/D5080
2016-04-07 04:23:25 +00:00
glebius
63cd1c131a A change to KPI of vm_pager_get_pages() and underlying VOP_GETPAGES().
o With new KPI consumers can request contiguous ranges of pages, and
  unlike before, all pages will be kept busied on return, like it was
  done before with the 'reqpage' only. Now the reqpage goes away. With
  new interface it is easier to implement code protected from race
  conditions.

  Such arrayed requests for now should be preceeded by a call to
  vm_pager_haspage() to make sure that request is possible. This
  could be improved later, making vm_pager_haspage() obsolete.

  Strenghtening the promises on the business of the array of pages
  allows us to remove such hacks as swp_pager_free_nrpage() and
  vm_pager_free_nonreq().

o New KPI accepts two integer pointers that may optionally point at
  values for read ahead and read behind, that a pager may do, if it
  can. These pages are completely owned by pager, and not controlled
  by the caller.

  This shifts the UFS-specific readahead logic from vm_fault.c, which
  should be file system agnostic, into vnode_pager.c. It also removes
  one VOP_BMAP() request per hard fault.

Discussed with:	kib, alc, jeff, scottl
Sponsored by:	Nginx, Inc.
Sponsored by:	Netflix
2015-12-16 21:30:45 +00:00
cem
f219735807 vm_fault_hold: handle vm_page_rename failure
On vm_page_rename failure, fix a missing object unlock and a double free of
a page.

First remove the old page, then rename into other page into first_object,
then free the old page.  This avoids the problem on rename failure.  This is
a little ugly but seems to be the most straightforward solution.

Tested with:
  $ sysctl debug.fail_point.uma_zalloc_arg="1%return"
  $ kyua test -k /usr/tests/sys/Kyuafile

Submitted by:	Ryan Libby <rlibby@gmail.com>
Reviewed by:	kib
Seen by:	alc
Sponsored by:	EMC / Isilon Storage Division
Differential Revision:	https://reviews.freebsd.org/D4326
2015-12-06 17:46:12 +00:00
alc
57edafe0ad Refinements to r281079's sequential access optimization: Prefetched pages,
which constitute the majority of the pages that are processed by
vm_fault_dontneed(), are already near the tail of the inactive queue.  Only
the pages at faulting virtual addresses are actually moved by
vm_page_advise(..., MADV_DONTNEED).  However, vm_page_advise(...,
MADV_DONTNEED) is simultaneously too aggressive and passive for the moved
pages.  It makes most of these pages too easily reclaimable, and at the same
time it leaves enough pages in the active queue to trigger pageouts by the
page daemon.  Instead, with this change, the pages at faulting virtual
addresses are moved to the tail of the inactive queue, where they are
relatively close to the pages prefetched by the same page fault.

Discussed with:	jeff
Sponsored by:	EMC / Isilon Storage Division
2015-08-03 20:30:27 +00:00
kib
34996cf7cd Do not pretend that vm_fault(9) supports unwiring the address. Rename
the VM_FAULT_CHANGE_WIRING flag to VM_FAULT_WIRE.  Assert that the
flag is only passed when faulting on the wired map entry.  Remove the
vm_page_unwire() call, which should be never reachable.

Since VM_FAULT_WIRE flag implies wired map entry, the TRYPAGER() macro
is reduced to the testing of the fs.object having a default pager.
Inline the check.

Suggested and reviewed by:	alc
Tested by:	pho (previous version)
MFC after:	1 week
2015-07-30 18:28:34 +00:00
glebius
519f1ccd36 Make KPI of vm_pager_get_pages() more strict: if a pager changes a page
in the requested array, then it is responsible for disposition of previous
page and is responsible for updating the entry in the requested array.
Now consumers of KPI do not need to re-lookup the pages after call to
vm_pager_get_pages().

Reviewed by:	kib
Sponsored by:	Netflix
Sponsored by:	Nginx, Inc.
2015-06-12 11:32:20 +00:00
kib
dfce070d88 Do not sleep waiting for the MAP_ENTRY_IN_TRANSITION state ending with
the vnode locked.

Review:	https://reviews.freebsd.org/D2381
Submitted by:	Conrad Meyer, Attilio Rao
MFC after:	1 week
2015-04-28 08:20:23 +00:00
alc
7028695829 Until the lock assertions in vm_page_advise() are properly reevaluated,
vm_fault_dontneed() should acquire a write lock on the first object in
the shadow chain.

Reported by:	gleb, David Wolfskill
2015-04-05 20:07:33 +00:00
alc
e8bbef8d0b Replace vm_fault()'s heuristic for automatic cache behind with a heuristic
that performs the equivalent of an automatic madvise(..., MADV_DONTNEED).
The current heuristic, even with the improvements that I made a few years
ago, is a good example of making the wrong trade-off, or optimizing for
the infrequent case.  The infrequent case being reading a single file that
is much larger than memory using mmap(2).  And, in this case, the page
daemon isn't the bottleneck; it's the I/O.

In all other cases, the current heuristic has too many false positives,
i.e., it caches too many pages that are later reused.  To give one
example, thousands of pages are cached by the current heuristic during a
buildworld and all of them are reactivated before the buildworld
completes.  In particular, clang reads source files using mmap(2) and
there are some relatively large source files in our source tree, e.g.,
sqlite, that are read multiple times.  With the new heuristic, I see fewer
false positives and they have a much lower cost.

I actually tried something like this more than two years ago and it
didn't perform as well as the cache behind heuristic.  However, that was
before the changes to the page daemon in late summer of 2013 and the
existence of pmap_advise().  In particular, with the page daemon doing
its work more frequently and in smaller batches, it now completes its
work while the application accessing the file is blocked on I/O.
Whereas previously, the page daemon appeared to hog the CPU for so long
that it caused "hiccups" in the application's execution.

Finally, I'll add that the elimination of cache pages is a prerequisite
for NUMA support.

Reviewed by:	jeff, kib
Sponsored by:	EMC / Isilon Storage Division
2015-04-04 19:10:22 +00:00
alc
b131a2abc8 Introduce vm_object_color() and use it in mmap(2) to set the color of
named objects to zero before the virtual address is selected.  Previously,
the color setting was delayed until after the virtual address was
selected.  In rtld, this delay effectively prevented the mapping of a
shared library's code section using superpages.  Now, for example, we see
the first 1 MB of libc's code on armv6 mapped by a superpage after we've
gotten through the initial cold misses that bring the first 1 MB of code
into memory.  (With the page clustering that we perform on read faults,
this happens quickly.)

Differential Revision:	https://reviews.freebsd.org/D2013
Reviewed by:	jhb, kib
Tested by:	Svatopluk Kraus (armv6)
MFC after:	6 weeks
2015-03-21 17:56:55 +00:00
alc
2c4b57d486 Fix the root cause of the "vm_reserv_populate: reserv <address> is already
promoted" panics.  The sequence of events that leads to a panic is rather
long and circuitous.  First, suppose that process P has a promoted
superpage S within vm object O that it can write to.  Then, suppose that P
forks, which leads to S being write protected.  Now, before P's child
exits, suppose that P writes to another virtual page within O.  Since the
pages within O are copy on write, a shadow object for O is created to
house the new physical copy of the faulted on virtual page.  Then, before
P can fault on S, P's child exists.  Now, when P faults on S, it will
follow the "optimized" path for copy-on-write faults in vm_fault(),
wherein the underlying physical page is moved from O to its shadow object
rather than allocating a new page and copying the new page's contents from
the old page.  Moreover, suppose that every 4 KB physical page making up S
is moved to the shadow object in this way.  However, the optimized path
does not move the underlying superpage reservation, which is the root
cause of the panics!  Ultimately, P performs vm_object_collapse() on O's
shadow object, which destroys O and in doing so breaks any reservations
still belonging to O.  This leaves the reservation underlying S in an
inconsistent state: It's simultaneously not in use and promoted.  Breaking
a reservation does not demote it because I never intended for a promoted
reservation to be broken.  It makes little sense.  Finally, this
inconsistency leads to an assertion failure the next time that the
reservation is used.

The failing assertion does not (currently) exist in FreeBSD 10.x or
earlier.  There, we will quietly break the promoted reservation.  While
illogical and unintended, breaking the reservation is essentially
harmless.

PR:		198163
Reviewed by:	kib
Tested by:	pho
X-MFC after:	r267213
Sponsored by:	EMC / Isilon Storage Division
2015-03-19 01:40:43 +00:00
kib
19abfd4698 Update mtime for tmpfs files modified through memory mapping. Similar
to UFS, perform updates during syncer scans, which in particular means
that tmpfs now performs scan on sync.  Also, this means that a mtime
update may be delayed up to 30 seconds after the write.

The vm_object' OBJ_TMPFS_DIRTY flag for tmpfs swap object is similar
to the OBJ_MIGHTBEDIRTY flag for the vnode object, it indicates that
object could have been dirtied.  Adapt fast page fault handler and
vm_object_set_writeable_dirty() to handle OBJ_TMPFS_NODE same as
OBJT_VNODE.

Reported by:	Ronald Klop <ronald-lists@klop.ws>
Tested by:	pho
Sponsored by:	The FreeBSD Foundation
MFC after:	2 weeks
2015-01-28 10:37:23 +00:00
alc
fb32c103c7 Revamp the default page clustering strategy that is used by the page fault
handler.  For roughly twenty years, the page fault handler has used the
same basic strategy: Fetch a fixed number of non-resident pages both ahead
and behind the virtual page that was faulted on.  Over the years,
alternative strategies have been implemented for optimizing the handling
of random and sequential access patterns, but the only change to the
default strategy has been to increase the number of pages read ahead to 7
and behind to 8.

The problem with the default page clustering strategy becomes apparent
when you look at how it behaves on the code section of an executable or
shared library.  (To simplify the following explanation, I'm going to
ignore the read that is performed to obtain the header and assume that no
pages are resident at the start of execution.)  Suppose that we have a
code section consisting of 32 pages.  Further, suppose that we access
pages 4, 28, and 16 in that order.  Under the default page clustering
strategy, we page fault three times and perform three I/O operations,
because the first and second page faults only read a truncated cluster of
12 pages.  In contrast, if we access pages 8, 24, and 16 in that order, we
only fault twice and perform two I/O operations, because the first and
second page faults read a full cluster of 16 pages.  In general, truncated
clusters are more common than full clusters.

To address this problem, this revision changes the default page clustering
strategy to align the start of the cluster to a page offset within the vm
object that is a multiple of the cluster size.  This results in many fewer
truncated clusters.  Returning to our example, if we now access pages 4,
28, and 16 in that order, the cluster that is read to satisfy the page
fault on page 28 will now include page 16.  So, the access to page 16 will
no longer page fault and perform an I/O operation.

Since the revised default page clustering strategy is typically reading
more pages at a time, we are likely to read a few more pages that are
never accessed.  However, for the various programs that we looked at,
including clang, emacs, firefox, and openjdk, the reduction in the number
of page faults and I/O operations far outweighed the increase in the
number of pages that are never accessed.  Moreover, the extra resident
pages allowed for many more superpage mappings.  For example, if we look
at the execution of clang during a buildworld, the number of (hard) page
faults on the code section drops by 26%, the number of superpage mappings
increases by about 29,000, but the number of never accessed pages only
increases from 30.38% to 33.66%.  Finally, this leads to a small but
measureable reduction in execution time.

In collaboration with:	Emily Pettigrew <ejp1@rice.edu>
Differential Revision:	https://reviews.freebsd.org/D1500
Reviewed by:	jhb, kib
MFC after:	6 weeks
2015-01-16 18:17:09 +00:00
kib
79db3369f9 Revert r263475: TDP_DEVMEMIO no longer needed, since amd64 /dev/kmem
does not access kernel mappings directly.

Reviewed by:	alc
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
2015-01-12 08:58:07 +00:00
kib
2b6232c7be Make MAP_NOSYNC handling in the vm_fault() read-locked object path
compatible with write-locked path.  Test for MAP_ENTRY_NOSYNC and set
VPO_NOSYNC for pages with dirty mask zero (this does not exclude a
possibility that the page is dirty, e.g. due to read fault on
writeable mapping and consequent write; the same issue exists in the
slow path).

Use helper vm_fault_dirty() to unify fast and slow path handling of
VPO_NOSYNC and setting the dirty mask.

Reviewed by:	alc
Sponsored by:	The FreeBSD Foundation
2014-10-10 19:27:36 +00:00
alc
5690b644d4 Relax one of the conditions for mapping a page on the fast path.
Reviewed by:	kib
X-MFC with:	r270011
Sponsored by:	EMC / Isilon Storage Division
2014-08-23 05:24:31 +00:00
kib
ff74afd0bd Implement 'fast path' for the vm page fault handler. Or, it could be
called a scalable path.  When several preconditions hold, the vm
object lock for the object containing the faulted page is taken in
read mode, instead of write, which allows parallel faults processing
in the region.

Namely, the fast path is taken when the faulted page already exists
and does not need copy on write, is already fully valid, and not busy.
For technical reasons, fast path is avoided when the fault is the
first write on the vnode object, or when the fault is for wiring or
debugger read or write.

On the fast path, pmap_enter(9) is passed the PMAP_ENTER_NOSLEEP flag,
since object lock is kept.  Pmap might fail to create the entry, in
which case the fallback to slow path is performed.

Reviewed by:	alc
Tested by:	pho (previous version)
Hardware provided and hosted by:	The FreeBSD Foundation and
	 Sentex Data Communications
Sponsored by:	The FreeBSD Foundation
MFC after:	2 week
2014-08-15 07:30:14 +00:00
alc
e11aaa26b0 Avoid pointless (but harmless) actions on unmanaged pages.
Reviewed by:	kib
Sponsored by:	EMC / Isilon Storage Division
2014-08-14 15:46:15 +00:00
kib
094158b3f2 Change pmap_enter(9) interface to take flags parameter and superpage
mapping size (currently unused).  The flags includes the fault access
bits, wired flag as PMAP_ENTER_WIRED, and a new flag
PMAP_ENTER_NOSLEEP to indicate that pmap should not sleep.

For powerpc aim both 32 and 64 bit, fix implementation to ensure that
the requested mapping is created when PMAP_ENTER_NOSLEEP is not
specified, in particular, wait for the available memory required to
proceed.

In collaboration with:	alc
Tested by:	nwhitehorn (ppc aim32 and booke)
Sponsored by:	The FreeBSD Foundation and EMC / Isilon Storage Division
MFC after:	2 weeks
2014-08-08 17:12:03 +00:00
alc
f873c17deb Handle wiring failures in vm_map_wire() with the new functions
pmap_unwire() and vm_object_unwire().

Retire vm_fault_{un,}wire(), since they are no longer used.

(See r268327 and r269134 for the motivation behind this change.)

Reviewed by:	kib
Sponsored by:	EMC / Isilon Storage Division
2014-08-02 16:10:24 +00:00
alc
2b42c1ddc8 When unwiring a region of an address space, do not assume that the
underlying physical pages are mapped by the pmap.  If, for example, the
application has performed an mprotect(..., PROT_NONE) on any part of the
wired region, then those pages will no longer be mapped by the pmap.
So, using the pmap to lookup the wired pages in order to unwire them
doesn't always work, and when it doesn't work wired pages are leaked.

To avoid the leak, introduce and use a new function vm_object_unwire()
that locates the wired pages by traversing the object and its backing
objects.

At the same time, switch from using pmap_change_wiring() to the recently
introduced function pmap_unwire() for unwiring the region's mappings.
pmap_unwire() is faster, because it operates a range of virtual addresses
rather than a single virtual page at a time.  Moreover, by operating on
a range, it is superpage friendly.  It doesn't waste time performing
unnecessary demotions.

Reported by:	markj
Reviewed by:	kib
Tested by:	pho, jmg (arm)
Sponsored by:	EMC / Isilon Storage Division
2014-07-26 18:10:18 +00:00
attilio
2802c525ad - Modify vm_page_unwire() and vm_page_enqueue() to directly accept
the queue where to enqueue pages that are going to be unwired.
- Add stronger checks to the enqueue/dequeue for the pagequeues when
  adding and removing pages to them.

Of course, for unmanaged pages the queue parameter of vm_page_unwire() will
be ignored, just as the active parameter today.
This makes adding new pagequeues quicker.

This change effectively modifies the KPI.  __FreeBSD_version will be,
however, bumped just when the full cache of free pages will be
evicted.

Sponsored by:	EMC / Isilon storage division
Reviewed by:	alc
Tested by:	pho
2014-06-16 18:15:27 +00:00
kib
62a5a6e7ef Remove redundand loop. The inner goto restarts the whole page
handling in the situation identical to the loop condition.

Sponsored by:	The FreeBSD Foundation
MFC after:	3 days
2014-05-21 08:19:04 +00:00
kib
ba20870cbd Fix locking. The dst_object must remain locked on the retry of the
loop iteration.

Reported and tested by:	pho
Sponsored by:	The FreeBSD Foundation
MFC after:	6 days
2014-05-11 18:07:07 +00:00
kib
7798d7f7f4 For the upgrade case in vm_fault_copy_entry(), when the entry does not
need COW and is writeable (i.e. becoming writeable due to the
mprotect(2) operation), do not create a new backing object for the
entry.  The caller of the function is vm_map_protect(), the call is
made to ensure that wired entry has all pages resident and wired in
the top level object and to enable the write.  We might need to copy
read-only page from some backing objects into the top object or remap
the page with the write allowed.

This fixes the issue with mishandling of the swap accounting when
read-only wired mapping is upgraded to write-enabled after fork.  The
previous code path did not accounted the new object, but it creation
is redundand anyway and the change provides an optimization for the
non-common situation.

Reported by:	markj
Suggested and reviewed by:	alc (previous version)
Tested by:	pho
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
2014-05-10 17:03:33 +00:00
kib
c36743d23f When vm_fault_copy_entry() is called from vm_map_protect() for a wired
entry and performs the upgrade of the entry permissions from read-only
to read-write, we must allow to search for the source pages in the
backing object, like we do in the case of forking the read-only wired
entry. For the fork case, the behaviour is allowed by src_readonly
boolean, which in fact is only used to assert that read-write case
provides all source pages in the top-level object.

Eliminate the src_readonly variable.  Allow for the copy loop to look
into the backing objects, add explicit asserts to ensure that only
read-only and upgrade case actually does.

Expand comments. Change the panic call into assert.

Reported by:	markj
Tested by:	markj, pho (previous version)
Reviewed by:	alc
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
2014-04-27 05:19:01 +00:00
kib
24c4e4a548 Fix two issues with /dev/mem access on amd64, both causing kernel page
faults.

First, for accesses to direct map region should check for the limit by
which direct map is instantiated.

Second, for accesses to the kernel map, success returned from the
kernacc(9) does not guarantee that consequent attempt to read or write
to the checked address succeed, since other thread might invalidate
the address meantime.  Add a new thread private flag TDP_DEVMEMIO,
which instructs vm_fault() to return error when fault happens on the
MAP_ENTRY_NOFAULT entry, instead of panicing.  The trap handler would
then see a page fault from access, and recover in normal way, making
/dev/mem access safer.

Remove GIANT_REQUIRED from the amd64 memrw(), since it is not needed
and having Giant locked does not solve issues for amd64.

Note that at least the second issue exists on other architectures, and
requires similar patching for md code.

Reported and tested by:	clusteradm (gjb, sbruno)
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
2014-03-21 14:25:09 +00:00
alc
43e9da37b5 Don't call vm_fault_prefault() on zero-fill faults. It's a waste of time.
Successful prefaults after a zero-fill fault are extremely rare.
2014-02-09 01:59:52 +00:00
alc
cf63b11b17 Make prefaulting more aggressive on hard faults. Previously, we would only
map a fraction of the pages that were fetched by vm_pager_get_pages() from
secondary storage.  Now, we map them all in order to avoid future soft
faults.  This effect is most evident when a memory-mapped file is accessed
sequentially.  Previously, there were 6 soft faults for every hard fault.
Now, these soft faults are eliminated.

Sponsored by:	EMC / Isilon Storage Division
2014-02-02 20:21:53 +00:00
kib
3c8b0e8428 Revert back to use int for the page counts. In vn_io_fault(), the i/o
is chunked to pieces limited by integer io_hold_cnt tunable, while
vm_fault_quick_hold_pages() takes integer max_count as the upper bound.

Rearrange the checks to correctly handle overflowing address arithmetic.

Submitted by:	bde
Tested by:	pho
Discussed with:	alc
MFC after:	1 week
2013-11-20 08:45:26 +00:00
kib
3cdb6ad1ff Avoid overflow for the page counts.
Reported and tested by:	pho
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
2013-11-12 08:47:58 +00:00
kib
6796656333 Remove zero-copy sockets code. It only worked for anonymous memory,
and the equivalent functionality is now provided by sendfile(2) over
posix shared memory filedescriptor.

Remove the cow member of struct vm_page, and rearrange the remaining
members.  While there, make hold_count unsigned.

Requested and reviewed by:	alc
Tested by:	pho
Sponsored by:	The FreeBSD Foundation
Approved by:	re (delphij)
2013-09-16 06:25:54 +00:00
attilio
e9f37cac74 On all the architectures, avoid to preallocate the physical memory
for nodes used in vm_radix.
On architectures supporting direct mapping, also avoid to pre-allocate
the KVA for such nodes.

In order to do so make the operations derived from vm_radix_insert()
to fail and handle all the deriving failure of those.

vm_radix-wise introduce a new function called vm_radix_replace(),
which can replace a leaf node, already present, with a new one,
and take into account the possibility, during vm_radix_insert()
allocation, that the operations on the radix trie can recurse.
This means that if operations in vm_radix_insert() recursed
vm_radix_insert() will start from scratch again.

Sponsored by:	EMC / Isilon storage division
Reviewed by:	alc (older version)
Reviewed by:	jeff
Tested by:	pho, scottl
2013-08-09 11:28:55 +00:00
attilio
16c7563cf4 The soft and hard busy mechanism rely on the vm object lock to work.
Unify the 2 concept into a real, minimal, sxlock where the shared
acquisition represent the soft busy and the exclusive acquisition
represent the hard busy.
The old VPO_WANTED mechanism becames the hard-path for this new lock
and it becomes per-page rather than per-object.
The vm_object lock becames an interlock for this functionality:
it can be held in both read or write mode.
However, if the vm_object lock is held in read mode while acquiring
or releasing the busy state, the thread owner cannot make any
assumption on the busy state unless it is also busying it.

Also:
- Add a new flag to directly shared busy pages while vm_page_alloc
  and vm_page_grab are being executed.  This will be very helpful
  once these functions happen under a read object lock.
- Move the swapping sleep into its own per-object flag

The KPI is heavilly changed this is why the version is bumped.
It is very likely that some VM ports users will need to change
their own code.

Sponsored by:	EMC / Isilon storage division
Discussed with:	alc
Reviewed by:	jeff, kib
Tested by:	gavin, bapt (older version)
Tested by:	pho, scottl
2013-08-09 11:11:11 +00:00
attilio
899ab64514 Revert r253939:
We cannot busy a page before doing pagefaults.
Infact, it can deadlock against vnode lock, as it tries to vget().
Other functions, right now, have an opposite lock ordering, like
vm_object_sync(), which acquires the vnode lock first and then
sleeps on the busy mechanism.

Before this patch is reinserted we need to break this ordering.

Sponsored by:	EMC / Isilon storage division
Reported by:	kib
2013-08-05 08:55:35 +00:00
attilio
19b2ea9f81 The page hold mechanism is fast but it has couple of fallouts:
- It does not let pages respect the LRU policy
- It bloats the active/inactive queues of few pages

Try to avoid it as much as possible with the long-term target to
completely remove it.
Use the soft-busy mechanism to protect page content accesses during
short-term operations (like uiomove_fromphys()).

After this change only vm_fault_quick_hold_pages() is still using the
hold mechanism for page content access.
There is an additional complexity there as the quick path cannot
immediately access the page object to busy the page and the slow path
cannot however busy more than one page a time (to avoid deadlocks).

Fixing such primitive can bring to complete removal of the page hold
mechanism.

Sponsored by:	EMC / Isilon storage division
Discussed with:	alc
Reviewed by:	jeff
Tested by:	pho
2013-08-04 21:07:24 +00:00
kib
bea7bbed5f The vm_fault() should not be allowed to proceed on the map entry which
is being wired now.  The entry wired count is changed to non-zero in
advance, before the map lock is dropped.  This makes the vm_fault() to
perceive the entry as wired, and breaks the fragment which moves the
wire count from the shadowed page, to the upper page, making the code
unwiring non-wired page.

On the other hand, the vm_fault() calls from vm_fault_wire() should be
allowed to proceed, so only drain MAP_ENTRY_IN_TRANSITION from
vm_fault() when wiring_thread is not current.

Reported and tested by:	pho
Reviewed by:	alc
Sponsored by:	The FreeBSD Foundation
MFC after:	2 weeks
2013-07-11 05:58:28 +00:00
attilio
2df95f4b3a Acquire read lock on the src object for vm_fault_copy_entry().
Sponsored by:	EMC / Isilon storage division
Reviewed by:	alc
2013-05-22 15:11:00 +00:00
alc
d5c05f4a92 Relax the object locking in vm_fault_prefault(). A read lock suffices.
Reviewed by:	attilio
Sponsored by:	EMC / Isilon Storage Division
2013-05-17 19:02:36 +00:00
attilio
905e648d42 Hide the details for the assertion for VM_OBJECT_LOCK operations.
Rename current VM_OBJECT_LOCK_ASSERT(foo, RA_WLOCKED) into
VM_OBJECT_ASSERT_WLOCKED(foo)

Sponsored by:	EMC / Isilon storage division
Requested by:	alc
2013-02-21 21:54:53 +00:00
attilio
15bf891afe Rename VM_OBJECT_LOCK(), VM_OBJECT_UNLOCK() and VM_OBJECT_TRYLOCK() to
their "write" versions.

Sponsored by:	EMC / Isilon storage division
2013-02-20 12:03:20 +00:00
attilio
658534ed5a Switch vm_object lock to be a rwlock.
* VM_OBJECT_LOCK and VM_OBJECT_UNLOCK are mapped to write operations
* VM_OBJECT_SLEEP() is introduced as a general purpose primitve to
  get a sleep operation using a VM_OBJECT_LOCK() as protection
* The approach must bear with vm_pager.h namespace pollution so many
  files require including directly rwlock.h
2013-02-20 10:38:34 +00:00
zont
b5edc96a84 - Add system wide page faults requiring I/O counter.
Reviewed by:	alc
MFC after:	2 weeks
2013-01-28 12:54:53 +00:00