Commit Graph

654 Commits

Author SHA1 Message Date
Mark Johnston
2934eb8a22 Fix a logic error in the item size calculation for internal UMA zones.
Kegs for internal zones always keep the slab header in the slab itself.
Therefore, when determining the allocation size, we need to take the
slab header size into account.

Reported and tested by:	ae, rakuco
Reviewed by:	avg
MFC after:	2 weeks
Differential Revision:	https://reviews.freebsd.org/D12342
2017-09-13 15:44:54 +00:00
Konstantin Belousov
93c5d3a46a Add a vm_page_change_lock() helper, the common code to not relock page
lock if both old and new pages use the same underlying lock.  Convert
existing places to use the helper instead of inlining it.  Use the
optimization in vm_object_page_remove().

Suggested and reviewed by:	alc, markj
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
2017-09-09 17:35:19 +00:00
Mark Johnston
f93f7cf199 Speed up vm_page_array initialization.
We currently initialize the vm_page array in three passes: one to zero
the array, one to initialize the "order" field of each page (necessary
when inserting them into the vm_phys buddy allocator one-by-one), and
one to initialize the remaining non-zero fields and individually insert
each page into the allocator.

Merge the three passes into one following a suggestion from alc:
initialize vm_page fields in a single pass, and use vm_phys_free_contig()
to efficiently insert physical memory segments into the buddy allocator.
This reduces the initialization time to a third or a quarter of what it
was before on most systems that I tested.

Reviewed by:	alc, kib
MFC after:	3 weeks
Differential Revision:	https://reviews.freebsd.org/D12248
2017-09-07 21:43:39 +00:00
Mateusz Guzik
fe933c1d88 Start annotating global _padalign locks with __exclusive_cache_line
While these locks are guarnteed to not share their respective cache lines,
their current placement leaves unnecessary holes in lines which preceeded them.

For instance the annotation of vm_page_queue_free_mtx allows 2 neighbour
cachelines (previously separate by the lock) to be collapsed into 1.

The annotation is only effective on architectures which have it implemented in
their linker script (currently only amd64). Thus locks are not converted to
their not-padaligned variants as to not affect the rest.

MFC after:	1 week
2017-09-06 20:28:18 +00:00
Mark Johnston
33fff5d536 Add vm_page_alloc_after().
This is a variant of vm_page_alloc() which accepts an additional parameter:
the page in the object with largest index that is smaller than the requested
index. vm_page_alloc() finds this page using a lookup in the object's radix
tree, but in some cases its identity is already known, allowing the lookup
to be elided.

Modify kmem_back() and vm_page_grab_pages() to use vm_page_alloc_after().
vm_page_alloc() is converted into a trivial wrapper of
vm_page_alloc_after().

Suggested by:	alc
Reviewed by:	alc, kib
MFC after:	2 weeks
Differential Revision:	https://reviews.freebsd.org/D11984
2017-08-15 16:39:49 +00:00
Mark Johnston
9df950b35d Modify vm_page_grab_pages() to handle VM_ALLOC_NOWAIT.
This will allow its use in sendfile_swapin().

Reviewed by:	alc, kib
MFC after:	2 weeks
Differential Revision:	https://reviews.freebsd.org/D11942
2017-08-11 16:29:22 +00:00
Mark Johnston
2c642ec1e7 Make vm_page_sunbusy() assert that the page is unlocked.
Reviewed by:	kib
MFC after:	1 week
Sponsored by:	Dell EMC Isilon
Differential Revision:	https://reviews.freebsd.org/D11946
2017-08-10 22:43:38 +00:00
Alan Cox
5471caf6f1 Introduce vm_page_grab_pages(), which is intended to replace loops calling
vm_page_grab() on consecutive page indices.  Besides simplifying the code
in the caller, vm_page_grab_pages() allows for batching optimizations.
For example, the current implementation replaces calls to vm_page_lookup()
on consecutive page indices by cheaper calls to vm_page_next().

Reviewed by:	kib, markj
Tested by:	pho (an earlier version)
MFC after:	2 weeks
Differential Revision:	https://reviews.freebsd.org/D11926
2017-08-09 04:23:04 +00:00
Alan Cox
1d3b9818e7 In vm_page_ps_test(), always check that the base pages within the specified
superpage all belong to the same object.  To date, that check has not been
needed, but upcoming changes require it.  (See the Differential Revision.)

Reviewed by:	kib, markj
MFC after:	1 week
Differential Revision:	https://reviews.freebsd.org/D11556
2017-07-23 05:54:56 +00:00
Alan Cox
8830260128 Generalize vm_page_ps_is_valid() to support testing other predicates on
the (super)page, renaming the function to vm_page_ps_test().

Reviewed by:	kib, markj
MFC after:	1 week
2017-07-14 02:15:48 +00:00
John Baldwin
4bd7e351f1 Fix an off-by-one error in the VM page array on some systems.
r31386 changed how the size of the VM page array was calculated to be
less wasteful.  For most systems, the amount of memory is divided by
the overhead required by each page (a page of data plus a struct vm_page)
to determine the maximum number of available pages.  However, if the
remainder for the first non-available page was at least a page of data
(so that the only memory missing was a struct vm_page), this last page
was left in phys_avail[] but was not allocated an entry in the VM page
array.  Handle this case by explicitly excluding the page from
phys_avail[].

Reviewed by:	alc
Sponsored by:	DARPA / AFRL
Differential Revision:	https://reviews.freebsd.org/D11000
2017-06-08 16:18:41 +00:00
Gleb Smirnoff
83c9dea1ba - Remove 'struct vmmeter' from 'struct pcpu', leaving only global vmmeter
in place.  To do per-cpu stats, convert all fields that previously were
  maintained in the vmmeters that sit in pcpus to counter(9).
- Since some vmmeter stats may be touched at very early stages of boot,
  before we have set up UMA and we can do counter_u64_alloc(), provide an
  early counter mechanism:
  o Leave one spare uint64_t in struct pcpu, named pc_early_dummy_counter.
  o Point counter(9) fields of vmmeter to pcpu[0].pc_early_dummy_counter,
    so that at early stages of boot, before counters are allocated we already
    point to a counter that can be safely written to.
  o For sparc64 that required a whole dummy pcpu[MAXCPU] array.

Further related changes:
- Don't include vmmeter.h into pcpu.h.
- vm.stats.vm.v_swappgsout and vm.stats.vm.v_swappgsin changed to 64-bit,
  to match kernel representation.
- struct vmmeter hidden under _KERNEL, and only vmstat(1) is an exclusion.

This is based on benno@'s 4-year old patch:
https://lists.freebsd.org/pipermail/freebsd-arch/2013-July/014471.html

Reviewed by:	kib, gallatin, marius, lidl
Differential Revision:	https://reviews.freebsd.org/D10156
2017-04-17 17:34:47 +00:00
Warner Losh
fbbd9655e5 Renumber copyright clause 4
Renumber cluase 4 to 3, per what everybody else did when BSD granted
them permission to remove clause 3. My insistance on keeping the same
numbering for legal reasons is too pedantic, so give up on that point.

Submitted by:	Jan Schaumann <jschauma@stevens.edu>
Pull Request:	https://github.com/freebsd/freebsd/pull/96
2017-02-28 23:42:47 +00:00
Alan Cox
8a99f1cc59 Over the years, the code and comments in vm_page_startup() have diverged in
one respect.  When determining how many page structures to allocate,
contrary to what the comments say, the code does not account for the
overhead of a page structure per page of physical memory.  This revision
changes the code to match the comments.

Reviewed by:	kib, markj
MFC after:	6 weeks
Differential Revision:	https://reviews.freebsd.org/D9081
2017-02-04 05:23:10 +00:00
Mark Johnston
c2655a40a7 Avoid unnecessary page lookups in vm_object_madvise().
vm_object_madvise() is frequently used to apply advice to a contiguous
set of pages in an object with no backing object. Optimize this case by
skipping non-resident subranges in constant time, and by iterating over
resident pages using the object memq, thus avoiding radix tree lookups on
each page index in the specified range.

While here, move MADV_WILLNEED handling to vm_page_advise(), and rename the
"advise" parameter to vm_object_madvise() to "advice."

Reviewed by:	alc, kib
MFC after:	2 weeks
Differential Revision:	https://reviews.freebsd.org/D9098
2017-01-15 03:50:08 +00:00
Gleb Smirnoff
bfc8c24c73 Move bogus_page declaration to vm_page.h and initialization to vm_page.c.
Reviewed by:	kib
2017-01-04 22:27:19 +00:00
Mark Johnston
b1fd102ee7 Add a page queue for holding dirty anonymous unswappable pages.
On systems without a configured swap device, an attempt to launder pages
from a swap object will always fail and result in the page being
reactivated. This means that the page daemon will continuously scan pages
that can never be evicted. With this change, anonymous pages are instead
moved to PQ_UNSWAPPABLE after a failed laundering attempt when no swap
devices are configured. PQ_UNSWAPPABLE is not scanned unless a swap device
is configured, so unreferenced unswappable pages are excluded from the page
daemon's workload.

Reviewed by:	alc
2017-01-03 00:05:44 +00:00
Konstantin Belousov
0c8bd6a7d8 Assert that the pages found on the object queue by vm_page_next() and
vm_page_prev() have correct ownership.

In collaboration with:	alc
Sponsored by:	The FreeBSD Foundation (kib)
MFC after:	1 week
2016-12-30 17:37:06 +00:00
Alan Cox
920da7e4d2 Relax the object type restrictions on vm_page_alloc_contig(). Specifically,
add support for object types that were previously prohibited because they
could contain PG_CACHED pages.

Roughly halve the number of radix trie operations performed by
vm_page_alloc_contig() using the same approach that is employed by
vm_page_alloc().  Also, eliminate the radix trie lookup performed with the
free page queues lock held.

Tidy up the handling of radix trie insert failures in vm_page_alloc() and
vm_page_alloc_contig().

Reviewed by:	kib, markj
Tested by:	pho
Sponsored by:	Dell EMC Isilon
Differential Revision:	https://reviews.freebsd.org/D8878
2016-12-28 18:32:13 +00:00
Alan Cox
3453bca864 Eliminate every mention of PG_CACHED pages from the comments in the machine-
independent layer of the virtual memory system.  Update some of the nearby
comments to eliminate redundancy and improve clarity.

In vm/vm_reserv.c, do not use hyphens after adverbs ending in -ly per
The Chicago Manual of Style.

Update the comment in vm/vm_page.h defining the four types of page queues to
reflect the elimination of PG_CACHED pages and the introduction of the
laundry queue.

Reviewed by:	kib, markj
Sponsored by:	Dell EMC Isilon
Differential Revision:	https://reviews.freebsd.org/D8752
2016-12-12 17:47:09 +00:00
Alan Cox
e94965d82e Previously, vm_radix_remove() would panic if the radix trie didn't
contain a vm_page_t at the specified index.  However, with this
change, vm_radix_remove() no longer panics.  Instead, it returns NULL
if there is no vm_page_t at the specified index.  Otherwise, it
returns the vm_page_t.  The motivation for this change is that it
simplifies the use of radix tries in the amd64, arm64, and i386 pmap
implementations.  Instead of performing a lookup before every remove,
the pmap can simply perform the remove.

Reviewed by:	kib, markj
Differential Revision:	https://reviews.freebsd.org/D8708
2016-12-08 04:29:29 +00:00
Alan Cox
ba67369628 Recursion on the free page queue mutex occurred when UMA needed to allocate
a new page of radix trie nodes to complete a vm_radix_insert() operation
that was requested by vm_page_cache().  Specifically, vm_page_cache()
already held the free page queue lock when UMA tried to acquire it through
a call to vm_page_alloc().  This code path no longer exists, so there is no
longer any reason to allow recursion on the free page queue mutex.

Improve nearby comments.

Reviewed by:	kib, markj
Tested by:	pho
Sponsored by:	Dell EMC Isilon
Differential Revision:	https://reviews.freebsd.org/D8628
2016-11-27 01:42:53 +00:00
Alan Cox
bba39b9ae3 Remove PG_CACHED-related fields from struct vmmeter, because they are no
longer used.  More precisely, they are always zero because the code that
decremented and incremented them no longer exists.

Bump __FreeBSD_version to mark this change.

Reviewed by:	kib, markj
Sponsored by:	Dell EMC Isilon
Differential Revision:	https://reviews.freebsd.org/D8583
2016-11-22 18:13:46 +00:00
Alan Cox
7667839a7e Remove most of the code for implementing PG_CACHED pages. (This change does
not remove user-space visible fields from vm_cnt or all of the references to
cached pages from comments.  Those changes will come later.)

Reviewed by:	kib, markj
Tested by:	pho
Sponsored by:	Dell EMC Isilon
Differential Revision:	https://reviews.freebsd.org/D8497
2016-11-15 18:22:50 +00:00
Alan Cox
ebcddc7217 Introduce a new page queue, PQ_LAUNDRY, for storing unreferenced, dirty
pages, specificially, dirty pages that have passed once through the inactive
queue.  A new, dedicated thread is responsible for both deciding when to
launder pages and actually laundering them.  The new policy uses the
relative sizes of the inactive and laundry queues to determine whether to
launder pages at a given point in time.  In general, this leads to more
intelligent swapping behavior, since the laundry thread will avoid pageouts
when the marginal benefit of doing so is low.  Previously, without a
dedicated queue for dirty pages, the page daemon didn't have the information
to determine whether pageout provides any benefit to the system.  Thus, the
previous policy often resulted in small but steadily increasing amounts of
swap usage when the system is under memory pressure, even when the inactive
queue consisted mostly of clean pages.  This change addresses that issue,
and also paves the way for some future virtual memory system improvements by
removing the last source of object-cached clean pages, i.e., PG_CACHE pages.

The new laundry thread sleeps while waiting for a request from the page
daemon thread(s).  A request is raised by setting the variable
vm_laundry_request and waking the laundry thread.  We request launderings
for two reasons: to try and balance the inactive and laundry queue sizes
("background laundering"), and to quickly make up for a shortage of free
pages and clean inactive pages ("shortfall laundering").  When background
laundering is requested, the laundry thread computes the number of page
daemon wakeups that have taken place since the last laundering.  If this
number is large enough relative to the ratio of the laundry and (global)
inactive queue sizes, we will launder vm_background_launder_target pages at
vm_background_launder_rate KB/s.  Otherwise, the laundry thread goes back
to sleep without doing any work.  When scanning the laundry queue during
background laundering, reactivated pages are counted towards the laundry
thread's target.

In contrast, shortfall laundering is requested when an inactive queue scan
fails to meet its target.  In this case, the laundry thread attempts to
launder enough pages to meet v_free_target within 0.5s, which is the
inactive queue scan period.

A laundry request can be latched while another is currently being
serviced.  In particular, a shortfall request will immediately preempt a
background laundering.

This change also redefines the meaning of vm_cnt.v_reactivated and removes
the functions vm_page_cache() and vm_page_try_to_cache().  The new meaning
of vm_cnt.v_reactivated now better reflects its name.  It represents the
number of inactive or laundry pages that are returned to the active queue
on account of a reference.

In collaboration with:	markj
Reviewed by:	kib
Tested by:	pho
Sponsored by:	Dell EMC Isilon
Differential Revision:	https://reviews.freebsd.org/D8302
2016-11-09 18:48:37 +00:00
Konstantin Belousov
1771e987ca Do not sleep in vm_wait() if pagedaemon did not yet started. Panic instead.
Requests which cannot be satisfied by allocators at boot time often
have unrealizable parameters.  Waiting for the pagedaemon' start would
hang the boot if done in the thread0 context and just never succeed if
executed from another thread.  In fact, for very early stages, sleep
attempt panics with obscure diagnostic about the scheduler state, and
explicit panic in vm_wait() makes the investigation much shorter by
cut off the examination of the thread and scheduler.

Theoretically, some subsystem might grab a resource to exhaustion, and
free it later in the boot process.  If this unlikely scenario does
appear for real, the way to diagnose the trouble can be revisited.

Reported by:	emaste
Reviewed by:	markj
Sponsored by:	The FreeBSD Foundation
MFC after:	2 weeks
Differential revision:	https://reviews.freebsd.org/D8421
2016-11-04 12:58:50 +00:00
Konstantin Belousov
bd9546a21c Export vm_page_xunbusy_maybelocked().
Reviewed by:	alc, markj
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
X-Differential revision:	https://reviews.freebsd.org/D8197
2016-10-17 08:14:23 +00:00
Konstantin Belousov
5975e53d40 Fix a race in vm_page_busy_sleep(9).
Suppose that we have an exclusively busy page, and a thread which can
accept shared-busy page.  In this case, typical code waiting for the
page xbusy state to pass is
again:
	VM_OBJECT_WLOCK(object);
	...
	if (vm_page_xbusied(m)) {
		vm_page_lock(m);
 		VM_OBJECT_WUNLOCK(object);    <---1
		vm_page_busy_sleep(p, "vmopax");
 		goto again;
	}

Suppose that the xbusy state owner locked the object, unbusied the
page and unlocked the object after we are at the line [1], but before we
executed the load of the busy_lock word in vm_page_busy_sleep().  If it
happens that there is still no waiters recorded for the busy state,
the xbusy owner did not acquired the page lock, so it proceeded.

More, suppose that some other thread happen to share-busy the page
after xbusy state was relinquished but before the m->busy_lock is read
in vm_page_busy_sleep().  Again, that thread only needs vm_object lock
to proceed.  Then, vm_page_busy_sleep() reads busy_lock value equal to
the VPB_SHARERS_WORD(1).

In this case, all tests in vm_page_busy_sleep(9) pass and we are going
to sleep, despite the page being share-busied.

Update check for m->busy_lock == VPB_UNBUSIED in vm_page_busy_sleep(9)
to also accept shared-busy state if we only wait for the xbusy state to
pass.

Merge sequential if()s with the same 'then' clause in
vm_page_busy_sleep().

Note that the current code does not share-busy pages from parallel
threads, the only way to have more that one sbusy owner is right now
is to recurse.

Reported and tested by:	pho (previous version)
Reviewed by:	alc, markj
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
Differential revision:	https://reviews.freebsd.org/D8196
2016-10-13 14:41:05 +00:00
Konstantin Belousov
267ed8e2f7 When downgrading exclusively busied page to shared-busy state, wakeup
waiters.  Otherwise, owners of the shared-busy state are left blocked
and might get into a deadlock.

Note that the vm_page_busy_downgrade() function is not used in the
tree right now.

Reported and tested by:	pho (previous version)
Reviewed by:	alc, markj
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
Differential revision:	https://reviews.freebsd.org/D8195
2016-10-11 18:09:37 +00:00
Alan Cox
70cf3ced3c Make the page daemon's notion of what kind of pass is being performed
by vm_pageout_scan() local to vm_pageout_worker().  There is no reason
to store the pass in the NUMA domain structure.

Reviewed by:	kib
MFC after:	3 weeks
2016-10-05 17:32:06 +00:00
Mark Johnston
dbbaf04f1e Remove support for idle page zeroing.
Idle page zeroing has been disabled by default on all architectures since
r170816 and has some bugs that make it seemingly unusable. Specifically,
the idle-priority pagezero thread exacerbates contention for the free page
lock, and yields the CPU without releasing it in non-preemptive kernels. The
pagezero thread also does not behave correctly when superpage reservations
are enabled: its target is a function of v_free_count, which includes
reserved-but-free pages, but it is only able to zero pages belonging to the
physical memory allocator.

Reviewed by:	alc, imp, kib
Differential Revision:	https://reviews.freebsd.org/D7714
2016-09-03 20:38:13 +00:00
Mark Johnston
915d1b71cd Restore swap pager readahead after r292373.
The removal of vm_fault_additional_pages() meant that a hard fault on
a swap-backed page would result in only that page being read in. This
change implements readahead and readbehind for the swap pager in
swap_pager_getpages(). swap_pager_haspage() is modified to return the
largest contiguous non-resident range of pages containing the requested
range.

Reviewed by:	alc, kib
Tested by:	pho
MFC after:	1 month
Differential Revision:	https://reviews.freebsd.org/D7677
2016-08-30 05:56:21 +00:00
Mark Johnston
842ee21e20 Strengthen assertions about the busy state of newly-allocated pages.
Reviewed by:	alc
MFC after:	1 week
2016-08-13 19:49:32 +00:00
Mark Johnston
897d0c6617 Use vm_page_undirty() instead of manually setting a page field.
Reviewed by:	alc
MFC after:	3 days
2016-07-29 21:05:37 +00:00
Mark Johnston
efe1ff4cf0 Update a comment in vm_page_advise() to match behaviour after r290529.
Reviewed by:	alc
MFC after:	3 days
2016-07-23 21:02:36 +00:00
Colin Percival
34caa842a4 Autotune the number of pages set aside for UMA startup based on the number
of CPUs present.  On amd64 this unbreaks the boot for systems with 92 or
more CPUs; the limit will vary on other systems depending on the size of
their uma_zone and uma_cache structures.

The major consumer of pages during UMA startup is the 19 zone structures
which are set up before UMA has bootstrapped itself sufficiently to use
the rest of the available memory:  UMA Slabs, UMA Hash, 4 / 6 / 8 / 12 /
16 / 32 / 64 / 128 / 256 Bucket, vmem btag, VM OBJECT, RADIX NODE, MAP,
KMAP ENTRY, MAP ENTRY, VMSPACE, and fakepg.  If the zone structures occupy
more than one page, they will not share pages and the number of pages
currently needed for startup is 19 * pages_per_zone + N, where N is the
number of pages used for allocating other structures; on amd64 N = 3 at
present (2 pages are allocated for UMA Kegs, and one page for UMA Hash).

This patch adds a new definition UMA_BOOT_PAGES_ZONES, currently set to 32,
and if a zone structure does not fit into a single page sets boot_pages to
UMA_BOOT_PAGES_ZONES * pages_per_zone instead of UMA_BOOT_PAGES (which
remains at 64).  Consequently this patch has no effect on systems where the
zone structure fits into 2 or fewer pages (on amd64, 59 or fewer CPUs), but
increases boot_pages sufficiently on systems where the large number of CPUs
makes this structure larger.  It seems safe to assume that systems with 60+
CPUs can afford to set aside an additional 128kB of memory per 32 CPUs.

The vm.boot_pages tunable continues to override this computation, but is
unlikely to be necessary in the future.

Tested on:	EC2 x1.32xlarge
Relnotes:	FreeBSD can now boot on 92+ CPU systems without requiring
		vm.boot_pages to be manually adjusted.
Reviewed by:	jeff, alc, adrian
Approved by:	re (kib)
2016-07-07 18:37:12 +00:00
Konstantin Belousov
35e8002c58 In vm_page_xunbusy_maybelocked(), add fast path for unbusy when no
waiters exist, same as for vm_page_xunbusy().  If previous value of
busy_lock was VPB_SINGLE_EXCLUSIVER, no waiters existed and wakeup is
not needed.

Move common code from vm_page_xunbusy_maybelocked() and
vm_page_xunbusy_hard() to vm_page_xunbusy_locked().

Reviewed by:	alc
Sponsored by:	The FreeBSD Foundation
MFC after:	2 weeks
Approved by:	re (gjb)
2016-06-23 08:28:13 +00:00
Mark Johnston
0a1dc6e23c Reset the page busy lock state after failing to insert into the object.
Freeing a shared-busy page is not permitted.

Reviewed by:	kib
MFC after:	1 week
Differential Revision:	https://reviews.freebsd.org/D6670
2016-06-02 17:11:24 +00:00
Mark Johnston
e705296958 Don't preserve the page's object linkage in vm_page_insert_after().
Per the KASSERT at the beginning of the function, we expect that the page
does not belong to any object, so its object and pindex fields are
meaningless. Reset them in the rare case that vm_radix_insert() fails.

Reviewed by:	kib
MFC after:	1 week
Differential Revision: https://reviews.freebsd.org/D6669
2016-06-02 16:58:47 +00:00
Konstantin Belousov
e5f0191f20 If the fast path unbusy in vm_page_replace() fails, slow path needs to
acquire the page lock, which recurses.  Avoid the recursion by reusing
the code from vm_page_remove() in a new helper
vm_page_xunbusy_maybelocked().

Reviewed by:	alc
Sponsored by:	The FreeBSD Foundation
2016-06-01 20:39:00 +00:00
Alan Cox
56ce06907c The flag "vm_pages_needed" has long served two distinct purposes: (1) to
indicate that threads are waiting for free pages to become available and
(2) to indicate whether a wakeup call has been sent to the page daemon.
The trouble is that a single flag cannot really serve both purposes, because
we have two distinct targets for when to wakeup threads waiting for free
pages versus when the page daemon has completed its work.  In particular,
the flag will be cleared by vm_page_free() before the page daemon has met
its target, and this can lead to the OOM killer being invoked prematurely.
To address this problem, a new flag "vm_pageout_wanted" is introduced.

Discussed with:	jeff
Reviewed by:	kib, markj
Tested by:	markj
Sponsored by:	EMC / Isilon Storage Division
2016-05-27 19:15:45 +00:00
Konstantin Belousov
0e38422096 In vm_page_cache(), only drop the vnode after radix insert failure
for empty page cache when the object type if OBJT_VNODE.

Reported and tested by:	pho
Reviewed by:	alc
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
2016-05-24 19:20:30 +00:00
Konstantin Belousov
30a8a5f7a6 In vm_page_alloc_contig(), on vm_page_insert() failure, mark each
freed page as VPO_UNMANAGED.  Otherwise vm_pge_free_toq() insists on
owning the page lock.

Previously, VPO_UNMANAGED was only set up to the last processed page.

Reviewed by:	alc
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
2016-05-24 10:21:39 +00:00
Conrad Meyer
5a2e650a36 vm/vm_page.h: Fix trivial '-Wpointer-sign' warning
pq_vcnt, as a count of real things, has no business being negative.  It is only
ever initialized by a u_int counter.

The warning came from the atomic_add_int() in vm_pagequeue_cnt_add().

Rectify the warning by changing the variable to u_int.  No functional change.

Suggested by:	Clang 3.3
Sponsored by:	EMC / Isilon Storage Division
2016-05-19 17:54:14 +00:00
John Baldwin
0ef149024f Don't require write locks on the VM object for vm_page_prev/next.
Reviewed by:	kib
Sponsored by:	Chelsio Communications
2016-04-29 17:35:28 +00:00
Pedro F. Giffuni
d9c9c81c08 sys: use our roundup2/rounddown2() macros when param.h is available.
rounddown2 tends to produce longer lines than the original code
and when the code has a high indentation level it was not really
advantageous to do the replacement.

This tries to strike a balance between readability using the macros
and flexibility of having the expressions, so not everything is
converted.
2016-04-21 19:57:40 +00:00
Gleb Smirnoff
b28cc462ad Include sys/_task.h into uma_int.h, so that taskqueue.h isn't a
requirement for uma_int.h.

Suggested by:	jhb
2016-02-09 20:22:35 +00:00
Gleb Smirnoff
e60b2fcbeb Redo r292484. Embed task(9) into zone, so that uz_maxaction is called
in a context that can sleep, allowing consumers of the KPI to run their
drain routines without any extra measures.

Discussed with:	jtl
2016-02-03 23:30:17 +00:00
Alan Cox
c869e67208 Introduce a new mechanism for relocating virtual pages to a new physical
address and use this mechanism when:

1. kmem_alloc_{attr,contig}() can't find suitable free pages in the physical
   memory allocator's free page lists.  This replaces the long-standing
   approach of scanning the inactive and inactive queues, converting clean
   pages into PG_CACHED pages and laundering dirty pages.  In contrast, the
   new mechanism does not use PG_CACHED pages nor does it trigger a large
   number of I/O operations.

2. on 32-bit MIPS processors, uma_small_alloc() and the pmap can't find
   free pages in the physical memory allocator's free page lists that are
   covered by the direct map.  Tested by: adrian

3. ttm_bo_global_init() and ttm_vm_page_alloc_dma32() can't find suitable
   free pages in the physical memory allocator's free page lists.

In the coming months, I expect that this new mechanism will be applied in
other places.  For example, balloon drivers should use relocation to
minimize fragmentation of the guest physical address space.

Make vm_phys_alloc_contig() a little smarter (and more efficient in some
cases).  Specifically, use vm_phys_segs[] earlier to avoid scanning free
page lists that can't possibly contain suitable pages.

Reviewed by:	kib, markj
Glanced at:	jhb
Discussed with:	jeff
Sponsored by:	EMC / Isilon Storage Division
Differential Revision:	https://reviews.freebsd.org/D4444
2015-12-19 18:42:50 +00:00
Gleb Smirnoff
b0cd20172d A change to KPI of vm_pager_get_pages() and underlying VOP_GETPAGES().
o With new KPI consumers can request contiguous ranges of pages, and
  unlike before, all pages will be kept busied on return, like it was
  done before with the 'reqpage' only. Now the reqpage goes away. With
  new interface it is easier to implement code protected from race
  conditions.

  Such arrayed requests for now should be preceeded by a call to
  vm_pager_haspage() to make sure that request is possible. This
  could be improved later, making vm_pager_haspage() obsolete.

  Strenghtening the promises on the business of the array of pages
  allows us to remove such hacks as swp_pager_free_nrpage() and
  vm_pager_free_nonreq().

o New KPI accepts two integer pointers that may optionally point at
  values for read ahead and read behind, that a pager may do, if it
  can. These pages are completely owned by pager, and not controlled
  by the caller.

  This shifts the UFS-specific readahead logic from vm_fault.c, which
  should be file system agnostic, into vnode_pager.c. It also removes
  one VOP_BMAP() request per hard fault.

Discussed with:	kib, alc, jeff, scottl
Sponsored by:	Nginx, Inc.
Sponsored by:	Netflix
2015-12-16 21:30:45 +00:00
Conrad Meyer
5e09bdc821 vm_page_replace: remove redundant radix lookup
Remove redundant lookup of the old page from vm_page_replace.  Verification
that the old page exists is already done by vm_radix_replace.

Submitted by:	Ryan Libby <rlibby@gmail.com>
Reviewed by:	alc, kib
Sponsored by:	EMC / Isilon Storage Division
Follow-up to:	https://reviews.freebsd.org/D4326
Differential Revision:	https://reviews.freebsd.org/D4471
2015-12-10 22:57:27 +00:00
Mark Johnston
7e78597f04 Ensure that deactivated pages that are not expected to be reused are
reclaimed in FIFO order by the pagedaemon.  Previously we would enqueue
such pages at the head of the inactive queue, yielding a LIFO reclaim order.

Reviewed by:	alc
MFC after:	2 weeks
Sponsored by:	EMC / Isilon Storage Division
2015-11-08 01:36:18 +00:00
Alan Cox
bc7275964c Reduce the scope of a variable to the only file where it is used. 2015-10-03 19:27:52 +00:00
Mark Johnston
3138cd3670 As a step towards the elimination of PG_CACHED pages, rework the handling
of POSIX_FADV_DONTNEED so that it causes the backing pages to be moved to
the head of the inactive queue instead of being cached.

This affects the implementation of POSIX_FADV_NOREUSE as well, since it
works by applying POSIX_FADV_DONTNEED to file ranges after they have been
read or written.  At that point the corresponding buffers may still be
dirty, so the previous implementation would coalesce successive ranges and
apply POSIX_FADV_DONTNEED to the result, ensuring that pages backing the
dirty buffers would eventually be cached.  To preserve this behaviour in an
efficient manner, this change adds a new buf flag, B_NOREUSE, which causes
the pages backing a VMIO buf to be placed at the head of the inactive queue
when the buf is released.  POSIX_FADV_NOREUSE then works by setting this
flag in bufs that underlie the specified range.

Reviewed by:	alc, kib
Sponsored by:	EMC / Isilon Storage Division
Differential Revision:	https://reviews.freebsd.org/D3726
2015-09-30 23:06:29 +00:00
Alan Cox
15aaea7892 Change vm_page_unwire() such that it (1) accepts PQ_NONE as the specified
queue and (2) returns a Boolean indicating whether the page's wire count
transitioned to zero.

Exploit this change in vfs_vmio_release() to avoid pointlessly enqueueing
a page that is about to be freed.

(An earlier version of this change was developed by attilio@ and kmacy@.
Any errors in this version are my own.)

Reviewed by:	kib
Sponsored by:	EMC / Isilon Storage Division
2015-09-22 18:16:52 +00:00
Alan Cox
c9af644e5c Eliminate (many) unnecessary calls to pmap_remove_all(). Pages from objects
with a reference count of zero can't possibly be mapped, so there is never a
need for vm_page_set_invalid() to call pmap_remove_all() on them.

Reviewed by:	kib
MFC after:	1 week
Sponsored by:	EMC / Isilon Storage Division
2015-09-17 22:28:38 +00:00
Mark Johnston
d73ce4c698 Remove the v_cache_min and v_cache_max sysctls. They are unused and have
no effect.

Reviewed by:	alc
Sponsored by:	EMC / Isilon Storage Division
2015-09-11 03:00:20 +00:00
Mark Johnston
c25fabea97 Remove weighted page handling from vm_page_advise().
This was added in r51337 as part of the implementation of
madvise(MADV_DONTNEED).  Its objective was to ensure that the page daemon
would eventually reclaim other unreferenced pages (i.e., unreferenced pages
not touched by madvise()) from the active queue.

Now that the pagedaemon performs steady scanning of the active page queue,
this weighted handling is unnecessary.  Instead, always "cache" clean pages
by moving them to the head of the inactive page queue.  This simplifies the
implementation of vm_page_advise() and eliminates the fragmentation that
resulted from the distribution of pages among multiple queues.

Suggested by:	alc
Reviewed by:	alc
Sponsored by:	EMC / Isilon Storage Division
Differential Revision:	https://reviews.freebsd.org/D3401
2015-08-28 00:44:17 +00:00
Andrew Turner
52afd687c3 Add the kernel support for minidumps on arm64.
Obtained from:	ABT Systems Ltd
Sponsored by:	The FreeBSD Foundation
Differential Revision:	https://reviews.freebsd.org/D3318
2015-08-20 12:49:56 +00:00
Alan Cox
966272ca33 Retire VM_FREEPOOL_CACHE as the next step in eliminating PG_CACHE pages.
Differential Revision:	https://reviews.freebsd.org/D2712
Reviewed by:	kib
Sponsored by:	EMC / Isilon Storage Division
2015-06-08 04:59:32 +00:00
Alan Cox
c4e49ba477 Document vm_page_alloc_contig()'s support for the VM_ALLOC_NODUMP option.
MFC after:	3 days
2015-05-30 23:37:47 +00:00
Konstantin Belousov
2c20bd8b99 Do grammar fix in the comment to record the right commit message for
r283162.

Fix a cosmetic issue with vm_page_alloc() calling vm_page_free_toq()
with the page not completely satisfying vm_page_free() assertions.
The page is not owned by the object, since insertion failed.  But
besides m->object reset to NULL, we should also set VPO_UNMANAGED flag
for consistency.

Reported by:	pho
Reviewed by:	alc
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
2015-05-20 23:15:56 +00:00
Konstantin Belousov
da47499040 Remove the write-only variable phent. We currently do not check the
size of the program header's entries.

Reported by:	adrian (by using gcc 4.9)
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
2015-05-20 23:03:22 +00:00
John Baldwin
ed95805e90 Remove support for Xen PV domU kernels. Support for HVM domU kernels
remains.  Xen is planning to phase out support for PV upstream since it
is harder to maintain and has more overhead.  Modern x86 CPUs include
virtualization extensions that support HVM guests instead of PV guests.
In addition, the PV code was i386 only and not as well maintained recently
as the HVM code.
- Remove the i386-only NATIVE option that was used to disable certain
  components for PV kernels.  These components are now standard as they
  are on amd64.
- Remove !XENHVM bits from PV drivers.
- Remove various shims required for XEN (e.g. PT_UPDATES_FLUSH, LOAD_CR3,
  etc.)
- Remove duplicate copy of <xen/features.h>.
- Remove unused, i386-only xenstored.h.

Differential Revision:	https://reviews.freebsd.org/D2362
Reviewed by:	royger
Tested by:	royger (i386/amd64 HVM domU and amd64 PVH dom0)
Relnotes:	yes
2015-04-30 15:48:48 +00:00
Scott Long
affc4a4bff Improve support for blacklisting bad memory locations. The user can supply
a text file with a list of physical memory addresses to exclude, and have it
loaded at boot time via the provided example in loader.conf.  The tunable
'vm.blacklist' remains, but using an external file means that there's no
practical limit to the size of the list.  This change also improves the
scanning algorithm for processing the list, scanning the list only once
instead of scanning it for every page in the system.  Both the sysctl and
the file can be unsorted and contain duplicates so long as each entry is
numeric (decimal or hex) and is separated by a space, comma, or newline
character.  The sysctl 'vm.page_blacklist' is now provided to report what
memory locations were successfully excluded.

Reviewed by:	imp, emax
Obtained from:	Netflix, Inc.
MFC after:	3 days
2015-04-29 15:57:14 +00:00
Rui Paulo
b575067a21 Add comments about CTLFLAG_RDTUN vs. TUNABLE_INT_FETCH.
Requested by:	julian
2015-03-26 05:20:18 +00:00
Rui Paulo
57e5a8b184 Use TUNABLE_INT_FETCH for boot_pages.
vm.boot_pages is marked as a CTLFLAG_RDTUN, but it's used by the VM
before the sysctl subsystem is initialsed.  We manually fetch the
variable from the environment to work around this problem.

Tested by:	Keith White kwhite at uottawa.ca
MFC after:	1 week
2015-03-24 20:09:55 +00:00
Rui Paulo
b0bce0aef2 Remove whitespace. 2015-03-24 20:07:27 +00:00
Gleb Smirnoff
e3ed82bcf7 Add flag VM_ALLOC_NOWAIT for vm_page_grab() that prevents sleeping and
allows the function to fail.

Reviewed by:	kib, alc
Sponsored by:	Nginx, Inc.
2014-12-22 09:02:21 +00:00
Gleb Smirnoff
6ee80f259c Do not clear flag that vm_page_alloc() doesn't support.
Submitted by:	kib
2014-12-22 09:00:47 +00:00
Alan Cox
271f0f1219 Enable the use of VM_PHYSSEG_SPARSE on amd64 and i386, making it the default
on i386 PAE.  Previously, VM_PHYSSEG_SPARSE could not be used on amd64 and
i386 because vm_page_startup() would not create vm_page structures for the
kernel page table pages allocated during pmap_bootstrap() but those vm_page
structures are needed when the kernel attempts to promote the corresponding
kernel virtual addresses to superpage mappings.  To address this problem, a
new public function, vm_phys_add_seg(), is introduced and vm_phys_init() is
updated to reflect the creation of vm_phys_seg structures by calls to
vm_phys_add_seg().

Discussed with:	Svatopluk Kraus
MFC after:	3 weeks
Sponsored by:	EMC / Isilon Storage Division
2014-11-15 23:40:44 +00:00
Alan Cox
5e929009d2 Eliminate a stale, i386-specific comment. 2014-11-04 18:52:59 +00:00
Davide Italiano
2be111bf7d Follow up to r225617. In order to maximize the re-usability of kernel code
in userland rename in-kernel getenv()/setenv() to kern_setenv()/kern_getenv().
This fixes a namespace collision with libc symbols.

Submitted by:   kmacy
Tested by:      make universe
2014-10-16 18:04:43 +00:00
John Baldwin
1a83a822d2 Fix a typo. 2014-08-29 21:20:36 +00:00
Konstantin Belousov
afb69e6b3e Adapt vm_page_aflag_set(PGA_WRITEABLE) to the locking of
pmap_enter(PMAP_ENTER_NOSLEEP).  The PGA_WRITEABLE flag can be set
when either the page is busied, or the owner object is locked.

Update comments, move all assertions about page state when
PGA_WRITEABLE flag is set, into new helper
vm_page_assert_pga_writeable().

Reviewed by:	alc
Sponsored by:	The FreeBSD Foundation
MFC after:	2 weeks
2014-08-09 05:00:34 +00:00
Hans Petter Selasky
af3b2549c4 Pull in r267961 and r267973 again. Fix for issues reported will follow. 2014-06-28 03:56:17 +00:00
Glen Barber
37a107a407 Revert r267961, r267973:
These changes prevent sysctl(8) from returning proper output,
such as:

 1) no output from sysctl(8)
 2) erroneously returning ENOMEM with tools like truss(1)
    or uname(1)
 truss: can not get etype: Cannot allocate memory
2014-06-27 22:05:21 +00:00
Hans Petter Selasky
3da1cf1e88 Extend the meaning of the CTLFLAG_TUN flag to automatically check if
there is an environment variable which shall initialize the SYSCTL
during early boot. This works for all SYSCTL types both statically and
dynamically created ones, except for the SYSCTL NODE type and SYSCTLs
which belong to VNETs. A new flag, CTLFLAG_NOFETCH, has been added to
be used in the case a tunable sysctl has a custom initialisation
function allowing the sysctl to still be marked as a tunable. The
kernel SYSCTL API is mostly the same, with a few exceptions for some
special operations like iterating childrens of a static/extern SYSCTL
node. This operation should probably be made into a factored out
common macro, hence some device drivers use this. The reason for
changing the SYSCTL API was the need for a SYSCTL parent OID pointer
and not only the SYSCTL parent OID list pointer in order to quickly
generate the sysctl path. The motivation behind this patch is to avoid
parameter loading cludges inside the OFED driver subsystem. Instead of
adding special code to the OFED driver subsystem to post-load tunables
into dynamically created sysctls, we generalize this in the kernel.

Other changes:
- Corrected a possibly incorrect sysctl name from "hw.cbb.intr_mask"
to "hw.pcic.intr_mask".
- Removed redundant TUNABLE statements throughout the kernel.
- Some minor code rewrites in connection to removing not needed
TUNABLE statements.
- Added a missing SYSCTL_DECL().
- Wrapped two very long lines.
- Avoid malloc()/free() inside sysctl string handling, in case it is
called to initialize a sysctl from a tunable, hence malloc()/free() is
not ready when sysctls from the sysctl dataset are registered.
- Bumped FreeBSD version to indicate SYSCTL API change.

MFC after:	2 weeks
Sponsored by:	Mellanox Technologies
2014-06-27 16:33:43 +00:00
Attilio Rao
3ae10f7477 - Modify vm_page_unwire() and vm_page_enqueue() to directly accept
the queue where to enqueue pages that are going to be unwired.
- Add stronger checks to the enqueue/dequeue for the pagequeues when
  adding and removing pages to them.

Of course, for unmanaged pages the queue parameter of vm_page_unwire() will
be ignored, just as the active parameter today.
This makes adding new pagequeues quicker.

This change effectively modifies the KPI.  __FreeBSD_version will be,
however, bumped just when the full cache of free pages will be
evicted.

Sponsored by:	EMC / Isilon storage division
Reviewed by:	alc
Tested by:	pho
2014-06-16 18:15:27 +00:00
Alan Cox
dd05fa1945 Add a page size field to struct vm_page. Increase the page size field when
a partially populated reservation becomes fully populated, and decrease this
field when a fully populated reservation becomes partially populated.

Use this field to simplify the implementation of pmap_enter_object() on
amd64, arm, and i386.

On all architectures where we support superpages, the cost of creating a
superpage mapping is roughly the same as creating a base page mapping.  For
example, both kinds of mappings entail the creation of a single PTE and PV
entry.  With this in mind, use the page size field to make the
implementation of vm_map_pmap_enter(..., MAP_PREFAULT_PARTIAL) a little
smarter.  Previously, if MAP_PREFAULT_PARTIAL was specified to
vm_map_pmap_enter(), that function would only map base pages.  Now, it will
create up to 96 base page or superpage mappings.

Reviewed by:	kib
Sponsored by:	EMC / Isilon Storage Division
2014-06-07 17:12:26 +00:00
Bryan Drewery
44f1c91610 Rename global cnt to vm_cnt to avoid shadowing.
To reduce the diff struct pcu.cnt field was not renamed, so
PCPU_OP(cnt.field) is still used. pc_cnt and pcpu are also used in
kvm(3) and vmstat(8). The goal was to not affect externally used KPI.

Bump __FreeBSD_version_ in case some out-of-tree module/code relies on the
the global cnt variable.

Exp-run revealed no ports using it directly.

No objection from:	arch@
Sponsored by:	EMC / Isilon Storage Division
2014-03-22 10:26:09 +00:00
Alan Cox
793d14076a In an effort to diagnose possible corruption of struct vm_page on some
sparc64 machines make the page queue assert in vm_page_dequeue() more
precise.  While I'm here switch the page lock assert to the newer style.
2014-01-24 19:08:42 +00:00
Alan Cox
000fb817d8 Since the introduction of the popmap to reservations in r259999, there is
no longer any need for the page's PG_CACHED and PG_FREE flags to be set and
cleared while the free page queues lock is held.  Thus, vm_page_alloc(),
vm_page_alloc_contig(), and vm_page_alloc_freelist() can wait until after
the free page queues lock is released to clear the page's flags.  Moreover,
the PG_FREE flag can be retired.  Now that the reservation system no longer
uses it, its only uses are in a few assertions.  Eliminating these
assertions is no real loss.  Other assertions catch the same types of
misbehavior, like doubly freeing a page (see r260032) or dirtying a free
page (free pages are invalid and only valid pages can be dirtied).

Eliminate an unneeded variable from vm_page_alloc_contig().

Sponsored by:	EMC / Isilon Storage Division
2013-12-31 18:25:15 +00:00
Alan Cox
703b304f33 Eliminate a redundant parameter to vm_radix_replace().
Improve the wording of the comment describing vm_radix_replace().

Reviewed by:	attilio
MFC after:	6 weeks
Sponsored by:	EMC / Isilon Storage Division
2013-12-08 20:07:02 +00:00
Konstantin Belousov
9eab548476 PG_SLAB no longer serves a useful purpose, since m->object is no
longer abused to store pointer to slab. Remove it.

Reviewed by:    alc
Sponsored by:   The FreeBSD Foundation
Approved by:	re (hrs)
2013-09-17 07:35:26 +00:00
Konstantin Belousov
3846a82284 Remove zero-copy sockets code. It only worked for anonymous memory,
and the equivalent functionality is now provided by sendfile(2) over
posix shared memory filedescriptor.

Remove the cow member of struct vm_page, and rearrange the remaining
members.  While there, make hold_count unsigned.

Requested and reviewed by:	alc
Tested by:	pho
Sponsored by:	The FreeBSD Foundation
Approved by:	re (delphij)
2013-09-16 06:25:54 +00:00
Konstantin Belousov
196beb5359 If the last page of the file is partially full and whole valid
portion is invalidated, invalidate the whole page.  Otherwise,
partially valid page appears on a page queue, which is wrong.  This
could only happen for the last page, because only then buffer which
triggered invalidation could not cover the whole page.

Reported and tested by:	pho (previous version)
Reviewed by:	alc
Sponsored by:	The FreeBSD Foundation
Approved by:	re (delphij)
MFC after:	2 weeks
2013-09-14 10:11:38 +00:00
Konstantin Belousov
7a4b2bc56c The vm_page_trysbusy() should not fail when shared busy counter or
VPB_BIT_WAITERS flag were changed between reading of busy_lock and the
cas.  The vm_page_sbusy(), which is the only user of
vm_page_trysbusy() in the tree, panics on the failure, which in these
cases is transient and do not mean that the current page state
prevents sbusying.

Retry the operation inside vm_page_trysbusy() if cas failed, only
return a failure when VPB_BIT_SHARED is cleared.

Reported and tested by:	pho
Reviewed by:	attilio
Sponsored by:	The FreeBSD Foundation
2013-09-05 12:54:40 +00:00
Alan Cox
51321f7c31 Significantly reduce the cost, i.e., run time, of calls to madvise(...,
MADV_DONTNEED) and madvise(..., MADV_FREE).  Specifically, introduce a new
pmap function, pmap_advise(), that operates on a range of virtual addresses
within the specified pmap, allowing for a more efficient implementation of
MADV_DONTNEED and MADV_FREE.  Previously, the implementation of
MADV_DONTNEED and MADV_FREE relied on per-page pmap operations, such as
pmap_clear_reference().  Intuitively, the problem with this implementation
is that the pmap-level locks are acquired and released and the page table
traversed repeatedly, once for each resident page in the range
that was specified to madvise(2).  A more subtle flaw with the previous
implementation is that pmap_clear_reference() would clear the reference bit
on all mappings to the specified page, not just the mapping in the range
specified to madvise(2).

Since our malloc(3) makes heavy use of madvise(2), this change can have a
measureable impact.  For example, the system time for completing a parallel
"buildworld" on a 6-core amd64 machine was reduced by about 1.5% to 2.0%.

Note: This change only contains pmap_advise() implementations for a subset
of our supported architectures.  I will commit implementations for the
remaining architectures after further testing.  For now, a stub function is
sufficient because of the advisory nature of pmap_advise().

Discussed with: jeff, jhb, kib
Tested by:      pho (i386), marcel (ia64)
Sponsored by:   EMC / Isilon Storage Division
2013-08-29 15:49:05 +00:00
Gleb Smirnoff
133dae887b Remove comment that is no longer relevant since r254182. 2013-08-26 14:14:25 +00:00
Alan Cox
776cad90ff Addendum to r254141: The call to vm_radix_insert() in vm_page_cache() can
reclaim the last preexisting cached page in the object, resulting in a call
to vdrop().  Detect this scenario so that the vnode's hold count is
correctly maintained.  Otherwise, we panic.

Reported by:	scottl
Tested by:	pho
Discussed with:	attilio, jeff, kib
2013-08-23 17:27:12 +00:00
Konstantin Belousov
5944de8ecd Remove the deprecated VM_ALLOC_RETRY flag for the vm_page_grab(9).
The flag was mandatory since r209792, where vm_page_grab(9) was
changed to only support the alloc retry semantic.

Suggested and reviewed by:	alc
Sponsored by:	The FreeBSD Foundation
2013-08-22 07:39:53 +00:00
Alan Cox
28a288cbaa Addendum to r254141: Allow recursion on the free pages queues lock in
vm_page_alloc_freelist().

Reported and tested by:	sbruno
Sponsored by:	EMC / Isilon Storage Division
2013-08-21 15:31:43 +00:00
Attilio Rao
a834cbaec8 On the recovery path for vm_page_alloc(), if a page had been requested
wired, unwind back the wiring bits otherwise we can end up freeing a
page that is considered wired.

Sponsored by:	EMC / Isilon storage division
Reported by:	alc
2013-08-15 11:01:25 +00:00
Jeff Roberson
d9e232109f Improve pageout flow control to wakeup more frequently and do less work while
maintaining better LRU of active pages.

 - Change v_free_target to include the quantity previously represented by
   v_cache_min so we don't need to add them together everywhere we use them.
 - Add a pageout_wakeup_thresh that sets the free page count trigger for
   waking the page daemon.  Set this 10% above v_free_min so we wakeup before
   any phase transitions in vm users.
 - Adjust down v_free_target now that we're willing to accept more pagedaemon
   wakeups.  This means we process fewer pages in one iteration as well,
   leading to shorter lock hold times and less overall disruption.
 - Eliminate vm_pageout_page_stats().  This was a minor variation on the
   PQ_ACTIVE segment of the normal pageout daemon.  Instead we now process
   1 / vm_pageout_update_period pages every second.  This causes us to visit
   the whole active list every 60 seconds.  Previously we would only maintain
   the active LRU when we were short on pages which would mean it could be
   woefully out of date.

Reviewed by:	alc (slight variant of this)
Discussed with:	alc, kib, jhb
Sponsored by:	EMC / Isilon Storage Division
2013-08-13 21:56:16 +00:00
Attilio Rao
6006884122 Correct the recovery logic in vm_page_alloc_contig:
what is really needed on this code snipped is that all the pages that
are already fully inserted gets fully freed, while for the others the
object removal itself might be skipped, hence the object might be set to
NULL.

Sponsored by:	EMC / Isilon storage division
Reported by:	alc, kib
Reviewed by:	alc
2013-08-11 21:15:04 +00:00
Konstantin Belousov
c325e866f4 Different consumers of the struct vm_page abuse pageq member to keep
additional information, when the page is guaranteed to not belong to a
paging queue.  Usually, this results in a lot of type casts which make
reasoning about the code correctness harder.

Sometimes m->object is used instead of pageq, which could cause real
and confusing bugs if non-NULL m->object is leaked.  See r141955 and
r253140 for examples.

Change the pageq member into a union containing explicitly-typed
members.  Use them instead of type-punning or abusing m->object in x86
pmaps, uma and vm_page_alloc_contig().

Requested and reviewed by:	alc
Sponsored by:	The FreeBSD Foundation
2013-08-10 17:36:42 +00:00
John Baldwin
cdc00bf7d2 Revert the addition of VPO_BUSY and instead update vm_page_replace() to
properly unbusy the page.

Submitted by:	alc
2013-08-09 21:14:55 +00:00
Attilio Rao
e946b94934 On all the architectures, avoid to preallocate the physical memory
for nodes used in vm_radix.
On architectures supporting direct mapping, also avoid to pre-allocate
the KVA for such nodes.

In order to do so make the operations derived from vm_radix_insert()
to fail and handle all the deriving failure of those.

vm_radix-wise introduce a new function called vm_radix_replace(),
which can replace a leaf node, already present, with a new one,
and take into account the possibility, during vm_radix_insert()
allocation, that the operations on the radix trie can recurse.
This means that if operations in vm_radix_insert() recursed
vm_radix_insert() will start from scratch again.

Sponsored by:	EMC / Isilon storage division
Reviewed by:	alc (older version)
Reviewed by:	jeff
Tested by:	pho, scottl
2013-08-09 11:28:55 +00:00
Attilio Rao
c7aebda8a1 The soft and hard busy mechanism rely on the vm object lock to work.
Unify the 2 concept into a real, minimal, sxlock where the shared
acquisition represent the soft busy and the exclusive acquisition
represent the hard busy.
The old VPO_WANTED mechanism becames the hard-path for this new lock
and it becomes per-page rather than per-object.
The vm_object lock becames an interlock for this functionality:
it can be held in both read or write mode.
However, if the vm_object lock is held in read mode while acquiring
or releasing the busy state, the thread owner cannot make any
assumption on the busy state unless it is also busying it.

Also:
- Add a new flag to directly shared busy pages while vm_page_alloc
  and vm_page_grab are being executed.  This will be very helpful
  once these functions happen under a read object lock.
- Move the swapping sleep into its own per-object flag

The KPI is heavilly changed this is why the version is bumped.
It is very likely that some VM ports users will need to change
their own code.

Sponsored by:	EMC / Isilon storage division
Discussed with:	alc
Reviewed by:	jeff, kib
Tested by:	gavin, bapt (older version)
Tested by:	pho, scottl
2013-08-09 11:11:11 +00:00