Commit Graph

606 Commits

Author SHA1 Message Date
markj
c1d51518dd Fix some races introduced in r332974.
With r332974, when performing a synchronized access of a page's "queue"
field, one must first check whether the page is logically dequeued. If
so, then the page lock does not prevent the page from being removed
from its page queue. Intoduce vm_page_queue(), which returns the page's
logical queue index. In some cases, direct access to the "queue" field
is still required, but such accesses should be confined to sys/vm.

Reported and tested by:	pho
Reviewed by:	kib
Sponsored by:	Dell EMC Isilon
Differential Revision:	https://reviews.freebsd.org/D15280
2018-05-04 17:17:30 +00:00
markj
37d36b67f3 Improve VM page queue scalability.
Currently both the page lock and a page queue lock must be held in
order to enqueue, dequeue or requeue a page in a given page queue.
The queue locks are a scalability bottleneck in many workloads. This
change reduces page queue lock contention by batching queue operations.
To detangle the page and page queue locks, per-CPU batch queues are
used to reference pages with pending queue operations. The requested
operation is encoded in the page's aflags field with the page lock
held, after which the page is enqueued for a deferred batch operation.
Page queue scans are similarly optimized to minimize the amount of
work performed with a page queue lock held.

Reviewed by:	kib, jeff (previous versions)
Tested by:	pho
Sponsored by:	Dell EMC Isilon
Differential Revision:	https://reviews.freebsd.org/D14893
2018-04-24 21:15:54 +00:00
markj
d241ba38fa Dequeue wired pages lazily.
Previously, wiring a page would cause it to be removed from its page
queue. In the common case, unwiring causes it to be enqueued at the tail
of that page queue. This change modifies vm_page_wire() to not dequeue
the page, thus avoiding the highly contended page queue locks. Instead,
vm_page_unwire() takes care of requeuing the page as a single operation,
and the page daemon dequeues wired pages as they are encountered during
a queue scan to avoid needlessly revisiting them later. For pages in
PQ_ACTIVE we do even better, since a requeue is unnecessary.

The change improves scalability for some common workloads. For instance,
threads wiring pages into the buffer cache no longer need to modify
global page queues, and unwiring is usually done by the bufspace thread,
so concurrency is not as much of an issue. As another example, many
sysctl handlers wire the output buffer to avoid faults on copyout, and
since the buffer is likely to be in PQ_ACTIVE, we now entirely avoid
modifying the page queue in this case.

The change also adds a block comment describing some properties of
struct vm_page's reference counters, and the busy lock.

Reviewed by:	jeff
Discussed with:	alc, kib
MFC after:	1 month
Differential Revision:	https://reviews.freebsd.org/D11943
2018-02-07 16:57:10 +00:00
jeff
e67ec0d694 Use per-domain locks for vm page queue free. Move paging control from
global to per-domain state.  Protect reservations with the free lock
from the domain that they belong to.  Refactor to make vm domains more
of a first class object.

Reviewed by:    markj, kib, gallatin
Tested by:      pho
Sponsored by:   Netflix, Dell/EMC Isilon
Differential Revision:  https://reviews.freebsd.org/D14000
2018-02-06 22:10:07 +00:00
kib
e676dae295 On munlock(), unwire correct page.
It is possible, for complex fork()/collapse situations, to have
sibling address spaces to partially share shadow chains. If one
sibling performs wiring, it can happen that a transient page, invalid
and busy, is installed into a shadow object which is visible to other
sibling for the duration of vm_fault_hold().  When the backing object
contains the valid page, and the wiring is performed on read-only
entry, the transient page is eventually removed.

But the sibling which observed the transient page might perform the
unwire, executing vm_object_unwire().  There, the first page found in
the shadow chain is considered as the page that was wired for the
mapping.  It is really the page below it which is wired.  So we unwire
the wrong page, either triggering the asserts of breaking the page'
wire counter.

As the fix, wait for the busy state to finish if we find such page
during unwire, and restart the shadow chain walk after the sleep.

Reported and tested by:	pho
Reviewed by:	markj
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
Differential revision:	https://reviews.freebsd.org/D14184
2018-02-05 12:49:20 +00:00
jeff
94c7af8ca2 Implement 'domainset', a cpuset based NUMA policy mechanism. This allows
userspace to control NUMA policy administratively and programmatically.

Implement domainset based iterators in the page layer.

Remove the now legacy numa_* syscalls.

Cleanup some header polution created by having seq.h in proc.h.

Reviewed by:	markj, kib
Discussed with:	alc
Tested by:	pho
Sponsored by:	Netflix, Dell/EMC Isilon
Differential Revision:	https://reviews.freebsd.org/D13403
2018-01-12 22:48:23 +00:00
alc
5037be343c Make the vm object bypass and collapse counters per CPU.
Requested by:	mjg
Reviewed by:	kib, markj
MFC after:	1 week
Differential Revision:	https://reviews.freebsd.org/D13611
2017-12-25 19:36:04 +00:00
pfg
155122ce53 SPDX: Consider code from Carnegie-Mellon University.
Interesting cases, most likely from CMU Mach sources.
2017-11-30 15:48:35 +00:00
jeff
990ca74cdc Eliminate kmem_arena and kmem_object in preparation for further NUMA commits.
The arena argument to kmem_*() is now only used in an assert.  A follow-up
commit will remove the argument altogether before we freeze the API for the
next release.

This replaces the hard limit on kmem size with a soft limit imposed by UMA.  When
the soft limit is exceeded we periodically wakeup the UMA reclaim thread to
attempt to shrink KVA.  On 32bit architectures this should behave much more
gracefully as we exhaust KVA.  On 64bit the limits are likely never hit.

Reviewed by:	markj, kib (some objections)
Discussed with:	alc
Tested by:	pho
Sponsored by:	Netflix / Dell EMC Isilon
Differential Revision:	https://reviews.freebsd.org/D13187
2017-11-28 23:40:54 +00:00
pfg
4736ccfd9c sys: further adoption of SPDX licensing ID tags.
Mainly focus on files that use BSD 3-Clause license.

The Software Package Data Exchange (SPDX) group provides a specification
to make it easier for automated tools to detect and summarize well known
opensource licenses. We are gradually adopting the specification, noting
that the tags are considered only advisory and do not, in any way,
superceed or replace the license texts.

Special thanks to Wind River for providing access to "The Duke of
Highlander" tool: an older (2014) run over FreeBSD tree was useful as a
starting point.
2017-11-20 19:43:44 +00:00
jeff
3c355d849c Replace manyinstances of VM_WAIT with blocking page allocation flags
similar to the kernel memory allocator.

This simplifies NUMA allocation because the domain will be known at wait
time and races between failure and sleeping are eliminated.  This also
reduces boilerplate code and simplifies callers.

A wait primitive is supplied for uma zones for similar reasons.  This
eliminates some non-specific VM_WAIT calls in favor of more explicit
sleeps that may be satisfied without new pages.

Reviewed by:	alc, kib, markj
Tested by:	pho
Sponsored by:	Netflix, Dell/EMC Isilon
2017-11-08 02:39:37 +00:00
alc
6dd81762df Batch atomic updates to the number of active, inactive, and laundry
pages by vm_object_terminate_pages().  For example, for a "buildworld"
workload, this batching reduces vm_object_terminate_pages()'s average
execution time by 12%.  (The total savings were about 11.7 billion
processor cycles.)

Reviewed by:	kib
MFC after:	1 week
2017-10-19 04:13:47 +00:00
alc
7ad59282da Optimize vm_object_page_remove() by eliminating pointless calls to
pmap_remove_all().  If the object to which a page belongs has no
references, then that page cannot possibly be mapped.

Reviewed by:	kib
MFC after:	1 week
2017-09-28 17:55:41 +00:00
kib
23d65de60e For unlinked files, do not msync(2) or sync on the vnode deactivation.
One consequence of the patch is that msyncing unlinked file mappings
no longer reduces the amount of the dirty memory in the system, but I
do not think that there are users of msync(2) that utilize it for such
side-effect.

Reported and tested by:	tjil
PR:	222356
Reviewed by:	alc
Sponsored by:	The FreeBSD Foundation
MFC after:	2 weeks
Differential revision:	https://reviews.freebsd.org/D12411
2017-09-19 16:46:37 +00:00
kib
4c4cbd798b Batch freeing of the pages in vm_object_page_remove() under the same
free queue mutex lock owning session, same as it was done for the
object termination in r323561.

Reported and tested by:	mjg
Reviewed by:	alc, markj
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
2017-09-15 16:07:09 +00:00
kib
9c0adbf36e Do not relock free queue mutex for each page, free whole terminating
object' page queue under the single mutex lock.

First, all pages on the queue are prepared for free by calls to
vm_page_free_prep(), and pages which should not be returned to the
physical allocator (e.g. wired or fictitious) are simply removed from
the queue.  On the second pass, vm_page_free_phys_pglist() inserts all
pages from the queue without relocking the mutex.

The change improves the object termination, e.g. on the process exit
where large anonymous memory objects otherwise cause relocks the free
queue mutex for each page.  More, if several such processes are
exiting or execing in parallel, the mutex was highly contended on
the address space demolition.

Diagnosed and tested by:	mjg (previous version)
Reviewed by:	alc, markj
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
2017-09-13 19:22:07 +00:00
kib
d27f2b11c2 Add a vm_page_change_lock() helper, the common code to not relock page
lock if both old and new pages use the same underlying lock.  Convert
existing places to use the helper instead of inlining it.  Use the
optimization in vm_object_page_remove().

Suggested and reviewed by:	alc, markj
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
2017-09-09 17:35:19 +00:00
kib
5ed2d3f017 Replace global swhash in swap pager with per-object trie to track swap
blocks assigned to the object pages.

- The global swhash_mtx is removed, trie is synchronized by the
  corresponding object lock.
- The swp_pager_meta_free_all() function used during object
  termination is optimized by only looking at the trie instead of
  having to search whole hash for the swap blocks owned by the object.
- On swap_pager_swapoff(), instead of iterating over the swhash,
  global object list have to be inspected. There, we have to ensure
  that we do see valid trie content if we see that the object type is
  swap.
Sizing of the swblk zone is same as for swblock zone, each swblk maps
SWAP_META_PAGES pages.

Proposed by:	alc
Reviewed by:	alc, markj (previous version)
Tested by:	alc, pho (previous version)
Sponsored by:	The FreeBSD Foundation
MFC after:	1 month
Differential revision:	https://reviews.freebsd.org/D11435
2017-08-25 23:13:21 +00:00
br
bc66f23e16 Add OBJ_PG_DTOR flag to VM object.
Setting this flag allows us to skip pages removal from VM object queue
during object termination and to leave that for cdev_pg_dtor function.

Move pages removal code to separate function vm_object_terminate_pages()
as comments does not survive indentation.

This will be required for Intel SGX support where we will have to remove
pages from VM object manually.

Reviewed by:	kib, alc
Sponsored by:	DARPA, AFRL
Differential Revision:	https://reviews.freebsd.org/D11688
2017-08-16 08:49:11 +00:00
kib
e38ce8ac29 Do not allocate struct kinfo_vmobject on stack.
Its size is 1184 bytes.

Noted by:	eugen
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
2017-07-22 13:33:06 +00:00
kib
f16b72a2cc Add pctrie_init() and vm_radix_init() to initialize generic pctrie and
vm_radix trie.

Existing vm_radix_init() function is renamed to vm_radix_zinit().
Inlines moved out of the _ headers.

Reviewed by:	alc, markj (previous version)
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
Differential revision:	https://reviews.freebsd.org/D11661
2017-07-19 20:52:47 +00:00
kib
e75ba1d5c4 Commit the 64-bit inode project.
Extend the ino_t, dev_t, nlink_t types to 64-bit ints.  Modify
struct dirent layout to add d_off, increase the size of d_fileno
to 64-bits, increase the size of d_namlen to 16-bits, and change
the required alignment.  Increase struct statfs f_mntfromname[] and
f_mntonname[] array length MNAMELEN to 1024.

ABI breakage is mitigated by providing compatibility using versioned
symbols, ingenious use of the existing padding in structures, and
by employing other tricks.  Unfortunately, not everything can be
fixed, especially outside the base system.  For instance, third-party
APIs which pass struct stat around are broken in backward and
forward incompatible ways.

Kinfo sysctl MIBs ABI is changed in backward-compatible way, but
there is no general mechanism to handle other sysctl MIBS which
return structures where the layout has changed. It was considered
that the breakage is either in the management interfaces, where we
usually allow ABI slip, or is not important.

Struct xvnode changed layout, no compat shims are provided.

For struct xtty, dev_t tty device member was reduced to uint32_t.
It was decided that keeping ABI compat in this case is more useful
than reporting 64-bit dev_t, for the sake of pstat.

Update note: strictly follow the instructions in UPDATING.  Build
and install the new kernel with COMPAT_FREEBSD11 option enabled,
then reboot, and only then install new world.

Credits: The 64-bit inode project, also known as ino64, started life
many years ago as a project by Gleb Kurtsou (gleb).  Kirk McKusick
(mckusick) then picked up and updated the patch, and acted as a
flag-waver.  Feedback, suggestions, and discussions were carried
by Ed Maste (emaste), John Baldwin (jhb), Jilles Tjoelker (jilles),
and Rick Macklem (rmacklem).  Kris Moore (kris) performed an initial
ports investigation followed by an exp-run by Antoine Brodin (antoine).
Essential and all-embracing testing was done by Peter Holm (pho).
The heavy lifting of coordinating all these efforts and bringing the
project to completion were done by Konstantin Belousov (kib).

Sponsored by:	The FreeBSD Foundation (emaste, kib)
Differential revision:	https://reviews.freebsd.org/D10439
2017-05-23 09:29:05 +00:00
glebius
21ead51d79 - Remove 'struct vmmeter' from 'struct pcpu', leaving only global vmmeter
in place.  To do per-cpu stats, convert all fields that previously were
  maintained in the vmmeters that sit in pcpus to counter(9).
- Since some vmmeter stats may be touched at very early stages of boot,
  before we have set up UMA and we can do counter_u64_alloc(), provide an
  early counter mechanism:
  o Leave one spare uint64_t in struct pcpu, named pc_early_dummy_counter.
  o Point counter(9) fields of vmmeter to pcpu[0].pc_early_dummy_counter,
    so that at early stages of boot, before counters are allocated we already
    point to a counter that can be safely written to.
  o For sparc64 that required a whole dummy pcpu[MAXCPU] array.

Further related changes:
- Don't include vmmeter.h into pcpu.h.
- vm.stats.vm.v_swappgsout and vm.stats.vm.v_swappgsin changed to 64-bit,
  to match kernel representation.
- struct vmmeter hidden under _KERNEL, and only vmstat(1) is an exclusion.

This is based on benno@'s 4-year old patch:
https://lists.freebsd.org/pipermail/freebsd-arch/2013-July/014471.html

Reviewed by:	kib, gallatin, marius, lidl
Differential Revision:	https://reviews.freebsd.org/D10156
2017-04-17 17:34:47 +00:00
alc
89e63829ad Relax the locking requirements for vm_object_page_noreuse(). While
reviewing all uses of OFF_TO_IDX(), I observed that
vm_object_page_noreuse() is requiring an exclusive lock on the object
when, in fact, a shared lock suffices.

Reviewed by:	kib, markj
MFC after:	3 days
Differential Revision:	https://reviews.freebsd.org/D10011
2017-03-15 17:43:45 +00:00
kib
8cf2af841c Use atop() instead of OFF_TO_IDX() for convertion of addresses or
addresses offsets, as intended.

Suggested and reviewed by:	alc
Sponsored by:	The FreeBSD Foundation
MFC after:	2 weeks
2017-03-14 19:39:17 +00:00
imp
7e6cabd06e Renumber copyright clause 4
Renumber cluase 4 to 3, per what everybody else did when BSD granted
them permission to remove clause 3. My insistance on keeping the same
numbering for legal reasons is too pedantic, so give up on that point.

Submitted by:	Jan Schaumann <jschauma@stevens.edu>
Pull Request:	https://github.com/freebsd/freebsd/pull/96
2017-02-28 23:42:47 +00:00
markj
dfb4f0694a Avoid page lookups in the top-level object in vm_object_madvise().
We can iterate over consecutive resident pages in the top-level object
using the object's page list rather than by performing lookups in the
object radix tree. This extends one of the optimizations in r312208 to the
case where a shadow chain is present.

Suggested by:	alc
Reviewed by:	alc, kib (previous version)
MFC after:	2 weeks
Differential Revision:	https://reviews.freebsd.org/D9282
2017-01-30 18:51:43 +00:00
markj
6543297f0e Avoid unnecessary page lookups in vm_object_madvise().
vm_object_madvise() is frequently used to apply advice to a contiguous
set of pages in an object with no backing object. Optimize this case by
skipping non-resident subranges in constant time, and by iterating over
resident pages using the object memq, thus avoiding radix tree lookups on
each page index in the specified range.

While here, move MADV_WILLNEED handling to vm_page_advise(), and rename the
"advise" parameter to vm_object_madvise() to "advice."

Reviewed by:	alc, kib
MFC after:	2 weeks
Differential Revision:	https://reviews.freebsd.org/D9098
2017-01-15 03:50:08 +00:00
kib
e2725e89e1 Improve vm_object_scan_all_shadowed() to also check swap backing objects.
As noted in the removed comment, it is possible and not prohibitively
costly to look up the swap blocks for the given page index.  Implement
a swap_pager_find_least() function to do that, and use it to iterate
simultaneously over both backing object page queue and swap
allocations when looking for shadowed pages.

Testing shows that number of new succesful scans, enabled by this
addition, is small but non-zero.  When worked out, the change both
further reduces the depth of the shadow object chain, and frees unused
but allocated swap and memory.

Suggested and reviewed by:	alc
Tested by:	pho (previous version)
Sponsored by:	The FreeBSD Foundation
MFC after:	2 weeks
2016-12-18 20:56:14 +00:00
alc
023ed8da8d Eliminate every mention of PG_CACHED pages from the comments in the machine-
independent layer of the virtual memory system.  Update some of the nearby
comments to eliminate redundancy and improve clarity.

In vm/vm_reserv.c, do not use hyphens after adverbs ending in -ly per
The Chicago Manual of Style.

Update the comment in vm/vm_page.h defining the four types of page queues to
reflect the elimination of PG_CACHED pages and the introduction of the
laundry queue.

Reviewed by:	kib, markj
Sponsored by:	Dell EMC Isilon
Differential Revision:	https://reviews.freebsd.org/D8752
2016-12-12 17:47:09 +00:00
alc
93bd5b27d2 During vm_page_cache()'s call to vm_radix_insert(), if vm_page_alloc() was
called to allocate a new page of radix trie nodes, there could be a call to
vm_radix_remove() on the same trie (of PG_CACHED pages) as the in-progress
vm_radix_insert().  With the removal of PG_CACHED pages, we can simplify
vm_radix_insert() and vm_radix_remove() by removing the flags on the root of
the trie that were used to detect this case and the code for restarting
vm_radix_insert() when it happened.

Reviewed by:	kib, markj
Tested by:	pho
Sponsored by:	Dell EMC Isilon
Differential Revision:	https://reviews.freebsd.org/D8664
2016-12-01 17:26:37 +00:00
alc
2fa3607305 Remove most of the code for implementing PG_CACHED pages. (This change does
not remove user-space visible fields from vm_cnt or all of the references to
cached pages from comments.  Those changes will come later.)

Reviewed by:	kib, markj
Tested by:	pho
Sponsored by:	Dell EMC Isilon
Differential Revision:	https://reviews.freebsd.org/D8497
2016-11-15 18:22:50 +00:00
alc
5fd3767e0e Introduce a new page queue, PQ_LAUNDRY, for storing unreferenced, dirty
pages, specificially, dirty pages that have passed once through the inactive
queue.  A new, dedicated thread is responsible for both deciding when to
launder pages and actually laundering them.  The new policy uses the
relative sizes of the inactive and laundry queues to determine whether to
launder pages at a given point in time.  In general, this leads to more
intelligent swapping behavior, since the laundry thread will avoid pageouts
when the marginal benefit of doing so is low.  Previously, without a
dedicated queue for dirty pages, the page daemon didn't have the information
to determine whether pageout provides any benefit to the system.  Thus, the
previous policy often resulted in small but steadily increasing amounts of
swap usage when the system is under memory pressure, even when the inactive
queue consisted mostly of clean pages.  This change addresses that issue,
and also paves the way for some future virtual memory system improvements by
removing the last source of object-cached clean pages, i.e., PG_CACHE pages.

The new laundry thread sleeps while waiting for a request from the page
daemon thread(s).  A request is raised by setting the variable
vm_laundry_request and waking the laundry thread.  We request launderings
for two reasons: to try and balance the inactive and laundry queue sizes
("background laundering"), and to quickly make up for a shortage of free
pages and clean inactive pages ("shortfall laundering").  When background
laundering is requested, the laundry thread computes the number of page
daemon wakeups that have taken place since the last laundering.  If this
number is large enough relative to the ratio of the laundry and (global)
inactive queue sizes, we will launder vm_background_launder_target pages at
vm_background_launder_rate KB/s.  Otherwise, the laundry thread goes back
to sleep without doing any work.  When scanning the laundry queue during
background laundering, reactivated pages are counted towards the laundry
thread's target.

In contrast, shortfall laundering is requested when an inactive queue scan
fails to meet its target.  In this case, the laundry thread attempts to
launder enough pages to meet v_free_target within 0.5s, which is the
inactive queue scan period.

A laundry request can be latched while another is currently being
serviced.  In particular, a shortfall request will immediately preempt a
background laundering.

This change also redefines the meaning of vm_cnt.v_reactivated and removes
the functions vm_page_cache() and vm_page_try_to_cache().  The new meaning
of vm_cnt.v_reactivated now better reflects its name.  It represents the
number of inactive or laundry pages that are returned to the active queue
on account of a reference.

In collaboration with:	markj
Reviewed by:	kib
Tested by:	pho
Sponsored by:	Dell EMC Isilon
Differential Revision:	https://reviews.freebsd.org/D8302
2016-11-09 18:48:37 +00:00
kib
af14bca641 Fix a race in vm_page_busy_sleep(9).
Suppose that we have an exclusively busy page, and a thread which can
accept shared-busy page.  In this case, typical code waiting for the
page xbusy state to pass is
again:
	VM_OBJECT_WLOCK(object);
	...
	if (vm_page_xbusied(m)) {
		vm_page_lock(m);
 		VM_OBJECT_WUNLOCK(object);    <---1
		vm_page_busy_sleep(p, "vmopax");
 		goto again;
	}

Suppose that the xbusy state owner locked the object, unbusied the
page and unlocked the object after we are at the line [1], but before we
executed the load of the busy_lock word in vm_page_busy_sleep().  If it
happens that there is still no waiters recorded for the busy state,
the xbusy owner did not acquired the page lock, so it proceeded.

More, suppose that some other thread happen to share-busy the page
after xbusy state was relinquished but before the m->busy_lock is read
in vm_page_busy_sleep().  Again, that thread only needs vm_object lock
to proceed.  Then, vm_page_busy_sleep() reads busy_lock value equal to
the VPB_SHARERS_WORD(1).

In this case, all tests in vm_page_busy_sleep(9) pass and we are going
to sleep, despite the page being share-busied.

Update check for m->busy_lock == VPB_UNBUSIED in vm_page_busy_sleep(9)
to also accept shared-busy state if we only wait for the xbusy state to
pass.

Merge sequential if()s with the same 'then' clause in
vm_page_busy_sleep().

Note that the current code does not share-busy pages from parallel
threads, the only way to have more that one sbusy owner is right now
is to recurse.

Reported and tested by:	pho (previous version)
Reviewed by:	alc, markj
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
Differential revision:	https://reviews.freebsd.org/D8196
2016-10-13 14:41:05 +00:00
trasz
255ed885fa Replace all remaining calls to vprint(9) with vn_printf(9), and remove
the old macro.

MFC after:	1 month
2016-08-10 16:12:31 +00:00
kib
c859b3f77d In vgonel(), postpone setting BO_DEAD until VOP_RECLAIM() is called,
if vnode is VMIO.  For VMIO vnodes, set BO_DEAD in vm_object_terminate().

The vnode_destroy_object(), when calling into vm_object_terminate(),
must be able to flush buffers.  BO_DEAD purpose is to quickly destroy
buffers on write when the underlying vnode is not operable any more
(one example is the devfs node after geom is gone).  Setting BO_DEAD
for reclaiming vnode before object is terminated is premature, and
results in unability to flush buffers with live SU dependencies from
vinvalbuf() in vm_object_terminate().

Reported by:	David Cross <dcrosstech@gmail.com>
Tested by:	pho
Sponsored by:	The FreeBSD Foundation
MFC after:	2 weeks
2016-07-11 14:19:09 +00:00
kib
ead782ed2d Do not leak the vm object lock when swap reservation failed, in
vm_object_coalesce().

Reported and tested by:	pho
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
2016-05-29 15:46:19 +00:00
kib
f11f5ba812 Prevent parallel object collapses. Both vm_object_collapse_scan() and
swap_pager_copy() might unlock the object, which allows the parallel
collapse to execute.  Besides destroying the object, it also might
move the reference from parent to the backing object, firing the
assertion ref_count == 1.

Collapses are prevented by bumping paging_in_progress counters on both
the object and its backing object.

Reported by:	cem
Tested by:	pho (previous version)
Reviewed by:	alc
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
X-Differential revision:	https://reviews.freebsd.org/D6085
2016-05-26 16:59:29 +00:00
kib
b02e158812 Style changes to some most outrageous violations in vm_object_collapse().
Reviewed by:	alc
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
2016-05-26 16:51:38 +00:00
kib
8da898f26c Add implementation of robust mutexes, hopefully close enough to the
intention of the POSIX IEEE Std 1003.1TM-2008/Cor 1-2013.

A robust mutex is guaranteed to be cleared by the system upon either
thread or process owner termination while the mutex is held.  The next
mutex locker is then notified about inconsistent mutex state and can
execute (or abandon) corrective actions.

The patch mostly consists of small changes here and there, adding
neccessary checks for the inconsistent and abandoned conditions into
existing paths.  Additionally, the thread exit handler was extended to
iterate over the userspace-maintained list of owned robust mutexes,
unlocking and marking as terminated each of them.

The list of owned robust mutexes cannot be maintained atomically
synchronous with the mutex lock state (it is possible in kernel, but
is too expensive).  Instead, for the duration of lock or unlock
operation, the current mutex is remembered in a special slot that is
also checked by the kernel at thread termination.

Kernel must be aware about the per-thread location of the heads of
robust mutex lists and the current active mutex slot.  When a thread
touches a robust mutex for the first time, a new umtx op syscall is
issued which informs about location of lists heads.

The umtx sleep queues for PP and PI mutexes are split between
non-robust and robust.

Somewhat unrelated changes in the patch:
1. Style.
2. The fix for proper tdfind() call use in umtxq_sleep_pi() for shared
   pi mutexes.
3. Removal of the userspace struct pthread_mutex m_owner field.
4. The sysctl kern.ipc.umtx_vnode_persistent is added, which controls
   the lifetime of the shared mutex associated with a vnode' page.

Reviewed by:	jilles (previous version, supposedly the objection was fixed)
Discussed with:	brooks, Martin Simmons <martin@lispworks.com> (some aspects)
Tested by:	pho
Sponsored by:	The FreeBSD Foundation
2016-05-17 09:56:22 +00:00
pfg
8327dbfe4d sys/vm: minor spelling fixes in comments.
No functional change.
2016-05-02 20:16:29 +00:00
kib
e76eb4255b Implement process-shared locks support for libthr.so.3, without
breaking the ABI.  Special value is stored in the lock pointer to
indicate shared lock, and offline page in the shared memory is
allocated to store the actual lock.

Reviewed by:	vangyzen (previous version)
Discussed with:	deischen, emaste, jhb, rwatson,
	Martin Simmons <martin@lispworks.com>
Tested by:	pho
Sponsored by:	The FreeBSD Foundation
2016-02-28 17:52:33 +00:00
glebius
63cd1c131a A change to KPI of vm_pager_get_pages() and underlying VOP_GETPAGES().
o With new KPI consumers can request contiguous ranges of pages, and
  unlike before, all pages will be kept busied on return, like it was
  done before with the 'reqpage' only. Now the reqpage goes away. With
  new interface it is easier to implement code protected from race
  conditions.

  Such arrayed requests for now should be preceeded by a call to
  vm_pager_haspage() to make sure that request is possible. This
  could be improved later, making vm_pager_haspage() obsolete.

  Strenghtening the promises on the business of the array of pages
  allows us to remove such hacks as swp_pager_free_nrpage() and
  vm_pager_free_nonreq().

o New KPI accepts two integer pointers that may optionally point at
  values for read ahead and read behind, that a pager may do, if it
  can. These pages are completely owned by pager, and not controlled
  by the caller.

  This shifts the UFS-specific readahead logic from vm_fault.c, which
  should be file system agnostic, into vnode_pager.c. It also removes
  one VOP_BMAP() request per hard fault.

Discussed with:	kib, alc, jeff, scottl
Sponsored by:	Nginx, Inc.
Sponsored by:	Netflix
2015-12-16 21:30:45 +00:00
cem
cdfc8c63b9 Pull vm_object_scan_all_shadowed out of vm_object_backing_scan
These two functions were largely unrelated, they just used the same same
loop logic to walk through a backing object's memq.  Pull out the
all_shadowed test as its own function and eliminate
OBSC_TEST_ALL_SHADOWED.  Rename vm_object_backing_scan to
vm_object_collapse_scan.

No functional change.

Sponsored by:	EMC / Isilon Storage Division
Differential Revision:	https://reviews.freebsd.org/D4335
2015-12-03 17:21:10 +00:00
kib
9a3339cff9 r221714 fixed the situation when the collapse scan improperly handled
invalid (busy) page supposedly inserted by the vm_fault(), in the
OBSC_COLLAPSE_NOWAIT case.  As a continuation to r221714, fix a case
when invalid page is found by the object scan in OBSC_COLLAPSE_WAIT
case as well.  But, since this is waitable scan, we should wait for
the termination of the busy state and restart from the beginning of
the backing object' page queue. [*]

Do not free the shadow page swap space when the parent page is
invalid, otherwise this action potentially corrupts user data.

Combine all instances of the collapse scan sleep code fragments into
the new helper vm_object_backing_scan_wait().

Improve style compliance and comments.  Change the return type of
vm_object_backing_scan() to bool.

Initial submission by:	cem, https://reviews.freebsd.org/D4103 [*]
Reviewed by:	alc, cem
Tested by:	cem
Sponsored by:	The FreeBSD Foundation
MFC after:	2 weeks
Differential revision:	https://reviews.freebsd.org/D4146
2015-12-01 09:06:09 +00:00
markj
6348241c12 As a step towards the elimination of PG_CACHED pages, rework the handling
of POSIX_FADV_DONTNEED so that it causes the backing pages to be moved to
the head of the inactive queue instead of being cached.

This affects the implementation of POSIX_FADV_NOREUSE as well, since it
works by applying POSIX_FADV_DONTNEED to file ranges after they have been
read or written.  At that point the corresponding buffers may still be
dirty, so the previous implementation would coalesce successive ranges and
apply POSIX_FADV_DONTNEED to the result, ensuring that pages backing the
dirty buffers would eventually be cached.  To preserve this behaviour in an
efficient manner, this change adds a new buf flag, B_NOREUSE, which causes
the pages backing a VMIO buf to be placed at the head of the inactive queue
when the buf is released.  POSIX_FADV_NOREUSE then works by setting this
flag in bufs that underlie the specified range.

Reviewed by:	alc, kib
Sponsored by:	EMC / Isilon Storage Division
Differential Revision:	https://reviews.freebsd.org/D3726
2015-09-30 23:06:29 +00:00
kib
3df958c236 Revert r173708's modifications to vm_object_page_remove().
Assume that a vnode is mapped shared and mlocked(), and then the vnode
is truncated, or truncated and then again extended past the mapping
point EOF.  Truncation removes the pages past the truncation point,
and if pages are later created at this range, they are not properly
mapped into the mlocked region, and their wiring count is wrong.

The revert leaves the invalidated but wired pages on the object queue,
which means that the pages are found by vm_object_unwire() when the
mapped range is munlock()ed, and reused by the buffer cache when the
vnode is extended again.

The changes in r173708 were required since then vm_map_unwire() looked
at the page tables to find the page to unwire.  This is no longer
needed with the vm_object_unwire() introduction, which follows the
objects shadow chain.

Also eliminate OBJPR_NOTWIRED flag for vm_object_page_remove(), which
is now redundand, we do not remove wired pages.

Reported by:	trasz, Dmitry Sivachenko <trtrmitya@gmail.com>
Suggested and reviewed by:	alc
Tested by:	pho
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
2015-07-25 18:29:06 +00:00
glebius
519f1ccd36 Make KPI of vm_pager_get_pages() more strict: if a pager changes a page
in the requested array, then it is responsible for disposition of previous
page and is responsible for updating the entry in the requested array.
Now consumers of KPI do not need to re-lookup the pages after call to
vm_pager_get_pages().

Reviewed by:	kib
Sponsored by:	Netflix
Sponsored by:	Nginx, Inc.
2015-06-12 11:32:20 +00:00
vangyzen
597cee37df Provide vnode in memory map info for files on tmpfs
When providing memory map information to userland, populate the vnode pointer
for tmpfs files.  Set the memory mapping to appear as a vnode type, to match
FreeBSD 9 behavior.

This fixes the use of tmpfs files with the dtrace pid provider,
procstat -v, procfs, linprocfs, pmc (pmcstat), and ptrace (PT_VM_ENTRY).

Submitted by:   Eric Badger <eric@badgerio.us> (initial revision)
Obtained from:  Dell Inc.
PR:             198431
MFC after:      2 weeks
Reviewed by:    jhb
Approved by:    kib (mentor)
2015-06-02 18:37:04 +00:00
jhb
4a4be98eae Export a list of VM objects in the system via a sysctl. The list can be
examined via 'vmstat -o'.  It can be used to determine which files are
using physical pages of memory and how much each is using.

Differential Revision:	https://reviews.freebsd.org/D2277
Reviewed by:	alc, kib
MFC after:	2 weeks
Sponsored by:	Norse Corp, Inc. (forward porting to HEAD/10)
2015-05-27 18:11:05 +00:00