Commit Graph

561 Commits

Author SHA1 Message Date
markj
6348241c12 As a step towards the elimination of PG_CACHED pages, rework the handling
of POSIX_FADV_DONTNEED so that it causes the backing pages to be moved to
the head of the inactive queue instead of being cached.

This affects the implementation of POSIX_FADV_NOREUSE as well, since it
works by applying POSIX_FADV_DONTNEED to file ranges after they have been
read or written.  At that point the corresponding buffers may still be
dirty, so the previous implementation would coalesce successive ranges and
apply POSIX_FADV_DONTNEED to the result, ensuring that pages backing the
dirty buffers would eventually be cached.  To preserve this behaviour in an
efficient manner, this change adds a new buf flag, B_NOREUSE, which causes
the pages backing a VMIO buf to be placed at the head of the inactive queue
when the buf is released.  POSIX_FADV_NOREUSE then works by setting this
flag in bufs that underlie the specified range.

Reviewed by:	alc, kib
Sponsored by:	EMC / Isilon Storage Division
Differential Revision:	https://reviews.freebsd.org/D3726
2015-09-30 23:06:29 +00:00
kib
3df958c236 Revert r173708's modifications to vm_object_page_remove().
Assume that a vnode is mapped shared and mlocked(), and then the vnode
is truncated, or truncated and then again extended past the mapping
point EOF.  Truncation removes the pages past the truncation point,
and if pages are later created at this range, they are not properly
mapped into the mlocked region, and their wiring count is wrong.

The revert leaves the invalidated but wired pages on the object queue,
which means that the pages are found by vm_object_unwire() when the
mapped range is munlock()ed, and reused by the buffer cache when the
vnode is extended again.

The changes in r173708 were required since then vm_map_unwire() looked
at the page tables to find the page to unwire.  This is no longer
needed with the vm_object_unwire() introduction, which follows the
objects shadow chain.

Also eliminate OBJPR_NOTWIRED flag for vm_object_page_remove(), which
is now redundand, we do not remove wired pages.

Reported by:	trasz, Dmitry Sivachenko <trtrmitya@gmail.com>
Suggested and reviewed by:	alc
Tested by:	pho
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
2015-07-25 18:29:06 +00:00
glebius
519f1ccd36 Make KPI of vm_pager_get_pages() more strict: if a pager changes a page
in the requested array, then it is responsible for disposition of previous
page and is responsible for updating the entry in the requested array.
Now consumers of KPI do not need to re-lookup the pages after call to
vm_pager_get_pages().

Reviewed by:	kib
Sponsored by:	Netflix
Sponsored by:	Nginx, Inc.
2015-06-12 11:32:20 +00:00
vangyzen
597cee37df Provide vnode in memory map info for files on tmpfs
When providing memory map information to userland, populate the vnode pointer
for tmpfs files.  Set the memory mapping to appear as a vnode type, to match
FreeBSD 9 behavior.

This fixes the use of tmpfs files with the dtrace pid provider,
procstat -v, procfs, linprocfs, pmc (pmcstat), and ptrace (PT_VM_ENTRY).

Submitted by:   Eric Badger <eric@badgerio.us> (initial revision)
Obtained from:  Dell Inc.
PR:             198431
MFC after:      2 weeks
Reviewed by:    jhb
Approved by:    kib (mentor)
2015-06-02 18:37:04 +00:00
jhb
4a4be98eae Export a list of VM objects in the system via a sysctl. The list can be
examined via 'vmstat -o'.  It can be used to determine which files are
using physical pages of memory and how much each is using.

Differential Revision:	https://reviews.freebsd.org/D2277
Reviewed by:	alc, kib
MFC after:	2 weeks
Sponsored by:	Norse Corp, Inc. (forward porting to HEAD/10)
2015-05-27 18:11:05 +00:00
jhb
0bf260e595 Place VM objects on the object list when created and never remove them.
This is ok since objects come from a NOFREE zone and allows objects to
be locked while traversing the object list without triggering a LOR.

Ensure that objects on the list are marked DEAD while free or stillborn,
and that they have a refcount of zero.  This required updating most of
the pagers to explicitly mark an object as dead when deallocating it.
(Only the vnode pager did this previously.)

Differential Revision:	https://reviews.freebsd.org/D2423
Reviewed by:	alc, kib (earlier version)
MFC after:	2 weeks
Sponsored by:	Norse Corp, Inc.
2015-05-08 19:43:37 +00:00
alc
56c9a1b2f6 Correct a typo in vm_object_backing_scan() that originated in r254141.
Specifically, change a lock acquire into a lock release.

MFC after:	3 days
Sponsored by:	EMC / Isilon Storage Division
2015-03-07 04:18:40 +00:00
alc
2ab42594ef Use RW_NEW rather than calling bzero(). 2015-03-01 05:18:02 +00:00
kib
19abfd4698 Update mtime for tmpfs files modified through memory mapping. Similar
to UFS, perform updates during syncer scans, which in particular means
that tmpfs now performs scan on sync.  Also, this means that a mtime
update may be delayed up to 30 seconds after the write.

The vm_object' OBJ_TMPFS_DIRTY flag for tmpfs swap object is similar
to the OBJ_MIGHTBEDIRTY flag for the vnode object, it indicates that
object could have been dirtied.  Adapt fast page fault handler and
vm_object_set_writeable_dirty() to handle OBJ_TMPFS_NODE same as
OBJT_VNODE.

Reported by:	Ronald Klop <ronald-lists@klop.ws>
Tested by:	pho
Sponsored by:	The FreeBSD Foundation
MFC after:	2 weeks
2015-01-28 10:37:23 +00:00
kib
f690968933 When the last reference on the vnode' vm object is dropped, read the
vp->v_vflag without taking vnode lock and without bypass.  We do know
that vp is the lowest level in the stack, since the pointer is
obtained from the object' handle.  Stale VV_TEXT flag read can only
happen if parallel execve() is performed and not yet activated the
image, since process takes reference for text mapping.  In this case,
the execve() code manages the VV_TEXT flag on its own already.

It was observed that otherwise read-only sendfile(2) requires
exclusive vnode lock and contending on it on some loads for VV_TEXT
handling.

Reported by:	glebius, scottl
Tested by:	glebius, pho
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
2014-12-05 15:02:30 +00:00
alc
2b42c1ddc8 When unwiring a region of an address space, do not assume that the
underlying physical pages are mapped by the pmap.  If, for example, the
application has performed an mprotect(..., PROT_NONE) on any part of the
wired region, then those pages will no longer be mapped by the pmap.
So, using the pmap to lookup the wired pages in order to unwire them
doesn't always work, and when it doesn't work wired pages are leaked.

To avoid the leak, introduce and use a new function vm_object_unwire()
that locates the wired pages by traversing the object and its backing
objects.

At the same time, switch from using pmap_change_wiring() to the recently
introduced function pmap_unwire() for unwiring the region's mappings.
pmap_unwire() is faster, because it operates a range of virtual addresses
rather than a single virtual page at a time.  Moreover, by operating on
a range, it is superpage friendly.  It doesn't waste time performing
unnecessary demotions.

Reported by:	markj
Reviewed by:	kib
Tested by:	pho, jmg (arm)
Sponsored by:	EMC / Isilon Storage Division
2014-07-26 18:10:18 +00:00
kib
b50b5e1d49 Correct assertion. The shadowing object cannot be tmpfs vm object,
and tmpfs object cannot shadow.  In other words, tmpfs vm object is
always at the bottom of the shadow chain.

Reported and tested by:	bdrewery
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
2014-07-24 10:25:42 +00:00
kib
8664d64bc3 The OBJ_TMPFS flag of vm_object means that there is unreclaimed tmpfs
vnode for the tmpfs node owning this object.  The flag is currently
used for two purposes.  First, it allows to correctly handle VV_TEXT
for tmpfs vnode when the ref count on the object is decremented to 1,
similar to vnode_pager_dealloc() for regular filesystems.  Second, it
prevents some operations, which are done on OBJT_SWAP vm objects
backing user anonymous memory, but are incorrect for the object owned
by tmpfs node.

The second kind of use of the OBJ_TMPFS flag is incorrect, since the
vnode might be reclaimed, which clears the flag, but vm object
operations must still be disallowed.

Introduce one more flag, OBJ_TMPFS_NODE, which is permanently set on
the object for VREG tmpfs node, and used instead of OBJ_TMPFS to test
whether vm object collapse and similar actions should be disabled.

Tested by:	pho
Sponsored by:	The FreeBSD Foundation
MFC after:	2 weeks
2014-07-14 09:30:37 +00:00
bdrewery
6fcf6199a4 Rename global cnt to vm_cnt to avoid shadowing.
To reduce the diff struct pcu.cnt field was not renamed, so
PCPU_OP(cnt.field) is still used. pc_cnt and pcpu are also used in
kvm(3) and vmstat(8). The goal was to not affect externally used KPI.

Bump __FreeBSD_version_ in case some out-of-tree module/code relies on the
the global cnt variable.

Exp-run revealed no ports using it directly.

No objection from:	arch@
Sponsored by:	EMC / Isilon Storage Division
2014-03-22 10:26:09 +00:00
kib
e4111a6b71 Do not vdrop() the tmpfs vnode until it is unlocked. The hold
reference might be the last, and then vdrop() would free the vnode.

Reported and tested by:	bdrewery
MFC after:	1 week
2014-03-12 15:13:57 +00:00
attilio
6a31e25bb9 Fix-up r254141: in the process of making a failing vm_page_rename()
a call of pager_swap_freespace() was moved around, now leading to freeing
the incorrect page because of the pindex changes after vm_page_rename().

Get back to use the correct pindex when destroying the swap space.

Sponsored by:	EMC / Isilon storage division
Reported by:	avg
Tested by:	pho
MFC after:	7 days
2014-02-14 03:34:12 +00:00
glebius
b1650f2d1e Fix function name in KASSERT().
Submitted by:	hiren
2014-02-12 20:11:20 +00:00
kib
79db892faa Do not coalesce if the swap object belongs to tmpfs vnode. The
coalesce would extend the object to keep pages for the anonymous
mapping created by the process.  The pages has no relations to the
tmpfs file content which could be written into the corresponding
range, causing anonymous mapping and file content aliasing and
subsequent corruption.

Another lesser problem created by coalescing is over-accounting on the
tmpfs node destruction, since the object size is substracted from the
total count of the pages owned by the tmpfs mount.

Reported and tested by:	bdrewery
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
2013-11-05 06:18:50 +00:00
kib
56cc686058 Drain for the xbusy state for two places which potentially do
pmap_remove_all(). Not doing the drain allows the pmap_enter() to
proceed in parallel, making the pmap_remove_all() effects void.

The race results in an invalidated page mapped wired by usermode.

Reported and tested by:	pho
Reviewed by:	alc
Sponsored by:	The FreeBSD Foundation
Approved by:	re (glebius)
2013-09-08 17:51:22 +00:00
kib
ba12eedccd Remove the deprecated VM_ALLOC_RETRY flag for the vm_page_grab(9).
The flag was mandatory since r209792, where vm_page_grab(9) was
changed to only support the alloc retry semantic.

Suggested and reviewed by:	alc
Sponsored by:	The FreeBSD Foundation
2013-08-22 07:39:53 +00:00
attilio
e9f37cac74 On all the architectures, avoid to preallocate the physical memory
for nodes used in vm_radix.
On architectures supporting direct mapping, also avoid to pre-allocate
the KVA for such nodes.

In order to do so make the operations derived from vm_radix_insert()
to fail and handle all the deriving failure of those.

vm_radix-wise introduce a new function called vm_radix_replace(),
which can replace a leaf node, already present, with a new one,
and take into account the possibility, during vm_radix_insert()
allocation, that the operations on the radix trie can recurse.
This means that if operations in vm_radix_insert() recursed
vm_radix_insert() will start from scratch again.

Sponsored by:	EMC / Isilon storage division
Reviewed by:	alc (older version)
Reviewed by:	jeff
Tested by:	pho, scottl
2013-08-09 11:28:55 +00:00
attilio
16c7563cf4 The soft and hard busy mechanism rely on the vm object lock to work.
Unify the 2 concept into a real, minimal, sxlock where the shared
acquisition represent the soft busy and the exclusive acquisition
represent the hard busy.
The old VPO_WANTED mechanism becames the hard-path for this new lock
and it becomes per-page rather than per-object.
The vm_object lock becames an interlock for this functionality:
it can be held in both read or write mode.
However, if the vm_object lock is held in read mode while acquiring
or releasing the busy state, the thread owner cannot make any
assumption on the busy state unless it is also busying it.

Also:
- Add a new flag to directly shared busy pages while vm_page_alloc
  and vm_page_grab are being executed.  This will be very helpful
  once these functions happen under a read object lock.
- Move the swapping sleep into its own per-object flag

The KPI is heavilly changed this is why the version is bumped.
It is very likely that some VM ports users will need to change
their own code.

Sponsored by:	EMC / Isilon storage division
Discussed with:	alc
Reviewed by:	jeff, kib
Tested by:	gavin, bapt (older version)
Tested by:	pho, scottl
2013-08-09 11:11:11 +00:00
jeff
de4ecca213 Replace kernel virtual address space allocation with vmem. This provides
transparent layering and better fragmentation.

 - Normalize functions that allocate memory to use kmem_*
 - Those that allocate address space are named kva_*
 - Those that operate on maps are named kmap_*
 - Implement recursive allocation handling for kmem_arena in vmem.

Reviewed by:	alc
Tested by:	pho
Sponsored by:	EMC / Isilon Storage Division
2013-08-07 06:21:20 +00:00
kib
da0e8446db Never remove user-wired pages from an object when doing
msync(MS_INVALIDATE).  The vm_fault_copy_entry() requires that object
range which corresponds to the user-wired vm_map_entry, is always
fully populated.

Add OBJPR_NOTWIRED flag for vm_object_page_remove() to request the
preserving behaviour, use it when calling vm_object_page_remove() from
vm_object_sync().

Reported and tested by:	pho
Reviewed by:	alc
Sponsored by:	The FreeBSD Foundation
MFC after:	2 weeks
2013-07-11 05:47:26 +00:00
jeff
e725dd5c1e - Add a general purpose resource allocator, vmem, from NetBSD. It was
originally inspired by the Solaris vmem detailed in the proceedings
   of usenix 2001.  The NetBSD version was heavily refactored for bugs
   and simplicity.
 - Use this resource allocator to allocate the buffer and transient maps.
   Buffer cache defrags are reduced by 25% when used by filesystems with
   mixed block sizes.  Ultimately this may permit dynamic buffer cache
   sizing on low KVA machines.

Discussed with:	alc, kib, attilio
Tested by:	pho
Sponsored by:	EMC / Isilon Storage Division
2013-06-28 03:51:20 +00:00
alc
53ffec3a56 Revise the interface between vm_object_madvise() and vm_page_dontneed() so
that pointless calls to pmap_is_modified() can be easily avoided when
performing madvise(..., MADV_FREE).

Sponsored by:	EMC / Isilon Storage Division
2013-06-10 01:48:21 +00:00
attilio
7816c7bc23 In vm_object_split(), busy and consequently unbusy the pages only when
swap_pager_copy() is invoked, otherwise there is no reason to do so.
This will eliminate the necessity to busy pages most of the times.

Sponsored by:	EMC / Isilon storage division
Reviewed by:	alc
2013-06-04 22:47:01 +00:00
kib
0c381861b0 After the object lock was dropped, the object' reference count could
change.  Retest the ref_count and return from the function to not
execute the further code which assumes that ref_count == 1 if it is
not.  Also, do not leak vnode lock if other thread cleared OBJ_TMPFS
flag meantime.

Reported by:	bdrewery
Tested by:	bdrewery, pho
Sponsored by:	The FreeBSD Foundation
2013-05-30 20:00:19 +00:00
kib
b77a98bb0b Remove the capitalization in the assertion message. Print the address
of the object to get useful information from optimizated kernels dump.
2013-05-30 19:53:31 +00:00
kib
2f2c1edec8 Rework the handling of the tmpfs node backing swap object and tmpfs
vnode v_object to avoid double-buffering.  Use the same object both as
the backing store for tmpfs node and as the v_object.

Besides reducing memory use up to 2x times for situation of mapping
files from tmpfs, it also makes tmpfs read and write operations copy
twice bytes less.

VM subsystem was already slightly adapted to tolerate OBJT_SWAP object
as v_object. Now the vm_object_deallocate() is modified to not
reinstantiate OBJ_ONEMAPPING flag and help the VFS to correctly handle
VV_TEXT flag on the last dereference of the tmpfs backing object.

Reviewed by:	alc
Tested by:	pho, bf
MFC after:	1 month
2013-04-28 19:38:59 +00:00
kib
d4d37d6d88 Make vm_object_page_clean() and vm_mmap_vnode() tolerate the vnode'
v_object of non OBJT_VNODE type.

For vm_object_page_clean(), simply do not assert that object type must
be OBJT_VNODE, and add a comment explaining how the check for
OBJ_MIGHTBEDIRTY prevents the rest of function from operating on such
objects.

For vm_mmap_vnode(), if the object type is not OBJT_VNODE, require it
to be for swap pager (or default), handle the bypass filesystems, and
correctly acquire the object reference in this case.

Reviewed by:	alc
Tested by:	pho, bf
MFC after:	1 week
2013-04-28 19:25:09 +00:00
alc
2c9c761886 Introduce vm_radix_is_empty(), and use it in place of
vm_object_cache_is_empty() where the caller is aware of the page cache's
implementation as a radix trie.

Sponsored by:	EMC / Isilon Storage Division
2013-03-10 17:30:57 +00:00
attilio
76954ad68a Merge from vmcontention. 2013-03-09 03:19:53 +00:00
attilio
16a80466e5 MFC 2013-03-09 02:51:51 +00:00
attilio
72f7f3e528 Switch the vm_object mutex to be a rwlock. This will enable in the
future further optimizations where the vm_object lock will be held
in read mode most of the time the page cache resident pool of pages
are accessed for reading purposes.

The change is mostly mechanical but few notes are reported:
* The KPI changes as follow:
  - VM_OBJECT_LOCK() -> VM_OBJECT_WLOCK()
  - VM_OBJECT_TRYLOCK() -> VM_OBJECT_TRYWLOCK()
  - VM_OBJECT_UNLOCK() -> VM_OBJECT_WUNLOCK()
  - VM_OBJECT_LOCK_ASSERT(MA_OWNED) -> VM_OBJECT_ASSERT_WLOCKED()
    (in order to avoid visibility of implementation details)
  - The read-mode operations are added:
    VM_OBJECT_RLOCK(), VM_OBJECT_TRYRLOCK(), VM_OBJECT_RUNLOCK(),
    VM_OBJECT_ASSERT_RLOCKED(), VM_OBJECT_ASSERT_LOCKED()
* The vm/vm_pager.h namespace pollution avoidance (forcing requiring
  sys/mutex.h in consumers directly to cater its inlining functions
  using VM_OBJECT_LOCK()) imposes that all the vm/vm_pager.h
  consumers now must include also sys/rwlock.h.
* zfs requires a quite convoluted fix to include FreeBSD rwlocks into
  the compat layer because the name clash between FreeBSD and solaris
  versions must be avoided.
  At this purpose zfs redefines the vm_object locking functions
  directly, isolating the FreeBSD components in specific compat stubs.

The KPI results heavilly broken by this commit.  Thirdy part ports must
be updated accordingly (I can think off-hand of VirtualBox, for example).

Sponsored by:	EMC / Isilon storage division
Reviewed by:	jeff
Reviewed by:	pjd (ZFS specific review)
Discussed with:	alc
Tested by:	pho
2013-03-09 02:32:23 +00:00
attilio
754f3790b8 Merge from vmc-playground:
Introduce a new KPI that verifies if the page cache is empty for a
specified vm_object.  This KPI does not make assumptions about the
locking in order to be used also for building assertions at init and
destroy time.
It is mostly used to hide implementation details of the page cache.

Sponsored by:	EMC / Isilon storage division
Reviewed by:	jeff
Reviewed by:	alc (vm_radix based version)
Tested by:	flo, pho, jhb, davide
2013-03-09 02:05:29 +00:00
attilio
bf1dc90446 MFC 2013-03-08 00:03:07 +00:00
attilio
1be810ec73 MFC 2013-03-04 13:14:59 +00:00
attilio
e5bdd2f06e Merge from vmcontention:
As vm objects are type-stable there is no need to initialize the
resident splay tree pointer and the cache splay tree pointer in
_vm_object_allocate() but this could be done in the init UMA zone
handler.

The destructor UMA zone handler, will further check if the condition is
retained at every destruction and catch for bugs.

Sponsored by:	EMC / Isilon storage division
Submitted by:	alc
2013-03-04 13:10:59 +00:00
attilio
709ad55889 Evaluations on the likelyhood of empty object cache cannot be made in
general way but must be evaluated case by case.
Embedd the decision in the caller themselves rather than in a
general purpose KPI.

Sponsored by:	EMC / Isilon storage division
Reported by:	alc
Reviewed by:	alc
2013-03-04 12:33:40 +00:00
attilio
60e39c95b8 Remove the boot-time cache support and rely on UMA boot-time slab cache
for allocating the nodes before to have the possibility to carve
directly from the UMA subsystem.

Sponsored by:	EMC / Isilon storage division
Reviewed by:	alc
2013-03-04 00:07:23 +00:00
alc
f661d6e522 We don't need to reinitialize the root of the page cache trie on every
vm object allocation.  We can, instead, rely on the type stability of
the vm object zone.  (Note that we already assert that the page cache
trie is empty in the vm object zone destructor.)

Sponsored by:	EMC / Isilon Storage Division
2013-03-03 20:37:27 +00:00
attilio
a345907061 Merge from vmcontention 2013-03-03 01:10:49 +00:00
attilio
c53a782d3a MFC 2013-03-03 01:06:24 +00:00
alc
2322e91e7c Revert white space change in the previous commit.
Requested by:	attilio
2013-03-02 18:27:51 +00:00
alc
c5b028cc14 Assert that the trie is empty when a vm object is destroyed.
Since vm objects are allocated from type-stable memory, we don't need to
initialize the trie's root in _vm_object_allocate() on every vm object
allocation.  We can instead do it once in vm_object_zinit().

We don't need to call vm_radix_reclaim_allnodes() in vm_object_terminate()
unless the resident page count is non-zero.

Reviewed by:	attilio
Sponsored by:	EMC / Isilon Storage Division
2013-03-02 18:18:30 +00:00
alc
90d4aeb975 The value held by the vm object's field pg_color is only considered
valid if the flag OBJ_COLORED is set.  Since _vm_object_allocate()
doesn't set this flag, it needn't initialize pg_color.

Sponsored by:	EMC / Isilon Storage Division
2013-03-02 18:07:29 +00:00
attilio
c74a3afc6a Fix compiling. 2013-02-26 23:54:17 +00:00
attilio
74f58faa15 MFC 2013-02-26 23:52:23 +00:00
attilio
cbe7c0e167 MFC 2013-02-26 23:43:28 +00:00