Commit Graph

159 Commits

Author SHA1 Message Date
Konstantin Belousov
d3b9828d0d The vmtotal sysctl handler marks active vm objects to calculate
statistics.  Marking is done by setting the OBJ_ACTIVE flag.  The
flags change is locked, but the problem is that many parts of system
assume that vm object initialization ensures that no other code could
change the object, and thus performed lockless.  The end result is
corrupted flags in vm objects, most visible is spurious OBJ_DEAD flag,
causing random hangs.

Avoid the active object marking, instead provide equally inexact but
immutable is_object_alive() definition for the object mapped state.

Avoid iterating over the processes mappings altogether by using
arguably improved definition of the paging thread as one which sleeps
on the v_free_count.

PR:	204764
Diagnosed by:	pho
Tested by:	pho (previous version)
Reviewed by:	alc
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
Approved by:	re (gjb)
2016-06-21 17:49:33 +00:00
Konstantin Belousov
2a339d9e3d Add implementation of robust mutexes, hopefully close enough to the
intention of the POSIX IEEE Std 1003.1TM-2008/Cor 1-2013.

A robust mutex is guaranteed to be cleared by the system upon either
thread or process owner termination while the mutex is held.  The next
mutex locker is then notified about inconsistent mutex state and can
execute (or abandon) corrective actions.

The patch mostly consists of small changes here and there, adding
neccessary checks for the inconsistent and abandoned conditions into
existing paths.  Additionally, the thread exit handler was extended to
iterate over the userspace-maintained list of owned robust mutexes,
unlocking and marking as terminated each of them.

The list of owned robust mutexes cannot be maintained atomically
synchronous with the mutex lock state (it is possible in kernel, but
is too expensive).  Instead, for the duration of lock or unlock
operation, the current mutex is remembered in a special slot that is
also checked by the kernel at thread termination.

Kernel must be aware about the per-thread location of the heads of
robust mutex lists and the current active mutex slot.  When a thread
touches a robust mutex for the first time, a new umtx op syscall is
issued which informs about location of lists heads.

The umtx sleep queues for PP and PI mutexes are split between
non-robust and robust.

Somewhat unrelated changes in the patch:
1. Style.
2. The fix for proper tdfind() call use in umtxq_sleep_pi() for shared
   pi mutexes.
3. Removal of the userspace struct pthread_mutex m_owner field.
4. The sysctl kern.ipc.umtx_vnode_persistent is added, which controls
   the lifetime of the shared mutex associated with a vnode' page.

Reviewed by:	jilles (previous version, supposedly the objection was fixed)
Discussed with:	brooks, Martin Simmons <martin@lispworks.com> (some aspects)
Tested by:	pho
Sponsored by:	The FreeBSD Foundation
2016-05-17 09:56:22 +00:00
Konstantin Belousov
1bdbd70599 Implement process-shared locks support for libthr.so.3, without
breaking the ABI.  Special value is stored in the lock pointer to
indicate shared lock, and offline page in the shared memory is
allocated to store the actual lock.

Reviewed by:	vangyzen (previous version)
Discussed with:	deischen, emaste, jhb, rwatson,
	Martin Simmons <martin@lispworks.com>
Tested by:	pho
Sponsored by:	The FreeBSD Foundation
2016-02-28 17:52:33 +00:00
Gleb Smirnoff
b0cd20172d A change to KPI of vm_pager_get_pages() and underlying VOP_GETPAGES().
o With new KPI consumers can request contiguous ranges of pages, and
  unlike before, all pages will be kept busied on return, like it was
  done before with the 'reqpage' only. Now the reqpage goes away. With
  new interface it is easier to implement code protected from race
  conditions.

  Such arrayed requests for now should be preceeded by a call to
  vm_pager_haspage() to make sure that request is possible. This
  could be improved later, making vm_pager_haspage() obsolete.

  Strenghtening the promises on the business of the array of pages
  allows us to remove such hacks as swp_pager_free_nrpage() and
  vm_pager_free_nonreq().

o New KPI accepts two integer pointers that may optionally point at
  values for read ahead and read behind, that a pager may do, if it
  can. These pages are completely owned by pager, and not controlled
  by the caller.

  This shifts the UFS-specific readahead logic from vm_fault.c, which
  should be file system agnostic, into vnode_pager.c. It also removes
  one VOP_BMAP() request per hard fault.

Discussed with:	kib, alc, jeff, scottl
Sponsored by:	Nginx, Inc.
Sponsored by:	Netflix
2015-12-16 21:30:45 +00:00
Mark Johnston
3138cd3670 As a step towards the elimination of PG_CACHED pages, rework the handling
of POSIX_FADV_DONTNEED so that it causes the backing pages to be moved to
the head of the inactive queue instead of being cached.

This affects the implementation of POSIX_FADV_NOREUSE as well, since it
works by applying POSIX_FADV_DONTNEED to file ranges after they have been
read or written.  At that point the corresponding buffers may still be
dirty, so the previous implementation would coalesce successive ranges and
apply POSIX_FADV_DONTNEED to the result, ensuring that pages backing the
dirty buffers would eventually be cached.  To preserve this behaviour in an
efficient manner, this change adds a new buf flag, B_NOREUSE, which causes
the pages backing a VMIO buf to be placed at the head of the inactive queue
when the buf is released.  POSIX_FADV_NOREUSE then works by setting this
flag in bufs that underlie the specified range.

Reviewed by:	alc, kib
Sponsored by:	EMC / Isilon Storage Division
Differential Revision:	https://reviews.freebsd.org/D3726
2015-09-30 23:06:29 +00:00
Konstantin Belousov
6195b24a79 Revert r173708's modifications to vm_object_page_remove().
Assume that a vnode is mapped shared and mlocked(), and then the vnode
is truncated, or truncated and then again extended past the mapping
point EOF.  Truncation removes the pages past the truncation point,
and if pages are later created at this range, they are not properly
mapped into the mlocked region, and their wiring count is wrong.

The revert leaves the invalidated but wired pages on the object queue,
which means that the pages are found by vm_object_unwire() when the
mapped range is munlock()ed, and reused by the buffer cache when the
vnode is extended again.

The changes in r173708 were required since then vm_map_unwire() looked
at the page tables to find the page to unwire.  This is no longer
needed with the vm_object_unwire() introduction, which follows the
objects shadow chain.

Also eliminate OBJPR_NOTWIRED flag for vm_object_page_remove(), which
is now redundand, we do not remove wired pages.

Reported by:	trasz, Dmitry Sivachenko <trtrmitya@gmail.com>
Suggested and reviewed by:	alc
Tested by:	pho
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
2015-07-25 18:29:06 +00:00
Eric van Gyzen
63e4c6cdf9 Provide vnode in memory map info for files on tmpfs
When providing memory map information to userland, populate the vnode pointer
for tmpfs files.  Set the memory mapping to appear as a vnode type, to match
FreeBSD 9 behavior.

This fixes the use of tmpfs files with the dtrace pid provider,
procstat -v, procfs, linprocfs, pmc (pmcstat), and ptrace (PT_VM_ENTRY).

Submitted by:   Eric Badger <eric@badgerio.us> (initial revision)
Obtained from:  Dell Inc.
PR:             198431
MFC after:      2 weeks
Reviewed by:    jhb
Approved by:    kib (mentor)
2015-06-02 18:37:04 +00:00
Alan Cox
3d653db063 Introduce vm_object_color() and use it in mmap(2) to set the color of
named objects to zero before the virtual address is selected.  Previously,
the color setting was delayed until after the virtual address was
selected.  In rtld, this delay effectively prevented the mapping of a
shared library's code section using superpages.  Now, for example, we see
the first 1 MB of libc's code on armv6 mapped by a superpage after we've
gotten through the initial cold misses that bring the first 1 MB of code
into memory.  (With the page clustering that we perform on read faults,
this happens quickly.)

Differential Revision:	https://reviews.freebsd.org/D2013
Reviewed by:	jhb, kib
Tested by:	Svatopluk Kraus (armv6)
MFC after:	6 weeks
2015-03-21 17:56:55 +00:00
Konstantin Belousov
f40cb1c645 Update mtime for tmpfs files modified through memory mapping. Similar
to UFS, perform updates during syncer scans, which in particular means
that tmpfs now performs scan on sync.  Also, this means that a mtime
update may be delayed up to 30 seconds after the write.

The vm_object' OBJ_TMPFS_DIRTY flag for tmpfs swap object is similar
to the OBJ_MIGHTBEDIRTY flag for the vnode object, it indicates that
object could have been dirtied.  Adapt fast page fault handler and
vm_object_set_writeable_dirty() to handle OBJ_TMPFS_NODE same as
OBJT_VNODE.

Reported by:	Ronald Klop <ronald-lists@klop.ws>
Tested by:	pho
Sponsored by:	The FreeBSD Foundation
MFC after:	2 weeks
2015-01-28 10:37:23 +00:00
Alan Cox
81a065058c Update a stale comment. 2014-09-11 03:16:57 +00:00
Konstantin Belousov
faaf544760 Add wrappers to assert that vm object is unlocked and for try upgrade.
Reviewed by:	alc
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
2014-08-06 19:30:35 +00:00
Alan Cox
0346250941 When unwiring a region of an address space, do not assume that the
underlying physical pages are mapped by the pmap.  If, for example, the
application has performed an mprotect(..., PROT_NONE) on any part of the
wired region, then those pages will no longer be mapped by the pmap.
So, using the pmap to lookup the wired pages in order to unwire them
doesn't always work, and when it doesn't work wired pages are leaked.

To avoid the leak, introduce and use a new function vm_object_unwire()
that locates the wired pages by traversing the object and its backing
objects.

At the same time, switch from using pmap_change_wiring() to the recently
introduced function pmap_unwire() for unwiring the region's mappings.
pmap_unwire() is faster, because it operates a range of virtual addresses
rather than a single virtual page at a time.  Moreover, by operating on
a range, it is superpage friendly.  It doesn't waste time performing
unnecessary demotions.

Reported by:	markj
Reviewed by:	kib
Tested by:	pho, jmg (arm)
Sponsored by:	EMC / Isilon Storage Division
2014-07-26 18:10:18 +00:00
Konstantin Belousov
f08f7dca40 The OBJ_TMPFS flag of vm_object means that there is unreclaimed tmpfs
vnode for the tmpfs node owning this object.  The flag is currently
used for two purposes.  First, it allows to correctly handle VV_TEXT
for tmpfs vnode when the ref count on the object is decremented to 1,
similar to vnode_pager_dealloc() for regular filesystems.  Second, it
prevents some operations, which are done on OBJT_SWAP vm objects
backing user anonymous memory, but are incorrect for the object owned
by tmpfs node.

The second kind of use of the OBJ_TMPFS flag is incorrect, since the
vnode might be reclaimed, which clears the flag, but vm object
operations must still be disallowed.

Introduce one more flag, OBJ_TMPFS_NODE, which is permanently set on
the object for VREG tmpfs node, and used instead of OBJ_TMPFS to test
whether vm object collapse and similar actions should be disabled.

Tested by:	pho
Sponsored by:	The FreeBSD Foundation
MFC after:	2 weeks
2014-07-14 09:30:37 +00:00
Attilio Rao
e946b94934 On all the architectures, avoid to preallocate the physical memory
for nodes used in vm_radix.
On architectures supporting direct mapping, also avoid to pre-allocate
the KVA for such nodes.

In order to do so make the operations derived from vm_radix_insert()
to fail and handle all the deriving failure of those.

vm_radix-wise introduce a new function called vm_radix_replace(),
which can replace a leaf node, already present, with a new one,
and take into account the possibility, during vm_radix_insert()
allocation, that the operations on the radix trie can recurse.
This means that if operations in vm_radix_insert() recursed
vm_radix_insert() will start from scratch again.

Sponsored by:	EMC / Isilon storage division
Reviewed by:	alc (older version)
Reviewed by:	jeff
Tested by:	pho, scottl
2013-08-09 11:28:55 +00:00
Konstantin Belousov
ebf5d94e82 Never remove user-wired pages from an object when doing
msync(MS_INVALIDATE).  The vm_fault_copy_entry() requires that object
range which corresponds to the user-wired vm_map_entry, is always
fully populated.

Add OBJPR_NOTWIRED flag for vm_object_page_remove() to request the
preserving behaviour, use it when calling vm_object_page_remove() from
vm_object_sync().

Reported and tested by:	pho
Reviewed by:	alc
Sponsored by:	The FreeBSD Foundation
MFC after:	2 weeks
2013-07-11 05:47:26 +00:00
Attilio Rao
9af6d512f5 o Relax locking assertions for vm_page_find_least()
o Relax locking assertions for pmap_enter_object() and add them also
  to architectures that currently don't have any
o Introduce VM_OBJECT_LOCK_DOWNGRADE() which is basically a downgrade
  operation on the per-object rwlock
o Use all the mechanisms above to make vm_map_pmap_enter() to work
  mostl of the times only with readlocks.

Sponsored by:	EMC / Isilon storage division
Reviewed by:	alc
2013-05-21 20:38:19 +00:00
Konstantin Belousov
6f2af3fcf3 Rework the handling of the tmpfs node backing swap object and tmpfs
vnode v_object to avoid double-buffering.  Use the same object both as
the backing store for tmpfs node and as the v_object.

Besides reducing memory use up to 2x times for situation of mapping
files from tmpfs, it also makes tmpfs read and write operations copy
twice bytes less.

VM subsystem was already slightly adapted to tolerate OBJT_SWAP object
as v_object. Now the vm_object_deallocate() is modified to not
reinstantiate OBJ_ONEMAPPING flag and help the VFS to correctly handle
VV_TEXT flag on the last dereference of the tmpfs backing object.

Reviewed by:	alc
Tested by:	pho, bf
MFC after:	1 month
2013-04-28 19:38:59 +00:00
Attilio Rao
774d251d99 Sync back vmcontention branch into HEAD:
Replace the per-object resident and cached pages splay tree with a
path-compressed multi-digit radix trie.
Along with this, switch also the x86-specific handling of idle page
tables to using the radix trie.

This change is supposed to do the following:
- Allowing the acquisition of read locking for lookup operations of the
  resident/cached pages collections as the per-vm_page_t splay iterators
  are now removed.
- Increase the scalability of the operations on the page collections.

The radix trie does rely on the consumers locking to ensure atomicity of
its operations.  In order to avoid deadlocks the bisection nodes are
pre-allocated in the UMA zone.  This can be done safely because the
algorithm needs at maximum one new node per insert which means the
maximum number of the desired nodes is the number of available physical
frames themselves.  However, not all the times a new bisection node is
really needed.

The radix trie implements path-compression because UFS indirect blocks
can lead to several objects with a very sparse trie, increasing the number
of levels to usually scan.  It also helps in the nodes pre-fetching by
introducing the single node per-insert property.

This code is not generalized (yet) because of the possible loss of
performance by having much of the sizes in play configurable.
However, efforts to make this code more general and then reusable in
further different consumers might be really done.

The only KPI change is the removal of the function vm_page_splay() which
is now reaped.
The only KBI change, instead, is the removal of the left/right iterators
from struct vm_page, which are now reaped.

Further technical notes broken into mealpieces can be retrieved from the
svn branch:
http://svn.freebsd.org/base/user/attilio/vmcontention/

Sponsored by:	EMC / Isilon storage division
In collaboration with:	alc, jeff
Tested by:	flo, pho, jhb, davide
Tested by:	ian (arm)
Tested by:	andreast (powerpc)
2013-03-18 00:25:02 +00:00
Attilio Rao
89f6b8632c Switch the vm_object mutex to be a rwlock. This will enable in the
future further optimizations where the vm_object lock will be held
in read mode most of the time the page cache resident pool of pages
are accessed for reading purposes.

The change is mostly mechanical but few notes are reported:
* The KPI changes as follow:
  - VM_OBJECT_LOCK() -> VM_OBJECT_WLOCK()
  - VM_OBJECT_TRYLOCK() -> VM_OBJECT_TRYWLOCK()
  - VM_OBJECT_UNLOCK() -> VM_OBJECT_WUNLOCK()
  - VM_OBJECT_LOCK_ASSERT(MA_OWNED) -> VM_OBJECT_ASSERT_WLOCKED()
    (in order to avoid visibility of implementation details)
  - The read-mode operations are added:
    VM_OBJECT_RLOCK(), VM_OBJECT_TRYRLOCK(), VM_OBJECT_RUNLOCK(),
    VM_OBJECT_ASSERT_RLOCKED(), VM_OBJECT_ASSERT_LOCKED()
* The vm/vm_pager.h namespace pollution avoidance (forcing requiring
  sys/mutex.h in consumers directly to cater its inlining functions
  using VM_OBJECT_LOCK()) imposes that all the vm/vm_pager.h
  consumers now must include also sys/rwlock.h.
* zfs requires a quite convoluted fix to include FreeBSD rwlocks into
  the compat layer because the name clash between FreeBSD and solaris
  versions must be avoided.
  At this purpose zfs redefines the vm_object locking functions
  directly, isolating the FreeBSD components in specific compat stubs.

The KPI results heavilly broken by this commit.  Thirdy part ports must
be updated accordingly (I can think off-hand of VirtualBox, for example).

Sponsored by:	EMC / Isilon storage division
Reviewed by:	jeff
Reviewed by:	pjd (ZFS specific review)
Discussed with:	alc
Tested by:	pho
2013-03-09 02:32:23 +00:00
Attilio Rao
c934116100 Merge from vmc-playground:
Introduce a new KPI that verifies if the page cache is empty for a
specified vm_object.  This KPI does not make assumptions about the
locking in order to be used also for building assertions at init and
destroy time.
It is mostly used to hide implementation details of the page cache.

Sponsored by:	EMC / Isilon storage division
Reviewed by:	jeff
Reviewed by:	alc (vm_radix based version)
Tested by:	flo, pho, jhb, davide
2013-03-09 02:05:29 +00:00
Attilio Rao
dc1558d1cd Merge from vmobj-rwlock:
VM_OBJECT_LOCKED() macro is only used to implement a custom version
of lock assertions right now (which likely spread out thanks to
copy and paste).
Remove it and implement actual assertions.

Sponsored by:	EMC / Isilon storage division
Reviewed by:	alc
Tested by:	pho
2013-02-27 18:12:13 +00:00
Attilio Rao
a4915c21d9 Merge from vmc-playground branch:
Replace the sub-optimal uma_zone_set_obj() primitive with more modern
uma_zone_reserve_kva().  The new primitive reserves before hand
the necessary KVA space to cater the zone allocations and allocates pages
with ALLOC_NOOBJ.  More specifically:
- uma_zone_reserve_kva() does not need an object to cater the backend
  allocator.
- uma_zone_reserve_kva() can cater M_WAITOK requests, in order to
  serve zones which need to do uma_prealloc() too.
- When possible, uma_zone_reserve_kva() uses directly the direct-mapping
  by uma_small_alloc() rather than relying on the KVA / offset
  combination.

The removal of the object attribute allows 2 further changes:
1) _vm_object_allocate() becomes static within vm_object.c
2) VM_OBJECT_LOCK_INIT() is removed.  This function is replaced by
   direct calls to mtx_init() as there is no need to export it anymore
   and the calls aren't either homogeneous anymore: there are now small
   differences between arguments passed to mtx_init().

Sponsored by:	EMC / Isilon storage division
Reviewed by:	alc (which also offered almost all the comments)
Tested by:	pho, jhb, davide
2013-02-26 23:35:27 +00:00
Attilio Rao
0dde287b20 Wrap the sleeps synchronized by the vm_object lock into the specific
macro VM_OBJECT_SLEEP().
This hides some implementation details like the usage of the msleep()
primitive and the necessity to access to the lock address directly.
For this reason VM_OBJECT_MTX() macro is now retired.

Sponsored by:	EMC / Isilon storage division
Reviewed by:	alc
Tested by:	pho
2013-02-26 17:22:08 +00:00
Kenneth D. Merry
43ab9660c5 Fix a bug in the device pager code that can trigger an assertion
in devfs if a particular race condition is hit in the device pager
code.

This was a side effect of change 227530 which changed the device
pager interface to call a new destructor routine for the cdev.
That destructor routine, old_dev_pager_dtor(), takes a VM object
handle.

The object handle is cast to a struct cdev *, and passed into
dev_rel().

That works in most cases, except the case in cdev_pager_allocate()
where there is a race condition between two threads allocating an
object backed by the same device.  The loser of the race
deallocates its object at the end of the function.

The problem is that before inserting the object into the
dev_pager_object_list, the object's handle is changed from the
struct cdev pointer to the object's own address.  This is to avoid
conflicts with the winner of the race, which already inserted an
object in the list with a handle that is a pointer to the same cdev
structure.

The object is then passed to vm_object_deallocate(), and eventually
makes its way down to old_dev_pager_dtor().  That function passes
the handle pointer (which is actually a VM object, not a struct
cdev as usual) into dev_rel().  dev_rel() decrements the reference
count in the assumed struct cdev (which happens to be 0), and
that triggers the assertion in dev_rel() that the reference count
is greater than or equal to 0.

The fix is to add a cdev pointer to the VM object, and use that
pointer when calling the cdev_pg_dtor() routine.

vm_object.h:	Add a struct cdev pointer to the VM object
		structure.

device_pager.c:	In cdev_pager_allocate(), populate the new cdev
		pointer.

		In dev_pager_dealloc(), use the new cdev pointer
		when calling the object's cdev_pg_dtor() routine.

Reviewed by:	kib
Sponsored by:	Spectra Logic Corporation
MFC after:	1 week
2013-01-09 16:48:38 +00:00
Alan Cox
2863482058 In the past four years, we've added two new vm object types. Each time,
similar changes had to be made in various places throughout the machine-
independent virtual memory layer to support the new vm object type.
However, in most of these places, it's actually not the type of the vm
object that matters to us but instead certain attributes of its pages.
For example, OBJT_DEVICE, OBJT_MGTDEVICE, and OBJT_SG objects contain
fictitious pages.  In other words, in most of these places, we were
testing the vm object's type to determine if it contained fictitious (or
unmanaged) pages.

To both simplify the code in these places and make the addition of future
vm object types easier, this change introduces two new vm object flags
that describe attributes of the vm object's pages, specifically, whether
they are fictitious or unmanaged.

Reviewed and tested by:	kib
2012-12-09 00:32:38 +00:00
Attilio Rao
ffdd0c7db3 - Add a comment explaining the locking of the cached pages pool held
by vm_objects.
- Add flags for the per-object lock and free pages queue mutex lock.
  Use the newly added flags to mark the cache root within the vm_object
  structure.

Please note that other vm_object members should be marked with correct
locking but they are left for other commits.

In collabouration with:	alc

MFC after:	3 days3 days3 days
2012-06-22 18:34:11 +00:00
John Baldwin
92a5994685 Fix madvise(MADV_WILLNEED) to properly handle individual mappings larger
than 4GB.  Specifically, the inlined version of 'ptoa' of the the 'int'
count of pages overflowed on 64-bit platforms.  While here, change
vm_object_madvise() to accept two vm_pindex_t parameters (start and end)
rather than a (start, count) tuple to match other VM APIs as suggested
by alc@.
2012-03-19 18:47:34 +00:00
Konstantin Belousov
126d60823a In vm_object_page_clean(), do not clean OBJ_MIGHTBEDIRTY object flag
if the filesystem performed short write and we are skipping the page
due to this.

Propogate write error from the pager back to the callers of
vm_pageout_flush().  Report the failure to write a page from the
requested range as the FALSE return value from vm_object_page_clean(),
and propagate it back to msync(2) to return EIO to usermode.

While there, convert the clearobjflags variable in the
vm_object_page_clean() and arguments of the helper functions to
boolean.

PR:	kern/165927
Reviewed by:	alc
MFC after:	2 weeks
2012-03-17 23:00:32 +00:00
Konstantin Belousov
84110e7e0b Account the writeable shared mappings backed by file in the vnode
v_writecount.  Keep the amount of the virtual address space used by
the mappings in the new vm_object un_pager.vnp.writemappings
counter. The vnode v_writecount is incremented when writemappings gets
non-zero value, and decremented when writemappings is returned to
zero.

Writeable shared vnode-backed mappings are accounted for in vm_mmap(),
and vm_map_insert() is instructed to set MAP_ENTRY_VN_WRITECNT flag on
the created map entry.  During deferred map entry deallocation,
vm_map_process_deferred() checks for MAP_ENTRY_VN_WRITECOUNT and
decrements writemappings for the vm object.

Now, the writeable mount cannot be demoted to read-only while
writeable shared mappings of the vnodes from the mount point
exist. Also, execve(2) fails for such files with ETXTBUSY, as it
should be.

Noted by:	tegge
Reviewed by:	tegge (long time ago, early version), alc
Tested by:	pho
MFC after:	3 weeks
2012-02-23 21:07:16 +00:00
Konstantin Belousov
5dda2db9c8 Change the type of the paging_in_progress refcounter from u_short to
u_int. With the auto-sized buffer cache on the modern machines, UFS
metadata can generate more the 65535 pages belonging to the buffers
undergoing i/o, overflowing the counter.

Reported and tested by:	jimharris
Reviewed by:	alc
MFC after:	1 week
2012-01-10 18:05:44 +00:00
Konstantin Belousov
286790a7dd Update the device pager interface, while keeping the compatibility
layer for old KPI and KBI.  New interface should be used together with
d_mmap_single cdevsw method.

Device pager can be allocated with the cdev_pager_allocate(9)
function, which takes struct cdev_pager_ops, containing
constructor/destructor and page fault handler methods supplied by
driver.

Constructor and destructor, called at the pager allocation and
deallocation time, allow the driver to handle per-object private data.

The pager handler is called to handle page fault on the vm map entry
backed by the driver pager. Driver shall return either the vm_page_t
which should be mapped, or error code (which does not cause kernel
panic anymore). The page handler interface has a placeholder to
specify the access mode causing the fault, but currently PROT_READ is
always passed there.

Sponsored by:	The FreeBSD Foundation
Reviewed by:	alc
MFC after:	1 month
2011-11-15 14:40:00 +00:00
John Baldwin
936c09ac0f Add the posix_fadvise(2) system call. It is somewhat similar to
madvise(2) except that it operates on a file descriptor instead of a
memory region.  It is currently only supported on regular files.

Just as with madvise(2), the advice given to posix_fadvise(2) can be
divided into two types.  The first type provide hints about data access
patterns and are used in the file read and write routines to modify the
I/O flags passed down to VOP_READ() and VOP_WRITE().  These modes are
thus filesystem independent.  Note that to ease implementation (and
since this API is only advisory anyway), only a single non-normal
range is allowed per file descriptor.

The second type of hints are used to hint to the OS that data will or
will not be used.  These hints are implemented via a new VOP_ADVISE().
A default implementation is provided which does nothing for the WILLNEED
request and attempts to move any clean pages to the cache page queue for
the DONTNEED request.  This latter case required two other changes.
First, a new V_CLEANONLY flag was added to vinvalbuf().  This requests
vinvalbuf() to only flush clean buffers for the vnode from the buffer
cache and to not remove any backing pages from the vnode.  This is
used to ensure clean pages are not wired into the buffer cache before
attempting to move them to the cache page queue.  The second change adds
a new vm_object_page_cache() method.  This method is somewhat similar to
vm_object_page_remove() except that instead of freeing each page in the
specified range, it attempts to move clean pages to the cache queue if
possible.

To preserve the ABI of struct file, the f_cdevpriv pointer is now reused
in a union to point to the currently active advice region if one is
present for regular files.

Reviewed by:	jilles, kib, arch@
Approved by:	re (kib)
MFC after:	1 month
2011-11-04 04:02:50 +00:00
Alan Cox
6bbee8e28a Add a new option, OBJPR_NOTMAPPED, to vm_object_page_remove(). Passing this
option to vm_object_page_remove() asserts that the specified range of pages
is not mapped, or more precisely that none of these pages have any managed
mappings.  Thus, vm_object_page_remove() need not call pmap_remove_all() on
the pages.

This change not only saves time by eliminating pointless calls to
pmap_remove_all(), but it also eliminates an inconsistency in the use of
pmap_remove_all() versus related functions, like pmap_remove_write().  It
eliminates harmless but pointless calls to pmap_remove_all() that were being
performed on PG_UNMANAGED pages.

Update all of the existing assertions on pmap_remove_all() to reflect this
change.

Reviewed by:	kib
2011-06-29 16:40:41 +00:00
Alan Cox
17f3095d1a Unless "cnt" exceeds MAX_COMMIT_COUNT, nfsrv_commit() and nfsvno_fsync() are
incorrectly calling vm_object_page_clean().  They are passing the length of
the range rather than the ending offset of the range.

Perform the OFF_TO_IDX() conversion in vm_object_page_clean() rather than the
callers.

Reviewed by:	kib
MFC after:	3 weeks
2011-02-05 21:21:27 +00:00
Konstantin Belousov
50cfe7fa50 Remove OBJ_CLEANING flag. The vfs_setdirty_locked_object() is the only
consumer of the flag, and it used the flag because OBJ_MIGHTBEDIRTY
was cleared early in vm_object_page_clean, before the cleaning pass
was done. This is no longer true after r216799.

 Moreover, since OBJ_CLEANING is a flag, and not the counter, it could
be reset too prematurely when parallel vm_object_page_clean() are
performed.

Reviewed by:	alc (as a part of the bigger patch)
MFC after:	1 month (after r216799 is merged)
2010-12-29 22:26:49 +00:00
Alan Cox
4de2261903 Move vm_object_print()'s prototype to the expected place. 2010-12-27 07:12:22 +00:00
Edward Tomasz Napierala
ef694c1ac4 Replace pointer to "struct uidinfo" with pointer to "struct ucred"
in "struct vm_object".  This is required to make it possible to account
for per-jail swap usage.

Reviewed by:	kib@
Tested by:	pho@
Sponsored by:	FreeBSD Foundation
2010-12-02 17:37:16 +00:00
Konstantin Belousov
49e3050e6c VI_OBJDIRTY vnode flag mirrors the state of OBJ_MIGHTBEDIRTY vm object
flag. Besides providing the redundand information, need to update both
vnode and object flags causes more acquisition of vnode interlock.
OBJ_MIGHTBEDIRTY is only checked for vnode-backed vm objects.

Remove VI_OBJDIRTY and make sure that OBJ_MIGHTBEDIRTY is set only for
vnode-backed vm objects.

Suggested and reviewed by:	alc
Tested by:	pho
MFC after:	3 weeks
2009-12-21 12:29:38 +00:00
John Baldwin
013818111a Add a new type of VM object: OBJT_SG. An OBJT_SG object is very similar to
a device pager (OBJT_DEVICE) object in that it uses fictitious pages to
provide aliases to other memory addresses.  The primary difference is that
it uses an sglist(9) to determine the physical addresses for a given offset
into the object instead of invoking the d_mmap() method in a device driver.

Reviewed by:	alc
Approved by:	re (kensmith)
MFC after:	2 weeks
2009-07-24 13:50:29 +00:00
Alan Cox
3153e878dd Add support to the virtual memory system for configuring machine-
dependent memory attributes:

Rename vm_cache_mode_t to vm_memattr_t.  The new name reflects the
fact that there are machine-dependent memory attributes that have
nothing to do with controlling the cache's behavior.

Introduce vm_object_set_memattr() for setting the default memory
attributes that will be given to an object's pages.

Introduce and use pmap_page_{get,set}_memattr() for getting and
setting a page's machine-dependent memory attributes.  Add full
support for these functions on amd64 and i386 and stubs for them on
the other architectures.  The function pmap_page_set_memattr() is also
responsible for any other machine-dependent aspects of changing a
page's memory attributes, such as flushing the cache or updating the
direct map.  The uses include kmem_alloc_contig(), vm_page_alloc(),
and the device pager:

  kmem_alloc_contig() can now be used to allocate kernel memory with
  non-default memory attributes on amd64 and i386.

  vm_page_alloc() and the device pager will set the memory attributes
  for the real or fictitious page according to the object's default
  memory attributes.

Update the various pmap functions on amd64 and i386 that map pages to
incorporate each page's memory attributes in the mapping.

Notes: (1) Inherent to this design are safety features that prevent
the specification of inconsistent memory attributes by different
mappings on amd64 and i386.  In addition, the device pager provides a
warning when a device driver creates a fictitious page with memory
attributes that are inconsistent with the real page that the
fictitious page is an alias for. (2) Storing the machine-dependent
memory attributes for amd64 and i386 as a dedicated "int" in "struct
md_page" represents a compromise between space efficiency and the ease
of MFCing these changes to RELENG_7.

In collaboration with: jhb

Approved by:	re (kib)
2009-07-12 23:31:20 +00:00
Konstantin Belousov
3364c323e6 Implement global and per-uid accounting of the anonymous memory. Add
rlimit RLIMIT_SWAP that limits the amount of swap that may be reserved
for the uid.

The accounting information (charge) is associated with either map entry,
or vm object backing the entry, assuming the object is the first one
in the shadow chain and entry does not require COW. Charge is moved
from entry to object on allocation of the object, e.g. during the mmap,
assuming the object is allocated, or on the first page fault on the
entry. It moves back to the entry on forks due to COW setup.

The per-entry granularity of accounting makes the charge process fair
for processes that change uid during lifetime, and decrements charge
for proper uid when region is unmapped.

The interface of vm_pager_allocate(9) is extended by adding struct ucred *,
that is used to charge appropriate uid when allocation if performed by
kernel, e.g. md(4).

Several syscalls, among them is fork(2), may now return ENOMEM when
global or per-uid limits are enforced.

In collaboration with:	pho
Reviewed by:	alc
Approved by:	re (kensmith)
2009-06-23 20:45:22 +00:00
Alan Cox
387aabc513 Long, long ago in r27464 special case code for mapping device-backed
memory with 4MB pages was added to pmap_object_init_pt().  This code
assumes that the pages of a OBJT_DEVICE object are always physically
contiguous.  Unfortunately, this is not always the case.  For example,
jhb@ informs me that the recently introduced /dev/ksyms driver creates
a OBJT_DEVICE object that violates this assumption.  Thus, this
revision modifies pmap_object_init_pt() to abort the mapping if the
OBJT_DEVICE object's pages are not physically contiguous.  This
revision also changes some inconsistent if not buggy behavior.  For
example, the i386 version aborts if the first 4MB virtual page that
would be mapped is already valid.  However, it incorrectly replaces
any subsequent 4MB virtual page mappings that it encounters,
potentially leaking a page table page.  The amd64 version has a bug of
my own creation.  It potentially busies the wrong page and always an
insufficent number of pages if it blocks allocating a page table page.

To my knowledge, there have been no reports of these bugs, hence,
their persistance.  I suspect that the existing restrictions that
pmap_object_init_pt() placed on the OBJT_DEVICE objects that it would
choose to map, for example, that the first page must be aligned on a 2
or 4MB physical boundary and that the size of the mapping must be a
multiple of the large page size, were enough to avoid triggering the
bug for drivers like ksyms.  However, one side effect of testing the
OBJT_DEVICE object's pages for physical contiguity is that a dubious
difference between pmap_object_init_pt() and the standard path for
mapping devices pages, i.e., vm_fault(), has been eliminated.
Previously, pmap_object_init_pt() would only instantiate the first
PG_FICTITOUS page being mapped because it never examined the rest.
Now, however, pmap_object_init_pt() uses the new function
vm_object_populate() to instantiate them all (in order to support
testing their physical contiguity).  These pages need to be
instantiated for the mechanism that I have prototyped for
automatically maintaining the consistency of the PAT settings across
multiple mappings, particularly, amd64's direct mapping, to work.
(Translation: This change is also being made to support jhb@'s work on
the Nvidia feature requests.)

Discussed with:	jhb@
2009-06-14 19:51:43 +00:00
Alan Cox
7b54b1a9f5 Eliminate OBJ_NEEDGIANT. After r188331, OBJ_NEEDGIANT's only use is by a
redundant assertion in vm_fault().

Reviewed by:	kib
2009-02-08 22:17:24 +00:00
Stephan Uphoff
2ac78f0e1a Allow VM object creation in ufs_lookup. (If vfs.vmiodirenable is set)
Directory IO without a VM object will store data in 'malloced' buffers
severely limiting caching of the data. Without this  change VM objects for
directories are only created on an open() of the directory.
TODO: Inline test if VM object already exists to avoid locking/function call
overhead.

Tested by: kris@
Reviewed by: jeff@
Reported by: David Filo
2008-05-20 19:05:43 +00:00
Alan Cox
3df92083af Add a list of reservations to the vm object structure.
Recycle the vm object's "pg_color" field to represent the color of the
first virtual page address at which the object is mapped instead of the
color of the object's first physical page.  Since an object may not be
mapped, introduce a flag "OBJ_COLORED" that indicates whether "pg_color"
is valid.
2007-12-27 17:56:35 +00:00
Alan Cox
7bfda801a8 Change the management of cached pages (PQ_CACHE) in two fundamental
ways:

(1) Cached pages are no longer kept in the object's resident page
splay tree and memq.  Instead, they are kept in a separate per-object
splay tree of cached pages.  However, access to this new per-object
splay tree is synchronized by the _free_ page queues lock, not to be
confused with the heavily contended page queues lock.  Consequently, a
cached page can be reclaimed by vm_page_alloc(9) without acquiring the
object's lock or the page queues lock.

This solves a problem independently reported by tegge@ and Isilon.
Specifically, they observed the page daemon consuming a great deal of
CPU time because of pages bouncing back and forth between the cache
queue (PQ_CACHE) and the inactive queue (PQ_INACTIVE).  The source of
this problem turned out to be a deadlock avoidance strategy employed
when selecting a cached page to reclaim in vm_page_select_cache().
However, the root cause was really that reclaiming a cached page
required the acquisition of an object lock while the page queues lock
was already held.  Thus, this change addresses the problem at its
root, by eliminating the need to acquire the object's lock.

Moreover, keeping cached pages in the object's primary splay tree and
memq was, in effect, optimizing for the uncommon case.  Cached pages
are reclaimed far, far more often than they are reactivated.  Instead,
this change makes reclamation cheaper, especially in terms of
synchronization overhead, and reactivation more expensive, because
reactivated pages will have to be reentered into the object's primary
splay tree and memq.

(2) Cached pages are now stored alongside free pages in the physical
memory allocator's buddy queues, increasing the likelihood that large
allocations of contiguous physical memory (i.e., superpages) will
succeed.

Finally, as a result of this change long-standing restrictions on when
and where a cached page can be reclaimed and returned by
vm_page_alloc(9) are eliminated.  Specifically, calls to
vm_page_alloc(9) specifying VM_ALLOC_INTERRUPT can now reclaim and
return a formerly cached page.  Consequently, a call to malloc(9)
specifying M_NOWAIT is less likely to fail.

Discussed with: many over the course of the summer, including jeff@,
   Justin Husted @ Isilon, peter@, tegge@
Tested by: an earlier version by kris@
Approved by: re (kensmith)
2007-09-25 06:25:06 +00:00
Alan Cox
af51d7bf57 Eliminate OBJ_WRITEABLE. It hasn't been used in a long time. 2006-07-21 06:40:29 +00:00
Alan Cox
02dd83311a Make vm_object_vndeallocate() static. The external calls to it were
eliminated in ufs/ffs/ffs_vnops.c's revision 1.125.
2006-01-22 23:56:20 +00:00
Jeff Roberson
ed4fe4f4f5 - Add a new object flag "OBJ_NEEDSGIANT". We set this flag if the
underlying vnode requires Giant.
 - In vm_fault only acquire Giant if the underlying object has NEEDSGIANT
   set.
 - In vm_object_shadow inherit the NEEDSGIANT flag from the backing object.
2005-05-03 11:11:26 +00:00
John Baldwin
98df9218da - Change the vm_mmap() function to accept an objtype_t parameter specifying
the type of object represented by the handle argument.
- Allow vm_mmap() to map device memory via cdev objects in addition to
  vnodes and anonymous memory.  Note that mmaping a cdev directly does not
  currently perform any MAC checks like mapping a vnode does.
- Unbreak the DRM getbufs ioctl by having it call vm_mmap() directly on the
  cdev the ioctl is acting on rather than trying to find a suitable vnode
  to map from.

Reviewed by:	alc, arch@
2005-04-01 20:00:11 +00:00