intention of the POSIX IEEE Std 1003.1TM-2008/Cor 1-2013.
A robust mutex is guaranteed to be cleared by the system upon either
thread or process owner termination while the mutex is held. The next
mutex locker is then notified about inconsistent mutex state and can
execute (or abandon) corrective actions.
The patch mostly consists of small changes here and there, adding
neccessary checks for the inconsistent and abandoned conditions into
existing paths. Additionally, the thread exit handler was extended to
iterate over the userspace-maintained list of owned robust mutexes,
unlocking and marking as terminated each of them.
The list of owned robust mutexes cannot be maintained atomically
synchronous with the mutex lock state (it is possible in kernel, but
is too expensive). Instead, for the duration of lock or unlock
operation, the current mutex is remembered in a special slot that is
also checked by the kernel at thread termination.
Kernel must be aware about the per-thread location of the heads of
robust mutex lists and the current active mutex slot. When a thread
touches a robust mutex for the first time, a new umtx op syscall is
issued which informs about location of lists heads.
The umtx sleep queues for PP and PI mutexes are split between
non-robust and robust.
Somewhat unrelated changes in the patch:
1. Style.
2. The fix for proper tdfind() call use in umtxq_sleep_pi() for shared
pi mutexes.
3. Removal of the userspace struct pthread_mutex m_owner field.
4. The sysctl kern.ipc.umtx_vnode_persistent is added, which controls
the lifetime of the shared mutex associated with a vnode' page.
Reviewed by: jilles (previous version, supposedly the objection was fixed)
Discussed with: brooks, Martin Simmons <martin@lispworks.com> (some aspects)
Tested by: pho
Sponsored by: The FreeBSD Foundation
breaking the ABI. Special value is stored in the lock pointer to
indicate shared lock, and offline page in the shared memory is
allocated to store the actual lock.
Reviewed by: vangyzen (previous version)
Discussed with: deischen, emaste, jhb, rwatson,
Martin Simmons <martin@lispworks.com>
Tested by: pho
Sponsored by: The FreeBSD Foundation
o With new KPI consumers can request contiguous ranges of pages, and
unlike before, all pages will be kept busied on return, like it was
done before with the 'reqpage' only. Now the reqpage goes away. With
new interface it is easier to implement code protected from race
conditions.
Such arrayed requests for now should be preceeded by a call to
vm_pager_haspage() to make sure that request is possible. This
could be improved later, making vm_pager_haspage() obsolete.
Strenghtening the promises on the business of the array of pages
allows us to remove such hacks as swp_pager_free_nrpage() and
vm_pager_free_nonreq().
o New KPI accepts two integer pointers that may optionally point at
values for read ahead and read behind, that a pager may do, if it
can. These pages are completely owned by pager, and not controlled
by the caller.
This shifts the UFS-specific readahead logic from vm_fault.c, which
should be file system agnostic, into vnode_pager.c. It also removes
one VOP_BMAP() request per hard fault.
Discussed with: kib, alc, jeff, scottl
Sponsored by: Nginx, Inc.
Sponsored by: Netflix
of POSIX_FADV_DONTNEED so that it causes the backing pages to be moved to
the head of the inactive queue instead of being cached.
This affects the implementation of POSIX_FADV_NOREUSE as well, since it
works by applying POSIX_FADV_DONTNEED to file ranges after they have been
read or written. At that point the corresponding buffers may still be
dirty, so the previous implementation would coalesce successive ranges and
apply POSIX_FADV_DONTNEED to the result, ensuring that pages backing the
dirty buffers would eventually be cached. To preserve this behaviour in an
efficient manner, this change adds a new buf flag, B_NOREUSE, which causes
the pages backing a VMIO buf to be placed at the head of the inactive queue
when the buf is released. POSIX_FADV_NOREUSE then works by setting this
flag in bufs that underlie the specified range.
Reviewed by: alc, kib
Sponsored by: EMC / Isilon Storage Division
Differential Revision: https://reviews.freebsd.org/D3726
Assume that a vnode is mapped shared and mlocked(), and then the vnode
is truncated, or truncated and then again extended past the mapping
point EOF. Truncation removes the pages past the truncation point,
and if pages are later created at this range, they are not properly
mapped into the mlocked region, and their wiring count is wrong.
The revert leaves the invalidated but wired pages on the object queue,
which means that the pages are found by vm_object_unwire() when the
mapped range is munlock()ed, and reused by the buffer cache when the
vnode is extended again.
The changes in r173708 were required since then vm_map_unwire() looked
at the page tables to find the page to unwire. This is no longer
needed with the vm_object_unwire() introduction, which follows the
objects shadow chain.
Also eliminate OBJPR_NOTWIRED flag for vm_object_page_remove(), which
is now redundand, we do not remove wired pages.
Reported by: trasz, Dmitry Sivachenko <trtrmitya@gmail.com>
Suggested and reviewed by: alc
Tested by: pho
Sponsored by: The FreeBSD Foundation
MFC after: 1 week
When providing memory map information to userland, populate the vnode pointer
for tmpfs files. Set the memory mapping to appear as a vnode type, to match
FreeBSD 9 behavior.
This fixes the use of tmpfs files with the dtrace pid provider,
procstat -v, procfs, linprocfs, pmc (pmcstat), and ptrace (PT_VM_ENTRY).
Submitted by: Eric Badger <eric@badgerio.us> (initial revision)
Obtained from: Dell Inc.
PR: 198431
MFC after: 2 weeks
Reviewed by: jhb
Approved by: kib (mentor)
named objects to zero before the virtual address is selected. Previously,
the color setting was delayed until after the virtual address was
selected. In rtld, this delay effectively prevented the mapping of a
shared library's code section using superpages. Now, for example, we see
the first 1 MB of libc's code on armv6 mapped by a superpage after we've
gotten through the initial cold misses that bring the first 1 MB of code
into memory. (With the page clustering that we perform on read faults,
this happens quickly.)
Differential Revision: https://reviews.freebsd.org/D2013
Reviewed by: jhb, kib
Tested by: Svatopluk Kraus (armv6)
MFC after: 6 weeks
to UFS, perform updates during syncer scans, which in particular means
that tmpfs now performs scan on sync. Also, this means that a mtime
update may be delayed up to 30 seconds after the write.
The vm_object' OBJ_TMPFS_DIRTY flag for tmpfs swap object is similar
to the OBJ_MIGHTBEDIRTY flag for the vnode object, it indicates that
object could have been dirtied. Adapt fast page fault handler and
vm_object_set_writeable_dirty() to handle OBJ_TMPFS_NODE same as
OBJT_VNODE.
Reported by: Ronald Klop <ronald-lists@klop.ws>
Tested by: pho
Sponsored by: The FreeBSD Foundation
MFC after: 2 weeks
underlying physical pages are mapped by the pmap. If, for example, the
application has performed an mprotect(..., PROT_NONE) on any part of the
wired region, then those pages will no longer be mapped by the pmap.
So, using the pmap to lookup the wired pages in order to unwire them
doesn't always work, and when it doesn't work wired pages are leaked.
To avoid the leak, introduce and use a new function vm_object_unwire()
that locates the wired pages by traversing the object and its backing
objects.
At the same time, switch from using pmap_change_wiring() to the recently
introduced function pmap_unwire() for unwiring the region's mappings.
pmap_unwire() is faster, because it operates a range of virtual addresses
rather than a single virtual page at a time. Moreover, by operating on
a range, it is superpage friendly. It doesn't waste time performing
unnecessary demotions.
Reported by: markj
Reviewed by: kib
Tested by: pho, jmg (arm)
Sponsored by: EMC / Isilon Storage Division
vnode for the tmpfs node owning this object. The flag is currently
used for two purposes. First, it allows to correctly handle VV_TEXT
for tmpfs vnode when the ref count on the object is decremented to 1,
similar to vnode_pager_dealloc() for regular filesystems. Second, it
prevents some operations, which are done on OBJT_SWAP vm objects
backing user anonymous memory, but are incorrect for the object owned
by tmpfs node.
The second kind of use of the OBJ_TMPFS flag is incorrect, since the
vnode might be reclaimed, which clears the flag, but vm object
operations must still be disallowed.
Introduce one more flag, OBJ_TMPFS_NODE, which is permanently set on
the object for VREG tmpfs node, and used instead of OBJ_TMPFS to test
whether vm object collapse and similar actions should be disabled.
Tested by: pho
Sponsored by: The FreeBSD Foundation
MFC after: 2 weeks
for nodes used in vm_radix.
On architectures supporting direct mapping, also avoid to pre-allocate
the KVA for such nodes.
In order to do so make the operations derived from vm_radix_insert()
to fail and handle all the deriving failure of those.
vm_radix-wise introduce a new function called vm_radix_replace(),
which can replace a leaf node, already present, with a new one,
and take into account the possibility, during vm_radix_insert()
allocation, that the operations on the radix trie can recurse.
This means that if operations in vm_radix_insert() recursed
vm_radix_insert() will start from scratch again.
Sponsored by: EMC / Isilon storage division
Reviewed by: alc (older version)
Reviewed by: jeff
Tested by: pho, scottl
msync(MS_INVALIDATE). The vm_fault_copy_entry() requires that object
range which corresponds to the user-wired vm_map_entry, is always
fully populated.
Add OBJPR_NOTWIRED flag for vm_object_page_remove() to request the
preserving behaviour, use it when calling vm_object_page_remove() from
vm_object_sync().
Reported and tested by: pho
Reviewed by: alc
Sponsored by: The FreeBSD Foundation
MFC after: 2 weeks
o Relax locking assertions for pmap_enter_object() and add them also
to architectures that currently don't have any
o Introduce VM_OBJECT_LOCK_DOWNGRADE() which is basically a downgrade
operation on the per-object rwlock
o Use all the mechanisms above to make vm_map_pmap_enter() to work
mostl of the times only with readlocks.
Sponsored by: EMC / Isilon storage division
Reviewed by: alc
vnode v_object to avoid double-buffering. Use the same object both as
the backing store for tmpfs node and as the v_object.
Besides reducing memory use up to 2x times for situation of mapping
files from tmpfs, it also makes tmpfs read and write operations copy
twice bytes less.
VM subsystem was already slightly adapted to tolerate OBJT_SWAP object
as v_object. Now the vm_object_deallocate() is modified to not
reinstantiate OBJ_ONEMAPPING flag and help the VFS to correctly handle
VV_TEXT flag on the last dereference of the tmpfs backing object.
Reviewed by: alc
Tested by: pho, bf
MFC after: 1 month
future further optimizations where the vm_object lock will be held
in read mode most of the time the page cache resident pool of pages
are accessed for reading purposes.
The change is mostly mechanical but few notes are reported:
* The KPI changes as follow:
- VM_OBJECT_LOCK() -> VM_OBJECT_WLOCK()
- VM_OBJECT_TRYLOCK() -> VM_OBJECT_TRYWLOCK()
- VM_OBJECT_UNLOCK() -> VM_OBJECT_WUNLOCK()
- VM_OBJECT_LOCK_ASSERT(MA_OWNED) -> VM_OBJECT_ASSERT_WLOCKED()
(in order to avoid visibility of implementation details)
- The read-mode operations are added:
VM_OBJECT_RLOCK(), VM_OBJECT_TRYRLOCK(), VM_OBJECT_RUNLOCK(),
VM_OBJECT_ASSERT_RLOCKED(), VM_OBJECT_ASSERT_LOCKED()
* The vm/vm_pager.h namespace pollution avoidance (forcing requiring
sys/mutex.h in consumers directly to cater its inlining functions
using VM_OBJECT_LOCK()) imposes that all the vm/vm_pager.h
consumers now must include also sys/rwlock.h.
* zfs requires a quite convoluted fix to include FreeBSD rwlocks into
the compat layer because the name clash between FreeBSD and solaris
versions must be avoided.
At this purpose zfs redefines the vm_object locking functions
directly, isolating the FreeBSD components in specific compat stubs.
The KPI results heavilly broken by this commit. Thirdy part ports must
be updated accordingly (I can think off-hand of VirtualBox, for example).
Sponsored by: EMC / Isilon storage division
Reviewed by: jeff
Reviewed by: pjd (ZFS specific review)
Discussed with: alc
Tested by: pho
Introduce a new KPI that verifies if the page cache is empty for a
specified vm_object. This KPI does not make assumptions about the
locking in order to be used also for building assertions at init and
destroy time.
It is mostly used to hide implementation details of the page cache.
Sponsored by: EMC / Isilon storage division
Reviewed by: jeff
Reviewed by: alc (vm_radix based version)
Tested by: flo, pho, jhb, davide
general way but must be evaluated case by case.
Embedd the decision in the caller themselves rather than in a
general purpose KPI.
Sponsored by: EMC / Isilon storage division
Reported by: alc
Reviewed by: alc
VM_OBJECT_LOCKED() macro is only used to implement a custom version
of lock assertions right now (which likely spread out thanks to
copy and paste).
Remove it and implement actual assertions.
Sponsored by: EMC / Isilon storage division
Reviewed by: alc
Tested by: pho
Replace the sub-optimal uma_zone_set_obj() primitive with more modern
uma_zone_reserve_kva(). The new primitive reserves before hand
the necessary KVA space to cater the zone allocations and allocates pages
with ALLOC_NOOBJ. More specifically:
- uma_zone_reserve_kva() does not need an object to cater the backend
allocator.
- uma_zone_reserve_kva() can cater M_WAITOK requests, in order to
serve zones which need to do uma_prealloc() too.
- When possible, uma_zone_reserve_kva() uses directly the direct-mapping
by uma_small_alloc() rather than relying on the KVA / offset
combination.
The removal of the object attribute allows 2 further changes:
1) _vm_object_allocate() becomes static within vm_object.c
2) VM_OBJECT_LOCK_INIT() is removed. This function is replaced by
direct calls to mtx_init() as there is no need to export it anymore
and the calls aren't either homogeneous anymore: there are now small
differences between arguments passed to mtx_init().
Sponsored by: EMC / Isilon storage division
Reviewed by: alc (which also offered almost all the comments)
Tested by: pho, jhb, davide
macro VM_OBJECT_SLEEP().
This hides some implementation details like the usage of the msleep()
primitive and the necessity to access to the lock address directly.
For this reason VM_OBJECT_MTX() macro is now retired.
Sponsored by: EMC / Isilon storage division
Reviewed by: alc
Tested by: pho
more modern uma_zone_reserve_kva(). The difference is that it doesn't
rely anymore on an obj to allocate pages and the slab allocator doesn't
use any more any specific locking but atomic operations to complete
the operation.
Where possible, the uma_small_alloc() is instead used and the uk_kva
member becomes unused.
The subsequent cleanups also brings along the removal of
VM_OBJECT_LOCK_INIT() macro which is not used anymore as the code
can be easilly cleaned up to perform a single mtx_init(), private
to vm_object.c.
For the same reason, _vm_object_allocate() becomes private as well.
Sponsored by: EMC / Isilon storage division
Reviewed by: alc
* VM_OBJECT_LOCK and VM_OBJECT_UNLOCK are mapped to write operations
* VM_OBJECT_SLEEP() is introduced as a general purpose primitve to
get a sleep operation using a VM_OBJECT_LOCK() as protection
* The approach must bear with vm_pager.h namespace pollution so many
files require including directly rwlock.h
in devfs if a particular race condition is hit in the device pager
code.
This was a side effect of change 227530 which changed the device
pager interface to call a new destructor routine for the cdev.
That destructor routine, old_dev_pager_dtor(), takes a VM object
handle.
The object handle is cast to a struct cdev *, and passed into
dev_rel().
That works in most cases, except the case in cdev_pager_allocate()
where there is a race condition between two threads allocating an
object backed by the same device. The loser of the race
deallocates its object at the end of the function.
The problem is that before inserting the object into the
dev_pager_object_list, the object's handle is changed from the
struct cdev pointer to the object's own address. This is to avoid
conflicts with the winner of the race, which already inserted an
object in the list with a handle that is a pointer to the same cdev
structure.
The object is then passed to vm_object_deallocate(), and eventually
makes its way down to old_dev_pager_dtor(). That function passes
the handle pointer (which is actually a VM object, not a struct
cdev as usual) into dev_rel(). dev_rel() decrements the reference
count in the assumed struct cdev (which happens to be 0), and
that triggers the assertion in dev_rel() that the reference count
is greater than or equal to 0.
The fix is to add a cdev pointer to the VM object, and use that
pointer when calling the cdev_pg_dtor() routine.
vm_object.h: Add a struct cdev pointer to the VM object
structure.
device_pager.c: In cdev_pager_allocate(), populate the new cdev
pointer.
In dev_pager_dealloc(), use the new cdev pointer
when calling the object's cdev_pg_dtor() routine.
Reviewed by: kib
Sponsored by: Spectra Logic Corporation
MFC after: 1 week