pointer: When vm_object_deallocate() sleeps because of a non-zero
paging in progress count on either object or object's shadow,
vm_object_deallocate() must ensure that object is still the shadow's
backing object when it reawakens. In fact, object may have been
deallocated while vm_object_deallocate() slept. If so, reacquiring
the lock on object can lead to a deadlock.
Submitted by: ups@
MFC after: 3 weeks
be called without any vnode locks held. Remove calls to vn_start_write() and
vn_finished_write() in vnode_pager_putpages() and add these calls before the
vnode lock is obtained to most of the callers that don't already have them.
object that requires Giant in vm_object_deallocate(). This is somewhat
hairy in that if we can't obtain Giant directly, we have to drop the
object lock, then lock Giant, then relock the object lock and verify that
we still need Giant. If we don't (because the object changed to OBJT_DEAD
for example), then we drop Giant before continuing.
Reviewed by: alc
Tested by: kris
the resident page count matches the object size. We know it fully backs
its parent in this case.
Reviewed by: acl, tegge
Sponsored by: Isilon Systems, Inc.
- provide an interface (macros) to the page coloring part of the VM system,
this allows to try different coloring algorithms without the need to
touch every file [1]
- make the page queue tuning values readable: sysctl vm.stats.pagequeue
- autotuning of the page coloring values based upon the cache size instead
of options in the kernel config (disabling of the page coloring as a
kernel option is still possible)
MD changes:
- detection of the cache size: only IA32 and AMD64 (untested) contains
cache size detection code, every other arch just comes with a dummy
function (this results in the use of default values like it was the
case without the autotuning of the page coloring)
- print some more info on Intel CPU's (like we do on AMD and Transmeta
CPU's)
Note to AMD owners (IA32 and AMD64): please run "sysctl vm.stats.pagequeue"
and report if the cache* values are zero (= bug in the cache detection code)
or not.
Based upon work by: Chad David <davidc@acns.ab.ca> [1]
Reviewed by: alc, arch (in 2004)
Discussed with: alc, Chad David, arch (in 2004)
vm_object_backing_scan() was not written to handle. Specifically, a wired
page within a backing object that is shadowed by a page within the shadow
object. Handle this state by removing the wired page from the backing
object. The wired page will be freed by socow_iodone().
Stop masking errors: If a page is being freed by vm_object_backing_scan(),
assert that it is no longer mapped rather than quietly destroying any
mappings.
Tested by: Harald Schmalzbauer
due to the vm object being locked.
When a process writes large amounts of data to a file, the vm object associated
with that file can contain most of the physical pages on the machine. If the
process is preempted while holding the lock on the vm object, pagedaemon would
be able to move very few pages from PQ_INACTIVE to PQ_CACHE or from PQ_ACTIVE
to PQ_INACTIVE, resulting in unlimited cleaning of dirty pages belonging to
other vm objects.
Temporarily unlock the page queues lock while locking vm objects to avoid lock
order violation. Detect and handle relevant page queue changes.
This change depends on both the lock portion of struct vm_object and normal
struct vm_page being type stable.
Reviewed by: alc
underlying vnode requires Giant.
- In vm_fault only acquire Giant if the underlying object has NEEDSGIANT
set.
- In vm_object_shadow inherit the NEEDSGIANT flag from the backing object.
queues lock in vm_object_backing_scan(). Updates to the page's PG_BUSY
flag and busy field are synchronized by the containing object's lock.
Testing the page's hold_count and wire_count in vm_object_backing_scan()'s
OBSC_COLLAPSE_NOWAIT case is unnecessary. There is no reason why the held
or wired pages cannot be migrated to the shadow object.
Reviewed by: tegge
- Use VFS_LOCK_GIANT() rather than directly acquiring giant in places
where giant is only held because vfs requires it.
Sponsored By: Isilon Systems, Inc.
and BBO is BO's backing object. Now, suppose that O and BO are being
collapsed. Furthermore, suppose that BO has been marked dead
(OBJ_DEAD) by vm_object_backing_scan() and that either
vm_object_backing_scan() has been forced to sleep due to encountering
a busy page or vm_object_collapse() has been forced to sleep due to
memory allocation in the swap pager. If vm_object_deallocate() is
then called on BBO and BO is BBO's only shadow object,
vm_object_deallocate() will collapse BO and BBO. In doing so, it adds
a necessary temporary reference to BO. If this collapse also sleeps
and the prior collapse resumes first, the temporary reference will
cause vm_object_collapse to panic with the message "backing_object %p
was somehow re-referenced during collapse!"
Resolve this race by changing vm_object_deallocate() such that it
doesn't collapse BO and BBO if BO is marked dead. Once O and BO are
collapsed, vm_object_collapse() will attempt to collapse O and BBO.
So, vm_object_deallocate() on BBO need do nothing.
Reported by: Peter Holm on 20050107
URL: http://www.holm.cc/stress/log/cons102.html
In collaboration with: tegge@
Candidate for RELENG_4 and RELENG_5
MFC after: 2 weeks
I'm not sure why a credential was added to these in the first place, it is
not used anywhere and it doesn't make much sense:
The credentials for syncing a file (ability to write to the
file) should be checked at the system call level.
Credentials for syncing one or more filesystems ("none")
should be checked at the system call level as well.
If the filesystem implementation needs a particular credential
to carry out the syncing it would logically have to the
cached mount credential, or a credential cached along with
any delayed write data.
Discussed with: rwatson
because this call is only needed to wake threads that slept when they
discovered a dead object connected to a vnode. To eliminate unnecessary
calls to wakeup() by vnode_pager_dealloc(), introduce a new flag,
OBJ_DISCONNECTWNT.
Reviewed by: tegge@
need for most calls to vm_page_busy(). Specifically, most calls to
vm_page_busy() occur immediately prior to a call to vm_page_remove().
In such cases, the containing vm object is locked across both calls.
Consequently, the setting of the vm page's PG_BUSY flag is not even
visible to other threads that are following the synchronization
protocol.
This change (1) eliminates the calls to vm_page_busy() that
immediately precede a call to vm_page_remove() or functions, such as
vm_page_free() and vm_page_rename(), that call it and (2) relaxes the
requirement in vm_page_remove() that the vm page's PG_BUSY flag is
set. Now, the vm page's PG_BUSY flag is set only when the vm object
lock is released while the vm page is still in transition. Typically,
this is when it is undergoing I/O.
so that they know whether the allocation is supposed to be able to sleep
or not.
* Allow uma_zone constructors and initialation functions to return either
success or error. Almost all of the ones in the tree currently return
success unconditionally, but mbuf is a notable exception: the packet
zone constructor wants to be able to fail if it cannot suballocate an
mbuf cluster, and the mbuf allocators want to be able to fail in general
in a MAC kernel if the MAC mbuf initializer fails. This fixes the
panics people are seeing when they run out of memory for mbuf clusters.
* Allow debug.nosleepwithlocks on WITNESS to be disabled, without changing
the default.
Both bmilekic and jeff have reviewed the changes made to make failable
zone allocations work.
vm/vm_object.c revision 1.88) and vm_object_sync() (originating in
vm/vm_map.c revision 1.36): When descending a chain of backing objects,
both use the wrong object's backing offset. Consequently, both may
operate on the wrong pages.
Quoting Matt, "This could be responsible for all of the sporatic madvise
oddness that has been reported over the years."
Reviewed by: Matt Dillon
kmem_alloc_pageable(). The difference between these is that an errant
memory access to the zone will be detected sooner with
kmem_alloc_nofault().
The following changes serve to eliminate the following lock-order
reversal reported by witness:
1st 0xc1a3c084 vm object (vm object) @ vm/swap_pager.c:1311
2nd 0xc07acb00 swap_pager swhash (swap_pager swhash) @ vm/swap_pager.c:1797
3rd 0xc1804bdc vm object (vm object) @ vm/uma_core.c:931
There is no potential deadlock in this case. However, witness is unable
to recognize this because vm objects used by UMA have the same type as
ordinary vm objects. To remedy this, we make the following changes:
- Add a mutex type argument to VM_OBJECT_LOCK_INIT().
- Use the mutex type argument to assign distinct types to special
vm objects such as the kernel object, kmem object, and UMA objects.
- Define a static swap zone object for use by UMA. (Only static
objects are assigned a special mutex type.)
vm objects shadowing source in vm_object_shadow(). This closes a race where
vm_object_collapse() could be called with a partially uninitialized object
argument causing symptoms that looked like hardware problems, e.g. signal 6,
10, 11 or a /bin/sh busy-waiting for a nonexistant child process.
that msync(2) is its only caller.
- Migrate the parts of the old vm_map_clean() that examined the internals
of a vm object to a new function vm_object_sync() that is implemented in
vm_object.c. At the same, introduce the necessary vm object locking so
that vm_map_sync() and vm_object_sync() can be called without Giant.
Reviewed by: tegge
destination objects are locked on entry and exit. Add comments to
the callers noting that the locks can be released by swap_pager_copy().
- Remove several instances of GIANT_REQUIRED.