and the equivalent functionality is now provided by sendfile(2) over
posix shared memory filedescriptor.
Remove the cow member of struct vm_page, and rearrange the remaining
members. While there, make hold_count unsigned.
Requested and reviewed by: alc
Tested by: pho
Sponsored by: The FreeBSD Foundation
Approved by: re (delphij)
portion is invalidated, invalidate the whole page. Otherwise,
partially valid page appears on a page queue, which is wrong. This
could only happen for the last page, because only then buffer which
triggered invalidation could not cover the whole page.
Reported and tested by: pho (previous version)
Reviewed by: alc
Sponsored by: The FreeBSD Foundation
Approved by: re (delphij)
MFC after: 2 weeks
VPB_BIT_WAITERS flag were changed between reading of busy_lock and the
cas. The vm_page_sbusy(), which is the only user of
vm_page_trysbusy() in the tree, panics on the failure, which in these
cases is transient and do not mean that the current page state
prevents sbusying.
Retry the operation inside vm_page_trysbusy() if cas failed, only
return a failure when VPB_BIT_SHARED is cleared.
Reported and tested by: pho
Reviewed by: attilio
Sponsored by: The FreeBSD Foundation
MADV_DONTNEED) and madvise(..., MADV_FREE). Specifically, introduce a new
pmap function, pmap_advise(), that operates on a range of virtual addresses
within the specified pmap, allowing for a more efficient implementation of
MADV_DONTNEED and MADV_FREE. Previously, the implementation of
MADV_DONTNEED and MADV_FREE relied on per-page pmap operations, such as
pmap_clear_reference(). Intuitively, the problem with this implementation
is that the pmap-level locks are acquired and released and the page table
traversed repeatedly, once for each resident page in the range
that was specified to madvise(2). A more subtle flaw with the previous
implementation is that pmap_clear_reference() would clear the reference bit
on all mappings to the specified page, not just the mapping in the range
specified to madvise(2).
Since our malloc(3) makes heavy use of madvise(2), this change can have a
measureable impact. For example, the system time for completing a parallel
"buildworld" on a 6-core amd64 machine was reduced by about 1.5% to 2.0%.
Note: This change only contains pmap_advise() implementations for a subset
of our supported architectures. I will commit implementations for the
remaining architectures after further testing. For now, a stub function is
sufficient because of the advisory nature of pmap_advise().
Discussed with: jeff, jhb, kib
Tested by: pho (i386), marcel (ia64)
Sponsored by: EMC / Isilon Storage Division
reclaim the last preexisting cached page in the object, resulting in a call
to vdrop(). Detect this scenario so that the vnode's hold count is
correctly maintained. Otherwise, we panic.
Reported by: scottl
Tested by: pho
Discussed with: attilio, jeff, kib
The flag was mandatory since r209792, where vm_page_grab(9) was
changed to only support the alloc retry semantic.
Suggested and reviewed by: alc
Sponsored by: The FreeBSD Foundation
wired, unwind back the wiring bits otherwise we can end up freeing a
page that is considered wired.
Sponsored by: EMC / Isilon storage division
Reported by: alc
maintaining better LRU of active pages.
- Change v_free_target to include the quantity previously represented by
v_cache_min so we don't need to add them together everywhere we use them.
- Add a pageout_wakeup_thresh that sets the free page count trigger for
waking the page daemon. Set this 10% above v_free_min so we wakeup before
any phase transitions in vm users.
- Adjust down v_free_target now that we're willing to accept more pagedaemon
wakeups. This means we process fewer pages in one iteration as well,
leading to shorter lock hold times and less overall disruption.
- Eliminate vm_pageout_page_stats(). This was a minor variation on the
PQ_ACTIVE segment of the normal pageout daemon. Instead we now process
1 / vm_pageout_update_period pages every second. This causes us to visit
the whole active list every 60 seconds. Previously we would only maintain
the active LRU when we were short on pages which would mean it could be
woefully out of date.
Reviewed by: alc (slight variant of this)
Discussed with: alc, kib, jhb
Sponsored by: EMC / Isilon Storage Division
what is really needed on this code snipped is that all the pages that
are already fully inserted gets fully freed, while for the others the
object removal itself might be skipped, hence the object might be set to
NULL.
Sponsored by: EMC / Isilon storage division
Reported by: alc, kib
Reviewed by: alc
additional information, when the page is guaranteed to not belong to a
paging queue. Usually, this results in a lot of type casts which make
reasoning about the code correctness harder.
Sometimes m->object is used instead of pageq, which could cause real
and confusing bugs if non-NULL m->object is leaked. See r141955 and
r253140 for examples.
Change the pageq member into a union containing explicitly-typed
members. Use them instead of type-punning or abusing m->object in x86
pmaps, uma and vm_page_alloc_contig().
Requested and reviewed by: alc
Sponsored by: The FreeBSD Foundation
for nodes used in vm_radix.
On architectures supporting direct mapping, also avoid to pre-allocate
the KVA for such nodes.
In order to do so make the operations derived from vm_radix_insert()
to fail and handle all the deriving failure of those.
vm_radix-wise introduce a new function called vm_radix_replace(),
which can replace a leaf node, already present, with a new one,
and take into account the possibility, during vm_radix_insert()
allocation, that the operations on the radix trie can recurse.
This means that if operations in vm_radix_insert() recursed
vm_radix_insert() will start from scratch again.
Sponsored by: EMC / Isilon storage division
Reviewed by: alc (older version)
Reviewed by: jeff
Tested by: pho, scottl
Unify the 2 concept into a real, minimal, sxlock where the shared
acquisition represent the soft busy and the exclusive acquisition
represent the hard busy.
The old VPO_WANTED mechanism becames the hard-path for this new lock
and it becomes per-page rather than per-object.
The vm_object lock becames an interlock for this functionality:
it can be held in both read or write mode.
However, if the vm_object lock is held in read mode while acquiring
or releasing the busy state, the thread owner cannot make any
assumption on the busy state unless it is also busying it.
Also:
- Add a new flag to directly shared busy pages while vm_page_alloc
and vm_page_grab are being executed. This will be very helpful
once these functions happen under a read object lock.
- Move the swapping sleep into its own per-object flag
The KPI is heavilly changed this is why the version is bumped.
It is very likely that some VM ports users will need to change
their own code.
Sponsored by: EMC / Isilon storage division
Discussed with: alc
Reviewed by: jeff, kib
Tested by: gavin, bapt (older version)
Tested by: pho, scottl
into threads each processing queue in a single domain. The structure
of the pagedaemons and queues is kept intact, most of the changes come
from the need for code to find an owning page queue for given page,
calculated from the segment containing the page.
The tie between NUMA domain and pagedaemon thread/pagequeue split is
rather arbitrary, the multithreaded daemon could be allowed for the
single-domain machines, or one domain might be split into several page
domains, to further increase concurrency.
Right now, each pagedaemon thread tries to reach the global target,
precalculated at the start of the pass. This is not optimal, since it
could cause excessive page deactivation and freeing. The code should
be changed to re-check the global page deficit state in the loop after
some number of iterations.
The pagedaemons reach the quorum before starting the OOM, since one
thread inability to meet the target is normal for split queues. Only
when all pagedaemons fail to produce enough reusable pages, OOM is
started by single selected thread.
Launder is modified to take into account the segments layout with
regard to the region for which cleaning is performed.
Based on the preliminary patch by jeff, sponsored by EMC / Isilon
Storage Division.
Reviewed by: alc
Tested by: pho
Sponsored by: The FreeBSD Foundation
not busy, since its only caller brelse() can legitimately call it on
busy page. This happens for VOP_PUTPAGES() on filesystems that use
buffers and which VOP_WRITE() method marked the buffer containing page
as non-cacheable.
Reported and tested by: pho
Reviewed by: alc
Sponsored by: The FreeBSD Foundation
MFC after: 2 weeks
reset by pmap_page_init() right after being initialized in vm_page_initfake().
The statement above is with reference to the amd64 implementation of
pmap_page_init().
Fix this by calling 'pmap_page_init()' in 'vm_page_initfake()' before changing
the 'memattr'.
Reviewed by: kib
MFC after: 2 weeks
clearing the page's PGA_REFERENCED flag. Since we are typically
manipulating the page's act_count field when we are clearing its
PGA_REFERENCED flag, the page lock is already held everywhere that we clear
the PGA_REFERENCED flag. So, in fact, this revision only changes some
comments and an assertion. Nonetheless, it will enable later changes to
object locking in the pageout code.
Introduce vm_page_assert_locked(), which completely hides the implementation
details of the page lock from the caller, and use it in
vm_page_aflag_clear(). (The existing vm_page_lock_assert() could not be
used in vm_page_aflag_clear().) Over the coming weeks, I expect that we'll
either eliminate or replace the various uses of vm_page_lock_assert() with
vm_page_assert_locked().
Reviewed by: attilio
Sponsored by: EMC / Isilon Storage Division
lock instead of the object lock, there is no reason for vm_page_activate()
to assert that the object is locked for either read or write access.
(The "VPO_UNMANAGED" flag never changes after page allocation.)
Sponsored by: EMC / Isilon Storage Division
o Relax locking assertions for pmap_enter_object() and add them also
to architectures that currently don't have any
o Introduce VM_OBJECT_LOCK_DOWNGRADE() which is basically a downgrade
operation on the per-object rwlock
o Use all the mechanisms above to make vm_map_pmap_enter() to work
mostl of the times only with readlocks.
Sponsored by: EMC / Isilon storage division
Reviewed by: alc
a vm page, denoted either by an address of the struct vm_page, or, if
the '/p' modifier is specified, by a physical address of the
corresponding frame.
Reviewed by: jhb
Sponsored by: The FreeBSD Foundation
MFC after: 1 week
tree is used to maintain the object's collection of resident pages,
vm_page_lookup() no longer needs an exclusive lock.
Reviewed by: attilio
Sponsored by: EMC / Isilon Storage Division
vm_page_insert() so that (1) vm_radix_lookup_le() is never called while the
free page queues lock is held and (2) vm_radix_lookup_le() is called at most
once. This change reduces the average time that the free page queues lock
is held by vm_page_alloc() as well as vm_page_alloc()'s average overall
running time.
Sponsored by: EMC / Isilon Storage Division
indices. Consequentially, vm_page_insert() should use
vm_radix_lookup_le() instead of vm_radix_lookup_ge(). Here's why. In
the expected case, vm_radix_lookup_le() will quickly find a page less
than the specified key at the same radix node. In contrast,
vm_radix_lookup_ge() is expected to return NULL, but to do that it must
examine every slot in the radix tree that is greater than the key.
Prior to this change, the average cost of a vm_page_insert() call on my
test machine was 992 cycles. After this change, the average cost is only
532 cycles, a reduction of 46%.
Reviewed by: attilio
Sponsored by: EMC / Isilon Storage Division
"index". The content of a radix tree leaf, or at least its "key", is not
opaque to the other radix tree operations. Specifically, they know how to
extract the "key" from a leaf. So, eliminating the parameter "index" isn't
breaking the abstraction. Moreover, eliminating the parameter "index"
effectively prevents the caller from passing an inconsistent "index" and
leaf to vm_radix_insert().
Reviewed by: attilio
Sponsored by: EMC / Isilon Storage Division
page into its new object until the page's pindex has been updated.
Otherwise, one code path within vm_radix_insert() may use the wrong
pindex value.
Sponsored by: EMC / Isilon Storage Division
future further optimizations where the vm_object lock will be held
in read mode most of the time the page cache resident pool of pages
are accessed for reading purposes.
The change is mostly mechanical but few notes are reported:
* The KPI changes as follow:
- VM_OBJECT_LOCK() -> VM_OBJECT_WLOCK()
- VM_OBJECT_TRYLOCK() -> VM_OBJECT_TRYWLOCK()
- VM_OBJECT_UNLOCK() -> VM_OBJECT_WUNLOCK()
- VM_OBJECT_LOCK_ASSERT(MA_OWNED) -> VM_OBJECT_ASSERT_WLOCKED()
(in order to avoid visibility of implementation details)
- The read-mode operations are added:
VM_OBJECT_RLOCK(), VM_OBJECT_TRYRLOCK(), VM_OBJECT_RUNLOCK(),
VM_OBJECT_ASSERT_RLOCKED(), VM_OBJECT_ASSERT_LOCKED()
* The vm/vm_pager.h namespace pollution avoidance (forcing requiring
sys/mutex.h in consumers directly to cater its inlining functions
using VM_OBJECT_LOCK()) imposes that all the vm/vm_pager.h
consumers now must include also sys/rwlock.h.
* zfs requires a quite convoluted fix to include FreeBSD rwlocks into
the compat layer because the name clash between FreeBSD and solaris
versions must be avoided.
At this purpose zfs redefines the vm_object locking functions
directly, isolating the FreeBSD components in specific compat stubs.
The KPI results heavilly broken by this commit. Thirdy part ports must
be updated accordingly (I can think off-hand of VirtualBox, for example).
Sponsored by: EMC / Isilon storage division
Reviewed by: jeff
Reviewed by: pjd (ZFS specific review)
Discussed with: alc
Tested by: pho
Introduce a new KPI that verifies if the page cache is empty for a
specified vm_object. This KPI does not make assumptions about the
locking in order to be used also for building assertions at init and
destroy time.
It is mostly used to hide implementation details of the page cache.
Sponsored by: EMC / Isilon storage division
Reviewed by: jeff
Reviewed by: alc (vm_radix based version)
Tested by: flo, pho, jhb, davide
general way but must be evaluated case by case.
Embedd the decision in the caller themselves rather than in a
general purpose KPI.
Sponsored by: EMC / Isilon storage division
Reported by: alc
Reviewed by: alc
macro VM_OBJECT_SLEEP().
This hides some implementation details like the usage of the msleep()
primitive and the necessity to access to the lock address directly.
For this reason VM_OBJECT_MTX() macro is now retired.
Sponsored by: EMC / Isilon storage division
Reviewed by: alc
Tested by: pho