of CPUs present. On amd64 this unbreaks the boot for systems with 92 or
more CPUs; the limit will vary on other systems depending on the size of
their uma_zone and uma_cache structures.
The major consumer of pages during UMA startup is the 19 zone structures
which are set up before UMA has bootstrapped itself sufficiently to use
the rest of the available memory: UMA Slabs, UMA Hash, 4 / 6 / 8 / 12 /
16 / 32 / 64 / 128 / 256 Bucket, vmem btag, VM OBJECT, RADIX NODE, MAP,
KMAP ENTRY, MAP ENTRY, VMSPACE, and fakepg. If the zone structures occupy
more than one page, they will not share pages and the number of pages
currently needed for startup is 19 * pages_per_zone + N, where N is the
number of pages used for allocating other structures; on amd64 N = 3 at
present (2 pages are allocated for UMA Kegs, and one page for UMA Hash).
This patch adds a new definition UMA_BOOT_PAGES_ZONES, currently set to 32,
and if a zone structure does not fit into a single page sets boot_pages to
UMA_BOOT_PAGES_ZONES * pages_per_zone instead of UMA_BOOT_PAGES (which
remains at 64). Consequently this patch has no effect on systems where the
zone structure fits into 2 or fewer pages (on amd64, 59 or fewer CPUs), but
increases boot_pages sufficiently on systems where the large number of CPUs
makes this structure larger. It seems safe to assume that systems with 60+
CPUs can afford to set aside an additional 128kB of memory per 32 CPUs.
The vm.boot_pages tunable continues to override this computation, but is
unlikely to be necessary in the future.
Tested on: EC2 x1.32xlarge
Relnotes: FreeBSD can now boot on 92+ CPU systems without requiring
vm.boot_pages to be manually adjusted.
Reviewed by: jeff, alc, adrian
Approved by: re (kib)
mp_maxid or CPU_FOREACH() as appropriate. This fixes a number of places in
the kernel that assumed CPU IDs are dense in [0, mp_ncpus) and would try,
for example, to run tasks on CPUs that did not exist or to allocate too
few buffers on systems with sparse CPU IDs in which there are holes in the
range and mp_maxid > mp_ncpus. Such circumstances generally occur on
systems with SMT, but on which SMT is disabled. This patch restores system
operation at least on POWER8 systems configured in this way.
There are a number of other places in the kernel with potential problems
in these situations, but where sparse CPU IDs are not currently known
to occur, mostly in the ARM machine-dependent code. These will be fixed
in a follow-up commit after the stable/11 branch.
PR: kern/210106
Reviewed by: jhb
Approved by: re (glebius)
objects.
Assert that there is no new waiters for the already terminated objects.
Old waiters should have been notified by the termination calling
vnode_pager_dealloc() (old/new are with regard of the lock acquisition
interval).
Only clear the vp->v_object for the case of already terminated object,
since other branches call vnode_pager_dealloc(), which should clear
the pointer. Assert this.
Tested by: pho
Reviewed by: alc
Sponsored by: The FreeBSD Foundation
MFC after: 2 weeks
Approved by: re (gjb)
sleepable scan, iteration over the shadow chain looking for a page
could find an OBJ_DEAD object. Such state of the mapping is only
transient, the dead object will be terminated and removed from the
chain shortly. We must not return KERN_PROTECTION_FAILURE unless the
object type is changed to OBJT_DEAD in the chain, indicating that
paging on this address is really impossible. Returning
KERN_PROTECTION_FAILURE prematurely causes spurious SIGSEGV delivered
to processes, or kernel accesses to UVA spuriously failing with
EFAULT.
If the object with OBJ_DEAD flag is found, only return
KERN_PROTECTION_FAILURE when object type is already OBJT_DEAD.
Otherwise, sleep a tick and retry the fault handling.
Ideally, we would wait until the OBJ_DEAD flag is resolved, e.g. by
waiting until the paging on this object is finished. But to do so, we
need to reference the dead object, while vm_object_collapse() insists
on owning the final reference on the collapsed object. This could be
fixed by e.g. changing the assert to shared reference release between
vm_fault() and vm_object_collapse(), but it seems to be too much
complications for rare boundary condition.
PR: 204426
Tested by: pho
Reviewed by: alc
Sponsored by: The FreeBSD Foundation
X-Differential revision: https://reviews.freebsd.org/D6085
MFC after: 2 weeks
Approved by: re (gjb)
waiters exist, same as for vm_page_xunbusy(). If previous value of
busy_lock was VPB_SINGLE_EXCLUSIVER, no waiters existed and wakeup is
not needed.
Move common code from vm_page_xunbusy_maybelocked() and
vm_page_xunbusy_hard() to vm_page_xunbusy_locked().
Reviewed by: alc
Sponsored by: The FreeBSD Foundation
MFC after: 2 weeks
Approved by: re (gjb)
There is an order between covered vnode lock and allproc_lock, which
is established by calling mountcheckdirs() while owning the covered
vnode lock. mountcheckdirs() iterates over the processes, protected by
allproc_lock. This order is needed and seems to be not avoidable.
On the other hand, various VM daemons also need to iterate over all
processes, and they lock and unlock user maps. Since unlock of the
user map may trigger processing of the deferred map entries, it causes
vnode locking to occur. Or, when vmspace is freed, dropping references
on the vnode-backed object also lock vnodes. We get reverted order
comparing with the mount/unmount order.
For VM daemons, there is no need to own allproc_lock while we operate
on vmspaces. If the process is held, it serves as the marker for
allproc list, which allows to continue the iteration.
Add _PHOLD_LITE() macro, similar to _PHOLD(), but not causing swap-in
of the kernel stacks. It is used instead of _PHOLD() in vm code,
since e.g. calling faultin() in OOM conditions only exaggerates the
problem.
Modernize comment describing PHOLD.
Reported by: lists@yamagi.org
Tested by: pho (previous version)
Reviewed by: jhb
Sponsored by: The FreeBSD Foundation
MFC after: 3 week
Approved by: re (gjb)
Differential revision: https://reviews.freebsd.org/D6679
statistics. Marking is done by setting the OBJ_ACTIVE flag. The
flags change is locked, but the problem is that many parts of system
assume that vm object initialization ensures that no other code could
change the object, and thus performed lockless. The end result is
corrupted flags in vm objects, most visible is spurious OBJ_DEAD flag,
causing random hangs.
Avoid the active object marking, instead provide equally inexact but
immutable is_object_alive() definition for the object mapped state.
Avoid iterating over the processes mappings altogether by using
arguably improved definition of the paging thread as one which sleeps
on the v_free_count.
PR: 204764
Diagnosed by: pho
Tested by: pho (previous version)
Reviewed by: alc
Sponsored by: The FreeBSD Foundation
MFC after: 1 week
Approved by: re (gjb)
Right now, all modifications of the list are locked by sw_alloc_mtx.
But initial lookup of the object by the handle in swap_pager_alloc()
is not protected by sw_alloc_mtx, which means that
vm_pager_object_lookup() could follow freed pointer.
Create a new named swap object with the OBJT_SWAP type, instead
of OBJT_DEFAULT. With this change, swp_pager_meta_build() never need
to upgrade named OBJT_DEFAULT to OBJT_SWAP (in the other place, we do
not forbid for client code to create named OBJT_DEFAULT objects at
all).
That change allows to remove sw_alloc_mtx and make the list locked by
sw_alloc_sx lock. Update swap_pager_copy() to new locking mode.
Create helper swap_pager_alloc_init() to consolidate named and
anonymous swap objects creation, while a caller ensures that the
neccesary locks are held around the helper.
Reviewed by: alc
Sponsored by: The FreeBSD Foundation
MFC after: 2 weeks
Approved by: re (hrs)
Per the KASSERT at the beginning of the function, we expect that the page
does not belong to any object, so its object and pindex fields are
meaningless. Reset them in the rare case that vm_radix_insert() fails.
Reviewed by: kib
MFC after: 1 week
Differential Revision: https://reviews.freebsd.org/D6669
With r284861, UMA zones use the trash ctor and dtor by default. This is
incompatible with memguard, which frees the backing page when the item
is freed. Modify the UMA debug functions to be no-ops if the item was
allocated from memguard. This also fixes constructors such as
mb_ctor_pack(), which invokes the trash ctor in addition to performing
some initialization.
Reviewed by: glebius
MFC after: 3 weeks
Differential Revision: https://reviews.freebsd.org/D6562
acquire the page lock, which recurses. Avoid the recursion by reusing
the code from vm_page_remove() in a new helper
vm_page_xunbusy_maybelocked().
Reviewed by: alc
Sponsored by: The FreeBSD Foundation
indicate that threads are waiting for free pages to become available and
(2) to indicate whether a wakeup call has been sent to the page daemon.
The trouble is that a single flag cannot really serve both purposes, because
we have two distinct targets for when to wakeup threads waiting for free
pages versus when the page daemon has completed its work. In particular,
the flag will be cleared by vm_page_free() before the page daemon has met
its target, and this can lead to the OOM killer being invoked prematurely.
To address this problem, a new flag "vm_pageout_wanted" is introduced.
Discussed with: jeff
Reviewed by: kib, markj
Tested by: markj
Sponsored by: EMC / Isilon Storage Division
optimized copy-on-write faults. This has two advantages: (1) one less radix
tree operation is performed and (2) vm_page_replace_checked() cannot fail,
making the code simpler.
Submitted by: Ryan Libby
Reviewed by: kib
Sponsored by: EMC / Isilon Storage Division
Differential Revision: https://reviews.freebsd.org/D4478
swap_pager_copy() might unlock the object, which allows the parallel
collapse to execute. Besides destroying the object, it also might
move the reference from parent to the backing object, firing the
assertion ref_count == 1.
Collapses are prevented by bumping paging_in_progress counters on both
the object and its backing object.
Reported by: cem
Tested by: pho (previous version)
Reviewed by: alc
Sponsored by: The FreeBSD Foundation
MFC after: 1 week
X-Differential revision: https://reviews.freebsd.org/D6085
for empty page cache when the object type if OBJT_VNODE.
Reported and tested by: pho
Reviewed by: alc
Sponsored by: The FreeBSD Foundation
MFC after: 1 week
freed page as VPO_UNMANAGED. Otherwise vm_pge_free_toq() insists on
owning the page lock.
Previously, VPO_UNMANAGED was only set up to the last processed page.
Reviewed by: alc
Sponsored by: The FreeBSD Foundation
MFC after: 1 week
Existing issue of not protecting pager_object_list iteration in
vm_pager_object_lookup() by sw_alloc_mtx is not affected by Giant
removal.
Reviewed by: alc
Sponsored by: The FreeBSD Foundation
is actually the opposite of that stated in the comment.
Remove an unnecessary assignment. Use an assertion to document the fact
that no assignment is needed.
Rewrite another comment to clarify that the page is not completely valid.
Reviewed by: kib
the current offset (spelled: "fs.pindex") until it is known whether a
backing object exists. In fact, if not for the fact that the backing
object offset is zero when there is no backing object, this update would
produce a broken offset.
Reviewed by: kib
Add a pair of bus methods that can be used to "map" resources for direct
CPU access using bus_space(9). bus_map_resource() creates a mapping and
bus_unmap_resource() releases a previously created mapping. Mappings are
described by 'struct resource_map' object. Pointers to these objects can
be passed as the first argument to the bus_space wrapper API used for bus
resources.
Drivers that wish to map all of a resource using default settings
(for example, using uncacheable memory attributes) do not need to change.
However, drivers that wish to use non-default settings can now do so
without jumping through hoops.
First, an RF_UNMAPPED flag is added to request that a resource is not
implicitly mapped with the default settings when it is activated. This
permits other activation steps (such as enabling I/O or memory decoding
in a device's PCI command register) to be taken without creating a
mapping. Right now the AGP drivers don't set RF_ACTIVE to avoid using
up a large amount of KVA to map the AGP aperture on 32-bit platforms.
Once RF_UNMAPPED is supported on all platforms that support AGP this
can be changed to using RF_UNMAPPED with RF_ACTIVE instead.
Second, bus_map_resource accepts an optional structure that defines
additional settings for a given mapping.
For example, a driver can now request to map only a subset of a resource
instead of the entire range. The AGP driver could also use this to only
map the first page of the aperture (IIRC, it calls pmap_mapdev() directly
to map the first page currently). I will also eventually change the
PCI-PCI bridge driver to request mappings of the subset of the I/O window
resource on its parent side to create mappings for child devices rather
than passing child resources directly up to nexus to be mapped. This
also permits bridges that do address translation to request suitable
mappings from a resource on the "upper" side of the bus when mapping
resources on the "lower" side of the bus.
Another attribute that can be specified is an alternate memory attribute
for memory-mapped resources. This can be used to request a
Write-Combining mapping of a PCI BAR in an MI fashion. (Currently the
drivers that do this call pmap_change_attr() directly for x86 only.)
Note that this commit only adds the MI framework. Each platform needs
to add support for handling RF_UNMAPPED and thew new
bus_map/unmap_resource methods. Generally speaking, any drivers that
are calling rman_set_bustag() and rman_set_bushandle() need to be
updated.
Discussed on: arch
Reviewed by: cem
Differential Revision: https://reviews.freebsd.org/D5237
cleanup consists of fixes to comments. However, there is one change to
code: Remove special-case handling of errors involving the kernel map.
We do not perform I/O on the kernel map, so there is no need for this
special case.
Reviewed by: kib (an earlier version)
pq_vcnt, as a count of real things, has no business being negative. It is only
ever initialized by a u_int counter.
The warning came from the atomic_add_int() in vm_pagequeue_cnt_add().
Rectify the warning by changing the variable to u_int. No functional change.
Suggested by: Clang 3.3
Sponsored by: EMC / Isilon Storage Division
intention of the POSIX IEEE Std 1003.1TM-2008/Cor 1-2013.
A robust mutex is guaranteed to be cleared by the system upon either
thread or process owner termination while the mutex is held. The next
mutex locker is then notified about inconsistent mutex state and can
execute (or abandon) corrective actions.
The patch mostly consists of small changes here and there, adding
neccessary checks for the inconsistent and abandoned conditions into
existing paths. Additionally, the thread exit handler was extended to
iterate over the userspace-maintained list of owned robust mutexes,
unlocking and marking as terminated each of them.
The list of owned robust mutexes cannot be maintained atomically
synchronous with the mutex lock state (it is possible in kernel, but
is too expensive). Instead, for the duration of lock or unlock
operation, the current mutex is remembered in a special slot that is
also checked by the kernel at thread termination.
Kernel must be aware about the per-thread location of the heads of
robust mutex lists and the current active mutex slot. When a thread
touches a robust mutex for the first time, a new umtx op syscall is
issued which informs about location of lists heads.
The umtx sleep queues for PP and PI mutexes are split between
non-robust and robust.
Somewhat unrelated changes in the patch:
1. Style.
2. The fix for proper tdfind() call use in umtxq_sleep_pi() for shared
pi mutexes.
3. Removal of the userspace struct pthread_mutex m_owner field.
4. The sysctl kern.ipc.umtx_vnode_persistent is added, which controls
the lifetime of the shared mutex associated with a vnode' page.
Reviewed by: jilles (previous version, supposedly the objection was fixed)
Discussed with: brooks, Martin Simmons <martin@lispworks.com> (some aspects)
Tested by: pho
Sponsored by: The FreeBSD Foundation
Avoid logging inconsistency for the /dev/mem device at all. The driver
leaves memattr intact, and the corrective action in the device pager
handles it right.
In the logged warning, name the driver we blame, and show memory
attributes values.
Reported and tested by: pho
Reviewed by: alc
Sponsored by: The FreeBSD Foundation
MFC after: 1 week
Differential revision: https://reviews.freebsd.org/D6149
rounddown2 tends to produce longer lines than the original code
and when the code has a high indentation level it was not really
advantageous to do the replacement.
This tries to strike a balance between readability using the macros
and flexibility of having the expressions, so not everything is
converted.
VM_NUMA_ALLOC is used to enable use of domain-aware memory allocation in
the virtual memory system. DEVICE_NUMA is used to enable affinity
reporting for devices such as bus_get_domain().
MAXMEMDOM must still be set to a value greater than for any NUMA support
to be effective. Note that 'cpuset -gd' always works if MAXMEMDOM is
enabled and the system supports NUMA.
Reviewed by: kib
Differential Revision: https://reviews.freebsd.org/D5782
for limiting disk (actually filesystem) IO.
Note that in some cases these limits are not quite precise. It's ok,
as long as it's within some reasonable bounds.
Testing - and review of the code, in particular the VFS and VM parts - is
very welcome.
MFC after: 1 month
Relnotes: yes
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D5080
breaking the ABI. Special value is stored in the lock pointer to
indicate shared lock, and offline page in the shared memory is
allocated to store the actual lock.
Reviewed by: vangyzen (previous version)
Discussed with: deischen, emaste, jhb, rwatson,
Martin Simmons <martin@lispworks.com>
Tested by: pho
Sponsored by: The FreeBSD Foundation
- Pull the vmspace logic out into helper functions and reduce duplication.
Operations on the vmspace are all isolated to vm_map.c, but it now exports
a new 'vmspace_switch_aio' for use by AIO kernel processes.
- When an AIO kernel process wants to exit, break out of the main loop and
perform cleanup after the loop end. This reduces a lot of indentation and
allows cleanup to more closely mirror setup actions before the loop starts.
- Convert a DIAGNOSTIC to KASSERT().
- Replace mycp with more typical 'p'.
Reviewed by: kib
Sponsored by: Chelsio Communications
Differential Revision: https://reviews.freebsd.org/D4990
than ascending order in vm_phys_alloc_contig() so that, for example, a
sequence of contigmalloc(low=0, high=4GB) calls doesn't exhaust the supply
of low physical memory resulting in a later contigmalloc(low=0, high=1MB)
failure.
Reported by: cy
Tested by: cy
Sponsored by: EMC / Isilon Storage Division
more than once when doing round-robin.
This lead to a panic because the iterator was trying the same domain
twice and not trying one of the other domains.
Reported by: pho
Tested by: pho
exhausted.
It is possible for a bug in the code (or, theoretically, even unusual
network conditions) to exhaust all possible mbufs or mbuf clusters.
When this occurs, things can grind to a halt fairly quickly. However,
we currently do not call mb_reclaim() unless the entire system is
experiencing a low-memory condition.
While it is best to try to prevent exhaustion of one of the mbuf zones,
it would also be useful to have a mechanism to attempt to recover from
these situations by freeing "expendable" mbufs.
This patch makes two changes:
a) The patch adds a generic API to the UMA zone allocator to set a
function that should be called when an allocation fails because the
zone limit has been reached. Because of the way this function can be
called, it really should do minimal work.
b) The patch uses this API to try to free mbufs when an allocation
fails from one of the mbuf zones because the zone limit has been
reached. The function schedules a callout to run mb_reclaim().
Differential Revision: https://reviews.freebsd.org/D3864
Reviewed by: gnn
Comments by: rrs, glebius
MFC after: 2 weeks
Sponsored by: Juniper Networks
address and use this mechanism when:
1. kmem_alloc_{attr,contig}() can't find suitable free pages in the physical
memory allocator's free page lists. This replaces the long-standing
approach of scanning the inactive and inactive queues, converting clean
pages into PG_CACHED pages and laundering dirty pages. In contrast, the
new mechanism does not use PG_CACHED pages nor does it trigger a large
number of I/O operations.
2. on 32-bit MIPS processors, uma_small_alloc() and the pmap can't find
free pages in the physical memory allocator's free page lists that are
covered by the direct map. Tested by: adrian
3. ttm_bo_global_init() and ttm_vm_page_alloc_dma32() can't find suitable
free pages in the physical memory allocator's free page lists.
In the coming months, I expect that this new mechanism will be applied in
other places. For example, balloon drivers should use relocation to
minimize fragmentation of the guest physical address space.
Make vm_phys_alloc_contig() a little smarter (and more efficient in some
cases). Specifically, use vm_phys_segs[] earlier to avoid scanning free
page lists that can't possibly contain suitable pages.
Reviewed by: kib, markj
Glanced at: jhb
Discussed with: jeff
Sponsored by: EMC / Isilon Storage Division
Differential Revision: https://reviews.freebsd.org/D4444
It turns out the callers of vm_page_replace know exactly which page they are
replacing and would like to assert about it. Change those from hard panics to
KASSERTs, and provide them with a wrapper so they don't have to deal with
warnings from an INVARIANTS-dependent dead store of the return value of
vm_page_replace.
Submitted by: Ryan Libby <rlibby@gmail.com>
Reviewed by: alc, kib (earlier version)
Sponsored by: EMC / Isilon Storage Division
Differential Revision: https://reviews.freebsd.org/D4497
o With new KPI consumers can request contiguous ranges of pages, and
unlike before, all pages will be kept busied on return, like it was
done before with the 'reqpage' only. Now the reqpage goes away. With
new interface it is easier to implement code protected from race
conditions.
Such arrayed requests for now should be preceeded by a call to
vm_pager_haspage() to make sure that request is possible. This
could be improved later, making vm_pager_haspage() obsolete.
Strenghtening the promises on the business of the array of pages
allows us to remove such hacks as swp_pager_free_nrpage() and
vm_pager_free_nonreq().
o New KPI accepts two integer pointers that may optionally point at
values for read ahead and read behind, that a pager may do, if it
can. These pages are completely owned by pager, and not controlled
by the caller.
This shifts the UFS-specific readahead logic from vm_fault.c, which
should be file system agnostic, into vnode_pager.c. It also removes
one VOP_BMAP() request per hard fault.
Discussed with: kib, alc, jeff, scottl
Sponsored by: Nginx, Inc.
Sponsored by: Netflix
A panicking thread always executes with a critical section held, so any
attempt to allocate or free memory while dumping will otherwise cause a
second panic. This can occur, for example, if xpt_polled_action() completes
non-dump I/O that was pending at the time of the panic. The fact that this
can occur is itself a bug, but asserting in this case does little but
reduce the reliability of kernel dumps.
Suggested by: kib
Reported by: pho
Remove redundant lookup of the old page from vm_page_replace. Verification
that the old page exists is already done by vm_radix_replace.
Submitted by: Ryan Libby <rlibby@gmail.com>
Reviewed by: alc, kib
Sponsored by: EMC / Isilon Storage Division
Follow-up to: https://reviews.freebsd.org/D4326
Differential Revision: https://reviews.freebsd.org/D4471
On vm_page_rename failure, fix a missing object unlock and a double free of
a page.
First remove the old page, then rename into other page into first_object,
then free the old page. This avoids the problem on rename failure. This is
a little ugly but seems to be the most straightforward solution.
Tested with:
$ sysctl debug.fail_point.uma_zalloc_arg="1%return"
$ kyua test -k /usr/tests/sys/Kyuafile
Submitted by: Ryan Libby <rlibby@gmail.com>
Reviewed by: kib
Seen by: alc
Sponsored by: EMC / Isilon Storage Division
Differential Revision: https://reviews.freebsd.org/D4326
These two functions were largely unrelated, they just used the same same
loop logic to walk through a backing object's memq. Pull out the
all_shadowed test as its own function and eliminate
OBSC_TEST_ALL_SHADOWED. Rename vm_object_backing_scan to
vm_object_collapse_scan.
No functional change.
Sponsored by: EMC / Isilon Storage Division
Differential Revision: https://reviews.freebsd.org/D4335
invalid (busy) page supposedly inserted by the vm_fault(), in the
OBSC_COLLAPSE_NOWAIT case. As a continuation to r221714, fix a case
when invalid page is found by the object scan in OBSC_COLLAPSE_WAIT
case as well. But, since this is waitable scan, we should wait for
the termination of the busy state and restart from the beginning of
the backing object' page queue. [*]
Do not free the shadow page swap space when the parent page is
invalid, otherwise this action potentially corrupts user data.
Combine all instances of the collapse scan sleep code fragments into
the new helper vm_object_backing_scan_wait().
Improve style compliance and comments. Change the return type of
vm_object_backing_scan() to bool.
Initial submission by: cem, https://reviews.freebsd.org/D4103 [*]
Reviewed by: alc, cem
Tested by: cem
Sponsored by: The FreeBSD Foundation
MFC after: 2 weeks
Differential revision: https://reviews.freebsd.org/D4146
Systematically use ANSI C functions definitions.
Correct type of the flags argument to the dev_pager_putpages() function.
Use vm_pager_free_nonreq().
Sponsored by: The FreeBSD Foundation
MFC after: 1 week
active queue scan initiated write.
Re-trying from the inactive queue when doing active scan makes the
loop never end if number of domains is greater than 1 and inactive or
active scan cannot reach the target.
Reported and tested by: Andrew Gallatin <gallatin@netflix.com>
Reviewed by: alc
Sponsored by: The FreeBSD Foundation
MFC after: 2 weeks
case that the reservation contained "low", the starting position in the
popmap for the free page search was incorrectly calculated. The most
likely (and visible) symptom of this error was the assertion failure,
"vm_reserv_reclaim_contig: pa is too low".
The r289895 revision did not accounted for the block containing the
requested page, when calculating the run of pages. Include the pages
before/after the requested page, that fit into the reqblock, into the
calculation.
Noted by: glebius
Tested by: pho
Sponsored by: The FreeBSD Foundation
MFC after: 1 week
critical section.
uma_zalloc_arg()/uma_zalloc_free() may acquire a sleepable lock on the
zone. The malloc() family of functions may call uma_zalloc_arg() or
uma_zalloc_free().
The malloc(9) man page currently claims that free() will never sleep.
It also implies that the malloc() family of functions will not sleep
when called with M_NOWAIT. However, it is more correct to say that
these functions will not sleep indefinitely. Indeed, they may acquire
a sleepable lock. However, a developer may overlook this restriction
because the WITNESS check that catches attempts to call the malloc()
family of functions within a critical section is inconsistenly
applied.
This change clarifies the language of the malloc(9) man page to clarify
the restriction against calling the malloc() family of functions
while in a critical section or holding a spin lock. It also adds
KASSERTs at appropriate points to make the enforcement of this
restriction more consistent.
PR: 204633
Differential Revision: https://reviews.freebsd.org/D4197
Reviewed by: markj
Approved by: gnn (mentor)
Sponsored by: Juniper Networks
checks for the swap space consumption plus checks that the amount of
the free pages exceeds some limit, in case pagedeamon did not coped
with the page shortage in one of the late passes. This is wrong
because it does not account for the presence of the reclamaible pages
in the queues which are not selectable for reclaim immediately. E.g.,
on the swap-less systems, large active queue easily triggered OOM.
Instead, only raise OOM when pagedaemon is unable to produce a free
page in several back-to-back passes. Track the failed passes per
pagedaemon thread.
The number of passes to trigger OOM was selected empirically and
tested both on small (32M-64M i386 VM) and large (32G amd64)
configurations. If the specifics of the load require tuning, sysctl
vm.pageout_oom_seq sets the number of back-to-back passes which must
fail before OOM is raised. Each pass takes 1/2 of seconds. Less the
value, more sensible the pagedaemon is to the page shortage.
In future, some heuristic to calculate the value of the tunable might
be designed based on the system configuration and load. But before it
can be done, the i/o system must be fixed to reliably time-out
pagedaemon writes, even if waiting for the memory to proceed. Then,
code can account for the in-flight page-outs and postpone OOM until
all of them finished, which should reduce the need in tuning. Right
now, ignoring the in-flight writes and the counter allows to break
deadlocks due to write path doing sleepable memory allocations.
Reported by: Dmitry Sivachenko, bde, many others
Tested by: pho, bde, tuexen (arm)
Reviewed by: alc
Discussed with: bde, imp
Sponsored by: The FreeBSD Foundation
MFC after: 3 weeks
Residency count track the number of pte entries installed into the
current pmap, which does not reflect the consumption of the physical
memory by the address map. Due to several mechanisms like pv entries
reclamation, copy on write etc. the resident pte entries count may be
much less than the amount of physical memory kept by the process.
Provide the OOM-specific vm_pageout_oom_pagecount() function which
estimates the amount of reclamaible memory which could be stolen if
the process is killed.
Reported and tested by: pho
Reviewed by: alc
Comments text by: alc
Sponsored by: The FreeBSD Foundation
MFC after: 3 weeks
other actions, swaps out kernel stacks of the processes. On the other
hand, currentl OOM logic which selects a process to kill in the
critical condition, skips process with swapped-out thread. Under some
loads, this results in the big(gest) process being ignored by OOM.
Do not skip a process which has inhibited thread due to the swap-out,
in the OOM selection loop. Note that killing such process requires
the thread stack page-in, but sometimes this is the only way to
recover.
Reported and tested by: pho
Reviewed by: alc
Sponsored by: The FreeBSD Foundation
MFC after: 3 weeks
certain kernel structures for use by debuggers. This mostly aids
in examining cores from a kernel without debug symbols as a debugger
can infer these values if debug symbols are available.
One set of variables describes the layout of 'struct linker_file' to
walk the list of loaded kernel modules.
A second set of variables describes the layout of 'struct proc' and
'struct thread' to walk the list of processes in the kernel and the
threads in each process.
The 'pcb_size' variable is used to index into the stoppcbs[] array.
The 'vm_maxuser_address' is used to distinguish kernel virtual addresses
from user addresses. This doesn't have to be perfect, and
'vm_maxuser_address' is a cheap and simple way to differentiate kernel
pointers from simple values like TIDs and PIDs.
While here, annotate the fields in struct pcb used by kgdb on amd64
and i386 to note that their ABI should be preserved. Annotations for
other platforms will be added in the future.
Reviewed by: kib
MFC after: 2 weeks
Differential Revision: https://reviews.freebsd.org/D3773
reclaimed in FIFO order by the pagedaemon. Previously we would enqueue
such pages at the head of the inactive queue, yielding a LIFO reclaim order.
Reviewed by: alc
MFC after: 2 weeks
Sponsored by: EMC / Isilon Storage Division
pager. It is enough to execute VOP_BMAP() once to obtain both the
disk block address for the requested page, and the before/after limits
for the contiguous run. The clipping of the vm_page_t array passed to
the vnode_pager_generic_getpages() and the disk address for the first
page in the clipped array can be deduced from the call results.
While there, remove some noise (like if (1) {...}) and adjust nearby
code.
Reviewed by: alc
Discussed with: glebius
Tested by: pho
Sponsored by: The FreeBSD Foundation
MFC after: 3 weeks
in vm_pageout_fallback_object_lock() and vm_pageout_page_lock(). The
check for the m->queue == queue assumes that the page does belong to a
queue.
Modify the 'unchanged' calculation bu dereferencing the marker tailq
pointers, which is known to belong to the queue. Since for a page m
linked to the queue, m->queue must be equal to the queue index, assert
this instead of checking.
In collaboration with: alc
Sponsored by: The FreeBSD Foundation (kib)
MFC after: 2 weeks
8x performance improvement in a micro benchmark on a 4 socket machine.
- Get buffer headers from a per-cpu uma cache that sits in from of the
free queue.
- Use a per-cpu quantum cache in vmem to eliminate contention for kva.
- Use multiple clean queues according to buffer cache size to eliminate
clean queue lock contention.
- Introduce a bufspace daemon that attempts to prevent getnewbuf() callers
from blocking or doing direct recycling.
- Close some bufspace allocation races that could lead to endless
recycling.
- Further the transition to a more modern style of small functions grouped
by prefix in order to improve growing complexity.
Sponsored by: EMC / Isilon
Reviewed by: kib
Tested by: pho
the kernel or kmem object can't be paged out. Since they can't be paged
out, they are never enqueued in a paging queue. Nonetheless, passing
PQ_INACTIVE to vm_page_unwire() in kmem_unback() creates the appearance
that these pages are being enqueued in the inactive queue. As of r288122,
we can avoid giving this false impression by passing PQ_NONE.
Submitted by: kmacy
Differential Revision: https://reviews.freebsd.org/D1674
of POSIX_FADV_DONTNEED so that it causes the backing pages to be moved to
the head of the inactive queue instead of being cached.
This affects the implementation of POSIX_FADV_NOREUSE as well, since it
works by applying POSIX_FADV_DONTNEED to file ranges after they have been
read or written. At that point the corresponding buffers may still be
dirty, so the previous implementation would coalesce successive ranges and
apply POSIX_FADV_DONTNEED to the result, ensuring that pages backing the
dirty buffers would eventually be cached. To preserve this behaviour in an
efficient manner, this change adds a new buf flag, B_NOREUSE, which causes
the pages backing a VMIO buf to be placed at the head of the inactive queue
when the buf is released. POSIX_FADV_NOREUSE then works by setting this
flag in bufs that underlie the specified range.
Reviewed by: alc, kib
Sponsored by: EMC / Isilon Storage Division
Differential Revision: https://reviews.freebsd.org/D3726
arena in r254025 introduced a bug in the case when an allocation is only
partially successful. Specifically, the vm object lock was not being
acquired before freeing the allocated pages. To address this bug, replace
the existing code by a call to kmem_unback().
Change the type of a variable in kmem_alloc_attr() so that an allocation
of two or more gigabytes won't fail.
Replace the error handling code in kmem_back() by a call to kmem_unback().
Reviewed by: kib (an earlier version)
MFC after: 1 week
Sponsored by: EMC / Isilon Storage Division
by noobj_alloc() don't belong to a vm object, they can't be paged out.
Since they can't be paged out, they are never enqueued in a paging queue.
Nonetheless, passing PQ_INACTIVE to vm_page_unwire() creates the appearance
that these pages are being enqueued in the inactive queue. As of r288122,
we can avoid giving this false impression by passing PQ_NONE.
Submitted by: kmacy
Differential Revision: https://reviews.freebsd.org/D1674
queue and (2) returns a Boolean indicating whether the page's wire count
transitioned to zero.
Exploit this change in vfs_vmio_release() to avoid pointlessly enqueueing
a page that is about to be freed.
(An earlier version of this change was developed by attilio@ and kmacy@.
Any errors in this version are my own.)
Reviewed by: kib
Sponsored by: EMC / Isilon Storage Division
should not assume that vm_pages_needed will remain set while it sleeps.
Other threads can clear vm_pages_needed by performing a sufficient
number of vm_page_free() calls, e.g., process termination. The effect
of this error was that vm_pageout_worker() would free and/or launder
pages when, in fact, there was no shortage of free pages.
Rewrite a nearby comment to describe all of the possible cases and not
just the most common case. The problem being that the comment made
the most common case seem like the only case.
Reviewed by: kib
MFC after: 1 week
Sponsored by: EMC / Isilon Storage Division
with a reference count of zero can't possibly be mapped, so there is never a
need for vm_page_set_invalid() to call pmap_remove_all() on them.
Reviewed by: kib
MFC after: 1 week
Sponsored by: EMC / Isilon Storage Division
mapped address without valid pte installed, when parallel wiring of
the entry happen. The entry must be copy on write. If entry is COW
but was already copied, and parallel wiring set
MAP_ENTRY_IN_TRANSITION, vm_fault() would sleep waiting for the
MAP_ENTRY_IN_TRANSITION flag to clear. After that, the fault handler
is restarted and vm_map_lookup() or vm_map_lookup_locked() trip over
the check. Note that this is race, if the address is accessed after
the wiring is done, the entry does not fault at all.
There is no reason in the current kernel to disallow write access to
the COW wired entry if the entry permissions allow it. Initially this
was done in r24666, since that kernel did not supported proper
copy-on-write for wired text, which was fixed in r199869. The r251901
revision re-introduced the r24666 fix for the current VM.
Note that write access must clear MAP_ENTRY_NEEDS_COPY entry flag by
performing COW. In reverse, when MAP_ENTRY_NEEDS_COPY is set in
vmspace_fork(), the MAP_ENTRY_USER_WIRED flag is cleared. Put the
assert stating the invariant, instead of returning the error.
Reported and debugging help by: peter
Reviewed by: alc
Sponsored by: The FreeBSD Foundation
MFC after: 1 week
locking and doesn't sleep. Flag the consumer we create as such. In
addition, decrement the in flight index when we have an out of memory
error after having incremented it previously. This would have
prevented swapoff from working if the swap pager ever hit a resource
shortage trying to swap out something (the swap in path always waits
for a bio, so won't have this issue). Simplify the close logic by
abandoning the use of private and initializing the index to 1 and
dropping that reference when we previously set private.
Also, set sw_id only while sw_dev_mtx is held. This should only affect
swapping to a vnode, as opposed to a geom whose close always sets it to
NULL with sw_dev_mtx held.
Differential Review: https://reviews.freebsd.org/D3547
so that there is only one place where pages are freed and only one place
where pages are moved to the tail of the queue.
Reviewed by: kib
Sponsored by: EMC / Isilon Storage Division
pages will have left the inactive queue before the page daemon performs
its next scan. Also, ignore references to pages from terminated objects.
This allows the clean pages to be freed a little sooner.
Move some comments to their proper place, i.e., next to the code that
they describe, and update other nearby comments.
Reviewed by: kib
Sponsored by: EMC / Isilon Storage Division
Objects obtained from such zones are supposed to retain type stability,
which was violated by aforementioned trashing.
This is a follow-up to r284861.
Discussed with: kib