Remove OBJT_SWAP_TMPFS. Move tmpfs-specific swap pager bits into
tmpfs_subr.c.
There is no longer any code to directly support tmpfs in sys/vm, most
tmpfs knowledge is shared by non-anon swap object type implementation.
The tmpfs-specific methods are provided by registered tmpfs pager, which
inherits from the swap pager.
Reviewed by: markj
Tested by: pho
Sponsored by: The FreeBSD Foundation
MFC after: 1 week
Differential revision: https://reviews.freebsd.org/D30168
Pager is allowed to inherit part of its implementation from the existing
pager, which is done by copying non-NULL virtual method slots.
Reviewed by: markj
Tested by: pho
Sponsored by: The FreeBSD Foundation
MFC after: 1 week
Differential revision: https://reviews.freebsd.org/D30168
Mostly in cases where OBJ_SWAP flag works as well, or by reversing the
condition so that object types can be listed.
Reviewed by: markj
Tested by: pho
Sponsored by: The FreeBSD Foundation
MFC after: 1 week
Differential revision: https://reviews.freebsd.org/D30168
This avoids the need to know all existing object types in advance, by the
cost of loosing the assert that unknown object type is handled in a sane
manner.
Reviewed by: markj
Tested by: pho
Sponsored by: The FreeBSD Foundation
MFC after: 1 week
Differential revision: https://reviews.freebsd.org/D30168
This is OBJT_SWAP pager, specialized for tmpfs. Right now, both swap pager
and generic vm code have to explicitly handle swap objects which are tmpfs
vnode v_object, in the special ways. Replace (almost) all such places with
proper methods.
Since VM still needs a notion of the 'swap object', regardless of its
use, add yet another type-classification flag OBJ_SWAP. Set it in
vm_object_allocate() where other type-class flags are set.
This change almost completely eliminates the knowledge of tmpfs from VM,
and opens a way to make OBJT_SWAP_TMPFS loadable from tmpfs.ko.
Reviewed by: markj
Tested by: pho
Sponsored by: The FreeBSD Foundation
MFC after: 1 week
Differential revision: https://reviews.freebsd.org/D30070
Put each type into dedicated line, which makes addition of new
types cleaner.
Reviewed by: markj
Tested by: pho
Sponsored by: The FreeBSD Foundation
MFC after: 1 week
Differential revision: https://reviews.freebsd.org/D30070
Allow vp_heldp argument to be NULL, in which case the returned vnode
is not held for tmpfs swap objects.
Reviewed by: markj
Tested by: pho
Sponsored by: The FreeBSD Foundation
MFC after: 1 week
Differential revision: https://reviews.freebsd.org/D30070
Makes the code in vm_object collapse/page_remove cleaner
Reviewed by: markj
Tested by: pho
Sponsored by: The FreeBSD Foundation
MFC after: 1 week
Differential revision: https://reviews.freebsd.org/D30070
This eliminates the staircase of conditions in vm_map_entry_set_vnode_text().
Reviewed by: markj
Tested by: pho
Sponsored by: The FreeBSD Foundation
MFC after: 1 week
Differential revision: https://reviews.freebsd.org/D30070
specialized for swap and vnode pagers, and used to implement
vm_object_set_writeable_dirty().
Reviewed by: markj
Tested by: pho
Sponsored by: The FreeBSD Foundation
MFC after: 1 week
Differential revision: https://reviews.freebsd.org/D30070
Fill lines with the function definitions.
Use local var to shorten repeated extra-long expressions.
Reviewed by: markj
Tested by: pho
Sponsored by: The FreeBSD Foundation
MFC after: 1 week
Differential revision: https://reviews.freebsd.org/D30070
When copying from the old buffer to the new buffer, we don't know the
requested size of the old allocation, but only the size of the
allocation provided by UMA. This value is "alloc". Because the copy
may access bytes in the old allocation's red zone, we must mark the full
allocation valid in the shadow map. Do so using the correct size.
Reported by: kp
Tested by: kp
Sponsored by: The FreeBSD Foundation
When estimating working set size, measure only allocation batches, not free
batches. Allocation and free patterns can be very different. For example,
ZFS on vm_lowmem event can free to UMA few gigabytes of memory in one call,
but it does not mean it will request the same amount back that fast too, in
fact it won't.
Update working set size on every reclamation call, shrinking caches faster
under pressure. Lack of this caused repeating vm_lowmem events squeezing
more and more memory out of real consumers only to make it stuck in UMA
caches. I saw ZFS drop ARC size in half before previous algorithm after
periodic WSS update decided to reclaim UMA caches.
Introduce voluntary reclamation of UMA caches not used for a long time. For
each zdom track longterm minimal cache size watermark, freeing some unused
items every UMA_TIMEOUT after first 15 minutes without cache misses. Freed
memory can get better use by other consumers. For example, ZFS won't grow
its ARC unless it see free memory, since it does not know it is not really
used. And even if memory is not really needed, periodic free during
inactivity periods should reduce its fragmentation.
Reviewed by: markj, jeff (previous version)
MFC after: 2 weeks
Sponsored by: iXsystems, Inc.
Differential Revision: https://reviews.freebsd.org/D29790
For anonymous objects, provide a handle kvo_me naming the object,
and report the handle of the backing object. This allows userspace
to deconstruct the shadow chain. Right now the handle is the address
of the object in KVA, but this is not guaranteed.
For the same anonymous objects, report the swap space used for actually
swapped out pages, in kvo_swapped field. I do not believe that it is
useful to report full 64bit counter there, so only uint32_t value is
returned, clamped to the max.
For kinfo_vmentry, report anonymous object handle backing the entry,
so that the shadow chain for the specific mapping can be deconstructed.
Reviewed by: markj
Sponsored by: The FreeBSD Foundation
MFC after: 1 week
Differential revision: https://reviews.freebsd.org/D29771
Make it possible to reclaim items from a specific NUMA domain.
- Add uma_zone_reclaim_domain() and uma_reclaim_domain().
- Permit parallel reclamations. Use a counter instead of a flag to
synchronize with zone_dtor().
- Use the zone lock to protect cache_shrink() now that parallel reclaims
can happen.
- Add a sysctl that can be used to trigger reclamation from a specific
domain.
Currently the new KPIs are unused, so there should be no functional
change.
Reviewed by: mav
MFC after: 2 weeks
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D29685
Note that the per-domain variant does not shrink the target bucket size.
No functional change intended.
MFC after: 2 weeks
Sponsored by: The FreeBSD Foundation
Memory allocated with kmem_* is unmapped upon free, so KASAN doesn't
provide a lot of benefit, but since allocations are always a multiple of
the page size we can create a redzone when the allocation request size
is not a multiple of the page size.
MFC after: 2 weeks
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D29458
We allocate kernel stacks using a UMA cache zone. Cache zones have
KASAN disabled by default, but in this case it makes sense to enable it.
Reviewed by: andrew
MFC after: 2 weeks
Differential Revision: https://reviews.freebsd.org/D29457
- Add a UMA_ZONE_NOKASAN flag to indicate that items from a particular
zone should not be sanitized. This is applied implicitly for NOFREE
and cache zones.
- Add KASAN call backs which get invoked:
1) when a slab is imported into a keg
2) when an item is allocated from a zone
3) when an item is freed to a zone
4) when a slab is freed back to the VM
In state transitions 1 and 3, memory is poisoned so that accesses will
trigger a panic. In state transitions 2 and 4, memory is marked
valid.
- Disable trashing if KASAN is enabled. It just adds extra CPU overhead
to catch problems that are detected by KASAN.
MFC after: 2 weeks
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D29456
--Eliminate a big ifdef that encompassed all currently-supported
architectures except mips and powerpc32. This applied to the case
in which we've allocated a superpage but the pager-populated range
is insufficient for a superpage mapping. For platforms that don't
support superpages the check should be inexpensive as we shouldn't
get a superpage in the first place. Make the normal-page fallback
logic identical for all platforms and provide a simple implementation
of pmap_ps_enabled() for MIPS and Book-E/AIM32 powerpc.
--Apply the logic for handling pmap_enter() failure if a superpage
mapping can't be supported due to additional protection policy.
Use KERN_PROTECTION_FAILURE instead of KERN_FAILURE for this case,
and note Intel PKU on amd64 as the first example of such protection
policy.
Reviewed by: kib, markj, bdragon
Differential Revision: https://reviews.freebsd.org/D29439
pmap_enter(PMAP_ENTER_LARGEPAGE) may return KERN_PROTECTION_FAILURE due to
PKRU inconsistency. Handle it in the call place from vm_fault_populate(),
and in places which decode errors from vm_fault_populate()/
vm_fault_allocate().
Reviewed by: jah, markj
Sponsored by: The FreeBSD Foundation
MFC after: 1 week
Differential revision: https://reviews.freebsd.org/D29442
We already allow free(NULL) and uma_zfree(..., NULL). Make
uma_zfree_pcpu(..., NULL) work as well.
This also means that counter_u64_free(NULL) will work.
These make cleanup code simpler.
MFC after: 1 week
Sponsored by: Rubicon Communications, LLC ("Netgate")
Differential Revision: https://reviews.freebsd.org/D29189
The per-domain partpop queue is locked by the combination of the
per-domain lock and individual reservation mutexes.
vm_reserv_reclaim_contig() scans the queue looking for partially
populated reservations that can be reclaimed in order to satisfy the
caller's allocation.
During the scan, we drop the per-domain lock. At this point, the rvn
pointer may be invalidated. Take care to load rvn after re-acquiring
the per-domain lock.
While here, simplify the condition used to check whether a reservation
was dequeued while the per-domain lock was dropped.
Reviewed by: alc, kib
Reported by: gallatin
MFC after: 3 days
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D29203
When searching for runs to reclaim, we need to ensure that the entire
run will be added to the buddy allocator as a single unit. Otherwise,
it will not be visible to vm_phys_alloc_contig() as it is currently
implemented. This is a problem for allocation requests that are not a
power of 2 in size, as with 9KB jumbo mbuf clusters.
Reported by: alc
Reviewed by: alc
MFC after: 2 weeks
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D28924
This KASSERT is overzealous because of the following race condition:
1) A managed page which is currently in PQ_LAUNDRY is freed.
vm_page_free_prep calls vm_page_dequeue_deferred()
The page state is:
PQ_LAUNDRY, PGA_DEQUEUE|PGA_ENQUEUED
2) The laundry worker comes around and pick up the page and calls
vm_pageout_defer(m, PQ_LAUNDRY, true) to check if page is still in the
queue. We do a vm_page_astate_load and get
PQ_LAUNDRY, PGA_DEQUEUE|PGA_ENQUEUED
as per above.
3) The laundry worker is pre-empted and another thread allocates our page
from the free pool. For example vm_page_alloc_domain_after calls
vm_page_dequeue() and sets VPO_UNMANAGED because we are allocating for
an OBJT_UNMANAGED object.
The page state is:
PQ_NONE, 0 - VPO_UNMANAGED
4) The laundry worker resumes, and processes vm_pageout_defer based on the
stale astate which leads to a call to vm_page_pqbatch_submit, which will
trip on the KASSERT.
Submitted by: mlaier
Reviewed by: markj, rlibby
Sponsored by: Dell EMC Isilon
Differential Revision: https://reviews.freebsd.org/D28563
Otherwise, on a powerpc64 NUMA system with hashed page tables, the
first-level superpage reservation size is large enough that the value of
the kernel KVA arena import quantum, KVA_NUMA_IMPORT_QUANTUM, is
negative and gets sign-extended when passed to vmem_set_import(). This
results in a boot-time hang on such platforms.
Reported by: bdragon
MFC after: 3 days
KCSAN complains about racy accesses in the locking code. Those races are
fine since they are inside a TD_SET_RUNNING() loop that expects the value
to be changed by another CPU.
Use relaxed atomic stores/loads to indicate that this variable can be
written/read by multiple CPUs at the same time. This will also prevent
the compiler from doing unexpected re-ordering.
Reported by: GENERIC-KCSAN
Test Plan: KCSAN no longer complains, kernel still runs fine.
Reviewed By: markj, mjg (earlier version)
Differential Revision: https://reviews.freebsd.org/D28569
This macro returns true if a provided virtual address is contained
in the kernel's clean submap.
In CHERI kernels, the buffer cache and transient I/O map are allocated
as separate regions. Abstracting this check reduces the diff relative
to FreeBSD. It is perhaps slightly more readable as well.
Reviewed by: kib
Obtained from: CheriBSD
Sponsored by: DARPA
Differential Revision: https://reviews.freebsd.org/D28710
This flag indicates that the page should be enqueued near the head of
the inactive queue, skipping the LRU queue. It is used when unwiring
pages from the buffer cache following direct I/O or after I/O when
POSIX_FADV_NOREUSE or _DONTNEED advice was specified, or when
sendfile(SF_NOCACHE) completes. For the direct I/O and sendfile cases
we only enqueue the page if we decide not to free it, typically because
it's mapped.
Pass "noreuse" through to vm_page_release_toq() so that we actually
honour the desired LRU policy for these scenarios.
Reported by: bdrewery
Reviewed by: alc, kib
MFC after: 3 days
Differential Revision: https://reviews.freebsd.org/D28555
If a M_WAITOK contig alloc fails, the VM subsystem will try to
reclaim contiguous memory twice before actually failing the
request. On a system with 64GB of RAM I've observed this take
400-500ms before it finally gives up, and I believe that this
will only be worse on systems with even more memory.
In certain contexts this delay is extremely harmful, so add a flag
that will skip reclaim for allocation requests to allow those
paths to opt-out of doing an expensive reclaim.
Sponsored by: Dell Inc
Differential Revision: https://reviews.freebsd.org/D28422
Reviewed by: markj, kib
Replace all uses of kern_mmap with kern_mmap_req move the old kern_mmap.
Reand rename kern_mmap_req to kern_mmap .
The helper saved some code churn initially, but having multiple
interfaces is sub-optimal.
Obtained from: CheriBSD
Reviewed by: kib
Differential Revision: https://reviews.freebsd.org/D28292
This prevents a situation where other thread modifies map entries
permissions between setting max_prot, then relocking, then setting prot,
confusing the operation outcome. E.g. you can get an error that is not
possible if operation is performed atomic.
Also enable setting rwx for max_prot even if map does not allow to set
effective rwx protection.
Reviewed by: brooks, markj (previous version)
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D28117
It is checked in vm_map_insert() and vm_map_protect() that PROT_WRITE |
PROT_EXEC are never specified together, if vm_map has MAP_WX flag set.
FreeBSD control flag allows specific binary to request WX exempt, and
there are per ABI boolean sysctls kern.elf{32,64}.allow_wx to enable/
disable globally.
Reviewed by: emaste, jhb
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D28050
startup_alloc() uses pmap_map() to map slabs used for bootstrapping the
VM. pmap_map() may ignore the hint address and simply return a range
from the direct map. In this case we must not unmap the range in
startup_free().
UMA uses bootstart and bootmem to track the range of KVA into which
slabs are mapped if the direct map is not used. Unmap a startup slab
only if it was mapped into that range.
Reported by: alc
Reviewed by: alc, kib
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D27885
Use atomic testandset and testandclear to catch concurrent double free,
and to reduce the number of atomic operations.
Submitted by: jeff
Reviewed by: cem, kib, markj (all previous version)
Sponsored by: Dell EMC Isilon
Differential Revision: https://reviews.freebsd.org/D22703