The precedence of the '&' operator is less than of '+'. Added braces
do change the order of evaluation into the natural one, in my opinion.
On the other hand, the value of the expression should not change since
all elements should have page-aligned values.
This fixes a gcc warning reported.
Reported by: adrian
Sponsored by: The FreeBSD Foundation
MFC after: 1 week
With helper page daemon threads, enabled by default in r364786, we
divide the inactive target by the number of threads, rounding down, and
sum the total number of pages freed by the threads. This sum is
compared with the original target, but by rounding down we might lose
pages, causing the page daemon control loop to conclude that inactive
queue scanning isn't keeping up with demand for free pages. Typically
this results in excessive swapping.
Fix the problem by accounting for the error in the main pagedaemon
thread's target. Note that by default the problem will manifest only in
systems with >16 CPUs in a NUMA domain.
Reviewed by: cem
Discussed with: dougm
Reported and tested by: dhw, glebius
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D26610
uma_zalloc_domain() allocates from the requested domain instead of
following a first-touch policy (the default for most zones). Currently
it is only used by malloc_domainset(), and consumers free returned items
with free(9) since r363834.
Previously uma_zalloc_domain() worked by always going to the keg for an
item. As a result, the use of UMA zone caches was unbalanced: we free
items to the caches, but always allocate from the keg, skipping the
caches.
Make some effort to allocate from the UMA caches when performing a
cross-domain allocation. This avoids blowing up the caches when
something is performing many transient allocations with
malloc_domainset().
Reported and tested by: dhw, glebius
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D26427
When SMR was introduced, zone_put_bucket() was changed to always place
full buckets at the end of the queue. However, it is generally
preferable to use recently used buckets since their items are more
likely to be resident in cache. So, for buckets that have no constraint
on item reuse, use a last-in-first-out ordering as we did before.
Reviewed by: rlibby
Tested by: dhw, glebius
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D26426
Currently we allocate and map zero-filled anonymous pages when dumping
core. This can result in lots of needless disk I/O and page
allocations. This change tries to make the core dumper more clever and
represent unbacked ranges of virtual memory by holes in the core dump
file.
Add a new page fault type, VM_FAULT_NOFILL, which causes vm_fault() to
clean up and return an error when it would otherwise map a zero-filled
page. Then, in the core dumper code, prefault all user pages and handle
errors by simply extending the size of the core file. This also fixes a
bug related to the fact that vn_io_fault1() does not attempt partial I/O
in the face of errors from vm_fault_quick_hold_pages(): if a truncated
file is mapped into a user process, an attempt to dump beyond the end of
the file results in an error, but this means that valid pages
immediately preceding the end of the file might not have been dumped
either.
The change reduces the core dump size of trivial programs by a factor of
ten simply by excluding unaccessed libc.so pages.
PR: 249067
Reviewed by: kib
Tested by: pho
MFC after: 1 month
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D26590
On an Ampere Altra system, the physical memory is populated
sparsely within the physical address space, with only about 0.4%
of physical addresses backed by RAM in the range [0, last_pa].
This is causing the vm_reserv_array to be over-sized by a few
orders of magnitude, wasting roughly 5 GiB on a system with
256 GiB of RAM.
The sparse allocation of vm_reserv_array is controlled by defining
VM_PHYSSEG_SPARSE, with the dense allocation still remaining for
platforms with VM_PHYSSEG_DENSE.
Reviewed by: markj, alc, kib
Approved by: scottl (implicit)
MFC after: 1 week
Sponsored by: Ampere Computing, Inc.
Differential Revision: https://reviews.freebsd.org/D26130
On Ampere Altra systems, the sparse population of RAM within the
physical address space causes the vm_page_dump bitmap to be much
larger than necessary, increasing the size from ~8 Mib to > 2 Gib
(and overflowing `int` for the size).
Changing the page dump bitmap also changes the minidump file
format, so changes are also necessary in libkvm.
Reviewed by: jhb
Approved by: scottl (implicit)
MFC after: 1 week
Sponsored by: Ampere Computing, Inc.
Differential Revision: https://reviews.freebsd.org/D26131
These definitions were repeated by all architectures, with small
variations. Consolidate the common definitons in machine
independent code and use bitset(9) macros for manipulation. Many
opportunities for deduplication remain in the machine dependent
minidump logic. The only intended functional change is increasing
the bit index type to vm_pindex_t, allowing the indexing of pages
with address of 8 TiB and greater.
Reviewed by: kib, markj
Approved by: scottl (implicit)
MFC after: 1 week
Sponsored by: Ampere Computing, Inc.
Differential Revision: https://reviews.freebsd.org/D26129
vm_ooffset_t is now unsigned. Remove some tests for negative values,
or make other adjustments accordingly.
Reported by: Coverity
Reviewed by: kib markj
Sponsored by: Dell EMC Isilon
Differential Revision: https://reviews.freebsd.org/D26214
Since r347532 (merged to stable/12) we only count user-wired pages
towards the system limit. However, we now also treat pages wired by
hypervisors (bhyve and virtualbox) as user-wired, so starting VMs with
large amounts of RAM tends to fail due to the low limit.
The purpose of the limit is to provide a seatbelt, not to impose some
policy on the use of wired memory. Thus, increase the default limit to
allow reasonable VM configurations to work without tuning.
Reviewed by: kib
Discussed with: dougm
MFC after: 3 days
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D26424
Created with shm_open2(SHM_LARGEPAGE) and then configured with
FIOSSHMLPGCNF ioctl, largepages posix shared memory objects guarantee
that all userspace mappings of it are served by superpage non-managed
mappings.
Only amd64 for now, both 2M and 1G superpages can be requested, the
later requires CPU feature.
Reviewed by: markj
Tested by: pho
Sponsored by: The FreeBSD Foundation
MFC after: 1 week
Differential revision: https://reviews.freebsd.org/D24652
The entries and their clip boundaries must be aligned on supported
superpages sizes from pagesizes[]. vm_map operations return Mach
error KERN_INVALID_ARGUMENT, which is usually translated to EINVAL, if
it would require clip not at the boundary.
In other words, entries force preserving virtual addresses superpage
properties.
Reviewed by: markj
Tested by: pho
Sponsored by: The FreeBSD Foundation
MFC after: 1 week
Differential revision: https://reviews.freebsd.org/D24652
The flag requests entry of non-managed superpage mapping of size
pagesizes[psind] into the page table.
Pmap supports fake wiring of the largepage mappings. Only attributes
of the largepage mapping can be changed by calling pmap_enter(9) over
existing mapping, physical address of the page must be unchanged.
Reviewed by: markj
Tested by: pho
Sponsored by: The FreeBSD Foundation
MFC after: 1 week
Differential revision: https://reviews.freebsd.org/D24652
Add support for user-supplied callbacks into phys pager operations,
providing custom getpages(), haspage(), and populate() methods
implementations. Pager stores user data ptr/val in the object to
provide context.
Add phys_pager_allocate() helper that takes user ops table as one of
the arguments.
Current code for these methods is moved to the 'default' ops table,
assigned automatically when vm_pager_alloc() is used.
Reviewed by: markj
Tested by: pho
Sponsored by: The FreeBSD Foundation
MFC after: 1 week
Differential revision: https://reviews.freebsd.org/D24652
orig_object->type can change from OBJT_DEFAULT to OBJT_SWAP while
vm_object_split() is sleeping. In this case some pages in new_object
may be left unbusied, but vm_object_split() attempts to unbusy all of
them.
Track the beginning of the busied range. Add an assertion to verify
that pages are not re-added to the source object while sleeping.
Reported by: Olympios Petrakis <olympios.petrakis@netapp.com>
Reviewed by: alc, kib
Tested by: pho
MFC after: 1 week
Sponsored by: NetApp, Inc.
Sponsored by: Klara, Inc.
Differential Revision: https://reviews.freebsd.org/D26223
We were needlessly acquiring the object lock to call
vm_page_grab_pages() even when all of the requested pages were looked up
locklessly. Fix that, stop testing for count == 0 in
vm_page_grab_pages(), and add assertions to help catch this kind of
mistake.
Reported by: cem
Reviewed by: alc, cem, dougm, jeff
Differential Revision: https://reviews.freebsd.org/D26304
Currently we use a single bit to indicate whether the virtual page is
part of a superpage. To support a forthcoming implementation of
non-transparent 1GB superpages, it is useful to provide more detailed
information about large page sizes.
The change converts MINCORE_SUPER into a mask for MINCORE_PSIND(psind)
values, indicating a mapping of size psind, where psind is an index into
the pagesizes array returned by getpagesizes(3), which in turn comes
from the hw.pagesizes sysctl. MINCORE_PSIND(1) is equal to the old
value of MINCORE_SUPER.
For now, two bits are used to record the page size, permitting values
of MAXPAGESIZES up to 4.
Reviewed by: alc, kib
Sponsored by: Juniper Networks, Inc.
Sponsored by: Klara, Inc.
Differential Revision: https://reviews.freebsd.org/D26238
In Linux, ksize() gets the actual amount of memory allocated for a given
object. This commit adds malloc_usable_size() to FreeBSD KPI which does
the same. It also maps LinuxKPI ksize() to newly created function.
ksize() function is used by drm-kmod.
Reviewed by: hselasky, kib
MFC after: 1 week
Differential Revision: https://reviews.freebsd.org/D26215
Coverity flagged the scaling by sizeof(uzd). That is the type
of the pointer, so the scaling was already done by pointer arithmetic.
However, this was also passing a stack frame pointer to kvm_read,
so it was doubly wrong.
Move ZDOM_GET into the !_KERNEL section and use it in libmemstat.
Reported by: Coverity
Reviewed by: markj
MFC after: 2 weeks
Sponsored by: Dell EMC Isilon
Differential Revision: https://reviews.freebsd.org/D26213
This helps minimize internal fragmentation that occurs when 2MB imports
are interleaved across NUMA domains. Virtually all KVA allocations on
direct map platforms consume more than one page, so the fragmentation
manifests as runs of 511 4KB page mappings in the kernel.
Reviewed by: alc, kib
Tested by: pho
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D26050
Autoscale vm_pageout worker threads from r364129 with CPU count. The
default is arbitrarily chosen to be 16 CPUs per worker thread, but can
be adjusted with the vm.pageout_cpus_per_thread tunable.
There will never be less than 1 thread per populated NUMA domain, and
the previous arbitrary upper limit (at most ncpus/2 threads per NUMA
domain) is preserved.
Care is taken to gracefully handle asymmetric NUMA nodes, such as empty
node systems (e.g., AMD 2990WX) and systems with nodes of varying size
(e.g., some larger >20 core Intel Haswell/Broadwell Xeon).
Reviewed by: kib, markj
Sponsored by: Isilon
Differential Revision: https://reviews.freebsd.org/D26152
For such pages ref_count is effectively a consumer-managed field, but
there is no harm in calling vm_page_wire() on them.
vm_page_unwire_noq() handles them as well. Relax the vm_page_wire()
assertions to permit this case which is triggered by some out-of-tree
code. [1]
Also guard a conditional assertion with INVARIANTS. Otherwise the
conditions are evaluated even though the result is unused. [2]
Reported by: bz, cem [1], kib [2]
Reviewed by: dougm, kib
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D26173
The primary benefit is maintaining a completely shared
code base with the community allowing FreeBSD to receive
new features sooner and with less effort.
I would advise against doing 'zpool upgrade'
or creating indispensable pools using new
features until this change has had a month+
to soak.
Work on merging FreeBSD support in to what was
at the time "ZFS on Linux" began in August 2018.
I first publicly proposed transitioning FreeBSD
to (new) OpenZFS on December 18th, 2018. FreeBSD
support in OpenZFS was finally completed in December
2019. A CFT for downstreaming OpenZFS support in
to FreeBSD was first issued on July 8th. All issues
that were reported have been addressed or, for
a couple of less critical matters there are
pull requests in progress with OpenZFS. iXsystems
has tested and dogfooded extensively internally.
The TrueNAS 12 release is based on OpenZFS with
some additional features that have not yet made
it upstream.
Improvements include:
project quotas, encrypted datasets,
allocation classes, vectorized raidz,
vectorized checksums, various command line
improvements, zstd compression.
Thanks to those who have helped along the way:
Ryan Moeller, Allan Jude, Zack Welch, and many
others.
Sponsored by: iXsystems, Inc.
Differential Revision: https://reviews.freebsd.org/D25872
The zone limit mechanism was recently reworked, and
allocation failures due to limits being exceeded
were inadvertently no longer being recorded. This
would lead to, for example, mbuf allocation failures
not being indicated in netstat -m or vmstat -z
Reviewed by: markj
Sponsored by: Netflix
Some of the resulting fallout in CAM does not appear straightforward to
fix, so simply revert the commit for now in the absence of a better
solution.
Discussed with: mjg
Reported by: dhw
non-sleepable context. Previously only _sleep() would panic.
This will catch misuse of M_WAITOK at development stage rather
than at stress load stage.
Reviewed by: markj
Differential Revision: https://reviews.freebsd.org/D26027
Today, the zone is only used to allocate a trio of kernel maps: the
kernel map itself, and the exec and pipe submaps. Maps for user
processes are dynamically allocated but are embedded in the vmspace
structure, which is allocated from its own zone. Make the
aforementioned kernel maps statically allocated and get rid of the zone.
While here, remove a stale comment above vmspace_alloc() and change the
names of locks initialized in vm_map_init() to match vmspace_zinit().
Reported by: alc
Reviewed by: alc, kib
MFC after: 2 weeks
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D26052
The vm objects are type-stable, and can be accessed even after the
last reference is dropped, or in case of vnode objects, after vgone()
destroyed it as well.
Stop asserting that pip == 0 after vm_object_terminate() waited for
existing owners to drop it, we only want to drain them before setting
OBJ_DEAD flag. Also stop asserting pip == 0 in object destructor.
Update comments explaining the interaction between paging_in_progress
and termination.
Reviewed by: markj
Tested by: pho
Sponsored by: The FreeBSD Foundation
Differential revision: https://reviews.freebsd.org/D25968
This will be used later, where it matters on 32bit arches.
Reviewed by: markj
Tested by: pho
Sponsored by: The FreeBSD Foundation
Differential revision: https://reviews.freebsd.org/D25968
In very high throughput workloads, the inactive scan can become overwhelmed
as you have many cores producing pages and a single core freeing. Since
Mark's introduction of batched pagequeue operations, we can now run multiple
inactive threads working on independent batches.
To avoid confusing the pid and other control algorithms, I (Jeff) do this in
a mpi-like fan out and collect model that is driven from the primary page
daemon. It decides whether the shortfall can be overcome with a single
thread and if not dispatches multiple threads and waits for their results.
The heuristic is based on timing the pageout activity and averaging a
pages-per-second variable which is exponentially decayed. This is visible in
sysctl and may be interesting for other purposes.
I (Jeff) have verified that this does indeed double our paging throughput
when used with two threads. With four we tend to run into other contention
problems. For now I would like to commit this infrastructure with only a
single thread enabled.
The number of worker threads per domain can be controlled with the
'vm.pageout_threads_per_domain' tunable.
Submitted by: jeff (earlier version)
Discussed with: markj
Tested by: pho
Sponsored by: probably Netflix (based on contemporary commits)
Differential Revision: https://reviews.freebsd.org/D21629
The global "bucketdisable" flag indicates that we are in a low memory
situation and should avoid allocating buckets. However, in the
allocation path we were checking it before the full bucket cache and
bailing even if the cache is non-empty. Defer the check so that we have
a shot at allocating from the cache.
This came up because M_NOWAIT allocations from the buf trie node zone
must always succeed. In one scenario, all of the preallocated trie
nodes were in the bucket list, and a new slab allocation could not
succeed due to a memory shortage. The short-circuiting caused an
allocation failure which triggered a panic.
Reported by: pho
Reviewed by: cem
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D25980
In the most common case (fork+execve) this doesn't matter, but further
attempts to apply entropy would fail in (e.g.) a pre-fork server.
Reported by: Alfredo Mazzinghi
Reviewed by: kib, markj
Obtained from: CheriBSD
MFC after: 3 days
Differential Revision: https://reviews.freebsd.org/D25966