Commit Graph

4571 Commits

Author SHA1 Message Date
Mark Johnston
f09cbea31a uma: Respect uk_reserve in keg_drain()
When a reserve of free items is configured for a zone, the reserve must
not be reclaimed under memory pressure.  Modify keg_drain() to simply
respect the reserved pool.

While here remove an always-false uk_freef == NULL check (kegs that
shouldn't be drained should set _NOFREE instead), and make sure that the
keg_drain() KTR statement does not reference an uninitialized variable.

Reviewed by:	alc, rlibby
Sponsored by:	The FreeBSD Foundation
Differential Revision:	https://reviews.freebsd.org/D26772
2020-10-19 16:57:40 +00:00
Mark Johnston
1b2dcc8c54 uma: Avoid depleting keg reserves when filling a bucket
zone_import() fetches a free or partially free slab from the keg and
then uses its items to populate an array, typically filling a bucket.
If a single allocation causes the keg to drop below its minimum reserve,
the inner loop ends.  However, if the bucket is still not full and
M_USE_RESERVE is specified, the outer loop will continue to fetch items
from the keg.

If M_USE_RESERVE is specified and the number of free items is below the
reserved limit, we should return only a single item.  Otherwise, if the
bucket size is larger than the reserve, all of the reserved items may
end up in a single per-CPU bucket, invisible to other CPUs.

Reviewed by:	rlibby
MFC after:	2 weeks
Sponsored by:	The FreeBSD Foundation
Differential Revision:	https://reviews.freebsd.org/D26771
2020-10-19 16:55:03 +00:00
Konstantin Belousov
6f3b523c9a Avoid dump_avail[] redefinition.
Move dump_avail[] extern declaration and inlines into a new header
vm/vm_dumpset.h.  This fixes default gcc build for mips.

Reviewed by:	alc, scottph
Tested by:	kevans (previous version)
Sponsored by:	The FreeBSD Foundation
Differential revision:	https://reviews.freebsd.org/D26741
2020-10-14 22:51:40 +00:00
Bryan Drewery
c2c6fb90e0 Use unlocked page lookup for inmem() to avoid object lock contention
Reviewed By:	kib, markj
Submitted by:	mlaier
Sponsored by:	Dell EMC
Differential Revision:	https://reviews.freebsd.org/D26653
2020-10-09 23:49:42 +00:00
Konstantin Belousov
42f96162c3 vm_page_dump_index_to_pa(): Add braces to the expression involving + and &.
The precedence of the '&' operator is less than of '+'.  Added braces
do change the order of evaluation into the natural one, in my opinion.
On the other hand, the value of the expression should not change since
all elements should have page-aligned values.

This fixes a gcc warning reported.

Reported by:	adrian
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
2020-10-08 22:46:15 +00:00
Mark Johnston
2913cc4637 vm_pageout: Avoid rounding down the inactive scan target
With helper page daemon threads, enabled by default in r364786, we
divide the inactive target by the number of threads, rounding down, and
sum the total number of pages freed by the threads.  This sum is
compared with the original target, but by rounding down we might lose
pages, causing the page daemon control loop to conclude that inactive
queue scanning isn't keeping up with demand for free pages.  Typically
this results in excessive swapping.

Fix the problem by accounting for the error in the main pagedaemon
thread's target.  Note that by default the problem will manifest only in
systems with >16 CPUs in a NUMA domain.

Reviewed by:	cem
Discussed with:	dougm
Reported and tested by:	dhw, glebius
Sponsored by:	The FreeBSD Foundation
Differential Revision:	https://reviews.freebsd.org/D26610
2020-10-02 19:16:06 +00:00
Mark Johnston
06d8bdcbf7 uma: Use the bucket cache for cross-domain allocations
uma_zalloc_domain() allocates from the requested domain instead of
following a first-touch policy (the default for most zones).  Currently
it is only used by malloc_domainset(), and consumers free returned items
with free(9) since r363834.

Previously uma_zalloc_domain() worked by always going to the keg for an
item.  As a result, the use of UMA zone caches was unbalanced: we free
items to the caches, but always allocate from the keg, skipping the
caches.

Make some effort to allocate from the UMA caches when performing a
cross-domain allocation.  This avoids blowing up the caches when
something is performing many transient allocations with
malloc_domainset().

Reported and tested by:	dhw, glebius
Sponsored by:		The FreeBSD Foundation
Differential Revision:	https://reviews.freebsd.org/D26427
2020-10-02 19:04:29 +00:00
Mark Johnston
5afdf5c1ca uma: Use LIFO for non-SMR bucket caches
When SMR was introduced, zone_put_bucket() was changed to always place
full buckets at the end of the queue.  However, it is generally
preferable to use recently used buckets since their items are more
likely to be resident in cache.  So, for buckets that have no constraint
on item reuse, use a last-in-first-out ordering as we did before.

Reviewed by:	rlibby
Tested by:	dhw, glebius
Sponsored by:	The FreeBSD Foundation
Differential Revision:	https://reviews.freebsd.org/D26426
2020-10-02 19:04:09 +00:00
Mark Johnston
952c8964ba uma: Remove newlines from panic messages
Sponsored by:	The FreeBSD Foundation
2020-10-02 19:03:42 +00:00
Mark Johnston
f31695cc64 Implement sparse core dumps
Currently we allocate and map zero-filled anonymous pages when dumping
core.  This can result in lots of needless disk I/O and page
allocations.  This change tries to make the core dumper more clever and
represent unbacked ranges of virtual memory by holes in the core dump
file.

Add a new page fault type, VM_FAULT_NOFILL, which causes vm_fault() to
clean up and return an error when it would otherwise map a zero-filled
page.  Then, in the core dumper code, prefault all user pages and handle
errors by simply extending the size of the core file.  This also fixes a
bug related to the fact that vn_io_fault1() does not attempt partial I/O
in the face of errors from vm_fault_quick_hold_pages(): if a truncated
file is mapped into a user process, an attempt to dump beyond the end of
the file results in an error, but this means that valid pages
immediately preceding the end of the file might not have been dumped
either.

The change reduces the core dump size of trivial programs by a factor of
ten simply by excluding unaccessed libc.so pages.

PR:		249067
Reviewed by:	kib
Tested by:	pho
MFC after:	1 month
Sponsored by:	The FreeBSD Foundation
Differential Revision:	https://reviews.freebsd.org/D26590
2020-10-02 17:50:22 +00:00
Mark Johnston
114484b7ec Flag vm_reserv and vm_phys sysctls as MPSAFE.
Nothing in these subsystems relies on Giant.

MFC after:	1 week
2020-09-23 19:36:07 +00:00
Mark Johnston
78257765f2 Add a vmparam.h constant indicating pmap support for large pages.
Enable SHM_LARGEPAGE support on arm64.

Reviewed by:	alc, kib
Sponsored by:	Juniper Networks, Inc., Klara, Inc.
Differential Revision:	https://reviews.freebsd.org/D26467
2020-09-23 19:34:21 +00:00
D Scott Phillips
de03184698 arm64/pmap: Sparsify pv_table
Reviewed by:	markj, kib
Approved by:	scottl (implicit)
MFC after:	1 week
Sponsored by:	Ampere Computing, Inc.
Differential Revision:	https://reviews.freebsd.org/D26132
2020-09-21 22:23:57 +00:00
D Scott Phillips
7988971a99 vm_reserv: Sparsify the vm_reserv_array when VM_PHYSSEG_SPARSE
On an Ampere Altra system, the physical memory is populated
sparsely within the physical address space, with only about 0.4%
of physical addresses backed by RAM in the range [0, last_pa].

This is causing the vm_reserv_array to be over-sized by a few
orders of magnitude, wasting roughly 5 GiB on a system with
256 GiB of RAM.

The sparse allocation of vm_reserv_array is controlled by defining
VM_PHYSSEG_SPARSE, with the dense allocation still remaining for
platforms with VM_PHYSSEG_DENSE.

Reviewed by:	markj, alc, kib
Approved by:	scottl (implicit)
MFC after:	1 week
Sponsored by:	Ampere Computing, Inc.
Differential Revision:	https://reviews.freebsd.org/D26130
2020-09-21 22:22:53 +00:00
D Scott Phillips
00e6614750 Sparsify the vm_page_dump bitmap
On Ampere Altra systems, the sparse population of RAM within the
physical address space causes the vm_page_dump bitmap to be much
larger than necessary, increasing the size from ~8 Mib to > 2 Gib
(and overflowing `int` for the size).

Changing the page dump bitmap also changes the minidump file
format, so changes are also necessary in libkvm.

Reviewed by:	jhb
Approved by:	scottl (implicit)
MFC after:	1 week
Sponsored by:	Ampere Computing, Inc.
Differential Revision:	https://reviews.freebsd.org/D26131
2020-09-21 22:21:59 +00:00
D Scott Phillips
ab041f713a Move vm_page_dump bitset array definition to MI code
These definitions were repeated by all architectures, with small
variations. Consolidate the common definitons in machine
independent code and use bitset(9) macros for manipulation. Many
opportunities for deduplication remain in the machine dependent
minidump logic. The only intended functional change is increasing
the bit index type to vm_pindex_t, allowing the indexing of pages
with address of 8 TiB and greater.

Reviewed by:	kib, markj
Approved by:	scottl (implicit)
MFC after:	1 week
Sponsored by:	Ampere Computing, Inc.
Differential Revision:	https://reviews.freebsd.org/D26129
2020-09-21 22:20:37 +00:00
Eric van Gyzen
f9cc8410e1 vm_ooffset_t is now unsigned
vm_ooffset_t is now unsigned. Remove some tests for negative values,
or make other adjustments accordingly.

Reported by:	Coverity
Reviewed by:	kib markj
Sponsored by:	Dell EMC Isilon
Differential Revision:	https://reviews.freebsd.org/D26214
2020-09-18 16:48:08 +00:00
Mark Johnston
97458520cc Increase the default vm.max_user_wired value.
Since r347532 (merged to stable/12) we only count user-wired pages
towards the system limit.  However, we now also treat pages wired by
hypervisors (bhyve and virtualbox) as user-wired, so starting VMs with
large amounts of RAM tends to fail due to the low limit.

The purpose of the limit is to provide a seatbelt, not to impose some
policy on the use of wired memory.  Thus, increase the default limit to
allow reasonable VM configurations to work without tuning.

Reviewed by:	kib
Discussed with:	dougm
MFC after:	3 days
Sponsored by:	The FreeBSD Foundation
Differential Revision:	https://reviews.freebsd.org/D26424
2020-09-17 16:49:28 +00:00
Konstantin Belousov
d301b3580f Support for userspace non-transparent superpages (largepages).
Created with shm_open2(SHM_LARGEPAGE) and then configured with
FIOSSHMLPGCNF ioctl, largepages posix shared memory objects guarantee
that all userspace mappings of it are served by superpage non-managed
mappings.

Only amd64 for now, both 2M and 1G superpages can be requested, the
later requires CPU feature.

Reviewed by:	markj
Tested by:	pho
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
Differential revision:	https://reviews.freebsd.org/D24652
2020-09-09 22:12:51 +00:00
Konstantin Belousov
e2e80fb3de vm_map: Add a map entry kind that can only be clipped at specific boundary.
The entries and their clip boundaries must be aligned on supported
superpages sizes from pagesizes[].  vm_map operations return Mach
error KERN_INVALID_ARGUMENT, which is usually translated to EINVAL, if
it would require clip not at the boundary.

In other words, entries force preserving virtual addresses superpage
properties.

Reviewed by:	markj
Tested by:	pho
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
Differential revision:	https://reviews.freebsd.org/D24652
2020-09-09 22:02:30 +00:00
Konstantin Belousov
6cadbcd203 Add pmap_enter(9) PMAP_ENTER_LARGEPAGE flag and implement it on amd64.
The flag requests entry of non-managed superpage mapping of size
pagesizes[psind] into the page table.

Pmap supports fake wiring of the largepage mappings.  Only attributes
of the largepage mapping can be changed by calling pmap_enter(9) over
existing mapping, physical address of the page must be unchanged.

Reviewed by:	markj
Tested by:	pho
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
Differential revision:	https://reviews.freebsd.org/D24652
2020-09-09 21:50:24 +00:00
Konstantin Belousov
7a9f2da33c Add vm_map_find_aligned(9).
Reviewed by:	markj
Tested by:	pho
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
Differential revision:	https://reviews.freebsd.org/D24652
2020-09-09 21:44:59 +00:00
Konstantin Belousov
60cd9c95c5 Move MAP_32BIT_MAX_ADDR definition to sys/mman.h.
Reviewed by:	markj
Tested by:	pho
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
Differential revision:	https://reviews.freebsd.org/D24652
2020-09-09 21:39:06 +00:00
Konstantin Belousov
e8f77c204b Prepare to handle non-trivial errors from vm_map_delete().
Reviewed by:	markj
Tested by:	pho
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
Differential revision:	https://reviews.freebsd.org/D24652
2020-09-09 21:34:31 +00:00
Konstantin Belousov
a720b31c2a Allow consumer to customize physical pager.
Add support for user-supplied callbacks into phys pager operations,
providing custom getpages(), haspage(), and populate() methods
implementations.  Pager stores user data ptr/val in the object to
provide context.

Add phys_pager_allocate() helper that takes user ops table as one of
the arguments.

Current code for these methods is moved to the 'default' ops table,
assigned automatically when vm_pager_alloc() is used.

Reviewed by:	markj
Tested by:	pho
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
Differential revision:	https://reviews.freebsd.org/D24652
2020-09-09 00:00:43 +00:00
Konstantin Belousov
67a659d282 Add kern_mmap_racct_check(), a helper to verify limits in vm_mmap*().
Reviewed by:	markj
Tested by:	pho
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
Differential revision:	https://reviews.freebsd.org/D24652
2020-09-08 23:48:19 +00:00
Konstantin Belousov
89d2fb14d5 Add interruptible variant of vm_wait(9), vm_wait_intr(9).
Also add msleep flags argument to vm_wait_doms(9).

Reviewed by:	markj
Tested by:	pho
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
Differential revision:	https://reviews.freebsd.org/D24652
2020-09-08 23:28:09 +00:00
Mark Johnston
aec9e7d8b0 vm_object_split(): Handle orig_object type changes.
orig_object->type can change from OBJT_DEFAULT to OBJT_SWAP while
vm_object_split() is sleeping.  In this case some pages in new_object
may be left unbusied, but vm_object_split() attempts to unbusy all of
them.

Track the beginning of the busied range.  Add an assertion to verify
that pages are not re-added to the source object while sleeping.

Reported by:	Olympios Petrakis <olympios.petrakis@netapp.com>
Reviewed by:	alc, kib
Tested by:	pho
MFC after:	1 week
Sponsored by:	NetApp, Inc.
Sponsored by:	Klara, Inc.
Differential Revision:	https://reviews.freebsd.org/D26223
2020-09-07 23:28:33 +00:00
Mark Johnston
a2d704d19f Avoid unnecessary object locking in vm_page_grab_pages_unlocked().
We were needlessly acquiring the object lock to call
vm_page_grab_pages() even when all of the requested pages were looked up
locklessly.  Fix that, stop testing for count == 0 in
vm_page_grab_pages(), and add assertions to help catch this kind of
mistake.

Reported by:	cem
Reviewed by:	alc, cem, dougm, jeff
Differential Revision:	https://reviews.freebsd.org/D26304
2020-09-02 19:59:25 +00:00
Mark Johnston
847ab36bf2 Include the psind in data returned by mincore(2).
Currently we use a single bit to indicate whether the virtual page is
part of a superpage.  To support a forthcoming implementation of
non-transparent 1GB superpages, it is useful to provide more detailed
information about large page sizes.

The change converts MINCORE_SUPER into a mask for MINCORE_PSIND(psind)
values, indicating a mapping of size psind, where psind is an index into
the pagesizes array returned by getpagesizes(3), which in turn comes
from the hw.pagesizes sysctl.  MINCORE_PSIND(1) is equal to the old
value of MINCORE_SUPER.

For now, two bits are used to record the page size, permitting values
of MAXPAGESIZES up to 4.

Reviewed by:	alc, kib
Sponsored by:	Juniper Networks, Inc.
Sponsored by:	Klara, Inc.
Differential Revision:	https://reviews.freebsd.org/D26238
2020-09-02 18:16:43 +00:00
Mateusz Guzik
c3aa3bf97c vm: clean up empty lines in .c and .h files 2020-09-01 21:20:45 +00:00
Vladimir Kondratyev
5d4bf0578f LinuxKPI: Implement ksize() function.
In Linux, ksize() gets the actual amount of memory allocated for a given
object. This commit adds malloc_usable_size() to FreeBSD KPI which does
the same. It also maps LinuxKPI ksize() to newly created function.

ksize() function is used by drm-kmod.

Reviewed by:	hselasky, kib
MFC after:	1 week
Differential Revision:	https://reviews.freebsd.org/D26215
2020-08-29 19:26:31 +00:00
Eric van Gyzen
609de97e04 vm_pageout_scan_active: ensure ps_delta is initialized
Reported by:	Coverity
Reviewed by:	markj
MFC after:	2 weeks
Sponsored by:	Dell EMC Isilon
Differential Revision:	https://reviews.freebsd.org/D26212
2020-08-28 19:59:02 +00:00
Eric van Gyzen
a2e194654f memstat_kvm_uma: fix reading of uma_zone_domain structures
Coverity flagged the scaling by sizeof(uzd).  That is the type
of the pointer, so the scaling was already done by pointer arithmetic.
However, this was also passing a stack frame pointer to kvm_read,
so it was doubly wrong.

Move ZDOM_GET into the !_KERNEL section and use it in libmemstat.

Reported by:	Coverity
Reviewed by:	markj
MFC after:	2 weeks
Sponsored by:	Dell EMC Isilon
Differential Revision:	https://reviews.freebsd.org/D26213
2020-08-28 19:50:40 +00:00
Mark Johnston
aea9103e06 Use a large kmem arena import size on NUMA systems.
This helps minimize internal fragmentation that occurs when 2MB imports
are interleaved across NUMA domains.  Virtually all KVA allocations on
direct map platforms consume more than one page, so the fragmentation
manifests as runs of 511 4KB page mappings in the kernel.

Reviewed by:	alc, kib
Tested by:	pho
Sponsored by:	The FreeBSD Foundation
Differential Revision:	https://reviews.freebsd.org/D26050
2020-08-26 14:31:48 +00:00
Conrad Meyer
74f5530d7a vm_pageout: Scale worker threads with CPUs
Autoscale vm_pageout worker threads from r364129 with CPU count.  The
default is arbitrarily chosen to be 16 CPUs per worker thread, but can
be adjusted with the vm.pageout_cpus_per_thread tunable.

There will never be less than 1 thread per populated NUMA domain, and
the previous arbitrary upper limit (at most ncpus/2 threads per NUMA
domain) is preserved.

Care is taken to gracefully handle asymmetric NUMA nodes, such as empty
node systems (e.g., AMD 2990WX) and systems with nodes of varying size
(e.g., some larger >20 core Intel Haswell/Broadwell Xeon).

Reviewed by:	kib, markj
Sponsored by:	Isilon
Differential Revision:	https://reviews.freebsd.org/D26152
2020-08-25 21:36:56 +00:00
Mark Johnston
411096d034 Permit vm_page_wire() to be called on pages not belonging to an object.
For such pages ref_count is effectively a consumer-managed field, but
there is no harm in calling vm_page_wire() on them.
vm_page_unwire_noq() handles them as well.  Relax the vm_page_wire()
assertions to permit this case which is triggered by some out-of-tree
code. [1]

Also guard a conditional assertion with INVARIANTS.  Otherwise the
conditions are evaluated even though the result is unused. [2]

Reported by:	bz, cem [1], kib [2]
Reviewed by:	dougm, kib
Sponsored by:	The FreeBSD Foundation
Differential Revision:	https://reviews.freebsd.org/D26173
2020-08-25 13:45:06 +00:00
Matt Macy
9e5787d228 Merge OpenZFS support in to HEAD.
The primary benefit is maintaining a completely shared
code base with the community allowing FreeBSD to receive
new features sooner and with less effort.

I would advise against doing 'zpool upgrade'
or creating indispensable pools using new
features until this change has had a month+
to soak.

Work on merging FreeBSD support in to what was
at the time "ZFS on Linux" began in August 2018.
I first publicly proposed transitioning FreeBSD
to (new) OpenZFS on December 18th, 2018. FreeBSD
support in OpenZFS was finally completed in December
2019. A CFT for downstreaming OpenZFS support in
to FreeBSD was first issued on July 8th. All issues
that were reported have been addressed or, for
a couple of less critical matters there are
pull requests in progress with OpenZFS. iXsystems
has tested and dogfooded extensively internally.
The TrueNAS 12 release is based on OpenZFS with
some additional features that have not yet made
it upstream.

Improvements include:
  project quotas, encrypted datasets,
  allocation classes, vectorized raidz,
  vectorized checksums, various command line
  improvements, zstd compression.

Thanks to those who have helped along the way:
Ryan Moeller, Allan Jude, Zack Welch, and many
others.

Sponsored by:	iXsystems, Inc.
Differential Revision:	https://reviews.freebsd.org/D25872
2020-08-25 02:21:27 +00:00
Mateusz Guzik
feabaaf995 cache: drop the always curthread argument from reverse lookup routines
Note VOP_VPTOCNP keeps getting it as temporary compatibility for zfs.

Tested by:	pho
2020-08-24 08:57:02 +00:00
Andrew Gallatin
791dda877f uma: record allocation failures due to zone limits
The zone limit mechanism was recently reworked, and
allocation failures due to limits being exceeded
were inadvertently no longer being recorded. This
would lead to, for example, mbuf allocation failures
not being indicated in netstat -m or vmstat -z

Reviewed by:	markj
Sponsored by:	Netflix
2020-08-21 18:31:57 +00:00
Mateusz Guzik
7ad2a82da2 vfs: drop the error parameter from vn_isdisk, introduce vn_isdisk_error
Most consumers pass NULL.
2020-08-19 02:51:17 +00:00
Mark Johnston
b21b022a81 Revert r364310.
Some of the resulting fallout in CAM does not appear straightforward to
fix, so simply revert the commit for now in the absence of a better
solution.

Discussed with:	mjg
Reported by:	dhw
2020-08-18 14:09:49 +00:00
Gleb Smirnoff
1921bb7b68 With INVARIANTS panic immediately if M_WAITOK is requested in a
non-sleepable context.  Previously only _sleep() would panic.
This will catch misuse of M_WAITOK at development stage rather
than at stress load stage.

Reviewed by:	markj
Differential Revision:	https://reviews.freebsd.org/D26027
2020-08-17 15:37:08 +00:00
Mark Johnston
7efe14cb99 Commit a missing piece of r364302.
This had failed to apply due to a merge conflict.

Reported by:	Jenkins
MFC with:	r364302
2020-08-17 14:06:51 +00:00
Mark Johnston
7dd979dfef Remove the VM map zone.
Today, the zone is only used to allocate a trio of kernel maps: the
kernel map itself, and the exec and pipe submaps.  Maps for user
processes are dynamically allocated but are embedded in the vmspace
structure, which is allocated from its own zone.  Make the
aforementioned kernel maps statically allocated and get rid of the zone.

While here, remove a stale comment above vmspace_alloc() and change the
names of locks initialized in vm_map_init() to match vmspace_zinit().

Reported by:	alc
Reviewed by:	alc, kib
MFC after:	2 weeks
Sponsored by:	The FreeBSD Foundation
Differential Revision:	https://reviews.freebsd.org/D26052
2020-08-17 13:02:01 +00:00
Konstantin Belousov
ffae7ea935 vm_object: allow paging_in_progress to be acquired after object termination.
The vm objects are type-stable, and can be accessed even after the
last reference is dropped, or in case of vnode objects, after vgone()
destroyed it as well.

Stop asserting that pip == 0 after vm_object_terminate() waited for
existing owners to drop it, we only want to drain them before setting
OBJ_DEAD flag.  Also stop asserting pip == 0 in object destructor.

Update comments explaining the interaction between paging_in_progress
and termination.

Reviewed by:	markj
Tested by:	pho
Sponsored by:	The FreeBSD Foundation
Differential revision:	https://reviews.freebsd.org/D25968
2020-08-16 20:57:02 +00:00
Konstantin Belousov
419e5698a0 Atomically update vm_object vnp_size, where atomic is available.
This will be used later, where it matters on 32bit arches.

Reviewed by:	markj
Tested by:	pho
Sponsored by:	The FreeBSD Foundation
Differential revision:	https://reviews.freebsd.org/D25968
2020-08-16 20:52:24 +00:00
Mateusz Guzik
a92a971bbb vfs: remove the thread argument from vget
It was already asserted to be curthread.

Semantic patch:

@@

expression arg1, arg2, arg3;

@@

- vget(arg1, arg2, arg3)
+ vget(arg1, arg2)
2020-08-16 17:18:54 +00:00
Conrad Meyer
ea7b737a6f vm_pageout: Correct threshold calculation on single-CPU systems
Reported by:	Michael Butler
X-MFC-With:	r364129
2020-08-14 18:48:48 +00:00
Conrad Meyer
b7883452d4 Back out unrelated change
Reported by:	kib, markj
X-MFC-With:	r364129
2020-08-12 00:21:30 +00:00
Conrad Meyer
0292c54bdb Add support for multithreading the inactive queue pageout within a domain.
In very high throughput workloads, the inactive scan can become overwhelmed
as you have many cores producing pages and a single core freeing.  Since
Mark's introduction of batched pagequeue operations, we can now run multiple
inactive threads working on independent batches.

To avoid confusing the pid and other control algorithms, I (Jeff) do this in
a mpi-like fan out and collect model that is driven from the primary page
daemon.  It decides whether the shortfall can be overcome with a single
thread and if not dispatches multiple threads and waits for their results.

The heuristic is based on timing the pageout activity and averaging a
pages-per-second variable which is exponentially decayed. This is visible in
sysctl and may be interesting for other purposes.

I (Jeff) have verified that this does indeed double our paging throughput
when used with two threads. With four we tend to run into other contention
problems.  For now I would like to commit this infrastructure with only a
single thread enabled.

The number of worker threads per domain can be controlled with the
'vm.pageout_threads_per_domain' tunable.

Submitted by:	jeff (earlier version)
Discussed with:	markj
Tested by:	pho
Sponsored by:	probably Netflix (based on contemporary commits)
Differential Revision:	https://reviews.freebsd.org/D21629
2020-08-11 20:37:45 +00:00
Mark Johnston
af32cefd7c Check the UMA zone's full bucket cache before short-circuiting an alloc.
The global "bucketdisable" flag indicates that we are in a low memory
situation and should avoid allocating buckets.  However, in the
allocation path we were checking it before the full bucket cache and
bailing even if the cache is non-empty.  Defer the check so that we have
a shot at allocating from the cache.

This came up because M_NOWAIT allocations from the buf trie node zone
must always succeed.  In one scenario, all of the preallocated trie
nodes were in the bucket list, and a new slab allocation could not
succeed due to a memory shortage.  The short-circuiting caused an
allocation failure which triggered a panic.

Reported by:	pho
Reviewed by:	cem
Sponsored by:	The FreeBSD Foundation
Differential Revision:	https://reviews.freebsd.org/D25980
2020-08-10 20:34:45 +00:00
Brooks Davis
9f9cc3f989 Preserve ASLR vm_map flags across fork
In the most common case (fork+execve) this doesn't matter, but further
attempts to apply entropy would fail in (e.g.) a pre-fork server.

Reported by:	Alfredo Mazzinghi
Reviewed by:	kib, markj
Obtained from:	CheriBSD
MFC after:	3 days
Differential Revision:	https://reviews.freebsd.org/D25966
2020-08-06 16:20:20 +00:00
Mark Johnston
efec381dd1 Remove most lingering references to the page lock in comments.
Finish updating comments to reflect new locking protocols introduced
over the past year.  In particular, vm_page_lock is now effectively
unused.

Reviewed by:	kib
Sponsored by:	The FreeBSD Foundation
Differential Revision:	https://reviews.freebsd.org/D25868
2020-08-04 14:59:43 +00:00
Mark Johnston
96ad26eefb Remove free_domain() and uma_zfree_domain().
These functions were introduced before UMA started ensuring that freed
memory gets placed in domain-local caches.  They no longer serve any
purpose since UMA now provides their functionality by default.  Remove
them to simplyify the kernel memory allocator interfaces a bit.

Reviewed by:	cem, kib
Sponsored by:	The FreeBSD Foundation
Differential Revision:	https://reviews.freebsd.org/D25937
2020-08-04 13:58:36 +00:00
Mark Johnston
958d8f527c Remove the volatile qualifier from busy_lock.
Use atomic(9) to load the lock state.  Some places were doing this
already, so it was inconsistent.  In initialization code, the lock state
is still initialized with plain stores.

Reviewed by:	alc, kib
Sponsored by:	The FreeBSD Foundation
Differential Revision:	https://reviews.freebsd.org/D25861
2020-07-29 19:38:49 +00:00
Mark Johnston
f72e5be58a vm_page_xbusy_claim(): Use atomics to update busy lock state.
vm_page_xbusy_claim() could clobber the waiter bit.  For its original
use, kernel memory pages, this was not a problem since nothing would
ever block on the busy lock for such pages.  r363607 introduced a new
use where this could in principle be a problem.

Fix the problem by using atomic_cmpset to update the lock owner.  Since
this macro is defined only for INVARIANTS kernels the extra overhead
doesn't seem prohibitive.

Reported by:	vangyzen
Reviewed by:	alc, kib, vangyzen
Sponsored by:	The FreeBSD Foundation
Differential Revision:	https://reviews.freebsd.org/D25859
2020-07-28 19:50:39 +00:00
Mark Johnston
782ebde52e vm_page_free_invalid(): Relax the xbusy assertion.
vm_page_assert_xbusied() asserts that the busying thread is the current
thread.  For some uses of vm_page_free_invalid() (e.g., error handling
in vnode_pager_generic_getpages_done()), this condition might not hold.

Reported by:	Jenkins via trasz
Reviewed by:	chs, kib
Sponsored by:	The FreeBSD Foundation
Differential Revision:	https://reviews.freebsd.org/D25828
2020-07-27 14:25:10 +00:00
Doug Moore
00fd73d2da Fix an overflow bug in the blist allocator that needlessly capped max
swap size by dividing a value, which was always a multiple of 64, by
64.  Remove the code that reduced max swap size down to that cap.

Eliminate the distinction between BLIST_BMAP_RADIX and
BLIST_META_RADIX.  Call them both BLIST_RADIX.

Make improvments to the blist self-test code to silence compiler
warnings and to test larger blists.

Reported by:	jmallett
Reviewed by:	alc
Discussed with:	kib
Tested by:	pho
Differential Revision:	https://reviews.freebsd.org/D25736
2020-07-25 18:29:10 +00:00
Mateusz Guzik
ee74412269 vm: fix swap reservation leak and clean up surrounding code
The code did not subtract from the global counter if per-uid reservation
failed.

Cleanup highlights:
- load overcommit once
- move per-uid manipulation to dedicated routines
- don't fetch wire count if requested size is below the limit
- convert return type from int to bool
- ifdef the routines with _KERNEL to keep vm.h compilable by userspace

Reviewed by:	kib (previous version)
Differential Revision:	https://reviews.freebsd.org/D25787
2020-07-24 13:23:32 +00:00
Mateusz Guzik
126a2470b9 vm: annotate swap_reserved with __exclusive_cache_line
The counter keeps being updated all the time and variables read afterwards
share the cacheline. Note this still fundamentally does not scale and needs
to be replaced, in the meantime gets a bandaid.

brk1_processes -t 52 ops/s:
before: 8598298
after:  9098080
2020-07-23 08:42:16 +00:00
Chuck Silvers
1bd12a3bb2 Fix vnode_pager handling of read ahead/behind pages when a disk read fails.
Rather than marking the read ahead/behind pages valid even though they were
not initialized, free them using the new function vm_page_free_invalid().

Reviewed by:	markj, kib
Sponsored by:	Netflix
Differential Revision:	https://reviews.freebsd.org/D25430
2020-07-17 23:10:35 +00:00
Chuck Silvers
4dfa06e114 Add a new function vm_page_free_invalid() for freeing invalid pages
that might be wired.  If the page is wired then it cannot be freed now,
but the thread that eventually unwires it will free it at that point.

Reviewed by:	markj, kib
Sponsored by:	Netflix
Differential Revision:	https://reviews.freebsd.org/D25430
2020-07-17 23:09:36 +00:00
Chuck Silvers
c3dbadc1fd Revert my change from r361855 in favor of a better fix.
Reviewed by:	markj, kib
Sponsored by:	Netflix
Differential Revision:	https://reviews.freebsd.org/D25430
2020-07-17 23:08:01 +00:00
Mark Johnston
a7752896f0 Add vm_map_valid_range_KBI().
This is required for standalone module builds.

Reported by:	hselasky
Reviewed by:	dougm, hselasky, kib
MFC after:	3 days
Differential Revision:	https://reviews.freebsd.org/D25650
2020-07-13 16:39:27 +00:00
Scott Long
ffc568ba8b Revert r362998, r326999 while a better compatibility strategy is devised. 2020-07-09 22:38:36 +00:00
Scott Long
b302c2e5c9 Migrate the feature of excluding RAM pages to use "excludelist"
as its nomenclature.

MFC after:	1 week
2020-07-07 20:33:11 +00:00
Conrad Meyer
8a64110e43 vm: Add missing WITNESS warnings for M_WAITOK allocation
vm_map_clip_{end,start} and lookup_clip_start allocate memory M_WAITOK
for !system_map vm_maps.  Add WITNESS warning annotation for !system_map
callers who may be holding non-sleepable locks.

Reviewed by:	markj
Sponsored by:	Dell EMC Isilon
Differential Revision:	https://reviews.freebsd.org/D25283
2020-06-29 16:54:00 +00:00
Mark Johnston
8c277118d8 Fix UMA's first-touch policy on systems with empty domains.
Suppose a thread is running on a CPU in a NUMA domain with no physical
RAM.  When an item is freed to a first-touch zone, it ends up in the
cross-domain bucket.  When the bucket is full, it gets placed in another
domain's bucket queue.  However, when allocating an item, UMA will
always go to the keg upon a per-CPU cache miss because the empty
domain's bucket queue will always be empty.  This means that a non-empty
domain's bucket queues can grow very rapidly on such systems.  For
example, it can easily cause mbuf allocation failures when the zone
limit is reached.

Change cache_alloc() to follow a round-robin policy when running on an
empty domain.

Sponsored by:	The FreeBSD Foundation
Differential Revision:	https://reviews.freebsd.org/D25355
2020-06-28 21:35:04 +00:00
Konstantin Belousov
ee06cffcd2 vm_page_free_prep(): correct description of the required page and object state.
Reviewed by:	markj
Sponsored by:	The FreeBSD Foundation
Differential revision:	https://reviews.freebsd.org/D25482
2020-06-27 02:31:39 +00:00
Mark Johnston
84242cf68a Call swap_pager_freespace() from vm_object_page_remove().
All vm_object_page_remove() callers, except
linux_invalidate_mapping_pages() in the LinuxKPI, free swap space when
removing a range of pages from an object.  The LinuxKPI case appears to
be an unintentional omission that could result in leaked swap blocks, so
unconditionally free swap space in vm_object_page_remove() to protect
against similar bugs in the future.

Reviewed by:	alc, kib
Tested by:	pho
Sponsored by:	The FreeBSD Foundation
Differential Revision:	https://reviews.freebsd.org/D25329
2020-06-25 15:21:21 +00:00
Jeff Roberson
c8b0a88b8d Clarify some language. Favor primary where both master and primary were
used in conjunction with secondary.
2020-06-20 20:21:04 +00:00
Edward Tomasz Napierala
52c81be11a Add linux_madvise(2) instead of having Linux apps call the native
FreeBSD madvise(2) directly.  While some of the flag values match,
most don't.

PR:		kern/230160
Reported by:	markj
Reviewed by:	markj
Discussed with:	brooks, kib
MFC after:	2 weeks
Sponsored by:	The FreeBSD Foundation
Differential Revision:	https://reviews.freebsd.org/D25272
2020-06-20 18:29:22 +00:00
Mark Johnston
cdd02f43b9 Revert r362360.
This commit was simply wrong since two different objects are locked.

Reported by:	lwhsu, pho
Pointy hat:	markj
2020-06-19 11:04:49 +00:00
Mark Johnston
f034074034 Restore a check unintentionally dropped in r362361.
MFC with:	r362361
2020-06-19 04:18:20 +00:00
Mark Johnston
0f1e6ec591 Add a helper function for validating VA ranges.
Functions which take untrusted user ranges must validate against the
bounds of the map, and also check for wraparound.  Instead of having the
same logic duplicated in a number of places, add a function to check.

Reviewed by:	dougm, kib
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
Differential Revision:	https://reviews.freebsd.org/D25328
2020-06-19 03:32:04 +00:00
Mark Johnston
61b006887e Fix a double object unlock in vm_object_backing_collapse_wait().
Reviewed by:	kib
Sponsored by:	The FreeBSD Foundation
Differential Revision:	https://reviews.freebsd.org/D25327
2020-06-19 03:31:46 +00:00
Conrad Meyer
a116b5d3e4 vm: Drop vm_map_clip_{start,end} macro wrappers
No functional change.

Reviewed by:	dougm, markj
Sponsored by:	Dell EMC Isilon
Differential Revision:	https://reviews.freebsd.org/D25282
2020-06-16 22:53:56 +00:00
Eric van Gyzen
8cc8c5864a Honor db_pager_quit in some vm_object ddb commands
These can be rather verbose.

MFC after:	2 weeks
Sponsored by:	Dell EMC Isilon
2020-06-12 21:53:08 +00:00
Mateusz Guzik
7ce3a31286 vm: rework swap_pager_status to execute in constant time
The lock-protected iteration is trivially avoidable.

This removes a serialisation point from Linux binaries (which end up calling
here from the sysinfo syscall).
2020-06-09 14:16:18 +00:00
Chuck Silvers
bd7d64f548 Don't mark pages as valid if reading the contents from disk fails.
Instead, just skip marking pages valid if the read fails.  Future
attempts to access such pages will notice that they are not marked valid
and try to read them from disk again.

Reviewed by:	kib, markj
Sponsored by:	Netflix
Differential Revision:	https://reviews.freebsd.org/D25138
2020-06-06 00:47:59 +00:00
Ed Maste
4d13f78444 Correct terminology in vm.imply_prot_max sysctl description
As with r361769 (man page), PROT_* are properly called protections, not
permissions.

MFC after:	1 week
MFC with:	r361769
Sponsored by:	The FreeBSD Foundation
2020-06-04 01:49:29 +00:00
Mateusz Guzik
1c58c09f5a uma: hide item_domain under ifdef NUMA
Fixes build warnings on mips.
2020-05-29 08:30:35 +00:00
Mark Johnston
81302f1d77 Fix boot on systems where NUMA domain 0 is unpopulated.
- Add vm_phys_early_add_seg(), complementing vm_phys_early_alloc(), to
  ensure that segments registered during hammer_time() are placed in the
  right domain.  Otherwise, since the SRAT is not parsed at that point,
  we just add them to domain 0, which may be incorrect and results in a
  domain with only several MB worth of memory.
- Fix uma_startup1() to try allocating memory for zones from any domain.
  If domain 0 is unpopulated, the allocation will simply fail, resulting
  in a page fault slightly later during boot.
- Change _vm_phys_domain() to return -1 for addresses not covered by the
  affinity table, and change vm_phys_early_alloc() to handle wildcard
  domains.  This is necessary on amd64, where the page array is dense
  and pmap_page_array_startup() may allocate page table pages for
  non-existent page frames.

Reported and tested by:	Rafael Kitover <rkitover@gmail.com>
Reviewed by:	cem (earlier version), kib
Sponsored by:	The FreeBSD Foundation
Differential Revision:	https://reviews.freebsd.org/D25001
2020-05-28 19:41:00 +00:00
Konstantin Belousov
fe0dcc402f Simplify the condition to enable superpage mappings in vm_fault_soft_fast().
The list of arches list there matches the list of arches where
default VM_NRESERVLEVEL > 0.  Before sparc64 removal, that was the
only arch that defined VM_NRESERVLEVEL > 0 to help with cache coloring,
but did not implemented superpages.  Now it can be simplified.

Submitted by:	alc
Reviewed by:	markj
2020-05-27 21:44:26 +00:00
Justin Hibbits
d4ed51f329 Properly sort ifdef archs in vm_fault_soft_fast superpage guards.
Sort broken in r360887.
2020-05-27 01:35:46 +00:00
Mark Johnston
dc2b320563 Allocate UMA per-CPU counters earlier.
Otherwise anything counted before SI_SUB_VM_CONF is discarded.  However,
it is useful to be able to see stats from allocations done early during
boot.

Reviewed by:	kib
Sponsored by:	The FreeBSD Foundation
Differential Revision:	https://reviews.freebsd.org/D24756
2020-05-14 16:06:54 +00:00
Kyle Evans
c79cee7136 kernel: provide panicky version of __unreachable
__builtin_unreachable doesn't raise any compile-time warnings/errors on its
own, so problems with its usage can't be easily detected. While it would be
nice for this situation to change and compilers to at least add a warning
for trivial cases where local state means the instruction can't be reached,
this isn't the case at the moment and likely will not happen.

This commit adds an __assert_unreachable, whose intent is incredibly clear:
it asserts that this instruction is unreachable. On INVARIANTS builds, it's
a panic(), and on non-INVARIANTS it expands to  __unreachable().

Existing users of __unreachable() are converted to __assert_unreachable,
to improve debuggability if this assumption is violated.

Reviewed by:	mjg
Differential Revision:	https://reviews.freebsd.org/D23793
2020-05-13 18:07:37 +00:00
Justin Hibbits
65bbba25d2 powerpc64: Implement Radix MMU for POWER9 CPUs
Summary:
POWER9 supports two MMU formats: traditional hashed page tables, and Radix
page tables, similar to what's presesnt on most other architectures.  The
PowerISA also specifies a process table -- a table of page table pointers--
which on the POWER9 is only available with the Radix MMU, so we can take
advantage of it with the Radix MMU driver.

Written by Matt Macy.

Differential Revision: https://reviews.freebsd.org/D19516
2020-05-11 02:33:37 +00:00
Mark Johnston
a9ea09e548 Re-check for wirings after busying the page in vm_page_release_locked().
A concurrent unlocked lookup can wire the page after
vm_page_release_locked() releases the last wiring, in which case
vm_page_release_locked() must not free the page.  Once the xbusy lock is
acquired, that, the object lock and the fact that the page is unmapped
ensure that the wire count cannot increase, so re-check for new wirings
after the page is xbusied.

Update the comment above vm_page_wired() to reflect the new
synchronization rules.

Reported by:	glebius
Reviewed by:	alc, jeff, kib
Sponsored by:	The FreeBSD Foundation
Differential Revision:	https://reviews.freebsd.org/D24592
2020-04-28 13:51:41 +00:00
Mark Johnston
f13fa9df05 Use a single VM object for kernel stacks.
Previously we allocated a separate VM object for each kernel stack.
However, fully constructed kernel stacks are cached by UMA, so there is
no harm in using a single global object for all stacks.  This reduces
memory consumption and makes it easier to define a memory allocation
policy for kernel stack pages, with the aim of reducing physical memory
fragmentation.

Add a global kstack_object, and use the stack KVA address to index into
the object like we do with kernel_object.

Reviewed by:	kib
Tested by:	pho
Sponsored by:	The FreeBSD Foundation
Differential Revision:	https://reviews.freebsd.org/D24473
2020-04-26 20:08:57 +00:00
Mark Johnston
33655d9546 Factor out the kmem contig page alloc and reclamation code.
kmem_alloc_attr_domain() and kmem_alloc_contig_domain() duplicated each
other's page allocation and reclamation logic.  Place it in a single
function to make it easier to add additional consumers.  No functional
change intended.

Reviewed by:	jeff, kib
MFC after:	1 week
Sponsored by:	The FreeBSD Foundation
Differential Revision:	https://reviews.freebsd.org/D24475
2020-04-21 16:01:44 +00:00
Mark Johnston
303b77029b Minimize conditional compilation for handling of M_EXEC.
This simplifies some planned changes.  No functional change intended.

Reviewed by:	kib
MFC after:	1 week
Sponsored by:	The FreeBSD Foundation
Differential Revision:	https://reviews.freebsd.org/D24474
2020-04-21 15:55:28 +00:00
Mark Johnston
70e68b19a4 Handle trashed queue pointers in vm_page_acquire_unlocked().
vm_page_acquire_unlocked() relies on type-stability of vm_page
structures and assumes that the listq linkage pointers always point to a
vm_page or are NULL.  QUEUE_MACRO_DEBUG_TRASH breaks that assumption, so
add an explicit check for a trashed queue pointer before dereferencing.

Reported and tested by:	pho
Reviewed by:	kib
Sponsored by:	The FreeBSD Foundation
Differential Revision:	https://reviews.freebsd.org/D24472
2020-04-20 14:45:17 +00:00
Bryan Drewery
adc0388117 Remove dead code leftover from r331018.
Sponsored by:	Dell EMC
2020-03-31 01:12:53 +00:00
Konstantin Belousov
abfdf76791 VOP_GETPAGES_ASYNC(): consistently call iodone() callback in case of error.
Reviewed by:	glebius, markj
Tested by:	pho
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
Differential revision:	https://reviews.freebsd.org/D24038
2020-03-30 21:44:30 +00:00
Konstantin Belousov
a7c55b3e1b ddb show pginfo: print pages reference value in hex.
It is more useful this way after the VPRC_ flags were introduced.

Sponsored by:	The FreeBSD Foundation
2020-03-28 12:21:52 +00:00
Jeff Roberson
d1105e9441 Check for busy or wired in vm_page_relookup(). Some callers will only keep
a page wired and expect it to still be present.

Reported by:	delphij@FreeBSD.org
Reviewed by:	kib
2020-03-11 22:25:45 +00:00
Mark Johnston
54007ce8ae Clean up uma_int.h a bit.
This makes it easier to write libkvm programs that access UMA data
structures.

- Remove a couple of unused slab functions and make others local to
  uma_core.c.  Similarly move SLAB_BITSETS, which affects the layout of
  slab structures, to uma_core.c.
- Stop defining the slab structures under _KERNEL.  There's no real
  reason they can't be visible to userspace like the rest of UMA's
  structures are.
- Group KEG_ASSERT_COLD with other keg macros.
- Convert an assertion about MAXMEMDOM to use _Static_assert.

No functional change intended.

Discussed with:	jeff
Reviewed by:	rlibby
Sponsored by:	The FreeBSD Foundation
Differential Revision:	https://reviews.freebsd.org/D23980
2020-03-07 15:37:23 +00:00
Mark Johnston
3fba886874 Move SMR pointer type definition and access macros to smr_types.h.
The intent is to provide a header that can be included by other headers
without introducing too much pollution.  smr.h depends on various
headers and will likely grow over time, but is less likely to be
required by system headers.

Rename SMR_TYPE_DECLARE() to SMR_POINTER():
- One might use SMR to protect more than just pointers; it
  could be used for resizeable arrays, for example, so TYPE seems too
  generic.
- It is useful to be able to define anonymous SMR-protected pointer
  types and the _DECLARE suffix makes that look wrong.

Reviewed by:	jeff, mjg, rlibby
Sponsored by:	The FreeBSD Foundation
Differential Revision:	https://reviews.freebsd.org/D23988
2020-03-07 00:55:46 +00:00
Brooks Davis
3823a5990a Remove an apparently incorrect assertion.
Without this change mips64 fails to boot.

Discussed with:	markj
Sponsored by:	DARPA
2020-03-06 23:31:09 +00:00
Mark Johnston
d869a17e62 Use COUNTER_U64_DEFINE_EARLY() in places where it simplifies things.
Reviewed by:	kib
Sponsored by:	The FreeBSD Foundation
Differential Revision:	https://reviews.freebsd.org/D23978
2020-03-06 19:10:00 +00:00
Brooks Davis
d718de812f Introduce kern_mmap_req().
This presents an extensible interface to the generic mmap(2)
implementation via a struct pointer intended to use a designated
initializer or compount literal.  We take advantage of the mandatory
zeroing of fields not listed in the initializer.

Remove kern_mmap_fpcheck() and use kern_mmap_req().

The motivation for this change is a desire to keep the core
implementation from growing an ever-increasing number of arguments
that must be specified in the correct order for the lowest-level
implementations.  In CheriBSD we have already added two more arguments.

Reviewed by:	kib
Discussed with:	kevans
Obtained from:	CheriBSD
Sponsored by:	DARPA, AFRL
Differential Revision:	https://reviews.freebsd.org/D23164
2020-03-04 21:27:12 +00:00
Mark Johnston
1ed42f6fdd Avoid doubly wiring a newly allocated page in vm_page_grab_valid().
This fixes a regression from r358363.

Reported by:	manu, jbeich
Tested by:	jbeich
2020-03-01 22:09:11 +00:00
Mateusz Guzik
7f746c9fcc vm: add debug to uma_zone_set_smr
Reviewed by:	markj, rlibby
Differential Revision:	https://reviews.freebsd.org/D23902
2020-03-01 21:49:16 +00:00
Jeff Roberson
6be21eb778 Provide a lock free alternative to resolve bogus pages. This is not likely
to be much of a perf win, just a nice code simplification.

Reviewed by:	markj, kib
Differential Revision:	https://reviews.freebsd.org/D23866
2020-02-28 21:42:48 +00:00
Jeff Roberson
7aaf252c96 Convert a few triviail consumers to the new unlocked grab API.
Reviewed by:	kib, markj
Differential Revision:	https://reviews.freebsd.org/D23847
2020-02-28 20:34:30 +00:00
Jeff Roberson
3f39f80ab3 Support the NOCREAT flag for grab_valid_unlocked.
Reviewed by:	markj
Differential Revision:	https://reviews.freebsd.org/D23865
2020-02-28 20:32:35 +00:00
Jeff Roberson
1a0c234eb2 Simplify vref() code in object_reference. The local temporary is no longer
necessary.  Fix formatting errors.

Reported by:	mjg
Discussed with:	kib
2020-02-28 20:30:53 +00:00
Mark Johnston
c99d0c5801 Add a blocking counter KPI.
refcount(9) was recently extended to support waiting on a refcount to
drop to zero, as this was needed for a lockless VM object
paging-in-progress counter.  However, this adds overhead to all uses of
refcount(9) and doesn't really match traditional refcounting semantics:
once a counter has dropped to zero, the protected object may be freed at
any point and it is not safe to dereference the counter.

This change removes that extension and instead adds a new set of KPIs,
blockcount_*, for use by VM object PIP and busy.

Reviewed by:	jeff, kib, mjg
Tested by:	pho
Sponsored by:	The FreeBSD Foundation
Differential Revision:	https://reviews.freebsd.org/D23723
2020-02-28 16:05:18 +00:00
Jeff Roberson
fe835cbf5f A pair of performance improvements.
Swap buckets on free as well as alloc so that alloc is always the most
cache-hot data.

When selecting a zone domain for the round-robin bucket cache use the
local domain unless there is a severe imbalance.  This does not affinitize
memory, only locks and queues.

Reviewed by:	markj, rlibby
Differential Revision:	https://reviews.freebsd.org/D23824
2020-02-27 08:23:10 +00:00
Jeff Roberson
c49be4f1c6 Add unlocked grab* function variants that use lockless radix code to
lookup pages.  These variants will fall back to their locked counterparts
if the page is not present.

Discussed with:	kib, markj
Differential Revision:	https://reviews.freebsd.org/D23449
2020-02-27 02:37:27 +00:00
Ed Maste
acb8858f05 Return ENOTSUP for mmap/mprotect if prot not subset of prot_max
From POSIX,

[ENOTSUP]
    The implementation does not support the combination of accesses
    requested in the prot argument.

This fits the case that prot contains permissions which are not a subset
of prot_max.

Reviewed by:	brooks, cem
Relnotes:	Yes
Sponsored by:	The FreeBSD Foundation
Differential Revision:	https://reviews.freebsd.org/D23843
2020-02-26 20:03:43 +00:00
Pawel Biernacki
7029da5c36 Mark more nodes as CTLFLAG_MPSAFE or CTLFLAG_NEEDGIANT (17 of many)
r357614 added CTLFLAG_NEEDGIANT to make it easier to find nodes that are
still not MPSAFE (or already are but aren’t properly marked).
Use it in preparation for a general review of all nodes.

This is non-functional change that adds annotations to SYSCTL_NODE and
SYSCTL_PROC nodes using one of the soon-to-be-required flags.

Mark all obvious cases as MPSAFE.  All entries that haven't been marked
as MPSAFE before are by default marked as NEEDGIANT

Approved by:	kib (mentor, blanket)
Commented by:	kib, gallatin, melifaro
Differential Revision:	https://reviews.freebsd.org/D23718
2020-02-26 14:26:36 +00:00
Doug Moore
36b01270d1 The last argument to swp_pager_getswapspace is always 1. Remove that argument.
Reviewed by:	markj
Differential Revision:	https://reviews.freebsd.org/D23810
2020-02-24 04:01:09 +00:00
Mark Johnston
7ca5539285 Allow swap_pager_putpages() to allocate one block at a time.
The minimum allocation size of 4 blocks is an old policy that came with
the "new" swap pager in r42957.  Since then the blist allocator has
gotten better at reducing fragmentation; for example, with r349777 it
can return a range that spans multiple leaves.  When swap space is close
to being exhaused, the minimum of 4 blocks most likely exacerbates
memory pressure, so reduce it to 1.

Reported by:	alc
Tested by:	pho
Reviewed by:	alc, dougm, kib
Sponsored by:	The FreeBSD Foundation
Differential Revision:	https://reviews.freebsd.org/D23763
2020-02-23 17:59:51 +00:00
Ryan Libby
eaa17d4291 sys/vm: quiet -Wwrite-strings
Discussed with:	kib
Reviewed by:	markj
Differential Revision:	https://reviews.freebsd.org/D23796
2020-02-23 03:32:04 +00:00
Mark Johnston
0464f16e91 Constify uma_zcache_create() and uma_zsecond_create()'s "name" argument.
It is already internally handled as a pointer to a const string, in
particular by uma_zcreate().

Fix indentation while here.

MFC after:	1 week
2020-02-22 17:44:28 +00:00
Kyle Evans
cef81f8f01 vm_radix: prefer __builtin_unreachable() to an unreachable panic()
This provides the needed hint to GCC and offers an annotation for readers to
observe that it's in-fact impossible to hit this point. We'll get hit with a
a -Wswitch error if the enum applicable to the switch above were to get
expanded without the new value(s) being handled.
2020-02-22 16:20:04 +00:00
Jeff Roberson
226dd6db47 Add an atomic-free tick moderated lazy update variant of SMR.
This enables very cheap read sections with free-to-use latencies and memory
overhead similar to epoch.  On a recent AMD platform a read section cost
1ns vs 5ns for the default SMR.  On Xeon the numbers should be more like 1
ns vs 11.  The memory consumption should be proportional to the product
of the free rate and 2*1/hz while normal SMR consumption is proportional
to the product of free rate and maximum read section time.

While here refactor the code to make future additions more
straightforward.

Name the overall technique Global Unbound Sequences (GUS) and adjust some
comments accordingly.  This helps distinguish discussions of the general
technique (SMR) vs this specific implementation (GUS).

Discussed with:	rlibby, markj
2020-02-22 03:44:10 +00:00
Warner Losh
cafbf0c664 Don't convert all lower-layer errors to EIO.
Don't convert all lower layer errors to EIO. Instead, pass the actual error up
the stack. This will allow the upper layers that look for ENXIO to react
properly to that signal from the lower layers and, for UFS, unmount the
filesystem.

Reviewed by: kib@
Differential Revision:  https://reviews.freebsd.org/D23755
2020-02-20 01:33:01 +00:00
Warner Losh
65252dc903 Don't spam the console with an additional, and useless, error message.
There's no need to spam the console with this error message. If there's an I/O
error, the disk/cam driver will report it at the lower levels. If that's an
actual problem, the upper layers will report that.

Reviewed by: kib@
Differential Revision:  https://reviews.freebsd.org/D23756
2020-02-20 00:34:46 +00:00
Jeff Roberson
4b3dac72b3 Silence a gcc warning about no return from a function that handles every
possible enum in a switch statement.  I verified that this emits nothing
as expected on clang.  radix relies on constant propagation to eliminate
any branching from these access routines.

Reported by:	lwhsu/tinderbox
2020-02-19 22:34:22 +00:00
Jeff Roberson
1ddda2eb24 Use SMR to provide a safe unlocked lookup for vm_radix.
The tree is kept correct for readers with store barriers and careful
ordering.  The existing object lock serializes writers.  Consumers
will be introduced in later commits.

Reviewed by:	markj, kib
Differential Revision:	https://reviews.freebsd.org/D23446
2020-02-19 19:58:31 +00:00
Jeff Roberson
c6fd3e23f7 Use per-domain locks for the bucket cache.
This gives much better concurrency when there are a large number of
cores per-domain and multiple domains.  Avoid taking the lock entirely
if it will not be productive.  ROUNDROBIN domains will have mixed
memory in each domain and will load balance to all domains.

While here refactor the zone/domain separation and bucket limits to
simplify callers.

Reviewed by:	markj
Differential Revision:	https://reviews.freebsd.org/D23673
2020-02-19 18:48:46 +00:00
Jeff Roberson
e9ceb9dd11 Don't release xbusy on kmem pages. After lockless page lookup we will not
be able to guarantee that they can be racquired without blocking.

Reviewed by:	kib
Discussed with:	markj
Differential Revision:	https://reviews.freebsd.org/D23506
2020-02-19 09:10:11 +00:00
Jeff Roberson
6c5f36ff30 Eliminate some unnecessary uses of UMA_ZONE_VM. Only zones involved in
virtual address or physical page allocation need to be marked with this
flag.

Reviewed by:	markj
Tested by:	pho
Differential Revision:	https://reviews.freebsd.org/D23712
2020-02-19 08:17:27 +00:00
Mark Johnston
34e2051faf Remove swblk_t.
It was used only to store the bounds of each swap device.  However,
since swblk_t is a signed 32-bit int and daddr_t is a signed 64-bit
int, swp_pager_isondev() may return an invalid result if swap devices
are repeatedly added and removed and sw_end for a device ends up
becoming a negative number.

Note that the removed comment about maximum swap size still applies.

Reviewed by:	jeff, kib
Tested by:	pho
MFC after:	2 weeks
Sponsored by:	The FreeBSD Foundation
Differential Revision:	https://reviews.freebsd.org/D23666
2020-02-17 15:11:07 +00:00
Mark Johnston
725b4ff001 Fix a swap block allocation race.
putpages' allocation of swap blocks is done under the global sw_dev
lock.  Previously it would drop that lock before inserting the allocated
blocks into the object's trie, creating a window in which swap blocks
are allocated but are not visible to swapoff.  This can cause
swp_pager_strategy() to fail and panic the system.

Fix the problem bluntly, by allocating swap blocks under the object
lock.

Reviewed by:	jeff, kib
Tested by:	pho
MFC after:	2 weeks
Sponsored by:	The FreeBSD Foundation
Differential Revision:	https://reviews.freebsd.org/D23665
2020-02-17 15:10:41 +00:00
Mark Johnston
c90d075be4 Fix object locking races in swapoff(2).
swap_pager_swapoff_object()'s goal is to allocate pages for all valid
swap blocks belonging to the object, for which there is no resident
page.  If the page corresponding to a block is already resident and
valid, the block can simply be discarded.

The existing implementation tries to minimize the number of I/Os used.
For each cluster of swap blocks, it finds maximal runs of valid swap
blocks not resident in memory, and valid resident pages.  During this
processing, the object lock may be dropped in several places: when
calling getpages, or when blocking on a busy page in
vm_page_grab_pages().  While the lock is dropped, another thread may
free swap blocks, causing getpages to page in stale data.

Fix the problem following a suggestion from Jeff: use getpages'
readahead capability to perform clustering rather than doing it
ourselves.  The simplies the code a bit without reintroducing the old
behaviour of performing one I/O per page.

Reviewed by:	jeff
Reported by:	dhw, gallatin
Tested by:	pho
MFC after:	2 weeks
Sponsored by:	The FreeBSD Foundation
Differential Revision:	https://reviews.freebsd.org/D23664
2020-02-17 15:09:40 +00:00
Jeff Roberson
ed581bf68f Add a simple accessor that returns the bytes of memory consumed by a zone. 2020-02-17 01:59:55 +00:00
Jeff Roberson
f212367b42 Refactor _vm_page_busy_sleep to reduce the delta between the various
sleep routines and introduce a variant that supports lockless sleep.

Reviewed by:	kib
Differential Revision:	https://reviews.freebsd.org/D23612
2020-02-17 01:08:00 +00:00
Jeff Roberson
70260874ac UMA has become more particular about zone types. Use the right allocator
calls in uma_zwait().
2020-02-17 01:06:18 +00:00
Jeff Roberson
6d88d784f8 Slightly restructure uma_zalloc* to generate better code from clang and
reduce duplication among zalloc functions.

Reviewed by:	markj
Discussed with:	mjg
Differential Revision:	https://reviews.freebsd.org/D23672
2020-02-16 01:07:19 +00:00
Mateusz Guzik
3379d2f926 vm: use new capsicum helpers 2020-02-15 01:29:07 +00:00
Mateusz Guzik
23ed568caa vm: remove no longer needed atomic_load_ptr casts 2020-02-14 23:16:29 +00:00
Mark Johnston
06ef60525f Fix handling of WAITFAIL in vm_page_grab() and vm_page_grab_pages().
After sleeping through a memory shortage, we must return NULL rather
than retry.

Discussed with:	jeff
Reported by:	pho
Sponsored by:	The FreeBSD Foundation
2020-02-13 23:18:35 +00:00
Mark Johnston
cefc92e1a2 Update the zone-global count of cached items in bucket_cache_reclaim().
This was missed in r351673.  The count is used to enfore cache limits,
which are rarely used.

Discussed with:	jeff
Sponsored by:	The FreeBSD Foundation
2020-02-13 23:15:21 +00:00
Jeff Roberson
543117bed8 Fix a case where ub_seq would fail to be set if the cross bucket was
flushed due to memory pressure.

Reviewed by:	markj
Differential Revision:	http://reviews.freebsd.org/D23614
2020-02-13 20:58:51 +00:00
Mateusz Guzik
3acb6572fc Store offset into zpcpu allocations in the per-cpu area.
This shorten zpcpu_get and allows more optimizations.

Reviewed by:	jeff
Differential Revision:	https://reviews.freebsd.org/D23570
2020-02-12 11:11:22 +00:00
Mark Johnston
4ab3aee8fb Reduce lock hold time in keg_drain().
Maintain a count of free slabs in the per-domain keg structure and use
that to clear the free slab list in constant time for most cases.  This
helps minimize lock contention induced by reclamation, in preparation
for proactive trimming of excesses of free memory.

Reviewed by:	jeff, rlibby
Tested by:	pho
Differential Revision:	https://reviews.freebsd.org/D23532
2020-02-11 20:06:33 +00:00
Jonathan T. Looney
3c200db9d2 Modify the vm.panic_on_oom sysctl to take a count of events.
Currently, the vm.panic_on_oom sysctl is a boolean which controls the
behavior of the VM system when it encounters an out-of-memory situation.
If set to 0, the VM system kills the largest process. If set to any other
value, the VM system will initiate a panic.

This change makes the sysctl a count of events. If set to 0, the VM system
kills the largest process. If set to any other value, the VM system will
kill the largest process until it has seen the specified number of
out-of-memory events. Once it reaches the specified number of events, it
will initiate a panic.

This change is helpful in capturing cores when the system is in a perpetual
cycle of out-of-memory events (as opposed to just hitting one or two
sporadic out-of-memory events).

Reviewed by:	kib
MFC after:	2 weeks
Sponsored by:	Netflix
Differential Revision:	https://reviews.freebsd.org/D23601
2020-02-10 18:06:38 +00:00
Ryan Libby
bae55c4aec uma: remove UMA_ZFLAG_CACHEONLY flag
UMA_ZFLAG_CACHEONLY was essentially the same thing as UMA_ZONE_VM, but
with a more confusing name.  Remove the flag, make UMA_ZONE_VM an
inherit flag, and replace all references.

Reviewed by:	markj
Sponsored by:	Dell EMC Isilon
Differential Revision:	https://reviews.freebsd.org/D23516
2020-02-06 08:32:25 +00:00
Ryan Libby
33e5a1ea3b uma: multipage chicken switch
Add a switch to allow disabling multipage slabs, in order to facilitate
measuring memory usage and performance effects.  The tunable
vm.debug.uma_multipage_slabs defaults to 1 and can be set to 0 to
disable.  The name may change soon.

Reviewed by:	markj (previous version)
Sponsored by:	Dell EMC Isilon
Differential Revision:	https://reviews.freebsd.org/D23487
2020-02-04 22:40:45 +00:00
Ryan Libby
27ca37acb7 uma: grow slabs to enforce minimum memory efficiency
Memory efficiency can be poor with awkward item sizes (e.g. 1/2 or 1
page size + epsilon).  In order to achieve a minimum memory efficiency,
select a slab size with a potentially larger number of pages if it
yields a lower portion of waste.

This may mean using page_alloc instead of uma_small_alloc, which could
be more costly.

Discussed with:	jeff, mckusick
Sponsored by:	Dell EMC Isilon
Differential Revision:	https://reviews.freebsd.org/D23239
2020-02-04 22:40:34 +00:00
Ryan Libby
ec0d828071 uma: add UMA_ZONE_CONTIG, and a default contig_alloc
For now, copy the mbuf allocator.

Reviewed by:	jeff, markj (previous version)
Sponsored by:	Dell EMC Isilon
Differential Revision:	https://reviews.freebsd.org/D23237
2020-02-04 22:40:11 +00:00
Ryan Libby
5ba16cf3d7 uma: pcpu_page_free needs to startup_free pages from startup_alloc
After r357392, it is apparent that we do have some early-boot PCPU
zones.  Make it so we can safely free pages from them if they are
actually used during early boot.

Reviewed by:	jeff, markj
Sponsored by:	Dell EMC Isilon
Differential Revision:	https://reviews.freebsd.org/D23496
2020-02-04 22:39:58 +00:00
Jeff Roberson
ee9e43f8dd Add an explicit busy state for free pages. This improves behavior with
potential bugs that access freed pages as well as providing a path
towards lockless page lookup.

Reviewed by:	kib
Differential Revision:	https://reviews.freebsd.org/D23444
2020-02-04 20:33:01 +00:00
Jeff Roberson
e84130a0c0 Use literal bucket sizes for smaller buckets rather than the rounding
system.  Small bucket sizes already pack well even if they are an odd
number of words.  This prevents any potential new instances of the
problem fixed in r357463 as well as making the system easier to
understand.

Reviewed by:	markj
Differential Revision:	https://reviews.freebsd.org/D23494
2020-02-04 20:28:06 +00:00
Konstantin Belousov
8d34a3bf7d Enable vm_object_mightbedirty() and vm_object_page_clean() for swap
objects backing tmpfs vnodes data.

The clean scan is limited to only remove write permissions from the
mapped pages of the objects.  This fixes the issue that tmpfs vnode
mtime is not updated from writes to the mmaped area after the initial
page-in.

Noted by:	mjg
Reviewed by:	markj
Discussed with:	jeff
Tested by:	pho
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
Differential revision:	https://reviews.freebsd.org/D23432
2020-02-04 19:03:37 +00:00