Commit Graph

4454 Commits

Author SHA1 Message Date
Mark Johnston
97458520cc Increase the default vm.max_user_wired value.
Since r347532 (merged to stable/12) we only count user-wired pages
towards the system limit.  However, we now also treat pages wired by
hypervisors (bhyve and virtualbox) as user-wired, so starting VMs with
large amounts of RAM tends to fail due to the low limit.

The purpose of the limit is to provide a seatbelt, not to impose some
policy on the use of wired memory.  Thus, increase the default limit to
allow reasonable VM configurations to work without tuning.

Reviewed by:	kib
Discussed with:	dougm
MFC after:	3 days
Sponsored by:	The FreeBSD Foundation
Differential Revision:	https://reviews.freebsd.org/D26424
2020-09-17 16:49:28 +00:00
Konstantin Belousov
d301b3580f Support for userspace non-transparent superpages (largepages).
Created with shm_open2(SHM_LARGEPAGE) and then configured with
FIOSSHMLPGCNF ioctl, largepages posix shared memory objects guarantee
that all userspace mappings of it are served by superpage non-managed
mappings.

Only amd64 for now, both 2M and 1G superpages can be requested, the
later requires CPU feature.

Reviewed by:	markj
Tested by:	pho
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
Differential revision:	https://reviews.freebsd.org/D24652
2020-09-09 22:12:51 +00:00
Konstantin Belousov
e2e80fb3de vm_map: Add a map entry kind that can only be clipped at specific boundary.
The entries and their clip boundaries must be aligned on supported
superpages sizes from pagesizes[].  vm_map operations return Mach
error KERN_INVALID_ARGUMENT, which is usually translated to EINVAL, if
it would require clip not at the boundary.

In other words, entries force preserving virtual addresses superpage
properties.

Reviewed by:	markj
Tested by:	pho
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
Differential revision:	https://reviews.freebsd.org/D24652
2020-09-09 22:02:30 +00:00
Konstantin Belousov
6cadbcd203 Add pmap_enter(9) PMAP_ENTER_LARGEPAGE flag and implement it on amd64.
The flag requests entry of non-managed superpage mapping of size
pagesizes[psind] into the page table.

Pmap supports fake wiring of the largepage mappings.  Only attributes
of the largepage mapping can be changed by calling pmap_enter(9) over
existing mapping, physical address of the page must be unchanged.

Reviewed by:	markj
Tested by:	pho
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
Differential revision:	https://reviews.freebsd.org/D24652
2020-09-09 21:50:24 +00:00
Konstantin Belousov
7a9f2da33c Add vm_map_find_aligned(9).
Reviewed by:	markj
Tested by:	pho
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
Differential revision:	https://reviews.freebsd.org/D24652
2020-09-09 21:44:59 +00:00
Konstantin Belousov
60cd9c95c5 Move MAP_32BIT_MAX_ADDR definition to sys/mman.h.
Reviewed by:	markj
Tested by:	pho
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
Differential revision:	https://reviews.freebsd.org/D24652
2020-09-09 21:39:06 +00:00
Konstantin Belousov
e8f77c204b Prepare to handle non-trivial errors from vm_map_delete().
Reviewed by:	markj
Tested by:	pho
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
Differential revision:	https://reviews.freebsd.org/D24652
2020-09-09 21:34:31 +00:00
Konstantin Belousov
a720b31c2a Allow consumer to customize physical pager.
Add support for user-supplied callbacks into phys pager operations,
providing custom getpages(), haspage(), and populate() methods
implementations.  Pager stores user data ptr/val in the object to
provide context.

Add phys_pager_allocate() helper that takes user ops table as one of
the arguments.

Current code for these methods is moved to the 'default' ops table,
assigned automatically when vm_pager_alloc() is used.

Reviewed by:	markj
Tested by:	pho
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
Differential revision:	https://reviews.freebsd.org/D24652
2020-09-09 00:00:43 +00:00
Konstantin Belousov
67a659d282 Add kern_mmap_racct_check(), a helper to verify limits in vm_mmap*().
Reviewed by:	markj
Tested by:	pho
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
Differential revision:	https://reviews.freebsd.org/D24652
2020-09-08 23:48:19 +00:00
Konstantin Belousov
89d2fb14d5 Add interruptible variant of vm_wait(9), vm_wait_intr(9).
Also add msleep flags argument to vm_wait_doms(9).

Reviewed by:	markj
Tested by:	pho
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
Differential revision:	https://reviews.freebsd.org/D24652
2020-09-08 23:28:09 +00:00
Mark Johnston
aec9e7d8b0 vm_object_split(): Handle orig_object type changes.
orig_object->type can change from OBJT_DEFAULT to OBJT_SWAP while
vm_object_split() is sleeping.  In this case some pages in new_object
may be left unbusied, but vm_object_split() attempts to unbusy all of
them.

Track the beginning of the busied range.  Add an assertion to verify
that pages are not re-added to the source object while sleeping.

Reported by:	Olympios Petrakis <olympios.petrakis@netapp.com>
Reviewed by:	alc, kib
Tested by:	pho
MFC after:	1 week
Sponsored by:	NetApp, Inc.
Sponsored by:	Klara, Inc.
Differential Revision:	https://reviews.freebsd.org/D26223
2020-09-07 23:28:33 +00:00
Mark Johnston
a2d704d19f Avoid unnecessary object locking in vm_page_grab_pages_unlocked().
We were needlessly acquiring the object lock to call
vm_page_grab_pages() even when all of the requested pages were looked up
locklessly.  Fix that, stop testing for count == 0 in
vm_page_grab_pages(), and add assertions to help catch this kind of
mistake.

Reported by:	cem
Reviewed by:	alc, cem, dougm, jeff
Differential Revision:	https://reviews.freebsd.org/D26304
2020-09-02 19:59:25 +00:00
Mark Johnston
847ab36bf2 Include the psind in data returned by mincore(2).
Currently we use a single bit to indicate whether the virtual page is
part of a superpage.  To support a forthcoming implementation of
non-transparent 1GB superpages, it is useful to provide more detailed
information about large page sizes.

The change converts MINCORE_SUPER into a mask for MINCORE_PSIND(psind)
values, indicating a mapping of size psind, where psind is an index into
the pagesizes array returned by getpagesizes(3), which in turn comes
from the hw.pagesizes sysctl.  MINCORE_PSIND(1) is equal to the old
value of MINCORE_SUPER.

For now, two bits are used to record the page size, permitting values
of MAXPAGESIZES up to 4.

Reviewed by:	alc, kib
Sponsored by:	Juniper Networks, Inc.
Sponsored by:	Klara, Inc.
Differential Revision:	https://reviews.freebsd.org/D26238
2020-09-02 18:16:43 +00:00
Mateusz Guzik
c3aa3bf97c vm: clean up empty lines in .c and .h files 2020-09-01 21:20:45 +00:00
Vladimir Kondratyev
5d4bf0578f LinuxKPI: Implement ksize() function.
In Linux, ksize() gets the actual amount of memory allocated for a given
object. This commit adds malloc_usable_size() to FreeBSD KPI which does
the same. It also maps LinuxKPI ksize() to newly created function.

ksize() function is used by drm-kmod.

Reviewed by:	hselasky, kib
MFC after:	1 week
Differential Revision:	https://reviews.freebsd.org/D26215
2020-08-29 19:26:31 +00:00
Eric van Gyzen
609de97e04 vm_pageout_scan_active: ensure ps_delta is initialized
Reported by:	Coverity
Reviewed by:	markj
MFC after:	2 weeks
Sponsored by:	Dell EMC Isilon
Differential Revision:	https://reviews.freebsd.org/D26212
2020-08-28 19:59:02 +00:00
Eric van Gyzen
a2e194654f memstat_kvm_uma: fix reading of uma_zone_domain structures
Coverity flagged the scaling by sizeof(uzd).  That is the type
of the pointer, so the scaling was already done by pointer arithmetic.
However, this was also passing a stack frame pointer to kvm_read,
so it was doubly wrong.

Move ZDOM_GET into the !_KERNEL section and use it in libmemstat.

Reported by:	Coverity
Reviewed by:	markj
MFC after:	2 weeks
Sponsored by:	Dell EMC Isilon
Differential Revision:	https://reviews.freebsd.org/D26213
2020-08-28 19:50:40 +00:00
Mark Johnston
aea9103e06 Use a large kmem arena import size on NUMA systems.
This helps minimize internal fragmentation that occurs when 2MB imports
are interleaved across NUMA domains.  Virtually all KVA allocations on
direct map platforms consume more than one page, so the fragmentation
manifests as runs of 511 4KB page mappings in the kernel.

Reviewed by:	alc, kib
Tested by:	pho
Sponsored by:	The FreeBSD Foundation
Differential Revision:	https://reviews.freebsd.org/D26050
2020-08-26 14:31:48 +00:00
Conrad Meyer
74f5530d7a vm_pageout: Scale worker threads with CPUs
Autoscale vm_pageout worker threads from r364129 with CPU count.  The
default is arbitrarily chosen to be 16 CPUs per worker thread, but can
be adjusted with the vm.pageout_cpus_per_thread tunable.

There will never be less than 1 thread per populated NUMA domain, and
the previous arbitrary upper limit (at most ncpus/2 threads per NUMA
domain) is preserved.

Care is taken to gracefully handle asymmetric NUMA nodes, such as empty
node systems (e.g., AMD 2990WX) and systems with nodes of varying size
(e.g., some larger >20 core Intel Haswell/Broadwell Xeon).

Reviewed by:	kib, markj
Sponsored by:	Isilon
Differential Revision:	https://reviews.freebsd.org/D26152
2020-08-25 21:36:56 +00:00
Mark Johnston
411096d034 Permit vm_page_wire() to be called on pages not belonging to an object.
For such pages ref_count is effectively a consumer-managed field, but
there is no harm in calling vm_page_wire() on them.
vm_page_unwire_noq() handles them as well.  Relax the vm_page_wire()
assertions to permit this case which is triggered by some out-of-tree
code. [1]

Also guard a conditional assertion with INVARIANTS.  Otherwise the
conditions are evaluated even though the result is unused. [2]

Reported by:	bz, cem [1], kib [2]
Reviewed by:	dougm, kib
Sponsored by:	The FreeBSD Foundation
Differential Revision:	https://reviews.freebsd.org/D26173
2020-08-25 13:45:06 +00:00
Matt Macy
9e5787d228 Merge OpenZFS support in to HEAD.
The primary benefit is maintaining a completely shared
code base with the community allowing FreeBSD to receive
new features sooner and with less effort.

I would advise against doing 'zpool upgrade'
or creating indispensable pools using new
features until this change has had a month+
to soak.

Work on merging FreeBSD support in to what was
at the time "ZFS on Linux" began in August 2018.
I first publicly proposed transitioning FreeBSD
to (new) OpenZFS on December 18th, 2018. FreeBSD
support in OpenZFS was finally completed in December
2019. A CFT for downstreaming OpenZFS support in
to FreeBSD was first issued on July 8th. All issues
that were reported have been addressed or, for
a couple of less critical matters there are
pull requests in progress with OpenZFS. iXsystems
has tested and dogfooded extensively internally.
The TrueNAS 12 release is based on OpenZFS with
some additional features that have not yet made
it upstream.

Improvements include:
  project quotas, encrypted datasets,
  allocation classes, vectorized raidz,
  vectorized checksums, various command line
  improvements, zstd compression.

Thanks to those who have helped along the way:
Ryan Moeller, Allan Jude, Zack Welch, and many
others.

Sponsored by:	iXsystems, Inc.
Differential Revision:	https://reviews.freebsd.org/D25872
2020-08-25 02:21:27 +00:00
Mateusz Guzik
feabaaf995 cache: drop the always curthread argument from reverse lookup routines
Note VOP_VPTOCNP keeps getting it as temporary compatibility for zfs.

Tested by:	pho
2020-08-24 08:57:02 +00:00
Andrew Gallatin
791dda877f uma: record allocation failures due to zone limits
The zone limit mechanism was recently reworked, and
allocation failures due to limits being exceeded
were inadvertently no longer being recorded. This
would lead to, for example, mbuf allocation failures
not being indicated in netstat -m or vmstat -z

Reviewed by:	markj
Sponsored by:	Netflix
2020-08-21 18:31:57 +00:00
Mateusz Guzik
7ad2a82da2 vfs: drop the error parameter from vn_isdisk, introduce vn_isdisk_error
Most consumers pass NULL.
2020-08-19 02:51:17 +00:00
Mark Johnston
b21b022a81 Revert r364310.
Some of the resulting fallout in CAM does not appear straightforward to
fix, so simply revert the commit for now in the absence of a better
solution.

Discussed with:	mjg
Reported by:	dhw
2020-08-18 14:09:49 +00:00
Gleb Smirnoff
1921bb7b68 With INVARIANTS panic immediately if M_WAITOK is requested in a
non-sleepable context.  Previously only _sleep() would panic.
This will catch misuse of M_WAITOK at development stage rather
than at stress load stage.

Reviewed by:	markj
Differential Revision:	https://reviews.freebsd.org/D26027
2020-08-17 15:37:08 +00:00
Mark Johnston
7efe14cb99 Commit a missing piece of r364302.
This had failed to apply due to a merge conflict.

Reported by:	Jenkins
MFC with:	r364302
2020-08-17 14:06:51 +00:00
Mark Johnston
7dd979dfef Remove the VM map zone.
Today, the zone is only used to allocate a trio of kernel maps: the
kernel map itself, and the exec and pipe submaps.  Maps for user
processes are dynamically allocated but are embedded in the vmspace
structure, which is allocated from its own zone.  Make the
aforementioned kernel maps statically allocated and get rid of the zone.

While here, remove a stale comment above vmspace_alloc() and change the
names of locks initialized in vm_map_init() to match vmspace_zinit().

Reported by:	alc
Reviewed by:	alc, kib
MFC after:	2 weeks
Sponsored by:	The FreeBSD Foundation
Differential Revision:	https://reviews.freebsd.org/D26052
2020-08-17 13:02:01 +00:00
Konstantin Belousov
ffae7ea935 vm_object: allow paging_in_progress to be acquired after object termination.
The vm objects are type-stable, and can be accessed even after the
last reference is dropped, or in case of vnode objects, after vgone()
destroyed it as well.

Stop asserting that pip == 0 after vm_object_terminate() waited for
existing owners to drop it, we only want to drain them before setting
OBJ_DEAD flag.  Also stop asserting pip == 0 in object destructor.

Update comments explaining the interaction between paging_in_progress
and termination.

Reviewed by:	markj
Tested by:	pho
Sponsored by:	The FreeBSD Foundation
Differential revision:	https://reviews.freebsd.org/D25968
2020-08-16 20:57:02 +00:00
Konstantin Belousov
419e5698a0 Atomically update vm_object vnp_size, where atomic is available.
This will be used later, where it matters on 32bit arches.

Reviewed by:	markj
Tested by:	pho
Sponsored by:	The FreeBSD Foundation
Differential revision:	https://reviews.freebsd.org/D25968
2020-08-16 20:52:24 +00:00
Mateusz Guzik
a92a971bbb vfs: remove the thread argument from vget
It was already asserted to be curthread.

Semantic patch:

@@

expression arg1, arg2, arg3;

@@

- vget(arg1, arg2, arg3)
+ vget(arg1, arg2)
2020-08-16 17:18:54 +00:00
Conrad Meyer
ea7b737a6f vm_pageout: Correct threshold calculation on single-CPU systems
Reported by:	Michael Butler
X-MFC-With:	r364129
2020-08-14 18:48:48 +00:00
Conrad Meyer
b7883452d4 Back out unrelated change
Reported by:	kib, markj
X-MFC-With:	r364129
2020-08-12 00:21:30 +00:00
Conrad Meyer
0292c54bdb Add support for multithreading the inactive queue pageout within a domain.
In very high throughput workloads, the inactive scan can become overwhelmed
as you have many cores producing pages and a single core freeing.  Since
Mark's introduction of batched pagequeue operations, we can now run multiple
inactive threads working on independent batches.

To avoid confusing the pid and other control algorithms, I (Jeff) do this in
a mpi-like fan out and collect model that is driven from the primary page
daemon.  It decides whether the shortfall can be overcome with a single
thread and if not dispatches multiple threads and waits for their results.

The heuristic is based on timing the pageout activity and averaging a
pages-per-second variable which is exponentially decayed. This is visible in
sysctl and may be interesting for other purposes.

I (Jeff) have verified that this does indeed double our paging throughput
when used with two threads. With four we tend to run into other contention
problems.  For now I would like to commit this infrastructure with only a
single thread enabled.

The number of worker threads per domain can be controlled with the
'vm.pageout_threads_per_domain' tunable.

Submitted by:	jeff (earlier version)
Discussed with:	markj
Tested by:	pho
Sponsored by:	probably Netflix (based on contemporary commits)
Differential Revision:	https://reviews.freebsd.org/D21629
2020-08-11 20:37:45 +00:00
Mark Johnston
af32cefd7c Check the UMA zone's full bucket cache before short-circuiting an alloc.
The global "bucketdisable" flag indicates that we are in a low memory
situation and should avoid allocating buckets.  However, in the
allocation path we were checking it before the full bucket cache and
bailing even if the cache is non-empty.  Defer the check so that we have
a shot at allocating from the cache.

This came up because M_NOWAIT allocations from the buf trie node zone
must always succeed.  In one scenario, all of the preallocated trie
nodes were in the bucket list, and a new slab allocation could not
succeed due to a memory shortage.  The short-circuiting caused an
allocation failure which triggered a panic.

Reported by:	pho
Reviewed by:	cem
Sponsored by:	The FreeBSD Foundation
Differential Revision:	https://reviews.freebsd.org/D25980
2020-08-10 20:34:45 +00:00
Brooks Davis
9f9cc3f989 Preserve ASLR vm_map flags across fork
In the most common case (fork+execve) this doesn't matter, but further
attempts to apply entropy would fail in (e.g.) a pre-fork server.

Reported by:	Alfredo Mazzinghi
Reviewed by:	kib, markj
Obtained from:	CheriBSD
MFC after:	3 days
Differential Revision:	https://reviews.freebsd.org/D25966
2020-08-06 16:20:20 +00:00
Mark Johnston
efec381dd1 Remove most lingering references to the page lock in comments.
Finish updating comments to reflect new locking protocols introduced
over the past year.  In particular, vm_page_lock is now effectively
unused.

Reviewed by:	kib
Sponsored by:	The FreeBSD Foundation
Differential Revision:	https://reviews.freebsd.org/D25868
2020-08-04 14:59:43 +00:00
Mark Johnston
96ad26eefb Remove free_domain() and uma_zfree_domain().
These functions were introduced before UMA started ensuring that freed
memory gets placed in domain-local caches.  They no longer serve any
purpose since UMA now provides their functionality by default.  Remove
them to simplyify the kernel memory allocator interfaces a bit.

Reviewed by:	cem, kib
Sponsored by:	The FreeBSD Foundation
Differential Revision:	https://reviews.freebsd.org/D25937
2020-08-04 13:58:36 +00:00
Mark Johnston
958d8f527c Remove the volatile qualifier from busy_lock.
Use atomic(9) to load the lock state.  Some places were doing this
already, so it was inconsistent.  In initialization code, the lock state
is still initialized with plain stores.

Reviewed by:	alc, kib
Sponsored by:	The FreeBSD Foundation
Differential Revision:	https://reviews.freebsd.org/D25861
2020-07-29 19:38:49 +00:00
Mark Johnston
f72e5be58a vm_page_xbusy_claim(): Use atomics to update busy lock state.
vm_page_xbusy_claim() could clobber the waiter bit.  For its original
use, kernel memory pages, this was not a problem since nothing would
ever block on the busy lock for such pages.  r363607 introduced a new
use where this could in principle be a problem.

Fix the problem by using atomic_cmpset to update the lock owner.  Since
this macro is defined only for INVARIANTS kernels the extra overhead
doesn't seem prohibitive.

Reported by:	vangyzen
Reviewed by:	alc, kib, vangyzen
Sponsored by:	The FreeBSD Foundation
Differential Revision:	https://reviews.freebsd.org/D25859
2020-07-28 19:50:39 +00:00
Mark Johnston
782ebde52e vm_page_free_invalid(): Relax the xbusy assertion.
vm_page_assert_xbusied() asserts that the busying thread is the current
thread.  For some uses of vm_page_free_invalid() (e.g., error handling
in vnode_pager_generic_getpages_done()), this condition might not hold.

Reported by:	Jenkins via trasz
Reviewed by:	chs, kib
Sponsored by:	The FreeBSD Foundation
Differential Revision:	https://reviews.freebsd.org/D25828
2020-07-27 14:25:10 +00:00
Doug Moore
00fd73d2da Fix an overflow bug in the blist allocator that needlessly capped max
swap size by dividing a value, which was always a multiple of 64, by
64.  Remove the code that reduced max swap size down to that cap.

Eliminate the distinction between BLIST_BMAP_RADIX and
BLIST_META_RADIX.  Call them both BLIST_RADIX.

Make improvments to the blist self-test code to silence compiler
warnings and to test larger blists.

Reported by:	jmallett
Reviewed by:	alc
Discussed with:	kib
Tested by:	pho
Differential Revision:	https://reviews.freebsd.org/D25736
2020-07-25 18:29:10 +00:00
Mateusz Guzik
ee74412269 vm: fix swap reservation leak and clean up surrounding code
The code did not subtract from the global counter if per-uid reservation
failed.

Cleanup highlights:
- load overcommit once
- move per-uid manipulation to dedicated routines
- don't fetch wire count if requested size is below the limit
- convert return type from int to bool
- ifdef the routines with _KERNEL to keep vm.h compilable by userspace

Reviewed by:	kib (previous version)
Differential Revision:	https://reviews.freebsd.org/D25787
2020-07-24 13:23:32 +00:00
Mateusz Guzik
126a2470b9 vm: annotate swap_reserved with __exclusive_cache_line
The counter keeps being updated all the time and variables read afterwards
share the cacheline. Note this still fundamentally does not scale and needs
to be replaced, in the meantime gets a bandaid.

brk1_processes -t 52 ops/s:
before: 8598298
after:  9098080
2020-07-23 08:42:16 +00:00
Chuck Silvers
1bd12a3bb2 Fix vnode_pager handling of read ahead/behind pages when a disk read fails.
Rather than marking the read ahead/behind pages valid even though they were
not initialized, free them using the new function vm_page_free_invalid().

Reviewed by:	markj, kib
Sponsored by:	Netflix
Differential Revision:	https://reviews.freebsd.org/D25430
2020-07-17 23:10:35 +00:00
Chuck Silvers
4dfa06e114 Add a new function vm_page_free_invalid() for freeing invalid pages
that might be wired.  If the page is wired then it cannot be freed now,
but the thread that eventually unwires it will free it at that point.

Reviewed by:	markj, kib
Sponsored by:	Netflix
Differential Revision:	https://reviews.freebsd.org/D25430
2020-07-17 23:09:36 +00:00
Chuck Silvers
c3dbadc1fd Revert my change from r361855 in favor of a better fix.
Reviewed by:	markj, kib
Sponsored by:	Netflix
Differential Revision:	https://reviews.freebsd.org/D25430
2020-07-17 23:08:01 +00:00
Mark Johnston
a7752896f0 Add vm_map_valid_range_KBI().
This is required for standalone module builds.

Reported by:	hselasky
Reviewed by:	dougm, hselasky, kib
MFC after:	3 days
Differential Revision:	https://reviews.freebsd.org/D25650
2020-07-13 16:39:27 +00:00
Scott Long
ffc568ba8b Revert r362998, r326999 while a better compatibility strategy is devised. 2020-07-09 22:38:36 +00:00
Scott Long
b302c2e5c9 Migrate the feature of excluding RAM pages to use "excludelist"
as its nomenclature.

MFC after:	1 week
2020-07-07 20:33:11 +00:00