freebsd-dev

Author	SHA1	Message	Date
Mark Johnston	97458520cc	Increase the default vm.max_user_wired value. Since r347532 (merged to stable/12) we only count user-wired pages towards the system limit. However, we now also treat pages wired by hypervisors (bhyve and virtualbox) as user-wired, so starting VMs with large amounts of RAM tends to fail due to the low limit. The purpose of the limit is to provide a seatbelt, not to impose some policy on the use of wired memory. Thus, increase the default limit to allow reasonable VM configurations to work without tuning. Reviewed by: kib Discussed with: dougm MFC after: 3 days Sponsored by: The FreeBSD Foundation Differential Revision: https://reviews.freebsd.org/D26424	2020-09-17 16:49:28 +00:00
Konstantin Belousov	d301b3580f	Support for userspace non-transparent superpages (largepages). Created with shm_open2(SHM_LARGEPAGE) and then configured with FIOSSHMLPGCNF ioctl, largepages posix shared memory objects guarantee that all userspace mappings of it are served by superpage non-managed mappings. Only amd64 for now, both 2M and 1G superpages can be requested, the later requires CPU feature. Reviewed by: markj Tested by: pho Sponsored by: The FreeBSD Foundation MFC after: 1 week Differential revision: https://reviews.freebsd.org/D24652	2020-09-09 22:12:51 +00:00
Konstantin Belousov	e2e80fb3de	vm_map: Add a map entry kind that can only be clipped at specific boundary. The entries and their clip boundaries must be aligned on supported superpages sizes from pagesizes[]. vm_map operations return Mach error KERN_INVALID_ARGUMENT, which is usually translated to EINVAL, if it would require clip not at the boundary. In other words, entries force preserving virtual addresses superpage properties. Reviewed by: markj Tested by: pho Sponsored by: The FreeBSD Foundation MFC after: 1 week Differential revision: https://reviews.freebsd.org/D24652	2020-09-09 22:02:30 +00:00
Konstantin Belousov	6cadbcd203	Add pmap_enter(9) PMAP_ENTER_LARGEPAGE flag and implement it on amd64. The flag requests entry of non-managed superpage mapping of size pagesizes[psind] into the page table. Pmap supports fake wiring of the largepage mappings. Only attributes of the largepage mapping can be changed by calling pmap_enter(9) over existing mapping, physical address of the page must be unchanged. Reviewed by: markj Tested by: pho Sponsored by: The FreeBSD Foundation MFC after: 1 week Differential revision: https://reviews.freebsd.org/D24652	2020-09-09 21:50:24 +00:00
Konstantin Belousov	7a9f2da33c	Add vm_map_find_aligned(9). Reviewed by: markj Tested by: pho Sponsored by: The FreeBSD Foundation MFC after: 1 week Differential revision: https://reviews.freebsd.org/D24652	2020-09-09 21:44:59 +00:00
Konstantin Belousov	60cd9c95c5	Move MAP_32BIT_MAX_ADDR definition to sys/mman.h. Reviewed by: markj Tested by: pho Sponsored by: The FreeBSD Foundation MFC after: 1 week Differential revision: https://reviews.freebsd.org/D24652	2020-09-09 21:39:06 +00:00
Konstantin Belousov	e8f77c204b	Prepare to handle non-trivial errors from vm_map_delete(). Reviewed by: markj Tested by: pho Sponsored by: The FreeBSD Foundation MFC after: 1 week Differential revision: https://reviews.freebsd.org/D24652	2020-09-09 21:34:31 +00:00
Konstantin Belousov	a720b31c2a	Allow consumer to customize physical pager. Add support for user-supplied callbacks into phys pager operations, providing custom getpages(), haspage(), and populate() methods implementations. Pager stores user data ptr/val in the object to provide context. Add phys_pager_allocate() helper that takes user ops table as one of the arguments. Current code for these methods is moved to the 'default' ops table, assigned automatically when vm_pager_alloc() is used. Reviewed by: markj Tested by: pho Sponsored by: The FreeBSD Foundation MFC after: 1 week Differential revision: https://reviews.freebsd.org/D24652	2020-09-09 00:00:43 +00:00
Konstantin Belousov	67a659d282	Add kern_mmap_racct_check(), a helper to verify limits in vm_mmap*(). Reviewed by: markj Tested by: pho Sponsored by: The FreeBSD Foundation MFC after: 1 week Differential revision: https://reviews.freebsd.org/D24652	2020-09-08 23:48:19 +00:00
Konstantin Belousov	89d2fb14d5	Add interruptible variant of vm_wait(9), vm_wait_intr(9). Also add msleep flags argument to vm_wait_doms(9). Reviewed by: markj Tested by: pho Sponsored by: The FreeBSD Foundation MFC after: 1 week Differential revision: https://reviews.freebsd.org/D24652	2020-09-08 23:28:09 +00:00
Mark Johnston	aec9e7d8b0	vm_object_split(): Handle orig_object type changes. orig_object->type can change from OBJT_DEFAULT to OBJT_SWAP while vm_object_split() is sleeping. In this case some pages in new_object may be left unbusied, but vm_object_split() attempts to unbusy all of them. Track the beginning of the busied range. Add an assertion to verify that pages are not re-added to the source object while sleeping. Reported by: Olympios Petrakis <olympios.petrakis@netapp.com> Reviewed by: alc, kib Tested by: pho MFC after: 1 week Sponsored by: NetApp, Inc. Sponsored by: Klara, Inc. Differential Revision: https://reviews.freebsd.org/D26223	2020-09-07 23:28:33 +00:00
Mark Johnston	a2d704d19f	Avoid unnecessary object locking in vm_page_grab_pages_unlocked(). We were needlessly acquiring the object lock to call vm_page_grab_pages() even when all of the requested pages were looked up locklessly. Fix that, stop testing for count == 0 in vm_page_grab_pages(), and add assertions to help catch this kind of mistake. Reported by: cem Reviewed by: alc, cem, dougm, jeff Differential Revision: https://reviews.freebsd.org/D26304	2020-09-02 19:59:25 +00:00
Mark Johnston	847ab36bf2	Include the psind in data returned by mincore(2). Currently we use a single bit to indicate whether the virtual page is part of a superpage. To support a forthcoming implementation of non-transparent 1GB superpages, it is useful to provide more detailed information about large page sizes. The change converts MINCORE_SUPER into a mask for MINCORE_PSIND(psind) values, indicating a mapping of size psind, where psind is an index into the pagesizes array returned by getpagesizes(3), which in turn comes from the hw.pagesizes sysctl. MINCORE_PSIND(1) is equal to the old value of MINCORE_SUPER. For now, two bits are used to record the page size, permitting values of MAXPAGESIZES up to 4. Reviewed by: alc, kib Sponsored by: Juniper Networks, Inc. Sponsored by: Klara, Inc. Differential Revision: https://reviews.freebsd.org/D26238	2020-09-02 18:16:43 +00:00
Mateusz Guzik	c3aa3bf97c	vm: clean up empty lines in .c and .h files	2020-09-01 21:20:45 +00:00
Vladimir Kondratyev	5d4bf0578f	LinuxKPI: Implement ksize() function. In Linux, ksize() gets the actual amount of memory allocated for a given object. This commit adds malloc_usable_size() to FreeBSD KPI which does the same. It also maps LinuxKPI ksize() to newly created function. ksize() function is used by drm-kmod. Reviewed by: hselasky, kib MFC after: 1 week Differential Revision: https://reviews.freebsd.org/D26215	2020-08-29 19:26:31 +00:00
Eric van Gyzen	609de97e04	vm_pageout_scan_active: ensure ps_delta is initialized Reported by: Coverity Reviewed by: markj MFC after: 2 weeks Sponsored by: Dell EMC Isilon Differential Revision: https://reviews.freebsd.org/D26212	2020-08-28 19:59:02 +00:00
Eric van Gyzen	a2e194654f	memstat_kvm_uma: fix reading of uma_zone_domain structures Coverity flagged the scaling by sizeof(uzd). That is the type of the pointer, so the scaling was already done by pointer arithmetic. However, this was also passing a stack frame pointer to kvm_read, so it was doubly wrong. Move ZDOM_GET into the !_KERNEL section and use it in libmemstat. Reported by: Coverity Reviewed by: markj MFC after: 2 weeks Sponsored by: Dell EMC Isilon Differential Revision: https://reviews.freebsd.org/D26213	2020-08-28 19:50:40 +00:00
Mark Johnston	aea9103e06	Use a large kmem arena import size on NUMA systems. This helps minimize internal fragmentation that occurs when 2MB imports are interleaved across NUMA domains. Virtually all KVA allocations on direct map platforms consume more than one page, so the fragmentation manifests as runs of 511 4KB page mappings in the kernel. Reviewed by: alc, kib Tested by: pho Sponsored by: The FreeBSD Foundation Differential Revision: https://reviews.freebsd.org/D26050	2020-08-26 14:31:48 +00:00
Conrad Meyer	74f5530d7a	vm_pageout: Scale worker threads with CPUs Autoscale vm_pageout worker threads from r364129 with CPU count. The default is arbitrarily chosen to be 16 CPUs per worker thread, but can be adjusted with the vm.pageout_cpus_per_thread tunable. There will never be less than 1 thread per populated NUMA domain, and the previous arbitrary upper limit (at most ncpus/2 threads per NUMA domain) is preserved. Care is taken to gracefully handle asymmetric NUMA nodes, such as empty node systems (e.g., AMD 2990WX) and systems with nodes of varying size (e.g., some larger >20 core Intel Haswell/Broadwell Xeon). Reviewed by: kib, markj Sponsored by: Isilon Differential Revision: https://reviews.freebsd.org/D26152	2020-08-25 21:36:56 +00:00
Mark Johnston	411096d034	Permit vm_page_wire() to be called on pages not belonging to an object. For such pages ref_count is effectively a consumer-managed field, but there is no harm in calling vm_page_wire() on them. vm_page_unwire_noq() handles them as well. Relax the vm_page_wire() assertions to permit this case which is triggered by some out-of-tree code. [1] Also guard a conditional assertion with INVARIANTS. Otherwise the conditions are evaluated even though the result is unused. [2] Reported by: bz, cem [1], kib [2] Reviewed by: dougm, kib Sponsored by: The FreeBSD Foundation Differential Revision: https://reviews.freebsd.org/D26173	2020-08-25 13:45:06 +00:00
Matt Macy	9e5787d228	Merge OpenZFS support in to HEAD. The primary benefit is maintaining a completely shared code base with the community allowing FreeBSD to receive new features sooner and with less effort. I would advise against doing 'zpool upgrade' or creating indispensable pools using new features until this change has had a month+ to soak. Work on merging FreeBSD support in to what was at the time "ZFS on Linux" began in August 2018. I first publicly proposed transitioning FreeBSD to (new) OpenZFS on December 18th, 2018. FreeBSD support in OpenZFS was finally completed in December 2019. A CFT for downstreaming OpenZFS support in to FreeBSD was first issued on July 8th. All issues that were reported have been addressed or, for a couple of less critical matters there are pull requests in progress with OpenZFS. iXsystems has tested and dogfooded extensively internally. The TrueNAS 12 release is based on OpenZFS with some additional features that have not yet made it upstream. Improvements include: project quotas, encrypted datasets, allocation classes, vectorized raidz, vectorized checksums, various command line improvements, zstd compression. Thanks to those who have helped along the way: Ryan Moeller, Allan Jude, Zack Welch, and many others. Sponsored by: iXsystems, Inc. Differential Revision: https://reviews.freebsd.org/D25872	2020-08-25 02:21:27 +00:00
Mateusz Guzik	feabaaf995	cache: drop the always curthread argument from reverse lookup routines Note VOP_VPTOCNP keeps getting it as temporary compatibility for zfs. Tested by: pho	2020-08-24 08:57:02 +00:00
Andrew Gallatin	791dda877f	uma: record allocation failures due to zone limits The zone limit mechanism was recently reworked, and allocation failures due to limits being exceeded were inadvertently no longer being recorded. This would lead to, for example, mbuf allocation failures not being indicated in netstat -m or vmstat -z Reviewed by: markj Sponsored by: Netflix	2020-08-21 18:31:57 +00:00
Mateusz Guzik	7ad2a82da2	vfs: drop the error parameter from vn_isdisk, introduce vn_isdisk_error Most consumers pass NULL.	2020-08-19 02:51:17 +00:00
Mark Johnston	b21b022a81	Revert r364310. Some of the resulting fallout in CAM does not appear straightforward to fix, so simply revert the commit for now in the absence of a better solution. Discussed with: mjg Reported by: dhw	2020-08-18 14:09:49 +00:00
Gleb Smirnoff	1921bb7b68	With INVARIANTS panic immediately if M_WAITOK is requested in a non-sleepable context. Previously only _sleep() would panic. This will catch misuse of M_WAITOK at development stage rather than at stress load stage. Reviewed by: markj Differential Revision: https://reviews.freebsd.org/D26027	2020-08-17 15:37:08 +00:00
Mark Johnston	7efe14cb99	Commit a missing piece of r364302. This had failed to apply due to a merge conflict. Reported by: Jenkins MFC with: r364302	2020-08-17 14:06:51 +00:00
Mark Johnston	7dd979dfef	Remove the VM map zone. Today, the zone is only used to allocate a trio of kernel maps: the kernel map itself, and the exec and pipe submaps. Maps for user processes are dynamically allocated but are embedded in the vmspace structure, which is allocated from its own zone. Make the aforementioned kernel maps statically allocated and get rid of the zone. While here, remove a stale comment above vmspace_alloc() and change the names of locks initialized in vm_map_init() to match vmspace_zinit(). Reported by: alc Reviewed by: alc, kib MFC after: 2 weeks Sponsored by: The FreeBSD Foundation Differential Revision: https://reviews.freebsd.org/D26052	2020-08-17 13:02:01 +00:00
Konstantin Belousov	ffae7ea935	vm_object: allow paging_in_progress to be acquired after object termination. The vm objects are type-stable, and can be accessed even after the last reference is dropped, or in case of vnode objects, after vgone() destroyed it as well. Stop asserting that pip == 0 after vm_object_terminate() waited for existing owners to drop it, we only want to drain them before setting OBJ_DEAD flag. Also stop asserting pip == 0 in object destructor. Update comments explaining the interaction between paging_in_progress and termination. Reviewed by: markj Tested by: pho Sponsored by: The FreeBSD Foundation Differential revision: https://reviews.freebsd.org/D25968	2020-08-16 20:57:02 +00:00
Konstantin Belousov	419e5698a0	Atomically update vm_object vnp_size, where atomic is available. This will be used later, where it matters on 32bit arches. Reviewed by: markj Tested by: pho Sponsored by: The FreeBSD Foundation Differential revision: https://reviews.freebsd.org/D25968	2020-08-16 20:52:24 +00:00
Mateusz Guzik	a92a971bbb	vfs: remove the thread argument from vget It was already asserted to be curthread. Semantic patch: @@ expression arg1, arg2, arg3; @@ - vget(arg1, arg2, arg3) + vget(arg1, arg2)	2020-08-16 17:18:54 +00:00
Conrad Meyer	ea7b737a6f	vm_pageout: Correct threshold calculation on single-CPU systems Reported by: Michael Butler X-MFC-With: r364129	2020-08-14 18:48:48 +00:00
Conrad Meyer	b7883452d4	Back out unrelated change Reported by: kib, markj X-MFC-With: r364129	2020-08-12 00:21:30 +00:00
Conrad Meyer	0292c54bdb	Add support for multithreading the inactive queue pageout within a domain. In very high throughput workloads, the inactive scan can become overwhelmed as you have many cores producing pages and a single core freeing. Since Mark's introduction of batched pagequeue operations, we can now run multiple inactive threads working on independent batches. To avoid confusing the pid and other control algorithms, I (Jeff) do this in a mpi-like fan out and collect model that is driven from the primary page daemon. It decides whether the shortfall can be overcome with a single thread and if not dispatches multiple threads and waits for their results. The heuristic is based on timing the pageout activity and averaging a pages-per-second variable which is exponentially decayed. This is visible in sysctl and may be interesting for other purposes. I (Jeff) have verified that this does indeed double our paging throughput when used with two threads. With four we tend to run into other contention problems. For now I would like to commit this infrastructure with only a single thread enabled. The number of worker threads per domain can be controlled with the 'vm.pageout_threads_per_domain' tunable. Submitted by: jeff (earlier version) Discussed with: markj Tested by: pho Sponsored by: probably Netflix (based on contemporary commits) Differential Revision: https://reviews.freebsd.org/D21629	2020-08-11 20:37:45 +00:00
Mark Johnston	af32cefd7c	Check the UMA zone's full bucket cache before short-circuiting an alloc. The global "bucketdisable" flag indicates that we are in a low memory situation and should avoid allocating buckets. However, in the allocation path we were checking it before the full bucket cache and bailing even if the cache is non-empty. Defer the check so that we have a shot at allocating from the cache. This came up because M_NOWAIT allocations from the buf trie node zone must always succeed. In one scenario, all of the preallocated trie nodes were in the bucket list, and a new slab allocation could not succeed due to a memory shortage. The short-circuiting caused an allocation failure which triggered a panic. Reported by: pho Reviewed by: cem Sponsored by: The FreeBSD Foundation Differential Revision: https://reviews.freebsd.org/D25980	2020-08-10 20:34:45 +00:00
Brooks Davis	9f9cc3f989	Preserve ASLR vm_map flags across fork In the most common case (fork+execve) this doesn't matter, but further attempts to apply entropy would fail in (e.g.) a pre-fork server. Reported by: Alfredo Mazzinghi Reviewed by: kib, markj Obtained from: CheriBSD MFC after: 3 days Differential Revision: https://reviews.freebsd.org/D25966	2020-08-06 16:20:20 +00:00
Mark Johnston	efec381dd1	Remove most lingering references to the page lock in comments. Finish updating comments to reflect new locking protocols introduced over the past year. In particular, vm_page_lock is now effectively unused. Reviewed by: kib Sponsored by: The FreeBSD Foundation Differential Revision: https://reviews.freebsd.org/D25868	2020-08-04 14:59:43 +00:00
Mark Johnston	96ad26eefb	Remove free_domain() and uma_zfree_domain(). These functions were introduced before UMA started ensuring that freed memory gets placed in domain-local caches. They no longer serve any purpose since UMA now provides their functionality by default. Remove them to simplyify the kernel memory allocator interfaces a bit. Reviewed by: cem, kib Sponsored by: The FreeBSD Foundation Differential Revision: https://reviews.freebsd.org/D25937	2020-08-04 13:58:36 +00:00
Mark Johnston	958d8f527c	Remove the volatile qualifier from busy_lock. Use atomic(9) to load the lock state. Some places were doing this already, so it was inconsistent. In initialization code, the lock state is still initialized with plain stores. Reviewed by: alc, kib Sponsored by: The FreeBSD Foundation Differential Revision: https://reviews.freebsd.org/D25861	2020-07-29 19:38:49 +00:00
Mark Johnston	f72e5be58a	vm_page_xbusy_claim(): Use atomics to update busy lock state. vm_page_xbusy_claim() could clobber the waiter bit. For its original use, kernel memory pages, this was not a problem since nothing would ever block on the busy lock for such pages. r363607 introduced a new use where this could in principle be a problem. Fix the problem by using atomic_cmpset to update the lock owner. Since this macro is defined only for INVARIANTS kernels the extra overhead doesn't seem prohibitive. Reported by: vangyzen Reviewed by: alc, kib, vangyzen Sponsored by: The FreeBSD Foundation Differential Revision: https://reviews.freebsd.org/D25859	2020-07-28 19:50:39 +00:00
Mark Johnston	782ebde52e	vm_page_free_invalid(): Relax the xbusy assertion. vm_page_assert_xbusied() asserts that the busying thread is the current thread. For some uses of vm_page_free_invalid() (e.g., error handling in vnode_pager_generic_getpages_done()), this condition might not hold. Reported by: Jenkins via trasz Reviewed by: chs, kib Sponsored by: The FreeBSD Foundation Differential Revision: https://reviews.freebsd.org/D25828	2020-07-27 14:25:10 +00:00
Doug Moore	00fd73d2da	Fix an overflow bug in the blist allocator that needlessly capped max swap size by dividing a value, which was always a multiple of 64, by 64. Remove the code that reduced max swap size down to that cap. Eliminate the distinction between BLIST_BMAP_RADIX and BLIST_META_RADIX. Call them both BLIST_RADIX. Make improvments to the blist self-test code to silence compiler warnings and to test larger blists. Reported by: jmallett Reviewed by: alc Discussed with: kib Tested by: pho Differential Revision: https://reviews.freebsd.org/D25736	2020-07-25 18:29:10 +00:00
Mateusz Guzik	ee74412269	vm: fix swap reservation leak and clean up surrounding code The code did not subtract from the global counter if per-uid reservation failed. Cleanup highlights: - load overcommit once - move per-uid manipulation to dedicated routines - don't fetch wire count if requested size is below the limit - convert return type from int to bool - ifdef the routines with _KERNEL to keep vm.h compilable by userspace Reviewed by: kib (previous version) Differential Revision: https://reviews.freebsd.org/D25787	2020-07-24 13:23:32 +00:00
Mateusz Guzik	126a2470b9	vm: annotate swap_reserved with __exclusive_cache_line The counter keeps being updated all the time and variables read afterwards share the cacheline. Note this still fundamentally does not scale and needs to be replaced, in the meantime gets a bandaid. brk1_processes -t 52 ops/s: before: 8598298 after: 9098080	2020-07-23 08:42:16 +00:00
Chuck Silvers	1bd12a3bb2	Fix vnode_pager handling of read ahead/behind pages when a disk read fails. Rather than marking the read ahead/behind pages valid even though they were not initialized, free them using the new function vm_page_free_invalid(). Reviewed by: markj, kib Sponsored by: Netflix Differential Revision: https://reviews.freebsd.org/D25430	2020-07-17 23:10:35 +00:00
Chuck Silvers	4dfa06e114	Add a new function vm_page_free_invalid() for freeing invalid pages that might be wired. If the page is wired then it cannot be freed now, but the thread that eventually unwires it will free it at that point. Reviewed by: markj, kib Sponsored by: Netflix Differential Revision: https://reviews.freebsd.org/D25430	2020-07-17 23:09:36 +00:00
Chuck Silvers	c3dbadc1fd	Revert my change from r361855 in favor of a better fix. Reviewed by: markj, kib Sponsored by: Netflix Differential Revision: https://reviews.freebsd.org/D25430	2020-07-17 23:08:01 +00:00
Mark Johnston	a7752896f0	Add vm_map_valid_range_KBI(). This is required for standalone module builds. Reported by: hselasky Reviewed by: dougm, hselasky, kib MFC after: 3 days Differential Revision: https://reviews.freebsd.org/D25650	2020-07-13 16:39:27 +00:00
Scott Long	ffc568ba8b	Revert r362998, r326999 while a better compatibility strategy is devised.	2020-07-09 22:38:36 +00:00
Scott Long	b302c2e5c9	Migrate the feature of excluding RAM pages to use "excludelist" as its nomenclature. MFC after: 1 week	2020-07-07 20:33:11 +00:00

1 2 3 4 5 ...

4454 Commits