freebsd-dev

Author	SHA1	Message	Date
Mark Johnston	eeacb3b02f	Merge the vm_page hold and wire mechanisms. The hold_count and wire_count fields of struct vm_page are separate reference counters with similar semantics. The remaining essential differences are that holds are not counted as a reference with respect to LRU, and holds have an implicit free-on-last unhold semantic whereas vm_page_unwire() callers must explicitly determine whether to free the page once the last reference to the page is released. This change removes the KPIs which directly manipulate hold_count. Functions such as vm_fault_quick_hold_pages() now return wired pages instead. Since r328977 the overhead of maintaining LRU for wired pages is lower, and in many cases vm_fault_quick_hold_pages() callers would swap holds for wirings on the returned pages anyway, so with this change we remove a number of page lock acquisitions. No functional change is intended. __FreeBSD_version is bumped. Reviewed by: alc, kib Discussed with: jeff Discussed with: jhb, np (cxgbe) Tested by: pho (previous version) Sponsored by: Netflix Differential Revision: https://reviews.freebsd.org/D19247	2019-07-08 19:46:20 +00:00
Mark Johnston	46736e306c	Elide the vm_reserv_free_page() call when PG_PCPU_CACHE is set. Pages with PG_PCPU_CACHE set cannot have been allocated from a reservation, so as an optimization, skip the call to vm_reserv_free_page() in this case. Otherwise, the access of the corresponding reservation structure often results in a cache miss. Reviewed by: alc, kib Discussed with: jeff MFC after: 2 weeks Sponsored by: Netflix Differential Revision: https://reviews.freebsd.org/D20859	2019-07-08 19:02:40 +00:00
Mark Johnston	d9a73522e3	Add a per-CPU page cache per VM free pool. Some workloads benefit from having a per-CPU cache for VM_FREEPOOL_DIRECT pages. Reviewed by: dougm, kib Discussed with: alc, jeff MFC after: 2 weeks Sponsored by: Netflix Differential Revision: https://reviews.freebsd.org/D20858	2019-07-08 18:56:30 +00:00
Doug Moore	7b9bcad939	A style-related change, r349791, made unclear the meaning of a comment. Rewrite that comment to improve its clarity. Reported by: cem Reviewed by: alc, cem Approved by: kib, markj (mentors, implicit) Differential Revision: https://reviews.freebsd.org/D20871	2019-07-07 06:57:04 +00:00
Doug Moore	0cab71bcee	Fix style(9) violations involving division by PAGE_SIZE. Reviewed by: alc Approved by: markj (mentor) Differential Revision: https://reviews.freebsd.org/D20847	2019-07-06 15:55:16 +00:00
Doug Moore	31c82722c1	Change blist_next_leaf_alloc so that it can examine more than one leaf after the one where the possible block allocation begins, and allocate a larger number of blocks than the current limit. This does not affect the limit on minimum allocation size, which still cannot exceed BLIST_MAX_ALLOC. Use this change to modify swp_pager_getswapspace and its callers, so that they can allocate more than BLIST_MAX_ALLOC blocks if they are available. Tested by: pho Approved by: markj (mentor) Differential Revision: https://reviews.freebsd.org/D20579	2019-07-06 06:15:03 +00:00
Doug Moore	56948d177e	Based on work posted at https://reviews.freebsd.org/D13484 , change swap_pager_swapoff_object and swp_pager_force_pagein so that they can page in multiple pages at a time to a swap device, rather than doing one I/O operation for each page. Tested by: pho Submitted by: ota_j.email.ne.jp (Yoshihiro Ota) Reviewed by: alc, markj, kib Approved by: kib, markj (mentors) MFC after: 1 week Differential Revision: https://reviews.freebsd.org/D20635	2019-07-05 16:49:34 +00:00
Doug Moore	d2860f22a4	Move an assignment, drop a label, and change gotos to break statements in vm_map_unwire. The code generated on amd86 is unchanged. Approved by: markj (mentor) Differential Revision: https://reviews.freebsd.org/D20850	2019-07-04 19:25:30 +00:00
Doug Moore	b71f9b0de6	Replace a 'goto' with an 'else' in vm_map_wire_locked. Reviewed by: alc Approved by: markj (mentor) Differential Revision: https://reviews.freebsd.org/D20855	2019-07-04 19:17:55 +00:00
Doug Moore	9a0cdf9440	Change boolean_t variables in vm_map_unwire and vm_map_wire_locked to bool. Drop result variable. Add holes_ok bool to replace repeated masking of flags parameter. Approved by: markj (mentor) Differential Revision: https://reviews.freebsd.org/D20846	2019-07-04 19:12:13 +00:00
Doug Moore	723413be0c	Drop a temp variable from vm_map_insert, with no effect on the resulting amd64 machine code. Reviewed by: alc Approved by: kib, markj (mentors, implicit) Differential Revision: https://reviews.freebsd.org/D20849	2019-07-04 18:28:49 +00:00
Doug Moore	38e220e8df	Eliminate a goto and a label in vm_map_wire_locked by inserting an 'else'. Reviewed by: alc Approved by: kib, markj (mentors, implicit) Differential Revision: https://reviews.freebsd.org/D20845	2019-07-03 22:41:54 +00:00
Ed Maste	b93a053ca2	correct pmap_ts_referenced return type pmap_ts_referenced returns a count, not a boolean, and is supposed to have int as the return type not boolean_t. This worked previously because boolean_t is an int typedef. Discussed with: kib MFC after: 1 week Sponsored by: The FreeBSD Foundation	2019-07-03 19:59:56 +00:00
Mark Johnston	d70f0ab38d	Cache the next queue element when traversing a page queue. When QUEUE_MACRO_DEBUG_TRASH is configured, removing a queue element invalidates its queue linkage pointers. vm_pageout_collect_batch() was relying on these pointers remaining valid after a removal, so modify it to fetch the next queued page before dequeuing the current page. Submitted by: Don Morris <dgmorris@earthlink.net> Reviewed by: cem, vangyzen MFC after: 1 week Differential Revision: https://reviews.freebsd.org/D20842	2019-07-03 18:46:39 +00:00
Mark Johnston	9f74cdbf78	Mark pages allocated from the per-CPU cache. Only free pages to the cache when they were allocated from that cache. This mitigates rapid fragmentation of physical memory seen during poudriere's dependency calculation phase. In particular, pages belonging to broken reservations are no longer freed to the per-CPU cache, so they get a chance to coalesce with freed pages during the break. Otherwise, the optimized CoW handler may create object chains in which multiple objects contain pages from the same reservation, and the order in which we do object termination means that the reservation is broken before all of those pages are freed, so some of them end up in the per-CPU cache and thus permanently fragment physical memory. The flag may also be useful for eliding calls to vm_reserv_free_page(), thus avoiding memory accesses for data that is likely not present in the CPU caches. Reviewed by: alc Discussed with: jeff MFC after: 1 week Sponsored by: The FreeBSD Foundation Differential Revision: https://reviews.freebsd.org/D20763	2019-07-02 19:51:40 +00:00
Konstantin Belousov	5dc7e31a09	Control implicit PROT_MAX() using procctl(2) and the FreeBSD note feature bit. In particular, allocate the bit to opt-out the image from implicit PROTMAX enablement. Provide procctl(2) verbs to set and query implicit PROTMAX handling. The knobs mimic the same per-image flag and per-process controls for ASLR. Reviewed by: emaste, markj (previous version) Discussed with: brooks Sponsored by: The FreeBSD Foundation Differential revision: https://reviews.freebsd.org/D20795	2019-07-02 19:07:17 +00:00
Konstantin Belousov	3730695151	Use traditional 'p' local to designate td->td_proc in kern_mmap. Reviewed by: emaste, markj Sponsored by: The FreeBSD Foundation MFC after: 3 days Differential revision: https://reviews.freebsd.org/D20795	2019-07-02 19:01:14 +00:00
Doug Moore	5201cbabf5	Remove a call to vm_map_simplify_entry from _vm_map_clip_start. Recent changes to vm_map_protect have made it unnecessary. Reviewed by: alc Approved by: kib (mentor) Tested by: pho Differential Revision: https://reviews.freebsd.org/D20633	2019-06-30 02:08:13 +00:00
Doug Moore	a72dce340d	If vm_map_protect fails with KERN_RESOURCE_SHORTAGE, be sure to simplify modified entries before returning. Reviewed by: alc, markj (earlier version), kib (earlier version) Approved by: kib, markj (mentors, implicit) Differential Revision: https://reviews.freebsd.org/D20753	2019-06-28 02:14:54 +00:00
Mark Johnston	0fd977b3fa	Add a return value to vm_page_remove(). Use it to indicate whether the page may be safely freed following its removal from the object. Also change vm_page_remove() to assume that the page's object pointer is non-NULL, and have callers perform this check instead. This is a step towards an implementation of an atomic reference counter for each physical page structure. Reviewed by: alc, dougm, kib MFC after: 1 week Sponsored by: Netflix Differential Revision: https://reviews.freebsd.org/D20758	2019-06-26 17:37:51 +00:00
Doug Moore	d1d3f7e1d1	Revert r349393, which leads to an assertion failure on bootup, in vm_map_stack_locked. Reported by: ler@lerctr.org Approved by: kib, markj (mentors, implicit)	2019-06-26 03:12:57 +00:00
Doug Moore	52499d1739	Eliminate some uses of the prev and next fields of vm_map_entry_t. Since the only caller to vm_map_splay is vm_map_lookup_entry, move the implementation of vm_map_splay into vm_map_lookup_helper, called by vm_map_lookup_entry. vm_map_lookup_entry returns the greatest entry less than or equal to a given address, but in many cases the caller wants the least entry greater than or equal to the address and uses the next pointer to get to it. Provide an alternative interface to lookup, vm_map_lookup_entry_ge, to provide the latter behavior, and let callers use one or the other rather than having them use the next pointer after a lookup miss to get what they really want. In vm_map_growstack, the caller wants an entry that includes a given address, and either the preceding or next entry depending on the value of eflags in the first entry. Incorporate that behavior into vm_map_lookup_helper, the function that implements all of these lookups. Eliminate some temporary variables used with vm_map_lookup_entry, but inessential. Reviewed by: markj (earlier version) Approved by: kib (mentor) Differential Revision: https://reviews.freebsd.org/D20664	2019-06-25 20:25:16 +00:00
Doug Moore	18cd8bb800	vm_map_protect may return an INVALID_ARGUMENT or PROTECTION_FAILURE error response after clipping the first map entry in the region to be reserved. This creates a pair of matching entries that should have been "simplified" back into one, or never created. This change defers the clipping of that entry until those two vm_map_protect failure cases have been ruled out. Reviewed by: alc Approved by: markj (mentor) Differential Revision: https://reviews.freebsd.org/D20711	2019-06-25 07:44:37 +00:00
Brooks Davis	74a1b66cf4	Extend mmap/mprotect API to specify the max page protections. A new macro PROT_MAX() alters a protection value so it can be OR'd with a regular protection value to specify the maximum permissions. If present, these flags specify the maximum permissions. While these flags are non-portable, they can be used in portable code with simple ifdefs to expand PROT_MAX() to 0. This change allows (e.g.) a region that must be writable during run-time linking or JIT code generation to be made permanently read+execute after writes are complete. This complements W^X protections allowing more precise control by the programmer. This change alters mprotect argument checking and returns an error when unhandled protection flags are set. This differs from POSIX (in that POSIX only specifies an error), but is the documented behavior on Linux and more closely matches historical mmap behavior. In addition to explicit setting of the maximum permissions, an experimental sysctl vm.imply_prot_max causes mmap to assume that the initial permissions requested should be the maximum when the sysctl is set to 1. PROT_NONE mappings are excluded from this for compatibility with rtld and other consumers that use such mappings to reserve address space before mapping contents into part of the reservation. A final version this is expected to provide per-binary and per-process opt-in/out options and this sysctl will go away in its current form. As such it is undocumented. Reviewed by: emaste, kib (prior version), markj Additional suggestions from: alc Obtained from: CheriBSD Sponsored by: DARPA, AFRL Differential Revision: https://reviews.freebsd.org/D18880	2019-06-20 18:24:16 +00:00
Mark Johnston	ee1f168540	Group vm_page_activate()'s definition with other related functions. No functional change intended. MFC after: 3 days	2019-06-19 21:36:00 +00:00
Doug Moore	4766eba1df	Critical comments were lost in r349203. This patch seeks to restore the lost information in new comments. Reported by: alc Reviewed by: alc Approved by: kib (mentor) Differential Revision: https://reviews.freebsd.org/D20632	2019-06-15 04:30:13 +00:00
Doug Moore	771315283b	Avoid using the prev field of vm_map_entry_t in two functions that iterate over consecutive vm_map entries, and that can easily just 'remember' the prev value instead of looking it up. Approved by: kib (mentor) Differential Revision: https://reviews.freebsd.org/D20628	2019-06-14 03:15:54 +00:00
Doug Moore	af1d6d6a11	Create a function for creating objects to back map entries, and one for giving cred to a map entry backed by an object, and use them instead of the code duplicated inline now. Approved by: kib (mentor) Differential Revision: https://reviews.freebsd.org/D20370	2019-06-13 20:09:07 +00:00
Doug Moore	e65d58a0fe	To test to see if a free space is big enough compare the required length to the difference of the two offsets that define the gap, to avoid overflow, rather that adding the length to an offset and comparing that to another offset. This addresses an overflow issue reported by Peter Holm on i386. Reported by: pho Tested by: pho Approved by: kib (mentor) Differential Revision: https://reviews.freebsd.org/D20594	2019-06-11 22:41:39 +00:00
Doug Moore	f8c8b2e8a0	r348879 introduced a wrong-way comparison that broke mmap. This change rights that comparison. Reported by: pho Approved by: markj (mentor) MFC after: 3 days Differential Revision: https://reviews.freebsd.org/D20595	2019-06-10 22:06:40 +00:00
Doug Moore	5a0879da80	The computations of vm_map_splay_split and vm_map_splay_merge touch both children of every entry on the search path as part of updating values of the max_free field. By comparing the max_free values of an entry and its child on the search path, the code can avoid accessing the child off the path in cases where the max_free value decreases along the path. Specifically, this patch changes splay_split so that the max_free field of every entry on the search path is replaced, temporarily, by the max_free field from its child not on the search path or, if the child in that direction is NULL, then a difference between start and end values of two pointers already available in the split code, without following any next or prev pointers. However, to find that max_free value does not require looking toward that other child if either the child on the search path has a lower max_free value, or the current max_free value is zero, because in either case we know that the value of max_free for the other child is the value we already have. So, the changes to vm_entry_splay_split make sure that we know all the off-search-path entries we will need to complete the splay, without looking at all of them. There is an exception at the bottom of the search path where we cannot rely on the max_free value in the direction of the NULL pointer that ends the search, because of the behavior of entry-clipping code. The corresponding change to vm_splay_entry_merge makes it simpler, since it's just reversing pointers and updating running maxima. In a test intended to exercise vigorously the vm_map implementation, the effect of this change was to reduce the data cache miss rate by 10-14% and the running time by 5-7%. Tested by: pho Reviewed by: alc Approved by: kib (mentor) MFC after: 1 month Differential Revision: https://reviews.freebsd.org/D19826	2019-06-10 21:34:07 +00:00
Doug Moore	77555b849d	Change the check for 'size' wrapping around to zero in kern_mmap to account for both the lower and upper bound modifications. Change the error returned to ENOMEM. Rename the parameter size to len and make size a local variable that stores the value of len after it has been modified. This addresses concerns expressed by Bruce Evans after r348843. Reported by: brde@optusnet.com.au Reviewed by: kib, markj (mentors) MFC after: 3 days Relnotes: yes Differential Revision: https://reviews.freebsd.org/D20592	2019-06-10 21:26:14 +00:00
John Baldwin	0b96ca3310	Remove an overly-aggressive assertion. While it is true that the new vmspace passed to vmspace_switch_aio will always have a valid reference due to the AIO job or the extra reference on the original vmspace in the worker thread, it is not true that the old vmspace being switched away from will have more than one reference. Specifically, when a process with queued AIO jobs exits, the exit hook in aio_proc_rundown will only ensure that all of the AIO jobs have completed or been cancelled. However, the last AIO job might have completed and woken up the exiting process before the worker thread servicing that job has switched back to its original vmspace. In that case, the process might finish exiting dropping its reference to the vmspace before the worker thread resulting in the worker thread dropping the last reference. Reported by: np Reviewed by: alc, markj, np, imp MFC after: 2 weeks Sponsored by: Chelsio Communications Differential Revision: https://reviews.freebsd.org/D20542	2019-06-10 19:01:54 +00:00
Doug Moore	97220a279f	There are times when a len==0 parameter to mmap is okay. But on a 32-bit machine, a len parameter just a few bytes short of 4G, rounded up to a page boundary and hitting zero then, is not okay. Return failure in that case. Reported by: pho Reviewed by: alc, kib (mentor) Tested by: pho Differential Revision: https://reviews.freebsd.org/D20580	2019-06-10 03:07:10 +00:00
Konstantin Belousov	452a2db863	Style MAP_ENTRY_ and MAP_ definitions. Spell all bits in the hex constants. Since all lines are modified, consistently use <tab> after #define. Reviewed by: alc (previous version), dougm Sponsored by: The FreeBSD Foundation MFC after: 3 days Differential revision: https://reviews.freebsd.org/D20560	2019-06-08 20:28:04 +00:00
Doug Moore	7c022327ab	Simple code refactoring originally in D13484. Extract swp_pager_force_dirty() and swp_pager_force_launder() out of swp_pager_force_pagein(). Extract swap_pager_swapoff_object() out of swap_pager_swapoff(). Submitted by: ota_j.email.ne.jp Reviewed by: alc, dougm Approved by: kib (mentor) Differential Revision: https://reviews.freebsd.org/D20545	2019-06-08 17:49:17 +00:00
Mark Johnston	88ea538a98	Replace uses of vm_page_unwire(m, PQ_NONE) with vm_page_unwire_noq(m). These calls are not the same in general: the former will dequeue the page if it is enqueued, while the latter will just leave it alone. But, all existing uses of the former apply to unmanaged pages, which are never enqueued in the first place. No functional change intended. Reviewed by: kib MFC after: 1 week Sponsored by: Netflix Differential Revision: https://reviews.freebsd.org/D20470	2019-06-07 18:23:29 +00:00
Alexander Motin	3b2f2cb8e9	Allow UMA hash tables to expand faster then 2x in 20 seconds. ZFS ABD allocates tons of 4KB chunks via UMA, requiring huge hash tables. With initial hash table size of only 32 elements it takes ~20 expansions or ~400 seconds to adapt to handling 220GB ZFS ARC. During that time not only the hash table is highly inefficient, but also each of those expan- sions takes significant time with the lock held, blocking operation. On my test system with 256GB of RAM and ZFS pool of 28 HDDs this change reduces time needed to first time read 240GB from ~300-400s, during which system is quite busy and unresponsive, to only ~150s with light CPU load and just 5 sub-second CPU spikes to expand the hash table. MFC after: 2 weeks Sponsored by: iXsystems, Inc.	2019-06-06 23:57:28 +00:00
Doug Moore	f96e8a0bab	The means of finding ranges of free pages was changed for vm_reserv_break in r348484, and there was found to improve performance minutely and reduce code size. This change applies a similar change to vm_reserv_reclaim_config, expecting similar benefits. This change also allows quick rejection of page ranges that are unsuitable on account of alignment or boundary issues, where those issues are processed a page at a time in the current implementation. For contrived test cases, this can make finding a reservation satisfying a major alignment requirement around 30 times faster. Tested by: pho Approved by: markj (mentor) Differential Revision: https://reviews.freebsd.org/D20274	2019-06-06 16:28:34 +00:00
Mark Johnston	fbd9585915	Add sysctls for uma_kmem_{limit,total}. Reviewed by: alc, dougm, kib MFC after: 1 week Sponsored by: The FreeBSD Foundation Differential Revision: https://reviews.freebsd.org/D20514	2019-06-06 16:26:58 +00:00
Mark Johnston	058f0f7464	Remove the volatile qualifer from uma_kmem_total. No functional change intended. Reviewed by: alc, dougm, kib MFC after: 1 week Sponsored by: The FreeBSD Foundation Differential Revision: https://reviews.freebsd.org/D20514	2019-06-06 16:23:44 +00:00
Konstantin Belousov	32d2014dde	In vm_map_entry_set_vnode_text(), tolerate tmpfs mappings for which vnode is no longer resident. Mapping of tmpfs file does not bump use count on the vnode, because backing object has swap type. As result, even during normal operations, and of course on forced unmount, we might end up with text mapping from tmpfs node which has no vnode in memory. In this case, there is no v_writecount to clear (this was done during reclaim), and no reason to assert that the vnode is present. Restructure the code to silently ignore OBJ_SWAP objects with OBJ_TMPFS_NODE flag set, but OBJ_TMPFS flag clear. Reported and tested by: pho Sponsored by: The FreeBSD Foundation MFC after: 1 week	2019-06-05 20:21:17 +00:00
Mark Johnston	2d2748710a	Remove an outdated header comment for vm_page.c. The listed rules were incomplete and outdated. There is a much more comprehensive comment in vm_page.h. Reviewed by: alc, kib MFC after: 1 week Differential Revision: https://reviews.freebsd.org/D20503	2019-06-04 18:38:27 +00:00
Konstantin Belousov	21d7728498	Remove dead store. sw_flags is set to the function argument several lines later. Reported by: danfe using PVS-studio Sponsored by: The FreeBSD Foundation MFC after: 1 week	2019-06-03 15:19:11 +00:00
Alan Cox	2d5039db18	Retire vm_reserv_extend_{contig,page}(). These functions were introduced as part of a false start toward fine-grained reservation locking. In the end, they were not needed, so eliminate them. Order the parameters to vm_reserv_alloc_{contig,page}() consistently with the vm_page functions that call them. Update the comments about the locking requirements for vm_reserv_alloc_{contig,page}(). They no longer require a free page queues lock. Wrap several lines that became too long after the "req" and "domain" parameters were added to vm_reserv_alloc_{contig,page}(). Reviewed by: kib, markj MFC after: 1 week Differential Revision: https://reviews.freebsd.org/D20492	2019-06-03 05:15:36 +00:00
Mark Johnston	d842aa5114	Add a vm_page_wired() predicate. Use it instead of accessing the wire_count field directly. No functional change intended. Reviewed by: alc, kib MFC after: 1 week Sponsored by: Netflix Differential Revision: https://reviews.freebsd.org/D20485	2019-06-02 01:00:17 +00:00
Doug Moore	b8590dae50	The function vm_phys_free_contig invokes vm_phys_free_pages for every power-of-two page block it frees, launching an unsuccessful search for a buddy to pair up with each time. The only possible buddy-up mergers are across the boundaries of the freed region, so change vm_phys_free_contig simply to enqueue the freed interior blocks, via a new function vm_phys_enqueue_contig, and then call vm_phys_free_pages on the bounding blocks to create as big a cross-boundary block as possible after buddy-merging. The only callers of vm_phys_free_contig at the moment call it in situations where merging blocks across the boundary is clearly impossible, so just call vm_phys_enqueue_contig in those places and avoid trying to buddy-up at all. One beneficiary of this change is in breaking reservations. For the case where memory is freed in breaking a reservation with only the first and last pages allocated, the number of cycles consumed by the operation drops about 11% with this change. Suggested by: alc Reviewed by: alc Approved by: kib, markj (mentors) Differential Revision: https://reviews.freebsd.org/D16901	2019-05-31 21:02:42 +00:00
Mark Johnston	42447bb506	Remove a redundant vm_page_remove() call. vm_page_free_prep() removes the page from its object. No functional change intended. Reviewed by: alc, kib MFC after: 1 week Differential Revision: https://reviews.freebsd.org/D20469	2019-05-31 14:59:40 +00:00
Gleb Smirnoff	4a9f6ba75b	In r343857 the referred comment moved to uma_vm_zone_stats().	2019-05-29 22:33:37 +00:00
Doug Moore	e67a5068ec	Reduce the code size and number of ffsl calls in vm_reserv_break. Use xor to find where free ranges begin and end. Tested by: pho Reviewed by:alc Approved by:markj, kib (mentors) Differential Revision: https://reviews.freebsd.org/D20256	2019-05-28 00:51:23 +00:00
Doug Moore	73f1145140	Fix typo from r348128: _func__ -> __func__ Reported by: LINT	2019-05-23 02:10:41 +00:00
Doug Moore	fa581662af	Cleanups made necessary by r348115, or reactions to it: 1. Change size_t to vm_size_t in some places. 2. Rename vm_map_entry_resize_free to drop the _free part. 3. Fix whitespace errors. 4. Fix screwups in patch-conflict-management that left out important changes related to growing and shrinking objects. Reviewed by: alc Approved by: kib (mentor)	2019-05-22 23:11:16 +00:00
Doug Moore	1895f5202a	Passing a parameter to vm_map_entry_resize_free that describes the amount of resizing reduces the number of functions changing the vm_map invariants regarding the max_free field of map entries. Reviewed by: markj (mentor) Approved by: kib (mentor) Differential Revision: https://reviews.freebsd.org/D20356	2019-05-22 17:40:54 +00:00
Conrad Meyer	daec92844e	Include ktr.h in more compilation units Similar to r348026, exhaustive search for uses of CTRn() and cross reference ktr.h includes. Where it was obvious that an OS compat header of some kind included ktr.h indirectly, .c files were left alone. Some of these files clearly got ktr.h via header pollution in some scenarios, or tinderbox would not be passing prior to this revision, but go ahead and explicitly include it in files using it anyway. Like r348026, these CUs did not show up in tinderbox as missing the include. Reported by: peterj (arm64/mp_machdep.c) X-MFC-With: r347984 Sponsored by: Dell EMC Isilon	2019-05-21 20:38:48 +00:00
Conrad Meyer	e2e050c8ef	Extract eventfilter declarations to sys/_eventfilter.h This allows replacing "sys/eventfilter.h" includes with "sys/_eventfilter.h" in other header files (e.g., sys/{bus,conf,cpu}.h) and reduces header pollution substantially. EVENTHANDLER_DECLARE and EVENTHANDLER_LIST_DECLAREs were moved out of .c files into appropriate headers (e.g., sys/proc.h, powernv/opal.h). As a side effect of reduced header pollution, many .c files and headers no longer contain needed definitions. The remainder of the patch addresses adding appropriate includes to fix those files. LOCK_DEBUG and LOCK_FILE_LINE_ARG are moved to sys/_lock.h, as required by sys/mutex.h since r326106 (but silently protected by header pollution prior to this change). No functional change (intended). Of course, any out of tree modules that relied on header pollution for sys/eventhandler.h, sys/lock.h, or sys/mutex.h inclusion need to be fixed. __FreeBSD_version has been bumped.	2019-05-20 00:38:23 +00:00
Mark Johnston	ccc5d6dd97	Use M_NEXTFIT in memguard(9). memguard(9) wants to avoid reuse of freed addresses for as long as possible. Previously it maintained a racily updated cursor which was passed to vmem_xalloc(9) as the minimum address. However, vmem will not in general return the lowest free address in the arena, so this trick only really works until the cursor has wrapped around the first time. Reported by: alc Reviewed by: alc MFC after: 1 month Differential Revision: https://reviews.freebsd.org/D17227	2019-05-18 02:02:14 +00:00
Mark Johnston	8cd6a80d7d	Restore the pre-r347532 behaviour of ignoring wiring failures in mmap(). The error handling added in r347532 is not right when mapping vnodes and will be fixed separately. Reported by: syzbot+1d2cc393bd6c88a548be@syzkaller.appspotmail.com MFC with: r347532	2019-05-13 18:40:01 +00:00
Mark Johnston	54a3a11421	Provide separate accounting for user-wired pages. Historically we have not distinguished between kernel wirings and user wirings for accounting purposes. User wirings (via mlock(2)) were subject to a global limit on the number of wired pages, so if large swaths of physical memory were wired by the kernel, as happens with the ZFS ARC among other things, the limit could be exceeded, causing user wirings to fail. The change adds a new counter, v_user_wire_count, which counts the number of virtual pages wired by user processes via mlock(2) and mlockall(2). Only user-wired pages are subject to the system-wide limit which helps provide some safety against deadlocks. In particular, while sources of kernel wirings typically support some backpressure mechanism, there is no way to reclaim user-wired pages shorting of killing the wiring process. The limit is exported as vm.max_user_wired, renamed from vm.max_wired, and changed from u_int to u_long. The choice to count virtual user-wired pages rather than physical pages was done for simplicity. There are mechanisms that can cause user-wired mappings to be destroyed while maintaining a wiring of the backing physical page; these make it difficult to accurately track user wirings at the physical page layer. The change also closes some holes which allowed user wirings to succeed even when they would cause the system limit to be exceeded. For instance, mmap() may now fail with ENOMEM in a process that has called mlockall(MCL_FUTURE) if the new mapping would cause the user wiring limit to be exceeded. Note that bhyve -S is subject to the user wiring limit, which defaults to 1/3 of physical RAM. Users that wish to exceed the limit must tune vm.max_user_wired. Reviewed by: kib, ngie (mlock() test changes) Tested by: pho (earlier version) MFC after: 45 days Sponsored by: Netflix Differential Revision: https://reviews.freebsd.org/D19908	2019-05-13 16:38:48 +00:00
Doug Moore	87ae0686a2	A new parameter to blist_alloc specifies an upper bound on the size of the allocation request, so that the blocks allocated are from the next set of free blocks big enough to satisfy the minimum requirements of the request, and the number of blocks allocated are as many as possible, up to the specified maximum. The implementation of swp_pager_getswapspace uses this parameter to ask for a number of blocks between the new halved request size and the previous failed request size. Thus a request for 32 blocks may fail, but instead of getting only 16 blocks instead, the caller asks for 16 to 31 next, and might get 19 or 27, which is closer to what they originally wanted. I expect this to lead to bigger block allocations and less block fragmentation, at least in some cases. Approved by: kib (mentor) Differential Revision: https://reviews.freebsd.org/D20001	2019-05-11 16:15:13 +00:00
Doug Moore	48e98a2afc	Callers of swp_pager_getswapspace get either as many blocks as they requested, or none, and in the latter case it is up to them to pick a smaller request to make - which they always do by halving the failed request. This change to swp_pager_getswapspace leaves the task of downsizing the request to the function and not its caller. It still does so by halving the original request. Approved by: kib (mentor) Differential Revision: https://reviews.freebsd.org/D20228	2019-05-11 10:16:43 +00:00
Konstantin Belousov	12487941f4	Noted by: alc Reviewed by: alc, markj (previous version) Sponsored by: The FreeBSD Foundation MFC after: 6 days	2019-05-06 08:46:11 +00:00
Konstantin Belousov	78022527bb	Switch to use shared vnode locks for text files during image activation. kern_execve() locks text vnode exclusive to be able to set and clear VV_TEXT flag. VV_TEXT is mutually exclusive with the v_writecount > 0 condition. The change removes VV_TEXT, replacing it with the condition v_writecount <= -1, and puts v_writecount under the vnode interlock. Each text reference decrements v_writecount. To clear the text reference when the segment is unmapped, it is recorded in the vm_map_entry backed by the text file as MAP_ENTRY_VN_TEXT flag, and v_writecount is incremented on the map entry removal The operations like VOP_ADD_WRITECOUNT() and VOP_SET_TEXT() check that v_writecount does not contradict the desired change. vn_writecheck() is now racy and its use was eliminated everywhere except access. Atomic check for writeability and increment of v_writecount is performed by the VOP. vn_truncate() now increments v_writecount around VOP_SETATTR() call, lack of which is arguably a bug on its own. nullfs bypasses v_writecount to the lower vnode always, so nullfs vnode has its own v_writecount correct, and lower vnode gets all references, since object->handle is always lower vnode. On the text vnode' vm object dealloc, the v_writecount value is reset to zero, and deadfs vop_unset_text short-circuit the operation. Reclamation of lowervp always reclaims all nullfs vnodes referencing lowervp first, so no stray references are left. Reviewed by: markj, trasz Tested by: mjg, pho Sponsored by: The FreeBSD Foundation MFC after: 1 month Differential revision: https://reviews.freebsd.org/D19923	2019-05-05 11:20:43 +00:00
Konstantin Belousov	7f1446052f	Do not collapse objects with OBJ_NOSPLIT backing swap object. NOSPLIT swap objects are not anonymous, they are used by tmpfs regular files and POSIX shared memory. For such objects, collapse is not permitted. Reported by: mjg Reviewed by: markj, trasz Tested by: mjg, pho Sponsored by: The FreeBSD Foundation MFC after: 1 week Differential revision: https://reviews.freebsd.org/D19923	2019-05-05 11:06:19 +00:00
Doug Moore	64f8d2575a	fls() should find the most significant bit of an int faster than a linear search can, so use it to avoid a linear search in isqrt. Approved by: kib (mentor), markj (mentor) Differential Revision: https://reviews.freebsd.org/D20102	2019-05-03 02:55:54 +00:00
Konstantin Belousov	19f5d9f27f	Fix another race between vm_map_protect() and vm_map_wire(). vm_map_wire() increments entry->wire_count, after that it drops the map lock both for faulting in the entry' pages, and for marking next entry in the requested region as IN_TRANSITION. Only after all entries are faulted in, MAP_ENTRY_USER_WIRE flag is set. This makes it possible for vm_map_protect() to run while other entry' MAP_ENTRY_IN_TRANSITION flag is handled, and vm_map_busy() lock does not prevent it. In particular, if the call to vm_map_protect() adds VM_PROT_WRITE to CoW entry, it would fail to call vm_fault_copy_entry(). There are at least two consequences of the race: the top object in the shadow chain is not populated with writeable pages, and second, the entry eventually get contradictory flags MAP_ENTRY_NEEDS_COPY \| MAP_ENTRY_USER_WIRED with VM_PROT_WRITE set. Handle it by waiting for all MAP_ENTRY_IN_TRANSITION flags to go away in vm_map_protect(), which does not drop map lock afterwards. Note that vm_map_busy_wait() is left as is. Reported and tested by: pho (previous version) Reviewed by: Doug Moore <dougm@rice.edu>, markj Sponsored by: The FreeBSD Foundation MFC after: 1 week Differential revision: https://reviews.freebsd.org/D20091	2019-05-01 13:15:06 +00:00
Mark Johnston	c4e5de7e75	Disable vm map consistency checking by default on INVARIANTS kernels. The checks are too expensive for a general-purpose kernel. Enable the checks when DIAGNOSTIC is defined and provide a sysctl to enable the checks in a non-DIAGNOSTIC INVARIANTS kernel. Reviewed by: kib Discussed with: Doug Moore <dougm@rice.edu> MFC after: 3 days Sponsored by: The FreeBSD Foundation Differential Revision: https://reviews.freebsd.org/D19999	2019-04-22 11:23:35 +00:00
Tycho Nightingale	323ad38632	for a cache-only zone the destructor tries to destroy a non-existent keg Reviewed by: markj Sponsored by: Dell EMC Isilon Differential Revision: https://reviews.freebsd.org/D19835	2019-04-12 12:46:25 +00:00
Konstantin Belousov	a5a02ef49f	Fix mis-merge. Amusingly, it is nop. Noted by: trasz Sponsored by: The FreeBSD Foundation MFC after: 1 week X-MFC-rev: r345702	2019-04-05 16:12:35 +00:00
Konstantin Belousov	9f70117263	Eliminate adj_free field from vm_map_entry. Drop the adj_free field from vm_map_entry_t. Refine the max_free field so that p->max_free is the size of the largest gap with one endpoint in the subtree rooted at p. Change vm_map_findspace so that, first, the address-based splay is restricted to tree nodes with large-enough max_free value, to avoid searching for the right starting point in a subtree where all the gaps are too small. Second, when the address search leads to a tree search for the first large-enough gap, that gap is the subject of a splay-search that brings the gap to the top of the tree, so that an immediate insertion will take constant time. Break up the splay code into separate components, one for searching and breaking up the tree and another for reassembling it. Use these components, and not splay itself, for linking and unlinking. Drop the after-where parameter to link, as it is computed as a side-effect of the splay search. Submitted by: Doug Moore <dougm@rice.edu> Reviewed by: markj Tested by: pho MFC after: 2 weeks Differential revision: https://reviews.freebsd.org/D17794	2019-03-29 16:53:46 +00:00
Edward Tomasz Napierala	0b208315f4	Improve error reporting when the swap pager runs out of memory. Reviewed by: kib MFC after: 2 weeks Sponsored by: Klara Inc. Differential Revision: https://reviews.freebsd.org/D19699	2019-03-26 19:11:15 +00:00
Konstantin Belousov	5019dac98a	ASLR: check for max_addr after applying randomization, not before. Otherwise resulting address from vm_map_find() migh not satisfy the upper limit. For instance, it could affect MAP_32BIT flag from 64bit processes. Found by: Doug Moore <dougm@rice.edu> Reviewed by: alc, Doug Moore <dougm@rice.edu> Sponsored by: The FreeBSD Foundation MFC after: 1 week Differential revision: https://reviews.freebsd.org/D19688	2019-03-23 16:36:18 +00:00
Mark Johnston	64087fd7f3	Disallow preemptive creation of wired superpage mappings. There are some unusual cases where a process may cause an mlock()ed range of memory to be unmapped. If the application subsequently faults on that region, the handler may attempt to create a superpage mapping backed by the resident, wired pages. However, the pmap code responsible for creating such a mapping (pmap_enter_pde() on i386 and amd64) does not ensure that a leaf page table page is available if the superpage is later demoted; the demotion operation must therefore perform a non-blocking page allocation and must unmap the entire superpage if the allocation fails. The pmap layer ensures that this can never happen for wired mappings, and so the case described above breaks that invariant. For now, simply ensure that the MI fault handler never attempts to create a wired superpage except via promotion. Reviewed by: kib Reported by: syzbot+292d3b0416c27c131505@syzkaller.appspotmail.com MFC after: 2 weeks Sponsored by: The FreeBSD Foundation Differential Revision: https://reviews.freebsd.org/D19670	2019-03-21 19:52:50 +00:00
Konstantin Belousov	45d72c7d7f	vm_fault_copy_entry: accept invalid source pages. Either msync(MS_INVALIDATE) or the object unlock during vnode truncation can expose invalid pages backing wired entries. Accept them, but do not install them into destrination pmap. We must create copied pages in the copy case, because e.g. vm_object_unwire() expects that the entry is fully backed. Reported by: syzkaller, via emaste Reported by: syzbot+514d40ce757a3f8b15bc@syzkaller.appspotmail.com Reviewed by: markj Tested by: pho Sponsored by: The FreeBSD Foundation MFC after: 1 week Differential revision: https://reviews.freebsd.org/D19615	2019-03-20 13:07:57 +00:00
Mark Johnston	3b5b20292b	Implement minidump support for RISC-V. Submitted by: Mitchell Horne <mhorne063@gmail.com> Differential Revision: https://reviews.freebsd.org/D18320	2019-03-06 00:01:06 +00:00
Mateusz Guzik	75d6d57634	vm: remove seq.h inclusion made obsolete by NUMA rewrite Sponsored by: The FreeBSD Foundation	2019-02-27 22:42:29 +00:00
Jason A. Harmening	40a5168449	Fix incorrect assertion in vnode_pager_generic_getpages() Reviewed by: kib, glebius MFC after: 1 week	2019-02-26 04:50:46 +00:00
Mark Johnston	2b6010705c	Improve vmem tuning for platforms without a direct map. On platforms without a direct map (i.e., platforms without UMA_MD_SMALL_ALLOC defined), the boundary tag allocator reserves a number of tags for use when allocating a new slab of boundary tags, as such platforms require free boundary tags in order to allocate boundary tags. r327899 increased the number of boundary tags required for a KVA allocation in the worst case, and the aforementioned reservation was not updated accordingly. In some cases, this could lead to a system hang. Fix the problem by increasing this reservation. Also reduce KVA_QUANTUM on systems lacking superpage support. The previous import quantum (4MB with a 4KB page size) was quite large for systems with limited KVA, and fragmentation in kernel_arena could cause kernel memory allocation failures even with a substantial amount of free KVA. Reported and tested by: jhibbits Reviewed by: alc, kib No objections: jeff MFC after: 2 weeks Sponsored by: The FreeBSD Foundation Differential Revision: https://reviews.freebsd.org/D19337	2019-02-25 19:22:13 +00:00
Mark Johnston	46e39081f4	Clear pointers to indicate that the respective locks are released. This fixes a problem in r344231: vm_pageout_launder() may scan two queues when swap is disabled. Reported by: pho MFC with: r344231	2019-02-21 15:44:32 +00:00
Konstantin Belousov	e7a9df16e6	Add kernel support for Intel userspace protection keys feature on Skylake Xeons. See SDM rev. 68 Vol 3 4.6.2 Protection Keys and the description of the RDPKRU and WRPKRU instructions. Reviewed by: markj Tested by: pho Sponsored by: The FreeBSD Foundation MFC after: 2 weeks Differential revision: https://reviews.freebsd.org/D18893	2019-02-20 09:51:13 +00:00
Mark Johnston	602566044a	Remove a redundant flag variable. Use the object pointer itself to determine whether the object is locked. No functional change intended. Reviewed by: kib MFC after: 1 week Sponsored by: Netflix Differential Revision: https://reviews.freebsd.org/D19215	2019-02-17 16:35:19 +00:00
Gleb Smirnoff	66fb0b1ad7	For 32-bit machines rollback the default number of vnode pager pbufs back to the lever before r343030. For 64-bit machines reduce it slightly, too. Together with r343030 I bumped the limit up to the value we use at Netflix to serve 100 Gbit/s of sendfile traffic, and it probably isn't a good default. Provide a loader tunable to change vnode pager pbufs count. Document it.	2019-02-15 23:36:22 +00:00
Konstantin Belousov	484e9d0322	Make anon clustering more compatible. Make the clustering enabling knob more fine-grained by providing a setting where the allocation with hint is not clustered. This is aimed to be somewhat more compatible with e.g. go 1.4 which expects that hinted mmap without MAP_FIXED does not change the allocation address. Now the vm.cluster_anon can be set to 1 to only cluster when no hints, and to 2 to always cluster. Default value is 1. Requested by: peter Reviewed by: emaste, markj Sponsored by: The FreeBSD Foundation MFC after: 1 month Differential revision: https://reviews.freebsd.org/D19194	2019-02-14 15:45:53 +00:00
Mark Johnston	f6893f09d5	Implement transparent 2MB superpage promotion for RISC-V. This includes support for pmap_enter(..., psind=1) as described in the commit log message for r321378. The changes are largely modelled after amd64. arm64 has more stringent requirements around superpage creation to avoid the possibility of TLB conflict aborts, and these requirements do not apply to RISC-V, which like amd64 permits simultaneous caching of 4KB and 2MB translations for a given page. RISC-V's PTE format includes only two software bits, and as these are already consumed we do not have an analogue for amd64's PG_PROMOTED. Instead, pmap_remove_l2() always invalidates the entire 2MB address range. pmap_ts_referenced() is modified to clear PTE_A, now that we support both hardware- and software-managed reference and dirty bits. Also fix pmap_fault_fixup() so that it does not set PTE_A or PTE_D on kernel mappings. Reviewed by: kib (earlier version) Discussed with: jhb Sponsored by: The FreeBSD Foundation Differential Revision: https://reviews.freebsd.org/D18863 Differential Revision: https://reviews.freebsd.org/D18864 Differential Revision: https://reviews.freebsd.org/D18865 Differential Revision: https://reviews.freebsd.org/D18866 Differential Revision: https://reviews.freebsd.org/D18867 Differential Revision: https://reviews.freebsd.org/D18868	2019-02-13 17:19:37 +00:00
Pedro F. Giffuni	6929b7d1ab	UMA: unsign some variables related to allocation in hash_alloc(). As a followup to r343673, unsign some variables related to allocation since the hashsize cannot be negative. This gives a bit more space to handle bigger allocations and avoid some implicit casting. While here also unsign uh_hashmask, it makes little sense to keep that signed. MFC after: 2 weeks Differential Revision: https://reviews.freebsd.org/D19148	2019-02-12 04:33:05 +00:00
Konstantin Belousov	f6d281e8aa	struct xswdev on amd64 requires compat32 shims after ino64. i386 is the only architecture where uint64_t does not specify 8-bytes alignment, which makes struct xswdev layout not compatible between 64bit and i386. Reported and tested by: pho Sponsored by: The FreeBSD Foundation MFC after: 1 week	2019-02-10 19:01:05 +00:00
Konstantin Belousov	fa50a3552d	Implement Address Space Layout Randomization (ASLR) With this change, randomization can be enabled for all non-fixed mappings. It means that the base address for the mapping is selected with a guaranteed amount of entropy (bits). If the mapping was requested to be superpage aligned, the randomization honours the superpage attributes. Although the value of ASLR is diminshing over time as exploit authors work out simple ASLR bypass techniques, it elimintates the trivial exploitation of certain vulnerabilities, at least in theory. This implementation is relatively small and happens at the correct architectural level. Also, it is not expected to introduce regressions in existing cases when turned off (default for now), or cause any significant maintaince burden. The randomization is done on a best-effort basis - that is, the allocator falls back to a first fit strategy if fragmentation prevents entropy injection. It is trivial to implement a strong mode where failure to guarantee the requested amount of entropy results in mapping request failure, but I do not consider that to be usable. I have not fine-tuned the amount of entropy injected right now. It is only a quantitive change that will not change the implementation. The current amount is controlled by aslr_pages_rnd. To not spoil coalescing optimizations, to reduce the page table fragmentation inherent to ASLR, and to keep the transient superpage promotion for the malloced memory, locality clustering is implemented for anonymous private mappings, which are automatically grouped until fragmentation kicks in. The initial location for the anon group range is, of course, randomized. This is controlled by vm.cluster_anon, enabled by default. The default mode keeps the sbrk area unpopulated by other mappings, but this can be turned off, which gives much more breathing bits on architectures with small address space, such as i386. This is tied with the question of following an application's hint about the mmap(2) base address. Testing shows that ignoring the hint does not affect the function of common applications, but I would expect more demanding code could break. By default sbrk is preserved and mmap hints are satisfied, which can be changed by using the kern.elf{32,64}.aslr.honor_sbrk sysctl. ASLR is enabled on per-ABI basis, and currently it is only allowed on FreeBSD native i386 and amd64 (including compat 32bit) ABIs. Support for additional architectures will be added after further testing. Both per-process and per-image controls are implemented: - procctl(2) adds PROC_ASLR_CTL/PROC_ASLR_STATUS; - NT_FREEBSD_FCTL_ASLR_DISABLE feature control note bit makes it possible to force ASLR off for the given binary. (A tool to edit the feature control note is in development.) Global controls are: - kern.elf{32,64}.aslr.enable - for non-fixed mappings done by mmap(2); - kern.elf{32,64}.aslr.pie_enable - for PIE image activation mappings; - kern.elf{32,64}.aslr.honor_sbrk - allow to use sbrk area for mmap(2); - vm.cluster_anon - enables anon mapping clustering. PR: 208580 (exp runs) Exp-runs done by: antoine Reviewed by: markj (previous version) Discussed with: emaste Tested by: pho MFC after: 1 month Sponsored by: The FreeBSD Foundation Differential revision: https://reviews.freebsd.org/D5603	2019-02-10 17:19:45 +00:00
Konstantin Belousov	5dddee2d65	i386: honor kern.elf32.read_exec for ommap(2) and break(2), as already done on amd64. Sponsored by: The FreeBSD Foundation MFC after: 1 week	2019-02-09 03:56:48 +00:00
Konstantin Belousov	a7f67facdf	Normalize the declaration of i386_read_exec variable. It is currently re-declared in sys/sysent.h which is a wrong place for MD variable. Which causes redeclaration error with gcc when sys/sysent.h and machine/md_var.h are included both. Remove it from sys/sysent.h and instead include machine/md_var.h when needed, under #ifdef for both i386 and amd64. Reported and tested by: bde Sponsored by: The FreeBSD Foundation MFC after: 1 week	2019-02-09 03:51:51 +00:00
Gleb Smirnoff	ad66f95865	Now that there is only one way to allocate a slab, remove uz_slab method. Discussed with: jeff	2019-02-07 03:55:05 +00:00
Gleb Smirnoff	b47acb0a4d	Report cache zones in UMA stats sysctl, that 'vmstat -z' uses. This should had been part of r251826.	2019-02-07 03:32:45 +00:00
Konstantin Belousov	d22ff6e6a2	contigmalloc: handle M_EXEC. Reviewed by: alc, markj Sponsored by: The FreeBSD Foundation MFC after: 1 week Differential revision: https://reviews.freebsd.org/D19092	2019-02-07 02:00:23 +00:00
Mark Johnston	1e2b3e6f92	Allow vm_page_free_prep() to dequeue pages without the page lock. This is a step towards being able to free pages without the page lock held. The approach is simply to add an implementation of vm_page_dequeue_deferred() which does not assert that the page lock is held. Formally, the page lock is required to set PGA_DEQUEUE, but in the case of vm_page_free_prep() we get the same mutual exclusion for free by virtue of the fact that no other references to the page may exist. No functional change intended. Reviewed by: kib (previous version) MFC after: 2 weeks Sponsored by: Netflix Differential Revision: https://reviews.freebsd.org/D19065	2019-02-03 18:43:20 +00:00
Mark Johnston	d0488e698f	Fix a race in vm_page_dequeue_deferred(). To detect the case where the page is already marked for a deferred dequeue, we must read the "queue" and "aflags" fields in a precise order. Otherwise, a race with a concurrent vm_page_dequeue_complete() could leave the page with PGA_DEQUEUE set despite it already having been dequeued. Fix the problem by using vm_page_queue() to check the queue state, which correctly handles the race. Reviewed by: kib Tested by: pho MFC after: 3 days Sponsored by: Netflix Differential Revision: https://reviews.freebsd.org/D19039	2019-02-03 18:38:58 +00:00
Alexander Motin	59568a0e52	Fix integer math overflow in UMA hash_alloc(). 512GB of ZFS ABD ARC means abd_chunk zone of 128M 4KB items. To manage them UMA tries to allocate 2GB hash table, which size does not fit into the int variable, causing later allocation failure, which makes ARC shrink back below the 512GB, not letting it to use more RAM. With this change I easily reached >700GB ARC size on 768GB RAM machine. MFC after: 1 week Sponsored by: iXsystems, Inc.	2019-02-02 04:11:59 +00:00
Gleb Smirnoff	37125720b9	In zone_alloc_bucket() max argument was calculated based on uz_count. Then bucket_alloc() also selects bucket size based on uz_count. However, since zone lock is dropped, uz_count may reduce. In this case max may be greater than ub_entries and that would yield into writing beyond end of the allocation. Reported by: pho	2019-01-31 17:52:48 +00:00
Mark Johnston	862203935e	Correct uma_prealloc()'s use of domainset iterators after r339925. The iterator should be reinitialized after every successful slab allocation. A request to advance the iterator is interpreted as an allocation failure, so a sufficiently large preallocation would cause the iterator to believe that all domains were exhausted, resulting in a sleep with the keg lock held. [1] Also, keg_alloc_slab() should pass the unmodified wait flag to the item initialization routine, which may use it to perform allocations from other zones. Reported and tested by: slavah Diagnosed by: kib [1] Reviewed by: kib MFC after: 1 week Sponsored by: The FreeBSD Foundation	2019-01-23 18:58:15 +00:00
Konstantin Belousov	f2a496d667	MI VM: Make it possible to set size of superpage at boot instead of compile time. In order to allow single kernel to use PAE pagetables on i386 if hardware supports it, and fall back to classic two-level paging structures if not, superpage code should be able to adopt to either 2M or 4M superpages size. There I make MI VM structures large enough to track the biggest possible superpage, by allowing architecture to define VM_NFREEORDER_MAX and VM_LEVEL_0_ORDER_MAX constants. Corresponding VM_NFREEORDER and VM_LEVEL_0_ORDER symbols can be defined as runtime values and must be less than the _MAX constants. If architecture does not define _MAXs, it is assumed that _MAX == normal constant. Reviewed by: markj Tested by: pho (as part of the larger patch) Sponsored by: The FreeBSD Foundation MFC after: 1 week Differential revision: https://reviews.freebsd.org/D18853	2019-01-18 13:35:06 +00:00
Gleb Smirnoff	46b0292a82	Do not reserve KVA for paging bufs in vm_ksubmap_init(), since now they allocate it in pbuf_init(). This should have been done together with r343030.	2019-01-16 20:14:16 +00:00
Konstantin Belousov	ea7e7006db	Implement shmat(2) flag SHM_REMAP. Based on the description in Linux man page. Reviewed by: markj, ngie (previous version) Sponsored by: Mellanox Technologies MFC after: 1 week Differential revision: https://reviews.freebsd.org/D18837	2019-01-16 05:15:57 +00:00
Gleb Smirnoff	b68d692a3d	Whitespace.	2019-01-16 04:02:08 +00:00
Gleb Smirnoff	396694153f	Fix compilation failures on different arches that have vm_machdep.c not aware of counter_u64_t by including counter.h into uma_int.h. I'm not happy about this inclusion, but it fixes compilation ASAP.	2019-01-15 19:33:47 +00:00
Gleb Smirnoff	e7e4bcd856	style(9): break long line.	2019-01-15 18:50:11 +00:00
Gleb Smirnoff	f8c86a5fde	Remove harmless leftover from code that cycles over zone's kegs. Just use + instead of +=. There is no functional change.	2019-01-15 18:49:31 +00:00
Gleb Smirnoff	bb45b411e2	Only do uz_items accounting for zones that have a limit set in uz_max_items. This reduces amount of locking required for these zones. Also, for cache only zones (UMA_ZFLAG_CACHE) accounting uz_items wasn't correct at all, since they may allocate items directly from their backing store and then free them via UMA underflowing uz_items. Tested by: pho	2019-01-15 18:32:26 +00:00
Gleb Smirnoff	2efcc8cbca	Make uz_allocs, uz_frees and uz_fails counter(9). This removes some atomic updates and reduces amount of data protected by zone lock. During startup point these fields to EARLY_COUNTER. After startup allocate them for all early zones. Tested by: pho	2019-01-15 18:24:34 +00:00
Gleb Smirnoff	5a8eee2bb4	Fix compilation on 32-bit.	2019-01-15 03:43:46 +00:00
Gleb Smirnoff	756a541279	Allocate pager bufs from UMA instead of 80-ish mutex protected linked list. o In vm_pager_bufferinit() create pbuf_zone and start accounting on how many pbufs are we going to have set. In various subsystems that are going to utilize pbufs create private zones via call to pbuf_zsecond_create(). The latter calls uma_zsecond_create(), and sets a limit on created zone. After startup preallocate pbufs according to requirements of all pbuf zones. Subsystems that used to have a private limit with old allocator now have private pbuf zones: md(4), fusefs, NFS client, smbfs, VFS cluster, FFS, swap, vnode pager. The following subsystems use shared pbuf zone: cam(4), nvme(4), physio(9), aio(4). They should have their private limits, but changing that is out of scope of this commit. o Fetch tunable value of kern.nswbuf from init_param2() and while here move NSWBUF_MIN to opt_param.h and eliminate opt_swap.h, that was holding only this option. Default values aren't touched by this commit, but they probably should be reviewed wrt to modern hardware. This change removes a tight bottleneck from sendfile(2) operation, that uses pbufs in vnode pager. Other pagers also would benefit from faster allocation. Together with: gallatin Tested by: pho	2019-01-15 01:02:16 +00:00
Gleb Smirnoff	bb15d1c778	o Move zone limit from keg level up to zone level. This means that now two zones sharing a keg may have different limits. Now this is going to work: zone = uma_zcreate(); uma_zone_set_max(zone, limit); zone2 = uma_zsecond_create(zone); uma_zone_set_max(zone2, limit2); Kegs no longer have uk_maxpages field, but zones have uz_items. When set, it may be rounded up to minimum possible CPU bucket cache size. For small limits bucket cache can also be reconfigured to be smaller. Counter uz_items is updated whenever items transition from keg to a bucket cache or directly to a consumer. If zone has uz_maxitems set and it is reached, then we are going to sleep. o Since new limits don't play well with multi-keg zones, remove them. The idea of multi-keg zones was introduced exactly 10 years ago, and never have had a practical usage. In discussion with Jeff we came to a wild agreement that if we ever want to reintroduce the idea of a smart allocator that would be able to choose between two (or more) totally different backing stores, that choice should be made one level higher than UMA, e.g. in malloc(9) or in mget(), or whatever and choice should be controlled by the caller. o Sleeping code is improved to account number of sleepers and wake them one by one, to avoid thundering herd problem. o Flag UMA_ZONE_NOBUCKETCACHE removed, instead uma_zone_set_maxcache() KPI added. Having no bucket cache basically means setting maxcache to 0. o Now with many fields added and many removed (no multi-keg zones!) make sure that struct uma_zone is perfectly aligned. Reviewed by: markj, jeff Tested by: pho Differential Revision: https://reviews.freebsd.org/D17773	2019-01-15 00:02:06 +00:00
Gleb Smirnoff	9cc36b3dab	Fix regression in r331368, that broke dumping of UMA startup pages when WITNESS is present. Discussed with: markj	2019-01-07 23:17:09 +00:00
Konstantin Belousov	3fbc2e00d1	Add a tunable which changes mincore(2) algorithm to only report data from the local mapping. Enable the setting by default. The article behind the change: https://arxiv.org/abs/1901.01161 Reviewed by: markj Discussed with: emaste Sponsored by: The FreeBSD Foundation MFC after: 1 week Differential revision: https://reviews.freebsd.org/D18764	2019-01-07 22:10:48 +00:00
Konstantin Belousov	7af4985245	Add 'v' modifier to the ddb 'show pginfo' command to display vm_page backing the provided kernel virtual address. Tested by: pho MFC after: 1 week Sponsored by: The FreeBSD Foundation	2018-12-30 15:58:18 +00:00
Mateusz Guzik	cc426dd319	Remove unused argument to priv_check_cred. Patch mostly generated with cocinnelle: @@ expression E1,E2; @@ - priv_check_cred(E1,E2,0) + priv_check_cred(E1,E2) Sponsored by: The FreeBSD Foundation	2018-12-11 19:32:16 +00:00
Mateusz Guzik	83764b446a	vm: use fcmpset for vmspace reference counting Sponsored by: The FreeBSD Foundation	2018-12-07 16:22:54 +00:00
Brooks Davis	d48719bd96	Normalize COMPAT_43 syscall declarations. Have ogetkerninfo, ogetpagesize, ogethostname, osethostname, and oaccept declare o<foo>_args structs rather than non-compat ones. Due to a failure to use NOARGS in most cases this adds only one new declaration. No changes required in freebsd32 as only ogetpagesize() is implemented and it has a 32-bit specific implementation. Reviewed by: kib Sponsored by: DARPA, AFRL Differential Revision: https://reviews.freebsd.org/D15816	2018-12-04 16:48:47 +00:00
Konstantin Belousov	10d9120c44	Change the vm_ooffset_t type to unsigned. The type represents byte offset in the vm_object_t data space, which does not span negative offsets in FreeBSD VM. The change matches byte offset signess with the unsignedness of the vm_pindex_t which represents the type of the page indexes in the objects. This allows to remove the UOFF_TO_IDX() macro which was used when we have to forcibly interpret the type as unsigned anyway. Also it fixes a lot of implicit bugs in the device drivers d_mmap methods. Reviewed by: alc, markj (previous version) Tested by: pho MFC after: 2 weeks Sponsored by: The FreeBSD Foundation	2018-12-02 13:16:46 +00:00
Konstantin Belousov	a823302783	Allow to create swap zone larger than v_page_count / 2. If user configured the maxswapzone tunable, just take the literal value for the initial zone sizing attempt. Before, it was only possible to reduce the zone by the tunable. While there, correct the message which was not correct when zone creation rounded the size up. Reported by: jmg Reviewed by: markj MFC after: 1 week Sponsored by: The FreeBSD Foundation Differential revision: https://reviews.freebsd.org/D18381	2018-12-01 16:50:12 +00:00
Eric van Gyzen	5e38e3f5eb	Include path for tmpfs objects in vm.objects sysctl This applies the fix in r283924 to the vm.objects sysctl added by r283624 so the output will include the vnode information (i.e. path) for tmpfs objects. Reviewed by: kib, dab MFC after: 2 weeks Sponsored by: Dell EMC Isilon Differential Revision: https://reviews.freebsd.org/D2724	2018-11-30 04:59:43 +00:00
Eric van Gyzen	0951bd362c	Add assertions and comment to vm_object_vnode() Reviewed by: kib MFC after: 2 weeks Sponsored by: Dell EMC Isilon Differential Revision: https://reviews.freebsd.org/D2724	2018-11-30 04:18:31 +00:00
Mark Johnston	e31fc3ab13	Update the free page count when blacklisting pages. Otherwise the free page count will not accurately reflect the physical page allocator's state. On 11 this can trigger panics in vm_page_alloc() since the allocator state and free page count are updated atomically and we expect them to stay in sync. On 12 the bug would manifest as threads looping in vm_page_alloc(). PR: 231296 Reported by: mav, wollman, Rainer Duffner, Josh Gitlin Reviewed by: alc, kib, mav MFC after: 3 days Sponsored by: The FreeBSD Foundation Differential Revision: https://reviews.freebsd.org/D18374	2018-11-29 16:31:01 +00:00
Gleb Smirnoff	0b2e3aead3	Fix yet another edge case in uma_startup_count(). If zone size fits into several pages, but leaves no space for struct uma_slab at the end we miscalculate number of pages by one. Totally mimic keg_large_init() math here to cover that problem. Reported by: gallatin	2018-11-28 19:54:02 +00:00
Gleb Smirnoff	3d5e3df73f	For not offpage zones the slab is placed at the end of page. Keg's uk_pgoff is calculated to guarantee that struct uma_slab is placed at pointer size alignment. Calculation of real struct uma_slab size is done in keg_ctor() and yet again in keg_large_init(), to check if we need an extra page. This calculation can actually be performed at compile time. - Add SIZEOF_UMA_SLAB macro to calculate size of struct uma_slab placed at an end of a page with alignment requirement. - Use SIZEOF_UMA_SLAB in keg_ctor() and in keg_large_init(). This is a not a functional change. - Use SIZEOF_UMA_SLAB in UMA_SLAB_SPACE definition and in keg_small_init(). This is a potential bugfix, but in reality I don't think there are any systems affected, since compiler aligns struct uma_slab anyway.	2018-11-28 19:17:27 +00:00
Konstantin Belousov	6e00f3a311	Avoid unneeded check in vmspace_alloc(). All vmspace_alloc() callers know which kind of pmap they allocate. Reviewed by: alc, markj (previous version) Sponsored by: The FreeBSD Foundation MFC after: 1 week Differential revision: https://reviews.freebsd.org/D18329	2018-11-25 17:56:49 +00:00
Ben Widawsky	f82dd310bb	linuxkpi: Use pageproc instead of vmproc According to markj@: pageproc contains the page daemon and laundry threads, which are responsible for managing the LRU page queues and writing back dirty pages. vmproc's main task is to swap out kernel stacks when the system is under memory pressure, and swap them back in when necessary. It's a somewhat legacy component of the system and isn't required. You can build a kernel without it by specifying "options NO_SWAPPING" (which is a somewhat misleading name), in which vm_swapout_dummy.c is compiled instead of vm_swapout.c. Based on this, we want pageproc to emulate kswapd, not vmproc. Reviewed by: markj Differential Revision: https://reviews.freebsd.org/D18061	2018-11-21 04:34:18 +00:00
Ben Widawsky	c3f4f28c63	linuxkpi: Add some basic swap functions These are used by kms-drm to determine various heuristics relate memory conditions. The number of free swap pages is just a variable, and it can be much cheaper by either adding a new getter, or simply extern'ing swap_total. However, this patch opts to use the more expensive, existing interface - since this isn't an operation in a high per path. This allows us to remove some more gpl linuxkpi and do the follo kms-drm: git rm linuxkpi/gplv2/include/linux/swap.h Reviewed by: mmacy, Johannes Lundberg <johalun0@gmail.com> Approved by: emaste (mentor) Differential Revision: https://reviews.freebsd.org/D18052	2018-11-20 22:49:19 +00:00
Alan Cox	541a117532	Use swp_pager_isondev() throughout. Submitted by: ota@j.email.ne.jp Change swp_pager_isondev()'s return type to bool. Reviewed by: kib MFC after: 1 week Differential Revision: https://reviews.freebsd.org/D16712	2018-11-19 17:17:23 +00:00
Alan Cox	92e78c1012	Tidy up vm_map_simplify_entry() and its recently introduced helper functions. Notably, reflow the text of some comments so that they occupy fewer lines, and introduce an assertion in one of the new helper functions so that it is not misused by a future caller. In collaboration with: Doug Moore <dougm@rice.edu> MFC after: 1 week Differential Revision: https://reviews.freebsd.org/D17635	2018-11-18 01:27:17 +00:00
Mark Johnston	0f9b7bf37a	Add accounting to per-domain UMA full bucket caches. In particular, track the current size of the cache and maintain an estimate of its working set size. This will be used to decide how much to shrink various caches when the kernel attempts to reclaim pages. As a secondary effect, it makes statistics aggregation (done by, e.g., vmstat -z) cheaper since sysctl_vm_zone_stats() no longer needs to iterate over lists of cached buckets. Discussed with: alc, glebius, jeff Tested by: pho (previous version) MFC after: 1 month Differential Revision: https://reviews.freebsd.org/D16666	2018-11-13 19:44:40 +00:00
Mark Johnston	0e48e06807	Re-apply r336984, reverting r339934. r336984 exposed the bug fixed in r340241, leading to the initial revert while the bug was being hunted down. Now that the bug is fixed, we can revert the revert. Discussed with: alc MFC after: 3 days	2018-11-10 20:33:08 +00:00
Mark Johnston	150d384e5c	Fix a use-after-free in swp_pager_meta_free(). This was introduced in r326329 and explains the crashes mentioned in the commit log message for r339934. In particular, on INVARIANTS kernels, UMA trashing causes the loop to exit early, leaving swap blocks behind when they should have been freed. After r336984 this became more problematic since new anonymous mappings were more likely to reuse swapped-out subranges of existing VM objects, so faults would trigger pageins of freed memory rather than returning zeroed pages. Reviewed by: kib MFC after: 3 days Sponsored by: The FreeBSD Foundation Differential Revision: https://reviews.freebsd.org/D17897	2018-11-07 23:28:11 +00:00
Mark Johnston	07702f72e5	Avoid specifying VM_PROT_EXECUTE in mappings from pipe_map and exec_map. These submaps are used for mapping pipe buffers and execv() argument strings respectively, so there's no need for such mappings to have execute permissions. Reported by: jhb Reviewed by: alc, jhb, kib MFC after: 2 weeks Sponsored by: The FreeBSD Foundation Differential Revision: https://reviews.freebsd.org/D17827	2018-11-06 21:57:03 +00:00
Mark Johnston	8002c3a495	Initialize last_target in the laundry thread control loop. In practice it is always initialized because nfreed must be positive in order to trigger background laundering, but this isn't obvious. CID: 1387997 MFC after: 1 week	2018-11-06 02:52:54 +00:00
Mark Johnston	2203c46d87	Initialize the eflags field of vm_map headers. Initializing the eflags field of the map->header entry to a value with a unique new bit set makes a few comparisons to &map->header unnecessary. Submitted by: Doug Moore <dougm@rice.edu> Reviewed by: alc, kib Tested by: pho MFC after: 2 weeks Differential Revision: https://reviews.freebsd.org/D14005	2018-11-02 16:26:44 +00:00
Mark Johnston	5d277a85ad	Revert r336984. It appears to be responsible for random segfaults observed when lots of paging activity is taking place, but the root cause is not yet understood. Requested by: alc MFC after: now	2018-10-30 22:40:40 +00:00
Mark Johnston	9978bd996b	Add malloc_domainset(9) and _domainset variants to other allocator KPIs. Remove malloc_domain(9) and most other _domain KPIs added in r327900. The new functions allow the caller to specify a general NUMA domain selection policy, rather than specifically requesting an allocation from a specific domain. The latter policy tends to interact poorly with M_WAITOK, resulting in situations where a caller is blocked indefinitely because the specified domain is depleted. Most existing consumers of the _domain KPIs are converted to instead use a DOMAINSET_PREF() policy, in which we fall back to other domains to satisfy the allocation request. This change also defines a set of DOMAINSET_FIXED() policies, which only permit allocations from the specified domain. Discussed with: gallatin, jeff Reported and tested by: pho (previous version) MFC after: 2 weeks Sponsored by: The FreeBSD Foundation Differential Revision: https://reviews.freebsd.org/D17418	2018-10-30 18:26:34 +00:00
Mark Johnston	920239efde	Fix some problems that manifest when NUMA domain 0 is empty. - In uma_prealloc(), we need to check for an empty domain before the first allocation attempt, not after. Fix this by switching uma_prealloc() to use a vm_domainset iterator, which addresses the secondary issue of using a signed domain identifier in round-robin iteration. - Don't automatically create a page daemon for domain 0. - In domainset_empty_vm(), recompute ds_cnt and ds_order after excluding empty domains; otherwise we may frequently specify an empty domain when calling in to the page allocator, wasting CPU time. Convert DOMAINSET_PREF() policies for empty domains to round-robin. - When freeing bootstrap pages, don't count them towards the per-domain total page counts for now: some vm_phys segments are created before the SRAT is parsed and are thus always identified as being in domain 0 even when they are not. Then, when bootstrap pages are freed, they are added to a domain that we had previously thought was empty. Until this is corrected, we simply exclude them from the per-domain page count. Reported and tested by: Rajesh Kumar <rajfbsd@gmail.com> Reviewed by: gallatin MFC after: 2 weeks Sponsored by: The FreeBSD Foundation Differential Revision: https://reviews.freebsd.org/D17704	2018-10-30 17:57:40 +00:00
Alan Cox	9f1abe3df4	Eliminate typically pointless calls to vm_fault_prefault() on soft, copy- on-write faults. On a page fault, when we call vm_fault_prefault(), it probes the pmap and the shadow chain of vm objects to see if there are opportunities to create read and/or execute-only mappings to neighoring pages. For example, in the case of hard faults, such effort typically pays off, that is, mappings are created that eliminate future soft page faults. However, in the the case of soft, copy-on-write faults, the effort very rarely pays off. (See the review for some specific data.) Reviewed by: kib, markj MFC after: 3 weeks Differential Revision: https://reviews.freebsd.org/D17367	2018-10-27 17:49:46 +00:00
Mark Johnston	17fbf3cf34	Add a !NUMA definition for vm_domainset_iter_policy_ref_init(). Pointy hat: markj X-MFC with: r339661 Sponsored by: The FreeBSD Foundation	2018-10-24 17:09:20 +00:00
Mark Johnston	7571e24901	Add an #include required after r339686. X-MFC with: r339686 Sponsored by: The FreeBSD Foundation	2018-10-24 16:49:16 +00:00
Mark Johnston	194a979ee9	Use a vm_domainset iterator in keg_fetch_slab(). Previously, it used a hand-rolled round-robin iterator. This meant that the minskip logic in r338507 didn't apply to UMA allocations, and also meant that we would call vm_wait() for individual domains rather than permitting an allocation from any domain with sufficient free pages. Discussed with: jeff Tested by: pho MFC after: 1 week Sponsored by: The FreeBSD Foundation Differential Revision: https://reviews.freebsd.org/D17420	2018-10-24 16:41:47 +00:00
Mark Johnston	87ab1a10b1	Initialize static domainsets regardless of whether an SRAT is present. Reported by: yuripv X-MFC with: r339452 Sponsored by: The FreeBSD Foundation	2018-10-23 18:07:16 +00:00
Mark Johnston	4c29d2de67	Refactor domainset iterators for use by malloc(9) and UMA. Before this change we had two flavours of vm_domainset iterators: "page" and "malloc". The latter was only used for kmem_() and hard-coded its behaviour based on kernel_object's policy. Moreover, its use contained a race similar to that fixed by r338755 since the kernel_object's iterator was being run without the object lock. In some cases it is useful to be able to explicitly specify a policy (domainset) or policy+iterator (domainset_ref) when performing memory allocations. To that end, refactor the vm_dominset_ KPI to permit this, and get rid of the "malloc" domainset_iter KPI in the process. Reviewed by: jeff (previous version) Tested by: pho (part of a larger patch) MFC after: 2 weeks Sponsored by: The FreeBSD Foundation Differential Revision: https://reviews.freebsd.org/D17417	2018-10-23 16:35:58 +00:00
Mark Johnston	b61f314290	Make it possible to disable NUMA support with a tunable. This provides a chicken switch for anyone negatively impacted by enabling NUMA in the amd64 GENERIC kernel configuration. With NUMA disabled at boot-time, information about the NUMA topology is not exposed to the rest of the kernel, and all of physical memory is viewed as coming from a single domain. This method still has some performance overhead relative to disabling NUMA support at compile time. PR: 231460 Reviewed by: alc, gallatin, kib MFC after: 1 week Sponsored by: The FreeBSD Foundation Differential Revision: https://reviews.freebsd.org/D17439	2018-10-22 20:13:51 +00:00
Mark Johnston	2801dd08d7	Fix the build after r339601. I committed some patches out of order and didn't build-test one of them. Reported by: Jenkins, O. Hartmann <ohartmann@walstatt.org> X-MFC with: r339601	2018-10-22 17:19:48 +00:00
Mark Johnston	2a843ae7d9	Avoid a redundancy in a comment updated by r339601. Reported by: alc X-MFC with: r339601	2018-10-22 17:17:30 +00:00
Mark Johnston	b00581965d	Swap in processes unless there's a global memory shortage. On NUMA systems, we would not swap in processes unless all domains had some free pages. This is too conservative in general. Instead, permit swapins so long as at least one domain has free pages, and add a kernel stack NUMA policy which ensures that we will try to allocate kernel stack pages from any domain. Reported and tested by: pho, Jan Bramkamp <crest@bultmann.eu> Reviewed by: alc, kib Discussed with: jeff MFC after: 3 days Sponsored by: The FreeBSD Foundation Differential Revision: https://reviews.freebsd.org/D17304	2018-10-22 17:04:04 +00:00
Gleb Smirnoff	81c0d72c60	If we lost race or were migrated during bucket allocation for the per-CPU cache, then we put new bucket on generic bucket cache. However, code didn't honor UMA_ZONE_NOBUCKETCACHE flag, so potentially we could start a cache on a zone that clearly forbids that. Fix this. Reviewed by: markj	2018-10-22 15:48:07 +00:00
Konstantin Belousov	17afd2beec	Unindent vm_map_simplify_entry() after r339506. Reviewed by: markj Sponsored by: The FreeBSD Foundation MFC after: 1 week Differential revision: https://reviews.freebsd.org/D17632	2018-10-21 00:11:56 +00:00
Konstantin Belousov	074244628b	Reduce code duplication in merging vm_entry neighbors. Submitted by: Doug Moore <dougm@rice.edu> Reviewed by: markj MFC after: 1 week Differential revision: https://reviews.freebsd.org/D17610	2018-10-20 23:08:04 +00:00
Mark Johnston	662e7fa8d9	Create some global domainsets and refactor NUMA registration. Pre-defined policies are useful when integrating the domainset(9) policy machinery into various kernel memory allocators. The refactoring will make it easier to add NUMA support for other architectures. No functional change intended. Reviewed by: alc, gallatin, jeff, kib Tested by: pho (part of a larger patch) MFC after: 3 days Sponsored by: The FreeBSD Foundation Differential Revision: https://reviews.freebsd.org/D17416	2018-10-20 17:36:00 +00:00
Matt Macy	e8bb589d56	eliminate locking surrounding ui_vmsize and swap reserve by using atomics Change swap_reserve and swap_total to be in units of pages so that swap reservations can be done using only atomics instead of using a single global mutex for swap_reserve and a single mutex for all processes running under the same uid for uid accounting. Results in mmap speed up and a 70% increase in brk calls / second. Reviewed by: alc@, markj@, kib@ Approved by: re (delphij@) Differential Revision: https://reviews.freebsd.org/D16273	2018-10-05 05:50:56 +00:00
Mark Johnston	93db904d19	Use an unsigned iterator for domain sets. Otherwise (iter % ds->ds_cnt) is not guaranteed to lie in the range [0, MAXMEMDOM). Reported by: pho Reviewed by: kib Approved by: re (rgrimes) Sponsored by: The FreeBSD Foundation Differential Revision: https://reviews.freebsd.org/D17374	2018-10-01 18:51:39 +00:00
Andrew Gallatin	30c5525b3c	Allow empty NUMA memory domains to support Threadripper2 The AMD Threadripper 2990WX is basically a slightly crippled Epyc. Rather than having 4 memory controllers, one per NUMA domain, it has only 2 memory controllers enabled. This means that only 2 of the 4 NUMA domains can be populated with physical memory, and the others are empty. Add support to FreeBSD for empty NUMA domains by: - creating empty memory domains when parsing the SRAT table, rather than failing to parse the table - not running the pageout deamon threads in empty domains - adding defensive code to UMA to avoid allocating from empty domains - adding defensive code to cpuset to avoid binding to an empty domain Thanks to Jeff for suggesting this strategy. Reviewed by: alc, markj Approved by: re (gjb@) Differential Revision: https://reviews.freebsd.org/D1683	2018-10-01 14:14:21 +00:00
Konstantin Belousov	c62637d679	Correct vm_fault_copy_entry() handling of backing file truncation after the file mapping was wired. if a wired map entry is backed by vnode and the file is truncated, corresponding pages are invalidated. vm_fault_copy_entry() should be aware of it and allow for invalid pages past end of file. Also, such pages should be not mapped into userspace. If userspace accesses the truncated part of the mapping later, it gets a signal, there is no way kernel can prevent the page fault. Reported by: andrew using syzkaller Reviewed by: alc Sponsored by: The FreeBSD Foundation Approved by: re (gjb) MFC after: 1 week Differential revision: https://reviews.freebsd.org/D17323	2018-09-28 14:11:38 +00:00
Konstantin Belousov	9f25ab83f9	In vm_fault_copy_entry(), we should not assert that entry is charged if the dst_object is not of swap type. It can only happen when entry does not require copy, otherwise vm_map_protect() already adds the charge. So the assert was right for the case where swap object was allocated in the vm_fault_copy_entry(), but not when it was just copied from src_entry and its type is not swap. Reported by: andrew using syzkaller Reviewed by: alc Sponsored by: The FreeBSD Foundation Approved by: re (gjb) MFC after: 1 week Differential revision: https://reviews.freebsd.org/D17323	2018-09-28 14:11:01 +00:00
Konstantin Belousov	a60d3db15e	In vm_fault_copy_entry(), collect the code to initialize a newly allocated dst_object in a single place. Suggested and reviewed by: alc Sponsored by: The FreeBSD Foundation Approved by: re (gjb) MFC after: 1 week Differential revision: https://reviews.freebsd.org/D17323	2018-09-28 14:10:12 +00:00
Mark Johnston	463406ac4a	Add more NUMA-specific low memory predicates. Use these predicates instead of inline references to vm_min_domains. Also add a global all_domains set, akin to all_cpus. Reviewed by: alc, jeff, kib Approved by: re (gjb) Sponsored by: The FreeBSD Foundation Differential Revision: https://reviews.freebsd.org/D17278	2018-09-24 19:24:17 +00:00
Alan Cox	f5fbe90de4	Passing UMA_ZONE_NOFREE to uma_zcreate() for swpctrie_zone and swblk_zone is redundant, because uma_zone_reserve_kva() is performed on both zones and it sets this same flag on the zone. (Moreover, the implementation of the swap pager does not itself require these zones to be UMA_ZONE_NOFREE.) Reviewed by: kib, markj Approved by: re (gjb) MFC after: 1 week Differential Revision: https://reviews.freebsd.org/D17296	2018-09-24 16:49:02 +00:00
Mark Johnston	3d14a7bb43	Ensure that "domain" is initialized when vm_ndomains == 1. Reported by: alc Approved by: re (gjb)	2018-09-24 15:32:46 +00:00
Mark Johnston	969e147aff	Ensure that imports into per-domain kmem arenas are KVA_QUANTUM-aligned. The old code appears to assume that vmem_alloc() would import size-aligned KVA chunks from the parent kernel_arena, but vmem doesn't provide this guarantee. Also remove the unused global RWX arena and add comments explaining why we have per-domain arenas. Reported by: alc Reviewed by: alc, kib (previous version) Approved by: re (gjb) Sponsored by: The FreeBSD Foundation Differential Revision: https://reviews.freebsd.org/D17249	2018-09-20 18:29:55 +00:00
Mark Johnston	25ed23cfbb	Change the domain selection policy in kmem_back(). Ensure that pages backing the same virtual large page come from the same physical domain, as kmem_malloc_domain() does. PR: 231038 Reviewed by: alc, kib Approved by: re (gjb) Sponsored by: The FreeBSD Foundation Differential Revision: https://reviews.freebsd.org/D17248	2018-09-20 15:45:12 +00:00
Mark Johnston	1aed6d48a8	Move kernel vmem arena initialization to vm_kern.c. This keeps the initialization coupled together with the kmem_* KPI implementation, which is the main user of these arenas. No functional change intended. Reviewed by: alc Approved by: re (gjb) Sponsored by: The FreeBSD Foundation Differential Revision: https://reviews.freebsd.org/D17247	2018-09-19 19:13:43 +00:00
Mateusz Guzik	c035292545	vm: check for empty kstack cache before locking The current cache logic checks the total number of stacks in the kernel, which even on small boxes significantly exceeds the 128 limit (e.g. an 8-way box with zfs has almost 800 stacks allocated). Stacks are cached earlier for each main thread. As a result the code is rarely executed, but when it is then (on boxes like the above) it always fails. Since there are no provisions made for NUMA and release time is approaching, just do a quick check to avoid acquiring the lock. Approved by: re (kib)	2018-09-19 16:02:33 +00:00
Mark Johnston	26fe2217bf	Only update the domain cursor once in keg_fetch_slab(). We drop the keg lock when we go to actually allocate the slab, allowing other threads to advance the cursor. This can cause us to exit the round-robin loop before having attempted allocations from all domains, resulting in a hang during a subsequent blocking allocation attempt from a depleted domain. Reported and tested by: Jan Bramkamp <crest@bultmann.eu> Reviewed by: alc, cem Approved by: re (gjb) Sponsored by: The FreeBSD Foundation Differential Revision: https://reviews.freebsd.org/D17209	2018-09-18 17:51:45 +00:00
Mateusz Guzik	2554f86a8d	vm: stop taking proc lock in mmap to satisfy racct if it is disabled Limits can be safely obtained with lim_cur from the thread. racct is compiled in but disabled by default. Note that racct enablement is a boot-only tunable. This eliminates second most common place of taking the lock while pkg building. While here don't take the lock in mlockall either. Reviewed by: kib Approved by: re (gjb) Differential Revision: https://reviews.freebsd.org/D17210	2018-09-18 01:24:30 +00:00
Mark Johnston	7a364d458a	Split some checks in vm_page_activate() to make it easier to read. No functional change intended. Reviewed by: alc, kib Approved by: re (gjb) Sponsored by: The FreeBSD Foundation Differential Revision: https://reviews.freebsd.org/D17028	2018-09-10 18:59:23 +00:00
Mark Johnston	5a7f993702	Relax an assertion in vm_pqbatch_process_page(). While executing vm_pqbatch_process_page(m), m->queue may change to PQ_NONE if the page daemon is concurrently freeing the page. In this case m's queue state flags must be clear, so vm_pqbatch_process_page() will be a no-op, but the race could cause spurious assertion failures. Correct the assertion which assumed that m->queue's value does not change while the page queue lock is held. Reviewed by: alc, kib Reported and tested by: pho Approved by: re (gjb) Sponsored by: The FreeBSD Foundation Differential Revision: https://reviews.freebsd.org/D17027	2018-09-08 21:49:43 +00:00
Mark Johnston	c56c7299c2	Use the correct terminology. Reported by: kib Approved by: re (gjb) Differential revision: https://reviews.freebsd.org/D16191	2018-09-06 20:02:19 +00:00
Mark Johnston	23984ce5cd	Avoid resource deadlocks when one domain has exhausted its memory. Attempt other allowed domains if the requested domain is below the minimum paging threshold. Block in fork only if all domains available to the forking thread are below the severe threshold rather than any. Submitted by: jeff Reported by: mjg Reviewed by: alc, kib, markj Approved by: re (rgrimes) Differential Revision: https://reviews.freebsd.org/D16191	2018-09-06 19:28:52 +00:00
Mark Johnston	21f01f4584	Remove vm_page_remque(). Testing m->queue != PQ_NONE is not sufficient; see the commit log message for r338276. As of r332974 vm_page_dequeue() handles already-dequeued pages, so just replace vm_page_remque() calls with vm_page_dequeue() calls. Reviewed by: kib Tested by: pho Approved by: re (marius) Sponsored by: The FreeBSD Foundation Differential Revision: https://reviews.freebsd.org/D17025	2018-09-06 16:17:45 +00:00
Alan Cox	72aebdd742	Recent changes have created, for the first time, physical memory segments that can be coalesced. To be clear, fragmentation of phys_avail[] is not the cause. This fragmentation of vm_phys_segs[] arises from the "special" calls to vm_phys_add_seg(), in other words, not those that derive directly from phys_avail[], but those that we create for the initial kernel page table pages and now for the kernel and modules loaded at boot time. Since we sometimes iterate over the physical memory segments, coalescing these segments at initialization time is a worthwhile change. Reviewed by: kib, markj Approved by: re (rgrimes) MFC after: 1 week Differential Revision: https://reviews.freebsd.org/D16976	2018-09-02 18:29:38 +00:00
Konstantin Belousov	f0165b1ca6	Remove {max/min}_offset() macros, use vm_map_{max/min}() inlines. Exposing max_offset and min_offset defines in public headers is causing clashes with variable names, for example when building QEMU. Based on the submission by: royger Reviewed by: alc, markj (previous version) Sponsored by: The FreeBSD Foundation (kib) MFC after: 1 week Approved by: re (marius) Differential revision: https://reviews.freebsd.org/D16881	2018-08-29 12:24:19 +00:00
Mark Murray	19fa89e938	Remove the Yarrow PRNG algorithm option in accordance with due notice given in random(4). This includes updating of the relevant man pages, and no-longer-used harvesting parameters. Ensure that the pseudo-unit-test still does something useful, now also with the "other" algorithm instead of Yarrow. PR: 230870 Reviewed by: cem Approved by: so(delphij,gtetlow) Approved by: re(marius) Differential Revision: https://reviews.freebsd.org/D16898	2018-08-26 12:51:46 +00:00
Alan Cox	49bfa624ac	Eliminate the arena parameter to kmem_free(). Implicitly this corrects an error in the function hypercall_memfree(), where the wrong arena was being passed to kmem_free(). Introduce a per-page flag, VPO_KMEM_EXEC, to mark physical pages that are mapped in kmem with execute permissions. Use this flag to determine which arena the kmem virtual addresses are returned to. Eliminate UMA_SLAB_KRWX. The introduction of VPO_KMEM_EXEC makes it redundant. Update the nearby comment for UMA_SLAB_KERNEL. Reviewed by: kib, markj Discussed with: jeff Approved by: re (marius) Differential Revision: https://reviews.freebsd.org/D16845	2018-08-25 19:38:08 +00:00
Gleb Smirnoff	306abf0f35	Either "free" or "allocated" is misleading here, since an item in a bucket is free from perspective of UMA consumer, and it is allocated from perspective of keg. Discussed with: markj Approved by: re (kib)	2018-08-24 18:47:50 +00:00
Gleb Smirnoff	a307fb5b0c	Fix comment. The actual meaning of ub_cnt is the opposite.	2018-08-23 23:24:28 +00:00
Mark Johnston	899fe184c7	Add a per-pagequeue pdpages counter. Expose these counters under the vm.domain sysctl node. The existing vm.stats.vm.v_pdpages sysctl is preserved. Reviewed by: alc (previous version) Differential Revision: https://reviews.freebsd.org/D14666	2018-08-23 21:03:45 +00:00
Mark Johnston	99d92d732f	Ensure that queue state is cleared when vm_page_dequeue() returns. Per-page queue state is updated non-atomically, with either the page lock or the page queue lock held. When vm_page_dequeue() is called without the page lock, in rare cases a different thread may be concurrently dequeuing the page with the pagequeue lock held. Because of the non-atomic update, vm_page_dequeue() might return before queue state is completely updated, which can lead to race conditions. Restrict the vm_page_dequeue() interface so that it must be called either with the page lock held or on a free page, and busy wait when a different thread is concurrently updating queue state, which must happen in a critical section. While here, do some related cleanup: inline vm_page_dequeue_locked() into its only caller and delete a prototype for the unimplemented vm_page_requeue_locked(). Replace the volatile qualifier for "queue" added in r333703 with explicit uses of atomic_load_8() where required. Reported and tested by: pho Reviewed by: alc Differential Revision: https://reviews.freebsd.org/D15980	2018-08-23 20:34:22 +00:00
Alan Cox	83a90bffd8	Eliminate kmem_malloc()'s unused arena parameter. (The arena parameter became unused in FreeBSD 12.x as a side-effect of the NUMA-related changes.) Reviewed by: kib, markj Discussed with: jeff, re@ Differential Revision: https://reviews.freebsd.org/D16825	2018-08-21 16:43:46 +00:00
Alan Cox	44d0efb215	Eliminate kmem_alloc_contig()'s unused arena parameter. Reviewed by: hselasky, kib, markj Discussed with: jeff Differential Revision: https://reviews.freebsd.org/D16799	2018-08-20 15:57:27 +00:00
Alan Cox	db7c2a4822	Eliminate the unused arena parameter from kmem_alloc_attr(). Reviewed by: kib, markj Differential Revision: https://reviews.freebsd.org/D16793	2018-08-18 22:07:48 +00:00
Alan Cox	067fd85894	Eliminate the arena parameter to kmem_malloc_domain(). It is redundant. The domain and flags parameters suffice. In fact, the related functions kmem_alloc_{attr,contig}_domain() don't have an arena parameter. Reviewed by: kib, markj Differential Revision: https://reviews.freebsd.org/D16713	2018-08-18 18:33:50 +00:00
Konstantin Belousov	c1344d2bbe	Prevent some parallel swap-ins, rate-limit swapper swap-ins. If faultin() was called outside swapper (from PHOLD()), do not allow swapper to initiate additional swap-ins. Swapper' initiated swap-ins are serialized because they are synchronous and executed in the context of the thread0. With the added limitation, we only allow parallel swap-ins from PHOLD(), which is up to PHOLD() users to manage, usually they do not need to. Rate-limit swapper' swap-ins to one in the MAXSLP / 2 seconds interval, counting faultin() swapins. Suggested by: alc Reviewed by: alc, markj Tested by: pho Sponsored by: The FreeBSD Foundation MFC after: 2 weeks Differential revision: https://reviews.freebsd.org/D16610	2018-08-13 16:48:46 +00:00
Mark Johnston	b50a4ea646	Account for the lowmem handlers in the inactive queue scan target. Before r329882 the target would be computed after lowmem handlers run and free pages. On some systems a significant amount of page reclamation happens this way. However, with r329882 the target is computed first, which can lead to unnecessary reclamation from the page cache, and this in turn may result in excessive swapping. Instead, adjust the target after running lowmem handlers. Don't invoke the lowmem handlers before the PID controller, though, since that would hide the true rate of page allocation. Reviewed by: alc, kib (previous version) Sponsored by: The FreeBSD Foundation Differential Revision: https://reviews.freebsd.org/D16606	2018-08-09 18:25:49 +00:00
Alan Cox	2bf8cb3804	Add support for pmap_enter(..., psind=1) to the armv6 pmap. In other words, add support for explicitly requesting that pmap_enter() create a 1 MB page mapping. (Essentially, this feature allows the machine-independent layer to create superpage mappings preemptively, and not wait for automatic promotion to occur.) Export pmap_ps_enabled() to the machine-independent layer. Add a flag to pmap_pv_insert_pte1() that specifies whether it should fail or reclaim a PV entry when one is not available. Refactor pmap_enter_pte1() into two functions, one by the same name, that is a general-purpose function for creating pte1 mappings, and another, pmap_enter_1mpage(), that is used to prefault 1 MB read- and/or execute- only mappings for execve(2), mmap(2), and shmat(2). In addition, as an optimization to pmap_enter(..., psind=0), eliminate the use of pte2_is_managed() from pmap_enter(). Unlike the x86 pmap implementations, armv6 does not have a managed bit defined within the PTE. So, pte2_is_managed() is actually a call to PHYS_TO_VM_PAGE(), which is O(n) in the number of vm_phys_segs[]. All but one call to PHYS_TO_VM_PAGE() in pmap_enter() can be avoided. Reviewed by: kib, markj, mmel Tested by: mmel MFC after: 6 weeks Differential Revision: https://reviews.freebsd.org/D16555	2018-08-08 16:55:01 +00:00
Alan Cox	78f1deeffe	Defer and aggregate swap_pager_meta_build frees. Before swp_pager_meta_build replaces an old swapblk with an new one, it frees the old one. To allow such freeing of blocks to be aggregated, have swp_pager_meta_build return the old swap block, and make the caller responsible for freeing it. Define a pair of short static functions, swp_pager_init_freerange and swp_pager_update_freerange, to do the initialization and updating of blk addresses and counters used in aggregating blocks to be freed. Submitted by: Doug Moore <dougm@rice.edu> Reviewed by: kib, markj (an earlier version) Tested by: pho MFC after: 1 week Differential Revision: https://reviews.freebsd.org/D13707	2018-08-08 02:30:34 +00:00
Konstantin Belousov	a70e9a1388	Swap in WKILLED processes. Swapped-out process that is WKILLED must be swapped in as soon as possible. The reason is that such process can be killed by OOM and its pages can be only freed if the process exits. To exit, the kernel stack of the process must be mapped. When allocating pages for the stack of the WKILLED process on swap in, use VM_ALLOC_SYSTEM requests to increase the chance of the allocation to succeed. Add counter of the swapped out processes to avoid unneeded iteration over the allprocs list when there is no work to do, reducing the allproc_lock ownership. Reviewed by: alc, markj (previous version) Tested by: pho Sponsored by: The FreeBSD Foundation MFC after: 2 weeks Differential revision: https://reviews.freebsd.org/D16489	2018-08-04 20:45:43 +00:00
Mark Johnston	c16bd872dc	Add the required page accounting to kmem_bootstrap_free(). Reviewed by: alc, kib MFC after: 3 days Sponsored by: The FreeBSD Foundation Differential Revision: https://reviews.freebsd.org/D16581	2018-08-03 16:35:37 +00:00
Konstantin Belousov	e45b89d23d	Add pmap_is_valid_memattr(9). Discussed with: alc Sponsored by: The FreeBSD Foundation, Mellanox Technologies MFC after: 1 week Differential revision: https://reviews.freebsd.org/D15583	2018-08-01 18:45:51 +00:00
Konstantin Belousov	6e1d2cf679	For compat32, emulate the same wraparound check as occurs on the real ILP32 system. Reported by and discussed with: asomers PR: 230162 Sponsored by: The FreeBSD Foundation MFC after: 1 week Differential revision: https://reviews.freebsd.org/D16525	2018-07-31 18:00:47 +00:00
Alan Cox	005783a0a6	Allow vm object coalescing to occur in the midst of a vm object when the OBJ_ONEMAPPING flag is set. In other words, allow recycling of existing but unused subranges of a vm object when the OBJ_ONEMAPPING flag is set. Such situations are increasingly common with jemalloc >= 5.0. This change has the expected effect of reducing the number of vm map entry and object allocations and increasing the number of superpage promotions. Reviewed by: kib, markj Tested by: pho MFC after: 6 weeks Differential Revision: https://reviews.freebsd.org/D16501	2018-07-31 17:41:48 +00:00
Alan Cox	737e25f7eb	To date, mlockall(MCL_FUTURE) has had the unfortunate side effect of blocking vm map entry and object coalescing for the calling process. However, there is no reason that mlockall(MCL_FUTURE) should block such coalescing. This change enables it. Reviewed by: kib, markj Tested by: pho MFC after: 6 weeks Differential Revision: https://reviews.freebsd.org/D16413	2018-07-28 04:06:33 +00:00
Warner Losh	67d33338c0	Rename VM_FREELIST_ISADMA to VM_FREELIST_LOWMEM. There's no differene between VM_FREELIST_ISADMA and VM_FREELIST_LOWMEM except for the default boundary (16MB on x86 and 256MB on MIPS, but they are otherwise the same). We don't need both for any system we support (there were some really old ARC systems that did have ISA/EISA bus, but we never ran on them and they are too old to ever grow support for). Differential Review: https://reviews.freebsd.org/D16290	2018-07-27 18:34:20 +00:00
Mark Johnston	6c85795a25	Fix handling of KVA in kmem_bootstrap_free(). Do not use vm_map_remove() to release KVA back to the system. Because kernel map entries do not have an associated VM object, with r336030 the vm_map_remove() call will not update the kernel page tables. Avoid relying on the vm_map layer and instead update the pmap and release KVA to the kernel arena directly in kmem_bootstrap_free(). Because the pmap updates will generally result in superpage demotions, modify pmap_init() to insert PTPs shadowed by superpage mappings into the kernel pmap's radix tree. While here, port r329171 to i386. Reported by: alc Reviewed by: alc, kib X-MFC with: r336505 Sponsored by: The FreeBSD Foundation Differential Revision: https://reviews.freebsd.org/D16426	2018-07-27 15:46:34 +00:00
Li-Wen Hsu	03154ade2a	Use __riscv to determine building for RISC-V Reviewed by: br Sponsored by: The FreeBSD Foundation Differential Revision: https://reviews.freebsd.org/D16398	2018-07-23 19:49:54 +00:00
Mark Johnston	398a929f42	Add support for pmap_enter(psind = 1) to the arm64 pmap. See the commit log messages for r321378 and r336288 for descriptions of this functionality. Reviewed by: alc Differential Revision: https://reviews.freebsd.org/D16303	2018-07-20 16:37:04 +00:00
Mark Johnston	483f692ea6	Have preload_delete_name() free pages backing preloaded data. On i386 and amd64, add a vm_phys segment for physical memory used to store the kernel binary and other preloaded data. This makes it possible to free such memory back to the system once it is no longer needed, e.g., when a preloaded kernel module is unloaded. Previously, it would have remained unused. Reviewed by: kib, royger MFC after: 2 weeks Sponsored by: The FreeBSD Foundation Differential Revision: https://reviews.freebsd.org/D16330	2018-07-19 20:00:28 +00:00
Alan Cox	103cc0f6ea	Revert r329254. The underlying cause for the copy-on-write problem in multithreaded programs that was addressed by r329254 was in the implementation of pmap_enter() on some architectures, notably, amd64. kib@, markj@ and I have audited all of the pmap_enter() implementations, and fixed the broken ones, specifically, amd64 (r335784, r335971), i386 (r336092), mips (r336248), and riscv (r336294). To be clear, the reason to address the problem within pmap_enter() and revert r329254 is not just a matter of principle. An effect of r329254 was that a copy-on-write fault actually entailed two page faults, not one, even for single-threaded programs. Now, in the expected case for either single- or multithreaded programs, we are back to a single page fault to complete a copy-on-write operation. (In extremely rare circumstances, a multithreaded program could suffer two page faults.) Reviewed by: kib, markj Tested by: truckman MFC after: 3 weeks Differential Revision: https://reviews.freebsd.org/D16301	2018-07-19 17:01:10 +00:00
Alan Cox	d7aeb429a0	Test PGA_REFERENCED after calling pmap_ts_referenced(), rather than before, so that a reference from a concurrently destroyed mapping is observed during the current scan. Reviewed by: kib, markj MFC after: 1 week Differential Revision: https://reviews.freebsd.org/D16277	2018-07-15 19:25:15 +00:00
Alan Cox	8c0873714c	Add support for pmap_enter(..., psind=1) to the i386 pmap. In other words, add support for explicitly requesting that pmap_enter() create a 2 or 4 MB page mapping. (Essentially, this feature allows the machine-independent layer to create superpage mappings preemptively, and not wait for automatic promotion to occur.) Export pmap_ps_enabled() to the machine-independent layer. Add a flag to pmap_pv_insert_pde() that specifies whether it should fail or reclaim a PV entry when one is not available. Refactor pmap_enter_pde() into two functions, one by the same name, that is a general-purpose function for creating PDE PG_PS mappings, and another, pmap_enter_4mpage(), that is used to prefault 2 or 4 MB read- and/or execute-only mappings for execve(2), mmap(2), and shmat(2). Reviewed by: kib Tested by: pho Differential Revision: https://reviews.freebsd.org/D16246	2018-07-14 17:20:27 +00:00
Mateusz Guzik	efb6d4a479	uma: whack main zone counter update in the slow path, freeing side See r333052.	2018-07-12 22:35:52 +00:00
Mark Johnston	013072f04c	Fix pre-SI_SUB_CPU initialization of per-CPU counters. r336020 introduced pcpu_page_alloc(), replacing page_alloc() as the backend allocator for PCPU UMA zones. Unlike page_alloc(), it does not honour malloc(9) flags such as M_ZERO or M_NODUMP, so fix that. r336020 also changed counter(9) to initialize each counter using a CPU_FOREACH() loop instead of an SMP rendezvous. Before SI_SUB_CPU, smp_rendezvous() will only execute the callback on the current CPU (i.e., CPU 0), so only one counter gets zeroed. The rest are zeroed by virtue of the fact that UMA gratuitously zeroes slabs when importing them into a zone. Prior to SI_SUB_CPU, all_cpus is clear, so with r336020 we weren't zeroing vm_cnt counters during boot: the CPU_FOREACH() loop had no effect, and pcpu_page_alloc() didn't honour M_ZERO. Fix this by iterating over the full range of CPU IDs when zeroing counters, ignoring whether the corresponding bits in all_cpus are set. Reported and tested by: pho (previous version) Reviewed by: kib (previous version) Differential Revision: https://reviews.freebsd.org/D16190	2018-07-10 00:18:12 +00:00
Sean Bruno	a03af34228	Wrap the declaration and assignment of "stripe" with #ifdef NUMA declarations as not all targets are NUMA aware. Found with gcc. Sponsored by: Limelight Networks Differential Revision: https://reviews.freebsd.org/D16113	2018-07-07 13:37:44 +00:00
Jeff Roberson	2ef6727edd	Use the ticks since the last update to reduce hysteresis in the partpopq and contention on the vm_reserv_domain lock. This gives a roughly 8x speedup on will-it-scale fault1 on a 16 core machine. Reviewed by: alc, kib, markj	2018-07-07 01:54:45 +00:00
Konstantin Belousov	32f0fefc39	Save a call to pmap_remove() if entry cannot have any pages mapped. Due to the way rtld creates mappings for the shared objects, each dso causes unmap of at least three guard map entries. For instance, in the buildworld load, this change reduces the amount of pmap_remove() calls by 1/5. Profiled by: alc Reviewed by: alc, markj Sponsored by: The FreeBSD Foundation MFC after: 1 week Differential revision: https://reviews.freebsd.org/D16148	2018-07-06 12:44:48 +00:00
Konstantin Belousov	be7be41275	Style: no need for braces around single-line then clause. Reviewed by: alc, markj Sponsored by: The FreeBSD Foundation MFC after: 3 days Differential revision: https://reviews.freebsd.org/D16148	2018-07-06 12:37:46 +00:00
Matt Macy	ab3059a8e7	Back pcpu zone with domain correct pages - Change pcpu zone consumers to use a stride size of PAGE_SIZE. (defined as UMA_PCPU_ALLOC_SIZE to make future identification easier) - Allocate page from the correct domain for a given cpu. - Don't initialize pc_domain to non-zero value if NUMA is not defined There are some misconceptions surrounding this field. It is the _VM_ NUMA domain and should only ever correspond to valid domain values as understood by the VM. The former slab size of sizeof(struct pcpu) was somewhat arbitrary. The new value is PAGE_SIZE because that's the smallest granularity which the VM can allocate a slab for a given domain. If you have fewer than PAGE_SIZE/8 counters on your system there will be some memory wasted, but this is obviously something where you want the cache line to be coming from the correct domain. Reviewed by: jeff Sponsored by: Limelight Networks Differential Revision: https://reviews.freebsd.org/D15933	2018-07-06 02:06:03 +00:00
Andrew Turner	2bf9501287	Create a new macro for static DPCPU data. On arm64 (and possible other architectures) we are unable to use static DPCPU data in kernel modules. This is because the compiler will generate PC-relative accesses, however the runtime-linker expects to be able to relocate these. In preparation to fix this create two macros depending on if the data is global or static. Reviewed by: bz, emaste, markj Sponsored by: ABT Systems Ltd Differential Revision: https://reviews.freebsd.org/D16140	2018-07-05 17:13:37 +00:00
Konstantin Belousov	a66d7a8ddc	Copyout(9) on 4/4 i386 needs correct vm_page_array[]. On the 4/4 i386, copyout(9) may need to call pmap_extract_and_hold() on arbitrary userspace mapping. If the mapping is backed by the non-managed cdev pager or by the sg pager, on dense configs we might access arbitrary element of vm_page_array[], in particular, not corresponding to a page from the memory segment. Initialize such pages as fictitious with the corresponding physical address. Reported by: bde Reviewed by: alc, markj (previous version) Sponsored by: The FreeBSD Foundation Differential revision: https://reviews.freebsd.org/D16085	2018-07-05 16:43:15 +00:00
Alan Cox	370a338a7d	Allow callers to vm_phys_split_pages() to specify whether insertion should occur at the head or the tail of the page queues.	2018-07-05 02:08:57 +00:00
Matt Macy	f4b3640475	inline atomics and allow tied modules to inline locks - inline atomics in modules on i386 and amd64 (they were always inline on other arches) - allow modules to opt in to inlining locks by specifying MODULE_TIED=1 in the makefile Reviewed by: kib Sponsored by: Limelight Networks Differential Revision: https://reviews.freebsd.org/D16079	2018-07-02 19:48:38 +00:00
Alan Cox	7493904eca	Introduce vm_phys_enq_range(), and call it in vm_phys_alloc_npages() and vm_phys_alloc_seg_contig() instead of vm_phys_free_contig(). In short, vm_phys_enq_range() is simpler and faster than the more general vm_phys_free_contig(), and in the case of vm_phys_alloc_seg_contig(), vm_phys_free_contig() was placing the excess physical pages at the wrong end of the queues. In collaboration with: Doug Moore <dougm@rice.edu>	2018-07-02 17:18:46 +00:00
Alan Cox	9161b4de54	Three changes to vm_phys_alloc_seg_contig(): 1. Optimize the order computation. 2. Update the pool for all of the chunks that are removed from the free page lists, and not just the first chunk. 3. Simplify the code for returning excess pages to the free page lists. Reviewed by: Doug Moore <dougm@rice.edu>	2018-06-29 04:08:14 +00:00
Alan Cox	32d81f21b9	Reflow one of the comments describing vm_phys_alloc_npages().	2018-06-28 17:52:06 +00:00
Ed Maste	e8a1ec3e05	Split kern_break from sys_break and use it in linuxulator Previously the linuxulator's linux_brk invoked the FreeBSD sys_break syscall implementation directly. Instead, move the bulk of the existing implementation to kern_break, and call that from both sys_break and linux_brk. This also addresses a minor bug in linux_brk in that we now return the actual (rounded up) break address, rather than the requested value. Reviewed by: brooks (earlier version) Sponsored by: Turing Robotic Industries Differential Revision: https://reviews.freebsd.org/D16019	2018-06-27 14:45:13 +00:00
Alan Cox	89ea39a727	Update the physical page selection strategy used by vm_page_import() so that it does not cause rapid fragmentation of the free physical memory. Reviewed by: jeff, markj (an earlier version) Differential Revision: https://reviews.freebsd.org/D15976	2018-06-26 18:29:56 +00:00
Mateusz Guzik	a3d799fbb5	vm: stop passing M_ZERO when allocating radix nodes Allocation explicitely initialized the 3 leading fields. The rest is an array which is supposed to be NULL-ed prior to deallocation. Delegate zeroing to the infrequently called object initializator. This gets rid of one of the most common memset consumers. Reviewed by: markj Differential Revision: https://reviews.freebsd.org/D15989	2018-06-24 13:08:05 +00:00
Jeff Roberson	63b5557b2f	Sort uma_zone fields according to 64 byte cache line with adjacent line prefetch on 64bit architectures. Prior to this, two lines were needed for the fast path and each line may fetch an unused adjacent neighbor. - Move fields used by the fast path into a single line. - Move constants into the adjacent line which is mostly used for the spare bucket alloc 'medium path'. - Unpad the mtx which is only used by the fast path and place it in a line with rarely used data. This aligns the cachelines better and eliminates 128 bytes of wasted space. This gives a 45% improvement on a will-it-scale test on a 24 core machine. Reviewed by: mmacy	2018-06-23 08:10:09 +00:00
Ian Lepore	c5b7751fa2	Eliminate a spurious panic on non-SMP systems (occurred on shutdown/reboot).	2018-06-22 20:22:26 +00:00
Ruslan Bukin	b47999470d	Fix uma_zalloc_pcpu_arg() operation in case of !SMP build. Reviewed by: mjg Sponsored by: DARPA, AFRL	2018-06-21 11:43:54 +00:00
Brooks Davis	9da5364ed9	Name the implementation of brk and sbrk sys_break(). The break() system call was renamed (several times) starting in v3 AT&T UNIX when C was invented and break was a language keyword. The last vestage of a need for it to be called something else (eg obreak) was removed in r225617 which consistantly prefixed all syscall implementations. Reviewed by: emaste, kib (older version) Sponsored by: DARPA, AFRL Differential Revision: https://reviews.freebsd.org/D15638	2018-06-14 21:27:25 +00:00
Konstantin Belousov	b7b8a09658	Handle the race between fork/vm_object_split() and faults. If fault started before vmspace_fork() locked the map, and then during fork, vm_map_copy_entry()->vm_object_split() is executed, it is possible that the fault instantiate the page into the original object when the page was already copied into the new object (see vm_map_split() for the orig/new objects terminology). This can happen if split found a busy page (e.g. from the fault) and slept dropping the objects lock, which allows the swap pager to instantiate read-behind pages for the fault. Then the restart of the scan can see a page in the scanned range, where it was already copied to the upper object. Fix it by instantiating the read-ahead pages before swap_pager_getpages() method drops the lock to allocate pbuf. The object scan would see the whole range prefilled with the busy pages and not proceed the range. Note that vm_fault rechecks the map generation count after the object unlock, so that it restarts the handling if raced with split, and re-lookups the right page from the upper object. In collaboration with: alc Tested by: pho Sponsored by: The FreeBSD Foundation MFC after: 1 week	2018-06-14 19:41:02 +00:00
Jonathan T. Looney	0766f278d8	Make UMA and malloc(9) return non-executable memory in most cases. Most kernel memory that is allocated after boot does not need to be executable. There are a few exceptions. For example, kernel modules do need executable memory, but they don't use UMA or malloc(9). The BPF JIT compiler also needs executable memory and did use malloc(9) until r317072. (Note that a side effect of r316767 was that the "small allocation" path in UMA on amd64 already returned non-executable memory. This meant that some calls to malloc(9) or the UMA zone(9) allocator could return executable memory, while others could return non-executable memory. This change makes the behavior consistent.) This change makes malloc(9) return non-executable memory unless the new M_EXEC flag is specified. After this change, the UMA zone(9) allocator will always return non-executable memory, and a KASSERT will catch attempts to use the M_EXEC flag to allocate executable memory using uma_zalloc() or its variants. Allocations that do need executable memory have various choices. They may use the M_EXEC flag to malloc(9), or they may use a different VM interfact to obtain executable pages. Now that malloc(9) again allows executable allocations, this change also reverts most of r317072. PR: 228927 Reviewed by: alc, kib, markj, jhb (previous version) Sponsored by: Netflix Differential Revision: https://reviews.freebsd.org/D15691	2018-06-13 17:04:41 +00:00
Mateusz Guzik	4e180881ae	uma: implement provisional api for per-cpu zones Per-cpu zone allocations are very rarely done compared to regular zones. The intent is to avoid pessimizing the latter case with per-cpu specific code. In particular contrary to the claim in r334824, M_ZERO is sometimes being used for such zones. But the zeroing method is completely different and braching on it in the fast path for regular zones is a waste of time.	2018-06-08 21:40:03 +00:00
Mateusz Guzik	b8af2820f6	uma: fix up r334824 Turns out there is code which ends up passing M_ZERO to counters. Since counters zero unconditionally on their own, just ignore drop the flag in that place.	2018-06-08 05:40:36 +00:00
Mateusz Guzik	ea99223ec9	uma: remove M_ZERO support for pcpu zones Nothing in the tree uses it and pcpu zones have a fundamentally different use case than the regular zones - they are not supposed to be allocated and freed all the time. This reduces pollution in the allocation fast path.	2018-06-08 03:16:16 +00:00
Gleb Smirnoff	c5deaf0452	UMA memory debugging enabled with INVARIANTS consists of two things: trashing freed memory and checking that allocated memory is properly trashed, and also of keeping a bitset of freed items. Trashing/checking creates a lot of CPU cache poisoning, while keeping debugging bitsets consistent creates a lot of contention on UMA zone lock(s). The performance difference between INVARIANTS kernel and normal one is mostly attributed to UMA debugging, rather than to all KASSERT checks in the kernel. Add loader tunable vm.debug.divisor that allows either to turn off UMA debugging completely, or turn it on only for a fraction of allocations, while still running all KASSERTs in kernel. That allows to run INVARIANTS kernels in production environments without reducing load by orders of magnitude, but still doing useful extra checks. Default value is 1, meaning debug every allocation. Value of 0 would disable UMA debugging completely. Values above 1 enable debugging only for every N-th item. It isn't possible to strictly follow the number, but still amount of debugging is reduced roughly by (N-1)/N percent. Sponsored by: Netflix Differential Revision: https://reviews.freebsd.org/D15199	2018-06-08 00:15:08 +00:00
Jonathan T. Looney	16e05b3275	Fix a typo in vm_domain_set(). When a domain crosses into the severe range, we need to set the domain bit from the vm_severe_domains bitset (instead of clearing it). Reviewed by: jeff, markj Sponsored by: Netflix, Inc.	2018-06-07 13:29:54 +00:00
Mark Johnston	9f9c9b22ec	Reimplement brk() and sbrk() to avoid the use of _end. Previously, libc.so would initialize its notion of the break address using _end, a special symbol emitted by the static linker following the bss section. Compatibility issues between lld and ld.bfd could cause the wrong definition of _end (libc.so's definition rather than that of the executable) to be used, breaking the brk()/sbrk() interface. Avoid this problem and future interoperability issues by simply not relying on _end. Instead, modify the break() system call to return the kernel's view of the current break address, and have libc initialize its state using an extra syscall upon the first use of the interface. As a side effect, this appears to fix brk()/sbrk() usage in executables run with rtld direct exec, since the kernel and libc.so no longer maintain separate views of the process' break address. PR: 228574 Reviewed by: kib (previous version) MFC after: 2 months Differential Revision: https://reviews.freebsd.org/D15663	2018-06-04 19:35:15 +00:00
Mark Johnston	27e29d103f	Correct the description of vm_pageout_scan_inactive() after r334508. Reported by: alc	2018-06-04 16:46:36 +00:00
Alan Cox	3e7cb27cdd	Use a single, consistent approach to returning success versus failure in vm_map_madvise(). Previously, vm_map_madvise() used a traditional Unix- style "return (0);" to indicate success in the common case, but Mach- style return values in the edge cases. Since KERN_SUCCESS equals zero, the only problem with this inconsistency was stylistic. vm_map_madvise() has exactly two callers in the entire source tree, and only one of them cares about the return value. That caller, kern_madvise(), can be simplified if vm_map_madvise() consistently uses Unix-style return values. Since vm_map_madvise() uses the variable modify_map as a Boolean, make it one. Eliminate a redundant error check from kern_madvise(). Add a comment explaining where the check is performed. Explicitly note that exec_release_args_kva() doesn't care about vm_map_madvise()'s return value. Since MADV_FREE is passed as the behavior, the return value will always be zero. Reviewed by: kib, markj MFC after: 7 days	2018-06-04 16:28:06 +00:00
Justin Hibbits	12f691959f	Align UMA data to 128 byte cacheline size Suggested by: mjg	2018-06-04 15:44:17 +00:00
Mark Johnston	49a3710c89	Remove the "pass" variable from the page daemon control loop. It serves little purpose after r308474 and r329882. As a side effect, the removal fixes a bug in r329882 which caused the page daemon to periodically invoke lowmem handlers even in the absence of memory pressure. Reviewed by: jeff Differential Revision: https://reviews.freebsd.org/D15491	2018-06-02 00:01:07 +00:00
Konstantin Belousov	633d3b1c71	Only check for MAP_32BIT when available. Reported by: mmacy Sponsored by: The FreeBSD Foundation MFC after: 10 days	2018-06-01 23:50:51 +00:00
Alan Cox	60221a5701	Only a small subset of mmap(2)'s flags should be used in combination with the flag MAP_GUARD. Rather than enumerating the flags that are not allowed, enumerate the flags that are allowed. The list of allowed flags is much shorter and less likely to change. (As an aside, one of the previously enumerated flags, MAP_PREFAULT, was not even a legal flag for mmap(2). However, because of an earlier check within kern_mmap(), this misuse of MAP_PREFAULT was harmless.) Reviewed by: kib MFC after: 10 days	2018-06-01 21:37:42 +00:00
Mark Johnston	6939b4d3b4	Typo. PR: 228533 Submitted by: Jakub Piecuch <j.piecuch96@gmail.com> MFC after: 1 week	2018-05-30 16:48:48 +00:00
Alan Cox	6e1e759c56	Addendum to r334233. In vm_fault_populate(), since the page lock is held, we must use vm_page_xunbusy_maybelocked() rather than vm_page_xunbusy() to unbusy the page. Reviewed by: kib X-MFC with: r334233	2018-05-28 16:23:39 +00:00
Alan Cox	fccdefa1a1	Eliminate duplicate assertions. We assert at the start of vm_fault_hold() that the map entry is wired if the caller passes the flag VM_FAULT_WIRE. Eliminate the same assertion, but spelled differently, at the end of vm_fault_hold() and vm_fault_populate(). Repeat the assertion only if the map is unlocked and the map lookup must be repeated. Reviewed by: kib MFC after: 10 days Differential Revision: https://reviews.freebsd.org/D15582	2018-05-28 04:38:10 +00:00
Alan Cox	70183daa80	Use pmap_enter(..., psind=1) in vm_fault_populate() on amd64. While superpage mappings were already being created by automatic promotion in vm_fault_populate(), this change reduces the cost of creating those mappings. Essentially, one pmap_enter(..., psind=1) call takes the place of 512 pmap_enter(..., psind=0) calls, and that one pmap_enter(..., psind=1) call eliminates the allocation of a page table page. Reviewed by: kib MFC after: 10 days Differential Revision: https://reviews.freebsd.org/D15572	2018-05-26 02:59:34 +00:00
Brooks Davis	7351a8bdb5	Make vadvise compat freebsd11. The vadvise syscall (aka ovadvise) is undocumented and has always been implmented as returning EINVAL. Put the syscall under COMPAT11 and provide a userspace implementation. Reviewed by: kib Sponsored by: DARPA, AFRL Differential Revision: https://reviews.freebsd.org/D15557	2018-05-25 20:40:23 +00:00
Alan Cox	d3f8534e99	Eliminate an unused parameter from vm_fault_populate(). Reviewed by: kib MFC after: 10 days	2018-05-24 20:43:41 +00:00
Mark Johnston	7bb4634e18	Update r334154 with review feedback from D15490. An old revision was committed by accident. Differential Revision: https://reviews.freebsd.org/D15490	2018-05-24 20:26:37 +00:00
Brooks Davis	758d46cfb0	Don't implement break(2) at all on aarch64 and riscv. This should have been done when they were removed from libc, but was overlooked in the runup to 11.0. No users should exist. Approved by: andrew Sponsored by: DARPA, AFRL Differential Revision: https://reviews.freebsd.org/D15539	2018-05-24 17:04:27 +00:00
Mark Johnston	be37ee791f	Split the active and inactive queue scans into separate subroutines. The scans are largely independent, so this helps make the code marginally neater, and makes it easier to incorporate feedback from the active queue scan into the page daemon control loop. Improve some comments while here. No functional change intended. Reviewed by: alc, kib Differential Revision: https://reviews.freebsd.org/D15490	2018-05-24 14:16:22 +00:00
Mark Johnston	a99ee60b9a	Ensure that "m" is initialized in vm_page_alloc_freelist_domain(). While here, remove a superfluous comment. Coverity CID: 1383559 MFC after: 3 days	2018-05-22 16:19:48 +00:00
Mark Johnston	23d123c6cf	Use the canonical check for reservation support.	2018-05-19 23:49:13 +00:00
Mark Johnston	01f04471f4	Don't increment addl_page_shortage for wired pages. Such pages are dequeued as they're encountered during the inactive queue scan, so by the time we get to the active queue scan, they should have already been subtracted from the inactive queue length. Reviewed by: alc Differential Revision: https://reviews.freebsd.org/D15479	2018-05-18 16:59:58 +00:00
Mark Johnston	ba2b3349e1	Fix a race in vm_page_pagequeue_lockptr(). The value of m->queue must be cached after comparing it with PQ_NONE, since it may be concurrently changing. Reported by: glebius Reviewed by: jeff Differential Revision: https://reviews.freebsd.org/D15462	2018-05-17 04:27:08 +00:00
Matt Macy	73e37d1deb	Fix powerpc64 LINT vm_object_reserve() == true is impossible on power. Make conditional on VM_LEVEL_0_ORDER being defined. Reviewed by: jeff Approved by: sbruno	2018-05-17 03:19:31 +00:00
Mark Johnston	36f8fe9bbb	Get rid of vm_pageout_page_queued(). vm_page_queue(), added in r333256, generalizes vm_pageout_page_queued(), so use it instead. No functional change intended. Reviewed by: kib Differential Revision: https://reviews.freebsd.org/D15402	2018-05-13 13:00:59 +00:00
Mateusz Guzik	782e38aa48	uma: increase alignment to 128 bytes on amd64 Current UMA internals are not suited for efficient operation in multi-socket environments. In particular there is very common use of MAXCPU arrays and other fields which are not always properly aligned and are not local for target threads (apart from the first node of course). Turns out the existing UMA_ALIGN macro can be used to mostly work around the problem until the code get fixed. The current setting of 64 bytes runs into trouble when adjacent cache line prefetcher gets to work. An example 128-way benchmark doing a lot of malloc/frees has the following instruction samples: before: kernel`lf_advlockasync+0x43b 32940 kernel`malloc+0xe5 42380 kernel`bzero+0x19 47798 kernel`spinlock_exit+0x26 60423 kernel`0xffffffff80 78238 0x0 136947 kernel`uma_zfree_arg+0x46 159594 kernel`uma_zalloc_arg+0x672 180556 kernel`uma_zfree_arg+0x2a 459923 kernel`uma_zalloc_arg+0x5ec 489910 after: kernel`bzero+0xd 46115 kernel`lf_advlockasync+0x25f 46134 kernel`lf_advlockasync+0x38a 49078 kernel`fget_unlocked+0xd1 49942 kernel`lf_advlockasync+0x43b 55392 kernel`copyin+0x4a 56963 kernel`bzero+0x19 81983 kernel`spinlock_exit+0x26 91889 kernel`0xffffffff80 136357 0x0 239424 See the review for more details. Reviewed by: kib Differential Revision: https://reviews.freebsd.org/D15346	2018-05-11 07:04:57 +00:00
Mark Johnston	1b5c869d64	Fix some races introduced in r332974. With r332974, when performing a synchronized access of a page's "queue" field, one must first check whether the page is logically dequeued. If so, then the page lock does not prevent the page from being removed from its page queue. Intoduce vm_page_queue(), which returns the page's logical queue index. In some cases, direct access to the "queue" field is still required, but such accesses should be confined to sys/vm. Reported and tested by: pho Reviewed by: kib Sponsored by: Dell EMC Isilon Differential Revision: https://reviews.freebsd.org/D15280	2018-05-04 17:17:30 +00:00
Konstantin Belousov	a7163bb962	Eliminate some vm object relocks in vm fault. For the vm_fault_prefault() call from vm_fault_soft_fast(), extend the scope of the object rlock to avoid re-taking it inside vm_fault_prefault(). It causes pmap_enter_quick() sometimes called with shadow object lock as well as the page lock, but this looks innocent. Noted and measured by: mjg Reviewed by: alc, markj (as part of the larger patch) Tested by: pho (as part of the larger patch) Sponsored by: The FreeBSD Foundation MFC after: 1 week Differential revision: https://reviews.freebsd.org/D15122	2018-04-29 12:43:08 +00:00
Mateusz Guzik	e825ab8d89	uma: whack main zone counter update in the slow path Cached counters are typically zero at this point so it performs avoidable atomics. Everything reading them also reads the cached ones, thus there is really no point. Reviewed by: jeff	2018-04-27 05:37:35 +00:00
Mateusz Guzik	23e17f83f1	vm: move vm_cnt to __read_mostly now that it is not written to While here whack unused locking keys for the struct. Discussed with: jeff	2018-04-27 05:36:02 +00:00
Mark Johnston	5cd29d0f3c	Improve VM page queue scalability. Currently both the page lock and a page queue lock must be held in order to enqueue, dequeue or requeue a page in a given page queue. The queue locks are a scalability bottleneck in many workloads. This change reduces page queue lock contention by batching queue operations. To detangle the page and page queue locks, per-CPU batch queues are used to reference pages with pending queue operations. The requested operation is encoded in the page's aflags field with the page lock held, after which the page is enqueued for a deferred batch operation. Page queue scans are similarly optimized to minimize the amount of work performed with a page queue lock held. Reviewed by: kib, jeff (previous versions) Tested by: pho Sponsored by: Dell EMC Isilon Differential Revision: https://reviews.freebsd.org/D14893	2018-04-24 21:15:54 +00:00
Mark Johnston	7e28037a09	Add a UMA zone flag to disable the use of buckets. This allows the creation of zones which don't do any caching in front of the keg. If the zone is a cache zone, this means that UMA will not attempt any memory allocations when allocating an item from the backend. This is intended for use after a panic by netdump, but likely has other applications. Reviewed by: kib MFC after: 2 weeks Sponsored by: Dell EMC Isilon Differential Revision: https://reviews.freebsd.org/D15184	2018-04-24 20:05:45 +00:00
Mark Johnston	64b3893010	Initialize marker pages in vm_page_domain_init(). They were previously initialized by the corresponding page daemon threads, but for vmd_inacthead this may be too late if vm_page_deactivate_noreuse() is called during boot. Reported and tested by: cperciva Reviewed by: alc, kib MFC after: 1 week	2018-04-19 14:09:44 +00:00
Mark Johnston	9de8fcfddf	Ensure that m and skip_m belong to the same object. Pages allocated from a given reservation may belong to different objects. It is therefore possible for vm_page_ps_test() to be called with the base page's object unlocked. Check for this case before asserting that the object lock is held. Reported by: jhb Reviewed by: kib MFC after: 1 week	2018-04-17 18:49:17 +00:00
Konstantin Belousov	e55d32b7b3	Handle Skylake-X errata SKZ63. SKZ63 Processor May Hang When Executing Code In an HLE Transaction Region Problem: Under certain conditions, if the processor acquires an HLE (Hardware Lock Elision) lock via the XACQUIRE instruction in the Host Physical Address range between 40000000H and 403FFFFFH, it may hang with an internal timeout error (MCACOD 0400H) logged into IA32_MCi_STATUS. Move the pages from the range into the blacklist. Add a tunable to not waste 4M if local DoS is not the issue. Reviewed by: markj Sponsored by: The FreeBSD Foundation MFC after: 1 week Differential revision: https://reviews.freebsd.org/D15001	2018-04-07 17:06:13 +00:00
Brooks Davis	6469bdcdb6	Move most of the contents of opt_compat.h to opt_global.h. opt_compat.h is mentioned in nearly 180 files. In-progress network driver compabibility improvements may add over 100 more so this is closer to "just about everywhere" than "only some files" per the guidance in sys/conf/options. Keep COMPAT_LINUX32 in opt_compat.h as it is confined to a subset of sys/compat/linux/*.c. A fake _COMPAT_LINUX option ensure opt_compat.h is created on all architectures. Move COMPAT_LINUXKPI to opt_dontuse.h as it is only used to control the set of compiled files. Reviewed by: kib, cem, jhb, jtl Sponsored by: DARPA, AFRL Differential Revision: https://reviews.freebsd.org/D14941	2018-04-06 17:35:35 +00:00
Mark Johnston	c098768e4d	Ensure the background laundering threshold is positive after a scan. The division added in r331732 meant that we wouldn't attempt a background laundering until at least v_free_target - v_free_min clean pages had been freed by the page daemon since the last laundering. If the inactive queue is depleted but not completely empty (e.g., because it contains busy pages), it can thus take a long time to meet this threshold. Restore the pre-r331732 behaviour of using a non-zero background laundering threshold if at least one inactive queue scan has elapsed since the last attempt at background laundering. Submitted by: tijl (original version)	2018-04-02 15:07:41 +00:00
Gleb Smirnoff	b92b26ad08	Use UMA_SLAB_SPACE macro. No functional change here.	2018-04-02 05:15:25 +00:00
Gleb Smirnoff	96a10340ce	In uma_startup_count() handle special case when zone will fit into single slab, but with alignment adjustment it won't. Again, when there is only one item in a slab alignment can be ignored. See previous revision of this file for more info. PR: 227116	2018-04-02 05:14:31 +00:00
Gleb Smirnoff	1ca6ed4589	Handle a special case when a slab can fit only one allocation, and zone has a large alignment. With alignment taken into account uk_rsize will be greater than space in a slab. However, since we have only one item per slab, it is always naturally aligned. Code that will panic before this change with 4k page: z = uma_zcreate("test", 3984, NULL, NULL, NULL, NULL, 31, 0); uma_zalloc(z, M_WAITOK); A practical scenario to hit the panic is a machine with 56 CPUs and 2 NUMA domains, which yields in zone size of 3984. PR: 227116 MFC after: 2 weeks	2018-04-02 05:11:59 +00:00
Jeff Roberson	c33e3a642b	Add a uma cache of free pages in the DEFAULT freepool. This gives us per-cpu alloc and free of pages. The cache is filled with as few trips to the phys allocator as possible by the use of a new vm_phys_alloc_npages() function which allocates as many as N pages. This code was originally by markj with the import function rewritten by me. Reviewed by: markj, kib Tested by: pho Sponsored by: Netflix, Dell/EMC Isilon Differential Revision: https://reviews.freebsd.org/D14905	2018-04-01 04:50:05 +00:00
Jeff Roberson	e8bb2dc7c9	Add the flag ZONE_NOBUCKETCACHE. This flag instructions UMA not to keep a cache of fully populated buckets. This will be used in a follow-on commit. The flag idea was originally from markj. Reviewed by: markj, kib Tested by: pho Sponsored by: Netflix, Dell/EMC Isilon	2018-04-01 04:47:05 +00:00
Konstantin Belousov	19ea042eb8	Make vm_map_max/min/pmap KBI stable. There are out of tree consumers of vm_map_min() and vm_map_max(), and I believe there are consumers of vm_map_pmap(), although the later is arguably less in the need of KBI-stable interface. For the consumers benefit, make modules using this KPI not depended on the struct vm_map layout. Reviewed by: alc, markj Sponsored by: The FreeBSD Foundation MFC after: 1 week Differential revision: https://reviews.freebsd.org/D14902	2018-03-30 10:55:31 +00:00
Mark Johnston	6068486258	Fix the background laundering mechanism after r329882. Rather than using the number of inactive queue scans as a metric for how many clean pages are being freed by the page daemon, have the page daemon keep a running counter of the number of pages it has freed, and have the laundry thread use that when computing the background laundering threshold. Reviewed by: kib Differential Revision: https://reviews.freebsd.org/D14884	2018-03-29 14:27:40 +00:00
Jeff Roberson	e5818a53db	Implement several enhancements to NUMA policies. Add a new "interleave" allocation policy which stripes pages across domains with a stride or width keeping contiguity within a multi-page region. Move the kernel to the dedicated numbered cpuset #2 making it possible to assign kernel threads and memory policy separately from user. This also eliminates the need for the complicated interrupt binding code. Add a sysctl API for viewing and manipulating domainsets. Refactor some of the cpuset_t manipulation code using the generic bitset type so that it can be used for both. This probably belongs in a dedicated subr file. Attempt to improve the include situation. Reviewed by: kib Discussed with: jhb (cpuset parts) Tested by: pho (before review feedback) Sponsored by: Netflix, Dell/EMC Isilon Differential Revision: https://reviews.freebsd.org/D14839	2018-03-29 02:54:50 +00:00
Jeff Roberson	146bf2c66d	Move vm_ndomains to vm.h where it can be used with a single header include rather than requiring a half-dozen. Many non-vm files may want to know the number of valid domains. Sponsored by: Netflix, Dell/EMC Isilon	2018-03-27 03:27:02 +00:00
Konstantin Belousov	8ec533d336	Allow to specify for vm_fault_quick_hold_pages() that nofault mode should be honored. We must not sleep or acquire any MI VM locks if TDP_NOFAULTING is specified. On the other hand, there were some callers in the tree which set TDP_NOFAULTING for larger scope than needed, I fixed the code which I wrote, but I suspect that linuxkpi and out of tree drm drivers might abuse this still. So only enable the mode for vm_fault_quick_hold_pages() where vm_fault_hold() is not called when specifically asked by user. I decided to use vm_prot_t flag to not change KPI. Since number of flags in vm_prot_t is limited, I reused the same flag which was already consumed for vm_map_lookup(). Reported and tested by: pho (as part of the larger patch) Reviewed by: markj Sponsored by: The FreeBSD Foundation MFC after: 1 week Differential revision: https://reviews.freebsd.org/D14825	2018-03-26 16:31:12 +00:00
Konstantin Belousov	ed9e8bc468	Account the size of the vslock-ed memory by the thread. Assert that all such memory is unwired on return to usermode. The count of the wired memory will be used to detect the copyout mode. Tested by: pho (as part of the larger patch) Sponsored by: The FreeBSD Foundation MFC after: 1 week	2018-03-24 13:51:27 +00:00
Konstantin Belousov	63b5d112b6	For vm_zone_stats() sysctl handler, do not drain sbuf calling copyout(9) while owning zone lock. Despite old value sysctl buffer is wired, spurious faults might still occur. Note that we still own the uma_rwlock there, but this lock does not participate in sensitive lock orders. Reported and tested by: pho (as part of the larger patch) Sponsored by: The FreeBSD Foundation MFC after: 1 week	2018-03-24 13:48:53 +00:00
Jeff Roberson	2d3f4181de	Fix two compliation problems on non-amd64 architectures.	2018-03-23 18:24:02 +00:00
Mark Johnston	4046851367	Correct a couple of assertion messages in vm_page_reclaim_run(). MFC after: 3 days	2018-03-23 14:38:56 +00:00
Cy Schubert	72346b2232	Fix build on i386 without INVARIANTS following r331369. --- vm_reserv.o --- In file included from /opt/src/svn-current/sys/vm/vm_reserv.c:48: In file included from /opt/src/svn-current/sys/sys/counter.h:37: ./machine/counter.h:174:3: error: implicit declaration of function 'critical_enter' is invalid in C99 [-Werror,-Wimplicit-function-declarat ion] critical_enter(); Reviewed by: jeff@	2018-03-23 03:22:30 +00:00
Jeff Roberson	5c930c894d	Lock reservations with a dedicated lock in each reservation. Protect the vmd_free_count with atomics. This allows us to allocate and free from reservations without the free lock except where a superpage is allocated from the physical layer, which is roughly 1/512 of the operations on amd64. Use the counter api to eliminate cache conention on counters. Reviewed by: markj Tested by: pho Sponsored by: Netflix, Dell/EMC Isilon Differential Revision: https://reviews.freebsd.org/D14707	2018-03-22 19:21:11 +00:00
Jeff Roberson	9a4b4cd3bc	Start witness much earlier in boot so that we can shrink the pend list and make it more immune to further change. Reviewed by: markj, imp (Part of D14707) Sponsored by: Netflix, Dell/EMC Isilon	2018-03-22 19:11:43 +00:00
Jeff Roberson	cdfeced8ff	Use read_mostly and alignment tags to eliminate or limit false sharing. Reviewed by: markj (Part of D14707) Sponsored by: Netflix, Dell/EMC Isilon	2018-03-22 19:06:50 +00:00
Konstantin Belousov	79e9552ebb	Check for wrap-around in vm_phys_alloc_seg_contig(). It is possible to provide insane values for size in contigmalloc(9) request, which usually not reaches the phys allocator due to failing KVA allocation. But with the forthcoming 4/4 i386, where 32bit architecture has almost 4G KVA, contigmalloc(1G) is not unreasonable outright and KVA might be available sometimes. Then, the calculation of pa_end could wrap around, depending on the physical address, and the checks in vm_phys_alloc_seg_contig() would pass while the iteration in the loop after the 'done' label goes out of the vm_page_array bounds. Fix it by detecting the wrap. Reported and tested by: pho Reviewed by: alc, markj Sponsored by: The FreeBSD Foundation MFC after: 1 week Differential revision: https://reviews.freebsd.org/D14767	2018-03-20 16:17:55 +00:00
Mark Johnston	c6a70eaea8	Avoid dequeuing the fault page during a soft fault. Such pages are re-enqueued at the end of the fault handler, preserving LRU. Rather than performing two separate operations per fault, simply requeue the page at the end of the fault (or bump its activation count if it resides in PQ_ACTIVE, avoiding the page queue lock entirely). This elides some page lock and page queue lock operations in common cases, e.g., CoW faults. Note that we must still dequeue the source page for "optimized" CoW faults since the page may not remain enqueued while it is moved to another object. Reviewed by: alc, kib Tested by: pho MFC after: 2 weeks Differential Revision: https://reviews.freebsd.org/D14625	2018-03-18 16:49:30 +00:00
Mark Johnston	0eb50f9cd2	Have vm_page_{deactivate,launder}() requeue already-queued pages. In many cases the page is not enqueued so the change will have no effect. However, the change is needed to support an optimization in the fault handler and in some cases (sendfile, the buffer cache) it was being emulated by the caller anyway. Reviewed by: alc Tested by: pho MFC after: 2 weeks X-Differential Revision: https://reviews.freebsd.org/D14625	2018-03-18 16:40:56 +00:00
Mark Johnston	434862acb1	Have vm_page_replace() assert that the new page is not enqueued. The new page does not belong to a VM object, but the page daemon does not expect to encounter such pages. Reviewed by: alc, kib Tested by: pho MFC after: 1 week X-Differential Revision: https://reviews.freebsd.org/D14625	2018-03-18 16:35:40 +00:00
Conrad Meyer	5d3b36666b	Fix GCC build: Remove redundant pagedaemon_wakeup declaration Introduced in r331018. Reported by: kevans Sponsored by: Dell EMC Isilon	2018-03-16 07:05:09 +00:00
Jeff Roberson	30fbfdda6c	Eliminate pageout wakeup races. Take another step towards lockless vmd_free_count manipulation. Reduce the scope of the free lock by using a pageout lock to synchronize sleep and wakeup. Only trigger the pageout daemon on transitions between states. Drive all wakeup operations directly as side-effects from freeing memory rather than requiring an additional function call. Reviewed by: markj, kib Tested by: pho Sponsored by: Netflix, Dell/EMC Isilon Differential Revision: https://reviews.freebsd.org/D14612	2018-03-15 19:23:07 +00:00
Konstantin Belousov	741e1c9196	Revert the chunk from r330410 in vm_page_reclaim_run(). There, the pages freed might be managed but the page's lock is not owned. For KPI correctness, the page lock is requried around the call to vm_page_free_prep(), which is asserted. Reclaim loop already did the work which could be done by vm_page_free_prep(), so the lock is not needed and the only consequence of not owning it is the assert trigger. Instead of adding the locking to satisfy the assert, revert to the code that calls vm_page_free_phys() directly. Reported by: pho Discussed with: jeff Sponsored by: The FreeBSD Foundation MFC after: 1 week	2018-03-13 18:27:23 +00:00
Jeff Roberson	f4af595964	Don't assert that the domain free lock is held until we're certain that there is a valid reservation. This can trip erroneously when memory falls within a domain but doesn't have the reservation initialized because it does not meet size or alignment requirements. Reported by: pho, mjg Sponsored by: Netflix, Dell/EMC Isilon	2018-03-07 22:04:27 +00:00
Konstantin Belousov	2a8e8f7892	Remove redundant test from r330410. If the input slist is non-empty, counter cannot be zero after freeing. Noted by: mjg MFC after: 2 weeks	2018-03-04 21:15:31 +00:00
Konstantin Belousov	8c8ee2ee1c	Unify bulk free operations in several pmaps. Submitted by: Yoshihiro Ota Reviewed by: markj MFC after: 2 weeks Differential revision: https://reviews.freebsd.org/D13485	2018-03-04 20:53:20 +00:00
Mark Johnston	3b8cf4acf0	Give the 0th domain's page daemon thread a consistent name. Page daemon threads for other domains show up in ps(1) output as "pagedaemon/domN", so let that be the case for domain 0 as well. Submitted by: Kevin Bowling <kevin.bowling@kev009.com> MFC after: 1 week Differential Revision: https://reviews.freebsd.org/D14518	2018-02-27 16:51:09 +00:00
Mark Johnston	59d3150b58	Restore the pre-r329882 inactive page shortage computation. With r329882, in the absence of a free page shortage we would only take len(PQ_INACTIVE)+len(PQ_LAUNDRY) into account when deciding whether to aggressively scan PQ_ACTIVE. Previously we would also include the number of free pages in this computation, ensuring that we wouldn't scan PQ_ACTIVE with plenty of free memory available. The change in behaviour was most noticeable immediately after booting, when PQ_INACTIVE and PQ_LAUNDRY are nearly empty. Reviewed by: jeff	2018-02-24 20:47:22 +00:00
Konstantin Belousov	cd84455f91	Hide all vm/vm_pageout.h content under #ifdef _KERNEL. There are no parts useful for usermode applications in vm/vm_pageout.h. Even for the specific applications like fstat and lsof. In my opinion, this protection is redundant and instead userspace should not include the header at all. Since there are apparently broken third party codebases, give them a bit of slack by providing transitional period. Reported by: julian Sponsored by: The FreeBSD Foundation MFC after: 1 week	2018-02-24 10:26:26 +00:00
Mark Johnston	5f70fb1425	Correct some comments after r328954. Reviewed by: kib Differential Revision: https://reviews.freebsd.org/D14486	2018-02-23 23:27:53 +00:00
Mark Johnston	9140bff7ed	Remove a bogus assertion from vm_page_launder(). After r328977, a wired page m may have m->queue != PQ_NONE. Reviewed by: kib X-MFC with: r328977 Differential Revision: https://reviews.freebsd.org/D14485	2018-02-23 23:25:22 +00:00
Jeff Roberson	5f8cd1c0bf	Add a generic Proportional Integral Derivative (PID) controller algorithm and use it to regulate page daemon output. This provides much smoother and more responsive page daemon output, anticipating demand and avoiding pageout stalls by increasing the number of pages to match the workload. This is a reimplementation of work done by myself and mlaier at Isilon. Reviewed by: bsdimp Tested by: pho Sponsored by: Netflix, Dell/EMC Isilon Differential Revision: https://reviews.freebsd.org/D14402	2018-02-23 22:51:51 +00:00
Konstantin Belousov	2c0f13aa59	vm_wait() rework. Make vm_wait() take the vm_object argument which specifies the domain set to wait for the min condition pass. If there is no object associated with the wait, use curthread' policy domainset. The mechanics of the wait in vm_wait() and vm_wait_domain() is supplied by the new helper vm_wait_doms(), which directly takes the bitmask of the domains to wait for passing min condition. Eliminate pagedaemon_wait(). vm_domain_clear() handles the same operations. Eliminate VM_WAIT and VM_WAITPFAULT macros, the direct functions calls are enough. Eliminate several control state variables from vm_domain, unneeded after the vm_wait() conversion. Scetched and reviewed by: jeff Tested by: pho Sponsored by: The FreeBSD Foundation, Mellanox Technologies Differential revision: https://reviews.freebsd.org/D14384	2018-02-20 10:13:13 +00:00
Mark Johnston	3f060b60b1	Use the conventional name for an array of pages. No functional change intended. Discussed with: kib MFC after: 3 days	2018-02-16 15:38:22 +00:00
Konstantin Belousov	ada27a3bb8	Cleanup unused page argument for vm_reserv_break(). Reviewed by: markj MFC after: 1 week Differential revision: https://reviews.freebsd.org/D14364	2018-02-14 00:34:02 +00:00
Konstantin Belousov	d929ad7f91	Ensure memory consistency on COW. From the submitter description: The process is forked transitioning a map entry to COW Thread A writes to a page on the map entry, faults, updates the pmap to writable at a new phys addr, and starts TLB invalidations... Thread B acquires a lock, writes to a location on the new phys addr, and releases the lock Thread C acquires the lock, reads from the location on the old phys addr... Thread A ...continues the TLB invalidations which are completed Thread C ...reads from the location on the new phys addr, and releases the lock In this example Thread B and C [lock, use and unlock] properly and neither own the lock at the same time. Thread A was writing somewhere else on the page and so never had/needed the lock. Thread C sees a location that is only ever read\|modified under a lock change beneath it while it is the lock owner. To fix this, perform the two-stage update of the copied PTE. First, the PTE is updated with the address of the new physical page with copied content, but in read-only mode. The pmap locking and the page busy state during PTE update and TLB invalidation IPIs ensure that any writer to the page cannot upgrade the PTE to the writable state until all CPUs updated their TLB to not cache old mapping. Then, after the busy state of the page is lifted, the faults for write can proceed and do not violate the consistency of the reads. The change is done in vm_fault because most architectures do need IPIs to invalidate remote TLBs. More, I think that hardware guarantees of atomicity of the remote TLB invalidation are not enough to prevent the inconsistent reads of non-atomic reads, like multi-word accesses protected by a lock. So instead of modifying each pmap invalidation code, I did it there. Discovered and analyzed by: Elliott.Rabe@dell.com Reviewed by: markj PR: 225584 (appeared to have the same cause) Tested by: Elliott.Rabe@dell.com, emaste, Mike Tancsa <mike@sentex.net>, truckman Sponsored by: The FreeBSD Foundation MFC after: 1 week Differential revision: https://reviews.freebsd.org/D14347	2018-02-14 00:31:45 +00:00
Konstantin Belousov	607970bc8e	Do not call pmap_enter() with invalid protection mode. If the map entry elookup was performed due to the mapping changes, we need to ensure that there is still some access permission bit requested which is compatible with the current vm_map_entry mode. If not, restart the handler from scratch instead of trying to save the current progress. Also adjust fault_type to not include cleared permission bits. Reviewed by: markj Sponsored by: The FreeBSD Foundation MFC after: 1 week Differential revision: https://reviews.freebsd.org/D14347	2018-02-14 00:25:18 +00:00

... 4 5 6 7 8 ...

4324 Commits