freebsd-dev

Author	SHA1	Message	Date
Gleb Smirnoff	4a9f6ba75b	In r343857 the referred comment moved to uma_vm_zone_stats().	2019-05-29 22:33:37 +00:00
Doug Moore	e67a5068ec	Reduce the code size and number of ffsl calls in vm_reserv_break. Use xor to find where free ranges begin and end. Tested by: pho Reviewed by:alc Approved by:markj, kib (mentors) Differential Revision: https://reviews.freebsd.org/D20256	2019-05-28 00:51:23 +00:00
Doug Moore	73f1145140	Fix typo from r348128: _func__ -> __func__ Reported by: LINT	2019-05-23 02:10:41 +00:00
Doug Moore	fa581662af	Cleanups made necessary by r348115, or reactions to it: 1. Change size_t to vm_size_t in some places. 2. Rename vm_map_entry_resize_free to drop the _free part. 3. Fix whitespace errors. 4. Fix screwups in patch-conflict-management that left out important changes related to growing and shrinking objects. Reviewed by: alc Approved by: kib (mentor)	2019-05-22 23:11:16 +00:00
Doug Moore	1895f5202a	Passing a parameter to vm_map_entry_resize_free that describes the amount of resizing reduces the number of functions changing the vm_map invariants regarding the max_free field of map entries. Reviewed by: markj (mentor) Approved by: kib (mentor) Differential Revision: https://reviews.freebsd.org/D20356	2019-05-22 17:40:54 +00:00
Conrad Meyer	daec92844e	Include ktr.h in more compilation units Similar to r348026, exhaustive search for uses of CTRn() and cross reference ktr.h includes. Where it was obvious that an OS compat header of some kind included ktr.h indirectly, .c files were left alone. Some of these files clearly got ktr.h via header pollution in some scenarios, or tinderbox would not be passing prior to this revision, but go ahead and explicitly include it in files using it anyway. Like r348026, these CUs did not show up in tinderbox as missing the include. Reported by: peterj (arm64/mp_machdep.c) X-MFC-With: r347984 Sponsored by: Dell EMC Isilon	2019-05-21 20:38:48 +00:00
Conrad Meyer	e2e050c8ef	Extract eventfilter declarations to sys/_eventfilter.h This allows replacing "sys/eventfilter.h" includes with "sys/_eventfilter.h" in other header files (e.g., sys/{bus,conf,cpu}.h) and reduces header pollution substantially. EVENTHANDLER_DECLARE and EVENTHANDLER_LIST_DECLAREs were moved out of .c files into appropriate headers (e.g., sys/proc.h, powernv/opal.h). As a side effect of reduced header pollution, many .c files and headers no longer contain needed definitions. The remainder of the patch addresses adding appropriate includes to fix those files. LOCK_DEBUG and LOCK_FILE_LINE_ARG are moved to sys/_lock.h, as required by sys/mutex.h since r326106 (but silently protected by header pollution prior to this change). No functional change (intended). Of course, any out of tree modules that relied on header pollution for sys/eventhandler.h, sys/lock.h, or sys/mutex.h inclusion need to be fixed. __FreeBSD_version has been bumped.	2019-05-20 00:38:23 +00:00
Mark Johnston	ccc5d6dd97	Use M_NEXTFIT in memguard(9). memguard(9) wants to avoid reuse of freed addresses for as long as possible. Previously it maintained a racily updated cursor which was passed to vmem_xalloc(9) as the minimum address. However, vmem will not in general return the lowest free address in the arena, so this trick only really works until the cursor has wrapped around the first time. Reported by: alc Reviewed by: alc MFC after: 1 month Differential Revision: https://reviews.freebsd.org/D17227	2019-05-18 02:02:14 +00:00
Mark Johnston	8cd6a80d7d	Restore the pre-r347532 behaviour of ignoring wiring failures in mmap(). The error handling added in r347532 is not right when mapping vnodes and will be fixed separately. Reported by: syzbot+1d2cc393bd6c88a548be@syzkaller.appspotmail.com MFC with: r347532	2019-05-13 18:40:01 +00:00
Mark Johnston	54a3a11421	Provide separate accounting for user-wired pages. Historically we have not distinguished between kernel wirings and user wirings for accounting purposes. User wirings (via mlock(2)) were subject to a global limit on the number of wired pages, so if large swaths of physical memory were wired by the kernel, as happens with the ZFS ARC among other things, the limit could be exceeded, causing user wirings to fail. The change adds a new counter, v_user_wire_count, which counts the number of virtual pages wired by user processes via mlock(2) and mlockall(2). Only user-wired pages are subject to the system-wide limit which helps provide some safety against deadlocks. In particular, while sources of kernel wirings typically support some backpressure mechanism, there is no way to reclaim user-wired pages shorting of killing the wiring process. The limit is exported as vm.max_user_wired, renamed from vm.max_wired, and changed from u_int to u_long. The choice to count virtual user-wired pages rather than physical pages was done for simplicity. There are mechanisms that can cause user-wired mappings to be destroyed while maintaining a wiring of the backing physical page; these make it difficult to accurately track user wirings at the physical page layer. The change also closes some holes which allowed user wirings to succeed even when they would cause the system limit to be exceeded. For instance, mmap() may now fail with ENOMEM in a process that has called mlockall(MCL_FUTURE) if the new mapping would cause the user wiring limit to be exceeded. Note that bhyve -S is subject to the user wiring limit, which defaults to 1/3 of physical RAM. Users that wish to exceed the limit must tune vm.max_user_wired. Reviewed by: kib, ngie (mlock() test changes) Tested by: pho (earlier version) MFC after: 45 days Sponsored by: Netflix Differential Revision: https://reviews.freebsd.org/D19908	2019-05-13 16:38:48 +00:00
Doug Moore	87ae0686a2	A new parameter to blist_alloc specifies an upper bound on the size of the allocation request, so that the blocks allocated are from the next set of free blocks big enough to satisfy the minimum requirements of the request, and the number of blocks allocated are as many as possible, up to the specified maximum. The implementation of swp_pager_getswapspace uses this parameter to ask for a number of blocks between the new halved request size and the previous failed request size. Thus a request for 32 blocks may fail, but instead of getting only 16 blocks instead, the caller asks for 16 to 31 next, and might get 19 or 27, which is closer to what they originally wanted. I expect this to lead to bigger block allocations and less block fragmentation, at least in some cases. Approved by: kib (mentor) Differential Revision: https://reviews.freebsd.org/D20001	2019-05-11 16:15:13 +00:00
Doug Moore	48e98a2afc	Callers of swp_pager_getswapspace get either as many blocks as they requested, or none, and in the latter case it is up to them to pick a smaller request to make - which they always do by halving the failed request. This change to swp_pager_getswapspace leaves the task of downsizing the request to the function and not its caller. It still does so by halving the original request. Approved by: kib (mentor) Differential Revision: https://reviews.freebsd.org/D20228	2019-05-11 10:16:43 +00:00
Konstantin Belousov	12487941f4	Noted by: alc Reviewed by: alc, markj (previous version) Sponsored by: The FreeBSD Foundation MFC after: 6 days	2019-05-06 08:46:11 +00:00
Konstantin Belousov	78022527bb	Switch to use shared vnode locks for text files during image activation. kern_execve() locks text vnode exclusive to be able to set and clear VV_TEXT flag. VV_TEXT is mutually exclusive with the v_writecount > 0 condition. The change removes VV_TEXT, replacing it with the condition v_writecount <= -1, and puts v_writecount under the vnode interlock. Each text reference decrements v_writecount. To clear the text reference when the segment is unmapped, it is recorded in the vm_map_entry backed by the text file as MAP_ENTRY_VN_TEXT flag, and v_writecount is incremented on the map entry removal The operations like VOP_ADD_WRITECOUNT() and VOP_SET_TEXT() check that v_writecount does not contradict the desired change. vn_writecheck() is now racy and its use was eliminated everywhere except access. Atomic check for writeability and increment of v_writecount is performed by the VOP. vn_truncate() now increments v_writecount around VOP_SETATTR() call, lack of which is arguably a bug on its own. nullfs bypasses v_writecount to the lower vnode always, so nullfs vnode has its own v_writecount correct, and lower vnode gets all references, since object->handle is always lower vnode. On the text vnode' vm object dealloc, the v_writecount value is reset to zero, and deadfs vop_unset_text short-circuit the operation. Reclamation of lowervp always reclaims all nullfs vnodes referencing lowervp first, so no stray references are left. Reviewed by: markj, trasz Tested by: mjg, pho Sponsored by: The FreeBSD Foundation MFC after: 1 month Differential revision: https://reviews.freebsd.org/D19923	2019-05-05 11:20:43 +00:00
Konstantin Belousov	7f1446052f	Do not collapse objects with OBJ_NOSPLIT backing swap object. NOSPLIT swap objects are not anonymous, they are used by tmpfs regular files and POSIX shared memory. For such objects, collapse is not permitted. Reported by: mjg Reviewed by: markj, trasz Tested by: mjg, pho Sponsored by: The FreeBSD Foundation MFC after: 1 week Differential revision: https://reviews.freebsd.org/D19923	2019-05-05 11:06:19 +00:00
Doug Moore	64f8d2575a	fls() should find the most significant bit of an int faster than a linear search can, so use it to avoid a linear search in isqrt. Approved by: kib (mentor), markj (mentor) Differential Revision: https://reviews.freebsd.org/D20102	2019-05-03 02:55:54 +00:00
Konstantin Belousov	19f5d9f27f	Fix another race between vm_map_protect() and vm_map_wire(). vm_map_wire() increments entry->wire_count, after that it drops the map lock both for faulting in the entry' pages, and for marking next entry in the requested region as IN_TRANSITION. Only after all entries are faulted in, MAP_ENTRY_USER_WIRE flag is set. This makes it possible for vm_map_protect() to run while other entry' MAP_ENTRY_IN_TRANSITION flag is handled, and vm_map_busy() lock does not prevent it. In particular, if the call to vm_map_protect() adds VM_PROT_WRITE to CoW entry, it would fail to call vm_fault_copy_entry(). There are at least two consequences of the race: the top object in the shadow chain is not populated with writeable pages, and second, the entry eventually get contradictory flags MAP_ENTRY_NEEDS_COPY \| MAP_ENTRY_USER_WIRED with VM_PROT_WRITE set. Handle it by waiting for all MAP_ENTRY_IN_TRANSITION flags to go away in vm_map_protect(), which does not drop map lock afterwards. Note that vm_map_busy_wait() is left as is. Reported and tested by: pho (previous version) Reviewed by: Doug Moore <dougm@rice.edu>, markj Sponsored by: The FreeBSD Foundation MFC after: 1 week Differential revision: https://reviews.freebsd.org/D20091	2019-05-01 13:15:06 +00:00
Mark Johnston	c4e5de7e75	Disable vm map consistency checking by default on INVARIANTS kernels. The checks are too expensive for a general-purpose kernel. Enable the checks when DIAGNOSTIC is defined and provide a sysctl to enable the checks in a non-DIAGNOSTIC INVARIANTS kernel. Reviewed by: kib Discussed with: Doug Moore <dougm@rice.edu> MFC after: 3 days Sponsored by: The FreeBSD Foundation Differential Revision: https://reviews.freebsd.org/D19999	2019-04-22 11:23:35 +00:00
Tycho Nightingale	323ad38632	for a cache-only zone the destructor tries to destroy a non-existent keg Reviewed by: markj Sponsored by: Dell EMC Isilon Differential Revision: https://reviews.freebsd.org/D19835	2019-04-12 12:46:25 +00:00
Konstantin Belousov	a5a02ef49f	Fix mis-merge. Amusingly, it is nop. Noted by: trasz Sponsored by: The FreeBSD Foundation MFC after: 1 week X-MFC-rev: r345702	2019-04-05 16:12:35 +00:00
Konstantin Belousov	9f70117263	Eliminate adj_free field from vm_map_entry. Drop the adj_free field from vm_map_entry_t. Refine the max_free field so that p->max_free is the size of the largest gap with one endpoint in the subtree rooted at p. Change vm_map_findspace so that, first, the address-based splay is restricted to tree nodes with large-enough max_free value, to avoid searching for the right starting point in a subtree where all the gaps are too small. Second, when the address search leads to a tree search for the first large-enough gap, that gap is the subject of a splay-search that brings the gap to the top of the tree, so that an immediate insertion will take constant time. Break up the splay code into separate components, one for searching and breaking up the tree and another for reassembling it. Use these components, and not splay itself, for linking and unlinking. Drop the after-where parameter to link, as it is computed as a side-effect of the splay search. Submitted by: Doug Moore <dougm@rice.edu> Reviewed by: markj Tested by: pho MFC after: 2 weeks Differential revision: https://reviews.freebsd.org/D17794	2019-03-29 16:53:46 +00:00
Edward Tomasz Napierala	0b208315f4	Improve error reporting when the swap pager runs out of memory. Reviewed by: kib MFC after: 2 weeks Sponsored by: Klara Inc. Differential Revision: https://reviews.freebsd.org/D19699	2019-03-26 19:11:15 +00:00
Konstantin Belousov	5019dac98a	ASLR: check for max_addr after applying randomization, not before. Otherwise resulting address from vm_map_find() migh not satisfy the upper limit. For instance, it could affect MAP_32BIT flag from 64bit processes. Found by: Doug Moore <dougm@rice.edu> Reviewed by: alc, Doug Moore <dougm@rice.edu> Sponsored by: The FreeBSD Foundation MFC after: 1 week Differential revision: https://reviews.freebsd.org/D19688	2019-03-23 16:36:18 +00:00
Mark Johnston	64087fd7f3	Disallow preemptive creation of wired superpage mappings. There are some unusual cases where a process may cause an mlock()ed range of memory to be unmapped. If the application subsequently faults on that region, the handler may attempt to create a superpage mapping backed by the resident, wired pages. However, the pmap code responsible for creating such a mapping (pmap_enter_pde() on i386 and amd64) does not ensure that a leaf page table page is available if the superpage is later demoted; the demotion operation must therefore perform a non-blocking page allocation and must unmap the entire superpage if the allocation fails. The pmap layer ensures that this can never happen for wired mappings, and so the case described above breaks that invariant. For now, simply ensure that the MI fault handler never attempts to create a wired superpage except via promotion. Reviewed by: kib Reported by: syzbot+292d3b0416c27c131505@syzkaller.appspotmail.com MFC after: 2 weeks Sponsored by: The FreeBSD Foundation Differential Revision: https://reviews.freebsd.org/D19670	2019-03-21 19:52:50 +00:00
Konstantin Belousov	45d72c7d7f	vm_fault_copy_entry: accept invalid source pages. Either msync(MS_INVALIDATE) or the object unlock during vnode truncation can expose invalid pages backing wired entries. Accept them, but do not install them into destrination pmap. We must create copied pages in the copy case, because e.g. vm_object_unwire() expects that the entry is fully backed. Reported by: syzkaller, via emaste Reported by: syzbot+514d40ce757a3f8b15bc@syzkaller.appspotmail.com Reviewed by: markj Tested by: pho Sponsored by: The FreeBSD Foundation MFC after: 1 week Differential revision: https://reviews.freebsd.org/D19615	2019-03-20 13:07:57 +00:00
Mark Johnston	3b5b20292b	Implement minidump support for RISC-V. Submitted by: Mitchell Horne <mhorne063@gmail.com> Differential Revision: https://reviews.freebsd.org/D18320	2019-03-06 00:01:06 +00:00
Mateusz Guzik	75d6d57634	vm: remove seq.h inclusion made obsolete by NUMA rewrite Sponsored by: The FreeBSD Foundation	2019-02-27 22:42:29 +00:00
Jason A. Harmening	40a5168449	Fix incorrect assertion in vnode_pager_generic_getpages() Reviewed by: kib, glebius MFC after: 1 week	2019-02-26 04:50:46 +00:00
Mark Johnston	2b6010705c	Improve vmem tuning for platforms without a direct map. On platforms without a direct map (i.e., platforms without UMA_MD_SMALL_ALLOC defined), the boundary tag allocator reserves a number of tags for use when allocating a new slab of boundary tags, as such platforms require free boundary tags in order to allocate boundary tags. r327899 increased the number of boundary tags required for a KVA allocation in the worst case, and the aforementioned reservation was not updated accordingly. In some cases, this could lead to a system hang. Fix the problem by increasing this reservation. Also reduce KVA_QUANTUM on systems lacking superpage support. The previous import quantum (4MB with a 4KB page size) was quite large for systems with limited KVA, and fragmentation in kernel_arena could cause kernel memory allocation failures even with a substantial amount of free KVA. Reported and tested by: jhibbits Reviewed by: alc, kib No objections: jeff MFC after: 2 weeks Sponsored by: The FreeBSD Foundation Differential Revision: https://reviews.freebsd.org/D19337	2019-02-25 19:22:13 +00:00
Mark Johnston	46e39081f4	Clear pointers to indicate that the respective locks are released. This fixes a problem in r344231: vm_pageout_launder() may scan two queues when swap is disabled. Reported by: pho MFC with: r344231	2019-02-21 15:44:32 +00:00
Konstantin Belousov	e7a9df16e6	Add kernel support for Intel userspace protection keys feature on Skylake Xeons. See SDM rev. 68 Vol 3 4.6.2 Protection Keys and the description of the RDPKRU and WRPKRU instructions. Reviewed by: markj Tested by: pho Sponsored by: The FreeBSD Foundation MFC after: 2 weeks Differential revision: https://reviews.freebsd.org/D18893	2019-02-20 09:51:13 +00:00
Mark Johnston	602566044a	Remove a redundant flag variable. Use the object pointer itself to determine whether the object is locked. No functional change intended. Reviewed by: kib MFC after: 1 week Sponsored by: Netflix Differential Revision: https://reviews.freebsd.org/D19215	2019-02-17 16:35:19 +00:00
Gleb Smirnoff	66fb0b1ad7	For 32-bit machines rollback the default number of vnode pager pbufs back to the lever before r343030. For 64-bit machines reduce it slightly, too. Together with r343030 I bumped the limit up to the value we use at Netflix to serve 100 Gbit/s of sendfile traffic, and it probably isn't a good default. Provide a loader tunable to change vnode pager pbufs count. Document it.	2019-02-15 23:36:22 +00:00
Konstantin Belousov	484e9d0322	Make anon clustering more compatible. Make the clustering enabling knob more fine-grained by providing a setting where the allocation with hint is not clustered. This is aimed to be somewhat more compatible with e.g. go 1.4 which expects that hinted mmap without MAP_FIXED does not change the allocation address. Now the vm.cluster_anon can be set to 1 to only cluster when no hints, and to 2 to always cluster. Default value is 1. Requested by: peter Reviewed by: emaste, markj Sponsored by: The FreeBSD Foundation MFC after: 1 month Differential revision: https://reviews.freebsd.org/D19194	2019-02-14 15:45:53 +00:00
Mark Johnston	f6893f09d5	Implement transparent 2MB superpage promotion for RISC-V. This includes support for pmap_enter(..., psind=1) as described in the commit log message for r321378. The changes are largely modelled after amd64. arm64 has more stringent requirements around superpage creation to avoid the possibility of TLB conflict aborts, and these requirements do not apply to RISC-V, which like amd64 permits simultaneous caching of 4KB and 2MB translations for a given page. RISC-V's PTE format includes only two software bits, and as these are already consumed we do not have an analogue for amd64's PG_PROMOTED. Instead, pmap_remove_l2() always invalidates the entire 2MB address range. pmap_ts_referenced() is modified to clear PTE_A, now that we support both hardware- and software-managed reference and dirty bits. Also fix pmap_fault_fixup() so that it does not set PTE_A or PTE_D on kernel mappings. Reviewed by: kib (earlier version) Discussed with: jhb Sponsored by: The FreeBSD Foundation Differential Revision: https://reviews.freebsd.org/D18863 Differential Revision: https://reviews.freebsd.org/D18864 Differential Revision: https://reviews.freebsd.org/D18865 Differential Revision: https://reviews.freebsd.org/D18866 Differential Revision: https://reviews.freebsd.org/D18867 Differential Revision: https://reviews.freebsd.org/D18868	2019-02-13 17:19:37 +00:00
Pedro F. Giffuni	6929b7d1ab	UMA: unsign some variables related to allocation in hash_alloc(). As a followup to r343673, unsign some variables related to allocation since the hashsize cannot be negative. This gives a bit more space to handle bigger allocations and avoid some implicit casting. While here also unsign uh_hashmask, it makes little sense to keep that signed. MFC after: 2 weeks Differential Revision: https://reviews.freebsd.org/D19148	2019-02-12 04:33:05 +00:00
Konstantin Belousov	f6d281e8aa	struct xswdev on amd64 requires compat32 shims after ino64. i386 is the only architecture where uint64_t does not specify 8-bytes alignment, which makes struct xswdev layout not compatible between 64bit and i386. Reported and tested by: pho Sponsored by: The FreeBSD Foundation MFC after: 1 week	2019-02-10 19:01:05 +00:00
Konstantin Belousov	fa50a3552d	Implement Address Space Layout Randomization (ASLR) With this change, randomization can be enabled for all non-fixed mappings. It means that the base address for the mapping is selected with a guaranteed amount of entropy (bits). If the mapping was requested to be superpage aligned, the randomization honours the superpage attributes. Although the value of ASLR is diminshing over time as exploit authors work out simple ASLR bypass techniques, it elimintates the trivial exploitation of certain vulnerabilities, at least in theory. This implementation is relatively small and happens at the correct architectural level. Also, it is not expected to introduce regressions in existing cases when turned off (default for now), or cause any significant maintaince burden. The randomization is done on a best-effort basis - that is, the allocator falls back to a first fit strategy if fragmentation prevents entropy injection. It is trivial to implement a strong mode where failure to guarantee the requested amount of entropy results in mapping request failure, but I do not consider that to be usable. I have not fine-tuned the amount of entropy injected right now. It is only a quantitive change that will not change the implementation. The current amount is controlled by aslr_pages_rnd. To not spoil coalescing optimizations, to reduce the page table fragmentation inherent to ASLR, and to keep the transient superpage promotion for the malloced memory, locality clustering is implemented for anonymous private mappings, which are automatically grouped until fragmentation kicks in. The initial location for the anon group range is, of course, randomized. This is controlled by vm.cluster_anon, enabled by default. The default mode keeps the sbrk area unpopulated by other mappings, but this can be turned off, which gives much more breathing bits on architectures with small address space, such as i386. This is tied with the question of following an application's hint about the mmap(2) base address. Testing shows that ignoring the hint does not affect the function of common applications, but I would expect more demanding code could break. By default sbrk is preserved and mmap hints are satisfied, which can be changed by using the kern.elf{32,64}.aslr.honor_sbrk sysctl. ASLR is enabled on per-ABI basis, and currently it is only allowed on FreeBSD native i386 and amd64 (including compat 32bit) ABIs. Support for additional architectures will be added after further testing. Both per-process and per-image controls are implemented: - procctl(2) adds PROC_ASLR_CTL/PROC_ASLR_STATUS; - NT_FREEBSD_FCTL_ASLR_DISABLE feature control note bit makes it possible to force ASLR off for the given binary. (A tool to edit the feature control note is in development.) Global controls are: - kern.elf{32,64}.aslr.enable - for non-fixed mappings done by mmap(2); - kern.elf{32,64}.aslr.pie_enable - for PIE image activation mappings; - kern.elf{32,64}.aslr.honor_sbrk - allow to use sbrk area for mmap(2); - vm.cluster_anon - enables anon mapping clustering. PR: 208580 (exp runs) Exp-runs done by: antoine Reviewed by: markj (previous version) Discussed with: emaste Tested by: pho MFC after: 1 month Sponsored by: The FreeBSD Foundation Differential revision: https://reviews.freebsd.org/D5603	2019-02-10 17:19:45 +00:00
Konstantin Belousov	5dddee2d65	i386: honor kern.elf32.read_exec for ommap(2) and break(2), as already done on amd64. Sponsored by: The FreeBSD Foundation MFC after: 1 week	2019-02-09 03:56:48 +00:00
Konstantin Belousov	a7f67facdf	Normalize the declaration of i386_read_exec variable. It is currently re-declared in sys/sysent.h which is a wrong place for MD variable. Which causes redeclaration error with gcc when sys/sysent.h and machine/md_var.h are included both. Remove it from sys/sysent.h and instead include machine/md_var.h when needed, under #ifdef for both i386 and amd64. Reported and tested by: bde Sponsored by: The FreeBSD Foundation MFC after: 1 week	2019-02-09 03:51:51 +00:00
Gleb Smirnoff	ad66f95865	Now that there is only one way to allocate a slab, remove uz_slab method. Discussed with: jeff	2019-02-07 03:55:05 +00:00
Gleb Smirnoff	b47acb0a4d	Report cache zones in UMA stats sysctl, that 'vmstat -z' uses. This should had been part of r251826.	2019-02-07 03:32:45 +00:00
Konstantin Belousov	d22ff6e6a2	contigmalloc: handle M_EXEC. Reviewed by: alc, markj Sponsored by: The FreeBSD Foundation MFC after: 1 week Differential revision: https://reviews.freebsd.org/D19092	2019-02-07 02:00:23 +00:00
Mark Johnston	1e2b3e6f92	Allow vm_page_free_prep() to dequeue pages without the page lock. This is a step towards being able to free pages without the page lock held. The approach is simply to add an implementation of vm_page_dequeue_deferred() which does not assert that the page lock is held. Formally, the page lock is required to set PGA_DEQUEUE, but in the case of vm_page_free_prep() we get the same mutual exclusion for free by virtue of the fact that no other references to the page may exist. No functional change intended. Reviewed by: kib (previous version) MFC after: 2 weeks Sponsored by: Netflix Differential Revision: https://reviews.freebsd.org/D19065	2019-02-03 18:43:20 +00:00
Mark Johnston	d0488e698f	Fix a race in vm_page_dequeue_deferred(). To detect the case where the page is already marked for a deferred dequeue, we must read the "queue" and "aflags" fields in a precise order. Otherwise, a race with a concurrent vm_page_dequeue_complete() could leave the page with PGA_DEQUEUE set despite it already having been dequeued. Fix the problem by using vm_page_queue() to check the queue state, which correctly handles the race. Reviewed by: kib Tested by: pho MFC after: 3 days Sponsored by: Netflix Differential Revision: https://reviews.freebsd.org/D19039	2019-02-03 18:38:58 +00:00
Alexander Motin	59568a0e52	Fix integer math overflow in UMA hash_alloc(). 512GB of ZFS ABD ARC means abd_chunk zone of 128M 4KB items. To manage them UMA tries to allocate 2GB hash table, which size does not fit into the int variable, causing later allocation failure, which makes ARC shrink back below the 512GB, not letting it to use more RAM. With this change I easily reached >700GB ARC size on 768GB RAM machine. MFC after: 1 week Sponsored by: iXsystems, Inc.	2019-02-02 04:11:59 +00:00
Gleb Smirnoff	37125720b9	In zone_alloc_bucket() max argument was calculated based on uz_count. Then bucket_alloc() also selects bucket size based on uz_count. However, since zone lock is dropped, uz_count may reduce. In this case max may be greater than ub_entries and that would yield into writing beyond end of the allocation. Reported by: pho	2019-01-31 17:52:48 +00:00
Mark Johnston	862203935e	Correct uma_prealloc()'s use of domainset iterators after r339925. The iterator should be reinitialized after every successful slab allocation. A request to advance the iterator is interpreted as an allocation failure, so a sufficiently large preallocation would cause the iterator to believe that all domains were exhausted, resulting in a sleep with the keg lock held. [1] Also, keg_alloc_slab() should pass the unmodified wait flag to the item initialization routine, which may use it to perform allocations from other zones. Reported and tested by: slavah Diagnosed by: kib [1] Reviewed by: kib MFC after: 1 week Sponsored by: The FreeBSD Foundation	2019-01-23 18:58:15 +00:00
Konstantin Belousov	f2a496d667	MI VM: Make it possible to set size of superpage at boot instead of compile time. In order to allow single kernel to use PAE pagetables on i386 if hardware supports it, and fall back to classic two-level paging structures if not, superpage code should be able to adopt to either 2M or 4M superpages size. There I make MI VM structures large enough to track the biggest possible superpage, by allowing architecture to define VM_NFREEORDER_MAX and VM_LEVEL_0_ORDER_MAX constants. Corresponding VM_NFREEORDER and VM_LEVEL_0_ORDER symbols can be defined as runtime values and must be less than the _MAX constants. If architecture does not define _MAXs, it is assumed that _MAX == normal constant. Reviewed by: markj Tested by: pho (as part of the larger patch) Sponsored by: The FreeBSD Foundation MFC after: 1 week Differential revision: https://reviews.freebsd.org/D18853	2019-01-18 13:35:06 +00:00
Gleb Smirnoff	46b0292a82	Do not reserve KVA for paging bufs in vm_ksubmap_init(), since now they allocate it in pbuf_init(). This should have been done together with r343030.	2019-01-16 20:14:16 +00:00

1 2 3 4 5 ...

4026 Commits