freebsd-dev

Author	SHA1	Message	Date
Will Andrews	8311a2b8a4	Add vm.panic_on_oom sysctl, which enables those who would rather panic than kill a process, when the system runs out of memory. Defaults to off. Usually, this is most useful when the OOM condition is due to mismanagement of memory, on a system where the applications in question don't respond well to being killed. In theory, if the system is properly managed, it shouldn't be possible to hit this condition. If it does, the panic can be more desirable for some users (since it can be a good means of finding the root cause) rather than killing the largest process and continuing on its merry way. As kib@ mentions in the differential, there is also protect(1), which uses procctl(PROC_SPROTECT) to ensure that some processes are immune. However, a panic approach is still useful in some environments. This is primarily intended as a development/debugging tool. Differential Revision: D1627 Reviewed by: kib MFC after: 1 week	2015-01-24 17:32:45 +00:00
Ryan Stone	423521aa33	vmspace_release() may sleep if the last reference is being released, so add a WITNESS_WARN() to catch cases where it is called with a non-sleepable lock held. MFC after: 1 month Sponsored by: Sandvine Inc.	2015-01-24 16:59:38 +00:00
Konstantin Belousov	71943c3d35	Avoid calling vmspace_free() while owning the process lock. Freeing of an vm space may require obtaining sleepable locks. Hold the process to keep the pointer valid, and change trylock to lock, since there is no longer two process locks owned simultaneously in vm_pageout_oom(). Note that after the process lock is dropped, process might exec, and no longer qualify as the owner of biggest vm space. In collaboration with: rstone Sponsored by: The FreeBSD Foundation MFC after: 1 week	2015-01-24 15:33:42 +00:00
Alan Cox	5268042bbd	Revamp the default page clustering strategy that is used by the page fault handler. For roughly twenty years, the page fault handler has used the same basic strategy: Fetch a fixed number of non-resident pages both ahead and behind the virtual page that was faulted on. Over the years, alternative strategies have been implemented for optimizing the handling of random and sequential access patterns, but the only change to the default strategy has been to increase the number of pages read ahead to 7 and behind to 8. The problem with the default page clustering strategy becomes apparent when you look at how it behaves on the code section of an executable or shared library. (To simplify the following explanation, I'm going to ignore the read that is performed to obtain the header and assume that no pages are resident at the start of execution.) Suppose that we have a code section consisting of 32 pages. Further, suppose that we access pages 4, 28, and 16 in that order. Under the default page clustering strategy, we page fault three times and perform three I/O operations, because the first and second page faults only read a truncated cluster of 12 pages. In contrast, if we access pages 8, 24, and 16 in that order, we only fault twice and perform two I/O operations, because the first and second page faults read a full cluster of 16 pages. In general, truncated clusters are more common than full clusters. To address this problem, this revision changes the default page clustering strategy to align the start of the cluster to a page offset within the vm object that is a multiple of the cluster size. This results in many fewer truncated clusters. Returning to our example, if we now access pages 4, 28, and 16 in that order, the cluster that is read to satisfy the page fault on page 28 will now include page 16. So, the access to page 16 will no longer page fault and perform an I/O operation. Since the revised default page clustering strategy is typically reading more pages at a time, we are likely to read a few more pages that are never accessed. However, for the various programs that we looked at, including clang, emacs, firefox, and openjdk, the reduction in the number of page faults and I/O operations far outweighed the increase in the number of pages that are never accessed. Moreover, the extra resident pages allowed for many more superpage mappings. For example, if we look at the execution of clang during a buildworld, the number of (hard) page faults on the code section drops by 26%, the number of superpage mappings increases by about 29,000, but the number of never accessed pages only increases from 30.38% to 33.66%. Finally, this leads to a small but measureable reduction in execution time. In collaboration with: Emily Pettigrew <ejp1@rice.edu> Differential Revision: https://reviews.freebsd.org/D1500 Reviewed by: jhb, kib MFC after: 6 weeks	2015-01-16 18:17:09 +00:00
Konstantin Belousov	18cc2ff047	Revert r263475: TDP_DEVMEMIO no longer needed, since amd64 /dev/kmem does not access kernel mappings directly. Reviewed by: alc Sponsored by: The FreeBSD Foundation MFC after: 1 week	2015-01-12 08:58:07 +00:00
Alan Cox	67c44fa359	Eliminate a stale debug message. The per-CPU cache locks were replaced by critical sections in r145686. PR: 193254 Submitted by: luke.tw@gmail.com MFC after: 3 days	2014-12-31 17:44:57 +00:00
Alan Cox	d866a563d4	The physical memory allocator supports the use of distinct free lists for managing pages from different address ranges. Generally speaking, this feature is used to increase the likelihood that physical pages are available that can meet special DMA requirements or can be accessed through a limited-coverage direct mapping (e.g., MIPS). However, prior to this change, the configuration of the free lists was static, i.e., it was determined at compile time. Consequentally, free lists could be created for address ranges that held no actual pages, for example, on 32-bit MIPS- based systems with 512 MB or less of physical memory. This change makes the creation of the free lists dynamic, i.e., it is based on the available physical memory at boot time. On 64-bit x86-based systems with 64 GB or more of physical memory, create free lists for managing pages with physical addresses below 4 GB. This change is to address reported problems with initializing devices that require the allocation of physical pages below 4 GB on some systems with 128 GB or more of physical memory. PR: 185727 Differential Revision: https://reviews.freebsd.org/D1274 Reviewed by: jhb, kib MFC after: 3 weeks Sponsored by: EMC / Isilon Storage Division	2014-12-31 00:54:38 +00:00
Gleb Smirnoff	e3ed82bcf7	Add flag VM_ALLOC_NOWAIT for vm_page_grab() that prevents sleeping and allows the function to fail. Reviewed by: kib, alc Sponsored by: Nginx, Inc.	2014-12-22 09:02:21 +00:00
Gleb Smirnoff	6ee80f259c	Do not clear flag that vm_page_alloc() doesn't support. Submitted by: kib	2014-12-22 09:00:47 +00:00
Gleb Smirnoff	89fc8bdbb6	Document flags of vm_page allocation functions. Reviewed by: alc	2014-12-22 08:59:44 +00:00
John Baldwin	01ca58b23c	Always ignore the deprecated MAP_RENAME and MAP_NORESERVE flags to mmap(). Some old libraries may be used even with newer binaries (specifically the Nvidia driver libraries). Differential Revision: https://reviews.freebsd.org/D1262 Reviewed by: kib	2014-12-05 15:24:42 +00:00
Konstantin Belousov	30d57414a0	When the last reference on the vnode' vm object is dropped, read the vp->v_vflag without taking vnode lock and without bypass. We do know that vp is the lowest level in the stack, since the pointer is obtained from the object' handle. Stale VV_TEXT flag read can only happen if parallel execve() is performed and not yet activated the image, since process takes reference for text mapping. In this case, the execve() code manages the VV_TEXT flag on its own already. It was observed that otherwise read-only sendfile(2) requires exclusive vnode lock and contending on it on some loads for VV_TEXT handling. Reported by: glebius, scottl Tested by: glebius, pho Sponsored by: The FreeBSD Foundation MFC after: 1 week	2014-12-05 15:02:30 +00:00
Konstantin Belousov	95c4bf756a	Provide mutual exclusion between zone allocation/destruction and uma_reclaim(). Reclamation code must not see half-constructed or destructed zones. Do this by bracing uma_zcreate() and uma_zdestroy() into a shared-locked sx, and take the sx exclusively in uma_reclaim(). Usually zones are not created/destroyed during the system operation, but tmpfs mounts do cause zone operations and exposed the bug. Another solution could be to only expose a new keg on uma_kegs list after the corresponding zone is fully constructed, and similar treatment for the destruction. But it probably requires more risky code rearrangement as well. Reported and tested by: pho Discussed with: avg Sponsored by: The FreeBSD Foundation MFC after: 2 weeks	2014-11-30 20:20:55 +00:00
Gleb Smirnoff	1bb5ad634e	We already have "int i" in this scope. Submitted by: alc	2014-11-24 07:57:20 +00:00
Gleb Smirnoff	d932810143	\n at end of panicstr is redundant. Submitted by: alc	2014-11-23 18:32:21 +00:00
Gleb Smirnoff	90effb2341	Merge from projects/sendfile: o Provide a new VOP_GETPAGES_ASYNC(), which works like VOP_GETPAGES(), but doesn't sleep. It returns immediately, and will execute the I/O done handler function that must be supplied as argument. o Provide VOP_GETPAGES_ASYNC() for the FFS, which uses vnode_pager. o Extend pagertab to support pgo_getpages_async method, and implement this method for vnode_pager. Reviewed by: kib Tested by: pho Sponsored by: Netflix Sponsored by: Nginx, Inc.	2014-11-23 12:01:52 +00:00
Alan Cox	09e5f3c4b8	By the time that vm_reserv_init() runs, vm_phys_segs[] is initialized. Use it instead of phys_avail[]. Discussed with: Svatopluk Kraus	2014-11-22 17:46:30 +00:00
Gleb Smirnoff	79f0deb938	Use __func__ in KASSERTs, since the code is about to be moved to other place. Sponsored by: Nginx, Inc.	2014-11-19 16:29:39 +00:00
Gleb Smirnoff	2a5eef69a6	In vnode_pager_generic_getpages() vp->v_mount is dereferenced in the beginning, thus can't be NULL. Sponsored by: Nginx, Inc.	2014-11-19 15:17:19 +00:00
Gleb Smirnoff	e122dfc1ce	Collapse three contiguous comment blocks into one. Remove historical note about wrong assumptions 20 years ago. Use proper casing. Sponsored by: Nginx, Inc.	2014-11-18 13:38:07 +00:00
Alan Cox	271f0f1219	Enable the use of VM_PHYSSEG_SPARSE on amd64 and i386, making it the default on i386 PAE. Previously, VM_PHYSSEG_SPARSE could not be used on amd64 and i386 because vm_page_startup() would not create vm_page structures for the kernel page table pages allocated during pmap_bootstrap() but those vm_page structures are needed when the kernel attempts to promote the corresponding kernel virtual addresses to superpage mappings. To address this problem, a new public function, vm_phys_add_seg(), is introduced and vm_phys_init() is updated to reflect the creation of vm_phys_seg structures by calls to vm_phys_add_seg(). Discussed with: Svatopluk Kraus MFC after: 3 weeks Sponsored by: EMC / Isilon Storage Division	2014-11-15 23:40:44 +00:00
Gleb Smirnoff	e1a4454553	Even better indent struct pagerops.	2014-11-14 18:15:35 +00:00
Gleb Smirnoff	5536922ec0	Constantly indent struct pagerops.	2014-11-14 18:00:00 +00:00
Konstantin Belousov	e065e87c1e	Fix mis-spelling of bits and types names in the default_pager_putpages() and swap_pager_putpages(). It is the same fix as was done for vnode_pager_putpages() in r271586. Reviewed by: alc Sponsored by: The FreeBSD Foundation MFC after: 1 week	2014-11-04 19:56:04 +00:00
Alan Cox	5e929009d2	Eliminate a stale, i386-specific comment.	2014-11-04 18:52:59 +00:00
Mark Murray	10cb24248a	This is the much-discussed major upgrade to the random(4) device, known to you all as /dev/random. This code has had an extensive rewrite and a good series of reviews, both by the author and other parties. This means a lot of code has been simplified. Pluggable structures for high-rate entropy generators are available, and it is most definitely not the case that /dev/random can be driven by only a hardware souce any more. This has been designed out of the device. Hardware sources are stirred into the CSPRNG (Yarrow, Fortuna) like any other entropy source. Pluggable modules may be written by third parties for additional sources. The harvesting structures and consequently the locking have been simplified. Entropy harvesting is done in a more general way (the documentation for this will follow). There is some GREAT entropy to be had in the UMA allocator, but it is disabled for now as messing with that is likely to annoy many people. The venerable (but effective) Yarrow algorithm, which is no longer supported by its authors now has an alternative, Fortuna. For now, Yarrow is retained as the default algorithm, but this may be changed using a kernel option. It is intended to make Fortuna the default algorithm for 11.0. Interested parties are encouraged to read ISBN 978-0-470-47424-2 "Cryptography Engineering" By Ferguson, Schneier and Kohno for Fortuna's gory details. Heck, read it anyway. Many thanks to Arthur Mesh who did early grunt work, and who got caught in the crossfire rather more than he deserved to. My thanks also to folks who helped me thresh this out on whiteboards and in the odd "Hallway track", or otherwise. My Nomex pants are on. Let the feedback commence! Reviewed by: trasz,des(partial),imp(partial?),rwatson(partial?) Approved by: so(des)	2014-10-30 21:21:53 +00:00
Hans Petter Selasky	f0188618f2	Fix multiple incorrect SYSCTL arguments in the kernel: - Wrong integer type was specified. - Wrong or missing "access" specifier. The "access" specifier sometimes included the SYSCTL type, which it should not, except for procedural SYSCTL nodes. - Logical OR where binary OR was expected. - Properly assert the "access" argument passed to all SYSCTL macros, using the CTASSERT macro. This applies to both static- and dynamically created SYSCTLs. - Properly assert the the data type for both static and dynamic SYSCTLs. In the case of static SYSCTLs we only assert that the data pointed to by the SYSCTL data pointer has the correct size, hence there is no easy way to assert types in the C language outside a C-function. - Rewrote some code which doesn't pass a constant "access" specifier when creating dynamic SYSCTL nodes, which is now a requirement. - Updated "EXAMPLES" section in SYSCTL manual page. MFC after: 3 days Sponsored by: Mellanox Technologies	2014-10-21 07:31:21 +00:00
John Baldwin	5817298f31	Retire the unimplemented MAP_RENAME and MAP_NORESERVE flags to mmap(2). Older binaries are still permitted to use these flags. PR: 193961 (exp-run in ports) Differential Revision: https://reviews.freebsd.org/D848 Reviewed by: kib	2014-10-18 12:28:51 +00:00
Davide Italiano	2be111bf7d	Follow up to r225617. In order to maximize the re-usability of kernel code in userland rename in-kernel getenv()/setenv() to kern_setenv()/kern_getenv(). This fixes a namespace collision with libc symbols. Submitted by: kmacy Tested by: make universe	2014-10-16 18:04:43 +00:00
Konstantin Belousov	a36f55322c	Make MAP_NOSYNC handling in the vm_fault() read-locked object path compatible with write-locked path. Test for MAP_ENTRY_NOSYNC and set VPO_NOSYNC for pages with dirty mask zero (this does not exclude a possibility that the page is dirty, e.g. due to read fault on writeable mapping and consequent write; the same issue exists in the slow path). Use helper vm_fault_dirty() to unify fast and slow path handling of VPO_NOSYNC and setting the dirty mask. Reviewed by: alc Sponsored by: The FreeBSD Foundation	2014-10-10 19:27:36 +00:00
Bryan Venteicher	111fbcd5ed	Change the UMA mutex into a rwlock Acquire the lock in read mode when just needed to ensure the stability of the keg list. The UMA lock may be held for a long time (relatively speaking) in uma_reclaim() on machines with lots of zones/kegs. If the uma_timeout() would fire during that period, subsequent callouts on that CPU may be significantly delayed. Reviewed by: jhb	2014-10-05 21:34:56 +00:00
Bryan Venteicher	6e5254e0d7	Remove stray uma_mtx lock/unlock in zone_drain_wait() Callers of zone_drain_wait(M_WAITOK) do not need to hold (and were not) the uma_mtx, but we would attempt to unlock and relock the mutex if we had to sleep because the zone was already draining. The M_NOWAIT callers may hold the uma_mtx, but we do not sleep in that case. Reviewed by: jhb MFC after: 3 days	2014-10-05 03:18:30 +00:00
Konstantin Belousov	b76278407d	Add kernel option KSTACK_USAGE_PROF to sample the stack depth on interrupts and report the largest value seen as sysctl debug.max_kstack_used. Useful to estimate how close the kernel stack size is to overflow. In collaboration with: Larry Baird <lab@gta.com> Sponsored by: The FreeBSD Foundation (kib) MFC after: 1 week	2014-10-04 18:38:14 +00:00
Steven Hartland	14a0d74ea8	Refactor ZFS ARC reclaim checks and limits Remove previously added kmem methods in favour of defines which allow diff minimisation between upstream code base. Rebalance ARC free target to be vm_pageout_wakeup_thresh by default which eliminates issue where ARC gets minimised instead of balancing with VM pageout. The restores the target point prior to r270759. Bring in missing upstream only changes which move unused code to further eliminate code differences. Add additional DTRACE probe to aid monitoring of ARC behaviour. Enable upstream i386 code paths on platforms which don't define UMA_MD_SMALL_ALLOC. Fix mixture of byte an page values in arc_memory_throttle i386 code path value assignment of available_memory. PR: 187594 Review: D702 Reviewed by: avg MFC after: 1 week X-MFC-With: r270759 & r270861 Sponsored by: Multiplay	2014-10-03 20:34:55 +00:00
Steven Hartland	f721133eb9	Fix ticks wrap issue of lowmem test in vm_pageout_scan Reviewed by: jhb (D818) MFC after: 3 days Sponsored by: Multiplay	2014-09-24 14:35:08 +00:00
Konstantin Belousov	54432196db	vm_map_pmap_enter() and pmap_enter_object() are currently not aware of the wired attribute of the mapping. As result, some pmap implementations clear the wired state of the page table entries, which breaks invariants and allows the entries to be lost. Avoid calling vm_map_pmap_enter() for the MADV_WILLNEED on the wired entry, the pages must be already mapped. Noted and reviewed by: alc Sponsored by: The FreeBSD Foundation MFC after: 3 days	2014-09-23 18:54:23 +00:00
Konstantin Belousov	10204535af	The vm_mmap_cdev() explicitely converts absence of both MAP_SHARED and MAP_PRIVATE flags to MAP_SHARED. Apparently, some code in tree, in particular, libgeom, relied on this behaviour, see r271721. For regular file types, the absence of the flags is interpreted as MAP_PRIVATE, and libc nlist used this (fixed in r271723). Allow the implicit flags for legacy binaries. Bump __FreeBSD_version to get the ABI note on new binaries to check for in mmap code. Remove the test for presence of one of the MAP_ANON, MAP_SHARED or MAP_PRIVATE flags before fget_mmap(). For MAP_ANON, we already verify that passed fd == -1. For fd != -1, test after fget_mmap() (for newer binaries) covers the case. Reported by: bdrewery, pho Reviewed by: jhb Sponsored by: The FreeBSD Foundation	2014-09-17 21:04:50 +00:00
John Baldwin	8bafac5444	Permit MAP_RENAME and MAP_NORESERVE for now. These flags should be removed, but at least Chromium and OpenJDK use MAP_NORESERVE.	2014-09-16 17:21:06 +00:00
John Baldwin	5fd3f8b3b6	Add stricter checking of some mmap() arguments: - Fail with EINVAL if an invalid protection mask is passed to mmap(). - Fail with EINVAL if an unknown flag is passed to mmap(). - Fail with EINVAL if both MAP_PRIVATE and MAP_SHARED are passed to mmap(). - Require one of either MAP_PRIVATE or MAP_SHARED for non-anonymous mappings. Reviewed by: alc, kib MFC after: 2 weeks Differential Revision: https://reviews.freebsd.org/D698	2014-09-15 17:20:13 +00:00
Alan Cox	a7fecb4d3a	Three improvements to vnode_pager_generic_getpages(): Eliminate an exclusive object lock acquisition and release on the expected execution path. Do page zeroing before the object lock is acquired rather than during the time that the object lock is held. Use vm_pager_free_nonreq() to eliminate duplicated code. Reviewed by: kib MFC after: 6 weeks Sponsored by: EMC / Isilon Storage Division	2014-09-15 17:14:09 +00:00
Gleb Smirnoff	be58a555d2	Remove redundant declaration. vnode.h should be included before vnode_pager.h.	2014-09-15 15:49:29 +00:00
Konstantin Belousov	d15b55c554	Provide the unique implementation for the VOP_GETPAGES() method used by ffs and ext2fs. Remove duplicated call to vm_page_zero_invalid(), done by VOP and by vm_pager_getpages(). Use vm_pager_free_nonreq(). Reviewed by: alc (previous version) Sponsored by: The FreeBSD Foundation MFC after: 6 weeks (after r271596)	2014-09-15 12:28:29 +00:00
Alan Cox	396b3e34b4	Avoid an exclusive acquisition of the object lock on the expected execution path through the NFS clients' getpages functions. Introduce vm_pager_free_nonreq(). This function can be used to eliminate code that is duplicated in many getpages functions. Also, in contrast to the code that currently appears in those getpages functions, vm_pager_free_nonreq() avoids acquiring an exclusive object lock in one case. Reviewed by: kib MFC after: 6 weeks Sponsored by: EMC / Isilon Storage Division	2014-09-14 18:07:55 +00:00
Konstantin Belousov	33cad9e936	Fix mis-spelling of bits and types names in the vnode_pager_putpages(). The changes should not modify the generated code. The pager->pgo_putpages() method takes int flags as its fourth argument, while vnode_pager_putpages() used boolean_t (which is typedef'ed to int). The flags are from VM_PAGER_* namespace, while vnode_pager_putpages() passed TRUE and OBJPC_SYNC to VOP_PUTPAGES(), which both are numerically equal to VM_PAGER_PUT_SYNC. Noted and reviewed by: alc (previous version) Sponsored by: The FreeBSD Foundation MFC after: 1 week	2014-09-14 10:27:36 +00:00
Alan Cox	81a065058c	Update a stale comment.	2014-09-11 03:16:57 +00:00
Gleb Smirnoff	27ad26d8c7	Remove unused arguments for VOP_GETPAGES(), VOP_PUTPAGES().	2014-09-10 12:36:41 +00:00
Alan Cox	64f096eeb2	Fix a boundary case error in vm_reserv_alloc_contig(): If a reservation isn't being allocated for the last of the requested pages, because a reservation won't fit in the gap between allocated pages, then the reservation structure shouldn't be initialized. While I'm here, improve the nearby comments. Reported by: jeff, pho MFC after: 1 week Sponsored by: EMC / Isilon Storage Division	2014-09-10 05:52:30 +00:00
Alan Cox	0afcd3af8b	Oops. vm_map_simplify_entry() is used by mac_proc_vm_revoke_recurse(), so it can't be static.	2014-09-08 02:25:01 +00:00
Alan Cox	077ec27cd6	Make two functions static and eliminate an unused #define.	2014-09-08 00:19:03 +00:00
John Baldwin	1a83a822d2	Fix a typo.	2014-08-29 21:20:36 +00:00
Steven Hartland	4d19f4ad1f	Refactor ZFS ARC reclaim logic to be more VM cooperative Prior to this change we triggered ARC reclaim when kmem usage passed 3/4 of the total available, as indicated by vmem_size(kmem_arena, VMEM_ALLOC). This could lead large amounts of unused RAM e.g. on a 192GB machine with ARC the only major RAM consumer, 40GB of RAM would remain unused. The old method has also been seen to result in extreme RAM usage under certain loads, causing poor performance and stalls. We now trigger ARC reclaim when the number of free pages drops below the value defined by the new sysctl vfs.zfs.arc_free_target, which defaults to the value of vm.v_free_target. Credit to Karl Denninger for the original patch on which this update was based. PR: 191510 and 187594 Tested by: dteske MFC after: 1 week Relnotes: yes Sponsored by: Multiplay	2014-08-28 19:50:08 +00:00
Alan Cox	9452b5eda9	Back in the days when the kernel was single threaded, testing "vm_paging_target() > 0" was a reasonable way of determining if the inactive queue scan met its target. However, now that other threads can be allocating pages while the inactive queue scan is running, it's an unreliable method. The effect of it being unreliable is that we can start swapping out processes when we didn't intend to. This issue has existed since the kernel was multithreaded, but the changes to the inactive queue target in 10.0-RELEASE have made its effects visible. This change introduces a more direct method for determining if the inactive queue scan met its target that is not affected by the actions of other threads. Reported by: Steve Polyack Tested by: pho, Steve Polyack (an earlier version) MFC after: 1 week Sponsored by: EMC / Isilon Storage Division	2014-08-26 16:40:20 +00:00
Alan Cox	b9ce8cc2d7	Relax one of the conditions for mapping a page on the fast path. Reviewed by: kib X-MFC with: r270011 Sponsored by: EMC / Isilon Storage Division	2014-08-23 05:24:31 +00:00
Konstantin Belousov	afe55ca373	Implement 'fast path' for the vm page fault handler. Or, it could be called a scalable path. When several preconditions hold, the vm object lock for the object containing the faulted page is taken in read mode, instead of write, which allows parallel faults processing in the region. Namely, the fast path is taken when the faulted page already exists and does not need copy on write, is already fully valid, and not busy. For technical reasons, fast path is avoided when the fault is the first write on the vnode object, or when the fault is for wiring or debugger read or write. On the fast path, pmap_enter(9) is passed the PMAP_ENTER_NOSLEEP flag, since object lock is kept. Pmap might fail to create the entry, in which case the fallback to slow path is performed. Reviewed by: alc Tested by: pho (previous version) Hardware provided and hosted by: The FreeBSD Foundation and Sentex Data Communications Sponsored by: The FreeBSD Foundation MFC after: 2 week	2014-08-15 07:30:14 +00:00
Alan Cox	9f746b66df	Avoid pointless (but harmless) actions on unmanaged pages. Reviewed by: kib Sponsored by: EMC / Isilon Storage Division	2014-08-14 15:46:15 +00:00
Konstantin Belousov	70978c93b8	If vm_page_grab() allocates a new page, the page is not inserted into page queue even when the allocation is not wired. It is responsibility of the vm_page_grab() caller to ensure that the page does not end on the vm_object queue but not on the pagedaemon queue, which would effectively create unpageable unwired page. In exec_map_first_page() and vm_imgact_hold_page(), activate the page immediately after unbusying it, to avoid leak. In the uiomove_object_page(), deactivate page before the object is unlocked. There is no leak, since the page is deactivated after uiomove_fromphys() finished. But allowing non-queued non-wired page in the unlocked object queue makes it impossible to assert that leak does not happen in other places. Reviewed by: alc Sponsored by: The FreeBSD Foundation MFC after: 1 week	2014-08-13 05:44:08 +00:00
Konstantin Belousov	afb69e6b3e	Adapt vm_page_aflag_set(PGA_WRITEABLE) to the locking of pmap_enter(PMAP_ENTER_NOSLEEP). The PGA_WRITEABLE flag can be set when either the page is busied, or the owner object is locked. Update comments, move all assertions about page state when PGA_WRITEABLE flag is set, into new helper vm_page_assert_pga_writeable(). Reviewed by: alc Sponsored by: The FreeBSD Foundation MFC after: 2 weeks	2014-08-09 05:00:34 +00:00
Konstantin Belousov	39ffa8c138	Change pmap_enter(9) interface to take flags parameter and superpage mapping size (currently unused). The flags includes the fault access bits, wired flag as PMAP_ENTER_WIRED, and a new flag PMAP_ENTER_NOSLEEP to indicate that pmap should not sleep. For powerpc aim both 32 and 64 bit, fix implementation to ensure that the requested mapping is created when PMAP_ENTER_NOSLEEP is not specified, in particular, wait for the available memory required to proceed. In collaboration with: alc Tested by: nwhitehorn (ppc aim32 and booke) Sponsored by: The FreeBSD Foundation and EMC / Isilon Storage Division MFC after: 2 weeks	2014-08-08 17:12:03 +00:00
Konstantin Belousov	385b4265fc	The vm_pager_page_unswapped() pager op is only implemented for the swap pager. Swap pager uses a private mutex to protect swap metadata, and does not rely on the vm object lock to ensure integrity of it. Weaken the requirement for the vm object lock by only asserting locked object in vm_pager_page_unswapped(), instead of locked exclusively. Reviewed by: alc Sponsored by: The FreeBSD Foundation MFC after: 1 week	2014-08-06 19:34:03 +00:00
Konstantin Belousov	faaf544760	Add wrappers to assert that vm object is unlocked and for try upgrade. Reviewed by: alc Sponsored by: The FreeBSD Foundation MFC after: 1 week	2014-08-06 19:30:35 +00:00
Roger Pau Monné	5ebe728d53	vm_phys: improve robustness of fictitious ranges With the current implementation of managed fictitious ranges when also using VM_PHYSSEG_DENSE, a user could try to register a fictitious range that starts inside of vm_page_array, but then overrruns it (because the end of the fictitious range is greater than vm_page_array_size + first_page). This would result in PHYS_TO_VM_PAGE returning unallocated pages from past the end of vm_page_array. The same could happen if a user tried to register a segment that starts outside of vm_page_array but ends inside of it. In order to fix this, allow vm_phys_fictitious_{reg/unreg}_range to use a set of pages from vm_page_array, and allocate the rest. Sponsored by: Citrix Systems R&D Reviewed by: kib, alc vm/vm_phys.c: - Allow registering/unregistering fictitious ranges that overrun vm_page_array.	2014-08-05 10:29:01 +00:00
Alan Cox	a695d9b25b	Retire pmap_change_wiring(). We have never used it to wire virtual pages. We continue to use pmap_enter() for that. For unwiring virtual pages, we now use pmap_unwire(), which unwires a range of virtual addresses instead of a single virtual page. Sponsored by: EMC / Isilon Storage Division	2014-08-03 20:40:51 +00:00
Alan Cox	0b69568411	Rewrite a loop in vm_map_wire() so that gcc doesn't think that the variable "rv" is uninitialized. Reported by: bz	2014-08-02 17:58:20 +00:00
Alan Cox	66cd575b28	Handle wiring failures in vm_map_wire() with the new functions pmap_unwire() and vm_object_unwire(). Retire vm_fault_{un,}wire(), since they are no longer used. (See r268327 and r269134 for the motivation behind this change.) Reviewed by: kib Sponsored by: EMC / Isilon Storage Division	2014-08-02 16:10:24 +00:00
Alan Cox	0346250941	When unwiring a region of an address space, do not assume that the underlying physical pages are mapped by the pmap. If, for example, the application has performed an mprotect(..., PROT_NONE) on any part of the wired region, then those pages will no longer be mapped by the pmap. So, using the pmap to lookup the wired pages in order to unwire them doesn't always work, and when it doesn't work wired pages are leaked. To avoid the leak, introduce and use a new function vm_object_unwire() that locates the wired pages by traversing the object and its backing objects. At the same time, switch from using pmap_change_wiring() to the recently introduced function pmap_unwire() for unwiring the region's mappings. pmap_unwire() is faster, because it operates a range of virtual addresses rather than a single virtual page at a time. Moreover, by operating on a range, it is superpage friendly. It doesn't waste time performing unnecessary demotions. Reported by: markj Reviewed by: kib Tested by: pho, jmg (arm) Sponsored by: EMC / Isilon Storage Division	2014-07-26 18:10:18 +00:00
Konstantin Belousov	4bace8e721	Correct assertion. The shadowing object cannot be tmpfs vm object, and tmpfs object cannot shadow. In other words, tmpfs vm object is always at the bottom of the shadow chain. Reported and tested by: bdrewery Sponsored by: The FreeBSD Foundation MFC after: 1 week	2014-07-24 10:25:42 +00:00
Konstantin Belousov	f08f7dca40	The OBJ_TMPFS flag of vm_object means that there is unreclaimed tmpfs vnode for the tmpfs node owning this object. The flag is currently used for two purposes. First, it allows to correctly handle VV_TEXT for tmpfs vnode when the ref count on the object is decremented to 1, similar to vnode_pager_dealloc() for regular filesystems. Second, it prevents some operations, which are done on OBJT_SWAP vm objects backing user anonymous memory, but are incorrect for the object owned by tmpfs node. The second kind of use of the OBJ_TMPFS flag is incorrect, since the vnode might be reclaimed, which clears the flag, but vm object operations must still be disallowed. Introduce one more flag, OBJ_TMPFS_NODE, which is permanently set on the object for VREG tmpfs node, and used instead of OBJ_TMPFS to test whether vm object collapse and similar actions should be disabled. Tested by: pho Sponsored by: The FreeBSD Foundation MFC after: 2 weeks	2014-07-14 09:30:37 +00:00
Roger Pau Monné	38d6b2dcb2	vm_phys: remove limitation on number of fictitious regions The number of vm fictitious regions was limited to 8 by default, but Xen will make heavy usage of those kind of regions in order to map memory from foreign domains, so instead of increasing the default number, change the implementation to use a red-black tree to track vm fictitious ranges. The public interface remains the same. Sponsored by: Citrix Systems R&D Reviewed by: kib, alc Approved by: gibbs vm/vm_phys.c: - Replace the vm fictitious static array with a red-black tree. - Use a rwlock instead of a mutex, since now we also need to take the lock in vm_phys_fictitious_to_vm_page, and it can be shared.	2014-07-09 08:12:58 +00:00
Marcel Moolenaar	e7d939bda2	Remove ia64. This includes: o All directories named ia64 o All files named ia64 o All ia64-specific code guarded by __ia64__ o All ia64-specific makefile logic o Mention of ia64 in comments and documentation This excludes: o Everything under contrib/ o Everything under crypto/ o sys/xen/interface o sys/sys/elf_common.h Discussed at: BSDcan	2014-07-07 00:27:09 +00:00
Alan Cox	09132ba6ac	Introduce pmap_unwire(). It will replace pmap_change_wiring(). There are several reasons for this change: pmap_change_wiring() has never (in my memory) been used to set the wired attribute on a virtual page. We have always used pmap_enter() to do that. Moreover, it is not really safe to use pmap_change_wiring() to set the wired attribute on a virtual page. The description of pmap_change_wiring() says that it assumes the existence of a mapping in the pmap. However, non-wired mappings may be reclaimed by the pmap at any time. (See pmap_collect().) Many implementations of pmap_change_wiring() will crash if the mapping does not exist. pmap_unwire() accepts a range of virtual addresses, whereas pmap_change_wiring() acts upon a single virtual page. Since we are typically unwiring a range of virtual addresses, pmap_unwire() will be more efficient. Moreover, pmap_unwire() allows us to unwire superpage mappings. Previously, we were forced to demote the superpage mapping, because pmap_change_wiring() only allowed us to express the unwiring of a single base page mapping at a time. This added to the overhead of unwiring for large ranges of addresses, including the implicit unwiring that occurs at process termination. Implementations for arm and powerpc will follow. Discussed with: jeff, marcel Reviewed by: kib Sponsored by: EMC / Isilon Storage Division	2014-07-06 17:42:38 +00:00
Hans Petter Selasky	af3b2549c4	Pull in r267961 and r267973 again. Fix for issues reported will follow.	2014-06-28 03:56:17 +00:00
Glen Barber	37a107a407	Revert r267961, r267973: These changes prevent sysctl(8) from returning proper output, such as: 1) no output from sysctl(8) 2) erroneously returning ENOMEM with tools like truss(1) or uname(1) truss: can not get etype: Cannot allocate memory	2014-06-27 22:05:21 +00:00
Hans Petter Selasky	3da1cf1e88	Extend the meaning of the CTLFLAG_TUN flag to automatically check if there is an environment variable which shall initialize the SYSCTL during early boot. This works for all SYSCTL types both statically and dynamically created ones, except for the SYSCTL NODE type and SYSCTLs which belong to VNETs. A new flag, CTLFLAG_NOFETCH, has been added to be used in the case a tunable sysctl has a custom initialisation function allowing the sysctl to still be marked as a tunable. The kernel SYSCTL API is mostly the same, with a few exceptions for some special operations like iterating childrens of a static/extern SYSCTL node. This operation should probably be made into a factored out common macro, hence some device drivers use this. The reason for changing the SYSCTL API was the need for a SYSCTL parent OID pointer and not only the SYSCTL parent OID list pointer in order to quickly generate the sysctl path. The motivation behind this patch is to avoid parameter loading cludges inside the OFED driver subsystem. Instead of adding special code to the OFED driver subsystem to post-load tunables into dynamically created sysctls, we generalize this in the kernel. Other changes: - Corrected a possibly incorrect sysctl name from "hw.cbb.intr_mask" to "hw.pcic.intr_mask". - Removed redundant TUNABLE statements throughout the kernel. - Some minor code rewrites in connection to removing not needed TUNABLE statements. - Added a missing SYSCTL_DECL(). - Wrapped two very long lines. - Avoid malloc()/free() inside sysctl string handling, in case it is called to initialize a sysctl from a tunable, hence malloc()/free() is not ready when sysctls from the sysctl dataset are registered. - Bumped FreeBSD version to indicate SYSCTL API change. MFC after: 2 weeks Sponsored by: Mellanox Technologies	2014-06-27 16:33:43 +00:00
Alan Cox	60169c88d9	Delay the call to crhold() in vm_map_insert() until we know that we won't have to undo it by calling crfree(). This reduces the total number of calls by vm_map_insert() to crhold() and crfree() by 45% in my tests. Eliminate an unnecessary variable from vm_map_insert(). Reviewed by: kib Tested by: pho	2014-06-26 16:04:03 +00:00
Alan Cox	eaaf9f7fce	Now that vm_map_insert() sets MAP_ENTRY_GROWS_{DOWN,UP} on the stack entries that it creates (r267645), we can place the check that blocks map entry coalescing on stack entries in vm_map_simplify_entry() where it properly belongs. Reviewed by: kib	2014-06-25 03:30:03 +00:00
Konstantin Belousov	b5f8c226ab	Use correct names for the flags. MAP_ENTRY_GROWS_* have the same numerical values as MAP_STACK_GROWS_*, but the former is for entries' eflags, while the later for the cow argument of vm_map_insert(). Submitted by: alc	2014-06-23 07:03:47 +00:00
Konstantin Belousov	5831f5fc52	Assert that the new entry is inserted into the right location in the map entries list, and that it does not overlap with the previous and next entries. Reviewed by: alc Sponsored by: The FreeBSD Foundation MFC after: 1 week	2014-06-20 07:01:53 +00:00
Alan Cox	39c18ce157	Eliminate a pointless call to vm_map_clip_start() from vm_map_growstack(). For this call to do anything at all we would have to have two overlapping map entries. Submitted by: kib	2014-06-19 21:05:07 +00:00
Alan Cox	712efe66e2	When MAP_STACK_GROWS_{DOWN,UP} are passed to vm_map_insert() set the corresponding flag(s) in the new map entry. Previously, the caller was responsible for setting them after vm_map_insert() returned. Pass MAP_STACK_GROWS_DOWN to vm_map_insert() from vm_map_growstack() when extending the stack in the downward direction. Together these changes slightly simplify the caller's task when creating a downward growing stack. In particular, the caller no longer needs to clip the previous entry, because the new stack entry can't possibly coalesce with the previous entry. Reviewed by: kib Sponsored by: EMC / Isilon Storage Division	2014-06-19 16:26:16 +00:00
Konstantin Belousov	11c42bcc54	Add MAP_EXCL flag for mmap(2). It should be combined with MAP_FIXED, and prevents the request from deleting existing mappings in the region, failing instead. Reviewed by: alc Discussed with: jhb Tested by: markj, pho (previous version, as part of the bigger patch) Sponsored by: The FreeBSD Foundation MFC after: 1 week	2014-06-19 05:00:39 +00:00
Attilio Rao	3ae10f7477	- Modify vm_page_unwire() and vm_page_enqueue() to directly accept the queue where to enqueue pages that are going to be unwired. - Add stronger checks to the enqueue/dequeue for the pagequeues when adding and removing pages to them. Of course, for unmanaged pages the queue parameter of vm_page_unwire() will be ignored, just as the active parameter today. This makes adding new pagequeues quicker. This change effectively modifies the KPI. __FreeBSD_version will be, however, bumped just when the full cache of free pages will be evicted. Sponsored by: EMC / Isilon storage division Reviewed by: alc Tested by: pho	2014-06-16 18:15:27 +00:00
Alan Cox	33314db034	Tidy up the early parts of vm_map_insert(), in particular, simplify one of the assertions and eliminate a comment that has grown stale. Reviewed by: kib MFC after: 1 week	2014-06-16 16:37:41 +00:00
Alan Cox	e1f92ccc73	One of the intentions behind r267254 was that the global variable "sgrowsiz" would be read once and cached in a local variable so that the resource limit check and map entry insertion would be guaranteed to use the same value. However, the value being passed to vm_map_insert() is still from "sgrowsiz" and not the local variable. Correct this oversight. Reviewed by: kib	2014-06-15 07:52:59 +00:00
Alexander Motin	1aa6c75827	Introduce new "256 Bucket" zone to split requests and reduce congestion on "128 Bucket" zone lock. MFC after: 2 weeks Sponsored by: iXsystems, Inc.	2014-06-12 11:57:07 +00:00
Alexander Motin	20d3ab87cd	Allocating new bucket for bucket zone, never take it from the zone itself, since it will almost certanly fail. Take next bigger zone instead. This situation should not happen with original bucket zones configuration: "32 Bucket" zone uses "64 Bucket" and vice versa. But if "64 Bucket" zone lock is congested, zone may grow its bucket size and start biting itself. MFC after: 2 weeks Sponsored by: iXsystems, Inc.	2014-06-12 11:36:22 +00:00
Alan Cox	3180f7573a	Correct a bug in the management of the population map on big-endian machines. Specifically, there was a mismatch between how the routine allocation and deallocation operations accessed the population map and how the aggressively optimized reservation-breaking operation accessed it. So, problems only occurred when reservations were broken. This change makes the routine operations access the population map in the same way as the reservation breaking operation. This bug was introduced in r259999. PR: 187080 Tested by: jmg (on an "armeb" machine) Sponsored by: EMC / Isilon Storage Division	2014-06-11 16:11:12 +00:00
Konstantin Belousov	4648ba0a0f	Make mmap(MAP_STACK) search for the available address space, similar to !MAP_STACK mapping requests. For MAP_STACK \| MAP_FIXED, clear any mappings which could previously exist in the used range. For this, teach vm_map_find() and vm_map_fixed() to handle MAP_STACK_GROWS_DOWN or _UP cow flags, by calling a new vm_map_stack_locked() helper, which is factored out from vm_map_stack(). The side effect of the change is that MAP_STACK started obeying MAP_ALIGNMENT and MAP_32BIT flags. Reported by: rwatson Reviewed by: alc Sponsored by: The FreeBSD Foundation MFC after: 2 weeks	2014-06-09 03:37:41 +00:00
Alan Cox	dd05fa1945	Add a page size field to struct vm_page. Increase the page size field when a partially populated reservation becomes fully populated, and decrease this field when a fully populated reservation becomes partially populated. Use this field to simplify the implementation of pmap_enter_object() on amd64, arm, and i386. On all architectures where we support superpages, the cost of creating a superpage mapping is roughly the same as creating a base page mapping. For example, both kinds of mappings entail the creation of a single PTE and PV entry. With this in mind, use the page size field to make the implementation of vm_map_pmap_enter(..., MAP_PREFAULT_PARTIAL) a little smarter. Previously, if MAP_PREFAULT_PARTIAL was specified to vm_map_pmap_enter(), that function would only map base pages. Now, it will create up to 96 base page or superpage mappings. Reviewed by: kib Sponsored by: EMC / Isilon Storage Division	2014-06-07 17:12:26 +00:00
Konstantin Belousov	5930251a9d	Remove the assert which can be triggered by the userspace. The situation checked by assert is verified to not take place in vm_map_wire(), and protection permissions on the wired entry can be revoked afterward. Reported by: markj Reviewed by: alc Sponsored by: The FreeBSD Foundation MFC after: 1 week	2014-05-28 00:45:35 +00:00
Alan Cox	fa2f411c4e	There is no reason to perform the pmap_remove() on the kernel pmap while the kmem object lock is held. Do the pmap_remove() before acquiring the kmem object lock. MFC after: 1 week Sponsored by: EMC / Isilon Storage Division	2014-05-23 16:22:36 +00:00
Konstantin Belousov	2602a2ea88	Remove redundand loop. The inner goto restarts the whole page handling in the situation identical to the loop condition. Sponsored by: The FreeBSD Foundation MFC after: 3 days	2014-05-21 08:19:04 +00:00
Konstantin Belousov	7032434e98	When exec_new_vmspace() decides that current vmspace cannot be reused on execve(2), it calls vmspace_exec(), which frees the current vmspace. The thread executing an exec syscall gets new vmspace assigned, and old vmspace is freed if only referenced by the current process. The free operation includes pmap_release(), which de-constructs the paging structures used by hardware. If the calling process is multithreaded, other threads are suspended in the thread_suspend_check(), and need to be unsuspended and run to be able to exit on successfull exec. Now, since the old vmspace is destroyed, paging structures are invalid, threads are resumed on the non-existent pmaps (page tables), which leads to triple fault on x86. To fix, postpone the free of old vmspace until the threads are resumed and exited. To avoid modifications to all image activators all of which use exec_new_vmspace(), memoize the current (old) vmspace in kern_execve(), and notify it about the need to call vmspace_free() with a thread-private flag TDP_EXECVMSPC. http://bugs.debian.org/743141 Reported by: Ivo De Decker <ivo.dedecker@ugent.be> through secteam Sponsored by: The FreeBSD Foundation MFC after: 3 days	2014-05-20 09:19:35 +00:00
Alan Cox	afaa41f6b8	On a fork allow read-only wired pages to be copy-on-write shared between the parent and child processes. Previously, we copied these pages even though they are read only. However, the reason for copying them is historical and no longer exists. In recent times, vm_map_protect() has developed the ability to copy pages when write access is added to wired copy-on-write pages. So, in this case, copy-on-write sharing of wired pages is not to be feared. It is not going to lead to copy-on-write faults on wired memory. Reviewed by: kib MFC after: 1 week Sponsored by: EMC / Isilon Storage Division	2014-05-13 13:20:23 +00:00
Konstantin Belousov	c8f780e3d6	Fix locking. The dst_object must remain locked on the retry of the loop iteration. Reported and tested by: pho Sponsored by: The FreeBSD Foundation MFC after: 6 days	2014-05-11 18:07:07 +00:00
Alan Cox	dd006a1b14	With the new-and-improved vm_fault_copy_entry() (r265843), we can always avoid soft page faults when adding write access to user wired entries in vm_map_protect(). Previously, we only avoided the soft page fault when the underlying pages were copy-on-write. In other words, we avoided the pages faults that might sleep on page allocation, but not the trivial page faults to update the physical map. Reviewed by: kib MFC after: 1 week Sponsored by: EMC / Isilon Storage Division	2014-05-11 17:41:29 +00:00
Alan Cox	d9a9209abe	About 9% of the pmap_protect() calls being performed by vm_map_copy_entry() are unnecessary. Eliminate the unnecessary calls. Reviewed by: kib MFC after: 1 week Sponsored by: EMC / Isilon Storage Division	2014-05-10 19:47:00 +00:00
Konstantin Belousov	0973283d6e	For the upgrade case in vm_fault_copy_entry(), when the entry does not need COW and is writeable (i.e. becoming writeable due to the mprotect(2) operation), do not create a new backing object for the entry. The caller of the function is vm_map_protect(), the call is made to ensure that wired entry has all pages resident and wired in the top level object and to enable the write. We might need to copy read-only page from some backing objects into the top object or remap the page with the write allowed. This fixes the issue with mishandling of the swap accounting when read-only wired mapping is upgraded to write-enabled after fork. The previous code path did not accounted the new object, but it creation is redundand anyway and the change provides an optimization for the non-common situation. Reported by: markj Suggested and reviewed by: alc (previous version) Tested by: pho Sponsored by: The FreeBSD Foundation MFC after: 1 week	2014-05-10 17:03:33 +00:00
Konstantin Belousov	44bbc3b77d	When printing the map with the ddb 'show procvm' command, do not dump page queues for the backing objects. The queues are huge and clutter the display, when mostly the map entries and its backing storage is interesting. The page queues can be seen with ddb 'show object' command. Reviewed by: alc Sponsored by: The FreeBSD Foundation MFC after: 1 week	2014-05-10 16:36:13 +00:00
Konstantin Belousov	3d95614f9d	Print the entry address in addition to the object. The variable is typically optimized out and debuggers cannot find its value. Sponsored by: The FreeBSD Foundation MFC after: 1 week	2014-05-10 16:30:48 +00:00
Peter Holm	e103f5b1c0	msync(2) must return ENOMEM and not EINVAL when the address is outside the allowed range or when one or more pages are not mapped. This according to The Open Group Base Specifications Issue 7. Discussed with: attilio, Bruce Evans Reviewed by: alc, Garrett Cooper Reported by: ATF MFC after: 2 weeks Sponsored by: EMC / Isilon storage division	2014-05-07 08:38:02 +00:00
Alan Cox	60196cda04	Prior to r254304, a separate function, vm_pageout_page_stats(), was used to periodically update the reference status of the active pages. This function was called, instead of vm_pageout_scan(), when memory was not scarce. The objective was to provide up to date reference status for active pages in case memory did become scarce and active pages needed to be deactivated. The active page queue scan performed by vm_pageout_page_stats() was virtually identical to that performed by vm_pageout_scan(), and so r254304 eliminated vm_pageout_page_stats(). Instead, vm_pageout_scan() is called with the parameter "pass" set to zero. The intention was that when pass is zero, vm_pageout_scan() would only scan the active queue. However, the variable page_shortage can still be greater than zero when memory is not scarce and vm_pageout_scan() is called with pass equal to zero. Consequently, the inactive queue may be scanned and dirty pages laundered even though that was not intended by r254304. This revision fixes that. Reported by: avg MFC after: 1 week Sponsored by: EMC / Isilon Storage Division	2014-05-06 03:42:04 +00:00
Konstantin Belousov	a17937bdd0	For the VM_PHYSSEG_DENSE case, checking the requested range to fall into the area backed by vm_page_array wrongly compared end with vm_page_array_size. It should be adjusted by first_page index to be correct. Also, the corner and incorrect case of the requested range extending after the end of the vm_page_array was incorrectly handled by allocating the segment. Fix the comparision for the end of range and return EINVAL if the end extends beyond vm_page_array. Discussed with: royger Sponsored by: The FreeBSD Foundation MFC after: 1 week	2014-04-29 18:42:37 +00:00
Konstantin Belousov	4c74acf76a	When vm_fault_copy_entry() is called from vm_map_protect() for a wired entry and performs the upgrade of the entry permissions from read-only to read-write, we must allow to search for the source pages in the backing object, like we do in the case of forking the read-only wired entry. For the fork case, the behaviour is allowed by src_readonly boolean, which in fact is only used to assert that read-write case provides all source pages in the top-level object. Eliminate the src_readonly variable. Allow for the copy loop to look into the backing objects, add explicit asserts to ensure that only read-only and upgrade case actually does. Expand comments. Change the panic call into assert. Reported by: markj Tested by: markj, pho (previous version) Reviewed by: alc Sponsored by: The FreeBSD Foundation MFC after: 1 week	2014-04-27 05:19:01 +00:00
Dag-Erling Smørgrav	612032773a	Add sysctl OIDs showing the actual size and capacity of the swap zone. MFC after: 1 week	2014-04-26 12:18:17 +00:00
Bryan Drewery	44f1c91610	Rename global cnt to vm_cnt to avoid shadowing. To reduce the diff struct pcu.cnt field was not renamed, so PCPU_OP(cnt.field) is still used. pc_cnt and pcpu are also used in kvm(3) and vmstat(8). The goal was to not affect externally used KPI. Bump __FreeBSD_version_ in case some out-of-tree module/code relies on the the global cnt variable. Exp-run revealed no ports using it directly. No objection from: arch@ Sponsored by: EMC / Isilon Storage Division	2014-03-22 10:26:09 +00:00
Konstantin Belousov	52f3c44efe	Fix two issues with /dev/mem access on amd64, both causing kernel page faults. First, for accesses to direct map region should check for the limit by which direct map is instantiated. Second, for accesses to the kernel map, success returned from the kernacc(9) does not guarantee that consequent attempt to read or write to the checked address succeed, since other thread might invalidate the address meantime. Add a new thread private flag TDP_DEVMEMIO, which instructs vm_fault() to return error when fault happens on the MAP_ENTRY_NOFAULT entry, instead of panicing. The trap handler would then see a page fault from access, and recover in normal way, making /dev/mem access safer. Remove GIANT_REQUIRED from the amd64 memrw(), since it is not needed and having Giant locked does not solve issues for amd64. Note that at least the second issue exists on other architectures, and requires similar patching for md code. Reported and tested by: clusteradm (gjb, sbruno) Sponsored by: The FreeBSD Foundation MFC after: 1 week	2014-03-21 14:25:09 +00:00
Konstantin Belousov	997ac6905f	Initialize vm_map_entry member wiring_thread on the map entry creation. This was missed in r253190. Reported by: hps, peter Tested by: hps Sponsored by: The FreeBSD Foundation MFC after: 3 days	2014-03-21 13:55:57 +00:00
Attilio Rao	0d8243cc34	vm_page_grab() and vm_pager_get_pages() can drop the vm_object lock, then threads can sleep on the pip condition. Avoid to deadlock such threads by correctly awakening the sleeping ones after the pip is finished. swapoff side of the bug can likely result in shutdown deadlocks. Sponsored by: EMC / Isilon Storage Division Reported by: pho, pluknet Tested by: pho	2014-03-19 01:13:42 +00:00
Robert Watson	4a14441044	Update kernel inclusions of capability.h to use capsicum.h instead; some further refinement is required as some device drivers intended to be portable over FreeBSD versions rely on __FreeBSD_version to decide whether to include capability.h. MFC after: 3 weeks	2014-03-16 10:55:57 +00:00
Konstantin Belousov	7253a5ec63	Initialize paddr to handle the case of zero size. Reported and reviewed by: Conrad Meyer <cemeyer@uw.edu> MFC after: 1 week	2014-03-12 16:38:55 +00:00
Konstantin Belousov	2309fa9b92	Do not vdrop() the tmpfs vnode until it is unlocked. The hold reference might be the last, and then vdrop() would free the vnode. Reported and tested by: bdrewery MFC after: 1 week	2014-03-12 15:13:57 +00:00
Dimitry Andric	2367b4ddc4	After r251709, avoid a clang 3.4 warning about an unused static const variable (uma_max_ipers), when asserts are disabled. Reviewed by: glebius MFC after: 3 days	2014-02-14 17:47:18 +00:00
Attilio Rao	14a5dc1780	Fix-up r254141: in the process of making a failing vm_page_rename() a call of pager_swap_freespace() was moved around, now leading to freeing the incorrect page because of the pindex changes after vm_page_rename(). Get back to use the correct pindex when destroying the swap space. Sponsored by: EMC / Isilon storage division Reported by: avg Tested by: pho MFC after: 7 days	2014-02-14 03:34:12 +00:00
Gleb Smirnoff	5f3563b0a5	Fix function name in KASSERT(). Submitted by: hiren	2014-02-12 20:11:20 +00:00
John Baldwin	8add0ced70	Correct assertion to assert that the existing device VM object uses the same type rather than asserting in the case where we just created a new VM object. Reviewed by: kib	2014-02-11 22:05:21 +00:00
Gleb Smirnoff	49fef6a202	Create two public UMA_ZONE_PCPU zones: 64 bit sized and pointer sized. Sponsored by: Nginx, Inc.	2014-02-10 19:59:46 +00:00
Gleb Smirnoff	f947570e35	Style.	2014-02-10 19:51:15 +00:00
Gleb Smirnoff	48343a2f34	Make M_ZERO flag work correctly on UMA_ZONE_PCPU zones. Sponsored by: Nginx, Inc.	2014-02-10 19:48:26 +00:00
Alan Cox	7b9b301c6b	Don't call vm_fault_prefault() on zero-fill faults. It's a waste of time. Successful prefaults after a zero-fill fault are extremely rare.	2014-02-09 01:59:52 +00:00
Gleb Smirnoff	0a5a3ccb81	Provide macros that allow easily export uma(9) zone limits and current usage via sysctl(9): SYSCTL_UMA_MAX() SYSCTL_ADD_UMA_MAX() SYSCTL_UMA_CUR() SYSCTL_ADD_UMA_CUR() Sponsored by: Nginx, Inc.	2014-02-07 14:29:03 +00:00
Alan Cox	63281952f0	Make prefaulting more aggressive on hard faults. Previously, we would only map a fraction of the pages that were fetched by vm_pager_get_pages() from secondary storage. Now, we map them all in order to avoid future soft faults. This effect is most evident when a memory-mapped file is accessed sequentially. Previously, there were 6 soft faults for every hard fault. Now, these soft faults are eliminated. Sponsored by: EMC / Isilon Storage Division	2014-02-02 20:21:53 +00:00
Alan Cox	793d14076a	In an effort to diagnose possible corruption of struct vm_page on some sparc64 machines make the page queue assert in vm_page_dequeue() more precise. While I'm here switch the page lock assert to the newer style.	2014-01-24 19:08:42 +00:00
John Baldwin	ab46f63e8f	Fix a couple of typos.	2014-01-21 03:27:47 +00:00
Gleb Smirnoff	7ebba1f8ff	ANSIfy declarations. Ok'ed by: alc	2014-01-20 18:47:56 +00:00
Alan Cox	86fa24710e	Style changes in vm_pageout_scan(): 1. Be consistent in the style of "act_delta" manipulations between the inactive and active queue scans. 2. Explicitly compare to zero. 3. The deactivation of a page is based is based on its recent history and not just the current call to vm_pageout_scan(). The variable "act_delta" represents the current state of the page, and not its history. Avoid possible confusion by not (ab)using "act_delta" for the making the deactivation decision. Submitted by: kib [1] Reviewed by: kib [2,3]	2014-01-18 20:02:59 +00:00
Alan Cox	9099545af1	Correctly update the count of stuck pages, "addl_page_shortage", in vm_pageout_scan(). There were missing increments in two less common cases. Don't conflate the count of stuck pages and the pageout deficit provided by vm_page_alloc{,_contig}(). (A proposed fix to the OOM code depends on this.) Handle held pages consistently in the inactive queue scan. In the more common case, we did not move the page to the tail of the queue. Whereas, in the less common case, we did. There's no particular reason to move the page in the less common case, so remove it. Perform the calculation of the page shortage for the active queue scan a little earlier, before the active queue lock is acquired. The correctness of this calculation doesn't depend on the active queue lock being held. Eliminate a redundant variable, "pcount". Use the more descriptive variable, "maxscan", in its place. Apply a few nearby style fixes, e.g., eliminate stray whitespace and excess parentheses. Reviewed by: kib Sponsored by: EMC / Isilon Storage Division	2014-01-12 19:04:20 +00:00
Alan Cox	000fb817d8	Since the introduction of the popmap to reservations in r259999, there is no longer any need for the page's PG_CACHED and PG_FREE flags to be set and cleared while the free page queues lock is held. Thus, vm_page_alloc(), vm_page_alloc_contig(), and vm_page_alloc_freelist() can wait until after the free page queues lock is released to clear the page's flags. Moreover, the PG_FREE flag can be retired. Now that the reservation system no longer uses it, its only uses are in a few assertions. Eliminating these assertions is no real loss. Other assertions catch the same types of misbehavior, like doubly freeing a page (see r260032) or dirtying a free page (free pages are invalid and only valid pages can be dirtied). Eliminate an unneeded variable from vm_page_alloc_contig(). Sponsored by: EMC / Isilon Storage Division	2013-12-31 18:25:15 +00:00
Alan Cox	a08c151546	Add "popmap" assertions: The page being freed isn't already free, and the page being allocated isn't already allocated. Sponsored by: EMC / Isilon Storage Division	2013-12-29 04:54:52 +00:00
Alan Cox	ec17932242	MFp4 alc_popmap Change the way that reservations keep track of which pages are in use. Instead of using the page's PG_CACHED and PG_FREE flags, maintain a bit vector within the reservation. This approach has a couple benefits. First, it makes breaking reservations much cheaper because there are fewer cache misses to identify the unused pages. Second, it is a pre- requisite for supporting two or more reservation sizes.	2013-12-28 04:28:35 +00:00
Konstantin Belousov	b61a53d43d	Do not coalesce stack entry, vm_map_stack() asserts that the requested region is claimed by a new entry. Pass MAP_STACK_GROWS_DOWN and MAP_STACK_GROWS_UP flags to vm_map_insert() from vm_map_stack(), to really turn off coalescing code and call to vm_map_simplify_entry() [1]. Reported by: avg, peter, many Tested by: avg, peter Noted by: avg [1] Sponsored by: The FreeBSD Foundation MFC after: 1 week	2013-12-27 16:59:47 +00:00
Marcel Moolenaar	938b0f5b75	For ia64, use pmap_remove_pages() and not pmap_remove(). The problem is that we don't have a good way (yet) to iterate over the mapped pages by virtual address and simply try each page within the range. Given that we call pmap_remove() over the entire 2^63 bytes of address space, it takes a while for pmap_remove to have tried all 2^50 pages. By using pmap_remove_pages() we use the PV list to find all mappings. Change derived from a patch by: alc	2013-12-26 05:46:10 +00:00
Dimitry Andric	d395270d06	In sys/vm/vm_pageout.c, since vm_pageout_worker() takes a void * as argument, cast the incoming 0 argument to void , to silence a warning from clang 3.4 ("expression which evaluates to zero treated as a null pointer constant of type 'void ' [-Wnon-literal-null-conversion]"). MFC after: 3 days	2013-12-25 22:32:34 +00:00
Alan Cox	703b304f33	Eliminate a redundant parameter to vm_radix_replace(). Improve the wording of the comment describing vm_radix_replace(). Reviewed by: attilio MFC after: 6 weeks Sponsored by: EMC / Isilon Storage Division	2013-12-08 20:07:02 +00:00
Craig Rodrigues	a3845534a0	In keg_dtor(), print out the keg name in the "Freed UMA keg was not empty" message printed to the console. This makes it easier to track down the source of certain memory leaks. Suggested by: adrian	2013-11-29 08:04:45 +00:00
Alexander Motin	03175483c2	- Add bucket size column to `show uma` DDB command. - Add `show umacache` command to show alike stats for cache-only UMA zones.	2013-11-28 19:20:49 +00:00
Alexander Motin	cec48e002a	Make UMA to not blindly force offpage slab header allocation for large (> PAGE_SIZE) zones. If zone is not multiple to PAGE_SIZE, there may be enough space for the header at the last page, so we may avoid extra header memory allocation and hash table update/lookup. ZFS creates bunch of odd-sized UMA zones (5120, 6144, 7168, 10240, 14336). This change gives good use to at least some of otherwise lost memory there. Reviewed by: avg	2013-11-27 20:56:10 +00:00
Alexander Motin	f7104ccd94	Don't count bucket allocation failures for UMA zones as their own failures. There are good reasons for this to happen, such as recursion prevention, etc. and they are not fatal since buckets are just an optimization mechanism. Real bucket allocation failures are any way counted by the bucket zones themselves, and we don't need double accounting there.	2013-11-27 20:16:18 +00:00
Alexander Motin	e8a720fe60	Fix bug introduced at r252226, when udata argument passed to bucket_alloc() was used without making sure first that it was really passed for us. On some of my systems this bug made user argument passed by ZFS code to uma_zalloc_arg() unexpectedly block UMA per-CPU caches for those zones.	2013-11-27 19:55:42 +00:00
Alexander Motin	8a8d9d1475	When purging per-CPU UMA caches do not return empty buckets into the global full bucket cache to not trigger assertion if allocation happen before that global cache get purged.	2013-11-23 13:42:56 +00:00
Konstantin Belousov	79e9451f07	Vm map code performs clipping when map entry covers region which is larger than the operational region. If the op region size is zero, clipping would create a zero-sized map entry. The result is that vm map splay starts behaving inconsistently, sometimes returning zero-sized entry, sometimes the next (or previous) entry. One step further, it could result in e.g. vm_map_wire() setting MAP_ENTRY_IN_TRANSITION on the zero-sized entry, but failing to clear it in the done part. The vm_map_delete() than hangs forever waiting for the flag removal. Verify for zero-length requests and act as if it is always successfull without performing any action on the address space. Diagnosed by: pho Tested by: pho (previous version) Reviewed by: alc (previous version) Sponsored by: The FreeBSD Foundation MFC after: 1 week	2013-11-20 09:03:48 +00:00
Konstantin Belousov	ff3ae454c0	Add assertions to cover all places in the wiring and unwiring code where MAP_ENTRY_IN_TRANSITION is set or cleared. Tested by: pho Reviewed by: alc Sponsored by: The FreeBSD Foundation MFC after: 1 week	2013-11-20 08:47:54 +00:00
Konstantin Belousov	7e14088d93	Revert back to use int for the page counts. In vn_io_fault(), the i/o is chunked to pieces limited by integer io_hold_cnt tunable, while vm_fault_quick_hold_pages() takes integer max_count as the upper bound. Rearrange the checks to correctly handle overflowing address arithmetic. Submitted by: bde Tested by: pho Discussed with: alc MFC after: 1 week	2013-11-20 08:45:26 +00:00
Alexander Motin	a2de44abf5	Implement mechanism to safely but slowly purge UMA per-CPU caches. This is a last resort for very low memory condition in case other measures to free memory were ineffective. Sequentially cycle through all CPUs and extract per-CPU cache buckets into zone cache from where they can be freed.	2013-11-19 10:51:46 +00:00
Alexander Motin	4d104ba024	Grow UMA zone bucket size also on lock congestion during item free. Lock congestion is the same, whether it happens on alloc or free, so handle it equally. Now that we have back pressure, there is no problem to grow buckets a bit faster. Any way growth is much slower then in 9.x.	2013-11-19 10:17:10 +00:00
Alexander Motin	f3932e9025	Add two new UMA bucket zones to store 3 and 9 items per bucket. These new buckets make bucket size self-tuning more soft and precise. Without them there are buckets for 1, 5, 13, 29, ... items. While at bigger sizes difference about 2x is fine, at smallest ones it is 5x and 2.6x respectively. New buckets make that line look like 1, 3, 5, 9, 13, 29, reducing jumps between steps, making algorithm work softer, allocating and freeing memory in better fitting chunks. Otherwise there is quite a big gap between allocating 128K and 5x128K of RAM at once.	2013-11-19 10:10:44 +00:00
Alexander Motin	ace66b568e	Implement soft pressure on UMA cache bucket sizes. Every time system detects low memory condition decrease bucket sizes for each zone by one item. As result, higher memory pressure will push to smaller bucket sizes and so smaller per-CPU caches and so more efficient memory use. Before this change there was no force to oppose buckets growth as result of practically inevitable zone lock conflicts, and after some run time per-CPU caches could consume enough RAM to kill the system.	2013-11-19 10:05:53 +00:00
Konstantin Belousov	d005ed537c	Avoid overflow for the page counts. Reported and tested by: pho Sponsored by: The FreeBSD Foundation MFC after: 1 week	2013-11-12 08:47:58 +00:00
Konstantin Belousov	1bd7d0b7db	If filesystem declares that it supports shared locking for writes, use shared vnode lock for VOP_PUTPAGES() as well. The only such filesystem in the tree is ZFS, and it uses vnode_pager_generic_putpages(), which performs the pageout with VOP_WRITE(). Reviewed by: alc Discussed with: avg Tested by: pho Sponsored by: The FreeBSD Foundation MFC after: 2 weeks	2013-11-09 20:36:29 +00:00
Konstantin Belousov	9ded9474d3	Do not coalesce if the swap object belongs to tmpfs vnode. The coalesce would extend the object to keep pages for the anonymous mapping created by the process. The pages has no relations to the tmpfs file content which could be written into the corresponding range, causing anonymous mapping and file content aliasing and subsequent corruption. Another lesser problem created by coalescing is over-accounting on the tmpfs node destruction, since the object size is substracted from the total count of the pages owned by the tmpfs mount. Reported and tested by: bdrewery Sponsored by: The FreeBSD Foundation MFC after: 1 week	2013-11-05 06:18:50 +00:00
Alan Cox	eb2f42fbb0	Tidy up the output of "sysctl vm.phys_free". Approved by: re (glebius) Sponsored by: EMC / Isilon Storage Division	2013-10-10 16:11:45 +00:00

1 2 3 4 5 ...

3399 Commits