freebsd-nq

Author	SHA1	Message	Date
John Baldwin	ed95805e90	Remove support for Xen PV domU kernels. Support for HVM domU kernels remains. Xen is planning to phase out support for PV upstream since it is harder to maintain and has more overhead. Modern x86 CPUs include virtualization extensions that support HVM guests instead of PV guests. In addition, the PV code was i386 only and not as well maintained recently as the HVM code. - Remove the i386-only NATIVE option that was used to disable certain components for PV kernels. These components are now standard as they are on amd64. - Remove !XENHVM bits from PV drivers. - Remove various shims required for XEN (e.g. PT_UPDATES_FLUSH, LOAD_CR3, etc.) - Remove duplicate copy of <xen/features.h>. - Remove unused, i386-only xenstored.h. Differential Revision: https://reviews.freebsd.org/D2362 Reviewed by: royger Tested by: royger (i386/amd64 HVM domU and amd64 PVH dom0) Relnotes: yes	2015-04-30 15:48:48 +00:00
Scott Long	affc4a4bff	Improve support for blacklisting bad memory locations. The user can supply a text file with a list of physical memory addresses to exclude, and have it loaded at boot time via the provided example in loader.conf. The tunable 'vm.blacklist' remains, but using an external file means that there's no practical limit to the size of the list. This change also improves the scanning algorithm for processing the list, scanning the list only once instead of scanning it for every page in the system. Both the sysctl and the file can be unsorted and contain duplicates so long as each entry is numeric (decimal or hex) and is separated by a space, comma, or newline character. The sysctl 'vm.page_blacklist' is now provided to report what memory locations were successfully excluded. Reviewed by: imp, emax Obtained from: Netflix, Inc. MFC after: 3 days	2015-04-29 15:57:14 +00:00
Edward Tomasz Napierala	4b5c9cf62f	Add kern.racct.enable tunable and RACCT_DISABLED config option. The point of this is to be able to add RACCT (with RACCT_DISABLED) to GENERIC, to avoid having to rebuild the kernel to use rctl(8). Differential Revision: https://reviews.freebsd.org/D2369 Reviewed by: kib@ MFC after: 1 month Relnotes: yes Sponsored by: The FreeBSD Foundation	2015-04-29 10:23:02 +00:00
Konstantin Belousov	85af31a464	Do not sleep waiting for the MAP_ENTRY_IN_TRANSITION state ending with the vnode locked. Review: https://reviews.freebsd.org/D2381 Submitted by: Conrad Meyer, Attilio Rao MFC after: 1 week	2015-04-28 08:20:23 +00:00
Scott Long	43ffa928f9	Revert r281451. It causes a panic/hang early in boot for a number of users, myself included. The original code is likely papering over a larger bug that needs to be explored, but for now get things back to a working state. Obtained from: Netflix, Inc. MFC after: immediately	2015-04-24 17:03:53 +00:00
John Baldwin	179fa75e6e	Reassign copyright statements on several files from Advanced Computing Technologies LLC to Hudson River Trading LLC. Approved by: Hudson River Trading LLC (who owns ACT LLC) MFC after: 1 week	2015-04-23 14:22:20 +00:00
Alan Cox	d74e6a1d27	Eliminate an unused variable. MFC after: 1 week	2015-04-20 16:48:21 +00:00
Alan Cox	74f944344b	Eliminate an unused variable. MFC after: 1 week	2015-04-19 00:29:02 +00:00
Konstantin Belousov	0538aafc41	The lseek(2), mmap(2), truncate(2), ftruncate(2), pread(2), and pwrite(2) syscalls are wrapped to provide compatibility with pre-7.x kernels which required padding before the off_t parameter. The fcntl(2) contains compatibility code to handle kernels before the struct flock was changed during the 8.x CURRENT development. The shims were reasonable to allow easier revert to the older kernel at that time. Now, two or three major releases later, shims do not serve any purpose. Such old kernels cannot handle current libc, so revert the compatibility code. Make padded syscalls support conditional under the COMPAT6 config option. For COMPAT32, the syscalls were under COMPAT6 already. Remove WITHOUT_SYSCALL_COMPAT build option, which only purpose was to (partially) disable the removed shims. Reviewed by: jhb, imp (previous versions) Discussed with: peter Sponsored by: The FreeBSD Foundation MFC after: 1 week	2015-04-18 21:50:13 +00:00
Dmitry Chagin	51cfb0be84	Rework r281162. Indeed, the flexible array member is preferable here. Suggested by: Justin T. Gibbs MFC after: 3 days	2015-04-12 06:21:58 +00:00
Alan Cox	6a93e36b5f	Correct an off-by-one error in vm_reserv_reclaim_contig() that results in an infinite loop. Submitted by: Svatopluk Kraus MFC after: 1 week	2015-04-11 22:57:13 +00:00
Gleb Smirnoff	16be9f54e7	UMA zone limit can be lowered, so remove protection against from the sysctl_handle_uma_zone_max(). Sponsored by: Nginx, Inc.	2015-04-10 06:56:49 +00:00
Alexander Motin	0ada3afc25	Remove sleeps from geom_up thread on device destruction. MFC after: 3 days.	2015-04-09 13:09:05 +00:00
Jeff Roberson	34d8b7ea3b	- Simplify vm_pageout_scan() by introducing a new vm_pageout_clean() function that does the locking and validation associated with cleaning a page. This moves 150 lines of code into its own function. - Rename vm_pageout_clean() to vm_pageout_cluster() to define what it really does; clustering nearby pages for pageout optimization. Reviewd by: alc, kib, kmacy Tested by: pho (earlier version) Sponsored by: EMC / Isilon	2015-04-07 02:18:52 +00:00
Dmitry Chagin	6723fdfe46	Properly calculate "UMA Zones" per cpu cache size. Avoid allocating an extra struct uma_cache since the struct uma_zone already has one. PR: 199169 Submitted by: luke.tw gmail com MFC after: 1 week	2015-04-06 18:45:41 +00:00
Alan Cox	b5ab20c066	Until the lock assertions in vm_page_advise() are properly reevaluated, vm_fault_dontneed() should acquire a write lock on the first object in the shadow chain. Reported by: gleb, David Wolfskill	2015-04-05 20:07:33 +00:00
Dmitry Chagin	1d2c0c460c	Fix wrong kassert msg in uma. PR: 199172 Submitted by: luke.tw gmail com MFC after: 1 week	2015-04-05 18:25:23 +00:00
Alan Cox	a8b0f1009d	Replace vm_fault()'s heuristic for automatic cache behind with a heuristic that performs the equivalent of an automatic madvise(..., MADV_DONTNEED). The current heuristic, even with the improvements that I made a few years ago, is a good example of making the wrong trade-off, or optimizing for the infrequent case. The infrequent case being reading a single file that is much larger than memory using mmap(2). And, in this case, the page daemon isn't the bottleneck; it's the I/O. In all other cases, the current heuristic has too many false positives, i.e., it caches too many pages that are later reused. To give one example, thousands of pages are cached by the current heuristic during a buildworld and all of them are reactivated before the buildworld completes. In particular, clang reads source files using mmap(2) and there are some relatively large source files in our source tree, e.g., sqlite, that are read multiple times. With the new heuristic, I see fewer false positives and they have a much lower cost. I actually tried something like this more than two years ago and it didn't perform as well as the cache behind heuristic. However, that was before the changes to the page daemon in late summer of 2013 and the existence of pmap_advise(). In particular, with the page daemon doing its work more frequently and in smaller batches, it now completes its work while the application accessing the file is blocked on I/O. Whereas previously, the page daemon appeared to hog the CPU for so long that it caused "hiccups" in the application's execution. Finally, I'll add that the elimination of cache pages is a prerequisite for NUMA support. Reviewed by: jeff, kib Sponsored by: EMC / Isilon Storage Division	2015-04-04 19:10:22 +00:00
Ryan Stone	f2c2231e0c	Fix integer truncation bug in malloc(9) A couple of internal functions used by malloc(9) and uma truncated a size_t down to an int. This could cause any number of issues (e.g. indefinite sleeps, memory corruption) if any kernel subsystem tried to allocate 2GB or more through malloc. zfs would attempt such an allocation when run on a system with 2TB or more of RAM. Note to self: When this is MFCed, sparc64 needs the same fix. Differential revision: https://reviews.freebsd.org/D2106 Reviewed by: kib Reported by: Michael Fuckner <michael@fuckner.net> Tested by: Michael Fuckner <michael@fuckner.net> MFC after: 2 weeks	2015-04-01 12:42:26 +00:00
Gleb Smirnoff	f6d6b5e262	Catch up on r271387 and remove unused parameter from VOP_GETPAGES_ASYNC().	2015-03-30 22:49:26 +00:00
Jeff Roberson	b3de46ab23	- Eliminate pagequeue locking in the dirty code in vm_pageout_scan(). - Use a more precise series of tests to see if the page changed while we were locking the vnode. Reviewed by: alc Sponsored by: EMC / Isilon	2015-03-28 02:36:49 +00:00
Alexander Motin	3398491b2f	Make swapper release orphaned (lost) GEOM provider. Swap device is still reported as enabled, and system still may crash later if some swapped-out kernel pages were lost with the device, but at least GEOM and CAM can now release the lost disk, allowing it to be reconnected. MFC after: 2 weeks Sponsored by: iXsystems, Inc.	2015-03-26 17:21:12 +00:00
Rui Paulo	b575067a21	Add comments about CTLFLAG_RDTUN vs. TUNABLE_INT_FETCH. Requested by: julian	2015-03-26 05:20:18 +00:00
Rui Paulo	57e5a8b184	Use TUNABLE_INT_FETCH for boot_pages. vm.boot_pages is marked as a CTLFLAG_RDTUN, but it's used by the VM before the sysctl subsystem is initialsed. We manually fetch the variable from the environment to work around this problem. Tested by: Keith White kwhite at uottawa.ca MFC after: 1 week	2015-03-24 20:09:55 +00:00
Rui Paulo	b0bce0aef2	Remove whitespace.	2015-03-24 20:07:27 +00:00
Alan Cox	3d653db063	Introduce vm_object_color() and use it in mmap(2) to set the color of named objects to zero before the virtual address is selected. Previously, the color setting was delayed until after the virtual address was selected. In rtld, this delay effectively prevented the mapping of a shared library's code section using superpages. Now, for example, we see the first 1 MB of libc's code on armv6 mapped by a superpage after we've gotten through the initial cold misses that bring the first 1 MB of code into memory. (With the page clustering that we perform on read faults, this happens quickly.) Differential Revision: https://reviews.freebsd.org/D2013 Reviewed by: jhb, kib Tested by: Svatopluk Kraus (armv6) MFC after: 6 weeks	2015-03-21 17:56:55 +00:00
Alan Cox	dfdf9abd94	Fix the root cause of the "vm_reserv_populate: reserv <address> is already promoted" panics. The sequence of events that leads to a panic is rather long and circuitous. First, suppose that process P has a promoted superpage S within vm object O that it can write to. Then, suppose that P forks, which leads to S being write protected. Now, before P's child exits, suppose that P writes to another virtual page within O. Since the pages within O are copy on write, a shadow object for O is created to house the new physical copy of the faulted on virtual page. Then, before P can fault on S, P's child exists. Now, when P faults on S, it will follow the "optimized" path for copy-on-write faults in vm_fault(), wherein the underlying physical page is moved from O to its shadow object rather than allocating a new page and copying the new page's contents from the old page. Moreover, suppose that every 4 KB physical page making up S is moved to the shadow object in this way. However, the optimized path does not move the underlying superpage reservation, which is the root cause of the panics! Ultimately, P performs vm_object_collapse() on O's shadow object, which destroys O and in doing so breaks any reservations still belonging to O. This leaves the reservation underlying S in an inconsistent state: It's simultaneously not in use and promoted. Breaking a reservation does not demote it because I never intended for a promoted reservation to be broken. It makes little sense. Finally, this inconsistency leads to an assertion failure the next time that the reservation is used. The failing assertion does not (currently) exist in FreeBSD 10.x or earlier. There, we will quietly break the promoted reservation. While illogical and unintended, breaking the reservation is essentially harmless. PR: 198163 Reviewed by: kib Tested by: pho X-MFC after: r267213 Sponsored by: EMC / Isilon Storage Division	2015-03-19 01:40:43 +00:00
Gleb Smirnoff	4d6481a4c9	o Enhance vm_pager_free_nonreq() function: - Allow to call the function with vm object lock held. - Allow to specify reqpage that doesn't match any page in the region, meaning freeing all pages. o Utilize the new function in couple more places in vnode pager. Reviewed by: alc, kib Sponsored by: Netflix Sponsored by: Nginx, Inc.	2015-03-17 19:19:19 +00:00
Gleb Smirnoff	41c895a888	Provide a comment explaining r279688. Suggested by: alc	2015-03-16 14:24:47 +00:00
Ian Lepore	1eafc07856	Set the SBUF_INCLUDENUL flag in sbuf_new_for_sysctl() so that sysctl strings returned to userland include the nulterm byte. Some uses of sbuf_new_for_sysctl() write binary data rather than strings; clear the SBUF_INCLUDENUL flag after calling sbuf_new_for_sysctl() in those cases. (Note that the sbuf code still automatically adds a nulterm byte in sbuf_finish(), but since it's not included in the length it won't get copied to userland along with the binary data.) Remove explicit adding of a nulterm byte in a couple places now that it gets done automatically by the sbuf drain code. PR: 195668	2015-03-14 17:08:28 +00:00
Ian Lepore	ed9dd64b8c	Revert r279932; this is going to be fixed in the sbuf code instead. PR: 195668	2015-03-14 13:00:37 +00:00
Ian Lepore	f3b9fcf251	Nullterminate strings returned via sysctl. PR: 195668	2015-03-12 18:06:30 +00:00
Gleb Smirnoff	2c0cb02607	Fix function name in comment.	2015-03-10 13:06:54 +00:00
Konstantin Belousov	79d7993d98	Fix function name in the panic message. Sponsored by: The FreeBSD Foundation MFC after: 1 week	2015-03-08 02:13:46 +00:00
Alan Cox	6a24058fab	Correct a typo in vm_object_backing_scan() that originated in r254141. Specifically, change a lock acquire into a lock release. MFC after: 3 days Sponsored by: EMC / Isilon Storage Division	2015-03-07 04:18:40 +00:00
Gleb Smirnoff	73e9030e61	- In vnode_pager_generic_getpages() use different free counters for synchronous and asynchronous requests. The latter can saturate the I/O and we do not want them to affect regular paging. - Allocate the pbuf at the very beginning of the function, so that if we are low on certain kind of pbufs don't even proceed to BMAP, but sleep. Reviewed by: kib Sponsored by: Nginx, Inc. Sponsored by: Netflix	2015-03-06 14:15:30 +00:00
Alan Cox	777a36c5e3	Use RW_NEW rather than calling bzero().	2015-03-01 05:18:02 +00:00
Alan Cox	f81b73f3aa	Eliminate a variable that became unused when VFS_LOCK_GIANT() was eliminated. MFC after: 3 days	2015-02-28 19:11:37 +00:00
Enji Cooper	b2ecae3fec	Some minor style(9) fixes (whitespace + comment) MFC after: 3 days	2015-02-17 08:50:26 +00:00
Konstantin Belousov	f40cb1c645	Update mtime for tmpfs files modified through memory mapping. Similar to UFS, perform updates during syncer scans, which in particular means that tmpfs now performs scan on sync. Also, this means that a mtime update may be delayed up to 30 seconds after the write. The vm_object' OBJ_TMPFS_DIRTY flag for tmpfs swap object is similar to the OBJ_MIGHTBEDIRTY flag for the vnode object, it indicates that object could have been dirtied. Adapt fast page fault handler and vm_object_set_writeable_dirty() to handle OBJ_TMPFS_NODE same as OBJT_VNODE. Reported by: Ronald Klop <ronald-lists@klop.ws> Tested by: pho Sponsored by: The FreeBSD Foundation MFC after: 2 weeks	2015-01-28 10:37:23 +00:00
Will Andrews	8311a2b8a4	Add vm.panic_on_oom sysctl, which enables those who would rather panic than kill a process, when the system runs out of memory. Defaults to off. Usually, this is most useful when the OOM condition is due to mismanagement of memory, on a system where the applications in question don't respond well to being killed. In theory, if the system is properly managed, it shouldn't be possible to hit this condition. If it does, the panic can be more desirable for some users (since it can be a good means of finding the root cause) rather than killing the largest process and continuing on its merry way. As kib@ mentions in the differential, there is also protect(1), which uses procctl(PROC_SPROTECT) to ensure that some processes are immune. However, a panic approach is still useful in some environments. This is primarily intended as a development/debugging tool. Differential Revision: D1627 Reviewed by: kib MFC after: 1 week	2015-01-24 17:32:45 +00:00
Ryan Stone	423521aa33	vmspace_release() may sleep if the last reference is being released, so add a WITNESS_WARN() to catch cases where it is called with a non-sleepable lock held. MFC after: 1 month Sponsored by: Sandvine Inc.	2015-01-24 16:59:38 +00:00
Konstantin Belousov	71943c3d35	Avoid calling vmspace_free() while owning the process lock. Freeing of an vm space may require obtaining sleepable locks. Hold the process to keep the pointer valid, and change trylock to lock, since there is no longer two process locks owned simultaneously in vm_pageout_oom(). Note that after the process lock is dropped, process might exec, and no longer qualify as the owner of biggest vm space. In collaboration with: rstone Sponsored by: The FreeBSD Foundation MFC after: 1 week	2015-01-24 15:33:42 +00:00
Alan Cox	5268042bbd	Revamp the default page clustering strategy that is used by the page fault handler. For roughly twenty years, the page fault handler has used the same basic strategy: Fetch a fixed number of non-resident pages both ahead and behind the virtual page that was faulted on. Over the years, alternative strategies have been implemented for optimizing the handling of random and sequential access patterns, but the only change to the default strategy has been to increase the number of pages read ahead to 7 and behind to 8. The problem with the default page clustering strategy becomes apparent when you look at how it behaves on the code section of an executable or shared library. (To simplify the following explanation, I'm going to ignore the read that is performed to obtain the header and assume that no pages are resident at the start of execution.) Suppose that we have a code section consisting of 32 pages. Further, suppose that we access pages 4, 28, and 16 in that order. Under the default page clustering strategy, we page fault three times and perform three I/O operations, because the first and second page faults only read a truncated cluster of 12 pages. In contrast, if we access pages 8, 24, and 16 in that order, we only fault twice and perform two I/O operations, because the first and second page faults read a full cluster of 16 pages. In general, truncated clusters are more common than full clusters. To address this problem, this revision changes the default page clustering strategy to align the start of the cluster to a page offset within the vm object that is a multiple of the cluster size. This results in many fewer truncated clusters. Returning to our example, if we now access pages 4, 28, and 16 in that order, the cluster that is read to satisfy the page fault on page 28 will now include page 16. So, the access to page 16 will no longer page fault and perform an I/O operation. Since the revised default page clustering strategy is typically reading more pages at a time, we are likely to read a few more pages that are never accessed. However, for the various programs that we looked at, including clang, emacs, firefox, and openjdk, the reduction in the number of page faults and I/O operations far outweighed the increase in the number of pages that are never accessed. Moreover, the extra resident pages allowed for many more superpage mappings. For example, if we look at the execution of clang during a buildworld, the number of (hard) page faults on the code section drops by 26%, the number of superpage mappings increases by about 29,000, but the number of never accessed pages only increases from 30.38% to 33.66%. Finally, this leads to a small but measureable reduction in execution time. In collaboration with: Emily Pettigrew <ejp1@rice.edu> Differential Revision: https://reviews.freebsd.org/D1500 Reviewed by: jhb, kib MFC after: 6 weeks	2015-01-16 18:17:09 +00:00
Konstantin Belousov	18cc2ff047	Revert r263475: TDP_DEVMEMIO no longer needed, since amd64 /dev/kmem does not access kernel mappings directly. Reviewed by: alc Sponsored by: The FreeBSD Foundation MFC after: 1 week	2015-01-12 08:58:07 +00:00
Alan Cox	67c44fa359	Eliminate a stale debug message. The per-CPU cache locks were replaced by critical sections in r145686. PR: 193254 Submitted by: luke.tw@gmail.com MFC after: 3 days	2014-12-31 17:44:57 +00:00
Alan Cox	d866a563d4	The physical memory allocator supports the use of distinct free lists for managing pages from different address ranges. Generally speaking, this feature is used to increase the likelihood that physical pages are available that can meet special DMA requirements or can be accessed through a limited-coverage direct mapping (e.g., MIPS). However, prior to this change, the configuration of the free lists was static, i.e., it was determined at compile time. Consequentally, free lists could be created for address ranges that held no actual pages, for example, on 32-bit MIPS- based systems with 512 MB or less of physical memory. This change makes the creation of the free lists dynamic, i.e., it is based on the available physical memory at boot time. On 64-bit x86-based systems with 64 GB or more of physical memory, create free lists for managing pages with physical addresses below 4 GB. This change is to address reported problems with initializing devices that require the allocation of physical pages below 4 GB on some systems with 128 GB or more of physical memory. PR: 185727 Differential Revision: https://reviews.freebsd.org/D1274 Reviewed by: jhb, kib MFC after: 3 weeks Sponsored by: EMC / Isilon Storage Division	2014-12-31 00:54:38 +00:00
Gleb Smirnoff	e3ed82bcf7	Add flag VM_ALLOC_NOWAIT for vm_page_grab() that prevents sleeping and allows the function to fail. Reviewed by: kib, alc Sponsored by: Nginx, Inc.	2014-12-22 09:02:21 +00:00
Gleb Smirnoff	6ee80f259c	Do not clear flag that vm_page_alloc() doesn't support. Submitted by: kib	2014-12-22 09:00:47 +00:00
Gleb Smirnoff	89fc8bdbb6	Document flags of vm_page allocation functions. Reviewed by: alc	2014-12-22 08:59:44 +00:00
John Baldwin	01ca58b23c	Always ignore the deprecated MAP_RENAME and MAP_NORESERVE flags to mmap(). Some old libraries may be used even with newer binaries (specifically the Nvidia driver libraries). Differential Revision: https://reviews.freebsd.org/D1262 Reviewed by: kib	2014-12-05 15:24:42 +00:00
Konstantin Belousov	30d57414a0	When the last reference on the vnode' vm object is dropped, read the vp->v_vflag without taking vnode lock and without bypass. We do know that vp is the lowest level in the stack, since the pointer is obtained from the object' handle. Stale VV_TEXT flag read can only happen if parallel execve() is performed and not yet activated the image, since process takes reference for text mapping. In this case, the execve() code manages the VV_TEXT flag on its own already. It was observed that otherwise read-only sendfile(2) requires exclusive vnode lock and contending on it on some loads for VV_TEXT handling. Reported by: glebius, scottl Tested by: glebius, pho Sponsored by: The FreeBSD Foundation MFC after: 1 week	2014-12-05 15:02:30 +00:00
Konstantin Belousov	95c4bf756a	Provide mutual exclusion between zone allocation/destruction and uma_reclaim(). Reclamation code must not see half-constructed or destructed zones. Do this by bracing uma_zcreate() and uma_zdestroy() into a shared-locked sx, and take the sx exclusively in uma_reclaim(). Usually zones are not created/destroyed during the system operation, but tmpfs mounts do cause zone operations and exposed the bug. Another solution could be to only expose a new keg on uma_kegs list after the corresponding zone is fully constructed, and similar treatment for the destruction. But it probably requires more risky code rearrangement as well. Reported and tested by: pho Discussed with: avg Sponsored by: The FreeBSD Foundation MFC after: 2 weeks	2014-11-30 20:20:55 +00:00
Gleb Smirnoff	1bb5ad634e	We already have "int i" in this scope. Submitted by: alc	2014-11-24 07:57:20 +00:00
Gleb Smirnoff	d932810143	\n at end of panicstr is redundant. Submitted by: alc	2014-11-23 18:32:21 +00:00
Gleb Smirnoff	90effb2341	Merge from projects/sendfile: o Provide a new VOP_GETPAGES_ASYNC(), which works like VOP_GETPAGES(), but doesn't sleep. It returns immediately, and will execute the I/O done handler function that must be supplied as argument. o Provide VOP_GETPAGES_ASYNC() for the FFS, which uses vnode_pager. o Extend pagertab to support pgo_getpages_async method, and implement this method for vnode_pager. Reviewed by: kib Tested by: pho Sponsored by: Netflix Sponsored by: Nginx, Inc.	2014-11-23 12:01:52 +00:00
Alan Cox	09e5f3c4b8	By the time that vm_reserv_init() runs, vm_phys_segs[] is initialized. Use it instead of phys_avail[]. Discussed with: Svatopluk Kraus	2014-11-22 17:46:30 +00:00
Gleb Smirnoff	79f0deb938	Use __func__ in KASSERTs, since the code is about to be moved to other place. Sponsored by: Nginx, Inc.	2014-11-19 16:29:39 +00:00
Gleb Smirnoff	2a5eef69a6	In vnode_pager_generic_getpages() vp->v_mount is dereferenced in the beginning, thus can't be NULL. Sponsored by: Nginx, Inc.	2014-11-19 15:17:19 +00:00
Gleb Smirnoff	e122dfc1ce	Collapse three contiguous comment blocks into one. Remove historical note about wrong assumptions 20 years ago. Use proper casing. Sponsored by: Nginx, Inc.	2014-11-18 13:38:07 +00:00
Alan Cox	271f0f1219	Enable the use of VM_PHYSSEG_SPARSE on amd64 and i386, making it the default on i386 PAE. Previously, VM_PHYSSEG_SPARSE could not be used on amd64 and i386 because vm_page_startup() would not create vm_page structures for the kernel page table pages allocated during pmap_bootstrap() but those vm_page structures are needed when the kernel attempts to promote the corresponding kernel virtual addresses to superpage mappings. To address this problem, a new public function, vm_phys_add_seg(), is introduced and vm_phys_init() is updated to reflect the creation of vm_phys_seg structures by calls to vm_phys_add_seg(). Discussed with: Svatopluk Kraus MFC after: 3 weeks Sponsored by: EMC / Isilon Storage Division	2014-11-15 23:40:44 +00:00
Gleb Smirnoff	e1a4454553	Even better indent struct pagerops.	2014-11-14 18:15:35 +00:00
Gleb Smirnoff	5536922ec0	Constantly indent struct pagerops.	2014-11-14 18:00:00 +00:00
Konstantin Belousov	e065e87c1e	Fix mis-spelling of bits and types names in the default_pager_putpages() and swap_pager_putpages(). It is the same fix as was done for vnode_pager_putpages() in r271586. Reviewed by: alc Sponsored by: The FreeBSD Foundation MFC after: 1 week	2014-11-04 19:56:04 +00:00
Alan Cox	5e929009d2	Eliminate a stale, i386-specific comment.	2014-11-04 18:52:59 +00:00
Mark Murray	10cb24248a	This is the much-discussed major upgrade to the random(4) device, known to you all as /dev/random. This code has had an extensive rewrite and a good series of reviews, both by the author and other parties. This means a lot of code has been simplified. Pluggable structures for high-rate entropy generators are available, and it is most definitely not the case that /dev/random can be driven by only a hardware souce any more. This has been designed out of the device. Hardware sources are stirred into the CSPRNG (Yarrow, Fortuna) like any other entropy source. Pluggable modules may be written by third parties for additional sources. The harvesting structures and consequently the locking have been simplified. Entropy harvesting is done in a more general way (the documentation for this will follow). There is some GREAT entropy to be had in the UMA allocator, but it is disabled for now as messing with that is likely to annoy many people. The venerable (but effective) Yarrow algorithm, which is no longer supported by its authors now has an alternative, Fortuna. For now, Yarrow is retained as the default algorithm, but this may be changed using a kernel option. It is intended to make Fortuna the default algorithm for 11.0. Interested parties are encouraged to read ISBN 978-0-470-47424-2 "Cryptography Engineering" By Ferguson, Schneier and Kohno for Fortuna's gory details. Heck, read it anyway. Many thanks to Arthur Mesh who did early grunt work, and who got caught in the crossfire rather more than he deserved to. My thanks also to folks who helped me thresh this out on whiteboards and in the odd "Hallway track", or otherwise. My Nomex pants are on. Let the feedback commence! Reviewed by: trasz,des(partial),imp(partial?),rwatson(partial?) Approved by: so(des)	2014-10-30 21:21:53 +00:00
Hans Petter Selasky	f0188618f2	Fix multiple incorrect SYSCTL arguments in the kernel: - Wrong integer type was specified. - Wrong or missing "access" specifier. The "access" specifier sometimes included the SYSCTL type, which it should not, except for procedural SYSCTL nodes. - Logical OR where binary OR was expected. - Properly assert the "access" argument passed to all SYSCTL macros, using the CTASSERT macro. This applies to both static- and dynamically created SYSCTLs. - Properly assert the the data type for both static and dynamic SYSCTLs. In the case of static SYSCTLs we only assert that the data pointed to by the SYSCTL data pointer has the correct size, hence there is no easy way to assert types in the C language outside a C-function. - Rewrote some code which doesn't pass a constant "access" specifier when creating dynamic SYSCTL nodes, which is now a requirement. - Updated "EXAMPLES" section in SYSCTL manual page. MFC after: 3 days Sponsored by: Mellanox Technologies	2014-10-21 07:31:21 +00:00
John Baldwin	5817298f31	Retire the unimplemented MAP_RENAME and MAP_NORESERVE flags to mmap(2). Older binaries are still permitted to use these flags. PR: 193961 (exp-run in ports) Differential Revision: https://reviews.freebsd.org/D848 Reviewed by: kib	2014-10-18 12:28:51 +00:00
Davide Italiano	2be111bf7d	Follow up to r225617. In order to maximize the re-usability of kernel code in userland rename in-kernel getenv()/setenv() to kern_setenv()/kern_getenv(). This fixes a namespace collision with libc symbols. Submitted by: kmacy Tested by: make universe	2014-10-16 18:04:43 +00:00
Konstantin Belousov	a36f55322c	Make MAP_NOSYNC handling in the vm_fault() read-locked object path compatible with write-locked path. Test for MAP_ENTRY_NOSYNC and set VPO_NOSYNC for pages with dirty mask zero (this does not exclude a possibility that the page is dirty, e.g. due to read fault on writeable mapping and consequent write; the same issue exists in the slow path). Use helper vm_fault_dirty() to unify fast and slow path handling of VPO_NOSYNC and setting the dirty mask. Reviewed by: alc Sponsored by: The FreeBSD Foundation	2014-10-10 19:27:36 +00:00
Bryan Venteicher	111fbcd5ed	Change the UMA mutex into a rwlock Acquire the lock in read mode when just needed to ensure the stability of the keg list. The UMA lock may be held for a long time (relatively speaking) in uma_reclaim() on machines with lots of zones/kegs. If the uma_timeout() would fire during that period, subsequent callouts on that CPU may be significantly delayed. Reviewed by: jhb	2014-10-05 21:34:56 +00:00
Bryan Venteicher	6e5254e0d7	Remove stray uma_mtx lock/unlock in zone_drain_wait() Callers of zone_drain_wait(M_WAITOK) do not need to hold (and were not) the uma_mtx, but we would attempt to unlock and relock the mutex if we had to sleep because the zone was already draining. The M_NOWAIT callers may hold the uma_mtx, but we do not sleep in that case. Reviewed by: jhb MFC after: 3 days	2014-10-05 03:18:30 +00:00
Konstantin Belousov	b76278407d	Add kernel option KSTACK_USAGE_PROF to sample the stack depth on interrupts and report the largest value seen as sysctl debug.max_kstack_used. Useful to estimate how close the kernel stack size is to overflow. In collaboration with: Larry Baird <lab@gta.com> Sponsored by: The FreeBSD Foundation (kib) MFC after: 1 week	2014-10-04 18:38:14 +00:00
Steven Hartland	14a0d74ea8	Refactor ZFS ARC reclaim checks and limits Remove previously added kmem methods in favour of defines which allow diff minimisation between upstream code base. Rebalance ARC free target to be vm_pageout_wakeup_thresh by default which eliminates issue where ARC gets minimised instead of balancing with VM pageout. The restores the target point prior to r270759. Bring in missing upstream only changes which move unused code to further eliminate code differences. Add additional DTRACE probe to aid monitoring of ARC behaviour. Enable upstream i386 code paths on platforms which don't define UMA_MD_SMALL_ALLOC. Fix mixture of byte an page values in arc_memory_throttle i386 code path value assignment of available_memory. PR: 187594 Review: D702 Reviewed by: avg MFC after: 1 week X-MFC-With: r270759 & r270861 Sponsored by: Multiplay	2014-10-03 20:34:55 +00:00
Steven Hartland	f721133eb9	Fix ticks wrap issue of lowmem test in vm_pageout_scan Reviewed by: jhb (D818) MFC after: 3 days Sponsored by: Multiplay	2014-09-24 14:35:08 +00:00
Konstantin Belousov	54432196db	vm_map_pmap_enter() and pmap_enter_object() are currently not aware of the wired attribute of the mapping. As result, some pmap implementations clear the wired state of the page table entries, which breaks invariants and allows the entries to be lost. Avoid calling vm_map_pmap_enter() for the MADV_WILLNEED on the wired entry, the pages must be already mapped. Noted and reviewed by: alc Sponsored by: The FreeBSD Foundation MFC after: 3 days	2014-09-23 18:54:23 +00:00
Konstantin Belousov	10204535af	The vm_mmap_cdev() explicitely converts absence of both MAP_SHARED and MAP_PRIVATE flags to MAP_SHARED. Apparently, some code in tree, in particular, libgeom, relied on this behaviour, see r271721. For regular file types, the absence of the flags is interpreted as MAP_PRIVATE, and libc nlist used this (fixed in r271723). Allow the implicit flags for legacy binaries. Bump __FreeBSD_version to get the ABI note on new binaries to check for in mmap code. Remove the test for presence of one of the MAP_ANON, MAP_SHARED or MAP_PRIVATE flags before fget_mmap(). For MAP_ANON, we already verify that passed fd == -1. For fd != -1, test after fget_mmap() (for newer binaries) covers the case. Reported by: bdrewery, pho Reviewed by: jhb Sponsored by: The FreeBSD Foundation	2014-09-17 21:04:50 +00:00
John Baldwin	8bafac5444	Permit MAP_RENAME and MAP_NORESERVE for now. These flags should be removed, but at least Chromium and OpenJDK use MAP_NORESERVE.	2014-09-16 17:21:06 +00:00
John Baldwin	5fd3f8b3b6	Add stricter checking of some mmap() arguments: - Fail with EINVAL if an invalid protection mask is passed to mmap(). - Fail with EINVAL if an unknown flag is passed to mmap(). - Fail with EINVAL if both MAP_PRIVATE and MAP_SHARED are passed to mmap(). - Require one of either MAP_PRIVATE or MAP_SHARED for non-anonymous mappings. Reviewed by: alc, kib MFC after: 2 weeks Differential Revision: https://reviews.freebsd.org/D698	2014-09-15 17:20:13 +00:00
Alan Cox	a7fecb4d3a	Three improvements to vnode_pager_generic_getpages(): Eliminate an exclusive object lock acquisition and release on the expected execution path. Do page zeroing before the object lock is acquired rather than during the time that the object lock is held. Use vm_pager_free_nonreq() to eliminate duplicated code. Reviewed by: kib MFC after: 6 weeks Sponsored by: EMC / Isilon Storage Division	2014-09-15 17:14:09 +00:00
Gleb Smirnoff	be58a555d2	Remove redundant declaration. vnode.h should be included before vnode_pager.h.	2014-09-15 15:49:29 +00:00
Konstantin Belousov	d15b55c554	Provide the unique implementation for the VOP_GETPAGES() method used by ffs and ext2fs. Remove duplicated call to vm_page_zero_invalid(), done by VOP and by vm_pager_getpages(). Use vm_pager_free_nonreq(). Reviewed by: alc (previous version) Sponsored by: The FreeBSD Foundation MFC after: 6 weeks (after r271596)	2014-09-15 12:28:29 +00:00
Alan Cox	396b3e34b4	Avoid an exclusive acquisition of the object lock on the expected execution path through the NFS clients' getpages functions. Introduce vm_pager_free_nonreq(). This function can be used to eliminate code that is duplicated in many getpages functions. Also, in contrast to the code that currently appears in those getpages functions, vm_pager_free_nonreq() avoids acquiring an exclusive object lock in one case. Reviewed by: kib MFC after: 6 weeks Sponsored by: EMC / Isilon Storage Division	2014-09-14 18:07:55 +00:00
Konstantin Belousov	33cad9e936	Fix mis-spelling of bits and types names in the vnode_pager_putpages(). The changes should not modify the generated code. The pager->pgo_putpages() method takes int flags as its fourth argument, while vnode_pager_putpages() used boolean_t (which is typedef'ed to int). The flags are from VM_PAGER_* namespace, while vnode_pager_putpages() passed TRUE and OBJPC_SYNC to VOP_PUTPAGES(), which both are numerically equal to VM_PAGER_PUT_SYNC. Noted and reviewed by: alc (previous version) Sponsored by: The FreeBSD Foundation MFC after: 1 week	2014-09-14 10:27:36 +00:00
Alan Cox	81a065058c	Update a stale comment.	2014-09-11 03:16:57 +00:00
Gleb Smirnoff	27ad26d8c7	Remove unused arguments for VOP_GETPAGES(), VOP_PUTPAGES().	2014-09-10 12:36:41 +00:00
Alan Cox	64f096eeb2	Fix a boundary case error in vm_reserv_alloc_contig(): If a reservation isn't being allocated for the last of the requested pages, because a reservation won't fit in the gap between allocated pages, then the reservation structure shouldn't be initialized. While I'm here, improve the nearby comments. Reported by: jeff, pho MFC after: 1 week Sponsored by: EMC / Isilon Storage Division	2014-09-10 05:52:30 +00:00
Alan Cox	0afcd3af8b	Oops. vm_map_simplify_entry() is used by mac_proc_vm_revoke_recurse(), so it can't be static.	2014-09-08 02:25:01 +00:00
Alan Cox	077ec27cd6	Make two functions static and eliminate an unused #define.	2014-09-08 00:19:03 +00:00
John Baldwin	1a83a822d2	Fix a typo.	2014-08-29 21:20:36 +00:00
Steven Hartland	4d19f4ad1f	Refactor ZFS ARC reclaim logic to be more VM cooperative Prior to this change we triggered ARC reclaim when kmem usage passed 3/4 of the total available, as indicated by vmem_size(kmem_arena, VMEM_ALLOC). This could lead large amounts of unused RAM e.g. on a 192GB machine with ARC the only major RAM consumer, 40GB of RAM would remain unused. The old method has also been seen to result in extreme RAM usage under certain loads, causing poor performance and stalls. We now trigger ARC reclaim when the number of free pages drops below the value defined by the new sysctl vfs.zfs.arc_free_target, which defaults to the value of vm.v_free_target. Credit to Karl Denninger for the original patch on which this update was based. PR: 191510 and 187594 Tested by: dteske MFC after: 1 week Relnotes: yes Sponsored by: Multiplay	2014-08-28 19:50:08 +00:00
Alan Cox	9452b5eda9	Back in the days when the kernel was single threaded, testing "vm_paging_target() > 0" was a reasonable way of determining if the inactive queue scan met its target. However, now that other threads can be allocating pages while the inactive queue scan is running, it's an unreliable method. The effect of it being unreliable is that we can start swapping out processes when we didn't intend to. This issue has existed since the kernel was multithreaded, but the changes to the inactive queue target in 10.0-RELEASE have made its effects visible. This change introduces a more direct method for determining if the inactive queue scan met its target that is not affected by the actions of other threads. Reported by: Steve Polyack Tested by: pho, Steve Polyack (an earlier version) MFC after: 1 week Sponsored by: EMC / Isilon Storage Division	2014-08-26 16:40:20 +00:00
Alan Cox	b9ce8cc2d7	Relax one of the conditions for mapping a page on the fast path. Reviewed by: kib X-MFC with: r270011 Sponsored by: EMC / Isilon Storage Division	2014-08-23 05:24:31 +00:00
Konstantin Belousov	afe55ca373	Implement 'fast path' for the vm page fault handler. Or, it could be called a scalable path. When several preconditions hold, the vm object lock for the object containing the faulted page is taken in read mode, instead of write, which allows parallel faults processing in the region. Namely, the fast path is taken when the faulted page already exists and does not need copy on write, is already fully valid, and not busy. For technical reasons, fast path is avoided when the fault is the first write on the vnode object, or when the fault is for wiring or debugger read or write. On the fast path, pmap_enter(9) is passed the PMAP_ENTER_NOSLEEP flag, since object lock is kept. Pmap might fail to create the entry, in which case the fallback to slow path is performed. Reviewed by: alc Tested by: pho (previous version) Hardware provided and hosted by: The FreeBSD Foundation and Sentex Data Communications Sponsored by: The FreeBSD Foundation MFC after: 2 week	2014-08-15 07:30:14 +00:00
Alan Cox	9f746b66df	Avoid pointless (but harmless) actions on unmanaged pages. Reviewed by: kib Sponsored by: EMC / Isilon Storage Division	2014-08-14 15:46:15 +00:00
Konstantin Belousov	70978c93b8	If vm_page_grab() allocates a new page, the page is not inserted into page queue even when the allocation is not wired. It is responsibility of the vm_page_grab() caller to ensure that the page does not end on the vm_object queue but not on the pagedaemon queue, which would effectively create unpageable unwired page. In exec_map_first_page() and vm_imgact_hold_page(), activate the page immediately after unbusying it, to avoid leak. In the uiomove_object_page(), deactivate page before the object is unlocked. There is no leak, since the page is deactivated after uiomove_fromphys() finished. But allowing non-queued non-wired page in the unlocked object queue makes it impossible to assert that leak does not happen in other places. Reviewed by: alc Sponsored by: The FreeBSD Foundation MFC after: 1 week	2014-08-13 05:44:08 +00:00
Konstantin Belousov	afb69e6b3e	Adapt vm_page_aflag_set(PGA_WRITEABLE) to the locking of pmap_enter(PMAP_ENTER_NOSLEEP). The PGA_WRITEABLE flag can be set when either the page is busied, or the owner object is locked. Update comments, move all assertions about page state when PGA_WRITEABLE flag is set, into new helper vm_page_assert_pga_writeable(). Reviewed by: alc Sponsored by: The FreeBSD Foundation MFC after: 2 weeks	2014-08-09 05:00:34 +00:00
Konstantin Belousov	39ffa8c138	Change pmap_enter(9) interface to take flags parameter and superpage mapping size (currently unused). The flags includes the fault access bits, wired flag as PMAP_ENTER_WIRED, and a new flag PMAP_ENTER_NOSLEEP to indicate that pmap should not sleep. For powerpc aim both 32 and 64 bit, fix implementation to ensure that the requested mapping is created when PMAP_ENTER_NOSLEEP is not specified, in particular, wait for the available memory required to proceed. In collaboration with: alc Tested by: nwhitehorn (ppc aim32 and booke) Sponsored by: The FreeBSD Foundation and EMC / Isilon Storage Division MFC after: 2 weeks	2014-08-08 17:12:03 +00:00
Konstantin Belousov	385b4265fc	The vm_pager_page_unswapped() pager op is only implemented for the swap pager. Swap pager uses a private mutex to protect swap metadata, and does not rely on the vm object lock to ensure integrity of it. Weaken the requirement for the vm object lock by only asserting locked object in vm_pager_page_unswapped(), instead of locked exclusively. Reviewed by: alc Sponsored by: The FreeBSD Foundation MFC after: 1 week	2014-08-06 19:34:03 +00:00
Konstantin Belousov	faaf544760	Add wrappers to assert that vm object is unlocked and for try upgrade. Reviewed by: alc Sponsored by: The FreeBSD Foundation MFC after: 1 week	2014-08-06 19:30:35 +00:00
Roger Pau Monné	5ebe728d53	vm_phys: improve robustness of fictitious ranges With the current implementation of managed fictitious ranges when also using VM_PHYSSEG_DENSE, a user could try to register a fictitious range that starts inside of vm_page_array, but then overrruns it (because the end of the fictitious range is greater than vm_page_array_size + first_page). This would result in PHYS_TO_VM_PAGE returning unallocated pages from past the end of vm_page_array. The same could happen if a user tried to register a segment that starts outside of vm_page_array but ends inside of it. In order to fix this, allow vm_phys_fictitious_{reg/unreg}_range to use a set of pages from vm_page_array, and allocate the rest. Sponsored by: Citrix Systems R&D Reviewed by: kib, alc vm/vm_phys.c: - Allow registering/unregistering fictitious ranges that overrun vm_page_array.	2014-08-05 10:29:01 +00:00
Alan Cox	a695d9b25b	Retire pmap_change_wiring(). We have never used it to wire virtual pages. We continue to use pmap_enter() for that. For unwiring virtual pages, we now use pmap_unwire(), which unwires a range of virtual addresses instead of a single virtual page. Sponsored by: EMC / Isilon Storage Division	2014-08-03 20:40:51 +00:00
Alan Cox	0b69568411	Rewrite a loop in vm_map_wire() so that gcc doesn't think that the variable "rv" is uninitialized. Reported by: bz	2014-08-02 17:58:20 +00:00
Alan Cox	66cd575b28	Handle wiring failures in vm_map_wire() with the new functions pmap_unwire() and vm_object_unwire(). Retire vm_fault_{un,}wire(), since they are no longer used. (See r268327 and r269134 for the motivation behind this change.) Reviewed by: kib Sponsored by: EMC / Isilon Storage Division	2014-08-02 16:10:24 +00:00
Alan Cox	0346250941	When unwiring a region of an address space, do not assume that the underlying physical pages are mapped by the pmap. If, for example, the application has performed an mprotect(..., PROT_NONE) on any part of the wired region, then those pages will no longer be mapped by the pmap. So, using the pmap to lookup the wired pages in order to unwire them doesn't always work, and when it doesn't work wired pages are leaked. To avoid the leak, introduce and use a new function vm_object_unwire() that locates the wired pages by traversing the object and its backing objects. At the same time, switch from using pmap_change_wiring() to the recently introduced function pmap_unwire() for unwiring the region's mappings. pmap_unwire() is faster, because it operates a range of virtual addresses rather than a single virtual page at a time. Moreover, by operating on a range, it is superpage friendly. It doesn't waste time performing unnecessary demotions. Reported by: markj Reviewed by: kib Tested by: pho, jmg (arm) Sponsored by: EMC / Isilon Storage Division	2014-07-26 18:10:18 +00:00
Konstantin Belousov	4bace8e721	Correct assertion. The shadowing object cannot be tmpfs vm object, and tmpfs object cannot shadow. In other words, tmpfs vm object is always at the bottom of the shadow chain. Reported and tested by: bdrewery Sponsored by: The FreeBSD Foundation MFC after: 1 week	2014-07-24 10:25:42 +00:00
Konstantin Belousov	f08f7dca40	The OBJ_TMPFS flag of vm_object means that there is unreclaimed tmpfs vnode for the tmpfs node owning this object. The flag is currently used for two purposes. First, it allows to correctly handle VV_TEXT for tmpfs vnode when the ref count on the object is decremented to 1, similar to vnode_pager_dealloc() for regular filesystems. Second, it prevents some operations, which are done on OBJT_SWAP vm objects backing user anonymous memory, but are incorrect for the object owned by tmpfs node. The second kind of use of the OBJ_TMPFS flag is incorrect, since the vnode might be reclaimed, which clears the flag, but vm object operations must still be disallowed. Introduce one more flag, OBJ_TMPFS_NODE, which is permanently set on the object for VREG tmpfs node, and used instead of OBJ_TMPFS to test whether vm object collapse and similar actions should be disabled. Tested by: pho Sponsored by: The FreeBSD Foundation MFC after: 2 weeks	2014-07-14 09:30:37 +00:00
Roger Pau Monné	38d6b2dcb2	vm_phys: remove limitation on number of fictitious regions The number of vm fictitious regions was limited to 8 by default, but Xen will make heavy usage of those kind of regions in order to map memory from foreign domains, so instead of increasing the default number, change the implementation to use a red-black tree to track vm fictitious ranges. The public interface remains the same. Sponsored by: Citrix Systems R&D Reviewed by: kib, alc Approved by: gibbs vm/vm_phys.c: - Replace the vm fictitious static array with a red-black tree. - Use a rwlock instead of a mutex, since now we also need to take the lock in vm_phys_fictitious_to_vm_page, and it can be shared.	2014-07-09 08:12:58 +00:00
Marcel Moolenaar	e7d939bda2	Remove ia64. This includes: o All directories named ia64 o All files named ia64 o All ia64-specific code guarded by __ia64__ o All ia64-specific makefile logic o Mention of ia64 in comments and documentation This excludes: o Everything under contrib/ o Everything under crypto/ o sys/xen/interface o sys/sys/elf_common.h Discussed at: BSDcan	2014-07-07 00:27:09 +00:00
Alan Cox	09132ba6ac	Introduce pmap_unwire(). It will replace pmap_change_wiring(). There are several reasons for this change: pmap_change_wiring() has never (in my memory) been used to set the wired attribute on a virtual page. We have always used pmap_enter() to do that. Moreover, it is not really safe to use pmap_change_wiring() to set the wired attribute on a virtual page. The description of pmap_change_wiring() says that it assumes the existence of a mapping in the pmap. However, non-wired mappings may be reclaimed by the pmap at any time. (See pmap_collect().) Many implementations of pmap_change_wiring() will crash if the mapping does not exist. pmap_unwire() accepts a range of virtual addresses, whereas pmap_change_wiring() acts upon a single virtual page. Since we are typically unwiring a range of virtual addresses, pmap_unwire() will be more efficient. Moreover, pmap_unwire() allows us to unwire superpage mappings. Previously, we were forced to demote the superpage mapping, because pmap_change_wiring() only allowed us to express the unwiring of a single base page mapping at a time. This added to the overhead of unwiring for large ranges of addresses, including the implicit unwiring that occurs at process termination. Implementations for arm and powerpc will follow. Discussed with: jeff, marcel Reviewed by: kib Sponsored by: EMC / Isilon Storage Division	2014-07-06 17:42:38 +00:00
Hans Petter Selasky	af3b2549c4	Pull in r267961 and r267973 again. Fix for issues reported will follow.	2014-06-28 03:56:17 +00:00
Glen Barber	37a107a407	Revert r267961, r267973: These changes prevent sysctl(8) from returning proper output, such as: 1) no output from sysctl(8) 2) erroneously returning ENOMEM with tools like truss(1) or uname(1) truss: can not get etype: Cannot allocate memory	2014-06-27 22:05:21 +00:00
Hans Petter Selasky	3da1cf1e88	Extend the meaning of the CTLFLAG_TUN flag to automatically check if there is an environment variable which shall initialize the SYSCTL during early boot. This works for all SYSCTL types both statically and dynamically created ones, except for the SYSCTL NODE type and SYSCTLs which belong to VNETs. A new flag, CTLFLAG_NOFETCH, has been added to be used in the case a tunable sysctl has a custom initialisation function allowing the sysctl to still be marked as a tunable. The kernel SYSCTL API is mostly the same, with a few exceptions for some special operations like iterating childrens of a static/extern SYSCTL node. This operation should probably be made into a factored out common macro, hence some device drivers use this. The reason for changing the SYSCTL API was the need for a SYSCTL parent OID pointer and not only the SYSCTL parent OID list pointer in order to quickly generate the sysctl path. The motivation behind this patch is to avoid parameter loading cludges inside the OFED driver subsystem. Instead of adding special code to the OFED driver subsystem to post-load tunables into dynamically created sysctls, we generalize this in the kernel. Other changes: - Corrected a possibly incorrect sysctl name from "hw.cbb.intr_mask" to "hw.pcic.intr_mask". - Removed redundant TUNABLE statements throughout the kernel. - Some minor code rewrites in connection to removing not needed TUNABLE statements. - Added a missing SYSCTL_DECL(). - Wrapped two very long lines. - Avoid malloc()/free() inside sysctl string handling, in case it is called to initialize a sysctl from a tunable, hence malloc()/free() is not ready when sysctls from the sysctl dataset are registered. - Bumped FreeBSD version to indicate SYSCTL API change. MFC after: 2 weeks Sponsored by: Mellanox Technologies	2014-06-27 16:33:43 +00:00
Alan Cox	60169c88d9	Delay the call to crhold() in vm_map_insert() until we know that we won't have to undo it by calling crfree(). This reduces the total number of calls by vm_map_insert() to crhold() and crfree() by 45% in my tests. Eliminate an unnecessary variable from vm_map_insert(). Reviewed by: kib Tested by: pho	2014-06-26 16:04:03 +00:00
Alan Cox	eaaf9f7fce	Now that vm_map_insert() sets MAP_ENTRY_GROWS_{DOWN,UP} on the stack entries that it creates (r267645), we can place the check that blocks map entry coalescing on stack entries in vm_map_simplify_entry() where it properly belongs. Reviewed by: kib	2014-06-25 03:30:03 +00:00
Konstantin Belousov	b5f8c226ab	Use correct names for the flags. MAP_ENTRY_GROWS_* have the same numerical values as MAP_STACK_GROWS_*, but the former is for entries' eflags, while the later for the cow argument of vm_map_insert(). Submitted by: alc	2014-06-23 07:03:47 +00:00
Konstantin Belousov	5831f5fc52	Assert that the new entry is inserted into the right location in the map entries list, and that it does not overlap with the previous and next entries. Reviewed by: alc Sponsored by: The FreeBSD Foundation MFC after: 1 week	2014-06-20 07:01:53 +00:00
Alan Cox	39c18ce157	Eliminate a pointless call to vm_map_clip_start() from vm_map_growstack(). For this call to do anything at all we would have to have two overlapping map entries. Submitted by: kib	2014-06-19 21:05:07 +00:00
Alan Cox	712efe66e2	When MAP_STACK_GROWS_{DOWN,UP} are passed to vm_map_insert() set the corresponding flag(s) in the new map entry. Previously, the caller was responsible for setting them after vm_map_insert() returned. Pass MAP_STACK_GROWS_DOWN to vm_map_insert() from vm_map_growstack() when extending the stack in the downward direction. Together these changes slightly simplify the caller's task when creating a downward growing stack. In particular, the caller no longer needs to clip the previous entry, because the new stack entry can't possibly coalesce with the previous entry. Reviewed by: kib Sponsored by: EMC / Isilon Storage Division	2014-06-19 16:26:16 +00:00
Konstantin Belousov	11c42bcc54	Add MAP_EXCL flag for mmap(2). It should be combined with MAP_FIXED, and prevents the request from deleting existing mappings in the region, failing instead. Reviewed by: alc Discussed with: jhb Tested by: markj, pho (previous version, as part of the bigger patch) Sponsored by: The FreeBSD Foundation MFC after: 1 week	2014-06-19 05:00:39 +00:00
Attilio Rao	3ae10f7477	- Modify vm_page_unwire() and vm_page_enqueue() to directly accept the queue where to enqueue pages that are going to be unwired. - Add stronger checks to the enqueue/dequeue for the pagequeues when adding and removing pages to them. Of course, for unmanaged pages the queue parameter of vm_page_unwire() will be ignored, just as the active parameter today. This makes adding new pagequeues quicker. This change effectively modifies the KPI. __FreeBSD_version will be, however, bumped just when the full cache of free pages will be evicted. Sponsored by: EMC / Isilon storage division Reviewed by: alc Tested by: pho	2014-06-16 18:15:27 +00:00
Alan Cox	33314db034	Tidy up the early parts of vm_map_insert(), in particular, simplify one of the assertions and eliminate a comment that has grown stale. Reviewed by: kib MFC after: 1 week	2014-06-16 16:37:41 +00:00
Alan Cox	e1f92ccc73	One of the intentions behind r267254 was that the global variable "sgrowsiz" would be read once and cached in a local variable so that the resource limit check and map entry insertion would be guaranteed to use the same value. However, the value being passed to vm_map_insert() is still from "sgrowsiz" and not the local variable. Correct this oversight. Reviewed by: kib	2014-06-15 07:52:59 +00:00
Alexander Motin	1aa6c75827	Introduce new "256 Bucket" zone to split requests and reduce congestion on "128 Bucket" zone lock. MFC after: 2 weeks Sponsored by: iXsystems, Inc.	2014-06-12 11:57:07 +00:00
Alexander Motin	20d3ab87cd	Allocating new bucket for bucket zone, never take it from the zone itself, since it will almost certanly fail. Take next bigger zone instead. This situation should not happen with original bucket zones configuration: "32 Bucket" zone uses "64 Bucket" and vice versa. But if "64 Bucket" zone lock is congested, zone may grow its bucket size and start biting itself. MFC after: 2 weeks Sponsored by: iXsystems, Inc.	2014-06-12 11:36:22 +00:00
Alan Cox	3180f7573a	Correct a bug in the management of the population map on big-endian machines. Specifically, there was a mismatch between how the routine allocation and deallocation operations accessed the population map and how the aggressively optimized reservation-breaking operation accessed it. So, problems only occurred when reservations were broken. This change makes the routine operations access the population map in the same way as the reservation breaking operation. This bug was introduced in r259999. PR: 187080 Tested by: jmg (on an "armeb" machine) Sponsored by: EMC / Isilon Storage Division	2014-06-11 16:11:12 +00:00
Konstantin Belousov	4648ba0a0f	Make mmap(MAP_STACK) search for the available address space, similar to !MAP_STACK mapping requests. For MAP_STACK \| MAP_FIXED, clear any mappings which could previously exist in the used range. For this, teach vm_map_find() and vm_map_fixed() to handle MAP_STACK_GROWS_DOWN or _UP cow flags, by calling a new vm_map_stack_locked() helper, which is factored out from vm_map_stack(). The side effect of the change is that MAP_STACK started obeying MAP_ALIGNMENT and MAP_32BIT flags. Reported by: rwatson Reviewed by: alc Sponsored by: The FreeBSD Foundation MFC after: 2 weeks	2014-06-09 03:37:41 +00:00
Alan Cox	dd05fa1945	Add a page size field to struct vm_page. Increase the page size field when a partially populated reservation becomes fully populated, and decrease this field when a fully populated reservation becomes partially populated. Use this field to simplify the implementation of pmap_enter_object() on amd64, arm, and i386. On all architectures where we support superpages, the cost of creating a superpage mapping is roughly the same as creating a base page mapping. For example, both kinds of mappings entail the creation of a single PTE and PV entry. With this in mind, use the page size field to make the implementation of vm_map_pmap_enter(..., MAP_PREFAULT_PARTIAL) a little smarter. Previously, if MAP_PREFAULT_PARTIAL was specified to vm_map_pmap_enter(), that function would only map base pages. Now, it will create up to 96 base page or superpage mappings. Reviewed by: kib Sponsored by: EMC / Isilon Storage Division	2014-06-07 17:12:26 +00:00
Konstantin Belousov	5930251a9d	Remove the assert which can be triggered by the userspace. The situation checked by assert is verified to not take place in vm_map_wire(), and protection permissions on the wired entry can be revoked afterward. Reported by: markj Reviewed by: alc Sponsored by: The FreeBSD Foundation MFC after: 1 week	2014-05-28 00:45:35 +00:00
Alan Cox	fa2f411c4e	There is no reason to perform the pmap_remove() on the kernel pmap while the kmem object lock is held. Do the pmap_remove() before acquiring the kmem object lock. MFC after: 1 week Sponsored by: EMC / Isilon Storage Division	2014-05-23 16:22:36 +00:00
Konstantin Belousov	2602a2ea88	Remove redundand loop. The inner goto restarts the whole page handling in the situation identical to the loop condition. Sponsored by: The FreeBSD Foundation MFC after: 3 days	2014-05-21 08:19:04 +00:00
Konstantin Belousov	7032434e98	When exec_new_vmspace() decides that current vmspace cannot be reused on execve(2), it calls vmspace_exec(), which frees the current vmspace. The thread executing an exec syscall gets new vmspace assigned, and old vmspace is freed if only referenced by the current process. The free operation includes pmap_release(), which de-constructs the paging structures used by hardware. If the calling process is multithreaded, other threads are suspended in the thread_suspend_check(), and need to be unsuspended and run to be able to exit on successfull exec. Now, since the old vmspace is destroyed, paging structures are invalid, threads are resumed on the non-existent pmaps (page tables), which leads to triple fault on x86. To fix, postpone the free of old vmspace until the threads are resumed and exited. To avoid modifications to all image activators all of which use exec_new_vmspace(), memoize the current (old) vmspace in kern_execve(), and notify it about the need to call vmspace_free() with a thread-private flag TDP_EXECVMSPC. http://bugs.debian.org/743141 Reported by: Ivo De Decker <ivo.dedecker@ugent.be> through secteam Sponsored by: The FreeBSD Foundation MFC after: 3 days	2014-05-20 09:19:35 +00:00
Alan Cox	afaa41f6b8	On a fork allow read-only wired pages to be copy-on-write shared between the parent and child processes. Previously, we copied these pages even though they are read only. However, the reason for copying them is historical and no longer exists. In recent times, vm_map_protect() has developed the ability to copy pages when write access is added to wired copy-on-write pages. So, in this case, copy-on-write sharing of wired pages is not to be feared. It is not going to lead to copy-on-write faults on wired memory. Reviewed by: kib MFC after: 1 week Sponsored by: EMC / Isilon Storage Division	2014-05-13 13:20:23 +00:00
Konstantin Belousov	c8f780e3d6	Fix locking. The dst_object must remain locked on the retry of the loop iteration. Reported and tested by: pho Sponsored by: The FreeBSD Foundation MFC after: 6 days	2014-05-11 18:07:07 +00:00
Alan Cox	dd006a1b14	With the new-and-improved vm_fault_copy_entry() (r265843), we can always avoid soft page faults when adding write access to user wired entries in vm_map_protect(). Previously, we only avoided the soft page fault when the underlying pages were copy-on-write. In other words, we avoided the pages faults that might sleep on page allocation, but not the trivial page faults to update the physical map. Reviewed by: kib MFC after: 1 week Sponsored by: EMC / Isilon Storage Division	2014-05-11 17:41:29 +00:00
Alan Cox	d9a9209abe	About 9% of the pmap_protect() calls being performed by vm_map_copy_entry() are unnecessary. Eliminate the unnecessary calls. Reviewed by: kib MFC after: 1 week Sponsored by: EMC / Isilon Storage Division	2014-05-10 19:47:00 +00:00
Konstantin Belousov	0973283d6e	For the upgrade case in vm_fault_copy_entry(), when the entry does not need COW and is writeable (i.e. becoming writeable due to the mprotect(2) operation), do not create a new backing object for the entry. The caller of the function is vm_map_protect(), the call is made to ensure that wired entry has all pages resident and wired in the top level object and to enable the write. We might need to copy read-only page from some backing objects into the top object or remap the page with the write allowed. This fixes the issue with mishandling of the swap accounting when read-only wired mapping is upgraded to write-enabled after fork. The previous code path did not accounted the new object, but it creation is redundand anyway and the change provides an optimization for the non-common situation. Reported by: markj Suggested and reviewed by: alc (previous version) Tested by: pho Sponsored by: The FreeBSD Foundation MFC after: 1 week	2014-05-10 17:03:33 +00:00
Konstantin Belousov	44bbc3b77d	When printing the map with the ddb 'show procvm' command, do not dump page queues for the backing objects. The queues are huge and clutter the display, when mostly the map entries and its backing storage is interesting. The page queues can be seen with ddb 'show object' command. Reviewed by: alc Sponsored by: The FreeBSD Foundation MFC after: 1 week	2014-05-10 16:36:13 +00:00
Konstantin Belousov	3d95614f9d	Print the entry address in addition to the object. The variable is typically optimized out and debuggers cannot find its value. Sponsored by: The FreeBSD Foundation MFC after: 1 week	2014-05-10 16:30:48 +00:00
Peter Holm	e103f5b1c0	msync(2) must return ENOMEM and not EINVAL when the address is outside the allowed range or when one or more pages are not mapped. This according to The Open Group Base Specifications Issue 7. Discussed with: attilio, Bruce Evans Reviewed by: alc, Garrett Cooper Reported by: ATF MFC after: 2 weeks Sponsored by: EMC / Isilon storage division	2014-05-07 08:38:02 +00:00
Alan Cox	60196cda04	Prior to r254304, a separate function, vm_pageout_page_stats(), was used to periodically update the reference status of the active pages. This function was called, instead of vm_pageout_scan(), when memory was not scarce. The objective was to provide up to date reference status for active pages in case memory did become scarce and active pages needed to be deactivated. The active page queue scan performed by vm_pageout_page_stats() was virtually identical to that performed by vm_pageout_scan(), and so r254304 eliminated vm_pageout_page_stats(). Instead, vm_pageout_scan() is called with the parameter "pass" set to zero. The intention was that when pass is zero, vm_pageout_scan() would only scan the active queue. However, the variable page_shortage can still be greater than zero when memory is not scarce and vm_pageout_scan() is called with pass equal to zero. Consequently, the inactive queue may be scanned and dirty pages laundered even though that was not intended by r254304. This revision fixes that. Reported by: avg MFC after: 1 week Sponsored by: EMC / Isilon Storage Division	2014-05-06 03:42:04 +00:00
Konstantin Belousov	a17937bdd0	For the VM_PHYSSEG_DENSE case, checking the requested range to fall into the area backed by vm_page_array wrongly compared end with vm_page_array_size. It should be adjusted by first_page index to be correct. Also, the corner and incorrect case of the requested range extending after the end of the vm_page_array was incorrectly handled by allocating the segment. Fix the comparision for the end of range and return EINVAL if the end extends beyond vm_page_array. Discussed with: royger Sponsored by: The FreeBSD Foundation MFC after: 1 week	2014-04-29 18:42:37 +00:00
Konstantin Belousov	4c74acf76a	When vm_fault_copy_entry() is called from vm_map_protect() for a wired entry and performs the upgrade of the entry permissions from read-only to read-write, we must allow to search for the source pages in the backing object, like we do in the case of forking the read-only wired entry. For the fork case, the behaviour is allowed by src_readonly boolean, which in fact is only used to assert that read-write case provides all source pages in the top-level object. Eliminate the src_readonly variable. Allow for the copy loop to look into the backing objects, add explicit asserts to ensure that only read-only and upgrade case actually does. Expand comments. Change the panic call into assert. Reported by: markj Tested by: markj, pho (previous version) Reviewed by: alc Sponsored by: The FreeBSD Foundation MFC after: 1 week	2014-04-27 05:19:01 +00:00
Dag-Erling Smørgrav	612032773a	Add sysctl OIDs showing the actual size and capacity of the swap zone. MFC after: 1 week	2014-04-26 12:18:17 +00:00
Bryan Drewery	44f1c91610	Rename global cnt to vm_cnt to avoid shadowing. To reduce the diff struct pcu.cnt field was not renamed, so PCPU_OP(cnt.field) is still used. pc_cnt and pcpu are also used in kvm(3) and vmstat(8). The goal was to not affect externally used KPI. Bump __FreeBSD_version_ in case some out-of-tree module/code relies on the the global cnt variable. Exp-run revealed no ports using it directly. No objection from: arch@ Sponsored by: EMC / Isilon Storage Division	2014-03-22 10:26:09 +00:00
Konstantin Belousov	52f3c44efe	Fix two issues with /dev/mem access on amd64, both causing kernel page faults. First, for accesses to direct map region should check for the limit by which direct map is instantiated. Second, for accesses to the kernel map, success returned from the kernacc(9) does not guarantee that consequent attempt to read or write to the checked address succeed, since other thread might invalidate the address meantime. Add a new thread private flag TDP_DEVMEMIO, which instructs vm_fault() to return error when fault happens on the MAP_ENTRY_NOFAULT entry, instead of panicing. The trap handler would then see a page fault from access, and recover in normal way, making /dev/mem access safer. Remove GIANT_REQUIRED from the amd64 memrw(), since it is not needed and having Giant locked does not solve issues for amd64. Note that at least the second issue exists on other architectures, and requires similar patching for md code. Reported and tested by: clusteradm (gjb, sbruno) Sponsored by: The FreeBSD Foundation MFC after: 1 week	2014-03-21 14:25:09 +00:00
Konstantin Belousov	997ac6905f	Initialize vm_map_entry member wiring_thread on the map entry creation. This was missed in r253190. Reported by: hps, peter Tested by: hps Sponsored by: The FreeBSD Foundation MFC after: 3 days	2014-03-21 13:55:57 +00:00
Attilio Rao	0d8243cc34	vm_page_grab() and vm_pager_get_pages() can drop the vm_object lock, then threads can sleep on the pip condition. Avoid to deadlock such threads by correctly awakening the sleeping ones after the pip is finished. swapoff side of the bug can likely result in shutdown deadlocks. Sponsored by: EMC / Isilon Storage Division Reported by: pho, pluknet Tested by: pho	2014-03-19 01:13:42 +00:00
Robert Watson	4a14441044	Update kernel inclusions of capability.h to use capsicum.h instead; some further refinement is required as some device drivers intended to be portable over FreeBSD versions rely on __FreeBSD_version to decide whether to include capability.h. MFC after: 3 weeks	2014-03-16 10:55:57 +00:00
Konstantin Belousov	7253a5ec63	Initialize paddr to handle the case of zero size. Reported and reviewed by: Conrad Meyer <cemeyer@uw.edu> MFC after: 1 week	2014-03-12 16:38:55 +00:00

1 2 3 4 5 ...

3439 Commits