freebsd-skq

Author	SHA1	Message	Date
markj	c78ab1ef94	Add a local variable initialization needed in the OBJT_DEFAULT case. Reviewed by: kib Differential Revision: https://reviews.freebsd.org/D2992	2015-07-05 22:26:19 +00:00
mjg	93aba404ec	vm: don't lock proc around accesses to vm_{t,d}addr and RLIMIT_DATA in sys_mmap vm_{t,d}addr are constant and we can use thread's copy of resource limits	2015-07-02 18:30:12 +00:00
kib	eee3eeb0fa	Account for the main process stack being one page below the highest user address when ABI uses shared page. Note that the change is no-op for correctness, since shared page does not fault. The mapping for the shared page is installed at the address space creation, the page is unmanaged and its pte/pv entry cannot be reclaimed. Submitted by: Oliver Pinter Review: https://reviews.freebsd.org/D2954 MFC after: 1 week	2015-07-02 15:22:13 +00:00
markm	d586165577	Huge cleanup of random(4) code. * GENERAL - Update copyright. - Make kernel options for RANDOM_YARROW and RANDOM_DUMMY. Set neither to ON, which means we want Fortuna - If there is no 'device random' in the kernel, there will be NO random(4) device in the kernel, and the KERN_ARND sysctl will return nothing. With RANDOM_DUMMY there will be a random(4) that always blocks. - Repair kern.arandom (KERN_ARND sysctl). The old version went through arc4random(9) and was a bit weird. - Adjust arc4random stirring a bit - the existing code looks a little suspect. - Fix the nasty pre- and post-read overloading by providing explictit functions to do these tasks. - Redo read_random(9) so as to duplicate random(4)'s read internals. This makes it a first-class citizen rather than a hack. - Move stuff out of locked regions when it does not need to be there. - Trim RANDOM_DEBUG printfs. Some are excess to requirement, some behind boot verbose. - Use SYSINIT to sequence the startup. - Fix init/deinit sysctl stuff. - Make relevant sysctls also tunables. - Add different harvesting "styles" to allow for different requirements (direct, queue, fast). - Add harvesting of FFS atime events. This needs to be checked for weighing down the FS code. - Add harvesting of slab allocator events. This needs to be checked for weighing down the allocator code. - Fix the random(9) manpage. - Loadable modules are not present for now. These will be re-engineered when the dust settles. - Use macros for locks. - Fix comments. * src/share/man/... - Update the man pages. * src/etc/... - The startup/shutdown work is done in D2924. * src/UPDATING - Add UPDATING announcement. * src/sys/dev/random/build.sh - Add copyright. - Add libz for unit tests. * src/sys/dev/random/dummy.c - Remove; no longer needed. Functionality incorporated into randomdev.. live_entropy_sources.c live_entropy_sources.h - Remove; content moved. - move content to randomdev.[ch] and optimise. * src/sys/dev/random/random_adaptors.c src/sys/dev/random/random_adaptors.h - Remove; plugability is no longer used. Compile-time algorithm selection is the way to go. * src/sys/dev/random/random_harvestq.c src/sys/dev/random/random_harvestq.h - Add early (re)boot-time randomness caching. * src/sys/dev/random/randomdev_soft.c src/sys/dev/random/randomdev_soft.h - Remove; no longer needed. * src/sys/dev/random/uint128.h - Provide a fake uint128_t; if a real one ever arrived, we can use that instead. All that is needed here is N=0, N++, N==0, and some localised trickery is used to manufacture a 128-bit 0ULLL. * src/sys/dev/random/unit_test.c src/sys/dev/random/unit_test.h - Improve unit tests; previously the testing human needed clairvoyance; now the test will do a basic check of compressibility. Clairvoyant talent is still a good idea. - This is still a long way off a proper unit test. * src/sys/dev/random/fortuna.c src/sys/dev/random/fortuna.h - Improve messy union to just uint128_t. - Remove unneeded 'static struct fortuna_start_cache'. - Tighten up up arithmetic. - Provide a method to allow eternal junk to be introduced; harden it against blatant by compress/hashing. - Assert that locks are held correctly. - Fix the nasty pre- and post-read overloading by providing explictit functions to do these tasks. - Turn into self-sufficient module (no longer requires randomdev_soft.[ch]) * src/sys/dev/random/yarrow.c src/sys/dev/random/yarrow.h - Improve messy union to just uint128_t. - Remove unneeded 'staic struct start_cache'. - Tighten up up arithmetic. - Provide a method to allow eternal junk to be introduced; harden it against blatant by compress/hashing. - Assert that locks are held correctly. - Fix the nasty pre- and post-read overloading by providing explictit functions to do these tasks. - Turn into self-sufficient module (no longer requires randomdev_soft.[ch]) - Fix some magic numbers elsewhere used as FAST and SLOW. Differential Revision: https://reviews.freebsd.org/D2025 Reviewed by: vsevolod,delphij,rwatson,trasz,jmg Approved by: so (delphij)	2015-06-30 17:00:45 +00:00
jmg	8df3676103	If INVARIANTS is specified, add ctor/dtor to junk memory if they are unspecified... Submitted by: Suresh Gumpula at Netapp Differential Revision: https://reviews.freebsd.org/D2725	2015-06-25 20:44:46 +00:00
alc	29dd35a1c6	Avoid pmap_is_modified() on pages that can't be mapped. MFC after: 1 week Sponsored by: EMC / Isilon Storage Division	2015-06-21 01:22:35 +00:00
glebius	5b81a20433	o Un-inline vm_pager_get_pages(), vm_pager_get_pages_async(). o Provide an extensive set of assertions for input array of pages. o Remove now duplicate assertions from different pagers. Sponsored by: Nginx, Inc. Sponsored by: Netflix	2015-06-17 22:44:27 +00:00
kib	2038792788	Invalid pages do not need neither update of the activation count nor they coould be dirty. Move the handling if the invalid pages in the inactive scan earlier. Remove some code duplication in the scan by introducing the 'drop_page' label, which centralizes the object and the page unlock. Suggested and reviewed by: alc Sponsored by: The FreeBSD Foundation MFC after: 2 weeks	2015-06-14 20:23:41 +00:00
alc	927e89d882	As the next step in eliminating PG_CACHE pages, free rather than cache pages in vm_pageout_scan(). The reactivation rate of cache pages created by vm_pageout_scan() is extremely low; typically no more than 0.5% to 2.25% of the pages are ever reactivated. At the same time, caching pages is more expensive than freeing them. For example, in a test with PostgreSQL, this change reduced the amount of time spent in the inactive queue scan by 1/6. Differential Revision: https://reviews.freebsd.org/D2805 Reviewed by: kib Sponsored by: EMC / Isilon Storage Division	2015-06-14 05:23:39 +00:00
glebius	519f1ccd36	Make KPI of vm_pager_get_pages() more strict: if a pager changes a page in the requested array, then it is responsible for disposition of previous page and is responsible for updating the entry in the requested array. Now consumers of KPI do not need to re-lookup the pages after call to vm_pager_get_pages(). Reviewed by: kib Sponsored by: Netflix Sponsored by: Nginx, Inc.	2015-06-12 11:32:20 +00:00
mjg	d7bc9285a6	Implement lockless resource limits. Use the same scheme implemented to manage credentials. Code needing to look at process's credentials (as opposed to thred's) is provided with *_proc variants of relevant functions. Places which possibly had to take the proc lock anyway still use the proc pointer to access limits.	2015-06-10 10:48:12 +00:00
alc	73775f21e2	Correct a type error in kmem_unback(). Previously, kmem_unback() did not correctly handle deallocation requests of two or more gigabytes in size. Eventually, this would lead to a panic elsewhere in the kernel, such as "vm_radix_insert: key <vm_pindex_t> is already present". Reported by: Ilias Marinos MFC after: 1 week	2015-06-10 05:17:14 +00:00
alc	263927b83e	Retire VM_FREEPOOL_CACHE as the next step in eliminating PG_CACHE pages. Differential Revision: https://reviews.freebsd.org/D2712 Reviewed by: kib Sponsored by: EMC / Isilon Storage Division	2015-06-08 04:59:32 +00:00
jhb	bba1e1e047	Add a new file operations hook for mmap operations. File type-specific logic is now placed in the mmap hook implementation rather than requiring it to be placed in sys/vm/vm_mmap.c. This hook allows new file types to support mmap() as well as potentially allowing mmap() for existing file types that do not currently support any mapping. The vm_mmap() function is now split up into two functions. A new vm_mmap_object() function handles the "back half" of vm_mmap() and accepts a referenced VM object to map rather than a (handle, handle_type) tuple. vm_mmap() is now reduced to converting a (handle, handle_type) tuple to a a VM object and then calling vm_mmap_object() to handle the actual mapping. The vm_mmap() function remains for use by other parts of the kernel (e.g. device drivers and exec) but now only supports mapping vnodes, character devices, and anonymous memory. The mmap() system call invokes vm_mmap_object() directly with a NULL object for anonymous mappings. For mappings using a file descriptor, the descriptors fo_mmap() hook is invoked instead. The fo_mmap() hook is responsible for performing type-specific checks and adjustments to arguments as well as possibly modifying mapping parameters such as flags or the object offset. The fo_mmap() hook routines then call vm_mmap_object() to handle the actual mapping. The fo_mmap() hook is optional. If it is not set, then fo_mmap() will fail with ENODEV. A fo_mmap() hook is implemented for regular files, character devices, and shared memory objects (created via shm_open()). While here, consistently use the VM_PROT_* constants for the vm_prot_t type for the 'prot' variable passed to vm_mmap() and vm_mmap_object() as well as the vm_mmap_vnode() and vm_mmap_cdev() helper routines. Previously some places were using the mmap()-specific PROT_* constants instead. While this happens to work because PROT_xx == VM_PROT_xx, using VM_PROT_* is more correct. Differential Revision: https://reviews.freebsd.org/D2658 Reviewed by: alc (glanced over), kib MFC after: 1 month Sponsored by: Chelsio	2015-06-04 19:41:15 +00:00
vangyzen	597cee37df	Provide vnode in memory map info for files on tmpfs When providing memory map information to userland, populate the vnode pointer for tmpfs files. Set the memory mapping to appear as a vnode type, to match FreeBSD 9 behavior. This fixes the use of tmpfs files with the dtrace pid provider, procstat -v, procfs, linprocfs, pmc (pmcstat), and ptrace (PT_VM_ENTRY). Submitted by: Eric Badger <eric@badgerio.us> (initial revision) Obtained from: Dell Inc. PR: 198431 MFC after: 2 weeks Reviewed by: jhb Approved by: kib (mentor)	2015-06-02 18:37:04 +00:00
alc	a7a4e853c7	Document vm_page_alloc_contig()'s support for the VM_ALLOC_NODUMP option. MFC after: 3 days	2015-05-30 23:37:47 +00:00
jhb	4a4be98eae	Export a list of VM objects in the system via a sysctl. The list can be examined via 'vmstat -o'. It can be used to determine which files are using physical pages of memory and how much each is using. Differential Revision: https://reviews.freebsd.org/D2277 Reviewed by: alc, kib MFC after: 2 weeks Sponsored by: Norse Corp, Inc. (forward porting to HEAD/10)	2015-05-27 18:11:05 +00:00
jkim	318c4f97e6	CALLOUT_MPSAFE has lost its meaning since r141428, i.e., for more than ten years for head. However, it is continuously misused as the mpsafe argument for callout_init(9). Deprecate the flag and clean up callout_init() calls to make them more consistent. Differential Revision: https://reviews.freebsd.org/D2613 Reviewed by: jhb MFC after: 2 weeks	2015-05-22 17:05:21 +00:00
kib	3718535142	Do grammar fix in the comment to record the right commit message for r283162. Fix a cosmetic issue with vm_page_alloc() calling vm_page_free_toq() with the page not completely satisfying vm_page_free() assertions. The page is not owned by the object, since insertion failed. But besides m->object reset to NULL, we should also set VPO_UNMANAGED flag for consistency. Reported by: pho Reviewed by: alc Sponsored by: The FreeBSD Foundation MFC after: 1 week	2015-05-20 23:15:56 +00:00
kib	f37cc175ef	Remove the write-only variable phent. We currently do not check the size of the program header's entries. Reported by: adrian (by using gcc 4.9) Sponsored by: The FreeBSD Foundation MFC after: 1 week	2015-05-20 23:03:22 +00:00
kib	cb622d9740	Satisfy vm_object uma zone destructor requirements after r282660 when vnode object creation raced. Reported by: pho Reviewed by: alc Sponsored by: The FreeBSD Foundation	2015-05-10 08:21:03 +00:00
kib	71cf7d735d	The vmem callback to reclaim kmem arena address space on low or fragmented conditions currently just wakes up the pagedaemon. The kmem arena is significantly smaller then the total available physical memory, which means that there are loads where kmem arena space could be exhausted, while there is a lot of pages available still. The woken up pagedaemon sees vm_pages_needed != 0, verifies the condition vm_paging_needed() which is false, clears the pass and returns back to sleep, not calling neither uma_reclaim() nor lowmem handler. To handle low kmem arena conditions, create additional pagedaemon thread which calls uma_reclaim() directly. The thread sleeps on the dedicated channel and kmem_reclaim() wakes the thread in addition to the pagedaemon. Reported and tested by: pho Sponsored by: The FreeBSD Foundation MFC after: 2 weeks	2015-05-09 20:08:36 +00:00
jhb	0bf260e595	Place VM objects on the object list when created and never remove them. This is ok since objects come from a NOFREE zone and allows objects to be locked while traversing the object list without triggering a LOR. Ensure that objects on the list are marked DEAD while free or stillborn, and that they have a refcount of zero. This required updating most of the pagers to explicitly mark an object as dead when deallocating it. (Only the vnode pager did this previously.) Differential Revision: https://reviews.freebsd.org/D2423 Reviewed by: alc, kib (earlier version) MFC after: 2 weeks Sponsored by: Norse Corp, Inc.	2015-05-08 19:43:37 +00:00
adrian	d1e90611f7	oops - how'd i miss this. Sorry!	2015-05-08 06:02:23 +00:00
adrian	2cad336903	Add initial memory locality cost awareness to the VM, and include a basic ACPI SLIT table parser. For now this just exports the map via sysctl; it'll eventually be useful to userland when there's more useful NUMA support in -HEAD. * Add an optional mem_locality map; * add a mapping function taking from/to domain and returning the relative cost, or -1 if it's not available; * Add a very basic SLIT parser to x86 ACPI. Differential Revision: https://reviews.freebsd.org/D2460 Reviewed by: rpaulo, stas, jhb Sponsored by: Norse Corp, Inc (hardware, coding); Dell (hardware)	2015-05-08 00:56:56 +00:00
glebius	a7d547aa97	Fix the KASSERT and improve wording in r282426. Submitted by: alc	2015-05-06 08:07:11 +00:00
glebius	acfa186500	Fix arithmetical bug in vnode_pager_haspage(). The check against object size should be done not with the number of pages in the first block, but with the overall number of pages. While here, add KASSERT that makes sure that BMAP doesn't return completely irrelevant blocks. Reviewed by: kib Tested by: pho Sponsored by: Netflix Sponsored by: Nginx, Inc.	2015-05-04 18:49:25 +00:00
glebius	0a56d25a94	Instead of reading, validating and adjusting value of the vm.swap_async_max in the main swapper work cycle, do it in the sysctl handler. This removes extra mutex acquisition from the main cycle and makes the sysctl knob return error on an invalid value, instead of accepting and fixing it. Reviewed by: kib Sponsored by: Netflix Sponsored by: Nginx, Inc.	2015-05-02 20:27:37 +00:00
jhb	9c4c8b62fb	Remove support for Xen PV domU kernels. Support for HVM domU kernels remains. Xen is planning to phase out support for PV upstream since it is harder to maintain and has more overhead. Modern x86 CPUs include virtualization extensions that support HVM guests instead of PV guests. In addition, the PV code was i386 only and not as well maintained recently as the HVM code. - Remove the i386-only NATIVE option that was used to disable certain components for PV kernels. These components are now standard as they are on amd64. - Remove !XENHVM bits from PV drivers. - Remove various shims required for XEN (e.g. PT_UPDATES_FLUSH, LOAD_CR3, etc.) - Remove duplicate copy of <xen/features.h>. - Remove unused, i386-only xenstored.h. Differential Revision: https://reviews.freebsd.org/D2362 Reviewed by: royger Tested by: royger (i386/amd64 HVM domU and amd64 PVH dom0) Relnotes: yes	2015-04-30 15:48:48 +00:00
scottl	cac1f63fc8	Improve support for blacklisting bad memory locations. The user can supply a text file with a list of physical memory addresses to exclude, and have it loaded at boot time via the provided example in loader.conf. The tunable 'vm.blacklist' remains, but using an external file means that there's no practical limit to the size of the list. This change also improves the scanning algorithm for processing the list, scanning the list only once instead of scanning it for every page in the system. Both the sysctl and the file can be unsorted and contain duplicates so long as each entry is numeric (decimal or hex) and is separated by a space, comma, or newline character. The sysctl 'vm.page_blacklist' is now provided to report what memory locations were successfully excluded. Reviewed by: imp, emax Obtained from: Netflix, Inc. MFC after: 3 days	2015-04-29 15:57:14 +00:00
trasz	802017a04b	Add kern.racct.enable tunable and RACCT_DISABLED config option. The point of this is to be able to add RACCT (with RACCT_DISABLED) to GENERIC, to avoid having to rebuild the kernel to use rctl(8). Differential Revision: https://reviews.freebsd.org/D2369 Reviewed by: kib@ MFC after: 1 month Relnotes: yes Sponsored by: The FreeBSD Foundation	2015-04-29 10:23:02 +00:00
kib	dfce070d88	Do not sleep waiting for the MAP_ENTRY_IN_TRANSITION state ending with the vnode locked. Review: https://reviews.freebsd.org/D2381 Submitted by: Conrad Meyer, Attilio Rao MFC after: 1 week	2015-04-28 08:20:23 +00:00
scottl	ede23391e8	Revert r281451. It causes a panic/hang early in boot for a number of users, myself included. The original code is likely papering over a larger bug that needs to be explored, but for now get things back to a working state. Obtained from: Netflix, Inc. MFC after: immediately	2015-04-24 17:03:53 +00:00
jhb	e4683250d1	Reassign copyright statements on several files from Advanced Computing Technologies LLC to Hudson River Trading LLC. Approved by: Hudson River Trading LLC (who owns ACT LLC) MFC after: 1 week	2015-04-23 14:22:20 +00:00
alc	3b5965fb8f	Eliminate an unused variable. MFC after: 1 week	2015-04-20 16:48:21 +00:00
alc	d6d560db51	Eliminate an unused variable. MFC after: 1 week	2015-04-19 00:29:02 +00:00
kib	2254748ed0	The lseek(2), mmap(2), truncate(2), ftruncate(2), pread(2), and pwrite(2) syscalls are wrapped to provide compatibility with pre-7.x kernels which required padding before the off_t parameter. The fcntl(2) contains compatibility code to handle kernels before the struct flock was changed during the 8.x CURRENT development. The shims were reasonable to allow easier revert to the older kernel at that time. Now, two or three major releases later, shims do not serve any purpose. Such old kernels cannot handle current libc, so revert the compatibility code. Make padded syscalls support conditional under the COMPAT6 config option. For COMPAT32, the syscalls were under COMPAT6 already. Remove WITHOUT_SYSCALL_COMPAT build option, which only purpose was to (partially) disable the removed shims. Reviewed by: jhb, imp (previous versions) Discussed with: peter Sponsored by: The FreeBSD Foundation MFC after: 1 week	2015-04-18 21:50:13 +00:00
dchagin	02e5fc3da7	Rework r281162. Indeed, the flexible array member is preferable here. Suggested by: Justin T. Gibbs MFC after: 3 days	2015-04-12 06:21:58 +00:00
alc	f116ec9943	Correct an off-by-one error in vm_reserv_reclaim_contig() that results in an infinite loop. Submitted by: Svatopluk Kraus MFC after: 1 week	2015-04-11 22:57:13 +00:00
glebius	662cdec164	UMA zone limit can be lowered, so remove protection against from the sysctl_handle_uma_zone_max(). Sponsored by: Nginx, Inc.	2015-04-10 06:56:49 +00:00
mav	2e38078077	Remove sleeps from geom_up thread on device destruction. MFC after: 3 days.	2015-04-09 13:09:05 +00:00
jeff	fa9eb8e1ea	- Simplify vm_pageout_scan() by introducing a new vm_pageout_clean() function that does the locking and validation associated with cleaning a page. This moves 150 lines of code into its own function. - Rename vm_pageout_clean() to vm_pageout_cluster() to define what it really does; clustering nearby pages for pageout optimization. Reviewd by: alc, kib, kmacy Tested by: pho (earlier version) Sponsored by: EMC / Isilon	2015-04-07 02:18:52 +00:00
dchagin	fd38dba27d	Properly calculate "UMA Zones" per cpu cache size. Avoid allocating an extra struct uma_cache since the struct uma_zone already has one. PR: 199169 Submitted by: luke.tw gmail com MFC after: 1 week	2015-04-06 18:45:41 +00:00
alc	7028695829	Until the lock assertions in vm_page_advise() are properly reevaluated, vm_fault_dontneed() should acquire a write lock on the first object in the shadow chain. Reported by: gleb, David Wolfskill	2015-04-05 20:07:33 +00:00
dchagin	1e93730b8c	Fix wrong kassert msg in uma. PR: 199172 Submitted by: luke.tw gmail com MFC after: 1 week	2015-04-05 18:25:23 +00:00
alc	e8bbef8d0b	Replace vm_fault()'s heuristic for automatic cache behind with a heuristic that performs the equivalent of an automatic madvise(..., MADV_DONTNEED). The current heuristic, even with the improvements that I made a few years ago, is a good example of making the wrong trade-off, or optimizing for the infrequent case. The infrequent case being reading a single file that is much larger than memory using mmap(2). And, in this case, the page daemon isn't the bottleneck; it's the I/O. In all other cases, the current heuristic has too many false positives, i.e., it caches too many pages that are later reused. To give one example, thousands of pages are cached by the current heuristic during a buildworld and all of them are reactivated before the buildworld completes. In particular, clang reads source files using mmap(2) and there are some relatively large source files in our source tree, e.g., sqlite, that are read multiple times. With the new heuristic, I see fewer false positives and they have a much lower cost. I actually tried something like this more than two years ago and it didn't perform as well as the cache behind heuristic. However, that was before the changes to the page daemon in late summer of 2013 and the existence of pmap_advise(). In particular, with the page daemon doing its work more frequently and in smaller batches, it now completes its work while the application accessing the file is blocked on I/O. Whereas previously, the page daemon appeared to hog the CPU for so long that it caused "hiccups" in the application's execution. Finally, I'll add that the elimination of cache pages is a prerequisite for NUMA support. Reviewed by: jeff, kib Sponsored by: EMC / Isilon Storage Division	2015-04-04 19:10:22 +00:00
rstone	57feb6fb43	Fix integer truncation bug in malloc(9) A couple of internal functions used by malloc(9) and uma truncated a size_t down to an int. This could cause any number of issues (e.g. indefinite sleeps, memory corruption) if any kernel subsystem tried to allocate 2GB or more through malloc. zfs would attempt such an allocation when run on a system with 2TB or more of RAM. Note to self: When this is MFCed, sparc64 needs the same fix. Differential revision: https://reviews.freebsd.org/D2106 Reviewed by: kib Reported by: Michael Fuckner <michael@fuckner.net> Tested by: Michael Fuckner <michael@fuckner.net> MFC after: 2 weeks	2015-04-01 12:42:26 +00:00
glebius	e4390a8823	Catch up on r271387 and remove unused parameter from VOP_GETPAGES_ASYNC().	2015-03-30 22:49:26 +00:00
jeff	2ef1578319	- Eliminate pagequeue locking in the dirty code in vm_pageout_scan(). - Use a more precise series of tests to see if the page changed while we were locking the vnode. Reviewed by: alc Sponsored by: EMC / Isilon	2015-03-28 02:36:49 +00:00
mav	ee2fe1ad5c	Make swapper release orphaned (lost) GEOM provider. Swap device is still reported as enabled, and system still may crash later if some swapped-out kernel pages were lost with the device, but at least GEOM and CAM can now release the lost disk, allowing it to be reconnected. MFC after: 2 weeks Sponsored by: iXsystems, Inc.	2015-03-26 17:21:12 +00:00
rpaulo	4a05532670	Add comments about CTLFLAG_RDTUN vs. TUNABLE_INT_FETCH. Requested by: julian	2015-03-26 05:20:18 +00:00
rpaulo	2ef1cdce8e	Use TUNABLE_INT_FETCH for boot_pages. vm.boot_pages is marked as a CTLFLAG_RDTUN, but it's used by the VM before the sysctl subsystem is initialsed. We manually fetch the variable from the environment to work around this problem. Tested by: Keith White kwhite at uottawa.ca MFC after: 1 week	2015-03-24 20:09:55 +00:00
rpaulo	5ab5a7c167	Remove whitespace.	2015-03-24 20:07:27 +00:00
alc	b131a2abc8	Introduce vm_object_color() and use it in mmap(2) to set the color of named objects to zero before the virtual address is selected. Previously, the color setting was delayed until after the virtual address was selected. In rtld, this delay effectively prevented the mapping of a shared library's code section using superpages. Now, for example, we see the first 1 MB of libc's code on armv6 mapped by a superpage after we've gotten through the initial cold misses that bring the first 1 MB of code into memory. (With the page clustering that we perform on read faults, this happens quickly.) Differential Revision: https://reviews.freebsd.org/D2013 Reviewed by: jhb, kib Tested by: Svatopluk Kraus (armv6) MFC after: 6 weeks	2015-03-21 17:56:55 +00:00
alc	2c4b57d486	Fix the root cause of the "vm_reserv_populate: reserv <address> is already promoted" panics. The sequence of events that leads to a panic is rather long and circuitous. First, suppose that process P has a promoted superpage S within vm object O that it can write to. Then, suppose that P forks, which leads to S being write protected. Now, before P's child exits, suppose that P writes to another virtual page within O. Since the pages within O are copy on write, a shadow object for O is created to house the new physical copy of the faulted on virtual page. Then, before P can fault on S, P's child exists. Now, when P faults on S, it will follow the "optimized" path for copy-on-write faults in vm_fault(), wherein the underlying physical page is moved from O to its shadow object rather than allocating a new page and copying the new page's contents from the old page. Moreover, suppose that every 4 KB physical page making up S is moved to the shadow object in this way. However, the optimized path does not move the underlying superpage reservation, which is the root cause of the panics! Ultimately, P performs vm_object_collapse() on O's shadow object, which destroys O and in doing so breaks any reservations still belonging to O. This leaves the reservation underlying S in an inconsistent state: It's simultaneously not in use and promoted. Breaking a reservation does not demote it because I never intended for a promoted reservation to be broken. It makes little sense. Finally, this inconsistency leads to an assertion failure the next time that the reservation is used. The failing assertion does not (currently) exist in FreeBSD 10.x or earlier. There, we will quietly break the promoted reservation. While illogical and unintended, breaking the reservation is essentially harmless. PR: 198163 Reviewed by: kib Tested by: pho X-MFC after: r267213 Sponsored by: EMC / Isilon Storage Division	2015-03-19 01:40:43 +00:00
glebius	398be53682	o Enhance vm_pager_free_nonreq() function: - Allow to call the function with vm object lock held. - Allow to specify reqpage that doesn't match any page in the region, meaning freeing all pages. o Utilize the new function in couple more places in vnode pager. Reviewed by: alc, kib Sponsored by: Netflix Sponsored by: Nginx, Inc.	2015-03-17 19:19:19 +00:00
glebius	df5d850742	Provide a comment explaining r279688. Suggested by: alc	2015-03-16 14:24:47 +00:00
ian	0dd684d23f	Set the SBUF_INCLUDENUL flag in sbuf_new_for_sysctl() so that sysctl strings returned to userland include the nulterm byte. Some uses of sbuf_new_for_sysctl() write binary data rather than strings; clear the SBUF_INCLUDENUL flag after calling sbuf_new_for_sysctl() in those cases. (Note that the sbuf code still automatically adds a nulterm byte in sbuf_finish(), but since it's not included in the length it won't get copied to userland along with the binary data.) Remove explicit adding of a nulterm byte in a couple places now that it gets done automatically by the sbuf drain code. PR: 195668	2015-03-14 17:08:28 +00:00
ian	eae109babf	Revert r279932; this is going to be fixed in the sbuf code instead. PR: 195668	2015-03-14 13:00:37 +00:00
ian	037188bda9	Nullterminate strings returned via sysctl. PR: 195668	2015-03-12 18:06:30 +00:00
glebius	6bbfdd570a	Fix function name in comment.	2015-03-10 13:06:54 +00:00
kib	c539cecb43	Fix function name in the panic message. Sponsored by: The FreeBSD Foundation MFC after: 1 week	2015-03-08 02:13:46 +00:00
alc	56c9a1b2f6	Correct a typo in vm_object_backing_scan() that originated in r254141. Specifically, change a lock acquire into a lock release. MFC after: 3 days Sponsored by: EMC / Isilon Storage Division	2015-03-07 04:18:40 +00:00
glebius	3a52ecfa66	- In vnode_pager_generic_getpages() use different free counters for synchronous and asynchronous requests. The latter can saturate the I/O and we do not want them to affect regular paging. - Allocate the pbuf at the very beginning of the function, so that if we are low on certain kind of pbufs don't even proceed to BMAP, but sleep. Reviewed by: kib Sponsored by: Nginx, Inc. Sponsored by: Netflix	2015-03-06 14:15:30 +00:00
alc	2ab42594ef	Use RW_NEW rather than calling bzero().	2015-03-01 05:18:02 +00:00
alc	37e48c6e3a	Eliminate a variable that became unused when VFS_LOCK_GIANT() was eliminated. MFC after: 3 days	2015-02-28 19:11:37 +00:00
ngie	d54589fec4	Some minor style(9) fixes (whitespace + comment) MFC after: 3 days	2015-02-17 08:50:26 +00:00
kib	19abfd4698	Update mtime for tmpfs files modified through memory mapping. Similar to UFS, perform updates during syncer scans, which in particular means that tmpfs now performs scan on sync. Also, this means that a mtime update may be delayed up to 30 seconds after the write. The vm_object' OBJ_TMPFS_DIRTY flag for tmpfs swap object is similar to the OBJ_MIGHTBEDIRTY flag for the vnode object, it indicates that object could have been dirtied. Adapt fast page fault handler and vm_object_set_writeable_dirty() to handle OBJ_TMPFS_NODE same as OBJT_VNODE. Reported by: Ronald Klop <ronald-lists@klop.ws> Tested by: pho Sponsored by: The FreeBSD Foundation MFC after: 2 weeks	2015-01-28 10:37:23 +00:00
will	2e062600fe	Add vm.panic_on_oom sysctl, which enables those who would rather panic than kill a process, when the system runs out of memory. Defaults to off. Usually, this is most useful when the OOM condition is due to mismanagement of memory, on a system where the applications in question don't respond well to being killed. In theory, if the system is properly managed, it shouldn't be possible to hit this condition. If it does, the panic can be more desirable for some users (since it can be a good means of finding the root cause) rather than killing the largest process and continuing on its merry way. As kib@ mentions in the differential, there is also protect(1), which uses procctl(PROC_SPROTECT) to ensure that some processes are immune. However, a panic approach is still useful in some environments. This is primarily intended as a development/debugging tool. Differential Revision: D1627 Reviewed by: kib MFC after: 1 week	2015-01-24 17:32:45 +00:00
rstone	520ad84555	vmspace_release() may sleep if the last reference is being released, so add a WITNESS_WARN() to catch cases where it is called with a non-sleepable lock held. MFC after: 1 month Sponsored by: Sandvine Inc.	2015-01-24 16:59:38 +00:00
kib	7cbc6347a2	Avoid calling vmspace_free() while owning the process lock. Freeing of an vm space may require obtaining sleepable locks. Hold the process to keep the pointer valid, and change trylock to lock, since there is no longer two process locks owned simultaneously in vm_pageout_oom(). Note that after the process lock is dropped, process might exec, and no longer qualify as the owner of biggest vm space. In collaboration with: rstone Sponsored by: The FreeBSD Foundation MFC after: 1 week	2015-01-24 15:33:42 +00:00
alc	fb32c103c7	Revamp the default page clustering strategy that is used by the page fault handler. For roughly twenty years, the page fault handler has used the same basic strategy: Fetch a fixed number of non-resident pages both ahead and behind the virtual page that was faulted on. Over the years, alternative strategies have been implemented for optimizing the handling of random and sequential access patterns, but the only change to the default strategy has been to increase the number of pages read ahead to 7 and behind to 8. The problem with the default page clustering strategy becomes apparent when you look at how it behaves on the code section of an executable or shared library. (To simplify the following explanation, I'm going to ignore the read that is performed to obtain the header and assume that no pages are resident at the start of execution.) Suppose that we have a code section consisting of 32 pages. Further, suppose that we access pages 4, 28, and 16 in that order. Under the default page clustering strategy, we page fault three times and perform three I/O operations, because the first and second page faults only read a truncated cluster of 12 pages. In contrast, if we access pages 8, 24, and 16 in that order, we only fault twice and perform two I/O operations, because the first and second page faults read a full cluster of 16 pages. In general, truncated clusters are more common than full clusters. To address this problem, this revision changes the default page clustering strategy to align the start of the cluster to a page offset within the vm object that is a multiple of the cluster size. This results in many fewer truncated clusters. Returning to our example, if we now access pages 4, 28, and 16 in that order, the cluster that is read to satisfy the page fault on page 28 will now include page 16. So, the access to page 16 will no longer page fault and perform an I/O operation. Since the revised default page clustering strategy is typically reading more pages at a time, we are likely to read a few more pages that are never accessed. However, for the various programs that we looked at, including clang, emacs, firefox, and openjdk, the reduction in the number of page faults and I/O operations far outweighed the increase in the number of pages that are never accessed. Moreover, the extra resident pages allowed for many more superpage mappings. For example, if we look at the execution of clang during a buildworld, the number of (hard) page faults on the code section drops by 26%, the number of superpage mappings increases by about 29,000, but the number of never accessed pages only increases from 30.38% to 33.66%. Finally, this leads to a small but measureable reduction in execution time. In collaboration with: Emily Pettigrew <ejp1@rice.edu> Differential Revision: https://reviews.freebsd.org/D1500 Reviewed by: jhb, kib MFC after: 6 weeks	2015-01-16 18:17:09 +00:00
kib	79db3369f9	Revert r263475: TDP_DEVMEMIO no longer needed, since amd64 /dev/kmem does not access kernel mappings directly. Reviewed by: alc Sponsored by: The FreeBSD Foundation MFC after: 1 week	2015-01-12 08:58:07 +00:00
alc	b48d1f4410	Eliminate a stale debug message. The per-CPU cache locks were replaced by critical sections in r145686. PR: 193254 Submitted by: luke.tw@gmail.com MFC after: 3 days	2014-12-31 17:44:57 +00:00
alc	369e66acd7	The physical memory allocator supports the use of distinct free lists for managing pages from different address ranges. Generally speaking, this feature is used to increase the likelihood that physical pages are available that can meet special DMA requirements or can be accessed through a limited-coverage direct mapping (e.g., MIPS). However, prior to this change, the configuration of the free lists was static, i.e., it was determined at compile time. Consequentally, free lists could be created for address ranges that held no actual pages, for example, on 32-bit MIPS- based systems with 512 MB or less of physical memory. This change makes the creation of the free lists dynamic, i.e., it is based on the available physical memory at boot time. On 64-bit x86-based systems with 64 GB or more of physical memory, create free lists for managing pages with physical addresses below 4 GB. This change is to address reported problems with initializing devices that require the allocation of physical pages below 4 GB on some systems with 128 GB or more of physical memory. PR: 185727 Differential Revision: https://reviews.freebsd.org/D1274 Reviewed by: jhb, kib MFC after: 3 weeks Sponsored by: EMC / Isilon Storage Division	2014-12-31 00:54:38 +00:00
glebius	2d07787e22	Add flag VM_ALLOC_NOWAIT for vm_page_grab() that prevents sleeping and allows the function to fail. Reviewed by: kib, alc Sponsored by: Nginx, Inc.	2014-12-22 09:02:21 +00:00
glebius	159bb47471	Do not clear flag that vm_page_alloc() doesn't support. Submitted by: kib	2014-12-22 09:00:47 +00:00
glebius	ccd01d8847	Document flags of vm_page allocation functions. Reviewed by: alc	2014-12-22 08:59:44 +00:00
jhb	d3e566151f	Always ignore the deprecated MAP_RENAME and MAP_NORESERVE flags to mmap(). Some old libraries may be used even with newer binaries (specifically the Nvidia driver libraries). Differential Revision: https://reviews.freebsd.org/D1262 Reviewed by: kib	2014-12-05 15:24:42 +00:00
kib	f690968933	When the last reference on the vnode' vm object is dropped, read the vp->v_vflag without taking vnode lock and without bypass. We do know that vp is the lowest level in the stack, since the pointer is obtained from the object' handle. Stale VV_TEXT flag read can only happen if parallel execve() is performed and not yet activated the image, since process takes reference for text mapping. In this case, the execve() code manages the VV_TEXT flag on its own already. It was observed that otherwise read-only sendfile(2) requires exclusive vnode lock and contending on it on some loads for VV_TEXT handling. Reported by: glebius, scottl Tested by: glebius, pho Sponsored by: The FreeBSD Foundation MFC after: 1 week	2014-12-05 15:02:30 +00:00
kib	a114aae162	Provide mutual exclusion between zone allocation/destruction and uma_reclaim(). Reclamation code must not see half-constructed or destructed zones. Do this by bracing uma_zcreate() and uma_zdestroy() into a shared-locked sx, and take the sx exclusively in uma_reclaim(). Usually zones are not created/destroyed during the system operation, but tmpfs mounts do cause zone operations and exposed the bug. Another solution could be to only expose a new keg on uma_kegs list after the corresponding zone is fully constructed, and similar treatment for the destruction. But it probably requires more risky code rearrangement as well. Reported and tested by: pho Discussed with: avg Sponsored by: The FreeBSD Foundation MFC after: 2 weeks	2014-11-30 20:20:55 +00:00
glebius	5e67a4071d	We already have "int i" in this scope. Submitted by: alc	2014-11-24 07:57:20 +00:00
glebius	3f7fdcc87f	\n at end of panicstr is redundant. Submitted by: alc	2014-11-23 18:32:21 +00:00
glebius	b4ef8e602d	Merge from projects/sendfile: o Provide a new VOP_GETPAGES_ASYNC(), which works like VOP_GETPAGES(), but doesn't sleep. It returns immediately, and will execute the I/O done handler function that must be supplied as argument. o Provide VOP_GETPAGES_ASYNC() for the FFS, which uses vnode_pager. o Extend pagertab to support pgo_getpages_async method, and implement this method for vnode_pager. Reviewed by: kib Tested by: pho Sponsored by: Netflix Sponsored by: Nginx, Inc.	2014-11-23 12:01:52 +00:00
alc	f4c161ccb6	By the time that vm_reserv_init() runs, vm_phys_segs[] is initialized. Use it instead of phys_avail[]. Discussed with: Svatopluk Kraus	2014-11-22 17:46:30 +00:00
glebius	20a46e6241	Use __func__ in KASSERTs, since the code is about to be moved to other place. Sponsored by: Nginx, Inc.	2014-11-19 16:29:39 +00:00
glebius	c73b720d6a	In vnode_pager_generic_getpages() vp->v_mount is dereferenced in the beginning, thus can't be NULL. Sponsored by: Nginx, Inc.	2014-11-19 15:17:19 +00:00
glebius	b87d94c5df	Collapse three contiguous comment blocks into one. Remove historical note about wrong assumptions 20 years ago. Use proper casing. Sponsored by: Nginx, Inc.	2014-11-18 13:38:07 +00:00
alc	aeebd38e4b	Enable the use of VM_PHYSSEG_SPARSE on amd64 and i386, making it the default on i386 PAE. Previously, VM_PHYSSEG_SPARSE could not be used on amd64 and i386 because vm_page_startup() would not create vm_page structures for the kernel page table pages allocated during pmap_bootstrap() but those vm_page structures are needed when the kernel attempts to promote the corresponding kernel virtual addresses to superpage mappings. To address this problem, a new public function, vm_phys_add_seg(), is introduced and vm_phys_init() is updated to reflect the creation of vm_phys_seg structures by calls to vm_phys_add_seg(). Discussed with: Svatopluk Kraus MFC after: 3 weeks Sponsored by: EMC / Isilon Storage Division	2014-11-15 23:40:44 +00:00
glebius	1a17154ef9	Even better indent struct pagerops.	2014-11-14 18:15:35 +00:00
glebius	7d3b226a7b	Constantly indent struct pagerops.	2014-11-14 18:00:00 +00:00
kib	938d2505e3	Fix mis-spelling of bits and types names in the default_pager_putpages() and swap_pager_putpages(). It is the same fix as was done for vnode_pager_putpages() in r271586. Reviewed by: alc Sponsored by: The FreeBSD Foundation MFC after: 1 week	2014-11-04 19:56:04 +00:00
alc	26666baef9	Eliminate a stale, i386-specific comment.	2014-11-04 18:52:59 +00:00
markm	fce6747f55	This is the much-discussed major upgrade to the random(4) device, known to you all as /dev/random. This code has had an extensive rewrite and a good series of reviews, both by the author and other parties. This means a lot of code has been simplified. Pluggable structures for high-rate entropy generators are available, and it is most definitely not the case that /dev/random can be driven by only a hardware souce any more. This has been designed out of the device. Hardware sources are stirred into the CSPRNG (Yarrow, Fortuna) like any other entropy source. Pluggable modules may be written by third parties for additional sources. The harvesting structures and consequently the locking have been simplified. Entropy harvesting is done in a more general way (the documentation for this will follow). There is some GREAT entropy to be had in the UMA allocator, but it is disabled for now as messing with that is likely to annoy many people. The venerable (but effective) Yarrow algorithm, which is no longer supported by its authors now has an alternative, Fortuna. For now, Yarrow is retained as the default algorithm, but this may be changed using a kernel option. It is intended to make Fortuna the default algorithm for 11.0. Interested parties are encouraged to read ISBN 978-0-470-47424-2 "Cryptography Engineering" By Ferguson, Schneier and Kohno for Fortuna's gory details. Heck, read it anyway. Many thanks to Arthur Mesh who did early grunt work, and who got caught in the crossfire rather more than he deserved to. My thanks also to folks who helped me thresh this out on whiteboards and in the odd "Hallway track", or otherwise. My Nomex pants are on. Let the feedback commence! Reviewed by: trasz,des(partial),imp(partial?),rwatson(partial?) Approved by: so(des)	2014-10-30 21:21:53 +00:00
hselasky	49c137f7be	Fix multiple incorrect SYSCTL arguments in the kernel: - Wrong integer type was specified. - Wrong or missing "access" specifier. The "access" specifier sometimes included the SYSCTL type, which it should not, except for procedural SYSCTL nodes. - Logical OR where binary OR was expected. - Properly assert the "access" argument passed to all SYSCTL macros, using the CTASSERT macro. This applies to both static- and dynamically created SYSCTLs. - Properly assert the the data type for both static and dynamic SYSCTLs. In the case of static SYSCTLs we only assert that the data pointed to by the SYSCTL data pointer has the correct size, hence there is no easy way to assert types in the C language outside a C-function. - Rewrote some code which doesn't pass a constant "access" specifier when creating dynamic SYSCTL nodes, which is now a requirement. - Updated "EXAMPLES" section in SYSCTL manual page. MFC after: 3 days Sponsored by: Mellanox Technologies	2014-10-21 07:31:21 +00:00
jhb	a619f7cffc	Retire the unimplemented MAP_RENAME and MAP_NORESERVE flags to mmap(2). Older binaries are still permitted to use these flags. PR: 193961 (exp-run in ports) Differential Revision: https://reviews.freebsd.org/D848 Reviewed by: kib	2014-10-18 12:28:51 +00:00
davide	e88bd26b3f	Follow up to r225617. In order to maximize the re-usability of kernel code in userland rename in-kernel getenv()/setenv() to kern_setenv()/kern_getenv(). This fixes a namespace collision with libc symbols. Submitted by: kmacy Tested by: make universe	2014-10-16 18:04:43 +00:00
kib	2b6232c7be	Make MAP_NOSYNC handling in the vm_fault() read-locked object path compatible with write-locked path. Test for MAP_ENTRY_NOSYNC and set VPO_NOSYNC for pages with dirty mask zero (this does not exclude a possibility that the page is dirty, e.g. due to read fault on writeable mapping and consequent write; the same issue exists in the slow path). Use helper vm_fault_dirty() to unify fast and slow path handling of VPO_NOSYNC and setting the dirty mask. Reviewed by: alc Sponsored by: The FreeBSD Foundation	2014-10-10 19:27:36 +00:00
bryanv	41e2fe5645	Change the UMA mutex into a rwlock Acquire the lock in read mode when just needed to ensure the stability of the keg list. The UMA lock may be held for a long time (relatively speaking) in uma_reclaim() on machines with lots of zones/kegs. If the uma_timeout() would fire during that period, subsequent callouts on that CPU may be significantly delayed. Reviewed by: jhb	2014-10-05 21:34:56 +00:00
bryanv	0b86b14507	Remove stray uma_mtx lock/unlock in zone_drain_wait() Callers of zone_drain_wait(M_WAITOK) do not need to hold (and were not) the uma_mtx, but we would attempt to unlock and relock the mutex if we had to sleep because the zone was already draining. The M_NOWAIT callers may hold the uma_mtx, but we do not sleep in that case. Reviewed by: jhb MFC after: 3 days	2014-10-05 03:18:30 +00:00
kib	feb6ca868c	Add kernel option KSTACK_USAGE_PROF to sample the stack depth on interrupts and report the largest value seen as sysctl debug.max_kstack_used. Useful to estimate how close the kernel stack size is to overflow. In collaboration with: Larry Baird <lab@gta.com> Sponsored by: The FreeBSD Foundation (kib) MFC after: 1 week	2014-10-04 18:38:14 +00:00
smh	f2543cb01c	Refactor ZFS ARC reclaim checks and limits Remove previously added kmem methods in favour of defines which allow diff minimisation between upstream code base. Rebalance ARC free target to be vm_pageout_wakeup_thresh by default which eliminates issue where ARC gets minimised instead of balancing with VM pageout. The restores the target point prior to r270759. Bring in missing upstream only changes which move unused code to further eliminate code differences. Add additional DTRACE probe to aid monitoring of ARC behaviour. Enable upstream i386 code paths on platforms which don't define UMA_MD_SMALL_ALLOC. Fix mixture of byte an page values in arc_memory_throttle i386 code path value assignment of available_memory. PR: 187594 Review: D702 Reviewed by: avg MFC after: 1 week X-MFC-With: r270759 & r270861 Sponsored by: Multiplay	2014-10-03 20:34:55 +00:00
smh	a51f20a2c3	Fix ticks wrap issue of lowmem test in vm_pageout_scan Reviewed by: jhb (D818) MFC after: 3 days Sponsored by: Multiplay	2014-09-24 14:35:08 +00:00
kib	74ff69d0ef	vm_map_pmap_enter() and pmap_enter_object() are currently not aware of the wired attribute of the mapping. As result, some pmap implementations clear the wired state of the page table entries, which breaks invariants and allows the entries to be lost. Avoid calling vm_map_pmap_enter() for the MADV_WILLNEED on the wired entry, the pages must be already mapped. Noted and reviewed by: alc Sponsored by: The FreeBSD Foundation MFC after: 3 days	2014-09-23 18:54:23 +00:00
kib	bec113e57f	The vm_mmap_cdev() explicitely converts absence of both MAP_SHARED and MAP_PRIVATE flags to MAP_SHARED. Apparently, some code in tree, in particular, libgeom, relied on this behaviour, see r271721. For regular file types, the absence of the flags is interpreted as MAP_PRIVATE, and libc nlist used this (fixed in r271723). Allow the implicit flags for legacy binaries. Bump __FreeBSD_version to get the ABI note on new binaries to check for in mmap code. Remove the test for presence of one of the MAP_ANON, MAP_SHARED or MAP_PRIVATE flags before fget_mmap(). For MAP_ANON, we already verify that passed fd == -1. For fd != -1, test after fget_mmap() (for newer binaries) covers the case. Reported by: bdrewery, pho Reviewed by: jhb Sponsored by: The FreeBSD Foundation	2014-09-17 21:04:50 +00:00
jhb	da9629f8a4	Permit MAP_RENAME and MAP_NORESERVE for now. These flags should be removed, but at least Chromium and OpenJDK use MAP_NORESERVE.	2014-09-16 17:21:06 +00:00
jhb	bd469e35e2	Add stricter checking of some mmap() arguments: - Fail with EINVAL if an invalid protection mask is passed to mmap(). - Fail with EINVAL if an unknown flag is passed to mmap(). - Fail with EINVAL if both MAP_PRIVATE and MAP_SHARED are passed to mmap(). - Require one of either MAP_PRIVATE or MAP_SHARED for non-anonymous mappings. Reviewed by: alc, kib MFC after: 2 weeks Differential Revision: https://reviews.freebsd.org/D698	2014-09-15 17:20:13 +00:00
alc	17021239d9	Three improvements to vnode_pager_generic_getpages(): Eliminate an exclusive object lock acquisition and release on the expected execution path. Do page zeroing before the object lock is acquired rather than during the time that the object lock is held. Use vm_pager_free_nonreq() to eliminate duplicated code. Reviewed by: kib MFC after: 6 weeks Sponsored by: EMC / Isilon Storage Division	2014-09-15 17:14:09 +00:00
glebius	fb532b9858	Remove redundant declaration. vnode.h should be included before vnode_pager.h.	2014-09-15 15:49:29 +00:00
kib	ebd8a253bb	Provide the unique implementation for the VOP_GETPAGES() method used by ffs and ext2fs. Remove duplicated call to vm_page_zero_invalid(), done by VOP and by vm_pager_getpages(). Use vm_pager_free_nonreq(). Reviewed by: alc (previous version) Sponsored by: The FreeBSD Foundation MFC after: 6 weeks (after r271596)	2014-09-15 12:28:29 +00:00
alc	810772ca9c	Avoid an exclusive acquisition of the object lock on the expected execution path through the NFS clients' getpages functions. Introduce vm_pager_free_nonreq(). This function can be used to eliminate code that is duplicated in many getpages functions. Also, in contrast to the code that currently appears in those getpages functions, vm_pager_free_nonreq() avoids acquiring an exclusive object lock in one case. Reviewed by: kib MFC after: 6 weeks Sponsored by: EMC / Isilon Storage Division	2014-09-14 18:07:55 +00:00
kib	6044029377	Fix mis-spelling of bits and types names in the vnode_pager_putpages(). The changes should not modify the generated code. The pager->pgo_putpages() method takes int flags as its fourth argument, while vnode_pager_putpages() used boolean_t (which is typedef'ed to int). The flags are from VM_PAGER_* namespace, while vnode_pager_putpages() passed TRUE and OBJPC_SYNC to VOP_PUTPAGES(), which both are numerically equal to VM_PAGER_PUT_SYNC. Noted and reviewed by: alc (previous version) Sponsored by: The FreeBSD Foundation MFC after: 1 week	2014-09-14 10:27:36 +00:00
alc	6af056918d	Update a stale comment.	2014-09-11 03:16:57 +00:00
glebius	5939c729a8	Remove unused arguments for VOP_GETPAGES(), VOP_PUTPAGES().	2014-09-10 12:36:41 +00:00
alc	996a2e8d68	Fix a boundary case error in vm_reserv_alloc_contig(): If a reservation isn't being allocated for the last of the requested pages, because a reservation won't fit in the gap between allocated pages, then the reservation structure shouldn't be initialized. While I'm here, improve the nearby comments. Reported by: jeff, pho MFC after: 1 week Sponsored by: EMC / Isilon Storage Division	2014-09-10 05:52:30 +00:00
alc	1771ced43b	Oops. vm_map_simplify_entry() is used by mac_proc_vm_revoke_recurse(), so it can't be static.	2014-09-08 02:25:01 +00:00
alc	6a1691365d	Make two functions static and eliminate an unused #define.	2014-09-08 00:19:03 +00:00
jhb	74edf8c62a	Fix a typo.	2014-08-29 21:20:36 +00:00
smh	502601a540	Refactor ZFS ARC reclaim logic to be more VM cooperative Prior to this change we triggered ARC reclaim when kmem usage passed 3/4 of the total available, as indicated by vmem_size(kmem_arena, VMEM_ALLOC). This could lead large amounts of unused RAM e.g. on a 192GB machine with ARC the only major RAM consumer, 40GB of RAM would remain unused. The old method has also been seen to result in extreme RAM usage under certain loads, causing poor performance and stalls. We now trigger ARC reclaim when the number of free pages drops below the value defined by the new sysctl vfs.zfs.arc_free_target, which defaults to the value of vm.v_free_target. Credit to Karl Denninger for the original patch on which this update was based. PR: 191510 and 187594 Tested by: dteske MFC after: 1 week Relnotes: yes Sponsored by: Multiplay	2014-08-28 19:50:08 +00:00
alc	e934ec28fb	Back in the days when the kernel was single threaded, testing "vm_paging_target() > 0" was a reasonable way of determining if the inactive queue scan met its target. However, now that other threads can be allocating pages while the inactive queue scan is running, it's an unreliable method. The effect of it being unreliable is that we can start swapping out processes when we didn't intend to. This issue has existed since the kernel was multithreaded, but the changes to the inactive queue target in 10.0-RELEASE have made its effects visible. This change introduces a more direct method for determining if the inactive queue scan met its target that is not affected by the actions of other threads. Reported by: Steve Polyack Tested by: pho, Steve Polyack (an earlier version) MFC after: 1 week Sponsored by: EMC / Isilon Storage Division	2014-08-26 16:40:20 +00:00
alc	5690b644d4	Relax one of the conditions for mapping a page on the fast path. Reviewed by: kib X-MFC with: r270011 Sponsored by: EMC / Isilon Storage Division	2014-08-23 05:24:31 +00:00
kib	ff74afd0bd	Implement 'fast path' for the vm page fault handler. Or, it could be called a scalable path. When several preconditions hold, the vm object lock for the object containing the faulted page is taken in read mode, instead of write, which allows parallel faults processing in the region. Namely, the fast path is taken when the faulted page already exists and does not need copy on write, is already fully valid, and not busy. For technical reasons, fast path is avoided when the fault is the first write on the vnode object, or when the fault is for wiring or debugger read or write. On the fast path, pmap_enter(9) is passed the PMAP_ENTER_NOSLEEP flag, since object lock is kept. Pmap might fail to create the entry, in which case the fallback to slow path is performed. Reviewed by: alc Tested by: pho (previous version) Hardware provided and hosted by: The FreeBSD Foundation and Sentex Data Communications Sponsored by: The FreeBSD Foundation MFC after: 2 week	2014-08-15 07:30:14 +00:00
alc	e11aaa26b0	Avoid pointless (but harmless) actions on unmanaged pages. Reviewed by: kib Sponsored by: EMC / Isilon Storage Division	2014-08-14 15:46:15 +00:00
kib	065b68902b	If vm_page_grab() allocates a new page, the page is not inserted into page queue even when the allocation is not wired. It is responsibility of the vm_page_grab() caller to ensure that the page does not end on the vm_object queue but not on the pagedaemon queue, which would effectively create unpageable unwired page. In exec_map_first_page() and vm_imgact_hold_page(), activate the page immediately after unbusying it, to avoid leak. In the uiomove_object_page(), deactivate page before the object is unlocked. There is no leak, since the page is deactivated after uiomove_fromphys() finished. But allowing non-queued non-wired page in the unlocked object queue makes it impossible to assert that leak does not happen in other places. Reviewed by: alc Sponsored by: The FreeBSD Foundation MFC after: 1 week	2014-08-13 05:44:08 +00:00
kib	41802e2c86	Adapt vm_page_aflag_set(PGA_WRITEABLE) to the locking of pmap_enter(PMAP_ENTER_NOSLEEP). The PGA_WRITEABLE flag can be set when either the page is busied, or the owner object is locked. Update comments, move all assertions about page state when PGA_WRITEABLE flag is set, into new helper vm_page_assert_pga_writeable(). Reviewed by: alc Sponsored by: The FreeBSD Foundation MFC after: 2 weeks	2014-08-09 05:00:34 +00:00
kib	094158b3f2	Change pmap_enter(9) interface to take flags parameter and superpage mapping size (currently unused). The flags includes the fault access bits, wired flag as PMAP_ENTER_WIRED, and a new flag PMAP_ENTER_NOSLEEP to indicate that pmap should not sleep. For powerpc aim both 32 and 64 bit, fix implementation to ensure that the requested mapping is created when PMAP_ENTER_NOSLEEP is not specified, in particular, wait for the available memory required to proceed. In collaboration with: alc Tested by: nwhitehorn (ppc aim32 and booke) Sponsored by: The FreeBSD Foundation and EMC / Isilon Storage Division MFC after: 2 weeks	2014-08-08 17:12:03 +00:00
kib	61cc19eab8	The vm_pager_page_unswapped() pager op is only implemented for the swap pager. Swap pager uses a private mutex to protect swap metadata, and does not rely on the vm object lock to ensure integrity of it. Weaken the requirement for the vm object lock by only asserting locked object in vm_pager_page_unswapped(), instead of locked exclusively. Reviewed by: alc Sponsored by: The FreeBSD Foundation MFC after: 1 week	2014-08-06 19:34:03 +00:00
kib	a39918dab4	Add wrappers to assert that vm object is unlocked and for try upgrade. Reviewed by: alc Sponsored by: The FreeBSD Foundation MFC after: 1 week	2014-08-06 19:30:35 +00:00
royger	b86acea372	vm_phys: improve robustness of fictitious ranges With the current implementation of managed fictitious ranges when also using VM_PHYSSEG_DENSE, a user could try to register a fictitious range that starts inside of vm_page_array, but then overrruns it (because the end of the fictitious range is greater than vm_page_array_size + first_page). This would result in PHYS_TO_VM_PAGE returning unallocated pages from past the end of vm_page_array. The same could happen if a user tried to register a segment that starts outside of vm_page_array but ends inside of it. In order to fix this, allow vm_phys_fictitious_{reg/unreg}_range to use a set of pages from vm_page_array, and allocate the rest. Sponsored by: Citrix Systems R&D Reviewed by: kib, alc vm/vm_phys.c: - Allow registering/unregistering fictitious ranges that overrun vm_page_array.	2014-08-05 10:29:01 +00:00
alc	38b6c535da	Retire pmap_change_wiring(). We have never used it to wire virtual pages. We continue to use pmap_enter() for that. For unwiring virtual pages, we now use pmap_unwire(), which unwires a range of virtual addresses instead of a single virtual page. Sponsored by: EMC / Isilon Storage Division	2014-08-03 20:40:51 +00:00
alc	eacf4d0259	Rewrite a loop in vm_map_wire() so that gcc doesn't think that the variable "rv" is uninitialized. Reported by: bz	2014-08-02 17:58:20 +00:00
alc	f873c17deb	Handle wiring failures in vm_map_wire() with the new functions pmap_unwire() and vm_object_unwire(). Retire vm_fault_{un,}wire(), since they are no longer used. (See r268327 and r269134 for the motivation behind this change.) Reviewed by: kib Sponsored by: EMC / Isilon Storage Division	2014-08-02 16:10:24 +00:00
alc	2b42c1ddc8	When unwiring a region of an address space, do not assume that the underlying physical pages are mapped by the pmap. If, for example, the application has performed an mprotect(..., PROT_NONE) on any part of the wired region, then those pages will no longer be mapped by the pmap. So, using the pmap to lookup the wired pages in order to unwire them doesn't always work, and when it doesn't work wired pages are leaked. To avoid the leak, introduce and use a new function vm_object_unwire() that locates the wired pages by traversing the object and its backing objects. At the same time, switch from using pmap_change_wiring() to the recently introduced function pmap_unwire() for unwiring the region's mappings. pmap_unwire() is faster, because it operates a range of virtual addresses rather than a single virtual page at a time. Moreover, by operating on a range, it is superpage friendly. It doesn't waste time performing unnecessary demotions. Reported by: markj Reviewed by: kib Tested by: pho, jmg (arm) Sponsored by: EMC / Isilon Storage Division	2014-07-26 18:10:18 +00:00
kib	b50b5e1d49	Correct assertion. The shadowing object cannot be tmpfs vm object, and tmpfs object cannot shadow. In other words, tmpfs vm object is always at the bottom of the shadow chain. Reported and tested by: bdrewery Sponsored by: The FreeBSD Foundation MFC after: 1 week	2014-07-24 10:25:42 +00:00
kib	8664d64bc3	The OBJ_TMPFS flag of vm_object means that there is unreclaimed tmpfs vnode for the tmpfs node owning this object. The flag is currently used for two purposes. First, it allows to correctly handle VV_TEXT for tmpfs vnode when the ref count on the object is decremented to 1, similar to vnode_pager_dealloc() for regular filesystems. Second, it prevents some operations, which are done on OBJT_SWAP vm objects backing user anonymous memory, but are incorrect for the object owned by tmpfs node. The second kind of use of the OBJ_TMPFS flag is incorrect, since the vnode might be reclaimed, which clears the flag, but vm object operations must still be disallowed. Introduce one more flag, OBJ_TMPFS_NODE, which is permanently set on the object for VREG tmpfs node, and used instead of OBJ_TMPFS to test whether vm object collapse and similar actions should be disabled. Tested by: pho Sponsored by: The FreeBSD Foundation MFC after: 2 weeks	2014-07-14 09:30:37 +00:00
royger	c9ce58e137	vm_phys: remove limitation on number of fictitious regions The number of vm fictitious regions was limited to 8 by default, but Xen will make heavy usage of those kind of regions in order to map memory from foreign domains, so instead of increasing the default number, change the implementation to use a red-black tree to track vm fictitious ranges. The public interface remains the same. Sponsored by: Citrix Systems R&D Reviewed by: kib, alc Approved by: gibbs vm/vm_phys.c: - Replace the vm fictitious static array with a red-black tree. - Use a rwlock instead of a mutex, since now we also need to take the lock in vm_phys_fictitious_to_vm_page, and it can be shared.	2014-07-09 08:12:58 +00:00
marcel	9f28abd980	Remove ia64. This includes: o All directories named ia64 o All files named ia64 o All ia64-specific code guarded by __ia64__ o All ia64-specific makefile logic o Mention of ia64 in comments and documentation This excludes: o Everything under contrib/ o Everything under crypto/ o sys/xen/interface o sys/sys/elf_common.h Discussed at: BSDcan	2014-07-07 00:27:09 +00:00
alc	d74e85dbb9	Introduce pmap_unwire(). It will replace pmap_change_wiring(). There are several reasons for this change: pmap_change_wiring() has never (in my memory) been used to set the wired attribute on a virtual page. We have always used pmap_enter() to do that. Moreover, it is not really safe to use pmap_change_wiring() to set the wired attribute on a virtual page. The description of pmap_change_wiring() says that it assumes the existence of a mapping in the pmap. However, non-wired mappings may be reclaimed by the pmap at any time. (See pmap_collect().) Many implementations of pmap_change_wiring() will crash if the mapping does not exist. pmap_unwire() accepts a range of virtual addresses, whereas pmap_change_wiring() acts upon a single virtual page. Since we are typically unwiring a range of virtual addresses, pmap_unwire() will be more efficient. Moreover, pmap_unwire() allows us to unwire superpage mappings. Previously, we were forced to demote the superpage mapping, because pmap_change_wiring() only allowed us to express the unwiring of a single base page mapping at a time. This added to the overhead of unwiring for large ranges of addresses, including the implicit unwiring that occurs at process termination. Implementations for arm and powerpc will follow. Discussed with: jeff, marcel Reviewed by: kib Sponsored by: EMC / Isilon Storage Division	2014-07-06 17:42:38 +00:00
hselasky	35b126e324	Pull in r267961 and r267973 again. Fix for issues reported will follow.	2014-06-28 03:56:17 +00:00
gjb	fc21f40567	Revert r267961, r267973: These changes prevent sysctl(8) from returning proper output, such as: 1) no output from sysctl(8) 2) erroneously returning ENOMEM with tools like truss(1) or uname(1) truss: can not get etype: Cannot allocate memory	2014-06-27 22:05:21 +00:00
hselasky	bd1ed65f0f	Extend the meaning of the CTLFLAG_TUN flag to automatically check if there is an environment variable which shall initialize the SYSCTL during early boot. This works for all SYSCTL types both statically and dynamically created ones, except for the SYSCTL NODE type and SYSCTLs which belong to VNETs. A new flag, CTLFLAG_NOFETCH, has been added to be used in the case a tunable sysctl has a custom initialisation function allowing the sysctl to still be marked as a tunable. The kernel SYSCTL API is mostly the same, with a few exceptions for some special operations like iterating childrens of a static/extern SYSCTL node. This operation should probably be made into a factored out common macro, hence some device drivers use this. The reason for changing the SYSCTL API was the need for a SYSCTL parent OID pointer and not only the SYSCTL parent OID list pointer in order to quickly generate the sysctl path. The motivation behind this patch is to avoid parameter loading cludges inside the OFED driver subsystem. Instead of adding special code to the OFED driver subsystem to post-load tunables into dynamically created sysctls, we generalize this in the kernel. Other changes: - Corrected a possibly incorrect sysctl name from "hw.cbb.intr_mask" to "hw.pcic.intr_mask". - Removed redundant TUNABLE statements throughout the kernel. - Some minor code rewrites in connection to removing not needed TUNABLE statements. - Added a missing SYSCTL_DECL(). - Wrapped two very long lines. - Avoid malloc()/free() inside sysctl string handling, in case it is called to initialize a sysctl from a tunable, hence malloc()/free() is not ready when sysctls from the sysctl dataset are registered. - Bumped FreeBSD version to indicate SYSCTL API change. MFC after: 2 weeks Sponsored by: Mellanox Technologies	2014-06-27 16:33:43 +00:00
alc	50c074c333	Delay the call to crhold() in vm_map_insert() until we know that we won't have to undo it by calling crfree(). This reduces the total number of calls by vm_map_insert() to crhold() and crfree() by 45% in my tests. Eliminate an unnecessary variable from vm_map_insert(). Reviewed by: kib Tested by: pho	2014-06-26 16:04:03 +00:00
alc	aeda9d4c41	Now that vm_map_insert() sets MAP_ENTRY_GROWS_{DOWN,UP} on the stack entries that it creates (r267645), we can place the check that blocks map entry coalescing on stack entries in vm_map_simplify_entry() where it properly belongs. Reviewed by: kib	2014-06-25 03:30:03 +00:00
kib	c2a4e94982	Use correct names for the flags. MAP_ENTRY_GROWS_* have the same numerical values as MAP_STACK_GROWS_*, but the former is for entries' eflags, while the later for the cow argument of vm_map_insert(). Submitted by: alc	2014-06-23 07:03:47 +00:00
kib	bded3c3768	Assert that the new entry is inserted into the right location in the map entries list, and that it does not overlap with the previous and next entries. Reviewed by: alc Sponsored by: The FreeBSD Foundation MFC after: 1 week	2014-06-20 07:01:53 +00:00
alc	2cd5489f2c	Eliminate a pointless call to vm_map_clip_start() from vm_map_growstack(). For this call to do anything at all we would have to have two overlapping map entries. Submitted by: kib	2014-06-19 21:05:07 +00:00
alc	1fa56d3c4e	When MAP_STACK_GROWS_{DOWN,UP} are passed to vm_map_insert() set the corresponding flag(s) in the new map entry. Previously, the caller was responsible for setting them after vm_map_insert() returned. Pass MAP_STACK_GROWS_DOWN to vm_map_insert() from vm_map_growstack() when extending the stack in the downward direction. Together these changes slightly simplify the caller's task when creating a downward growing stack. In particular, the caller no longer needs to clip the previous entry, because the new stack entry can't possibly coalesce with the previous entry. Reviewed by: kib Sponsored by: EMC / Isilon Storage Division	2014-06-19 16:26:16 +00:00
kib	893100aa10	Add MAP_EXCL flag for mmap(2). It should be combined with MAP_FIXED, and prevents the request from deleting existing mappings in the region, failing instead. Reviewed by: alc Discussed with: jhb Tested by: markj, pho (previous version, as part of the bigger patch) Sponsored by: The FreeBSD Foundation MFC after: 1 week	2014-06-19 05:00:39 +00:00
attilio	2802c525ad	- Modify vm_page_unwire() and vm_page_enqueue() to directly accept the queue where to enqueue pages that are going to be unwired. - Add stronger checks to the enqueue/dequeue for the pagequeues when adding and removing pages to them. Of course, for unmanaged pages the queue parameter of vm_page_unwire() will be ignored, just as the active parameter today. This makes adding new pagequeues quicker. This change effectively modifies the KPI. __FreeBSD_version will be, however, bumped just when the full cache of free pages will be evicted. Sponsored by: EMC / Isilon storage division Reviewed by: alc Tested by: pho	2014-06-16 18:15:27 +00:00
alc	7b0b901024	Tidy up the early parts of vm_map_insert(), in particular, simplify one of the assertions and eliminate a comment that has grown stale. Reviewed by: kib MFC after: 1 week	2014-06-16 16:37:41 +00:00
alc	5badd21ab6	One of the intentions behind r267254 was that the global variable "sgrowsiz" would be read once and cached in a local variable so that the resource limit check and map entry insertion would be guaranteed to use the same value. However, the value being passed to vm_map_insert() is still from "sgrowsiz" and not the local variable. Correct this oversight. Reviewed by: kib	2014-06-15 07:52:59 +00:00
mav	2bc26491c3	Introduce new "256 Bucket" zone to split requests and reduce congestion on "128 Bucket" zone lock. MFC after: 2 weeks Sponsored by: iXsystems, Inc.	2014-06-12 11:57:07 +00:00
mav	c7987fc583	Allocating new bucket for bucket zone, never take it from the zone itself, since it will almost certanly fail. Take next bigger zone instead. This situation should not happen with original bucket zones configuration: "32 Bucket" zone uses "64 Bucket" and vice versa. But if "64 Bucket" zone lock is congested, zone may grow its bucket size and start biting itself. MFC after: 2 weeks Sponsored by: iXsystems, Inc.	2014-06-12 11:36:22 +00:00
alc	f133c95804	Correct a bug in the management of the population map on big-endian machines. Specifically, there was a mismatch between how the routine allocation and deallocation operations accessed the population map and how the aggressively optimized reservation-breaking operation accessed it. So, problems only occurred when reservations were broken. This change makes the routine operations access the population map in the same way as the reservation breaking operation. This bug was introduced in r259999. PR: 187080 Tested by: jmg (on an "armeb" machine) Sponsored by: EMC / Isilon Storage Division	2014-06-11 16:11:12 +00:00
kib	7f8a65c9fc	Make mmap(MAP_STACK) search for the available address space, similar to !MAP_STACK mapping requests. For MAP_STACK \| MAP_FIXED, clear any mappings which could previously exist in the used range. For this, teach vm_map_find() and vm_map_fixed() to handle MAP_STACK_GROWS_DOWN or _UP cow flags, by calling a new vm_map_stack_locked() helper, which is factored out from vm_map_stack(). The side effect of the change is that MAP_STACK started obeying MAP_ALIGNMENT and MAP_32BIT flags. Reported by: rwatson Reviewed by: alc Sponsored by: The FreeBSD Foundation MFC after: 2 weeks	2014-06-09 03:37:41 +00:00
alc	39548e640f	Add a page size field to struct vm_page. Increase the page size field when a partially populated reservation becomes fully populated, and decrease this field when a fully populated reservation becomes partially populated. Use this field to simplify the implementation of pmap_enter_object() on amd64, arm, and i386. On all architectures where we support superpages, the cost of creating a superpage mapping is roughly the same as creating a base page mapping. For example, both kinds of mappings entail the creation of a single PTE and PV entry. With this in mind, use the page size field to make the implementation of vm_map_pmap_enter(..., MAP_PREFAULT_PARTIAL) a little smarter. Previously, if MAP_PREFAULT_PARTIAL was specified to vm_map_pmap_enter(), that function would only map base pages. Now, it will create up to 96 base page or superpage mappings. Reviewed by: kib Sponsored by: EMC / Isilon Storage Division	2014-06-07 17:12:26 +00:00
kib	b03014a79c	Remove the assert which can be triggered by the userspace. The situation checked by assert is verified to not take place in vm_map_wire(), and protection permissions on the wired entry can be revoked afterward. Reported by: markj Reviewed by: alc Sponsored by: The FreeBSD Foundation MFC after: 1 week	2014-05-28 00:45:35 +00:00
alc	549a5817c0	There is no reason to perform the pmap_remove() on the kernel pmap while the kmem object lock is held. Do the pmap_remove() before acquiring the kmem object lock. MFC after: 1 week Sponsored by: EMC / Isilon Storage Division	2014-05-23 16:22:36 +00:00
kib	62a5a6e7ef	Remove redundand loop. The inner goto restarts the whole page handling in the situation identical to the loop condition. Sponsored by: The FreeBSD Foundation MFC after: 3 days	2014-05-21 08:19:04 +00:00
kib	235f4b1083	When exec_new_vmspace() decides that current vmspace cannot be reused on execve(2), it calls vmspace_exec(), which frees the current vmspace. The thread executing an exec syscall gets new vmspace assigned, and old vmspace is freed if only referenced by the current process. The free operation includes pmap_release(), which de-constructs the paging structures used by hardware. If the calling process is multithreaded, other threads are suspended in the thread_suspend_check(), and need to be unsuspended and run to be able to exit on successfull exec. Now, since the old vmspace is destroyed, paging structures are invalid, threads are resumed on the non-existent pmaps (page tables), which leads to triple fault on x86. To fix, postpone the free of old vmspace until the threads are resumed and exited. To avoid modifications to all image activators all of which use exec_new_vmspace(), memoize the current (old) vmspace in kern_execve(), and notify it about the need to call vmspace_free() with a thread-private flag TDP_EXECVMSPC. http://bugs.debian.org/743141 Reported by: Ivo De Decker <ivo.dedecker@ugent.be> through secteam Sponsored by: The FreeBSD Foundation MFC after: 3 days	2014-05-20 09:19:35 +00:00
alc	cf3c13370d	On a fork allow read-only wired pages to be copy-on-write shared between the parent and child processes. Previously, we copied these pages even though they are read only. However, the reason for copying them is historical and no longer exists. In recent times, vm_map_protect() has developed the ability to copy pages when write access is added to wired copy-on-write pages. So, in this case, copy-on-write sharing of wired pages is not to be feared. It is not going to lead to copy-on-write faults on wired memory. Reviewed by: kib MFC after: 1 week Sponsored by: EMC / Isilon Storage Division	2014-05-13 13:20:23 +00:00
kib	ba20870cbd	Fix locking. The dst_object must remain locked on the retry of the loop iteration. Reported and tested by: pho Sponsored by: The FreeBSD Foundation MFC after: 6 days	2014-05-11 18:07:07 +00:00
alc	b716a32dc2	With the new-and-improved vm_fault_copy_entry() (r265843), we can always avoid soft page faults when adding write access to user wired entries in vm_map_protect(). Previously, we only avoided the soft page fault when the underlying pages were copy-on-write. In other words, we avoided the pages faults that might sleep on page allocation, but not the trivial page faults to update the physical map. Reviewed by: kib MFC after: 1 week Sponsored by: EMC / Isilon Storage Division	2014-05-11 17:41:29 +00:00
alc	d283b621dc	About 9% of the pmap_protect() calls being performed by vm_map_copy_entry() are unnecessary. Eliminate the unnecessary calls. Reviewed by: kib MFC after: 1 week Sponsored by: EMC / Isilon Storage Division	2014-05-10 19:47:00 +00:00
kib	7798d7f7f4	For the upgrade case in vm_fault_copy_entry(), when the entry does not need COW and is writeable (i.e. becoming writeable due to the mprotect(2) operation), do not create a new backing object for the entry. The caller of the function is vm_map_protect(), the call is made to ensure that wired entry has all pages resident and wired in the top level object and to enable the write. We might need to copy read-only page from some backing objects into the top object or remap the page with the write allowed. This fixes the issue with mishandling of the swap accounting when read-only wired mapping is upgraded to write-enabled after fork. The previous code path did not accounted the new object, but it creation is redundand anyway and the change provides an optimization for the non-common situation. Reported by: markj Suggested and reviewed by: alc (previous version) Tested by: pho Sponsored by: The FreeBSD Foundation MFC after: 1 week	2014-05-10 17:03:33 +00:00
kib	32f811c3b5	When printing the map with the ddb 'show procvm' command, do not dump page queues for the backing objects. The queues are huge and clutter the display, when mostly the map entries and its backing storage is interesting. The page queues can be seen with ddb 'show object' command. Reviewed by: alc Sponsored by: The FreeBSD Foundation MFC after: 1 week	2014-05-10 16:36:13 +00:00
kib	90e904e481	Print the entry address in addition to the object. The variable is typically optimized out and debuggers cannot find its value. Sponsored by: The FreeBSD Foundation MFC after: 1 week	2014-05-10 16:30:48 +00:00
pho	1108602cbc	msync(2) must return ENOMEM and not EINVAL when the address is outside the allowed range or when one or more pages are not mapped. This according to The Open Group Base Specifications Issue 7. Discussed with: attilio, Bruce Evans Reviewed by: alc, Garrett Cooper Reported by: ATF MFC after: 2 weeks Sponsored by: EMC / Isilon storage division	2014-05-07 08:38:02 +00:00
alc	c77af15259	Prior to r254304, a separate function, vm_pageout_page_stats(), was used to periodically update the reference status of the active pages. This function was called, instead of vm_pageout_scan(), when memory was not scarce. The objective was to provide up to date reference status for active pages in case memory did become scarce and active pages needed to be deactivated. The active page queue scan performed by vm_pageout_page_stats() was virtually identical to that performed by vm_pageout_scan(), and so r254304 eliminated vm_pageout_page_stats(). Instead, vm_pageout_scan() is called with the parameter "pass" set to zero. The intention was that when pass is zero, vm_pageout_scan() would only scan the active queue. However, the variable page_shortage can still be greater than zero when memory is not scarce and vm_pageout_scan() is called with pass equal to zero. Consequently, the inactive queue may be scanned and dirty pages laundered even though that was not intended by r254304. This revision fixes that. Reported by: avg MFC after: 1 week Sponsored by: EMC / Isilon Storage Division	2014-05-06 03:42:04 +00:00
kib	0c45ba8eb0	For the VM_PHYSSEG_DENSE case, checking the requested range to fall into the area backed by vm_page_array wrongly compared end with vm_page_array_size. It should be adjusted by first_page index to be correct. Also, the corner and incorrect case of the requested range extending after the end of the vm_page_array was incorrectly handled by allocating the segment. Fix the comparision for the end of range and return EINVAL if the end extends beyond vm_page_array. Discussed with: royger Sponsored by: The FreeBSD Foundation MFC after: 1 week	2014-04-29 18:42:37 +00:00
kib	c36743d23f	When vm_fault_copy_entry() is called from vm_map_protect() for a wired entry and performs the upgrade of the entry permissions from read-only to read-write, we must allow to search for the source pages in the backing object, like we do in the case of forking the read-only wired entry. For the fork case, the behaviour is allowed by src_readonly boolean, which in fact is only used to assert that read-write case provides all source pages in the top-level object. Eliminate the src_readonly variable. Allow for the copy loop to look into the backing objects, add explicit asserts to ensure that only read-only and upgrade case actually does. Expand comments. Change the panic call into assert. Reported by: markj Tested by: markj, pho (previous version) Reviewed by: alc Sponsored by: The FreeBSD Foundation MFC after: 1 week	2014-04-27 05:19:01 +00:00
des	65493094e9	Add sysctl OIDs showing the actual size and capacity of the swap zone. MFC after: 1 week	2014-04-26 12:18:17 +00:00
bdrewery	6fcf6199a4	Rename global cnt to vm_cnt to avoid shadowing. To reduce the diff struct pcu.cnt field was not renamed, so PCPU_OP(cnt.field) is still used. pc_cnt and pcpu are also used in kvm(3) and vmstat(8). The goal was to not affect externally used KPI. Bump __FreeBSD_version_ in case some out-of-tree module/code relies on the the global cnt variable. Exp-run revealed no ports using it directly. No objection from: arch@ Sponsored by: EMC / Isilon Storage Division	2014-03-22 10:26:09 +00:00
kib	24c4e4a548	Fix two issues with /dev/mem access on amd64, both causing kernel page faults. First, for accesses to direct map region should check for the limit by which direct map is instantiated. Second, for accesses to the kernel map, success returned from the kernacc(9) does not guarantee that consequent attempt to read or write to the checked address succeed, since other thread might invalidate the address meantime. Add a new thread private flag TDP_DEVMEMIO, which instructs vm_fault() to return error when fault happens on the MAP_ENTRY_NOFAULT entry, instead of panicing. The trap handler would then see a page fault from access, and recover in normal way, making /dev/mem access safer. Remove GIANT_REQUIRED from the amd64 memrw(), since it is not needed and having Giant locked does not solve issues for amd64. Note that at least the second issue exists on other architectures, and requires similar patching for md code. Reported and tested by: clusteradm (gjb, sbruno) Sponsored by: The FreeBSD Foundation MFC after: 1 week	2014-03-21 14:25:09 +00:00
kib	07e1d8d74f	Initialize vm_map_entry member wiring_thread on the map entry creation. This was missed in r253190. Reported by: hps, peter Tested by: hps Sponsored by: The FreeBSD Foundation MFC after: 3 days	2014-03-21 13:55:57 +00:00
attilio	f19bbde667	vm_page_grab() and vm_pager_get_pages() can drop the vm_object lock, then threads can sleep on the pip condition. Avoid to deadlock such threads by correctly awakening the sleeping ones after the pip is finished. swapoff side of the bug can likely result in shutdown deadlocks. Sponsored by: EMC / Isilon Storage Division Reported by: pho, pluknet Tested by: pho	2014-03-19 01:13:42 +00:00
rwatson	33fdc14c0c	Update kernel inclusions of capability.h to use capsicum.h instead; some further refinement is required as some device drivers intended to be portable over FreeBSD versions rely on __FreeBSD_version to decide whether to include capability.h. MFC after: 3 weeks	2014-03-16 10:55:57 +00:00
kib	12c58a8fb0	Initialize paddr to handle the case of zero size. Reported and reviewed by: Conrad Meyer <cemeyer@uw.edu> MFC after: 1 week	2014-03-12 16:38:55 +00:00
kib	e4111a6b71	Do not vdrop() the tmpfs vnode until it is unlocked. The hold reference might be the last, and then vdrop() would free the vnode. Reported and tested by: bdrewery MFC after: 1 week	2014-03-12 15:13:57 +00:00
dim	e21b440a4c	After r251709, avoid a clang 3.4 warning about an unused static const variable (uma_max_ipers), when asserts are disabled. Reviewed by: glebius MFC after: 3 days	2014-02-14 17:47:18 +00:00
attilio	6a31e25bb9	Fix-up r254141: in the process of making a failing vm_page_rename() a call of pager_swap_freespace() was moved around, now leading to freeing the incorrect page because of the pindex changes after vm_page_rename(). Get back to use the correct pindex when destroying the swap space. Sponsored by: EMC / Isilon storage division Reported by: avg Tested by: pho MFC after: 7 days	2014-02-14 03:34:12 +00:00
glebius	b1650f2d1e	Fix function name in KASSERT(). Submitted by: hiren	2014-02-12 20:11:20 +00:00
jhb	ee403b8a8c	Correct assertion to assert that the existing device VM object uses the same type rather than asserting in the case where we just created a new VM object. Reviewed by: kib	2014-02-11 22:05:21 +00:00
glebius	45bf1cc683	Create two public UMA_ZONE_PCPU zones: 64 bit sized and pointer sized. Sponsored by: Nginx, Inc.	2014-02-10 19:59:46 +00:00
glebius	613c5f4e53	Style.	2014-02-10 19:51:15 +00:00
glebius	1861286fed	Make M_ZERO flag work correctly on UMA_ZONE_PCPU zones. Sponsored by: Nginx, Inc.	2014-02-10 19:48:26 +00:00
alc	43e9da37b5	Don't call vm_fault_prefault() on zero-fill faults. It's a waste of time. Successful prefaults after a zero-fill fault are extremely rare.	2014-02-09 01:59:52 +00:00
glebius	e8c2426587	Provide macros that allow easily export uma(9) zone limits and current usage via sysctl(9): SYSCTL_UMA_MAX() SYSCTL_ADD_UMA_MAX() SYSCTL_UMA_CUR() SYSCTL_ADD_UMA_CUR() Sponsored by: Nginx, Inc.	2014-02-07 14:29:03 +00:00
alc	cf63b11b17	Make prefaulting more aggressive on hard faults. Previously, we would only map a fraction of the pages that were fetched by vm_pager_get_pages() from secondary storage. Now, we map them all in order to avoid future soft faults. This effect is most evident when a memory-mapped file is accessed sequentially. Previously, there were 6 soft faults for every hard fault. Now, these soft faults are eliminated. Sponsored by: EMC / Isilon Storage Division	2014-02-02 20:21:53 +00:00
alc	50a7eacf05	In an effort to diagnose possible corruption of struct vm_page on some sparc64 machines make the page queue assert in vm_page_dequeue() more precise. While I'm here switch the page lock assert to the newer style.	2014-01-24 19:08:42 +00:00
jhb	14337759da	Fix a couple of typos.	2014-01-21 03:27:47 +00:00
glebius	681dcc3c57	ANSIfy declarations. Ok'ed by: alc	2014-01-20 18:47:56 +00:00
alc	d98c6ca3a1	Style changes in vm_pageout_scan(): 1. Be consistent in the style of "act_delta" manipulations between the inactive and active queue scans. 2. Explicitly compare to zero. 3. The deactivation of a page is based is based on its recent history and not just the current call to vm_pageout_scan(). The variable "act_delta" represents the current state of the page, and not its history. Avoid possible confusion by not (ab)using "act_delta" for the making the deactivation decision. Submitted by: kib [1] Reviewed by: kib [2,3]	2014-01-18 20:02:59 +00:00
alc	ed1e11749f	Correctly update the count of stuck pages, "addl_page_shortage", in vm_pageout_scan(). There were missing increments in two less common cases. Don't conflate the count of stuck pages and the pageout deficit provided by vm_page_alloc{,_contig}(). (A proposed fix to the OOM code depends on this.) Handle held pages consistently in the inactive queue scan. In the more common case, we did not move the page to the tail of the queue. Whereas, in the less common case, we did. There's no particular reason to move the page in the less common case, so remove it. Perform the calculation of the page shortage for the active queue scan a little earlier, before the active queue lock is acquired. The correctness of this calculation doesn't depend on the active queue lock being held. Eliminate a redundant variable, "pcount". Use the more descriptive variable, "maxscan", in its place. Apply a few nearby style fixes, e.g., eliminate stray whitespace and excess parentheses. Reviewed by: kib Sponsored by: EMC / Isilon Storage Division	2014-01-12 19:04:20 +00:00
alc	d0ccfff2c5	Since the introduction of the popmap to reservations in r259999, there is no longer any need for the page's PG_CACHED and PG_FREE flags to be set and cleared while the free page queues lock is held. Thus, vm_page_alloc(), vm_page_alloc_contig(), and vm_page_alloc_freelist() can wait until after the free page queues lock is released to clear the page's flags. Moreover, the PG_FREE flag can be retired. Now that the reservation system no longer uses it, its only uses are in a few assertions. Eliminating these assertions is no real loss. Other assertions catch the same types of misbehavior, like doubly freeing a page (see r260032) or dirtying a free page (free pages are invalid and only valid pages can be dirtied). Eliminate an unneeded variable from vm_page_alloc_contig(). Sponsored by: EMC / Isilon Storage Division	2013-12-31 18:25:15 +00:00
alc	4e0447c7bb	Add "popmap" assertions: The page being freed isn't already free, and the page being allocated isn't already allocated. Sponsored by: EMC / Isilon Storage Division	2013-12-29 04:54:52 +00:00
alc	0850ce80b6	MFp4 alc_popmap Change the way that reservations keep track of which pages are in use. Instead of using the page's PG_CACHED and PG_FREE flags, maintain a bit vector within the reservation. This approach has a couple benefits. First, it makes breaking reservations much cheaper because there are fewer cache misses to identify the unused pages. Second, it is a pre- requisite for supporting two or more reservation sizes.	2013-12-28 04:28:35 +00:00
kib	2e35793d0f	Do not coalesce stack entry, vm_map_stack() asserts that the requested region is claimed by a new entry. Pass MAP_STACK_GROWS_DOWN and MAP_STACK_GROWS_UP flags to vm_map_insert() from vm_map_stack(), to really turn off coalescing code and call to vm_map_simplify_entry() [1]. Reported by: avg, peter, many Tested by: avg, peter Noted by: avg [1] Sponsored by: The FreeBSD Foundation MFC after: 1 week	2013-12-27 16:59:47 +00:00
marcel	df5d69cc0d	For ia64, use pmap_remove_pages() and not pmap_remove(). The problem is that we don't have a good way (yet) to iterate over the mapped pages by virtual address and simply try each page within the range. Given that we call pmap_remove() over the entire 2^63 bytes of address space, it takes a while for pmap_remove to have tried all 2^50 pages. By using pmap_remove_pages() we use the PV list to find all mappings. Change derived from a patch by: alc	2013-12-26 05:46:10 +00:00
dim	7f980c6758	In sys/vm/vm_pageout.c, since vm_pageout_worker() takes a void * as argument, cast the incoming 0 argument to void , to silence a warning from clang 3.4 ("expression which evaluates to zero treated as a null pointer constant of type 'void ' [-Wnon-literal-null-conversion]"). MFC after: 3 days	2013-12-25 22:32:34 +00:00

... 2 3 4 5 6 ...

3725 Commits