freebsd-dev

Author	SHA1	Message	Date
Konstantin Belousov	c8f780e3d6	Fix locking. The dst_object must remain locked on the retry of the loop iteration. Reported and tested by: pho Sponsored by: The FreeBSD Foundation MFC after: 6 days	2014-05-11 18:07:07 +00:00
Alan Cox	dd006a1b14	With the new-and-improved vm_fault_copy_entry() (r265843), we can always avoid soft page faults when adding write access to user wired entries in vm_map_protect(). Previously, we only avoided the soft page fault when the underlying pages were copy-on-write. In other words, we avoided the pages faults that might sleep on page allocation, but not the trivial page faults to update the physical map. Reviewed by: kib MFC after: 1 week Sponsored by: EMC / Isilon Storage Division	2014-05-11 17:41:29 +00:00
Alan Cox	d9a9209abe	About 9% of the pmap_protect() calls being performed by vm_map_copy_entry() are unnecessary. Eliminate the unnecessary calls. Reviewed by: kib MFC after: 1 week Sponsored by: EMC / Isilon Storage Division	2014-05-10 19:47:00 +00:00
Konstantin Belousov	0973283d6e	For the upgrade case in vm_fault_copy_entry(), when the entry does not need COW and is writeable (i.e. becoming writeable due to the mprotect(2) operation), do not create a new backing object for the entry. The caller of the function is vm_map_protect(), the call is made to ensure that wired entry has all pages resident and wired in the top level object and to enable the write. We might need to copy read-only page from some backing objects into the top object or remap the page with the write allowed. This fixes the issue with mishandling of the swap accounting when read-only wired mapping is upgraded to write-enabled after fork. The previous code path did not accounted the new object, but it creation is redundand anyway and the change provides an optimization for the non-common situation. Reported by: markj Suggested and reviewed by: alc (previous version) Tested by: pho Sponsored by: The FreeBSD Foundation MFC after: 1 week	2014-05-10 17:03:33 +00:00
Konstantin Belousov	44bbc3b77d	When printing the map with the ddb 'show procvm' command, do not dump page queues for the backing objects. The queues are huge and clutter the display, when mostly the map entries and its backing storage is interesting. The page queues can be seen with ddb 'show object' command. Reviewed by: alc Sponsored by: The FreeBSD Foundation MFC after: 1 week	2014-05-10 16:36:13 +00:00
Konstantin Belousov	3d95614f9d	Print the entry address in addition to the object. The variable is typically optimized out and debuggers cannot find its value. Sponsored by: The FreeBSD Foundation MFC after: 1 week	2014-05-10 16:30:48 +00:00
Peter Holm	e103f5b1c0	msync(2) must return ENOMEM and not EINVAL when the address is outside the allowed range or when one or more pages are not mapped. This according to The Open Group Base Specifications Issue 7. Discussed with: attilio, Bruce Evans Reviewed by: alc, Garrett Cooper Reported by: ATF MFC after: 2 weeks Sponsored by: EMC / Isilon storage division	2014-05-07 08:38:02 +00:00
Alan Cox	60196cda04	Prior to r254304, a separate function, vm_pageout_page_stats(), was used to periodically update the reference status of the active pages. This function was called, instead of vm_pageout_scan(), when memory was not scarce. The objective was to provide up to date reference status for active pages in case memory did become scarce and active pages needed to be deactivated. The active page queue scan performed by vm_pageout_page_stats() was virtually identical to that performed by vm_pageout_scan(), and so r254304 eliminated vm_pageout_page_stats(). Instead, vm_pageout_scan() is called with the parameter "pass" set to zero. The intention was that when pass is zero, vm_pageout_scan() would only scan the active queue. However, the variable page_shortage can still be greater than zero when memory is not scarce and vm_pageout_scan() is called with pass equal to zero. Consequently, the inactive queue may be scanned and dirty pages laundered even though that was not intended by r254304. This revision fixes that. Reported by: avg MFC after: 1 week Sponsored by: EMC / Isilon Storage Division	2014-05-06 03:42:04 +00:00
Konstantin Belousov	a17937bdd0	For the VM_PHYSSEG_DENSE case, checking the requested range to fall into the area backed by vm_page_array wrongly compared end with vm_page_array_size. It should be adjusted by first_page index to be correct. Also, the corner and incorrect case of the requested range extending after the end of the vm_page_array was incorrectly handled by allocating the segment. Fix the comparision for the end of range and return EINVAL if the end extends beyond vm_page_array. Discussed with: royger Sponsored by: The FreeBSD Foundation MFC after: 1 week	2014-04-29 18:42:37 +00:00
Konstantin Belousov	4c74acf76a	When vm_fault_copy_entry() is called from vm_map_protect() for a wired entry and performs the upgrade of the entry permissions from read-only to read-write, we must allow to search for the source pages in the backing object, like we do in the case of forking the read-only wired entry. For the fork case, the behaviour is allowed by src_readonly boolean, which in fact is only used to assert that read-write case provides all source pages in the top-level object. Eliminate the src_readonly variable. Allow for the copy loop to look into the backing objects, add explicit asserts to ensure that only read-only and upgrade case actually does. Expand comments. Change the panic call into assert. Reported by: markj Tested by: markj, pho (previous version) Reviewed by: alc Sponsored by: The FreeBSD Foundation MFC after: 1 week	2014-04-27 05:19:01 +00:00
Dag-Erling Smørgrav	612032773a	Add sysctl OIDs showing the actual size and capacity of the swap zone. MFC after: 1 week	2014-04-26 12:18:17 +00:00
Bryan Drewery	44f1c91610	Rename global cnt to vm_cnt to avoid shadowing. To reduce the diff struct pcu.cnt field was not renamed, so PCPU_OP(cnt.field) is still used. pc_cnt and pcpu are also used in kvm(3) and vmstat(8). The goal was to not affect externally used KPI. Bump __FreeBSD_version_ in case some out-of-tree module/code relies on the the global cnt variable. Exp-run revealed no ports using it directly. No objection from: arch@ Sponsored by: EMC / Isilon Storage Division	2014-03-22 10:26:09 +00:00
Konstantin Belousov	52f3c44efe	Fix two issues with /dev/mem access on amd64, both causing kernel page faults. First, for accesses to direct map region should check for the limit by which direct map is instantiated. Second, for accesses to the kernel map, success returned from the kernacc(9) does not guarantee that consequent attempt to read or write to the checked address succeed, since other thread might invalidate the address meantime. Add a new thread private flag TDP_DEVMEMIO, which instructs vm_fault() to return error when fault happens on the MAP_ENTRY_NOFAULT entry, instead of panicing. The trap handler would then see a page fault from access, and recover in normal way, making /dev/mem access safer. Remove GIANT_REQUIRED from the amd64 memrw(), since it is not needed and having Giant locked does not solve issues for amd64. Note that at least the second issue exists on other architectures, and requires similar patching for md code. Reported and tested by: clusteradm (gjb, sbruno) Sponsored by: The FreeBSD Foundation MFC after: 1 week	2014-03-21 14:25:09 +00:00
Konstantin Belousov	997ac6905f	Initialize vm_map_entry member wiring_thread on the map entry creation. This was missed in r253190. Reported by: hps, peter Tested by: hps Sponsored by: The FreeBSD Foundation MFC after: 3 days	2014-03-21 13:55:57 +00:00
Attilio Rao	0d8243cc34	vm_page_grab() and vm_pager_get_pages() can drop the vm_object lock, then threads can sleep on the pip condition. Avoid to deadlock such threads by correctly awakening the sleeping ones after the pip is finished. swapoff side of the bug can likely result in shutdown deadlocks. Sponsored by: EMC / Isilon Storage Division Reported by: pho, pluknet Tested by: pho	2014-03-19 01:13:42 +00:00
Robert Watson	4a14441044	Update kernel inclusions of capability.h to use capsicum.h instead; some further refinement is required as some device drivers intended to be portable over FreeBSD versions rely on __FreeBSD_version to decide whether to include capability.h. MFC after: 3 weeks	2014-03-16 10:55:57 +00:00
Konstantin Belousov	7253a5ec63	Initialize paddr to handle the case of zero size. Reported and reviewed by: Conrad Meyer <cemeyer@uw.edu> MFC after: 1 week	2014-03-12 16:38:55 +00:00
Konstantin Belousov	2309fa9b92	Do not vdrop() the tmpfs vnode until it is unlocked. The hold reference might be the last, and then vdrop() would free the vnode. Reported and tested by: bdrewery MFC after: 1 week	2014-03-12 15:13:57 +00:00
Dimitry Andric	2367b4ddc4	After r251709, avoid a clang 3.4 warning about an unused static const variable (uma_max_ipers), when asserts are disabled. Reviewed by: glebius MFC after: 3 days	2014-02-14 17:47:18 +00:00
Attilio Rao	14a5dc1780	Fix-up r254141: in the process of making a failing vm_page_rename() a call of pager_swap_freespace() was moved around, now leading to freeing the incorrect page because of the pindex changes after vm_page_rename(). Get back to use the correct pindex when destroying the swap space. Sponsored by: EMC / Isilon storage division Reported by: avg Tested by: pho MFC after: 7 days	2014-02-14 03:34:12 +00:00
Gleb Smirnoff	5f3563b0a5	Fix function name in KASSERT(). Submitted by: hiren	2014-02-12 20:11:20 +00:00
John Baldwin	8add0ced70	Correct assertion to assert that the existing device VM object uses the same type rather than asserting in the case where we just created a new VM object. Reviewed by: kib	2014-02-11 22:05:21 +00:00
Gleb Smirnoff	49fef6a202	Create two public UMA_ZONE_PCPU zones: 64 bit sized and pointer sized. Sponsored by: Nginx, Inc.	2014-02-10 19:59:46 +00:00
Gleb Smirnoff	f947570e35	Style.	2014-02-10 19:51:15 +00:00
Gleb Smirnoff	48343a2f34	Make M_ZERO flag work correctly on UMA_ZONE_PCPU zones. Sponsored by: Nginx, Inc.	2014-02-10 19:48:26 +00:00
Alan Cox	7b9b301c6b	Don't call vm_fault_prefault() on zero-fill faults. It's a waste of time. Successful prefaults after a zero-fill fault are extremely rare.	2014-02-09 01:59:52 +00:00
Gleb Smirnoff	0a5a3ccb81	Provide macros that allow easily export uma(9) zone limits and current usage via sysctl(9): SYSCTL_UMA_MAX() SYSCTL_ADD_UMA_MAX() SYSCTL_UMA_CUR() SYSCTL_ADD_UMA_CUR() Sponsored by: Nginx, Inc.	2014-02-07 14:29:03 +00:00
Alan Cox	63281952f0	Make prefaulting more aggressive on hard faults. Previously, we would only map a fraction of the pages that were fetched by vm_pager_get_pages() from secondary storage. Now, we map them all in order to avoid future soft faults. This effect is most evident when a memory-mapped file is accessed sequentially. Previously, there were 6 soft faults for every hard fault. Now, these soft faults are eliminated. Sponsored by: EMC / Isilon Storage Division	2014-02-02 20:21:53 +00:00
Alan Cox	793d14076a	In an effort to diagnose possible corruption of struct vm_page on some sparc64 machines make the page queue assert in vm_page_dequeue() more precise. While I'm here switch the page lock assert to the newer style.	2014-01-24 19:08:42 +00:00
John Baldwin	ab46f63e8f	Fix a couple of typos.	2014-01-21 03:27:47 +00:00
Gleb Smirnoff	7ebba1f8ff	ANSIfy declarations. Ok'ed by: alc	2014-01-20 18:47:56 +00:00
Alan Cox	86fa24710e	Style changes in vm_pageout_scan(): 1. Be consistent in the style of "act_delta" manipulations between the inactive and active queue scans. 2. Explicitly compare to zero. 3. The deactivation of a page is based is based on its recent history and not just the current call to vm_pageout_scan(). The variable "act_delta" represents the current state of the page, and not its history. Avoid possible confusion by not (ab)using "act_delta" for the making the deactivation decision. Submitted by: kib [1] Reviewed by: kib [2,3]	2014-01-18 20:02:59 +00:00
Alan Cox	9099545af1	Correctly update the count of stuck pages, "addl_page_shortage", in vm_pageout_scan(). There were missing increments in two less common cases. Don't conflate the count of stuck pages and the pageout deficit provided by vm_page_alloc{,_contig}(). (A proposed fix to the OOM code depends on this.) Handle held pages consistently in the inactive queue scan. In the more common case, we did not move the page to the tail of the queue. Whereas, in the less common case, we did. There's no particular reason to move the page in the less common case, so remove it. Perform the calculation of the page shortage for the active queue scan a little earlier, before the active queue lock is acquired. The correctness of this calculation doesn't depend on the active queue lock being held. Eliminate a redundant variable, "pcount". Use the more descriptive variable, "maxscan", in its place. Apply a few nearby style fixes, e.g., eliminate stray whitespace and excess parentheses. Reviewed by: kib Sponsored by: EMC / Isilon Storage Division	2014-01-12 19:04:20 +00:00
Alan Cox	000fb817d8	Since the introduction of the popmap to reservations in r259999, there is no longer any need for the page's PG_CACHED and PG_FREE flags to be set and cleared while the free page queues lock is held. Thus, vm_page_alloc(), vm_page_alloc_contig(), and vm_page_alloc_freelist() can wait until after the free page queues lock is released to clear the page's flags. Moreover, the PG_FREE flag can be retired. Now that the reservation system no longer uses it, its only uses are in a few assertions. Eliminating these assertions is no real loss. Other assertions catch the same types of misbehavior, like doubly freeing a page (see r260032) or dirtying a free page (free pages are invalid and only valid pages can be dirtied). Eliminate an unneeded variable from vm_page_alloc_contig(). Sponsored by: EMC / Isilon Storage Division	2013-12-31 18:25:15 +00:00
Alan Cox	a08c151546	Add "popmap" assertions: The page being freed isn't already free, and the page being allocated isn't already allocated. Sponsored by: EMC / Isilon Storage Division	2013-12-29 04:54:52 +00:00
Alan Cox	ec17932242	MFp4 alc_popmap Change the way that reservations keep track of which pages are in use. Instead of using the page's PG_CACHED and PG_FREE flags, maintain a bit vector within the reservation. This approach has a couple benefits. First, it makes breaking reservations much cheaper because there are fewer cache misses to identify the unused pages. Second, it is a pre- requisite for supporting two or more reservation sizes.	2013-12-28 04:28:35 +00:00
Konstantin Belousov	b61a53d43d	Do not coalesce stack entry, vm_map_stack() asserts that the requested region is claimed by a new entry. Pass MAP_STACK_GROWS_DOWN and MAP_STACK_GROWS_UP flags to vm_map_insert() from vm_map_stack(), to really turn off coalescing code and call to vm_map_simplify_entry() [1]. Reported by: avg, peter, many Tested by: avg, peter Noted by: avg [1] Sponsored by: The FreeBSD Foundation MFC after: 1 week	2013-12-27 16:59:47 +00:00
Marcel Moolenaar	938b0f5b75	For ia64, use pmap_remove_pages() and not pmap_remove(). The problem is that we don't have a good way (yet) to iterate over the mapped pages by virtual address and simply try each page within the range. Given that we call pmap_remove() over the entire 2^63 bytes of address space, it takes a while for pmap_remove to have tried all 2^50 pages. By using pmap_remove_pages() we use the PV list to find all mappings. Change derived from a patch by: alc	2013-12-26 05:46:10 +00:00
Dimitry Andric	d395270d06	In sys/vm/vm_pageout.c, since vm_pageout_worker() takes a void * as argument, cast the incoming 0 argument to void , to silence a warning from clang 3.4 ("expression which evaluates to zero treated as a null pointer constant of type 'void ' [-Wnon-literal-null-conversion]"). MFC after: 3 days	2013-12-25 22:32:34 +00:00
Alan Cox	703b304f33	Eliminate a redundant parameter to vm_radix_replace(). Improve the wording of the comment describing vm_radix_replace(). Reviewed by: attilio MFC after: 6 weeks Sponsored by: EMC / Isilon Storage Division	2013-12-08 20:07:02 +00:00
Craig Rodrigues	a3845534a0	In keg_dtor(), print out the keg name in the "Freed UMA keg was not empty" message printed to the console. This makes it easier to track down the source of certain memory leaks. Suggested by: adrian	2013-11-29 08:04:45 +00:00
Alexander Motin	03175483c2	- Add bucket size column to `show uma` DDB command. - Add `show umacache` command to show alike stats for cache-only UMA zones.	2013-11-28 19:20:49 +00:00
Alexander Motin	cec48e002a	Make UMA to not blindly force offpage slab header allocation for large (> PAGE_SIZE) zones. If zone is not multiple to PAGE_SIZE, there may be enough space for the header at the last page, so we may avoid extra header memory allocation and hash table update/lookup. ZFS creates bunch of odd-sized UMA zones (5120, 6144, 7168, 10240, 14336). This change gives good use to at least some of otherwise lost memory there. Reviewed by: avg	2013-11-27 20:56:10 +00:00
Alexander Motin	f7104ccd94	Don't count bucket allocation failures for UMA zones as their own failures. There are good reasons for this to happen, such as recursion prevention, etc. and they are not fatal since buckets are just an optimization mechanism. Real bucket allocation failures are any way counted by the bucket zones themselves, and we don't need double accounting there.	2013-11-27 20:16:18 +00:00
Alexander Motin	e8a720fe60	Fix bug introduced at r252226, when udata argument passed to bucket_alloc() was used without making sure first that it was really passed for us. On some of my systems this bug made user argument passed by ZFS code to uma_zalloc_arg() unexpectedly block UMA per-CPU caches for those zones.	2013-11-27 19:55:42 +00:00
Alexander Motin	8a8d9d1475	When purging per-CPU UMA caches do not return empty buckets into the global full bucket cache to not trigger assertion if allocation happen before that global cache get purged.	2013-11-23 13:42:56 +00:00
Konstantin Belousov	79e9451f07	Vm map code performs clipping when map entry covers region which is larger than the operational region. If the op region size is zero, clipping would create a zero-sized map entry. The result is that vm map splay starts behaving inconsistently, sometimes returning zero-sized entry, sometimes the next (or previous) entry. One step further, it could result in e.g. vm_map_wire() setting MAP_ENTRY_IN_TRANSITION on the zero-sized entry, but failing to clear it in the done part. The vm_map_delete() than hangs forever waiting for the flag removal. Verify for zero-length requests and act as if it is always successfull without performing any action on the address space. Diagnosed by: pho Tested by: pho (previous version) Reviewed by: alc (previous version) Sponsored by: The FreeBSD Foundation MFC after: 1 week	2013-11-20 09:03:48 +00:00
Konstantin Belousov	ff3ae454c0	Add assertions to cover all places in the wiring and unwiring code where MAP_ENTRY_IN_TRANSITION is set or cleared. Tested by: pho Reviewed by: alc Sponsored by: The FreeBSD Foundation MFC after: 1 week	2013-11-20 08:47:54 +00:00
Konstantin Belousov	7e14088d93	Revert back to use int for the page counts. In vn_io_fault(), the i/o is chunked to pieces limited by integer io_hold_cnt tunable, while vm_fault_quick_hold_pages() takes integer max_count as the upper bound. Rearrange the checks to correctly handle overflowing address arithmetic. Submitted by: bde Tested by: pho Discussed with: alc MFC after: 1 week	2013-11-20 08:45:26 +00:00
Alexander Motin	a2de44abf5	Implement mechanism to safely but slowly purge UMA per-CPU caches. This is a last resort for very low memory condition in case other measures to free memory were ineffective. Sequentially cycle through all CPUs and extract per-CPU cache buckets into zone cache from where they can be freed.	2013-11-19 10:51:46 +00:00
Alexander Motin	4d104ba024	Grow UMA zone bucket size also on lock congestion during item free. Lock congestion is the same, whether it happens on alloc or free, so handle it equally. Now that we have back pressure, there is no problem to grow buckets a bit faster. Any way growth is much slower then in 9.x.	2013-11-19 10:17:10 +00:00
Alexander Motin	f3932e9025	Add two new UMA bucket zones to store 3 and 9 items per bucket. These new buckets make bucket size self-tuning more soft and precise. Without them there are buckets for 1, 5, 13, 29, ... items. While at bigger sizes difference about 2x is fine, at smallest ones it is 5x and 2.6x respectively. New buckets make that line look like 1, 3, 5, 9, 13, 29, reducing jumps between steps, making algorithm work softer, allocating and freeing memory in better fitting chunks. Otherwise there is quite a big gap between allocating 128K and 5x128K of RAM at once.	2013-11-19 10:10:44 +00:00
Alexander Motin	ace66b568e	Implement soft pressure on UMA cache bucket sizes. Every time system detects low memory condition decrease bucket sizes for each zone by one item. As result, higher memory pressure will push to smaller bucket sizes and so smaller per-CPU caches and so more efficient memory use. Before this change there was no force to oppose buckets growth as result of practically inevitable zone lock conflicts, and after some run time per-CPU caches could consume enough RAM to kill the system.	2013-11-19 10:05:53 +00:00
Konstantin Belousov	d005ed537c	Avoid overflow for the page counts. Reported and tested by: pho Sponsored by: The FreeBSD Foundation MFC after: 1 week	2013-11-12 08:47:58 +00:00
Konstantin Belousov	1bd7d0b7db	If filesystem declares that it supports shared locking for writes, use shared vnode lock for VOP_PUTPAGES() as well. The only such filesystem in the tree is ZFS, and it uses vnode_pager_generic_putpages(), which performs the pageout with VOP_WRITE(). Reviewed by: alc Discussed with: avg Tested by: pho Sponsored by: The FreeBSD Foundation MFC after: 2 weeks	2013-11-09 20:36:29 +00:00
Konstantin Belousov	9ded9474d3	Do not coalesce if the swap object belongs to tmpfs vnode. The coalesce would extend the object to keep pages for the anonymous mapping created by the process. The pages has no relations to the tmpfs file content which could be written into the corresponding range, causing anonymous mapping and file content aliasing and subsequent corruption. Another lesser problem created by coalescing is over-accounting on the tmpfs node destruction, since the object size is substracted from the total count of the pages owned by the tmpfs mount. Reported and tested by: bdrewery Sponsored by: The FreeBSD Foundation MFC after: 1 week	2013-11-05 06:18:50 +00:00
Alan Cox	eb2f42fbb0	Tidy up the output of "sysctl vm.phys_free". Approved by: re (glebius) Sponsored by: EMC / Isilon Storage Division	2013-10-10 16:11:45 +00:00
Alan Cox	f872f6eaf5	Both the vm_map and vmspace zones are defined as "no free". So, there is no point in defining a fini function for these zones. Reviewed by: kib Approved by: re (glebius) Sponsored by: EMC / Isilon Storage Division	2013-09-22 17:48:10 +00:00
Neel Natu	74d1d2b7cc	Merge the following changes from projects/bhyve_npt_pmap: - add fields to 'struct pmap' that are required to manage nested page tables. - add a parameter to 'vmspace_alloc()' that can be used to override the default pmap initialization routine 'pmap_pinit()'. These changes are pushed ahead of the remaining changes in 'bhyve_npt_pmap' in anticipation of the upcoming KBI freeze for 10.0. Reviewed by: kib@, alc@ Approved by: re (glebius)	2013-09-20 17:06:49 +00:00
Alan Cox	deb179bb4c	The pmap function pmap_clear_reference() is no longer used. Remove it. pmap_clear_reference() has had exactly one caller in the kernel for several years, more precisely, since FreeBSD 8. Now, that call no longer exists. Approved by: re (kib) Sponsored by: EMC / Isilon Storage Division	2013-09-20 04:30:18 +00:00
John Baldwin	55648840de	Extend the support for exempting processes from being killed when swap is exhausted. - Add a new protect(1) command that can be used to set or revoke protection from arbitrary processes. Similar to ktrace it can apply a change to all existing descendants of a process as well as future descendants. - Add a new procctl(2) system call that provides a generic interface for control operations on processes (as opposed to the debugger-specific operations provided by ptrace(2)). procctl(2) uses a combination of idtype_t and an id to identify the set of processes on which to operate similar to wait6(). - Add a PROC_SPROTECT control operation to manage the protection status of a set of processes. MADV_PROTECT still works for backwards compatability. - Add a p_flag2 to struct proc (and a corresponding ki_flag2 to kinfo_proc) the first bit of which is used to track if P_PROTECT should be inherited by new child processes. Reviewed by: kib, jilles (earlier version) Approved by: re (delphij) MFC after: 1 month	2013-09-19 18:53:42 +00:00
Konstantin Belousov	9eab548476	PG_SLAB no longer serves a useful purpose, since m->object is no longer abused to store pointer to slab. Remove it. Reviewed by: alc Sponsored by: The FreeBSD Foundation Approved by: re (hrs)	2013-09-17 07:35:26 +00:00
Konstantin Belousov	3846a82284	Remove zero-copy sockets code. It only worked for anonymous memory, and the equivalent functionality is now provided by sendfile(2) over posix shared memory filedescriptor. Remove the cow member of struct vm_page, and rearrange the remaining members. While there, make hold_count unsigned. Requested and reviewed by: alc Tested by: pho Sponsored by: The FreeBSD Foundation Approved by: re (delphij)	2013-09-16 06:25:54 +00:00
Konstantin Belousov	196beb5359	If the last page of the file is partially full and whole valid portion is invalidated, invalidate the whole page. Otherwise, partially valid page appears on a page queue, which is wrong. This could only happen for the last page, because only then buffer which triggered invalidation could not cover the whole page. Reported and tested by: pho (previous version) Reviewed by: alc Sponsored by: The FreeBSD Foundation Approved by: re (delphij) MFC after: 2 weeks	2013-09-14 10:11:38 +00:00
John Baldwin	6a87d217e2	Fix an off-by-one error when populating mincore(2) entries for skipped entries. lastvecindex references the last valid byte, so the new bytes should come after it. Approved by: re (kib) MFC after: 1 week	2013-09-12 20:46:32 +00:00
John Baldwin	edb572a38c	Add a mmap flag (MAP_32BIT) on 64-bit platforms to request that a mapping use an address in the first 2GB of the process's address space. This flag should have the same semantics as the same flag on Linux. To facilitate this, add a new parameter to vm_map_find() that specifies an optional maximum virtual address. While here, fix several callers of vm_map_find() to use a VMFS_* constant for the findspace argument instead of TRUE and FALSE. Reviewed by: alc Approved by: re (kib)	2013-09-09 18:11:59 +00:00
Konstantin Belousov	3aaea6efd5	Drain for the xbusy state for two places which potentially do pmap_remove_all(). Not doing the drain allows the pmap_enter() to proceed in parallel, making the pmap_remove_all() effects void. The race results in an invalidated page mapped wired by usermode. Reported and tested by: pho Reviewed by: alc Sponsored by: The FreeBSD Foundation Approved by: re (glebius)	2013-09-08 17:51:22 +00:00
Konstantin Belousov	7a4b2bc56c	The vm_page_trysbusy() should not fail when shared busy counter or VPB_BIT_WAITERS flag were changed between reading of busy_lock and the cas. The vm_page_sbusy(), which is the only user of vm_page_trysbusy() in the tree, panics on the failure, which in these cases is transient and do not mean that the current page state prevents sbusying. Retry the operation inside vm_page_trysbusy() if cas failed, only return a failure when VPB_BIT_SHARED is cleared. Reported and tested by: pho Reviewed by: attilio Sponsored by: The FreeBSD Foundation	2013-09-05 12:54:40 +00:00
Pawel Jakub Dawidek	7008be5bd7	Change the cap_rights_t type from uint64_t to a structure that we can extend in the future in a backward compatible (API and ABI) way. The cap_rights_t represents capability rights. We used to use one bit to represent one right, but we are running out of spare bits. Currently the new structure provides place for 114 rights (so 50 more than the previous cap_rights_t), but it is possible to grow the structure to hold at least 285 rights, although we can make it even larger if 285 rights won't be enough. The structure definition looks like this: struct cap_rights { uint64_t cr_rights[CAP_RIGHTS_VERSION + 2]; }; The initial CAP_RIGHTS_VERSION is 0. The top two bits in the first element of the cr_rights[] array contain total number of elements in the array - 2. This means if those two bits are equal to 0, we have 2 array elements. The top two bits in all remaining array elements should be 0. The next five bits in all array elements contain array index. Only one bit is used and bit position in this five-bits range defines array index. This means there can be at most five array elements in the future. To define new right the CAPRIGHT() macro must be used. The macro takes two arguments - an array index and a bit to set, eg. #define CAP_PDKILL CAPRIGHT(1, 0x0000000000000800ULL) We still support aliases that combine few rights, but the rights have to belong to the same array element, eg: #define CAP_LOOKUP CAPRIGHT(0, 0x0000000000000400ULL) #define CAP_FCHMOD CAPRIGHT(0, 0x0000000000002000ULL) #define CAP_FCHMODAT (CAP_FCHMOD \| CAP_LOOKUP) There is new API to manage the new cap_rights_t structure: cap_rights_t cap_rights_init(cap_rights_t rights, ...); void cap_rights_set(cap_rights_t rights, ...); void cap_rights_clear(cap_rights_t rights, ...); bool cap_rights_is_set(const cap_rights_t rights, ...); bool cap_rights_is_valid(const cap_rights_t rights); void cap_rights_merge(cap_rights_t dst, const cap_rights_t src); void cap_rights_remove(cap_rights_t dst, const cap_rights_t src); bool cap_rights_contains(const cap_rights_t big, const cap_rights_t little); Capability rights to the cap_rights_init(), cap_rights_set(), cap_rights_clear() and cap_rights_is_set() functions are provided by separating them with commas, eg: cap_rights_t rights; cap_rights_init(&rights, CAP_READ, CAP_WRITE, CAP_FSTAT); There is no need to terminate the list of rights, as those functions are actually macros that take care of the termination, eg: #define cap_rights_set(rights, ...) \ __cap_rights_set((rights), __VA_ARGS__, 0ULL) void __cap_rights_set(cap_rights_t *rights, ...); Thanks to using one bit as an array index we can assert in those functions that there are no two rights belonging to different array elements provided together. For example this is illegal and will be detected, because CAP_LOOKUP belongs to element 0 and CAP_PDKILL to element 1: cap_rights_init(&rights, CAP_LOOKUP \| CAP_PDKILL); Providing several rights that belongs to the same array's element this way is correct, but is not advised. It should only be used for aliases definition. This commit also breaks compatibility with some existing Capsicum system calls, but I see no other way to do that. This should be fine as Capsicum is still experimental and this change is not going to 9.x. Sponsored by: The FreeBSD Foundation	2013-09-05 00:09:56 +00:00
Kirk McKusick	1645995b97	Fix bug introduced in rewrite of keg_free_slab in -r251894. The consequence of the bug is that fini calls are not done when a slab is freed by a call-back from the page daemon. It went unnoticed for two months because fini is little used. I spotted the bug while reading the code to learn how it works so I could write it up for the next edition of the Design and Implementation of FreeBSD book. No MFC needed as this code exists only in HEAD. Reviewed by: kib, jeff Tested by: pho	2013-08-31 15:40:15 +00:00
Alan Cox	51321f7c31	Significantly reduce the cost, i.e., run time, of calls to madvise(..., MADV_DONTNEED) and madvise(..., MADV_FREE). Specifically, introduce a new pmap function, pmap_advise(), that operates on a range of virtual addresses within the specified pmap, allowing for a more efficient implementation of MADV_DONTNEED and MADV_FREE. Previously, the implementation of MADV_DONTNEED and MADV_FREE relied on per-page pmap operations, such as pmap_clear_reference(). Intuitively, the problem with this implementation is that the pmap-level locks are acquired and released and the page table traversed repeatedly, once for each resident page in the range that was specified to madvise(2). A more subtle flaw with the previous implementation is that pmap_clear_reference() would clear the reference bit on all mappings to the specified page, not just the mapping in the range specified to madvise(2). Since our malloc(3) makes heavy use of madvise(2), this change can have a measureable impact. For example, the system time for completing a parallel "buildworld" on a 6-core amd64 machine was reduced by about 1.5% to 2.0%. Note: This change only contains pmap_advise() implementations for a subset of our supported architectures. I will commit implementations for the remaining architectures after further testing. For now, a stub function is sufficient because of the advisory nature of pmap_advise(). Discussed with: jeff, jhb, kib Tested by: pho (i386), marcel (ia64) Sponsored by: EMC / Isilon Storage Division	2013-08-29 15:49:05 +00:00
Gleb Smirnoff	133dae887b	Remove comment that is no longer relevant since r254182.	2013-08-26 14:14:25 +00:00
Alan Cox	776cad90ff	Addendum to r254141: The call to vm_radix_insert() in vm_page_cache() can reclaim the last preexisting cached page in the object, resulting in a call to vdrop(). Detect this scenario so that the vnode's hold count is correctly maintained. Otherwise, we panic. Reported by: scottl Tested by: pho Discussed with: attilio, jeff, kib	2013-08-23 17:27:12 +00:00
Konstantin Belousov	e68c64f0ba	Revert r254501. Instead, reuse the type stability of the struct pmap which is the part of struct vmspace, allocated from UMA_ZONE_NOFREE zone. Initialize the pmap lock in the vmspace zone init function, and remove pmap lock initialization and destruction from pmap_pinit() and pmap_release(). Suggested and reviewed by: alc (previous version) Tested by: pho Sponsored by: The FreeBSD Foundation	2013-08-22 18:12:24 +00:00
Konstantin Belousov	5944de8ecd	Remove the deprecated VM_ALLOC_RETRY flag for the vm_page_grab(9). The flag was mandatory since r209792, where vm_page_grab(9) was changed to only support the alloc retry semantic. Suggested and reviewed by: alc Sponsored by: The FreeBSD Foundation	2013-08-22 07:39:53 +00:00
Jeff Roberson	274132ac23	- Eliminate the vm object lock from the active queue scan. It is not necessary since we do not free or cache the page from active anymore. Document the one possible race that is harmless. Sponsored by: EMC / Isilon Storage Division Discussed with: alc	2013-08-21 22:39:19 +00:00
Alan Cox	28a288cbaa	Addendum to r254141: Allow recursion on the free pages queues lock in vm_page_alloc_freelist(). Reported and tested by: sbruno Sponsored by: EMC / Isilon Storage Division	2013-08-21 15:31:43 +00:00
Jeff Roberson	c9612b2db8	- Increase the active lru refresh interval to 10 minutes. This has been shown to negatively impact some workloads and the goal is only to eliminate worst case behaviors for very long periods of paging inactivity. Eventually we should determine a more complex scaling factor for this feature. - Rate limit low memory callback handlers to limit thrashing. Set the default to 10 seconds. Sponsored by: EMC / Isilon Storage Division	2013-08-19 23:54:24 +00:00
Jeff Roberson	d91722fb27	- Use an arbitrary but reasonably large import size for kva on architectures that don't support superpages. This keeps the number of spans and internal fragmentation lower. - When the user asks for alignment from vmem_xalloc adjust the imported size by 2*align to be certain we can satisfy the allocation. This comes at the expense of potential failures when the backend can't supply enough memory but could supply the requested size and alignment. Sponsored by: EMC / Isilon Storage Division	2013-08-19 23:02:39 +00:00
Konstantin Belousov	949c918635	Remove the arbitrary binding of the pagedaemon threads to the domains, update the comment accordingly and make it more precise. Requested and reviewed by: jeff (previous version)	2013-08-17 07:10:01 +00:00
John Baldwin	5aa60b6f21	Add new mmap(2) flags to permit applications to request specific virtual address alignment of mappings. - MAP_ALIGNED(n) requests a mapping aligned on a boundary of (1 << n). Requests for n >= number of bits in a pointer or less than the size of a page fail with EINVAL. This matches the API provided by NetBSD. - MAP_ALIGNED_SUPER is a special case of MAP_ALIGNED. It can be used to optimize the chances of using large pages. By default it will align the mapping on a large page boundary (the system is free to choose any large page size to align to that seems best for the mapping request). However, if the object being mapped is already using large pages, then it will align the virtual mapping to match the existing large pages in the object instead. - Internally, VMFS_ALIGNED_SPACE is now renamed to VMFS_SUPER_SPACE, and VMFS_ALIGNED_SPACE(n) is repurposed for specifying a specific alignment. MAP_ALIGNED(n) maps to using VMFS_ALIGNED_SPACE(n), while MAP_ALIGNED_SUPER maps to VMFS_SUPER_SPACE. - mmap() of a device object now uses VMFS_OPTIMAL_SPACE rather than explicitly using VMFS_SUPER_SPACE. All device objects are forced to use a specific color on creation, so VMFS_OPTIMAL_SPACE is effectively equivalent. Reviewed by: alc MFC after: 1 month	2013-08-16 21:13:55 +00:00
Jeff Roberson	114f62c6df	- Fix bug in r254304. Use the ACTIVE pq count for the active list processing, not inactive. This was the result of a bad merge. Reported by: pho Sponsored by: EMC / Isilon Storage Division	2013-08-15 22:29:49 +00:00
Attilio Rao	a834cbaec8	On the recovery path for vm_page_alloc(), if a page had been requested wired, unwind back the wiring bits otherwise we can end up freeing a page that is considered wired. Sponsored by: EMC / Isilon storage division Reported by: alc	2013-08-15 11:01:25 +00:00
Jeff Roberson	8441d1e842	- Add a statically allocated memguard arena since it is needed very early on. - Pass the appropriate flags to vmem_xalloc() when allocating space for the arena from kmem_arena. Sponsored by: EMC / Isilon Storage Division	2013-08-13 22:40:43 +00:00
Jeff Roberson	d9e232109f	Improve pageout flow control to wakeup more frequently and do less work while maintaining better LRU of active pages. - Change v_free_target to include the quantity previously represented by v_cache_min so we don't need to add them together everywhere we use them. - Add a pageout_wakeup_thresh that sets the free page count trigger for waking the page daemon. Set this 10% above v_free_min so we wakeup before any phase transitions in vm users. - Adjust down v_free_target now that we're willing to accept more pagedaemon wakeups. This means we process fewer pages in one iteration as well, leading to shorter lock hold times and less overall disruption. - Eliminate vm_pageout_page_stats(). This was a minor variation on the PQ_ACTIVE segment of the normal pageout daemon. Instead we now process 1 / vm_pageout_update_period pages every second. This causes us to visit the whole active list every 60 seconds. Previously we would only maintain the active LRU when we were short on pages which would mean it could be woefully out of date. Reviewed by: alc (slight variant of this) Discussed with: alc, kib, jhb Sponsored by: EMC / Isilon Storage Division	2013-08-13 21:56:16 +00:00
Attilio Rao	6006884122	Correct the recovery logic in vm_page_alloc_contig: what is really needed on this code snipped is that all the pages that are already fully inserted gets fully freed, while for the others the object removal itself might be skipped, hence the object might be set to NULL. Sponsored by: EMC / Isilon storage division Reported by: alc, kib Reviewed by: alc	2013-08-11 21:15:04 +00:00
Konstantin Belousov	c325e866f4	Different consumers of the struct vm_page abuse pageq member to keep additional information, when the page is guaranteed to not belong to a paging queue. Usually, this results in a lot of type casts which make reasoning about the code correctness harder. Sometimes m->object is used instead of pageq, which could cause real and confusing bugs if non-NULL m->object is leaked. See r141955 and r253140 for examples. Change the pageq member into a union containing explicitly-typed members. Use them instead of type-punning or abusing m->object in x86 pmaps, uma and vm_page_alloc_contig(). Requested and reviewed by: alc Sponsored by: The FreeBSD Foundation	2013-08-10 17:36:42 +00:00
Andrey Zonov	767cfe52cc	Remove unused definition for CTL_VM_NAMES. Suggested by: bde	2013-08-09 23:47:43 +00:00
John Baldwin	cdc00bf7d2	Revert the addition of VPO_BUSY and instead update vm_page_replace() to properly unbusy the page. Submitted by: alc	2013-08-09 21:14:55 +00:00
David E. O'Brien	5d82a21469	Add missing 'VPO_BUSY' from r254141 to fix kernel build break.	2013-08-09 16:43:50 +00:00
Attilio Rao	e946b94934	On all the architectures, avoid to preallocate the physical memory for nodes used in vm_radix. On architectures supporting direct mapping, also avoid to pre-allocate the KVA for such nodes. In order to do so make the operations derived from vm_radix_insert() to fail and handle all the deriving failure of those. vm_radix-wise introduce a new function called vm_radix_replace(), which can replace a leaf node, already present, with a new one, and take into account the possibility, during vm_radix_insert() allocation, that the operations on the radix trie can recurse. This means that if operations in vm_radix_insert() recursed vm_radix_insert() will start from scratch again. Sponsored by: EMC / Isilon storage division Reviewed by: alc (older version) Reviewed by: jeff Tested by: pho, scottl	2013-08-09 11:28:55 +00:00
Attilio Rao	c7aebda8a1	The soft and hard busy mechanism rely on the vm object lock to work. Unify the 2 concept into a real, minimal, sxlock where the shared acquisition represent the soft busy and the exclusive acquisition represent the hard busy. The old VPO_WANTED mechanism becames the hard-path for this new lock and it becomes per-page rather than per-object. The vm_object lock becames an interlock for this functionality: it can be held in both read or write mode. However, if the vm_object lock is held in read mode while acquiring or releasing the busy state, the thread owner cannot make any assumption on the busy state unless it is also busying it. Also: - Add a new flag to directly shared busy pages while vm_page_alloc and vm_page_grab are being executed. This will be very helpful once these functions happen under a read object lock. - Move the swapping sleep into its own per-object flag The KPI is heavilly changed this is why the version is bumped. It is very likely that some VM ports users will need to change their own code. Sponsored by: EMC / Isilon storage division Discussed with: alc Reviewed by: jeff, kib Tested by: gavin, bapt (older version) Tested by: pho, scottl	2013-08-09 11:11:11 +00:00
Konstantin Belousov	449c2e92c9	Split the pagequeues per NUMA domains, and split pageademon process into threads each processing queue in a single domain. The structure of the pagedaemons and queues is kept intact, most of the changes come from the need for code to find an owning page queue for given page, calculated from the segment containing the page. The tie between NUMA domain and pagedaemon thread/pagequeue split is rather arbitrary, the multithreaded daemon could be allowed for the single-domain machines, or one domain might be split into several page domains, to further increase concurrency. Right now, each pagedaemon thread tries to reach the global target, precalculated at the start of the pass. This is not optimal, since it could cause excessive page deactivation and freeing. The code should be changed to re-check the global page deficit state in the loop after some number of iterations. The pagedaemons reach the quorum before starting the OOM, since one thread inability to meet the target is normal for split queues. Only when all pagedaemons fail to produce enough reusable pages, OOM is started by single selected thread. Launder is modified to take into account the segments layout with regard to the region for which cleaning is performed. Based on the preliminary patch by jeff, sponsored by EMC / Isilon Storage Division. Reviewed by: alc Tested by: pho Sponsored by: The FreeBSD Foundation	2013-08-07 16:36:38 +00:00
Jeff Roberson	5df87b21d3	Replace kernel virtual address space allocation with vmem. This provides transparent layering and better fragmentation. - Normalize functions that allocate memory to use kmem_* - Those that allocate address space are named kva_* - Those that operate on maps are named kmap_* - Implement recursive allocation handling for kmem_arena in vmem. Reviewed by: alc Tested by: pho Sponsored by: EMC / Isilon Storage Division	2013-08-07 06:21:20 +00:00
Mark Johnston	c0432fc38b	Fill in the description fields for M_FICT_PAGES. Reviewed by: kib MFC after: 3 days	2013-08-07 00:20:30 +00:00
Attilio Rao	be99683637	Revert r253939: We cannot busy a page before doing pagefaults. Infact, it can deadlock against vnode lock, as it tries to vget(). Other functions, right now, have an opposite lock ordering, like vm_object_sync(), which acquires the vnode lock first and then sleeps on the busy mechanism. Before this patch is reinserted we need to break this ordering. Sponsored by: EMC / Isilon storage division Reported by: kib	2013-08-05 08:55:35 +00:00
Attilio Rao	3b6714cacb	The page hold mechanism is fast but it has couple of fallouts: - It does not let pages respect the LRU policy - It bloats the active/inactive queues of few pages Try to avoid it as much as possible with the long-term target to completely remove it. Use the soft-busy mechanism to protect page content accesses during short-term operations (like uiomove_fromphys()). After this change only vm_fault_quick_hold_pages() is still using the hold mechanism for page content access. There is an additional complexity there as the quick path cannot immediately access the page object to busy the page and the slow path cannot however busy more than one page a time (to avoid deadlocks). Fixing such primitive can bring to complete removal of the page hold mechanism. Sponsored by: EMC / Isilon storage division Discussed with: alc Reviewed by: jeff Tested by: pho	2013-08-04 21:07:24 +00:00
Andrey Zonov	23b5c8fe3d	Unbreak sysctl ABI changes introduced in r253662 Requested by: bde	2013-07-29 18:48:51 +00:00
Jeff Roberson	bb7858ea20	Improve page LRU quality and simplify the logic. - Don't short-circuit aging tests for unmapped objects. This biases against unmapped file pages and transient mappings. - Always honor PGA_REFERENCED. We can now use this after soft busying to lazily restart the LRU. - Don't transition directly from active to cached bypassing the inactive queue. This frees recently used data much too early. - Rename actcount to act_delta to be more consistent with use and meaning. Reviewed by: kib, alc Sponsored by: EMC / Isilon Storage Division	2013-07-26 23:22:05 +00:00
Andrey Zonov	20dd2f38dc	Remove define and documentation for vm_pageout_algorithm missed in r253587	2013-07-26 02:00:06 +00:00
Tim Kientzle	763d9566fe	Clear entire map structure including locks so that the locks don't accidentally appear to have been already initialized. In particular, this fixes a consistent kernel crash on armv6 with: panic: lock "vm map (user)" 0xc09cc050 already initialized that appeared with r251709. PR: arm/180820	2013-07-25 03:48:37 +00:00
Andriy Gapon	785797c341	rename scheduler->swapper and SI_SUB_RUN_SCHEDULER->SI_SUB_LAST Also directly call swapper() at the end of mi_startup instead of relying on swapper being the last thing in sysinits order. Rationale: - "RUN_SCHEDULER" was misleading, scheduling already takes place at that stage - "scheduler" was misleading, the function swaps in the swapped out processes - another SYSINIT(SI_SUB_RUN_SCHEDULER, SI_ORDER_ANY) could never be invoked depending on its relative order with scheduler; this was not obvious and the bug actually used to exist Reviewed by: kib (ealier version) MFC after: 14 days	2013-07-24 09:45:31 +00:00
Gleb Smirnoff	dab12c75bb	Since r251709 a slab no longer use 8-bit indicies to manage items, thus remove a stale comment. Reviewed by: jeff	2013-07-24 06:13:00 +00:00
Jeff Roberson	90776bd730	- Remove the long obsolete 'vm_pageout_algorithm' experiment. Discussed with: alc Sponsored by: EMC / Isilon Storage Division	2013-07-24 01:25:56 +00:00
Jeff Roberson	c93dcf2235	- Correct a stale comment. We don't have vclean() anymore. The work is done by vgonel() and destroy_vobject() should only be called once from VOP_INACTIVE(). Sponsored by: EMC / Isilon Storage Division	2013-07-23 22:52:38 +00:00
Gleb Smirnoff	e28a647db6	Revert r249590 and in case if mp_ncpus isn't initialized use MAXCPU. This allows us to init counter zone at early stage of boot. Reviewed by: kib Tested by: Lytochkin Boris <lytboris gmail.com>	2013-07-23 11:16:40 +00:00
Jeremie Le Hen	fc2b167929	Fix previous commit when option RACCT is not used. MFC after: 7 days	2013-07-22 22:16:47 +00:00
Jeremie Le Hen	c92b506977	Fix a panic in the racct code when munlock(2) is called with incorrect values. The racct code in sys_munlock() assumed that the boundaries provided by the userland were correct as long as vm_map_unwire() returned successfully. However the latter contains its own logic and sometimes manages to do something out of those boundaries, even if they are buggy. This change makes the racct code to use the accounting done by the vm layer, as it is done in other places such as vm_mlock(). Despite fixing the panic, Alan Cox pointed that this code is still race-y though: two simultaneous callers will produce incorrect values. Reviewed by: alc MFC after: 7 days	2013-07-22 21:47:14 +00:00
John Baldwin	ff74a3fa6b	Be more aggressive in using superpages in all mappings of objects: - Add a new address space allocation method (VMFS_OPTIMAL_SPACE) for vm_map_find() that will try to alter the alignment of a mapping to match any existing superpage mappings of the object being mapped. If no suitable address range is found with the necessary alignment, vm_map_find() will fall back to using the simple first-fit strategy (VMFS_ANY_SPACE). - Change mmap() without MAP_FIXED, shmat(), and the GEM mapping ioctl to use VMFS_OPTIMAL_SPACE instead of VMFS_ANY_SPACE. Reviewed by: alc (earlier version) MFC after: 2 weeks	2013-07-19 19:06:15 +00:00
Konstantin Belousov	5a3c920f45	When swap pager allocates metadata in the pagedaemon context, allow it to drain the reserve. This was broken in r243040, causing deadlock. Note that VM_WAIT call in case of uma_zalloc() failure from pagedaemon would only wait for the v_pageout_free_min anyway. Reported and tested by: pho Reviewed by: alc Sponsored by: The FreeBSD Foundation	2013-07-11 20:33:57 +00:00
Konstantin Belousov	4f9c9114a3	The vm_fault() should not be allowed to proceed on the map entry which is being wired now. The entry wired count is changed to non-zero in advance, before the map lock is dropped. This makes the vm_fault() to perceive the entry as wired, and breaks the fragment which moves the wire count from the shadowed page, to the upper page, making the code unwiring non-wired page. On the other hand, the vm_fault() calls from vm_fault_wire() should be allowed to proceed, so only drain MAP_ENTRY_IN_TRANSITION from vm_fault() when wiring_thread is not current. Reported and tested by: pho Reviewed by: alc Sponsored by: The FreeBSD Foundation MFC after: 2 weeks	2013-07-11 05:58:28 +00:00
Konstantin Belousov	0acea7dfde	The mlockall() or VM_MAP_WIRE_HOLESOK does not interact properly with parallel creation of the map entries, e.g. by mmap() or stack growing. It also breaks when other entry is wired in parallel. The vm_map_wire() iterates over the map entries in the region, and assumes that map entries it finds are marked as in transition before, also that any entry marked as in transition, are marked by the current invocation of vm_map_wire(). This is not true for new entries in the holes. Add the thread owner of the MAP_ENTRY_IN_TRANSITION flag to struct vm_map_entry. In vm_map_wire() and vm_map_unwire(), only process the entries which transition owner is the current thread. Reported and tested by: pho Reviewed by: alc Sponsored by: The FreeBSD Foundation MFC after: 2 weeks	2013-07-11 05:55:08 +00:00
Konstantin Belousov	ebf5d94e82	Never remove user-wired pages from an object when doing msync(MS_INVALIDATE). The vm_fault_copy_entry() requires that object range which corresponds to the user-wired vm_map_entry, is always fully populated. Add OBJPR_NOTWIRED flag for vm_object_page_remove() to request the preserving behaviour, use it when calling vm_object_page_remove() from vm_object_sync(). Reported and tested by: pho Reviewed by: alc Sponsored by: The FreeBSD Foundation MFC after: 2 weeks	2013-07-11 05:47:26 +00:00
Konstantin Belousov	3abeb8113d	In the vm_page_set_invalid() function, do not assert that the page is not busy, since its only caller brelse() can legitimately call it on busy page. This happens for VOP_PUTPAGES() on filesystems that use buffers and which VOP_WRITE() method marked the buffer containing page as non-cacheable. Reported and tested by: pho Reviewed by: alc Sponsored by: The FreeBSD Foundation MFC after: 2 weeks	2013-07-11 05:38:39 +00:00
Konstantin Belousov	56ce850bf8	Fix typo in comment. MFC after: 3 days	2013-07-09 13:22:30 +00:00
Neel Natu	6b5fbc1225	vm_phys_fictitious_reg_range() was losing the 'memattr' because it would be reset by pmap_page_init() right after being initialized in vm_page_initfake(). The statement above is with reference to the amd64 implementation of pmap_page_init(). Fix this by calling 'pmap_page_init()' in 'vm_page_initfake()' before changing the 'memattr'. Reviewed by: kib MFC after: 2 weeks	2013-07-03 23:38:37 +00:00
Davide Italiano	a1dff92058	Remove a spurious keg lock acquisition.	2013-06-28 21:13:19 +00:00
Jeff Roberson	5f51836645	- Add a general purpose resource allocator, vmem, from NetBSD. It was originally inspired by the Solaris vmem detailed in the proceedings of usenix 2001. The NetBSD version was heavily refactored for bugs and simplicity. - Use this resource allocator to allocate the buffer and transient maps. Buffer cache defrags are reduced by 25% when used by filesystems with mixed block sizes. Ultimately this may permit dynamic buffer cache sizing on low KVA machines. Discussed with: alc, kib, attilio Tested by: pho Sponsored by: EMC / Isilon Storage Division	2013-06-28 03:51:20 +00:00
Jeff Roberson	6fd34d6f67	- Resolve bucket recursion issues by passing a cookie with zone flags through bucket_alloc() to uma_zalloc_arg() and uma_zfree_arg(). - Make some smaller buckets for large zones to further reduce memory waste. - Implement uma_zone_reserve(). This holds aside a number of items only for callers who specify M_USE_RESERVE. buckets will never be filled from reserve allocations. Sponsored by: EMC / Isilon Storage Division	2013-06-26 00:57:38 +00:00
Gleb Smirnoff	4aa4cd8e92	Typo in comment.	2013-06-24 13:36:16 +00:00
Jeff Roberson	af5263743c	- Add a per-zone lock for zones without kegs. - Be more explicit about zone vs keg locking. This functionally changes almost nothing. - Add a size parameter to uma_zcache_create() so we can size the buckets. - Pass the zone to bucket_alloc() so it can modify allocation flags as appropriate. - Fix a bug in zone_alloc_bucket() where I missed an address of operator in a failure case. (Found by pho) Sponsored by: EMC / Isilon Storage Division	2013-06-20 19:08:12 +00:00
Jeff Roberson	8aaf680e90	- Persist the caller's flags in the bucket allocation flags so we don't lose a M_NOVM when we recurse into a bucket allocation. Sponsored by: EMC / Isilon Storage Division	2013-06-19 02:30:32 +00:00
Dag-Erling Smørgrav	5b3e02570a	Fix a bug that allowed a tracing process (e.g. gdb) to write to a memory-mapped file in the traced process's address space even if neither the traced process nor the tracing process had write access to that file. Security: CVE-2013-2171 Security: FreeBSD-SA-13:06.mmap Approved by: so	2013-06-18 07:02:35 +00:00
Jeff Roberson	fc03d22b17	Refine UMA bucket allocation to reduce space consumption and improve performance. - Always free to the alloc bucket if there is space. This gives LIFO allocation order to improve hot-cache performance. This also allows for zones with a single bucket per-cpu rather than a pair if the entire working set fits in one bucket. - Enable per-cpu caches of buckets. To prevent recursive bucket allocation one bucket zone still has per-cpu caches disabled. - Pick the initial bucket size based on a table driven maximum size per-bucket rather than the number of items per-page. This gives more sane initial sizes. - Only grow the bucket size when we face contention on the zone lock, this causes bucket sizes to grow more slowly. - Adjust the number of items per-bucket to account for the header space. This packs the buckets more efficiently per-page while making them not quite powers of two. - Eliminate the per-zone free bucket list. Always return buckets back to the bucket zone. This ensures that as zones grow into larger bucket sizes they eventually discard the smaller sizes. It persists fewer buckets in the system. The locking is slightly trickier. - Only switch buckets in zalloc, not zfree, this eliminates pathological cases where we ping-pong between two buckets. - Ensure that the thread that fills a new bucket gets to allocate from it to give a better upper bound on allocation time. Sponsored by: EMC / Isilon Storage Division	2013-06-18 04:50:20 +00:00
Jeff Roberson	0095a78419	- Add a new UMA API: uma_zcache_create(). This makes a zone without any backing memory that is only a container for per-cpu caches of arbitrary pointer items. These zones have no kegs. - Convert the regular keg based allocator to use the new import/release functions. - Move some stats to be atomics since they would require excessive zone locking/unlocking with the new import/release paradigm. Make zone_free_item simpler now that callers can manage more stats. - Check for these cache-only zones in the public APIs and debugging code by checking zone_first_keg() against NULL. Sponsored by: EMC / Isilong Storage Division	2013-06-17 03:43:47 +00:00
Jeff Roberson	ef72505e6d	- Convert the slab free item list from a linked array of indices to a bitmap using sys/bitset. This is much simpler, has lower space overhead and is cheaper in most cases. - Use a second bitmap for invariants asserts and improve the quality of the asserts as well as the number of erroneous conditions that we will catch. - Drastically simplify sizing code. Special case refcnt zones since they will be going away. - Update stale comments. Sponsored by: EMC / Isilon Storage Division	2013-06-13 21:05:38 +00:00
Alan Cox	2051980f97	Revise the interface between vm_object_madvise() and vm_page_dontneed() so that pointless calls to pmap_is_modified() can be easily avoided when performing madvise(..., MADV_FREE). Sponsored by: EMC / Isilon Storage Division	2013-06-10 01:48:21 +00:00
Gleb Smirnoff	995d706909	Make sys_mlock() function just a wrapper around vm_mlock() function that does all the job. Reviewed by: kib, jilles Sponsored by: Nginx, Inc.	2013-06-08 13:13:40 +00:00
Attilio Rao	002f377ab2	Complete r251452: Avoid to busy/unbusy a page in cases where there is no need to drop the vm_obj lock, more nominally when the page is full valid after vm_page_grab(). Sponsored by: EMC / Isilon storage division Reviewed by: alc	2013-06-06 18:19:26 +00:00
Attilio Rao	dfd55c0c7b	In vm_object_split(), busy and consequently unbusy the pages only when swap_pager_copy() is invoked, otherwise there is no reason to do so. This will eliminate the necessity to busy pages most of the times. Sponsored by: EMC / Isilon storage division Reviewed by: alc	2013-06-04 22:47:01 +00:00
Alan Cox	da38420832	Update a comment.	2013-06-04 05:44:52 +00:00
Alan Cox	e23b0a193e	Relax the object locking in vm_pageout_map_deactivate_pages() and vm_pageout_object_deactivate_pages(). A read lock suffices. Sponsored by: EMC / Isilon Storage Division	2013-06-04 02:28:47 +00:00
Konstantin Belousov	be6ec55376	Remove irrelevant comments. Discussed with: alc MFC after: 3 days	2013-06-03 17:30:40 +00:00
Alan Cox	b417181250	Require that the page lock is held, instead of the object lock, when clearing the page's PGA_REFERENCED flag. Since we are typically manipulating the page's act_count field when we are clearing its PGA_REFERENCED flag, the page lock is already held everywhere that we clear the PGA_REFERENCED flag. So, in fact, this revision only changes some comments and an assertion. Nonetheless, it will enable later changes to object locking in the pageout code. Introduce vm_page_assert_locked(), which completely hides the implementation details of the page lock from the caller, and use it in vm_page_aflag_clear(). (The existing vm_page_lock_assert() could not be used in vm_page_aflag_clear().) Over the coming weeks, I expect that we'll either eliminate or replace the various uses of vm_page_lock_assert() with vm_page_assert_locked(). Reviewed by: attilio Sponsored by: EMC / Isilon Storage Division	2013-06-03 01:22:54 +00:00
Alan Cox	b4e498071d	Now that access to the page's "act_count" field is synchronized by the page lock instead of the object lock, there is no reason for vm_page_activate() to assert that the object is locked for either read or write access. (The "VPO_UNMANAGED" flag never changes after page allocation.) Sponsored by: EMC / Isilon Storage Division	2013-06-01 20:32:34 +00:00
Alan Cox	ef5ba5a31d	Simplify the definition of vm_page_lock_assert(). There is no compelling reason to inline the implementation of vm_page_lock_assert() in the !KLD_MODULES case. Use the same implementation for both KLD_MODULES and !KLD_MODULES. Reviewed by: kib	2013-05-31 16:00:42 +00:00
Konstantin Belousov	7560005c41	After the object lock was dropped, the object' reference count could change. Retest the ref_count and return from the function to not execute the further code which assumes that ref_count == 1 if it is not. Also, do not leak vnode lock if other thread cleared OBJ_TMPFS flag meantime. Reported by: bdrewery Tested by: bdrewery, pho Sponsored by: The FreeBSD Foundation	2013-05-30 20:00:19 +00:00
Konstantin Belousov	782d4a636b	Remove the capitalization in the assertion message. Print the address of the object to get useful information from optimizated kernels dump.	2013-05-30 19:53:31 +00:00
Attilio Rao	c25673ffd6	o Change the locking scheme for swp_bcount. It can now be accessed with a write lock on the object containing it OR with a read lock on the object containing it along with the swhash_mtx. o Remove some duplicate assertions for swap_pager_freespace() and swap_pager_unswapped() but keep the object locking references for documentation. Sponsored by: EMC / Isilon storage division Reviewed by: alc	2013-05-28 22:07:23 +00:00
Attilio Rao	83b375ea16	Acquire read lock on the src object for vm_fault_copy_entry(). Sponsored by: EMC / Isilon storage division Reviewed by: alc	2013-05-22 15:11:00 +00:00
Attilio Rao	9af6d512f5	o Relax locking assertions for vm_page_find_least() o Relax locking assertions for pmap_enter_object() and add them also to architectures that currently don't have any o Introduce VM_OBJECT_LOCK_DOWNGRADE() which is basically a downgrade operation on the per-object rwlock o Use all the mechanisms above to make vm_map_pmap_enter() to work mostl of the times only with readlocks. Sponsored by: EMC / Isilon storage division Reviewed by: alc	2013-05-21 20:38:19 +00:00
Konstantin Belousov	4fab678be2	Add ddb command 'show pginfo' which provides useful information about a vm page, denoted either by an address of the struct vm_page, or, if the '/p' modifier is specified, by a physical address of the corresponding frame. Reviewed by: jhb Sponsored by: The FreeBSD Foundation MFC after: 1 week	2013-05-21 11:04:00 +00:00
Alan Cox	c141ae7f49	Relax the object locking in vm_fault_prefault(). A read lock suffices. Reviewed by: attilio Sponsored by: EMC / Isilon Storage Division	2013-05-17 19:02:36 +00:00
Alan Cox	767a6420bc	Relax the object locking assertion in vm_page_lookup(). Now that a radix tree is used to maintain the object's collection of resident pages, vm_page_lookup() no longer needs an exclusive lock. Reviewed by: attilio Sponsored by: EMC / Isilon Storage Division	2013-05-17 18:49:43 +00:00
Attilio Rao	7e226537c7	o Add accessor functions to add and remove pages from a specific freelist. o Split the pool of free pages queues really by domain and not rely on definition of VM_RAW_NFREELIST. o For MAXMEMDOM > 1, wrap the RR allocation logic into a specific function that is called when calculating the allocation domain. The RR counter is kept, currently, per-thread. In the future it is expected that such function evolves in a real policy decision referee, based on specific informations retrieved by per-thread and per-vm_object attributes. o Add the concept of "probed domains" under the form of vm_ndomains. It is responsibility for every architecture willing to support multiple memory domains to correctly probe vm_ndomains along with mem_affinity segments attributes. Those two values are supposed to remain always consistent. Please also note that vm_ndomains and td_dom_rr_idx are both int because segments already store domains as int. Ideally u_int would have much more sense. Probabilly this should be cleaned up in the future. o Apply RR domain selection also to vm_phys_zero_pages_idle(). Sponsored by: EMC / Isilon storage division Partly obtained from: jeff Reviewed by: alc Tested by: jeff	2013-05-13 15:40:51 +00:00
Peter Wemm	df839389c5	Bandaid for compiling with gcc, which happens to be the default compiler for a number of platforms still.	2013-05-13 07:09:31 +00:00
Alan Cox	404eb1b3fd	Refactor vm_page_alloc()'s interactions with vm_reserv_alloc_page() and vm_page_insert() so that (1) vm_radix_lookup_le() is never called while the free page queues lock is held and (2) vm_radix_lookup_le() is called at most once. This change reduces the average time that the free page queues lock is held by vm_page_alloc() as well as vm_page_alloc()'s average overall running time. Sponsored by: EMC / Isilon Storage Division	2013-05-12 16:50:18 +00:00
Alan Cox	9f2e600890	To reduce the amount of arithmetic performed in the various radix tree functions, reverse the numbering scheme for the levels. The highest numbered level in the tree now appears near the root instead of the leaves. Sponsored by: EMC / Isilon Storage Division	2013-05-11 18:01:41 +00:00
Attilio Rao	d0b5855eb2	Fix-up r250338 by completing the removal of VM_NDOMAIN in favor of MAXMEMDOM. This unbreak builds. Sponsored by: EMC / Isilon storage division Reported by: adrian, jeli	2013-05-08 10:55:39 +00:00
Attilio Rao	941646f5ec	Rename VM_NDOMAIN into MAXMEMDOM and move it into machine/param.h in order to match the MAXCPU concept. The change should also be useful for consolidation and consistency. Sponsored by: EMC / Isilon storage division Obtained from: jeff Reviewed by: alc	2013-05-07 22:46:24 +00:00
Alan Cox	bb0e1de4ab	Remove a redundant call to panic() from vm_radix_keydiff(). The assertion before the loop accomplishes the same thing. Sponsored by: EMC / Isilon Storage Division	2013-05-07 18:45:34 +00:00
Alan Cox	2d4b9a6438	Optimize vm_radix_lookup_ge() and vm_radix_lookup_le(). Specifically, change the way that these functions ascend the tree when the search for a matching leaf fails at an interior node. Rather than returning to the root of the tree and repeating the lookup with an updated key, maintain a stack of interior nodes that were visited during the descent and use that stack to resume the lookup at the closest ancestor that might have a matching descendant. Sponsored by: EMC / Isilon Storage Division Reviewed by: attilio Tested by: pho	2013-05-04 22:50:15 +00:00
John Baldwin	f5c4b077be	Fix two bugs in the current NUMA-aware allocation code: - vm_phys_alloc_freelist_pages() can be called by vm_page_alloc_freelist() to allocate a page from a specific freelist. In the NUMA case it did not properly map the public VM_FREELIST_* constants to the correct backing freelists, nor did it try all NUMA domains for allocations from VM_FREELIST_DEFAULT. - vm_phys_alloc_pages() did not pin the thread and each call to vm_phys_alloc_freelist_pages() fetched the current domain to choose which freelist to use. If a thread migrated domains during the loop in vm_phys_alloc_pages() it could skip one of the freelists. If the other freelists were out of memory then it is possible that vm_phys_alloc_pages() would fail to allocate a page even though pages were available resulting in a panic in vm_page_alloc(). Reviewed by: alc MFC after: 1 week	2013-05-03 18:58:37 +00:00
Konstantin Belousov	53f5f8a0e1	Add a hint suggesting why tmpfs does not need a special case there.	2013-05-02 18:35:12 +00:00
Konstantin Belousov	6f2af3fcf3	Rework the handling of the tmpfs node backing swap object and tmpfs vnode v_object to avoid double-buffering. Use the same object both as the backing store for tmpfs node and as the v_object. Besides reducing memory use up to 2x times for situation of mapping files from tmpfs, it also makes tmpfs read and write operations copy twice bytes less. VM subsystem was already slightly adapted to tolerate OBJT_SWAP object as v_object. Now the vm_object_deallocate() is modified to not reinstantiate OBJ_ONEMAPPING flag and help the VFS to correctly handle VV_TEXT flag on the last dereference of the tmpfs backing object. Reviewed by: alc Tested by: pho, bf MFC after: 1 month	2013-04-28 19:38:59 +00:00
Konstantin Belousov	e5f299ff76	Make vm_object_page_clean() and vm_mmap_vnode() tolerate the vnode' v_object of non OBJT_VNODE type. For vm_object_page_clean(), simply do not assert that object type must be OBJT_VNODE, and add a comment explaining how the check for OBJ_MIGHTBEDIRTY prevents the rest of function from operating on such objects. For vm_mmap_vnode(), if the object type is not OBJT_VNODE, require it to be for swap pager (or default), handle the bypass filesystems, and correctly acquire the object reference in this case. Reviewed by: alc Tested by: pho, bf MFC after: 1 week	2013-04-28 19:25:09 +00:00
Konstantin Belousov	9b8851faae	Assert that the object type for the vnode' non-NULL v_object, passed to vnode_pager_setsize(), is either OBJT_VNODE, or, if vnode was already reclaimed, OBJT_DEAD. Note that the later is only possible due to some filesystems, in particular, nfsiods from nfs clients, call vnode_pager_setsize() with unlocked vnode. More, if the object is terminated, do not perform the resizing operation. Reviewed by: alc Tested by: pho, bf MFC after: 1 week	2013-04-28 19:19:26 +00:00
Konstantin Belousov	6ded84276d	Convert panic() into KASSERT(). Reviewed by: alc MFC after: 1 week	2013-04-28 18:40:55 +00:00
Alan Cox	82af926a57	Eliminate an unneeded call to vm_radix_trimkey() from vm_radix_lookup_le(). This call is clearing bits from the key that will be set again by the next line. Sponsored by: EMC / Isilon Storage Division	2013-04-28 08:29:00 +00:00
Alan Cox	40076ebc5c	Avoid some lookup restarts in vm_radix_lookup_{ge,le}(). Sponsored by: EMC / Isilon Storage Division	2013-04-27 16:44:59 +00:00
Gleb Smirnoff	08a3102c0b	Panic if UMA_ZONE_PCPU is created at early stages of boot, when mp_ncpus isn't yet initialized. Otherwise we will panic at first allocation later. Sponsored by: Nginx, Inc.	2013-04-22 09:02:23 +00:00
Alan Cox	384875a3a6	Simplify vm_radix_{add,dec}lev(). Sponsored by: EMC / Isilon Storage Division	2013-04-22 01:26:13 +00:00
Alan Cox	880659fe81	When calculating the number of reserved nodes, discount the pages that will be used to store the nodes. Sponsored by: EMC / Isilon Storage Division	2013-04-18 05:34:33 +00:00
Alan Cox	a08f2cf69e	Although we perform path compression to reduce the height of the trie and the number of interior nodes, we have previously created a level zero interior node at the root of every non-empty trie, even when that node is not strictly necessary, i.e., it has only one child. This change is the second (and final) step in eliminating those unnecessary level zero interior nodes. Specifically, it updates the deletion and insertion functions so that they do not require a level zero interior node at the root of the trie. For a "buildworld" workload, this change results in a 16.8% reduction in the number of interior nodes allocated and a similar reduction in the average execution time for lookup functions. For example, the average execution time for a call to vm_radix_lookup_ge() is reduced by 22.9%. Reviewed by: attilio, jeff (an earlier version) Sponsored by: EMC / Isilon Storage Division	2013-04-15 06:12:00 +00:00
Alan Cox	6f9c0b15bb	Although we perform path compression to reduce the height of the trie and the number of interior nodes, we always create a level zero interior node at the root of every non-empty trie, even when that node is not strictly necessary, i.e., it has only one child. This change is the first step in eliminating those unnecessary level zero interior nodes. Specifically, it updates all of the lookup functions so that they do not require a level zero interior node at the root. Reviewed by: attilio, jeff (an earlier version) Sponsored by: EMC / Isilon Storage Division	2013-04-12 20:21:28 +00:00
Gleb Smirnoff	85dcf349c1	Convert UMA code to C99 uintXX_t types.	2013-04-09 17:43:48 +00:00
Gleb Smirnoff	04fc5741e0	Swap us_freecount and us_flags, achieving same structure size as before previous commit. Submitted by: alc	2013-04-09 17:25:15 +00:00
Gleb Smirnoff	8cf455b8d9	Since now we support 256 items per slab, we need more bits for us_freecount. This grows uma_slab_head on 32-bit arches, but growth isn't significant. Taking kmem zones as example, only the 32 byte zone is affected, ipers is reduced from 113 to 112. In collaboration with: kib	2013-04-09 15:15:52 +00:00
Gleb Smirnoff	025071f2af	Fix KASSERTs: maximum number of items per slab is 256.	2013-04-09 12:20:44 +00:00
Konstantin Belousov	b9781cf650	Fix the assertions for the state of the object under the map entry with the MAP_ENTRY_VN_WRITECNT flag: - Move the assertion that verifies the state of the v_writecount and vnp.writecount, under the block where the object is locked. - Check that the object type is OBJT_VNODE before asserting. Reported by: avg Reviewed by: alc MFC after: 1 week	2013-04-09 10:04:10 +00:00
Attilio Rao	a15f7df5de	The per-page act_count can be made very-easily protected by the per-page lock rather than vm_object lock, without any further overhead. Make the formal switch. Sponsored by: EMC / Isilon storage division Reviewed by: alc Tested by: pho	2013-04-08 20:02:27 +00:00
Gleb Smirnoff	ad97af7ebd	Merge from projects/counters: UMA_ZONE_PCPU zones. These zones have slab size == sizeof(struct pcpu), but request from VM enough pages to fit (uk_slabsize * mp_ncpus). An item allocated from such zone would have a separate twin for each CPU in the system, and these twins are at a distance of sizeof(struct pcpu) from each other. This magic value of distance would allow us to make some optimizations later. To address private item from a CPU simple arithmetics should be used: item = (type )((char )base + sizeof(struct pcpu) * curcpu) These arithmetics are available as zpcpu_get() macro in pcpu.h. To introduce non-page size slabs a new field had been added to uma_keg uk_slabsize. This shifted some frequently used fields of uma_keg to the fourth cache line on amd64. To mitigate this pessimization, uma_keg fields were a bit rearranged and least frequently used uk_name and uk_link moved down to the fourth cache line. All other fields, that are dereferenced frequently fit into first three cache lines. Sponsored by: Nginx, Inc.	2013-04-08 19:10:45 +00:00
Alan Cox	2c899fede2	Micro-optimize the order of struct vm_radix_node's fields. Specifically, arrange for all of the fields to start at a short offset from the beginning of the structure. Eliminate unnecessary masking of VM_RADIX_FLAGS from the root pointer in vm_radix_getroot(). Sponsored by: EMC / Isilon Storage Division	2013-04-07 01:30:51 +00:00
Jeff Roberson	26089666b6	Prepare to replace the buf splay with a trie: - Don't insert BKGRDMARKER bufs into the splay or dirty/clean buf lists. No consumers need to find them there and it complicates the tree. These flags are all FFS specific and could be moved out of the buf cache. - Use pbgetvp() and pbrelvp() to associate the background and journal bufs with the vp. Not only is this much cheaper it makes more sense for these transient bufs. - Fix the assertions in pbget* and pbrel*. It's not safe to check list pointers which were never initialized. Use the BX flags instead. We also check B_PAGING in reassignbuf() so this should cover all cases. Discussed with: kib, mckusick, attilio Sponsored by: EMC / Isilon Storage Division	2013-04-06 22:21:23 +00:00
Alan Cox	c1c82b36ad	Simplify vm_radix_keybarr(). Sponsored by: EMC / Isilon Storage Division	2013-04-06 18:04:35 +00:00
Alan Cox	72abda6466	Simplify vm_radix_insert(). Reviewed by: attilio Tested by: pho Sponsored by: EMC / Isilon Storage Division	2013-04-06 06:02:55 +00:00
Alan Cox	96f1a84272	Replace the remaining uses of vm_radix_node_page() by vm_radix_isleaf() and vm_radix_topage(). This transformation eliminates some unnecessary conditional branches from the inner loops of vm_radix_insert(), vm_radix_lookup{,_ge,_le}(), and vm_radix_remove(). Simplify the control flow of vm_radix_lookup_{ge,le}(). Reviewed by: attilio (an earlier version) Tested by: pho Sponsored by: EMC / Isilon Storage Division	2013-04-03 06:37:25 +00:00
Konstantin Belousov	bafa6cfc93	Release the v_writecount reference on the vnode in case of error, before the vnode is vput() in vm_mmap_vnode(). Error return means that there is no use reference on the vnode from the vm object reference, and failing to restore v_writecount breaks the invariant that v_writecount is less or equal to the usecount. The situation observed when nfs client returns ESTALE for VOP_GETATTR() after the open. In collaboration with: pho MFC after: 1 week	2013-03-28 06:39:27 +00:00
Alan Cox	3fc10b7363	Introduce vm_radix_isleaf() and use it in a couple places. As compared to using vm_radix_node_page() == NULL, the compiler is able to generate one less conditional branch when vm_radix_isleaf() is used. More use cases involving the inner loops of vm_radix_insert(), vm_radix_lookup{,_ge,_le}(), and vm_radix_remove() will follow. Reviewed by: attilio Sponsored by: EMC / Isilon Storage Division	2013-03-26 17:30:40 +00:00
Alan Cox	652615dcb7	Micro-optimize the control flow in a few places. Eliminate a panic call that could never be reached in vm_radix_insert(). (If the pointer being checked by the panic call were ever NULL, the immmediately preceding loop would have already crashed on a NULL pointer dereference.) Reviewed by: attilio (an earlier version) Sponsored by: EMC / Isilon Storage Division	2013-03-24 16:43:07 +00:00
Konstantin Belousov	7db07e1c85	Only size and create the bio_transient_map when unmapped buffers are enabled. Now, disabling the unmapped buffers should result in the kernel memory map identical to pre-r248550. Sponsored by: The FreeBSD Foundation	2013-03-21 07:28:15 +00:00
Konstantin Belousov	6991ee13a6	Fix the logic inversion in the r248512. Noted by: mckay	2013-03-20 09:44:23 +00:00
Konstantin Belousov	2cc718a11c	Do not map the swap i/o pbufs if the geom provider for the swap partition accepts unmapped requests. Sponsored by: The FreeBSD Foundation Tested by: pho	2013-03-19 14:39:27 +00:00
Konstantin Belousov	6ce697dc73	Pass unmapped buffers for page in requests if the filesystem indicated support for the unmapped i/o. Sponsored by: The FreeBSD Foundation Tested by: pho	2013-03-19 14:36:28 +00:00
Konstantin Belousov	ee75e7de7b	Implement the concept of the unmapped VMIO buffers, i.e. buffers which do not map the b_pages pages into buffer_map KVA. The use of the unmapped buffers eliminate the need to perform TLB shootdown for mapping on the buffer creation and reuse, greatly reducing the amount of IPIs for shootdown on big-SMP machines and eliminating up to 25-30% of the system time on i/o intensive workloads. The unmapped buffer should be explicitely requested by the GB_UNMAPPED flag by the consumer. For unmapped buffer, no KVA reservation is performed at all. The consumer might request unmapped buffer which does have a KVA reserve, to manually map it without recursing into buffer cache and blocking, with the GB_KVAALLOC flag. When the mapped buffer is requested and unmapped buffer already exists, the cache performs an upgrade, possibly reusing the KVA reservation. Unmapped buffer is translated into unmapped bio in g_vfs_strategy(). Unmapped bio carry a pointer to the vm_page_t array, offset and length instead of the data pointer. The provider which processes the bio should explicitely specify a readiness to accept unmapped bio, otherwise g_down geom thread performs the transient upgrade of the bio request by mapping the pages into the new bio_transient_map KVA submap. The bio_transient_map submap claims up to 10% of the buffer map, and the total buffer_map + bio_transient_map KVA usage stays the same. Still, it could be manually tuned by kern.bio_transient_maxcnt tunable, in the units of the transient mappings. Eventually, the bio_transient_map could be removed after all geom classes and drivers can accept unmapped i/o requests. Unmapped support can be turned off by the vfs.unmapped_buf_allowed tunable, disabling which makes the buffer (or cluster) creation requests to ignore GB_UNMAPPED and GB_KVAALLOC flags. Unmapped buffers are only enabled by default on the architectures where pmap_copy_page() was implemented and tested. In the rework, filesystem metadata is not the subject to maxbufspace limit anymore. Since the metadata buffers are always mapped, the buffers still have to fit into the buffer map, which provides a reasonable (but practically unreachable) upper bound on it. The non-metadata buffer allocations, both mapped and unmapped, is accounted against maxbufspace, as before. Effectively, this means that the maxbufspace is forced on mapped and unmapped buffers separately. The pre-patch bufspace limiting code did not worked, because buffer_map fragmentation does not allow the limit to be reached. By Jeff Roberson request, the getnewbuf() function was split into smaller single-purpose functions. Sponsored by: The FreeBSD Foundation Discussed with: jeff (previous version) Tested by: pho, scottl (previous version), jhb, bf MFC after: 2 weeks	2013-03-19 14:13:12 +00:00
Attilio Rao	774d251d99	Sync back vmcontention branch into HEAD: Replace the per-object resident and cached pages splay tree with a path-compressed multi-digit radix trie. Along with this, switch also the x86-specific handling of idle page tables to using the radix trie. This change is supposed to do the following: - Allowing the acquisition of read locking for lookup operations of the resident/cached pages collections as the per-vm_page_t splay iterators are now removed. - Increase the scalability of the operations on the page collections. The radix trie does rely on the consumers locking to ensure atomicity of its operations. In order to avoid deadlocks the bisection nodes are pre-allocated in the UMA zone. This can be done safely because the algorithm needs at maximum one new node per insert which means the maximum number of the desired nodes is the number of available physical frames themselves. However, not all the times a new bisection node is really needed. The radix trie implements path-compression because UFS indirect blocks can lead to several objects with a very sparse trie, increasing the number of levels to usually scan. It also helps in the nodes pre-fetching by introducing the single node per-insert property. This code is not generalized (yet) because of the possible loss of performance by having much of the sizes in play configurable. However, efforts to make this code more general and then reusable in further different consumers might be really done. The only KPI change is the removal of the function vm_page_splay() which is now reaped. The only KBI change, instead, is the removal of the left/right iterators from struct vm_page, which are now reaped. Further technical notes broken into mealpieces can be retrieved from the svn branch: http://svn.freebsd.org/base/user/attilio/vmcontention/ Sponsored by: EMC / Isilon storage division In collaboration with: alc, jeff Tested by: flo, pho, jhb, davide Tested by: ian (arm) Tested by: andreast (powerpc)	2013-03-18 00:25:02 +00:00
Konstantin Belousov	70e198dd07	Some style fixes. Sponsored by: The FreeBSD Foundation	2013-03-14 20:31:39 +00:00
Konstantin Belousov	e8a4a618cf	Add pmap function pmap_copy_pages(), which copies the content of the pages around, taking array of vm_page_t both for source and destination. Starting offsets and total transfer size are specified. The function implements optimal algorithm for copying using the platform-specific optimizations. For instance, on the architectures were the direct map is available, no transient mappings are created, for i386 the per-cpu ephemeral page frame is used. The code was typically borrowed from the pmap_copy_page() for the same architecture. Only i386/amd64, powerpc aim and arm/arm-v6 implementations were tested at the time of commit. High-level code, not committed yet to the tree, ensures that the use of the function is only allowed after explicit enablement. For sparc64, the existing code has known issues and a stab is added instead, to allow the kernel linking. Sponsored by: The FreeBSD Foundation Tested by: pho (i386, amd64), scottl (amd64), ian (arm and arm-v6) MFC after: 2 weeks	2013-03-14 20:18:12 +00:00
Konstantin Belousov	e7788a47e3	Remove excessive and inconsistent initializers for the various kernel maps and submaps. MFC after: 2 weeks	2013-03-14 19:50:09 +00:00
Attilio Rao	4bc80a3402	Simplify vm_page_is_valid(). Sponsored by: EMC / Isilon storage division Reviewed by: alc	2013-03-12 12:20:49 +00:00
Alan Cox	34496b53ee	Update a comment: The object lock is no longer a mutex.	2013-03-09 21:32:24 +00:00
Attilio Rao	89f6b8632c	Switch the vm_object mutex to be a rwlock. This will enable in the future further optimizations where the vm_object lock will be held in read mode most of the time the page cache resident pool of pages are accessed for reading purposes. The change is mostly mechanical but few notes are reported: * The KPI changes as follow: - VM_OBJECT_LOCK() -> VM_OBJECT_WLOCK() - VM_OBJECT_TRYLOCK() -> VM_OBJECT_TRYWLOCK() - VM_OBJECT_UNLOCK() -> VM_OBJECT_WUNLOCK() - VM_OBJECT_LOCK_ASSERT(MA_OWNED) -> VM_OBJECT_ASSERT_WLOCKED() (in order to avoid visibility of implementation details) - The read-mode operations are added: VM_OBJECT_RLOCK(), VM_OBJECT_TRYRLOCK(), VM_OBJECT_RUNLOCK(), VM_OBJECT_ASSERT_RLOCKED(), VM_OBJECT_ASSERT_LOCKED() * The vm/vm_pager.h namespace pollution avoidance (forcing requiring sys/mutex.h in consumers directly to cater its inlining functions using VM_OBJECT_LOCK()) imposes that all the vm/vm_pager.h consumers now must include also sys/rwlock.h. * zfs requires a quite convoluted fix to include FreeBSD rwlocks into the compat layer because the name clash between FreeBSD and solaris versions must be avoided. At this purpose zfs redefines the vm_object locking functions directly, isolating the FreeBSD components in specific compat stubs. The KPI results heavilly broken by this commit. Thirdy part ports must be updated accordingly (I can think off-hand of VirtualBox, for example). Sponsored by: EMC / Isilon storage division Reviewed by: jeff Reviewed by: pjd (ZFS specific review) Discussed with: alc Tested by: pho	2013-03-09 02:32:23 +00:00
Attilio Rao	c934116100	Merge from vmc-playground: Introduce a new KPI that verifies if the page cache is empty for a specified vm_object. This KPI does not make assumptions about the locking in order to be used also for building assertions at init and destroy time. It is mostly used to hide implementation details of the page cache. Sponsored by: EMC / Isilon storage division Reviewed by: jeff Reviewed by: alc (vm_radix based version) Tested by: flo, pho, jhb, davide	2013-03-09 02:05:29 +00:00
Andre Oppermann	15ae0c9af9	Move the callout subsystem initialization to its own SYSINIT() from being indirectly called via cpu_startup()+vm_ksubmap_init(). The boot order position remains the same at SI_SUB_CPU. Allocation of the callout array is changed to stardard kernel malloc from a slightly obscure direct kernel_map allocation. kern_timeout_callwheel_alloc() is renamed to callout_callwheel_init() to better describe its purpose. kern_timeout_callwheel_init() is removed simplifying the per-cpu initialization. Reviewed by: davide	2013-03-08 10:37:17 +00:00
Attilio Rao	198da1b2fa	Merge from vmcontention: As vm objects are type-stable there is no need to initialize the resident splay tree pointer and the cache splay tree pointer in _vm_object_allocate() but this could be done in the init UMA zone handler. The destructor UMA zone handler, will further check if the condition is retained at every destruction and catch for bugs. Sponsored by: EMC / Isilon storage division Submitted by: alc	2013-03-04 13:10:59 +00:00
Alan Cox	55f33f2caf	The value held by the vm object's field pg_color is only considered valid if the flag OBJ_COLORED is set. Since _vm_object_allocate() doesn't set this flag, it needn't initialize pg_color. Sponsored by: EMC / Isilon Storage Division	2013-03-02 18:07:29 +00:00
Pawel Jakub Dawidek	2609222ab4	Merge Capsicum overhaul: - Capability is no longer separate descriptor type. Now every descriptor has set of its own capability rights. - The cap_new(2) system call is left, but it is no longer documented and should not be used in new code. - The new syscall cap_rights_limit(2) should be used instead of cap_new(2), which limits capability rights of the given descriptor without creating a new one. - The cap_getrights(2) syscall is renamed to cap_rights_get(2). - If CAP_IOCTL capability right is present we can further reduce allowed ioctls list with the new cap_ioctls_limit(2) syscall. List of allowed ioctls can be retrived with cap_ioctls_get(2) syscall. - If CAP_FCNTL capability right is present we can further reduce fcntls that can be used with the new cap_fcntls_limit(2) syscall and retrive them with cap_fcntls_get(2). - To support ioctl and fcntl white-listing the filedesc structure was heavly modified. - The audit subsystem, kdump and procstat tools were updated to recognize new syscalls. - Capability rights were revised and eventhough I tried hard to provide backward API and ABI compatibility there are some incompatible changes that are described in detail below: CAP_CREATE old behaviour: - Allow for openat(2)+O_CREAT. - Allow for linkat(2). - Allow for symlinkat(2). CAP_CREATE new behaviour: - Allow for openat(2)+O_CREAT. Added CAP_LINKAT: - Allow for linkat(2). ABI: Reuses CAP_RMDIR bit. - Allow to be target for renameat(2). Added CAP_SYMLINKAT: - Allow for symlinkat(2). Removed CAP_DELETE. Old behaviour: - Allow for unlinkat(2) when removing non-directory object. - Allow to be source for renameat(2). Removed CAP_RMDIR. Old behaviour: - Allow for unlinkat(2) when removing directory. Added CAP_RENAMEAT: - Required for source directory for the renameat(2) syscall. Added CAP_UNLINKAT (effectively it replaces CAP_DELETE and CAP_RMDIR): - Allow for unlinkat(2) on any object. - Required if target of renameat(2) exists and will be removed by this call. Removed CAP_MAPEXEC. CAP_MMAP old behaviour: - Allow for mmap(2) with any combination of PROT_NONE, PROT_READ and PROT_WRITE. CAP_MMAP new behaviour: - Allow for mmap(2)+PROT_NONE. Added CAP_MMAP_R: - Allow for mmap(PROT_READ). Added CAP_MMAP_W: - Allow for mmap(PROT_WRITE). Added CAP_MMAP_X: - Allow for mmap(PROT_EXEC). Added CAP_MMAP_RW: - Allow for mmap(PROT_READ \| PROT_WRITE). Added CAP_MMAP_RX: - Allow for mmap(PROT_READ \| PROT_EXEC). Added CAP_MMAP_WX: - Allow for mmap(PROT_WRITE \| PROT_EXEC). Added CAP_MMAP_RWX: - Allow for mmap(PROT_READ \| PROT_WRITE \| PROT_EXEC). Renamed CAP_MKDIR to CAP_MKDIRAT. Renamed CAP_MKFIFO to CAP_MKFIFOAT. Renamed CAP_MKNODE to CAP_MKNODEAT. CAP_READ old behaviour: - Allow pread(2). - Disallow read(2), readv(2) (if there is no CAP_SEEK). CAP_READ new behaviour: - Allow read(2), readv(2). - Disallow pread(2) (CAP_SEEK was also required). CAP_WRITE old behaviour: - Allow pwrite(2). - Disallow write(2), writev(2) (if there is no CAP_SEEK). CAP_WRITE new behaviour: - Allow write(2), writev(2). - Disallow pwrite(2) (CAP_SEEK was also required). Added convinient defines: #define CAP_PREAD (CAP_SEEK \| CAP_READ) #define CAP_PWRITE (CAP_SEEK \| CAP_WRITE) #define CAP_MMAP_R (CAP_MMAP \| CAP_SEEK \| CAP_READ) #define CAP_MMAP_W (CAP_MMAP \| CAP_SEEK \| CAP_WRITE) #define CAP_MMAP_X (CAP_MMAP \| CAP_SEEK \| 0x0000000000000008ULL) #define CAP_MMAP_RW (CAP_MMAP_R \| CAP_MMAP_W) #define CAP_MMAP_RX (CAP_MMAP_R \| CAP_MMAP_X) #define CAP_MMAP_WX (CAP_MMAP_W \| CAP_MMAP_X) #define CAP_MMAP_RWX (CAP_MMAP_R \| CAP_MMAP_W \| CAP_MMAP_X) #define CAP_RECV CAP_READ #define CAP_SEND CAP_WRITE #define CAP_SOCK_CLIENT \ (CAP_CONNECT \| CAP_GETPEERNAME \| CAP_GETSOCKNAME \| CAP_GETSOCKOPT \| \ CAP_PEELOFF \| CAP_RECV \| CAP_SEND \| CAP_SETSOCKOPT \| CAP_SHUTDOWN) #define CAP_SOCK_SERVER \ (CAP_ACCEPT \| CAP_BIND \| CAP_GETPEERNAME \| CAP_GETSOCKNAME \| \ CAP_GETSOCKOPT \| CAP_LISTEN \| CAP_PEELOFF \| CAP_RECV \| CAP_SEND \| \ CAP_SETSOCKOPT \| CAP_SHUTDOWN) Added defines for backward API compatibility: #define CAP_MAPEXEC CAP_MMAP_X #define CAP_DELETE CAP_UNLINKAT #define CAP_MKDIR CAP_MKDIRAT #define CAP_RMDIR CAP_UNLINKAT #define CAP_MKFIFO CAP_MKFIFOAT #define CAP_MKNOD CAP_MKNODAT #define CAP_SOCK_ALL (CAP_SOCK_CLIENT \| CAP_SOCK_SERVER) Sponsored by: The FreeBSD Foundation Reviewed by: Christoph Mallon <christoph.mallon@gmx.de> Many aspects discussed with: rwatson, benl, jonathan ABI compatibility discussed with: kib	2013-03-02 00:53:12 +00:00
Attilio Rao	dc1558d1cd	Merge from vmobj-rwlock: VM_OBJECT_LOCKED() macro is only used to implement a custom version of lock assertions right now (which likely spread out thanks to copy and paste). Remove it and implement actual assertions. Sponsored by: EMC / Isilon storage division Reviewed by: alc Tested by: pho	2013-02-27 18:12:13 +00:00
Attilio Rao	a4915c21d9	Merge from vmc-playground branch: Replace the sub-optimal uma_zone_set_obj() primitive with more modern uma_zone_reserve_kva(). The new primitive reserves before hand the necessary KVA space to cater the zone allocations and allocates pages with ALLOC_NOOBJ. More specifically: - uma_zone_reserve_kva() does not need an object to cater the backend allocator. - uma_zone_reserve_kva() can cater M_WAITOK requests, in order to serve zones which need to do uma_prealloc() too. - When possible, uma_zone_reserve_kva() uses directly the direct-mapping by uma_small_alloc() rather than relying on the KVA / offset combination. The removal of the object attribute allows 2 further changes: 1) _vm_object_allocate() becomes static within vm_object.c 2) VM_OBJECT_LOCK_INIT() is removed. This function is replaced by direct calls to mtx_init() as there is no need to export it anymore and the calls aren't either homogeneous anymore: there are now small differences between arguments passed to mtx_init(). Sponsored by: EMC / Isilon storage division Reviewed by: alc (which also offered almost all the comments) Tested by: pho, jhb, davide	2013-02-26 23:35:27 +00:00
Attilio Rao	64a3476f0c	Remove white spaces. Sponsored by: EMC / Isilon storage division	2013-02-26 20:35:40 +00:00
Attilio Rao	0dde287b20	Wrap the sleeps synchronized by the vm_object lock into the specific macro VM_OBJECT_SLEEP(). This hides some implementation details like the usage of the msleep() primitive and the necessity to access to the lock address directly. For this reason VM_OBJECT_MTX() macro is now retired. Sponsored by: EMC / Isilon storage division Reviewed by: alc Tested by: pho	2013-02-26 17:22:08 +00:00
Alan Cox	fc23011bc3	On arm, like sparc64, the end of the kernel map varies from one type of machine to another. Therefore, VM_MAX_KERNEL_ADDRESS can't be a constant. Instead, #define it to be a variable, vm_max_kernel_address, just like we do on sparc64. Reviewed by: kib Tested by: ian	2013-02-18 01:02:48 +00:00
John Baldwin	174b5f3850	Make VM_NDOMAIN a kernel option so that it can be enabled from a kernel config file. Requested by: phk (ages ago) MFC after: 1 month	2013-02-14 19:38:04 +00:00
Marius Strobl	94bfd5b1a0	Try to improve r242655 take III: move these SYSCTLs describing the kernel map, which is defined and initialized in vm/vm_kern.c, to the latter. Submitted by: alc	2013-02-04 09:35:48 +00:00
Gleb Smirnoff	3caae6ca60	Fix typo in debug printf.	2013-01-29 19:06:16 +00:00
Andrey Zonov	b3a01bdf1f	- Add system wide page faults requiring I/O counter. Reviewed by: alc MFC after: 2 weeks	2013-01-28 12:54:53 +00:00
Andrey Zonov	536368691a	- Add sysctls to show number of stats scans. MFC after: 2 weeks	2013-01-28 12:20:20 +00:00
Andrey Zonov	4a36532940	- Style. MFC after: 2 weeks	2013-01-28 12:08:29 +00:00
Andrey Zonov	1cc20081df	- Get rid of unused function vmspace_wired_count(). Reviewed by: alc Approved by: kib (mentor) MFC after: 1 week	2013-01-14 12:12:56 +00:00
Andrey Zonov	cde4a72547	- Improve readability of sys_obreak(). Suggested by: alc Reviewed by: alc Approved by: kib (mentor) MFC after: 1 week	2013-01-11 09:58:35 +00:00
Andrey Zonov	3ac7d29722	- Reduce kernel size by removing unnecessary pointer indirections. GENERIC kernel size reduced in 16 bytes and RACCT kernel in 336 bytes. Suggested by: alc Reviewed by: alc Approved by: kib (mentor) MFC after: 1 week	2013-01-10 12:43:58 +00:00
Kenneth D. Merry	43ab9660c5	Fix a bug in the device pager code that can trigger an assertion in devfs if a particular race condition is hit in the device pager code. This was a side effect of change 227530 which changed the device pager interface to call a new destructor routine for the cdev. That destructor routine, old_dev_pager_dtor(), takes a VM object handle. The object handle is cast to a struct cdev *, and passed into dev_rel(). That works in most cases, except the case in cdev_pager_allocate() where there is a race condition between two threads allocating an object backed by the same device. The loser of the race deallocates its object at the end of the function. The problem is that before inserting the object into the dev_pager_object_list, the object's handle is changed from the struct cdev pointer to the object's own address. This is to avoid conflicts with the winner of the race, which already inserted an object in the list with a handle that is a pointer to the same cdev structure. The object is then passed to vm_object_deallocate(), and eventually makes its way down to old_dev_pager_dtor(). That function passes the handle pointer (which is actually a VM object, not a struct cdev as usual) into dev_rel(). dev_rel() decrements the reference count in the assumed struct cdev (which happens to be 0), and that triggers the assertion in dev_rel() that the reference count is greater than or equal to 0. The fix is to add a cdev pointer to the VM object, and use that pointer when calling the cdev_pg_dtor() routine. vm_object.h: Add a struct cdev pointer to the VM object structure. device_pager.c: In cdev_pager_allocate(), populate the new cdev pointer. In dev_pager_dealloc(), use the new cdev pointer when calling the object's cdev_pg_dtor() routine. Reviewed by: kib Sponsored by: Spectra Logic Corporation MFC after: 1 week	2013-01-09 16:48:38 +00:00
Gleb Smirnoff	936c747be0	Comment fix: there is no ub_ptr, instead explain meaning of uz_count field verbally.	2012-12-21 10:09:45 +00:00
Andrey Zonov	7e19eda4aa	- Fix locked memory accounting for maps with MAP_WIREFUTURE flag. - Add sysctl vm.old_mlock which may turn such accounting off. Reviewed by: avg, trasz Approved by: kib (mentor) MFC after: 1 week	2012-12-18 07:35:01 +00:00
Alan Cox	2863482058	In the past four years, we've added two new vm object types. Each time, similar changes had to be made in various places throughout the machine- independent virtual memory layer to support the new vm object type. However, in most of these places, it's actually not the type of the vm object that matters to us but instead certain attributes of its pages. For example, OBJT_DEVICE, OBJT_MGTDEVICE, and OBJT_SG objects contain fictitious pages. In other words, in most of these places, we were testing the vm object's type to determine if it contained fictitious (or unmanaged) pages. To both simplify the code in these places and make the addition of future vm object types easier, this change introduces two new vm object flags that describe attributes of the vm object's pages, specifically, whether they are fictitious or unmanaged. Reviewed and tested by: kib	2012-12-09 00:32:38 +00:00
Pawel Jakub Dawidek	b0ae014466	White-space cleanups.	2012-12-08 09:23:05 +00:00
Pawel Jakub Dawidek	2f891cd504	Implemented uma_zone_set_warning(9) function that sets a warning, which will be printed once the given zone becomes full and cannot allocate an item. The warning will not be printed more often than every five minutes. All UMA warnings can be globally turned off by setting sysctl/tunable vm.zone_warnings to 0. Discussed on: arch Obtained from: WHEEL Systems MFC after: 2 weeks	2012-12-07 22:27:13 +00:00
Alan Cox	96b0b92ac1	Add support for the (relatively) new object type OBJT_MGTDEVICE to vm_object_set_memattr(). Also, add a "safety belt" so that vm_object_set_memattr() doesn't silently modify undefined object types. Reviewed by: kib MFC after: 10 days	2012-11-28 18:29:34 +00:00
Alan Cox	a922d312b0	Make a few small changes to vm_map_pmap_enter(): Add detail to the comment describing this function. In particular, describe what MAP_PREFAULT_PARTIAL does. Eliminate the abrupt change in behavior when the specified address range grows from MAX_INIT_PT pages to MAX_INIT_PT plus one pages. Instead of doing nothing, i.e., preloading no mappings whatsoever, map any resident pages that fall within the start of the specified address range, i.e., [addr, addr + ulmin(size, ptoa(MAX_INIT_PT))). Long ago, the vm object's list of resident pages was not ordered, so this function had to choose between probing the global hash table of all resident pages and iterating over the vm object's unordered list of resident pages. Now, the list is ordered, so there is no reason for MAP_PREFAULT_PARTIAL to be concerned with the vm object's count of resident changes. MFC after: 14 days	2012-11-25 19:42:36 +00:00
Alan Cox	0d69690e8f	Correct an error in r230623. When both VM_ALLOC_NODUMP and VM_ALLOC_ZERO were specified to vm_page_alloc(), PG_NODUMP wasn't being set on the allocated page when it happened to be pre-zeroed. MFC after: 5 days	2012-11-21 06:26:18 +00:00
Jaakko Heinonen	02c62349c9	- Don't pass geom and provider names as format strings. - Add __printflike() attributes. - Remove an extra argument for the g_new_geomf() call in swapongeom_ev(). Reviewed by: pjd	2012-11-20 12:32:18 +00:00
Alan Cox	969a0af09d	Update a comment to reflect the elimination of the hold queue in r242300.	2012-11-17 04:00:19 +00:00
Konstantin Belousov	43f48b65c0	Move the declaration of vm_phys_paddr_to_vm_page() from vm/vm_page.h to vm/vm_phys.h, where it belongs. Requested and reviewed by: alc MFC after: 2 weeks	2012-11-16 05:55:56 +00:00
Konstantin Belousov	962b064afe	Explicitely state that M_USE_RESERVE requires M_NOWAIT, using assertion. Reviewed by: alc MFC after: 2 weeks	2012-11-16 05:49:56 +00:00
Konstantin Belousov	b32ecf44bc	Flip the semantic of M_NOWAIT to only require the allocation to not sleep, and perform the page allocations with VM_ALLOC_SYSTEM class. Previously, the allocation was also allowed to completely drain the reserve of the free pages, being translated to VM_ALLOC_INTERRUPT request class for vm_page_alloc() and similar functions. Allow the caller of malloc* to request the 'deep drain' semantic by providing M_USE_RESERVE flag, now translated to VM_ALLOC_INTERRUPT class. Previously, it resulted in less aggressive VM_ALLOC_SYSTEM allocation class. Centralize the translation of the M_* malloc(9) flags in the single inline function malloc2vm_flags(). Discussion started by: "Sears, Steven" <Steven.Sears@netapp.com> Reviewed by: alc, mdf (previous version) Tested by: pho (previous version) MFC after: 2 weeks	2012-11-14 20:01:40 +00:00
Alan Cox	8d22020384	Replace the single, global page queues lock with per-queue locks on the active and inactive paging queues. Reviewed by: kib	2012-11-13 02:50:39 +00:00
Attilio Rao	2ebcd458e3	Fix DDB command "show map XXX": - Check that an argument is always available, otherwise current map printing before to recurse is garbage. - Spit out a message if an argument is not provided. - Remove unread nlines variable. - Use an explicit recursive function, disassociated from the DB_SHOW_COMMAND() body, in order to make clear prototype and recursion of the above mentioned function. The code results now much less obscure. Submitted by: gianni	2012-11-12 00:30:40 +00:00
Konstantin Belousov	140dedb81c	The r241025 fixed the case when a binary, executed from nullfs mount, was still possible to open for write from the lower filesystem. There is a symmetric situation where the binary could already has file descriptors opened for write, but it can be executed from the nullfs overlay. Handle the issue by passing one v_writecount reference to the lower vnode if nullfs vnode has non-zero v_writecount. Note that only one write reference can be donated, since nullfs only keeps one use reference on the lower vnode. Always use the lower vnode v_writecount for the checks. Introduce the VOP_GET_WRITECOUNT to read v_writecount, which is currently always bypassed to the lower vnode, and VOP_ADD_WRITECOUNT to manipulate the v_writecount value, which manages a single bypass reference to the lower vnode. Caling the VOPs instead of directly accessing v_writecount provide the fix described in the previous paragraph. Tested by: pho MFC after: 3 weeks	2012-11-02 13:56:36 +00:00
Alan Cox	9fc4739d2a	In general, we call pmap_remove_all() before calling vm_page_cache(). So, the call to pmap_remove_all() within vm_page_cache() is usually redundant. This change eliminates that call to pmap_remove_all() and introduces a call to pmap_remove_all() before vm_page_cache() in the one place where it didn't already exist. When iterating over a paging queue, if the object containing the current page has a zero reference count, then the page can't have any managed mappings. So, a call to pmap_remove_all() is pointless. Change a panic() call in vm_page_cache() to a KASSERT(). MFC after: 6 weeks	2012-11-01 16:20:02 +00:00
Attilio Rao	4ceaf45de5	Rework the known mutexes to benefit about staying on their own cache line in order to avoid manual frobbing but using struct mtx_padalign. The sole exception being nvme and sxfge drivers, where the author redefined CACHE_LINE_SIZE manually, so they need to be analyzed and dealt with separately. Reviwed by: jimharris, alc	2012-10-31 18:07:18 +00:00
Alan Cox	081a488159	Replace the page hold queue, PQ_HOLD, by a new page flag, PG_UNHOLDFREE, because the queue itself serves no purpose. When a held page is freed, inserting the page into the hold queue has the side effect of setting the page's "queue" field to PQ_HOLD. Later, when the page is unheld, it will be freed because the "queue" field is PQ_HOLD. In other words, PQ_HOLD is used as a flag, not a queue. So, this change replaces it with a flag. To accomodate the new page flag, make the page's "flags" field wider and "oflags" field narrower. Reviewed by: kib	2012-10-29 06:15:04 +00:00
Edward Tomasz Napierala	a406d8c319	Remove useless check; vm_pindex_t is unsigned on all architectures. CID: 3701 Found with: Coverity Prevent	2012-10-28 20:03:57 +00:00
Matthew D Fleming	bb196eb480	Const-ify the zone name argument to uma_zcreate(9). MFC after: 3 days	2012-10-26 17:51:05 +00:00
Andre Oppermann	25c1e16409	Move the corresponding MTX_SYSINIT() next to their struct mtx declaration to make their relationship more obvious as done with the other such mutexs.	2012-10-26 17:31:35 +00:00
Konstantin Belousov	ef45823eba	Commit the actual text provided by Alan, instead of the wrong update in r242011. MFC after: 1 week	2012-10-24 18:32:37 +00:00
Konstantin Belousov	bc79b37f2c	Dirty the newly copied anonymous pages after the wired region is forked. Otherwise, pagedaemon might reclaim the page without saving its content into the swap file, resulting in the valid content replaced by zeroes. Reported and tested by: pho Reviewed and comment update by: alc MFC after: 1 week	2012-10-24 18:21:59 +00:00
Konstantin Belousov	5050aa86cf	Remove the support for using non-mpsafe filesystem modules. In particular, do not lock Giant conditionally when calling into the filesystem module, remove the VFS_LOCK_GIANT() and related macros. Stop handling buffers belonging to non-mpsafe filesystems. The VFS_VERSION is bumped to indicate the interface change which does not result in the interface signatures changes. Conducted and reviewed by: attilio Tested by: pho	2012-10-22 17:50:54 +00:00
Eitan Adler	0b80c1e400	Print flags as hex instead of an integer. PR: kern/168210 Submitted by: linimon Reviewed by: alc Approved by: cperciva MFC after: 3 days	2012-10-22 02:11:57 +00:00
Alan Cox	7ecfabc7bb	Move vm_page_requeue() to the only file that uses it. MFC after: 3 weeks	2012-10-13 20:19:43 +00:00
Alan Cox	9af47af64a	Eliminate the conditional for releasing the page queues lock in vm_page_sleep(). vm_page_sleep() is no longer called with this lock held. Eliminate assertions that the page queues lock is NOT held. These assertions won't translate well to having distinct locks on the active and inactive page queues, and they really aren't that useful. MFC after: 3 weeks	2012-10-13 18:46:46 +00:00
Alan Cox	4db2c4b8c7	Tidy up a bit: Update some of the comments. In particular, use "sleep" in preference to "block" where appropriate. Eliminate some unnecessary casts. Make a few whitespace changes for consistency. Reviewed by: kib MFC after: 3 days	2012-10-03 05:06:45 +00:00
Konstantin Belousov	877d24ac8a	Fix the mis-handling of the VV_TEXT on the nullfs vnodes. If you have a binary on a filesystem which is also mounted over by nullfs, you could execute the binary from the lower filesystem, or from the nullfs mount. When executed from lower filesystem, the lower vnode gets VV_TEXT flag set, and the file cannot be modified while the binary is active. But, if executed as the nullfs alias, only the nullfs vnode gets VV_TEXT set, and you still can open the lower vnode for write. Add a set of VOPs for the VV_TEXT query, set and clear operations, which are correctly bypassed to lower vnode. Tested by: pho (previous version) MFC after: 2 weeks	2012-09-28 11:25:02 +00:00
Alan Cox	1f11f2bff4	Address a race condition that was introduced in r238212. Unless the page queues lock is acquired before the page lock is released, there is no guarantee that the page will still be in that same page queue when vm_page_requeue() is called. Reported by: pho In collaboration with: kib MFC after: 3 days	2012-09-23 17:42:39 +00:00
Konstantin Belousov	5f9c767b19	Plug the accounting leak for the wired pages when msync(MS_INVALIDATE) is performed on the vnode mapping which is wired in other address space. While there, explicitely assert that the page is unwired and zero the wire_count instead of substract. The condition is rechecked later in vm_page_free(_toq) already. Reported and tested by: zont Reviewed by: alc (previous version) MFC after: 1 week	2012-09-20 09:52:57 +00:00
Gleb Smirnoff	2864dbbfc1	If caller specifies UMA_ZONE_OFFPAGE explicitly, then do not waste memory in an allocation for a slab. Reviewed by: jeff	2012-09-18 20:28:55 +00:00
Eitan Adler	96240c89f0	Correct double "the the" Approved by: cperciva MFC after: 3 days	2012-09-14 21:28:56 +00:00
Andrey Zonov	c4e357e8d3	- Simplify VM code by using vmspace_wired_count() for counting wired memory of a process. Reviewed by: avg Approved by: kib (mentor) MFC after: 2 weeks	2012-09-05 18:19:54 +00:00
Dag-Erling Smørgrav	f379b823bc	Whitespace cleanup.	2012-09-05 12:24:50 +00:00
Dag-Erling Smørgrav	dc1b35b525	No memory barrier is required. This was pointed out by kib@ a while ago, but I got distracted by other matters. (for real this time)	2012-09-04 22:19:33 +00:00
Dag-Erling Smørgrav	22a5e6b972	Revert previous commit, which was performed in the wrong tree.	2012-09-04 21:06:53 +00:00
Dag-Erling Smørgrav	db0390e833	No memory barrier is required. This was pointed out by kib@ a while ago, but I got distracted by other matters.	2012-09-04 19:04:02 +00:00
Andrey Zonov	cfe52ecf0e	- After r240026 sgrowsiz should be used in a safer maner. Approved by: kib (mentor) MCF after: 1 week	2012-09-03 09:34:46 +00:00
Andrey Zonov	e145130e71	- Remove accounting of locked memory from vsunlock(9) that I missed in r239818. Approved by: kib (mentor)	2012-08-30 08:03:33 +00:00
Andrey Zonov	126a63ce6c	- Don't take an account of locked memory for current process in vslock(9). There are two consumers of vslock(9): sysctl code and drm driver. These consumers are using locked memory as transient memory, it doesn't belong to a process's memory. Suggested by: avg Reviewed by: alc Approved by: kib (mentor) MFC after: 2 weeks	2012-08-29 11:23:59 +00:00
Sergey Kandaurov	9462305cbe	Typo in previous change: print half the theoretical maximum as maximum recommended amount. Reported by: <site freebsd at orientalsensation com> Reviewed by: des	2012-08-27 10:59:49 +00:00
Gleb Smirnoff	42321809c4	Fix function name in keg_cachespread_init() assert.	2012-08-26 09:54:11 +00:00
Dag-Erling Smørgrav	3ff863f1aa	- When running out of swzone, instead of spewing an error message every tick until the situation is resolved (if ever), just print a single message when running out and another when space becomes available. - When adding more swap, warn if the total amount exceeds half the theoretical maximum we can handle.	2012-08-16 08:29:49 +00:00
Konstantin Belousov	ee4116b8f7	For old mmap syscall, when executing on amd64 or ia64, enforce the PROT_EXEC if prot is non-zero, process is 32bit and kern.elf32.i386_read_exec syscal is enabled. This workaround is needed for old i386 a.out binaries, where dynamic linker did not specified PROT_EXEC for mapping of the text. The kern.elf32.i386_read_exec MIB name looks weird for a.out binaries, but I reused the existing knob which already has the needed semantic. MFC after: 1 week	2012-08-14 12:11:48 +00:00
Konstantin Belousov	7707ccabfb	Adjust the r205536, by allowing a non-zero offset for anonymous mappings for a.out binaries. Apparently, a.out ld.so from FreeBSD 1.1.5.1 can issue such requests. Reported and tested by: Dan Plassche <dplassche@gmail.com> MFC after: 1 week	2012-08-14 11:47:07 +00:00
Konstantin Belousov	b6c00483e9	Do not leave invalid pages in the object after the short read for a network file systems (not only NFS proper). Short reads cause pages other then the requested one, which were not filled by read response, to stay invalid. Change the vm_page_readahead_finish() interface to not take the error code, but instead to make a decision to free or to (de)activate the page only by its validity. As result, not requested invalid pages are freed even if the read RPC indicated success. Noted and reviewed by: alc MFC after: 1 week	2012-08-14 11:45:47 +00:00
Alan Cox	18f55b0171	Never sleep on busy pages in vm_pageout_launder(), always skip them. Long ago, sleeping on busy pages in vm_pageout_launder() made sense. The call to vm_pageout_flush() specified asynchronous I/O and sleeping on busy pages blocked vm_pageout_launder() until the flush had completed. However, in CVS revision 1.35 of vm/vm_contig.c, the call to vm_pageout_flush() was changed to request synchronous I/O, but the sleep on busy pages was not removed.	2012-08-07 04:48:14 +00:00
Konstantin Belousov	1c771f9222	After the PHYS_TO_VM_PAGE() function was de-inlined, the main reason to pull vm_param.h was removed. Other big dependency of vm_page.h on vm_param.h are PA_LOCK* definitions, which are only needed for in-kernel code, because modules use KBI-safe functions to lock the pages. Stop including vm_param.h into vm_page.h. Include vm_param.h explicitely for the kernel code which needs it. Suggested and reviewed by: alc MFC after: 2 weeks	2012-08-05 14:11:42 +00:00
Konstantin Belousov	0055cbd3c5	Reduce code duplication and exposure of direct access to struct vm_page oflags by providing helper function vm_page_readahead_finish(), which handles completed reads for pages with indexes other then the requested one, for VOP_GETPAGES(). Reviewed by: alc MFC after: 1 week	2012-08-04 18:16:43 +00:00
Alan Cox	369763e31a	Inline vm_page_aflags_clear() and vm_page_aflags_set(). Add comments stating that neither these functions nor the flags that they are used to manipulate are part of the KBI.	2012-08-03 01:48:15 +00:00
Alan Cox	d26a90a8b2	Eliminate an unneeded declaration. (I should have removed this as part of r227568.)	2012-07-30 20:38:37 +00:00
Konstantin Belousov	311e34e260	Do not requeue held page or page for which locking failed, just leave them alone. Process the act_count updates for the held pages in the vm_pageout loop over the inactive queue, instead of refusing to do anything with such page. Clarify the intent of the addl_page_shortage counter and change its use for pages which are not processed in the loop according to the description. Reviewed by: alc MFC after: 2 weeks	2012-07-26 09:06:48 +00:00
Alan Cox	2cba1ccd0e	Addendum to r238604. If the inactive queue scan isn't restarted, then the variable "addl_page_shortage_init" isn't needed. X-MFC after: r238604	2012-07-24 02:35:30 +00:00
Konstantin Belousov	d4961bcb3a	Do not restart scan of the inactive queue when non-inactive page is found. Rather, we shall not find such pages on inactive queue at all. Requested and reviewed by: alc MFC after: 2 weeks	2012-07-18 21:47:50 +00:00
Alan Cox	85eeca35b9	Move what remains of vm/vm_contig.c into vm/vm_pageout.c, where similar code resides. Rename vm_contig_grow_cache() to vm_pageout_grow_cache(). Reviewed by: kib	2012-07-18 05:21:34 +00:00
Alan Cox	da1ab8a4a0	Correct vm_page_alloc_contig()'s implementation of VM_ALLOC_NODUMP.	2012-07-17 02:36:59 +00:00
Alan Cox	907e4524dc	Various improvements to vm_contig_grow_cache(). Most notably, even when it can't sleep, it can still move clean pages from the inactive queue to the cache. Also, when a page is cached, there is no need to restart the scan. The "next" page pointer held by vm_contig_launder() is still valid. Finally, add a comment summarizing what vm_contig_grow_cache() does based upon the value of "tries". MFC after: 3 weeks	2012-07-16 18:13:43 +00:00
Alan Cox	476f7f2423	Correct an off-by-one error in vm_reserv_alloc_contig() that resulted in the last reservation of a multi-reservation allocation not being initialized.	2012-07-15 21:46:19 +00:00
Matthew D Fleming	f806cdcf99	Fix a bug with memguard(9) on 32-bit architectures without a VM_KMEM_MAX_SIZE. The code was not taking into account the size of the kernel_map, which the kmem_map is allocated from, so it could produce a sub-map size too large to fit. The simplest solution is to ignore VM_KMEM_MAX entirely and base the memguard map's size off the kernel_map's size, since this is always relevant and always smaller. Found by: Justin Hibbits	2012-07-15 20:29:48 +00:00
Alan Cox	9757857c4f	If vm_contig_grow_cache() is allowed to sleep, then invoke the vm_lowmem handlers.	2012-07-14 20:14:03 +00:00
Alan Cox	0ff0fc84c2	Move kmem_alloc_{attr,contig}() to vm/vm_kern.c, where similarly named functions reside. Correct the comment describing kmem_alloc_contig().	2012-07-14 18:10:44 +00:00
Attilio Rao	571a1e92aa	Document the object type movements, related to swp_pager_copy(), in vm_object_collapse() and vm_object_split(). In collabouration with: alc MFC after: 3 days	2012-07-11 01:04:59 +00:00
Konstantin Belousov	a4156419ca	Avoid vm page queues lock leak after r238212. Reported and tested by: Michael Butler <imb protected-networks net> Reviewed by: alc Pointy hat to: kib MFC after: 20 days	2012-07-08 18:04:26 +00:00
Konstantin Belousov	48cc2fc774	Drop page queues mutex on each iteration of vm_pageout_scan over the inactive queue, unless busy page is found. Dropping the mutex often should allow the other lock acquires to proceed without waiting for whole inactive scan to finish. On machines with lot of physical memory scan often need to iterate a lot before it finishes or finds a page which requires laundring, causing high latency for other lock waiters. Suggested and reviewed by: alc MFC after: 3 weeks	2012-07-07 19:39:08 +00:00
Eitan Adler	c288b54837	Add missing sleep stat increase PR: kern/168211 Submitted by: linimon Reviewed by: alc Approved by: cperciva MFC after: 3 days	2012-07-07 17:46:11 +00:00
Konstantin Belousov	5d10ef2096	Style. Reviewed by: alc (previous version) MFC after: 1 week	2012-07-06 20:13:16 +00:00
John Baldwin	687c94aac9	Honor db_pager_quit in 'show uma' and 'show malloc'. MFC after: 1 month	2012-07-02 16:14:52 +00:00
Alan Cox	e30df26e7b	Add new pmap layer locks to the predefined lock order. Change the names of a few existing VM locks to follow a consistent naming scheme.	2012-06-27 03:45:25 +00:00
Attilio Rao	ffdd0c7db3	- Add a comment explaining the locking of the cached pages pool held by vm_objects. - Add flags for the per-object lock and free pages queue mutex lock. Use the newly added flags to mark the cache root within the vm_object structure. Please note that other vm_object members should be marked with correct locking but they are left for other commits. In collabouration with: alc MFC after: 3 days3 days3 days	2012-06-22 18:34:11 +00:00
Alan Cox	eddc92918e	Selectively inline vm_page_dirty().	2012-06-20 23:25:47 +00:00
John Baldwin	6fbe60fa8b	Move the per-thread deferred user map entries list into a private list in vm_map_process_deferred() which is then iterated to release map entries. This avoids having a nested vm map unlock operation called from the loop body attempt to recuse into vm_map_process_deferred(). This can happen if the vm_map_remove() triggers the OOM killer. Reviewed by: alc, kib MFC after: 1 week	2012-06-20 18:00:26 +00:00
Attilio Rao	db9ba57895	Do a more targeted check on the page cache and avoid to check the cache pointer directly in vnode_pager_setsize() by using newly introduced vm_page_is_cached() function. Reviewed by: alc MFC after: 2 weeks X-MFC: r234039,234064	2012-06-16 21:39:00 +00:00
Alan Cox	6031c68de4	The page flag PGA_WRITEABLE is set and cleared exclusively by the pmap layer, but it is read directly by the MI VM layer. This change introduces pmap_page_is_write_mapped() in order to completely encapsulate all direct access to PGA_WRITEABLE in the pmap layer. Aesthetics aside, I am making this change because amd64 will likely begin using an alternative method to track write mappings, and having pmap_page_is_write_mapped() in place allows me to make such a change without further modification to the MI VM layer. As an added bonus, tidy up some nearby comments concerning page flags. Reviewed by: kib MFC after: 6 weeks	2012-06-16 18:56:19 +00:00
Konstantin Belousov	83ce08538a	Use the previous stack entry protection and max protection to correctly propagate the stack execution permissions when stack is grown down. First, curproc->p_sysent->sv_stackprot specifies maximum allowed stack protection for current ABI, so the new stack entry was typically marked executable always. Second, for non-main stack MAP_STACK mapping, the PROT_ flags should be used which were specified at the mmap(2) call time, and not sv_stackprot. MFC after: 1 week	2012-06-10 11:31:50 +00:00
Eitan Adler	0a4a2b8e62	Revert r236380 PR: kern/166780 Requested by: many Approved by: cperciva (implicit)	2012-06-01 18:58:50 +00:00
Eitan Adler	71ee98c97c	Add sysctl to query amount of swap space free PR: kern/166780 Submitted by: Radim Kolar <hsn@sendmail.cz> Approved by: cperciva MFC after: 1 week	2012-06-01 04:42:52 +00:00
Maksim Yevmenkin	251386b4b2	Tweak condition for disabling allocation from per-CPU buckets in low memory situation. I've observed a situation where per-CPU allocations were disabled while there were enough free cached pages. Basically, cnt.v_free_count was sitting stable at a value lower than cnt.v_free_min and that caused massive performance drop. Reviewed by: alc MFC after: 1 week	2012-05-23 18:56:29 +00:00
Konstantin Belousov	4d34e019c4	Calculate the count of per-process cow faults. Export the count to userspace using the obscure spare int field in struct kinfo_proc. Submitted by: Andrey Zonov <andrey zonov org> MFC after: 1 week	2012-05-23 18:10:54 +00:00
Andriy Gapon	b6062382be	vm_pager_object_lookup: small performance optimization do not needlessly lock an object if its handle doesn't match Reviewed by: kib, alc MFC after: 1 week	2012-05-23 12:51:49 +00:00
Andrew Turner	c415e17250	Fix booting on ARM. In PHYS_TO_VM_PAGE() when VM_PHYSSEG_DENSE is set the check if we are past the end of vm_page_array was incorrect causing it to return NULL. This value is then used in vm_phys_add_page causing a data abort. Reviewed by: alc, kib, imp Tested by: stas	2012-05-22 07:04:23 +00:00
Nathan Whitehorn	ccc4a5c761	Replace the list of PVOs owned by each PMAP with an RB tree. This simplifies range operations like pmap_remove() and pmap_protect() as well as allowing simple operations like pmap_extract() not to involve any global state. This substantially reduces lock coverages for the global table lock and improves concurrency.	2012-05-20 14:33:28 +00:00
Konstantin Belousov	df2f557df6	Do not double-reference the found vm object in cdev_pager_lookup(). vm_pager_object_lookup() already referenced the object. Note that there is no in-tree consumers of cdev_pager_lookup(). The only known user of the function is i915 gem driver, which is not yet imported. This should make the KPI change minor. Submitted by: avg MFC after: 1 week	2012-05-18 10:23:47 +00:00
Konstantin Belousov	b7ac5a8571	Add new pager type, OBJT_MGTDEVICE. It provides the device pager which carries fictitous managed pages. In particular, the consumers of the new object type can remove all mappings of the device page with pmap_remove_all(). The range of physical addresses used for fake page allocation shall be registered with vm_phys_fictitious_reg_range() interface to allow the PHYS_TO_VM_PAGE() to work in pmap. Most likely, only i386 and amd64 pmaps can handle fictitious managed pages right now. Sponsored by: The FreeBSD Foundation Reviewed by: alc MFC after: 1 month	2012-05-12 20:49:58 +00:00
Konstantin Belousov	b6de32bd9b	Add a facility to register a range of physical addresses to be used for allocation of fictitious pages, for which PHYS_TO_VM_PAGE() returns proper fictitious vm_page_t. The range should be de-registered after consumer stopped using it. De-inline the PHYS_TO_VM_PAGE() since it now carries code to iterate over registered ranges. A hash container might be developed instead of range registration interface, and fake pages could be put automatically into the hash, were PHYS_TO_VM_PAGE() could look them up later. This should be considered before the MFC of the commit is done. Sponsored by: The FreeBSD Foundation Reviewed by: alc MFC after: 1 month	2012-05-12 20:42:56 +00:00
Konstantin Belousov	e461aae747	Split the code from vm_page_getfake() to initialize the fake page struct vm_page into new interface vm_page_initfake(). Handle the case of fake page re-initialization with changed memattr. Sponsored by: The FreeBSD Foundation Reviewed by: alc MFC after: 1 month	2012-05-12 20:34:22 +00:00
Konstantin Belousov	116c213502	Assert that the page passed to vm_page_putfake() is unmanaged. Sponsored by: The FreeBSD Foundation Reviewed by: alc MFC after: 1 month	2012-05-12 20:27:51 +00:00
Konstantin Belousov	7900f95d88	Assert that fictitious or unmanaged pages do not appear on active/inactive lists. Sponsored by: The FreeBSD Foundation Reviewed by: alc MFC after: 1 month	2012-05-12 20:24:46 +00:00
Konstantin Belousov	13a0b7bcc4	Commit the change forgotten in r235356. Sponsored by: The FreeBSD Foundation Reviewed by: alc MFC after: 1 month	2012-05-12 20:10:18 +00:00
Konstantin Belousov	0c26bb71f6	Make the vm_page_array_size long. Remove redundand zero initialization for vm_page_array_size and nearby variablees. Sponsored by: The FreeBSD Foundation Reviewed by: alc MFC after: 1 month	2012-05-12 20:03:06 +00:00
Alan Cox	13458803f4	Give vm_fault()'s sequential access optimization a makeover. There are two aspects to the sequential access optimization: (1) read ahead of pages that are expected to be accessed in the near future and (2) unmap and cache behind of pages that are not expected to be accessed again. This revision changes both aspects. The read ahead optimization is now more effective. It starts with the same initial read window as before, but arithmetically grows the window on sequential page faults. This can yield increased read bandwidth. For example, on one of my machines, a program using mmap() to read a file that is several times larger than the machine's physical memory takes about 17% less time to complete. The unmap and cache behind optimization is now more selectively applied. The read ahead window must grow to its maximum size before unmap and cache behind is performed. This significantly reduces the number of times that pages are unmapped and cached only to be reactivated a short time later. The unmap and cache behind optimization now clears each page's referenced flag. Previously, in the case of dirty pages, if the containing file was still mapped at the time that the page daemon examined the dirty pages, they would be reactivated. From a stylistic standpoint, this revision also cleanly separates the implementation of the read ahead and unmap/cache behind optimizations. Glanced at: kib MFC after: 2 weeks	2012-05-10 15:16:42 +00:00
Nathan Whitehorn	0b852c03eb	Avoid a lock order reversal in pmap_extract_and_hold() from relocking the page. This PMAP requires an additional lock besides the PMAP lock in pmap_extract_and_hold(), which vm_page_pa_tryrelock() did not release. Suggested by: kib MFC after: 4 days	2012-04-22 17:58:30 +00:00
Konstantin Belousov	1472f4f4b9	When MAP_STACK mapping is created, the map entry is created only to cover the initial stack size. For MCL_WIREFUTURE maps, the subsequent call to vm_map_wire() to wire the whole stack region fails due to VM_MAP_WIRE_NOHOLES flag. Use the VM_MAP_WIRE_HOLESOK to only wire mapped part of the stack. Reported and tested by: Sushanth Rai <sushanth_rai yahoo com> Reviewed by: alc MFC after: 1 week	2012-04-21 18:36:53 +00:00
Alan Cox	2aa163dc57	As documented in vm_page.h, updates to the vm_page's flags no longer require the page queues lock. MFC after: 1 week	2012-04-21 18:26:16 +00:00
Attilio Rao	a0f2c37b6f	- Introduce a cache-miss optimization for consistency with other accesses of the cache member of vm_object objects. - Use novel vm_page_is_cached() for checks outside of the vm subsystem. Reviewed by: alc MFC after: 2 weeks X-MFC: r234039	2012-04-09 17:05:18 +00:00
Alan Cox	1c8279e4e7	Fix mincore(2) so that it reports PG_CACHED pages as resident. MFC after: 2 weeks	2012-04-08 18:25:12 +00:00
Alan Cox	908e3da10e	If a page belonging a reservation is cached, then mark the reservation so that it will be freed to the cache pool rather than the default pool. Otherwise, the cached pages within the reservation may be recycled sooner than necessary. Reported by: Andrey Zonov	2012-04-08 17:00:46 +00:00
Attilio Rao	d1aa86e151	Staticize vm_page_cache_remove(). Reviewed by: alc	2012-04-06 20:34:00 +00:00
Nathan Whitehorn	57bd5cce62	Reduce the frequency that the PowerPC/AIM pmaps invalidate instruction caches, by invalidating kernel icaches only when needed and not flushing user caches for shared pages. Suggested by: kib MFC after: 2 weeks	2012-04-06 16:03:38 +00:00
John Baldwin	35818d2e94	Add new ktrace records for the start and end of VM faults. This gives a pair of records similar to syscall entry and return that a user can use to determine how long page faults take. The new ktrace records are enabled via the 'p' trace type, and are enabled in the default set of trace points. Reviewed by: kib MFC after: 2 weeks	2012-04-05 17:13:14 +00:00
Kirk McKusick	1faacf5d09	Keep track of the mount point associated with a special device to enable the collection of counts of synchronous and asynchronous reads and writes for its associated filesystem. The counts are displayed using `mount -v'. Ensure that buffers used for paging indicate the vnode from which they are operating so that counts of paging I/O operations from the filesystem are collected. This checkin only adds the setting of the mount point for the UFS/FFS filesystem, but it would be trivial to add the setting and clearing of the mount point at filesystem mount/unmount time for other filesystems too. Reviewed by: kib	2012-03-28 20:49:11 +00:00
Alan Cox	5730afc9b6	Handle spurious page faults that may occur in no-fault sections of the kernel. When access restrictions are added to a page table entry, we flush the corresponding virtual address mapping from the TLB. In contrast, when access restrictions are removed from a page table entry, we do not flush the virtual address mapping from the TLB. This is exactly as recommended in AMD's documentation. In effect, when access restrictions are removed from a page table entry, AMD's MMUs will transparently refresh a stale TLB entry. In short, this saves us from having to perform potentially costly TLB flushes. In contrast, Intel's MMUs are allowed to generate a spurious page fault based upon the stale TLB entry. Usually, such spurious page faults are handled by vm_fault() without incident. However, when we are executing no-fault sections of the kernel, we are not allowed to execute vm_fault(). This change introduces special-case handling for spurious page faults that occur in no-fault sections of the kernel. In collaboration with: kib Tested by: gibbs (an earlier version) I would also like to acknowledge Hiroki Sato's assistance in diagnosing this problem. MFC after: 1 week	2012-03-22 04:52:51 +00:00
John Baldwin	d6e9b97b3f	Bah, just revert my earlier change entirely. (Missed alc's request to do this earlier.) Requested by: alc	2012-03-19 19:06:40 +00:00
John Baldwin	92a5994685	Fix madvise(MADV_WILLNEED) to properly handle individual mappings larger than 4GB. Specifically, the inlined version of 'ptoa' of the the 'int' count of pages overflowed on 64-bit platforms. While here, change vm_object_madvise() to accept two vm_pindex_t parameters (start and end) rather than a (start, count) tuple to match other VM APIs as suggested by alc@.	2012-03-19 18:47:34 +00:00
John Baldwin	8407f69657	Alter the previous commit to use vm_size_t instead of vm_pindex_t. vm_pindex_t is not a count of pages per se, it is more like vm_ooffset_t, but a page index instead of a byte offset.	2012-03-19 18:43:44 +00:00
Konstantin Belousov	126d60823a	In vm_object_page_clean(), do not clean OBJ_MIGHTBEDIRTY object flag if the filesystem performed short write and we are skipping the page due to this. Propogate write error from the pager back to the callers of vm_pageout_flush(). Report the failure to write a page from the requested range as the FALSE return value from vm_object_page_clean(), and propagate it back to msync(2) to return EIO to usermode. While there, convert the clearobjflags variable in the vm_object_page_clean() and arguments of the helper functions to boolean. PR: kern/165927 Reviewed by: alc MFC after: 2 weeks	2012-03-17 23:00:32 +00:00
John Baldwin	df96bc9713	Pedantic nit: use vm_pindex_t instead of long for a count of pages.	2012-03-14 20:57:48 +00:00
John Baldwin	b47f624183	Add KTR_VFS traces to track modifications to a vnode's writecount.	2012-03-08 20:27:20 +00:00
Alan Cox	83cbe16ff4	Eliminate stale incorrect ARGSUSED comments. Submitted by: bde	2012-03-02 17:33:51 +00:00
Alan Cox	9ed54e79b5	Simplify kmem_alloc() by eliminating code that existed on account of external pagers in Mach. FreeBSD doesn't implement external pagers. Moreover, it don't pageout the kernel object. So, the reasons for having code don't hold. Reviewed by: kib MFC after: 6 weeks	2012-02-29 05:41:29 +00:00
Alan Cox	f9230ad6b8	Simplify vm_mmap()'s control flow. Add a comment describing what vm_mmap_to_errno() does. Reviewed by: kib MFC after: 3 weeks X-MFC after: r232071	2012-02-25 21:06:39 +00:00
Alan Cox	79e538388f	Simplify vmspace_fork()'s control flow by copying immutable data before the vm map locks are acquired. Also, eliminate redundant initialization of the new vm map's timestamp. Reviewed by: kib MFC after: 3 weeks	2012-02-25 17:49:59 +00:00
Konstantin Belousov	9d22083da8	Place the if() at the right location, to activate the v_writecount accounting for shared writeable mappings for all filesystems, not only for the bypass layers. Submitted by: alc Pointy hat to: kib MFC after: 20 days	2012-02-24 10:41:58 +00:00
Konstantin Belousov	84110e7e0b	Account the writeable shared mappings backed by file in the vnode v_writecount. Keep the amount of the virtual address space used by the mappings in the new vm_object un_pager.vnp.writemappings counter. The vnode v_writecount is incremented when writemappings gets non-zero value, and decremented when writemappings is returned to zero. Writeable shared vnode-backed mappings are accounted for in vm_mmap(), and vm_map_insert() is instructed to set MAP_ENTRY_VN_WRITECNT flag on the created map entry. During deferred map entry deallocation, vm_map_process_deferred() checks for MAP_ENTRY_VN_WRITECOUNT and decrements writemappings for the vm object. Now, the writeable mount cannot be demoted to read-only while writeable shared mappings of the vnodes from the mount point exist. Also, execve(2) fails for such files with ETXTBUSY, as it should be. Noted by: tegge Reviewed by: tegge (long time ago, early version), alc Tested by: pho MFC after: 3 weeks	2012-02-23 21:07:16 +00:00
Konstantin Belousov	501f538675	Remove wrong comment. Discussed with: alc MFC after: 3 days	2012-02-22 20:01:38 +00:00
Alan Cox	a649296959	When vm_mmap() is used to map a vm object into a kernel vm_map, it makes no sense to check the size of the kernel vm_map against the user-level resource limits for the calling process. Reviewed by: kib	2012-02-16 06:45:51 +00:00
Konstantin Belousov	8211bd45bc	Close a race due to dropping of the map lock between creating map entry for a shared mapping and marking the entry for inheritance. Other thread might execute vmspace_fork() in between (e.g. by fork(2)), resulting in the mapping becoming private. Noted and reviewed by: alc MFC after: 1 week	2012-02-11 17:29:07 +00:00
Ed Schouten	7870adb640	Remove direct access to si_name. Code should just use the devtoname() function to obtain the name of a character device. Also add const keywords to pieces of code that need it to build properly. MFC after: 2 weeks	2012-02-10 12:35:57 +00:00
Alexander Motin	8f12d83ad9	Fix NULL dereference panic on attempt to turn off (on system shutdown) disconnected swap device. This is quick and imperfect solution, as swap device will still be opened and GEOM will not be able to destroy it. Proper solution would be to automatically turn off and close disconnected swap device, but with existing code it will cause panic if there is at least one page on device, even if it is unimportant page of the user-level process. It needs some work. Reviewed by: kib@ MFC after: 1 week	2012-02-01 20:12:44 +00:00
Kip Macy	263811f724	exclude kmem_alloc'ed ARC data buffers from kernel minidumps on amd64 excluding other allocations including UMA now entails the addition of a single flag to kmem_alloc or uma zone create Reviewed by: alc, avg MFC after: 2 weeks	2012-01-27 20:18:31 +00:00
Nathan Whitehorn	8d01a3b281	Revert r212360 now that PowerPC can handle large sparse arguments to pmap_remove() (changed in r228412). MFC after: 2 weeks	2012-01-17 00:31:09 +00:00
Konstantin Belousov	5dda2db9c8	Change the type of the paging_in_progress refcounter from u_short to u_int. With the auto-sized buffer cache on the modern machines, UFS metadata can generate more the 65535 pages belonging to the buffers undergoing i/o, overflowing the counter. Reported and tested by: jimharris Reviewed by: alc MFC after: 1 week	2012-01-10 18:05:44 +00:00
Konstantin Belousov	e65919f9fc	Do not restart the scan in vm_object_page_clean() on the object generation change if requested mode is async. The object generation is only changed when the object is marked as OBJ_MIGHTBEDIRTY. For async mode it is enough to write each dirty page, not to make a guarantee that all pages are cleared after the vm_object_page_clean() returned. Diagnosed by: truckman Tested by: flo Reviewed by: alc, truckman MFC after: 2 weeks	2012-01-04 16:04:20 +00:00
Alan Cox	b5f359b7c3	Optimize vm_object_split()'s handling of reservations.	2011-12-28 20:27:18 +00:00
Konstantin Belousov	75ff604a78	Optimize the common case of msyncing the whole file mapping with MS_SYNC flag. The system must guarantee that all writes are finished before syscalls returned. Schedule the writes in async mode, which is much faster and allows the clustering to occur. Wait for writes using VOP_FSYNC(), since we are syncing the whole file mapping. Potentially, the restriction to only apply the optimization can be relaxed by not requiring that the mapping cover whole file, as it is done by other OSes. Reported and tested by: az Reviewed by: alc MFC after: 2 weeks	2011-12-23 09:09:42 +00:00
Konstantin Belousov	e878d99718	Move kstack_cache_entry into the private header, and make the stack cache list header accessible outside vm_glue.c. MFC after: 1 week	2011-12-16 10:56:16 +00:00
Eitan Adler	33fd7c5628	- The previous commit (r228449) accidentally moved the vm.stats.vm.* sysctls to vm.stats.sys. Move them back. Noticed by: pho Reviewed by: bde (earlier version) Approved by: bz MFC after: 1 week Pointy hat to: me	2011-12-14 13:25:00 +00:00
Eitan Adler	3eb9ab5255	Document a large number of currently undocumented sysctls. While here fix some style(9) issues and reduce redundancy. PR: kern/155491 PR: kern/155490 PR: kern/155489 Submitted by: Galimov Albert <wtfcrap@mail.ru> Approved by: bde Reviewed by: jhb MFC after: 1 week	2011-12-13 00:38:50 +00:00
Konstantin Belousov	134465d732	Fix printf. Submitted by: az MFC after: 1 week	2011-12-12 10:04:04 +00:00
Alan Cox	c68c35372e	Introduce vm_reserv_alloc_contig() and teach vm_page_alloc_contig() how to use superpage reservations. So, for the first time, kernel virtual memory that is allocated by contigmalloc(), kmem_alloc_attr(), and kmem_alloc_contig() can be promoted to superpages. In fact, even a series of small contigmalloc() allocations may collectively result in a promoted superpage. Eliminate some duplication of code in vm_reserv_alloc_page(). Change the type of vm_reserv_reclaim_contig()'s first parameter in order that it be consistent with other vm_*_contig() functions. Tested by: marius (sparc64)	2011-12-05 18:29:25 +00:00
Konstantin Belousov	dc874f9881	Rename vm_page_set_valid() to vm_page_set_valid_range(). The vm_page_set_valid() is the most reasonable name for the m->valid accessor. Reviewed by: attilio, alc	2011-11-30 17:39:00 +00:00
Konstantin Belousov	cf1911a9ad	Hide the internals of vm_page_lock(9) from the loadable modules. Since the address of vm_page lock mutex depends on the kernel options, it is easy for module to get out of sync with the kernel. No vm_page_lockptr() accessor is provided for modules. It can be added later if needed, unless proper KPI is developed to serve the needs. Reviewed by: attilio, alc MFC after: 3 weeks	2011-11-29 13:07:32 +00:00
Attilio Rao	9fde98bba3	Introduce the same mutex-wise fix in r227758 for sx locks. The functions that offer file and line specifications are: - sx_assert_ - sx_downgrade_ - sx_slock_ - sx_slock_sig_ - sx_sunlock_ - sx_try_slock_ - sx_try_xlock_ - sx_try_upgrade_ - sx_unlock_ - sx_xlock_ - sx_xlock_sig_ - sx_xunlock_ Now vm_map locking is fully converted and can avoid to know specifics about locking procedures. Reviewed by: kib MFC after: 1 month	2011-11-21 12:59:52 +00:00
Attilio Rao	ccdf233323	Introduce macro stubs in the mutex implementation that will be always defined and will allow consumers, willing to provide options, file and line to locking requests, to not worry about options redefining the interfaces. This is typically useful when there is the need to build another locking interface on top of the mutex one. The introduced functions that consumers can use are: - mtx_lock_flags_ - mtx_unlock_flags_ - mtx_lock_spin_flags_ - mtx_unlock_spin_flags_ - mtx_assert_ - thread_lock_flags_ Spare notes: - Likely we can get rid of all the 'INVARIANTS' specification in the ppbus code by using the same macro as done in this patch (but this is left to the ppbus maintainer) - all the other locking interfaces may require a similar cleanup, where the most notable case is sx which will allow a further cleanup of vm_map locking facilities - The patch should be fully compatible with older branches, thus a MFC is previewed (infact it uses all the underlying mechanisms already present). Comments review by: eadler, Ben Kaduk Discussed with: kib, jhb MFC after: 1 month	2011-11-20 16:33:09 +00:00
Alan Cox	5ff276b7f4	Eliminate end-of-line white space.	2011-11-17 06:54:49 +00:00
Alan Cox	fbd80bd047	Refactor the code that performs physically contiguous memory allocation, yielding a new public interface, vm_page_alloc_contig(). This new function addresses some of the limitations of the current interfaces, contigmalloc() and kmem_alloc_contig(). For example, the physically contiguous memory that is allocated with those interfaces can only be allocated to the kernel vm object and must be mapped into the kernel virtual address space. It also provides functionality that vm_phys_alloc_contig() doesn't, such as wiring the returned pages. Moreover, unlike that function, it respects the low water marks on the paging queues and wakes up the page daemon when necessary. That said, at present, this new function can't be applied to all types of vm objects. However, that restriction will be eliminated in the coming weeks. From a design standpoint, this change also addresses an inconsistency between vm_phys_alloc_contig() and the other vm_phys_alloc*() functions. Specifically, vm_phys_alloc_contig() manipulated vm_page fields that other functions in vm/vm_phys.c didn't. Moreover, vm_phys_alloc_contig() knew about vnodes and reservations. Now, vm_page_alloc_contig() is responsible for these things. Reviewed by: kib Discussed with: jhb	2011-11-16 16:46:09 +00:00
Konstantin Belousov	286790a7dd	Update the device pager interface, while keeping the compatibility layer for old KPI and KBI. New interface should be used together with d_mmap_single cdevsw method. Device pager can be allocated with the cdev_pager_allocate(9) function, which takes struct cdev_pager_ops, containing constructor/destructor and page fault handler methods supplied by driver. Constructor and destructor, called at the pager allocation and deallocation time, allow the driver to handle per-object private data. The pager handler is called to handle page fault on the vm map entry backed by the driver pager. Driver shall return either the vm_page_t which should be mapped, or error code (which does not cause kernel panic anymore). The page handler interface has a placeholder to specify the access mode causing the fault, but currently PROT_READ is always passed there. Sponsored by: The FreeBSD Foundation Reviewed by: alc MFC after: 1 month	2011-11-15 14:40:00 +00:00

... 5 6 7 8 9 ...

3506 Commits