freebsd-skq

Author	SHA1	Message	Date
kib	12c58a8fb0	Initialize paddr to handle the case of zero size. Reported and reviewed by: Conrad Meyer <cemeyer@uw.edu> MFC after: 1 week	2014-03-12 16:38:55 +00:00
kib	e4111a6b71	Do not vdrop() the tmpfs vnode until it is unlocked. The hold reference might be the last, and then vdrop() would free the vnode. Reported and tested by: bdrewery MFC after: 1 week	2014-03-12 15:13:57 +00:00
dim	e21b440a4c	After r251709, avoid a clang 3.4 warning about an unused static const variable (uma_max_ipers), when asserts are disabled. Reviewed by: glebius MFC after: 3 days	2014-02-14 17:47:18 +00:00
attilio	6a31e25bb9	Fix-up r254141: in the process of making a failing vm_page_rename() a call of pager_swap_freespace() was moved around, now leading to freeing the incorrect page because of the pindex changes after vm_page_rename(). Get back to use the correct pindex when destroying the swap space. Sponsored by: EMC / Isilon storage division Reported by: avg Tested by: pho MFC after: 7 days	2014-02-14 03:34:12 +00:00
glebius	b1650f2d1e	Fix function name in KASSERT(). Submitted by: hiren	2014-02-12 20:11:20 +00:00
jhb	ee403b8a8c	Correct assertion to assert that the existing device VM object uses the same type rather than asserting in the case where we just created a new VM object. Reviewed by: kib	2014-02-11 22:05:21 +00:00
glebius	45bf1cc683	Create two public UMA_ZONE_PCPU zones: 64 bit sized and pointer sized. Sponsored by: Nginx, Inc.	2014-02-10 19:59:46 +00:00
glebius	613c5f4e53	Style.	2014-02-10 19:51:15 +00:00
glebius	1861286fed	Make M_ZERO flag work correctly on UMA_ZONE_PCPU zones. Sponsored by: Nginx, Inc.	2014-02-10 19:48:26 +00:00
alc	43e9da37b5	Don't call vm_fault_prefault() on zero-fill faults. It's a waste of time. Successful prefaults after a zero-fill fault are extremely rare.	2014-02-09 01:59:52 +00:00
glebius	e8c2426587	Provide macros that allow easily export uma(9) zone limits and current usage via sysctl(9): SYSCTL_UMA_MAX() SYSCTL_ADD_UMA_MAX() SYSCTL_UMA_CUR() SYSCTL_ADD_UMA_CUR() Sponsored by: Nginx, Inc.	2014-02-07 14:29:03 +00:00
alc	cf63b11b17	Make prefaulting more aggressive on hard faults. Previously, we would only map a fraction of the pages that were fetched by vm_pager_get_pages() from secondary storage. Now, we map them all in order to avoid future soft faults. This effect is most evident when a memory-mapped file is accessed sequentially. Previously, there were 6 soft faults for every hard fault. Now, these soft faults are eliminated. Sponsored by: EMC / Isilon Storage Division	2014-02-02 20:21:53 +00:00
alc	50a7eacf05	In an effort to diagnose possible corruption of struct vm_page on some sparc64 machines make the page queue assert in vm_page_dequeue() more precise. While I'm here switch the page lock assert to the newer style.	2014-01-24 19:08:42 +00:00
jhb	14337759da	Fix a couple of typos.	2014-01-21 03:27:47 +00:00
glebius	681dcc3c57	ANSIfy declarations. Ok'ed by: alc	2014-01-20 18:47:56 +00:00
alc	d98c6ca3a1	Style changes in vm_pageout_scan(): 1. Be consistent in the style of "act_delta" manipulations between the inactive and active queue scans. 2. Explicitly compare to zero. 3. The deactivation of a page is based is based on its recent history and not just the current call to vm_pageout_scan(). The variable "act_delta" represents the current state of the page, and not its history. Avoid possible confusion by not (ab)using "act_delta" for the making the deactivation decision. Submitted by: kib [1] Reviewed by: kib [2,3]	2014-01-18 20:02:59 +00:00
alc	ed1e11749f	Correctly update the count of stuck pages, "addl_page_shortage", in vm_pageout_scan(). There were missing increments in two less common cases. Don't conflate the count of stuck pages and the pageout deficit provided by vm_page_alloc{,_contig}(). (A proposed fix to the OOM code depends on this.) Handle held pages consistently in the inactive queue scan. In the more common case, we did not move the page to the tail of the queue. Whereas, in the less common case, we did. There's no particular reason to move the page in the less common case, so remove it. Perform the calculation of the page shortage for the active queue scan a little earlier, before the active queue lock is acquired. The correctness of this calculation doesn't depend on the active queue lock being held. Eliminate a redundant variable, "pcount". Use the more descriptive variable, "maxscan", in its place. Apply a few nearby style fixes, e.g., eliminate stray whitespace and excess parentheses. Reviewed by: kib Sponsored by: EMC / Isilon Storage Division	2014-01-12 19:04:20 +00:00
alc	d0ccfff2c5	Since the introduction of the popmap to reservations in r259999, there is no longer any need for the page's PG_CACHED and PG_FREE flags to be set and cleared while the free page queues lock is held. Thus, vm_page_alloc(), vm_page_alloc_contig(), and vm_page_alloc_freelist() can wait until after the free page queues lock is released to clear the page's flags. Moreover, the PG_FREE flag can be retired. Now that the reservation system no longer uses it, its only uses are in a few assertions. Eliminating these assertions is no real loss. Other assertions catch the same types of misbehavior, like doubly freeing a page (see r260032) or dirtying a free page (free pages are invalid and only valid pages can be dirtied). Eliminate an unneeded variable from vm_page_alloc_contig(). Sponsored by: EMC / Isilon Storage Division	2013-12-31 18:25:15 +00:00
alc	4e0447c7bb	Add "popmap" assertions: The page being freed isn't already free, and the page being allocated isn't already allocated. Sponsored by: EMC / Isilon Storage Division	2013-12-29 04:54:52 +00:00
alc	0850ce80b6	MFp4 alc_popmap Change the way that reservations keep track of which pages are in use. Instead of using the page's PG_CACHED and PG_FREE flags, maintain a bit vector within the reservation. This approach has a couple benefits. First, it makes breaking reservations much cheaper because there are fewer cache misses to identify the unused pages. Second, it is a pre- requisite for supporting two or more reservation sizes.	2013-12-28 04:28:35 +00:00
kib	2e35793d0f	Do not coalesce stack entry, vm_map_stack() asserts that the requested region is claimed by a new entry. Pass MAP_STACK_GROWS_DOWN and MAP_STACK_GROWS_UP flags to vm_map_insert() from vm_map_stack(), to really turn off coalescing code and call to vm_map_simplify_entry() [1]. Reported by: avg, peter, many Tested by: avg, peter Noted by: avg [1] Sponsored by: The FreeBSD Foundation MFC after: 1 week	2013-12-27 16:59:47 +00:00
marcel	df5d69cc0d	For ia64, use pmap_remove_pages() and not pmap_remove(). The problem is that we don't have a good way (yet) to iterate over the mapped pages by virtual address and simply try each page within the range. Given that we call pmap_remove() over the entire 2^63 bytes of address space, it takes a while for pmap_remove to have tried all 2^50 pages. By using pmap_remove_pages() we use the PV list to find all mappings. Change derived from a patch by: alc	2013-12-26 05:46:10 +00:00
dim	7f980c6758	In sys/vm/vm_pageout.c, since vm_pageout_worker() takes a void * as argument, cast the incoming 0 argument to void , to silence a warning from clang 3.4 ("expression which evaluates to zero treated as a null pointer constant of type 'void ' [-Wnon-literal-null-conversion]"). MFC after: 3 days	2013-12-25 22:32:34 +00:00
alc	1fd1edcf18	Eliminate a redundant parameter to vm_radix_replace(). Improve the wording of the comment describing vm_radix_replace(). Reviewed by: attilio MFC after: 6 weeks Sponsored by: EMC / Isilon Storage Division	2013-12-08 20:07:02 +00:00
rodrigc	cc89f06998	In keg_dtor(), print out the keg name in the "Freed UMA keg was not empty" message printed to the console. This makes it easier to track down the source of certain memory leaks. Suggested by: adrian	2013-11-29 08:04:45 +00:00
mav	78de33b79a	- Add bucket size column to `show uma` DDB command. - Add `show umacache` command to show alike stats for cache-only UMA zones.	2013-11-28 19:20:49 +00:00
mav	ffd93c315f	Make UMA to not blindly force offpage slab header allocation for large (> PAGE_SIZE) zones. If zone is not multiple to PAGE_SIZE, there may be enough space for the header at the last page, so we may avoid extra header memory allocation and hash table update/lookup. ZFS creates bunch of odd-sized UMA zones (5120, 6144, 7168, 10240, 14336). This change gives good use to at least some of otherwise lost memory there. Reviewed by: avg	2013-11-27 20:56:10 +00:00
mav	8937d14f8f	Don't count bucket allocation failures for UMA zones as their own failures. There are good reasons for this to happen, such as recursion prevention, etc. and they are not fatal since buckets are just an optimization mechanism. Real bucket allocation failures are any way counted by the bucket zones themselves, and we don't need double accounting there.	2013-11-27 20:16:18 +00:00
mav	8773f9e310	Fix bug introduced at r252226, when udata argument passed to bucket_alloc() was used without making sure first that it was really passed for us. On some of my systems this bug made user argument passed by ZFS code to uma_zalloc_arg() unexpectedly block UMA per-CPU caches for those zones.	2013-11-27 19:55:42 +00:00
mav	21d101ccc5	When purging per-CPU UMA caches do not return empty buckets into the global full bucket cache to not trigger assertion if allocation happen before that global cache get purged.	2013-11-23 13:42:56 +00:00
kib	732157ce21	Vm map code performs clipping when map entry covers region which is larger than the operational region. If the op region size is zero, clipping would create a zero-sized map entry. The result is that vm map splay starts behaving inconsistently, sometimes returning zero-sized entry, sometimes the next (or previous) entry. One step further, it could result in e.g. vm_map_wire() setting MAP_ENTRY_IN_TRANSITION on the zero-sized entry, but failing to clear it in the done part. The vm_map_delete() than hangs forever waiting for the flag removal. Verify for zero-length requests and act as if it is always successfull without performing any action on the address space. Diagnosed by: pho Tested by: pho (previous version) Reviewed by: alc (previous version) Sponsored by: The FreeBSD Foundation MFC after: 1 week	2013-11-20 09:03:48 +00:00
kib	2f482609db	Add assertions to cover all places in the wiring and unwiring code where MAP_ENTRY_IN_TRANSITION is set or cleared. Tested by: pho Reviewed by: alc Sponsored by: The FreeBSD Foundation MFC after: 1 week	2013-11-20 08:47:54 +00:00
kib	3c8b0e8428	Revert back to use int for the page counts. In vn_io_fault(), the i/o is chunked to pieces limited by integer io_hold_cnt tunable, while vm_fault_quick_hold_pages() takes integer max_count as the upper bound. Rearrange the checks to correctly handle overflowing address arithmetic. Submitted by: bde Tested by: pho Discussed with: alc MFC after: 1 week	2013-11-20 08:45:26 +00:00
mav	ff33031e0d	Implement mechanism to safely but slowly purge UMA per-CPU caches. This is a last resort for very low memory condition in case other measures to free memory were ineffective. Sequentially cycle through all CPUs and extract per-CPU cache buckets into zone cache from where they can be freed.	2013-11-19 10:51:46 +00:00
mav	073851700e	Grow UMA zone bucket size also on lock congestion during item free. Lock congestion is the same, whether it happens on alloc or free, so handle it equally. Now that we have back pressure, there is no problem to grow buckets a bit faster. Any way growth is much slower then in 9.x.	2013-11-19 10:17:10 +00:00
mav	3e43d6e71a	Add two new UMA bucket zones to store 3 and 9 items per bucket. These new buckets make bucket size self-tuning more soft and precise. Without them there are buckets for 1, 5, 13, 29, ... items. While at bigger sizes difference about 2x is fine, at smallest ones it is 5x and 2.6x respectively. New buckets make that line look like 1, 3, 5, 9, 13, 29, reducing jumps between steps, making algorithm work softer, allocating and freeing memory in better fitting chunks. Otherwise there is quite a big gap between allocating 128K and 5x128K of RAM at once.	2013-11-19 10:10:44 +00:00
mav	bdb3c9c41b	Implement soft pressure on UMA cache bucket sizes. Every time system detects low memory condition decrease bucket sizes for each zone by one item. As result, higher memory pressure will push to smaller bucket sizes and so smaller per-CPU caches and so more efficient memory use. Before this change there was no force to oppose buckets growth as result of practically inevitable zone lock conflicts, and after some run time per-CPU caches could consume enough RAM to kill the system.	2013-11-19 10:05:53 +00:00
kib	3cdb6ad1ff	Avoid overflow for the page counts. Reported and tested by: pho Sponsored by: The FreeBSD Foundation MFC after: 1 week	2013-11-12 08:47:58 +00:00
kib	9b734f0e00	If filesystem declares that it supports shared locking for writes, use shared vnode lock for VOP_PUTPAGES() as well. The only such filesystem in the tree is ZFS, and it uses vnode_pager_generic_putpages(), which performs the pageout with VOP_WRITE(). Reviewed by: alc Discussed with: avg Tested by: pho Sponsored by: The FreeBSD Foundation MFC after: 2 weeks	2013-11-09 20:36:29 +00:00
kib	79db892faa	Do not coalesce if the swap object belongs to tmpfs vnode. The coalesce would extend the object to keep pages for the anonymous mapping created by the process. The pages has no relations to the tmpfs file content which could be written into the corresponding range, causing anonymous mapping and file content aliasing and subsequent corruption. Another lesser problem created by coalescing is over-accounting on the tmpfs node destruction, since the object size is substracted from the total count of the pages owned by the tmpfs mount. Reported and tested by: bdrewery Sponsored by: The FreeBSD Foundation MFC after: 1 week	2013-11-05 06:18:50 +00:00
alc	83e71fe4f7	Tidy up the output of "sysctl vm.phys_free". Approved by: re (glebius) Sponsored by: EMC / Isilon Storage Division	2013-10-10 16:11:45 +00:00
alc	6e2676ddc1	Both the vm_map and vmspace zones are defined as "no free". So, there is no point in defining a fini function for these zones. Reviewed by: kib Approved by: re (glebius) Sponsored by: EMC / Isilon Storage Division	2013-09-22 17:48:10 +00:00
neel	44c4dbefdb	Merge the following changes from projects/bhyve_npt_pmap: - add fields to 'struct pmap' that are required to manage nested page tables. - add a parameter to 'vmspace_alloc()' that can be used to override the default pmap initialization routine 'pmap_pinit()'. These changes are pushed ahead of the remaining changes in 'bhyve_npt_pmap' in anticipation of the upcoming KBI freeze for 10.0. Reviewed by: kib@, alc@ Approved by: re (glebius)	2013-09-20 17:06:49 +00:00
alc	88a4d0f31a	The pmap function pmap_clear_reference() is no longer used. Remove it. pmap_clear_reference() has had exactly one caller in the kernel for several years, more precisely, since FreeBSD 8. Now, that call no longer exists. Approved by: re (kib) Sponsored by: EMC / Isilon Storage Division	2013-09-20 04:30:18 +00:00
jhb	d3ef75b6c7	Extend the support for exempting processes from being killed when swap is exhausted. - Add a new protect(1) command that can be used to set or revoke protection from arbitrary processes. Similar to ktrace it can apply a change to all existing descendants of a process as well as future descendants. - Add a new procctl(2) system call that provides a generic interface for control operations on processes (as opposed to the debugger-specific operations provided by ptrace(2)). procctl(2) uses a combination of idtype_t and an id to identify the set of processes on which to operate similar to wait6(). - Add a PROC_SPROTECT control operation to manage the protection status of a set of processes. MADV_PROTECT still works for backwards compatability. - Add a p_flag2 to struct proc (and a corresponding ki_flag2 to kinfo_proc) the first bit of which is used to track if P_PROTECT should be inherited by new child processes. Reviewed by: kib, jilles (earlier version) Approved by: re (delphij) MFC after: 1 month	2013-09-19 18:53:42 +00:00
kib	8ca067efb2	PG_SLAB no longer serves a useful purpose, since m->object is no longer abused to store pointer to slab. Remove it. Reviewed by: alc Sponsored by: The FreeBSD Foundation Approved by: re (hrs)	2013-09-17 07:35:26 +00:00
kib	6796656333	Remove zero-copy sockets code. It only worked for anonymous memory, and the equivalent functionality is now provided by sendfile(2) over posix shared memory filedescriptor. Remove the cow member of struct vm_page, and rearrange the remaining members. While there, make hold_count unsigned. Requested and reviewed by: alc Tested by: pho Sponsored by: The FreeBSD Foundation Approved by: re (delphij)	2013-09-16 06:25:54 +00:00
kib	889b9d0e0b	If the last page of the file is partially full and whole valid portion is invalidated, invalidate the whole page. Otherwise, partially valid page appears on a page queue, which is wrong. This could only happen for the last page, because only then buffer which triggered invalidation could not cover the whole page. Reported and tested by: pho (previous version) Reviewed by: alc Sponsored by: The FreeBSD Foundation Approved by: re (delphij) MFC after: 2 weeks	2013-09-14 10:11:38 +00:00
jhb	3c31e1fb75	Fix an off-by-one error when populating mincore(2) entries for skipped entries. lastvecindex references the last valid byte, so the new bytes should come after it. Approved by: re (kib) MFC after: 1 week	2013-09-12 20:46:32 +00:00
jhb	04bb6e10cd	Add a mmap flag (MAP_32BIT) on 64-bit platforms to request that a mapping use an address in the first 2GB of the process's address space. This flag should have the same semantics as the same flag on Linux. To facilitate this, add a new parameter to vm_map_find() that specifies an optional maximum virtual address. While here, fix several callers of vm_map_find() to use a VMFS_* constant for the findspace argument instead of TRUE and FALSE. Reviewed by: alc Approved by: re (kib)	2013-09-09 18:11:59 +00:00
kib	56cc686058	Drain for the xbusy state for two places which potentially do pmap_remove_all(). Not doing the drain allows the pmap_enter() to proceed in parallel, making the pmap_remove_all() effects void. The race results in an invalidated page mapped wired by usermode. Reported and tested by: pho Reviewed by: alc Sponsored by: The FreeBSD Foundation Approved by: re (glebius)	2013-09-08 17:51:22 +00:00
kib	7ab18d4990	The vm_page_trysbusy() should not fail when shared busy counter or VPB_BIT_WAITERS flag were changed between reading of busy_lock and the cas. The vm_page_sbusy(), which is the only user of vm_page_trysbusy() in the tree, panics on the failure, which in these cases is transient and do not mean that the current page state prevents sbusying. Retry the operation inside vm_page_trysbusy() if cas failed, only return a failure when VPB_BIT_SHARED is cleared. Reported and tested by: pho Reviewed by: attilio Sponsored by: The FreeBSD Foundation	2013-09-05 12:54:40 +00:00
pjd	029a6f5d92	Change the cap_rights_t type from uint64_t to a structure that we can extend in the future in a backward compatible (API and ABI) way. The cap_rights_t represents capability rights. We used to use one bit to represent one right, but we are running out of spare bits. Currently the new structure provides place for 114 rights (so 50 more than the previous cap_rights_t), but it is possible to grow the structure to hold at least 285 rights, although we can make it even larger if 285 rights won't be enough. The structure definition looks like this: struct cap_rights { uint64_t cr_rights[CAP_RIGHTS_VERSION + 2]; }; The initial CAP_RIGHTS_VERSION is 0. The top two bits in the first element of the cr_rights[] array contain total number of elements in the array - 2. This means if those two bits are equal to 0, we have 2 array elements. The top two bits in all remaining array elements should be 0. The next five bits in all array elements contain array index. Only one bit is used and bit position in this five-bits range defines array index. This means there can be at most five array elements in the future. To define new right the CAPRIGHT() macro must be used. The macro takes two arguments - an array index and a bit to set, eg. #define CAP_PDKILL CAPRIGHT(1, 0x0000000000000800ULL) We still support aliases that combine few rights, but the rights have to belong to the same array element, eg: #define CAP_LOOKUP CAPRIGHT(0, 0x0000000000000400ULL) #define CAP_FCHMOD CAPRIGHT(0, 0x0000000000002000ULL) #define CAP_FCHMODAT (CAP_FCHMOD \| CAP_LOOKUP) There is new API to manage the new cap_rights_t structure: cap_rights_t cap_rights_init(cap_rights_t rights, ...); void cap_rights_set(cap_rights_t rights, ...); void cap_rights_clear(cap_rights_t rights, ...); bool cap_rights_is_set(const cap_rights_t rights, ...); bool cap_rights_is_valid(const cap_rights_t rights); void cap_rights_merge(cap_rights_t dst, const cap_rights_t src); void cap_rights_remove(cap_rights_t dst, const cap_rights_t src); bool cap_rights_contains(const cap_rights_t big, const cap_rights_t little); Capability rights to the cap_rights_init(), cap_rights_set(), cap_rights_clear() and cap_rights_is_set() functions are provided by separating them with commas, eg: cap_rights_t rights; cap_rights_init(&rights, CAP_READ, CAP_WRITE, CAP_FSTAT); There is no need to terminate the list of rights, as those functions are actually macros that take care of the termination, eg: #define cap_rights_set(rights, ...) \ __cap_rights_set((rights), __VA_ARGS__, 0ULL) void __cap_rights_set(cap_rights_t *rights, ...); Thanks to using one bit as an array index we can assert in those functions that there are no two rights belonging to different array elements provided together. For example this is illegal and will be detected, because CAP_LOOKUP belongs to element 0 and CAP_PDKILL to element 1: cap_rights_init(&rights, CAP_LOOKUP \| CAP_PDKILL); Providing several rights that belongs to the same array's element this way is correct, but is not advised. It should only be used for aliases definition. This commit also breaks compatibility with some existing Capsicum system calls, but I see no other way to do that. This should be fine as Capsicum is still experimental and this change is not going to 9.x. Sponsored by: The FreeBSD Foundation	2013-09-05 00:09:56 +00:00
mckusick	57ee6d3c5d	Fix bug introduced in rewrite of keg_free_slab in -r251894. The consequence of the bug is that fini calls are not done when a slab is freed by a call-back from the page daemon. It went unnoticed for two months because fini is little used. I spotted the bug while reading the code to learn how it works so I could write it up for the next edition of the Design and Implementation of FreeBSD book. No MFC needed as this code exists only in HEAD. Reviewed by: kib, jeff Tested by: pho	2013-08-31 15:40:15 +00:00
alc	aa9a7bb9e6	Significantly reduce the cost, i.e., run time, of calls to madvise(..., MADV_DONTNEED) and madvise(..., MADV_FREE). Specifically, introduce a new pmap function, pmap_advise(), that operates on a range of virtual addresses within the specified pmap, allowing for a more efficient implementation of MADV_DONTNEED and MADV_FREE. Previously, the implementation of MADV_DONTNEED and MADV_FREE relied on per-page pmap operations, such as pmap_clear_reference(). Intuitively, the problem with this implementation is that the pmap-level locks are acquired and released and the page table traversed repeatedly, once for each resident page in the range that was specified to madvise(2). A more subtle flaw with the previous implementation is that pmap_clear_reference() would clear the reference bit on all mappings to the specified page, not just the mapping in the range specified to madvise(2). Since our malloc(3) makes heavy use of madvise(2), this change can have a measureable impact. For example, the system time for completing a parallel "buildworld" on a 6-core amd64 machine was reduced by about 1.5% to 2.0%. Note: This change only contains pmap_advise() implementations for a subset of our supported architectures. I will commit implementations for the remaining architectures after further testing. For now, a stub function is sufficient because of the advisory nature of pmap_advise(). Discussed with: jeff, jhb, kib Tested by: pho (i386), marcel (ia64) Sponsored by: EMC / Isilon Storage Division	2013-08-29 15:49:05 +00:00
glebius	088bcbe3ed	Remove comment that is no longer relevant since r254182.	2013-08-26 14:14:25 +00:00
alc	1a535523cd	Addendum to r254141: The call to vm_radix_insert() in vm_page_cache() can reclaim the last preexisting cached page in the object, resulting in a call to vdrop(). Detect this scenario so that the vnode's hold count is correctly maintained. Otherwise, we panic. Reported by: scottl Tested by: pho Discussed with: attilio, jeff, kib	2013-08-23 17:27:12 +00:00
kib	05a9dff802	Revert r254501. Instead, reuse the type stability of the struct pmap which is the part of struct vmspace, allocated from UMA_ZONE_NOFREE zone. Initialize the pmap lock in the vmspace zone init function, and remove pmap lock initialization and destruction from pmap_pinit() and pmap_release(). Suggested and reviewed by: alc (previous version) Tested by: pho Sponsored by: The FreeBSD Foundation	2013-08-22 18:12:24 +00:00
kib	ba12eedccd	Remove the deprecated VM_ALLOC_RETRY flag for the vm_page_grab(9). The flag was mandatory since r209792, where vm_page_grab(9) was changed to only support the alloc retry semantic. Suggested and reviewed by: alc Sponsored by: The FreeBSD Foundation	2013-08-22 07:39:53 +00:00
jeff	bef38f5afd	- Eliminate the vm object lock from the active queue scan. It is not necessary since we do not free or cache the page from active anymore. Document the one possible race that is harmless. Sponsored by: EMC / Isilon Storage Division Discussed with: alc	2013-08-21 22:39:19 +00:00
alc	42d76a02b5	Addendum to r254141: Allow recursion on the free pages queues lock in vm_page_alloc_freelist(). Reported and tested by: sbruno Sponsored by: EMC / Isilon Storage Division	2013-08-21 15:31:43 +00:00
jeff	0b78e7c4d9	- Increase the active lru refresh interval to 10 minutes. This has been shown to negatively impact some workloads and the goal is only to eliminate worst case behaviors for very long periods of paging inactivity. Eventually we should determine a more complex scaling factor for this feature. - Rate limit low memory callback handlers to limit thrashing. Set the default to 10 seconds. Sponsored by: EMC / Isilon Storage Division	2013-08-19 23:54:24 +00:00
jeff	ed90d4ba3f	- Use an arbitrary but reasonably large import size for kva on architectures that don't support superpages. This keeps the number of spans and internal fragmentation lower. - When the user asks for alignment from vmem_xalloc adjust the imported size by 2*align to be certain we can satisfy the allocation. This comes at the expense of potential failures when the backend can't supply enough memory but could supply the requested size and alignment. Sponsored by: EMC / Isilon Storage Division	2013-08-19 23:02:39 +00:00
kib	3c951b7b9d	Remove the arbitrary binding of the pagedaemon threads to the domains, update the comment accordingly and make it more precise. Requested and reviewed by: jeff (previous version)	2013-08-17 07:10:01 +00:00
jhb	3bfcb89de4	Add new mmap(2) flags to permit applications to request specific virtual address alignment of mappings. - MAP_ALIGNED(n) requests a mapping aligned on a boundary of (1 << n). Requests for n >= number of bits in a pointer or less than the size of a page fail with EINVAL. This matches the API provided by NetBSD. - MAP_ALIGNED_SUPER is a special case of MAP_ALIGNED. It can be used to optimize the chances of using large pages. By default it will align the mapping on a large page boundary (the system is free to choose any large page size to align to that seems best for the mapping request). However, if the object being mapped is already using large pages, then it will align the virtual mapping to match the existing large pages in the object instead. - Internally, VMFS_ALIGNED_SPACE is now renamed to VMFS_SUPER_SPACE, and VMFS_ALIGNED_SPACE(n) is repurposed for specifying a specific alignment. MAP_ALIGNED(n) maps to using VMFS_ALIGNED_SPACE(n), while MAP_ALIGNED_SUPER maps to VMFS_SUPER_SPACE. - mmap() of a device object now uses VMFS_OPTIMAL_SPACE rather than explicitly using VMFS_SUPER_SPACE. All device objects are forced to use a specific color on creation, so VMFS_OPTIMAL_SPACE is effectively equivalent. Reviewed by: alc MFC after: 1 month	2013-08-16 21:13:55 +00:00
jeff	478dc3171b	- Fix bug in r254304. Use the ACTIVE pq count for the active list processing, not inactive. This was the result of a bad merge. Reported by: pho Sponsored by: EMC / Isilon Storage Division	2013-08-15 22:29:49 +00:00
attilio	ae49aeaba6	On the recovery path for vm_page_alloc(), if a page had been requested wired, unwind back the wiring bits otherwise we can end up freeing a page that is considered wired. Sponsored by: EMC / Isilon storage division Reported by: alc	2013-08-15 11:01:25 +00:00
jeff	d330a11545	- Add a statically allocated memguard arena since it is needed very early on. - Pass the appropriate flags to vmem_xalloc() when allocating space for the arena from kmem_arena. Sponsored by: EMC / Isilon Storage Division	2013-08-13 22:40:43 +00:00
jeff	bc00d6df57	Improve pageout flow control to wakeup more frequently and do less work while maintaining better LRU of active pages. - Change v_free_target to include the quantity previously represented by v_cache_min so we don't need to add them together everywhere we use them. - Add a pageout_wakeup_thresh that sets the free page count trigger for waking the page daemon. Set this 10% above v_free_min so we wakeup before any phase transitions in vm users. - Adjust down v_free_target now that we're willing to accept more pagedaemon wakeups. This means we process fewer pages in one iteration as well, leading to shorter lock hold times and less overall disruption. - Eliminate vm_pageout_page_stats(). This was a minor variation on the PQ_ACTIVE segment of the normal pageout daemon. Instead we now process 1 / vm_pageout_update_period pages every second. This causes us to visit the whole active list every 60 seconds. Previously we would only maintain the active LRU when we were short on pages which would mean it could be woefully out of date. Reviewed by: alc (slight variant of this) Discussed with: alc, kib, jhb Sponsored by: EMC / Isilon Storage Division	2013-08-13 21:56:16 +00:00
attilio	f2a180739c	Correct the recovery logic in vm_page_alloc_contig: what is really needed on this code snipped is that all the pages that are already fully inserted gets fully freed, while for the others the object removal itself might be skipped, hence the object might be set to NULL. Sponsored by: EMC / Isilon storage division Reported by: alc, kib Reviewed by: alc	2013-08-11 21:15:04 +00:00
kib	4675fcfce0	Different consumers of the struct vm_page abuse pageq member to keep additional information, when the page is guaranteed to not belong to a paging queue. Usually, this results in a lot of type casts which make reasoning about the code correctness harder. Sometimes m->object is used instead of pageq, which could cause real and confusing bugs if non-NULL m->object is leaked. See r141955 and r253140 for examples. Change the pageq member into a union containing explicitly-typed members. Use them instead of type-punning or abusing m->object in x86 pmaps, uma and vm_page_alloc_contig(). Requested and reviewed by: alc Sponsored by: The FreeBSD Foundation	2013-08-10 17:36:42 +00:00
zont	340906f426	Remove unused definition for CTL_VM_NAMES. Suggested by: bde	2013-08-09 23:47:43 +00:00
jhb	8f3909e991	Revert the addition of VPO_BUSY and instead update vm_page_replace() to properly unbusy the page. Submitted by: alc	2013-08-09 21:14:55 +00:00
obrien	8b37b80e65	Add missing 'VPO_BUSY' from r254141 to fix kernel build break.	2013-08-09 16:43:50 +00:00
attilio	e9f37cac74	On all the architectures, avoid to preallocate the physical memory for nodes used in vm_radix. On architectures supporting direct mapping, also avoid to pre-allocate the KVA for such nodes. In order to do so make the operations derived from vm_radix_insert() to fail and handle all the deriving failure of those. vm_radix-wise introduce a new function called vm_radix_replace(), which can replace a leaf node, already present, with a new one, and take into account the possibility, during vm_radix_insert() allocation, that the operations on the radix trie can recurse. This means that if operations in vm_radix_insert() recursed vm_radix_insert() will start from scratch again. Sponsored by: EMC / Isilon storage division Reviewed by: alc (older version) Reviewed by: jeff Tested by: pho, scottl	2013-08-09 11:28:55 +00:00
attilio	16c7563cf4	The soft and hard busy mechanism rely on the vm object lock to work. Unify the 2 concept into a real, minimal, sxlock where the shared acquisition represent the soft busy and the exclusive acquisition represent the hard busy. The old VPO_WANTED mechanism becames the hard-path for this new lock and it becomes per-page rather than per-object. The vm_object lock becames an interlock for this functionality: it can be held in both read or write mode. However, if the vm_object lock is held in read mode while acquiring or releasing the busy state, the thread owner cannot make any assumption on the busy state unless it is also busying it. Also: - Add a new flag to directly shared busy pages while vm_page_alloc and vm_page_grab are being executed. This will be very helpful once these functions happen under a read object lock. - Move the swapping sleep into its own per-object flag The KPI is heavilly changed this is why the version is bumped. It is very likely that some VM ports users will need to change their own code. Sponsored by: EMC / Isilon storage division Discussed with: alc Reviewed by: jeff, kib Tested by: gavin, bapt (older version) Tested by: pho, scottl	2013-08-09 11:11:11 +00:00
kib	8de1718b60	Split the pagequeues per NUMA domains, and split pageademon process into threads each processing queue in a single domain. The structure of the pagedaemons and queues is kept intact, most of the changes come from the need for code to find an owning page queue for given page, calculated from the segment containing the page. The tie between NUMA domain and pagedaemon thread/pagequeue split is rather arbitrary, the multithreaded daemon could be allowed for the single-domain machines, or one domain might be split into several page domains, to further increase concurrency. Right now, each pagedaemon thread tries to reach the global target, precalculated at the start of the pass. This is not optimal, since it could cause excessive page deactivation and freeing. The code should be changed to re-check the global page deficit state in the loop after some number of iterations. The pagedaemons reach the quorum before starting the OOM, since one thread inability to meet the target is normal for split queues. Only when all pagedaemons fail to produce enough reusable pages, OOM is started by single selected thread. Launder is modified to take into account the segments layout with regard to the region for which cleaning is performed. Based on the preliminary patch by jeff, sponsored by EMC / Isilon Storage Division. Reviewed by: alc Tested by: pho Sponsored by: The FreeBSD Foundation	2013-08-07 16:36:38 +00:00
jeff	de4ecca213	Replace kernel virtual address space allocation with vmem. This provides transparent layering and better fragmentation. - Normalize functions that allocate memory to use kmem_* - Those that allocate address space are named kva_* - Those that operate on maps are named kmap_* - Implement recursive allocation handling for kmem_arena in vmem. Reviewed by: alc Tested by: pho Sponsored by: EMC / Isilon Storage Division	2013-08-07 06:21:20 +00:00
markj	44ee260831	Fill in the description fields for M_FICT_PAGES. Reviewed by: kib MFC after: 3 days	2013-08-07 00:20:30 +00:00
attilio	899ab64514	Revert r253939: We cannot busy a page before doing pagefaults. Infact, it can deadlock against vnode lock, as it tries to vget(). Other functions, right now, have an opposite lock ordering, like vm_object_sync(), which acquires the vnode lock first and then sleeps on the busy mechanism. Before this patch is reinserted we need to break this ordering. Sponsored by: EMC / Isilon storage division Reported by: kib	2013-08-05 08:55:35 +00:00
attilio	19b2ea9f81	The page hold mechanism is fast but it has couple of fallouts: - It does not let pages respect the LRU policy - It bloats the active/inactive queues of few pages Try to avoid it as much as possible with the long-term target to completely remove it. Use the soft-busy mechanism to protect page content accesses during short-term operations (like uiomove_fromphys()). After this change only vm_fault_quick_hold_pages() is still using the hold mechanism for page content access. There is an additional complexity there as the quick path cannot immediately access the page object to busy the page and the slow path cannot however busy more than one page a time (to avoid deadlocks). Fixing such primitive can bring to complete removal of the page hold mechanism. Sponsored by: EMC / Isilon storage division Discussed with: alc Reviewed by: jeff Tested by: pho	2013-08-04 21:07:24 +00:00
zont	f6b004c36a	Unbreak sysctl ABI changes introduced in r253662 Requested by: bde	2013-07-29 18:48:51 +00:00
jeff	8076cabebb	Improve page LRU quality and simplify the logic. - Don't short-circuit aging tests for unmapped objects. This biases against unmapped file pages and transient mappings. - Always honor PGA_REFERENCED. We can now use this after soft busying to lazily restart the LRU. - Don't transition directly from active to cached bypassing the inactive queue. This frees recently used data much too early. - Rename actcount to act_delta to be more consistent with use and meaning. Reviewed by: kib, alc Sponsored by: EMC / Isilon Storage Division	2013-07-26 23:22:05 +00:00
zont	d47da97be7	Remove define and documentation for vm_pageout_algorithm missed in r253587	2013-07-26 02:00:06 +00:00
kientzle	8ef3c0b12c	Clear entire map structure including locks so that the locks don't accidentally appear to have been already initialized. In particular, this fixes a consistent kernel crash on armv6 with: panic: lock "vm map (user)" 0xc09cc050 already initialized that appeared with r251709. PR: arm/180820	2013-07-25 03:48:37 +00:00
avg	9e6374b6a9	rename scheduler->swapper and SI_SUB_RUN_SCHEDULER->SI_SUB_LAST Also directly call swapper() at the end of mi_startup instead of relying on swapper being the last thing in sysinits order. Rationale: - "RUN_SCHEDULER" was misleading, scheduling already takes place at that stage - "scheduler" was misleading, the function swaps in the swapped out processes - another SYSINIT(SI_SUB_RUN_SCHEDULER, SI_ORDER_ANY) could never be invoked depending on its relative order with scheduler; this was not obvious and the bug actually used to exist Reviewed by: kib (ealier version) MFC after: 14 days	2013-07-24 09:45:31 +00:00
glebius	e0b9f3e4d8	Since r251709 a slab no longer use 8-bit indicies to manage items, thus remove a stale comment. Reviewed by: jeff	2013-07-24 06:13:00 +00:00
jeff	e6992c2985	- Remove the long obsolete 'vm_pageout_algorithm' experiment. Discussed with: alc Sponsored by: EMC / Isilon Storage Division	2013-07-24 01:25:56 +00:00
jeff	f5ce18bd6e	- Correct a stale comment. We don't have vclean() anymore. The work is done by vgonel() and destroy_vobject() should only be called once from VOP_INACTIVE(). Sponsored by: EMC / Isilon Storage Division	2013-07-23 22:52:38 +00:00
glebius	8a9169a4ba	Revert r249590 and in case if mp_ncpus isn't initialized use MAXCPU. This allows us to init counter zone at early stage of boot. Reviewed by: kib Tested by: Lytochkin Boris <lytboris gmail.com>	2013-07-23 11:16:40 +00:00
jlh	a7248da0b8	Fix previous commit when option RACCT is not used. MFC after: 7 days	2013-07-22 22:16:47 +00:00
jlh	40069c94e8	Fix a panic in the racct code when munlock(2) is called with incorrect values. The racct code in sys_munlock() assumed that the boundaries provided by the userland were correct as long as vm_map_unwire() returned successfully. However the latter contains its own logic and sometimes manages to do something out of those boundaries, even if they are buggy. This change makes the racct code to use the accounting done by the vm layer, as it is done in other places such as vm_mlock(). Despite fixing the panic, Alan Cox pointed that this code is still race-y though: two simultaneous callers will produce incorrect values. Reviewed by: alc MFC after: 7 days	2013-07-22 21:47:14 +00:00
jhb	d67e7a1cc9	Be more aggressive in using superpages in all mappings of objects: - Add a new address space allocation method (VMFS_OPTIMAL_SPACE) for vm_map_find() that will try to alter the alignment of a mapping to match any existing superpage mappings of the object being mapped. If no suitable address range is found with the necessary alignment, vm_map_find() will fall back to using the simple first-fit strategy (VMFS_ANY_SPACE). - Change mmap() without MAP_FIXED, shmat(), and the GEM mapping ioctl to use VMFS_OPTIMAL_SPACE instead of VMFS_ANY_SPACE. Reviewed by: alc (earlier version) MFC after: 2 weeks	2013-07-19 19:06:15 +00:00
kib	ff1a2e73b1	When swap pager allocates metadata in the pagedaemon context, allow it to drain the reserve. This was broken in r243040, causing deadlock. Note that VM_WAIT call in case of uma_zalloc() failure from pagedaemon would only wait for the v_pageout_free_min anyway. Reported and tested by: pho Reviewed by: alc Sponsored by: The FreeBSD Foundation	2013-07-11 20:33:57 +00:00
kib	bea7bbed5f	The vm_fault() should not be allowed to proceed on the map entry which is being wired now. The entry wired count is changed to non-zero in advance, before the map lock is dropped. This makes the vm_fault() to perceive the entry as wired, and breaks the fragment which moves the wire count from the shadowed page, to the upper page, making the code unwiring non-wired page. On the other hand, the vm_fault() calls from vm_fault_wire() should be allowed to proceed, so only drain MAP_ENTRY_IN_TRANSITION from vm_fault() when wiring_thread is not current. Reported and tested by: pho Reviewed by: alc Sponsored by: The FreeBSD Foundation MFC after: 2 weeks	2013-07-11 05:58:28 +00:00
kib	04554f0bf4	The mlockall() or VM_MAP_WIRE_HOLESOK does not interact properly with parallel creation of the map entries, e.g. by mmap() or stack growing. It also breaks when other entry is wired in parallel. The vm_map_wire() iterates over the map entries in the region, and assumes that map entries it finds are marked as in transition before, also that any entry marked as in transition, are marked by the current invocation of vm_map_wire(). This is not true for new entries in the holes. Add the thread owner of the MAP_ENTRY_IN_TRANSITION flag to struct vm_map_entry. In vm_map_wire() and vm_map_unwire(), only process the entries which transition owner is the current thread. Reported and tested by: pho Reviewed by: alc Sponsored by: The FreeBSD Foundation MFC after: 2 weeks	2013-07-11 05:55:08 +00:00
kib	da0e8446db	Never remove user-wired pages from an object when doing msync(MS_INVALIDATE). The vm_fault_copy_entry() requires that object range which corresponds to the user-wired vm_map_entry, is always fully populated. Add OBJPR_NOTWIRED flag for vm_object_page_remove() to request the preserving behaviour, use it when calling vm_object_page_remove() from vm_object_sync(). Reported and tested by: pho Reviewed by: alc Sponsored by: The FreeBSD Foundation MFC after: 2 weeks	2013-07-11 05:47:26 +00:00
kib	c2cfac4ffc	In the vm_page_set_invalid() function, do not assert that the page is not busy, since its only caller brelse() can legitimately call it on busy page. This happens for VOP_PUTPAGES() on filesystems that use buffers and which VOP_WRITE() method marked the buffer containing page as non-cacheable. Reported and tested by: pho Reviewed by: alc Sponsored by: The FreeBSD Foundation MFC after: 2 weeks	2013-07-11 05:38:39 +00:00
kib	70e5ada2a0	Fix typo in comment. MFC after: 3 days	2013-07-09 13:22:30 +00:00
neel	44e6cbd7c8	vm_phys_fictitious_reg_range() was losing the 'memattr' because it would be reset by pmap_page_init() right after being initialized in vm_page_initfake(). The statement above is with reference to the amd64 implementation of pmap_page_init(). Fix this by calling 'pmap_page_init()' in 'vm_page_initfake()' before changing the 'memattr'. Reviewed by: kib MFC after: 2 weeks	2013-07-03 23:38:37 +00:00
davide	26a7b21456	Remove a spurious keg lock acquisition.	2013-06-28 21:13:19 +00:00
jeff	e725dd5c1e	- Add a general purpose resource allocator, vmem, from NetBSD. It was originally inspired by the Solaris vmem detailed in the proceedings of usenix 2001. The NetBSD version was heavily refactored for bugs and simplicity. - Use this resource allocator to allocate the buffer and transient maps. Buffer cache defrags are reduced by 25% when used by filesystems with mixed block sizes. Ultimately this may permit dynamic buffer cache sizing on low KVA machines. Discussed with: alc, kib, attilio Tested by: pho Sponsored by: EMC / Isilon Storage Division	2013-06-28 03:51:20 +00:00
jeff	4201cd7bd1	- Resolve bucket recursion issues by passing a cookie with zone flags through bucket_alloc() to uma_zalloc_arg() and uma_zfree_arg(). - Make some smaller buckets for large zones to further reduce memory waste. - Implement uma_zone_reserve(). This holds aside a number of items only for callers who specify M_USE_RESERVE. buckets will never be filled from reserve allocations. Sponsored by: EMC / Isilon Storage Division	2013-06-26 00:57:38 +00:00
glebius	fd69e3f2df	Typo in comment.	2013-06-24 13:36:16 +00:00
jeff	a6b6e4783c	- Add a per-zone lock for zones without kegs. - Be more explicit about zone vs keg locking. This functionally changes almost nothing. - Add a size parameter to uma_zcache_create() so we can size the buckets. - Pass the zone to bucket_alloc() so it can modify allocation flags as appropriate. - Fix a bug in zone_alloc_bucket() where I missed an address of operator in a failure case. (Found by pho) Sponsored by: EMC / Isilon Storage Division	2013-06-20 19:08:12 +00:00
jeff	b81bfe8f58	- Persist the caller's flags in the bucket allocation flags so we don't lose a M_NOVM when we recurse into a bucket allocation. Sponsored by: EMC / Isilon Storage Division	2013-06-19 02:30:32 +00:00
des	f5b61fedc2	Fix a bug that allowed a tracing process (e.g. gdb) to write to a memory-mapped file in the traced process's address space even if neither the traced process nor the tracing process had write access to that file. Security: CVE-2013-2171 Security: FreeBSD-SA-13:06.mmap Approved by: so	2013-06-18 07:02:35 +00:00
jeff	cca9ad5b94	Refine UMA bucket allocation to reduce space consumption and improve performance. - Always free to the alloc bucket if there is space. This gives LIFO allocation order to improve hot-cache performance. This also allows for zones with a single bucket per-cpu rather than a pair if the entire working set fits in one bucket. - Enable per-cpu caches of buckets. To prevent recursive bucket allocation one bucket zone still has per-cpu caches disabled. - Pick the initial bucket size based on a table driven maximum size per-bucket rather than the number of items per-page. This gives more sane initial sizes. - Only grow the bucket size when we face contention on the zone lock, this causes bucket sizes to grow more slowly. - Adjust the number of items per-bucket to account for the header space. This packs the buckets more efficiently per-page while making them not quite powers of two. - Eliminate the per-zone free bucket list. Always return buckets back to the bucket zone. This ensures that as zones grow into larger bucket sizes they eventually discard the smaller sizes. It persists fewer buckets in the system. The locking is slightly trickier. - Only switch buckets in zalloc, not zfree, this eliminates pathological cases where we ping-pong between two buckets. - Ensure that the thread that fills a new bucket gets to allocate from it to give a better upper bound on allocation time. Sponsored by: EMC / Isilon Storage Division	2013-06-18 04:50:20 +00:00
jeff	1980616f65	- Add a new UMA API: uma_zcache_create(). This makes a zone without any backing memory that is only a container for per-cpu caches of arbitrary pointer items. These zones have no kegs. - Convert the regular keg based allocator to use the new import/release functions. - Move some stats to be atomics since they would require excessive zone locking/unlocking with the new import/release paradigm. Make zone_free_item simpler now that callers can manage more stats. - Check for these cache-only zones in the public APIs and debugging code by checking zone_first_keg() against NULL. Sponsored by: EMC / Isilong Storage Division	2013-06-17 03:43:47 +00:00
jeff	84a32e0176	- Convert the slab free item list from a linked array of indices to a bitmap using sys/bitset. This is much simpler, has lower space overhead and is cheaper in most cases. - Use a second bitmap for invariants asserts and improve the quality of the asserts as well as the number of erroneous conditions that we will catch. - Drastically simplify sizing code. Special case refcnt zones since they will be going away. - Update stale comments. Sponsored by: EMC / Isilon Storage Division	2013-06-13 21:05:38 +00:00
alc	53ffec3a56	Revise the interface between vm_object_madvise() and vm_page_dontneed() so that pointless calls to pmap_is_modified() can be easily avoided when performing madvise(..., MADV_FREE). Sponsored by: EMC / Isilon Storage Division	2013-06-10 01:48:21 +00:00
glebius	163379d62d	Make sys_mlock() function just a wrapper around vm_mlock() function that does all the job. Reviewed by: kib, jilles Sponsored by: Nginx, Inc.	2013-06-08 13:13:40 +00:00
attilio	3b60ec551b	Complete r251452: Avoid to busy/unbusy a page in cases where there is no need to drop the vm_obj lock, more nominally when the page is full valid after vm_page_grab(). Sponsored by: EMC / Isilon storage division Reviewed by: alc	2013-06-06 18:19:26 +00:00
attilio	7816c7bc23	In vm_object_split(), busy and consequently unbusy the pages only when swap_pager_copy() is invoked, otherwise there is no reason to do so. This will eliminate the necessity to busy pages most of the times. Sponsored by: EMC / Isilon storage division Reviewed by: alc	2013-06-04 22:47:01 +00:00
alc	7c42edf9b8	Update a comment.	2013-06-04 05:44:52 +00:00
alc	7aa1cb1ddb	Relax the object locking in vm_pageout_map_deactivate_pages() and vm_pageout_object_deactivate_pages(). A read lock suffices. Sponsored by: EMC / Isilon Storage Division	2013-06-04 02:28:47 +00:00
kib	dba0637421	Remove irrelevant comments. Discussed with: alc MFC after: 3 days	2013-06-03 17:30:40 +00:00
alc	17993ced1b	Require that the page lock is held, instead of the object lock, when clearing the page's PGA_REFERENCED flag. Since we are typically manipulating the page's act_count field when we are clearing its PGA_REFERENCED flag, the page lock is already held everywhere that we clear the PGA_REFERENCED flag. So, in fact, this revision only changes some comments and an assertion. Nonetheless, it will enable later changes to object locking in the pageout code. Introduce vm_page_assert_locked(), which completely hides the implementation details of the page lock from the caller, and use it in vm_page_aflag_clear(). (The existing vm_page_lock_assert() could not be used in vm_page_aflag_clear().) Over the coming weeks, I expect that we'll either eliminate or replace the various uses of vm_page_lock_assert() with vm_page_assert_locked(). Reviewed by: attilio Sponsored by: EMC / Isilon Storage Division	2013-06-03 01:22:54 +00:00
alc	ece39b8d03	Now that access to the page's "act_count" field is synchronized by the page lock instead of the object lock, there is no reason for vm_page_activate() to assert that the object is locked for either read or write access. (The "VPO_UNMANAGED" flag never changes after page allocation.) Sponsored by: EMC / Isilon Storage Division	2013-06-01 20:32:34 +00:00
alc	5c982b2ff7	Simplify the definition of vm_page_lock_assert(). There is no compelling reason to inline the implementation of vm_page_lock_assert() in the !KLD_MODULES case. Use the same implementation for both KLD_MODULES and !KLD_MODULES. Reviewed by: kib	2013-05-31 16:00:42 +00:00
kib	0c381861b0	After the object lock was dropped, the object' reference count could change. Retest the ref_count and return from the function to not execute the further code which assumes that ref_count == 1 if it is not. Also, do not leak vnode lock if other thread cleared OBJ_TMPFS flag meantime. Reported by: bdrewery Tested by: bdrewery, pho Sponsored by: The FreeBSD Foundation	2013-05-30 20:00:19 +00:00
kib	b77a98bb0b	Remove the capitalization in the assertion message. Print the address of the object to get useful information from optimizated kernels dump.	2013-05-30 19:53:31 +00:00
attilio	4f8aa13b4a	o Change the locking scheme for swp_bcount. It can now be accessed with a write lock on the object containing it OR with a read lock on the object containing it along with the swhash_mtx. o Remove some duplicate assertions for swap_pager_freespace() and swap_pager_unswapped() but keep the object locking references for documentation. Sponsored by: EMC / Isilon storage division Reviewed by: alc	2013-05-28 22:07:23 +00:00
attilio	2df95f4b3a	Acquire read lock on the src object for vm_fault_copy_entry(). Sponsored by: EMC / Isilon storage division Reviewed by: alc	2013-05-22 15:11:00 +00:00
attilio	fdf82ef9cf	o Relax locking assertions for vm_page_find_least() o Relax locking assertions for pmap_enter_object() and add them also to architectures that currently don't have any o Introduce VM_OBJECT_LOCK_DOWNGRADE() which is basically a downgrade operation on the per-object rwlock o Use all the mechanisms above to make vm_map_pmap_enter() to work mostl of the times only with readlocks. Sponsored by: EMC / Isilon storage division Reviewed by: alc	2013-05-21 20:38:19 +00:00
kib	f8c66c9055	Add ddb command 'show pginfo' which provides useful information about a vm page, denoted either by an address of the struct vm_page, or, if the '/p' modifier is specified, by a physical address of the corresponding frame. Reviewed by: jhb Sponsored by: The FreeBSD Foundation MFC after: 1 week	2013-05-21 11:04:00 +00:00
alc	d5c05f4a92	Relax the object locking in vm_fault_prefault(). A read lock suffices. Reviewed by: attilio Sponsored by: EMC / Isilon Storage Division	2013-05-17 19:02:36 +00:00
alc	585b1bf4b4	Relax the object locking assertion in vm_page_lookup(). Now that a radix tree is used to maintain the object's collection of resident pages, vm_page_lookup() no longer needs an exclusive lock. Reviewed by: attilio Sponsored by: EMC / Isilon Storage Division	2013-05-17 18:49:43 +00:00
attilio	291f413ed8	o Add accessor functions to add and remove pages from a specific freelist. o Split the pool of free pages queues really by domain and not rely on definition of VM_RAW_NFREELIST. o For MAXMEMDOM > 1, wrap the RR allocation logic into a specific function that is called when calculating the allocation domain. The RR counter is kept, currently, per-thread. In the future it is expected that such function evolves in a real policy decision referee, based on specific informations retrieved by per-thread and per-vm_object attributes. o Add the concept of "probed domains" under the form of vm_ndomains. It is responsibility for every architecture willing to support multiple memory domains to correctly probe vm_ndomains along with mem_affinity segments attributes. Those two values are supposed to remain always consistent. Please also note that vm_ndomains and td_dom_rr_idx are both int because segments already store domains as int. Ideally u_int would have much more sense. Probabilly this should be cleaned up in the future. o Apply RR domain selection also to vm_phys_zero_pages_idle(). Sponsored by: EMC / Isilon storage division Partly obtained from: jeff Reviewed by: alc Tested by: jeff	2013-05-13 15:40:51 +00:00
peter	46ecea3da0	Bandaid for compiling with gcc, which happens to be the default compiler for a number of platforms still.	2013-05-13 07:09:31 +00:00
alc	7d20e37fb6	Refactor vm_page_alloc()'s interactions with vm_reserv_alloc_page() and vm_page_insert() so that (1) vm_radix_lookup_le() is never called while the free page queues lock is held and (2) vm_radix_lookup_le() is called at most once. This change reduces the average time that the free page queues lock is held by vm_page_alloc() as well as vm_page_alloc()'s average overall running time. Sponsored by: EMC / Isilon Storage Division	2013-05-12 16:50:18 +00:00
alc	2435fdb8ad	To reduce the amount of arithmetic performed in the various radix tree functions, reverse the numbering scheme for the levels. The highest numbered level in the tree now appears near the root instead of the leaves. Sponsored by: EMC / Isilon Storage Division	2013-05-11 18:01:41 +00:00
attilio	c549d43cb1	Fix-up r250338 by completing the removal of VM_NDOMAIN in favor of MAXMEMDOM. This unbreak builds. Sponsored by: EMC / Isilon storage division Reported by: adrian, jeli	2013-05-08 10:55:39 +00:00
attilio	b24a52ec9e	Rename VM_NDOMAIN into MAXMEMDOM and move it into machine/param.h in order to match the MAXCPU concept. The change should also be useful for consolidation and consistency. Sponsored by: EMC / Isilon storage division Obtained from: jeff Reviewed by: alc	2013-05-07 22:46:24 +00:00
alc	85384d7eba	Remove a redundant call to panic() from vm_radix_keydiff(). The assertion before the loop accomplishes the same thing. Sponsored by: EMC / Isilon Storage Division	2013-05-07 18:45:34 +00:00
alc	1136bac82b	Optimize vm_radix_lookup_ge() and vm_radix_lookup_le(). Specifically, change the way that these functions ascend the tree when the search for a matching leaf fails at an interior node. Rather than returning to the root of the tree and repeating the lookup with an updated key, maintain a stack of interior nodes that were visited during the descent and use that stack to resume the lookup at the closest ancestor that might have a matching descendant. Sponsored by: EMC / Isilon Storage Division Reviewed by: attilio Tested by: pho	2013-05-04 22:50:15 +00:00
jhb	383aea5677	Fix two bugs in the current NUMA-aware allocation code: - vm_phys_alloc_freelist_pages() can be called by vm_page_alloc_freelist() to allocate a page from a specific freelist. In the NUMA case it did not properly map the public VM_FREELIST_* constants to the correct backing freelists, nor did it try all NUMA domains for allocations from VM_FREELIST_DEFAULT. - vm_phys_alloc_pages() did not pin the thread and each call to vm_phys_alloc_freelist_pages() fetched the current domain to choose which freelist to use. If a thread migrated domains during the loop in vm_phys_alloc_pages() it could skip one of the freelists. If the other freelists were out of memory then it is possible that vm_phys_alloc_pages() would fail to allocate a page even though pages were available resulting in a panic in vm_page_alloc(). Reviewed by: alc MFC after: 1 week	2013-05-03 18:58:37 +00:00
kib	fc1170cbc9	Add a hint suggesting why tmpfs does not need a special case there.	2013-05-02 18:35:12 +00:00
kib	2f2c1edec8	Rework the handling of the tmpfs node backing swap object and tmpfs vnode v_object to avoid double-buffering. Use the same object both as the backing store for tmpfs node and as the v_object. Besides reducing memory use up to 2x times for situation of mapping files from tmpfs, it also makes tmpfs read and write operations copy twice bytes less. VM subsystem was already slightly adapted to tolerate OBJT_SWAP object as v_object. Now the vm_object_deallocate() is modified to not reinstantiate OBJ_ONEMAPPING flag and help the VFS to correctly handle VV_TEXT flag on the last dereference of the tmpfs backing object. Reviewed by: alc Tested by: pho, bf MFC after: 1 month	2013-04-28 19:38:59 +00:00
kib	d4d37d6d88	Make vm_object_page_clean() and vm_mmap_vnode() tolerate the vnode' v_object of non OBJT_VNODE type. For vm_object_page_clean(), simply do not assert that object type must be OBJT_VNODE, and add a comment explaining how the check for OBJ_MIGHTBEDIRTY prevents the rest of function from operating on such objects. For vm_mmap_vnode(), if the object type is not OBJT_VNODE, require it to be for swap pager (or default), handle the bypass filesystems, and correctly acquire the object reference in this case. Reviewed by: alc Tested by: pho, bf MFC after: 1 week	2013-04-28 19:25:09 +00:00
kib	dae3935768	Assert that the object type for the vnode' non-NULL v_object, passed to vnode_pager_setsize(), is either OBJT_VNODE, or, if vnode was already reclaimed, OBJT_DEAD. Note that the later is only possible due to some filesystems, in particular, nfsiods from nfs clients, call vnode_pager_setsize() with unlocked vnode. More, if the object is terminated, do not perform the resizing operation. Reviewed by: alc Tested by: pho, bf MFC after: 1 week	2013-04-28 19:19:26 +00:00
kib	0e1bea778f	Convert panic() into KASSERT(). Reviewed by: alc MFC after: 1 week	2013-04-28 18:40:55 +00:00
alc	a75dfbd08b	Eliminate an unneeded call to vm_radix_trimkey() from vm_radix_lookup_le(). This call is clearing bits from the key that will be set again by the next line. Sponsored by: EMC / Isilon Storage Division	2013-04-28 08:29:00 +00:00
alc	046db6cecd	Avoid some lookup restarts in vm_radix_lookup_{ge,le}(). Sponsored by: EMC / Isilon Storage Division	2013-04-27 16:44:59 +00:00
glebius	18dd370b59	Panic if UMA_ZONE_PCPU is created at early stages of boot, when mp_ncpus isn't yet initialized. Otherwise we will panic at first allocation later. Sponsored by: Nginx, Inc.	2013-04-22 09:02:23 +00:00
alc	78339bf7f3	Simplify vm_radix_{add,dec}lev(). Sponsored by: EMC / Isilon Storage Division	2013-04-22 01:26:13 +00:00
alc	aaf865752d	When calculating the number of reserved nodes, discount the pages that will be used to store the nodes. Sponsored by: EMC / Isilon Storage Division	2013-04-18 05:34:33 +00:00
alc	2ce0362e96	Although we perform path compression to reduce the height of the trie and the number of interior nodes, we have previously created a level zero interior node at the root of every non-empty trie, even when that node is not strictly necessary, i.e., it has only one child. This change is the second (and final) step in eliminating those unnecessary level zero interior nodes. Specifically, it updates the deletion and insertion functions so that they do not require a level zero interior node at the root of the trie. For a "buildworld" workload, this change results in a 16.8% reduction in the number of interior nodes allocated and a similar reduction in the average execution time for lookup functions. For example, the average execution time for a call to vm_radix_lookup_ge() is reduced by 22.9%. Reviewed by: attilio, jeff (an earlier version) Sponsored by: EMC / Isilon Storage Division	2013-04-15 06:12:00 +00:00
alc	565184245d	Although we perform path compression to reduce the height of the trie and the number of interior nodes, we always create a level zero interior node at the root of every non-empty trie, even when that node is not strictly necessary, i.e., it has only one child. This change is the first step in eliminating those unnecessary level zero interior nodes. Specifically, it updates all of the lookup functions so that they do not require a level zero interior node at the root. Reviewed by: attilio, jeff (an earlier version) Sponsored by: EMC / Isilon Storage Division	2013-04-12 20:21:28 +00:00
glebius	204e3efd77	Convert UMA code to C99 uintXX_t types.	2013-04-09 17:43:48 +00:00
glebius	486eba7ad7	Swap us_freecount and us_flags, achieving same structure size as before previous commit. Submitted by: alc	2013-04-09 17:25:15 +00:00
glebius	d0006df0df	Since now we support 256 items per slab, we need more bits for us_freecount. This grows uma_slab_head on 32-bit arches, but growth isn't significant. Taking kmem zones as example, only the 32 byte zone is affected, ipers is reduced from 113 to 112. In collaboration with: kib	2013-04-09 15:15:52 +00:00
glebius	3206771906	Fix KASSERTs: maximum number of items per slab is 256.	2013-04-09 12:20:44 +00:00
kib	d5061cb1cd	Fix the assertions for the state of the object under the map entry with the MAP_ENTRY_VN_WRITECNT flag: - Move the assertion that verifies the state of the v_writecount and vnp.writecount, under the block where the object is locked. - Check that the object type is OBJT_VNODE before asserting. Reported by: avg Reviewed by: alc MFC after: 1 week	2013-04-09 10:04:10 +00:00
attilio	3975276634	The per-page act_count can be made very-easily protected by the per-page lock rather than vm_object lock, without any further overhead. Make the formal switch. Sponsored by: EMC / Isilon storage division Reviewed by: alc Tested by: pho	2013-04-08 20:02:27 +00:00
glebius	7f9db020a2	Merge from projects/counters: UMA_ZONE_PCPU zones. These zones have slab size == sizeof(struct pcpu), but request from VM enough pages to fit (uk_slabsize * mp_ncpus). An item allocated from such zone would have a separate twin for each CPU in the system, and these twins are at a distance of sizeof(struct pcpu) from each other. This magic value of distance would allow us to make some optimizations later. To address private item from a CPU simple arithmetics should be used: item = (type )((char )base + sizeof(struct pcpu) * curcpu) These arithmetics are available as zpcpu_get() macro in pcpu.h. To introduce non-page size slabs a new field had been added to uma_keg uk_slabsize. This shifted some frequently used fields of uma_keg to the fourth cache line on amd64. To mitigate this pessimization, uma_keg fields were a bit rearranged and least frequently used uk_name and uk_link moved down to the fourth cache line. All other fields, that are dereferenced frequently fit into first three cache lines. Sponsored by: Nginx, Inc.	2013-04-08 19:10:45 +00:00
alc	a9ceed102a	Micro-optimize the order of struct vm_radix_node's fields. Specifically, arrange for all of the fields to start at a short offset from the beginning of the structure. Eliminate unnecessary masking of VM_RADIX_FLAGS from the root pointer in vm_radix_getroot(). Sponsored by: EMC / Isilon Storage Division	2013-04-07 01:30:51 +00:00
jeff	fa887dba7b	Prepare to replace the buf splay with a trie: - Don't insert BKGRDMARKER bufs into the splay or dirty/clean buf lists. No consumers need to find them there and it complicates the tree. These flags are all FFS specific and could be moved out of the buf cache. - Use pbgetvp() and pbrelvp() to associate the background and journal bufs with the vp. Not only is this much cheaper it makes more sense for these transient bufs. - Fix the assertions in pbget* and pbrel*. It's not safe to check list pointers which were never initialized. Use the BX flags instead. We also check B_PAGING in reassignbuf() so this should cover all cases. Discussed with: kib, mckusick, attilio Sponsored by: EMC / Isilon Storage Division	2013-04-06 22:21:23 +00:00
alc	916607c009	Simplify vm_radix_keybarr(). Sponsored by: EMC / Isilon Storage Division	2013-04-06 18:04:35 +00:00
alc	c348483a3d	Simplify vm_radix_insert(). Reviewed by: attilio Tested by: pho Sponsored by: EMC / Isilon Storage Division	2013-04-06 06:02:55 +00:00
alc	631b72b276	Replace the remaining uses of vm_radix_node_page() by vm_radix_isleaf() and vm_radix_topage(). This transformation eliminates some unnecessary conditional branches from the inner loops of vm_radix_insert(), vm_radix_lookup{,_ge,_le}(), and vm_radix_remove(). Simplify the control flow of vm_radix_lookup_{ge,le}(). Reviewed by: attilio (an earlier version) Tested by: pho Sponsored by: EMC / Isilon Storage Division	2013-04-03 06:37:25 +00:00
kib	7b210bf144	Release the v_writecount reference on the vnode in case of error, before the vnode is vput() in vm_mmap_vnode(). Error return means that there is no use reference on the vnode from the vm object reference, and failing to restore v_writecount breaks the invariant that v_writecount is less or equal to the usecount. The situation observed when nfs client returns ESTALE for VOP_GETATTR() after the open. In collaboration with: pho MFC after: 1 week	2013-03-28 06:39:27 +00:00
alc	f90174984d	Introduce vm_radix_isleaf() and use it in a couple places. As compared to using vm_radix_node_page() == NULL, the compiler is able to generate one less conditional branch when vm_radix_isleaf() is used. More use cases involving the inner loops of vm_radix_insert(), vm_radix_lookup{,_ge,_le}(), and vm_radix_remove() will follow. Reviewed by: attilio Sponsored by: EMC / Isilon Storage Division	2013-03-26 17:30:40 +00:00
alc	97049bb4f6	Micro-optimize the control flow in a few places. Eliminate a panic call that could never be reached in vm_radix_insert(). (If the pointer being checked by the panic call were ever NULL, the immmediately preceding loop would have already crashed on a NULL pointer dereference.) Reviewed by: attilio (an earlier version) Sponsored by: EMC / Isilon Storage Division	2013-03-24 16:43:07 +00:00
kib	9382f70781	Only size and create the bio_transient_map when unmapped buffers are enabled. Now, disabling the unmapped buffers should result in the kernel memory map identical to pre-r248550. Sponsored by: The FreeBSD Foundation	2013-03-21 07:28:15 +00:00
kib	fde3650fd8	Fix the logic inversion in the r248512. Noted by: mckay	2013-03-20 09:44:23 +00:00
kib	2ace051956	Do not map the swap i/o pbufs if the geom provider for the swap partition accepts unmapped requests. Sponsored by: The FreeBSD Foundation Tested by: pho	2013-03-19 14:39:27 +00:00
kib	a43491886a	Pass unmapped buffers for page in requests if the filesystem indicated support for the unmapped i/o. Sponsored by: The FreeBSD Foundation Tested by: pho	2013-03-19 14:36:28 +00:00
kib	7c26a038f9	Implement the concept of the unmapped VMIO buffers, i.e. buffers which do not map the b_pages pages into buffer_map KVA. The use of the unmapped buffers eliminate the need to perform TLB shootdown for mapping on the buffer creation and reuse, greatly reducing the amount of IPIs for shootdown on big-SMP machines and eliminating up to 25-30% of the system time on i/o intensive workloads. The unmapped buffer should be explicitely requested by the GB_UNMAPPED flag by the consumer. For unmapped buffer, no KVA reservation is performed at all. The consumer might request unmapped buffer which does have a KVA reserve, to manually map it without recursing into buffer cache and blocking, with the GB_KVAALLOC flag. When the mapped buffer is requested and unmapped buffer already exists, the cache performs an upgrade, possibly reusing the KVA reservation. Unmapped buffer is translated into unmapped bio in g_vfs_strategy(). Unmapped bio carry a pointer to the vm_page_t array, offset and length instead of the data pointer. The provider which processes the bio should explicitely specify a readiness to accept unmapped bio, otherwise g_down geom thread performs the transient upgrade of the bio request by mapping the pages into the new bio_transient_map KVA submap. The bio_transient_map submap claims up to 10% of the buffer map, and the total buffer_map + bio_transient_map KVA usage stays the same. Still, it could be manually tuned by kern.bio_transient_maxcnt tunable, in the units of the transient mappings. Eventually, the bio_transient_map could be removed after all geom classes and drivers can accept unmapped i/o requests. Unmapped support can be turned off by the vfs.unmapped_buf_allowed tunable, disabling which makes the buffer (or cluster) creation requests to ignore GB_UNMAPPED and GB_KVAALLOC flags. Unmapped buffers are only enabled by default on the architectures where pmap_copy_page() was implemented and tested. In the rework, filesystem metadata is not the subject to maxbufspace limit anymore. Since the metadata buffers are always mapped, the buffers still have to fit into the buffer map, which provides a reasonable (but practically unreachable) upper bound on it. The non-metadata buffer allocations, both mapped and unmapped, is accounted against maxbufspace, as before. Effectively, this means that the maxbufspace is forced on mapped and unmapped buffers separately. The pre-patch bufspace limiting code did not worked, because buffer_map fragmentation does not allow the limit to be reached. By Jeff Roberson request, the getnewbuf() function was split into smaller single-purpose functions. Sponsored by: The FreeBSD Foundation Discussed with: jeff (previous version) Tested by: pho, scottl (previous version), jhb, bf MFC after: 2 weeks	2013-03-19 14:13:12 +00:00
attilio	919afa77e4	Commit new file FreeBSD tags. Sponsored by: EMC / Isilon storage division	2013-03-17 23:53:06 +00:00
attilio	d500d6361a	MFC	2013-03-17 23:39:52 +00:00
alc	a69d85af8b	Fix a couple typos. Sponsored by: EMC / Isilon Storage Division	2013-03-17 20:44:09 +00:00
alc	6cbd8f24b9	The calls to vm_radix_lookup_ge() by vm_reserv_alloc_{contig,page}() can be eliminated. If the calls to vm_radix_lookup_le() return NULL, then the page at the head of the object's memq must be the page with the least pindex greater than the specified pindex. Reviewed by: attilio Sponsored by: EMC / Isilon Storage Division	2013-03-17 20:40:31 +00:00
alc	8a01505f5e	The M_ZERO can be eliminated from the uma_zalloc() call in vm_radix_node_get() with a small change to vm_radix_reclaim_allnodes_int(). This change further reduced the average number of cycles per vm_page_insert() call from 532 to 519. Reviewed by: attilio Sponsored by: EMC / Isilon Storage Division	2013-03-17 16:49:37 +00:00
alc	9e48bd7ba9	Most allocation of pages to objects proceeds from lower to higher indices. Consequentially, vm_page_insert() should use vm_radix_lookup_le() instead of vm_radix_lookup_ge(). Here's why. In the expected case, vm_radix_lookup_le() will quickly find a page less than the specified key at the same radix node. In contrast, vm_radix_lookup_ge() is expected to return NULL, but to do that it must examine every slot in the radix tree that is greater than the key. Prior to this change, the average cost of a vm_page_insert() call on my test machine was 992 cycles. After this change, the average cost is only 532 cycles, a reduction of 46%. Reviewed by: attilio Sponsored by: EMC / Isilon Storage Division	2013-03-17 16:23:19 +00:00
alc	b346e448af	Simplify the interface to vm_radix_insert() by eliminating the parameter "index". The content of a radix tree leaf, or at least its "key", is not opaque to the other radix tree operations. Specifically, they know how to extract the "key" from a leaf. So, eliminating the parameter "index" isn't breaking the abstraction. Moreover, eliminating the parameter "index" effectively prevents the caller from passing an inconsistent "index" and leaf to vm_radix_insert(). Reviewed by: attilio Sponsored by: EMC / Isilon Storage Division	2013-03-17 16:06:03 +00:00
attilio	a2e67affe3	Expand ambiguous comments some more. Requested by: alc	2013-03-17 15:27:26 +00:00
kib	7ca94eca24	Some style fixes. Sponsored by: The FreeBSD Foundation	2013-03-14 20:31:39 +00:00
kib	63efc821c3	Add pmap function pmap_copy_pages(), which copies the content of the pages around, taking array of vm_page_t both for source and destination. Starting offsets and total transfer size are specified. The function implements optimal algorithm for copying using the platform-specific optimizations. For instance, on the architectures were the direct map is available, no transient mappings are created, for i386 the per-cpu ephemeral page frame is used. The code was typically borrowed from the pmap_copy_page() for the same architecture. Only i386/amd64, powerpc aim and arm/arm-v6 implementations were tested at the time of commit. High-level code, not committed yet to the tree, ensures that the use of the function is only allowed after explicit enablement. For sparc64, the existing code has known issues and a stab is added instead, to allow the kernel linking. Sponsored by: The FreeBSD Foundation Tested by: pho (i386, amd64), scottl (amd64), ian (arm and arm-v6) MFC after: 2 weeks	2013-03-14 20:18:12 +00:00
kib	51407f194b	Remove excessive and inconsistent initializers for the various kernel maps and submaps. MFC after: 2 weeks	2013-03-14 19:50:09 +00:00
attilio	07b5846fc9	Fix compilation. Sponsored by: EMC / Isilon storage division	2013-03-13 01:38:32 +00:00
attilio	3b0a5f0419	Use the _KERNEL protectors. Sponsored by: EMC / Isilon storage division Requested by: alc	2013-03-13 01:02:11 +00:00
attilio	02cf10e6db	Add a further safety belt to prevent inconsistencies. Sponsored by: EMC / Isilon storage division Submitted by: alc	2013-03-13 01:00:34 +00:00
attilio	ba43ac477b	For uniformity, use the user provided index. Sponsored by: EMC / Isilon storage division Reviewed and reported by: alc	2013-03-13 00:41:37 +00:00
attilio	3c52979cb4	MFC	2013-03-12 13:26:12 +00:00
attilio	45af7dd4e7	Simplify vm_page_is_valid(). Sponsored by: EMC / Isilon storage division Reviewed by: alc	2013-03-12 12:20:49 +00:00
alc	07fc599921	When transferring the page from one object to another, don't insert the page into its new object until the page's pindex has been updated. Otherwise, one code path within vm_radix_insert() may use the wrong pindex value. Sponsored by: EMC / Isilon Storage Division	2013-03-12 06:14:31 +00:00
attilio	32a3275e77	MFC	2013-03-11 10:49:02 +00:00
alc	2c9c761886	Introduce vm_radix_is_empty(), and use it in place of vm_object_cache_is_empty() where the caller is aware of the page cache's implementation as a radix trie. Sponsored by: EMC / Isilon Storage Division	2013-03-10 17:30:57 +00:00
alc	854d4fd5e6	Update a comment: The object lock is no longer a mutex.	2013-03-09 21:32:24 +00:00
attilio	76954ad68a	Merge from vmcontention.	2013-03-09 03:19:53 +00:00
attilio	16a80466e5	MFC	2013-03-09 02:51:51 +00:00
attilio	72f7f3e528	Switch the vm_object mutex to be a rwlock. This will enable in the future further optimizations where the vm_object lock will be held in read mode most of the time the page cache resident pool of pages are accessed for reading purposes. The change is mostly mechanical but few notes are reported: * The KPI changes as follow: - VM_OBJECT_LOCK() -> VM_OBJECT_WLOCK() - VM_OBJECT_TRYLOCK() -> VM_OBJECT_TRYWLOCK() - VM_OBJECT_UNLOCK() -> VM_OBJECT_WUNLOCK() - VM_OBJECT_LOCK_ASSERT(MA_OWNED) -> VM_OBJECT_ASSERT_WLOCKED() (in order to avoid visibility of implementation details) - The read-mode operations are added: VM_OBJECT_RLOCK(), VM_OBJECT_TRYRLOCK(), VM_OBJECT_RUNLOCK(), VM_OBJECT_ASSERT_RLOCKED(), VM_OBJECT_ASSERT_LOCKED() * The vm/vm_pager.h namespace pollution avoidance (forcing requiring sys/mutex.h in consumers directly to cater its inlining functions using VM_OBJECT_LOCK()) imposes that all the vm/vm_pager.h consumers now must include also sys/rwlock.h. * zfs requires a quite convoluted fix to include FreeBSD rwlocks into the compat layer because the name clash between FreeBSD and solaris versions must be avoided. At this purpose zfs redefines the vm_object locking functions directly, isolating the FreeBSD components in specific compat stubs. The KPI results heavilly broken by this commit. Thirdy part ports must be updated accordingly (I can think off-hand of VirtualBox, for example). Sponsored by: EMC / Isilon storage division Reviewed by: jeff Reviewed by: pjd (ZFS specific review) Discussed with: alc Tested by: pho	2013-03-09 02:32:23 +00:00
attilio	754f3790b8	Merge from vmc-playground: Introduce a new KPI that verifies if the page cache is empty for a specified vm_object. This KPI does not make assumptions about the locking in order to be used also for building assertions at init and destroy time. It is mostly used to hide implementation details of the page cache. Sponsored by: EMC / Isilon storage division Reviewed by: jeff Reviewed by: alc (vm_radix based version) Tested by: flo, pho, jhb, davide	2013-03-09 02:05:29 +00:00
attilio	7fd2627275	MFC	2013-03-09 01:39:42 +00:00
andre	adea04bda7	Move the callout subsystem initialization to its own SYSINIT() from being indirectly called via cpu_startup()+vm_ksubmap_init(). The boot order position remains the same at SI_SUB_CPU. Allocation of the callout array is changed to stardard kernel malloc from a slightly obscure direct kernel_map allocation. kern_timeout_callwheel_alloc() is renamed to callout_callwheel_init() to better describe its purpose. kern_timeout_callwheel_init() is removed simplifying the per-cpu initialization. Reviewed by: davide	2013-03-08 10:37:17 +00:00
attilio	bf1dc90446	MFC	2013-03-08 00:03:07 +00:00
attilio	82aa86d64f	Improve comments. Sponsored by: EMC / Isilon storage division Submitted by: mdf	2013-03-07 23:37:10 +00:00
attilio	1be810ec73	MFC	2013-03-04 13:14:59 +00:00
attilio	e5bdd2f06e	Merge from vmcontention: As vm objects are type-stable there is no need to initialize the resident splay tree pointer and the cache splay tree pointer in _vm_object_allocate() but this could be done in the init UMA zone handler. The destructor UMA zone handler, will further check if the condition is retained at every destruction and catch for bugs. Sponsored by: EMC / Isilon storage division Submitted by: alc	2013-03-04 13:10:59 +00:00
attilio	709ad55889	Evaluations on the likelyhood of empty object cache cannot be made in general way but must be evaluated case by case. Embedd the decision in the caller themselves rather than in a general purpose KPI. Sponsored by: EMC / Isilon storage division Reported by: alc Reviewed by: alc	2013-03-04 12:33:40 +00:00
alc	a8671df14b	Fix a typo. Sponsored by: EMC / Isilon Storage Division	2013-03-04 07:25:11 +00:00
alc	a855741cf1	A Boolean is more appropriate than an int here. Use what I think is a slightly better variable name. Sponsored by: EMC / Isilon Storage Division	2013-03-04 07:20:59 +00:00
alc	c3be5353b8	Make a pass over most of the comments.	2013-03-04 07:11:10 +00:00
alc	475367da61	Simplify Boolean expressions. Sponsored by: EMC / Isilon Storage Division	2013-03-04 06:26:25 +00:00
alc	5094368613	Fix spelling. Sponsored by: EMC / Isilon Storage Division	2013-03-04 06:13:26 +00:00
attilio	60e39c95b8	Remove the boot-time cache support and rely on UMA boot-time slab cache for allocating the nodes before to have the possibility to carve directly from the UMA subsystem. Sponsored by: EMC / Isilon storage division Reviewed by: alc	2013-03-04 00:07:23 +00:00
alc	f661d6e522	We don't need to reinitialize the root of the page cache trie on every vm object allocation. We can, instead, rely on the type stability of the vm object zone. (Note that we already assert that the page cache trie is empty in the vm object zone destructor.) Sponsored by: EMC / Isilon Storage Division	2013-03-03 20:37:27 +00:00
alc	317a9584fb	Two out of three times that vm_page_find_least() is called, it's going to return the vm object's first page. In those cases, there is no need to traverse the trie. Sponsored by: EMC / Isilon Storage Division	2013-03-03 01:36:31 +00:00
attilio	a345907061	Merge from vmcontention	2013-03-03 01:10:49 +00:00
attilio	c53a782d3a	MFC	2013-03-03 01:06:24 +00:00
alc	2322e91e7c	Revert white space change in the previous commit. Requested by: attilio	2013-03-02 18:27:51 +00:00
alc	c5b028cc14	Assert that the trie is empty when a vm object is destroyed. Since vm objects are allocated from type-stable memory, we don't need to initialize the trie's root in _vm_object_allocate() on every vm object allocation. We can instead do it once in vm_object_zinit(). We don't need to call vm_radix_reclaim_allnodes() in vm_object_terminate() unless the resident page count is non-zero. Reviewed by: attilio Sponsored by: EMC / Isilon Storage Division	2013-03-02 18:18:30 +00:00
alc	90d4aeb975	The value held by the vm object's field pg_color is only considered valid if the flag OBJ_COLORED is set. Since _vm_object_allocate() doesn't set this flag, it needn't initialize pg_color. Sponsored by: EMC / Isilon Storage Division	2013-03-02 18:07:29 +00:00
attilio	e98f58faf6	MFC	2013-03-02 14:48:41 +00:00
attilio	89979cd218	Merge from vmcontention	2013-03-02 14:35:15 +00:00
attilio	17028bb6ae	MFC	2013-03-02 14:28:31 +00:00
pjd	f07ebb8888	Merge Capsicum overhaul: - Capability is no longer separate descriptor type. Now every descriptor has set of its own capability rights. - The cap_new(2) system call is left, but it is no longer documented and should not be used in new code. - The new syscall cap_rights_limit(2) should be used instead of cap_new(2), which limits capability rights of the given descriptor without creating a new one. - The cap_getrights(2) syscall is renamed to cap_rights_get(2). - If CAP_IOCTL capability right is present we can further reduce allowed ioctls list with the new cap_ioctls_limit(2) syscall. List of allowed ioctls can be retrived with cap_ioctls_get(2) syscall. - If CAP_FCNTL capability right is present we can further reduce fcntls that can be used with the new cap_fcntls_limit(2) syscall and retrive them with cap_fcntls_get(2). - To support ioctl and fcntl white-listing the filedesc structure was heavly modified. - The audit subsystem, kdump and procstat tools were updated to recognize new syscalls. - Capability rights were revised and eventhough I tried hard to provide backward API and ABI compatibility there are some incompatible changes that are described in detail below: CAP_CREATE old behaviour: - Allow for openat(2)+O_CREAT. - Allow for linkat(2). - Allow for symlinkat(2). CAP_CREATE new behaviour: - Allow for openat(2)+O_CREAT. Added CAP_LINKAT: - Allow for linkat(2). ABI: Reuses CAP_RMDIR bit. - Allow to be target for renameat(2). Added CAP_SYMLINKAT: - Allow for symlinkat(2). Removed CAP_DELETE. Old behaviour: - Allow for unlinkat(2) when removing non-directory object. - Allow to be source for renameat(2). Removed CAP_RMDIR. Old behaviour: - Allow for unlinkat(2) when removing directory. Added CAP_RENAMEAT: - Required for source directory for the renameat(2) syscall. Added CAP_UNLINKAT (effectively it replaces CAP_DELETE and CAP_RMDIR): - Allow for unlinkat(2) on any object. - Required if target of renameat(2) exists and will be removed by this call. Removed CAP_MAPEXEC. CAP_MMAP old behaviour: - Allow for mmap(2) with any combination of PROT_NONE, PROT_READ and PROT_WRITE. CAP_MMAP new behaviour: - Allow for mmap(2)+PROT_NONE. Added CAP_MMAP_R: - Allow for mmap(PROT_READ). Added CAP_MMAP_W: - Allow for mmap(PROT_WRITE). Added CAP_MMAP_X: - Allow for mmap(PROT_EXEC). Added CAP_MMAP_RW: - Allow for mmap(PROT_READ \| PROT_WRITE). Added CAP_MMAP_RX: - Allow for mmap(PROT_READ \| PROT_EXEC). Added CAP_MMAP_WX: - Allow for mmap(PROT_WRITE \| PROT_EXEC). Added CAP_MMAP_RWX: - Allow for mmap(PROT_READ \| PROT_WRITE \| PROT_EXEC). Renamed CAP_MKDIR to CAP_MKDIRAT. Renamed CAP_MKFIFO to CAP_MKFIFOAT. Renamed CAP_MKNODE to CAP_MKNODEAT. CAP_READ old behaviour: - Allow pread(2). - Disallow read(2), readv(2) (if there is no CAP_SEEK). CAP_READ new behaviour: - Allow read(2), readv(2). - Disallow pread(2) (CAP_SEEK was also required). CAP_WRITE old behaviour: - Allow pwrite(2). - Disallow write(2), writev(2) (if there is no CAP_SEEK). CAP_WRITE new behaviour: - Allow write(2), writev(2). - Disallow pwrite(2) (CAP_SEEK was also required). Added convinient defines: #define CAP_PREAD (CAP_SEEK \| CAP_READ) #define CAP_PWRITE (CAP_SEEK \| CAP_WRITE) #define CAP_MMAP_R (CAP_MMAP \| CAP_SEEK \| CAP_READ) #define CAP_MMAP_W (CAP_MMAP \| CAP_SEEK \| CAP_WRITE) #define CAP_MMAP_X (CAP_MMAP \| CAP_SEEK \| 0x0000000000000008ULL) #define CAP_MMAP_RW (CAP_MMAP_R \| CAP_MMAP_W) #define CAP_MMAP_RX (CAP_MMAP_R \| CAP_MMAP_X) #define CAP_MMAP_WX (CAP_MMAP_W \| CAP_MMAP_X) #define CAP_MMAP_RWX (CAP_MMAP_R \| CAP_MMAP_W \| CAP_MMAP_X) #define CAP_RECV CAP_READ #define CAP_SEND CAP_WRITE #define CAP_SOCK_CLIENT \ (CAP_CONNECT \| CAP_GETPEERNAME \| CAP_GETSOCKNAME \| CAP_GETSOCKOPT \| \ CAP_PEELOFF \| CAP_RECV \| CAP_SEND \| CAP_SETSOCKOPT \| CAP_SHUTDOWN) #define CAP_SOCK_SERVER \ (CAP_ACCEPT \| CAP_BIND \| CAP_GETPEERNAME \| CAP_GETSOCKNAME \| \ CAP_GETSOCKOPT \| CAP_LISTEN \| CAP_PEELOFF \| CAP_RECV \| CAP_SEND \| \ CAP_SETSOCKOPT \| CAP_SHUTDOWN) Added defines for backward API compatibility: #define CAP_MAPEXEC CAP_MMAP_X #define CAP_DELETE CAP_UNLINKAT #define CAP_MKDIR CAP_MKDIRAT #define CAP_RMDIR CAP_UNLINKAT #define CAP_MKFIFO CAP_MKFIFOAT #define CAP_MKNOD CAP_MKNODAT #define CAP_SOCK_ALL (CAP_SOCK_CLIENT \| CAP_SOCK_SERVER) Sponsored by: The FreeBSD Foundation Reviewed by: Christoph Mallon <christoph.mallon@gmx.de> Many aspects discussed with: rwatson, benl, jonathan ABI compatibility discussed with: kib	2013-03-02 00:53:12 +00:00
attilio	31dffb7b33	Merge from vmcontention	2013-02-27 18:25:57 +00:00
attilio	6ff1954532	MFC	2013-02-27 18:23:12 +00:00
attilio	8d28f94790	Merge from vmobj-rwlock: VM_OBJECT_LOCKED() macro is only used to implement a custom version of lock assertions right now (which likely spread out thanks to copy and paste). Remove it and implement actual assertions. Sponsored by: EMC / Isilon storage division Reviewed by: alc Tested by: pho	2013-02-27 18:12:13 +00:00
attilio	c74a3afc6a	Fix compiling.	2013-02-26 23:54:17 +00:00
attilio	74f58faa15	MFC	2013-02-26 23:52:23 +00:00
attilio	cd86838830	Merge from vmcontention	2013-02-26 23:46:19 +00:00
attilio	cbe7c0e167	MFC	2013-02-26 23:43:28 +00:00
attilio	cc89d0bd92	Merge from vmc-playground branch: Replace the sub-optimal uma_zone_set_obj() primitive with more modern uma_zone_reserve_kva(). The new primitive reserves before hand the necessary KVA space to cater the zone allocations and allocates pages with ALLOC_NOOBJ. More specifically: - uma_zone_reserve_kva() does not need an object to cater the backend allocator. - uma_zone_reserve_kva() can cater M_WAITOK requests, in order to serve zones which need to do uma_prealloc() too. - When possible, uma_zone_reserve_kva() uses directly the direct-mapping by uma_small_alloc() rather than relying on the KVA / offset combination. The removal of the object attribute allows 2 further changes: 1) _vm_object_allocate() becomes static within vm_object.c 2) VM_OBJECT_LOCK_INIT() is removed. This function is replaced by direct calls to mtx_init() as there is no need to export it anymore and the calls aren't either homogeneous anymore: there are now small differences between arguments passed to mtx_init(). Sponsored by: EMC / Isilon storage division Reviewed by: alc (which also offered almost all the comments) Tested by: pho, jhb, davide	2013-02-26 23:35:27 +00:00
attilio	726aa55a61	Merge from vmcontention	2013-02-26 21:17:38 +00:00
attilio	9d00dd1afe	MFC	2013-02-26 21:13:09 +00:00
attilio	820ab571ec	MFC	2013-02-26 21:09:35 +00:00
attilio	5a60eaa26c	Remove white spaces. Sponsored by: EMC / Isilon storage division	2013-02-26 20:35:40 +00:00
attilio	43aa55b4cd	Revert the moving of vm_object objects initialization: the objects zone ensures type-stability and thus we want to execute actual lock initialization only when the objects are brought into the zone otherwise there could be races between lock threads doing re-initilization and other threads that want to acquire the lock without a reference. Sponsored by: EMC / Isilon storage division Reported by: alc	2013-02-26 20:18:25 +00:00
attilio	210a93e7f7	Merge from vmcontention	2013-02-26 18:18:39 +00:00
attilio	134623836d	MFC	2013-02-26 18:11:43 +00:00
attilio	afe5ce0c13	MFC	2013-02-26 17:33:18 +00:00
attilio	49f99b7251	Wrap the sleeps synchronized by the vm_object lock into the specific macro VM_OBJECT_SLEEP(). This hides some implementation details like the usage of the msleep() primitive and the necessity to access to the lock address directly. For this reason VM_OBJECT_MTX() macro is now retired. Sponsored by: EMC / Isilon storage division Reviewed by: alc Tested by: pho	2013-02-26 17:22:08 +00:00
alc	c5315c03fb	Update a comment: noobj_alloc() has replaced obj_alloc(), but it doesn't really make sense for this comment to name specific backend allocators, instead simply refer to backend allocators. Sponsored by: EMC / Isilon Storage Division	2013-02-26 06:38:00 +00:00
attilio	e590e8091e	As VM_OBJECT_SLEEP() is a vm_object_t specific function, make the passed object as the first argument of the function for consistency. Sponsored by: EMC / Isilon storage revision	2013-02-26 01:38:12 +00:00
attilio	fc0ecac2f8	Revert wrongly added asserts: lookup and remove from the collection of cached pages doesn't require the object lock to be held. Sponsored by: EMC / Isilon storage division	2013-02-26 00:34:52 +00:00
alc	26f238c055	Revise the comment describing uma_zone_reserve_kva(). Sponsored by: EMC / Isilon Storage Division Reviewed by: attilio	2013-02-26 00:18:50 +00:00
attilio	343c9f6f19	Missing semicolon. Sponsored by: EMC / Isilon storage division Submitted by: alc Pointy hat to: me	2013-02-24 19:10:16 +00:00
attilio	1a753217f3	Simplify return logic. Sponsored by: EMC / Isilon storage division Submitted by: alc	2013-02-24 19:05:11 +00:00
attilio	69d25b60d5	Merge from vmcontention	2013-02-24 17:11:10 +00:00
attilio	cff31deb1a	MFC	2013-02-24 16:50:53 +00:00
attilio	12289fcebc	Retire the old UMA primitive uma_zone_set_obj() and replace it with the more modern uma_zone_reserve_kva(). The difference is that it doesn't rely anymore on an obj to allocate pages and the slab allocator doesn't use any more any specific locking but atomic operations to complete the operation. Where possible, the uma_small_alloc() is instead used and the uk_kva member becomes unused. The subsequent cleanups also brings along the removal of VM_OBJECT_LOCK_INIT() macro which is not used anymore as the code can be easilly cleaned up to perform a single mtx_init(), private to vm_object.c. For the same reason, _vm_object_allocate() becomes private as well. Sponsored by: EMC / Isilon storage division Reviewed by: alc	2013-02-24 16:41:36 +00:00
attilio	6b1291b4d1	Do not call vm_radix_lookup_ge() in the reservation system unless it is absolutely necessary. Sponsored by: EMC / Isilon storage division Submitted by: alc	2013-02-24 16:10:43 +00:00
attilio	f6d331e804	Fix an inverted check that was reporting indexes wrongly detected as wrapped. Sponsored by: EMC / Isilon storage divison Reported by: alc	2013-02-24 16:08:37 +00:00
alc	96feae12e9	Correctly assert that no page already exists at the offset within the object that is currently being allocated. Sponsored by: EMC / Isilon Storage Division	2013-02-23 19:28:31 +00:00
attilio	8702b26c68	Complete the asserts by definining also assertions for RA_RLOCKED and RA_LOCKED cases. Sponsored by: EMC / Isilon storage division Requested by: alc	2013-02-21 21:56:51 +00:00
attilio	905e648d42	Hide the details for the assertion for VM_OBJECT_LOCK operations. Rename current VM_OBJECT_LOCK_ASSERT(foo, RA_WLOCKED) into VM_OBJECT_ASSERT_WLOCKED(foo) Sponsored by: EMC / Isilon storage division Requested by: alc	2013-02-21 21:54:53 +00:00
attilio	b2afca4987	Add read mode operations to VM_OBJECT_LOCK* class of functions. Sponsored by: EMC / Isilon storage division	2013-02-20 12:06:33 +00:00

... 3 4 5 6 7 ...

3598 Commits