freebsd-dev

Author	SHA1	Message	Date
Jeff Roberson	6fd34d6f67	- Resolve bucket recursion issues by passing a cookie with zone flags through bucket_alloc() to uma_zalloc_arg() and uma_zfree_arg(). - Make some smaller buckets for large zones to further reduce memory waste. - Implement uma_zone_reserve(). This holds aside a number of items only for callers who specify M_USE_RESERVE. buckets will never be filled from reserve allocations. Sponsored by: EMC / Isilon Storage Division	2013-06-26 00:57:38 +00:00
Gleb Smirnoff	4aa4cd8e92	Typo in comment.	2013-06-24 13:36:16 +00:00
Jeff Roberson	af5263743c	- Add a per-zone lock for zones without kegs. - Be more explicit about zone vs keg locking. This functionally changes almost nothing. - Add a size parameter to uma_zcache_create() so we can size the buckets. - Pass the zone to bucket_alloc() so it can modify allocation flags as appropriate. - Fix a bug in zone_alloc_bucket() where I missed an address of operator in a failure case. (Found by pho) Sponsored by: EMC / Isilon Storage Division	2013-06-20 19:08:12 +00:00
Jeff Roberson	8aaf680e90	- Persist the caller's flags in the bucket allocation flags so we don't lose a M_NOVM when we recurse into a bucket allocation. Sponsored by: EMC / Isilon Storage Division	2013-06-19 02:30:32 +00:00
Dag-Erling Smørgrav	5b3e02570a	Fix a bug that allowed a tracing process (e.g. gdb) to write to a memory-mapped file in the traced process's address space even if neither the traced process nor the tracing process had write access to that file. Security: CVE-2013-2171 Security: FreeBSD-SA-13:06.mmap Approved by: so	2013-06-18 07:02:35 +00:00
Jeff Roberson	fc03d22b17	Refine UMA bucket allocation to reduce space consumption and improve performance. - Always free to the alloc bucket if there is space. This gives LIFO allocation order to improve hot-cache performance. This also allows for zones with a single bucket per-cpu rather than a pair if the entire working set fits in one bucket. - Enable per-cpu caches of buckets. To prevent recursive bucket allocation one bucket zone still has per-cpu caches disabled. - Pick the initial bucket size based on a table driven maximum size per-bucket rather than the number of items per-page. This gives more sane initial sizes. - Only grow the bucket size when we face contention on the zone lock, this causes bucket sizes to grow more slowly. - Adjust the number of items per-bucket to account for the header space. This packs the buckets more efficiently per-page while making them not quite powers of two. - Eliminate the per-zone free bucket list. Always return buckets back to the bucket zone. This ensures that as zones grow into larger bucket sizes they eventually discard the smaller sizes. It persists fewer buckets in the system. The locking is slightly trickier. - Only switch buckets in zalloc, not zfree, this eliminates pathological cases where we ping-pong between two buckets. - Ensure that the thread that fills a new bucket gets to allocate from it to give a better upper bound on allocation time. Sponsored by: EMC / Isilon Storage Division	2013-06-18 04:50:20 +00:00
Jeff Roberson	0095a78419	- Add a new UMA API: uma_zcache_create(). This makes a zone without any backing memory that is only a container for per-cpu caches of arbitrary pointer items. These zones have no kegs. - Convert the regular keg based allocator to use the new import/release functions. - Move some stats to be atomics since they would require excessive zone locking/unlocking with the new import/release paradigm. Make zone_free_item simpler now that callers can manage more stats. - Check for these cache-only zones in the public APIs and debugging code by checking zone_first_keg() against NULL. Sponsored by: EMC / Isilong Storage Division	2013-06-17 03:43:47 +00:00
Jeff Roberson	ef72505e6d	- Convert the slab free item list from a linked array of indices to a bitmap using sys/bitset. This is much simpler, has lower space overhead and is cheaper in most cases. - Use a second bitmap for invariants asserts and improve the quality of the asserts as well as the number of erroneous conditions that we will catch. - Drastically simplify sizing code. Special case refcnt zones since they will be going away. - Update stale comments. Sponsored by: EMC / Isilon Storage Division	2013-06-13 21:05:38 +00:00
Alan Cox	2051980f97	Revise the interface between vm_object_madvise() and vm_page_dontneed() so that pointless calls to pmap_is_modified() can be easily avoided when performing madvise(..., MADV_FREE). Sponsored by: EMC / Isilon Storage Division	2013-06-10 01:48:21 +00:00
Gleb Smirnoff	995d706909	Make sys_mlock() function just a wrapper around vm_mlock() function that does all the job. Reviewed by: kib, jilles Sponsored by: Nginx, Inc.	2013-06-08 13:13:40 +00:00
Attilio Rao	002f377ab2	Complete r251452: Avoid to busy/unbusy a page in cases where there is no need to drop the vm_obj lock, more nominally when the page is full valid after vm_page_grab(). Sponsored by: EMC / Isilon storage division Reviewed by: alc	2013-06-06 18:19:26 +00:00
Attilio Rao	dfd55c0c7b	In vm_object_split(), busy and consequently unbusy the pages only when swap_pager_copy() is invoked, otherwise there is no reason to do so. This will eliminate the necessity to busy pages most of the times. Sponsored by: EMC / Isilon storage division Reviewed by: alc	2013-06-04 22:47:01 +00:00
Alan Cox	da38420832	Update a comment.	2013-06-04 05:44:52 +00:00
Alan Cox	e23b0a193e	Relax the object locking in vm_pageout_map_deactivate_pages() and vm_pageout_object_deactivate_pages(). A read lock suffices. Sponsored by: EMC / Isilon Storage Division	2013-06-04 02:28:47 +00:00
Konstantin Belousov	be6ec55376	Remove irrelevant comments. Discussed with: alc MFC after: 3 days	2013-06-03 17:30:40 +00:00
Alan Cox	b417181250	Require that the page lock is held, instead of the object lock, when clearing the page's PGA_REFERENCED flag. Since we are typically manipulating the page's act_count field when we are clearing its PGA_REFERENCED flag, the page lock is already held everywhere that we clear the PGA_REFERENCED flag. So, in fact, this revision only changes some comments and an assertion. Nonetheless, it will enable later changes to object locking in the pageout code. Introduce vm_page_assert_locked(), which completely hides the implementation details of the page lock from the caller, and use it in vm_page_aflag_clear(). (The existing vm_page_lock_assert() could not be used in vm_page_aflag_clear().) Over the coming weeks, I expect that we'll either eliminate or replace the various uses of vm_page_lock_assert() with vm_page_assert_locked(). Reviewed by: attilio Sponsored by: EMC / Isilon Storage Division	2013-06-03 01:22:54 +00:00
Alan Cox	b4e498071d	Now that access to the page's "act_count" field is synchronized by the page lock instead of the object lock, there is no reason for vm_page_activate() to assert that the object is locked for either read or write access. (The "VPO_UNMANAGED" flag never changes after page allocation.) Sponsored by: EMC / Isilon Storage Division	2013-06-01 20:32:34 +00:00
Alan Cox	ef5ba5a31d	Simplify the definition of vm_page_lock_assert(). There is no compelling reason to inline the implementation of vm_page_lock_assert() in the !KLD_MODULES case. Use the same implementation for both KLD_MODULES and !KLD_MODULES. Reviewed by: kib	2013-05-31 16:00:42 +00:00
Konstantin Belousov	7560005c41	After the object lock was dropped, the object' reference count could change. Retest the ref_count and return from the function to not execute the further code which assumes that ref_count == 1 if it is not. Also, do not leak vnode lock if other thread cleared OBJ_TMPFS flag meantime. Reported by: bdrewery Tested by: bdrewery, pho Sponsored by: The FreeBSD Foundation	2013-05-30 20:00:19 +00:00
Konstantin Belousov	782d4a636b	Remove the capitalization in the assertion message. Print the address of the object to get useful information from optimizated kernels dump.	2013-05-30 19:53:31 +00:00
Attilio Rao	c25673ffd6	o Change the locking scheme for swp_bcount. It can now be accessed with a write lock on the object containing it OR with a read lock on the object containing it along with the swhash_mtx. o Remove some duplicate assertions for swap_pager_freespace() and swap_pager_unswapped() but keep the object locking references for documentation. Sponsored by: EMC / Isilon storage division Reviewed by: alc	2013-05-28 22:07:23 +00:00
Attilio Rao	83b375ea16	Acquire read lock on the src object for vm_fault_copy_entry(). Sponsored by: EMC / Isilon storage division Reviewed by: alc	2013-05-22 15:11:00 +00:00
Attilio Rao	9af6d512f5	o Relax locking assertions for vm_page_find_least() o Relax locking assertions for pmap_enter_object() and add them also to architectures that currently don't have any o Introduce VM_OBJECT_LOCK_DOWNGRADE() which is basically a downgrade operation on the per-object rwlock o Use all the mechanisms above to make vm_map_pmap_enter() to work mostl of the times only with readlocks. Sponsored by: EMC / Isilon storage division Reviewed by: alc	2013-05-21 20:38:19 +00:00
Konstantin Belousov	4fab678be2	Add ddb command 'show pginfo' which provides useful information about a vm page, denoted either by an address of the struct vm_page, or, if the '/p' modifier is specified, by a physical address of the corresponding frame. Reviewed by: jhb Sponsored by: The FreeBSD Foundation MFC after: 1 week	2013-05-21 11:04:00 +00:00
Alan Cox	c141ae7f49	Relax the object locking in vm_fault_prefault(). A read lock suffices. Reviewed by: attilio Sponsored by: EMC / Isilon Storage Division	2013-05-17 19:02:36 +00:00
Alan Cox	767a6420bc	Relax the object locking assertion in vm_page_lookup(). Now that a radix tree is used to maintain the object's collection of resident pages, vm_page_lookup() no longer needs an exclusive lock. Reviewed by: attilio Sponsored by: EMC / Isilon Storage Division	2013-05-17 18:49:43 +00:00
Attilio Rao	7e226537c7	o Add accessor functions to add and remove pages from a specific freelist. o Split the pool of free pages queues really by domain and not rely on definition of VM_RAW_NFREELIST. o For MAXMEMDOM > 1, wrap the RR allocation logic into a specific function that is called when calculating the allocation domain. The RR counter is kept, currently, per-thread. In the future it is expected that such function evolves in a real policy decision referee, based on specific informations retrieved by per-thread and per-vm_object attributes. o Add the concept of "probed domains" under the form of vm_ndomains. It is responsibility for every architecture willing to support multiple memory domains to correctly probe vm_ndomains along with mem_affinity segments attributes. Those two values are supposed to remain always consistent. Please also note that vm_ndomains and td_dom_rr_idx are both int because segments already store domains as int. Ideally u_int would have much more sense. Probabilly this should be cleaned up in the future. o Apply RR domain selection also to vm_phys_zero_pages_idle(). Sponsored by: EMC / Isilon storage division Partly obtained from: jeff Reviewed by: alc Tested by: jeff	2013-05-13 15:40:51 +00:00
Peter Wemm	df839389c5	Bandaid for compiling with gcc, which happens to be the default compiler for a number of platforms still.	2013-05-13 07:09:31 +00:00
Alan Cox	404eb1b3fd	Refactor vm_page_alloc()'s interactions with vm_reserv_alloc_page() and vm_page_insert() so that (1) vm_radix_lookup_le() is never called while the free page queues lock is held and (2) vm_radix_lookup_le() is called at most once. This change reduces the average time that the free page queues lock is held by vm_page_alloc() as well as vm_page_alloc()'s average overall running time. Sponsored by: EMC / Isilon Storage Division	2013-05-12 16:50:18 +00:00
Alan Cox	9f2e600890	To reduce the amount of arithmetic performed in the various radix tree functions, reverse the numbering scheme for the levels. The highest numbered level in the tree now appears near the root instead of the leaves. Sponsored by: EMC / Isilon Storage Division	2013-05-11 18:01:41 +00:00
Attilio Rao	d0b5855eb2	Fix-up r250338 by completing the removal of VM_NDOMAIN in favor of MAXMEMDOM. This unbreak builds. Sponsored by: EMC / Isilon storage division Reported by: adrian, jeli	2013-05-08 10:55:39 +00:00
Attilio Rao	941646f5ec	Rename VM_NDOMAIN into MAXMEMDOM and move it into machine/param.h in order to match the MAXCPU concept. The change should also be useful for consolidation and consistency. Sponsored by: EMC / Isilon storage division Obtained from: jeff Reviewed by: alc	2013-05-07 22:46:24 +00:00
Alan Cox	bb0e1de4ab	Remove a redundant call to panic() from vm_radix_keydiff(). The assertion before the loop accomplishes the same thing. Sponsored by: EMC / Isilon Storage Division	2013-05-07 18:45:34 +00:00
Alan Cox	2d4b9a6438	Optimize vm_radix_lookup_ge() and vm_radix_lookup_le(). Specifically, change the way that these functions ascend the tree when the search for a matching leaf fails at an interior node. Rather than returning to the root of the tree and repeating the lookup with an updated key, maintain a stack of interior nodes that were visited during the descent and use that stack to resume the lookup at the closest ancestor that might have a matching descendant. Sponsored by: EMC / Isilon Storage Division Reviewed by: attilio Tested by: pho	2013-05-04 22:50:15 +00:00
John Baldwin	f5c4b077be	Fix two bugs in the current NUMA-aware allocation code: - vm_phys_alloc_freelist_pages() can be called by vm_page_alloc_freelist() to allocate a page from a specific freelist. In the NUMA case it did not properly map the public VM_FREELIST_* constants to the correct backing freelists, nor did it try all NUMA domains for allocations from VM_FREELIST_DEFAULT. - vm_phys_alloc_pages() did not pin the thread and each call to vm_phys_alloc_freelist_pages() fetched the current domain to choose which freelist to use. If a thread migrated domains during the loop in vm_phys_alloc_pages() it could skip one of the freelists. If the other freelists were out of memory then it is possible that vm_phys_alloc_pages() would fail to allocate a page even though pages were available resulting in a panic in vm_page_alloc(). Reviewed by: alc MFC after: 1 week	2013-05-03 18:58:37 +00:00
Konstantin Belousov	53f5f8a0e1	Add a hint suggesting why tmpfs does not need a special case there.	2013-05-02 18:35:12 +00:00
Konstantin Belousov	6f2af3fcf3	Rework the handling of the tmpfs node backing swap object and tmpfs vnode v_object to avoid double-buffering. Use the same object both as the backing store for tmpfs node and as the v_object. Besides reducing memory use up to 2x times for situation of mapping files from tmpfs, it also makes tmpfs read and write operations copy twice bytes less. VM subsystem was already slightly adapted to tolerate OBJT_SWAP object as v_object. Now the vm_object_deallocate() is modified to not reinstantiate OBJ_ONEMAPPING flag and help the VFS to correctly handle VV_TEXT flag on the last dereference of the tmpfs backing object. Reviewed by: alc Tested by: pho, bf MFC after: 1 month	2013-04-28 19:38:59 +00:00
Konstantin Belousov	e5f299ff76	Make vm_object_page_clean() and vm_mmap_vnode() tolerate the vnode' v_object of non OBJT_VNODE type. For vm_object_page_clean(), simply do not assert that object type must be OBJT_VNODE, and add a comment explaining how the check for OBJ_MIGHTBEDIRTY prevents the rest of function from operating on such objects. For vm_mmap_vnode(), if the object type is not OBJT_VNODE, require it to be for swap pager (or default), handle the bypass filesystems, and correctly acquire the object reference in this case. Reviewed by: alc Tested by: pho, bf MFC after: 1 week	2013-04-28 19:25:09 +00:00
Konstantin Belousov	9b8851faae	Assert that the object type for the vnode' non-NULL v_object, passed to vnode_pager_setsize(), is either OBJT_VNODE, or, if vnode was already reclaimed, OBJT_DEAD. Note that the later is only possible due to some filesystems, in particular, nfsiods from nfs clients, call vnode_pager_setsize() with unlocked vnode. More, if the object is terminated, do not perform the resizing operation. Reviewed by: alc Tested by: pho, bf MFC after: 1 week	2013-04-28 19:19:26 +00:00
Konstantin Belousov	6ded84276d	Convert panic() into KASSERT(). Reviewed by: alc MFC after: 1 week	2013-04-28 18:40:55 +00:00
Alan Cox	82af926a57	Eliminate an unneeded call to vm_radix_trimkey() from vm_radix_lookup_le(). This call is clearing bits from the key that will be set again by the next line. Sponsored by: EMC / Isilon Storage Division	2013-04-28 08:29:00 +00:00
Alan Cox	40076ebc5c	Avoid some lookup restarts in vm_radix_lookup_{ge,le}(). Sponsored by: EMC / Isilon Storage Division	2013-04-27 16:44:59 +00:00
Gleb Smirnoff	08a3102c0b	Panic if UMA_ZONE_PCPU is created at early stages of boot, when mp_ncpus isn't yet initialized. Otherwise we will panic at first allocation later. Sponsored by: Nginx, Inc.	2013-04-22 09:02:23 +00:00
Alan Cox	384875a3a6	Simplify vm_radix_{add,dec}lev(). Sponsored by: EMC / Isilon Storage Division	2013-04-22 01:26:13 +00:00
Alan Cox	880659fe81	When calculating the number of reserved nodes, discount the pages that will be used to store the nodes. Sponsored by: EMC / Isilon Storage Division	2013-04-18 05:34:33 +00:00
Alan Cox	a08f2cf69e	Although we perform path compression to reduce the height of the trie and the number of interior nodes, we have previously created a level zero interior node at the root of every non-empty trie, even when that node is not strictly necessary, i.e., it has only one child. This change is the second (and final) step in eliminating those unnecessary level zero interior nodes. Specifically, it updates the deletion and insertion functions so that they do not require a level zero interior node at the root of the trie. For a "buildworld" workload, this change results in a 16.8% reduction in the number of interior nodes allocated and a similar reduction in the average execution time for lookup functions. For example, the average execution time for a call to vm_radix_lookup_ge() is reduced by 22.9%. Reviewed by: attilio, jeff (an earlier version) Sponsored by: EMC / Isilon Storage Division	2013-04-15 06:12:00 +00:00
Alan Cox	6f9c0b15bb	Although we perform path compression to reduce the height of the trie and the number of interior nodes, we always create a level zero interior node at the root of every non-empty trie, even when that node is not strictly necessary, i.e., it has only one child. This change is the first step in eliminating those unnecessary level zero interior nodes. Specifically, it updates all of the lookup functions so that they do not require a level zero interior node at the root. Reviewed by: attilio, jeff (an earlier version) Sponsored by: EMC / Isilon Storage Division	2013-04-12 20:21:28 +00:00
Gleb Smirnoff	85dcf349c1	Convert UMA code to C99 uintXX_t types.	2013-04-09 17:43:48 +00:00
Gleb Smirnoff	04fc5741e0	Swap us_freecount and us_flags, achieving same structure size as before previous commit. Submitted by: alc	2013-04-09 17:25:15 +00:00
Gleb Smirnoff	8cf455b8d9	Since now we support 256 items per slab, we need more bits for us_freecount. This grows uma_slab_head on 32-bit arches, but growth isn't significant. Taking kmem zones as example, only the 32 byte zone is affected, ipers is reduced from 113 to 112. In collaboration with: kib	2013-04-09 15:15:52 +00:00

1 2 3 4 5 ...

3088 Commits