freebsd-skq

Author	SHA1	Message	Date
attilio	b24a52ec9e	Rename VM_NDOMAIN into MAXMEMDOM and move it into machine/param.h in order to match the MAXCPU concept. The change should also be useful for consolidation and consistency. Sponsored by: EMC / Isilon storage division Obtained from: jeff Reviewed by: alc	2013-05-07 22:46:24 +00:00
alc	85384d7eba	Remove a redundant call to panic() from vm_radix_keydiff(). The assertion before the loop accomplishes the same thing. Sponsored by: EMC / Isilon Storage Division	2013-05-07 18:45:34 +00:00
alc	1136bac82b	Optimize vm_radix_lookup_ge() and vm_radix_lookup_le(). Specifically, change the way that these functions ascend the tree when the search for a matching leaf fails at an interior node. Rather than returning to the root of the tree and repeating the lookup with an updated key, maintain a stack of interior nodes that were visited during the descent and use that stack to resume the lookup at the closest ancestor that might have a matching descendant. Sponsored by: EMC / Isilon Storage Division Reviewed by: attilio Tested by: pho	2013-05-04 22:50:15 +00:00
jhb	383aea5677	Fix two bugs in the current NUMA-aware allocation code: - vm_phys_alloc_freelist_pages() can be called by vm_page_alloc_freelist() to allocate a page from a specific freelist. In the NUMA case it did not properly map the public VM_FREELIST_* constants to the correct backing freelists, nor did it try all NUMA domains for allocations from VM_FREELIST_DEFAULT. - vm_phys_alloc_pages() did not pin the thread and each call to vm_phys_alloc_freelist_pages() fetched the current domain to choose which freelist to use. If a thread migrated domains during the loop in vm_phys_alloc_pages() it could skip one of the freelists. If the other freelists were out of memory then it is possible that vm_phys_alloc_pages() would fail to allocate a page even though pages were available resulting in a panic in vm_page_alloc(). Reviewed by: alc MFC after: 1 week	2013-05-03 18:58:37 +00:00
kib	fc1170cbc9	Add a hint suggesting why tmpfs does not need a special case there.	2013-05-02 18:35:12 +00:00
kib	2f2c1edec8	Rework the handling of the tmpfs node backing swap object and tmpfs vnode v_object to avoid double-buffering. Use the same object both as the backing store for tmpfs node and as the v_object. Besides reducing memory use up to 2x times for situation of mapping files from tmpfs, it also makes tmpfs read and write operations copy twice bytes less. VM subsystem was already slightly adapted to tolerate OBJT_SWAP object as v_object. Now the vm_object_deallocate() is modified to not reinstantiate OBJ_ONEMAPPING flag and help the VFS to correctly handle VV_TEXT flag on the last dereference of the tmpfs backing object. Reviewed by: alc Tested by: pho, bf MFC after: 1 month	2013-04-28 19:38:59 +00:00
kib	d4d37d6d88	Make vm_object_page_clean() and vm_mmap_vnode() tolerate the vnode' v_object of non OBJT_VNODE type. For vm_object_page_clean(), simply do not assert that object type must be OBJT_VNODE, and add a comment explaining how the check for OBJ_MIGHTBEDIRTY prevents the rest of function from operating on such objects. For vm_mmap_vnode(), if the object type is not OBJT_VNODE, require it to be for swap pager (or default), handle the bypass filesystems, and correctly acquire the object reference in this case. Reviewed by: alc Tested by: pho, bf MFC after: 1 week	2013-04-28 19:25:09 +00:00
kib	dae3935768	Assert that the object type for the vnode' non-NULL v_object, passed to vnode_pager_setsize(), is either OBJT_VNODE, or, if vnode was already reclaimed, OBJT_DEAD. Note that the later is only possible due to some filesystems, in particular, nfsiods from nfs clients, call vnode_pager_setsize() with unlocked vnode. More, if the object is terminated, do not perform the resizing operation. Reviewed by: alc Tested by: pho, bf MFC after: 1 week	2013-04-28 19:19:26 +00:00
kib	0e1bea778f	Convert panic() into KASSERT(). Reviewed by: alc MFC after: 1 week	2013-04-28 18:40:55 +00:00
alc	a75dfbd08b	Eliminate an unneeded call to vm_radix_trimkey() from vm_radix_lookup_le(). This call is clearing bits from the key that will be set again by the next line. Sponsored by: EMC / Isilon Storage Division	2013-04-28 08:29:00 +00:00
alc	046db6cecd	Avoid some lookup restarts in vm_radix_lookup_{ge,le}(). Sponsored by: EMC / Isilon Storage Division	2013-04-27 16:44:59 +00:00
glebius	18dd370b59	Panic if UMA_ZONE_PCPU is created at early stages of boot, when mp_ncpus isn't yet initialized. Otherwise we will panic at first allocation later. Sponsored by: Nginx, Inc.	2013-04-22 09:02:23 +00:00
alc	78339bf7f3	Simplify vm_radix_{add,dec}lev(). Sponsored by: EMC / Isilon Storage Division	2013-04-22 01:26:13 +00:00
alc	aaf865752d	When calculating the number of reserved nodes, discount the pages that will be used to store the nodes. Sponsored by: EMC / Isilon Storage Division	2013-04-18 05:34:33 +00:00
alc	2ce0362e96	Although we perform path compression to reduce the height of the trie and the number of interior nodes, we have previously created a level zero interior node at the root of every non-empty trie, even when that node is not strictly necessary, i.e., it has only one child. This change is the second (and final) step in eliminating those unnecessary level zero interior nodes. Specifically, it updates the deletion and insertion functions so that they do not require a level zero interior node at the root of the trie. For a "buildworld" workload, this change results in a 16.8% reduction in the number of interior nodes allocated and a similar reduction in the average execution time for lookup functions. For example, the average execution time for a call to vm_radix_lookup_ge() is reduced by 22.9%. Reviewed by: attilio, jeff (an earlier version) Sponsored by: EMC / Isilon Storage Division	2013-04-15 06:12:00 +00:00
alc	565184245d	Although we perform path compression to reduce the height of the trie and the number of interior nodes, we always create a level zero interior node at the root of every non-empty trie, even when that node is not strictly necessary, i.e., it has only one child. This change is the first step in eliminating those unnecessary level zero interior nodes. Specifically, it updates all of the lookup functions so that they do not require a level zero interior node at the root. Reviewed by: attilio, jeff (an earlier version) Sponsored by: EMC / Isilon Storage Division	2013-04-12 20:21:28 +00:00
glebius	204e3efd77	Convert UMA code to C99 uintXX_t types.	2013-04-09 17:43:48 +00:00
glebius	486eba7ad7	Swap us_freecount and us_flags, achieving same structure size as before previous commit. Submitted by: alc	2013-04-09 17:25:15 +00:00
glebius	d0006df0df	Since now we support 256 items per slab, we need more bits for us_freecount. This grows uma_slab_head on 32-bit arches, but growth isn't significant. Taking kmem zones as example, only the 32 byte zone is affected, ipers is reduced from 113 to 112. In collaboration with: kib	2013-04-09 15:15:52 +00:00
glebius	3206771906	Fix KASSERTs: maximum number of items per slab is 256.	2013-04-09 12:20:44 +00:00
kib	d5061cb1cd	Fix the assertions for the state of the object under the map entry with the MAP_ENTRY_VN_WRITECNT flag: - Move the assertion that verifies the state of the v_writecount and vnp.writecount, under the block where the object is locked. - Check that the object type is OBJT_VNODE before asserting. Reported by: avg Reviewed by: alc MFC after: 1 week	2013-04-09 10:04:10 +00:00
attilio	3975276634	The per-page act_count can be made very-easily protected by the per-page lock rather than vm_object lock, without any further overhead. Make the formal switch. Sponsored by: EMC / Isilon storage division Reviewed by: alc Tested by: pho	2013-04-08 20:02:27 +00:00
glebius	7f9db020a2	Merge from projects/counters: UMA_ZONE_PCPU zones. These zones have slab size == sizeof(struct pcpu), but request from VM enough pages to fit (uk_slabsize * mp_ncpus). An item allocated from such zone would have a separate twin for each CPU in the system, and these twins are at a distance of sizeof(struct pcpu) from each other. This magic value of distance would allow us to make some optimizations later. To address private item from a CPU simple arithmetics should be used: item = (type )((char )base + sizeof(struct pcpu) * curcpu) These arithmetics are available as zpcpu_get() macro in pcpu.h. To introduce non-page size slabs a new field had been added to uma_keg uk_slabsize. This shifted some frequently used fields of uma_keg to the fourth cache line on amd64. To mitigate this pessimization, uma_keg fields were a bit rearranged and least frequently used uk_name and uk_link moved down to the fourth cache line. All other fields, that are dereferenced frequently fit into first three cache lines. Sponsored by: Nginx, Inc.	2013-04-08 19:10:45 +00:00
alc	a9ceed102a	Micro-optimize the order of struct vm_radix_node's fields. Specifically, arrange for all of the fields to start at a short offset from the beginning of the structure. Eliminate unnecessary masking of VM_RADIX_FLAGS from the root pointer in vm_radix_getroot(). Sponsored by: EMC / Isilon Storage Division	2013-04-07 01:30:51 +00:00
jeff	fa887dba7b	Prepare to replace the buf splay with a trie: - Don't insert BKGRDMARKER bufs into the splay or dirty/clean buf lists. No consumers need to find them there and it complicates the tree. These flags are all FFS specific and could be moved out of the buf cache. - Use pbgetvp() and pbrelvp() to associate the background and journal bufs with the vp. Not only is this much cheaper it makes more sense for these transient bufs. - Fix the assertions in pbget* and pbrel*. It's not safe to check list pointers which were never initialized. Use the BX flags instead. We also check B_PAGING in reassignbuf() so this should cover all cases. Discussed with: kib, mckusick, attilio Sponsored by: EMC / Isilon Storage Division	2013-04-06 22:21:23 +00:00
alc	916607c009	Simplify vm_radix_keybarr(). Sponsored by: EMC / Isilon Storage Division	2013-04-06 18:04:35 +00:00
alc	c348483a3d	Simplify vm_radix_insert(). Reviewed by: attilio Tested by: pho Sponsored by: EMC / Isilon Storage Division	2013-04-06 06:02:55 +00:00
alc	631b72b276	Replace the remaining uses of vm_radix_node_page() by vm_radix_isleaf() and vm_radix_topage(). This transformation eliminates some unnecessary conditional branches from the inner loops of vm_radix_insert(), vm_radix_lookup{,_ge,_le}(), and vm_radix_remove(). Simplify the control flow of vm_radix_lookup_{ge,le}(). Reviewed by: attilio (an earlier version) Tested by: pho Sponsored by: EMC / Isilon Storage Division	2013-04-03 06:37:25 +00:00
kib	7b210bf144	Release the v_writecount reference on the vnode in case of error, before the vnode is vput() in vm_mmap_vnode(). Error return means that there is no use reference on the vnode from the vm object reference, and failing to restore v_writecount breaks the invariant that v_writecount is less or equal to the usecount. The situation observed when nfs client returns ESTALE for VOP_GETATTR() after the open. In collaboration with: pho MFC after: 1 week	2013-03-28 06:39:27 +00:00
alc	f90174984d	Introduce vm_radix_isleaf() and use it in a couple places. As compared to using vm_radix_node_page() == NULL, the compiler is able to generate one less conditional branch when vm_radix_isleaf() is used. More use cases involving the inner loops of vm_radix_insert(), vm_radix_lookup{,_ge,_le}(), and vm_radix_remove() will follow. Reviewed by: attilio Sponsored by: EMC / Isilon Storage Division	2013-03-26 17:30:40 +00:00
alc	97049bb4f6	Micro-optimize the control flow in a few places. Eliminate a panic call that could never be reached in vm_radix_insert(). (If the pointer being checked by the panic call were ever NULL, the immmediately preceding loop would have already crashed on a NULL pointer dereference.) Reviewed by: attilio (an earlier version) Sponsored by: EMC / Isilon Storage Division	2013-03-24 16:43:07 +00:00
kib	9382f70781	Only size and create the bio_transient_map when unmapped buffers are enabled. Now, disabling the unmapped buffers should result in the kernel memory map identical to pre-r248550. Sponsored by: The FreeBSD Foundation	2013-03-21 07:28:15 +00:00
kib	fde3650fd8	Fix the logic inversion in the r248512. Noted by: mckay	2013-03-20 09:44:23 +00:00
kib	2ace051956	Do not map the swap i/o pbufs if the geom provider for the swap partition accepts unmapped requests. Sponsored by: The FreeBSD Foundation Tested by: pho	2013-03-19 14:39:27 +00:00
kib	a43491886a	Pass unmapped buffers for page in requests if the filesystem indicated support for the unmapped i/o. Sponsored by: The FreeBSD Foundation Tested by: pho	2013-03-19 14:36:28 +00:00
kib	7c26a038f9	Implement the concept of the unmapped VMIO buffers, i.e. buffers which do not map the b_pages pages into buffer_map KVA. The use of the unmapped buffers eliminate the need to perform TLB shootdown for mapping on the buffer creation and reuse, greatly reducing the amount of IPIs for shootdown on big-SMP machines and eliminating up to 25-30% of the system time on i/o intensive workloads. The unmapped buffer should be explicitely requested by the GB_UNMAPPED flag by the consumer. For unmapped buffer, no KVA reservation is performed at all. The consumer might request unmapped buffer which does have a KVA reserve, to manually map it without recursing into buffer cache and blocking, with the GB_KVAALLOC flag. When the mapped buffer is requested and unmapped buffer already exists, the cache performs an upgrade, possibly reusing the KVA reservation. Unmapped buffer is translated into unmapped bio in g_vfs_strategy(). Unmapped bio carry a pointer to the vm_page_t array, offset and length instead of the data pointer. The provider which processes the bio should explicitely specify a readiness to accept unmapped bio, otherwise g_down geom thread performs the transient upgrade of the bio request by mapping the pages into the new bio_transient_map KVA submap. The bio_transient_map submap claims up to 10% of the buffer map, and the total buffer_map + bio_transient_map KVA usage stays the same. Still, it could be manually tuned by kern.bio_transient_maxcnt tunable, in the units of the transient mappings. Eventually, the bio_transient_map could be removed after all geom classes and drivers can accept unmapped i/o requests. Unmapped support can be turned off by the vfs.unmapped_buf_allowed tunable, disabling which makes the buffer (or cluster) creation requests to ignore GB_UNMAPPED and GB_KVAALLOC flags. Unmapped buffers are only enabled by default on the architectures where pmap_copy_page() was implemented and tested. In the rework, filesystem metadata is not the subject to maxbufspace limit anymore. Since the metadata buffers are always mapped, the buffers still have to fit into the buffer map, which provides a reasonable (but practically unreachable) upper bound on it. The non-metadata buffer allocations, both mapped and unmapped, is accounted against maxbufspace, as before. Effectively, this means that the maxbufspace is forced on mapped and unmapped buffers separately. The pre-patch bufspace limiting code did not worked, because buffer_map fragmentation does not allow the limit to be reached. By Jeff Roberson request, the getnewbuf() function was split into smaller single-purpose functions. Sponsored by: The FreeBSD Foundation Discussed with: jeff (previous version) Tested by: pho, scottl (previous version), jhb, bf MFC after: 2 weeks	2013-03-19 14:13:12 +00:00
attilio	919afa77e4	Commit new file FreeBSD tags. Sponsored by: EMC / Isilon storage division	2013-03-17 23:53:06 +00:00
attilio	d500d6361a	MFC	2013-03-17 23:39:52 +00:00
alc	a69d85af8b	Fix a couple typos. Sponsored by: EMC / Isilon Storage Division	2013-03-17 20:44:09 +00:00
alc	6cbd8f24b9	The calls to vm_radix_lookup_ge() by vm_reserv_alloc_{contig,page}() can be eliminated. If the calls to vm_radix_lookup_le() return NULL, then the page at the head of the object's memq must be the page with the least pindex greater than the specified pindex. Reviewed by: attilio Sponsored by: EMC / Isilon Storage Division	2013-03-17 20:40:31 +00:00
alc	8a01505f5e	The M_ZERO can be eliminated from the uma_zalloc() call in vm_radix_node_get() with a small change to vm_radix_reclaim_allnodes_int(). This change further reduced the average number of cycles per vm_page_insert() call from 532 to 519. Reviewed by: attilio Sponsored by: EMC / Isilon Storage Division	2013-03-17 16:49:37 +00:00
alc	9e48bd7ba9	Most allocation of pages to objects proceeds from lower to higher indices. Consequentially, vm_page_insert() should use vm_radix_lookup_le() instead of vm_radix_lookup_ge(). Here's why. In the expected case, vm_radix_lookup_le() will quickly find a page less than the specified key at the same radix node. In contrast, vm_radix_lookup_ge() is expected to return NULL, but to do that it must examine every slot in the radix tree that is greater than the key. Prior to this change, the average cost of a vm_page_insert() call on my test machine was 992 cycles. After this change, the average cost is only 532 cycles, a reduction of 46%. Reviewed by: attilio Sponsored by: EMC / Isilon Storage Division	2013-03-17 16:23:19 +00:00
alc	b346e448af	Simplify the interface to vm_radix_insert() by eliminating the parameter "index". The content of a radix tree leaf, or at least its "key", is not opaque to the other radix tree operations. Specifically, they know how to extract the "key" from a leaf. So, eliminating the parameter "index" isn't breaking the abstraction. Moreover, eliminating the parameter "index" effectively prevents the caller from passing an inconsistent "index" and leaf to vm_radix_insert(). Reviewed by: attilio Sponsored by: EMC / Isilon Storage Division	2013-03-17 16:06:03 +00:00
attilio	a2e67affe3	Expand ambiguous comments some more. Requested by: alc	2013-03-17 15:27:26 +00:00
kib	7ca94eca24	Some style fixes. Sponsored by: The FreeBSD Foundation	2013-03-14 20:31:39 +00:00
kib	63efc821c3	Add pmap function pmap_copy_pages(), which copies the content of the pages around, taking array of vm_page_t both for source and destination. Starting offsets and total transfer size are specified. The function implements optimal algorithm for copying using the platform-specific optimizations. For instance, on the architectures were the direct map is available, no transient mappings are created, for i386 the per-cpu ephemeral page frame is used. The code was typically borrowed from the pmap_copy_page() for the same architecture. Only i386/amd64, powerpc aim and arm/arm-v6 implementations were tested at the time of commit. High-level code, not committed yet to the tree, ensures that the use of the function is only allowed after explicit enablement. For sparc64, the existing code has known issues and a stab is added instead, to allow the kernel linking. Sponsored by: The FreeBSD Foundation Tested by: pho (i386, amd64), scottl (amd64), ian (arm and arm-v6) MFC after: 2 weeks	2013-03-14 20:18:12 +00:00
kib	51407f194b	Remove excessive and inconsistent initializers for the various kernel maps and submaps. MFC after: 2 weeks	2013-03-14 19:50:09 +00:00
attilio	07b5846fc9	Fix compilation. Sponsored by: EMC / Isilon storage division	2013-03-13 01:38:32 +00:00
attilio	3b0a5f0419	Use the _KERNEL protectors. Sponsored by: EMC / Isilon storage division Requested by: alc	2013-03-13 01:02:11 +00:00
attilio	02cf10e6db	Add a further safety belt to prevent inconsistencies. Sponsored by: EMC / Isilon storage division Submitted by: alc	2013-03-13 01:00:34 +00:00

1 2 3 4 5 ...

3265 Commits