freebsd-nq

Author	SHA1	Message	Date
Alan Cox	d5efa0a475	Switching from a global hash table to per-vm_object radix tries for mapping vm_object page indices to on-disk swap space (r322913) has changed the synchronization requirements for a couple swap pager functions. Whereas before a read lock on the vm object sufficed because of the global mutex on the hash table, a write lock on the vm object may now be required. In particular, calls to vm_pager_page_unswapped() now require a write lock on the vm_object. Consequently, vm_fault()'s fast path cannot call vm_pager_page_unswapped(). The swap space will have to be released at a later point. Reviewed by: kib, markj X-MFC with: r322913 Differential Revision: https://reviews.freebsd.org/D12134	2017-08-28 16:55:43 +00:00
Alan Cox	90ea34bf97	Address a compilation warning on some architectures that was introduced by the previous change, r321386. Reported by: ian MFC after: 10 days X-MFC after: r321386	2017-07-23 19:35:14 +00:00
Alan Cox	8b5e1472d2	Utilize pmap_enter(..., psind=1) in vm_fault_soft_fast() on amd64. (The Differential Revision discusses the benefits of this change.) Add a function, vm_reserv_to_superpage(), that returns the superpage containing the specified base page. Reviewed by: kib, markj Tested by: pho MFC after: 10 days Differential Revision: https://reviews.freebsd.org/D11556	2017-07-23 16:28:13 +00:00
Konstantin Belousov	19bd0d9c85	Implement address space guards. Guard, requested by the MAP_GUARD mmap(2) flag, prevents the reuse of the allocated address space, but does not allow instantiation of the pages in the range. It is useful for more explicit support for usual two-stage reserve then commit allocators, since it prevents accidental instantiation of the mapping, e.g. by mprotect(2). Use guards to reimplement stack grow code. Explicitely track stack grow area with the guard, including the stack guard page. On stack grow, trivial shift of the guard map entry and stack map entry limits makes the stack expansion. Move the code to detect stack grow and call vm_map_growstack(), from vm_fault() into vm_map_lookup(). As result, it is impossible to get random mapping to occur in the stack grow area, or to overlap the stack guard page. Enable stack guard page by default. Reviewed by: alc, markj Man page update reviewed by: alc, bjk, emaste, markj, pho Tested by: pho, Qualys Sponsored by: The FreeBSD Foundation MFC after: 1 week Differential revision: https://reviews.freebsd.org/D11306 (man pages)	2017-06-24 17:01:11 +00:00
Gleb Smirnoff	83c9dea1ba	- Remove 'struct vmmeter' from 'struct pcpu', leaving only global vmmeter in place. To do per-cpu stats, convert all fields that previously were maintained in the vmmeters that sit in pcpus to counter(9). - Since some vmmeter stats may be touched at very early stages of boot, before we have set up UMA and we can do counter_u64_alloc(), provide an early counter mechanism: o Leave one spare uint64_t in struct pcpu, named pc_early_dummy_counter. o Point counter(9) fields of vmmeter to pcpu[0].pc_early_dummy_counter, so that at early stages of boot, before counters are allocated we already point to a counter that can be safely written to. o For sparc64 that required a whole dummy pcpu[MAXCPU] array. Further related changes: - Don't include vmmeter.h into pcpu.h. - vm.stats.vm.v_swappgsout and vm.stats.vm.v_swappgsin changed to 64-bit, to match kernel representation. - struct vmmeter hidden under _KERNEL, and only vmstat(1) is an exclusion. This is based on benno@'s 4-year old patch: https://lists.freebsd.org/pipermail/freebsd-arch/2013-July/014471.html Reviewed by: kib, gallatin, marius, lidl Differential Revision: https://reviews.freebsd.org/D10156	2017-04-17 17:34:47 +00:00
Alan Cox	8956418832	Two changes to vm_fault_populate(): Simplify the logic for clipping the range returned by the pager to fit within the map entry. Use atop() rather than OFF_TO_IDX() on addresses. Reviewed by: kib MFC after: 1 week	2017-03-19 19:52:47 +00:00
Konstantin Belousov	bc27810671	Fix off-by-one in the vm_fault_populate() code. When re-calculating the last inclusive page index after the pager call, -1 was erronously ommitted. If the pager extended the run (unlikely), the result would be insertion of the valid page mapping outside the current map entry range. Found by: alc Sponsored by: The FreeBSD Foundation MFC after: 1 week	2017-03-19 14:42:16 +00:00
Konstantin Belousov	d1780e8dac	Use atop() instead of OFF_TO_IDX() for convertion of addresses or addresses offsets, as intended. Suggested and reviewed by: alc Sponsored by: The FreeBSD Foundation MFC after: 2 weeks	2017-03-14 19:39:17 +00:00
Konstantin Belousov	63cdcaaead	Properly handle possible underflow in vm_fault_prefault(). In vm_fault_prefault(), if backward count causes underflow in calculation of starta = addra - backward * PAGE_SIZE; then starta must be clipped to entry->start, instead of zero. Clipping to zero allowed mapping outside of the map entries address ranges, in particular, map at zero. Submitted by: Yanko Yankulov <yanko.yankulov@gmail.com> Reviewed by: alc MFC after: 1 week	2017-02-24 08:09:16 +00:00
Bjoern A. Zeeb	05d58177e8	Use %s __func__ to print the actual function name (been looking at the wrong one for too often lately at first), and also use %#lx to get the 0x prefix for the address. MFC after: 1 week	2017-02-14 01:20:03 +00:00
Konstantin Belousov	7a432b84e8	Fix two similar bugs in the populate vm_fault() code. If pager' populate method succeeded, but other thread raced with us and modified vm_map, we must unbusy all pages busied by the pager, before we retry the whole fault handling. If pager instantiated more pages than fit into the current map entry, we must unbusy the pages which are clipped. Also do some refactoring, clarify comments and use more clear local variable names. Reported and tested by: kargl, subbsd@gmail.com (previous version) Reviewed by: alc Sponsored by: The FreeBSD Foundation MFC after: 2 weeks	2016-12-30 18:55:33 +00:00
Konstantin Belousov	c42b43a054	Add a new populate() pager method and extend device pager ops vector with cdev_pg_populate() to provide device drivers access to it. It gives drivers fine control of the pages ownership and allows drivers to implement arbitrary prefault policies. The populate method is called on a page fault and is supposed to populate the vm object with the page at the fault location and some amount of pages around it, at pager's discretion. VM provides the pager with the hints about current range of the object mapping, to avoid instantiation of immediately unused pages, if pager decides so. Also, VM passes the fault type and map entry protection to the pager, allowing it to force the optimal required ownership of the mapped pages. Installed pages must contiguously fill the returned region, be fully valid and exclusively busied. Of course, the pages must be compatible with the object' type. After populate() successfully returned, VM fault handler installs as many instantiated pages into the process page tables as it sees reasonable, while still obeying the correct semantic for COW and vm map locking. The method is opt-in, pager sets OBJ_POPULATE flag to indicate that the method can be called. If pager' vm objects can be shadowed, pager must implement the traditional getpages() method in addition to the populate(). Populate() might fall back to the getpages() on per-call basis as well, by returning VM_PAGER_BAD error code. For now for device pagers, the populate() method is only allowed to be used by the managed device pagers, but the limitation is only made because there is no unmanaged fault handlers which could use it right now. KPI designed together with, and reviewed by: alc Tested by: pho Sponsored by: The FreeBSD Foundation MFC after: 3 weeks	2016-12-08 11:26:11 +00:00
Konstantin Belousov	dc5401d240	Move map_generation snapshot value into struct faultstate. Reviewed by: alc Tested by: pho Sponsored by: The FreeBSD Foundation MFC after: 1 week	2016-12-08 10:29:41 +00:00
Konstantin Belousov	41ddec83c1	Move the fast fault path into the separate function. Reviewed by: alc Sponsored by: The FreeBSD Foundation MFC after: 1 week	2016-11-16 16:34:17 +00:00
Alan Cox	7667839a7e	Remove most of the code for implementing PG_CACHED pages. (This change does not remove user-space visible fields from vm_cnt or all of the references to cached pages from comments. Those changes will come later.) Reviewed by: kib, markj Tested by: pho Sponsored by: Dell EMC Isilon Differential Revision: https://reviews.freebsd.org/D8497	2016-11-15 18:22:50 +00:00
Alan Cox	ebcddc7217	Introduce a new page queue, PQ_LAUNDRY, for storing unreferenced, dirty pages, specificially, dirty pages that have passed once through the inactive queue. A new, dedicated thread is responsible for both deciding when to launder pages and actually laundering them. The new policy uses the relative sizes of the inactive and laundry queues to determine whether to launder pages at a given point in time. In general, this leads to more intelligent swapping behavior, since the laundry thread will avoid pageouts when the marginal benefit of doing so is low. Previously, without a dedicated queue for dirty pages, the page daemon didn't have the information to determine whether pageout provides any benefit to the system. Thus, the previous policy often resulted in small but steadily increasing amounts of swap usage when the system is under memory pressure, even when the inactive queue consisted mostly of clean pages. This change addresses that issue, and also paves the way for some future virtual memory system improvements by removing the last source of object-cached clean pages, i.e., PG_CACHE pages. The new laundry thread sleeps while waiting for a request from the page daemon thread(s). A request is raised by setting the variable vm_laundry_request and waking the laundry thread. We request launderings for two reasons: to try and balance the inactive and laundry queue sizes ("background laundering"), and to quickly make up for a shortage of free pages and clean inactive pages ("shortfall laundering"). When background laundering is requested, the laundry thread computes the number of page daemon wakeups that have taken place since the last laundering. If this number is large enough relative to the ratio of the laundry and (global) inactive queue sizes, we will launder vm_background_launder_target pages at vm_background_launder_rate KB/s. Otherwise, the laundry thread goes back to sleep without doing any work. When scanning the laundry queue during background laundering, reactivated pages are counted towards the laundry thread's target. In contrast, shortfall laundering is requested when an inactive queue scan fails to meet its target. In this case, the laundry thread attempts to launder enough pages to meet v_free_target within 0.5s, which is the inactive queue scan period. A laundry request can be latched while another is currently being serviced. In particular, a shortfall request will immediately preempt a background laundering. This change also redefines the meaning of vm_cnt.v_reactivated and removes the functions vm_page_cache() and vm_page_try_to_cache(). The new meaning of vm_cnt.v_reactivated now better reflects its name. It represents the number of inactive or laundry pages that are returned to the active queue on account of a reference. In collaboration with: markj Reviewed by: kib Tested by: pho Sponsored by: Dell EMC Isilon Differential Revision: https://reviews.freebsd.org/D8302	2016-11-09 18:48:37 +00:00
Alan Cox	857025056f	In vm_fault()'s loop over the shadow chain, move a comment describing our invariants to a better place. Also, add two comments concerning the relationship between the map and vnode locks. Reviewed by: kib MFC after: 3 days	2016-11-03 16:44:55 +00:00
Alan Cox	dda4d36957	Move and revise a comment about the relation between the object's paging- in-progress count and the vnode. Prior to r188331, we always acquired the vnode lock before incrementing the object's paging-in-progress count. Now, we increment it before attempting to acquire the vnode lock with LK_NOWAIT, but we never sleep acquiring the vnode lock while we have the count incremented. Reviewed by: kib MFC after: 3 days	2016-11-01 17:11:10 +00:00
Konstantin Belousov	e26236e9f3	Change remained internal uses of boolean_t to bool in vm/vm_fault.c. Reviewed by: alc Sponsored by: The FreeBSD Foundation MFC after: 1 week	2016-10-30 20:39:38 +00:00
Alan Cox	f994b2077b	Merge and sort vm_fault_hold()'s "int" variable definitions. Reviewed by: kib MFC after: 7 days	2016-10-30 19:15:59 +00:00
Konstantin Belousov	022dfd690c	Remove vnode_locked label and goto, by collapsing vp calculation into the conditional. Suggested and reviewed by: alc Sponsored by: The FreeBSD Foundation MFC after: 1 week	2016-10-30 18:05:18 +00:00
Alan Cox	cd8a6fe8e9	The "lookup_is_valid" field is used as a "bool". Make it one. Convert vm_fault_hold()'s Boolean variables that are only used internally to "bool". Add a comment describing why the one remaining "boolean_t" was not converted. Reviewed by: kib MFC after: 8 days	2016-10-29 21:01:49 +00:00
Alan Cox	320023e286	With one exception, "hardfault" is used like a "bool". Change that exception and make it a "bool". Reviewed by: kib MFC after: 7 days	2016-10-29 19:22:38 +00:00
Mark Johnston	a9ee028d04	Add one more use of unlock_vp(). Discussed with: kib X-MFC With: r308094	2016-10-29 18:47:28 +00:00
Konstantin Belousov	cfabea3d3a	Add unlock_vp() helper. Trim space. Discussed with: alc Sponsored by: The FreeBSD Foundation MFC after: 1 week	2016-10-29 18:03:29 +00:00
Konstantin Belousov	230afe0be6	If vm_fault_hold(9) finds that fs.m is wired, do not free it after a pager error, leave the page to the wire owner. E.g. the page might be a part of the invalidated buffer. Reported and tested by: pho Reviewed by: alc, markj Sponsored by: The FreeBSD Foundation MFC after: 2 weeks Differential revision: https://reviews.freebsd.org/D8197	2016-10-17 08:17:06 +00:00
Mark Johnston	eb17fb15b3	Plug a potential vnode lock leak in vm_fault_hold(). Reviewed by: alc, kib MFC after: 1 week Sponsored by: Dell EMC Isilon Differential Revision: https://reviews.freebsd.org/D8242	2016-10-13 20:39:34 +00:00
Alan Cox	8d67b8c863	Add a comment describing the 'fast path' that was introduced in r270011. Reviewed by: kib MFC after: 3 days Sponsored by: EMC / Isilon Storage Division	2016-07-20 17:20:22 +00:00
Alan Cox	0c3a489325	Break up vm_fault()'s implementation of the read-ahead and delete-behind optimizations into two distinct pieces. The first piece consists of the code that should only be performed once per page fault and requires the map to be locked. The second piece consists of the code that should be performed each time a pager is called on an object in the shadow chain. (This second piece expects the map to be unlocked.) Previously, the entire implementation could be executed multiple times. Moreover, the second and subsequent executions would occur with the map unlocked. Usually, the ensuing unsynchronized accesses to the map were harmless because the map was not changing. Nonetheless, it was possible for a use-after-free error to occur, where vm_fault() wrote to a freed map entry. This change corrects that problem. Reported by: avg Reviewed by: kib MFC after: 3 days Sponsored by: EMC / Isilon Storage Division	2016-07-18 04:20:26 +00:00
Alan Cox	381b724280	Change the type of the map entry's next_read field from a vm_pindex_t to a vm_offset_t. (This field is used to detect sequential access to the virtual address range represented by the map entry.) There are three reasons to make this change. First, a vm_offset_t is smaller on 32-bit architectures. Consequently, a struct vm_map_entry is now smaller on 32-bit architectures. Second, a vm_offset_t can be written atomically, whereas it may not be possible to write a vm_pindex_t atomically on a 32-bit architecture. Third, using a vm_pindex_t makes the next_read field dependent on which object in the shadow chain is being read from. Replace an "XXX" comment. Reviewed by: kib Approved by: re (gjb) Sponsored by: EMC / Isilon Storage Division	2016-07-07 20:58:16 +00:00
Konstantin Belousov	3f1c66b8d2	Change type of the 'dead' variable to boolean. Requested by: alc MFC after: 1 week Approved by: re (gjb)	2016-07-03 00:08:17 +00:00
Konstantin Belousov	725441f69b	If the vm_fault() handler raced with the vm_object_collapse() sleepable scan, iteration over the shadow chain looking for a page could find an OBJ_DEAD object. Such state of the mapping is only transient, the dead object will be terminated and removed from the chain shortly. We must not return KERN_PROTECTION_FAILURE unless the object type is changed to OBJT_DEAD in the chain, indicating that paging on this address is really impossible. Returning KERN_PROTECTION_FAILURE prematurely causes spurious SIGSEGV delivered to processes, or kernel accesses to UVA spuriously failing with EFAULT. If the object with OBJ_DEAD flag is found, only return KERN_PROTECTION_FAILURE when object type is already OBJT_DEAD. Otherwise, sleep a tick and retry the fault handling. Ideally, we would wait until the OBJ_DEAD flag is resolved, e.g. by waiting until the paging on this object is finished. But to do so, we need to reference the dead object, while vm_object_collapse() insists on owning the final reference on the collapsed object. This could be fixed by e.g. changing the assert to shared reference release between vm_fault() and vm_object_collapse(), but it seems to be too much complications for rare boundary condition. PR: 204426 Tested by: pho Reviewed by: alc Sponsored by: The FreeBSD Foundation X-Differential revision: https://reviews.freebsd.org/D6085 MFC after: 2 weeks Approved by: re (gjb)	2016-06-27 21:54:19 +00:00
Alan Cox	bccdea450b	Use vm_page_replace_checked() instead of vm_page_rename() for implementing optimized copy-on-write faults. This has two advantages: (1) one less radix tree operation is performed and (2) vm_page_replace_checked() cannot fail, making the code simpler. Submitted by: Ryan Libby Reviewed by: kib Sponsored by: EMC / Isilon Storage Division Differential Revision: https://reviews.freebsd.org/D4478	2016-05-27 06:05:12 +00:00
Alan Cox	10b4196bd0	Correct an error in a comment: One of the conditions for page allocation is actually the opposite of that stated in the comment. Remove an unnecessary assignment. Use an assertion to document the fact that no assignment is needed. Rewrite another comment to clarify that the page is not completely valid. Reviewed by: kib	2016-05-23 16:59:05 +00:00
Alan Cox	6753423ccb	When descending a shadow chain of objects, it makes no sense to update the current offset (spelled: "fs.pindex") until it is known whether a backing object exists. In fact, if not for the fact that the backing object offset is zero when there is no backing object, this update would produce a broken offset. Reviewed by: kib	2016-05-21 23:18:23 +00:00
Alan Cox	521ddf39cb	Clean up the handling of errors from vm_pager_get_pages(). Mostly, this cleanup consists of fixes to comments. However, there is one change to code: Remove special-case handling of errors involving the kernel map. We do not perform I/O on the kernel map, so there is no need for this special case. Reviewed by: kib (an earlier version)	2016-05-19 19:27:33 +00:00
Edward Tomasz Napierala	ae34b6ff96	Add four new RCTL resources - readbps, readiops, writebps and writeiops, for limiting disk (actually filesystem) IO. Note that in some cases these limits are not quite precise. It's ok, as long as it's within some reasonable bounds. Testing - and review of the code, in particular the VFS and VM parts - is very welcome. MFC after: 1 month Relnotes: yes Sponsored by: The FreeBSD Foundation Differential Revision: https://reviews.freebsd.org/D5080	2016-04-07 04:23:25 +00:00
Gleb Smirnoff	b0cd20172d	A change to KPI of vm_pager_get_pages() and underlying VOP_GETPAGES(). o With new KPI consumers can request contiguous ranges of pages, and unlike before, all pages will be kept busied on return, like it was done before with the 'reqpage' only. Now the reqpage goes away. With new interface it is easier to implement code protected from race conditions. Such arrayed requests for now should be preceeded by a call to vm_pager_haspage() to make sure that request is possible. This could be improved later, making vm_pager_haspage() obsolete. Strenghtening the promises on the business of the array of pages allows us to remove such hacks as swp_pager_free_nrpage() and vm_pager_free_nonreq(). o New KPI accepts two integer pointers that may optionally point at values for read ahead and read behind, that a pager may do, if it can. These pages are completely owned by pager, and not controlled by the caller. This shifts the UFS-specific readahead logic from vm_fault.c, which should be file system agnostic, into vnode_pager.c. It also removes one VOP_BMAP() request per hard fault. Discussed with: kib, alc, jeff, scottl Sponsored by: Nginx, Inc. Sponsored by: Netflix	2015-12-16 21:30:45 +00:00
Conrad Meyer	6fee422ed5	vm_fault_hold: handle vm_page_rename failure On vm_page_rename failure, fix a missing object unlock and a double free of a page. First remove the old page, then rename into other page into first_object, then free the old page. This avoids the problem on rename failure. This is a little ugly but seems to be the most straightforward solution. Tested with: $ sysctl debug.fail_point.uma_zalloc_arg="1%return" $ kyua test -k /usr/tests/sys/Kyuafile Submitted by: Ryan Libby <rlibby@gmail.com> Reviewed by: kib Seen by: alc Sponsored by: EMC / Isilon Storage Division Differential Revision: https://reviews.freebsd.org/D4326	2015-12-06 17:46:12 +00:00
Alan Cox	d8015db3b5	Refinements to r281079's sequential access optimization: Prefetched pages, which constitute the majority of the pages that are processed by vm_fault_dontneed(), are already near the tail of the inactive queue. Only the pages at faulting virtual addresses are actually moved by vm_page_advise(..., MADV_DONTNEED). However, vm_page_advise(..., MADV_DONTNEED) is simultaneously too aggressive and passive for the moved pages. It makes most of these pages too easily reclaimable, and at the same time it leaves enough pages in the active queue to trigger pageouts by the page daemon. Instead, with this change, the pages at faulting virtual addresses are moved to the tail of the inactive queue, where they are relatively close to the pages prefetched by the same page fault. Discussed with: jeff Sponsored by: EMC / Isilon Storage Division	2015-08-03 20:30:27 +00:00
Konstantin Belousov	6a875bf929	Do not pretend that vm_fault(9) supports unwiring the address. Rename the VM_FAULT_CHANGE_WIRING flag to VM_FAULT_WIRE. Assert that the flag is only passed when faulting on the wired map entry. Remove the vm_page_unwire() call, which should be never reachable. Since VM_FAULT_WIRE flag implies wired map entry, the TRYPAGER() macro is reduced to the testing of the fs.object having a default pager. Inline the check. Suggested and reviewed by: alc Tested by: pho (previous version) MFC after: 1 week	2015-07-30 18:28:34 +00:00
Gleb Smirnoff	093c7f396d	Make KPI of vm_pager_get_pages() more strict: if a pager changes a page in the requested array, then it is responsible for disposition of previous page and is responsible for updating the entry in the requested array. Now consumers of KPI do not need to re-lookup the pages after call to vm_pager_get_pages(). Reviewed by: kib Sponsored by: Netflix Sponsored by: Nginx, Inc.	2015-06-12 11:32:20 +00:00
Konstantin Belousov	85af31a464	Do not sleep waiting for the MAP_ENTRY_IN_TRANSITION state ending with the vnode locked. Review: https://reviews.freebsd.org/D2381 Submitted by: Conrad Meyer, Attilio Rao MFC after: 1 week	2015-04-28 08:20:23 +00:00
Alan Cox	b5ab20c066	Until the lock assertions in vm_page_advise() are properly reevaluated, vm_fault_dontneed() should acquire a write lock on the first object in the shadow chain. Reported by: gleb, David Wolfskill	2015-04-05 20:07:33 +00:00
Alan Cox	a8b0f1009d	Replace vm_fault()'s heuristic for automatic cache behind with a heuristic that performs the equivalent of an automatic madvise(..., MADV_DONTNEED). The current heuristic, even with the improvements that I made a few years ago, is a good example of making the wrong trade-off, or optimizing for the infrequent case. The infrequent case being reading a single file that is much larger than memory using mmap(2). And, in this case, the page daemon isn't the bottleneck; it's the I/O. In all other cases, the current heuristic has too many false positives, i.e., it caches too many pages that are later reused. To give one example, thousands of pages are cached by the current heuristic during a buildworld and all of them are reactivated before the buildworld completes. In particular, clang reads source files using mmap(2) and there are some relatively large source files in our source tree, e.g., sqlite, that are read multiple times. With the new heuristic, I see fewer false positives and they have a much lower cost. I actually tried something like this more than two years ago and it didn't perform as well as the cache behind heuristic. However, that was before the changes to the page daemon in late summer of 2013 and the existence of pmap_advise(). In particular, with the page daemon doing its work more frequently and in smaller batches, it now completes its work while the application accessing the file is blocked on I/O. Whereas previously, the page daemon appeared to hog the CPU for so long that it caused "hiccups" in the application's execution. Finally, I'll add that the elimination of cache pages is a prerequisite for NUMA support. Reviewed by: jeff, kib Sponsored by: EMC / Isilon Storage Division	2015-04-04 19:10:22 +00:00
Alan Cox	3d653db063	Introduce vm_object_color() and use it in mmap(2) to set the color of named objects to zero before the virtual address is selected. Previously, the color setting was delayed until after the virtual address was selected. In rtld, this delay effectively prevented the mapping of a shared library's code section using superpages. Now, for example, we see the first 1 MB of libc's code on armv6 mapped by a superpage after we've gotten through the initial cold misses that bring the first 1 MB of code into memory. (With the page clustering that we perform on read faults, this happens quickly.) Differential Revision: https://reviews.freebsd.org/D2013 Reviewed by: jhb, kib Tested by: Svatopluk Kraus (armv6) MFC after: 6 weeks	2015-03-21 17:56:55 +00:00
Alan Cox	dfdf9abd94	Fix the root cause of the "vm_reserv_populate: reserv <address> is already promoted" panics. The sequence of events that leads to a panic is rather long and circuitous. First, suppose that process P has a promoted superpage S within vm object O that it can write to. Then, suppose that P forks, which leads to S being write protected. Now, before P's child exits, suppose that P writes to another virtual page within O. Since the pages within O are copy on write, a shadow object for O is created to house the new physical copy of the faulted on virtual page. Then, before P can fault on S, P's child exists. Now, when P faults on S, it will follow the "optimized" path for copy-on-write faults in vm_fault(), wherein the underlying physical page is moved from O to its shadow object rather than allocating a new page and copying the new page's contents from the old page. Moreover, suppose that every 4 KB physical page making up S is moved to the shadow object in this way. However, the optimized path does not move the underlying superpage reservation, which is the root cause of the panics! Ultimately, P performs vm_object_collapse() on O's shadow object, which destroys O and in doing so breaks any reservations still belonging to O. This leaves the reservation underlying S in an inconsistent state: It's simultaneously not in use and promoted. Breaking a reservation does not demote it because I never intended for a promoted reservation to be broken. It makes little sense. Finally, this inconsistency leads to an assertion failure the next time that the reservation is used. The failing assertion does not (currently) exist in FreeBSD 10.x or earlier. There, we will quietly break the promoted reservation. While illogical and unintended, breaking the reservation is essentially harmless. PR: 198163 Reviewed by: kib Tested by: pho X-MFC after: r267213 Sponsored by: EMC / Isilon Storage Division	2015-03-19 01:40:43 +00:00
Konstantin Belousov	f40cb1c645	Update mtime for tmpfs files modified through memory mapping. Similar to UFS, perform updates during syncer scans, which in particular means that tmpfs now performs scan on sync. Also, this means that a mtime update may be delayed up to 30 seconds after the write. The vm_object' OBJ_TMPFS_DIRTY flag for tmpfs swap object is similar to the OBJ_MIGHTBEDIRTY flag for the vnode object, it indicates that object could have been dirtied. Adapt fast page fault handler and vm_object_set_writeable_dirty() to handle OBJ_TMPFS_NODE same as OBJT_VNODE. Reported by: Ronald Klop <ronald-lists@klop.ws> Tested by: pho Sponsored by: The FreeBSD Foundation MFC after: 2 weeks	2015-01-28 10:37:23 +00:00
Alan Cox	5268042bbd	Revamp the default page clustering strategy that is used by the page fault handler. For roughly twenty years, the page fault handler has used the same basic strategy: Fetch a fixed number of non-resident pages both ahead and behind the virtual page that was faulted on. Over the years, alternative strategies have been implemented for optimizing the handling of random and sequential access patterns, but the only change to the default strategy has been to increase the number of pages read ahead to 7 and behind to 8. The problem with the default page clustering strategy becomes apparent when you look at how it behaves on the code section of an executable or shared library. (To simplify the following explanation, I'm going to ignore the read that is performed to obtain the header and assume that no pages are resident at the start of execution.) Suppose that we have a code section consisting of 32 pages. Further, suppose that we access pages 4, 28, and 16 in that order. Under the default page clustering strategy, we page fault three times and perform three I/O operations, because the first and second page faults only read a truncated cluster of 12 pages. In contrast, if we access pages 8, 24, and 16 in that order, we only fault twice and perform two I/O operations, because the first and second page faults read a full cluster of 16 pages. In general, truncated clusters are more common than full clusters. To address this problem, this revision changes the default page clustering strategy to align the start of the cluster to a page offset within the vm object that is a multiple of the cluster size. This results in many fewer truncated clusters. Returning to our example, if we now access pages 4, 28, and 16 in that order, the cluster that is read to satisfy the page fault on page 28 will now include page 16. So, the access to page 16 will no longer page fault and perform an I/O operation. Since the revised default page clustering strategy is typically reading more pages at a time, we are likely to read a few more pages that are never accessed. However, for the various programs that we looked at, including clang, emacs, firefox, and openjdk, the reduction in the number of page faults and I/O operations far outweighed the increase in the number of pages that are never accessed. Moreover, the extra resident pages allowed for many more superpage mappings. For example, if we look at the execution of clang during a buildworld, the number of (hard) page faults on the code section drops by 26%, the number of superpage mappings increases by about 29,000, but the number of never accessed pages only increases from 30.38% to 33.66%. Finally, this leads to a small but measureable reduction in execution time. In collaboration with: Emily Pettigrew <ejp1@rice.edu> Differential Revision: https://reviews.freebsd.org/D1500 Reviewed by: jhb, kib MFC after: 6 weeks	2015-01-16 18:17:09 +00:00
Konstantin Belousov	18cc2ff047	Revert r263475: TDP_DEVMEMIO no longer needed, since amd64 /dev/kmem does not access kernel mappings directly. Reviewed by: alc Sponsored by: The FreeBSD Foundation MFC after: 1 week	2015-01-12 08:58:07 +00:00

1 2 3 4 5 ...

380 Commits