freebsd-dev

Author	SHA1	Message	Date
Konstantin Belousov	12487941f4	Noted by: alc Reviewed by: alc, markj (previous version) Sponsored by: The FreeBSD Foundation MFC after: 6 days	2019-05-06 08:46:11 +00:00
Konstantin Belousov	78022527bb	Switch to use shared vnode locks for text files during image activation. kern_execve() locks text vnode exclusive to be able to set and clear VV_TEXT flag. VV_TEXT is mutually exclusive with the v_writecount > 0 condition. The change removes VV_TEXT, replacing it with the condition v_writecount <= -1, and puts v_writecount under the vnode interlock. Each text reference decrements v_writecount. To clear the text reference when the segment is unmapped, it is recorded in the vm_map_entry backed by the text file as MAP_ENTRY_VN_TEXT flag, and v_writecount is incremented on the map entry removal The operations like VOP_ADD_WRITECOUNT() and VOP_SET_TEXT() check that v_writecount does not contradict the desired change. vn_writecheck() is now racy and its use was eliminated everywhere except access. Atomic check for writeability and increment of v_writecount is performed by the VOP. vn_truncate() now increments v_writecount around VOP_SETATTR() call, lack of which is arguably a bug on its own. nullfs bypasses v_writecount to the lower vnode always, so nullfs vnode has its own v_writecount correct, and lower vnode gets all references, since object->handle is always lower vnode. On the text vnode' vm object dealloc, the v_writecount value is reset to zero, and deadfs vop_unset_text short-circuit the operation. Reclamation of lowervp always reclaims all nullfs vnodes referencing lowervp first, so no stray references are left. Reviewed by: markj, trasz Tested by: mjg, pho Sponsored by: The FreeBSD Foundation MFC after: 1 month Differential revision: https://reviews.freebsd.org/D19923	2019-05-05 11:20:43 +00:00
Konstantin Belousov	7f1446052f	Do not collapse objects with OBJ_NOSPLIT backing swap object. NOSPLIT swap objects are not anonymous, they are used by tmpfs regular files and POSIX shared memory. For such objects, collapse is not permitted. Reported by: mjg Reviewed by: markj, trasz Tested by: mjg, pho Sponsored by: The FreeBSD Foundation MFC after: 1 week Differential revision: https://reviews.freebsd.org/D19923	2019-05-05 11:06:19 +00:00
Eric van Gyzen	5e38e3f5eb	Include path for tmpfs objects in vm.objects sysctl This applies the fix in r283924 to the vm.objects sysctl added by r283624 so the output will include the vnode information (i.e. path) for tmpfs objects. Reviewed by: kib, dab MFC after: 2 weeks Sponsored by: Dell EMC Isilon Differential Revision: https://reviews.freebsd.org/D2724	2018-11-30 04:59:43 +00:00
Eric van Gyzen	0951bd362c	Add assertions and comment to vm_object_vnode() Reviewed by: kib MFC after: 2 weeks Sponsored by: Dell EMC Isilon Differential Revision: https://reviews.freebsd.org/D2724	2018-11-30 04:18:31 +00:00
Mark Johnston	0e48e06807	Re-apply r336984, reverting r339934. r336984 exposed the bug fixed in r340241, leading to the initial revert while the bug was being hunted down. Now that the bug is fixed, we can revert the revert. Discussed with: alc MFC after: 3 days	2018-11-10 20:33:08 +00:00
Mark Johnston	5d277a85ad	Revert r336984. It appears to be responsible for random segfaults observed when lots of paging activity is taking place, but the root cause is not yet understood. Requested by: alc MFC after: now	2018-10-30 22:40:40 +00:00
Mark Johnston	4c29d2de67	Refactor domainset iterators for use by malloc(9) and UMA. Before this change we had two flavours of vm_domainset iterators: "page" and "malloc". The latter was only used for kmem_() and hard-coded its behaviour based on kernel_object's policy. Moreover, its use contained a race similar to that fixed by r338755 since the kernel_object's iterator was being run without the object lock. In some cases it is useful to be able to explicitly specify a policy (domainset) or policy+iterator (domainset_ref) when performing memory allocations. To that end, refactor the vm_dominset_ KPI to permit this, and get rid of the "malloc" domainset_iter KPI in the process. Reviewed by: jeff (previous version) Tested by: pho (part of a larger patch) MFC after: 2 weeks Sponsored by: The FreeBSD Foundation Differential Revision: https://reviews.freebsd.org/D17417	2018-10-23 16:35:58 +00:00
Alan Cox	005783a0a6	Allow vm object coalescing to occur in the midst of a vm object when the OBJ_ONEMAPPING flag is set. In other words, allow recycling of existing but unused subranges of a vm object when the OBJ_ONEMAPPING flag is set. Such situations are increasingly common with jemalloc >= 5.0. This change has the expected effect of reducing the number of vm map entry and object allocations and increasing the number of superpage promotions. Reviewed by: kib, markj Tested by: pho MFC after: 6 weeks Differential Revision: https://reviews.freebsd.org/D16501	2018-07-31 17:41:48 +00:00
Mark Johnston	1b5c869d64	Fix some races introduced in r332974. With r332974, when performing a synchronized access of a page's "queue" field, one must first check whether the page is logically dequeued. If so, then the page lock does not prevent the page from being removed from its page queue. Intoduce vm_page_queue(), which returns the page's logical queue index. In some cases, direct access to the "queue" field is still required, but such accesses should be confined to sys/vm. Reported and tested by: pho Reviewed by: kib Sponsored by: Dell EMC Isilon Differential Revision: https://reviews.freebsd.org/D15280	2018-05-04 17:17:30 +00:00
Mark Johnston	5cd29d0f3c	Improve VM page queue scalability. Currently both the page lock and a page queue lock must be held in order to enqueue, dequeue or requeue a page in a given page queue. The queue locks are a scalability bottleneck in many workloads. This change reduces page queue lock contention by batching queue operations. To detangle the page and page queue locks, per-CPU batch queues are used to reference pages with pending queue operations. The requested operation is encoded in the page's aflags field with the page lock held, after which the page is enqueued for a deferred batch operation. Page queue scans are similarly optimized to minimize the amount of work performed with a page queue lock held. Reviewed by: kib, jeff (previous versions) Tested by: pho Sponsored by: Dell EMC Isilon Differential Revision: https://reviews.freebsd.org/D14893	2018-04-24 21:15:54 +00:00
Mark Johnston	1d3a1bcfac	Dequeue wired pages lazily. Previously, wiring a page would cause it to be removed from its page queue. In the common case, unwiring causes it to be enqueued at the tail of that page queue. This change modifies vm_page_wire() to not dequeue the page, thus avoiding the highly contended page queue locks. Instead, vm_page_unwire() takes care of requeuing the page as a single operation, and the page daemon dequeues wired pages as they are encountered during a queue scan to avoid needlessly revisiting them later. For pages in PQ_ACTIVE we do even better, since a requeue is unnecessary. The change improves scalability for some common workloads. For instance, threads wiring pages into the buffer cache no longer need to modify global page queues, and unwiring is usually done by the bufspace thread, so concurrency is not as much of an issue. As another example, many sysctl handlers wire the output buffer to avoid faults on copyout, and since the buffer is likely to be in PQ_ACTIVE, we now entirely avoid modifying the page queue in this case. The change also adds a block comment describing some properties of struct vm_page's reference counters, and the busy lock. Reviewed by: jeff Discussed with: alc, kib MFC after: 1 month Differential Revision: https://reviews.freebsd.org/D11943	2018-02-07 16:57:10 +00:00
Jeff Roberson	e2068d0bcd	Use per-domain locks for vm page queue free. Move paging control from global to per-domain state. Protect reservations with the free lock from the domain that they belong to. Refactor to make vm domains more of a first class object. Reviewed by: markj, kib, gallatin Tested by: pho Sponsored by: Netflix, Dell/EMC Isilon Differential Revision: https://reviews.freebsd.org/D14000	2018-02-06 22:10:07 +00:00
Konstantin Belousov	20e4afbfbb	On munlock(), unwire correct page. It is possible, for complex fork()/collapse situations, to have sibling address spaces to partially share shadow chains. If one sibling performs wiring, it can happen that a transient page, invalid and busy, is installed into a shadow object which is visible to other sibling for the duration of vm_fault_hold(). When the backing object contains the valid page, and the wiring is performed on read-only entry, the transient page is eventually removed. But the sibling which observed the transient page might perform the unwire, executing vm_object_unwire(). There, the first page found in the shadow chain is considered as the page that was wired for the mapping. It is really the page below it which is wired. So we unwire the wrong page, either triggering the asserts of breaking the page' wire counter. As the fix, wait for the busy state to finish if we find such page during unwire, and restart the shadow chain walk after the sleep. Reported and tested by: pho Reviewed by: markj Sponsored by: The FreeBSD Foundation MFC after: 1 week Differential revision: https://reviews.freebsd.org/D14184	2018-02-05 12:49:20 +00:00
Jeff Roberson	3f289c3fcf	Implement 'domainset', a cpuset based NUMA policy mechanism. This allows userspace to control NUMA policy administratively and programmatically. Implement domainset based iterators in the page layer. Remove the now legacy numa_* syscalls. Cleanup some header polution created by having seq.h in proc.h. Reviewed by: markj, kib Discussed with: alc Tested by: pho Sponsored by: Netflix, Dell/EMC Isilon Differential Revision: https://reviews.freebsd.org/D13403	2018-01-12 22:48:23 +00:00
Alan Cox	115423761e	Make the vm object bypass and collapse counters per CPU. Requested by: mjg Reviewed by: kib, markj MFC after: 1 week Differential Revision: https://reviews.freebsd.org/D13611	2017-12-25 19:36:04 +00:00
Pedro F. Giffuni	796df753f4	SPDX: Consider code from Carnegie-Mellon University. Interesting cases, most likely from CMU Mach sources.	2017-11-30 15:48:35 +00:00
Jeff Roberson	2e47807c21	Eliminate kmem_arena and kmem_object in preparation for further NUMA commits. The arena argument to kmem_*() is now only used in an assert. A follow-up commit will remove the argument altogether before we freeze the API for the next release. This replaces the hard limit on kmem size with a soft limit imposed by UMA. When the soft limit is exceeded we periodically wakeup the UMA reclaim thread to attempt to shrink KVA. On 32bit architectures this should behave much more gracefully as we exhaust KVA. On 64bit the limits are likely never hit. Reviewed by: markj, kib (some objections) Discussed with: alc Tested by: pho Sponsored by: Netflix / Dell EMC Isilon Differential Revision: https://reviews.freebsd.org/D13187	2017-11-28 23:40:54 +00:00
Pedro F. Giffuni	51369649b0	sys: further adoption of SPDX licensing ID tags. Mainly focus on files that use BSD 3-Clause license. The Software Package Data Exchange (SPDX) group provides a specification to make it easier for automated tools to detect and summarize well known opensource licenses. We are gradually adopting the specification, noting that the tags are considered only advisory and do not, in any way, superceed or replace the license texts. Special thanks to Wind River for providing access to "The Duke of Highlander" tool: an older (2014) run over FreeBSD tree was useful as a starting point.	2017-11-20 19:43:44 +00:00
Jeff Roberson	8d6fbbb867	Replace manyinstances of VM_WAIT with blocking page allocation flags similar to the kernel memory allocator. This simplifies NUMA allocation because the domain will be known at wait time and races between failure and sleeping are eliminated. This also reduces boilerplate code and simplifies callers. A wait primitive is supplied for uma zones for similar reasons. This eliminates some non-specific VM_WAIT calls in favor of more explicit sleeps that may be satisfied without new pages. Reviewed by: alc, kib, markj Tested by: pho Sponsored by: Netflix, Dell/EMC Isilon	2017-11-08 02:39:37 +00:00
Alan Cox	4074d642d2	Batch atomic updates to the number of active, inactive, and laundry pages by vm_object_terminate_pages(). For example, for a "buildworld" workload, this batching reduces vm_object_terminate_pages()'s average execution time by 12%. (The total savings were about 11.7 billion processor cycles.) Reviewed by: kib MFC after: 1 week	2017-10-19 04:13:47 +00:00
Alan Cox	cf060942db	Optimize vm_object_page_remove() by eliminating pointless calls to pmap_remove_all(). If the object to which a page belongs has no references, then that page cannot possibly be mapped. Reviewed by: kib MFC after: 1 week	2017-09-28 17:55:41 +00:00
Konstantin Belousov	5bf949377e	For unlinked files, do not msync(2) or sync on the vnode deactivation. One consequence of the patch is that msyncing unlinked file mappings no longer reduces the amount of the dirty memory in the system, but I do not think that there are users of msync(2) that utilize it for such side-effect. Reported and tested by: tjil PR: 222356 Reviewed by: alc Sponsored by: The FreeBSD Foundation MFC after: 2 weeks Differential revision: https://reviews.freebsd.org/D12411	2017-09-19 16:46:37 +00:00
Konstantin Belousov	bba52ecadd	Batch freeing of the pages in vm_object_page_remove() under the same free queue mutex lock owning session, same as it was done for the object termination in r323561. Reported and tested by: mjg Reviewed by: alc, markj Sponsored by: The FreeBSD Foundation MFC after: 1 week	2017-09-15 16:07:09 +00:00
Konstantin Belousov	2fcd1ff68f	Do not relock free queue mutex for each page, free whole terminating object' page queue under the single mutex lock. First, all pages on the queue are prepared for free by calls to vm_page_free_prep(), and pages which should not be returned to the physical allocator (e.g. wired or fictitious) are simply removed from the queue. On the second pass, vm_page_free_phys_pglist() inserts all pages from the queue without relocking the mutex. The change improves the object termination, e.g. on the process exit where large anonymous memory objects otherwise cause relocks the free queue mutex for each page. More, if several such processes are exiting or execing in parallel, the mutex was highly contended on the address space demolition. Diagnosed and tested by: mjg (previous version) Reviewed by: alc, markj Sponsored by: The FreeBSD Foundation MFC after: 1 week	2017-09-13 19:22:07 +00:00
Konstantin Belousov	93c5d3a46a	Add a vm_page_change_lock() helper, the common code to not relock page lock if both old and new pages use the same underlying lock. Convert existing places to use the helper instead of inlining it. Use the optimization in vm_object_page_remove(). Suggested and reviewed by: alc, markj Sponsored by: The FreeBSD Foundation MFC after: 1 week	2017-09-09 17:35:19 +00:00
Konstantin Belousov	f425ab8e50	Replace global swhash in swap pager with per-object trie to track swap blocks assigned to the object pages. - The global swhash_mtx is removed, trie is synchronized by the corresponding object lock. - The swp_pager_meta_free_all() function used during object termination is optimized by only looking at the trie instead of having to search whole hash for the swap blocks owned by the object. - On swap_pager_swapoff(), instead of iterating over the swhash, global object list have to be inspected. There, we have to ensure that we do see valid trie content if we see that the object type is swap. Sizing of the swblk zone is same as for swblock zone, each swblk maps SWAP_META_PAGES pages. Proposed by: alc Reviewed by: alc, markj (previous version) Tested by: alc, pho (previous version) Sponsored by: The FreeBSD Foundation MFC after: 1 month Differential revision: https://reviews.freebsd.org/D11435	2017-08-25 23:13:21 +00:00
Ruslan Bukin	7bbdb843b6	Add OBJ_PG_DTOR flag to VM object. Setting this flag allows us to skip pages removal from VM object queue during object termination and to leave that for cdev_pg_dtor function. Move pages removal code to separate function vm_object_terminate_pages() as comments does not survive indentation. This will be required for Intel SGX support where we will have to remove pages from VM object manually. Reviewed by: kib, alc Sponsored by: DARPA, AFRL Differential Revision: https://reviews.freebsd.org/D11688	2017-08-16 08:49:11 +00:00
Konstantin Belousov	0ecee546c5	Do not allocate struct kinfo_vmobject on stack. Its size is 1184 bytes. Noted by: eugen Sponsored by: The FreeBSD Foundation MFC after: 1 week	2017-07-22 13:33:06 +00:00
Konstantin Belousov	cd1241fbd0	Add pctrie_init() and vm_radix_init() to initialize generic pctrie and vm_radix trie. Existing vm_radix_init() function is renamed to vm_radix_zinit(). Inlines moved out of the _ headers. Reviewed by: alc, markj (previous version) Sponsored by: The FreeBSD Foundation MFC after: 1 week Differential revision: https://reviews.freebsd.org/D11661	2017-07-19 20:52:47 +00:00
Konstantin Belousov	6992112349	Commit the 64-bit inode project. Extend the ino_t, dev_t, nlink_t types to 64-bit ints. Modify struct dirent layout to add d_off, increase the size of d_fileno to 64-bits, increase the size of d_namlen to 16-bits, and change the required alignment. Increase struct statfs f_mntfromname[] and f_mntonname[] array length MNAMELEN to 1024. ABI breakage is mitigated by providing compatibility using versioned symbols, ingenious use of the existing padding in structures, and by employing other tricks. Unfortunately, not everything can be fixed, especially outside the base system. For instance, third-party APIs which pass struct stat around are broken in backward and forward incompatible ways. Kinfo sysctl MIBs ABI is changed in backward-compatible way, but there is no general mechanism to handle other sysctl MIBS which return structures where the layout has changed. It was considered that the breakage is either in the management interfaces, where we usually allow ABI slip, or is not important. Struct xvnode changed layout, no compat shims are provided. For struct xtty, dev_t tty device member was reduced to uint32_t. It was decided that keeping ABI compat in this case is more useful than reporting 64-bit dev_t, for the sake of pstat. Update note: strictly follow the instructions in UPDATING. Build and install the new kernel with COMPAT_FREEBSD11 option enabled, then reboot, and only then install new world. Credits: The 64-bit inode project, also known as ino64, started life many years ago as a project by Gleb Kurtsou (gleb). Kirk McKusick (mckusick) then picked up and updated the patch, and acted as a flag-waver. Feedback, suggestions, and discussions were carried by Ed Maste (emaste), John Baldwin (jhb), Jilles Tjoelker (jilles), and Rick Macklem (rmacklem). Kris Moore (kris) performed an initial ports investigation followed by an exp-run by Antoine Brodin (antoine). Essential and all-embracing testing was done by Peter Holm (pho). The heavy lifting of coordinating all these efforts and bringing the project to completion were done by Konstantin Belousov (kib). Sponsored by: The FreeBSD Foundation (emaste, kib) Differential revision: https://reviews.freebsd.org/D10439	2017-05-23 09:29:05 +00:00
Gleb Smirnoff	83c9dea1ba	- Remove 'struct vmmeter' from 'struct pcpu', leaving only global vmmeter in place. To do per-cpu stats, convert all fields that previously were maintained in the vmmeters that sit in pcpus to counter(9). - Since some vmmeter stats may be touched at very early stages of boot, before we have set up UMA and we can do counter_u64_alloc(), provide an early counter mechanism: o Leave one spare uint64_t in struct pcpu, named pc_early_dummy_counter. o Point counter(9) fields of vmmeter to pcpu[0].pc_early_dummy_counter, so that at early stages of boot, before counters are allocated we already point to a counter that can be safely written to. o For sparc64 that required a whole dummy pcpu[MAXCPU] array. Further related changes: - Don't include vmmeter.h into pcpu.h. - vm.stats.vm.v_swappgsout and vm.stats.vm.v_swappgsin changed to 64-bit, to match kernel representation. - struct vmmeter hidden under _KERNEL, and only vmstat(1) is an exclusion. This is based on benno@'s 4-year old patch: https://lists.freebsd.org/pipermail/freebsd-arch/2013-July/014471.html Reviewed by: kib, gallatin, marius, lidl Differential Revision: https://reviews.freebsd.org/D10156	2017-04-17 17:34:47 +00:00
Alan Cox	52d1addaa1	Relax the locking requirements for vm_object_page_noreuse(). While reviewing all uses of OFF_TO_IDX(), I observed that vm_object_page_noreuse() is requiring an exclusive lock on the object when, in fact, a shared lock suffices. Reviewed by: kib, markj MFC after: 3 days Differential Revision: https://reviews.freebsd.org/D10011	2017-03-15 17:43:45 +00:00
Konstantin Belousov	d1780e8dac	Use atop() instead of OFF_TO_IDX() for convertion of addresses or addresses offsets, as intended. Suggested and reviewed by: alc Sponsored by: The FreeBSD Foundation MFC after: 2 weeks	2017-03-14 19:39:17 +00:00
Warner Losh	fbbd9655e5	Renumber copyright clause 4 Renumber cluase 4 to 3, per what everybody else did when BSD granted them permission to remove clause 3. My insistance on keeping the same numbering for legal reasons is too pedantic, so give up on that point. Submitted by: Jan Schaumann <jschauma@stevens.edu> Pull Request: https://github.com/freebsd/freebsd/pull/96	2017-02-28 23:42:47 +00:00
Mark Johnston	aa3650ea36	Avoid page lookups in the top-level object in vm_object_madvise(). We can iterate over consecutive resident pages in the top-level object using the object's page list rather than by performing lookups in the object radix tree. This extends one of the optimizations in r312208 to the case where a shadow chain is present. Suggested by: alc Reviewed by: alc, kib (previous version) MFC after: 2 weeks Differential Revision: https://reviews.freebsd.org/D9282	2017-01-30 18:51:43 +00:00
Mark Johnston	c2655a40a7	Avoid unnecessary page lookups in vm_object_madvise(). vm_object_madvise() is frequently used to apply advice to a contiguous set of pages in an object with no backing object. Optimize this case by skipping non-resident subranges in constant time, and by iterating over resident pages using the object memq, thus avoiding radix tree lookups on each page index in the specified range. While here, move MADV_WILLNEED handling to vm_page_advise(), and rename the "advise" parameter to vm_object_madvise() to "advice." Reviewed by: alc, kib MFC after: 2 weeks Differential Revision: https://reviews.freebsd.org/D9098	2017-01-15 03:50:08 +00:00
Konstantin Belousov	77d6fd97ef	Improve vm_object_scan_all_shadowed() to also check swap backing objects. As noted in the removed comment, it is possible and not prohibitively costly to look up the swap blocks for the given page index. Implement a swap_pager_find_least() function to do that, and use it to iterate simultaneously over both backing object page queue and swap allocations when looking for shadowed pages. Testing shows that number of new succesful scans, enabled by this addition, is small but non-zero. When worked out, the change both further reduces the depth of the shadow object chain, and frees unused but allocated swap and memory. Suggested and reviewed by: alc Tested by: pho (previous version) Sponsored by: The FreeBSD Foundation MFC after: 2 weeks	2016-12-18 20:56:14 +00:00
Alan Cox	3453bca864	Eliminate every mention of PG_CACHED pages from the comments in the machine- independent layer of the virtual memory system. Update some of the nearby comments to eliminate redundancy and improve clarity. In vm/vm_reserv.c, do not use hyphens after adverbs ending in -ly per The Chicago Manual of Style. Update the comment in vm/vm_page.h defining the four types of page queues to reflect the elimination of PG_CACHED pages and the introduction of the laundry queue. Reviewed by: kib, markj Sponsored by: Dell EMC Isilon Differential Revision: https://reviews.freebsd.org/D8752	2016-12-12 17:47:09 +00:00
Alan Cox	563a19d546	During vm_page_cache()'s call to vm_radix_insert(), if vm_page_alloc() was called to allocate a new page of radix trie nodes, there could be a call to vm_radix_remove() on the same trie (of PG_CACHED pages) as the in-progress vm_radix_insert(). With the removal of PG_CACHED pages, we can simplify vm_radix_insert() and vm_radix_remove() by removing the flags on the root of the trie that were used to detect this case and the code for restarting vm_radix_insert() when it happened. Reviewed by: kib, markj Tested by: pho Sponsored by: Dell EMC Isilon Differential Revision: https://reviews.freebsd.org/D8664	2016-12-01 17:26:37 +00:00
Alan Cox	7667839a7e	Remove most of the code for implementing PG_CACHED pages. (This change does not remove user-space visible fields from vm_cnt or all of the references to cached pages from comments. Those changes will come later.) Reviewed by: kib, markj Tested by: pho Sponsored by: Dell EMC Isilon Differential Revision: https://reviews.freebsd.org/D8497	2016-11-15 18:22:50 +00:00
Alan Cox	ebcddc7217	Introduce a new page queue, PQ_LAUNDRY, for storing unreferenced, dirty pages, specificially, dirty pages that have passed once through the inactive queue. A new, dedicated thread is responsible for both deciding when to launder pages and actually laundering them. The new policy uses the relative sizes of the inactive and laundry queues to determine whether to launder pages at a given point in time. In general, this leads to more intelligent swapping behavior, since the laundry thread will avoid pageouts when the marginal benefit of doing so is low. Previously, without a dedicated queue for dirty pages, the page daemon didn't have the information to determine whether pageout provides any benefit to the system. Thus, the previous policy often resulted in small but steadily increasing amounts of swap usage when the system is under memory pressure, even when the inactive queue consisted mostly of clean pages. This change addresses that issue, and also paves the way for some future virtual memory system improvements by removing the last source of object-cached clean pages, i.e., PG_CACHE pages. The new laundry thread sleeps while waiting for a request from the page daemon thread(s). A request is raised by setting the variable vm_laundry_request and waking the laundry thread. We request launderings for two reasons: to try and balance the inactive and laundry queue sizes ("background laundering"), and to quickly make up for a shortage of free pages and clean inactive pages ("shortfall laundering"). When background laundering is requested, the laundry thread computes the number of page daemon wakeups that have taken place since the last laundering. If this number is large enough relative to the ratio of the laundry and (global) inactive queue sizes, we will launder vm_background_launder_target pages at vm_background_launder_rate KB/s. Otherwise, the laundry thread goes back to sleep without doing any work. When scanning the laundry queue during background laundering, reactivated pages are counted towards the laundry thread's target. In contrast, shortfall laundering is requested when an inactive queue scan fails to meet its target. In this case, the laundry thread attempts to launder enough pages to meet v_free_target within 0.5s, which is the inactive queue scan period. A laundry request can be latched while another is currently being serviced. In particular, a shortfall request will immediately preempt a background laundering. This change also redefines the meaning of vm_cnt.v_reactivated and removes the functions vm_page_cache() and vm_page_try_to_cache(). The new meaning of vm_cnt.v_reactivated now better reflects its name. It represents the number of inactive or laundry pages that are returned to the active queue on account of a reference. In collaboration with: markj Reviewed by: kib Tested by: pho Sponsored by: Dell EMC Isilon Differential Revision: https://reviews.freebsd.org/D8302	2016-11-09 18:48:37 +00:00
Konstantin Belousov	5975e53d40	Fix a race in vm_page_busy_sleep(9). Suppose that we have an exclusively busy page, and a thread which can accept shared-busy page. In this case, typical code waiting for the page xbusy state to pass is again: VM_OBJECT_WLOCK(object); ... if (vm_page_xbusied(m)) { vm_page_lock(m); VM_OBJECT_WUNLOCK(object); <---1 vm_page_busy_sleep(p, "vmopax"); goto again; } Suppose that the xbusy state owner locked the object, unbusied the page and unlocked the object after we are at the line [1], but before we executed the load of the busy_lock word in vm_page_busy_sleep(). If it happens that there is still no waiters recorded for the busy state, the xbusy owner did not acquired the page lock, so it proceeded. More, suppose that some other thread happen to share-busy the page after xbusy state was relinquished but before the m->busy_lock is read in vm_page_busy_sleep(). Again, that thread only needs vm_object lock to proceed. Then, vm_page_busy_sleep() reads busy_lock value equal to the VPB_SHARERS_WORD(1). In this case, all tests in vm_page_busy_sleep(9) pass and we are going to sleep, despite the page being share-busied. Update check for m->busy_lock == VPB_UNBUSIED in vm_page_busy_sleep(9) to also accept shared-busy state if we only wait for the xbusy state to pass. Merge sequential if()s with the same 'then' clause in vm_page_busy_sleep(). Note that the current code does not share-busy pages from parallel threads, the only way to have more that one sbusy owner is right now is to recurse. Reported and tested by: pho (previous version) Reviewed by: alc, markj Sponsored by: The FreeBSD Foundation MFC after: 1 week Differential revision: https://reviews.freebsd.org/D8196	2016-10-13 14:41:05 +00:00
Edward Tomasz Napierala	411455a8fb	Replace all remaining calls to vprint(9) with vn_printf(9), and remove the old macro. MFC after: 1 month	2016-08-10 16:12:31 +00:00
Konstantin Belousov	19efd8a5a8	In vgonel(), postpone setting BO_DEAD until VOP_RECLAIM() is called, if vnode is VMIO. For VMIO vnodes, set BO_DEAD in vm_object_terminate(). The vnode_destroy_object(), when calling into vm_object_terminate(), must be able to flush buffers. BO_DEAD purpose is to quickly destroy buffers on write when the underlying vnode is not operable any more (one example is the devfs node after geom is gone). Setting BO_DEAD for reclaiming vnode before object is terminated is premature, and results in unability to flush buffers with live SU dependencies from vinvalbuf() in vm_object_terminate(). Reported by: David Cross <dcrosstech@gmail.com> Tested by: pho Sponsored by: The FreeBSD Foundation MFC after: 2 weeks	2016-07-11 14:19:09 +00:00
Konstantin Belousov	9f790a1756	Do not leak the vm object lock when swap reservation failed, in vm_object_coalesce(). Reported and tested by: pho Sponsored by: The FreeBSD Foundation MFC after: 1 week	2016-05-29 15:46:19 +00:00
Konstantin Belousov	aa9bc3b171	Prevent parallel object collapses. Both vm_object_collapse_scan() and swap_pager_copy() might unlock the object, which allows the parallel collapse to execute. Besides destroying the object, it also might move the reference from parent to the backing object, firing the assertion ref_count == 1. Collapses are prevented by bumping paging_in_progress counters on both the object and its backing object. Reported by: cem Tested by: pho (previous version) Reviewed by: alc Sponsored by: The FreeBSD Foundation MFC after: 1 week X-Differential revision: https://reviews.freebsd.org/D6085	2016-05-26 16:59:29 +00:00
Konstantin Belousov	98f139daef	Style changes to some most outrageous violations in vm_object_collapse(). Reviewed by: alc Sponsored by: The FreeBSD Foundation MFC after: 1 week	2016-05-26 16:51:38 +00:00
Konstantin Belousov	2a339d9e3d	Add implementation of robust mutexes, hopefully close enough to the intention of the POSIX IEEE Std 1003.1TM-2008/Cor 1-2013. A robust mutex is guaranteed to be cleared by the system upon either thread or process owner termination while the mutex is held. The next mutex locker is then notified about inconsistent mutex state and can execute (or abandon) corrective actions. The patch mostly consists of small changes here and there, adding neccessary checks for the inconsistent and abandoned conditions into existing paths. Additionally, the thread exit handler was extended to iterate over the userspace-maintained list of owned robust mutexes, unlocking and marking as terminated each of them. The list of owned robust mutexes cannot be maintained atomically synchronous with the mutex lock state (it is possible in kernel, but is too expensive). Instead, for the duration of lock or unlock operation, the current mutex is remembered in a special slot that is also checked by the kernel at thread termination. Kernel must be aware about the per-thread location of the heads of robust mutex lists and the current active mutex slot. When a thread touches a robust mutex for the first time, a new umtx op syscall is issued which informs about location of lists heads. The umtx sleep queues for PP and PI mutexes are split between non-robust and robust. Somewhat unrelated changes in the patch: 1. Style. 2. The fix for proper tdfind() call use in umtxq_sleep_pi() for shared pi mutexes. 3. Removal of the userspace struct pthread_mutex m_owner field. 4. The sysctl kern.ipc.umtx_vnode_persistent is added, which controls the lifetime of the shared mutex associated with a vnode' page. Reviewed by: jilles (previous version, supposedly the objection was fixed) Discussed with: brooks, Martin Simmons <martin@lispworks.com> (some aspects) Tested by: pho Sponsored by: The FreeBSD Foundation	2016-05-17 09:56:22 +00:00
Pedro F. Giffuni	763df3ec55	sys/vm: minor spelling fixes in comments. No functional change.	2016-05-02 20:16:29 +00:00

1 2 3 4 5 ...

554 Commits