freebsd-nq

Author	SHA1	Message	Date
Xin LI	78d7964b46	Implement INHERIT_ZERO for minherit(2). INHERIT_ZERO is an OpenBSD feature. When a page is marked as such, it would be zeroed upon fork(). This would be used in new arc4random(3) functions. PR: 182610 Reviewed by: kib (earlier version) MFC after: 1 month Differential Revision: https://reviews.freebsd.org/D427	2017-03-14 17:10:42 +00:00
Mark Johnston	e20ff1a4d4	Update a comment to reflect reality. MFC after: 1 week	2017-03-13 18:45:25 +00:00
Konstantin Belousov	2b6d1a639b	Follow-up to r313690. Fix two missed places where vm_object offset to index calculation should use unsigned shift, to allow handling of full range of unsigned offsets used to create device mappings. Reported and tested by: royger (previous version) Reviewed by: alc (previous version) Sponsored by: The FreeBSD Foundation MFC after: 1 week	2017-03-12 13:53:13 +00:00
Andriy Gapon	57223e9994	uma: fix pages <-> items conversions at several places Those places were not taking into account uk_ppera. At present one allocation is always used by one slab, so uk_ppera must be used to convert between pages and slabs. uk_ipers is used to convert between slabs and items. MFC after: 1 month (if ever)	2017-03-11 16:43:38 +00:00
Andriy Gapon	a55ebb7cd5	uma: eliminate uk_slabsize field The field was not used beyond the initial keg setup stage anyway. MFC after: 1 month (if ever)	2017-03-11 16:35:36 +00:00
Warner Losh	fbbd9655e5	Renumber copyright clause 4 Renumber cluase 4 to 3, per what everybody else did when BSD granted them permission to remove clause 3. My insistance on keeping the same numbering for legal reasons is too pedantic, so give up on that point. Submitted by: Jan Schaumann <jschauma@stevens.edu> Pull Request: https://github.com/freebsd/freebsd/pull/96	2017-02-28 23:42:47 +00:00
Andriy Gapon	9b43bc27c4	call vm_lowmem hook in uma_reclaim_worker A comment near kmem_reclaim() implies that we already did that. Calling the hook is useful, because some handlers, e.g. ARC, might be able to release significant amounts of KVA. Now that we have more than one place where vm_lowmem hook is called, use this change as an opportunity to introduce flags that describe a reason for calling the hook. No handler makes use of the flags yet. Reviewed by: markj, kib MFC after: 1 week Sponsored by: Panzura Differential Revision: https://reviews.freebsd.org/D9764	2017-02-25 16:39:21 +00:00
Konstantin Belousov	63cdcaaead	Properly handle possible underflow in vm_fault_prefault(). In vm_fault_prefault(), if backward count causes underflow in calculation of starta = addra - backward * PAGE_SIZE; then starta must be clipped to entry->start, instead of zero. Clipping to zero allowed mapping outside of the map entries address ranges, in particular, map at zero. Submitted by: Yanko Yankulov <yanko.yankulov@gmail.com> Reviewed by: alc MFC after: 1 week	2017-02-24 08:09:16 +00:00
Andriy Gapon	937c1b0757	try to fix RACCT_RSS accounting There could be a race between the vm daemon setting RACCT_RSS based on the vm space and vmspace_exit (called from exit1) resetting RACCT_RSS to zero. In that case we can get a zombie process with non-zero RACCT_RSS. If the process is jailed, that may break accounting for the jail. There could be other consequences. Fix this race in the vm daemon by updating RACCT_RSS only when a process is in the normal state. Also, make accounting a little bit more accurate by refreshing the page resident count after calling vm_pageout_map_deactivate_pages(). Finally, add an assert that the RSS is zero when a process is reaped. PR: 210315 Reviewed by: trasz Differential Revision: https://reviews.freebsd.org/D9464	2017-02-14 13:54:05 +00:00
Bjoern A. Zeeb	05d58177e8	Use %s __func__ to print the actual function name (been looking at the wrong one for too often lately at first), and also use %#lx to get the 0x prefix for the address. MFC after: 1 week	2017-02-14 01:20:03 +00:00
Konstantin Belousov	496ab0532d	Rework r313352. Rename kern_vm_* functions to kern_*. Move the prototypes to syscallsubr.h. Also change Mach VM types to uintptr_t/size_t as needed, to avoid headers pollution. Requested by: alc, jhb Reviewed by: alc Sponsored by: The FreeBSD Foundation MFC after: 2 weeks Differential revision: https://reviews.freebsd.org/D9535	2017-02-13 09:04:38 +00:00
Konstantin Belousov	04e89ffba8	Remove MPSAFE and ARGUSED annotations, ANSI-fy syscall handlers. Discussed with: alc Sponsored by: The FreeBSD Foundation MFC after: 1 week	2017-02-13 00:40:55 +00:00
Konstantin Belousov	987ff18184	Consistently handle negative or wrapping offsets in the mmap(2) syscalls. For regular files and posix shared memory, POSIX requires that [offset, offset + size) range is legitimate. At the maping time, check that offset is not negative. Allowing negative offsets might expose the data that filesystem put into vm_object for internal use, esp. due to OFF_TO_IDX() signess treatment. Fault handler verifies that the mapped range is valid, assuming that mmap(2) checked that arithmetic gives no undefined results. For device mappings, leave the semantic of negative offsets to the driver. Correct object page index calculation to not erronously propagate sign. In either case, disallow overflow of offset + size. Update mmap(2) man page to explain the requirement of the range validity, and behaviour when the range becomes invalid after mapping. Reported and tested by: royger (previous version) Reviewed by: alc Sponsored by: The FreeBSD Foundation MFC after: 2 weeks	2017-02-12 21:05:44 +00:00
Konstantin Belousov	1c2ad3e962	Change type of the prot parameter for kern_vm_mmap() from vm_prot_t to int. This makes the code to pass whole word of the mmap(2) syscall argument prot to the syscall helper kern_vm_mmap(), which can validate all bits. The change provides temporal fix for sys/vm/mmap_test mmap__bad_arguments, which was broken after r313352. PR: 216976 Reported and tested by: ngie Sponsored by: The FreeBSD Foundation	2017-02-11 20:27:39 +00:00
Edward Tomasz Napierala	69cdfcef2e	Add kern_vm_mmap2(), kern_vm_mprotect(), kern_vm_msync(), kern_vm_munlock(), kern_vm_munmap(), and kern_vm_madvise(), and use them in various compats instead of their sys_*() counterparts. Reviewed by: ed, dchagin, kib MFC after: 2 weeks Sponsored by: DARPA, AFRL Differential Revision: https://reviews.freebsd.org/D9378	2017-02-06 20:57:12 +00:00
Konstantin Belousov	5fca242374	Style, use tab after #define. Reviewed by: alc Sponsored by: The FreeBSD Foundation MFC after: 3 days	2017-02-04 19:16:19 +00:00
Alan Cox	8a99f1cc59	Over the years, the code and comments in vm_page_startup() have diverged in one respect. When determining how many page structures to allocate, contrary to what the comments say, the code does not account for the overhead of a page structure per page of physical memory. This revision changes the code to match the comments. Reviewed by: kib, markj MFC after: 6 weeks Differential Revision: https://reviews.freebsd.org/D9081	2017-02-04 05:23:10 +00:00
Edward Tomasz Napierala	a6b15641d6	Ifdef out the unused vm_rr_selectdomain(). MFC after: 2 weeks Sponsored by: DARPA, AFRL	2017-02-02 17:44:55 +00:00
Mark Johnston	aa3650ea36	Avoid page lookups in the top-level object in vm_object_madvise(). We can iterate over consecutive resident pages in the top-level object using the object's page list rather than by performing lookups in the object radix tree. This extends one of the optimizations in r312208 to the case where a shadow chain is present. Suggested by: alc Reviewed by: alc, kib (previous version) MFC after: 2 weeks Differential Revision: https://reviews.freebsd.org/D9282	2017-01-30 18:51:43 +00:00
Mateusz Guzik	736ff8c396	hwpmc: partially depessimize munmap handling if the module is not loaded HWPMC_HOOKS is enabled in GENERIC and triggers some work avoidable in the common (module not loaded) case. In particular this avoids permission checks + lock downgrade singlethreaded and in cases were an executable mapping is found the pmc sx lock is no longer bounced. Note this is a band aid. MFC after: 1 week	2017-01-24 22:00:16 +00:00
Mark Johnston	c2655a40a7	Avoid unnecessary page lookups in vm_object_madvise(). vm_object_madvise() is frequently used to apply advice to a contiguous set of pages in an object with no backing object. Optimize this case by skipping non-resident subranges in constant time, and by iterating over resident pages using the object memq, thus avoiding radix tree lookups on each page index in the specified range. While here, move MADV_WILLNEED handling to vm_page_advise(), and rename the "advise" parameter to vm_object_madvise() to "advice." Reviewed by: alc, kib MFC after: 2 weeks Differential Revision: https://reviews.freebsd.org/D9098	2017-01-15 03:50:08 +00:00
Gleb Smirnoff	4f56243aad	Fix the contiguity once more.	2017-01-12 20:26:02 +00:00
Mark Johnston	8d65cba217	Remove a redundant use of min(). Reported by: rpokala X-MFC With: r311346	2017-01-05 03:13:45 +00:00
Mark Johnston	ec492b13f1	Add a small allocator for exec_map entries. Upon each execve, we allocate a KVA range for use in copying data to the new image. Pages must be faulted into the range, and when the range is freed, the backing pages are freed and their mappings are destroyed. This is a lot of needless overhead, and the exec_map management becomes a bottleneck when many CPUs are executing execve concurrently. Moreover, the number of available ranges is fixed at 16, which is insufficient on large systems and potentially excessive on 32-bit systems. The new allocator reduces overhead by making exec_map allocations persistent. When a range is freed, pages backing the range are marked clean and made easy to reclaim. With this change, the exec_map is sized based on the number of CPUs. Reviewed by: kib MFC after: 1 month Differential Revision: https://reviews.freebsd.org/D8921	2017-01-05 01:44:12 +00:00
Gleb Smirnoff	1e0c121f3a	Fix assertion that checks that pages are consecutive to properly handle bogus_page insertion(s).	2017-01-04 22:31:09 +00:00
Gleb Smirnoff	bfc8c24c73	Move bogus_page declaration to vm_page.h and initialization to vm_page.c. Reviewed by: kib	2017-01-04 22:27:19 +00:00
Mark Johnston	b1fd102ee7	Add a page queue for holding dirty anonymous unswappable pages. On systems without a configured swap device, an attempt to launder pages from a swap object will always fail and result in the page being reactivated. This means that the page daemon will continuously scan pages that can never be evicted. With this change, anonymous pages are instead moved to PQ_UNSWAPPABLE after a failed laundering attempt when no swap devices are configured. PQ_UNSWAPPABLE is not scanned unless a swap device is configured, so unreferenced unswappable pages are excluded from the page daemon's workload. Reviewed by: alc	2017-01-03 00:05:44 +00:00
Justin Hibbits	b5345ef10e	Print flags in hex instead of decimal. Hex is easier to grok for flags, and consistent with other prints.	2017-01-02 16:50:52 +00:00
Konstantin Belousov	1569205f0a	Style fixes for vm_map_insert(). Reviewed by: alc Sponsored by: The FreeBSD Foundation MFC after: 1 week	2017-01-01 18:49:46 +00:00
Konstantin Belousov	03302b1380	Ansify vm/vm_pager.c. Style. Reviewed by: alc Sponsored by: The FreeBSD Foundation MFC after: 1 week	2016-12-31 19:30:22 +00:00
Mateusz Guzik	6ff51a3685	Use vrefact in vnode_pager_alloc.	2016-12-31 10:37:56 +00:00
Konstantin Belousov	7a432b84e8	Fix two similar bugs in the populate vm_fault() code. If pager' populate method succeeded, but other thread raced with us and modified vm_map, we must unbusy all pages busied by the pager, before we retry the whole fault handling. If pager instantiated more pages than fit into the current map entry, we must unbusy the pages which are clipped. Also do some refactoring, clarify comments and use more clear local variable names. Reported and tested by: kargl, subbsd@gmail.com (previous version) Reviewed by: alc Sponsored by: The FreeBSD Foundation MFC after: 2 weeks	2016-12-30 18:55:33 +00:00
Konstantin Belousov	0c8bd6a7d8	Assert that the pages found on the object queue by vm_page_next() and vm_page_prev() have correct ownership. In collaboration with: alc Sponsored by: The FreeBSD Foundation (kib) MFC after: 1 week	2016-12-30 17:37:06 +00:00
Konstantin Belousov	9a4ee196dd	Style. Reviewed by: alc Sponsored by: The FreeBSD Foundation MFC after: 1 week	2016-12-30 13:04:43 +00:00
Mateusz Guzik	0b3b55a0f2	Remove cpu_spinwait after seq_consistent. It does not add any benefit as the read routine will do it as necessary.	2016-12-30 06:26:17 +00:00
Alan Cox	920da7e4d2	Relax the object type restrictions on vm_page_alloc_contig(). Specifically, add support for object types that were previously prohibited because they could contain PG_CACHED pages. Roughly halve the number of radix trie operations performed by vm_page_alloc_contig() using the same approach that is employed by vm_page_alloc(). Also, eliminate the radix trie lookup performed with the free page queues lock held. Tidy up the handling of radix trie insert failures in vm_page_alloc() and vm_page_alloc_contig(). Reviewed by: kib, markj Tested by: pho Sponsored by: Dell EMC Isilon Differential Revision: https://reviews.freebsd.org/D8878	2016-12-28 18:32:13 +00:00
Konstantin Belousov	8b590e9506	Remove redundancy in vmtotal(). There are two instances of inlined unlocks + continue in vmtotal() switch statements, which are ordinary expressed with break from the switch case and code after the switch. Also, the combination of continue and break statement is redundand. Reviewed by: alc Sponsored by: The FreeBSD Foundation MFC after: 1 week	2016-12-26 19:29:04 +00:00
Konstantin Belousov	2e56b64fa4	Fix argument type and microoptimize swp_pager_meta_free(). The count argument natural type if vm_pindex_t, but due to the loop organization, it has to be signed type to detect the termination condition. Replace this logic by using distinguished counter for the processed pages, and terminate loop when the counter exceeds the argument. Completely process one swblock for all relevant indexes instead of doing relookup in hash when incrementing page index on the loop step. Do not drop hash mutex around iterations. Noted and reviewed by: alc Tested by: pho Sponsored by: The FreeBSD Foundation MFC after: 2 weeks	2016-12-24 09:57:31 +00:00
Konstantin Belousov	77d6fd97ef	Improve vm_object_scan_all_shadowed() to also check swap backing objects. As noted in the removed comment, it is possible and not prohibitively costly to look up the swap blocks for the given page index. Implement a swap_pager_find_least() function to do that, and use it to iterate simultaneously over both backing object page queue and swap allocations when looking for shadowed pages. Testing shows that number of new succesful scans, enabled by this addition, is small but non-zero. When worked out, the change both further reduces the depth of the shadow object chain, and frees unused but allocated swap and memory. Suggested and reviewed by: alc Tested by: pho (previous version) Sponsored by: The FreeBSD Foundation MFC after: 2 weeks	2016-12-18 20:56:14 +00:00
Konstantin Belousov	71057cd207	In swp_pager_meta_free_all(), fix type of the index variable. Style. Noted and reviewed by: alc (previous version) Sponsored by: The FreeBSD Foundation MFC after: 1 week	2016-12-16 23:33:37 +00:00
Konstantin Belousov	a1e9a3bba3	Provide introductory description of the default pager. Reviewed by: alc Sponsored by: The FreeBSD Foundation MFC after: 1 week	2016-12-14 23:36:32 +00:00
Konstantin Belousov	a41ece0840	Remove locking around accounting initialization of the default object. The object is not yet fully constructed and must not be available to other threads. This makes default_pager_alloc() almost identical to swap_pager_alloc_init(). Reviewed by: alc Sponsored by: The FreeBSD Foundation MFC after: 1 week	2016-12-14 23:34:25 +00:00
Alan Cox	3d026d871f	Tidy up. Mostly, remove or replace stale comments. Most of the comments in this file actually described the operation of the swap pager, not the default pager. Given that this is the wrong place to discuss the implementation of the swap pager, it shouldn't come as a surprise that as the swap pager evolved these comments became increasingly stale. In addition, apply some style fixes, like modernizing a few remaining old- style function definitions. Reviewed by: kib, markj MFC after: 1 week Differential Revision: https://reviews.freebsd.org/D8781	2016-12-14 17:28:55 +00:00
John Baldwin	a9546a6b17	Use db_lookup_proc() in the DDB 'show procvm' command. This allows processes to be identified by PID as well as a pointer address. MFC after: 2 weeks Sponsored by: DARPA / AFRL	2016-12-13 19:22:43 +00:00
Alan Cox	3453bca864	Eliminate every mention of PG_CACHED pages from the comments in the machine- independent layer of the virtual memory system. Update some of the nearby comments to eliminate redundancy and improve clarity. In vm/vm_reserv.c, do not use hyphens after adverbs ending in -ly per The Chicago Manual of Style. Update the comment in vm/vm_page.h defining the four types of page queues to reflect the elimination of PG_CACHED pages and the introduction of the laundry queue. Reviewed by: kib, markj Sponsored by: Dell EMC Isilon Differential Revision: https://reviews.freebsd.org/D8752	2016-12-12 17:47:09 +00:00
Gleb Smirnoff	255003da42	Allow bogus_page to be passed to pager(s).	2016-12-09 21:21:24 +00:00
Mark Johnston	90458813cd	Conditionalize PG_CACHE sysctls on COMPAT_FREEBSD11. Reviewed by: glebius, imp, jhb Differential Revision: https://reviews.freebsd.org/D8736	2016-12-09 18:55:27 +00:00
Konstantin Belousov	ed01d9894e	Implement the populate() pager method for phys pager. It allows to provide configurable agressive prefaulting and useful hints to page daemon about memory allocations, on faults for pages managed by phys pager. In fact, this implementation is superior to the MAP_SHARED_PHYS hack from my Postgresql paper, while giving similar benefits of reducing the page faults numbers on SysV shared memory mappings. Reviewed by: alc Tested by: pho Sponsored by: The FreeBSD Foundation MFC after: 3 weeks	2016-12-08 11:35:53 +00:00
Konstantin Belousov	c42b43a054	Add a new populate() pager method and extend device pager ops vector with cdev_pg_populate() to provide device drivers access to it. It gives drivers fine control of the pages ownership and allows drivers to implement arbitrary prefault policies. The populate method is called on a page fault and is supposed to populate the vm object with the page at the fault location and some amount of pages around it, at pager's discretion. VM provides the pager with the hints about current range of the object mapping, to avoid instantiation of immediately unused pages, if pager decides so. Also, VM passes the fault type and map entry protection to the pager, allowing it to force the optimal required ownership of the mapped pages. Installed pages must contiguously fill the returned region, be fully valid and exclusively busied. Of course, the pages must be compatible with the object' type. After populate() successfully returned, VM fault handler installs as many instantiated pages into the process page tables as it sees reasonable, while still obeying the correct semantic for COW and vm map locking. The method is opt-in, pager sets OBJ_POPULATE flag to indicate that the method can be called. If pager' vm objects can be shadowed, pager must implement the traditional getpages() method in addition to the populate(). Populate() might fall back to the getpages() on per-call basis as well, by returning VM_PAGER_BAD error code. For now for device pagers, the populate() method is only allowed to be used by the managed device pagers, but the limitation is only made because there is no unmanaged fault handlers which could use it right now. KPI designed together with, and reviewed by: alc Tested by: pho Sponsored by: The FreeBSD Foundation MFC after: 3 weeks	2016-12-08 11:26:11 +00:00
Konstantin Belousov	dc5401d240	Move map_generation snapshot value into struct faultstate. Reviewed by: alc Tested by: pho Sponsored by: The FreeBSD Foundation MFC after: 1 week	2016-12-08 10:29:41 +00:00
Konstantin Belousov	272cc3c4d0	Style. Reviewed by: alc Sponsored by: The FreeBSD Foundation MFC after: 1 week	2016-12-08 10:28:51 +00:00
Alan Cox	e94965d82e	Previously, vm_radix_remove() would panic if the radix trie didn't contain a vm_page_t at the specified index. However, with this change, vm_radix_remove() no longer panics. Instead, it returns NULL if there is no vm_page_t at the specified index. Otherwise, it returns the vm_page_t. The motivation for this change is that it simplifies the use of radix tries in the amd64, arm64, and i386 pmap implementations. Instead of performing a lookup before every remove, the pmap can simply perform the remove. Reviewed by: kib, markj Differential Revision: https://reviews.freebsd.org/D8708	2016-12-08 04:29:29 +00:00
Mark Johnston	43482f897b	Use the official spelling for NULL arguments to typed sysctl handlers. Reported by: bde	2016-12-07 01:15:10 +00:00
Mark Johnston	77edd8fa00	Provide dummy sysctls for v_cache_count and v_tcached. Some utilities (notably top(1)) exit if any of their input sysctls don't exist, and the removal of the above-mentioned PG_CACHE-related sysctls makes it difficult to run such utilities on different versions of the kernel without recompiling. Requested by: bde	2016-12-06 22:52:45 +00:00
Alan Cox	8804a2b030	Eliminate a stale comment; vm_radix_prealloc() was replaced in r254141. MFC after: 3 days	2016-12-02 16:29:30 +00:00
Alan Cox	563a19d546	During vm_page_cache()'s call to vm_radix_insert(), if vm_page_alloc() was called to allocate a new page of radix trie nodes, there could be a call to vm_radix_remove() on the same trie (of PG_CACHED pages) as the in-progress vm_radix_insert(). With the removal of PG_CACHED pages, we can simplify vm_radix_insert() and vm_radix_remove() by removing the flags on the root of the trie that were used to detect this case and the code for restarting vm_radix_insert() when it happened. Reviewed by: kib, markj Tested by: pho Sponsored by: Dell EMC Isilon Differential Revision: https://reviews.freebsd.org/D8664	2016-12-01 17:26:37 +00:00
Alan Cox	ba67369628	Recursion on the free page queue mutex occurred when UMA needed to allocate a new page of radix trie nodes to complete a vm_radix_insert() operation that was requested by vm_page_cache(). Specifically, vm_page_cache() already held the free page queue lock when UMA tried to acquire it through a call to vm_page_alloc(). This code path no longer exists, so there is no longer any reason to allow recursion on the free page queue mutex. Improve nearby comments. Reviewed by: kib, markj Tested by: pho Sponsored by: Dell EMC Isilon Differential Revision: https://reviews.freebsd.org/D8628	2016-11-27 01:42:53 +00:00
Mark Johnston	99e6e1930c	Release laundered vnode pages to the head of the inactive queue. The swap pager enqueues laundered pages near the head of the inactive queue to avoid another trip through LRU before reclamation. This change adds support for this behaviour to the vnode pager and makes use of it in UFS and ext2fs. Some ioflag handling is consolidated into a common subroutine so that this support can be easily extended to other filesystems which make use of the buffer cache. No changes are needed for ZFS since its putpages routine always undirties the pages before returning, and the laundry thread requeues the pages appropriately in this case. Reviewed by: alc, kib Differential Revision: https://reviews.freebsd.org/D8589	2016-11-23 17:53:07 +00:00
Alan Cox	bba39b9ae3	Remove PG_CACHED-related fields from struct vmmeter, because they are no longer used. More precisely, they are always zero because the code that decremented and incremented them no longer exists. Bump __FreeBSD_version to mark this change. Reviewed by: kib, markj Sponsored by: Dell EMC Isilon Differential Revision: https://reviews.freebsd.org/D8583	2016-11-22 18:13:46 +00:00
Gleb Smirnoff	e48b82bd83	- If caller specifies readbehind and readahead that together with count doesn't fit into a buf, then trim readbehind and readahead evenly. If rbehind was limited by the previous BMAP, then roundup its trim to block size. - Add KASSERT to check that b_blkno has proper offset from original blkno returned by BMAP. [1] - Add KASSERT to check that pages in buf are consecutive. Reviewed by: kib Submitted by: kib [1]	2016-11-17 20:32:32 +00:00
Konstantin Belousov	41ddec83c1	Move the fast fault path into the separate function. Reviewed by: alc Sponsored by: The FreeBSD Foundation MFC after: 1 week	2016-11-16 16:34:17 +00:00
Alan Cox	7667839a7e	Remove most of the code for implementing PG_CACHED pages. (This change does not remove user-space visible fields from vm_cnt or all of the references to cached pages from comments. Those changes will come later.) Reviewed by: kib, markj Tested by: pho Sponsored by: Dell EMC Isilon Differential Revision: https://reviews.freebsd.org/D8497	2016-11-15 18:22:50 +00:00
Alan Cox	ebcddc7217	Introduce a new page queue, PQ_LAUNDRY, for storing unreferenced, dirty pages, specificially, dirty pages that have passed once through the inactive queue. A new, dedicated thread is responsible for both deciding when to launder pages and actually laundering them. The new policy uses the relative sizes of the inactive and laundry queues to determine whether to launder pages at a given point in time. In general, this leads to more intelligent swapping behavior, since the laundry thread will avoid pageouts when the marginal benefit of doing so is low. Previously, without a dedicated queue for dirty pages, the page daemon didn't have the information to determine whether pageout provides any benefit to the system. Thus, the previous policy often resulted in small but steadily increasing amounts of swap usage when the system is under memory pressure, even when the inactive queue consisted mostly of clean pages. This change addresses that issue, and also paves the way for some future virtual memory system improvements by removing the last source of object-cached clean pages, i.e., PG_CACHE pages. The new laundry thread sleeps while waiting for a request from the page daemon thread(s). A request is raised by setting the variable vm_laundry_request and waking the laundry thread. We request launderings for two reasons: to try and balance the inactive and laundry queue sizes ("background laundering"), and to quickly make up for a shortage of free pages and clean inactive pages ("shortfall laundering"). When background laundering is requested, the laundry thread computes the number of page daemon wakeups that have taken place since the last laundering. If this number is large enough relative to the ratio of the laundry and (global) inactive queue sizes, we will launder vm_background_launder_target pages at vm_background_launder_rate KB/s. Otherwise, the laundry thread goes back to sleep without doing any work. When scanning the laundry queue during background laundering, reactivated pages are counted towards the laundry thread's target. In contrast, shortfall laundering is requested when an inactive queue scan fails to meet its target. In this case, the laundry thread attempts to launder enough pages to meet v_free_target within 0.5s, which is the inactive queue scan period. A laundry request can be latched while another is currently being serviced. In particular, a shortfall request will immediately preempt a background laundering. This change also redefines the meaning of vm_cnt.v_reactivated and removes the functions vm_page_cache() and vm_page_try_to_cache(). The new meaning of vm_cnt.v_reactivated now better reflects its name. It represents the number of inactive or laundry pages that are returned to the active queue on account of a reference. In collaboration with: markj Reviewed by: kib Tested by: pho Sponsored by: Dell EMC Isilon Differential Revision: https://reviews.freebsd.org/D8302	2016-11-09 18:48:37 +00:00
Bryan Drewery	28323add09	Fix improper use of "its". Sponsored by: Dell EMC Isilon	2016-11-08 23:59:41 +00:00
Konstantin Belousov	1771e987ca	Do not sleep in vm_wait() if pagedaemon did not yet started. Panic instead. Requests which cannot be satisfied by allocators at boot time often have unrealizable parameters. Waiting for the pagedaemon' start would hang the boot if done in the thread0 context and just never succeed if executed from another thread. In fact, for very early stages, sleep attempt panics with obscure diagnostic about the scheduler state, and explicit panic in vm_wait() makes the investigation much shorter by cut off the examination of the thread and scheduler. Theoretically, some subsystem might grab a resource to exhaustion, and free it later in the boot process. If this unlikely scenario does appear for real, the way to diagnose the trouble can be revisited. Reported by: emaste Reviewed by: markj Sponsored by: The FreeBSD Foundation MFC after: 2 weeks Differential revision: https://reviews.freebsd.org/D8421	2016-11-04 12:58:50 +00:00
Alan Cox	857025056f	In vm_fault()'s loop over the shadow chain, move a comment describing our invariants to a better place. Also, add two comments concerning the relationship between the map and vnode locks. Reviewed by: kib MFC after: 3 days	2016-11-03 16:44:55 +00:00
Alan Cox	dda4d36957	Move and revise a comment about the relation between the object's paging- in-progress count and the vnode. Prior to r188331, we always acquired the vnode lock before incrementing the object's paging-in-progress count. Now, we increment it before attempting to acquire the vnode lock with LK_NOWAIT, but we never sleep acquiring the vnode lock while we have the count incremented. Reviewed by: kib MFC after: 3 days	2016-11-01 17:11:10 +00:00
Conrad Meyer	8532d381a9	Add BUF_TRACKING and FULL_BUF_TRACKING buffer debugging Upstream the BUF_TRACKING and FULL_BUF_TRACKING buffer debugging code. This can be handy in tracking down what code touched hung bios and bufs last. The full history is especially useful, but adds enough bloat that it shouldn't be enabled in release builds. Function names (or arbitrary string constants) are tracked in a fixed-size ring in bufs. Bios gain a pointer to the upper buf for tracking. SCSI CCBs gain a pointer to the upper bio for tracking. Reviewed by: markj Sponsored by: Dell EMC Isilon Differential Revision: https://reviews.freebsd.org/D8366	2016-10-31 23:09:52 +00:00
Konstantin Belousov	e26236e9f3	Change remained internal uses of boolean_t to bool in vm/vm_fault.c. Reviewed by: alc Sponsored by: The FreeBSD Foundation MFC after: 1 week	2016-10-30 20:39:38 +00:00
Konstantin Belousov	1dcadc022f	Remove vm_pager_has_page() declaration. It is not too useful since static inline definition appears later in the file. Reviewed by: alc Sponsored by: The FreeBSD Foundation MFC after: 1 week	2016-10-30 20:38:57 +00:00
Alan Cox	f994b2077b	Merge and sort vm_fault_hold()'s "int" variable definitions. Reviewed by: kib MFC after: 7 days	2016-10-30 19:15:59 +00:00
Konstantin Belousov	022dfd690c	Remove vnode_locked label and goto, by collapsing vp calculation into the conditional. Suggested and reviewed by: alc Sponsored by: The FreeBSD Foundation MFC after: 1 week	2016-10-30 18:05:18 +00:00
Konstantin Belousov	1be02479be	Split long line instead of unindenting it. Add KASSERT() verifying that a device object with the same handle has the same ops vector. Reviewed by: alc Sponsored by: The FreeBSD Foundation MFC after: 1 week	2016-10-30 18:04:11 +00:00
Alan Cox	cd8a6fe8e9	The "lookup_is_valid" field is used as a "bool". Make it one. Convert vm_fault_hold()'s Boolean variables that are only used internally to "bool". Add a comment describing why the one remaining "boolean_t" was not converted. Reviewed by: kib MFC after: 8 days	2016-10-29 21:01:49 +00:00
Alan Cox	320023e286	With one exception, "hardfault" is used like a "bool". Change that exception and make it a "bool". Reviewed by: kib MFC after: 7 days	2016-10-29 19:22:38 +00:00
Mark Johnston	a9ee028d04	Add one more use of unlock_vp(). Discussed with: kib X-MFC With: r308094	2016-10-29 18:47:28 +00:00
Konstantin Belousov	cfabea3d3a	Add unlock_vp() helper. Trim space. Discussed with: alc Sponsored by: The FreeBSD Foundation MFC after: 1 week	2016-10-29 18:03:29 +00:00
Mark Johnston	829be5168d	Simplify keg_drain() a bit by using LIST_FOREACH_SAFE. MFC after: 1 week	2016-10-20 23:10:27 +00:00
Gleb Smirnoff	dcc0ff5a52	Fix incorrect assertion that could miss overflows. Reviewed by: kib	2016-10-19 19:50:09 +00:00
Konstantin Belousov	230afe0be6	If vm_fault_hold(9) finds that fs.m is wired, do not free it after a pager error, leave the page to the wire owner. E.g. the page might be a part of the invalidated buffer. Reported and tested by: pho Reviewed by: alc, markj Sponsored by: The FreeBSD Foundation MFC after: 2 weeks Differential revision: https://reviews.freebsd.org/D8197	2016-10-17 08:17:06 +00:00
Konstantin Belousov	bd9546a21c	Export vm_page_xunbusy_maybelocked(). Reviewed by: alc, markj Sponsored by: The FreeBSD Foundation MFC after: 1 week X-Differential revision: https://reviews.freebsd.org/D8197	2016-10-17 08:14:23 +00:00
Mark Johnston	eb17fb15b3	Plug a potential vnode lock leak in vm_fault_hold(). Reviewed by: alc, kib MFC after: 1 week Sponsored by: Dell EMC Isilon Differential Revision: https://reviews.freebsd.org/D8242	2016-10-13 20:39:34 +00:00
Konstantin Belousov	5975e53d40	Fix a race in vm_page_busy_sleep(9). Suppose that we have an exclusively busy page, and a thread which can accept shared-busy page. In this case, typical code waiting for the page xbusy state to pass is again: VM_OBJECT_WLOCK(object); ... if (vm_page_xbusied(m)) { vm_page_lock(m); VM_OBJECT_WUNLOCK(object); <---1 vm_page_busy_sleep(p, "vmopax"); goto again; } Suppose that the xbusy state owner locked the object, unbusied the page and unlocked the object after we are at the line [1], but before we executed the load of the busy_lock word in vm_page_busy_sleep(). If it happens that there is still no waiters recorded for the busy state, the xbusy owner did not acquired the page lock, so it proceeded. More, suppose that some other thread happen to share-busy the page after xbusy state was relinquished but before the m->busy_lock is read in vm_page_busy_sleep(). Again, that thread only needs vm_object lock to proceed. Then, vm_page_busy_sleep() reads busy_lock value equal to the VPB_SHARERS_WORD(1). In this case, all tests in vm_page_busy_sleep(9) pass and we are going to sleep, despite the page being share-busied. Update check for m->busy_lock == VPB_UNBUSIED in vm_page_busy_sleep(9) to also accept shared-busy state if we only wait for the xbusy state to pass. Merge sequential if()s with the same 'then' clause in vm_page_busy_sleep(). Note that the current code does not share-busy pages from parallel threads, the only way to have more that one sbusy owner is right now is to recurse. Reported and tested by: pho (previous version) Reviewed by: alc, markj Sponsored by: The FreeBSD Foundation MFC after: 1 week Differential revision: https://reviews.freebsd.org/D8196	2016-10-13 14:41:05 +00:00
Konstantin Belousov	267ed8e2f7	When downgrading exclusively busied page to shared-busy state, wakeup waiters. Otherwise, owners of the shared-busy state are left blocked and might get into a deadlock. Note that the vm_page_busy_downgrade() function is not used in the tree right now. Reported and tested by: pho (previous version) Reviewed by: alc, markj Sponsored by: The FreeBSD Foundation MFC after: 1 week Differential revision: https://reviews.freebsd.org/D8195	2016-10-11 18:09:37 +00:00
Alan Cox	70cf3ced3c	Make the page daemon's notion of what kind of pass is being performed by vm_pageout_scan() local to vm_pageout_worker(). There is no reason to store the pass in the NUMA domain structure. Reviewed by: kib MFC after: 3 weeks	2016-10-05 17:32:06 +00:00
Alan Cox	e57dd910e6	Change vm_pageout_scan() to return a value indicating whether the free page target was met. Previously, vm_pageout_worker() itself checked the length of the free page queues to determine whether vm_pageout_scan(pass >= 1)'s inactive queue scan freed enough pages to meet the free page target. Specifically, vm_pageout_worker() used vm_paging_needed(). The trouble with vm_paging_needed() is that it compares the length of the free page queues to the wakeup threshold for the page daemon, which is much lower than the free page target. Consequently, vm_pageout_worker() could conclude that the inactive queue scan succeeded in meeting its free page target when in fact it did not; and rather than immediately triggering an all-out laundering pass over the inactive queue, vm_pageout_worker() would go back to sleep waiting for the free page count to fall below the page daemon wakeup threshold again, at which point it will perform another limited (pass == 1) scan over the inactive queue. Changing vm_pageout_worker() to use vm_page_count_target() instead of vm_paging_needed() won't work because any page allocations that happen concurrently with the inactive queue scan will result in the free page count being below the target at the end of a successful scan. Instead, having vm_pageout_scan() return a value indicating success or failure is the most straightforward fix. Reviewed by: kib, markj MFC after: 3 weeks Sponsored by: Dell EMC Isilon Differential Revision: https://reviews.freebsd.org/D8111	2016-10-05 16:15:26 +00:00
Andrew Gallatin	edb2994a62	Conditionally move initial vfs bio alloc above 4G On machines with just the wrong amount of physical memory (enough to have a lot of bufs, but not enough to use VM_FREELIST_DMA32) it is possible for 32-bit address limited devices to have little to no memory left when attaching, due to potentially large vfs bio configs consuming all memory below 4GB not protected by VM_FREELIST_ISADMA. This causes the 32-bit devices to allocate from VM_FREELIST_ISADMA, leaving that freelist emtpy when ISA devices need DMAable memory. Rather than decrease VM_DMA32_NPAGES_THRESHOLD, use the time honored technique of putting initially allocated kernel data structs at the end (or at least not the beginning) of memory. Since this allocation is done at boot and is wired, is not freed, so the system is low on 32-bit (and ISA) dma'ble memory forever. So it is a good candidate to move above 4GB. While here, remove an unneeded round_page() from kmem_malloc's size argument as suggested by alc. The first thing kmem_malloc() does is a round_page(size), so there is no need to do it before the call. Reviewed by: alc Sponsored by: Netflix	2016-10-03 13:23:43 +00:00
Alan Cox	8cb0c1029d	Various changes to pmap_ts_referenced() Move PMAP_TS_REFERENCED_MAX out of the various pmap implementations and into vm/pmap.h, and describe what its purpose is. Eliminate the archaic "XXX" comment about its value. I don't believe that its exact value, e.g., 5 versus 6, matters. Update the arm64 and riscv pmap implementations of pmap_ts_referenced() to opportunistically update the page's dirty field. On amd64, use the PDE value already cached in a local variable rather than dereferencing a pointer again and again. Reviewed by: kib, markj MFC after: 2 weeks Differential Revision: https://reviews.freebsd.org/D7836	2016-09-10 16:49:25 +00:00
Mark Johnston	dd9cb6da0b	Respect the caller's hints when performing swap readahead. The pager getpages interface allows the caller to bound the number of readahead and readbehind pages, and vm_fault_hold() makes use of this feature. These bounds were ignored after r305056, causing the swap pager to potentially page in more than the specified number of pages. Reported and reviewed by: alc X-MFC with: r305056	2016-09-04 00:25:49 +00:00
Mark Johnston	dbbaf04f1e	Remove support for idle page zeroing. Idle page zeroing has been disabled by default on all architectures since r170816 and has some bugs that make it seemingly unusable. Specifically, the idle-priority pagezero thread exacerbates contention for the free page lock, and yields the CPU without releasing it in non-preemptive kernels. The pagezero thread also does not behave correctly when superpage reservations are enabled: its target is a function of v_free_count, which includes reserved-but-free pages, but it is only able to zero pages belonging to the physical memory allocator. Reviewed by: alc, imp, kib Differential Revision: https://reviews.freebsd.org/D7714	2016-09-03 20:38:13 +00:00
Konstantin Belousov	9815066425	Make swapoff reliable. The swap_pager_swapoff() function uses trylock for the object lock before pagein, which means that either i/o to md(4) over swap, or intensive page faults over swap pager objects might prevent swapoff() from making any progress. Then the retry < 100 check fails and machine panics. If trylock fails, acquire the object lock in the blockable way and restart the hash bucket walk. Keep retries logic for now. Reported and tested by: pho Reviewed by: alc, markj Sponsored by: The FreeBSD Foundation MFC after: 2 weeks Differential revision: https://reviews.freebsd.org/D7688	2016-08-31 14:49:58 +00:00
Mark Johnston	915d1b71cd	Restore swap pager readahead after r292373. The removal of vm_fault_additional_pages() meant that a hard fault on a swap-backed page would result in only that page being read in. This change implements readahead and readbehind for the swap pager in swap_pager_getpages(). swap_pager_haspage() is modified to return the largest contiguous non-resident range of pages containing the requested range. Reviewed by: alc, kib Tested by: pho MFC after: 1 month Differential Revision: https://reviews.freebsd.org/D7677	2016-08-30 05:56:21 +00:00
Alan Cox	ce3ee09b53	Eliminate unneeded vm_page_xbusy() and vm_page_xunbusy() operations when neither vm_pager_has_page() nor vm_pager_get_pages() is called. Reviewed by: kib, markj MFC after: 3 weeks	2016-08-14 22:00:45 +00:00
Mark Johnston	842ee21e20	Strengthen assertions about the busy state of newly-allocated pages. Reviewed by: alc MFC after: 1 week	2016-08-13 19:49:32 +00:00
Mark Johnston	fc85a6f0c4	Initialize page busy lock state in vm_phys_add_page(). MFC after: 1 week	2016-08-13 19:48:43 +00:00
Alan Cox	791444089f	Correct errors and clean up the comments on the active queue scan. Eliminate some unnecessary blank lines. Reviewed by: kib, markj MFC after: 1 week	2016-08-12 03:22:58 +00:00
Edward Tomasz Napierala	411455a8fb	Replace all remaining calls to vprint(9) with vn_printf(9), and remove the old macro. MFC after: 1 month	2016-08-10 16:12:31 +00:00
Alan Cox	f0edf3f806	Correct a spelling error.	2016-08-05 16:44:11 +00:00
Alan Cox	248fe642a7	Clean up the comments and code style in and around vm_pageout_cluster(). In particular, fix factual, grammatical, and spelling errors in various comments, and remove comments that are out of place in this function. Reviewed by: kib, markj MFC after: 3 weeks Sponsored by: EMC / Isilon Storage Division Differential Revision: https://reviews.freebsd.org/D7410	2016-08-04 16:20:12 +00:00
Konstantin Belousov	0c657d22eb	Explain why swapgeom_close_ev() is delegated. Sponsored by: The FreeBSD Foundation MFC after: 1 week	2016-08-03 07:11:19 +00:00
Alan Cox	87ff568c26	Restore the historical behavior of "sysctl vm.swap_idle_enabled=1". Prior to r254304, we had separate functions for reclamation and laundering (vm_pageout_scan) versus updating usage information, i.e., "reference bits", on active pages (vm_pageout_page_stats), and we only performed vm_req_vmdaemon(VM_SWAP_IDLE) if vm_pages_needed was true. However, since r254303, if vm_swap_idle_enabled was "1", we have performed vm_req_vmdaemon(VM_SWAP_IDLE) regardless of whether we are short of free pages. This was unintended and too aggressive, so I suspect no one uses this feature. With this change, we restore the historical behavior and only perform vm_req_vmdaemon(VM_SWAP_IDLE) when we are short of free pages. Reviewed by: kib, markj	2016-08-01 17:25:07 +00:00
Mark Johnston	897d0c6617	Use vm_page_undirty() instead of manually setting a page field. Reviewed by: alc MFC after: 3 days	2016-07-29 21:05:37 +00:00
Alan Cox	793172ea88	Remove a probe declaration that has been unused since r292469, when vm_pageout_grow_cache() was replaced. MFC after: 3 days	2016-07-29 16:43:51 +00:00
Alan Cox	f095d1bbc7	Remove any mention of cache (PG_CACHE) pages from the comments in vm_pageout_scan(). That function has not cached pages since r284376. MFC after: 3 days	2016-07-28 22:30:48 +00:00
Konstantin Belousov	88ad2d7b47	Do not delegate a work to geom event thread which can be done inline. In particular, swapongeom_ev() needed event thread context when swap pager configuration was performed under Giant and geom asserted that Giant is not owned. Now both of the reason went away. On the other hand, note that swpageom_release() is called from the bio_done context, and possible close cannot be performed inline. Also fix some minor issues. The swapgeom() function does not use the td argument, remove it. Recheck that the vnode passed is still VCHR and not reclaimed after the lock. Reviewed by: mav Tested by: pho Sponsored by: The FreeBSD Foundation MFC after: 2 weeks	2016-07-28 15:57:01 +00:00
Konstantin Belousov	2174a0c607	Fix style and typo. Sponsored by: The FreeBSD Foundation MFC after: 1 week	2016-07-28 15:49:51 +00:00
Mark Johnston	3ac8f842ea	De-pluralize "queues" where appropriate in the pagedaemon code. MFC after: 1 week	2016-07-27 17:11:03 +00:00
Alan Cox	a766ffd061	Update a comment to reflect r284376. MFC after: 3 days	2016-07-27 03:49:00 +00:00
Mark Johnston	44be0a8ea5	Correct a comment - each page queue has its own lock. Reviewed by: alc MFC after: 3 days	2016-07-23 21:03:25 +00:00
Mark Johnston	efe1ff4cf0	Update a comment in vm_page_advise() to match behaviour after r290529. Reviewed by: alc MFC after: 3 days	2016-07-23 21:02:36 +00:00
Alan Cox	8d67b8c863	Add a comment describing the 'fast path' that was introduced in r270011. Reviewed by: kib MFC after: 3 days Sponsored by: EMC / Isilon Storage Division	2016-07-20 17:20:22 +00:00
Mark Johnston	afa5d70339	Release the second critical section in uma_zfree_arg() slightly earlier. It is only needed when removing a full bucket from the per-CPU cache. The bucket cache (uz_buckets) is protected by the zone mutex and thus the critical section can be released before inserting into that list. MFC after: 1 week	2016-07-20 01:01:50 +00:00
Mark Johnston	20c58db95a	Make vm_pageout_wakeup_thresh a u_int rather than an int. It's a threshold for v_free_count, which is of type u_int. This also lets us get rid of a cast in vm_paging_needed(). Reviewed by: alc MFC after: 1 week	2016-07-20 00:09:22 +00:00
Alan Cox	0c3a489325	Break up vm_fault()'s implementation of the read-ahead and delete-behind optimizations into two distinct pieces. The first piece consists of the code that should only be performed once per page fault and requires the map to be locked. The second piece consists of the code that should be performed each time a pager is called on an object in the shadow chain. (This second piece expects the map to be unlocked.) Previously, the entire implementation could be executed multiple times. Moreover, the second and subsequent executions would occur with the map unlocked. Usually, the ensuing unsynchronized accesses to the map were harmless because the map was not changing. Nonetheless, it was possible for a use-after-free error to occur, where vm_fault() wrote to a freed map entry. This change corrects that problem. Reported by: avg Reviewed by: kib MFC after: 3 days Sponsored by: EMC / Isilon Storage Division	2016-07-18 04:20:26 +00:00
Konstantin Belousov	19efd8a5a8	In vgonel(), postpone setting BO_DEAD until VOP_RECLAIM() is called, if vnode is VMIO. For VMIO vnodes, set BO_DEAD in vm_object_terminate(). The vnode_destroy_object(), when calling into vm_object_terminate(), must be able to flush buffers. BO_DEAD purpose is to quickly destroy buffers on write when the underlying vnode is not operable any more (one example is the devfs node after geom is gone). Setting BO_DEAD for reclaiming vnode before object is terminated is premature, and results in unability to flush buffers with live SU dependencies from vinvalbuf() in vm_object_terminate(). Reported by: David Cross <dcrosstech@gmail.com> Tested by: pho Sponsored by: The FreeBSD Foundation MFC after: 2 weeks	2016-07-11 14:19:09 +00:00
Robert Watson	0df4264748	When mmap(2) is used with a vnode, capture vnode attributes in the audit trail. This was not required for Common Criteria auditing (which requires only that the intent to read or write be audited at the time of open(2)), but is useful for contemporary live analysis and forensics. MFC after: 3 days Sponsored by: DARPA, AFRL	2016-07-10 11:49:10 +00:00
Robert Watson	51d1f69069	Audit file-descriptor arguments to I/O system calls such as read(2), write(2), dup(2), and mmap(2). This auditing is not required by the Common Criteria (and hence was not being performed), but is valuable in both contemporary live analysis and forensic use cases. MFC after: 3 days Sponsored by: DARPA, AFRL	2016-07-10 08:04:02 +00:00
Alan Cox	381b724280	Change the type of the map entry's next_read field from a vm_pindex_t to a vm_offset_t. (This field is used to detect sequential access to the virtual address range represented by the map entry.) There are three reasons to make this change. First, a vm_offset_t is smaller on 32-bit architectures. Consequently, a struct vm_map_entry is now smaller on 32-bit architectures. Second, a vm_offset_t can be written atomically, whereas it may not be possible to write a vm_pindex_t atomically on a 32-bit architecture. Third, using a vm_pindex_t makes the next_read field dependent on which object in the shadow chain is being read from. Replace an "XXX" comment. Reviewed by: kib Approved by: re (gjb) Sponsored by: EMC / Isilon Storage Division	2016-07-07 20:58:16 +00:00
Colin Percival	34caa842a4	Autotune the number of pages set aside for UMA startup based on the number of CPUs present. On amd64 this unbreaks the boot for systems with 92 or more CPUs; the limit will vary on other systems depending on the size of their uma_zone and uma_cache structures. The major consumer of pages during UMA startup is the 19 zone structures which are set up before UMA has bootstrapped itself sufficiently to use the rest of the available memory: UMA Slabs, UMA Hash, 4 / 6 / 8 / 12 / 16 / 32 / 64 / 128 / 256 Bucket, vmem btag, VM OBJECT, RADIX NODE, MAP, KMAP ENTRY, MAP ENTRY, VMSPACE, and fakepg. If the zone structures occupy more than one page, they will not share pages and the number of pages currently needed for startup is 19 * pages_per_zone + N, where N is the number of pages used for allocating other structures; on amd64 N = 3 at present (2 pages are allocated for UMA Kegs, and one page for UMA Hash). This patch adds a new definition UMA_BOOT_PAGES_ZONES, currently set to 32, and if a zone structure does not fit into a single page sets boot_pages to UMA_BOOT_PAGES_ZONES * pages_per_zone instead of UMA_BOOT_PAGES (which remains at 64). Consequently this patch has no effect on systems where the zone structure fits into 2 or fewer pages (on amd64, 59 or fewer CPUs), but increases boot_pages sufficiently on systems where the large number of CPUs makes this structure larger. It seems safe to assume that systems with 60+ CPUs can afford to set aside an additional 128kB of memory per 32 CPUs. The vm.boot_pages tunable continues to override this computation, but is unlikely to be necessary in the future. Tested on: EC2 x1.32xlarge Relnotes: FreeBSD can now boot on 92+ CPU systems without requiring vm.boot_pages to be manually adjusted. Reviewed by: jeff, alc, adrian Approved by: re (kib)	2016-07-07 18:37:12 +00:00
Nathan Whitehorn	96c85efb4b	Replace a number of conflations of mp_ncpus and mp_maxid with either mp_maxid or CPU_FOREACH() as appropriate. This fixes a number of places in the kernel that assumed CPU IDs are dense in [0, mp_ncpus) and would try, for example, to run tasks on CPUs that did not exist or to allocate too few buffers on systems with sparse CPU IDs in which there are holes in the range and mp_maxid > mp_ncpus. Such circumstances generally occur on systems with SMT, but on which SMT is disabled. This patch restores system operation at least on POWER8 systems configured in this way. There are a number of other places in the kernel with potential problems in these situations, but where sparse CPU IDs are not currently known to occur, mostly in the ARM machine-dependent code. These will be fixed in a follow-up commit after the stable/11 branch. PR: kern/210106 Reviewed by: jhb Approved by: re (glebius)	2016-07-06 14:09:49 +00:00
Konstantin Belousov	90880a1b29	Clarify the vnode_destroy_vobject() logic handling for already terminated objects. Assert that there is no new waiters for the already terminated objects. Old waiters should have been notified by the termination calling vnode_pager_dealloc() (old/new are with regard of the lock acquisition interval). Only clear the vp->v_object for the case of already terminated object, since other branches call vnode_pager_dealloc(), which should clear the pointer. Assert this. Tested by: pho Reviewed by: alc Sponsored by: The FreeBSD Foundation MFC after: 2 weeks Approved by: re (gjb)	2016-07-05 11:21:02 +00:00
Konstantin Belousov	3f1c66b8d2	Change type of the 'dead' variable to boolean. Requested by: alc MFC after: 1 week Approved by: re (gjb)	2016-07-03 00:08:17 +00:00
Konstantin Belousov	725441f69b	If the vm_fault() handler raced with the vm_object_collapse() sleepable scan, iteration over the shadow chain looking for a page could find an OBJ_DEAD object. Such state of the mapping is only transient, the dead object will be terminated and removed from the chain shortly. We must not return KERN_PROTECTION_FAILURE unless the object type is changed to OBJT_DEAD in the chain, indicating that paging on this address is really impossible. Returning KERN_PROTECTION_FAILURE prematurely causes spurious SIGSEGV delivered to processes, or kernel accesses to UVA spuriously failing with EFAULT. If the object with OBJ_DEAD flag is found, only return KERN_PROTECTION_FAILURE when object type is already OBJT_DEAD. Otherwise, sleep a tick and retry the fault handling. Ideally, we would wait until the OBJ_DEAD flag is resolved, e.g. by waiting until the paging on this object is finished. But to do so, we need to reference the dead object, while vm_object_collapse() insists on owning the final reference on the collapsed object. This could be fixed by e.g. changing the assert to shared reference release between vm_fault() and vm_object_collapse(), but it seems to be too much complications for rare boundary condition. PR: 204426 Tested by: pho Reviewed by: alc Sponsored by: The FreeBSD Foundation X-Differential revision: https://reviews.freebsd.org/D6085 MFC after: 2 weeks Approved by: re (gjb)	2016-06-27 21:54:19 +00:00
Konstantin Belousov	35e8002c58	In vm_page_xunbusy_maybelocked(), add fast path for unbusy when no waiters exist, same as for vm_page_xunbusy(). If previous value of busy_lock was VPB_SINGLE_EXCLUSIVER, no waiters existed and wakeup is not needed. Move common code from vm_page_xunbusy_maybelocked() and vm_page_xunbusy_hard() to vm_page_xunbusy_locked(). Reviewed by: alc Sponsored by: The FreeBSD Foundation MFC after: 2 weeks Approved by: re (gjb)	2016-06-23 08:28:13 +00:00
Konstantin Belousov	505cd5d13b	Add a comment noting locking regime for vm_page_xunbusy(). Reviewed by: alc Sponsored by: The FreeBSD Foundation MFC after: 2 weeks Approved by: re (gjb)	2016-06-23 08:27:38 +00:00
Konstantin Belousov	95e2409a33	Fix a LOR between vnode locks and allproc_lock. There is an order between covered vnode lock and allproc_lock, which is established by calling mountcheckdirs() while owning the covered vnode lock. mountcheckdirs() iterates over the processes, protected by allproc_lock. This order is needed and seems to be not avoidable. On the other hand, various VM daemons also need to iterate over all processes, and they lock and unlock user maps. Since unlock of the user map may trigger processing of the deferred map entries, it causes vnode locking to occur. Or, when vmspace is freed, dropping references on the vnode-backed object also lock vnodes. We get reverted order comparing with the mount/unmount order. For VM daemons, there is no need to own allproc_lock while we operate on vmspaces. If the process is held, it serves as the marker for allproc list, which allows to continue the iteration. Add _PHOLD_LITE() macro, similar to _PHOLD(), but not causing swap-in of the kernel stacks. It is used instead of _PHOLD() in vm code, since e.g. calling faultin() in OOM conditions only exaggerates the problem. Modernize comment describing PHOLD. Reported by: lists@yamagi.org Tested by: pho (previous version) Reviewed by: jhb Sponsored by: The FreeBSD Foundation MFC after: 3 week Approved by: re (gjb) Differential revision: https://reviews.freebsd.org/D6679	2016-06-22 20:15:37 +00:00
Konstantin Belousov	d3b9828d0d	The vmtotal sysctl handler marks active vm objects to calculate statistics. Marking is done by setting the OBJ_ACTIVE flag. The flags change is locked, but the problem is that many parts of system assume that vm object initialization ensures that no other code could change the object, and thus performed lockless. The end result is corrupted flags in vm objects, most visible is spurious OBJ_DEAD flag, causing random hangs. Avoid the active object marking, instead provide equally inexact but immutable is_object_alive() definition for the object mapped state. Avoid iterating over the processes mappings altogether by using arguably improved definition of the paging thread as one which sleeps on the v_free_count. PR: 204764 Diagnosed by: pho Tested by: pho (previous version) Reviewed by: alc Sponsored by: The FreeBSD Foundation MFC after: 1 week Approved by: re (gjb)	2016-06-21 17:49:33 +00:00
Konstantin Belousov	eb4d6a1b3b	Fix inconsistent locking of the swap pager named objects list. Right now, all modifications of the list are locked by sw_alloc_mtx. But initial lookup of the object by the handle in swap_pager_alloc() is not protected by sw_alloc_mtx, which means that vm_pager_object_lookup() could follow freed pointer. Create a new named swap object with the OBJT_SWAP type, instead of OBJT_DEFAULT. With this change, swp_pager_meta_build() never need to upgrade named OBJT_DEFAULT to OBJT_SWAP (in the other place, we do not forbid for client code to create named OBJT_DEFAULT objects at all). That change allows to remove sw_alloc_mtx and make the list locked by sw_alloc_sx lock. Update swap_pager_copy() to new locking mode. Create helper swap_pager_alloc_init() to consolidate named and anonymous swap objects creation, while a caller ensures that the neccesary locks are held around the helper. Reviewed by: alc Sponsored by: The FreeBSD Foundation MFC after: 2 weeks Approved by: re (hrs)	2016-06-13 03:42:46 +00:00
Konstantin Belousov	1571927369	Explicitely initialize sw_alloc_sx. Currently it is not initialized but works due to zeroed out bss on startup. Reviewed by: alc Sponsored by: The FreeBSD Foundation MFC after: 1 week Approved by: re (hrs)	2016-06-13 03:39:16 +00:00
Mark Johnston	0a1dc6e23c	Reset the page busy lock state after failing to insert into the object. Freeing a shared-busy page is not permitted. Reviewed by: kib MFC after: 1 week Differential Revision: https://reviews.freebsd.org/D6670	2016-06-02 17:11:24 +00:00
Mark Johnston	e705296958	Don't preserve the page's object linkage in vm_page_insert_after(). Per the KASSERT at the beginning of the function, we expect that the page does not belong to any object, so its object and pindex fields are meaningless. Reset them in the rare case that vm_radix_insert() fails. Reviewed by: kib MFC after: 1 week Differential Revision: https://reviews.freebsd.org/D6669	2016-06-02 16:58:47 +00:00
Mark Johnston	bc9d08e1cf	Fix memguard(9) in kernels with INVARIANTS enabled. With r284861, UMA zones use the trash ctor and dtor by default. This is incompatible with memguard, which frees the backing page when the item is freed. Modify the UMA debug functions to be no-ops if the item was allocated from memguard. This also fixes constructors such as mb_ctor_pack(), which invokes the trash ctor in addition to performing some initialization. Reviewed by: glebius MFC after: 3 weeks Differential Revision: https://reviews.freebsd.org/D6562	2016-06-01 22:31:35 +00:00
Konstantin Belousov	e5f0191f20	If the fast path unbusy in vm_page_replace() fails, slow path needs to acquire the page lock, which recurses. Avoid the recursion by reusing the code from vm_page_remove() in a new helper vm_page_xunbusy_maybelocked(). Reviewed by: alc Sponsored by: The FreeBSD Foundation	2016-06-01 20:39:00 +00:00
Konstantin Belousov	9f790a1756	Do not leak the vm object lock when swap reservation failed, in vm_object_coalesce(). Reported and tested by: pho Sponsored by: The FreeBSD Foundation MFC after: 1 week	2016-05-29 15:46:19 +00:00
Alan Cox	56ce06907c	The flag "vm_pages_needed" has long served two distinct purposes: (1) to indicate that threads are waiting for free pages to become available and (2) to indicate whether a wakeup call has been sent to the page daemon. The trouble is that a single flag cannot really serve both purposes, because we have two distinct targets for when to wakeup threads waiting for free pages versus when the page daemon has completed its work. In particular, the flag will be cleared by vm_page_free() before the page daemon has met its target, and this can lead to the OOM killer being invoked prematurely. To address this problem, a new flag "vm_pageout_wanted" is introduced. Discussed with: jeff Reviewed by: kib, markj Tested by: markj Sponsored by: EMC / Isilon Storage Division	2016-05-27 19:15:45 +00:00
Alan Cox	bccdea450b	Use vm_page_replace_checked() instead of vm_page_rename() for implementing optimized copy-on-write faults. This has two advantages: (1) one less radix tree operation is performed and (2) vm_page_replace_checked() cannot fail, making the code simpler. Submitted by: Ryan Libby Reviewed by: kib Sponsored by: EMC / Isilon Storage Division Differential Revision: https://reviews.freebsd.org/D4478	2016-05-27 06:05:12 +00:00
Konstantin Belousov	aa9bc3b171	Prevent parallel object collapses. Both vm_object_collapse_scan() and swap_pager_copy() might unlock the object, which allows the parallel collapse to execute. Besides destroying the object, it also might move the reference from parent to the backing object, firing the assertion ref_count == 1. Collapses are prevented by bumping paging_in_progress counters on both the object and its backing object. Reported by: cem Tested by: pho (previous version) Reviewed by: alc Sponsored by: The FreeBSD Foundation MFC after: 1 week X-Differential revision: https://reviews.freebsd.org/D6085	2016-05-26 16:59:29 +00:00
Konstantin Belousov	98f139daef	Style changes to some most outrageous violations in vm_object_collapse(). Reviewed by: alc Sponsored by: The FreeBSD Foundation MFC after: 1 week	2016-05-26 16:51:38 +00:00
Konstantin Belousov	0e38422096	In vm_page_cache(), only drop the vnode after radix insert failure for empty page cache when the object type if OBJT_VNODE. Reported and tested by: pho Reviewed by: alc Sponsored by: The FreeBSD Foundation MFC after: 1 week	2016-05-24 19:20:30 +00:00
Konstantin Belousov	30a8a5f7a6	In vm_page_alloc_contig(), on vm_page_insert() failure, mark each freed page as VPO_UNMANAGED. Otherwise vm_pge_free_toq() insists on owning the page lock. Previously, VPO_UNMANAGED was only set up to the last processed page. Reviewed by: alc Sponsored by: The FreeBSD Foundation MFC after: 1 week	2016-05-24 10:21:39 +00:00
Konstantin Belousov	9a2047083f	Remove Giant around allocation of the swap pager with non-NULL handle. Existing issue of not protecting pager_object_list iteration in vm_pager_object_lookup() by sw_alloc_mtx is not affected by Giant removal. Reviewed by: alc Sponsored by: The FreeBSD Foundation	2016-05-24 10:16:03 +00:00
Alan Cox	10b4196bd0	Correct an error in a comment: One of the conditions for page allocation is actually the opposite of that stated in the comment. Remove an unnecessary assignment. Use an assertion to document the fact that no assignment is needed. Rewrite another comment to clarify that the page is not completely valid. Reviewed by: kib	2016-05-23 16:59:05 +00:00
Konstantin Belousov	4c36e917b2	Mark swap-related proc sysctls as not requiring Giant. Reviewed by: alc (as part of larger patch) Sponsored by: The FreeBSD Foundation	2016-05-22 23:28:23 +00:00
Konstantin Belousov	04533e1ef7	Replace hand-made exclusive lock, protecting against parallel swapon/swapoff invocations, with sx. Reviewed by: alc (as part of larger patch) Sponsored by: The FreeBSD Foundation	2016-05-22 23:25:01 +00:00
Konstantin Belousov	a525fd17cd	Remove false claim. Giant is dropped by mi_startup() before passing the control to swapper. Sponsored by: The FreeBSD Foundation MFC after: 3 days	2016-05-22 19:25:53 +00:00
Alan Cox	6753423ccb	When descending a shadow chain of objects, it makes no sense to update the current offset (spelled: "fs.pindex") until it is known whether a backing object exists. In fact, if not for the fact that the backing object offset is zero when there is no backing object, this update would produce a broken offset. Reviewed by: kib	2016-05-21 23:18:23 +00:00
John Baldwin	cc981af204	Add new bus methods for mapping resources. Add a pair of bus methods that can be used to "map" resources for direct CPU access using bus_space(9). bus_map_resource() creates a mapping and bus_unmap_resource() releases a previously created mapping. Mappings are described by 'struct resource_map' object. Pointers to these objects can be passed as the first argument to the bus_space wrapper API used for bus resources. Drivers that wish to map all of a resource using default settings (for example, using uncacheable memory attributes) do not need to change. However, drivers that wish to use non-default settings can now do so without jumping through hoops. First, an RF_UNMAPPED flag is added to request that a resource is not implicitly mapped with the default settings when it is activated. This permits other activation steps (such as enabling I/O or memory decoding in a device's PCI command register) to be taken without creating a mapping. Right now the AGP drivers don't set RF_ACTIVE to avoid using up a large amount of KVA to map the AGP aperture on 32-bit platforms. Once RF_UNMAPPED is supported on all platforms that support AGP this can be changed to using RF_UNMAPPED with RF_ACTIVE instead. Second, bus_map_resource accepts an optional structure that defines additional settings for a given mapping. For example, a driver can now request to map only a subset of a resource instead of the entire range. The AGP driver could also use this to only map the first page of the aperture (IIRC, it calls pmap_mapdev() directly to map the first page currently). I will also eventually change the PCI-PCI bridge driver to request mappings of the subset of the I/O window resource on its parent side to create mappings for child devices rather than passing child resources directly up to nexus to be mapped. This also permits bridges that do address translation to request suitable mappings from a resource on the "upper" side of the bus when mapping resources on the "lower" side of the bus. Another attribute that can be specified is an alternate memory attribute for memory-mapped resources. This can be used to request a Write-Combining mapping of a PCI BAR in an MI fashion. (Currently the drivers that do this call pmap_change_attr() directly for x86 only.) Note that this commit only adds the MI framework. Each platform needs to add support for handling RF_UNMAPPED and thew new bus_map/unmap_resource methods. Generally speaking, any drivers that are calling rman_set_bustag() and rman_set_bushandle() need to be updated. Discussed on: arch Reviewed by: cem Differential Revision: https://reviews.freebsd.org/D5237	2016-05-20 17:57:47 +00:00
Alan Cox	521ddf39cb	Clean up the handling of errors from vm_pager_get_pages(). Mostly, this cleanup consists of fixes to comments. However, there is one change to code: Remove special-case handling of errors involving the kernel map. We do not perform I/O on the kernel map, so there is no need for this special case. Reviewed by: kib (an earlier version)	2016-05-19 19:27:33 +00:00
Conrad Meyer	5a2e650a36	vm/vm_page.h: Fix trivial '-Wpointer-sign' warning pq_vcnt, as a count of real things, has no business being negative. It is only ever initialized by a u_int counter. The warning came from the atomic_add_int() in vm_pagequeue_cnt_add(). Rectify the warning by changing the variable to u_int. No functional change. Suggested by: Clang 3.3 Sponsored by: EMC / Isilon Storage Division	2016-05-19 17:54:14 +00:00
Konstantin Belousov	2a339d9e3d	Add implementation of robust mutexes, hopefully close enough to the intention of the POSIX IEEE Std 1003.1TM-2008/Cor 1-2013. A robust mutex is guaranteed to be cleared by the system upon either thread or process owner termination while the mutex is held. The next mutex locker is then notified about inconsistent mutex state and can execute (or abandon) corrective actions. The patch mostly consists of small changes here and there, adding neccessary checks for the inconsistent and abandoned conditions into existing paths. Additionally, the thread exit handler was extended to iterate over the userspace-maintained list of owned robust mutexes, unlocking and marking as terminated each of them. The list of owned robust mutexes cannot be maintained atomically synchronous with the mutex lock state (it is possible in kernel, but is too expensive). Instead, for the duration of lock or unlock operation, the current mutex is remembered in a special slot that is also checked by the kernel at thread termination. Kernel must be aware about the per-thread location of the heads of robust mutex lists and the current active mutex slot. When a thread touches a robust mutex for the first time, a new umtx op syscall is issued which informs about location of lists heads. The umtx sleep queues for PP and PI mutexes are split between non-robust and robust. Somewhat unrelated changes in the patch: 1. Style. 2. The fix for proper tdfind() call use in umtxq_sleep_pi() for shared pi mutexes. 3. Removal of the userspace struct pthread_mutex m_owner field. 4. The sysctl kern.ipc.umtx_vnode_persistent is added, which controls the lifetime of the shared mutex associated with a vnode' page. Reviewed by: jilles (previous version, supposedly the objection was fixed) Discussed with: brooks, Martin Simmons <martin@lispworks.com> (some aspects) Tested by: pho Sponsored by: The FreeBSD Foundation	2016-05-17 09:56:22 +00:00
John Baldwin	cf8cdda37c	Move vm_domain_rr_selectdomain() under #ifdef VM_NUMA_ALLOC. The function had a null function body in the !VM_NUMA_ALLOC case but also wasn't called in the !VM_NUMA_ALLOC case. Suggested by: ngie	2016-05-10 22:25:55 +00:00
Pedro F. Giffuni	763df3ec55	sys/vm: minor spelling fixes in comments. No functional change.	2016-05-02 20:16:29 +00:00
Konstantin Belousov	e9d37c9f8d	Avoid duplicated calls to pmap_page_get_memattr(). Avoid logging inconsistency for the /dev/mem device at all. The driver leaves memattr intact, and the corrective action in the device pager handles it right. In the logged warning, name the driver we blame, and show memory attributes values. Reported and tested by: pho Reviewed by: alc Sponsored by: The FreeBSD Foundation MFC after: 1 week Differential revision: https://reviews.freebsd.org/D6149	2016-05-01 17:48:43 +00:00
John Baldwin	0ef149024f	Don't require write locks on the VM object for vm_page_prev/next. Reviewed by: kib Sponsored by: Chelsio Communications	2016-04-29 17:35:28 +00:00
John Baldwin	164a37a55a	Trim redundant message. WITNESS_WARN() appends "with non-sleepable lock" to the caller's message. Sponsored by: Chelsio Communications	2016-04-27 21:51:24 +00:00
Pedro F. Giffuni	b66bb393f2	Cleanup redundant parenthesis from existing howmany()/roundup() macro uses.	2016-04-22 16:57:42 +00:00
Pedro F. Giffuni	d9c9c81c08	sys: use our roundup2/rounddown2() macros when param.h is available. rounddown2 tends to produce longer lines than the original code and when the code has a high indentation level it was not really advantageous to do the replacement. This tries to strike a balance between readability using the macros and flexibility of having the expressions, so not everything is converted.	2016-04-21 19:57:40 +00:00
Pedro F. Giffuni	8dfea46460	Remove slightly used const values that can be replaced with nitems(). Suggested by: jhb	2016-04-21 15:38:28 +00:00
John Baldwin	62d70a8174	Add more fine-grained kernel options for NUMA support. VM_NUMA_ALLOC is used to enable use of domain-aware memory allocation in the virtual memory system. DEVICE_NUMA is used to enable affinity reporting for devices such as bus_get_domain(). MAXMEMDOM must still be set to a value greater than for any NUMA support to be effective. Note that 'cpuset -gd' always works if MAXMEMDOM is enabled and the system supports NUMA. Reviewed by: kib Differential Revision: https://reviews.freebsd.org/D5782	2016-04-09 13:58:04 +00:00
Edward Tomasz Napierala	ae34b6ff96	Add four new RCTL resources - readbps, readiops, writebps and writeiops, for limiting disk (actually filesystem) IO. Note that in some cases these limits are not quite precise. It's ok, as long as it's within some reasonable bounds. Testing - and review of the code, in particular the VFS and VM parts - is very welcome. MFC after: 1 month Relnotes: yes Sponsored by: The FreeBSD Foundation Differential Revision: https://reviews.freebsd.org/D5080	2016-04-07 04:23:25 +00:00
Gleb Smirnoff	cfcae3f86f	Remove UMA_ZONE_REFCNT feature, now unused. Blessed by: jeff	2016-03-01 00:33:32 +00:00
Konstantin Belousov	1bdbd70599	Implement process-shared locks support for libthr.so.3, without breaking the ABI. Special value is stored in the lock pointer to indicate shared lock, and offline page in the shared memory is allocated to store the actual lock. Reviewed by: vangyzen (previous version) Discussed with: deischen, emaste, jhb, rwatson, Martin Simmons <martin@lispworks.com> Tested by: pho Sponsored by: The FreeBSD Foundation	2016-02-28 17:52:33 +00:00
Gleb Smirnoff	b28cc462ad	Include sys/_task.h into uma_int.h, so that taskqueue.h isn't a requirement for uma_int.h. Suggested by: jhb	2016-02-09 20:22:35 +00:00
Mark Johnston	3f38e13073	Plug a vm_page leak introduced in r292373. Reported by: vangyzen	2016-02-05 19:35:53 +00:00
Gleb Smirnoff	e60b2fcbeb	Redo r292484. Embed task(9) into zone, so that uz_maxaction is called in a context that can sleep, allowing consumers of the KPI to run their drain routines without any extra measures. Discussed with: jtl	2016-02-03 23:30:17 +00:00
Gleb Smirnoff	9542ea7b80	Move uma_dbg_alloc() and uma_dbg_free() into uma_core.c, which allows to make uma_dbg.h not depend on uma_int.h, which allows to uninclude uma_int.h from the mbuf(9) allocator.	2016-02-03 22:02:36 +00:00
Konstantin Belousov	ba7c64d17b	Typo in comment.	2016-01-24 13:38:41 +00:00
John Baldwin	8a4dc40ff4	Various cleanups to the main function for AIO kernel processes: - Pull the vmspace logic out into helper functions and reduce duplication. Operations on the vmspace are all isolated to vm_map.c, but it now exports a new 'vmspace_switch_aio' for use by AIO kernel processes. - When an AIO kernel process wants to exit, break out of the main loop and perform cleanup after the loop end. This reduces a lot of indentation and allows cleanup to more closely mirror setup actions before the loop starts. - Convert a DIAGNOSTIC to KASSERT(). - Replace mycp with more typical 'p'. Reviewed by: kib Sponsored by: Chelsio Communications Differential Revision: https://reviews.freebsd.org/D4990	2016-01-19 21:37:51 +00:00
Alan Cox	477bffbe4d	A fix to r292469: Iterate over the physical segments in descending rather than ascending order in vm_phys_alloc_contig() so that, for example, a sequence of contigmalloc(low=0, high=4GB) calls doesn't exhaust the supply of low physical memory resulting in a later contigmalloc(low=0, high=1MB) failure. Reported by: cy Tested by: cy Sponsored by: EMC / Isilon Storage Division	2016-01-16 04:41:40 +00:00
Adrian Chadd	54de56f3b2	Fix the domain iterator to not try the first-touch / fixed domain more than once when doing round-robin. This lead to a panic because the iterator was trying the same domain twice and not trying one of the other domains. Reported by: pho Tested by: pho	2016-01-10 17:53:43 +00:00
Konstantin Belousov	ff64a90ed9	Add missed relpbuf() for a smallfs page-in. Reported by: Shawn Webb Tested by: pho Sponsored by: The FreeBSD Foundation	2015-12-27 14:42:39 +00:00
Jonathan T. Looney	54503a13d8	Add a safety net to reclaim mbufs when one of the mbuf zones become exhausted. It is possible for a bug in the code (or, theoretically, even unusual network conditions) to exhaust all possible mbufs or mbuf clusters. When this occurs, things can grind to a halt fairly quickly. However, we currently do not call mb_reclaim() unless the entire system is experiencing a low-memory condition. While it is best to try to prevent exhaustion of one of the mbuf zones, it would also be useful to have a mechanism to attempt to recover from these situations by freeing "expendable" mbufs. This patch makes two changes: a) The patch adds a generic API to the UMA zone allocator to set a function that should be called when an allocation fails because the zone limit has been reached. Because of the way this function can be called, it really should do minimal work. b) The patch uses this API to try to free mbufs when an allocation fails from one of the mbuf zones because the zone limit has been reached. The function schedules a callout to run mb_reclaim(). Differential Revision: https://reviews.freebsd.org/D3864 Reviewed by: gnn Comments by: rrs, glebius MFC after: 2 weeks Sponsored by: Juniper Networks	2015-12-20 02:05:33 +00:00
Alan Cox	c869e67208	Introduce a new mechanism for relocating virtual pages to a new physical address and use this mechanism when: 1. kmem_alloc_{attr,contig}() can't find suitable free pages in the physical memory allocator's free page lists. This replaces the long-standing approach of scanning the inactive and inactive queues, converting clean pages into PG_CACHED pages and laundering dirty pages. In contrast, the new mechanism does not use PG_CACHED pages nor does it trigger a large number of I/O operations. 2. on 32-bit MIPS processors, uma_small_alloc() and the pmap can't find free pages in the physical memory allocator's free page lists that are covered by the direct map. Tested by: adrian 3. ttm_bo_global_init() and ttm_vm_page_alloc_dma32() can't find suitable free pages in the physical memory allocator's free page lists. In the coming months, I expect that this new mechanism will be applied in other places. For example, balloon drivers should use relocation to minimize fragmentation of the guest physical address space. Make vm_phys_alloc_contig() a little smarter (and more efficient in some cases). Specifically, use vm_phys_segs[] earlier to avoid scanning free page lists that can't possibly contain suitable pages. Reviewed by: kib, markj Glanced at: jhb Discussed with: jeff Sponsored by: EMC / Isilon Storage Division Differential Revision: https://reviews.freebsd.org/D4444	2015-12-19 18:42:50 +00:00
Conrad Meyer	8170d6e52b	vm_page_replace: add wrapper to KASSERT about old page It turns out the callers of vm_page_replace know exactly which page they are replacing and would like to assert about it. Change those from hard panics to KASSERTs, and provide them with a wrapper so they don't have to deal with warnings from an INVARIANTS-dependent dead store of the return value of vm_page_replace. Submitted by: Ryan Libby <rlibby@gmail.com> Reviewed by: alc, kib (earlier version) Sponsored by: EMC / Isilon Storage Division Differential Revision: https://reviews.freebsd.org/D4497	2015-12-17 17:48:57 +00:00
Conrad Meyer	dc62d55929	vm_page.h: page busy macro fixups Minor changes to: - delete extraneous trailing semicolons from macro definitions, and - correct spelling of "busying" in panic messages Submitted by: Ryan Libby <rlibby@gmail.com> Reviewed by: alc, kib Sponsored by: EMC / Isilon Storage Division Differential Revision: https://reviews.freebsd.org/D4577	2015-12-16 23:23:12 +00:00
Gleb Smirnoff	b0cd20172d	A change to KPI of vm_pager_get_pages() and underlying VOP_GETPAGES(). o With new KPI consumers can request contiguous ranges of pages, and unlike before, all pages will be kept busied on return, like it was done before with the 'reqpage' only. Now the reqpage goes away. With new interface it is easier to implement code protected from race conditions. Such arrayed requests for now should be preceeded by a call to vm_pager_haspage() to make sure that request is possible. This could be improved later, making vm_pager_haspage() obsolete. Strenghtening the promises on the business of the array of pages allows us to remove such hacks as swp_pager_free_nrpage() and vm_pager_free_nonreq(). o New KPI accepts two integer pointers that may optionally point at values for read ahead and read behind, that a pager may do, if it can. These pages are completely owned by pager, and not controlled by the caller. This shifts the UFS-specific readahead logic from vm_fault.c, which should be file system agnostic, into vnode_pager.c. It also removes one VOP_BMAP() request per hard fault. Discussed with: kib, alc, jeff, scottl Sponsored by: Nginx, Inc. Sponsored by: Netflix	2015-12-16 21:30:45 +00:00
Mark Johnston	d9e2e68d38	Don't make assertions about td_critnest when the scheduler is stopped. A panicking thread always executes with a critical section held, so any attempt to allocate or free memory while dumping will otherwise cause a second panic. This can occur, for example, if xpt_polled_action() completes non-dump I/O that was pending at the time of the panic. The fact that this can occur is itself a bug, but asserting in this case does little but reduce the reliability of kernel dumps. Suggested by: kib Reported by: pho	2015-12-11 20:05:07 +00:00
Conrad Meyer	5e09bdc821	vm_page_replace: remove redundant radix lookup Remove redundant lookup of the old page from vm_page_replace. Verification that the old page exists is already done by vm_radix_replace. Submitted by: Ryan Libby <rlibby@gmail.com> Reviewed by: alc, kib Sponsored by: EMC / Isilon Storage Division Follow-up to: https://reviews.freebsd.org/D4326 Differential Revision: https://reviews.freebsd.org/D4471	2015-12-10 22:57:27 +00:00
Conrad Meyer	6fee422ed5	vm_fault_hold: handle vm_page_rename failure On vm_page_rename failure, fix a missing object unlock and a double free of a page. First remove the old page, then rename into other page into first_object, then free the old page. This avoids the problem on rename failure. This is a little ugly but seems to be the most straightforward solution. Tested with: $ sysctl debug.fail_point.uma_zalloc_arg="1%return" $ kyua test -k /usr/tests/sys/Kyuafile Submitted by: Ryan Libby <rlibby@gmail.com> Reviewed by: kib Seen by: alc Sponsored by: EMC / Isilon Storage Division Differential Revision: https://reviews.freebsd.org/D4326	2015-12-06 17:46:12 +00:00
Conrad Meyer	4cc8daf782	Pull vm_object_scan_all_shadowed out of vm_object_backing_scan These two functions were largely unrelated, they just used the same same loop logic to walk through a backing object's memq. Pull out the all_shadowed test as its own function and eliminate OBSC_TEST_ALL_SHADOWED. Rename vm_object_backing_scan to vm_object_collapse_scan. No functional change. Sponsored by: EMC / Isilon Storage Division Differential Revision: https://reviews.freebsd.org/D4335	2015-12-03 17:21:10 +00:00
Konstantin Belousov	99a1570a25	r221714 fixed the situation when the collapse scan improperly handled invalid (busy) page supposedly inserted by the vm_fault(), in the OBSC_COLLAPSE_NOWAIT case. As a continuation to r221714, fix a case when invalid page is found by the object scan in OBSC_COLLAPSE_WAIT case as well. But, since this is waitable scan, we should wait for the termination of the busy state and restart from the beginning of the backing object' page queue. [] Do not free the shadow page swap space when the parent page is invalid, otherwise this action potentially corrupts user data. Combine all instances of the collapse scan sleep code fragments into the new helper vm_object_backing_scan_wait(). Improve style compliance and comments. Change the return type of vm_object_backing_scan() to bool. Initial submission by: cem, https://reviews.freebsd.org/D4103 [] Reviewed by: alc, cem Tested by: cem Sponsored by: The FreeBSD Foundation MFC after: 2 weeks Differential revision: https://reviews.freebsd.org/D4146	2015-12-01 09:06:09 +00:00
Konstantin Belousov	b89def80f5	Minor cleanup. Systematically use ANSI C functions definitions. Correct type of the flags argument to the dev_pager_putpages() function. Use vm_pager_free_nonreq(). Sponsored by: The FreeBSD Foundation MFC after: 1 week	2015-11-29 11:37:25 +00:00
Konstantin Belousov	2eb2f0d5e3	In vm_pageout_grow_cache(), do not re-try the inactive queue when active queue scan initiated write. Re-trying from the inactive queue when doing active scan makes the loop never end if number of domains is greater than 1 and inactive or active scan cannot reach the target. Reported and tested by: Andrew Gallatin <gallatin@netflix.com> Reviewed by: alc Sponsored by: The FreeBSD Foundation MFC after: 2 weeks	2015-11-27 19:43:36 +00:00
Alan Cox	67b7e4345e	Correct an error in vm_reserv_reclaim_contig(). In the highly unusual case that the reservation contained "low", the starting position in the popmap for the free page search was incorrectly calculated. The most likely (and visible) symptom of this error was the assertion failure, "vm_reserv_reclaim_contig: pa is too low".	2015-11-26 19:12:18 +00:00
Konstantin Belousov	9af50b0126	Record proper commit message for r291157. The r289895 revision did not accounted for the block containing the requested page, when calculating the run of pages. Include the pages before/after the requested page, that fit into the reqblock, into the calculation. Noted by: glebius Tested by: pho Sponsored by: The FreeBSD Foundation MFC after: 1 week	2015-11-22 09:50:13 +00:00
Konstantin Belousov	4586820a07	Noted by: glebius Tested by: pho Sponsored by: The FreeBSD Foundation MFC after: 1 week	2015-11-22 09:48:03 +00:00
Mark Johnston	7672ca059a	Remove unneeded includes of opt_kdtrace.h. As of r258541, KDTRACE_HOOKS is defined in opt_global.h, so opt_kdtrace.h is not needed when defining SDT(9) probes.	2015-11-22 02:01:01 +00:00
Gleb Smirnoff	09c837b897	Remove remnants of the old NFS from vnode pager. Reviewed by: kib Sponsored by: Netflix	2015-11-20 23:52:27 +00:00
Jonathan T. Looney	1067a2ba68	Consistently enforce the restriction against calling malloc/free when in a critical section. uma_zalloc_arg()/uma_zalloc_free() may acquire a sleepable lock on the zone. The malloc() family of functions may call uma_zalloc_arg() or uma_zalloc_free(). The malloc(9) man page currently claims that free() will never sleep. It also implies that the malloc() family of functions will not sleep when called with M_NOWAIT. However, it is more correct to say that these functions will not sleep indefinitely. Indeed, they may acquire a sleepable lock. However, a developer may overlook this restriction because the WITNESS check that catches attempts to call the malloc() family of functions within a critical section is inconsistenly applied. This change clarifies the language of the malloc(9) man page to clarify the restriction against calling the malloc() family of functions while in a critical section or holding a spin lock. It also adds KASSERTs at appropriate points to make the enforcement of this restriction more consistent. PR: 204633 Differential Revision: https://reviews.freebsd.org/D4197 Reviewed by: markj Approved by: gnn (mentor) Sponsored by: Juniper Networks	2015-11-19 14:04:53 +00:00
Konstantin Belousov	76386c7ecd	Rework the test which raises OOM condition. Right now, the code checks for the swap space consumption plus checks that the amount of the free pages exceeds some limit, in case pagedeamon did not coped with the page shortage in one of the late passes. This is wrong because it does not account for the presence of the reclamaible pages in the queues which are not selectable for reclaim immediately. E.g., on the swap-less systems, large active queue easily triggered OOM. Instead, only raise OOM when pagedaemon is unable to produce a free page in several back-to-back passes. Track the failed passes per pagedaemon thread. The number of passes to trigger OOM was selected empirically and tested both on small (32M-64M i386 VM) and large (32G amd64) configurations. If the specifics of the load require tuning, sysctl vm.pageout_oom_seq sets the number of back-to-back passes which must fail before OOM is raised. Each pass takes 1/2 of seconds. Less the value, more sensible the pagedaemon is to the page shortage. In future, some heuristic to calculate the value of the tunable might be designed based on the system configuration and load. But before it can be done, the i/o system must be fixed to reliably time-out pagedaemon writes, even if waiting for the memory to proceed. Then, code can account for the in-flight page-outs and postpone OOM until all of them finished, which should reduce the need in tuning. Right now, ignoring the in-flight writes and the counter allows to break deadlocks due to write path doing sleepable memory allocations. Reported by: Dmitry Sivachenko, bde, many others Tested by: pho, bde, tuexen (arm) Reviewed by: alc Discussed with: bde, imp Sponsored by: The FreeBSD Foundation MFC after: 3 weeks	2015-11-16 06:26:26 +00:00
Konstantin Belousov	3949873f7a	Do not use vmspace_resident_count() for the OOM process selection. Residency count track the number of pte entries installed into the current pmap, which does not reflect the consumption of the physical memory by the address map. Due to several mechanisms like pv entries reclamation, copy on write etc. the resident pte entries count may be much less than the amount of physical memory kept by the process. Provide the OOM-specific vm_pageout_oom_pagecount() function which estimates the amount of reclamaible memory which could be stolen if the process is killed. Reported and tested by: pho Reviewed by: alc Comments text by: alc Sponsored by: The FreeBSD Foundation MFC after: 3 weeks	2015-11-16 06:02:11 +00:00
Konstantin Belousov	b98acc0a1b	VM daemon works in parallel with the pagedaemon threads, and, among other actions, swaps out kernel stacks of the processes. On the other hand, currentl OOM logic which selects a process to kill in the critical condition, skips process with swapped-out thread. Under some loads, this results in the big(gest) process being ignored by OOM. Do not skip a process which has inhibited thread due to the swap-out, in the OOM selection loop. Note that killing such process requires the thread stack page-in, but sometimes this is the only way to recover. Reported and tested by: pho Reviewed by: alc Sponsored by: The FreeBSD Foundation MFC after: 3 weeks	2015-11-16 05:52:04 +00:00
John Baldwin	645743ea99	Export various helper variables describing the layout and size of certain kernel structures for use by debuggers. This mostly aids in examining cores from a kernel without debug symbols as a debugger can infer these values if debug symbols are available. One set of variables describes the layout of 'struct linker_file' to walk the list of loaded kernel modules. A second set of variables describes the layout of 'struct proc' and 'struct thread' to walk the list of processes in the kernel and the threads in each process. The 'pcb_size' variable is used to index into the stoppcbs[] array. The 'vm_maxuser_address' is used to distinguish kernel virtual addresses from user addresses. This doesn't have to be perfect, and 'vm_maxuser_address' is a cheap and simple way to differentiate kernel pointers from simple values like TIDs and PIDs. While here, annotate the fields in struct pcb used by kgdb on amd64 and i386 to note that their ABI should be preserved. Annotations for other platforms will be added in the future. Reviewed by: kib MFC after: 2 weeks Differential Revision: https://reviews.freebsd.org/D3773	2015-11-12 22:00:59 +00:00
Mark Johnston	7e78597f04	Ensure that deactivated pages that are not expected to be reused are reclaimed in FIFO order by the pagedaemon. Previously we would enqueue such pages at the head of the inactive queue, yielding a LIFO reclaim order. Reviewed by: alc MFC after: 2 weeks Sponsored by: EMC / Isilon Storage Division	2015-11-08 01:36:18 +00:00
Konstantin Belousov	eac91e326a	Reduce the amount of calls to VOP_BMAP() made from the local vnode pager. It is enough to execute VOP_BMAP() once to obtain both the disk block address for the requested page, and the before/after limits for the contiguous run. The clipping of the vm_page_t array passed to the vnode_pager_generic_getpages() and the disk address for the first page in the clipped array can be deduced from the call results. While there, remove some noise (like if (1) {...}) and adjust nearby code. Reviewed by: alc Discussed with: glebius Tested by: pho Sponsored by: The FreeBSD Foundation MFC after: 3 weeks	2015-10-24 21:59:22 +00:00
Jason A. Harmening	7c989c156f	Fix capitalization	2015-10-23 12:06:06 +00:00
Jason A. Harmening	a50730587b	Remove unclear comment about address truncation in busdma. Add (hopefully much clearer) comment at declaration of PHYS_TO_VM_PAGE(). Noted by: avg	2015-10-23 12:03:25 +00:00
Konstantin Belousov	69b8585e79	Only marker is guaranteed to be present on the queue after the relock in vm_pageout_fallback_object_lock() and vm_pageout_page_lock(). The check for the m->queue == queue assumes that the page does belong to a queue. Modify the 'unchanged' calculation bu dereferencing the marker tailq pointers, which is known to belong to the queue. Since for a page m linked to the queue, m->queue must be equal to the queue index, assert this instead of checking. In collaboration with: alc Sponsored by: The FreeBSD Foundation (kib) MFC after: 2 weeks	2015-10-18 09:33:28 +00:00
Konstantin Belousov	8748f58cde	Revert r289302, invalid pages can be queued, e.g. by vfs_vmio_unwire(). Found by: alc Tested by: pho Sponsored by: The FreeBSD Foundation	2015-10-15 19:07:38 +00:00
Konstantin Belousov	12a73f207a	Invalid pages should not appear on the inactive queue. Change the check into an assertion. Reviewed by: alc Tested by: pho Sponsored by: The FreeBSD Foundation	2015-10-14 09:03:32 +00:00
Jeff Roberson	21fae96123	Parallelize the buffer cache and rewrite getnewbuf(). This results in a 8x performance improvement in a micro benchmark on a 4 socket machine. - Get buffer headers from a per-cpu uma cache that sits in from of the free queue. - Use a per-cpu quantum cache in vmem to eliminate contention for kva. - Use multiple clean queues according to buffer cache size to eliminate clean queue lock contention. - Introduce a bufspace daemon that attempts to prevent getnewbuf() callers from blocking or doing direct recycling. - Close some bufspace allocation races that could lead to endless recycling. - Further the transition to a more modern style of small functions grouped by prefix in order to improve growing complexity. Sponsored by: EMC / Isilon Reviewed by: kib Tested by: pho	2015-10-14 02:10:07 +00:00
Alan Cox	e595970add	Exploit r288122 to avoid pointlessly enqueueing a page that is about to be freed. Submitted by: kmacy Differential Revision: https://reviews.freebsd.org/D1674	2015-10-09 03:38:58 +00:00
Alan Cox	27e9ed8a5a	Exploit r288122 to address a cosmetic issue. Pages belonging to either the kernel or kmem object can't be paged out. Since they can't be paged out, they are never enqueued in a paging queue. Nonetheless, passing PQ_INACTIVE to vm_page_unwire() in kmem_unback() creates the appearance that these pages are being enqueued in the inactive queue. As of r288122, we can avoid giving this false impression by passing PQ_NONE. Submitted by: kmacy Differential Revision: https://reviews.freebsd.org/D1674	2015-10-06 05:49:00 +00:00
Warner Losh	d635a37ffa	Mark swap_pager_putpages static at its definition. It was already static at its declaration. Remove needless swapdev_strategy forward declaration. MFC After: 3 days	2015-10-05 21:29:17 +00:00
Alan Cox	bc7275964c	Reduce the scope of a variable to the only file where it is used.	2015-10-03 19:27:52 +00:00
Mark Johnston	3138cd3670	As a step towards the elimination of PG_CACHED pages, rework the handling of POSIX_FADV_DONTNEED so that it causes the backing pages to be moved to the head of the inactive queue instead of being cached. This affects the implementation of POSIX_FADV_NOREUSE as well, since it works by applying POSIX_FADV_DONTNEED to file ranges after they have been read or written. At that point the corresponding buffers may still be dirty, so the previous implementation would coalesce successive ranges and apply POSIX_FADV_DONTNEED to the result, ensuring that pages backing the dirty buffers would eventually be cached. To preserve this behaviour in an efficient manner, this change adds a new buf flag, B_NOREUSE, which causes the pages backing a VMIO buf to be placed at the head of the inactive queue when the buf is released. POSIX_FADV_NOREUSE then works by setting this flag in bufs that underlie the specified range. Reviewed by: alc, kib Sponsored by: EMC / Isilon Storage Division Differential Revision: https://reviews.freebsd.org/D3726	2015-09-30 23:06:29 +00:00
Alan Cox	9e829b2272	The conversion of kmem_alloc_attr() from operating on a vm map to a vmem arena in r254025 introduced a bug in the case when an allocation is only partially successful. Specifically, the vm object lock was not being acquired before freeing the allocated pages. To address this bug, replace the existing code by a call to kmem_unback(). Change the type of a variable in kmem_alloc_attr() so that an allocation of two or more gigabytes won't fail. Replace the error handling code in kmem_back() by a call to kmem_unback(). Reviewed by: kib (an earlier version) MFC after: 1 week Sponsored by: EMC / Isilon Storage Division	2015-09-26 22:57:10 +00:00
Alan Cox	087a613247	Exploit r288122 to address a cosmetic issue. Since the pages allocated by noobj_alloc() don't belong to a vm object, they can't be paged out. Since they can't be paged out, they are never enqueued in a paging queue. Nonetheless, passing PQ_INACTIVE to vm_page_unwire() creates the appearance that these pages are being enqueued in the inactive queue. As of r288122, we can avoid giving this false impression by passing PQ_NONE. Submitted by: kmacy Differential Revision: https://reviews.freebsd.org/D1674	2015-09-26 17:45:10 +00:00
Alan Cox	15aaea7892	Change vm_page_unwire() such that it (1) accepts PQ_NONE as the specified queue and (2) returns a Boolean indicating whether the page's wire count transitioned to zero. Exploit this change in vfs_vmio_release() to avoid pointlessly enqueueing a page that is about to be freed. (An earlier version of this change was developed by attilio@ and kmacy@. Any errors in this version are my own.) Reviewed by: kib Sponsored by: EMC / Isilon Storage Division	2015-09-22 18:16:52 +00:00
Alan Cox	d9347bca9a	Correct a non-fatal error in vm_pageout_worker(). vm_pageout_worker() should not assume that vm_pages_needed will remain set while it sleeps. Other threads can clear vm_pages_needed by performing a sufficient number of vm_page_free() calls, e.g., process termination. The effect of this error was that vm_pageout_worker() would free and/or launder pages when, in fact, there was no shortage of free pages. Rewrite a nearby comment to describe all of the possible cases and not just the most common case. The problem being that the comment made the most common case seem like the only case. Reviewed by: kib MFC after: 1 week Sponsored by: EMC / Isilon Storage Division	2015-09-20 19:20:03 +00:00
Alan Cox	c9af644e5c	Eliminate (many) unnecessary calls to pmap_remove_all(). Pages from objects with a reference count of zero can't possibly be mapped, so there is never a need for vm_page_set_invalid() to call pmap_remove_all() on them. Reviewed by: kib MFC after: 1 week Sponsored by: EMC / Isilon Storage Division	2015-09-17 22:28:38 +00:00
Mark Johnston	d73ce4c698	Remove the v_cache_min and v_cache_max sysctls. They are unused and have no effect. Reviewed by: alc Sponsored by: EMC / Isilon Storage Division	2015-09-11 03:00:20 +00:00
Konstantin Belousov	b8db977617	Remove a check which caused spurious SIGSEGV on usermode access to the mapped address without valid pte installed, when parallel wiring of the entry happen. The entry must be copy on write. If entry is COW but was already copied, and parallel wiring set MAP_ENTRY_IN_TRANSITION, vm_fault() would sleep waiting for the MAP_ENTRY_IN_TRANSITION flag to clear. After that, the fault handler is restarted and vm_map_lookup() or vm_map_lookup_locked() trip over the check. Note that this is race, if the address is accessed after the wiring is done, the entry does not fault at all. There is no reason in the current kernel to disallow write access to the COW wired entry if the entry permissions allow it. Initially this was done in r24666, since that kernel did not supported proper copy-on-write for wired text, which was fixed in r199869. The r251901 revision re-introduced the r24666 fix for the current VM. Note that write access must clear MAP_ENTRY_NEEDS_COPY entry flag by performing COW. In reverse, when MAP_ENTRY_NEEDS_COPY is set in vmspace_fork(), the MAP_ENTRY_USER_WIRED flag is cleared. Put the assert stating the invariant, instead of returning the error. Reported and debugging help by: peter Reviewed by: alc Sponsored by: The FreeBSD Foundation MFC after: 1 week	2015-09-09 06:19:33 +00:00
Warner Losh	9e3e3fe5b3	The swap pager is compatible with direct dispatch. It does its own locking and doesn't sleep. Flag the consumer we create as such. In addition, decrement the in flight index when we have an out of memory error after having incremented it previously. This would have prevented swapoff from working if the swap pager ever hit a resource shortage trying to swap out something (the swap in path always waits for a bio, so won't have this issue). Simplify the close logic by abandoning the use of private and initializing the index to 1 and dropping that reference when we previously set private. Also, set sw_id only while sw_dev_mtx is held. This should only affect swapping to a vnode, as opposed to a geom whose close always sets it to NULL with sw_dev_mtx held. Differential Review: https://reviews.freebsd.org/D3547	2015-09-08 17:47:56 +00:00
Alan Cox	27a9fb2fc2	To simplify upcoming changes to the inactive queue scan, change the code so that there is only one place where pages are freed and only one place where pages are moved to the tail of the queue. Reviewed by: kib Sponsored by: EMC / Isilon Storage Division	2015-09-08 04:18:57 +00:00
Alan Cox	960810ccea	Eliminate pointless requeueing of pages from terminated objects. These pages will have left the inactive queue before the page daemon performs its next scan. Also, ignore references to pages from terminated objects. This allows the clean pages to be freed a little sooner. Move some comments to their proper place, i.e., next to the code that they describe, and update other nearby comments. Reviewed by: kib Sponsored by: EMC / Isilon Storage Division	2015-09-05 17:34:49 +00:00
Mateusz Guzik	19c591bfe2	Don't trash memory from UMA_ZONE_NOFREE zones. Objects obtained from such zones are supposed to retain type stability, which was violated by aforementioned trashing. This is a follow-up to r284861. Discussed with: kib	2015-09-02 23:09:01 +00:00
Alan Cox	a3aeedabb4	Handle held pages earlier in the inactive queue scan. Reviewed by: kib Sponsored by: EMC / Isilon Storage Division	2015-09-01 06:21:12 +00:00
Mark Johnston	c25fabea97	Remove weighted page handling from vm_page_advise(). This was added in r51337 as part of the implementation of madvise(MADV_DONTNEED). Its objective was to ensure that the page daemon would eventually reclaim other unreferenced pages (i.e., unreferenced pages not touched by madvise()) from the active queue. Now that the pagedaemon performs steady scanning of the active page queue, this weighted handling is unnecessary. Instead, always "cache" clean pages by moving them to the head of the inactive page queue. This simplifies the implementation of vm_page_advise() and eliminates the fragmentation that resulted from the distribution of pages among multiple queues. Suggested by: alc Reviewed by: alc Sponsored by: EMC / Isilon Storage Division Differential Revision: https://reviews.freebsd.org/D3401	2015-08-28 00:44:17 +00:00
Alan Cox	40aa80a7c2	In vm_pageout_scan(), simplify the logic for determining if a page can be paged out and apply some nearby style fixes. In collaboration with: kib MFC after: 1 week Sponsored by: The FreeBSD Foundation, EMC / Isilon Storage Division	2015-08-27 20:38:45 +00:00
Alan Cox	eb5d39694e	Testing whether a page is dirty does not require the page lock. Moreover, it may involve a pmap operation that iterates over the page's PV list, so unnecessarily holding the page lock is undesirable. MFC after: 1 week Sponsored by: EMC / Isilon Storage Division	2015-08-25 01:01:25 +00:00
Mark Murray	e866d8f05b	Make the UMA harvesting go away completely if not wanted. Default to "not wanted". Provide and document the RANDOM_ENABLE_UMA option. Change RANDOM_FAST to RANDOM_UMA to clarify the harvesting. Remove RANDOM_DEBUG option, replace with SDT probes. These will be of use to folks measuring the harvesting effect when deciding whether to use RANDOM_ENABLE_UMA. Requested by: scottl and others. Approved by: so (/dev/random blanket) Differential Revision: https://reviews.freebsd.org/D3197	2015-08-22 12:59:05 +00:00
Alan Cox	77923df2c1	Eliminate pointless assignments to rtvals[] in swap_pager_putpages(). Reviewed by: kib Sponsored by: EMC / Isilon Storage Division	2015-08-21 17:00:39 +00:00
Ryan Stone	a6bf3a9ef6	Prevent ticks rollover from preventing vm_lowmem event Currently vm_pageout_scan() uses a ticks-based scheme to rate-limit the number of times that the vm_lowmem event will happen. However if no events happen for long enough for ticks to roll over, this leaves us in a long window in which vm_lowmem events will not happen. Replace the use of ticks with time_t to prevent rollover from ever being an issue. Reviewed by: ian MFC after: 3 weeks Sponsored by: EMC / Isilon Storage Division Differential Revision: https://reviews.freebsd.org/D3439	2015-08-20 20:28:51 +00:00
Andrew Turner	52afd687c3	Add the kernel support for minidumps on arm64. Obtained from: ABT Systems Ltd Sponsored by: The FreeBSD Foundation Differential Revision: https://reviews.freebsd.org/D3318	2015-08-20 12:49:56 +00:00
Alan Cox	f9b11500c2	As another piece of PG_CACHE page elimination, remove an LRU-defeating call to vm_page_try_to_cache() from vm_pageout_flush(). Other changes, most recently r286814, have made this call unnecessary. Reviewed by: kib Discussed with: jeff Tested by: pho Sponsored by: EMC / Isilon Storage Division	2015-08-16 17:07:53 +00:00
Konstantin Belousov	edc8222303	Make kstack_pages a tunable on arm, x86, and powepc. On i386, the initial thread stack is not adjusted by the tunable, the stack is allocated too early to get access to the kernel environment. See TD0_KSTACK_PAGES for the thread0 stack sizing on i386. The tunable was tested on x86 only. From the visual inspection, it seems that it might work on arm and powerpc. The arm USPACE_SVC_STACK_TOP and powerpc USPACE macros seems to be already incorrect for the threads with non-default kstack size. I only changed the macros to use variable instead of constant, since I cannot test. On arm64, mips and sparc64, some static data structures are sized by KSTACK_PAGES, so the tunable is disabled. Sponsored by: The FreeBSD Foundation MFC after: 2 week	2015-08-10 17:18:21 +00:00
Zbigniew Bodek	9ba30bcb42	Avoid sign extension of value passed to kva_alloc from uma_zone_reserve_kva Fixes "panic: vm_radix_reserve_kva: unable to reserve KVA" caused by sign extention of "pages * UMA_SLAB_SIZE" value passed to kva_alloc() which takes unsigned long argument. In the erroneus case that triggered this bug, the number of pages to allocate in uma_zone_reserve_kva() was 0x8ebe6, that gave the total number of bytes to allocate equal to 0x8ebe6000 (int). This was then sign extended in kva_alloc() to 0xffffffff8ebe6000 (unsigned long). Reviewed by: alc, kib Submitted by: Zbigniew Bodek <zbb@semihalf.com> Obtained from: Semihalf Sponsored by: The FreeBSD Foundation Differential Revision: https://reviews.freebsd.org/D3346	2015-08-10 17:16:49 +00:00
Alan Cox	e0a63baae4	Introduce a sysctl for reporting the number of fully populated reservations.	2015-08-06 21:27:50 +00:00
Jason A. Harmening	0a3e154709	Properly sort the function declarations added in r286296 Submitted by: alc Approved by: kib (mentor)	2015-08-05 10:48:32 +00:00
Jason A. Harmening	713841afb2	Add two new pmap functions: vm_offset_t pmap_quick_enter_page(vm_page_t m) void pmap_quick_remove_page(vm_offset_t kva) These will create and destroy a temporary, CPU-local KVA mapping of a specified page. Guarantees: --Will not sleep and will not fail. --Safe to call under a non-sleepable lock or from an ithread Restrictions: --Not guaranteed to be safe to call from an interrupt filter or under a spin mutex on all platforms --Current implementation does not guarantee more than one page of mapping space across all platforms. MI code should not make nested calls to pmap_quick_enter_page. --MI code should not perform locking while holding onto a mapping created by pmap_quick_enter_page The idea is to use this in busdma, for bounce buffer copies as well as virtually-indexed cache maintenance on mips and arm. NOTE: the non-i386, non-amd64 implementations of these functions still need review and testing. Reviewed by: kib Approved by: kib (mentor) Differential Revision: http://reviews.freebsd.org/D3013	2015-08-04 19:46:13 +00:00
Alan Cox	d8015db3b5	Refinements to r281079's sequential access optimization: Prefetched pages, which constitute the majority of the pages that are processed by vm_fault_dontneed(), are already near the tail of the inactive queue. Only the pages at faulting virtual addresses are actually moved by vm_page_advise(..., MADV_DONTNEED). However, vm_page_advise(..., MADV_DONTNEED) is simultaneously too aggressive and passive for the moved pages. It makes most of these pages too easily reclaimable, and at the same time it leaves enough pages in the active queue to trigger pageouts by the page daemon. Instead, with this change, the pages at faulting virtual addresses are moved to the tail of the inactive queue, where they are relatively close to the pages prefetched by the same page fault. Discussed with: jeff Sponsored by: EMC / Isilon Storage Division	2015-08-03 20:30:27 +00:00
Konstantin Belousov	6a875bf929	Do not pretend that vm_fault(9) supports unwiring the address. Rename the VM_FAULT_CHANGE_WIRING flag to VM_FAULT_WIRE. Assert that the flag is only passed when faulting on the wired map entry. Remove the vm_page_unwire() call, which should be never reachable. Since VM_FAULT_WIRE flag implies wired map entry, the TRYPAGER() macro is reduced to the testing of the fs.object having a default pager. Inline the check. Suggested and reviewed by: alc Tested by: pho (previous version) MFC after: 1 week	2015-07-30 18:28:34 +00:00
Jeff Roberson	98082691bb	- Make 'struct buf *buf' private to vfs_bio.c. Having a global variable 'buf' is inconvenient and has lead me to some irritating to discover bugs over the years. It also makes it more challenging to refactor the buf allocation system. - Move swbuf and declare it as an extern in vfs_bio.c. This is still not perfect but better than it was before. - Eliminate the unused ffs function that relied on knowledge of the buf array. - Move the shutdown code that iterates over the buf array into vfs_bio.c. Reviewed by: kib Sponsored by: EMC / Isilon Storage Division	2015-07-29 02:26:57 +00:00
Konstantin Belousov	6195b24a79	Revert r173708's modifications to vm_object_page_remove(). Assume that a vnode is mapped shared and mlocked(), and then the vnode is truncated, or truncated and then again extended past the mapping point EOF. Truncation removes the pages past the truncation point, and if pages are later created at this range, they are not properly mapped into the mlocked region, and their wiring count is wrong. The revert leaves the invalidated but wired pages on the object queue, which means that the pages are found by vm_object_unwire() when the mapped range is munlock()ed, and reused by the buffer cache when the vnode is extended again. The changes in r173708 were required since then vm_map_unwire() looked at the page tables to find the page to unwire. This is no longer needed with the vm_object_unwire() introduction, which follows the objects shadow chain. Also eliminate OBJPR_NOTWIRED flag for vm_object_page_remove(), which is now redundand, we do not remove wired pages. Reported by: trasz, Dmitry Sivachenko <trtrmitya@gmail.com> Suggested and reviewed by: alc Tested by: pho Sponsored by: The FreeBSD Foundation MFC after: 1 week	2015-07-25 18:29:06 +00:00
Jeff Roberson	fade8dd714	Refactor unmapped buffer address handling. - Use pointer assignment rather than a combination of pointers and flags to switch buffers between unmapped and mapped. This eliminates multiple flags and generally simplifies the logic. - Eliminate b_saveaddr since it is only used with pager bufs which have their b_data re-initialized on each allocation. - Gather up some convenience routines in the buffer cache for manipulating buf space and buf malloc space. - Add an inline, buf_mapped(), to standardize checks around unmapped buffers. In collaboration with: mlaier Reviewed by: kib Tested by: pho (many small revisions ago) Sponsored by: EMC / Isilon Storage Division	2015-07-23 19:13:41 +00:00
Adrian Chadd	6520495abc	Add an initial NUMA affinity/policy configuration for threads and processes. This is based on work done by jeff@ and jhb@, as well as the numa.diff patch that has been circulating when someone asks for first-touch NUMA on -10 or -11. * Introduce a simple set of VM policy and iterator types. * tie the policy types into the vm_phys path for now, mirroring how the initial first-touch allocation work was enabled. * add syscalls to control changing thread and process defaults. * add a global NUMA VM domain policy. * implement a simple cascade policy order - if a thread policy exists, use it; if a process policy exists, use it; use the default policy. * processes inherit policies from their parent processes, threads inherit policies from their parent threads. * add a simple tool (numactl) to query and modify default thread/process policities. * add documentation for the new syscalls, for numa and for numactl. * re-enable first touch NUMA again by default, as now policies can be set in a variety of methods. This is only relevant for very specific workloads. This doesn't pretend to be a final NUMA solution. The previous defaults in -HEAD (with MAXMEMDOM set) can be achieved by 'sysctl vm.default_policy=rr'. This is only relevant if MAXMEMDOM is set to something other than 1. Ie, if you're using GENERIC or a modified kernel with non-NUMA, then this is a glorified no-op for you. Thank you to Norse Corp for giving me access to rather large (for FreeBSD!) NUMA machines in order to develop and verify this. Thank you to Dell for providing me with dual socket sandybridge and westmere v3 hardware to do NUMA development with. Thank you to Scott Long at Netflix for providing me with access to the two-socket, four-domain haswell v3 hardware. Thank you to Peter Holm for running the stress testing suite against the NUMA branch during various stages of development! Tested: * MIPS (regression testing; non-NUMA) * i386 (regression testing; non-NUMA GENERIC) * amd64 (regression testing; non-NUMA GENERIC) * westmere, 2 socket (thankyou norse!) * sandy bridge, 2 socket (thankyou dell!) * ivy bridge, 2 socket (thankyou norse!) * westmere-EX, 4 socket / 1TB RAM (thankyou norse!) * haswell, 2 socket (thankyou norse!) * haswell v3, 2 socket (thankyou dell) * haswell v3, 2x18 core (thankyou scott long / netflix!) * Peter Holm ran a stress test suite on this work and found one issue, but has not been able to verify it (it doesn't look NUMA related, and he only saw it once over many testing runs.) * I've tested bhyve instances running in fixed NUMA domains and cpusets; all seems to work correctly. Verified: * intel-pcm - pcm-numa.x and pcm-memory.x, whilst selecting different NUMA policies for processes under test. Review: This was reviewed through phabricator (https://reviews.freebsd.org/D2559) as well as privately and via emails to freebsd-arch@. The git history with specific attributes is available at https://github.com/erikarn/freebsd/ in the NUMA branch (https://github.com/erikarn/freebsd/compare/local/adrian_numa_policy). This has been reviewed by a number of people (stas, rpaulo, kib, ngie, wblock) but not achieved a clear consensus. My hope is that with further exposure and testing more functionality can be implemented and evaluated. Notes: * The VM doesn't handle unbalanced domains very well, and if you have an overly unbalanced memory setup whilst under high memory pressure, VM page allocation may fail leading to a kernel panic. This was a problem in the past, but it's much more easily triggered now with these tools. * This work only controls the path through vm_phys; it doesn't yet strongly/predictably affect contigmalloc, KVA placement, UMA, etc. So, driver placement of memory isn't really guaranteed in any way. That's next on my plate. Sponsored by: Norse Corp, Inc.; Dell	2015-07-11 15:21:37 +00:00
Alan Cox	22cf98d1f3	The intention of r254304 was to scan the active queue continuously. However, I've observed the active queue scan stopping when there are frequent free page shortages and the inactive queue is steadily refilled by other mechanisms, such as the sequential access heuristic in vm_fault() or madvise(2). To remedy this problem, record the time of the last active queue scan, and always scan a number of pages proportional to the time since the last scan, regardless of whether that last scan was a timeout-triggered ("pass == 0") or free-page-shortage-triggered ("pass > 0") scan. Also, on a timeout-triggered scan, allow a full scan of the active queue when the system is short of inactive pages. Reviewed by: kib MFC after: 6 weeks Sponsored by: EMC / Isilon Storage Division	2015-07-08 17:45:59 +00:00
Mark Johnston	010ba3842c	Add a local variable initialization needed in the OBJT_DEFAULT case. Reviewed by: kib Differential Revision: https://reviews.freebsd.org/D2992	2015-07-05 22:26:19 +00:00
Mateusz Guzik	cd336bad26	vm: don't lock proc around accesses to vm_{t,d}addr and RLIMIT_DATA in sys_mmap vm_{t,d}addr are constant and we can use thread's copy of resource limits	2015-07-02 18:30:12 +00:00
Konstantin Belousov	be930a2021	Account for the main process stack being one page below the highest user address when ABI uses shared page. Note that the change is no-op for correctness, since shared page does not fault. The mapping for the shared page is installed at the address space creation, the page is unmanaged and its pte/pv entry cannot be reclaimed. Submitted by: Oliver Pinter Review: https://reviews.freebsd.org/D2954 MFC after: 1 week	2015-07-02 15:22:13 +00:00
Mark Murray	d1b06863fb	Huge cleanup of random(4) code. * GENERAL - Update copyright. - Make kernel options for RANDOM_YARROW and RANDOM_DUMMY. Set neither to ON, which means we want Fortuna - If there is no 'device random' in the kernel, there will be NO random(4) device in the kernel, and the KERN_ARND sysctl will return nothing. With RANDOM_DUMMY there will be a random(4) that always blocks. - Repair kern.arandom (KERN_ARND sysctl). The old version went through arc4random(9) and was a bit weird. - Adjust arc4random stirring a bit - the existing code looks a little suspect. - Fix the nasty pre- and post-read overloading by providing explictit functions to do these tasks. - Redo read_random(9) so as to duplicate random(4)'s read internals. This makes it a first-class citizen rather than a hack. - Move stuff out of locked regions when it does not need to be there. - Trim RANDOM_DEBUG printfs. Some are excess to requirement, some behind boot verbose. - Use SYSINIT to sequence the startup. - Fix init/deinit sysctl stuff. - Make relevant sysctls also tunables. - Add different harvesting "styles" to allow for different requirements (direct, queue, fast). - Add harvesting of FFS atime events. This needs to be checked for weighing down the FS code. - Add harvesting of slab allocator events. This needs to be checked for weighing down the allocator code. - Fix the random(9) manpage. - Loadable modules are not present for now. These will be re-engineered when the dust settles. - Use macros for locks. - Fix comments. * src/share/man/... - Update the man pages. * src/etc/... - The startup/shutdown work is done in D2924. * src/UPDATING - Add UPDATING announcement. * src/sys/dev/random/build.sh - Add copyright. - Add libz for unit tests. * src/sys/dev/random/dummy.c - Remove; no longer needed. Functionality incorporated into randomdev.. live_entropy_sources.c live_entropy_sources.h - Remove; content moved. - move content to randomdev.[ch] and optimise. * src/sys/dev/random/random_adaptors.c src/sys/dev/random/random_adaptors.h - Remove; plugability is no longer used. Compile-time algorithm selection is the way to go. * src/sys/dev/random/random_harvestq.c src/sys/dev/random/random_harvestq.h - Add early (re)boot-time randomness caching. * src/sys/dev/random/randomdev_soft.c src/sys/dev/random/randomdev_soft.h - Remove; no longer needed. * src/sys/dev/random/uint128.h - Provide a fake uint128_t; if a real one ever arrived, we can use that instead. All that is needed here is N=0, N++, N==0, and some localised trickery is used to manufacture a 128-bit 0ULLL. * src/sys/dev/random/unit_test.c src/sys/dev/random/unit_test.h - Improve unit tests; previously the testing human needed clairvoyance; now the test will do a basic check of compressibility. Clairvoyant talent is still a good idea. - This is still a long way off a proper unit test. * src/sys/dev/random/fortuna.c src/sys/dev/random/fortuna.h - Improve messy union to just uint128_t. - Remove unneeded 'static struct fortuna_start_cache'. - Tighten up up arithmetic. - Provide a method to allow eternal junk to be introduced; harden it against blatant by compress/hashing. - Assert that locks are held correctly. - Fix the nasty pre- and post-read overloading by providing explictit functions to do these tasks. - Turn into self-sufficient module (no longer requires randomdev_soft.[ch]) * src/sys/dev/random/yarrow.c src/sys/dev/random/yarrow.h - Improve messy union to just uint128_t. - Remove unneeded 'staic struct start_cache'. - Tighten up up arithmetic. - Provide a method to allow eternal junk to be introduced; harden it against blatant by compress/hashing. - Assert that locks are held correctly. - Fix the nasty pre- and post-read overloading by providing explictit functions to do these tasks. - Turn into self-sufficient module (no longer requires randomdev_soft.[ch]) - Fix some magic numbers elsewhere used as FAST and SLOW. Differential Revision: https://reviews.freebsd.org/D2025 Reviewed by: vsevolod,delphij,rwatson,trasz,jmg Approved by: so (delphij)	2015-06-30 17:00:45 +00:00
John-Mark Gurney	afc6dc3669	If INVARIANTS is specified, add ctor/dtor to junk memory if they are unspecified... Submitted by: Suresh Gumpula at Netapp Differential Revision: https://reviews.freebsd.org/D2725	2015-06-25 20:44:46 +00:00
Alan Cox	aa04413540	Avoid pmap_is_modified() on pages that can't be mapped. MFC after: 1 week Sponsored by: EMC / Isilon Storage Division	2015-06-21 01:22:35 +00:00
Gleb Smirnoff	093ebe1d28	o Un-inline vm_pager_get_pages(), vm_pager_get_pages_async(). o Provide an extensive set of assertions for input array of pages. o Remove now duplicate assertions from different pagers. Sponsored by: Nginx, Inc. Sponsored by: Netflix	2015-06-17 22:44:27 +00:00
Konstantin Belousov	776f729c86	Invalid pages do not need neither update of the activation count nor they coould be dirty. Move the handling if the invalid pages in the inactive scan earlier. Remove some code duplication in the scan by introducing the 'drop_page' label, which centralizes the object and the page unlock. Suggested and reviewed by: alc Sponsored by: The FreeBSD Foundation MFC after: 2 weeks	2015-06-14 20:23:41 +00:00
Alan Cox	78afdce6af	As the next step in eliminating PG_CACHE pages, free rather than cache pages in vm_pageout_scan(). The reactivation rate of cache pages created by vm_pageout_scan() is extremely low; typically no more than 0.5% to 2.25% of the pages are ever reactivated. At the same time, caching pages is more expensive than freeing them. For example, in a test with PostgreSQL, this change reduced the amount of time spent in the inactive queue scan by 1/6. Differential Revision: https://reviews.freebsd.org/D2805 Reviewed by: kib Sponsored by: EMC / Isilon Storage Division	2015-06-14 05:23:39 +00:00
Gleb Smirnoff	093c7f396d	Make KPI of vm_pager_get_pages() more strict: if a pager changes a page in the requested array, then it is responsible for disposition of previous page and is responsible for updating the entry in the requested array. Now consumers of KPI do not need to re-lookup the pages after call to vm_pager_get_pages(). Reviewed by: kib Sponsored by: Netflix Sponsored by: Nginx, Inc.	2015-06-12 11:32:20 +00:00
Mateusz Guzik	f6f6d24062	Implement lockless resource limits. Use the same scheme implemented to manage credentials. Code needing to look at process's credentials (as opposed to thred's) is provided with *_proc variants of relevant functions. Places which possibly had to take the proc lock anyway still use the proc pointer to access limits.	2015-06-10 10:48:12 +00:00
Alan Cox	5cd7a4f76c	Correct a type error in kmem_unback(). Previously, kmem_unback() did not correctly handle deallocation requests of two or more gigabytes in size. Eventually, this would lead to a panic elsewhere in the kernel, such as "vm_radix_insert: key <vm_pindex_t> is already present". Reported by: Ilias Marinos MFC after: 1 week	2015-06-10 05:17:14 +00:00

... 3 4 5 6 7 ...

3805 Commits