freebsd-dev

Author	SHA1	Message	Date
Konstantin Belousov	00de677313	Assert that the protection of a new map entry is a subset of the max protection. Noted and reviewed by: alc Sponsored by: The FreeBSD Foundation MFC after: 1 week	2017-06-21 18:51:30 +00:00
Alan Cox	3a5d839ebc	Eliminate an unused macro. MFC after: 3 days	2017-06-21 03:55:45 +00:00
Konstantin Belousov	212e02c836	Ignore the P_SYSTEM process flag, and do not request VM_MAP_WIRE_SYSTEM mode when wiring the newly grown stack. System maps do not create auto-grown stack. Any stack we handled, even for P_SYSTEM, must be for user address space. P_SYSTEM processes with mapped user space is either init(8) or an aio worker attached to other user process with aio buffer pointing into stack area. In either case, VM_MAP_WIRE_USER mode should be used. Noted and reviewed by: alc Sponsored by: The FreeBSD Foundation MFC after: 1 week	2017-06-19 20:40:59 +00:00
Alan Cox	87b0ab69a9	Pages that are passed to swap_pager_putpages() should already be fully dirty. Assert that they are fully dirty rather than redundantly calling vm_page_dirty() on them. Reviewed by: kib, markj MFC after: 1 week X-MFC after: r319932	2017-06-17 03:05:25 +00:00
Konstantin Belousov	e6c44f65d4	Some minor improvements to vnode_pager_generic_putpages(). - Add asserts that the pages to write are dirty. The last page, if partially written, is only required to be dirty, while completely written pages should have all dirty bit set. - Use uintmax_t to print vm_page pindexes. - Use NULL instead of casted zero. - Remove if () test which duplicated the loop ending condition. - Miscellaneous style fixes. Reviewed by: alc, markj (previous version) Tested by: pho Sponsored by: The FreeBSD Foundation MFC after: 1 week	2017-06-15 14:34:33 +00:00
Gleb Smirnoff	77e1943785	When we are in UMA_STARTUP use startup_alloc() for any zone, not for internal zones only. This allows to create new zones at early stages of boot, without need to mark them as internal to UMA, which isn't always true. Reviewed by: alc	2017-06-08 21:33:19 +00:00
John Baldwin	4bd7e351f1	Fix an off-by-one error in the VM page array on some systems. r31386 changed how the size of the VM page array was calculated to be less wasteful. For most systems, the amount of memory is divided by the overhead required by each page (a page of data plus a struct vm_page) to determine the maximum number of available pages. However, if the remainder for the first non-available page was at least a page of data (so that the only memory missing was a struct vm_page), this last page was left in phys_avail[] but was not allocated an entry in the VM page array. Handle this case by explicitly excluding the page from phys_avail[]. Reviewed by: alc Sponsored by: DARPA / AFRL Differential Revision: https://reviews.freebsd.org/D11000	2017-06-08 16:18:41 +00:00
Alan Cox	761097c85e	Starting in r118390, swaponsomething() began to reserve the blocks at the beginning of a swap area for a disk label. However, neither r118390 nor r118544, which increased the reservation from one to two blocks, correctly accounted for these blocks when updating the variable "swap_pager_avail". This change corrects that error. Reviewed by: kib MFC after: 5 days	2017-06-06 16:52:07 +00:00
Alan Cox	03bdd65f18	When the function blist_fill() was added to the kernel in r107913, the swap pager used a different scheme for striping the allocation of swap space across multiple devices. And, although blist_fill() was intended to support fill operations with large counts, the old striping scheme never performed a fill larger than the stripe size. Consequently, the misplacement of a sanity check in blst_meta_fill() went undetected. Now, moving forward in time to r118390, a new scheme for striping was introduced that maintained a blist allocator per device, but as noted in r318995, swapoff_one() was not fully and correctly converted to the new scheme. This change completes what was started in r318995 by fixing the underlying bug in blst_meta_fill() that stops swapoff_one() from simply performing a single blist_fill() operation. Reviewed by: kib MFC after: 5 days Differential Revision: https://reviews.freebsd.org/D11043	2017-06-06 03:32:17 +00:00
Alan Cox	3e78e98337	The variable "breakout" is used like a Boolean, so actually define it as one. Reviewed by: kib MFC after: 5 days	2017-06-05 18:07:56 +00:00
Alan Cox	064650c180	Halve the memory being internally allocated by the blist allocator. In short, half of the memory that is allocated to implement the radix tree is wasted because we did not change "u_daddr_t" to be a 64-bit unsigned int when we changed "daddr_t" to be a 64-bit (signed) int. (See r96849 and r96851.) Reviewed by: kib, markj Tested by: pho MFC after: 5 days Differential Revision: https://reviews.freebsd.org/D11028	2017-06-05 17:14:16 +00:00
Gleb Smirnoff	1431a74845	As old prophecy says, some day UMA_DEBUG printfs shall be made CTRs.	2017-06-01 18:36:52 +00:00
Gleb Smirnoff	ac0a6fd015	Simplify boot pages management in UMA. It is simply a contigous virtual memory pointer and number of pages. There is no need to build a linked list here. Just increment pointer and decrement counter. The only functional difference to old allocator is that before we gave pages from topmost and down to lowest, and now we give them in normal ascending order. While here remove padalign from a mutex that is unused at runtime. Reviewed by: alc	2017-06-01 18:26:57 +00:00
Alan Cox	07c348ea7b	After r118390, the variable "dmmax" was neither the correct strip size nor the correct maximum block size. Moreover, after r318995, it serves no purpose except to provide information to user space through a read- sysctl. This change eliminates the variable "dmmax" but retains the sysctl. It also corrects the value returned by the sysctl. Reviewed by: kib, markj MFC after: 3 days	2017-05-27 21:46:00 +00:00
Alan Cox	fe71561af2	In r118390, the swap pager's approach to striping swap allocation over multiple devices was changed. However, swapoff_one() was not fully and correctly converted. In particular, with r118390's introduction of a per- device blist, the maximum swap block size, "dmmax", became irrelevant to swapoff_one()'s operation. Moreover, swapoff_one() was performing out-of- range operations on the per-device blist that were silently ignored by blist_fill(). This change corrects both of these problems with swapoff_one(), which will allow us to potentially increase MAX_PAGEOUT_CLUSTER. Previously, swapoff_one() would panic inside of blist_fill() if you increased MAX_PAGEOUT_CLUSTER. Reviewed by: kib, markj MFC after: 3 days	2017-05-27 16:40:00 +00:00
Konstantin Belousov	6992112349	Commit the 64-bit inode project. Extend the ino_t, dev_t, nlink_t types to 64-bit ints. Modify struct dirent layout to add d_off, increase the size of d_fileno to 64-bits, increase the size of d_namlen to 16-bits, and change the required alignment. Increase struct statfs f_mntfromname[] and f_mntonname[] array length MNAMELEN to 1024. ABI breakage is mitigated by providing compatibility using versioned symbols, ingenious use of the existing padding in structures, and by employing other tricks. Unfortunately, not everything can be fixed, especially outside the base system. For instance, third-party APIs which pass struct stat around are broken in backward and forward incompatible ways. Kinfo sysctl MIBs ABI is changed in backward-compatible way, but there is no general mechanism to handle other sysctl MIBS which return structures where the layout has changed. It was considered that the breakage is either in the management interfaces, where we usually allow ABI slip, or is not important. Struct xvnode changed layout, no compat shims are provided. For struct xtty, dev_t tty device member was reduced to uint32_t. It was decided that keeping ABI compat in this case is more useful than reporting 64-bit dev_t, for the sake of pstat. Update note: strictly follow the instructions in UPDATING. Build and install the new kernel with COMPAT_FREEBSD11 option enabled, then reboot, and only then install new world. Credits: The 64-bit inode project, also known as ino64, started life many years ago as a project by Gleb Kurtsou (gleb). Kirk McKusick (mckusick) then picked up and updated the patch, and acted as a flag-waver. Feedback, suggestions, and discussions were carried by Ed Maste (emaste), John Baldwin (jhb), Jilles Tjoelker (jilles), and Rick Macklem (rmacklem). Kris Moore (kris) performed an initial ports investigation followed by an exp-run by Antoine Brodin (antoine). Essential and all-embracing testing was done by Peter Holm (pho). The heavy lifting of coordinating all these efforts and bringing the project to completion were done by Konstantin Belousov (kib). Sponsored by: The FreeBSD Foundation (emaste, kib) Differential revision: https://reviews.freebsd.org/D10439	2017-05-23 09:29:05 +00:00
Konstantin Belousov	be10b9d5d7	Emulate pre-r317061 ABI. This restores 32bit-sized accesses to vmcnt sysctls, making old binaries like top(1), systat(8) and reboot(8) mostly functional on newer kernel. Reviewed by: bde Sponsored by: The FreeBSD Foundation	2017-05-02 18:40:41 +00:00
Gleb Smirnoff	83c9dea1ba	- Remove 'struct vmmeter' from 'struct pcpu', leaving only global vmmeter in place. To do per-cpu stats, convert all fields that previously were maintained in the vmmeters that sit in pcpus to counter(9). - Since some vmmeter stats may be touched at very early stages of boot, before we have set up UMA and we can do counter_u64_alloc(), provide an early counter mechanism: o Leave one spare uint64_t in struct pcpu, named pc_early_dummy_counter. o Point counter(9) fields of vmmeter to pcpu[0].pc_early_dummy_counter, so that at early stages of boot, before counters are allocated we already point to a counter that can be safely written to. o For sparc64 that required a whole dummy pcpu[MAXCPU] array. Further related changes: - Don't include vmmeter.h into pcpu.h. - vm.stats.vm.v_swappgsout and vm.stats.vm.v_swappgsin changed to 64-bit, to match kernel representation. - struct vmmeter hidden under _KERNEL, and only vmstat(1) is an exclusion. This is based on benno@'s 4-year old patch: https://lists.freebsd.org/pipermail/freebsd-arch/2013-July/014471.html Reviewed by: kib, gallatin, marius, lidl Differential Revision: https://reviews.freebsd.org/D10156	2017-04-17 17:34:47 +00:00
Gleb Smirnoff	9ed01c32e0	All these files need sys/vmmeter.h, but now they got it implicitly included via sys/pcpu.h.	2017-04-17 17:07:00 +00:00
Mark Johnston	e1cb9d3747	Busy the map in vm_map_protect(). We are otherwise susceptible to a race with a concurrent vm_map_wire(), which may drop the map lock to fault pages into the object chain. In particular, vm_map_protect() will only copy newly writable wired pages into the top-level object when MAP_ENTRY_USER_WIRED is set, but vm_map_wire() only sets this flag after its fault loop. We may thus end up with a writable wired entry whose top-level object does not contain the entire range of pages. Reported and tested by: pho Reviewed by: kib MFC after: 1 week Sponsored by: Dell EMC Isilon Differential Revision: https://reviews.freebsd.org/D10349	2017-04-10 21:01:42 +00:00
Mark Johnston	1c2d20a1d0	Consistently use for-loops in vm_map_protect(). No functional change. Reviewed by: kib MFC after: 1 week Sponsored by: Dell EMC Isilon X-Differential Revision: https://reviews.freebsd.org/D10349	2017-04-10 20:57:16 +00:00
Mark Johnston	ed11e4d701	Add some bounds assertions to the vm_map_entry clip functions. Reviewed by: kib MFC after: 1 week Sponsored by: Dell EMC Isilon X-Differential Revision: https://reviews.freebsd.org/D10349	2017-04-10 20:55:42 +00:00
Konstantin Belousov	65b9599a76	Extract calculation of ioflags from the vm_pager_putpages flags into a helper. Reviewed by: markj Tested by: pho Sponsored by: The FreeBSD Foundation MFC after: 1 week X-Differential revision: https://reviews.freebsd.org/D10241	2017-04-05 16:56:04 +00:00
Konstantin Belousov	3dbb0ca646	Some style fixes for vnode_pager_generic_putpages(), in the local declaration block. Reviewed by: markj (as part of the larger patch) Tested by: pho (as part of the larger patch) Sponsored by: The FreeBSD Foundation MFC after: 1 week X-Differential revision: https://reviews.freebsd.org/D10241	2017-04-05 16:45:00 +00:00
Konstantin Belousov	53b6404819	Use int instead of boolean_t for flags argument type in vnode_pager_generic_putpages() prototype; change the argument name to reflect that it is flags. Reviewed by: markj Tested by: pho Sponsored by: The FreeBSD Foundation MFC after: 1 week X-Differential revision: https://reviews.freebsd.org/D10241	2017-04-05 16:30:41 +00:00
John Baldwin	a5a355788e	Assert that the align parameter to uma_zcreate() is valid. Reviewed by: kib MFC after: 1 week Sponsored by: DARPA / AFRL Differential Revision: https://reviews.freebsd.org/D10100	2017-04-04 16:26:46 +00:00
Dmitry Chagin	46dc8e9d6a	Add kern_mincore() helper for micore() syscall. Suggested by: kib@ Reviewed by: kib@ MFC after: 1 month Differential Revision: https://reviews.freebsd.org/D10143	2017-03-30 19:42:49 +00:00
Alan Cox	8956418832	Two changes to vm_fault_populate(): Simplify the logic for clipping the range returned by the pager to fit within the map entry. Use atop() rather than OFF_TO_IDX() on addresses. Reviewed by: kib MFC after: 1 week	2017-03-19 19:52:47 +00:00
Konstantin Belousov	bc27810671	Fix off-by-one in the vm_fault_populate() code. When re-calculating the last inclusive page index after the pager call, -1 was erronously ommitted. If the pager extended the run (unlikely), the result would be insertion of the valid page mapping outside the current map entry range. Found by: alc Sponsored by: The FreeBSD Foundation MFC after: 1 week	2017-03-19 14:42:16 +00:00
Xin LI	83d37aaf01	The adj_free and max_free values of new_entry will be calculated and assigned by subsequent vm_map_entry_link(), therefore, remove the pointless copying. Submitted by: alc MFC after: 3 days	2017-03-16 05:44:16 +00:00
Alan Cox	52d1addaa1	Relax the locking requirements for vm_object_page_noreuse(). While reviewing all uses of OFF_TO_IDX(), I observed that vm_object_page_noreuse() is requiring an exclusive lock on the object when, in fact, a shared lock suffices. Reviewed by: kib, markj MFC after: 3 days Differential Revision: https://reviews.freebsd.org/D10011	2017-03-15 17:43:45 +00:00
Konstantin Belousov	d1780e8dac	Use atop() instead of OFF_TO_IDX() for convertion of addresses or addresses offsets, as intended. Suggested and reviewed by: alc Sponsored by: The FreeBSD Foundation MFC after: 2 weeks	2017-03-14 19:39:17 +00:00
Xin LI	78d7964b46	Implement INHERIT_ZERO for minherit(2). INHERIT_ZERO is an OpenBSD feature. When a page is marked as such, it would be zeroed upon fork(). This would be used in new arc4random(3) functions. PR: 182610 Reviewed by: kib (earlier version) MFC after: 1 month Differential Revision: https://reviews.freebsd.org/D427	2017-03-14 17:10:42 +00:00
Mark Johnston	e20ff1a4d4	Update a comment to reflect reality. MFC after: 1 week	2017-03-13 18:45:25 +00:00
Konstantin Belousov	2b6d1a639b	Follow-up to r313690. Fix two missed places where vm_object offset to index calculation should use unsigned shift, to allow handling of full range of unsigned offsets used to create device mappings. Reported and tested by: royger (previous version) Reviewed by: alc (previous version) Sponsored by: The FreeBSD Foundation MFC after: 1 week	2017-03-12 13:53:13 +00:00
Andriy Gapon	57223e9994	uma: fix pages <-> items conversions at several places Those places were not taking into account uk_ppera. At present one allocation is always used by one slab, so uk_ppera must be used to convert between pages and slabs. uk_ipers is used to convert between slabs and items. MFC after: 1 month (if ever)	2017-03-11 16:43:38 +00:00
Andriy Gapon	a55ebb7cd5	uma: eliminate uk_slabsize field The field was not used beyond the initial keg setup stage anyway. MFC after: 1 month (if ever)	2017-03-11 16:35:36 +00:00
Warner Losh	fbbd9655e5	Renumber copyright clause 4 Renumber cluase 4 to 3, per what everybody else did when BSD granted them permission to remove clause 3. My insistance on keeping the same numbering for legal reasons is too pedantic, so give up on that point. Submitted by: Jan Schaumann <jschauma@stevens.edu> Pull Request: https://github.com/freebsd/freebsd/pull/96	2017-02-28 23:42:47 +00:00
Andriy Gapon	9b43bc27c4	call vm_lowmem hook in uma_reclaim_worker A comment near kmem_reclaim() implies that we already did that. Calling the hook is useful, because some handlers, e.g. ARC, might be able to release significant amounts of KVA. Now that we have more than one place where vm_lowmem hook is called, use this change as an opportunity to introduce flags that describe a reason for calling the hook. No handler makes use of the flags yet. Reviewed by: markj, kib MFC after: 1 week Sponsored by: Panzura Differential Revision: https://reviews.freebsd.org/D9764	2017-02-25 16:39:21 +00:00
Konstantin Belousov	63cdcaaead	Properly handle possible underflow in vm_fault_prefault(). In vm_fault_prefault(), if backward count causes underflow in calculation of starta = addra - backward * PAGE_SIZE; then starta must be clipped to entry->start, instead of zero. Clipping to zero allowed mapping outside of the map entries address ranges, in particular, map at zero. Submitted by: Yanko Yankulov <yanko.yankulov@gmail.com> Reviewed by: alc MFC after: 1 week	2017-02-24 08:09:16 +00:00
Andriy Gapon	937c1b0757	try to fix RACCT_RSS accounting There could be a race between the vm daemon setting RACCT_RSS based on the vm space and vmspace_exit (called from exit1) resetting RACCT_RSS to zero. In that case we can get a zombie process with non-zero RACCT_RSS. If the process is jailed, that may break accounting for the jail. There could be other consequences. Fix this race in the vm daemon by updating RACCT_RSS only when a process is in the normal state. Also, make accounting a little bit more accurate by refreshing the page resident count after calling vm_pageout_map_deactivate_pages(). Finally, add an assert that the RSS is zero when a process is reaped. PR: 210315 Reviewed by: trasz Differential Revision: https://reviews.freebsd.org/D9464	2017-02-14 13:54:05 +00:00
Bjoern A. Zeeb	05d58177e8	Use %s __func__ to print the actual function name (been looking at the wrong one for too often lately at first), and also use %#lx to get the 0x prefix for the address. MFC after: 1 week	2017-02-14 01:20:03 +00:00
Konstantin Belousov	496ab0532d	Rework r313352. Rename kern_vm_* functions to kern_*. Move the prototypes to syscallsubr.h. Also change Mach VM types to uintptr_t/size_t as needed, to avoid headers pollution. Requested by: alc, jhb Reviewed by: alc Sponsored by: The FreeBSD Foundation MFC after: 2 weeks Differential revision: https://reviews.freebsd.org/D9535	2017-02-13 09:04:38 +00:00
Konstantin Belousov	04e89ffba8	Remove MPSAFE and ARGUSED annotations, ANSI-fy syscall handlers. Discussed with: alc Sponsored by: The FreeBSD Foundation MFC after: 1 week	2017-02-13 00:40:55 +00:00
Konstantin Belousov	987ff18184	Consistently handle negative or wrapping offsets in the mmap(2) syscalls. For regular files and posix shared memory, POSIX requires that [offset, offset + size) range is legitimate. At the maping time, check that offset is not negative. Allowing negative offsets might expose the data that filesystem put into vm_object for internal use, esp. due to OFF_TO_IDX() signess treatment. Fault handler verifies that the mapped range is valid, assuming that mmap(2) checked that arithmetic gives no undefined results. For device mappings, leave the semantic of negative offsets to the driver. Correct object page index calculation to not erronously propagate sign. In either case, disallow overflow of offset + size. Update mmap(2) man page to explain the requirement of the range validity, and behaviour when the range becomes invalid after mapping. Reported and tested by: royger (previous version) Reviewed by: alc Sponsored by: The FreeBSD Foundation MFC after: 2 weeks	2017-02-12 21:05:44 +00:00
Konstantin Belousov	1c2ad3e962	Change type of the prot parameter for kern_vm_mmap() from vm_prot_t to int. This makes the code to pass whole word of the mmap(2) syscall argument prot to the syscall helper kern_vm_mmap(), which can validate all bits. The change provides temporal fix for sys/vm/mmap_test mmap__bad_arguments, which was broken after r313352. PR: 216976 Reported and tested by: ngie Sponsored by: The FreeBSD Foundation	2017-02-11 20:27:39 +00:00
Edward Tomasz Napierala	69cdfcef2e	Add kern_vm_mmap2(), kern_vm_mprotect(), kern_vm_msync(), kern_vm_munlock(), kern_vm_munmap(), and kern_vm_madvise(), and use them in various compats instead of their sys_*() counterparts. Reviewed by: ed, dchagin, kib MFC after: 2 weeks Sponsored by: DARPA, AFRL Differential Revision: https://reviews.freebsd.org/D9378	2017-02-06 20:57:12 +00:00
Konstantin Belousov	5fca242374	Style, use tab after #define. Reviewed by: alc Sponsored by: The FreeBSD Foundation MFC after: 3 days	2017-02-04 19:16:19 +00:00
Alan Cox	8a99f1cc59	Over the years, the code and comments in vm_page_startup() have diverged in one respect. When determining how many page structures to allocate, contrary to what the comments say, the code does not account for the overhead of a page structure per page of physical memory. This revision changes the code to match the comments. Reviewed by: kib, markj MFC after: 6 weeks Differential Revision: https://reviews.freebsd.org/D9081	2017-02-04 05:23:10 +00:00
Edward Tomasz Napierala	a6b15641d6	Ifdef out the unused vm_rr_selectdomain(). MFC after: 2 weeks Sponsored by: DARPA, AFRL	2017-02-02 17:44:55 +00:00
Mark Johnston	aa3650ea36	Avoid page lookups in the top-level object in vm_object_madvise(). We can iterate over consecutive resident pages in the top-level object using the object's page list rather than by performing lookups in the object radix tree. This extends one of the optimizations in r312208 to the case where a shadow chain is present. Suggested by: alc Reviewed by: alc, kib (previous version) MFC after: 2 weeks Differential Revision: https://reviews.freebsd.org/D9282	2017-01-30 18:51:43 +00:00
Mateusz Guzik	736ff8c396	hwpmc: partially depessimize munmap handling if the module is not loaded HWPMC_HOOKS is enabled in GENERIC and triggers some work avoidable in the common (module not loaded) case. In particular this avoids permission checks + lock downgrade singlethreaded and in cases were an executable mapping is found the pmc sx lock is no longer bounced. Note this is a band aid. MFC after: 1 week	2017-01-24 22:00:16 +00:00
Mark Johnston	c2655a40a7	Avoid unnecessary page lookups in vm_object_madvise(). vm_object_madvise() is frequently used to apply advice to a contiguous set of pages in an object with no backing object. Optimize this case by skipping non-resident subranges in constant time, and by iterating over resident pages using the object memq, thus avoiding radix tree lookups on each page index in the specified range. While here, move MADV_WILLNEED handling to vm_page_advise(), and rename the "advise" parameter to vm_object_madvise() to "advice." Reviewed by: alc, kib MFC after: 2 weeks Differential Revision: https://reviews.freebsd.org/D9098	2017-01-15 03:50:08 +00:00
Gleb Smirnoff	4f56243aad	Fix the contiguity once more.	2017-01-12 20:26:02 +00:00
Mark Johnston	8d65cba217	Remove a redundant use of min(). Reported by: rpokala X-MFC With: r311346	2017-01-05 03:13:45 +00:00
Mark Johnston	ec492b13f1	Add a small allocator for exec_map entries. Upon each execve, we allocate a KVA range for use in copying data to the new image. Pages must be faulted into the range, and when the range is freed, the backing pages are freed and their mappings are destroyed. This is a lot of needless overhead, and the exec_map management becomes a bottleneck when many CPUs are executing execve concurrently. Moreover, the number of available ranges is fixed at 16, which is insufficient on large systems and potentially excessive on 32-bit systems. The new allocator reduces overhead by making exec_map allocations persistent. When a range is freed, pages backing the range are marked clean and made easy to reclaim. With this change, the exec_map is sized based on the number of CPUs. Reviewed by: kib MFC after: 1 month Differential Revision: https://reviews.freebsd.org/D8921	2017-01-05 01:44:12 +00:00
Gleb Smirnoff	1e0c121f3a	Fix assertion that checks that pages are consecutive to properly handle bogus_page insertion(s).	2017-01-04 22:31:09 +00:00
Gleb Smirnoff	bfc8c24c73	Move bogus_page declaration to vm_page.h and initialization to vm_page.c. Reviewed by: kib	2017-01-04 22:27:19 +00:00
Mark Johnston	b1fd102ee7	Add a page queue for holding dirty anonymous unswappable pages. On systems without a configured swap device, an attempt to launder pages from a swap object will always fail and result in the page being reactivated. This means that the page daemon will continuously scan pages that can never be evicted. With this change, anonymous pages are instead moved to PQ_UNSWAPPABLE after a failed laundering attempt when no swap devices are configured. PQ_UNSWAPPABLE is not scanned unless a swap device is configured, so unreferenced unswappable pages are excluded from the page daemon's workload. Reviewed by: alc	2017-01-03 00:05:44 +00:00
Justin Hibbits	b5345ef10e	Print flags in hex instead of decimal. Hex is easier to grok for flags, and consistent with other prints.	2017-01-02 16:50:52 +00:00
Konstantin Belousov	1569205f0a	Style fixes for vm_map_insert(). Reviewed by: alc Sponsored by: The FreeBSD Foundation MFC after: 1 week	2017-01-01 18:49:46 +00:00
Konstantin Belousov	03302b1380	Ansify vm/vm_pager.c. Style. Reviewed by: alc Sponsored by: The FreeBSD Foundation MFC after: 1 week	2016-12-31 19:30:22 +00:00
Mateusz Guzik	6ff51a3685	Use vrefact in vnode_pager_alloc.	2016-12-31 10:37:56 +00:00
Konstantin Belousov	7a432b84e8	Fix two similar bugs in the populate vm_fault() code. If pager' populate method succeeded, but other thread raced with us and modified vm_map, we must unbusy all pages busied by the pager, before we retry the whole fault handling. If pager instantiated more pages than fit into the current map entry, we must unbusy the pages which are clipped. Also do some refactoring, clarify comments and use more clear local variable names. Reported and tested by: kargl, subbsd@gmail.com (previous version) Reviewed by: alc Sponsored by: The FreeBSD Foundation MFC after: 2 weeks	2016-12-30 18:55:33 +00:00
Konstantin Belousov	0c8bd6a7d8	Assert that the pages found on the object queue by vm_page_next() and vm_page_prev() have correct ownership. In collaboration with: alc Sponsored by: The FreeBSD Foundation (kib) MFC after: 1 week	2016-12-30 17:37:06 +00:00
Konstantin Belousov	9a4ee196dd	Style. Reviewed by: alc Sponsored by: The FreeBSD Foundation MFC after: 1 week	2016-12-30 13:04:43 +00:00
Mateusz Guzik	0b3b55a0f2	Remove cpu_spinwait after seq_consistent. It does not add any benefit as the read routine will do it as necessary.	2016-12-30 06:26:17 +00:00
Alan Cox	920da7e4d2	Relax the object type restrictions on vm_page_alloc_contig(). Specifically, add support for object types that were previously prohibited because they could contain PG_CACHED pages. Roughly halve the number of radix trie operations performed by vm_page_alloc_contig() using the same approach that is employed by vm_page_alloc(). Also, eliminate the radix trie lookup performed with the free page queues lock held. Tidy up the handling of radix trie insert failures in vm_page_alloc() and vm_page_alloc_contig(). Reviewed by: kib, markj Tested by: pho Sponsored by: Dell EMC Isilon Differential Revision: https://reviews.freebsd.org/D8878	2016-12-28 18:32:13 +00:00
Konstantin Belousov	8b590e9506	Remove redundancy in vmtotal(). There are two instances of inlined unlocks + continue in vmtotal() switch statements, which are ordinary expressed with break from the switch case and code after the switch. Also, the combination of continue and break statement is redundand. Reviewed by: alc Sponsored by: The FreeBSD Foundation MFC after: 1 week	2016-12-26 19:29:04 +00:00
Konstantin Belousov	2e56b64fa4	Fix argument type and microoptimize swp_pager_meta_free(). The count argument natural type if vm_pindex_t, but due to the loop organization, it has to be signed type to detect the termination condition. Replace this logic by using distinguished counter for the processed pages, and terminate loop when the counter exceeds the argument. Completely process one swblock for all relevant indexes instead of doing relookup in hash when incrementing page index on the loop step. Do not drop hash mutex around iterations. Noted and reviewed by: alc Tested by: pho Sponsored by: The FreeBSD Foundation MFC after: 2 weeks	2016-12-24 09:57:31 +00:00
Konstantin Belousov	77d6fd97ef	Improve vm_object_scan_all_shadowed() to also check swap backing objects. As noted in the removed comment, it is possible and not prohibitively costly to look up the swap blocks for the given page index. Implement a swap_pager_find_least() function to do that, and use it to iterate simultaneously over both backing object page queue and swap allocations when looking for shadowed pages. Testing shows that number of new succesful scans, enabled by this addition, is small but non-zero. When worked out, the change both further reduces the depth of the shadow object chain, and frees unused but allocated swap and memory. Suggested and reviewed by: alc Tested by: pho (previous version) Sponsored by: The FreeBSD Foundation MFC after: 2 weeks	2016-12-18 20:56:14 +00:00
Konstantin Belousov	71057cd207	In swp_pager_meta_free_all(), fix type of the index variable. Style. Noted and reviewed by: alc (previous version) Sponsored by: The FreeBSD Foundation MFC after: 1 week	2016-12-16 23:33:37 +00:00
Konstantin Belousov	a1e9a3bba3	Provide introductory description of the default pager. Reviewed by: alc Sponsored by: The FreeBSD Foundation MFC after: 1 week	2016-12-14 23:36:32 +00:00
Konstantin Belousov	a41ece0840	Remove locking around accounting initialization of the default object. The object is not yet fully constructed and must not be available to other threads. This makes default_pager_alloc() almost identical to swap_pager_alloc_init(). Reviewed by: alc Sponsored by: The FreeBSD Foundation MFC after: 1 week	2016-12-14 23:34:25 +00:00
Alan Cox	3d026d871f	Tidy up. Mostly, remove or replace stale comments. Most of the comments in this file actually described the operation of the swap pager, not the default pager. Given that this is the wrong place to discuss the implementation of the swap pager, it shouldn't come as a surprise that as the swap pager evolved these comments became increasingly stale. In addition, apply some style fixes, like modernizing a few remaining old- style function definitions. Reviewed by: kib, markj MFC after: 1 week Differential Revision: https://reviews.freebsd.org/D8781	2016-12-14 17:28:55 +00:00
John Baldwin	a9546a6b17	Use db_lookup_proc() in the DDB 'show procvm' command. This allows processes to be identified by PID as well as a pointer address. MFC after: 2 weeks Sponsored by: DARPA / AFRL	2016-12-13 19:22:43 +00:00
Alan Cox	3453bca864	Eliminate every mention of PG_CACHED pages from the comments in the machine- independent layer of the virtual memory system. Update some of the nearby comments to eliminate redundancy and improve clarity. In vm/vm_reserv.c, do not use hyphens after adverbs ending in -ly per The Chicago Manual of Style. Update the comment in vm/vm_page.h defining the four types of page queues to reflect the elimination of PG_CACHED pages and the introduction of the laundry queue. Reviewed by: kib, markj Sponsored by: Dell EMC Isilon Differential Revision: https://reviews.freebsd.org/D8752	2016-12-12 17:47:09 +00:00
Gleb Smirnoff	255003da42	Allow bogus_page to be passed to pager(s).	2016-12-09 21:21:24 +00:00
Mark Johnston	90458813cd	Conditionalize PG_CACHE sysctls on COMPAT_FREEBSD11. Reviewed by: glebius, imp, jhb Differential Revision: https://reviews.freebsd.org/D8736	2016-12-09 18:55:27 +00:00
Konstantin Belousov	ed01d9894e	Implement the populate() pager method for phys pager. It allows to provide configurable agressive prefaulting and useful hints to page daemon about memory allocations, on faults for pages managed by phys pager. In fact, this implementation is superior to the MAP_SHARED_PHYS hack from my Postgresql paper, while giving similar benefits of reducing the page faults numbers on SysV shared memory mappings. Reviewed by: alc Tested by: pho Sponsored by: The FreeBSD Foundation MFC after: 3 weeks	2016-12-08 11:35:53 +00:00
Konstantin Belousov	c42b43a054	Add a new populate() pager method and extend device pager ops vector with cdev_pg_populate() to provide device drivers access to it. It gives drivers fine control of the pages ownership and allows drivers to implement arbitrary prefault policies. The populate method is called on a page fault and is supposed to populate the vm object with the page at the fault location and some amount of pages around it, at pager's discretion. VM provides the pager with the hints about current range of the object mapping, to avoid instantiation of immediately unused pages, if pager decides so. Also, VM passes the fault type and map entry protection to the pager, allowing it to force the optimal required ownership of the mapped pages. Installed pages must contiguously fill the returned region, be fully valid and exclusively busied. Of course, the pages must be compatible with the object' type. After populate() successfully returned, VM fault handler installs as many instantiated pages into the process page tables as it sees reasonable, while still obeying the correct semantic for COW and vm map locking. The method is opt-in, pager sets OBJ_POPULATE flag to indicate that the method can be called. If pager' vm objects can be shadowed, pager must implement the traditional getpages() method in addition to the populate(). Populate() might fall back to the getpages() on per-call basis as well, by returning VM_PAGER_BAD error code. For now for device pagers, the populate() method is only allowed to be used by the managed device pagers, but the limitation is only made because there is no unmanaged fault handlers which could use it right now. KPI designed together with, and reviewed by: alc Tested by: pho Sponsored by: The FreeBSD Foundation MFC after: 3 weeks	2016-12-08 11:26:11 +00:00
Konstantin Belousov	dc5401d240	Move map_generation snapshot value into struct faultstate. Reviewed by: alc Tested by: pho Sponsored by: The FreeBSD Foundation MFC after: 1 week	2016-12-08 10:29:41 +00:00
Konstantin Belousov	272cc3c4d0	Style. Reviewed by: alc Sponsored by: The FreeBSD Foundation MFC after: 1 week	2016-12-08 10:28:51 +00:00
Alan Cox	e94965d82e	Previously, vm_radix_remove() would panic if the radix trie didn't contain a vm_page_t at the specified index. However, with this change, vm_radix_remove() no longer panics. Instead, it returns NULL if there is no vm_page_t at the specified index. Otherwise, it returns the vm_page_t. The motivation for this change is that it simplifies the use of radix tries in the amd64, arm64, and i386 pmap implementations. Instead of performing a lookup before every remove, the pmap can simply perform the remove. Reviewed by: kib, markj Differential Revision: https://reviews.freebsd.org/D8708	2016-12-08 04:29:29 +00:00
Mark Johnston	43482f897b	Use the official spelling for NULL arguments to typed sysctl handlers. Reported by: bde	2016-12-07 01:15:10 +00:00
Mark Johnston	77edd8fa00	Provide dummy sysctls for v_cache_count and v_tcached. Some utilities (notably top(1)) exit if any of their input sysctls don't exist, and the removal of the above-mentioned PG_CACHE-related sysctls makes it difficult to run such utilities on different versions of the kernel without recompiling. Requested by: bde	2016-12-06 22:52:45 +00:00
Alan Cox	8804a2b030	Eliminate a stale comment; vm_radix_prealloc() was replaced in r254141. MFC after: 3 days	2016-12-02 16:29:30 +00:00
Alan Cox	563a19d546	During vm_page_cache()'s call to vm_radix_insert(), if vm_page_alloc() was called to allocate a new page of radix trie nodes, there could be a call to vm_radix_remove() on the same trie (of PG_CACHED pages) as the in-progress vm_radix_insert(). With the removal of PG_CACHED pages, we can simplify vm_radix_insert() and vm_radix_remove() by removing the flags on the root of the trie that were used to detect this case and the code for restarting vm_radix_insert() when it happened. Reviewed by: kib, markj Tested by: pho Sponsored by: Dell EMC Isilon Differential Revision: https://reviews.freebsd.org/D8664	2016-12-01 17:26:37 +00:00
Alan Cox	ba67369628	Recursion on the free page queue mutex occurred when UMA needed to allocate a new page of radix trie nodes to complete a vm_radix_insert() operation that was requested by vm_page_cache(). Specifically, vm_page_cache() already held the free page queue lock when UMA tried to acquire it through a call to vm_page_alloc(). This code path no longer exists, so there is no longer any reason to allow recursion on the free page queue mutex. Improve nearby comments. Reviewed by: kib, markj Tested by: pho Sponsored by: Dell EMC Isilon Differential Revision: https://reviews.freebsd.org/D8628	2016-11-27 01:42:53 +00:00
Mark Johnston	99e6e1930c	Release laundered vnode pages to the head of the inactive queue. The swap pager enqueues laundered pages near the head of the inactive queue to avoid another trip through LRU before reclamation. This change adds support for this behaviour to the vnode pager and makes use of it in UFS and ext2fs. Some ioflag handling is consolidated into a common subroutine so that this support can be easily extended to other filesystems which make use of the buffer cache. No changes are needed for ZFS since its putpages routine always undirties the pages before returning, and the laundry thread requeues the pages appropriately in this case. Reviewed by: alc, kib Differential Revision: https://reviews.freebsd.org/D8589	2016-11-23 17:53:07 +00:00
Alan Cox	bba39b9ae3	Remove PG_CACHED-related fields from struct vmmeter, because they are no longer used. More precisely, they are always zero because the code that decremented and incremented them no longer exists. Bump __FreeBSD_version to mark this change. Reviewed by: kib, markj Sponsored by: Dell EMC Isilon Differential Revision: https://reviews.freebsd.org/D8583	2016-11-22 18:13:46 +00:00
Gleb Smirnoff	e48b82bd83	- If caller specifies readbehind and readahead that together with count doesn't fit into a buf, then trim readbehind and readahead evenly. If rbehind was limited by the previous BMAP, then roundup its trim to block size. - Add KASSERT to check that b_blkno has proper offset from original blkno returned by BMAP. [1] - Add KASSERT to check that pages in buf are consecutive. Reviewed by: kib Submitted by: kib [1]	2016-11-17 20:32:32 +00:00
Konstantin Belousov	41ddec83c1	Move the fast fault path into the separate function. Reviewed by: alc Sponsored by: The FreeBSD Foundation MFC after: 1 week	2016-11-16 16:34:17 +00:00
Alan Cox	7667839a7e	Remove most of the code for implementing PG_CACHED pages. (This change does not remove user-space visible fields from vm_cnt or all of the references to cached pages from comments. Those changes will come later.) Reviewed by: kib, markj Tested by: pho Sponsored by: Dell EMC Isilon Differential Revision: https://reviews.freebsd.org/D8497	2016-11-15 18:22:50 +00:00
Alan Cox	ebcddc7217	Introduce a new page queue, PQ_LAUNDRY, for storing unreferenced, dirty pages, specificially, dirty pages that have passed once through the inactive queue. A new, dedicated thread is responsible for both deciding when to launder pages and actually laundering them. The new policy uses the relative sizes of the inactive and laundry queues to determine whether to launder pages at a given point in time. In general, this leads to more intelligent swapping behavior, since the laundry thread will avoid pageouts when the marginal benefit of doing so is low. Previously, without a dedicated queue for dirty pages, the page daemon didn't have the information to determine whether pageout provides any benefit to the system. Thus, the previous policy often resulted in small but steadily increasing amounts of swap usage when the system is under memory pressure, even when the inactive queue consisted mostly of clean pages. This change addresses that issue, and also paves the way for some future virtual memory system improvements by removing the last source of object-cached clean pages, i.e., PG_CACHE pages. The new laundry thread sleeps while waiting for a request from the page daemon thread(s). A request is raised by setting the variable vm_laundry_request and waking the laundry thread. We request launderings for two reasons: to try and balance the inactive and laundry queue sizes ("background laundering"), and to quickly make up for a shortage of free pages and clean inactive pages ("shortfall laundering"). When background laundering is requested, the laundry thread computes the number of page daemon wakeups that have taken place since the last laundering. If this number is large enough relative to the ratio of the laundry and (global) inactive queue sizes, we will launder vm_background_launder_target pages at vm_background_launder_rate KB/s. Otherwise, the laundry thread goes back to sleep without doing any work. When scanning the laundry queue during background laundering, reactivated pages are counted towards the laundry thread's target. In contrast, shortfall laundering is requested when an inactive queue scan fails to meet its target. In this case, the laundry thread attempts to launder enough pages to meet v_free_target within 0.5s, which is the inactive queue scan period. A laundry request can be latched while another is currently being serviced. In particular, a shortfall request will immediately preempt a background laundering. This change also redefines the meaning of vm_cnt.v_reactivated and removes the functions vm_page_cache() and vm_page_try_to_cache(). The new meaning of vm_cnt.v_reactivated now better reflects its name. It represents the number of inactive or laundry pages that are returned to the active queue on account of a reference. In collaboration with: markj Reviewed by: kib Tested by: pho Sponsored by: Dell EMC Isilon Differential Revision: https://reviews.freebsd.org/D8302	2016-11-09 18:48:37 +00:00
Bryan Drewery	28323add09	Fix improper use of "its". Sponsored by: Dell EMC Isilon	2016-11-08 23:59:41 +00:00
Konstantin Belousov	1771e987ca	Do not sleep in vm_wait() if pagedaemon did not yet started. Panic instead. Requests which cannot be satisfied by allocators at boot time often have unrealizable parameters. Waiting for the pagedaemon' start would hang the boot if done in the thread0 context and just never succeed if executed from another thread. In fact, for very early stages, sleep attempt panics with obscure diagnostic about the scheduler state, and explicit panic in vm_wait() makes the investigation much shorter by cut off the examination of the thread and scheduler. Theoretically, some subsystem might grab a resource to exhaustion, and free it later in the boot process. If this unlikely scenario does appear for real, the way to diagnose the trouble can be revisited. Reported by: emaste Reviewed by: markj Sponsored by: The FreeBSD Foundation MFC after: 2 weeks Differential revision: https://reviews.freebsd.org/D8421	2016-11-04 12:58:50 +00:00
Alan Cox	857025056f	In vm_fault()'s loop over the shadow chain, move a comment describing our invariants to a better place. Also, add two comments concerning the relationship between the map and vnode locks. Reviewed by: kib MFC after: 3 days	2016-11-03 16:44:55 +00:00
Alan Cox	dda4d36957	Move and revise a comment about the relation between the object's paging- in-progress count and the vnode. Prior to r188331, we always acquired the vnode lock before incrementing the object's paging-in-progress count. Now, we increment it before attempting to acquire the vnode lock with LK_NOWAIT, but we never sleep acquiring the vnode lock while we have the count incremented. Reviewed by: kib MFC after: 3 days	2016-11-01 17:11:10 +00:00
Conrad Meyer	8532d381a9	Add BUF_TRACKING and FULL_BUF_TRACKING buffer debugging Upstream the BUF_TRACKING and FULL_BUF_TRACKING buffer debugging code. This can be handy in tracking down what code touched hung bios and bufs last. The full history is especially useful, but adds enough bloat that it shouldn't be enabled in release builds. Function names (or arbitrary string constants) are tracked in a fixed-size ring in bufs. Bios gain a pointer to the upper buf for tracking. SCSI CCBs gain a pointer to the upper bio for tracking. Reviewed by: markj Sponsored by: Dell EMC Isilon Differential Revision: https://reviews.freebsd.org/D8366	2016-10-31 23:09:52 +00:00
Konstantin Belousov	e26236e9f3	Change remained internal uses of boolean_t to bool in vm/vm_fault.c. Reviewed by: alc Sponsored by: The FreeBSD Foundation MFC after: 1 week	2016-10-30 20:39:38 +00:00
Konstantin Belousov	1dcadc022f	Remove vm_pager_has_page() declaration. It is not too useful since static inline definition appears later in the file. Reviewed by: alc Sponsored by: The FreeBSD Foundation MFC after: 1 week	2016-10-30 20:38:57 +00:00
Alan Cox	f994b2077b	Merge and sort vm_fault_hold()'s "int" variable definitions. Reviewed by: kib MFC after: 7 days	2016-10-30 19:15:59 +00:00
Konstantin Belousov	022dfd690c	Remove vnode_locked label and goto, by collapsing vp calculation into the conditional. Suggested and reviewed by: alc Sponsored by: The FreeBSD Foundation MFC after: 1 week	2016-10-30 18:05:18 +00:00
Konstantin Belousov	1be02479be	Split long line instead of unindenting it. Add KASSERT() verifying that a device object with the same handle has the same ops vector. Reviewed by: alc Sponsored by: The FreeBSD Foundation MFC after: 1 week	2016-10-30 18:04:11 +00:00
Alan Cox	cd8a6fe8e9	The "lookup_is_valid" field is used as a "bool". Make it one. Convert vm_fault_hold()'s Boolean variables that are only used internally to "bool". Add a comment describing why the one remaining "boolean_t" was not converted. Reviewed by: kib MFC after: 8 days	2016-10-29 21:01:49 +00:00
Alan Cox	320023e286	With one exception, "hardfault" is used like a "bool". Change that exception and make it a "bool". Reviewed by: kib MFC after: 7 days	2016-10-29 19:22:38 +00:00
Mark Johnston	a9ee028d04	Add one more use of unlock_vp(). Discussed with: kib X-MFC With: r308094	2016-10-29 18:47:28 +00:00
Konstantin Belousov	cfabea3d3a	Add unlock_vp() helper. Trim space. Discussed with: alc Sponsored by: The FreeBSD Foundation MFC after: 1 week	2016-10-29 18:03:29 +00:00
Mark Johnston	829be5168d	Simplify keg_drain() a bit by using LIST_FOREACH_SAFE. MFC after: 1 week	2016-10-20 23:10:27 +00:00
Gleb Smirnoff	dcc0ff5a52	Fix incorrect assertion that could miss overflows. Reviewed by: kib	2016-10-19 19:50:09 +00:00
Konstantin Belousov	230afe0be6	If vm_fault_hold(9) finds that fs.m is wired, do not free it after a pager error, leave the page to the wire owner. E.g. the page might be a part of the invalidated buffer. Reported and tested by: pho Reviewed by: alc, markj Sponsored by: The FreeBSD Foundation MFC after: 2 weeks Differential revision: https://reviews.freebsd.org/D8197	2016-10-17 08:17:06 +00:00
Konstantin Belousov	bd9546a21c	Export vm_page_xunbusy_maybelocked(). Reviewed by: alc, markj Sponsored by: The FreeBSD Foundation MFC after: 1 week X-Differential revision: https://reviews.freebsd.org/D8197	2016-10-17 08:14:23 +00:00
Mark Johnston	eb17fb15b3	Plug a potential vnode lock leak in vm_fault_hold(). Reviewed by: alc, kib MFC after: 1 week Sponsored by: Dell EMC Isilon Differential Revision: https://reviews.freebsd.org/D8242	2016-10-13 20:39:34 +00:00
Konstantin Belousov	5975e53d40	Fix a race in vm_page_busy_sleep(9). Suppose that we have an exclusively busy page, and a thread which can accept shared-busy page. In this case, typical code waiting for the page xbusy state to pass is again: VM_OBJECT_WLOCK(object); ... if (vm_page_xbusied(m)) { vm_page_lock(m); VM_OBJECT_WUNLOCK(object); <---1 vm_page_busy_sleep(p, "vmopax"); goto again; } Suppose that the xbusy state owner locked the object, unbusied the page and unlocked the object after we are at the line [1], but before we executed the load of the busy_lock word in vm_page_busy_sleep(). If it happens that there is still no waiters recorded for the busy state, the xbusy owner did not acquired the page lock, so it proceeded. More, suppose that some other thread happen to share-busy the page after xbusy state was relinquished but before the m->busy_lock is read in vm_page_busy_sleep(). Again, that thread only needs vm_object lock to proceed. Then, vm_page_busy_sleep() reads busy_lock value equal to the VPB_SHARERS_WORD(1). In this case, all tests in vm_page_busy_sleep(9) pass and we are going to sleep, despite the page being share-busied. Update check for m->busy_lock == VPB_UNBUSIED in vm_page_busy_sleep(9) to also accept shared-busy state if we only wait for the xbusy state to pass. Merge sequential if()s with the same 'then' clause in vm_page_busy_sleep(). Note that the current code does not share-busy pages from parallel threads, the only way to have more that one sbusy owner is right now is to recurse. Reported and tested by: pho (previous version) Reviewed by: alc, markj Sponsored by: The FreeBSD Foundation MFC after: 1 week Differential revision: https://reviews.freebsd.org/D8196	2016-10-13 14:41:05 +00:00
Konstantin Belousov	267ed8e2f7	When downgrading exclusively busied page to shared-busy state, wakeup waiters. Otherwise, owners of the shared-busy state are left blocked and might get into a deadlock. Note that the vm_page_busy_downgrade() function is not used in the tree right now. Reported and tested by: pho (previous version) Reviewed by: alc, markj Sponsored by: The FreeBSD Foundation MFC after: 1 week Differential revision: https://reviews.freebsd.org/D8195	2016-10-11 18:09:37 +00:00
Alan Cox	70cf3ced3c	Make the page daemon's notion of what kind of pass is being performed by vm_pageout_scan() local to vm_pageout_worker(). There is no reason to store the pass in the NUMA domain structure. Reviewed by: kib MFC after: 3 weeks	2016-10-05 17:32:06 +00:00
Alan Cox	e57dd910e6	Change vm_pageout_scan() to return a value indicating whether the free page target was met. Previously, vm_pageout_worker() itself checked the length of the free page queues to determine whether vm_pageout_scan(pass >= 1)'s inactive queue scan freed enough pages to meet the free page target. Specifically, vm_pageout_worker() used vm_paging_needed(). The trouble with vm_paging_needed() is that it compares the length of the free page queues to the wakeup threshold for the page daemon, which is much lower than the free page target. Consequently, vm_pageout_worker() could conclude that the inactive queue scan succeeded in meeting its free page target when in fact it did not; and rather than immediately triggering an all-out laundering pass over the inactive queue, vm_pageout_worker() would go back to sleep waiting for the free page count to fall below the page daemon wakeup threshold again, at which point it will perform another limited (pass == 1) scan over the inactive queue. Changing vm_pageout_worker() to use vm_page_count_target() instead of vm_paging_needed() won't work because any page allocations that happen concurrently with the inactive queue scan will result in the free page count being below the target at the end of a successful scan. Instead, having vm_pageout_scan() return a value indicating success or failure is the most straightforward fix. Reviewed by: kib, markj MFC after: 3 weeks Sponsored by: Dell EMC Isilon Differential Revision: https://reviews.freebsd.org/D8111	2016-10-05 16:15:26 +00:00
Andrew Gallatin	edb2994a62	Conditionally move initial vfs bio alloc above 4G On machines with just the wrong amount of physical memory (enough to have a lot of bufs, but not enough to use VM_FREELIST_DMA32) it is possible for 32-bit address limited devices to have little to no memory left when attaching, due to potentially large vfs bio configs consuming all memory below 4GB not protected by VM_FREELIST_ISADMA. This causes the 32-bit devices to allocate from VM_FREELIST_ISADMA, leaving that freelist emtpy when ISA devices need DMAable memory. Rather than decrease VM_DMA32_NPAGES_THRESHOLD, use the time honored technique of putting initially allocated kernel data structs at the end (or at least not the beginning) of memory. Since this allocation is done at boot and is wired, is not freed, so the system is low on 32-bit (and ISA) dma'ble memory forever. So it is a good candidate to move above 4GB. While here, remove an unneeded round_page() from kmem_malloc's size argument as suggested by alc. The first thing kmem_malloc() does is a round_page(size), so there is no need to do it before the call. Reviewed by: alc Sponsored by: Netflix	2016-10-03 13:23:43 +00:00
Alan Cox	8cb0c1029d	Various changes to pmap_ts_referenced() Move PMAP_TS_REFERENCED_MAX out of the various pmap implementations and into vm/pmap.h, and describe what its purpose is. Eliminate the archaic "XXX" comment about its value. I don't believe that its exact value, e.g., 5 versus 6, matters. Update the arm64 and riscv pmap implementations of pmap_ts_referenced() to opportunistically update the page's dirty field. On amd64, use the PDE value already cached in a local variable rather than dereferencing a pointer again and again. Reviewed by: kib, markj MFC after: 2 weeks Differential Revision: https://reviews.freebsd.org/D7836	2016-09-10 16:49:25 +00:00
Mark Johnston	dd9cb6da0b	Respect the caller's hints when performing swap readahead. The pager getpages interface allows the caller to bound the number of readahead and readbehind pages, and vm_fault_hold() makes use of this feature. These bounds were ignored after r305056, causing the swap pager to potentially page in more than the specified number of pages. Reported and reviewed by: alc X-MFC with: r305056	2016-09-04 00:25:49 +00:00
Mark Johnston	dbbaf04f1e	Remove support for idle page zeroing. Idle page zeroing has been disabled by default on all architectures since r170816 and has some bugs that make it seemingly unusable. Specifically, the idle-priority pagezero thread exacerbates contention for the free page lock, and yields the CPU without releasing it in non-preemptive kernels. The pagezero thread also does not behave correctly when superpage reservations are enabled: its target is a function of v_free_count, which includes reserved-but-free pages, but it is only able to zero pages belonging to the physical memory allocator. Reviewed by: alc, imp, kib Differential Revision: https://reviews.freebsd.org/D7714	2016-09-03 20:38:13 +00:00
Konstantin Belousov	9815066425	Make swapoff reliable. The swap_pager_swapoff() function uses trylock for the object lock before pagein, which means that either i/o to md(4) over swap, or intensive page faults over swap pager objects might prevent swapoff() from making any progress. Then the retry < 100 check fails and machine panics. If trylock fails, acquire the object lock in the blockable way and restart the hash bucket walk. Keep retries logic for now. Reported and tested by: pho Reviewed by: alc, markj Sponsored by: The FreeBSD Foundation MFC after: 2 weeks Differential revision: https://reviews.freebsd.org/D7688	2016-08-31 14:49:58 +00:00
Mark Johnston	915d1b71cd	Restore swap pager readahead after r292373. The removal of vm_fault_additional_pages() meant that a hard fault on a swap-backed page would result in only that page being read in. This change implements readahead and readbehind for the swap pager in swap_pager_getpages(). swap_pager_haspage() is modified to return the largest contiguous non-resident range of pages containing the requested range. Reviewed by: alc, kib Tested by: pho MFC after: 1 month Differential Revision: https://reviews.freebsd.org/D7677	2016-08-30 05:56:21 +00:00
Alan Cox	ce3ee09b53	Eliminate unneeded vm_page_xbusy() and vm_page_xunbusy() operations when neither vm_pager_has_page() nor vm_pager_get_pages() is called. Reviewed by: kib, markj MFC after: 3 weeks	2016-08-14 22:00:45 +00:00
Mark Johnston	842ee21e20	Strengthen assertions about the busy state of newly-allocated pages. Reviewed by: alc MFC after: 1 week	2016-08-13 19:49:32 +00:00
Mark Johnston	fc85a6f0c4	Initialize page busy lock state in vm_phys_add_page(). MFC after: 1 week	2016-08-13 19:48:43 +00:00
Alan Cox	791444089f	Correct errors and clean up the comments on the active queue scan. Eliminate some unnecessary blank lines. Reviewed by: kib, markj MFC after: 1 week	2016-08-12 03:22:58 +00:00
Edward Tomasz Napierala	411455a8fb	Replace all remaining calls to vprint(9) with vn_printf(9), and remove the old macro. MFC after: 1 month	2016-08-10 16:12:31 +00:00
Alan Cox	f0edf3f806	Correct a spelling error.	2016-08-05 16:44:11 +00:00
Alan Cox	248fe642a7	Clean up the comments and code style in and around vm_pageout_cluster(). In particular, fix factual, grammatical, and spelling errors in various comments, and remove comments that are out of place in this function. Reviewed by: kib, markj MFC after: 3 weeks Sponsored by: EMC / Isilon Storage Division Differential Revision: https://reviews.freebsd.org/D7410	2016-08-04 16:20:12 +00:00
Konstantin Belousov	0c657d22eb	Explain why swapgeom_close_ev() is delegated. Sponsored by: The FreeBSD Foundation MFC after: 1 week	2016-08-03 07:11:19 +00:00
Alan Cox	87ff568c26	Restore the historical behavior of "sysctl vm.swap_idle_enabled=1". Prior to r254304, we had separate functions for reclamation and laundering (vm_pageout_scan) versus updating usage information, i.e., "reference bits", on active pages (vm_pageout_page_stats), and we only performed vm_req_vmdaemon(VM_SWAP_IDLE) if vm_pages_needed was true. However, since r254303, if vm_swap_idle_enabled was "1", we have performed vm_req_vmdaemon(VM_SWAP_IDLE) regardless of whether we are short of free pages. This was unintended and too aggressive, so I suspect no one uses this feature. With this change, we restore the historical behavior and only perform vm_req_vmdaemon(VM_SWAP_IDLE) when we are short of free pages. Reviewed by: kib, markj	2016-08-01 17:25:07 +00:00
Mark Johnston	897d0c6617	Use vm_page_undirty() instead of manually setting a page field. Reviewed by: alc MFC after: 3 days	2016-07-29 21:05:37 +00:00
Alan Cox	793172ea88	Remove a probe declaration that has been unused since r292469, when vm_pageout_grow_cache() was replaced. MFC after: 3 days	2016-07-29 16:43:51 +00:00
Alan Cox	f095d1bbc7	Remove any mention of cache (PG_CACHE) pages from the comments in vm_pageout_scan(). That function has not cached pages since r284376. MFC after: 3 days	2016-07-28 22:30:48 +00:00
Konstantin Belousov	88ad2d7b47	Do not delegate a work to geom event thread which can be done inline. In particular, swapongeom_ev() needed event thread context when swap pager configuration was performed under Giant and geom asserted that Giant is not owned. Now both of the reason went away. On the other hand, note that swpageom_release() is called from the bio_done context, and possible close cannot be performed inline. Also fix some minor issues. The swapgeom() function does not use the td argument, remove it. Recheck that the vnode passed is still VCHR and not reclaimed after the lock. Reviewed by: mav Tested by: pho Sponsored by: The FreeBSD Foundation MFC after: 2 weeks	2016-07-28 15:57:01 +00:00
Konstantin Belousov	2174a0c607	Fix style and typo. Sponsored by: The FreeBSD Foundation MFC after: 1 week	2016-07-28 15:49:51 +00:00
Mark Johnston	3ac8f842ea	De-pluralize "queues" where appropriate in the pagedaemon code. MFC after: 1 week	2016-07-27 17:11:03 +00:00
Alan Cox	a766ffd061	Update a comment to reflect r284376. MFC after: 3 days	2016-07-27 03:49:00 +00:00
Mark Johnston	44be0a8ea5	Correct a comment - each page queue has its own lock. Reviewed by: alc MFC after: 3 days	2016-07-23 21:03:25 +00:00
Mark Johnston	efe1ff4cf0	Update a comment in vm_page_advise() to match behaviour after r290529. Reviewed by: alc MFC after: 3 days	2016-07-23 21:02:36 +00:00
Alan Cox	8d67b8c863	Add a comment describing the 'fast path' that was introduced in r270011. Reviewed by: kib MFC after: 3 days Sponsored by: EMC / Isilon Storage Division	2016-07-20 17:20:22 +00:00
Mark Johnston	afa5d70339	Release the second critical section in uma_zfree_arg() slightly earlier. It is only needed when removing a full bucket from the per-CPU cache. The bucket cache (uz_buckets) is protected by the zone mutex and thus the critical section can be released before inserting into that list. MFC after: 1 week	2016-07-20 01:01:50 +00:00
Mark Johnston	20c58db95a	Make vm_pageout_wakeup_thresh a u_int rather than an int. It's a threshold for v_free_count, which is of type u_int. This also lets us get rid of a cast in vm_paging_needed(). Reviewed by: alc MFC after: 1 week	2016-07-20 00:09:22 +00:00
Alan Cox	0c3a489325	Break up vm_fault()'s implementation of the read-ahead and delete-behind optimizations into two distinct pieces. The first piece consists of the code that should only be performed once per page fault and requires the map to be locked. The second piece consists of the code that should be performed each time a pager is called on an object in the shadow chain. (This second piece expects the map to be unlocked.) Previously, the entire implementation could be executed multiple times. Moreover, the second and subsequent executions would occur with the map unlocked. Usually, the ensuing unsynchronized accesses to the map were harmless because the map was not changing. Nonetheless, it was possible for a use-after-free error to occur, where vm_fault() wrote to a freed map entry. This change corrects that problem. Reported by: avg Reviewed by: kib MFC after: 3 days Sponsored by: EMC / Isilon Storage Division	2016-07-18 04:20:26 +00:00
Konstantin Belousov	19efd8a5a8	In vgonel(), postpone setting BO_DEAD until VOP_RECLAIM() is called, if vnode is VMIO. For VMIO vnodes, set BO_DEAD in vm_object_terminate(). The vnode_destroy_object(), when calling into vm_object_terminate(), must be able to flush buffers. BO_DEAD purpose is to quickly destroy buffers on write when the underlying vnode is not operable any more (one example is the devfs node after geom is gone). Setting BO_DEAD for reclaiming vnode before object is terminated is premature, and results in unability to flush buffers with live SU dependencies from vinvalbuf() in vm_object_terminate(). Reported by: David Cross <dcrosstech@gmail.com> Tested by: pho Sponsored by: The FreeBSD Foundation MFC after: 2 weeks	2016-07-11 14:19:09 +00:00
Robert Watson	0df4264748	When mmap(2) is used with a vnode, capture vnode attributes in the audit trail. This was not required for Common Criteria auditing (which requires only that the intent to read or write be audited at the time of open(2)), but is useful for contemporary live analysis and forensics. MFC after: 3 days Sponsored by: DARPA, AFRL	2016-07-10 11:49:10 +00:00
Robert Watson	51d1f69069	Audit file-descriptor arguments to I/O system calls such as read(2), write(2), dup(2), and mmap(2). This auditing is not required by the Common Criteria (and hence was not being performed), but is valuable in both contemporary live analysis and forensic use cases. MFC after: 3 days Sponsored by: DARPA, AFRL	2016-07-10 08:04:02 +00:00
Alan Cox	381b724280	Change the type of the map entry's next_read field from a vm_pindex_t to a vm_offset_t. (This field is used to detect sequential access to the virtual address range represented by the map entry.) There are three reasons to make this change. First, a vm_offset_t is smaller on 32-bit architectures. Consequently, a struct vm_map_entry is now smaller on 32-bit architectures. Second, a vm_offset_t can be written atomically, whereas it may not be possible to write a vm_pindex_t atomically on a 32-bit architecture. Third, using a vm_pindex_t makes the next_read field dependent on which object in the shadow chain is being read from. Replace an "XXX" comment. Reviewed by: kib Approved by: re (gjb) Sponsored by: EMC / Isilon Storage Division	2016-07-07 20:58:16 +00:00

1 2 3 4 5 ...

3737 Commits