Commit Graph

3759 Commits

Author SHA1 Message Date
Konstantin Belousov
bc27810671 Fix off-by-one in the vm_fault_populate() code.
When re-calculating the last inclusive page index after the pager
call, -1 was erronously ommitted.  If the pager extended the run
(unlikely), the result would be insertion of the valid page mapping
outside the current map entry range.

Found by:	alc
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
2017-03-19 14:42:16 +00:00
Xin LI
83d37aaf01 The adj_free and max_free values of new_entry will be calculated and
assigned by subsequent vm_map_entry_link(), therefore, remove the
pointless copying.

Submitted by:	alc
MFC after:	3 days
2017-03-16 05:44:16 +00:00
Alan Cox
52d1addaa1 Relax the locking requirements for vm_object_page_noreuse(). While
reviewing all uses of OFF_TO_IDX(), I observed that
vm_object_page_noreuse() is requiring an exclusive lock on the object
when, in fact, a shared lock suffices.

Reviewed by:	kib, markj
MFC after:	3 days
Differential Revision:	https://reviews.freebsd.org/D10011
2017-03-15 17:43:45 +00:00
Konstantin Belousov
d1780e8dac Use atop() instead of OFF_TO_IDX() for convertion of addresses or
addresses offsets, as intended.

Suggested and reviewed by:	alc
Sponsored by:	The FreeBSD Foundation
MFC after:	2 weeks
2017-03-14 19:39:17 +00:00
Xin LI
78d7964b46 Implement INHERIT_ZERO for minherit(2).
INHERIT_ZERO is an OpenBSD feature.

When a page is marked as such, it would be zeroed
upon fork().

This would be used in new arc4random(3) functions.

PR:	182610
Reviewed by:	kib (earlier version)
MFC after:	1 month
Differential Revision:	https://reviews.freebsd.org/D427
2017-03-14 17:10:42 +00:00
Mark Johnston
e20ff1a4d4 Update a comment to reflect reality.
MFC after:	1 week
2017-03-13 18:45:25 +00:00
Konstantin Belousov
2b6d1a639b Follow-up to r313690.
Fix two missed places where vm_object offset to index calculation
should use unsigned shift, to allow handling of full range of unsigned
offsets used to create device mappings.

Reported and tested by:	royger (previous version)
Reviewed by:	alc (previous version)
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
2017-03-12 13:53:13 +00:00
Andriy Gapon
57223e9994 uma: fix pages <-> items conversions at several places
Those places were not taking into account uk_ppera.
At present one allocation is always used by one slab, so uk_ppera must
be used to convert between pages and slabs.
uk_ipers is used to convert between slabs and items.

MFC after:	1 month (if ever)
2017-03-11 16:43:38 +00:00
Andriy Gapon
a55ebb7cd5 uma: eliminate uk_slabsize field
The field was not used beyond the initial keg setup stage anyway.

MFC after:	1 month (if ever)
2017-03-11 16:35:36 +00:00
Warner Losh
fbbd9655e5 Renumber copyright clause 4
Renumber cluase 4 to 3, per what everybody else did when BSD granted
them permission to remove clause 3. My insistance on keeping the same
numbering for legal reasons is too pedantic, so give up on that point.

Submitted by:	Jan Schaumann <jschauma@stevens.edu>
Pull Request:	https://github.com/freebsd/freebsd/pull/96
2017-02-28 23:42:47 +00:00
Andriy Gapon
9b43bc27c4 call vm_lowmem hook in uma_reclaim_worker
A comment near kmem_reclaim() implies that we already did that.
Calling the hook is useful, because some handlers, e.g. ARC,
might be able to release significant amounts of KVA.

Now that we have more than one place where vm_lowmem hook is called,
use this change as an opportunity to introduce flags that describe
a reason for calling the hook.  No handler makes use of the flags yet.

Reviewed by:	markj, kib
MFC after:	1 week
Sponsored by:	Panzura
Differential Revision: https://reviews.freebsd.org/D9764
2017-02-25 16:39:21 +00:00
Konstantin Belousov
63cdcaaead Properly handle possible underflow in vm_fault_prefault().
In vm_fault_prefault(), if backward count causes underflow in
calculation of
	starta = addra - backward * PAGE_SIZE;
then starta must be clipped to entry->start, instead of zero.
Clipping to zero allowed mapping outside of the map entries address
ranges, in particular, map at zero.

Submitted by:	Yanko Yankulov <yanko.yankulov@gmail.com>
Reviewed by:	alc
MFC after:	1 week
2017-02-24 08:09:16 +00:00
Andriy Gapon
937c1b0757 try to fix RACCT_RSS accounting
There could be a race between the vm daemon setting RACCT_RSS based on
the vm space and vmspace_exit (called from exit1) resetting RACCT_RSS to
zero.  In that case we can get a zombie process with non-zero RACCT_RSS.
If the process is jailed, that may break accounting for the jail.
There could be other consequences.

Fix this race in the vm daemon by updating RACCT_RSS only when a process
is in the normal state.  Also, make accounting a little bit more
accurate by refreshing the page resident count after calling
vm_pageout_map_deactivate_pages().
Finally, add an assert that the RSS is zero when a process is reaped.

PR:		210315
Reviewed by:	trasz
Differential Revision: https://reviews.freebsd.org/D9464
2017-02-14 13:54:05 +00:00
Bjoern A. Zeeb
05d58177e8 Use %s __func__ to print the actual function name (been looking at
the wrong one for too often lately at first), and also use %#lx to
get the 0x prefix for the address.

MFC after:	1 week
2017-02-14 01:20:03 +00:00
Konstantin Belousov
496ab0532d Rework r313352.
Rename kern_vm_* functions to kern_*.  Move the prototypes to
syscallsubr.h.  Also change Mach VM types to uintptr_t/size_t as
needed, to avoid headers pollution.

Requested by:	alc, jhb
Reviewed by:	alc
Sponsored by:	The FreeBSD Foundation
MFC after:	2 weeks
Differential revision:	https://reviews.freebsd.org/D9535
2017-02-13 09:04:38 +00:00
Konstantin Belousov
04e89ffba8 Remove MPSAFE and ARGUSED annotations, ANSI-fy syscall handlers.
Discussed with:	alc
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
2017-02-13 00:40:55 +00:00
Konstantin Belousov
987ff18184 Consistently handle negative or wrapping offsets in the mmap(2) syscalls.
For regular files and posix shared memory, POSIX requires that
[offset, offset + size) range is legitimate.  At the maping time,
check that offset is not negative.  Allowing negative offsets might
expose the data that filesystem put into vm_object for internal use,
esp. due to OFF_TO_IDX() signess treatment.  Fault handler verifies
that the mapped range is valid, assuming that mmap(2) checked that
arithmetic gives no undefined results.

For device mappings, leave the semantic of negative offsets to the
driver.  Correct object page index calculation to not erronously
propagate sign.

In either case, disallow overflow of offset + size.

Update mmap(2) man page to explain the requirement of the range
validity, and behaviour when the range becomes invalid after mapping.

Reported and tested by:	royger (previous version)
Reviewed by:	alc
Sponsored by:	The FreeBSD Foundation
MFC after:	2 weeks
2017-02-12 21:05:44 +00:00
Konstantin Belousov
1c2ad3e962 Change type of the prot parameter for kern_vm_mmap() from vm_prot_t to int.
This makes the code to pass whole word of the mmap(2) syscall argument
prot to the syscall helper kern_vm_mmap(), which can validate all
bits.  The change provides temporal fix for sys/vm/mmap_test
mmap__bad_arguments, which was broken after r313352.

PR:	216976
Reported and tested by:	ngie
Sponsored by:	The FreeBSD Foundation
2017-02-11 20:27:39 +00:00
Edward Tomasz Napierala
69cdfcef2e Add kern_vm_mmap2(), kern_vm_mprotect(), kern_vm_msync(), kern_vm_munlock(),
kern_vm_munmap(), and kern_vm_madvise(), and use them in various compats
instead of their sys_*() counterparts.

Reviewed by:	ed, dchagin, kib
MFC after:	2 weeks
Sponsored by:	DARPA, AFRL
Differential Revision:	https://reviews.freebsd.org/D9378
2017-02-06 20:57:12 +00:00
Konstantin Belousov
5fca242374 Style, use tab after #define.
Reviewed by:	alc
Sponsored by:	The FreeBSD Foundation
MFC after:	3 days
2017-02-04 19:16:19 +00:00
Alan Cox
8a99f1cc59 Over the years, the code and comments in vm_page_startup() have diverged in
one respect.  When determining how many page structures to allocate,
contrary to what the comments say, the code does not account for the
overhead of a page structure per page of physical memory.  This revision
changes the code to match the comments.

Reviewed by:	kib, markj
MFC after:	6 weeks
Differential Revision:	https://reviews.freebsd.org/D9081
2017-02-04 05:23:10 +00:00
Edward Tomasz Napierala
a6b15641d6 Ifdef out the unused vm_rr_selectdomain().
MFC after:	2 weeks
Sponsored by:	DARPA, AFRL
2017-02-02 17:44:55 +00:00
Mark Johnston
aa3650ea36 Avoid page lookups in the top-level object in vm_object_madvise().
We can iterate over consecutive resident pages in the top-level object
using the object's page list rather than by performing lookups in the
object radix tree. This extends one of the optimizations in r312208 to the
case where a shadow chain is present.

Suggested by:	alc
Reviewed by:	alc, kib (previous version)
MFC after:	2 weeks
Differential Revision:	https://reviews.freebsd.org/D9282
2017-01-30 18:51:43 +00:00
Mateusz Guzik
736ff8c396 hwpmc: partially depessimize munmap handling if the module is not loaded
HWPMC_HOOKS is enabled in GENERIC and triggers some work avoidable in the
common (module not loaded) case.

In particular this avoids permission checks + lock downgrade
singlethreaded and in cases were an executable mapping is found the pmc
sx lock is no longer bounced.

Note this is a band aid.

MFC after:	1 week
2017-01-24 22:00:16 +00:00
Mark Johnston
c2655a40a7 Avoid unnecessary page lookups in vm_object_madvise().
vm_object_madvise() is frequently used to apply advice to a contiguous
set of pages in an object with no backing object. Optimize this case by
skipping non-resident subranges in constant time, and by iterating over
resident pages using the object memq, thus avoiding radix tree lookups on
each page index in the specified range.

While here, move MADV_WILLNEED handling to vm_page_advise(), and rename the
"advise" parameter to vm_object_madvise() to "advice."

Reviewed by:	alc, kib
MFC after:	2 weeks
Differential Revision:	https://reviews.freebsd.org/D9098
2017-01-15 03:50:08 +00:00
Gleb Smirnoff
4f56243aad Fix the contiguity once more. 2017-01-12 20:26:02 +00:00
Mark Johnston
8d65cba217 Remove a redundant use of min().
Reported by:	rpokala
X-MFC With:	r311346
2017-01-05 03:13:45 +00:00
Mark Johnston
ec492b13f1 Add a small allocator for exec_map entries.
Upon each execve, we allocate a KVA range for use in copying data to the
new image. Pages must be faulted into the range, and when the range is
freed, the backing pages are freed and their mappings are destroyed. This
is a lot of needless overhead, and the exec_map management becomes a
bottleneck when many CPUs are executing execve concurrently. Moreover, the
number of available ranges is fixed at 16, which is insufficient on large
systems and potentially excessive on 32-bit systems.

The new allocator reduces overhead by making exec_map allocations
persistent. When a range is freed, pages backing the range are marked clean
and made easy to reclaim. With this change, the exec_map is sized based on
the number of CPUs.

Reviewed by:	kib
MFC after:	1 month
Differential Revision:	https://reviews.freebsd.org/D8921
2017-01-05 01:44:12 +00:00
Gleb Smirnoff
1e0c121f3a Fix assertion that checks that pages are consecutive to properly
handle bogus_page insertion(s).
2017-01-04 22:31:09 +00:00
Gleb Smirnoff
bfc8c24c73 Move bogus_page declaration to vm_page.h and initialization to vm_page.c.
Reviewed by:	kib
2017-01-04 22:27:19 +00:00
Mark Johnston
b1fd102ee7 Add a page queue for holding dirty anonymous unswappable pages.
On systems without a configured swap device, an attempt to launder pages
from a swap object will always fail and result in the page being
reactivated. This means that the page daemon will continuously scan pages
that can never be evicted. With this change, anonymous pages are instead
moved to PQ_UNSWAPPABLE after a failed laundering attempt when no swap
devices are configured. PQ_UNSWAPPABLE is not scanned unless a swap device
is configured, so unreferenced unswappable pages are excluded from the page
daemon's workload.

Reviewed by:	alc
2017-01-03 00:05:44 +00:00
Justin Hibbits
b5345ef10e Print flags in hex instead of decimal.
Hex is easier to grok for flags, and consistent with other prints.
2017-01-02 16:50:52 +00:00
Konstantin Belousov
1569205f0a Style fixes for vm_map_insert().
Reviewed by:	alc
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
2017-01-01 18:49:46 +00:00
Konstantin Belousov
03302b1380 Ansify vm/vm_pager.c. Style.
Reviewed by:	alc
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
2016-12-31 19:30:22 +00:00
Mateusz Guzik
6ff51a3685 Use vrefact in vnode_pager_alloc. 2016-12-31 10:37:56 +00:00
Konstantin Belousov
7a432b84e8 Fix two similar bugs in the populate vm_fault() code.
If pager' populate method succeeded, but other thread raced with us
and modified vm_map, we must unbusy all pages busied by the pager,
before we retry the whole fault handling.  If pager instantiated more
pages than fit into the current map entry, we must unbusy the pages
which are clipped.

Also do some refactoring, clarify comments and use more clear local
variable names.

Reported and tested by:	kargl, subbsd@gmail.com (previous version)
Reviewed by:	alc
Sponsored by:	The FreeBSD Foundation
MFC after:	2 weeks
2016-12-30 18:55:33 +00:00
Konstantin Belousov
0c8bd6a7d8 Assert that the pages found on the object queue by vm_page_next() and
vm_page_prev() have correct ownership.

In collaboration with:	alc
Sponsored by:	The FreeBSD Foundation (kib)
MFC after:	1 week
2016-12-30 17:37:06 +00:00
Konstantin Belousov
9a4ee196dd Style.
Reviewed by:	alc
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
2016-12-30 13:04:43 +00:00
Mateusz Guzik
0b3b55a0f2 Remove cpu_spinwait after seq_consistent.
It does not add any benefit as the read routine will do it as necessary.
2016-12-30 06:26:17 +00:00
Alan Cox
920da7e4d2 Relax the object type restrictions on vm_page_alloc_contig(). Specifically,
add support for object types that were previously prohibited because they
could contain PG_CACHED pages.

Roughly halve the number of radix trie operations performed by
vm_page_alloc_contig() using the same approach that is employed by
vm_page_alloc().  Also, eliminate the radix trie lookup performed with the
free page queues lock held.

Tidy up the handling of radix trie insert failures in vm_page_alloc() and
vm_page_alloc_contig().

Reviewed by:	kib, markj
Tested by:	pho
Sponsored by:	Dell EMC Isilon
Differential Revision:	https://reviews.freebsd.org/D8878
2016-12-28 18:32:13 +00:00
Konstantin Belousov
8b590e9506 Remove redundancy in vmtotal().
There are two instances of inlined unlocks + continue in vmtotal()
switch statements, which are ordinary expressed with break from the
switch case and code after the switch.  Also, the combination of
continue and break statement is redundand.

Reviewed by:	alc
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
2016-12-26 19:29:04 +00:00
Konstantin Belousov
2e56b64fa4 Fix argument type and microoptimize swp_pager_meta_free().
The count argument natural type if vm_pindex_t, but due to the loop
organization, it has to be signed type to detect the termination
condition.  Replace this logic by using distinguished counter for the
processed pages, and terminate loop when the counter exceeds the
argument.

Completely process one swblock for all relevant indexes instead of
doing relookup in hash when incrementing page index on the loop step.

Do not drop hash mutex around iterations.

Noted and reviewed by:	alc
Tested by:	pho
Sponsored by:	The FreeBSD Foundation
MFC after:	2 weeks
2016-12-24 09:57:31 +00:00
Konstantin Belousov
77d6fd97ef Improve vm_object_scan_all_shadowed() to also check swap backing objects.
As noted in the removed comment, it is possible and not prohibitively
costly to look up the swap blocks for the given page index.  Implement
a swap_pager_find_least() function to do that, and use it to iterate
simultaneously over both backing object page queue and swap
allocations when looking for shadowed pages.

Testing shows that number of new succesful scans, enabled by this
addition, is small but non-zero.  When worked out, the change both
further reduces the depth of the shadow object chain, and frees unused
but allocated swap and memory.

Suggested and reviewed by:	alc
Tested by:	pho (previous version)
Sponsored by:	The FreeBSD Foundation
MFC after:	2 weeks
2016-12-18 20:56:14 +00:00
Konstantin Belousov
71057cd207 In swp_pager_meta_free_all(), fix type of the index variable. Style.
Noted and reviewed by:	alc (previous version)
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
2016-12-16 23:33:37 +00:00
Konstantin Belousov
a1e9a3bba3 Provide introductory description of the default pager.
Reviewed by:	alc
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
2016-12-14 23:36:32 +00:00
Konstantin Belousov
a41ece0840 Remove locking around accounting initialization of the default object.
The object is not yet fully constructed and must not be available to
other threads.  This makes default_pager_alloc() almost identical to
swap_pager_alloc_init().

Reviewed by:	alc
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
2016-12-14 23:34:25 +00:00
Alan Cox
3d026d871f Tidy up. Mostly, remove or replace stale comments. Most of the comments
in this file actually described the operation of the swap pager, not the
default pager.  Given that this is the wrong place to discuss the
implementation of the swap pager, it shouldn't come as a surprise that as
the swap pager evolved these comments became increasingly stale.  In
addition, apply some style fixes, like modernizing a few remaining old-
style function definitions.

Reviewed by:	kib, markj
MFC after:	1 week
Differential Revision:	https://reviews.freebsd.org/D8781
2016-12-14 17:28:55 +00:00
John Baldwin
a9546a6b17 Use db_lookup_proc() in the DDB 'show procvm' command.
This allows processes to be identified by PID as well as a pointer address.

MFC after:	2 weeks
Sponsored by:	DARPA / AFRL
2016-12-13 19:22:43 +00:00
Alan Cox
3453bca864 Eliminate every mention of PG_CACHED pages from the comments in the machine-
independent layer of the virtual memory system.  Update some of the nearby
comments to eliminate redundancy and improve clarity.

In vm/vm_reserv.c, do not use hyphens after adverbs ending in -ly per
The Chicago Manual of Style.

Update the comment in vm/vm_page.h defining the four types of page queues to
reflect the elimination of PG_CACHED pages and the introduction of the
laundry queue.

Reviewed by:	kib, markj
Sponsored by:	Dell EMC Isilon
Differential Revision:	https://reviews.freebsd.org/D8752
2016-12-12 17:47:09 +00:00
Gleb Smirnoff
255003da42 Allow bogus_page to be passed to pager(s). 2016-12-09 21:21:24 +00:00
Mark Johnston
90458813cd Conditionalize PG_CACHE sysctls on COMPAT_FREEBSD11.
Reviewed by:	glebius, imp, jhb
Differential Revision:	https://reviews.freebsd.org/D8736
2016-12-09 18:55:27 +00:00
Konstantin Belousov
ed01d9894e Implement the populate() pager method for phys pager.
It allows to provide configurable agressive prefaulting and useful
hints to page daemon about memory allocations, on faults for pages
managed by phys pager.  In fact, this implementation is superior to
the MAP_SHARED_PHYS hack from my Postgresql paper, while giving
similar benefits of reducing the page faults numbers on SysV shared
memory mappings.

Reviewed by:	alc
Tested by:	pho
Sponsored by:	The FreeBSD Foundation
MFC after:	3 weeks
2016-12-08 11:35:53 +00:00
Konstantin Belousov
c42b43a054 Add a new populate() pager method and extend device pager ops vector
with cdev_pg_populate() to provide device drivers access to it.  It
gives drivers fine control of the pages ownership and allows drivers
to implement arbitrary prefault policies.

The populate method is called on a page fault and is supposed to
populate the vm object with the page at the fault location and some
amount of pages around it, at pager's discretion.  VM provides the
pager with the hints about current range of the object mapping, to
avoid instantiation of immediately unused pages, if pager decides so.
Also, VM passes the fault type and map entry protection to the pager,
allowing it to force the optimal required ownership of the mapped
pages.

Installed pages must contiguously fill the returned region, be fully
valid and exclusively busied.  Of course, the pages must be compatible
with the object' type.

After populate() successfully returned, VM fault handler installs as
many instantiated pages into the process page tables as it sees
reasonable, while still obeying the correct semantic for COW and vm
map locking.

The method is opt-in, pager sets OBJ_POPULATE flag to indicate that
the method can be called.  If pager' vm objects can be shadowed, pager
must implement the traditional getpages() method in addition to the
populate().  Populate() might fall back to the getpages() on per-call
basis as well, by returning VM_PAGER_BAD error code.

For now for device pagers, the populate() method is only allowed to be
used by the managed device pagers, but the limitation is only made
because there is no unmanaged fault handlers which could use it right
now.

KPI designed together with, and reviewed by:	alc
Tested by:	pho
Sponsored by:	The FreeBSD Foundation
MFC after:	3 weeks
2016-12-08 11:26:11 +00:00
Konstantin Belousov
dc5401d240 Move map_generation snapshot value into struct faultstate.
Reviewed by:	alc
Tested by:	pho
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
2016-12-08 10:29:41 +00:00
Konstantin Belousov
272cc3c4d0 Style.
Reviewed by:	alc
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
2016-12-08 10:28:51 +00:00
Alan Cox
e94965d82e Previously, vm_radix_remove() would panic if the radix trie didn't
contain a vm_page_t at the specified index.  However, with this
change, vm_radix_remove() no longer panics.  Instead, it returns NULL
if there is no vm_page_t at the specified index.  Otherwise, it
returns the vm_page_t.  The motivation for this change is that it
simplifies the use of radix tries in the amd64, arm64, and i386 pmap
implementations.  Instead of performing a lookup before every remove,
the pmap can simply perform the remove.

Reviewed by:	kib, markj
Differential Revision:	https://reviews.freebsd.org/D8708
2016-12-08 04:29:29 +00:00
Mark Johnston
43482f897b Use the official spelling for NULL arguments to typed sysctl handlers.
Reported by:	bde
2016-12-07 01:15:10 +00:00
Mark Johnston
77edd8fa00 Provide dummy sysctls for v_cache_count and v_tcached.
Some utilities (notably top(1)) exit if any of their input sysctls don't
exist, and the removal of the above-mentioned PG_CACHE-related sysctls
makes it difficult to run such utilities on different versions of the
kernel without recompiling.

Requested by:	bde
2016-12-06 22:52:45 +00:00
Alan Cox
8804a2b030 Eliminate a stale comment; vm_radix_prealloc() was replaced in r254141.
MFC after:	3 days
2016-12-02 16:29:30 +00:00
Alan Cox
563a19d546 During vm_page_cache()'s call to vm_radix_insert(), if vm_page_alloc() was
called to allocate a new page of radix trie nodes, there could be a call to
vm_radix_remove() on the same trie (of PG_CACHED pages) as the in-progress
vm_radix_insert().  With the removal of PG_CACHED pages, we can simplify
vm_radix_insert() and vm_radix_remove() by removing the flags on the root of
the trie that were used to detect this case and the code for restarting
vm_radix_insert() when it happened.

Reviewed by:	kib, markj
Tested by:	pho
Sponsored by:	Dell EMC Isilon
Differential Revision:	https://reviews.freebsd.org/D8664
2016-12-01 17:26:37 +00:00
Alan Cox
ba67369628 Recursion on the free page queue mutex occurred when UMA needed to allocate
a new page of radix trie nodes to complete a vm_radix_insert() operation
that was requested by vm_page_cache().  Specifically, vm_page_cache()
already held the free page queue lock when UMA tried to acquire it through
a call to vm_page_alloc().  This code path no longer exists, so there is no
longer any reason to allow recursion on the free page queue mutex.

Improve nearby comments.

Reviewed by:	kib, markj
Tested by:	pho
Sponsored by:	Dell EMC Isilon
Differential Revision:	https://reviews.freebsd.org/D8628
2016-11-27 01:42:53 +00:00
Mark Johnston
99e6e1930c Release laundered vnode pages to the head of the inactive queue.
The swap pager enqueues laundered pages near the head of the inactive queue
to avoid another trip through LRU before reclamation. This change adds
support for this behaviour to the vnode pager and makes use of it in UFS and
ext2fs. Some ioflag handling is consolidated into a common subroutine so
that this support can be easily extended to other filesystems which make use
of the buffer cache. No changes are needed for ZFS since its putpages
routine always undirties the pages before returning, and the laundry
thread requeues the pages appropriately in this case.

Reviewed by:	alc, kib
Differential Revision:	https://reviews.freebsd.org/D8589
2016-11-23 17:53:07 +00:00
Alan Cox
bba39b9ae3 Remove PG_CACHED-related fields from struct vmmeter, because they are no
longer used.  More precisely, they are always zero because the code that
decremented and incremented them no longer exists.

Bump __FreeBSD_version to mark this change.

Reviewed by:	kib, markj
Sponsored by:	Dell EMC Isilon
Differential Revision:	https://reviews.freebsd.org/D8583
2016-11-22 18:13:46 +00:00
Gleb Smirnoff
e48b82bd83 - If caller specifies readbehind and readahead that together with count
doesn't fit into a buf, then trim readbehind and readahead evenly.  If
  rbehind was limited by the previous BMAP, then roundup its trim to
  block size.
- Add KASSERT to check that b_blkno has proper offset from original
  blkno returned by BMAP. [1]
- Add KASSERT to check that pages in buf are consecutive.

Reviewed by:	kib
Submitted by:	kib [1]
2016-11-17 20:32:32 +00:00
Konstantin Belousov
41ddec83c1 Move the fast fault path into the separate function.
Reviewed by:	alc
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
2016-11-16 16:34:17 +00:00
Alan Cox
7667839a7e Remove most of the code for implementing PG_CACHED pages. (This change does
not remove user-space visible fields from vm_cnt or all of the references to
cached pages from comments.  Those changes will come later.)

Reviewed by:	kib, markj
Tested by:	pho
Sponsored by:	Dell EMC Isilon
Differential Revision:	https://reviews.freebsd.org/D8497
2016-11-15 18:22:50 +00:00
Alan Cox
ebcddc7217 Introduce a new page queue, PQ_LAUNDRY, for storing unreferenced, dirty
pages, specificially, dirty pages that have passed once through the inactive
queue.  A new, dedicated thread is responsible for both deciding when to
launder pages and actually laundering them.  The new policy uses the
relative sizes of the inactive and laundry queues to determine whether to
launder pages at a given point in time.  In general, this leads to more
intelligent swapping behavior, since the laundry thread will avoid pageouts
when the marginal benefit of doing so is low.  Previously, without a
dedicated queue for dirty pages, the page daemon didn't have the information
to determine whether pageout provides any benefit to the system.  Thus, the
previous policy often resulted in small but steadily increasing amounts of
swap usage when the system is under memory pressure, even when the inactive
queue consisted mostly of clean pages.  This change addresses that issue,
and also paves the way for some future virtual memory system improvements by
removing the last source of object-cached clean pages, i.e., PG_CACHE pages.

The new laundry thread sleeps while waiting for a request from the page
daemon thread(s).  A request is raised by setting the variable
vm_laundry_request and waking the laundry thread.  We request launderings
for two reasons: to try and balance the inactive and laundry queue sizes
("background laundering"), and to quickly make up for a shortage of free
pages and clean inactive pages ("shortfall laundering").  When background
laundering is requested, the laundry thread computes the number of page
daemon wakeups that have taken place since the last laundering.  If this
number is large enough relative to the ratio of the laundry and (global)
inactive queue sizes, we will launder vm_background_launder_target pages at
vm_background_launder_rate KB/s.  Otherwise, the laundry thread goes back
to sleep without doing any work.  When scanning the laundry queue during
background laundering, reactivated pages are counted towards the laundry
thread's target.

In contrast, shortfall laundering is requested when an inactive queue scan
fails to meet its target.  In this case, the laundry thread attempts to
launder enough pages to meet v_free_target within 0.5s, which is the
inactive queue scan period.

A laundry request can be latched while another is currently being
serviced.  In particular, a shortfall request will immediately preempt a
background laundering.

This change also redefines the meaning of vm_cnt.v_reactivated and removes
the functions vm_page_cache() and vm_page_try_to_cache().  The new meaning
of vm_cnt.v_reactivated now better reflects its name.  It represents the
number of inactive or laundry pages that are returned to the active queue
on account of a reference.

In collaboration with:	markj
Reviewed by:	kib
Tested by:	pho
Sponsored by:	Dell EMC Isilon
Differential Revision:	https://reviews.freebsd.org/D8302
2016-11-09 18:48:37 +00:00
Bryan Drewery
28323add09 Fix improper use of "its".
Sponsored by:	Dell EMC Isilon
2016-11-08 23:59:41 +00:00
Konstantin Belousov
1771e987ca Do not sleep in vm_wait() if pagedaemon did not yet started. Panic instead.
Requests which cannot be satisfied by allocators at boot time often
have unrealizable parameters.  Waiting for the pagedaemon' start would
hang the boot if done in the thread0 context and just never succeed if
executed from another thread.  In fact, for very early stages, sleep
attempt panics with obscure diagnostic about the scheduler state, and
explicit panic in vm_wait() makes the investigation much shorter by
cut off the examination of the thread and scheduler.

Theoretically, some subsystem might grab a resource to exhaustion, and
free it later in the boot process.  If this unlikely scenario does
appear for real, the way to diagnose the trouble can be revisited.

Reported by:	emaste
Reviewed by:	markj
Sponsored by:	The FreeBSD Foundation
MFC after:	2 weeks
Differential revision:	https://reviews.freebsd.org/D8421
2016-11-04 12:58:50 +00:00
Alan Cox
857025056f In vm_fault()'s loop over the shadow chain, move a comment describing our
invariants to a better place.  Also, add two comments concerning the
relationship between the map and vnode locks.

Reviewed by:	kib
MFC after:	3 days
2016-11-03 16:44:55 +00:00
Alan Cox
dda4d36957 Move and revise a comment about the relation between the object's paging-
in-progress count and the vnode.  Prior to r188331, we always acquired
the vnode lock before incrementing the object's paging-in-progress count.
Now, we increment it before attempting to acquire the vnode lock with
LK_NOWAIT, but we never sleep acquiring the vnode lock while we have the
count incremented.

Reviewed by:	kib
MFC after:	3 days
2016-11-01 17:11:10 +00:00
Conrad Meyer
8532d381a9 Add BUF_TRACKING and FULL_BUF_TRACKING buffer debugging
Upstream the BUF_TRACKING and FULL_BUF_TRACKING buffer debugging code.
This can be handy in tracking down what code touched hung bios and bufs
last. The full history is especially useful, but adds enough bloat that
it shouldn't be enabled in release builds.

Function names (or arbitrary string constants) are tracked in a
fixed-size ring in bufs. Bios gain a pointer to the upper buf for
tracking. SCSI CCBs gain a pointer to the upper bio for tracking.

Reviewed by:	markj
Sponsored by:	Dell EMC Isilon
Differential Revision:	https://reviews.freebsd.org/D8366
2016-10-31 23:09:52 +00:00
Konstantin Belousov
e26236e9f3 Change remained internal uses of boolean_t to bool in vm/vm_fault.c.
Reviewed by:	alc
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
2016-10-30 20:39:38 +00:00
Konstantin Belousov
1dcadc022f Remove vm_pager_has_page() declaration. It is not too useful since
static inline definition appears later in the file.

Reviewed by:	alc
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
2016-10-30 20:38:57 +00:00
Alan Cox
f994b2077b Merge and sort vm_fault_hold()'s "int" variable definitions.
Reviewed by:	kib
MFC after:	7 days
2016-10-30 19:15:59 +00:00
Konstantin Belousov
022dfd690c Remove vnode_locked label and goto, by collapsing vp calculation into
the conditional.

Suggested and reviewed by:	alc
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
2016-10-30 18:05:18 +00:00
Konstantin Belousov
1be02479be Split long line instead of unindenting it. Add KASSERT() verifying
that a device object with the same handle has the same ops vector.

Reviewed by:	alc
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
2016-10-30 18:04:11 +00:00
Alan Cox
cd8a6fe8e9 The "lookup_is_valid" field is used as a "bool". Make it one.
Convert vm_fault_hold()'s Boolean variables that are only used
internally to "bool".  Add a comment describing why the one
remaining "boolean_t" was not converted.

Reviewed by:	kib
MFC after:	8 days
2016-10-29 21:01:49 +00:00
Alan Cox
320023e286 With one exception, "hardfault" is used like a "bool". Change that
exception and make it a "bool".

Reviewed by:	kib
MFC after:	7 days
2016-10-29 19:22:38 +00:00
Mark Johnston
a9ee028d04 Add one more use of unlock_vp().
Discussed with:	kib
X-MFC With:	r308094
2016-10-29 18:47:28 +00:00
Konstantin Belousov
cfabea3d3a Add unlock_vp() helper.
Trim space.

Discussed with:	alc
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
2016-10-29 18:03:29 +00:00
Mark Johnston
829be5168d Simplify keg_drain() a bit by using LIST_FOREACH_SAFE.
MFC after:	1 week
2016-10-20 23:10:27 +00:00
Gleb Smirnoff
dcc0ff5a52 Fix incorrect assertion that could miss overflows.
Reviewed by:	kib
2016-10-19 19:50:09 +00:00
Konstantin Belousov
230afe0be6 If vm_fault_hold(9) finds that fs.m is wired, do not free it after a
pager error, leave the page to the wire owner.  E.g. the page might be
a part of the invalidated buffer.

Reported and tested by:	pho
Reviewed by:	alc, markj
Sponsored by:	The FreeBSD Foundation
MFC after:	2 weeks
Differential revision:	https://reviews.freebsd.org/D8197
2016-10-17 08:17:06 +00:00
Konstantin Belousov
bd9546a21c Export vm_page_xunbusy_maybelocked().
Reviewed by:	alc, markj
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
X-Differential revision:	https://reviews.freebsd.org/D8197
2016-10-17 08:14:23 +00:00
Mark Johnston
eb17fb15b3 Plug a potential vnode lock leak in vm_fault_hold().
Reviewed by:	alc, kib
MFC after:	1 week
Sponsored by:	Dell EMC Isilon
Differential Revision:	https://reviews.freebsd.org/D8242
2016-10-13 20:39:34 +00:00
Konstantin Belousov
5975e53d40 Fix a race in vm_page_busy_sleep(9).
Suppose that we have an exclusively busy page, and a thread which can
accept shared-busy page.  In this case, typical code waiting for the
page xbusy state to pass is
again:
	VM_OBJECT_WLOCK(object);
	...
	if (vm_page_xbusied(m)) {
		vm_page_lock(m);
 		VM_OBJECT_WUNLOCK(object);    <---1
		vm_page_busy_sleep(p, "vmopax");
 		goto again;
	}

Suppose that the xbusy state owner locked the object, unbusied the
page and unlocked the object after we are at the line [1], but before we
executed the load of the busy_lock word in vm_page_busy_sleep().  If it
happens that there is still no waiters recorded for the busy state,
the xbusy owner did not acquired the page lock, so it proceeded.

More, suppose that some other thread happen to share-busy the page
after xbusy state was relinquished but before the m->busy_lock is read
in vm_page_busy_sleep().  Again, that thread only needs vm_object lock
to proceed.  Then, vm_page_busy_sleep() reads busy_lock value equal to
the VPB_SHARERS_WORD(1).

In this case, all tests in vm_page_busy_sleep(9) pass and we are going
to sleep, despite the page being share-busied.

Update check for m->busy_lock == VPB_UNBUSIED in vm_page_busy_sleep(9)
to also accept shared-busy state if we only wait for the xbusy state to
pass.

Merge sequential if()s with the same 'then' clause in
vm_page_busy_sleep().

Note that the current code does not share-busy pages from parallel
threads, the only way to have more that one sbusy owner is right now
is to recurse.

Reported and tested by:	pho (previous version)
Reviewed by:	alc, markj
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
Differential revision:	https://reviews.freebsd.org/D8196
2016-10-13 14:41:05 +00:00
Konstantin Belousov
267ed8e2f7 When downgrading exclusively busied page to shared-busy state, wakeup
waiters.  Otherwise, owners of the shared-busy state are left blocked
and might get into a deadlock.

Note that the vm_page_busy_downgrade() function is not used in the
tree right now.

Reported and tested by:	pho (previous version)
Reviewed by:	alc, markj
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
Differential revision:	https://reviews.freebsd.org/D8195
2016-10-11 18:09:37 +00:00
Alan Cox
70cf3ced3c Make the page daemon's notion of what kind of pass is being performed
by vm_pageout_scan() local to vm_pageout_worker().  There is no reason
to store the pass in the NUMA domain structure.

Reviewed by:	kib
MFC after:	3 weeks
2016-10-05 17:32:06 +00:00
Alan Cox
e57dd910e6 Change vm_pageout_scan() to return a value indicating whether the free page
target was met.

Previously, vm_pageout_worker() itself checked the length of the free page
queues to determine whether vm_pageout_scan(pass >= 1)'s inactive queue scan
freed enough pages to meet the free page target.  Specifically,
vm_pageout_worker() used vm_paging_needed().  The trouble with
vm_paging_needed() is that it compares the length of the free page queues to
the wakeup threshold for the page daemon, which is much lower than the free
page target.  Consequently, vm_pageout_worker() could conclude that the
inactive queue scan succeeded in meeting its free page target when in fact
it did not; and rather than immediately triggering an all-out laundering
pass over the inactive queue, vm_pageout_worker() would go back to sleep
waiting for the free page count to fall below the page daemon wakeup
threshold again, at which point it will perform another limited (pass == 1)
scan over the inactive queue.

Changing vm_pageout_worker() to use vm_page_count_target() instead of
vm_paging_needed() won't work because any page allocations that happen
concurrently with the inactive queue scan will result in the free page count
being below the target at the end of a successful scan.  Instead, having
vm_pageout_scan() return a value indicating success or failure is the most
straightforward fix.

Reviewed by:	kib, markj
MFC after:	3 weeks
Sponsored by:	Dell EMC Isilon
Differential Revision:	https://reviews.freebsd.org/D8111
2016-10-05 16:15:26 +00:00
Andrew Gallatin
edb2994a62 Conditionally move initial vfs bio alloc above 4G
On machines with just the wrong amount of physical memory (enough to
have a lot of bufs, but not enough to use VM_FREELIST_DMA32) it is
possible for 32-bit address limited devices to have little to no
memory left when attaching, due to potentially large vfs bio configs
consuming all memory below 4GB not protected by VM_FREELIST_ISADMA.
This causes the 32-bit devices to allocate from VM_FREELIST_ISADMA,
leaving that freelist emtpy when ISA devices need DMAable memory.

Rather than decrease VM_DMA32_NPAGES_THRESHOLD, use the time honored
technique of putting initially allocated kernel data structs
at the end (or at least not the beginning) of memory.

Since this allocation is done at boot and is wired, is not freed,
so the system is low on 32-bit (and ISA) dma'ble memory forever.
So it is a good candidate to move above 4GB.

While here, remove an unneeded round_page() from kmem_malloc's size
argument as suggested by alc.  The first thing kmem_malloc() does
is a round_page(size), so there is no need to do it before the call.

Reviewed by: alc
Sponsored by: Netflix
2016-10-03 13:23:43 +00:00
Alan Cox
8cb0c1029d Various changes to pmap_ts_referenced()
Move PMAP_TS_REFERENCED_MAX out of the various pmap implementations and
into vm/pmap.h, and describe what its purpose is.  Eliminate the archaic
"XXX" comment about its value.  I don't believe that its exact value, e.g.,
5 versus 6, matters.

Update the arm64 and riscv pmap implementations of pmap_ts_referenced()
to opportunistically update the page's dirty field.

On amd64, use the PDE value already cached in a local variable rather than
dereferencing a pointer again and again.

Reviewed by:	kib, markj
MFC after:	2 weeks
Differential Revision:	https://reviews.freebsd.org/D7836
2016-09-10 16:49:25 +00:00
Mark Johnston
dd9cb6da0b Respect the caller's hints when performing swap readahead.
The pager getpages interface allows the caller to bound the number of
readahead and readbehind pages, and vm_fault_hold() makes use of this
feature. These bounds were ignored after r305056, causing the swap pager
to potentially page in more than the specified number of pages.

Reported and reviewed by:	alc
X-MFC with:	r305056
2016-09-04 00:25:49 +00:00
Mark Johnston
dbbaf04f1e Remove support for idle page zeroing.
Idle page zeroing has been disabled by default on all architectures since
r170816 and has some bugs that make it seemingly unusable. Specifically,
the idle-priority pagezero thread exacerbates contention for the free page
lock, and yields the CPU without releasing it in non-preemptive kernels. The
pagezero thread also does not behave correctly when superpage reservations
are enabled: its target is a function of v_free_count, which includes
reserved-but-free pages, but it is only able to zero pages belonging to the
physical memory allocator.

Reviewed by:	alc, imp, kib
Differential Revision:	https://reviews.freebsd.org/D7714
2016-09-03 20:38:13 +00:00
Konstantin Belousov
9815066425 Make swapoff reliable.
The swap_pager_swapoff() function uses trylock for the object lock
before pagein, which means that either i/o to md(4) over swap, or
intensive page faults over swap pager objects might prevent swapoff()
from making any progress. Then the retry < 100 check fails and machine
panics.

If trylock fails, acquire the object lock in the blockable way and
restart the hash bucket walk.  Keep retries logic for now.

Reported and tested by:	pho
Reviewed by:	alc, markj
Sponsored by:	The FreeBSD Foundation
MFC after:	2 weeks
Differential revision:	https://reviews.freebsd.org/D7688
2016-08-31 14:49:58 +00:00
Mark Johnston
915d1b71cd Restore swap pager readahead after r292373.
The removal of vm_fault_additional_pages() meant that a hard fault on
a swap-backed page would result in only that page being read in. This
change implements readahead and readbehind for the swap pager in
swap_pager_getpages(). swap_pager_haspage() is modified to return the
largest contiguous non-resident range of pages containing the requested
range.

Reviewed by:	alc, kib
Tested by:	pho
MFC after:	1 month
Differential Revision:	https://reviews.freebsd.org/D7677
2016-08-30 05:56:21 +00:00
Alan Cox
ce3ee09b53 Eliminate unneeded vm_page_xbusy() and vm_page_xunbusy() operations when
neither vm_pager_has_page() nor vm_pager_get_pages() is called.

Reviewed by:	kib, markj
MFC after:	3 weeks
2016-08-14 22:00:45 +00:00
Mark Johnston
842ee21e20 Strengthen assertions about the busy state of newly-allocated pages.
Reviewed by:	alc
MFC after:	1 week
2016-08-13 19:49:32 +00:00
Mark Johnston
fc85a6f0c4 Initialize page busy lock state in vm_phys_add_page().
MFC after:	1 week
2016-08-13 19:48:43 +00:00
Alan Cox
791444089f Correct errors and clean up the comments on the active queue scan.
Eliminate some unnecessary blank lines.

Reviewed by:	kib, markj
MFC after:	1 week
2016-08-12 03:22:58 +00:00
Edward Tomasz Napierala
411455a8fb Replace all remaining calls to vprint(9) with vn_printf(9), and remove
the old macro.

MFC after:	1 month
2016-08-10 16:12:31 +00:00
Alan Cox
f0edf3f806 Correct a spelling error. 2016-08-05 16:44:11 +00:00
Alan Cox
248fe642a7 Clean up the comments and code style in and around vm_pageout_cluster().
In particular, fix factual, grammatical, and spelling errors in various
comments, and remove comments that are out of place in this function.

Reviewed by:	kib, markj
MFC after:	3 weeks
Sponsored by:	EMC / Isilon Storage Division
Differential Revision:	https://reviews.freebsd.org/D7410
2016-08-04 16:20:12 +00:00
Konstantin Belousov
0c657d22eb Explain why swapgeom_close_ev() is delegated.
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
2016-08-03 07:11:19 +00:00
Alan Cox
87ff568c26 Restore the historical behavior of "sysctl vm.swap_idle_enabled=1". Prior
to r254304, we had separate functions for reclamation and laundering
(vm_pageout_scan) versus updating usage information, i.e., "reference
bits", on active pages (vm_pageout_page_stats), and we only performed
vm_req_vmdaemon(VM_SWAP_IDLE) if vm_pages_needed was true.  However, since
r254303, if vm_swap_idle_enabled was "1", we have performed
vm_req_vmdaemon(VM_SWAP_IDLE) regardless of whether we are short of free
pages.  This was unintended and too aggressive, so I suspect no one uses
this feature.  With this change, we restore the historical behavior and
only perform vm_req_vmdaemon(VM_SWAP_IDLE) when we are short of free
pages.

Reviewed by:	kib, markj
2016-08-01 17:25:07 +00:00
Mark Johnston
897d0c6617 Use vm_page_undirty() instead of manually setting a page field.
Reviewed by:	alc
MFC after:	3 days
2016-07-29 21:05:37 +00:00
Alan Cox
793172ea88 Remove a probe declaration that has been unused since r292469, when
vm_pageout_grow_cache() was replaced.

MFC after:	3 days
2016-07-29 16:43:51 +00:00
Alan Cox
f095d1bbc7 Remove any mention of cache (PG_CACHE) pages from the comments in
vm_pageout_scan().  That function has not cached pages since r284376.

MFC after:	3 days
2016-07-28 22:30:48 +00:00
Konstantin Belousov
88ad2d7b47 Do not delegate a work to geom event thread which can be done inline.
In particular, swapongeom_ev() needed event thread context when swap
pager configuration was performed under Giant and geom asserted that
Giant is not owned.  Now both of the reason went away.

On the other hand, note that swpageom_release() is called from the
bio_done context, and possible close cannot be performed inline.

Also fix some minor issues.  The swapgeom() function does not use the
td argument, remove it.  Recheck that the vnode passed is still VCHR
and not reclaimed after the lock.

Reviewed by:	mav
Tested by:	pho
Sponsored by:	The FreeBSD Foundation
MFC after:	2 weeks
2016-07-28 15:57:01 +00:00
Konstantin Belousov
2174a0c607 Fix style and typo.
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
2016-07-28 15:49:51 +00:00
Mark Johnston
3ac8f842ea De-pluralize "queues" where appropriate in the pagedaemon code.
MFC after:	1 week
2016-07-27 17:11:03 +00:00
Alan Cox
a766ffd061 Update a comment to reflect r284376.
MFC after:	3 days
2016-07-27 03:49:00 +00:00
Mark Johnston
44be0a8ea5 Correct a comment - each page queue has its own lock.
Reviewed by:	alc
MFC after:	3 days
2016-07-23 21:03:25 +00:00
Mark Johnston
efe1ff4cf0 Update a comment in vm_page_advise() to match behaviour after r290529.
Reviewed by:	alc
MFC after:	3 days
2016-07-23 21:02:36 +00:00
Alan Cox
8d67b8c863 Add a comment describing the 'fast path' that was introduced in r270011.
Reviewed by:	kib
MFC after:	3 days
Sponsored by:	EMC / Isilon Storage Division
2016-07-20 17:20:22 +00:00
Mark Johnston
afa5d70339 Release the second critical section in uma_zfree_arg() slightly earlier.
It is only needed when removing a full bucket from the per-CPU cache. The
bucket cache (uz_buckets) is protected by the zone mutex and thus the
critical section can be released before inserting into that list.

MFC after:	1 week
2016-07-20 01:01:50 +00:00
Mark Johnston
20c58db95a Make vm_pageout_wakeup_thresh a u_int rather than an int.
It's a threshold for v_free_count, which is of type u_int. This also lets
us get rid of a cast in vm_paging_needed().

Reviewed by:	alc
MFC after:	1 week
2016-07-20 00:09:22 +00:00
Alan Cox
0c3a489325 Break up vm_fault()'s implementation of the read-ahead and delete-behind
optimizations into two distinct pieces.  The first piece consists of the
code that should only be performed once per page fault and requires the map
to be locked.  The second piece consists of the code that should be
performed each time a pager is called on an object in the shadow chain.
(This second piece expects the map to be unlocked.)

Previously, the entire implementation could be executed multiple times.
Moreover, the second and subsequent executions would occur with the map
unlocked.  Usually, the ensuing unsynchronized accesses to the map were
harmless because the map was not changing.  Nonetheless, it was possible for
a use-after-free error to occur, where vm_fault() wrote to a freed map
entry.  This change corrects that problem.

Reported by:	avg
Reviewed by:	kib
MFC after:	3 days
Sponsored by:	EMC / Isilon Storage Division
2016-07-18 04:20:26 +00:00
Konstantin Belousov
19efd8a5a8 In vgonel(), postpone setting BO_DEAD until VOP_RECLAIM() is called,
if vnode is VMIO.  For VMIO vnodes, set BO_DEAD in vm_object_terminate().

The vnode_destroy_object(), when calling into vm_object_terminate(),
must be able to flush buffers.  BO_DEAD purpose is to quickly destroy
buffers on write when the underlying vnode is not operable any more
(one example is the devfs node after geom is gone).  Setting BO_DEAD
for reclaiming vnode before object is terminated is premature, and
results in unability to flush buffers with live SU dependencies from
vinvalbuf() in vm_object_terminate().

Reported by:	David Cross <dcrosstech@gmail.com>
Tested by:	pho
Sponsored by:	The FreeBSD Foundation
MFC after:	2 weeks
2016-07-11 14:19:09 +00:00
Robert Watson
0df4264748 When mmap(2) is used with a vnode, capture vnode attributes in the
audit trail.  This was not required for Common Criteria auditing
(which requires only that the intent to read or write be audited
at the time of open(2)), but is useful for contemporary live
analysis and forensics.

MFC after:	3 days
Sponsored by:	DARPA, AFRL
2016-07-10 11:49:10 +00:00
Robert Watson
51d1f69069 Audit file-descriptor arguments to I/O system calls such as
read(2), write(2), dup(2), and mmap(2).  This auditing is not
required by the Common Criteria (and hence was not being
performed), but is valuable in both contemporary live analysis
and forensic use cases.

MFC after:	3 days
Sponsored by:	DARPA, AFRL
2016-07-10 08:04:02 +00:00
Alan Cox
381b724280 Change the type of the map entry's next_read field from a vm_pindex_t to a
vm_offset_t.  (This field is used to detect sequential access to the virtual
address range represented by the map entry.)  There are three reasons to
make this change.  First, a vm_offset_t is smaller on 32-bit architectures.
Consequently, a struct vm_map_entry is now smaller on 32-bit architectures.
Second, a vm_offset_t can be written atomically, whereas it may not be
possible to write a vm_pindex_t atomically on a 32-bit architecture.  Third,
using a vm_pindex_t makes the next_read field dependent on which object in
the shadow chain is being read from.

Replace an "XXX" comment.

Reviewed by:	kib
Approved by:	re (gjb)
Sponsored by:	EMC / Isilon Storage Division
2016-07-07 20:58:16 +00:00
Colin Percival
34caa842a4 Autotune the number of pages set aside for UMA startup based on the number
of CPUs present.  On amd64 this unbreaks the boot for systems with 92 or
more CPUs; the limit will vary on other systems depending on the size of
their uma_zone and uma_cache structures.

The major consumer of pages during UMA startup is the 19 zone structures
which are set up before UMA has bootstrapped itself sufficiently to use
the rest of the available memory:  UMA Slabs, UMA Hash, 4 / 6 / 8 / 12 /
16 / 32 / 64 / 128 / 256 Bucket, vmem btag, VM OBJECT, RADIX NODE, MAP,
KMAP ENTRY, MAP ENTRY, VMSPACE, and fakepg.  If the zone structures occupy
more than one page, they will not share pages and the number of pages
currently needed for startup is 19 * pages_per_zone + N, where N is the
number of pages used for allocating other structures; on amd64 N = 3 at
present (2 pages are allocated for UMA Kegs, and one page for UMA Hash).

This patch adds a new definition UMA_BOOT_PAGES_ZONES, currently set to 32,
and if a zone structure does not fit into a single page sets boot_pages to
UMA_BOOT_PAGES_ZONES * pages_per_zone instead of UMA_BOOT_PAGES (which
remains at 64).  Consequently this patch has no effect on systems where the
zone structure fits into 2 or fewer pages (on amd64, 59 or fewer CPUs), but
increases boot_pages sufficiently on systems where the large number of CPUs
makes this structure larger.  It seems safe to assume that systems with 60+
CPUs can afford to set aside an additional 128kB of memory per 32 CPUs.

The vm.boot_pages tunable continues to override this computation, but is
unlikely to be necessary in the future.

Tested on:	EC2 x1.32xlarge
Relnotes:	FreeBSD can now boot on 92+ CPU systems without requiring
		vm.boot_pages to be manually adjusted.
Reviewed by:	jeff, alc, adrian
Approved by:	re (kib)
2016-07-07 18:37:12 +00:00
Nathan Whitehorn
96c85efb4b Replace a number of conflations of mp_ncpus and mp_maxid with either
mp_maxid or CPU_FOREACH() as appropriate. This fixes a number of places in
the kernel that assumed CPU IDs are dense in [0, mp_ncpus) and would try,
for example, to run tasks on CPUs that did not exist or to allocate too
few buffers on systems with sparse CPU IDs in which there are holes in the
range and mp_maxid > mp_ncpus. Such circumstances generally occur on
systems with SMT, but on which SMT is disabled. This patch restores system
operation at least on POWER8 systems configured in this way.

There are a number of other places in the kernel with potential problems
in these situations, but where sparse CPU IDs are not currently known
to occur, mostly in the ARM machine-dependent code. These will be fixed
in a follow-up commit after the stable/11 branch.

PR:		kern/210106
Reviewed by:	jhb
Approved by:	re (glebius)
2016-07-06 14:09:49 +00:00
Konstantin Belousov
90880a1b29 Clarify the vnode_destroy_vobject() logic handling for already terminated
objects.

Assert that there is no new waiters for the already terminated objects.
Old waiters should have been notified by the termination calling
vnode_pager_dealloc() (old/new are with regard of the lock acquisition
interval).

Only clear the vp->v_object for the case of already terminated object,
since other branches call vnode_pager_dealloc(), which should clear
the pointer.  Assert this.

Tested by:	pho
Reviewed by:	alc
Sponsored by:	The FreeBSD Foundation
MFC after:	2 weeks
Approved by:	re (gjb)
2016-07-05 11:21:02 +00:00
Konstantin Belousov
3f1c66b8d2 Change type of the 'dead' variable to boolean.
Requested by:	alc
MFC after:	1 week
Approved by:	re (gjb)
2016-07-03 00:08:17 +00:00
Konstantin Belousov
725441f69b If the vm_fault() handler raced with the vm_object_collapse()
sleepable scan, iteration over the shadow chain looking for a page
could find an OBJ_DEAD object.  Such state of the mapping is only
transient, the dead object will be terminated and removed from the
chain shortly.  We must not return KERN_PROTECTION_FAILURE unless the
object type is changed to OBJT_DEAD in the chain, indicating that
paging on this address is really impossible.  Returning
KERN_PROTECTION_FAILURE prematurely causes spurious SIGSEGV delivered
to processes, or kernel accesses to UVA spuriously failing with
EFAULT.

If the object with OBJ_DEAD flag is found, only return
KERN_PROTECTION_FAILURE when object type is already OBJT_DEAD.
Otherwise, sleep a tick and retry the fault handling.

Ideally, we would wait until the OBJ_DEAD flag is resolved, e.g. by
waiting until the paging on this object is finished.  But to do so, we
need to reference the dead object, while vm_object_collapse() insists
on owning the final reference on the collapsed object.  This could be
fixed by e.g. changing the assert to shared reference release between
vm_fault() and vm_object_collapse(), but it seems to be too much
complications for rare boundary condition.

PR:	204426
Tested by:    pho
Reviewed by:  alc
Sponsored by: The FreeBSD Foundation
X-Differential revision:	https://reviews.freebsd.org/D6085
MFC after:	2 weeks
Approved by:	re (gjb)
2016-06-27 21:54:19 +00:00
Konstantin Belousov
35e8002c58 In vm_page_xunbusy_maybelocked(), add fast path for unbusy when no
waiters exist, same as for vm_page_xunbusy().  If previous value of
busy_lock was VPB_SINGLE_EXCLUSIVER, no waiters existed and wakeup is
not needed.

Move common code from vm_page_xunbusy_maybelocked() and
vm_page_xunbusy_hard() to vm_page_xunbusy_locked().

Reviewed by:	alc
Sponsored by:	The FreeBSD Foundation
MFC after:	2 weeks
Approved by:	re (gjb)
2016-06-23 08:28:13 +00:00
Konstantin Belousov
505cd5d13b Add a comment noting locking regime for vm_page_xunbusy().
Reviewed by:	alc
Sponsored by:	The FreeBSD Foundation
MFC after:	2 weeks
Approved by:	re (gjb)
2016-06-23 08:27:38 +00:00
Konstantin Belousov
95e2409a33 Fix a LOR between vnode locks and allproc_lock.
There is an order between covered vnode lock and allproc_lock, which
is established by calling mountcheckdirs() while owning the covered
vnode lock. mountcheckdirs() iterates over the processes, protected by
allproc_lock.  This order is needed and seems to be not avoidable.

On the other hand, various VM daemons also need to iterate over all
processes, and they lock and unlock user maps.  Since unlock of the
user map may trigger processing of the deferred map entries, it causes
vnode locking to occur.  Or, when vmspace is freed, dropping references
on the vnode-backed object also lock vnodes.  We get reverted order
comparing with the mount/unmount order.

For VM daemons, there is no need to own allproc_lock while we operate
on vmspaces. If the process is held, it serves as the marker for
allproc list, which allows to continue the iteration.

Add _PHOLD_LITE() macro, similar to _PHOLD(), but not causing swap-in
of the kernel stacks.  It is used instead of _PHOLD() in vm code,
since e.g. calling faultin() in OOM conditions only exaggerates the
problem.

Modernize comment describing PHOLD.

Reported by:	lists@yamagi.org
Tested by:	pho (previous version)
Reviewed by:	jhb
Sponsored by:	The FreeBSD Foundation
MFC after:	3 week
Approved by:	re (gjb)
Differential revision:	https://reviews.freebsd.org/D6679
2016-06-22 20:15:37 +00:00
Konstantin Belousov
d3b9828d0d The vmtotal sysctl handler marks active vm objects to calculate
statistics.  Marking is done by setting the OBJ_ACTIVE flag.  The
flags change is locked, but the problem is that many parts of system
assume that vm object initialization ensures that no other code could
change the object, and thus performed lockless.  The end result is
corrupted flags in vm objects, most visible is spurious OBJ_DEAD flag,
causing random hangs.

Avoid the active object marking, instead provide equally inexact but
immutable is_object_alive() definition for the object mapped state.

Avoid iterating over the processes mappings altogether by using
arguably improved definition of the paging thread as one which sleeps
on the v_free_count.

PR:	204764
Diagnosed by:	pho
Tested by:	pho (previous version)
Reviewed by:	alc
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
Approved by:	re (gjb)
2016-06-21 17:49:33 +00:00
Konstantin Belousov
eb4d6a1b3b Fix inconsistent locking of the swap pager named objects list.
Right now, all modifications of the list are locked by sw_alloc_mtx.
But initial lookup of the object by the handle in swap_pager_alloc()
is not protected by sw_alloc_mtx, which means that
vm_pager_object_lookup() could follow freed pointer.

Create a new named swap object with the OBJT_SWAP type, instead
of OBJT_DEFAULT.  With this change, swp_pager_meta_build() never need
to upgrade named OBJT_DEFAULT to OBJT_SWAP (in the other place, we do
not forbid for client code to create named OBJT_DEFAULT objects at
all).

That change allows to remove sw_alloc_mtx and make the list locked by
sw_alloc_sx lock.  Update swap_pager_copy() to new locking mode.

Create helper swap_pager_alloc_init() to consolidate named and
anonymous swap objects creation, while a caller ensures that the
neccesary locks are held around the helper.

Reviewed by:	alc
Sponsored by:	The FreeBSD Foundation
MFC after:	2 weeks
Approved by:	re (hrs)
2016-06-13 03:42:46 +00:00
Konstantin Belousov
1571927369 Explicitely initialize sw_alloc_sx. Currently it is not initialized
but works due to zeroed out bss on startup.

Reviewed by:	alc
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
Approved by:	re (hrs)
2016-06-13 03:39:16 +00:00
Mark Johnston
0a1dc6e23c Reset the page busy lock state after failing to insert into the object.
Freeing a shared-busy page is not permitted.

Reviewed by:	kib
MFC after:	1 week
Differential Revision:	https://reviews.freebsd.org/D6670
2016-06-02 17:11:24 +00:00
Mark Johnston
e705296958 Don't preserve the page's object linkage in vm_page_insert_after().
Per the KASSERT at the beginning of the function, we expect that the page
does not belong to any object, so its object and pindex fields are
meaningless. Reset them in the rare case that vm_radix_insert() fails.

Reviewed by:	kib
MFC after:	1 week
Differential Revision: https://reviews.freebsd.org/D6669
2016-06-02 16:58:47 +00:00
Mark Johnston
bc9d08e1cf Fix memguard(9) in kernels with INVARIANTS enabled.
With r284861, UMA zones use the trash ctor and dtor by default. This is
incompatible with memguard, which frees the backing page when the item
is freed. Modify the UMA debug functions to be no-ops if the item was
allocated from memguard. This also fixes constructors such as
mb_ctor_pack(), which invokes the trash ctor in addition to performing
some initialization.

Reviewed by:	glebius
MFC after:	3 weeks
Differential Revision:	https://reviews.freebsd.org/D6562
2016-06-01 22:31:35 +00:00
Konstantin Belousov
e5f0191f20 If the fast path unbusy in vm_page_replace() fails, slow path needs to
acquire the page lock, which recurses.  Avoid the recursion by reusing
the code from vm_page_remove() in a new helper
vm_page_xunbusy_maybelocked().

Reviewed by:	alc
Sponsored by:	The FreeBSD Foundation
2016-06-01 20:39:00 +00:00
Konstantin Belousov
9f790a1756 Do not leak the vm object lock when swap reservation failed, in
vm_object_coalesce().

Reported and tested by:	pho
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
2016-05-29 15:46:19 +00:00
Alan Cox
56ce06907c The flag "vm_pages_needed" has long served two distinct purposes: (1) to
indicate that threads are waiting for free pages to become available and
(2) to indicate whether a wakeup call has been sent to the page daemon.
The trouble is that a single flag cannot really serve both purposes, because
we have two distinct targets for when to wakeup threads waiting for free
pages versus when the page daemon has completed its work.  In particular,
the flag will be cleared by vm_page_free() before the page daemon has met
its target, and this can lead to the OOM killer being invoked prematurely.
To address this problem, a new flag "vm_pageout_wanted" is introduced.

Discussed with:	jeff
Reviewed by:	kib, markj
Tested by:	markj
Sponsored by:	EMC / Isilon Storage Division
2016-05-27 19:15:45 +00:00
Alan Cox
bccdea450b Use vm_page_replace_checked() instead of vm_page_rename() for implementing
optimized copy-on-write faults.  This has two advantages: (1) one less radix
tree operation is performed and (2) vm_page_replace_checked() cannot fail,
making the code simpler.

Submitted by:	Ryan Libby
Reviewed by:	kib
Sponsored by:	EMC / Isilon Storage Division
Differential Revision:	https://reviews.freebsd.org/D4478
2016-05-27 06:05:12 +00:00
Konstantin Belousov
aa9bc3b171 Prevent parallel object collapses. Both vm_object_collapse_scan() and
swap_pager_copy() might unlock the object, which allows the parallel
collapse to execute.  Besides destroying the object, it also might
move the reference from parent to the backing object, firing the
assertion ref_count == 1.

Collapses are prevented by bumping paging_in_progress counters on both
the object and its backing object.

Reported by:	cem
Tested by:	pho (previous version)
Reviewed by:	alc
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
X-Differential revision:	https://reviews.freebsd.org/D6085
2016-05-26 16:59:29 +00:00
Konstantin Belousov
98f139daef Style changes to some most outrageous violations in vm_object_collapse().
Reviewed by:	alc
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
2016-05-26 16:51:38 +00:00
Konstantin Belousov
0e38422096 In vm_page_cache(), only drop the vnode after radix insert failure
for empty page cache when the object type if OBJT_VNODE.

Reported and tested by:	pho
Reviewed by:	alc
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
2016-05-24 19:20:30 +00:00
Konstantin Belousov
30a8a5f7a6 In vm_page_alloc_contig(), on vm_page_insert() failure, mark each
freed page as VPO_UNMANAGED.  Otherwise vm_pge_free_toq() insists on
owning the page lock.

Previously, VPO_UNMANAGED was only set up to the last processed page.

Reviewed by:	alc
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
2016-05-24 10:21:39 +00:00
Konstantin Belousov
9a2047083f Remove Giant around allocation of the swap pager with non-NULL handle.
Existing issue of not protecting pager_object_list iteration in
vm_pager_object_lookup() by sw_alloc_mtx is not affected by Giant
removal.

Reviewed by:	alc
Sponsored by:	The FreeBSD Foundation
2016-05-24 10:16:03 +00:00
Alan Cox
10b4196bd0 Correct an error in a comment: One of the conditions for page allocation
is actually the opposite of that stated in the comment.

Remove an unnecessary assignment.  Use an assertion to document the fact
that no assignment is needed.

Rewrite another comment to clarify that the page is not completely valid.

Reviewed by:	kib
2016-05-23 16:59:05 +00:00
Konstantin Belousov
4c36e917b2 Mark swap-related proc sysctls as not requiring Giant.
Reviewed by:	alc (as part of larger patch)
Sponsored by:	The FreeBSD Foundation
2016-05-22 23:28:23 +00:00
Konstantin Belousov
04533e1ef7 Replace hand-made exclusive lock, protecting against parallel
swapon/swapoff invocations, with sx.

Reviewed by:	alc (as part of larger patch)
Sponsored by:	The FreeBSD Foundation
2016-05-22 23:25:01 +00:00
Konstantin Belousov
a525fd17cd Remove false claim. Giant is dropped by mi_startup() before passing
the control to swapper.

Sponsored by:	The FreeBSD Foundation
MFC after:	3 days
2016-05-22 19:25:53 +00:00
Alan Cox
6753423ccb When descending a shadow chain of objects, it makes no sense to update
the current offset (spelled: "fs.pindex") until it is known whether a
backing object exists.  In fact, if not for the fact that the backing
object offset is zero when there is no backing object, this update would
produce a broken offset.

Reviewed by:	kib
2016-05-21 23:18:23 +00:00
John Baldwin
cc981af204 Add new bus methods for mapping resources.
Add a pair of bus methods that can be used to "map" resources for direct
CPU access using bus_space(9).  bus_map_resource() creates a mapping and
bus_unmap_resource() releases a previously created mapping.  Mappings are
described by 'struct resource_map' object.  Pointers to these objects can
be passed as the first argument to the bus_space wrapper API used for bus
resources.

Drivers that wish to map all of a resource using default settings
(for example, using uncacheable memory attributes) do not need to change.
However, drivers that wish to use non-default settings can now do so
without jumping through hoops.

First, an RF_UNMAPPED flag is added to request that a resource is not
implicitly mapped with the default settings when it is activated.  This
permits other activation steps (such as enabling I/O or memory decoding
in a device's PCI command register) to be taken without creating a
mapping.  Right now the AGP drivers don't set RF_ACTIVE to avoid using
up a large amount of KVA to map the AGP aperture on 32-bit platforms.
Once RF_UNMAPPED is supported on all platforms that support AGP this
can be changed to using RF_UNMAPPED with RF_ACTIVE instead.

Second, bus_map_resource accepts an optional structure that defines
additional settings for a given mapping.

For example, a driver can now request to map only a subset of a resource
instead of the entire range.  The AGP driver could also use this to only
map the first page of the aperture (IIRC, it calls pmap_mapdev() directly
to map the first page currently).  I will also eventually change the
PCI-PCI bridge driver to request mappings of the subset of the I/O window
resource on its parent side to create mappings for child devices rather
than passing child resources directly up to nexus to be mapped.  This
also permits bridges that do address translation to request suitable
mappings from a resource on the "upper" side of the bus when mapping
resources on the "lower" side of the bus.

Another attribute that can be specified is an alternate memory attribute
for memory-mapped resources.  This can be used to request a
Write-Combining mapping of a PCI BAR in an MI fashion.  (Currently the
drivers that do this call pmap_change_attr() directly for x86 only.)

Note that this commit only adds the MI framework.  Each platform needs
to add support for handling RF_UNMAPPED and thew new
bus_map/unmap_resource methods.  Generally speaking, any drivers that
are calling rman_set_bustag() and rman_set_bushandle() need to be
updated.

Discussed on:	arch
Reviewed by:	cem
Differential Revision:	https://reviews.freebsd.org/D5237
2016-05-20 17:57:47 +00:00
Alan Cox
521ddf39cb Clean up the handling of errors from vm_pager_get_pages(). Mostly, this
cleanup consists of fixes to comments.  However, there is one change to
code: Remove special-case handling of errors involving the kernel map.
We do not perform I/O on the kernel map, so there is no need for this
special case.

Reviewed by:	kib (an earlier version)
2016-05-19 19:27:33 +00:00
Conrad Meyer
5a2e650a36 vm/vm_page.h: Fix trivial '-Wpointer-sign' warning
pq_vcnt, as a count of real things, has no business being negative.  It is only
ever initialized by a u_int counter.

The warning came from the atomic_add_int() in vm_pagequeue_cnt_add().

Rectify the warning by changing the variable to u_int.  No functional change.

Suggested by:	Clang 3.3
Sponsored by:	EMC / Isilon Storage Division
2016-05-19 17:54:14 +00:00
Konstantin Belousov
2a339d9e3d Add implementation of robust mutexes, hopefully close enough to the
intention of the POSIX IEEE Std 1003.1TM-2008/Cor 1-2013.

A robust mutex is guaranteed to be cleared by the system upon either
thread or process owner termination while the mutex is held.  The next
mutex locker is then notified about inconsistent mutex state and can
execute (or abandon) corrective actions.

The patch mostly consists of small changes here and there, adding
neccessary checks for the inconsistent and abandoned conditions into
existing paths.  Additionally, the thread exit handler was extended to
iterate over the userspace-maintained list of owned robust mutexes,
unlocking and marking as terminated each of them.

The list of owned robust mutexes cannot be maintained atomically
synchronous with the mutex lock state (it is possible in kernel, but
is too expensive).  Instead, for the duration of lock or unlock
operation, the current mutex is remembered in a special slot that is
also checked by the kernel at thread termination.

Kernel must be aware about the per-thread location of the heads of
robust mutex lists and the current active mutex slot.  When a thread
touches a robust mutex for the first time, a new umtx op syscall is
issued which informs about location of lists heads.

The umtx sleep queues for PP and PI mutexes are split between
non-robust and robust.

Somewhat unrelated changes in the patch:
1. Style.
2. The fix for proper tdfind() call use in umtxq_sleep_pi() for shared
   pi mutexes.
3. Removal of the userspace struct pthread_mutex m_owner field.
4. The sysctl kern.ipc.umtx_vnode_persistent is added, which controls
   the lifetime of the shared mutex associated with a vnode' page.

Reviewed by:	jilles (previous version, supposedly the objection was fixed)
Discussed with:	brooks, Martin Simmons <martin@lispworks.com> (some aspects)
Tested by:	pho
Sponsored by:	The FreeBSD Foundation
2016-05-17 09:56:22 +00:00
John Baldwin
cf8cdda37c Move vm_domain_rr_selectdomain() under #ifdef VM_NUMA_ALLOC.
The function had a null function body in the !VM_NUMA_ALLOC case but
also wasn't called in the !VM_NUMA_ALLOC case.

Suggested by:	ngie
2016-05-10 22:25:55 +00:00
Pedro F. Giffuni
763df3ec55 sys/vm: minor spelling fixes in comments.
No functional change.
2016-05-02 20:16:29 +00:00
Konstantin Belousov
e9d37c9f8d Avoid duplicated calls to pmap_page_get_memattr().
Avoid logging inconsistency for the /dev/mem device at all. The driver
leaves memattr intact, and the corrective action in the device pager
handles it right.

In the logged warning, name the driver we blame, and show memory
attributes values.

Reported and tested by:	pho
Reviewed by:	alc
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
Differential revision:	https://reviews.freebsd.org/D6149
2016-05-01 17:48:43 +00:00
John Baldwin
0ef149024f Don't require write locks on the VM object for vm_page_prev/next.
Reviewed by:	kib
Sponsored by:	Chelsio Communications
2016-04-29 17:35:28 +00:00
John Baldwin
164a37a55a Trim redundant message.
WITNESS_WARN() appends "with non-sleepable lock" to the caller's message.

Sponsored by:	Chelsio Communications
2016-04-27 21:51:24 +00:00
Pedro F. Giffuni
b66bb393f2 Cleanup redundant parenthesis from existing howmany()/roundup() macro uses. 2016-04-22 16:57:42 +00:00
Pedro F. Giffuni
d9c9c81c08 sys: use our roundup2/rounddown2() macros when param.h is available.
rounddown2 tends to produce longer lines than the original code
and when the code has a high indentation level it was not really
advantageous to do the replacement.

This tries to strike a balance between readability using the macros
and flexibility of having the expressions, so not everything is
converted.
2016-04-21 19:57:40 +00:00
Pedro F. Giffuni
8dfea46460 Remove slightly used const values that can be replaced with nitems().
Suggested by:	jhb
2016-04-21 15:38:28 +00:00
John Baldwin
62d70a8174 Add more fine-grained kernel options for NUMA support.
VM_NUMA_ALLOC is used to enable use of domain-aware memory allocation in
the virtual memory system.  DEVICE_NUMA is used to enable affinity
reporting for devices such as bus_get_domain().

MAXMEMDOM must still be set to a value greater than for any NUMA support
to be effective.  Note that 'cpuset -gd' always works if MAXMEMDOM is
enabled and the system supports NUMA.

Reviewed by:	kib
Differential Revision:	https://reviews.freebsd.org/D5782
2016-04-09 13:58:04 +00:00
Edward Tomasz Napierala
ae34b6ff96 Add four new RCTL resources - readbps, readiops, writebps and writeiops,
for limiting disk (actually filesystem) IO.

Note that in some cases these limits are not quite precise. It's ok,
as long as it's within some reasonable bounds.

Testing - and review of the code, in particular the VFS and VM parts - is
very welcome.

MFC after:	1 month
Relnotes:	yes
Sponsored by:	The FreeBSD Foundation
Differential Revision:	https://reviews.freebsd.org/D5080
2016-04-07 04:23:25 +00:00
Gleb Smirnoff
cfcae3f86f Remove UMA_ZONE_REFCNT feature, now unused.
Blessed by:	jeff
2016-03-01 00:33:32 +00:00
Konstantin Belousov
1bdbd70599 Implement process-shared locks support for libthr.so.3, without
breaking the ABI.  Special value is stored in the lock pointer to
indicate shared lock, and offline page in the shared memory is
allocated to store the actual lock.

Reviewed by:	vangyzen (previous version)
Discussed with:	deischen, emaste, jhb, rwatson,
	Martin Simmons <martin@lispworks.com>
Tested by:	pho
Sponsored by:	The FreeBSD Foundation
2016-02-28 17:52:33 +00:00
Gleb Smirnoff
b28cc462ad Include sys/_task.h into uma_int.h, so that taskqueue.h isn't a
requirement for uma_int.h.

Suggested by:	jhb
2016-02-09 20:22:35 +00:00
Mark Johnston
3f38e13073 Plug a vm_page leak introduced in r292373.
Reported by:	vangyzen
2016-02-05 19:35:53 +00:00
Gleb Smirnoff
e60b2fcbeb Redo r292484. Embed task(9) into zone, so that uz_maxaction is called
in a context that can sleep, allowing consumers of the KPI to run their
drain routines without any extra measures.

Discussed with:	jtl
2016-02-03 23:30:17 +00:00
Gleb Smirnoff
9542ea7b80 Move uma_dbg_alloc() and uma_dbg_free() into uma_core.c, which allows
to make uma_dbg.h not depend on uma_int.h, which allows to uninclude
uma_int.h from the mbuf(9) allocator.
2016-02-03 22:02:36 +00:00
Konstantin Belousov
ba7c64d17b Typo in comment. 2016-01-24 13:38:41 +00:00
John Baldwin
8a4dc40ff4 Various cleanups to the main function for AIO kernel processes:
- Pull the vmspace logic out into helper functions and reduce duplication.
  Operations on the vmspace are all isolated to vm_map.c, but it now exports
  a new 'vmspace_switch_aio' for use by AIO kernel processes.
- When an AIO kernel process wants to exit, break out of the main loop and
  perform cleanup after the loop end.  This reduces a lot of indentation and
  allows cleanup to more closely mirror setup actions before the loop starts.
- Convert a DIAGNOSTIC to KASSERT().
- Replace mycp with more typical 'p'.

Reviewed by:	kib
Sponsored by:	Chelsio Communications
Differential Revision:	https://reviews.freebsd.org/D4990
2016-01-19 21:37:51 +00:00
Alan Cox
477bffbe4d A fix to r292469: Iterate over the physical segments in descending rather
than ascending order in vm_phys_alloc_contig() so that, for example, a
sequence of contigmalloc(low=0, high=4GB) calls doesn't exhaust the supply
of low physical memory resulting in a later contigmalloc(low=0, high=1MB)
failure.

Reported by:	cy
Tested by:	cy
Sponsored by:	EMC / Isilon Storage Division
2016-01-16 04:41:40 +00:00
Adrian Chadd
54de56f3b2 Fix the domain iterator to not try the first-touch / fixed domain
more than once when doing round-robin.

This lead to a panic because the iterator was trying the same domain
twice and not trying one of the other domains.

Reported by: pho
Tested by: pho
2016-01-10 17:53:43 +00:00
Konstantin Belousov
ff64a90ed9 Add missed relpbuf() for a smallfs page-in.
Reported by:	Shawn Webb
Tested by:	pho
Sponsored by:	The FreeBSD Foundation
2015-12-27 14:42:39 +00:00
Jonathan T. Looney
54503a13d8 Add a safety net to reclaim mbufs when one of the mbuf zones become
exhausted.

It is possible for a bug in the code (or, theoretically, even unusual
network conditions) to exhaust all possible mbufs or mbuf clusters.
When this occurs, things can grind to a halt fairly quickly. However,
we currently do not call mb_reclaim() unless the entire system is
experiencing a low-memory condition.

While it is best to try to prevent exhaustion of one of the mbuf zones,
it would also be useful to have a mechanism to attempt to recover from
these situations by freeing "expendable" mbufs.

This patch makes two changes:

a) The patch adds a generic API to the UMA zone allocator to set a
function that should be called when an allocation fails because the
zone limit has been reached. Because of the way this function can be
called, it really should do minimal work.

b) The patch uses this API to try to free mbufs when an allocation
fails from one of the mbuf zones because the zone limit has been
reached. The function schedules a callout to run mb_reclaim().

Differential Revision:	https://reviews.freebsd.org/D3864
Reviewed by:	gnn
Comments by:	rrs, glebius
MFC after:	2 weeks
Sponsored by:	Juniper Networks
2015-12-20 02:05:33 +00:00
Alan Cox
c869e67208 Introduce a new mechanism for relocating virtual pages to a new physical
address and use this mechanism when:

1. kmem_alloc_{attr,contig}() can't find suitable free pages in the physical
   memory allocator's free page lists.  This replaces the long-standing
   approach of scanning the inactive and inactive queues, converting clean
   pages into PG_CACHED pages and laundering dirty pages.  In contrast, the
   new mechanism does not use PG_CACHED pages nor does it trigger a large
   number of I/O operations.

2. on 32-bit MIPS processors, uma_small_alloc() and the pmap can't find
   free pages in the physical memory allocator's free page lists that are
   covered by the direct map.  Tested by: adrian

3. ttm_bo_global_init() and ttm_vm_page_alloc_dma32() can't find suitable
   free pages in the physical memory allocator's free page lists.

In the coming months, I expect that this new mechanism will be applied in
other places.  For example, balloon drivers should use relocation to
minimize fragmentation of the guest physical address space.

Make vm_phys_alloc_contig() a little smarter (and more efficient in some
cases).  Specifically, use vm_phys_segs[] earlier to avoid scanning free
page lists that can't possibly contain suitable pages.

Reviewed by:	kib, markj
Glanced at:	jhb
Discussed with:	jeff
Sponsored by:	EMC / Isilon Storage Division
Differential Revision:	https://reviews.freebsd.org/D4444
2015-12-19 18:42:50 +00:00
Conrad Meyer
8170d6e52b vm_page_replace: add wrapper to KASSERT about old page
It turns out the callers of vm_page_replace know exactly which page they are
replacing and would like to assert about it.  Change those from hard panics to
KASSERTs, and provide them with a wrapper so they don't have to deal with
warnings from an INVARIANTS-dependent dead store of the return value of
vm_page_replace.

Submitted by:	Ryan Libby <rlibby@gmail.com>
Reviewed by:	alc, kib (earlier version)
Sponsored by:	EMC / Isilon Storage Division
Differential Revision:	https://reviews.freebsd.org/D4497
2015-12-17 17:48:57 +00:00
Conrad Meyer
dc62d55929 vm_page.h: page busy macro fixups
Minor changes to:
  - delete extraneous trailing semicolons from macro definitions, and
  - correct spelling of "busying" in panic messages

Submitted by:	Ryan Libby <rlibby@gmail.com>
Reviewed by:	alc, kib
Sponsored by:	EMC / Isilon Storage Division
Differential Revision:	https://reviews.freebsd.org/D4577
2015-12-16 23:23:12 +00:00
Gleb Smirnoff
b0cd20172d A change to KPI of vm_pager_get_pages() and underlying VOP_GETPAGES().
o With new KPI consumers can request contiguous ranges of pages, and
  unlike before, all pages will be kept busied on return, like it was
  done before with the 'reqpage' only. Now the reqpage goes away. With
  new interface it is easier to implement code protected from race
  conditions.

  Such arrayed requests for now should be preceeded by a call to
  vm_pager_haspage() to make sure that request is possible. This
  could be improved later, making vm_pager_haspage() obsolete.

  Strenghtening the promises on the business of the array of pages
  allows us to remove such hacks as swp_pager_free_nrpage() and
  vm_pager_free_nonreq().

o New KPI accepts two integer pointers that may optionally point at
  values for read ahead and read behind, that a pager may do, if it
  can. These pages are completely owned by pager, and not controlled
  by the caller.

  This shifts the UFS-specific readahead logic from vm_fault.c, which
  should be file system agnostic, into vnode_pager.c. It also removes
  one VOP_BMAP() request per hard fault.

Discussed with:	kib, alc, jeff, scottl
Sponsored by:	Nginx, Inc.
Sponsored by:	Netflix
2015-12-16 21:30:45 +00:00
Mark Johnston
d9e2e68d38 Don't make assertions about td_critnest when the scheduler is stopped.
A panicking thread always executes with a critical section held, so any
attempt to allocate or free memory while dumping will otherwise cause a
second panic. This can occur, for example, if xpt_polled_action() completes
non-dump I/O that was pending at the time of the panic. The fact that this
can occur is itself a bug, but asserting in this case does little but
reduce the reliability of kernel dumps.

Suggested by:	kib
Reported by:	pho
2015-12-11 20:05:07 +00:00
Conrad Meyer
5e09bdc821 vm_page_replace: remove redundant radix lookup
Remove redundant lookup of the old page from vm_page_replace.  Verification
that the old page exists is already done by vm_radix_replace.

Submitted by:	Ryan Libby <rlibby@gmail.com>
Reviewed by:	alc, kib
Sponsored by:	EMC / Isilon Storage Division
Follow-up to:	https://reviews.freebsd.org/D4326
Differential Revision:	https://reviews.freebsd.org/D4471
2015-12-10 22:57:27 +00:00
Conrad Meyer
6fee422ed5 vm_fault_hold: handle vm_page_rename failure
On vm_page_rename failure, fix a missing object unlock and a double free of
a page.

First remove the old page, then rename into other page into first_object,
then free the old page.  This avoids the problem on rename failure.  This is
a little ugly but seems to be the most straightforward solution.

Tested with:
  $ sysctl debug.fail_point.uma_zalloc_arg="1%return"
  $ kyua test -k /usr/tests/sys/Kyuafile

Submitted by:	Ryan Libby <rlibby@gmail.com>
Reviewed by:	kib
Seen by:	alc
Sponsored by:	EMC / Isilon Storage Division
Differential Revision:	https://reviews.freebsd.org/D4326
2015-12-06 17:46:12 +00:00
Conrad Meyer
4cc8daf782 Pull vm_object_scan_all_shadowed out of vm_object_backing_scan
These two functions were largely unrelated, they just used the same same
loop logic to walk through a backing object's memq.  Pull out the
all_shadowed test as its own function and eliminate
OBSC_TEST_ALL_SHADOWED.  Rename vm_object_backing_scan to
vm_object_collapse_scan.

No functional change.

Sponsored by:	EMC / Isilon Storage Division
Differential Revision:	https://reviews.freebsd.org/D4335
2015-12-03 17:21:10 +00:00
Konstantin Belousov
99a1570a25 r221714 fixed the situation when the collapse scan improperly handled
invalid (busy) page supposedly inserted by the vm_fault(), in the
OBSC_COLLAPSE_NOWAIT case.  As a continuation to r221714, fix a case
when invalid page is found by the object scan in OBSC_COLLAPSE_WAIT
case as well.  But, since this is waitable scan, we should wait for
the termination of the busy state and restart from the beginning of
the backing object' page queue. [*]

Do not free the shadow page swap space when the parent page is
invalid, otherwise this action potentially corrupts user data.

Combine all instances of the collapse scan sleep code fragments into
the new helper vm_object_backing_scan_wait().

Improve style compliance and comments.  Change the return type of
vm_object_backing_scan() to bool.

Initial submission by:	cem, https://reviews.freebsd.org/D4103 [*]
Reviewed by:	alc, cem
Tested by:	cem
Sponsored by:	The FreeBSD Foundation
MFC after:	2 weeks
Differential revision:	https://reviews.freebsd.org/D4146
2015-12-01 09:06:09 +00:00
Konstantin Belousov
b89def80f5 Minor cleanup.
Systematically use ANSI C functions definitions.
Correct type of the flags argument to the dev_pager_putpages() function.
Use vm_pager_free_nonreq().

Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
2015-11-29 11:37:25 +00:00
Konstantin Belousov
2eb2f0d5e3 In vm_pageout_grow_cache(), do not re-try the inactive queue when
active queue scan initiated write.

Re-trying from the inactive queue when doing active scan makes the
loop never end if number of domains is greater than 1 and inactive or
active scan cannot reach the target.

Reported and tested by:	Andrew Gallatin <gallatin@netflix.com>
Reviewed by:	alc
Sponsored by:	The FreeBSD Foundation
MFC after:	2 weeks
2015-11-27 19:43:36 +00:00
Alan Cox
67b7e4345e Correct an error in vm_reserv_reclaim_contig(). In the highly unusual
case that the reservation contained "low", the starting position in the
popmap for the free page search was incorrectly calculated.  The most
likely (and visible) symptom of this error was the assertion failure,
"vm_reserv_reclaim_contig: pa is too low".
2015-11-26 19:12:18 +00:00
Konstantin Belousov
9af50b0126 Record proper commit message for r291157.
The r289895 revision did not accounted for the block containing the
requested page, when calculating the run of pages.  Include the pages
before/after the requested page, that fit into the reqblock, into the
calculation.

Noted by:	glebius
Tested by:	pho
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
2015-11-22 09:50:13 +00:00
Konstantin Belousov
4586820a07 Noted by: glebius
Tested by:	pho
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
2015-11-22 09:48:03 +00:00
Mark Johnston
7672ca059a Remove unneeded includes of opt_kdtrace.h.
As of r258541, KDTRACE_HOOKS is defined in opt_global.h, so opt_kdtrace.h
is not needed when defining SDT(9) probes.
2015-11-22 02:01:01 +00:00
Gleb Smirnoff
09c837b897 Remove remnants of the old NFS from vnode pager.
Reviewed by:	kib
Sponsored by:	Netflix
2015-11-20 23:52:27 +00:00
Jonathan T. Looney
1067a2ba68 Consistently enforce the restriction against calling malloc/free when in a
critical section.

uma_zalloc_arg()/uma_zalloc_free() may acquire a sleepable lock on the
zone. The malloc() family of functions may call uma_zalloc_arg() or
uma_zalloc_free().

The malloc(9) man page currently claims that free() will never sleep.
It also implies that the malloc() family of functions will not sleep
when called with M_NOWAIT. However, it is more correct to say that
these functions will not sleep indefinitely. Indeed, they may acquire
a sleepable lock. However, a developer may overlook this restriction
because the WITNESS check that catches attempts to call the malloc()
family of functions within a critical section is inconsistenly
applied.

This change clarifies the language of the malloc(9) man page to clarify
the restriction against calling the malloc() family of functions
while in a critical section or holding a spin lock. It also adds
KASSERTs at appropriate points to make the enforcement of this
restriction more consistent.

PR:		204633
Differential Revision:	https://reviews.freebsd.org/D4197
Reviewed by:	markj
Approved by:	gnn (mentor)
Sponsored by:	Juniper Networks
2015-11-19 14:04:53 +00:00
Konstantin Belousov
76386c7ecd Rework the test which raises OOM condition. Right now, the code
checks for the swap space consumption plus checks that the amount of
the free pages exceeds some limit, in case pagedeamon did not coped
with the page shortage in one of the late passes.  This is wrong
because it does not account for the presence of the reclamaible pages
in the queues which are not selectable for reclaim immediately.  E.g.,
on the swap-less systems, large active queue easily triggered OOM.

Instead, only raise OOM when pagedaemon is unable to produce a free
page in several back-to-back passes.  Track the failed passes per
pagedaemon thread.

The number of passes to trigger OOM was selected empirically and
tested both on small (32M-64M i386 VM) and large (32G amd64)
configurations.  If the specifics of the load require tuning, sysctl
vm.pageout_oom_seq sets the number of back-to-back passes which must
fail before OOM is raised.  Each pass takes 1/2 of seconds.  Less the
value, more sensible the pagedaemon is to the page shortage.

In future, some heuristic to calculate the value of the tunable might
be designed based on the system configuration and load.  But before it
can be done, the i/o system must be fixed to reliably time-out
pagedaemon writes, even if waiting for the memory to proceed.  Then,
code can account for the in-flight page-outs and postpone OOM until
all of them finished, which should reduce the need in tuning.  Right
now, ignoring the in-flight writes and the counter allows to break
deadlocks due to write path doing sleepable memory allocations.

Reported by:	Dmitry Sivachenko, bde, many others
Tested by:	pho, bde, tuexen (arm)
Reviewed by:	alc
Discussed with:	bde, imp
Sponsored by:	The FreeBSD Foundation
MFC after:	3 weeks
2015-11-16 06:26:26 +00:00
Konstantin Belousov
3949873f7a Do not use vmspace_resident_count() for the OOM process selection.
Residency count track the number of pte entries installed into the
current pmap, which does not reflect the consumption of the physical
memory by the address map.  Due to several mechanisms like pv entries
reclamation, copy on write etc. the resident pte entries count may be
much less than the amount of physical memory kept by the process.

Provide the OOM-specific vm_pageout_oom_pagecount() function which
estimates the amount of reclamaible memory which could be stolen if
the process is killed.

Reported and tested by:	pho
Reviewed by:	alc
Comments text by:	alc
Sponsored by:	The FreeBSD Foundation
MFC after:	3 weeks
2015-11-16 06:02:11 +00:00
Konstantin Belousov
b98acc0a1b VM daemon works in parallel with the pagedaemon threads, and, among
other actions, swaps out kernel stacks of the processes.  On the other
hand, currentl OOM logic which selects a process to kill in the
critical condition, skips process with swapped-out thread.  Under some
loads, this results in the big(gest) process being ignored by OOM.

Do not skip a process which has inhibited thread due to the swap-out,
in the OOM selection loop.  Note that killing such process requires
the thread stack page-in, but sometimes this is the only way to
recover.

Reported and tested by:	pho
Reviewed by:	alc
Sponsored by:	The FreeBSD Foundation
MFC after:	3 weeks
2015-11-16 05:52:04 +00:00
John Baldwin
645743ea99 Export various helper variables describing the layout and size of
certain kernel structures for use by debuggers. This mostly aids
in examining cores from a kernel without debug symbols as a debugger
can infer these values if debug symbols are available.

One set of variables describes the layout of 'struct linker_file' to
walk the list of loaded kernel modules.

A second set of variables describes the layout of 'struct proc' and
'struct thread' to walk the list of processes in the kernel and the
threads in each process.

The 'pcb_size' variable is used to index into the stoppcbs[] array.

The 'vm_maxuser_address' is used to distinguish kernel virtual addresses
from user addresses. This doesn't have to be perfect, and
'vm_maxuser_address' is a cheap and simple way to differentiate kernel
pointers from simple values like TIDs and PIDs.

While here, annotate the fields in struct pcb used by kgdb on amd64
and i386 to note that their ABI should be preserved.  Annotations for
other platforms will be added in the future.

Reviewed by:	kib
MFC after:	2 weeks
Differential Revision:	https://reviews.freebsd.org/D3773
2015-11-12 22:00:59 +00:00
Mark Johnston
7e78597f04 Ensure that deactivated pages that are not expected to be reused are
reclaimed in FIFO order by the pagedaemon.  Previously we would enqueue
such pages at the head of the inactive queue, yielding a LIFO reclaim order.

Reviewed by:	alc
MFC after:	2 weeks
Sponsored by:	EMC / Isilon Storage Division
2015-11-08 01:36:18 +00:00
Konstantin Belousov
eac91e326a Reduce the amount of calls to VOP_BMAP() made from the local vnode
pager.  It is enough to execute VOP_BMAP() once to obtain both the
disk block address for the requested page, and the before/after limits
for the contiguous run.  The clipping of the vm_page_t array passed to
the vnode_pager_generic_getpages() and the disk address for the first
page in the clipped array can be deduced from the call results.

While there, remove some noise (like if (1) {...}) and adjust nearby
code.

Reviewed by:	alc
Discussed with:	glebius
Tested by:	pho
Sponsored by:	The FreeBSD Foundation
MFC after:	3 weeks
2015-10-24 21:59:22 +00:00
Jason A. Harmening
7c989c156f Fix capitalization 2015-10-23 12:06:06 +00:00