Commit Graph

3660 Commits

Author SHA1 Message Date
jhb
178ec26e82 Don't require write locks on the VM object for vm_page_prev/next.
Reviewed by:	kib
Sponsored by:	Chelsio Communications
2016-04-29 17:35:28 +00:00
jhb
d6559c9525 Trim redundant message.
WITNESS_WARN() appends "with non-sleepable lock" to the caller's message.

Sponsored by:	Chelsio Communications
2016-04-27 21:51:24 +00:00
pfg
b4106812fd Cleanup redundant parenthesis from existing howmany()/roundup() macro uses. 2016-04-22 16:57:42 +00:00
pfg
729533413f sys: use our roundup2/rounddown2() macros when param.h is available.
rounddown2 tends to produce longer lines than the original code
and when the code has a high indentation level it was not really
advantageous to do the replacement.

This tries to strike a balance between readability using the macros
and flexibility of having the expressions, so not everything is
converted.
2016-04-21 19:57:40 +00:00
pfg
fc65edc1cd Remove slightly used const values that can be replaced with nitems().
Suggested by:	jhb
2016-04-21 15:38:28 +00:00
jhb
6beb82443a Add more fine-grained kernel options for NUMA support.
VM_NUMA_ALLOC is used to enable use of domain-aware memory allocation in
the virtual memory system.  DEVICE_NUMA is used to enable affinity
reporting for devices such as bus_get_domain().

MAXMEMDOM must still be set to a value greater than for any NUMA support
to be effective.  Note that 'cpuset -gd' always works if MAXMEMDOM is
enabled and the system supports NUMA.

Reviewed by:	kib
Differential Revision:	https://reviews.freebsd.org/D5782
2016-04-09 13:58:04 +00:00
trasz
825d80e01c Add four new RCTL resources - readbps, readiops, writebps and writeiops,
for limiting disk (actually filesystem) IO.

Note that in some cases these limits are not quite precise. It's ok,
as long as it's within some reasonable bounds.

Testing - and review of the code, in particular the VFS and VM parts - is
very welcome.

MFC after:	1 month
Relnotes:	yes
Sponsored by:	The FreeBSD Foundation
Differential Revision:	https://reviews.freebsd.org/D5080
2016-04-07 04:23:25 +00:00
glebius
b8acb1ed4a Remove UMA_ZONE_REFCNT feature, now unused.
Blessed by:	jeff
2016-03-01 00:33:32 +00:00
kib
e76eb4255b Implement process-shared locks support for libthr.so.3, without
breaking the ABI.  Special value is stored in the lock pointer to
indicate shared lock, and offline page in the shared memory is
allocated to store the actual lock.

Reviewed by:	vangyzen (previous version)
Discussed with:	deischen, emaste, jhb, rwatson,
	Martin Simmons <martin@lispworks.com>
Tested by:	pho
Sponsored by:	The FreeBSD Foundation
2016-02-28 17:52:33 +00:00
glebius
b3c4f0ddbf Include sys/_task.h into uma_int.h, so that taskqueue.h isn't a
requirement for uma_int.h.

Suggested by:	jhb
2016-02-09 20:22:35 +00:00
markj
f611a65735 Plug a vm_page leak introduced in r292373.
Reported by:	vangyzen
2016-02-05 19:35:53 +00:00
glebius
953ea03018 Redo r292484. Embed task(9) into zone, so that uz_maxaction is called
in a context that can sleep, allowing consumers of the KPI to run their
drain routines without any extra measures.

Discussed with:	jtl
2016-02-03 23:30:17 +00:00
glebius
8f977b6c5c Move uma_dbg_alloc() and uma_dbg_free() into uma_core.c, which allows
to make uma_dbg.h not depend on uma_int.h, which allows to uninclude
uma_int.h from the mbuf(9) allocator.
2016-02-03 22:02:36 +00:00
kib
e840a5e833 Typo in comment. 2016-01-24 13:38:41 +00:00
jhb
897a9bc5d3 Various cleanups to the main function for AIO kernel processes:
- Pull the vmspace logic out into helper functions and reduce duplication.
  Operations on the vmspace are all isolated to vm_map.c, but it now exports
  a new 'vmspace_switch_aio' for use by AIO kernel processes.
- When an AIO kernel process wants to exit, break out of the main loop and
  perform cleanup after the loop end.  This reduces a lot of indentation and
  allows cleanup to more closely mirror setup actions before the loop starts.
- Convert a DIAGNOSTIC to KASSERT().
- Replace mycp with more typical 'p'.

Reviewed by:	kib
Sponsored by:	Chelsio Communications
Differential Revision:	https://reviews.freebsd.org/D4990
2016-01-19 21:37:51 +00:00
alc
2c013a8511 A fix to r292469: Iterate over the physical segments in descending rather
than ascending order in vm_phys_alloc_contig() so that, for example, a
sequence of contigmalloc(low=0, high=4GB) calls doesn't exhaust the supply
of low physical memory resulting in a later contigmalloc(low=0, high=1MB)
failure.

Reported by:	cy
Tested by:	cy
Sponsored by:	EMC / Isilon Storage Division
2016-01-16 04:41:40 +00:00
adrian
ef0b7e48a5 Fix the domain iterator to not try the first-touch / fixed domain
more than once when doing round-robin.

This lead to a panic because the iterator was trying the same domain
twice and not trying one of the other domains.

Reported by: pho
Tested by: pho
2016-01-10 17:53:43 +00:00
kib
6cbc48de82 Add missed relpbuf() for a smallfs page-in.
Reported by:	Shawn Webb
Tested by:	pho
Sponsored by:	The FreeBSD Foundation
2015-12-27 14:42:39 +00:00
jtl
94d8d1452b Add a safety net to reclaim mbufs when one of the mbuf zones become
exhausted.

It is possible for a bug in the code (or, theoretically, even unusual
network conditions) to exhaust all possible mbufs or mbuf clusters.
When this occurs, things can grind to a halt fairly quickly. However,
we currently do not call mb_reclaim() unless the entire system is
experiencing a low-memory condition.

While it is best to try to prevent exhaustion of one of the mbuf zones,
it would also be useful to have a mechanism to attempt to recover from
these situations by freeing "expendable" mbufs.

This patch makes two changes:

a) The patch adds a generic API to the UMA zone allocator to set a
function that should be called when an allocation fails because the
zone limit has been reached. Because of the way this function can be
called, it really should do minimal work.

b) The patch uses this API to try to free mbufs when an allocation
fails from one of the mbuf zones because the zone limit has been
reached. The function schedules a callout to run mb_reclaim().

Differential Revision:	https://reviews.freebsd.org/D3864
Reviewed by:	gnn
Comments by:	rrs, glebius
MFC after:	2 weeks
Sponsored by:	Juniper Networks
2015-12-20 02:05:33 +00:00
alc
8343c406db Introduce a new mechanism for relocating virtual pages to a new physical
address and use this mechanism when:

1. kmem_alloc_{attr,contig}() can't find suitable free pages in the physical
   memory allocator's free page lists.  This replaces the long-standing
   approach of scanning the inactive and inactive queues, converting clean
   pages into PG_CACHED pages and laundering dirty pages.  In contrast, the
   new mechanism does not use PG_CACHED pages nor does it trigger a large
   number of I/O operations.

2. on 32-bit MIPS processors, uma_small_alloc() and the pmap can't find
   free pages in the physical memory allocator's free page lists that are
   covered by the direct map.  Tested by: adrian

3. ttm_bo_global_init() and ttm_vm_page_alloc_dma32() can't find suitable
   free pages in the physical memory allocator's free page lists.

In the coming months, I expect that this new mechanism will be applied in
other places.  For example, balloon drivers should use relocation to
minimize fragmentation of the guest physical address space.

Make vm_phys_alloc_contig() a little smarter (and more efficient in some
cases).  Specifically, use vm_phys_segs[] earlier to avoid scanning free
page lists that can't possibly contain suitable pages.

Reviewed by:	kib, markj
Glanced at:	jhb
Discussed with:	jeff
Sponsored by:	EMC / Isilon Storage Division
Differential Revision:	https://reviews.freebsd.org/D4444
2015-12-19 18:42:50 +00:00
cem
4508a28d23 vm_page_replace: add wrapper to KASSERT about old page
It turns out the callers of vm_page_replace know exactly which page they are
replacing and would like to assert about it.  Change those from hard panics to
KASSERTs, and provide them with a wrapper so they don't have to deal with
warnings from an INVARIANTS-dependent dead store of the return value of
vm_page_replace.

Submitted by:	Ryan Libby <rlibby@gmail.com>
Reviewed by:	alc, kib (earlier version)
Sponsored by:	EMC / Isilon Storage Division
Differential Revision:	https://reviews.freebsd.org/D4497
2015-12-17 17:48:57 +00:00
cem
f64eb0a1ef vm_page.h: page busy macro fixups
Minor changes to:
  - delete extraneous trailing semicolons from macro definitions, and
  - correct spelling of "busying" in panic messages

Submitted by:	Ryan Libby <rlibby@gmail.com>
Reviewed by:	alc, kib
Sponsored by:	EMC / Isilon Storage Division
Differential Revision:	https://reviews.freebsd.org/D4577
2015-12-16 23:23:12 +00:00
glebius
63cd1c131a A change to KPI of vm_pager_get_pages() and underlying VOP_GETPAGES().
o With new KPI consumers can request contiguous ranges of pages, and
  unlike before, all pages will be kept busied on return, like it was
  done before with the 'reqpage' only. Now the reqpage goes away. With
  new interface it is easier to implement code protected from race
  conditions.

  Such arrayed requests for now should be preceeded by a call to
  vm_pager_haspage() to make sure that request is possible. This
  could be improved later, making vm_pager_haspage() obsolete.

  Strenghtening the promises on the business of the array of pages
  allows us to remove such hacks as swp_pager_free_nrpage() and
  vm_pager_free_nonreq().

o New KPI accepts two integer pointers that may optionally point at
  values for read ahead and read behind, that a pager may do, if it
  can. These pages are completely owned by pager, and not controlled
  by the caller.

  This shifts the UFS-specific readahead logic from vm_fault.c, which
  should be file system agnostic, into vnode_pager.c. It also removes
  one VOP_BMAP() request per hard fault.

Discussed with:	kib, alc, jeff, scottl
Sponsored by:	Nginx, Inc.
Sponsored by:	Netflix
2015-12-16 21:30:45 +00:00
markj
98fd4878e0 Don't make assertions about td_critnest when the scheduler is stopped.
A panicking thread always executes with a critical section held, so any
attempt to allocate or free memory while dumping will otherwise cause a
second panic. This can occur, for example, if xpt_polled_action() completes
non-dump I/O that was pending at the time of the panic. The fact that this
can occur is itself a bug, but asserting in this case does little but
reduce the reliability of kernel dumps.

Suggested by:	kib
Reported by:	pho
2015-12-11 20:05:07 +00:00
cem
9328ee7b96 vm_page_replace: remove redundant radix lookup
Remove redundant lookup of the old page from vm_page_replace.  Verification
that the old page exists is already done by vm_radix_replace.

Submitted by:	Ryan Libby <rlibby@gmail.com>
Reviewed by:	alc, kib
Sponsored by:	EMC / Isilon Storage Division
Follow-up to:	https://reviews.freebsd.org/D4326
Differential Revision:	https://reviews.freebsd.org/D4471
2015-12-10 22:57:27 +00:00
cem
f219735807 vm_fault_hold: handle vm_page_rename failure
On vm_page_rename failure, fix a missing object unlock and a double free of
a page.

First remove the old page, then rename into other page into first_object,
then free the old page.  This avoids the problem on rename failure.  This is
a little ugly but seems to be the most straightforward solution.

Tested with:
  $ sysctl debug.fail_point.uma_zalloc_arg="1%return"
  $ kyua test -k /usr/tests/sys/Kyuafile

Submitted by:	Ryan Libby <rlibby@gmail.com>
Reviewed by:	kib
Seen by:	alc
Sponsored by:	EMC / Isilon Storage Division
Differential Revision:	https://reviews.freebsd.org/D4326
2015-12-06 17:46:12 +00:00
cem
cdfc8c63b9 Pull vm_object_scan_all_shadowed out of vm_object_backing_scan
These two functions were largely unrelated, they just used the same same
loop logic to walk through a backing object's memq.  Pull out the
all_shadowed test as its own function and eliminate
OBSC_TEST_ALL_SHADOWED.  Rename vm_object_backing_scan to
vm_object_collapse_scan.

No functional change.

Sponsored by:	EMC / Isilon Storage Division
Differential Revision:	https://reviews.freebsd.org/D4335
2015-12-03 17:21:10 +00:00
kib
9a3339cff9 r221714 fixed the situation when the collapse scan improperly handled
invalid (busy) page supposedly inserted by the vm_fault(), in the
OBSC_COLLAPSE_NOWAIT case.  As a continuation to r221714, fix a case
when invalid page is found by the object scan in OBSC_COLLAPSE_WAIT
case as well.  But, since this is waitable scan, we should wait for
the termination of the busy state and restart from the beginning of
the backing object' page queue. [*]

Do not free the shadow page swap space when the parent page is
invalid, otherwise this action potentially corrupts user data.

Combine all instances of the collapse scan sleep code fragments into
the new helper vm_object_backing_scan_wait().

Improve style compliance and comments.  Change the return type of
vm_object_backing_scan() to bool.

Initial submission by:	cem, https://reviews.freebsd.org/D4103 [*]
Reviewed by:	alc, cem
Tested by:	cem
Sponsored by:	The FreeBSD Foundation
MFC after:	2 weeks
Differential revision:	https://reviews.freebsd.org/D4146
2015-12-01 09:06:09 +00:00
kib
affb6f73cc Minor cleanup.
Systematically use ANSI C functions definitions.
Correct type of the flags argument to the dev_pager_putpages() function.
Use vm_pager_free_nonreq().

Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
2015-11-29 11:37:25 +00:00
kib
d3add9250e In vm_pageout_grow_cache(), do not re-try the inactive queue when
active queue scan initiated write.

Re-trying from the inactive queue when doing active scan makes the
loop never end if number of domains is greater than 1 and inactive or
active scan cannot reach the target.

Reported and tested by:	Andrew Gallatin <gallatin@netflix.com>
Reviewed by:	alc
Sponsored by:	The FreeBSD Foundation
MFC after:	2 weeks
2015-11-27 19:43:36 +00:00
alc
75944832ce Correct an error in vm_reserv_reclaim_contig(). In the highly unusual
case that the reservation contained "low", the starting position in the
popmap for the free page search was incorrectly calculated.  The most
likely (and visible) symptom of this error was the assertion failure,
"vm_reserv_reclaim_contig: pa is too low".
2015-11-26 19:12:18 +00:00
kib
c95b809b8d Record proper commit message for r291157.
The r289895 revision did not accounted for the block containing the
requested page, when calculating the run of pages.  Include the pages
before/after the requested page, that fit into the reqblock, into the
calculation.

Noted by:	glebius
Tested by:	pho
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
2015-11-22 09:50:13 +00:00
kib
0ae248d3d0 Noted by: glebius
Tested by:	pho
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
2015-11-22 09:48:03 +00:00
markj
3e47d7787e Remove unneeded includes of opt_kdtrace.h.
As of r258541, KDTRACE_HOOKS is defined in opt_global.h, so opt_kdtrace.h
is not needed when defining SDT(9) probes.
2015-11-22 02:01:01 +00:00
glebius
6460e3db4e Remove remnants of the old NFS from vnode pager.
Reviewed by:	kib
Sponsored by:	Netflix
2015-11-20 23:52:27 +00:00
jtl
f2aa140123 Consistently enforce the restriction against calling malloc/free when in a
critical section.

uma_zalloc_arg()/uma_zalloc_free() may acquire a sleepable lock on the
zone. The malloc() family of functions may call uma_zalloc_arg() or
uma_zalloc_free().

The malloc(9) man page currently claims that free() will never sleep.
It also implies that the malloc() family of functions will not sleep
when called with M_NOWAIT. However, it is more correct to say that
these functions will not sleep indefinitely. Indeed, they may acquire
a sleepable lock. However, a developer may overlook this restriction
because the WITNESS check that catches attempts to call the malloc()
family of functions within a critical section is inconsistenly
applied.

This change clarifies the language of the malloc(9) man page to clarify
the restriction against calling the malloc() family of functions
while in a critical section or holding a spin lock. It also adds
KASSERTs at appropriate points to make the enforcement of this
restriction more consistent.

PR:		204633
Differential Revision:	https://reviews.freebsd.org/D4197
Reviewed by:	markj
Approved by:	gnn (mentor)
Sponsored by:	Juniper Networks
2015-11-19 14:04:53 +00:00
kib
a2d411cbbc Rework the test which raises OOM condition. Right now, the code
checks for the swap space consumption plus checks that the amount of
the free pages exceeds some limit, in case pagedeamon did not coped
with the page shortage in one of the late passes.  This is wrong
because it does not account for the presence of the reclamaible pages
in the queues which are not selectable for reclaim immediately.  E.g.,
on the swap-less systems, large active queue easily triggered OOM.

Instead, only raise OOM when pagedaemon is unable to produce a free
page in several back-to-back passes.  Track the failed passes per
pagedaemon thread.

The number of passes to trigger OOM was selected empirically and
tested both on small (32M-64M i386 VM) and large (32G amd64)
configurations.  If the specifics of the load require tuning, sysctl
vm.pageout_oom_seq sets the number of back-to-back passes which must
fail before OOM is raised.  Each pass takes 1/2 of seconds.  Less the
value, more sensible the pagedaemon is to the page shortage.

In future, some heuristic to calculate the value of the tunable might
be designed based on the system configuration and load.  But before it
can be done, the i/o system must be fixed to reliably time-out
pagedaemon writes, even if waiting for the memory to proceed.  Then,
code can account for the in-flight page-outs and postpone OOM until
all of them finished, which should reduce the need in tuning.  Right
now, ignoring the in-flight writes and the counter allows to break
deadlocks due to write path doing sleepable memory allocations.

Reported by:	Dmitry Sivachenko, bde, many others
Tested by:	pho, bde, tuexen (arm)
Reviewed by:	alc
Discussed with:	bde, imp
Sponsored by:	The FreeBSD Foundation
MFC after:	3 weeks
2015-11-16 06:26:26 +00:00
kib
732006a7d7 Do not use vmspace_resident_count() for the OOM process selection.
Residency count track the number of pte entries installed into the
current pmap, which does not reflect the consumption of the physical
memory by the address map.  Due to several mechanisms like pv entries
reclamation, copy on write etc. the resident pte entries count may be
much less than the amount of physical memory kept by the process.

Provide the OOM-specific vm_pageout_oom_pagecount() function which
estimates the amount of reclamaible memory which could be stolen if
the process is killed.

Reported and tested by:	pho
Reviewed by:	alc
Comments text by:	alc
Sponsored by:	The FreeBSD Foundation
MFC after:	3 weeks
2015-11-16 06:02:11 +00:00
kib
fae664e9cc VM daemon works in parallel with the pagedaemon threads, and, among
other actions, swaps out kernel stacks of the processes.  On the other
hand, currentl OOM logic which selects a process to kill in the
critical condition, skips process with swapped-out thread.  Under some
loads, this results in the big(gest) process being ignored by OOM.

Do not skip a process which has inhibited thread due to the swap-out,
in the OOM selection loop.  Note that killing such process requires
the thread stack page-in, but sometimes this is the only way to
recover.

Reported and tested by:	pho
Reviewed by:	alc
Sponsored by:	The FreeBSD Foundation
MFC after:	3 weeks
2015-11-16 05:52:04 +00:00
jhb
c1d9f70889 Export various helper variables describing the layout and size of
certain kernel structures for use by debuggers. This mostly aids
in examining cores from a kernel without debug symbols as a debugger
can infer these values if debug symbols are available.

One set of variables describes the layout of 'struct linker_file' to
walk the list of loaded kernel modules.

A second set of variables describes the layout of 'struct proc' and
'struct thread' to walk the list of processes in the kernel and the
threads in each process.

The 'pcb_size' variable is used to index into the stoppcbs[] array.

The 'vm_maxuser_address' is used to distinguish kernel virtual addresses
from user addresses. This doesn't have to be perfect, and
'vm_maxuser_address' is a cheap and simple way to differentiate kernel
pointers from simple values like TIDs and PIDs.

While here, annotate the fields in struct pcb used by kgdb on amd64
and i386 to note that their ABI should be preserved.  Annotations for
other platforms will be added in the future.

Reviewed by:	kib
MFC after:	2 weeks
Differential Revision:	https://reviews.freebsd.org/D3773
2015-11-12 22:00:59 +00:00
markj
a110d65fe3 Ensure that deactivated pages that are not expected to be reused are
reclaimed in FIFO order by the pagedaemon.  Previously we would enqueue
such pages at the head of the inactive queue, yielding a LIFO reclaim order.

Reviewed by:	alc
MFC after:	2 weeks
Sponsored by:	EMC / Isilon Storage Division
2015-11-08 01:36:18 +00:00
kib
326fa7f350 Reduce the amount of calls to VOP_BMAP() made from the local vnode
pager.  It is enough to execute VOP_BMAP() once to obtain both the
disk block address for the requested page, and the before/after limits
for the contiguous run.  The clipping of the vm_page_t array passed to
the vnode_pager_generic_getpages() and the disk address for the first
page in the clipped array can be deduced from the call results.

While there, remove some noise (like if (1) {...}) and adjust nearby
code.

Reviewed by:	alc
Discussed with:	glebius
Tested by:	pho
Sponsored by:	The FreeBSD Foundation
MFC after:	3 weeks
2015-10-24 21:59:22 +00:00
jah
711ef25019 Fix capitalization 2015-10-23 12:06:06 +00:00
jah
075add2496 Remove unclear comment about address truncation in busdma. Add (hopefully much clearer) comment at declaration of PHYS_TO_VM_PAGE().
Noted by:	avg
2015-10-23 12:03:25 +00:00
kib
9d188e9d5c Only marker is guaranteed to be present on the queue after the relock
in vm_pageout_fallback_object_lock() and vm_pageout_page_lock().  The
check for the m->queue == queue assumes that the page does belong to a
queue.

Modify the 'unchanged' calculation bu dereferencing the marker tailq
pointers, which is known to belong to the queue.  Since for a page m
linked to the queue, m->queue must be equal to the queue index, assert
this instead of checking.

In collaboration with:	alc
Sponsored by:	The FreeBSD Foundation (kib)
MFC after:	2 weeks
2015-10-18 09:33:28 +00:00
kib
3b70fdfb1d Revert r289302, invalid pages can be queued, e.g. by vfs_vmio_unwire().
Found by:	alc
Tested by:	pho
Sponsored by:	The FreeBSD Foundation
2015-10-15 19:07:38 +00:00
kib
176234430e Invalid pages should not appear on the inactive queue. Change the
check into an assertion.

Reviewed by:	alc
Tested by:	pho
Sponsored by:	The FreeBSD Foundation
2015-10-14 09:03:32 +00:00
jeff
4402204d47 Parallelize the buffer cache and rewrite getnewbuf(). This results in a
8x performance improvement in a micro benchmark on a 4 socket machine.

 - Get buffer headers from a per-cpu uma cache that sits in from of the
   free queue.
 - Use a per-cpu quantum cache in vmem to eliminate contention for kva.
 - Use multiple clean queues according to buffer cache size to eliminate
   clean queue lock contention.
 - Introduce a bufspace daemon that attempts to prevent getnewbuf() callers
   from blocking or doing direct recycling.
 - Close some bufspace allocation races that could lead to endless
   recycling.
 - Further the transition to a more modern style of small functions grouped
   by prefix in order to improve growing complexity.

Sponsored by:	EMC / Isilon
Reviewed by:	kib
Tested by:	pho
2015-10-14 02:10:07 +00:00
alc
de91cc41c3 Exploit r288122 to avoid pointlessly enqueueing a page that is about to be
freed.

Submitted by:	kmacy
Differential Revision:	https://reviews.freebsd.org/D1674
2015-10-09 03:38:58 +00:00
alc
945e362d2b Exploit r288122 to address a cosmetic issue. Pages belonging to either
the kernel or kmem object can't be paged out.  Since they can't be paged
out, they are never enqueued in a paging queue.  Nonetheless, passing
PQ_INACTIVE to vm_page_unwire() in kmem_unback() creates the appearance
that these pages are being enqueued in the inactive queue.  As of r288122,
we can avoid giving this false impression by passing PQ_NONE.

Submitted by:	kmacy
Differential Revision:	https://reviews.freebsd.org/D1674
2015-10-06 05:49:00 +00:00