Commit Graph

3400 Commits

Author SHA1 Message Date
Jung-uk Kim
fd90e2ed54 CALLOUT_MPSAFE has lost its meaning since r141428, i.e., for more than ten
years for head.  However, it is continuously misused as the mpsafe argument
for callout_init(9).  Deprecate the flag and clean up callout_init() calls
to make them more consistent.

Differential Revision:	https://reviews.freebsd.org/D2613
Reviewed by:	jhb
MFC after:	2 weeks
2015-05-22 17:05:21 +00:00
Konstantin Belousov
2c20bd8b99 Do grammar fix in the comment to record the right commit message for
r283162.

Fix a cosmetic issue with vm_page_alloc() calling vm_page_free_toq()
with the page not completely satisfying vm_page_free() assertions.
The page is not owned by the object, since insertion failed.  But
besides m->object reset to NULL, we should also set VPO_UNMANAGED flag
for consistency.

Reported by:	pho
Reviewed by:	alc
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
2015-05-20 23:15:56 +00:00
Konstantin Belousov
da47499040 Remove the write-only variable phent. We currently do not check the
size of the program header's entries.

Reported by:	adrian (by using gcc 4.9)
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
2015-05-20 23:03:22 +00:00
Konstantin Belousov
9cddade79a Satisfy vm_object uma zone destructor requirements after r282660 when
vnode object creation raced.

Reported by:	pho
Reviewed by:	alc
Sponsored by:	The FreeBSD Foundation
2015-05-10 08:21:03 +00:00
Konstantin Belousov
44ec2b63c5 The vmem callback to reclaim kmem arena address space on low or
fragmented conditions currently just wakes up the pagedaemon.  The
kmem arena is significantly smaller then the total available physical
memory, which means that there are loads where kmem arena space could
be exhausted, while there is a lot of pages available still.  The
woken up pagedaemon sees vm_pages_needed != 0, verifies the condition
vm_paging_needed() which is false, clears the pass and returns back to
sleep, not calling neither uma_reclaim() nor lowmem handler.

To handle low kmem arena conditions, create additional pagedaemon
thread which calls uma_reclaim() directly.  The thread sleeps on the
dedicated channel and kmem_reclaim() wakes the thread in addition to
the pagedaemon.

Reported and tested by:	pho
Sponsored by:	The FreeBSD Foundation
MFC after:	2 weeks
2015-05-09 20:08:36 +00:00
John Baldwin
e735691b61 Place VM objects on the object list when created and never remove them.
This is ok since objects come from a NOFREE zone and allows objects to
be locked while traversing the object list without triggering a LOR.

Ensure that objects on the list are marked DEAD while free or stillborn,
and that they have a refcount of zero.  This required updating most of
the pagers to explicitly mark an object as dead when deallocating it.
(Only the vnode pager did this previously.)

Differential Revision:	https://reviews.freebsd.org/D2423
Reviewed by:	alc, kib (earlier version)
MFC after:	2 weeks
Sponsored by:	Norse Corp, Inc.
2015-05-08 19:43:37 +00:00
Adrian Chadd
cf82d8112e oops - how'd i miss this. Sorry! 2015-05-08 06:02:23 +00:00
Adrian Chadd
415d7ccab2 Add initial memory locality cost awareness to the VM, and include
a basic ACPI SLIT table parser.

For now this just exports the map via sysctl; it'll eventually be useful
to userland when there's more useful NUMA support in -HEAD.

* Add an optional mem_locality map;
* add a mapping function taking from/to domain and returning the
  relative cost, or -1 if it's not available;
* Add a very basic SLIT parser to x86 ACPI.

Differential Revision:	https://reviews.freebsd.org/D2460
Reviewed by:	rpaulo, stas, jhb
Sponsored by:	Norse Corp, Inc (hardware, coding); Dell (hardware)
2015-05-08 00:56:56 +00:00
Gleb Smirnoff
d2596d179a Fix the KASSERT and improve wording in r282426.
Submitted by:	alc
2015-05-06 08:07:11 +00:00
Gleb Smirnoff
84d313761d Fix arithmetical bug in vnode_pager_haspage(). The check against object size
should be done not with the number of pages in the first block, but with the
overall number of pages.  While here, add KASSERT that makes sure that BMAP
doesn't return completely irrelevant blocks.

Reviewed by:	kib
Tested by:	pho
Sponsored by:	Netflix
Sponsored by:	Nginx, Inc.
2015-05-04 18:49:25 +00:00
Gleb Smirnoff
89c241d1a6 Instead of reading, validating and adjusting value of the vm.swap_async_max
in the main swapper work cycle, do it in the sysctl handler.  This removes
extra mutex acquisition from the main cycle and makes the sysctl knob return
error on an invalid value, instead of accepting and fixing it.

Reviewed by:	kib
Sponsored by:	Netflix
Sponsored by:	Nginx, Inc.
2015-05-02 20:27:37 +00:00
John Baldwin
ed95805e90 Remove support for Xen PV domU kernels. Support for HVM domU kernels
remains.  Xen is planning to phase out support for PV upstream since it
is harder to maintain and has more overhead.  Modern x86 CPUs include
virtualization extensions that support HVM guests instead of PV guests.
In addition, the PV code was i386 only and not as well maintained recently
as the HVM code.
- Remove the i386-only NATIVE option that was used to disable certain
  components for PV kernels.  These components are now standard as they
  are on amd64.
- Remove !XENHVM bits from PV drivers.
- Remove various shims required for XEN (e.g. PT_UPDATES_FLUSH, LOAD_CR3,
  etc.)
- Remove duplicate copy of <xen/features.h>.
- Remove unused, i386-only xenstored.h.

Differential Revision:	https://reviews.freebsd.org/D2362
Reviewed by:	royger
Tested by:	royger (i386/amd64 HVM domU and amd64 PVH dom0)
Relnotes:	yes
2015-04-30 15:48:48 +00:00
Scott Long
affc4a4bff Improve support for blacklisting bad memory locations. The user can supply
a text file with a list of physical memory addresses to exclude, and have it
loaded at boot time via the provided example in loader.conf.  The tunable
'vm.blacklist' remains, but using an external file means that there's no
practical limit to the size of the list.  This change also improves the
scanning algorithm for processing the list, scanning the list only once
instead of scanning it for every page in the system.  Both the sysctl and
the file can be unsorted and contain duplicates so long as each entry is
numeric (decimal or hex) and is separated by a space, comma, or newline
character.  The sysctl 'vm.page_blacklist' is now provided to report what
memory locations were successfully excluded.

Reviewed by:	imp, emax
Obtained from:	Netflix, Inc.
MFC after:	3 days
2015-04-29 15:57:14 +00:00
Edward Tomasz Napierala
4b5c9cf62f Add kern.racct.enable tunable and RACCT_DISABLED config option.
The point of this is to be able to add RACCT (with RACCT_DISABLED)
to GENERIC, to avoid having to rebuild the kernel to use rctl(8).

Differential Revision:	https://reviews.freebsd.org/D2369
Reviewed by:	kib@
MFC after:	1 month
Relnotes:	yes
Sponsored by:	The FreeBSD Foundation
2015-04-29 10:23:02 +00:00
Konstantin Belousov
85af31a464 Do not sleep waiting for the MAP_ENTRY_IN_TRANSITION state ending with
the vnode locked.

Review:	https://reviews.freebsd.org/D2381
Submitted by:	Conrad Meyer, Attilio Rao
MFC after:	1 week
2015-04-28 08:20:23 +00:00
Scott Long
43ffa928f9 Revert r281451. It causes a panic/hang early in boot for a number of
users, myself included.  The original code is likely papering over a
larger bug that needs to be explored, but for now get things back to
a working state.

Obtained from:	Netflix, Inc.
MFC after:	immediately
2015-04-24 17:03:53 +00:00
John Baldwin
179fa75e6e Reassign copyright statements on several files from Advanced
Computing Technologies LLC to Hudson River Trading LLC.

Approved by:	Hudson River Trading LLC (who owns ACT LLC)
MFC after:	1 week
2015-04-23 14:22:20 +00:00
Alan Cox
d74e6a1d27 Eliminate an unused variable.
MFC after:	1 week
2015-04-20 16:48:21 +00:00
Alan Cox
74f944344b Eliminate an unused variable.
MFC after:	1 week
2015-04-19 00:29:02 +00:00
Konstantin Belousov
0538aafc41 The lseek(2), mmap(2), truncate(2), ftruncate(2), pread(2), and
pwrite(2) syscalls are wrapped to provide compatibility with pre-7.x
kernels which required padding before the off_t parameter.  The
fcntl(2) contains compatibility code to handle kernels before the
struct flock was changed during the 8.x CURRENT development.  The
shims were reasonable to allow easier revert to the older kernel at
that time.

Now, two or three major releases later, shims do not serve any
purpose.  Such old kernels cannot handle current libc, so revert the
compatibility code.

Make padded syscalls support conditional under the COMPAT6 config
option.  For COMPAT32, the syscalls were under COMPAT6 already.

Remove WITHOUT_SYSCALL_COMPAT build option, which only purpose was to
(partially) disable the removed shims.

Reviewed by:	jhb, imp (previous versions)
Discussed with:	peter
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
2015-04-18 21:50:13 +00:00
Dmitry Chagin
51cfb0be84 Rework r281162. Indeed, the flexible array member is preferable here.
Suggested by:   Justin T. Gibbs

MFC after:	3 days
2015-04-12 06:21:58 +00:00
Alan Cox
6a93e36b5f Correct an off-by-one error in vm_reserv_reclaim_contig() that results in
an infinite loop.

Submitted by:	Svatopluk Kraus
MFC after:	1 week
2015-04-11 22:57:13 +00:00
Gleb Smirnoff
16be9f54e7 UMA zone limit can be lowered, so remove protection against from
the sysctl_handle_uma_zone_max().

Sponsored by:	Nginx, Inc.
2015-04-10 06:56:49 +00:00
Alexander Motin
0ada3afc25 Remove sleeps from geom_up thread on device destruction.
MFC after:	3 days.
2015-04-09 13:09:05 +00:00
Jeff Roberson
34d8b7ea3b - Simplify vm_pageout_scan() by introducing a new vm_pageout_clean()
function that does the locking and validation associated with cleaning
   a page.  This moves 150 lines of code into its own function.
 - Rename vm_pageout_clean() to vm_pageout_cluster() to define what it
   really does; clustering nearby pages for pageout optimization.

Reviewd by:	alc, kib, kmacy
Tested by:	pho (earlier version)
Sponsored by:	EMC / Isilon
2015-04-07 02:18:52 +00:00
Dmitry Chagin
6723fdfe46 Properly calculate "UMA Zones" per cpu cache size. Avoid allocating
an extra struct uma_cache since the struct uma_zone already has one.

PR:		199169
Submitted by:	luke.tw gmail com
MFC after:	1 week
2015-04-06 18:45:41 +00:00
Alan Cox
b5ab20c066 Until the lock assertions in vm_page_advise() are properly reevaluated,
vm_fault_dontneed() should acquire a write lock on the first object in
the shadow chain.

Reported by:	gleb, David Wolfskill
2015-04-05 20:07:33 +00:00
Dmitry Chagin
1d2c0c460c Fix wrong kassert msg in uma.
PR:		199172
Submitted by:	luke.tw gmail com
MFC after:	1 week
2015-04-05 18:25:23 +00:00
Alan Cox
a8b0f1009d Replace vm_fault()'s heuristic for automatic cache behind with a heuristic
that performs the equivalent of an automatic madvise(..., MADV_DONTNEED).
The current heuristic, even with the improvements that I made a few years
ago, is a good example of making the wrong trade-off, or optimizing for
the infrequent case.  The infrequent case being reading a single file that
is much larger than memory using mmap(2).  And, in this case, the page
daemon isn't the bottleneck; it's the I/O.

In all other cases, the current heuristic has too many false positives,
i.e., it caches too many pages that are later reused.  To give one
example, thousands of pages are cached by the current heuristic during a
buildworld and all of them are reactivated before the buildworld
completes.  In particular, clang reads source files using mmap(2) and
there are some relatively large source files in our source tree, e.g.,
sqlite, that are read multiple times.  With the new heuristic, I see fewer
false positives and they have a much lower cost.

I actually tried something like this more than two years ago and it
didn't perform as well as the cache behind heuristic.  However, that was
before the changes to the page daemon in late summer of 2013 and the
existence of pmap_advise().  In particular, with the page daemon doing
its work more frequently and in smaller batches, it now completes its
work while the application accessing the file is blocked on I/O.
Whereas previously, the page daemon appeared to hog the CPU for so long
that it caused "hiccups" in the application's execution.

Finally, I'll add that the elimination of cache pages is a prerequisite
for NUMA support.

Reviewed by:	jeff, kib
Sponsored by:	EMC / Isilon Storage Division
2015-04-04 19:10:22 +00:00
Ryan Stone
f2c2231e0c Fix integer truncation bug in malloc(9)
A couple of internal functions used by malloc(9) and uma truncated
a size_t down to an int.  This could cause any number of issues
(e.g. indefinite sleeps, memory corruption) if any kernel
subsystem tried to allocate 2GB or more through malloc.  zfs would
attempt such an allocation when run on a system with 2TB or more
of RAM.

Note to self: When this is MFCed, sparc64 needs the same fix.

Differential revision:	https://reviews.freebsd.org/D2106
Reviewed by:	kib
Reported by:	Michael Fuckner <michael@fuckner.net>
Tested by:	Michael Fuckner <michael@fuckner.net>
MFC after:	2 weeks
2015-04-01 12:42:26 +00:00
Gleb Smirnoff
f6d6b5e262 Catch up on r271387 and remove unused parameter from
VOP_GETPAGES_ASYNC().
2015-03-30 22:49:26 +00:00
Jeff Roberson
b3de46ab23 - Eliminate pagequeue locking in the dirty code in vm_pageout_scan().
- Use a more precise series of tests to see if the page changed while we
   were locking the vnode.

Reviewed by:	alc
Sponsored by:	EMC / Isilon
2015-03-28 02:36:49 +00:00
Alexander Motin
3398491b2f Make swapper release orphaned (lost) GEOM provider.
Swap device is still reported as enabled, and system still may crash later
if some swapped-out kernel pages were lost with the device, but at least
GEOM and CAM can now release the lost disk, allowing it to be reconnected.

MFC after:	2 weeks
Sponsored by:	iXsystems, Inc.
2015-03-26 17:21:12 +00:00
Rui Paulo
b575067a21 Add comments about CTLFLAG_RDTUN vs. TUNABLE_INT_FETCH.
Requested by:	julian
2015-03-26 05:20:18 +00:00
Rui Paulo
57e5a8b184 Use TUNABLE_INT_FETCH for boot_pages.
vm.boot_pages is marked as a CTLFLAG_RDTUN, but it's used by the VM
before the sysctl subsystem is initialsed.  We manually fetch the
variable from the environment to work around this problem.

Tested by:	Keith White kwhite at uottawa.ca
MFC after:	1 week
2015-03-24 20:09:55 +00:00
Rui Paulo
b0bce0aef2 Remove whitespace. 2015-03-24 20:07:27 +00:00
Alan Cox
3d653db063 Introduce vm_object_color() and use it in mmap(2) to set the color of
named objects to zero before the virtual address is selected.  Previously,
the color setting was delayed until after the virtual address was
selected.  In rtld, this delay effectively prevented the mapping of a
shared library's code section using superpages.  Now, for example, we see
the first 1 MB of libc's code on armv6 mapped by a superpage after we've
gotten through the initial cold misses that bring the first 1 MB of code
into memory.  (With the page clustering that we perform on read faults,
this happens quickly.)

Differential Revision:	https://reviews.freebsd.org/D2013
Reviewed by:	jhb, kib
Tested by:	Svatopluk Kraus (armv6)
MFC after:	6 weeks
2015-03-21 17:56:55 +00:00
Alan Cox
dfdf9abd94 Fix the root cause of the "vm_reserv_populate: reserv <address> is already
promoted" panics.  The sequence of events that leads to a panic is rather
long and circuitous.  First, suppose that process P has a promoted
superpage S within vm object O that it can write to.  Then, suppose that P
forks, which leads to S being write protected.  Now, before P's child
exits, suppose that P writes to another virtual page within O.  Since the
pages within O are copy on write, a shadow object for O is created to
house the new physical copy of the faulted on virtual page.  Then, before
P can fault on S, P's child exists.  Now, when P faults on S, it will
follow the "optimized" path for copy-on-write faults in vm_fault(),
wherein the underlying physical page is moved from O to its shadow object
rather than allocating a new page and copying the new page's contents from
the old page.  Moreover, suppose that every 4 KB physical page making up S
is moved to the shadow object in this way.  However, the optimized path
does not move the underlying superpage reservation, which is the root
cause of the panics!  Ultimately, P performs vm_object_collapse() on O's
shadow object, which destroys O and in doing so breaks any reservations
still belonging to O.  This leaves the reservation underlying S in an
inconsistent state: It's simultaneously not in use and promoted.  Breaking
a reservation does not demote it because I never intended for a promoted
reservation to be broken.  It makes little sense.  Finally, this
inconsistency leads to an assertion failure the next time that the
reservation is used.

The failing assertion does not (currently) exist in FreeBSD 10.x or
earlier.  There, we will quietly break the promoted reservation.  While
illogical and unintended, breaking the reservation is essentially
harmless.

PR:		198163
Reviewed by:	kib
Tested by:	pho
X-MFC after:	r267213
Sponsored by:	EMC / Isilon Storage Division
2015-03-19 01:40:43 +00:00
Gleb Smirnoff
4d6481a4c9 o Enhance vm_pager_free_nonreq() function:
- Allow to call the function with vm object lock held.
  - Allow to specify reqpage that doesn't match any page in the region,
    meaning freeing all pages.
o Utilize the new function in couple more places in vnode pager.

Reviewed by:	alc, kib
Sponsored by:	Netflix
Sponsored by:	Nginx, Inc.
2015-03-17 19:19:19 +00:00
Gleb Smirnoff
41c895a888 Provide a comment explaining r279688.
Suggested by:	alc
2015-03-16 14:24:47 +00:00
Ian Lepore
1eafc07856 Set the SBUF_INCLUDENUL flag in sbuf_new_for_sysctl() so that sysctl
strings returned to userland include the nulterm byte.

Some uses of sbuf_new_for_sysctl() write binary data rather than strings;
clear the SBUF_INCLUDENUL flag after calling sbuf_new_for_sysctl() in
those cases.  (Note that the sbuf code still automatically adds a nulterm
byte in sbuf_finish(), but since it's not included in the length it won't
get copied to userland along with the binary data.)

Remove explicit adding of a nulterm byte in a couple places now that it
gets done automatically by the sbuf drain code.

PR:		195668
2015-03-14 17:08:28 +00:00
Ian Lepore
ed9dd64b8c Revert r279932; this is going to be fixed in the sbuf code instead.
PR:		195668
2015-03-14 13:00:37 +00:00
Ian Lepore
f3b9fcf251 Nullterminate strings returned via sysctl.
PR:		195668
2015-03-12 18:06:30 +00:00
Gleb Smirnoff
2c0cb02607 Fix function name in comment. 2015-03-10 13:06:54 +00:00
Konstantin Belousov
79d7993d98 Fix function name in the panic message.
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
2015-03-08 02:13:46 +00:00
Alan Cox
6a24058fab Correct a typo in vm_object_backing_scan() that originated in r254141.
Specifically, change a lock acquire into a lock release.

MFC after:	3 days
Sponsored by:	EMC / Isilon Storage Division
2015-03-07 04:18:40 +00:00
Gleb Smirnoff
73e9030e61 - In vnode_pager_generic_getpages() use different free counters for
synchronous and asynchronous requests.  The latter can saturate the
  I/O and we do not want them to affect regular paging.
- Allocate the pbuf at the very beginning of the function, so that
  if we are low on certain kind of pbufs don't even proceed to BMAP,
  but sleep.

Reviewed by:	kib
Sponsored by:	Nginx, Inc.
Sponsored by:	Netflix
2015-03-06 14:15:30 +00:00
Alan Cox
777a36c5e3 Use RW_NEW rather than calling bzero(). 2015-03-01 05:18:02 +00:00
Alan Cox
f81b73f3aa Eliminate a variable that became unused when VFS_LOCK_GIANT() was
eliminated.

MFC after:	3 days
2015-02-28 19:11:37 +00:00
Enji Cooper
b2ecae3fec Some minor style(9) fixes (whitespace + comment)
MFC after: 3 days
2015-02-17 08:50:26 +00:00
Konstantin Belousov
f40cb1c645 Update mtime for tmpfs files modified through memory mapping. Similar
to UFS, perform updates during syncer scans, which in particular means
that tmpfs now performs scan on sync.  Also, this means that a mtime
update may be delayed up to 30 seconds after the write.

The vm_object' OBJ_TMPFS_DIRTY flag for tmpfs swap object is similar
to the OBJ_MIGHTBEDIRTY flag for the vnode object, it indicates that
object could have been dirtied.  Adapt fast page fault handler and
vm_object_set_writeable_dirty() to handle OBJ_TMPFS_NODE same as
OBJT_VNODE.

Reported by:	Ronald Klop <ronald-lists@klop.ws>
Tested by:	pho
Sponsored by:	The FreeBSD Foundation
MFC after:	2 weeks
2015-01-28 10:37:23 +00:00
Will Andrews
8311a2b8a4 Add vm.panic_on_oom sysctl, which enables those who would rather panic than
kill a process, when the system runs out of memory.  Defaults to off.

Usually, this is most useful when the OOM condition is due to mismanagement
of memory, on a system where the applications in question don't respond well
to being killed.

In theory, if the system is properly managed, it shouldn't be possible to
hit this condition. If it does, the panic can be more desirable for some
users (since it can be a good means of finding the root cause) rather than
killing the largest process and continuing on its merry way.

As kib@ mentions in the differential, there is also protect(1), which uses
procctl(PROC_SPROTECT) to ensure that some processes are immune.  However,
a panic approach is still useful in some environments.  This is primarily
intended as a development/debugging tool.

Differential Revision:	D1627
Reviewed by:		kib
MFC after:		1 week
2015-01-24 17:32:45 +00:00
Ryan Stone
423521aa33 vmspace_release() may sleep if the last reference is being released,
so add a WITNESS_WARN() to catch cases where it is called with a
non-sleepable lock held.

MFC after:	1 month
Sponsored by:	Sandvine Inc.
2015-01-24 16:59:38 +00:00
Konstantin Belousov
71943c3d35 Avoid calling vmspace_free() while owning the process lock. Freeing
of an vm space may require obtaining sleepable locks.  Hold the
process to keep the pointer valid, and change trylock to lock, since
there is no longer two process locks owned simultaneously in
vm_pageout_oom().

Note that after the process lock is dropped, process might exec, and
no longer qualify as the owner of biggest vm space.

In collaboration with:	rstone
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
2015-01-24 15:33:42 +00:00
Alan Cox
5268042bbd Revamp the default page clustering strategy that is used by the page fault
handler.  For roughly twenty years, the page fault handler has used the
same basic strategy: Fetch a fixed number of non-resident pages both ahead
and behind the virtual page that was faulted on.  Over the years,
alternative strategies have been implemented for optimizing the handling
of random and sequential access patterns, but the only change to the
default strategy has been to increase the number of pages read ahead to 7
and behind to 8.

The problem with the default page clustering strategy becomes apparent
when you look at how it behaves on the code section of an executable or
shared library.  (To simplify the following explanation, I'm going to
ignore the read that is performed to obtain the header and assume that no
pages are resident at the start of execution.)  Suppose that we have a
code section consisting of 32 pages.  Further, suppose that we access
pages 4, 28, and 16 in that order.  Under the default page clustering
strategy, we page fault three times and perform three I/O operations,
because the first and second page faults only read a truncated cluster of
12 pages.  In contrast, if we access pages 8, 24, and 16 in that order, we
only fault twice and perform two I/O operations, because the first and
second page faults read a full cluster of 16 pages.  In general, truncated
clusters are more common than full clusters.

To address this problem, this revision changes the default page clustering
strategy to align the start of the cluster to a page offset within the vm
object that is a multiple of the cluster size.  This results in many fewer
truncated clusters.  Returning to our example, if we now access pages 4,
28, and 16 in that order, the cluster that is read to satisfy the page
fault on page 28 will now include page 16.  So, the access to page 16 will
no longer page fault and perform an I/O operation.

Since the revised default page clustering strategy is typically reading
more pages at a time, we are likely to read a few more pages that are
never accessed.  However, for the various programs that we looked at,
including clang, emacs, firefox, and openjdk, the reduction in the number
of page faults and I/O operations far outweighed the increase in the
number of pages that are never accessed.  Moreover, the extra resident
pages allowed for many more superpage mappings.  For example, if we look
at the execution of clang during a buildworld, the number of (hard) page
faults on the code section drops by 26%, the number of superpage mappings
increases by about 29,000, but the number of never accessed pages only
increases from 30.38% to 33.66%.  Finally, this leads to a small but
measureable reduction in execution time.

In collaboration with:	Emily Pettigrew <ejp1@rice.edu>
Differential Revision:	https://reviews.freebsd.org/D1500
Reviewed by:	jhb, kib
MFC after:	6 weeks
2015-01-16 18:17:09 +00:00
Konstantin Belousov
18cc2ff047 Revert r263475: TDP_DEVMEMIO no longer needed, since amd64 /dev/kmem
does not access kernel mappings directly.

Reviewed by:	alc
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
2015-01-12 08:58:07 +00:00
Alan Cox
67c44fa359 Eliminate a stale debug message. The per-CPU cache locks were replaced
by critical sections in r145686.

PR:		193254
Submitted by:	luke.tw@gmail.com
MFC after:	3 days
2014-12-31 17:44:57 +00:00
Alan Cox
d866a563d4 The physical memory allocator supports the use of distinct free lists for
managing pages from different address ranges.  Generally speaking, this
feature is used to increase the likelihood that physical pages are
available that can meet special DMA requirements or can be accessed through
a limited-coverage direct mapping (e.g., MIPS).  However, prior to this
change, the configuration of the free lists was static, i.e., it was
determined at compile time.  Consequentally, free lists could be created
for address ranges that held no actual pages, for example, on 32-bit MIPS-
based systems with 512 MB or less of physical memory.  This change makes
the creation of the free lists dynamic, i.e., it is based on the available
physical memory at boot time.

On 64-bit x86-based systems with 64 GB or more of physical memory, create
free lists for managing pages with physical addresses below 4 GB.  This
change is to address reported problems with initializing devices that
require the allocation of physical pages below 4 GB on some systems with
128 GB or more of physical memory.

PR:		185727
Differential Revision:	https://reviews.freebsd.org/D1274
Reviewed by:	jhb, kib
MFC after:	3 weeks
Sponsored by:	EMC / Isilon Storage Division
2014-12-31 00:54:38 +00:00
Gleb Smirnoff
e3ed82bcf7 Add flag VM_ALLOC_NOWAIT for vm_page_grab() that prevents sleeping and
allows the function to fail.

Reviewed by:	kib, alc
Sponsored by:	Nginx, Inc.
2014-12-22 09:02:21 +00:00
Gleb Smirnoff
6ee80f259c Do not clear flag that vm_page_alloc() doesn't support.
Submitted by:	kib
2014-12-22 09:00:47 +00:00
Gleb Smirnoff
89fc8bdbb6 Document flags of vm_page allocation functions.
Reviewed by:	alc
2014-12-22 08:59:44 +00:00
John Baldwin
01ca58b23c Always ignore the deprecated MAP_RENAME and MAP_NORESERVE flags to mmap().
Some old libraries may be used even with newer binaries (specifically the
Nvidia driver libraries).

Differential Revision:	https://reviews.freebsd.org/D1262
Reviewed by:	kib
2014-12-05 15:24:42 +00:00
Konstantin Belousov
30d57414a0 When the last reference on the vnode' vm object is dropped, read the
vp->v_vflag without taking vnode lock and without bypass.  We do know
that vp is the lowest level in the stack, since the pointer is
obtained from the object' handle.  Stale VV_TEXT flag read can only
happen if parallel execve() is performed and not yet activated the
image, since process takes reference for text mapping.  In this case,
the execve() code manages the VV_TEXT flag on its own already.

It was observed that otherwise read-only sendfile(2) requires
exclusive vnode lock and contending on it on some loads for VV_TEXT
handling.

Reported by:	glebius, scottl
Tested by:	glebius, pho
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
2014-12-05 15:02:30 +00:00
Konstantin Belousov
95c4bf756a Provide mutual exclusion between zone allocation/destruction and
uma_reclaim().  Reclamation code must not see half-constructed or
destructed zones.  Do this by bracing uma_zcreate() and uma_zdestroy()
into a shared-locked sx, and take the sx exclusively in uma_reclaim().

Usually zones are not created/destroyed during the system operation,
but tmpfs mounts do cause zone operations and exposed the bug.

Another solution could be to only expose a new keg on uma_kegs list
after the corresponding zone is fully constructed, and similar
treatment for the destruction.  But it probably requires more risky
code rearrangement as well.

Reported and tested by:	pho
Discussed with:	avg
Sponsored by:	The FreeBSD Foundation
MFC after:	2 weeks
2014-11-30 20:20:55 +00:00
Gleb Smirnoff
1bb5ad634e We already have "int i" in this scope.
Submitted by:	alc
2014-11-24 07:57:20 +00:00
Gleb Smirnoff
d932810143 \n at end of panicstr is redundant.
Submitted by:	alc
2014-11-23 18:32:21 +00:00
Gleb Smirnoff
90effb2341 Merge from projects/sendfile:
o Provide a new VOP_GETPAGES_ASYNC(), which works like VOP_GETPAGES(), but
  doesn't sleep. It returns immediately, and will execute the I/O done handler
  function that must be supplied as argument.
o Provide VOP_GETPAGES_ASYNC() for the FFS, which uses vnode_pager.
o Extend pagertab to support pgo_getpages_async method, and implement this
  method for vnode_pager.

Reviewed by:	kib
Tested by:	pho
Sponsored by:	Netflix
Sponsored by:	Nginx, Inc.
2014-11-23 12:01:52 +00:00
Alan Cox
09e5f3c4b8 By the time that vm_reserv_init() runs, vm_phys_segs[] is initialized. Use
it instead of phys_avail[].

Discussed with:	Svatopluk Kraus
2014-11-22 17:46:30 +00:00
Gleb Smirnoff
79f0deb938 Use __func__ in KASSERTs, since the code is about to be moved to other place.
Sponsored by:	Nginx, Inc.
2014-11-19 16:29:39 +00:00
Gleb Smirnoff
2a5eef69a6 In vnode_pager_generic_getpages() vp->v_mount is dereferenced in the
beginning, thus can't be NULL.

Sponsored by:	Nginx, Inc.
2014-11-19 15:17:19 +00:00
Gleb Smirnoff
e122dfc1ce Collapse three contiguous comment blocks into one. Remove historical
note about wrong assumptions 20 years ago. Use proper casing.

Sponsored by:	Nginx, Inc.
2014-11-18 13:38:07 +00:00
Alan Cox
271f0f1219 Enable the use of VM_PHYSSEG_SPARSE on amd64 and i386, making it the default
on i386 PAE.  Previously, VM_PHYSSEG_SPARSE could not be used on amd64 and
i386 because vm_page_startup() would not create vm_page structures for the
kernel page table pages allocated during pmap_bootstrap() but those vm_page
structures are needed when the kernel attempts to promote the corresponding
kernel virtual addresses to superpage mappings.  To address this problem, a
new public function, vm_phys_add_seg(), is introduced and vm_phys_init() is
updated to reflect the creation of vm_phys_seg structures by calls to
vm_phys_add_seg().

Discussed with:	Svatopluk Kraus
MFC after:	3 weeks
Sponsored by:	EMC / Isilon Storage Division
2014-11-15 23:40:44 +00:00
Gleb Smirnoff
e1a4454553 Even better indent struct pagerops. 2014-11-14 18:15:35 +00:00
Gleb Smirnoff
5536922ec0 Constantly indent struct pagerops. 2014-11-14 18:00:00 +00:00
Konstantin Belousov
e065e87c1e Fix mis-spelling of bits and types names in the
default_pager_putpages() and swap_pager_putpages().
It is the same fix as was done for vnode_pager_putpages()
in r271586.

Reviewed by:	alc
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
2014-11-04 19:56:04 +00:00
Alan Cox
5e929009d2 Eliminate a stale, i386-specific comment. 2014-11-04 18:52:59 +00:00
Mark Murray
10cb24248a This is the much-discussed major upgrade to the random(4) device, known to you all as /dev/random.
This code has had an extensive rewrite and a good series of reviews, both by the author and other parties. This means a lot of code has been simplified. Pluggable structures for high-rate entropy generators are available, and it is most definitely not the case that /dev/random can be driven by only a hardware souce any more. This has been designed out of the device. Hardware sources are stirred into the CSPRNG (Yarrow, Fortuna) like any other entropy source. Pluggable modules may be written by third parties for additional sources.

The harvesting structures and consequently the locking have been simplified. Entropy harvesting is done in a more general way (the documentation for this will follow). There is some GREAT entropy to be had in the UMA allocator, but it is disabled for now as messing with that is likely to annoy many people.

The venerable (but effective) Yarrow algorithm, which is no longer supported by its authors now has an alternative, Fortuna. For now, Yarrow is retained as the default algorithm, but this may be changed using a kernel option. It is intended to make Fortuna the default algorithm for 11.0. Interested parties are encouraged to read ISBN 978-0-470-47424-2 "Cryptography Engineering" By Ferguson, Schneier and Kohno for Fortuna's gory details. Heck, read it anyway.

Many thanks to Arthur Mesh who did early grunt work, and who got caught in the crossfire rather more than he deserved to.

My thanks also to folks who helped me thresh this out on whiteboards and in the odd "Hallway track", or otherwise.

My Nomex pants are on. Let the feedback commence!

Reviewed by:	trasz,des(partial),imp(partial?),rwatson(partial?)
Approved by:	so(des)
2014-10-30 21:21:53 +00:00
Hans Petter Selasky
f0188618f2 Fix multiple incorrect SYSCTL arguments in the kernel:
- Wrong integer type was specified.

- Wrong or missing "access" specifier. The "access" specifier
sometimes included the SYSCTL type, which it should not, except for
procedural SYSCTL nodes.

- Logical OR where binary OR was expected.

- Properly assert the "access" argument passed to all SYSCTL macros,
using the CTASSERT macro. This applies to both static- and dynamically
created SYSCTLs.

- Properly assert the the data type for both static and dynamic
SYSCTLs. In the case of static SYSCTLs we only assert that the data
pointed to by the SYSCTL data pointer has the correct size, hence
there is no easy way to assert types in the C language outside a
C-function.

- Rewrote some code which doesn't pass a constant "access" specifier
when creating dynamic SYSCTL nodes, which is now a requirement.

- Updated "EXAMPLES" section in SYSCTL manual page.

MFC after:	3 days
Sponsored by:	Mellanox Technologies
2014-10-21 07:31:21 +00:00
John Baldwin
5817298f31 Retire the unimplemented MAP_RENAME and MAP_NORESERVE flags to mmap(2).
Older binaries are still permitted to use these flags.

PR:		193961 (exp-run in ports)
Differential Revision:	https://reviews.freebsd.org/D848
Reviewed by:	kib
2014-10-18 12:28:51 +00:00
Davide Italiano
2be111bf7d Follow up to r225617. In order to maximize the re-usability of kernel code
in userland rename in-kernel getenv()/setenv() to kern_setenv()/kern_getenv().
This fixes a namespace collision with libc symbols.

Submitted by:   kmacy
Tested by:      make universe
2014-10-16 18:04:43 +00:00
Konstantin Belousov
a36f55322c Make MAP_NOSYNC handling in the vm_fault() read-locked object path
compatible with write-locked path.  Test for MAP_ENTRY_NOSYNC and set
VPO_NOSYNC for pages with dirty mask zero (this does not exclude a
possibility that the page is dirty, e.g. due to read fault on
writeable mapping and consequent write; the same issue exists in the
slow path).

Use helper vm_fault_dirty() to unify fast and slow path handling of
VPO_NOSYNC and setting the dirty mask.

Reviewed by:	alc
Sponsored by:	The FreeBSD Foundation
2014-10-10 19:27:36 +00:00
Bryan Venteicher
111fbcd5ed Change the UMA mutex into a rwlock
Acquire the lock in read mode when just needed to ensure the stability
of the keg list. The UMA lock may be held for a long time (relatively
speaking) in uma_reclaim() on machines with lots of zones/kegs. If the
uma_timeout() would fire during that period, subsequent callouts on that
CPU may be significantly delayed.

Reviewed by:	jhb
2014-10-05 21:34:56 +00:00
Bryan Venteicher
6e5254e0d7 Remove stray uma_mtx lock/unlock in zone_drain_wait()
Callers of zone_drain_wait(M_WAITOK) do not need to hold (and were not)
the uma_mtx, but we would attempt to unlock and relock the mutex if we
had to sleep because the zone was already draining. The M_NOWAIT callers
may hold the uma_mtx, but we do not sleep in that case.

Reviewed by:	jhb
MFC after:	3 days
2014-10-05 03:18:30 +00:00
Konstantin Belousov
b76278407d Add kernel option KSTACK_USAGE_PROF to sample the stack depth on
interrupts and report the largest value seen as sysctl
debug.max_kstack_used.  Useful to estimate how close the kernel stack
size is to overflow.

In collaboration with:	Larry Baird <lab@gta.com>
Sponsored by:	The FreeBSD Foundation (kib)
MFC after:	1 week
2014-10-04 18:38:14 +00:00
Steven Hartland
14a0d74ea8 Refactor ZFS ARC reclaim checks and limits
Remove previously added kmem methods in favour of defines which
allow diff minimisation between upstream code base.

Rebalance ARC free target to be vm_pageout_wakeup_thresh by default
which eliminates issue where ARC gets minimised instead of balancing
with VM pageout. The restores the target point prior to r270759.

Bring in missing upstream only changes which move unused code to
further eliminate code differences.

Add additional DTRACE probe to aid monitoring of ARC behaviour.

Enable upstream i386 code paths on platforms which don't define
UMA_MD_SMALL_ALLOC.

Fix mixture of byte an page values in arc_memory_throttle i386 code
path value assignment of available_memory.

PR:		187594
Review:		D702
Reviewed by:	avg
MFC after:	1 week
X-MFC-With:	r270759 & r270861
Sponsored by:	Multiplay
2014-10-03 20:34:55 +00:00
Steven Hartland
f721133eb9 Fix ticks wrap issue of lowmem test in vm_pageout_scan
Reviewed by:	jhb (D818)
MFC after:	3 days
Sponsored by:	Multiplay
2014-09-24 14:35:08 +00:00
Konstantin Belousov
54432196db vm_map_pmap_enter() and pmap_enter_object() are currently not aware of
the wired attribute of the mapping.  As result, some pmap
implementations clear the wired state of the page table entries, which
breaks invariants and allows the entries to be lost.  Avoid calling
vm_map_pmap_enter() for the MADV_WILLNEED on the wired entry, the
pages must be already mapped.

Noted and reviewed by:	alc
Sponsored by:	The FreeBSD Foundation
MFC after:	3 days
2014-09-23 18:54:23 +00:00
Konstantin Belousov
10204535af The vm_mmap_cdev() explicitely converts absence of both MAP_SHARED and
MAP_PRIVATE flags to MAP_SHARED.  Apparently, some code in tree, in
particular, libgeom, relied on this behaviour, see r271721.  For
regular file types, the absence of the flags is interpreted as
MAP_PRIVATE, and libc nlist used this (fixed in r271723).

Allow the implicit flags for legacy binaries.  Bump __FreeBSD_version
to get the ABI note on new binaries to check for in mmap code.

Remove the test for presence of one of the MAP_ANON, MAP_SHARED or
MAP_PRIVATE flags before fget_mmap().  For MAP_ANON, we already verify
that passed fd == -1.  For fd != -1, test after fget_mmap() (for newer
binaries) covers the case.

Reported by:	bdrewery, pho
Reviewed by:	jhb
Sponsored by:	The FreeBSD Foundation
2014-09-17 21:04:50 +00:00
John Baldwin
8bafac5444 Permit MAP_RENAME and MAP_NORESERVE for now. These flags should be removed, but at least
Chromium and OpenJDK use MAP_NORESERVE.
2014-09-16 17:21:06 +00:00
John Baldwin
5fd3f8b3b6 Add stricter checking of some mmap() arguments:
- Fail with EINVAL if an invalid protection mask is passed to mmap().
- Fail with EINVAL if an unknown flag is passed to mmap().
- Fail with EINVAL if both MAP_PRIVATE and MAP_SHARED are passed to mmap().
- Require one of either MAP_PRIVATE or MAP_SHARED for non-anonymous
  mappings.

Reviewed by:	alc, kib
MFC after:	2 weeks
Differential Revision:	https://reviews.freebsd.org/D698
2014-09-15 17:20:13 +00:00
Alan Cox
a7fecb4d3a Three improvements to vnode_pager_generic_getpages():
Eliminate an exclusive object lock acquisition and release on the expected
execution path.

Do page zeroing before the object lock is acquired rather than during the
time that the object lock is held.

Use vm_pager_free_nonreq() to eliminate duplicated code.

Reviewed by:	kib
MFC after:	6 weeks
Sponsored by:	EMC / Isilon Storage Division
2014-09-15 17:14:09 +00:00
Gleb Smirnoff
be58a555d2 Remove redundant declaration. vnode.h should be included before vnode_pager.h. 2014-09-15 15:49:29 +00:00
Konstantin Belousov
d15b55c554 Provide the unique implementation for the VOP_GETPAGES() method used
by ffs and ext2fs.  Remove duplicated call to vm_page_zero_invalid(),
done by VOP and by vm_pager_getpages().  Use vm_pager_free_nonreq().

Reviewed by:	alc (previous version)
Sponsored by:	The FreeBSD Foundation
MFC after:	6 weeks (after r271596)
2014-09-15 12:28:29 +00:00
Alan Cox
396b3e34b4 Avoid an exclusive acquisition of the object lock on the expected execution
path through the NFS clients' getpages functions.

Introduce vm_pager_free_nonreq().  This function can be used to eliminate
code that is duplicated in many getpages functions.  Also, in contrast to
the code that currently appears in those getpages functions,
vm_pager_free_nonreq() avoids acquiring an exclusive object lock in one
case.

Reviewed by:	kib
MFC after:	6 weeks
Sponsored by:	EMC / Isilon Storage Division
2014-09-14 18:07:55 +00:00
Konstantin Belousov
33cad9e936 Fix mis-spelling of bits and types names in the vnode_pager_putpages().
The changes should not modify the generated code.

The pager->pgo_putpages() method takes int flags as its fourth
argument, while vnode_pager_putpages() used boolean_t (which is
typedef'ed to int).  The flags are from VM_PAGER_* namespace, while
vnode_pager_putpages() passed TRUE and OBJPC_SYNC to VOP_PUTPAGES(),
which both are numerically equal to VM_PAGER_PUT_SYNC.

Noted and reviewed by:	alc (previous version)
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
2014-09-14 10:27:36 +00:00
Alan Cox
81a065058c Update a stale comment. 2014-09-11 03:16:57 +00:00
Gleb Smirnoff
27ad26d8c7 Remove unused arguments for VOP_GETPAGES(), VOP_PUTPAGES(). 2014-09-10 12:36:41 +00:00
Alan Cox
64f096eeb2 Fix a boundary case error in vm_reserv_alloc_contig(): If a reservation
isn't being allocated for the last of the requested pages, because a
reservation won't fit in the gap between allocated pages, then the
reservation structure shouldn't be initialized.

While I'm here, improve the nearby comments.

Reported by:	jeff, pho
MFC after:	1 week
Sponsored by:	EMC / Isilon Storage Division
2014-09-10 05:52:30 +00:00
Alan Cox
0afcd3af8b Oops. vm_map_simplify_entry() is used by mac_proc_vm_revoke_recurse(), so
it can't be static.
2014-09-08 02:25:01 +00:00
Alan Cox
077ec27cd6 Make two functions static and eliminate an unused #define. 2014-09-08 00:19:03 +00:00