Commit Graph

3725 Commits

Author SHA1 Message Date
markj
c78ab1ef94 Add a local variable initialization needed in the OBJT_DEFAULT case.
Reviewed by:	kib
Differential Revision:	https://reviews.freebsd.org/D2992
2015-07-05 22:26:19 +00:00
mjg
93aba404ec vm: don't lock proc around accesses to vm_{t,d}addr and RLIMIT_DATA in sys_mmap
vm_{t,d}addr are constant and we can use thread's copy of resource limits
2015-07-02 18:30:12 +00:00
kib
eee3eeb0fa Account for the main process stack being one page below the highest
user address when ABI uses shared page.

Note that the change is no-op for correctness, since shared page does
not fault.  The mapping for the shared page is installed at the
address space creation, the page is unmanaged and its pte/pv entry
cannot be reclaimed.

Submitted by:	Oliver Pinter
Review:	https://reviews.freebsd.org/D2954
MFC after:	1 week
2015-07-02 15:22:13 +00:00
markm
d586165577 Huge cleanup of random(4) code.
* GENERAL
- Update copyright.
- Make kernel options for RANDOM_YARROW and RANDOM_DUMMY. Set
  neither to ON, which means we want Fortuna
- If there is no 'device random' in the kernel, there will be NO
  random(4) device in the kernel, and the KERN_ARND sysctl will
  return nothing. With RANDOM_DUMMY there will be a random(4) that
  always blocks.
- Repair kern.arandom (KERN_ARND sysctl). The old version went
  through arc4random(9) and was a bit weird.
- Adjust arc4random stirring a bit - the existing code looks a little
  suspect.
- Fix the nasty pre- and post-read overloading by providing explictit
  functions to do these tasks.
- Redo read_random(9) so as to duplicate random(4)'s read internals.
  This makes it a first-class citizen rather than a hack.
- Move stuff out of locked regions when it does not need to be
  there.
- Trim RANDOM_DEBUG printfs. Some are excess to requirement, some
  behind boot verbose.
- Use SYSINIT to sequence the startup.
- Fix init/deinit sysctl stuff.
- Make relevant sysctls also tunables.
- Add different harvesting "styles" to allow for different requirements
  (direct, queue, fast).
- Add harvesting of FFS atime events. This needs to be checked for
  weighing down the FS code.
- Add harvesting of slab allocator events. This needs to be checked for
  weighing down the allocator code.
- Fix the random(9) manpage.
- Loadable modules are not present for now. These will be re-engineered
  when the dust settles.
- Use macros for locks.
- Fix comments.

* src/share/man/...
- Update the man pages.

* src/etc/...
- The startup/shutdown work is done in D2924.

* src/UPDATING
- Add UPDATING announcement.

* src/sys/dev/random/build.sh
- Add copyright.
- Add libz for unit tests.

* src/sys/dev/random/dummy.c
- Remove; no longer needed. Functionality incorporated into randomdev.*.

* live_entropy_sources.c live_entropy_sources.h
- Remove; content moved.
- move content to randomdev.[ch] and optimise.

* src/sys/dev/random/random_adaptors.c src/sys/dev/random/random_adaptors.h
- Remove; plugability is no longer used. Compile-time algorithm
  selection is the way to go.

* src/sys/dev/random/random_harvestq.c src/sys/dev/random/random_harvestq.h
- Add early (re)boot-time randomness caching.

* src/sys/dev/random/randomdev_soft.c src/sys/dev/random/randomdev_soft.h
- Remove; no longer needed.

* src/sys/dev/random/uint128.h
- Provide a fake uint128_t; if a real one ever arrived, we can use
  that instead. All that is needed here is N=0, N++, N==0, and some
  localised trickery is used to manufacture a 128-bit 0ULLL.

* src/sys/dev/random/unit_test.c src/sys/dev/random/unit_test.h
- Improve unit tests; previously the testing human needed clairvoyance;
  now the test will do a basic check of compressibility. Clairvoyant
  talent is still a good idea.
- This is still a long way off a proper unit test.

* src/sys/dev/random/fortuna.c src/sys/dev/random/fortuna.h
- Improve messy union to just uint128_t.
- Remove unneeded 'static struct fortuna_start_cache'.
- Tighten up up arithmetic.
- Provide a method to allow eternal junk to be introduced; harden
  it against blatant by compress/hashing.
- Assert that locks are held correctly.
- Fix the nasty pre- and post-read overloading by providing explictit
  functions to do these tasks.
- Turn into self-sufficient module (no longer requires randomdev_soft.[ch])

* src/sys/dev/random/yarrow.c src/sys/dev/random/yarrow.h
- Improve messy union to just uint128_t.
- Remove unneeded 'staic struct start_cache'.
- Tighten up up arithmetic.
- Provide a method to allow eternal junk to be introduced; harden
  it against blatant by compress/hashing.
- Assert that locks are held correctly.
- Fix the nasty pre- and post-read overloading by providing explictit
  functions to do these tasks.
- Turn into self-sufficient module (no longer requires randomdev_soft.[ch])
- Fix some magic numbers elsewhere used as FAST and SLOW.

Differential Revision: https://reviews.freebsd.org/D2025
Reviewed by: vsevolod,delphij,rwatson,trasz,jmg
Approved by: so (delphij)
2015-06-30 17:00:45 +00:00
jmg
8df3676103 If INVARIANTS is specified, add ctor/dtor to junk memory if they are
unspecified...

Submitted by:	Suresh Gumpula at Netapp
Differential Revision:	https://reviews.freebsd.org/D2725
2015-06-25 20:44:46 +00:00
alc
29dd35a1c6 Avoid pmap_is_modified() on pages that can't be mapped.
MFC after:	1 week
Sponsored by:	EMC / Isilon Storage Division
2015-06-21 01:22:35 +00:00
glebius
5b81a20433 o Un-inline vm_pager_get_pages(), vm_pager_get_pages_async().
o Provide an extensive set of assertions for input array of pages.
o Remove now duplicate assertions from different pagers.

Sponsored by:	Nginx, Inc.
Sponsored by:	Netflix
2015-06-17 22:44:27 +00:00
kib
2038792788 Invalid pages do not need neither update of the activation count nor
they coould be dirty.  Move the handling if the invalid pages in the
inactive scan earlier.

Remove some code duplication in the scan by introducing the
'drop_page' label, which centralizes the object and the page unlock.

Suggested and reviewed by:	alc
Sponsored by:	The FreeBSD Foundation
MFC after:	2 weeks
2015-06-14 20:23:41 +00:00
alc
927e89d882 As the next step in eliminating PG_CACHE pages, free rather than cache
pages in vm_pageout_scan().  The reactivation rate of cache pages created
by vm_pageout_scan() is extremely low; typically no more than 0.5% to
2.25% of the pages are ever reactivated.  At the same time, caching pages
is more expensive than freeing them.  For example, in a test with
PostgreSQL, this change reduced the amount of time spent in the inactive
queue scan by 1/6.

Differential Revision:	https://reviews.freebsd.org/D2805
Reviewed by:	kib
Sponsored by:	EMC / Isilon Storage Division
2015-06-14 05:23:39 +00:00
glebius
519f1ccd36 Make KPI of vm_pager_get_pages() more strict: if a pager changes a page
in the requested array, then it is responsible for disposition of previous
page and is responsible for updating the entry in the requested array.
Now consumers of KPI do not need to re-lookup the pages after call to
vm_pager_get_pages().

Reviewed by:	kib
Sponsored by:	Netflix
Sponsored by:	Nginx, Inc.
2015-06-12 11:32:20 +00:00
mjg
d7bc9285a6 Implement lockless resource limits.
Use the same scheme implemented to manage credentials.

Code needing to look at process's credentials (as opposed to thred's) is
provided with *_proc variants of relevant functions.

Places which possibly had to take the proc lock anyway still use the proc
pointer to access limits.
2015-06-10 10:48:12 +00:00
alc
73775f21e2 Correct a type error in kmem_unback(). Previously, kmem_unback() did not
correctly handle deallocation requests of two or more gigabytes in size.
Eventually, this would lead to a panic elsewhere in the kernel, such as
"vm_radix_insert: key <vm_pindex_t> is already present".

Reported by:	Ilias Marinos
MFC after:	1 week
2015-06-10 05:17:14 +00:00
alc
263927b83e Retire VM_FREEPOOL_CACHE as the next step in eliminating PG_CACHE pages.
Differential Revision:	https://reviews.freebsd.org/D2712
Reviewed by:	kib
Sponsored by:	EMC / Isilon Storage Division
2015-06-08 04:59:32 +00:00
jhb
bba1e1e047 Add a new file operations hook for mmap operations. File type-specific
logic is now placed in the mmap hook implementation rather than requiring
it to be placed in sys/vm/vm_mmap.c.  This hook allows new file types to
support mmap() as well as potentially allowing mmap() for existing file
types that do not currently support any mapping.

The vm_mmap() function is now split up into two functions.  A new
vm_mmap_object() function handles the "back half" of vm_mmap() and accepts
a referenced VM object to map rather than a (handle, handle_type) tuple.
vm_mmap() is now reduced to converting a (handle, handle_type) tuple to a
a VM object and then calling vm_mmap_object() to handle the actual mapping.
The vm_mmap() function remains for use by other parts of the kernel
(e.g. device drivers and exec) but now only supports mapping vnodes,
character devices, and anonymous memory.

The mmap() system call invokes vm_mmap_object() directly with a NULL object
for anonymous mappings.  For mappings using a file descriptor, the
descriptors fo_mmap() hook is invoked instead.  The fo_mmap() hook is
responsible for performing type-specific checks and adjustments to
arguments as well as possibly modifying mapping parameters such as flags
or the object offset.  The fo_mmap() hook routines then call
vm_mmap_object() to handle the actual mapping.

The fo_mmap() hook is optional.  If it is not set, then fo_mmap() will
fail with ENODEV.  A fo_mmap() hook is implemented for regular files,
character devices, and shared memory objects (created via shm_open()).

While here, consistently use the VM_PROT_* constants for the vm_prot_t
type for the 'prot' variable passed to vm_mmap() and vm_mmap_object()
as well as the vm_mmap_vnode() and vm_mmap_cdev() helper routines.
Previously some places were using the mmap()-specific PROT_* constants
instead.  While this happens to work because PROT_xx == VM_PROT_xx,
using VM_PROT_* is more correct.

Differential Revision:	https://reviews.freebsd.org/D2658
Reviewed by:	alc (glanced over), kib
MFC after:	1 month
Sponsored by:	Chelsio
2015-06-04 19:41:15 +00:00
vangyzen
597cee37df Provide vnode in memory map info for files on tmpfs
When providing memory map information to userland, populate the vnode pointer
for tmpfs files.  Set the memory mapping to appear as a vnode type, to match
FreeBSD 9 behavior.

This fixes the use of tmpfs files with the dtrace pid provider,
procstat -v, procfs, linprocfs, pmc (pmcstat), and ptrace (PT_VM_ENTRY).

Submitted by:   Eric Badger <eric@badgerio.us> (initial revision)
Obtained from:  Dell Inc.
PR:             198431
MFC after:      2 weeks
Reviewed by:    jhb
Approved by:    kib (mentor)
2015-06-02 18:37:04 +00:00
alc
a7a4e853c7 Document vm_page_alloc_contig()'s support for the VM_ALLOC_NODUMP option.
MFC after:	3 days
2015-05-30 23:37:47 +00:00
jhb
4a4be98eae Export a list of VM objects in the system via a sysctl. The list can be
examined via 'vmstat -o'.  It can be used to determine which files are
using physical pages of memory and how much each is using.

Differential Revision:	https://reviews.freebsd.org/D2277
Reviewed by:	alc, kib
MFC after:	2 weeks
Sponsored by:	Norse Corp, Inc. (forward porting to HEAD/10)
2015-05-27 18:11:05 +00:00
jkim
318c4f97e6 CALLOUT_MPSAFE has lost its meaning since r141428, i.e., for more than ten
years for head.  However, it is continuously misused as the mpsafe argument
for callout_init(9).  Deprecate the flag and clean up callout_init() calls
to make them more consistent.

Differential Revision:	https://reviews.freebsd.org/D2613
Reviewed by:	jhb
MFC after:	2 weeks
2015-05-22 17:05:21 +00:00
kib
3718535142 Do grammar fix in the comment to record the right commit message for
r283162.

Fix a cosmetic issue with vm_page_alloc() calling vm_page_free_toq()
with the page not completely satisfying vm_page_free() assertions.
The page is not owned by the object, since insertion failed.  But
besides m->object reset to NULL, we should also set VPO_UNMANAGED flag
for consistency.

Reported by:	pho
Reviewed by:	alc
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
2015-05-20 23:15:56 +00:00
kib
f37cc175ef Remove the write-only variable phent. We currently do not check the
size of the program header's entries.

Reported by:	adrian (by using gcc 4.9)
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
2015-05-20 23:03:22 +00:00
kib
cb622d9740 Satisfy vm_object uma zone destructor requirements after r282660 when
vnode object creation raced.

Reported by:	pho
Reviewed by:	alc
Sponsored by:	The FreeBSD Foundation
2015-05-10 08:21:03 +00:00
kib
71cf7d735d The vmem callback to reclaim kmem arena address space on low or
fragmented conditions currently just wakes up the pagedaemon.  The
kmem arena is significantly smaller then the total available physical
memory, which means that there are loads where kmem arena space could
be exhausted, while there is a lot of pages available still.  The
woken up pagedaemon sees vm_pages_needed != 0, verifies the condition
vm_paging_needed() which is false, clears the pass and returns back to
sleep, not calling neither uma_reclaim() nor lowmem handler.

To handle low kmem arena conditions, create additional pagedaemon
thread which calls uma_reclaim() directly.  The thread sleeps on the
dedicated channel and kmem_reclaim() wakes the thread in addition to
the pagedaemon.

Reported and tested by:	pho
Sponsored by:	The FreeBSD Foundation
MFC after:	2 weeks
2015-05-09 20:08:36 +00:00
jhb
0bf260e595 Place VM objects on the object list when created and never remove them.
This is ok since objects come from a NOFREE zone and allows objects to
be locked while traversing the object list without triggering a LOR.

Ensure that objects on the list are marked DEAD while free or stillborn,
and that they have a refcount of zero.  This required updating most of
the pagers to explicitly mark an object as dead when deallocating it.
(Only the vnode pager did this previously.)

Differential Revision:	https://reviews.freebsd.org/D2423
Reviewed by:	alc, kib (earlier version)
MFC after:	2 weeks
Sponsored by:	Norse Corp, Inc.
2015-05-08 19:43:37 +00:00
adrian
d1e90611f7 oops - how'd i miss this. Sorry! 2015-05-08 06:02:23 +00:00
adrian
2cad336903 Add initial memory locality cost awareness to the VM, and include
a basic ACPI SLIT table parser.

For now this just exports the map via sysctl; it'll eventually be useful
to userland when there's more useful NUMA support in -HEAD.

* Add an optional mem_locality map;
* add a mapping function taking from/to domain and returning the
  relative cost, or -1 if it's not available;
* Add a very basic SLIT parser to x86 ACPI.

Differential Revision:	https://reviews.freebsd.org/D2460
Reviewed by:	rpaulo, stas, jhb
Sponsored by:	Norse Corp, Inc (hardware, coding); Dell (hardware)
2015-05-08 00:56:56 +00:00
glebius
a7d547aa97 Fix the KASSERT and improve wording in r282426.
Submitted by:	alc
2015-05-06 08:07:11 +00:00
glebius
acfa186500 Fix arithmetical bug in vnode_pager_haspage(). The check against object size
should be done not with the number of pages in the first block, but with the
overall number of pages.  While here, add KASSERT that makes sure that BMAP
doesn't return completely irrelevant blocks.

Reviewed by:	kib
Tested by:	pho
Sponsored by:	Netflix
Sponsored by:	Nginx, Inc.
2015-05-04 18:49:25 +00:00
glebius
0a56d25a94 Instead of reading, validating and adjusting value of the vm.swap_async_max
in the main swapper work cycle, do it in the sysctl handler.  This removes
extra mutex acquisition from the main cycle and makes the sysctl knob return
error on an invalid value, instead of accepting and fixing it.

Reviewed by:	kib
Sponsored by:	Netflix
Sponsored by:	Nginx, Inc.
2015-05-02 20:27:37 +00:00
jhb
9c4c8b62fb Remove support for Xen PV domU kernels. Support for HVM domU kernels
remains.  Xen is planning to phase out support for PV upstream since it
is harder to maintain and has more overhead.  Modern x86 CPUs include
virtualization extensions that support HVM guests instead of PV guests.
In addition, the PV code was i386 only and not as well maintained recently
as the HVM code.
- Remove the i386-only NATIVE option that was used to disable certain
  components for PV kernels.  These components are now standard as they
  are on amd64.
- Remove !XENHVM bits from PV drivers.
- Remove various shims required for XEN (e.g. PT_UPDATES_FLUSH, LOAD_CR3,
  etc.)
- Remove duplicate copy of <xen/features.h>.
- Remove unused, i386-only xenstored.h.

Differential Revision:	https://reviews.freebsd.org/D2362
Reviewed by:	royger
Tested by:	royger (i386/amd64 HVM domU and amd64 PVH dom0)
Relnotes:	yes
2015-04-30 15:48:48 +00:00
scottl
cac1f63fc8 Improve support for blacklisting bad memory locations. The user can supply
a text file with a list of physical memory addresses to exclude, and have it
loaded at boot time via the provided example in loader.conf.  The tunable
'vm.blacklist' remains, but using an external file means that there's no
practical limit to the size of the list.  This change also improves the
scanning algorithm for processing the list, scanning the list only once
instead of scanning it for every page in the system.  Both the sysctl and
the file can be unsorted and contain duplicates so long as each entry is
numeric (decimal or hex) and is separated by a space, comma, or newline
character.  The sysctl 'vm.page_blacklist' is now provided to report what
memory locations were successfully excluded.

Reviewed by:	imp, emax
Obtained from:	Netflix, Inc.
MFC after:	3 days
2015-04-29 15:57:14 +00:00
trasz
802017a04b Add kern.racct.enable tunable and RACCT_DISABLED config option.
The point of this is to be able to add RACCT (with RACCT_DISABLED)
to GENERIC, to avoid having to rebuild the kernel to use rctl(8).

Differential Revision:	https://reviews.freebsd.org/D2369
Reviewed by:	kib@
MFC after:	1 month
Relnotes:	yes
Sponsored by:	The FreeBSD Foundation
2015-04-29 10:23:02 +00:00
kib
dfce070d88 Do not sleep waiting for the MAP_ENTRY_IN_TRANSITION state ending with
the vnode locked.

Review:	https://reviews.freebsd.org/D2381
Submitted by:	Conrad Meyer, Attilio Rao
MFC after:	1 week
2015-04-28 08:20:23 +00:00
scottl
ede23391e8 Revert r281451. It causes a panic/hang early in boot for a number of
users, myself included.  The original code is likely papering over a
larger bug that needs to be explored, but for now get things back to
a working state.

Obtained from:	Netflix, Inc.
MFC after:	immediately
2015-04-24 17:03:53 +00:00
jhb
e4683250d1 Reassign copyright statements on several files from Advanced
Computing Technologies LLC to Hudson River Trading LLC.

Approved by:	Hudson River Trading LLC (who owns ACT LLC)
MFC after:	1 week
2015-04-23 14:22:20 +00:00
alc
3b5965fb8f Eliminate an unused variable.
MFC after:	1 week
2015-04-20 16:48:21 +00:00
alc
d6d560db51 Eliminate an unused variable.
MFC after:	1 week
2015-04-19 00:29:02 +00:00
kib
2254748ed0 The lseek(2), mmap(2), truncate(2), ftruncate(2), pread(2), and
pwrite(2) syscalls are wrapped to provide compatibility with pre-7.x
kernels which required padding before the off_t parameter.  The
fcntl(2) contains compatibility code to handle kernels before the
struct flock was changed during the 8.x CURRENT development.  The
shims were reasonable to allow easier revert to the older kernel at
that time.

Now, two or three major releases later, shims do not serve any
purpose.  Such old kernels cannot handle current libc, so revert the
compatibility code.

Make padded syscalls support conditional under the COMPAT6 config
option.  For COMPAT32, the syscalls were under COMPAT6 already.

Remove WITHOUT_SYSCALL_COMPAT build option, which only purpose was to
(partially) disable the removed shims.

Reviewed by:	jhb, imp (previous versions)
Discussed with:	peter
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
2015-04-18 21:50:13 +00:00
dchagin
02e5fc3da7 Rework r281162. Indeed, the flexible array member is preferable here.
Suggested by:   Justin T. Gibbs

MFC after:	3 days
2015-04-12 06:21:58 +00:00
alc
f116ec9943 Correct an off-by-one error in vm_reserv_reclaim_contig() that results in
an infinite loop.

Submitted by:	Svatopluk Kraus
MFC after:	1 week
2015-04-11 22:57:13 +00:00
glebius
662cdec164 UMA zone limit can be lowered, so remove protection against from
the sysctl_handle_uma_zone_max().

Sponsored by:	Nginx, Inc.
2015-04-10 06:56:49 +00:00
mav
2e38078077 Remove sleeps from geom_up thread on device destruction.
MFC after:	3 days.
2015-04-09 13:09:05 +00:00
jeff
fa9eb8e1ea - Simplify vm_pageout_scan() by introducing a new vm_pageout_clean()
function that does the locking and validation associated with cleaning
   a page.  This moves 150 lines of code into its own function.
 - Rename vm_pageout_clean() to vm_pageout_cluster() to define what it
   really does; clustering nearby pages for pageout optimization.

Reviewd by:	alc, kib, kmacy
Tested by:	pho (earlier version)
Sponsored by:	EMC / Isilon
2015-04-07 02:18:52 +00:00
dchagin
fd38dba27d Properly calculate "UMA Zones" per cpu cache size. Avoid allocating
an extra struct uma_cache since the struct uma_zone already has one.

PR:		199169
Submitted by:	luke.tw gmail com
MFC after:	1 week
2015-04-06 18:45:41 +00:00
alc
7028695829 Until the lock assertions in vm_page_advise() are properly reevaluated,
vm_fault_dontneed() should acquire a write lock on the first object in
the shadow chain.

Reported by:	gleb, David Wolfskill
2015-04-05 20:07:33 +00:00
dchagin
1e93730b8c Fix wrong kassert msg in uma.
PR:		199172
Submitted by:	luke.tw gmail com
MFC after:	1 week
2015-04-05 18:25:23 +00:00
alc
e8bbef8d0b Replace vm_fault()'s heuristic for automatic cache behind with a heuristic
that performs the equivalent of an automatic madvise(..., MADV_DONTNEED).
The current heuristic, even with the improvements that I made a few years
ago, is a good example of making the wrong trade-off, or optimizing for
the infrequent case.  The infrequent case being reading a single file that
is much larger than memory using mmap(2).  And, in this case, the page
daemon isn't the bottleneck; it's the I/O.

In all other cases, the current heuristic has too many false positives,
i.e., it caches too many pages that are later reused.  To give one
example, thousands of pages are cached by the current heuristic during a
buildworld and all of them are reactivated before the buildworld
completes.  In particular, clang reads source files using mmap(2) and
there are some relatively large source files in our source tree, e.g.,
sqlite, that are read multiple times.  With the new heuristic, I see fewer
false positives and they have a much lower cost.

I actually tried something like this more than two years ago and it
didn't perform as well as the cache behind heuristic.  However, that was
before the changes to the page daemon in late summer of 2013 and the
existence of pmap_advise().  In particular, with the page daemon doing
its work more frequently and in smaller batches, it now completes its
work while the application accessing the file is blocked on I/O.
Whereas previously, the page daemon appeared to hog the CPU for so long
that it caused "hiccups" in the application's execution.

Finally, I'll add that the elimination of cache pages is a prerequisite
for NUMA support.

Reviewed by:	jeff, kib
Sponsored by:	EMC / Isilon Storage Division
2015-04-04 19:10:22 +00:00
rstone
57feb6fb43 Fix integer truncation bug in malloc(9)
A couple of internal functions used by malloc(9) and uma truncated
a size_t down to an int.  This could cause any number of issues
(e.g. indefinite sleeps, memory corruption) if any kernel
subsystem tried to allocate 2GB or more through malloc.  zfs would
attempt such an allocation when run on a system with 2TB or more
of RAM.

Note to self: When this is MFCed, sparc64 needs the same fix.

Differential revision:	https://reviews.freebsd.org/D2106
Reviewed by:	kib
Reported by:	Michael Fuckner <michael@fuckner.net>
Tested by:	Michael Fuckner <michael@fuckner.net>
MFC after:	2 weeks
2015-04-01 12:42:26 +00:00
glebius
e4390a8823 Catch up on r271387 and remove unused parameter from
VOP_GETPAGES_ASYNC().
2015-03-30 22:49:26 +00:00
jeff
2ef1578319 - Eliminate pagequeue locking in the dirty code in vm_pageout_scan().
- Use a more precise series of tests to see if the page changed while we
   were locking the vnode.

Reviewed by:	alc
Sponsored by:	EMC / Isilon
2015-03-28 02:36:49 +00:00
mav
ee2fe1ad5c Make swapper release orphaned (lost) GEOM provider.
Swap device is still reported as enabled, and system still may crash later
if some swapped-out kernel pages were lost with the device, but at least
GEOM and CAM can now release the lost disk, allowing it to be reconnected.

MFC after:	2 weeks
Sponsored by:	iXsystems, Inc.
2015-03-26 17:21:12 +00:00
rpaulo
4a05532670 Add comments about CTLFLAG_RDTUN vs. TUNABLE_INT_FETCH.
Requested by:	julian
2015-03-26 05:20:18 +00:00
rpaulo
2ef1cdce8e Use TUNABLE_INT_FETCH for boot_pages.
vm.boot_pages is marked as a CTLFLAG_RDTUN, but it's used by the VM
before the sysctl subsystem is initialsed.  We manually fetch the
variable from the environment to work around this problem.

Tested by:	Keith White kwhite at uottawa.ca
MFC after:	1 week
2015-03-24 20:09:55 +00:00
rpaulo
5ab5a7c167 Remove whitespace. 2015-03-24 20:07:27 +00:00
alc
b131a2abc8 Introduce vm_object_color() and use it in mmap(2) to set the color of
named objects to zero before the virtual address is selected.  Previously,
the color setting was delayed until after the virtual address was
selected.  In rtld, this delay effectively prevented the mapping of a
shared library's code section using superpages.  Now, for example, we see
the first 1 MB of libc's code on armv6 mapped by a superpage after we've
gotten through the initial cold misses that bring the first 1 MB of code
into memory.  (With the page clustering that we perform on read faults,
this happens quickly.)

Differential Revision:	https://reviews.freebsd.org/D2013
Reviewed by:	jhb, kib
Tested by:	Svatopluk Kraus (armv6)
MFC after:	6 weeks
2015-03-21 17:56:55 +00:00
alc
2c4b57d486 Fix the root cause of the "vm_reserv_populate: reserv <address> is already
promoted" panics.  The sequence of events that leads to a panic is rather
long and circuitous.  First, suppose that process P has a promoted
superpage S within vm object O that it can write to.  Then, suppose that P
forks, which leads to S being write protected.  Now, before P's child
exits, suppose that P writes to another virtual page within O.  Since the
pages within O are copy on write, a shadow object for O is created to
house the new physical copy of the faulted on virtual page.  Then, before
P can fault on S, P's child exists.  Now, when P faults on S, it will
follow the "optimized" path for copy-on-write faults in vm_fault(),
wherein the underlying physical page is moved from O to its shadow object
rather than allocating a new page and copying the new page's contents from
the old page.  Moreover, suppose that every 4 KB physical page making up S
is moved to the shadow object in this way.  However, the optimized path
does not move the underlying superpage reservation, which is the root
cause of the panics!  Ultimately, P performs vm_object_collapse() on O's
shadow object, which destroys O and in doing so breaks any reservations
still belonging to O.  This leaves the reservation underlying S in an
inconsistent state: It's simultaneously not in use and promoted.  Breaking
a reservation does not demote it because I never intended for a promoted
reservation to be broken.  It makes little sense.  Finally, this
inconsistency leads to an assertion failure the next time that the
reservation is used.

The failing assertion does not (currently) exist in FreeBSD 10.x or
earlier.  There, we will quietly break the promoted reservation.  While
illogical and unintended, breaking the reservation is essentially
harmless.

PR:		198163
Reviewed by:	kib
Tested by:	pho
X-MFC after:	r267213
Sponsored by:	EMC / Isilon Storage Division
2015-03-19 01:40:43 +00:00
glebius
398be53682 o Enhance vm_pager_free_nonreq() function:
- Allow to call the function with vm object lock held.
  - Allow to specify reqpage that doesn't match any page in the region,
    meaning freeing all pages.
o Utilize the new function in couple more places in vnode pager.

Reviewed by:	alc, kib
Sponsored by:	Netflix
Sponsored by:	Nginx, Inc.
2015-03-17 19:19:19 +00:00
glebius
df5d850742 Provide a comment explaining r279688.
Suggested by:	alc
2015-03-16 14:24:47 +00:00
ian
0dd684d23f Set the SBUF_INCLUDENUL flag in sbuf_new_for_sysctl() so that sysctl
strings returned to userland include the nulterm byte.

Some uses of sbuf_new_for_sysctl() write binary data rather than strings;
clear the SBUF_INCLUDENUL flag after calling sbuf_new_for_sysctl() in
those cases.  (Note that the sbuf code still automatically adds a nulterm
byte in sbuf_finish(), but since it's not included in the length it won't
get copied to userland along with the binary data.)

Remove explicit adding of a nulterm byte in a couple places now that it
gets done automatically by the sbuf drain code.

PR:		195668
2015-03-14 17:08:28 +00:00
ian
eae109babf Revert r279932; this is going to be fixed in the sbuf code instead.
PR:		195668
2015-03-14 13:00:37 +00:00
ian
037188bda9 Nullterminate strings returned via sysctl.
PR:		195668
2015-03-12 18:06:30 +00:00
glebius
6bbfdd570a Fix function name in comment. 2015-03-10 13:06:54 +00:00
kib
c539cecb43 Fix function name in the panic message.
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
2015-03-08 02:13:46 +00:00
alc
56c9a1b2f6 Correct a typo in vm_object_backing_scan() that originated in r254141.
Specifically, change a lock acquire into a lock release.

MFC after:	3 days
Sponsored by:	EMC / Isilon Storage Division
2015-03-07 04:18:40 +00:00
glebius
3a52ecfa66 - In vnode_pager_generic_getpages() use different free counters for
synchronous and asynchronous requests.  The latter can saturate the
  I/O and we do not want them to affect regular paging.
- Allocate the pbuf at the very beginning of the function, so that
  if we are low on certain kind of pbufs don't even proceed to BMAP,
  but sleep.

Reviewed by:	kib
Sponsored by:	Nginx, Inc.
Sponsored by:	Netflix
2015-03-06 14:15:30 +00:00
alc
2ab42594ef Use RW_NEW rather than calling bzero(). 2015-03-01 05:18:02 +00:00
alc
37e48c6e3a Eliminate a variable that became unused when VFS_LOCK_GIANT() was
eliminated.

MFC after:	3 days
2015-02-28 19:11:37 +00:00
ngie
d54589fec4 Some minor style(9) fixes (whitespace + comment)
MFC after: 3 days
2015-02-17 08:50:26 +00:00
kib
19abfd4698 Update mtime for tmpfs files modified through memory mapping. Similar
to UFS, perform updates during syncer scans, which in particular means
that tmpfs now performs scan on sync.  Also, this means that a mtime
update may be delayed up to 30 seconds after the write.

The vm_object' OBJ_TMPFS_DIRTY flag for tmpfs swap object is similar
to the OBJ_MIGHTBEDIRTY flag for the vnode object, it indicates that
object could have been dirtied.  Adapt fast page fault handler and
vm_object_set_writeable_dirty() to handle OBJ_TMPFS_NODE same as
OBJT_VNODE.

Reported by:	Ronald Klop <ronald-lists@klop.ws>
Tested by:	pho
Sponsored by:	The FreeBSD Foundation
MFC after:	2 weeks
2015-01-28 10:37:23 +00:00
will
2e062600fe Add vm.panic_on_oom sysctl, which enables those who would rather panic than
kill a process, when the system runs out of memory.  Defaults to off.

Usually, this is most useful when the OOM condition is due to mismanagement
of memory, on a system where the applications in question don't respond well
to being killed.

In theory, if the system is properly managed, it shouldn't be possible to
hit this condition. If it does, the panic can be more desirable for some
users (since it can be a good means of finding the root cause) rather than
killing the largest process and continuing on its merry way.

As kib@ mentions in the differential, there is also protect(1), which uses
procctl(PROC_SPROTECT) to ensure that some processes are immune.  However,
a panic approach is still useful in some environments.  This is primarily
intended as a development/debugging tool.

Differential Revision:	D1627
Reviewed by:		kib
MFC after:		1 week
2015-01-24 17:32:45 +00:00
rstone
520ad84555 vmspace_release() may sleep if the last reference is being released,
so add a WITNESS_WARN() to catch cases where it is called with a
non-sleepable lock held.

MFC after:	1 month
Sponsored by:	Sandvine Inc.
2015-01-24 16:59:38 +00:00
kib
7cbc6347a2 Avoid calling vmspace_free() while owning the process lock. Freeing
of an vm space may require obtaining sleepable locks.  Hold the
process to keep the pointer valid, and change trylock to lock, since
there is no longer two process locks owned simultaneously in
vm_pageout_oom().

Note that after the process lock is dropped, process might exec, and
no longer qualify as the owner of biggest vm space.

In collaboration with:	rstone
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
2015-01-24 15:33:42 +00:00
alc
fb32c103c7 Revamp the default page clustering strategy that is used by the page fault
handler.  For roughly twenty years, the page fault handler has used the
same basic strategy: Fetch a fixed number of non-resident pages both ahead
and behind the virtual page that was faulted on.  Over the years,
alternative strategies have been implemented for optimizing the handling
of random and sequential access patterns, but the only change to the
default strategy has been to increase the number of pages read ahead to 7
and behind to 8.

The problem with the default page clustering strategy becomes apparent
when you look at how it behaves on the code section of an executable or
shared library.  (To simplify the following explanation, I'm going to
ignore the read that is performed to obtain the header and assume that no
pages are resident at the start of execution.)  Suppose that we have a
code section consisting of 32 pages.  Further, suppose that we access
pages 4, 28, and 16 in that order.  Under the default page clustering
strategy, we page fault three times and perform three I/O operations,
because the first and second page faults only read a truncated cluster of
12 pages.  In contrast, if we access pages 8, 24, and 16 in that order, we
only fault twice and perform two I/O operations, because the first and
second page faults read a full cluster of 16 pages.  In general, truncated
clusters are more common than full clusters.

To address this problem, this revision changes the default page clustering
strategy to align the start of the cluster to a page offset within the vm
object that is a multiple of the cluster size.  This results in many fewer
truncated clusters.  Returning to our example, if we now access pages 4,
28, and 16 in that order, the cluster that is read to satisfy the page
fault on page 28 will now include page 16.  So, the access to page 16 will
no longer page fault and perform an I/O operation.

Since the revised default page clustering strategy is typically reading
more pages at a time, we are likely to read a few more pages that are
never accessed.  However, for the various programs that we looked at,
including clang, emacs, firefox, and openjdk, the reduction in the number
of page faults and I/O operations far outweighed the increase in the
number of pages that are never accessed.  Moreover, the extra resident
pages allowed for many more superpage mappings.  For example, if we look
at the execution of clang during a buildworld, the number of (hard) page
faults on the code section drops by 26%, the number of superpage mappings
increases by about 29,000, but the number of never accessed pages only
increases from 30.38% to 33.66%.  Finally, this leads to a small but
measureable reduction in execution time.

In collaboration with:	Emily Pettigrew <ejp1@rice.edu>
Differential Revision:	https://reviews.freebsd.org/D1500
Reviewed by:	jhb, kib
MFC after:	6 weeks
2015-01-16 18:17:09 +00:00
kib
79db3369f9 Revert r263475: TDP_DEVMEMIO no longer needed, since amd64 /dev/kmem
does not access kernel mappings directly.

Reviewed by:	alc
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
2015-01-12 08:58:07 +00:00
alc
b48d1f4410 Eliminate a stale debug message. The per-CPU cache locks were replaced
by critical sections in r145686.

PR:		193254
Submitted by:	luke.tw@gmail.com
MFC after:	3 days
2014-12-31 17:44:57 +00:00
alc
369e66acd7 The physical memory allocator supports the use of distinct free lists for
managing pages from different address ranges.  Generally speaking, this
feature is used to increase the likelihood that physical pages are
available that can meet special DMA requirements or can be accessed through
a limited-coverage direct mapping (e.g., MIPS).  However, prior to this
change, the configuration of the free lists was static, i.e., it was
determined at compile time.  Consequentally, free lists could be created
for address ranges that held no actual pages, for example, on 32-bit MIPS-
based systems with 512 MB or less of physical memory.  This change makes
the creation of the free lists dynamic, i.e., it is based on the available
physical memory at boot time.

On 64-bit x86-based systems with 64 GB or more of physical memory, create
free lists for managing pages with physical addresses below 4 GB.  This
change is to address reported problems with initializing devices that
require the allocation of physical pages below 4 GB on some systems with
128 GB or more of physical memory.

PR:		185727
Differential Revision:	https://reviews.freebsd.org/D1274
Reviewed by:	jhb, kib
MFC after:	3 weeks
Sponsored by:	EMC / Isilon Storage Division
2014-12-31 00:54:38 +00:00
glebius
2d07787e22 Add flag VM_ALLOC_NOWAIT for vm_page_grab() that prevents sleeping and
allows the function to fail.

Reviewed by:	kib, alc
Sponsored by:	Nginx, Inc.
2014-12-22 09:02:21 +00:00
glebius
159bb47471 Do not clear flag that vm_page_alloc() doesn't support.
Submitted by:	kib
2014-12-22 09:00:47 +00:00
glebius
ccd01d8847 Document flags of vm_page allocation functions.
Reviewed by:	alc
2014-12-22 08:59:44 +00:00
jhb
d3e566151f Always ignore the deprecated MAP_RENAME and MAP_NORESERVE flags to mmap().
Some old libraries may be used even with newer binaries (specifically the
Nvidia driver libraries).

Differential Revision:	https://reviews.freebsd.org/D1262
Reviewed by:	kib
2014-12-05 15:24:42 +00:00
kib
f690968933 When the last reference on the vnode' vm object is dropped, read the
vp->v_vflag without taking vnode lock and without bypass.  We do know
that vp is the lowest level in the stack, since the pointer is
obtained from the object' handle.  Stale VV_TEXT flag read can only
happen if parallel execve() is performed and not yet activated the
image, since process takes reference for text mapping.  In this case,
the execve() code manages the VV_TEXT flag on its own already.

It was observed that otherwise read-only sendfile(2) requires
exclusive vnode lock and contending on it on some loads for VV_TEXT
handling.

Reported by:	glebius, scottl
Tested by:	glebius, pho
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
2014-12-05 15:02:30 +00:00
kib
a114aae162 Provide mutual exclusion between zone allocation/destruction and
uma_reclaim().  Reclamation code must not see half-constructed or
destructed zones.  Do this by bracing uma_zcreate() and uma_zdestroy()
into a shared-locked sx, and take the sx exclusively in uma_reclaim().

Usually zones are not created/destroyed during the system operation,
but tmpfs mounts do cause zone operations and exposed the bug.

Another solution could be to only expose a new keg on uma_kegs list
after the corresponding zone is fully constructed, and similar
treatment for the destruction.  But it probably requires more risky
code rearrangement as well.

Reported and tested by:	pho
Discussed with:	avg
Sponsored by:	The FreeBSD Foundation
MFC after:	2 weeks
2014-11-30 20:20:55 +00:00
glebius
5e67a4071d We already have "int i" in this scope.
Submitted by:	alc
2014-11-24 07:57:20 +00:00
glebius
3f7fdcc87f \n at end of panicstr is redundant.
Submitted by:	alc
2014-11-23 18:32:21 +00:00
glebius
b4ef8e602d Merge from projects/sendfile:
o Provide a new VOP_GETPAGES_ASYNC(), which works like VOP_GETPAGES(), but
  doesn't sleep. It returns immediately, and will execute the I/O done handler
  function that must be supplied as argument.
o Provide VOP_GETPAGES_ASYNC() for the FFS, which uses vnode_pager.
o Extend pagertab to support pgo_getpages_async method, and implement this
  method for vnode_pager.

Reviewed by:	kib
Tested by:	pho
Sponsored by:	Netflix
Sponsored by:	Nginx, Inc.
2014-11-23 12:01:52 +00:00
alc
f4c161ccb6 By the time that vm_reserv_init() runs, vm_phys_segs[] is initialized. Use
it instead of phys_avail[].

Discussed with:	Svatopluk Kraus
2014-11-22 17:46:30 +00:00
glebius
20a46e6241 Use __func__ in KASSERTs, since the code is about to be moved to other place.
Sponsored by:	Nginx, Inc.
2014-11-19 16:29:39 +00:00
glebius
c73b720d6a In vnode_pager_generic_getpages() vp->v_mount is dereferenced in the
beginning, thus can't be NULL.

Sponsored by:	Nginx, Inc.
2014-11-19 15:17:19 +00:00
glebius
b87d94c5df Collapse three contiguous comment blocks into one. Remove historical
note about wrong assumptions 20 years ago. Use proper casing.

Sponsored by:	Nginx, Inc.
2014-11-18 13:38:07 +00:00
alc
aeebd38e4b Enable the use of VM_PHYSSEG_SPARSE on amd64 and i386, making it the default
on i386 PAE.  Previously, VM_PHYSSEG_SPARSE could not be used on amd64 and
i386 because vm_page_startup() would not create vm_page structures for the
kernel page table pages allocated during pmap_bootstrap() but those vm_page
structures are needed when the kernel attempts to promote the corresponding
kernel virtual addresses to superpage mappings.  To address this problem, a
new public function, vm_phys_add_seg(), is introduced and vm_phys_init() is
updated to reflect the creation of vm_phys_seg structures by calls to
vm_phys_add_seg().

Discussed with:	Svatopluk Kraus
MFC after:	3 weeks
Sponsored by:	EMC / Isilon Storage Division
2014-11-15 23:40:44 +00:00
glebius
1a17154ef9 Even better indent struct pagerops. 2014-11-14 18:15:35 +00:00
glebius
7d3b226a7b Constantly indent struct pagerops. 2014-11-14 18:00:00 +00:00
kib
938d2505e3 Fix mis-spelling of bits and types names in the
default_pager_putpages() and swap_pager_putpages().
It is the same fix as was done for vnode_pager_putpages()
in r271586.

Reviewed by:	alc
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
2014-11-04 19:56:04 +00:00
alc
26666baef9 Eliminate a stale, i386-specific comment. 2014-11-04 18:52:59 +00:00
markm
fce6747f55 This is the much-discussed major upgrade to the random(4) device, known to you all as /dev/random.
This code has had an extensive rewrite and a good series of reviews, both by the author and other parties. This means a lot of code has been simplified. Pluggable structures for high-rate entropy generators are available, and it is most definitely not the case that /dev/random can be driven by only a hardware souce any more. This has been designed out of the device. Hardware sources are stirred into the CSPRNG (Yarrow, Fortuna) like any other entropy source. Pluggable modules may be written by third parties for additional sources.

The harvesting structures and consequently the locking have been simplified. Entropy harvesting is done in a more general way (the documentation for this will follow). There is some GREAT entropy to be had in the UMA allocator, but it is disabled for now as messing with that is likely to annoy many people.

The venerable (but effective) Yarrow algorithm, which is no longer supported by its authors now has an alternative, Fortuna. For now, Yarrow is retained as the default algorithm, but this may be changed using a kernel option. It is intended to make Fortuna the default algorithm for 11.0. Interested parties are encouraged to read ISBN 978-0-470-47424-2 "Cryptography Engineering" By Ferguson, Schneier and Kohno for Fortuna's gory details. Heck, read it anyway.

Many thanks to Arthur Mesh who did early grunt work, and who got caught in the crossfire rather more than he deserved to.

My thanks also to folks who helped me thresh this out on whiteboards and in the odd "Hallway track", or otherwise.

My Nomex pants are on. Let the feedback commence!

Reviewed by:	trasz,des(partial),imp(partial?),rwatson(partial?)
Approved by:	so(des)
2014-10-30 21:21:53 +00:00
hselasky
49c137f7be Fix multiple incorrect SYSCTL arguments in the kernel:
- Wrong integer type was specified.

- Wrong or missing "access" specifier. The "access" specifier
sometimes included the SYSCTL type, which it should not, except for
procedural SYSCTL nodes.

- Logical OR where binary OR was expected.

- Properly assert the "access" argument passed to all SYSCTL macros,
using the CTASSERT macro. This applies to both static- and dynamically
created SYSCTLs.

- Properly assert the the data type for both static and dynamic
SYSCTLs. In the case of static SYSCTLs we only assert that the data
pointed to by the SYSCTL data pointer has the correct size, hence
there is no easy way to assert types in the C language outside a
C-function.

- Rewrote some code which doesn't pass a constant "access" specifier
when creating dynamic SYSCTL nodes, which is now a requirement.

- Updated "EXAMPLES" section in SYSCTL manual page.

MFC after:	3 days
Sponsored by:	Mellanox Technologies
2014-10-21 07:31:21 +00:00
jhb
a619f7cffc Retire the unimplemented MAP_RENAME and MAP_NORESERVE flags to mmap(2).
Older binaries are still permitted to use these flags.

PR:		193961 (exp-run in ports)
Differential Revision:	https://reviews.freebsd.org/D848
Reviewed by:	kib
2014-10-18 12:28:51 +00:00
davide
e88bd26b3f Follow up to r225617. In order to maximize the re-usability of kernel code
in userland rename in-kernel getenv()/setenv() to kern_setenv()/kern_getenv().
This fixes a namespace collision with libc symbols.

Submitted by:   kmacy
Tested by:      make universe
2014-10-16 18:04:43 +00:00
kib
2b6232c7be Make MAP_NOSYNC handling in the vm_fault() read-locked object path
compatible with write-locked path.  Test for MAP_ENTRY_NOSYNC and set
VPO_NOSYNC for pages with dirty mask zero (this does not exclude a
possibility that the page is dirty, e.g. due to read fault on
writeable mapping and consequent write; the same issue exists in the
slow path).

Use helper vm_fault_dirty() to unify fast and slow path handling of
VPO_NOSYNC and setting the dirty mask.

Reviewed by:	alc
Sponsored by:	The FreeBSD Foundation
2014-10-10 19:27:36 +00:00
bryanv
41e2fe5645 Change the UMA mutex into a rwlock
Acquire the lock in read mode when just needed to ensure the stability
of the keg list. The UMA lock may be held for a long time (relatively
speaking) in uma_reclaim() on machines with lots of zones/kegs. If the
uma_timeout() would fire during that period, subsequent callouts on that
CPU may be significantly delayed.

Reviewed by:	jhb
2014-10-05 21:34:56 +00:00
bryanv
0b86b14507 Remove stray uma_mtx lock/unlock in zone_drain_wait()
Callers of zone_drain_wait(M_WAITOK) do not need to hold (and were not)
the uma_mtx, but we would attempt to unlock and relock the mutex if we
had to sleep because the zone was already draining. The M_NOWAIT callers
may hold the uma_mtx, but we do not sleep in that case.

Reviewed by:	jhb
MFC after:	3 days
2014-10-05 03:18:30 +00:00
kib
feb6ca868c Add kernel option KSTACK_USAGE_PROF to sample the stack depth on
interrupts and report the largest value seen as sysctl
debug.max_kstack_used.  Useful to estimate how close the kernel stack
size is to overflow.

In collaboration with:	Larry Baird <lab@gta.com>
Sponsored by:	The FreeBSD Foundation (kib)
MFC after:	1 week
2014-10-04 18:38:14 +00:00
smh
f2543cb01c Refactor ZFS ARC reclaim checks and limits
Remove previously added kmem methods in favour of defines which
allow diff minimisation between upstream code base.

Rebalance ARC free target to be vm_pageout_wakeup_thresh by default
which eliminates issue where ARC gets minimised instead of balancing
with VM pageout. The restores the target point prior to r270759.

Bring in missing upstream only changes which move unused code to
further eliminate code differences.

Add additional DTRACE probe to aid monitoring of ARC behaviour.

Enable upstream i386 code paths on platforms which don't define
UMA_MD_SMALL_ALLOC.

Fix mixture of byte an page values in arc_memory_throttle i386 code
path value assignment of available_memory.

PR:		187594
Review:		D702
Reviewed by:	avg
MFC after:	1 week
X-MFC-With:	r270759 & r270861
Sponsored by:	Multiplay
2014-10-03 20:34:55 +00:00
smh
a51f20a2c3 Fix ticks wrap issue of lowmem test in vm_pageout_scan
Reviewed by:	jhb (D818)
MFC after:	3 days
Sponsored by:	Multiplay
2014-09-24 14:35:08 +00:00
kib
74ff69d0ef vm_map_pmap_enter() and pmap_enter_object() are currently not aware of
the wired attribute of the mapping.  As result, some pmap
implementations clear the wired state of the page table entries, which
breaks invariants and allows the entries to be lost.  Avoid calling
vm_map_pmap_enter() for the MADV_WILLNEED on the wired entry, the
pages must be already mapped.

Noted and reviewed by:	alc
Sponsored by:	The FreeBSD Foundation
MFC after:	3 days
2014-09-23 18:54:23 +00:00
kib
bec113e57f The vm_mmap_cdev() explicitely converts absence of both MAP_SHARED and
MAP_PRIVATE flags to MAP_SHARED.  Apparently, some code in tree, in
particular, libgeom, relied on this behaviour, see r271721.  For
regular file types, the absence of the flags is interpreted as
MAP_PRIVATE, and libc nlist used this (fixed in r271723).

Allow the implicit flags for legacy binaries.  Bump __FreeBSD_version
to get the ABI note on new binaries to check for in mmap code.

Remove the test for presence of one of the MAP_ANON, MAP_SHARED or
MAP_PRIVATE flags before fget_mmap().  For MAP_ANON, we already verify
that passed fd == -1.  For fd != -1, test after fget_mmap() (for newer
binaries) covers the case.

Reported by:	bdrewery, pho
Reviewed by:	jhb
Sponsored by:	The FreeBSD Foundation
2014-09-17 21:04:50 +00:00
jhb
da9629f8a4 Permit MAP_RENAME and MAP_NORESERVE for now. These flags should be removed, but at least
Chromium and OpenJDK use MAP_NORESERVE.
2014-09-16 17:21:06 +00:00
jhb
bd469e35e2 Add stricter checking of some mmap() arguments:
- Fail with EINVAL if an invalid protection mask is passed to mmap().
- Fail with EINVAL if an unknown flag is passed to mmap().
- Fail with EINVAL if both MAP_PRIVATE and MAP_SHARED are passed to mmap().
- Require one of either MAP_PRIVATE or MAP_SHARED for non-anonymous
  mappings.

Reviewed by:	alc, kib
MFC after:	2 weeks
Differential Revision:	https://reviews.freebsd.org/D698
2014-09-15 17:20:13 +00:00
alc
17021239d9 Three improvements to vnode_pager_generic_getpages():
Eliminate an exclusive object lock acquisition and release on the expected
execution path.

Do page zeroing before the object lock is acquired rather than during the
time that the object lock is held.

Use vm_pager_free_nonreq() to eliminate duplicated code.

Reviewed by:	kib
MFC after:	6 weeks
Sponsored by:	EMC / Isilon Storage Division
2014-09-15 17:14:09 +00:00
glebius
fb532b9858 Remove redundant declaration. vnode.h should be included before vnode_pager.h. 2014-09-15 15:49:29 +00:00
kib
ebd8a253bb Provide the unique implementation for the VOP_GETPAGES() method used
by ffs and ext2fs.  Remove duplicated call to vm_page_zero_invalid(),
done by VOP and by vm_pager_getpages().  Use vm_pager_free_nonreq().

Reviewed by:	alc (previous version)
Sponsored by:	The FreeBSD Foundation
MFC after:	6 weeks (after r271596)
2014-09-15 12:28:29 +00:00
alc
810772ca9c Avoid an exclusive acquisition of the object lock on the expected execution
path through the NFS clients' getpages functions.

Introduce vm_pager_free_nonreq().  This function can be used to eliminate
code that is duplicated in many getpages functions.  Also, in contrast to
the code that currently appears in those getpages functions,
vm_pager_free_nonreq() avoids acquiring an exclusive object lock in one
case.

Reviewed by:	kib
MFC after:	6 weeks
Sponsored by:	EMC / Isilon Storage Division
2014-09-14 18:07:55 +00:00
kib
6044029377 Fix mis-spelling of bits and types names in the vnode_pager_putpages().
The changes should not modify the generated code.

The pager->pgo_putpages() method takes int flags as its fourth
argument, while vnode_pager_putpages() used boolean_t (which is
typedef'ed to int).  The flags are from VM_PAGER_* namespace, while
vnode_pager_putpages() passed TRUE and OBJPC_SYNC to VOP_PUTPAGES(),
which both are numerically equal to VM_PAGER_PUT_SYNC.

Noted and reviewed by:	alc (previous version)
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
2014-09-14 10:27:36 +00:00
alc
6af056918d Update a stale comment. 2014-09-11 03:16:57 +00:00
glebius
5939c729a8 Remove unused arguments for VOP_GETPAGES(), VOP_PUTPAGES(). 2014-09-10 12:36:41 +00:00
alc
996a2e8d68 Fix a boundary case error in vm_reserv_alloc_contig(): If a reservation
isn't being allocated for the last of the requested pages, because a
reservation won't fit in the gap between allocated pages, then the
reservation structure shouldn't be initialized.

While I'm here, improve the nearby comments.

Reported by:	jeff, pho
MFC after:	1 week
Sponsored by:	EMC / Isilon Storage Division
2014-09-10 05:52:30 +00:00
alc
1771ced43b Oops. vm_map_simplify_entry() is used by mac_proc_vm_revoke_recurse(), so
it can't be static.
2014-09-08 02:25:01 +00:00
alc
6a1691365d Make two functions static and eliminate an unused #define. 2014-09-08 00:19:03 +00:00
jhb
74edf8c62a Fix a typo. 2014-08-29 21:20:36 +00:00
smh
502601a540 Refactor ZFS ARC reclaim logic to be more VM cooperative
Prior to this change we triggered ARC reclaim when kmem usage passed 3/4
of the total available, as indicated by vmem_size(kmem_arena, VMEM_ALLOC).

This could lead large amounts of unused RAM e.g. on a 192GB machine with
ARC the only major RAM consumer, 40GB of RAM would remain unused.

The old method has also been seen to result in extreme RAM usage under
certain loads, causing poor performance and stalls.

We now trigger ARC reclaim when the number of free pages drops below the
value defined by the new sysctl vfs.zfs.arc_free_target, which defaults
to the value of vm.v_free_target.

Credit to Karl Denninger for the original patch on which this update was
based.

PR:		191510 and 187594
Tested by:	dteske
MFC after:	1 week
Relnotes:	yes
Sponsored by:	Multiplay
2014-08-28 19:50:08 +00:00
alc
e934ec28fb Back in the days when the kernel was single threaded, testing
"vm_paging_target() > 0" was a reasonable way of determining if the
inactive queue scan met its target.  However, now that other threads
can be allocating pages while the inactive queue scan is running, it's
an unreliable method.  The effect of it being unreliable is that we
can start swapping out processes when we didn't intend to.

This issue has existed since the kernel was multithreaded, but the
changes to the inactive queue target in 10.0-RELEASE have made its
effects visible.

This change introduces a more direct method for determining if the
inactive queue scan met its target that is not affected by the actions
of other threads.

Reported by:	Steve Polyack
Tested by:	pho, Steve Polyack (an earlier version)
MFC after:	1 week
Sponsored by:	EMC / Isilon Storage Division
2014-08-26 16:40:20 +00:00
alc
5690b644d4 Relax one of the conditions for mapping a page on the fast path.
Reviewed by:	kib
X-MFC with:	r270011
Sponsored by:	EMC / Isilon Storage Division
2014-08-23 05:24:31 +00:00
kib
ff74afd0bd Implement 'fast path' for the vm page fault handler. Or, it could be
called a scalable path.  When several preconditions hold, the vm
object lock for the object containing the faulted page is taken in
read mode, instead of write, which allows parallel faults processing
in the region.

Namely, the fast path is taken when the faulted page already exists
and does not need copy on write, is already fully valid, and not busy.
For technical reasons, fast path is avoided when the fault is the
first write on the vnode object, or when the fault is for wiring or
debugger read or write.

On the fast path, pmap_enter(9) is passed the PMAP_ENTER_NOSLEEP flag,
since object lock is kept.  Pmap might fail to create the entry, in
which case the fallback to slow path is performed.

Reviewed by:	alc
Tested by:	pho (previous version)
Hardware provided and hosted by:	The FreeBSD Foundation and
	 Sentex Data Communications
Sponsored by:	The FreeBSD Foundation
MFC after:	2 week
2014-08-15 07:30:14 +00:00
alc
e11aaa26b0 Avoid pointless (but harmless) actions on unmanaged pages.
Reviewed by:	kib
Sponsored by:	EMC / Isilon Storage Division
2014-08-14 15:46:15 +00:00
kib
065b68902b If vm_page_grab() allocates a new page, the page is not inserted into
page queue even when the allocation is not wired.  It is
responsibility of the vm_page_grab() caller to ensure that the page
does not end on the vm_object queue but not on the pagedaemon queue,
which would effectively create unpageable unwired page.

In exec_map_first_page() and vm_imgact_hold_page(), activate the page
immediately after unbusying it, to avoid leak.

In the uiomove_object_page(), deactivate page before the object is
unlocked.  There is no leak, since the page is deactivated after
uiomove_fromphys() finished.  But allowing non-queued non-wired page
in the unlocked object queue makes it impossible to assert that leak
does not happen in other places.

Reviewed by:	alc
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
2014-08-13 05:44:08 +00:00
kib
41802e2c86 Adapt vm_page_aflag_set(PGA_WRITEABLE) to the locking of
pmap_enter(PMAP_ENTER_NOSLEEP).  The PGA_WRITEABLE flag can be set
when either the page is busied, or the owner object is locked.

Update comments, move all assertions about page state when
PGA_WRITEABLE flag is set, into new helper
vm_page_assert_pga_writeable().

Reviewed by:	alc
Sponsored by:	The FreeBSD Foundation
MFC after:	2 weeks
2014-08-09 05:00:34 +00:00
kib
094158b3f2 Change pmap_enter(9) interface to take flags parameter and superpage
mapping size (currently unused).  The flags includes the fault access
bits, wired flag as PMAP_ENTER_WIRED, and a new flag
PMAP_ENTER_NOSLEEP to indicate that pmap should not sleep.

For powerpc aim both 32 and 64 bit, fix implementation to ensure that
the requested mapping is created when PMAP_ENTER_NOSLEEP is not
specified, in particular, wait for the available memory required to
proceed.

In collaboration with:	alc
Tested by:	nwhitehorn (ppc aim32 and booke)
Sponsored by:	The FreeBSD Foundation and EMC / Isilon Storage Division
MFC after:	2 weeks
2014-08-08 17:12:03 +00:00
kib
61cc19eab8 The vm_pager_page_unswapped() pager op is only implemented for the
swap pager.  Swap pager uses a private mutex to protect swap metadata,
and does not rely on the vm object lock to ensure integrity of it.

Weaken the requirement for the vm object lock by only asserting locked
object in vm_pager_page_unswapped(), instead of locked exclusively.

Reviewed by:	alc
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
2014-08-06 19:34:03 +00:00
kib
a39918dab4 Add wrappers to assert that vm object is unlocked and for try upgrade.
Reviewed by:	alc
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
2014-08-06 19:30:35 +00:00
royger
b86acea372 vm_phys: improve robustness of fictitious ranges
With the current implementation of managed fictitious ranges when
also using VM_PHYSSEG_DENSE, a user could try to register a
fictitious range that starts inside of vm_page_array, but then
overrruns it (because the end of the fictitious range is greater than
vm_page_array_size + first_page). This would result in PHYS_TO_VM_PAGE
returning unallocated pages from past the end of vm_page_array. The
same could happen if a user tried to register a segment that starts
outside of vm_page_array but ends inside of it.

In order to fix this, allow vm_phys_fictitious_{reg/unreg}_range to
use a set of pages from vm_page_array, and allocate the rest.

Sponsored by: Citrix Systems R&D
Reviewed by: kib, alc

vm/vm_phys.c:
 - Allow registering/unregistering fictitious ranges that overrun
   vm_page_array.
2014-08-05 10:29:01 +00:00
alc
38b6c535da Retire pmap_change_wiring(). We have never used it to wire virtual pages.
We continue to use pmap_enter() for that.  For unwiring virtual pages, we
now use pmap_unwire(), which unwires a range of virtual addresses instead
of a single virtual page.

Sponsored by:	EMC / Isilon Storage Division
2014-08-03 20:40:51 +00:00
alc
eacf4d0259 Rewrite a loop in vm_map_wire() so that gcc doesn't think that the variable
"rv" is uninitialized.

Reported by:	bz
2014-08-02 17:58:20 +00:00
alc
f873c17deb Handle wiring failures in vm_map_wire() with the new functions
pmap_unwire() and vm_object_unwire().

Retire vm_fault_{un,}wire(), since they are no longer used.

(See r268327 and r269134 for the motivation behind this change.)

Reviewed by:	kib
Sponsored by:	EMC / Isilon Storage Division
2014-08-02 16:10:24 +00:00
alc
2b42c1ddc8 When unwiring a region of an address space, do not assume that the
underlying physical pages are mapped by the pmap.  If, for example, the
application has performed an mprotect(..., PROT_NONE) on any part of the
wired region, then those pages will no longer be mapped by the pmap.
So, using the pmap to lookup the wired pages in order to unwire them
doesn't always work, and when it doesn't work wired pages are leaked.

To avoid the leak, introduce and use a new function vm_object_unwire()
that locates the wired pages by traversing the object and its backing
objects.

At the same time, switch from using pmap_change_wiring() to the recently
introduced function pmap_unwire() for unwiring the region's mappings.
pmap_unwire() is faster, because it operates a range of virtual addresses
rather than a single virtual page at a time.  Moreover, by operating on
a range, it is superpage friendly.  It doesn't waste time performing
unnecessary demotions.

Reported by:	markj
Reviewed by:	kib
Tested by:	pho, jmg (arm)
Sponsored by:	EMC / Isilon Storage Division
2014-07-26 18:10:18 +00:00
kib
b50b5e1d49 Correct assertion. The shadowing object cannot be tmpfs vm object,
and tmpfs object cannot shadow.  In other words, tmpfs vm object is
always at the bottom of the shadow chain.

Reported and tested by:	bdrewery
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
2014-07-24 10:25:42 +00:00
kib
8664d64bc3 The OBJ_TMPFS flag of vm_object means that there is unreclaimed tmpfs
vnode for the tmpfs node owning this object.  The flag is currently
used for two purposes.  First, it allows to correctly handle VV_TEXT
for tmpfs vnode when the ref count on the object is decremented to 1,
similar to vnode_pager_dealloc() for regular filesystems.  Second, it
prevents some operations, which are done on OBJT_SWAP vm objects
backing user anonymous memory, but are incorrect for the object owned
by tmpfs node.

The second kind of use of the OBJ_TMPFS flag is incorrect, since the
vnode might be reclaimed, which clears the flag, but vm object
operations must still be disallowed.

Introduce one more flag, OBJ_TMPFS_NODE, which is permanently set on
the object for VREG tmpfs node, and used instead of OBJ_TMPFS to test
whether vm object collapse and similar actions should be disabled.

Tested by:	pho
Sponsored by:	The FreeBSD Foundation
MFC after:	2 weeks
2014-07-14 09:30:37 +00:00
royger
c9ce58e137 vm_phys: remove limitation on number of fictitious regions
The number of vm fictitious regions was limited to 8 by default, but
Xen will make heavy usage of those kind of regions in order to map
memory from foreign domains, so instead of increasing the default
number, change the implementation to use a red-black tree to track vm
fictitious ranges.

The public interface remains the same.

Sponsored by: Citrix Systems R&D
Reviewed by: kib, alc
Approved by: gibbs

vm/vm_phys.c:
 - Replace the vm fictitious static array with a red-black tree.
 - Use a rwlock instead of a mutex, since now we also need to take the
   lock in vm_phys_fictitious_to_vm_page, and it can be shared.
2014-07-09 08:12:58 +00:00
marcel
9f28abd980 Remove ia64.
This includes:
o   All directories named *ia64*
o   All files named *ia64*
o   All ia64-specific code guarded by __ia64__
o   All ia64-specific makefile logic
o   Mention of ia64 in comments and documentation

This excludes:
o   Everything under contrib/
o   Everything under crypto/
o   sys/xen/interface
o   sys/sys/elf_common.h

Discussed at: BSDcan
2014-07-07 00:27:09 +00:00
alc
d74e85dbb9 Introduce pmap_unwire(). It will replace pmap_change_wiring(). There are
several reasons for this change:

pmap_change_wiring() has never (in my memory) been used to set the wired
attribute on a virtual page.  We have always used pmap_enter() to do that.
Moreover, it is not really safe to use pmap_change_wiring() to set the wired
attribute on a virtual page.  The description of pmap_change_wiring() says
that it assumes the existence of a mapping in the pmap.  However, non-wired
mappings may be reclaimed by the pmap at any time.  (See pmap_collect().)
Many implementations of pmap_change_wiring() will crash if the mapping does
not exist.

pmap_unwire() accepts a range of virtual addresses, whereas
pmap_change_wiring() acts upon a single virtual page.  Since we are
typically unwiring a range of virtual addresses, pmap_unwire() will be more
efficient.  Moreover, pmap_unwire() allows us to unwire superpage mappings.
Previously, we were forced to demote the superpage mapping, because
pmap_change_wiring() only allowed us to express the unwiring of a single
base page mapping at a time.  This added to the overhead of unwiring for
large ranges of addresses, including the implicit unwiring that occurs at
process termination.

Implementations for arm and powerpc will follow.

Discussed with:	jeff, marcel
Reviewed by:	kib
Sponsored by:	EMC / Isilon Storage Division
2014-07-06 17:42:38 +00:00
hselasky
35b126e324 Pull in r267961 and r267973 again. Fix for issues reported will follow. 2014-06-28 03:56:17 +00:00
gjb
fc21f40567 Revert r267961, r267973:
These changes prevent sysctl(8) from returning proper output,
such as:

 1) no output from sysctl(8)
 2) erroneously returning ENOMEM with tools like truss(1)
    or uname(1)
 truss: can not get etype: Cannot allocate memory
2014-06-27 22:05:21 +00:00
hselasky
bd1ed65f0f Extend the meaning of the CTLFLAG_TUN flag to automatically check if
there is an environment variable which shall initialize the SYSCTL
during early boot. This works for all SYSCTL types both statically and
dynamically created ones, except for the SYSCTL NODE type and SYSCTLs
which belong to VNETs. A new flag, CTLFLAG_NOFETCH, has been added to
be used in the case a tunable sysctl has a custom initialisation
function allowing the sysctl to still be marked as a tunable. The
kernel SYSCTL API is mostly the same, with a few exceptions for some
special operations like iterating childrens of a static/extern SYSCTL
node. This operation should probably be made into a factored out
common macro, hence some device drivers use this. The reason for
changing the SYSCTL API was the need for a SYSCTL parent OID pointer
and not only the SYSCTL parent OID list pointer in order to quickly
generate the sysctl path. The motivation behind this patch is to avoid
parameter loading cludges inside the OFED driver subsystem. Instead of
adding special code to the OFED driver subsystem to post-load tunables
into dynamically created sysctls, we generalize this in the kernel.

Other changes:
- Corrected a possibly incorrect sysctl name from "hw.cbb.intr_mask"
to "hw.pcic.intr_mask".
- Removed redundant TUNABLE statements throughout the kernel.
- Some minor code rewrites in connection to removing not needed
TUNABLE statements.
- Added a missing SYSCTL_DECL().
- Wrapped two very long lines.
- Avoid malloc()/free() inside sysctl string handling, in case it is
called to initialize a sysctl from a tunable, hence malloc()/free() is
not ready when sysctls from the sysctl dataset are registered.
- Bumped FreeBSD version to indicate SYSCTL API change.

MFC after:	2 weeks
Sponsored by:	Mellanox Technologies
2014-06-27 16:33:43 +00:00
alc
50c074c333 Delay the call to crhold() in vm_map_insert() until we know that we won't
have to undo it by calling crfree().  This reduces the total number of calls
by vm_map_insert() to crhold() and crfree() by 45% in my tests.

Eliminate an unnecessary variable from vm_map_insert().

Reviewed by:	kib
Tested by:	pho
2014-06-26 16:04:03 +00:00
alc
aeda9d4c41 Now that vm_map_insert() sets MAP_ENTRY_GROWS_{DOWN,UP} on the stack entries
that it creates (r267645), we can place the check that blocks map entry
coalescing on stack entries in vm_map_simplify_entry() where it properly
belongs.

Reviewed by:	kib
2014-06-25 03:30:03 +00:00
kib
c2a4e94982 Use correct names for the flags. MAP_ENTRY_GROWS_* have the same
numerical values as MAP_STACK_GROWS_*, but the former is for entries'
eflags, while the later for the cow argument of vm_map_insert().

Submitted by:	alc
2014-06-23 07:03:47 +00:00
kib
bded3c3768 Assert that the new entry is inserted into the right location in the
map entries list, and that it does not overlap with the previous and
next entries.

Reviewed by:	alc
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
2014-06-20 07:01:53 +00:00
alc
2cd5489f2c Eliminate a pointless call to vm_map_clip_start() from vm_map_growstack().
For this call to do anything at all we would have to have two overlapping
map entries.

Submitted by:	kib
2014-06-19 21:05:07 +00:00
alc
1fa56d3c4e When MAP_STACK_GROWS_{DOWN,UP} are passed to vm_map_insert() set the
corresponding flag(s) in the new map entry.  Previously, the caller was
responsible for setting them after vm_map_insert() returned.

Pass MAP_STACK_GROWS_DOWN to vm_map_insert() from vm_map_growstack() when
extending the stack in the downward direction.

Together these changes slightly simplify the caller's task when creating a
downward growing stack.  In particular, the caller no longer needs to clip
the previous entry, because the new stack entry can't possibly coalesce
with the previous entry.

Reviewed by:	kib
Sponsored by:	EMC / Isilon Storage Division
2014-06-19 16:26:16 +00:00
kib
893100aa10 Add MAP_EXCL flag for mmap(2). It should be combined with MAP_FIXED,
and prevents the request from deleting existing mappings in the
region, failing instead.

Reviewed by:	alc
Discussed with:	jhb
Tested by:	markj, pho (previous version, as part of the bigger patch)
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
2014-06-19 05:00:39 +00:00
attilio
2802c525ad - Modify vm_page_unwire() and vm_page_enqueue() to directly accept
the queue where to enqueue pages that are going to be unwired.
- Add stronger checks to the enqueue/dequeue for the pagequeues when
  adding and removing pages to them.

Of course, for unmanaged pages the queue parameter of vm_page_unwire() will
be ignored, just as the active parameter today.
This makes adding new pagequeues quicker.

This change effectively modifies the KPI.  __FreeBSD_version will be,
however, bumped just when the full cache of free pages will be
evicted.

Sponsored by:	EMC / Isilon storage division
Reviewed by:	alc
Tested by:	pho
2014-06-16 18:15:27 +00:00
alc
7b0b901024 Tidy up the early parts of vm_map_insert(), in particular, simplify one
of the assertions and eliminate a comment that has grown stale.

Reviewed by:	kib
MFC after:	1 week
2014-06-16 16:37:41 +00:00
alc
5badd21ab6 One of the intentions behind r267254 was that the global variable "sgrowsiz"
would be read once and cached in a local variable so that the resource limit
check and map entry insertion would be guaranteed to use the same value.
However, the value being passed to vm_map_insert() is still from "sgrowsiz"
and not the local variable.  Correct this oversight.

Reviewed by:	kib
2014-06-15 07:52:59 +00:00
mav
2bc26491c3 Introduce new "256 Bucket" zone to split requests and reduce congestion
on "128 Bucket" zone lock.

MFC after:	2 weeks
Sponsored by:	iXsystems, Inc.
2014-06-12 11:57:07 +00:00
mav
c7987fc583 Allocating new bucket for bucket zone, never take it from the zone itself,
since it will almost certanly fail.  Take next bigger zone instead.

This situation should not happen with original bucket zones configuration:
"32 Bucket" zone uses "64 Bucket" and vice versa.  But if "64 Bucket" zone
lock is congested, zone may grow its bucket size and start biting itself.

MFC after:	2 weeks
Sponsored by:	iXsystems, Inc.
2014-06-12 11:36:22 +00:00
alc
f133c95804 Correct a bug in the management of the population map on big-endian
machines.  Specifically, there was a mismatch between how the routine
allocation and deallocation operations accessed the population map
and how the aggressively optimized reservation-breaking operation
accessed it.  So, problems only occurred when reservations were broken.
This change makes the routine operations access the population map in
the same way as the reservation breaking operation.

This bug was introduced in r259999.

PR:		187080
Tested by:	jmg (on an "armeb" machine)
Sponsored by:	EMC / Isilon Storage Division
2014-06-11 16:11:12 +00:00
kib
7f8a65c9fc Make mmap(MAP_STACK) search for the available address space, similar
to !MAP_STACK mapping requests.  For MAP_STACK | MAP_FIXED, clear any
mappings which could previously exist in the used range.

For this, teach vm_map_find() and vm_map_fixed() to handle
MAP_STACK_GROWS_DOWN or _UP cow flags, by calling a new
vm_map_stack_locked() helper, which is factored out from
vm_map_stack().

The side effect of the change is that MAP_STACK started obeying
MAP_ALIGNMENT and MAP_32BIT flags.

Reported by:	rwatson
Reviewed by:	alc
Sponsored by:	The FreeBSD Foundation
MFC after:	2 weeks
2014-06-09 03:37:41 +00:00
alc
39548e640f Add a page size field to struct vm_page. Increase the page size field when
a partially populated reservation becomes fully populated, and decrease this
field when a fully populated reservation becomes partially populated.

Use this field to simplify the implementation of pmap_enter_object() on
amd64, arm, and i386.

On all architectures where we support superpages, the cost of creating a
superpage mapping is roughly the same as creating a base page mapping.  For
example, both kinds of mappings entail the creation of a single PTE and PV
entry.  With this in mind, use the page size field to make the
implementation of vm_map_pmap_enter(..., MAP_PREFAULT_PARTIAL) a little
smarter.  Previously, if MAP_PREFAULT_PARTIAL was specified to
vm_map_pmap_enter(), that function would only map base pages.  Now, it will
create up to 96 base page or superpage mappings.

Reviewed by:	kib
Sponsored by:	EMC / Isilon Storage Division
2014-06-07 17:12:26 +00:00
kib
b03014a79c Remove the assert which can be triggered by the userspace. The
situation checked by assert is verified to not take place in
vm_map_wire(), and protection permissions on the wired entry can be
revoked afterward.

Reported by:	markj
Reviewed by:	alc
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
2014-05-28 00:45:35 +00:00
alc
549a5817c0 There is no reason to perform the pmap_remove() on the kernel pmap while
the kmem object lock is held.  Do the pmap_remove() before acquiring the
kmem object lock.

MFC after:	1 week
Sponsored by:	EMC / Isilon Storage Division
2014-05-23 16:22:36 +00:00
kib
62a5a6e7ef Remove redundand loop. The inner goto restarts the whole page
handling in the situation identical to the loop condition.

Sponsored by:	The FreeBSD Foundation
MFC after:	3 days
2014-05-21 08:19:04 +00:00
kib
235f4b1083 When exec_new_vmspace() decides that current vmspace cannot be reused
on execve(2), it calls vmspace_exec(), which frees the current
vmspace.  The thread executing an exec syscall gets new vmspace
assigned, and old vmspace is freed if only referenced by the current
process.  The free operation includes pmap_release(), which
de-constructs the paging structures used by hardware.

If the calling process is multithreaded, other threads are suspended
in the thread_suspend_check(), and need to be unsuspended and run to
be able to exit on successfull exec.  Now, since the old vmspace is
destroyed, paging structures are invalid, threads are resumed on the
non-existent pmaps (page tables), which leads to triple fault on x86.

To fix, postpone the free of old vmspace until the threads are resumed
and exited.  To avoid modifications to all image activators all of
which use exec_new_vmspace(), memoize the current (old) vmspace in
kern_execve(), and notify it about the need to call vmspace_free()
with a thread-private flag TDP_EXECVMSPC.

http://bugs.debian.org/743141

Reported by:	Ivo De Decker <ivo.dedecker@ugent.be> through secteam
Sponsored by:	The FreeBSD Foundation
MFC after:	3 days
2014-05-20 09:19:35 +00:00
alc
cf3c13370d On a fork allow read-only wired pages to be copy-on-write shared between the
parent and child processes.  Previously, we copied these pages even though
they are read only.  However, the reason for copying them is historical and
no longer exists.  In recent times, vm_map_protect() has developed the
ability to copy pages when write access is added to wired copy-on-write
pages.  So, in this case, copy-on-write sharing of wired pages is not to be
feared.  It is not going to lead to copy-on-write faults on wired memory.

Reviewed by:	kib
MFC after:	1 week
Sponsored by:	EMC / Isilon Storage Division
2014-05-13 13:20:23 +00:00
kib
ba20870cbd Fix locking. The dst_object must remain locked on the retry of the
loop iteration.

Reported and tested by:	pho
Sponsored by:	The FreeBSD Foundation
MFC after:	6 days
2014-05-11 18:07:07 +00:00
alc
b716a32dc2 With the new-and-improved vm_fault_copy_entry() (r265843), we can always
avoid soft page faults when adding write access to user wired entries in
vm_map_protect().  Previously, we only avoided the soft page fault when
the underlying pages were copy-on-write.  In other words, we avoided the
pages faults that might sleep on page allocation, but not the trivial
page faults to update the physical map.

Reviewed by:	kib
MFC after:	1 week
Sponsored by:	EMC / Isilon Storage Division
2014-05-11 17:41:29 +00:00
alc
d283b621dc About 9% of the pmap_protect() calls being performed by vm_map_copy_entry()
are unnecessary.  Eliminate the unnecessary calls.

Reviewed by:	kib
MFC after:	1 week
Sponsored by:	EMC / Isilon Storage Division
2014-05-10 19:47:00 +00:00
kib
7798d7f7f4 For the upgrade case in vm_fault_copy_entry(), when the entry does not
need COW and is writeable (i.e. becoming writeable due to the
mprotect(2) operation), do not create a new backing object for the
entry.  The caller of the function is vm_map_protect(), the call is
made to ensure that wired entry has all pages resident and wired in
the top level object and to enable the write.  We might need to copy
read-only page from some backing objects into the top object or remap
the page with the write allowed.

This fixes the issue with mishandling of the swap accounting when
read-only wired mapping is upgraded to write-enabled after fork.  The
previous code path did not accounted the new object, but it creation
is redundand anyway and the change provides an optimization for the
non-common situation.

Reported by:	markj
Suggested and reviewed by:	alc (previous version)
Tested by:	pho
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
2014-05-10 17:03:33 +00:00
kib
32f811c3b5 When printing the map with the ddb 'show procvm' command, do not dump
page queues for the backing objects.  The queues are huge and clutter
the display, when mostly the map entries and its backing storage is
interesting.

The page queues can be seen with ddb 'show object' command.

Reviewed by:	alc
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
2014-05-10 16:36:13 +00:00
kib
90e904e481 Print the entry address in addition to the object. The variable is
typically optimized out and debuggers cannot find its value.

Sponsored by:	    The FreeBSD Foundation
MFC after:	1 week
2014-05-10 16:30:48 +00:00
pho
1108602cbc msync(2) must return ENOMEM and not EINVAL when the address is outside the
allowed range or when one or more pages are not mapped. This according to
The Open Group Base Specifications Issue 7.

Discussed with:	 attilio, Bruce Evans
Reviewed by:	 alc, Garrett Cooper
Reported by:	 ATF
MFC after:	 2 weeks
Sponsored by:	EMC / Isilon storage division
2014-05-07 08:38:02 +00:00
alc
c77af15259 Prior to r254304, a separate function, vm_pageout_page_stats(), was used to
periodically update the reference status of the active pages.  This function
was called, instead of vm_pageout_scan(), when memory was not scarce.  The
objective was to provide up to date reference status for active pages in
case memory did become scarce and active pages needed to be deactivated.

The active page queue scan performed by vm_pageout_page_stats() was
virtually identical to that performed by vm_pageout_scan(), and so r254304
eliminated vm_pageout_page_stats().  Instead, vm_pageout_scan() is
called with the parameter "pass" set to zero.  The intention was that when
pass is zero, vm_pageout_scan() would only scan the active queue.  However,
the variable page_shortage can still be greater than zero when memory is not
scarce and vm_pageout_scan() is called with pass equal to zero.
Consequently, the inactive queue may be scanned and dirty pages laundered
even though that was not intended by r254304.  This revision fixes that.

Reported by:	avg
MFC after:	1 week
Sponsored by:	EMC / Isilon Storage Division
2014-05-06 03:42:04 +00:00
kib
0c45ba8eb0 For the VM_PHYSSEG_DENSE case, checking the requested range to fall
into the area backed by vm_page_array wrongly compared end with
vm_page_array_size.  It should be adjusted by first_page index to be
correct.

Also, the corner and incorrect case of the requested range extending
after the end of the vm_page_array was incorrectly handled by
allocating the segment.

Fix the comparision for the end of range and return EINVAL if the end
extends beyond vm_page_array.

Discussed with:	royger
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
2014-04-29 18:42:37 +00:00
kib
c36743d23f When vm_fault_copy_entry() is called from vm_map_protect() for a wired
entry and performs the upgrade of the entry permissions from read-only
to read-write, we must allow to search for the source pages in the
backing object, like we do in the case of forking the read-only wired
entry. For the fork case, the behaviour is allowed by src_readonly
boolean, which in fact is only used to assert that read-write case
provides all source pages in the top-level object.

Eliminate the src_readonly variable.  Allow for the copy loop to look
into the backing objects, add explicit asserts to ensure that only
read-only and upgrade case actually does.

Expand comments. Change the panic call into assert.

Reported by:	markj
Tested by:	markj, pho (previous version)
Reviewed by:	alc
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
2014-04-27 05:19:01 +00:00
des
65493094e9 Add sysctl OIDs showing the actual size and capacity of the swap zone.
MFC after:	1 week
2014-04-26 12:18:17 +00:00
bdrewery
6fcf6199a4 Rename global cnt to vm_cnt to avoid shadowing.
To reduce the diff struct pcu.cnt field was not renamed, so
PCPU_OP(cnt.field) is still used. pc_cnt and pcpu are also used in
kvm(3) and vmstat(8). The goal was to not affect externally used KPI.

Bump __FreeBSD_version_ in case some out-of-tree module/code relies on the
the global cnt variable.

Exp-run revealed no ports using it directly.

No objection from:	arch@
Sponsored by:	EMC / Isilon Storage Division
2014-03-22 10:26:09 +00:00
kib
24c4e4a548 Fix two issues with /dev/mem access on amd64, both causing kernel page
faults.

First, for accesses to direct map region should check for the limit by
which direct map is instantiated.

Second, for accesses to the kernel map, success returned from the
kernacc(9) does not guarantee that consequent attempt to read or write
to the checked address succeed, since other thread might invalidate
the address meantime.  Add a new thread private flag TDP_DEVMEMIO,
which instructs vm_fault() to return error when fault happens on the
MAP_ENTRY_NOFAULT entry, instead of panicing.  The trap handler would
then see a page fault from access, and recover in normal way, making
/dev/mem access safer.

Remove GIANT_REQUIRED from the amd64 memrw(), since it is not needed
and having Giant locked does not solve issues for amd64.

Note that at least the second issue exists on other architectures, and
requires similar patching for md code.

Reported and tested by:	clusteradm (gjb, sbruno)
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
2014-03-21 14:25:09 +00:00
kib
07e1d8d74f Initialize vm_map_entry member wiring_thread on the map entry creation.
This was missed in r253190.

Reported by:	hps, peter
Tested by:	hps
Sponsored by:	The FreeBSD Foundation
MFC after:	3 days
2014-03-21 13:55:57 +00:00
attilio
f19bbde667 vm_page_grab() and vm_pager_get_pages() can drop the vm_object lock,
then threads can sleep on the pip condition.
Avoid to deadlock such threads by correctly awakening the sleeping ones
after the pip is finished.
swapoff side of the bug can likely result in shutdown deadlocks.

Sponsored by:	EMC / Isilon Storage Division
Reported by:	pho, pluknet
Tested by:	pho
2014-03-19 01:13:42 +00:00
rwatson
33fdc14c0c Update kernel inclusions of capability.h to use capsicum.h instead; some
further refinement is required as some device drivers intended to be
portable over FreeBSD versions rely on __FreeBSD_version to decide whether
to include capability.h.

MFC after:	3 weeks
2014-03-16 10:55:57 +00:00
kib
12c58a8fb0 Initialize paddr to handle the case of zero size.
Reported and reviewed by:	Conrad Meyer <cemeyer@uw.edu>
MFC after:	1 week
2014-03-12 16:38:55 +00:00
kib
e4111a6b71 Do not vdrop() the tmpfs vnode until it is unlocked. The hold
reference might be the last, and then vdrop() would free the vnode.

Reported and tested by:	bdrewery
MFC after:	1 week
2014-03-12 15:13:57 +00:00
dim
e21b440a4c After r251709, avoid a clang 3.4 warning about an unused static const
variable (uma_max_ipers), when asserts are disabled.

Reviewed by:	glebius
MFC after:	3 days
2014-02-14 17:47:18 +00:00
attilio
6a31e25bb9 Fix-up r254141: in the process of making a failing vm_page_rename()
a call of pager_swap_freespace() was moved around, now leading to freeing
the incorrect page because of the pindex changes after vm_page_rename().

Get back to use the correct pindex when destroying the swap space.

Sponsored by:	EMC / Isilon storage division
Reported by:	avg
Tested by:	pho
MFC after:	7 days
2014-02-14 03:34:12 +00:00
glebius
b1650f2d1e Fix function name in KASSERT().
Submitted by:	hiren
2014-02-12 20:11:20 +00:00
jhb
ee403b8a8c Correct assertion to assert that the existing device VM object uses the
same type rather than asserting in the case where we just created a new
VM object.

Reviewed by:	kib
2014-02-11 22:05:21 +00:00
glebius
45bf1cc683 Create two public UMA_ZONE_PCPU zones: 64 bit sized and pointer sized.
Sponsored by:	Nginx, Inc.
2014-02-10 19:59:46 +00:00
glebius
613c5f4e53 Style. 2014-02-10 19:51:15 +00:00
glebius
1861286fed Make M_ZERO flag work correctly on UMA_ZONE_PCPU zones.
Sponsored by:	Nginx, Inc.
2014-02-10 19:48:26 +00:00
alc
43e9da37b5 Don't call vm_fault_prefault() on zero-fill faults. It's a waste of time.
Successful prefaults after a zero-fill fault are extremely rare.
2014-02-09 01:59:52 +00:00
glebius
e8c2426587 Provide macros that allow easily export uma(9) zone limits and
current usage via sysctl(9):

  SYSCTL_UMA_MAX()
  SYSCTL_ADD_UMA_MAX()
  SYSCTL_UMA_CUR()
  SYSCTL_ADD_UMA_CUR()

Sponsored by:	Nginx, Inc.
2014-02-07 14:29:03 +00:00
alc
cf63b11b17 Make prefaulting more aggressive on hard faults. Previously, we would only
map a fraction of the pages that were fetched by vm_pager_get_pages() from
secondary storage.  Now, we map them all in order to avoid future soft
faults.  This effect is most evident when a memory-mapped file is accessed
sequentially.  Previously, there were 6 soft faults for every hard fault.
Now, these soft faults are eliminated.

Sponsored by:	EMC / Isilon Storage Division
2014-02-02 20:21:53 +00:00
alc
50a7eacf05 In an effort to diagnose possible corruption of struct vm_page on some
sparc64 machines make the page queue assert in vm_page_dequeue() more
precise.  While I'm here switch the page lock assert to the newer style.
2014-01-24 19:08:42 +00:00
jhb
14337759da Fix a couple of typos. 2014-01-21 03:27:47 +00:00
glebius
681dcc3c57 ANSIfy declarations.
Ok'ed by:	alc
2014-01-20 18:47:56 +00:00
alc
d98c6ca3a1 Style changes in vm_pageout_scan():
1. Be consistent in the style of "act_delta" manipulations between the
   inactive and active queue scans.

2. Explicitly compare to zero.

3. The deactivation of a page is based is based on its recent history
   and not just the current call to vm_pageout_scan().  The variable
   "act_delta" represents the current state of the page, and not its
   history.  Avoid possible confusion by not (ab)using "act_delta" for
   the making the deactivation decision.

Submitted by:	kib [1]
Reviewed by:	kib [2,3]
2014-01-18 20:02:59 +00:00
alc
ed1e11749f Correctly update the count of stuck pages, "addl_page_shortage", in
vm_pageout_scan().  There were missing increments in two less common cases.

Don't conflate the count of stuck pages and the pageout deficit provided by
vm_page_alloc{,_contig}().  (A proposed fix to the OOM code depends on this.)

Handle held pages consistently in the inactive queue scan.  In the more
common case, we did not move the page to the tail of the queue.  Whereas, in
the less common case, we did.  There's no particular reason to move the page
in the less common case, so remove it.

Perform the calculation of the page shortage for the active queue scan a
little earlier, before the active queue lock is acquired.  The correctness
of this calculation doesn't depend on the active queue lock being held.

Eliminate a redundant variable, "pcount".  Use the more descriptive
variable, "maxscan", in its place.

Apply a few nearby style fixes, e.g., eliminate stray whitespace and excess
parentheses.

Reviewed by:	kib
Sponsored by:	EMC / Isilon Storage Division
2014-01-12 19:04:20 +00:00
alc
d0ccfff2c5 Since the introduction of the popmap to reservations in r259999, there is
no longer any need for the page's PG_CACHED and PG_FREE flags to be set and
cleared while the free page queues lock is held.  Thus, vm_page_alloc(),
vm_page_alloc_contig(), and vm_page_alloc_freelist() can wait until after
the free page queues lock is released to clear the page's flags.  Moreover,
the PG_FREE flag can be retired.  Now that the reservation system no longer
uses it, its only uses are in a few assertions.  Eliminating these
assertions is no real loss.  Other assertions catch the same types of
misbehavior, like doubly freeing a page (see r260032) or dirtying a free
page (free pages are invalid and only valid pages can be dirtied).

Eliminate an unneeded variable from vm_page_alloc_contig().

Sponsored by:	EMC / Isilon Storage Division
2013-12-31 18:25:15 +00:00
alc
4e0447c7bb Add "popmap" assertions: The page being freed isn't already free, and the
page being allocated isn't already allocated.

Sponsored by:	EMC / Isilon Storage Division
2013-12-29 04:54:52 +00:00
alc
0850ce80b6 MFp4 alc_popmap
Change the way that reservations keep track of which pages are in use.
  Instead of using the page's PG_CACHED and PG_FREE flags, maintain a bit
  vector within the reservation.  This approach has a couple benefits.
  First, it makes breaking reservations much cheaper because there are
  fewer cache misses to identify the unused pages.  Second, it is a pre-
  requisite for supporting two or more reservation sizes.
2013-12-28 04:28:35 +00:00
kib
2e35793d0f Do not coalesce stack entry, vm_map_stack() asserts that the requested
region is claimed by a new entry.

Pass MAP_STACK_GROWS_DOWN and MAP_STACK_GROWS_UP flags to
vm_map_insert() from vm_map_stack(), to really turn off coalescing
code and call to vm_map_simplify_entry() [1].

Reported by:	avg, peter, many
Tested by:	avg, peter
Noted by:	avg [1]
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
2013-12-27 16:59:47 +00:00
marcel
df5d69cc0d For ia64, use pmap_remove_pages() and not pmap_remove(). The problem is
that we don't have a good way (yet) to iterate over the mapped pages by
virtual address and simply try each page within the range. Given that we
call pmap_remove() over the entire 2^63 bytes of address space, it takes
a while for pmap_remove to have tried all 2^50 pages.
By using pmap_remove_pages() we use the PV list to find all mappings.

Change derived from a patch by: alc
2013-12-26 05:46:10 +00:00
dim
7f980c6758 In sys/vm/vm_pageout.c, since vm_pageout_worker() takes a void * as
argument, cast the incoming 0 argument to void *, to silence a warning
from clang 3.4 ("expression which evaluates to zero treated as a null
pointer constant of type 'void *' [-Wnon-literal-null-conversion]").

MFC after:	3 days
2013-12-25 22:32:34 +00:00