queue and (2) returns a Boolean indicating whether the page's wire count
transitioned to zero.
Exploit this change in vfs_vmio_release() to avoid pointlessly enqueueing
a page that is about to be freed.
(An earlier version of this change was developed by attilio@ and kmacy@.
Any errors in this version are my own.)
Reviewed by: kib
Sponsored by: EMC / Isilon Storage Division
should not assume that vm_pages_needed will remain set while it sleeps.
Other threads can clear vm_pages_needed by performing a sufficient
number of vm_page_free() calls, e.g., process termination. The effect
of this error was that vm_pageout_worker() would free and/or launder
pages when, in fact, there was no shortage of free pages.
Rewrite a nearby comment to describe all of the possible cases and not
just the most common case. The problem being that the comment made
the most common case seem like the only case.
Reviewed by: kib
MFC after: 1 week
Sponsored by: EMC / Isilon Storage Division
with a reference count of zero can't possibly be mapped, so there is never a
need for vm_page_set_invalid() to call pmap_remove_all() on them.
Reviewed by: kib
MFC after: 1 week
Sponsored by: EMC / Isilon Storage Division
mapped address without valid pte installed, when parallel wiring of
the entry happen. The entry must be copy on write. If entry is COW
but was already copied, and parallel wiring set
MAP_ENTRY_IN_TRANSITION, vm_fault() would sleep waiting for the
MAP_ENTRY_IN_TRANSITION flag to clear. After that, the fault handler
is restarted and vm_map_lookup() or vm_map_lookup_locked() trip over
the check. Note that this is race, if the address is accessed after
the wiring is done, the entry does not fault at all.
There is no reason in the current kernel to disallow write access to
the COW wired entry if the entry permissions allow it. Initially this
was done in r24666, since that kernel did not supported proper
copy-on-write for wired text, which was fixed in r199869. The r251901
revision re-introduced the r24666 fix for the current VM.
Note that write access must clear MAP_ENTRY_NEEDS_COPY entry flag by
performing COW. In reverse, when MAP_ENTRY_NEEDS_COPY is set in
vmspace_fork(), the MAP_ENTRY_USER_WIRED flag is cleared. Put the
assert stating the invariant, instead of returning the error.
Reported and debugging help by: peter
Reviewed by: alc
Sponsored by: The FreeBSD Foundation
MFC after: 1 week
locking and doesn't sleep. Flag the consumer we create as such. In
addition, decrement the in flight index when we have an out of memory
error after having incremented it previously. This would have
prevented swapoff from working if the swap pager ever hit a resource
shortage trying to swap out something (the swap in path always waits
for a bio, so won't have this issue). Simplify the close logic by
abandoning the use of private and initializing the index to 1 and
dropping that reference when we previously set private.
Also, set sw_id only while sw_dev_mtx is held. This should only affect
swapping to a vnode, as opposed to a geom whose close always sets it to
NULL with sw_dev_mtx held.
Differential Review: https://reviews.freebsd.org/D3547
so that there is only one place where pages are freed and only one place
where pages are moved to the tail of the queue.
Reviewed by: kib
Sponsored by: EMC / Isilon Storage Division
pages will have left the inactive queue before the page daemon performs
its next scan. Also, ignore references to pages from terminated objects.
This allows the clean pages to be freed a little sooner.
Move some comments to their proper place, i.e., next to the code that
they describe, and update other nearby comments.
Reviewed by: kib
Sponsored by: EMC / Isilon Storage Division
Objects obtained from such zones are supposed to retain type stability,
which was violated by aforementioned trashing.
This is a follow-up to r284861.
Discussed with: kib
This was added in r51337 as part of the implementation of
madvise(MADV_DONTNEED). Its objective was to ensure that the page daemon
would eventually reclaim other unreferenced pages (i.e., unreferenced pages
not touched by madvise()) from the active queue.
Now that the pagedaemon performs steady scanning of the active page queue,
this weighted handling is unnecessary. Instead, always "cache" clean pages
by moving them to the head of the inactive page queue. This simplifies the
implementation of vm_page_advise() and eliminates the fragmentation that
resulted from the distribution of pages among multiple queues.
Suggested by: alc
Reviewed by: alc
Sponsored by: EMC / Isilon Storage Division
Differential Revision: https://reviews.freebsd.org/D3401
it may involve a pmap operation that iterates over the page's PV list, so
unnecessarily holding the page lock is undesirable.
MFC after: 1 week
Sponsored by: EMC / Isilon Storage Division
Provide and document the RANDOM_ENABLE_UMA option.
Change RANDOM_FAST to RANDOM_UMA to clarify the harvesting.
Remove RANDOM_DEBUG option, replace with SDT probes. These will be of
use to folks measuring the harvesting effect when deciding whether to
use RANDOM_ENABLE_UMA.
Requested by: scottl and others.
Approved by: so (/dev/random blanket)
Differential Revision: https://reviews.freebsd.org/D3197
Currently vm_pageout_scan() uses a ticks-based scheme to rate-limit
the number of times that the vm_lowmem event will happen. However
if no events happen for long enough for ticks to roll over, this
leaves us in a long window in which vm_lowmem events will not
happen.
Replace the use of ticks with time_t to prevent rollover from ever
being an issue.
Reviewed by: ian
MFC after: 3 weeks
Sponsored by: EMC / Isilon Storage Division
Differential Revision: https://reviews.freebsd.org/D3439
to vm_page_try_to_cache() from vm_pageout_flush(). Other changes, most
recently r286814, have made this call unnecessary.
Reviewed by: kib
Discussed with: jeff
Tested by: pho
Sponsored by: EMC / Isilon Storage Division
initial thread stack is not adjusted by the tunable, the stack is
allocated too early to get access to the kernel environment. See
TD0_KSTACK_PAGES for the thread0 stack sizing on i386.
The tunable was tested on x86 only. From the visual inspection, it
seems that it might work on arm and powerpc. The arm
USPACE_SVC_STACK_TOP and powerpc USPACE macros seems to be already
incorrect for the threads with non-default kstack size. I only
changed the macros to use variable instead of constant, since I cannot
test.
On arm64, mips and sparc64, some static data structures are sized by
KSTACK_PAGES, so the tunable is disabled.
Sponsored by: The FreeBSD Foundation
MFC after: 2 week
Fixes "panic: vm_radix_reserve_kva: unable to reserve KVA" caused by sign
extention of "pages * UMA_SLAB_SIZE" value passed to kva_alloc() which
takes unsigned long argument.
In the erroneus case that triggered this bug, the number of pages
to allocate in uma_zone_reserve_kva() was 0x8ebe6, that gave the
total number of bytes to allocate equal to 0x8ebe6000 (int).
This was then sign extended in kva_alloc() to 0xffffffff8ebe6000
(unsigned long).
Reviewed by: alc, kib
Submitted by: Zbigniew Bodek <zbb@semihalf.com>
Obtained from: Semihalf
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D3346
vm_offset_t pmap_quick_enter_page(vm_page_t m)
void pmap_quick_remove_page(vm_offset_t kva)
These will create and destroy a temporary, CPU-local KVA mapping of a specified page.
Guarantees:
--Will not sleep and will not fail.
--Safe to call under a non-sleepable lock or from an ithread
Restrictions:
--Not guaranteed to be safe to call from an interrupt filter or under a spin mutex on all platforms
--Current implementation does not guarantee more than one page of mapping space across all platforms. MI code should not make nested calls to pmap_quick_enter_page.
--MI code should not perform locking while holding onto a mapping created by pmap_quick_enter_page
The idea is to use this in busdma, for bounce buffer copies as well as virtually-indexed cache maintenance on mips and arm.
NOTE: the non-i386, non-amd64 implementations of these functions still need review and testing.
Reviewed by: kib
Approved by: kib (mentor)
Differential Revision: http://reviews.freebsd.org/D3013
which constitute the majority of the pages that are processed by
vm_fault_dontneed(), are already near the tail of the inactive queue. Only
the pages at faulting virtual addresses are actually moved by
vm_page_advise(..., MADV_DONTNEED). However, vm_page_advise(...,
MADV_DONTNEED) is simultaneously too aggressive and passive for the moved
pages. It makes most of these pages too easily reclaimable, and at the same
time it leaves enough pages in the active queue to trigger pageouts by the
page daemon. Instead, with this change, the pages at faulting virtual
addresses are moved to the tail of the inactive queue, where they are
relatively close to the pages prefetched by the same page fault.
Discussed with: jeff
Sponsored by: EMC / Isilon Storage Division
the VM_FAULT_CHANGE_WIRING flag to VM_FAULT_WIRE. Assert that the
flag is only passed when faulting on the wired map entry. Remove the
vm_page_unwire() call, which should be never reachable.
Since VM_FAULT_WIRE flag implies wired map entry, the TRYPAGER() macro
is reduced to the testing of the fs.object having a default pager.
Inline the check.
Suggested and reviewed by: alc
Tested by: pho (previous version)
MFC after: 1 week
'buf' is inconvenient and has lead me to some irritating to discover
bugs over the years. It also makes it more challenging to refactor
the buf allocation system.
- Move swbuf and declare it as an extern in vfs_bio.c. This is still
not perfect but better than it was before.
- Eliminate the unused ffs function that relied on knowledge of the buf
array.
- Move the shutdown code that iterates over the buf array into vfs_bio.c.
Reviewed by: kib
Sponsored by: EMC / Isilon Storage Division
Assume that a vnode is mapped shared and mlocked(), and then the vnode
is truncated, or truncated and then again extended past the mapping
point EOF. Truncation removes the pages past the truncation point,
and if pages are later created at this range, they are not properly
mapped into the mlocked region, and their wiring count is wrong.
The revert leaves the invalidated but wired pages on the object queue,
which means that the pages are found by vm_object_unwire() when the
mapped range is munlock()ed, and reused by the buffer cache when the
vnode is extended again.
The changes in r173708 were required since then vm_map_unwire() looked
at the page tables to find the page to unwire. This is no longer
needed with the vm_object_unwire() introduction, which follows the
objects shadow chain.
Also eliminate OBJPR_NOTWIRED flag for vm_object_page_remove(), which
is now redundand, we do not remove wired pages.
Reported by: trasz, Dmitry Sivachenko <trtrmitya@gmail.com>
Suggested and reviewed by: alc
Tested by: pho
Sponsored by: The FreeBSD Foundation
MFC after: 1 week
- Use pointer assignment rather than a combination of pointers and
flags to switch buffers between unmapped and mapped. This eliminates
multiple flags and generally simplifies the logic.
- Eliminate b_saveaddr since it is only used with pager bufs which have
their b_data re-initialized on each allocation.
- Gather up some convenience routines in the buffer cache for
manipulating buf space and buf malloc space.
- Add an inline, buf_mapped(), to standardize checks around unmapped
buffers.
In collaboration with: mlaier
Reviewed by: kib
Tested by: pho (many small revisions ago)
Sponsored by: EMC / Isilon Storage Division
This is based on work done by jeff@ and jhb@, as well as the numa.diff
patch that has been circulating when someone asks for first-touch NUMA
on -10 or -11.
* Introduce a simple set of VM policy and iterator types.
* tie the policy types into the vm_phys path for now, mirroring how
the initial first-touch allocation work was enabled.
* add syscalls to control changing thread and process defaults.
* add a global NUMA VM domain policy.
* implement a simple cascade policy order - if a thread policy exists, use it;
if a process policy exists, use it; use the default policy.
* processes inherit policies from their parent processes, threads inherit
policies from their parent threads.
* add a simple tool (numactl) to query and modify default thread/process
policities.
* add documentation for the new syscalls, for numa and for numactl.
* re-enable first touch NUMA again by default, as now policies can be
set in a variety of methods.
This is only relevant for very specific workloads.
This doesn't pretend to be a final NUMA solution.
The previous defaults in -HEAD (with MAXMEMDOM set) can be achieved by
'sysctl vm.default_policy=rr'.
This is only relevant if MAXMEMDOM is set to something other than 1.
Ie, if you're using GENERIC or a modified kernel with non-NUMA, then
this is a glorified no-op for you.
Thank you to Norse Corp for giving me access to rather large
(for FreeBSD!) NUMA machines in order to develop and verify this.
Thank you to Dell for providing me with dual socket sandybridge
and westmere v3 hardware to do NUMA development with.
Thank you to Scott Long at Netflix for providing me with access
to the two-socket, four-domain haswell v3 hardware.
Thank you to Peter Holm for running the stress testing suite
against the NUMA branch during various stages of development!
Tested:
* MIPS (regression testing; non-NUMA)
* i386 (regression testing; non-NUMA GENERIC)
* amd64 (regression testing; non-NUMA GENERIC)
* westmere, 2 socket (thankyou norse!)
* sandy bridge, 2 socket (thankyou dell!)
* ivy bridge, 2 socket (thankyou norse!)
* westmere-EX, 4 socket / 1TB RAM (thankyou norse!)
* haswell, 2 socket (thankyou norse!)
* haswell v3, 2 socket (thankyou dell)
* haswell v3, 2x18 core (thankyou scott long / netflix!)
* Peter Holm ran a stress test suite on this work and found one
issue, but has not been able to verify it (it doesn't look NUMA
related, and he only saw it once over many testing runs.)
* I've tested bhyve instances running in fixed NUMA domains and cpusets;
all seems to work correctly.
Verified:
* intel-pcm - pcm-numa.x and pcm-memory.x, whilst selecting different
NUMA policies for processes under test.
Review:
This was reviewed through phabricator (https://reviews.freebsd.org/D2559)
as well as privately and via emails to freebsd-arch@. The git history
with specific attributes is available at https://github.com/erikarn/freebsd/
in the NUMA branch (https://github.com/erikarn/freebsd/compare/local/adrian_numa_policy).
This has been reviewed by a number of people (stas, rpaulo, kib, ngie,
wblock) but not achieved a clear consensus. My hope is that with further
exposure and testing more functionality can be implemented and evaluated.
Notes:
* The VM doesn't handle unbalanced domains very well, and if you have an overly
unbalanced memory setup whilst under high memory pressure, VM page allocation
may fail leading to a kernel panic. This was a problem in the past, but it's
much more easily triggered now with these tools.
* This work only controls the path through vm_phys; it doesn't yet strongly/predictably
affect contigmalloc, KVA placement, UMA, etc. So, driver placement of memory
isn't really guaranteed in any way. That's next on my plate.
Sponsored by: Norse Corp, Inc.; Dell
However, I've observed the active queue scan stopping when there are
frequent free page shortages and the inactive queue is steadily refilled
by other mechanisms, such as the sequential access heuristic in vm_fault()
or madvise(2). To remedy this problem, record the time of the last active
queue scan, and always scan a number of pages proportional to the time
since the last scan, regardless of whether that last scan was a
timeout-triggered ("pass == 0") or free-page-shortage-triggered ("pass >
0") scan.
Also, on a timeout-triggered scan, allow a full scan of the active queue
when the system is short of inactive pages.
Reviewed by: kib
MFC after: 6 weeks
Sponsored by: EMC / Isilon Storage Division
user address when ABI uses shared page.
Note that the change is no-op for correctness, since shared page does
not fault. The mapping for the shared page is installed at the
address space creation, the page is unmanaged and its pte/pv entry
cannot be reclaimed.
Submitted by: Oliver Pinter
Review: https://reviews.freebsd.org/D2954
MFC after: 1 week
* GENERAL
- Update copyright.
- Make kernel options for RANDOM_YARROW and RANDOM_DUMMY. Set
neither to ON, which means we want Fortuna
- If there is no 'device random' in the kernel, there will be NO
random(4) device in the kernel, and the KERN_ARND sysctl will
return nothing. With RANDOM_DUMMY there will be a random(4) that
always blocks.
- Repair kern.arandom (KERN_ARND sysctl). The old version went
through arc4random(9) and was a bit weird.
- Adjust arc4random stirring a bit - the existing code looks a little
suspect.
- Fix the nasty pre- and post-read overloading by providing explictit
functions to do these tasks.
- Redo read_random(9) so as to duplicate random(4)'s read internals.
This makes it a first-class citizen rather than a hack.
- Move stuff out of locked regions when it does not need to be
there.
- Trim RANDOM_DEBUG printfs. Some are excess to requirement, some
behind boot verbose.
- Use SYSINIT to sequence the startup.
- Fix init/deinit sysctl stuff.
- Make relevant sysctls also tunables.
- Add different harvesting "styles" to allow for different requirements
(direct, queue, fast).
- Add harvesting of FFS atime events. This needs to be checked for
weighing down the FS code.
- Add harvesting of slab allocator events. This needs to be checked for
weighing down the allocator code.
- Fix the random(9) manpage.
- Loadable modules are not present for now. These will be re-engineered
when the dust settles.
- Use macros for locks.
- Fix comments.
* src/share/man/...
- Update the man pages.
* src/etc/...
- The startup/shutdown work is done in D2924.
* src/UPDATING
- Add UPDATING announcement.
* src/sys/dev/random/build.sh
- Add copyright.
- Add libz for unit tests.
* src/sys/dev/random/dummy.c
- Remove; no longer needed. Functionality incorporated into randomdev.*.
* live_entropy_sources.c live_entropy_sources.h
- Remove; content moved.
- move content to randomdev.[ch] and optimise.
* src/sys/dev/random/random_adaptors.c src/sys/dev/random/random_adaptors.h
- Remove; plugability is no longer used. Compile-time algorithm
selection is the way to go.
* src/sys/dev/random/random_harvestq.c src/sys/dev/random/random_harvestq.h
- Add early (re)boot-time randomness caching.
* src/sys/dev/random/randomdev_soft.c src/sys/dev/random/randomdev_soft.h
- Remove; no longer needed.
* src/sys/dev/random/uint128.h
- Provide a fake uint128_t; if a real one ever arrived, we can use
that instead. All that is needed here is N=0, N++, N==0, and some
localised trickery is used to manufacture a 128-bit 0ULLL.
* src/sys/dev/random/unit_test.c src/sys/dev/random/unit_test.h
- Improve unit tests; previously the testing human needed clairvoyance;
now the test will do a basic check of compressibility. Clairvoyant
talent is still a good idea.
- This is still a long way off a proper unit test.
* src/sys/dev/random/fortuna.c src/sys/dev/random/fortuna.h
- Improve messy union to just uint128_t.
- Remove unneeded 'static struct fortuna_start_cache'.
- Tighten up up arithmetic.
- Provide a method to allow eternal junk to be introduced; harden
it against blatant by compress/hashing.
- Assert that locks are held correctly.
- Fix the nasty pre- and post-read overloading by providing explictit
functions to do these tasks.
- Turn into self-sufficient module (no longer requires randomdev_soft.[ch])
* src/sys/dev/random/yarrow.c src/sys/dev/random/yarrow.h
- Improve messy union to just uint128_t.
- Remove unneeded 'staic struct start_cache'.
- Tighten up up arithmetic.
- Provide a method to allow eternal junk to be introduced; harden
it against blatant by compress/hashing.
- Assert that locks are held correctly.
- Fix the nasty pre- and post-read overloading by providing explictit
functions to do these tasks.
- Turn into self-sufficient module (no longer requires randomdev_soft.[ch])
- Fix some magic numbers elsewhere used as FAST and SLOW.
Differential Revision: https://reviews.freebsd.org/D2025
Reviewed by: vsevolod,delphij,rwatson,trasz,jmg
Approved by: so (delphij)
o Provide an extensive set of assertions for input array of pages.
o Remove now duplicate assertions from different pagers.
Sponsored by: Nginx, Inc.
Sponsored by: Netflix
they coould be dirty. Move the handling if the invalid pages in the
inactive scan earlier.
Remove some code duplication in the scan by introducing the
'drop_page' label, which centralizes the object and the page unlock.
Suggested and reviewed by: alc
Sponsored by: The FreeBSD Foundation
MFC after: 2 weeks
pages in vm_pageout_scan(). The reactivation rate of cache pages created
by vm_pageout_scan() is extremely low; typically no more than 0.5% to
2.25% of the pages are ever reactivated. At the same time, caching pages
is more expensive than freeing them. For example, in a test with
PostgreSQL, this change reduced the amount of time spent in the inactive
queue scan by 1/6.
Differential Revision: https://reviews.freebsd.org/D2805
Reviewed by: kib
Sponsored by: EMC / Isilon Storage Division
in the requested array, then it is responsible for disposition of previous
page and is responsible for updating the entry in the requested array.
Now consumers of KPI do not need to re-lookup the pages after call to
vm_pager_get_pages().
Reviewed by: kib
Sponsored by: Netflix
Sponsored by: Nginx, Inc.
Use the same scheme implemented to manage credentials.
Code needing to look at process's credentials (as opposed to thred's) is
provided with *_proc variants of relevant functions.
Places which possibly had to take the proc lock anyway still use the proc
pointer to access limits.
correctly handle deallocation requests of two or more gigabytes in size.
Eventually, this would lead to a panic elsewhere in the kernel, such as
"vm_radix_insert: key <vm_pindex_t> is already present".
Reported by: Ilias Marinos
MFC after: 1 week
logic is now placed in the mmap hook implementation rather than requiring
it to be placed in sys/vm/vm_mmap.c. This hook allows new file types to
support mmap() as well as potentially allowing mmap() for existing file
types that do not currently support any mapping.
The vm_mmap() function is now split up into two functions. A new
vm_mmap_object() function handles the "back half" of vm_mmap() and accepts
a referenced VM object to map rather than a (handle, handle_type) tuple.
vm_mmap() is now reduced to converting a (handle, handle_type) tuple to a
a VM object and then calling vm_mmap_object() to handle the actual mapping.
The vm_mmap() function remains for use by other parts of the kernel
(e.g. device drivers and exec) but now only supports mapping vnodes,
character devices, and anonymous memory.
The mmap() system call invokes vm_mmap_object() directly with a NULL object
for anonymous mappings. For mappings using a file descriptor, the
descriptors fo_mmap() hook is invoked instead. The fo_mmap() hook is
responsible for performing type-specific checks and adjustments to
arguments as well as possibly modifying mapping parameters such as flags
or the object offset. The fo_mmap() hook routines then call
vm_mmap_object() to handle the actual mapping.
The fo_mmap() hook is optional. If it is not set, then fo_mmap() will
fail with ENODEV. A fo_mmap() hook is implemented for regular files,
character devices, and shared memory objects (created via shm_open()).
While here, consistently use the VM_PROT_* constants for the vm_prot_t
type for the 'prot' variable passed to vm_mmap() and vm_mmap_object()
as well as the vm_mmap_vnode() and vm_mmap_cdev() helper routines.
Previously some places were using the mmap()-specific PROT_* constants
instead. While this happens to work because PROT_xx == VM_PROT_xx,
using VM_PROT_* is more correct.
Differential Revision: https://reviews.freebsd.org/D2658
Reviewed by: alc (glanced over), kib
MFC after: 1 month
Sponsored by: Chelsio
When providing memory map information to userland, populate the vnode pointer
for tmpfs files. Set the memory mapping to appear as a vnode type, to match
FreeBSD 9 behavior.
This fixes the use of tmpfs files with the dtrace pid provider,
procstat -v, procfs, linprocfs, pmc (pmcstat), and ptrace (PT_VM_ENTRY).
Submitted by: Eric Badger <eric@badgerio.us> (initial revision)
Obtained from: Dell Inc.
PR: 198431
MFC after: 2 weeks
Reviewed by: jhb
Approved by: kib (mentor)
examined via 'vmstat -o'. It can be used to determine which files are
using physical pages of memory and how much each is using.
Differential Revision: https://reviews.freebsd.org/D2277
Reviewed by: alc, kib
MFC after: 2 weeks
Sponsored by: Norse Corp, Inc. (forward porting to HEAD/10)
years for head. However, it is continuously misused as the mpsafe argument
for callout_init(9). Deprecate the flag and clean up callout_init() calls
to make them more consistent.
Differential Revision: https://reviews.freebsd.org/D2613
Reviewed by: jhb
MFC after: 2 weeks
r283162.
Fix a cosmetic issue with vm_page_alloc() calling vm_page_free_toq()
with the page not completely satisfying vm_page_free() assertions.
The page is not owned by the object, since insertion failed. But
besides m->object reset to NULL, we should also set VPO_UNMANAGED flag
for consistency.
Reported by: pho
Reviewed by: alc
Sponsored by: The FreeBSD Foundation
MFC after: 1 week
fragmented conditions currently just wakes up the pagedaemon. The
kmem arena is significantly smaller then the total available physical
memory, which means that there are loads where kmem arena space could
be exhausted, while there is a lot of pages available still. The
woken up pagedaemon sees vm_pages_needed != 0, verifies the condition
vm_paging_needed() which is false, clears the pass and returns back to
sleep, not calling neither uma_reclaim() nor lowmem handler.
To handle low kmem arena conditions, create additional pagedaemon
thread which calls uma_reclaim() directly. The thread sleeps on the
dedicated channel and kmem_reclaim() wakes the thread in addition to
the pagedaemon.
Reported and tested by: pho
Sponsored by: The FreeBSD Foundation
MFC after: 2 weeks
This is ok since objects come from a NOFREE zone and allows objects to
be locked while traversing the object list without triggering a LOR.
Ensure that objects on the list are marked DEAD while free or stillborn,
and that they have a refcount of zero. This required updating most of
the pagers to explicitly mark an object as dead when deallocating it.
(Only the vnode pager did this previously.)
Differential Revision: https://reviews.freebsd.org/D2423
Reviewed by: alc, kib (earlier version)
MFC after: 2 weeks
Sponsored by: Norse Corp, Inc.
a basic ACPI SLIT table parser.
For now this just exports the map via sysctl; it'll eventually be useful
to userland when there's more useful NUMA support in -HEAD.
* Add an optional mem_locality map;
* add a mapping function taking from/to domain and returning the
relative cost, or -1 if it's not available;
* Add a very basic SLIT parser to x86 ACPI.
Differential Revision: https://reviews.freebsd.org/D2460
Reviewed by: rpaulo, stas, jhb
Sponsored by: Norse Corp, Inc (hardware, coding); Dell (hardware)
should be done not with the number of pages in the first block, but with the
overall number of pages. While here, add KASSERT that makes sure that BMAP
doesn't return completely irrelevant blocks.
Reviewed by: kib
Tested by: pho
Sponsored by: Netflix
Sponsored by: Nginx, Inc.
in the main swapper work cycle, do it in the sysctl handler. This removes
extra mutex acquisition from the main cycle and makes the sysctl knob return
error on an invalid value, instead of accepting and fixing it.
Reviewed by: kib
Sponsored by: Netflix
Sponsored by: Nginx, Inc.
remains. Xen is planning to phase out support for PV upstream since it
is harder to maintain and has more overhead. Modern x86 CPUs include
virtualization extensions that support HVM guests instead of PV guests.
In addition, the PV code was i386 only and not as well maintained recently
as the HVM code.
- Remove the i386-only NATIVE option that was used to disable certain
components for PV kernels. These components are now standard as they
are on amd64.
- Remove !XENHVM bits from PV drivers.
- Remove various shims required for XEN (e.g. PT_UPDATES_FLUSH, LOAD_CR3,
etc.)
- Remove duplicate copy of <xen/features.h>.
- Remove unused, i386-only xenstored.h.
Differential Revision: https://reviews.freebsd.org/D2362
Reviewed by: royger
Tested by: royger (i386/amd64 HVM domU and amd64 PVH dom0)
Relnotes: yes
a text file with a list of physical memory addresses to exclude, and have it
loaded at boot time via the provided example in loader.conf. The tunable
'vm.blacklist' remains, but using an external file means that there's no
practical limit to the size of the list. This change also improves the
scanning algorithm for processing the list, scanning the list only once
instead of scanning it for every page in the system. Both the sysctl and
the file can be unsorted and contain duplicates so long as each entry is
numeric (decimal or hex) and is separated by a space, comma, or newline
character. The sysctl 'vm.page_blacklist' is now provided to report what
memory locations were successfully excluded.
Reviewed by: imp, emax
Obtained from: Netflix, Inc.
MFC after: 3 days
The point of this is to be able to add RACCT (with RACCT_DISABLED)
to GENERIC, to avoid having to rebuild the kernel to use rctl(8).
Differential Revision: https://reviews.freebsd.org/D2369
Reviewed by: kib@
MFC after: 1 month
Relnotes: yes
Sponsored by: The FreeBSD Foundation
users, myself included. The original code is likely papering over a
larger bug that needs to be explored, but for now get things back to
a working state.
Obtained from: Netflix, Inc.
MFC after: immediately
pwrite(2) syscalls are wrapped to provide compatibility with pre-7.x
kernels which required padding before the off_t parameter. The
fcntl(2) contains compatibility code to handle kernels before the
struct flock was changed during the 8.x CURRENT development. The
shims were reasonable to allow easier revert to the older kernel at
that time.
Now, two or three major releases later, shims do not serve any
purpose. Such old kernels cannot handle current libc, so revert the
compatibility code.
Make padded syscalls support conditional under the COMPAT6 config
option. For COMPAT32, the syscalls were under COMPAT6 already.
Remove WITHOUT_SYSCALL_COMPAT build option, which only purpose was to
(partially) disable the removed shims.
Reviewed by: jhb, imp (previous versions)
Discussed with: peter
Sponsored by: The FreeBSD Foundation
MFC after: 1 week
function that does the locking and validation associated with cleaning
a page. This moves 150 lines of code into its own function.
- Rename vm_pageout_clean() to vm_pageout_cluster() to define what it
really does; clustering nearby pages for pageout optimization.
Reviewd by: alc, kib, kmacy
Tested by: pho (earlier version)
Sponsored by: EMC / Isilon
that performs the equivalent of an automatic madvise(..., MADV_DONTNEED).
The current heuristic, even with the improvements that I made a few years
ago, is a good example of making the wrong trade-off, or optimizing for
the infrequent case. The infrequent case being reading a single file that
is much larger than memory using mmap(2). And, in this case, the page
daemon isn't the bottleneck; it's the I/O.
In all other cases, the current heuristic has too many false positives,
i.e., it caches too many pages that are later reused. To give one
example, thousands of pages are cached by the current heuristic during a
buildworld and all of them are reactivated before the buildworld
completes. In particular, clang reads source files using mmap(2) and
there are some relatively large source files in our source tree, e.g.,
sqlite, that are read multiple times. With the new heuristic, I see fewer
false positives and they have a much lower cost.
I actually tried something like this more than two years ago and it
didn't perform as well as the cache behind heuristic. However, that was
before the changes to the page daemon in late summer of 2013 and the
existence of pmap_advise(). In particular, with the page daemon doing
its work more frequently and in smaller batches, it now completes its
work while the application accessing the file is blocked on I/O.
Whereas previously, the page daemon appeared to hog the CPU for so long
that it caused "hiccups" in the application's execution.
Finally, I'll add that the elimination of cache pages is a prerequisite
for NUMA support.
Reviewed by: jeff, kib
Sponsored by: EMC / Isilon Storage Division
A couple of internal functions used by malloc(9) and uma truncated
a size_t down to an int. This could cause any number of issues
(e.g. indefinite sleeps, memory corruption) if any kernel
subsystem tried to allocate 2GB or more through malloc. zfs would
attempt such an allocation when run on a system with 2TB or more
of RAM.
Note to self: When this is MFCed, sparc64 needs the same fix.
Differential revision: https://reviews.freebsd.org/D2106
Reviewed by: kib
Reported by: Michael Fuckner <michael@fuckner.net>
Tested by: Michael Fuckner <michael@fuckner.net>
MFC after: 2 weeks
Swap device is still reported as enabled, and system still may crash later
if some swapped-out kernel pages were lost with the device, but at least
GEOM and CAM can now release the lost disk, allowing it to be reconnected.
MFC after: 2 weeks
Sponsored by: iXsystems, Inc.
vm.boot_pages is marked as a CTLFLAG_RDTUN, but it's used by the VM
before the sysctl subsystem is initialsed. We manually fetch the
variable from the environment to work around this problem.
Tested by: Keith White kwhite at uottawa.ca
MFC after: 1 week
named objects to zero before the virtual address is selected. Previously,
the color setting was delayed until after the virtual address was
selected. In rtld, this delay effectively prevented the mapping of a
shared library's code section using superpages. Now, for example, we see
the first 1 MB of libc's code on armv6 mapped by a superpage after we've
gotten through the initial cold misses that bring the first 1 MB of code
into memory. (With the page clustering that we perform on read faults,
this happens quickly.)
Differential Revision: https://reviews.freebsd.org/D2013
Reviewed by: jhb, kib
Tested by: Svatopluk Kraus (armv6)
MFC after: 6 weeks
promoted" panics. The sequence of events that leads to a panic is rather
long and circuitous. First, suppose that process P has a promoted
superpage S within vm object O that it can write to. Then, suppose that P
forks, which leads to S being write protected. Now, before P's child
exits, suppose that P writes to another virtual page within O. Since the
pages within O are copy on write, a shadow object for O is created to
house the new physical copy of the faulted on virtual page. Then, before
P can fault on S, P's child exists. Now, when P faults on S, it will
follow the "optimized" path for copy-on-write faults in vm_fault(),
wherein the underlying physical page is moved from O to its shadow object
rather than allocating a new page and copying the new page's contents from
the old page. Moreover, suppose that every 4 KB physical page making up S
is moved to the shadow object in this way. However, the optimized path
does not move the underlying superpage reservation, which is the root
cause of the panics! Ultimately, P performs vm_object_collapse() on O's
shadow object, which destroys O and in doing so breaks any reservations
still belonging to O. This leaves the reservation underlying S in an
inconsistent state: It's simultaneously not in use and promoted. Breaking
a reservation does not demote it because I never intended for a promoted
reservation to be broken. It makes little sense. Finally, this
inconsistency leads to an assertion failure the next time that the
reservation is used.
The failing assertion does not (currently) exist in FreeBSD 10.x or
earlier. There, we will quietly break the promoted reservation. While
illogical and unintended, breaking the reservation is essentially
harmless.
PR: 198163
Reviewed by: kib
Tested by: pho
X-MFC after: r267213
Sponsored by: EMC / Isilon Storage Division
- Allow to call the function with vm object lock held.
- Allow to specify reqpage that doesn't match any page in the region,
meaning freeing all pages.
o Utilize the new function in couple more places in vnode pager.
Reviewed by: alc, kib
Sponsored by: Netflix
Sponsored by: Nginx, Inc.
strings returned to userland include the nulterm byte.
Some uses of sbuf_new_for_sysctl() write binary data rather than strings;
clear the SBUF_INCLUDENUL flag after calling sbuf_new_for_sysctl() in
those cases. (Note that the sbuf code still automatically adds a nulterm
byte in sbuf_finish(), but since it's not included in the length it won't
get copied to userland along with the binary data.)
Remove explicit adding of a nulterm byte in a couple places now that it
gets done automatically by the sbuf drain code.
PR: 195668
synchronous and asynchronous requests. The latter can saturate the
I/O and we do not want them to affect regular paging.
- Allocate the pbuf at the very beginning of the function, so that
if we are low on certain kind of pbufs don't even proceed to BMAP,
but sleep.
Reviewed by: kib
Sponsored by: Nginx, Inc.
Sponsored by: Netflix
to UFS, perform updates during syncer scans, which in particular means
that tmpfs now performs scan on sync. Also, this means that a mtime
update may be delayed up to 30 seconds after the write.
The vm_object' OBJ_TMPFS_DIRTY flag for tmpfs swap object is similar
to the OBJ_MIGHTBEDIRTY flag for the vnode object, it indicates that
object could have been dirtied. Adapt fast page fault handler and
vm_object_set_writeable_dirty() to handle OBJ_TMPFS_NODE same as
OBJT_VNODE.
Reported by: Ronald Klop <ronald-lists@klop.ws>
Tested by: pho
Sponsored by: The FreeBSD Foundation
MFC after: 2 weeks
kill a process, when the system runs out of memory. Defaults to off.
Usually, this is most useful when the OOM condition is due to mismanagement
of memory, on a system where the applications in question don't respond well
to being killed.
In theory, if the system is properly managed, it shouldn't be possible to
hit this condition. If it does, the panic can be more desirable for some
users (since it can be a good means of finding the root cause) rather than
killing the largest process and continuing on its merry way.
As kib@ mentions in the differential, there is also protect(1), which uses
procctl(PROC_SPROTECT) to ensure that some processes are immune. However,
a panic approach is still useful in some environments. This is primarily
intended as a development/debugging tool.
Differential Revision: D1627
Reviewed by: kib
MFC after: 1 week