fragmented conditions currently just wakes up the pagedaemon. The
kmem arena is significantly smaller then the total available physical
memory, which means that there are loads where kmem arena space could
be exhausted, while there is a lot of pages available still. The
woken up pagedaemon sees vm_pages_needed != 0, verifies the condition
vm_paging_needed() which is false, clears the pass and returns back to
sleep, not calling neither uma_reclaim() nor lowmem handler.
To handle low kmem arena conditions, create additional pagedaemon
thread which calls uma_reclaim() directly. The thread sleeps on the
dedicated channel and kmem_reclaim() wakes the thread in addition to
the pagedaemon.
Reported and tested by: pho
Sponsored by: The FreeBSD Foundation
MFC after: 2 weeks
This is ok since objects come from a NOFREE zone and allows objects to
be locked while traversing the object list without triggering a LOR.
Ensure that objects on the list are marked DEAD while free or stillborn,
and that they have a refcount of zero. This required updating most of
the pagers to explicitly mark an object as dead when deallocating it.
(Only the vnode pager did this previously.)
Differential Revision: https://reviews.freebsd.org/D2423
Reviewed by: alc, kib (earlier version)
MFC after: 2 weeks
Sponsored by: Norse Corp, Inc.
a basic ACPI SLIT table parser.
For now this just exports the map via sysctl; it'll eventually be useful
to userland when there's more useful NUMA support in -HEAD.
* Add an optional mem_locality map;
* add a mapping function taking from/to domain and returning the
relative cost, or -1 if it's not available;
* Add a very basic SLIT parser to x86 ACPI.
Differential Revision: https://reviews.freebsd.org/D2460
Reviewed by: rpaulo, stas, jhb
Sponsored by: Norse Corp, Inc (hardware, coding); Dell (hardware)
should be done not with the number of pages in the first block, but with the
overall number of pages. While here, add KASSERT that makes sure that BMAP
doesn't return completely irrelevant blocks.
Reviewed by: kib
Tested by: pho
Sponsored by: Netflix
Sponsored by: Nginx, Inc.
in the main swapper work cycle, do it in the sysctl handler. This removes
extra mutex acquisition from the main cycle and makes the sysctl knob return
error on an invalid value, instead of accepting and fixing it.
Reviewed by: kib
Sponsored by: Netflix
Sponsored by: Nginx, Inc.
remains. Xen is planning to phase out support for PV upstream since it
is harder to maintain and has more overhead. Modern x86 CPUs include
virtualization extensions that support HVM guests instead of PV guests.
In addition, the PV code was i386 only and not as well maintained recently
as the HVM code.
- Remove the i386-only NATIVE option that was used to disable certain
components for PV kernels. These components are now standard as they
are on amd64.
- Remove !XENHVM bits from PV drivers.
- Remove various shims required for XEN (e.g. PT_UPDATES_FLUSH, LOAD_CR3,
etc.)
- Remove duplicate copy of <xen/features.h>.
- Remove unused, i386-only xenstored.h.
Differential Revision: https://reviews.freebsd.org/D2362
Reviewed by: royger
Tested by: royger (i386/amd64 HVM domU and amd64 PVH dom0)
Relnotes: yes
a text file with a list of physical memory addresses to exclude, and have it
loaded at boot time via the provided example in loader.conf. The tunable
'vm.blacklist' remains, but using an external file means that there's no
practical limit to the size of the list. This change also improves the
scanning algorithm for processing the list, scanning the list only once
instead of scanning it for every page in the system. Both the sysctl and
the file can be unsorted and contain duplicates so long as each entry is
numeric (decimal or hex) and is separated by a space, comma, or newline
character. The sysctl 'vm.page_blacklist' is now provided to report what
memory locations were successfully excluded.
Reviewed by: imp, emax
Obtained from: Netflix, Inc.
MFC after: 3 days
The point of this is to be able to add RACCT (with RACCT_DISABLED)
to GENERIC, to avoid having to rebuild the kernel to use rctl(8).
Differential Revision: https://reviews.freebsd.org/D2369
Reviewed by: kib@
MFC after: 1 month
Relnotes: yes
Sponsored by: The FreeBSD Foundation
users, myself included. The original code is likely papering over a
larger bug that needs to be explored, but for now get things back to
a working state.
Obtained from: Netflix, Inc.
MFC after: immediately
pwrite(2) syscalls are wrapped to provide compatibility with pre-7.x
kernels which required padding before the off_t parameter. The
fcntl(2) contains compatibility code to handle kernels before the
struct flock was changed during the 8.x CURRENT development. The
shims were reasonable to allow easier revert to the older kernel at
that time.
Now, two or three major releases later, shims do not serve any
purpose. Such old kernels cannot handle current libc, so revert the
compatibility code.
Make padded syscalls support conditional under the COMPAT6 config
option. For COMPAT32, the syscalls were under COMPAT6 already.
Remove WITHOUT_SYSCALL_COMPAT build option, which only purpose was to
(partially) disable the removed shims.
Reviewed by: jhb, imp (previous versions)
Discussed with: peter
Sponsored by: The FreeBSD Foundation
MFC after: 1 week
function that does the locking and validation associated with cleaning
a page. This moves 150 lines of code into its own function.
- Rename vm_pageout_clean() to vm_pageout_cluster() to define what it
really does; clustering nearby pages for pageout optimization.
Reviewd by: alc, kib, kmacy
Tested by: pho (earlier version)
Sponsored by: EMC / Isilon
that performs the equivalent of an automatic madvise(..., MADV_DONTNEED).
The current heuristic, even with the improvements that I made a few years
ago, is a good example of making the wrong trade-off, or optimizing for
the infrequent case. The infrequent case being reading a single file that
is much larger than memory using mmap(2). And, in this case, the page
daemon isn't the bottleneck; it's the I/O.
In all other cases, the current heuristic has too many false positives,
i.e., it caches too many pages that are later reused. To give one
example, thousands of pages are cached by the current heuristic during a
buildworld and all of them are reactivated before the buildworld
completes. In particular, clang reads source files using mmap(2) and
there are some relatively large source files in our source tree, e.g.,
sqlite, that are read multiple times. With the new heuristic, I see fewer
false positives and they have a much lower cost.
I actually tried something like this more than two years ago and it
didn't perform as well as the cache behind heuristic. However, that was
before the changes to the page daemon in late summer of 2013 and the
existence of pmap_advise(). In particular, with the page daemon doing
its work more frequently and in smaller batches, it now completes its
work while the application accessing the file is blocked on I/O.
Whereas previously, the page daemon appeared to hog the CPU for so long
that it caused "hiccups" in the application's execution.
Finally, I'll add that the elimination of cache pages is a prerequisite
for NUMA support.
Reviewed by: jeff, kib
Sponsored by: EMC / Isilon Storage Division
A couple of internal functions used by malloc(9) and uma truncated
a size_t down to an int. This could cause any number of issues
(e.g. indefinite sleeps, memory corruption) if any kernel
subsystem tried to allocate 2GB or more through malloc. zfs would
attempt such an allocation when run on a system with 2TB or more
of RAM.
Note to self: When this is MFCed, sparc64 needs the same fix.
Differential revision: https://reviews.freebsd.org/D2106
Reviewed by: kib
Reported by: Michael Fuckner <michael@fuckner.net>
Tested by: Michael Fuckner <michael@fuckner.net>
MFC after: 2 weeks
Swap device is still reported as enabled, and system still may crash later
if some swapped-out kernel pages were lost with the device, but at least
GEOM and CAM can now release the lost disk, allowing it to be reconnected.
MFC after: 2 weeks
Sponsored by: iXsystems, Inc.
vm.boot_pages is marked as a CTLFLAG_RDTUN, but it's used by the VM
before the sysctl subsystem is initialsed. We manually fetch the
variable from the environment to work around this problem.
Tested by: Keith White kwhite at uottawa.ca
MFC after: 1 week
named objects to zero before the virtual address is selected. Previously,
the color setting was delayed until after the virtual address was
selected. In rtld, this delay effectively prevented the mapping of a
shared library's code section using superpages. Now, for example, we see
the first 1 MB of libc's code on armv6 mapped by a superpage after we've
gotten through the initial cold misses that bring the first 1 MB of code
into memory. (With the page clustering that we perform on read faults,
this happens quickly.)
Differential Revision: https://reviews.freebsd.org/D2013
Reviewed by: jhb, kib
Tested by: Svatopluk Kraus (armv6)
MFC after: 6 weeks
promoted" panics. The sequence of events that leads to a panic is rather
long and circuitous. First, suppose that process P has a promoted
superpage S within vm object O that it can write to. Then, suppose that P
forks, which leads to S being write protected. Now, before P's child
exits, suppose that P writes to another virtual page within O. Since the
pages within O are copy on write, a shadow object for O is created to
house the new physical copy of the faulted on virtual page. Then, before
P can fault on S, P's child exists. Now, when P faults on S, it will
follow the "optimized" path for copy-on-write faults in vm_fault(),
wherein the underlying physical page is moved from O to its shadow object
rather than allocating a new page and copying the new page's contents from
the old page. Moreover, suppose that every 4 KB physical page making up S
is moved to the shadow object in this way. However, the optimized path
does not move the underlying superpage reservation, which is the root
cause of the panics! Ultimately, P performs vm_object_collapse() on O's
shadow object, which destroys O and in doing so breaks any reservations
still belonging to O. This leaves the reservation underlying S in an
inconsistent state: It's simultaneously not in use and promoted. Breaking
a reservation does not demote it because I never intended for a promoted
reservation to be broken. It makes little sense. Finally, this
inconsistency leads to an assertion failure the next time that the
reservation is used.
The failing assertion does not (currently) exist in FreeBSD 10.x or
earlier. There, we will quietly break the promoted reservation. While
illogical and unintended, breaking the reservation is essentially
harmless.
PR: 198163
Reviewed by: kib
Tested by: pho
X-MFC after: r267213
Sponsored by: EMC / Isilon Storage Division
- Allow to call the function with vm object lock held.
- Allow to specify reqpage that doesn't match any page in the region,
meaning freeing all pages.
o Utilize the new function in couple more places in vnode pager.
Reviewed by: alc, kib
Sponsored by: Netflix
Sponsored by: Nginx, Inc.
strings returned to userland include the nulterm byte.
Some uses of sbuf_new_for_sysctl() write binary data rather than strings;
clear the SBUF_INCLUDENUL flag after calling sbuf_new_for_sysctl() in
those cases. (Note that the sbuf code still automatically adds a nulterm
byte in sbuf_finish(), but since it's not included in the length it won't
get copied to userland along with the binary data.)
Remove explicit adding of a nulterm byte in a couple places now that it
gets done automatically by the sbuf drain code.
PR: 195668
synchronous and asynchronous requests. The latter can saturate the
I/O and we do not want them to affect regular paging.
- Allocate the pbuf at the very beginning of the function, so that
if we are low on certain kind of pbufs don't even proceed to BMAP,
but sleep.
Reviewed by: kib
Sponsored by: Nginx, Inc.
Sponsored by: Netflix
to UFS, perform updates during syncer scans, which in particular means
that tmpfs now performs scan on sync. Also, this means that a mtime
update may be delayed up to 30 seconds after the write.
The vm_object' OBJ_TMPFS_DIRTY flag for tmpfs swap object is similar
to the OBJ_MIGHTBEDIRTY flag for the vnode object, it indicates that
object could have been dirtied. Adapt fast page fault handler and
vm_object_set_writeable_dirty() to handle OBJ_TMPFS_NODE same as
OBJT_VNODE.
Reported by: Ronald Klop <ronald-lists@klop.ws>
Tested by: pho
Sponsored by: The FreeBSD Foundation
MFC after: 2 weeks
kill a process, when the system runs out of memory. Defaults to off.
Usually, this is most useful when the OOM condition is due to mismanagement
of memory, on a system where the applications in question don't respond well
to being killed.
In theory, if the system is properly managed, it shouldn't be possible to
hit this condition. If it does, the panic can be more desirable for some
users (since it can be a good means of finding the root cause) rather than
killing the largest process and continuing on its merry way.
As kib@ mentions in the differential, there is also protect(1), which uses
procctl(PROC_SPROTECT) to ensure that some processes are immune. However,
a panic approach is still useful in some environments. This is primarily
intended as a development/debugging tool.
Differential Revision: D1627
Reviewed by: kib
MFC after: 1 week