synchronous and asynchronous requests. The latter can saturate the
I/O and we do not want them to affect regular paging.
- Allocate the pbuf at the very beginning of the function, so that
if we are low on certain kind of pbufs don't even proceed to BMAP,
but sleep.
Reviewed by: kib
Sponsored by: Nginx, Inc.
Sponsored by: Netflix
to UFS, perform updates during syncer scans, which in particular means
that tmpfs now performs scan on sync. Also, this means that a mtime
update may be delayed up to 30 seconds after the write.
The vm_object' OBJ_TMPFS_DIRTY flag for tmpfs swap object is similar
to the OBJ_MIGHTBEDIRTY flag for the vnode object, it indicates that
object could have been dirtied. Adapt fast page fault handler and
vm_object_set_writeable_dirty() to handle OBJ_TMPFS_NODE same as
OBJT_VNODE.
Reported by: Ronald Klop <ronald-lists@klop.ws>
Tested by: pho
Sponsored by: The FreeBSD Foundation
MFC after: 2 weeks
kill a process, when the system runs out of memory. Defaults to off.
Usually, this is most useful when the OOM condition is due to mismanagement
of memory, on a system where the applications in question don't respond well
to being killed.
In theory, if the system is properly managed, it shouldn't be possible to
hit this condition. If it does, the panic can be more desirable for some
users (since it can be a good means of finding the root cause) rather than
killing the largest process and continuing on its merry way.
As kib@ mentions in the differential, there is also protect(1), which uses
procctl(PROC_SPROTECT) to ensure that some processes are immune. However,
a panic approach is still useful in some environments. This is primarily
intended as a development/debugging tool.
Differential Revision: D1627
Reviewed by: kib
MFC after: 1 week
of an vm space may require obtaining sleepable locks. Hold the
process to keep the pointer valid, and change trylock to lock, since
there is no longer two process locks owned simultaneously in
vm_pageout_oom().
Note that after the process lock is dropped, process might exec, and
no longer qualify as the owner of biggest vm space.
In collaboration with: rstone
Sponsored by: The FreeBSD Foundation
MFC after: 1 week
handler. For roughly twenty years, the page fault handler has used the
same basic strategy: Fetch a fixed number of non-resident pages both ahead
and behind the virtual page that was faulted on. Over the years,
alternative strategies have been implemented for optimizing the handling
of random and sequential access patterns, but the only change to the
default strategy has been to increase the number of pages read ahead to 7
and behind to 8.
The problem with the default page clustering strategy becomes apparent
when you look at how it behaves on the code section of an executable or
shared library. (To simplify the following explanation, I'm going to
ignore the read that is performed to obtain the header and assume that no
pages are resident at the start of execution.) Suppose that we have a
code section consisting of 32 pages. Further, suppose that we access
pages 4, 28, and 16 in that order. Under the default page clustering
strategy, we page fault three times and perform three I/O operations,
because the first and second page faults only read a truncated cluster of
12 pages. In contrast, if we access pages 8, 24, and 16 in that order, we
only fault twice and perform two I/O operations, because the first and
second page faults read a full cluster of 16 pages. In general, truncated
clusters are more common than full clusters.
To address this problem, this revision changes the default page clustering
strategy to align the start of the cluster to a page offset within the vm
object that is a multiple of the cluster size. This results in many fewer
truncated clusters. Returning to our example, if we now access pages 4,
28, and 16 in that order, the cluster that is read to satisfy the page
fault on page 28 will now include page 16. So, the access to page 16 will
no longer page fault and perform an I/O operation.
Since the revised default page clustering strategy is typically reading
more pages at a time, we are likely to read a few more pages that are
never accessed. However, for the various programs that we looked at,
including clang, emacs, firefox, and openjdk, the reduction in the number
of page faults and I/O operations far outweighed the increase in the
number of pages that are never accessed. Moreover, the extra resident
pages allowed for many more superpage mappings. For example, if we look
at the execution of clang during a buildworld, the number of (hard) page
faults on the code section drops by 26%, the number of superpage mappings
increases by about 29,000, but the number of never accessed pages only
increases from 30.38% to 33.66%. Finally, this leads to a small but
measureable reduction in execution time.
In collaboration with: Emily Pettigrew <ejp1@rice.edu>
Differential Revision: https://reviews.freebsd.org/D1500
Reviewed by: jhb, kib
MFC after: 6 weeks
managing pages from different address ranges. Generally speaking, this
feature is used to increase the likelihood that physical pages are
available that can meet special DMA requirements or can be accessed through
a limited-coverage direct mapping (e.g., MIPS). However, prior to this
change, the configuration of the free lists was static, i.e., it was
determined at compile time. Consequentally, free lists could be created
for address ranges that held no actual pages, for example, on 32-bit MIPS-
based systems with 512 MB or less of physical memory. This change makes
the creation of the free lists dynamic, i.e., it is based on the available
physical memory at boot time.
On 64-bit x86-based systems with 64 GB or more of physical memory, create
free lists for managing pages with physical addresses below 4 GB. This
change is to address reported problems with initializing devices that
require the allocation of physical pages below 4 GB on some systems with
128 GB or more of physical memory.
PR: 185727
Differential Revision: https://reviews.freebsd.org/D1274
Reviewed by: jhb, kib
MFC after: 3 weeks
Sponsored by: EMC / Isilon Storage Division
Some old libraries may be used even with newer binaries (specifically the
Nvidia driver libraries).
Differential Revision: https://reviews.freebsd.org/D1262
Reviewed by: kib
vp->v_vflag without taking vnode lock and without bypass. We do know
that vp is the lowest level in the stack, since the pointer is
obtained from the object' handle. Stale VV_TEXT flag read can only
happen if parallel execve() is performed and not yet activated the
image, since process takes reference for text mapping. In this case,
the execve() code manages the VV_TEXT flag on its own already.
It was observed that otherwise read-only sendfile(2) requires
exclusive vnode lock and contending on it on some loads for VV_TEXT
handling.
Reported by: glebius, scottl
Tested by: glebius, pho
Sponsored by: The FreeBSD Foundation
MFC after: 1 week
uma_reclaim(). Reclamation code must not see half-constructed or
destructed zones. Do this by bracing uma_zcreate() and uma_zdestroy()
into a shared-locked sx, and take the sx exclusively in uma_reclaim().
Usually zones are not created/destroyed during the system operation,
but tmpfs mounts do cause zone operations and exposed the bug.
Another solution could be to only expose a new keg on uma_kegs list
after the corresponding zone is fully constructed, and similar
treatment for the destruction. But it probably requires more risky
code rearrangement as well.
Reported and tested by: pho
Discussed with: avg
Sponsored by: The FreeBSD Foundation
MFC after: 2 weeks
o Provide a new VOP_GETPAGES_ASYNC(), which works like VOP_GETPAGES(), but
doesn't sleep. It returns immediately, and will execute the I/O done handler
function that must be supplied as argument.
o Provide VOP_GETPAGES_ASYNC() for the FFS, which uses vnode_pager.
o Extend pagertab to support pgo_getpages_async method, and implement this
method for vnode_pager.
Reviewed by: kib
Tested by: pho
Sponsored by: Netflix
Sponsored by: Nginx, Inc.
on i386 PAE. Previously, VM_PHYSSEG_SPARSE could not be used on amd64 and
i386 because vm_page_startup() would not create vm_page structures for the
kernel page table pages allocated during pmap_bootstrap() but those vm_page
structures are needed when the kernel attempts to promote the corresponding
kernel virtual addresses to superpage mappings. To address this problem, a
new public function, vm_phys_add_seg(), is introduced and vm_phys_init() is
updated to reflect the creation of vm_phys_seg structures by calls to
vm_phys_add_seg().
Discussed with: Svatopluk Kraus
MFC after: 3 weeks
Sponsored by: EMC / Isilon Storage Division
default_pager_putpages() and swap_pager_putpages().
It is the same fix as was done for vnode_pager_putpages()
in r271586.
Reviewed by: alc
Sponsored by: The FreeBSD Foundation
MFC after: 1 week
This code has had an extensive rewrite and a good series of reviews, both by the author and other parties. This means a lot of code has been simplified. Pluggable structures for high-rate entropy generators are available, and it is most definitely not the case that /dev/random can be driven by only a hardware souce any more. This has been designed out of the device. Hardware sources are stirred into the CSPRNG (Yarrow, Fortuna) like any other entropy source. Pluggable modules may be written by third parties for additional sources.
The harvesting structures and consequently the locking have been simplified. Entropy harvesting is done in a more general way (the documentation for this will follow). There is some GREAT entropy to be had in the UMA allocator, but it is disabled for now as messing with that is likely to annoy many people.
The venerable (but effective) Yarrow algorithm, which is no longer supported by its authors now has an alternative, Fortuna. For now, Yarrow is retained as the default algorithm, but this may be changed using a kernel option. It is intended to make Fortuna the default algorithm for 11.0. Interested parties are encouraged to read ISBN 978-0-470-47424-2 "Cryptography Engineering" By Ferguson, Schneier and Kohno for Fortuna's gory details. Heck, read it anyway.
Many thanks to Arthur Mesh who did early grunt work, and who got caught in the crossfire rather more than he deserved to.
My thanks also to folks who helped me thresh this out on whiteboards and in the odd "Hallway track", or otherwise.
My Nomex pants are on. Let the feedback commence!
Reviewed by: trasz,des(partial),imp(partial?),rwatson(partial?)
Approved by: so(des)
- Wrong integer type was specified.
- Wrong or missing "access" specifier. The "access" specifier
sometimes included the SYSCTL type, which it should not, except for
procedural SYSCTL nodes.
- Logical OR where binary OR was expected.
- Properly assert the "access" argument passed to all SYSCTL macros,
using the CTASSERT macro. This applies to both static- and dynamically
created SYSCTLs.
- Properly assert the the data type for both static and dynamic
SYSCTLs. In the case of static SYSCTLs we only assert that the data
pointed to by the SYSCTL data pointer has the correct size, hence
there is no easy way to assert types in the C language outside a
C-function.
- Rewrote some code which doesn't pass a constant "access" specifier
when creating dynamic SYSCTL nodes, which is now a requirement.
- Updated "EXAMPLES" section in SYSCTL manual page.
MFC after: 3 days
Sponsored by: Mellanox Technologies
Older binaries are still permitted to use these flags.
PR: 193961 (exp-run in ports)
Differential Revision: https://reviews.freebsd.org/D848
Reviewed by: kib
in userland rename in-kernel getenv()/setenv() to kern_setenv()/kern_getenv().
This fixes a namespace collision with libc symbols.
Submitted by: kmacy
Tested by: make universe
compatible with write-locked path. Test for MAP_ENTRY_NOSYNC and set
VPO_NOSYNC for pages with dirty mask zero (this does not exclude a
possibility that the page is dirty, e.g. due to read fault on
writeable mapping and consequent write; the same issue exists in the
slow path).
Use helper vm_fault_dirty() to unify fast and slow path handling of
VPO_NOSYNC and setting the dirty mask.
Reviewed by: alc
Sponsored by: The FreeBSD Foundation
Acquire the lock in read mode when just needed to ensure the stability
of the keg list. The UMA lock may be held for a long time (relatively
speaking) in uma_reclaim() on machines with lots of zones/kegs. If the
uma_timeout() would fire during that period, subsequent callouts on that
CPU may be significantly delayed.
Reviewed by: jhb
Callers of zone_drain_wait(M_WAITOK) do not need to hold (and were not)
the uma_mtx, but we would attempt to unlock and relock the mutex if we
had to sleep because the zone was already draining. The M_NOWAIT callers
may hold the uma_mtx, but we do not sleep in that case.
Reviewed by: jhb
MFC after: 3 days
interrupts and report the largest value seen as sysctl
debug.max_kstack_used. Useful to estimate how close the kernel stack
size is to overflow.
In collaboration with: Larry Baird <lab@gta.com>
Sponsored by: The FreeBSD Foundation (kib)
MFC after: 1 week
Remove previously added kmem methods in favour of defines which
allow diff minimisation between upstream code base.
Rebalance ARC free target to be vm_pageout_wakeup_thresh by default
which eliminates issue where ARC gets minimised instead of balancing
with VM pageout. The restores the target point prior to r270759.
Bring in missing upstream only changes which move unused code to
further eliminate code differences.
Add additional DTRACE probe to aid monitoring of ARC behaviour.
Enable upstream i386 code paths on platforms which don't define
UMA_MD_SMALL_ALLOC.
Fix mixture of byte an page values in arc_memory_throttle i386 code
path value assignment of available_memory.
PR: 187594
Review: D702
Reviewed by: avg
MFC after: 1 week
X-MFC-With: r270759 & r270861
Sponsored by: Multiplay
the wired attribute of the mapping. As result, some pmap
implementations clear the wired state of the page table entries, which
breaks invariants and allows the entries to be lost. Avoid calling
vm_map_pmap_enter() for the MADV_WILLNEED on the wired entry, the
pages must be already mapped.
Noted and reviewed by: alc
Sponsored by: The FreeBSD Foundation
MFC after: 3 days
MAP_PRIVATE flags to MAP_SHARED. Apparently, some code in tree, in
particular, libgeom, relied on this behaviour, see r271721. For
regular file types, the absence of the flags is interpreted as
MAP_PRIVATE, and libc nlist used this (fixed in r271723).
Allow the implicit flags for legacy binaries. Bump __FreeBSD_version
to get the ABI note on new binaries to check for in mmap code.
Remove the test for presence of one of the MAP_ANON, MAP_SHARED or
MAP_PRIVATE flags before fget_mmap(). For MAP_ANON, we already verify
that passed fd == -1. For fd != -1, test after fget_mmap() (for newer
binaries) covers the case.
Reported by: bdrewery, pho
Reviewed by: jhb
Sponsored by: The FreeBSD Foundation
- Fail with EINVAL if an invalid protection mask is passed to mmap().
- Fail with EINVAL if an unknown flag is passed to mmap().
- Fail with EINVAL if both MAP_PRIVATE and MAP_SHARED are passed to mmap().
- Require one of either MAP_PRIVATE or MAP_SHARED for non-anonymous
mappings.
Reviewed by: alc, kib
MFC after: 2 weeks
Differential Revision: https://reviews.freebsd.org/D698
Eliminate an exclusive object lock acquisition and release on the expected
execution path.
Do page zeroing before the object lock is acquired rather than during the
time that the object lock is held.
Use vm_pager_free_nonreq() to eliminate duplicated code.
Reviewed by: kib
MFC after: 6 weeks
Sponsored by: EMC / Isilon Storage Division