Swap device is still reported as enabled, and system still may crash later
if some swapped-out kernel pages were lost with the device, but at least
GEOM and CAM can now release the lost disk, allowing it to be reconnected.
MFC after: 2 weeks
Sponsored by: iXsystems, Inc.
vm.boot_pages is marked as a CTLFLAG_RDTUN, but it's used by the VM
before the sysctl subsystem is initialsed. We manually fetch the
variable from the environment to work around this problem.
Tested by: Keith White kwhite at uottawa.ca
MFC after: 1 week
named objects to zero before the virtual address is selected. Previously,
the color setting was delayed until after the virtual address was
selected. In rtld, this delay effectively prevented the mapping of a
shared library's code section using superpages. Now, for example, we see
the first 1 MB of libc's code on armv6 mapped by a superpage after we've
gotten through the initial cold misses that bring the first 1 MB of code
into memory. (With the page clustering that we perform on read faults,
this happens quickly.)
Differential Revision: https://reviews.freebsd.org/D2013
Reviewed by: jhb, kib
Tested by: Svatopluk Kraus (armv6)
MFC after: 6 weeks
promoted" panics. The sequence of events that leads to a panic is rather
long and circuitous. First, suppose that process P has a promoted
superpage S within vm object O that it can write to. Then, suppose that P
forks, which leads to S being write protected. Now, before P's child
exits, suppose that P writes to another virtual page within O. Since the
pages within O are copy on write, a shadow object for O is created to
house the new physical copy of the faulted on virtual page. Then, before
P can fault on S, P's child exists. Now, when P faults on S, it will
follow the "optimized" path for copy-on-write faults in vm_fault(),
wherein the underlying physical page is moved from O to its shadow object
rather than allocating a new page and copying the new page's contents from
the old page. Moreover, suppose that every 4 KB physical page making up S
is moved to the shadow object in this way. However, the optimized path
does not move the underlying superpage reservation, which is the root
cause of the panics! Ultimately, P performs vm_object_collapse() on O's
shadow object, which destroys O and in doing so breaks any reservations
still belonging to O. This leaves the reservation underlying S in an
inconsistent state: It's simultaneously not in use and promoted. Breaking
a reservation does not demote it because I never intended for a promoted
reservation to be broken. It makes little sense. Finally, this
inconsistency leads to an assertion failure the next time that the
reservation is used.
The failing assertion does not (currently) exist in FreeBSD 10.x or
earlier. There, we will quietly break the promoted reservation. While
illogical and unintended, breaking the reservation is essentially
harmless.
PR: 198163
Reviewed by: kib
Tested by: pho
X-MFC after: r267213
Sponsored by: EMC / Isilon Storage Division
- Allow to call the function with vm object lock held.
- Allow to specify reqpage that doesn't match any page in the region,
meaning freeing all pages.
o Utilize the new function in couple more places in vnode pager.
Reviewed by: alc, kib
Sponsored by: Netflix
Sponsored by: Nginx, Inc.
strings returned to userland include the nulterm byte.
Some uses of sbuf_new_for_sysctl() write binary data rather than strings;
clear the SBUF_INCLUDENUL flag after calling sbuf_new_for_sysctl() in
those cases. (Note that the sbuf code still automatically adds a nulterm
byte in sbuf_finish(), but since it's not included in the length it won't
get copied to userland along with the binary data.)
Remove explicit adding of a nulterm byte in a couple places now that it
gets done automatically by the sbuf drain code.
PR: 195668
synchronous and asynchronous requests. The latter can saturate the
I/O and we do not want them to affect regular paging.
- Allocate the pbuf at the very beginning of the function, so that
if we are low on certain kind of pbufs don't even proceed to BMAP,
but sleep.
Reviewed by: kib
Sponsored by: Nginx, Inc.
Sponsored by: Netflix
to UFS, perform updates during syncer scans, which in particular means
that tmpfs now performs scan on sync. Also, this means that a mtime
update may be delayed up to 30 seconds after the write.
The vm_object' OBJ_TMPFS_DIRTY flag for tmpfs swap object is similar
to the OBJ_MIGHTBEDIRTY flag for the vnode object, it indicates that
object could have been dirtied. Adapt fast page fault handler and
vm_object_set_writeable_dirty() to handle OBJ_TMPFS_NODE same as
OBJT_VNODE.
Reported by: Ronald Klop <ronald-lists@klop.ws>
Tested by: pho
Sponsored by: The FreeBSD Foundation
MFC after: 2 weeks
kill a process, when the system runs out of memory. Defaults to off.
Usually, this is most useful when the OOM condition is due to mismanagement
of memory, on a system where the applications in question don't respond well
to being killed.
In theory, if the system is properly managed, it shouldn't be possible to
hit this condition. If it does, the panic can be more desirable for some
users (since it can be a good means of finding the root cause) rather than
killing the largest process and continuing on its merry way.
As kib@ mentions in the differential, there is also protect(1), which uses
procctl(PROC_SPROTECT) to ensure that some processes are immune. However,
a panic approach is still useful in some environments. This is primarily
intended as a development/debugging tool.
Differential Revision: D1627
Reviewed by: kib
MFC after: 1 week
of an vm space may require obtaining sleepable locks. Hold the
process to keep the pointer valid, and change trylock to lock, since
there is no longer two process locks owned simultaneously in
vm_pageout_oom().
Note that after the process lock is dropped, process might exec, and
no longer qualify as the owner of biggest vm space.
In collaboration with: rstone
Sponsored by: The FreeBSD Foundation
MFC after: 1 week
handler. For roughly twenty years, the page fault handler has used the
same basic strategy: Fetch a fixed number of non-resident pages both ahead
and behind the virtual page that was faulted on. Over the years,
alternative strategies have been implemented for optimizing the handling
of random and sequential access patterns, but the only change to the
default strategy has been to increase the number of pages read ahead to 7
and behind to 8.
The problem with the default page clustering strategy becomes apparent
when you look at how it behaves on the code section of an executable or
shared library. (To simplify the following explanation, I'm going to
ignore the read that is performed to obtain the header and assume that no
pages are resident at the start of execution.) Suppose that we have a
code section consisting of 32 pages. Further, suppose that we access
pages 4, 28, and 16 in that order. Under the default page clustering
strategy, we page fault three times and perform three I/O operations,
because the first and second page faults only read a truncated cluster of
12 pages. In contrast, if we access pages 8, 24, and 16 in that order, we
only fault twice and perform two I/O operations, because the first and
second page faults read a full cluster of 16 pages. In general, truncated
clusters are more common than full clusters.
To address this problem, this revision changes the default page clustering
strategy to align the start of the cluster to a page offset within the vm
object that is a multiple of the cluster size. This results in many fewer
truncated clusters. Returning to our example, if we now access pages 4,
28, and 16 in that order, the cluster that is read to satisfy the page
fault on page 28 will now include page 16. So, the access to page 16 will
no longer page fault and perform an I/O operation.
Since the revised default page clustering strategy is typically reading
more pages at a time, we are likely to read a few more pages that are
never accessed. However, for the various programs that we looked at,
including clang, emacs, firefox, and openjdk, the reduction in the number
of page faults and I/O operations far outweighed the increase in the
number of pages that are never accessed. Moreover, the extra resident
pages allowed for many more superpage mappings. For example, if we look
at the execution of clang during a buildworld, the number of (hard) page
faults on the code section drops by 26%, the number of superpage mappings
increases by about 29,000, but the number of never accessed pages only
increases from 30.38% to 33.66%. Finally, this leads to a small but
measureable reduction in execution time.
In collaboration with: Emily Pettigrew <ejp1@rice.edu>
Differential Revision: https://reviews.freebsd.org/D1500
Reviewed by: jhb, kib
MFC after: 6 weeks
managing pages from different address ranges. Generally speaking, this
feature is used to increase the likelihood that physical pages are
available that can meet special DMA requirements or can be accessed through
a limited-coverage direct mapping (e.g., MIPS). However, prior to this
change, the configuration of the free lists was static, i.e., it was
determined at compile time. Consequentally, free lists could be created
for address ranges that held no actual pages, for example, on 32-bit MIPS-
based systems with 512 MB or less of physical memory. This change makes
the creation of the free lists dynamic, i.e., it is based on the available
physical memory at boot time.
On 64-bit x86-based systems with 64 GB or more of physical memory, create
free lists for managing pages with physical addresses below 4 GB. This
change is to address reported problems with initializing devices that
require the allocation of physical pages below 4 GB on some systems with
128 GB or more of physical memory.
PR: 185727
Differential Revision: https://reviews.freebsd.org/D1274
Reviewed by: jhb, kib
MFC after: 3 weeks
Sponsored by: EMC / Isilon Storage Division
Some old libraries may be used even with newer binaries (specifically the
Nvidia driver libraries).
Differential Revision: https://reviews.freebsd.org/D1262
Reviewed by: kib
vp->v_vflag without taking vnode lock and without bypass. We do know
that vp is the lowest level in the stack, since the pointer is
obtained from the object' handle. Stale VV_TEXT flag read can only
happen if parallel execve() is performed and not yet activated the
image, since process takes reference for text mapping. In this case,
the execve() code manages the VV_TEXT flag on its own already.
It was observed that otherwise read-only sendfile(2) requires
exclusive vnode lock and contending on it on some loads for VV_TEXT
handling.
Reported by: glebius, scottl
Tested by: glebius, pho
Sponsored by: The FreeBSD Foundation
MFC after: 1 week
uma_reclaim(). Reclamation code must not see half-constructed or
destructed zones. Do this by bracing uma_zcreate() and uma_zdestroy()
into a shared-locked sx, and take the sx exclusively in uma_reclaim().
Usually zones are not created/destroyed during the system operation,
but tmpfs mounts do cause zone operations and exposed the bug.
Another solution could be to only expose a new keg on uma_kegs list
after the corresponding zone is fully constructed, and similar
treatment for the destruction. But it probably requires more risky
code rearrangement as well.
Reported and tested by: pho
Discussed with: avg
Sponsored by: The FreeBSD Foundation
MFC after: 2 weeks
o Provide a new VOP_GETPAGES_ASYNC(), which works like VOP_GETPAGES(), but
doesn't sleep. It returns immediately, and will execute the I/O done handler
function that must be supplied as argument.
o Provide VOP_GETPAGES_ASYNC() for the FFS, which uses vnode_pager.
o Extend pagertab to support pgo_getpages_async method, and implement this
method for vnode_pager.
Reviewed by: kib
Tested by: pho
Sponsored by: Netflix
Sponsored by: Nginx, Inc.
on i386 PAE. Previously, VM_PHYSSEG_SPARSE could not be used on amd64 and
i386 because vm_page_startup() would not create vm_page structures for the
kernel page table pages allocated during pmap_bootstrap() but those vm_page
structures are needed when the kernel attempts to promote the corresponding
kernel virtual addresses to superpage mappings. To address this problem, a
new public function, vm_phys_add_seg(), is introduced and vm_phys_init() is
updated to reflect the creation of vm_phys_seg structures by calls to
vm_phys_add_seg().
Discussed with: Svatopluk Kraus
MFC after: 3 weeks
Sponsored by: EMC / Isilon Storage Division
default_pager_putpages() and swap_pager_putpages().
It is the same fix as was done for vnode_pager_putpages()
in r271586.
Reviewed by: alc
Sponsored by: The FreeBSD Foundation
MFC after: 1 week
This code has had an extensive rewrite and a good series of reviews, both by the author and other parties. This means a lot of code has been simplified. Pluggable structures for high-rate entropy generators are available, and it is most definitely not the case that /dev/random can be driven by only a hardware souce any more. This has been designed out of the device. Hardware sources are stirred into the CSPRNG (Yarrow, Fortuna) like any other entropy source. Pluggable modules may be written by third parties for additional sources.
The harvesting structures and consequently the locking have been simplified. Entropy harvesting is done in a more general way (the documentation for this will follow). There is some GREAT entropy to be had in the UMA allocator, but it is disabled for now as messing with that is likely to annoy many people.
The venerable (but effective) Yarrow algorithm, which is no longer supported by its authors now has an alternative, Fortuna. For now, Yarrow is retained as the default algorithm, but this may be changed using a kernel option. It is intended to make Fortuna the default algorithm for 11.0. Interested parties are encouraged to read ISBN 978-0-470-47424-2 "Cryptography Engineering" By Ferguson, Schneier and Kohno for Fortuna's gory details. Heck, read it anyway.
Many thanks to Arthur Mesh who did early grunt work, and who got caught in the crossfire rather more than he deserved to.
My thanks also to folks who helped me thresh this out on whiteboards and in the odd "Hallway track", or otherwise.
My Nomex pants are on. Let the feedback commence!
Reviewed by: trasz,des(partial),imp(partial?),rwatson(partial?)
Approved by: so(des)
- Wrong integer type was specified.
- Wrong or missing "access" specifier. The "access" specifier
sometimes included the SYSCTL type, which it should not, except for
procedural SYSCTL nodes.
- Logical OR where binary OR was expected.
- Properly assert the "access" argument passed to all SYSCTL macros,
using the CTASSERT macro. This applies to both static- and dynamically
created SYSCTLs.
- Properly assert the the data type for both static and dynamic
SYSCTLs. In the case of static SYSCTLs we only assert that the data
pointed to by the SYSCTL data pointer has the correct size, hence
there is no easy way to assert types in the C language outside a
C-function.
- Rewrote some code which doesn't pass a constant "access" specifier
when creating dynamic SYSCTL nodes, which is now a requirement.
- Updated "EXAMPLES" section in SYSCTL manual page.
MFC after: 3 days
Sponsored by: Mellanox Technologies
Older binaries are still permitted to use these flags.
PR: 193961 (exp-run in ports)
Differential Revision: https://reviews.freebsd.org/D848
Reviewed by: kib
in userland rename in-kernel getenv()/setenv() to kern_setenv()/kern_getenv().
This fixes a namespace collision with libc symbols.
Submitted by: kmacy
Tested by: make universe
compatible with write-locked path. Test for MAP_ENTRY_NOSYNC and set
VPO_NOSYNC for pages with dirty mask zero (this does not exclude a
possibility that the page is dirty, e.g. due to read fault on
writeable mapping and consequent write; the same issue exists in the
slow path).
Use helper vm_fault_dirty() to unify fast and slow path handling of
VPO_NOSYNC and setting the dirty mask.
Reviewed by: alc
Sponsored by: The FreeBSD Foundation
Acquire the lock in read mode when just needed to ensure the stability
of the keg list. The UMA lock may be held for a long time (relatively
speaking) in uma_reclaim() on machines with lots of zones/kegs. If the
uma_timeout() would fire during that period, subsequent callouts on that
CPU may be significantly delayed.
Reviewed by: jhb
Callers of zone_drain_wait(M_WAITOK) do not need to hold (and were not)
the uma_mtx, but we would attempt to unlock and relock the mutex if we
had to sleep because the zone was already draining. The M_NOWAIT callers
may hold the uma_mtx, but we do not sleep in that case.
Reviewed by: jhb
MFC after: 3 days
interrupts and report the largest value seen as sysctl
debug.max_kstack_used. Useful to estimate how close the kernel stack
size is to overflow.
In collaboration with: Larry Baird <lab@gta.com>
Sponsored by: The FreeBSD Foundation (kib)
MFC after: 1 week
Remove previously added kmem methods in favour of defines which
allow diff minimisation between upstream code base.
Rebalance ARC free target to be vm_pageout_wakeup_thresh by default
which eliminates issue where ARC gets minimised instead of balancing
with VM pageout. The restores the target point prior to r270759.
Bring in missing upstream only changes which move unused code to
further eliminate code differences.
Add additional DTRACE probe to aid monitoring of ARC behaviour.
Enable upstream i386 code paths on platforms which don't define
UMA_MD_SMALL_ALLOC.
Fix mixture of byte an page values in arc_memory_throttle i386 code
path value assignment of available_memory.
PR: 187594
Review: D702
Reviewed by: avg
MFC after: 1 week
X-MFC-With: r270759 & r270861
Sponsored by: Multiplay
the wired attribute of the mapping. As result, some pmap
implementations clear the wired state of the page table entries, which
breaks invariants and allows the entries to be lost. Avoid calling
vm_map_pmap_enter() for the MADV_WILLNEED on the wired entry, the
pages must be already mapped.
Noted and reviewed by: alc
Sponsored by: The FreeBSD Foundation
MFC after: 3 days
MAP_PRIVATE flags to MAP_SHARED. Apparently, some code in tree, in
particular, libgeom, relied on this behaviour, see r271721. For
regular file types, the absence of the flags is interpreted as
MAP_PRIVATE, and libc nlist used this (fixed in r271723).
Allow the implicit flags for legacy binaries. Bump __FreeBSD_version
to get the ABI note on new binaries to check for in mmap code.
Remove the test for presence of one of the MAP_ANON, MAP_SHARED or
MAP_PRIVATE flags before fget_mmap(). For MAP_ANON, we already verify
that passed fd == -1. For fd != -1, test after fget_mmap() (for newer
binaries) covers the case.
Reported by: bdrewery, pho
Reviewed by: jhb
Sponsored by: The FreeBSD Foundation
- Fail with EINVAL if an invalid protection mask is passed to mmap().
- Fail with EINVAL if an unknown flag is passed to mmap().
- Fail with EINVAL if both MAP_PRIVATE and MAP_SHARED are passed to mmap().
- Require one of either MAP_PRIVATE or MAP_SHARED for non-anonymous
mappings.
Reviewed by: alc, kib
MFC after: 2 weeks
Differential Revision: https://reviews.freebsd.org/D698
Eliminate an exclusive object lock acquisition and release on the expected
execution path.
Do page zeroing before the object lock is acquired rather than during the
time that the object lock is held.
Use vm_pager_free_nonreq() to eliminate duplicated code.
Reviewed by: kib
MFC after: 6 weeks
Sponsored by: EMC / Isilon Storage Division
by ffs and ext2fs. Remove duplicated call to vm_page_zero_invalid(),
done by VOP and by vm_pager_getpages(). Use vm_pager_free_nonreq().
Reviewed by: alc (previous version)
Sponsored by: The FreeBSD Foundation
MFC after: 6 weeks (after r271596)
path through the NFS clients' getpages functions.
Introduce vm_pager_free_nonreq(). This function can be used to eliminate
code that is duplicated in many getpages functions. Also, in contrast to
the code that currently appears in those getpages functions,
vm_pager_free_nonreq() avoids acquiring an exclusive object lock in one
case.
Reviewed by: kib
MFC after: 6 weeks
Sponsored by: EMC / Isilon Storage Division
The changes should not modify the generated code.
The pager->pgo_putpages() method takes int flags as its fourth
argument, while vnode_pager_putpages() used boolean_t (which is
typedef'ed to int). The flags are from VM_PAGER_* namespace, while
vnode_pager_putpages() passed TRUE and OBJPC_SYNC to VOP_PUTPAGES(),
which both are numerically equal to VM_PAGER_PUT_SYNC.
Noted and reviewed by: alc (previous version)
Sponsored by: The FreeBSD Foundation
MFC after: 1 week
isn't being allocated for the last of the requested pages, because a
reservation won't fit in the gap between allocated pages, then the
reservation structure shouldn't be initialized.
While I'm here, improve the nearby comments.
Reported by: jeff, pho
MFC after: 1 week
Sponsored by: EMC / Isilon Storage Division
Prior to this change we triggered ARC reclaim when kmem usage passed 3/4
of the total available, as indicated by vmem_size(kmem_arena, VMEM_ALLOC).
This could lead large amounts of unused RAM e.g. on a 192GB machine with
ARC the only major RAM consumer, 40GB of RAM would remain unused.
The old method has also been seen to result in extreme RAM usage under
certain loads, causing poor performance and stalls.
We now trigger ARC reclaim when the number of free pages drops below the
value defined by the new sysctl vfs.zfs.arc_free_target, which defaults
to the value of vm.v_free_target.
Credit to Karl Denninger for the original patch on which this update was
based.
PR: 191510 and 187594
Tested by: dteske
MFC after: 1 week
Relnotes: yes
Sponsored by: Multiplay
"vm_paging_target() > 0" was a reasonable way of determining if the
inactive queue scan met its target. However, now that other threads
can be allocating pages while the inactive queue scan is running, it's
an unreliable method. The effect of it being unreliable is that we
can start swapping out processes when we didn't intend to.
This issue has existed since the kernel was multithreaded, but the
changes to the inactive queue target in 10.0-RELEASE have made its
effects visible.
This change introduces a more direct method for determining if the
inactive queue scan met its target that is not affected by the actions
of other threads.
Reported by: Steve Polyack
Tested by: pho, Steve Polyack (an earlier version)
MFC after: 1 week
Sponsored by: EMC / Isilon Storage Division
called a scalable path. When several preconditions hold, the vm
object lock for the object containing the faulted page is taken in
read mode, instead of write, which allows parallel faults processing
in the region.
Namely, the fast path is taken when the faulted page already exists
and does not need copy on write, is already fully valid, and not busy.
For technical reasons, fast path is avoided when the fault is the
first write on the vnode object, or when the fault is for wiring or
debugger read or write.
On the fast path, pmap_enter(9) is passed the PMAP_ENTER_NOSLEEP flag,
since object lock is kept. Pmap might fail to create the entry, in
which case the fallback to slow path is performed.
Reviewed by: alc
Tested by: pho (previous version)
Hardware provided and hosted by: The FreeBSD Foundation and
Sentex Data Communications
Sponsored by: The FreeBSD Foundation
MFC after: 2 week
page queue even when the allocation is not wired. It is
responsibility of the vm_page_grab() caller to ensure that the page
does not end on the vm_object queue but not on the pagedaemon queue,
which would effectively create unpageable unwired page.
In exec_map_first_page() and vm_imgact_hold_page(), activate the page
immediately after unbusying it, to avoid leak.
In the uiomove_object_page(), deactivate page before the object is
unlocked. There is no leak, since the page is deactivated after
uiomove_fromphys() finished. But allowing non-queued non-wired page
in the unlocked object queue makes it impossible to assert that leak
does not happen in other places.
Reviewed by: alc
Sponsored by: The FreeBSD Foundation
MFC after: 1 week
pmap_enter(PMAP_ENTER_NOSLEEP). The PGA_WRITEABLE flag can be set
when either the page is busied, or the owner object is locked.
Update comments, move all assertions about page state when
PGA_WRITEABLE flag is set, into new helper
vm_page_assert_pga_writeable().
Reviewed by: alc
Sponsored by: The FreeBSD Foundation
MFC after: 2 weeks
mapping size (currently unused). The flags includes the fault access
bits, wired flag as PMAP_ENTER_WIRED, and a new flag
PMAP_ENTER_NOSLEEP to indicate that pmap should not sleep.
For powerpc aim both 32 and 64 bit, fix implementation to ensure that
the requested mapping is created when PMAP_ENTER_NOSLEEP is not
specified, in particular, wait for the available memory required to
proceed.
In collaboration with: alc
Tested by: nwhitehorn (ppc aim32 and booke)
Sponsored by: The FreeBSD Foundation and EMC / Isilon Storage Division
MFC after: 2 weeks
swap pager. Swap pager uses a private mutex to protect swap metadata,
and does not rely on the vm object lock to ensure integrity of it.
Weaken the requirement for the vm object lock by only asserting locked
object in vm_pager_page_unswapped(), instead of locked exclusively.
Reviewed by: alc
Sponsored by: The FreeBSD Foundation
MFC after: 1 week
With the current implementation of managed fictitious ranges when
also using VM_PHYSSEG_DENSE, a user could try to register a
fictitious range that starts inside of vm_page_array, but then
overrruns it (because the end of the fictitious range is greater than
vm_page_array_size + first_page). This would result in PHYS_TO_VM_PAGE
returning unallocated pages from past the end of vm_page_array. The
same could happen if a user tried to register a segment that starts
outside of vm_page_array but ends inside of it.
In order to fix this, allow vm_phys_fictitious_{reg/unreg}_range to
use a set of pages from vm_page_array, and allocate the rest.
Sponsored by: Citrix Systems R&D
Reviewed by: kib, alc
vm/vm_phys.c:
- Allow registering/unregistering fictitious ranges that overrun
vm_page_array.
We continue to use pmap_enter() for that. For unwiring virtual pages, we
now use pmap_unwire(), which unwires a range of virtual addresses instead
of a single virtual page.
Sponsored by: EMC / Isilon Storage Division
pmap_unwire() and vm_object_unwire().
Retire vm_fault_{un,}wire(), since they are no longer used.
(See r268327 and r269134 for the motivation behind this change.)
Reviewed by: kib
Sponsored by: EMC / Isilon Storage Division
underlying physical pages are mapped by the pmap. If, for example, the
application has performed an mprotect(..., PROT_NONE) on any part of the
wired region, then those pages will no longer be mapped by the pmap.
So, using the pmap to lookup the wired pages in order to unwire them
doesn't always work, and when it doesn't work wired pages are leaked.
To avoid the leak, introduce and use a new function vm_object_unwire()
that locates the wired pages by traversing the object and its backing
objects.
At the same time, switch from using pmap_change_wiring() to the recently
introduced function pmap_unwire() for unwiring the region's mappings.
pmap_unwire() is faster, because it operates a range of virtual addresses
rather than a single virtual page at a time. Moreover, by operating on
a range, it is superpage friendly. It doesn't waste time performing
unnecessary demotions.
Reported by: markj
Reviewed by: kib
Tested by: pho, jmg (arm)
Sponsored by: EMC / Isilon Storage Division
and tmpfs object cannot shadow. In other words, tmpfs vm object is
always at the bottom of the shadow chain.
Reported and tested by: bdrewery
Sponsored by: The FreeBSD Foundation
MFC after: 1 week
vnode for the tmpfs node owning this object. The flag is currently
used for two purposes. First, it allows to correctly handle VV_TEXT
for tmpfs vnode when the ref count on the object is decremented to 1,
similar to vnode_pager_dealloc() for regular filesystems. Second, it
prevents some operations, which are done on OBJT_SWAP vm objects
backing user anonymous memory, but are incorrect for the object owned
by tmpfs node.
The second kind of use of the OBJ_TMPFS flag is incorrect, since the
vnode might be reclaimed, which clears the flag, but vm object
operations must still be disallowed.
Introduce one more flag, OBJ_TMPFS_NODE, which is permanently set on
the object for VREG tmpfs node, and used instead of OBJ_TMPFS to test
whether vm object collapse and similar actions should be disabled.
Tested by: pho
Sponsored by: The FreeBSD Foundation
MFC after: 2 weeks
The number of vm fictitious regions was limited to 8 by default, but
Xen will make heavy usage of those kind of regions in order to map
memory from foreign domains, so instead of increasing the default
number, change the implementation to use a red-black tree to track vm
fictitious ranges.
The public interface remains the same.
Sponsored by: Citrix Systems R&D
Reviewed by: kib, alc
Approved by: gibbs
vm/vm_phys.c:
- Replace the vm fictitious static array with a red-black tree.
- Use a rwlock instead of a mutex, since now we also need to take the
lock in vm_phys_fictitious_to_vm_page, and it can be shared.
This includes:
o All directories named *ia64*
o All files named *ia64*
o All ia64-specific code guarded by __ia64__
o All ia64-specific makefile logic
o Mention of ia64 in comments and documentation
This excludes:
o Everything under contrib/
o Everything under crypto/
o sys/xen/interface
o sys/sys/elf_common.h
Discussed at: BSDcan
several reasons for this change:
pmap_change_wiring() has never (in my memory) been used to set the wired
attribute on a virtual page. We have always used pmap_enter() to do that.
Moreover, it is not really safe to use pmap_change_wiring() to set the wired
attribute on a virtual page. The description of pmap_change_wiring() says
that it assumes the existence of a mapping in the pmap. However, non-wired
mappings may be reclaimed by the pmap at any time. (See pmap_collect().)
Many implementations of pmap_change_wiring() will crash if the mapping does
not exist.
pmap_unwire() accepts a range of virtual addresses, whereas
pmap_change_wiring() acts upon a single virtual page. Since we are
typically unwiring a range of virtual addresses, pmap_unwire() will be more
efficient. Moreover, pmap_unwire() allows us to unwire superpage mappings.
Previously, we were forced to demote the superpage mapping, because
pmap_change_wiring() only allowed us to express the unwiring of a single
base page mapping at a time. This added to the overhead of unwiring for
large ranges of addresses, including the implicit unwiring that occurs at
process termination.
Implementations for arm and powerpc will follow.
Discussed with: jeff, marcel
Reviewed by: kib
Sponsored by: EMC / Isilon Storage Division
These changes prevent sysctl(8) from returning proper output,
such as:
1) no output from sysctl(8)
2) erroneously returning ENOMEM with tools like truss(1)
or uname(1)
truss: can not get etype: Cannot allocate memory
there is an environment variable which shall initialize the SYSCTL
during early boot. This works for all SYSCTL types both statically and
dynamically created ones, except for the SYSCTL NODE type and SYSCTLs
which belong to VNETs. A new flag, CTLFLAG_NOFETCH, has been added to
be used in the case a tunable sysctl has a custom initialisation
function allowing the sysctl to still be marked as a tunable. The
kernel SYSCTL API is mostly the same, with a few exceptions for some
special operations like iterating childrens of a static/extern SYSCTL
node. This operation should probably be made into a factored out
common macro, hence some device drivers use this. The reason for
changing the SYSCTL API was the need for a SYSCTL parent OID pointer
and not only the SYSCTL parent OID list pointer in order to quickly
generate the sysctl path. The motivation behind this patch is to avoid
parameter loading cludges inside the OFED driver subsystem. Instead of
adding special code to the OFED driver subsystem to post-load tunables
into dynamically created sysctls, we generalize this in the kernel.
Other changes:
- Corrected a possibly incorrect sysctl name from "hw.cbb.intr_mask"
to "hw.pcic.intr_mask".
- Removed redundant TUNABLE statements throughout the kernel.
- Some minor code rewrites in connection to removing not needed
TUNABLE statements.
- Added a missing SYSCTL_DECL().
- Wrapped two very long lines.
- Avoid malloc()/free() inside sysctl string handling, in case it is
called to initialize a sysctl from a tunable, hence malloc()/free() is
not ready when sysctls from the sysctl dataset are registered.
- Bumped FreeBSD version to indicate SYSCTL API change.
MFC after: 2 weeks
Sponsored by: Mellanox Technologies
have to undo it by calling crfree(). This reduces the total number of calls
by vm_map_insert() to crhold() and crfree() by 45% in my tests.
Eliminate an unnecessary variable from vm_map_insert().
Reviewed by: kib
Tested by: pho
that it creates (r267645), we can place the check that blocks map entry
coalescing on stack entries in vm_map_simplify_entry() where it properly
belongs.
Reviewed by: kib
numerical values as MAP_STACK_GROWS_*, but the former is for entries'
eflags, while the later for the cow argument of vm_map_insert().
Submitted by: alc
map entries list, and that it does not overlap with the previous and
next entries.
Reviewed by: alc
Sponsored by: The FreeBSD Foundation
MFC after: 1 week
corresponding flag(s) in the new map entry. Previously, the caller was
responsible for setting them after vm_map_insert() returned.
Pass MAP_STACK_GROWS_DOWN to vm_map_insert() from vm_map_growstack() when
extending the stack in the downward direction.
Together these changes slightly simplify the caller's task when creating a
downward growing stack. In particular, the caller no longer needs to clip
the previous entry, because the new stack entry can't possibly coalesce
with the previous entry.
Reviewed by: kib
Sponsored by: EMC / Isilon Storage Division
and prevents the request from deleting existing mappings in the
region, failing instead.
Reviewed by: alc
Discussed with: jhb
Tested by: markj, pho (previous version, as part of the bigger patch)
Sponsored by: The FreeBSD Foundation
MFC after: 1 week
the queue where to enqueue pages that are going to be unwired.
- Add stronger checks to the enqueue/dequeue for the pagequeues when
adding and removing pages to them.
Of course, for unmanaged pages the queue parameter of vm_page_unwire() will
be ignored, just as the active parameter today.
This makes adding new pagequeues quicker.
This change effectively modifies the KPI. __FreeBSD_version will be,
however, bumped just when the full cache of free pages will be
evicted.
Sponsored by: EMC / Isilon storage division
Reviewed by: alc
Tested by: pho