VM_MAP_WIRE_SYSTEM mode when wiring the newly grown stack.
System maps do not create auto-grown stack. Any stack we handled,
even for P_SYSTEM, must be for user address space. P_SYSTEM processes
with mapped user space is either init(8) or an aio worker attached to
other user process with aio buffer pointing into stack area. In either
case, VM_MAP_WIRE_USER mode should be used.
Noted and reviewed by: alc
Sponsored by: The FreeBSD Foundation
MFC after: 1 week
dirty. Assert that they are fully dirty rather than redundantly calling
vm_page_dirty() on them.
Reviewed by: kib, markj
MFC after: 1 week
X-MFC after: r319932
- Add asserts that the pages to write are dirty. The last page, if
partially written, is only required to be dirty, while completely
written pages should have all dirty bit set.
- Use uintmax_t to print vm_page pindexes.
- Use NULL instead of casted zero.
- Remove if () test which duplicated the loop ending condition.
- Miscellaneous style fixes.
Reviewed by: alc, markj (previous version)
Tested by: pho
Sponsored by: The FreeBSD Foundation
MFC after: 1 week
internal zones only. This allows to create new zones at early stages
of boot, without need to mark them as internal to UMA, which isn't
always true.
Reviewed by: alc
r31386 changed how the size of the VM page array was calculated to be
less wasteful. For most systems, the amount of memory is divided by
the overhead required by each page (a page of data plus a struct vm_page)
to determine the maximum number of available pages. However, if the
remainder for the first non-available page was at least a page of data
(so that the only memory missing was a struct vm_page), this last page
was left in phys_avail[] but was not allocated an entry in the VM page
array. Handle this case by explicitly excluding the page from
phys_avail[].
Reviewed by: alc
Sponsored by: DARPA / AFRL
Differential Revision: https://reviews.freebsd.org/D11000
beginning of a swap area for a disk label. However, neither r118390 nor
r118544, which increased the reservation from one to two blocks, correctly
accounted for these blocks when updating the variable "swap_pager_avail".
This change corrects that error.
Reviewed by: kib
MFC after: 5 days
pager used a different scheme for striping the allocation of swap space
across multiple devices. And, although blist_fill() was intended to support
fill operations with large counts, the old striping scheme never performed a
fill larger than the stripe size. Consequently, the misplacement of a
sanity check in blst_meta_fill() went undetected. Now, moving forward in
time to r118390, a new scheme for striping was introduced that maintained a
blist allocator per device, but as noted in r318995, swapoff_one() was not
fully and correctly converted to the new scheme. This change completes what
was started in r318995 by fixing the underlying bug in blst_meta_fill() that
stops swapoff_one() from simply performing a single blist_fill() operation.
Reviewed by: kib
MFC after: 5 days
Differential Revision: https://reviews.freebsd.org/D11043
short, half of the memory that is allocated to implement the radix tree is
wasted because we did not change "u_daddr_t" to be a 64-bit unsigned int
when we changed "daddr_t" to be a 64-bit (signed) int. (See r96849 and
r96851.)
Reviewed by: kib, markj
Tested by: pho
MFC after: 5 days
Differential Revision: https://reviews.freebsd.org/D11028
It is simply a contigous virtual memory pointer and number of pages.
There is no need to build a linked list here. Just increment pointer
and decrement counter. The only functional difference to old allocator
is that before we gave pages from topmost and down to lowest, and now
we give them in normal ascending order.
While here remove padalign from a mutex that is unused at runtime.
Reviewed by: alc
nor the correct maximum block size. Moreover, after r318995, it serves
no purpose except to provide information to user space through a read-
sysctl.
This change eliminates the variable "dmmax" but retains the sysctl. It
also corrects the value returned by the sysctl.
Reviewed by: kib, markj
MFC after: 3 days
multiple devices was changed. However, swapoff_one() was not fully and
correctly converted. In particular, with r118390's introduction of a per-
device blist, the maximum swap block size, "dmmax", became irrelevant to
swapoff_one()'s operation. Moreover, swapoff_one() was performing out-of-
range operations on the per-device blist that were silently ignored by
blist_fill().
This change corrects both of these problems with swapoff_one(), which will
allow us to potentially increase MAX_PAGEOUT_CLUSTER. Previously,
swapoff_one() would panic inside of blist_fill() if you increased
MAX_PAGEOUT_CLUSTER.
Reviewed by: kib, markj
MFC after: 3 days
Extend the ino_t, dev_t, nlink_t types to 64-bit ints. Modify
struct dirent layout to add d_off, increase the size of d_fileno
to 64-bits, increase the size of d_namlen to 16-bits, and change
the required alignment. Increase struct statfs f_mntfromname[] and
f_mntonname[] array length MNAMELEN to 1024.
ABI breakage is mitigated by providing compatibility using versioned
symbols, ingenious use of the existing padding in structures, and
by employing other tricks. Unfortunately, not everything can be
fixed, especially outside the base system. For instance, third-party
APIs which pass struct stat around are broken in backward and
forward incompatible ways.
Kinfo sysctl MIBs ABI is changed in backward-compatible way, but
there is no general mechanism to handle other sysctl MIBS which
return structures where the layout has changed. It was considered
that the breakage is either in the management interfaces, where we
usually allow ABI slip, or is not important.
Struct xvnode changed layout, no compat shims are provided.
For struct xtty, dev_t tty device member was reduced to uint32_t.
It was decided that keeping ABI compat in this case is more useful
than reporting 64-bit dev_t, for the sake of pstat.
Update note: strictly follow the instructions in UPDATING. Build
and install the new kernel with COMPAT_FREEBSD11 option enabled,
then reboot, and only then install new world.
Credits: The 64-bit inode project, also known as ino64, started life
many years ago as a project by Gleb Kurtsou (gleb). Kirk McKusick
(mckusick) then picked up and updated the patch, and acted as a
flag-waver. Feedback, suggestions, and discussions were carried
by Ed Maste (emaste), John Baldwin (jhb), Jilles Tjoelker (jilles),
and Rick Macklem (rmacklem). Kris Moore (kris) performed an initial
ports investigation followed by an exp-run by Antoine Brodin (antoine).
Essential and all-embracing testing was done by Peter Holm (pho).
The heavy lifting of coordinating all these efforts and bringing the
project to completion were done by Konstantin Belousov (kib).
Sponsored by: The FreeBSD Foundation (emaste, kib)
Differential revision: https://reviews.freebsd.org/D10439
This restores 32bit-sized accesses to vmcnt sysctls, making old
binaries like top(1), systat(8) and reboot(8) mostly functional on
newer kernel.
Reviewed by: bde
Sponsored by: The FreeBSD Foundation
in place. To do per-cpu stats, convert all fields that previously were
maintained in the vmmeters that sit in pcpus to counter(9).
- Since some vmmeter stats may be touched at very early stages of boot,
before we have set up UMA and we can do counter_u64_alloc(), provide an
early counter mechanism:
o Leave one spare uint64_t in struct pcpu, named pc_early_dummy_counter.
o Point counter(9) fields of vmmeter to pcpu[0].pc_early_dummy_counter,
so that at early stages of boot, before counters are allocated we already
point to a counter that can be safely written to.
o For sparc64 that required a whole dummy pcpu[MAXCPU] array.
Further related changes:
- Don't include vmmeter.h into pcpu.h.
- vm.stats.vm.v_swappgsout and vm.stats.vm.v_swappgsin changed to 64-bit,
to match kernel representation.
- struct vmmeter hidden under _KERNEL, and only vmstat(1) is an exclusion.
This is based on benno@'s 4-year old patch:
https://lists.freebsd.org/pipermail/freebsd-arch/2013-July/014471.html
Reviewed by: kib, gallatin, marius, lidl
Differential Revision: https://reviews.freebsd.org/D10156
We are otherwise susceptible to a race with a concurrent vm_map_wire(),
which may drop the map lock to fault pages into the object chain. In
particular, vm_map_protect() will only copy newly writable wired pages
into the top-level object when MAP_ENTRY_USER_WIRED is set, but
vm_map_wire() only sets this flag after its fault loop. We may thus end
up with a writable wired entry whose top-level object does not contain the
entire range of pages.
Reported and tested by: pho
Reviewed by: kib
MFC after: 1 week
Sponsored by: Dell EMC Isilon
Differential Revision: https://reviews.freebsd.org/D10349
declaration block.
Reviewed by: markj (as part of the larger patch)
Tested by: pho (as part of the larger patch)
Sponsored by: The FreeBSD Foundation
MFC after: 1 week
X-Differential revision: https://reviews.freebsd.org/D10241
vnode_pager_generic_putpages() prototype; change the argument name to
reflect that it is flags.
Reviewed by: markj
Tested by: pho
Sponsored by: The FreeBSD Foundation
MFC after: 1 week
X-Differential revision: https://reviews.freebsd.org/D10241
Simplify the logic for clipping the range returned by the pager to fit
within the map entry.
Use atop() rather than OFF_TO_IDX() on addresses.
Reviewed by: kib
MFC after: 1 week
When re-calculating the last inclusive page index after the pager
call, -1 was erronously ommitted. If the pager extended the run
(unlikely), the result would be insertion of the valid page mapping
outside the current map entry range.
Found by: alc
Sponsored by: The FreeBSD Foundation
MFC after: 1 week
reviewing all uses of OFF_TO_IDX(), I observed that
vm_object_page_noreuse() is requiring an exclusive lock on the object
when, in fact, a shared lock suffices.
Reviewed by: kib, markj
MFC after: 3 days
Differential Revision: https://reviews.freebsd.org/D10011
INHERIT_ZERO is an OpenBSD feature.
When a page is marked as such, it would be zeroed
upon fork().
This would be used in new arc4random(3) functions.
PR: 182610
Reviewed by: kib (earlier version)
MFC after: 1 month
Differential Revision: https://reviews.freebsd.org/D427
Fix two missed places where vm_object offset to index calculation
should use unsigned shift, to allow handling of full range of unsigned
offsets used to create device mappings.
Reported and tested by: royger (previous version)
Reviewed by: alc (previous version)
Sponsored by: The FreeBSD Foundation
MFC after: 1 week
Those places were not taking into account uk_ppera.
At present one allocation is always used by one slab, so uk_ppera must
be used to convert between pages and slabs.
uk_ipers is used to convert between slabs and items.
MFC after: 1 month (if ever)
Renumber cluase 4 to 3, per what everybody else did when BSD granted
them permission to remove clause 3. My insistance on keeping the same
numbering for legal reasons is too pedantic, so give up on that point.
Submitted by: Jan Schaumann <jschauma@stevens.edu>
Pull Request: https://github.com/freebsd/freebsd/pull/96
A comment near kmem_reclaim() implies that we already did that.
Calling the hook is useful, because some handlers, e.g. ARC,
might be able to release significant amounts of KVA.
Now that we have more than one place where vm_lowmem hook is called,
use this change as an opportunity to introduce flags that describe
a reason for calling the hook. No handler makes use of the flags yet.
Reviewed by: markj, kib
MFC after: 1 week
Sponsored by: Panzura
Differential Revision: https://reviews.freebsd.org/D9764
In vm_fault_prefault(), if backward count causes underflow in
calculation of
starta = addra - backward * PAGE_SIZE;
then starta must be clipped to entry->start, instead of zero.
Clipping to zero allowed mapping outside of the map entries address
ranges, in particular, map at zero.
Submitted by: Yanko Yankulov <yanko.yankulov@gmail.com>
Reviewed by: alc
MFC after: 1 week
There could be a race between the vm daemon setting RACCT_RSS based on
the vm space and vmspace_exit (called from exit1) resetting RACCT_RSS to
zero. In that case we can get a zombie process with non-zero RACCT_RSS.
If the process is jailed, that may break accounting for the jail.
There could be other consequences.
Fix this race in the vm daemon by updating RACCT_RSS only when a process
is in the normal state. Also, make accounting a little bit more
accurate by refreshing the page resident count after calling
vm_pageout_map_deactivate_pages().
Finally, add an assert that the RSS is zero when a process is reaped.
PR: 210315
Reviewed by: trasz
Differential Revision: https://reviews.freebsd.org/D9464
Rename kern_vm_* functions to kern_*. Move the prototypes to
syscallsubr.h. Also change Mach VM types to uintptr_t/size_t as
needed, to avoid headers pollution.
Requested by: alc, jhb
Reviewed by: alc
Sponsored by: The FreeBSD Foundation
MFC after: 2 weeks
Differential revision: https://reviews.freebsd.org/D9535
For regular files and posix shared memory, POSIX requires that
[offset, offset + size) range is legitimate. At the maping time,
check that offset is not negative. Allowing negative offsets might
expose the data that filesystem put into vm_object for internal use,
esp. due to OFF_TO_IDX() signess treatment. Fault handler verifies
that the mapped range is valid, assuming that mmap(2) checked that
arithmetic gives no undefined results.
For device mappings, leave the semantic of negative offsets to the
driver. Correct object page index calculation to not erronously
propagate sign.
In either case, disallow overflow of offset + size.
Update mmap(2) man page to explain the requirement of the range
validity, and behaviour when the range becomes invalid after mapping.
Reported and tested by: royger (previous version)
Reviewed by: alc
Sponsored by: The FreeBSD Foundation
MFC after: 2 weeks
This makes the code to pass whole word of the mmap(2) syscall argument
prot to the syscall helper kern_vm_mmap(), which can validate all
bits. The change provides temporal fix for sys/vm/mmap_test
mmap__bad_arguments, which was broken after r313352.
PR: 216976
Reported and tested by: ngie
Sponsored by: The FreeBSD Foundation
kern_vm_munmap(), and kern_vm_madvise(), and use them in various compats
instead of their sys_*() counterparts.
Reviewed by: ed, dchagin, kib
MFC after: 2 weeks
Sponsored by: DARPA, AFRL
Differential Revision: https://reviews.freebsd.org/D9378
one respect. When determining how many page structures to allocate,
contrary to what the comments say, the code does not account for the
overhead of a page structure per page of physical memory. This revision
changes the code to match the comments.
Reviewed by: kib, markj
MFC after: 6 weeks
Differential Revision: https://reviews.freebsd.org/D9081
We can iterate over consecutive resident pages in the top-level object
using the object's page list rather than by performing lookups in the
object radix tree. This extends one of the optimizations in r312208 to the
case where a shadow chain is present.
Suggested by: alc
Reviewed by: alc, kib (previous version)
MFC after: 2 weeks
Differential Revision: https://reviews.freebsd.org/D9282
HWPMC_HOOKS is enabled in GENERIC and triggers some work avoidable in the
common (module not loaded) case.
In particular this avoids permission checks + lock downgrade
singlethreaded and in cases were an executable mapping is found the pmc
sx lock is no longer bounced.
Note this is a band aid.
MFC after: 1 week
vm_object_madvise() is frequently used to apply advice to a contiguous
set of pages in an object with no backing object. Optimize this case by
skipping non-resident subranges in constant time, and by iterating over
resident pages using the object memq, thus avoiding radix tree lookups on
each page index in the specified range.
While here, move MADV_WILLNEED handling to vm_page_advise(), and rename the
"advise" parameter to vm_object_madvise() to "advice."
Reviewed by: alc, kib
MFC after: 2 weeks
Differential Revision: https://reviews.freebsd.org/D9098
Upon each execve, we allocate a KVA range for use in copying data to the
new image. Pages must be faulted into the range, and when the range is
freed, the backing pages are freed and their mappings are destroyed. This
is a lot of needless overhead, and the exec_map management becomes a
bottleneck when many CPUs are executing execve concurrently. Moreover, the
number of available ranges is fixed at 16, which is insufficient on large
systems and potentially excessive on 32-bit systems.
The new allocator reduces overhead by making exec_map allocations
persistent. When a range is freed, pages backing the range are marked clean
and made easy to reclaim. With this change, the exec_map is sized based on
the number of CPUs.
Reviewed by: kib
MFC after: 1 month
Differential Revision: https://reviews.freebsd.org/D8921
On systems without a configured swap device, an attempt to launder pages
from a swap object will always fail and result in the page being
reactivated. This means that the page daemon will continuously scan pages
that can never be evicted. With this change, anonymous pages are instead
moved to PQ_UNSWAPPABLE after a failed laundering attempt when no swap
devices are configured. PQ_UNSWAPPABLE is not scanned unless a swap device
is configured, so unreferenced unswappable pages are excluded from the page
daemon's workload.
Reviewed by: alc
If pager' populate method succeeded, but other thread raced with us
and modified vm_map, we must unbusy all pages busied by the pager,
before we retry the whole fault handling. If pager instantiated more
pages than fit into the current map entry, we must unbusy the pages
which are clipped.
Also do some refactoring, clarify comments and use more clear local
variable names.
Reported and tested by: kargl, subbsd@gmail.com (previous version)
Reviewed by: alc
Sponsored by: The FreeBSD Foundation
MFC after: 2 weeks
add support for object types that were previously prohibited because they
could contain PG_CACHED pages.
Roughly halve the number of radix trie operations performed by
vm_page_alloc_contig() using the same approach that is employed by
vm_page_alloc(). Also, eliminate the radix trie lookup performed with the
free page queues lock held.
Tidy up the handling of radix trie insert failures in vm_page_alloc() and
vm_page_alloc_contig().
Reviewed by: kib, markj
Tested by: pho
Sponsored by: Dell EMC Isilon
Differential Revision: https://reviews.freebsd.org/D8878
There are two instances of inlined unlocks + continue in vmtotal()
switch statements, which are ordinary expressed with break from the
switch case and code after the switch. Also, the combination of
continue and break statement is redundand.
Reviewed by: alc
Sponsored by: The FreeBSD Foundation
MFC after: 1 week
The count argument natural type if vm_pindex_t, but due to the loop
organization, it has to be signed type to detect the termination
condition. Replace this logic by using distinguished counter for the
processed pages, and terminate loop when the counter exceeds the
argument.
Completely process one swblock for all relevant indexes instead of
doing relookup in hash when incrementing page index on the loop step.
Do not drop hash mutex around iterations.
Noted and reviewed by: alc
Tested by: pho
Sponsored by: The FreeBSD Foundation
MFC after: 2 weeks
As noted in the removed comment, it is possible and not prohibitively
costly to look up the swap blocks for the given page index. Implement
a swap_pager_find_least() function to do that, and use it to iterate
simultaneously over both backing object page queue and swap
allocations when looking for shadowed pages.
Testing shows that number of new succesful scans, enabled by this
addition, is small but non-zero. When worked out, the change both
further reduces the depth of the shadow object chain, and frees unused
but allocated swap and memory.
Suggested and reviewed by: alc
Tested by: pho (previous version)
Sponsored by: The FreeBSD Foundation
MFC after: 2 weeks
The object is not yet fully constructed and must not be available to
other threads. This makes default_pager_alloc() almost identical to
swap_pager_alloc_init().
Reviewed by: alc
Sponsored by: The FreeBSD Foundation
MFC after: 1 week
in this file actually described the operation of the swap pager, not the
default pager. Given that this is the wrong place to discuss the
implementation of the swap pager, it shouldn't come as a surprise that as
the swap pager evolved these comments became increasingly stale. In
addition, apply some style fixes, like modernizing a few remaining old-
style function definitions.
Reviewed by: kib, markj
MFC after: 1 week
Differential Revision: https://reviews.freebsd.org/D8781
independent layer of the virtual memory system. Update some of the nearby
comments to eliminate redundancy and improve clarity.
In vm/vm_reserv.c, do not use hyphens after adverbs ending in -ly per
The Chicago Manual of Style.
Update the comment in vm/vm_page.h defining the four types of page queues to
reflect the elimination of PG_CACHED pages and the introduction of the
laundry queue.
Reviewed by: kib, markj
Sponsored by: Dell EMC Isilon
Differential Revision: https://reviews.freebsd.org/D8752
It allows to provide configurable agressive prefaulting and useful
hints to page daemon about memory allocations, on faults for pages
managed by phys pager. In fact, this implementation is superior to
the MAP_SHARED_PHYS hack from my Postgresql paper, while giving
similar benefits of reducing the page faults numbers on SysV shared
memory mappings.
Reviewed by: alc
Tested by: pho
Sponsored by: The FreeBSD Foundation
MFC after: 3 weeks
with cdev_pg_populate() to provide device drivers access to it. It
gives drivers fine control of the pages ownership and allows drivers
to implement arbitrary prefault policies.
The populate method is called on a page fault and is supposed to
populate the vm object with the page at the fault location and some
amount of pages around it, at pager's discretion. VM provides the
pager with the hints about current range of the object mapping, to
avoid instantiation of immediately unused pages, if pager decides so.
Also, VM passes the fault type and map entry protection to the pager,
allowing it to force the optimal required ownership of the mapped
pages.
Installed pages must contiguously fill the returned region, be fully
valid and exclusively busied. Of course, the pages must be compatible
with the object' type.
After populate() successfully returned, VM fault handler installs as
many instantiated pages into the process page tables as it sees
reasonable, while still obeying the correct semantic for COW and vm
map locking.
The method is opt-in, pager sets OBJ_POPULATE flag to indicate that
the method can be called. If pager' vm objects can be shadowed, pager
must implement the traditional getpages() method in addition to the
populate(). Populate() might fall back to the getpages() on per-call
basis as well, by returning VM_PAGER_BAD error code.
For now for device pagers, the populate() method is only allowed to be
used by the managed device pagers, but the limitation is only made
because there is no unmanaged fault handlers which could use it right
now.
KPI designed together with, and reviewed by: alc
Tested by: pho
Sponsored by: The FreeBSD Foundation
MFC after: 3 weeks
contain a vm_page_t at the specified index. However, with this
change, vm_radix_remove() no longer panics. Instead, it returns NULL
if there is no vm_page_t at the specified index. Otherwise, it
returns the vm_page_t. The motivation for this change is that it
simplifies the use of radix tries in the amd64, arm64, and i386 pmap
implementations. Instead of performing a lookup before every remove,
the pmap can simply perform the remove.
Reviewed by: kib, markj
Differential Revision: https://reviews.freebsd.org/D8708
Some utilities (notably top(1)) exit if any of their input sysctls don't
exist, and the removal of the above-mentioned PG_CACHE-related sysctls
makes it difficult to run such utilities on different versions of the
kernel without recompiling.
Requested by: bde
called to allocate a new page of radix trie nodes, there could be a call to
vm_radix_remove() on the same trie (of PG_CACHED pages) as the in-progress
vm_radix_insert(). With the removal of PG_CACHED pages, we can simplify
vm_radix_insert() and vm_radix_remove() by removing the flags on the root of
the trie that were used to detect this case and the code for restarting
vm_radix_insert() when it happened.
Reviewed by: kib, markj
Tested by: pho
Sponsored by: Dell EMC Isilon
Differential Revision: https://reviews.freebsd.org/D8664
a new page of radix trie nodes to complete a vm_radix_insert() operation
that was requested by vm_page_cache(). Specifically, vm_page_cache()
already held the free page queue lock when UMA tried to acquire it through
a call to vm_page_alloc(). This code path no longer exists, so there is no
longer any reason to allow recursion on the free page queue mutex.
Improve nearby comments.
Reviewed by: kib, markj
Tested by: pho
Sponsored by: Dell EMC Isilon
Differential Revision: https://reviews.freebsd.org/D8628
The swap pager enqueues laundered pages near the head of the inactive queue
to avoid another trip through LRU before reclamation. This change adds
support for this behaviour to the vnode pager and makes use of it in UFS and
ext2fs. Some ioflag handling is consolidated into a common subroutine so
that this support can be easily extended to other filesystems which make use
of the buffer cache. No changes are needed for ZFS since its putpages
routine always undirties the pages before returning, and the laundry
thread requeues the pages appropriately in this case.
Reviewed by: alc, kib
Differential Revision: https://reviews.freebsd.org/D8589
longer used. More precisely, they are always zero because the code that
decremented and incremented them no longer exists.
Bump __FreeBSD_version to mark this change.
Reviewed by: kib, markj
Sponsored by: Dell EMC Isilon
Differential Revision: https://reviews.freebsd.org/D8583
doesn't fit into a buf, then trim readbehind and readahead evenly. If
rbehind was limited by the previous BMAP, then roundup its trim to
block size.
- Add KASSERT to check that b_blkno has proper offset from original
blkno returned by BMAP. [1]
- Add KASSERT to check that pages in buf are consecutive.
Reviewed by: kib
Submitted by: kib [1]
not remove user-space visible fields from vm_cnt or all of the references to
cached pages from comments. Those changes will come later.)
Reviewed by: kib, markj
Tested by: pho
Sponsored by: Dell EMC Isilon
Differential Revision: https://reviews.freebsd.org/D8497
pages, specificially, dirty pages that have passed once through the inactive
queue. A new, dedicated thread is responsible for both deciding when to
launder pages and actually laundering them. The new policy uses the
relative sizes of the inactive and laundry queues to determine whether to
launder pages at a given point in time. In general, this leads to more
intelligent swapping behavior, since the laundry thread will avoid pageouts
when the marginal benefit of doing so is low. Previously, without a
dedicated queue for dirty pages, the page daemon didn't have the information
to determine whether pageout provides any benefit to the system. Thus, the
previous policy often resulted in small but steadily increasing amounts of
swap usage when the system is under memory pressure, even when the inactive
queue consisted mostly of clean pages. This change addresses that issue,
and also paves the way for some future virtual memory system improvements by
removing the last source of object-cached clean pages, i.e., PG_CACHE pages.
The new laundry thread sleeps while waiting for a request from the page
daemon thread(s). A request is raised by setting the variable
vm_laundry_request and waking the laundry thread. We request launderings
for two reasons: to try and balance the inactive and laundry queue sizes
("background laundering"), and to quickly make up for a shortage of free
pages and clean inactive pages ("shortfall laundering"). When background
laundering is requested, the laundry thread computes the number of page
daemon wakeups that have taken place since the last laundering. If this
number is large enough relative to the ratio of the laundry and (global)
inactive queue sizes, we will launder vm_background_launder_target pages at
vm_background_launder_rate KB/s. Otherwise, the laundry thread goes back
to sleep without doing any work. When scanning the laundry queue during
background laundering, reactivated pages are counted towards the laundry
thread's target.
In contrast, shortfall laundering is requested when an inactive queue scan
fails to meet its target. In this case, the laundry thread attempts to
launder enough pages to meet v_free_target within 0.5s, which is the
inactive queue scan period.
A laundry request can be latched while another is currently being
serviced. In particular, a shortfall request will immediately preempt a
background laundering.
This change also redefines the meaning of vm_cnt.v_reactivated and removes
the functions vm_page_cache() and vm_page_try_to_cache(). The new meaning
of vm_cnt.v_reactivated now better reflects its name. It represents the
number of inactive or laundry pages that are returned to the active queue
on account of a reference.
In collaboration with: markj
Reviewed by: kib
Tested by: pho
Sponsored by: Dell EMC Isilon
Differential Revision: https://reviews.freebsd.org/D8302
Requests which cannot be satisfied by allocators at boot time often
have unrealizable parameters. Waiting for the pagedaemon' start would
hang the boot if done in the thread0 context and just never succeed if
executed from another thread. In fact, for very early stages, sleep
attempt panics with obscure diagnostic about the scheduler state, and
explicit panic in vm_wait() makes the investigation much shorter by
cut off the examination of the thread and scheduler.
Theoretically, some subsystem might grab a resource to exhaustion, and
free it later in the boot process. If this unlikely scenario does
appear for real, the way to diagnose the trouble can be revisited.
Reported by: emaste
Reviewed by: markj
Sponsored by: The FreeBSD Foundation
MFC after: 2 weeks
Differential revision: https://reviews.freebsd.org/D8421
in-progress count and the vnode. Prior to r188331, we always acquired
the vnode lock before incrementing the object's paging-in-progress count.
Now, we increment it before attempting to acquire the vnode lock with
LK_NOWAIT, but we never sleep acquiring the vnode lock while we have the
count incremented.
Reviewed by: kib
MFC after: 3 days
Upstream the BUF_TRACKING and FULL_BUF_TRACKING buffer debugging code.
This can be handy in tracking down what code touched hung bios and bufs
last. The full history is especially useful, but adds enough bloat that
it shouldn't be enabled in release builds.
Function names (or arbitrary string constants) are tracked in a
fixed-size ring in bufs. Bios gain a pointer to the upper buf for
tracking. SCSI CCBs gain a pointer to the upper bio for tracking.
Reviewed by: markj
Sponsored by: Dell EMC Isilon
Differential Revision: https://reviews.freebsd.org/D8366
Convert vm_fault_hold()'s Boolean variables that are only used
internally to "bool". Add a comment describing why the one
remaining "boolean_t" was not converted.
Reviewed by: kib
MFC after: 8 days
pager error, leave the page to the wire owner. E.g. the page might be
a part of the invalidated buffer.
Reported and tested by: pho
Reviewed by: alc, markj
Sponsored by: The FreeBSD Foundation
MFC after: 2 weeks
Differential revision: https://reviews.freebsd.org/D8197
Suppose that we have an exclusively busy page, and a thread which can
accept shared-busy page. In this case, typical code waiting for the
page xbusy state to pass is
again:
VM_OBJECT_WLOCK(object);
...
if (vm_page_xbusied(m)) {
vm_page_lock(m);
VM_OBJECT_WUNLOCK(object); <---1
vm_page_busy_sleep(p, "vmopax");
goto again;
}
Suppose that the xbusy state owner locked the object, unbusied the
page and unlocked the object after we are at the line [1], but before we
executed the load of the busy_lock word in vm_page_busy_sleep(). If it
happens that there is still no waiters recorded for the busy state,
the xbusy owner did not acquired the page lock, so it proceeded.
More, suppose that some other thread happen to share-busy the page
after xbusy state was relinquished but before the m->busy_lock is read
in vm_page_busy_sleep(). Again, that thread only needs vm_object lock
to proceed. Then, vm_page_busy_sleep() reads busy_lock value equal to
the VPB_SHARERS_WORD(1).
In this case, all tests in vm_page_busy_sleep(9) pass and we are going
to sleep, despite the page being share-busied.
Update check for m->busy_lock == VPB_UNBUSIED in vm_page_busy_sleep(9)
to also accept shared-busy state if we only wait for the xbusy state to
pass.
Merge sequential if()s with the same 'then' clause in
vm_page_busy_sleep().
Note that the current code does not share-busy pages from parallel
threads, the only way to have more that one sbusy owner is right now
is to recurse.
Reported and tested by: pho (previous version)
Reviewed by: alc, markj
Sponsored by: The FreeBSD Foundation
MFC after: 1 week
Differential revision: https://reviews.freebsd.org/D8196
waiters. Otherwise, owners of the shared-busy state are left blocked
and might get into a deadlock.
Note that the vm_page_busy_downgrade() function is not used in the
tree right now.
Reported and tested by: pho (previous version)
Reviewed by: alc, markj
Sponsored by: The FreeBSD Foundation
MFC after: 1 week
Differential revision: https://reviews.freebsd.org/D8195
by vm_pageout_scan() local to vm_pageout_worker(). There is no reason
to store the pass in the NUMA domain structure.
Reviewed by: kib
MFC after: 3 weeks
target was met.
Previously, vm_pageout_worker() itself checked the length of the free page
queues to determine whether vm_pageout_scan(pass >= 1)'s inactive queue scan
freed enough pages to meet the free page target. Specifically,
vm_pageout_worker() used vm_paging_needed(). The trouble with
vm_paging_needed() is that it compares the length of the free page queues to
the wakeup threshold for the page daemon, which is much lower than the free
page target. Consequently, vm_pageout_worker() could conclude that the
inactive queue scan succeeded in meeting its free page target when in fact
it did not; and rather than immediately triggering an all-out laundering
pass over the inactive queue, vm_pageout_worker() would go back to sleep
waiting for the free page count to fall below the page daemon wakeup
threshold again, at which point it will perform another limited (pass == 1)
scan over the inactive queue.
Changing vm_pageout_worker() to use vm_page_count_target() instead of
vm_paging_needed() won't work because any page allocations that happen
concurrently with the inactive queue scan will result in the free page count
being below the target at the end of a successful scan. Instead, having
vm_pageout_scan() return a value indicating success or failure is the most
straightforward fix.
Reviewed by: kib, markj
MFC after: 3 weeks
Sponsored by: Dell EMC Isilon
Differential Revision: https://reviews.freebsd.org/D8111
On machines with just the wrong amount of physical memory (enough to
have a lot of bufs, but not enough to use VM_FREELIST_DMA32) it is
possible for 32-bit address limited devices to have little to no
memory left when attaching, due to potentially large vfs bio configs
consuming all memory below 4GB not protected by VM_FREELIST_ISADMA.
This causes the 32-bit devices to allocate from VM_FREELIST_ISADMA,
leaving that freelist emtpy when ISA devices need DMAable memory.
Rather than decrease VM_DMA32_NPAGES_THRESHOLD, use the time honored
technique of putting initially allocated kernel data structs
at the end (or at least not the beginning) of memory.
Since this allocation is done at boot and is wired, is not freed,
so the system is low on 32-bit (and ISA) dma'ble memory forever.
So it is a good candidate to move above 4GB.
While here, remove an unneeded round_page() from kmem_malloc's size
argument as suggested by alc. The first thing kmem_malloc() does
is a round_page(size), so there is no need to do it before the call.
Reviewed by: alc
Sponsored by: Netflix
Move PMAP_TS_REFERENCED_MAX out of the various pmap implementations and
into vm/pmap.h, and describe what its purpose is. Eliminate the archaic
"XXX" comment about its value. I don't believe that its exact value, e.g.,
5 versus 6, matters.
Update the arm64 and riscv pmap implementations of pmap_ts_referenced()
to opportunistically update the page's dirty field.
On amd64, use the PDE value already cached in a local variable rather than
dereferencing a pointer again and again.
Reviewed by: kib, markj
MFC after: 2 weeks
Differential Revision: https://reviews.freebsd.org/D7836
The pager getpages interface allows the caller to bound the number of
readahead and readbehind pages, and vm_fault_hold() makes use of this
feature. These bounds were ignored after r305056, causing the swap pager
to potentially page in more than the specified number of pages.
Reported and reviewed by: alc
X-MFC with: r305056
Idle page zeroing has been disabled by default on all architectures since
r170816 and has some bugs that make it seemingly unusable. Specifically,
the idle-priority pagezero thread exacerbates contention for the free page
lock, and yields the CPU without releasing it in non-preemptive kernels. The
pagezero thread also does not behave correctly when superpage reservations
are enabled: its target is a function of v_free_count, which includes
reserved-but-free pages, but it is only able to zero pages belonging to the
physical memory allocator.
Reviewed by: alc, imp, kib
Differential Revision: https://reviews.freebsd.org/D7714
The swap_pager_swapoff() function uses trylock for the object lock
before pagein, which means that either i/o to md(4) over swap, or
intensive page faults over swap pager objects might prevent swapoff()
from making any progress. Then the retry < 100 check fails and machine
panics.
If trylock fails, acquire the object lock in the blockable way and
restart the hash bucket walk. Keep retries logic for now.
Reported and tested by: pho
Reviewed by: alc, markj
Sponsored by: The FreeBSD Foundation
MFC after: 2 weeks
Differential revision: https://reviews.freebsd.org/D7688
The removal of vm_fault_additional_pages() meant that a hard fault on
a swap-backed page would result in only that page being read in. This
change implements readahead and readbehind for the swap pager in
swap_pager_getpages(). swap_pager_haspage() is modified to return the
largest contiguous non-resident range of pages containing the requested
range.
Reviewed by: alc, kib
Tested by: pho
MFC after: 1 month
Differential Revision: https://reviews.freebsd.org/D7677
In particular, fix factual, grammatical, and spelling errors in various
comments, and remove comments that are out of place in this function.
Reviewed by: kib, markj
MFC after: 3 weeks
Sponsored by: EMC / Isilon Storage Division
Differential Revision: https://reviews.freebsd.org/D7410
to r254304, we had separate functions for reclamation and laundering
(vm_pageout_scan) versus updating usage information, i.e., "reference
bits", on active pages (vm_pageout_page_stats), and we only performed
vm_req_vmdaemon(VM_SWAP_IDLE) if vm_pages_needed was true. However, since
r254303, if vm_swap_idle_enabled was "1", we have performed
vm_req_vmdaemon(VM_SWAP_IDLE) regardless of whether we are short of free
pages. This was unintended and too aggressive, so I suspect no one uses
this feature. With this change, we restore the historical behavior and
only perform vm_req_vmdaemon(VM_SWAP_IDLE) when we are short of free
pages.
Reviewed by: kib, markj
In particular, swapongeom_ev() needed event thread context when swap
pager configuration was performed under Giant and geom asserted that
Giant is not owned. Now both of the reason went away.
On the other hand, note that swpageom_release() is called from the
bio_done context, and possible close cannot be performed inline.
Also fix some minor issues. The swapgeom() function does not use the
td argument, remove it. Recheck that the vnode passed is still VCHR
and not reclaimed after the lock.
Reviewed by: mav
Tested by: pho
Sponsored by: The FreeBSD Foundation
MFC after: 2 weeks
It is only needed when removing a full bucket from the per-CPU cache. The
bucket cache (uz_buckets) is protected by the zone mutex and thus the
critical section can be released before inserting into that list.
MFC after: 1 week
It's a threshold for v_free_count, which is of type u_int. This also lets
us get rid of a cast in vm_paging_needed().
Reviewed by: alc
MFC after: 1 week
optimizations into two distinct pieces. The first piece consists of the
code that should only be performed once per page fault and requires the map
to be locked. The second piece consists of the code that should be
performed each time a pager is called on an object in the shadow chain.
(This second piece expects the map to be unlocked.)
Previously, the entire implementation could be executed multiple times.
Moreover, the second and subsequent executions would occur with the map
unlocked. Usually, the ensuing unsynchronized accesses to the map were
harmless because the map was not changing. Nonetheless, it was possible for
a use-after-free error to occur, where vm_fault() wrote to a freed map
entry. This change corrects that problem.
Reported by: avg
Reviewed by: kib
MFC after: 3 days
Sponsored by: EMC / Isilon Storage Division
if vnode is VMIO. For VMIO vnodes, set BO_DEAD in vm_object_terminate().
The vnode_destroy_object(), when calling into vm_object_terminate(),
must be able to flush buffers. BO_DEAD purpose is to quickly destroy
buffers on write when the underlying vnode is not operable any more
(one example is the devfs node after geom is gone). Setting BO_DEAD
for reclaiming vnode before object is terminated is premature, and
results in unability to flush buffers with live SU dependencies from
vinvalbuf() in vm_object_terminate().
Reported by: David Cross <dcrosstech@gmail.com>
Tested by: pho
Sponsored by: The FreeBSD Foundation
MFC after: 2 weeks
audit trail. This was not required for Common Criteria auditing
(which requires only that the intent to read or write be audited
at the time of open(2)), but is useful for contemporary live
analysis and forensics.
MFC after: 3 days
Sponsored by: DARPA, AFRL
read(2), write(2), dup(2), and mmap(2). This auditing is not
required by the Common Criteria (and hence was not being
performed), but is valuable in both contemporary live analysis
and forensic use cases.
MFC after: 3 days
Sponsored by: DARPA, AFRL
vm_offset_t. (This field is used to detect sequential access to the virtual
address range represented by the map entry.) There are three reasons to
make this change. First, a vm_offset_t is smaller on 32-bit architectures.
Consequently, a struct vm_map_entry is now smaller on 32-bit architectures.
Second, a vm_offset_t can be written atomically, whereas it may not be
possible to write a vm_pindex_t atomically on a 32-bit architecture. Third,
using a vm_pindex_t makes the next_read field dependent on which object in
the shadow chain is being read from.
Replace an "XXX" comment.
Reviewed by: kib
Approved by: re (gjb)
Sponsored by: EMC / Isilon Storage Division