It is possible to provide insane values for size in contigmalloc(9)
request, which usually not reaches the phys allocator due to failing
KVA allocation. But with the forthcoming 4/4 i386, where 32bit
architecture has almost 4G KVA, contigmalloc(1G) is not unreasonable
outright and KVA might be available sometimes.
Then, the calculation of pa_end could wrap around, depending on the
physical address, and the checks in vm_phys_alloc_seg_contig() would
pass while the iteration in the loop after the 'done' label goes out
of the vm_page_array bounds.
Fix it by detecting the wrap.
Reported and tested by: pho
Reviewed by: alc, markj
Sponsored by: The FreeBSD Foundation
MFC after: 1 week
Differential revision: https://reviews.freebsd.org/D14767
Such pages are re-enqueued at the end of the fault handler, preserving
LRU. Rather than performing two separate operations per fault, simply
requeue the page at the end of the fault (or bump its activation count
if it resides in PQ_ACTIVE, avoiding the page queue lock entirely).
This elides some page lock and page queue lock operations in common
cases, e.g., CoW faults.
Note that we must still dequeue the source page for "optimized" CoW
faults since the page may not remain enqueued while it is moved to
another object.
Reviewed by: alc, kib
Tested by: pho
MFC after: 2 weeks
Differential Revision: https://reviews.freebsd.org/D14625
In many cases the page is not enqueued so the change will have no
effect. However, the change is needed to support an optimization in
the fault handler and in some cases (sendfile, the buffer cache) it
was being emulated by the caller anyway.
Reviewed by: alc
Tested by: pho
MFC after: 2 weeks
X-Differential Revision: https://reviews.freebsd.org/D14625
The new page does not belong to a VM object, but the page daemon does
not expect to encounter such pages.
Reviewed by: alc, kib
Tested by: pho
MFC after: 1 week
X-Differential Revision: https://reviews.freebsd.org/D14625
vmd_free_count manipulation. Reduce the scope of the free lock by
using a pageout lock to synchronize sleep and wakeup. Only trigger
the pageout daemon on transitions between states. Drive all wakeup
operations directly as side-effects from freeing memory rather than
requiring an additional function call.
Reviewed by: markj, kib
Tested by: pho
Sponsored by: Netflix, Dell/EMC Isilon
Differential Revision: https://reviews.freebsd.org/D14612
There, the pages freed might be managed but the page's lock is not
owned. For KPI correctness, the page lock is requried around the call
to vm_page_free_prep(), which is asserted. Reclaim loop already did
the work which could be done by vm_page_free_prep(), so the lock is
not needed and the only consequence of not owning it is the assert
trigger.
Instead of adding the locking to satisfy the assert, revert to the
code that calls vm_page_free_phys() directly.
Reported by: pho
Discussed with: jeff
Sponsored by: The FreeBSD Foundation
MFC after: 1 week
there is a valid reservation. This can trip erroneously when memory
falls within a domain but doesn't have the reservation initialized because
it does not meet size or alignment requirements.
Reported by: pho, mjg
Sponsored by: Netflix, Dell/EMC Isilon
Page daemon threads for other domains show up in ps(1) output as
"pagedaemon/domN", so let that be the case for domain 0 as well.
Submitted by: Kevin Bowling <kevin.bowling@kev009.com>
MFC after: 1 week
Differential Revision: https://reviews.freebsd.org/D14518
With r329882, in the absence of a free page shortage we would only take
len(PQ_INACTIVE)+len(PQ_LAUNDRY) into account when deciding whether to
aggressively scan PQ_ACTIVE. Previously we would also include the
number of free pages in this computation, ensuring that we wouldn't scan
PQ_ACTIVE with plenty of free memory available. The change in behaviour
was most noticeable immediately after booting, when PQ_INACTIVE and
PQ_LAUNDRY are nearly empty.
Reviewed by: jeff
There are no parts useful for usermode applications in
vm/vm_pageout.h. Even for the specific applications like fstat and
lsof.
In my opinion, this protection is redundant and instead userspace
should not include the header at all. Since there are apparently
broken third party codebases, give them a bit of slack by providing
transitional period.
Reported by: julian
Sponsored by: The FreeBSD Foundation
MFC after: 1 week
After r328977, a wired page m may have m->queue != PQ_NONE.
Reviewed by: kib
X-MFC with: r328977
Differential Revision: https://reviews.freebsd.org/D14485
use it to regulate page daemon output.
This provides much smoother and more responsive page daemon output, anticipating
demand and avoiding pageout stalls by increasing the number of pages to match
the workload. This is a reimplementation of work done by myself and mlaier at
Isilon.
Reviewed by: bsdimp
Tested by: pho
Sponsored by: Netflix, Dell/EMC Isilon
Differential Revision: https://reviews.freebsd.org/D14402
Make vm_wait() take the vm_object argument which specifies the domain
set to wait for the min condition pass. If there is no object
associated with the wait, use curthread' policy domainset. The
mechanics of the wait in vm_wait() and vm_wait_domain() is supplied by
the new helper vm_wait_doms(), which directly takes the bitmask of the
domains to wait for passing min condition.
Eliminate pagedaemon_wait(). vm_domain_clear() handles the same
operations.
Eliminate VM_WAIT and VM_WAITPFAULT macros, the direct functions calls
are enough.
Eliminate several control state variables from vm_domain, unneeded
after the vm_wait() conversion.
Scetched and reviewed by: jeff
Tested by: pho
Sponsored by: The FreeBSD Foundation, Mellanox Technologies
Differential revision: https://reviews.freebsd.org/D14384
From the submitter description:
The process is forked transitioning a map entry to COW
Thread A writes to a page on the map entry, faults, updates the pmap to
writable at a new phys addr, and starts TLB invalidations...
Thread B acquires a lock, writes to a location on the new phys addr, and
releases the lock
Thread C acquires the lock, reads from the location on the old phys addr...
Thread A ...continues the TLB invalidations which are completed
Thread C ...reads from the location on the new phys addr, and releases
the lock
In this example Thread B and C [lock, use and unlock] properly and
neither own the lock at the same time. Thread A was writing somewhere
else on the page and so never had/needed the lock. Thread C sees a
location that is only ever read|modified under a lock change beneath
it while it is the lock owner.
To fix this, perform the two-stage update of the copied PTE. First,
the PTE is updated with the address of the new physical page with
copied content, but in read-only mode. The pmap locking and the page
busy state during PTE update and TLB invalidation IPIs ensure that any
writer to the page cannot upgrade the PTE to the writable state until
all CPUs updated their TLB to not cache old mapping. Then, after the
busy state of the page is lifted, the faults for write can proceed and
do not violate the consistency of the reads.
The change is done in vm_fault because most architectures do need IPIs
to invalidate remote TLBs. More, I think that hardware guarantees of
atomicity of the remote TLB invalidation are not enough to prevent the
inconsistent reads of non-atomic reads, like multi-word accesses
protected by a lock. So instead of modifying each pmap invalidation
code, I did it there.
Discovered and analyzed by: Elliott.Rabe@dell.com
Reviewed by: markj
PR: 225584 (appeared to have the same cause)
Tested by: Elliott.Rabe@dell.com, emaste, Mike Tancsa <mike@sentex.net>, truckman
Sponsored by: The FreeBSD Foundation
MFC after: 1 week
Differential revision: https://reviews.freebsd.org/D14347
If the map entry elookup was performed due to the mapping changes, we
need to ensure that there is still some access permission bit
requested which is compatible with the current vm_map_entry mode. If
not, restart the handler from scratch instead of trying to save the
current progress.
Also adjust fault_type to not include cleared permission bits.
Reviewed by: markj
Sponsored by: The FreeBSD Foundation
MFC after: 1 week
Differential revision: https://reviews.freebsd.org/D14347
Suppose that we have an object with a mapped superpage, and that all
pages in the superpages are held (by some driver). Additionally,
suppose that the object is terminated, e.g. because the only process
mapping it is exiting. Then the reservation is broken, but the pages
cannot be freed until later, when they are unheld. In this situation,
the reservation code cannot clean psind, since no pages are freed, and
the page is freed and then reused with invalid psind.
Clean psind on vm_reserv_break() to avoid the situation.
Reported and tested by: Slava Shwartsman
Reviewed by: markj
Sponsored by: Mellanox Technologies
MFC after: 1 week
Differential revision: https://reviews.freebsd.org/D14335
significant source of cache line contention from vm_page_alloc(). Use
accessors and vm_page_unwire_noq() so that the mechanism can be easily
changed in the future.
Reviewed by: markj
Discussed with: kib, glebius
Tested by: pho (earlier version)
Sponsored by: Netflix, Dell/EMC Isilon
Differential Revision: https://reviews.freebsd.org/D14273
size of UMA zone allocation is greater than page size. In this case zone
of zones can not use UMA_MD_SMALL_ALLOC, and we need to postpone switch
off of this zone from startup_alloc() until full launch of VM.
o Always supply number of VM zones to uma_startup_count(). On machines
with UMA_MD_SMALL_ALLOC ignore it completely, unless zsize goes over
a page. In the latter case account VM zones for number of allocations
from the zone of zones.
o Rewrite startup_alloc() so that it will immediately switch off from
itself any zone that is already capable of running real alloc.
In worst case scenario we may leak a single page here. See comment
in uma_startup_count().
o Hardcode call to uma_startup2() into vm_mem_init(). Otherwise some
extra SYSINITs, e.g. vm_page_init() may sneak in before.
o While here, remove uma_boot_pages_mtx. With recent changes to boot
pages calculation, we are guaranteed to use all of the boot_pages
in the early single threaded stage.
Reported & tested by: mav
o Most of startup zones have struct uma_slab embedded into the slab,
so provide macro UMA_SLAB_SPACE and use it instead of UMA_SLAB_SIZE,
when calculating how many pages would certain kind of allocations
require. Some zones are offpage, so we might have a positive inaccuracy.
o The keg for the zone of zones is allocated "dynamically", so we
need +1 when calculating amount of pages for kegs. [1]
o The zones of zones and zones of kegs have arbitrary alignment of 32,
and this also needs to be accounted for. [2]
While here, spread more comments and improve diagnostic messages.
Reported by: pho [1], jtl [2]
Previously, wiring a page would cause it to be removed from its page
queue. In the common case, unwiring causes it to be enqueued at the tail
of that page queue. This change modifies vm_page_wire() to not dequeue
the page, thus avoiding the highly contended page queue locks. Instead,
vm_page_unwire() takes care of requeuing the page as a single operation,
and the page daemon dequeues wired pages as they are encountered during
a queue scan to avoid needlessly revisiting them later. For pages in
PQ_ACTIVE we do even better, since a requeue is unnecessary.
The change improves scalability for some common workloads. For instance,
threads wiring pages into the buffer cache no longer need to modify
global page queues, and unwiring is usually done by the bufspace thread,
so concurrency is not as much of an issue. As another example, many
sysctl handlers wire the output buffer to avoid faults on copyout, and
since the buffer is likely to be in PQ_ACTIVE, we now entirely avoid
modifying the page queue in this case.
The change also adds a block comment describing some properties of
struct vm_page's reference counters, and the busy lock.
Reviewed by: jeff
Discussed with: alc, kib
MFC after: 1 month
Differential Revision: https://reviews.freebsd.org/D11943
global to per-domain state. Protect reservations with the free lock
from the domain that they belong to. Refactor to make vm domains more
of a first class object.
Reviewed by: markj, kib, gallatin
Tested by: pho
Sponsored by: Netflix, Dell/EMC Isilon
Differential Revision: https://reviews.freebsd.org/D14000
o Call uma_startup1() after initializing kmem, vmem and domains.
o Include 8 eight VM startup pages into uma_startup_count() calculation.
o Account for vmem_startup() and vm_map_startup() preallocating pages.
o Account for extra two allocations done by kmem_init() and vmem_create().
o Hardcode the place of execution of vm_radix_reserve_kva(). Using SYSINIT
allowed several other SYSINITs to sneak in before it, thus bumping
requirement for amount of boot pages.
for UMA startup.
o Introduce another stage of UMA startup, which is entered after
vm_page_startup() finishes. After this stage we don't yet enable buckets,
but we can ask VM for pages. Rename stages to meaningful names while here.
New list of stages: BOOT_COLD, BOOT_STRAPPED, BOOT_PAGEALLOC, BOOT_BUCKETS,
BOOT_RUNNING.
Enabling page alloc earlier allows us to dramatically reduce number of
boot pages required. What is more important number of zones becomes
consistent across different machines, as no MD allocations are done before
the BOOT_PAGEALLOC stage. Now only UMA internal zones actually need to use
startup_alloc(), however that may change, so vm_page_startup() provides
its need for early zones as argument.
o Introduce uma_startup_count() function, to avoid code duplication. The
functions calculates sizes of zones zone and kegs zone, and calculates how
many pages UMA will need to bootstrap.
It counts not only of zone structures, but also of kegs, slabs and hashes.
o Hide uma_startup_foo() declarations from public file.
o Provide several DIAGNOSTIC printfs on boot_pages usage.
o Bugfix: when calculating zone of zones size use (mp_maxid + 1) instead of
mp_ncpus. Use resulting number not only in the size argument to zone_ctor()
but also as args.size.
Reviewed by: imp, gallatin (earlier version)
Differential Revision: https://reviews.freebsd.org/D14054
It is possible, for complex fork()/collapse situations, to have
sibling address spaces to partially share shadow chains. If one
sibling performs wiring, it can happen that a transient page, invalid
and busy, is installed into a shadow object which is visible to other
sibling for the duration of vm_fault_hold(). When the backing object
contains the valid page, and the wiring is performed on read-only
entry, the transient page is eventually removed.
But the sibling which observed the transient page might perform the
unwire, executing vm_object_unwire(). There, the first page found in
the shadow chain is considered as the page that was wired for the
mapping. It is really the page below it which is wired. So we unwire
the wrong page, either triggering the asserts of breaking the page'
wire counter.
As the fix, wait for the busy state to finish if we find such page
during unwire, and restart the shadow chain walk after the sleep.
Reported and tested by: pho
Reviewed by: markj
Sponsored by: The FreeBSD Foundation
MFC after: 1 week
Differential revision: https://reviews.freebsd.org/D14184
clear dirty bits for completely invalid blocks.
Otherwise we might not write out the last chunk that is shorter than
512 bytes, if the file end is not aligned on disk block boundary.
This become important after the r324794.
PR: 225586
Reported by: tris_vern@hotmail.com
Tested by: pho
Sponsored by: The FreeBSD Foundation
MFC after: 3 days
In several places, entry start and end field are checked, after
excluding the possibility that the entry is map->header. By assigning
max and min values to the start and end fields of map->header in
vm_map_init, the explicit map->header checks become unnecessary.
Submitted by: Doug Moore <dougm@rice.edu>
Reviewed by: alc, kib, markj (previous version)
Tested by: pho (previous version)
MFC after: 1 week
Differential Revision: https://reviews.freebsd.org/D13735
kernel by PHYS_TO_DMAP() as previously present on amd64, arm64, riscv, and
powerpc64. This introduces a new MI macro (PMAP_HAS_DMAP) that can be
evaluated at runtime to determine if the architecture has a direct map;
if it does not (or does) unconditionally and PMAP_HAS_DMAP is either 0 or
1, the compiler can remove the conditional logic.
As part of this, implement PHYS_TO_DMAP() on sparc64 and mips64, which had
similar things but spelled differently. 32-bit MIPS has a partial direct-map
that maps poorly to this concept and is unchanged.
Reviewed by: kib
Suggestions from: marius, alc, kib
Runtime tested on: amd64, powerpc64, powerpc, mips64
allocated with a tag to come from the specified domain if it meets the
other constraints provided by the tag. Automatically create a tag at
the root of each bus specifying the domain local to that bus if
available.
Reviewed by: jhb, kib
Tested by: pho
Sponsored by: Netflix, Dell/EMC Isilon
Differential Revision: https://reviews.freebsd.org/D13545
domains can be done by the _domain() API variants. UMA also supports a
first-touch policy via the NUMA zone flag.
The slab layer is now segregated by VM domains and is precise. It handles
iteration for round-robin directly. The per-cpu cache layer remains
a mix of domains according to where memory is allocated and freed. Well
behaved clients can achieve perfect locality with no performance penalty.
The direct domain allocation functions have to visit the slab layer and
so require per-zone locks which come at some expense.
Reviewed by: Attilio (a slightly older version)
Tested by: pho
Sponsored by: Netflix, Dell/EMC Isilon
reservations by giving each memory domain its own KVA space in vmem that
is naturally aligned on superpage boundaries.
Reviewed by: alc, markj, kib (some objections)
Sponsored by: Netflix, Dell/EMC Isilon
Tested by; pho
Differential Revision: https://reviews.freebsd.org/D13289
Implement 'domainset', a cpuset based NUMA policy mechanism. This allows
userspace to control NUMA policy administratively and programmatically.
Implement domainset based iterators in the page layer.
Remove the now legacy numa_* syscalls.
Cleanup some header polution created by having seq.h in proc.h.
Reviewed by: markj, kib
Discussed with: alc
Tested by: pho
Sponsored by: Netflix, Dell/EMC Isilon
Differential Revision: https://reviews.freebsd.org/D13403
userspace to control NUMA policy administratively and programmatically.
Implement domainset based iterators in the page layer.
Remove the now legacy numa_* syscalls.
Cleanup some header polution created by having seq.h in proc.h.
Reviewed by: markj, kib
Discussed with: alc
Tested by: pho
Sponsored by: Netflix, Dell/EMC Isilon
Differential Revision: https://reviews.freebsd.org/D13403
Consolidate the regions covered by the process lock.
Combine similar conditions tests into one, e.g. all process flags can
be test with one logical operation.
Add check for in-exec state, since p_vmspace is dererenced.
Remove labels and goto by explicitly tracking state.
Update comments.
Reviewed by: alc, markj (previous version)
Tested by: pho (previous version)
Sponsored by: The FreeBSD Foundation
MFC after: 2 weeks
Differential revision: https://reviews.freebsd.org/D13693
its per-thread kernel stack pages by making them pass through the inactive
queue first. Instead, immediately place them in the laundry so that they
might be cleaned and made available for reclamation sooner.
Reviewed by: kib, markj
MFC after: 1 week
rather than kmem arena size to determine available memory.
Initialize the UMA limit to LONG_MAX to avoid spurious wakeups on boot before
the real limit is set.
PR: 224330 (partial), 224080
Reviewed by: markj, avg
Sponsored by: Netflix / Dell EMC Isilon
Differential Revision: https://reviews.freebsd.org/D13494
On a load where single anonymous object consumes almost all memory on
the large system, swapout code executes the iteration over the
corresponding object page queue for long time, owning the map and
object locks. This blocks pagedaemon which tries to lock the object,
and blocks other threads in the process in vm_fault() waiting for the
map lock.
Handle the issue by terminating the deactivation loop if we executed
too long and by yielding at the top level in vm_daemon.
Reported by: peterj, pho
Reviewed by: alc
Tested by: pho (as part of the larger patch)
Sponsored by: The FreeBSD Foundation
MFC after: 1 week
Differential revision: https://reviews.freebsd.org/D13671
introduction in r83366. (At that time, this code appeared in vm/vm_glue.c,
because vm/vm_swapout.c did not exist.) When the FOREACH_THREAD loop
completes, we know that the sleep time for every thread is above whichever
threshold is being applied.
Reviewed by: kib
X-MFC with: r327354
swp_pager_meta_ctl(), with no opportunity to recognize freeing of
consecutive blocks and free fewer block ranges. To open that opportunity,
this change removes the SWM_FREE option from swp_pager_meta_ctl(), and
compels the caller to do the freeing when a valid block address is returned.
In swap_pager_copy(), these frees are aggregated, so that a sequence of them
can be done at one time.
The only other caller to swp_pager_meta_ctl() that passed SWM_FREE,
swp_pager_unswapped(), is also modified to handle its single free
explicitly.
Submitted by: Doug Moore <dougm@rice.edu>
Reviewed by: kib (an earlier version)
MFC after: 1 week
Differential Revision: https://reviews.freebsd.org/D13290
Neither swapout_procs() nor swapout() access the map. Since the
process' vmspace is referenced only to obtain the pointer to the
vm_map, the reference is not needed as well.
Reviewed by: alc
Sponsored by: The FreeBSD Foundation
MFC after: 1 week
Differential revision: https://reviews.freebsd.org/D13681
Reviewed by: alc, markj (as part of the larger patch)
Tested by: pho (again, as part of the larger patch)
Sponsored by: The FreeBSD Foundation
MFC after: 1 week
Differential revision: https://reviews.freebsd.org/D13671
for finding aligned free space in the given map. With this change, we
always return KERN_NO_SPACE when we fail to find free space. Whereas,
previously, we might return KERN_INVALID_ADDRESS. Also, with this change,
we explicitly check for address wrap, rather than relying upon the map's
min and max addresses to establish sentinel-like regions.
This refactoring was inspired by the problem that we addressed in r326098.
Reviewed by: kib
Tested by: pho
Discussed with: markj
MFC after: 3 weeks
Differential Revision: https://reviews.freebsd.org/D13346
Otherwise the page daemon will not reclaim pages and thus will not
wake threads sleeping in VM_WAIT.
Reported and tested by: pho
Reviewed by: alc, kib
X-MFC with: r327168
Differential Revision: https://reviews.freebsd.org/D13640
Both issues caused the page daemon to erroneously go to sleep when
applications are consuming free pages at a high rate, leaving the
application threads blocked in VM_WAIT.
1) After completing an inactive queue scan, concurrent allocations may
have prevented the page daemon from meeting the v_free_min threshold.
In this case, the page daemon was going to sleep even when the
inactive queue contained plenty of clean pages.
2) pagedaemon_wakeup() may be called without the free queues lock held.
This can lead to a lost wakeup if a call occurs after the page daemon
clears vm_pageout_wanted but before going to sleep.
Fix 1) by ensuring that we start a new inactive queue scan immediately
if v_free_count < v_free_min after a prior scan.
Fix 2) by adding a new subroutine, pagedaemon_wait(), called from
vm_wait() and vm_waitpfault(). It wakes up the page daemon if either
vm_pages_needed or vm_pageout_wanted is false, and atomically sleeps
on v_free_count.
Reported by: jeff
Reviewed by: alc
MFC after: 2 weeks
Differential Revision: https://reviews.freebsd.org/D13424
The laundry thread keeps track of the number of inactive queue scans
performed by the page daemon, and was previously using the v_pdwakeups
counter to count them. However, in some cases the inactive queue may
be scanned multiple times after a single wakeup, so it's more accurate
to use a dedicated counter.
Reviewed by: alc, kib (previous version)
MFC after: 1 week
Differential Revision: https://reviews.freebsd.org/D13422
r292392 modified the active queue scan to weigh clean pages differently
from dirty pages when attempting to meet the inactive queue target. When
r306706 was merged into the PQ_LAUNDRY branch, this mechanism was
broken. Fix it by scalaing the correct page shortage variable.
Reviewed by: alc, kib
MFC after: 1 week
Differential Revision: https://reviews.freebsd.org/D13423
atomic_set_*() sets a bit in the target memory location, so
atomic_set_int(&uma_reclaim_needed, 0) does not do what it looks like
it does.
PR: 224080
Reviewed by: jeff, kib
Differential Revision: https://reviews.freebsd.org/D13412
Commit r326346 moved domain iterators from physical layer to vm_page one,
but it also removed translation of freelist to flind for
vm_page_alloc_freelist() call. Before it expects VM_FREELIST_ parameter,
but after it expect freelist index.
On small WiFi boxes with few megabytes of RAM, there is only one freelist
VM_FREELIST_LOWMEM (1) and there is no VM_FREELIST_DEFAULT(0) (see file
sys/mips/include/vmparam.h). It results in freelist 1 with flind 0.
At first, this commit renames flind to freelist in vm_page_alloc_freelist
to avoid misunderstanding about input parameters. Then on physical layer it
restores translation for correct handling of freelist parameter.
Reported by: landonf
Reviewed by: jeff
Differential Revision: https://reviews.freebsd.org/D13351
It's theoretically possible for the vnode and object to be disassociated
while locks are dropped around the vget() call, in which case we
shouldn't proceed with laundering.
Noted and reviewed by: kib
MFC after: 1 week
The arena argument to kmem_*() is now only used in an assert. A follow-up
commit will remove the argument altogether before we freeze the API for the
next release.
This replaces the hard limit on kmem size with a soft limit imposed by UMA. When
the soft limit is exceeded we periodically wakeup the UMA reclaim thread to
attempt to shrink KVA. On 32bit architectures this should behave much more
gracefully as we exhaust KVA. On 64bit the limits are likely never hit.
Reviewed by: markj, kib (some objections)
Discussed with: alc
Tested by: pho
Sponsored by: Netflix / Dell EMC Isilon
Differential Revision: https://reviews.freebsd.org/D13187
blocks in a single call to blist_alloc(). However, when it frees
that space, it previously called blist_free() on each block, one at a
time. With this change, the swap pager identifies ranges of
contiguous blocks to be freed, and calls blist_free() once per
range. In one extreme case, that is described in the review, the time
to perform an munmap(2) was reduced by 55%.
Submitted by: Doug Moore <dougm@rice.edu>
Reviewed by: kib
MFC after: 1 week
Differential Revision: https://reviews.freebsd.org/D12397
This gives a marginal improvement in the vm_page_array initialization
time. Also garbage-collect the now-unused vm_phys_paddr_to_segind().
Reviewed by: alc, kib
MFC after: 1 week
Differential Revision: https://reviews.freebsd.org/D13270
Mainly focus on files that use BSD 2-Clause license, however the tool I
was using misidentified many licenses so this was mostly a manual - error
prone - task.
The Software Package Data Exchange (SPDX) group provides a specification
to make it easier for automated tools to detect and summarize well known
opensource licenses. We are gradually adopting the specification, noting
that the tags are considered only advisory and do not, in any way,
superceed or replace the license texts.
No functional change intended.
On KERN_NO_SPACE error, as it is returned now, vm_map_find() continues
the loop searching for the suitable range for the requested mapping
with specific alignment. Since the vm_map_findspace() succesfully
finds the same place, the loop never ends.
The errors returned from vm_map_stack() completely repeat the behavior
of vm_map_insert() now, as suggested by Alan.
Reported by: Arto Pekkanen <aksyom@gmail.com>
PR: 223732
Reviewed by: alc, markj
Discussed with: jhb
Sponsored by: The FreeBSD Foundation
MFC after: 3 days
Differential revision: https://reviews.freebsd.org/D13186
second scan of the address space with find_space = VMFS_ANY_SPACE is
performed. Previously, vm_map_find() released and reacquired the map lock
between the first and second scans. However, there is no compelling
reason to do so. This revision modifies vm_map_find() to retain the map
lock.
Reviewed by: jhb, kib, markj
MFC after: 1 week
X-Differential Revision: https://reviews.freebsd.org/D13155
Some drm2 drivers will set PG_FICTITIOUS in physical pages in order to
satisfy the OBJT_MGTDEVICE object interface, so a scan may encounter
fictitous pages. For now, allow for this possibility; such pages will be
skipped later in the scan since they are wired.
Reported by: avg
Reviewed by: kib
MFC after: 1 week
Mainly focus on files that use BSD 3-Clause license.
The Software Package Data Exchange (SPDX) group provides a specification
to make it easier for automated tools to detect and summarize well known
opensource licenses. We are gradually adopting the specification, noting
that the tags are considered only advisory and do not, in any way,
superceed or replace the license texts.
Special thanks to Wind River for providing access to "The Duke of
Highlander" tool: an older (2014) run over FreeBSD tree was useful as a
starting point.
The Software Package Data Exchange (SPDX) group provides a specification
to make it easier for automated tools to detect and summarize well known
opensource licenses. We are gradually adopting the specification, noting
that the tags are considered only advisory and do not, in any way,
superceed or replace the license texts.
Special thanks to Wind River for providing access to "The Duke of
Highlander" tool: an older (2014) run over FreeBSD tree was useful as a
starting point.
Initially, only tag files that use BSD 4-Clause "Original" license.
RelNotes: yes
Differential Revision: https://reviews.freebsd.org/D13133
hardware sizes.
32bit counters already overflow on approachable virtual memory page
counts, and soon would overflow on the physical pages counts as well.
Bump sizes to 64bit types. Bump __FreeBSD_version.
It is impossible to provide perfect backward ABI compat for this
change. If a program requests an old structure, it can be detected by
size. But if it queries the size first by passing NULL old req
pointer, there is almost nothing we can do to detect the desired ABI.
As a partial solution, check p_osrel of the quering process when
selecting the size to report.
Submitted by: Pawel Biernacki <pawel.biernacki@gmail.com>
Differential revision: https://reviews.freebsd.org/D13018
similar to the kernel memory allocator.
This simplifies NUMA allocation because the domain will be known at wait
time and races between failure and sleeping are eliminated. This also
reduces boilerplate code and simplifies callers.
A wait primitive is supplied for uma zones for similar reasons. This
eliminates some non-specific VM_WAIT calls in favor of more explicit
sleeps that may be satisfied without new pages.
Reviewed by: alc, kib, markj
Tested by: pho
Sponsored by: Netflix, Dell/EMC Isilon
A fictitious page is always wired, so there is no point in trying to
remove one from the page queues.
Completely remove one inaccurate comment from vm_page_free_prep() and
correct another.
Reviewed by: kib, markj
MFC after: 1 week
one call to sysctl(2) from jemalloc startup code. (That also requires
changes to jemalloc, but I plan to push those to upstream first.)
Reviewed by: kib
MFC after: 2 weeks
Sponsored by: DARPA, AFRL
Differential Revision: https://reviews.freebsd.org/D12745
This catches some rare mysterious failures at the source. The check
is only performed on architectures which implement direct map, and
only enabled with option DIAGNOSTIC, similar to other costly
consistency checks.
Reviewed by: alc, markj
Sponsored by: The FreeBSD Foundation
MFC after: 2 weeks
Only upgrade it to write mode if we need to clear dirty bits of the
partially valid page after EOF.
Suggested and reviewed by: alc
Sponsored by: The FreeBSD Foundation
MFC after: 3 weeks
There is no NO_SWAPPING #ifdef left in the code.
Requested by: alc
Reviewed by: alc, markj
Sponsored by: The FreeBSD Foundation
MFC after: 3 weeks
Differential revision: https://reviews.freebsd.org/D12663
If filesystem block size is less than the page size, it is possible
that the page-out run contains partially clean pages. E.g., the chunk
of the page might be bdwrite()-ed, or some thread performed bwrite()
on a buffer which references a chunk of the paged out page. As
result, the assertion added in r319975, which checked that all pages
in the run are dirty, does not hold on such filesystems.
One solution is to remove the assert, but it is undesirable, because
we do overwrite the valid on-disk content. I cannot provide a scenario
where such write would corrupt the file data, but I do not like it on
principle. Another, in my opinion proper, solution is to only write
parts of the pages still marked dirty. The patch implements this, it
skips clean blocks and only writes the dirty block runs.
Note that due to clustering, write one page might clean other pages in
the run, so the next write range must be calculated only after the
current range is written out.
More, due to a possible invalidation, and the fact that the object
lock is dropped and reacquired before the checks, it is possible that
the whole page-out pages run appears to consist of only clean pages.
For this reason, it is impossible to assert that there is some work
for the pageout method to do (i.e. assert that there is at least one
dirty page in the run). But such clearing can only occur due to
invalidation, and not due to a parallel write, because we own the
vnode lock exclusive.
Reported by: fsu
In collaboration with: pho
Reviewed by: alc, markj
Sponsored by: The FreeBSD Foundation
MFC after: 3 weeks
Differential revision: https://reviews.freebsd.org/D12668
pages by vm_object_terminate_pages(). For example, for a "buildworld"
workload, this batching reduces vm_object_terminate_pages()'s average
execution time by 12%. (The total savings were about 11.7 billion
processor cycles.)
Reviewed by: kib
MFC after: 1 week
The variable is modified with the highly contended page free queue lock.
It unnecessarily shares a cacheline with purely read-only fields and is
re-read after the lock is dropped in the page allocation code making the
hold time longer.
Pad the variable just like the others and store the value as found with
the lock held instead of re-reading.
Provides a modest 1%-ish speed up in concurrent page faults.
Reviewed by: kib, markj
Differential Revision: https://reviews.freebsd.org/D12665
the page is already wired or queued. Prior to the elimination of PG_CACHED
pages, vm_page_grab() might have returned a valid, previously PG_CACHED
page, in which case enqueueing the page was necessary. Now, that can't
happen. Moreover, activating the page is a dubious choice, since the page
is not being accessed.
Reviewed by: kib
MFC after: 1 week
pmap_remove_all(). If the object to which a page belongs has no
references, then that page cannot possibly be mapped.
Reviewed by: kib
MFC after: 1 week
This is a wrapper around _Alignof() that sets the alignment for a zone
to the alignment required by a given type. This allows the compiler to
determine the proper alignment rather than having the programmer try to
guess.
Discussed on: arch@
MFC after: 1 week
Sponsored by: DARPA / AFRL
vm_page_try_to_free() is testing conditions, like clean versus dirty,
that only vary in managed pages.
Suggested by: kib
Reviewed by: markj
X-MFC after: never
can be avoided when the page's containing object has a reference count of
zero. (If the object has a reference count of zero, then none of its pages
can possibly be mapped.)
Address nearby style issues in vm_page_try_to_free(), and change its
return type to "bool".
Reviewed by: kib, markj
MFC after: 1 week
One consequence of the patch is that msyncing unlinked file mappings
no longer reduces the amount of the dirty memory in the system, but I
do not think that there are users of msync(2) that utilize it for such
side-effect.
Reported and tested by: tjil
PR: 222356
Reviewed by: alc
Sponsored by: The FreeBSD Foundation
MFC after: 2 weeks
Differential revision: https://reviews.freebsd.org/D12411
free queue mutex lock owning session, same as it was done for the
object termination in r323561.
Reported and tested by: mjg
Reviewed by: alc, markj
Sponsored by: The FreeBSD Foundation
MFC after: 1 week
16 bits is only wide enough for kegs with an item size of up to 64KB.
At that size or larger, slab headers are typically offpage because the
item size is a multiple of the page size, but there is no requirement
that this be the case.
We can widen the field without affecting the layout of struct uma_keg
since the removal of uk_slabsize in r315077 left an adjacent hole.
PR: 218911
MFC after: 2 weeks
object' page queue under the single mutex lock.
First, all pages on the queue are prepared for free by calls to
vm_page_free_prep(), and pages which should not be returned to the
physical allocator (e.g. wired or fictitious) are simply removed from
the queue. On the second pass, vm_page_free_phys_pglist() inserts all
pages from the queue without relocking the mutex.
The change improves the object termination, e.g. on the process exit
where large anonymous memory objects otherwise cause relocks the free
queue mutex for each page. More, if several such processes are
exiting or execing in parallel, the mutex was highly contended on
the address space demolition.
Diagnosed and tested by: mjg (previous version)
Reviewed by: alc, markj
Sponsored by: The FreeBSD Foundation
MFC after: 1 week
and insertion into the phys allocator free queues vm_page_free_phys().
Also provide a wrapper vm_page_free_phys_pglist() for batched free.
Reviewed by: alc, markj
Tested by: mjg (previous version)
Sponsored by: The FreeBSD Foundation
MFC after: 1 week
Kegs for internal zones always keep the slab header in the slab itself.
Therefore, when determining the allocation size, we need to take the
slab header size into account.
Reported and tested by: ae, rakuco
Reviewed by: avg
MFC after: 2 weeks
Differential Revision: https://reviews.freebsd.org/D12342
Prior to the change they were subject to extreme false sharing.
In particular this change shaves about 3 seconds real time of -j 80 buildkernel.
Reviewed by: alc, markj
Differential Revision: https://reviews.freebsd.org/D12281
for analyzing the radix tree structures and reporting on the number, and
sizes, of maximal intervals of free blocks. The report includes the number
of maximal intervals, and also the number of them in each of several size
ranges, from small (size 1, or 3 to 4) to large (28657 to 46367) with size
boundaries defined by Fibonacci numbers. The report is written in the test
tool with the 's' command, or in a running kernel by sysctl.
The analysis of the radix tree frequently computes the position of the lone
bit set in a u_daddr_t, a computation that also appears in leaf allocation.
That computation has been moved into a function of its own, and optimized
for cases where an inlined machine instruction can replace the usual binary
search.
Submitted by: Doug Moore <dougm@rice.edu>
MFC after: 1 week
Differential Revision: https://reviews.freebsd.org/D11906
lock if both old and new pages use the same underlying lock. Convert
existing places to use the helper instead of inlining it. Use the
optimization in vm_object_page_remove().
Suggested and reviewed by: alc, markj
Sponsored by: The FreeBSD Foundation
MFC after: 1 week
We currently initialize the vm_page array in three passes: one to zero
the array, one to initialize the "order" field of each page (necessary
when inserting them into the vm_phys buddy allocator one-by-one), and
one to initialize the remaining non-zero fields and individually insert
each page into the allocator.
Merge the three passes into one following a suggestion from alc:
initialize vm_page fields in a single pass, and use vm_phys_free_contig()
to efficiently insert physical memory segments into the buddy allocator.
This reduces the initialization time to a third or a quarter of what it
was before on most systems that I tested.
Reviewed by: alc, kib
MFC after: 3 weeks
Differential Revision: https://reviews.freebsd.org/D12248
While these locks are guarnteed to not share their respective cache lines,
their current placement leaves unnecessary holes in lines which preceeded them.
For instance the annotation of vm_page_queue_free_mtx allows 2 neighbour
cachelines (previously separate by the lock) to be collapsed into 1.
The annotation is only effective on architectures which have it implemented in
their linker script (currently only amd64). Thus locks are not converted to
their not-padaligned variants as to not affect the rest.
MFC after: 1 week
In swp_pager_meta_build(), if the requested operation results in
freeing the last swap pointer in the swblk, free the trie node. Other
swap pager code does not expect to find completely empty swblk.
Reviewed by: alc, markj
Tested by: pho
Sponsored by: The FreeBSD Foundation
MFC after: 1 week
swapblk for our index while we dropped the object lock.
Noted by: jeff
Reviewed by: alc, markj
Tested by: pho
Sponsored by: The FreeBSD Foundation
MFC after: 1 week
The function return value is not used. Its argument is always
swap_total/PAGE_SIZE, so make it not take any arguments.
Submitted by: ota@j.email.ne.jp
PR: 221356
MFC after: 1 week
Before r207410, the hold count of a page in a page queue was protected
by the queue lock, and, before laundering a page, the page daemon
removed managed writeable mappings of the page before releasing the
queue lock. This ensured that other threads could not concurrently
create transient writeable mappings using pmap_extract_and_hold() on a
user map, as is done for example by vmapbuf(). With that revision,
however, a race can allow the creation of such a mapping, meaning that
the page might be modified as it is being laundered, potentially
resulting in it being marked clean when its contents do not match
those given to the pager. Close the race by using the page lock to
synchronize the hold count check in vm_pageout_cluster() with the
removal of writeable managed mappings.
Reported by: alc
Reviewed by: alc, kib
MFC after: 1 week
Differential Revision: https://reviews.freebsd.org/D12084
new use of the vm_object's lock to synchronize updates to a radix trie
mapping per-vm object page indices to on-disk swap blocks.
Fix a typo in a nearby comment.
Reviewed by: kib, markj
X-MFC with: r322913
Differential Revision: https://reviews.freebsd.org/D12134
vm_object page indices to on-disk swap space (r322913) has changed the
synchronization requirements for a couple swap pager functions. Whereas
before a read lock on the vm object sufficed because of the global mutex
on the hash table, a write lock on the vm object may now be required. In
particular, calls to vm_pager_page_unswapped() now require a write lock on
the vm_object. Consequently, vm_fault()'s fast path cannot call
vm_pager_page_unswapped(). The swap space will have to be released at a
later point.
Reviewed by: kib, markj
X-MFC with: r322913
Differential Revision: https://reviews.freebsd.org/D12134
blocks assigned to the object pages.
- The global swhash_mtx is removed, trie is synchronized by the
corresponding object lock.
- The swp_pager_meta_free_all() function used during object
termination is optimized by only looking at the trie instead of
having to search whole hash for the swap blocks owned by the object.
- On swap_pager_swapoff(), instead of iterating over the swhash,
global object list have to be inspected. There, we have to ensure
that we do see valid trie content if we see that the object type is
swap.
Sizing of the swblk zone is same as for swblock zone, each swblk maps
SWAP_META_PAGES pages.
Proposed by: alc
Reviewed by: alc, markj (previous version)
Tested by: alc, pho (previous version)
Sponsored by: The FreeBSD Foundation
MFC after: 1 month
Differential revision: https://reviews.freebsd.org/D11435
Setting this flag allows us to skip pages removal from VM object queue
during object termination and to leave that for cdev_pg_dtor function.
Move pages removal code to separate function vm_object_terminate_pages()
as comments does not survive indentation.
This will be required for Intel SGX support where we will have to remove
pages from VM object manually.
Reviewed by: kib, alc
Sponsored by: DARPA, AFRL
Differential Revision: https://reviews.freebsd.org/D11688
This is a variant of vm_page_alloc() which accepts an additional parameter:
the page in the object with largest index that is smaller than the requested
index. vm_page_alloc() finds this page using a lookup in the object's radix
tree, but in some cases its identity is already known, allowing the lookup
to be elided.
Modify kmem_back() and vm_page_grab_pages() to use vm_page_alloc_after().
vm_page_alloc() is converted into a trivial wrapper of
vm_page_alloc_after().
Suggested by: alc
Reviewed by: alc, kib
MFC after: 2 weeks
Differential Revision: https://reviews.freebsd.org/D11984
We can remove some unnecessary object radix tree lookups by using the
object memq to iterate over pages in the specified range. This does not,
however, eliminate the lookup needed in vm_page_free_toq() to remove each
tree entry.
Reviewed by: alc, kib (previous revision)
MFC after: 1 week
Differential Revision: https://reviews.freebsd.org/D11945
vm_page_grab() on consecutive page indices. Besides simplifying the code
in the caller, vm_page_grab_pages() allows for batching optimizations.
For example, the current implementation replaces calls to vm_page_lookup()
on consecutive page indices by cheaper calls to vm_page_next().
Reviewed by: kib, markj
Tested by: pho (an earlier version)
MFC after: 2 weeks
Differential Revision: https://reviews.freebsd.org/D11926
Suppose that a file on NFS has partially filled last page, and this
page is dirty. NFS VOP_PAGEOUT() method only marks the the page clean
up to the block of the last written byte, leaving other blocks dirty.
Also any page which erronously exists in the vnode vm_object past EOF
is also left marked as dirty.
With the introduction of the buf-cache coherent pager, each pass of
syncer over the object with such page results in creation of B_DELWRI
buffer due to VOP_WRITE() call. This buffer is noted on next syncer
pass, which results e.g. a visible manifestation of shutdown never
finishing vnode sync. Note that before buf-cache coherency commit, a
dirty page might left never synced to server if a partial writes
occur.
Fix this by clearing dirty bits after EOF. Only blocks of the partial
page which are completely after EOF are marked clean, to avoid
possible user data loss.
Reported by: mav
Reviewed by: alc, markj
Tested by: mav, pho
Sponsored by: The FreeBSD Foundation
MFC after: 1 week
Differential revision: https://reviews.freebsd.org/D11697
Differential Revision discusses the benefits of this change.)
Add a function, vm_reserv_to_superpage(), that returns the superpage
containing the specified base page.
Reviewed by: kib, markj
Tested by: pho
MFC after: 10 days
Differential Revision: https://reviews.freebsd.org/D11556
add support for explicitly requesting that pmap_enter() create a 2MB page
mapping. (Essentially, this feature allows the machine-independent layer to
create superpage mappings preemptively, and not wait for automatic promotion
to occur.)
Export pmap_ps_enabled() to the machine-independent layer.
Add a flag to pmap_pv_insert_pde() that specifies whether it should fail or
reclaim a PV entry when one is not available.
Refactor pmap_enter_pde() into two functions, one by the same name, that is
a general-purpose function for creating PDE PG_PS mappings, and another,
pmap_enter_2mpage(), that is used to prefault 2MB read- and/or execute-only
mappings for execve(2), mmap(2), and shmat(2).
Submitted by: Yufeng Zhou <yz70@rice.edu> (an earlier version)
Reviewed by: kib, markj
Tested by: pho
MFC after: 10 days
Differential Revision: https://reviews.freebsd.org/D11556
superpage all belong to the same object. To date, that check has not been
needed, but upcoming changes require it. (See the Differential Revision.)
Reviewed by: kib, markj
MFC after: 1 week
Differential Revision: https://reviews.freebsd.org/D11556
vm_radix trie.
Existing vm_radix_init() function is renamed to vm_radix_zinit().
Inlines moved out of the _ headers.
Reviewed by: alc, markj (previous version)
Sponsored by: The FreeBSD Foundation
MFC after: 1 week
Differential revision: https://reviews.freebsd.org/D11661
Commit message for r321173 incorrectly stated that the change disables
automatic stack growth from the AIO daemons contexts, with explanation
that this is currently prevents applying wrong resource limits. Fix
this by actually disabling the growth.
Noted by: alc
Reviewed by: alc, jhb
Sponsored by: The FreeBSD Foundation
MFC after: 1 week
check blocking grow from other processes accesses.
Debugger may access stack grow area with ptrace(2). In this case,
real state of the process is to not have the stack grown, which
provides more accurate inspection. Technical reason to avoid the grow
is to avoid applying wrong process (debugger) stack limit.
This change also has a consequence of making aio workers accesses past
the bottom of stacks into EFAULT, arguably the situation is a
programmers mistake.
Reported by: jhb
Discussed with: alc, jhb
Sponsored by: The FreeBSD Foundation
MFC after: 3 days
Reported by: antoine
Tested by: Stefan Ehmann <shoesoft@gmx.net>,
Jan Kokemueller <jan.kokemueller@gmail.com>
PR: 220493
Sponsored by: The FreeBSD Foundation
MFC after: 3 days
gap entry in the vm map being smaller than the sysctl-derived stack guard
size. Otherwise, the value of max_grow can suffer from overflow, and the
roundup(grow_amount, sgrowsiz) will not be properly capped, resulting in
an assertion failure.
In collaboration with: kib
MFC after: 3 days
recycles the current vm space. Otherwise, an mlockall(MCL_FUTURE) could
still be in effect on the process after an execve(2), which violates the
specification for mlockall(2).
It's pointless for vm_map_stack() to check the MEMLOCK limit. It will
never be asked to wire the stack. Moreover, it doesn't even implement
wiring of the stack.
Reviewed by: kib, markj
MFC after: 1 week
Differential Revision: https://reviews.freebsd.org/D11421
a hint.
Right now, for non-fixed mmap(2) calls, addr is de-facto interpreted
as the absolute minimal address of the range where the mapping is
created. The VA allocator only allocates in the range [addr,
VM_MAXUSER_ADDRESS]. This is too restrictive, the mmap(2) call might
unduly fail if there is no free addresses above addr but a lot of
usable space below it.
Lift this implementation limitation by allocating VA in two passes.
First, try to allocate above addr, as before. If that fails, do the
second pass with less restrictive constraints for the start of
allocation by specifying minimal allocation address at the max bss
end, if this limit is less than addr.
One important case where this change makes a difference is the
allocation of the stacks for new threads in libthr. Under some
configuration conditions, libthr tries to hint kernel to reuse the
main thread stack grow area for the new stacks. This cannot work by
design now after grow area is converted to stack, and there is no
unallocated VA above the main stack. Interpreting requested stack
base address as the hint provides compatibility with old libthr and
with (mis-)configured current libthr.
Reviewed by: alc
Tested by: dim (previous version)
Sponsored by: The FreeBSD Foundation
MFC after: 1 week
the requested protection.
The syscall returns success without changing the protection of the
guard. This is consistent with the current mprotect(2) behaviour on
the unmapped ranges. More important, the calls performed by libc and
libthr to allow execution of stacks, if requested by the loaded ELF
objects, do the expected change instead of failing on the grow space
guard.
Reviewed by: alc, markj
Sponsored by: The FreeBSD Foundation
MFC after: 1 week
If mmap(2) is called with the MAP_STACK flag and the size which is
less or equal to the initial stack mapping size plus guard,
calculation of the mapping layout created zero-sized guard. Attempt
to create such entry failed in vm_map_insert(), causing the whole
mmap(2) call to fail.
Fix it by adjusting the initial mapping size to have space for
non-empty guard. Reject MAP_STACK requests which are shorter or equal
to the configured guard pages size.
Reported and tested by: Manfred Antar <null@pozo.com>
Reviewed by: alc, markj
Sponsored by: The FreeBSD Foundation
MFC after: 1 week
Decouple the pageout cluster size from the size of the hash table entry
used by the swap pager for mapping (object, pindex) to a block on the
swap device(s), and keep the size of a hash table entry at its current
size.
Eliminate a pointless macro.
Reviewed by: kib, markj (an earlier version)
MFC after: 4 weeks
Differential Revision: https://reviews.freebsd.org/D11305
Guard, requested by the MAP_GUARD mmap(2) flag, prevents the reuse of
the allocated address space, but does not allow instantiation of the
pages in the range. It is useful for more explicit support for usual
two-stage reserve then commit allocators, since it prevents accidental
instantiation of the mapping, e.g. by mprotect(2).
Use guards to reimplement stack grow code. Explicitely track stack
grow area with the guard, including the stack guard page. On stack
grow, trivial shift of the guard map entry and stack map entry limits
makes the stack expansion. Move the code to detect stack grow and
call vm_map_growstack(), from vm_fault() into vm_map_lookup().
As result, it is impossible to get random mapping to occur in the
stack grow area, or to overlap the stack guard page.
Enable stack guard page by default.
Reviewed by: alc, markj
Man page update reviewed by: alc, bjk, emaste, markj, pho
Tested by: pho, Qualys
Sponsored by: The FreeBSD Foundation
MFC after: 1 week
Differential revision: https://reviews.freebsd.org/D11306 (man pages)
The issue is catched by "vm_map_wire: alien wire" KASSERT at the end
of the vm_map_wire(). We currently check for MAP_ENTRY_WIRE_SKIPPED
flag before ensuring that the wiring_thread is curthread. For HOLESOK
wiring, this means that we might see WIRE_SKIPPED entry from different
wiring.
The fix it by only checking WIRE_SKIPPED if the entry is put
IN_TRANSITION by us. Also fixed a typo in the comment explaining the
situation.
Reported and tested by: pho
Reviewed by: alc
Sponsored by: The FreeBSD Foundation
MFC after: 1 week
instantiated.
Calling pmap_copy() on non-faulted anonymous memory entries is useless.
Noted and reviewed by: alc
Sponsored by: The FreeBSD Foundation
MFC after: 1 week
VM_MAP_WIRE_SYSTEM mode when wiring the newly grown stack.
System maps do not create auto-grown stack. Any stack we handled,
even for P_SYSTEM, must be for user address space. P_SYSTEM processes
with mapped user space is either init(8) or an aio worker attached to
other user process with aio buffer pointing into stack area. In either
case, VM_MAP_WIRE_USER mode should be used.
Noted and reviewed by: alc
Sponsored by: The FreeBSD Foundation
MFC after: 1 week
dirty. Assert that they are fully dirty rather than redundantly calling
vm_page_dirty() on them.
Reviewed by: kib, markj
MFC after: 1 week
X-MFC after: r319932
- Add asserts that the pages to write are dirty. The last page, if
partially written, is only required to be dirty, while completely
written pages should have all dirty bit set.
- Use uintmax_t to print vm_page pindexes.
- Use NULL instead of casted zero.
- Remove if () test which duplicated the loop ending condition.
- Miscellaneous style fixes.
Reviewed by: alc, markj (previous version)
Tested by: pho
Sponsored by: The FreeBSD Foundation
MFC after: 1 week
internal zones only. This allows to create new zones at early stages
of boot, without need to mark them as internal to UMA, which isn't
always true.
Reviewed by: alc
r31386 changed how the size of the VM page array was calculated to be
less wasteful. For most systems, the amount of memory is divided by
the overhead required by each page (a page of data plus a struct vm_page)
to determine the maximum number of available pages. However, if the
remainder for the first non-available page was at least a page of data
(so that the only memory missing was a struct vm_page), this last page
was left in phys_avail[] but was not allocated an entry in the VM page
array. Handle this case by explicitly excluding the page from
phys_avail[].
Reviewed by: alc
Sponsored by: DARPA / AFRL
Differential Revision: https://reviews.freebsd.org/D11000
beginning of a swap area for a disk label. However, neither r118390 nor
r118544, which increased the reservation from one to two blocks, correctly
accounted for these blocks when updating the variable "swap_pager_avail".
This change corrects that error.
Reviewed by: kib
MFC after: 5 days
pager used a different scheme for striping the allocation of swap space
across multiple devices. And, although blist_fill() was intended to support
fill operations with large counts, the old striping scheme never performed a
fill larger than the stripe size. Consequently, the misplacement of a
sanity check in blst_meta_fill() went undetected. Now, moving forward in
time to r118390, a new scheme for striping was introduced that maintained a
blist allocator per device, but as noted in r318995, swapoff_one() was not
fully and correctly converted to the new scheme. This change completes what
was started in r318995 by fixing the underlying bug in blst_meta_fill() that
stops swapoff_one() from simply performing a single blist_fill() operation.
Reviewed by: kib
MFC after: 5 days
Differential Revision: https://reviews.freebsd.org/D11043
short, half of the memory that is allocated to implement the radix tree is
wasted because we did not change "u_daddr_t" to be a 64-bit unsigned int
when we changed "daddr_t" to be a 64-bit (signed) int. (See r96849 and
r96851.)
Reviewed by: kib, markj
Tested by: pho
MFC after: 5 days
Differential Revision: https://reviews.freebsd.org/D11028
It is simply a contigous virtual memory pointer and number of pages.
There is no need to build a linked list here. Just increment pointer
and decrement counter. The only functional difference to old allocator
is that before we gave pages from topmost and down to lowest, and now
we give them in normal ascending order.
While here remove padalign from a mutex that is unused at runtime.
Reviewed by: alc
nor the correct maximum block size. Moreover, after r318995, it serves
no purpose except to provide information to user space through a read-
sysctl.
This change eliminates the variable "dmmax" but retains the sysctl. It
also corrects the value returned by the sysctl.
Reviewed by: kib, markj
MFC after: 3 days
multiple devices was changed. However, swapoff_one() was not fully and
correctly converted. In particular, with r118390's introduction of a per-
device blist, the maximum swap block size, "dmmax", became irrelevant to
swapoff_one()'s operation. Moreover, swapoff_one() was performing out-of-
range operations on the per-device blist that were silently ignored by
blist_fill().
This change corrects both of these problems with swapoff_one(), which will
allow us to potentially increase MAX_PAGEOUT_CLUSTER. Previously,
swapoff_one() would panic inside of blist_fill() if you increased
MAX_PAGEOUT_CLUSTER.
Reviewed by: kib, markj
MFC after: 3 days
Extend the ino_t, dev_t, nlink_t types to 64-bit ints. Modify
struct dirent layout to add d_off, increase the size of d_fileno
to 64-bits, increase the size of d_namlen to 16-bits, and change
the required alignment. Increase struct statfs f_mntfromname[] and
f_mntonname[] array length MNAMELEN to 1024.
ABI breakage is mitigated by providing compatibility using versioned
symbols, ingenious use of the existing padding in structures, and
by employing other tricks. Unfortunately, not everything can be
fixed, especially outside the base system. For instance, third-party
APIs which pass struct stat around are broken in backward and
forward incompatible ways.
Kinfo sysctl MIBs ABI is changed in backward-compatible way, but
there is no general mechanism to handle other sysctl MIBS which
return structures where the layout has changed. It was considered
that the breakage is either in the management interfaces, where we
usually allow ABI slip, or is not important.
Struct xvnode changed layout, no compat shims are provided.
For struct xtty, dev_t tty device member was reduced to uint32_t.
It was decided that keeping ABI compat in this case is more useful
than reporting 64-bit dev_t, for the sake of pstat.
Update note: strictly follow the instructions in UPDATING. Build
and install the new kernel with COMPAT_FREEBSD11 option enabled,
then reboot, and only then install new world.
Credits: The 64-bit inode project, also known as ino64, started life
many years ago as a project by Gleb Kurtsou (gleb). Kirk McKusick
(mckusick) then picked up and updated the patch, and acted as a
flag-waver. Feedback, suggestions, and discussions were carried
by Ed Maste (emaste), John Baldwin (jhb), Jilles Tjoelker (jilles),
and Rick Macklem (rmacklem). Kris Moore (kris) performed an initial
ports investigation followed by an exp-run by Antoine Brodin (antoine).
Essential and all-embracing testing was done by Peter Holm (pho).
The heavy lifting of coordinating all these efforts and bringing the
project to completion were done by Konstantin Belousov (kib).
Sponsored by: The FreeBSD Foundation (emaste, kib)
Differential revision: https://reviews.freebsd.org/D10439
This restores 32bit-sized accesses to vmcnt sysctls, making old
binaries like top(1), systat(8) and reboot(8) mostly functional on
newer kernel.
Reviewed by: bde
Sponsored by: The FreeBSD Foundation
in place. To do per-cpu stats, convert all fields that previously were
maintained in the vmmeters that sit in pcpus to counter(9).
- Since some vmmeter stats may be touched at very early stages of boot,
before we have set up UMA and we can do counter_u64_alloc(), provide an
early counter mechanism:
o Leave one spare uint64_t in struct pcpu, named pc_early_dummy_counter.
o Point counter(9) fields of vmmeter to pcpu[0].pc_early_dummy_counter,
so that at early stages of boot, before counters are allocated we already
point to a counter that can be safely written to.
o For sparc64 that required a whole dummy pcpu[MAXCPU] array.
Further related changes:
- Don't include vmmeter.h into pcpu.h.
- vm.stats.vm.v_swappgsout and vm.stats.vm.v_swappgsin changed to 64-bit,
to match kernel representation.
- struct vmmeter hidden under _KERNEL, and only vmstat(1) is an exclusion.
This is based on benno@'s 4-year old patch:
https://lists.freebsd.org/pipermail/freebsd-arch/2013-July/014471.html
Reviewed by: kib, gallatin, marius, lidl
Differential Revision: https://reviews.freebsd.org/D10156
We are otherwise susceptible to a race with a concurrent vm_map_wire(),
which may drop the map lock to fault pages into the object chain. In
particular, vm_map_protect() will only copy newly writable wired pages
into the top-level object when MAP_ENTRY_USER_WIRED is set, but
vm_map_wire() only sets this flag after its fault loop. We may thus end
up with a writable wired entry whose top-level object does not contain the
entire range of pages.
Reported and tested by: pho
Reviewed by: kib
MFC after: 1 week
Sponsored by: Dell EMC Isilon
Differential Revision: https://reviews.freebsd.org/D10349
declaration block.
Reviewed by: markj (as part of the larger patch)
Tested by: pho (as part of the larger patch)
Sponsored by: The FreeBSD Foundation
MFC after: 1 week
X-Differential revision: https://reviews.freebsd.org/D10241
vnode_pager_generic_putpages() prototype; change the argument name to
reflect that it is flags.
Reviewed by: markj
Tested by: pho
Sponsored by: The FreeBSD Foundation
MFC after: 1 week
X-Differential revision: https://reviews.freebsd.org/D10241
Simplify the logic for clipping the range returned by the pager to fit
within the map entry.
Use atop() rather than OFF_TO_IDX() on addresses.
Reviewed by: kib
MFC after: 1 week
When re-calculating the last inclusive page index after the pager
call, -1 was erronously ommitted. If the pager extended the run
(unlikely), the result would be insertion of the valid page mapping
outside the current map entry range.
Found by: alc
Sponsored by: The FreeBSD Foundation
MFC after: 1 week
reviewing all uses of OFF_TO_IDX(), I observed that
vm_object_page_noreuse() is requiring an exclusive lock on the object
when, in fact, a shared lock suffices.
Reviewed by: kib, markj
MFC after: 3 days
Differential Revision: https://reviews.freebsd.org/D10011
INHERIT_ZERO is an OpenBSD feature.
When a page is marked as such, it would be zeroed
upon fork().
This would be used in new arc4random(3) functions.
PR: 182610
Reviewed by: kib (earlier version)
MFC after: 1 month
Differential Revision: https://reviews.freebsd.org/D427
Fix two missed places where vm_object offset to index calculation
should use unsigned shift, to allow handling of full range of unsigned
offsets used to create device mappings.
Reported and tested by: royger (previous version)
Reviewed by: alc (previous version)
Sponsored by: The FreeBSD Foundation
MFC after: 1 week
Those places were not taking into account uk_ppera.
At present one allocation is always used by one slab, so uk_ppera must
be used to convert between pages and slabs.
uk_ipers is used to convert between slabs and items.
MFC after: 1 month (if ever)
Renumber cluase 4 to 3, per what everybody else did when BSD granted
them permission to remove clause 3. My insistance on keeping the same
numbering for legal reasons is too pedantic, so give up on that point.
Submitted by: Jan Schaumann <jschauma@stevens.edu>
Pull Request: https://github.com/freebsd/freebsd/pull/96
A comment near kmem_reclaim() implies that we already did that.
Calling the hook is useful, because some handlers, e.g. ARC,
might be able to release significant amounts of KVA.
Now that we have more than one place where vm_lowmem hook is called,
use this change as an opportunity to introduce flags that describe
a reason for calling the hook. No handler makes use of the flags yet.
Reviewed by: markj, kib
MFC after: 1 week
Sponsored by: Panzura
Differential Revision: https://reviews.freebsd.org/D9764
In vm_fault_prefault(), if backward count causes underflow in
calculation of
starta = addra - backward * PAGE_SIZE;
then starta must be clipped to entry->start, instead of zero.
Clipping to zero allowed mapping outside of the map entries address
ranges, in particular, map at zero.
Submitted by: Yanko Yankulov <yanko.yankulov@gmail.com>
Reviewed by: alc
MFC after: 1 week
There could be a race between the vm daemon setting RACCT_RSS based on
the vm space and vmspace_exit (called from exit1) resetting RACCT_RSS to
zero. In that case we can get a zombie process with non-zero RACCT_RSS.
If the process is jailed, that may break accounting for the jail.
There could be other consequences.
Fix this race in the vm daemon by updating RACCT_RSS only when a process
is in the normal state. Also, make accounting a little bit more
accurate by refreshing the page resident count after calling
vm_pageout_map_deactivate_pages().
Finally, add an assert that the RSS is zero when a process is reaped.
PR: 210315
Reviewed by: trasz
Differential Revision: https://reviews.freebsd.org/D9464
Rename kern_vm_* functions to kern_*. Move the prototypes to
syscallsubr.h. Also change Mach VM types to uintptr_t/size_t as
needed, to avoid headers pollution.
Requested by: alc, jhb
Reviewed by: alc
Sponsored by: The FreeBSD Foundation
MFC after: 2 weeks
Differential revision: https://reviews.freebsd.org/D9535
For regular files and posix shared memory, POSIX requires that
[offset, offset + size) range is legitimate. At the maping time,
check that offset is not negative. Allowing negative offsets might
expose the data that filesystem put into vm_object for internal use,
esp. due to OFF_TO_IDX() signess treatment. Fault handler verifies
that the mapped range is valid, assuming that mmap(2) checked that
arithmetic gives no undefined results.
For device mappings, leave the semantic of negative offsets to the
driver. Correct object page index calculation to not erronously
propagate sign.
In either case, disallow overflow of offset + size.
Update mmap(2) man page to explain the requirement of the range
validity, and behaviour when the range becomes invalid after mapping.
Reported and tested by: royger (previous version)
Reviewed by: alc
Sponsored by: The FreeBSD Foundation
MFC after: 2 weeks
This makes the code to pass whole word of the mmap(2) syscall argument
prot to the syscall helper kern_vm_mmap(), which can validate all
bits. The change provides temporal fix for sys/vm/mmap_test
mmap__bad_arguments, which was broken after r313352.
PR: 216976
Reported and tested by: ngie
Sponsored by: The FreeBSD Foundation
kern_vm_munmap(), and kern_vm_madvise(), and use them in various compats
instead of their sys_*() counterparts.
Reviewed by: ed, dchagin, kib
MFC after: 2 weeks
Sponsored by: DARPA, AFRL
Differential Revision: https://reviews.freebsd.org/D9378
one respect. When determining how many page structures to allocate,
contrary to what the comments say, the code does not account for the
overhead of a page structure per page of physical memory. This revision
changes the code to match the comments.
Reviewed by: kib, markj
MFC after: 6 weeks
Differential Revision: https://reviews.freebsd.org/D9081
We can iterate over consecutive resident pages in the top-level object
using the object's page list rather than by performing lookups in the
object radix tree. This extends one of the optimizations in r312208 to the
case where a shadow chain is present.
Suggested by: alc
Reviewed by: alc, kib (previous version)
MFC after: 2 weeks
Differential Revision: https://reviews.freebsd.org/D9282
HWPMC_HOOKS is enabled in GENERIC and triggers some work avoidable in the
common (module not loaded) case.
In particular this avoids permission checks + lock downgrade
singlethreaded and in cases were an executable mapping is found the pmc
sx lock is no longer bounced.
Note this is a band aid.
MFC after: 1 week
vm_object_madvise() is frequently used to apply advice to a contiguous
set of pages in an object with no backing object. Optimize this case by
skipping non-resident subranges in constant time, and by iterating over
resident pages using the object memq, thus avoiding radix tree lookups on
each page index in the specified range.
While here, move MADV_WILLNEED handling to vm_page_advise(), and rename the
"advise" parameter to vm_object_madvise() to "advice."
Reviewed by: alc, kib
MFC after: 2 weeks
Differential Revision: https://reviews.freebsd.org/D9098
Upon each execve, we allocate a KVA range for use in copying data to the
new image. Pages must be faulted into the range, and when the range is
freed, the backing pages are freed and their mappings are destroyed. This
is a lot of needless overhead, and the exec_map management becomes a
bottleneck when many CPUs are executing execve concurrently. Moreover, the
number of available ranges is fixed at 16, which is insufficient on large
systems and potentially excessive on 32-bit systems.
The new allocator reduces overhead by making exec_map allocations
persistent. When a range is freed, pages backing the range are marked clean
and made easy to reclaim. With this change, the exec_map is sized based on
the number of CPUs.
Reviewed by: kib
MFC after: 1 month
Differential Revision: https://reviews.freebsd.org/D8921
On systems without a configured swap device, an attempt to launder pages
from a swap object will always fail and result in the page being
reactivated. This means that the page daemon will continuously scan pages
that can never be evicted. With this change, anonymous pages are instead
moved to PQ_UNSWAPPABLE after a failed laundering attempt when no swap
devices are configured. PQ_UNSWAPPABLE is not scanned unless a swap device
is configured, so unreferenced unswappable pages are excluded from the page
daemon's workload.
Reviewed by: alc
If pager' populate method succeeded, but other thread raced with us
and modified vm_map, we must unbusy all pages busied by the pager,
before we retry the whole fault handling. If pager instantiated more
pages than fit into the current map entry, we must unbusy the pages
which are clipped.
Also do some refactoring, clarify comments and use more clear local
variable names.
Reported and tested by: kargl, subbsd@gmail.com (previous version)
Reviewed by: alc
Sponsored by: The FreeBSD Foundation
MFC after: 2 weeks
add support for object types that were previously prohibited because they
could contain PG_CACHED pages.
Roughly halve the number of radix trie operations performed by
vm_page_alloc_contig() using the same approach that is employed by
vm_page_alloc(). Also, eliminate the radix trie lookup performed with the
free page queues lock held.
Tidy up the handling of radix trie insert failures in vm_page_alloc() and
vm_page_alloc_contig().
Reviewed by: kib, markj
Tested by: pho
Sponsored by: Dell EMC Isilon
Differential Revision: https://reviews.freebsd.org/D8878
There are two instances of inlined unlocks + continue in vmtotal()
switch statements, which are ordinary expressed with break from the
switch case and code after the switch. Also, the combination of
continue and break statement is redundand.
Reviewed by: alc
Sponsored by: The FreeBSD Foundation
MFC after: 1 week
The count argument natural type if vm_pindex_t, but due to the loop
organization, it has to be signed type to detect the termination
condition. Replace this logic by using distinguished counter for the
processed pages, and terminate loop when the counter exceeds the
argument.
Completely process one swblock for all relevant indexes instead of
doing relookup in hash when incrementing page index on the loop step.
Do not drop hash mutex around iterations.
Noted and reviewed by: alc
Tested by: pho
Sponsored by: The FreeBSD Foundation
MFC after: 2 weeks
As noted in the removed comment, it is possible and not prohibitively
costly to look up the swap blocks for the given page index. Implement
a swap_pager_find_least() function to do that, and use it to iterate
simultaneously over both backing object page queue and swap
allocations when looking for shadowed pages.
Testing shows that number of new succesful scans, enabled by this
addition, is small but non-zero. When worked out, the change both
further reduces the depth of the shadow object chain, and frees unused
but allocated swap and memory.
Suggested and reviewed by: alc
Tested by: pho (previous version)
Sponsored by: The FreeBSD Foundation
MFC after: 2 weeks
The object is not yet fully constructed and must not be available to
other threads. This makes default_pager_alloc() almost identical to
swap_pager_alloc_init().
Reviewed by: alc
Sponsored by: The FreeBSD Foundation
MFC after: 1 week
in this file actually described the operation of the swap pager, not the
default pager. Given that this is the wrong place to discuss the
implementation of the swap pager, it shouldn't come as a surprise that as
the swap pager evolved these comments became increasingly stale. In
addition, apply some style fixes, like modernizing a few remaining old-
style function definitions.
Reviewed by: kib, markj
MFC after: 1 week
Differential Revision: https://reviews.freebsd.org/D8781
independent layer of the virtual memory system. Update some of the nearby
comments to eliminate redundancy and improve clarity.
In vm/vm_reserv.c, do not use hyphens after adverbs ending in -ly per
The Chicago Manual of Style.
Update the comment in vm/vm_page.h defining the four types of page queues to
reflect the elimination of PG_CACHED pages and the introduction of the
laundry queue.
Reviewed by: kib, markj
Sponsored by: Dell EMC Isilon
Differential Revision: https://reviews.freebsd.org/D8752
It allows to provide configurable agressive prefaulting and useful
hints to page daemon about memory allocations, on faults for pages
managed by phys pager. In fact, this implementation is superior to
the MAP_SHARED_PHYS hack from my Postgresql paper, while giving
similar benefits of reducing the page faults numbers on SysV shared
memory mappings.
Reviewed by: alc
Tested by: pho
Sponsored by: The FreeBSD Foundation
MFC after: 3 weeks
with cdev_pg_populate() to provide device drivers access to it. It
gives drivers fine control of the pages ownership and allows drivers
to implement arbitrary prefault policies.
The populate method is called on a page fault and is supposed to
populate the vm object with the page at the fault location and some
amount of pages around it, at pager's discretion. VM provides the
pager with the hints about current range of the object mapping, to
avoid instantiation of immediately unused pages, if pager decides so.
Also, VM passes the fault type and map entry protection to the pager,
allowing it to force the optimal required ownership of the mapped
pages.
Installed pages must contiguously fill the returned region, be fully
valid and exclusively busied. Of course, the pages must be compatible
with the object' type.
After populate() successfully returned, VM fault handler installs as
many instantiated pages into the process page tables as it sees
reasonable, while still obeying the correct semantic for COW and vm
map locking.
The method is opt-in, pager sets OBJ_POPULATE flag to indicate that
the method can be called. If pager' vm objects can be shadowed, pager
must implement the traditional getpages() method in addition to the
populate(). Populate() might fall back to the getpages() on per-call
basis as well, by returning VM_PAGER_BAD error code.
For now for device pagers, the populate() method is only allowed to be
used by the managed device pagers, but the limitation is only made
because there is no unmanaged fault handlers which could use it right
now.
KPI designed together with, and reviewed by: alc
Tested by: pho
Sponsored by: The FreeBSD Foundation
MFC after: 3 weeks
contain a vm_page_t at the specified index. However, with this
change, vm_radix_remove() no longer panics. Instead, it returns NULL
if there is no vm_page_t at the specified index. Otherwise, it
returns the vm_page_t. The motivation for this change is that it
simplifies the use of radix tries in the amd64, arm64, and i386 pmap
implementations. Instead of performing a lookup before every remove,
the pmap can simply perform the remove.
Reviewed by: kib, markj
Differential Revision: https://reviews.freebsd.org/D8708
Some utilities (notably top(1)) exit if any of their input sysctls don't
exist, and the removal of the above-mentioned PG_CACHE-related sysctls
makes it difficult to run such utilities on different versions of the
kernel without recompiling.
Requested by: bde
called to allocate a new page of radix trie nodes, there could be a call to
vm_radix_remove() on the same trie (of PG_CACHED pages) as the in-progress
vm_radix_insert(). With the removal of PG_CACHED pages, we can simplify
vm_radix_insert() and vm_radix_remove() by removing the flags on the root of
the trie that were used to detect this case and the code for restarting
vm_radix_insert() when it happened.
Reviewed by: kib, markj
Tested by: pho
Sponsored by: Dell EMC Isilon
Differential Revision: https://reviews.freebsd.org/D8664
a new page of radix trie nodes to complete a vm_radix_insert() operation
that was requested by vm_page_cache(). Specifically, vm_page_cache()
already held the free page queue lock when UMA tried to acquire it through
a call to vm_page_alloc(). This code path no longer exists, so there is no
longer any reason to allow recursion on the free page queue mutex.
Improve nearby comments.
Reviewed by: kib, markj
Tested by: pho
Sponsored by: Dell EMC Isilon
Differential Revision: https://reviews.freebsd.org/D8628
The swap pager enqueues laundered pages near the head of the inactive queue
to avoid another trip through LRU before reclamation. This change adds
support for this behaviour to the vnode pager and makes use of it in UFS and
ext2fs. Some ioflag handling is consolidated into a common subroutine so
that this support can be easily extended to other filesystems which make use
of the buffer cache. No changes are needed for ZFS since its putpages
routine always undirties the pages before returning, and the laundry
thread requeues the pages appropriately in this case.
Reviewed by: alc, kib
Differential Revision: https://reviews.freebsd.org/D8589
longer used. More precisely, they are always zero because the code that
decremented and incremented them no longer exists.
Bump __FreeBSD_version to mark this change.
Reviewed by: kib, markj
Sponsored by: Dell EMC Isilon
Differential Revision: https://reviews.freebsd.org/D8583
doesn't fit into a buf, then trim readbehind and readahead evenly. If
rbehind was limited by the previous BMAP, then roundup its trim to
block size.
- Add KASSERT to check that b_blkno has proper offset from original
blkno returned by BMAP. [1]
- Add KASSERT to check that pages in buf are consecutive.
Reviewed by: kib
Submitted by: kib [1]