The listed rules were incomplete and outdated. There is a much more
comprehensive comment in vm_page.h.
Reviewed by: alc, kib
MFC after: 1 week
Differential Revision: https://reviews.freebsd.org/D20503
sw_flags is set to the function argument several lines later.
Reported by: danfe using PVS-studio
Sponsored by: The FreeBSD Foundation
MFC after: 1 week
as part of a false start toward fine-grained reservation locking. In the
end, they were not needed, so eliminate them.
Order the parameters to vm_reserv_alloc_{contig,page}() consistently with
the vm_page functions that call them.
Update the comments about the locking requirements for
vm_reserv_alloc_{contig,page}(). They no longer require a free page
queues lock.
Wrap several lines that became too long after the "req" and "domain"
parameters were added to vm_reserv_alloc_{contig,page}().
Reviewed by: kib, markj
MFC after: 1 week
Differential Revision: https://reviews.freebsd.org/D20492
Use it instead of accessing the wire_count field directly. No
functional change intended.
Reviewed by: alc, kib
MFC after: 1 week
Sponsored by: Netflix
Differential Revision: https://reviews.freebsd.org/D20485
power-of-two page block it frees, launching an unsuccessful search for
a buddy to pair up with each time. The only possible buddy-up mergers
are across the boundaries of the freed region, so change
vm_phys_free_contig simply to enqueue the freed interior blocks, via a
new function vm_phys_enqueue_contig, and then call vm_phys_free_pages
on the bounding blocks to create as big a cross-boundary block as
possible after buddy-merging.
The only callers of vm_phys_free_contig at the moment call it in
situations where merging blocks across the boundary is clearly
impossible, so just call vm_phys_enqueue_contig in those places and
avoid trying to buddy-up at all.
One beneficiary of this change is in breaking reservations. For the
case where memory is freed in breaking a reservation with only the
first and last pages allocated, the number of cycles consumed by the
operation drops about 11% with this change.
Suggested by: alc
Reviewed by: alc
Approved by: kib, markj (mentors)
Differential Revision: https://reviews.freebsd.org/D16901
1. Change size_t to vm_size_t in some places.
2. Rename vm_map_entry_resize_free to drop the _free part.
3. Fix whitespace errors.
4. Fix screwups in patch-conflict-management that left out important
changes related to growing and shrinking objects.
Reviewed by: alc
Approved by: kib (mentor)
amount of resizing reduces the number of functions changing the vm_map
invariants regarding the max_free field of map entries.
Reviewed by: markj (mentor)
Approved by: kib (mentor)
Differential Revision: https://reviews.freebsd.org/D20356
Similar to r348026, exhaustive search for uses of CTRn() and cross reference
ktr.h includes. Where it was obvious that an OS compat header of some kind
included ktr.h indirectly, .c files were left alone. Some of these files
clearly got ktr.h via header pollution in some scenarios, or tinderbox would
not be passing prior to this revision, but go ahead and explicitly include it
in files using it anyway.
Like r348026, these CUs did not show up in tinderbox as missing the include.
Reported by: peterj (arm64/mp_machdep.c)
X-MFC-With: r347984
Sponsored by: Dell EMC Isilon
This allows replacing "sys/eventfilter.h" includes with "sys/_eventfilter.h"
in other header files (e.g., sys/{bus,conf,cpu}.h) and reduces header
pollution substantially.
EVENTHANDLER_DECLARE and EVENTHANDLER_LIST_DECLAREs were moved out of .c
files into appropriate headers (e.g., sys/proc.h, powernv/opal.h).
As a side effect of reduced header pollution, many .c files and headers no
longer contain needed definitions. The remainder of the patch addresses
adding appropriate includes to fix those files.
LOCK_DEBUG and LOCK_FILE_LINE_ARG are moved to sys/_lock.h, as required by
sys/mutex.h since r326106 (but silently protected by header pollution prior
to this change).
No functional change (intended). Of course, any out of tree modules that
relied on header pollution for sys/eventhandler.h, sys/lock.h, or
sys/mutex.h inclusion need to be fixed. __FreeBSD_version has been bumped.
memguard(9) wants to avoid reuse of freed addresses for as long as
possible. Previously it maintained a racily updated cursor which was
passed to vmem_xalloc(9) as the minimum address. However, vmem will
not in general return the lowest free address in the arena, so this
trick only really works until the cursor has wrapped around the first
time.
Reported by: alc
Reviewed by: alc
MFC after: 1 month
Differential Revision: https://reviews.freebsd.org/D17227
Historically we have not distinguished between kernel wirings and user
wirings for accounting purposes. User wirings (via mlock(2)) were
subject to a global limit on the number of wired pages, so if large
swaths of physical memory were wired by the kernel, as happens with
the ZFS ARC among other things, the limit could be exceeded, causing
user wirings to fail.
The change adds a new counter, v_user_wire_count, which counts the
number of virtual pages wired by user processes via mlock(2) and
mlockall(2). Only user-wired pages are subject to the system-wide
limit which helps provide some safety against deadlocks. In
particular, while sources of kernel wirings typically support some
backpressure mechanism, there is no way to reclaim user-wired pages
shorting of killing the wiring process. The limit is exported as
vm.max_user_wired, renamed from vm.max_wired, and changed from u_int
to u_long.
The choice to count virtual user-wired pages rather than physical
pages was done for simplicity. There are mechanisms that can cause
user-wired mappings to be destroyed while maintaining a wiring of
the backing physical page; these make it difficult to accurately
track user wirings at the physical page layer.
The change also closes some holes which allowed user wirings to succeed
even when they would cause the system limit to be exceeded. For
instance, mmap() may now fail with ENOMEM in a process that has called
mlockall(MCL_FUTURE) if the new mapping would cause the user wiring
limit to be exceeded.
Note that bhyve -S is subject to the user wiring limit, which defaults
to 1/3 of physical RAM. Users that wish to exceed the limit must tune
vm.max_user_wired.
Reviewed by: kib, ngie (mlock() test changes)
Tested by: pho (earlier version)
MFC after: 45 days
Sponsored by: Netflix
Differential Revision: https://reviews.freebsd.org/D19908
the allocation request, so that the blocks allocated are from the next
set of free blocks big enough to satisfy the minimum requirements of
the request, and the number of blocks allocated are as many as
possible, up to the specified maximum. The implementation of
swp_pager_getswapspace uses this parameter to ask for a number of
blocks between the new halved request size and the previous failed
request size. Thus a request for 32 blocks may fail, but instead of
getting only 16 blocks instead, the caller asks for 16 to 31 next, and
might get 19 or 27, which is closer to what they originally wanted.
I expect this to lead to bigger block allocations and less block
fragmentation, at least in some cases.
Approved by: kib (mentor)
Differential Revision: https://reviews.freebsd.org/D20001
requested, or none, and in the latter case it is up to them to pick a
smaller request to make - which they always do by halving the failed
request. This change to swp_pager_getswapspace leaves the task of
downsizing the request to the function and not its caller. It still
does so by halving the original request.
Approved by: kib (mentor)
Differential Revision: https://reviews.freebsd.org/D20228
kern_execve() locks text vnode exclusive to be able to set and clear
VV_TEXT flag. VV_TEXT is mutually exclusive with the v_writecount > 0
condition.
The change removes VV_TEXT, replacing it with the condition
v_writecount <= -1, and puts v_writecount under the vnode interlock.
Each text reference decrements v_writecount. To clear the text
reference when the segment is unmapped, it is recorded in the
vm_map_entry backed by the text file as MAP_ENTRY_VN_TEXT flag, and
v_writecount is incremented on the map entry removal
The operations like VOP_ADD_WRITECOUNT() and VOP_SET_TEXT() check that
v_writecount does not contradict the desired change. vn_writecheck()
is now racy and its use was eliminated everywhere except access.
Atomic check for writeability and increment of v_writecount is
performed by the VOP. vn_truncate() now increments v_writecount
around VOP_SETATTR() call, lack of which is arguably a bug on its own.
nullfs bypasses v_writecount to the lower vnode always, so nullfs
vnode has its own v_writecount correct, and lower vnode gets all
references, since object->handle is always lower vnode.
On the text vnode' vm object dealloc, the v_writecount value is reset
to zero, and deadfs vop_unset_text short-circuit the operation.
Reclamation of lowervp always reclaims all nullfs vnodes referencing
lowervp first, so no stray references are left.
Reviewed by: markj, trasz
Tested by: mjg, pho
Sponsored by: The FreeBSD Foundation
MFC after: 1 month
Differential revision: https://reviews.freebsd.org/D19923
NOSPLIT swap objects are not anonymous, they are used by tmpfs regular
files and POSIX shared memory. For such objects, collapse is not
permitted.
Reported by: mjg
Reviewed by: markj, trasz
Tested by: mjg, pho
Sponsored by: The FreeBSD Foundation
MFC after: 1 week
Differential revision: https://reviews.freebsd.org/D19923
linear search can, so use it to avoid a linear search in isqrt.
Approved by: kib (mentor), markj (mentor)
Differential Revision: https://reviews.freebsd.org/D20102
vm_map_wire() increments entry->wire_count, after that it drops the
map lock both for faulting in the entry' pages, and for marking next
entry in the requested region as IN_TRANSITION. Only after all entries
are faulted in, MAP_ENTRY_USER_WIRE flag is set.
This makes it possible for vm_map_protect() to run while other entry'
MAP_ENTRY_IN_TRANSITION flag is handled, and vm_map_busy() lock does
not prevent it. In particular, if the call to vm_map_protect() adds
VM_PROT_WRITE to CoW entry, it would fail to call
vm_fault_copy_entry(). There are at least two consequences of the
race: the top object in the shadow chain is not populated with
writeable pages, and second, the entry eventually get contradictory
flags MAP_ENTRY_NEEDS_COPY | MAP_ENTRY_USER_WIRED with VM_PROT_WRITE
set.
Handle it by waiting for all MAP_ENTRY_IN_TRANSITION flags to go away
in vm_map_protect(), which does not drop map lock afterwards. Note
that vm_map_busy_wait() is left as is.
Reported and tested by: pho (previous version)
Reviewed by: Doug Moore <dougm@rice.edu>, markj
Sponsored by: The FreeBSD Foundation
MFC after: 1 week
Differential revision: https://reviews.freebsd.org/D20091
The checks are too expensive for a general-purpose kernel. Enable the
checks when DIAGNOSTIC is defined and provide a sysctl to enable the
checks in a non-DIAGNOSTIC INVARIANTS kernel.
Reviewed by: kib
Discussed with: Doug Moore <dougm@rice.edu>
MFC after: 3 days
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D19999
Drop the adj_free field from vm_map_entry_t. Refine the max_free field
so that p->max_free is the size of the largest gap with one endpoint
in the subtree rooted at p. Change vm_map_findspace so that, first,
the address-based splay is restricted to tree nodes with large-enough
max_free value, to avoid searching for the right starting point in a
subtree where all the gaps are too small. Second, when the address
search leads to a tree search for the first large-enough gap, that gap
is the subject of a splay-search that brings the gap to the top of the
tree, so that an immediate insertion will take constant time.
Break up the splay code into separate components, one for searching
and breaking up the tree and another for reassembling it. Use these
components, and not splay itself, for linking and unlinking. Drop the
after-where parameter to link, as it is computed as a side-effect of
the splay search.
Submitted by: Doug Moore <dougm@rice.edu>
Reviewed by: markj
Tested by: pho
MFC after: 2 weeks
Differential revision: https://reviews.freebsd.org/D17794
Otherwise resulting address from vm_map_find() migh not satisfy the
upper limit. For instance, it could affect MAP_32BIT flag from 64bit
processes.
Found by: Doug Moore <dougm@rice.edu>
Reviewed by: alc, Doug Moore <dougm@rice.edu>
Sponsored by: The FreeBSD Foundation
MFC after: 1 week
Differential revision: https://reviews.freebsd.org/D19688
There are some unusual cases where a process may cause an mlock()ed
range of memory to be unmapped. If the application subsequently
faults on that region, the handler may attempt to create a superpage
mapping backed by the resident, wired pages. However, the pmap code
responsible for creating such a mapping (pmap_enter_pde() on i386
and amd64) does not ensure that a leaf page table page is available
if the superpage is later demoted; the demotion operation must therefore
perform a non-blocking page allocation and must unmap the entire
superpage if the allocation fails. The pmap layer ensures that this
can never happen for wired mappings, and so the case described above
breaks that invariant.
For now, simply ensure that the MI fault handler never attempts to
create a wired superpage except via promotion.
Reviewed by: kib
Reported by: syzbot+292d3b0416c27c131505@syzkaller.appspotmail.com
MFC after: 2 weeks
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D19670
Either msync(MS_INVALIDATE) or the object unlock during vnode
truncation can expose invalid pages backing wired entries. Accept
them, but do not install them into destrination pmap. We must create
copied pages in the copy case, because e.g. vm_object_unwire() expects
that the entry is fully backed.
Reported by: syzkaller, via emaste
Reported by: syzbot+514d40ce757a3f8b15bc@syzkaller.appspotmail.com
Reviewed by: markj
Tested by: pho
Sponsored by: The FreeBSD Foundation
MFC after: 1 week
Differential revision: https://reviews.freebsd.org/D19615
On platforms without a direct map (i.e., platforms without
UMA_MD_SMALL_ALLOC defined), the boundary tag allocator reserves a
number of tags for use when allocating a new slab of boundary tags,
as such platforms require free boundary tags in order to allocate
boundary tags. r327899 increased the number of boundary tags required
for a KVA allocation in the worst case, and the aforementioned
reservation was not updated accordingly. In some cases, this could
lead to a system hang. Fix the problem by increasing this reservation.
Also reduce KVA_QUANTUM on systems lacking superpage support.
The previous import quantum (4MB with a 4KB page size) was quite large
for systems with limited KVA, and fragmentation in kernel_arena could
cause kernel memory allocation failures even with a substantial amount
of free KVA.
Reported and tested by: jhibbits
Reviewed by: alc, kib
No objections: jeff
MFC after: 2 weeks
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D19337
Skylake Xeons.
See SDM rev. 68 Vol 3 4.6.2 Protection Keys and the description of the
RDPKRU and WRPKRU instructions.
Reviewed by: markj
Tested by: pho
Sponsored by: The FreeBSD Foundation
MFC after: 2 weeks
Differential revision: https://reviews.freebsd.org/D18893
Use the object pointer itself to determine whether the object is locked.
No functional change intended.
Reviewed by: kib
MFC after: 1 week
Sponsored by: Netflix
Differential Revision: https://reviews.freebsd.org/D19215
back to the lever before r343030. For 64-bit machines reduce it slightly,
too. Together with r343030 I bumped the limit up to the value we use at
Netflix to serve 100 Gbit/s of sendfile traffic, and it probably isn't a
good default.
Provide a loader tunable to change vnode pager pbufs count. Document it.
Make the clustering enabling knob more fine-grained by providing a
setting where the allocation with hint is not clustered. This is aimed
to be somewhat more compatible with e.g. go 1.4 which expects that
hinted mmap without MAP_FIXED does not change the allocation address.
Now the vm.cluster_anon can be set to 1 to only cluster when no hints,
and to 2 to always cluster. Default value is 1.
Requested by: peter
Reviewed by: emaste, markj
Sponsored by: The FreeBSD Foundation
MFC after: 1 month
Differential revision: https://reviews.freebsd.org/D19194
This includes support for pmap_enter(..., psind=1) as described in the
commit log message for r321378.
The changes are largely modelled after amd64. arm64 has more stringent
requirements around superpage creation to avoid the possibility of TLB
conflict aborts, and these requirements do not apply to RISC-V, which
like amd64 permits simultaneous caching of 4KB and 2MB translations for
a given page. RISC-V's PTE format includes only two software bits, and
as these are already consumed we do not have an analogue for amd64's
PG_PROMOTED. Instead, pmap_remove_l2() always invalidates the entire
2MB address range.
pmap_ts_referenced() is modified to clear PTE_A, now that we support
both hardware- and software-managed reference and dirty bits. Also
fix pmap_fault_fixup() so that it does not set PTE_A or PTE_D on kernel
mappings.
Reviewed by: kib (earlier version)
Discussed with: jhb
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D18863
Differential Revision: https://reviews.freebsd.org/D18864
Differential Revision: https://reviews.freebsd.org/D18865
Differential Revision: https://reviews.freebsd.org/D18866
Differential Revision: https://reviews.freebsd.org/D18867
Differential Revision: https://reviews.freebsd.org/D18868
As a followup to r343673, unsign some variables related to allocation
since the hashsize cannot be negative. This gives a bit more space to
handle bigger allocations and avoid some implicit casting.
While here also unsign uh_hashmask, it makes little sense to keep that
signed.
MFC after: 2 weeks
Differential Revision: https://reviews.freebsd.org/D19148
i386 is the only architecture where uint64_t does not specify 8-bytes
alignment, which makes struct xswdev layout not compatible between
64bit and i386.
Reported and tested by: pho
Sponsored by: The FreeBSD Foundation
MFC after: 1 week
With this change, randomization can be enabled for all non-fixed
mappings. It means that the base address for the mapping is selected
with a guaranteed amount of entropy (bits). If the mapping was
requested to be superpage aligned, the randomization honours the
superpage attributes.
Although the value of ASLR is diminshing over time as exploit authors
work out simple ASLR bypass techniques, it elimintates the trivial
exploitation of certain vulnerabilities, at least in theory. This
implementation is relatively small and happens at the correct
architectural level. Also, it is not expected to introduce
regressions in existing cases when turned off (default for now), or
cause any significant maintaince burden.
The randomization is done on a best-effort basis - that is, the
allocator falls back to a first fit strategy if fragmentation prevents
entropy injection. It is trivial to implement a strong mode where
failure to guarantee the requested amount of entropy results in
mapping request failure, but I do not consider that to be usable.
I have not fine-tuned the amount of entropy injected right now. It is
only a quantitive change that will not change the implementation. The
current amount is controlled by aslr_pages_rnd.
To not spoil coalescing optimizations, to reduce the page table
fragmentation inherent to ASLR, and to keep the transient superpage
promotion for the malloced memory, locality clustering is implemented
for anonymous private mappings, which are automatically grouped until
fragmentation kicks in. The initial location for the anon group range
is, of course, randomized. This is controlled by vm.cluster_anon,
enabled by default.
The default mode keeps the sbrk area unpopulated by other mappings,
but this can be turned off, which gives much more breathing bits on
architectures with small address space, such as i386. This is tied
with the question of following an application's hint about the mmap(2)
base address. Testing shows that ignoring the hint does not affect the
function of common applications, but I would expect more demanding
code could break. By default sbrk is preserved and mmap hints are
satisfied, which can be changed by using the
kern.elf{32,64}.aslr.honor_sbrk sysctl.
ASLR is enabled on per-ABI basis, and currently it is only allowed on
FreeBSD native i386 and amd64 (including compat 32bit) ABIs. Support
for additional architectures will be added after further testing.
Both per-process and per-image controls are implemented:
- procctl(2) adds PROC_ASLR_CTL/PROC_ASLR_STATUS;
- NT_FREEBSD_FCTL_ASLR_DISABLE feature control note bit makes it possible
to force ASLR off for the given binary. (A tool to edit the feature
control note is in development.)
Global controls are:
- kern.elf{32,64}.aslr.enable - for non-fixed mappings done by mmap(2);
- kern.elf{32,64}.aslr.pie_enable - for PIE image activation mappings;
- kern.elf{32,64}.aslr.honor_sbrk - allow to use sbrk area for mmap(2);
- vm.cluster_anon - enables anon mapping clustering.
PR: 208580 (exp runs)
Exp-runs done by: antoine
Reviewed by: markj (previous version)
Discussed with: emaste
Tested by: pho
MFC after: 1 month
Sponsored by: The FreeBSD Foundation
Differential revision: https://reviews.freebsd.org/D5603
It is currently re-declared in sys/sysent.h which is a wrong place for
MD variable. Which causes redeclaration error with gcc when
sys/sysent.h and machine/md_var.h are included both.
Remove it from sys/sysent.h and instead include machine/md_var.h when
needed, under #ifdef for both i386 and amd64.
Reported and tested by: bde
Sponsored by: The FreeBSD Foundation
MFC after: 1 week
This is a step towards being able to free pages without the page
lock held. The approach is simply to add an implementation of
vm_page_dequeue_deferred() which does not assert that the page
lock is held. Formally, the page lock is required to set
PGA_DEQUEUE, but in the case of vm_page_free_prep() we get the
same mutual exclusion for free by virtue of the fact that no
other references to the page may exist.
No functional change intended.
Reviewed by: kib (previous version)
MFC after: 2 weeks
Sponsored by: Netflix
Differential Revision: https://reviews.freebsd.org/D19065
To detect the case where the page is already marked for a deferred
dequeue, we must read the "queue" and "aflags" fields in a
precise order. Otherwise, a race with a concurrent
vm_page_dequeue_complete() could leave the page with PGA_DEQUEUE
set despite it already having been dequeued. Fix the problem by
using vm_page_queue() to check the queue state, which correctly
handles the race.
Reviewed by: kib
Tested by: pho
MFC after: 3 days
Sponsored by: Netflix
Differential Revision: https://reviews.freebsd.org/D19039
512GB of ZFS ABD ARC means abd_chunk zone of 128M 4KB items. To manage
them UMA tries to allocate 2GB hash table, which size does not fit into
the int variable, causing later allocation failure, which makes ARC shrink
back below the 512GB, not letting it to use more RAM. With this change I
easily reached >700GB ARC size on 768GB RAM machine.
MFC after: 1 week
Sponsored by: iXsystems, Inc.
Then bucket_alloc() also selects bucket size based on uz_count. However,
since zone lock is dropped, uz_count may reduce. In this case max may
be greater than ub_entries and that would yield into writing beyond end
of the allocation.
Reported by: pho
The iterator should be reinitialized after every successful slab
allocation. A request to advance the iterator is interpreted as
an allocation failure, so a sufficiently large preallocation would
cause the iterator to believe that all domains were exhausted,
resulting in a sleep with the keg lock held. [1]
Also, keg_alloc_slab() should pass the unmodified wait flag to the
item initialization routine, which may use it to perform allocations
from other zones.
Reported and tested by: slavah
Diagnosed by: kib [1]
Reviewed by: kib
MFC after: 1 week
Sponsored by: The FreeBSD Foundation
In order to allow single kernel to use PAE pagetables on i386 if
hardware supports it, and fall back to classic two-level paging
structures if not, superpage code should be able to adopt to either 2M
or 4M superpages size. There I make MI VM structures large enough to
track the biggest possible superpage, by allowing architecture to
define VM_NFREEORDER_MAX and VM_LEVEL_0_ORDER_MAX constants.
Corresponding VM_NFREEORDER and VM_LEVEL_0_ORDER symbols can be
defined as runtime values and must be less than the _MAX constants.
If architecture does not define _MAXs, it is assumed that _MAX ==
normal constant.
Reviewed by: markj
Tested by: pho (as part of the larger patch)
Sponsored by: The FreeBSD Foundation
MFC after: 1 week
Differential revision: https://reviews.freebsd.org/D18853
Based on the description in Linux man page.
Reviewed by: markj, ngie (previous version)
Sponsored by: Mellanox Technologies
MFC after: 1 week
Differential revision: https://reviews.freebsd.org/D18837
This reduces amount of locking required for these zones.
Also, for cache only zones (UMA_ZFLAG_CACHE) accounting uz_items wasn't
correct at all, since they may allocate items directly from their backing
store and then free them via UMA underflowing uz_items.
Tested by: pho
atomic updates and reduces amount of data protected by zone lock.
During startup point these fields to EARLY_COUNTER. After startup
allocate them for all early zones.
Tested by: pho
o In vm_pager_bufferinit() create pbuf_zone and start accounting on how many
pbufs are we going to have set.
In various subsystems that are going to utilize pbufs create private zones
via call to pbuf_zsecond_create(). The latter calls uma_zsecond_create(),
and sets a limit on created zone. After startup preallocate pbufs according
to requirements of all pbuf zones.
Subsystems that used to have a private limit with old allocator now have
private pbuf zones: md(4), fusefs, NFS client, smbfs, VFS cluster, FFS,
swap, vnode pager.
The following subsystems use shared pbuf zone: cam(4), nvme(4), physio(9),
aio(4). They should have their private limits, but changing that is out of
scope of this commit.
o Fetch tunable value of kern.nswbuf from init_param2() and while here move
NSWBUF_MIN to opt_param.h and eliminate opt_swap.h, that was holding only
this option.
Default values aren't touched by this commit, but they probably should be
reviewed wrt to modern hardware.
This change removes a tight bottleneck from sendfile(2) operation, that
uses pbufs in vnode pager. Other pagers also would benefit from faster
allocation.
Together with: gallatin
Tested by: pho
two zones sharing a keg may have different limits. Now this is going
to work:
zone = uma_zcreate();
uma_zone_set_max(zone, limit);
zone2 = uma_zsecond_create(zone);
uma_zone_set_max(zone2, limit2);
Kegs no longer have uk_maxpages field, but zones have uz_items. When
set, it may be rounded up to minimum possible CPU bucket cache size.
For small limits bucket cache can also be reconfigured to be smaller.
Counter uz_items is updated whenever items transition from keg to a
bucket cache or directly to a consumer. If zone has uz_maxitems set and
it is reached, then we are going to sleep.
o Since new limits don't play well with multi-keg zones, remove them. The
idea of multi-keg zones was introduced exactly 10 years ago, and never
have had a practical usage. In discussion with Jeff we came to a wild
agreement that if we ever want to reintroduce the idea of a smart allocator
that would be able to choose between two (or more) totally different
backing stores, that choice should be made one level higher than UMA,
e.g. in malloc(9) or in mget(), or whatever and choice should be controlled
by the caller.
o Sleeping code is improved to account number of sleepers and wake them one
by one, to avoid thundering herd problem.
o Flag UMA_ZONE_NOBUCKETCACHE removed, instead uma_zone_set_maxcache()
KPI added. Having no bucket cache basically means setting maxcache to 0.
o Now with many fields added and many removed (no multi-keg zones!) make
sure that struct uma_zone is perfectly aligned.
Reviewed by: markj, jeff
Tested by: pho
Differential Revision: https://reviews.freebsd.org/D17773
from the local mapping.
Enable the setting by default.
The article behind the change: https://arxiv.org/abs/1901.01161
Reviewed by: markj
Discussed with: emaste
Sponsored by: The FreeBSD Foundation
MFC after: 1 week
Differential revision: https://reviews.freebsd.org/D18764
Have ogetkerninfo, ogetpagesize, ogethostname, osethostname, and oaccept
declare o<foo>_args structs rather than non-compat ones. Due to a
failure to use NOARGS in most cases this adds only one new declaration.
No changes required in freebsd32 as only ogetpagesize() is implemented
and it has a 32-bit specific implementation.
Reviewed by: kib
Sponsored by: DARPA, AFRL
Differential Revision: https://reviews.freebsd.org/D15816
The type represents byte offset in the vm_object_t data space, which
does not span negative offsets in FreeBSD VM. The change matches byte
offset signess with the unsignedness of the vm_pindex_t which
represents the type of the page indexes in the objects.
This allows to remove the UOFF_TO_IDX() macro which was used when we
have to forcibly interpret the type as unsigned anyway. Also it fixes
a lot of implicit bugs in the device drivers d_mmap methods.
Reviewed by: alc, markj (previous version)
Tested by: pho
MFC after: 2 weeks
Sponsored by: The FreeBSD Foundation
If user configured the maxswapzone tunable, just take the literal
value for the initial zone sizing attempt. Before, it was only
possible to reduce the zone by the tunable.
While there, correct the message which was not correct when zone
creation rounded the size up.
Reported by: jmg
Reviewed by: markj
MFC after: 1 week
Sponsored by: The FreeBSD Foundation
Differential revision: https://reviews.freebsd.org/D18381
This applies the fix in r283924 to the vm.objects sysctl
added by r283624 so the output will include the vnode
information (i.e. path) for tmpfs objects.
Reviewed by: kib, dab
MFC after: 2 weeks
Sponsored by: Dell EMC Isilon
Differential Revision: https://reviews.freebsd.org/D2724
Otherwise the free page count will not accurately reflect the physical
page allocator's state. On 11 this can trigger panics in
vm_page_alloc() since the allocator state and free page count are
updated atomically and we expect them to stay in sync. On 12 the
bug would manifest as threads looping in vm_page_alloc().
PR: 231296
Reported by: mav, wollman, Rainer Duffner, Josh Gitlin
Reviewed by: alc, kib, mav
MFC after: 3 days
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D18374
several pages, but leaves no space for struct uma_slab at the end we
miscalculate number of pages by one. Totally mimic keg_large_init() math
here to cover that problem.
Reported by: gallatin
is calculated to guarantee that struct uma_slab is placed at pointer size
alignment. Calculation of real struct uma_slab size is done in keg_ctor()
and yet again in keg_large_init(), to check if we need an extra page. This
calculation can actually be performed at compile time.
- Add SIZEOF_UMA_SLAB macro to calculate size of struct uma_slab placed at
an end of a page with alignment requirement.
- Use SIZEOF_UMA_SLAB in keg_ctor() and in keg_large_init(). This is a not
a functional change.
- Use SIZEOF_UMA_SLAB in UMA_SLAB_SPACE definition and in keg_small_init().
This is a potential bugfix, but in reality I don't think there are any
systems affected, since compiler aligns struct uma_slab anyway.
All vmspace_alloc() callers know which kind of pmap they allocate.
Reviewed by: alc, markj (previous version)
Sponsored by: The FreeBSD Foundation
MFC after: 1 week
Differential revision: https://reviews.freebsd.org/D18329
According to markj@:
pageproc contains the page daemon and laundry threads, which are
responsible for managing the LRU page queues and writing back dirty
pages. vmproc's main task is to swap out kernel stacks when the system
is under memory pressure, and swap them back in when necessary. It's a
somewhat legacy component of the system and isn't required. You can
build a kernel without it by specifying "options NO_SWAPPING" (which is
a somewhat misleading name), in which vm_swapout_dummy.c is compiled
instead of vm_swapout.c.
Based on this, we want pageproc to emulate kswapd, not vmproc.
Reviewed by: markj
Differential Revision: https://reviews.freebsd.org/D18061
These are used by kms-drm to determine various heuristics relate
memory conditions.
The number of free swap pages is just a variable, and it can be
much cheaper by either adding a new getter, or simply extern'ing
swap_total. However, this patch opts to use the more expensive,
existing interface - since this isn't an operation in a high per
path.
This allows us to remove some more gpl linuxkpi and do the follo
kms-drm:
git rm linuxkpi/gplv2/include/linux/swap.h
Reviewed by: mmacy, Johannes Lundberg <johalun0@gmail.com>
Approved by: emaste (mentor)
Differential Revision: https://reviews.freebsd.org/D18052
functions. Notably, reflow the text of some comments so that they
occupy fewer lines, and introduce an assertion in one of the new
helper functions so that it is not misused by a future caller.
In collaboration with: Doug Moore <dougm@rice.edu>
MFC after: 1 week
Differential Revision: https://reviews.freebsd.org/D17635
In particular, track the current size of the cache and maintain an
estimate of its working set size. This will be used to decide how
much to shrink various caches when the kernel attempts to reclaim
pages. As a secondary effect, it makes statistics aggregation (done
by, e.g., vmstat -z) cheaper since sysctl_vm_zone_stats() no longer
needs to iterate over lists of cached buckets.
Discussed with: alc, glebius, jeff
Tested by: pho (previous version)
MFC after: 1 month
Differential Revision: https://reviews.freebsd.org/D16666
r336984 exposed the bug fixed in r340241, leading to the initial revert
while the bug was being hunted down. Now that the bug is fixed, we
can revert the revert.
Discussed with: alc
MFC after: 3 days
This was introduced in r326329 and explains the crashes mentioned in
the commit log message for r339934. In particular, on INVARIANTS
kernels, UMA trashing causes the loop to exit early, leaving swap
blocks behind when they should have been freed. After r336984 this
became more problematic since new anonymous mappings were more
likely to reuse swapped-out subranges of existing VM objects, so faults
would trigger pageins of freed memory rather than returning zeroed
pages.
Reviewed by: kib
MFC after: 3 days
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D17897
These submaps are used for mapping pipe buffers and execv() argument
strings respectively, so there's no need for such mappings to have
execute permissions.
Reported by: jhb
Reviewed by: alc, jhb, kib
MFC after: 2 weeks
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D17827
In practice it is always initialized because nfreed must be positive
in order to trigger background laundering, but this isn't obvious.
CID: 1387997
MFC after: 1 week
Initializing the eflags field of the map->header entry to a value with a
unique new bit set makes a few comparisons to &map->header unnecessary.
Submitted by: Doug Moore <dougm@rice.edu>
Reviewed by: alc, kib
Tested by: pho
MFC after: 2 weeks
Differential Revision: https://reviews.freebsd.org/D14005
It appears to be responsible for random segfaults observed when lots
of paging activity is taking place, but the root cause is not yet
understood.
Requested by: alc
MFC after: now
Remove malloc_domain(9) and most other _domain KPIs added in r327900.
The new functions allow the caller to specify a general NUMA domain
selection policy, rather than specifically requesting an allocation from
a specific domain. The latter policy tends to interact poorly with
M_WAITOK, resulting in situations where a caller is blocked indefinitely
because the specified domain is depleted. Most existing consumers of
the _domain KPIs are converted to instead use a DOMAINSET_PREF() policy,
in which we fall back to other domains to satisfy the allocation
request.
This change also defines a set of DOMAINSET_FIXED() policies, which
only permit allocations from the specified domain.
Discussed with: gallatin, jeff
Reported and tested by: pho (previous version)
MFC after: 2 weeks
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D17418
- In uma_prealloc(), we need to check for an empty domain before the
first allocation attempt, not after. Fix this by switching
uma_prealloc() to use a vm_domainset iterator, which addresses the
secondary issue of using a signed domain identifier in round-robin
iteration.
- Don't automatically create a page daemon for domain 0.
- In domainset_empty_vm(), recompute ds_cnt and ds_order after
excluding empty domains; otherwise we may frequently specify an empty
domain when calling in to the page allocator, wasting CPU time.
Convert DOMAINSET_PREF() policies for empty domains to round-robin.
- When freeing bootstrap pages, don't count them towards the per-domain
total page counts for now: some vm_phys segments are created before
the SRAT is parsed and are thus always identified as being in domain 0
even when they are not. Then, when bootstrap pages are freed, they
are added to a domain that we had previously thought was empty. Until
this is corrected, we simply exclude them from the per-domain page
count.
Reported and tested by: Rajesh Kumar <rajfbsd@gmail.com>
Reviewed by: gallatin
MFC after: 2 weeks
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D17704
on-write faults. On a page fault, when we call vm_fault_prefault(), it
probes the pmap and the shadow chain of vm objects to see if there are
opportunities to create read and/or execute-only mappings to neighoring
pages. For example, in the case of hard faults, such effort typically pays
off, that is, mappings are created that eliminate future soft page faults.
However, in the the case of soft, copy-on-write faults, the effort very
rarely pays off. (See the review for some specific data.)
Reviewed by: kib, markj
MFC after: 3 weeks
Differential Revision: https://reviews.freebsd.org/D17367
Previously, it used a hand-rolled round-robin iterator. This meant that
the minskip logic in r338507 didn't apply to UMA allocations, and also
meant that we would call vm_wait() for individual domains rather than
permitting an allocation from any domain with sufficient free pages.
Discussed with: jeff
Tested by: pho
MFC after: 1 week
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D17420
Before this change we had two flavours of vm_domainset iterators: "page"
and "malloc". The latter was only used for kmem_*() and hard-coded its
behaviour based on kernel_object's policy. Moreover, its use contained
a race similar to that fixed by r338755 since the kernel_object's
iterator was being run without the object lock.
In some cases it is useful to be able to explicitly specify a policy
(domainset) or policy+iterator (domainset_ref) when performing memory
allocations. To that end, refactor the vm_dominset_* KPI to permit
this, and get rid of the "malloc" domainset_iter KPI in the process.
Reviewed by: jeff (previous version)
Tested by: pho (part of a larger patch)
MFC after: 2 weeks
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D17417
This provides a chicken switch for anyone negatively impacted by
enabling NUMA in the amd64 GENERIC kernel configuration. With
NUMA disabled at boot-time, information about the NUMA topology
is not exposed to the rest of the kernel, and all of physical
memory is viewed as coming from a single domain.
This method still has some performance overhead relative to disabling
NUMA support at compile time.
PR: 231460
Reviewed by: alc, gallatin, kib
MFC after: 1 week
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D17439
I committed some patches out of order and didn't build-test one of them.
Reported by: Jenkins, O. Hartmann <ohartmann@walstatt.org>
X-MFC with: r339601
On NUMA systems, we would not swap in processes unless all domains
had some free pages. This is too conservative in general. Instead,
permit swapins so long as at least one domain has free pages, and add
a kernel stack NUMA policy which ensures that we will try to allocate
kernel stack pages from any domain.
Reported and tested by: pho, Jan Bramkamp <crest@bultmann.eu>
Reviewed by: alc, kib
Discussed with: jeff
MFC after: 3 days
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D17304
cache, then we put new bucket on generic bucket cache. However, code didn't
honor UMA_ZONE_NOBUCKETCACHE flag, so potentially we could start a cache
on a zone that clearly forbids that. Fix this.
Reviewed by: markj
Pre-defined policies are useful when integrating the domainset(9)
policy machinery into various kernel memory allocators.
The refactoring will make it easier to add NUMA support for other
architectures.
No functional change intended.
Reviewed by: alc, gallatin, jeff, kib
Tested by: pho (part of a larger patch)
MFC after: 3 days
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D17416
Change swap_reserve and swap_total to be in units of pages so that
swap reservations can be done using only atomics instead of using a single
global mutex for swap_reserve and a single mutex for all processes running
under the same uid for uid accounting.
Results in mmap speed up and a 70% increase in brk calls / second.
Reviewed by: alc@, markj@, kib@
Approved by: re (delphij@)
Differential Revision: https://reviews.freebsd.org/D16273
Otherwise (iter % ds->ds_cnt) is not guaranteed to lie in the range
[0, MAXMEMDOM).
Reported by: pho
Reviewed by: kib
Approved by: re (rgrimes)
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D17374
The AMD Threadripper 2990WX is basically a slightly crippled Epyc.
Rather than having 4 memory controllers, one per NUMA domain, it has
only 2 memory controllers enabled. This means that only 2 of the
4 NUMA domains can be populated with physical memory, and the
others are empty.
Add support to FreeBSD for empty NUMA domains by:
- creating empty memory domains when parsing the SRAT table,
rather than failing to parse the table
- not running the pageout deamon threads in empty domains
- adding defensive code to UMA to avoid allocating from empty domains
- adding defensive code to cpuset to avoid binding to an empty domain
Thanks to Jeff for suggesting this strategy.
Reviewed by: alc, markj
Approved by: re (gjb@)
Differential Revision: https://reviews.freebsd.org/D1683
after the file mapping was wired.
if a wired map entry is backed by vnode and the file is truncated,
corresponding pages are invalidated. vm_fault_copy_entry() should be
aware of it and allow for invalid pages past end of file. Also, such
pages should be not mapped into userspace. If userspace accesses the
truncated part of the mapping later, it gets a signal, there is no way
kernel can prevent the page fault.
Reported by: andrew using syzkaller
Reviewed by: alc
Sponsored by: The FreeBSD Foundation
Approved by: re (gjb)
MFC after: 1 week
Differential revision: https://reviews.freebsd.org/D17323
if the dst_object is not of swap type.
It can only happen when entry does not require copy, otherwise
vm_map_protect() already adds the charge. So the assert was right for
the case where swap object was allocated in the vm_fault_copy_entry(),
but not when it was just copied from src_entry and its type is not
swap.
Reported by: andrew using syzkaller
Reviewed by: alc
Sponsored by: The FreeBSD Foundation
Approved by: re (gjb)
MFC after: 1 week
Differential revision: https://reviews.freebsd.org/D17323
allocated dst_object in a single place.
Suggested and reviewed by: alc
Sponsored by: The FreeBSD Foundation
Approved by: re (gjb)
MFC after: 1 week
Differential revision: https://reviews.freebsd.org/D17323
Use these predicates instead of inline references to vm_min_domains.
Also add a global all_domains set, akin to all_cpus.
Reviewed by: alc, jeff, kib
Approved by: re (gjb)
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D17278
redundant, because uma_zone_reserve_kva() is performed on both zones and it
sets this same flag on the zone. (Moreover, the implementation of the swap
pager does not itself require these zones to be UMA_ZONE_NOFREE.)
Reviewed by: kib, markj
Approved by: re (gjb)
MFC after: 1 week
Differential Revision: https://reviews.freebsd.org/D17296
The old code appears to assume that vmem_alloc() would import
size-aligned KVA chunks from the parent kernel_arena, but vmem doesn't
provide this guarantee.
Also remove the unused global RWX arena and add comments explaining why
we have per-domain arenas.
Reported by: alc
Reviewed by: alc, kib (previous version)
Approved by: re (gjb)
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D17249
Ensure that pages backing the same virtual large page come from the
same physical domain, as kmem_malloc_domain() does.
PR: 231038
Reviewed by: alc, kib
Approved by: re (gjb)
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D17248
This keeps the initialization coupled together with the kmem_* KPI
implementation, which is the main user of these arenas.
No functional change intended.
Reviewed by: alc
Approved by: re (gjb)
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D17247
The current cache logic checks the total number of stacks in the kernel,
which even on small boxes significantly exceeds the 128 limit (e.g. an
8-way box with zfs has almost 800 stacks allocated).
Stacks are cached earlier for each main thread.
As a result the code is rarely executed, but when it is then (on boxes like
the above) it always fails. Since there are no provisions made for NUMA and
release time is approaching, just do a quick check to avoid acquiring the
lock.
Approved by: re (kib)
We drop the keg lock when we go to actually allocate the slab, allowing
other threads to advance the cursor. This can cause us to exit the
round-robin loop before having attempted allocations from all domains,
resulting in a hang during a subsequent blocking allocation attempt from
a depleted domain.
Reported and tested by: Jan Bramkamp <crest@bultmann.eu>
Reviewed by: alc, cem
Approved by: re (gjb)
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D17209
Limits can be safely obtained with lim_cur from the thread. racct is compiled
in but disabled by default. Note that racct enablement is a boot-only tunable.
This eliminates second most common place of taking the lock while pkg building.
While here don't take the lock in mlockall either.
Reviewed by: kib
Approved by: re (gjb)
Differential Revision: https://reviews.freebsd.org/D17210
No functional change intended.
Reviewed by: alc, kib
Approved by: re (gjb)
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D17028
While executing vm_pqbatch_process_page(m), m->queue may change to
PQ_NONE if the page daemon is concurrently freeing the page. In this
case m's queue state flags must be clear, so vm_pqbatch_process_page()
will be a no-op, but the race could cause spurious assertion failures.
Correct the assertion which assumed that m->queue's value does not
change while the page queue lock is held.
Reviewed by: alc, kib
Reported and tested by: pho
Approved by: re (gjb)
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D17027
other allowed domains if the requested domain is below the minimum paging
threshold. Block in fork only if all domains available to the forking
thread are below the severe threshold rather than any.
Submitted by: jeff
Reported by: mjg
Reviewed by: alc, kib, markj
Approved by: re (rgrimes)
Differential Revision: https://reviews.freebsd.org/D16191
Testing m->queue != PQ_NONE is not sufficient; see the commit log
message for r338276. As of r332974 vm_page_dequeue() handles
already-dequeued pages, so just replace vm_page_remque() calls with
vm_page_dequeue() calls.
Reviewed by: kib
Tested by: pho
Approved by: re (marius)
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D17025
that can be coalesced. To be clear, fragmentation of phys_avail[] is not
the cause. This fragmentation of vm_phys_segs[] arises from the "special"
calls to vm_phys_add_seg(), in other words, not those that derive directly
from phys_avail[], but those that we create for the initial kernel page
table pages and now for the kernel and modules loaded at boot time. Since
we sometimes iterate over the physical memory segments, coalescing these
segments at initialization time is a worthwhile change.
Reviewed by: kib, markj
Approved by: re (rgrimes)
MFC after: 1 week
Differential Revision: https://reviews.freebsd.org/D16976
Exposing max_offset and min_offset defines in public headers is
causing clashes with variable names, for example when building QEMU.
Based on the submission by: royger
Reviewed by: alc, markj (previous version)
Sponsored by: The FreeBSD Foundation (kib)
MFC after: 1 week
Approved by: re (marius)
Differential revision: https://reviews.freebsd.org/D16881
given in random(4).
This includes updating of the relevant man pages, and no-longer-used
harvesting parameters.
Ensure that the pseudo-unit-test still does something useful, now also
with the "other" algorithm instead of Yarrow.
PR: 230870
Reviewed by: cem
Approved by: so(delphij,gtetlow)
Approved by: re(marius)
Differential Revision: https://reviews.freebsd.org/D16898
error in the function hypercall_memfree(), where the wrong arena was being
passed to kmem_free().
Introduce a per-page flag, VPO_KMEM_EXEC, to mark physical pages that are
mapped in kmem with execute permissions. Use this flag to determine which
arena the kmem virtual addresses are returned to.
Eliminate UMA_SLAB_KRWX. The introduction of VPO_KMEM_EXEC makes it
redundant.
Update the nearby comment for UMA_SLAB_KERNEL.
Reviewed by: kib, markj
Discussed with: jeff
Approved by: re (marius)
Differential Revision: https://reviews.freebsd.org/D16845
Expose these counters under the vm.domain sysctl node. The existing
vm.stats.vm.v_pdpages sysctl is preserved.
Reviewed by: alc (previous version)
Differential Revision: https://reviews.freebsd.org/D14666
Per-page queue state is updated non-atomically, with either the page
lock or the page queue lock held. When vm_page_dequeue() is called
without the page lock, in rare cases a different thread may be
concurrently dequeuing the page with the pagequeue lock held. Because
of the non-atomic update, vm_page_dequeue() might return before queue
state is completely updated, which can lead to race conditions.
Restrict the vm_page_dequeue() interface so that it must be called
either with the page lock held or on a free page, and busy wait when
a different thread is concurrently updating queue state, which must
happen in a critical section.
While here, do some related cleanup: inline vm_page_dequeue_locked()
into its only caller and delete a prototype for the unimplemented
vm_page_requeue_locked(). Replace the volatile qualifier for "queue"
added in r333703 with explicit uses of atomic_load_8() where required.
Reported and tested by: pho
Reviewed by: alc
Differential Revision: https://reviews.freebsd.org/D15980
became unused in FreeBSD 12.x as a side-effect of the NUMA-related
changes.)
Reviewed by: kib, markj
Discussed with: jeff, re@
Differential Revision: https://reviews.freebsd.org/D16825
The domain and flags parameters suffice. In fact, the related functions
kmem_alloc_{attr,contig}_domain() don't have an arena parameter.
Reviewed by: kib, markj
Differential Revision: https://reviews.freebsd.org/D16713
If faultin() was called outside swapper (from PHOLD()), do not allow
swapper to initiate additional swap-ins. Swapper' initiated swap-ins
are serialized because they are synchronous and executed in the
context of the thread0. With the added limitation, we only allow
parallel swap-ins from PHOLD(), which is up to PHOLD() users to
manage, usually they do not need to.
Rate-limit swapper' swap-ins to one in the MAXSLP / 2 seconds
interval, counting faultin() swapins.
Suggested by: alc
Reviewed by: alc, markj
Tested by: pho
Sponsored by: The FreeBSD Foundation
MFC after: 2 weeks
Differential revision: https://reviews.freebsd.org/D16610
Before r329882 the target would be computed after lowmem handlers run
and free pages. On some systems a significant amount of page
reclamation happens this way. However, with r329882 the target is
computed first, which can lead to unnecessary reclamation from the
page cache, and this in turn may result in excessive swapping.
Instead, adjust the target after running lowmem handlers. Don't
invoke the lowmem handlers before the PID controller, though, since
that would hide the true rate of page allocation.
Reviewed by: alc, kib (previous version)
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D16606
add support for explicitly requesting that pmap_enter() create a 1 MB page
mapping. (Essentially, this feature allows the machine-independent layer
to create superpage mappings preemptively, and not wait for automatic
promotion to occur.)
Export pmap_ps_enabled() to the machine-independent layer.
Add a flag to pmap_pv_insert_pte1() that specifies whether it should fail
or reclaim a PV entry when one is not available.
Refactor pmap_enter_pte1() into two functions, one by the same name, that
is a general-purpose function for creating pte1 mappings, and another,
pmap_enter_1mpage(), that is used to prefault 1 MB read- and/or execute-
only mappings for execve(2), mmap(2), and shmat(2).
In addition, as an optimization to pmap_enter(..., psind=0), eliminate the
use of pte2_is_managed() from pmap_enter(). Unlike the x86 pmap
implementations, armv6 does not have a managed bit defined within the PTE.
So, pte2_is_managed() is actually a call to PHYS_TO_VM_PAGE(), which is O(n)
in the number of vm_phys_segs[]. All but one call to PHYS_TO_VM_PAGE() in
pmap_enter() can be avoided.
Reviewed by: kib, markj, mmel
Tested by: mmel
MFC after: 6 weeks
Differential Revision: https://reviews.freebsd.org/D16555
Before swp_pager_meta_build replaces an old swapblk with an new one,
it frees the old one. To allow such freeing of blocks to be
aggregated, have swp_pager_meta_build return the old swap block, and
make the caller responsible for freeing it.
Define a pair of short static functions, swp_pager_init_freerange and
swp_pager_update_freerange, to do the initialization and updating of
blk addresses and counters used in aggregating blocks to be freed.
Submitted by: Doug Moore <dougm@rice.edu>
Reviewed by: kib, markj (an earlier version)
Tested by: pho
MFC after: 1 week
Differential Revision: https://reviews.freebsd.org/D13707
Swapped-out process that is WKILLED must be swapped in as soon as
possible. The reason is that such process can be killed by OOM and
its pages can be only freed if the process exits. To exit, the kernel
stack of the process must be mapped.
When allocating pages for the stack of the WKILLED process on swap in,
use VM_ALLOC_SYSTEM requests to increase the chance of the allocation
to succeed.
Add counter of the swapped out processes to avoid unneeded iteration
over the allprocs list when there is no work to do, reducing the
allproc_lock ownership.
Reviewed by: alc, markj (previous version)
Tested by: pho
Sponsored by: The FreeBSD Foundation
MFC after: 2 weeks
Differential revision: https://reviews.freebsd.org/D16489
OBJ_ONEMAPPING flag is set. In other words, allow recycling of existing
but unused subranges of a vm object when the OBJ_ONEMAPPING flag is set.
Such situations are increasingly common with jemalloc >= 5.0. This
change has the expected effect of reducing the number of vm map entry and
object allocations and increasing the number of superpage promotions.
Reviewed by: kib, markj
Tested by: pho
MFC after: 6 weeks
Differential Revision: https://reviews.freebsd.org/D16501
blocking vm map entry and object coalescing for the calling process.
However, there is no reason that mlockall(MCL_FUTURE) should block
such coalescing. This change enables it.
Reviewed by: kib, markj
Tested by: pho
MFC after: 6 weeks
Differential Revision: https://reviews.freebsd.org/D16413
There's no differene between VM_FREELIST_ISADMA and VM_FREELIST_LOWMEM
except for the default boundary (16MB on x86 and 256MB on MIPS, but
they are otherwise the same). We don't need both for any system we
support (there were some really old ARC systems that did have ISA/EISA
bus, but we never ran on them and they are too old to ever grow
support for).
Differential Review: https://reviews.freebsd.org/D16290