r336020 introduced pcpu_page_alloc(), replacing page_alloc() as the
backend allocator for PCPU UMA zones. Unlike page_alloc(), it does
not honour malloc(9) flags such as M_ZERO or M_NODUMP, so fix that.
r336020 also changed counter(9) to initialize each counter using a
CPU_FOREACH() loop instead of an SMP rendezvous. Before SI_SUB_CPU,
smp_rendezvous() will only execute the callback on the current CPU
(i.e., CPU 0), so only one counter gets zeroed. The rest are zeroed
by virtue of the fact that UMA gratuitously zeroes slabs when importing
them into a zone.
Prior to SI_SUB_CPU, all_cpus is clear, so with r336020 we weren't
zeroing vm_cnt counters during boot: the CPU_FOREACH() loop had no
effect, and pcpu_page_alloc() didn't honour M_ZERO. Fix this by
iterating over the full range of CPU IDs when zeroing counters,
ignoring whether the corresponding bits in all_cpus are set.
Reported and tested by: pho (previous version)
Reviewed by: kib (previous version)
Differential Revision: https://reviews.freebsd.org/D16190
Due to the way rtld creates mappings for the shared objects, each dso
causes unmap of at least three guard map entries. For instance, in
the buildworld load, this change reduces the amount of pmap_remove()
calls by 1/5.
Profiled by: alc
Reviewed by: alc, markj
Sponsored by: The FreeBSD Foundation
MFC after: 1 week
Differential revision: https://reviews.freebsd.org/D16148
- Change pcpu zone consumers to use a stride size of PAGE_SIZE.
(defined as UMA_PCPU_ALLOC_SIZE to make future identification easier)
- Allocate page from the correct domain for a given cpu.
- Don't initialize pc_domain to non-zero value if NUMA is not defined
There are some misconceptions surrounding this field. It is the
_VM_ NUMA domain and should only ever correspond to valid domain
values as understood by the VM.
The former slab size of sizeof(struct pcpu) was somewhat arbitrary.
The new value is PAGE_SIZE because that's the smallest granularity
which the VM can allocate a slab for a given domain. If you have
fewer than PAGE_SIZE/8 counters on your system there will be some
memory wasted, but this is obviously something where you want the
cache line to be coming from the correct domain.
Reviewed by: jeff
Sponsored by: Limelight Networks
Differential Revision: https://reviews.freebsd.org/D15933
On arm64 (and possible other architectures) we are unable to use static
DPCPU data in kernel modules. This is because the compiler will generate
PC-relative accesses, however the runtime-linker expects to be able to
relocate these.
In preparation to fix this create two macros depending on if the data is
global or static.
Reviewed by: bz, emaste, markj
Sponsored by: ABT Systems Ltd
Differential Revision: https://reviews.freebsd.org/D16140
On the 4/4 i386, copyout(9) may need to call pmap_extract_and_hold()
on arbitrary userspace mapping. If the mapping is backed by the
non-managed cdev pager or by the sg pager, on dense configs we might
access arbitrary element of vm_page_array[], in particular, not
corresponding to a page from the memory segment. Initialize such pages
as fictitious with the corresponding physical address.
Reported by: bde
Reviewed by: alc, markj (previous version)
Sponsored by: The FreeBSD Foundation
Differential revision: https://reviews.freebsd.org/D16085
- inline atomics in modules on i386 and amd64 (they were always
inline on other arches)
- allow modules to opt in to inlining locks by specifying
MODULE_TIED=1 in the makefile
Reviewed by: kib
Sponsored by: Limelight Networks
Differential Revision: https://reviews.freebsd.org/D16079
and vm_phys_alloc_seg_contig() instead of vm_phys_free_contig(). In
short, vm_phys_enq_range() is simpler and faster than the more general
vm_phys_free_contig(), and in the case of vm_phys_alloc_seg_contig(),
vm_phys_free_contig() was placing the excess physical pages at the
wrong end of the queues.
In collaboration with: Doug Moore <dougm@rice.edu>
1. Optimize the order computation.
2. Update the pool for all of the chunks that are removed from the free
page lists, and not just the first chunk.
3. Simplify the code for returning excess pages to the free page lists.
Reviewed by: Doug Moore <dougm@rice.edu>
Previously the linuxulator's linux_brk invoked the FreeBSD sys_break
syscall implementation directly. Instead, move the bulk of the existing
implementation to kern_break, and call that from both sys_break and
linux_brk.
This also addresses a minor bug in linux_brk in that we now return the
actual (rounded up) break address, rather than the requested value.
Reviewed by: brooks (earlier version)
Sponsored by: Turing Robotic Industries
Differential Revision: https://reviews.freebsd.org/D16019
that it does not cause rapid fragmentation of the free physical memory.
Reviewed by: jeff, markj (an earlier version)
Differential Revision: https://reviews.freebsd.org/D15976
Allocation explicitely initialized the 3 leading fields. The rest is an
array which is supposed to be NULL-ed prior to deallocation.
Delegate zeroing to the infrequently called object initializator.
This gets rid of one of the most common memset consumers.
Reviewed by: markj
Differential Revision: https://reviews.freebsd.org/D15989
prefetch on 64bit architectures. Prior to this, two lines were needed
for the fast path and each line may fetch an unused adjacent neighbor.
- Move fields used by the fast path into a single line.
- Move constants into the adjacent line which is mostly used for
the spare bucket alloc 'medium path'.
- Unpad the mtx which is only used by the fast path and place it in
a line with rarely used data. This aligns the cachelines better and
eliminates 128 bytes of wasted space.
This gives a 45% improvement on a will-it-scale test on a 24 core machine.
Reviewed by: mmacy
The break() system call was renamed (several times) starting in v3
AT&T UNIX when C was invented and break was a language keyword. The
last vestage of a need for it to be called something else (eg obreak)
was removed in r225617 which consistantly prefixed all syscall
implementations.
Reviewed by: emaste, kib (older version)
Sponsored by: DARPA, AFRL
Differential Revision: https://reviews.freebsd.org/D15638
If fault started before vmspace_fork() locked the map, and then during
fork, vm_map_copy_entry()->vm_object_split() is executed, it is
possible that the fault instantiate the page into the original object
when the page was already copied into the new object (see
vm_map_split() for the orig/new objects terminology). This can happen
if split found a busy page (e.g. from the fault) and slept dropping
the objects lock, which allows the swap pager to instantiate
read-behind pages for the fault. Then the restart of the scan can see
a page in the scanned range, where it was already copied to the upper
object.
Fix it by instantiating the read-ahead pages before
swap_pager_getpages() method drops the lock to allocate pbuf. The
object scan would see the whole range prefilled with the busy pages
and not proceed the range.
Note that vm_fault rechecks the map generation count after the object
unlock, so that it restarts the handling if raced with split, and
re-lookups the right page from the upper object.
In collaboration with: alc
Tested by: pho
Sponsored by: The FreeBSD Foundation
MFC after: 1 week
Most kernel memory that is allocated after boot does not need to be
executable. There are a few exceptions. For example, kernel modules
do need executable memory, but they don't use UMA or malloc(9). The
BPF JIT compiler also needs executable memory and did use malloc(9)
until r317072.
(Note that a side effect of r316767 was that the "small allocation"
path in UMA on amd64 already returned non-executable memory. This
meant that some calls to malloc(9) or the UMA zone(9) allocator could
return executable memory, while others could return non-executable
memory. This change makes the behavior consistent.)
This change makes malloc(9) return non-executable memory unless the new
M_EXEC flag is specified. After this change, the UMA zone(9) allocator
will always return non-executable memory, and a KASSERT will catch
attempts to use the M_EXEC flag to allocate executable memory using
uma_zalloc() or its variants.
Allocations that do need executable memory have various choices. They
may use the M_EXEC flag to malloc(9), or they may use a different VM
interfact to obtain executable pages.
Now that malloc(9) again allows executable allocations, this change also
reverts most of r317072.
PR: 228927
Reviewed by: alc, kib, markj, jhb (previous version)
Sponsored by: Netflix
Differential Revision: https://reviews.freebsd.org/D15691
Per-cpu zone allocations are very rarely done compared to regular zones.
The intent is to avoid pessimizing the latter case with per-cpu specific
code.
In particular contrary to the claim in r334824, M_ZERO is sometimes being
used for such zones. But the zeroing method is completely different and
braching on it in the fast path for regular zones is a waste of time.
Turns out there is code which ends up passing M_ZERO to counters.
Since counters zero unconditionally on their own, just ignore drop the
flag in that place.
Nothing in the tree uses it and pcpu zones have a fundamentally different use
case than the regular zones - they are not supposed to be allocated and freed
all the time.
This reduces pollution in the allocation fast path.
trashing freed memory and checking that allocated memory is properly
trashed, and also of keeping a bitset of freed items. Trashing/checking
creates a lot of CPU cache poisoning, while keeping debugging bitsets
consistent creates a lot of contention on UMA zone lock(s). The performance
difference between INVARIANTS kernel and normal one is mostly attributed
to UMA debugging, rather than to all KASSERT checks in the kernel.
Add loader tunable vm.debug.divisor that allows either to turn off UMA
debugging completely, or turn it on only for a fraction of allocations,
while still running all KASSERTs in kernel. That allows to run INVARIANTS
kernels in production environments without reducing load by orders of
magnitude, but still doing useful extra checks.
Default value is 1, meaning debug every allocation. Value of 0 would
disable UMA debugging completely. Values above 1 enable debugging only
for every N-th item. It isn't possible to strictly follow the number,
but still amount of debugging is reduced roughly by (N-1)/N percent.
Sponsored by: Netflix
Differential Revision: https://reviews.freebsd.org/D15199
Previously, libc.so would initialize its notion of the break address
using _end, a special symbol emitted by the static linker following
the bss section. Compatibility issues between lld and ld.bfd could
cause the wrong definition of _end (libc.so's definition rather than
that of the executable) to be used, breaking the brk()/sbrk()
interface.
Avoid this problem and future interoperability issues by simply not
relying on _end. Instead, modify the break() system call to return
the kernel's view of the current break address, and have libc
initialize its state using an extra syscall upon the first use of the
interface. As a side effect, this appears to fix brk()/sbrk() usage
in executables run with rtld direct exec, since the kernel and libc.so
no longer maintain separate views of the process' break address.
PR: 228574
Reviewed by: kib (previous version)
MFC after: 2 months
Differential Revision: https://reviews.freebsd.org/D15663
vm_map_madvise(). Previously, vm_map_madvise() used a traditional Unix-
style "return (0);" to indicate success in the common case, but Mach-
style return values in the edge cases. Since KERN_SUCCESS equals zero,
the only problem with this inconsistency was stylistic. vm_map_madvise()
has exactly two callers in the entire source tree, and only one of them
cares about the return value. That caller, kern_madvise(), can be
simplified if vm_map_madvise() consistently uses Unix-style return
values.
Since vm_map_madvise() uses the variable modify_map as a Boolean, make it
one.
Eliminate a redundant error check from kern_madvise(). Add a comment
explaining where the check is performed.
Explicitly note that exec_release_args_kva() doesn't care about
vm_map_madvise()'s return value. Since MADV_FREE is passed as the
behavior, the return value will always be zero.
Reviewed by: kib, markj
MFC after: 7 days
It serves little purpose after r308474 and r329882. As a side
effect, the removal fixes a bug in r329882 which caused the
page daemon to periodically invoke lowmem handlers even in the
absence of memory pressure.
Reviewed by: jeff
Differential Revision: https://reviews.freebsd.org/D15491
the flag MAP_GUARD. Rather than enumerating the flags that are not
allowed, enumerate the flags that are allowed. The list of allowed flags
is much shorter and less likely to change. (As an aside, one of the
previously enumerated flags, MAP_PREFAULT, was not even a legal flag for
mmap(2). However, because of an earlier check within kern_mmap(), this
misuse of MAP_PREFAULT was harmless.)
Reviewed by: kib
MFC after: 10 days
that the map entry is wired if the caller passes the flag VM_FAULT_WIRE.
Eliminate the same assertion, but spelled differently, at the end of
vm_fault_hold() and vm_fault_populate(). Repeat the assertion only if the
map is unlocked and the map lookup must be repeated.
Reviewed by: kib
MFC after: 10 days
Differential Revision: https://reviews.freebsd.org/D15582
superpage mappings were already being created by automatic promotion in
vm_fault_populate(), this change reduces the cost of creating those
mappings. Essentially, one pmap_enter(..., psind=1) call takes the place
of 512 pmap_enter(..., psind=0) calls, and that one pmap_enter(...,
psind=1) call eliminates the allocation of a page table page.
Reviewed by: kib
MFC after: 10 days
Differential Revision: https://reviews.freebsd.org/D15572
The vadvise syscall (aka ovadvise) is undocumented and has always been
implmented as returning EINVAL. Put the syscall under COMPAT11 and
provide a userspace implementation.
Reviewed by: kib
Sponsored by: DARPA, AFRL
Differential Revision: https://reviews.freebsd.org/D15557
This should have been done when they were removed from libc, but was
overlooked in the runup to 11.0. No users should exist.
Approved by: andrew
Sponsored by: DARPA, AFRL
Differential Revision: https://reviews.freebsd.org/D15539
The scans are largely independent, so this helps make the code
marginally neater, and makes it easier to incorporate feedback from the
active queue scan into the page daemon control loop.
Improve some comments while here. No functional change intended.
Reviewed by: alc, kib
Differential Revision: https://reviews.freebsd.org/D15490
Such pages are dequeued as they're encountered during the inactive queue
scan, so by the time we get to the active queue scan, they should have
already been subtracted from the inactive queue length.
Reviewed by: alc
Differential Revision: https://reviews.freebsd.org/D15479
The value of m->queue must be cached after comparing it with PQ_NONE,
since it may be concurrently changing.
Reported by: glebius
Reviewed by: jeff
Differential Revision: https://reviews.freebsd.org/D15462
vm_page_queue(), added in r333256, generalizes vm_pageout_page_queued(),
so use it instead. No functional change intended.
Reviewed by: kib
Differential Revision: https://reviews.freebsd.org/D15402