Add a <sys/_pv_entry.h> intended for use in <machine/pmap.h> to
define struct pv_entry, pv_chunk, and related macros and inline
functions.
Note that powerpc does not yet use this as while the mmu_radix pmap
in powerpc uses the new scheme (albeit with fewer PV entries in a
chunk than normal due to an used pv_pmap field in struct pv_entry),
the Book-E pmaps for powerpc use the older style PV entries without
chunks (and thus require the pv_pmap field).
Suggested by: kib
Reviewed by: kib
Sponsored by: DARPA
Differential Revision: https://reviews.freebsd.org/D36685
Only i386 and amd64 print the decoded syscall name in the backtrace.
This de-duplication facilitates further changes and adoption by other
platforms.
Reviewed by: jrtc27, markj, jhb
MFC after: 1 week
Differential Revision: https://reviews.freebsd.org/D36565
Scoreboard is needed a moment when smp_started == true. If some kernel
daemon thread is started before scoreboard is inited, and does some pmap
operation that requires TLB maintanence, which races with SMP startup,
we might dereference NULL invl_scoreboard. This is particularly easy
to trigger when EARLY_AP_STARTUP is not defined.
Reported by: glebius
Reviewed by: markj
Sponsored by: The FreeBSD Foundation
MFC after: 1 week
Differential revision: https://reviews.freebsd.org/D36766
This assertion can be triggered by usermode since vm_map_madvise()
doesn't force advice to be applied to an entire largepage mapping. I
can't see any reason not to permit it, however, since MADV_DONTNEED and
_FREE are advisory and we can simply do nothing when a 1GB mapping is
encountered.
Reviewed by: alc, kib
MFC after: 1 week
Differential Revision: https://reviews.freebsd.org/D36675
This code path can be triggered by applying MADV_WILLNEED to a 1GB
mapping.
Reviewed by: alc, kib
MFC after: 1 week
Differential Revision: https://reviews.freebsd.org/D36674
pmap_growkernel() may be called when mapping a region above KERNBASE,
typically for a kernel module. If we have enough PTPs left over from
bootstrap, pmap_growkernel() does nothing. However, it's possible to
run out, and in this case pmap_growkernel() will try to grow the kernel
map all the way from kernel_vm_end to somewhere past KERNBASE, which can
easily run the system out of memory. This happens with large kernel
modules such as the nvidia GPU driver. There is also a WIP dtrace
provider which needs to map KVA in the region above KERNBASE (to provide
trampolines which allow a copy of traced kernel instruction to be
executed), and its allocations could potentially trigger this scenario.
This change modifies pmap_growkernel() to manage the two regions
separately, allowing them to grow independently. The end of the
KERNBASE region is tracked by modifying "nkpt".
PR: 265019
Reviewed by: alc, imp, kib
MFC after: 2 weeks
Differential Revision: https://reviews.freebsd.org/D36673
This matches the return type of pmap_mapdev/bios.
Reviewed by: kib, markj
Sponsored by: DARPA
Differential Revision: https://reviews.freebsd.org/D36548
This changes the default TCP Congestion Control (CC) to CUBIC.
For small, transactional exchanges (e.g. web objects <15kB), this
will not have a material effect. However, for long duration data
transfers, CUBIC allocates a slightly higher fraction of the
available bandwidth, when competing against NewReno CC.
Reviewed By: tuexen, mav, #transport, guest-ccui, emaste
Relnotes: Yes
Sponsored by: NetApp, Inc.
Differential Revision: https://reviews.freebsd.org/D36537
When attempting to promote 4KB user-space mappings to a 2MB user-space
mapping, the address of the struct vm_page representing the page table
page that contains the 4KB mappings is already known to the caller.
Pass that address to the promotion function rather than making the
promotion function recompute it, which on arm64 entails iteration over
the vm_phys_segs array by PHYS_TO_VM_PAGE(). And, while I'm here,
eliminate unnecessary arithmetic from the calculation of the first PTE's
address on arm64.
MFC after: 1 week
Add VM_EXITCODE_IPI to permit returning unhandled IPIs to userland.
INIT and Startup IPIs are now returned to userland. Due to backward
compatibility reasons, a new capability is added for enabling
VM_EXITCODE_IPI.
MFC after: 2 weeks
Differential Revision: https://reviews.freebsd.org/D35623
Sponsored by: Beckhoff Automation GmbH & Co. KG
Failure to map RSDP, XSLT and checksum failures are events that can't
happen unless something has gone wrong. As such, they should be reported
always, and not in bootverbose. This has been this way since it was
originally brought in to parse APIC tables.
Sponsored by: Netflix
Reviewed by: andrew
Differential Revision: https://reviews.freebsd.org/D36406
Add missing pmap_unmapbios() calls for when we return 0. Otherwise we
can leave the table mapped when it is of no use.
Sponsored by: Netflix
Reviewed by: andrew
Differential Revision: https://reviews.freebsd.org/D36405
for the "trap with interrupts disabled" warning.
Reviewed by: jhb
Tested by: pho
Sponsored by: The FreeBSD Foundation
MFC after: 1 week
Differential revision: https://reviews.freebsd.org/D36302
This applies one of the changes from
5567d6b441 to other architectures
besides arm64.
Reviewed by: kib
Sponsored by: DARPA
Differential Revision: https://reviews.freebsd.org/D36263
Reject attempts to map host physical address ranges that are not
subsets of a passthrough device's BAR into a guest.
Reviewed by: markj, emaste
MFC after: 1 week
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D36238
It does not work with ULE, which is the default scheduler for over a
decade.
Reviewed by: emaste, kib
Differential Revision: https://reviews.freebsd.org/D36094
Make most AST handlers dynamically registered. This allows to have
subsystem-specific handler source located in the subsystem files,
instead of making subr_trap.c aware of it. For instance, signal
delivery code on return to userspace is now moved to kern_sig.c.
Also, it allows to have some handlers designated as the cleanup (kclear)
type, which are called both at AST and on thread/process exit. For
instance, ast(), exit1(), and NFS server no longer need to be aware
about UFS softdep processing.
The dynamic registration also allows third-party modules to register AST
handlers if needed. There is one caveat with loadable modules: the
code does not make any effort to ensure that the module is not unloaded
before all threads processed through AST handler in it. In fact, this
is already present behavior for hwpmc.ko and ufs.ko. I do not think it
is worth the efforts and the runtime overhead to try to fix it.
Reviewed by: markj
Tested by: emaste (arm64), pho
Discussed with: jhb
Sponsored by: The FreeBSD Foundation
MFC after: 1 week
Differential revision: https://reviews.freebsd.org/D35888
Otherwise we are using whatever the value was left from the previous
thread run on kernel entry from usermode. Typically it would be the
desired value as is, but it is not guaranteed.
Reviewed by: markj
Tested by: pho
Sponsored by: The FreeBSD Foundation
MFC after: 1 week
Differential revision: https://reviews.freebsd.org/D35888
On physical systems the ram isn't initialized on boot. So, coreboot uses
the cache as ram in this boot phase. When exiting cache as ram, coreboot
calls INVD for making the cache consistent.
In a virtual environment ram is always initialized and the cache is
always consistent. So, we can safely ignore this call.
Reviewed by: jhb, imp
Differential Revision: https://reviews.freebsd.org/D35620
Sponsored by: Beckhoff Automation GmbH & Co. KG
A replacement QAT driver will be imported, but this replacement does not
support Atom C2xxx hardware. So, the existing driver will be kept
around to provide opencrypto offload support for those chipsets.
Reviewed by: pauamma, emaste
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D35817
With clang 15, the following -Werror warning is produced:
sys/amd64/amd64/pmap.c:8274:22: error: variable 'freed' set but not used [-Werror,-Wunused-but-set-variable]
int allfree, field, freed, i, idx;
^
The 'freed' variable is only used when PV_STATS is defined. Ensure it is
only declared and set in that case.
MFC after: 3 days
This is not completely exhaustive, but covers a large majority of
commands in the tree.
Reviewed by: markj
Sponsored by: Juniper Networks, Inc.
Sponsored by: Klara, Inc.
Differential Revision: https://reviews.freebsd.org/D35583
Store the shared page address in struct vmspace.
Also instead of storing absolute addresses of various shared page
segments save their offsets with respect to the shared page address.
This will be more useful when the shared page address is randomized.
Approved by: mw(mentor)
Sponsored by: Stormshield
Obtained from: Semihalf
Reviewed by: kib
Differential Revision: https://reviews.freebsd.org/D35393
Use a getter macro instead of fetching the sigcode address directly
from a sysent of a given process. It assumes that the sigcode is stored
in the shared page, which is true in all cases, except for a.out
binaries. This will be later useful when the shared page address
randomization is introduced.
No functional change intended.
Approved by: mw(mentor)
Sponsored by: Stormshield
Obtained from: Semihalf
Reviewed by: kib
Differential Revision: https://reviews.freebsd.org/D35392
On x86 systems, the debug.late_console tunable makes it possible to set
up the console before we call pmap_bootstrap. (The tunable is turned
on by default; setting late_console=0 results in consoles being probed
early.)
Unfortunately this is not compatible with using the ACPI SPCR table to
find the console, since consulting ACPI tables requires mapping memory
addresses. As such, we skip the call to uart_cpu_acpi_spcr from
uart_cpu_x86 in the !late_console case.
Reviewed by: imp
Sponsored by: https://www.patreon.com/cperciva
Differential Revision: https://reviews.freebsd.org/D35794
This is in preparation for removal of OBJT_DEFAULT. In particular, it
is now cheap to check whether an OBJT_SWAP object has any swap blocks
allocated, so the benefit of having a separate OBJT_DEFAULT type is
quite marginal, and the OBJT_DEFAULT->SWAP transition is a source of
bugs.
Reviewed by: alc, hselasky, kib
MFC after: 2 weeks
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D35779
While uart could be detected completely through plug and play means, add
it here for two reasons. First, we don't do that from the loader, so
it's not available as a console. Second, even if we did do it from the
loader, there's a limitation in the system today that console drivers
must be compiled into the kernel because the console is selected before
external modules are linked into the kernel. Adding it only increases
the kernel size by ~14k as well.
Sponsored by: Netflix
Idea liked by: des, rpokala, brooks, jhb
This patch fixes the AMD implementation for snapshotting. It removes
unnecessary vmcb fields that should not be saved and duplicates.
Reviewed by: jhb
Differential Revision: https://reviews.freebsd.org/D33431
Some tools (firecraker loader) only check for notes in PT_NOTE program
headers, so make sure the notes added using the ELFNOTE macro end up
in such header.
Output from readelf -Wl for and amd64 kernel after the change:
Elf file type is EXEC (Executable file)
Entry point 0xffffffff8038a000
There are 11 program headers, starting at offset 64
Program Headers:
Type Offset VirtAddr PhysAddr FileSiz MemSiz Flg Align
PHDR 0x000040 0xffffffff80200040 0x0000000000200040 0x000268 0x000268 R 0x8
INTERP 0x0002a8 0xffffffff802002a8 0x00000000002002a8 0x00000d 0x00000d R 0x1
[Requesting program interpreter: /red/herring]
LOAD 0x000000 0xffffffff80200000 0x0000000000200000 0x189e28 0x189e28 R 0x200000
LOAD 0x18a000 0xffffffff8038a000 0x000000000038a000 0xe447e8 0xe447e8 R E 0x200000
LOAD 0xfce7f0 0xffffffff811ce7f0 0x00000000011ce7f0 0x6b955c 0x6b955c R 0x200000
LOAD 0x1800000 0xffffffff81a00000 0x0000000001a00000 0x000140 0x000140 RW 0x200000
LOAD 0x1801000 0xffffffff81a01000 0x0000000001a01000 0x1c8480 0x5ff000 RW 0x200000
DYNAMIC 0x1800000 0xffffffff81a00000 0x0000000001a00000 0x000140 0x000140 RW 0x8
GNU_RELRO 0x1800000 0xffffffff81a00000 0x0000000001a00000 0x000140 0x000140 R 0x1
GNU_STACK 0x000000 0x0000000000000000 0x0000000000000000 0x000000 0x000000 RW 0
NOTE 0x1687ae0 0xffffffff81887ae0 0x0000000001887ae0 0x0001c0 0x0001c0 R 0x4
Section to Segment mapping:
Segment Sections...
[...]
10 .note.gnu.build-id .note.Xen
Reported by: cperciva
Fixes: 1a9cdd373a ('xen: add PV/PVH kernel entry point')
Fixes: 93ee134a24 ('Integrate support for xen in to i386 common code.')
Sponsored by: Citrix Systems R&D
Reviewed by: emaste
Differential revision: https://reviews.freebsd.org/D35611
Enhanced REP MOVSB feature of CPUs starting from Ivy Bridge makes
REP MOVSB the fastest way to copy memory in most of cases. However
Intel Optimization Reference Manual says: "setting the DF to force
REP MOVSB to copy bytes from high towards low addresses will expe-
rience significant performance degradation". Measurements on Intel
Cascade Lake and Alder Lake, same as on AMD Zen3 show that it can
drop throughput to as low as 2.5-3.5GB/s, comparing to ~10-30GB/s
of REP MOVSQ or hand-rolled loop, used for non-ERMS CPUs.
This patch keeps ERMS use for forward ordered memory copies, but
removes it for backward overlapped moves where it does not work.
Reviewed by: mjg
MFC after: 2 weeks
When the kernel is compiled with -asan-stack=true, the address sanitizer
will emit inline accesses to the shadow map. In other words, some
shadow map accesses are not intercepted by the KASAN runtime, so they
cannot be disabled even if the runtime is not yet initialized by
kasan_init() at the end of hammer_time().
This went unnoticed because the loader will initialize all PML4 entries
of the bootstrap page table to point to the same PDP page, so early
shadow map accesses do not raise a page fault, though they are silently
corrupting memory. In fact, when the loader does not copy the staging
area, we do get a page fault since in that case only the first and last
PML4Es are populated by the loader. But due to another bug, the loader
always treated KASAN kernels as non-relocatable and thus always copied
the staging area.
It is not really practical to annotate hammer_time() and all callees
with __nosanitizeaddress, so instead add some early initialization which
creates a shadow for the boot stack used by hammer_time(). This is only
needed by KASAN, not by KMSAN, but the shared pmap code handles both.
Reported by: mhorne
Reviewed by: kib
MFC after: 1 month
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D35449
PTI page table pages are allocated from a VM object, so must be
exclusively busied when they are freed, e.g., when a thread loses a race
in pmap_pti_pde(). Simply keep PTPs busy at all times, as was done for
some other kernel allocators in commit
e9ceb9dd11.
Also remove some redundant assertions on "ref_count":
vm_page_unwire_noq() already asserts that the page's reference count is
greater than zero.
Reported by: syzkaller
Reviewed by: kib
MFC after: 2 weeks
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D35466