To determine whether to use alternate signal stack or not,
we need to use the native signal number, not the one translated
with bsd_to_linux_signal().
In practical terms, this fixes golang.
Reviewed By: dchagin
Fixes: 135dd0cab5
Sponsored By: EPSRC
Differential Revision: https://reviews.freebsd.org/D31298
When a cmpset for removing the PG_RW bit in pmap_promote_pde() fails,
there is no need to repeat the alignment, PG_A, and PG_V tests just to
reload the PTE's value. The only bit that we need be concerned with at
this point is PG_M. Use fcmpset instead.
MFC after: 1 week
Old expression happens to provide the correct answer, but assumes that
kernel is loaded at physical address zero, with 2M gap. Do not use
kernphys to calculate KVA of kernel text start, just explicitly write
out KERNBASE and the hole size.
Reviewed by: markj
Sponsored by: The FreeBSD Foundation
MFC after: 1 week
Differential revision: https://reviews.freebsd.org/D31121
The call to pmap_allow_2m_x_page() in pmap_enter_object() is redundant.
Specifically, even without the call to pmap_allow_2m_x_page() in
pmap_enter_object(), pmap_allow_2m_x_page() is eventually called by
pmap_enter_pde(), so the outcome will be the same. Essentially,
calling pmap_allow_2m_x_page() in pmap_enter_object() amounts to
"optimizing" for the unexpected case.
Reviewed by: kib
MFC after: 1 week
Initial patch from submitter was adapted by me to prevent unconditional
FUTEX_REQUEUE use.
PR: 255947
Submitted by: Philippe Michaud-Boudreault
Differential Revision: https://reviews.freebsd.org/D30332
The vDSO initialisation order should be as follows:
- native abi init via exec_sysvec_init();
- vDSO symbols queued to the linux_vdso_syms list;
- linux_vdso_install();
- linux_exec_sysvec_init();
As the exec_sysvec_init() called with SI_ORDER_ANY (last) at SI_SUB_EXEC
order, move linux_vdso_install() and linux_exec_sysvec_init() to the
SI_SUB_EXEC+1 order.
Reviewed by: trasz
Differential Revision: https://reviews.freebsd.org/D30902
MFC after 2 weeks
In order to reduce diff between arches constify vdso install/deinstall
functions like arm64.
Reviewed by: emaste
Differential revision: https://reviews.freebsd.org/D30901
MFC after: 2 weeks
The vDSO (virtual dynamic shared object) is a small shared library that the
kernel maps R/O into the address space of all Linux processes on image
activation. The vDSO is a fully formed ELF image, shared by all processes
with the same ABI, has no process private data.
The primary purpose of the vDSO:
- non-executable stack, signal trampolines not copied to the stack;
- signal trampolines unwind, mandatory for the NPTL;
- to avoid contex-switch overhead frequently used system calls can be
implemented in the vDSO: for now gettimeofday, clock_gettime.
The first two have been implemented, so add the implementation of system
calls.
System calls implemenation based on a native timekeeping code with some
limitations:
- ifunc can't be used, as vDSO r/o mapped to the process VA and rtld
can't relocate symbols;
- reading HPET memory is not implemented for now (TODO).
In case on any error vDSO system calls fallback to the kernel system
calls. For unimplemented vDSO system calls added prototypes which call
corresponding kernel system call.
Tested by: trasz (arm64)
Differential revision: https://reviews.freebsd.org/D30900
MFC after: 2 weeks
Temporary add stubs to the Linux emulation layer which calls the existing hook.
Reviewed by: kib
Differential Revision: https://reviews.freebsd.org/D30911
MFC after: 2 weeks
In preparation for vDSO code revision get rid of incomplete vDSO methods
from locore, but leave .note.Linux section commented out.
.note.Linux section is used by glibc rtld to get the kernel version, that
saves one system call call. I'll try to implement it later, if figure out
how to use it with jails.
MFC after: 2 weeks
The underlying types for both are the same so arguably this doesn't
really matter, but using the wrong type is still confusing and
technically incorrect.
The syscall number is stored in the same register as the syscall return
on amd64 (and possibly other architectures) and so it is impossible to
recover in the signal handler after the call has returned. This small
tweak delivers it in the `si_value` field of the signal, which is
sufficient to catch capability violations and emulate them with a call
to a more-privileged process in the signal handler.
This reapplies 3a522ba1bc with a fix for
the static assertion failure on i386.
Approved by: markj (mentor)
Reviewed by: kib, bcr (manpages)
Differential Revision: https://reviews.freebsd.org/D29185
pmap_copy() is used to speculatively create mappings, so those mappings
should not have their access bit preset.
Reviewed by: kib, markj
MFC after: 1 week
Differential Revision: https://reviews.freebsd.org/D31162
Reduce the live ranges for three variables so that they do not span the
call to PHYS_TO_VM_PAGE(). This enables the compiler to generate
slightly smaller machine code.
Reviewed by: kib, markj
MFC after: 1 week
Differential Revision: https://reviews.freebsd.org/D31161
The ACPI parsing code around rid range was wrong on assuming there is
only one pair of start/end device id range. Besides, ivhd_dev_parse()
never work as supposed. The start/end rid info was always zero.
Restructure the code to build dynamic-sized tables for each IOMMU softc
holding device entries. The device entries are enumerated to find a
suitable IOMMU unit. Operations on devices not governed (e.g. the IOMMU
unit itself) are no-op from now on. There are also a minor fix on wrong
%b formatting string usage.
Tested on my EPYC 7282.
Sponsored by: The FreeBSD Foundation
Reviewed by: grehan
Differential Revision: https://reviews.freebsd.org/D30827
This controller supports 2.5G/1G/100MB/10MB speeds, and allows
tx/rx checksum offload, TSO, LRO, and multi-queue operation.
The driver was derived from code contributed by Intel, and modified
by Netgate to fit into the iflib framework.
Thanks to Mike Karels for testing and feedback on the driver.
Reviewed by: bcr (manpages), kbowling, scottl, erj
MFC after: 1 month
Relnotes: yes
Sponsored by: Rubicon Communications, LLC ("Netgate")
Differential Revision: https://reviews.freebsd.org/D30668
Remove deugging stuff, since it's arguably not needed in a minimal
setup. Also vlan, tuntap and gif since they can be loaded.
imp didn't include the part of the patch that removed xen guest support.
Xen guest is relatively small and has no way of being loaded.
Reviewed by: imp
PR: 229564
MFC After: 3 days
The syscall number is stored in the same register as the syscall return
on amd64 (and possibly other architectures) and so it is impossible to
recover in the signal handler after the call has returned. This small
tweak delivers it in the `si_value` field of the signal, which is
sufficient to catch capability violations and emulate them with a call
to a more-privileged process in the signal handler.
Approved by: markj (mentor)
Reviewed by: kib, bcr (manpages)
Differential Revision: https://reviews.freebsd.org/D29185
Otherwise KASAN may generate false positives if the trapframe was
written into a poisoned region of the stack.
Reported by: pho
Sponsored by: The FreeBSD Foundation
Use sysentvec hooks to only call umtx_thread_exit/umtx_exec, which handle
robust mutexes, for native FreeBSD ABI. Similarly, there is no sense
in calling sigfastblock_clear() for non-native ABIs.
Requested by: dchagin
Reviewed by: dchagin, markj (previous version)
Sponsored by: The FreeBSD Foundation
MFC after: 2 weeks
Differential revision: https://reviews.freebsd.org/D30987
In a few places, on a failed compare-and-set, both the amd64 pmap and
the arm64 pmap repeat tests on bits that won't change state while the
pmap is locked. Eliminate some of these unnecessary tests.
Reviewed by: andrew, kib, markj
MFC after: 1 week
Differential Revision: https://reviews.freebsd.org/D31014
Implement dumping core for Linux binaries on amd64, for both
32- and 64-bit executables. Some bits are still missing.
This is based on a prototype by chuck@.
Reviewed By: kib
Sponsored By: EPSRC
Differential Revision: https://reviews.freebsd.org/D30019
Eliminate some unnecessary unlocking and relocking when we have to retry
the operation to avoid deadlock. (All of the other pmap functions that
iterate over a PV list already implemented retries without these same
unlocking and relocking operations.)
Reviewed by: kib, markj
MFC after: 1 week
Differential Revision: https://reviews.freebsd.org/D30951
This adds `sv_elf_core_osabi`, `sv_elf_core_abi_vendor`,
and `sv_elf_core_prepare_notes` fields to `struct sysentvec`,
and modifies imgact_elf.c to make use of them instead
of hardcoding FreeBSD-specific values. It also updates all
of the ABI definitions to preserve current behaviour.
This makes it possible to implement non-native ELF coredump
support without unnecessary code duplication. It will be used
for Linux coredumps.
Reviewed By: kib
Sponsored By: EPSRC
Differential Revision: https://reviews.freebsd.org/D30921
Hide the vDSO defines to the linux32_sysvec as they are not intended to
be used outside of it. Fix LINUX32_PS_STRINGS, use the size of
struct linux32_ps_strings instead of a numeric constant.
MFC after: 2 weeks
Assuming we can't run on i486, i586 class cpu, retire linux_kplatform var
and use hardcoded 'machine' value in linux_newuname().
I have added linux_kplatform for consistency with linux_platform which is
placed in to vdso to avoid excess copyout it on stack for AT_PLATFORM at
exec time.
This is the first stage of Linuxulator's vdso revision.
Reviewed by: trasz, imp
Differential Revision: https://reviews.freebsd.org/D30774
MFC after: 2 weeks
Stop confusing people, retire COMPAT_LINUX and COMPAT_LINUX32 kernel
build options. Since we have 32 and 64 bit Linux emulators, we can't build both
emulators together into the kernel. I don't think it matters, Linux emulation
depends on loadable modules (via rc).
Cut LINPROCFS and LINSYSFS for consistency.
PR: 215061
Reviewed by: bcr (manpages), trasz
Differential Revision: https://reviews.freebsd.org/D30751
MFC after: 2 weeks
This makes it easier to compare the two. This involves moving
the mutex slightly lower down, but there should be no functional
changes.
Sponsored By: EPSRC
Differential Revision: https://reviews.freebsd.org/D30541
For now the Linux emulation layer uses in kernel ppoll(2) without
conversion of user supplied fd 'events', and does not convert the
kernel supplied fd 'revents'.
At least POLLRDHUP is handled by FreeBSD differently than by
Linux. Seems that Linux silencly ignores POLLRDHUP on non socket fd's
unlike FreeBSD, which does more strictly check and fails.
Rework the Linux ppoll, using kern_poll and converting 'events'
and 'revents' values.
While here, move poll events defines to the MI part of code as they
mostly identical on all arches except arm.
Differential Revision: https://reviews.freebsd.org/D30716
MFC after: 2 weeks
The original %b description string is wrong.
Sponsored by: The FreeBSD Foundation
Reviewed by: imp, jhb
Differential Revision: https://reviews.freebsd.org/D30805
so that PHYS_TO_VM_PAGE() and consequently physcopyin() work for them
Reviewed by: markj
Sponsored by: The FreeBSD Foundation
MFC after: 1 week
Differential revision: https://reviews.freebsd.org/D30785
Require queueing of the signals with default action, and disable
dequeueing SIGCHLD on wait for live process.
Reported and tested by: dchagin
Reviewed by: dchagin, markj
Sponsored by: The FreeBSD Foundation
MFC after: 1 week
Differential revision: https://reviews.freebsd.org/D30675
When a process has used sysarch(2) to specify descriptors for its
private LDT, upon rfork(RFMEM) descriptors are copied into the new child
process. Any updates to the descriptors are thus reflected to all other
processes sharing the vmspace. However, this is incorrect in the rather
obscure case where the child process was created before the LDT was
modified. Fix this by only modifying other processes which already
share the LDT.
Reported by: syzkaller
Reviewed by: kib
MFC after: 1 week
Sponsored by: The FreeBSD Foundation
Retire ksiginfo_to_lsiginfo function, use siginfo_to_lsiginfo instead.
Convert rt_sigtimedwait siginfo variables to well known names.
MFC after: 2 weeks
Otherwise it is copied from the creating thread. Then, if either thread
exits, the other is left with a dangling pointer, typically resulting in
a page fault upon the next context switch.
Reported by: syzkaller
Reviewed by: kib
MFC after: 1 week
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D30607
We only need to ensure that interrupts are disabled when handling a
fault from iret. Otherwise it's possible to trigger the assertion
legitimately, e.g., by copying in from an invalid address.
Fixes: 4a59cbc12
Reported by: pho
Reviewed by: kib
MFC after: 1 week
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D30594
When PTI is enabled, we may have been on the trampoline stack when iret
faults. So, we have to switch back to the regular stack before
re-entering trap().
trap() has the somewhat strange behaviour of re-enabling interrupts when
handling certain kernel-mode execeptions. In particular, it was doing
this for exceptions raised during execution of iret. When switching
away from the trampoline stack, however, the thread must not be migrated
to a different CPU. Fix the problem by simply leaving interrupts
disabled during the window.
Reported by: syzbot+6cfa544fd86ad4647ffc@syzkaller.appspotmail.com
Reported by: syzbot+cfdfc9e5a8f28f11a7f5@syzkaller.appspotmail.com
Reviewed by: kib
MFC after: 1 week
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D30578
Make it under SI_SUB_CPU sysinit, instead of much later SI_SUB_DRIVERS.
The SI_SUB_DRIVERS survived from times when FPU used real ISA attachment,
now it is only pnp stub claiming id.
PR: 255997
Reviewed by: jhb
Sponsored by: The FreeBSD Foundation
MFC after: 1 week
Differential revision: https://reviews.freebsd.org/D30512
If copyin family of routines fault, kernel does clear PSL.AC on the
fault entry, but the AC flag of the faulted frame is kept intact. Since
onfault handler is effectively jump, AC survives until syscall exit.
Reported by: m00nbsd, via Sony
Reviewed by: markj
Sponsored by: The FreeBSD Foundation
admbugs: 975
While here, fix all links to older en_US.ISO8859-1 documentation
in the src/ tree.
PR: 255026
Reported by: Michael Büker <freebsd@michael-bueker.de>
Reviewed by: dbaio
Approved by: blackend (mentor), re (gjb)
MFC after: 10 days
Differential Revision: https://reviews.freebsd.org/D30265
The AP startup extern variable declarations are not longer needed,
since PVHv2 uses the native AP startup path using the lapic. Remove
the declaration and make the variables static to mp_machdep.c
Sponsored by: Citrix Systems R&D
PVHv1 was officially removed from Xen in 4.9, so just axe the related
code from FreeBSD.
Note FreeBSD supports PVHv2, which is the replacement for PVHv1.
Sponsored by: Citrix Systems R&D
Reviewed by: kib, Elliott Mitchell
Differential Revision: https://reviews.freebsd.org/D30228
The change to futex_andl_smap() should have ordered stac before the
load from a user address, otherwise it does not fix anything.
Fixes: fb58045145 ("linux: Fix SMAP-enabled futex routines")
MFC after: 1 week
Sponsored by: The FreeBSD Foundation
Some of them were dereferencing the user pointer before disabling SMAP.
PR: 255591
Reviewed by: kib
Tested by: pitwuu@gmail.com
Sponsored by: The FreeBSD Foundation
MFC after: 1 week
Differential Revision: https://reviews.freebsd.org/D30276
This fixes strace(1) erroneously reporting return values
as "Function not implemented", combined with reporting the binary
ABI as X32.
Very similar code in linux_ptrace_getregs() is left as it is - it's
probably wrong too, but I don't have a way to test it.
Sponsored By: EPSRC
Differential Revision: https://reviews.freebsd.org/D29927
It is defined as a uint64_t in the UEFI spec. As it's not used as a
pointer by the kernel follow this and define it as the same in the
kernel.
Reviewed by: kib, manu, imp
Sponsored by: Innovate UK
Differential Revision: https://reviews.freebsd.org/D29759
A testing on the real hardware uncovered an issue, and since I do not have
access to the machine, disable until the bug can be fixed.
Reported by: "Pieper, Jeffrey E" <jeffrey.e.pieper@intel.com>
Sponsored by: The FreeBSD Foundation
MFC after: 3 days
When setting up trampoline mapping for LA57 switcher, it is possible
that TLB still has some random mapping at that address.
Sponsored by: The FreeBSD Foundation
MFC after: 1 week
Option `FIB_ALGO` gates new modular fib lookup functionality,
enabling more performant routing table lookups and improving
control plane convergence under the load.
Detailed feature description is available in D27401.
Reviewed By: olivier, gnn
Differential Revision: https://reviews.freebsd.org/D28434
Previously we've returned the error from native ptrace(2), ENOMEM.
This confused Linux strace(2).
Reviewed By: emaste
Sponsored By: EPSRC
Differential Revision: https://reviews.freebsd.org/D29925
- Use malloc(9) to allocate ivhd_hdrs list. The previous assumption
that there are at most 10 IVHDs in a system is not true. A counter
example would be a system with 4 IOMMUs, and each IOMMU is related
to IVHDs type 10h, 11h and 40h in the ACPI IVRS table.
- Always scan through the whole ivhd_hdrs list to find IVHDs that has
the same DeviceId but less prioritized IVHD type.
Sponsored by: The FreeBSD Foundation
MFC with: 74ada297e8
Reviewed by: grehan
Approved by: lwhsu (mentor)
Differential Revision: https://reviews.freebsd.org/D29525
- Initialize KASAN before executing SYSINITs.
- Add a GENERIC-KASAN kernel config, akin to GENERIC-KCSAN.
- Increase the kernel stack size if KASAN is enabled. Some of the
ASAN instrumentation increases stack usage and it's enough to
trigger stack overflows in ZFS.
- Mark the trapframe as valid in interrupt handlers if it is
assigned to td_intr_frame. Otherwise, an interrupt in a function
which creates a poisoned alloca region can trigger false positives.
MFC after: 2 weeks
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D29455
The idea behind KASAN is to use a region of memory to track the validity
of buffers in the kernel map. This region is the shadow map. The
compiler inserts calls to the KASAN runtime for every emitted load
and store, and the runtime uses the shadow map to decide whether the
access is valid. Various kernel allocators call kasan_mark() to update
the shadow map.
Since the shadow map tracks only accesses to the kernel map, accesses to
other kernel maps are not validated by KASAN. UMA_MD_SMALL_ALLOC is
disabled when KASAN is configured to reduce usage of the direct map.
Currently we have no mechanism to completely eliminate uses of the
direct map, so KASAN's coverage is not comprehensive.
The shadow map uses one byte per eight bytes in the kernel map. In
pmap_bootstrap() we create an initial set of page tables for the kernel
and preloaded data.
When pmap_growkernel() is called, we call kasan_shadow_map() to extend
the shadow map. kasan_shadow_map() uses pmap_kasan_enter() to allocate
memory for the shadow region and map it.
Reviewed by: kib
MFC after: 2 weeks
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D29417
This should be a no-op; the purpose of this is to reduce
a spurious difference between Linuxulator and Linux, to make
debugging core dumps slightly easier.
Note that AT_HWCAP2 we pass to Linux binaries is always 0,
instead of being equal to 'cpu_feature2'. This matches what
I've observed under Ubuntu Focal VM.
Reviewed By: chuck, dchagin
Sponsored By: EPSRC
Differential Revision: https://reviews.freebsd.org/D29609
This is intended to be used with memory mapped IO, e.g. from
bus_space_map with no flags, or pmap_mapdev.
Use this new memory type in the map request configured by
resource_init_map_request, and in pciconf.
Reviewed by: kib
Differential Revision: https://reviews.freebsd.org/D29692
instead of manually inlining it
Reviewed by: jhb
Sponsored by: The FreeBSD Foundation
MFC after: 1 week
Differential revision: https://reviews.freebsd.org/D29687
instead of manual zeroing of the debug registers file in pcb.
This centralizes the cleaning code, but the practical difference is
that PCB_DBREGS flag is cleared, saving some operations on context
switching.
Reviewed by: jhb
Sponsored by: The FreeBSD Foundation
MFC after: 1 week
Differential revision: https://reviews.freebsd.org/D29687
Move the code from exec_setregs() to reset debug registers state on exec,
to the x86_clear_dbregs() helper
Reviewed by: jhb
Sponsored by: The FreeBSD Foundation
MFC after: 1 week
Differential revision: https://reviews.freebsd.org/D29687
On some systems (e.g. Lenovo ThinkPad X240, Apple MacBookPro12,1)
the SMBIOS entry point is not found in the <0xFFFFF space.
Follow the SMBIOS spec and use the EFI Configuration Table for
locating the entry point on EFI systems.
Reviewed by: rpokala, dab
MFC after: 1 week
Sponsored by: Dell EMC Isilon
Differential Revision: https://reviews.freebsd.org/D29276
This fixes double IVHD_SETUP_INTR calls on the same IOMMU device.
Sponsored by: The FreeBSD Foundation
MFC with: 74ada297e8
Reported by: Oleg Ginzburg <olevole@olevole.ru>
Reviewed by: grehan
Approved by: philip (mentor)
Differential Revision: https://reviews.freebsd.org/D29521
--Eliminate a big ifdef that encompassed all currently-supported
architectures except mips and powerpc32. This applied to the case
in which we've allocated a superpage but the pager-populated range
is insufficient for a superpage mapping. For platforms that don't
support superpages the check should be inexpensive as we shouldn't
get a superpage in the first place. Make the normal-page fallback
logic identical for all platforms and provide a simple implementation
of pmap_ps_enabled() for MIPS and Book-E/AIM32 powerpc.
--Apply the logic for handling pmap_enter() failure if a superpage
mapping can't be supported due to additional protection policy.
Use KERN_PROTECTION_FAILURE instead of KERN_FAILURE for this case,
and note Intel PKU on amd64 as the first example of such protection
policy.
Reviewed by: kib, markj, bdragon
Differential Revision: https://reviews.freebsd.org/D29439
The remote protocol allows for implementations to report more specific
reasons for the break in execution back to the client [1]. This is
entirely optional, so it is only implemented for amd64, arm64, and i386
at the moment.
[1] https://sourceware.org/gdb/current/onlinedocs/gdb/Stop-Reply-Packets.html
Reviewed by: jhb
MFC after: 3 weeks
Sponsored by: NetApp, Inc.
Sponsored by: Klara, Inc.
NetApp PR: 51
Differential Revision: https://reviews.freebsd.org/D29174
Use the new kdb variants. Print more specific error messages.
Reviewed by: jhb, markj
MFC after: 3 weeks
Sponsored by: NetApp, Inc.
Sponsored by: Klara, Inc.
Differential Revision: https://reviews.freebsd.org/D29156
Add wrappers around the dbreg interface that can be consumed by MI
kernel debugger code. The dbreg functions themselves are updated to
return error codes, not just -1. dbreg_set_watchpoint() is extended to
accept access bits as an argument.
Reviewed by: jhb, kib, markj
MFC after: 3 weeks
Sponsored by: NetApp, Inc.
Sponsored by: Klara, Inc.
Differential Revision: https://reviews.freebsd.org/D29155
Make it easy to define interceptors for new sanitizer runtimes, rather
than assuming KCSAN. Lay a bit of groundwork for KASAN and KMSAN.
When a sanitizer is compiled in, atomic(9) and bus_space(9) definitions
in atomic_san.h are used by default instead of the inline
implementations in the platform's atomic.h. These definitions are
implemented in the sanitizer runtime, which includes
machine/{atomic,bus}.h with SAN_RUNTIME defined to pull in the actual
implementations.
No functional change intended.
MFC after: 1 month
Sponsored by: The FreeBSD Foundation
Currently, AMD-vi PCI-e passthrough will lead to the following lines in
dmesg:
"kernel: CPU0: local APIC error 0x40
ivhd0: Error: completion failed tail:0x720, head:0x0."
After some tracing, the problem is due to the interaction with
amdvi_alloc_intr_resources() and pci_driver_added(). In ivrs_drv, the
identification of AMD-vi IVHD is done by walking over the ACPI IVRS
table and ivhdX device_ts are added under the acpi bus, while there are
no driver handling the corresponding IOMMU PCI function. In
amdvi_alloc_intr_resources(), the MSI intr are allocated with the ivhdX
device_t instead of the IOMMU PCI function device_t. bus_setup_intr() is
called on ivhdX. the IOMMU pci function device_t is only used for
pci_enable_msi(). Since bus_setup_intr() is not called on IOMMU pci
function, the IOMMU PCI function device_t's dinfo->cfg.msi is never
updated to reflect the supposed msi_data and msi_addr. So the msi_data
and msi_addr stay in the value 0. When pci_driver_added() tried to loop
over the children of a pci bus, and do pci_cfg_restore() on each of
them, msi_addr and msi_data with value 0 will be written to the MSI
capability of the IOMMU pci function, thus explaining the errors in
dmesg.
This change includes an amdiommu driver which currently does attaching,
detaching and providing DEVMETHODs for setting up and tearing down
interrupt. The purpose of the driver is to prevent pci_driver_added()
from calling pci_cfg_restore() on the IOMMU PCI function device_t.
The introduction of the amdiommu driver handles allocation of an IRQ
resource within the IOMMU PCI function, so that the dinfo->cfg.msi is
populated.
This has been tested on EPYC Rome 7282 with Radeon 5700XT GPU.
Sponsored by: The FreeBSD Foundation
Reviewed by: jhb
Approved by: philip (mentor)
MFC after: 2 weeks
Differential Revision: https://reviews.freebsd.org/D28984
e4b8deb222 removed the last in-tree uses of PCPU_INC(). Its
potential benefit is also practically nonexistent. Non-x86
platforms already implement it as PCPU_ADD(..., 1), and according
to [0] there are no recent x86 processors for which the 'inc'
instruction provides a performance benefit over the equivalent
memory-operand form of the 'add' instruction. The only remaining
benefit of 'inc' is smaller instruction size, which in this case
is inconsequential given the limited number of per-CPU data consumers.
[0]: https://www.agner.org/optimize/instruction_tables.pdf
Reviewed by: kib
Differential Revision: https://reviews.freebsd.org/D29308
This is a prerequisite to using these functions outside of ddb, but also
provides some cleanup and minor refactoring. This code is almost
entirely duplicated between the two implementations, the only
significant difference being the lack of dbreg synchronization on i386.
Cleanups are:
- demote some internal functions to static
- use the constant NDBREGS instead of a '4' literal
- remove K&R definitions
- some added comments
Reviewed by: kib, jhb
Sponsored by: NetApp, Inc.
Sponsored by: Klara, Inc.
MFC after: 1 week
Differential Revision: https://reviews.freebsd.org/D29153
We want to allow the UEFI firmware to enumerate and assign
addresses to PCI devices so we can boot from NVMe[1]. Address
assignment of PCI BARs is properly handled by the PCI emulation
code in general, but a few specific cases need additional support.
fbuf and passthru map additional objects into the guest physical
address space and so need to handle address updates. Here we add a
callback to emulated PCI devices to inform them of a BAR
configuration change. fbuf and passthru then watch for these BAR
changes and relocate the frame buffer memory segment and passthru
device mmio area respectively.
We also add new VM_MUNMAP_MEMSEG and VM_UNMAP_PPTDEV_MMIO ioctls
to vmm(4) to facilitate the unmapping needed for addres updates.
[1]: https://github.com/freebsd/uefi-edk2/pull/9/
Originally by: scottph
MFC After: 1 week
Sponsored by: Intel Corporation
Reviewed by: grehan
Approved by: philip (mentor)
Differential Revision: https://reviews.freebsd.org/D24066
As follow-on work to e4b8deb222, move page table page
allocation and freeing into their own functions. Use these
functions to provide separate kernel vs. user page table page
accounting, and to wrap common tasks such as management of
zero-filled page state.
Requested by: markj, kib
Reviewed by: kib
Differential Revision: https://reviews.freebsd.org/D29151
POSIX states that new threads created via pthread_create() should
inherit the "floating point environment" from the creating thread.
Discussed with: kib
MFC after: 1 week
Sponsored by: Netflix
Differential Revision: https://reviews.freebsd.org/D29204
- Use update_pcb_bases() when updating FS or GS base addresses to
permit use of FSBASE and GSBASE in Linux processes. This also sets
PCB_FULL_IRET. linux32 was setting PCB_32BIT which should be a
no-op (exec sets it).
- Remove write-only variables to construct unused segment descriptors
for linux32.
Reviewed by: kib
MFC after: 1 week
Sponsored by: Netflix
Differential Revision: https://reviews.freebsd.org/D29026
Before the pcb is copied to the new thread during cpu_fork() and
cpu_copy_thread(), the kernel re-reads the current register values in
case they are stale. This is done by setting PCB_FULL_IRET in
pcb_flags.
This works fine for user threads, but the creation of kernel processes
and kernel threads do not follow the normal synchronization rules for
pcb_flags. Specifically, new kernel processes are always forked from
thread0, not from curthread, so adjusting pcb_flags via a simple
instruction without the LOCK prefix can race with thread0 running on
another CPU. Similarly, kthread_add() clones from the first thread in
the relevant kernel process, not from curthread. In practice, Netflix
encountered a panic where the pcb_flags in the first kthread of the
KTLS process were trashed due to update_pcb_bases() in
cpu_copy_thread() running from thread0 to create one of the other KTLS
threads racing with the first KTLS kthread calling fpu_kern_thread()
on another CPU. In the panicking case, the write to update pcb_flags
in fpu_kern_thread() was lost triggering an "Unregistered use of FPU
in kernel" panic when the first KTLS kthread later tried to use the
FPU.
Reported by: gallatin
Discussed with: kib
MFC after: 1 week
Sponsored by: Netflix
Differential Revision: https://reviews.freebsd.org/D29023
This change converts most of the counters in the amd64 pmap from
global atomics to scalable counter(9) counters. Per discussion
with kib@, it also removes the handrolled per-CPU PCID save count
as it isn't considered generally useful.
The bulk of these counters remain guarded by PV_STATS, as it seems
unlikely that they will be useful outside of very specific debugging
scenarios. However, this change does add two new counters that
are available without PV_STATS. pt_page_count and pv_page_count
track the number of active physical-to-virtual list pages and page
table pages, respectively. These will be useful in evaluating
the memory footprint of pmap structures under various workloads,
which will help to guide future changes in this area.
Reviewed by: kib
Differential Revision: https://reviews.freebsd.org/D28923
Other kernel sanitizers (KMSAN, KASAN) require interceptors as well, so
put these in a more generic place as a step towards importing the other
sanitizers.
No functional change intended.
MFC after: 1 week
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D29103
invltlb_invpcid_pti_handler() was requesting delayed TLB invalidation
even for processes that aren't using PTI. With an out-of-tree
change to avoid PTI for non-jailed root processes, this caused an
assertion failure in pmap_activate_sw_pcid_pti() when context-switching
between PTI and non-PTI processes.
Reviewed by: bdrewery kib tychon
Sponsored by: Dell EMC Isilon
Differential Revision: https://reviews.freebsd.org/D29094
Otherwise during attach newbus prints "nexus0", which is not very
useful.
The generic nexus device is already quiet, as is nexus_acpi on arm64.
MFC after: 1 week
Sponsored by: The FreeBSD Foundation
The purpose of these checks is to ensure that the address of the
next-level page table page is valid, since nothing is synchronizing with
a concurrent update of the large map and large map PTPs are freed to the
system. However, if PG_PS is set, there is no next level.
Reported by: rpokala
Reviewed by: kib
Tested by: rpokala
MFC after: 3 days
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D28922
In some cases the DELAY implementation on amd64 can recurse on a spin
mutex in the i8254 early delay code. Detect when this is going to
happen and don't call delay in this case. It is safe to not delay here
with the only issue being KCSAN may not detect data races.
Reviewed by: kib
Tested by: arichardson
Sponsored by: Innovate UK
Differential Revision: https://reviews.freebsd.org/D28895
Add it to the x86 GENERIC and MINIMAL kernels
Sponsored by: Ampere Computing LLC
Submitted by: Klara Inc.
Reviewed by: rpokala
Differential Revision: https://reviews.freebsd.org/D28738