Make it easy to define interceptors for new sanitizer runtimes, rather
than assuming KCSAN. Lay a bit of groundwork for KASAN and KMSAN.
When a sanitizer is compiled in, atomic(9) and bus_space(9) definitions
in atomic_san.h are used by default instead of the inline
implementations in the platform's atomic.h. These definitions are
implemented in the sanitizer runtime, which includes
machine/{atomic,bus}.h with SAN_RUNTIME defined to pull in the actual
implementations.
No functional change intended.
MFC after: 1 month
Sponsored by: The FreeBSD Foundation
Currently, AMD-vi PCI-e passthrough will lead to the following lines in
dmesg:
"kernel: CPU0: local APIC error 0x40
ivhd0: Error: completion failed tail:0x720, head:0x0."
After some tracing, the problem is due to the interaction with
amdvi_alloc_intr_resources() and pci_driver_added(). In ivrs_drv, the
identification of AMD-vi IVHD is done by walking over the ACPI IVRS
table and ivhdX device_ts are added under the acpi bus, while there are
no driver handling the corresponding IOMMU PCI function. In
amdvi_alloc_intr_resources(), the MSI intr are allocated with the ivhdX
device_t instead of the IOMMU PCI function device_t. bus_setup_intr() is
called on ivhdX. the IOMMU pci function device_t is only used for
pci_enable_msi(). Since bus_setup_intr() is not called on IOMMU pci
function, the IOMMU PCI function device_t's dinfo->cfg.msi is never
updated to reflect the supposed msi_data and msi_addr. So the msi_data
and msi_addr stay in the value 0. When pci_driver_added() tried to loop
over the children of a pci bus, and do pci_cfg_restore() on each of
them, msi_addr and msi_data with value 0 will be written to the MSI
capability of the IOMMU pci function, thus explaining the errors in
dmesg.
This change includes an amdiommu driver which currently does attaching,
detaching and providing DEVMETHODs for setting up and tearing down
interrupt. The purpose of the driver is to prevent pci_driver_added()
from calling pci_cfg_restore() on the IOMMU PCI function device_t.
The introduction of the amdiommu driver handles allocation of an IRQ
resource within the IOMMU PCI function, so that the dinfo->cfg.msi is
populated.
This has been tested on EPYC Rome 7282 with Radeon 5700XT GPU.
Sponsored by: The FreeBSD Foundation
Reviewed by: jhb
Approved by: philip (mentor)
MFC after: 2 weeks
Differential Revision: https://reviews.freebsd.org/D28984
e4b8deb22227 removed the last in-tree uses of PCPU_INC(). Its
potential benefit is also practically nonexistent. Non-x86
platforms already implement it as PCPU_ADD(..., 1), and according
to [0] there are no recent x86 processors for which the 'inc'
instruction provides a performance benefit over the equivalent
memory-operand form of the 'add' instruction. The only remaining
benefit of 'inc' is smaller instruction size, which in this case
is inconsequential given the limited number of per-CPU data consumers.
[0]: https://www.agner.org/optimize/instruction_tables.pdf
Reviewed by: kib
Differential Revision: https://reviews.freebsd.org/D29308
This is a prerequisite to using these functions outside of ddb, but also
provides some cleanup and minor refactoring. This code is almost
entirely duplicated between the two implementations, the only
significant difference being the lack of dbreg synchronization on i386.
Cleanups are:
- demote some internal functions to static
- use the constant NDBREGS instead of a '4' literal
- remove K&R definitions
- some added comments
Reviewed by: kib, jhb
Sponsored by: NetApp, Inc.
Sponsored by: Klara, Inc.
MFC after: 1 week
Differential Revision: https://reviews.freebsd.org/D29153
We want to allow the UEFI firmware to enumerate and assign
addresses to PCI devices so we can boot from NVMe[1]. Address
assignment of PCI BARs is properly handled by the PCI emulation
code in general, but a few specific cases need additional support.
fbuf and passthru map additional objects into the guest physical
address space and so need to handle address updates. Here we add a
callback to emulated PCI devices to inform them of a BAR
configuration change. fbuf and passthru then watch for these BAR
changes and relocate the frame buffer memory segment and passthru
device mmio area respectively.
We also add new VM_MUNMAP_MEMSEG and VM_UNMAP_PPTDEV_MMIO ioctls
to vmm(4) to facilitate the unmapping needed for addres updates.
[1]: https://github.com/freebsd/uefi-edk2/pull/9/
Originally by: scottph
MFC After: 1 week
Sponsored by: Intel Corporation
Reviewed by: grehan
Approved by: philip (mentor)
Differential Revision: https://reviews.freebsd.org/D24066
As follow-on work to e4b8deb222278b2a, move page table page
allocation and freeing into their own functions. Use these
functions to provide separate kernel vs. user page table page
accounting, and to wrap common tasks such as management of
zero-filled page state.
Requested by: markj, kib
Reviewed by: kib
Differential Revision: https://reviews.freebsd.org/D29151
POSIX states that new threads created via pthread_create() should
inherit the "floating point environment" from the creating thread.
Discussed with: kib
MFC after: 1 week
Sponsored by: Netflix
Differential Revision: https://reviews.freebsd.org/D29204
- Use update_pcb_bases() when updating FS or GS base addresses to
permit use of FSBASE and GSBASE in Linux processes. This also sets
PCB_FULL_IRET. linux32 was setting PCB_32BIT which should be a
no-op (exec sets it).
- Remove write-only variables to construct unused segment descriptors
for linux32.
Reviewed by: kib
MFC after: 1 week
Sponsored by: Netflix
Differential Revision: https://reviews.freebsd.org/D29026
Before the pcb is copied to the new thread during cpu_fork() and
cpu_copy_thread(), the kernel re-reads the current register values in
case they are stale. This is done by setting PCB_FULL_IRET in
pcb_flags.
This works fine for user threads, but the creation of kernel processes
and kernel threads do not follow the normal synchronization rules for
pcb_flags. Specifically, new kernel processes are always forked from
thread0, not from curthread, so adjusting pcb_flags via a simple
instruction without the LOCK prefix can race with thread0 running on
another CPU. Similarly, kthread_add() clones from the first thread in
the relevant kernel process, not from curthread. In practice, Netflix
encountered a panic where the pcb_flags in the first kthread of the
KTLS process were trashed due to update_pcb_bases() in
cpu_copy_thread() running from thread0 to create one of the other KTLS
threads racing with the first KTLS kthread calling fpu_kern_thread()
on another CPU. In the panicking case, the write to update pcb_flags
in fpu_kern_thread() was lost triggering an "Unregistered use of FPU
in kernel" panic when the first KTLS kthread later tried to use the
FPU.
Reported by: gallatin
Discussed with: kib
MFC after: 1 week
Sponsored by: Netflix
Differential Revision: https://reviews.freebsd.org/D29023
This change converts most of the counters in the amd64 pmap from
global atomics to scalable counter(9) counters. Per discussion
with kib@, it also removes the handrolled per-CPU PCID save count
as it isn't considered generally useful.
The bulk of these counters remain guarded by PV_STATS, as it seems
unlikely that they will be useful outside of very specific debugging
scenarios. However, this change does add two new counters that
are available without PV_STATS. pt_page_count and pv_page_count
track the number of active physical-to-virtual list pages and page
table pages, respectively. These will be useful in evaluating
the memory footprint of pmap structures under various workloads,
which will help to guide future changes in this area.
Reviewed by: kib
Differential Revision: https://reviews.freebsd.org/D28923
Other kernel sanitizers (KMSAN, KASAN) require interceptors as well, so
put these in a more generic place as a step towards importing the other
sanitizers.
No functional change intended.
MFC after: 1 week
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D29103
invltlb_invpcid_pti_handler() was requesting delayed TLB invalidation
even for processes that aren't using PTI. With an out-of-tree
change to avoid PTI for non-jailed root processes, this caused an
assertion failure in pmap_activate_sw_pcid_pti() when context-switching
between PTI and non-PTI processes.
Reviewed by: bdrewery kib tychon
Sponsored by: Dell EMC Isilon
Differential Revision: https://reviews.freebsd.org/D29094
Otherwise during attach newbus prints "nexus0", which is not very
useful.
The generic nexus device is already quiet, as is nexus_acpi on arm64.
MFC after: 1 week
Sponsored by: The FreeBSD Foundation
The purpose of these checks is to ensure that the address of the
next-level page table page is valid, since nothing is synchronizing with
a concurrent update of the large map and large map PTPs are freed to the
system. However, if PG_PS is set, there is no next level.
Reported by: rpokala
Reviewed by: kib
Tested by: rpokala
MFC after: 3 days
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D28922
In some cases the DELAY implementation on amd64 can recurse on a spin
mutex in the i8254 early delay code. Detect when this is going to
happen and don't call delay in this case. It is safe to not delay here
with the only issue being KCSAN may not detect data races.
Reviewed by: kib
Tested by: arichardson
Sponsored by: Innovate UK
Differential Revision: https://reviews.freebsd.org/D28895
Add it to the x86 GENERIC and MINIMAL kernels
Sponsored by: Ampere Computing LLC
Submitted by: Klara Inc.
Reviewed by: rpokala
Differential Revision: https://reviews.freebsd.org/D28738
Tested with glibc test suite.
The C variant in libkern performs excessive branching to find the zero
byte instead of using the bsfq instruction. The same code patched to use
it is still slower than the routine implemented here as the compiler
keeps neglecting to perform certain optimizations (like using leaq).
On top of that the routine can be used as a starting point for copyinstr
which operates on words intead of bytes.
The previous attempt had an instance of swapped operands to andq when
dealing with fully aligned case, which had a side effect of breaking the
code for certain corner cases. Noted by jrtc27.
Sample results:
$(perl -e "print 'A' x 3"):
stock: 211198039
patched:338626619
asm: 465609618
$(perl -e "print 'A' x 100"):
stock: 83151997
patched: 98285919
asm: 120719888
Reviewed by: jhb, kib
Differential Revision: https://reviews.freebsd.org/D28779
This macro returns true if a provided virtual address is contained
in the kernel's clean submap.
In CHERI kernels, the buffer cache and transient I/O map are allocated
as separate regions. Abstracting this check reduces the diff relative
to FreeBSD. It is perhaps slightly more readable as well.
Reviewed by: kib
Obtained from: CheriBSD
Sponsored by: DARPA
Differential Revision: https://reviews.freebsd.org/D28710
linux_shared_page_init() creates an object and grabs and maps a single
page to back the VDSO. When destroying the VDSO object, we failed to
destroy the mapping and free KVA. Fix this.
Reviewed by: kib
MFC after: 1 week
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D28696
Allow setting the bootmethod variable from the Xen PVH entry point, in
order to be able to correctly set the underlying firmware mode when
booted as a dom0.
Move the bootmethod variable to be defined in x86/cpu_machdep.c
instead so it can be shared by both i386 and amd64.
Sponsored by: Citrix Systems R&D
Reviewed by: kib
Differential revision: https://reviews.freebsd.org/D28619
This reverts commit af366d353b84bdc4e730f0fc563853abc338271c.
Trips over '\xa4' byte and terminates early, as found in
lib/libc/gen/setdomainname_test:setdomainname_basic testcase
However, keep moving libkern/strlen.c out of conf/files.
Reported by: lwhsu
The C variant in libkern performs excessive branching to find the
non-zero byte instead of using the bsfq instruction. The same code
patched to use it is still slower than the routine implemented here
as the compiler keeps neglecting to perform certain optimizations
(like using leaq).
On top of that the routine can is a starting point for copyinstr
which operates on words instead of bytes.
Tested with glibc test suite.
Sample results (calls/s):
Haswell:
$(perl -e "print 'A' x 3"):
stock: 211198039
patched:338626619
asm: 465609618
$(perl -e "print 'A' x 100"):
stock: 83151997
patched: 98285919
asm: 120719888
AMD EPYC 7R32:
$(perl -e "print 'A' x 3"):
stock: 282523617
asm: 491498172
$(perl -e "print 'A' x 100"):
stock: 114857172
asm: 112082057
One common method of EOI'ing an interrupt at the IO-APIC level is to
switch the pin to edge triggering mode and then back into level mode.
That would cause the IRR bit to be cleared and thus further interrupts
to be injected. FreeBSD does indeed use that method if the IO-APIC EOI
register is not supported.
The bhyve IO-APIC emulation code didn't clear the IRR bit when doing
that switch, and was also missing acknowledging the IRR state when
trying to inject an interrupt in vioapic_send_intr.
Reviewed by: grehan
Differential revision: https://reviews.freebsd.org/D28238
After modifying a redirection entry only try to inject an interrupt if
the pin is in level mode, pins in edge mode shouldn't take into
account the line assert status as they are triggered by edge changes,
not the line status itself.
Reviewed by: grehan
Differential revision: https://reviews.freebsd.org/D28237
vioapic_send_intr does already check whether the pin is masked before
injecting the interrupt, there's no need to do it in vioapic_write
also.
No functional change intended.
Reviewed by: grehan
Differential revision: https://reviews.freebsd.org/D28236
This is a tradeoff which saves jumps for smaller sizes while making
the 8-16 range slower (roughly in line with the other cases).
Tested with glibc test suite.
For example size 3 (most common with vfs namecache) (ops/s):
before: 407086026
after: 461391995
The regressed range of 8-16 (with 8 as example):
before: 540850489
after: 461671032
All page zeroing is using temporal stores with rep movs*, the routine is
unused for several years.
Should a need arise for zeroing using non-temporal stores, a more
optimized variant can be implemented with a more descriptive name.
The ptrace(2) Linux man page claims the syscall returns ESRCH,
if the tracee is not stopped; the native ptrace(2) returns EBUSY.
Sponsored by: The FreeBSD Foundation
Based on discussions on freebsd-arch@, enable KERN_TLS in
GENERIC on amd64, but leave it disabled via the
sysctl kern.ipc.tls.enable. Users wishing to enable
ktls must set kern.ipc.tls.enable=1
While here, fix wording in NOTES to mention that KERN_TLS
also does receive now.
Sponsored by: Netflix
Reviewed by: allanjude
Differential Revision: https://reviews.freebsd.org/D28163
usbhid(4) is disabled by default to avoid conflicts with existing USB HID
drivers. To enable it place following lines to /boot/loader.conf:
hw.usb.usbhid.enable=1
usbhid_load="YES"
Suggested by: jhb
Reviewed by: hselasky
Differential revision: https://reviews.freebsd.org/D28124
This is the superset of the nooptions found in the -DEBUG kernels.
Reviewed by: emaste, manu
Sponsored by: Innovate UK
Differential Revision: https://reviews.freebsd.org/D28152
In particular, using GELI on a root filesystem will only use
accelerated software crypto drivers if they are available before the
root filesystem is mounted. While these modules can be loaded from
the loader, including them in GENERIC provides a better out-of-the-box
experience for users.
Both aesni(4) and armv8crypto(4) provide accelerated implementations
of the default cipher used by GELI (AES-XTS) in addition to other
ciphers.
Reviewed by: mhorne, allanjude, markj
Differential Revision: https://reviews.freebsd.org/D28100