controlled by the TCP_BLACKBOX option.
Enable this as part of amd64 GENERIC. For now, leave it disabled on
other platforms.
Sponsored by: Netflix, Inc.
Bring #includes closer to style(9) and reduce differences between the
(three) MD versions of linux_machdep.c and linux_sysvec.c.
Sponsored by: Turing Robotic Industries Inc.
ptrace_xstate_info).
struct ptrace_xstate_info has 64bit member but ends up with 32bit
one. As result, on amd64 there is a 32bit padding at the end, but not
on i386.
We must clear the padding before doing the copyout. For compat32 case,
we must copyout the structure which does not have the padding at the
end. The later fixes 32bit gdb display of the YMM registers when
running on amd64 kernel.
Reported by: Vlad Tsyrklevich
Reviewed by: brooks (previous version)
Sponsored by: The FreeBSD Foundation
admbugs: 765
MFC after: 1 week
Differential revision: https://reviews.freebsd.org/D14794
On amd64, efi_enter calls fpu_kern_enter(). This may not be called until
fpuinitstate has been invoked, resulting in a kernel panic with
efirt_load="YES" in loader.conf(5).
Move fpuinitstate a little earlier in SI_SUB_DRIVERS so that we can squeeze
efirt between it and efirtc at SI_SUB_DRIVERS, SI_ORDER_ANY. efidev must be
after efirt and doesn't really need to be at SI_SUB_DEVFS, so drop it at
SI_SUB_DRIVER, SI_ORDER_ANY.
The not immediately obvious dependency of fpuinitstate by efirt has been
noted in both places.
Discussed with: kib, andrew
Reported by: Jakob Alvermark <jakob@alvermark.net>
X-MFC-With: r330868
I accidentally swapped 'linux_fixup_elf' to 'linux_elf_fixup' in amd64's
declaration (only), while bringing this change over from git and
encountering a conflict.
assym is only to be included by other .s files, and should never
actually be assembled by itself.
Reviewed by: imp, bdrewery (earlier)
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D14180
context switch code.
Some BIOSes give control to the OS with CR0.WP already set, making the
kernel text read-only before cpu_startup().
Reported by: Peter Lei <peter.lei@ieee.org>
Reviewed by: jtl
Sponsored by: The FreeBSD Foundation
MFC after: 1 week
Differential revision: https://reviews.freebsd.org/D14768
This is a pure syntax patch to create an interface to enable and later
restore write access to the kernel text and other read-only mapped
regions. It is in line with e.g. vm_fault_disable_pagefaults() by
allowing the nesting.
Discussed with: Peter Lei <peter.lei@ieee.org>
Reviewed by: jtl
Sponsored by: The FreeBSD Foundation
MFC after: 1 week
Differential revision: https://reviews.freebsd.org/D14768
It's preferable to have a consistent prefix. This also reduces
differences between the three linux*_sysvec.c files.
Sponsored by: Turing Robotic Industries Inc.
There's a fair amount of duplication between MD linuxulator files.
Make indentation and comments consistent between the three versions of
linux_sysvec.c to reduce diffs when comparing them.
Sponsored by: Turing Robotic Industries Inc.
Three copies of the linuxulator linux_sysvec.c contained identical
BSD to Linux errno translation tables, and future work to support other
architectures will also use the same table. Move the table to a common
file to be used by all. Make it 'const int' to place it in .rodata.
(Some existing Linux architectures use MD errno values, but x86 and Arm
share the generic set.)
This change should introduce no functional change; a followup will add
missing errno values.
MFC after: 3 weeks
Sponsored by: Turing Robotic Industries Inc.
Differential Revision: https://reviews.freebsd.org/D14665
This fixes a problem encountered on the Lenovo Thinkpad X220/Yoga 11e where
runtime services would try to inexplicably jump to other parts of memory
where it shouldn't be when attempting to enumerate EFI vars, causing a
panic.
The virtual mapping is enabled by default and can be disabled by setting
efi_disable_vmap in loader.conf(5).
Reviewed by: kib (earlier version)
MFC after: 3 weeks
Differential Revision: https://reviews.freebsd.org/D14677
Migrate to modern types before creating MD Linuxolator bits for new
architectures.
Reviewed by: cem
Sponsored by: Turing Robotic Industries Inc.
Differential Revision: https://reviews.freebsd.org/D14676
correctly for the data contained on each memory page.
There are several components to this change:
* Add a variable to indicate the start of the R/W portion of the
initial memory.
* Stop detecting NX bit support for each AP. Instead, use the value
from the BSP and, if supported, activate the feature on the other
APs just before loading the correct page table. (Functionally, we
already assume that the BSP and all APs had the same support or
lack of support for the NX bit.)
* Set the RW and NX bits correctly for the kernel text, data, and
BSS (subject to some caveats below).
* Ensure DDB can write to memory when necessary (such as to set a
breakpoint).
* Ensure GDB can write to memory when necessary (such as to set a
breakpoint). For this purpose, add new MD functions gdb_begin_write()
and gdb_end_write() which the GDB support code can call before and
after writing to memory.
This change is not comprehensive:
* It doesn't do anything to protect modules.
* It doesn't do anything for kernel memory allocated after the kernel
starts running.
* In order to avoid excessive memory inefficiency, it may let multiple
types of data share a 2M page, and assigns the most permissions
needed for data on that page.
Reviewed by: jhb, kib
Discussed with: emaste
MFC after: 2 weeks
Sponsored by: Netflix
Differential Revision: https://reviews.freebsd.org/D14282
therefore, it should be safe to set the NX bit on the PML4E for the
recursive page table mappings. According to the Intel docs, the effect
of the NX bit should propogate to any page reached through a PML4E which
has the NX bit set.
Reviewed by: kib, markj
MFC after: 2 weeks
Sponsored by: Netflix
Differential Revision: https://reviews.freebsd.org/D14333
first available virtual address to a 2MB boundary. After r329071,
create_pagetables() rounds firstaddr up to a 2MB boundary. This ensures
the kernel is mapped in super-pages, which is the point of the logic
in pmap_kmem_choose(). Therefore, it is no longer necessary for
pmap_bootstrap() to round up to the 2MB boundary again.
As pmap_bootstrap() was the only user of pmap_kmem_choose(), we can
delete pmap_kmem_choose().
Reviewed by: kib
MFC after: 2 weeks
X-MFC-with: r329071
Sponsored by: Netflix
Differential Revision: https://reviews.freebsd.org/D14355
Rename ACPI_IVRS_HARDWARE_NEW to ACPI_IVRS_HARDWARE_EFRSUP, since new definitions add Extended Feature Register support. Use IvrsType to distinguish three types of IVHD - 0x10(legacy), 0x11 and 0x40(with EFR). IVHD 0x40 is also called mixed type since it supports HID device entries.
Fix 2 coverity bugs reported by cem.
Reported by:jkim, cem
Approved by:grehan
Differential Revision://reviews.freebsd.org/D14501
Since that change the system call stack traces look like this:
...
sys___sysctl() at sys___sysctl+0x5f/frame 0xfffffe0028e13ac0
amd64_syscall() at amd64_syscall+0x79b/frame 0xfffffe0028e13bf0
fast_syscall_common() at fast_syscall_common+0x101/frame 0xfffffe0028e13bf0
So, db_nextframe() stopped recognizing the system call frame.
This commit should fix that.
Reviewed by: kib
MFC after: 4 days
imcsmb(4) provides smbus(4) support for the SMBus controller functionality
in the integrated Memory Controllers (iMCs) embedded in Intel Sandybridge-
Xeon, Ivybridge-Xeon, Haswell-Xeon, and Broadwell-Xeon CPUs. Each CPU
implements one or more iMCs, depending on the number of cores; each iMC
implements two SMBus controllers (iMC-SMBs).
*** IMPORTANT NOTE ***
Because motherboard firmware or the BMC might try to use the iMC-SMBs for
monitoring DIMM temperatures and/or managing an NVDIMM, the driver might
need to temporarily disable those functions, or take a hardware interlock,
before using the iMC-SMBs. Details on how to do this may vary from board to
board, and the procedure may be proprietary. It is strongly suggested that
anyone wishing to use this driver contact their motherboard vendor, and
modify the driver as described in the manual page and in the driver itself.
(For what it's worth, the driver as-is has been tested on various SuperMicro
motherboards.)
Reviewed by: avg, jhb
MFC after: 1 week
Relnotes: yes
Sponsored by: Panasas
Differential Revision: https://reviews.freebsd.org/D14447
Discussed with: avg, ian, jhb
Tested by: allanjude (previous version), Panasas
Many licenses on Linuxolator files contained small variations from the
standard FreeBSD license text. To avoid license proliferation switch to
the standard 2-clause FreeBSD license for those files where I have
permission from each of the listed copyright holders. Additional files
still waiting on permission from others are listed in review D14210.
Approved by: dchagin, rdivacky, sos
MFC after: 1 week
MFC with: r329370
Sponsored by: The FreeBSD Foundation
Unlike the existing GLA2GPA ioctl, GLA2GPA_NOFAULT does not modify
the guest. In particular, it does not inject any faults or modify
PTEs in the guest when performing an address space translation.
This is used by bhyve's debug server to read and write memory for
the remote debugger.
Reviewed by: grehan
MFC after: 1 month
Differential Revision: https://reviews.freebsd.org/D14075
The conditional compilation support is now centralized in
tcp_fastopen.h and tcp_var.h. This doesn't provide the minimum
theoretical code/data footprint when TCP_RFC7413 is disabled, but
nearly all the TFO code should wind up being removed by the optimizer,
the additional footprint in the syncache entries is a single pointer,
and the additional overhead in the tcpcb is at the end of the
structure.
This enables the TCP_RFC7413 kernel option by default in amd64 and
arm64 GENERIC.
Reviewed by: hiren
MFC after: 1 month
Sponsored by: Limelight Networks
Differential Revision: https://reviews.freebsd.org/D14048
When I wrote the patch, I wanted to remove SYSINIT() usage from amd64 code.
There is no reason to keep the divergence any more because iwasaki merged
most amd64 suspend/resume code to i386 with r235622. Note this also fixed
an enge case reported by royger. [1]
Suggested by: jhb
Reviewed by: royger
Tested by: royger [1]
MFC after: 1 week
Differential Revision: https://reviews.freebsd.org/D14400 [1]
No implementation of fpu_kern_enter() can fail, and it was causing needless
error checking boilerplate and confusion. Change the return code to void to
match reality.
(This trivial change took nine days to land because of the commit hook on
sys/dev/random. Please consider removing the hook or otherwise lowering the
bar -- secteam never seems to have free time to review patches.)
Reported by: Lachlan McIlroy <Lachlan.McIlroy AT isilon.com>
Reviewed by: delphij
Approved by: secteam (delphij)
Sponsored by: Dell EMC Isilon
Differential Revision: https://reviews.freebsd.org/D14380
These are a convenience for bhyve's debug server to use a single
ioctl for 'g' and 'G' rather than a loop of individual get/set
ioctl requests.
Reviewed by: grehan
MFC after: 2 months
Differential Revision: https://reviews.freebsd.org/D14074
Make vm_wait() take the vm_object argument which specifies the domain
set to wait for the min condition pass. If there is no object
associated with the wait, use curthread' policy domainset. The
mechanics of the wait in vm_wait() and vm_wait_domain() is supplied by
the new helper vm_wait_doms(), which directly takes the bitmask of the
domains to wait for passing min condition.
Eliminate pagedaemon_wait(). vm_domain_clear() handles the same
operations.
Eliminate VM_WAIT and VM_WAITPFAULT macros, the direct functions calls
are enough.
Eliminate several control state variables from vm_domain, unneeded
after the vm_wait() conversion.
Scetched and reviewed by: jeff
Tested by: pho
Sponsored by: The FreeBSD Foundation, Mellanox Technologies
Differential revision: https://reviews.freebsd.org/D14384
Many licenses on Linuxolator files contained small variations from the
standard FreeBSD license text. To avoid license proliferation switch to
the standard 2-clause FreeBSD license for those files where I have
permission from each of the listed copyright holders. Additional files
waiting on permission from others are listed in review D14210.
Approved by: kan, marcel, sos, rdivacky
MFC after: 1 week
Sponsored by: The FreeBSD Foundation
Small global symbols confuse ddb which matches them against small
unrelated displacements and makes the disassembly ugly.
Reported by: bde
Sponsored by: The FreeBSD Foundation
MFC after: 1 week
IVRS can have entry of type legacy and non-legacy present at same time for same AMD-Vi device. ivhd driver will ignore legacy if new IVHD type is present as specified in AMD-Vi specification. Earlier both of IVHD entries used and two ivhd devices were created.
Add support for new IVHD type 0x11 and 0x40 in ACPI. Create new struct of type acpi_ivrs_hardware_new for these new type of IVHDs. Legacy type 0x10 will continue to use acpi_ivrs_hardware.
Reviewed by: avg
Approved by: grehan
Differential Revision:https://reviews.freebsd.org/D13160
The idea is, the pmap_qenter() API is now defined to not produce executable
mappings. If you need executable mappings, use another API.
Add pg_nx flag in pmap_qenter on x86 to make kernel pages non-executable.
Other architectures that support execute-specific permissons on page table
entries should subsequently be updated to match.
Submitted by: Darrick Lew <darrick.freebsd AT gmail.com>
Reviewed by: markj
Discussed with: alc, jhb, kib
Sponsored by: Dell EMC Isilon
Differential Revision: https://reviews.freebsd.org/D14062
compilation under FreeBSD. The mthca driver was temporarily removed as
part of the Linux 4.9 RoCE/infinband upgrade.
Top commit in Linux source tree:
69973b830859bc6529a7a0468ba0d80ee5117826
Sponsored by: Mellanox Technologies
significant source of cache line contention from vm_page_alloc(). Use
accessors and vm_page_unwire_noq() so that the mechanism can be easily
changed in the future.
Reviewed by: markj
Discussed with: kib, glebius
Tested by: pho (earlier version)
Sponsored by: Netflix, Dell/EMC Isilon
Differential Revision: https://reviews.freebsd.org/D14273
makes them consistent with the way other page-table pages are allocated.
It also provides the rest of the VM system a good clue that these pages
are used.
Reviewed by: alc, kib, markj
Sponsored by: Netflix
Differential Revision: https://reviews.freebsd.org/D14269
mappings for the pages used for the kernel and some initial allocations
used for the page table. It maps the kernel and the blocks used for
these initial allocations using 2MB pages.
However, if the kernel does not end on a 2MB boundary, it still maps the
last portion using a 2MB page, but reports that the unused 4K blocks
within this 2MB allocation are free physical blocks. This means that
these same physical blocks could also be mapped elsewhere - for example,
into a user process. Given the proximity to the kernel text and data
area, it seems wise to avoid allowing someone to write data to physical
blocks also mapped into these virtual addresses.
(Note that this isn't a security vulnerability: the direct map makes
most/all memory on the system mapped into kernel space. And, nothing
in the kernel should be trying to access these pages, as the virtual
addresses are unused. It simply seems wise to avoid reusing these
physical blocks while they are mapped to virtual addresses so close
to the kernel text and data area.)
Consequently, let's reserve the physical blocks covered by the
page-table mappings for these initial allocations.
Reviewed by: kib, markj
MFC after: 2 weeks
Sponsored by: Netflix
Differential Revision: https://reviews.freebsd.org/D14268
global to per-domain state. Protect reservations with the free lock
from the domain that they belong to. Refactor to make vm domains more
of a first class object.
Reviewed by: markj, kib, gallatin
Tested by: pho
Sponsored by: Netflix, Dell/EMC Isilon
Differential Revision: https://reviews.freebsd.org/D14000
A version of each of the MD files by necessity exists for each CPU
architecture supported by the Linuxolator. Clean these up so that new
architectures do not inherit whitespace issues.
Clean up shared Linuxolator files while here.
Sponsored by: Turing Robotic Industries Inc.
Branch Predictors) mitigation.
DOcument 336996-001 promises that CPUs which implement IBRS but not
STIBP silently ignore setting of the bit instead of trapping.
Sponsored by: The FreeBSD Foundation
MFC after: 1 week
It is coded according to the Intel document 336996-001, reading of the
patches posted on lkml, and some additional consultations with Intel.
For existing processors, you need a microcode update which adds IBRS
CPU features, and to manually enable it by setting the tunable/sysctl
hw.ibrs_disable to 0. Current status can be checked in sysctl
hw.ibrs_active. The mitigation might be inactive if the CPU feature
is not patched in, or if CPU reports that IBRS use is not required, by
IA32_ARCH_CAP_IBRS_ALL bit.
Sponsored by: The FreeBSD Foundation
MFC after: 1 week
Differential revision: https://reviews.freebsd.org/D14029
The virtual interrupt method uses V_IRQ, V_INTR_PRIO, and V_INTR_VECTOR
fields of VMCB to inject a virtual interrupt into a guest VM. This
method has many advantages over the direct event injection as it
offloads all decisions of whether and when the interrupt can be
delivered to the guest. But with a purely software emulated vAPIC the
advantage is also a problem. The problem is that the hypervisor does
not have any precise control over when the interrupt is actually
delivered to the guest (or a notification about that). Because of that
the hypervisor cannot update the interrupt vector in IRR and ISR in the
same way as real hardware would. The hypervisor becomes aware that the
interrupt is being serviced only upon the first VMEXIT after the
interrupt is delivered. This creates a window between the actual
interrupt delivery and the update of IRR and ISR. That means that IRR
and ISR might not be correctly set up to the point of the
end-of-interrupt signal.
The described deviation has been observed to cause an interrupt loss in
the following scenario. vCPU0 posts an inter-processor interrupt to
vCPU1. The interrupt is injected as a virtual interrupt by the
hypervisor. The interrupt is delivered to a guest and an interrupt
handler is invoked. The handler performs a requested action and
acknowledges the request by modifying a global variable. So far, there
is no VMEXIT and the hypervisor is unaware of the events. Then, vCPU0
notices the acknowledgment and sends another IPI with the same vector.
The IPI gets collapsed into the previous IPI in the IRR of vCPU1. Only
after that a VMEXIT of vCPU1 occurs. At that time the vector is cleared
in the IRR and is set in the ISR. vCPU1 has vAPIC state as if the
second IPI has never been sent.
The scenario is impossible on the real hardware because IRR and ISR are
updated just before the interrupt handler gets started.
I saw several possibilities of fixing the problem. One is to intercept
the virtual interrupt delivery to update IRR and ISR at the right
moment. The other is to deliver the LAPIC interrupts using the event
injection, same as legacy interrupts. I opted to use the latter
approach for several reasons. It's equivalent to what VMM/Intel does
(in !VMX case). It appears to be what VirtualBox and KVM do. The code
is already there (to support legacy interrupts).
Another possibility was to use a special intermediate state for a vector
after it is injected using a virtual interrupt and before it is known
whether it was accepted or is still pending.
That approach was implemented in https://reviews.freebsd.org/D13828
That method is more complex and does not have any clear advantage.
Please see sections 15.20 and 15.21.4 of "AMD64 Architecture
Programmer's Manual Volume 2: System Programming" (publication 24593,
revision 3.29) for comparison between event injection and virtual
interrupt injection.
PR: 215972
Reported by: ajschot@hotmail.com, grehan
Tested by: anish, grehan, Nils Beyer <nbe@renzel.net>
Reviewed by: anish, grehan
MFC after: 2 weeks
Differential Revision: https://reviews.freebsd.org/D13780
This avoids a nested page fault when obtaining a stack trace in DDB if
the address from the first frame does not resolve to a known symbol.
MFC after: 1 week
Sponsored by: Chelsio Communications
Use PCID to avoid complete TLB shootdown when switching between user
and kernel mode with PTI enabled.
I use the model close to what I read about KAISER, user-mode PCID has
1:1 correspondence to the kernel-mode PCID, by setting bit 11 in PCID.
Full kernel-mode TLB shootdown is performed on context switches, since
KVA TLB invalidation only works in the current pmap. User-mode part of
TLB is flushed on the pmap activations as well.
Similarly, IPI TLB shootdowns must handle both kernel and user address
spaces for each address. Note that machines which implement PCID but
do not have INVPCID instructions, cause the usual complications in the
IPI handlers, due to the need to switch to the target PCID temporary.
This is racy, but because for PCID/no-INVPCID we disable the
interrupts in pmap_activate_sw(), IPI handler cannot see inconsistent
state of CPU PCID vs PCPU pmap/kcr3/ucr3 pointers.
On the other hand, on kernel/user switches, CR3_PCID_SAVE bit is set
and we do not clear TLB.
I can imagine alternative use of PCID, where there is only one PCID
allocated for the kernel pmap. Then, there is no need to shootdown
kernel TLB entries on context switch. But copyout(3) would need to
either use method similar to proc_rwmem() to access the userspace
data, or (in reverse) provide a temporal mapping for the kernel buffer
into user mode PCID and use trampoline for copy.
Reviewed by: markj (previous version)
Tested by: pho
Discussed with: alc (some aspects)
Sponsored by: The FreeBSD Foundation
MFC after: 3 weeks
Differential revision: https://reviews.freebsd.org/D13985
These files previously had a 3-clause license and 'THE REGENTS' text.
Switch to standard 2-clause text with kib's approval, and add the SPDX
tag.
Approved by: kib
Uses of mallocarray(9).
The use of mallocarray(9) has rocketed the required swap to build FreeBSD.
This is likely caused by the allocation size attributes which put extra pressure
on the compiler.
Given that most of these checks are superfluous we have to choose better
where to use mallocarray(9). We still have more uses of mallocarray(9) but
hopefully this is enough to bring swap usage to a reasonable level.
Reported by: wosch
PR: 225197
still active.
Map userspace portion of VA in the PTI kernel-mode page table as
non-executable. This way, if we ever miss reloading ucr3 into %cr3 on
the return to usermode, the process traps instead of executing in
potentially vulnerable setup. Catch the condition of such trap and
verify user-mode %cr3, which is saved by page fault handler.
I peek this trick in some article about Linux implementation.
Reviewed by: alc, markj (previous version)
Sponsored by: The FreeBSD Foundation
MFC after: 12 days
DIfferential revision: https://reviews.freebsd.org/D13956
kernel by PHYS_TO_DMAP() as previously present on amd64, arm64, riscv, and
powerpc64. This introduces a new MI macro (PMAP_HAS_DMAP) that can be
evaluated at runtime to determine if the architecture has a direct map;
if it does not (or does) unconditionally and PMAP_HAS_DMAP is either 0 or
1, the compiler can remove the conditional logic.
As part of this, implement PHYS_TO_DMAP() on sparc64 and mips64, which had
similar things but spelled differently. 32-bit MIPS has a partial direct-map
that maps poorly to this concept and is unchanged.
Reviewed by: kib
Suggestions from: marius, alc, kib
Runtime tested on: amd64, powerpc64, powerpc, mips64
Kernel Page Table Isolation (KPTI) was introduced in r328083 as a
mitigation for the 'Meltdown' vulnerability. AMD CPUs are not affected,
per https://www.amd.com/en/corporate/speculative-execution:
We believe AMD processors are not susceptible due to our use of
privilege level protections within paging architecture and no
mitigation is required.
Thus default KPTI to off for AMD CPUs, and to on for others. This may
be refined later as we obtain more specific information on the sets of
CPUs that are and are not affected.
Submitted by: Mitchell Horne
Reviewed by: cem
Relnotes: Yes
Security: CVE-2017-5754
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D13971
Similar to NMIs, machine check exceptions can fire at any time and are
not masked by IF. This means that machine checks can fire when the
kstack is too deep to hold a trap frame, or at critical sections in
trap handlers when a user %gs is used with a kernel %cs. Use the same
strategy used for NMIs of using a dedicated per-CPU stack configured
in IST 3. Store the CPU's pcpu pointer at the stop of the stack so
that the machine check handler can reliably find the proper value for
%gs (also borrowed from NMIs).
This should also fix a similar issue with PTI with a MC# occurring
while the CPU is executing on the trampoline stack.
While here, bypass trap() entirely and just call mca_intr(). This
avoids a bogus call to kdb_reenter() (there's no reason to try to
reenter kdb if a MC# is raised).
Reviewed by: kib
Tested by: avg (on AMD without PTI)
Differential Revision: https://reviews.freebsd.org/D13962
In the !PTI case the NMI handler jumped past the instructions that set
%rdi to point to the current PCB, but the target instructions assumed %rdi
were set.
Reviewed by: kib
Tested by: pho
Apparently machinde/cpu.h is supposed to contain MD implementations of
MI interfaces. Also, remove kernphys declaration from machdep.c,
since it is already provided by md_var.h.
Requested and reviewed by: bde
MFC after: 13 days
Currently most of the debug registers are not saved and restored
during VM transitions allowing guest and host debug register values to
leak into the opposite context. One result is that hardware
watchpoints do not work reliably within a guest under VT-x.
Due to differences in SVM and VT-x, slightly different approaches are
used.
For VT-x:
- Enable debug register save/restore for VM entry/exit in the VMCS for
DR7 and MSR_DEBUGCTL.
- Explicitly save DR0-3,6 of the guest.
- Explicitly save DR0-3,6-7, MSR_DEBUGCTL, and the trap flag from
%rflags for the host. Note that because DR6 is "software" managed
and not stored in the VMCS a kernel debugger which single steps
through VM entry could corrupt the guest DR6 (since a single step
trap taken after loading the guest DR6 could alter the DR6
register). To avoid this, explicitly disable single-stepping via
the trace flag before loading the guest DR6. A determined debugger
could still defeat this by setting a breakpoint after the guest DR6
was loaded and then single-stepping.
For SVM:
- Enable debug register caching in the VMCB for DR6/DR7.
- Explicitly save DR0-3 of the guest.
- Explicitly save DR0-3,6-7, and MSR_DEBUGCTL for the host. Since SVM
saves the guest DR6 in the VMCB, the race with single-stepping
described for VT-x does not exist.
For both platforms, expose all of the guest DRx values via --get-drX
and --set-drX flags to bhyvectl.
Discussed with: avg, grehan
Tested by: avg (SVM), myself (VT-x)
MFC after: 1 month
Differential Revision: https://reviews.freebsd.org/D13229
The implementation of the Kernel Page Table Isolation (KPTI) for
amd64, first version. It provides a workaround for the 'meltdown'
vulnerability. PTI is turned off by default for now, enable with the
loader tunable vm.pmap.pti=1.
The pmap page table is split into kernel-mode table and user-mode
table. Kernel-mode table is identical to the non-PTI table, while
usermode table is obtained from kernel table by leaving userspace
mappings intact, but only leaving the following parts of the kernel
mapped:
kernel text (but not modules text)
PCPU
GDT/IDT/user LDT/task structures
IST stacks for NMI and doublefault handlers.
Kernel switches to user page table before returning to usermode, and
restores full kernel page table on the entry. Initial kernel-mode
stack for PTI trampoline is allocated in PCPU, it is only 16
qwords. Kernel entry trampoline switches page tables. then the
hardware trap frame is copied to the normal kstack, and execution
continues.
IST stacks are kept mapped and no trampoline is needed for
NMI/doublefault, but of course page table switch is performed.
On return to usermode, the trampoline is used again, iret frame is
copied to the trampoline stack, page tables are switched and iretq is
executed. The case of iretq faulting due to the invalid usermode
context is tricky, since the frame for fault is appended to the
trampoline frame. Besides copying the fault frame and original
(corrupted) frame to kstack, the fault frame must be patched to make
it look as if the fault occured on the kstack, see the comment in
doret_iret detection code in trap().
Currently kernel pages which are mapped during trampoline operation
are identical for all pmaps. They are registered using
pmap_pti_add_kva(). Besides initial registrations done during boot,
LDT and non-common TSS segments are registered if user requested their
use. In principle, they can be installed into kernel page table per
pmap with some work. Similarly, PCPU can be hidden from userspace
mapping using trampoline PCPU page, but again I do not see much
benefits besides complexity.
PDPE pages for the kernel half of the user page tables are
pre-allocated during boot because we need to know pml4 entries which
are copied to the top-level paging structure page, in advance on a new
pmap creation. I enforce this to avoid iterating over the all
existing pmaps if a new PDPE page is needed for PTI kernel mappings.
The iteration is a known problematic operation on i386.
The need to flush hidden kernel translations on the switch to user
mode make global tables (PG_G) meaningless and even harming, so PG_G
use is disabled for PTI case. Our existing use of PCID is
incompatible with PTI and is automatically disabled if PTI is
enabled. PCID can be forced on only for developer's benefit.
MCE is known to be broken, it requires IST stack to operate completely
correctly even for non-PTI case, and absolutely needs dedicated IST
stack because MCE delivery while trampoline did not switched from PTI
stack is fatal. The fix is pending.
Reviewed by: markj (partially)
Tested by: pho (previous version)
Discussed with: jeff, jhb
Sponsored by: The FreeBSD Foundation
MFC after: 2 weeks
Focus on code where we are doing multiplications within malloc(9). None of
these ire likely to overflow, however the change is still useful as some
static checkers can benefit from the allocation attributes we use for
mallocarray.
This initial sweep only covers malloc(9) calls with M_NOWAIT. No good
reason but I started doing the changes before r327796 and at that time it
was convenient to make sure the sorrounding code could handle NULL values.
X-Differential revision: https://reviews.freebsd.org/D13837
domains can be done by the _domain() API variants. UMA also supports a
first-touch policy via the NUMA zone flag.
The slab layer is now segregated by VM domains and is precise. It handles
iteration for round-robin directly. The per-cpu cache layer remains
a mix of domains according to where memory is allocated and freed. Well
behaved clients can achieve perfect locality with no performance penalty.
The direct domain allocation functions have to visit the slab layer and
so require per-zone locks which come at some expense.
Reviewed by: Attilio (a slightly older version)
Tested by: pho
Sponsored by: Netflix, Dell/EMC Isilon
We already clear %RFLAGS.DF on the kernel entry due to the compiler's
ABI requirements.
Suggested by: jhb
Sponsored by: The FreeBSD Foundation
MFC after: 2 weeks
Hardware already did it for us due to the mask loaded into the
MSR_SF_MASK msr register.
Reviewed by: jhb
Sponsored by: The FreeBSD Foundation
MFC after: 1 week
Differential revision: https://reviews.freebsd.org/D13838
The symbol is just an offset in the hardware TSS structure, it is not
limited to the common_tss instance.
Sponsored by: The FreeBSD Foundation
MFC after: 3 days
This is a followup to r307903.
struct svm_softc takes more than 200 kilobytes while what we really need
is 3 contiguous pages for I/O permission map and 2 contiguous pages for
MSR permission map. Other physically mapped structures have a size of
a single page, so a proper alignment is sufficient for their correct
mapping.
Thus, only the permission maps are allocated with contigmalloc now,
the softc is allocated with a regular malloc.
Additionally, this commit adds a check that malloc returns memory with the
expected page alignment and that contigmalloc does not fail.
Unfortunately, at present svm_vminit() is expected to always succeed and
there is no way to report an error.
So, a contigmalloc failure leads to a panic.
We should probably fix this.
MFC after: 2 weeks
Add cpuctl(4) ioctl CPUCTL_EVAL_CPU_FEATURES which forces re-read of
cpu_features, cpu_features2, cpu_stdext_features, and
std_stdext_features2.
The intent is to allow the kernel to see the changes in the CPU
features after micocode update. Of course, the update is not atomic
across variables and not synchronized with readers. See the man page
warning as well.
Reviewed by: imp (previous version), jilles
Sponsored by: The FreeBSD Foundation
MFC after: 1 week
Differential revision: https://reviews.freebsd.org/D13770
It does not change anything in the behavior of trap_pfault(), while
eliminating obfuscation of jumping to the code which checks for the
condition reversed of the goto cause. Also avoid force initialize the
rv variable, since it is now only accessed after storing vm_fault()
return value.
Reviewed by: alc
Sponsored by: The FreeBSD Foundation
MFC after: 1 week
Differential revision: https://reviews.freebsd.org/D13725
The entry must be logged "manually" using TSRAW rather than TSENTER
since PCPU data structures have not yet been initialized and thus
curthread cannot be accessed; &thread0 is what will become curthread
later in hammer_time.
Other MD initialization code should be similarly instrumented in order
to gain visibility into the time spent before entering mi_startup; this
will require some care and testing from people with access to such
hardware.
They provide relaxed-ordered atomic access semantic. Due to the
FreeBSD memory model, the operations are syntaxical wrappers around
the volatile accesses. The volatile qualifier is used to ensure that
the access not optimized out and in turn depends on the volatile
semantic as implemented by supported compilers.
The motivation for adding the operation is to help people coming from
other systems or knowing the C11/C++ standards where atomics have
special type and require use of the special access operations. It is
still the case that FreeBSD requires plain load and stores of aligned
integer types to be atomic.
Suggested by: jhb
Reviewed by: alc, jhb
Sponsored by: The FreeBSD Foundation
MFC after: 1 week
Differential revision: https://reviews.freebsd.org/D13534
The DTrace fasttrap entry points expect a struct reg containing the
register values of the calling thread. Perform the conversion in
fasttrap rather than in the trap handler: this reduces the number of
ifdefs and avoids wasting stack space for traps that don't involve
DTrace.
MFC after: 2 weeks
This variable should be pure MI except possibly for reading it in MD
dump routines. Its initialization was pure MD in 4.4BSD, but FreeBSD
changed this in r36441 in 1998. There were many imperfections in
r36441. This commit fixes only a small one, to simplify fixing the
others 1 arch at a time. (r47678 added support for
special/early/multiple message buffer initialization which I want in
a more general form, but this was too fragile to use because hacking
on the msgbufp global corrupted it, and was only used for 5 hours in
-current...)
This is better than directly changing PCI configuration space of the
device because it makes the PCI bus aware of the configuration.
Also, the change allows to drop a bunch of code that duplicated
pci_enable_msi() functionality.
I wonder if it's possible to further simplify the code by using
pci_alloc_msi().
ivhd should attach after the root PCI bus and, thus, after the ACPI
Host-PCI bridge off which the bus hangs. This is because ivhd changes
PCI configuration of a PCI IOMMU device that is located on the root bus.
If the bus attaches after ivhd it clears the MSI portion of the
configuration. As a result IOMMU event interrupts would never be
delivered.
For regular ACPI devices the order is calculated as
ACPI_DEV_BASE_ORDER + level * 10
where level is a depth of the device in the ACPI namespace.
I expect the depth of the Host-PCI bridge to be two or three,
so ACPI_DEV_BASE_ORDER + 10 * 10 should be a sufficiently safe order
for ivhd.
This should fix the setup of the AMD-Vi event interrupt when vmm is
preloaded (as opposed to kldload-ed).
This ensures that we can receive further event interrupts.
See the description of the bits in the specification for
MMIO Offset 2020h IOMMU Status Register.
The bits are defined as set-by-hardware write-1-to-clear, same as all
the bits in the status register.
Discussed with: anish
Stop issuing pre-assigned number to enumerate all page table pages,
the assignment is incorrect. Instead automatically calculate the next
unused index. This index in fact does not serve any purpose except to
be unique to satisfy vm_page_grab() interface, we do not look up the
page by the index later.
Reported and tested by: emaste
Reviewed by: andrew
Sponsored by: The FreeBSD Foundation
MFC after: 2 weeks
PR: 223906
Differential revision: https://reviews.freebsd.org/D13273
In the linux ENOADATA is frequently #defined as ENOATTR.
The change is required for an xattrs support implementation.
MFC after: 1 week
Discussed with: netchild
Approved by: pfg
Differential Revision: https://reviews.freebsd.org/D13221
Mainly focus on files that use BSD 2-Clause license, however the tool I
was using misidentified many licenses so this was mostly a manual - error
prone - task.
The Software Package Data Exchange (SPDX) group provides a specification
to make it easier for automated tools to detect and summarize well known
opensource licenses. We are gradually adopting the specification, noting
that the tags are considered only advisory and do not, in any way,
superceed or replace the license texts.
For FreeBSD/arm64's cloudabi32 support, I'm going to need a TO_PTR() in
this place. Also use it for all of the other source files, so that the
difference remains as minimal as possible.
MFC after: 2 weeks
Ensure that an opening bracket always has a matching closing one.
Ensure that there is always a new-line at the end of a report line.
Also, add a space before the printed event flag.
Reviewed by: anish
The defined bits are the lower bits, not the higher ones.
Also, the specification has been extended to define bits 0:18 and they
all could potentially be interesting to us, so extend the width of the
field accordingly.
Reviewed by: anish
Many 8-byte entries have zero at byte 4, so the second 4-byte part is
skipped as a 4-byte padding entry. But not all 8-byte entries have that
property and they get misinterpreted.
A real example:
48 00 00 00 ff 01 00 01
This an 8-byte ACPI_IVRS_TYPE_SPECIAL entry for IOAPIC with ID 255 (bogus).
It is reported as:
ivhd0: Unknown dev entry:0xff
Fortunately, it was completely harmless.
Also, bail out early if we encounter an entry of a variable length type.
We do not have proper handling for those yet.
Reviewed by: anish
Upon successful completion, the execve() system call invokes
exec_setregs() to initialize the registers of the initial thread of the
newly executed process. What is weird is that when execve() returns, it
still goes through the normal system call return path, clobbering the
registers with the system call's return value (td->td_retval).
Though this doesn't seem to be problematic for x86 most of the times (as
the value of eax/rax doesn't matter upon startup), this can be pretty
frustrating for architectures where function argument and return
registers overlap (e.g., ARM). On these systems, exec_setregs() also
needs to initialize td_retval.
Even worse are architectures where cpu_set_syscall_retval() sets
registers to values not derived from td_retval. On these architectures,
there is no way cpu_set_syscall_retval() can set registers to the way it
wants them to be upon the start of execution.
To get rid of this madness, let sys_execve() return EJUSTRETURN. This
will cause cpu_set_syscall_retval() to leave registers intact. This
makes process execution easier to understand. It also eliminates the
difference between execution of the initial process and successive ones.
The initial call to sys_execve() is not performed through a system call
context.
Reviewed by: kib, jhibbits
Differential Revision: https://reviews.freebsd.org/D13180
Mainly focus on files that use BSD 3-Clause license.
The Software Package Data Exchange (SPDX) group provides a specification
to make it easier for automated tools to detect and summarize well known
opensource licenses. We are gradually adopting the specification, noting
that the tags are considered only advisory and do not, in any way,
superceed or replace the license texts.
Special thanks to Wind River for providing access to "The Duke of
Highlander" tool: an older (2014) run over FreeBSD tree was useful as a
starting point.
The Software Package Data Exchange (SPDX) group provides a specification
to make it easier for automated tools to detect and summarize well known
opensource licenses. We are gradually adopting the specification, noting
that the tags are considered only advisory and do not, in any way,
superceed or replace the license texts.
Special thanks to Wind River for providing access to "The Duke of
Highlander" tool: an older (2014) run over FreeBSD tree was useful as a
starting point.
Initially, only tag files that use BSD 4-Clause "Original" license.
RelNotes: yes
Differential Revision: https://reviews.freebsd.org/D13133
similar to the kernel memory allocator.
This simplifies NUMA allocation because the domain will be known at wait
time and races between failure and sleeping are eliminated. This also
reduces boilerplate code and simplifies callers.
A wait primitive is supplied for uma zones for similar reasons. This
eliminates some non-specific VM_WAIT calls in favor of more explicit
sleeps that may be satisfied without new pages.
Reviewed by: alc, kib, markj
Tested by: pho
Sponsored by: Netflix, Dell/EMC Isilon
In reclaim_pv_chunk(), rotate the pv chunks list so that next
invocations of the reclaim do not scan the same pv chunks that could
not be freed. Only do the rotation when there is no parallel scan,
tracked by active_reclaims counter.
To rotate, move all chunks that are before current iteration marker,
after another marker that is inserted at the list tail on start of the
reclaim.
Reviewed by: alc
Tested by: pho (previous version)
Sponsored by: The FreeBSD Foundation
MFC after: 1 week
Some callers of fpusetregs()/npxsetregs(), most importantly
set_fpcontext(), clear reserved bits. But some did not. Do the
clearing in fpusetregs() and remove now redundand operation from
set_fpcontext().
Reported by: Maxime Villard <max@m00nbsd.net>
Sponsored by: The FreeBSD Foundation
MFC after: 1 week
This is needed for the HDA emulation with FreeBSD guests.
Reviewed by: marcelo
MFC after: 2 weeks
Differential Revision: https://reviews.freebsd.org/D12832
timer frequency a power of two. This changes the frequency from 10 to
16.7 MHz (2 ^ 24 HZ). Using a power of two avoids roundoff errors when
doing arithmetic in sbintime_t units.
Testing shows this can fix erratic ntpd behavior in guests using the
hpet timer (which is the default for multicore guests).
Reported by: bsam@
Specifically, devices that do not support PCI-e FLR and were not
gracefully shutdown by the guest OS could continue to issue DMA
requests after the VM was terminated. The changes in r305485 meant
that those DMA requests were completed against the host's memory which
could result in random memory corruption. Instead, leave ppt devices
that are not attached to a VM disabled in the IOMMU and only restore
the devices to the host domain if the ppt(4) driver is detached from a
device.
As an added safety belt, disable busmastering for a pass-through device
when before adding it to the host domain during ppt(4) detach.
PR: 222937
Tested by: Harry Schmalzbauer <freebsd@omnilan.de>
Reviewed by: grehan
MFC after: 1 week
Differential Revision: https://reviews.freebsd.org/D12661
While here cache-align chains.
This shortens longest found chain during poudriere -j 80 from 32 to 16.
Pushing this higher up will probably require allocation on boot.
HEAD. Enable VIMAGE in GENERIC kernels and some others (where GENERIC does
not exist) on HEAD.
Disable building LINT-VIMAGE with VIMAGE being default.
This should give it a lot more exposure in the run-up to 12 to help
us evaluate whether to keep it on by default or not.
We are also hoping to get better performance testing.
The feature can be disabled using nooptions.
Requested by: many
Reviewed by: kristof, emaste, hiren
X-MFC after: never
Relnotes: yes
Differential Revision: https://reviews.freebsd.org/D12639
Note that pv_list_locks is an array and currently it fits 2 locks per line.
Resizing it and/or putting more locks in different lines requires several tests.
MFC after: 1 week
All of the kernel dump implementations keep track of the current offset
("dumplo") within the dump device. However, except for textdumps, they
all write the dump sequentially, so we can reduce code duplication by
having the MI code keep track of the current offset. The new
dump_append() API can be used to write at the current offset.
This is needed to implement support for kernel dump compression in the
MI kernel dump code.
Also simplify dump_encrypted_write() somewhat: use dump_write() instead
of duplicating its bounds checks, and get rid of the redundant offset
tracking.
Reviewed by: cem
Sponsored by: Dell EMC Isilon
Differential Revision: https://reviews.freebsd.org/D11722