Many licenses on Linuxolator files contained small variations from the
standard FreeBSD license text. To avoid license proliferation switch to
the standard 2-clause FreeBSD license for those files where I have
permission from each of the listed copyright holders. Additional files
waiting on permission from others are listed in review D14210.
Approved by: kan, marcel, sos, rdivacky
MFC after: 1 week
Sponsored by: The FreeBSD Foundation
Small global symbols confuse ddb which matches them against small
unrelated displacements and makes the disassembly ugly.
Reported by: bde
Sponsored by: The FreeBSD Foundation
MFC after: 1 week
IVRS can have entry of type legacy and non-legacy present at same time for same AMD-Vi device. ivhd driver will ignore legacy if new IVHD type is present as specified in AMD-Vi specification. Earlier both of IVHD entries used and two ivhd devices were created.
Add support for new IVHD type 0x11 and 0x40 in ACPI. Create new struct of type acpi_ivrs_hardware_new for these new type of IVHDs. Legacy type 0x10 will continue to use acpi_ivrs_hardware.
Reviewed by: avg
Approved by: grehan
Differential Revision:https://reviews.freebsd.org/D13160
The idea is, the pmap_qenter() API is now defined to not produce executable
mappings. If you need executable mappings, use another API.
Add pg_nx flag in pmap_qenter on x86 to make kernel pages non-executable.
Other architectures that support execute-specific permissons on page table
entries should subsequently be updated to match.
Submitted by: Darrick Lew <darrick.freebsd AT gmail.com>
Reviewed by: markj
Discussed with: alc, jhb, kib
Sponsored by: Dell EMC Isilon
Differential Revision: https://reviews.freebsd.org/D14062
compilation under FreeBSD. The mthca driver was temporarily removed as
part of the Linux 4.9 RoCE/infinband upgrade.
Top commit in Linux source tree:
69973b830859bc6529a7a0468ba0d80ee5117826
Sponsored by: Mellanox Technologies
significant source of cache line contention from vm_page_alloc(). Use
accessors and vm_page_unwire_noq() so that the mechanism can be easily
changed in the future.
Reviewed by: markj
Discussed with: kib, glebius
Tested by: pho (earlier version)
Sponsored by: Netflix, Dell/EMC Isilon
Differential Revision: https://reviews.freebsd.org/D14273
makes them consistent with the way other page-table pages are allocated.
It also provides the rest of the VM system a good clue that these pages
are used.
Reviewed by: alc, kib, markj
Sponsored by: Netflix
Differential Revision: https://reviews.freebsd.org/D14269
mappings for the pages used for the kernel and some initial allocations
used for the page table. It maps the kernel and the blocks used for
these initial allocations using 2MB pages.
However, if the kernel does not end on a 2MB boundary, it still maps the
last portion using a 2MB page, but reports that the unused 4K blocks
within this 2MB allocation are free physical blocks. This means that
these same physical blocks could also be mapped elsewhere - for example,
into a user process. Given the proximity to the kernel text and data
area, it seems wise to avoid allowing someone to write data to physical
blocks also mapped into these virtual addresses.
(Note that this isn't a security vulnerability: the direct map makes
most/all memory on the system mapped into kernel space. And, nothing
in the kernel should be trying to access these pages, as the virtual
addresses are unused. It simply seems wise to avoid reusing these
physical blocks while they are mapped to virtual addresses so close
to the kernel text and data area.)
Consequently, let's reserve the physical blocks covered by the
page-table mappings for these initial allocations.
Reviewed by: kib, markj
MFC after: 2 weeks
Sponsored by: Netflix
Differential Revision: https://reviews.freebsd.org/D14268
global to per-domain state. Protect reservations with the free lock
from the domain that they belong to. Refactor to make vm domains more
of a first class object.
Reviewed by: markj, kib, gallatin
Tested by: pho
Sponsored by: Netflix, Dell/EMC Isilon
Differential Revision: https://reviews.freebsd.org/D14000
A version of each of the MD files by necessity exists for each CPU
architecture supported by the Linuxolator. Clean these up so that new
architectures do not inherit whitespace issues.
Clean up shared Linuxolator files while here.
Sponsored by: Turing Robotic Industries Inc.
Branch Predictors) mitigation.
DOcument 336996-001 promises that CPUs which implement IBRS but not
STIBP silently ignore setting of the bit instead of trapping.
Sponsored by: The FreeBSD Foundation
MFC after: 1 week
It is coded according to the Intel document 336996-001, reading of the
patches posted on lkml, and some additional consultations with Intel.
For existing processors, you need a microcode update which adds IBRS
CPU features, and to manually enable it by setting the tunable/sysctl
hw.ibrs_disable to 0. Current status can be checked in sysctl
hw.ibrs_active. The mitigation might be inactive if the CPU feature
is not patched in, or if CPU reports that IBRS use is not required, by
IA32_ARCH_CAP_IBRS_ALL bit.
Sponsored by: The FreeBSD Foundation
MFC after: 1 week
Differential revision: https://reviews.freebsd.org/D14029
The virtual interrupt method uses V_IRQ, V_INTR_PRIO, and V_INTR_VECTOR
fields of VMCB to inject a virtual interrupt into a guest VM. This
method has many advantages over the direct event injection as it
offloads all decisions of whether and when the interrupt can be
delivered to the guest. But with a purely software emulated vAPIC the
advantage is also a problem. The problem is that the hypervisor does
not have any precise control over when the interrupt is actually
delivered to the guest (or a notification about that). Because of that
the hypervisor cannot update the interrupt vector in IRR and ISR in the
same way as real hardware would. The hypervisor becomes aware that the
interrupt is being serviced only upon the first VMEXIT after the
interrupt is delivered. This creates a window between the actual
interrupt delivery and the update of IRR and ISR. That means that IRR
and ISR might not be correctly set up to the point of the
end-of-interrupt signal.
The described deviation has been observed to cause an interrupt loss in
the following scenario. vCPU0 posts an inter-processor interrupt to
vCPU1. The interrupt is injected as a virtual interrupt by the
hypervisor. The interrupt is delivered to a guest and an interrupt
handler is invoked. The handler performs a requested action and
acknowledges the request by modifying a global variable. So far, there
is no VMEXIT and the hypervisor is unaware of the events. Then, vCPU0
notices the acknowledgment and sends another IPI with the same vector.
The IPI gets collapsed into the previous IPI in the IRR of vCPU1. Only
after that a VMEXIT of vCPU1 occurs. At that time the vector is cleared
in the IRR and is set in the ISR. vCPU1 has vAPIC state as if the
second IPI has never been sent.
The scenario is impossible on the real hardware because IRR and ISR are
updated just before the interrupt handler gets started.
I saw several possibilities of fixing the problem. One is to intercept
the virtual interrupt delivery to update IRR and ISR at the right
moment. The other is to deliver the LAPIC interrupts using the event
injection, same as legacy interrupts. I opted to use the latter
approach for several reasons. It's equivalent to what VMM/Intel does
(in !VMX case). It appears to be what VirtualBox and KVM do. The code
is already there (to support legacy interrupts).
Another possibility was to use a special intermediate state for a vector
after it is injected using a virtual interrupt and before it is known
whether it was accepted or is still pending.
That approach was implemented in https://reviews.freebsd.org/D13828
That method is more complex and does not have any clear advantage.
Please see sections 15.20 and 15.21.4 of "AMD64 Architecture
Programmer's Manual Volume 2: System Programming" (publication 24593,
revision 3.29) for comparison between event injection and virtual
interrupt injection.
PR: 215972
Reported by: ajschot@hotmail.com, grehan
Tested by: anish, grehan, Nils Beyer <nbe@renzel.net>
Reviewed by: anish, grehan
MFC after: 2 weeks
Differential Revision: https://reviews.freebsd.org/D13780
This avoids a nested page fault when obtaining a stack trace in DDB if
the address from the first frame does not resolve to a known symbol.
MFC after: 1 week
Sponsored by: Chelsio Communications
Use PCID to avoid complete TLB shootdown when switching between user
and kernel mode with PTI enabled.
I use the model close to what I read about KAISER, user-mode PCID has
1:1 correspondence to the kernel-mode PCID, by setting bit 11 in PCID.
Full kernel-mode TLB shootdown is performed on context switches, since
KVA TLB invalidation only works in the current pmap. User-mode part of
TLB is flushed on the pmap activations as well.
Similarly, IPI TLB shootdowns must handle both kernel and user address
spaces for each address. Note that machines which implement PCID but
do not have INVPCID instructions, cause the usual complications in the
IPI handlers, due to the need to switch to the target PCID temporary.
This is racy, but because for PCID/no-INVPCID we disable the
interrupts in pmap_activate_sw(), IPI handler cannot see inconsistent
state of CPU PCID vs PCPU pmap/kcr3/ucr3 pointers.
On the other hand, on kernel/user switches, CR3_PCID_SAVE bit is set
and we do not clear TLB.
I can imagine alternative use of PCID, where there is only one PCID
allocated for the kernel pmap. Then, there is no need to shootdown
kernel TLB entries on context switch. But copyout(3) would need to
either use method similar to proc_rwmem() to access the userspace
data, or (in reverse) provide a temporal mapping for the kernel buffer
into user mode PCID and use trampoline for copy.
Reviewed by: markj (previous version)
Tested by: pho
Discussed with: alc (some aspects)
Sponsored by: The FreeBSD Foundation
MFC after: 3 weeks
Differential revision: https://reviews.freebsd.org/D13985
These files previously had a 3-clause license and 'THE REGENTS' text.
Switch to standard 2-clause text with kib's approval, and add the SPDX
tag.
Approved by: kib
Uses of mallocarray(9).
The use of mallocarray(9) has rocketed the required swap to build FreeBSD.
This is likely caused by the allocation size attributes which put extra pressure
on the compiler.
Given that most of these checks are superfluous we have to choose better
where to use mallocarray(9). We still have more uses of mallocarray(9) but
hopefully this is enough to bring swap usage to a reasonable level.
Reported by: wosch
PR: 225197
still active.
Map userspace portion of VA in the PTI kernel-mode page table as
non-executable. This way, if we ever miss reloading ucr3 into %cr3 on
the return to usermode, the process traps instead of executing in
potentially vulnerable setup. Catch the condition of such trap and
verify user-mode %cr3, which is saved by page fault handler.
I peek this trick in some article about Linux implementation.
Reviewed by: alc, markj (previous version)
Sponsored by: The FreeBSD Foundation
MFC after: 12 days
DIfferential revision: https://reviews.freebsd.org/D13956
kernel by PHYS_TO_DMAP() as previously present on amd64, arm64, riscv, and
powerpc64. This introduces a new MI macro (PMAP_HAS_DMAP) that can be
evaluated at runtime to determine if the architecture has a direct map;
if it does not (or does) unconditionally and PMAP_HAS_DMAP is either 0 or
1, the compiler can remove the conditional logic.
As part of this, implement PHYS_TO_DMAP() on sparc64 and mips64, which had
similar things but spelled differently. 32-bit MIPS has a partial direct-map
that maps poorly to this concept and is unchanged.
Reviewed by: kib
Suggestions from: marius, alc, kib
Runtime tested on: amd64, powerpc64, powerpc, mips64
Kernel Page Table Isolation (KPTI) was introduced in r328083 as a
mitigation for the 'Meltdown' vulnerability. AMD CPUs are not affected,
per https://www.amd.com/en/corporate/speculative-execution:
We believe AMD processors are not susceptible due to our use of
privilege level protections within paging architecture and no
mitigation is required.
Thus default KPTI to off for AMD CPUs, and to on for others. This may
be refined later as we obtain more specific information on the sets of
CPUs that are and are not affected.
Submitted by: Mitchell Horne
Reviewed by: cem
Relnotes: Yes
Security: CVE-2017-5754
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D13971
Similar to NMIs, machine check exceptions can fire at any time and are
not masked by IF. This means that machine checks can fire when the
kstack is too deep to hold a trap frame, or at critical sections in
trap handlers when a user %gs is used with a kernel %cs. Use the same
strategy used for NMIs of using a dedicated per-CPU stack configured
in IST 3. Store the CPU's pcpu pointer at the stop of the stack so
that the machine check handler can reliably find the proper value for
%gs (also borrowed from NMIs).
This should also fix a similar issue with PTI with a MC# occurring
while the CPU is executing on the trampoline stack.
While here, bypass trap() entirely and just call mca_intr(). This
avoids a bogus call to kdb_reenter() (there's no reason to try to
reenter kdb if a MC# is raised).
Reviewed by: kib
Tested by: avg (on AMD without PTI)
Differential Revision: https://reviews.freebsd.org/D13962
In the !PTI case the NMI handler jumped past the instructions that set
%rdi to point to the current PCB, but the target instructions assumed %rdi
were set.
Reviewed by: kib
Tested by: pho
Apparently machinde/cpu.h is supposed to contain MD implementations of
MI interfaces. Also, remove kernphys declaration from machdep.c,
since it is already provided by md_var.h.
Requested and reviewed by: bde
MFC after: 13 days
Currently most of the debug registers are not saved and restored
during VM transitions allowing guest and host debug register values to
leak into the opposite context. One result is that hardware
watchpoints do not work reliably within a guest under VT-x.
Due to differences in SVM and VT-x, slightly different approaches are
used.
For VT-x:
- Enable debug register save/restore for VM entry/exit in the VMCS for
DR7 and MSR_DEBUGCTL.
- Explicitly save DR0-3,6 of the guest.
- Explicitly save DR0-3,6-7, MSR_DEBUGCTL, and the trap flag from
%rflags for the host. Note that because DR6 is "software" managed
and not stored in the VMCS a kernel debugger which single steps
through VM entry could corrupt the guest DR6 (since a single step
trap taken after loading the guest DR6 could alter the DR6
register). To avoid this, explicitly disable single-stepping via
the trace flag before loading the guest DR6. A determined debugger
could still defeat this by setting a breakpoint after the guest DR6
was loaded and then single-stepping.
For SVM:
- Enable debug register caching in the VMCB for DR6/DR7.
- Explicitly save DR0-3 of the guest.
- Explicitly save DR0-3,6-7, and MSR_DEBUGCTL for the host. Since SVM
saves the guest DR6 in the VMCB, the race with single-stepping
described for VT-x does not exist.
For both platforms, expose all of the guest DRx values via --get-drX
and --set-drX flags to bhyvectl.
Discussed with: avg, grehan
Tested by: avg (SVM), myself (VT-x)
MFC after: 1 month
Differential Revision: https://reviews.freebsd.org/D13229
The implementation of the Kernel Page Table Isolation (KPTI) for
amd64, first version. It provides a workaround for the 'meltdown'
vulnerability. PTI is turned off by default for now, enable with the
loader tunable vm.pmap.pti=1.
The pmap page table is split into kernel-mode table and user-mode
table. Kernel-mode table is identical to the non-PTI table, while
usermode table is obtained from kernel table by leaving userspace
mappings intact, but only leaving the following parts of the kernel
mapped:
kernel text (but not modules text)
PCPU
GDT/IDT/user LDT/task structures
IST stacks for NMI and doublefault handlers.
Kernel switches to user page table before returning to usermode, and
restores full kernel page table on the entry. Initial kernel-mode
stack for PTI trampoline is allocated in PCPU, it is only 16
qwords. Kernel entry trampoline switches page tables. then the
hardware trap frame is copied to the normal kstack, and execution
continues.
IST stacks are kept mapped and no trampoline is needed for
NMI/doublefault, but of course page table switch is performed.
On return to usermode, the trampoline is used again, iret frame is
copied to the trampoline stack, page tables are switched and iretq is
executed. The case of iretq faulting due to the invalid usermode
context is tricky, since the frame for fault is appended to the
trampoline frame. Besides copying the fault frame and original
(corrupted) frame to kstack, the fault frame must be patched to make
it look as if the fault occured on the kstack, see the comment in
doret_iret detection code in trap().
Currently kernel pages which are mapped during trampoline operation
are identical for all pmaps. They are registered using
pmap_pti_add_kva(). Besides initial registrations done during boot,
LDT and non-common TSS segments are registered if user requested their
use. In principle, they can be installed into kernel page table per
pmap with some work. Similarly, PCPU can be hidden from userspace
mapping using trampoline PCPU page, but again I do not see much
benefits besides complexity.
PDPE pages for the kernel half of the user page tables are
pre-allocated during boot because we need to know pml4 entries which
are copied to the top-level paging structure page, in advance on a new
pmap creation. I enforce this to avoid iterating over the all
existing pmaps if a new PDPE page is needed for PTI kernel mappings.
The iteration is a known problematic operation on i386.
The need to flush hidden kernel translations on the switch to user
mode make global tables (PG_G) meaningless and even harming, so PG_G
use is disabled for PTI case. Our existing use of PCID is
incompatible with PTI and is automatically disabled if PTI is
enabled. PCID can be forced on only for developer's benefit.
MCE is known to be broken, it requires IST stack to operate completely
correctly even for non-PTI case, and absolutely needs dedicated IST
stack because MCE delivery while trampoline did not switched from PTI
stack is fatal. The fix is pending.
Reviewed by: markj (partially)
Tested by: pho (previous version)
Discussed with: jeff, jhb
Sponsored by: The FreeBSD Foundation
MFC after: 2 weeks
Focus on code where we are doing multiplications within malloc(9). None of
these ire likely to overflow, however the change is still useful as some
static checkers can benefit from the allocation attributes we use for
mallocarray.
This initial sweep only covers malloc(9) calls with M_NOWAIT. No good
reason but I started doing the changes before r327796 and at that time it
was convenient to make sure the sorrounding code could handle NULL values.
X-Differential revision: https://reviews.freebsd.org/D13837
domains can be done by the _domain() API variants. UMA also supports a
first-touch policy via the NUMA zone flag.
The slab layer is now segregated by VM domains and is precise. It handles
iteration for round-robin directly. The per-cpu cache layer remains
a mix of domains according to where memory is allocated and freed. Well
behaved clients can achieve perfect locality with no performance penalty.
The direct domain allocation functions have to visit the slab layer and
so require per-zone locks which come at some expense.
Reviewed by: Attilio (a slightly older version)
Tested by: pho
Sponsored by: Netflix, Dell/EMC Isilon