Commit Graph

463 Commits

Author SHA1 Message Date
Konstantin Belousov
2343757338 Align IA32_ARCH_CAP MSR definitions and use with SDM rev. 068.
SDM rev. 068 was released yesterday and it contains the description of
the MSR 0x10a IA32_ARCH_CAP. This change adds symbolic definitions for
all bits present in the document, and decode them in the CPU
identification lines printed on boot.

But also, the document defines SSB_NO as bit 4, while FreeBSD used but
2 to detect the need to work-around Speculative Store Bypass
issue.  Change code to use the bit from SDM.

Similarly, the document describes bit 3 as an indicator that L1TF
issue is not present, in particular, no L1D flush is needed on
VMENTRY.  We used RDCL_NO to avoid flushing, and again I changed the
code to follow new spec from SDM.

In fact my Apollo Lake machine with latest ucode shows this:
    IA32_ARCH_CAPS=0x19<RDCL_NO,SKIP_L1DFL_VME,SSB_NO>

Reviewed by:	bwidawsk
Sponsored by:	The FreeBSD Foundation
MFC after:	3 days
Differential revision:	https://reviews.freebsd.org/D18006
2018-11-16 21:27:11 +00:00
Marcelo Araujo
ec9e3fb095 Merge cases with upper block.
This is a cosmetic change only to simplify code.

Reported by:	anish
Sponsored by:	iXsystems Inc.
2018-10-31 01:27:44 +00:00
Marcelo Araujo
5bae7542d4 Emulate machine check related MSR_EXTFEATURES to allow guest OSes to
boot on AMD FX Series.

PR:		224476
Submitted by:	Keita Uchida <m@jgz.jp>
Reviewed by:	rgrimes
Sponsored by:	iXsystems Inc.
Differential Revision:	https://reviews.freebsd.org/D17713
2018-10-30 10:02:23 +00:00
Yuri Pankov
8d56c80545 Provide basic descriptions for VMX exit reason (from "Intel 64 and IA-32
Architectures Software Developer’s Manual Volume 3").  Add the document
to SEE ALSO in bhyve.8 (and pet manlint here a bit).

Reviewed by:	jhb, rgrimes, 0mp
Approved by:	kib (mentor)
Differential Revision:	https://reviews.freebsd.org/D17531
2018-10-27 21:24:28 +00:00
John Baldwin
de679f6efa Reload the LDT selector after an AMD-v #VMEXIT.
cpu_switch() always reloads the LDT, so this can only affect the
hypervisor process itself.  Fix this by explicitly reloading the host
LDT selector after each #VMEXIT.  The stock bhyve process on FreeBSD
never uses a custom LDT, so this change is cosmetic.

Reviewed by:	kib
Tested by:	Mike Tancsa <mike@sentex.net>
Approved by:	re (gjb)
MFC after:	2 weeks
2018-10-15 18:12:25 +00:00
Konstantin Belousov
78a3652794 bhyve: emulate CLFLUSH and CLFLUSHOPT.
Apparently CLFLUSH on mmio can cause VM exit, as reported in the PR.
I do not see that anything useful can be done except emulating page
faults on invalid addresses.

Due to the instruction encoding pecularity, also emulate SFENCE.

PR:	232081
Reported by:	phk
Reviewed by:	araujo, avg, jhb (all: previous version)
Sponsored by:	The FreeBSD Foundation
Approved by:	re (gjb)
MFC after:	1 week
Differential revision:	https://reviews.freebsd.org/D17482
2018-10-12 15:30:15 +00:00
John Baldwin
b843f9be5e Fully restore the GDTR, IDTR, and LDTR after VT-x VM exits.
The VT-x VMCS only stores the base address of the GDTR and IDTR.  As a
result, VM exits use a fixed limit of 0xffff for the host GDTR and
IDTR losing the smaller limits set in when the initial GDT is loaded
on each CPU during boot.  Explicitly save and restore the full GDTR
and IDTR contents around VM entries and exits to restore the correct
limit.

Similarly, explicitly save and restore the LDT selector.  VM exits
always clear the host LDTR as if the LDT was loaded with a NULL
selector and a userspace hypervisor is probably using a NULL selector
anyway, but save and restore the LDT explicitly just to be safe.

PR:		230773
Reported by:	John Levon <levon@movementarian.org>
Reviewed by:	kib
Tested by:	araujo
Approved by:	re (rgrimes)
MFC after:	1 week
2018-10-11 18:27:19 +00:00
Andrew Turner
27d2645787 Handle a guest executing a vm instruction by trapping and raising an
undefined instruction exception. Previously we would exit the guest,
however an unprivileged user could execute these.

Found with:	syzkaller
Reviewed by:	araujo, tychon (previous version)
Approved by:	re (kib)
MFC after:	1 week
Differential Revision:	https://reviews.freebsd.org/D17192
2018-09-27 11:16:19 +00:00
Konstantin Belousov
c1141fba00 Update L1TF workaround to sustain L1D pollution from NMI.
Current mitigation for L1TF in bhyve flushes L1D either by an explicit
WRMSR command, or by software reading enough uninteresting data to
fully populate all lines of L1D.  If NMI occurs after either of
methods is completed, but before VM entry, L1D becomes polluted with
the cache lines touched by NMI handlers.  There is no interesting data
which NMI accesses, but something sensitive might be co-located on the
same cache line, and then L1TF exposes that to a rogue guest.

Use VM entry MSR load list to ensure atomicity of L1D cache and VM
entry if updated microcode was loaded.  If only software flush method
is available, try to help the bhyve sw flusher by also flushing L1D on
NMI exit to kernel mode.

Suggested by and discussed with: Andrew Cooper <andrew.cooper3@citrix.com>
Reviewed by:	jhb
Sponsored by:	The FreeBSD Foundation
MFC after:	2 weeks
Differential revision:	https://reviews.freebsd.org/D16790
2018-08-19 18:47:16 +00:00
Konstantin Belousov
c30578feeb Provide part of the mitigation for L1TF-VMM.
On the guest entry in bhyve, flush L1 data cache, using either L1D
flush command MSR if available, or by reading enough uninteresting
data to fill whole cache.

Flush is automatically enabled on CPUs which do not report RDCL_NO,
and can be disabled with the hw.vmm.l1d_flush tunable/kenv.

Security:	CVE-2018-3646
Reviewed by:	emaste. jhb, Tony Luck <tony.luck@intel.com>
Sponsored by:	The FreeBSD Foundation
2018-08-14 17:29:41 +00:00
Marcelo Araujo
be963beee6 - Add the ability to run bhyve(8) within a jail(8).
This patch adds a new sysctl(8) knob "security.jail.vmm_allowed",
by default this option is disable.

Submitted by:	Shawn Webb <shawn.webb____hardenedbsd.org>
Reviewed by:	jamie@ and myself.
Relnotes:	Yes.
Sponsored by:	HardenedBSD and G2, Inc.
Differential Revision:	https://reviews.freebsd.org/D16057
2018-08-01 00:39:21 +00:00
Marcelo Araujo
ebc3c37c6f Add SPDX tags to vmm(4).
MFC after:	4 weeks.
Sponsored by:	iXsystems Inc.
2018-06-13 07:02:58 +00:00
John Baldwin
9e2154ff1c Cleanups related to debug exceptions on x86.
- Add constants for fields in DR6 and the reserved fields in DR7.  Use
  these constants instead of magic numbers in most places that use DR6
  and DR7.
- Refer to T_TRCTRAP as "debug exception" rather than a "trace trap"
  as it is not just for trace exceptions.
- Always read DR6 for debug exceptions and only clear TF in the flags
  register for user exceptions where DR6.BS is set.
- Clear DR6 before returning from a debug exception handler as
  recommended by the SDM dating all the way back to the 386.  This
  allows debuggers to determine the cause of each exception.  For
  kernel traps, clear DR6 in the T_TRCTRAP case and pass DR6 by value
  to other parts of the handler (namely, user_dbreg_trap()).  For user
  traps, wait until after trapsignal to clear DR6 so that userland
  debuggers can read DR6 via PT_GETDBREGS while the thread is stopped
  in trapsignal().

Reviewed by:	kib, rgrimes
MFC after:	1 month
Differential Revision:	https://reviews.freebsd.org/D15189
2018-05-22 00:45:00 +00:00
Antoine Brodin
147d12a7d3 vmmdev: return EFAULT when trying to read beyond VM system memory max address
Currently, when using dd(1) to take a VM memory image, the capture never ends,
reading zeroes when it's beyond VM system memory max address.
Return EFAULT when trying to read beyond VM system memory max address.

Reviewed by:	imp, grehan, anish
Approved by:	grehan
Differential Revision:	https://reviews.freebsd.org/D15156
2018-05-15 17:20:58 +00:00
Peter Grehan
adb947a67a Use PCI power-mgmt to reset a device if FLR fails.
A large number of devices don't support PCIe FLR, in particular
graphics adapters. Use PCI power management to perform the
reset if FLR fails or isn't available, by cycling the device
through the D3 state.

This has been tested by a number of users with Nvidia and AMD GPUs.

Submitted and tested by: Matt Macy
Reviewed by:	jhb, imp, rgrimes
MFC after:	3 weeks
Differential Revision:	https://reviews.freebsd.org/D15268
2018-05-02 17:41:00 +00:00
Konstantin Belousov
b7941dc91e Correct undesirable interaction between caching of %cr4 in bhyve and
invltlb_glob().

Reviewed by:	grehan, jhb
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
Differential revision:	https://reviews.freebsd.org/D15138
2018-04-24 13:44:19 +00:00
Tycho Nightingale
6ac73777ea Add SDT probes to vmexit on Intel.
Submitted by:	domagoj.stolfa_gmail.com
Reviewed by:	grehan, tychon
Sponsored by:	DARPA/AFRL
Differential Revision:	https://reviews.freebsd.org/D14656
2018-04-13 17:23:05 +00:00
Rodney W. Grimes
01d822d33b Add the ability to control the CPU topology of created VMs
from userland without the need to use sysctls, it allows the old
sysctls to continue to function, but deprecates them at
FreeBSD_version 1200060 (Relnotes for deprecate).

The command line of bhyve is maintained in a backwards compatible way.
The API of libvmmapi is maintained in a backwards compatible way.
The sysctl's are maintained in a backwards compatible way.

Added command option looks like:
bhyve -c [[cpus=]n][,sockets=n][,cores=n][,threads=n][,maxcpus=n]
The optional parts can be specified in any order, but only a single
integer invokes the backwards compatible parse.  [,maxcpus=n] is
hidden by #ifdef until kernel support is added, though the api
is put in place.

bhyvectl --get-cpu-topology option added.

Reviewed by:	grehan (maintainer, earlier version),
Reviewed by:	bcr (manpages)
Approved by:	bde (mentor), phk (mentor)
Tested by:	Oleg Ginzburg <olevole@olevole.ru> (cbsd)
MFC after:	1 week
Relnotes:	Y
Differential Revision:	https://reviews.freebsd.org/D9930
2018-04-08 19:24:49 +00:00
John Baldwin
fc276d92ae Add a way to temporarily suspend and resume virtual CPUs.
This is used as part of implementing run control in bhyve's debug
server.  The hypervisor now maintains a set of "debugged" CPUs.
Attempting to run a debugged CPU will fail to execute any guest
instructions and will instead report a VM_EXITCODE_DEBUG exit to
the userland hypervisor.  Virtual CPUs are placed into the debugged
state via vm_suspend_cpu() (implemented via a new VM_SUSPEND_CPU ioctl).
Virtual CPUs can be resumed via vm_resume_cpu() (VM_RESUME_CPU ioctl).

The debug server suspends virtual CPUs when it wishes them to stop
executing in the guest (for example, when a debugger attaches to the
server).  The debug server can choose to resume only a subset of CPUs
(for example, when single stepping) or it can choose to resume all
CPUs.  The debug server must explicitly mark a CPU as resumed via
vm_resume_cpu() before the virtual CPU will successfully execute any
guest instructions.

Reviewed by:	avg, grehan
Tested on:	Intel (jhb), AMD (avg)
Differential Revision:	https://reviews.freebsd.org/D14466
2018-04-06 22:03:43 +00:00
Tycho Nightingale
490768e24a Fix a lock recursion introduced in r327065.
Reported by:	kmacy
Reviewed by:	grehan, jhb
Sponsored by:	Dell EMC Isilon
Differential Revision:	https://reviews.freebsd.org/D14548
2018-03-07 18:03:22 +00:00
Anish Gupta
9363000dfe Move the new AMD-Vi IVHD [ACPI_IVRS_HARDWARE_NEW]definitions added in r329360 in contrib ACPI to local files till ACPI code adds new definitions reported by jkim.
Rename ACPI_IVRS_HARDWARE_NEW to ACPI_IVRS_HARDWARE_EFRSUP, since new definitions add Extended Feature Register support.  Use IvrsType to distinguish three types of IVHD - 0x10(legacy), 0x11 and 0x40(with EFR). IVHD 0x40 is also called mixed type since it supports HID device entries.
Fix 2 coverity bugs reported by cem.

Reported by:jkim, cem
Approved by:grehan
Differential Revision://reviews.freebsd.org/D14501
2018-03-05 02:28:25 +00:00
John Baldwin
5f8754c077 Add a new variant of the GLA2GPA ioctl for use by the debug server.
Unlike the existing GLA2GPA ioctl, GLA2GPA_NOFAULT does not modify
the guest.  In particular, it does not inject any faults or modify
PTEs in the guest when performing an address space translation.

This is used by bhyve's debug server to read and write memory for
the remote debugger.

Reviewed by:	grehan
MFC after:	1 month
Differential Revision:	https://reviews.freebsd.org/D14075
2018-02-26 19:19:05 +00:00
John Baldwin
4f8666989a Add two new ioctls to bhyve for batch register fetch/store operations.
These are a convenience for bhyve's debug server to use a single
ioctl for 'g' and 'G' rather than a loop of individual get/set
ioctl requests.

Reviewed by:	grehan
MFC after:	2 months
Differential Revision:	https://reviews.freebsd.org/D14074
2018-02-22 00:39:25 +00:00
Andriy Gapon
7b394c1066 move vintr_intercept_enabled under INVARIANTS
The function is not used outside of INVARIANTS since r328622.

MFC after:	1 week
2018-02-16 07:02:14 +00:00
Anish Gupta
0b37d3d90e This change fixes duplicate detection of same IOMMU/AMD-Vi device for Ryzen with EFR support.
IVRS can have entry of type legacy and non-legacy present at same time for same AMD-Vi device. ivhd driver will ignore legacy if new IVHD type is present as specified in AMD-Vi specification. Earlier both of IVHD entries used and two ivhd devices were created.
Add support for new IVHD type 0x11 and 0x40 in ACPI. Create new struct of type acpi_ivrs_hardware_new for these new type of IVHDs. Legacy type 0x10 will continue to use acpi_ivrs_hardware.

Reviewed by:	avg
Approved by:	grehan
Differential Revision:https://reviews.freebsd.org/D13160
2018-02-16 05:17:00 +00:00
Tycho Nightingale
58a6aaf7ec Provide further mitigation against CVE-2017-5715 by flushing the
return stack buffer (RSB) upon returning from the guest.

This was inspired by this linux commit:
https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/arch/x86/kvm?id=117cc7a908c83697b0b737d15ae1eb5943afe35b

Reviewed by:	grehan
Sponsored by:	Dell EMC Isilon
Differential Revision:	https://reviews.freebsd.org/D14272
2018-02-12 14:45:27 +00:00
Andriy Gapon
6a8b7aa424 vmm/svm: post LAPIC interrupts using event injection, not virtual interrupts
The virtual interrupt method uses V_IRQ, V_INTR_PRIO, and V_INTR_VECTOR
fields of VMCB to inject a virtual interrupt into a guest VM.  This
method has many advantages over the direct event injection as it
offloads all decisions of whether and when the interrupt can be
delivered to the guest.  But with a purely software emulated vAPIC the
advantage is also a problem.  The problem is that the hypervisor does
not have any precise control over when the interrupt is actually
delivered to the guest (or a notification about that).  Because of that
the hypervisor cannot update the interrupt vector in IRR and ISR in the
same way as real hardware would.  The hypervisor becomes aware that the
interrupt is being serviced only upon the first VMEXIT after the
interrupt is delivered.  This creates a window between the actual
interrupt delivery and the update of IRR and ISR.  That means that IRR
and ISR might not be correctly set up to the point of the
end-of-interrupt signal.

The described deviation has been observed to cause an interrupt loss in
the following scenario.  vCPU0 posts an inter-processor interrupt to
vCPU1.  The interrupt is injected as a virtual interrupt by the
hypervisor.  The interrupt is delivered to a guest and an interrupt
handler is invoked.  The handler performs a requested action and
acknowledges the request by modifying a global variable.  So far, there
is no VMEXIT and the hypervisor is unaware of the events.  Then, vCPU0
notices the acknowledgment and sends another IPI with the same vector.
The IPI gets collapsed into the previous IPI in the IRR of vCPU1.  Only
after that a VMEXIT of vCPU1 occurs.  At that time the vector is cleared
in the IRR and is set in the ISR.  vCPU1 has vAPIC state as if the
second IPI has never been sent.
The scenario is impossible on the real hardware because IRR and ISR are
updated just before the interrupt handler gets started.

I saw several possibilities of fixing the problem.  One is to intercept
the virtual interrupt delivery to update IRR and ISR at the right
moment.  The other is to deliver the LAPIC interrupts using the event
injection, same as legacy interrupts.  I opted to use the latter
approach for several reasons.  It's equivalent to what VMM/Intel does
(in !VMX case).  It appears to be what VirtualBox and KVM do.  The code
is already there (to support legacy interrupts).

Another possibility was to use a special intermediate state for a vector
after it is injected using a virtual interrupt and before it is known
whether it was accepted or is still pending.
That approach was implemented in https://reviews.freebsd.org/D13828
That method is more complex and does not have any clear advantage.

Please see sections 15.20 and 15.21.4 of "AMD64 Architecture
Programmer's Manual Volume 2: System Programming" (publication 24593,
revision 3.29) for comparison between event injection and virtual
interrupt injection.

PR:		215972
Reported by:	ajschot@hotmail.com, grehan
Tested by:	anish, grehan,  Nils Beyer <nbe@renzel.net>
Reviewed by:	anish, grehan
MFC after:	2 weeks
Differential Revision: https://reviews.freebsd.org/D13780
2018-01-31 11:14:26 +00:00
John Baldwin
65eefbe422 Save and restore guest debug registers.
Currently most of the debug registers are not saved and restored
during VM transitions allowing guest and host debug register values to
leak into the opposite context.  One result is that hardware
watchpoints do not work reliably within a guest under VT-x.

Due to differences in SVM and VT-x, slightly different approaches are
used.

For VT-x:

- Enable debug register save/restore for VM entry/exit in the VMCS for
  DR7 and MSR_DEBUGCTL.
- Explicitly save DR0-3,6 of the guest.
- Explicitly save DR0-3,6-7, MSR_DEBUGCTL, and the trap flag from
  %rflags for the host.  Note that because DR6 is "software" managed
  and not stored in the VMCS a kernel debugger which single steps
  through VM entry could corrupt the guest DR6 (since a single step
  trap taken after loading the guest DR6 could alter the DR6
  register).  To avoid this, explicitly disable single-stepping via
  the trace flag before loading the guest DR6.  A determined debugger
  could still defeat this by setting a breakpoint after the guest DR6
  was loaded and then single-stepping.

For SVM:
- Enable debug register caching in the VMCB for DR6/DR7.
- Explicitly save DR0-3 of the guest.
- Explicitly save DR0-3,6-7, and MSR_DEBUGCTL for the host.  Since SVM
  saves the guest DR6 in the VMCB, the race with single-stepping
  described for VT-x does not exist.

For both platforms, expose all of the guest DRx values via --get-drX
and --set-drX flags to bhyvectl.

Discussed with:	avg, grehan
Tested by:	avg (SVM), myself (VT-x)
MFC after:	1 month
Differential Revision:	https://reviews.freebsd.org/D13229
2018-01-17 23:11:25 +00:00
Konstantin Belousov
bd50262f70 PTI for amd64.
The implementation of the Kernel Page Table Isolation (KPTI) for
amd64, first version. It provides a workaround for the 'meltdown'
vulnerability.  PTI is turned off by default for now, enable with the
loader tunable vm.pmap.pti=1.

The pmap page table is split into kernel-mode table and user-mode
table. Kernel-mode table is identical to the non-PTI table, while
usermode table is obtained from kernel table by leaving userspace
mappings intact, but only leaving the following parts of the kernel
mapped:

    kernel text (but not modules text)
    PCPU
    GDT/IDT/user LDT/task structures
    IST stacks for NMI and doublefault handlers.

Kernel switches to user page table before returning to usermode, and
restores full kernel page table on the entry. Initial kernel-mode
stack for PTI trampoline is allocated in PCPU, it is only 16
qwords.  Kernel entry trampoline switches page tables. then the
hardware trap frame is copied to the normal kstack, and execution
continues.

IST stacks are kept mapped and no trampoline is needed for
NMI/doublefault, but of course page table switch is performed.

On return to usermode, the trampoline is used again, iret frame is
copied to the trampoline stack, page tables are switched and iretq is
executed.  The case of iretq faulting due to the invalid usermode
context is tricky, since the frame for fault is appended to the
trampoline frame.  Besides copying the fault frame and original
(corrupted) frame to kstack, the fault frame must be patched to make
it look as if the fault occured on the kstack, see the comment in
doret_iret detection code in trap().

Currently kernel pages which are mapped during trampoline operation
are identical for all pmaps.  They are registered using
pmap_pti_add_kva().  Besides initial registrations done during boot,
LDT and non-common TSS segments are registered if user requested their
use.  In principle, they can be installed into kernel page table per
pmap with some work.  Similarly, PCPU can be hidden from userspace
mapping using trampoline PCPU page, but again I do not see much
benefits besides complexity.

PDPE pages for the kernel half of the user page tables are
pre-allocated during boot because we need to know pml4 entries which
are copied to the top-level paging structure page, in advance on a new
pmap creation.  I enforce this to avoid iterating over the all
existing pmaps if a new PDPE page is needed for PTI kernel mappings.
The iteration is a known problematic operation on i386.

The need to flush hidden kernel translations on the switch to user
mode make global tables (PG_G) meaningless and even harming, so PG_G
use is disabled for PTI case.  Our existing use of PCID is
incompatible with PTI and is automatically disabled if PTI is
enabled.  PCID can be forced on only for developer's benefit.

MCE is known to be broken, it requires IST stack to operate completely
correctly even for non-PTI case, and absolutely needs dedicated IST
stack because MCE delivery while trampoline did not switched from PTI
stack is fatal.  The fix is pending.

Reviewed by:	markj (partially)
Tested by:	pho (previous version)
Discussed with:	jeff, jhb
Sponsored by:	The FreeBSD Foundation
MFC after:	2 weeks
2018-01-17 11:44:21 +00:00
Tycho Nightingale
91fe5fe7e7 Provide some mitigation against CVE-2017-5715 by clearing registers
upon returning from the guest which aren't immediately clobbered by
the host.  This eradicates any remaining guest contents limiting their
usefulness in an exploit gadget.

This was inspired by this linux commit:
https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=5b6c02f38315b720c593c6079364855d276886aa

Reviewed by:	grehan, rgrimes
Sponsored by:	Dell EMC Isilon
Differential Revision:	https://reviews.freebsd.org/D13573
2018-01-15 18:37:03 +00:00
Andriy Gapon
091da2dfa5 vmm/svm: contigmalloc of the whole svm_softc is excessive
This is a followup to r307903.

struct svm_softc takes more than 200 kilobytes while what we really need
is 3 contiguous pages for I/O permission map and 2 contiguous pages for
MSR permission map.  Other physically mapped structures have a size of
a single page, so a proper alignment is sufficient for their correct
mapping.

Thus, only the permission maps are allocated with contigmalloc now,
the softc is allocated with a regular malloc.

Additionally, this commit adds a check that malloc returns memory with the
expected page alignment and that contigmalloc does not fail.
Unfortunately, at present svm_vminit() is expected to always succeed and
there is no way to report an error.
So, a contigmalloc failure leads to a panic.
We should probably fix this.

MFC after:	2 weeks
2018-01-09 14:22:18 +00:00
Andriy Gapon
5f3c7d6580 Fix a couple of comments in AMD Virtual Machine Control Block structure
MFC after:	1 week
2018-01-05 19:15:24 +00:00
Tycho Nightingale
9e33a61693 Recognize a pending virtual interrupt while emulating the halt instruction.
Reviewed by:	grehan, rgrimes
Sponsored by:	Dell EMC Isilon
Differential Revision:	https://reviews.freebsd.org/D13573
2017-12-21 18:30:11 +00:00
Andriy Gapon
a7437a3e9d amd-vi: set iommu msi configuration using pci_enable_msi method
This is better than directly changing PCI configuration space of the
device because it makes the PCI bus aware of the configuration.
Also, the change allows to drop a bunch of code that duplicated
pci_enable_msi() functionality.

I wonder if it's possible to further simplify the code by using
pci_alloc_msi().
2017-12-04 17:10:52 +00:00
Andriy Gapon
df92c28d6a vmm/amd: add ivhd device with a higher order
ivhd should attach after the root PCI bus and, thus, after the ACPI
Host-PCI bridge off which the bus hangs.  This is because ivhd changes
PCI configuration of a PCI IOMMU device that is located on the root bus.
If the bus attaches after ivhd it clears the MSI portion of the
configuration.  As a result IOMMU event interrupts would never be
delivered.

For regular ACPI devices the order is calculated as
    ACPI_DEV_BASE_ORDER + level * 10
where level is a depth of the device in the ACPI namespace.
I expect the depth of the Host-PCI bridge to be two or three,
so ACPI_DEV_BASE_ORDER + 10 * 10 should be a sufficiently safe order
for ivhd.

This should fix the setup of the AMD-Vi event interrupt when vmm is
preloaded (as opposed to kldload-ed).
2017-12-04 17:08:03 +00:00
Andriy Gapon
8f09494d1e amd-vi: clear event interrupt and overflow bits upon handling the interrupt
This ensures that we can receive further event interrupts.
See the description of the bits in the specification for
MMIO Offset 2020h IOMMU Status Register.
The bits are defined as set-by-hardware write-1-to-clear, same as all
the bits in the status register.

Discussed with:	anish
2017-12-04 17:02:53 +00:00
Pedro F. Giffuni
c49761dd57 sys/amd64: further adoption of SPDX licensing ID tags.
Mainly focus on files that use BSD 2-Clause license, however the tool I
was using misidentified many licenses so this was mostly a manual - error
prone - task.

The Software Package Data Exchange (SPDX) group provides a specification
to make it easier for automated tools to detect and summarize well known
opensource licenses. We are gradually adopting the specification, noting
that the tags are considered only advisory and do not, in any way,
superceed or replace the license texts.
2017-11-27 15:03:07 +00:00
Andriy Gapon
a3bbbd5e40 amd-vi: a small whitespace cleanup
Reviewed by:	anish
2017-11-24 11:37:41 +00:00
Andriy Gapon
685c54fc6a amd-vi: use correct type for pci_rid, start_dev_rid, end_dev_rid sysctls
Previously, the values could look confusing because of unrelated bits from
adjacent memory.

Reviewed by:	anish
2017-11-24 11:36:35 +00:00
Andriy Gapon
eb6c9c128c amd-vi: small improvements to event printing
Ensure that an opening bracket always has a matching closing one.
Ensure that there is always a new-line at the end of a report line.
Also, add a space before the printed event flag.

Reviewed by:	anish
2017-11-24 11:35:43 +00:00
Andriy Gapon
dee38cdc2a amd-vi: print some additional details for INVALID_DEVICE_REQUEST event
Namely, the type of the hardware event and whether the transaction
was a translation request.

Reviewed by:	anish
2017-11-24 11:34:46 +00:00
Andriy Gapon
53d580f984 amd-vi: fix up r326152, the new width requires a wider type
This is my brain-o from extending the width at the last moment.
2017-11-24 11:25:06 +00:00
Andriy Gapon
5a041f2183 amd-vi: fix and extend definition of Command and Event Status Register (0x2020)
The defined bits are the lower bits, not the higher ones.

Also, the specification has been extended to define bits 0:18 and they
all could potentially be interesting to us, so extend the width of the
field accordingly.

Reviewed by:	anish
2017-11-24 11:20:10 +00:00
Andriy Gapon
8523ad24ba vmm/amd: improve iteration over IVHD (type 10h) entries in IVRS table
Many 8-byte entries have zero at byte 4, so the second 4-byte part is
skipped as a 4-byte padding entry.  But not all 8-byte entries have that
property and they get misinterpreted.

A real example:
    48 00 00 00 ff 01 00 01
This an 8-byte ACPI_IVRS_TYPE_SPECIAL entry for IOAPIC with ID 255 (bogus).
It is reported as:
    ivhd0: Unknown dev entry:0xff
Fortunately, it was completely harmless.

Also, bail out early if we encounter an entry of a variable length type.
We do not have proper handling for those yet.

Reviewed by:	anish
2017-11-24 11:10:36 +00:00
Pedro F. Giffuni
51369649b0 sys: further adoption of SPDX licensing ID tags.
Mainly focus on files that use BSD 3-Clause license.

The Software Package Data Exchange (SPDX) group provides a specification
to make it easier for automated tools to detect and summarize well known
opensource licenses. We are gradually adopting the specification, noting
that the tags are considered only advisory and do not, in any way,
superceed or replace the license texts.

Special thanks to Wind River for providing access to "The Duke of
Highlander" tool: an older (2014) run over FreeBSD tree was useful as a
starting point.
2017-11-20 19:43:44 +00:00
Peter Grehan
9d210a4a18 Emulate the "OR reg, r/m" instruction (opcode 0BH).
This is	needed for the HDA emulation with FreeBSD guests.

Reviewed by:	marcelo
MFC after:	2 weeks
Differential Revision:	https://reviews.freebsd.org/D12832
2017-11-01 03:26:53 +00:00
Ian Lepore
5deb1573e8 Improve the performance of the hpet timer in bhyve guests by making the
timer frequency a power of two.  This changes the frequency from 10 to
16.7 MHz (2 ^ 24 HZ).  Using a power of two avoids roundoff errors when
doing arithmetic in sbintime_t units.

Testing shows this can fix erratic ntpd behavior in guests using the
hpet timer (which is the default for multicore guests).

Reported by:	bsam@
2017-10-29 20:50:03 +00:00
John Baldwin
6db55a0f3a Rework pass through changes in r305485 to be safer.
Specifically, devices that do not support PCI-e FLR and were not
gracefully shutdown by the guest OS could continue to issue DMA
requests after the VM was terminated.  The changes in r305485 meant
that those DMA requests were completed against the host's memory which
could result in random memory corruption.  Instead, leave ppt devices
that are not attached to a VM disabled in the IOMMU and only restore
the devices to the host domain if the ppt(4) driver is detached from a
device.

As an added safety belt, disable busmastering for a pass-through device
when before adding it to the host domain during ppt(4) detach.

PR:		222937
Tested by:	Harry Schmalzbauer <freebsd@omnilan.de>
Reviewed by:	grehan
MFC after:	1 week
Differential Revision:	https://reviews.freebsd.org/D12661
2017-10-27 14:57:14 +00:00
Konstantin Belousov
9fc847133e Save KGSBASE in pcb before overriding it with the guest value.
Reported by:	lwhsu, mjoras
Discussed with:	jhb
Sponsored by:	The FreeBSD Foundation
MFC after:	18 days
2017-08-24 10:49:53 +00:00
Ryan Libby
5228ad10d4 amd-vi: gcc build errors
amdvi_cmp_wait: gcc complained about a malformed string behind an ifdef.

struct amdvi_dte: widen the type of the first reserved bitfield so that
the packed representation would not cross an alignment boundary for that
type. Apparently that causes in-tree gcc (4.2) to insert padding
(despite packed, resulting in a wrong structure definition), and causes
more modern gcc to emit a warning.

ivrs_hdr_iterate_tbl: delete a misleading check about header length
being less than 0 (the type is unsigned) and replace it with a check
that the length doesn't exceed the table size.

Reviewed by:	anish, grehan
Approved by:	markj (mentor)
Sponsored by:	Dell EMC Isilon
Differential Revision:	https://reviews.freebsd.org/D11485
2017-07-07 06:37:19 +00:00