freebsd-dev

Author	SHA1	Message	Date
Corvin Köhne	6171e026be	bhyve: add support for MTRR Some guests or driver might depend on MTRR to work properly. E.g. the nvidia gpu driver won't work without MTRR. Reviewed by: markj MFC after: 2 weeks Sponsored by: Beckhoff Automation GmbH & Co. KG Differential Revision: https://reviews.freebsd.org/D33333	2022-01-14 12:41:44 +01:00
Peter Grehan	15add60d37	Convert vmm_ops calls to IFUNC There is no need for these to be function pointers since they are never modified post-module load. Rename AMD/Intel ops to be more consistent. Submitted by: adam_fenn.io Reviewed by: markj, grehan Approved by: grehan (bhyve) MFC after: 3 weeks Differential Revision: https://reviews.freebsd.org/D27375	2020-11-28 01:16:59 +00:00
Mark Johnston	8e2cbc5660	vmx: Implement pmap (de)activation in C Rewrite the code that maintains pm_active and invalidates EPTP-tagged TLB entries in C. Previously this work was done in vmx_enter_guest(), in assembly, but there is no good reason for that and it makes the TLB invalidation algorithm for nested page tables harder to review. No functional change intended. Now, an error from the invept instruction results in a kernel panic rather than a vmexit. Such errors should occur only as a result of VMM bugs. Reviewed by: grehan, kib MFC after: 2 weeks Sponsored by: The FreeBSD Foundation Differential Revision: https://reviews.freebsd.org/D26830	2020-10-19 15:24:35 +00:00
Peter Grehan	f5f5f1e7d6	Support guest rdtscp and rdpid instructions on Intel VT-x Enable any of rdtscp and/or rdpid for bhyve guests on Intel-based hosts that support the "enable RDTSCP" VM-execution control. Submitted by: adam_fenn.io Reported by: chuck Reviewed by: chuck, grehan, jhb Approved by: jhb (bhyve), grehan MFC after: 3 weeks Relnotes: Yes Differential Revision: https://reviews.freebsd.org/D26003	2020-08-18 07:23:47 +00:00
John Baldwin	cbd03a9df2	Support software breakpoints in the debug server on Intel CPUs. - Allow the userland hypervisor to intercept breakpoint exceptions (BP#) in the guest. A new capability (VM_CAP_BPT_EXIT) is used to enable this feature. These exceptions are reported to userland via a new VM_EXITCODE_BPT that includes the length of the original breakpoint instruction. If userland wishes to pass the exception through to the guest, it must be explicitly re-injected via vm_inject_exception(). - Export VMCS_ENTRY_INST_LENGTH as a VM_REG_GUEST_ENTRY_INST_LENGTH pseudo-register. Injecting a BP# on Intel requires setting this to the length of the breakpoint instruction. AMD SVM currently ignores writes to this register (but reports success) and fails to read it. - Rework the per-vCPU state tracked by the debug server. Rather than a single 'stepping_vcpu' global, add a structure for each vCPU that tracks state about that vCPU ('stepping', 'stepped', and 'hit_swbreak'). A global 'stopped_vcpu' tracks which vCPU is currently reporting an event. Event handlers for MTRAP and breakpoint exits loop until the associated event is reported to the debugger. Breakpoint events are discarded if the breakpoint is not present when a vCPU resumes in the breakpoint handler to retry submitting the breakpoint event. - Maintain a linked-list of active breakpoints in response to the GDB 'Z0' and 'z0' packets. Reviewed by: markj (earlier version) MFC after: 2 months Differential Revision: https://reviews.freebsd.org/D20309	2019-12-13 19:21:58 +00:00
Tycho Nightingale	58a6aaf7ec	Provide further mitigation against CVE-2017-5715 by flushing the return stack buffer (RSB) upon returning from the guest. This was inspired by this linux commit: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/arch/x86/kvm?id=117cc7a908c83697b0b737d15ae1eb5943afe35b Reviewed by: grehan Sponsored by: Dell EMC Isilon Differential Revision: https://reviews.freebsd.org/D14272	2018-02-12 14:45:27 +00:00
John Baldwin	65eefbe422	Save and restore guest debug registers. Currently most of the debug registers are not saved and restored during VM transitions allowing guest and host debug register values to leak into the opposite context. One result is that hardware watchpoints do not work reliably within a guest under VT-x. Due to differences in SVM and VT-x, slightly different approaches are used. For VT-x: - Enable debug register save/restore for VM entry/exit in the VMCS for DR7 and MSR_DEBUGCTL. - Explicitly save DR0-3,6 of the guest. - Explicitly save DR0-3,6-7, MSR_DEBUGCTL, and the trap flag from %rflags for the host. Note that because DR6 is "software" managed and not stored in the VMCS a kernel debugger which single steps through VM entry could corrupt the guest DR6 (since a single step trap taken after loading the guest DR6 could alter the DR6 register). To avoid this, explicitly disable single-stepping via the trace flag before loading the guest DR6. A determined debugger could still defeat this by setting a breakpoint after the guest DR6 was loaded and then single-stepping. For SVM: - Enable debug register caching in the VMCB for DR6/DR7. - Explicitly save DR0-3 of the guest. - Explicitly save DR0-3,6-7, and MSR_DEBUGCTL for the host. Since SVM saves the guest DR6 in the VMCB, the race with single-stepping described for VT-x does not exist. For both platforms, expose all of the guest DRx values via --get-drX and --set-drX flags to bhyvectl. Discussed with: avg, grehan Tested by: avg (SVM), myself (VT-x) MFC after: 1 month Differential Revision: https://reviews.freebsd.org/D13229	2018-01-17 23:11:25 +00:00
Pedro F. Giffuni	c49761dd57	sys/amd64: further adoption of SPDX licensing ID tags. Mainly focus on files that use BSD 2-Clause license, however the tool I was using misidentified many licenses so this was mostly a manual - error prone - task. The Software Package Data Exchange (SPDX) group provides a specification to make it easier for automated tools to detect and summarize well known opensource licenses. We are gradually adopting the specification, noting that the tags are considered only advisory and do not, in any way, superceed or replace the license texts.	2017-11-27 15:03:07 +00:00
Tycho Nightingale	277bdd9950	Support guest writes to the TSC by enabling the "use TSC offsetting" execution control and writing the difference between the host TSC and the guest TSC into the TSC offset in the VMCS upon encountering a write. Reviewed by: neel	2015-06-09 00:14:47 +00:00
Neel Natu	a318f7ddb2	Always emulate MSR_PAT on Intel processors and don't rely on PAT save/restore capability of VT-x. This lets bhyve run nested in older VMware versions that don't support the PAT save/restore capability. Note that the actual value programmed by the guest in MSR_PAT is irrelevant because bhyve sets the 'Ignore PAT' bit in the nested PTE. Reported by: marcel Tested by: Leon Dang (ldang@nahannisys.com) Sponsored by: Nahanni Systems MFC after: 2 weeks	2015-02-24 05:35:15 +00:00
Neel Natu	2ce1242309	Clear blocking due to STI or MOV SS in the hypervisor when an instruction is emulated or when the vcpu incurs an exception. This matches the CPU behavior. Remove special case code in HLT processing that was clearing the interrupt shadow. This is now redundant because the interrupt shadow is always cleared when the vcpu is resumed after an instruction is emulated. Reported by: David Reed (david.reed@tidalscale.com) MFC after: 2 weeks	2015-01-06 19:04:02 +00:00
Neel Natu	c3498942a5	Restructure the MSR handling so it is entirely handled by processor-specific code. There are only a handful of MSRs common between the two so there isn't too much duplicate functionality. The VT-x code has the following types of MSRs: - MSRs that are unconditionally saved/restored on every guest/host context switch (e.g., MSR_GSBASE). - MSRs that are restored to guest values on entry to vmx_run() and saved before returning. This is an optimization for MSRs that are not used in host kernel context (e.g., MSR_KGSBASE). - MSRs that are emulated and every access by the guest causes a trap into the hypervisor (e.g., MSR_IA32_MISC_ENABLE). Reviewed by: grehan	2014-09-20 02:35:21 +00:00
Peter Grehan	897bb47e7b	Make the vmx asm code dtrace-fbt-friendly by - inserting frame enter/leave sequences - restructuring the vmx_enter_guest routine so that it subsumes the vm_exit_guest block, which was the #vmexit RIP and not a callable routine. Reviewed by: neel MFC after: 3 weeks	2014-05-18 03:50:17 +00:00
Neel Natu	81d597b736	There is no need to save and restore the host's return address in the 'struct vmxctx'. It is preserved on the host stack across a guest entry and exit and just restoring the host's '%rsp' is sufficient. Pointed out by: grehan@	2014-04-11 20:15:53 +00:00
Neel Natu	dc50650607	Queue pending exceptions in the 'struct vcpu' instead of directly updating the processor-specific VMCS or VMCB. The pending exception will be delivered right before entering the guest. The order of event injection into the guest is: - hardware exception - NMI - maskable interrupt In the Intel VT-x case, a pending NMI or interrupt will enable the interrupt window-exiting and inject it as soon as possible after the hardware exception is injected. Also since interrupts are inherently asynchronous, injecting them after the hardware exception should not affect correctness from the guest perspective. Rename the unused ioctl VM_INJECT_EVENT to VM_INJECT_EXCEPTION and restrict it to only deliver x86 hardware exceptions. This new ioctl is now used to inject a protection fault when the guest accesses an unimplemented MSR. Discussed with: grehan, jhb Reviewed by: jhb	2014-02-26 00:52:05 +00:00
John Baldwin	a0efd3fb34	A first pass at adding support for injecting hardware exceptions for emulated instructions. - Add helper routines to inject interrupt information for a hardware exception from the VM exit callback routines. - Use the new routines to inject GP and UD exceptions for invalid operations when emulating the xsetbv instruction. - Don't directly manipulate the entry interrupt info when a user event is injected. Instead, store the event info in the vmx state and only apply it during a VM entry if a hardware exception or NMI is not already pending. - While here, use HANDLED/UNHANDLED instead of 1/0 in a couple of routines. Reviewed by: neel	2014-02-18 03:07:36 +00:00
Neel Natu	953c2c47eb	Avoid doing unnecessary nested TLB invalidations. Prior to this change the cached value of 'pm_eptgen' was tracked per-vcpu and per-hostcpu. In the degenerate case where 'N' vcpus were sharing a single hostcpu this could result in 'N - 1' unnecessary TLB invalidations. Since an 'invept' invalidates mappings for all VPIDs the first 'invept' is sufficient. Fix this by moving the 'eptgen[MAXCPU]' array from 'vmxctx' to 'struct vmx'. If it is known that an 'invept' is going to be done before entering the guest then it is safe to skip the 'invvpid'. The stat VPU_INVVPID_SAVED counts the number of 'invvpid' invalidations that were avoided because they were subsumed by an 'invept'. Discussed with: grehan	2014-02-04 02:45:08 +00:00
Neel Natu	176666c2c9	Enable "Posted Interrupt Processing" if supported by the CPU. This lets us inject interrupts into the guest without causing a VM-exit. This feature can be disabled by setting the tunable "hw.vmm.vmx.use_apic_pir" to "0". The following sysctls provide information about this feature: - hw.vmm.vmx.posted_interrupts (0 if disabled, 1 if enabled) - hw.vmm.vmx.posted_interrupt_vector (vector number used for vcpu notification) Tested on a Intel Xeon E5-2620v2 courtesy of Allan Jude at ScaleEngine.	2014-01-11 04:22:00 +00:00
Neel Natu	f7d4742540	Enable the "Acknowledge Interrupt on VM exit" VM-exit control. This control is needed to enable "Posted Interrupts" and is present in all the Intel VT-x implementations supported by bhyve so enable it as the default. With this VM-exit control enabled the processor will acknowledge the APIC and store the vector number in the "VM-Exit Interruption Information" field. We now call the interrupt handler "by hand" through the IDT entry associated with the vector.	2014-01-11 03:14:05 +00:00
Neel Natu	0492757c70	Restructure the VMX code to enter and exit the guest. In large part this change hides the setjmp/longjmp semantics of VM enter/exit. vmx_enter_guest() is used to enter guest context and vmx_exit_guest() is used to transition back into host context. Fix a longstanding race where a vcpu interrupt notification might be ignored if it happens after vmx_inject_interrupts() but before host interrupts are disabled in vmx_resume/vmx_launch. We now called vmx_inject_interrupts() with host interrupts disabled to prevent this. Suggested by: grehan@	2014-01-01 21:17:08 +00:00
Neel Natu	de5ea6b65e	vlapic code restructuring to make it easy to support hardware-assist for APIC emulation. The vlapic initialization and cleanup is done via processor specific vmm_ops. This will allow the VT-x/SVM modules to layer any hardware-assist for APIC emulation or virtual interrupt delivery on top of the vlapic device model. Add a parameter to 'vcpu_notify_event()' to distinguish between vlapic interrupts versus other events (e.g. NMI). This provides an opportunity to use hardware-assists like Posted Interrupts (VT-x) or doorbell MSR (SVM) to deliver an interrupt to a guest without causing a VM-exit. Get rid of lapic_pending_intr() and lapic_intr_accepted() and use the vlapic_xxx() counterparts directly. Associate an 'Apic Page' with each vcpu and reference it from the 'vlapic'. The 'Apic Page' is intended to be referenced from the Intel VMCS as the 'virtual APIC page' or from the AMD VMCB as the 'vAPIC backing page'.	2013-12-25 06:46:31 +00:00
Neel Natu	49cc03da31	Add a new capability, VM_CAP_ENABLE_INVPCID, that can be enabled to expose 'invpcid' instruction to the guest. Currently bhyve will try to enable this capability unconditionally if it is available. Consolidate code in bhyve to set the capabilities so it is no longer duplicated in BSP and AP bringup. Add a sysctl 'vm.pmap.invpcid_works' to display whether the 'invpcid' instruction is available. Reviewed by: grehan MFC after: 3 days	2013-10-16 18:20:27 +00:00
Neel Natu	318224bbe6	Merge projects/bhyve_npt_pmap into head. Make the amd64/pmap code aware of nested page table mappings used by bhyve guests. This allows bhyve to associate each guest with its own vmspace and deal with nested page faults in the context of that vmspace. This also enables features like accessed/dirty bit tracking, swapping to disk and transparent superpage promotions of guest memory. Guest vmspace: Each bhyve guest has a unique vmspace to represent the physical memory allocated to the guest. Each memory segment allocated by the guest is mapped into the guest's address space via the 'vmspace->vm_map' and is backed by an object of type OBJT_DEFAULT. pmap types: The amd64/pmap now understands two types of pmaps: PT_X86 and PT_EPT. The PT_X86 pmap type is used by the vmspace associated with the host kernel as well as user processes executing on the host. The PT_EPT pmap is used by the vmspace associated with a bhyve guest. Page Table Entries: The EPT page table entries as mostly similar in functionality to regular page table entries although there are some differences in terms of what bits are used to express that functionality. For e.g. the dirty bit is represented by bit 9 in the nested PTE as opposed to bit 6 in the regular x86 PTE. Therefore the bitmask representing the dirty bit is now computed at runtime based on the type of the pmap. Thus PG_M that was previously a macro now becomes a local variable that is initialized at runtime using 'pmap_modified_bit(pmap)'. An additional wrinkle associated with EPT mappings is that older Intel processors don't have hardware support for tracking accessed/dirty bits in the PTE. This means that the amd64/pmap code needs to emulate these bits to provide proper accounting to the VM subsystem. This is achieved by using the following mapping for EPT entries that need emulation of A/D bits: Bit Position Interpreted By PG_V 52 software (accessed bit emulation handler) PG_RW 53 software (dirty bit emulation handler) PG_A 0 hardware (aka EPT_PG_RD) PG_M 1 hardware (aka EPT_PG_WR) The idea to use the mapping listed above for A/D bit emulation came from Alan Cox (alc@). The final difference with respect to x86 PTEs is that some EPT implementations do not support superpage mappings. This is recorded in the 'pm_flags' field of the pmap. TLB invalidation: The amd64/pmap code has a number of ways to do invalidation of mappings that may be cached in the TLB: single page, multiple pages in a range or the entire TLB. All of these funnel into a single EPT invalidation routine called 'pmap_invalidate_ept()'. This routine bumps up the EPT generation number and sends an IPI to the host cpus that are executing the guest's vcpus. On a subsequent entry into the guest it will detect that the EPT has changed and invalidate the mappings from the TLB. Guest memory access: Since the guest memory is no longer wired we need to hold the host physical page that backs the guest physical page before we can access it. The helper functions 'vm_gpa_hold()/vm_gpa_release()' are available for this purpose. PCI passthru: Guest's with PCI passthru devices will wire the entire guest physical address space. The MMIO BAR associated with the passthru device is backed by a vm_object of type OBJT_SG. An IOMMU domain is created only for guest's that have one or more PCI passthru devices attached to them. Limitations: There isn't a way to map a guest physical page without execute permissions. This is because the amd64/pmap code interprets the guest physical mappings as user mappings since they are numerically below VM_MAXUSER_ADDRESS. Since PG_U shares the same bit position as EPT_PG_EXECUTE all guest mappings become automatically executable. Thanks to Alan Cox and Konstantin Belousov for their rigorous code reviews as well as their support and encouragement. Thanks for John Baldwin for reviewing the use of OBJT_SG as the backing object for pci passthru mmio regions. Special thanks to Peter Holm for testing the patch on short notice. Approved by: re Discussed with: grehan Reviewed by: alc, kib Tested by: pho	2013-10-05 21:22:35 +00:00
Neel Natu	f352ff0ca8	Maintain state regarding NMI delivery to guest vcpu in VT-x independent manner. Also add a stats counter to count the number of NMIs delivered per vcpu. Obtained from: NetApp	2012-10-24 02:54:21 +00:00
Neel Natu	eeefa4e4be	Test for AST pending with interrupts disabled right before entering the guest. If an IPI was delivered to this cpu before interrupts were disabled then return right away via vmx_setjmp() with a return value of VMX_RETURN_AST. Obtained from: NetApp	2012-10-23 02:20:42 +00:00
Neel Natu	ad54f37429	Fix a long standing bug in VMXCTX_GUEST_RESTORE(). There was an assumption by the "callers" of this macro that on "return" the %rsp will be pointing to the 'vmxctx'. The macro was not doing this and thus when trying to restore host state on an error from "vmlaunch" or "vmresume" we were treating the memory locations on the host stack as 'struct vmxctx'. This led to all sorts of weird bugs like double faults or invalid instruction faults. This bug is exposed by the -O2 option used to compile the kernel module. With the -O2 flag the compiler will optimize the following piece of code: int loopstart = 1; ... if (loopstart) { loopstart = 0; vmx_launch(); } else vmx_resume(); into this: vmx_launch(); Since vmx_launch() and vmx_resume() are declared to be __dead2 functions the compiler is free to do this. The compiler has no way to know that the functions return indirectly through vmx_setjmp(). This optimization in turn leads us to trigger the bug in VMXCTX_GUEST_RESTORE(). With this change we can boot a 8.1 guest on a 9.0 host. Reported by: jhb@	2011-05-20 03:23:09 +00:00
Peter Grehan	366f60834f	Import of bhyve hypervisor and utilities, part 1. vmm.ko - kernel module for VT-x, VT-d and hypervisor control bhyve - user-space sequencer and i/o emulation vmmctl - dump of hypervisor register state libvmm - front-end to vmm.ko chardev interface bhyve was designed and implemented by Neel Natu. Thanks to the following folk from NetApp who helped to make this available: Joe CaraDonna Peter Snyder Jeff Heller Sandeep Mann Steve Miller Brian Pawlowski	2011-05-13 04:54:01 +00:00

27 Commits