freebsd-nq

Author	SHA1	Message	Date
Robert Wing	39d87a0235	vmm: fix "set but not used" warnings	2022-02-28 14:55:37 -09:00
Corvin Köhne	6171e026be	bhyve: add support for MTRR Some guests or driver might depend on MTRR to work properly. E.g. the nvidia gpu driver won't work without MTRR. Reviewed by: markj MFC after: 2 weeks Sponsored by: Beckhoff Automation GmbH & Co. KG Differential Revision: https://reviews.freebsd.org/D33333	2022-01-14 12:41:44 +01:00
Andrew Turner	b792434150	Create sys/reg.h for the common code previously in machine/reg.h Move the common kernel function signatures from machine/reg.h to a new sys/reg.h. This is in preperation for adding PT_GETREGSET to ptrace(2). Reviewed by: imp, markj Sponsored by: DARPA, AFRL (original work) Sponsored by: The FreeBSD Foundation Differential Revision: https://reviews.freebsd.org/D19830	2021-08-30 12:50:53 +01:00
Mark Johnston	41335c6b7f	vmm: Make iommu ops tables const While here, use designated initializers and rename some AMD iommu method implementations to match the corresponding op names. No functional change intended. Reviewed by: grehan MFC after: 1 week Sponsored by: The FreeBSD Foundation Differential Revision: https://reviews.freebsd.org/D31462	2021-08-09 13:28:27 -04:00
Mark Johnston	e54ae8258d	amd64: Fix output operand specs for the stmxcsr and vmread intrinsics This does not appear to affect code generation, at least with the default toolchain. Noticed because incorrect output specifications lead to false positives from KMSAN, as the instrumentation uses them to update shadow state for output operands. Reviewed by: kib MFC after: 1 week Sponsored by: The FreeBSD Foundation Differential Revision: https://reviews.freebsd.org/D31466	2021-08-09 13:28:08 -04:00
Peter Grehan	15add60d37	Convert vmm_ops calls to IFUNC There is no need for these to be function pointers since they are never modified post-module load. Rename AMD/Intel ops to be more consistent. Submitted by: adam_fenn.io Reviewed by: markj, grehan Approved by: grehan (bhyve) MFC after: 3 weeks Differential Revision: https://reviews.freebsd.org/D27375	2020-11-28 01:16:59 +00:00
Mark Johnston	6f5a960678	vmm: Make pmap_invalidate_ept() wait synchronously for guest exits Currently EPT TLB invalidation is done by incrementing a generation counter and issuing an IPI to all CPUs currently running vCPU threads. The VMM inner loop caches the most recently observed generation on each host CPU and invalidates TLB entries before executing the VM if the cached generation number is not the most recent value. pmap_invalidate_ept() issues IPIs to force each vCPU to stop executing guest instructions and reload the generation number. However, it does not actually wait for vCPUs to exit, potentially creating a window where guests may continue to reference stale TLB entries. Fix the problem by bracketing guest execution with an SMR read section which is entered before loading the invalidation generation. Then, pmap_invalidate_ept() increments the current write sequence before loading pm_active and sending IPIs, and polls readers to ensure that all vCPUs potentially operating with stale TLB entries have exited before pmap_invalidate_ept() returns. Also ensure that unsynchronized loads of the generation counter are wrapped with atomic(9), and stop (inconsistently) updating the invalidation counter and pm_active bitmask with acquire semantics. Reviewed by: grehan, kib Sponsored by: The FreeBSD Foundation Differential Revision: https://reviews.freebsd.org/D26910	2020-11-11 15:01:17 +00:00
Mark Johnston	8e2cbc5660	vmx: Implement pmap (de)activation in C Rewrite the code that maintains pm_active and invalidates EPTP-tagged TLB entries in C. Previously this work was done in vmx_enter_guest(), in assembly, but there is no good reason for that and it makes the TLB invalidation algorithm for nested page tables harder to review. No functional change intended. Now, an error from the invept instruction results in a kernel panic rather than a vmexit. Such errors should occur only as a result of VMM bugs. Reviewed by: grehan, kib MFC after: 2 weeks Sponsored by: The FreeBSD Foundation Differential Revision: https://reviews.freebsd.org/D26830	2020-10-19 15:24:35 +00:00
John Baldwin	a3f2a9c57e	Clear the upper 32-bits of registers in x86_emulate_cpuid(). Per the Intel manuals, CPUID is supposed to unconditionally zero the upper 32 bits of the involved (rax/rbx/rcx/rdx) registers. Previously, the emulation would cast pointers to the 64-bit register values down to `uint32_t`, which while properly manipulating the lower bits, would leave any garbage in the upper bits uncleared. While no existing guest OSes seem to stumble over this in practice, the bhyve emulation should match x86 expectations. This was discovered through alignment warnings emitted by gcc9, while testing it against SmartOS/bhyve. SmartOS bug: https://smartos.org/bugview/OS-8168 Submitted by: Patrick Mooney Reviewed by: rgrimes Differential Revision: https://reviews.freebsd.org/D24727	2020-10-01 16:45:11 +00:00
Ed Maste	09860d44e4	bhyve: do not permit write access to VMCB / VMCS Reported by: Patrick Mooney Submitted by: jhb Security: CVE-2020-24718	2020-09-15 21:04:27 +00:00
Mateusz Guzik	543769bf83	amd64: clean up empty lines in .c and .h files	2020-09-01 21:16:54 +00:00
Konstantin Belousov	f3eb12e4a6	Add bhyve support for LA57 guest mode. Noted and reviewed by: grehan Sponsored by: The FreeBSD Foundation Differential revision: https://reviews.freebsd.org/D25273	2020-08-23 20:37:21 +00:00
Konstantin Belousov	9ce875d9b5	amd64 pmap: LA57 AKA 5-level paging Since LA57 was moved to the main SDM document with revision 072, it seems that we should have a support for it, and silicons are coming. This patch makes pmap support both LA48 and LA57 hardware. The selection of page table level is done at startup, kernel always receives control from loader with 4-level paging. It is not clear how UEFI spec would adapt LA57, for instance it could hand out control in LA57 mode sometimes. To switch from LA48 to LA57 requires turning off long mode, requesting LA57 in CR4, then re-entering long mode. This is somewhat delicate and done in pmap_bootstrap_la57(). AP startup in LA57 mode is much easier, we only need to toggle a bit in CR4 and load right value in CR3. I decided to not change kernel map for now. Single PML5 entry is created that points to the existing kernel_pml4 (KML4Phys) page, and a pml5 entry to create our recursive mapping for vtopte()/vtopde(). This decision is motivated by the fact that we cannot overcommit for KVA, so large space there is unusable until machines start providing wider physical memory addressing. Another reason is that I do not want to break our fragile autotuning, so the KVA expansion is not included into this first step. Nice side effect is that minidumps are compatible. On the other hand, (very) large address space is definitely immediately useful for some userspace applications. For userspace, numbering of pte entries (or page table pages) is always done for 5-level structures even if we operate in 4-level mode. The pmap_is_la57() function is added to report the mode of the specified pmap, this is done not to allow simultaneous 4-/5-levels (which is not allowed by hw), but to accomodate for EPT which has separate level control and in principle might not allow 5-leve EPT despite x86 paging supports it. Anyway, it does not seems critical to have 5-level EPT support now. Tested by: pho (LA48 hardware) Reviewed by: alc Sponsored by: The FreeBSD Foundation Differential revision: https://reviews.freebsd.org/D25273	2020-08-23 20:19:04 +00:00
Peter Grehan	3a3f1e9dfa	Export a routine to provide the TSC_AUX MSR value and use this in vmm. Also, drop an unnecessary set of braces. Requested by: kib Reviewed by: kib MFC after: 3 weeks	2020-08-18 11:36:38 +00:00
Peter Grehan	f5f5f1e7d6	Support guest rdtscp and rdpid instructions on Intel VT-x Enable any of rdtscp and/or rdpid for bhyve guests on Intel-based hosts that support the "enable RDTSCP" VM-execution control. Submitted by: adam_fenn.io Reported by: chuck Reviewed by: chuck, grehan, jhb Approved by: jhb (bhyve), grehan MFC after: 3 weeks Relnotes: Yes Differential Revision: https://reviews.freebsd.org/D26003	2020-08-18 07:23:47 +00:00
John Baldwin	483d953a86	Initial support for bhyve save and restore. Save and restore (also known as suspend and resume) permits a snapshot to be taken of a guest's state that can later be resumed. In the current implementation, bhyve(8) creates a UNIX domain socket that is used by bhyvectl(8) to send a request to save a snapshot (and optionally exit after the snapshot has been taken). A snapshot currently consists of two files: the first holds a copy of guest RAM, and the second file holds other guest state such as vCPU register values and device model state. To resume a guest, bhyve(8) must be started with a matching pair of command line arguments to instantiate the same set of device models as well as a pointer to the saved snapshot. While the current implementation is useful for several uses cases, it has a few limitations. The file format for saving the guest state is tied to the ABI of internal bhyve structures and is not self-describing (in that it does not communicate the set of device models present in the system). In addition, the state saved for some device models closely matches the internal data structures which might prove a challenge for compatibility of snapshot files across a range of bhyve versions. The file format also does not currently support versioning of individual chunks of state. As a result, the current file format is not a fixed binary format and future revisions to save and restore will break binary compatiblity of snapshot files. The goal is to move to a more flexible format that adds versioning, etc. and at that point to commit to providing a reasonable level of compatibility. As a result, the current implementation is not enabled by default. It can be enabled via the WITH_BHYVE_SNAPSHOT=yes option for userland builds, and the kernel option BHYVE_SHAPSHOT. Submitted by: Mihai Tiganus, Flavius Anton, Darius Mihai Submitted by: Elena Mihailescu, Mihai Carabas, Sergiu Weisz Relnotes: yes Sponsored by: University Politehnica of Bucharest Sponsored by: Matthew Grooms (student scholarships) Sponsored by: iXsystems Differential Revision: https://reviews.freebsd.org/D19495	2020-05-05 00:02:04 +00:00
Michael Reifenberger	1bc51bad2b	Untangle TPR shadowing and APIC virtualization. This speeds up Windows guests tremendously. The patch does: Add a new tuneable 'hw.vmm.vmx.use_tpr_shadowing' to disable TLP shadowing. Also add 'hw.vmm.vmx.cap.tpr_shadowing' to be able to query if TPR shadowing is used. Detach the initialization of TPR shadowing from the initialization of APIC virtualization. APIC virtualization still needs TPR shadowing, but not vice versa. Any CPU that supports APIC virtualization should also support TPR shadowing. When TPR shadowing is used, the APIC page of each vCPU is written to the VMCS_VIRTUAL_APIC field of the VMCS so that the CPU can write directly to the page without intercept. On vm exit, vlapic_update_ppr() is called to update the PPR. Submitted by: Yamagi Burmeister MFC after: 2 weeks Differential Revision: https://reviews.freebsd.org/D22942	2020-03-10 16:53:49 +00:00
Pawel Biernacki	b40598c539	Mark more nodes as CTLFLAG_MPSAFE or CTLFLAG_NEEDGIANT (4 of many) r357614 added CTLFLAG_NEEDGIANT to make it easier to find nodes that are still not MPSAFE (or already are but aren’t properly marked). Use it in preparation for a general review of all nodes. This is non-functional change that adds annotations to SYSCTL_NODE and SYSCTL_PROC nodes using one of the soon-to-be-required flags. Reviewed by: kib Approved by: kib (mentor) Differential Revision: https://reviews.freebsd.org/D23625 X-Generally looks fine: jhb	2020-02-15 18:57:49 +00:00
John Baldwin	cbd03a9df2	Support software breakpoints in the debug server on Intel CPUs. - Allow the userland hypervisor to intercept breakpoint exceptions (BP#) in the guest. A new capability (VM_CAP_BPT_EXIT) is used to enable this feature. These exceptions are reported to userland via a new VM_EXITCODE_BPT that includes the length of the original breakpoint instruction. If userland wishes to pass the exception through to the guest, it must be explicitly re-injected via vm_inject_exception(). - Export VMCS_ENTRY_INST_LENGTH as a VM_REG_GUEST_ENTRY_INST_LENGTH pseudo-register. Injecting a BP# on Intel requires setting this to the length of the breakpoint instruction. AMD SVM currently ignores writes to this register (but reports success) and fails to read it. - Rework the per-vCPU state tracked by the debug server. Rather than a single 'stepping_vcpu' global, add a structure for each vCPU that tracks state about that vCPU ('stepping', 'stepped', and 'hit_swbreak'). A global 'stopped_vcpu' tracks which vCPU is currently reporting an event. Event handlers for MTRAP and breakpoint exits loop until the associated event is reported to the debugger. Breakpoint events are discarded if the breakpoint is not present when a vCPU resumes in the breakpoint handler to retry submitting the breakpoint event. - Maintain a linked-list of active breakpoints in response to the GDB 'Z0' and 'z0' packets. Reviewed by: markj (earlier version) MFC after: 2 months Differential Revision: https://reviews.freebsd.org/D20309	2019-12-13 19:21:58 +00:00
Mark Johnston	d3588766e1	Correct the scope of several global variables. They are accessed from multiple compilation units. No functional change intended. MFC after: 1 week Sponsored by: Netflix	2019-09-27 21:04:33 +00:00
Mark Johnston	13a7c4d478	Use designated initializers for vmm_ops. MFC after: 3 days	2019-08-07 19:45:44 +00:00
Ed Maste	490d56c527	vmx: use C99 bool, not boolean_t Bhyve's vmm is a self-contained modern component and thus a good candidate for use of C99 types. Reviewed by: jhb, kib, markj, Patrick Mooney MFC after: 1 week Sponsored by: The FreeBSD Foundation Differential Revision: https://reviews.freebsd.org/D21036	2019-08-01 02:16:48 +00:00
Scott Long	da761f3b1f	Implement VT-d capability detection on chipsets that have multiple translation units with differing capabilities From the author via Bugzilla: --- When an attempt is made to passthrough a PCI device to a bhyve VM (causing initialisation of IOMMU) on certain Intel chipsets using VT-d the PCI bus stops working entirely. This issue occurs on the E3-1275 v5 processor on C236 chipset and has also been encountered by others on the forums with different hardware in the Skylake series. The chipset has two VT-d translation units. The issue is caused by an attempt to use the VT-d device-IOTLB capability that is supported by only the first unit for devices attached to the second unit which lacks that capability. Only the capabilities of the first unit are checked and are assumed to be the same for all units. Attached is a patch to rectify this issue by determining which unit is responsible for the device being added to a domain and then checking that unit's device-IOTLB capability. In addition to this a few fixes have been made to other instances where the first unit's capabilities are assumed for all units for domains they share. In these cases a mutual set of capabilities is determined. The patch should hopefully fix any bugs for current/future hardware with multiple translation units supporting different capabilities. A description is on the forums at https://forums.freebsd.org/threads/pci-passthrough-bhyve-usb-xhci.65235 The thread includes observations by other users of the bug occurring, and description as well as confirmation of the fix. I'd also like to thank Ordoban for their help. --- Personally tested on a Skylake laptop, Skylake Xeon server, and a Xeon-D-1541, passing through XHCI and NVMe functions. Passthru is hit-or-miss to the point of being unusable without this patch. PR: 229852 Submitted by: callum@aitchison.org MFC after: 1 week	2019-06-19 06:41:07 +00:00
Rodney W. Grimes	a488c9c99a	Add accessor function for vm->maxcpus Replace most VM_MAXCPU constant useses with an accessor function to vm->maxcpus which for now is initialized and kept at the value of VM_MAXCPUS. This is a rework of Fabian Freyer (fabian.freyer_physik.tu-berlin.de) work from D10070 to adjust it for the cpu topology changes that occured in r332298 Submitted by: Fabian Freyer (fabian.freyer_physik.tu-berlin.de) Reviewed by: Patrick Mooney <patrick.mooney@joyent.com> Approved by: bde (mentor), jhb (maintainer) MFC after: 3 days Differential Revision: https://reviews.freebsd.org/D18755	2019-04-25 22:51:36 +00:00
John Baldwin	2c352feb3b	Fix missed posted interrupts in VT-x in bhyve. When a vCPU is HLTed, interrupts with a priority below the processor priority (PPR) should not resume the vCPU while interrupts at or above the PPR should. With posted interrupts, bhyve maintains a bitmap of pending interrupts in PIR descriptor along with a single 'pending' bit. This bit is checked by a CPU running in guest mode at various places to determine if it should be checked. In addition, another CPU can force a CPU in guest mode to check for pending interrupts by sending an IPI to a special IDT vector reserved for this purpose. bhyve had a bug in that it would only notify a guest vCPU of an interrupt (e.g. by sending the special IPI or by resuming it if it was idle due to HLT) if an interrupt arrived that was higher priority than PPR and no interrupts were currently pending. This assumed that if the 'pending' bit was set, any needed notification was already in progress. However, if the first interrupt sent to a HLTed vCPU was lower priority than PPR and the second was higher than PPR, the first interrupt would set 'pending' but not notify the vCPU, and the second interrupt would not notify the vCPU because 'pending' was already set. To fix this, track the priority of pending interrupts in a separate per-vCPU bitmask and notify a vCPU anytime an interrupt arrives that is above PPR and higher than any previously-received interrupt. This was found and debugged in the bhyve port to SmartOS maintained by Joyent. Relevant SmartOS bugs with more background: https://smartos.org/bugview/OS-6829 https://smartos.org/bugview/OS-6930 https://smartos.org/bugview/OS-7354 Submitted by: Patrick Mooney <pmooney@pfmooney.com> Reviewed by: tychon, rgrimes Obtained from: SmartOS / Joyent MFC after: 2 weeks Differential Revision: https://reviews.freebsd.org/D19299	2019-03-01 20:43:48 +00:00
Konstantin Belousov	f1dc49f33a	Trim whitespace at EoL, use tabs instead of spaces for indent. PR: 235004 Submitted by: Jose Luis Duran <jlduran@gmail.com> MFC after: 3 days	2019-01-17 05:15:25 +00:00
Konstantin Belousov	2343757338	Align IA32_ARCH_CAP MSR definitions and use with SDM rev. 068. SDM rev. 068 was released yesterday and it contains the description of the MSR 0x10a IA32_ARCH_CAP. This change adds symbolic definitions for all bits present in the document, and decode them in the CPU identification lines printed on boot. But also, the document defines SSB_NO as bit 4, while FreeBSD used but 2 to detect the need to work-around Speculative Store Bypass issue. Change code to use the bit from SDM. Similarly, the document describes bit 3 as an indicator that L1TF issue is not present, in particular, no L1D flush is needed on VMENTRY. We used RDCL_NO to avoid flushing, and again I changed the code to follow new spec from SDM. In fact my Apollo Lake machine with latest ucode shows this: IA32_ARCH_CAPS=0x19<RDCL_NO,SKIP_L1DFL_VME,SSB_NO> Reviewed by: bwidawsk Sponsored by: The FreeBSD Foundation MFC after: 3 days Differential revision: https://reviews.freebsd.org/D18006	2018-11-16 21:27:11 +00:00
Yuri Pankov	8d56c80545	Provide basic descriptions for VMX exit reason (from "Intel 64 and IA-32 Architectures Software Developer’s Manual Volume 3"). Add the document to SEE ALSO in bhyve.8 (and pet manlint here a bit). Reviewed by: jhb, rgrimes, 0mp Approved by: kib (mentor) Differential Revision: https://reviews.freebsd.org/D17531	2018-10-27 21:24:28 +00:00
John Baldwin	b843f9be5e	Fully restore the GDTR, IDTR, and LDTR after VT-x VM exits. The VT-x VMCS only stores the base address of the GDTR and IDTR. As a result, VM exits use a fixed limit of 0xffff for the host GDTR and IDTR losing the smaller limits set in when the initial GDT is loaded on each CPU during boot. Explicitly save and restore the full GDTR and IDTR contents around VM entries and exits to restore the correct limit. Similarly, explicitly save and restore the LDT selector. VM exits always clear the host LDTR as if the LDT was loaded with a NULL selector and a userspace hypervisor is probably using a NULL selector anyway, but save and restore the LDT explicitly just to be safe. PR: 230773 Reported by: John Levon <levon@movementarian.org> Reviewed by: kib Tested by: araujo Approved by: re (rgrimes) MFC after: 1 week	2018-10-11 18:27:19 +00:00
Andrew Turner	27d2645787	Handle a guest executing a vm instruction by trapping and raising an undefined instruction exception. Previously we would exit the guest, however an unprivileged user could execute these. Found with: syzkaller Reviewed by: araujo, tychon (previous version) Approved by: re (kib) MFC after: 1 week Differential Revision: https://reviews.freebsd.org/D17192	2018-09-27 11:16:19 +00:00
Konstantin Belousov	c1141fba00	Update L1TF workaround to sustain L1D pollution from NMI. Current mitigation for L1TF in bhyve flushes L1D either by an explicit WRMSR command, or by software reading enough uninteresting data to fully populate all lines of L1D. If NMI occurs after either of methods is completed, but before VM entry, L1D becomes polluted with the cache lines touched by NMI handlers. There is no interesting data which NMI accesses, but something sensitive might be co-located on the same cache line, and then L1TF exposes that to a rogue guest. Use VM entry MSR load list to ensure atomicity of L1D cache and VM entry if updated microcode was loaded. If only software flush method is available, try to help the bhyve sw flusher by also flushing L1D on NMI exit to kernel mode. Suggested by and discussed with: Andrew Cooper <andrew.cooper3@citrix.com> Reviewed by: jhb Sponsored by: The FreeBSD Foundation MFC after: 2 weeks Differential revision: https://reviews.freebsd.org/D16790	2018-08-19 18:47:16 +00:00
Konstantin Belousov	c30578feeb	Provide part of the mitigation for L1TF-VMM. On the guest entry in bhyve, flush L1 data cache, using either L1D flush command MSR if available, or by reading enough uninteresting data to fill whole cache. Flush is automatically enabled on CPUs which do not report RDCL_NO, and can be disabled with the hw.vmm.l1d_flush tunable/kenv. Security: CVE-2018-3646 Reviewed by: emaste. jhb, Tony Luck <tony.luck@intel.com> Sponsored by: The FreeBSD Foundation	2018-08-14 17:29:41 +00:00
Marcelo Araujo	ebc3c37c6f	Add SPDX tags to vmm(4). MFC after: 4 weeks. Sponsored by: iXsystems Inc.	2018-06-13 07:02:58 +00:00
John Baldwin	9e2154ff1c	Cleanups related to debug exceptions on x86. - Add constants for fields in DR6 and the reserved fields in DR7. Use these constants instead of magic numbers in most places that use DR6 and DR7. - Refer to T_TRCTRAP as "debug exception" rather than a "trace trap" as it is not just for trace exceptions. - Always read DR6 for debug exceptions and only clear TF in the flags register for user exceptions where DR6.BS is set. - Clear DR6 before returning from a debug exception handler as recommended by the SDM dating all the way back to the 386. This allows debuggers to determine the cause of each exception. For kernel traps, clear DR6 in the T_TRCTRAP case and pass DR6 by value to other parts of the handler (namely, user_dbreg_trap()). For user traps, wait until after trapsignal to clear DR6 so that userland debuggers can read DR6 via PT_GETDBREGS while the thread is stopped in trapsignal(). Reviewed by: kib, rgrimes MFC after: 1 month Differential Revision: https://reviews.freebsd.org/D15189	2018-05-22 00:45:00 +00:00
Tycho Nightingale	6ac73777ea	Add SDT probes to vmexit on Intel. Submitted by: domagoj.stolfa_gmail.com Reviewed by: grehan, tychon Sponsored by: DARPA/AFRL Differential Revision: https://reviews.freebsd.org/D14656	2018-04-13 17:23:05 +00:00
John Baldwin	fc276d92ae	Add a way to temporarily suspend and resume virtual CPUs. This is used as part of implementing run control in bhyve's debug server. The hypervisor now maintains a set of "debugged" CPUs. Attempting to run a debugged CPU will fail to execute any guest instructions and will instead report a VM_EXITCODE_DEBUG exit to the userland hypervisor. Virtual CPUs are placed into the debugged state via vm_suspend_cpu() (implemented via a new VM_SUSPEND_CPU ioctl). Virtual CPUs can be resumed via vm_resume_cpu() (VM_RESUME_CPU ioctl). The debug server suspends virtual CPUs when it wishes them to stop executing in the guest (for example, when a debugger attaches to the server). The debug server can choose to resume only a subset of CPUs (for example, when single stepping) or it can choose to resume all CPUs. The debug server must explicitly mark a CPU as resumed via vm_resume_cpu() before the virtual CPU will successfully execute any guest instructions. Reviewed by: avg, grehan Tested on: Intel (jhb), AMD (avg) Differential Revision: https://reviews.freebsd.org/D14466	2018-04-06 22:03:43 +00:00
Tycho Nightingale	490768e24a	Fix a lock recursion introduced in r327065. Reported by: kmacy Reviewed by: grehan, jhb Sponsored by: Dell EMC Isilon Differential Revision: https://reviews.freebsd.org/D14548	2018-03-07 18:03:22 +00:00
Tycho Nightingale	58a6aaf7ec	Provide further mitigation against CVE-2017-5715 by flushing the return stack buffer (RSB) upon returning from the guest. This was inspired by this linux commit: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/arch/x86/kvm?id=117cc7a908c83697b0b737d15ae1eb5943afe35b Reviewed by: grehan Sponsored by: Dell EMC Isilon Differential Revision: https://reviews.freebsd.org/D14272	2018-02-12 14:45:27 +00:00
John Baldwin	65eefbe422	Save and restore guest debug registers. Currently most of the debug registers are not saved and restored during VM transitions allowing guest and host debug register values to leak into the opposite context. One result is that hardware watchpoints do not work reliably within a guest under VT-x. Due to differences in SVM and VT-x, slightly different approaches are used. For VT-x: - Enable debug register save/restore for VM entry/exit in the VMCS for DR7 and MSR_DEBUGCTL. - Explicitly save DR0-3,6 of the guest. - Explicitly save DR0-3,6-7, MSR_DEBUGCTL, and the trap flag from %rflags for the host. Note that because DR6 is "software" managed and not stored in the VMCS a kernel debugger which single steps through VM entry could corrupt the guest DR6 (since a single step trap taken after loading the guest DR6 could alter the DR6 register). To avoid this, explicitly disable single-stepping via the trace flag before loading the guest DR6. A determined debugger could still defeat this by setting a breakpoint after the guest DR6 was loaded and then single-stepping. For SVM: - Enable debug register caching in the VMCB for DR6/DR7. - Explicitly save DR0-3 of the guest. - Explicitly save DR0-3,6-7, and MSR_DEBUGCTL for the host. Since SVM saves the guest DR6 in the VMCB, the race with single-stepping described for VT-x does not exist. For both platforms, expose all of the guest DRx values via --get-drX and --set-drX flags to bhyvectl. Discussed with: avg, grehan Tested by: avg (SVM), myself (VT-x) MFC after: 1 month Differential Revision: https://reviews.freebsd.org/D13229	2018-01-17 23:11:25 +00:00
Konstantin Belousov	bd50262f70	PTI for amd64. The implementation of the Kernel Page Table Isolation (KPTI) for amd64, first version. It provides a workaround for the 'meltdown' vulnerability. PTI is turned off by default for now, enable with the loader tunable vm.pmap.pti=1. The pmap page table is split into kernel-mode table and user-mode table. Kernel-mode table is identical to the non-PTI table, while usermode table is obtained from kernel table by leaving userspace mappings intact, but only leaving the following parts of the kernel mapped: kernel text (but not modules text) PCPU GDT/IDT/user LDT/task structures IST stacks for NMI and doublefault handlers. Kernel switches to user page table before returning to usermode, and restores full kernel page table on the entry. Initial kernel-mode stack for PTI trampoline is allocated in PCPU, it is only 16 qwords. Kernel entry trampoline switches page tables. then the hardware trap frame is copied to the normal kstack, and execution continues. IST stacks are kept mapped and no trampoline is needed for NMI/doublefault, but of course page table switch is performed. On return to usermode, the trampoline is used again, iret frame is copied to the trampoline stack, page tables are switched and iretq is executed. The case of iretq faulting due to the invalid usermode context is tricky, since the frame for fault is appended to the trampoline frame. Besides copying the fault frame and original (corrupted) frame to kstack, the fault frame must be patched to make it look as if the fault occured on the kstack, see the comment in doret_iret detection code in trap(). Currently kernel pages which are mapped during trampoline operation are identical for all pmaps. They are registered using pmap_pti_add_kva(). Besides initial registrations done during boot, LDT and non-common TSS segments are registered if user requested their use. In principle, they can be installed into kernel page table per pmap with some work. Similarly, PCPU can be hidden from userspace mapping using trampoline PCPU page, but again I do not see much benefits besides complexity. PDPE pages for the kernel half of the user page tables are pre-allocated during boot because we need to know pml4 entries which are copied to the top-level paging structure page, in advance on a new pmap creation. I enforce this to avoid iterating over the all existing pmaps if a new PDPE page is needed for PTI kernel mappings. The iteration is a known problematic operation on i386. The need to flush hidden kernel translations on the switch to user mode make global tables (PG_G) meaningless and even harming, so PG_G use is disabled for PTI case. Our existing use of PCID is incompatible with PTI and is automatically disabled if PTI is enabled. PCID can be forced on only for developer's benefit. MCE is known to be broken, it requires IST stack to operate completely correctly even for non-PTI case, and absolutely needs dedicated IST stack because MCE delivery while trampoline did not switched from PTI stack is fatal. The fix is pending. Reviewed by: markj (partially) Tested by: pho (previous version) Discussed with: jeff, jhb Sponsored by: The FreeBSD Foundation MFC after: 2 weeks	2018-01-17 11:44:21 +00:00
Tycho Nightingale	91fe5fe7e7	Provide some mitigation against CVE-2017-5715 by clearing registers upon returning from the guest which aren't immediately clobbered by the host. This eradicates any remaining guest contents limiting their usefulness in an exploit gadget. This was inspired by this linux commit: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=5b6c02f38315b720c593c6079364855d276886aa Reviewed by: grehan, rgrimes Sponsored by: Dell EMC Isilon Differential Revision: https://reviews.freebsd.org/D13573	2018-01-15 18:37:03 +00:00
Tycho Nightingale	9e33a61693	Recognize a pending virtual interrupt while emulating the halt instruction. Reviewed by: grehan, rgrimes Sponsored by: Dell EMC Isilon Differential Revision: https://reviews.freebsd.org/D13573	2017-12-21 18:30:11 +00:00
Pedro F. Giffuni	c49761dd57	sys/amd64: further adoption of SPDX licensing ID tags. Mainly focus on files that use BSD 2-Clause license, however the tool I was using misidentified many licenses so this was mostly a manual - error prone - task. The Software Package Data Exchange (SPDX) group provides a specification to make it easier for automated tools to detect and summarize well known opensource licenses. We are gradually adopting the specification, noting that the tags are considered only advisory and do not, in any way, superceed or replace the license texts.	2017-11-27 15:03:07 +00:00
Konstantin Belousov	9fc847133e	Save KGSBASE in pcb before overriding it with the guest value. Reported by: lwhsu, mjoras Discussed with: jhb Sponsored by: The FreeBSD Foundation MFC after: 18 days	2017-08-24 10:49:53 +00:00
Pedro F. Giffuni	500eb14ae8	vmm(4): Small spelling fixes. Reviewed by: grehan	2016-05-03 22:07:18 +00:00
Neel Natu	9b1aa8d622	Restructure memory allocation in bhyve to support "devmem". devmem is used to represent MMIO devices like the boot ROM or a VESA framebuffer where doing a trap-and-emulate for every access is impractical. devmem is a hybrid of system memory (sysmem) and emulated device models. devmem is mapped in the guest address space via nested page tables similar to sysmem. However the address range where devmem is mapped may be changed by the guest at runtime (e.g. by reprogramming a PCI BAR). Also devmem is usually mapped RO or RW as compared to RWX mappings for sysmem. Each devmem segment is named (e.g. "bootrom") and this name is used to create a device node for the devmem segment (e.g. /dev/vmm/testvm.bootrom). The device node supports mmap(2) and this decouples the host mapping of devmem from its mapping in the guest address space (which can change). Reviewed by: tychon Discussed with: grehan Differential Revision: https://reviews.freebsd.org/D2762 MFC after: 4 weeks	2015-06-18 06:00:17 +00:00
Tycho Nightingale	277bdd9950	Support guest writes to the TSC by enabling the "use TSC offsetting" execution control and writing the difference between the host TSC and the guest TSC into the TSC offset in the VMCS upon encountering a write. Reviewed by: neel	2015-06-09 00:14:47 +00:00
Neel Natu	248e6799e9	Fix non-deterministic delays when accessing a vcpu that was in "running" or "sleeping" state. This is done by forcing the vcpu to transition to "idle" by returning to userspace with an exit code of VM_EXITCODE_REQIDLE. MFC after: 2 weeks	2015-05-28 17:37:01 +00:00
Neel Natu	1c73ea3ef8	Don't rely on the 'VM-exit instruction length' field in the VMCS to always have an accurate length on an EPT violation. This is not needed by the instruction decoding code because it also has to work with AMD/SVM that does not provide a valid instruction length on a Nested Page Fault. In collaboration with: Leon Dang (ldang@nahannisys.com) Discussed with: grehan MFC after: 1 week	2015-05-22 17:34:22 +00:00
Neel Natu	1d29bfc149	Emulate machine check related MSRs to allow guest OSes like Windows to boot. Reported by: Leon Dang (ldang@nahannisys.com) MFC after: 2 weeks	2015-05-02 04:19:11 +00:00

1 2 3 4

193 Commits