Commit Graph

298 Commits

Author SHA1 Message Date
neel
11f176f814 Simplify register state save and restore across a VMRUN:
- Host registers are now stored on the stack instead of a per-cpu host context.

- Host %FS and %GS selectors are not saved and restored across VMRUN.
  - Restoring the %FS/%GS selectors was futile anyways since that only updates
    the low 32 bits of base address in the hidden descriptor state.
  - GS.base is properly updated via the MSR_GSBASE on return from svm_launch().
  - FS.base is not used while inside the kernel so it can be safely ignored.

- Add function prologue/epilogue so svm_launch() can be traced with Dtrace's
  FBT entry/exit probes. They also serve to save/restore the host %rbp across
  VMRUN.

Reviewed by:	grehan
Discussed with:	Anish Gupta (akgupt3@gmail.com)
2014-09-27 02:04:58 +00:00
neel
35aaa3ac11 Allow more VMCB fields to be cached:
- CR2
- CR0, CR3, CR4 and EFER
- GDT/IDT base/limit fields
- CS/DS/ES/SS selector/base/limit/attrib fields

The caching can be further restricted via the tunable 'hw.vmm.svm.vmcb_clean'.

Restructure the code such that the fields above are only modified in a single
place. This makes it easy to invalidate the VMCB cache when any of these fields
is modified.
2014-09-21 23:42:54 +00:00
neel
5834276187 Get rid of unused stat VMM_HLT_IGNORED. 2014-09-21 18:52:56 +00:00
neel
ef294abb97 IFC r271888.
Restructure MSR emulation so it is all done in processor-specific code.
2014-09-20 21:46:31 +00:00
neel
c9b7ad126a IFC @r271694 2014-09-17 18:46:51 +00:00
neel
eefb10a184 Rework vNMI injection.
Keep track of NMI blocking by enabling the IRET intercept on a successful
vNMI injection. The NMI blocking condition is cleared when the handler
executes an IRET and traps back into the hypervisor.

Don't inject NMI if the processor is in an interrupt shadow to preserve the
atomic nature of "STI;HLT". Take advantage of this and artificially set the
interrupt shadow to prevent NMI injection when restarting the "iret".

Reviewed by:	Anish Gupta (akgupt3@gmail.com), grehan
2014-09-17 00:30:25 +00:00
neel
16169f746d Minor cleanup.
Get rid of unused 'svm_feature' from the softc.

Get rid of the redundant 'vcpu_cnt' checks in svm.c. There is a similar check
in vmm.c against 'vm->active_cpus' before the AMD-specific code is called.

Submitted by:	Anish Gupta (akgupt3@gmail.com)
2014-09-16 04:01:55 +00:00
neel
97bffd44f6 Use V_IRQ, V_INTR_VECTOR and V_TPR to offload APIC interrupt delivery to the
processor. Briefly, the hypervisor sets V_INTR_VECTOR to the APIC vector
and sets V_IRQ to 1 to indicate a pending interrupt. The hardware then takes
care of injecting this vector when the guest is able to receive it.

Legacy PIC interrupts are still delivered via the event injection mechanism.
This is because the vector injected by the PIC must reflect the state of its
pins at the time the CPU is ready to accept the interrupt.

Accesses to the TPR via %CR8 are handled entirely in hardware. This requires
that the emulated TPR must be synced to V_TPR after a #VMEXIT.

The guest can also modify the TPR via the memory mapped APIC. This requires
that the V_TPR must be synced with the emulated TPR before a VMRUN.

Reviewed by:	Anish Gupta (akgupt3@gmail.com)
2014-09-16 03:31:40 +00:00
neel
cbc92dc709 Set the 'vmexit->inst_length' field properly depending on the type of the
VM-exit and ultimately on whether nRIP is valid. This allows us to update
the %rip after the emulation is finished so any exceptions triggered during
the emulation will point to the right instruction.

Don't attempt to handle INS/OUTS VM-exits unless the DecodeAssist capability
is available. The effective segment field in EXITINFO1 is not valid without
this capability.

Add VM_EXITCODE_SVM to flag SVM VM-exits that cannot be handled. Provide the
VMCB fields exitinfo1 and exitinfo2 as collateral to help with debugging.

Provide a SVM VM-exit handler to dump the exitcode, exitinfo1 and exitinfo2
fields in bhyve(8).

Reviewed by:	Anish Gupta (akgupt3@gmail.com)
Reviewed by:	grehan
2014-09-14 04:39:04 +00:00
neel
3e1af2f123 Bug fixes.
- Don't enable the HLT intercept by default. It will be enabled by bhyve(8)
  if required. Prior to this change HLT exiting was always enabled making
  the "-H" option to bhyve(8) meaningless.

- Recognize a VM exit triggered by a non-maskable interrupt. Prior to this
  change the exit would be punted to userspace and the virtual machine would
  terminate.
2014-09-13 23:48:43 +00:00
neel
a38d07e455 style(9): insert an empty line if the function has no local variables
Pointed out by:	grehan
2014-09-13 22:45:04 +00:00
neel
32e0378b35 AMD processors that have the SVM decode assist capability will store the
instruction bytes in the VMCB on a nested page fault. This is useful because
it saves having to walk the guest page tables to fetch the instruction.

vie_init() now takes two additional parameters 'inst_bytes' and 'inst_len'
that map directly to 'vie->inst[]' and 'vie->num_valid'.

The instruction emulation handler skips calling 'vmm_fetch_instruction()'
if 'vie->num_valid' is non-zero.

The use of this capability can be turned off by setting the sysctl/tunable
'hw.vmm.svm.disable_npf_assist' to '1'.

Reviewed by:	Anish Gupta (akgupt3@gmail.com)
Discussed with:	grehan
2014-09-13 22:16:40 +00:00
neel
b2ca87a5d0 Optimize the common case of injecting an interrupt into a vcpu after a HLT
by explicitly moving it out of the interrupt shadow. The hypervisor is done
"executing" the HLT and by definition this moves the vcpu out of the
1-instruction interrupt shadow.

Prior to this change the interrupt would be held pending because the VMCS
guest-interruptibility-state would indicate that "blocking by STI" was in
effect. This resulted in an unnecessary round trip into the guest before
the pending interrupt could be injected.

Reviewed by:	grehan
2014-09-12 06:15:20 +00:00
neel
45bb1086d6 style(9): indent the switch, don't indent the case, indent case body one tab. 2014-09-11 06:17:56 +00:00
neel
e108397c4a Repurpose the V_IRQ interrupt injection to implement VMX-style interrupt
window exiting. This simply involves setting V_IRQ and enabling the VINTR
intercept. This instructs the CPU to trap back into the hypervisor as soon
as an interrupt can be injected into the guest. The pending interrupt is
then injected via the traditional event injection mechanism.

Rework vcpu interrupt injection so that Linux guests now idle with host cpu
utilization close to 0%.

Reviewed by:	Anish Gupta (earlier version)
Discussed with:	grehan
2014-09-11 02:37:02 +00:00
neel
907c0aec17 Allow intercepts and irq fields to be cached by the VMCB.
Provide APIs svm_enable_intercept()/svm_disable_intercept() to add/delete
VMCB intercepts. These APIs ensure that the VMCB state cache is invalidated
when intercepts are modified.

Each intercept is identified as a (index,bitmask) tuple. For e.g., the
VINTR intercept is identified as (VMCB_CTRL1_INTCPT,VMCB_INTCPT_VINTR).
The first 20 bytes in control area that are used to enable intercepts
are represented as 'uint32_t intercept[5]' in 'struct vmcb_ctrl'.

Modify svm_setcap() and svm_getcap() to use the new APIs.

Discussed with:	Anish Gupta (akgupt3@gmail.com)
2014-09-10 03:13:40 +00:00
neel
3c61bfaac6 Move the VMCB initialization into svm.c in preparation for changes to the
interrupt injection logic.

Discussed with:	Anish Gupta (akgupt3@gmail.com)
2014-09-10 02:35:19 +00:00
neel
bbc4ff544f Move the event injection function into svm.c and add KTR logging for
every event injection.

This in in preparation for changes to SVM guest interrupt injection.

Discussed with:	Anish Gupta (akgupt3@gmail.com)
2014-09-10 02:20:32 +00:00
neel
5b8dd1ad82 Remove a bogus check that flagged an error if the guest %rip was zero.
An AP begins execution with %rip set to 0 after a startup IPI.

Discussed with:	Anish Gupta (akgupt3@gmail.com)
2014-09-10 01:46:22 +00:00
neel
77b2918286 Make the KTR tracepoints uniform and ensure that every VM-exit is logged.
Discussed with:	Anish Gupta (akgupt3@gmail.com)
2014-09-10 01:37:32 +00:00
neel
d07afdc371 Allow guest read access to MSR_EFER without hypervisor intervention.
Dirty the VMCB_CACHE_CR state cache when MSR_EFER is modified.
2014-09-10 01:10:53 +00:00
neel
d6f50ad39f Remove gratuitous forward declarations.
Remove tabs on empty lines.
2014-09-09 23:39:43 +00:00
neel
5824b2ab21 Do proper ASID management for guest vcpus.
Prior to this change an ASID was hard allocated to a guest and shared by all
its vcpus. The meant that the number of VMs that could be created was limited
to the number of ASIDs supported by the CPU. It was also inefficient because
it forced a TLB flush on every VMRUN.

With this change the number of guests that can be created is independent of
the number of available ASIDs. Also, the TLB is flushed only when a new ASID
is allocated.

Discussed with:	grehan
Reviewed by:	Anish Gupta (akgupt3@gmail.com)
2014-09-06 19:02:52 +00:00
neel
8decf07cf3 Merge svm_set_vmcb() and svm_init_vmcb() into a single function that is called
just once when a vcpu is initialized.

Discussed with:	Anish Gupta (akgupt3@gmail.com)
2014-09-05 03:33:16 +00:00
neel
9c9ff7e250 Remove unused header file.
Discussed with:	Anish Gupta (akgupt3@gmail.com)
2014-09-04 06:07:32 +00:00
neel
4038b1d8f9 Consolidate the code to restore the host TSS after a #VMEXIT into a single
function restore_host_tss().

Don't bother to restore MSR_KGSBASE after a #VMEXIT since it is not used in
the kernel. It will be restored on return to userspace.

Discussed with:	Anish Gupta (akgupt3@gmail.com)
2014-09-04 06:00:18 +00:00
neel
e5d2f8730c IFC @r269962
Submitted by:	Anish Gupta (akgupt3@gmail.com)
2014-09-02 04:22:42 +00:00
neel
15d3c1ec17 The "SUB" instruction used in getcc() actually does 'x -= y' so use the
proper constraint for 'x'. The "+r" constraint indicates that 'x' is an
input and output register operand.

While here generate code for different variants of getcc() using a macro
GETCC(sz) where 'sz' indicates the operand size.

Update the status bits in %rflags when emulating AND and OR opcodes.

Reviewed by:	grehan
2014-08-30 19:59:42 +00:00
grehan
4900dd0aa5 Implement the 0x2B SUB instruction, and the OR variant of 0x81.
Found with local APIC accesses from bitrig/amd64 bsd.rd, 07/15-snap.

Reviewed by:	neel
MFC after:	3 days
2014-08-27 00:53:56 +00:00
neel
37c34582e9 An exception is allowed to be injected even if the vcpu is in an interrupt
shadow, so move the check for pending exception before bailing out due to
an interrupt shadow.

Change return type of 'vmcb_eventinject()' to a void and convert all error
returns into KASSERTs.

Fix VMCB_EXITINTINFO_EC(x) and VMCB_EXITINTINFO_TYPE(x) to do the shift
before masking the result.

Reviewed by:    Anish Gupta (akgupt3@gmail.com)
2014-08-25 00:58:20 +00:00
neel
ca77633f85 Add "hw.vmm.topology.threads_per_core" and "hw.vmm.topology.cores_per_package"
tunables to modify the default cpu topology advertised by bhyve.

Also add a tunable "hw.vmm.topology.cpuid_leaf_b" to disable the CPUID
leaf 0xb. This is intended for testing guest behavior when it falls back
on using CPUID leaf 0x4 to deduce CPU topology.

The default behavior is to advertise each vcpu as a core in a separate soket.
2014-08-24 01:10:06 +00:00
neel
33474e62d3 Fix a bug in the emulation of CPUID leaf 0x4 where bhyve was claiming that
the vcpu had no caches at all. This causes problems when executing applications
in the guest compiled with the Intel compiler.

Submitted by:	Mark Hill (mark.hill@tidalscale.com)
2014-08-23 22:44:31 +00:00
neel
ea3ac18997 Return the spurious interrupt vector (IRQ7 or IRQ15) if the atpic cannot
find any unmasked pin with an interrupt asserted.

Reviewed by:	tychon
CR:		https://reviews.freebsd.org/D669
MFC after:	1 week
2014-08-23 21:16:26 +00:00
neel
69905f3734 Reword comment to match the interrupt mode names from the MPtable spec.
Reviewed by:	tychon
2014-08-14 18:03:38 +00:00
neel
53369652fd Use the max guest memory address when creating its iommu domain.
Also, assert that the GPA being mapped in the domain is less than its maxaddr.

Reviewed by:	grehan
Pointed out by:	Anish Gupta (akgupt3@gmail.com)
2014-08-14 05:00:45 +00:00
neel
6e79e6c83b Support PCI extended config space in bhyve.
Add the ACPI MCFG table to advertise the extended config memory window.

Introduce a new flag MEM_F_IMMUTABLE for memory ranges that cannot be deleted
or moved in the guest's address space. The PCI extended config space is an
example of an immutable memory range.

Add emulation for the "movzw" instruction. This instruction is used by FreeBSD
to read a 16-bit extended config space register.

CR:		https://phabric.freebsd.org/D505
Reviewed by:	jhb, grehan
Requested by:	tychon
2014-08-08 03:49:01 +00:00
jhb
05d21354c3 - Output a summary of optional VT-x features in dmesg similar to CPU
features.  If bootverbose is enabled, a detailed list is provided;
  otherwise, a single-line summary is displayed.
- Add read-only sysctls for optional VT-x capabilities used by bhyve
  under a new hw.vmm.vmx.cap node. Move a few exiting sysctls that
  indicate the presence of optional capabilities under this node.

CR:		https://phabric.freebsd.org/D498
Reviewed by:	grehan, neel
MFC after:	1 week
2014-07-30 00:00:12 +00:00
neel
20e3e8762f If a vcpu has issued a HLT instruction with interrupts disabled then it sleeps
forever in vm_handle_hlt().

This is usually not an issue as long as one of the other vcpus properly resets
or powers off the virtual machine. However, if the bhyve(8) process is killed
with a signal the halted vcpu cannot be woken up because it's sleep cannot be
interrupted.

Fix this by waking up periodically and returning from vm_handle_hlt() if
TDF_ASTPENDING is set.

Reported by:	Leon Dang
Sponsored by:	Nahanni Systems
2014-07-26 02:53:51 +00:00
neel
62d591cec9 Don't return -1 from the push emulation handler. Negative return values are
interpreted specially on return from sys_ioctl() and may cause undesirable
side-effects like restarting the system call.
2014-07-26 02:51:46 +00:00
neel
563faffd70 Fix a couple of issues in the PUSH emulation:
It is not possible to PUSH a 32-bit operand on the stack in 64-bit mode. The
default operand size for PUSH is 64-bits and the operand size override prefix
changes that to 16-bits.

vm_copy_setup() can return '1' if it encounters a fault when walking the
guest page tables. This is a guest issue and is now handled properly by
resuming the guest to handle the fault.
2014-07-24 23:01:53 +00:00
neel
4535fa67c4 Fix fault injection in bhyve.
The faulting instruction needs to be restarted when the exception handler
is done handling the fault. bhyve now does this correctly by setting
'vmexit[vcpu].inst_length' to zero so the %rip is not advanced.

A minor complication is that the fault injection APIs are used by instruction
emulation code that is shared by vmm.ko and bhyve. Thus the argument that
refers to 'struct vm *' in kernel or 'struct vmctx *' in userspace needs to
be loosely typed as a 'void *'.
2014-07-24 01:38:11 +00:00
neel
e972917c13 Emulate instructions emitted by OpenBSD/i386 version 5.5:
- CMP REG, r/m
- MOV AX/EAX/RAX, moffset
- MOV moffset, AX/EAX/RAX
- PUSH r/m
2014-07-23 04:28:51 +00:00
neel
7b1204cc02 Fix build without INVARIANTS defined by getting rid of unused variable 'exc'.
Reported by:	adrian, stefanf
2014-07-20 16:34:35 +00:00
neel
1f15eea2e0 Handle nested exceptions in bhyve.
A nested exception condition arises when a second exception is triggered while
delivering the first exception. Most nested exceptions can be handled serially
but some are converted into a double fault. If an exception is generated during
delivery of a double fault then the virtual machine shuts down as a result of
a triple fault.

vm_exit_intinfo() is used to record that a VM-exit happened while an event was
being delivered through the IDT. If an exception is triggered while handling
the VM-exit it will be treated like a nested exception.

vm_entry_intinfo() is used by processor-specific code to get the event to be
injected into the guest on the next VM-entry. This function is responsible for
deciding the disposition of nested exceptions.
2014-07-19 20:59:08 +00:00
neel
5046d9cb8a Add emulation for legacy x86 task switching mechanism.
FreeBSD/i386 uses task switching to handle double fault exceptions and this
change enables that to work.

Reported by:	glebius
2014-07-16 21:26:26 +00:00
neel
eb07e4ed55 Add support for operand size and address size override prefixes in bhyve's
instruction emulation [1].

Fix bug in emulation of opcode 0x8A where the destination is a legacy high
byte register and the guest vcpu is in 32-bit mode. Prior to this change
instead of modifying %ah, %bh, %ch or %dh the emulation would end up
modifying %spl, %bpl, %sil or %dil instead.

Add support for moffsets by treating it as a 2, 4 or 8 byte immediate value
during instruction decoding.

Fix bug in verify_gla() where the linear address computed after decoding
the instruction was not being truncated to the effective address size [2].

Tested by:	Leon Dang [1]
Reported by:	Peter Grehan [2]
Sponsored by:	Nahanni Systems
2014-07-15 17:37:17 +00:00
neel
307c44649f Use the correct offset when converting a logical address (segment:offset)
to a linear address.
2014-07-11 01:23:38 +00:00
neel
845f7be2e3 Accurately identify the vcpu's operating mode as 64-bit, compatibility,
protected or real.
2014-07-08 21:48:57 +00:00
neel
d5633f89da Invalidate guest TLB mappings as a side-effect of its CR3 being updated.
This is a pre-requisite for task switch emulation since the CR3 is loaded
from the new TSS.
2014-07-08 20:51:03 +00:00
hselasky
35b126e324 Pull in r267961 and r267973 again. Fix for issues reported will follow. 2014-06-28 03:56:17 +00:00