Commit Graph

405 Commits

Author SHA1 Message Date
neel
16cbb8793a IFC @r272481 2014-10-05 01:28:21 +00:00
neel
f6b1385c0e Get rid of code that dealt with the hardware not being able to save/restore
the PAT MSR on guest exit/entry. This workaround was done for a beta release
of VMware Fusion 5 but is no longer needed in later versions.

All Intel CPUs since Nehalem have supported saving and restoring MSR_PAT
in the VM exit and entry controls.

Discussed with:	grehan
2014-10-02 05:32:29 +00:00
neel
09ab9e2937 IFC @r272185 2014-09-27 22:15:50 +00:00
neel
11f176f814 Simplify register state save and restore across a VMRUN:
- Host registers are now stored on the stack instead of a per-cpu host context.

- Host %FS and %GS selectors are not saved and restored across VMRUN.
  - Restoring the %FS/%GS selectors was futile anyways since that only updates
    the low 32 bits of base address in the hidden descriptor state.
  - GS.base is properly updated via the MSR_GSBASE on return from svm_launch().
  - FS.base is not used while inside the kernel so it can be safely ignored.

- Add function prologue/epilogue so svm_launch() can be traced with Dtrace's
  FBT entry/exit probes. They also serve to save/restore the host %rbp across
  VMRUN.

Reviewed by:	grehan
Discussed with:	Anish Gupta (akgupt3@gmail.com)
2014-09-27 02:04:58 +00:00
grehan
8be950fc2c Allow the PIC's IMR register to be read before ICW initialisation.
As of git submit e179f6914152eca9, the Linux kernel does a simple
probe of the PIC by writing a pattern to the IMR and then reading it
back, prior to the init sequence of ICW words.

The bhyve PIC emulation wasn't allowing the IMR to be read until
the ICW sequence was complete. This limitation isn't required so
relax the test.

With this change, Linux kernels 3.15-rc2 and later won't hang
on boot when calibrating the local APIC.

Reviewed by:	tychon
MFC after:	3 days
2014-09-27 01:15:24 +00:00
neel
35aaa3ac11 Allow more VMCB fields to be cached:
- CR2
- CR0, CR3, CR4 and EFER
- GDT/IDT base/limit fields
- CS/DS/ES/SS selector/base/limit/attrib fields

The caching can be further restricted via the tunable 'hw.vmm.svm.vmcb_clean'.

Restructure the code such that the fields above are only modified in a single
place. This makes it easy to invalidate the VMCB cache when any of these fields
is modified.
2014-09-21 23:42:54 +00:00
neel
5834276187 Get rid of unused stat VMM_HLT_IGNORED. 2014-09-21 18:52:56 +00:00
neel
ef294abb97 IFC r271888.
Restructure MSR emulation so it is all done in processor-specific code.
2014-09-20 21:46:31 +00:00
neel
0350d078cb Add some more KTR events to help debugging. 2014-09-20 05:13:03 +00:00
neel
f6a11e15ea MSR_KGSBASE is no longer saved and restored from the guest MSR save area. This
behavior was changed in r271888 so update the comment block to reflect this.

MSR_KGSBASE is accessible from the guest without triggering a VM-exit. The
permission bitmap for MSR_KGSBASE is modified by vmx_msr_guest_init() so get
rid of redundant code in vmx_vminit().
2014-09-20 05:12:34 +00:00
neel
46721cc2c7 Restructure the MSR handling so it is entirely handled by processor-specific
code. There are only a handful of MSRs common between the two so there isn't
too much duplicate functionality.

The VT-x code has the following types of MSRs:

- MSRs that are unconditionally saved/restored on every guest/host context
  switch (e.g., MSR_GSBASE).

- MSRs that are restored to guest values on entry to vmx_run() and saved
  before returning. This is an optimization for MSRs that are not used in
  host kernel context (e.g., MSR_KGSBASE).

- MSRs that are emulated and every access by the guest causes a trap into
  the hypervisor (e.g., MSR_IA32_MISC_ENABLE).

Reviewed by:	grehan
2014-09-20 02:35:21 +00:00
neel
c9b7ad126a IFC @r271694 2014-09-17 18:46:51 +00:00
neel
eefb10a184 Rework vNMI injection.
Keep track of NMI blocking by enabling the IRET intercept on a successful
vNMI injection. The NMI blocking condition is cleared when the handler
executes an IRET and traps back into the hypervisor.

Don't inject NMI if the processor is in an interrupt shadow to preserve the
atomic nature of "STI;HLT". Take advantage of this and artificially set the
interrupt shadow to prevent NMI injection when restarting the "iret".

Reviewed by:	Anish Gupta (akgupt3@gmail.com), grehan
2014-09-17 00:30:25 +00:00
neel
16169f746d Minor cleanup.
Get rid of unused 'svm_feature' from the softc.

Get rid of the redundant 'vcpu_cnt' checks in svm.c. There is a similar check
in vmm.c against 'vm->active_cpus' before the AMD-specific code is called.

Submitted by:	Anish Gupta (akgupt3@gmail.com)
2014-09-16 04:01:55 +00:00
neel
97bffd44f6 Use V_IRQ, V_INTR_VECTOR and V_TPR to offload APIC interrupt delivery to the
processor. Briefly, the hypervisor sets V_INTR_VECTOR to the APIC vector
and sets V_IRQ to 1 to indicate a pending interrupt. The hardware then takes
care of injecting this vector when the guest is able to receive it.

Legacy PIC interrupts are still delivered via the event injection mechanism.
This is because the vector injected by the PIC must reflect the state of its
pins at the time the CPU is ready to accept the interrupt.

Accesses to the TPR via %CR8 are handled entirely in hardware. This requires
that the emulated TPR must be synced to V_TPR after a #VMEXIT.

The guest can also modify the TPR via the memory mapped APIC. This requires
that the V_TPR must be synced with the emulated TPR before a VMRUN.

Reviewed by:	Anish Gupta (akgupt3@gmail.com)
2014-09-16 03:31:40 +00:00
neel
cbc92dc709 Set the 'vmexit->inst_length' field properly depending on the type of the
VM-exit and ultimately on whether nRIP is valid. This allows us to update
the %rip after the emulation is finished so any exceptions triggered during
the emulation will point to the right instruction.

Don't attempt to handle INS/OUTS VM-exits unless the DecodeAssist capability
is available. The effective segment field in EXITINFO1 is not valid without
this capability.

Add VM_EXITCODE_SVM to flag SVM VM-exits that cannot be handled. Provide the
VMCB fields exitinfo1 and exitinfo2 as collateral to help with debugging.

Provide a SVM VM-exit handler to dump the exitcode, exitinfo1 and exitinfo2
fields in bhyve(8).

Reviewed by:	Anish Gupta (akgupt3@gmail.com)
Reviewed by:	grehan
2014-09-14 04:39:04 +00:00
neel
3e1af2f123 Bug fixes.
- Don't enable the HLT intercept by default. It will be enabled by bhyve(8)
  if required. Prior to this change HLT exiting was always enabled making
  the "-H" option to bhyve(8) meaningless.

- Recognize a VM exit triggered by a non-maskable interrupt. Prior to this
  change the exit would be punted to userspace and the virtual machine would
  terminate.
2014-09-13 23:48:43 +00:00
neel
a38d07e455 style(9): insert an empty line if the function has no local variables
Pointed out by:	grehan
2014-09-13 22:45:04 +00:00
neel
32e0378b35 AMD processors that have the SVM decode assist capability will store the
instruction bytes in the VMCB on a nested page fault. This is useful because
it saves having to walk the guest page tables to fetch the instruction.

vie_init() now takes two additional parameters 'inst_bytes' and 'inst_len'
that map directly to 'vie->inst[]' and 'vie->num_valid'.

The instruction emulation handler skips calling 'vmm_fetch_instruction()'
if 'vie->num_valid' is non-zero.

The use of this capability can be turned off by setting the sysctl/tunable
'hw.vmm.svm.disable_npf_assist' to '1'.

Reviewed by:	Anish Gupta (akgupt3@gmail.com)
Discussed with:	grehan
2014-09-13 22:16:40 +00:00
neel
b2ca87a5d0 Optimize the common case of injecting an interrupt into a vcpu after a HLT
by explicitly moving it out of the interrupt shadow. The hypervisor is done
"executing" the HLT and by definition this moves the vcpu out of the
1-instruction interrupt shadow.

Prior to this change the interrupt would be held pending because the VMCS
guest-interruptibility-state would indicate that "blocking by STI" was in
effect. This resulted in an unnecessary round trip into the guest before
the pending interrupt could be injected.

Reviewed by:	grehan
2014-09-12 06:15:20 +00:00
neel
45bb1086d6 style(9): indent the switch, don't indent the case, indent case body one tab. 2014-09-11 06:17:56 +00:00
neel
e108397c4a Repurpose the V_IRQ interrupt injection to implement VMX-style interrupt
window exiting. This simply involves setting V_IRQ and enabling the VINTR
intercept. This instructs the CPU to trap back into the hypervisor as soon
as an interrupt can be injected into the guest. The pending interrupt is
then injected via the traditional event injection mechanism.

Rework vcpu interrupt injection so that Linux guests now idle with host cpu
utilization close to 0%.

Reviewed by:	Anish Gupta (earlier version)
Discussed with:	grehan
2014-09-11 02:37:02 +00:00
neel
907c0aec17 Allow intercepts and irq fields to be cached by the VMCB.
Provide APIs svm_enable_intercept()/svm_disable_intercept() to add/delete
VMCB intercepts. These APIs ensure that the VMCB state cache is invalidated
when intercepts are modified.

Each intercept is identified as a (index,bitmask) tuple. For e.g., the
VINTR intercept is identified as (VMCB_CTRL1_INTCPT,VMCB_INTCPT_VINTR).
The first 20 bytes in control area that are used to enable intercepts
are represented as 'uint32_t intercept[5]' in 'struct vmcb_ctrl'.

Modify svm_setcap() and svm_getcap() to use the new APIs.

Discussed with:	Anish Gupta (akgupt3@gmail.com)
2014-09-10 03:13:40 +00:00
neel
3c61bfaac6 Move the VMCB initialization into svm.c in preparation for changes to the
interrupt injection logic.

Discussed with:	Anish Gupta (akgupt3@gmail.com)
2014-09-10 02:35:19 +00:00
neel
bbc4ff544f Move the event injection function into svm.c and add KTR logging for
every event injection.

This in in preparation for changes to SVM guest interrupt injection.

Discussed with:	Anish Gupta (akgupt3@gmail.com)
2014-09-10 02:20:32 +00:00
neel
5b8dd1ad82 Remove a bogus check that flagged an error if the guest %rip was zero.
An AP begins execution with %rip set to 0 after a startup IPI.

Discussed with:	Anish Gupta (akgupt3@gmail.com)
2014-09-10 01:46:22 +00:00
neel
77b2918286 Make the KTR tracepoints uniform and ensure that every VM-exit is logged.
Discussed with:	Anish Gupta (akgupt3@gmail.com)
2014-09-10 01:37:32 +00:00
neel
d07afdc371 Allow guest read access to MSR_EFER without hypervisor intervention.
Dirty the VMCB_CACHE_CR state cache when MSR_EFER is modified.
2014-09-10 01:10:53 +00:00
neel
d6f50ad39f Remove gratuitous forward declarations.
Remove tabs on empty lines.
2014-09-09 23:39:43 +00:00
neel
5824b2ab21 Do proper ASID management for guest vcpus.
Prior to this change an ASID was hard allocated to a guest and shared by all
its vcpus. The meant that the number of VMs that could be created was limited
to the number of ASIDs supported by the CPU. It was also inefficient because
it forced a TLB flush on every VMRUN.

With this change the number of guests that can be created is independent of
the number of available ASIDs. Also, the TLB is flushed only when a new ASID
is allocated.

Discussed with:	grehan
Reviewed by:	Anish Gupta (akgupt3@gmail.com)
2014-09-06 19:02:52 +00:00
neel
8decf07cf3 Merge svm_set_vmcb() and svm_init_vmcb() into a single function that is called
just once when a vcpu is initialized.

Discussed with:	Anish Gupta (akgupt3@gmail.com)
2014-09-05 03:33:16 +00:00
neel
9c9ff7e250 Remove unused header file.
Discussed with:	Anish Gupta (akgupt3@gmail.com)
2014-09-04 06:07:32 +00:00
neel
4038b1d8f9 Consolidate the code to restore the host TSS after a #VMEXIT into a single
function restore_host_tss().

Don't bother to restore MSR_KGSBASE after a #VMEXIT since it is not used in
the kernel. It will be restored on return to userspace.

Discussed with:	Anish Gupta (akgupt3@gmail.com)
2014-09-04 06:00:18 +00:00
neel
e5d2f8730c IFC @r269962
Submitted by:	Anish Gupta (akgupt3@gmail.com)
2014-09-02 04:22:42 +00:00
neel
15d3c1ec17 The "SUB" instruction used in getcc() actually does 'x -= y' so use the
proper constraint for 'x'. The "+r" constraint indicates that 'x' is an
input and output register operand.

While here generate code for different variants of getcc() using a macro
GETCC(sz) where 'sz' indicates the operand size.

Update the status bits in %rflags when emulating AND and OR opcodes.

Reviewed by:	grehan
2014-08-30 19:59:42 +00:00
grehan
4900dd0aa5 Implement the 0x2B SUB instruction, and the OR variant of 0x81.
Found with local APIC accesses from bitrig/amd64 bsd.rd, 07/15-snap.

Reviewed by:	neel
MFC after:	3 days
2014-08-27 00:53:56 +00:00
neel
37c34582e9 An exception is allowed to be injected even if the vcpu is in an interrupt
shadow, so move the check for pending exception before bailing out due to
an interrupt shadow.

Change return type of 'vmcb_eventinject()' to a void and convert all error
returns into KASSERTs.

Fix VMCB_EXITINTINFO_EC(x) and VMCB_EXITINTINFO_TYPE(x) to do the shift
before masking the result.

Reviewed by:    Anish Gupta (akgupt3@gmail.com)
2014-08-25 00:58:20 +00:00
neel
ca77633f85 Add "hw.vmm.topology.threads_per_core" and "hw.vmm.topology.cores_per_package"
tunables to modify the default cpu topology advertised by bhyve.

Also add a tunable "hw.vmm.topology.cpuid_leaf_b" to disable the CPUID
leaf 0xb. This is intended for testing guest behavior when it falls back
on using CPUID leaf 0x4 to deduce CPU topology.

The default behavior is to advertise each vcpu as a core in a separate soket.
2014-08-24 01:10:06 +00:00
neel
33474e62d3 Fix a bug in the emulation of CPUID leaf 0x4 where bhyve was claiming that
the vcpu had no caches at all. This causes problems when executing applications
in the guest compiled with the Intel compiler.

Submitted by:	Mark Hill (mark.hill@tidalscale.com)
2014-08-23 22:44:31 +00:00
neel
ea3ac18997 Return the spurious interrupt vector (IRQ7 or IRQ15) if the atpic cannot
find any unmasked pin with an interrupt asserted.

Reviewed by:	tychon
CR:		https://reviews.freebsd.org/D669
MFC after:	1 week
2014-08-23 21:16:26 +00:00
neel
69905f3734 Reword comment to match the interrupt mode names from the MPtable spec.
Reviewed by:	tychon
2014-08-14 18:03:38 +00:00
neel
53369652fd Use the max guest memory address when creating its iommu domain.
Also, assert that the GPA being mapped in the domain is less than its maxaddr.

Reviewed by:	grehan
Pointed out by:	Anish Gupta (akgupt3@gmail.com)
2014-08-14 05:00:45 +00:00
neel
6e79e6c83b Support PCI extended config space in bhyve.
Add the ACPI MCFG table to advertise the extended config memory window.

Introduce a new flag MEM_F_IMMUTABLE for memory ranges that cannot be deleted
or moved in the guest's address space. The PCI extended config space is an
example of an immutable memory range.

Add emulation for the "movzw" instruction. This instruction is used by FreeBSD
to read a 16-bit extended config space register.

CR:		https://phabric.freebsd.org/D505
Reviewed by:	jhb, grehan
Requested by:	tychon
2014-08-08 03:49:01 +00:00
jhb
05d21354c3 - Output a summary of optional VT-x features in dmesg similar to CPU
features.  If bootverbose is enabled, a detailed list is provided;
  otherwise, a single-line summary is displayed.
- Add read-only sysctls for optional VT-x capabilities used by bhyve
  under a new hw.vmm.vmx.cap node. Move a few exiting sysctls that
  indicate the presence of optional capabilities under this node.

CR:		https://phabric.freebsd.org/D498
Reviewed by:	grehan, neel
MFC after:	1 week
2014-07-30 00:00:12 +00:00
neel
20e3e8762f If a vcpu has issued a HLT instruction with interrupts disabled then it sleeps
forever in vm_handle_hlt().

This is usually not an issue as long as one of the other vcpus properly resets
or powers off the virtual machine. However, if the bhyve(8) process is killed
with a signal the halted vcpu cannot be woken up because it's sleep cannot be
interrupted.

Fix this by waking up periodically and returning from vm_handle_hlt() if
TDF_ASTPENDING is set.

Reported by:	Leon Dang
Sponsored by:	Nahanni Systems
2014-07-26 02:53:51 +00:00
neel
62d591cec9 Don't return -1 from the push emulation handler. Negative return values are
interpreted specially on return from sys_ioctl() and may cause undesirable
side-effects like restarting the system call.
2014-07-26 02:51:46 +00:00
neel
563faffd70 Fix a couple of issues in the PUSH emulation:
It is not possible to PUSH a 32-bit operand on the stack in 64-bit mode. The
default operand size for PUSH is 64-bits and the operand size override prefix
changes that to 16-bits.

vm_copy_setup() can return '1' if it encounters a fault when walking the
guest page tables. This is a guest issue and is now handled properly by
resuming the guest to handle the fault.
2014-07-24 23:01:53 +00:00
neel
4535fa67c4 Fix fault injection in bhyve.
The faulting instruction needs to be restarted when the exception handler
is done handling the fault. bhyve now does this correctly by setting
'vmexit[vcpu].inst_length' to zero so the %rip is not advanced.

A minor complication is that the fault injection APIs are used by instruction
emulation code that is shared by vmm.ko and bhyve. Thus the argument that
refers to 'struct vm *' in kernel or 'struct vmctx *' in userspace needs to
be loosely typed as a 'void *'.
2014-07-24 01:38:11 +00:00
neel
e972917c13 Emulate instructions emitted by OpenBSD/i386 version 5.5:
- CMP REG, r/m
- MOV AX/EAX/RAX, moffset
- MOV moffset, AX/EAX/RAX
- PUSH r/m
2014-07-23 04:28:51 +00:00
neel
7b1204cc02 Fix build without INVARIANTS defined by getting rid of unused variable 'exc'.
Reported by:	adrian, stefanf
2014-07-20 16:34:35 +00:00
neel
1f15eea2e0 Handle nested exceptions in bhyve.
A nested exception condition arises when a second exception is triggered while
delivering the first exception. Most nested exceptions can be handled serially
but some are converted into a double fault. If an exception is generated during
delivery of a double fault then the virtual machine shuts down as a result of
a triple fault.

vm_exit_intinfo() is used to record that a VM-exit happened while an event was
being delivered through the IDT. If an exception is triggered while handling
the VM-exit it will be treated like a nested exception.

vm_entry_intinfo() is used by processor-specific code to get the event to be
injected into the guest on the next VM-entry. This function is responsible for
deciding the disposition of nested exceptions.
2014-07-19 20:59:08 +00:00
neel
5046d9cb8a Add emulation for legacy x86 task switching mechanism.
FreeBSD/i386 uses task switching to handle double fault exceptions and this
change enables that to work.

Reported by:	glebius
2014-07-16 21:26:26 +00:00
neel
eb07e4ed55 Add support for operand size and address size override prefixes in bhyve's
instruction emulation [1].

Fix bug in emulation of opcode 0x8A where the destination is a legacy high
byte register and the guest vcpu is in 32-bit mode. Prior to this change
instead of modifying %ah, %bh, %ch or %dh the emulation would end up
modifying %spl, %bpl, %sil or %dil instead.

Add support for moffsets by treating it as a 2, 4 or 8 byte immediate value
during instruction decoding.

Fix bug in verify_gla() where the linear address computed after decoding
the instruction was not being truncated to the effective address size [2].

Tested by:	Leon Dang [1]
Reported by:	Peter Grehan [2]
Sponsored by:	Nahanni Systems
2014-07-15 17:37:17 +00:00
neel
307c44649f Use the correct offset when converting a logical address (segment:offset)
to a linear address.
2014-07-11 01:23:38 +00:00
neel
845f7be2e3 Accurately identify the vcpu's operating mode as 64-bit, compatibility,
protected or real.
2014-07-08 21:48:57 +00:00
neel
d5633f89da Invalidate guest TLB mappings as a side-effect of its CR3 being updated.
This is a pre-requisite for task switch emulation since the CR3 is loaded
from the new TSS.
2014-07-08 20:51:03 +00:00
hselasky
35b126e324 Pull in r267961 and r267973 again. Fix for issues reported will follow. 2014-06-28 03:56:17 +00:00
gjb
fc21f40567 Revert r267961, r267973:
These changes prevent sysctl(8) from returning proper output,
such as:

 1) no output from sysctl(8)
 2) erroneously returning ENOMEM with tools like truss(1)
    or uname(1)
 truss: can not get etype: Cannot allocate memory
2014-06-27 22:05:21 +00:00
hselasky
bd1ed65f0f Extend the meaning of the CTLFLAG_TUN flag to automatically check if
there is an environment variable which shall initialize the SYSCTL
during early boot. This works for all SYSCTL types both statically and
dynamically created ones, except for the SYSCTL NODE type and SYSCTLs
which belong to VNETs. A new flag, CTLFLAG_NOFETCH, has been added to
be used in the case a tunable sysctl has a custom initialisation
function allowing the sysctl to still be marked as a tunable. The
kernel SYSCTL API is mostly the same, with a few exceptions for some
special operations like iterating childrens of a static/extern SYSCTL
node. This operation should probably be made into a factored out
common macro, hence some device drivers use this. The reason for
changing the SYSCTL API was the need for a SYSCTL parent OID pointer
and not only the SYSCTL parent OID list pointer in order to quickly
generate the sysctl path. The motivation behind this patch is to avoid
parameter loading cludges inside the OFED driver subsystem. Instead of
adding special code to the OFED driver subsystem to post-load tunables
into dynamically created sysctls, we generalize this in the kernel.

Other changes:
- Corrected a possibly incorrect sysctl name from "hw.cbb.intr_mask"
to "hw.pcic.intr_mask".
- Removed redundant TUNABLE statements throughout the kernel.
- Some minor code rewrites in connection to removing not needed
TUNABLE statements.
- Added a missing SYSCTL_DECL().
- Wrapped two very long lines.
- Avoid malloc()/free() inside sysctl string handling, in case it is
called to initialize a sysctl from a tunable, hence malloc()/free() is
not ready when sysctls from the sysctl dataset are registered.
- Bumped FreeBSD version to indicate SYSCTL API change.

MFC after:	2 weeks
Sponsored by:	Mellanox Technologies
2014-06-27 16:33:43 +00:00
tychon
816d8c3faa Add support for emulating the move instruction: "mov r/m8, imm8".
Reviewed by:	neel
2014-06-26 17:15:41 +00:00
grehan
54db9f3822 Expose the amount of resident and wired memory from the guest's vmspace.
This is different than the amount shown for the process e.g. by
/usr/bin/top - that is the mappings faulted in by the mmap'd region
of guest memory.

The values can be fetched with bhyvectl

 # bhyvectl --get-stats --vm=myvm
 ...
 Resident memory                         	413749248
 Wired memory                            	0
 ...

vmm_stat.[ch] -
 Modify the counter code in bhyve to allow direct setting of a counter
as opposed to incrementing, and providing a callback to fetch a
counter's value.

Reviewed by:	neel
2014-06-25 22:13:35 +00:00
tychon
bb415f07f0 Bring an overly enthusiastic KASSERT inline with the Intel SDM.
Reviewed by:	neel
2014-06-16 22:59:18 +00:00
neel
8c7e29c295 Disable global interrupts early so all the software state maintained by bhyve
is sampled "atomically". Any interrupts after this point will be held pending
by the CPU until the guest starts executing and will immediately trigger a
#VMEXIT.

Reviewed by:	Anish Gupta (akgupt3@gmail.com)
2014-06-11 17:48:07 +00:00
neel
e48c89801a Add helper functions to populate VM exit information for rendezvous and
astpending exits. This is to reduce code duplication between VT-x and
SVM implementations.
2014-06-10 16:45:58 +00:00
neel
32f7809be1 Turn on interrupt window exiting unconditionally when an ExtINT is being
injected into the guest. This allows the hypervisor to inject another
ExtINT or APIC vector as soon as the guest is able to process interrupts.

This change is not to address any correctness issue but to guarantee that
any pending APIC vector that was preempted by the ExtINT will be injected
as soon as possible. Prior to this change such pending interrupts could be
delayed until the next VM exit.
2014-06-10 01:38:02 +00:00
grehan
adbd7deabf Temporary fix for guest idle detection.
Handle ExtINT injection for SVM. The HPET emulation
will inject a legacy interrupt at startup, and if this
isn't handled, will result in the HLT-exit code assuming
there are outstanding ExtINTs and return without sleeping.

svm_inj_interrupts() needs more changes to bring it up
to date with the VT-x version: these are forthcoming.

Reviewed by:	neel
2014-06-09 21:02:48 +00:00
neel
d4bb0b204a Add reserved bit checking when doing %CR8 emulation and inject #GP if required.
Pointed out by:	grehan
Reviewed by:	tychon
2014-06-09 20:51:08 +00:00
grehan
fe997346e0 Allow the TSC MSR to be accessed directly from the guest. 2014-06-07 23:08:06 +00:00
grehan
afc0a3433a Set the guest PAT MSR in the VMCB to power-on defaults.
Linux guests accept the values in this register, while *BSD
guests reprogram it. Default values of zero correspond to
PAT_UNCACHEABLE, resulting in glacial performance.

Thanks to Willem Jan Withagen for first reporting this and
helping out with the investigation.
2014-06-07 23:05:12 +00:00
neel
80a67d54c4 Add ioctl(VM_REINIT) to reinitialize the virtual machine state maintained
by vmm.ko. This allows the virtual machine to be restarted without having
to destroy it first.

Reviewed by:	grehan
2014-06-07 21:36:52 +00:00
tychon
c04c953593 Support guest accesses to %cr8.
Reviewed by:	neel
2014-06-06 18:23:49 +00:00
grehan
f1ed4b50ae ins/outs support for SVM. Modelled on the Intel VT-x code.
Remove CR2 save/restore - the guest restore/save is done
in hardware, and there is no need to save/restore the host
version (same as VT-x).

Submitted by:	neel (SVM segment descriptor 'P' bit code)
Reviewed by:	neel
2014-06-06 02:55:18 +00:00
grehan
39adc03910 Allow the guest's CR2 value to be read/written.
This is required for page-fault injection.
2014-06-05 06:29:18 +00:00
grehan
2374fa6276 Use API call when VM is detected as suspended. This fixes
the (harmless) error message on exit:

  vmexit_suspend: invalid reason 217645057

Reviewed by:	neel, Anish Gupta (akgupt3@gmail.com)
2014-06-03 22:26:46 +00:00
grehan
5e6423ee3b Bring (almost) up-to-date with HEAD.
- use the new virtual APIC page
- update to current bhyve APIs

Tested by Anish with multiple FreeBSD SMP VMs on a Phenom,
and verified by myself with light FreeBSD VM testing
on a Sempron 3850 APU.

The issues reported with Linux guests are very likely to still
be here, but this sync eliminates the skew between the
project branch and CURRENT, and should help to determine
the causes.

Some follow-on commits will fix minor cosmetic issues.

Submitted by:	Anish Gupta (akgupt3@gmail.com)
2014-06-03 06:56:54 +00:00
grehan
95f7c2f56c MFC @ r266724
An SVM update will follow this.
2014-06-03 02:34:21 +00:00
neel
9c2a942387 Activate vcpus from bhyve(8) using the ioctl VM_ACTIVATE_CPU instead of doing
it implicitly in vmm.ko.

Add ioctl VM_GET_CPUS to get the current set of 'active' and 'suspended' cpus
and display them via /usr/sbin/bhyvectl using the "--get-active-cpus" and
"--get-suspended-cpus" options.

This is in preparation for being able to reset virtual machine state without
having to destroy and recreate it.
2014-05-31 23:37:34 +00:00
tychon
61025dc75e If VMX isn't enabled so long as the lock bit isn't set yet in MSR
IA32_FEATURE_CONTROL it still can be.

Approved by:	grehan (co-mentor)
2014-05-30 23:37:31 +00:00
jhb
e3d386d9fe - Rework the XSAVE/XRSTOR emulation to only expose XCR0 features to the
guest for which the rules regarding xsetbv emulation are known.  In
  particular future extensions like AVX-512 have interdependencies among
  feature bits that could allow a guest to trigger a GP# in the host with
  the current approach of allowing anything the host supports.
- Add proper checking of Intel MPX and AVX-512 XSAVE features in the
  xsetbv emulation and allow these features to be exposed to the guest if
  they are enabled in the host.
- Expose a subset of known-safe features from leaf 0 of the structured
  extended features to guests if they are supported on the host including
  RDFSBASE/RDGSBASE, BMI1/2, AVX2, AVX-512, HLE, ERMS, and RTM.  Aside
  from AVX-512, these features are all new instructions available for use
  in ring 3 with no additional hypervisor changes needed.

Reviewed by:	neel
2014-05-27 19:04:38 +00:00
neel
4b40e47cf8 Add segment protection and limits violation checks in vie_calculate_gla()
for 32-bit x86 guests.

Tested using ins/outs executed in a FreeBSD/i386 guest.
2014-05-27 04:26:22 +00:00
neel
07a8a1c99a Remove restriction on insb/insw/insl emulation. These instructions are
properly emulated.
2014-05-25 02:05:23 +00:00
neel
ffc6a38259 Do the linear address calculation for the ins/outs emulation using a new
API function 'vie_calculate_gla()'.

While the current implementation is simplistic it forms the basis of doing
segmentation checks if the guest is in 32-bit protected mode.
2014-05-25 00:57:24 +00:00
neel
51a05acc08 Add libvmmapi functions vm_copyin() and vm_copyout() to copy into and out
of the guest linear address space. These APIs in turn use a new ioctl
'VM_GLA2GPA' to convert the guest linear address to guest physical.

Use the new copyin/copyout APIs when emulating ins/outs instruction in
bhyve(8).
2014-05-24 23:12:30 +00:00
neel
6a6e13c407 Consolidate all the information needed by the guest page table walker into
'struct vm_guest_paging'.

Check for canonical addressing in vmm_gla2gpa() and inject a protection
fault into the guest if a violation is detected.

If the page table walk is restarted in vmm_gla2gpa() then reset 'ptpphys' to
point to the root of the page tables.
2014-05-24 20:26:57 +00:00
neel
52a4f11861 When injecting a page fault into the guest also update the guest's %cr2 to
indicate the faulting linear address.

If the guest PML4 entry has the PG_PS bit set then inject a page fault into
the guest with the PGEX_RSV bit set in the error_code.

Get rid of redundant checks for the PG_RW violations when walking the page
tables.
2014-05-24 19:13:25 +00:00
neel
2ccda87aca Check for alignment check violation when processing in/out string instructions. 2014-05-23 19:59:14 +00:00
neel
8f99933d82 Add emulation of the "outsb" instruction. NetBSD guests use this to write to
the UART FIFO.

The emulation is constrained in a number of ways: 64-bit only, doesn't check
for all exception conditions, limited to i/o ports emulated in userspace.

Some of these constraints will be relaxed in followup commits.

Requested by:	grehan
Reviewed by:	tychon (partially and a much earlier version)
2014-05-23 05:15:17 +00:00
neel
062bfb5ea3 A Centos 6.4 guest will write 0xff to the 8259 mask register before beginning
the proper ICWx initialization sequence. It assumes, probably correctly, that
the boot firmware has done the 8259 initialization.

Since grub-bhyve does not initialize the 8259 this write to the mask register
takes a code path in which 'error' remains uninitialized (ready=0,icw_num=0).

Fix this by initializing 'error' at the start of the function.
2014-05-23 05:04:50 +00:00
neel
f33a3d02f0 Allow vmx_getdesc() and vmx_setdesc() to be called for a vcpu that is in the
VCPU_RUNNING state. This will let the VMX exit handler inspect the vcpu's
segment descriptors without having to exit the critical section.
2014-05-22 17:22:37 +00:00
neel
645d479a58 Inject page fault into the guest if the page table walker detects an invalid
translation for the guest linear address.
2014-05-22 03:14:54 +00:00
neel
6071e2741b Add PG_RW check when translating a guest linear to guest physical address.
Set the accessed and dirty bits in the page table entry. If it fails then
restart the page table walk from the beginning. This might happen if another
vcpu modifies the page tables simultaneously.

Reviewed by:	alc, kib
2014-05-20 20:30:28 +00:00
neel
b0752c3683 Add PG_U (user/supervisor) checks when translating a guest linear address
to a guest physical address.

PG_PS (page size) field is valid only in a PDE or a PDPTE so it is now
checked only in non-terminal paging entries.

Ignore the upper 32-bits of the CR3 for PAE paging.
2014-05-19 03:50:07 +00:00
grehan
9fa48763c0 Make the vmx asm code dtrace-fbt-friendly by
- inserting frame enter/leave sequences
 - restructuring the vmx_enter_guest routine so that it subsumes
   the vm_exit_guest block, which was the #vmexit RIP and not a
   callable routine.

Reviewed by:	neel
MFC after:	3 weeks
2014-05-18 03:50:17 +00:00
jhb
f558af85b7 Implement a PCI interrupt router to route PCI legacy INTx interrupts to
the legacy 8259A PICs.
- Implement an ICH-comptabile PCI interrupt router on the lpc device with
  8 steerable pins configured via config space access to byte-wide
  registers at 0x60-63 and 0x68-6b.
- For each configured PCI INTx interrupt, route it to both an I/O APIC
  pin and a PCI interrupt router pin.  When a PCI INTx interrupt is
  asserted, ensure that both pins are asserted.
- Provide an initial routing of PCI interrupt router (PIRQ) pins to
  8259A pins (ISA IRQs) and initialize the interrupt line config register
  for the corresponding PCI function with the ISA IRQ as this matches
  existing hardware.
- Add a global _PIC method for OSPM to select the desired interrupt routing
  configuration.
- Update the _PRT methods for PCI bridges to provide both APIC and legacy
  PRT tables and return the appropriate table based on the configured
  routing configuration.  Note that if the lpc device is not configured, no
  routing information is provided.
- When the lpc device is enabled, provide ACPI PCI link devices corresponding
  to each PIRQ pin.
- Add a VMM ioctl to adjust the trigger mode (edge vs level) for 8259A
  pins via the ELCR.
- Mark the power management SCI as level triggered.
- Don't hardcode the number of elements in Packages in the source for
  the DSDT.  iasl(8) will fill in the actual number of elements, and
  this makes it simpler to generate a Package with a variable number of
  elements.

Reviewed by:	tycho
2014-05-15 14:16:55 +00:00
neel
5fd692c3b5 Virtual machine halt detection is turned on by default. Allow it to be
disabled via the tunable 'hw.vmm.halt_detection'.
2014-05-05 16:19:24 +00:00
neel
b735ae5b9a Add logic in the HLT exit handler to detect if the guest has put all vcpus
to sleep permanently by executing a HLT with interrupts disabled.

When this condition is detected the guest with be suspended with a reason of
VM_SUSPEND_HALT and the bhyve(8) process will exit.

Tested by executing "halt" inside a RHEL7-beta guest.

Discussed with:	grehan@
Reviewed by:	jhb@, tychon@
2014-05-02 00:33:56 +00:00
neel
0601994645 Ignore writes to microcode update MSR. This MSR is accessed by RHEL7 guest.
Add KTR tracepoints to annotate wrmsr and rdmsr VM exits.
2014-04-30 02:08:27 +00:00
neel
9c85092013 Some Linux guests will implement a 'halt' by disabling the APIC and executing
the 'HLT' instruction. This condition was detected by 'vm_handle_hlt()' and
converted into the SPINDOWN_CPU exitcode . The bhyve(8) process would exit
the vcpu thread in response to a SPINDOWN_CPU and when the last vcpu was
spun down it would reset the virtual machine via vm_suspend(VM_SUSPEND_RESET).

This functionality was broken in r263780 in a way that made it impossible
to kill the bhyve(8) process because it would loop forever in
vm_handle_suspend().

Unbreak this by removing the code to spindown vcpus. Thus a 'halt' from
a Linux guest will appear to be hung but this is consistent with the
behavior on bare metal. The guest can be rebooted by using the bhyvectl
options '--force-reset' or '--force-poweroff'.

Reviewed by:	grehan@
2014-04-29 18:42:56 +00:00
neel
b616a9a2e4 Allow a virtual machine to be forcibly reset or powered off. This is done
by adding an argument to the VM_SUSPEND ioctl that specifies how the virtual
machine should be suspended, viz. VM_SUSPEND_RESET or VM_SUSPEND_POWEROFF.

The disposition of VM_SUSPEND is also made available to the exit handler
via the 'u.suspended' member of 'struct vm_exit'.

This capability is exposed via the '--force-reset' and '--force-poweroff'
arguments to /usr/sbin/bhyvectl.

Discussed with:	grehan@
2014-04-28 22:06:40 +00:00
neel
ebf51bf73a A VMCS is always inactive when it exits the vmx_run() loop.
Remove redundant code and the misleading comment that suggest otherwise.

Reviewed by:	grehan@
2014-04-26 22:37:56 +00:00
grehan
4f6dd265e1 Allow the guest to read the TSC via MSR 0x10.
NetBSD/amd64 does this, as does Linux on AMD CPUs.

Reviewed by:	neel
MFC after:	3 weeks
2014-04-24 00:27:34 +00:00
neel
360d54aa50 Change the vlapic timer frequency to be in the ballpark of contemporary
hardware. This also decouples the vlapic emulation from the host's TSC
frequency.

Requested by:	grehan@
2014-04-23 16:50:40 +00:00
tychon
f44a06b5a1 Factor out common ioport handler code for better hygiene -- pointed
out by neel@.

Approved by:	neel (co-mentor)
2014-04-22 16:13:56 +00:00
tychon
d8c307b493 Add support for the PIT 'readback' command -- based on a patch by grehan@.
Approved by:	grehan (co-mentor)
2014-04-18 16:05:12 +00:00
tychon
2c52df9a16 Respect the destination operand size of the 'Input from Port' instruction.
Approved by:	grehan (co-mentor)
2014-04-18 15:22:56 +00:00
tychon
4d0de44f39 Add support for reading the PIT Counter 2 output signal via the NMI
Status and Control register at port 0x61.

Be more conservative about "catching up" callouts that were supposed
to fire in the past by skipping an interrupt if it was
scheduled too far in the past.

Restore the PIT ACPI DSDT entries and add an entry for NMISC too.

Approved by:	neel (co-mentor)
2014-04-18 00:02:06 +00:00
jhb
03a8cfa7d9 Don't spindown the BSP if it executes hlt with the APIC disabled. A
guest that doesn't use the APIC at all can trigger this, plus the BSP
always needs to execute as it should trigger a reset, etc.

Reviewed by:	tychon
2014-04-15 20:53:53 +00:00
tychon
bbe78c2d72 Local APIC access via 32-bit naturally-aligned loads is merely
suggested in the SDM.  Since some OSes have implemented otherwise
don't be too rigorous in enforcing it.

Approved by:	grehan (co-mentor)
2014-04-15 17:06:26 +00:00
tychon
04f26f5235 Add support for emulating the byte move and sign extend instructions:
"movsx r/m8, r32" and "movsx r/m8, r64".

Approved by:	grehan (co-mentor)
2014-04-15 15:11:10 +00:00
tychon
5906c6773b Add support for emulating the slave PIC.
Reviewed by:	grehan, jhb
Approved by:	grehan (co-mentor)
2014-04-14 19:00:20 +00:00
neel
335d93f16d There is no need to save and restore the host's return address in the
'struct vmxctx'. It is preserved on the host stack across a guest entry
and exit and just restoring the host's '%rsp' is sufficient.

Pointed out by:	grehan@
2014-04-11 20:15:53 +00:00
tychon
45ec65c336 Account for the "plus 1" encoding of the CPUID Function 4 reported
core per package and cache sharing values.

Approved by:	grehan (co-mentor)
2014-04-11 18:19:21 +00:00
grehan
2ac5c08506 Rework r264179.
- remove redundant code
- remove erroneous setting of the error return
  in vmmdev_ioctl()
- use style(9) initialization
- in vmx_inject_pir(), document the race condition
  that the final conditional statement was detecting,

Tested with both gcc and clang builds.

Reviewed by:	neel
2014-04-10 19:15:58 +00:00
imp
e8da9ba992 Make the vmm code compile with gcc too. Not entirely sure things are
correct for the pirbase test (since I'd have thought we'd need to do
something even when the offset is 0 and that test looks like a
misguided attempt to not use an uninitialized variable), but it is at
least the same as today.
2014-04-05 22:43:23 +00:00
rstone
35a855d80f Re-write bhyve's I/O MMU handling in terms of PCI RID.
Reviewed by:	neel
MFC after:	2 months
Sponsored by:	Sandvine Inc.
2014-04-01 15:54:03 +00:00
rstone
120bf54d08 Revert PCI RID changes.
My PCI RID changes somehow got intermixed with my PCI ARI patch when I
committed it.  I may have accidentally applied a patch to a non-clean
working tree.  Revert everything while I figure out what went wrong.

Pointy hat to: rstone
2014-04-01 15:06:03 +00:00
rstone
4df7085933 Re-write bhyve's I/O MMU handling in terms of PCI RIDs
Reviewed by:	neel
Sponsored by:	Sandvine Inc
2014-04-01 14:54:43 +00:00
neel
3e49998fdf Add an ioctl to suspend a virtual machine (VM_SUSPEND). The ioctl can be called
from any context i.e., it is not required to be called from a vcpu thread. The
ioctl simply sets a state variable 'vm->suspend' to '1' and returns.

The vcpus inspect 'vm->suspend' in the run loop and if it is set to '1' the
vcpu breaks out of the loop with a reason of 'VM_EXITCODE_SUSPENDED'. The
suspend handler waits until all 'vm->active_cpus' have transitioned to
'vm->suspended_cpus' before returning to userspace.

Discussed with:	grehan
2014-03-26 23:34:27 +00:00
tychon
58699bc5fc Move the atpit device model from userspace into vmm.ko for better
precision and lower latency.

Approved by:	grehan (co-mentor)
2014-03-25 19:20:34 +00:00
neel
3818d66305 When a vcpu is deactivated it must also unblock any rendezvous that may be
blocked on it.

This is done by issuing a wakeup after clearing the 'vcpuid' from 'active_cpus'.
Also, use CPU_CLR_ATOMIC() to guarantee visibility of the updated 'active_cpus'
across all host cpus.
2014-03-18 02:49:28 +00:00
neel
9e498dc116 Notify vcpus participating in the rendezvous of the pending event to ensure
that they execute the rendezvous function as soon as possible.
2014-03-17 23:30:38 +00:00
tychon
5460439295 Fix a race wherein the source of an interrupt vector is wrongly
attributed if an ExtINT arrives during interrupt injection.

Also, fix a spurious interrupt if the PIC tries to raise an interrupt
before the outstanding one is accepted.

Finally, improve the PIC interrupt latency when another interrupt is
raised immediately after the outstanding one is accepted by creating a
vmexit rather than waiting for one to occur by happenstance.

Approved by:	neel (co-mentor)
2014-03-15 23:09:34 +00:00
tychon
9affb68b8d Don't try to return a vector to a caller that only cares if a vector
is pending or not.

Approved by:	neel (co-mentor)
2014-03-11 22:12:12 +00:00
tychon
25c8b61cfd Replace the userspace atpic stub with a more functional vmm.ko model.
New ioctls VM_ISA_ASSERT_IRQ, VM_ISA_DEASSERT_IRQ and VM_ISA_PULSE_IRQ
can be used to manipulate the pic, and optionally the ioapic, pin state.

Reviewed by:	jhb, neel
Approved by:	neel (co-mentor)
2014-03-11 16:56:00 +00:00
neel
4e6374765e Fix a race between VMRUN() and vcpu_notify_event() due to 'vcpu->hostcpu'
being updated outside of the vcpu_lock(). The race is benign and could
potentially result in a missed notification about a pending interrupt to
a vcpu. The interrupt would not be lost but rather delayed until the next
VM exit.

The vcpu's hostcpu is now updated concurrently with the vcpu state change.
When the vcpu transitions to the RUNNING state the hostcpu is set to 'curcpu'.
It is set to 'NOCPU' in all other cases.

Reviewed by:	grehan
2014-03-01 03:17:58 +00:00
jhb
4905c0f870 Correct VMware capitalization.
Submitted by:	joeld
2014-02-28 21:33:40 +00:00
jhb
4ce54b93eb Workaround an apparent bug in VMWare Fusion's nested VT support where it
triggers a VM exit with the exit reason of an external interrupt but
without a valid interrupt set in the exit interrupt information.

Tested by:	Michael Dexter
Reviewed by:	neel
MFC after:	1 week
2014-02-28 19:07:55 +00:00
neel
e01c440dae Queue pending exceptions in the 'struct vcpu' instead of directly updating the
processor-specific VMCS or VMCB. The pending exception will be delivered right
before entering the guest.

The order of event injection into the guest is:
- hardware exception
- NMI
- maskable interrupt

In the Intel VT-x case, a pending NMI or interrupt will enable the interrupt
window-exiting and inject it as soon as possible after the hardware exception
is injected. Also since interrupts are inherently asynchronous, injecting
them after the hardware exception should not affect correctness from the
guest perspective.

Rename the unused ioctl VM_INJECT_EVENT to VM_INJECT_EXCEPTION and restrict
it to only deliver x86 hardware exceptions. This new ioctl is now used to
inject a protection fault when the guest accesses an unimplemented MSR.

Discussed with:	grehan, jhb
Reviewed by:	jhb
2014-02-26 00:52:05 +00:00
grehan
e0e0829e5e MFC @ r259635
This brings in the "-w" option from bhyve to ignore unknown MSRs.
It will make debugging Linux guests a bit easier.

Suggested by:	Willem Jan Withagen (wjw at digiware nl)
2014-02-25 06:29:56 +00:00
neel
3e0732cf3e Add support for x2APIC virtualization assist in Intel VT-x.
The vlapic.ops handler 'enable_x2apic_mode' is called when the vlapic mode
is switched to x2APIC. The VT-x implementation of this handler turns off the
APIC-access virtualization and enables the x2APIC virtualization in the VMCS.

The x2APIC virtualization is done by allowing guest read access to a subset
of MSRs in the x2APIC range. In non-root operation the processor will satisfy
an 'rdmsr' access to these MSRs by reading from the virtual APIC page instead.

The guest is also given write access to TPR, EOI and SELF_IPI MSRs which
get special treatment in non-root operation. This is documented in the
Intel SDM section titled "Virtualizing MSR-Based APIC Accesses".

Enforce that APIC-write and APIC-access VM-exits are handled only if
APIC-access virtualization is enabled. The one exception to this is
SELF_IPI virtualization which may result in an APIC-write VM-exit.
2014-02-21 06:03:54 +00:00
neel
4626d164b8 Simplify APIC mode switching from MMIO to x2APIC. In part this is done to
simplify the implementation of the x2APIC virtualization assist in VT-x.

Prior to this change the vlapic allowed the guest to change its mode from
xAPIC to x2APIC. We don't allow that any more and the vlapic mode is locked
when the virtual machine is created. This is not very constraining because
operating systems already have to deal with BIOS setting up the APIC in
x2APIC mode at boot.

Fix a bug in the CPUID emulation where the x2APIC capability was leaking
from the host to the guest.

Ignore MMIO reads and writes to the vlapic in x2APIC mode. Similarly, ignore
MSR accesses to the vlapic when it is in xAPIC mode.

The default configuration of the vlapic is xAPIC. The "-x" option to bhyve(8)
can be used to change the mode to x2APIC instead.

Discussed with:	grehan@
2014-02-20 01:48:25 +00:00
jhb
521737384d A first pass at adding support for injecting hardware exceptions for
emulated instructions.
- Add helper routines to inject interrupt information for a hardware
  exception from the VM exit callback routines.
- Use the new routines to inject GP and UD exceptions for invalid
  operations when emulating the xsetbv instruction.
- Don't directly manipulate the entry interrupt info when a user event
  is injected.  Instead, store the event info in the vmx state and
  only apply it during a VM entry if a hardware exception or NMI is
  not already pending.
- While here, use HANDLED/UNHANDLED instead of 1/0 in a couple of
  routines.

Reviewed by:	neel
2014-02-18 03:07:36 +00:00
neel
f9781635be Handle writes to the SELF_IPI MSR by the guest when the vlapic is configured
in x2apic mode. Reads to this MSR are currently ignored but should cause a
general proctection exception to be injected into the vcpu.

All accesses to the corresponding offset in xAPIC mode are ignored.

Also, do not panic the host if there is mismatch between the trigger mode
programmed in the TMR and the actual interrupt being delivered. Instead the
anomaly is logged to aid debugging and to prevent a misbehaving guest from
panicking the host.
2014-02-17 23:07:16 +00:00
neel
19c649a6fb Use spinlocks to lock accesses to the vioapic.
This is necessary because if the vlapic is configured in x2apic mode the
vioapic_process_eoi() function is called inside the critical section
established by vm_run().
2014-02-17 22:57:51 +00:00
jhb
0cdfc370bf Add virtualized XSAVE support to bhyve which permits guests to use XSAVE and
XSAVE-enabled features like AVX.
- Store a per-cpu guest xcr0 register.  When switching to the guest FPU
  state, switch to the guest xcr0 value.  Note that the guest FPU state is
  saved and restored using the host's xcr0 value and xcr0 is saved/restored
  "inside" of saving/restoring the guest FPU state.
- Handle VM exits for the xsetbv instruction by updating the guest xcr0.
- Expose the XSAVE feature to the guest only if the host has enabled XSAVE,
  and only advertise XSAVE features enabled by the host to the guest.
  This ensures that the guest will only adjust FPU state that is a subset
  of the guest FPU state saved and restored by the host.

Reviewed by:	grehan
2014-02-08 16:37:54 +00:00
neel
5b2777fc37 Add a counter to differentiate between VM-exits due to nested paging faults
and instruction emulation faults.
2014-02-08 06:22:09 +00:00
neel
d79f5b61e6 Fix a bug in the handling of VM-exits caused by non-maskable interrupts (NMI).
If a VM-exit is caused by an NMI then "blocking by NMI" is in effect on the
CPU when the VM-exit is completed. No more NMIs will be recognized until
the execution of an "iret".

Prior to this change the NMI handler was dispatched via a software interrupt
with interrupts enabled. This meant that an interrupt could be recognized
by the processor before the NMI handler completed its execution. The "iret"
issued by the interrupt handler would then cause the "blocking by NMI" to
be cleared prematurely.

This is now fixed by handling the NMI with interrupts disabled in addition
to "blocking by NMI" already established by the VM-exit.
2014-02-08 05:04:34 +00:00
jhb
f4e46bef98 Add support for FreeBSD/i386 guests under bhyve.
- Similar to the hack for bootinfo32.c in userboot, define
  _MACHINE_ELF_WANT_32BIT in the load_elf32 file handlers in userboot.
  This allows userboot to load 32-bit kernels and modules.
- Copy the SMAP generation code out of bootinfo64.c and into its own
  file so it can be shared with bootinfo32.c to pass an SMAP to the i386
  kernel.
- Use uint32_t instead of u_long when aligning module metadata in
  bootinfo32.c in userboot, as otherwise the metadata used 64-bit
  alignment which corrupted the layout.
- Populate the basemem and extmem members of the bootinfo struct passed
  to 32-bit kernels.
- Fix the 32-bit stack in userboot to start at the top of the stack
  instead of the bottom so that there is room to grow before the
  kernel switches to its own stack.
- Push a fake return address onto the 32-bit stack in addition to the
  arguments normally passed to exec() in the loader.  This return
  address is needed to convince recover_bootinfo() in the 32-bit
  locore code that it is being invoked from a "new" boot block.
- Add a routine to libvmmapi to setup a 32-bit flat mode register state
  including a GDT and TSS that is able to start the i386 kernel and
  update bhyveload to use it when booting an i386 kernel.
- Use the guest register state to determine the CPU's current instruction
  mode (32-bit vs 64-bit) and paging mode (flat, 32-bit, PAE, or long
  mode) in the instruction emulation code.  Update the gla2gpa() routine
  used when fetching instructions to handle flat mode, 32-bit paging, and
  PAE paging in addition to long mode paging.  Don't look for a REX
  prefix when the CPU is in 32-bit mode, and use the detected mode to
  enable the existing 32-bit mode code when decoding the mod r/m byte.

Reviewed by:	grehan, neel
MFC after:	1 month
2014-02-05 04:39:03 +00:00
tychon
c9758b2e1c Add support for emulating the byte move and zero extend instructions:
"mov r/m8, r32" and "mov r/m8, r64".

Approved by:	neel (co-mentor)
2014-02-05 02:01:08 +00:00
grehan
9b01253af1 Changes to the SVM code to bring it up to r259205
- Convert VMM_CTR to VCPU_CTR KTR macros
 - Special handling of halt, save rflags for VMM layer to emulate
   halt for vcpu(sleep to be awakened by interrupt or stop it)
 - Cleanup of RVI exit handling code

Submitted by:	Anish Gupta (akgupt3@gmail.com)
Reviewed by:	grehan
2014-02-04 07:13:56 +00:00
grehan
db98868c2f MFC @ r259205 in preparation for some SVM updates. (for real this time) 2014-02-04 06:59:08 +00:00
grehan
f115890f37 Roll back botched partial MFC :( 2014-02-04 05:03:14 +00:00
neel
8f40ca632d Avoid doing unnecessary nested TLB invalidations.
Prior to this change the cached value of 'pm_eptgen' was tracked per-vcpu
and per-hostcpu. In the degenerate case where 'N' vcpus were sharing
a single hostcpu this could result in 'N - 1' unnecessary TLB invalidations.
Since an 'invept' invalidates mappings for all VPIDs the first 'invept'
is sufficient.

Fix this by moving the 'eptgen[MAXCPU]' array from 'vmxctx' to 'struct vmx'.

If it is known that an 'invept' is going to be done before entering the
guest then it is safe to skip the 'invvpid'. The stat VPU_INVVPID_SAVED
counts the number of 'invvpid' invalidations that were avoided because
they were subsumed by an 'invept'.

Discussed with:	grehan
2014-02-04 02:45:08 +00:00
grehan
0ee348bd50 MFC @ r259205 in preparation for some SVM updates. 2014-02-04 02:41:54 +00:00
jhb
3f6ca218c6 Enhance the support for PCI legacy INTx interrupts and enable them in
the virtio backends.
- Add a new ioctl to export the count of pins on the I/O APIC from vmm
  to the hypervisor.
- Use pins on the I/O APIC >= 16 for PCI interrupts leaving 0-15 for
  ISA interrupts.
- Populate the MP Table with I/O interrupt entries for any PCI INTx
  interrupts.
- Create a _PRT table under the PCI root bridge in ACPI to route any
  PCI INTx interrupts appropriately.
- Track which INTx interrupts are in use per-slot so that functions
  that share a slot attempt to distribute their INTx interrupts across
  the four available pins.
- Implicitly mask INTx interrupts if either MSI or MSI-X is enabled
  and when the INTx DIS bit is set in a function's PCI command register.
  Either assert or deassert the associated I/O APIC pin when the
  state of one of those conditions changes.
- Add INTx support to the virtio backends.
- Always advertise the MSI capability in the virtio backends.

Submitted by:	neel (7)
Reviewed by:	neel
MFC after:	2 weeks
2014-01-29 14:56:48 +00:00
neel
872f585846 Support level triggered interrupts with VT-x virtual interrupt delivery.
The VMCS field EOI_bitmap[] is an array of 256 bits - one for each vector.
If a bit is set to '1' in the EOI_bitmap[] then the processor will trigger
an EOI-induced VM-exit when it is doing EOI virtualization.

The EOI-induced VM-exit results in the EOI being forwarded to the vioapic
so that level triggered interrupts can be properly handled.

Tested by:	Anish Gupta (akgupt3@gmail.com)
2014-01-25 20:58:05 +00:00
jhb
b2533ec507 Move <machine/apicvar.h> to <x86/apicvar.h>. 2014-01-23 20:10:22 +00:00
neel
232e34343e Set "Interrupt Window Exiting" in the case where there is a vector to be
injected into the vcpu but the VM-entry interruption information field
already has the valid bit set.

Pointed out by:	David Reed (david.reed@tidalscale.com)
2014-01-23 06:06:50 +00:00
neel
e971318156 Handle a VM-exit due to a NMI properly by vectoring to the host's NMI handler
via a software interrupt.

This is safe to do because the logical processor is already cognizant of the
NMI and further NMIs are blocked until the host's NMI handler executes "iret".
2014-01-22 04:03:11 +00:00
neel
daa658a5dd There is no need to initialize the IOMMU if no passthru devices have been
configured for bhyve to use.

Suggested by:	grehan@
2014-01-21 03:01:34 +00:00