Commit Graph

405 Commits

Author SHA1 Message Date
neel
4eebcc1fbb Fix the RTC device model to operate correctly in 12-hour mode. The following
table documents the values in the RTC 'hour' field in the two modes:

Hour-of-the-day		12-hour mode	24-hour mode
12	AM		12		0
[1-11]	AM		[1-11]		[1-11]
12	PM		0x80 | 12	12
[1-11]	PM		0x80 | [1-11]	[13-23]

Reported by:	Julian Hsiao (madoka@nyanisore.net)
MFC after:	1 week
2015-03-28 02:55:16 +00:00
tychon
b925086de0 When fetching an instruction in non-64bit mode, consider the value of the
code segment base address.

Also if an instruction doesn't support a mod R/M (modRM) byte, don't
be concerned if the CPU is in real mode.

Reviewed by:	neel
2015-03-24 17:12:36 +00:00
mav
af4c17529a Report ARAT (APIC-Timer-always-running) feature for virtual CPU.
This makes FreeBSD guest to not avoid using LAPIC timer, preferring HPET
due to worries about non-existing for virtual CPUs deep sleep states.

Benchmarks of usleep(1) on guest and host show such extra latencies:
 - 51us for virtual HPET,
 - 22us for virtual LAPIC timer,
 - 22us for host HPET and
 - 3us for host LAPIC timer.

MFC after:	2 weeks
2015-03-16 11:57:03 +00:00
neel
e16d18eeb8 Use lapic_ipi_alloc() to dynamically allocate IPI slots needed by bhyve when
vmm.ko is loaded.

Also relocate the 'justreturn' IPI handler to be alongside all other handlers.

Requested by:	kib
2015-03-14 02:32:08 +00:00
tychon
eedd9cfb16 When ICW1 is issued the edge sense circuit is reset which means that
following an initialization a low-to-high transistion is necesary to
generate an interrupt.

Reviewed by:	neel
2015-03-06 02:05:45 +00:00
neel
e549774cf3 Fix warnings/errors when building vmm.ko with gcc:
- fix warning about comparison of 'uint8_t v_tpr >= 0' always being true.

- fix error triggered by an empty clobber list in the inline assembly for
  "clgi" and "stgi"

- fix error when compiling "vmload %rax", "vmrun %rax" and "vmsave %rax". The
  gcc assembler does not like the explicit operand "%rax" while the clang
  assembler requires specifying the operand "%rax". Fix this by encoding the
  instructions using the ".byte" directive.

Reported by:	julian
MFC after:	1 week
2015-03-02 20:13:49 +00:00
rstone
29fedc4c2c Allow passthrough devices to be hinted.
Allow the ppt driver to attach to devices that were hinted to be
passthrough devices by the PCI code creating them with a driver
name of "ppt".

Add a tunable that allows the IOMMU to be forced to be used.  With
SR-IOV passthrough devices the VFs may be created after vmm.ko is
loaded.  The current code will not initialize the IOMMU in that
case, meaning that the passthrough devices can't actually be used.

Differential Revision:	https://reviews.freebsd.org/D73
Reviewed by:		neel
MFC after: 		1 month
Sponsored by:		Sandvine Inc.
2015-03-01 00:39:48 +00:00
neel
f458d8769b Always emulate MSR_PAT on Intel processors and don't rely on PAT save/restore
capability of VT-x. This lets bhyve run nested in older VMware versions that
don't support the PAT save/restore capability.

Note that the actual value programmed by the guest in MSR_PAT is irrelevant
because bhyve sets the 'Ignore PAT' bit in the nested PTE.

Reported by:	marcel
Tested by:	Leon Dang (ldang@nahannisys.com)
Sponsored by:	Nahanni Systems
MFC after:	2 weeks
2015-02-24 05:35:15 +00:00
kib
0754f0eac9 Add x2APIC support. Enable it by default if CPU is capable. The
hw.x2apic_enable tunable allows disabling it from the loader prompt.

To closely repeat effects of the uncached memory ops when accessing
registers in the xAPIC mode, the x2APIC writes to MSRs are preceeded
by mfence, except for the EOI notifications.  This is probably too
strict, only ICR writes to send IPI require serialization to ensure
that other CPUs see the previous actions when IPI is delivered.  This
may be changed later.

In vmm justreturn IPI handler, call doreti_iret instead of doing iretd
inline, to handle corner conditions.

Note that the patch only switches LAPICs into x2APIC mode. It does not
enables FreeBSD to support > 255 CPUs, which requires parsing x2APIC
MADT entries and doing interrupts remapping, but is the required step
on the way.

Reviewed by:	neel
Tested by:	pho (real hardware), neel (on bhyve)
Discussed with:	jhb, grehan
Sponsored by:	The FreeBSD Foundation
MFC after:	2 months
2015-02-09 21:00:56 +00:00
neel
5c373339e4 Add macro to identify AVIC capability (advanced virtual interrupt controller)
in AMD processors.

Submitted by:	Dmitry Luhtionov (dmitryluhtionov@gmail.com)
2015-01-24 00:35:49 +00:00
neel
e901ab485a MOVS instruction emulation.
These instructions are emitted by 'bus_space_read_region()' when accessing
MMIO regions.

Since MOVS can be used with a repeat prefix start decoding the REPZ and
REPNZ prefixes. Also start decoding the segment override prefix since MOVS
allows overriding the source operand segment register.

Tested by:	tychon
MFC after:	1 week
2015-01-19 06:53:31 +00:00
neel
d9f07f9841 Simplify instruction restart logic in bhyve.
Keep track of the next instruction to be executed by the vcpu as 'nextrip'.
As a result the VM_RUN ioctl no longer takes the %rip where a vcpu should
start execution.

Also, instruction restart happens implicitly via 'vm_inject_exception()' or
explicitly via 'vm_restart_instruction()'. The APIs behave identically in
both kernel and userspace contexts. The main beneficiary is the instruction
emulation code that executes in both contexts.

bhyve(8) VM exit handlers now treat 'vmexit->rip' and 'vmexit->inst_length'
as readonly:
- Restarting an instruction is now done by calling 'vm_restart_instruction()'
  as opposed to setting 'vmexit->inst_length' to 0 (e.g. emulate_inout())
- Resuming vcpu at an arbitrary %rip is now done by setting VM_REG_GUEST_RIP
  as opposed to changing 'vmexit->rip' (e.g. vmexit_task_switch())

Differential Revision:	https://reviews.freebsd.org/D1526
Reviewed by:		grehan
MFC after:		2 weeks
2015-01-18 03:08:30 +00:00
neel
4091be74c6 Fix typo (missing comma).
MFC after:	3 days
2015-01-14 07:18:51 +00:00
neel
5c965bc583 'struct vm_exception' was intended to be used only as the collateral for the
VM_INJECT_EXCEPTION ioctl. However it morphed into other uses like keeping
track pending exceptions for a vcpu. This in turn causes confusion because
some fields in 'struct vm_exception' like 'vcpuid' make sense only in the
ioctl context. It also makes it harder to add or remove structure fields.

Fix this by using 'struct vm_exception' only to communicate information
from userspace to vmm.ko when injecting an exception.

Also, add a field 'restart_instruction' to 'struct vm_exception'. This
field is set to '1' for exceptions where the faulting instruction is
restarted after the exception is handled.

MFC after:      1 week
2015-01-13 22:00:47 +00:00
neel
e72e75f02d Clear blocking due to STI or MOV SS in the hypervisor when an instruction is
emulated or when the vcpu incurs an exception. This matches the CPU behavior.

Remove special case code in HLT processing that was clearing the interrupt
shadow. This is now redundant because the interrupt shadow is always cleared
when the vcpu is resumed after an instruction is emulated.

Reported by:	David Reed (david.reed@tidalscale.com)
MFC after:	2 weeks
2015-01-06 19:04:02 +00:00
neel
b9de0c114c Initialize all fields of 'struct vm_exception exception' before passing it to
vm_inject_exception(). This fixes the issue that 'exception.cpuid' is
uninitialized when calling 'vm_inject_exception()'.

However, in practice this change is a no-op because vm_inject_exception()
does not use 'exception.cpuid' for anything.

Reported by:    Coverity Scan
CID:            1261297
MFC after:      3 days
2014-12-30 23:38:31 +00:00
neel
7aa6460c48 Replace bhyve's minimal RTC emulation with a fully featured one in vmm.ko.
The new RTC emulation supports all interrupt modes: periodic, update ended
and alarm. It is also capable of maintaining the date/time and NVRAM contents
across virtual machine reset. Also, the date/time fields can now be modified
by the guest.

Since bhyve now emulates both the PIT and the RTC there is no need for
"Legacy Replacement Routing" in the HPET so get rid of it.

The RTC device state can be inspected via bhyvectl as follows:
bhyvectl --vm=vm --get-rtc-time
bhyvectl --vm=vm --set-rtc-time=<unix_time_secs>
bhyvectl --vm=vm --rtc-nvram-offset=<offset> --get-rtc-nvram
bhyvectl --vm=vm --rtc-nvram-offset=<offset> --set-rtc-nvram=<value>

Reviewed by:	tychon
Discussed with:	grehan
Differential Revision:	https://reviews.freebsd.org/D1385
MFC after:	2 weeks
2014-12-30 22:19:34 +00:00
neel
2908c65b8a Inject #UD into the guest when it executes either 'MONITOR' or 'MWAIT' on
an AMD/SVM host.

MFC after:	1 week
2014-12-30 02:44:33 +00:00
neel
baa3b938e5 Implement "special mask mode" in vatpic.
OpenBSD guests always enable "special mask mode" during boot. As a result of
r275952 this is flagged as an error and the guest cannot boot.

Reviewed by:	grehan
Differential Revision:	https://reviews.freebsd.org/D1384
MFC after:	1 week
2014-12-28 00:53:52 +00:00
neel
4d52b44708 Allow ktr(4) tracing of all guest exceptions via the tunable
"hw.vmm.trace_guest_exceptions".  To enable this feature set the tunable
to "1" before loading vmm.ko.

Tracing the guest exceptions can be useful when debugging guest triple faults.

Note that there is a performance impact when exception tracing is enabled
since every exception will now trigger a VM-exit.

Also, handle machine check exceptions that happen during guest execution
by vectoring to the host's machine check handler via "int $18".

Discussed with:	grehan
MFC after:	2 weeks
2014-12-23 02:14:49 +00:00
neel
9abef2383d Emulate writes to the IA32_MISC_ENABLE MSR.
PR:		196093
Reported by:	db
Tested by:	db
Discussed with:	grehan
MFC after:	1 week
2014-12-20 19:47:51 +00:00
neel
e4f07b01b4 Various 8259 device model improvements:
- implement 8259 "polled" mode.
- set 'atpic->sfn' if bit 4 in ICW4 is set during master initialization.
- report error if guest tries to enable the "special mask" mode.

Differential Revision:	https://reviews.freebsd.org/D1328
Reviewed by:		tychon
Reported by:		grehan
Tested by:		grehan
MFC after:		1 week
2014-12-20 04:57:45 +00:00
neel
36aadd747e Fix 8259 IRQ priority resolver.
Initialize the 8259 such that IRQ7 is the lowest priority.

Reviewed by:		tychon
Differential Revision:	https://reviews.freebsd.org/D1322
MFC after:		1 week
2014-12-17 03:04:43 +00:00
neel
bdae65463e For level triggered interrupts clear the PIC IRR bit when the interrupt pin
is deasserted. Prior to this change each assertion on a level triggered irq
pin resulted in two interrupts being delivered to the CPU.

Differential Revision:	https://reviews.freebsd.org/D1310
Reviewed by:	tychon
MFC after:	1 week
2014-12-16 06:33:57 +00:00
grehan
205a8351f4 Change the lower bound for guest vmspace allocation to 0 instead of
using the VM_MIN_ADDRESS constant.

HardenedBSD redefines VM_MIN_ADDRESS to be 64K, which results in
bhyve VM startup failing. Guest memory is always assumed to start
at 0 so use the absolute value instead.

Reported by:	Shawn Webb, lattera at gmail com
Reviewed by:	neel, grehan
Obtained from:	Oliver Pinter via HardenedBSD
23bd719ce1
MFC after:	1 week
2014-11-23 23:07:21 +00:00
araujo
8a51fd3c94 Reported by: Coverity
CID:		1249760
Reviewed by:	neel
Approved by:	neel
Sponsored by:	QNAP Systems Inc.
2014-10-28 07:19:02 +00:00
grehan
4789422fd1 Remove bhyve SVM feature printf's now that they are available in the
general CPU feature detection code.

Reviewed by:	neel
2014-10-27 22:20:51 +00:00
neel
4ee18dab17 Change the type of the first argument to the I/O emulation handlers to
'struct vm *'. Previously it used to be a 'void *' but there is no reason
to hide the actual type from the handler.

Discussed with:	tychon
MFC after:	1 week
2014-10-26 19:03:06 +00:00
neel
eb58530f9b Move the ACPI PM timer emulation into vmm.ko.
This reduces variability during timer calibration by keeping the emulation
"close" to the guest. Additionally having all timer emulations in the kernel
will ease the transition to a per-VM clock source (as opposed to using the
host's uptime keep track of time).

Discussed with:	grehan
2014-10-26 04:44:28 +00:00
neel
a76d8ee4e9 Don't pass the 'error' return from an I/O port handler directly to vm_run().
Most I/O port handlers return -1 to signal an error. If this value is returned
without modification to vm_run() then it leads to incorrect behavior because
'-1' is interpreted as ERESTART at the system call level.

Fix this by always returning EIO to signal an error from an I/O port handler.

MFC after:	1 week
2014-10-26 03:03:41 +00:00
neel
dd2febd6f2 IFC @r273214 2014-10-20 02:57:30 +00:00
neel
53c23ba9a2 IFC @r273206 2014-10-19 23:05:18 +00:00
neel
13e9198693 Don't advertise the "OS visible workarounds" feature in cpuid.80000001H:ECX.
bhyve doesn't emulate the MSRs needed to support this feature at this time.

Don't expose any model-specific RAS and performance monitoring features in
cpuid leaf 80000007H.

Emulate a few more MSRs for AMD: TSEG base address, TSEG address mask and
BIOS signature and P-state related MSRs.

This eliminates all the unimplemented MSRs accessed by Linux/x86_64 kernels
2.6.32, 3.10.0 and 3.17.0.
2014-10-19 21:38:58 +00:00
neel
d5ca76422a Don't advertise support for the NodeID MSR since bhyve doesn't emulate it. 2014-10-18 05:39:32 +00:00
neel
170f49cd34 Don't advertise the Instruction Based Sampling feature because it requires
emulating a large number of MSRs.

Ignore writes to a couple more AMD-specific MSRs and return 0 on read.

This further reduces the unimplemented MSRs accessed by a Linux guest on boot.
2014-10-17 06:23:04 +00:00
neel
b194fc5a2b Hide extended PerfCtr MSRs on AMD processors by clearing bits 23, 24 and 28 in
CPUID.80000001H:ECX.

Handle accesses to PerfCtrX and PerfEvtSelX MSRs by ignoring writes and
returning 0 on reads.

This further reduces the number of unimplemented MSRs hit by a Linux guest
during boot.
2014-10-17 03:04:38 +00:00
neel
8bc5fc92e5 Use the correct fault type (VM_PROT_EXECUTE) for an instruction fetch. 2014-10-16 18:16:31 +00:00
neel
09fbd3f7d9 Fix topology enumeration issues exposed by AMD Bulldozer Family 15h processor.
Initialize CPUID.80000008H:ECX[7:0] with the number of logical processors in
the package. This fixes a panic during early boot in NetBSD 7.0 BETA.

Clear the Topology Extension feature bit from CPUID.80000001H:ECX since we
don't emulate leaves 0x8000001D and 0x8000001E. This fixes a divide by zero
panic in early boot in Centos 6.4.

Tested on an "AMD Opteron 6320" courtesy of Ben Perrault.

Reviewed by:	grehan
2014-10-16 18:13:10 +00:00
davide
e88bd26b3f Follow up to r225617. In order to maximize the re-usability of kernel code
in userland rename in-kernel getenv()/setenv() to kern_setenv()/kern_getenv().
This fixes a namespace collision with libc symbols.

Submitted by:   kmacy
Tested by:      make universe
2014-10-16 18:04:43 +00:00
neel
c4ba9b78fd Actually hide the SVM capability by clearing CPUID.80000001H:ECX[bit 3]
after it has been initialized by cpuid_count().

Submitted by:	Anish Gupta (akgupt3@gmail.com)
2014-10-15 04:29:03 +00:00
neel
268d74c73a Emulate "POP r/m".
This is needed to boot OpenBSD/i386 MP kernel in bhyve.

Reported by:	grehan
MFC after:	1 week
2014-10-14 21:02:33 +00:00
neel
651ecc1e21 Remove extraneous comments. 2014-10-11 04:57:17 +00:00
neel
0d5ab4a163 Get rid of unused headers.
Restrict scope of malloc types M_SVM and M_SVM_VLAPIC by making them static.
Replace ERR() with KASSERT().
style(9) cleanup.
2014-10-11 04:41:21 +00:00
neel
bbecfc3c42 Get rid of unused forward declaration of 'struct svm_softc'. 2014-10-11 03:21:33 +00:00
neel
3e5970c8ec style(9) fixes.
Get rid of unused headers.
2014-10-11 03:19:26 +00:00
neel
c09e1bd7e2 Use a consistent style for messages emitted when the module is loaded. 2014-10-11 03:09:34 +00:00
neel
bfa9f55e70 IFC @r272887 2014-10-10 23:52:56 +00:00
neel
ae1ff84ece Fix bhyvectl so it works correctly on AMD/SVM hosts. Also, add command line
options to display some key VMCB fields.

The set of valid options that can be passed to bhyvectl now depends on the
processor type. AMD-specific options are identified by a "--vmcb" or "--avic"
in the option name. Intel-specific options are identified by a "--vmcs" in
the option name.

Submitted by:	Anish Gupta (akgupt3@gmail.com)
2014-10-10 21:48:59 +00:00
neel
f443215307 Support Intel-specific MSRs that are accessed when booting up a linux in bhyve:
- MSR_PLATFORM_INFO
- MSR_TURBO_RATIO_LIMITx
- MSR_RAPL_POWER_UNIT

Reviewed by:	grehan
MFC after:	1 week
2014-10-09 19:13:33 +00:00
neel
ce319c48f4 Inject #UD into the guest when it executes either 'MONITOR' or 'MWAIT'.
The hypervisor hides the MONITOR/MWAIT capability by unconditionally setting
CPUID.01H:ECX[3] to 0 so the guest should not expect these instructions to
be present anyways.

Discussed with:	grehan
2014-10-06 20:48:01 +00:00
neel
16cbb8793a IFC @r272481 2014-10-05 01:28:21 +00:00
neel
f6b1385c0e Get rid of code that dealt with the hardware not being able to save/restore
the PAT MSR on guest exit/entry. This workaround was done for a beta release
of VMware Fusion 5 but is no longer needed in later versions.

All Intel CPUs since Nehalem have supported saving and restoring MSR_PAT
in the VM exit and entry controls.

Discussed with:	grehan
2014-10-02 05:32:29 +00:00
neel
09ab9e2937 IFC @r272185 2014-09-27 22:15:50 +00:00
neel
11f176f814 Simplify register state save and restore across a VMRUN:
- Host registers are now stored on the stack instead of a per-cpu host context.

- Host %FS and %GS selectors are not saved and restored across VMRUN.
  - Restoring the %FS/%GS selectors was futile anyways since that only updates
    the low 32 bits of base address in the hidden descriptor state.
  - GS.base is properly updated via the MSR_GSBASE on return from svm_launch().
  - FS.base is not used while inside the kernel so it can be safely ignored.

- Add function prologue/epilogue so svm_launch() can be traced with Dtrace's
  FBT entry/exit probes. They also serve to save/restore the host %rbp across
  VMRUN.

Reviewed by:	grehan
Discussed with:	Anish Gupta (akgupt3@gmail.com)
2014-09-27 02:04:58 +00:00
grehan
8be950fc2c Allow the PIC's IMR register to be read before ICW initialisation.
As of git submit e179f6914152eca9, the Linux kernel does a simple
probe of the PIC by writing a pattern to the IMR and then reading it
back, prior to the init sequence of ICW words.

The bhyve PIC emulation wasn't allowing the IMR to be read until
the ICW sequence was complete. This limitation isn't required so
relax the test.

With this change, Linux kernels 3.15-rc2 and later won't hang
on boot when calibrating the local APIC.

Reviewed by:	tychon
MFC after:	3 days
2014-09-27 01:15:24 +00:00
neel
35aaa3ac11 Allow more VMCB fields to be cached:
- CR2
- CR0, CR3, CR4 and EFER
- GDT/IDT base/limit fields
- CS/DS/ES/SS selector/base/limit/attrib fields

The caching can be further restricted via the tunable 'hw.vmm.svm.vmcb_clean'.

Restructure the code such that the fields above are only modified in a single
place. This makes it easy to invalidate the VMCB cache when any of these fields
is modified.
2014-09-21 23:42:54 +00:00
neel
5834276187 Get rid of unused stat VMM_HLT_IGNORED. 2014-09-21 18:52:56 +00:00
neel
ef294abb97 IFC r271888.
Restructure MSR emulation so it is all done in processor-specific code.
2014-09-20 21:46:31 +00:00
neel
0350d078cb Add some more KTR events to help debugging. 2014-09-20 05:13:03 +00:00
neel
f6a11e15ea MSR_KGSBASE is no longer saved and restored from the guest MSR save area. This
behavior was changed in r271888 so update the comment block to reflect this.

MSR_KGSBASE is accessible from the guest without triggering a VM-exit. The
permission bitmap for MSR_KGSBASE is modified by vmx_msr_guest_init() so get
rid of redundant code in vmx_vminit().
2014-09-20 05:12:34 +00:00
neel
46721cc2c7 Restructure the MSR handling so it is entirely handled by processor-specific
code. There are only a handful of MSRs common between the two so there isn't
too much duplicate functionality.

The VT-x code has the following types of MSRs:

- MSRs that are unconditionally saved/restored on every guest/host context
  switch (e.g., MSR_GSBASE).

- MSRs that are restored to guest values on entry to vmx_run() and saved
  before returning. This is an optimization for MSRs that are not used in
  host kernel context (e.g., MSR_KGSBASE).

- MSRs that are emulated and every access by the guest causes a trap into
  the hypervisor (e.g., MSR_IA32_MISC_ENABLE).

Reviewed by:	grehan
2014-09-20 02:35:21 +00:00
neel
c9b7ad126a IFC @r271694 2014-09-17 18:46:51 +00:00
neel
eefb10a184 Rework vNMI injection.
Keep track of NMI blocking by enabling the IRET intercept on a successful
vNMI injection. The NMI blocking condition is cleared when the handler
executes an IRET and traps back into the hypervisor.

Don't inject NMI if the processor is in an interrupt shadow to preserve the
atomic nature of "STI;HLT". Take advantage of this and artificially set the
interrupt shadow to prevent NMI injection when restarting the "iret".

Reviewed by:	Anish Gupta (akgupt3@gmail.com), grehan
2014-09-17 00:30:25 +00:00
neel
16169f746d Minor cleanup.
Get rid of unused 'svm_feature' from the softc.

Get rid of the redundant 'vcpu_cnt' checks in svm.c. There is a similar check
in vmm.c against 'vm->active_cpus' before the AMD-specific code is called.

Submitted by:	Anish Gupta (akgupt3@gmail.com)
2014-09-16 04:01:55 +00:00
neel
97bffd44f6 Use V_IRQ, V_INTR_VECTOR and V_TPR to offload APIC interrupt delivery to the
processor. Briefly, the hypervisor sets V_INTR_VECTOR to the APIC vector
and sets V_IRQ to 1 to indicate a pending interrupt. The hardware then takes
care of injecting this vector when the guest is able to receive it.

Legacy PIC interrupts are still delivered via the event injection mechanism.
This is because the vector injected by the PIC must reflect the state of its
pins at the time the CPU is ready to accept the interrupt.

Accesses to the TPR via %CR8 are handled entirely in hardware. This requires
that the emulated TPR must be synced to V_TPR after a #VMEXIT.

The guest can also modify the TPR via the memory mapped APIC. This requires
that the V_TPR must be synced with the emulated TPR before a VMRUN.

Reviewed by:	Anish Gupta (akgupt3@gmail.com)
2014-09-16 03:31:40 +00:00
neel
cbc92dc709 Set the 'vmexit->inst_length' field properly depending on the type of the
VM-exit and ultimately on whether nRIP is valid. This allows us to update
the %rip after the emulation is finished so any exceptions triggered during
the emulation will point to the right instruction.

Don't attempt to handle INS/OUTS VM-exits unless the DecodeAssist capability
is available. The effective segment field in EXITINFO1 is not valid without
this capability.

Add VM_EXITCODE_SVM to flag SVM VM-exits that cannot be handled. Provide the
VMCB fields exitinfo1 and exitinfo2 as collateral to help with debugging.

Provide a SVM VM-exit handler to dump the exitcode, exitinfo1 and exitinfo2
fields in bhyve(8).

Reviewed by:	Anish Gupta (akgupt3@gmail.com)
Reviewed by:	grehan
2014-09-14 04:39:04 +00:00
neel
3e1af2f123 Bug fixes.
- Don't enable the HLT intercept by default. It will be enabled by bhyve(8)
  if required. Prior to this change HLT exiting was always enabled making
  the "-H" option to bhyve(8) meaningless.

- Recognize a VM exit triggered by a non-maskable interrupt. Prior to this
  change the exit would be punted to userspace and the virtual machine would
  terminate.
2014-09-13 23:48:43 +00:00
neel
a38d07e455 style(9): insert an empty line if the function has no local variables
Pointed out by:	grehan
2014-09-13 22:45:04 +00:00
neel
32e0378b35 AMD processors that have the SVM decode assist capability will store the
instruction bytes in the VMCB on a nested page fault. This is useful because
it saves having to walk the guest page tables to fetch the instruction.

vie_init() now takes two additional parameters 'inst_bytes' and 'inst_len'
that map directly to 'vie->inst[]' and 'vie->num_valid'.

The instruction emulation handler skips calling 'vmm_fetch_instruction()'
if 'vie->num_valid' is non-zero.

The use of this capability can be turned off by setting the sysctl/tunable
'hw.vmm.svm.disable_npf_assist' to '1'.

Reviewed by:	Anish Gupta (akgupt3@gmail.com)
Discussed with:	grehan
2014-09-13 22:16:40 +00:00
neel
b2ca87a5d0 Optimize the common case of injecting an interrupt into a vcpu after a HLT
by explicitly moving it out of the interrupt shadow. The hypervisor is done
"executing" the HLT and by definition this moves the vcpu out of the
1-instruction interrupt shadow.

Prior to this change the interrupt would be held pending because the VMCS
guest-interruptibility-state would indicate that "blocking by STI" was in
effect. This resulted in an unnecessary round trip into the guest before
the pending interrupt could be injected.

Reviewed by:	grehan
2014-09-12 06:15:20 +00:00
neel
45bb1086d6 style(9): indent the switch, don't indent the case, indent case body one tab. 2014-09-11 06:17:56 +00:00
neel
e108397c4a Repurpose the V_IRQ interrupt injection to implement VMX-style interrupt
window exiting. This simply involves setting V_IRQ and enabling the VINTR
intercept. This instructs the CPU to trap back into the hypervisor as soon
as an interrupt can be injected into the guest. The pending interrupt is
then injected via the traditional event injection mechanism.

Rework vcpu interrupt injection so that Linux guests now idle with host cpu
utilization close to 0%.

Reviewed by:	Anish Gupta (earlier version)
Discussed with:	grehan
2014-09-11 02:37:02 +00:00
neel
907c0aec17 Allow intercepts and irq fields to be cached by the VMCB.
Provide APIs svm_enable_intercept()/svm_disable_intercept() to add/delete
VMCB intercepts. These APIs ensure that the VMCB state cache is invalidated
when intercepts are modified.

Each intercept is identified as a (index,bitmask) tuple. For e.g., the
VINTR intercept is identified as (VMCB_CTRL1_INTCPT,VMCB_INTCPT_VINTR).
The first 20 bytes in control area that are used to enable intercepts
are represented as 'uint32_t intercept[5]' in 'struct vmcb_ctrl'.

Modify svm_setcap() and svm_getcap() to use the new APIs.

Discussed with:	Anish Gupta (akgupt3@gmail.com)
2014-09-10 03:13:40 +00:00
neel
3c61bfaac6 Move the VMCB initialization into svm.c in preparation for changes to the
interrupt injection logic.

Discussed with:	Anish Gupta (akgupt3@gmail.com)
2014-09-10 02:35:19 +00:00
neel
bbc4ff544f Move the event injection function into svm.c and add KTR logging for
every event injection.

This in in preparation for changes to SVM guest interrupt injection.

Discussed with:	Anish Gupta (akgupt3@gmail.com)
2014-09-10 02:20:32 +00:00
neel
5b8dd1ad82 Remove a bogus check that flagged an error if the guest %rip was zero.
An AP begins execution with %rip set to 0 after a startup IPI.

Discussed with:	Anish Gupta (akgupt3@gmail.com)
2014-09-10 01:46:22 +00:00
neel
77b2918286 Make the KTR tracepoints uniform and ensure that every VM-exit is logged.
Discussed with:	Anish Gupta (akgupt3@gmail.com)
2014-09-10 01:37:32 +00:00
neel
d07afdc371 Allow guest read access to MSR_EFER without hypervisor intervention.
Dirty the VMCB_CACHE_CR state cache when MSR_EFER is modified.
2014-09-10 01:10:53 +00:00
neel
d6f50ad39f Remove gratuitous forward declarations.
Remove tabs on empty lines.
2014-09-09 23:39:43 +00:00
neel
5824b2ab21 Do proper ASID management for guest vcpus.
Prior to this change an ASID was hard allocated to a guest and shared by all
its vcpus. The meant that the number of VMs that could be created was limited
to the number of ASIDs supported by the CPU. It was also inefficient because
it forced a TLB flush on every VMRUN.

With this change the number of guests that can be created is independent of
the number of available ASIDs. Also, the TLB is flushed only when a new ASID
is allocated.

Discussed with:	grehan
Reviewed by:	Anish Gupta (akgupt3@gmail.com)
2014-09-06 19:02:52 +00:00
neel
8decf07cf3 Merge svm_set_vmcb() and svm_init_vmcb() into a single function that is called
just once when a vcpu is initialized.

Discussed with:	Anish Gupta (akgupt3@gmail.com)
2014-09-05 03:33:16 +00:00
neel
9c9ff7e250 Remove unused header file.
Discussed with:	Anish Gupta (akgupt3@gmail.com)
2014-09-04 06:07:32 +00:00
neel
4038b1d8f9 Consolidate the code to restore the host TSS after a #VMEXIT into a single
function restore_host_tss().

Don't bother to restore MSR_KGSBASE after a #VMEXIT since it is not used in
the kernel. It will be restored on return to userspace.

Discussed with:	Anish Gupta (akgupt3@gmail.com)
2014-09-04 06:00:18 +00:00
neel
e5d2f8730c IFC @r269962
Submitted by:	Anish Gupta (akgupt3@gmail.com)
2014-09-02 04:22:42 +00:00
neel
15d3c1ec17 The "SUB" instruction used in getcc() actually does 'x -= y' so use the
proper constraint for 'x'. The "+r" constraint indicates that 'x' is an
input and output register operand.

While here generate code for different variants of getcc() using a macro
GETCC(sz) where 'sz' indicates the operand size.

Update the status bits in %rflags when emulating AND and OR opcodes.

Reviewed by:	grehan
2014-08-30 19:59:42 +00:00
grehan
4900dd0aa5 Implement the 0x2B SUB instruction, and the OR variant of 0x81.
Found with local APIC accesses from bitrig/amd64 bsd.rd, 07/15-snap.

Reviewed by:	neel
MFC after:	3 days
2014-08-27 00:53:56 +00:00
neel
37c34582e9 An exception is allowed to be injected even if the vcpu is in an interrupt
shadow, so move the check for pending exception before bailing out due to
an interrupt shadow.

Change return type of 'vmcb_eventinject()' to a void and convert all error
returns into KASSERTs.

Fix VMCB_EXITINTINFO_EC(x) and VMCB_EXITINTINFO_TYPE(x) to do the shift
before masking the result.

Reviewed by:    Anish Gupta (akgupt3@gmail.com)
2014-08-25 00:58:20 +00:00
neel
ca77633f85 Add "hw.vmm.topology.threads_per_core" and "hw.vmm.topology.cores_per_package"
tunables to modify the default cpu topology advertised by bhyve.

Also add a tunable "hw.vmm.topology.cpuid_leaf_b" to disable the CPUID
leaf 0xb. This is intended for testing guest behavior when it falls back
on using CPUID leaf 0x4 to deduce CPU topology.

The default behavior is to advertise each vcpu as a core in a separate soket.
2014-08-24 01:10:06 +00:00
neel
33474e62d3 Fix a bug in the emulation of CPUID leaf 0x4 where bhyve was claiming that
the vcpu had no caches at all. This causes problems when executing applications
in the guest compiled with the Intel compiler.

Submitted by:	Mark Hill (mark.hill@tidalscale.com)
2014-08-23 22:44:31 +00:00
neel
ea3ac18997 Return the spurious interrupt vector (IRQ7 or IRQ15) if the atpic cannot
find any unmasked pin with an interrupt asserted.

Reviewed by:	tychon
CR:		https://reviews.freebsd.org/D669
MFC after:	1 week
2014-08-23 21:16:26 +00:00
neel
69905f3734 Reword comment to match the interrupt mode names from the MPtable spec.
Reviewed by:	tychon
2014-08-14 18:03:38 +00:00
neel
53369652fd Use the max guest memory address when creating its iommu domain.
Also, assert that the GPA being mapped in the domain is less than its maxaddr.

Reviewed by:	grehan
Pointed out by:	Anish Gupta (akgupt3@gmail.com)
2014-08-14 05:00:45 +00:00
neel
6e79e6c83b Support PCI extended config space in bhyve.
Add the ACPI MCFG table to advertise the extended config memory window.

Introduce a new flag MEM_F_IMMUTABLE for memory ranges that cannot be deleted
or moved in the guest's address space. The PCI extended config space is an
example of an immutable memory range.

Add emulation for the "movzw" instruction. This instruction is used by FreeBSD
to read a 16-bit extended config space register.

CR:		https://phabric.freebsd.org/D505
Reviewed by:	jhb, grehan
Requested by:	tychon
2014-08-08 03:49:01 +00:00
jhb
05d21354c3 - Output a summary of optional VT-x features in dmesg similar to CPU
features.  If bootverbose is enabled, a detailed list is provided;
  otherwise, a single-line summary is displayed.
- Add read-only sysctls for optional VT-x capabilities used by bhyve
  under a new hw.vmm.vmx.cap node. Move a few exiting sysctls that
  indicate the presence of optional capabilities under this node.

CR:		https://phabric.freebsd.org/D498
Reviewed by:	grehan, neel
MFC after:	1 week
2014-07-30 00:00:12 +00:00
neel
20e3e8762f If a vcpu has issued a HLT instruction with interrupts disabled then it sleeps
forever in vm_handle_hlt().

This is usually not an issue as long as one of the other vcpus properly resets
or powers off the virtual machine. However, if the bhyve(8) process is killed
with a signal the halted vcpu cannot be woken up because it's sleep cannot be
interrupted.

Fix this by waking up periodically and returning from vm_handle_hlt() if
TDF_ASTPENDING is set.

Reported by:	Leon Dang
Sponsored by:	Nahanni Systems
2014-07-26 02:53:51 +00:00
neel
62d591cec9 Don't return -1 from the push emulation handler. Negative return values are
interpreted specially on return from sys_ioctl() and may cause undesirable
side-effects like restarting the system call.
2014-07-26 02:51:46 +00:00
neel
563faffd70 Fix a couple of issues in the PUSH emulation:
It is not possible to PUSH a 32-bit operand on the stack in 64-bit mode. The
default operand size for PUSH is 64-bits and the operand size override prefix
changes that to 16-bits.

vm_copy_setup() can return '1' if it encounters a fault when walking the
guest page tables. This is a guest issue and is now handled properly by
resuming the guest to handle the fault.
2014-07-24 23:01:53 +00:00
neel
4535fa67c4 Fix fault injection in bhyve.
The faulting instruction needs to be restarted when the exception handler
is done handling the fault. bhyve now does this correctly by setting
'vmexit[vcpu].inst_length' to zero so the %rip is not advanced.

A minor complication is that the fault injection APIs are used by instruction
emulation code that is shared by vmm.ko and bhyve. Thus the argument that
refers to 'struct vm *' in kernel or 'struct vmctx *' in userspace needs to
be loosely typed as a 'void *'.
2014-07-24 01:38:11 +00:00
neel
e972917c13 Emulate instructions emitted by OpenBSD/i386 version 5.5:
- CMP REG, r/m
- MOV AX/EAX/RAX, moffset
- MOV moffset, AX/EAX/RAX
- PUSH r/m
2014-07-23 04:28:51 +00:00
neel
7b1204cc02 Fix build without INVARIANTS defined by getting rid of unused variable 'exc'.
Reported by:	adrian, stefanf
2014-07-20 16:34:35 +00:00