6858 Commits

Author SHA1 Message Date
emaste
2ab45c505b Disable amd64 TLB Context ID (pcid) by default for now
There are a number of reports of userspace application crashes that
are "solved" by setting vm.pmap.pcid_enabled=0, including Java and the
x11/mate-terminal port (PR ports/184362).

I originally planned to disable this only in stable/10 (in r262753), but
it has been pointed out that additional crash reports on HEAD are not
likely to provide new insight into the problem.  The feature can easily
be enabled for testing.
2014-03-05 01:34:10 +00:00
jkim
9b4d3b43ca Move fpusave() wrapper for suspend hander to sys/amd64/amd64/fpu.c.
Inspired by:	jhb
2014-03-04 21:35:57 +00:00
jkim
e6f1aee9e4 Revert accidentally committed changes in 262748. 2014-03-04 20:16:00 +00:00
jkim
74dcbf5843 Properly save and restore CR0.
MFC after:	3 days
2014-03-04 20:07:36 +00:00
jkim
373cea9476 Remove dead code since r230426, fix a comment, and tidy up.
Reported by:	jhb
MFC after:	3 days
2014-03-04 19:41:16 +00:00
neel
4e6374765e Fix a race between VMRUN() and vcpu_notify_event() due to 'vcpu->hostcpu'
being updated outside of the vcpu_lock(). The race is benign and could
potentially result in a missed notification about a pending interrupt to
a vcpu. The interrupt would not be lost but rather delayed until the next
VM exit.

The vcpu's hostcpu is now updated concurrently with the vcpu state change.
When the vcpu transitions to the RUNNING state the hostcpu is set to 'curcpu'.
It is set to 'NOCPU' in all other cases.

Reviewed by:	grehan
2014-03-01 03:17:58 +00:00
jhb
4905c0f870 Correct VMware capitalization.
Submitted by:	joeld
2014-02-28 21:33:40 +00:00
jhb
4ce54b93eb Workaround an apparent bug in VMWare Fusion's nested VT support where it
triggers a VM exit with the exit reason of an external interrupt but
without a valid interrupt set in the exit interrupt information.

Tested by:	Michael Dexter
Reviewed by:	neel
MFC after:	1 week
2014-02-28 19:07:55 +00:00
neel
e01c440dae Queue pending exceptions in the 'struct vcpu' instead of directly updating the
processor-specific VMCS or VMCB. The pending exception will be delivered right
before entering the guest.

The order of event injection into the guest is:
- hardware exception
- NMI
- maskable interrupt

In the Intel VT-x case, a pending NMI or interrupt will enable the interrupt
window-exiting and inject it as soon as possible after the hardware exception
is injected. Also since interrupts are inherently asynchronous, injecting
them after the hardware exception should not affect correctness from the
guest perspective.

Rename the unused ioctl VM_INJECT_EVENT to VM_INJECT_EXCEPTION and restrict
it to only deliver x86 hardware exceptions. This new ioctl is now used to
inject a protection fault when the guest accesses an unimplemented MSR.

Discussed with:	grehan, jhb
Reviewed by:	jhb
2014-02-26 00:52:05 +00:00
alc
ebe945ff9f When the kernel is running in a virtual machine, it cannot rely upon the
processor family to determine if the workaround for AMD Family 10h Erratum
383 should be enabled.  To enable virtual machine migration among a
heterogeneous collection of physical machines, the hypervisor may have
been configured to report an older processor family with a reduced feature
set.  Effectively, the reported processor family and its features are like
a "least common denominator" for the collection of machines.

Therefore, when the kernel is running in a virtual machine, instead of
relying upon the processor family, we now test for features that prove
that the underlying processor is not affected by the erratum.  (The
features that we test for are unlikely to ever be emulated in software
on an affected physical processor.)

PR:		186061
Tested by:	Simon Matter
Discussed with:	jhb, neel
MFC after:	2 weeks
2014-02-22 18:53:42 +00:00
neel
3e0732cf3e Add support for x2APIC virtualization assist in Intel VT-x.
The vlapic.ops handler 'enable_x2apic_mode' is called when the vlapic mode
is switched to x2APIC. The VT-x implementation of this handler turns off the
APIC-access virtualization and enables the x2APIC virtualization in the VMCS.

The x2APIC virtualization is done by allowing guest read access to a subset
of MSRs in the x2APIC range. In non-root operation the processor will satisfy
an 'rdmsr' access to these MSRs by reading from the virtual APIC page instead.

The guest is also given write access to TPR, EOI and SELF_IPI MSRs which
get special treatment in non-root operation. This is documented in the
Intel SDM section titled "Virtualizing MSR-Based APIC Accesses".

Enforce that APIC-write and APIC-access VM-exits are handled only if
APIC-access virtualization is enabled. The one exception to this is
SELF_IPI virtualization which may result in an APIC-write VM-exit.
2014-02-21 06:03:54 +00:00
neel
4626d164b8 Simplify APIC mode switching from MMIO to x2APIC. In part this is done to
simplify the implementation of the x2APIC virtualization assist in VT-x.

Prior to this change the vlapic allowed the guest to change its mode from
xAPIC to x2APIC. We don't allow that any more and the vlapic mode is locked
when the virtual machine is created. This is not very constraining because
operating systems already have to deal with BIOS setting up the APIC in
x2APIC mode at boot.

Fix a bug in the CPUID emulation where the x2APIC capability was leaking
from the host to the guest.

Ignore MMIO reads and writes to the vlapic in x2APIC mode. Similarly, ignore
MSR accesses to the vlapic when it is in xAPIC mode.

The default configuration of the vlapic is xAPIC. The "-x" option to bhyve(8)
can be used to change the mode to x2APIC instead.

Discussed with:	grehan@
2014-02-20 01:48:25 +00:00
jhb
521737384d A first pass at adding support for injecting hardware exceptions for
emulated instructions.
- Add helper routines to inject interrupt information for a hardware
  exception from the VM exit callback routines.
- Use the new routines to inject GP and UD exceptions for invalid
  operations when emulating the xsetbv instruction.
- Don't directly manipulate the entry interrupt info when a user event
  is injected.  Instead, store the event info in the vmx state and
  only apply it during a VM entry if a hardware exception or NMI is
  not already pending.
- While here, use HANDLED/UNHANDLED instead of 1/0 in a couple of
  routines.

Reviewed by:	neel
2014-02-18 03:07:36 +00:00
neel
f9781635be Handle writes to the SELF_IPI MSR by the guest when the vlapic is configured
in x2apic mode. Reads to this MSR are currently ignored but should cause a
general proctection exception to be injected into the vcpu.

All accesses to the corresponding offset in xAPIC mode are ignored.

Also, do not panic the host if there is mismatch between the trigger mode
programmed in the TMR and the actual interrupt being delivered. Instead the
anomaly is logged to aid debugging and to prevent a misbehaving guest from
panicking the host.
2014-02-17 23:07:16 +00:00
neel
19c649a6fb Use spinlocks to lock accesses to the vioapic.
This is necessary because if the vlapic is configured in x2apic mode the
vioapic_process_eoi() function is called inside the critical section
established by vm_run().
2014-02-17 22:57:51 +00:00
dim
a8b6bed223 Upgrade our copy of llvm/clang to 3.4 release. This version supports
all of the features in the current working draft of the upcoming C++
standard, provisionally named C++1y.

The code generator's performance is greatly increased, and the loop
auto-vectorizer is now enabled at -Os and -O2 in addition to -O3.  The
PowerPC backend has made several major improvements to code generation
quality and compile time, and the X86, SPARC, ARM32, Aarch64 and SystemZ
backends have all seen major feature work.

Release notes for llvm and clang can be found here:
<http://llvm.org/releases/3.4/docs/ReleaseNotes.html>
<http://llvm.org/releases/3.4/tools/clang/docs/ReleaseNotes.html>

MFC after:	1 month
2014-02-16 19:44:07 +00:00
brueffer
4c9c4234e2 Retire the nve(4) driver; nfe(4) has been the default driver for NVIDIA
nForce MCP adapters for a long time.

Yays:	jhb, remko, yongari
Nays:	none on the current and stable lists
2014-02-16 12:22:43 +00:00
avg
84460fd88d provide fast versions of ffsl and flsl for i386; ffsll and flsll for amd64
Reviewed by:	jhb
MFC after:	10 days
X-MFC note:	consider thirdparty modules depending on these symbols
Sponsored by:	HybridCluster
2014-02-14 15:18:37 +00:00
jhb
6e6e271c34 Add support for managing PCI bus numbers. As with BARs and PCI-PCI bridge
I/O windows, the default is to preserve the firmware-assigned resources.
PCI bus numbers are only managed if NEW_PCIB is enabled and the architecture
defines a PCI_RES_BUS resource type.
- Add a helper API to create top-level PCI bus resource managers for each
  PCI domain/segment.  Host-PCI bridge drivers use this API to allocate
  bus numbers from their associated domain.
- Change the PCI bus and CardBus drivers to allocate a bus resource for
  their bus number from the parent PCI bridge device.
- Change the PCI-PCI and PCI-CardBus bridge drivers to allocate the
  full range of bus numbers from secbus to subbus from their parent bridge.
  The drivers also always program their primary bus register.  The bridge
  drivers also support growing their bus range by extending the bus resource
  and updating subbus to match the larger range.
- Add support for managing PCI bus resources to the Host-PCI bridge drivers
  used for amd64 and i386 (acpi_pcib, mptable_pcib, legacy_pcib, and qpi_pcib).
- Define a PCI_RES_BUS resource type for amd64 and i386.

Reviewed by:	imp
MFC after:	1 month
2014-02-12 04:30:37 +00:00
jhb
22ac1edc7b Don't waste a page of KVA for the boot-time memory test on x86. For amd64,
reuse the first page of the crashdumpmap as CMAP1/CADDR1.  For i386,
remove CMAP1/CADDR1 entirely and reuse CMAP3/CADDR3 for the memory test.

Reviewed by:	alc, peter
MFC after:	2 weeks
2014-02-11 22:02:40 +00:00
jhb
0cdfc370bf Add virtualized XSAVE support to bhyve which permits guests to use XSAVE and
XSAVE-enabled features like AVX.
- Store a per-cpu guest xcr0 register.  When switching to the guest FPU
  state, switch to the guest xcr0 value.  Note that the guest FPU state is
  saved and restored using the host's xcr0 value and xcr0 is saved/restored
  "inside" of saving/restoring the guest FPU state.
- Handle VM exits for the xsetbv instruction by updating the guest xcr0.
- Expose the XSAVE feature to the guest only if the host has enabled XSAVE,
  and only advertise XSAVE features enabled by the host to the guest.
  This ensures that the guest will only adjust FPU state that is a subset
  of the guest FPU state saved and restored by the host.

Reviewed by:	grehan
2014-02-08 16:37:54 +00:00
neel
5b2777fc37 Add a counter to differentiate between VM-exits due to nested paging faults
and instruction emulation faults.
2014-02-08 06:22:09 +00:00
neel
d79f5b61e6 Fix a bug in the handling of VM-exits caused by non-maskable interrupts (NMI).
If a VM-exit is caused by an NMI then "blocking by NMI" is in effect on the
CPU when the VM-exit is completed. No more NMIs will be recognized until
the execution of an "iret".

Prior to this change the NMI handler was dispatched via a software interrupt
with interrupts enabled. This meant that an interrupt could be recognized
by the processor before the NMI handler completed its execution. The "iret"
issued by the interrupt handler would then cause the "blocking by NMI" to
be cleared prematurely.

This is now fixed by handling the NMI with interrupts disabled in addition
to "blocking by NMI" already established by the VM-exit.
2014-02-08 05:04:34 +00:00
jhb
f4e46bef98 Add support for FreeBSD/i386 guests under bhyve.
- Similar to the hack for bootinfo32.c in userboot, define
  _MACHINE_ELF_WANT_32BIT in the load_elf32 file handlers in userboot.
  This allows userboot to load 32-bit kernels and modules.
- Copy the SMAP generation code out of bootinfo64.c and into its own
  file so it can be shared with bootinfo32.c to pass an SMAP to the i386
  kernel.
- Use uint32_t instead of u_long when aligning module metadata in
  bootinfo32.c in userboot, as otherwise the metadata used 64-bit
  alignment which corrupted the layout.
- Populate the basemem and extmem members of the bootinfo struct passed
  to 32-bit kernels.
- Fix the 32-bit stack in userboot to start at the top of the stack
  instead of the bottom so that there is room to grow before the
  kernel switches to its own stack.
- Push a fake return address onto the 32-bit stack in addition to the
  arguments normally passed to exec() in the loader.  This return
  address is needed to convince recover_bootinfo() in the 32-bit
  locore code that it is being invoked from a "new" boot block.
- Add a routine to libvmmapi to setup a 32-bit flat mode register state
  including a GDT and TSS that is able to start the i386 kernel and
  update bhyveload to use it when booting an i386 kernel.
- Use the guest register state to determine the CPU's current instruction
  mode (32-bit vs 64-bit) and paging mode (flat, 32-bit, PAE, or long
  mode) in the instruction emulation code.  Update the gla2gpa() routine
  used when fetching instructions to handle flat mode, 32-bit paging, and
  PAE paging in addition to long mode paging.  Don't look for a REX
  prefix when the CPU is in 32-bit mode, and use the detected mode to
  enable the existing 32-bit mode code when decoding the mod r/m byte.

Reviewed by:	grehan, neel
MFC after:	1 month
2014-02-05 04:39:03 +00:00
tychon
c9758b2e1c Add support for emulating the byte move and zero extend instructions:
"mov r/m8, r32" and "mov r/m8, r64".

Approved by:	neel (co-mentor)
2014-02-05 02:01:08 +00:00
neel
8f40ca632d Avoid doing unnecessary nested TLB invalidations.
Prior to this change the cached value of 'pm_eptgen' was tracked per-vcpu
and per-hostcpu. In the degenerate case where 'N' vcpus were sharing
a single hostcpu this could result in 'N - 1' unnecessary TLB invalidations.
Since an 'invept' invalidates mappings for all VPIDs the first 'invept'
is sufficient.

Fix this by moving the 'eptgen[MAXCPU]' array from 'vmxctx' to 'struct vmx'.

If it is known that an 'invept' is going to be done before entering the
guest then it is safe to skip the 'invvpid'. The stat VPU_INVVPID_SAVED
counts the number of 'invvpid' invalidations that were avoided because
they were subsumed by an 'invept'.

Discussed with:	grehan
2014-02-04 02:45:08 +00:00
jhb
3f6ca218c6 Enhance the support for PCI legacy INTx interrupts and enable them in
the virtio backends.
- Add a new ioctl to export the count of pins on the I/O APIC from vmm
  to the hypervisor.
- Use pins on the I/O APIC >= 16 for PCI interrupts leaving 0-15 for
  ISA interrupts.
- Populate the MP Table with I/O interrupt entries for any PCI INTx
  interrupts.
- Create a _PRT table under the PCI root bridge in ACPI to route any
  PCI INTx interrupts appropriately.
- Track which INTx interrupts are in use per-slot so that functions
  that share a slot attempt to distribute their INTx interrupts across
  the four available pins.
- Implicitly mask INTx interrupts if either MSI or MSI-X is enabled
  and when the INTx DIS bit is set in a function's PCI command register.
  Either assert or deassert the associated I/O APIC pin when the
  state of one of those conditions changes.
- Add INTx support to the virtio backends.
- Always advertise the MSI capability in the virtio backends.

Submitted by:	neel (7)
Reviewed by:	neel
MFC after:	2 weeks
2014-01-29 14:56:48 +00:00
jhb
0fa3283d5a Add support for 'clac' and 'stac' to DDB's disassembler on amd64. 2014-01-27 18:53:18 +00:00
neel
872f585846 Support level triggered interrupts with VT-x virtual interrupt delivery.
The VMCS field EOI_bitmap[] is an array of 256 bits - one for each vector.
If a bit is set to '1' in the EOI_bitmap[] then the processor will trigger
an EOI-induced VM-exit when it is doing EOI virtualization.

The EOI-induced VM-exit results in the EOI being forwarded to the vioapic
so that level triggered interrupts can be properly handled.

Tested by:	Anish Gupta (akgupt3@gmail.com)
2014-01-25 20:58:05 +00:00
grehan
e0134ab4ff Change RWX to XWR in comments to match intent and bit patterns
in discussion of valid EPT pte protections.

Discussed with:	neel
MFC after:	3 days
2014-01-25 06:58:41 +00:00
jhb
b2533ec507 Move <machine/apicvar.h> to <x86/apicvar.h>. 2014-01-23 20:10:22 +00:00
neel
232e34343e Set "Interrupt Window Exiting" in the case where there is a vector to be
injected into the vcpu but the VM-entry interruption information field
already has the valid bit set.

Pointed out by:	David Reed (david.reed@tidalscale.com)
2014-01-23 06:06:50 +00:00
neel
e971318156 Handle a VM-exit due to a NMI properly by vectoring to the host's NMI handler
via a software interrupt.

This is safe to do because the logical processor is already cognizant of the
NMI and further NMIs are blocked until the host's NMI handler executes "iret".
2014-01-22 04:03:11 +00:00
neel
daa658a5dd There is no need to initialize the IOMMU if no passthru devices have been
configured for bhyve to use.

Suggested by:	grehan@
2014-01-21 03:01:34 +00:00
emaste
13c3304ef6 Add VT kernel configuration to ease testing of vt(9), aka Newcons 2014-01-19 18:46:38 +00:00
neel
e74998b34d Some processor's don't allow NMI injection if the STI_BLOCKING bit is set in
the Guest Interruptibility-state field. However, there isn't any way to
figure out which processors have this requirement.

So, inject a pending NMI only if NMI_BLOCKING, MOVSS_BLOCKING, STI_BLOCKING
are all clear. If any of these bits are set then enable "NMI window exiting"
and inject the NMI in the VM-exit handler.
2014-01-18 21:47:12 +00:00
bryanv
31dc7c36ca Add very simple virtio_random(4) driver to harvest entropy from host
Reviewed by:	markm (random bits only)
2014-01-18 06:14:38 +00:00
neel
4dec904890 If the guest exits due to a fault while it is executing IRET then restore
the state of "Virtual NMI blocking" in the guest's interruptibility-state
field before resuming the guest.
2014-01-18 02:20:10 +00:00
neel
77ef1cf997 If a VM-exit happens during an NMI injection then clear the "NMI Blocking" bit
in the Guest Interruptibility-state VMCS field.

If we fail to do this then a subsequent VM-entry will fail because it is an
error to inject an NMI into the guest while "NMI Blocking" is turned on. This
is described in "Checks on Guest Non-Register State" in the Intel SDM.

Submitted by:	David Reed (david.reed@tidalscale.com)
2014-01-17 04:21:39 +00:00
neel
0bd53a85fb Add an API to rendezvous all active vcpus in a virtual machine. The rendezvous
can be initiated in the context of a vcpu thread or from the bhyve(8) control
process.

The first use of this functionality is to update the vlapic trigger-mode
register when the IOAPIC pin configuration is changed.

Prior to this change we would update the TMR in the virtual-APIC page at
the time of interrupt delivery. But this doesn't work with Posted Interrupts
because there is no way to program the EOI_exit_bitmap[] in the VMCS of
the target at the time of interrupt delivery.

Discussed with:	grehan@
2014-01-14 01:55:58 +00:00
gavin
6265cc52b1 Remove spaces from boot messages when we print the CPU ID/Family/Stepping
to match the rest of the CPU identification lines, and once again fit
into 80 columns in the usual case.
2014-01-11 22:41:10 +00:00
neel
ec09639132 Enable "Posted Interrupt Processing" if supported by the CPU. This lets us
inject interrupts into the guest without causing a VM-exit.

This feature can be disabled by setting the tunable "hw.vmm.vmx.use_apic_pir"
to "0".

The following sysctls provide information about this feature:
- hw.vmm.vmx.posted_interrupts (0 if disabled, 1 if enabled)
- hw.vmm.vmx.posted_interrupt_vector (vector number used for vcpu notification)

Tested on a Intel Xeon E5-2620v2 courtesy of Allan Jude at ScaleEngine.
2014-01-11 04:22:00 +00:00
neel
41814122f1 Enable the "Acknowledge Interrupt on VM exit" VM-exit control.
This control is needed to enable "Posted Interrupts" and is present in all
the Intel VT-x implementations supported by bhyve so enable it as the default.

With this VM-exit control enabled the processor will acknowledge the APIC and
store the vector number in the "VM-Exit Interruption Information" field. We
now call the interrupt handler "by hand" through the IDT entry associated
with the vector.
2014-01-11 03:14:05 +00:00
neel
00a86f71de Don't expose 'vmm_ipinum' as a global. 2014-01-09 03:25:54 +00:00
neel
ab2de99290 Use the 'Virtual Interrupt Delivery' feature of Intel VT-x if supported by
hardware. It is possible to turn this feature off and fall back to software
emulation of the APIC by setting the tunable hw.vmm.vmx.use_apic_vid to 0.

We now start handling two new types of VM-exits:

APIC-access: This is a fault-like VM-exit and is triggered when the APIC
register access is not accelerated (e.g. apic timer CCR). In response to
this we do emulate the instruction that triggered the APIC-access exit.

APIC-write: This is a trap-like VM-exit which does not require any instruction
emulation but it does require the hypervisor to emulate the access to the
specified register (e.g. icrlo register).

Introduce 'vlapic_ops' which are function pointers to vector the various
vlapic operations into processor-dependent code. The 'Virtual Interrupt
Delivery' feature installs 'ops' for setting the IRR bits in the virtual
APIC page and to return whether any interrupts are pending for this vcpu.

Tested on an "Intel Xeon E5-2620 v2" courtesy of Allan Jude at ScaleEngine.
2014-01-07 21:04:49 +00:00
neel
23ea3a1c59 Fix a bug introduced in r260167 related to VM-exit tracing.
Keep a copy of the 'rip' and the 'exit_reason' and use that when calling
vmx_exit_trace(). This is because both the 'rip' and 'exit_reason' can
be changed by 'vmx_exit_process()' and can lead to very misleading traces.
2014-01-07 18:53:14 +00:00
neel
b47601c298 Allow vlapic_set_intr_ready() to return a value that indicates whether or not
the vcpu should be kicked to process a pending interrupt. This will be useful
in the implementation of the Posted Interrupt APICv feature.

Change the return value of 'vlapic_pending_intr()' to indicate whether or not
an interrupt is available to be delivered to the vcpu depending on the value
of the PPR.

Add KTR tracepoints to debug guest IPI delivery.
2014-01-07 00:38:22 +00:00
neel
35066c7a68 Split the VMCS setup between 'vmcs_init()' that does initialization and
'vmx_vminit()' that does customization.

This makes it easier to turn on optional features (e.g. APICv) without
having to keep adding new parameters to 'vmcs_set_defaults()'.

Reviewed by:	grehan@
2014-01-06 23:16:39 +00:00
schweikh
916cce9b6e Correct a grammo in a comment; remove white space at EOL. 2014-01-06 17:23:22 +00:00
neel
96b61bcd13 Use the same label name for ENTRY() and END() macros for 'vmx_enter_guest'.
Pointed out by:	rmh@
2014-01-03 19:29:33 +00:00