Commit Graph

129 Commits

Author SHA1 Message Date
grehan
f115890f37 Roll back botched partial MFC :( 2014-02-04 05:03:14 +00:00
grehan
0ee348bd50 MFC @ r259205 in preparation for some SVM updates. 2014-02-04 02:41:54 +00:00
grehan
afd2340c15 Enable memory overcommit for AMD processors.
- No emulation of A/D bits is required since AMD-V RVI
supports A/D bits.
 - Enable pmap PT_RVI support(w/o PAT) which is required for
memory over-commit support.
 - Other minor fixes:
 * Make use of VMCB EXITINTINFO field. If a #VMEXIT happens while
delivering an interrupt, EXITINTINFO has all the details that bhyve
needs to inject the same interrupt.
 * SVM h/w decode assist code was incomplete - removed for now.
 * Some minor code clean-up (more coming).

Submitted by:	Anish Gupta (akgupt3@gmail.com)
2013-12-18 23:39:42 +00:00
grehan
aacc216bbf MFC @ r256071
This is the change where the bhyve_npt_pmap branch was
merged in to head.

The SVM changes to work with this will be in a follow-on
submit.
2013-12-18 22:31:53 +00:00
neel
3d4a180923 Fix x2apic support in bhyve.
When the guest is bringing up the APs in the x2APIC mode a write to the
ICR register will now trigger a return to userspace with an exitcode of
VM_EXITCODE_SPINUP_AP. This gets SMP guests working again with x2APIC.

Change the vlapic timer lock to be a spinlock because the vlapic can be
accessed from within a critical section (vm run loop) when guest is using
x2apic mode.

Reviewed by:	grehan@
2013-12-10 22:56:51 +00:00
neel
e7ebb9541a Use callout(9) to drive the vlapic timer instead of clocking it on each VM exit.
This decouples the guest's 'hz' from the host's 'hz' setting. For e.g. it is
now possible to have a guest run at 'hz=1000' while the host is at 'hz=100'.

Discussed with:	grehan@
Tested by:	Tycho Nightingale (tycho.nightingale@pluribusnetworks.com)
2013-12-07 23:11:12 +00:00
neel
127d791c3e If a vcpu disables its local apic and then executes a 'HLT' then spin down the
vcpu and destroy its thread context. Also modify the 'HLT' processing to ignore
pending interrupts in the IRR if interrupts have been disabled by the guest.
The interrupt cannot be injected into the guest in any case so resuming it
is futile.

With this change "halt" from a Linux guest works correctly.

Reviewed by:	grehan@
Tested by:	Tycho Nightingale (tycho.nightingale@pluribusnetworks.com)
2013-12-07 22:18:36 +00:00
neel
d3a0e92020 The 'protection' field in the VM exit collateral for the PAGING exit is not
used - get rid of it.
2013-12-03 01:21:21 +00:00
neel
a581c446d6 Rename 'vm_interrupt_hostcpu()' to 'vcpu_notify_event()' because the function
has outgrown its original name. Originally this function simply sent an IPI
to the host cpu that a vcpu was executing on but now it does a lot more than
just that.

Reviewed by:	grehan@
2013-12-03 00:43:31 +00:00
eadler
44c01df173 Fix undefined behavior: (1 << 31) is not defined as 1 is an int and this
shifts into the sign bit.  Instead use (1U << 31) which gets the
expected result.

This fix is not ideal as it assumes a 32 bit int, but does fix the issue
for most cases.

A similar change was made in OpenBSD.

Discussed with:	-arch, rdivacky
Reviewed by:	cperciva
2013-11-30 22:17:27 +00:00
neel
633860276f Add support for level triggered interrupt pins on the vioapic. Prior to this
commit level triggered interrupts would work as long as the pin was not shared
among multiple interrupt sources.

The vlapic now keeps track of level triggered interrupts in the trigger mode
register and will forward the EOI for a level triggered interrupt to the
vioapic. The vioapic in turn uses the EOI to sample the level on the pin and
re-inject the vector if the pin is still asserted.

The vhpet is the first consumer of level triggered interrupts and advertises
that it can generate interrupts on pins 20 through 23 of the vioapic.

Discussed with:	grehan@
2013-11-27 22:18:08 +00:00
neel
89dbc92028 Add HPET device emulation to bhyve.
bhyve supports a single timer block with 8 timers. The timers are all 32-bit
and capable of being operated in periodic mode. All timers support interrupt
delivery using MSI. Timers 0 and 1 also support legacy interrupt routing.

At the moment the timers are not connected to any ioapic pins but that will
be addressed in a subsequent commit.

This change is based on a patch from Tycho Nightingale (tycho.nightingale@pluribusnetworks.com).
2013-11-25 19:04:51 +00:00
neel
3b87354d1e Add an ioctl to assert and deassert an ioapic pin atomically. This will be used
to inject edge triggered legacy interrupts into the guest.

Start using the new API in device models that use edge triggered interrupts:
viz. the 8254 timer and the LPC/uart device emulation.

Submitted by:	Tycho Nightingale (tycho.nightingale@pluribusnetworks.com)
2013-11-23 03:56:03 +00:00
neel
023e2d37dd Eliminate redundant information about the host cpu in bhyve's KTR trace points.
This is always tracked by ktr(4) and can be displayed using the "-c" option
of ktrdump(8).

Discussed with:	grehan
2013-11-22 18:57:22 +00:00
neel
384d86e888 Move the ioapic device model from userspace into vmm.ko. This is needed for
upcoming in-kernel device emulations like the HPET.

The ioctls VM_IOAPIC_ASSERT_IRQ and VM_IOAPIC_DEASSERT_IRQ are used to
manipulate the ioapic pin state.

Discussed with:	grehan@
Submitted by:	Tycho Nightingale (tycho.nightingale@pluribusnetworks.com)
2013-11-12 22:51:03 +00:00
neel
94bdd999bb Remove the 'vdev' abstraction that was meant to sit on top of device models
in the kernel. This abstraction was redundant because the only device emulated
inside vmm.ko is the local apic and it is always at a fixed guest physical
address.

Discussed with:	grehan
2013-11-04 23:25:07 +00:00
neel
74368cc42d Rename the VMM_CTRx() family of macros to VCPU_CTRx() to highlight that these
tracepoints are vcpu-specific.

Add support for tracepoints that are global to the virtual machine - these
tracepoints are called VM_CTRx().
2013-10-31 05:20:11 +00:00
grehan
12f698f0c7 MFC @ r256071
This is just prior to the bhyve_npt_pmap import so will allow
just the change to be merged for easier debug.
2013-10-30 00:05:02 +00:00
neel
7ec78b8068 Remove unnecessary includes of <machine/pmap.h>
Requested by:	alc@
2013-10-29 02:25:18 +00:00
neel
e477631325 The ASID allocation in SVM is incorrect because it allocates a single ASID for
all vcpus belonging to a guest. This means that when different vcpus belonging
to the same guest are executing on the same host cpu there may be "leakage"
in the mappings created by one vcpu to another.

The proper fix for this is being worked on and will be committed shortly.

In the meantime workaround this bug by flushing the guest TLB entries on every
VM entry.

Submitted by:	Anish Gupta (akgupt3@gmail.com)
2013-10-21 23:46:37 +00:00
neel
75369cb181 Add a new capability, VM_CAP_ENABLE_INVPCID, that can be enabled to expose
'invpcid' instruction to the guest. Currently bhyve will try to enable this
capability unconditionally if it is available.

Consolidate code in bhyve to set the capabilities so it is no longer
duplicated in BSP and AP bringup.

Add a sysctl 'vm.pmap.invpcid_works' to display whether the 'invpcid'
instruction is available.

Reviewed by:	grehan
MFC after:	3 days
2013-10-16 18:20:27 +00:00
grehan
8afdf0fcc3 Fix SVM handling of ASTPENDING, which manifested as a
hang on console output (due to a missing interrupt).

SVM does exit processing and then handles ASTPENDING which
overwrites the already handled SVM exit cause and corrupts
virtual machine state. For example, if the SVM exit was due to
an I/O port access but the main loop detected an ASTPENDING,
the exit would be processed as ASTPENDING and leave the
device (e.g. emulated UART) for that I/O port in bad state.

Submitted by:	Anish Gupta (akgupt3@gmail.com)
Reviewed by:	grehan
2013-10-16 05:43:03 +00:00
neel
2d5c7e881c Fix the witness warning that warned against calling uiomove() while holding
the 'vmmdev_mtx' in vmmdev_rw().

Rely on the 'si_threadcount' accounting to ensure that we never destroy the
VM device node while it has operations in progress (e.g. ioctl, mmap etc).

Reported by:	grehan
Reviewed by:	grehan
2013-10-16 00:58:47 +00:00
neel
aed205d5cd Merge projects/bhyve_npt_pmap into head.
Make the amd64/pmap code aware of nested page table mappings used by bhyve
guests. This allows bhyve to associate each guest with its own vmspace and
deal with nested page faults in the context of that vmspace. This also
enables features like accessed/dirty bit tracking, swapping to disk and
transparent superpage promotions of guest memory.

Guest vmspace:
Each bhyve guest has a unique vmspace to represent the physical memory
allocated to the guest. Each memory segment allocated by the guest is
mapped into the guest's address space via the 'vmspace->vm_map' and is
backed by an object of type OBJT_DEFAULT.

pmap types:
The amd64/pmap now understands two types of pmaps: PT_X86 and PT_EPT.

The PT_X86 pmap type is used by the vmspace associated with the host kernel
as well as user processes executing on the host. The PT_EPT pmap is used by
the vmspace associated with a bhyve guest.

Page Table Entries:
The EPT page table entries as mostly similar in functionality to regular
page table entries although there are some differences in terms of what
bits are used to express that functionality. For e.g. the dirty bit is
represented by bit 9 in the nested PTE as opposed to bit 6 in the regular
x86 PTE. Therefore the bitmask representing the dirty bit is now computed
at runtime based on the type of the pmap. Thus PG_M that was previously a
macro now becomes a local variable that is initialized at runtime using
'pmap_modified_bit(pmap)'.

An additional wrinkle associated with EPT mappings is that older Intel
processors don't have hardware support for tracking accessed/dirty bits in
the PTE. This means that the amd64/pmap code needs to emulate these bits to
provide proper accounting to the VM subsystem. This is achieved by using
the following mapping for EPT entries that need emulation of A/D bits:
               Bit Position           Interpreted By
PG_V               52                 software (accessed bit emulation handler)
PG_RW              53                 software (dirty bit emulation handler)
PG_A               0                  hardware (aka EPT_PG_RD)
PG_M               1                  hardware (aka EPT_PG_WR)

The idea to use the mapping listed above for A/D bit emulation came from
Alan Cox (alc@).

The final difference with respect to x86 PTEs is that some EPT implementations
do not support superpage mappings. This is recorded in the 'pm_flags' field
of the pmap.

TLB invalidation:
The amd64/pmap code has a number of ways to do invalidation of mappings
that may be cached in the TLB: single page, multiple pages in a range or the
entire TLB. All of these funnel into a single EPT invalidation routine called
'pmap_invalidate_ept()'. This routine bumps up the EPT generation number and
sends an IPI to the host cpus that are executing the guest's vcpus. On a
subsequent entry into the guest it will detect that the EPT has changed and
invalidate the mappings from the TLB.

Guest memory access:
Since the guest memory is no longer wired we need to hold the host physical
page that backs the guest physical page before we can access it. The helper
functions 'vm_gpa_hold()/vm_gpa_release()' are available for this purpose.

PCI passthru:
Guest's with PCI passthru devices will wire the entire guest physical address
space. The MMIO BAR associated with the passthru device is backed by a
vm_object of type OBJT_SG. An IOMMU domain is created only for guest's that
have one or more PCI passthru devices attached to them.

Limitations:
There isn't a way to map a guest physical page without execute permissions.
This is because the amd64/pmap code interprets the guest physical mappings as
user mappings since they are numerically below VM_MAXUSER_ADDRESS. Since PG_U
shares the same bit position as EPT_PG_EXECUTE all guest mappings become
automatically executable.

Thanks to Alan Cox and Konstantin Belousov for their rigorous code reviews
as well as their support and encouragement.

Thanks for John Baldwin for reviewing the use of OBJT_SG as the backing
object for pci passthru mmio regions.

Special thanks to Peter Holm for testing the patch on short notice.

Approved by:	re
Discussed with:	grehan
Reviewed by:	alc, kib
Tested by:	pho
2013-10-05 21:22:35 +00:00
grehan
b2189384ce Return 0 for a rdmsr of MSR_IA32_PLATFORM_ID. This
is enough to get Ubuntu 12.0.4/13.0.4 to boot.

Approved by:	re@ (blanket)
2013-09-27 14:55:59 +00:00
grehan
d919ede3d3 IFC @ r255692
Comment out IA32_MISC_ENABLE MSR access - this doesn't exist on AMD.
Need to sort out how arch-specific MSRs will be handled.
2013-09-20 00:46:29 +00:00
grehan
a83c14de5c Hide TSC-deadline APIC timer support from guests. This mode
isn't yet implemented in bhyve's APIC emulation.

Reviewed by:	neel
Approved by:	re@ (blanket)
2013-09-17 17:56:53 +00:00
neel
16db26f84b Fix a bug in decoding an instruction that has an SIB byte as well as an
immediate operand. The presence of an SIB byte in decoding the ModR/M field
would cause 'imm_bytes' to not be set to the correct value.

Fix this by initializing 'imm_bytes' independent of the ModR/M decoding.

Reported by: grehan@
Approved by: re@
2013-09-17 16:06:07 +00:00
neel
348fe8d4ce Fix a limitation in bhyve that would limit the number of virtual machines to
the maximum number of VT-d domains (256 on a Sandybridge). We now allocate a
VT-d domain for a guest only if the administrator has explicitly configured
one or more PCI passthru device(s).

If there are no PCI passthru devices configured (the common case) then the
number of virtual machines is no longer limited by the maximum number of
VT-d domains.

Reviewed by: grehan@
Approved by: re@
2013-09-11 07:11:14 +00:00
neel
1306e6ef77 Allocate VPIDs by using the unit number allocator to keep do the bookkeeping.
Also deal with VPID exhaustion by allocating out of a reserved range as the
last resort.
2013-09-07 05:30:34 +00:00
grehan
0fa52a8a31 Mask off the vector from the MSI-x data word.
Some o/s's set the trigger-mode level bit which
results in an invalid vector and pass-thru interrupts
not being delivered.
2013-09-07 03:33:36 +00:00
grehan
5110b054b2 Emulate reading of the IA32_MISC_ENABLE MSR, by returning
the host MSR and masking off features that aren't supported.
Linux reads this MSR to detect if NX has been disabled via
BIOS.
2013-09-06 05:20:11 +00:00
grehan
37b6e3223f Allow CPUID leaf 0xD to be read as zeroes.
Linux reads this even though extended features
aren't exposed.

Support for 0xD will be expanded once AVX[2]
is exposed to the guest in upcoming work.
2013-09-06 05:16:10 +00:00
neel
99ab2bf08e Add support for emulating the byte move instruction "mov r/m8, r8".
This emulation is required when dumping MMIO space via the ddb "examine"
command.
2013-08-27 16:49:20 +00:00
grehan
74160a9a77 Add in last remaining files to get AMD-SVM operational.
Submitted by:	Anish Gupta (akgupt3@gmail.com)
2013-08-23 00:37:26 +00:00
grehan
f9c9379bb7 HLT_IGNORED stat is used by both vmx and svm - move to common stats.
Submitted by:	Anish Gupta (akgupt3@gmail.com)
2013-08-22 22:29:27 +00:00
grehan
c954556ddf Handle VM_PROT_NONE in nested page table code.
Submitted by:	Anish Gupta (akgupt3@gmail.com)
2013-08-22 22:26:46 +00:00
neel
44099f4092 Do not create superpage mappings in the iommu.
This is a workaround to hide the fact that we do not have any code to
demote a superpage mapping before we unmap a single page that is part
of the superpage.
2013-08-20 06:46:40 +00:00
neel
15659a9ddf Extract the location of the remapping hardware units from the ACPI DMAR table.
Submitted by:	Gopakumar T (gopakumar_thekkedath@yahoo.co.in)
2013-08-20 06:20:05 +00:00
grehan
baae5efeab IFC @ r254014 2013-08-07 00:09:49 +00:00
grehan
be25b04bb5 Follow-up commit to fix CR0 issues. Maintain
architectural state on CR vmexits by guaranteeing
that EFER, CR0 and the VMCS entry controls are
all in sync when transitioning to IA-32e mode.

Submitted by:	Tycho Nightingale (tycho.nightingale <at> plurisbusnetworks.com)
2013-08-03 03:16:42 +00:00
grehan
41f621778a Moved clearing of vmm_initialized to avoid the case
of unloading the module while VMs existed. This would
result in EBUSY, but would prevent further operations
on VMs resulting in the module being impossible to
unload.

Submitted by:   Tycho Nightingale (tycho.nightingale <at> plurisbusnetworks.com)
Reviewed by:	grehan, neel
2013-08-01 05:59:28 +00:00
grehan
045ef38328 Correctly maintain the CR0/CR4 shadow registers.
This was exposed with AP spinup of Linux, and
booting OpenBSD, where the CR0 register is unconditionally
written to prior to the longjump to enter protected
mode. The CR-vmexit handling was not updating CPU state which
resulted in a vmentry failure with invalid guest state.

A follow-on submit will fix the CPU state issue, but this
fix prevents the CR-vmexit prior to entering protected
mode by properly initializing and maintaining CR* state.

Reviewed by:	neel
Reported by:	Gopakumar.T @ netapp
2013-08-01 01:18:51 +00:00
neel
8a3e57621d Add support for emulation of the "or r/m, imm8" instruction.
Submitted by:	Zhixiang Yu (zxyu.core@gmail.com)
Obtained from:	GSoC 2013 (AHCI device emulation for bhyve)
2013-07-23 23:43:00 +00:00
neel
468b664f74 Verify that all bytes in the instruction buffer are consumed during decoding.
Suggested by:	grehan
2013-07-03 23:05:17 +00:00
grehan
cba9c47f36 Ignore guest PAT settings by default in EPT mappings.
From experimentation, other hypervisors also do this.

Diagnosed by:	tycho nightingale at pluribusnetworks com
Reviewed by:	neel
2013-07-01 20:05:43 +00:00
grehan
abdbf599d1 Make sure all CPUID values are handled, instead of exiting the
bhyve process when an unhandled one is encountered.

Hide some additional capabilities from the guest (e.g. debug store).

This fixes the issue with FreeBSD 9.1 MP guests exiting the VM on
AP spinup (where CPUID is used when sync'ing the TSCs) and the
issue with the Java build where CPUIDs are issued from a guest
userspace.

Submitted by:	tycho nightingale at pluribusnetworks com
Reviewed by:	neel
Reported by:	many
2013-06-28 06:05:33 +00:00
pluknet
6509325017 Fix a gcc warning uncovered after r251745.
Reported by:	Sergey V. Dyatko
Reviewed by:	neel
2013-06-18 23:31:09 +00:00
pluknet
672cd692ff Replace cpusetffs_obj with CPU_FFS, missed in r251703.
Reported by:	bdrewery, O. Hartmann
2013-06-14 10:26:38 +00:00
neel
c5e619b651 Support array-type of stats in bhyve.
An array-type stat in vmm.ko is defined as follows:
VMM_STAT_ARRAY(IPIS_SENT, VM_MAXCPU, "ipis sent to vcpu");

It is incremented as follows:
vmm_stat_array_incr(vm, vcpuid, IPIS_SENT, array_index, 1);

And output of 'bhyvectl --get-stats' looks like:
ipis sent to vcpu[0]     3114
ipis sent to vcpu[1]     0

Reviewed by:	grehan
Obtained from:	NetApp
2013-05-10 02:59:49 +00:00