Get rid of unused 'svm_feature' from the softc.
Get rid of the redundant 'vcpu_cnt' checks in svm.c. There is a similar check
in vmm.c against 'vm->active_cpus' before the AMD-specific code is called.
Submitted by: Anish Gupta (akgupt3@gmail.com)
processor. Briefly, the hypervisor sets V_INTR_VECTOR to the APIC vector
and sets V_IRQ to 1 to indicate a pending interrupt. The hardware then takes
care of injecting this vector when the guest is able to receive it.
Legacy PIC interrupts are still delivered via the event injection mechanism.
This is because the vector injected by the PIC must reflect the state of its
pins at the time the CPU is ready to accept the interrupt.
Accesses to the TPR via %CR8 are handled entirely in hardware. This requires
that the emulated TPR must be synced to V_TPR after a #VMEXIT.
The guest can also modify the TPR via the memory mapped APIC. This requires
that the V_TPR must be synced with the emulated TPR before a VMRUN.
Reviewed by: Anish Gupta (akgupt3@gmail.com)
VM-exit and ultimately on whether nRIP is valid. This allows us to update
the %rip after the emulation is finished so any exceptions triggered during
the emulation will point to the right instruction.
Don't attempt to handle INS/OUTS VM-exits unless the DecodeAssist capability
is available. The effective segment field in EXITINFO1 is not valid without
this capability.
Add VM_EXITCODE_SVM to flag SVM VM-exits that cannot be handled. Provide the
VMCB fields exitinfo1 and exitinfo2 as collateral to help with debugging.
Provide a SVM VM-exit handler to dump the exitcode, exitinfo1 and exitinfo2
fields in bhyve(8).
Reviewed by: Anish Gupta (akgupt3@gmail.com)
Reviewed by: grehan
- Don't enable the HLT intercept by default. It will be enabled by bhyve(8)
if required. Prior to this change HLT exiting was always enabled making
the "-H" option to bhyve(8) meaningless.
- Recognize a VM exit triggered by a non-maskable interrupt. Prior to this
change the exit would be punted to userspace and the virtual machine would
terminate.
instruction bytes in the VMCB on a nested page fault. This is useful because
it saves having to walk the guest page tables to fetch the instruction.
vie_init() now takes two additional parameters 'inst_bytes' and 'inst_len'
that map directly to 'vie->inst[]' and 'vie->num_valid'.
The instruction emulation handler skips calling 'vmm_fetch_instruction()'
if 'vie->num_valid' is non-zero.
The use of this capability can be turned off by setting the sysctl/tunable
'hw.vmm.svm.disable_npf_assist' to '1'.
Reviewed by: Anish Gupta (akgupt3@gmail.com)
Discussed with: grehan
window exiting. This simply involves setting V_IRQ and enabling the VINTR
intercept. This instructs the CPU to trap back into the hypervisor as soon
as an interrupt can be injected into the guest. The pending interrupt is
then injected via the traditional event injection mechanism.
Rework vcpu interrupt injection so that Linux guests now idle with host cpu
utilization close to 0%.
Reviewed by: Anish Gupta (earlier version)
Discussed with: grehan
Provide APIs svm_enable_intercept()/svm_disable_intercept() to add/delete
VMCB intercepts. These APIs ensure that the VMCB state cache is invalidated
when intercepts are modified.
Each intercept is identified as a (index,bitmask) tuple. For e.g., the
VINTR intercept is identified as (VMCB_CTRL1_INTCPT,VMCB_INTCPT_VINTR).
The first 20 bytes in control area that are used to enable intercepts
are represented as 'uint32_t intercept[5]' in 'struct vmcb_ctrl'.
Modify svm_setcap() and svm_getcap() to use the new APIs.
Discussed with: Anish Gupta (akgupt3@gmail.com)
Prior to this change an ASID was hard allocated to a guest and shared by all
its vcpus. The meant that the number of VMs that could be created was limited
to the number of ASIDs supported by the CPU. It was also inefficient because
it forced a TLB flush on every VMRUN.
With this change the number of guests that can be created is independent of
the number of available ASIDs. Also, the TLB is flushed only when a new ASID
is allocated.
Discussed with: grehan
Reviewed by: Anish Gupta (akgupt3@gmail.com)
function restore_host_tss().
Don't bother to restore MSR_KGSBASE after a #VMEXIT since it is not used in
the kernel. It will be restored on return to userspace.
Discussed with: Anish Gupta (akgupt3@gmail.com)
shadow, so move the check for pending exception before bailing out due to
an interrupt shadow.
Change return type of 'vmcb_eventinject()' to a void and convert all error
returns into KASSERTs.
Fix VMCB_EXITINTINFO_EC(x) and VMCB_EXITINTINFO_TYPE(x) to do the shift
before masking the result.
Reviewed by: Anish Gupta (akgupt3@gmail.com)
mapping size (currently unused). The flags includes the fault access
bits, wired flag as PMAP_ENTER_WIRED, and a new flag
PMAP_ENTER_NOSLEEP to indicate that pmap should not sleep.
For powerpc aim both 32 and 64 bit, fix implementation to ensure that
the requested mapping is created when PMAP_ENTER_NOSLEEP is not
specified, in particular, wait for the available memory required to
proceed.
In collaboration with: alc
Tested by: nwhitehorn (ppc aim32 and booke)
Sponsored by: The FreeBSD Foundation and EMC / Isilon Storage Division
MFC after: 2 weeks
Add the ACPI MCFG table to advertise the extended config memory window.
Introduce a new flag MEM_F_IMMUTABLE for memory ranges that cannot be deleted
or moved in the guest's address space. The PCI extended config space is an
example of an immutable memory range.
Add emulation for the "movzw" instruction. This instruction is used by FreeBSD
to read a 16-bit extended config space register.
CR: https://phabric.freebsd.org/D505
Reviewed by: jhb, grehan
Requested by: tychon
The MD allocators were very common, however there were some minor
differencies. These differencies were all consolidated in the MI allocator,
under ifdefs. The defines from machine/vmparam.h turn on features required
for a particular machine. For details look in the comment in sys/sf_buf.h.
As result no MD code left in sys/*/*/vm_machdep.c. Some arches still have
machine/sf_buf.h, which is usually quite small.
Tested by: glebius (i386), tuexen (arm32), kevlo (arm32)
Reviewed by: kib
Sponsored by: Netflix
Sponsored by: Nginx, Inc.
We continue to use pmap_enter() for that. For unwiring virtual pages, we
now use pmap_unwire(), which unwires a range of virtual addresses instead
of a single virtual page.
Sponsored by: EMC / Isilon Storage Division
features. If bootverbose is enabled, a detailed list is provided;
otherwise, a single-line summary is displayed.
- Add read-only sysctls for optional VT-x capabilities used by bhyve
under a new hw.vmm.vmx.cap node. Move a few exiting sysctls that
indicate the presence of optional capabilities under this node.
CR: https://phabric.freebsd.org/D498
Reviewed by: grehan, neel
MFC after: 1 week
forever in vm_handle_hlt().
This is usually not an issue as long as one of the other vcpus properly resets
or powers off the virtual machine. However, if the bhyve(8) process is killed
with a signal the halted vcpu cannot be woken up because it's sleep cannot be
interrupted.
Fix this by waking up periodically and returning from vm_handle_hlt() if
TDF_ASTPENDING is set.
Reported by: Leon Dang
Sponsored by: Nahanni Systems
It is not possible to PUSH a 32-bit operand on the stack in 64-bit mode. The
default operand size for PUSH is 64-bits and the operand size override prefix
changes that to 16-bits.
vm_copy_setup() can return '1' if it encounters a fault when walking the
guest page tables. This is a guest issue and is now handled properly by
resuming the guest to handle the fault.
involves updating the corresponding page tables followed by accesses to the
pages in question. This sequence is subject to the situation exactly described
in the "AMD64 Architecture Programmer's Manual Volume 2: System Programming"
rev. 3.23, "7.3.1 Special Coherency Considerations" [1, p. 171 f.]. Therefore,
issuing the INVLPG right after modifying the PTE bits is crucial (see also
r269050).
For the amd64 PMAP code, the order of instructions was already correct. The
above fact still is worth documenting, though.
1: http://amd-dev.wpengine.netdna-cdn.com/wordpress/media/2012/10/24593_APM_v21.pdf
Reviewed by: alc
Sponsored by: Bally Wulff Games & Entertainment GmbH
The faulting instruction needs to be restarted when the exception handler
is done handling the fault. bhyve now does this correctly by setting
'vmexit[vcpu].inst_length' to zero so the %rip is not advanced.
A minor complication is that the fault injection APIs are used by instruction
emulation code that is shared by vmm.ko and bhyve. Thus the argument that
refers to 'struct vm *' in kernel or 'struct vmctx *' in userspace needs to
be loosely typed as a 'void *'.
Setting PSE together with PAE or in long mode just makes the PSE bit
completely ignored, so don't set it.
Sponsored by: Citrix Systems R&D
Reviewed by: kib
A nested exception condition arises when a second exception is triggered while
delivering the first exception. Most nested exceptions can be handled serially
but some are converted into a double fault. If an exception is generated during
delivery of a double fault then the virtual machine shuts down as a result of
a triple fault.
vm_exit_intinfo() is used to record that a VM-exit happened while an event was
being delivered through the IDT. If an exception is triggered while handling
the VM-exit it will be treated like a nested exception.
vm_entry_intinfo() is used by processor-specific code to get the event to be
injected into the guest on the next VM-entry. This function is responsible for
deciding the disposition of nested exceptions.
instruction emulation [1].
Fix bug in emulation of opcode 0x8A where the destination is a legacy high
byte register and the guest vcpu is in 32-bit mode. Prior to this change
instead of modifying %ah, %bh, %ch or %dh the emulation would end up
modifying %spl, %bpl, %sil or %dil instead.
Add support for moffsets by treating it as a 2, 4 or 8 byte immediate value
during instruction decoding.
Fix bug in verify_gla() where the linear address computed after decoding
the instruction was not being truncated to the effective address size [2].
Tested by: Leon Dang [1]
Reported by: Peter Grehan [2]
Sponsored by: Nahanni Systems
the upstream implementation and helps ensure that a trap induced by tracing
fbt::trap:entry is handled without recursively generating another trap.
This makes it possible to run most (but not all) of the DTrace tests under
common/safety/ without triggering a kernel panic.
Submitted by: Anton Rang <anton.rang@isilon.com> (original version)
Phabric: D95
ptrace_set_pc() use the correct return to userspace using iret.
The signal return, PT_CONTINUE (which in fact uses signal return path)
set the pcb flag already. The setcontext(2) enforces iret return when
%rip is incorrect. Due to this, the change is redundand, but is made
to ensure that no path which modifies context, forgets to set
PCB_FULL_IRET.
Inspired by: CVE-2014-4699
Reviewed by: jhb
Sponsored by: The FreeBSD Foundation
MFC after: 1 week
several reasons for this change:
pmap_change_wiring() has never (in my memory) been used to set the wired
attribute on a virtual page. We have always used pmap_enter() to do that.
Moreover, it is not really safe to use pmap_change_wiring() to set the wired
attribute on a virtual page. The description of pmap_change_wiring() says
that it assumes the existence of a mapping in the pmap. However, non-wired
mappings may be reclaimed by the pmap at any time. (See pmap_collect().)
Many implementations of pmap_change_wiring() will crash if the mapping does
not exist.
pmap_unwire() accepts a range of virtual addresses, whereas
pmap_change_wiring() acts upon a single virtual page. Since we are
typically unwiring a range of virtual addresses, pmap_unwire() will be more
efficient. Moreover, pmap_unwire() allows us to unwire superpage mappings.
Previously, we were forced to demote the superpage mapping, because
pmap_change_wiring() only allowed us to express the unwiring of a single
base page mapping at a time. This added to the overhead of unwiring for
large ranges of addresses, including the implicit unwiring that occurs at
process termination.
Implementations for arm and powerpc will follow.
Discussed with: jeff, marcel
Reviewed by: kib
Sponsored by: EMC / Isilon Storage Division