Commit Graph

6644 Commits

Author SHA1 Message Date
Dimitry Andric
f785676f2a Upgrade our copy of llvm/clang to 3.4 release. This version supports
all of the features in the current working draft of the upcoming C++
standard, provisionally named C++1y.

The code generator's performance is greatly increased, and the loop
auto-vectorizer is now enabled at -Os and -O2 in addition to -O3.  The
PowerPC backend has made several major improvements to code generation
quality and compile time, and the X86, SPARC, ARM32, Aarch64 and SystemZ
backends have all seen major feature work.

Release notes for llvm and clang can be found here:
<http://llvm.org/releases/3.4/docs/ReleaseNotes.html>
<http://llvm.org/releases/3.4/tools/clang/docs/ReleaseNotes.html>

MFC after:	1 month
2014-02-16 19:44:07 +00:00
Christian Brueffer
7f47cbd3ce Retire the nve(4) driver; nfe(4) has been the default driver for NVIDIA
nForce MCP adapters for a long time.

Yays:	jhb, remko, yongari
Nays:	none on the current and stable lists
2014-02-16 12:22:43 +00:00
Andriy Gapon
f25e50cf0b provide fast versions of ffsl and flsl for i386; ffsll and flsll for amd64
Reviewed by:	jhb
MFC after:	10 days
X-MFC note:	consider thirdparty modules depending on these symbols
Sponsored by:	HybridCluster
2014-02-14 15:18:37 +00:00
John Baldwin
4edef187b8 Add support for managing PCI bus numbers. As with BARs and PCI-PCI bridge
I/O windows, the default is to preserve the firmware-assigned resources.
PCI bus numbers are only managed if NEW_PCIB is enabled and the architecture
defines a PCI_RES_BUS resource type.
- Add a helper API to create top-level PCI bus resource managers for each
  PCI domain/segment.  Host-PCI bridge drivers use this API to allocate
  bus numbers from their associated domain.
- Change the PCI bus and CardBus drivers to allocate a bus resource for
  their bus number from the parent PCI bridge device.
- Change the PCI-PCI and PCI-CardBus bridge drivers to allocate the
  full range of bus numbers from secbus to subbus from their parent bridge.
  The drivers also always program their primary bus register.  The bridge
  drivers also support growing their bus range by extending the bus resource
  and updating subbus to match the larger range.
- Add support for managing PCI bus resources to the Host-PCI bridge drivers
  used for amd64 and i386 (acpi_pcib, mptable_pcib, legacy_pcib, and qpi_pcib).
- Define a PCI_RES_BUS resource type for amd64 and i386.

Reviewed by:	imp
MFC after:	1 month
2014-02-12 04:30:37 +00:00
John Baldwin
4f67a8c5e9 Don't waste a page of KVA for the boot-time memory test on x86. For amd64,
reuse the first page of the crashdumpmap as CMAP1/CADDR1.  For i386,
remove CMAP1/CADDR1 entirely and reuse CMAP3/CADDR3 for the memory test.

Reviewed by:	alc, peter
MFC after:	2 weeks
2014-02-11 22:02:40 +00:00
John Baldwin
abb023fb95 Add virtualized XSAVE support to bhyve which permits guests to use XSAVE and
XSAVE-enabled features like AVX.
- Store a per-cpu guest xcr0 register.  When switching to the guest FPU
  state, switch to the guest xcr0 value.  Note that the guest FPU state is
  saved and restored using the host's xcr0 value and xcr0 is saved/restored
  "inside" of saving/restoring the guest FPU state.
- Handle VM exits for the xsetbv instruction by updating the guest xcr0.
- Expose the XSAVE feature to the guest only if the host has enabled XSAVE,
  and only advertise XSAVE features enabled by the host to the guest.
  This ensures that the guest will only adjust FPU state that is a subset
  of the guest FPU state saved and restored by the host.

Reviewed by:	grehan
2014-02-08 16:37:54 +00:00
Neel Natu
bf73979dd9 Add a counter to differentiate between VM-exits due to nested paging faults
and instruction emulation faults.
2014-02-08 06:22:09 +00:00
Neel Natu
62fbd7c27a Fix a bug in the handling of VM-exits caused by non-maskable interrupts (NMI).
If a VM-exit is caused by an NMI then "blocking by NMI" is in effect on the
CPU when the VM-exit is completed. No more NMIs will be recognized until
the execution of an "iret".

Prior to this change the NMI handler was dispatched via a software interrupt
with interrupts enabled. This meant that an interrupt could be recognized
by the processor before the NMI handler completed its execution. The "iret"
issued by the interrupt handler would then cause the "blocking by NMI" to
be cleared prematurely.

This is now fixed by handling the NMI with interrupts disabled in addition
to "blocking by NMI" already established by the VM-exit.
2014-02-08 05:04:34 +00:00
John Baldwin
00f3efe1bd Add support for FreeBSD/i386 guests under bhyve.
- Similar to the hack for bootinfo32.c in userboot, define
  _MACHINE_ELF_WANT_32BIT in the load_elf32 file handlers in userboot.
  This allows userboot to load 32-bit kernels and modules.
- Copy the SMAP generation code out of bootinfo64.c and into its own
  file so it can be shared with bootinfo32.c to pass an SMAP to the i386
  kernel.
- Use uint32_t instead of u_long when aligning module metadata in
  bootinfo32.c in userboot, as otherwise the metadata used 64-bit
  alignment which corrupted the layout.
- Populate the basemem and extmem members of the bootinfo struct passed
  to 32-bit kernels.
- Fix the 32-bit stack in userboot to start at the top of the stack
  instead of the bottom so that there is room to grow before the
  kernel switches to its own stack.
- Push a fake return address onto the 32-bit stack in addition to the
  arguments normally passed to exec() in the loader.  This return
  address is needed to convince recover_bootinfo() in the 32-bit
  locore code that it is being invoked from a "new" boot block.
- Add a routine to libvmmapi to setup a 32-bit flat mode register state
  including a GDT and TSS that is able to start the i386 kernel and
  update bhyveload to use it when booting an i386 kernel.
- Use the guest register state to determine the CPU's current instruction
  mode (32-bit vs 64-bit) and paging mode (flat, 32-bit, PAE, or long
  mode) in the instruction emulation code.  Update the gla2gpa() routine
  used when fetching instructions to handle flat mode, 32-bit paging, and
  PAE paging in addition to long mode paging.  Don't look for a REX
  prefix when the CPU is in 32-bit mode, and use the detected mode to
  enable the existing 32-bit mode code when decoding the mod r/m byte.

Reviewed by:	grehan, neel
MFC after:	1 month
2014-02-05 04:39:03 +00:00
Tycho Nightingale
54e03e07b3 Add support for emulating the byte move and zero extend instructions:
"mov r/m8, r32" and "mov r/m8, r64".

Approved by:	neel (co-mentor)
2014-02-05 02:01:08 +00:00
Neel Natu
953c2c47eb Avoid doing unnecessary nested TLB invalidations.
Prior to this change the cached value of 'pm_eptgen' was tracked per-vcpu
and per-hostcpu. In the degenerate case where 'N' vcpus were sharing
a single hostcpu this could result in 'N - 1' unnecessary TLB invalidations.
Since an 'invept' invalidates mappings for all VPIDs the first 'invept'
is sufficient.

Fix this by moving the 'eptgen[MAXCPU]' array from 'vmxctx' to 'struct vmx'.

If it is known that an 'invept' is going to be done before entering the
guest then it is safe to skip the 'invvpid'. The stat VPU_INVVPID_SAVED
counts the number of 'invvpid' invalidations that were avoided because
they were subsumed by an 'invept'.

Discussed with:	grehan
2014-02-04 02:45:08 +00:00
John Baldwin
3cbf3585cb Enhance the support for PCI legacy INTx interrupts and enable them in
the virtio backends.
- Add a new ioctl to export the count of pins on the I/O APIC from vmm
  to the hypervisor.
- Use pins on the I/O APIC >= 16 for PCI interrupts leaving 0-15 for
  ISA interrupts.
- Populate the MP Table with I/O interrupt entries for any PCI INTx
  interrupts.
- Create a _PRT table under the PCI root bridge in ACPI to route any
  PCI INTx interrupts appropriately.
- Track which INTx interrupts are in use per-slot so that functions
  that share a slot attempt to distribute their INTx interrupts across
  the four available pins.
- Implicitly mask INTx interrupts if either MSI or MSI-X is enabled
  and when the INTx DIS bit is set in a function's PCI command register.
  Either assert or deassert the associated I/O APIC pin when the
  state of one of those conditions changes.
- Add INTx support to the virtio backends.
- Always advertise the MSI capability in the virtio backends.

Submitted by:	neel (7)
Reviewed by:	neel
MFC after:	2 weeks
2014-01-29 14:56:48 +00:00
John Baldwin
3d2ec11759 Add support for 'clac' and 'stac' to DDB's disassembler on amd64. 2014-01-27 18:53:18 +00:00
Neel Natu
30b94db8c0 Support level triggered interrupts with VT-x virtual interrupt delivery.
The VMCS field EOI_bitmap[] is an array of 256 bits - one for each vector.
If a bit is set to '1' in the EOI_bitmap[] then the processor will trigger
an EOI-induced VM-exit when it is doing EOI virtualization.

The EOI-induced VM-exit results in the EOI being forwarded to the vioapic
so that level triggered interrupts can be properly handled.

Tested by:	Anish Gupta (akgupt3@gmail.com)
2014-01-25 20:58:05 +00:00
Peter Grehan
062eef4911 Change RWX to XWR in comments to match intent and bit patterns
in discussion of valid EPT pte protections.

Discussed with:	neel
MFC after:	3 days
2014-01-25 06:58:41 +00:00
John Baldwin
e07ef9b0f6 Move <machine/apicvar.h> to <x86/apicvar.h>. 2014-01-23 20:10:22 +00:00
Neel Natu
36736912b6 Set "Interrupt Window Exiting" in the case where there is a vector to be
injected into the vcpu but the VM-entry interruption information field
already has the valid bit set.

Pointed out by:	David Reed (david.reed@tidalscale.com)
2014-01-23 06:06:50 +00:00
Neel Natu
c308b23b7a Handle a VM-exit due to a NMI properly by vectoring to the host's NMI handler
via a software interrupt.

This is safe to do because the logical processor is already cognizant of the
NMI and further NMIs are blocked until the host's NMI handler executes "iret".
2014-01-22 04:03:11 +00:00
Neel Natu
51f45d0146 There is no need to initialize the IOMMU if no passthru devices have been
configured for bhyve to use.

Suggested by:	grehan@
2014-01-21 03:01:34 +00:00
Ed Maste
80f9f1580e Add VT kernel configuration to ease testing of vt(9), aka Newcons 2014-01-19 18:46:38 +00:00
Neel Natu
48b2d828a2 Some processor's don't allow NMI injection if the STI_BLOCKING bit is set in
the Guest Interruptibility-state field. However, there isn't any way to
figure out which processors have this requirement.

So, inject a pending NMI only if NMI_BLOCKING, MOVSS_BLOCKING, STI_BLOCKING
are all clear. If any of these bits are set then enable "NMI window exiting"
and inject the NMI in the VM-exit handler.
2014-01-18 21:47:12 +00:00
Bryan Venteicher
10c4018057 Add very simple virtio_random(4) driver to harvest entropy from host
Reviewed by:	markm (random bits only)
2014-01-18 06:14:38 +00:00
Neel Natu
e5a1d95089 If the guest exits due to a fault while it is executing IRET then restore
the state of "Virtual NMI blocking" in the guest's interruptibility-state
field before resuming the guest.
2014-01-18 02:20:10 +00:00
Neel Natu
160471d264 If a VM-exit happens during an NMI injection then clear the "NMI Blocking" bit
in the Guest Interruptibility-state VMCS field.

If we fail to do this then a subsequent VM-entry will fail because it is an
error to inject an NMI into the guest while "NMI Blocking" is turned on. This
is described in "Checks on Guest Non-Register State" in the Intel SDM.

Submitted by:	David Reed (david.reed@tidalscale.com)
2014-01-17 04:21:39 +00:00
Neel Natu
5b8a8cd1fe Add an API to rendezvous all active vcpus in a virtual machine. The rendezvous
can be initiated in the context of a vcpu thread or from the bhyve(8) control
process.

The first use of this functionality is to update the vlapic trigger-mode
register when the IOAPIC pin configuration is changed.

Prior to this change we would update the TMR in the virtual-APIC page at
the time of interrupt delivery. But this doesn't work with Posted Interrupts
because there is no way to program the EOI_exit_bitmap[] in the VMCS of
the target at the time of interrupt delivery.

Discussed with:	grehan@
2014-01-14 01:55:58 +00:00
Gavin Atkinson
56c63f28ed Remove spaces from boot messages when we print the CPU ID/Family/Stepping
to match the rest of the CPU identification lines, and once again fit
into 80 columns in the usual case.
2014-01-11 22:41:10 +00:00
Neel Natu
176666c2c9 Enable "Posted Interrupt Processing" if supported by the CPU. This lets us
inject interrupts into the guest without causing a VM-exit.

This feature can be disabled by setting the tunable "hw.vmm.vmx.use_apic_pir"
to "0".

The following sysctls provide information about this feature:
- hw.vmm.vmx.posted_interrupts (0 if disabled, 1 if enabled)
- hw.vmm.vmx.posted_interrupt_vector (vector number used for vcpu notification)

Tested on a Intel Xeon E5-2620v2 courtesy of Allan Jude at ScaleEngine.
2014-01-11 04:22:00 +00:00
Neel Natu
f7d4742540 Enable the "Acknowledge Interrupt on VM exit" VM-exit control.
This control is needed to enable "Posted Interrupts" and is present in all
the Intel VT-x implementations supported by bhyve so enable it as the default.

With this VM-exit control enabled the processor will acknowledge the APIC and
store the vector number in the "VM-Exit Interruption Information" field. We
now call the interrupt handler "by hand" through the IDT entry associated
with the vector.
2014-01-11 03:14:05 +00:00
Neel Natu
add611fd4c Don't expose 'vmm_ipinum' as a global. 2014-01-09 03:25:54 +00:00
Neel Natu
88c4b8d145 Use the 'Virtual Interrupt Delivery' feature of Intel VT-x if supported by
hardware. It is possible to turn this feature off and fall back to software
emulation of the APIC by setting the tunable hw.vmm.vmx.use_apic_vid to 0.

We now start handling two new types of VM-exits:

APIC-access: This is a fault-like VM-exit and is triggered when the APIC
register access is not accelerated (e.g. apic timer CCR). In response to
this we do emulate the instruction that triggered the APIC-access exit.

APIC-write: This is a trap-like VM-exit which does not require any instruction
emulation but it does require the hypervisor to emulate the access to the
specified register (e.g. icrlo register).

Introduce 'vlapic_ops' which are function pointers to vector the various
vlapic operations into processor-dependent code. The 'Virtual Interrupt
Delivery' feature installs 'ops' for setting the IRR bits in the virtual
APIC page and to return whether any interrupts are pending for this vcpu.

Tested on an "Intel Xeon E5-2620 v2" courtesy of Allan Jude at ScaleEngine.
2014-01-07 21:04:49 +00:00
Neel Natu
79c596309c Fix a bug introduced in r260167 related to VM-exit tracing.
Keep a copy of the 'rip' and the 'exit_reason' and use that when calling
vmx_exit_trace(). This is because both the 'rip' and 'exit_reason' can
be changed by 'vmx_exit_process()' and can lead to very misleading traces.
2014-01-07 18:53:14 +00:00
Neel Natu
4d1e82a88e Allow vlapic_set_intr_ready() to return a value that indicates whether or not
the vcpu should be kicked to process a pending interrupt. This will be useful
in the implementation of the Posted Interrupt APICv feature.

Change the return value of 'vlapic_pending_intr()' to indicate whether or not
an interrupt is available to be delivered to the vcpu depending on the value
of the PPR.

Add KTR tracepoints to debug guest IPI delivery.
2014-01-07 00:38:22 +00:00
Neel Natu
c847a5062c Split the VMCS setup between 'vmcs_init()' that does initialization and
'vmx_vminit()' that does customization.

This makes it easier to turn on optional features (e.g. APICv) without
having to keep adding new parameters to 'vmcs_set_defaults()'.

Reviewed by:	grehan@
2014-01-06 23:16:39 +00:00
Jens Schweikhardt
aa27ed4569 Correct a grammo in a comment; remove white space at EOL. 2014-01-06 17:23:22 +00:00
Neel Natu
5f8e2dfcb5 Use the same label name for ENTRY() and END() macros for 'vmx_enter_guest'.
Pointed out by:	rmh@
2014-01-03 19:29:33 +00:00
Neel Natu
0a9ae358fd Fix a bug in the HPET emulation where a timer interrupt could be lost when the
guest disables the HPET.

The HPET timer interrupt is triggered from the callout handler associated with
the timer. It is possible for the callout handler to be delayed before it gets
a chance to execute. If the guest disables the HPET during this window then the
handler never gets a chance to execute and the timer interrupt is lost.

This is now fixed by injecting a timer interrupt into the guest if the callout
time is detected to be in the past when the HPET is disabled.
2014-01-03 19:25:52 +00:00
Konstantin Belousov
27fd75d2c8 Update the description for pmap_remove_pages() to match the modern
times [1].  Assert that the pmap passed to pmap_remove_pages() is only
active on current CPU.

Submitted by:	alc [1]
Reviewed by:	alc
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
2014-01-02 18:50:52 +00:00
Konstantin Belousov
c0be75a58a Assert that accounting for the pmap resident pages does not underflow.
Reviewed by:	alc
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
2014-01-02 18:49:05 +00:00
Neel Natu
0492757c70 Restructure the VMX code to enter and exit the guest. In large part this change
hides the setjmp/longjmp semantics of VM enter/exit. vmx_enter_guest() is used
to enter guest context and vmx_exit_guest() is used to transition back into
host context.

Fix a longstanding race where a vcpu interrupt notification might be ignored
if it happens after vmx_inject_interrupts() but before host interrupts are
disabled in vmx_resume/vmx_launch. We now called vmx_inject_interrupts() with
host interrupts disabled to prevent this.

Suggested by:	grehan@
2014-01-01 21:17:08 +00:00
Dimitry Andric
24da2fe3d0 In sys/amd64/amd64/pmap.c, remove static function pmap_is_current(),
which has been unused since r189415.

Reviewed by:	alc
MFC after:	3 days
2013-12-30 20:37:47 +00:00
Neel Natu
7c05bc3124 Modify handling of writes to the vlapic LVT registers.
The handler is now called after the register value is updated in the virtual
APIC page. This will make it easier to handle APIC-write VM-exits with APIC
register virtualization turned on.

This also implies that we need to keep a snapshot of the last value written
to a LVT register. We can no longer rely on the LVT registers in the APIC
page to be "clean" because the guest can write anything to it before the
hypervisor has had a chance to sanitize it.
2013-12-28 00:20:55 +00:00
Neel Natu
fafe884473 Modify handling of writes to the vlapic ICR_TIMER, DCR_TIMER, ICRLO and ESR
registers.

The handler is now called after the register value is updated in the virtual
APIC page. This will make it easier to handle APIC-write VM-exits with APIC
register virtualization turned on.

We can no longer rely on the value of 'icr_timer' on the APIC page
in the callout handler. With APIC register virtualization the value of
'icr_timer' will be updated by the processor in guest-context before an
APIC-write VM-exit.

Clear the 'delivery status' bit in the ICRLO register in the write handler.
With APIC register virtualization the write happens in guest-context and
we cannot prevent a (buggy) guest from setting this bit.
2013-12-27 20:18:19 +00:00
Dimitry Andric
6f0c167fe2 In sys/amd64/vmm/intel/vmx.c, silence a (incorrect) gcc warning about
regval possibly being used uninitialized.

Reviewed by:	neel
2013-12-27 12:15:53 +00:00
Neel Natu
2c52dcd9a8 Modify handling of write to the vlapic SVR register.
The handler is now called after the register value is updated in the virtual
APIC page. This will make it easier to handle APIC-write VM-exits with APIC
register virtualization turned on.

Additionally, mask all the LVT entries when the vlapic is software-disabled.
2013-12-27 07:01:42 +00:00
Neel Natu
3f0ddc7c5c Modify handling of writes to the vlapic ID, LDR and DFR registers.
The handlers are now called after the register value is updated in the virtual
APIC page. This will make it easier to handle APIC-write VM-exits with APIC
register virtualization turned on.

Additionally, we need to ensure that the value of these registers is always
correctly reflected in the virtual APIC page, because there is no VM exit
when the guest reads these registers with APIC register virtualization.
2013-12-26 19:58:30 +00:00
Neel Natu
de5ea6b65e vlapic code restructuring to make it easy to support hardware-assist for APIC
emulation.

The vlapic initialization and cleanup is done via processor specific vmm_ops.
This will allow the VT-x/SVM modules to layer any hardware-assist for APIC
emulation or virtual interrupt delivery on top of the vlapic device model.

Add a parameter to 'vcpu_notify_event()' to distinguish between vlapic
interrupts versus other events (e.g. NMI). This provides an opportunity to
use hardware-assists like Posted Interrupts (VT-x) or doorbell MSR (SVM)
to deliver an interrupt to a guest without causing a VM-exit.

Get rid of lapic_pending_intr() and lapic_intr_accepted() and use the
vlapic_xxx() counterparts directly.

Associate an 'Apic Page' with each vcpu and reference it from the 'vlapic'.
The 'Apic Page' is intended to be referenced from the Intel VMCS as the
'virtual APIC page' or from the AMD VMCB as the 'vAPIC backing page'.
2013-12-25 06:46:31 +00:00
John Baldwin
63e62d390d Add a resume hook for bhyve that runs a function on all CPUs during
resume.  For Intel CPUs, invoke vmxon for CPUs that were in VMX mode
at the time of suspend.

Reviewed by:	neel
2013-12-23 19:48:22 +00:00
John Baldwin
330baf58c6 Extend the support for local interrupts on the local APIC:
- Add a generic routine to trigger an LVT interrupt that supports both
  fixed and NMI delivery modes.
- Add an ioctl and bhyvectl command to trigger local interrupts inside a
  guest.  In particular, a global NMI similar to that raised by SERR# or
  PERR# can be simulated by asserting LINT1 on all vCPUs.
- Extend the LVT table in the vCPU local APIC to support CMCI.
- Flesh out the local APIC error reporting a bit to cache errors and
  report them via ESR when ESR is written to.  Add support for asserting
  the error LVT when an error occurs.  Raise illegal vector errors when
  attempting to signal an invalid vector for an interrupt or when sending
  an IPI.
- Ignore writes to reserved bits in LVT entries.
- Export table entries the MADT and MP Table advertising the stock x86
  config of LINT0 set to ExtInt and LINT1 wired to NMI.

Reviewed by:	neel (earlier version)
2013-12-23 19:29:07 +00:00
Neel Natu
f80330a820 Add a parameter to 'vcpu_set_state()' to enforce that the vcpu is in the IDLE
state before the requested state transition. This guarantees that there is
exactly one ioctl() operating on a vcpu at any point in time and prevents
unintended state transitions.

More details available here:
http://lists.freebsd.org/pipermail/freebsd-virtualization/2013-December/001825.html

Reviewed by:	grehan
Reported by:	Markiyan Kushnir (markiyan.kushnir at gmail.com)
MFC after:	3 days
2013-12-22 20:29:59 +00:00
Neel Natu
a783578566 Consolidate the virtual apic initialization in a single function: vlapic_reset() 2013-12-22 00:08:00 +00:00
Neel Natu
5515bb73e6 Re-arrange bits in the amd64/pmap 'pm_flags' field.
The least significant 8 bits of 'pm_flags' are now used for the IPI vector
to use for nested page table TLB shootdown.

Previously we used IPI_AST to interrupt the host cpu which is functionally
correct but could lead to misleading interrupt counts for AST handler. The
AST handler was also doing a lot more than what is required for the nested
page table TLB shootdown (EOI and IRET).
2013-12-20 05:50:22 +00:00
Neel Natu
3de8386283 Use vmcs_read() and vmcs_write() in preference to vmread() and vmwrite()
respectively. The vmcs_xxx() functions provide inline error checking of
all accesses to the VMCS.
2013-12-18 06:24:21 +00:00
Neel Natu
4f8be175d5 Add an API to deliver message signalled interrupts to vcpus. This allows
callers treat the MSI 'addr' and 'data' fields as opaque and also lets
bhyve implement multiple destination modes: physical, flat and clustered.

Submitted by:	Tycho Nightingale (tycho.nightingale@pluribusnetworks.com)
Reviewed by:	grehan@
2013-12-16 19:59:31 +00:00
Neel Natu
a83011d2e7 Fix typo when initializing the vlapic version register ('<<' instead of '<'). 2013-12-11 06:28:44 +00:00
Neel Natu
becd984900 Fix x2apic support in bhyve.
When the guest is bringing up the APs in the x2APIC mode a write to the
ICR register will now trigger a return to userspace with an exitcode of
VM_EXITCODE_SPINUP_AP. This gets SMP guests working again with x2APIC.

Change the vlapic timer lock to be a spinlock because the vlapic can be
accessed from within a critical section (vm run loop) when guest is using
x2apic mode.

Reviewed by:	grehan@
2013-12-10 22:56:51 +00:00
John Baldwin
316032ad20 Move constants for indices in the local APIC's local vector table from
apicvar.h to apicreg.h.
2013-12-09 21:08:52 +00:00
Neel Natu
fb03ca4e42 Use callout(9) to drive the vlapic timer instead of clocking it on each VM exit.
This decouples the guest's 'hz' from the host's 'hz' setting. For e.g. it is
now possible to have a guest run at 'hz=1000' while the host is at 'hz=100'.

Discussed with:	grehan@
Tested by:	Tycho Nightingale (tycho.nightingale@pluribusnetworks.com)
2013-12-07 23:11:12 +00:00
Neel Natu
1c05219285 If a vcpu disables its local apic and then executes a 'HLT' then spin down the
vcpu and destroy its thread context. Also modify the 'HLT' processing to ignore
pending interrupts in the IRR if interrupts have been disabled by the guest.
The interrupt cannot be injected into the guest in any case so resuming it
is futile.

With this change "halt" from a Linux guest works correctly.

Reviewed by:	grehan@
Tested by:	Tycho Nightingale (tycho.nightingale@pluribusnetworks.com)
2013-12-07 22:18:36 +00:00
John Baldwin
5c79f1f9df Fix a typo. 2013-12-05 21:58:02 +00:00
Neel Natu
7a3c80aa55 The 'protection' field in the VM exit collateral for the PAGING exit is not
used - get rid of it.
2013-12-03 01:21:21 +00:00
Neel Natu
2282187475 Rename 'vm_interrupt_hostcpu()' to 'vcpu_notify_event()' because the function
has outgrown its original name. Originally this function simply sent an IPI
to the host cpu that a vcpu was executing on but now it does a lot more than
just that.

Reviewed by:	grehan@
2013-12-03 00:43:31 +00:00
Eitan Adler
7a22215c53 Fix undefined behavior: (1 << 31) is not defined as 1 is an int and this
shifts into the sign bit.  Instead use (1U << 31) which gets the
expected result.

This fix is not ideal as it assumes a 32 bit int, but does fix the issue
for most cases.

A similar change was made in OpenBSD.

Discussed with:	-arch, rdivacky
Reviewed by:	cperciva
2013-11-30 22:17:27 +00:00
Pawel Jakub Dawidek
f2b525e6b9 Make process descriptors standard part of the kernel. rwhod(8) already
requires process descriptors to work and having PROCDESC in GENERIC
seems not enough, especially that we hope to have more and more consumers
in the base.

MFC after:	3 days
2013-11-30 15:08:35 +00:00
Neel Natu
b5b28fc9dc Add support for level triggered interrupt pins on the vioapic. Prior to this
commit level triggered interrupts would work as long as the pin was not shared
among multiple interrupt sources.

The vlapic now keeps track of level triggered interrupts in the trigger mode
register and will forward the EOI for a level triggered interrupt to the
vioapic. The vioapic in turn uses the EOI to sample the level on the pin and
re-inject the vector if the pin is still asserted.

The vhpet is the first consumer of level triggered interrupts and advertises
that it can generate interrupts on pins 20 through 23 of the vioapic.

Discussed with:	grehan@
2013-11-27 22:18:08 +00:00
Konstantin Belousov
291bfc8d24 Hide struct pcb definition by #ifdef __amd64__ braces. If cc -m32
compilation results in inclusion of the header, a confict arises due
to savefpu being union for i386, but used as struct in the pcb
definition.  The 32bit code should not need amd64 variant of the
struct pcb anyway.

For struct region_descriptor, use __uint64_t instead of unsigned long,
as the base type for bit-fields.  Unsigned long cannot have width 64
for -m32.

The changes allowed to use sys/sysctl.h for cc -m32.

Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
2013-11-26 19:38:42 +00:00
Neel Natu
08e3ff329a Add HPET device emulation to bhyve.
bhyve supports a single timer block with 8 timers. The timers are all 32-bit
and capable of being operated in periodic mode. All timers support interrupt
delivery using MSI. Timers 0 and 1 also support legacy interrupt routing.

At the moment the timers are not connected to any ioapic pins but that will
be addressed in a subsequent commit.

This change is based on a patch from Tycho Nightingale (tycho.nightingale@pluribusnetworks.com).
2013-11-25 19:04:51 +00:00
Attilio Rao
54366c0bd7 - For kernel compiled only with KDTRACE_HOOKS and not any lock debugging
option, unbreak the lock tracing release semantic by embedding
  calls to LOCKSTAT_PROFILE_RELEASE_LOCK() direclty in the inlined
  version of the releasing functions for mutex, rwlock and sxlock.
  Failing to do so skips the lockstat_probe_func invokation for
  unlocking.
- As part of the LOCKSTAT support is inlined in mutex operation, for
  kernel compiled without lock debugging options, potentially every
  consumer must be compiled including opt_kdtrace.h.
  Fix this by moving KDTRACE_HOOKS into opt_global.h and remove the
  dependency by opt_kdtrace.h for all files, as now only KDTRACE_FRAMES
  is linked there and it is only used as a compile-time stub [0].

[0] immediately shows some new bug as DTRACE-derived support for debug
in sfxge is broken and it was never really tested.  As it was not
including correctly opt_kdtrace.h before it was never enabled so it
was kept broken for a while.  Fix this by using a protection stub,
leaving sfxge driver authors the responsibility for fixing it
appropriately [1].

Sponsored by:	EMC / Isilon storage division
Discussed with:	rstone
[0] Reported by:	rstone
[1] Discussed with:	philip
2013-11-25 07:38:45 +00:00
Neel Natu
ac7304a758 Add an ioctl to assert and deassert an ioapic pin atomically. This will be used
to inject edge triggered legacy interrupts into the guest.

Start using the new API in device models that use edge triggered interrupts:
viz. the 8254 timer and the LPC/uart device emulation.

Submitted by:	Tycho Nightingale (tycho.nightingale@pluribusnetworks.com)
2013-11-23 03:56:03 +00:00
Neel Natu
af480303a9 Eliminate redundant information about the host cpu in bhyve's KTR trace points.
This is always tracked by ktr(4) and can be displayed using the "-c" option
of ktrdump(8).

Discussed with:	grehan
2013-11-22 18:57:22 +00:00
Ed Maste
7b7d8599fe Don't abort SMAP processing after an entry of length 0
Length 0 is not special and should just be skipped.  This is the same
behaviour as i386.

Discussed with:	jhb@
Sponsored by:	The FreeBSD Foundation
2013-11-22 14:56:10 +00:00
Andreas Tobler
d2ef321a59 Introduce a WEAK_REFERENCE() alias and use it. Get rid of the CNAME and the
CONCAT macros in SYS.h.

Reviewed by:	bde, kib
2013-11-21 21:25:58 +00:00
Ed Maste
ff89f4778a Refactor amd64 startup SMAP parsing
Extracted from the projects/uefi branch, this change is a reasonable
cleanup and will reduce the diffs to review when bringing in the
UEFI work.

Reviewed by:	kib@
Sponsored by:	The FreeBSD Foundation
2013-11-21 19:20:08 +00:00
Ed Maste
aff122d6aa Disable amd64 boot time memory test by default
The page presence memory test takes a long time on large memory systems
and has little value on contemporary amd64 hardware.

Sponsored by:	The FreeBSD Foundation
2013-11-21 18:37:11 +00:00
Justin T. Gibbs
4fd76feafd Fix accounting for hw.realmem on the i386 and amd64 platforms.
sys/i386/i386/machdep.c:
sys/amd64/amd64/machdep.c:
	The value reported by FreeBSD as "real memory" when booting
	doesn't match what is later reported by sysctl as hw.realmem.
	This is due to the fact that the value printed during the
	boot process is fetched from smbios data (when possible),
	and accounts for holes in physical memory. On the other
	hand, the value of hw.realmem is unconditionally set to be
	one larger than the highest page of the physical address
	space.

	Fix this by setting hw.realmem to the same value printed
	during boot, this makes hw.realmem honour it's name and
	account properly for physical memory present in the system.

Submitted by:	Roger Pau Monné
Reviewed by:	gibbs
2013-11-15 16:05:55 +00:00
Ed Maste
3d271aaab0 x86: Allow users to change PSL_RF via ptrace(PT_SETREGS...)
Debuggers may need to change PSL_RF.  Note that tf_eflags is already stored
in the signal context during signal handling and PSL_RF previously could be
modified via sigreturn, so this change should not provide any new ability
to userspace.

For background see the thread at:
http://lists.freebsd.org/pipermail/freebsd-i386/2007-September/005910.html

Reviewed by:	jhb, kib
Sponsored by:	DARPA, AFRL
2013-11-14 15:37:20 +00:00
Neel Natu
565bbb8698 Move the ioapic device model from userspace into vmm.ko. This is needed for
upcoming in-kernel device emulations like the HPET.

The ioctls VM_IOAPIC_ASSERT_IRQ and VM_IOAPIC_DEASSERT_IRQ are used to
manipulate the ioapic pin state.

Discussed with:	grehan@
Submitted by:	Tycho Nightingale (tycho.nightingale@pluribusnetworks.com)
2013-11-12 22:51:03 +00:00
Konstantin Belousov
6f8a44a5dd Add bits for the AMD features from CPUID function 0x80000001 ECX,
described in the rev. 3.0 of the Kabini BKDG, document 48751.pdf.

Partially based on the patch submitted by:	Dmitry Luhtionov <dmitryluhtionov@gmail.com>
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
2013-11-08 16:32:30 +00:00
Alan Cox
c70af4875e As of r257209, all architectures have defined VM_KMEM_SIZE_SCALE. In other
words, every architecture is now auto-sizing the kmem arena.  This revision
changes kmeminit() so that the definition of VM_KMEM_SIZE_SCALE becomes
mandatory and the definition of VM_KMEM_SIZE becomes optional.

Replace or eliminate all existing definitions of VM_KMEM_SIZE.  With
auto-sizing enabled, VM_KMEM_SIZE effectively became an alternate spelling
for VM_KMEM_SIZE_MIN on most architectures.  Use VM_KMEM_SIZE_MIN for
clarity.

Change kmeminit() so that the effect of defining VM_KMEM_SIZE is similar to
that of setting the tunable vm.kmem_size.  Whereas the macros
VM_KMEM_SIZE_{MAX,MIN,SCALE} have had the same effect as the tunables
vm.kmem_size_{max,min,scale}, the effects of VM_KMEM_SIZE and vm.kmem_size
have been distinct.  In particular, whereas VM_KMEM_SIZE was overridden by
VM_KMEM_SIZE_{MAX,MIN,SCALE} and vm.kmem_size_{max,min,scale}, vm.kmem_size
was not.  Remedy this inconsistency.  Now, VM_KMEM_SIZE can be used to set
the size of the kmem arena at compile-time without that value being
overridden by auto-sizing.

Update the nearby comments to reflect the kmem submap being replaced by the
kmem arena.  Stop duplicating the auto-sizing formula in every machine-
dependent vmparam.h and place it in kmeminit() where auto-sizing takes
place.

Reviewed by:	kib (an earlier version)
Sponsored by:	EMC / Isilon Storage Division
2013-11-08 16:25:00 +00:00
Neel Natu
03cd05011f Remove the 'vdev' abstraction that was meant to sit on top of device models
in the kernel. This abstraction was redundant because the only device emulated
inside vmm.ko is the local apic and it is always at a fixed guest physical
address.

Discussed with:	grehan
2013-11-04 23:25:07 +00:00
Neel Natu
513c8d338d Rename the VMM_CTRx() family of macros to VCPU_CTRx() to highlight that these
tracepoints are vcpu-specific.

Add support for tracepoints that are global to the virtual machine - these
tracepoints are called VM_CTRx().
2013-10-31 05:20:11 +00:00
Mark Johnston
57170f49f2 Remove references to an unused fasttrap probe hook, and remove the
corresponding x86 trap type. Userland DTrace probes are currently handled
by the other fasttrap hooks (dtrace_pid_probe_ptr and
dtrace_return_probe_ptr).

Discussed with:	rpaulo
2013-10-31 02:35:00 +00:00
Neel Natu
e2f5d9a129 Remove unnecessary includes of <machine/pmap.h>
Requested by:	alc@
2013-10-29 02:25:18 +00:00
Gleb Smirnoff
69eb2b176c Include XEN and HyperV into amd64 LINT. 2013-10-28 21:11:28 +00:00
Konstantin Belousov
86be9f0dd5 Import the driver for VT-d DMAR hardware, as specified in the revision
1.3 of Intelб╝ Virtualization Technology for Directed I/O Architecture
Specification.  The Extended Context and PASIDs from the rev. 2.2 are
not supported, but I am not aware of any released hardware which
implements them.  Code does not use queued invalidation, see comments
for the reason, and does not provide interrupt remapping services.

Code implements the management of the guest address space per domain
and allows to establish and tear down arbitrary mappings, but not
partial unmapping.  The superpages are created as needed, but not
promoted.  Faults are recorded, fault records could be obtained
programmatically, and printed on the console.

Implement the busdma(9) using DMARs.  This busdma backend avoids
bouncing and provides security against misbehaving hardware and driver
bad programming, preventing leaks and corruption of the memory by wild
DMA accesses.

By default, the implementation is compiled into amd64 GENERIC kernel
but disabled; to enable, set hw.dmar.enable=1 loader tunable.  Code is
written to work on i386, but testing there was low priority, and
driver is not enabled in GENERIC.  Even with the DMAR turned on,
individual devices could be directed to use the bounce busdma with the
hw.busdma.pci<domain>:<bus>:<device>:<function>.bounce=1 tunable.  If
DMARs are capable of the pass-through translations, it is used,
otherwise, an identity-mapping page table is constructed.

The driver was tested on Xeon 5400/5500 chipset legacy machine,
Haswell desktop and E5 SandyBridge dual-socket boxes, with ahci(4),
ata(4), bce(4), ehci(4), mfi(4), uhci(4), xhci(4) devices.  It also
works with em(4) and igb(4), but there some fixes are needed for
drivers, which are not committed yet.  Intel GPUs do not work with
DMAR (yet).

Many thanks to John Baldwin, who explained me the newbus integration;
Peter Holm, who did all testing and helped me to discover and
understand several incredible bugs; and to Jim Harris for the access
to the EDS and BWG and for listening when I have to explain my
findings to somebody.

Sponsored by:	The FreeBSD Foundation
MFC after:	1 month
2013-10-28 13:33:29 +00:00
Konstantin Belousov
e20f049b87 Several small fixes for the amd64 minidump code.
In report_progress(), use nitems(progress_track) instead of manually
hard-coding array size.  Wrap long line.

In blk_write(), code verifies that ptr and pa cannot be non-zero
simultaneously.  The later check for the page-alignment of the ptr
argument never triggers due to pa != 0 always implying ptr == NULL.  I
believe that the intent was to ensure that physicall address passed is
page-aligned, since the address is (temporary) mapped for the duration
of the page write.

Clear the progress_track.visited fields when starting minidump.  If
minidump is restarted or taken second time during the system lifetime,
progress is not printed otherwise, making operator suspectible to the
dump status.

Tested by:	pho
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
2013-10-27 16:31:12 +00:00
Gleb Smirnoff
eedc7fd9e8 Provide includes that are needed in these files, and before were read
in implicitly via if.h -> if_var.h pollution.

Sponsored by:	Netflix
Sponsored by:	Nginx, Inc.
2013-10-26 18:18:50 +00:00
Neel Natu
49cc03da31 Add a new capability, VM_CAP_ENABLE_INVPCID, that can be enabled to expose
'invpcid' instruction to the guest. Currently bhyve will try to enable this
capability unconditionally if it is available.

Consolidate code in bhyve to set the capabilities so it is no longer
duplicated in BSP and AP bringup.

Add a sysctl 'vm.pmap.invpcid_works' to display whether the 'invpcid'
instruction is available.

Reviewed by:	grehan
MFC after:	3 days
2013-10-16 18:20:27 +00:00
Neel Natu
d38cae4aad Fix the witness warning that warned against calling uiomove() while holding
the 'vmmdev_mtx' in vmmdev_rw().

Rely on the 'si_threadcount' accounting to ensure that we never destroy the
VM device node while it has operations in progress (e.g. ioctl, mmap etc).

Reported by:	grehan
Reviewed by:	grehan
2013-10-16 00:58:47 +00:00
Glen Barber
6b48eebec6 Document XENHVM and xenpci are mutually inclusive.
Submitted by:   gibbs
Approved by:    re (delphij)
Sponsored by:   The FreeBSD Foundation
2013-10-11 19:40:28 +00:00
Dimitry Andric
9cba9d0157 In sys/amd64/amd64/pmap.c, fix several gcc warnings about uninitialized
variables in reclaim_pv_chunk().

Approved by:	re (marius)
Reviewed by:	neel, kib
X-MFC-With:	r256072
2013-10-08 20:04:35 +00:00
Justin T. Gibbs
5fdd34ee20 Formalize the concept of virtual CPU ids by adding a per-cpu vcpu_id
field.  Perform vcpu enumeration for Xen PV and HVM environments
and convert all Xen drivers to use vcpu_id instead of a hard coded
assumption of the mapping algorithm (acpi or apic ID) in use.

Submitted by:	Roger Pau Monné
Sponsored by:	Citrix Systems R&D
Reviewed by:	gibbs
Approved by:	re (blanket Xen)

amd64/include/pcpu.h:
i386/include/pcpu.h:
	Add vcpu_id to the amd64 and i386 pcpu structures.

dev/xen/timer/timer.c
x86/xen/xen_intr.c
	Use new vcpu_id instead of assuming acpi_id == vcpu_id.

i386/xen/mp_machdep.c:
i386/xen/mptable.c
x86/xen/hvm.c:
	Perform Xen HVM and Xen full PV vcpu_id mapping.

x86/xen/hvm.c:
x86/acpica/madt.c
	Change SYSINIT ordering of acpi CPU enumeration so that it
	is guaranteed to be available at the time of Xen HVM vcpu
	id mapping.
2013-10-05 23:11:01 +00:00
Neel Natu
318224bbe6 Merge projects/bhyve_npt_pmap into head.
Make the amd64/pmap code aware of nested page table mappings used by bhyve
guests. This allows bhyve to associate each guest with its own vmspace and
deal with nested page faults in the context of that vmspace. This also
enables features like accessed/dirty bit tracking, swapping to disk and
transparent superpage promotions of guest memory.

Guest vmspace:
Each bhyve guest has a unique vmspace to represent the physical memory
allocated to the guest. Each memory segment allocated by the guest is
mapped into the guest's address space via the 'vmspace->vm_map' and is
backed by an object of type OBJT_DEFAULT.

pmap types:
The amd64/pmap now understands two types of pmaps: PT_X86 and PT_EPT.

The PT_X86 pmap type is used by the vmspace associated with the host kernel
as well as user processes executing on the host. The PT_EPT pmap is used by
the vmspace associated with a bhyve guest.

Page Table Entries:
The EPT page table entries as mostly similar in functionality to regular
page table entries although there are some differences in terms of what
bits are used to express that functionality. For e.g. the dirty bit is
represented by bit 9 in the nested PTE as opposed to bit 6 in the regular
x86 PTE. Therefore the bitmask representing the dirty bit is now computed
at runtime based on the type of the pmap. Thus PG_M that was previously a
macro now becomes a local variable that is initialized at runtime using
'pmap_modified_bit(pmap)'.

An additional wrinkle associated with EPT mappings is that older Intel
processors don't have hardware support for tracking accessed/dirty bits in
the PTE. This means that the amd64/pmap code needs to emulate these bits to
provide proper accounting to the VM subsystem. This is achieved by using
the following mapping for EPT entries that need emulation of A/D bits:
               Bit Position           Interpreted By
PG_V               52                 software (accessed bit emulation handler)
PG_RW              53                 software (dirty bit emulation handler)
PG_A               0                  hardware (aka EPT_PG_RD)
PG_M               1                  hardware (aka EPT_PG_WR)

The idea to use the mapping listed above for A/D bit emulation came from
Alan Cox (alc@).

The final difference with respect to x86 PTEs is that some EPT implementations
do not support superpage mappings. This is recorded in the 'pm_flags' field
of the pmap.

TLB invalidation:
The amd64/pmap code has a number of ways to do invalidation of mappings
that may be cached in the TLB: single page, multiple pages in a range or the
entire TLB. All of these funnel into a single EPT invalidation routine called
'pmap_invalidate_ept()'. This routine bumps up the EPT generation number and
sends an IPI to the host cpus that are executing the guest's vcpus. On a
subsequent entry into the guest it will detect that the EPT has changed and
invalidate the mappings from the TLB.

Guest memory access:
Since the guest memory is no longer wired we need to hold the host physical
page that backs the guest physical page before we can access it. The helper
functions 'vm_gpa_hold()/vm_gpa_release()' are available for this purpose.

PCI passthru:
Guest's with PCI passthru devices will wire the entire guest physical address
space. The MMIO BAR associated with the passthru device is backed by a
vm_object of type OBJT_SG. An IOMMU domain is created only for guest's that
have one or more PCI passthru devices attached to them.

Limitations:
There isn't a way to map a guest physical page without execute permissions.
This is because the amd64/pmap code interprets the guest physical mappings as
user mappings since they are numerically below VM_MAXUSER_ADDRESS. Since PG_U
shares the same bit position as EPT_PG_EXECUTE all guest mappings become
automatically executable.

Thanks to Alan Cox and Konstantin Belousov for their rigorous code reviews
as well as their support and encouragement.

Thanks for John Baldwin for reviewing the use of OBJT_SG as the backing
object for pci passthru mmio regions.

Special thanks to Peter Holm for testing the patch on short notice.

Approved by:	re
Discussed with:	grehan
Reviewed by:	alc, kib
Tested by:	pho
2013-10-05 21:22:35 +00:00
John-Mark Gurney
29904f46d6 add aesni module to i386 and amd64 NOTES...
Approved by:	re (gjb)
2013-10-04 17:21:01 +00:00
Peter Grehan
e58d944482 Return 0 for a rdmsr of MSR_IA32_PLATFORM_ID. This
is enough to get Ubuntu 12.0.4/13.0.4 to boot.

Approved by:	re@ (blanket)
2013-09-27 14:55:59 +00:00
Konstantin Belousov
4cb8b041d1 In pmap_clear_modify(), initialize pvh even for fictitious managed
page, otherwise the small mappings loop would use uninitialized value.
Note that currently pmap_clear_modify() is not called for fictitious
pages.

Sponsored by:	The FreeBSD Foundation
Approved by:	re (glebius)
2013-09-24 13:52:47 +00:00
Konstantin Belousov
fecfc089e4 Use the pv lists generation count to read-lock the pvh_global_lock in
pmap_clear_modify().

Noted and reviewed by:	alc
Tested by:	pho
Sponsored by:	The FreeBSD Foundation
Approved by:	re (marius)
2013-09-24 12:26:43 +00:00
Konstantin Belousov
75f50c53f1 Ensure that the ERESTART return from the syscall reloads the
registers, to make the restarted syscall instruction pass the correct
arguments.

PR:	kern/182161
Reported by:	Russ Cox <rsc@swtch.com>
Sponsored by:	The FreeBSD Foundation
MFC after:	3 days
Approved by:	re (marius)
2013-09-24 12:24:48 +00:00
Konstantin Belousov
ad43b98491 Free both KVA and backing pages when freeing TSS memory.
Reported and tested by:	pho
Sponsored by:	The FreeBSD Foundation
Approved by:	re (marius)
2013-09-23 20:14:15 +00:00
Glen Barber
91aff61084 Put 'device hyperv' back in amd64/GENERIC, incorrectly removed with
r255736.

Pointed out by:	gibbs
Approved by:	re (delphij)
Sponsored by:	The FreeBSD Foundation
2013-09-21 01:07:27 +00:00
Peter Grehan
36f23e3c20 Reorder/regroup the vmm ioctl api definitions to allow some
semblance of API stability and growth during the 10.* timeframe.

Userland/kernel bhyve will have to be recompiled after this.

Reviewed by:	neel
Approved by:	re@ (blanket)
2013-09-21 00:27:53 +00:00