6759 Commits

Author SHA1 Message Date
Neel Natu
0a9ae358fd Fix a bug in the HPET emulation where a timer interrupt could be lost when the
guest disables the HPET.

The HPET timer interrupt is triggered from the callout handler associated with
the timer. It is possible for the callout handler to be delayed before it gets
a chance to execute. If the guest disables the HPET during this window then the
handler never gets a chance to execute and the timer interrupt is lost.

This is now fixed by injecting a timer interrupt into the guest if the callout
time is detected to be in the past when the HPET is disabled.
2014-01-03 19:25:52 +00:00
Konstantin Belousov
27fd75d2c8 Update the description for pmap_remove_pages() to match the modern
times [1].  Assert that the pmap passed to pmap_remove_pages() is only
active on current CPU.

Submitted by:	alc [1]
Reviewed by:	alc
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
2014-01-02 18:50:52 +00:00
Konstantin Belousov
c0be75a58a Assert that accounting for the pmap resident pages does not underflow.
Reviewed by:	alc
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
2014-01-02 18:49:05 +00:00
Neel Natu
0492757c70 Restructure the VMX code to enter and exit the guest. In large part this change
hides the setjmp/longjmp semantics of VM enter/exit. vmx_enter_guest() is used
to enter guest context and vmx_exit_guest() is used to transition back into
host context.

Fix a longstanding race where a vcpu interrupt notification might be ignored
if it happens after vmx_inject_interrupts() but before host interrupts are
disabled in vmx_resume/vmx_launch. We now called vmx_inject_interrupts() with
host interrupts disabled to prevent this.

Suggested by:	grehan@
2014-01-01 21:17:08 +00:00
Dimitry Andric
24da2fe3d0 In sys/amd64/amd64/pmap.c, remove static function pmap_is_current(),
which has been unused since r189415.

Reviewed by:	alc
MFC after:	3 days
2013-12-30 20:37:47 +00:00
Neel Natu
7c05bc3124 Modify handling of writes to the vlapic LVT registers.
The handler is now called after the register value is updated in the virtual
APIC page. This will make it easier to handle APIC-write VM-exits with APIC
register virtualization turned on.

This also implies that we need to keep a snapshot of the last value written
to a LVT register. We can no longer rely on the LVT registers in the APIC
page to be "clean" because the guest can write anything to it before the
hypervisor has had a chance to sanitize it.
2013-12-28 00:20:55 +00:00
Neel Natu
fafe884473 Modify handling of writes to the vlapic ICR_TIMER, DCR_TIMER, ICRLO and ESR
registers.

The handler is now called after the register value is updated in the virtual
APIC page. This will make it easier to handle APIC-write VM-exits with APIC
register virtualization turned on.

We can no longer rely on the value of 'icr_timer' on the APIC page
in the callout handler. With APIC register virtualization the value of
'icr_timer' will be updated by the processor in guest-context before an
APIC-write VM-exit.

Clear the 'delivery status' bit in the ICRLO register in the write handler.
With APIC register virtualization the write happens in guest-context and
we cannot prevent a (buggy) guest from setting this bit.
2013-12-27 20:18:19 +00:00
Dimitry Andric
6f0c167fe2 In sys/amd64/vmm/intel/vmx.c, silence a (incorrect) gcc warning about
regval possibly being used uninitialized.

Reviewed by:	neel
2013-12-27 12:15:53 +00:00
Neel Natu
2c52dcd9a8 Modify handling of write to the vlapic SVR register.
The handler is now called after the register value is updated in the virtual
APIC page. This will make it easier to handle APIC-write VM-exits with APIC
register virtualization turned on.

Additionally, mask all the LVT entries when the vlapic is software-disabled.
2013-12-27 07:01:42 +00:00
Neel Natu
3f0ddc7c5c Modify handling of writes to the vlapic ID, LDR and DFR registers.
The handlers are now called after the register value is updated in the virtual
APIC page. This will make it easier to handle APIC-write VM-exits with APIC
register virtualization turned on.

Additionally, we need to ensure that the value of these registers is always
correctly reflected in the virtual APIC page, because there is no VM exit
when the guest reads these registers with APIC register virtualization.
2013-12-26 19:58:30 +00:00
Neel Natu
de5ea6b65e vlapic code restructuring to make it easy to support hardware-assist for APIC
emulation.

The vlapic initialization and cleanup is done via processor specific vmm_ops.
This will allow the VT-x/SVM modules to layer any hardware-assist for APIC
emulation or virtual interrupt delivery on top of the vlapic device model.

Add a parameter to 'vcpu_notify_event()' to distinguish between vlapic
interrupts versus other events (e.g. NMI). This provides an opportunity to
use hardware-assists like Posted Interrupts (VT-x) or doorbell MSR (SVM)
to deliver an interrupt to a guest without causing a VM-exit.

Get rid of lapic_pending_intr() and lapic_intr_accepted() and use the
vlapic_xxx() counterparts directly.

Associate an 'Apic Page' with each vcpu and reference it from the 'vlapic'.
The 'Apic Page' is intended to be referenced from the Intel VMCS as the
'virtual APIC page' or from the AMD VMCB as the 'vAPIC backing page'.
2013-12-25 06:46:31 +00:00
John Baldwin
63e62d390d Add a resume hook for bhyve that runs a function on all CPUs during
resume.  For Intel CPUs, invoke vmxon for CPUs that were in VMX mode
at the time of suspend.

Reviewed by:	neel
2013-12-23 19:48:22 +00:00
John Baldwin
330baf58c6 Extend the support for local interrupts on the local APIC:
- Add a generic routine to trigger an LVT interrupt that supports both
  fixed and NMI delivery modes.
- Add an ioctl and bhyvectl command to trigger local interrupts inside a
  guest.  In particular, a global NMI similar to that raised by SERR# or
  PERR# can be simulated by asserting LINT1 on all vCPUs.
- Extend the LVT table in the vCPU local APIC to support CMCI.
- Flesh out the local APIC error reporting a bit to cache errors and
  report them via ESR when ESR is written to.  Add support for asserting
  the error LVT when an error occurs.  Raise illegal vector errors when
  attempting to signal an invalid vector for an interrupt or when sending
  an IPI.
- Ignore writes to reserved bits in LVT entries.
- Export table entries the MADT and MP Table advertising the stock x86
  config of LINT0 set to ExtInt and LINT1 wired to NMI.

Reviewed by:	neel (earlier version)
2013-12-23 19:29:07 +00:00
Neel Natu
f80330a820 Add a parameter to 'vcpu_set_state()' to enforce that the vcpu is in the IDLE
state before the requested state transition. This guarantees that there is
exactly one ioctl() operating on a vcpu at any point in time and prevents
unintended state transitions.

More details available here:
http://lists.freebsd.org/pipermail/freebsd-virtualization/2013-December/001825.html

Reviewed by:	grehan
Reported by:	Markiyan Kushnir (markiyan.kushnir at gmail.com)
MFC after:	3 days
2013-12-22 20:29:59 +00:00
Neel Natu
a783578566 Consolidate the virtual apic initialization in a single function: vlapic_reset() 2013-12-22 00:08:00 +00:00
Neel Natu
5515bb73e6 Re-arrange bits in the amd64/pmap 'pm_flags' field.
The least significant 8 bits of 'pm_flags' are now used for the IPI vector
to use for nested page table TLB shootdown.

Previously we used IPI_AST to interrupt the host cpu which is functionally
correct but could lead to misleading interrupt counts for AST handler. The
AST handler was also doing a lot more than what is required for the nested
page table TLB shootdown (EOI and IRET).
2013-12-20 05:50:22 +00:00
Neel Natu
3de8386283 Use vmcs_read() and vmcs_write() in preference to vmread() and vmwrite()
respectively. The vmcs_xxx() functions provide inline error checking of
all accesses to the VMCS.
2013-12-18 06:24:21 +00:00
Neel Natu
4f8be175d5 Add an API to deliver message signalled interrupts to vcpus. This allows
callers treat the MSI 'addr' and 'data' fields as opaque and also lets
bhyve implement multiple destination modes: physical, flat and clustered.

Submitted by:	Tycho Nightingale (tycho.nightingale@pluribusnetworks.com)
Reviewed by:	grehan@
2013-12-16 19:59:31 +00:00
Neel Natu
a83011d2e7 Fix typo when initializing the vlapic version register ('<<' instead of '<'). 2013-12-11 06:28:44 +00:00
Neel Natu
becd984900 Fix x2apic support in bhyve.
When the guest is bringing up the APs in the x2APIC mode a write to the
ICR register will now trigger a return to userspace with an exitcode of
VM_EXITCODE_SPINUP_AP. This gets SMP guests working again with x2APIC.

Change the vlapic timer lock to be a spinlock because the vlapic can be
accessed from within a critical section (vm run loop) when guest is using
x2apic mode.

Reviewed by:	grehan@
2013-12-10 22:56:51 +00:00
John Baldwin
316032ad20 Move constants for indices in the local APIC's local vector table from
apicvar.h to apicreg.h.
2013-12-09 21:08:52 +00:00
Neel Natu
fb03ca4e42 Use callout(9) to drive the vlapic timer instead of clocking it on each VM exit.
This decouples the guest's 'hz' from the host's 'hz' setting. For e.g. it is
now possible to have a guest run at 'hz=1000' while the host is at 'hz=100'.

Discussed with:	grehan@
Tested by:	Tycho Nightingale (tycho.nightingale@pluribusnetworks.com)
2013-12-07 23:11:12 +00:00
Neel Natu
1c05219285 If a vcpu disables its local apic and then executes a 'HLT' then spin down the
vcpu and destroy its thread context. Also modify the 'HLT' processing to ignore
pending interrupts in the IRR if interrupts have been disabled by the guest.
The interrupt cannot be injected into the guest in any case so resuming it
is futile.

With this change "halt" from a Linux guest works correctly.

Reviewed by:	grehan@
Tested by:	Tycho Nightingale (tycho.nightingale@pluribusnetworks.com)
2013-12-07 22:18:36 +00:00
John Baldwin
5c79f1f9df Fix a typo. 2013-12-05 21:58:02 +00:00
Neel Natu
7a3c80aa55 The 'protection' field in the VM exit collateral for the PAGING exit is not
used - get rid of it.
2013-12-03 01:21:21 +00:00
Neel Natu
2282187475 Rename 'vm_interrupt_hostcpu()' to 'vcpu_notify_event()' because the function
has outgrown its original name. Originally this function simply sent an IPI
to the host cpu that a vcpu was executing on but now it does a lot more than
just that.

Reviewed by:	grehan@
2013-12-03 00:43:31 +00:00
Eitan Adler
7a22215c53 Fix undefined behavior: (1 << 31) is not defined as 1 is an int and this
shifts into the sign bit.  Instead use (1U << 31) which gets the
expected result.

This fix is not ideal as it assumes a 32 bit int, but does fix the issue
for most cases.

A similar change was made in OpenBSD.

Discussed with:	-arch, rdivacky
Reviewed by:	cperciva
2013-11-30 22:17:27 +00:00
Pawel Jakub Dawidek
f2b525e6b9 Make process descriptors standard part of the kernel. rwhod(8) already
requires process descriptors to work and having PROCDESC in GENERIC
seems not enough, especially that we hope to have more and more consumers
in the base.

MFC after:	3 days
2013-11-30 15:08:35 +00:00
Neel Natu
b5b28fc9dc Add support for level triggered interrupt pins on the vioapic. Prior to this
commit level triggered interrupts would work as long as the pin was not shared
among multiple interrupt sources.

The vlapic now keeps track of level triggered interrupts in the trigger mode
register and will forward the EOI for a level triggered interrupt to the
vioapic. The vioapic in turn uses the EOI to sample the level on the pin and
re-inject the vector if the pin is still asserted.

The vhpet is the first consumer of level triggered interrupts and advertises
that it can generate interrupts on pins 20 through 23 of the vioapic.

Discussed with:	grehan@
2013-11-27 22:18:08 +00:00
Konstantin Belousov
291bfc8d24 Hide struct pcb definition by #ifdef __amd64__ braces. If cc -m32
compilation results in inclusion of the header, a confict arises due
to savefpu being union for i386, but used as struct in the pcb
definition.  The 32bit code should not need amd64 variant of the
struct pcb anyway.

For struct region_descriptor, use __uint64_t instead of unsigned long,
as the base type for bit-fields.  Unsigned long cannot have width 64
for -m32.

The changes allowed to use sys/sysctl.h for cc -m32.

Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
2013-11-26 19:38:42 +00:00
Neel Natu
08e3ff329a Add HPET device emulation to bhyve.
bhyve supports a single timer block with 8 timers. The timers are all 32-bit
and capable of being operated in periodic mode. All timers support interrupt
delivery using MSI. Timers 0 and 1 also support legacy interrupt routing.

At the moment the timers are not connected to any ioapic pins but that will
be addressed in a subsequent commit.

This change is based on a patch from Tycho Nightingale (tycho.nightingale@pluribusnetworks.com).
2013-11-25 19:04:51 +00:00
Attilio Rao
54366c0bd7 - For kernel compiled only with KDTRACE_HOOKS and not any lock debugging
option, unbreak the lock tracing release semantic by embedding
  calls to LOCKSTAT_PROFILE_RELEASE_LOCK() direclty in the inlined
  version of the releasing functions for mutex, rwlock and sxlock.
  Failing to do so skips the lockstat_probe_func invokation for
  unlocking.
- As part of the LOCKSTAT support is inlined in mutex operation, for
  kernel compiled without lock debugging options, potentially every
  consumer must be compiled including opt_kdtrace.h.
  Fix this by moving KDTRACE_HOOKS into opt_global.h and remove the
  dependency by opt_kdtrace.h for all files, as now only KDTRACE_FRAMES
  is linked there and it is only used as a compile-time stub [0].

[0] immediately shows some new bug as DTRACE-derived support for debug
in sfxge is broken and it was never really tested.  As it was not
including correctly opt_kdtrace.h before it was never enabled so it
was kept broken for a while.  Fix this by using a protection stub,
leaving sfxge driver authors the responsibility for fixing it
appropriately [1].

Sponsored by:	EMC / Isilon storage division
Discussed with:	rstone
[0] Reported by:	rstone
[1] Discussed with:	philip
2013-11-25 07:38:45 +00:00
Neel Natu
ac7304a758 Add an ioctl to assert and deassert an ioapic pin atomically. This will be used
to inject edge triggered legacy interrupts into the guest.

Start using the new API in device models that use edge triggered interrupts:
viz. the 8254 timer and the LPC/uart device emulation.

Submitted by:	Tycho Nightingale (tycho.nightingale@pluribusnetworks.com)
2013-11-23 03:56:03 +00:00
Neel Natu
af480303a9 Eliminate redundant information about the host cpu in bhyve's KTR trace points.
This is always tracked by ktr(4) and can be displayed using the "-c" option
of ktrdump(8).

Discussed with:	grehan
2013-11-22 18:57:22 +00:00
Ed Maste
7b7d8599fe Don't abort SMAP processing after an entry of length 0
Length 0 is not special and should just be skipped.  This is the same
behaviour as i386.

Discussed with:	jhb@
Sponsored by:	The FreeBSD Foundation
2013-11-22 14:56:10 +00:00
Andreas Tobler
d2ef321a59 Introduce a WEAK_REFERENCE() alias and use it. Get rid of the CNAME and the
CONCAT macros in SYS.h.

Reviewed by:	bde, kib
2013-11-21 21:25:58 +00:00
Ed Maste
ff89f4778a Refactor amd64 startup SMAP parsing
Extracted from the projects/uefi branch, this change is a reasonable
cleanup and will reduce the diffs to review when bringing in the
UEFI work.

Reviewed by:	kib@
Sponsored by:	The FreeBSD Foundation
2013-11-21 19:20:08 +00:00
Ed Maste
aff122d6aa Disable amd64 boot time memory test by default
The page presence memory test takes a long time on large memory systems
and has little value on contemporary amd64 hardware.

Sponsored by:	The FreeBSD Foundation
2013-11-21 18:37:11 +00:00
Justin T. Gibbs
4fd76feafd Fix accounting for hw.realmem on the i386 and amd64 platforms.
sys/i386/i386/machdep.c:
sys/amd64/amd64/machdep.c:
	The value reported by FreeBSD as "real memory" when booting
	doesn't match what is later reported by sysctl as hw.realmem.
	This is due to the fact that the value printed during the
	boot process is fetched from smbios data (when possible),
	and accounts for holes in physical memory. On the other
	hand, the value of hw.realmem is unconditionally set to be
	one larger than the highest page of the physical address
	space.

	Fix this by setting hw.realmem to the same value printed
	during boot, this makes hw.realmem honour it's name and
	account properly for physical memory present in the system.

Submitted by:	Roger Pau Monné
Reviewed by:	gibbs
2013-11-15 16:05:55 +00:00
Ed Maste
3d271aaab0 x86: Allow users to change PSL_RF via ptrace(PT_SETREGS...)
Debuggers may need to change PSL_RF.  Note that tf_eflags is already stored
in the signal context during signal handling and PSL_RF previously could be
modified via sigreturn, so this change should not provide any new ability
to userspace.

For background see the thread at:
http://lists.freebsd.org/pipermail/freebsd-i386/2007-September/005910.html

Reviewed by:	jhb, kib
Sponsored by:	DARPA, AFRL
2013-11-14 15:37:20 +00:00
Neel Natu
565bbb8698 Move the ioapic device model from userspace into vmm.ko. This is needed for
upcoming in-kernel device emulations like the HPET.

The ioctls VM_IOAPIC_ASSERT_IRQ and VM_IOAPIC_DEASSERT_IRQ are used to
manipulate the ioapic pin state.

Discussed with:	grehan@
Submitted by:	Tycho Nightingale (tycho.nightingale@pluribusnetworks.com)
2013-11-12 22:51:03 +00:00
Konstantin Belousov
6f8a44a5dd Add bits for the AMD features from CPUID function 0x80000001 ECX,
described in the rev. 3.0 of the Kabini BKDG, document 48751.pdf.

Partially based on the patch submitted by:	Dmitry Luhtionov <dmitryluhtionov@gmail.com>
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
2013-11-08 16:32:30 +00:00
Alan Cox
c70af4875e As of r257209, all architectures have defined VM_KMEM_SIZE_SCALE. In other
words, every architecture is now auto-sizing the kmem arena.  This revision
changes kmeminit() so that the definition of VM_KMEM_SIZE_SCALE becomes
mandatory and the definition of VM_KMEM_SIZE becomes optional.

Replace or eliminate all existing definitions of VM_KMEM_SIZE.  With
auto-sizing enabled, VM_KMEM_SIZE effectively became an alternate spelling
for VM_KMEM_SIZE_MIN on most architectures.  Use VM_KMEM_SIZE_MIN for
clarity.

Change kmeminit() so that the effect of defining VM_KMEM_SIZE is similar to
that of setting the tunable vm.kmem_size.  Whereas the macros
VM_KMEM_SIZE_{MAX,MIN,SCALE} have had the same effect as the tunables
vm.kmem_size_{max,min,scale}, the effects of VM_KMEM_SIZE and vm.kmem_size
have been distinct.  In particular, whereas VM_KMEM_SIZE was overridden by
VM_KMEM_SIZE_{MAX,MIN,SCALE} and vm.kmem_size_{max,min,scale}, vm.kmem_size
was not.  Remedy this inconsistency.  Now, VM_KMEM_SIZE can be used to set
the size of the kmem arena at compile-time without that value being
overridden by auto-sizing.

Update the nearby comments to reflect the kmem submap being replaced by the
kmem arena.  Stop duplicating the auto-sizing formula in every machine-
dependent vmparam.h and place it in kmeminit() where auto-sizing takes
place.

Reviewed by:	kib (an earlier version)
Sponsored by:	EMC / Isilon Storage Division
2013-11-08 16:25:00 +00:00
Neel Natu
03cd05011f Remove the 'vdev' abstraction that was meant to sit on top of device models
in the kernel. This abstraction was redundant because the only device emulated
inside vmm.ko is the local apic and it is always at a fixed guest physical
address.

Discussed with:	grehan
2013-11-04 23:25:07 +00:00
Neel Natu
513c8d338d Rename the VMM_CTRx() family of macros to VCPU_CTRx() to highlight that these
tracepoints are vcpu-specific.

Add support for tracepoints that are global to the virtual machine - these
tracepoints are called VM_CTRx().
2013-10-31 05:20:11 +00:00
Mark Johnston
57170f49f2 Remove references to an unused fasttrap probe hook, and remove the
corresponding x86 trap type. Userland DTrace probes are currently handled
by the other fasttrap hooks (dtrace_pid_probe_ptr and
dtrace_return_probe_ptr).

Discussed with:	rpaulo
2013-10-31 02:35:00 +00:00
Neel Natu
e2f5d9a129 Remove unnecessary includes of <machine/pmap.h>
Requested by:	alc@
2013-10-29 02:25:18 +00:00
Gleb Smirnoff
69eb2b176c Include XEN and HyperV into amd64 LINT. 2013-10-28 21:11:28 +00:00
Konstantin Belousov
86be9f0dd5 Import the driver for VT-d DMAR hardware, as specified in the revision
1.3 of Intelб╝ Virtualization Technology for Directed I/O Architecture
Specification.  The Extended Context and PASIDs from the rev. 2.2 are
not supported, but I am not aware of any released hardware which
implements them.  Code does not use queued invalidation, see comments
for the reason, and does not provide interrupt remapping services.

Code implements the management of the guest address space per domain
and allows to establish and tear down arbitrary mappings, but not
partial unmapping.  The superpages are created as needed, but not
promoted.  Faults are recorded, fault records could be obtained
programmatically, and printed on the console.

Implement the busdma(9) using DMARs.  This busdma backend avoids
bouncing and provides security against misbehaving hardware and driver
bad programming, preventing leaks and corruption of the memory by wild
DMA accesses.

By default, the implementation is compiled into amd64 GENERIC kernel
but disabled; to enable, set hw.dmar.enable=1 loader tunable.  Code is
written to work on i386, but testing there was low priority, and
driver is not enabled in GENERIC.  Even with the DMAR turned on,
individual devices could be directed to use the bounce busdma with the
hw.busdma.pci<domain>:<bus>:<device>:<function>.bounce=1 tunable.  If
DMARs are capable of the pass-through translations, it is used,
otherwise, an identity-mapping page table is constructed.

The driver was tested on Xeon 5400/5500 chipset legacy machine,
Haswell desktop and E5 SandyBridge dual-socket boxes, with ahci(4),
ata(4), bce(4), ehci(4), mfi(4), uhci(4), xhci(4) devices.  It also
works with em(4) and igb(4), but there some fixes are needed for
drivers, which are not committed yet.  Intel GPUs do not work with
DMAR (yet).

Many thanks to John Baldwin, who explained me the newbus integration;
Peter Holm, who did all testing and helped me to discover and
understand several incredible bugs; and to Jim Harris for the access
to the EDS and BWG and for listening when I have to explain my
findings to somebody.

Sponsored by:	The FreeBSD Foundation
MFC after:	1 month
2013-10-28 13:33:29 +00:00
Konstantin Belousov
e20f049b87 Several small fixes for the amd64 minidump code.
In report_progress(), use nitems(progress_track) instead of manually
hard-coding array size.  Wrap long line.

In blk_write(), code verifies that ptr and pa cannot be non-zero
simultaneously.  The later check for the page-alignment of the ptr
argument never triggers due to pa != 0 always implying ptr == NULL.  I
believe that the intent was to ensure that physicall address passed is
page-aligned, since the address is (temporary) mapped for the duration
of the page write.

Clear the progress_track.visited fields when starting minidump.  If
minidump is restarted or taken second time during the system lifetime,
progress is not printed otherwise, making operator suspectible to the
dump status.

Tested by:	pho
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
2013-10-27 16:31:12 +00:00