Commit Graph

7242 Commits

Author SHA1 Message Date
neel
ca77633f85 Add "hw.vmm.topology.threads_per_core" and "hw.vmm.topology.cores_per_package"
tunables to modify the default cpu topology advertised by bhyve.

Also add a tunable "hw.vmm.topology.cpuid_leaf_b" to disable the CPUID
leaf 0xb. This is intended for testing guest behavior when it falls back
on using CPUID leaf 0x4 to deduce CPU topology.

The default behavior is to advertise each vcpu as a core in a separate soket.
2014-08-24 01:10:06 +00:00
neel
33474e62d3 Fix a bug in the emulation of CPUID leaf 0x4 where bhyve was claiming that
the vcpu had no caches at all. This causes problems when executing applications
in the guest compiled with the Intel compiler.

Submitted by:	Mark Hill (mark.hill@tidalscale.com)
2014-08-23 22:44:31 +00:00
neel
ea3ac18997 Return the spurious interrupt vector (IRQ7 or IRQ15) if the atpic cannot
find any unmasked pin with an interrupt asserted.

Reviewed by:	tychon
CR:		https://reviews.freebsd.org/D669
MFC after:	1 week
2014-08-23 21:16:26 +00:00
jhb
711e51996a Fix build of si(4) and enable it in LINT on amd64 and i386. 2014-08-20 16:07:17 +00:00
jhb
4b82aff648 Bump MAXCPU on amd64 from 64 to 256. In practice APIC only permits 255
CPUs (IDs 0 through 254).  Getting above that limit requires x2APIC.

MFC after:	1 month
2014-08-20 16:06:24 +00:00
kib
d9b5edee7d Increase max number of physical segments on amd64 to 63.
Eventually, the vmd_segs of the struct vm_domain should become bitset
instead of long, to allow arbitrary compile-time selected maximum.

Reviewed by:	alc
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
2014-08-20 08:07:08 +00:00
alc
21285cbfcd There exists a possible sequence of page table page allocation failures
starting with a superpage demotion by pmap_enter() that could result in
a PV list lock being held when pmap_enter() is just about to return
KERN_RESOURCE_SHORTAGE.  Consequently, the KASSERT that no PV list locks
are held needs to be replaced with a conditional unlock.

Discussed with:	kib
X-MFC with:	r269728
Sponsored by:	EMC / Isilon Storage Division
2014-08-18 20:28:08 +00:00
gavin
0c913b9f59 Update i386/NOTES and amd64/NOTES files to contain the complete list of
firmwares for iwn(4) and sort them.

MFC after:	1 week
2014-08-14 18:29:55 +00:00
neel
69905f3734 Reword comment to match the interrupt mode names from the MPtable spec.
Reviewed by:	tychon
2014-08-14 18:03:38 +00:00
neel
53369652fd Use the max guest memory address when creating its iommu domain.
Also, assert that the GPA being mapped in the domain is less than its maxaddr.

Reviewed by:	grehan
Pointed out by:	Anish Gupta (akgupt3@gmail.com)
2014-08-14 05:00:45 +00:00
alc
6ec80ed21e Update the text of a KASSERT() to reflect the changes in r269728. 2014-08-09 17:13:02 +00:00
kib
094158b3f2 Change pmap_enter(9) interface to take flags parameter and superpage
mapping size (currently unused).  The flags includes the fault access
bits, wired flag as PMAP_ENTER_WIRED, and a new flag
PMAP_ENTER_NOSLEEP to indicate that pmap should not sleep.

For powerpc aim both 32 and 64 bit, fix implementation to ensure that
the requested mapping is created when PMAP_ENTER_NOSLEEP is not
specified, in particular, wait for the available memory required to
proceed.

In collaboration with:	alc
Tested by:	nwhitehorn (ppc aim32 and booke)
Sponsored by:	The FreeBSD Foundation and EMC / Isilon Storage Division
MFC after:	2 weeks
2014-08-08 17:12:03 +00:00
neel
6e79e6c83b Support PCI extended config space in bhyve.
Add the ACPI MCFG table to advertise the extended config memory window.

Introduce a new flag MEM_F_IMMUTABLE for memory ranges that cannot be deleted
or moved in the guest's address space. The PCI extended config space is an
example of an immutable memory range.

Add emulation for the "movzw" instruction. This instruction is used by FreeBSD
to read a 16-bit extended config space register.

CR:		https://phabric.freebsd.org/D505
Reviewed by:	jhb, grehan
Requested by:	tychon
2014-08-08 03:49:01 +00:00
glebius
991a1f338d Merge all MD sf_buf allocators into one MI, residing in kern/subr_sfbuf.c
The MD allocators were very common, however there were some minor
differencies. These differencies were all consolidated in the MI allocator,
under ifdefs. The defines from machine/vmparam.h turn on features required
for a particular machine. For details look in the comment in sys/sf_buf.h.

As result no MD code left in sys/*/*/vm_machdep.c. Some arches still have
machine/sf_buf.h, which is usually quite small.

Tested by:	glebius (i386), tuexen (arm32), kevlo (arm32)
Reviewed by:	kib
Sponsored by:	Netflix
Sponsored by:	Nginx, Inc.
2014-08-05 09:44:10 +00:00
alc
38b6c535da Retire pmap_change_wiring(). We have never used it to wire virtual pages.
We continue to use pmap_enter() for that.  For unwiring virtual pages, we
now use pmap_unwire(), which unwires a range of virtual addresses instead
of a single virtual page.

Sponsored by:	EMC / Isilon Storage Division
2014-08-03 20:40:51 +00:00
jhb
05d21354c3 - Output a summary of optional VT-x features in dmesg similar to CPU
features.  If bootverbose is enabled, a detailed list is provided;
  otherwise, a single-line summary is displayed.
- Add read-only sysctls for optional VT-x capabilities used by bhyve
  under a new hw.vmm.vmx.cap node. Move a few exiting sysctls that
  indicate the presence of optional capabilities under this node.

CR:		https://phabric.freebsd.org/D498
Reviewed by:	grehan, neel
MFC after:	1 week
2014-07-30 00:00:12 +00:00
neel
20e3e8762f If a vcpu has issued a HLT instruction with interrupts disabled then it sleeps
forever in vm_handle_hlt().

This is usually not an issue as long as one of the other vcpus properly resets
or powers off the virtual machine. However, if the bhyve(8) process is killed
with a signal the halted vcpu cannot be woken up because it's sleep cannot be
interrupted.

Fix this by waking up periodically and returning from vm_handle_hlt() if
TDF_ASTPENDING is set.

Reported by:	Leon Dang
Sponsored by:	Nahanni Systems
2014-07-26 02:53:51 +00:00
neel
62d591cec9 Don't return -1 from the push emulation handler. Negative return values are
interpreted specially on return from sys_ioctl() and may cause undesirable
side-effects like restarting the system call.
2014-07-26 02:51:46 +00:00
neel
563faffd70 Fix a couple of issues in the PUSH emulation:
It is not possible to PUSH a 32-bit operand on the stack in 64-bit mode. The
default operand size for PUSH is 64-bits and the operand size override prefix
changes that to 16-bits.

vm_copy_setup() can return '1' if it encounters a fault when walking the
guest page tables. This is a guest issue and is now handled properly by
resuming the guest to handle the fault.
2014-07-24 23:01:53 +00:00
marius
3179fa4baa Copying pages via temporary mappings in the !DMAP case of pmap_copy_pages()
involves updating the corresponding page tables followed by accesses to the
pages in question. This sequence is subject to the situation exactly described
in the "AMD64 Architecture Programmer's Manual Volume 2: System Programming"
rev. 3.23, "7.3.1 Special Coherency Considerations" [1, p. 171 f.]. Therefore,
issuing the INVLPG right after modifying the PTE bits is crucial (see also
r269050).
For the amd64 PMAP code, the order of instructions was already correct. The
above fact still is worth documenting, though.

1: http://amd-dev.wpengine.netdna-cdn.com/wordpress/media/2012/10/24593_APM_v21.pdf

Reviewed by:	alc
Sponsored by:	Bally Wulff Games & Entertainment GmbH
2014-07-24 10:12:22 +00:00
neel
4535fa67c4 Fix fault injection in bhyve.
The faulting instruction needs to be restarted when the exception handler
is done handling the fault. bhyve now does this correctly by setting
'vmexit[vcpu].inst_length' to zero so the %rip is not advanced.

A minor complication is that the fault injection APIs are used by instruction
emulation code that is shared by vmm.ko and bhyve. Thus the argument that
refers to 'struct vm *' in kernel or 'struct vmctx *' in userspace needs to
be loosely typed as a 'void *'.
2014-07-24 01:38:11 +00:00
royger
0ee303bbe3 don't set CR4 PSE bit on amd64
Setting PSE together with PAE or in long mode just makes the PSE bit
completely ignored, so don't set it.

Sponsored by: Citrix Systems R&D
Reviewed by: kib
2014-07-23 15:53:29 +00:00
neel
e972917c13 Emulate instructions emitted by OpenBSD/i386 version 5.5:
- CMP REG, r/m
- MOV AX/EAX/RAX, moffset
- MOV moffset, AX/EAX/RAX
- PUSH r/m
2014-07-23 04:28:51 +00:00
emaste
56b12303b3 Don't pass null kmdp to preload_search_info
On Xen PVH guests kmdp == NULL.

Submitted by:	royger
MFC after:	3 days
Sponsored by:	The FreeBSD Foundation
2014-07-22 13:58:33 +00:00
markj
07a0bd68a5 Fix the build when DTrace isn't enabled.
Reported by:	stefanf
X-MFC-With:	r268600
2014-07-20 18:44:56 +00:00
neel
7b1204cc02 Fix build without INVARIANTS defined by getting rid of unused variable 'exc'.
Reported by:	adrian, stefanf
2014-07-20 16:34:35 +00:00
neel
1f15eea2e0 Handle nested exceptions in bhyve.
A nested exception condition arises when a second exception is triggered while
delivering the first exception. Most nested exceptions can be handled serially
but some are converted into a double fault. If an exception is generated during
delivery of a double fault then the virtual machine shuts down as a result of
a triple fault.

vm_exit_intinfo() is used to record that a VM-exit happened while an event was
being delivered through the IDT. If an exception is triggered while handling
the VM-exit it will be treated like a nested exception.

vm_entry_intinfo() is used by processor-specific code to get the event to be
injected into the guest on the next VM-entry. This function is responsible for
deciding the disposition of nested exceptions.
2014-07-19 20:59:08 +00:00
markj
e18e12eeda Use a C wrapper for trap() instead of checking and calling the DTrace trap
hook in assembly.

Suggested by:	kib
Reviewed by:	kib (original version)
X-MFC-With:	r268600
2014-07-19 02:27:31 +00:00
neel
5046d9cb8a Add emulation for legacy x86 task switching mechanism.
FreeBSD/i386 uses task switching to handle double fault exceptions and this
change enables that to work.

Reported by:	glebius
2014-07-16 21:26:26 +00:00
neel
eb07e4ed55 Add support for operand size and address size override prefixes in bhyve's
instruction emulation [1].

Fix bug in emulation of opcode 0x8A where the destination is a legacy high
byte register and the guest vcpu is in 32-bit mode. Prior to this change
instead of modifying %ah, %bh, %ch or %dh the emulation would end up
modifying %spl, %bpl, %sil or %dil instead.

Add support for moffsets by treating it as a 2, 4 or 8 byte immediate value
during instruction decoding.

Fix bug in verify_gla() where the linear address computed after decoding
the instruction was not being truncated to the effective address size [2].

Tested by:	Leon Dang [1]
Reported by:	Peter Grehan [2]
Sponsored by:	Nahanni Systems
2014-07-15 17:37:17 +00:00
kib
1ecfe30151 Make amd64 pmap_copy_pages() functional for pages not mapped by DMAP.
Requested and reviewed by:	royger
Tested by:	pho, royger
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
2014-07-15 09:30:43 +00:00
markj
880dd1a983 Invoke the DTrace trap handler before calling trap() on amd64. This matches
the upstream implementation and helps ensure that a trap induced by tracing
fbt::trap:entry is handled without recursively generating another trap.

This makes it possible to run most (but not all) of the DTrace tests under
common/safety/ without triggering a kernel panic.

Submitted by:	Anton Rang <anton.rang@isilon.com> (original version)
Phabric:	D95
2014-07-14 04:38:17 +00:00
neel
307c44649f Use the correct offset when converting a logical address (segment:offset)
to a linear address.
2014-07-11 01:23:38 +00:00
kib
729061be23 For safety, ensure that any consumer of the set_regs() and
ptrace_set_pc() use the correct return to userspace using iret.

The signal return, PT_CONTINUE (which in fact uses signal return path)
set the pcb flag already.  The setcontext(2) enforces iret return when
%rip is incorrect.  Due to this, the change is redundand, but is made
to ensure that no path which modifies context, forgets to set
PCB_FULL_IRET.

Inspired by:	CVE-2014-4699
Reviewed by:	jhb
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
2014-07-09 21:39:40 +00:00
neel
845f7be2e3 Accurately identify the vcpu's operating mode as 64-bit, compatibility,
protected or real.
2014-07-08 21:48:57 +00:00
neel
d5633f89da Invalidate guest TLB mappings as a side-effect of its CR3 being updated.
This is a pre-requisite for task switch emulation since the CR3 is loaded
from the new TSS.
2014-07-08 20:51:03 +00:00
kib
ae88c29379 Correct si_code for the SIGBUS signal generated by the alignment trap.
Reported and tested by:	pho
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
2014-07-08 08:05:42 +00:00
alc
d74e85dbb9 Introduce pmap_unwire(). It will replace pmap_change_wiring(). There are
several reasons for this change:

pmap_change_wiring() has never (in my memory) been used to set the wired
attribute on a virtual page.  We have always used pmap_enter() to do that.
Moreover, it is not really safe to use pmap_change_wiring() to set the wired
attribute on a virtual page.  The description of pmap_change_wiring() says
that it assumes the existence of a mapping in the pmap.  However, non-wired
mappings may be reclaimed by the pmap at any time.  (See pmap_collect().)
Many implementations of pmap_change_wiring() will crash if the mapping does
not exist.

pmap_unwire() accepts a range of virtual addresses, whereas
pmap_change_wiring() acts upon a single virtual page.  Since we are
typically unwiring a range of virtual addresses, pmap_unwire() will be more
efficient.  Moreover, pmap_unwire() allows us to unwire superpage mappings.
Previously, we were forced to demote the superpage mapping, because
pmap_change_wiring() only allowed us to express the unwiring of a single
base page mapping at a time.  This added to the overhead of unwiring for
large ranges of addresses, including the implicit unwiring that occurs at
process termination.

Implementations for arm and powerpc will follow.

Discussed with:	jeff, marcel
Reviewed by:	kib
Sponsored by:	EMC / Isilon Storage Division
2014-07-06 17:42:38 +00:00
emaste
9825a4c806 Prefer vt(4) for UEFI boot
The UEFI framebuffer driver vt_efifb requires vt(4), so add a mechanism
for the startup routine to set the preferred console.  This change is
ugly because console init happens very early in the boot, making a
cleaner interface difficult.  This change is intended only to facilitate
the sc(4) / vt(4) transition, and can be reverted once vt(4) is the
default.
2014-07-02 13:24:21 +00:00
emaste
10d8b7a43b Add vt(4) devices and options to NOTES
Reviewed by:	marius (earlier version)
2014-07-01 00:22:54 +00:00
emaste
e6dbbf35ca Add vt(4) to GENERIC and retire the separate VT config
vt(4) and sc(4) can now coexist in the same kernel.  To choose the vt
driver, set the loader tunable kern.vty=vt .
2014-06-30 16:18:38 +00:00
hselasky
35b126e324 Pull in r267961 and r267973 again. Fix for issues reported will follow. 2014-06-28 03:56:17 +00:00
gjb
fc21f40567 Revert r267961, r267973:
These changes prevent sysctl(8) from returning proper output,
such as:

 1) no output from sysctl(8)
 2) erroneously returning ENOMEM with tools like truss(1)
    or uname(1)
 truss: can not get etype: Cannot allocate memory
2014-06-27 22:05:21 +00:00
hselasky
bd1ed65f0f Extend the meaning of the CTLFLAG_TUN flag to automatically check if
there is an environment variable which shall initialize the SYSCTL
during early boot. This works for all SYSCTL types both statically and
dynamically created ones, except for the SYSCTL NODE type and SYSCTLs
which belong to VNETs. A new flag, CTLFLAG_NOFETCH, has been added to
be used in the case a tunable sysctl has a custom initialisation
function allowing the sysctl to still be marked as a tunable. The
kernel SYSCTL API is mostly the same, with a few exceptions for some
special operations like iterating childrens of a static/extern SYSCTL
node. This operation should probably be made into a factored out
common macro, hence some device drivers use this. The reason for
changing the SYSCTL API was the need for a SYSCTL parent OID pointer
and not only the SYSCTL parent OID list pointer in order to quickly
generate the sysctl path. The motivation behind this patch is to avoid
parameter loading cludges inside the OFED driver subsystem. Instead of
adding special code to the OFED driver subsystem to post-load tunables
into dynamically created sysctls, we generalize this in the kernel.

Other changes:
- Corrected a possibly incorrect sysctl name from "hw.cbb.intr_mask"
to "hw.pcic.intr_mask".
- Removed redundant TUNABLE statements throughout the kernel.
- Some minor code rewrites in connection to removing not needed
TUNABLE statements.
- Added a missing SYSCTL_DECL().
- Wrapped two very long lines.
- Avoid malloc()/free() inside sysctl string handling, in case it is
called to initialize a sysctl from a tunable, hence malloc()/free() is
not ready when sysctls from the sysctl dataset are registered.
- Bumped FreeBSD version to indicate SYSCTL API change.

MFC after:	2 weeks
Sponsored by:	Mellanox Technologies
2014-06-27 16:33:43 +00:00
tychon
816d8c3faa Add support for emulating the move instruction: "mov r/m8, imm8".
Reviewed by:	neel
2014-06-26 17:15:41 +00:00
grehan
54db9f3822 Expose the amount of resident and wired memory from the guest's vmspace.
This is different than the amount shown for the process e.g. by
/usr/bin/top - that is the mappings faulted in by the mmap'd region
of guest memory.

The values can be fetched with bhyvectl

 # bhyvectl --get-stats --vm=myvm
 ...
 Resident memory                         	413749248
 Wired memory                            	0
 ...

vmm_stat.[ch] -
 Modify the counter code in bhyve to allow direct setting of a counter
as opposed to incrementing, and providing a callback to fetch a
counter's value.

Reviewed by:	neel
2014-06-25 22:13:35 +00:00
kib
fe547198b1 Add FPU_KERN_KTHR flag to fpu_kern_enter(9), which avoids saving FPU
context into memory for the kernel threads which called
fpu_kern_thread(9).  This allows the fpu_kern_enter() callers to not
check for is_fpu_kern_thread() to get the optimization.

Apply the flag to padlock(4) and aesni(4).  In aesni_cipher_process(),
do not leak FPU context state on error.

Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
2014-06-23 07:37:54 +00:00
dchagin
dd6bed9dd2 Revert r266925 as it can lead to instant panic at fexecve():
To allow to run the interpreter itself add a new ELF branding type.

Pointed out by:	kib, mjg
2014-06-17 05:29:18 +00:00
tychon
bb415f07f0 Bring an overly enthusiastic KASSERT inline with the Intel SDM.
Reviewed by:	neel
2014-06-16 22:59:18 +00:00
attilio
2802c525ad - Modify vm_page_unwire() and vm_page_enqueue() to directly accept
the queue where to enqueue pages that are going to be unwired.
- Add stronger checks to the enqueue/dequeue for the pagequeues when
  adding and removing pages to them.

Of course, for unmanaged pages the queue parameter of vm_page_unwire() will
be ignored, just as the active parameter today.
This makes adding new pagequeues quicker.

This change effectively modifies the KPI.  __FreeBSD_version will be,
however, bumped just when the full cache of free pages will be
evicted.

Sponsored by:	EMC / Isilon storage division
Reviewed by:	alc
Tested by:	pho
2014-06-16 18:15:27 +00:00
royger
7c7f3fb2d0 amd64/i386: introduce APIC hooks for different APIC implementations.
This is needed for Xen PV(H) guests, since there's no hardware lapic
available on this kind of domains. This commit should not change
functionality.

Sponsored by: Citrix Systems R&D
Reviewed by: jhb
Approved by: gibbs

amd64/include/cpu.h:
amd64/amd64/mp_machdep.c:
i386/include/cpu.h:
i386/i386/mp_machdep.c:
 - Remove lapic_ipi_vectored hook from cpu_ops, since it's now
   implemented in the lapic hooks.

amd64/amd64/mp_machdep.c:
i386/i386/mp_machdep.c:
 - Use lapic_ipi_vectored directly, since it's now an inline function
   that will call the appropiate hook.

x86/x86/local_apic.c:
 - Prefix bare metal public lapic functions with native_ and mark them
   as static.
 - Define default implementation of apic_ops.

x86/include/apicvar.h:
 - Declare the apic_ops structure and create inline functions to
   access the hooks, so the change is transparent to existing users of
   the lapic_ functions.

x86/xen/hvm.c:
 - Switch to use the new apic_ops.
2014-06-16 08:43:03 +00:00
neel
8c7e29c295 Disable global interrupts early so all the software state maintained by bhyve
is sampled "atomically". Any interrupts after this point will be held pending
by the CPU until the guest starts executing and will immediately trigger a
#VMEXIT.

Reviewed by:	Anish Gupta (akgupt3@gmail.com)
2014-06-11 17:48:07 +00:00
tychon
e250a91c1d Replace enum forward declarations with complete definitions.
Reviewed by:	neel
2014-06-10 18:46:00 +00:00
neel
e48c89801a Add helper functions to populate VM exit information for rendezvous and
astpending exits. This is to reduce code duplication between VT-x and
SVM implementations.
2014-06-10 16:45:58 +00:00
neel
32f7809be1 Turn on interrupt window exiting unconditionally when an ExtINT is being
injected into the guest. This allows the hypervisor to inject another
ExtINT or APIC vector as soon as the guest is able to process interrupts.

This change is not to address any correctness issue but to guarantee that
any pending APIC vector that was preempted by the ExtINT will be injected
as soon as possible. Prior to this change such pending interrupts could be
delayed until the next VM exit.
2014-06-10 01:38:02 +00:00
grehan
adbd7deabf Temporary fix for guest idle detection.
Handle ExtINT injection for SVM. The HPET emulation
will inject a legacy interrupt at startup, and if this
isn't handled, will result in the HLT-exit code assuming
there are outstanding ExtINTs and return without sleeping.

svm_inj_interrupts() needs more changes to bring it up
to date with the VT-x version: these are forthcoming.

Reviewed by:	neel
2014-06-09 21:02:48 +00:00
neel
d4bb0b204a Add reserved bit checking when doing %CR8 emulation and inject #GP if required.
Pointed out by:	grehan
Reviewed by:	tychon
2014-06-09 20:51:08 +00:00
grehan
fe997346e0 Allow the TSC MSR to be accessed directly from the guest. 2014-06-07 23:08:06 +00:00
grehan
afc0a3433a Set the guest PAT MSR in the VMCB to power-on defaults.
Linux guests accept the values in this register, while *BSD
guests reprogram it. Default values of zero correspond to
PAT_UNCACHEABLE, resulting in glacial performance.

Thanks to Willem Jan Withagen for first reporting this and
helping out with the investigation.
2014-06-07 23:05:12 +00:00
neel
80a67d54c4 Add ioctl(VM_REINIT) to reinitialize the virtual machine state maintained
by vmm.ko. This allows the virtual machine to be restarted without having
to destroy it first.

Reviewed by:	grehan
2014-06-07 21:36:52 +00:00
alc
39548e640f Add a page size field to struct vm_page. Increase the page size field when
a partially populated reservation becomes fully populated, and decrease this
field when a fully populated reservation becomes partially populated.

Use this field to simplify the implementation of pmap_enter_object() on
amd64, arm, and i386.

On all architectures where we support superpages, the cost of creating a
superpage mapping is roughly the same as creating a base page mapping.  For
example, both kinds of mappings entail the creation of a single PTE and PV
entry.  With this in mind, use the page size field to make the
implementation of vm_map_pmap_enter(..., MAP_PREFAULT_PARTIAL) a little
smarter.  Previously, if MAP_PREFAULT_PARTIAL was specified to
vm_map_pmap_enter(), that function would only map base pages.  Now, it will
create up to 96 base page or superpage mappings.

Reviewed by:	kib
Sponsored by:	EMC / Isilon Storage Division
2014-06-07 17:12:26 +00:00
tychon
c04c953593 Support guest accesses to %cr8.
Reviewed by:	neel
2014-06-06 18:23:49 +00:00
imp
7694525189 Restore comments accidentally removed.
MFC after: 3 days
2014-06-06 04:08:55 +00:00
grehan
f1ed4b50ae ins/outs support for SVM. Modelled on the Intel VT-x code.
Remove CR2 save/restore - the guest restore/save is done
in hardware, and there is no need to save/restore the host
version (same as VT-x).

Submitted by:	neel (SVM segment descriptor 'P' bit code)
Reviewed by:	neel
2014-06-06 02:55:18 +00:00
grehan
39adc03910 Allow the guest's CR2 value to be read/written.
This is required for page-fault injection.
2014-06-05 06:29:18 +00:00
grehan
2374fa6276 Use API call when VM is detected as suspended. This fixes
the (harmless) error message on exit:

  vmexit_suspend: invalid reason 217645057

Reviewed by:	neel, Anish Gupta (akgupt3@gmail.com)
2014-06-03 22:26:46 +00:00
grehan
5e6423ee3b Bring (almost) up-to-date with HEAD.
- use the new virtual APIC page
- update to current bhyve APIs

Tested by Anish with multiple FreeBSD SMP VMs on a Phenom,
and verified by myself with light FreeBSD VM testing
on a Sempron 3850 APU.

The issues reported with Linux guests are very likely to still
be here, but this sync eliminates the skew between the
project branch and CURRENT, and should help to determine
the causes.

Some follow-on commits will fix minor cosmetic issues.

Submitted by:	Anish Gupta (akgupt3@gmail.com)
2014-06-03 06:56:54 +00:00
grehan
95f7c2f56c MFC @ r266724
An SVM update will follow this.
2014-06-03 02:34:21 +00:00
neel
9c2a942387 Activate vcpus from bhyve(8) using the ioctl VM_ACTIVATE_CPU instead of doing
it implicitly in vmm.ko.

Add ioctl VM_GET_CPUS to get the current set of 'active' and 'suspended' cpus
and display them via /usr/sbin/bhyvectl using the "--get-active-cpus" and
"--get-suspended-cpus" options.

This is in preparation for being able to reset virtual machine state without
having to destroy and recreate it.
2014-05-31 23:37:34 +00:00
dchagin
538f396887 To allow to run the interpreter itself add a new ELF branding type.
Allow Linux ABI to run ELF interpreter.

MFC after:	3 days
2014-05-31 15:01:51 +00:00
tychon
61025dc75e If VMX isn't enabled so long as the lock bit isn't set yet in MSR
IA32_FEATURE_CONTROL it still can be.

Approved by:	grehan (co-mentor)
2014-05-30 23:37:31 +00:00
neel
0a0e9fcd5a Remove bogus check for kmem_malloc() failure even though M_WAITOK is set.
Requested by:	jkim
2014-05-30 20:58:32 +00:00
neel
aefe217075 Allocate a zeroed LDT.
Failing to do this might result in the LDT appearing to run out of free
descriptors because of random junk in the descriptor's 'sd_type' field.

http://lists.freebsd.org/pipermail/freebsd-amd64/2014-May/016088.html

Reviewed by:	kib
MFC after:	2 weeks
2014-05-30 18:59:37 +00:00
kib
7c98ae3376 When usermode loaded non-default segment selector into the %gs,
correctly prepare KGSBASE msr to restore the user descriptor base on
the last swapgs during return to usermode.

Reported and tested by:	peterj
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
2014-05-29 16:18:31 +00:00
markj
4c818572b7 Commit the rest of the changes that were intended to be part of r266826.
X-MFC-with:	r266826
2014-05-29 01:42:22 +00:00
jhb
e3d386d9fe - Rework the XSAVE/XRSTOR emulation to only expose XCR0 features to the
guest for which the rules regarding xsetbv emulation are known.  In
  particular future extensions like AVX-512 have interdependencies among
  feature bits that could allow a guest to trigger a GP# in the host with
  the current approach of allowing anything the host supports.
- Add proper checking of Intel MPX and AVX-512 XSAVE features in the
  xsetbv emulation and allow these features to be exposed to the guest if
  they are enabled in the host.
- Expose a subset of known-safe features from leaf 0 of the structured
  extended features to guests if they are supported on the host including
  RDFSBASE/RDGSBASE, BMI1/2, AVX2, AVX-512, HLE, ERMS, and RTM.  Aside
  from AVX-512, these features are all new instructions available for use
  in ring 3 with no additional hypervisor changes needed.

Reviewed by:	neel
2014-05-27 19:04:38 +00:00
neel
4b40e47cf8 Add segment protection and limits violation checks in vie_calculate_gla()
for 32-bit x86 guests.

Tested using ins/outs executed in a FreeBSD/i386 guest.
2014-05-27 04:26:22 +00:00
neel
07a8a1c99a Remove restriction on insb/insw/insl emulation. These instructions are
properly emulated.
2014-05-25 02:05:23 +00:00
neel
ffc6a38259 Do the linear address calculation for the ins/outs emulation using a new
API function 'vie_calculate_gla()'.

While the current implementation is simplistic it forms the basis of doing
segmentation checks if the guest is in 32-bit protected mode.
2014-05-25 00:57:24 +00:00
neel
51a05acc08 Add libvmmapi functions vm_copyin() and vm_copyout() to copy into and out
of the guest linear address space. These APIs in turn use a new ioctl
'VM_GLA2GPA' to convert the guest linear address to guest physical.

Use the new copyin/copyout APIs when emulating ins/outs instruction in
bhyve(8).
2014-05-24 23:12:30 +00:00
neel
6a6e13c407 Consolidate all the information needed by the guest page table walker into
'struct vm_guest_paging'.

Check for canonical addressing in vmm_gla2gpa() and inject a protection
fault into the guest if a violation is detected.

If the page table walk is restarted in vmm_gla2gpa() then reset 'ptpphys' to
point to the root of the page tables.
2014-05-24 20:26:57 +00:00
neel
52a4f11861 When injecting a page fault into the guest also update the guest's %cr2 to
indicate the faulting linear address.

If the guest PML4 entry has the PG_PS bit set then inject a page fault into
the guest with the PGEX_RSV bit set in the error_code.

Get rid of redundant checks for the PG_RW violations when walking the page
tables.
2014-05-24 19:13:25 +00:00
neel
2ccda87aca Check for alignment check violation when processing in/out string instructions. 2014-05-23 19:59:14 +00:00
neel
8f99933d82 Add emulation of the "outsb" instruction. NetBSD guests use this to write to
the UART FIFO.

The emulation is constrained in a number of ways: 64-bit only, doesn't check
for all exception conditions, limited to i/o ports emulated in userspace.

Some of these constraints will be relaxed in followup commits.

Requested by:	grehan
Reviewed by:	tychon (partially and a much earlier version)
2014-05-23 05:15:17 +00:00
neel
062bfb5ea3 A Centos 6.4 guest will write 0xff to the 8259 mask register before beginning
the proper ICWx initialization sequence. It assumes, probably correctly, that
the boot firmware has done the 8259 initialization.

Since grub-bhyve does not initialize the 8259 this write to the mask register
takes a code path in which 'error' remains uninitialized (ready=0,icw_num=0).

Fix this by initializing 'error' at the start of the function.
2014-05-23 05:04:50 +00:00
jhb
9578fc0e8e Don't permit users to request a subset of the AVX512 or MPX xsave masks.
These masks are documented in the Intel Architecture Instruction Set
Extensions Programming Reference (March 2014).

Reviewed by:	kib
MFC after:	1 month
2014-05-22 18:22:02 +00:00
neel
f33a3d02f0 Allow vmx_getdesc() and vmx_setdesc() to be called for a vcpu that is in the
VCPU_RUNNING state. This will let the VMX exit handler inspect the vcpu's
segment descriptors without having to exit the critical section.
2014-05-22 17:22:37 +00:00
jhibbits
445bd25136 imagact_binmisc builds for all supported architectures, so enable it for all.
Any bugs in execution will be dealt with as they crop up.

MFC after:	3 weeks
Relnotes:	Yes
2014-05-22 05:04:40 +00:00
neel
645d479a58 Inject page fault into the guest if the page table walker detects an invalid
translation for the guest linear address.
2014-05-22 03:14:54 +00:00
neel
6071e2741b Add PG_RW check when translating a guest linear to guest physical address.
Set the accessed and dirty bits in the page table entry. If it fails then
restart the page table walk from the beginning. This might happen if another
vcpu modifies the page tables simultaneously.

Reviewed by:	alc, kib
2014-05-20 20:30:28 +00:00
jhb
59c78787cb Add support for decoding the AMD SVM instructions. 2014-05-19 18:07:37 +00:00
neel
b0752c3683 Add PG_U (user/supervisor) checks when translating a guest linear address
to a guest physical address.

PG_PS (page size) field is valid only in a PDE or a PDPTE so it is now
checked only in non-terminal paging entries.

Ignore the upper 32-bits of the CR3 for PAE paging.
2014-05-19 03:50:07 +00:00
grehan
9fa48763c0 Make the vmx asm code dtrace-fbt-friendly by
- inserting frame enter/leave sequences
 - restructuring the vmx_enter_guest routine so that it subsumes
   the vm_exit_guest block, which was the #vmexit RIP and not a
   callable routine.

Reviewed by:	neel
MFC after:	3 weeks
2014-05-18 03:50:17 +00:00
jhb
1c24706a80 Add support for decoding rdrand and rdseed. 2014-05-17 21:10:03 +00:00
jhb
db4e203198 Add definitions for more structured extended features as well as
XSAVE Extended Features for AVX512 and MPX (Memory Protection Extensions).

Obtained from:	Intel's Instruction Set Extensions Programming Reference
                (March 2014)
2014-05-16 17:45:09 +00:00
jhb
f558af85b7 Implement a PCI interrupt router to route PCI legacy INTx interrupts to
the legacy 8259A PICs.
- Implement an ICH-comptabile PCI interrupt router on the lpc device with
  8 steerable pins configured via config space access to byte-wide
  registers at 0x60-63 and 0x68-6b.
- For each configured PCI INTx interrupt, route it to both an I/O APIC
  pin and a PCI interrupt router pin.  When a PCI INTx interrupt is
  asserted, ensure that both pins are asserted.
- Provide an initial routing of PCI interrupt router (PIRQ) pins to
  8259A pins (ISA IRQs) and initialize the interrupt line config register
  for the corresponding PCI function with the ISA IRQ as this matches
  existing hardware.
- Add a global _PIC method for OSPM to select the desired interrupt routing
  configuration.
- Update the _PRT methods for PCI bridges to provide both APIC and legacy
  PRT tables and return the appropriate table based on the configured
  routing configuration.  Note that if the lpc device is not configured, no
  routing information is provided.
- When the lpc device is enabled, provide ACPI PCI link devices corresponding
  to each PIRQ pin.
- Add a VMM ioctl to adjust the trigger mode (edge vs level) for 8259A
  pins via the ELCR.
- Mark the power management SCI as level triggered.
- Don't hardcode the number of elements in Packages in the source for
  the DSDT.  iasl(8) will fill in the actual number of elements, and
  this makes it simpler to generate a Package with a variable number of
  elements.

Reviewed by:	tycho
2014-05-15 14:16:55 +00:00
neel
5df866f4b1 Increase the TSS limit by one byte. The processor requires an additional byte
with all bits set to 1 beyond the I/O permission bitmap.

Prior to this change accessing I/O ports [0xFFF8-0xFFFF] would trigger a
#GP fault even though the I/O bitmap allowed access to those ports.

For more details see section "I/O Permission Bit Map" in the Intel SDM, Vol 1.

Reviewed by:	kib
2014-05-14 22:24:09 +00:00
neel
5fd692c3b5 Virtual machine halt detection is turned on by default. Allow it to be
disabled via the tunable 'hw.vmm.halt_detection'.
2014-05-05 16:19:24 +00:00
nwhitehorn
34465d9bbe Disable ACPI and P4TCC throttling by default, following discussion on
freebsd-current. These CPU speed control techniques are usually unhelpful
at best. For now, continue building the relevant code into GENERIC so that
it can trivially be re-enabled at runtime if anyone wants it.

MFC after:	1 month
2014-05-04 16:38:21 +00:00
ken
8f3f80c382 Bring in the mpr(4) driver for LSI's MPT3 12Gb SAS controllers.
This is derived from the mps(4) driver, but it supports only the 12Gb
IT and IR hardware including the SAS 3004, SAS 3008 and SAS 3108.

Some notes about this driver:
 o The 12Gb hardware can do "FastPath" I/O, and that capability is included in
   this driver.

 o WarpDrive functionality has been removed, since it isn't supported in
   the 12Gb driver interface.

 o The Scatter/Gather list handling code is significantly different between
   the 6Gb and 12Gb hardware.  The 12Gb boards support IEEE Scatter/Gather
   lists.

Thanks to LSI for developing and testing this driver for FreeBSD.

share/man/man4/mpr.4:
	mpr(4) man page.

sys/dev/mpr/*:
	mpr(4) driver files.

sys/modules/Makefile,
sys/modules/mpr/Makefile:
	Add a module Makefile for the mpr(4) driver.

sys/conf/files:
	Add the mpr(4) driver.

sys/amd64/conf/GENERIC,
sys/i386/conf/GENERIC,
sys/mips/conf/OCTEON1,
sys/sparc64/conf/GENERIC:
	Add the mpr(4) driver to all config files that currently
	have the mps(4) driver.

sys/ia64/conf/GENERIC:
	Add the mps(4) and mpr(4) drivers to the ia64 GENERIC
	config file.

sys/i386/conf/XEN:
	Exclude the mpr module from building here.

Submitted by:	Steve McConnell <Stephen.McConnell@lsi.com>
MFC after:	3 days
Tested by:	Chris Reeves <chrisr@spectralogic.com>
Sponsored by:	LSI, Spectra Logic
Relnotes:	LSI 12Gb SAS driver mpr(4) added
2014-05-02 20:25:09 +00:00
eadler
382c3dae47 lindev(4): finish the partial commit in r265212
lindev(4) was only used to provide /dev/full which is now a standard feature of
FreeBSD.  /dev/full was never linux-specific and provides a generally useful
feature.

Document this in UPDATING and bump __FreeBSD_version.  This will be documented
in the PH shortly.

Reported by:	jkim
2014-05-02 07:14:22 +00:00
neel
b735ae5b9a Add logic in the HLT exit handler to detect if the guest has put all vcpus
to sleep permanently by executing a HLT with interrupts disabled.

When this condition is detected the guest with be suspended with a reason of
VM_SUSPEND_HALT and the bhyve(8) process will exit.

Tested by executing "halt" inside a RHEL7-beta guest.

Discussed with:	grehan@
Reviewed by:	jhb@, tychon@
2014-05-02 00:33:56 +00:00
neel
0601994645 Ignore writes to microcode update MSR. This MSR is accessed by RHEL7 guest.
Add KTR tracepoints to annotate wrmsr and rdmsr VM exits.
2014-04-30 02:08:27 +00:00
neel
9c85092013 Some Linux guests will implement a 'halt' by disabling the APIC and executing
the 'HLT' instruction. This condition was detected by 'vm_handle_hlt()' and
converted into the SPINDOWN_CPU exitcode . The bhyve(8) process would exit
the vcpu thread in response to a SPINDOWN_CPU and when the last vcpu was
spun down it would reset the virtual machine via vm_suspend(VM_SUSPEND_RESET).

This functionality was broken in r263780 in a way that made it impossible
to kill the bhyve(8) process because it would loop forever in
vm_handle_suspend().

Unbreak this by removing the code to spindown vcpus. Thus a 'halt' from
a Linux guest will appear to be hung but this is consistent with the
behavior on bare metal. The guest can be rebooted by using the bhyvectl
options '--force-reset' or '--force-poweroff'.

Reviewed by:	grehan@
2014-04-29 18:42:56 +00:00
neel
b616a9a2e4 Allow a virtual machine to be forcibly reset or powered off. This is done
by adding an argument to the VM_SUSPEND ioctl that specifies how the virtual
machine should be suspended, viz. VM_SUSPEND_RESET or VM_SUSPEND_POWEROFF.

The disposition of VM_SUSPEND is also made available to the exit handler
via the 'u.suspended' member of 'struct vm_exit'.

This capability is exposed via the '--force-reset' and '--force-poweroff'
arguments to /usr/sbin/bhyvectl.

Discussed with:	grehan@
2014-04-28 22:06:40 +00:00
emaste
a864055bad Report boot method (BIOS/UEFI) via sysctl machdep.bootmethod
Sponsored by:	The FreeBSD Foundation
2014-04-27 15:14:59 +00:00
kib
d581b6a9ab Same as it was done in r263878 for invlrng_handler(), fix order of
checks for special pcid values in invlpg_pcid_handler().  Forst check
for special values, and only then do PCID-specific page invalidation.

Minor fix to the style compliance, declare local variable at the
function start.

Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
2014-04-27 05:37:01 +00:00
nwhitehorn
e4853bbc44 Don't need this now. VT does the same thing, but better.
Submitted by:	gjb
2014-04-27 02:28:32 +00:00
nwhitehorn
d89a83edd0 Add vt_efifb to VT kernel configuration now that that actually works. This
kernel will now boot on both BIOS and EFI systems without modification.
Equivalent functionality in GENERIC requires making vt(9) the default console
driver, which is probably appropriate at this point.
2014-04-27 02:22:21 +00:00
neel
ebf51bf73a A VMCS is always inactive when it exits the vmx_run() loop.
Remove redundant code and the misleading comment that suggest otherwise.

Reviewed by:	grehan@
2014-04-26 22:37:56 +00:00
scottl
62a64f0d2b Retire smp_active. It was racey and caused demonstrated problems with
the cpufreq code.  Replace its use with smp_started.  There's at least
one userland tool that still looks at the kern.smp.active sysctl, so
preserve it but point it to smp_started as well.

Discussed with: peter, jhb
MFC after: 3 days
Obtained from: Netflix
2014-04-26 20:27:54 +00:00
gjb
69c3e6933b Add a UEFI kernel configuration to include the VT kernel, and
replace the vt_vga driver with vt_efifb.

This is intended to help with snapshot builds only.

There is no intention to MFC this commit.

Sponsored by:	The FreeBSD Foundation
2014-04-25 21:47:24 +00:00
royger
f6f9cb7a0f xen: fix copyright header
Some of the code in xen-locore.S was picked from Cherry G. Mathew
amd64 Xen PV branch, but I've failed to set the proper copyright, so
do it now.

Approved by: gibbs
2014-04-24 14:44:42 +00:00
grehan
4f6dd265e1 Allow the guest to read the TSC via MSR 0x10.
NetBSD/amd64 does this, as does Linux on AMD CPUs.

Reviewed by:	neel
MFC after:	3 weeks
2014-04-24 00:27:34 +00:00
neel
360d54aa50 Change the vlapic timer frequency to be in the ballpark of contemporary
hardware. This also decouples the vlapic emulation from the host's TSC
frequency.

Requested by:	grehan@
2014-04-23 16:50:40 +00:00
tychon
f44a06b5a1 Factor out common ioport handler code for better hygiene -- pointed
out by neel@.

Approved by:	neel (co-mentor)
2014-04-22 16:13:56 +00:00
tychon
d8c307b493 Add support for the PIT 'readback' command -- based on a patch by grehan@.
Approved by:	grehan (co-mentor)
2014-04-18 16:05:12 +00:00
tychon
2c52df9a16 Respect the destination operand size of the 'Input from Port' instruction.
Approved by:	grehan (co-mentor)
2014-04-18 15:22:56 +00:00
tychon
4d0de44f39 Add support for reading the PIT Counter 2 output signal via the NMI
Status and Control register at port 0x61.

Be more conservative about "catching up" callouts that were supposed
to fire in the past by skipping an interrupt if it was
scheduled too far in the past.

Restore the PIT ACPI DSDT entries and add an entry for NMISC too.

Approved by:	neel (co-mentor)
2014-04-18 00:02:06 +00:00
jhb
03a8cfa7d9 Don't spindown the BSP if it executes hlt with the APIC disabled. A
guest that doesn't use the APIC at all can trigger this, plus the BSP
always needs to execute as it should trigger a reset, etc.

Reviewed by:	tychon
2014-04-15 20:53:53 +00:00
tychon
bbe78c2d72 Local APIC access via 32-bit naturally-aligned loads is merely
suggested in the SDM.  Since some OSes have implemented otherwise
don't be too rigorous in enforcing it.

Approved by:	grehan (co-mentor)
2014-04-15 17:06:26 +00:00
tychon
04f26f5235 Add support for emulating the byte move and sign extend instructions:
"movsx r/m8, r32" and "movsx r/m8, r64".

Approved by:	grehan (co-mentor)
2014-04-15 15:11:10 +00:00
tychon
5906c6773b Add support for emulating the slave PIC.
Reviewed by:	grehan, jhb
Approved by:	grehan (co-mentor)
2014-04-14 19:00:20 +00:00
neel
335d93f16d There is no need to save and restore the host's return address in the
'struct vmxctx'. It is preserved on the host stack across a guest entry
and exit and just restoring the host's '%rsp' is sufficient.

Pointed out by:	grehan@
2014-04-11 20:15:53 +00:00
tychon
45ec65c336 Account for the "plus 1" encoding of the CPUID Function 4 reported
core per package and cache sharing values.

Approved by:	grehan (co-mentor)
2014-04-11 18:19:21 +00:00
grehan
2ac5c08506 Rework r264179.
- remove redundant code
- remove erroneous setting of the error return
  in vmmdev_ioctl()
- use style(9) initialization
- in vmx_inject_pir(), document the race condition
  that the final conditional statement was detecting,

Tested with both gcc and clang builds.

Reviewed by:	neel
2014-04-10 19:15:58 +00:00
sbruno
c5f634c8c7 Really, really, really only allow this option for amd64/i386 builds.
Submitted by:	imp@ and tinderbox
2014-04-09 18:44:54 +00:00
imp
e8da9ba992 Make the vmm code compile with gcc too. Not entirely sure things are
correct for the pirbase test (since I'd have thought we'd need to do
something even when the offset is 0 and that test looks like a
misguided attempt to not use an uninitialized variable), but it is at
least the same as today.
2014-04-05 22:43:23 +00:00
rstone
35a855d80f Re-write bhyve's I/O MMU handling in terms of PCI RID.
Reviewed by:	neel
MFC after:	2 months
Sponsored by:	Sandvine Inc.
2014-04-01 15:54:03 +00:00
rstone
120bf54d08 Revert PCI RID changes.
My PCI RID changes somehow got intermixed with my PCI ARI patch when I
committed it.  I may have accidentally applied a patch to a non-clean
working tree.  Revert everything while I figure out what went wrong.

Pointy hat to: rstone
2014-04-01 15:06:03 +00:00
rstone
4df7085933 Re-write bhyve's I/O MMU handling in terms of PCI RIDs
Reviewed by:	neel
Sponsored by:	Sandvine Inc
2014-04-01 14:54:43 +00:00
kib
120b1857f9 Clear the kernel grab of the FPU state on fork. The pcb_save pointer
is already correctly reset to the FPU user save area, only PCB_KERNFPU
flag might leak from old thread state into the new state.

For creation of the user-mode thread, the change is nop since
corresponding syscall code does not use FPU.  On the other hand,
creation of a kernel thread forks from a thread selected arbitrary
from proc0, which might use FPU.

Reported and tested by:	Chris Torek <torek@torek.net>
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
2014-03-29 11:56:33 +00:00
kib
a6e5d8b248 Several fixes for the PCID implementation:
- When clearing a bit for a cpuid in pmap->pm_save, ensure that the
  cpuid is not set in pm_active.  The pm_save indicates which CPUs may
  have cached translations for given PCID, which implies that a CPU
  executing with the given pmap active have the translations
  cached. [1]

- In smp_masked_invltlb(), pass pmap to smp_targeted_tlb_shootdown(). [1]

- In invlrng_handler(), check for the special values of pcid (0 and
  -1) and do corresponding global or total invalidations before
  checking for performing PCID-specific range invalidation with
  INVPCID_ADDR. [2]

- In invltlb_pcid_handler(), do not read %cr3 unless needed. [2]

- Do minor style tweaks. [2]

Submitted by:	Henrik Gulbrandsen <henrik@gulbra.net> [1]
Other parts sponsored by:	The FreeBSD Foundation [2]
Tested by:	Henrik Gulbrandsen, pho
MFC after:	1 week
2014-03-28 16:07:27 +00:00
emaste
d2c99117cd Update EFI framebuffer handoff from loader
Sponsored by:	The FreeBSD Foundation
2014-03-27 19:43:38 +00:00
emaste
4a841fdff4 amd64: Parse the EFI memory map if present
With this change (and loader.efi from the projects/uefi branch) we can now
boot under qemu using the OVMF UEFI firmware image with the limitation
that a serial console is required.

(This is largely r246337 from the projects/uefi branch.)

Sponsored by:	The FreeBSD Foundation
2014-03-27 18:23:02 +00:00
neel
3e49998fdf Add an ioctl to suspend a virtual machine (VM_SUSPEND). The ioctl can be called
from any context i.e., it is not required to be called from a vcpu thread. The
ioctl simply sets a state variable 'vm->suspend' to '1' and returns.

The vcpus inspect 'vm->suspend' in the run loop and if it is set to '1' the
vcpu breaks out of the loop with a reason of 'VM_EXITCODE_SUSPENDED'. The
suspend handler waits until all 'vm->active_cpus' have transitioned to
'vm->suspended_cpus' before returning to userspace.

Discussed with:	grehan
2014-03-26 23:34:27 +00:00
imp
bd031ca10c Rather than require a makeoptions DEBUG to get debug correct,
add it in kern.mk, but only if we're using clang. While this
option is supported by both clang and gcc, in the future there
may be changes to clang which change the defaults that require
a tweak to build our kernel such that other tools in our tree
will work. Set a good example by forcing -gdwarf-2 only for
clang builds, and only if the user hasn't specified another
dwarf level already. Update UPDATING to reflect the changed
state of affairs. This also keeps us from having to update
all the ARM kernels to add this, and also keeps us from
in the future having to update all the MIPS kernels and is
one less place the user will have to know to do something
special for clang and one less thing developers will need
to do when moving an architecture to clang.

Reviewed by:	ian@
MFC after:	1 week
2014-03-25 22:08:31 +00:00
tychon
58699bc5fc Move the atpit device model from userspace into vmm.ko for better
precision and lower latency.

Approved by:	grehan (co-mentor)
2014-03-25 19:20:34 +00:00
bdrewery
6fcf6199a4 Rename global cnt to vm_cnt to avoid shadowing.
To reduce the diff struct pcu.cnt field was not renamed, so
PCPU_OP(cnt.field) is still used. pc_cnt and pcpu are also used in
kvm(3) and vmstat(8). The goal was to not affect externally used KPI.

Bump __FreeBSD_version_ in case some out-of-tree module/code relies on the
the global cnt variable.

Exp-run revealed no ports using it directly.

No objection from:	arch@
Sponsored by:	EMC / Isilon Storage Division
2014-03-22 10:26:09 +00:00
kib
7390415c58 Add change forgotten in r263475. Make dmaplimit accessible outside
amd64/pmap.c.

Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
2014-03-21 17:17:19 +00:00
kib
24c4e4a548 Fix two issues with /dev/mem access on amd64, both causing kernel page
faults.

First, for accesses to direct map region should check for the limit by
which direct map is instantiated.

Second, for accesses to the kernel map, success returned from the
kernacc(9) does not guarantee that consequent attempt to read or write
to the checked address succeed, since other thread might invalidate
the address meantime.  Add a new thread private flag TDP_DEVMEMIO,
which instructs vm_fault() to return error when fault happens on the
MAP_ENTRY_NOFAULT entry, instead of panicing.  The trap handler would
then see a page fault from access, and recover in normal way, making
/dev/mem access safer.

Remove GIANT_REQUIRED from the amd64 memrw(), since it is not needed
and having Giant locked does not solve issues for amd64.

Note that at least the second issue exists on other architectures, and
requires similar patching for md code.

Reported and tested by:	clusteradm (gjb, sbruno)
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
2014-03-21 14:25:09 +00:00
imp
9f008568e7 Remove vestiges of knowing the ISA bus, which we gave up on around 20
years ago. Remove redunant copy of isaregs.h.
2014-03-19 21:03:04 +00:00
markj
b41aca8e4d Only invoke fasttrap hooks for traps from user mode, and ensure that they're
called with interrupts enabled. Calling fasttrap_pid_probe() with interrupts
disabled can lead to deadlock if fasttrap writes to the process' address
space.

Reviewed by:	rpaulo
MFC after:	3 weeks
2014-03-19 01:27:56 +00:00
imp
ea27b8b541 In kernel config files, it is supposed to be 'options<space><tab>' not
'options<tab><tab>', per long standing (but recently not so strictly
enforced) convention.
2014-03-18 14:41:18 +00:00
neel
3818d66305 When a vcpu is deactivated it must also unblock any rendezvous that may be
blocked on it.

This is done by issuing a wakeup after clearing the 'vcpuid' from 'active_cpus'.
Also, use CPU_CLR_ATOMIC() to guarantee visibility of the updated 'active_cpus'
across all host cpus.
2014-03-18 02:49:28 +00:00
neel
9e498dc116 Notify vcpus participating in the rendezvous of the pending event to ensure
that they execute the rendezvous function as soon as possible.
2014-03-17 23:30:38 +00:00
imp
47e104941c Align all comments in config files on same column. This consistency
helps when bits and pieces of GENERIC from i386 or amd64 are cut and
pasted into other architecture's config files (which in the case of
ARM had gotten rather akimbo).
2014-03-16 15:22:52 +00:00
rwatson
33fdc14c0c Update kernel inclusions of capability.h to use capsicum.h instead; some
further refinement is required as some device drivers intended to be
portable over FreeBSD versions rely on __FreeBSD_version to decide whether
to include capability.h.

MFC after:	3 weeks
2014-03-16 10:55:57 +00:00
tychon
5460439295 Fix a race wherein the source of an interrupt vector is wrongly
attributed if an ExtINT arrives during interrupt injection.

Also, fix a spurious interrupt if the PIC tries to raise an interrupt
before the outstanding one is accepted.

Finally, improve the PIC interrupt latency when another interrupt is
raised immediately after the outstanding one is accepted by creating a
vmexit rather than waiting for one to occur by happenstance.

Approved by:	neel (co-mentor)
2014-03-15 23:09:34 +00:00
rwatson
e78b9db504 Revert a small portion of r263198 left over from local testing: don't
enable PCB groups and RSS by default [yet].
2014-03-15 00:59:23 +00:00
rwatson
f411704afc Several years after initial development, merge prototype support for
linking NIC Receive Side Scaling (RSS) to the network stack's
connection-group implementation.  This prototype (and derived patches)
are in use at Juniper and several other FreeBSD-using companies, so
despite some reservations about its maturity, merge the patch to the
base tree so that it can be iteratively refined in collaboration rather
than maintained as a set of gradually diverging patch sets.

(1) Merge a software implementation of the Toeplitz hash specified in
    RSS implemented by David Malone.  This is used to allow suitable
    pcbgroup placement of connections before the first packet is
    received from the NIC.  Software hashing is generally avoided,
    however, due to high cost of the hash on general-purpose CPUs.

(2) In in_rss.c, maintain authoritative versions of RSS state intended
    to be pushed to each NIC, including keying material, hash
    algorithm/ configuration, and buckets.  Provide software-facing
    interfaces to hash 2- and 4-tuples for IPv4 and IPv6 using both
    the RSS standardised Toeplitz and a 'naive' variation with a hash
    efficient in software but with poor distribution properties.
    Implement rss_m2cpuid()to be used by netisr and other load
    balancing code to look up the CPU on which an mbuf should be
    processed.

(3) In the Ethernet link layer, allow netisr distribution using RSS as
    a source of policy as an alternative to source ordering; continue
    to default to direct dispatch (i.e., don't try and requeue packets
    for processing on the 'right' CPU if they arrive in a directly
    dispatchable context).

(4) Allow RSS to control tuning of connection groups in order to align
    groups with RSS buckets.  If a packet arrives on a protocol using
    connection groups, and contains a suitable hardware-generated
    hash, use that hash value to select the connection group for pcb
    lookup for both IPv4 and IPv6.  If no hardware-generated Toeplitz
    hash is available, we fall back on regular PCB lookup risking
    contention rather than pay the cost of Toeplitz in software --
    this is a less scalable but, at my last measurement, faster
    approach.  As core counts go up, we may want to revise this
    strategy despite CPU overhead.

Where device drivers suitably configure NICs, and connection groups /
RSS are enabled, this should avoid both lock and line contention during
connection lookup for TCP.  This commit does not modify any device
drivers to tune device RSS configuration to the global RSS
configuration; patches are in circulation to do this for at least
Chelsio T3 and Intel 1G/10G drivers.  Currently, the KPI for device
drivers is not particularly robust, nor aware of more advanced features
such as runtime reconfiguration/rebalancing.  This will hopefully prove
a useful starting point for refinement.

No MFC is scheduled as we will first want to nail down a more mature
and maintainable KPI/KBI for device drivers.

Sponsored by:   Juniper Networks (original work)
Sponsored by:   EMC/Isilon (patch update and merge)
2014-03-15 00:57:50 +00:00
glebius
80e85e32a5 Remove AppleTalk support.
AppleTalk was a network transport protocol for Apple Macintosh devices
in 80s and then 90s. Starting with Mac OS X in 2000 the AppleTalk was
a legacy protocol and primary networking protocol is TCP/IP. The last
Mac OS X release to support AppleTalk happened in 2009. The same year
routing equipment vendors (namely Cisco) end their support.

Thus, AppleTalk won't be supported in FreeBSD 11.0-RELEASE.
2014-03-14 06:29:43 +00:00
glebius
d494babace Remove IPX support.
IPX was a network transport protocol in Novell's NetWare network operating
system from late 80s and then 90s. The NetWare itself switched to TCP/IP
as default transport in 1998. Later, in this century the Novell Open
Enterprise Server became successor of Novell NetWare. The last release
that claimed to still support IPX was OES 2 in 2007. Routing equipment
vendors (e.g. Cisco) discontinued support for IPX in 2011.

Thus, IPX won't be supported in FreeBSD 11.0-RELEASE.
2014-03-14 02:58:48 +00:00
imp
bf13b5b908 Delete stray clause 3 (Advertising clause) and renumber while i'm
here.

Approved by: alc@
2014-03-11 23:41:35 +00:00
tychon
9affb68b8d Don't try to return a vector to a caller that only cares if a vector
is pending or not.

Approved by:	neel (co-mentor)
2014-03-11 22:12:12 +00:00
imp
c4c8568cd0 Remove clause 3 (the advertising clause), per the regent's letter. 2014-03-11 17:20:50 +00:00
tychon
25c8b61cfd Replace the userspace atpic stub with a more functional vmm.ko model.
New ioctls VM_ISA_ASSERT_IRQ, VM_ISA_DEASSERT_IRQ and VM_ISA_PULSE_IRQ
can be used to manipulate the pic, and optionally the ioapic, pin state.

Reviewed by:	jhb, neel
Approved by:	neel (co-mentor)
2014-03-11 16:56:00 +00:00
royger
446e208ee2 xen: add a hook to perform AP startup
AP startup on PVH follows the PV method, so we need to add a hook in
order to diverge from bare metal.

Approved by: gibbs
Sponsored by: Citrix Systems R&D

amd64/amd64/machdep.c:
 - Add hook for start_all_aps on native (using native_start_all_aps
   defined in mp_machdep).

amd64/amd64/mp_machdep.c:
 - Make some variables global because they will also be used by the
   Xen PVH AP startup code.
 - Use the start_all_aps hook to start APs.
 - Rename start_all_aps to native_start_all_aps.

amd64/include/smp.h:
 - Add declaration for native_start_all_aps.

x86/include/init.h:
 - Declare start_all_aps hook in init_ops.

x86/xen/pv.c:
 - Pick external declarations from mp_machdep.
 - Introduce Xen PV code to start APs on PVH.
 - Set start_all_aps init hook to use the Xen PVH implementation.
2014-03-11 10:27:57 +00:00
royger
419270d8a7 xen: add hook for AP bootstrap memory reservation
This hook will only be implemented for bare metal, Xen doesn't require
any bootstrap code since APs are started in long mode with paging
enabled.

Approved by: gibbs
Sponsored by: Citrix Systems R&D

amd64/amd64/machdep.c:
 - Set mp_bootaddress hook for bare metal.

x86/include/init.h:
 - Define mp_bootaddress in init_ops.
2014-03-11 10:26:16 +00:00
royger
6b1be12234 xen: use the same hypercall mechanism for XEN and XENHVM
Currently XEN (PV) and XENHVM (PVHVM) ports use different ways to
issue hypercalls, unify this by filling the hypercall_page under HVM
also.

Approved by: gibbs
Sponsored by: Citrix Systems R&D

amd64/include/xen/hypercall.h:
 - Unify Xen hypercall code by always using the PV way.

i386/i386/locore.s:
 - Define hypercall_page on i386 XENHVM.

x86/xen/hvm.c:
 - Fill hypercall_page on XENHVM kernels using the HVM method (only
   when running as an HVM guest).
2014-03-11 10:24:13 +00:00
royger
891131cb52 xen: implement hook to fetch and parse e820 memory map
e820 memory map is fetched using a hypercall under Xen PVH, so add a
hook to init_ops in oder to diverge from bare metal and implement a
Xen variant.

Approved by: gibbs
Sponsored by: Citrix Systems R&D

x86/include/init.h:
 - Add a parse_memmap hook to init_ops, that will be called to fetch
   and parse the memory map.

amd64/amd64/machdep.c:
 - Decouple the fetch and the parse of the memmap, so the parse
   function can be shared with Xen code.
 - Move code around in order to implement the parse_memmap hook.

amd64/include/pc/bios.h:
 - Declare bios_add_smap_entries (implemented in machdep.c).

x86/xen/pv.c:
 - Implement fetching of e820 memmap when running as a PVH guest by
   using the XENMEM_memory_map hypercall.
2014-03-11 10:23:03 +00:00
royger
467e743960 xen: implement an early timer for Xen PVH
When running as a PVH guest, there's no emulated i8254, so we need to
use the Xen PV timer as the early source for DELAY. This change allows
for different implementations of the early DELAY function and
implements a Xen variant for it.

Approved by: gibbs
Sponsored by: Citrix Systems R&D

dev/xen/timer/timer.c:
dev/xen/timer/timer.h:
 - Implement Xen early delay functions using the PV timer and declare
   them.

x86/include/init.h:
 - Add hooks for early clock source initialization and early delay
   functions.

i386/i386/machdep.c:
pc98/pc98/machdep.c:
amd64/amd64/machdep.c:
 - Set early delay hooks to use the i8254 on bare metal.
 - Use clock_init (that will in turn make use of init_ops) to
   initialize the early clock source.

amd64/include/clock.h:
i386/include/clock.h:
 - Declare i8254_delay and clock_init.

i386/xen/clock.c:
 - Rename DELAY to i8254_delay.

x86/isa/clock.c:
 - Introduce clock_init that will take care of initializing the early
   clock by making use of the init_ops hooks.
 - Move non ISA related delay functions to the newly introduced delay
   file.

x86/x86/delay.c:
 - Add moved delay related functions.
 - Implement generic DELAY function that will use the init_ops hooks.

x86/xen/pv.c:
 - Set PVH hooks for the early delay related functions in init_ops.

conf/files.amd64:
conf/files.i386:
conf/files.pc98:
 - Add delay.c to the kernel build.
2014-03-11 10:20:42 +00:00
royger
4df602a6bf amd64: introduce hook for custom preload metadata parsers
Add hooks to amd64 in order to have diverging implementations, since
on Xen PV the metadata is passed to the kernel in a different form.

Approbed by: gibbs
Sponsored by: Citrix Systems R&D

amd64/amd64/machdep.c:
 - Define init_ops for native.
 - Put native code inside of native_parse_preload_data hook.
 - Call the parse_preload_data in order to fill the metadata info.

x86/include/init.h:
 - Declare the init_ops struct.

x86/xen/pv.c:
 - Declare xen_init_ops that contains the Xen PV implementation of
   init_ops.
 - Implement the parse_preload_data for Xen PVH, the info is fetched
   from HYPERVISOR_start_info->cmd_line as provided by Xen.
2014-03-11 10:15:25 +00:00
royger
5dd05db7ff xen: add PV/PVH kernel entry point
Add the PV/PVH entry point and the low level functions for PVH
early initialization.

Approved by: gibbs
Sponsored by: Citrix Systems R&D

amd64/amd64/genassym.c:
 - Add __FreeBSD_version define to assym.s so it can be used for the
   Xen notes.

amd64/amd64/locore.S:
 - Make bootstack global so it can be used from Xen kernel entry
   point.

amd64/amd64/xen-locore.S:
 - Add Xen notes to the kernel.
 - Add the Xen PV entry point, that is going to call hammer_time_xen.

amd64/include/asmacros.h:
 - Add ELFNOTE macros.

i386/xen/xen_machdep.c:
 - Define HYPERVISOR_start_info for the XEN i386 PV port, which is
   going to be used in some shared code between PV and PVH.

x86/xen/hvm.c:
 - Define HYPERVISOR_start_info for the PVH port.

x86/xen/pv.c:
 - Introduce hammer_time_xen which is going to perform early setup for
   Xen PVH:
    - Setup shared Xen variables start_info, shared_info and
      xen_store.
    - Set guest type.
    - Create initial page tables as FreeBSD expects to find them.
    - Call into native init function (hammer_time).

xen/xen-os.h:
 - Declare HYPERVISOR_start_info.

conf/files.amd64:
 - Add amd64/amd64/locore.S and x86/xen/pv.c to the list of files.
2014-03-11 10:07:01 +00:00
royger
27026f4f2a amd64/i386: switch IPI handlers to C code.
Move asm IPIs handlers to C code, so both Xen and native IPI handlers
share the same code.

Reviewed by: jhb
Approved by: gibbs
Sponsored by: Citrix Systems R&D

amd64/amd64/apic_vector.S:
i386/i386/apic_vector.s:
 - Remove asm coded IPI handlers and instead call the newly introduced
   C variants.

amd64/amd64/mp_machdep.c:
i386/i386/mp_machdep.c:
 - Add C coded clones to the asm IPI handlers (moved from
   x86/xen/hvm.c).

i386/include/smp.h:
amd64/include/smp.h:
 - Add prototypes for the C IPI handlers.

x86/xen/hvm.c:
 - Move the C IPI handlers to mp_machdep and call those in the Xen IPI
   handlers.

i386/xen/mp_machdep.c:
 - Add dummy IPI handlers to the i386 Xen PV port (this port doesn't
   support SMP).
2014-03-11 10:03:29 +00:00
emaste
2ab45c505b Disable amd64 TLB Context ID (pcid) by default for now
There are a number of reports of userspace application crashes that
are "solved" by setting vm.pmap.pcid_enabled=0, including Java and the
x11/mate-terminal port (PR ports/184362).

I originally planned to disable this only in stable/10 (in r262753), but
it has been pointed out that additional crash reports on HEAD are not
likely to provide new insight into the problem.  The feature can easily
be enabled for testing.
2014-03-05 01:34:10 +00:00
jkim
9b4d3b43ca Move fpusave() wrapper for suspend hander to sys/amd64/amd64/fpu.c.
Inspired by:	jhb
2014-03-04 21:35:57 +00:00
jkim
e6f1aee9e4 Revert accidentally committed changes in 262748. 2014-03-04 20:16:00 +00:00
jkim
74dcbf5843 Properly save and restore CR0.
MFC after:	3 days
2014-03-04 20:07:36 +00:00
jkim
373cea9476 Remove dead code since r230426, fix a comment, and tidy up.
Reported by:	jhb
MFC after:	3 days
2014-03-04 19:41:16 +00:00
neel
4e6374765e Fix a race between VMRUN() and vcpu_notify_event() due to 'vcpu->hostcpu'
being updated outside of the vcpu_lock(). The race is benign and could
potentially result in a missed notification about a pending interrupt to
a vcpu. The interrupt would not be lost but rather delayed until the next
VM exit.

The vcpu's hostcpu is now updated concurrently with the vcpu state change.
When the vcpu transitions to the RUNNING state the hostcpu is set to 'curcpu'.
It is set to 'NOCPU' in all other cases.

Reviewed by:	grehan
2014-03-01 03:17:58 +00:00
jhb
4905c0f870 Correct VMware capitalization.
Submitted by:	joeld
2014-02-28 21:33:40 +00:00
jhb
4ce54b93eb Workaround an apparent bug in VMWare Fusion's nested VT support where it
triggers a VM exit with the exit reason of an external interrupt but
without a valid interrupt set in the exit interrupt information.

Tested by:	Michael Dexter
Reviewed by:	neel
MFC after:	1 week
2014-02-28 19:07:55 +00:00
neel
e01c440dae Queue pending exceptions in the 'struct vcpu' instead of directly updating the
processor-specific VMCS or VMCB. The pending exception will be delivered right
before entering the guest.

The order of event injection into the guest is:
- hardware exception
- NMI
- maskable interrupt

In the Intel VT-x case, a pending NMI or interrupt will enable the interrupt
window-exiting and inject it as soon as possible after the hardware exception
is injected. Also since interrupts are inherently asynchronous, injecting
them after the hardware exception should not affect correctness from the
guest perspective.

Rename the unused ioctl VM_INJECT_EVENT to VM_INJECT_EXCEPTION and restrict
it to only deliver x86 hardware exceptions. This new ioctl is now used to
inject a protection fault when the guest accesses an unimplemented MSR.

Discussed with:	grehan, jhb
Reviewed by:	jhb
2014-02-26 00:52:05 +00:00
grehan
e0e0829e5e MFC @ r259635
This brings in the "-w" option from bhyve to ignore unknown MSRs.
It will make debugging Linux guests a bit easier.

Suggested by:	Willem Jan Withagen (wjw at digiware nl)
2014-02-25 06:29:56 +00:00
alc
ebe945ff9f When the kernel is running in a virtual machine, it cannot rely upon the
processor family to determine if the workaround for AMD Family 10h Erratum
383 should be enabled.  To enable virtual machine migration among a
heterogeneous collection of physical machines, the hypervisor may have
been configured to report an older processor family with a reduced feature
set.  Effectively, the reported processor family and its features are like
a "least common denominator" for the collection of machines.

Therefore, when the kernel is running in a virtual machine, instead of
relying upon the processor family, we now test for features that prove
that the underlying processor is not affected by the erratum.  (The
features that we test for are unlikely to ever be emulated in software
on an affected physical processor.)

PR:		186061
Tested by:	Simon Matter
Discussed with:	jhb, neel
MFC after:	2 weeks
2014-02-22 18:53:42 +00:00
neel
3e0732cf3e Add support for x2APIC virtualization assist in Intel VT-x.
The vlapic.ops handler 'enable_x2apic_mode' is called when the vlapic mode
is switched to x2APIC. The VT-x implementation of this handler turns off the
APIC-access virtualization and enables the x2APIC virtualization in the VMCS.

The x2APIC virtualization is done by allowing guest read access to a subset
of MSRs in the x2APIC range. In non-root operation the processor will satisfy
an 'rdmsr' access to these MSRs by reading from the virtual APIC page instead.

The guest is also given write access to TPR, EOI and SELF_IPI MSRs which
get special treatment in non-root operation. This is documented in the
Intel SDM section titled "Virtualizing MSR-Based APIC Accesses".

Enforce that APIC-write and APIC-access VM-exits are handled only if
APIC-access virtualization is enabled. The one exception to this is
SELF_IPI virtualization which may result in an APIC-write VM-exit.
2014-02-21 06:03:54 +00:00
neel
4626d164b8 Simplify APIC mode switching from MMIO to x2APIC. In part this is done to
simplify the implementation of the x2APIC virtualization assist in VT-x.

Prior to this change the vlapic allowed the guest to change its mode from
xAPIC to x2APIC. We don't allow that any more and the vlapic mode is locked
when the virtual machine is created. This is not very constraining because
operating systems already have to deal with BIOS setting up the APIC in
x2APIC mode at boot.

Fix a bug in the CPUID emulation where the x2APIC capability was leaking
from the host to the guest.

Ignore MMIO reads and writes to the vlapic in x2APIC mode. Similarly, ignore
MSR accesses to the vlapic when it is in xAPIC mode.

The default configuration of the vlapic is xAPIC. The "-x" option to bhyve(8)
can be used to change the mode to x2APIC instead.

Discussed with:	grehan@
2014-02-20 01:48:25 +00:00
jhb
521737384d A first pass at adding support for injecting hardware exceptions for
emulated instructions.
- Add helper routines to inject interrupt information for a hardware
  exception from the VM exit callback routines.
- Use the new routines to inject GP and UD exceptions for invalid
  operations when emulating the xsetbv instruction.
- Don't directly manipulate the entry interrupt info when a user event
  is injected.  Instead, store the event info in the vmx state and
  only apply it during a VM entry if a hardware exception or NMI is
  not already pending.
- While here, use HANDLED/UNHANDLED instead of 1/0 in a couple of
  routines.

Reviewed by:	neel
2014-02-18 03:07:36 +00:00
neel
f9781635be Handle writes to the SELF_IPI MSR by the guest when the vlapic is configured
in x2apic mode. Reads to this MSR are currently ignored but should cause a
general proctection exception to be injected into the vcpu.

All accesses to the corresponding offset in xAPIC mode are ignored.

Also, do not panic the host if there is mismatch between the trigger mode
programmed in the TMR and the actual interrupt being delivered. Instead the
anomaly is logged to aid debugging and to prevent a misbehaving guest from
panicking the host.
2014-02-17 23:07:16 +00:00
neel
19c649a6fb Use spinlocks to lock accesses to the vioapic.
This is necessary because if the vlapic is configured in x2apic mode the
vioapic_process_eoi() function is called inside the critical section
established by vm_run().
2014-02-17 22:57:51 +00:00
dim
a8b6bed223 Upgrade our copy of llvm/clang to 3.4 release. This version supports
all of the features in the current working draft of the upcoming C++
standard, provisionally named C++1y.

The code generator's performance is greatly increased, and the loop
auto-vectorizer is now enabled at -Os and -O2 in addition to -O3.  The
PowerPC backend has made several major improvements to code generation
quality and compile time, and the X86, SPARC, ARM32, Aarch64 and SystemZ
backends have all seen major feature work.

Release notes for llvm and clang can be found here:
<http://llvm.org/releases/3.4/docs/ReleaseNotes.html>
<http://llvm.org/releases/3.4/tools/clang/docs/ReleaseNotes.html>

MFC after:	1 month
2014-02-16 19:44:07 +00:00
brueffer
4c9c4234e2 Retire the nve(4) driver; nfe(4) has been the default driver for NVIDIA
nForce MCP adapters for a long time.

Yays:	jhb, remko, yongari
Nays:	none on the current and stable lists
2014-02-16 12:22:43 +00:00
avg
84460fd88d provide fast versions of ffsl and flsl for i386; ffsll and flsll for amd64
Reviewed by:	jhb
MFC after:	10 days
X-MFC note:	consider thirdparty modules depending on these symbols
Sponsored by:	HybridCluster
2014-02-14 15:18:37 +00:00
jhb
6e6e271c34 Add support for managing PCI bus numbers. As with BARs and PCI-PCI bridge
I/O windows, the default is to preserve the firmware-assigned resources.
PCI bus numbers are only managed if NEW_PCIB is enabled and the architecture
defines a PCI_RES_BUS resource type.
- Add a helper API to create top-level PCI bus resource managers for each
  PCI domain/segment.  Host-PCI bridge drivers use this API to allocate
  bus numbers from their associated domain.
- Change the PCI bus and CardBus drivers to allocate a bus resource for
  their bus number from the parent PCI bridge device.
- Change the PCI-PCI and PCI-CardBus bridge drivers to allocate the
  full range of bus numbers from secbus to subbus from their parent bridge.
  The drivers also always program their primary bus register.  The bridge
  drivers also support growing their bus range by extending the bus resource
  and updating subbus to match the larger range.
- Add support for managing PCI bus resources to the Host-PCI bridge drivers
  used for amd64 and i386 (acpi_pcib, mptable_pcib, legacy_pcib, and qpi_pcib).
- Define a PCI_RES_BUS resource type for amd64 and i386.

Reviewed by:	imp
MFC after:	1 month
2014-02-12 04:30:37 +00:00
jhb
22ac1edc7b Don't waste a page of KVA for the boot-time memory test on x86. For amd64,
reuse the first page of the crashdumpmap as CMAP1/CADDR1.  For i386,
remove CMAP1/CADDR1 entirely and reuse CMAP3/CADDR3 for the memory test.

Reviewed by:	alc, peter
MFC after:	2 weeks
2014-02-11 22:02:40 +00:00
jhb
0cdfc370bf Add virtualized XSAVE support to bhyve which permits guests to use XSAVE and
XSAVE-enabled features like AVX.
- Store a per-cpu guest xcr0 register.  When switching to the guest FPU
  state, switch to the guest xcr0 value.  Note that the guest FPU state is
  saved and restored using the host's xcr0 value and xcr0 is saved/restored
  "inside" of saving/restoring the guest FPU state.
- Handle VM exits for the xsetbv instruction by updating the guest xcr0.
- Expose the XSAVE feature to the guest only if the host has enabled XSAVE,
  and only advertise XSAVE features enabled by the host to the guest.
  This ensures that the guest will only adjust FPU state that is a subset
  of the guest FPU state saved and restored by the host.

Reviewed by:	grehan
2014-02-08 16:37:54 +00:00
neel
5b2777fc37 Add a counter to differentiate between VM-exits due to nested paging faults
and instruction emulation faults.
2014-02-08 06:22:09 +00:00
neel
d79f5b61e6 Fix a bug in the handling of VM-exits caused by non-maskable interrupts (NMI).
If a VM-exit is caused by an NMI then "blocking by NMI" is in effect on the
CPU when the VM-exit is completed. No more NMIs will be recognized until
the execution of an "iret".

Prior to this change the NMI handler was dispatched via a software interrupt
with interrupts enabled. This meant that an interrupt could be recognized
by the processor before the NMI handler completed its execution. The "iret"
issued by the interrupt handler would then cause the "blocking by NMI" to
be cleared prematurely.

This is now fixed by handling the NMI with interrupts disabled in addition
to "blocking by NMI" already established by the VM-exit.
2014-02-08 05:04:34 +00:00
jhb
f4e46bef98 Add support for FreeBSD/i386 guests under bhyve.
- Similar to the hack for bootinfo32.c in userboot, define
  _MACHINE_ELF_WANT_32BIT in the load_elf32 file handlers in userboot.
  This allows userboot to load 32-bit kernels and modules.
- Copy the SMAP generation code out of bootinfo64.c and into its own
  file so it can be shared with bootinfo32.c to pass an SMAP to the i386
  kernel.
- Use uint32_t instead of u_long when aligning module metadata in
  bootinfo32.c in userboot, as otherwise the metadata used 64-bit
  alignment which corrupted the layout.
- Populate the basemem and extmem members of the bootinfo struct passed
  to 32-bit kernels.
- Fix the 32-bit stack in userboot to start at the top of the stack
  instead of the bottom so that there is room to grow before the
  kernel switches to its own stack.
- Push a fake return address onto the 32-bit stack in addition to the
  arguments normally passed to exec() in the loader.  This return
  address is needed to convince recover_bootinfo() in the 32-bit
  locore code that it is being invoked from a "new" boot block.
- Add a routine to libvmmapi to setup a 32-bit flat mode register state
  including a GDT and TSS that is able to start the i386 kernel and
  update bhyveload to use it when booting an i386 kernel.
- Use the guest register state to determine the CPU's current instruction
  mode (32-bit vs 64-bit) and paging mode (flat, 32-bit, PAE, or long
  mode) in the instruction emulation code.  Update the gla2gpa() routine
  used when fetching instructions to handle flat mode, 32-bit paging, and
  PAE paging in addition to long mode paging.  Don't look for a REX
  prefix when the CPU is in 32-bit mode, and use the detected mode to
  enable the existing 32-bit mode code when decoding the mod r/m byte.

Reviewed by:	grehan, neel
MFC after:	1 month
2014-02-05 04:39:03 +00:00
tychon
c9758b2e1c Add support for emulating the byte move and zero extend instructions:
"mov r/m8, r32" and "mov r/m8, r64".

Approved by:	neel (co-mentor)
2014-02-05 02:01:08 +00:00
grehan
9b01253af1 Changes to the SVM code to bring it up to r259205
- Convert VMM_CTR to VCPU_CTR KTR macros
 - Special handling of halt, save rflags for VMM layer to emulate
   halt for vcpu(sleep to be awakened by interrupt or stop it)
 - Cleanup of RVI exit handling code

Submitted by:	Anish Gupta (akgupt3@gmail.com)
Reviewed by:	grehan
2014-02-04 07:13:56 +00:00
grehan
db98868c2f MFC @ r259205 in preparation for some SVM updates. (for real this time) 2014-02-04 06:59:08 +00:00
grehan
f115890f37 Roll back botched partial MFC :( 2014-02-04 05:03:14 +00:00
neel
8f40ca632d Avoid doing unnecessary nested TLB invalidations.
Prior to this change the cached value of 'pm_eptgen' was tracked per-vcpu
and per-hostcpu. In the degenerate case where 'N' vcpus were sharing
a single hostcpu this could result in 'N - 1' unnecessary TLB invalidations.
Since an 'invept' invalidates mappings for all VPIDs the first 'invept'
is sufficient.

Fix this by moving the 'eptgen[MAXCPU]' array from 'vmxctx' to 'struct vmx'.

If it is known that an 'invept' is going to be done before entering the
guest then it is safe to skip the 'invvpid'. The stat VPU_INVVPID_SAVED
counts the number of 'invvpid' invalidations that were avoided because
they were subsumed by an 'invept'.

Discussed with:	grehan
2014-02-04 02:45:08 +00:00
grehan
0ee348bd50 MFC @ r259205 in preparation for some SVM updates. 2014-02-04 02:41:54 +00:00
jhb
3f6ca218c6 Enhance the support for PCI legacy INTx interrupts and enable them in
the virtio backends.
- Add a new ioctl to export the count of pins on the I/O APIC from vmm
  to the hypervisor.
- Use pins on the I/O APIC >= 16 for PCI interrupts leaving 0-15 for
  ISA interrupts.
- Populate the MP Table with I/O interrupt entries for any PCI INTx
  interrupts.
- Create a _PRT table under the PCI root bridge in ACPI to route any
  PCI INTx interrupts appropriately.
- Track which INTx interrupts are in use per-slot so that functions
  that share a slot attempt to distribute their INTx interrupts across
  the four available pins.
- Implicitly mask INTx interrupts if either MSI or MSI-X is enabled
  and when the INTx DIS bit is set in a function's PCI command register.
  Either assert or deassert the associated I/O APIC pin when the
  state of one of those conditions changes.
- Add INTx support to the virtio backends.
- Always advertise the MSI capability in the virtio backends.

Submitted by:	neel (7)
Reviewed by:	neel
MFC after:	2 weeks
2014-01-29 14:56:48 +00:00
jhb
0fa3283d5a Add support for 'clac' and 'stac' to DDB's disassembler on amd64. 2014-01-27 18:53:18 +00:00
neel
872f585846 Support level triggered interrupts with VT-x virtual interrupt delivery.
The VMCS field EOI_bitmap[] is an array of 256 bits - one for each vector.
If a bit is set to '1' in the EOI_bitmap[] then the processor will trigger
an EOI-induced VM-exit when it is doing EOI virtualization.

The EOI-induced VM-exit results in the EOI being forwarded to the vioapic
so that level triggered interrupts can be properly handled.

Tested by:	Anish Gupta (akgupt3@gmail.com)
2014-01-25 20:58:05 +00:00
grehan
e0134ab4ff Change RWX to XWR in comments to match intent and bit patterns
in discussion of valid EPT pte protections.

Discussed with:	neel
MFC after:	3 days
2014-01-25 06:58:41 +00:00
jhb
b2533ec507 Move <machine/apicvar.h> to <x86/apicvar.h>. 2014-01-23 20:10:22 +00:00
neel
232e34343e Set "Interrupt Window Exiting" in the case where there is a vector to be
injected into the vcpu but the VM-entry interruption information field
already has the valid bit set.

Pointed out by:	David Reed (david.reed@tidalscale.com)
2014-01-23 06:06:50 +00:00
neel
e971318156 Handle a VM-exit due to a NMI properly by vectoring to the host's NMI handler
via a software interrupt.

This is safe to do because the logical processor is already cognizant of the
NMI and further NMIs are blocked until the host's NMI handler executes "iret".
2014-01-22 04:03:11 +00:00
neel
daa658a5dd There is no need to initialize the IOMMU if no passthru devices have been
configured for bhyve to use.

Suggested by:	grehan@
2014-01-21 03:01:34 +00:00
emaste
13c3304ef6 Add VT kernel configuration to ease testing of vt(9), aka Newcons 2014-01-19 18:46:38 +00:00
neel
e74998b34d Some processor's don't allow NMI injection if the STI_BLOCKING bit is set in
the Guest Interruptibility-state field. However, there isn't any way to
figure out which processors have this requirement.

So, inject a pending NMI only if NMI_BLOCKING, MOVSS_BLOCKING, STI_BLOCKING
are all clear. If any of these bits are set then enable "NMI window exiting"
and inject the NMI in the VM-exit handler.
2014-01-18 21:47:12 +00:00
bryanv
31dc7c36ca Add very simple virtio_random(4) driver to harvest entropy from host
Reviewed by:	markm (random bits only)
2014-01-18 06:14:38 +00:00
neel
4dec904890 If the guest exits due to a fault while it is executing IRET then restore
the state of "Virtual NMI blocking" in the guest's interruptibility-state
field before resuming the guest.
2014-01-18 02:20:10 +00:00
neel
77ef1cf997 If a VM-exit happens during an NMI injection then clear the "NMI Blocking" bit
in the Guest Interruptibility-state VMCS field.

If we fail to do this then a subsequent VM-entry will fail because it is an
error to inject an NMI into the guest while "NMI Blocking" is turned on. This
is described in "Checks on Guest Non-Register State" in the Intel SDM.

Submitted by:	David Reed (david.reed@tidalscale.com)
2014-01-17 04:21:39 +00:00
neel
0bd53a85fb Add an API to rendezvous all active vcpus in a virtual machine. The rendezvous
can be initiated in the context of a vcpu thread or from the bhyve(8) control
process.

The first use of this functionality is to update the vlapic trigger-mode
register when the IOAPIC pin configuration is changed.

Prior to this change we would update the TMR in the virtual-APIC page at
the time of interrupt delivery. But this doesn't work with Posted Interrupts
because there is no way to program the EOI_exit_bitmap[] in the VMCS of
the target at the time of interrupt delivery.

Discussed with:	grehan@
2014-01-14 01:55:58 +00:00
gavin
6265cc52b1 Remove spaces from boot messages when we print the CPU ID/Family/Stepping
to match the rest of the CPU identification lines, and once again fit
into 80 columns in the usual case.
2014-01-11 22:41:10 +00:00
neel
ec09639132 Enable "Posted Interrupt Processing" if supported by the CPU. This lets us
inject interrupts into the guest without causing a VM-exit.

This feature can be disabled by setting the tunable "hw.vmm.vmx.use_apic_pir"
to "0".

The following sysctls provide information about this feature:
- hw.vmm.vmx.posted_interrupts (0 if disabled, 1 if enabled)
- hw.vmm.vmx.posted_interrupt_vector (vector number used for vcpu notification)

Tested on a Intel Xeon E5-2620v2 courtesy of Allan Jude at ScaleEngine.
2014-01-11 04:22:00 +00:00
neel
41814122f1 Enable the "Acknowledge Interrupt on VM exit" VM-exit control.
This control is needed to enable "Posted Interrupts" and is present in all
the Intel VT-x implementations supported by bhyve so enable it as the default.

With this VM-exit control enabled the processor will acknowledge the APIC and
store the vector number in the "VM-Exit Interruption Information" field. We
now call the interrupt handler "by hand" through the IDT entry associated
with the vector.
2014-01-11 03:14:05 +00:00
neel
00a86f71de Don't expose 'vmm_ipinum' as a global. 2014-01-09 03:25:54 +00:00
neel
ab2de99290 Use the 'Virtual Interrupt Delivery' feature of Intel VT-x if supported by
hardware. It is possible to turn this feature off and fall back to software
emulation of the APIC by setting the tunable hw.vmm.vmx.use_apic_vid to 0.

We now start handling two new types of VM-exits:

APIC-access: This is a fault-like VM-exit and is triggered when the APIC
register access is not accelerated (e.g. apic timer CCR). In response to
this we do emulate the instruction that triggered the APIC-access exit.

APIC-write: This is a trap-like VM-exit which does not require any instruction
emulation but it does require the hypervisor to emulate the access to the
specified register (e.g. icrlo register).

Introduce 'vlapic_ops' which are function pointers to vector the various
vlapic operations into processor-dependent code. The 'Virtual Interrupt
Delivery' feature installs 'ops' for setting the IRR bits in the virtual
APIC page and to return whether any interrupts are pending for this vcpu.

Tested on an "Intel Xeon E5-2620 v2" courtesy of Allan Jude at ScaleEngine.
2014-01-07 21:04:49 +00:00
neel
23ea3a1c59 Fix a bug introduced in r260167 related to VM-exit tracing.
Keep a copy of the 'rip' and the 'exit_reason' and use that when calling
vmx_exit_trace(). This is because both the 'rip' and 'exit_reason' can
be changed by 'vmx_exit_process()' and can lead to very misleading traces.
2014-01-07 18:53:14 +00:00
neel
b47601c298 Allow vlapic_set_intr_ready() to return a value that indicates whether or not
the vcpu should be kicked to process a pending interrupt. This will be useful
in the implementation of the Posted Interrupt APICv feature.

Change the return value of 'vlapic_pending_intr()' to indicate whether or not
an interrupt is available to be delivered to the vcpu depending on the value
of the PPR.

Add KTR tracepoints to debug guest IPI delivery.
2014-01-07 00:38:22 +00:00
neel
35066c7a68 Split the VMCS setup between 'vmcs_init()' that does initialization and
'vmx_vminit()' that does customization.

This makes it easier to turn on optional features (e.g. APICv) without
having to keep adding new parameters to 'vmcs_set_defaults()'.

Reviewed by:	grehan@
2014-01-06 23:16:39 +00:00
schweikh
916cce9b6e Correct a grammo in a comment; remove white space at EOL. 2014-01-06 17:23:22 +00:00
neel
96b61bcd13 Use the same label name for ENTRY() and END() macros for 'vmx_enter_guest'.
Pointed out by:	rmh@
2014-01-03 19:29:33 +00:00
neel
141215c4ae Fix a bug in the HPET emulation where a timer interrupt could be lost when the
guest disables the HPET.

The HPET timer interrupt is triggered from the callout handler associated with
the timer. It is possible for the callout handler to be delayed before it gets
a chance to execute. If the guest disables the HPET during this window then the
handler never gets a chance to execute and the timer interrupt is lost.

This is now fixed by injecting a timer interrupt into the guest if the callout
time is detected to be in the past when the HPET is disabled.
2014-01-03 19:25:52 +00:00
kib
57c75d298a Update the description for pmap_remove_pages() to match the modern
times [1].  Assert that the pmap passed to pmap_remove_pages() is only
active on current CPU.

Submitted by:	alc [1]
Reviewed by:	alc
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
2014-01-02 18:50:52 +00:00
kib
d891f2b2cb Assert that accounting for the pmap resident pages does not underflow.
Reviewed by:	alc
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
2014-01-02 18:49:05 +00:00
neel
e2a56f4497 Restructure the VMX code to enter and exit the guest. In large part this change
hides the setjmp/longjmp semantics of VM enter/exit. vmx_enter_guest() is used
to enter guest context and vmx_exit_guest() is used to transition back into
host context.

Fix a longstanding race where a vcpu interrupt notification might be ignored
if it happens after vmx_inject_interrupts() but before host interrupts are
disabled in vmx_resume/vmx_launch. We now called vmx_inject_interrupts() with
host interrupts disabled to prevent this.

Suggested by:	grehan@
2014-01-01 21:17:08 +00:00
dim
968059a1fa In sys/amd64/amd64/pmap.c, remove static function pmap_is_current(),
which has been unused since r189415.

Reviewed by:	alc
MFC after:	3 days
2013-12-30 20:37:47 +00:00
neel
d6864450cf Modify handling of writes to the vlapic LVT registers.
The handler is now called after the register value is updated in the virtual
APIC page. This will make it easier to handle APIC-write VM-exits with APIC
register virtualization turned on.

This also implies that we need to keep a snapshot of the last value written
to a LVT register. We can no longer rely on the LVT registers in the APIC
page to be "clean" because the guest can write anything to it before the
hypervisor has had a chance to sanitize it.
2013-12-28 00:20:55 +00:00
neel
31726cba84 Modify handling of writes to the vlapic ICR_TIMER, DCR_TIMER, ICRLO and ESR
registers.

The handler is now called after the register value is updated in the virtual
APIC page. This will make it easier to handle APIC-write VM-exits with APIC
register virtualization turned on.

We can no longer rely on the value of 'icr_timer' on the APIC page
in the callout handler. With APIC register virtualization the value of
'icr_timer' will be updated by the processor in guest-context before an
APIC-write VM-exit.

Clear the 'delivery status' bit in the ICRLO register in the write handler.
With APIC register virtualization the write happens in guest-context and
we cannot prevent a (buggy) guest from setting this bit.
2013-12-27 20:18:19 +00:00
dim
90c5ea9fad In sys/amd64/vmm/intel/vmx.c, silence a (incorrect) gcc warning about
regval possibly being used uninitialized.

Reviewed by:	neel
2013-12-27 12:15:53 +00:00
neel
a369a289ee Modify handling of write to the vlapic SVR register.
The handler is now called after the register value is updated in the virtual
APIC page. This will make it easier to handle APIC-write VM-exits with APIC
register virtualization turned on.

Additionally, mask all the LVT entries when the vlapic is software-disabled.
2013-12-27 07:01:42 +00:00
neel
4a551f35c9 Modify handling of writes to the vlapic ID, LDR and DFR registers.
The handlers are now called after the register value is updated in the virtual
APIC page. This will make it easier to handle APIC-write VM-exits with APIC
register virtualization turned on.

Additionally, we need to ensure that the value of these registers is always
correctly reflected in the virtual APIC page, because there is no VM exit
when the guest reads these registers with APIC register virtualization.
2013-12-26 19:58:30 +00:00
neel
0c91ef8145 vlapic code restructuring to make it easy to support hardware-assist for APIC
emulation.

The vlapic initialization and cleanup is done via processor specific vmm_ops.
This will allow the VT-x/SVM modules to layer any hardware-assist for APIC
emulation or virtual interrupt delivery on top of the vlapic device model.

Add a parameter to 'vcpu_notify_event()' to distinguish between vlapic
interrupts versus other events (e.g. NMI). This provides an opportunity to
use hardware-assists like Posted Interrupts (VT-x) or doorbell MSR (SVM)
to deliver an interrupt to a guest without causing a VM-exit.

Get rid of lapic_pending_intr() and lapic_intr_accepted() and use the
vlapic_xxx() counterparts directly.

Associate an 'Apic Page' with each vcpu and reference it from the 'vlapic'.
The 'Apic Page' is intended to be referenced from the Intel VMCS as the
'virtual APIC page' or from the AMD VMCB as the 'vAPIC backing page'.
2013-12-25 06:46:31 +00:00
jhb
63c019063a Add a resume hook for bhyve that runs a function on all CPUs during
resume.  For Intel CPUs, invoke vmxon for CPUs that were in VMX mode
at the time of suspend.

Reviewed by:	neel
2013-12-23 19:48:22 +00:00
jhb
8ab82a5fe1 Extend the support for local interrupts on the local APIC:
- Add a generic routine to trigger an LVT interrupt that supports both
  fixed and NMI delivery modes.
- Add an ioctl and bhyvectl command to trigger local interrupts inside a
  guest.  In particular, a global NMI similar to that raised by SERR# or
  PERR# can be simulated by asserting LINT1 on all vCPUs.
- Extend the LVT table in the vCPU local APIC to support CMCI.
- Flesh out the local APIC error reporting a bit to cache errors and
  report them via ESR when ESR is written to.  Add support for asserting
  the error LVT when an error occurs.  Raise illegal vector errors when
  attempting to signal an invalid vector for an interrupt or when sending
  an IPI.
- Ignore writes to reserved bits in LVT entries.
- Export table entries the MADT and MP Table advertising the stock x86
  config of LINT0 set to ExtInt and LINT1 wired to NMI.

Reviewed by:	neel (earlier version)
2013-12-23 19:29:07 +00:00
neel
e78d8c9833 Add a parameter to 'vcpu_set_state()' to enforce that the vcpu is in the IDLE
state before the requested state transition. This guarantees that there is
exactly one ioctl() operating on a vcpu at any point in time and prevents
unintended state transitions.

More details available here:
http://lists.freebsd.org/pipermail/freebsd-virtualization/2013-December/001825.html

Reviewed by:	grehan
Reported by:	Markiyan Kushnir (markiyan.kushnir at gmail.com)
MFC after:	3 days
2013-12-22 20:29:59 +00:00
neel
5e963f9c65 Consolidate the virtual apic initialization in a single function: vlapic_reset() 2013-12-22 00:08:00 +00:00
neel
fffd41bbf1 Re-arrange bits in the amd64/pmap 'pm_flags' field.
The least significant 8 bits of 'pm_flags' are now used for the IPI vector
to use for nested page table TLB shootdown.

Previously we used IPI_AST to interrupt the host cpu which is functionally
correct but could lead to misleading interrupt counts for AST handler. The
AST handler was also doing a lot more than what is required for the nested
page table TLB shootdown (EOI and IRET).
2013-12-20 05:50:22 +00:00
grehan
afd2340c15 Enable memory overcommit for AMD processors.
- No emulation of A/D bits is required since AMD-V RVI
supports A/D bits.
 - Enable pmap PT_RVI support(w/o PAT) which is required for
memory over-commit support.
 - Other minor fixes:
 * Make use of VMCB EXITINTINFO field. If a #VMEXIT happens while
delivering an interrupt, EXITINTINFO has all the details that bhyve
needs to inject the same interrupt.
 * SVM h/w decode assist code was incomplete - removed for now.
 * Some minor code clean-up (more coming).

Submitted by:	Anish Gupta (akgupt3@gmail.com)
2013-12-18 23:39:42 +00:00
grehan
aacc216bbf MFC @ r256071
This is the change where the bhyve_npt_pmap branch was
merged in to head.

The SVM changes to work with this will be in a follow-on
submit.
2013-12-18 22:31:53 +00:00
neel
104d35e573 Use vmcs_read() and vmcs_write() in preference to vmread() and vmwrite()
respectively. The vmcs_xxx() functions provide inline error checking of
all accesses to the VMCS.
2013-12-18 06:24:21 +00:00
neel
e62c100b90 Add an API to deliver message signalled interrupts to vcpus. This allows
callers treat the MSI 'addr' and 'data' fields as opaque and also lets
bhyve implement multiple destination modes: physical, flat and clustered.

Submitted by:	Tycho Nightingale (tycho.nightingale@pluribusnetworks.com)
Reviewed by:	grehan@
2013-12-16 19:59:31 +00:00
neel
4741c4b7e2 Fix typo when initializing the vlapic version register ('<<' instead of '<'). 2013-12-11 06:28:44 +00:00
neel
3d4a180923 Fix x2apic support in bhyve.
When the guest is bringing up the APs in the x2APIC mode a write to the
ICR register will now trigger a return to userspace with an exitcode of
VM_EXITCODE_SPINUP_AP. This gets SMP guests working again with x2APIC.

Change the vlapic timer lock to be a spinlock because the vlapic can be
accessed from within a critical section (vm run loop) when guest is using
x2apic mode.

Reviewed by:	grehan@
2013-12-10 22:56:51 +00:00
jhb
8f255ea165 Move constants for indices in the local APIC's local vector table from
apicvar.h to apicreg.h.
2013-12-09 21:08:52 +00:00
neel
e7ebb9541a Use callout(9) to drive the vlapic timer instead of clocking it on each VM exit.
This decouples the guest's 'hz' from the host's 'hz' setting. For e.g. it is
now possible to have a guest run at 'hz=1000' while the host is at 'hz=100'.

Discussed with:	grehan@
Tested by:	Tycho Nightingale (tycho.nightingale@pluribusnetworks.com)
2013-12-07 23:11:12 +00:00
neel
127d791c3e If a vcpu disables its local apic and then executes a 'HLT' then spin down the
vcpu and destroy its thread context. Also modify the 'HLT' processing to ignore
pending interrupts in the IRR if interrupts have been disabled by the guest.
The interrupt cannot be injected into the guest in any case so resuming it
is futile.

With this change "halt" from a Linux guest works correctly.

Reviewed by:	grehan@
Tested by:	Tycho Nightingale (tycho.nightingale@pluribusnetworks.com)
2013-12-07 22:18:36 +00:00
jhb
0f5dd6ee19 Fix a typo. 2013-12-05 21:58:02 +00:00
neel
d3a0e92020 The 'protection' field in the VM exit collateral for the PAGING exit is not
used - get rid of it.
2013-12-03 01:21:21 +00:00
neel
a581c446d6 Rename 'vm_interrupt_hostcpu()' to 'vcpu_notify_event()' because the function
has outgrown its original name. Originally this function simply sent an IPI
to the host cpu that a vcpu was executing on but now it does a lot more than
just that.

Reviewed by:	grehan@
2013-12-03 00:43:31 +00:00
eadler
44c01df173 Fix undefined behavior: (1 << 31) is not defined as 1 is an int and this
shifts into the sign bit.  Instead use (1U << 31) which gets the
expected result.

This fix is not ideal as it assumes a 32 bit int, but does fix the issue
for most cases.

A similar change was made in OpenBSD.

Discussed with:	-arch, rdivacky
Reviewed by:	cperciva
2013-11-30 22:17:27 +00:00
pjd
4ac2e7d8d9 Make process descriptors standard part of the kernel. rwhod(8) already
requires process descriptors to work and having PROCDESC in GENERIC
seems not enough, especially that we hope to have more and more consumers
in the base.

MFC after:	3 days
2013-11-30 15:08:35 +00:00