Commit Graph

6865 Commits

Author SHA1 Message Date
Neel Natu
2591ee3e80 Remove a bogus check that flagged an error if the guest %rip was zero.
An AP begins execution with %rip set to 0 after a startup IPI.

Discussed with:	Anish Gupta (akgupt3@gmail.com)
2014-09-10 01:46:22 +00:00
Neel Natu
5e467bd098 Make the KTR tracepoints uniform and ensure that every VM-exit is logged.
Discussed with:	Anish Gupta (akgupt3@gmail.com)
2014-09-10 01:37:32 +00:00
Neel Natu
a2901ce7ad Allow guest read access to MSR_EFER without hypervisor intervention.
Dirty the VMCB_CACHE_CR state cache when MSR_EFER is modified.
2014-09-10 01:10:53 +00:00
Neel Natu
501f03eba2 Remove gratuitous forward declarations.
Remove tabs on empty lines.
2014-09-09 23:39:43 +00:00
Neel Natu
a268481428 Do proper ASID management for guest vcpus.
Prior to this change an ASID was hard allocated to a guest and shared by all
its vcpus. The meant that the number of VMs that could be created was limited
to the number of ASIDs supported by the CPU. It was also inefficient because
it forced a TLB flush on every VMRUN.

With this change the number of guests that can be created is independent of
the number of available ASIDs. Also, the TLB is flushed only when a new ASID
is allocated.

Discussed with:	grehan
Reviewed by:	Anish Gupta (akgupt3@gmail.com)
2014-09-06 19:02:52 +00:00
John Baldwin
b1d735ba4c Create a separate structure for per-CPU state saved across suspend and
resume that is a superset of a pcb.  Move the FPU state out of the pcb and
into this new structure.  As part of this, move the FPU resume code on
amd64 into a C function.  This allows resumectx() to still operate only on
a pcb and more closely mirrors the i386 code.

Reviewed by:	kib (earlier version)
2014-09-06 15:23:28 +00:00
Neel Natu
3865879753 Merge svm_set_vmcb() and svm_init_vmcb() into a single function that is called
just once when a vcpu is initialized.

Discussed with:	Anish Gupta (akgupt3@gmail.com)
2014-09-05 03:33:16 +00:00
Pedro F. Giffuni
11db54f172 Apply known workarounds for modern MacBooks.
The legacy USB circuit tends to give trouble on MacBook.
While the original report covered MacBook, extend the fix
preemptively for the newer MacBookPro too.

PR:		191693
Reviewed by:	emaste
MFC after:	5 days
2014-09-05 01:06:45 +00:00
Mark Johnston
a58b4afa9f Add mrsas(4) to GENERIC for i386 and amd64.
Approved by:	ambrisko, kadesai
MFC after:	3 days
2014-09-04 21:06:33 +00:00
John Baldwin
33a50f1b0f Merge the amd64 and i386 identcpu.c into a single x86 implementation.
This brings the structured extended features mask and VT-x reporting to
i386 and Intel cache and TLB info (under bootverbose) to amd64.
2014-09-04 14:26:25 +00:00
Neel Natu
bf4993ba2a Remove unused header file.
Discussed with:	Anish Gupta (akgupt3@gmail.com)
2014-09-04 06:07:32 +00:00
Neel Natu
fea6bd5cd3 Consolidate the code to restore the host TSS after a #VMEXIT into a single
function restore_host_tss().

Don't bother to restore MSR_KGSBASE after a #VMEXIT since it is not used in
the kernel. It will be restored on return to userspace.

Discussed with:	Anish Gupta (akgupt3@gmail.com)
2014-09-04 06:00:18 +00:00
John Baldwin
2b793beefd Remove trailing whitespace. 2014-09-04 01:56:15 +00:00
John Baldwin
7fb40488d6 - Move prototypes for various functions into out of C files and into
<machine/md_var.h>.
- Move some CPU-related variables out of i386/i386/identcpu.c to
  initcpu.c to match amd64.
- Move the declaration of has_f00f_hack out of identcpu.c to machdep.c.
- Remove a misleading comment from i386/i386/initcpu.c (locore zeros
  the BSS before it calls identify_cpu()) and remove explicit zero
  assignments to reduce the diff with amd64.
2014-09-04 01:46:06 +00:00
Neel Natu
246e7a2b64 IFC @r269962
Submitted by:	Anish Gupta (akgupt3@gmail.com)
2014-09-02 04:22:42 +00:00
Alan Cox
ad2e88a14f Update a comment to reflect the changes in r213408.
MFC after:	5 days
2014-09-02 04:11:20 +00:00
Neel Natu
4c98655ece The "SUB" instruction used in getcc() actually does 'x -= y' so use the
proper constraint for 'x'. The "+r" constraint indicates that 'x' is an
input and output register operand.

While here generate code for different variants of getcc() using a macro
GETCC(sz) where 'sz' indicates the operand size.

Update the status bits in %rflags when emulating AND and OR opcodes.

Reviewed by:	grehan
2014-08-30 19:59:42 +00:00
Pedro F. Giffuni
ec4a0b4408 Minor space/tab cleanups.
Most of them were ripped from the GSoC 2104
SMAP + kpatch project.
This is only a cosmetic change.

Taken from:	Oliver Pinter (op@)
MFC after:	5 days
2014-08-30 15:41:07 +00:00
John Baldwin
89871cdeb6 - Add a new structure type for the ACPI 3.0 SMAP entry that includes the
optional attributes field.
- Add a 'machdep.smap' sysctl that exports the SMAP table of the running
  system as an array of the ACPI 3.0 structure.  (On older systems, the
  attributes are given a value of zero.)  Note that the sysctl only
  exports the SMAP table if it is available via the metadata passed from
  the loader to the kernel.  If an SMAP is not available, an empty array
  is returned.
- Add a format handler for the ACPI 3.0 SMAP structure to the sysctl(8)
  binary to format the SMAP structures in a readable format similar to
  the format found in boot messages.

MFC after:	2 weeks
2014-08-29 21:25:47 +00:00
Peter Grehan
fc3dde9099 Implement the 0x2B SUB instruction, and the OR variant of 0x81.
Found with local APIC accesses from bitrig/amd64 bsd.rd, 07/15-snap.

Reviewed by:	neel
MFC after:	3 days
2014-08-27 00:53:56 +00:00
Neel Natu
48e8c2137a An exception is allowed to be injected even if the vcpu is in an interrupt
shadow, so move the check for pending exception before bailing out due to
an interrupt shadow.

Change return type of 'vmcb_eventinject()' to a void and convert all error
returns into KASSERTs.

Fix VMCB_EXITINTINFO_EC(x) and VMCB_EXITINTINFO_TYPE(x) to do the shift
before masking the result.

Reviewed by:    Anish Gupta (akgupt3@gmail.com)
2014-08-25 00:58:20 +00:00
Peter Grehan
7f21538b6e Change __inline style to be consistent with FreeBSD usage,
and also fix gcc build (on STABLE, when MFCd).

PR:	192880
Reviewed by:	neel
Reported by:	ngie
MFC after:	1 day
2014-08-24 02:07:34 +00:00
Neel Natu
8bd3845d3c Add "hw.vmm.topology.threads_per_core" and "hw.vmm.topology.cores_per_package"
tunables to modify the default cpu topology advertised by bhyve.

Also add a tunable "hw.vmm.topology.cpuid_leaf_b" to disable the CPUID
leaf 0xb. This is intended for testing guest behavior when it falls back
on using CPUID leaf 0x4 to deduce CPU topology.

The default behavior is to advertise each vcpu as a core in a separate soket.
2014-08-24 01:10:06 +00:00
Neel Natu
534dc967d7 Fix a bug in the emulation of CPUID leaf 0x4 where bhyve was claiming that
the vcpu had no caches at all. This causes problems when executing applications
in the guest compiled with the Intel compiler.

Submitted by:	Mark Hill (mark.hill@tidalscale.com)
2014-08-23 22:44:31 +00:00
Neel Natu
7a244722d1 Return the spurious interrupt vector (IRQ7 or IRQ15) if the atpic cannot
find any unmasked pin with an interrupt asserted.

Reviewed by:	tychon
CR:		https://reviews.freebsd.org/D669
MFC after:	1 week
2014-08-23 21:16:26 +00:00
John Baldwin
669eac89c5 Fix build of si(4) and enable it in LINT on amd64 and i386. 2014-08-20 16:07:17 +00:00
John Baldwin
64d6de263b Bump MAXCPU on amd64 from 64 to 256. In practice APIC only permits 255
CPUs (IDs 0 through 254).  Getting above that limit requires x2APIC.

MFC after:	1 month
2014-08-20 16:06:24 +00:00
Konstantin Belousov
3165194c6b Increase max number of physical segments on amd64 to 63.
Eventually, the vmd_segs of the struct vm_domain should become bitset
instead of long, to allow arbitrary compile-time selected maximum.

Reviewed by:	alc
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
2014-08-20 08:07:08 +00:00
Alan Cox
ada1ae623e There exists a possible sequence of page table page allocation failures
starting with a superpage demotion by pmap_enter() that could result in
a PV list lock being held when pmap_enter() is just about to return
KERN_RESOURCE_SHORTAGE.  Consequently, the KASSERT that no PV list locks
are held needs to be replaced with a conditional unlock.

Discussed with:	kib
X-MFC with:	r269728
Sponsored by:	EMC / Isilon Storage Division
2014-08-18 20:28:08 +00:00
Gavin Atkinson
67e3b91b31 Update i386/NOTES and amd64/NOTES files to contain the complete list of
firmwares for iwn(4) and sort them.

MFC after:	1 week
2014-08-14 18:29:55 +00:00
Neel Natu
4eec602102 Reword comment to match the interrupt mode names from the MPtable spec.
Reviewed by:	tychon
2014-08-14 18:03:38 +00:00
Neel Natu
477867a0e5 Use the max guest memory address when creating its iommu domain.
Also, assert that the GPA being mapped in the domain is less than its maxaddr.

Reviewed by:	grehan
Pointed out by:	Anish Gupta (akgupt3@gmail.com)
2014-08-14 05:00:45 +00:00
Alan Cox
4d33fe39e4 Update the text of a KASSERT() to reflect the changes in r269728. 2014-08-09 17:13:02 +00:00
Konstantin Belousov
39ffa8c138 Change pmap_enter(9) interface to take flags parameter and superpage
mapping size (currently unused).  The flags includes the fault access
bits, wired flag as PMAP_ENTER_WIRED, and a new flag
PMAP_ENTER_NOSLEEP to indicate that pmap should not sleep.

For powerpc aim both 32 and 64 bit, fix implementation to ensure that
the requested mapping is created when PMAP_ENTER_NOSLEEP is not
specified, in particular, wait for the available memory required to
proceed.

In collaboration with:	alc
Tested by:	nwhitehorn (ppc aim32 and booke)
Sponsored by:	The FreeBSD Foundation and EMC / Isilon Storage Division
MFC after:	2 weeks
2014-08-08 17:12:03 +00:00
Neel Natu
12a6eb99a1 Support PCI extended config space in bhyve.
Add the ACPI MCFG table to advertise the extended config memory window.

Introduce a new flag MEM_F_IMMUTABLE for memory ranges that cannot be deleted
or moved in the guest's address space. The PCI extended config space is an
example of an immutable memory range.

Add emulation for the "movzw" instruction. This instruction is used by FreeBSD
to read a 16-bit extended config space register.

CR:		https://phabric.freebsd.org/D505
Reviewed by:	jhb, grehan
Requested by:	tychon
2014-08-08 03:49:01 +00:00
Gleb Smirnoff
c8d2ffd6a7 Merge all MD sf_buf allocators into one MI, residing in kern/subr_sfbuf.c
The MD allocators were very common, however there were some minor
differencies. These differencies were all consolidated in the MI allocator,
under ifdefs. The defines from machine/vmparam.h turn on features required
for a particular machine. For details look in the comment in sys/sf_buf.h.

As result no MD code left in sys/*/*/vm_machdep.c. Some arches still have
machine/sf_buf.h, which is usually quite small.

Tested by:	glebius (i386), tuexen (arm32), kevlo (arm32)
Reviewed by:	kib
Sponsored by:	Netflix
Sponsored by:	Nginx, Inc.
2014-08-05 09:44:10 +00:00
Alan Cox
a695d9b25b Retire pmap_change_wiring(). We have never used it to wire virtual pages.
We continue to use pmap_enter() for that.  For unwiring virtual pages, we
now use pmap_unwire(), which unwires a range of virtual addresses instead
of a single virtual page.

Sponsored by:	EMC / Isilon Storage Division
2014-08-03 20:40:51 +00:00
John Baldwin
06fc6db948 - Output a summary of optional VT-x features in dmesg similar to CPU
features.  If bootverbose is enabled, a detailed list is provided;
  otherwise, a single-line summary is displayed.
- Add read-only sysctls for optional VT-x capabilities used by bhyve
  under a new hw.vmm.vmx.cap node. Move a few exiting sysctls that
  indicate the presence of optional capabilities under this node.

CR:		https://phabric.freebsd.org/D498
Reviewed by:	grehan, neel
MFC after:	1 week
2014-07-30 00:00:12 +00:00
Neel Natu
f008d1571d If a vcpu has issued a HLT instruction with interrupts disabled then it sleeps
forever in vm_handle_hlt().

This is usually not an issue as long as one of the other vcpus properly resets
or powers off the virtual machine. However, if the bhyve(8) process is killed
with a signal the halted vcpu cannot be woken up because it's sleep cannot be
interrupted.

Fix this by waking up periodically and returning from vm_handle_hlt() if
TDF_ASTPENDING is set.

Reported by:	Leon Dang
Sponsored by:	Nahanni Systems
2014-07-26 02:53:51 +00:00
Neel Natu
1edccd0f30 Don't return -1 from the push emulation handler. Negative return values are
interpreted specially on return from sys_ioctl() and may cause undesirable
side-effects like restarting the system call.
2014-07-26 02:51:46 +00:00
Neel Natu
830be8acb4 Fix a couple of issues in the PUSH emulation:
It is not possible to PUSH a 32-bit operand on the stack in 64-bit mode. The
default operand size for PUSH is 64-bits and the operand size override prefix
changes that to 16-bits.

vm_copy_setup() can return '1' if it encounters a fault when walking the
guest page tables. This is a guest issue and is now handled properly by
resuming the guest to handle the fault.
2014-07-24 23:01:53 +00:00
Marius Strobl
c615e6a8bf Copying pages via temporary mappings in the !DMAP case of pmap_copy_pages()
involves updating the corresponding page tables followed by accesses to the
pages in question. This sequence is subject to the situation exactly described
in the "AMD64 Architecture Programmer's Manual Volume 2: System Programming"
rev. 3.23, "7.3.1 Special Coherency Considerations" [1, p. 171 f.]. Therefore,
issuing the INVLPG right after modifying the PTE bits is crucial (see also
r269050).
For the amd64 PMAP code, the order of instructions was already correct. The
above fact still is worth documenting, though.

1: http://amd-dev.wpengine.netdna-cdn.com/wordpress/media/2012/10/24593_APM_v21.pdf

Reviewed by:	alc
Sponsored by:	Bally Wulff Games & Entertainment GmbH
2014-07-24 10:12:22 +00:00
Neel Natu
d37f2adb38 Fix fault injection in bhyve.
The faulting instruction needs to be restarted when the exception handler
is done handling the fault. bhyve now does this correctly by setting
'vmexit[vcpu].inst_length' to zero so the %rip is not advanced.

A minor complication is that the fault injection APIs are used by instruction
emulation code that is shared by vmm.ko and bhyve. Thus the argument that
refers to 'struct vm *' in kernel or 'struct vmctx *' in userspace needs to
be loosely typed as a 'void *'.
2014-07-24 01:38:11 +00:00
Roger Pau Monné
f5417a03e3 don't set CR4 PSE bit on amd64
Setting PSE together with PAE or in long mode just makes the PSE bit
completely ignored, so don't set it.

Sponsored by: Citrix Systems R&D
Reviewed by: kib
2014-07-23 15:53:29 +00:00
Neel Natu
d665d229ce Emulate instructions emitted by OpenBSD/i386 version 5.5:
- CMP REG, r/m
- MOV AX/EAX/RAX, moffset
- MOV moffset, AX/EAX/RAX
- PUSH r/m
2014-07-23 04:28:51 +00:00
Ed Maste
b47228854f Don't pass null kmdp to preload_search_info
On Xen PVH guests kmdp == NULL.

Submitted by:	royger
MFC after:	3 days
Sponsored by:	The FreeBSD Foundation
2014-07-22 13:58:33 +00:00
Mark Johnston
26cf239814 Fix the build when DTrace isn't enabled.
Reported by:	stefanf
X-MFC-With:	r268600
2014-07-20 18:44:56 +00:00
Neel Natu
019008ebf5 Fix build without INVARIANTS defined by getting rid of unused variable 'exc'.
Reported by:	adrian, stefanf
2014-07-20 16:34:35 +00:00
Neel Natu
091d453222 Handle nested exceptions in bhyve.
A nested exception condition arises when a second exception is triggered while
delivering the first exception. Most nested exceptions can be handled serially
but some are converted into a double fault. If an exception is generated during
delivery of a double fault then the virtual machine shuts down as a result of
a triple fault.

vm_exit_intinfo() is used to record that a VM-exit happened while an event was
being delivered through the IDT. If an exception is triggered while handling
the VM-exit it will be treated like a nested exception.

vm_entry_intinfo() is used by processor-specific code to get the event to be
injected into the guest on the next VM-entry. This function is responsible for
deciding the disposition of nested exceptions.
2014-07-19 20:59:08 +00:00
Mark Johnston
5a5f9d21dd Use a C wrapper for trap() instead of checking and calling the DTrace trap
hook in assembly.

Suggested by:	kib
Reviewed by:	kib (original version)
X-MFC-With:	r268600
2014-07-19 02:27:31 +00:00
Neel Natu
3d5444c864 Add emulation for legacy x86 task switching mechanism.
FreeBSD/i386 uses task switching to handle double fault exceptions and this
change enables that to work.

Reported by:	glebius
2014-07-16 21:26:26 +00:00
Neel Natu
f7a9f1784f Add support for operand size and address size override prefixes in bhyve's
instruction emulation [1].

Fix bug in emulation of opcode 0x8A where the destination is a legacy high
byte register and the guest vcpu is in 32-bit mode. Prior to this change
instead of modifying %ah, %bh, %ch or %dh the emulation would end up
modifying %spl, %bpl, %sil or %dil instead.

Add support for moffsets by treating it as a 2, 4 or 8 byte immediate value
during instruction decoding.

Fix bug in verify_gla() where the linear address computed after decoding
the instruction was not being truncated to the effective address size [2].

Tested by:	Leon Dang [1]
Reported by:	Peter Grehan [2]
Sponsored by:	Nahanni Systems
2014-07-15 17:37:17 +00:00
Konstantin Belousov
5e351014e0 Make amd64 pmap_copy_pages() functional for pages not mapped by DMAP.
Requested and reviewed by:	royger
Tested by:	pho, royger
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
2014-07-15 09:30:43 +00:00
Mark Johnston
291624fdf6 Invoke the DTrace trap handler before calling trap() on amd64. This matches
the upstream implementation and helps ensure that a trap induced by tracing
fbt::trap:entry is handled without recursively generating another trap.

This makes it possible to run most (but not all) of the DTrace tests under
common/safety/ without triggering a kernel panic.

Submitted by:	Anton Rang <anton.rang@isilon.com> (original version)
Phabric:	D95
2014-07-14 04:38:17 +00:00
Neel Natu
3ada6e07ac Use the correct offset when converting a logical address (segment:offset)
to a linear address.
2014-07-11 01:23:38 +00:00
Konstantin Belousov
fd815c0b8d For safety, ensure that any consumer of the set_regs() and
ptrace_set_pc() use the correct return to userspace using iret.

The signal return, PT_CONTINUE (which in fact uses signal return path)
set the pcb flag already.  The setcontext(2) enforces iret return when
%rip is incorrect.  Due to this, the change is redundand, but is made
to ensure that no path which modifies context, forgets to set
PCB_FULL_IRET.

Inspired by:	CVE-2014-4699
Reviewed by:	jhb
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
2014-07-09 21:39:40 +00:00
Neel Natu
b301b9e28f Accurately identify the vcpu's operating mode as 64-bit, compatibility,
protected or real.
2014-07-08 21:48:57 +00:00
Neel Natu
3527963b26 Invalidate guest TLB mappings as a side-effect of its CR3 being updated.
This is a pre-requisite for task switch emulation since the CR3 is loaded
from the new TSS.
2014-07-08 20:51:03 +00:00
Konstantin Belousov
a5244bac2e Correct si_code for the SIGBUS signal generated by the alignment trap.
Reported and tested by:	pho
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
2014-07-08 08:05:42 +00:00
Alan Cox
09132ba6ac Introduce pmap_unwire(). It will replace pmap_change_wiring(). There are
several reasons for this change:

pmap_change_wiring() has never (in my memory) been used to set the wired
attribute on a virtual page.  We have always used pmap_enter() to do that.
Moreover, it is not really safe to use pmap_change_wiring() to set the wired
attribute on a virtual page.  The description of pmap_change_wiring() says
that it assumes the existence of a mapping in the pmap.  However, non-wired
mappings may be reclaimed by the pmap at any time.  (See pmap_collect().)
Many implementations of pmap_change_wiring() will crash if the mapping does
not exist.

pmap_unwire() accepts a range of virtual addresses, whereas
pmap_change_wiring() acts upon a single virtual page.  Since we are
typically unwiring a range of virtual addresses, pmap_unwire() will be more
efficient.  Moreover, pmap_unwire() allows us to unwire superpage mappings.
Previously, we were forced to demote the superpage mapping, because
pmap_change_wiring() only allowed us to express the unwiring of a single
base page mapping at a time.  This added to the overhead of unwiring for
large ranges of addresses, including the implicit unwiring that occurs at
process termination.

Implementations for arm and powerpc will follow.

Discussed with:	jeff, marcel
Reviewed by:	kib
Sponsored by:	EMC / Isilon Storage Division
2014-07-06 17:42:38 +00:00
Ed Maste
018147eef9 Prefer vt(4) for UEFI boot
The UEFI framebuffer driver vt_efifb requires vt(4), so add a mechanism
for the startup routine to set the preferred console.  This change is
ugly because console init happens very early in the boot, making a
cleaner interface difficult.  This change is intended only to facilitate
the sc(4) / vt(4) transition, and can be reverted once vt(4) is the
default.
2014-07-02 13:24:21 +00:00
Ed Maste
ccbb7b5e19 Add vt(4) devices and options to NOTES
Reviewed by:	marius (earlier version)
2014-07-01 00:22:54 +00:00
Ed Maste
30dbb3eabc Add vt(4) to GENERIC and retire the separate VT config
vt(4) and sc(4) can now coexist in the same kernel.  To choose the vt
driver, set the loader tunable kern.vty=vt .
2014-06-30 16:18:38 +00:00
Hans Petter Selasky
af3b2549c4 Pull in r267961 and r267973 again. Fix for issues reported will follow. 2014-06-28 03:56:17 +00:00
Glen Barber
37a107a407 Revert r267961, r267973:
These changes prevent sysctl(8) from returning proper output,
such as:

 1) no output from sysctl(8)
 2) erroneously returning ENOMEM with tools like truss(1)
    or uname(1)
 truss: can not get etype: Cannot allocate memory
2014-06-27 22:05:21 +00:00
Hans Petter Selasky
3da1cf1e88 Extend the meaning of the CTLFLAG_TUN flag to automatically check if
there is an environment variable which shall initialize the SYSCTL
during early boot. This works for all SYSCTL types both statically and
dynamically created ones, except for the SYSCTL NODE type and SYSCTLs
which belong to VNETs. A new flag, CTLFLAG_NOFETCH, has been added to
be used in the case a tunable sysctl has a custom initialisation
function allowing the sysctl to still be marked as a tunable. The
kernel SYSCTL API is mostly the same, with a few exceptions for some
special operations like iterating childrens of a static/extern SYSCTL
node. This operation should probably be made into a factored out
common macro, hence some device drivers use this. The reason for
changing the SYSCTL API was the need for a SYSCTL parent OID pointer
and not only the SYSCTL parent OID list pointer in order to quickly
generate the sysctl path. The motivation behind this patch is to avoid
parameter loading cludges inside the OFED driver subsystem. Instead of
adding special code to the OFED driver subsystem to post-load tunables
into dynamically created sysctls, we generalize this in the kernel.

Other changes:
- Corrected a possibly incorrect sysctl name from "hw.cbb.intr_mask"
to "hw.pcic.intr_mask".
- Removed redundant TUNABLE statements throughout the kernel.
- Some minor code rewrites in connection to removing not needed
TUNABLE statements.
- Added a missing SYSCTL_DECL().
- Wrapped two very long lines.
- Avoid malloc()/free() inside sysctl string handling, in case it is
called to initialize a sysctl from a tunable, hence malloc()/free() is
not ready when sysctls from the sysctl dataset are registered.
- Bumped FreeBSD version to indicate SYSCTL API change.

MFC after:	2 weeks
Sponsored by:	Mellanox Technologies
2014-06-27 16:33:43 +00:00
Tycho Nightingale
896d1f7723 Add support for emulating the move instruction: "mov r/m8, imm8".
Reviewed by:	neel
2014-06-26 17:15:41 +00:00
Peter Grehan
cf1d80d88c Expose the amount of resident and wired memory from the guest's vmspace.
This is different than the amount shown for the process e.g. by
/usr/bin/top - that is the mappings faulted in by the mmap'd region
of guest memory.

The values can be fetched with bhyvectl

 # bhyvectl --get-stats --vm=myvm
 ...
 Resident memory                         	413749248
 Wired memory                            	0
 ...

vmm_stat.[ch] -
 Modify the counter code in bhyve to allow direct setting of a counter
as opposed to incrementing, and providing a callback to fetch a
counter's value.

Reviewed by:	neel
2014-06-25 22:13:35 +00:00
Konstantin Belousov
633034fe0e Add FPU_KERN_KTHR flag to fpu_kern_enter(9), which avoids saving FPU
context into memory for the kernel threads which called
fpu_kern_thread(9).  This allows the fpu_kern_enter() callers to not
check for is_fpu_kern_thread() to get the optimization.

Apply the flag to padlock(4) and aesni(4).  In aesni_cipher_process(),
do not leak FPU context state on error.

Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
2014-06-23 07:37:54 +00:00
Dmitry Chagin
2dedc1281a Revert r266925 as it can lead to instant panic at fexecve():
To allow to run the interpreter itself add a new ELF branding type.

Pointed out by:	kib, mjg
2014-06-17 05:29:18 +00:00
Tycho Nightingale
a026dc3fcb Bring an overly enthusiastic KASSERT inline with the Intel SDM.
Reviewed by:	neel
2014-06-16 22:59:18 +00:00
Attilio Rao
3ae10f7477 - Modify vm_page_unwire() and vm_page_enqueue() to directly accept
the queue where to enqueue pages that are going to be unwired.
- Add stronger checks to the enqueue/dequeue for the pagequeues when
  adding and removing pages to them.

Of course, for unmanaged pages the queue parameter of vm_page_unwire() will
be ignored, just as the active parameter today.
This makes adding new pagequeues quicker.

This change effectively modifies the KPI.  __FreeBSD_version will be,
however, bumped just when the full cache of free pages will be
evicted.

Sponsored by:	EMC / Isilon storage division
Reviewed by:	alc
Tested by:	pho
2014-06-16 18:15:27 +00:00
Roger Pau Monné
ef409ede7b amd64/i386: introduce APIC hooks for different APIC implementations.
This is needed for Xen PV(H) guests, since there's no hardware lapic
available on this kind of domains. This commit should not change
functionality.

Sponsored by: Citrix Systems R&D
Reviewed by: jhb
Approved by: gibbs

amd64/include/cpu.h:
amd64/amd64/mp_machdep.c:
i386/include/cpu.h:
i386/i386/mp_machdep.c:
 - Remove lapic_ipi_vectored hook from cpu_ops, since it's now
   implemented in the lapic hooks.

amd64/amd64/mp_machdep.c:
i386/i386/mp_machdep.c:
 - Use lapic_ipi_vectored directly, since it's now an inline function
   that will call the appropiate hook.

x86/x86/local_apic.c:
 - Prefix bare metal public lapic functions with native_ and mark them
   as static.
 - Define default implementation of apic_ops.

x86/include/apicvar.h:
 - Declare the apic_ops structure and create inline functions to
   access the hooks, so the change is transparent to existing users of
   the lapic_ functions.

x86/xen/hvm.c:
 - Switch to use the new apic_ops.
2014-06-16 08:43:03 +00:00
Neel Natu
4e98fc9011 Disable global interrupts early so all the software state maintained by bhyve
is sampled "atomically". Any interrupts after this point will be held pending
by the CPU until the guest starts executing and will immediately trigger a
#VMEXIT.

Reviewed by:	Anish Gupta (akgupt3@gmail.com)
2014-06-11 17:48:07 +00:00
Tycho Nightingale
5ebc578ba6 Replace enum forward declarations with complete definitions.
Reviewed by:	neel
2014-06-10 18:46:00 +00:00
Neel Natu
404874659f Add helper functions to populate VM exit information for rendezvous and
astpending exits. This is to reduce code duplication between VT-x and
SVM implementations.
2014-06-10 16:45:58 +00:00
Neel Natu
0494cb1bcb Turn on interrupt window exiting unconditionally when an ExtINT is being
injected into the guest. This allows the hypervisor to inject another
ExtINT or APIC vector as soon as the guest is able to process interrupts.

This change is not to address any correctness issue but to guarantee that
any pending APIC vector that was preempted by the ExtINT will be injected
as soon as possible. Prior to this change such pending interrupts could be
delayed until the next VM exit.
2014-06-10 01:38:02 +00:00
Peter Grehan
3787148758 Temporary fix for guest idle detection.
Handle ExtINT injection for SVM. The HPET emulation
will inject a legacy interrupt at startup, and if this
isn't handled, will result in the HLT-exit code assuming
there are outstanding ExtINTs and return without sleeping.

svm_inj_interrupts() needs more changes to bring it up
to date with the VT-x version: these are forthcoming.

Reviewed by:	neel
2014-06-09 21:02:48 +00:00
Neel Natu
051f2bd19d Add reserved bit checking when doing %CR8 emulation and inject #GP if required.
Pointed out by:	grehan
Reviewed by:	tychon
2014-06-09 20:51:08 +00:00
Peter Grehan
1cc0e0eedb Allow the TSC MSR to be accessed directly from the guest. 2014-06-07 23:08:06 +00:00
Peter Grehan
dc6610d553 Set the guest PAT MSR in the VMCB to power-on defaults.
Linux guests accept the values in this register, while *BSD
guests reprogram it. Default values of zero correspond to
PAT_UNCACHEABLE, resulting in glacial performance.

Thanks to Willem Jan Withagen for first reporting this and
helping out with the investigation.
2014-06-07 23:05:12 +00:00
Neel Natu
5fcf252f41 Add ioctl(VM_REINIT) to reinitialize the virtual machine state maintained
by vmm.ko. This allows the virtual machine to be restarted without having
to destroy it first.

Reviewed by:	grehan
2014-06-07 21:36:52 +00:00
Alan Cox
dd05fa1945 Add a page size field to struct vm_page. Increase the page size field when
a partially populated reservation becomes fully populated, and decrease this
field when a fully populated reservation becomes partially populated.

Use this field to simplify the implementation of pmap_enter_object() on
amd64, arm, and i386.

On all architectures where we support superpages, the cost of creating a
superpage mapping is roughly the same as creating a base page mapping.  For
example, both kinds of mappings entail the creation of a single PTE and PV
entry.  With this in mind, use the page size field to make the
implementation of vm_map_pmap_enter(..., MAP_PREFAULT_PARTIAL) a little
smarter.  Previously, if MAP_PREFAULT_PARTIAL was specified to
vm_map_pmap_enter(), that function would only map base pages.  Now, it will
create up to 96 base page or superpage mappings.

Reviewed by:	kib
Sponsored by:	EMC / Isilon Storage Division
2014-06-07 17:12:26 +00:00
Tycho Nightingale
594db0024e Support guest accesses to %cr8.
Reviewed by:	neel
2014-06-06 18:23:49 +00:00
Warner Losh
3f1afabf09 Restore comments accidentally removed.
MFC after: 3 days
2014-06-06 04:08:55 +00:00
Peter Grehan
0df5b8cb8c ins/outs support for SVM. Modelled on the Intel VT-x code.
Remove CR2 save/restore - the guest restore/save is done
in hardware, and there is no need to save/restore the host
version (same as VT-x).

Submitted by:	neel (SVM segment descriptor 'P' bit code)
Reviewed by:	neel
2014-06-06 02:55:18 +00:00
Peter Grehan
72a458ccff Allow the guest's CR2 value to be read/written.
This is required for page-fault injection.
2014-06-05 06:29:18 +00:00
Peter Grehan
8c1da7e67b Use API call when VM is detected as suspended. This fixes
the (harmless) error message on exit:

  vmexit_suspend: invalid reason 217645057

Reviewed by:	neel, Anish Gupta (akgupt3@gmail.com)
2014-06-03 22:26:46 +00:00
Peter Grehan
eee8190aab Bring (almost) up-to-date with HEAD.
- use the new virtual APIC page
- update to current bhyve APIs

Tested by Anish with multiple FreeBSD SMP VMs on a Phenom,
and verified by myself with light FreeBSD VM testing
on a Sempron 3850 APU.

The issues reported with Linux guests are very likely to still
be here, but this sync eliminates the skew between the
project branch and CURRENT, and should help to determine
the causes.

Some follow-on commits will fix minor cosmetic issues.

Submitted by:	Anish Gupta (akgupt3@gmail.com)
2014-06-03 06:56:54 +00:00
Peter Grehan
6cec9cad76 MFC @ r266724
An SVM update will follow this.
2014-06-03 02:34:21 +00:00
Neel Natu
95ebc360ef Activate vcpus from bhyve(8) using the ioctl VM_ACTIVATE_CPU instead of doing
it implicitly in vmm.ko.

Add ioctl VM_GET_CPUS to get the current set of 'active' and 'suspended' cpus
and display them via /usr/sbin/bhyvectl using the "--get-active-cpus" and
"--get-suspended-cpus" options.

This is in preparation for being able to reset virtual machine state without
having to destroy and recreate it.
2014-05-31 23:37:34 +00:00
Dmitry Chagin
5f56da1891 To allow to run the interpreter itself add a new ELF branding type.
Allow Linux ABI to run ELF interpreter.

MFC after:	3 days
2014-05-31 15:01:51 +00:00
Tycho Nightingale
11669a681c If VMX isn't enabled so long as the lock bit isn't set yet in MSR
IA32_FEATURE_CONTROL it still can be.

Approved by:	grehan (co-mentor)
2014-05-30 23:37:31 +00:00
Neel Natu
92754b1199 Remove bogus check for kmem_malloc() failure even though M_WAITOK is set.
Requested by:	jkim
2014-05-30 20:58:32 +00:00
Neel Natu
8e351f8a3c Allocate a zeroed LDT.
Failing to do this might result in the LDT appearing to run out of free
descriptors because of random junk in the descriptor's 'sd_type' field.

http://lists.freebsd.org/pipermail/freebsd-amd64/2014-May/016088.html

Reviewed by:	kib
MFC after:	2 weeks
2014-05-30 18:59:37 +00:00
Konstantin Belousov
64e9726555 When usermode loaded non-default segment selector into the %gs,
correctly prepare KGSBASE msr to restore the user descriptor base on
the last swapgs during return to usermode.

Reported and tested by:	peterj
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
2014-05-29 16:18:31 +00:00
Mark Johnston
f2789bd5c7 Commit the rest of the changes that were intended to be part of r266826.
X-MFC-with:	r266826
2014-05-29 01:42:22 +00:00
John Baldwin
44a68c4e40 - Rework the XSAVE/XRSTOR emulation to only expose XCR0 features to the
guest for which the rules regarding xsetbv emulation are known.  In
  particular future extensions like AVX-512 have interdependencies among
  feature bits that could allow a guest to trigger a GP# in the host with
  the current approach of allowing anything the host supports.
- Add proper checking of Intel MPX and AVX-512 XSAVE features in the
  xsetbv emulation and allow these features to be exposed to the guest if
  they are enabled in the host.
- Expose a subset of known-safe features from leaf 0 of the structured
  extended features to guests if they are supported on the host including
  RDFSBASE/RDGSBASE, BMI1/2, AVX2, AVX-512, HLE, ERMS, and RTM.  Aside
  from AVX-512, these features are all new instructions available for use
  in ring 3 with no additional hypervisor changes needed.

Reviewed by:	neel
2014-05-27 19:04:38 +00:00
Neel Natu
65ffa035a7 Add segment protection and limits violation checks in vie_calculate_gla()
for 32-bit x86 guests.

Tested using ins/outs executed in a FreeBSD/i386 guest.
2014-05-27 04:26:22 +00:00
Neel Natu
ae0780bbf1 Remove restriction on insb/insw/insl emulation. These instructions are
properly emulated.
2014-05-25 02:05:23 +00:00