Commit Graph

7242 Commits

Author SHA1 Message Date
neel
5c373339e4 Add macro to identify AVIC capability (advanced virtual interrupt controller)
in AMD processors.

Submitted by:	Dmitry Luhtionov (dmitryluhtionov@gmail.com)
2015-01-24 00:35:49 +00:00
neel
e901ab485a MOVS instruction emulation.
These instructions are emitted by 'bus_space_read_region()' when accessing
MMIO regions.

Since MOVS can be used with a repeat prefix start decoding the REPZ and
REPNZ prefixes. Also start decoding the segment override prefix since MOVS
allows overriding the source operand segment register.

Tested by:	tychon
MFC after:	1 week
2015-01-19 06:53:31 +00:00
neel
d9f07f9841 Simplify instruction restart logic in bhyve.
Keep track of the next instruction to be executed by the vcpu as 'nextrip'.
As a result the VM_RUN ioctl no longer takes the %rip where a vcpu should
start execution.

Also, instruction restart happens implicitly via 'vm_inject_exception()' or
explicitly via 'vm_restart_instruction()'. The APIs behave identically in
both kernel and userspace contexts. The main beneficiary is the instruction
emulation code that executes in both contexts.

bhyve(8) VM exit handlers now treat 'vmexit->rip' and 'vmexit->inst_length'
as readonly:
- Restarting an instruction is now done by calling 'vm_restart_instruction()'
  as opposed to setting 'vmexit->inst_length' to 0 (e.g. emulate_inout())
- Resuming vcpu at an arbitrary %rip is now done by setting VM_REG_GUEST_RIP
  as opposed to changing 'vmexit->rip' (e.g. vmexit_task_switch())

Differential Revision:	https://reviews.freebsd.org/D1526
Reviewed by:		grehan
MFC after:		2 weeks
2015-01-18 03:08:30 +00:00
np
337910a604 Plug cxgbe(4) back into !powerpc && !arm builds, instead of building it
on amd64 only.
2015-01-16 01:39:24 +00:00
royger
0c5b62d3d2 loader: implement multiboot support for Xen Dom0
Implement a subset of the multiboot specification in order to boot Xen
and a FreeBSD Dom0 from the FreeBSD bootloader. This multiboot
implementation is tailored to boot Xen and FreeBSD Dom0, and it will
most surely fail to boot any other multiboot compilant kernel.

In order to detect and boot the Xen microkernel, two new file formats
are added to the bootloader, multiboot and multiboot_obj. Multiboot
support must be tested before regular ELF support, since Xen is a
multiboot kernel that also uses ELF. After a multiboot kernel is
detected, all the other loaded kernels/modules are parsed by the
multiboot_obj format.

The layout of the loaded objects in memory is the following; first the
Xen kernel is loaded as a 32bit ELF into memory (Xen will switch to
long mode by itself), after that the FreeBSD kernel is loaded as a RAW
file (Xen will parse and load it using it's internal ELF loader), and
finally the metadata and the modules are loaded using the native
FreeBSD way. After everything is loaded we jump into Xen's entry point
using a small trampoline. The order of the multiboot modules passed to
Xen is the following, the first module is the RAW FreeBSD kernel, and
the second module is the metadata and the FreeBSD modules.

Since Xen will relocate the memory position of the second
multiboot module (the one that contains the metadata and native
FreeBSD modules), we need to stash the original modulep address inside
of the metadata itself in order to recalculate its position once
booted. This also means the metadata must come before the loaded
modules, so after loading the FreeBSD kernel a portion of memory is
reserved in order to place the metadata before booting.

In order to tell the loader to boot Xen and then the FreeBSD kernel the
following has to be added to the /boot/loader.conf file:

xen_cmdline="dom0_mem=1024M dom0_max_vcpus=2 dom0pvh=1 console=com1,vga"
xen_kernel="/boot/xen"

The first argument contains the command line that will be passed to the Xen
kernel, while the second argument is the path to the Xen kernel itself. This
can also be done manually from the loader command line, by for example
typing the following set of commands:

OK unload
OK load /boot/xen dom0_mem=1024M dom0_max_vcpus=2 dom0pvh=1 console=com1,vga
OK load kernel
OK load zfs
OK load if_tap
OK load ...
OK boot

Sponsored by: Citrix Systems R&D
Reviewed by: jhb
Differential Revision: https://reviews.freebsd.org/D517

For the Forth bits:
Submitted by: Julien Grall <julien.grall AT citrix.com>
2015-01-15 16:27:20 +00:00
imp
66acb8032e New MINIMAL kernel config. The goal with this configuration is to
only compile in those options in GENERIC that cannot be loaded as
modules. ufs is still included because many of its options aren't
present in the kernel module. There's some other exceptions documented
in the file. This is part of some work to get more things
automatically loading in the hopes of obsoleting GENERIC one day.
2015-01-15 00:42:06 +00:00
neel
4091be74c6 Fix typo (missing comma).
MFC after:	3 days
2015-01-14 07:18:51 +00:00
neel
5c965bc583 'struct vm_exception' was intended to be used only as the collateral for the
VM_INJECT_EXCEPTION ioctl. However it morphed into other uses like keeping
track pending exceptions for a vcpu. This in turn causes confusion because
some fields in 'struct vm_exception' like 'vcpuid' make sense only in the
ioctl context. It also makes it harder to add or remove structure fields.

Fix this by using 'struct vm_exception' only to communicate information
from userspace to vmm.ko when injecting an exception.

Also, add a field 'restart_instruction' to 'struct vm_exception'. This
field is set to '1' for exceptions where the faulting instruction is
restarted after the exception is handled.

MFC after:      1 week
2015-01-13 22:00:47 +00:00
kib
79db3369f9 Revert r263475: TDP_DEVMEMIO no longer needed, since amd64 /dev/kmem
does not access kernel mappings directly.

Reviewed by:	alc
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
2015-01-12 08:58:07 +00:00
kib
5e39e2c47c Revert r276600: PHYS_TO_DMAP_RAW() and DMAP_TO_PHYS_RAW() are no
longer used, remove them.

Reviewed by:	alc
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
2015-01-12 07:50:55 +00:00
kib
6db369589b Fix several issues with /dev/mem and /dev/kmem devices on amd64.
For /dev/mem, when requested physical address is not accessible by the
direct map, do temporal remaping with the caching attribute
'uncached'.  Limit the accessible addresses by MAXPHYADDR, since the
architecture disallowes writing non-zero into reserved bits of ptes
(or setting garbage into NX).

For /dev/kmem, only access existing kernel mappings for direct map
region.  For all other addresses, obtain a physical address of the
mapping and fall back to the /dev/mem mechanism.  This ensures that
/dev/kmem i/o does not fault even if the accessed region is changed in
parallel, by using either direct map or temporal mapping.

For both devices, operate on one page by iteration.  Do not return
error if any bytes were moved around, return the (partial) bytes count
to userspace.

Reviewed by:	alc
Tested by:	pho
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
2015-01-12 07:48:22 +00:00
kib
11969484c8 For x86, read MAXPHYADDR, defined in SDM vol 3 4.1.4 Enumeration of Paging
Features by CPUID as CPUID.80000008H:EAX[7:0], into variable cpu_maxphyaddr.

Reviewed by:	alc
Tested by:	pho
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
2015-01-12 07:36:25 +00:00
markj
7e7e145818 Factor out duplicated code from dumpsys() on each architecture into generic
code in sys/kern/kern_dump.c. Most dumpsys() implementations are nearly
identical and simply redefine a number of constants and helper subroutines;
a generic implementation will make it easier to implement features around
kernel core dumps. This change does not alter any minidump code and should
have no functional impact.

PR:		193873
Differential Revision:	https://reviews.freebsd.org/D904
Submitted by:	Conrad Meyer <conrad.meyer@isilon.com>
Reviewed by:	jhibbits (earlier version)
Sponsored by:	EMC / Isilon Storage Division
2015-01-07 01:01:39 +00:00
neel
e72e75f02d Clear blocking due to STI or MOV SS in the hypervisor when an instruction is
emulated or when the vcpu incurs an exception. This matches the CPU behavior.

Remove special case code in HLT processing that was clearing the interrupt
shadow. This is now redundant because the interrupt shadow is always cleared
when the vcpu is resumed after an instruction is emulated.

Reported by:	David Reed (david.reed@tidalscale.com)
MFC after:	2 weeks
2015-01-06 19:04:02 +00:00
jhb
c3d1954342 Remove "New" label from NFSCL/NFSD now that they are the only NFS
client/server.  While here, remove duplicate NFSCL from sys/conf/NOTES.

Approved by:	rmacklem
2015-01-06 16:15:57 +00:00
jhb
55d0376a65 On some Intel CPUs with a P-state but not C-state invariant TSC the TSC
may also halt in C2 and not just C3 (it seems that in some cases the BIOS
advertises its C3 state as a C2 state in _CST).  Just play it safe and
disable both C2 and C3 states if a user forces the use of the TSC as the
timecounter on such CPUs.

PR:		192316
Differential Revision:	https://reviews.freebsd.org/D1441
No objection from:	jkim
MFC after:	1 week
2015-01-05 20:44:44 +00:00
kib
771035aa6c For /dev/mem and /dev/kmem accesses, avoid asserting that addresses
are within direct map.  We want to return error instead of panicing.

PR:	194995
Sponsored by:	The FreeBSD Foundation
2015-01-03 01:28:58 +00:00
scottl
9f03d3d21b Fix a missed comment from r276526. 2015-01-02 15:46:54 +00:00
kib
35b8c82991 Callers of pmap_kextract() cannot distinguish between failure and
physical address zero.  Assume that the lowest page is always mapped
by direct map.

This restores access to the page at zero through /dev/mem after
r263475.

Reported and tested by:	neel
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
2015-01-02 01:05:08 +00:00
kib
a9168eed4e Actually remove GIANT_REQUIRED, declared but not done in r263475.
Style.

Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
2015-01-02 01:00:38 +00:00
dchagin
4026f410fd Regen after r276508, r276509. 2015-01-01 18:43:31 +00:00
dchagin
4f605a7bfc Correct an argument status of wait4 syscall for Linuxulator.
MFC after:	1 week
2015-01-01 18:37:03 +00:00
np
6c2366689f Temporarily unplug cxgbe(4) from !amd64 builds. 2014-12-31 20:34:12 +00:00
alc
369e66acd7 The physical memory allocator supports the use of distinct free lists for
managing pages from different address ranges.  Generally speaking, this
feature is used to increase the likelihood that physical pages are
available that can meet special DMA requirements or can be accessed through
a limited-coverage direct mapping (e.g., MIPS).  However, prior to this
change, the configuration of the free lists was static, i.e., it was
determined at compile time.  Consequentally, free lists could be created
for address ranges that held no actual pages, for example, on 32-bit MIPS-
based systems with 512 MB or less of physical memory.  This change makes
the creation of the free lists dynamic, i.e., it is based on the available
physical memory at boot time.

On 64-bit x86-based systems with 64 GB or more of physical memory, create
free lists for managing pages with physical addresses below 4 GB.  This
change is to address reported problems with initializing devices that
require the allocation of physical pages below 4 GB on some systems with
128 GB or more of physical memory.

PR:		185727
Differential Revision:	https://reviews.freebsd.org/D1274
Reviewed by:	jhb, kib
MFC after:	3 weeks
Sponsored by:	EMC / Isilon Storage Division
2014-12-31 00:54:38 +00:00
neel
b9de0c114c Initialize all fields of 'struct vm_exception exception' before passing it to
vm_inject_exception(). This fixes the issue that 'exception.cpuid' is
uninitialized when calling 'vm_inject_exception()'.

However, in practice this change is a no-op because vm_inject_exception()
does not use 'exception.cpuid' for anything.

Reported by:    Coverity Scan
CID:            1261297
MFC after:      3 days
2014-12-30 23:38:31 +00:00
neel
7aa6460c48 Replace bhyve's minimal RTC emulation with a fully featured one in vmm.ko.
The new RTC emulation supports all interrupt modes: periodic, update ended
and alarm. It is also capable of maintaining the date/time and NVRAM contents
across virtual machine reset. Also, the date/time fields can now be modified
by the guest.

Since bhyve now emulates both the PIT and the RTC there is no need for
"Legacy Replacement Routing" in the HPET so get rid of it.

The RTC device state can be inspected via bhyvectl as follows:
bhyvectl --vm=vm --get-rtc-time
bhyvectl --vm=vm --set-rtc-time=<unix_time_secs>
bhyvectl --vm=vm --rtc-nvram-offset=<offset> --get-rtc-nvram
bhyvectl --vm=vm --rtc-nvram-offset=<offset> --set-rtc-nvram=<value>

Reviewed by:	tychon
Discussed with:	grehan
Differential Revision:	https://reviews.freebsd.org/D1385
MFC after:	2 weeks
2014-12-30 22:19:34 +00:00
neel
2908c65b8a Inject #UD into the guest when it executes either 'MONITOR' or 'MWAIT' on
an AMD/SVM host.

MFC after:	1 week
2014-12-30 02:44:33 +00:00
neel
baa3b938e5 Implement "special mask mode" in vatpic.
OpenBSD guests always enable "special mask mode" during boot. As a result of
r275952 this is flagged as an error and the guest cannot boot.

Reviewed by:	grehan
Differential Revision:	https://reviews.freebsd.org/D1384
MFC after:	1 week
2014-12-28 00:53:52 +00:00
kib
d0d6e7c817 Change the way the lcall $7,$0 is reflected to usermode. Instead of
setting call gate, which must be 64 bit, put a code segment descriptor
into ldt slot 0.

This way, syscall shim does not switch temporary to 64bit trampoline,
and does not create a window where signal delivery interrupts 64 bit
mode (signal handler cannot return).  The cost is shim running with
non-zero based segment in %cs, which requires vfork() handling make
more assumptions.

Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
2014-12-27 23:19:08 +00:00
phk
17ba4d442c Use compiled in default keymaps which are available both in syscons and vt. 2014-12-25 17:50:04 +00:00
markj
7ea63e4fb4 Restore the trap type argument to the DTrace trap hook, removed in r268600.
It's redundant at the moment since it can be obtained from the trapframe
on the architectures where DTrace is supported, but this won't be the case
with ARM.
2014-12-23 15:38:19 +00:00
neel
4d52b44708 Allow ktr(4) tracing of all guest exceptions via the tunable
"hw.vmm.trace_guest_exceptions".  To enable this feature set the tunable
to "1" before loading vmm.ko.

Tracing the guest exceptions can be useful when debugging guest triple faults.

Note that there is a performance impact when exception tracing is enabled
since every exception will now trigger a VM-exit.

Also, handle machine check exceptions that happen during guest execution
by vectoring to the host's machine check handler via "int $18".

Discussed with:	grehan
MFC after:	2 weeks
2014-12-23 02:14:49 +00:00
neel
9abef2383d Emulate writes to the IA32_MISC_ENABLE MSR.
PR:		196093
Reported by:	db
Tested by:	db
Discussed with:	grehan
MFC after:	1 week
2014-12-20 19:47:51 +00:00
neel
e4f07b01b4 Various 8259 device model improvements:
- implement 8259 "polled" mode.
- set 'atpic->sfn' if bit 4 in ICW4 is set during master initialization.
- report error if guest tries to enable the "special mask" mode.

Differential Revision:	https://reviews.freebsd.org/D1328
Reviewed by:		tychon
Reported by:		grehan
Tested by:		grehan
MFC after:		1 week
2014-12-20 04:57:45 +00:00
neel
36aadd747e Fix 8259 IRQ priority resolver.
Initialize the 8259 such that IRQ7 is the lowest priority.

Reviewed by:		tychon
Differential Revision:	https://reviews.freebsd.org/D1322
MFC after:		1 week
2014-12-17 03:04:43 +00:00
kib
42d5fa98d3 The iret instruction may generate #np and #ss fault, besides #gp.
When returning to usermode, the handler for that exceptions is also
executed with wrong gs base.  Handle all three possible faults in the
same way, checking for iret fault, and performing full iret.

Sponsored by:	The FreeBSD Foundation
MFC after:	3 days
2014-12-16 18:28:33 +00:00
neel
bdae65463e For level triggered interrupts clear the PIC IRR bit when the interrupt pin
is deasserted. Prior to this change each assertion on a level triggered irq
pin resulted in two interrupts being delivered to the CPU.

Differential Revision:	https://reviews.freebsd.org/D1310
Reviewed by:	tychon
MFC after:	1 week
2014-12-16 06:33:57 +00:00
gnn
d5b2a401e7 This configuration file removes several debugging options, including
WITNESS and INVARIANTS checking, which are known to have significant
performance impact on running systems.  When benchmarking new features
this kernel should be used instead of the standard GENERIC.
This kernel configuration should never appear outside of the HEAD
of the FreeBSD tree.
2014-12-02 19:55:43 +00:00
emaste
fda27c9937 Revert r274772: it is not valid on MIPS
Reported by:	sbruno
2014-11-25 03:50:31 +00:00
grehan
205a8351f4 Change the lower bound for guest vmspace allocation to 0 instead of
using the VM_MIN_ADDRESS constant.

HardenedBSD redefines VM_MIN_ADDRESS to be 64K, which results in
bhyve VM startup failing. Guest memory is always assumed to start
at 0 so use the absolute value instead.

Reported by:	Shawn Webb, lattera at gmail com
Reviewed by:	neel, grehan
Obtained from:	Oliver Pinter via HardenedBSD
23bd719ce1
MFC after:	1 week
2014-11-23 23:07:21 +00:00
jhb
1671ac9155 Improve support for XSAVE with debuggers.
- Dump an NT_X86_XSTATE note if XSAVE is in use. This note is designed
  to match what Linux does in that 1) it dumps the entire XSAVE area
  including the fxsave state, and 2) it stashes a copy of the current
  xsave mask in the unused padding between the fxsave state and the
  xstate header at the same location used by Linux.
- Teach readelf() to recognize NT_X86_XSTATE notes.
- Change PT_GET/SETXSTATE to take the entire XSAVE state instead of
  only the extra portion. This avoids having to always make two
  ptrace() calls to get or set the full XSAVE state.
- Add a PT_GET_XSTATE_INFO which returns the length of the current
  XSTATE save area (so the size of the buffer needed for PT_GETXSTATE)
  and the current XSAVE mask (%xcr0).

Differential Revision:	https://reviews.freebsd.org/D1193
Reviewed by:	kib
MFC after:	2 weeks
2014-11-21 20:53:17 +00:00
emaste
c7e313326d Use canonical __PIC__ flag
It is automatically set when -fPIC is passed to the compiler.

Reviewed by:	dim, kib
Sponsored by:	The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D1179
2014-11-21 02:05:48 +00:00
alc
aeebd38e4b Enable the use of VM_PHYSSEG_SPARSE on amd64 and i386, making it the default
on i386 PAE.  Previously, VM_PHYSSEG_SPARSE could not be used on amd64 and
i386 because vm_page_startup() would not create vm_page structures for the
kernel page table pages allocated during pmap_bootstrap() but those vm_page
structures are needed when the kernel attempts to promote the corresponding
kernel virtual addresses to superpage mappings.  To address this problem, a
new public function, vm_phys_add_seg(), is introduced and vm_phys_init() is
updated to reflect the creation of vm_phys_seg structures by calls to
vm_phys_add_seg().

Discussed with:	Svatopluk Kraus
MFC after:	3 weeks
Sponsored by:	EMC / Isilon Storage Division
2014-11-15 23:40:44 +00:00
kib
07e17a18ac Fix END()s for fueword and fueword64, match the name in END() with
entry.

Submitted by:	Jeroen Hofstee <jeroen@myspectrum.nl>
MFC after:	1 week
2014-11-15 21:25:17 +00:00
scottl
1b5c52bedf Extend earlier addition of stack frames to most of support.S. This makes
stack traces in KDB, HWPMC, and DTrace much more reliable and useful.

Reviewed by:	kan, kib
Obtained from:	Netflix, Inc.
MFC after:	5 days
2014-11-13 22:11:44 +00:00
emaste
1fcabed49c Add workaround for vt efifb's early use of PHYS_TO_DMAP
In vt_efifb_init the framebuffer's physaddr is passed to PHYS_TO_DMAP
before the DMAP is setup. The result is not actually accessed until
after the mapping is setup, though. Loosen the assertion in PHYS_TO_DMAP
for now, to allow use when dmaplimit == 0.

Reviewed by:	kib
Sponsored by:	The FreeBSD Foundation
Differential Revision:	https://reviews.freebsd.org/D1142
2014-11-11 14:59:46 +00:00
melifaro
b5d711d3a6 Renove faith(4) and faithd(8) from base. It looks like industry
have chosen different (and more traditional) stateless/statuful
NAT64 as translation mechanism. Last non-trivial commits to both
faith(4) and faithd(8) happened more than 12 years ago, so I assume
it is time to drop RFC3142 in FreeBSD.

No objections from:	net@
2014-11-09 21:33:01 +00:00
glebius
013b888a89 Remove unused includes.
Reviewed by:	kib
2014-11-09 19:58:30 +00:00
kib
ac5c592f50 MFi386 r253328:
Create a proper stack frame for amd64 version of bcopy().  Note that
this also makes the stack properly aligned in the function, despite it
is not strictly needed.

Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
2014-11-08 11:56:26 +00:00
gnn
c55c1b4d7b Add support for netmap in GENERIC by default. 2014-11-05 06:22:37 +00:00
bryanv
4b08d4e97f Add VirtIO console to the x86 NOTES and files
Requested by:	jhb
2014-11-03 22:37:10 +00:00
jhb
fdfced8ce8 MFamd64: Add support for extended FPU states on i386. This includes
support for AVX on i386.
- Similar to amd64, move the FPU save area out of the PCB and instead
  store saved FPU state in a variable-sized buffer after the PCB on the
  stack.
- To support the variable PCB location, alter the locore code to only use
  the bottom-most page of proc0stack for init386().  init386() returns
  the correct stack pointer to locore which adjusts the stack for thread0
  before calling mi_startup().
- Don't bother setting cr3 in thread0's pcb in locore before calling
  init386().  It wasn't used (init386() overwrote it at the end) and
  it doesn't work with the variable-sized FPU save area.
- Remove the new-bus attachment from npx.  This was only ever useful for
  external co-processors using IRQ13, but those have not been supported
  for several years.  npxinit() is now called much earlier during boot
  (init386()) similar to amd64.
- Implement PT_{GET,SET}XSTATE and I386_GET_XFPUSTATE.
- npxsave() is now only called from context switch contexts so it can
  use XSAVEOPT.

Differential Revision:	https://reviews.freebsd.org/D1058
Reviewed by:	kib
Tested on:	FreeBSD/i386 VM under bhyve on Intel i5-2520
2014-11-02 22:58:30 +00:00
jhb
d47eb7d2d4 Rework virtual machine hypervisor detection.
- Move the existing code to x86/x86/identcpu.c since it is x86-specific.
- If the CPUID2_HV flag is set, assume a hypervisor is present and query
  the 0x40000000 leaf to determine the hypervisor vendor ID.  Export the
  vendor ID and the highest supported hypervisor CPUID leaf via
  hv_vendor[] and hv_high variables, respectively.  The hv_vendor[]
  array is also exported via the hw.hv_vendor sysctl.
- Merge the VMWare detection code from tsc.c into the new probe in
  identcpu.c.  Add a VM_GUEST_VMWARE to identify vmware and use that in
  the TSC code to identify VMWare.

Differential Revision:	https://reviews.freebsd.org/D1010
Reviewed by:	delphij, jkim, neel
2014-10-28 19:17:44 +00:00
kib
ad7bf17db7 Replace some calls to fuword() by fueword() with proper error checking.
Sponsored by:	The FreeBSD Foundation
Tested by:	pho
MFC after:	3 weeks
2014-10-28 15:28:20 +00:00
kib
29a659ef8e Add fueword(9) and casueword(9) functions. They are like fuword(9)
and casuword(9), but do not mix value read and indication of fault.

I know (or remember) enough assembly to handle x86 and powerpc.  For
arm, mips and sparc64, implement fueword() and casueword() as wrappers
around fuword() and casuword(), which means that the functions cannot
distinguish between -1 and fault.

On architectures where fueword() and casueword() are native, implement
fuword() and casuword() using fueword() and casuword(), to reduce
assembly code duplication.

Sponsored by:	The FreeBSD Foundation
Tested by:	pho
MFC after:	2 weeks (ia64 needs treating)
2014-10-28 15:22:13 +00:00
araujo
8a51fd3c94 Reported by: Coverity
CID:		1249760
Reviewed by:	neel
Approved by:	neel
Sponsored by:	QNAP Systems Inc.
2014-10-28 07:19:02 +00:00
grehan
4789422fd1 Remove bhyve SVM feature printf's now that they are available in the
general CPU feature detection code.

Reviewed by:	neel
2014-10-27 22:20:51 +00:00
neel
4ee18dab17 Change the type of the first argument to the I/O emulation handlers to
'struct vm *'. Previously it used to be a 'void *' but there is no reason
to hide the actual type from the handler.

Discussed with:	tychon
MFC after:	1 week
2014-10-26 19:03:06 +00:00
alc
dc84db712f By the time that pmap_init() runs, vm_phys_segs[] has been initialized. Obtaining
the end of memory address from vm_phys_segs[] is a little easier than obtaining it
from phys_avail[].

Discussed with:	Svatopluk Kraus
2014-10-26 17:56:47 +00:00
neel
eb58530f9b Move the ACPI PM timer emulation into vmm.ko.
This reduces variability during timer calibration by keeping the emulation
"close" to the guest. Additionally having all timer emulations in the kernel
will ease the transition to a per-VM clock source (as opposed to using the
host's uptime keep track of time).

Discussed with:	grehan
2014-10-26 04:44:28 +00:00
neel
a76d8ee4e9 Don't pass the 'error' return from an I/O port handler directly to vm_run().
Most I/O port handlers return -1 to signal an error. If this value is returned
without modification to vm_run() then it leads to incorrect behavior because
'-1' is interpreted as ERESTART at the system call level.

Fix this by always returning EIO to signal an error from an I/O port handler.

MFC after:	1 week
2014-10-26 03:03:41 +00:00
jhb
fcc57fff95 Add COMPAT_FREEBSD9 and COMPAT_FREEBSD10 options to wrap code that
provides compatability for FreeBSD 9.x and 10.x binaries.  Enable
these options in kernel configs that enable other COMPAT_FREEBSD<n>
options.
2014-10-24 19:58:24 +00:00
royger
0e3d9b8126 amd64: make uiomove_fromphys functional for pages not mapped by the DMAP
Place the code introduced in r268660 into a separate function that can be
called from uiomove_fromphys. Instead of pre-allocating two KVA pages use
vmem_alloc to allocate them on demand when needed. This prevents blocking if
a page fault is taken while physical addresses from outside the DMAP are
used, since the lock is now removed.

Also introduce a safety catch in PHYS_TO_DMAP and DMAP_TO_PHYS.

Sponsored by: Citrix Systems R&D
Reviewed by:		kib
Differential Revision:	https://reviews.freebsd.org/D947

amd64/amd64/pmap.c:
 - Factor out the code to deal with non DMAP addresses from pmap_copy_pages
   and place it in pmap_map_io_transient.
 - Change the code to use vmem_alloc instead of a set of pre-allocated
   pages.
 - Use pmap_qenter and don't pin the thread if there can be page faults.

amd64/amd64/uio_machdep.c:
 - Use pmap_map_io_transient in order to correctly deal with physical
   addresses not covered by the DMAP.

amd64/include/pmap.h:
 - Add the prototypes for the new functions.

amd64/include/vmparam.h:
 - Add safety catches to make sure PHYS_TO_DMAP and DMAP_TO_PHYS are only
   used with addresses covered by the DMAP.
2014-10-24 09:48:58 +00:00
royger
919b7d8b7c xen: implement the privcmd user-space device
This device is only attached to priviledged domains, and allows the
toolstack to interact with Xen. The two functions of the privcmd
interface is to allow the execution of hypercalls from user-space, and
the mapping of foreign domain memory.

Sponsored by: Citrix Systems R&D

i386/include/xen/hypercall.h:
amd64/include/xen/hypercall.h:
 - Introduce a function to make generic hypercalls into Xen.

xen/interface/xen.h:
xen/interface/memory.h:
 - Import the new hypercall XENMEM_add_to_physmap_range used by
   auto-translated guests to map memory from foreign domains.

dev/xen/privcmd/privcmd.c:
 - This device has the following functions:
   - Allow user-space applications to make hypercalls into Xen.
   - Allow user-space applications to map memory from foreign domains,
     this is accomplished using the newly introduced hypercall
     (XENMEM_add_to_physmap_range).

xen/privcmd.h:
 - Public ioctl interface for the privcmd device.

x86/xen/hvm.c:
 - Remove declaration of hypercall_page, now it's declared in
   hypercall.h.

conf/files:
 - Add the privcmd device to the build process.
2014-10-22 17:07:20 +00:00
hselasky
49c137f7be Fix multiple incorrect SYSCTL arguments in the kernel:
- Wrong integer type was specified.

- Wrong or missing "access" specifier. The "access" specifier
sometimes included the SYSCTL type, which it should not, except for
procedural SYSCTL nodes.

- Logical OR where binary OR was expected.

- Properly assert the "access" argument passed to all SYSCTL macros,
using the CTASSERT macro. This applies to both static- and dynamically
created SYSCTLs.

- Properly assert the the data type for both static and dynamic
SYSCTLs. In the case of static SYSCTLs we only assert that the data
pointed to by the SYSCTL data pointer has the correct size, hence
there is no easy way to assert types in the C language outside a
C-function.

- Rewrote some code which doesn't pass a constant "access" specifier
when creating dynamic SYSCTL nodes, which is now a requirement.

- Updated "EXAMPLES" section in SYSCTL manual page.

MFC after:	3 days
Sponsored by:	Mellanox Technologies
2014-10-21 07:31:21 +00:00
neel
4797036028 Merge projects/bhyve_svm into HEAD.
After this change bhyve supports AMD processors with the SVM/AMD-V hardware
extensions.

More details available here:
https://lists.freebsd.org/pipermail/freebsd-virtualization/2014-October/002905.html

Submitted by:	Anish Gupta (akgupt3@gmail.com)
Tested by:	Benjamin Perrault (ben.perrault@gmail.com)
Tested by:	Willem Jan Withagen (wjw@digiware.nl)
2014-10-21 07:10:43 +00:00
neel
fb464d4f98 Fix a race in pmap_emulate_accessed_dirty() that could trigger a EPT
misconfiguration VM-exit.

An EPT misconfiguration is triggered when the processor encounters a PTE
that is writable but not readable (WR=10). On processors that require A/D
bit emulation PG_M and PG_A map to EPT_PG_WRITE and EPT_PG_READ respectively.

If the PTE is updated as in the following code snippet:
	*pte |= PG_M;
	*pte |= PG_A;
then it is possible for another processor to observe the PTE after the PG_M
(aka EPT_PG_WRITE) bit is set but before PG_A (aka EPT_PG_READ) bit is set.

This will trigger an EPT misconfiguration VM-exit on the other processor.

Reported by:	rodrigc
Reviewed by:	grehan
MFC after:	3 days
2014-10-21 01:06:58 +00:00
neel
be8b3ca439 Merge from projects/bhyve_svm all the changes outside vmm.ko or bhyve utilities:
Add support for AMD's nested page tables in pmap.c:
- Provide the correct bit mask for various bit fields in a PTE (e.g. valid bit)
  for a pmap of type PT_RVI.
- Add a function 'pmap_type_guest(pmap)' that returns TRUE if the pmap is of
  type PT_EPT or PT_RVI.

Add CPU_SET_ATOMIC_ACQ(num, cpuset):
This is used when activating a vcpu in the nested pmap. Using the 'acquire'
variant guarantees that the load of the 'pm_eptgen' will happen only after
the vcpu is activated in 'pm_active'.

Add defines for various AMD-specific MSRs.

Submitted by:	Anish Gupta (akgupt3@gmail.com)
2014-10-20 18:09:33 +00:00
neel
dd2febd6f2 IFC @r273214 2014-10-20 02:57:30 +00:00
neel
53c23ba9a2 IFC @r273206 2014-10-19 23:05:18 +00:00
neel
13e9198693 Don't advertise the "OS visible workarounds" feature in cpuid.80000001H:ECX.
bhyve doesn't emulate the MSRs needed to support this feature at this time.

Don't expose any model-specific RAS and performance monitoring features in
cpuid leaf 80000007H.

Emulate a few more MSRs for AMD: TSEG base address, TSEG address mask and
BIOS signature and P-state related MSRs.

This eliminates all the unimplemented MSRs accessed by Linux/x86_64 kernels
2.6.32, 3.10.0 and 3.17.0.
2014-10-19 21:38:58 +00:00
neel
d5ca76422a Don't advertise support for the NodeID MSR since bhyve doesn't emulate it. 2014-10-18 05:39:32 +00:00
imp
ad9f7e3038 Fix build to not bogusly always rebuild vmm.ko.
Rename vmx_assym.s to vmx_assym.h to reflect that file's actual use
and update vmx_support.S's include to match. Add vmx_assym.h to the
SRCS to that it gets properly added to the dependency list. Add
vmx_support.S to SRCS as well, so it gets built and needs fewer
special-case goo. Remove now-redundant special-case goo. Finally,
vmx_genassym.o doesn't need to depend on a hand expanded ${_ILINKS}
explicitly, that's all taken care of by beforedepend.

With these items fixed, we no longer build vmm.ko every single time
through the modules on a KERNFAST build.

Sponsored by: Netflix
2014-10-17 13:20:49 +00:00
neel
170f49cd34 Don't advertise the Instruction Based Sampling feature because it requires
emulating a large number of MSRs.

Ignore writes to a couple more AMD-specific MSRs and return 0 on read.

This further reduces the unimplemented MSRs accessed by a Linux guest on boot.
2014-10-17 06:23:04 +00:00
neel
b194fc5a2b Hide extended PerfCtr MSRs on AMD processors by clearing bits 23, 24 and 28 in
CPUID.80000001H:ECX.

Handle accesses to PerfCtrX and PerfEvtSelX MSRs by ignoring writes and
returning 0 on reads.

This further reduces the number of unimplemented MSRs hit by a Linux guest
during boot.
2014-10-17 03:04:38 +00:00
neel
8bc5fc92e5 Use the correct fault type (VM_PROT_EXECUTE) for an instruction fetch. 2014-10-16 18:16:31 +00:00
neel
09fbd3f7d9 Fix topology enumeration issues exposed by AMD Bulldozer Family 15h processor.
Initialize CPUID.80000008H:ECX[7:0] with the number of logical processors in
the package. This fixes a panic during early boot in NetBSD 7.0 BETA.

Clear the Topology Extension feature bit from CPUID.80000001H:ECX since we
don't emulate leaves 0x8000001D and 0x8000001E. This fixes a divide by zero
panic in early boot in Centos 6.4.

Tested on an "AMD Opteron 6320" courtesy of Ben Perrault.

Reviewed by:	grehan
2014-10-16 18:13:10 +00:00
davide
e88bd26b3f Follow up to r225617. In order to maximize the re-usability of kernel code
in userland rename in-kernel getenv()/setenv() to kern_setenv()/kern_getenv().
This fixes a namespace collision with libc symbols.

Submitted by:   kmacy
Tested by:      make universe
2014-10-16 18:04:43 +00:00
neel
c4ba9b78fd Actually hide the SVM capability by clearing CPUID.80000001H:ECX[bit 3]
after it has been initialized by cpuid_count().

Submitted by:	Anish Gupta (akgupt3@gmail.com)
2014-10-15 04:29:03 +00:00
neel
268d74c73a Emulate "POP r/m".
This is needed to boot OpenBSD/i386 MP kernel in bhyve.

Reported by:	grehan
MFC after:	1 week
2014-10-14 21:02:33 +00:00
neel
651ecc1e21 Remove extraneous comments. 2014-10-11 04:57:17 +00:00
neel
0d5ab4a163 Get rid of unused headers.
Restrict scope of malloc types M_SVM and M_SVM_VLAPIC by making them static.
Replace ERR() with KASSERT().
style(9) cleanup.
2014-10-11 04:41:21 +00:00
neel
bbecfc3c42 Get rid of unused forward declaration of 'struct svm_softc'. 2014-10-11 03:21:33 +00:00
neel
3e5970c8ec style(9) fixes.
Get rid of unused headers.
2014-10-11 03:19:26 +00:00
neel
c09e1bd7e2 Use a consistent style for messages emitted when the module is loaded. 2014-10-11 03:09:34 +00:00
neel
bfa9f55e70 IFC @r272887 2014-10-10 23:52:56 +00:00
neel
ae1ff84ece Fix bhyvectl so it works correctly on AMD/SVM hosts. Also, add command line
options to display some key VMCB fields.

The set of valid options that can be passed to bhyvectl now depends on the
processor type. AMD-specific options are identified by a "--vmcb" or "--avic"
in the option name. Intel-specific options are identified by a "--vmcs" in
the option name.

Submitted by:	Anish Gupta (akgupt3@gmail.com)
2014-10-10 21:48:59 +00:00
neel
f443215307 Support Intel-specific MSRs that are accessed when booting up a linux in bhyve:
- MSR_PLATFORM_INFO
- MSR_TURBO_RATIO_LIMITx
- MSR_RAPL_POWER_UNIT

Reviewed by:	grehan
MFC after:	1 week
2014-10-09 19:13:33 +00:00
markj
0ebf86e1b1 Pass up the error status of minidumpsys() to its callers.
PR:		193761
Submitted by:	Conrad Meyer <conrad.meyer@isilon.com>
Sponsored by:	EMC / Isilon Storage Division
2014-10-08 20:25:21 +00:00
kib
30a51a18f4 Add an argument to the x86 pmap_invalidate_cache_range() to request
forced invalidation of the cache range regardless of the presence of
self-snoop feature.  Some recent Intel GPUs in some modes are not
coherent, and dirty lines in CPU cache must be flushed before the
pages are transferred to GPU domain.

Reviewed by:	alc (previous version)
Tested by:	pho (amd64)
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
2014-10-08 16:48:03 +00:00
neel
ce319c48f4 Inject #UD into the guest when it executes either 'MONITOR' or 'MWAIT'.
The hypervisor hides the MONITOR/MWAIT capability by unconditionally setting
CPUID.01H:ECX[3] to 0 so the guest should not expect these instructions to
be present anyways.

Discussed with:	grehan
2014-10-06 20:48:01 +00:00
neel
16cbb8793a IFC @r272481 2014-10-05 01:28:21 +00:00
neel
f6b1385c0e Get rid of code that dealt with the hardware not being able to save/restore
the PAT MSR on guest exit/entry. This workaround was done for a beta release
of VMware Fusion 5 but is no longer needed in later versions.

All Intel CPUs since Nehalem have supported saving and restoring MSR_PAT
in the VM exit and entry controls.

Discussed with:	grehan
2014-10-02 05:32:29 +00:00
royger
c5a5f5947f msi: add Xen MSI implementation
This patch adds support for MSI interrupts when running on Xen. Apart
from adding the Xen related code needed in order to register MSI
interrupts this patch also makes the msi_init function a hook in
init_ops, so different MSI implementations can have different
initialization functions.

Sponsored by: Citrix Systems R&D

xen/interface/physdev.h:
 - Add the MAP_PIRQ_TYPE_MULTI_MSI to map multi-vector MSI to the Xen
   public interface.

x86/include/init.h:
 - Add a hook for setting custom msi_init methods.

amd64/amd64/machdep.c:
i386/i386/machdep.c:
 - Set the default msi_init hook to point to the native MSI
   initialization method.

x86/xen/pv.c:
 - Set the Xen MSI init hook when running as a Xen guest.

x86/x86/local_apic.c:
 - Call the msi_init hook instead of directly calling msi_init.

xen/xen_intr.h:
x86/xen/xen_intr.c:
 - Introduce support for registering/releasing MSI interrupts with
   Xen.
 - The MSI interrupts will use the same PIC as the IO APIC interrupts.

xen/xen_msi.h:
x86/xen/xen_msi.c:
 - Introduce a Xen MSI implementation.

x86/xen/xen_nexus.c:
 - Overwrite the default MSI hooks in the Xen Nexus to use the Xen MSI
   implementation.

x86/xen/xen_pci.c:
 - Introduce a Xen specific PCI bus that inherits from the ACPI PCI
   bus and overwrites the native MSI methods.
 - This is needed because when running under Xen the MSI messages used
   to configure MSI interrupts on PCI devices are written by Xen
   itself.

dev/acpica/acpi_pci.c:
 - Lower the quality of the ACPI PCI bus so the newly introduced Xen
   PCI bus can take over when needed.

conf/files.i386:
conf/files.amd64:
 - Add the newly created files to the build process.
2014-09-30 16:46:45 +00:00
neel
09ab9e2937 IFC @r272185 2014-09-27 22:15:50 +00:00
neel
11f176f814 Simplify register state save and restore across a VMRUN:
- Host registers are now stored on the stack instead of a per-cpu host context.

- Host %FS and %GS selectors are not saved and restored across VMRUN.
  - Restoring the %FS/%GS selectors was futile anyways since that only updates
    the low 32 bits of base address in the hidden descriptor state.
  - GS.base is properly updated via the MSR_GSBASE on return from svm_launch().
  - FS.base is not used while inside the kernel so it can be safely ignored.

- Add function prologue/epilogue so svm_launch() can be traced with Dtrace's
  FBT entry/exit probes. They also serve to save/restore the host %rbp across
  VMRUN.

Reviewed by:	grehan
Discussed with:	Anish Gupta (akgupt3@gmail.com)
2014-09-27 02:04:58 +00:00
grehan
8be950fc2c Allow the PIC's IMR register to be read before ICW initialisation.
As of git submit e179f6914152eca9, the Linux kernel does a simple
probe of the PIC by writing a pattern to the IMR and then reading it
back, prior to the init sequence of ICW words.

The bhyve PIC emulation wasn't allowing the IMR to be read until
the ICW sequence was complete. This limitation isn't required so
relax the test.

With this change, Linux kernels 3.15-rc2 and later won't hang
on boot when calibrating the local APIC.

Reviewed by:	tychon
MFC after:	3 days
2014-09-27 01:15:24 +00:00
royger
494dc32ba6 ddb: allow specifying the exact address of the symtab and strtab
When the FreeBSD kernel is loaded from Xen the symtab and strtab are
not loaded the same way as the native boot loader. This patch adds
three new global variables to ddb that can be used to specify the
exact position and size of those tables, so they can be directly used
as parameters to db_add_symbol_table. A new helper is introduced, so callers
that used to set ksym_start and ksym_end can use this helper to set the new
variables.

It also adds support for loading them from the Xen PVH port, that was
previously missing those tables.

Sponsored by: Citrix Systems R&D
Reviewed by:	kib

ddb/db_main.c:
 - Add three new global variables: ksymtab, kstrtab, ksymtab_size that
   can be used to specify the position and size of the symtab and
   strtab.
 - Use those new variables in db_init in order to call db_add_symbol_table.
 - Move the logic in db_init to db_fetch_symtab in order to set ksymtab,
   kstrtab, ksymtab_size from ksym_start and ksym_end.

ddb/ddb.h:
 - Add prototype for db_fetch_ksymtab.
 - Declate the extern variables ksymtab, kstrtab and ksymtab_size.

x86/xen/pv.c:
 - Add support for finding the symtab and strtab when booted as a Xen
   PVH guest. Since Xen loads the symtab and strtab as NetBSD expects
   to find them we have to adapt and use the same method.

amd64/amd64/machdep.c:
arm/arm/machdep.c:
i386/i386/machdep.c:
mips/mips/machdep.c:
pc98/pc98/machdep.c:
powerpc/aim/machdep.c:
powerpc/booke/machdep.c:
sparc64/sparc64/machdep.c:
 - Use the newly introduced db_fetch_ksymtab in order to set ksymtab,
   kstrtab and ksymtab_size.
2014-09-25 08:28:10 +00:00
bz
7df5be5e6e As per [1] Intel only supports this driver on 64bit platforms.
For now restrict it to amd64.  Other architectures might be
re-added later once tested.

Remove the drivers from the global NOTES and files files and move
them to the amd64 specifics.
Remove the drivers from the i386 modules build and only leave the
amd64 version.

Rather than depending on "inet" depend on "pci" and make sure that
ixl(4) and ixlv(4) can be compiled independently [2].  This also
allows the drivers to build properly on IPv4-only or IPv6-only
kernels.

PR:		193824 [2]
Reviewed by:	eric.joyner intel.com
MFC after:	3 days

References:
[1] http://lists.freebsd.org/pipermail/svn-src-all/2014-August/090470.html
2014-09-23 08:33:03 +00:00
neel
35aaa3ac11 Allow more VMCB fields to be cached:
- CR2
- CR0, CR3, CR4 and EFER
- GDT/IDT base/limit fields
- CS/DS/ES/SS selector/base/limit/attrib fields

The caching can be further restricted via the tunable 'hw.vmm.svm.vmcb_clean'.

Restructure the code such that the fields above are only modified in a single
place. This makes it easy to invalidate the VMCB cache when any of these fields
is modified.
2014-09-21 23:42:54 +00:00