Commit Graph

466 Commits

Author SHA1 Message Date
ian
ae3d035140 Improve the performance of the hpet timer in bhyve guests by making the
timer frequency a power of two.  This changes the frequency from 10 to
16.7 MHz (2 ^ 24 HZ).  Using a power of two avoids roundoff errors when
doing arithmetic in sbintime_t units.

Testing shows this can fix erratic ntpd behavior in guests using the
hpet timer (which is the default for multicore guests).

Reported by:	bsam@
2017-10-29 20:50:03 +00:00
jhb
7db3e0e96c Rework pass through changes in r305485 to be safer.
Specifically, devices that do not support PCI-e FLR and were not
gracefully shutdown by the guest OS could continue to issue DMA
requests after the VM was terminated.  The changes in r305485 meant
that those DMA requests were completed against the host's memory which
could result in random memory corruption.  Instead, leave ppt devices
that are not attached to a VM disabled in the IOMMU and only restore
the devices to the host domain if the ppt(4) driver is detached from a
device.

As an added safety belt, disable busmastering for a pass-through device
when before adding it to the host domain during ppt(4) detach.

PR:		222937
Tested by:	Harry Schmalzbauer <freebsd@omnilan.de>
Reviewed by:	grehan
MFC after:	1 week
Differential Revision:	https://reviews.freebsd.org/D12661
2017-10-27 14:57:14 +00:00
kib
dd1b856d37 Save KGSBASE in pcb before overriding it with the guest value.
Reported by:	lwhsu, mjoras
Discussed with:	jhb
Sponsored by:	The FreeBSD Foundation
MFC after:	18 days
2017-08-24 10:49:53 +00:00
rlibby
2160513c84 amd-vi: gcc build errors
amdvi_cmp_wait: gcc complained about a malformed string behind an ifdef.

struct amdvi_dte: widen the type of the first reserved bitfield so that
the packed representation would not cross an alignment boundary for that
type. Apparently that causes in-tree gcc (4.2) to insert padding
(despite packed, resulting in a wrong structure definition), and causes
more modern gcc to emit a warning.

ivrs_hdr_iterate_tbl: delete a misleading check about header length
being less than 0 (the type is unsigned) and replace it with a check
that the length doesn't exceed the table size.

Reviewed by:	anish, grehan
Approved by:	markj (mentor)
Sponsored by:	Dell EMC Isilon
Differential Revision:	https://reviews.freebsd.org/D11485
2017-07-07 06:37:19 +00:00
anish
6bb9bc6652 Add AMD IOMMU/AMD-Vi support in bhyve for passthrough/direct assignment to VMs. To enable AMD-Vi, set hw.vmm.amdvi.enable=1.
Reviewed by:bcr
Approved by:grehan
Tested by:rgrimes
Differential Revision:https://reviews.freebsd.org/D10049
2017-04-30 02:08:46 +00:00
avg
7a52acd8b3 revert r315959 because it causes build problems
The change introduced a dependency between genassym.c and header files
generated from .m files, but that dependency is not specified in the
make files.

Also, the change could be not as useful as I thought it was.

Reported by:	dchagin, Manfred Antar <null@pozo.com>, and many others
2017-03-27 12:34:29 +00:00
avg
04ec8ce247 specific end of interrupt implementation for AMD Local APIC
The change is more intrusive than I would like because the feature
requires that a vector number is written to a special register.
Thus, now the vector number has to be provided to lapic_eoi().
It was readily available in the IO-APIC and MSI cases, but the IPI
handlers required more work.
Also, we now store the VMM IPI number in a global variable, so that it
is available to the justreturn handler for the same reason.

Reviewed by:	kib
MFC after:	6 weeks
Differential Revision: https://reviews.freebsd.org/D9880
2017-03-25 18:45:09 +00:00
grehan
071fd9390c Hide the AMD MONITORX/MWAITX capability.
Otherwise, recent Linux guests will use these instructions, resulting
in #UD exceptions since bhyve doesn't implement MONITOR/MWAIT exits.

This fixes boot-time hangs in recent Linux guests on Ryzen CPUs
(and probably Bulldozer aka AMD FX as well).

Reviewed by:	kib
MFC after:	1 week
2017-03-16 03:21:42 +00:00
imp
7e6cabd06e Renumber copyright clause 4
Renumber cluase 4 to 3, per what everybody else did when BSD granted
them permission to remove clause 3. My insistance on keeping the same
numbering for legal reasons is too pedantic, so give up on that point.

Submitted by:	Jan Schaumann <jschauma@stevens.edu>
Pull Request:	https://github.com/freebsd/freebsd/pull/96
2017-02-28 23:42:47 +00:00
avg
fa73e5b5c7 vmm_dev: work around a bogus error with gcc 6.3.0
The error is:
vmm_dev.c: In function 'alloc_memseg':
vmm_dev.c:261:11: error: null argument where non-null required (argument 1) [-Werror=nonnull]

Apparently, the gcc is unable to figure out that if a ternary operator
produced a non-NULL value once, then the operator with exactly the same
operands would produce the same value again.

MFC after:	1 week
2017-01-20 13:21:27 +00:00
cem
6d240aee91 Fix a variety of cosmetic typos and misspellings
No functional change.

PR:		216096, 216097, 216098, 216101, 216102, 216106, 216109, 216110
Reported by:	Bulat <bltsrc at mail.ru>
Sponsored by:	Dell EMC Isilon
2017-01-15 18:00:45 +00:00
bdrewery
30f99dbeef Fix improper use of "its".
Sponsored by:	Dell EMC Isilon
2016-11-08 23:59:41 +00:00
avg
9127df4438 fix a syntax error in r308039 ...
that I somehow introduced between testing the change
iand committing it.

MFC after:	1 week
X-MFC with:	r307903
2016-10-28 15:57:55 +00:00
avg
6310254212 vmm: another take at maximmum address passed to contigmalloc
Just using vm_paddr_t value with all bits set.
That should work as long as the type is unsigned.

While there, fix a couple of whitespace issues nearby.

MFC after:	1 week
X-MFC with:	r307903
2016-10-28 14:38:01 +00:00
avg
a2af253a41 fix up r307903, use correct max address definition
MFC after:	1 week
X-MFC with:	r307903
2016-10-25 10:59:21 +00:00
avg
7d3b940604 vmm/svm: iopm_bitmap and msr_bitmap must be contiguous in physical memory
To achieve that the whole svm_softc is allocated with contigmalloc now.
It would be more effient to de-embed those arrays and allocate only them
with contigmalloc.

Previously, if malloc(9) used non-contiguous pages for the arrays, then
random bits in physical pages next to the first page would be used to
determine permissions for I/O port and MSR accesses.  That could result
in a guest dangerously modifying the host hardware configuration.

One example is that sometimes NMI watchdog driver in a Linux guest
would be able to configure a performance counter on a host system.
The counter would generate an interrupt and if hwpmc(4) driver is loaded
on the host, then the interrupt would be delivered as an NMI.

Discussed with:	jhb
Reviewed by:	grehan
MFC after:	2 weeks
Differential Revision: https://reviews.freebsd.org/D8321
2016-10-25 10:34:14 +00:00
jhb
b83d0562bd Reset PCI pass through devices via PCI-e FLR during VM start and end.
Add routines to trigger a function level reset (FLR) of a PCI-express
device via the PCI-express device control register.  This also includes
support routines to wait for pending transactions to complete as well
as calculating the maximum completion timeout permitted by a device.

Change the ppt(4) driver to reset pass through devices before attaching
to a VM during startup and before detaching from a VM during shutdown.

Reviewed by:	imp, wblock (earlier version)
MFC after:	1 month
Sponsored by:	Chelsio Communications
Differential Revision:	https://reviews.freebsd.org/D7751
2016-09-06 21:15:35 +00:00
jhb
9b7bf59c96 Update the I/O MMU in bhyve when PCI devices are added and removed.
When the I/O MMU is active in bhyve, all PCI devices need valid entries
in the DMAR context tables. The I/O MMU code does a single enumeration
of the available PCI devices during initialization to add all existing
devices to a domain representing the host. The ppt(4) driver then moves
pass through devices in and out of domains for virtual machines as needed.
However, when new PCI devices were added at runtime either via SR-IOV or
HotPlug, the I/O MMU tables were not updated.

This change adds a new set of EVENTHANDLERS that are invoked when PCI
devices are added and deleted. The I/O MMU driver in bhyve installs
handlers for these events which it uses to add and remove devices to
the "host" domain.

Reviewed by:	imp
Sponsored by:	Chelsio Communications
Differential Revision:	https://reviews.freebsd.org/D7667
2016-09-06 20:17:54 +00:00
jhb
5f56f30076 Leave ppt devices in the host domain when they are not attached to a VM.
This allows a pass through device to be reset to a normal device driver
on the host and reused on the host.  ppt devices are now always active in
some I/O MMU domain when the I/O MMU is active, either the host domain
or the domain of a VM they are attached to.

Sponsored by:	Chelsio Communications
Differential Revision:	https://reviews.freebsd.org/D7666
2016-09-06 18:53:17 +00:00
jhb
731da97ac4 Enable I/O MMU when PCI pass through is first used.
Rather than enabling the I/O MMU when the vmm module is loaded,
defer initialization until the first attempt to pass a PCI device
through to a guest.  If the I/O MMU fails to initialize or is not
present, than fail the attempt to pass a PCI device through to a
guest.

The hw.vmm.force_iommu tunable has been removed since the I/O MMU is
no longer enabled during boot.  However, the I/O MMU support can be
disabled by setting the hw.vmm.iommu.enable tunable to 0 to prevent
use of the I/O MMU on any systems where it is buggy.

Reviewed by:	grehan
MFC after:	1 week
Differential Revision:	https://reviews.freebsd.org/D7448
2016-08-26 20:15:22 +00:00
jhb
b1de97ff2b Correct assertion on vcpuid argument to vm_gpa_hold().
PR:		208168
Submitted by:	Dave Cameron <daverabbitz@ihug.co.nz>
Reviewed by:	grehan
MFC after:	1 month
2016-08-03 15:20:10 +00:00
mav
480936abd1 Increase number of I/O APIC pins from 24 to 32 to give PCI up to 16 IRQs.
Move HPET to the top of the supported 0-31 range.

Proposed by:	jhb@, grehan@
2016-07-14 14:35:25 +00:00
eadler
156fd4834a Don't repeat the the word 'the'
(one manual change to fix grammar)

Confirmed With: db
Approved by: secteam (not really, but this is a comment typo fix)
2016-05-17 12:52:31 +00:00
pfg
826c10b2f3 vmm(4): Small spelling fixes.
Reviewed by:	grehan
2016-05-03 22:07:18 +00:00
anish
3d3fd1fdc9 Allow guest writes to AMD microcode update[0xc0010020] MSR without updating actual hardware MSR. This allows guest microcode update to go through which otherwise failing because wrmsr() was returning EINVAL.
Submitted by:Yamagi Burmeister
Approved by:grehan
MFC after:2 weeks
2016-04-11 05:09:43 +00:00
marcel
b1abdb4699 Bump VM_MAX_MEMSEGS from 2 to 3 to match the number of VM segment
identifiers present in vmmapi.h. In particular, it's now possible
to create a VM_FRAMEBUFFER segment.
2016-02-26 16:18:47 +00:00
skra
bad1d5e697 As <machine/vm.h> is included from <vm/vm.h>, there is no need to
include it explicitly when <vm/vm.h> is already included.

Reviewed by:	alc, kib
Differential Revision:	https://reviews.freebsd.org/D5380
2016-02-22 09:10:23 +00:00
skra
ad68cf93b1 As <machine/vmparam.h> is included from <vm/vm_param.h>, there is no
need to include it explicitly when <vm/vm_param.h> is already included.

Suggested by:	alc
Reviewed by:	alc
Differential Revision:	https://reviews.freebsd.org/D5379
2016-02-22 09:08:04 +00:00
skra
f4b6499ab5 As <machine/pmap.h> is included from <vm/pmap.h>, there is no need to
include it explicitly when <vm/pmap.h> is already included.

Reviewed by:	alc, kib
Differential Revision:	https://reviews.freebsd.org/D5373
2016-02-22 09:02:20 +00:00
neel
e9213dd046 Move the 'devmem' device nodes from /dev/vmm to /dev/vmm.io
Some external tools just do a 'ls /dev/vmm' to figure out the bhyve virtual
machines on the host. These tools break if the devmem device nodes also
appear in /dev/vmm.

Requested by:	grehan
2015-07-06 19:41:43 +00:00
tychon
d4a6573433 verify_gla() needs to account for non-zero segment base addresses.
Reviewed by:	neel
2015-06-26 18:00:29 +00:00
neel
2c76978d5a Restore the host's GS.base before returning from 'svm_launch()'.
Previously this was done by the caller of 'svm_launch()' after it returned.
This works fine as long as no code is executed in the interim that depends
on pcpu data.

The dtrace probe 'fbt:vmm:svm_launch:return' broke this assumption because
it calls 'dtrace_probe()' which in turn relies on pcpu data.

Reported by:	avg
MFC after:	1 week
2015-06-23 02:17:23 +00:00
neel
8c70d6c7af Restructure memory allocation in bhyve to support "devmem".
devmem is used to represent MMIO devices like the boot ROM or a VESA framebuffer
where doing a trap-and-emulate for every access is impractical. devmem is a
hybrid of system memory (sysmem) and emulated device models.

devmem is mapped in the guest address space via nested page tables similar
to sysmem. However the address range where devmem is mapped may be changed
by the guest at runtime (e.g. by reprogramming a PCI BAR). Also devmem is
usually mapped RO or RW as compared to RWX mappings for sysmem.

Each devmem segment is named (e.g. "bootrom") and this name is used to
create a device node for the devmem segment (e.g. /dev/vmm/testvm.bootrom).
The device node supports mmap(2) and this decouples the host mapping of
devmem from its mapping in the guest address space (which can change).

Reviewed by:	tychon
Discussed with:	grehan
Differential Revision:	https://reviews.freebsd.org/D2762
MFC after:	4 weeks
2015-06-18 06:00:17 +00:00
tychon
3259f3f35f Support guest writes to the TSC by enabling the "use TSC offsetting"
execution control and writing the difference between the host TSC and
the guest TSC into the TSC offset in the VMCS upon encountering a
write.

Reviewed by:	neel
2015-06-09 00:14:47 +00:00
neel
954e6fdc3b The 'verify_gla()' function is used to ensure that the effective address
after decoding the instruction matches the one provided by hardware.

Prior to r283293 'vie->num_valid' used to contain the actual length of
the instruction whereas now it contains the maximum instruction length
possible. This introduced a bug when calculating a RIP-relative base address.

Fix this by using 'vie->num_processed' rather than 'vie->num_valid' as the
length of the emulated instruction.

Reported and tested by:	tychon
MFC after:	1 week
2015-06-05 21:22:26 +00:00
neel
0a2b315713 Use tunable 'hw.vmm.svm.features' to disable specific SVM features even
though they might be available in hardware.

Use tunable 'hw.vmm.svm.num_asids' to limit the number of ASIDs used by
the hypervisor.

MFC after:	1 week
2015-06-04 02:12:23 +00:00
neel
3f2b4fc770 Fix non-deterministic delays when accessing a vcpu that was in "running" or
"sleeping" state. This is done by forcing the vcpu to transition to "idle"
by returning to userspace with an exit code of VM_EXITCODE_REQIDLE.

MFC after:      2 weeks
2015-05-28 17:37:01 +00:00
neel
f031069b05 Exceptions don't deliver an error code in real mode.
MFC after:	1 week
2015-05-23 01:17:50 +00:00
neel
e6dc875ea5 Remove the verification of instruction length after instruction decode. The
check has been bogus since r273375.

MFC after:	1 week
2015-05-22 21:09:11 +00:00
neel
4d7716dd23 Don't rely on the 'VM-exit instruction length' field in the VMCS to always
have an accurate length on an EPT violation. This is not needed by the
instruction decoding code because it also has to work with AMD/SVM that
does not provide a valid instruction length on a Nested Page Fault.

In collaboration with:	Leon Dang (ldang@nahannisys.com)
Discussed with:		grehan
MFC after:		1 week
2015-05-22 17:34:22 +00:00
jkim
318c4f97e6 CALLOUT_MPSAFE has lost its meaning since r141428, i.e., for more than ten
years for head.  However, it is continuously misused as the mpsafe argument
for callout_init(9).  Deprecate the flag and clean up callout_init() calls
to make them more consistent.

Differential Revision:	https://reviews.freebsd.org/D2613
Reviewed by:	jhb
MFC after:	2 weeks
2015-05-22 17:05:21 +00:00
neel
a0ca6168e6 Emulate the "CMP r/m, reg" instruction (opcode 39H).
Reported and tested by:	Leon Dang (ldang@nahannisys.com)
MFC after:	1 week
2015-05-21 18:23:37 +00:00
neel
7776059e98 Deprecate the 3-way return values from vm_gla2gpa() and vm_copy_setup().
Prior to this change both functions returned 0 for success, -1 for failure
and +1 to indicate that an exception was injected into the guest.

The numerical value of ERESTART also happens to be -1 so when these functions
returned -1 it had to be translated to a positive errno value to prevent the
VM_RUN ioctl from being inadvertently restarted. This made it easy to introduce
bugs when writing emulation code.

Fix this by adding an 'int *guest_fault' parameter and setting it to '1' if
an exception was delivered to the guest. The return value is 0 or EFAULT so
no additional translation is needed.

Reviewed by:	tychon
MFC after:	2 weeks
Differential Revision:	https://reviews.freebsd.org/D2428
2015-05-06 16:25:20 +00:00
neel
e5110c0076 Do a proper emulation of guest writes to MSR_EFER.
- Must-Be-Zero bits cannot be set.
- EFER_LME and EFER_LMA should respect the long mode consistency checks.
- EFER_NXE, EFER_FFXSR, EFER_TCE can be set if allowed by CPUID capabilities.
- Flag an error if guest tries to set EFER_LMSLE since bhyve doesn't enforce
  segment limits in 64-bit mode.

MFC after:	2 weeks
2015-05-06 05:40:20 +00:00
neel
46cb5cb412 Emulate the 'CMP r/m8, imm8' instruction encountered when booting a Windows
Vista guest.

Reported by:	Leon Dang (ldang@nahannisys.com)
MFC after:	1 week
2015-05-04 04:27:23 +00:00
neel
ed97fa3273 Don't advertise the Intel SMX capability to the guest.
Reported by:	Leon Dang (ldang@nahannisys.com)
MFC after:	1 week
2015-05-02 19:07:49 +00:00
neel
cd20ad9aa3 Emulate machine check related MSRs to allow guest OSes like Windows to boot.
Reported by:	Leon Dang (ldang@nahannisys.com)
MFC after:	2 weeks
2015-05-02 04:19:11 +00:00
neel
755b785420 r281630 relaxed the limits on the vectors that can be asserted in the IRRs.
Do the same when transitioning a vector from the IRR to the ISR and also
when extinguishing it from the ISR in response to an EOI.

Reported by:	Leon Dang (ldang@nahannisys.com)
MFC after:	2 weeks
2015-05-01 16:00:29 +00:00
neel
c782f2195b Emulate MSR_SYSCFG which is accessed by Linux on AMD cpus when MTRRs are
enabled.

MFC after:	2 weeks
2015-05-01 05:11:14 +00:00
neel
9cbb7c919e Don't require <sys/cpuset.h> to be always included before <machine/vmm.h>.
Only a subset of source files that include <machine/vmm.h> need to use the
APIs that require the inclusion of <sys/cpuset.h>.

MFC after:	1 week
2015-04-30 22:23:22 +00:00
neel
f57c0156d3 When an instruction cannot be decoded just return to userspace so bhyve(8)
can dump the instruction bytes.

Requested by:	grehan
MFC after:	1 week
2015-04-30 21:00:47 +00:00
neel
6641e0d4f2 Advertise the MTRR feature via CPUID and emulate the minimal set of MTRR MSRs.
This is required for booting Windows guests.

Reported by:	Leon Dang (ldang@nahannisys.com)
MFC after:	2 weeks
2015-04-30 19:23:50 +00:00
neel
acc3b0bbd7 Re-implement RTC current time calculation to eliminate the possibility of
losing time.

The problem with the earlier implementation was that the uptime value
used by 'vrtc_curtime()' could be different than the uptime value when
'vrtc_time_update()' actually updated 'base_uptime'.

Fix this by calculating and updating the (rtctime, uptime) tuple together.

MFC after:	2 weeks
2015-04-29 23:44:28 +00:00
neel
42e24b0f80 Emulate the 'bit test' instruction. Windows 7 uses 'bit test' to check the
'Delivery Status' bit in APIC ICR register.

Reported by:	Leon Dang (ldang@nahannisys.com)
MFC after:	2 weeks
2015-04-29 02:01:46 +00:00
neel
0151349b05 Implement the century byte in the RTC. Some guests require this field to be
properly set.

Reported by:	Leon Dang (ldang@nahannisys.com)
MFC after:	2 weeks
2015-04-28 23:44:47 +00:00
tychon
da94200043 STOS/STOSB/STOSW/STOSD/STOSQ instruction emulation.
Reviewed by:	neel
2015-04-25 19:02:06 +00:00
araujo
41717d52d1 Missing break in switch case.
Differential Revision:	D2342
Reviewed by:		neel
2015-04-23 02:50:06 +00:00
neel
4fe75f1860 Relax the check on which vectors can be delivered through the APIC. According
to the Intel SDM vectors 16 through 255 are allowed to be delivered via the
local APIC.

Reported by:	Leon Dang (ldang@nahannisys.com)
MFC after:	2 weeks
2015-04-16 22:44:51 +00:00
neel
6a373c232e Prefer 'vcpu_should_yield()' over checking 'curthread->td_flags' directly.
MFC after:	1 week
2015-04-16 20:15:47 +00:00
tychon
9079cdda46 Enhance the support for Group 1 Extended opcodes:
* Implemement the 0x81 and 0x83 CMP instructions.
  * Implemement the 0x83 AND instruction.
  * Implemement the 0x81 OR instruction.

Reviewed by:	neel
2015-04-06 12:22:41 +00:00
tychon
b3c521e852 Fix "MOVS" instruction memory to MMIO emulation. Currently updates to
%rdi, %rsi, etc are inadvertently bypassed along with the check to
see if the instruction needs to be repeated per the 'rep' prefix.

Add "MOVS" instruction support for the 'MMIO to MMIO' case.

Reviewed by:	neel
2015-04-01 00:15:31 +00:00
neel
4eebcc1fbb Fix the RTC device model to operate correctly in 12-hour mode. The following
table documents the values in the RTC 'hour' field in the two modes:

Hour-of-the-day		12-hour mode	24-hour mode
12	AM		12		0
[1-11]	AM		[1-11]		[1-11]
12	PM		0x80 | 12	12
[1-11]	PM		0x80 | [1-11]	[13-23]

Reported by:	Julian Hsiao (madoka@nyanisore.net)
MFC after:	1 week
2015-03-28 02:55:16 +00:00
tychon
b925086de0 When fetching an instruction in non-64bit mode, consider the value of the
code segment base address.

Also if an instruction doesn't support a mod R/M (modRM) byte, don't
be concerned if the CPU is in real mode.

Reviewed by:	neel
2015-03-24 17:12:36 +00:00
mav
af4c17529a Report ARAT (APIC-Timer-always-running) feature for virtual CPU.
This makes FreeBSD guest to not avoid using LAPIC timer, preferring HPET
due to worries about non-existing for virtual CPUs deep sleep states.

Benchmarks of usleep(1) on guest and host show such extra latencies:
 - 51us for virtual HPET,
 - 22us for virtual LAPIC timer,
 - 22us for host HPET and
 - 3us for host LAPIC timer.

MFC after:	2 weeks
2015-03-16 11:57:03 +00:00
neel
e16d18eeb8 Use lapic_ipi_alloc() to dynamically allocate IPI slots needed by bhyve when
vmm.ko is loaded.

Also relocate the 'justreturn' IPI handler to be alongside all other handlers.

Requested by:	kib
2015-03-14 02:32:08 +00:00
tychon
eedd9cfb16 When ICW1 is issued the edge sense circuit is reset which means that
following an initialization a low-to-high transistion is necesary to
generate an interrupt.

Reviewed by:	neel
2015-03-06 02:05:45 +00:00
neel
e549774cf3 Fix warnings/errors when building vmm.ko with gcc:
- fix warning about comparison of 'uint8_t v_tpr >= 0' always being true.

- fix error triggered by an empty clobber list in the inline assembly for
  "clgi" and "stgi"

- fix error when compiling "vmload %rax", "vmrun %rax" and "vmsave %rax". The
  gcc assembler does not like the explicit operand "%rax" while the clang
  assembler requires specifying the operand "%rax". Fix this by encoding the
  instructions using the ".byte" directive.

Reported by:	julian
MFC after:	1 week
2015-03-02 20:13:49 +00:00
rstone
29fedc4c2c Allow passthrough devices to be hinted.
Allow the ppt driver to attach to devices that were hinted to be
passthrough devices by the PCI code creating them with a driver
name of "ppt".

Add a tunable that allows the IOMMU to be forced to be used.  With
SR-IOV passthrough devices the VFs may be created after vmm.ko is
loaded.  The current code will not initialize the IOMMU in that
case, meaning that the passthrough devices can't actually be used.

Differential Revision:	https://reviews.freebsd.org/D73
Reviewed by:		neel
MFC after: 		1 month
Sponsored by:		Sandvine Inc.
2015-03-01 00:39:48 +00:00
neel
f458d8769b Always emulate MSR_PAT on Intel processors and don't rely on PAT save/restore
capability of VT-x. This lets bhyve run nested in older VMware versions that
don't support the PAT save/restore capability.

Note that the actual value programmed by the guest in MSR_PAT is irrelevant
because bhyve sets the 'Ignore PAT' bit in the nested PTE.

Reported by:	marcel
Tested by:	Leon Dang (ldang@nahannisys.com)
Sponsored by:	Nahanni Systems
MFC after:	2 weeks
2015-02-24 05:35:15 +00:00
kib
0754f0eac9 Add x2APIC support. Enable it by default if CPU is capable. The
hw.x2apic_enable tunable allows disabling it from the loader prompt.

To closely repeat effects of the uncached memory ops when accessing
registers in the xAPIC mode, the x2APIC writes to MSRs are preceeded
by mfence, except for the EOI notifications.  This is probably too
strict, only ICR writes to send IPI require serialization to ensure
that other CPUs see the previous actions when IPI is delivered.  This
may be changed later.

In vmm justreturn IPI handler, call doreti_iret instead of doing iretd
inline, to handle corner conditions.

Note that the patch only switches LAPICs into x2APIC mode. It does not
enables FreeBSD to support > 255 CPUs, which requires parsing x2APIC
MADT entries and doing interrupts remapping, but is the required step
on the way.

Reviewed by:	neel
Tested by:	pho (real hardware), neel (on bhyve)
Discussed with:	jhb, grehan
Sponsored by:	The FreeBSD Foundation
MFC after:	2 months
2015-02-09 21:00:56 +00:00
neel
5c373339e4 Add macro to identify AVIC capability (advanced virtual interrupt controller)
in AMD processors.

Submitted by:	Dmitry Luhtionov (dmitryluhtionov@gmail.com)
2015-01-24 00:35:49 +00:00
neel
e901ab485a MOVS instruction emulation.
These instructions are emitted by 'bus_space_read_region()' when accessing
MMIO regions.

Since MOVS can be used with a repeat prefix start decoding the REPZ and
REPNZ prefixes. Also start decoding the segment override prefix since MOVS
allows overriding the source operand segment register.

Tested by:	tychon
MFC after:	1 week
2015-01-19 06:53:31 +00:00
neel
d9f07f9841 Simplify instruction restart logic in bhyve.
Keep track of the next instruction to be executed by the vcpu as 'nextrip'.
As a result the VM_RUN ioctl no longer takes the %rip where a vcpu should
start execution.

Also, instruction restart happens implicitly via 'vm_inject_exception()' or
explicitly via 'vm_restart_instruction()'. The APIs behave identically in
both kernel and userspace contexts. The main beneficiary is the instruction
emulation code that executes in both contexts.

bhyve(8) VM exit handlers now treat 'vmexit->rip' and 'vmexit->inst_length'
as readonly:
- Restarting an instruction is now done by calling 'vm_restart_instruction()'
  as opposed to setting 'vmexit->inst_length' to 0 (e.g. emulate_inout())
- Resuming vcpu at an arbitrary %rip is now done by setting VM_REG_GUEST_RIP
  as opposed to changing 'vmexit->rip' (e.g. vmexit_task_switch())

Differential Revision:	https://reviews.freebsd.org/D1526
Reviewed by:		grehan
MFC after:		2 weeks
2015-01-18 03:08:30 +00:00
neel
4091be74c6 Fix typo (missing comma).
MFC after:	3 days
2015-01-14 07:18:51 +00:00
neel
5c965bc583 'struct vm_exception' was intended to be used only as the collateral for the
VM_INJECT_EXCEPTION ioctl. However it morphed into other uses like keeping
track pending exceptions for a vcpu. This in turn causes confusion because
some fields in 'struct vm_exception' like 'vcpuid' make sense only in the
ioctl context. It also makes it harder to add or remove structure fields.

Fix this by using 'struct vm_exception' only to communicate information
from userspace to vmm.ko when injecting an exception.

Also, add a field 'restart_instruction' to 'struct vm_exception'. This
field is set to '1' for exceptions where the faulting instruction is
restarted after the exception is handled.

MFC after:      1 week
2015-01-13 22:00:47 +00:00
neel
e72e75f02d Clear blocking due to STI or MOV SS in the hypervisor when an instruction is
emulated or when the vcpu incurs an exception. This matches the CPU behavior.

Remove special case code in HLT processing that was clearing the interrupt
shadow. This is now redundant because the interrupt shadow is always cleared
when the vcpu is resumed after an instruction is emulated.

Reported by:	David Reed (david.reed@tidalscale.com)
MFC after:	2 weeks
2015-01-06 19:04:02 +00:00
neel
b9de0c114c Initialize all fields of 'struct vm_exception exception' before passing it to
vm_inject_exception(). This fixes the issue that 'exception.cpuid' is
uninitialized when calling 'vm_inject_exception()'.

However, in practice this change is a no-op because vm_inject_exception()
does not use 'exception.cpuid' for anything.

Reported by:    Coverity Scan
CID:            1261297
MFC after:      3 days
2014-12-30 23:38:31 +00:00
neel
7aa6460c48 Replace bhyve's minimal RTC emulation with a fully featured one in vmm.ko.
The new RTC emulation supports all interrupt modes: periodic, update ended
and alarm. It is also capable of maintaining the date/time and NVRAM contents
across virtual machine reset. Also, the date/time fields can now be modified
by the guest.

Since bhyve now emulates both the PIT and the RTC there is no need for
"Legacy Replacement Routing" in the HPET so get rid of it.

The RTC device state can be inspected via bhyvectl as follows:
bhyvectl --vm=vm --get-rtc-time
bhyvectl --vm=vm --set-rtc-time=<unix_time_secs>
bhyvectl --vm=vm --rtc-nvram-offset=<offset> --get-rtc-nvram
bhyvectl --vm=vm --rtc-nvram-offset=<offset> --set-rtc-nvram=<value>

Reviewed by:	tychon
Discussed with:	grehan
Differential Revision:	https://reviews.freebsd.org/D1385
MFC after:	2 weeks
2014-12-30 22:19:34 +00:00
neel
2908c65b8a Inject #UD into the guest when it executes either 'MONITOR' or 'MWAIT' on
an AMD/SVM host.

MFC after:	1 week
2014-12-30 02:44:33 +00:00
neel
baa3b938e5 Implement "special mask mode" in vatpic.
OpenBSD guests always enable "special mask mode" during boot. As a result of
r275952 this is flagged as an error and the guest cannot boot.

Reviewed by:	grehan
Differential Revision:	https://reviews.freebsd.org/D1384
MFC after:	1 week
2014-12-28 00:53:52 +00:00
neel
4d52b44708 Allow ktr(4) tracing of all guest exceptions via the tunable
"hw.vmm.trace_guest_exceptions".  To enable this feature set the tunable
to "1" before loading vmm.ko.

Tracing the guest exceptions can be useful when debugging guest triple faults.

Note that there is a performance impact when exception tracing is enabled
since every exception will now trigger a VM-exit.

Also, handle machine check exceptions that happen during guest execution
by vectoring to the host's machine check handler via "int $18".

Discussed with:	grehan
MFC after:	2 weeks
2014-12-23 02:14:49 +00:00
neel
9abef2383d Emulate writes to the IA32_MISC_ENABLE MSR.
PR:		196093
Reported by:	db
Tested by:	db
Discussed with:	grehan
MFC after:	1 week
2014-12-20 19:47:51 +00:00
neel
e4f07b01b4 Various 8259 device model improvements:
- implement 8259 "polled" mode.
- set 'atpic->sfn' if bit 4 in ICW4 is set during master initialization.
- report error if guest tries to enable the "special mask" mode.

Differential Revision:	https://reviews.freebsd.org/D1328
Reviewed by:		tychon
Reported by:		grehan
Tested by:		grehan
MFC after:		1 week
2014-12-20 04:57:45 +00:00
neel
36aadd747e Fix 8259 IRQ priority resolver.
Initialize the 8259 such that IRQ7 is the lowest priority.

Reviewed by:		tychon
Differential Revision:	https://reviews.freebsd.org/D1322
MFC after:		1 week
2014-12-17 03:04:43 +00:00
neel
bdae65463e For level triggered interrupts clear the PIC IRR bit when the interrupt pin
is deasserted. Prior to this change each assertion on a level triggered irq
pin resulted in two interrupts being delivered to the CPU.

Differential Revision:	https://reviews.freebsd.org/D1310
Reviewed by:	tychon
MFC after:	1 week
2014-12-16 06:33:57 +00:00
grehan
205a8351f4 Change the lower bound for guest vmspace allocation to 0 instead of
using the VM_MIN_ADDRESS constant.

HardenedBSD redefines VM_MIN_ADDRESS to be 64K, which results in
bhyve VM startup failing. Guest memory is always assumed to start
at 0 so use the absolute value instead.

Reported by:	Shawn Webb, lattera at gmail com
Reviewed by:	neel, grehan
Obtained from:	Oliver Pinter via HardenedBSD
23bd719ce1
MFC after:	1 week
2014-11-23 23:07:21 +00:00
araujo
8a51fd3c94 Reported by: Coverity
CID:		1249760
Reviewed by:	neel
Approved by:	neel
Sponsored by:	QNAP Systems Inc.
2014-10-28 07:19:02 +00:00
grehan
4789422fd1 Remove bhyve SVM feature printf's now that they are available in the
general CPU feature detection code.

Reviewed by:	neel
2014-10-27 22:20:51 +00:00
neel
4ee18dab17 Change the type of the first argument to the I/O emulation handlers to
'struct vm *'. Previously it used to be a 'void *' but there is no reason
to hide the actual type from the handler.

Discussed with:	tychon
MFC after:	1 week
2014-10-26 19:03:06 +00:00
neel
eb58530f9b Move the ACPI PM timer emulation into vmm.ko.
This reduces variability during timer calibration by keeping the emulation
"close" to the guest. Additionally having all timer emulations in the kernel
will ease the transition to a per-VM clock source (as opposed to using the
host's uptime keep track of time).

Discussed with:	grehan
2014-10-26 04:44:28 +00:00
neel
a76d8ee4e9 Don't pass the 'error' return from an I/O port handler directly to vm_run().
Most I/O port handlers return -1 to signal an error. If this value is returned
without modification to vm_run() then it leads to incorrect behavior because
'-1' is interpreted as ERESTART at the system call level.

Fix this by always returning EIO to signal an error from an I/O port handler.

MFC after:	1 week
2014-10-26 03:03:41 +00:00
neel
dd2febd6f2 IFC @r273214 2014-10-20 02:57:30 +00:00
neel
53c23ba9a2 IFC @r273206 2014-10-19 23:05:18 +00:00
neel
13e9198693 Don't advertise the "OS visible workarounds" feature in cpuid.80000001H:ECX.
bhyve doesn't emulate the MSRs needed to support this feature at this time.

Don't expose any model-specific RAS and performance monitoring features in
cpuid leaf 80000007H.

Emulate a few more MSRs for AMD: TSEG base address, TSEG address mask and
BIOS signature and P-state related MSRs.

This eliminates all the unimplemented MSRs accessed by Linux/x86_64 kernels
2.6.32, 3.10.0 and 3.17.0.
2014-10-19 21:38:58 +00:00
neel
d5ca76422a Don't advertise support for the NodeID MSR since bhyve doesn't emulate it. 2014-10-18 05:39:32 +00:00
neel
170f49cd34 Don't advertise the Instruction Based Sampling feature because it requires
emulating a large number of MSRs.

Ignore writes to a couple more AMD-specific MSRs and return 0 on read.

This further reduces the unimplemented MSRs accessed by a Linux guest on boot.
2014-10-17 06:23:04 +00:00
neel
b194fc5a2b Hide extended PerfCtr MSRs on AMD processors by clearing bits 23, 24 and 28 in
CPUID.80000001H:ECX.

Handle accesses to PerfCtrX and PerfEvtSelX MSRs by ignoring writes and
returning 0 on reads.

This further reduces the number of unimplemented MSRs hit by a Linux guest
during boot.
2014-10-17 03:04:38 +00:00
neel
8bc5fc92e5 Use the correct fault type (VM_PROT_EXECUTE) for an instruction fetch. 2014-10-16 18:16:31 +00:00
neel
09fbd3f7d9 Fix topology enumeration issues exposed by AMD Bulldozer Family 15h processor.
Initialize CPUID.80000008H:ECX[7:0] with the number of logical processors in
the package. This fixes a panic during early boot in NetBSD 7.0 BETA.

Clear the Topology Extension feature bit from CPUID.80000001H:ECX since we
don't emulate leaves 0x8000001D and 0x8000001E. This fixes a divide by zero
panic in early boot in Centos 6.4.

Tested on an "AMD Opteron 6320" courtesy of Ben Perrault.

Reviewed by:	grehan
2014-10-16 18:13:10 +00:00
davide
e88bd26b3f Follow up to r225617. In order to maximize the re-usability of kernel code
in userland rename in-kernel getenv()/setenv() to kern_setenv()/kern_getenv().
This fixes a namespace collision with libc symbols.

Submitted by:   kmacy
Tested by:      make universe
2014-10-16 18:04:43 +00:00