Commit Graph

151 Commits

Author SHA1 Message Date
Bjoern A. Zeeb
246c398145 bhyve: Do not remove guest physical addresses from IOMMU host domain
This permits I/O devices on the host to directly access wired memory
dedicated to guests using passthru devices.  Note that wired memory
belonging to guests that do not use passthru devices has always been
accessible by I/O devices on the host.

bhyve maps guest physical addresses into the user address space of
the bhyve process by mmap'ing /dev/vmm/<vmname>.  Device models pass
pointers derived from this mapping directly to system calls such as
preadv() to minimize copies when emulating DMA.  If the backing store
for a device model is a raw host device (e.g. when exporting a raw disk
device such as /dev/ada<n> as a drive in the guest), the host device
driver (e.g. ahci for /dev/ada<n>) can itself use DMA on the host
directly to the guest's memory.  However, if the guest's memory is
not present in the host IOMMU domain, these DMA requests by the host
device will fail without raising an error visible to the host device
driver or to the guest resulting in non-working I/O in the guest.

It is unclear why guest addresses were removed from the IOMMU host domain
initially, especially only for VM's with a passthru device as the
host IOMMU domain does not affect the permissions of passthru devices,
only devices on the host.

A considered alternative was using bounce buffers instead (D34535
is a proof of concept), but that adds additional overhead for unclear
benefit.

This solves a long-standing problem when using passthru devices and
physical disks in the same VM.

Thanks to:	grehan (patience and help)
Thanks to:	jhb (for improving the commit message)
PR:		260178
Reviewed by:	grehan, jhb
MFC after:	3 days
Differential Revision: https://reviews.freebsd.org/D34607
2022-03-24 15:21:24 +00:00
Corvin Köhne
e47fe3183e bhyve: add ROM emulation
Some PCI devices especially GPUs require a ROM to work properly.
The ROM is executed by boot firmware to initialize the device.
To add a ROM to a device use the new ROM option for passthru device
(e.g. -s passthru,0/2/0,rom=<path>/<to>/<rom>).

It's necessary that the ROM is executed by the boot firmware.
It won't be executed by any OS.
Additionally, the boot firmware should be configured to execute the
ROM file.
For that reason, it's only possible to use a ROM when using
OVMF with enabled bus enumeration.

Differential Revision:	https://reviews.freebsd.org/D33129
Sponsored by:   Beckhoff Automation GmbH & Co. KG
MFC after:      1 month
2022-03-10 12:30:37 +01:00
Robert Wing
73505a1076 vmm: fix "set but not used" warnings 2022-02-28 14:46:08 -09:00
Stefan Eßer
e2650af157 Make CPU_SET macros compliant with other implementations
The introduction of <sched.h> improved compatibility with some 3rd
party software, but caused the configure scripts of some ports to
assume that they were run in a GLIBC compatible environment.

Parts of sched.h were made conditional on -D_WITH_CPU_SET_T being
added to ports, but there still were compatibility issues due to
invalid assumptions made in autoconfigure scripts.

The differences between the FreeBSD version of macros like CPU_AND,
CPU_OR, etc. and the GLIBC versions was in the number of arguments:
FreeBSD used a 2-address scheme (one source argument is also used as
the destination of the operation), while GLIBC uses a 3-adderess
scheme (2 source operands and a separately passed destination).

The GLIBC scheme provides a super-set of the functionality of the
FreeBSD macros, since it does not prevent passing the same variable
as source and destination arguments. In code that wanted to preserve
both source arguments, the FreeBSD macros required a temporary copy of
one of the source arguments.

This patch set allows to unconditionally provide functions and macros
expected by 3rd party software written for GLIBC based systems, but
breaks builds of externally maintained sources that use any of the
following macros: CPU_AND, CPU_ANDNOT, CPU_OR, CPU_XOR.

One contributed driver (contrib/ofed/libmlx5) has been patched to
support both the old and the new CPU_OR signatures. If this commit
is merged to -STABLE, the version test will have to be extended to
cover more ranges.

Ports that have added -D_WITH_CPU_SET_T to build on -CURRENT do
no longer require that option.

The FreeBSD version has been bumped to 1400046 to reflect this
incompatible change.

Reviewed by:	kib
MFC after:	2 weeks
Relnotes:	yes
Differential Revision:	https://reviews.freebsd.org/D33451
2021-12-30 12:20:32 +01:00
Ka Ho Ng
df95cc76af vmm: Bump vmname buffer in struct vm to VM_MAX_NAMELEN + 1
In hw.vmm.create sysctl handler the maximum length of vm name is
VM_MAX_NAMELEN. However in vm_create() the maximum length allowed is
only VM_MAX_NAMELEN - 1 chars. Bump the length of the internal buffer to
allow the length of VM_MAX_NAMELEN for vm name.

MFC after:	3 days
Reviewed by:	grehan
Sponsored by:	The FreeBSD Foundation
Differential Revision:	https://reviews.freebsd.org/D31372
2021-08-02 17:55:08 +08:00
D Scott Phillips
f8a6ec2d57 bhyve: support relocating fbuf and passthru data BARs
We want to allow the UEFI firmware to enumerate and assign
addresses to PCI devices so we can boot from NVMe[1]. Address
assignment of PCI BARs is properly handled by the PCI emulation
code in general, but a few specific cases need additional support.
fbuf and passthru map additional objects into the guest physical
address space and so need to handle address updates. Here we add a
callback to emulated PCI devices to inform them of a BAR
configuration change. fbuf and passthru then watch for these BAR
changes and relocate the frame buffer memory segment and passthru
device mmio area respectively.

We also add new VM_MUNMAP_MEMSEG and VM_UNMAP_PPTDEV_MMIO ioctls
to vmm(4) to facilitate the unmapping needed for addres updates.

[1]: https://github.com/freebsd/uefi-edk2/pull/9/

Originally by:	scottph
MFC After:	1 week
Sponsored by:	Intel Corporation
Reviewed by:	grehan
Approved by:	philip (mentor)
Differential Revision:	https://reviews.freebsd.org/D24066
2021-03-19 11:04:36 +08:00
Konstantin Belousov
3c48106aaa bhyve: limit max GPA to VM_MAXUSER_ADDRESS_LA48.
We use 4-level EPT pages, correct the upper bound.

Reviewed by:	grehan
Sponsored by:	The FreeBSD Foundation
Differential revision:	https://reviews.freebsd.org/D27402
2020-11-29 10:32:38 +00:00
Peter Grehan
15add60d37 Convert vmm_ops calls to IFUNC
There is no need for these to be function pointers since they are
never modified post-module load.

Rename AMD/Intel ops to be more consistent.

Submitted by:	adam_fenn.io
Reviewed by:	markj, grehan
Approved by:	grehan (bhyve)
MFC after:	3 weeks
Differential Revision:	https://reviews.freebsd.org/D27375
2020-11-28 01:16:59 +00:00
Mateusz Guzik
543769bf83 amd64: clean up empty lines in .c and .h files 2020-09-01 21:16:54 +00:00
Peter Grehan
46567b4f5e Allow guest device MMIO access from bootmem memory segments.
Recent versions of UEFI have moved local APIC timer initialization into
the early SEC phase which runs out of ROM, prior to self-relocating
into RAM. This results in a hypervisor exit.

Currently bhyve prevents instruction emulation from segments that aren't
marked as "sysmem" aka guest RAM, with the vm_gpa_hold() routine failing.
However, there is no reason for this restriction: the hypervisor already
controls whether EPT mappings are marked as executable.

Fix by dropping the redundant check of sysmem.

MFC after:	3 weeks
Differential Revision:	https://reviews.freebsd.org/D25955
2020-08-18 07:08:17 +00:00
John Baldwin
483d953a86 Initial support for bhyve save and restore.
Save and restore (also known as suspend and resume) permits a snapshot
to be taken of a guest's state that can later be resumed.  In the
current implementation, bhyve(8) creates a UNIX domain socket that is
used by bhyvectl(8) to send a request to save a snapshot (and
optionally exit after the snapshot has been taken).  A snapshot
currently consists of two files: the first holds a copy of guest RAM,
and the second file holds other guest state such as vCPU register
values and device model state.

To resume a guest, bhyve(8) must be started with a matching pair of
command line arguments to instantiate the same set of device models as
well as a pointer to the saved snapshot.

While the current implementation is useful for several uses cases, it
has a few limitations.  The file format for saving the guest state is
tied to the ABI of internal bhyve structures and is not
self-describing (in that it does not communicate the set of device
models present in the system).  In addition, the state saved for some
device models closely matches the internal data structures which might
prove a challenge for compatibility of snapshot files across a range
of bhyve versions.  The file format also does not currently support
versioning of individual chunks of state.  As a result, the current
file format is not a fixed binary format and future revisions to save
and restore will break binary compatiblity of snapshot files.  The
goal is to move to a more flexible format that adds versioning,
etc. and at that point to commit to providing a reasonable level of
compatibility.  As a result, the current implementation is not enabled
by default.  It can be enabled via the WITH_BHYVE_SNAPSHOT=yes option
for userland builds, and the kernel option BHYVE_SHAPSHOT.

Submitted by:	Mihai Tiganus, Flavius Anton, Darius Mihai
Submitted by:	Elena Mihailescu, Mihai Carabas, Sergiu Weisz
Relnotes:	yes
Sponsored by:	University Politehnica of Bucharest
Sponsored by:	Matthew Grooms (student scholarships)
Sponsored by:	iXsystems
Differential Revision:	https://reviews.freebsd.org/D19495
2020-05-05 00:02:04 +00:00
Conrad Meyer
00d3723fb4 vmm(4): Bump VM_MAX_MEMMAPS for vmgenid
As a short term solution for the problem reported by Shawn Webb re: r359950,
bump the maximum number of memmaps per VM. This structure is 40 bytes, and the
additional four (fixed array embedded in the struct vm) members increase the
size of struct vm by 3%.

(The vast majority of struct vm is the embedded struct vcpu array, which
accounts for 84% of the size -- over 4 kB.)

Reported by:	Shawn Webb <shawn.webb AT hardenedbsd.org>
Reviewed by:	grehan
X-MFC-With:	r359950
Differential Revision:	https://reviews.freebsd.org/D24507
2020-04-19 23:53:47 +00:00
Pawel Biernacki
b40598c539 Mark more nodes as CTLFLAG_MPSAFE or CTLFLAG_NEEDGIANT (4 of many)
r357614 added CTLFLAG_NEEDGIANT to make it easier to find nodes that are
still not MPSAFE (or already are but aren’t properly marked). Use it in
preparation for a general review of all nodes.
This is non-functional change that adds annotations to SYSCTL_NODE and
SYSCTL_PROC nodes using one of the soon-to-be-required flags.

Reviewed by:	kib
Approved by:	kib (mentor)
Differential Revision:	https://reviews.freebsd.org/D23625
X-Generally looks fine:	jhb
2020-02-15 18:57:49 +00:00
Konstantin Belousov
caab504277 vmm: Add Hygon Dhyana support.
Submitted by:	Pu Wen <puwen@hygon.cn>
Discussed with:	grehan
Reviewed by:	jhb (previous version)
MFC after:	1 week
Differential revision:	https://reviews.freebsd.org/D23553
2020-02-13 19:03:12 +00:00
Konstantin Belousov
b837dadd87 bhyve: terminate waiting loops if thread suspension is requested.
PR:	242724
Reviewed by:	markj
Reported and tested by:	Aleksandr Fedorov <aleksandr.fedorov@itglobal.com>
	 (previous version)
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
Differential revision:	https://reviews.freebsd.org/D22881
2020-01-02 22:37:04 +00:00
Andriy Gapon
869dbab7ba vmm: remove a wmb() call
After removing wmb(), vm_set_rendezvous_func() became super trivial, so
there was no point in keeping it.

The wmb (sfence on amd64, lock nop on i386) was not needed.  This can be
explained from several points of view.

First, wmb() is used for store-store ordering (although, the primitive
is undocumented).  There was no obvious subsequent store that needed the
barrier.

Second, x86 has a memory model with strong ordering including total
store order.  An explicit store barrier may be needed only when working
with special memory (device, special caching mode) or using special
instructions (non-temporal stores).  That was not the case for this
code.

Third, I believe that there is a misconception that sfence "flushes" the
store buffer in a sense that it speeds up the propagation of stores from
the store buffer to the global visibility.  I think that such
propagation always happens as fast as possible.  sfence only makes
subsequent stores wait for that propagation to complete.  So, sfence is
only useful for ordering of stores and only in the situations described
above.

Reviewed by:	jhb
MFC after:	23 days
Differential Revision: https://reviews.freebsd.org/D21978
2019-10-19 07:10:15 +00:00
Konstantin Belousov
df08823d07 Improve MD page fault handlers.
Centralize calculation of signal and ucode delivered on unhandled page
fault in new function vm_fault_trap().  MD trap_pfault() now almost
always uses the signal numbers and error codes calculated in
consistent MI way.

This introduces the protection fault compatibility sysctls to all
non-x86 architectures which did not have that bug, but apparently they
were already much more wrong in selecting delivered signals on
protection violations.

Change the delivered signal for accesses to mapped area after the
backing object was truncated.  According to POSIX description for
mmap(2):
   The system shall always zero-fill any partial page at the end of an
   object. Further, the system shall never write out any modified
   portions of the last page of an object which are beyond its
   end. References within the address range starting at pa and
   continuing for len bytes to whole pages following the end of an
   object shall result in delivery of a SIGBUS signal.

   An implementation may generate SIGBUS signals when a reference
   would cause an error in the mapped object, such as out-of-space
   condition.
Adjust according to the description, keeping the existing
compatibility code for SIGSEGV/SIGBUS on protection failures.

For situations where kernel cannot handle page fault due to resource
limit enforcement, SIGBUS with a new error code BUS_OBJERR is
delivered.  Also, provide a new error code SEGV_PKUERR for SIGSEGV on
amd64 due to protection key access violation.

vm_fault_hold() is renamed to vm_fault().  Fixed some nits in
trap_pfault()s like mis-interpreting Mach errors as errnos.  Removed
unneeded truncations of the fault addresses reported by hardware.

PR:	211924
Reviewed by:	alc
Discussed with:	jilles, markj
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
Differential revision:	https://reviews.freebsd.org/D21566
2019-09-27 18:43:36 +00:00
Mark Johnston
fee2a2fa39 Change synchonization rules for vm_page reference counting.
There are several mechanisms by which a vm_page reference is held,
preventing the page from being freed back to the page allocator.  In
particular, holding the page's object lock is sufficient to prevent the
page from being freed; holding the busy lock or a wiring is sufficent as
well.  These references are protected by the page lock, which must
therefore be acquired for many per-page operations.  This results in
false sharing since the page locks are external to the vm_page
structures themselves and each lock protects multiple structures.

Transition to using an atomically updated per-page reference counter.
The object's reference is counted using a flag bit in the counter.  A
second flag bit is used to atomically block new references via
pmap_extract_and_hold() while removing managed mappings of a page.
Thus, the reference count of a page is guaranteed not to increase if the
page is unbusied, unmapped, and the object's write lock is held.  As
a consequence of this, the page lock no longer protects a page's
identity; operations which move pages between objects are now
synchronized solely by the objects' locks.

The vm_page_wire() and vm_page_unwire() KPIs are changed.  The former
requires that either the object lock or the busy lock is held.  The
latter no longer has a return value and may free the page if it releases
the last reference to that page.  vm_page_unwire_noq() behaves the same
as before; the caller is responsible for checking its return value and
freeing or enqueuing the page as appropriate.  vm_page_wire_mapped() is
introduced for use in pmap_extract_and_hold().  It fails if the page is
concurrently being unmapped, typically triggering a fallback to the
fault handler.  vm_page_wire() no longer requires the page lock and
vm_page_unwire() now internally acquires the page lock when releasing
the last wiring of a page (since the page lock still protects a page's
queue state).  In particular, synchronization details are no longer
leaked into the caller.

The change excises the page lock from several frequently executed code
paths.  In particular, vm_object_terminate() no longer bounces between
page locks as it releases an object's pages, and direct I/O and
sendfile(SF_NOCACHE) completions no longer require the page lock.  In
these latter cases we now get linear scalability in the common scenario
where different threads are operating on different files.

__FreeBSD_version is bumped.  The DRM ports have been updated to
accomodate the KPI changes.

Reviewed by:	jeff (earlier version)
Tested by:	gallatin (earlier version), pho
Sponsored by:	Netflix
Differential Revision:	https://reviews.freebsd.org/D20486
2019-09-09 21:32:42 +00:00
Ed Maste
490d56c527 vmx: use C99 bool, not boolean_t
Bhyve's vmm is a self-contained modern component and thus a good
candidate for use of C99 types.

Reviewed by:	jhb, kib, markj, Patrick Mooney
MFC after:	1 week
Sponsored by:	The FreeBSD Foundation
Differential Revision:	https://reviews.freebsd.org/D21036
2019-08-01 02:16:48 +00:00
Mark Johnston
eeacb3b02f Merge the vm_page hold and wire mechanisms.
The hold_count and wire_count fields of struct vm_page are separate
reference counters with similar semantics.  The remaining essential
differences are that holds are not counted as a reference with respect
to LRU, and holds have an implicit free-on-last unhold semantic whereas
vm_page_unwire() callers must explicitly determine whether to free the
page once the last reference to the page is released.

This change removes the KPIs which directly manipulate hold_count.
Functions such as vm_fault_quick_hold_pages() now return wired pages
instead.  Since r328977 the overhead of maintaining LRU for wired pages
is lower, and in many cases vm_fault_quick_hold_pages() callers would
swap holds for wirings on the returned pages anyway, so with this change
we remove a number of page lock acquisitions.

No functional change is intended.  __FreeBSD_version is bumped.

Reviewed by:	alc, kib
Discussed with:	jeff
Discussed with:	jhb, np (cxgbe)
Tested by:	pho (previous version)
Sponsored by:	Netflix
Differential Revision:	https://reviews.freebsd.org/D19247
2019-07-08 19:46:20 +00:00
Mark Johnston
54a3a11421 Provide separate accounting for user-wired pages.
Historically we have not distinguished between kernel wirings and user
wirings for accounting purposes.  User wirings (via mlock(2)) were
subject to a global limit on the number of wired pages, so if large
swaths of physical memory were wired by the kernel, as happens with
the ZFS ARC among other things, the limit could be exceeded, causing
user wirings to fail.

The change adds a new counter, v_user_wire_count, which counts the
number of virtual pages wired by user processes via mlock(2) and
mlockall(2).  Only user-wired pages are subject to the system-wide
limit which helps provide some safety against deadlocks.  In
particular, while sources of kernel wirings typically support some
backpressure mechanism, there is no way to reclaim user-wired pages
shorting of killing the wiring process.  The limit is exported as
vm.max_user_wired, renamed from vm.max_wired, and changed from u_int
to u_long.

The choice to count virtual user-wired pages rather than physical
pages was done for simplicity.  There are mechanisms that can cause
user-wired mappings to be destroyed while maintaining a wiring of
the backing physical page; these make it difficult to accurately
track user wirings at the physical page layer.

The change also closes some holes which allowed user wirings to succeed
even when they would cause the system limit to be exceeded.  For
instance, mmap() may now fail with ENOMEM in a process that has called
mlockall(MCL_FUTURE) if the new mapping would cause the user wiring
limit to be exceeded.

Note that bhyve -S is subject to the user wiring limit, which defaults
to 1/3 of physical RAM.  Users that wish to exceed the limit must tune
vm.max_user_wired.

Reviewed by:	kib, ngie (mlock() test changes)
Tested by:	pho (earlier version)
MFC after:	45 days
Sponsored by:	Netflix
Differential Revision:	https://reviews.freebsd.org/D19908
2019-05-13 16:38:48 +00:00
Rodney W. Grimes
a488c9c99a Add accessor function for vm->maxcpus
Replace most VM_MAXCPU constant useses with an accessor function to
vm->maxcpus which for now is initialized and kept at the value of
VM_MAXCPUS.

This is a rework of Fabian Freyer (fabian.freyer_physik.tu-berlin.de)
work from D10070 to adjust it for the cpu topology changes that
occured in r332298

Submitted by:		Fabian Freyer (fabian.freyer_physik.tu-berlin.de)
Reviewed by:		Patrick Mooney <patrick.mooney@joyent.com>
Approved by:		bde (mentor), jhb (maintainer)
MFC after:		3 days
Differential Revision:	https://reviews.freebsd.org/D18755
2019-04-25 22:51:36 +00:00
Andrew Turner
27d2645787 Handle a guest executing a vm instruction by trapping and raising an
undefined instruction exception. Previously we would exit the guest,
however an unprivileged user could execute these.

Found with:	syzkaller
Reviewed by:	araujo, tychon (previous version)
Approved by:	re (kib)
MFC after:	1 week
Differential Revision:	https://reviews.freebsd.org/D17192
2018-09-27 11:16:19 +00:00
Antoine Brodin
147d12a7d3 vmmdev: return EFAULT when trying to read beyond VM system memory max address
Currently, when using dd(1) to take a VM memory image, the capture never ends,
reading zeroes when it's beyond VM system memory max address.
Return EFAULT when trying to read beyond VM system memory max address.

Reviewed by:	imp, grehan, anish
Approved by:	grehan
Differential Revision:	https://reviews.freebsd.org/D15156
2018-05-15 17:20:58 +00:00
Tycho Nightingale
6ac73777ea Add SDT probes to vmexit on Intel.
Submitted by:	domagoj.stolfa_gmail.com
Reviewed by:	grehan, tychon
Sponsored by:	DARPA/AFRL
Differential Revision:	https://reviews.freebsd.org/D14656
2018-04-13 17:23:05 +00:00
Rodney W. Grimes
01d822d33b Add the ability to control the CPU topology of created VMs
from userland without the need to use sysctls, it allows the old
sysctls to continue to function, but deprecates them at
FreeBSD_version 1200060 (Relnotes for deprecate).

The command line of bhyve is maintained in a backwards compatible way.
The API of libvmmapi is maintained in a backwards compatible way.
The sysctl's are maintained in a backwards compatible way.

Added command option looks like:
bhyve -c [[cpus=]n][,sockets=n][,cores=n][,threads=n][,maxcpus=n]
The optional parts can be specified in any order, but only a single
integer invokes the backwards compatible parse.  [,maxcpus=n] is
hidden by #ifdef until kernel support is added, though the api
is put in place.

bhyvectl --get-cpu-topology option added.

Reviewed by:	grehan (maintainer, earlier version),
Reviewed by:	bcr (manpages)
Approved by:	bde (mentor), phk (mentor)
Tested by:	Oleg Ginzburg <olevole@olevole.ru> (cbsd)
MFC after:	1 week
Relnotes:	Y
Differential Revision:	https://reviews.freebsd.org/D9930
2018-04-08 19:24:49 +00:00
John Baldwin
fc276d92ae Add a way to temporarily suspend and resume virtual CPUs.
This is used as part of implementing run control in bhyve's debug
server.  The hypervisor now maintains a set of "debugged" CPUs.
Attempting to run a debugged CPU will fail to execute any guest
instructions and will instead report a VM_EXITCODE_DEBUG exit to
the userland hypervisor.  Virtual CPUs are placed into the debugged
state via vm_suspend_cpu() (implemented via a new VM_SUSPEND_CPU ioctl).
Virtual CPUs can be resumed via vm_resume_cpu() (VM_RESUME_CPU ioctl).

The debug server suspends virtual CPUs when it wishes them to stop
executing in the guest (for example, when a debugger attaches to the
server).  The debug server can choose to resume only a subset of CPUs
(for example, when single stepping) or it can choose to resume all
CPUs.  The debug server must explicitly mark a CPU as resumed via
vm_resume_cpu() before the virtual CPU will successfully execute any
guest instructions.

Reviewed by:	avg, grehan
Tested on:	Intel (jhb), AMD (avg)
Differential Revision:	https://reviews.freebsd.org/D14466
2018-04-06 22:03:43 +00:00
Konstantin Belousov
bd50262f70 PTI for amd64.
The implementation of the Kernel Page Table Isolation (KPTI) for
amd64, first version. It provides a workaround for the 'meltdown'
vulnerability.  PTI is turned off by default for now, enable with the
loader tunable vm.pmap.pti=1.

The pmap page table is split into kernel-mode table and user-mode
table. Kernel-mode table is identical to the non-PTI table, while
usermode table is obtained from kernel table by leaving userspace
mappings intact, but only leaving the following parts of the kernel
mapped:

    kernel text (but not modules text)
    PCPU
    GDT/IDT/user LDT/task structures
    IST stacks for NMI and doublefault handlers.

Kernel switches to user page table before returning to usermode, and
restores full kernel page table on the entry. Initial kernel-mode
stack for PTI trampoline is allocated in PCPU, it is only 16
qwords.  Kernel entry trampoline switches page tables. then the
hardware trap frame is copied to the normal kstack, and execution
continues.

IST stacks are kept mapped and no trampoline is needed for
NMI/doublefault, but of course page table switch is performed.

On return to usermode, the trampoline is used again, iret frame is
copied to the trampoline stack, page tables are switched and iretq is
executed.  The case of iretq faulting due to the invalid usermode
context is tricky, since the frame for fault is appended to the
trampoline frame.  Besides copying the fault frame and original
(corrupted) frame to kstack, the fault frame must be patched to make
it look as if the fault occured on the kstack, see the comment in
doret_iret detection code in trap().

Currently kernel pages which are mapped during trampoline operation
are identical for all pmaps.  They are registered using
pmap_pti_add_kva().  Besides initial registrations done during boot,
LDT and non-common TSS segments are registered if user requested their
use.  In principle, they can be installed into kernel page table per
pmap with some work.  Similarly, PCPU can be hidden from userspace
mapping using trampoline PCPU page, but again I do not see much
benefits besides complexity.

PDPE pages for the kernel half of the user page tables are
pre-allocated during boot because we need to know pml4 entries which
are copied to the top-level paging structure page, in advance on a new
pmap creation.  I enforce this to avoid iterating over the all
existing pmaps if a new PDPE page is needed for PTI kernel mappings.
The iteration is a known problematic operation on i386.

The need to flush hidden kernel translations on the switch to user
mode make global tables (PG_G) meaningless and even harming, so PG_G
use is disabled for PTI case.  Our existing use of PCID is
incompatible with PTI and is automatically disabled if PTI is
enabled.  PCID can be forced on only for developer's benefit.

MCE is known to be broken, it requires IST stack to operate completely
correctly even for non-PTI case, and absolutely needs dedicated IST
stack because MCE delivery while trampoline did not switched from PTI
stack is fatal.  The fix is pending.

Reviewed by:	markj (partially)
Tested by:	pho (previous version)
Discussed with:	jeff, jhb
Sponsored by:	The FreeBSD Foundation
MFC after:	2 weeks
2018-01-17 11:44:21 +00:00
Pedro F. Giffuni
c49761dd57 sys/amd64: further adoption of SPDX licensing ID tags.
Mainly focus on files that use BSD 2-Clause license, however the tool I
was using misidentified many licenses so this was mostly a manual - error
prone - task.

The Software Package Data Exchange (SPDX) group provides a specification
to make it easier for automated tools to detect and summarize well known
opensource licenses. We are gradually adopting the specification, noting
that the tags are considered only advisory and do not, in any way,
superceed or replace the license texts.
2017-11-27 15:03:07 +00:00
Andriy Gapon
978f3da16f revert r315959 because it causes build problems
The change introduced a dependency between genassym.c and header files
generated from .m files, but that dependency is not specified in the
make files.

Also, the change could be not as useful as I thought it was.

Reported by:	dchagin, Manfred Antar <null@pozo.com>, and many others
2017-03-27 12:34:29 +00:00
Andriy Gapon
a7b4c009e1 specific end of interrupt implementation for AMD Local APIC
The change is more intrusive than I would like because the feature
requires that a vector number is written to a special register.
Thus, now the vector number has to be provided to lapic_eoi().
It was readily available in the IO-APIC and MSI cases, but the IPI
handlers required more work.
Also, we now store the VMM IPI number in a global variable, so that it
is available to the justreturn handler for the same reason.

Reviewed by:	kib
MFC after:	6 weeks
Differential Revision: https://reviews.freebsd.org/D9880
2017-03-25 18:45:09 +00:00
John Baldwin
ffe1b10d95 Enable I/O MMU when PCI pass through is first used.
Rather than enabling the I/O MMU when the vmm module is loaded,
defer initialization until the first attempt to pass a PCI device
through to a guest.  If the I/O MMU fails to initialize or is not
present, than fail the attempt to pass a PCI device through to a
guest.

The hw.vmm.force_iommu tunable has been removed since the I/O MMU is
no longer enabled during boot.  However, the I/O MMU support can be
disabled by setting the hw.vmm.iommu.enable tunable to 0 to prevent
use of the I/O MMU on any systems where it is buggy.

Reviewed by:	grehan
MFC after:	1 week
Differential Revision:	https://reviews.freebsd.org/D7448
2016-08-26 20:15:22 +00:00
John Baldwin
2de70600fa Correct assertion on vcpuid argument to vm_gpa_hold().
PR:		208168
Submitted by:	Dave Cameron <daverabbitz@ihug.co.nz>
Reviewed by:	grehan
MFC after:	1 month
2016-08-03 15:20:10 +00:00
Marcel Moolenaar
6bcf245ebc Bump VM_MAX_MEMSEGS from 2 to 3 to match the number of VM segment
identifiers present in vmmapi.h. In particular, it's now possible
to create a VM_FRAMEBUFFER segment.
2016-02-26 16:18:47 +00:00
Svatopluk Kraus
b352b10400 As <machine/vm.h> is included from <vm/vm.h>, there is no need to
include it explicitly when <vm/vm.h> is already included.

Reviewed by:	alc, kib
Differential Revision:	https://reviews.freebsd.org/D5380
2016-02-22 09:10:23 +00:00
Svatopluk Kraus
35a0bc1260 As <machine/vmparam.h> is included from <vm/vm_param.h>, there is no
need to include it explicitly when <vm/vm_param.h> is already included.

Suggested by:	alc
Reviewed by:	alc
Differential Revision:	https://reviews.freebsd.org/D5379
2016-02-22 09:08:04 +00:00
Neel Natu
9b1aa8d622 Restructure memory allocation in bhyve to support "devmem".
devmem is used to represent MMIO devices like the boot ROM or a VESA framebuffer
where doing a trap-and-emulate for every access is impractical. devmem is a
hybrid of system memory (sysmem) and emulated device models.

devmem is mapped in the guest address space via nested page tables similar
to sysmem. However the address range where devmem is mapped may be changed
by the guest at runtime (e.g. by reprogramming a PCI BAR). Also devmem is
usually mapped RO or RW as compared to RWX mappings for sysmem.

Each devmem segment is named (e.g. "bootrom") and this name is used to
create a device node for the devmem segment (e.g. /dev/vmm/testvm.bootrom).
The device node supports mmap(2) and this decouples the host mapping of
devmem from its mapping in the guest address space (which can change).

Reviewed by:	tychon
Discussed with:	grehan
Differential Revision:	https://reviews.freebsd.org/D2762
MFC after:	4 weeks
2015-06-18 06:00:17 +00:00
Neel Natu
248e6799e9 Fix non-deterministic delays when accessing a vcpu that was in "running" or
"sleeping" state. This is done by forcing the vcpu to transition to "idle"
by returning to userspace with an exit code of VM_EXITCODE_REQIDLE.

MFC after:      2 weeks
2015-05-28 17:37:01 +00:00
Neel Natu
47b9935d9b Exceptions don't deliver an error code in real mode.
MFC after:	1 week
2015-05-23 01:17:50 +00:00
Neel Natu
1c73ea3ef8 Don't rely on the 'VM-exit instruction length' field in the VMCS to always
have an accurate length on an EPT violation. This is not needed by the
instruction decoding code because it also has to work with AMD/SVM that
does not provide a valid instruction length on a Nested Page Fault.

In collaboration with:	Leon Dang (ldang@nahannisys.com)
Discussed with:		grehan
MFC after:		1 week
2015-05-22 17:34:22 +00:00
Neel Natu
9c4d547896 Deprecate the 3-way return values from vm_gla2gpa() and vm_copy_setup().
Prior to this change both functions returned 0 for success, -1 for failure
and +1 to indicate that an exception was injected into the guest.

The numerical value of ERESTART also happens to be -1 so when these functions
returned -1 it had to be translated to a positive errno value to prevent the
VM_RUN ioctl from being inadvertently restarted. This made it easy to introduce
bugs when writing emulation code.

Fix this by adding an 'int *guest_fault' parameter and setting it to '1' if
an exception was delivered to the guest. The return value is 0 or EFAULT so
no additional translation is needed.

Reviewed by:	tychon
MFC after:	2 weeks
Differential Revision:	https://reviews.freebsd.org/D2428
2015-05-06 16:25:20 +00:00
Neel Natu
c07a0648ec When an instruction cannot be decoded just return to userspace so bhyve(8)
can dump the instruction bytes.

Requested by:	grehan
MFC after:	1 week
2015-04-30 21:00:47 +00:00
Tycho Nightingale
ef7c2a82ed Fix "MOVS" instruction memory to MMIO emulation. Currently updates to
%rdi, %rsi, etc are inadvertently bypassed along with the check to
see if the instruction needs to be repeated per the 'rep' prefix.

Add "MOVS" instruction support for the 'MMIO to MMIO' case.

Reviewed by:	neel
2015-04-01 00:15:31 +00:00
Tycho Nightingale
e4f605ee81 When fetching an instruction in non-64bit mode, consider the value of the
code segment base address.

Also if an instruction doesn't support a mod R/M (modRM) byte, don't
be concerned if the CPU is in real mode.

Reviewed by:	neel
2015-03-24 17:12:36 +00:00
Neel Natu
18a2b08e65 Use lapic_ipi_alloc() to dynamically allocate IPI slots needed by bhyve when
vmm.ko is loaded.

Also relocate the 'justreturn' IPI handler to be alongside all other handlers.

Requested by:	kib
2015-03-14 02:32:08 +00:00
Ryan Stone
a15f820a27 Allow passthrough devices to be hinted.
Allow the ppt driver to attach to devices that were hinted to be
passthrough devices by the PCI code creating them with a driver
name of "ppt".

Add a tunable that allows the IOMMU to be forced to be used.  With
SR-IOV passthrough devices the VFs may be created after vmm.ko is
loaded.  The current code will not initialize the IOMMU in that
case, meaning that the passthrough devices can't actually be used.

Differential Revision:	https://reviews.freebsd.org/D73
Reviewed by:		neel
MFC after: 		1 month
Sponsored by:		Sandvine Inc.
2015-03-01 00:39:48 +00:00
Neel Natu
d087a39935 Simplify instruction restart logic in bhyve.
Keep track of the next instruction to be executed by the vcpu as 'nextrip'.
As a result the VM_RUN ioctl no longer takes the %rip where a vcpu should
start execution.

Also, instruction restart happens implicitly via 'vm_inject_exception()' or
explicitly via 'vm_restart_instruction()'. The APIs behave identically in
both kernel and userspace contexts. The main beneficiary is the instruction
emulation code that executes in both contexts.

bhyve(8) VM exit handlers now treat 'vmexit->rip' and 'vmexit->inst_length'
as readonly:
- Restarting an instruction is now done by calling 'vm_restart_instruction()'
  as opposed to setting 'vmexit->inst_length' to 0 (e.g. emulate_inout())
- Resuming vcpu at an arbitrary %rip is now done by setting VM_REG_GUEST_RIP
  as opposed to changing 'vmexit->rip' (e.g. vmexit_task_switch())

Differential Revision:	https://reviews.freebsd.org/D1526
Reviewed by:		grehan
MFC after:		2 weeks
2015-01-18 03:08:30 +00:00
Neel Natu
c9c75df48c 'struct vm_exception' was intended to be used only as the collateral for the
VM_INJECT_EXCEPTION ioctl. However it morphed into other uses like keeping
track pending exceptions for a vcpu. This in turn causes confusion because
some fields in 'struct vm_exception' like 'vcpuid' make sense only in the
ioctl context. It also makes it harder to add or remove structure fields.

Fix this by using 'struct vm_exception' only to communicate information
from userspace to vmm.ko when injecting an exception.

Also, add a field 'restart_instruction' to 'struct vm_exception'. This
field is set to '1' for exceptions where the faulting instruction is
restarted after the exception is handled.

MFC after:      1 week
2015-01-13 22:00:47 +00:00
Neel Natu
2ce1242309 Clear blocking due to STI or MOV SS in the hypervisor when an instruction is
emulated or when the vcpu incurs an exception. This matches the CPU behavior.

Remove special case code in HLT processing that was clearing the interrupt
shadow. This is now redundant because the interrupt shadow is always cleared
when the vcpu is resumed after an instruction is emulated.

Reported by:	David Reed (david.reed@tidalscale.com)
MFC after:	2 weeks
2015-01-06 19:04:02 +00:00
Neel Natu
0dafa5cd4b Replace bhyve's minimal RTC emulation with a fully featured one in vmm.ko.
The new RTC emulation supports all interrupt modes: periodic, update ended
and alarm. It is also capable of maintaining the date/time and NVRAM contents
across virtual machine reset. Also, the date/time fields can now be modified
by the guest.

Since bhyve now emulates both the PIT and the RTC there is no need for
"Legacy Replacement Routing" in the HPET so get rid of it.

The RTC device state can be inspected via bhyvectl as follows:
bhyvectl --vm=vm --get-rtc-time
bhyvectl --vm=vm --set-rtc-time=<unix_time_secs>
bhyvectl --vm=vm --rtc-nvram-offset=<offset> --get-rtc-nvram
bhyvectl --vm=vm --rtc-nvram-offset=<offset> --set-rtc-nvram=<value>

Reviewed by:	tychon
Discussed with:	grehan
Differential Revision:	https://reviews.freebsd.org/D1385
MFC after:	2 weeks
2014-12-30 22:19:34 +00:00