Commit Graph

504 Commits

Author SHA1 Message Date
John Baldwin
483d953a86 Initial support for bhyve save and restore.
Save and restore (also known as suspend and resume) permits a snapshot
to be taken of a guest's state that can later be resumed.  In the
current implementation, bhyve(8) creates a UNIX domain socket that is
used by bhyvectl(8) to send a request to save a snapshot (and
optionally exit after the snapshot has been taken).  A snapshot
currently consists of two files: the first holds a copy of guest RAM,
and the second file holds other guest state such as vCPU register
values and device model state.

To resume a guest, bhyve(8) must be started with a matching pair of
command line arguments to instantiate the same set of device models as
well as a pointer to the saved snapshot.

While the current implementation is useful for several uses cases, it
has a few limitations.  The file format for saving the guest state is
tied to the ABI of internal bhyve structures and is not
self-describing (in that it does not communicate the set of device
models present in the system).  In addition, the state saved for some
device models closely matches the internal data structures which might
prove a challenge for compatibility of snapshot files across a range
of bhyve versions.  The file format also does not currently support
versioning of individual chunks of state.  As a result, the current
file format is not a fixed binary format and future revisions to save
and restore will break binary compatiblity of snapshot files.  The
goal is to move to a more flexible format that adds versioning,
etc. and at that point to commit to providing a reasonable level of
compatibility.  As a result, the current implementation is not enabled
by default.  It can be enabled via the WITH_BHYVE_SNAPSHOT=yes option
for userland builds, and the kernel option BHYVE_SHAPSHOT.

Submitted by:	Mihai Tiganus, Flavius Anton, Darius Mihai
Submitted by:	Elena Mihailescu, Mihai Carabas, Sergiu Weisz
Relnotes:	yes
Sponsored by:	University Politehnica of Bucharest
Sponsored by:	Matthew Grooms (student scholarships)
Sponsored by:	iXsystems
Differential Revision:	https://reviews.freebsd.org/D19495
2020-05-05 00:02:04 +00:00
Conrad Meyer
47332982bc vmm(4): Decode and emulate BEXTR
Clang 10 -march=native kernels on znver1 emit BEXTR for APIC reads,
apparently.  Decode and emulate the instruction.

Reviewed by:	grehan
Differential Revision:	https://reviews.freebsd.org/D24463
2020-04-21 21:34:24 +00:00
Conrad Meyer
cfdea69d24 vmm(4): Decode 3-byte VEX-prefixed instructions
Reviewed by:	grehan
Differential Revision:	https://reviews.freebsd.org/D24462
2020-04-21 21:33:06 +00:00
Conrad Meyer
00d3723fb4 vmm(4): Bump VM_MAX_MEMMAPS for vmgenid
As a short term solution for the problem reported by Shawn Webb re: r359950,
bump the maximum number of memmaps per VM. This structure is 40 bytes, and the
additional four (fixed array embedded in the struct vm) members increase the
size of struct vm by 3%.

(The vast majority of struct vm is the embedded struct vcpu array, which
accounts for 84% of the size -- over 4 kB.)

Reported by:	Shawn Webb <shawn.webb AT hardenedbsd.org>
Reviewed by:	grehan
X-MFC-With:	r359950
Differential Revision:	https://reviews.freebsd.org/D24507
2020-04-19 23:53:47 +00:00
Conrad Meyer
b645fd4531 vmm(4): Expose instruction decode to userspace build
Permit instruction decoding logic to be compiled outside of the kernel for
rapid iteration and validation.

Reviewed by:	grehan
Differential Revision:	https://reviews.freebsd.org/D24439
2020-04-16 16:50:33 +00:00
Jung-uk Kim
3ee58df503 Merge ACPICA 20200326. 2020-03-27 00:29:33 +00:00
Michael Reifenberger
1bc51bad2b Untangle TPR shadowing and APIC virtualization.
This speeds up Windows guests tremendously.

The patch does:
Add a new tuneable 'hw.vmm.vmx.use_tpr_shadowing' to disable TLP shadowing.
Also add 'hw.vmm.vmx.cap.tpr_shadowing' to be able to query if TPR shadowing is used.

Detach the initialization of TPR shadowing from the initialization of APIC virtualization.
APIC virtualization still needs TPR shadowing, but not vice versa.
Any CPU that supports APIC virtualization should also support TPR shadowing.

When TPR shadowing is used, the APIC page of each vCPU is written to the VMCS_VIRTUAL_APIC field of the VMCS
so that the CPU can write directly to the page without intercept.

On vm exit, vlapic_update_ppr() is called to update the PPR.

Submitted by:	Yamagi Burmeister
MFC after:	2 weeks
Differential Revision:	https://reviews.freebsd.org/D22942
2020-03-10 16:53:49 +00:00
Pawel Biernacki
b40598c539 Mark more nodes as CTLFLAG_MPSAFE or CTLFLAG_NEEDGIANT (4 of many)
r357614 added CTLFLAG_NEEDGIANT to make it easier to find nodes that are
still not MPSAFE (or already are but aren’t properly marked). Use it in
preparation for a general review of all nodes.
This is non-functional change that adds annotations to SYSCTL_NODE and
SYSCTL_PROC nodes using one of the soon-to-be-required flags.

Reviewed by:	kib
Approved by:	kib (mentor)
Differential Revision:	https://reviews.freebsd.org/D23625
X-Generally looks fine:	jhb
2020-02-15 18:57:49 +00:00
Konstantin Belousov
caab504277 vmm: Add Hygon Dhyana support.
Submitted by:	Pu Wen <puwen@hygon.cn>
Discussed with:	grehan
Reviewed by:	jhb (previous version)
MFC after:	1 week
Differential revision:	https://reviews.freebsd.org/D23553
2020-02-13 19:03:12 +00:00
Konstantin Belousov
b837dadd87 bhyve: terminate waiting loops if thread suspension is requested.
PR:	242724
Reviewed by:	markj
Reported and tested by:	Aleksandr Fedorov <aleksandr.fedorov@itglobal.com>
	 (previous version)
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
Differential revision:	https://reviews.freebsd.org/D22881
2020-01-02 22:37:04 +00:00
John Baldwin
cbd03a9df2 Support software breakpoints in the debug server on Intel CPUs.
- Allow the userland hypervisor to intercept breakpoint exceptions
  (BP#) in the guest.  A new capability (VM_CAP_BPT_EXIT) is used to
  enable this feature.  These exceptions are reported to userland via
  a new VM_EXITCODE_BPT that includes the length of the original
  breakpoint instruction.  If userland wishes to pass the exception
  through to the guest, it must be explicitly re-injected via
  vm_inject_exception().

- Export VMCS_ENTRY_INST_LENGTH as a VM_REG_GUEST_ENTRY_INST_LENGTH
  pseudo-register.  Injecting a BP# on Intel requires setting this to
  the length of the breakpoint instruction.  AMD SVM currently ignores
  writes to this register (but reports success) and fails to read it.

- Rework the per-vCPU state tracked by the debug server.  Rather than
  a single 'stepping_vcpu' global, add a structure for each vCPU that
  tracks state about that vCPU ('stepping', 'stepped', and
  'hit_swbreak').  A global 'stopped_vcpu' tracks which vCPU is
  currently reporting an event.  Event handlers for MTRAP and
  breakpoint exits loop until the associated event is reported to the
  debugger.

  Breakpoint events are discarded if the breakpoint is not present
  when a vCPU resumes in the breakpoint handler to retry submitting
  the breakpoint event.

- Maintain a linked-list of active breakpoints in response to the GDB
  'Z0' and 'z0' packets.

Reviewed by:	markj (earlier version)
MFC after:	2 months
Differential Revision:	https://reviews.freebsd.org/D20309
2019-12-13 19:21:58 +00:00
Anish Gupta
84474332d3 bhyve amd: amdvi_dump_cmds() log the command for which the command completion failed. Completion is checked in poll mode although it can be done using interrupts.
No need to log all the commands in command ring but only the last one for which completion failed.

Reported by: np@freebsd.org
Reviewed by: np, markj
MFC after: 2 weeks
Differential Revision: https://reviews.freebsd.org/D22566
2019-12-01 04:00:08 +00:00
Konstantin Belousov
a7af4a3e7d amd64: move GDT into PCPU area.
Reviewed by:	jhb, markj
Tested by:	pho
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
Differential revision:	https://reviews.freebsd.org/D22302
2019-11-12 15:51:47 +00:00
Andriy Gapon
869dbab7ba vmm: remove a wmb() call
After removing wmb(), vm_set_rendezvous_func() became super trivial, so
there was no point in keeping it.

The wmb (sfence on amd64, lock nop on i386) was not needed.  This can be
explained from several points of view.

First, wmb() is used for store-store ordering (although, the primitive
is undocumented).  There was no obvious subsequent store that needed the
barrier.

Second, x86 has a memory model with strong ordering including total
store order.  An explicit store barrier may be needed only when working
with special memory (device, special caching mode) or using special
instructions (non-temporal stores).  That was not the case for this
code.

Third, I believe that there is a misconception that sfence "flushes" the
store buffer in a sense that it speeds up the propagation of stores from
the store buffer to the global visibility.  I think that such
propagation always happens as fast as possible.  sfence only makes
subsequent stores wait for that propagation to complete.  So, sfence is
only useful for ordering of stores and only in the situations described
above.

Reviewed by:	jhb
MFC after:	23 days
Differential Revision: https://reviews.freebsd.org/D21978
2019-10-19 07:10:15 +00:00
Mark Johnston
d3588766e1 Correct the scope of several global variables.
They are accessed from multiple compilation units.  No functional change
intended.

MFC after:	1 week
Sponsored by:	Netflix
2019-09-27 21:04:33 +00:00
Konstantin Belousov
df08823d07 Improve MD page fault handlers.
Centralize calculation of signal and ucode delivered on unhandled page
fault in new function vm_fault_trap().  MD trap_pfault() now almost
always uses the signal numbers and error codes calculated in
consistent MI way.

This introduces the protection fault compatibility sysctls to all
non-x86 architectures which did not have that bug, but apparently they
were already much more wrong in selecting delivered signals on
protection violations.

Change the delivered signal for accesses to mapped area after the
backing object was truncated.  According to POSIX description for
mmap(2):
   The system shall always zero-fill any partial page at the end of an
   object. Further, the system shall never write out any modified
   portions of the last page of an object which are beyond its
   end. References within the address range starting at pa and
   continuing for len bytes to whole pages following the end of an
   object shall result in delivery of a SIGBUS signal.

   An implementation may generate SIGBUS signals when a reference
   would cause an error in the mapped object, such as out-of-space
   condition.
Adjust according to the description, keeping the existing
compatibility code for SIGSEGV/SIGBUS on protection failures.

For situations where kernel cannot handle page fault due to resource
limit enforcement, SIGBUS with a new error code BUS_OBJERR is
delivered.  Also, provide a new error code SEGV_PKUERR for SIGSEGV on
amd64 due to protection key access violation.

vm_fault_hold() is renamed to vm_fault().  Fixed some nits in
trap_pfault()s like mis-interpreting Mach errors as errnos.  Removed
unneeded truncations of the fault addresses reported by hardware.

PR:	211924
Reviewed by:	alc
Discussed with:	jilles, markj
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
Differential revision:	https://reviews.freebsd.org/D21566
2019-09-27 18:43:36 +00:00
Mark Johnston
fee2a2fa39 Change synchonization rules for vm_page reference counting.
There are several mechanisms by which a vm_page reference is held,
preventing the page from being freed back to the page allocator.  In
particular, holding the page's object lock is sufficient to prevent the
page from being freed; holding the busy lock or a wiring is sufficent as
well.  These references are protected by the page lock, which must
therefore be acquired for many per-page operations.  This results in
false sharing since the page locks are external to the vm_page
structures themselves and each lock protects multiple structures.

Transition to using an atomically updated per-page reference counter.
The object's reference is counted using a flag bit in the counter.  A
second flag bit is used to atomically block new references via
pmap_extract_and_hold() while removing managed mappings of a page.
Thus, the reference count of a page is guaranteed not to increase if the
page is unbusied, unmapped, and the object's write lock is held.  As
a consequence of this, the page lock no longer protects a page's
identity; operations which move pages between objects are now
synchronized solely by the objects' locks.

The vm_page_wire() and vm_page_unwire() KPIs are changed.  The former
requires that either the object lock or the busy lock is held.  The
latter no longer has a return value and may free the page if it releases
the last reference to that page.  vm_page_unwire_noq() behaves the same
as before; the caller is responsible for checking its return value and
freeing or enqueuing the page as appropriate.  vm_page_wire_mapped() is
introduced for use in pmap_extract_and_hold().  It fails if the page is
concurrently being unmapped, typically triggering a fallback to the
fault handler.  vm_page_wire() no longer requires the page lock and
vm_page_unwire() now internally acquires the page lock when releasing
the last wiring of a page (since the page lock still protects a page's
queue state).  In particular, synchronization details are no longer
leaked into the caller.

The change excises the page lock from several frequently executed code
paths.  In particular, vm_object_terminate() no longer bounces between
page locks as it releases an object's pages, and direct I/O and
sendfile(SF_NOCACHE) completions no longer require the page lock.  In
these latter cases we now get linear scalability in the common scenario
where different threads are operating on different files.

__FreeBSD_version is bumped.  The DRM ports have been updated to
accomodate the KPI changes.

Reviewed by:	jeff (earlier version)
Tested by:	gallatin (earlier version), pho
Sponsored by:	Netflix
Differential Revision:	https://reviews.freebsd.org/D20486
2019-09-09 21:32:42 +00:00
John Baldwin
6a1e1c2c48 Simplify bhyve vlapic ESR logic.
The bhyve virtual local APIC uses an instance-global flag to indicate
when an error LVT is being delivered to prevent infinite recursion.
Use a function argument instead to reduce the amount of instance-global
state.

This was inspired by reviewing the bhyve save/restore work, which
saves a copy of the instance-global state for each vlapic.

Smart OS bug:	https://smartos.org/bugview/OS-7777
Submitted by:	Patrick Mooney
Reviewed by:	markj, rgrimes
Obtained from:	SmartOS / Joyent
Differential Revision:	https://reviews.freebsd.org/D20365
2019-08-29 18:23:38 +00:00
John Baldwin
e08087ee43 Use get_pcpu() to fetch the current CPU's pcpu pointer.
This avoids encoding knowledge about how pcpu objects are allocated and is
also a few instructions shorter.

MFC after:	2 weeks
2019-08-28 23:40:57 +00:00
Ed Maste
ba084c18de sys/{x86,amd64}: remove one of doubled ;s
MFC after:	1 week
2019-08-13 19:39:36 +00:00
Mark Johnston
13a7c4d478 Use designated initializers for vmm_ops.
MFC after:	3 days
2019-08-07 19:45:44 +00:00
Konstantin Belousov
e550631697 bhyve: Ignore MSI/MSI-X interrupts sent to non-active vCPUs in
physical destination mode.

This is mostly a nop, because the vmm initializes all vCPUs up to
vm_maxcpus, so even if the target CPU is not active, lapic/vlapic code
still has the valid data to use.  As John notes, dropping such
interrupts more closely matches the real harware, which ignores all
interrupts for not started APs.

Reviewed by:	jhb
admbugs:	837
MFC after:	1 week
Sponsored by:	The FreeBSD Foundation
2019-08-03 16:57:14 +00:00
Ed Maste
490d56c527 vmx: use C99 bool, not boolean_t
Bhyve's vmm is a self-contained modern component and thus a good
candidate for use of C99 types.

Reviewed by:	jhb, kib, markj, Patrick Mooney
MFC after:	1 week
Sponsored by:	The FreeBSD Foundation
Differential Revision:	https://reviews.freebsd.org/D21036
2019-08-01 02:16:48 +00:00
John Baldwin
87c39157c6 Improve the precision of bhyve's vPIT.
Use 'struct bintime' instead of 'sbintime_t' to manage times in vPIT
to postpone rounding to final results rather than intermediate
results.  In tests performed by Joyent, this reduced the error measured
by Linux guests by 59 ppm.

Smart OS bug:	https://smartos.org/bugview/OS-6923
Submitted by:	Patrick Mooney
Reviewed by:	rgrimes
Obtained from:	SmartOS / Joyent
MFC after:	2 weeks
Differential Revision:	https://reviews.freebsd.org/D20335
2019-07-20 15:59:49 +00:00
Konstantin Belousov
026e450262 Fix syntax.
Nod from:	jhb
Sponsored by:	The FreeBSD Foundation
2019-07-12 19:14:52 +00:00
Scott Long
422a8a4d3a Tie the name limit of a VM to SPECNAMELEN from devfs instead of a
hard-coded value. Don't allocate space for it from the kernel stack.
Account for prefix, suffix, and separator space in the name. This
takes the effective length up to 229 bytes on 13-current, and 37 bytes
on 12-stable. 37 bytes is enough to hold a full GUID string.

PR:		234134
MFC after:	1 week
Differential Revision:	http://reviews.freebsd.org/D20924
2019-07-12 18:37:56 +00:00
Mark Johnston
eeacb3b02f Merge the vm_page hold and wire mechanisms.
The hold_count and wire_count fields of struct vm_page are separate
reference counters with similar semantics.  The remaining essential
differences are that holds are not counted as a reference with respect
to LRU, and holds have an implicit free-on-last unhold semantic whereas
vm_page_unwire() callers must explicitly determine whether to free the
page once the last reference to the page is released.

This change removes the KPIs which directly manipulate hold_count.
Functions such as vm_fault_quick_hold_pages() now return wired pages
instead.  Since r328977 the overhead of maintaining LRU for wired pages
is lower, and in many cases vm_fault_quick_hold_pages() callers would
swap holds for wirings on the returned pages anyway, so with this change
we remove a number of page lock acquisitions.

No functional change is intended.  __FreeBSD_version is bumped.

Reviewed by:	alc, kib
Discussed with:	jeff
Discussed with:	jhb, np (cxgbe)
Tested by:	pho (previous version)
Sponsored by:	Netflix
Differential Revision:	https://reviews.freebsd.org/D19247
2019-07-08 19:46:20 +00:00
Rodney W. Grimes
e4da41f932 Emulate the "TEST r/m{16,32,64}, imm{16,32,32}" instructions (opcode F7H).
This adds emulation for:
	test r/m16, imm16
	test r/m32, imm32
	test r/m64, imm32 sign-extended to 64

OpenBSD guests compiled with clang 8.0.0 use TEST directly against a
Local APIC register instead of separate read via MOV followed by a
TEST against the register.

PR:		238794
Submitted by:	jhb
Reported by:	Jason Tubnor jason@tubnor.net
Tested by:	Jason Tubnor jason@tubnor.net
Reviewed by:	markj, Patrick Mooney patrick.mooney@joyent.com
MFC after:	3 days
Differential Revision:	https://reviews.freebsd.org/D20755
2019-06-26 21:19:43 +00:00
Scott Long
da761f3b1f Implement VT-d capability detection on chipsets that have multiple
translation units with differing capabilities

From the author via Bugzilla:
---
When an attempt is made to passthrough a PCI device to a bhyve VM
(causing initialisation of IOMMU) on certain Intel chipsets using
VT-d the PCI bus stops working entirely. This issue occurs on the
E3-1275 v5 processor on C236 chipset and has also been encountered
by others on the forums with different hardware in the Skylake
series.

The chipset has two VT-d translation units. The issue is caused by
an attempt to use the VT-d device-IOTLB capability that is
supported by only the first unit for devices attached to the
second unit which lacks that capability. Only the capabilities of
the first unit are checked and are assumed to be the same for all
units.

Attached is a patch to rectify this issue by determining which
unit is responsible for the device being added to a domain and
then checking that unit's device-IOTLB capability. In addition to
this a few fixes have been made to other instances where the first
unit's capabilities are assumed for all units for domains they
share. In these cases a mutual set of capabilities is determined.
The patch should hopefully fix any bugs for current/future
hardware with multiple translation units supporting different
capabilities.

A description is on the forums at
https://forums.freebsd.org/threads/pci-passthrough-bhyve-usb-xhci.65235
The thread includes observations by other users of the bug
occurring, and description as well as confirmation of the fix.
I'd also like to thank Ordoban for their help.

---
Personally tested on a Skylake laptop, Skylake Xeon server, and
a Xeon-D-1541, passing through XHCI and NVMe functions.  Passthru
is hit-or-miss to the point of being unusable without this
patch.

PR: 229852
Submitted by: callum@aitchison.org
MFC after: 1 week
2019-06-19 06:41:07 +00:00
John Baldwin
0d1fd6e541 Support MSI-X for passthrough devices with a separate PBA BAR.
pci_alloc_msix() requires both the table and PBA BARs to be allocated
by the driver.  ppt was only allocating the table BAR so would fail
for devices with the PBA in a separate BAR.  Fix this by allocating
the PBA BAR before pci_alloc_msix() if it is stored in a separate BAR.

While here, release BARs after calling pci_release_msi() instead of
before.  Also, don't call bus_teardown_intr() in error handling code
if bus_setup_intr() has just failed.

Reported by:	gallatin
Tested by:	gallatin
Reviewed by:	rgrimes, markj
MFC after:	1 week
Differential Revision:	https://reviews.freebsd.org/D20525
2019-06-05 19:30:32 +00:00
Conrad Meyer
e2e050c8ef Extract eventfilter declarations to sys/_eventfilter.h
This allows replacing "sys/eventfilter.h" includes with "sys/_eventfilter.h"
in other header files (e.g., sys/{bus,conf,cpu}.h) and reduces header
pollution substantially.

EVENTHANDLER_DECLARE and EVENTHANDLER_LIST_DECLAREs were moved out of .c
files into appropriate headers (e.g., sys/proc.h, powernv/opal.h).

As a side effect of reduced header pollution, many .c files and headers no
longer contain needed definitions.  The remainder of the patch addresses
adding appropriate includes to fix those files.

LOCK_DEBUG and LOCK_FILE_LINE_ARG are moved to sys/_lock.h, as required by
sys/mutex.h since r326106 (but silently protected by header pollution prior
to this change).

No functional change (intended).  Of course, any out of tree modules that
relied on header pollution for sys/eventhandler.h, sys/lock.h, or
sys/mutex.h inclusion need to be fixed.  __FreeBSD_version has been bumped.
2019-05-20 00:38:23 +00:00
John Baldwin
e519cee307 Expose the MD_CLEAR capability used by Intel MDS mitigations to guests.
Submitted by:	Patrick Mooney <pmooney@pfmooney.com>
Reviewed by:	kib
Tested by:	Patrick on SmartOS with Linux and Windows guests
Obtained from:	Joyent
MFC after:	3 days
Differential Revision:	https://reviews.freebsd.org/D20296
2019-05-18 21:20:38 +00:00
Mark Johnston
54a3a11421 Provide separate accounting for user-wired pages.
Historically we have not distinguished between kernel wirings and user
wirings for accounting purposes.  User wirings (via mlock(2)) were
subject to a global limit on the number of wired pages, so if large
swaths of physical memory were wired by the kernel, as happens with
the ZFS ARC among other things, the limit could be exceeded, causing
user wirings to fail.

The change adds a new counter, v_user_wire_count, which counts the
number of virtual pages wired by user processes via mlock(2) and
mlockall(2).  Only user-wired pages are subject to the system-wide
limit which helps provide some safety against deadlocks.  In
particular, while sources of kernel wirings typically support some
backpressure mechanism, there is no way to reclaim user-wired pages
shorting of killing the wiring process.  The limit is exported as
vm.max_user_wired, renamed from vm.max_wired, and changed from u_int
to u_long.

The choice to count virtual user-wired pages rather than physical
pages was done for simplicity.  There are mechanisms that can cause
user-wired mappings to be destroyed while maintaining a wiring of
the backing physical page; these make it difficult to accurately
track user wirings at the physical page layer.

The change also closes some holes which allowed user wirings to succeed
even when they would cause the system limit to be exceeded.  For
instance, mmap() may now fail with ENOMEM in a process that has called
mlockall(MCL_FUTURE) if the new mapping would cause the user wiring
limit to be exceeded.

Note that bhyve -S is subject to the user wiring limit, which defaults
to 1/3 of physical RAM.  Users that wish to exceed the limit must tune
vm.max_user_wired.

Reviewed by:	kib, ngie (mlock() test changes)
Tested by:	pho (earlier version)
MFC after:	45 days
Sponsored by:	Netflix
Differential Revision:	https://reviews.freebsd.org/D19908
2019-05-13 16:38:48 +00:00
Conrad Meyer
fce2d624ea vmm(4): Pass through RDSEED feature bit to guests
Reviewed by:	jhb
Approved by:	#bhyve (jhb)
MFC after:	2 leapseconds
Sponsored by:	Dell EMC Isilon
Differential Revision:	https://reviews.freebsd.org/D20194
2019-05-08 00:40:08 +00:00
John Baldwin
c2b4cedd78 Emulate the "ADD reg, r/m" instruction (opcode 03H).
OVMF's flash variable storage is using add instructions when indexing
the variable store bootrom location.

Submitted by:	D Scott Phillips <d.scott.phillips@intel.com>
Reviewed by:	rgrimes
MFC after:	1 week
Sponsored by:	Intel Corporation
Differential Revision:	https://reviews.freebsd.org/D19975
2019-05-03 21:48:42 +00:00
Rodney W. Grimes
a488c9c99a Add accessor function for vm->maxcpus
Replace most VM_MAXCPU constant useses with an accessor function to
vm->maxcpus which for now is initialized and kept at the value of
VM_MAXCPUS.

This is a rework of Fabian Freyer (fabian.freyer_physik.tu-berlin.de)
work from D10070 to adjust it for the cpu topology changes that
occured in r332298

Submitted by:		Fabian Freyer (fabian.freyer_physik.tu-berlin.de)
Reviewed by:		Patrick Mooney <patrick.mooney@joyent.com>
Approved by:		bde (mentor), jhb (maintainer)
MFC after:		3 days
Differential Revision:	https://reviews.freebsd.org/D18755
2019-04-25 22:51:36 +00:00
Konstantin Belousov
5db2a4a812 Implement resets for PCI buses and PCIe bridges.
For PCI device (i.e. child of a PCI bus), reset tries FLR if
implemented and worked, and falls to power reset otherwise.

For PCIe bus (child of a PCIe bridge or root port), reset
disables PCIe link and then re-trains it, performing what is known as
link-level reset.

Reviewed by:	imp (previous version), jhb (previous version)
Sponsored by:	Mellanox Technologies
MFC after:	2 weeks
Differential revision:	https://reviews.freebsd.org/D19646
2019-04-05 19:25:26 +00:00
John Baldwin
2c352feb3b Fix missed posted interrupts in VT-x in bhyve.
When a vCPU is HLTed, interrupts with a priority below the processor
priority (PPR) should not resume the vCPU while interrupts at or above
the PPR should.  With posted interrupts, bhyve maintains a bitmap of
pending interrupts in PIR descriptor along with a single 'pending'
bit.  This bit is checked by a CPU running in guest mode at various
places to determine if it should be checked.  In addition, another CPU
can force a CPU in guest mode to check for pending interrupts by
sending an IPI to a special IDT vector reserved for this purpose.

bhyve had a bug in that it would only notify a guest vCPU of an
interrupt (e.g. by sending the special IPI or by resuming it if it was
idle due to HLT) if an interrupt arrived that was higher priority than
PPR and no interrupts were currently pending.  This assumed that if
the 'pending' bit was set, any needed notification was already in
progress.  However, if the first interrupt sent to a HLTed vCPU was
lower priority than PPR and the second was higher than PPR, the first
interrupt would set 'pending' but not notify the vCPU, and the second
interrupt would not notify the vCPU because 'pending' was already set.
To fix this, track the priority of pending interrupts in a separate
per-vCPU bitmask and notify a vCPU anytime an interrupt arrives that
is above PPR and higher than any previously-received interrupt.

This was found and debugged in the bhyve port to SmartOS maintained by
Joyent.  Relevant SmartOS bugs with more background:

https://smartos.org/bugview/OS-6829
https://smartos.org/bugview/OS-6930
https://smartos.org/bugview/OS-7354

Submitted by:	Patrick Mooney <pmooney@pfmooney.com>
Reviewed by:	tychon, rgrimes
Obtained from:	SmartOS / Joyent
MFC after:	2 weeks
Differential Revision:	https://reviews.freebsd.org/D19299
2019-03-01 20:43:48 +00:00
Conrad Meyer
d0c7cde53e vmm(4): Mask Spectre feature bits on AMD hosts
For parity with Intel hosts, which already mask out the CPUID feature
bits that indicate the presence of the SPEC_CTRL MSR, do the same on
AMD.

Eventually we may want to have a better support story for guests, but
for now, limit the damage of incorrectly indicating an MSR we do not yet
support.

Eventually, we may want a generic CPUID override system for
administrators, or for minimum supported feature set in heterogenous
environments with failover.  That is a much larger scope effort than
this bug fix.

PR:		235010
Reported by:	Rys Sommefeldt <rys AT sommefeldt.com>
Sponsored by:	Dell EMC Isilon
2019-01-18 23:54:51 +00:00
Konstantin Belousov
f1dc49f33a Trim whitespace at EoL, use tabs instead of spaces for indent.
PR:  235004
Submitted by:	Jose Luis Duran <jlduran@gmail.com>
MFC after:	3 days
2019-01-17 05:15:25 +00:00
Conrad Meyer
15b7da10ac vmm(4): Take steps towards multicore bhyve AMD support
vmm's CPUID emulation presented Intel topology information to the guest, but
disabled AMD topology information and in some cases passed through garbage.
I.e., CPUID leaves 0x8000_001[de] were passed through to the guest, but
guest CPUs can migrate between host threads, so the information presented
was not consistent.  This could easily be observed with 'cpucontrol -i 0xfoo
/dev/cpuctl0'.

Slightly improve this situation by enabling the AMD topology feature flag
and presenting at least the CPUID fields used by FreeBSD itself to probe
topology on more modern AMD64 hardware (Family 15h+).  Older stuff is
probably less interesting.  I have not been able to empirically confirm it
is sufficient, but it should not regress anything either.

Reviewed by:	araujo (previous version)
Relnotes:	sure
2019-01-16 02:19:04 +00:00
Konstantin Belousov
2343757338 Align IA32_ARCH_CAP MSR definitions and use with SDM rev. 068.
SDM rev. 068 was released yesterday and it contains the description of
the MSR 0x10a IA32_ARCH_CAP. This change adds symbolic definitions for
all bits present in the document, and decode them in the CPU
identification lines printed on boot.

But also, the document defines SSB_NO as bit 4, while FreeBSD used but
2 to detect the need to work-around Speculative Store Bypass
issue.  Change code to use the bit from SDM.

Similarly, the document describes bit 3 as an indicator that L1TF
issue is not present, in particular, no L1D flush is needed on
VMENTRY.  We used RDCL_NO to avoid flushing, and again I changed the
code to follow new spec from SDM.

In fact my Apollo Lake machine with latest ucode shows this:
    IA32_ARCH_CAPS=0x19<RDCL_NO,SKIP_L1DFL_VME,SSB_NO>

Reviewed by:	bwidawsk
Sponsored by:	The FreeBSD Foundation
MFC after:	3 days
Differential revision:	https://reviews.freebsd.org/D18006
2018-11-16 21:27:11 +00:00
Marcelo Araujo
ec9e3fb095 Merge cases with upper block.
This is a cosmetic change only to simplify code.

Reported by:	anish
Sponsored by:	iXsystems Inc.
2018-10-31 01:27:44 +00:00
Marcelo Araujo
5bae7542d4 Emulate machine check related MSR_EXTFEATURES to allow guest OSes to
boot on AMD FX Series.

PR:		224476
Submitted by:	Keita Uchida <m@jgz.jp>
Reviewed by:	rgrimes
Sponsored by:	iXsystems Inc.
Differential Revision:	https://reviews.freebsd.org/D17713
2018-10-30 10:02:23 +00:00
Yuri Pankov
8d56c80545 Provide basic descriptions for VMX exit reason (from "Intel 64 and IA-32
Architectures Software Developer’s Manual Volume 3").  Add the document
to SEE ALSO in bhyve.8 (and pet manlint here a bit).

Reviewed by:	jhb, rgrimes, 0mp
Approved by:	kib (mentor)
Differential Revision:	https://reviews.freebsd.org/D17531
2018-10-27 21:24:28 +00:00
John Baldwin
de679f6efa Reload the LDT selector after an AMD-v #VMEXIT.
cpu_switch() always reloads the LDT, so this can only affect the
hypervisor process itself.  Fix this by explicitly reloading the host
LDT selector after each #VMEXIT.  The stock bhyve process on FreeBSD
never uses a custom LDT, so this change is cosmetic.

Reviewed by:	kib
Tested by:	Mike Tancsa <mike@sentex.net>
Approved by:	re (gjb)
MFC after:	2 weeks
2018-10-15 18:12:25 +00:00
Konstantin Belousov
78a3652794 bhyve: emulate CLFLUSH and CLFLUSHOPT.
Apparently CLFLUSH on mmio can cause VM exit, as reported in the PR.
I do not see that anything useful can be done except emulating page
faults on invalid addresses.

Due to the instruction encoding pecularity, also emulate SFENCE.

PR:	232081
Reported by:	phk
Reviewed by:	araujo, avg, jhb (all: previous version)
Sponsored by:	The FreeBSD Foundation
Approved by:	re (gjb)
MFC after:	1 week
Differential revision:	https://reviews.freebsd.org/D17482
2018-10-12 15:30:15 +00:00
John Baldwin
b843f9be5e Fully restore the GDTR, IDTR, and LDTR after VT-x VM exits.
The VT-x VMCS only stores the base address of the GDTR and IDTR.  As a
result, VM exits use a fixed limit of 0xffff for the host GDTR and
IDTR losing the smaller limits set in when the initial GDT is loaded
on each CPU during boot.  Explicitly save and restore the full GDTR
and IDTR contents around VM entries and exits to restore the correct
limit.

Similarly, explicitly save and restore the LDT selector.  VM exits
always clear the host LDTR as if the LDT was loaded with a NULL
selector and a userspace hypervisor is probably using a NULL selector
anyway, but save and restore the LDT explicitly just to be safe.

PR:		230773
Reported by:	John Levon <levon@movementarian.org>
Reviewed by:	kib
Tested by:	araujo
Approved by:	re (rgrimes)
MFC after:	1 week
2018-10-11 18:27:19 +00:00
Andrew Turner
27d2645787 Handle a guest executing a vm instruction by trapping and raising an
undefined instruction exception. Previously we would exit the guest,
however an unprivileged user could execute these.

Found with:	syzkaller
Reviewed by:	araujo, tychon (previous version)
Approved by:	re (kib)
MFC after:	1 week
Differential Revision:	https://reviews.freebsd.org/D17192
2018-09-27 11:16:19 +00:00
Konstantin Belousov
c1141fba00 Update L1TF workaround to sustain L1D pollution from NMI.
Current mitigation for L1TF in bhyve flushes L1D either by an explicit
WRMSR command, or by software reading enough uninteresting data to
fully populate all lines of L1D.  If NMI occurs after either of
methods is completed, but before VM entry, L1D becomes polluted with
the cache lines touched by NMI handlers.  There is no interesting data
which NMI accesses, but something sensitive might be co-located on the
same cache line, and then L1TF exposes that to a rogue guest.

Use VM entry MSR load list to ensure atomicity of L1D cache and VM
entry if updated microcode was loaded.  If only software flush method
is available, try to help the bhyve sw flusher by also flushing L1D on
NMI exit to kernel mode.

Suggested by and discussed with: Andrew Cooper <andrew.cooper3@citrix.com>
Reviewed by:	jhb
Sponsored by:	The FreeBSD Foundation
MFC after:	2 weeks
Differential revision:	https://reviews.freebsd.org/D16790
2018-08-19 18:47:16 +00:00