freebsd-dev

Author	SHA1	Message	Date
John Baldwin	483d953a86	Initial support for bhyve save and restore. Save and restore (also known as suspend and resume) permits a snapshot to be taken of a guest's state that can later be resumed. In the current implementation, bhyve(8) creates a UNIX domain socket that is used by bhyvectl(8) to send a request to save a snapshot (and optionally exit after the snapshot has been taken). A snapshot currently consists of two files: the first holds a copy of guest RAM, and the second file holds other guest state such as vCPU register values and device model state. To resume a guest, bhyve(8) must be started with a matching pair of command line arguments to instantiate the same set of device models as well as a pointer to the saved snapshot. While the current implementation is useful for several uses cases, it has a few limitations. The file format for saving the guest state is tied to the ABI of internal bhyve structures and is not self-describing (in that it does not communicate the set of device models present in the system). In addition, the state saved for some device models closely matches the internal data structures which might prove a challenge for compatibility of snapshot files across a range of bhyve versions. The file format also does not currently support versioning of individual chunks of state. As a result, the current file format is not a fixed binary format and future revisions to save and restore will break binary compatiblity of snapshot files. The goal is to move to a more flexible format that adds versioning, etc. and at that point to commit to providing a reasonable level of compatibility. As a result, the current implementation is not enabled by default. It can be enabled via the WITH_BHYVE_SNAPSHOT=yes option for userland builds, and the kernel option BHYVE_SHAPSHOT. Submitted by: Mihai Tiganus, Flavius Anton, Darius Mihai Submitted by: Elena Mihailescu, Mihai Carabas, Sergiu Weisz Relnotes: yes Sponsored by: University Politehnica of Bucharest Sponsored by: Matthew Grooms (student scholarships) Sponsored by: iXsystems Differential Revision: https://reviews.freebsd.org/D19495	2020-05-05 00:02:04 +00:00
Conrad Meyer	47332982bc	vmm(4): Decode and emulate BEXTR Clang 10 -march=native kernels on znver1 emit BEXTR for APIC reads, apparently. Decode and emulate the instruction. Reviewed by: grehan Differential Revision: https://reviews.freebsd.org/D24463	2020-04-21 21:34:24 +00:00
Conrad Meyer	cfdea69d24	vmm(4): Decode 3-byte VEX-prefixed instructions Reviewed by: grehan Differential Revision: https://reviews.freebsd.org/D24462	2020-04-21 21:33:06 +00:00
Conrad Meyer	00d3723fb4	vmm(4): Bump VM_MAX_MEMMAPS for vmgenid As a short term solution for the problem reported by Shawn Webb re: r359950, bump the maximum number of memmaps per VM. This structure is 40 bytes, and the additional four (fixed array embedded in the struct vm) members increase the size of struct vm by 3%. (The vast majority of struct vm is the embedded struct vcpu array, which accounts for 84% of the size -- over 4 kB.) Reported by: Shawn Webb <shawn.webb AT hardenedbsd.org> Reviewed by: grehan X-MFC-With: r359950 Differential Revision: https://reviews.freebsd.org/D24507	2020-04-19 23:53:47 +00:00
Conrad Meyer	b645fd4531	vmm(4): Expose instruction decode to userspace build Permit instruction decoding logic to be compiled outside of the kernel for rapid iteration and validation. Reviewed by: grehan Differential Revision: https://reviews.freebsd.org/D24439	2020-04-16 16:50:33 +00:00
Jung-uk Kim	3ee58df503	Merge ACPICA 20200326.	2020-03-27 00:29:33 +00:00
Michael Reifenberger	1bc51bad2b	Untangle TPR shadowing and APIC virtualization. This speeds up Windows guests tremendously. The patch does: Add a new tuneable 'hw.vmm.vmx.use_tpr_shadowing' to disable TLP shadowing. Also add 'hw.vmm.vmx.cap.tpr_shadowing' to be able to query if TPR shadowing is used. Detach the initialization of TPR shadowing from the initialization of APIC virtualization. APIC virtualization still needs TPR shadowing, but not vice versa. Any CPU that supports APIC virtualization should also support TPR shadowing. When TPR shadowing is used, the APIC page of each vCPU is written to the VMCS_VIRTUAL_APIC field of the VMCS so that the CPU can write directly to the page without intercept. On vm exit, vlapic_update_ppr() is called to update the PPR. Submitted by: Yamagi Burmeister MFC after: 2 weeks Differential Revision: https://reviews.freebsd.org/D22942	2020-03-10 16:53:49 +00:00
Pawel Biernacki	b40598c539	Mark more nodes as CTLFLAG_MPSAFE or CTLFLAG_NEEDGIANT (4 of many) r357614 added CTLFLAG_NEEDGIANT to make it easier to find nodes that are still not MPSAFE (or already are but aren’t properly marked). Use it in preparation for a general review of all nodes. This is non-functional change that adds annotations to SYSCTL_NODE and SYSCTL_PROC nodes using one of the soon-to-be-required flags. Reviewed by: kib Approved by: kib (mentor) Differential Revision: https://reviews.freebsd.org/D23625 X-Generally looks fine: jhb	2020-02-15 18:57:49 +00:00
Konstantin Belousov	caab504277	vmm: Add Hygon Dhyana support. Submitted by: Pu Wen <puwen@hygon.cn> Discussed with: grehan Reviewed by: jhb (previous version) MFC after: 1 week Differential revision: https://reviews.freebsd.org/D23553	2020-02-13 19:03:12 +00:00
Konstantin Belousov	b837dadd87	bhyve: terminate waiting loops if thread suspension is requested. PR: 242724 Reviewed by: markj Reported and tested by: Aleksandr Fedorov <aleksandr.fedorov@itglobal.com> (previous version) Sponsored by: The FreeBSD Foundation MFC after: 1 week Differential revision: https://reviews.freebsd.org/D22881	2020-01-02 22:37:04 +00:00
John Baldwin	cbd03a9df2	Support software breakpoints in the debug server on Intel CPUs. - Allow the userland hypervisor to intercept breakpoint exceptions (BP#) in the guest. A new capability (VM_CAP_BPT_EXIT) is used to enable this feature. These exceptions are reported to userland via a new VM_EXITCODE_BPT that includes the length of the original breakpoint instruction. If userland wishes to pass the exception through to the guest, it must be explicitly re-injected via vm_inject_exception(). - Export VMCS_ENTRY_INST_LENGTH as a VM_REG_GUEST_ENTRY_INST_LENGTH pseudo-register. Injecting a BP# on Intel requires setting this to the length of the breakpoint instruction. AMD SVM currently ignores writes to this register (but reports success) and fails to read it. - Rework the per-vCPU state tracked by the debug server. Rather than a single 'stepping_vcpu' global, add a structure for each vCPU that tracks state about that vCPU ('stepping', 'stepped', and 'hit_swbreak'). A global 'stopped_vcpu' tracks which vCPU is currently reporting an event. Event handlers for MTRAP and breakpoint exits loop until the associated event is reported to the debugger. Breakpoint events are discarded if the breakpoint is not present when a vCPU resumes in the breakpoint handler to retry submitting the breakpoint event. - Maintain a linked-list of active breakpoints in response to the GDB 'Z0' and 'z0' packets. Reviewed by: markj (earlier version) MFC after: 2 months Differential Revision: https://reviews.freebsd.org/D20309	2019-12-13 19:21:58 +00:00
Anish Gupta	84474332d3	bhyve amd: amdvi_dump_cmds() log the command for which the command completion failed. Completion is checked in poll mode although it can be done using interrupts. No need to log all the commands in command ring but only the last one for which completion failed. Reported by: np@freebsd.org Reviewed by: np, markj MFC after: 2 weeks Differential Revision: https://reviews.freebsd.org/D22566	2019-12-01 04:00:08 +00:00
Konstantin Belousov	a7af4a3e7d	amd64: move GDT into PCPU area. Reviewed by: jhb, markj Tested by: pho Sponsored by: The FreeBSD Foundation MFC after: 1 week Differential revision: https://reviews.freebsd.org/D22302	2019-11-12 15:51:47 +00:00
Andriy Gapon	869dbab7ba	vmm: remove a wmb() call After removing wmb(), vm_set_rendezvous_func() became super trivial, so there was no point in keeping it. The wmb (sfence on amd64, lock nop on i386) was not needed. This can be explained from several points of view. First, wmb() is used for store-store ordering (although, the primitive is undocumented). There was no obvious subsequent store that needed the barrier. Second, x86 has a memory model with strong ordering including total store order. An explicit store barrier may be needed only when working with special memory (device, special caching mode) or using special instructions (non-temporal stores). That was not the case for this code. Third, I believe that there is a misconception that sfence "flushes" the store buffer in a sense that it speeds up the propagation of stores from the store buffer to the global visibility. I think that such propagation always happens as fast as possible. sfence only makes subsequent stores wait for that propagation to complete. So, sfence is only useful for ordering of stores and only in the situations described above. Reviewed by: jhb MFC after: 23 days Differential Revision: https://reviews.freebsd.org/D21978	2019-10-19 07:10:15 +00:00
Mark Johnston	d3588766e1	Correct the scope of several global variables. They are accessed from multiple compilation units. No functional change intended. MFC after: 1 week Sponsored by: Netflix	2019-09-27 21:04:33 +00:00
Konstantin Belousov	df08823d07	Improve MD page fault handlers. Centralize calculation of signal and ucode delivered on unhandled page fault in new function vm_fault_trap(). MD trap_pfault() now almost always uses the signal numbers and error codes calculated in consistent MI way. This introduces the protection fault compatibility sysctls to all non-x86 architectures which did not have that bug, but apparently they were already much more wrong in selecting delivered signals on protection violations. Change the delivered signal for accesses to mapped area after the backing object was truncated. According to POSIX description for mmap(2): The system shall always zero-fill any partial page at the end of an object. Further, the system shall never write out any modified portions of the last page of an object which are beyond its end. References within the address range starting at pa and continuing for len bytes to whole pages following the end of an object shall result in delivery of a SIGBUS signal. An implementation may generate SIGBUS signals when a reference would cause an error in the mapped object, such as out-of-space condition. Adjust according to the description, keeping the existing compatibility code for SIGSEGV/SIGBUS on protection failures. For situations where kernel cannot handle page fault due to resource limit enforcement, SIGBUS with a new error code BUS_OBJERR is delivered. Also, provide a new error code SEGV_PKUERR for SIGSEGV on amd64 due to protection key access violation. vm_fault_hold() is renamed to vm_fault(). Fixed some nits in trap_pfault()s like mis-interpreting Mach errors as errnos. Removed unneeded truncations of the fault addresses reported by hardware. PR: 211924 Reviewed by: alc Discussed with: jilles, markj Sponsored by: The FreeBSD Foundation MFC after: 1 week Differential revision: https://reviews.freebsd.org/D21566	2019-09-27 18:43:36 +00:00
Mark Johnston	fee2a2fa39	Change synchonization rules for vm_page reference counting. There are several mechanisms by which a vm_page reference is held, preventing the page from being freed back to the page allocator. In particular, holding the page's object lock is sufficient to prevent the page from being freed; holding the busy lock or a wiring is sufficent as well. These references are protected by the page lock, which must therefore be acquired for many per-page operations. This results in false sharing since the page locks are external to the vm_page structures themselves and each lock protects multiple structures. Transition to using an atomically updated per-page reference counter. The object's reference is counted using a flag bit in the counter. A second flag bit is used to atomically block new references via pmap_extract_and_hold() while removing managed mappings of a page. Thus, the reference count of a page is guaranteed not to increase if the page is unbusied, unmapped, and the object's write lock is held. As a consequence of this, the page lock no longer protects a page's identity; operations which move pages between objects are now synchronized solely by the objects' locks. The vm_page_wire() and vm_page_unwire() KPIs are changed. The former requires that either the object lock or the busy lock is held. The latter no longer has a return value and may free the page if it releases the last reference to that page. vm_page_unwire_noq() behaves the same as before; the caller is responsible for checking its return value and freeing or enqueuing the page as appropriate. vm_page_wire_mapped() is introduced for use in pmap_extract_and_hold(). It fails if the page is concurrently being unmapped, typically triggering a fallback to the fault handler. vm_page_wire() no longer requires the page lock and vm_page_unwire() now internally acquires the page lock when releasing the last wiring of a page (since the page lock still protects a page's queue state). In particular, synchronization details are no longer leaked into the caller. The change excises the page lock from several frequently executed code paths. In particular, vm_object_terminate() no longer bounces between page locks as it releases an object's pages, and direct I/O and sendfile(SF_NOCACHE) completions no longer require the page lock. In these latter cases we now get linear scalability in the common scenario where different threads are operating on different files. __FreeBSD_version is bumped. The DRM ports have been updated to accomodate the KPI changes. Reviewed by: jeff (earlier version) Tested by: gallatin (earlier version), pho Sponsored by: Netflix Differential Revision: https://reviews.freebsd.org/D20486	2019-09-09 21:32:42 +00:00
John Baldwin	6a1e1c2c48	Simplify bhyve vlapic ESR logic. The bhyve virtual local APIC uses an instance-global flag to indicate when an error LVT is being delivered to prevent infinite recursion. Use a function argument instead to reduce the amount of instance-global state. This was inspired by reviewing the bhyve save/restore work, which saves a copy of the instance-global state for each vlapic. Smart OS bug: https://smartos.org/bugview/OS-7777 Submitted by: Patrick Mooney Reviewed by: markj, rgrimes Obtained from: SmartOS / Joyent Differential Revision: https://reviews.freebsd.org/D20365	2019-08-29 18:23:38 +00:00
John Baldwin	e08087ee43	Use get_pcpu() to fetch the current CPU's pcpu pointer. This avoids encoding knowledge about how pcpu objects are allocated and is also a few instructions shorter. MFC after: 2 weeks	2019-08-28 23:40:57 +00:00
Ed Maste	ba084c18de	sys/{x86,amd64}: remove one of doubled ;s MFC after: 1 week	2019-08-13 19:39:36 +00:00
Mark Johnston	13a7c4d478	Use designated initializers for vmm_ops. MFC after: 3 days	2019-08-07 19:45:44 +00:00
Konstantin Belousov	e550631697	bhyve: Ignore MSI/MSI-X interrupts sent to non-active vCPUs in physical destination mode. This is mostly a nop, because the vmm initializes all vCPUs up to vm_maxcpus, so even if the target CPU is not active, lapic/vlapic code still has the valid data to use. As John notes, dropping such interrupts more closely matches the real harware, which ignores all interrupts for not started APs. Reviewed by: jhb admbugs: 837 MFC after: 1 week Sponsored by: The FreeBSD Foundation	2019-08-03 16:57:14 +00:00
Ed Maste	490d56c527	vmx: use C99 bool, not boolean_t Bhyve's vmm is a self-contained modern component and thus a good candidate for use of C99 types. Reviewed by: jhb, kib, markj, Patrick Mooney MFC after: 1 week Sponsored by: The FreeBSD Foundation Differential Revision: https://reviews.freebsd.org/D21036	2019-08-01 02:16:48 +00:00
John Baldwin	87c39157c6	Improve the precision of bhyve's vPIT. Use 'struct bintime' instead of 'sbintime_t' to manage times in vPIT to postpone rounding to final results rather than intermediate results. In tests performed by Joyent, this reduced the error measured by Linux guests by 59 ppm. Smart OS bug: https://smartos.org/bugview/OS-6923 Submitted by: Patrick Mooney Reviewed by: rgrimes Obtained from: SmartOS / Joyent MFC after: 2 weeks Differential Revision: https://reviews.freebsd.org/D20335	2019-07-20 15:59:49 +00:00
Konstantin Belousov	026e450262	Fix syntax. Nod from: jhb Sponsored by: The FreeBSD Foundation	2019-07-12 19:14:52 +00:00
Scott Long	422a8a4d3a	Tie the name limit of a VM to SPECNAMELEN from devfs instead of a hard-coded value. Don't allocate space for it from the kernel stack. Account for prefix, suffix, and separator space in the name. This takes the effective length up to 229 bytes on 13-current, and 37 bytes on 12-stable. 37 bytes is enough to hold a full GUID string. PR: 234134 MFC after: 1 week Differential Revision: http://reviews.freebsd.org/D20924	2019-07-12 18:37:56 +00:00
Mark Johnston	eeacb3b02f	Merge the vm_page hold and wire mechanisms. The hold_count and wire_count fields of struct vm_page are separate reference counters with similar semantics. The remaining essential differences are that holds are not counted as a reference with respect to LRU, and holds have an implicit free-on-last unhold semantic whereas vm_page_unwire() callers must explicitly determine whether to free the page once the last reference to the page is released. This change removes the KPIs which directly manipulate hold_count. Functions such as vm_fault_quick_hold_pages() now return wired pages instead. Since r328977 the overhead of maintaining LRU for wired pages is lower, and in many cases vm_fault_quick_hold_pages() callers would swap holds for wirings on the returned pages anyway, so with this change we remove a number of page lock acquisitions. No functional change is intended. __FreeBSD_version is bumped. Reviewed by: alc, kib Discussed with: jeff Discussed with: jhb, np (cxgbe) Tested by: pho (previous version) Sponsored by: Netflix Differential Revision: https://reviews.freebsd.org/D19247	2019-07-08 19:46:20 +00:00
Rodney W. Grimes	e4da41f932	Emulate the "TEST r/m{16,32,64}, imm{16,32,32}" instructions (opcode F7H). This adds emulation for: test r/m16, imm16 test r/m32, imm32 test r/m64, imm32 sign-extended to 64 OpenBSD guests compiled with clang 8.0.0 use TEST directly against a Local APIC register instead of separate read via MOV followed by a TEST against the register. PR: 238794 Submitted by: jhb Reported by: Jason Tubnor jason@tubnor.net Tested by: Jason Tubnor jason@tubnor.net Reviewed by: markj, Patrick Mooney patrick.mooney@joyent.com MFC after: 3 days Differential Revision: https://reviews.freebsd.org/D20755	2019-06-26 21:19:43 +00:00
Scott Long	da761f3b1f	Implement VT-d capability detection on chipsets that have multiple translation units with differing capabilities From the author via Bugzilla: --- When an attempt is made to passthrough a PCI device to a bhyve VM (causing initialisation of IOMMU) on certain Intel chipsets using VT-d the PCI bus stops working entirely. This issue occurs on the E3-1275 v5 processor on C236 chipset and has also been encountered by others on the forums with different hardware in the Skylake series. The chipset has two VT-d translation units. The issue is caused by an attempt to use the VT-d device-IOTLB capability that is supported by only the first unit for devices attached to the second unit which lacks that capability. Only the capabilities of the first unit are checked and are assumed to be the same for all units. Attached is a patch to rectify this issue by determining which unit is responsible for the device being added to a domain and then checking that unit's device-IOTLB capability. In addition to this a few fixes have been made to other instances where the first unit's capabilities are assumed for all units for domains they share. In these cases a mutual set of capabilities is determined. The patch should hopefully fix any bugs for current/future hardware with multiple translation units supporting different capabilities. A description is on the forums at https://forums.freebsd.org/threads/pci-passthrough-bhyve-usb-xhci.65235 The thread includes observations by other users of the bug occurring, and description as well as confirmation of the fix. I'd also like to thank Ordoban for their help. --- Personally tested on a Skylake laptop, Skylake Xeon server, and a Xeon-D-1541, passing through XHCI and NVMe functions. Passthru is hit-or-miss to the point of being unusable without this patch. PR: 229852 Submitted by: callum@aitchison.org MFC after: 1 week	2019-06-19 06:41:07 +00:00
John Baldwin	0d1fd6e541	Support MSI-X for passthrough devices with a separate PBA BAR. pci_alloc_msix() requires both the table and PBA BARs to be allocated by the driver. ppt was only allocating the table BAR so would fail for devices with the PBA in a separate BAR. Fix this by allocating the PBA BAR before pci_alloc_msix() if it is stored in a separate BAR. While here, release BARs after calling pci_release_msi() instead of before. Also, don't call bus_teardown_intr() in error handling code if bus_setup_intr() has just failed. Reported by: gallatin Tested by: gallatin Reviewed by: rgrimes, markj MFC after: 1 week Differential Revision: https://reviews.freebsd.org/D20525	2019-06-05 19:30:32 +00:00
Conrad Meyer	e2e050c8ef	Extract eventfilter declarations to sys/_eventfilter.h This allows replacing "sys/eventfilter.h" includes with "sys/_eventfilter.h" in other header files (e.g., sys/{bus,conf,cpu}.h) and reduces header pollution substantially. EVENTHANDLER_DECLARE and EVENTHANDLER_LIST_DECLAREs were moved out of .c files into appropriate headers (e.g., sys/proc.h, powernv/opal.h). As a side effect of reduced header pollution, many .c files and headers no longer contain needed definitions. The remainder of the patch addresses adding appropriate includes to fix those files. LOCK_DEBUG and LOCK_FILE_LINE_ARG are moved to sys/_lock.h, as required by sys/mutex.h since r326106 (but silently protected by header pollution prior to this change). No functional change (intended). Of course, any out of tree modules that relied on header pollution for sys/eventhandler.h, sys/lock.h, or sys/mutex.h inclusion need to be fixed. __FreeBSD_version has been bumped.	2019-05-20 00:38:23 +00:00
John Baldwin	e519cee307	Expose the MD_CLEAR capability used by Intel MDS mitigations to guests. Submitted by: Patrick Mooney <pmooney@pfmooney.com> Reviewed by: kib Tested by: Patrick on SmartOS with Linux and Windows guests Obtained from: Joyent MFC after: 3 days Differential Revision: https://reviews.freebsd.org/D20296	2019-05-18 21:20:38 +00:00
Mark Johnston	54a3a11421	Provide separate accounting for user-wired pages. Historically we have not distinguished between kernel wirings and user wirings for accounting purposes. User wirings (via mlock(2)) were subject to a global limit on the number of wired pages, so if large swaths of physical memory were wired by the kernel, as happens with the ZFS ARC among other things, the limit could be exceeded, causing user wirings to fail. The change adds a new counter, v_user_wire_count, which counts the number of virtual pages wired by user processes via mlock(2) and mlockall(2). Only user-wired pages are subject to the system-wide limit which helps provide some safety against deadlocks. In particular, while sources of kernel wirings typically support some backpressure mechanism, there is no way to reclaim user-wired pages shorting of killing the wiring process. The limit is exported as vm.max_user_wired, renamed from vm.max_wired, and changed from u_int to u_long. The choice to count virtual user-wired pages rather than physical pages was done for simplicity. There are mechanisms that can cause user-wired mappings to be destroyed while maintaining a wiring of the backing physical page; these make it difficult to accurately track user wirings at the physical page layer. The change also closes some holes which allowed user wirings to succeed even when they would cause the system limit to be exceeded. For instance, mmap() may now fail with ENOMEM in a process that has called mlockall(MCL_FUTURE) if the new mapping would cause the user wiring limit to be exceeded. Note that bhyve -S is subject to the user wiring limit, which defaults to 1/3 of physical RAM. Users that wish to exceed the limit must tune vm.max_user_wired. Reviewed by: kib, ngie (mlock() test changes) Tested by: pho (earlier version) MFC after: 45 days Sponsored by: Netflix Differential Revision: https://reviews.freebsd.org/D19908	2019-05-13 16:38:48 +00:00
Conrad Meyer	fce2d624ea	vmm(4): Pass through RDSEED feature bit to guests Reviewed by: jhb Approved by: #bhyve (jhb) MFC after: 2 leapseconds Sponsored by: Dell EMC Isilon Differential Revision: https://reviews.freebsd.org/D20194	2019-05-08 00:40:08 +00:00
John Baldwin	c2b4cedd78	Emulate the "ADD reg, r/m" instruction (opcode 03H). OVMF's flash variable storage is using add instructions when indexing the variable store bootrom location. Submitted by: D Scott Phillips <d.scott.phillips@intel.com> Reviewed by: rgrimes MFC after: 1 week Sponsored by: Intel Corporation Differential Revision: https://reviews.freebsd.org/D19975	2019-05-03 21:48:42 +00:00
Rodney W. Grimes	a488c9c99a	Add accessor function for vm->maxcpus Replace most VM_MAXCPU constant useses with an accessor function to vm->maxcpus which for now is initialized and kept at the value of VM_MAXCPUS. This is a rework of Fabian Freyer (fabian.freyer_physik.tu-berlin.de) work from D10070 to adjust it for the cpu topology changes that occured in r332298 Submitted by: Fabian Freyer (fabian.freyer_physik.tu-berlin.de) Reviewed by: Patrick Mooney <patrick.mooney@joyent.com> Approved by: bde (mentor), jhb (maintainer) MFC after: 3 days Differential Revision: https://reviews.freebsd.org/D18755	2019-04-25 22:51:36 +00:00
Konstantin Belousov	5db2a4a812	Implement resets for PCI buses and PCIe bridges. For PCI device (i.e. child of a PCI bus), reset tries FLR if implemented and worked, and falls to power reset otherwise. For PCIe bus (child of a PCIe bridge or root port), reset disables PCIe link and then re-trains it, performing what is known as link-level reset. Reviewed by: imp (previous version), jhb (previous version) Sponsored by: Mellanox Technologies MFC after: 2 weeks Differential revision: https://reviews.freebsd.org/D19646	2019-04-05 19:25:26 +00:00
John Baldwin	2c352feb3b	Fix missed posted interrupts in VT-x in bhyve. When a vCPU is HLTed, interrupts with a priority below the processor priority (PPR) should not resume the vCPU while interrupts at or above the PPR should. With posted interrupts, bhyve maintains a bitmap of pending interrupts in PIR descriptor along with a single 'pending' bit. This bit is checked by a CPU running in guest mode at various places to determine if it should be checked. In addition, another CPU can force a CPU in guest mode to check for pending interrupts by sending an IPI to a special IDT vector reserved for this purpose. bhyve had a bug in that it would only notify a guest vCPU of an interrupt (e.g. by sending the special IPI or by resuming it if it was idle due to HLT) if an interrupt arrived that was higher priority than PPR and no interrupts were currently pending. This assumed that if the 'pending' bit was set, any needed notification was already in progress. However, if the first interrupt sent to a HLTed vCPU was lower priority than PPR and the second was higher than PPR, the first interrupt would set 'pending' but not notify the vCPU, and the second interrupt would not notify the vCPU because 'pending' was already set. To fix this, track the priority of pending interrupts in a separate per-vCPU bitmask and notify a vCPU anytime an interrupt arrives that is above PPR and higher than any previously-received interrupt. This was found and debugged in the bhyve port to SmartOS maintained by Joyent. Relevant SmartOS bugs with more background: https://smartos.org/bugview/OS-6829 https://smartos.org/bugview/OS-6930 https://smartos.org/bugview/OS-7354 Submitted by: Patrick Mooney <pmooney@pfmooney.com> Reviewed by: tychon, rgrimes Obtained from: SmartOS / Joyent MFC after: 2 weeks Differential Revision: https://reviews.freebsd.org/D19299	2019-03-01 20:43:48 +00:00
Conrad Meyer	d0c7cde53e	vmm(4): Mask Spectre feature bits on AMD hosts For parity with Intel hosts, which already mask out the CPUID feature bits that indicate the presence of the SPEC_CTRL MSR, do the same on AMD. Eventually we may want to have a better support story for guests, but for now, limit the damage of incorrectly indicating an MSR we do not yet support. Eventually, we may want a generic CPUID override system for administrators, or for minimum supported feature set in heterogenous environments with failover. That is a much larger scope effort than this bug fix. PR: 235010 Reported by: Rys Sommefeldt <rys AT sommefeldt.com> Sponsored by: Dell EMC Isilon	2019-01-18 23:54:51 +00:00
Konstantin Belousov	f1dc49f33a	Trim whitespace at EoL, use tabs instead of spaces for indent. PR: 235004 Submitted by: Jose Luis Duran <jlduran@gmail.com> MFC after: 3 days	2019-01-17 05:15:25 +00:00
Conrad Meyer	15b7da10ac	vmm(4): Take steps towards multicore bhyve AMD support vmm's CPUID emulation presented Intel topology information to the guest, but disabled AMD topology information and in some cases passed through garbage. I.e., CPUID leaves 0x8000_001[de] were passed through to the guest, but guest CPUs can migrate between host threads, so the information presented was not consistent. This could easily be observed with 'cpucontrol -i 0xfoo /dev/cpuctl0'. Slightly improve this situation by enabling the AMD topology feature flag and presenting at least the CPUID fields used by FreeBSD itself to probe topology on more modern AMD64 hardware (Family 15h+). Older stuff is probably less interesting. I have not been able to empirically confirm it is sufficient, but it should not regress anything either. Reviewed by: araujo (previous version) Relnotes: sure	2019-01-16 02:19:04 +00:00
Konstantin Belousov	2343757338	Align IA32_ARCH_CAP MSR definitions and use with SDM rev. 068. SDM rev. 068 was released yesterday and it contains the description of the MSR 0x10a IA32_ARCH_CAP. This change adds symbolic definitions for all bits present in the document, and decode them in the CPU identification lines printed on boot. But also, the document defines SSB_NO as bit 4, while FreeBSD used but 2 to detect the need to work-around Speculative Store Bypass issue. Change code to use the bit from SDM. Similarly, the document describes bit 3 as an indicator that L1TF issue is not present, in particular, no L1D flush is needed on VMENTRY. We used RDCL_NO to avoid flushing, and again I changed the code to follow new spec from SDM. In fact my Apollo Lake machine with latest ucode shows this: IA32_ARCH_CAPS=0x19<RDCL_NO,SKIP_L1DFL_VME,SSB_NO> Reviewed by: bwidawsk Sponsored by: The FreeBSD Foundation MFC after: 3 days Differential revision: https://reviews.freebsd.org/D18006	2018-11-16 21:27:11 +00:00
Marcelo Araujo	ec9e3fb095	Merge cases with upper block. This is a cosmetic change only to simplify code. Reported by: anish Sponsored by: iXsystems Inc.	2018-10-31 01:27:44 +00:00
Marcelo Araujo	5bae7542d4	Emulate machine check related MSR_EXTFEATURES to allow guest OSes to boot on AMD FX Series. PR: 224476 Submitted by: Keita Uchida <m@jgz.jp> Reviewed by: rgrimes Sponsored by: iXsystems Inc. Differential Revision: https://reviews.freebsd.org/D17713	2018-10-30 10:02:23 +00:00
Yuri Pankov	8d56c80545	Provide basic descriptions for VMX exit reason (from "Intel 64 and IA-32 Architectures Software Developer’s Manual Volume 3"). Add the document to SEE ALSO in bhyve.8 (and pet manlint here a bit). Reviewed by: jhb, rgrimes, 0mp Approved by: kib (mentor) Differential Revision: https://reviews.freebsd.org/D17531	2018-10-27 21:24:28 +00:00
John Baldwin	de679f6efa	Reload the LDT selector after an AMD-v #VMEXIT. cpu_switch() always reloads the LDT, so this can only affect the hypervisor process itself. Fix this by explicitly reloading the host LDT selector after each #VMEXIT. The stock bhyve process on FreeBSD never uses a custom LDT, so this change is cosmetic. Reviewed by: kib Tested by: Mike Tancsa <mike@sentex.net> Approved by: re (gjb) MFC after: 2 weeks	2018-10-15 18:12:25 +00:00
Konstantin Belousov	78a3652794	bhyve: emulate CLFLUSH and CLFLUSHOPT. Apparently CLFLUSH on mmio can cause VM exit, as reported in the PR. I do not see that anything useful can be done except emulating page faults on invalid addresses. Due to the instruction encoding pecularity, also emulate SFENCE. PR: 232081 Reported by: phk Reviewed by: araujo, avg, jhb (all: previous version) Sponsored by: The FreeBSD Foundation Approved by: re (gjb) MFC after: 1 week Differential revision: https://reviews.freebsd.org/D17482	2018-10-12 15:30:15 +00:00
John Baldwin	b843f9be5e	Fully restore the GDTR, IDTR, and LDTR after VT-x VM exits. The VT-x VMCS only stores the base address of the GDTR and IDTR. As a result, VM exits use a fixed limit of 0xffff for the host GDTR and IDTR losing the smaller limits set in when the initial GDT is loaded on each CPU during boot. Explicitly save and restore the full GDTR and IDTR contents around VM entries and exits to restore the correct limit. Similarly, explicitly save and restore the LDT selector. VM exits always clear the host LDTR as if the LDT was loaded with a NULL selector and a userspace hypervisor is probably using a NULL selector anyway, but save and restore the LDT explicitly just to be safe. PR: 230773 Reported by: John Levon <levon@movementarian.org> Reviewed by: kib Tested by: araujo Approved by: re (rgrimes) MFC after: 1 week	2018-10-11 18:27:19 +00:00
Andrew Turner	27d2645787	Handle a guest executing a vm instruction by trapping and raising an undefined instruction exception. Previously we would exit the guest, however an unprivileged user could execute these. Found with: syzkaller Reviewed by: araujo, tychon (previous version) Approved by: re (kib) MFC after: 1 week Differential Revision: https://reviews.freebsd.org/D17192	2018-09-27 11:16:19 +00:00
Konstantin Belousov	c1141fba00	Update L1TF workaround to sustain L1D pollution from NMI. Current mitigation for L1TF in bhyve flushes L1D either by an explicit WRMSR command, or by software reading enough uninteresting data to fully populate all lines of L1D. If NMI occurs after either of methods is completed, but before VM entry, L1D becomes polluted with the cache lines touched by NMI handlers. There is no interesting data which NMI accesses, but something sensitive might be co-located on the same cache line, and then L1TF exposes that to a rogue guest. Use VM entry MSR load list to ensure atomicity of L1D cache and VM entry if updated microcode was loaded. If only software flush method is available, try to help the bhyve sw flusher by also flushing L1D on NMI exit to kernel mode. Suggested by and discussed with: Andrew Cooper <andrew.cooper3@citrix.com> Reviewed by: jhb Sponsored by: The FreeBSD Foundation MFC after: 2 weeks Differential revision: https://reviews.freebsd.org/D16790	2018-08-19 18:47:16 +00:00

1 2 3 4 5 ...

504 Commits