allocated from exec_map. If many threads try to perform execve(2) in
parallel, the exec map is exhausted and some threads sleep
uninterruptible waiting for the map space. Then, the thread which won
the race for the space allocation, cannot single-thread the process,
causing deadlock.
Reported and tested by: pho (previous version)
Sponsored by: The FreeBSD Foundation
MFC after: 2 weeks
the Vahalia' "Unix Internals" section 15.12 "Other TLB Consistency
Algorithms". The same algorithm is already utilized by the MIPS pmap
to handle ASIDs.
The PCID for the address space is now allocated per-cpu during context
switch to the thread using pmap, when no PCID on the cpu was ever
allocated, or the current PCID is invalidated. If the PCID is reused,
bit 63 of %cr3 can be set to avoid TLB flush.
Each cpu has PCID' algorithm generation count, which is saved in the
pmap pcpu block when pcpu PCID is allocated. On invalidation, the
pmap generation count is zeroed, which signals the context switch code
that already allocated PCID is no longer valid. The implication is
the TLB shootdown for the given cpu/address space, due to the
allocation of new PCID.
The pm_save mask is no longer has to be tracked, which (significantly)
reduces the targets of the TLB shootdown IPIs. Previously, pm_save
was reset only on pmap_invalidate_all(), which made it accumulate the
cpuids of all processors on which the thread was scheduled between
full TLB shootdowns.
Besides reducing the amount of TLB shootdowns and removing atomics to
update pm_saves in the context switch code, the algorithm is much
simpler than the maintanence of pm_save and selection of the right
address space in the shootdown IPI handler.
Reviewed by: alc
Tested by: pho
Sponsored by: The FreeBSD Foundation
MFC after: 3 weeks
interacts with interrupts, query ACPI and use MWAIT for entrance into
Cx sleep states. Support C1 "I/O then halt" mode. See Intel'
document 302223-007 "Intelб╝ Processor Vendor-Specific ACPI Interface
Specification" for description.
Move the acpi_cpu_c1() function into x86/cpu_machdep.c and use
it instead of inlining "sti; hlt" sequence in several places.
In the acpi(4) man page, besides documenting the dev.cpu.N.cx_methods
sysctl, correct the names for dev.cpu.N.{cx_usage,cx_lowest,cx_supported}
sysctls.
Both jkim and avg have some other patches implementing the mwait
functionality; this work is unrelated. Linux does not rely on the
ACPI to provide correct tables describing Cx modes. Instead, the
driver has pre-defined knowledge of the CPU models, it was supplied by
Intel.
Tested by: pho (previous versions)
Sponsored by: The FreeBSD Foundation
This is done explicitly because a vcpu thread can be in a critical section
for the entire time slice alloted to it. This in turn can delay the handling
of the 'td_owepreempt'.
Reviewed by: jhb
MFC after: 1 week
Differential Revision: https://reviews.freebsd.org/D2430
Prior to this change both functions returned 0 for success, -1 for failure
and +1 to indicate that an exception was injected into the guest.
The numerical value of ERESTART also happens to be -1 so when these functions
returned -1 it had to be translated to a positive errno value to prevent the
VM_RUN ioctl from being inadvertently restarted. This made it easy to introduce
bugs when writing emulation code.
Fix this by adding an 'int *guest_fault' parameter and setting it to '1' if
an exception was delivered to the guest. The return value is 0 or EFAULT so
no additional translation is needed.
Reviewed by: tychon
MFC after: 2 weeks
Differential Revision: https://reviews.freebsd.org/D2428
- Must-Be-Zero bits cannot be set.
- EFER_LME and EFER_LMA should respect the long mode consistency checks.
- EFER_NXE, EFER_FFXSR, EFER_TCE can be set if allowed by CPUID capabilities.
- Flag an error if guest tries to set EFER_LMSLE since bhyve doesn't enforce
segment limits in 64-bit mode.
MFC after: 2 weeks
Do the same when transitioning a vector from the IRR to the ISR and also
when extinguishing it from the ISR in response to an EOI.
Reported by: Leon Dang (ldang@nahannisys.com)
MFC after: 2 weeks
remains. Xen is planning to phase out support for PV upstream since it
is harder to maintain and has more overhead. Modern x86 CPUs include
virtualization extensions that support HVM guests instead of PV guests.
In addition, the PV code was i386 only and not as well maintained recently
as the HVM code.
- Remove the i386-only NATIVE option that was used to disable certain
components for PV kernels. These components are now standard as they
are on amd64.
- Remove !XENHVM bits from PV drivers.
- Remove various shims required for XEN (e.g. PT_UPDATES_FLUSH, LOAD_CR3,
etc.)
- Remove duplicate copy of <xen/features.h>.
- Remove unused, i386-only xenstored.h.
Differential Revision: https://reviews.freebsd.org/D2362
Reviewed by: royger
Tested by: royger (i386/amd64 HVM domU and amd64 PVH dom0)
Relnotes: yes
losing time.
The problem with the earlier implementation was that the uptime value
used by 'vrtc_curtime()' could be different than the uptime value when
'vrtc_time_update()' actually updated 'base_uptime'.
Fix this by calculating and updating the (rtctime, uptime) tuple together.
MFC after: 2 weeks
- Vmbus multi channel support.
- Vector interrupt support.
- Signal optimization.
- Storvsc driver performance improvement.
- Scatter and gather support for storvsc driver.
- Minor bug fix for KVP driver.
Thanks royger, jhb and delphij from FreeBSD community for the reviews
and comments. Also thanks Hovy Xu from NetApp for the contributions to
the storvsc driver.
PR: 195238
Submitted by: whu
Reviewed by: royger, jhb, delphij
Approved by: royger
MFC after: 2 weeks
Relnotes: yes
Sponsored by: Microsoft OSTC
sys/amd64/amd64/mp_machdep.c, to the new common x86 source
sys/x86/x86/mp_x86.c.
Proposed and reviewed by: jhb
Review: https://reviews.freebsd.org/D2347
Sponsored by: The FreeBSD Foundation
sys/i386/i386/machdep.c to new file sys/x86/x86/cpu_machdep.c. Most
of the code is related to the idle handling.
Discussed with: pluknet
Sponsored by: The FreeBSD Foundation
shows no difference with the code removed.
On both amd64 and i386, assert that a released pmap is not active.
Proposed and reviewed by: alc
Discussed with: Svatopluk Kraus <onwahe@gmail.com>, peter
Sponsored by: The FreeBSD Foundation
MFC after: 2 weeks
to the Intel SDM vectors 16 through 255 are allowed to be delivered via the
local APIC.
Reported by: Leon Dang (ldang@nahannisys.com)
MFC after: 2 weeks
unroll the loop in ENTRY(pagezero)
acc' to the submitter this results in a reproducible 1% perf
improvement under buildworld like workload
I validated correctness and run-testing, but not performance impact
Submitted by: lidl@pix.net
Reviewed by: adrian
PR: 199151
MFC After: 1 month
A couple of internal functions used by malloc(9) and uma truncated
a size_t down to an int. This could cause any number of issues
(e.g. indefinite sleeps, memory corruption) if any kernel
subsystem tried to allocate 2GB or more through malloc. zfs would
attempt such an allocation when run on a system with 2TB or more
of RAM.
Note to self: When this is MFCed, sparc64 needs the same fix.
Differential revision: https://reviews.freebsd.org/D2106
Reviewed by: kib
Reported by: Michael Fuckner <michael@fuckner.net>
Tested by: Michael Fuckner <michael@fuckner.net>
MFC after: 2 weeks
%rdi, %rsi, etc are inadvertently bypassed along with the check to
see if the instruction needs to be repeated per the 'rep' prefix.
Add "MOVS" instruction support for the 'MMIO to MMIO' case.
Reviewed by: neel
on Intel processors. Clear spurious dependency by explicitely xoring
the destination register of popcnt.
Use bitcount64() instead of re-implementing SWAR locally, for
processors without popcnt instruction.
Reviewed by: jhb
Discussed with: jilles (previous version)
Sponsored by: The FreeBSD Foundation
rather than 20. The MP 1.4 specification states in Appendix B.2:
"A period of 20 microseconds should be sufficient for IPI dispatch to
complete under normal operating conditions".
(Note that this appears to be separate from the 10 millisecond (INIT) and
200 microsecond (STARTUP) waits after the IPIs are dispatched.) The
Intel SDM is silent on this issue as far as I can tell.
At least some hardware requires 60 microseconds as noted in the PR, so
bump this to 100 to be on the safe side.
PR: 197756
Reported by: zaphod@berentweb.com
MFC after: 1 week
originated from the return to usermode. #ss must be handled same as
#np.
Reported by: Andrew Lutomirski through secteam
Sponsored by: The FreeBSD Foundation
MFC after: 3 days
code segment base address.
Also if an instruction doesn't support a mod R/M (modRM) byte, don't
be concerned if the CPU is in real mode.
Reviewed by: neel
translation. In particular, despite IO-APICs only take 8bit apic id,
IR translation structures accept 32bit APIC Id, which allows x2APIC
mode to function properly. Extend msi_cpu of struct msi_intrsrc and
io_cpu of ioapic_intsrc to full int from one byte.
KPI of IR is isolated into the x86/iommu/iommu_intrmap.h, to avoid
bringing all dmar headers into interrupt code. The non-PCI(e) devices
which generate message interrupts on FSB require special handling. The
HPET FSB interrupts are remapped, while DMAR interrupts are not.
For each msi and ioapic interrupt source, the iommu cookie is added,
which is in fact index of the IRE (interrupt remap entry) in the IR
table. Cookie is made at the source allocation time, and then used at
the map time to fill both IRE and device registers. The MSI
address/data registers and IO-APIC redirection registers are
programmed with the special values which are recognized by IR and used
to restore the IRE index, to find proper delivery mode and target.
Map all MSI interrupts in the block when msi_map() is called.
Since an interrupt source setup and dismantle code are done in the
non-sleepable context, flushing interrupt entries cache in the IR
hardware, which is done async and ideally waits for the interrupt,
requires busy-wait for queue to drain. The dmar_qi_wait_for_seq() is
modified to take a boolean argument requesting busy-wait for the
written sequence number instead of waiting for interrupt.
Some interrupts are configured before IR is initialized, e.g. ACPI
SCI. Add intr_reprogram() function to reprogram all already
configured interrupts, and call it immediately before an IR unit is
enabled. There is still a small window after the IO-APIC redirection
entry is reprogrammed with cookie but before the unit is enabled, but
to fix this properly, IR must be started much earlier.
Add workarounds for 5500 and X58 northbridges, some revisions of which
have severe flaws in handling IR. Use the same identification methods
as employed by Linux.
Review: https://reviews.freebsd.org/D1892
Reviewed by: neel
Discussed with: jhb
Tested by: glebius, pho (previous versions)
Sponsored by: The FreeBSD Foundation
MFC after: 3 weeks
- Split the driver into independent pf and vf loadables. This is
in preparation for SRIOV support which will be following shortly.
This also allows us to keep a seperate revision control over the
two parts, making for easier sustaining.
- Make the TX/RX code a shared/seperated file, in the old code base
the ixv code would miss fixes that went into ixgbe, this model
will eliminate that problem.
- The driver loadables will now match the device names, something that
has been requested for some time.
- Rather than a modules/ixgbe there is now modules/ix and modules/ixv
- It will also be possible to make your static kernel with only one
or the other for streamlined installs, or both.
Enjoy!
Submitted by: jfv and erj
This makes FreeBSD guest to not avoid using LAPIC timer, preferring HPET
due to worries about non-existing for virtual CPUs deep sleep states.
Benchmarks of usleep(1) on guest and host show such extra latencies:
- 51us for virtual HPET,
- 22us for virtual LAPIC timer,
- 22us for host HPET and
- 3us for host LAPIC timer.
MFC after: 2 weeks
- fix warning about comparison of 'uint8_t v_tpr >= 0' always being true.
- fix error triggered by an empty clobber list in the inline assembly for
"clgi" and "stgi"
- fix error when compiling "vmload %rax", "vmrun %rax" and "vmsave %rax". The
gcc assembler does not like the explicit operand "%rax" while the clang
assembler requires specifying the operand "%rax". Fix this by encoding the
instructions using the ".byte" directive.
Reported by: julian
MFC after: 1 week
Implement the interace to create SR-IOV Virtual Functions (VFs).
When a driver registers that they support SR-IOV by calling
pci_setup_iov(), the SR-IOV code creates a new node in /dev/iov
for that device. An ioctl can be invoked on that device to
create VFs and have the driver initialize them.
At this point, allocating memory I/O windows (BARs) is not
supported.
Differential Revision: https://reviews.freebsd.org/D76
Reviewed by: jhb
MFC after: 1 month
Sponsored by: Sandvine Inc.
Allow the ppt driver to attach to devices that were hinted to be
passthrough devices by the PCI code creating them with a driver
name of "ppt".
Add a tunable that allows the IOMMU to be forced to be used. With
SR-IOV passthrough devices the VFs may be created after vmm.ko is
loaded. The current code will not initialize the IOMMU in that
case, meaning that the passthrough devices can't actually be used.
Differential Revision: https://reviews.freebsd.org/D73
Reviewed by: neel
MFC after: 1 month
Sponsored by: Sandvine Inc.
x2APIC mode is detected and enabled. Current theory is that switching
the APIC mode while an IPI is in flight might be the issue.
Postpone switching to x2APIC mode until we are guaranteed that all
starting IPIs are already send and aknowledged. Use aps_ready signal
as an indication that the BSP is done with us.
Tested by: adrian
Sponsored by: The FreeBSD Foundation
MFC after: 2 months
capability of VT-x. This lets bhyve run nested in older VMware versions that
don't support the PAT save/restore capability.
Note that the actual value programmed by the guest in MSR_PAT is irrelevant
because bhyve sets the 'Ignore PAT' bit in the nested PTE.
Reported by: marcel
Tested by: Leon Dang (ldang@nahannisys.com)
Sponsored by: Nahanni Systems
MFC after: 2 weeks
FPU state to avoid passing a negative length to fpusetregs() / npxsetregs().
Differential Revision: https://reviews.freebsd.org/D1861
Reviewed by: kib, emaste
Remove unneeded disable of LAPIC in the native_lapic_xapic_mode(). We
attempt to send wakeup IPI on the resume path right after BSP wakeup,
so disabling is wrong.
Reported and tested by: glebius, "Ranjan1018 ." <214748mv@gmail.com>
Sponsored by: The FreeBSD Foundation
MFC after: 2 months
hw.x2apic_enable tunable allows disabling it from the loader prompt.
To closely repeat effects of the uncached memory ops when accessing
registers in the xAPIC mode, the x2APIC writes to MSRs are preceeded
by mfence, except for the EOI notifications. This is probably too
strict, only ICR writes to send IPI require serialization to ensure
that other CPUs see the previous actions when IPI is delivered. This
may be changed later.
In vmm justreturn IPI handler, call doreti_iret instead of doing iretd
inline, to handle corner conditions.
Note that the patch only switches LAPICs into x2APIC mode. It does not
enables FreeBSD to support > 255 CPUs, which requires parsing x2APIC
MADT entries and doing interrupts remapping, but is the required step
on the way.
Reviewed by: neel
Tested by: pho (real hardware), neel (on bhyve)
Discussed with: jhb, grehan
Sponsored by: The FreeBSD Foundation
MFC after: 2 months
Intel Multiprocessor Specification v1.4. The Intel SDM claims that
the INIT IPIs here are invalid, but other systems follow the MP
spec instead.
While here, fix the IPI wait routine to accept a timeout in microseconds
instead of a raw spin count, and don't spin forever during AP startup.
Instead, panic if a STARTUP IPI is not delivered after 20 us.
PR: 196542
Differential Revision: https://reviews.freebsd.org/D1719
MFC after: 2 weeks
KVM clock shares the same data structures between the guest and the host
as Xen so it makes sense to just have a single copy of this code.
Differential Revision: https://reviews.freebsd.org/D1429
Reviewed by: royger (eariler version)
MFC after: 1 month
const. On x86, even after the machine context is supposedly read into
the struct ucontext, lazy FPU state save code might only mark the FPU
data as hardware-owned. Later, set_fpcontext() needs to fetch the
state from hardware, modifying the *mcp.
The set_mcontext(9) is called from sigreturn(2) and setcontext(2)
implementations and old create_thread(2) interface, which throw the
*mcp out after the set_mcontext() call.
Reported by: dim
Discussed with: jhb
Sponsored by: The FreeBSD Foundation
MFC after: 1 week
Current code requires that the first physical memory segment starts at 0,
but this is not really needed. We only need to make sure the bootstrap code
and page tables for APs are allocated below 4GB.
This patch removes this requirement and allows booting a Dell R710 from
UEFI, where the first physical memory segment starts at 0x10000.
Sponsored by: Citrix Systems R&D
Reviewed by: jhb
Differential Revision: https://reviews.freebsd.org/D1417
each GB of RAM tested so people watching the console can see that
the machine is making progress and not hung.
PR: 196650
Submitted by: Ravi Pokala <rpokala@panasas.com>
Suggestions from: Eric van Gyzen <eric@vangyzen.net>
MFC after: 2 weeks
These instructions are emitted by 'bus_space_read_region()' when accessing
MMIO regions.
Since MOVS can be used with a repeat prefix start decoding the REPZ and
REPNZ prefixes. Also start decoding the segment override prefix since MOVS
allows overriding the source operand segment register.
Tested by: tychon
MFC after: 1 week
Keep track of the next instruction to be executed by the vcpu as 'nextrip'.
As a result the VM_RUN ioctl no longer takes the %rip where a vcpu should
start execution.
Also, instruction restart happens implicitly via 'vm_inject_exception()' or
explicitly via 'vm_restart_instruction()'. The APIs behave identically in
both kernel and userspace contexts. The main beneficiary is the instruction
emulation code that executes in both contexts.
bhyve(8) VM exit handlers now treat 'vmexit->rip' and 'vmexit->inst_length'
as readonly:
- Restarting an instruction is now done by calling 'vm_restart_instruction()'
as opposed to setting 'vmexit->inst_length' to 0 (e.g. emulate_inout())
- Resuming vcpu at an arbitrary %rip is now done by setting VM_REG_GUEST_RIP
as opposed to changing 'vmexit->rip' (e.g. vmexit_task_switch())
Differential Revision: https://reviews.freebsd.org/D1526
Reviewed by: grehan
MFC after: 2 weeks
Implement a subset of the multiboot specification in order to boot Xen
and a FreeBSD Dom0 from the FreeBSD bootloader. This multiboot
implementation is tailored to boot Xen and FreeBSD Dom0, and it will
most surely fail to boot any other multiboot compilant kernel.
In order to detect and boot the Xen microkernel, two new file formats
are added to the bootloader, multiboot and multiboot_obj. Multiboot
support must be tested before regular ELF support, since Xen is a
multiboot kernel that also uses ELF. After a multiboot kernel is
detected, all the other loaded kernels/modules are parsed by the
multiboot_obj format.
The layout of the loaded objects in memory is the following; first the
Xen kernel is loaded as a 32bit ELF into memory (Xen will switch to
long mode by itself), after that the FreeBSD kernel is loaded as a RAW
file (Xen will parse and load it using it's internal ELF loader), and
finally the metadata and the modules are loaded using the native
FreeBSD way. After everything is loaded we jump into Xen's entry point
using a small trampoline. The order of the multiboot modules passed to
Xen is the following, the first module is the RAW FreeBSD kernel, and
the second module is the metadata and the FreeBSD modules.
Since Xen will relocate the memory position of the second
multiboot module (the one that contains the metadata and native
FreeBSD modules), we need to stash the original modulep address inside
of the metadata itself in order to recalculate its position once
booted. This also means the metadata must come before the loaded
modules, so after loading the FreeBSD kernel a portion of memory is
reserved in order to place the metadata before booting.
In order to tell the loader to boot Xen and then the FreeBSD kernel the
following has to be added to the /boot/loader.conf file:
xen_cmdline="dom0_mem=1024M dom0_max_vcpus=2 dom0pvh=1 console=com1,vga"
xen_kernel="/boot/xen"
The first argument contains the command line that will be passed to the Xen
kernel, while the second argument is the path to the Xen kernel itself. This
can also be done manually from the loader command line, by for example
typing the following set of commands:
OK unload
OK load /boot/xen dom0_mem=1024M dom0_max_vcpus=2 dom0pvh=1 console=com1,vga
OK load kernel
OK load zfs
OK load if_tap
OK load ...
OK boot
Sponsored by: Citrix Systems R&D
Reviewed by: jhb
Differential Revision: https://reviews.freebsd.org/D517
For the Forth bits:
Submitted by: Julien Grall <julien.grall AT citrix.com>
only compile in those options in GENERIC that cannot be loaded as
modules. ufs is still included because many of its options aren't
present in the kernel module. There's some other exceptions documented
in the file. This is part of some work to get more things
automatically loading in the hopes of obsoleting GENERIC one day.
VM_INJECT_EXCEPTION ioctl. However it morphed into other uses like keeping
track pending exceptions for a vcpu. This in turn causes confusion because
some fields in 'struct vm_exception' like 'vcpuid' make sense only in the
ioctl context. It also makes it harder to add or remove structure fields.
Fix this by using 'struct vm_exception' only to communicate information
from userspace to vmm.ko when injecting an exception.
Also, add a field 'restart_instruction' to 'struct vm_exception'. This
field is set to '1' for exceptions where the faulting instruction is
restarted after the exception is handled.
MFC after: 1 week
For /dev/mem, when requested physical address is not accessible by the
direct map, do temporal remaping with the caching attribute
'uncached'. Limit the accessible addresses by MAXPHYADDR, since the
architecture disallowes writing non-zero into reserved bits of ptes
(or setting garbage into NX).
For /dev/kmem, only access existing kernel mappings for direct map
region. For all other addresses, obtain a physical address of the
mapping and fall back to the /dev/mem mechanism. This ensures that
/dev/kmem i/o does not fault even if the accessed region is changed in
parallel, by using either direct map or temporal mapping.
For both devices, operate on one page by iteration. Do not return
error if any bytes were moved around, return the (partial) bytes count
to userspace.
Reviewed by: alc
Tested by: pho
Sponsored by: The FreeBSD Foundation
MFC after: 1 week
Features by CPUID as CPUID.80000008H:EAX[7:0], into variable cpu_maxphyaddr.
Reviewed by: alc
Tested by: pho
Sponsored by: The FreeBSD Foundation
MFC after: 1 week
code in sys/kern/kern_dump.c. Most dumpsys() implementations are nearly
identical and simply redefine a number of constants and helper subroutines;
a generic implementation will make it easier to implement features around
kernel core dumps. This change does not alter any minidump code and should
have no functional impact.
PR: 193873
Differential Revision: https://reviews.freebsd.org/D904
Submitted by: Conrad Meyer <conrad.meyer@isilon.com>
Reviewed by: jhibbits (earlier version)
Sponsored by: EMC / Isilon Storage Division
emulated or when the vcpu incurs an exception. This matches the CPU behavior.
Remove special case code in HLT processing that was clearing the interrupt
shadow. This is now redundant because the interrupt shadow is always cleared
when the vcpu is resumed after an instruction is emulated.
Reported by: David Reed (david.reed@tidalscale.com)
MFC after: 2 weeks
may also halt in C2 and not just C3 (it seems that in some cases the BIOS
advertises its C3 state as a C2 state in _CST). Just play it safe and
disable both C2 and C3 states if a user forces the use of the TSC as the
timecounter on such CPUs.
PR: 192316
Differential Revision: https://reviews.freebsd.org/D1441
No objection from: jkim
MFC after: 1 week
physical address zero. Assume that the lowest page is always mapped
by direct map.
This restores access to the page at zero through /dev/mem after
r263475.
Reported and tested by: neel
Sponsored by: The FreeBSD Foundation
MFC after: 1 week
managing pages from different address ranges. Generally speaking, this
feature is used to increase the likelihood that physical pages are
available that can meet special DMA requirements or can be accessed through
a limited-coverage direct mapping (e.g., MIPS). However, prior to this
change, the configuration of the free lists was static, i.e., it was
determined at compile time. Consequentally, free lists could be created
for address ranges that held no actual pages, for example, on 32-bit MIPS-
based systems with 512 MB or less of physical memory. This change makes
the creation of the free lists dynamic, i.e., it is based on the available
physical memory at boot time.
On 64-bit x86-based systems with 64 GB or more of physical memory, create
free lists for managing pages with physical addresses below 4 GB. This
change is to address reported problems with initializing devices that
require the allocation of physical pages below 4 GB on some systems with
128 GB or more of physical memory.
PR: 185727
Differential Revision: https://reviews.freebsd.org/D1274
Reviewed by: jhb, kib
MFC after: 3 weeks
Sponsored by: EMC / Isilon Storage Division
vm_inject_exception(). This fixes the issue that 'exception.cpuid' is
uninitialized when calling 'vm_inject_exception()'.
However, in practice this change is a no-op because vm_inject_exception()
does not use 'exception.cpuid' for anything.
Reported by: Coverity Scan
CID: 1261297
MFC after: 3 days
The new RTC emulation supports all interrupt modes: periodic, update ended
and alarm. It is also capable of maintaining the date/time and NVRAM contents
across virtual machine reset. Also, the date/time fields can now be modified
by the guest.
Since bhyve now emulates both the PIT and the RTC there is no need for
"Legacy Replacement Routing" in the HPET so get rid of it.
The RTC device state can be inspected via bhyvectl as follows:
bhyvectl --vm=vm --get-rtc-time
bhyvectl --vm=vm --set-rtc-time=<unix_time_secs>
bhyvectl --vm=vm --rtc-nvram-offset=<offset> --get-rtc-nvram
bhyvectl --vm=vm --rtc-nvram-offset=<offset> --set-rtc-nvram=<value>
Reviewed by: tychon
Discussed with: grehan
Differential Revision: https://reviews.freebsd.org/D1385
MFC after: 2 weeks
OpenBSD guests always enable "special mask mode" during boot. As a result of
r275952 this is flagged as an error and the guest cannot boot.
Reviewed by: grehan
Differential Revision: https://reviews.freebsd.org/D1384
MFC after: 1 week
setting call gate, which must be 64 bit, put a code segment descriptor
into ldt slot 0.
This way, syscall shim does not switch temporary to 64bit trampoline,
and does not create a window where signal delivery interrupts 64 bit
mode (signal handler cannot return). The cost is shim running with
non-zero based segment in %cs, which requires vfork() handling make
more assumptions.
Sponsored by: The FreeBSD Foundation
MFC after: 1 week
It's redundant at the moment since it can be obtained from the trapframe
on the architectures where DTrace is supported, but this won't be the case
with ARM.
"hw.vmm.trace_guest_exceptions". To enable this feature set the tunable
to "1" before loading vmm.ko.
Tracing the guest exceptions can be useful when debugging guest triple faults.
Note that there is a performance impact when exception tracing is enabled
since every exception will now trigger a VM-exit.
Also, handle machine check exceptions that happen during guest execution
by vectoring to the host's machine check handler via "int $18".
Discussed with: grehan
MFC after: 2 weeks
- implement 8259 "polled" mode.
- set 'atpic->sfn' if bit 4 in ICW4 is set during master initialization.
- report error if guest tries to enable the "special mask" mode.
Differential Revision: https://reviews.freebsd.org/D1328
Reviewed by: tychon
Reported by: grehan
Tested by: grehan
MFC after: 1 week
Initialize the 8259 such that IRQ7 is the lowest priority.
Reviewed by: tychon
Differential Revision: https://reviews.freebsd.org/D1322
MFC after: 1 week
When returning to usermode, the handler for that exceptions is also
executed with wrong gs base. Handle all three possible faults in the
same way, checking for iret fault, and performing full iret.
Sponsored by: The FreeBSD Foundation
MFC after: 3 days
is deasserted. Prior to this change each assertion on a level triggered irq
pin resulted in two interrupts being delivered to the CPU.
Differential Revision: https://reviews.freebsd.org/D1310
Reviewed by: tychon
MFC after: 1 week
WITNESS and INVARIANTS checking, which are known to have significant
performance impact on running systems. When benchmarking new features
this kernel should be used instead of the standard GENERIC.
This kernel configuration should never appear outside of the HEAD
of the FreeBSD tree.
using the VM_MIN_ADDRESS constant.
HardenedBSD redefines VM_MIN_ADDRESS to be 64K, which results in
bhyve VM startup failing. Guest memory is always assumed to start
at 0 so use the absolute value instead.
Reported by: Shawn Webb, lattera at gmail com
Reviewed by: neel, grehan
Obtained from: Oliver Pinter via HardenedBSD
23bd719ce1
MFC after: 1 week
- Dump an NT_X86_XSTATE note if XSAVE is in use. This note is designed
to match what Linux does in that 1) it dumps the entire XSAVE area
including the fxsave state, and 2) it stashes a copy of the current
xsave mask in the unused padding between the fxsave state and the
xstate header at the same location used by Linux.
- Teach readelf() to recognize NT_X86_XSTATE notes.
- Change PT_GET/SETXSTATE to take the entire XSAVE state instead of
only the extra portion. This avoids having to always make two
ptrace() calls to get or set the full XSAVE state.
- Add a PT_GET_XSTATE_INFO which returns the length of the current
XSTATE save area (so the size of the buffer needed for PT_GETXSTATE)
and the current XSAVE mask (%xcr0).
Differential Revision: https://reviews.freebsd.org/D1193
Reviewed by: kib
MFC after: 2 weeks
It is automatically set when -fPIC is passed to the compiler.
Reviewed by: dim, kib
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D1179
on i386 PAE. Previously, VM_PHYSSEG_SPARSE could not be used on amd64 and
i386 because vm_page_startup() would not create vm_page structures for the
kernel page table pages allocated during pmap_bootstrap() but those vm_page
structures are needed when the kernel attempts to promote the corresponding
kernel virtual addresses to superpage mappings. To address this problem, a
new public function, vm_phys_add_seg(), is introduced and vm_phys_init() is
updated to reflect the creation of vm_phys_seg structures by calls to
vm_phys_add_seg().
Discussed with: Svatopluk Kraus
MFC after: 3 weeks
Sponsored by: EMC / Isilon Storage Division
In vt_efifb_init the framebuffer's physaddr is passed to PHYS_TO_DMAP
before the DMAP is setup. The result is not actually accessed until
after the mapping is setup, though. Loosen the assertion in PHYS_TO_DMAP
for now, to allow use when dmaplimit == 0.
Reviewed by: kib
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D1142
have chosen different (and more traditional) stateless/statuful
NAT64 as translation mechanism. Last non-trivial commits to both
faith(4) and faithd(8) happened more than 12 years ago, so I assume
it is time to drop RFC3142 in FreeBSD.
No objections from: net@
Create a proper stack frame for amd64 version of bcopy(). Note that
this also makes the stack properly aligned in the function, despite it
is not strictly needed.
Sponsored by: The FreeBSD Foundation
MFC after: 1 week
support for AVX on i386.
- Similar to amd64, move the FPU save area out of the PCB and instead
store saved FPU state in a variable-sized buffer after the PCB on the
stack.
- To support the variable PCB location, alter the locore code to only use
the bottom-most page of proc0stack for init386(). init386() returns
the correct stack pointer to locore which adjusts the stack for thread0
before calling mi_startup().
- Don't bother setting cr3 in thread0's pcb in locore before calling
init386(). It wasn't used (init386() overwrote it at the end) and
it doesn't work with the variable-sized FPU save area.
- Remove the new-bus attachment from npx. This was only ever useful for
external co-processors using IRQ13, but those have not been supported
for several years. npxinit() is now called much earlier during boot
(init386()) similar to amd64.
- Implement PT_{GET,SET}XSTATE and I386_GET_XFPUSTATE.
- npxsave() is now only called from context switch contexts so it can
use XSAVEOPT.
Differential Revision: https://reviews.freebsd.org/D1058
Reviewed by: kib
Tested on: FreeBSD/i386 VM under bhyve on Intel i5-2520
- Move the existing code to x86/x86/identcpu.c since it is x86-specific.
- If the CPUID2_HV flag is set, assume a hypervisor is present and query
the 0x40000000 leaf to determine the hypervisor vendor ID. Export the
vendor ID and the highest supported hypervisor CPUID leaf via
hv_vendor[] and hv_high variables, respectively. The hv_vendor[]
array is also exported via the hw.hv_vendor sysctl.
- Merge the VMWare detection code from tsc.c into the new probe in
identcpu.c. Add a VM_GUEST_VMWARE to identify vmware and use that in
the TSC code to identify VMWare.
Differential Revision: https://reviews.freebsd.org/D1010
Reviewed by: delphij, jkim, neel
and casuword(9), but do not mix value read and indication of fault.
I know (or remember) enough assembly to handle x86 and powerpc. For
arm, mips and sparc64, implement fueword() and casueword() as wrappers
around fuword() and casuword(), which means that the functions cannot
distinguish between -1 and fault.
On architectures where fueword() and casueword() are native, implement
fuword() and casuword() using fueword() and casuword(), to reduce
assembly code duplication.
Sponsored by: The FreeBSD Foundation
Tested by: pho
MFC after: 2 weeks (ia64 needs treating)
'struct vm *'. Previously it used to be a 'void *' but there is no reason
to hide the actual type from the handler.
Discussed with: tychon
MFC after: 1 week
This reduces variability during timer calibration by keeping the emulation
"close" to the guest. Additionally having all timer emulations in the kernel
will ease the transition to a per-VM clock source (as opposed to using the
host's uptime keep track of time).
Discussed with: grehan
Most I/O port handlers return -1 to signal an error. If this value is returned
without modification to vm_run() then it leads to incorrect behavior because
'-1' is interpreted as ERESTART at the system call level.
Fix this by always returning EIO to signal an error from an I/O port handler.
MFC after: 1 week
Place the code introduced in r268660 into a separate function that can be
called from uiomove_fromphys. Instead of pre-allocating two KVA pages use
vmem_alloc to allocate them on demand when needed. This prevents blocking if
a page fault is taken while physical addresses from outside the DMAP are
used, since the lock is now removed.
Also introduce a safety catch in PHYS_TO_DMAP and DMAP_TO_PHYS.
Sponsored by: Citrix Systems R&D
Reviewed by: kib
Differential Revision: https://reviews.freebsd.org/D947
amd64/amd64/pmap.c:
- Factor out the code to deal with non DMAP addresses from pmap_copy_pages
and place it in pmap_map_io_transient.
- Change the code to use vmem_alloc instead of a set of pre-allocated
pages.
- Use pmap_qenter and don't pin the thread if there can be page faults.
amd64/amd64/uio_machdep.c:
- Use pmap_map_io_transient in order to correctly deal with physical
addresses not covered by the DMAP.
amd64/include/pmap.h:
- Add the prototypes for the new functions.
amd64/include/vmparam.h:
- Add safety catches to make sure PHYS_TO_DMAP and DMAP_TO_PHYS are only
used with addresses covered by the DMAP.
This device is only attached to priviledged domains, and allows the
toolstack to interact with Xen. The two functions of the privcmd
interface is to allow the execution of hypercalls from user-space, and
the mapping of foreign domain memory.
Sponsored by: Citrix Systems R&D
i386/include/xen/hypercall.h:
amd64/include/xen/hypercall.h:
- Introduce a function to make generic hypercalls into Xen.
xen/interface/xen.h:
xen/interface/memory.h:
- Import the new hypercall XENMEM_add_to_physmap_range used by
auto-translated guests to map memory from foreign domains.
dev/xen/privcmd/privcmd.c:
- This device has the following functions:
- Allow user-space applications to make hypercalls into Xen.
- Allow user-space applications to map memory from foreign domains,
this is accomplished using the newly introduced hypercall
(XENMEM_add_to_physmap_range).
xen/privcmd.h:
- Public ioctl interface for the privcmd device.
x86/xen/hvm.c:
- Remove declaration of hypercall_page, now it's declared in
hypercall.h.
conf/files:
- Add the privcmd device to the build process.
- Wrong integer type was specified.
- Wrong or missing "access" specifier. The "access" specifier
sometimes included the SYSCTL type, which it should not, except for
procedural SYSCTL nodes.
- Logical OR where binary OR was expected.
- Properly assert the "access" argument passed to all SYSCTL macros,
using the CTASSERT macro. This applies to both static- and dynamically
created SYSCTLs.
- Properly assert the the data type for both static and dynamic
SYSCTLs. In the case of static SYSCTLs we only assert that the data
pointed to by the SYSCTL data pointer has the correct size, hence
there is no easy way to assert types in the C language outside a
C-function.
- Rewrote some code which doesn't pass a constant "access" specifier
when creating dynamic SYSCTL nodes, which is now a requirement.
- Updated "EXAMPLES" section in SYSCTL manual page.
MFC after: 3 days
Sponsored by: Mellanox Technologies
misconfiguration VM-exit.
An EPT misconfiguration is triggered when the processor encounters a PTE
that is writable but not readable (WR=10). On processors that require A/D
bit emulation PG_M and PG_A map to EPT_PG_WRITE and EPT_PG_READ respectively.
If the PTE is updated as in the following code snippet:
*pte |= PG_M;
*pte |= PG_A;
then it is possible for another processor to observe the PTE after the PG_M
(aka EPT_PG_WRITE) bit is set but before PG_A (aka EPT_PG_READ) bit is set.
This will trigger an EPT misconfiguration VM-exit on the other processor.
Reported by: rodrigc
Reviewed by: grehan
MFC after: 3 days
Add support for AMD's nested page tables in pmap.c:
- Provide the correct bit mask for various bit fields in a PTE (e.g. valid bit)
for a pmap of type PT_RVI.
- Add a function 'pmap_type_guest(pmap)' that returns TRUE if the pmap is of
type PT_EPT or PT_RVI.
Add CPU_SET_ATOMIC_ACQ(num, cpuset):
This is used when activating a vcpu in the nested pmap. Using the 'acquire'
variant guarantees that the load of the 'pm_eptgen' will happen only after
the vcpu is activated in 'pm_active'.
Add defines for various AMD-specific MSRs.
Submitted by: Anish Gupta (akgupt3@gmail.com)
bhyve doesn't emulate the MSRs needed to support this feature at this time.
Don't expose any model-specific RAS and performance monitoring features in
cpuid leaf 80000007H.
Emulate a few more MSRs for AMD: TSEG base address, TSEG address mask and
BIOS signature and P-state related MSRs.
This eliminates all the unimplemented MSRs accessed by Linux/x86_64 kernels
2.6.32, 3.10.0 and 3.17.0.
Rename vmx_assym.s to vmx_assym.h to reflect that file's actual use
and update vmx_support.S's include to match. Add vmx_assym.h to the
SRCS to that it gets properly added to the dependency list. Add
vmx_support.S to SRCS as well, so it gets built and needs fewer
special-case goo. Remove now-redundant special-case goo. Finally,
vmx_genassym.o doesn't need to depend on a hand expanded ${_ILINKS}
explicitly, that's all taken care of by beforedepend.
With these items fixed, we no longer build vmm.ko every single time
through the modules on a KERNFAST build.
Sponsored by: Netflix
emulating a large number of MSRs.
Ignore writes to a couple more AMD-specific MSRs and return 0 on read.
This further reduces the unimplemented MSRs accessed by a Linux guest on boot.
CPUID.80000001H:ECX.
Handle accesses to PerfCtrX and PerfEvtSelX MSRs by ignoring writes and
returning 0 on reads.
This further reduces the number of unimplemented MSRs hit by a Linux guest
during boot.
Initialize CPUID.80000008H:ECX[7:0] with the number of logical processors in
the package. This fixes a panic during early boot in NetBSD 7.0 BETA.
Clear the Topology Extension feature bit from CPUID.80000001H:ECX since we
don't emulate leaves 0x8000001D and 0x8000001E. This fixes a divide by zero
panic in early boot in Centos 6.4.
Tested on an "AMD Opteron 6320" courtesy of Ben Perrault.
Reviewed by: grehan
in userland rename in-kernel getenv()/setenv() to kern_setenv()/kern_getenv().
This fixes a namespace collision with libc symbols.
Submitted by: kmacy
Tested by: make universe
options to display some key VMCB fields.
The set of valid options that can be passed to bhyvectl now depends on the
processor type. AMD-specific options are identified by a "--vmcb" or "--avic"
in the option name. Intel-specific options are identified by a "--vmcs" in
the option name.
Submitted by: Anish Gupta (akgupt3@gmail.com)
forced invalidation of the cache range regardless of the presence of
self-snoop feature. Some recent Intel GPUs in some modes are not
coherent, and dirty lines in CPU cache must be flushed before the
pages are transferred to GPU domain.
Reviewed by: alc (previous version)
Tested by: pho (amd64)
Sponsored by: The FreeBSD Foundation
MFC after: 1 week
The hypervisor hides the MONITOR/MWAIT capability by unconditionally setting
CPUID.01H:ECX[3] to 0 so the guest should not expect these instructions to
be present anyways.
Discussed with: grehan
the PAT MSR on guest exit/entry. This workaround was done for a beta release
of VMware Fusion 5 but is no longer needed in later versions.
All Intel CPUs since Nehalem have supported saving and restoring MSR_PAT
in the VM exit and entry controls.
Discussed with: grehan
This patch adds support for MSI interrupts when running on Xen. Apart
from adding the Xen related code needed in order to register MSI
interrupts this patch also makes the msi_init function a hook in
init_ops, so different MSI implementations can have different
initialization functions.
Sponsored by: Citrix Systems R&D
xen/interface/physdev.h:
- Add the MAP_PIRQ_TYPE_MULTI_MSI to map multi-vector MSI to the Xen
public interface.
x86/include/init.h:
- Add a hook for setting custom msi_init methods.
amd64/amd64/machdep.c:
i386/i386/machdep.c:
- Set the default msi_init hook to point to the native MSI
initialization method.
x86/xen/pv.c:
- Set the Xen MSI init hook when running as a Xen guest.
x86/x86/local_apic.c:
- Call the msi_init hook instead of directly calling msi_init.
xen/xen_intr.h:
x86/xen/xen_intr.c:
- Introduce support for registering/releasing MSI interrupts with
Xen.
- The MSI interrupts will use the same PIC as the IO APIC interrupts.
xen/xen_msi.h:
x86/xen/xen_msi.c:
- Introduce a Xen MSI implementation.
x86/xen/xen_nexus.c:
- Overwrite the default MSI hooks in the Xen Nexus to use the Xen MSI
implementation.
x86/xen/xen_pci.c:
- Introduce a Xen specific PCI bus that inherits from the ACPI PCI
bus and overwrites the native MSI methods.
- This is needed because when running under Xen the MSI messages used
to configure MSI interrupts on PCI devices are written by Xen
itself.
dev/acpica/acpi_pci.c:
- Lower the quality of the ACPI PCI bus so the newly introduced Xen
PCI bus can take over when needed.
conf/files.i386:
conf/files.amd64:
- Add the newly created files to the build process.
- Host registers are now stored on the stack instead of a per-cpu host context.
- Host %FS and %GS selectors are not saved and restored across VMRUN.
- Restoring the %FS/%GS selectors was futile anyways since that only updates
the low 32 bits of base address in the hidden descriptor state.
- GS.base is properly updated via the MSR_GSBASE on return from svm_launch().
- FS.base is not used while inside the kernel so it can be safely ignored.
- Add function prologue/epilogue so svm_launch() can be traced with Dtrace's
FBT entry/exit probes. They also serve to save/restore the host %rbp across
VMRUN.
Reviewed by: grehan
Discussed with: Anish Gupta (akgupt3@gmail.com)
As of git submit e179f6914152eca9, the Linux kernel does a simple
probe of the PIC by writing a pattern to the IMR and then reading it
back, prior to the init sequence of ICW words.
The bhyve PIC emulation wasn't allowing the IMR to be read until
the ICW sequence was complete. This limitation isn't required so
relax the test.
With this change, Linux kernels 3.15-rc2 and later won't hang
on boot when calibrating the local APIC.
Reviewed by: tychon
MFC after: 3 days
When the FreeBSD kernel is loaded from Xen the symtab and strtab are
not loaded the same way as the native boot loader. This patch adds
three new global variables to ddb that can be used to specify the
exact position and size of those tables, so they can be directly used
as parameters to db_add_symbol_table. A new helper is introduced, so callers
that used to set ksym_start and ksym_end can use this helper to set the new
variables.
It also adds support for loading them from the Xen PVH port, that was
previously missing those tables.
Sponsored by: Citrix Systems R&D
Reviewed by: kib
ddb/db_main.c:
- Add three new global variables: ksymtab, kstrtab, ksymtab_size that
can be used to specify the position and size of the symtab and
strtab.
- Use those new variables in db_init in order to call db_add_symbol_table.
- Move the logic in db_init to db_fetch_symtab in order to set ksymtab,
kstrtab, ksymtab_size from ksym_start and ksym_end.
ddb/ddb.h:
- Add prototype for db_fetch_ksymtab.
- Declate the extern variables ksymtab, kstrtab and ksymtab_size.
x86/xen/pv.c:
- Add support for finding the symtab and strtab when booted as a Xen
PVH guest. Since Xen loads the symtab and strtab as NetBSD expects
to find them we have to adapt and use the same method.
amd64/amd64/machdep.c:
arm/arm/machdep.c:
i386/i386/machdep.c:
mips/mips/machdep.c:
pc98/pc98/machdep.c:
powerpc/aim/machdep.c:
powerpc/booke/machdep.c:
sparc64/sparc64/machdep.c:
- Use the newly introduced db_fetch_ksymtab in order to set ksymtab,
kstrtab and ksymtab_size.
For now restrict it to amd64. Other architectures might be
re-added later once tested.
Remove the drivers from the global NOTES and files files and move
them to the amd64 specifics.
Remove the drivers from the i386 modules build and only leave the
amd64 version.
Rather than depending on "inet" depend on "pci" and make sure that
ixl(4) and ixlv(4) can be compiled independently [2]. This also
allows the drivers to build properly on IPv4-only or IPv6-only
kernels.
PR: 193824 [2]
Reviewed by: eric.joyner intel.com
MFC after: 3 days
References:
[1] http://lists.freebsd.org/pipermail/svn-src-all/2014-August/090470.html
- CR2
- CR0, CR3, CR4 and EFER
- GDT/IDT base/limit fields
- CS/DS/ES/SS selector/base/limit/attrib fields
The caching can be further restricted via the tunable 'hw.vmm.svm.vmcb_clean'.
Restructure the code such that the fields above are only modified in a single
place. This makes it easy to invalidate the VMCB cache when any of these fields
is modified.
behavior was changed in r271888 so update the comment block to reflect this.
MSR_KGSBASE is accessible from the guest without triggering a VM-exit. The
permission bitmap for MSR_KGSBASE is modified by vmx_msr_guest_init() so get
rid of redundant code in vmx_vminit().
code. There are only a handful of MSRs common between the two so there isn't
too much duplicate functionality.
The VT-x code has the following types of MSRs:
- MSRs that are unconditionally saved/restored on every guest/host context
switch (e.g., MSR_GSBASE).
- MSRs that are restored to guest values on entry to vmx_run() and saved
before returning. This is an optimization for MSRs that are not used in
host kernel context (e.g., MSR_KGSBASE).
- MSRs that are emulated and every access by the guest causes a trap into
the hypervisor (e.g., MSR_IA32_MISC_ENABLE).
Reviewed by: grehan
- Note the quirk with the interrupt enabled state of the dna handler.
- Use just panic() instead of printf() and panic(). Print tid instead
of pid, the fpu state is per-thread.
Sponsored by: The FreeBSD Foundation
MFC after: 1 week
for amd64/linux32. Fix the entirely bogus (untested) version from
r161310 for i386/linux using the same shared code in compat/linux.
It is unclear to me if we could support more clock mappings but
the current set allows me to successfully run commercial
32bit linux software under linuxolator on amd64.
Reviewed by: jhb
Differential Revision: D784
MFC after: 3 days
Sponsored by: DARPA, AFRL
Keep track of NMI blocking by enabling the IRET intercept on a successful
vNMI injection. The NMI blocking condition is cleared when the handler
executes an IRET and traps back into the hypervisor.
Don't inject NMI if the processor is in an interrupt shadow to preserve the
atomic nature of "STI;HLT". Take advantage of this and artificially set the
interrupt shadow to prevent NMI injection when restarting the "iret".
Reviewed by: Anish Gupta (akgupt3@gmail.com), grehan
Get rid of unused 'svm_feature' from the softc.
Get rid of the redundant 'vcpu_cnt' checks in svm.c. There is a similar check
in vmm.c against 'vm->active_cpus' before the AMD-specific code is called.
Submitted by: Anish Gupta (akgupt3@gmail.com)
processor. Briefly, the hypervisor sets V_INTR_VECTOR to the APIC vector
and sets V_IRQ to 1 to indicate a pending interrupt. The hardware then takes
care of injecting this vector when the guest is able to receive it.
Legacy PIC interrupts are still delivered via the event injection mechanism.
This is because the vector injected by the PIC must reflect the state of its
pins at the time the CPU is ready to accept the interrupt.
Accesses to the TPR via %CR8 are handled entirely in hardware. This requires
that the emulated TPR must be synced to V_TPR after a #VMEXIT.
The guest can also modify the TPR via the memory mapped APIC. This requires
that the V_TPR must be synced with the emulated TPR before a VMRUN.
Reviewed by: Anish Gupta (akgupt3@gmail.com)
VM-exit and ultimately on whether nRIP is valid. This allows us to update
the %rip after the emulation is finished so any exceptions triggered during
the emulation will point to the right instruction.
Don't attempt to handle INS/OUTS VM-exits unless the DecodeAssist capability
is available. The effective segment field in EXITINFO1 is not valid without
this capability.
Add VM_EXITCODE_SVM to flag SVM VM-exits that cannot be handled. Provide the
VMCB fields exitinfo1 and exitinfo2 as collateral to help with debugging.
Provide a SVM VM-exit handler to dump the exitcode, exitinfo1 and exitinfo2
fields in bhyve(8).
Reviewed by: Anish Gupta (akgupt3@gmail.com)
Reviewed by: grehan
- Don't enable the HLT intercept by default. It will be enabled by bhyve(8)
if required. Prior to this change HLT exiting was always enabled making
the "-H" option to bhyve(8) meaningless.
- Recognize a VM exit triggered by a non-maskable interrupt. Prior to this
change the exit would be punted to userspace and the virtual machine would
terminate.
instruction bytes in the VMCB on a nested page fault. This is useful because
it saves having to walk the guest page tables to fetch the instruction.
vie_init() now takes two additional parameters 'inst_bytes' and 'inst_len'
that map directly to 'vie->inst[]' and 'vie->num_valid'.
The instruction emulation handler skips calling 'vmm_fetch_instruction()'
if 'vie->num_valid' is non-zero.
The use of this capability can be turned off by setting the sysctl/tunable
'hw.vmm.svm.disable_npf_assist' to '1'.
Reviewed by: Anish Gupta (akgupt3@gmail.com)
Discussed with: grehan
by explicitly moving it out of the interrupt shadow. The hypervisor is done
"executing" the HLT and by definition this moves the vcpu out of the
1-instruction interrupt shadow.
Prior to this change the interrupt would be held pending because the VMCS
guest-interruptibility-state would indicate that "blocking by STI" was in
effect. This resulted in an unnecessary round trip into the guest before
the pending interrupt could be injected.
Reviewed by: grehan
window exiting. This simply involves setting V_IRQ and enabling the VINTR
intercept. This instructs the CPU to trap back into the hypervisor as soon
as an interrupt can be injected into the guest. The pending interrupt is
then injected via the traditional event injection mechanism.
Rework vcpu interrupt injection so that Linux guests now idle with host cpu
utilization close to 0%.
Reviewed by: Anish Gupta (earlier version)
Discussed with: grehan
AP startup and AP resume (it was already used for BSP startup and BSP
resume).
- Split code to do one-time probing of cache properties out of
initializecpu() and into initializecpucache(). This is called once on
the BSP during boot.
- Move enable_sse() into initializecpu().
- Call initializecpu() for AP startup instead of enable_sse() and
manually frobbing MSR_EFER to enable PG_NX.
- Call initializecpu() when an AP resumes. In theory this will now
properly re-enable PG_NX in MSR_EFER when resuming a PAE kernel on
APs.
Provide APIs svm_enable_intercept()/svm_disable_intercept() to add/delete
VMCB intercepts. These APIs ensure that the VMCB state cache is invalidated
when intercepts are modified.
Each intercept is identified as a (index,bitmask) tuple. For e.g., the
VINTR intercept is identified as (VMCB_CTRL1_INTCPT,VMCB_INTCPT_VINTR).
The first 20 bytes in control area that are used to enable intercepts
are represented as 'uint32_t intercept[5]' in 'struct vmcb_ctrl'.
Modify svm_setcap() and svm_getcap() to use the new APIs.
Discussed with: Anish Gupta (akgupt3@gmail.com)
Prior to this change an ASID was hard allocated to a guest and shared by all
its vcpus. The meant that the number of VMs that could be created was limited
to the number of ASIDs supported by the CPU. It was also inefficient because
it forced a TLB flush on every VMRUN.
With this change the number of guests that can be created is independent of
the number of available ASIDs. Also, the TLB is flushed only when a new ASID
is allocated.
Discussed with: grehan
Reviewed by: Anish Gupta (akgupt3@gmail.com)
resume that is a superset of a pcb. Move the FPU state out of the pcb and
into this new structure. As part of this, move the FPU resume code on
amd64 into a C function. This allows resumectx() to still operate only on
a pcb and more closely mirrors the i386 code.
Reviewed by: kib (earlier version)
The legacy USB circuit tends to give trouble on MacBook.
While the original report covered MacBook, extend the fix
preemptively for the newer MacBookPro too.
PR: 191693
Reviewed by: emaste
MFC after: 5 days
function restore_host_tss().
Don't bother to restore MSR_KGSBASE after a #VMEXIT since it is not used in
the kernel. It will be restored on return to userspace.
Discussed with: Anish Gupta (akgupt3@gmail.com)
<machine/md_var.h>.
- Move some CPU-related variables out of i386/i386/identcpu.c to
initcpu.c to match amd64.
- Move the declaration of has_f00f_hack out of identcpu.c to machdep.c.
- Remove a misleading comment from i386/i386/initcpu.c (locore zeros
the BSS before it calls identify_cpu()) and remove explicit zero
assignments to reduce the diff with amd64.
proper constraint for 'x'. The "+r" constraint indicates that 'x' is an
input and output register operand.
While here generate code for different variants of getcc() using a macro
GETCC(sz) where 'sz' indicates the operand size.
Update the status bits in %rflags when emulating AND and OR opcodes.
Reviewed by: grehan
optional attributes field.
- Add a 'machdep.smap' sysctl that exports the SMAP table of the running
system as an array of the ACPI 3.0 structure. (On older systems, the
attributes are given a value of zero.) Note that the sysctl only
exports the SMAP table if it is available via the metadata passed from
the loader to the kernel. If an SMAP is not available, an empty array
is returned.
- Add a format handler for the ACPI 3.0 SMAP structure to the sysctl(8)
binary to format the SMAP structures in a readable format similar to
the format found in boot messages.
MFC after: 2 weeks
shadow, so move the check for pending exception before bailing out due to
an interrupt shadow.
Change return type of 'vmcb_eventinject()' to a void and convert all error
returns into KASSERTs.
Fix VMCB_EXITINTINFO_EC(x) and VMCB_EXITINTINFO_TYPE(x) to do the shift
before masking the result.
Reviewed by: Anish Gupta (akgupt3@gmail.com)
tunables to modify the default cpu topology advertised by bhyve.
Also add a tunable "hw.vmm.topology.cpuid_leaf_b" to disable the CPUID
leaf 0xb. This is intended for testing guest behavior when it falls back
on using CPUID leaf 0x4 to deduce CPU topology.
The default behavior is to advertise each vcpu as a core in a separate soket.
the vcpu had no caches at all. This causes problems when executing applications
in the guest compiled with the Intel compiler.
Submitted by: Mark Hill (mark.hill@tidalscale.com)
Eventually, the vmd_segs of the struct vm_domain should become bitset
instead of long, to allow arbitrary compile-time selected maximum.
Reviewed by: alc
Sponsored by: The FreeBSD Foundation
MFC after: 1 week
starting with a superpage demotion by pmap_enter() that could result in
a PV list lock being held when pmap_enter() is just about to return
KERN_RESOURCE_SHORTAGE. Consequently, the KASSERT that no PV list locks
are held needs to be replaced with a conditional unlock.
Discussed with: kib
X-MFC with: r269728
Sponsored by: EMC / Isilon Storage Division
mapping size (currently unused). The flags includes the fault access
bits, wired flag as PMAP_ENTER_WIRED, and a new flag
PMAP_ENTER_NOSLEEP to indicate that pmap should not sleep.
For powerpc aim both 32 and 64 bit, fix implementation to ensure that
the requested mapping is created when PMAP_ENTER_NOSLEEP is not
specified, in particular, wait for the available memory required to
proceed.
In collaboration with: alc
Tested by: nwhitehorn (ppc aim32 and booke)
Sponsored by: The FreeBSD Foundation and EMC / Isilon Storage Division
MFC after: 2 weeks
Add the ACPI MCFG table to advertise the extended config memory window.
Introduce a new flag MEM_F_IMMUTABLE for memory ranges that cannot be deleted
or moved in the guest's address space. The PCI extended config space is an
example of an immutable memory range.
Add emulation for the "movzw" instruction. This instruction is used by FreeBSD
to read a 16-bit extended config space register.
CR: https://phabric.freebsd.org/D505
Reviewed by: jhb, grehan
Requested by: tychon
The MD allocators were very common, however there were some minor
differencies. These differencies were all consolidated in the MI allocator,
under ifdefs. The defines from machine/vmparam.h turn on features required
for a particular machine. For details look in the comment in sys/sf_buf.h.
As result no MD code left in sys/*/*/vm_machdep.c. Some arches still have
machine/sf_buf.h, which is usually quite small.
Tested by: glebius (i386), tuexen (arm32), kevlo (arm32)
Reviewed by: kib
Sponsored by: Netflix
Sponsored by: Nginx, Inc.
We continue to use pmap_enter() for that. For unwiring virtual pages, we
now use pmap_unwire(), which unwires a range of virtual addresses instead
of a single virtual page.
Sponsored by: EMC / Isilon Storage Division
features. If bootverbose is enabled, a detailed list is provided;
otherwise, a single-line summary is displayed.
- Add read-only sysctls for optional VT-x capabilities used by bhyve
under a new hw.vmm.vmx.cap node. Move a few exiting sysctls that
indicate the presence of optional capabilities under this node.
CR: https://phabric.freebsd.org/D498
Reviewed by: grehan, neel
MFC after: 1 week
forever in vm_handle_hlt().
This is usually not an issue as long as one of the other vcpus properly resets
or powers off the virtual machine. However, if the bhyve(8) process is killed
with a signal the halted vcpu cannot be woken up because it's sleep cannot be
interrupted.
Fix this by waking up periodically and returning from vm_handle_hlt() if
TDF_ASTPENDING is set.
Reported by: Leon Dang
Sponsored by: Nahanni Systems
It is not possible to PUSH a 32-bit operand on the stack in 64-bit mode. The
default operand size for PUSH is 64-bits and the operand size override prefix
changes that to 16-bits.
vm_copy_setup() can return '1' if it encounters a fault when walking the
guest page tables. This is a guest issue and is now handled properly by
resuming the guest to handle the fault.
involves updating the corresponding page tables followed by accesses to the
pages in question. This sequence is subject to the situation exactly described
in the "AMD64 Architecture Programmer's Manual Volume 2: System Programming"
rev. 3.23, "7.3.1 Special Coherency Considerations" [1, p. 171 f.]. Therefore,
issuing the INVLPG right after modifying the PTE bits is crucial (see also
r269050).
For the amd64 PMAP code, the order of instructions was already correct. The
above fact still is worth documenting, though.
1: http://amd-dev.wpengine.netdna-cdn.com/wordpress/media/2012/10/24593_APM_v21.pdf
Reviewed by: alc
Sponsored by: Bally Wulff Games & Entertainment GmbH
The faulting instruction needs to be restarted when the exception handler
is done handling the fault. bhyve now does this correctly by setting
'vmexit[vcpu].inst_length' to zero so the %rip is not advanced.
A minor complication is that the fault injection APIs are used by instruction
emulation code that is shared by vmm.ko and bhyve. Thus the argument that
refers to 'struct vm *' in kernel or 'struct vmctx *' in userspace needs to
be loosely typed as a 'void *'.
Setting PSE together with PAE or in long mode just makes the PSE bit
completely ignored, so don't set it.
Sponsored by: Citrix Systems R&D
Reviewed by: kib
A nested exception condition arises when a second exception is triggered while
delivering the first exception. Most nested exceptions can be handled serially
but some are converted into a double fault. If an exception is generated during
delivery of a double fault then the virtual machine shuts down as a result of
a triple fault.
vm_exit_intinfo() is used to record that a VM-exit happened while an event was
being delivered through the IDT. If an exception is triggered while handling
the VM-exit it will be treated like a nested exception.
vm_entry_intinfo() is used by processor-specific code to get the event to be
injected into the guest on the next VM-entry. This function is responsible for
deciding the disposition of nested exceptions.
instruction emulation [1].
Fix bug in emulation of opcode 0x8A where the destination is a legacy high
byte register and the guest vcpu is in 32-bit mode. Prior to this change
instead of modifying %ah, %bh, %ch or %dh the emulation would end up
modifying %spl, %bpl, %sil or %dil instead.
Add support for moffsets by treating it as a 2, 4 or 8 byte immediate value
during instruction decoding.
Fix bug in verify_gla() where the linear address computed after decoding
the instruction was not being truncated to the effective address size [2].
Tested by: Leon Dang [1]
Reported by: Peter Grehan [2]
Sponsored by: Nahanni Systems
the upstream implementation and helps ensure that a trap induced by tracing
fbt::trap:entry is handled without recursively generating another trap.
This makes it possible to run most (but not all) of the DTrace tests under
common/safety/ without triggering a kernel panic.
Submitted by: Anton Rang <anton.rang@isilon.com> (original version)
Phabric: D95
ptrace_set_pc() use the correct return to userspace using iret.
The signal return, PT_CONTINUE (which in fact uses signal return path)
set the pcb flag already. The setcontext(2) enforces iret return when
%rip is incorrect. Due to this, the change is redundand, but is made
to ensure that no path which modifies context, forgets to set
PCB_FULL_IRET.
Inspired by: CVE-2014-4699
Reviewed by: jhb
Sponsored by: The FreeBSD Foundation
MFC after: 1 week
several reasons for this change:
pmap_change_wiring() has never (in my memory) been used to set the wired
attribute on a virtual page. We have always used pmap_enter() to do that.
Moreover, it is not really safe to use pmap_change_wiring() to set the wired
attribute on a virtual page. The description of pmap_change_wiring() says
that it assumes the existence of a mapping in the pmap. However, non-wired
mappings may be reclaimed by the pmap at any time. (See pmap_collect().)
Many implementations of pmap_change_wiring() will crash if the mapping does
not exist.
pmap_unwire() accepts a range of virtual addresses, whereas
pmap_change_wiring() acts upon a single virtual page. Since we are
typically unwiring a range of virtual addresses, pmap_unwire() will be more
efficient. Moreover, pmap_unwire() allows us to unwire superpage mappings.
Previously, we were forced to demote the superpage mapping, because
pmap_change_wiring() only allowed us to express the unwiring of a single
base page mapping at a time. This added to the overhead of unwiring for
large ranges of addresses, including the implicit unwiring that occurs at
process termination.
Implementations for arm and powerpc will follow.
Discussed with: jeff, marcel
Reviewed by: kib
Sponsored by: EMC / Isilon Storage Division
The UEFI framebuffer driver vt_efifb requires vt(4), so add a mechanism
for the startup routine to set the preferred console. This change is
ugly because console init happens very early in the boot, making a
cleaner interface difficult. This change is intended only to facilitate
the sc(4) / vt(4) transition, and can be reverted once vt(4) is the
default.
These changes prevent sysctl(8) from returning proper output,
such as:
1) no output from sysctl(8)
2) erroneously returning ENOMEM with tools like truss(1)
or uname(1)
truss: can not get etype: Cannot allocate memory
there is an environment variable which shall initialize the SYSCTL
during early boot. This works for all SYSCTL types both statically and
dynamically created ones, except for the SYSCTL NODE type and SYSCTLs
which belong to VNETs. A new flag, CTLFLAG_NOFETCH, has been added to
be used in the case a tunable sysctl has a custom initialisation
function allowing the sysctl to still be marked as a tunable. The
kernel SYSCTL API is mostly the same, with a few exceptions for some
special operations like iterating childrens of a static/extern SYSCTL
node. This operation should probably be made into a factored out
common macro, hence some device drivers use this. The reason for
changing the SYSCTL API was the need for a SYSCTL parent OID pointer
and not only the SYSCTL parent OID list pointer in order to quickly
generate the sysctl path. The motivation behind this patch is to avoid
parameter loading cludges inside the OFED driver subsystem. Instead of
adding special code to the OFED driver subsystem to post-load tunables
into dynamically created sysctls, we generalize this in the kernel.
Other changes:
- Corrected a possibly incorrect sysctl name from "hw.cbb.intr_mask"
to "hw.pcic.intr_mask".
- Removed redundant TUNABLE statements throughout the kernel.
- Some minor code rewrites in connection to removing not needed
TUNABLE statements.
- Added a missing SYSCTL_DECL().
- Wrapped two very long lines.
- Avoid malloc()/free() inside sysctl string handling, in case it is
called to initialize a sysctl from a tunable, hence malloc()/free() is
not ready when sysctls from the sysctl dataset are registered.
- Bumped FreeBSD version to indicate SYSCTL API change.
MFC after: 2 weeks
Sponsored by: Mellanox Technologies
This is different than the amount shown for the process e.g. by
/usr/bin/top - that is the mappings faulted in by the mmap'd region
of guest memory.
The values can be fetched with bhyvectl
# bhyvectl --get-stats --vm=myvm
...
Resident memory 413749248
Wired memory 0
...
vmm_stat.[ch] -
Modify the counter code in bhyve to allow direct setting of a counter
as opposed to incrementing, and providing a callback to fetch a
counter's value.
Reviewed by: neel
context into memory for the kernel threads which called
fpu_kern_thread(9). This allows the fpu_kern_enter() callers to not
check for is_fpu_kern_thread() to get the optimization.
Apply the flag to padlock(4) and aesni(4). In aesni_cipher_process(),
do not leak FPU context state on error.
Sponsored by: The FreeBSD Foundation
MFC after: 1 week
the queue where to enqueue pages that are going to be unwired.
- Add stronger checks to the enqueue/dequeue for the pagequeues when
adding and removing pages to them.
Of course, for unmanaged pages the queue parameter of vm_page_unwire() will
be ignored, just as the active parameter today.
This makes adding new pagequeues quicker.
This change effectively modifies the KPI. __FreeBSD_version will be,
however, bumped just when the full cache of free pages will be
evicted.
Sponsored by: EMC / Isilon storage division
Reviewed by: alc
Tested by: pho
This is needed for Xen PV(H) guests, since there's no hardware lapic
available on this kind of domains. This commit should not change
functionality.
Sponsored by: Citrix Systems R&D
Reviewed by: jhb
Approved by: gibbs
amd64/include/cpu.h:
amd64/amd64/mp_machdep.c:
i386/include/cpu.h:
i386/i386/mp_machdep.c:
- Remove lapic_ipi_vectored hook from cpu_ops, since it's now
implemented in the lapic hooks.
amd64/amd64/mp_machdep.c:
i386/i386/mp_machdep.c:
- Use lapic_ipi_vectored directly, since it's now an inline function
that will call the appropiate hook.
x86/x86/local_apic.c:
- Prefix bare metal public lapic functions with native_ and mark them
as static.
- Define default implementation of apic_ops.
x86/include/apicvar.h:
- Declare the apic_ops structure and create inline functions to
access the hooks, so the change is transparent to existing users of
the lapic_ functions.
x86/xen/hvm.c:
- Switch to use the new apic_ops.
is sampled "atomically". Any interrupts after this point will be held pending
by the CPU until the guest starts executing and will immediately trigger a
#VMEXIT.
Reviewed by: Anish Gupta (akgupt3@gmail.com)
injected into the guest. This allows the hypervisor to inject another
ExtINT or APIC vector as soon as the guest is able to process interrupts.
This change is not to address any correctness issue but to guarantee that
any pending APIC vector that was preempted by the ExtINT will be injected
as soon as possible. Prior to this change such pending interrupts could be
delayed until the next VM exit.
Handle ExtINT injection for SVM. The HPET emulation
will inject a legacy interrupt at startup, and if this
isn't handled, will result in the HLT-exit code assuming
there are outstanding ExtINTs and return without sleeping.
svm_inj_interrupts() needs more changes to bring it up
to date with the VT-x version: these are forthcoming.
Reviewed by: neel
Linux guests accept the values in this register, while *BSD
guests reprogram it. Default values of zero correspond to
PAT_UNCACHEABLE, resulting in glacial performance.
Thanks to Willem Jan Withagen for first reporting this and
helping out with the investigation.
a partially populated reservation becomes fully populated, and decrease this
field when a fully populated reservation becomes partially populated.
Use this field to simplify the implementation of pmap_enter_object() on
amd64, arm, and i386.
On all architectures where we support superpages, the cost of creating a
superpage mapping is roughly the same as creating a base page mapping. For
example, both kinds of mappings entail the creation of a single PTE and PV
entry. With this in mind, use the page size field to make the
implementation of vm_map_pmap_enter(..., MAP_PREFAULT_PARTIAL) a little
smarter. Previously, if MAP_PREFAULT_PARTIAL was specified to
vm_map_pmap_enter(), that function would only map base pages. Now, it will
create up to 96 base page or superpage mappings.
Reviewed by: kib
Sponsored by: EMC / Isilon Storage Division
Remove CR2 save/restore - the guest restore/save is done
in hardware, and there is no need to save/restore the host
version (same as VT-x).
Submitted by: neel (SVM segment descriptor 'P' bit code)
Reviewed by: neel
- use the new virtual APIC page
- update to current bhyve APIs
Tested by Anish with multiple FreeBSD SMP VMs on a Phenom,
and verified by myself with light FreeBSD VM testing
on a Sempron 3850 APU.
The issues reported with Linux guests are very likely to still
be here, but this sync eliminates the skew between the
project branch and CURRENT, and should help to determine
the causes.
Some follow-on commits will fix minor cosmetic issues.
Submitted by: Anish Gupta (akgupt3@gmail.com)
it implicitly in vmm.ko.
Add ioctl VM_GET_CPUS to get the current set of 'active' and 'suspended' cpus
and display them via /usr/sbin/bhyvectl using the "--get-active-cpus" and
"--get-suspended-cpus" options.
This is in preparation for being able to reset virtual machine state without
having to destroy and recreate it.
correctly prepare KGSBASE msr to restore the user descriptor base on
the last swapgs during return to usermode.
Reported and tested by: peterj
Sponsored by: The FreeBSD Foundation
MFC after: 1 week
guest for which the rules regarding xsetbv emulation are known. In
particular future extensions like AVX-512 have interdependencies among
feature bits that could allow a guest to trigger a GP# in the host with
the current approach of allowing anything the host supports.
- Add proper checking of Intel MPX and AVX-512 XSAVE features in the
xsetbv emulation and allow these features to be exposed to the guest if
they are enabled in the host.
- Expose a subset of known-safe features from leaf 0 of the structured
extended features to guests if they are supported on the host including
RDFSBASE/RDGSBASE, BMI1/2, AVX2, AVX-512, HLE, ERMS, and RTM. Aside
from AVX-512, these features are all new instructions available for use
in ring 3 with no additional hypervisor changes needed.
Reviewed by: neel
API function 'vie_calculate_gla()'.
While the current implementation is simplistic it forms the basis of doing
segmentation checks if the guest is in 32-bit protected mode.
of the guest linear address space. These APIs in turn use a new ioctl
'VM_GLA2GPA' to convert the guest linear address to guest physical.
Use the new copyin/copyout APIs when emulating ins/outs instruction in
bhyve(8).
'struct vm_guest_paging'.
Check for canonical addressing in vmm_gla2gpa() and inject a protection
fault into the guest if a violation is detected.
If the page table walk is restarted in vmm_gla2gpa() then reset 'ptpphys' to
point to the root of the page tables.
indicate the faulting linear address.
If the guest PML4 entry has the PG_PS bit set then inject a page fault into
the guest with the PGEX_RSV bit set in the error_code.
Get rid of redundant checks for the PG_RW violations when walking the page
tables.
the UART FIFO.
The emulation is constrained in a number of ways: 64-bit only, doesn't check
for all exception conditions, limited to i/o ports emulated in userspace.
Some of these constraints will be relaxed in followup commits.
Requested by: grehan
Reviewed by: tychon (partially and a much earlier version)
the proper ICWx initialization sequence. It assumes, probably correctly, that
the boot firmware has done the 8259 initialization.
Since grub-bhyve does not initialize the 8259 this write to the mask register
takes a code path in which 'error' remains uninitialized (ready=0,icw_num=0).
Fix this by initializing 'error' at the start of the function.
These masks are documented in the Intel Architecture Instruction Set
Extensions Programming Reference (March 2014).
Reviewed by: kib
MFC after: 1 month