Give AHCI disk serial based on backing file path same as for virtio block.
It is still not good that they may intersect on different hosts, but that
is better then intersecting on the same host.
Rewrite virtio block device driver to work asynchronously and use the block
I/O interface.
Asynchronous operation, based on r280026 change, allows to not block virtual
CPU during I/O processing, that on slow/busy storage can take seconds.
Use of recently improved block I/O interface allows to process multiple
requests same time, that improves random I/O performance on wide storages.
Benchmarks of virtual disk, backed by ZVOL on RAID10 pool of 4 HDDs, show
~3.5 times random read performance improvements, while no degradation on
linear I/O. Guest CPU usage during test dropped from 100% to almost zero.
Modify virtqueue helpers added in r253440 to allow queuing.
Original virtqueue design allows queued and out-of-order processing, but
helpers added in r253440 suppose only direct blocking in-order one.
It could be fine for network, etc., but it is a huge limitation for storage
devices.
On parallel random I/O this allows better utilize wide storage pools.
To not confuse prefetcher on linear I/O, consecutive requests are executed
sequentially, following the same logic as was earlier implemented in CTL.
Benchmarks of virtual AHCI disk, backed by ZVOL on RAID10 pool of 4 HDDs,
show ~3.5 times random read performance improvements, while no degradation
on linear I/O.
It works only for virtual disks backed by ZVOLs and raw devices supporting
BIO_DELETE. Virtual disks backed by files won't report this capability.
Relnotes: yes
Add support for TOPOLOGY feature of virtio block device.
Passing through physical block size/offset from underlying storage allows
guest to manage proper data and I/O alignment to improve performance.
Move the ACPI PM timer emulation into vmm.ko.
MFC r273706
Change the type of the first argument to the I/O emulation handlers to
'struct vm *'.
MFC r273710
Add a comment explaining the intent behind the I/O reservation [0x72-0x77].
MFC r273744
Add foo_genassym.c files to DPSRCS so dependencies for them are generated.
This ensures these objects are rebuilt to generate an updated header of
assembly constants if needed.
MFC r274045
If the start bit, PxCMD.ST, is cleared and nothing is in-flight then
PxCI, PxSACT, PxCMD.CCS and PxCMD.CR should be 0.
MFC r274076
Improve the ability to cancel an in-flight request by using an interrupt,
via SIGCONT, to force the read or write system call to return prematurely.
MFC r274330
To allow a request to be submitted from within the callback routine of
a completing one increase the total by 1 but don't advertise it.
MFC r274931
Change the lower bound for guest vmspace allocation to 0 instead of using
the VM_MIN_ADDRESS constant.
MFC r275817
For level triggered interrupts clear the PIC IRR bit when the interrupt pin
is deasserted.
MFC r275850
Fix 8259 IRQ priority resolver.
MFC r275952
Various 8259 device model improvements.
MFC r275965
Emulate writes to the IA32_MISC_ENABLE MSR.
Add support AMD processors with the SVM/AMD-V hardware extensions.
MFC r273749
Remove bhyve SVM feature printf's now that they are available in the general
CPU feature detection code.
MFC r273766
Add missing 'break' pointed out by Coverity CID 1249760.
MFC r276098
Allow ktr(4) tracing of all guest exceptions via the tunable "hw.vmm.trace_guest_exceptions"
MFC r276392
Inject #UD into the guest when it executes either 'MONITOR' or 'MWAIT' on an
AMD/SVM host.
MFC r276402
Remove "svn:mergeinfo" property that was dragged along when these files were
svn copied in r273375.
Fix a recursive lock acquisition in vi_reset_dev().
MFC r270434
Return the spurious interrupt vector (IRQ7 or IRQ15) if the atpic cannot find
any unmasked pin with an interrupt asserted.
MFC r270436
Fix a bug in the emulation of CPUID leaf 0x4.
MFC r270437
Add "hw.vmm.topology.threads_per_core" and "hw.vmm.topology.cores_per_package"
tunables to modify the default cpu topology advertised by bhyve.
MFC r270855
Set the 'inst_length' to '0' early on before any error conditions are detected
in the emulation of the task switch. If any exceptions are triggered then the
guest %rip should point to instruction that caused the task switch as opposed
to the one after it.
MFC r270857
The "SUB" instruction used in getcc() actually does 'x -= y' so use the
proper constraint for 'x'. The "+r" constraint indicates that 'x' is an
input and output register operand.
While here generate code for different variants of getcc() using a macro
GETCC(sz) where 'sz' indicates the operand size.
Update the status bits in %rflags when emulating AND and OR opcodes.
MFC r271439
Initialize 'bc_rdonly' to the right value.
MFC r271451
Optimize the common case of injecting an interrupt into a vcpu after a HLT
by explicitly moving it out of the interrupt shadow.
MFC r271888
Restructure the MSR handling so it is entirely handled by processor-specific
code.
MFC r271890
MSR_KGSBASE is no longer saved and restored from the guest MSR save area. This
behavior was changed in r271888 so update the comment block to reflect this.
MFC r271891
Add some more KTR events to help debugging.
MFC r272197
mmap(2) requires either MAP_PRIVATE or MAP_SHARED for non-anonymous mappings.
MFC r272395
Get rid of code that dealt with the hardware not being able to save/restore
the PAT MSR on guest exit/entry. This workaround was done for a beta release
of VMware Fusion 5 but is no longer needed in later versions.
All Intel CPUs since Nehalem have supported saving and restoring MSR_PAT
in the VM exit and entry controls.
MFC r272670
Inject #UD into the guest when it executes either 'MONITOR' or 'MWAIT'.
MFC r272710
Implement the FLUSH operation in the virtio-block emulation.
MFC r272838
iasl(8) expects integer fields in data tables to be specified as hexadecimal
values. Therefore the bit width of the "PM Timer Block" was actually being
interpreted as 50-bits instead of the expected 32-bit.
This eliminates an error message emitted by a Linux 3.17 guest during boot:
"Invalid length for FADT/PmTimerBlock: 50, using default 32"
MFC r272839
Support Intel-specific MSRs that are accessed when booting up a linux in bhyve:
- MSR_PLATFORM_INFO
- MSR_TURBO_RATIO_LIMITx
- MSR_RAPL_POWER_UNIT
MFC r273108
Emulate "POP r/m". This is needed to boot OpenBSD/i386 MP kernel in bhyve.
MFC r273212
Support stopping and restarting the AHCI command list via toggling PxCMD.ST
from '1' to '0' and back. This allows the driver a chance to recover if
for instance a timeout occurred due to activity on the host.
Correct display of bhyve SMBIOS UUIDs with dmidecode by bumping the version.
The mixed little/big-endianness of SMBIOS UUIDs was clarified in v2.6
of the SMBIOS spec. dmidecode uses the reported version of SMBIOS to
determine the layout and what to byte-swap.
bhyve's SMBIOS reported as 2.4 though it implemented the 2.6-style of
memory layout. This resulted in dmidecode reporting a different
UUID than one passed in via the -U option.
Fix by exporting a version of 2.6.
Approved by: re (gjb)
Re-tested with NetBSD/amd64 5.2.2, 6.1.4 and 7-beta.
r271299:
Add a callback to be notified about negotiated features.
r271338:
Allow vtnet operation without merged rx buffers.
NetBSD's virtio-net implementation doesn't negotiate
the merged rx-buffers feature. To support this, check
to see if the feature was negotiated, and then adjust
the operation of the receive path accordingly by using
a larger iovec, and a smaller rx header.
In addition, ignore writes to the (read-only) status byte.
Approved by: re (glebius)
Obtained from: Vincenzo Maffione, Universita` di Pisa (r271299)
r268427, r268428, r268521, r268638, r268639, r268701, r268777,
r268889, r268922, r269008, r269042, r269043, r269080, r269094,
r269108, r269109, r269281, r269317, r269700, r269896, r269962,
r269989.
Catch bhyve up to CURRENT.
Lightly tested with FreeBSD i386/amd64, Linux i386/amd64, and
OpenBSD/amd64. Still resolving an issue with OpenBSD/i386.
Many thanks to jhb@ for all the hard work on the prior MFCs !
r267921 - support the "mov r/m8, imm8" instruction
r267934 - document options
r267949 - set DMI vers/date to fixed values
r267959 - doc: sort cmd flags
r267966 - EPT misconf post-mortem info
r268202 - use correct flag for event index
r268276 - 64-bit virtio capability api
r268427 - invalidate guest TLB when cr3 is updated, needed for TSS
r268428 - identify vcpu's operating mode
r268521 - use correct offset in guest logical-to-linear translation
r268638 - chs value
r268639 - chs fake values
r268701 - instr emul operand/address size override prefix support
r268777 - emulation for legacy x86 task switching
r268889 - nested exception support
r268922 - fix INVARIANTS build
r269008 - emulate instructions found in the OpenBSD/i386 5.5 kernel
r269042 - fix fault injection
r269043 - Reduce VMEXIT_RESTARTs in task_switch.c
r269080 - fix issues in PUSH emulation
r269094 - simplify return values from the inout handlers
r269108 - don't return -1 from the push emulation handler
r269109 - avoid permanent sleep in vm_handle_hlt()
r269281 - list VT-x features in base kernel dmesg
r269317 - Mark AHCI fatal errors as not completed
r269700 - Support PCI extended config space in bhyve
r269896 - Minor cleanup
r269962 - use max guest memory when creating IOMMU domain
r269989 - fix interrupt mode names
Turn on interrupt window exiting unconditionally when an ExtINT is being
injected into the guest.
Add helper functions to populate VM exit information for rendezvous and
astpending exits.
Provide APIs to directly get 'lowmem' and 'highmem' size directly.
Expose the amount of resident and wired memory from the guest's vmspace
266708,266724,266934,266935,268521:
Emulation of the "ins" and "outs" instructions.
Various fixes for translating guest linear addresses to guest physical
addresses.
265941,265951,266390,266550,266910:
Various bhyve fixes:
- Don't save host's return address in 'struct vmxctx'.
- Permit non-32-bit accesses to local APIC registers.
- Factor out common ioport handler code.
- Use calloc() in favor of malloc + memset.
- Change the vlapic timer frequency to be in the ballpark of contemporary
hardware.
- Allow the guest to read the TSC via MSR 0x10.
- A VMCS is always inactive when it exits the vmx_run() loop. Remove
redundant code and the misleading comment that suggest otherwise.
- Ignore writes to microcode update MSR. This MSR is accessed by RHEL7
guest.
Add KTR tracepoints to annotate wrmsr and rdmsr VM exits.
- Provide an alias for the userboot console and name it 'comconsole'.
- Use EV_ADD to create an mevent and EV_ENABLE to enable it.
- abort(3) the process in response to a VMEXIT_ABORT.
- Don't include the guest memory segments in the bhyve(8) process core dump.
- Make the vmx asm code dtrace-fbt-friendly.
- Allow vmx_getdesc() and vmx_setdesc() to be called for a vcpu that is in
the VCPU_RUNNING state.
- Enable VMX in the IA32_FEATURE_CONTROL MSR if it not enabled and the MSR
isn't locked.
Add an ioctl to suspend a virtual machine (VM_SUSPEND).
Add logic in the HLT exit handler to detect if the guest has put all vcpus
to sleep permanently by executing a HLT with interrupts disabled.
When this condition is detected the guest with be suspended with a reason of
VM_SUSPEND_HALT and the bhyve(8) process will exit.
This logic can be disabled via the tunable 'hw.vmm.halt_detection'.
- Add a very simple virtio_random(4) driver for FreeBSD guests to harvest
entropy from hypervisors.
- Add support to bhyve for the virtio RNG entropy-source device to provide
entry to bhyve guests.
Fixes for vcpu management in bhyve:
- Use 'cpuset_t' to represent the vcpus active in a virtual machine.
- Modify the "-p" option to be more flexible when associating a 'vcpu' with
a 'hostcpu'.
Various uart fixes:
- Open the uart emulation's backing tty in non-blocking mode.
- Support 16-bit register access.
- Disable the 'uart_drain()' callback when the emulated receive FIFO
is full.
Various PCI fixes:
- Allow PCI devices to be configured on all valid bus numbers from 0 to 255.
- Tweak the handling of PCI capabilities in emulated devices to remove
the non-standard zero capability list terminator.
- Add a check to validate that memory BARs of passthru devices are 4KB
aligned.
- Respect and track the enable bit in the PCI configuration address word.
- Handle quad-word access to 32-bit register pairs.
Handle single-byte reads from the bvmcons port (0x220) by returning
0xff. Some guests may attempt to read from this port to identify
psuedo-PNP ISA devices. (The ie(4) driver in FreeBSD/i386 is one
example.)
Various x2APIC fixes and enhancements:
- Use spinlocks for the vioapic.
- Handle the SELF_IPI MSR.
- Simplify the APIC mode switching between MMIO and x2APIC. The guest is
no longer allowed to switch modes at runtime. Instead, the desired mode
is set when the virtual machine is created.
- Disallow MMIO access in x2APIC mode and MSR access in xAPIC mode.
- Add support for x2APIC virtualization assist in Intel VT-x.
Add virtualized XSAVE support to bhyve which permits guests to use XSAVE and
XSAVE-enabled features like AVX.
- Store a per-cpu guest xcr0 register and handle xsetbv VM exits by emulating
the instruction.
- Only expose XSAVE to guests if XSAVE is enabled in the host. Only expose
a subset of XSAVE features currently supported by the guest and for which
the proper emulation of xsetbv is known. Currently this includes X87, SSE,
AVX, AVX-512, and Intel MPX.
- Add support for injecting hardware exceptions into the guest and use this
to trigger exceptions in the guest for invalid xsetbv operations instead
of potentially faulting in the host.
- Queue pending exceptions in the 'struct vcpu' instead of directly updating
the processor-specific VMCS or VMCB. The pending exception will be delivered
right before entering the guest.
- Rename the unused ioctl VM_INJECT_EVENT to VM_INJECT_EXCEPTION and restrict
it to only deliver x86 hardware exceptions. This new ioctl is now used to
inject a protection fault when the guest accesses an unimplemented MSR.
- Expose a subset of known-safe features from leaf 0 of the structured
extended features to guests if they are supported on the host including
RDFSBASE/RDGSBASE, BMI1/2, AVX2, AVX-512, HLE, ERMS, and RTM. Aside
from AVX-512, these features are all new instructions available for use
in ring 3 with no additional hypervisor changes needed.
Expand the support for PCI INTx interrupts including providing interrupt
routing information for INTx interrupts to I/O APIC pins and enabling
INTx interrupts in the virtio and AHCI backends.