MFC r284918:
Add helper fill_based_sd(9).
MFC r284919:
Add x86 PT_GETFSBASE, PT_GETGSBASE machine-depended ptrace requests to
obtain the thread %fs and %gs bases. Add x86 PT_SETFSBASE and
PT_SETGSBASE requests to set the bases from debuggers. The set
requests, similarly to the sysarch({I386,AMD64}_SET_FSBASE), override
the corresponding segment registers.
MFC r284965:
Document x86 machine-specific ptrace(2) requests.
MFC r285011:
Disallow a debugger on 64bit system to set fs/gs bases of the 32bit
process beyond the end of the process address space.
MFC r285104:
Grammar and language fixes.
Set the initial system time to a sane (as in: not end of 21st century)
value when booting on a PC with CMOS clock set to a year before 2000.
This uses 1980 (instead of 1970 as in the initial patch) as pivot year as
suggested by imp in the PR followup.
PR: 195703
Submitted by: cs@soi.spb.ru
Reviewed by: imp
Approved by: re (gjb)
Emulate the 'bit test' instruction.
MFC r282259:
Re-implement RTC current time calculation to eliminate the possibility of
losing time.
MFC r282281:
Advertise the MTRR feature via CPUID and emulate the minimal set of MTRR MSRs.
MFC r282284:
When an instruction cannot be decoded just return to userspace so bhyve(8)
can dump the instruction bytes.
MFC r282287:
Don't require <sys/cpuset.h> to be always included before <machine/vmm.h>.
MFC r282296:
Emulate MSR_SYSCFG which is accessed by Linux on AMD cpus when MTRRs are
enabled.
MFC r282301:
Relax limits when transitioning a vector from the IRR to the ISR and also
when extinguishing it from the ISR in response to an EOI.
MFC r282335:
Advertise an additional memory BAR in the "dummy" device emulation.
MFC r282336:
Emulate machine check related MSRs to allow guest OSes like Windows to boot.
MFC r282351:
Don't advertise the Intel SMX capability to the guest.
MFC r282407:
Emulate the 'CMP r/m8, imm8' instruction.
MFC r282519:
Add macros for AMD-specific bits in MSR_EFER: LMSLE, FFXSR and TCE.
MFC r282520:
Emulate guest writes to EFER_MSR properly.
MFC r282558:
Deprecate the 3-way return values from vm_gla2gpa() and vm_copy_setup().
MFC r282571:
Check 'td_owepreempt' and yield the vcpu thread if it is set.
MFC r282595:
Allow byte reads of AHCI registers.
MFC r282784:
Handling indirect descriptors is a capability of the host and not one that
needs to be negotiated. Use the host capabilities field and not the negotiated
field when verifying that indirect descriptors are supported.
MFC r282788:
Allow configuration of the sector size advertised to the guest.
MFC r282865:
Set the subvendor field in config space to the vendor ID. This is required
by the Windows virtio drivers to correctly match a device.
MFC r282922:
Bump the size of the blockif scatter-gather list to 67.
MFC r283075:
Fix off-by-one in array index bounds check. bhyveload would allow you to
create 33 entries on an array that only has 32 slots
MFC r283168:
Temporarily revert r282922 which bumped the max descriptors.
MFC r283255:
Emulate the "CMP r/m, reg" instruction (opcode 39H).
MFC r283256:
Add an option "--get-vmcs-exit-inst-length" to display the instruction length
of the instruction that caused the VM-exit.
MFC r283264:
Change the header type of the emulated host-bridge from type 1 to type 0.
MFC r283293:
Don't rely on the 'VM-exit instruction length' field in the VMCS to always
have an accurate length on an EPT violation.
MFC r283299:
Remove bogus verification of instruction length after instruction decode.
MFC r283308:
Exceptions don't deliver an error code in real mode.
MFC r283657:
Fix non-deterministic delays when accessing a vcpu that was in "running" or
"sleeping" state.
MFC r283973:
Use tunable 'hw.vmm.svm.features' to disable specific SVM features even
though they might be available in hardware. Use tunable 'hw.vmm.svm.num_asids'
to limit the number of ASIDs used by the hypervisor.
MFC r284046:
Fix regression in 'verify_gla()' with the RIP-relative addressing mode.
MFC r284174:
Support guest writes to the TSC by enabling the "use TSC offsetting"
execution control.
Move the 32-bit compatible procfs types from freebsd32.h to <sys/procfs.h>
and export them to userland.
- Define __HAVE_REG32 on platforms that define a reg32 structure and check
for this in <sys/procfs.h> to control when to export prstatus32, etc.
- Add prstatus32_t and prpsinfo32_t typedefs for the 32-bit structures.
libbfd looks for these types, and having them fixes 'gcore' in gdb of a
32-bit process on a 64-bit platform.
- Use the structure definitions from <sys/procfs.h> in gcore's elf32 core
dump code instead of duplicating the definitions.
The add_bounce_page() function can be called when loading physical
pages which pass a NULL virtual address. If the BUS_DMA_KEEP_PG_OFFSET
flag is set, use the physical address to compute the page offset
instead. The physical address should always be valid when adding
bounce pages and should contain the same page offset like the virtual
address.
Submitted by: Svatopluk Kraus <onwahe@gmail.com>
Reviewed by: jhb@
Add config option PAE_TABLES for the i386 kernel. It switches pmap to
use PAE format for the page tables, but does not incur other
consequences of the full PAE config. In particular, vm_paddr_t and
bus_addr_t are left 32bit, and max supported memory is still limited
by 4GB.
The option allows to have nx permissions for memory mappings on i386
kernel, while keeping the usual i386 KBI and avoiding the kernel data
sizing problems typical for the PAE config.
Revert the IPI startup sequence to match what is described in the
Intel Multiprocessor Specification v1.4. The Intel SDM claims that
278325:
Revert the IPI startup sequence to match what is described in the
Intel Multiprocessor Specification v1.4. The Intel SDM claims that
the INIT IPIs here are invalid, but other systems follow the MP
spec instead.
While here, fix the IPI wait routine to accept a timeout in microseconds
instead of a raw spin count, and don't spin forever during AP startup.
Instead, panic if a STARTUP IPI is not delivered after 20 us.
280866:
Wait 100 microseconds for a local APIC to dispatch each startup-related IPI
rather than 20. The MP 1.4 specification states in Appendix B.2:
"A period of 20 microseconds should be sufficient for IPI dispatch to
complete under normal operating conditions".
(Note that this appears to be separate from the 10 millisecond (INIT) and
200 microsecond (STARTUP) waits after the IPIs are dispatched.) The
Intel SDM is silent on this issue as far as I can tell.
At least some hardware requires 60 microseconds as noted in the PR, so
bump this to 100 to be on the safe side.
PR: 196542, 197756
On some Intel CPUs with a P-state but not C-state invariant TSC the TSC
may also halt in C2 and not just C3 (it seems that in some cases the BIOS
advertises its C3 state as a C2 state in _CST). Just play it safe and
disable both C2 and C3 states if a user forces the use of the TSC as the
timecounter on such CPUs.
PR: 192316
Add support for managing PCI bus numbers. As with BARs and PCI-PCI bridge
I/O windows, the default is to preserve the firmware-assigned resources.
PCI bus numbers are only managed if NEW_PCIB is enabled and the architecture
defines a PCI_RES_BUS resource type.
- Add a helper API to create top-level PCI bus resource managers for each
PCI domain/segment. Host-PCI bridge drivers use this API to allocate
bus numbers from their associated domain.
- Change the PCI bus and CardBus drivers to allocate a bus resource for
their bus number from the parent PCI bridge device.
- Change the PCI-PCI and PCI-CardBus bridge drivers to allocate the
full range of bus numbers from secbus to subbus from their parent bridge.
The drivers also always program their primary bus register. The bridge
drivers also support growing their bus range by extending the bus resource
and updating subbus to match the larger range.
- Add support for managing PCI bus resources to the Host-PCI bridge drivers
used for amd64 and i386 (acpi_pcib, mptable_pcib, legacy_pcib, and qpi_pcib).
- Define a PCI_RES_BUS resource type for amd64 and i386.
PR: 197076
- Reuse legacy_pcib_(read|write)_config() methods in the QPI pcib driver.
- Reuse legacy_pcib_alloc_msi{,x}() methods in the QPI and mptable pcib
drivers.
When mapping an allocated entry, use the entry size, instead of the
requested size. If tag restrictions caused split entry, its size is
less then requsted.
When inserting new entry into the address map, ensure that not only
next entry does not intersect with the tail of the new entry, but also
that previous entry is also before new entry start.
(only to ease merging of r279117).
MFC r279117:
Revert r276949 and redo the fix for PCIe/PCI bridges, which do not
follow specification and do not provide PCIe capability.
Fix DMAR context allocations for the devices behind PCIe->PCI bridges
after dmar driver was converted to use rids. The bus component to
calculate context page must be taken from the requestor rid, which is
a bridge, and not from the device bus number.
MFC support for PCI Alternate RID Interpretation. ARI is an optional PCIe
feature that allows PCI devices to present up to 256 functions on a bus.
This is effectively a prerequisite for PCI SR-IOV support.
r264007:
Add a method to get the PCI RID for a device.
Reviewed by: kib
MFC after: 2 months
Sponsored by: Sandvine Inc.
r264008:
Re-implement the DMAR I/O MMU code in terms of PCI RIDs
Under the hood the VT-d spec is really implemented in terms of
PCI RIDs instead of bus/slot/function, even though the spec makes
pains to convert back to bus/slot/function in examples. However
working with bus/slot/function is not correct when PCI ARI is
in use, so convert to using RIDs in most cases. bus/slot/function
will only be used when reporting errors to a user.
Reviewed by: kib
MFC after: 2 months
Sponsored by: Sandvine Inc.
r264009:
Re-write bhyve's I/O MMU handling in terms of PCI RID.
Reviewed by: neel
MFC after: 2 months
Sponsored by: Sandvine Inc.
r264011:
Add support for PCIe ARI
PCIe Alternate RID Interpretation (ARI) is an optional feature that
allows devices to have up to 256 different functions. It is
implemented by always setting the PCI slot number to 0 and
re-purposing the 5 bits used to encode the slot number to instead
contain the function number. Combined with the original 3 bits
allocated for the function number, this allows for 256 functions.
This is enabled by default, but it's expected to be a no-op on currently
supported hardware. It's a prerequisite for supporting PCI SR-IOV, and
I want the ARI support to go in early to help shake out any bugs in it.
ARI can be disabled by setting the tunable hw.pci.enable_ari=0.
Reviewed by: kib
MFC after: 2 months
Sponsored by: Sandvine Inc.
r264012:
Print status of ARI capability in pciconf -c
Teach pciconf how to print out the status (enabled/disabled) of the ARI
capability on PCI Root Complexes and Downstream Ports.
MFC after: 2 months
Sponsored by: Sandvine Inc.
r264013:
Add missing copyright date.
MFC after: 2 months
Improve support for XSAVE with debuggers.
- Dump an NT_X86_XSTATE note if XSAVE is in use. This note is designed
to match what Linux does in that 1) it dumps the entire XSAVE area
including the fxsave state, and 2) it stashes a copy of the current
xsave mask in the unused padding between the fxsave state and the
xstate header at the same location used by Linux.
- Teach readelf() to recognize NT_X86_XSTATE notes.
- Change PT_GET/SETXSTATE to take the entire XSAVE state instead of
only the extra portion. This avoids having to always make two
ptrace() calls to get or set the full XSAVE state.
- Add a PT_GET_XSTATE_INFO which returns the length of the current
XSTATE save area (so the size of the buffer needed for PT_GETXSTATE)
and the current XSAVE mask (%xcr0).
Rework virtual machine hypervisor detection.
- Move the existing code to x86/x86/identcpu.c since it is x86-specific.
- If the CPUID2_HV flag is set, assume a hypervisor is present and query
the 0x40000000 leaf to determine the hypervisor vendor ID. Export the
vendor ID and the highest supported hypervisor CPUID leaf via
hv_vendor[] and hv_high variables, respectively. The hv_vendor[]
array is also exported via the hw.hv_vendor sysctl.
- Merge the VMWare detection code from tsc.c into the new probe in
identcpu.c. Add a VM_GUEST_VMWARE to identify vmware and use that in
the TSC code to identify VMWare.
Output a summary of optional SVM features in dmesg similar to CPU features.
If bootverbose is enabled, a detailed list is provided; otherwise, a
single-line summary is displayed.
Requested by: jhb
Rename the AMD MSR_PERFCTR[0-3] so the Pentium Pro MSR_PERFCTR[0-1] aren't
redefined.
MFC r273214
Fix build to not bogusly always rebuild vmm.ko.
MFC r273338
Add support for AMD's nested page tables in pmap.c:
- Provide the correct bit mask for various bit fields in a PTE (e.g. valid bit)
for a pmap of type PT_RVI.
- Add a function 'pmap_type_guest(pmap)' that returns TRUE if the pmap is of
type PT_EPT or PT_RVI.
Add CPU_SET_ATOMIC_ACQ(num, cpuset):
This is used when activating a vcpu in the nested pmap. Using the 'acquire'
variant guarantees that the load of the 'pm_eptgen' will happen only after
the vcpu is activated in 'pm_active'.
Add defines for various AMD-specific MSRs.
Discussed with: kib (r261321)
Fix a recursive lock acquisition in vi_reset_dev().
MFC r270434
Return the spurious interrupt vector (IRQ7 or IRQ15) if the atpic cannot find
any unmasked pin with an interrupt asserted.
MFC r270436
Fix a bug in the emulation of CPUID leaf 0x4.
MFC r270437
Add "hw.vmm.topology.threads_per_core" and "hw.vmm.topology.cores_per_package"
tunables to modify the default cpu topology advertised by bhyve.
MFC r270855
Set the 'inst_length' to '0' early on before any error conditions are detected
in the emulation of the task switch. If any exceptions are triggered then the
guest %rip should point to instruction that caused the task switch as opposed
to the one after it.
MFC r270857
The "SUB" instruction used in getcc() actually does 'x -= y' so use the
proper constraint for 'x'. The "+r" constraint indicates that 'x' is an
input and output register operand.
While here generate code for different variants of getcc() using a macro
GETCC(sz) where 'sz' indicates the operand size.
Update the status bits in %rflags when emulating AND and OR opcodes.
MFC r271439
Initialize 'bc_rdonly' to the right value.
MFC r271451
Optimize the common case of injecting an interrupt into a vcpu after a HLT
by explicitly moving it out of the interrupt shadow.
MFC r271888
Restructure the MSR handling so it is entirely handled by processor-specific
code.
MFC r271890
MSR_KGSBASE is no longer saved and restored from the guest MSR save area. This
behavior was changed in r271888 so update the comment block to reflect this.
MFC r271891
Add some more KTR events to help debugging.
MFC r272197
mmap(2) requires either MAP_PRIVATE or MAP_SHARED for non-anonymous mappings.
MFC r272395
Get rid of code that dealt with the hardware not being able to save/restore
the PAT MSR on guest exit/entry. This workaround was done for a beta release
of VMware Fusion 5 but is no longer needed in later versions.
All Intel CPUs since Nehalem have supported saving and restoring MSR_PAT
in the VM exit and entry controls.
MFC r272670
Inject #UD into the guest when it executes either 'MONITOR' or 'MWAIT'.
MFC r272710
Implement the FLUSH operation in the virtio-block emulation.
MFC r272838
iasl(8) expects integer fields in data tables to be specified as hexadecimal
values. Therefore the bit width of the "PM Timer Block" was actually being
interpreted as 50-bits instead of the expected 32-bit.
This eliminates an error message emitted by a Linux 3.17 guest during boot:
"Invalid length for FADT/PmTimerBlock: 50, using default 32"
MFC r272839
Support Intel-specific MSRs that are accessed when booting up a linux in bhyve:
- MSR_PLATFORM_INFO
- MSR_TURBO_RATIO_LIMITx
- MSR_RAPL_POWER_UNIT
MFC r273108
Emulate "POP r/m". This is needed to boot OpenBSD/i386 MP kernel in bhyve.
MFC r273212
Support stopping and restarting the AHCI command list via toggling PxCMD.ST
from '1' to '0' and back. This allows the driver a chance to recover if
for instance a timeout occurred due to activity on the host.
- Remove spaces from boot messages when we print the CPU ID/Family/Stepping
- Move prototypes for various functions into out of C files and into
<machine/md_var.h>.
- Reduce diffs between i386 and amd64 initcpu.c and identcpu.c files.
- Move blacklists of broken TSCs out of the printcpuinfo() function
and into the TSC probe routine.
- Merge the amd64 and i386 identcpu.c into a single x86 implementation.
Save and restore FPU state across suspend and resume on i386.
- Create a separate structure for per-CPU state saved across suspend and
resume that is a superset of a pcb.
- Store the FPU state for suspend and resume in the new structure
(for amd64, this moves it out of the PCB)
- On both i386 and amd64, all of the FPU suspend/resume handling is now
done in C.
Approved by: re (hrs)
Change default logic to CONFORM because this routine is shared
with SCI polarity setting.
Reviewed by: jhb
MFC r269184:
Add missing newline to output dmesg properly.