- Return 2 x 16-bit registers in the correct byte order
for a 4-byte read that spans the CMD/STATUS register.
This reversal was hiding the capabilities-list, which prevented
the MSI capability from being found for XHCI passthru.
- Reorganize MSI/MSI-x config writes so that a 4-byte write at the
capability offset would have the read-only portion skipped.
This prevented MSI interrupts from being enabled.
Reported and extensively tested by Anatoli (me at anatoli dot ws)
PR: 245392
Reported by: Anatoli (me at anatoli dot ws)
Reviewed by: jhb (bhyve)
Approved by: jhb, bz (mentor)
MFC after: 1 week
Differential Revision: https://reviews.freebsd.org/D24951
Save and restore (also known as suspend and resume) permits a snapshot
to be taken of a guest's state that can later be resumed. In the
current implementation, bhyve(8) creates a UNIX domain socket that is
used by bhyvectl(8) to send a request to save a snapshot (and
optionally exit after the snapshot has been taken). A snapshot
currently consists of two files: the first holds a copy of guest RAM,
and the second file holds other guest state such as vCPU register
values and device model state.
To resume a guest, bhyve(8) must be started with a matching pair of
command line arguments to instantiate the same set of device models as
well as a pointer to the saved snapshot.
While the current implementation is useful for several uses cases, it
has a few limitations. The file format for saving the guest state is
tied to the ABI of internal bhyve structures and is not
self-describing (in that it does not communicate the set of device
models present in the system). In addition, the state saved for some
device models closely matches the internal data structures which might
prove a challenge for compatibility of snapshot files across a range
of bhyve versions. The file format also does not currently support
versioning of individual chunks of state. As a result, the current
file format is not a fixed binary format and future revisions to save
and restore will break binary compatiblity of snapshot files. The
goal is to move to a more flexible format that adds versioning,
etc. and at that point to commit to providing a reasonable level of
compatibility. As a result, the current implementation is not enabled
by default. It can be enabled via the WITH_BHYVE_SNAPSHOT=yes option
for userland builds, and the kernel option BHYVE_SHAPSHOT.
Submitted by: Mihai Tiganus, Flavius Anton, Darius Mihai
Submitted by: Elena Mihailescu, Mihai Carabas, Sergiu Weisz
Relnotes: yes
Sponsored by: University Politehnica of Bucharest
Sponsored by: Matthew Grooms (student scholarships)
Sponsored by: iXsystems
Differential Revision: https://reviews.freebsd.org/D19495
This ensures that bhyve properly recognizes when decoding is disabled
for BARs on passthru devices. To properly handle writes to the
register, export a pci_emul_cmd_changed function from pci_emul.c that
the pass through device model invokes for config writes that change
PCIR_COMMAND.
Reviewed by: rgrimes
MFC after: 2 weeks
Differential Revision: https://reviews.freebsd.org/D20531
For tools that uses bhyve such like libvirt, it is important to be able to
probe what features are supported by the given bhyve binary.
To give more context, libvirt probes bhyve's capabilities in a not very
effective way:
- Running 'bhyve -h' and parsing output.
- To detect devices, it runs 'bhyve -s 0,dev' for every each device and
parses error output to identify if the device is supported or not.
PR: 2101111
Submitted by: novel
MFC after: 2 weeks
Relnotes: yes
Sponsored by: iXsystems Inc.
Mainly focus on files that use BSD 2-Clause license, however the tool I
was using misidentified many licenses so this was mostly a manual - error
prone - task.
The Software Package Data Exchange (SPDX) group provides a specification
to make it easier for automated tools to detect and summarize well known
opensource licenses. We are gradually adopting the specification, noting
that the tags are considered only advisory and do not, in any way,
superceed or replace the license texts.
No functional change intended.
It was useless before, but may improve performance now if multiple devices
are configured and guest supports this feature.
Sponsored by: iXsystems, Inc.
If the PBA shares a page with the MSI-X table, map the shared page via
/dev/mem and emulate accesses to the portion of the PBA in the shared
page by accessing the mapped page.
Reviewed by: grehan
MFC after: 1 week
Differential Revision: https://reviews.freebsd.org/D5919
Add the ACPI MCFG table to advertise the extended config memory window.
Introduce a new flag MEM_F_IMMUTABLE for memory ranges that cannot be deleted
or moved in the guest's address space. The PCI extended config space is an
example of an immutable memory range.
Add emulation for the "movzw" instruction. This instruction is used by FreeBSD
to read a 16-bit extended config space register.
CR: https://phabric.freebsd.org/D505
Reviewed by: jhb, grehan
Requested by: tychon
the legacy 8259A PICs.
- Implement an ICH-comptabile PCI interrupt router on the lpc device with
8 steerable pins configured via config space access to byte-wide
registers at 0x60-63 and 0x68-6b.
- For each configured PCI INTx interrupt, route it to both an I/O APIC
pin and a PCI interrupt router pin. When a PCI INTx interrupt is
asserted, ensure that both pins are asserted.
- Provide an initial routing of PCI interrupt router (PIRQ) pins to
8259A pins (ISA IRQs) and initialize the interrupt line config register
for the corresponding PCI function with the ISA IRQ as this matches
existing hardware.
- Add a global _PIC method for OSPM to select the desired interrupt routing
configuration.
- Update the _PRT methods for PCI bridges to provide both APIC and legacy
PRT tables and return the appropriate table based on the configured
routing configuration. Note that if the lpc device is not configured, no
routing information is provided.
- When the lpc device is enabled, provide ACPI PCI link devices corresponding
to each PIRQ pin.
- Add a VMM ioctl to adjust the trigger mode (edge vs level) for 8259A
pins via the ELCR.
- Mark the power management SCI as level triggered.
- Don't hardcode the number of elements in Packages in the source for
the DSDT. iasl(8) will fill in the actual number of elements, and
this makes it simpler to generate a Package with a variable number of
elements.
Reviewed by: tycho
because there isn't a standard way to relay this information to the guest OS.
Add a command line option "-Y" to bhyve(8) to inhibit MPtable generation.
If the virtual machine is using PCI devices on buses other than 0 then it can
still use ACPI tables to convey this information to the guest.
Discussed with: grehan@
the non-standard zero capability list terminator. Instead, track
the start and end of the most recently added capability and use that
to adjust the previous capability's next pointer when a capability is
added and to determine the range of config registers belonging to
PCI capability registers.
Reviewed by: neel
This is done by representing each bus as root PCI device in ACPI. The device
implements the _BBN method to return the PCI bus number to the guest OS.
Each PCI bus keeps track of the resources that is decodes for devices
configured on the bus: i/o, mmio (32-bit) and mmio (64-bit). These windows
are advertised to the guest via the _CRS object of the root device.
Bus 0 is treated specially since it consumes the I/O ports to access the
PCI config space [0xcf8-0xcff]. It also decodes the legacy I/O ports that
are consumed by devices on the LPC bus. For this reason the LPC bridge can
be configured only on bus 0.
The bus number can be specified using the following command line option
to bhyve(8): "-s <bus>:<slot>:<func>,<emul>[,<config>]"
Discussed with: grehan@
Reviewed by: jhb@
the virtio backends.
- Add a new ioctl to export the count of pins on the I/O APIC from vmm
to the hypervisor.
- Use pins on the I/O APIC >= 16 for PCI interrupts leaving 0-15 for
ISA interrupts.
- Populate the MP Table with I/O interrupt entries for any PCI INTx
interrupts.
- Create a _PRT table under the PCI root bridge in ACPI to route any
PCI INTx interrupts appropriately.
- Track which INTx interrupts are in use per-slot so that functions
that share a slot attempt to distribute their INTx interrupts across
the four available pins.
- Implicitly mask INTx interrupts if either MSI or MSI-X is enabled
and when the INTx DIS bit is set in a function's PCI command register.
Either assert or deassert the associated I/O APIC pin when the
state of one of those conditions changes.
- Add INTx support to the virtio backends.
- Always advertise the MSI capability in the virtio backends.
Submitted by: neel (7)
Reviewed by: neel
MFC after: 2 weeks
LPC devices. Among other things, the LPC serial ports now appear as
ACPI devices.
- Move the info for the top-level PCI bus into the PCI emulation code and
add ResourceProducer entries for the memory ranges decoded by the bus
for memory BARs.
- Add a framework to allow each PCI emulation driver to optionally write
an entry into the DSDT under the \_SB_.PCI0 namespace. The LPC driver
uses this to write a node for the LPC bus (\_SB_.PCI0.ISA).
- Add a linker set to allow any LPC devices to write entries into the
DSDT below the LPC node.
- Move the existing DSDT block for the RTC to the RTC driver.
- Add DSDT nodes for the AT PIC, the 8254 ISA timer, and the LPC UART
devices.
- Add a "SuperIO" device under the LPC node to claim "system resources"
aling with a linker set to allow various drivers to add IO or memory
ranges that should be claimed as a system resource.
- Add system resource entries for the extended RTC IO range, the registers
used for ACPI power management, the ELCR, PCI interrupt routing register,
and post data register.
- Add various helper routines for generating DSDT entries.
Reviewed by: neel (earlier version)
callers treat the MSI 'addr' and 'data' fields as opaque and also lets
bhyve implement multiple destination modes: physical, flat and clustered.
Submitted by: Tycho Nightingale (tycho.nightingale@pluribusnetworks.com)
Reviewed by: grehan@
to inject edge triggered legacy interrupts into the guest.
Start using the new API in device models that use edge triggered interrupts:
viz. the 8254 timer and the LPC/uart device emulation.
Submitted by: Tycho Nightingale (tycho.nightingale@pluribusnetworks.com)
to a virtual machine then we implicitly create COM1 and COM2 ISA devices.
Prior to this change the only way of attaching a COM port to the virtual
machine was by presenting it as a PCI device that is mapped at the legacy
I/O address 0x3F8 or 0x2F8.
There were some issues with the original approach:
- It did not work at all with UEFI because UEFI will reprogram the PCI device
BARs and remap the COM1/COM2 ports at non-legacy addresses.
- OpenBSD GENERIC kernel does not create a /dev/console because it expects
the uart device at the legacy 0x3F8/0x2F8 address to be an ISA device.
- It was functional with a FreeBSD guest but caused the console to appear
on /dev/ttyu2 which was not intuitive.
The uart emulation is now independent of the bus on which it resides. Thus it
is possible to have uart devices on the PCI bus in addition to the legacy
COM1/COM2 devices behind the LPC bus.
The command line option to attach ISA COM1/COM2 ports to a virtual machine is
"-s <bus>,lpc -l com1,stdio".
The command line option to create a PCI-attached uart device is:
"-s <bus>,uart[,stdio]"
The command line option to create PCI-attached COM1/COM2 device is:
"-S <bus>,uart[,stdio]". This style of creating COM ports is deprecated.
Discussed with: grehan
Reviewed by: grehan
Submitted by: Tycho Nightingale (tycho.nightingale@pluribusnetworks.com)
M share/examples/bhyve/vmrun.sh
AM usr.sbin/bhyve/legacy_irq.c
AM usr.sbin/bhyve/legacy_irq.h
M usr.sbin/bhyve/Makefile
AM usr.sbin/bhyve/uart_emul.c
M usr.sbin/bhyve/bhyverun.c
AM usr.sbin/bhyve/uart_emul.h
M usr.sbin/bhyve/pci_uart.c
M usr.sbin/bhyve/pci_emul.c
M usr.sbin/bhyve/inout.c
M usr.sbin/bhyve/pci_emul.h
M usr.sbin/bhyve/inout.h
AM usr.sbin/bhyve/pci_lpc.c
AM usr.sbin/bhyve/pci_lpc.h
users to set the MAC address for a device.
Clean up some obsolete code in pci_virtio_net.c
Allow an error return from a PCI device emulation's init routine
to be propagated all the way back to the top-level and result in
the process exiting.
Submitted by: Dinakar Medavaram dinnu sun at gmail (original version)
silently overwriting the previous assignment.
Gripe if the emulation is not recognized instead of silently ignoring the
emulated device.
If an error is detected by pci_parse_slot() then exit from the command line
parsing loop in main().
Submitted by (initial version): Chris Torek (chris.torek@gmail.com)
devices are MSI-X capable. This in turn would lead it to treat bar 0 as
the MSI-X table bar even if the underlying device did not support MSI-X.
Fix this by providing an API to query the MSI-X table index of the emulated
device. If the underlying device does not support MSI-X then this API will
return -1.
Obtained from: NetApp
the default.
The current behavior of advertising a single MSI vector can be requested by
setting the environment variable "BHYVE_USE_MSI" to "true". The use of MSI
is not compliant with the virtio specification and will be eventually phased
out.
Submitted by: Gopakumar T
Obtained from: NetApp
statically. In most cases the number of table entries will be far less than
the maximum of 2048 allowed by the PCI specification.
Reuse macros from pcireg.h to interpret the MSI-X capability instead of rolling
our own.
Obtained from: NetApp
Firmware tables require too much knowledge of system configuration,
and it's difficult to pass that information in general terms to a library.
The upcoming ACPI work exposed this - it will also livein bhyve.
Also, remove code specific to NetApp from the mptable name, and remove
the -n option from bhyve.
Reviewed by: neel
Obtained from: NetApp
- New memory region interface. An RB tree holds the regions,
with a last-found per-vCPU cache to deal with the common case
of repeated guest accesses to MMIO registers in the same page.
- Support memory-mapped BARs in PCI emulation.
mem.c/h - memory region interface
instruction_emul.c/h - remove old region interface.
Use gpa from EPT exit to avoid a tablewalk to
determine operand address. Determine operand size
and use when calling through to region handler.
fbsdrun.c - call into region interface on paging
exit. Distinguish between instruction emul error
and region not found
pci_emul.c/h - implement new BAR callback api.
Split BAR alloc routine into routines that
require/don't require the BAR phys address.
ioapic.c
pci_passthru.c
pci_virtio_block.c
pci_virtio_net.c
pci_uart.c - update to new BAR callback i/f
Reviewed by: neel
Obtained from: NetApp
be activated as part of the slot config options.
The syntax is:
-s <slotnum>,uart[,stdio]
The stdio parameter instructs the code to perform i/o using
stdin/stdout. It can only be used for one instance.
To allow legacy i/o ports/irqs to be used, a new variant of
the slot command, -S, is introduced. When used to specify a
slot, the device will use legacy resources if it supports
them; otherwise it will be treated the same as the '-s' option.
Specifying the -S option with the uart will first use the 0x3f8/irq 4
config, and the second -S will use 0x2F8/irq 3.
Interrupt delivery is awaiting the arrival of the i/o apic code,
but this works fine in uart(4)'s polled mode.
This code was written by Cynthia Lu @ MIT while an intern at NetApp,
with further work from neel@ and grehan@.
Obtained from: NetApp
Includes instruction emulation for memory r/w access. This
opens the door for io-apic, local apic, hpet timer, and
legacy device emulation.
Submitted by: ryan dot berryhill at sandvine dot com
Reviewed by: grehan
Obtained from: Sandvine
vmm.ko - kernel module for VT-x, VT-d and hypervisor control
bhyve - user-space sequencer and i/o emulation
vmmctl - dump of hypervisor register state
libvmm - front-end to vmm.ko chardev interface
bhyve was designed and implemented by Neel Natu.
Thanks to the following folk from NetApp who helped to make this available:
Joe CaraDonna
Peter Snyder
Jeff Heller
Sandeep Mann
Steve Miller
Brian Pawlowski