Commit Graph

38 Commits

Author SHA1 Message Date
Neel Natu
7a902ec0ec Add a check to validate that memory BARs of passthru devices are 4KB aligned.
Also, the MSI-x table offset is not required to be 4KB aligned so take this
into account when computing the pages occupied by the MSI-x tables.
2014-02-18 19:00:15 +00:00
John Baldwin
a96b8b801a Tweak the handling of PCI capabilities in emulated devices to remove
the non-standard zero capability list terminator.   Instead, track
the start and end of the most recently added capability and use that
to adjust the previous capability's next pointer when a capability is
added and to determine the range of config registers belonging to
PCI capability registers.

Reviewed by:	neel
2014-02-18 03:00:20 +00:00
Neel Natu
d84882ca8f Allow PCI devices to be configured on all valid bus numbers from 0 to 255.
This is done by representing each bus as root PCI device in ACPI. The device
implements the _BBN method to return the PCI bus number to the guest OS.

Each PCI bus keeps track of the resources that is decodes for devices
configured on the bus: i/o, mmio (32-bit) and mmio (64-bit). These windows
are advertised to the guest via the _CRS object of the root device.

Bus 0 is treated specially since it consumes the I/O ports to access the
PCI config space [0xcf8-0xcff]. It also decodes the legacy I/O ports that
are consumed by devices on the LPC bus. For this reason the LPC bridge can
be configured only on bus 0.

The bus number can be specified using the following command line option
to bhyve(8): "-s <bus>:<slot>:<func>,<emul>[,<config>]"

Discussed with:	grehan@
Reviewed by:	jhb@
2014-02-14 21:34:08 +00:00
John Baldwin
3cbf3585cb Enhance the support for PCI legacy INTx interrupts and enable them in
the virtio backends.
- Add a new ioctl to export the count of pins on the I/O APIC from vmm
  to the hypervisor.
- Use pins on the I/O APIC >= 16 for PCI interrupts leaving 0-15 for
  ISA interrupts.
- Populate the MP Table with I/O interrupt entries for any PCI INTx
  interrupts.
- Create a _PRT table under the PCI root bridge in ACPI to route any
  PCI INTx interrupts appropriately.
- Track which INTx interrupts are in use per-slot so that functions
  that share a slot attempt to distribute their INTx interrupts across
  the four available pins.
- Implicitly mask INTx interrupts if either MSI or MSI-X is enabled
  and when the INTx DIS bit is set in a function's PCI command register.
  Either assert or deassert the associated I/O APIC pin when the
  state of one of those conditions changes.
- Add INTx support to the virtio backends.
- Always advertise the MSI capability in the virtio backends.

Submitted by:	neel (7)
Reviewed by:	neel
MFC after:	2 weeks
2014-01-29 14:56:48 +00:00
John Baldwin
d2bc4816c5 Remove support for legacy PCI devices. These haven't been needed since
support for LPC uart devices was added and it conflicts with upcoming
patches to add PCI INTx support.

Reviewed by:	neel
2014-01-27 22:26:15 +00:00
John Baldwin
e6c8bc291a Rework the DSDT generation code a bit to generate more accurate info about
LPC devices.  Among other things, the LPC serial ports now appear as
ACPI devices.
- Move the info for the top-level PCI bus into the PCI emulation code and
  add ResourceProducer entries for the memory ranges decoded by the bus
  for memory BARs.
- Add a framework to allow each PCI emulation driver to optionally write
  an entry into the DSDT under the \_SB_.PCI0 namespace.  The LPC driver
  uses this to write a node for the LPC bus (\_SB_.PCI0.ISA).
- Add a linker set to allow any LPC devices to write entries into the
  DSDT below the LPC node.
- Move the existing DSDT block for the RTC to the RTC driver.
- Add DSDT nodes for the AT PIC, the 8254 ISA timer, and the LPC UART
  devices.
- Add a "SuperIO" device under the LPC node to claim "system resources"
  aling with a linker set to allow various drivers to add IO or memory
  ranges that should be claimed as a system resource.
- Add system resource entries for the extended RTC IO range, the registers
  used for ACPI power management, the ELCR, PCI interrupt routing register,
  and post data register.
- Add various helper routines for generating DSDT entries.

Reviewed by:	neel (earlier version)
2014-01-02 21:26:59 +00:00
Neel Natu
4f8be175d5 Add an API to deliver message signalled interrupts to vcpus. This allows
callers treat the MSI 'addr' and 'data' fields as opaque and also lets
bhyve implement multiple destination modes: physical, flat and clustered.

Submitted by:	Tycho Nightingale (tycho.nightingale@pluribusnetworks.com)
Reviewed by:	grehan@
2013-12-16 19:59:31 +00:00
Neel Natu
ac7304a758 Add an ioctl to assert and deassert an ioapic pin atomically. This will be used
to inject edge triggered legacy interrupts into the guest.

Start using the new API in device models that use edge triggered interrupts:
viz. the 8254 timer and the LPC/uart device emulation.

Submitted by:	Tycho Nightingale (tycho.nightingale@pluribusnetworks.com)
2013-11-23 03:56:03 +00:00
Neel Natu
565bbb8698 Move the ioapic device model from userspace into vmm.ko. This is needed for
upcoming in-kernel device emulations like the HPET.

The ioctls VM_IOAPIC_ASSERT_IRQ and VM_IOAPIC_DEASSERT_IRQ are used to
manipulate the ioapic pin state.

Discussed with:	grehan@
Submitted by:	Tycho Nightingale (tycho.nightingale@pluribusnetworks.com)
2013-11-12 22:51:03 +00:00
Neel Natu
c8afb9bc3f Fix an off-by-one error when iterating over the emulated PCI BARs.
Submitted by:	Tycho Nightingale (tycho.nightingale@pluribusnetworks.com)
2013-11-06 22:35:52 +00:00
Neel Natu
ea7f1c8cd2 Add support for PCI-to-ISA LPC bridge emulation. If the LPC bus is attached
to a virtual machine then we implicitly create COM1 and COM2 ISA devices.

Prior to this change the only way of attaching a COM port to the virtual
machine was by presenting it as a PCI device that is mapped at the legacy
I/O address 0x3F8 or 0x2F8.

There were some issues with the original approach:
- It did not work at all with UEFI because UEFI will reprogram the PCI device
  BARs and remap the COM1/COM2 ports at non-legacy addresses.
- OpenBSD GENERIC kernel does not create a /dev/console because it expects
  the uart device at the legacy 0x3F8/0x2F8 address to be an ISA device.
- It was functional with a FreeBSD guest but caused the console to appear
  on /dev/ttyu2 which was not intuitive.

The uart emulation is now independent of the bus on which it resides. Thus it
is possible to have uart devices on the PCI bus in addition to the legacy
COM1/COM2 devices behind the LPC bus.

The command line option to attach ISA COM1/COM2 ports to a virtual machine is
"-s <bus>,lpc -l com1,stdio".

The command line option to create a PCI-attached uart device is:
"-s <bus>,uart[,stdio]"

The command line option to create PCI-attached COM1/COM2 device is:
"-S <bus>,uart[,stdio]". This style of creating COM ports is deprecated.

Discussed with:	grehan
Reviewed by:	grehan
Submitted by:	Tycho Nightingale (tycho.nightingale@pluribusnetworks.com)

M    share/examples/bhyve/vmrun.sh
AM   usr.sbin/bhyve/legacy_irq.c
AM   usr.sbin/bhyve/legacy_irq.h
M    usr.sbin/bhyve/Makefile
AM   usr.sbin/bhyve/uart_emul.c
M    usr.sbin/bhyve/bhyverun.c
AM   usr.sbin/bhyve/uart_emul.h
M    usr.sbin/bhyve/pci_uart.c
M    usr.sbin/bhyve/pci_emul.c
M    usr.sbin/bhyve/inout.c
M    usr.sbin/bhyve/pci_emul.h
M    usr.sbin/bhyve/inout.h
AM   usr.sbin/bhyve/pci_lpc.c
AM   usr.sbin/bhyve/pci_lpc.h
2013-10-29 00:18:11 +00:00
Peter Grehan
2a8d400a2e Allow a 4-byte write to PCI config space to overlap
the 2 read-only bytes at the start of a PCI capability.
This is the sequence that OpenBSD uses when enabling
MSI interrupts, and works fine on real h/w.

In bhyve, convert the 4 byte write to a 2-byte write to
the r/w area past the first 2 r/o bytes of a capability.

Reviewed by:	neel
Approved by:	re@ (blanket)
2013-10-09 23:53:21 +00:00
Neel Natu
318224bbe6 Merge projects/bhyve_npt_pmap into head.
Make the amd64/pmap code aware of nested page table mappings used by bhyve
guests. This allows bhyve to associate each guest with its own vmspace and
deal with nested page faults in the context of that vmspace. This also
enables features like accessed/dirty bit tracking, swapping to disk and
transparent superpage promotions of guest memory.

Guest vmspace:
Each bhyve guest has a unique vmspace to represent the physical memory
allocated to the guest. Each memory segment allocated by the guest is
mapped into the guest's address space via the 'vmspace->vm_map' and is
backed by an object of type OBJT_DEFAULT.

pmap types:
The amd64/pmap now understands two types of pmaps: PT_X86 and PT_EPT.

The PT_X86 pmap type is used by the vmspace associated with the host kernel
as well as user processes executing on the host. The PT_EPT pmap is used by
the vmspace associated with a bhyve guest.

Page Table Entries:
The EPT page table entries as mostly similar in functionality to regular
page table entries although there are some differences in terms of what
bits are used to express that functionality. For e.g. the dirty bit is
represented by bit 9 in the nested PTE as opposed to bit 6 in the regular
x86 PTE. Therefore the bitmask representing the dirty bit is now computed
at runtime based on the type of the pmap. Thus PG_M that was previously a
macro now becomes a local variable that is initialized at runtime using
'pmap_modified_bit(pmap)'.

An additional wrinkle associated with EPT mappings is that older Intel
processors don't have hardware support for tracking accessed/dirty bits in
the PTE. This means that the amd64/pmap code needs to emulate these bits to
provide proper accounting to the VM subsystem. This is achieved by using
the following mapping for EPT entries that need emulation of A/D bits:
               Bit Position           Interpreted By
PG_V               52                 software (accessed bit emulation handler)
PG_RW              53                 software (dirty bit emulation handler)
PG_A               0                  hardware (aka EPT_PG_RD)
PG_M               1                  hardware (aka EPT_PG_WR)

The idea to use the mapping listed above for A/D bit emulation came from
Alan Cox (alc@).

The final difference with respect to x86 PTEs is that some EPT implementations
do not support superpage mappings. This is recorded in the 'pm_flags' field
of the pmap.

TLB invalidation:
The amd64/pmap code has a number of ways to do invalidation of mappings
that may be cached in the TLB: single page, multiple pages in a range or the
entire TLB. All of these funnel into a single EPT invalidation routine called
'pmap_invalidate_ept()'. This routine bumps up the EPT generation number and
sends an IPI to the host cpus that are executing the guest's vcpus. On a
subsequent entry into the guest it will detect that the EPT has changed and
invalidate the mappings from the TLB.

Guest memory access:
Since the guest memory is no longer wired we need to hold the host physical
page that backs the guest physical page before we can access it. The helper
functions 'vm_gpa_hold()/vm_gpa_release()' are available for this purpose.

PCI passthru:
Guest's with PCI passthru devices will wire the entire guest physical address
space. The MMIO BAR associated with the passthru device is backed by a
vm_object of type OBJT_SG. An IOMMU domain is created only for guest's that
have one or more PCI passthru devices attached to them.

Limitations:
There isn't a way to map a guest physical page without execute permissions.
This is because the amd64/pmap code interprets the guest physical mappings as
user mappings since they are numerically below VM_MAXUSER_ADDRESS. Since PG_U
shares the same bit position as EPT_PG_EXECUTE all guest mappings become
automatically executable.

Thanks to Alan Cox and Konstantin Belousov for their rigorous code reviews
as well as their support and encouragement.

Thanks for John Baldwin for reviewing the use of OBJT_SG as the backing
object for pci passthru mmio regions.

Special thanks to Peter Holm for testing the patch on short notice.

Approved by:	re
Discussed with:	grehan
Reviewed by:	alc, kib
Tested by:	pho
2013-10-05 21:22:35 +00:00
Neel Natu
6a52209f9c Allow single byte reads of the emulated MSI-X tables. This is not required
by the PCI specification but needed to dump MMIO space from "ddb" in the
guest.
2013-08-27 16:50:48 +00:00
Peter Grehan
50dc0db3f0 Fix ordering of legacy IRQ reservations.
Submitted by:	Jeremiah Lott   jlott at averesystems dot com
2013-08-16 00:35:20 +00:00
Peter Grehan
a38e2a64dc Support an optional "mac=" parameter to virtio-net config, to allow
users to set the MAC address for a device.

Clean up some obsolete code in pci_virtio_net.c

Allow an error return from a PCI device emulation's init routine
to be propagated all the way back to the top-level and result in
the process exiting.

Submitted by:	Dinakar Medavaram    dinnu sun at gmail (original version)
2013-07-04 05:35:56 +00:00
Peter Grehan
34d244edb2 Fix up option parsing to allow a colon in the config section.
Clean up some other unnecessary code.

Submitted by:	Dinakar Medavaram    dinnu sun at gmail
Reviewed by:	neel
2013-07-01 23:53:22 +00:00
Peter Grehan
7554303627 Allow the PCI config address register to be read. The Linux
kernel does this. Also remove an unused header file.

Submitted by:	tycho nightingale at pluribusnetworks com
Reviewed by:	neel
2013-06-28 05:01:25 +00:00
Neel Natu
b05c77ff84 Gripe if some <slot,function> tuple is specified more than once instead of
silently overwriting the previous assignment.

Gripe if the emulation is not recognized instead of silently ignoring the
emulated device.

If an error is detected by pci_parse_slot() then exit from the command line
parsing loop in main().

Submitted by (initial version):	Chris Torek (chris.torek@gmail.com)
2013-04-26 02:24:50 +00:00
Neel Natu
9f08548d20 Setup accesses to the memory hole below 4GB to return all 1's on read and
consume all writes without any side effects.

Obtained from:	NetApp
2013-04-17 02:03:12 +00:00
Neel Natu
028d9311cd Improve PCI BAR emulation:
- Respect the MEMEN and PORTEN bits in the command register
- Allow the guest to reprogram the address decoded by the BAR

Submitted by:	Gopakumar T
Obtained from:	NetApp
2013-04-10 02:12:39 +00:00
Neel Natu
b060ba5024 Simplify the assignment of memory to virtual machines by requiring a single
command line option "-m <memsize in MB>" to specify the memory size.

Prior to this change the user needed to explicitly specify the amount of
memory allocated below 4G (-m <lowmem>) and the amount above 4G (-M <highmem>).

The "-M" option is no longer supported by 'bhyveload' and 'bhyve'.

The start of the PCI hole is fixed at 3GB and cannot be directly changed
using command line options. However it is still possible to change this in
special circumstances via the 'vm_set_lowmem_limit()' API provided by
libvmmapi.

Submitted by:	Dinakar Medavaram (initial version)
Reviewed by:	grehan
Obtained from:	NetApp
2013-03-18 22:38:30 +00:00
Peter Grehan
0ab13648f5 Add the ability to have a 'fallback' search for memory ranges.
These set of ranges will be looked at if a standard memory
range isn't found, and won't be installed in the cache.
Use this to implement the memory behaviour of the PCI hole on
x86 systems, where writes are ignored and reads always return -1.
This allows breakpoints to be set when issuing a 'boot -d', which
has the side effect of accessing the PCI hole when changing the
PTE protection on kernel code, since the pmap layer hasn't been
initialized (a bug, but present in existing FreeBSD releases so
has to be handled).

Reviewed by:	neel
Obtained from:	NetApp
2013-02-22 00:46:32 +00:00
Neel Natu
74f80b236d Advertise PCI-E capability in the hostbridge device presented to the guest.
FreeBSD wants to see this capability in at least one device in the PCI
hierarchy before it allows use of MSI or MSI-X.

Obtained from:	NetApp
2013-02-15 18:41:36 +00:00
Neel Natu
aa12663f49 Fix a bug in the passthru implementation where it would assume that all
devices are MSI-X capable. This in turn would lead it to treat bar 0 as
the MSI-X table bar even if the underlying device did not support MSI-X.

Fix this by providing an API to query the MSI-X table index of the emulated
device. If the underlying device does not support MSI-X then this API will
return -1.

Obtained from:	NetApp
2013-02-01 02:41:47 +00:00
Neel Natu
c9b4e98754 Add support for MSI-X interrupts in the virtio network device and make that
the default.

The current behavior of advertising a single MSI vector can be requested by
setting the environment variable "BHYVE_USE_MSI" to "true". The use of MSI
is not compliant with the virtio specification and will be eventually phased
out.

Submitted by:	Gopakumar T
Obtained from:	NetApp
2013-01-30 04:30:36 +00:00
Peter Grehan
e285ef8d28 Rename fbsdrun.* -> bhyverun.*
bhyve is intended to be a generic hypervisor, and not FreeBSD-specific.

(renaming internal routines will come later)

Reviewed by:	neel
Obtained from:	NetApp
2012-12-13 01:58:11 +00:00
Neel Natu
90415e0b4e Ignore PCI configuration accesses to all bus numbers other than PCI bus 0.
Obtained from:	NetApp
2012-10-27 02:39:08 +00:00
Peter Grehan
fbfc1c763c Remove mptable generation code from libvmmapi and move it to bhyve.
Firmware tables require too much knowledge of system configuration,
and it's difficult to pass that information in general terms to a library.
The upcoming ACPI work exposed this - it will also livein bhyve.

Also, remove code specific to NetApp from the mptable name, and remove
the -n option from bhyve.

Reviewed by:	neel
Obtained from:	NetApp
2012-10-26 13:40:12 +00:00
Peter Grehan
4d1e669cad Rework how guest MMIO regions are dealt with.
- New memory region interface. An RB tree holds the regions,
with a last-found per-vCPU cache to deal with the common case
of repeated guest accesses to MMIO registers in the same page.

- Support memory-mapped BARs in PCI emulation.

 mem.c/h - memory region interface

 instruction_emul.c/h - remove old region interface.
 Use gpa from EPT exit to avoid a tablewalk to
 determine operand address. Determine operand size
 and use when calling through to region handler.

 fbsdrun.c - call into region interface on paging
  exit. Distinguish between instruction emul error
  and region not found

 pci_emul.c/h - implement new BAR callback api.
 Split BAR alloc routine into routines that
 require/don't require the BAR phys address.

 ioapic.c
 pci_passthru.c
 pci_virtio_block.c
 pci_virtio_net.c
 pci_uart.c  - update to new BAR callback i/f

Reviewed by:	neel
Obtained from:	NetApp
2012-10-19 18:11:17 +00:00
Neel Natu
25d4944e06 Fix a bug in how a 64-bit bar in a pci passthru device would be presented to
the guest. Prior to the fix it was possible for such a bar to appear as a
32-bit bar as long as it was allocated from the region below 4GB.

This had the potential to confuse some drivers that were particular about
the size of the bars.

Obtained from:	NetApp
2012-08-06 07:20:25 +00:00
Neel Natu
99d653892b Add support for emulating PCI multi-function devices.
These function number is specified by an optional [:<func>] after the slot
number: -s 1:0,virtio-net,tap0

Ditto for the mptable naming: -n 1:0,e0a

Obtained from:	NetApp
2012-08-06 06:51:27 +00:00
Neel Natu
b0b53d3af9 Device model for ioapic emulation.
With this change the uart emulation is entirely interrupt driven.

Obtained from: NetApp
2012-08-05 00:00:52 +00:00
Neel Natu
308f907720 Use the correct variable to index into the 'lirq[]' array to check the legacy
IRQ ownership.
2012-08-04 04:26:17 +00:00
Peter Grehan
0038ee9891 Add 16550 uart emulation as a PCI device. This allows it to
be activated as part of the slot config options.
  The syntax is:

     -s <slotnum>,uart[,stdio]

  The stdio parameter instructs the code to perform i/o using
stdin/stdout. It can only be used for one instance.
  To allow legacy i/o ports/irqs to be used, a new variant of
the slot command, -S, is introduced. When used to specify a
slot, the device will use legacy resources if it supports
them; otherwise it will be treated the same as the '-s' option.
  Specifying the -S option with the uart will first use the 0x3f8/irq 4
config, and the second -S will use 0x2F8/irq 3.

  Interrupt delivery is awaiting the arrival of the i/o apic code,
but this works fine in uart(4)'s polled mode.

  This code was written by Cynthia Lu @ MIT while an intern at NetApp,
with further work from neel@ and grehan@.

Obtained from:	NetApp
2012-05-03 03:11:27 +00:00
Peter Grehan
cd942e0f25 MSI-x interrupt support for PCI pass-thru devices.
Includes instruction emulation for memory r/w access. This
opens the door for io-apic, local apic, hpet timer, and
legacy device emulation.

Submitted by:	ryan dot berryhill at sandvine dot com
Reviewed by:	grehan
Obtained from:	Sandvine
2012-04-28 16:28:00 +00:00
John Baldwin
b67e81db43 First cut to port bhyve, vmmctl, and libvmmapi to HEAD. 2011-05-15 04:03:11 +00:00
Peter Grehan
366f60834f Import of bhyve hypervisor and utilities, part 1.
vmm.ko - kernel module for VT-x, VT-d and hypervisor control
  bhyve  - user-space sequencer and i/o emulation
  vmmctl - dump of hypervisor register state
  libvmm - front-end to vmm.ko chardev interface

bhyve was designed and implemented by Neel Natu.

Thanks to the following folk from NetApp who helped to make this available:
	Joe CaraDonna
	Peter Snyder
	Jeff Heller
	Sandeep Mann
	Steve Miller
	Brian Pawlowski
2011-05-13 04:54:01 +00:00