rather than opt-out.
Prior to this change if the "-g" option was not specified then a listening
socket for tunneling gdb packets would be opened at port 6466. If a second
virtual machine is fired up, also without the "-g" option, then that would
fail because there is already a listener on port 6466.
After this change if a gdb tunnel port needs to be created it needs to be
explicitly specified with a "-g <portnum>" command line option.
Reviewed by: grehan@
Approved by: re@ (blanket)
Make the amd64/pmap code aware of nested page table mappings used by bhyve
guests. This allows bhyve to associate each guest with its own vmspace and
deal with nested page faults in the context of that vmspace. This also
enables features like accessed/dirty bit tracking, swapping to disk and
transparent superpage promotions of guest memory.
Guest vmspace:
Each bhyve guest has a unique vmspace to represent the physical memory
allocated to the guest. Each memory segment allocated by the guest is
mapped into the guest's address space via the 'vmspace->vm_map' and is
backed by an object of type OBJT_DEFAULT.
pmap types:
The amd64/pmap now understands two types of pmaps: PT_X86 and PT_EPT.
The PT_X86 pmap type is used by the vmspace associated with the host kernel
as well as user processes executing on the host. The PT_EPT pmap is used by
the vmspace associated with a bhyve guest.
Page Table Entries:
The EPT page table entries as mostly similar in functionality to regular
page table entries although there are some differences in terms of what
bits are used to express that functionality. For e.g. the dirty bit is
represented by bit 9 in the nested PTE as opposed to bit 6 in the regular
x86 PTE. Therefore the bitmask representing the dirty bit is now computed
at runtime based on the type of the pmap. Thus PG_M that was previously a
macro now becomes a local variable that is initialized at runtime using
'pmap_modified_bit(pmap)'.
An additional wrinkle associated with EPT mappings is that older Intel
processors don't have hardware support for tracking accessed/dirty bits in
the PTE. This means that the amd64/pmap code needs to emulate these bits to
provide proper accounting to the VM subsystem. This is achieved by using
the following mapping for EPT entries that need emulation of A/D bits:
Bit Position Interpreted By
PG_V 52 software (accessed bit emulation handler)
PG_RW 53 software (dirty bit emulation handler)
PG_A 0 hardware (aka EPT_PG_RD)
PG_M 1 hardware (aka EPT_PG_WR)
The idea to use the mapping listed above for A/D bit emulation came from
Alan Cox (alc@).
The final difference with respect to x86 PTEs is that some EPT implementations
do not support superpage mappings. This is recorded in the 'pm_flags' field
of the pmap.
TLB invalidation:
The amd64/pmap code has a number of ways to do invalidation of mappings
that may be cached in the TLB: single page, multiple pages in a range or the
entire TLB. All of these funnel into a single EPT invalidation routine called
'pmap_invalidate_ept()'. This routine bumps up the EPT generation number and
sends an IPI to the host cpus that are executing the guest's vcpus. On a
subsequent entry into the guest it will detect that the EPT has changed and
invalidate the mappings from the TLB.
Guest memory access:
Since the guest memory is no longer wired we need to hold the host physical
page that backs the guest physical page before we can access it. The helper
functions 'vm_gpa_hold()/vm_gpa_release()' are available for this purpose.
PCI passthru:
Guest's with PCI passthru devices will wire the entire guest physical address
space. The MMIO BAR associated with the passthru device is backed by a
vm_object of type OBJT_SG. An IOMMU domain is created only for guest's that
have one or more PCI passthru devices attached to them.
Limitations:
There isn't a way to map a guest physical page without execute permissions.
This is because the amd64/pmap code interprets the guest physical mappings as
user mappings since they are numerically below VM_MAXUSER_ADDRESS. Since PG_U
shares the same bit position as EPT_PG_EXECUTE all guest mappings become
automatically executable.
Thanks to Alan Cox and Konstantin Belousov for their rigorous code reviews
as well as their support and encouragement.
Thanks for John Baldwin for reviewing the use of OBJT_SG as the backing
object for pci passthru mmio regions.
Special thanks to Peter Holm for testing the patch on short notice.
Approved by: re
Discussed with: grehan
Reviewed by: alc, kib
Tested by: pho
these.
The mux-vcpus option may return at some point, given it's utility
in finding bhyve (and FreeBSD) bugs.
Approved by: re@ (blanket)
Discussed with: neel@
https://wiki.freebsd.org/SummerOfCode2013/bhyveAHCI
This provides ICH8 SATA disk and ATAPI ports, selectable
via the bhyve slot command-line parameter:
SATA
-s <slot>,ahci-hd,<image-file>
ATAPI
-s <slot>,ahci-cd,<image-file>
Slight modifications by: grehan@
Approved by: re@ (blanket)
Obtained from: FreeBSD GSoC'13
timer support. This should be enough for the emulation of
h/w periodic timers (and no more) e.g. some of the 8254's
more esoteric modes that happen to be used by non-FreeBSD o/s's.
Approved by: re@ (blanket)
This should be sufficient for 10.0 and will do
until forthcoming work to avoid limitations
in this area is complete.
Thanks to Bela Lubkin at tidalscale for the
headsup on the apic/cpu id/io apic ASL parameters
that are actually hex values and broke when
written as decimal when 11 vCPUs were configured.
Approved by: re@
users to set the MAC address for a device.
Clean up some obsolete code in pci_virtio_net.c
Allow an error return from a PCI device emulation's init routine
to be propagated all the way back to the top-level and result in
the process exiting.
Submitted by: Dinakar Medavaram dinnu sun at gmail (original version)
If this capability is negotiated by the guest then the device will
generate an interrupt when it runs out of available tx/rx descriptors.
Reviewed by: grehan
Obtained from: NetApp
drop any frames that arrive while the device is starved for receive buffers.
This makes the receive path to only execute in context of the receive thread
and allows for further simplification.
Reviewed by: grehan
than blocking the vCPU thread. This improves bulk data performance
by ~30-40% and doesn't harm req/resp time for stock netperf runs.
Future work will use a thread pool rather than a thread per tx queue.
Submitted by: Dinakar Medavaram
Reviewed by: neel, grehan
Obtained from: NetApp
silently overwriting the previous assignment.
Gripe if the emulation is not recognized instead of silently ignoring the
emulated device.
If an error is detected by pci_parse_slot() then exit from the command line
parsing loop in main().
Submitted by (initial version): Chris Torek (chris.torek@gmail.com)
descriptors. Prior to this change the device would only work with guests
that chose to use indirect descriptors.
Modify the device reset callback to actually reset the device state.
Submitted by: Tycho Nightingale (tycho.nightingale@pluribusnetworks.com)
This was working by accident because:
- the RB_HEADs were being initialized to zero as part of BSS
- the pthread_rwlock functions were implicitly initializing the lock object
Obtained from: NetApp
- use clock_gettime(2) as the time base for the emulated ACPI timer instead
of directly using rdtsc().
- don't advertise the invariant TSC capability to the guest to discourage it
from using the TSC as its time base.
Discussed with: jhb@ (about making 'smp_tsc' a global)
Reported by: Dan Mack on freebsd-virtualization@
Obtained from: NetApp
- Respect the MEMEN and PORTEN bits in the command register
- Allow the guest to reprogram the address decoded by the BAR
Submitted by: Gopakumar T
Obtained from: NetApp
command line option "-m <memsize in MB>" to specify the memory size.
Prior to this change the user needed to explicitly specify the amount of
memory allocated below 4G (-m <lowmem>) and the amount above 4G (-M <highmem>).
The "-M" option is no longer supported by 'bhyveload' and 'bhyve'.
The start of the PCI hole is fixed at 3GB and cannot be directly changed
using command line options. However it is still possible to change this in
special circumstances via the 'vm_set_lowmem_limit()' API provided by
libvmmapi.
Submitted by: Dinakar Medavaram (initial version)
Reviewed by: grehan
Obtained from: NetApp
into the MSI-X table before using it to calculate the table index.
In the common case where the MSI-X table is located at the begining of the
BAR these two offsets are identical and thus the code was working by accident.
This change will fix the case where the MSI-X table is located in the middle
or at the end of the BAR that contains it.
Obtained from: NetApp
This seems prudent to do in its own right but it also opens up the possibility
of not having to mmap the entire guest address space in the 'bhyve' process
context.
Discussed with: grehan
Obtained from: NetApp
These set of ranges will be looked at if a standard memory
range isn't found, and won't be installed in the cache.
Use this to implement the memory behaviour of the PCI hole on
x86 systems, where writes are ignored and reads always return -1.
This allows breakpoints to be set when issuing a 'boot -d', which
has the side effect of accessing the PCI hole when changing the
PTE protection on kernel code, since the pmap layer hasn't been
initialized (a bug, but present in existing FreeBSD releases so
has to be handled).
Reviewed by: neel
Obtained from: NetApp
Prior to this change pinning was implemented via an ioctl (VM_SET_PINNING)
that called 'sched_bind()' on behalf of the user thread.
The ULE implementation of 'sched_bind()' bumps up 'td_pinned' which in turn
runs afoul of the assertion '(td_pinned == 0)' in userret().
Using the cpuset affinity to implement pinning of the vcpu threads works with
both 4BSD and ULE schedulers and has the happy side-effect of getting rid
of a bunch of code in vmm.ko.
Discussed with: grehan
the default.
The current behavior of advertising a single MSI vector can be requested by
setting the environment variable "BHYVE_USE_MSI" to "yes". The use of MSI
is not compliant with the virtio specification and will be eventually phased
out.
Submitted by: Gopakumar T
Obtained from: NetApp
can only be located at the beginning or the end of the BAR.
If the MSI-table is located in the middle of a BAR then we will split the
BAR into two and create two mappings - one before the table and one after
the table - leaving a hole in place of the table so accesses to it can be
trapped and emulated.
Obtained from: NetApp
devices are MSI-X capable. This in turn would lead it to treat bar 0 as
the MSI-X table bar even if the underlying device did not support MSI-X.
Fix this by providing an API to query the MSI-X table index of the emulated
device. If the underlying device does not support MSI-X then this API will
return -1.
Obtained from: NetApp
the default.
The current behavior of advertising a single MSI vector can be requested by
setting the environment variable "BHYVE_USE_MSI" to "true". The use of MSI
is not compliant with the virtio specification and will be eventually phased
out.
Submitted by: Gopakumar T
Obtained from: NetApp
statically. In most cases the number of table entries will be far less than
the maximum of 2048 allowed by the PCI specification.
Reuse macros from pcireg.h to interpret the MSI-X capability instead of rolling
our own.
Obtained from: NetApp
fill up to the uart's rx fifo size, and leave any remaining input
for when the rx fifo is read. This allows cut'n'paste of long lines
to be done into the bhyve console without truncation.
Also, introduce a mutex since the file input will run in the mevent
thread context and may corrupt state accessed by a vCPU thread.
Reviewed by: neel
Approved by: NetApp
With this change, dbench with >= 4 processes runs without getting
weird jumps forward in time when the APCI pmtimer is the default
timecounter.
Obtained from: NetApp
the guest to execute real or unpaged protected mode code - bhyve relies on
this feature to execute the AP bootstrap code.
Get rid of the hack that allowed bhyve to support SMP guests on processors
that do not have the "unrestricted guest" capability. This hack was entirely
FreeBSD-specific and would not work with any other guest OS.
Instead, limit the number of vcpus to 1 when executing on processors without
"unrestricted guest" capability.
Suggested by: grehan
Obtained from: NetApp
bhyve is intended to be a generic hypervisor, and not FreeBSD-specific.
(renaming internal routines will come later)
Reviewed by: neel
Obtained from: NetApp
On a nested page table fault the hypervisor will:
- fetch the instruction using the guest %rip and %cr3
- decode the instruction in 'struct vie'
- emulate the instruction in host kernel context for local apic accesses
- any other type of mmio access is punted up to user-space (e.g. ioapic)
The decoded instruction is passed as collateral to the user-space process
that is handling the PAGING exit.
The emulation code is fleshed out to include more addressing modes (e.g. SIB)
and more types of operands (e.g. imm8). The source code is unified into a
single file (vmm_instruction_emul.c) that is compiled into vmm.ko as well
as /usr/sbin/bhyve.
Reviewed by: grehan
Obtained from: NetApp
The -A option will create the minimal set of required ACPI tables in
guest memory. Since ACPI mandates an IOAPIC, the -I option must also
be used.
Template ASL files are created, and then passed to the iasl compiler
to generate AML files. These are then loaded into guest physical mem.
In support of this, the ACPI PM timer is implemented, in 32-bit mode.
Tested on 7.4/8.*/9.*/10-CURRENT.
Reviewed by: neel
Obtained from: NetApp
Discussed with: jhb (a long while back)
than waiting until AP bringup detects an out-of-range vCPU.
While here, fix all error output to use fprintf(stderr, ...
Reviewed by: neel
Reported by: @allanjude
Firmware tables require too much knowledge of system configuration,
and it's difficult to pass that information in general terms to a library.
The upcoming ACPI work exposed this - it will also livein bhyve.
Also, remove code specific to NetApp from the mptable name, and remove
the -n option from bhyve.
Reviewed by: neel
Obtained from: NetApp
- New memory region interface. An RB tree holds the regions,
with a last-found per-vCPU cache to deal with the common case
of repeated guest accesses to MMIO registers in the same page.
- Support memory-mapped BARs in PCI emulation.
mem.c/h - memory region interface
instruction_emul.c/h - remove old region interface.
Use gpa from EPT exit to avoid a tablewalk to
determine operand address. Determine operand size
and use when calling through to region handler.
fbsdrun.c - call into region interface on paging
exit. Distinguish between instruction emul error
and region not found
pci_emul.c/h - implement new BAR callback api.
Split BAR alloc routine into routines that
require/don't require the BAR phys address.
ioapic.c
pci_passthru.c
pci_virtio_block.c
pci_virtio_net.c
pci_uart.c - update to new BAR callback i/f
Reviewed by: neel
Obtained from: NetApp
AP needs to be activated by spinning up an execution context for it.
The local apic emulation is now completely done in the hypervisor and it will
detect writes to the ICR_LO register that try to bring up the AP. In response
to such writes it will return to userspace with an exit code of SPINUP_AP.
Reviewed by: grehan
the guest. Prior to the fix it was possible for such a bar to appear as a
32-bit bar as long as it was allocated from the region below 4GB.
This had the potential to confuse some drivers that were particular about
the size of the bars.
Obtained from: NetApp
These function number is specified by an optional [:<func>] after the slot
number: -s 1:0,virtio-net,tap0
Ditto for the mptable naming: -n 1:0,e0a
Obtained from: NetApp
or 32-bit signed integer.
Simplify the handling of indirect addressing with displacement by
unconditionally adding the 'instruction->disp' to the target address.
This is alright since 'instruction->disp' is non-zero only for the
addressing modes that specify a displacement.
Obtained from: NetApp
be activated as part of the slot config options.
The syntax is:
-s <slotnum>,uart[,stdio]
The stdio parameter instructs the code to perform i/o using
stdin/stdout. It can only be used for one instance.
To allow legacy i/o ports/irqs to be used, a new variant of
the slot command, -S, is introduced. When used to specify a
slot, the device will use legacy resources if it supports
them; otherwise it will be treated the same as the '-s' option.
Specifying the -S option with the uart will first use the 0x3f8/irq 4
config, and the second -S will use 0x2F8/irq 3.
Interrupt delivery is awaiting the arrival of the i/o apic code,
but this works fine in uart(4)'s polled mode.
This code was written by Cynthia Lu @ MIT while an intern at NetApp,
with further work from neel@ and grehan@.
Obtained from: NetApp
Includes instruction emulation for memory r/w access. This
opens the door for io-apic, local apic, hpet timer, and
legacy device emulation.
Submitted by: ryan dot berryhill at sandvine dot com
Reviewed by: grehan
Obtained from: Sandvine
run as a 1/2 CPU guest on an 8.1 bhyve host.
bhyve/inout.c
inout.h
fbsdrun.c
- Rather than exiting on accesses to unhandled i/o ports, emulate
hardware by returning -1 on reads and ignoring writes to unhandled
ports. Support the previous mode by allowing a 'strict' parameter
to be set from the command line.
The 8.1 guest kernel was vastly cut down from GENERIC and had no
ISA devices. Booting GENERIC exposes a massive amount of random
touching of i/o ports (hello syscons/vga/atkbdc).
bhyve/consport.c
dev/bvm/bvm_console.c
- implement a simplistic signature for the bvm console by returning
'bv' for an inw on the port. Also, set the priority of the console
to CN_REMOTE if the signature was returned. This works better in
an environment where multiple consoles are in the kernel (hello syscons)
bhyve/rtc.c
- return 0 for the access to RTC_EQUIPMENT (yes, you syscons)
amd64/vmm/x86.c
x86.h
- hide a bunch more CPUID leaf 1 bits from the guest to prevent
cpufreq drivers from probing.
The next step will be to move CPUID handling completely into
user-space. This will allow the full spectrum of changes from
presenting a lowest-common-denominator CPU type/feature set, to
exposing (almost) everything that the host can support.
Reviewed by: neel
Obtained from: NetApp