Commit Graph

13 Commits

Author SHA1 Message Date
Neel Natu
318224bbe6 Merge projects/bhyve_npt_pmap into head.
Make the amd64/pmap code aware of nested page table mappings used by bhyve
guests. This allows bhyve to associate each guest with its own vmspace and
deal with nested page faults in the context of that vmspace. This also
enables features like accessed/dirty bit tracking, swapping to disk and
transparent superpage promotions of guest memory.

Guest vmspace:
Each bhyve guest has a unique vmspace to represent the physical memory
allocated to the guest. Each memory segment allocated by the guest is
mapped into the guest's address space via the 'vmspace->vm_map' and is
backed by an object of type OBJT_DEFAULT.

pmap types:
The amd64/pmap now understands two types of pmaps: PT_X86 and PT_EPT.

The PT_X86 pmap type is used by the vmspace associated with the host kernel
as well as user processes executing on the host. The PT_EPT pmap is used by
the vmspace associated with a bhyve guest.

Page Table Entries:
The EPT page table entries as mostly similar in functionality to regular
page table entries although there are some differences in terms of what
bits are used to express that functionality. For e.g. the dirty bit is
represented by bit 9 in the nested PTE as opposed to bit 6 in the regular
x86 PTE. Therefore the bitmask representing the dirty bit is now computed
at runtime based on the type of the pmap. Thus PG_M that was previously a
macro now becomes a local variable that is initialized at runtime using
'pmap_modified_bit(pmap)'.

An additional wrinkle associated with EPT mappings is that older Intel
processors don't have hardware support for tracking accessed/dirty bits in
the PTE. This means that the amd64/pmap code needs to emulate these bits to
provide proper accounting to the VM subsystem. This is achieved by using
the following mapping for EPT entries that need emulation of A/D bits:
               Bit Position           Interpreted By
PG_V               52                 software (accessed bit emulation handler)
PG_RW              53                 software (dirty bit emulation handler)
PG_A               0                  hardware (aka EPT_PG_RD)
PG_M               1                  hardware (aka EPT_PG_WR)

The idea to use the mapping listed above for A/D bit emulation came from
Alan Cox (alc@).

The final difference with respect to x86 PTEs is that some EPT implementations
do not support superpage mappings. This is recorded in the 'pm_flags' field
of the pmap.

TLB invalidation:
The amd64/pmap code has a number of ways to do invalidation of mappings
that may be cached in the TLB: single page, multiple pages in a range or the
entire TLB. All of these funnel into a single EPT invalidation routine called
'pmap_invalidate_ept()'. This routine bumps up the EPT generation number and
sends an IPI to the host cpus that are executing the guest's vcpus. On a
subsequent entry into the guest it will detect that the EPT has changed and
invalidate the mappings from the TLB.

Guest memory access:
Since the guest memory is no longer wired we need to hold the host physical
page that backs the guest physical page before we can access it. The helper
functions 'vm_gpa_hold()/vm_gpa_release()' are available for this purpose.

PCI passthru:
Guest's with PCI passthru devices will wire the entire guest physical address
space. The MMIO BAR associated with the passthru device is backed by a
vm_object of type OBJT_SG. An IOMMU domain is created only for guest's that
have one or more PCI passthru devices attached to them.

Limitations:
There isn't a way to map a guest physical page without execute permissions.
This is because the amd64/pmap code interprets the guest physical mappings as
user mappings since they are numerically below VM_MAXUSER_ADDRESS. Since PG_U
shares the same bit position as EPT_PG_EXECUTE all guest mappings become
automatically executable.

Thanks to Alan Cox and Konstantin Belousov for their rigorous code reviews
as well as their support and encouragement.

Thanks for John Baldwin for reviewing the use of OBJT_SG as the backing
object for pci passthru mmio regions.

Special thanks to Peter Holm for testing the patch on short notice.

Approved by:	re
Discussed with:	grehan
Reviewed by:	alc, kib
Tested by:	pho
2013-10-05 21:22:35 +00:00
Peter Grehan
94c3b3bffc Remove obsolete cmd-line options and code associated with
these.
 The mux-vcpus option may return at some point, given it's utility
in finding bhyve (and FreeBSD) bugs.

Approved by:	re@ (blanket)
Discussed with:	neel@
2013-10-04 23:29:07 +00:00
Peter Grehan
8b271170d1 Sanity-check the vm exitcode, and exit the process if it's out-of-bounds
or there is no registered handler.

Submitted by:	Bela Lubkin   bela dot lubkin at tidalscale dot com
2013-07-18 18:40:54 +00:00
Peter Grehan
9d6be09f8a Implement RTC CMOS nvram. Init some fields that are used
by FreeBSD and UEFI.
Tested with nvram(4).

Reviewed by:	neel
2013-07-11 03:54:35 +00:00
Peter Grehan
a38e2a64dc Support an optional "mac=" parameter to virtio-net config, to allow
users to set the MAC address for a device.

Clean up some obsolete code in pci_virtio_net.c

Allow an error return from a PCI device emulation's init routine
to be propagated all the way back to the top-level and result in
the process exiting.

Submitted by:	Dinakar Medavaram    dinnu sun at gmail (original version)
2013-07-04 05:35:56 +00:00
Neel Natu
b05c77ff84 Gripe if some <slot,function> tuple is specified more than once instead of
silently overwriting the previous assignment.

Gripe if the emulation is not recognized instead of silently ignoring the
emulated device.

If an error is detected by pci_parse_slot() then exit from the command line
parsing loop in main().

Submitted by (initial version):	Chris Torek (chris.torek@gmail.com)
2013-04-26 02:24:50 +00:00
Neel Natu
0e2ca4e625 Need to call init_mem() to really initialize the MMIO range lookups.
This was working by accident because:
- the RB_HEADs were being initialized to zero as part of BSS
- the pthread_rwlock functions were implicitly initializing the lock object

Obtained from:	NetApp
2013-04-10 18:59:20 +00:00
Neel Natu
b060ba5024 Simplify the assignment of memory to virtual machines by requiring a single
command line option "-m <memsize in MB>" to specify the memory size.

Prior to this change the user needed to explicitly specify the amount of
memory allocated below 4G (-m <lowmem>) and the amount above 4G (-M <highmem>).

The "-M" option is no longer supported by 'bhyveload' and 'bhyve'.

The start of the PCI hole is fixed at 3GB and cannot be directly changed
using command line options. However it is still possible to change this in
special circumstances via the 'vm_set_lowmem_limit()' API provided by
libvmmapi.

Submitted by:	Dinakar Medavaram (initial version)
Reviewed by:	grehan
Obtained from:	NetApp
2013-03-18 22:38:30 +00:00
Neel Natu
91039bb268 Specify the length of the mapping requested from 'paddr_guest2host()'.
This seems prudent to do in its own right but it also opens up the possibility
of not having to mmap the entire guest address space in the 'bhyve' process
context.

Discussed with:	grehan
Obtained from:	NetApp
2013-03-01 02:26:28 +00:00
Neel Natu
485b3300cc Implement guest vcpu pinning using 'pthread_setaffinity_np(3)'.
Prior to this change pinning was implemented via an ioctl (VM_SET_PINNING)
that called 'sched_bind()' on behalf of the user thread.

The ULE implementation of 'sched_bind()' bumps up 'td_pinned' which in turn
runs afoul of the assertion '(td_pinned == 0)' in userret().

Using the cpuset affinity to implement pinning of the vcpu threads works with
both 4BSD and ULE schedulers and has the happy side-effect of getting rid
of a bunch of code in vmm.ko.

Discussed with:	grehan
2013-02-11 20:36:07 +00:00
Neel Natu
24be8623c6 Use <vmname> in a consistent manner in usage messages output by 'bhyve',
'bhyveload' and 'bhyvectl'.

Pointed out by:	joel@
2013-01-20 03:47:13 +00:00
Neel Natu
5f0677d392 The "unrestricted guest" capability is a feature of Intel VT-x that allows
the guest to execute real or unpaged protected mode code - bhyve relies on
this feature to execute the AP bootstrap code.

Get rid of the hack that allowed bhyve to support SMP guests on processors
that do not have the "unrestricted guest" capability. This hack was entirely
FreeBSD-specific and would not work with any other guest OS.

Instead, limit the number of vcpus to 1 when executing on processors without
"unrestricted guest" capability.

Suggested by:	grehan
Obtained from:	NetApp
2013-01-04 02:04:41 +00:00
Peter Grehan
e285ef8d28 Rename fbsdrun.* -> bhyverun.*
bhyve is intended to be a generic hypervisor, and not FreeBSD-specific.

(renaming internal routines will come later)

Reviewed by:	neel
Obtained from:	NetApp
2012-12-13 01:58:11 +00:00