bus_get_cpus() returns a specified set of CPUs for a device. It accepts
an enum for the second parameter that indicates the type of cpuset to
request. Currently two valus are supported:
- LOCAL_CPUS (on x86 this returns all the CPUs in the package closest to
the device when DEVICE_NUMA is enabled)
- INTR_CPUS (like LOCAL_CPUS but only returns 1 SMT thread for each core)
For systems that do not support NUMA (or if it is not enabled in the kernel
config), LOCAL_CPUS fails with EINVAL. INTR_CPUS is mapped to 'all_cpus'
by default. The idea is that INTR_CPUS should always return a valid set.
Device drivers which want to use per-CPU interrupts should start using
INTR_CPUS instead of simply assigning interrupts to all available CPUs.
In the future we may wish to add tunables to control the policy of
INTR_CPUS (e.g. should it be local-only or global, should it ignore
SMT threads or not).
The x86 nexus driver exposes the internal set of interrupt CPUs from the
the x86 interrupt code via INTR_CPUS.
The ACPI bus driver and PCI bridge drivers use _PXM to return a suitable
LOCAL_CPUS set when _PXM exists and DEVICE_NUMA is enabled. They also and
the global INTR_CPUS set from the nexus driver with the per-domain set from
_PXM to generate a local INTR_CPUS set for child devices.
Compared to the r298933, this version uses 'struct _cpuset' in
<sys/bus.h> instead of 'cpuset_t' to avoid requiring <sys/param.h>
(<sys/_cpuset.h> still requires <sys/param.h> for MAXCPU even though
<sys/_bitset.h> does not after recent changes).
bus_get_cpus() returns a specified set of CPUs for a device. It accepts
an enum for the second parameter that indicates the type of cpuset to
request. Currently two valus are supported:
- LOCAL_CPUS (on x86 this returns all the CPUs in the package closest to
the device when DEVICE_NUMA is enabled)
- INTR_CPUS (like LOCAL_CPUS but only returns 1 SMT thread for each core)
For systems that do not support NUMA (or if it is not enabled in the kernel
config), LOCAL_CPUS fails with EINVAL. INTR_CPUS is mapped to 'all_cpus'
by default. The idea is that INTR_CPUS should always return a valid set.
Device drivers which want to use per-CPU interrupts should start using
INTR_CPUS instead of simply assigning interrupts to all available CPUs.
In the future we may wish to add tunables to control the policy of
INTR_CPUS (e.g. should it be local-only or global, should it ignore
SMT threads or not).
The x86 nexus driver exposes the internal set of interrupt CPUs from the
the x86 interrupt code via INTR_CPUS.
The ACPI bus driver and PCI bridge drivers use _PXM to return a suitable
LOCAL_CPUS set when _PXM exists and DEVICE_NUMA is enabled. They also and
the global INTR_CPUS set from the nexus driver with the per-domain set from
_PXM to generate a local INTR_CPUS set for child devices.
Reviewed by: wblock (manpage)
Differential Revision: https://reviews.freebsd.org/D5519
Simplify and unify placeholder type definitions.
Reviewed by: jhb
Sponsored by: The FreeBSD Foundation
Differential revision: https://reviews.freebsd.org/D5771
r286837, r286838, r288470, r288522, r288524, r288826,
r289001
Pull in bhyve bug fixes and changes to allow UEFI booting.
This provides Windows support.
Tested on Intel and AMD with:
- Arch Linux i386+amd64 (kernel 4.3.3)
- Ubuntu 15.10 server 64-bit
- FreeBSD-CURRENT/amd64 20160127 snap
- FreeBSD 10.2 i386+amd64
- OpenBSD 5.8 i386+amd64
- SmartOS latest
- Windows 10 build 1511'
Huge thanks to Yamagi Burmeister who submitted the patch
and did the majority of the testing.
r284539 - bootrom mem allocation support
r284630 - Add SO_REUSEADDR when starting debug port
r284688 - Fix a regression in "movs" emulation
r284877 - verify_gla() non-zero segment base fix
r285217 - Always assert DCD and DSR in the uart
r285218 - devmem nodes moved to /dev/vmm.io/
r286837 - Add define for SATA Check-Power-Mode
r286838 - Add simple (no-op) SATA cmd emulations
r288470 - Increase virtio-blk indirect descs
r288522 - Firmware guest query interface
r288524 - Fix post-test typo
r288826 - Clean up SATA unimplemented cmd msg
r289001 - Add -l option to specify userboot path
Submitted by: Yamagi Burmeister
Approved by: re (kib)
Export various helper variables describing the layout and size of
certain kernel structures for use by debuggers. This mostly aids
in examining cores from a kernel without debug symbols as a debugger
can infer these values if debug symbols are available.
One set of variables describes the layout of 'struct linker_file' to
walk the list of loaded kernel modules.
A second set of variables describes the layout of 'struct proc' and
'struct thread' to walk the list of processes in the kernel and the
threads in each process.
The 'pcb_size' variable is used to index into the stoppcbs[] array.
The 'vm_maxuser_address' is used to distinguish kernel virtual addresses
from user addresses. This doesn't have to be perfect, and
'vm_maxuser_address' is a cheap and simple way to differentiate kernel
pointers from simple values like TIDs and PIDs.
While here, annotate the fields in struct pcb used by kgdb on amd64
and i386 to note that their ABI should be preserved. Annotations for
other platforms will be added in the future.
While here, move the common bits of <machine/cputypes.h> to
<x86/cputypes.h> as well.
Reviewed by: kib
Differential Revision: https://reviews.freebsd.org/D4670
This variable was added to sys/x86/include/x86_var.h recently.
This unbreaks building kernel source that #includes both md_var.h and x86_var.h
with gcc 4.2.1 on amd64
Differential Revision: https://reviews.freebsd.org/D4686
Reviewed by: kib
X-MFC with: r291949
Sponsored by: EMC / Isilon Storage Division
new headers x86/include x86_var.h and x86_smp.h.
Reviewed by: emaste, jhb
Sponsored by: The FreeBSD Foundation
Differential revision: https://reviews.freebsd.org/D4358
the PG_G global pte flag, pmap_invalidate_all() fails to flush global
TLB entries [*]. This is because TLB shootdown handler for such
configs reloads CR3, and on i386 pmap_invalidate_all() does the same
for the initiating CPU. Note that current code does not issue total
invalidation requests for the kernel_pmap.
Rename amd64 function invltlb_globpcid() to invltlb_glob(), it is not
specific for PCID for quite some time, and implement the same
functionality for i386. Use the function instead of invltlb() in
shootdown handlers and in i386 pmap_invalidate_all(), but only for the
kernel pmap (which maps pages with the PG_G attribute set), which
takes care of PG_G TLB entries on flush.
To detect the affected pmap in i386 TLB shootdown handler, pmap should
be passed to the smp_masked_invltlb() function, which makes amd64 and
i386 TLB shootdown code almost identical. Merge the code under x86/.
Noted by: jhb [*]
Reviewed by: cem, jhb, pho
Tested by: pho
Sponsored by: The FreeBSD Foundation
Differential revision: https://reviews.freebsd.org/D4346
certain kernel structures for use by debuggers. This mostly aids
in examining cores from a kernel without debug symbols as a debugger
can infer these values if debug symbols are available.
One set of variables describes the layout of 'struct linker_file' to
walk the list of loaded kernel modules.
A second set of variables describes the layout of 'struct proc' and
'struct thread' to walk the list of processes in the kernel and the
threads in each process.
The 'pcb_size' variable is used to index into the stoppcbs[] array.
The 'vm_maxuser_address' is used to distinguish kernel virtual addresses
from user addresses. This doesn't have to be perfect, and
'vm_maxuser_address' is a cheap and simple way to differentiate kernel
pointers from simple values like TIDs and PIDs.
While here, annotate the fields in struct pcb used by kgdb on amd64
and i386 to note that their ABI should be preserved. Annotations for
other platforms will be added in the future.
Reviewed by: kib
MFC after: 2 weeks
Differential Revision: https://reviews.freebsd.org/D3773
amd64 and i386 platform code contain very similar xen/xen-os.h
The only differences are:
- Functions/variables/types which were unused in i386/xen/xen-os.h:
* xen_xchg
* __xchg_dummy
* __xg
* __xchg
* atomic_t
* atomic_inc
* rdtscll
The functions/variables/types unused in xen-os.h can be dropped and there
is no more differences betwen amd64 and i386.
The new header is placed in x86/include/xen and each platform will have
dummy headers include x86/xen/*.h. This is to be able to include
machine/xen/*.h in the PV drivers.
Submitted by: Julien Grall <julien.grall@citrix.com>
Reviewed by: royger
Differential Revision: https://reviews.freebsd.org/D3880
Sponsored by: Citrix Systems R&D
The current Xen console driver is crashing very quickly when using it on
an ARM guest. This is because the console lock is recursive and it may
lead to recursion on the tty lock and/or corrupt the ring pointer.
Furthermore, the console lock is not always taken where it should be and has
to be released too early because of the way the console has been designed.
Over the years, code has been modified to support various new features but
the driver has not been reworked.
This new driver has been rewritten with the idea of only having a small set
of specific function to write either via the shared ring or the hypercall
interface.
Note that HVM support has been left aside for now because it requires
additional features which are not yet supported. A follow-up patch will be
sent with HVM guest support.
List of items that may be good to have but not mandatory:
- Avoid to flush for each character written when using the tty
- Support multiple consoles
Submitted by: Julien Grall <julien.grall@citrix.com>
Reviewed by: royger
Differential Revision: https://reviews.freebsd.org/D3698
Sponsored by: Citrix Systems R&D
Pull the latest headers for Xen which allow us to add support for ARM and
use new features in FreeBSD.
This is a verbatim copy of the xen/include/public so every headers which
don't exits anymore in the Xen repositories have been dropped.
Note the interface version hasn't been bumped, it will be done in a
follow-up. Although, it requires fix in the code to get it compiled:
- sys/xen/xen_intr.h: evtchn_port_t is already defined in the headers so
drop it.
- {amd64,i386}/include/intr_machdep.h: NR_EVENT_CHANNELS now depends on
xen/interface/event_channel.h, so include it.
- {amd64,i386}/{amd64,i386}/support.S: It's not neccessary to include
machine/intr_machdep.h. This is also fixing build compilation with the
new headers.
- dev/xen/blkfront/blkfront.c: The typedef for blkif_request_segmenthas
been dropped. So directly use struct blkif_request_segment
Finally, modify xen/interface/xen-compat.h to throw a preprocessing error if
__XEN_INTERFACE_VERSION__ is not set. This is allow us to catch any file
where xen/xen-os.h is not correctly included.
Submitted by: Julien Grall <julien.grall@citrix.com>
Reviewed by: royger
Differential Revision: https://reviews.freebsd.org/D3805
Sponsored by: Citrix Systems R&D
use vtophys() directly instead of vtomach() and retire the no-longer-used
headers <machine/xenfunc.h> and <machine/xenvar.h>.
Reported by: bde (stale bits in <machine/xenfunc.h>)
Reviewed by: royger (earlier version)
Differential Revision: https://reviews.freebsd.org/D3266
reported, on APs. We already did this on BSP.
Otherwise, the userspace software which depends on the features
reported by the high CPUID levels is misbehaving. In particular, AVX
detection is non-functional, depending on which CPU thread happens to
execute when doing CPUID. Another victim is the libthr signal
handlers interposer, which needs to save full FPU extended state.
Reported and tested by: Andre Meiser <ortadur@web.de>
Sponsored by: The FreeBSD Foundation
MFC after: 2 weeks
ordering semantic of x86 CPUs makes only the compiler barrier
neccessary to give the acquire behaviour.
Existing implementation ensured sequentially consistent semantic for
load_acq, making much stronger guarantee than required by standard's
definition of the load acquire. Consumers which depend on the barrier
are believed to be identified and already fixed to use proper
operations.
Noted by: alc (long time ago)
Reviewed by: alc, bde
Tested by: pho
Sponsored by: The FreeBSD Foundation
MFC after: 2 weeks
Make the creation of the free lists dynamic, i.e., it is based on the
available physical memory at boot time. For amd64 systems with 64 GB
or more of physical memory, create free lists for managing pages with
physical addresses below 4 GB.
PR: 185727
Requested by: alc
Approved by: re (gjb)
provide a semantic defined by the C11 fences with corresponding
memory_order.
atomic_thread_fence_acq() gives r | r, w, where r and w are read and
write accesses, and | denotes the fence itself.
atomic_thread_fence_rel() is r, w | w.
atomic_thread_fence_acq_rel() is the combination of the acquire and
release in single operation. Note that reads after the acq+rel fence
could be made visible before writes preceeding the fence.
atomic_thread_fence_seq_cst() orders all accesses before/after the
fence, and the fence itself is globally ordered against other
sequentially consistent atomic operations.
Reviewed by: alc
Discussed with: bde
Sponsored by: The FreeBSD Foundation
MFC after: 3 weeks
macros on amd64 and i386. Move the definition to machine/param.h.
kgdb defines INKERNEL() too, the conflict is resolved by renaming kgdb
version to PINKERNEL().
On i386, correct the lowest kernel address. After the shared page was
introduced, USRSTACK no longer points to the last user address + 1 [*]
Submitted by: Oliver Pinter [*]
Sponsored by: The FreeBSD Foundation
MFC after: 1 week
atomic_load_acq(9), on it source, for x86.
Right now, atomic_load_acq() on x86 is sequentially consistent with
other atomics, code ensures this by doing store/load barrier by
performing locked nop on the source. Provide separate primitive
__storeload_barrier(), which is implemented as the locked nop done on
a cpu-private variable, and put __storeload_barrier() before load, to
keep seq_cst semantic but avoid introducing false dependency on the
no-modification of the source for its later use.
Note that seq_cst property of x86 atomic_load_acq() is not documented
and not carried by atomics implementations on other architectures,
although some kernel code relies on the behaviour. This commit does
not intend to change this.
Reviewed by: alc
Discussed with: bde
Tested by: pho
Sponsored by: The FreeBSD Foundation
MFC after: 2 weeks
Emulate the 'bit test' instruction.
MFC r282259:
Re-implement RTC current time calculation to eliminate the possibility of
losing time.
MFC r282281:
Advertise the MTRR feature via CPUID and emulate the minimal set of MTRR MSRs.
MFC r282284:
When an instruction cannot be decoded just return to userspace so bhyve(8)
can dump the instruction bytes.
MFC r282287:
Don't require <sys/cpuset.h> to be always included before <machine/vmm.h>.
MFC r282296:
Emulate MSR_SYSCFG which is accessed by Linux on AMD cpus when MTRRs are
enabled.
MFC r282301:
Relax limits when transitioning a vector from the IRR to the ISR and also
when extinguishing it from the ISR in response to an EOI.
MFC r282335:
Advertise an additional memory BAR in the "dummy" device emulation.
MFC r282336:
Emulate machine check related MSRs to allow guest OSes like Windows to boot.
MFC r282351:
Don't advertise the Intel SMX capability to the guest.
MFC r282407:
Emulate the 'CMP r/m8, imm8' instruction.
MFC r282519:
Add macros for AMD-specific bits in MSR_EFER: LMSLE, FFXSR and TCE.
MFC r282520:
Emulate guest writes to EFER_MSR properly.
MFC r282558:
Deprecate the 3-way return values from vm_gla2gpa() and vm_copy_setup().
MFC r282571:
Check 'td_owepreempt' and yield the vcpu thread if it is set.
MFC r282595:
Allow byte reads of AHCI registers.
MFC r282784:
Handling indirect descriptors is a capability of the host and not one that
needs to be negotiated. Use the host capabilities field and not the negotiated
field when verifying that indirect descriptors are supported.
MFC r282788:
Allow configuration of the sector size advertised to the guest.
MFC r282865:
Set the subvendor field in config space to the vendor ID. This is required
by the Windows virtio drivers to correctly match a device.
MFC r282922:
Bump the size of the blockif scatter-gather list to 67.
MFC r283075:
Fix off-by-one in array index bounds check. bhyveload would allow you to
create 33 entries on an array that only has 32 slots
MFC r283168:
Temporarily revert r282922 which bumped the max descriptors.
MFC r283255:
Emulate the "CMP r/m, reg" instruction (opcode 39H).
MFC r283256:
Add an option "--get-vmcs-exit-inst-length" to display the instruction length
of the instruction that caused the VM-exit.
MFC r283264:
Change the header type of the emulated host-bridge from type 1 to type 0.
MFC r283293:
Don't rely on the 'VM-exit instruction length' field in the VMCS to always
have an accurate length on an EPT violation.
MFC r283299:
Remove bogus verification of instruction length after instruction decode.
MFC r283308:
Exceptions don't deliver an error code in real mode.
MFC r283657:
Fix non-deterministic delays when accessing a vcpu that was in "running" or
"sleeping" state.
MFC r283973:
Use tunable 'hw.vmm.svm.features' to disable specific SVM features even
though they might be available in hardware. Use tunable 'hw.vmm.svm.num_asids'
to limit the number of ASIDs used by the hypervisor.
MFC r284046:
Fix regression in 'verify_gla()' with the RIP-relative addressing mode.
MFC r284174:
Support guest writes to the TSC by enabling the "use TSC offsetting"
execution control.
Allow passthrough devices to be hinted.
MFC r279683:
When ICW1 is issued the edge sense circuit is reset which means that
following an initialization a low-to-high transistion is necesary to
generate an interrupt.
MFC r279925:
Add -p parameter to list PCI device to pass through to the guest.
MFC r281559:
Fix handling of BUS_PROBE_NOWILDCARD in 'device_probe_child()'.
MFC r280447:
When fetching an instruction in non-64bit mode, consider the value of the
code segment base address.
MFC r280725:
Move legacy interrupt allocation for virtio devices to common code.
MFC r280775:
Fix the RTC device model to operate correctly in 12-hour mode.
MFC r280929:
Fix "MOVS" instruction memory to MMIO emulation.
MFC r280968:
Display instruction bytes and %rip prior to aborting due to an instruction
emulation error.
MFC r281145:
Enhance the support for Group 1 Extended opcodes for CMP, AND, OR instructions.
MFC r281542:
Initialize 'error' before use (Coverity IDs 1249748, 1249747, 1249751, 1249749)
MFC r281561:
Prior to aborting due to an ioport error, it is always interesting to see what
the guest's %rip is.
MFC r281611:
If the number of guest vcpus is less than '1' then flag it as an error.
MFC r281612:
Prefer 'vcpu_should_yield()' over checking 'curthread->td_flags' directly.
MFC r281630:
Relax the check on which vectors can be delivered through the APIC. According
to the Intel SDM vectors 16 through 255 are allowed to be delivered via the
local APIC.
MFC r281879:
Missing break in switch case (Coverity ID 1292499)
MFC r281946:
Don't allow guest to modify readonly bits in the PCI config 'status' register.
MFC r281987:
STOS/STOSB/STOSW/STOSD/STOSQ instruction emulation.
MFC r282206:
Implement the century byte in the RTC.
Replace bhyve's minimal RTC emulation with a fully featured one in vmm.ko.
MFC r276432:
Initialize all fields of 'struct vm_exception exception' before passing it
to vm_inject_exception().
MFC r276763:
Clear blocking due to STI or MOV SS in the hypervisor when an instruction is
emulated or when the vcpu incurs an exception.
MFC r277149:
Clean up usage of 'struct vm_exception' to only to communicate information
from userspace to vmm.ko when injecting an exception.
MFC r277168:
Fix typo (missing comma).
MFC r277309:
Make the error message explicit instead of just printing the usage if the
virtual machine name is not specified.
MFC r277310:
Simplify instruction restart logic in bhyve.
MFC r277359:
Fix a bug in libvmmapi 'vm_copy_setup()' where it would return success even
if the 'gpa' was in the guest MMIO region.
MFC r277360:
MOVS instruction emulation.
MFC r277626:
Add macro to identify AVIC capability (advanced virtual interrupt controller)
in AMD processors.
MFC r279220:
Don't close a block context if it couldn't be opened avoiding a null deref.
MFC r279225:
Add "-u" option to bhyve(8) to indicate that the RTC should maintain UTC time.
MFC r279227:
Emulate MSR 0xC0011024 when running on AMD processors.
MFC r279228:
Always emulate MSR_PAT on Intel processors and don't rely on PAT save/restore
capability of VT-x. This lets bhyve run nested in older VMware versions that
don't support the PAT save/restore capability.
MFC r279540:
Fix warnings/errors when building vmm.ko with gcc.
Revert MFC of r270223, which bumped MAXCPU on amd64 from 64 to 256.
The cpuset_getaffinity(2) and cpuset_setaffinity(2) check minimum set
size, which now fails for binaries compiled on 10.0 with MAXCPU == 64.
Submitted by: jhb
PR: 200802
devmem is used to represent MMIO devices like the boot ROM or a VESA framebuffer
where doing a trap-and-emulate for every access is impractical. devmem is a
hybrid of system memory (sysmem) and emulated device models.
devmem is mapped in the guest address space via nested page tables similar
to sysmem. However the address range where devmem is mapped may be changed
by the guest at runtime (e.g. by reprogramming a PCI BAR). Also devmem is
usually mapped RO or RW as compared to RWX mappings for sysmem.
Each devmem segment is named (e.g. "bootrom") and this name is used to
create a device node for the devmem segment (e.g. /dev/vmm/testvm.bootrom).
The device node supports mmap(2) and this decouples the host mapping of
devmem from its mapping in the guest address space (which can change).
Reviewed by: tychon
Discussed with: grehan
Differential Revision: https://reviews.freebsd.org/D2762
MFC after: 4 weeks