the vcpu should be kicked to process a pending interrupt. This will be useful
in the implementation of the Posted Interrupt APICv feature.
Change the return value of 'vlapic_pending_intr()' to indicate whether or not
an interrupt is available to be delivered to the vcpu depending on the value
of the PPR.
Add KTR tracepoints to debug guest IPI delivery.
'vmx_vminit()' that does customization.
This makes it easier to turn on optional features (e.g. APICv) without
having to keep adding new parameters to 'vmcs_set_defaults()'.
Reviewed by: grehan@
guest disables the HPET.
The HPET timer interrupt is triggered from the callout handler associated with
the timer. It is possible for the callout handler to be delayed before it gets
a chance to execute. If the guest disables the HPET during this window then the
handler never gets a chance to execute and the timer interrupt is lost.
This is now fixed by injecting a timer interrupt into the guest if the callout
time is detected to be in the past when the HPET is disabled.
times [1]. Assert that the pmap passed to pmap_remove_pages() is only
active on current CPU.
Submitted by: alc [1]
Reviewed by: alc
Sponsored by: The FreeBSD Foundation
MFC after: 1 week
hides the setjmp/longjmp semantics of VM enter/exit. vmx_enter_guest() is used
to enter guest context and vmx_exit_guest() is used to transition back into
host context.
Fix a longstanding race where a vcpu interrupt notification might be ignored
if it happens after vmx_inject_interrupts() but before host interrupts are
disabled in vmx_resume/vmx_launch. We now called vmx_inject_interrupts() with
host interrupts disabled to prevent this.
Suggested by: grehan@
The handler is now called after the register value is updated in the virtual
APIC page. This will make it easier to handle APIC-write VM-exits with APIC
register virtualization turned on.
This also implies that we need to keep a snapshot of the last value written
to a LVT register. We can no longer rely on the LVT registers in the APIC
page to be "clean" because the guest can write anything to it before the
hypervisor has had a chance to sanitize it.
registers.
The handler is now called after the register value is updated in the virtual
APIC page. This will make it easier to handle APIC-write VM-exits with APIC
register virtualization turned on.
We can no longer rely on the value of 'icr_timer' on the APIC page
in the callout handler. With APIC register virtualization the value of
'icr_timer' will be updated by the processor in guest-context before an
APIC-write VM-exit.
Clear the 'delivery status' bit in the ICRLO register in the write handler.
With APIC register virtualization the write happens in guest-context and
we cannot prevent a (buggy) guest from setting this bit.
The handler is now called after the register value is updated in the virtual
APIC page. This will make it easier to handle APIC-write VM-exits with APIC
register virtualization turned on.
Additionally, mask all the LVT entries when the vlapic is software-disabled.
The handlers are now called after the register value is updated in the virtual
APIC page. This will make it easier to handle APIC-write VM-exits with APIC
register virtualization turned on.
Additionally, we need to ensure that the value of these registers is always
correctly reflected in the virtual APIC page, because there is no VM exit
when the guest reads these registers with APIC register virtualization.
emulation.
The vlapic initialization and cleanup is done via processor specific vmm_ops.
This will allow the VT-x/SVM modules to layer any hardware-assist for APIC
emulation or virtual interrupt delivery on top of the vlapic device model.
Add a parameter to 'vcpu_notify_event()' to distinguish between vlapic
interrupts versus other events (e.g. NMI). This provides an opportunity to
use hardware-assists like Posted Interrupts (VT-x) or doorbell MSR (SVM)
to deliver an interrupt to a guest without causing a VM-exit.
Get rid of lapic_pending_intr() and lapic_intr_accepted() and use the
vlapic_xxx() counterparts directly.
Associate an 'Apic Page' with each vcpu and reference it from the 'vlapic'.
The 'Apic Page' is intended to be referenced from the Intel VMCS as the
'virtual APIC page' or from the AMD VMCB as the 'vAPIC backing page'.
- Add a generic routine to trigger an LVT interrupt that supports both
fixed and NMI delivery modes.
- Add an ioctl and bhyvectl command to trigger local interrupts inside a
guest. In particular, a global NMI similar to that raised by SERR# or
PERR# can be simulated by asserting LINT1 on all vCPUs.
- Extend the LVT table in the vCPU local APIC to support CMCI.
- Flesh out the local APIC error reporting a bit to cache errors and
report them via ESR when ESR is written to. Add support for asserting
the error LVT when an error occurs. Raise illegal vector errors when
attempting to signal an invalid vector for an interrupt or when sending
an IPI.
- Ignore writes to reserved bits in LVT entries.
- Export table entries the MADT and MP Table advertising the stock x86
config of LINT0 set to ExtInt and LINT1 wired to NMI.
Reviewed by: neel (earlier version)
state before the requested state transition. This guarantees that there is
exactly one ioctl() operating on a vcpu at any point in time and prevents
unintended state transitions.
More details available here:
http://lists.freebsd.org/pipermail/freebsd-virtualization/2013-December/001825.html
Reviewed by: grehan
Reported by: Markiyan Kushnir (markiyan.kushnir at gmail.com)
MFC after: 3 days
The least significant 8 bits of 'pm_flags' are now used for the IPI vector
to use for nested page table TLB shootdown.
Previously we used IPI_AST to interrupt the host cpu which is functionally
correct but could lead to misleading interrupt counts for AST handler. The
AST handler was also doing a lot more than what is required for the nested
page table TLB shootdown (EOI and IRET).
callers treat the MSI 'addr' and 'data' fields as opaque and also lets
bhyve implement multiple destination modes: physical, flat and clustered.
Submitted by: Tycho Nightingale (tycho.nightingale@pluribusnetworks.com)
Reviewed by: grehan@
When the guest is bringing up the APs in the x2APIC mode a write to the
ICR register will now trigger a return to userspace with an exitcode of
VM_EXITCODE_SPINUP_AP. This gets SMP guests working again with x2APIC.
Change the vlapic timer lock to be a spinlock because the vlapic can be
accessed from within a critical section (vm run loop) when guest is using
x2apic mode.
Reviewed by: grehan@
This decouples the guest's 'hz' from the host's 'hz' setting. For e.g. it is
now possible to have a guest run at 'hz=1000' while the host is at 'hz=100'.
Discussed with: grehan@
Tested by: Tycho Nightingale (tycho.nightingale@pluribusnetworks.com)
vcpu and destroy its thread context. Also modify the 'HLT' processing to ignore
pending interrupts in the IRR if interrupts have been disabled by the guest.
The interrupt cannot be injected into the guest in any case so resuming it
is futile.
With this change "halt" from a Linux guest works correctly.
Reviewed by: grehan@
Tested by: Tycho Nightingale (tycho.nightingale@pluribusnetworks.com)
has outgrown its original name. Originally this function simply sent an IPI
to the host cpu that a vcpu was executing on but now it does a lot more than
just that.
Reviewed by: grehan@
shifts into the sign bit. Instead use (1U << 31) which gets the
expected result.
This fix is not ideal as it assumes a 32 bit int, but does fix the issue
for most cases.
A similar change was made in OpenBSD.
Discussed with: -arch, rdivacky
Reviewed by: cperciva
requires process descriptors to work and having PROCDESC in GENERIC
seems not enough, especially that we hope to have more and more consumers
in the base.
MFC after: 3 days
commit level triggered interrupts would work as long as the pin was not shared
among multiple interrupt sources.
The vlapic now keeps track of level triggered interrupts in the trigger mode
register and will forward the EOI for a level triggered interrupt to the
vioapic. The vioapic in turn uses the EOI to sample the level on the pin and
re-inject the vector if the pin is still asserted.
The vhpet is the first consumer of level triggered interrupts and advertises
that it can generate interrupts on pins 20 through 23 of the vioapic.
Discussed with: grehan@
compilation results in inclusion of the header, a confict arises due
to savefpu being union for i386, but used as struct in the pcb
definition. The 32bit code should not need amd64 variant of the
struct pcb anyway.
For struct region_descriptor, use __uint64_t instead of unsigned long,
as the base type for bit-fields. Unsigned long cannot have width 64
for -m32.
The changes allowed to use sys/sysctl.h for cc -m32.
Sponsored by: The FreeBSD Foundation
MFC after: 1 week
bhyve supports a single timer block with 8 timers. The timers are all 32-bit
and capable of being operated in periodic mode. All timers support interrupt
delivery using MSI. Timers 0 and 1 also support legacy interrupt routing.
At the moment the timers are not connected to any ioapic pins but that will
be addressed in a subsequent commit.
This change is based on a patch from Tycho Nightingale (tycho.nightingale@pluribusnetworks.com).
option, unbreak the lock tracing release semantic by embedding
calls to LOCKSTAT_PROFILE_RELEASE_LOCK() direclty in the inlined
version of the releasing functions for mutex, rwlock and sxlock.
Failing to do so skips the lockstat_probe_func invokation for
unlocking.
- As part of the LOCKSTAT support is inlined in mutex operation, for
kernel compiled without lock debugging options, potentially every
consumer must be compiled including opt_kdtrace.h.
Fix this by moving KDTRACE_HOOKS into opt_global.h and remove the
dependency by opt_kdtrace.h for all files, as now only KDTRACE_FRAMES
is linked there and it is only used as a compile-time stub [0].
[0] immediately shows some new bug as DTRACE-derived support for debug
in sfxge is broken and it was never really tested. As it was not
including correctly opt_kdtrace.h before it was never enabled so it
was kept broken for a while. Fix this by using a protection stub,
leaving sfxge driver authors the responsibility for fixing it
appropriately [1].
Sponsored by: EMC / Isilon storage division
Discussed with: rstone
[0] Reported by: rstone
[1] Discussed with: philip
to inject edge triggered legacy interrupts into the guest.
Start using the new API in device models that use edge triggered interrupts:
viz. the 8254 timer and the LPC/uart device emulation.
Submitted by: Tycho Nightingale (tycho.nightingale@pluribusnetworks.com)
Extracted from the projects/uefi branch, this change is a reasonable
cleanup and will reduce the diffs to review when bringing in the
UEFI work.
Reviewed by: kib@
Sponsored by: The FreeBSD Foundation
The page presence memory test takes a long time on large memory systems
and has little value on contemporary amd64 hardware.
Sponsored by: The FreeBSD Foundation
sys/i386/i386/machdep.c:
sys/amd64/amd64/machdep.c:
The value reported by FreeBSD as "real memory" when booting
doesn't match what is later reported by sysctl as hw.realmem.
This is due to the fact that the value printed during the
boot process is fetched from smbios data (when possible),
and accounts for holes in physical memory. On the other
hand, the value of hw.realmem is unconditionally set to be
one larger than the highest page of the physical address
space.
Fix this by setting hw.realmem to the same value printed
during boot, this makes hw.realmem honour it's name and
account properly for physical memory present in the system.
Submitted by: Roger Pau Monné
Reviewed by: gibbs
Debuggers may need to change PSL_RF. Note that tf_eflags is already stored
in the signal context during signal handling and PSL_RF previously could be
modified via sigreturn, so this change should not provide any new ability
to userspace.
For background see the thread at:
http://lists.freebsd.org/pipermail/freebsd-i386/2007-September/005910.html
Reviewed by: jhb, kib
Sponsored by: DARPA, AFRL
upcoming in-kernel device emulations like the HPET.
The ioctls VM_IOAPIC_ASSERT_IRQ and VM_IOAPIC_DEASSERT_IRQ are used to
manipulate the ioapic pin state.
Discussed with: grehan@
Submitted by: Tycho Nightingale (tycho.nightingale@pluribusnetworks.com)
described in the rev. 3.0 of the Kabini BKDG, document 48751.pdf.
Partially based on the patch submitted by: Dmitry Luhtionov <dmitryluhtionov@gmail.com>
Sponsored by: The FreeBSD Foundation
MFC after: 1 week
words, every architecture is now auto-sizing the kmem arena. This revision
changes kmeminit() so that the definition of VM_KMEM_SIZE_SCALE becomes
mandatory and the definition of VM_KMEM_SIZE becomes optional.
Replace or eliminate all existing definitions of VM_KMEM_SIZE. With
auto-sizing enabled, VM_KMEM_SIZE effectively became an alternate spelling
for VM_KMEM_SIZE_MIN on most architectures. Use VM_KMEM_SIZE_MIN for
clarity.
Change kmeminit() so that the effect of defining VM_KMEM_SIZE is similar to
that of setting the tunable vm.kmem_size. Whereas the macros
VM_KMEM_SIZE_{MAX,MIN,SCALE} have had the same effect as the tunables
vm.kmem_size_{max,min,scale}, the effects of VM_KMEM_SIZE and vm.kmem_size
have been distinct. In particular, whereas VM_KMEM_SIZE was overridden by
VM_KMEM_SIZE_{MAX,MIN,SCALE} and vm.kmem_size_{max,min,scale}, vm.kmem_size
was not. Remedy this inconsistency. Now, VM_KMEM_SIZE can be used to set
the size of the kmem arena at compile-time without that value being
overridden by auto-sizing.
Update the nearby comments to reflect the kmem submap being replaced by the
kmem arena. Stop duplicating the auto-sizing formula in every machine-
dependent vmparam.h and place it in kmeminit() where auto-sizing takes
place.
Reviewed by: kib (an earlier version)
Sponsored by: EMC / Isilon Storage Division
in the kernel. This abstraction was redundant because the only device emulated
inside vmm.ko is the local apic and it is always at a fixed guest physical
address.
Discussed with: grehan
corresponding x86 trap type. Userland DTrace probes are currently handled
by the other fasttrap hooks (dtrace_pid_probe_ptr and
dtrace_return_probe_ptr).
Discussed with: rpaulo
1.3 of Intelб╝ Virtualization Technology for Directed I/O Architecture
Specification. The Extended Context and PASIDs from the rev. 2.2 are
not supported, but I am not aware of any released hardware which
implements them. Code does not use queued invalidation, see comments
for the reason, and does not provide interrupt remapping services.
Code implements the management of the guest address space per domain
and allows to establish and tear down arbitrary mappings, but not
partial unmapping. The superpages are created as needed, but not
promoted. Faults are recorded, fault records could be obtained
programmatically, and printed on the console.
Implement the busdma(9) using DMARs. This busdma backend avoids
bouncing and provides security against misbehaving hardware and driver
bad programming, preventing leaks and corruption of the memory by wild
DMA accesses.
By default, the implementation is compiled into amd64 GENERIC kernel
but disabled; to enable, set hw.dmar.enable=1 loader tunable. Code is
written to work on i386, but testing there was low priority, and
driver is not enabled in GENERIC. Even with the DMAR turned on,
individual devices could be directed to use the bounce busdma with the
hw.busdma.pci<domain>:<bus>:<device>:<function>.bounce=1 tunable. If
DMARs are capable of the pass-through translations, it is used,
otherwise, an identity-mapping page table is constructed.
The driver was tested on Xeon 5400/5500 chipset legacy machine,
Haswell desktop and E5 SandyBridge dual-socket boxes, with ahci(4),
ata(4), bce(4), ehci(4), mfi(4), uhci(4), xhci(4) devices. It also
works with em(4) and igb(4), but there some fixes are needed for
drivers, which are not committed yet. Intel GPUs do not work with
DMAR (yet).
Many thanks to John Baldwin, who explained me the newbus integration;
Peter Holm, who did all testing and helped me to discover and
understand several incredible bugs; and to Jim Harris for the access
to the EDS and BWG and for listening when I have to explain my
findings to somebody.
Sponsored by: The FreeBSD Foundation
MFC after: 1 month
In report_progress(), use nitems(progress_track) instead of manually
hard-coding array size. Wrap long line.
In blk_write(), code verifies that ptr and pa cannot be non-zero
simultaneously. The later check for the page-alignment of the ptr
argument never triggers due to pa != 0 always implying ptr == NULL. I
believe that the intent was to ensure that physicall address passed is
page-aligned, since the address is (temporary) mapped for the duration
of the page write.
Clear the progress_track.visited fields when starting minidump. If
minidump is restarted or taken second time during the system lifetime,
progress is not printed otherwise, making operator suspectible to the
dump status.
Tested by: pho
Sponsored by: The FreeBSD Foundation
MFC after: 1 week
'invpcid' instruction to the guest. Currently bhyve will try to enable this
capability unconditionally if it is available.
Consolidate code in bhyve to set the capabilities so it is no longer
duplicated in BSP and AP bringup.
Add a sysctl 'vm.pmap.invpcid_works' to display whether the 'invpcid'
instruction is available.
Reviewed by: grehan
MFC after: 3 days
the 'vmmdev_mtx' in vmmdev_rw().
Rely on the 'si_threadcount' accounting to ensure that we never destroy the
VM device node while it has operations in progress (e.g. ioctl, mmap etc).
Reported by: grehan
Reviewed by: grehan
field. Perform vcpu enumeration for Xen PV and HVM environments
and convert all Xen drivers to use vcpu_id instead of a hard coded
assumption of the mapping algorithm (acpi or apic ID) in use.
Submitted by: Roger Pau Monné
Sponsored by: Citrix Systems R&D
Reviewed by: gibbs
Approved by: re (blanket Xen)
amd64/include/pcpu.h:
i386/include/pcpu.h:
Add vcpu_id to the amd64 and i386 pcpu structures.
dev/xen/timer/timer.c
x86/xen/xen_intr.c
Use new vcpu_id instead of assuming acpi_id == vcpu_id.
i386/xen/mp_machdep.c:
i386/xen/mptable.c
x86/xen/hvm.c:
Perform Xen HVM and Xen full PV vcpu_id mapping.
x86/xen/hvm.c:
x86/acpica/madt.c
Change SYSINIT ordering of acpi CPU enumeration so that it
is guaranteed to be available at the time of Xen HVM vcpu
id mapping.
Make the amd64/pmap code aware of nested page table mappings used by bhyve
guests. This allows bhyve to associate each guest with its own vmspace and
deal with nested page faults in the context of that vmspace. This also
enables features like accessed/dirty bit tracking, swapping to disk and
transparent superpage promotions of guest memory.
Guest vmspace:
Each bhyve guest has a unique vmspace to represent the physical memory
allocated to the guest. Each memory segment allocated by the guest is
mapped into the guest's address space via the 'vmspace->vm_map' and is
backed by an object of type OBJT_DEFAULT.
pmap types:
The amd64/pmap now understands two types of pmaps: PT_X86 and PT_EPT.
The PT_X86 pmap type is used by the vmspace associated with the host kernel
as well as user processes executing on the host. The PT_EPT pmap is used by
the vmspace associated with a bhyve guest.
Page Table Entries:
The EPT page table entries as mostly similar in functionality to regular
page table entries although there are some differences in terms of what
bits are used to express that functionality. For e.g. the dirty bit is
represented by bit 9 in the nested PTE as opposed to bit 6 in the regular
x86 PTE. Therefore the bitmask representing the dirty bit is now computed
at runtime based on the type of the pmap. Thus PG_M that was previously a
macro now becomes a local variable that is initialized at runtime using
'pmap_modified_bit(pmap)'.
An additional wrinkle associated with EPT mappings is that older Intel
processors don't have hardware support for tracking accessed/dirty bits in
the PTE. This means that the amd64/pmap code needs to emulate these bits to
provide proper accounting to the VM subsystem. This is achieved by using
the following mapping for EPT entries that need emulation of A/D bits:
Bit Position Interpreted By
PG_V 52 software (accessed bit emulation handler)
PG_RW 53 software (dirty bit emulation handler)
PG_A 0 hardware (aka EPT_PG_RD)
PG_M 1 hardware (aka EPT_PG_WR)
The idea to use the mapping listed above for A/D bit emulation came from
Alan Cox (alc@).
The final difference with respect to x86 PTEs is that some EPT implementations
do not support superpage mappings. This is recorded in the 'pm_flags' field
of the pmap.
TLB invalidation:
The amd64/pmap code has a number of ways to do invalidation of mappings
that may be cached in the TLB: single page, multiple pages in a range or the
entire TLB. All of these funnel into a single EPT invalidation routine called
'pmap_invalidate_ept()'. This routine bumps up the EPT generation number and
sends an IPI to the host cpus that are executing the guest's vcpus. On a
subsequent entry into the guest it will detect that the EPT has changed and
invalidate the mappings from the TLB.
Guest memory access:
Since the guest memory is no longer wired we need to hold the host physical
page that backs the guest physical page before we can access it. The helper
functions 'vm_gpa_hold()/vm_gpa_release()' are available for this purpose.
PCI passthru:
Guest's with PCI passthru devices will wire the entire guest physical address
space. The MMIO BAR associated with the passthru device is backed by a
vm_object of type OBJT_SG. An IOMMU domain is created only for guest's that
have one or more PCI passthru devices attached to them.
Limitations:
There isn't a way to map a guest physical page without execute permissions.
This is because the amd64/pmap code interprets the guest physical mappings as
user mappings since they are numerically below VM_MAXUSER_ADDRESS. Since PG_U
shares the same bit position as EPT_PG_EXECUTE all guest mappings become
automatically executable.
Thanks to Alan Cox and Konstantin Belousov for their rigorous code reviews
as well as their support and encouragement.
Thanks for John Baldwin for reviewing the use of OBJT_SG as the backing
object for pci passthru mmio regions.
Special thanks to Peter Holm for testing the patch on short notice.
Approved by: re
Discussed with: grehan
Reviewed by: alc, kib
Tested by: pho
page, otherwise the small mappings loop would use uninitialized value.
Note that currently pmap_clear_modify() is not called for fictitious
pages.
Sponsored by: The FreeBSD Foundation
Approved by: re (glebius)
registers, to make the restarted syscall instruction pass the correct
arguments.
PR: kern/182161
Reported by: Russ Cox <rsc@swtch.com>
Sponsored by: The FreeBSD Foundation
MFC after: 3 days
Approved by: re (marius)
semblance of API stability and growth during the 10.* timeframe.
Userland/kernel bhyve will have to be recompiled after this.
Reviewed by: neel
Approved by: re@ (blanket)
amd64 and i386.
Submitted by: Roger Pau Monné
Sponsored by: Citrix Systems R&D
Reviewed by: gibbs
Approved by: re (blanket Xen)
MFC after: 2 weeks
sys/amd64/amd64/mp_machdep.c:
sys/amd64/include/cpu.h:
sys/i386/i386/mp_machdep.c:
sys/i386/include/cpu.h:
- Introduce two new CPU hooks for initialization and resume
purposes. This allows us to get rid of the XENHVM ifdefs in
mp_machdep, and also sets some hooks into common code that can be
used by other hypervisor implementations.
sys/amd64/conf/XENHVM:
sys/i386/conf/XENHVM:
- Remove these configs now that GENERIC has builtin support for Xen
HVM.
sys/kern/subr_smp.c:
- Make sure there are no pending IPIs when suspending a system.
sys/x86/xen/hvm.c:
- Add cpu init and resume vectors that are called from mp_machdep
using the new hooks.
- Only clear the vcpu_info mapping data on resume. It is already
clear for the BSP on a cold boot and is set correctly as APs
are started.
- Gate xen_hvm_init_cpu only to systems running under Xen.
sys/x86/xen/xen_intr.c:
- Gate the setup of event channels only to systems running under Xen.
- add fields to 'struct pmap' that are required to manage nested page tables.
- add a parameter to 'vmspace_alloc()' that can be used to override the
default pmap initialization routine 'pmap_pinit()'.
These changes are pushed ahead of the remaining changes in 'bhyve_npt_pmap'
in anticipation of the upcoming KBI freeze for 10.0.
Reviewed by: kib@, alc@
Approved by: re (glebius)
Xen PVHVM guest.
Submitted by: Roger Pau Monné
Sponsored by: Citrix Systems R&D
Reviewed by: gibbs
Approved by: re (blanket Xen)
MFC after: 2 weeks
sys/amd64/amd64/mp_machdep.c:
sys/i386/i386/mp_machdep.c:
- Make sure that are no MMU related IPIs pending on migration.
- Reset pending IPI_BITMAP on resume.
- Init vcpu_info on resume.
sys/amd64/include/intr_machdep.h:
sys/i386/include/intr_machdep.h:
sys/x86/acpica/acpi_wakeup.c:
sys/x86/x86/intr_machdep.c:
sys/x86/isa/atpic.c:
sys/x86/x86/io_apic.c:
sys/x86/x86/local_apic.c:
- Add a "suspend_cancelled" parameter to pic_resume(). For the
Xen PIC, restoration of interrupt services differs between
the aborted suspend and normal resume cases, so we must provide
this information.
sys/dev/acpica/acpi_timer.c:
sys/dev/xen/timer/timer.c:
sys/timetc.h:
- Don't swap out "suspend safe" timers across a suspend/resume
cycle. This includes the Xen PV and ACPI timers.
sys/dev/xen/control/control.c:
- Perform proper suspend/resume process for PVHVM:
- Suspend all APs before going into suspension, this allows us
to reset the vcpu_info on resume for each AP.
- Reset shared info page and callback on resume.
sys/dev/xen/timer/timer.c:
- Implement suspend/resume support for the PV timer. Since FreeBSD
doesn't perform a per-cpu resume of the timer, we need to call
smp_rendezvous in order to correctly resume the timer on each CPU.
sys/dev/xen/xenpci/xenpci.c:
- Don't reset the PCI interrupt on each suspend/resume.
sys/kern/subr_smp.c:
- When suspending a PVHVM domain make sure there are no MMU IPIs
in-flight, or we will get a lockup on resume due to the fact that
pending event channels are not carried over on migration.
- Implement a generic version of restart_cpus that can be used by
suspended and stopped cpus.
sys/x86/xen/hvm.c:
- Implement resume support for the hypercall page and shared info.
- Clear vcpu_info so it can be reset by APs when resuming from
suspension.
sys/dev/xen/xenpci/xenpci.c:
sys/x86/xen/hvm.c:
sys/x86/xen/xen_intr.c:
- Support UP kernel configurations.
sys/x86/xen/xen_intr.c:
- Properly rebind per-cpus VIRQs and IPIs on resume.
pmap_clear_reference() has had exactly one caller in the kernel for
several years, more precisely, since FreeBSD 8. Now, that call no
longer exists.
Approved by: re (kib)
Sponsored by: EMC / Isilon Storage Division
While here, correct all consumers to pass NULL instead of 0 as we pass
capability rights as pointers now, not uint64_t.
Reported by: Daniel Peyrolon
Tested by: Daniel Peyrolon
Approved by: re (marius)
to implement epoll subset of functionality. The kqueue user data are 32bit
on i386 which is not enough for epoll user data so this patch overrides
kqueue fileops to maintain enough space in struct file.
Initial patch developed by me in 2007 and then extended and finished
by Yuri Victorovich.
Approved by: re (delphij)
Sponsored by: Google Summer of Code
Submitted by: Yuri Victorovich <yuri at rawbw dot com>
Tested by: Yuri Victorovich <yuri at rawbw dot com>
immediate operand. The presence of an SIB byte in decoding the ModR/M field
would cause 'imm_bytes' to not be set to the correct value.
Fix this by initializing 'imm_bytes' independent of the ModR/M decoding.
Reported by: grehan@
Approved by: re@
not cover entire superpage, avoid copying. Doing partial copy would
require demotion, which is incompatible with the already held locks.
Reported by: cperciva
Reviewed by: alc
Sponsored by: The FreeBSD Foundation
MFC after: 1 week
Approved by: re (delphij)
the maximum number of VT-d domains (256 on a Sandybridge). We now allocate a
VT-d domain for a guest only if the administrator has explicitly configured
one or more PCI passthru device(s).
If there are no PCI passthru devices configured (the common case) then the
number of virtual machines is no longer limited by the maximum number of
VT-d domains.
Reviewed by: grehan@
Approved by: re@
This should be sufficient for 10.0 and will do
until forthcoming work to avoid limitations
in this area is complete.
Thanks to Bela Lubkin at tidalscale for the
headsup on the apic/cpu id/io apic ASL parameters
that are actually hex values and broke when
written as decimal when 11 vCPUs were configured.
Approved by: re@
amount of free memory was close to the point at which we would begin
reclaiming pages. Now, we continuously scan the active page queue,
regardless of the amount of free memory. Consequently, we are continuously
calling pmap_ts_referenced() on active pages.
Prior to this change, pmap_ts_referenced() would always demote superpage
mappings in order to obtain finer-grained reference information. This made
sense because we were coming under memory pressure and would soon have to
begin reclaiming pages. Now, however, with continuous scanning of the
active page queue, these demotions are taking a toll on performance. For
example, on one of my test machines, the running time for the HPCC Random
Access benchmark (also known as GUPS) has increased by 54%. To address this
problem, I have replaced the demotion with a heuristic for periodically
clearing the reference flag on superpage mappings.
Reviewed by: kib
Approved by: re (glebius)
Sponsored by: EMC / Isilon Storage Division
IPI implmementations.
Submitted by: Roger Pau Monné
Sponsored by: Citrix Systems R&D
Submitted by: gibbs (misc cleanup, table driven config)
Reviewed by: gibbs
MFC after: 2 weeks
sys/amd64/include/cpufunc.h:
sys/amd64/amd64/pmap.c:
Move invltlb_globpcid() into cpufunc.h so that it can be
used by the Xen HVM version of tlb shootdown IPI handlers.
sys/x86/xen/xen_intr.c:
sys/xen/xen_intr.h:
Rename xen_intr_bind_ipi() to xen_intr_alloc_and_bind_ipi(),
and remove the ipi vector parameter. This api allocates
an event channel port that can be used for ipi services,
but knows nothing of the actual ipi for which that port
will be used. Removing the unused argument and cleaning
up the comments surrounding its declaration helps clarify
its actual role.
sys/amd64/amd64/mp_machdep.c:
sys/amd64/include/cpu.h:
sys/i386/i386/mp_machdep.c:
sys/i386/include/cpu.h:
Implement a generic framework for amd64 and i386 that allows
the implementation of certain CPU management functions to
be selected at runtime. Currently this is only used for
the ipi send function, which we optimize for Xen when running
on a Xen hypervisor, but can easily be expanded to support
more operations.
sys/x86/xen/hvm.c:
Implement Xen PV IPI handlers and operations, replacing native
send IPI.
sys/amd64/include/pcpu.h:
sys/i386/include/pcpu.h:
sys/i386/include/smp.h:
Remove NR_VIRQS and NR_IPIS from FreeBSD headers. NR_VIRQS
is defined already for us in the xen interface files.
NR_IPIS is only needed in one file per Xen platform and is
easily inferred by the IPI vector table that is defined in
those files.
sys/i386/xen/mp_machdep.c:
Restructure to more closely match the HVM implementation by
performing table driven IPI setup.
pmap_is_modified() and pmap_is_referenced(), same as it was done for
pmap_ts_referenced().
Consolidate identical code for pmap_is_modified() and
pmap_is_referenced() into helper pmap_page_test_mappings().
Reviewed by: alc
Tested by: pho (previous version)
Sponsored by: The FreeBSD Foundation
sf_buf_alloc()/sf_buf_free() inlines, to save two calls to an absolutely
empty functions.
Reviewed by: alc, kib, scottl
Sponsored by: Nginx, Inc.
Sponsored by: Netflix
in the future in a backward compatible (API and ABI) way.
The cap_rights_t represents capability rights. We used to use one bit to
represent one right, but we are running out of spare bits. Currently the new
structure provides place for 114 rights (so 50 more than the previous
cap_rights_t), but it is possible to grow the structure to hold at least 285
rights, although we can make it even larger if 285 rights won't be enough.
The structure definition looks like this:
struct cap_rights {
uint64_t cr_rights[CAP_RIGHTS_VERSION + 2];
};
The initial CAP_RIGHTS_VERSION is 0.
The top two bits in the first element of the cr_rights[] array contain total
number of elements in the array - 2. This means if those two bits are equal to
0, we have 2 array elements.
The top two bits in all remaining array elements should be 0.
The next five bits in all array elements contain array index. Only one bit is
used and bit position in this five-bits range defines array index. This means
there can be at most five array elements in the future.
To define new right the CAPRIGHT() macro must be used. The macro takes two
arguments - an array index and a bit to set, eg.
#define CAP_PDKILL CAPRIGHT(1, 0x0000000000000800ULL)
We still support aliases that combine few rights, but the rights have to belong
to the same array element, eg:
#define CAP_LOOKUP CAPRIGHT(0, 0x0000000000000400ULL)
#define CAP_FCHMOD CAPRIGHT(0, 0x0000000000002000ULL)
#define CAP_FCHMODAT (CAP_FCHMOD | CAP_LOOKUP)
There is new API to manage the new cap_rights_t structure:
cap_rights_t *cap_rights_init(cap_rights_t *rights, ...);
void cap_rights_set(cap_rights_t *rights, ...);
void cap_rights_clear(cap_rights_t *rights, ...);
bool cap_rights_is_set(const cap_rights_t *rights, ...);
bool cap_rights_is_valid(const cap_rights_t *rights);
void cap_rights_merge(cap_rights_t *dst, const cap_rights_t *src);
void cap_rights_remove(cap_rights_t *dst, const cap_rights_t *src);
bool cap_rights_contains(const cap_rights_t *big, const cap_rights_t *little);
Capability rights to the cap_rights_init(), cap_rights_set(),
cap_rights_clear() and cap_rights_is_set() functions are provided by
separating them with commas, eg:
cap_rights_t rights;
cap_rights_init(&rights, CAP_READ, CAP_WRITE, CAP_FSTAT);
There is no need to terminate the list of rights, as those functions are
actually macros that take care of the termination, eg:
#define cap_rights_set(rights, ...) \
__cap_rights_set((rights), __VA_ARGS__, 0ULL)
void __cap_rights_set(cap_rights_t *rights, ...);
Thanks to using one bit as an array index we can assert in those functions that
there are no two rights belonging to different array elements provided
together. For example this is illegal and will be detected, because CAP_LOOKUP
belongs to element 0 and CAP_PDKILL to element 1:
cap_rights_init(&rights, CAP_LOOKUP | CAP_PDKILL);
Providing several rights that belongs to the same array's element this way is
correct, but is not advised. It should only be used for aliases definition.
This commit also breaks compatibility with some existing Capsicum system calls,
but I see no other way to do that. This should be fine as Capsicum is still
experimental and this change is not going to 9.x.
Sponsored by: The FreeBSD Foundation