Move asm IPIs handlers to C code, so both Xen and native IPI handlers
share the same code.
Reviewed by: jhb
Approved by: gibbs
Sponsored by: Citrix Systems R&D
amd64/amd64/apic_vector.S:
i386/i386/apic_vector.s:
- Remove asm coded IPI handlers and instead call the newly introduced
C variants.
amd64/amd64/mp_machdep.c:
i386/i386/mp_machdep.c:
- Add C coded clones to the asm IPI handlers (moved from
x86/xen/hvm.c).
i386/include/smp.h:
amd64/include/smp.h:
- Add prototypes for the C IPI handlers.
x86/xen/hvm.c:
- Move the C IPI handlers to mp_machdep and call those in the Xen IPI
handlers.
i386/xen/mp_machdep.c:
- Add dummy IPI handlers to the i386 Xen PV port (this port doesn't
support SMP).
processor-specific VMCS or VMCB. The pending exception will be delivered right
before entering the guest.
The order of event injection into the guest is:
- hardware exception
- NMI
- maskable interrupt
In the Intel VT-x case, a pending NMI or interrupt will enable the interrupt
window-exiting and inject it as soon as possible after the hardware exception
is injected. Also since interrupts are inherently asynchronous, injecting
them after the hardware exception should not affect correctness from the
guest perspective.
Rename the unused ioctl VM_INJECT_EVENT to VM_INJECT_EXCEPTION and restrict
it to only deliver x86 hardware exceptions. This new ioctl is now used to
inject a protection fault when the guest accesses an unimplemented MSR.
Discussed with: grehan, jhb
Reviewed by: jhb
simplify the implementation of the x2APIC virtualization assist in VT-x.
Prior to this change the vlapic allowed the guest to change its mode from
xAPIC to x2APIC. We don't allow that any more and the vlapic mode is locked
when the virtual machine is created. This is not very constraining because
operating systems already have to deal with BIOS setting up the APIC in
x2APIC mode at boot.
Fix a bug in the CPUID emulation where the x2APIC capability was leaking
from the host to the guest.
Ignore MMIO reads and writes to the vlapic in x2APIC mode. Similarly, ignore
MSR accesses to the vlapic when it is in xAPIC mode.
The default configuration of the vlapic is xAPIC. The "-x" option to bhyve(8)
can be used to change the mode to x2APIC instead.
Discussed with: grehan@
I/O windows, the default is to preserve the firmware-assigned resources.
PCI bus numbers are only managed if NEW_PCIB is enabled and the architecture
defines a PCI_RES_BUS resource type.
- Add a helper API to create top-level PCI bus resource managers for each
PCI domain/segment. Host-PCI bridge drivers use this API to allocate
bus numbers from their associated domain.
- Change the PCI bus and CardBus drivers to allocate a bus resource for
their bus number from the parent PCI bridge device.
- Change the PCI-PCI and PCI-CardBus bridge drivers to allocate the
full range of bus numbers from secbus to subbus from their parent bridge.
The drivers also always program their primary bus register. The bridge
drivers also support growing their bus range by extending the bus resource
and updating subbus to match the larger range.
- Add support for managing PCI bus resources to the Host-PCI bridge drivers
used for amd64 and i386 (acpi_pcib, mptable_pcib, legacy_pcib, and qpi_pcib).
- Define a PCI_RES_BUS resource type for amd64 and i386.
Reviewed by: imp
MFC after: 1 month
- Similar to the hack for bootinfo32.c in userboot, define
_MACHINE_ELF_WANT_32BIT in the load_elf32 file handlers in userboot.
This allows userboot to load 32-bit kernels and modules.
- Copy the SMAP generation code out of bootinfo64.c and into its own
file so it can be shared with bootinfo32.c to pass an SMAP to the i386
kernel.
- Use uint32_t instead of u_long when aligning module metadata in
bootinfo32.c in userboot, as otherwise the metadata used 64-bit
alignment which corrupted the layout.
- Populate the basemem and extmem members of the bootinfo struct passed
to 32-bit kernels.
- Fix the 32-bit stack in userboot to start at the top of the stack
instead of the bottom so that there is room to grow before the
kernel switches to its own stack.
- Push a fake return address onto the 32-bit stack in addition to the
arguments normally passed to exec() in the loader. This return
address is needed to convince recover_bootinfo() in the 32-bit
locore code that it is being invoked from a "new" boot block.
- Add a routine to libvmmapi to setup a 32-bit flat mode register state
including a GDT and TSS that is able to start the i386 kernel and
update bhyveload to use it when booting an i386 kernel.
- Use the guest register state to determine the CPU's current instruction
mode (32-bit vs 64-bit) and paging mode (flat, 32-bit, PAE, or long
mode) in the instruction emulation code. Update the gla2gpa() routine
used when fetching instructions to handle flat mode, 32-bit paging, and
PAE paging in addition to long mode paging. Don't look for a REX
prefix when the CPU is in 32-bit mode, and use the detected mode to
enable the existing 32-bit mode code when decoding the mod r/m byte.
Reviewed by: grehan, neel
MFC after: 1 month
the virtio backends.
- Add a new ioctl to export the count of pins on the I/O APIC from vmm
to the hypervisor.
- Use pins on the I/O APIC >= 16 for PCI interrupts leaving 0-15 for
ISA interrupts.
- Populate the MP Table with I/O interrupt entries for any PCI INTx
interrupts.
- Create a _PRT table under the PCI root bridge in ACPI to route any
PCI INTx interrupts appropriately.
- Track which INTx interrupts are in use per-slot so that functions
that share a slot attempt to distribute their INTx interrupts across
the four available pins.
- Implicitly mask INTx interrupts if either MSI or MSI-X is enabled
and when the INTx DIS bit is set in a function's PCI command register.
Either assert or deassert the associated I/O APIC pin when the
state of one of those conditions changes.
- Add INTx support to the virtio backends.
- Always advertise the MSI capability in the virtio backends.
Submitted by: neel (7)
Reviewed by: neel
MFC after: 2 weeks
The VMCS field EOI_bitmap[] is an array of 256 bits - one for each vector.
If a bit is set to '1' in the EOI_bitmap[] then the processor will trigger
an EOI-induced VM-exit when it is doing EOI virtualization.
The EOI-induced VM-exit results in the EOI being forwarded to the vioapic
so that level triggered interrupts can be properly handled.
Tested by: Anish Gupta (akgupt3@gmail.com)
can be initiated in the context of a vcpu thread or from the bhyve(8) control
process.
The first use of this functionality is to update the vlapic trigger-mode
register when the IOAPIC pin configuration is changed.
Prior to this change we would update the TMR in the virtual-APIC page at
the time of interrupt delivery. But this doesn't work with Posted Interrupts
because there is no way to program the EOI_exit_bitmap[] in the VMCS of
the target at the time of interrupt delivery.
Discussed with: grehan@
hides the setjmp/longjmp semantics of VM enter/exit. vmx_enter_guest() is used
to enter guest context and vmx_exit_guest() is used to transition back into
host context.
Fix a longstanding race where a vcpu interrupt notification might be ignored
if it happens after vmx_inject_interrupts() but before host interrupts are
disabled in vmx_resume/vmx_launch. We now called vmx_inject_interrupts() with
host interrupts disabled to prevent this.
Suggested by: grehan@
emulation.
The vlapic initialization and cleanup is done via processor specific vmm_ops.
This will allow the VT-x/SVM modules to layer any hardware-assist for APIC
emulation or virtual interrupt delivery on top of the vlapic device model.
Add a parameter to 'vcpu_notify_event()' to distinguish between vlapic
interrupts versus other events (e.g. NMI). This provides an opportunity to
use hardware-assists like Posted Interrupts (VT-x) or doorbell MSR (SVM)
to deliver an interrupt to a guest without causing a VM-exit.
Get rid of lapic_pending_intr() and lapic_intr_accepted() and use the
vlapic_xxx() counterparts directly.
Associate an 'Apic Page' with each vcpu and reference it from the 'vlapic'.
The 'Apic Page' is intended to be referenced from the Intel VMCS as the
'virtual APIC page' or from the AMD VMCB as the 'vAPIC backing page'.
- Add a generic routine to trigger an LVT interrupt that supports both
fixed and NMI delivery modes.
- Add an ioctl and bhyvectl command to trigger local interrupts inside a
guest. In particular, a global NMI similar to that raised by SERR# or
PERR# can be simulated by asserting LINT1 on all vCPUs.
- Extend the LVT table in the vCPU local APIC to support CMCI.
- Flesh out the local APIC error reporting a bit to cache errors and
report them via ESR when ESR is written to. Add support for asserting
the error LVT when an error occurs. Raise illegal vector errors when
attempting to signal an invalid vector for an interrupt or when sending
an IPI.
- Ignore writes to reserved bits in LVT entries.
- Export table entries the MADT and MP Table advertising the stock x86
config of LINT0 set to ExtInt and LINT1 wired to NMI.
Reviewed by: neel (earlier version)
state before the requested state transition. This guarantees that there is
exactly one ioctl() operating on a vcpu at any point in time and prevents
unintended state transitions.
More details available here:
http://lists.freebsd.org/pipermail/freebsd-virtualization/2013-December/001825.html
Reviewed by: grehan
Reported by: Markiyan Kushnir (markiyan.kushnir at gmail.com)
MFC after: 3 days
The least significant 8 bits of 'pm_flags' are now used for the IPI vector
to use for nested page table TLB shootdown.
Previously we used IPI_AST to interrupt the host cpu which is functionally
correct but could lead to misleading interrupt counts for AST handler. The
AST handler was also doing a lot more than what is required for the nested
page table TLB shootdown (EOI and IRET).
callers treat the MSI 'addr' and 'data' fields as opaque and also lets
bhyve implement multiple destination modes: physical, flat and clustered.
Submitted by: Tycho Nightingale (tycho.nightingale@pluribusnetworks.com)
Reviewed by: grehan@
vcpu and destroy its thread context. Also modify the 'HLT' processing to ignore
pending interrupts in the IRR if interrupts have been disabled by the guest.
The interrupt cannot be injected into the guest in any case so resuming it
is futile.
With this change "halt" from a Linux guest works correctly.
Reviewed by: grehan@
Tested by: Tycho Nightingale (tycho.nightingale@pluribusnetworks.com)
has outgrown its original name. Originally this function simply sent an IPI
to the host cpu that a vcpu was executing on but now it does a lot more than
just that.
Reviewed by: grehan@
compilation results in inclusion of the header, a confict arises due
to savefpu being union for i386, but used as struct in the pcb
definition. The 32bit code should not need amd64 variant of the
struct pcb anyway.
For struct region_descriptor, use __uint64_t instead of unsigned long,
as the base type for bit-fields. Unsigned long cannot have width 64
for -m32.
The changes allowed to use sys/sysctl.h for cc -m32.
Sponsored by: The FreeBSD Foundation
MFC after: 1 week
bhyve supports a single timer block with 8 timers. The timers are all 32-bit
and capable of being operated in periodic mode. All timers support interrupt
delivery using MSI. Timers 0 and 1 also support legacy interrupt routing.
At the moment the timers are not connected to any ioapic pins but that will
be addressed in a subsequent commit.
This change is based on a patch from Tycho Nightingale (tycho.nightingale@pluribusnetworks.com).
to inject edge triggered legacy interrupts into the guest.
Start using the new API in device models that use edge triggered interrupts:
viz. the 8254 timer and the LPC/uart device emulation.
Submitted by: Tycho Nightingale (tycho.nightingale@pluribusnetworks.com)
upcoming in-kernel device emulations like the HPET.
The ioctls VM_IOAPIC_ASSERT_IRQ and VM_IOAPIC_DEASSERT_IRQ are used to
manipulate the ioapic pin state.
Discussed with: grehan@
Submitted by: Tycho Nightingale (tycho.nightingale@pluribusnetworks.com)
words, every architecture is now auto-sizing the kmem arena. This revision
changes kmeminit() so that the definition of VM_KMEM_SIZE_SCALE becomes
mandatory and the definition of VM_KMEM_SIZE becomes optional.
Replace or eliminate all existing definitions of VM_KMEM_SIZE. With
auto-sizing enabled, VM_KMEM_SIZE effectively became an alternate spelling
for VM_KMEM_SIZE_MIN on most architectures. Use VM_KMEM_SIZE_MIN for
clarity.
Change kmeminit() so that the effect of defining VM_KMEM_SIZE is similar to
that of setting the tunable vm.kmem_size. Whereas the macros
VM_KMEM_SIZE_{MAX,MIN,SCALE} have had the same effect as the tunables
vm.kmem_size_{max,min,scale}, the effects of VM_KMEM_SIZE and vm.kmem_size
have been distinct. In particular, whereas VM_KMEM_SIZE was overridden by
VM_KMEM_SIZE_{MAX,MIN,SCALE} and vm.kmem_size_{max,min,scale}, vm.kmem_size
was not. Remedy this inconsistency. Now, VM_KMEM_SIZE can be used to set
the size of the kmem arena at compile-time without that value being
overridden by auto-sizing.
Update the nearby comments to reflect the kmem submap being replaced by the
kmem arena. Stop duplicating the auto-sizing formula in every machine-
dependent vmparam.h and place it in kmeminit() where auto-sizing takes
place.
Reviewed by: kib (an earlier version)
Sponsored by: EMC / Isilon Storage Division
'invpcid' instruction to the guest. Currently bhyve will try to enable this
capability unconditionally if it is available.
Consolidate code in bhyve to set the capabilities so it is no longer
duplicated in BSP and AP bringup.
Add a sysctl 'vm.pmap.invpcid_works' to display whether the 'invpcid'
instruction is available.
Reviewed by: grehan
MFC after: 3 days
field. Perform vcpu enumeration for Xen PV and HVM environments
and convert all Xen drivers to use vcpu_id instead of a hard coded
assumption of the mapping algorithm (acpi or apic ID) in use.
Submitted by: Roger Pau Monné
Sponsored by: Citrix Systems R&D
Reviewed by: gibbs
Approved by: re (blanket Xen)
amd64/include/pcpu.h:
i386/include/pcpu.h:
Add vcpu_id to the amd64 and i386 pcpu structures.
dev/xen/timer/timer.c
x86/xen/xen_intr.c
Use new vcpu_id instead of assuming acpi_id == vcpu_id.
i386/xen/mp_machdep.c:
i386/xen/mptable.c
x86/xen/hvm.c:
Perform Xen HVM and Xen full PV vcpu_id mapping.
x86/xen/hvm.c:
x86/acpica/madt.c
Change SYSINIT ordering of acpi CPU enumeration so that it
is guaranteed to be available at the time of Xen HVM vcpu
id mapping.
Make the amd64/pmap code aware of nested page table mappings used by bhyve
guests. This allows bhyve to associate each guest with its own vmspace and
deal with nested page faults in the context of that vmspace. This also
enables features like accessed/dirty bit tracking, swapping to disk and
transparent superpage promotions of guest memory.
Guest vmspace:
Each bhyve guest has a unique vmspace to represent the physical memory
allocated to the guest. Each memory segment allocated by the guest is
mapped into the guest's address space via the 'vmspace->vm_map' and is
backed by an object of type OBJT_DEFAULT.
pmap types:
The amd64/pmap now understands two types of pmaps: PT_X86 and PT_EPT.
The PT_X86 pmap type is used by the vmspace associated with the host kernel
as well as user processes executing on the host. The PT_EPT pmap is used by
the vmspace associated with a bhyve guest.
Page Table Entries:
The EPT page table entries as mostly similar in functionality to regular
page table entries although there are some differences in terms of what
bits are used to express that functionality. For e.g. the dirty bit is
represented by bit 9 in the nested PTE as opposed to bit 6 in the regular
x86 PTE. Therefore the bitmask representing the dirty bit is now computed
at runtime based on the type of the pmap. Thus PG_M that was previously a
macro now becomes a local variable that is initialized at runtime using
'pmap_modified_bit(pmap)'.
An additional wrinkle associated with EPT mappings is that older Intel
processors don't have hardware support for tracking accessed/dirty bits in
the PTE. This means that the amd64/pmap code needs to emulate these bits to
provide proper accounting to the VM subsystem. This is achieved by using
the following mapping for EPT entries that need emulation of A/D bits:
Bit Position Interpreted By
PG_V 52 software (accessed bit emulation handler)
PG_RW 53 software (dirty bit emulation handler)
PG_A 0 hardware (aka EPT_PG_RD)
PG_M 1 hardware (aka EPT_PG_WR)
The idea to use the mapping listed above for A/D bit emulation came from
Alan Cox (alc@).
The final difference with respect to x86 PTEs is that some EPT implementations
do not support superpage mappings. This is recorded in the 'pm_flags' field
of the pmap.
TLB invalidation:
The amd64/pmap code has a number of ways to do invalidation of mappings
that may be cached in the TLB: single page, multiple pages in a range or the
entire TLB. All of these funnel into a single EPT invalidation routine called
'pmap_invalidate_ept()'. This routine bumps up the EPT generation number and
sends an IPI to the host cpus that are executing the guest's vcpus. On a
subsequent entry into the guest it will detect that the EPT has changed and
invalidate the mappings from the TLB.
Guest memory access:
Since the guest memory is no longer wired we need to hold the host physical
page that backs the guest physical page before we can access it. The helper
functions 'vm_gpa_hold()/vm_gpa_release()' are available for this purpose.
PCI passthru:
Guest's with PCI passthru devices will wire the entire guest physical address
space. The MMIO BAR associated with the passthru device is backed by a
vm_object of type OBJT_SG. An IOMMU domain is created only for guest's that
have one or more PCI passthru devices attached to them.
Limitations:
There isn't a way to map a guest physical page without execute permissions.
This is because the amd64/pmap code interprets the guest physical mappings as
user mappings since they are numerically below VM_MAXUSER_ADDRESS. Since PG_U
shares the same bit position as EPT_PG_EXECUTE all guest mappings become
automatically executable.
Thanks to Alan Cox and Konstantin Belousov for their rigorous code reviews
as well as their support and encouragement.
Thanks for John Baldwin for reviewing the use of OBJT_SG as the backing
object for pci passthru mmio regions.
Special thanks to Peter Holm for testing the patch on short notice.
Approved by: re
Discussed with: grehan
Reviewed by: alc, kib
Tested by: pho
semblance of API stability and growth during the 10.* timeframe.
Userland/kernel bhyve will have to be recompiled after this.
Reviewed by: neel
Approved by: re@ (blanket)
amd64 and i386.
Submitted by: Roger Pau Monné
Sponsored by: Citrix Systems R&D
Reviewed by: gibbs
Approved by: re (blanket Xen)
MFC after: 2 weeks
sys/amd64/amd64/mp_machdep.c:
sys/amd64/include/cpu.h:
sys/i386/i386/mp_machdep.c:
sys/i386/include/cpu.h:
- Introduce two new CPU hooks for initialization and resume
purposes. This allows us to get rid of the XENHVM ifdefs in
mp_machdep, and also sets some hooks into common code that can be
used by other hypervisor implementations.
sys/amd64/conf/XENHVM:
sys/i386/conf/XENHVM:
- Remove these configs now that GENERIC has builtin support for Xen
HVM.
sys/kern/subr_smp.c:
- Make sure there are no pending IPIs when suspending a system.
sys/x86/xen/hvm.c:
- Add cpu init and resume vectors that are called from mp_machdep
using the new hooks.
- Only clear the vcpu_info mapping data on resume. It is already
clear for the BSP on a cold boot and is set correctly as APs
are started.
- Gate xen_hvm_init_cpu only to systems running under Xen.
sys/x86/xen/xen_intr.c:
- Gate the setup of event channels only to systems running under Xen.
- add fields to 'struct pmap' that are required to manage nested page tables.
- add a parameter to 'vmspace_alloc()' that can be used to override the
default pmap initialization routine 'pmap_pinit()'.
These changes are pushed ahead of the remaining changes in 'bhyve_npt_pmap'
in anticipation of the upcoming KBI freeze for 10.0.
Reviewed by: kib@, alc@
Approved by: re (glebius)
Xen PVHVM guest.
Submitted by: Roger Pau Monné
Sponsored by: Citrix Systems R&D
Reviewed by: gibbs
Approved by: re (blanket Xen)
MFC after: 2 weeks
sys/amd64/amd64/mp_machdep.c:
sys/i386/i386/mp_machdep.c:
- Make sure that are no MMU related IPIs pending on migration.
- Reset pending IPI_BITMAP on resume.
- Init vcpu_info on resume.
sys/amd64/include/intr_machdep.h:
sys/i386/include/intr_machdep.h:
sys/x86/acpica/acpi_wakeup.c:
sys/x86/x86/intr_machdep.c:
sys/x86/isa/atpic.c:
sys/x86/x86/io_apic.c:
sys/x86/x86/local_apic.c:
- Add a "suspend_cancelled" parameter to pic_resume(). For the
Xen PIC, restoration of interrupt services differs between
the aborted suspend and normal resume cases, so we must provide
this information.
sys/dev/acpica/acpi_timer.c:
sys/dev/xen/timer/timer.c:
sys/timetc.h:
- Don't swap out "suspend safe" timers across a suspend/resume
cycle. This includes the Xen PV and ACPI timers.
sys/dev/xen/control/control.c:
- Perform proper suspend/resume process for PVHVM:
- Suspend all APs before going into suspension, this allows us
to reset the vcpu_info on resume for each AP.
- Reset shared info page and callback on resume.
sys/dev/xen/timer/timer.c:
- Implement suspend/resume support for the PV timer. Since FreeBSD
doesn't perform a per-cpu resume of the timer, we need to call
smp_rendezvous in order to correctly resume the timer on each CPU.
sys/dev/xen/xenpci/xenpci.c:
- Don't reset the PCI interrupt on each suspend/resume.
sys/kern/subr_smp.c:
- When suspending a PVHVM domain make sure there are no MMU IPIs
in-flight, or we will get a lockup on resume due to the fact that
pending event channels are not carried over on migration.
- Implement a generic version of restart_cpus that can be used by
suspended and stopped cpus.
sys/x86/xen/hvm.c:
- Implement resume support for the hypercall page and shared info.
- Clear vcpu_info so it can be reset by APs when resuming from
suspension.
sys/dev/xen/xenpci/xenpci.c:
sys/x86/xen/hvm.c:
sys/x86/xen/xen_intr.c:
- Support UP kernel configurations.
sys/x86/xen/xen_intr.c:
- Properly rebind per-cpus VIRQs and IPIs on resume.
This should be sufficient for 10.0 and will do
until forthcoming work to avoid limitations
in this area is complete.
Thanks to Bela Lubkin at tidalscale for the
headsup on the apic/cpu id/io apic ASL parameters
that are actually hex values and broke when
written as decimal when 11 vCPUs were configured.
Approved by: re@
IPI implmementations.
Submitted by: Roger Pau Monné
Sponsored by: Citrix Systems R&D
Submitted by: gibbs (misc cleanup, table driven config)
Reviewed by: gibbs
MFC after: 2 weeks
sys/amd64/include/cpufunc.h:
sys/amd64/amd64/pmap.c:
Move invltlb_globpcid() into cpufunc.h so that it can be
used by the Xen HVM version of tlb shootdown IPI handlers.
sys/x86/xen/xen_intr.c:
sys/xen/xen_intr.h:
Rename xen_intr_bind_ipi() to xen_intr_alloc_and_bind_ipi(),
and remove the ipi vector parameter. This api allocates
an event channel port that can be used for ipi services,
but knows nothing of the actual ipi for which that port
will be used. Removing the unused argument and cleaning
up the comments surrounding its declaration helps clarify
its actual role.
sys/amd64/amd64/mp_machdep.c:
sys/amd64/include/cpu.h:
sys/i386/i386/mp_machdep.c:
sys/i386/include/cpu.h:
Implement a generic framework for amd64 and i386 that allows
the implementation of certain CPU management functions to
be selected at runtime. Currently this is only used for
the ipi send function, which we optimize for Xen when running
on a Xen hypervisor, but can easily be expanded to support
more operations.
sys/x86/xen/hvm.c:
Implement Xen PV IPI handlers and operations, replacing native
send IPI.
sys/amd64/include/pcpu.h:
sys/i386/include/pcpu.h:
sys/i386/include/smp.h:
Remove NR_VIRQS and NR_IPIS from FreeBSD headers. NR_VIRQS
is defined already for us in the xen interface files.
NR_IPIS is only needed in one file per Xen platform and is
easily inferred by the IPI vector table that is defined in
those files.
sys/i386/xen/mp_machdep.c:
Restructure to more closely match the HVM implementation by
performing table driven IPI setup.
sf_buf_alloc()/sf_buf_free() inlines, to save two calls to an absolutely
empty functions.
Reviewed by: alc, kib, scottl
Sponsored by: Nginx, Inc.
Sponsored by: Netflix
- Restore the pre-PCID TLB shootdown handlers for whole address space
and single page invalidation asm code, and assign the IPI handler to
them when PCID is not supported or disabled. Old handlers have
linear control flow. But, still use the common return sequence.
- Stop using pcpu for INVPCID descriptors in the invlrg handler. It
is enough to allocate descriptors on the stack. As result, two
SWAPGS instructions are shaved off from the code for Haswell+.
- Fix the reverted condition in invlrng for checking of the PCID
support [1], also in invlrng check that pmap is kernel pmap before
performing other tests. For the kernel pmap, which provides global
mappings, the INVLPG must be used for invalidation always.
- Save the pre-computed pmap' %CR3 register in the struct pmap. This
allows to remove several checks for pm_pcid validity when %CR3 is
reloaded [2].
Noted by: gibbs [1]
Discussed with: alc [2]
Tested by: pho, flo
Sponsored by: The FreeBSD Foundation
Intel CPUs. The feature tags TLB entries with the Id of the address
space and allows to avoid TLB invalidation on the context switch, it
is available only in the long mode. In the microbenchmarks, using the
PCID decreased latency of the context switches by ~30% on SandyBridge
class desktop CPUs, measured with the lat_ctx program from lmbench.
If available, use INVPCID instruction when a TLB entry in non-current
address space needs to be invalidated. The instruction is typically
available on the Haswell.
If needed, the use of PCID can be turned off with the
vm.pmap.pcid_enabled loader tunable set to 0. The state of the
feature is reported by the vm.pmap.pcid_enabled sysctl. The sysctl
vm.pmap.pcid_save_cnt reports the number of context switches which
avoided invalidating the TLB; compare with the total number of context
switches, available as sysctl vm.stats.sys.v_swtch.
Sponsored by: The FreeBSD Foundation
Reviewed by: alc
Tested by: pho, bf
Re-structure Xen HVM support so that:
- Xen is detected and hypercalls can be performed very
early in system startup.
- Xen interrupt services are implemented using FreeBSD's native
interrupt delivery infrastructure.
- the Xen interrupt service implementation is shared between PV
and HVM guests.
- Xen interrupt handlers can optionally use a filter handler
in order to avoid the overhead of dispatch to an interrupt
thread.
- interrupt load can be distributed among all available CPUs.
- the overhead of accessing the emulated local and I/O apics
on HVM is removed for event channel port events.
- a similar optimization can eventually, and fairly easily,
be used to optimize MSI.
Early Xen detection, HVM refactoring, PVHVM interrupt infrastructure,
and misc Xen cleanups:
Sponsored by: Spectra Logic Corporation
Unification of PV & HVM interrupt infrastructure, bug fixes,
and misc Xen cleanups:
Submitted by: Roger Pau Monné
Sponsored by: Citrix Systems R&D
sys/x86/x86/local_apic.c:
sys/amd64/include/apicvar.h:
sys/i386/include/apicvar.h:
sys/amd64/amd64/apic_vector.S:
sys/i386/i386/apic_vector.s:
sys/amd64/amd64/machdep.c:
sys/i386/i386/machdep.c:
sys/i386/xen/exception.s:
sys/x86/include/segments.h:
Reserve IDT vector 0x93 for the Xen event channel upcall
interrupt handler. On Hypervisors that support the direct
vector callback feature, we can request that this vector be
called directly by an injected HVM interrupt event, instead
of a simulated PCI interrupt on the Xen platform PCI device.
This avoids all of the overhead of dealing with the emulated
I/O APIC and local APIC. It also means that the Hypervisor
can inject these events on any CPU, allowing upcalls for
different ports to be handled in parallel.
sys/amd64/amd64/mp_machdep.c:
sys/i386/i386/mp_machdep.c:
Map Xen per-vcpu area during AP startup.
sys/amd64/include/intr_machdep.h:
sys/i386/include/intr_machdep.h:
Increase the FreeBSD IRQ vector table to include space
for event channel interrupt sources.
sys/amd64/include/pcpu.h:
sys/i386/include/pcpu.h:
Remove Xen HVM per-cpu variable data. These fields are now
allocated via the dynamic per-cpu scheme. See xen_intr.c
for details.
sys/amd64/include/xen/hypercall.h:
sys/dev/xen/blkback/blkback.c:
sys/i386/include/xen/xenvar.h:
sys/i386/xen/clock.c:
sys/i386/xen/xen_machdep.c:
sys/xen/gnttab.c:
Prefer FreeBSD primatives to Linux ones in Xen support code.
sys/amd64/include/xen/xen-os.h:
sys/i386/include/xen/xen-os.h:
sys/xen/xen-os.h:
sys/dev/xen/balloon/balloon.c:
sys/dev/xen/blkback/blkback.c:
sys/dev/xen/blkfront/blkfront.c:
sys/dev/xen/console/xencons_ring.c:
sys/dev/xen/control/control.c:
sys/dev/xen/netback/netback.c:
sys/dev/xen/netfront/netfront.c:
sys/dev/xen/xenpci/xenpci.c:
sys/i386/i386/machdep.c:
sys/i386/include/pmap.h:
sys/i386/include/xen/xenfunc.h:
sys/i386/isa/npx.c:
sys/i386/xen/clock.c:
sys/i386/xen/mp_machdep.c:
sys/i386/xen/mptable.c:
sys/i386/xen/xen_clock_util.c:
sys/i386/xen/xen_machdep.c:
sys/i386/xen/xen_rtc.c:
sys/xen/evtchn/evtchn_dev.c:
sys/xen/features.c:
sys/xen/gnttab.c:
sys/xen/gnttab.h:
sys/xen/hvm.h:
sys/xen/xenbus/xenbus.c:
sys/xen/xenbus/xenbus_if.m:
sys/xen/xenbus/xenbusb_front.c:
sys/xen/xenbus/xenbusvar.h:
sys/xen/xenstore/xenstore.c:
sys/xen/xenstore/xenstore_dev.c:
sys/xen/xenstore/xenstorevar.h:
Pull common Xen OS support functions/settings into xen/xen-os.h.
sys/amd64/include/xen/xen-os.h:
sys/i386/include/xen/xen-os.h:
sys/xen/xen-os.h:
Remove constants, macros, and functions unused in FreeBSD's Xen
support.
sys/xen/xen-os.h:
sys/i386/xen/xen_machdep.c:
sys/x86/xen/hvm.c:
Introduce new functions xen_domain(), xen_pv_domain(), and
xen_hvm_domain(). These are used in favor of #ifdefs so that
FreeBSD can dynamically detect and adapt to the presence of
a hypervisor. The goal is to have an HVM optimized GENERIC,
but more is necessary before this is possible.
sys/amd64/amd64/machdep.c:
sys/dev/xen/xenpci/xenpcivar.h:
sys/dev/xen/xenpci/xenpci.c:
sys/x86/xen/hvm.c:
sys/sys/kernel.h:
Refactor magic ioport, Hypercall table and Hypervisor shared
information page setup, and move it to a dedicated HVM support
module.
HVM mode initialization is now triggered during the
SI_SUB_HYPERVISOR phase of system startup. This currently
occurs just after the kernel VM is fully setup which is
just enough infrastructure to allow the hypercall table
and shared info page to be properly mapped.
sys/xen/hvm.h:
sys/x86/xen/hvm.c:
Add definitions and a method for configuring Hypervisor event
delievery via a direct vector callback.
sys/amd64/include/xen/xen-os.h:
sys/x86/xen/hvm.c:
sys/conf/files:
sys/conf/files.amd64:
sys/conf/files.i386:
Adjust kernel build to reflect the refactoring of early
Xen startup code and Xen interrupt services.
sys/dev/xen/blkback/blkback.c:
sys/dev/xen/blkfront/blkfront.c:
sys/dev/xen/blkfront/block.h:
sys/dev/xen/control/control.c:
sys/dev/xen/evtchn/evtchn_dev.c:
sys/dev/xen/netback/netback.c:
sys/dev/xen/netfront/netfront.c:
sys/xen/xenstore/xenstore.c:
sys/xen/evtchn/evtchn_dev.c:
sys/dev/xen/console/console.c:
sys/dev/xen/console/xencons_ring.c
Adjust drivers to use new xen_intr_*() API.
sys/dev/xen/blkback/blkback.c:
Since blkback defers all event handling to a taskqueue,
convert this task queue to a "fast" taskqueue, and schedule
it via an interrupt filter. This avoids an unnecessary
ithread context switch.
sys/xen/xenstore/xenstore.c:
The xenstore driver is MPSAFE. Indicate as much when
registering its interrupt handler.
sys/xen/xenbus/xenbus.c:
sys/xen/xenbus/xenbusvar.h:
Remove unused event channel APIs.
sys/xen/evtchn.h:
Remove all kernel Xen interrupt service API definitions
from this file. It is now only used for structure and
ioctl definitions related to the event channel userland
device driver.
Update the definitions in this file to match those from
NetBSD. Implementing this interface will be necessary for
Dom0 support.
sys/xen/evtchn/evtchnvar.h:
Add a header file for implemenation internal APIs related
to managing event channels event delivery. This is used
to allow, for example, the event channel userland device
driver to access low-level routines that typical kernel
consumers of event channel services should never access.
sys/xen/interface/event_channel.h:
sys/xen/xen_intr.h:
Standardize on the evtchn_port_t type for referring to
an event channel port id. In order to prevent low-level
event channel APIs from leaking to kernel consumers who
should not have access to this data, the type is defined
twice: Once in the Xen provided event_channel.h, and again
in xen/xen_intr.h. The double declaration is protected by
__XEN_EVTCHN_PORT_DEFINED__ to ensure it is never declared
twice within a given compilation unit.
sys/xen/xen_intr.h:
sys/xen/evtchn/evtchn.c:
sys/x86/xen/xen_intr.c:
sys/dev/xen/xenpci/evtchn.c:
sys/dev/xen/xenpci/xenpcivar.h:
New implementation of Xen interrupt services. This is
similar in many respects to the i386 PV implementation with
the exception that events for bound to event channel ports
(i.e. not IPI, virtual IRQ, or physical IRQ) are further
optimized to avoid mask/unmask operations that aren't
necessary for these edge triggered events.
Stubs exist for supporting physical IRQ binding, but will
need additional work before this implementation can be
fully shared between PV and HVM.
sys/amd64/amd64/mp_machdep.c:
sys/i386/i386/mp_machdep.c:
sys/i386/xen/mp_machdep.c
sys/x86/xen/hvm.c:
Add support for placing vcpu_info into an arbritary memory
page instead of using HYPERVISOR_shared_info->vcpu_info.
This allows the creation of domains with more than 32 vcpus.
sys/i386/i386/machdep.c:
sys/i386/xen/clock.c:
sys/i386/xen/xen_machdep.c:
sys/i386/xen/exception.s:
Add support for new event channle implementation.
It was actually done in r86301 but reverted in r150182 because GCC 3.x was
not able to handle it for a memory operand. Apparently, this problem was
fixed in GCC 4.1+ and several contrib sources already rely on this feature.
Bump up the KVA size proportionally from 512GB to 2TB.
The number of page table pages used by the direct map is now calculated at
run time based on 'Maxmem'. This means the small memory systems will not
see any additional tax in terms of page table pages for the direct map.
However all amd64 systems, regardless of the memory size, will use 3 more
pages to accomodate the bump in the KVA size.
More details available here:
http://lists.freebsd.org/pipermail/freebsd-hackers/2013-June/043015.htmlhttp://lists.freebsd.org/pipermail/freebsd-current/2013-July/043143.html
Tested with the following configurations:
- Sandybridge server with 64GB of memory.
- bhyve VM with 64MB of memory.
- bhyve VM with a 8GB of memory with the memory segment above 4GB cuddling
right up against the 4TB maximum memory limit.
Discussed on: hackers@, current@
Submitted by: Chris Torek (torek@torek.net)
The variable _logname_valid is not exported via the version script;
therefore, change C and i386/amd64 assembler code to remove indirection
(which allowed interposition). This makes the code slightly smaller and
faster.
Also, remove #define PIC_GOT from i386/amd64 in !PIC mode. Without PIC,
there is no place containing the address of each variable, so there is no
possible definition for PIC_GOT.
pvh_global_lock. This allows the method to be executed in parallel,
avoiding undue contention on the pvh_global_lock for the multithreaded
pagedaemon.
The pmap_ts_referenced() function has to inspect the page mappings for
several pmaps, which need to be locked while pv list lock is owned.
This contradicts to the lock order, where pmap lock is before pv list
lock. Introduce the generation count for the pv list of the page or
superpage, which indicate any change in the pv list, and, as usual,
perform restart of the iteration if generation changed while pv lock
was dropped for blocking acquire of a pmap lock.
Reported and tested by: pho
Reviewed by: alc
Sponsored by: The FreeBSD Foundation
Issues were noted by Bruce Evans and are present on all architectures.
On i386, a counter fetch should use atomic read of 64bit value,
otherwise carry from the increment on other CPU could be lost for the
given fetch, making error of 2^32. If 64bit read (cmpxchg8b) is not
available on the machine, it cannot be SMP and it is enough to disable
preemption around read to avoid the split read.
On x86 the counter increment is not atomic on purpose, which makes it
possible for the store of the incremented result to override just
zeroed per-cpu slot. The effect would be a counter going off by
arbitrary value after zeroing. Perform the counter zeroing on the
same processor which does the increments, making the operations
mutually exclusive. On i386, same as for the fetching, if the
cmpxchg8b is not available, machine is not SMP and we disable
preemption for zeroing.
PowerPC64 is treated the same as amd64.
For other architectures, the changes made to allow the compilation to
succeed, without fixing the issues with zeroing or fetching. It
should be possible to handle them by using the 64bit loads and stores
atomic WRT preemption (assuming the architectures also converted from
using critical sections to proper asm). If architecture does not
provide the facility, using global (spin) mutex would be non-optimal
but working solution.
Noted by: bde
Sponsored by: The FreeBSD Foundation
Move FreeBSD from interface version 0x00030204 to 0x00030208.
Updates are required to our grant table implementation before we
can bump this further.
sys/xen/hvm.h:
Replace the implementation of hvm_get_parameter(), formerly located
in sys/xen/interface/hvm/params.h. Linux has a similar file which
primarily stores this function.
sys/xen/xenstore/xenstore.c:
Include new xen/hvm.h header file to get hvm_get_parameter().
sys/amd64/include/xen/xen-os.h:
sys/i386/include/xen/xen-os.h:
Correctly protect function definition and variables from being
included into assembly files in xen-os.h
Xen memory barriers are now prefixed with "xen_" to avoid conflicts
with OS native primatives. Define Xen memory barriers in terms of
the native FreeBSD primatives.
Sponsored by: Spectra Logic Corporation
Reviewed by: Roger Pau Monné
Tested by: Roger Pau Monné
Obtained from: Roger Pau Monné (bug fixes)
registers also on other CPUs, besides the CPU which happens to execute
the ddb. The debugging registers are stored in the pcpu area,
together with the command which is executed by the IPI stop handler
upon resume.
Reviewed by: jhb
Sponsored by: The FreeBSD Foundation
MFC after: 1 week
1. Common headers for fdt.h and ofw_machdep.h under x86/include
with indirections under i386/include and amd64/include.
2. New modinfo for loader provided FDT blob.
3. Common x86_init_fdt() called from hammer_time() on amd64 and
init386() on i386.
4. Split-off FDT specific low-level console functions from FDT
bus methods for the uart(4) driver. The low-level console
logic has been moved to uart_cpu_fdt.c and is used for arm,
mips & powerpc only. The FDT bus methods are shared across
all architectures.
5. Add dev/fdt/fdt_x86.c to hold the fdt_fixup_table[] and the
fdt_pic_table[] arrays. Both are empty right now.
FDT addresses are I/O ports on x86. Since the core FDT code does
not handle different address spaces, adding support for both I/O
ports and memory addresses requires some thought and discussion.
It may be better to use a compile-time option that controls this.
Obtained from: Juniper Networks, Inc.
order to match the MAXCPU concept. The change should also be useful
for consolidation and consistency.
Sponsored by: EMC / Isilon storage division
Obtained from: jeff
Reviewed by: alc
Rework the guest register fetch code to allow the RIP to
be extracted from the VMCS while the kernel decoder is
functioning.
Hit by the OpenBSD local-apic code.
Submitted by: neel
Reviewed by: grehan
Obtained from: NetApp
- use clock_gettime(2) as the time base for the emulated ACPI timer instead
of directly using rdtsc().
- don't advertise the invariant TSC capability to the guest to discourage it
from using the TSC as its time base.
Discussed with: jhb@ (about making 'smp_tsc' a global)
Reported by: Dan Mack on freebsd-virtualization@
Obtained from: NetApp
Introduce counter(9) API, that implements fast and raceless counters,
provided (but not limited to) for gathering of statistical data.
See http://lists.freebsd.org/pipermail/freebsd-arch/2013-April/014204.html
for more details.
In collaboration with: kib
Reviewed by: luigi
Tested by: ae, ray
Sponsored by: Nginx, Inc.
decode. This is to accomodate hardware assist implementations that do not
provide the 'guest linear address' as part of nested page fault collateral.
Submitted by: Anish Gupta (akgupt3 at gmail dot com)
Replace the per-object resident and cached pages splay tree with a
path-compressed multi-digit radix trie.
Along with this, switch also the x86-specific handling of idle page
tables to using the radix trie.
This change is supposed to do the following:
- Allowing the acquisition of read locking for lookup operations of the
resident/cached pages collections as the per-vm_page_t splay iterators
are now removed.
- Increase the scalability of the operations on the page collections.
The radix trie does rely on the consumers locking to ensure atomicity of
its operations. In order to avoid deadlocks the bisection nodes are
pre-allocated in the UMA zone. This can be done safely because the
algorithm needs at maximum one new node per insert which means the
maximum number of the desired nodes is the number of available physical
frames themselves. However, not all the times a new bisection node is
really needed.
The radix trie implements path-compression because UFS indirect blocks
can lead to several objects with a very sparse trie, increasing the number
of levels to usually scan. It also helps in the nodes pre-fetching by
introducing the single node per-insert property.
This code is not generalized (yet) because of the possible loss of
performance by having much of the sizes in play configurable.
However, efforts to make this code more general and then reusable in
further different consumers might be really done.
The only KPI change is the removal of the function vm_page_splay() which
is now reaped.
The only KBI change, instead, is the removal of the left/right iterators
from struct vm_page, which are now reaped.
Further technical notes broken into mealpieces can be retrieved from the
svn branch:
http://svn.freebsd.org/base/user/attilio/vmcontention/
Sponsored by: EMC / Isilon storage division
In collaboration with: alc, jeff
Tested by: flo, pho, jhb, davide
Tested by: ian (arm)
Tested by: andreast (powerpc)
Rename the pv_entry_t iterator from pv_list to pv_next.
Besides being more correct technically (as the name seems to suggest
this is a list while it is an iterator), it will also be needed by
vm_radix work to avoid a nameclash on macro expansions.
Sponsored by: EMC / Isilon storage division
Reviewed by: alc, jeff
Tested by: flo, pho, jhb, davide
machine/signal.h and machine/ucontext.h into common x86 includes,
copying from amd64 and merging with i386.
Kernel-only compat definitions are kept in the i386/include/sigframe.h
and i386/include/signal.h, to reduce amd64 kernel namespace pollution.
The amd64 compat uses its own definitions so far.
The _MACHINE_ELF_WANT_32BIT definition is to allow the
sys/boot/userboot/userboot/elf32_freebsd.c to use i386 ELF definitions
on the amd64 compile host. The same hack could be usefully abused by
other code too.
Prior to this change pinning was implemented via an ioctl (VM_SET_PINNING)
that called 'sched_bind()' on behalf of the user thread.
The ULE implementation of 'sched_bind()' bumps up 'td_pinned' which in turn
runs afoul of the assertion '(td_pinned == 0)' in userret().
Using the cpuset affinity to implement pinning of the vcpu threads works with
both 4BSD and ULE schedulers and has the happy side-effect of getting rid
of a bunch of code in vmm.ko.
Discussed with: grehan
This eliminates the need to recompile the kernel when the default value
of NKPT is not big enough - for e.g. when loading large kernel modules
or memory disk images from the loader.
If NKPT is defined in the kernel configuration file then it overrides the
dynamic calculation.
Reviewed by: alc, kib
- change 'pics' from STAILQ to TAILQ
- ensure that Local APIC is always first in 'pics'
Reviewed by: jhb
Tested by: Sergey V. Dyatko <sergey.dyatko@gmail.com>,
KAHO Toshikazu <kaho@elam.kais.kyoto-u.ac.jp>
MFC after: 12 days
During the early days of bhyve it did not support instruction emulation
which necessitated the use of x2apic to access the local apic. This is no
longer the case and the dependency on x2apic has gone away.
The x2apic patches can be considered independently of bhyve and will be
merged into head via projects/x2apic.
Discussed with: grehan
Provide a tunable 'machdep.x2apic_desired' to let the administrator override
the default behavior.
Provide a read-only sysctl 'machdep.x2apic' to let the administrator know
whether the kernel is using x2apic or legacy mmio to access local apic.
Tested with Parallels Desktop 8 and bhyve hypervisors.
Also tested running on bare metal Intel Xeon E5-2658.
Obtained from: NetApp
Discussed with: jhb, attilio, avg, grehan
On a nested page table fault the hypervisor will:
- fetch the instruction using the guest %rip and %cr3
- decode the instruction in 'struct vie'
- emulate the instruction in host kernel context for local apic accesses
- any other type of mmio access is punted up to user-space (e.g. ioapic)
The decoded instruction is passed as collateral to the user-space process
that is handling the PAGING exit.
The emulation code is fleshed out to include more addressing modes (e.g. SIB)
and more types of operands (e.g. imm8). The source code is unified into a
single file (vmm_instruction_emul.c) that is compiled into vmm.ko as well
as /usr/sbin/bhyve.
Reviewed by: grehan
Obtained from: NetApp
introduced with the IvyBridge CPUs. Provide the definitions for new
bits in CR3 and CR4 registers.
Tested by: avg, Michael Moll <kvedulv@kvedulv.de>
MFC after: 2 weeks
instruction loads/stores at its will.
The macro __compiler_membar() is currently supported for both gcc and
clang, but kernel compilation will fail otherwise.
Reviewed by: bde, kib
Discussed with: dim, theraven
MFC after: 2 weeks
r234247.
Use, instead, the static intializer introduced in r239923 for x86 and
sparc64 intr_cpus, unwinding the code to the initial version.
Reviewed by: marius
chunks. This breaks the assumption that the entire memory segment is
contiguously allocated in the host physical address space.
This also paves the way to satisfy the 4KB page allocations by requesting
free pages from the VM subsystem as opposed to hard-partitioning host memory
at boot time.
associated with guest physical memory is contiguous.
In this case vm_malloc() was using vm_gpa2hpa() to indirectly infer whether
or not the address range had already been allocated.
Replace this instead with an explicit API 'vm_gpa_available()' that returns
TRUE if a page is available for allocation in guest physical address space.
bits under #ifdef _KERNEL but leave definitions for various structures
defined by standards ($PIR table, SMAP entries, etc.) available to
userland.
- Consolidate duplicate SMBIOS table structure definitions in ipmi(4)
and smbios(4) in <machine/pc/bios.h> and make them available to
userland.
MFC after: 2 weeks
AP needs to be activated by spinning up an execution context for it.
The local apic emulation is now completely done in the hypervisor and it will
detect writes to the ICR_LO register that try to bring up the AP. In response
to such writes it will return to userspace with an exit code of SPINUP_AP.
Reviewed by: grehan
amd64. It is implemented as __pure2 inline with non-volatile asm read
from pcpu, which allows a compiler to cache its results.
Convert most PCPU_GET(pcb) and curthread->td_pcb accesses into curpcb.
Note that __curthread() uses magic value 0 as an offsetof(struct pcpu,
pc_curthread). It seems to be done this way due to machine/pcpu.h
needs to be processed before sys/pcpu.h, because machine/pcpu.h
contributes machine-depended fields to the struct pcpu definition. As
result, machine/pcpu.h cannot use struct pcpu yet.
The __curpcb() also uses a magic constant instead of offsetof(struct
pcpu, pc_curpcb) for the same reason. The constants are now defined as
symbols and CTASSERTs are added to ensure that future KBI changes do
not break the code.
Requested and reviewed by: bde
MFC after: 3 weeks
occurs using the SSE math processor. Update comments describing the
handling of the exception status bits in coprocessors control words.
Remove GET_FPU_CW and GET_FPU_SW macros which were used only once.
Prefer to use curpcb to access pcb_save over the longer path of
referencing pcb through the thread structure.
Based on the submission by: Ed Alley <wea llnl gov>
PR: amd64/169927
Reviewed by: bde
MFC after: 3 weeks
mostly meets the guidelines set by the Intel SDM:
1. We use XRSTOR and XSAVE from the same CPL using the same linear
address for the store area
2. Contrary to the recommendations, we cannot zero the FPU save area
for a new thread, since fork semantic requires the copy of the
previous state. This advice seemingly contradicts to the advice
from the item 6.
3. We do use XSAVEOPT in the context switch code only, and the area
for XSAVEOPT already always contains the data saved by XSAVE.
4. We do not modify the save area between XRSTOR, when the area is
loaded into FPU context, and XSAVE. We always spit the fpu context
into save area and start emulation when directly writing into FPU
context.
5. We do not use segmented addressing to access save area, or rather,
always address it using %ds basing.
6. XSAVEOPT can be only executed in the area which was previously
loaded with XRSTOR, since context switch code checks for FPU use by
outgoing thread before saving, and thread which stopped emulation
forcibly get context loaded with XRSTOR.
7. The PCB cannot be paged out while FPU emulation is turned off, since
stack of the executing thread is never swapped out.
The context switch code is patched to issue XSAVEOPT instead of XSAVE
if supported. This approach eliminates one conditional in the context
switch code, which would be needed otherwise.
For user-visible machine context to have proper data, fpugetregs()
checks for unsaved extension blocks and manually copies pristine FPU
state into them, according to the description provided by CPUID leaf
0xd.
MFC after: 1 month
on x86 and use that to implement stop_emulating() in the fpu/npx code.
Reimplement start_emulating() in the non-XEN case by using load_cr0() and
rcr0() instead of the 'lmsw' and 'smsw' instructions. Intel explicitly
discourages the use of 'lmsw' and 'smsw' on 80386 and later processors in
the description of these instructions in Volume 2 of the ADM.
Reviewed by: kib
MFC after: 1 month
natively rather than hand-assembled versions. For xgetbv/xsetbv, add a
wrapper API to deal with xcr* registers: rxcr() and load_xcr().
Reviewed by: kib
MFC after: 1 month
usermode, using shared page. The structures and functions have vdso
prefix, to indicate the intended location of the code in some future.
The versioned per-algorithm data is exported in the format of struct
vdso_timehands, which mostly repeats the content of in-kernel struct
timehands. Usermode reading of the structure can be lockless.
Compatibility export for 32bit processes on 64bit host is also
provided. Kernel also provides usermode with indication about
currently used timecounter, so that libc can fall back to syscall if
configured timecounter is unknown to usermode code.
The shared data updates are initiated both from the tc_windup(), where
a fast task is queued to do the update, and from sysctl handlers which
change timecounter. A manual override switch
kern.timecounter.fast_gettime allows to turn off the mechanism.
Only x86 architectures export the real algorithm data, and there, only
for tsc timecounter. HPET counters page could be exported as well, but
I prefer to not further glue the kernel and libc ABI there until
proper vdso-based solution is developed.
Minimal stubs neccessary for non-x86 architectures to still compile
are provided.
Discussed with: bde
Reviewed by: jhb
Tested by: flo
MFC after: 1 month
layer, but it is read directly by the MI VM layer. This change introduces
pmap_page_is_write_mapped() in order to completely encapsulate all direct
access to PGA_WRITEABLE in the pmap layer.
Aesthetics aside, I am making this change because amd64 will likely begin
using an alternative method to track write mappings, and having
pmap_page_is_write_mapped() in place allows me to make such a change
without further modification to the MI VM layer.
As an added bonus, tidy up some nearby comments concerning page flags.
Reviewed by: kib
MFC after: 6 weeks
suspend/resume procedures are minimized among them.
common:
- Add global cpuset suspended_cpus to indicate APs are suspended/resumed.
- Remove acpi_waketag and acpi_wakemap from acpivar.h (no longer used).
- Add some variables in acpi_wakecode.S in order to minimize the difference
among amd64 and i386.
- Disable load_cr3() because now CR3 is restored in resumectx().
amd64:
- Add suspend/resume related members (such as MSR) in PCB.
- Modify savectx() for above new PCB members.
- Merge acpi_switch.S into cpu_switch.S as resumectx().
i386:
- Merge(and remove) suspendctx() into savectx() in order to match with
amd64 code.
Reviewed by: attilio@, acpi@
locked xchg instruction. IA32 memory model guarantees that store has
release semantic, since stores cannot pass loads or stores.
Reviewed by: bde, jhb
Tested by: pho
MFC after: 2 weeks
in_cksum.h required ip.h to be included for struct ip. To be
able to use some general checksum functions like in_addword()
in a non-IPv4 context, limit the (also exported to user space)
IPv4 specific functions to the times, when the ip.h header is
present and IPVERSION is defined (to 4).
We should consider more general checksum (updating) functions
to also allow easier incremental checksum updates in the L3/4
stack and firewalls, as well as ponder further requirements by
certain NIC drivers needing slightly different pseudo values
in offloading cases. Thinking in terms of a better "library".
Sponsored by: The FreeBSD Foundation
Sponsored by: iXsystems
Reviewed by: gnn (as part of the whole)
MFC After: 3 days
longer uses the active and inactive paging queues. Instead, the pmap now
maintains an LRU-ordered list of pv entry pages, and pmap_pv_reclaim() uses
this list to select pv entries for reclamation.
Note: The old pmap_collect() tried to avoid reclaiming mappings for pages
that have either a hold_count or a busy field that is non-zero. However,
this isn't necessary for correctness, and the locking in pmap_collect() was
insufficient to guarantee that such mappings weren't reclaimed. The new
pmap_pv_reclaim() doesn't even try.
Reviewed by: kib
MFC after: 6 weeks
intr_bind() on x86.
This has been requested by jhb and I strongly disagree with this,
but as long as he is the x86 and interrupt subsystem maintainer I will
follow his directives.
The disagreement cames from what we should really consider as a
public KPI. IMHO, if we really need a selection between the kernel
functions, we may need an explicit protection like _KERNEL_KPI, which
defines which subset of the kernel function might really be considered
as part of the KPI (for thirdy part modules) and which not.
As long as we don't have this mechanism I just consider any possible
function as usable by thirdy part code, thus intr_bind() included.
MFC after: 1 week
Includes instruction emulation for memory r/w access. This
opens the door for io-apic, local apic, hpet timer, and
legacy device emulation.
Submitted by: ryan dot berryhill at sandvine dot com
Reviewed by: grehan
Obtained from: Sandvine
discrepancy between modules and kernel, but deal with SMP differences
within the functions themselves.
As an added bonus this also helps in terms of code readability.
Requested by: gibbs
Reviewed by: jhb, marius
MFC after: 1 week
bridges. Rather than blindly enabling the windows on all of them, only
enable the window when an MSI interrupt is enabled for a device behind
the bridge, similar to what already happens for HT PCI-PCI bridges.
To implement this, each x86 Host-PCI bridge driver has to be able to
locate it's actual backing device on bus 0. For ACPI, use the _ADR
method to find the slot and function of the device. For the non-ACPI
case, the legacy(4) driver already scans bus 0 looking for Host-PCI
bridge devices. Now it saves the slot and function of each bridge that
it finds as ivars that the Host-PCI bridge driver can then use in its
pcib_map_msi() method.
This fixes machines where non-MSI interrupts were broken by the previous
round of HT MSI changes.
Tested by: bapt
MFC after: 1 week
be less ambiguous and more clearly identify what it means. This
attribute is what Intel refers to as UC-, and it's only difference
relative to normal UC memory is that a WC MTRR will override a UC-
PAT entry causing the memory to be treated as WC, whereas a UC PAT
entry will always override the MTRR.
- Remove the VM_MEMATTR_UNCACHED alias from powerpc.
New kernel events can be added at various location for sampling or counting.
This will for example allow easy system profiling whatever the processor is
with known tools like pmcstat(8).
Simultaneous usage of software PMC and hardware PMC is possible, for example
looking at the lock acquire failure, page fault while sampling on
instructions.
Sponsored by: NETASQ
MFC after: 1 month
kernel.
When access restrictions are added to a page table entry, we flush the
corresponding virtual address mapping from the TLB. In contrast, when
access restrictions are removed from a page table entry, we do not
flush the virtual address mapping from the TLB. This is exactly as
recommended in AMD's documentation. In effect, when access
restrictions are removed from a page table entry, AMD's MMUs will
transparently refresh a stale TLB entry. In short, this saves us from
having to perform potentially costly TLB flushes. In contrast,
Intel's MMUs are allowed to generate a spurious page fault based upon
the stale TLB entry. Usually, such spurious page faults are handled
by vm_fault() without incident. However, when we are executing
no-fault sections of the kernel, we are not allowed to execute
vm_fault(). This change introduces special-case handling for spurious
page faults that occur in no-fault sections of the kernel.
In collaboration with: kib
Tested by: gibbs (an earlier version)
I would also like to acknowledge Hiroki Sato's assistance in
diagnosing this problem.
MFC after: 1 week
segments.h to a new x86 segments.h.
Add __packed attribute to some structs (just to be sure).
Also make it clear that i386 GDT and LDT entries are used in ia64 code.
reg.h with stubs.
The tREGISTER macros are only made visible on i386. These macros are
deprecated and should not be available on amd64.
The i386 and amd64 versions of struct reg have been renamed to struct
__reg32 and struct __reg64. During compilation either __reg32 or __reg64
is defined as reg depending on the machine architecture. On amd64 the i386
struct is also available as struct reg32 which is used in COMPAT_FREEBSD32
code.
Most of compat/ia32/ia32_reg.h is now IA64 only.
Reviewed by: kib (previous version)
Remove FPU types from compat/ia32/ia32_reg.h that are no longer needed.
Create machine/npx.h on amd64 to allow compiling i386 code that uses
this header.
The original npx.h and fpu.h define struct envxmm differently. Both
definitions have been included in the new x86 header as struct __envxmm32
and struct __envxmm64. During compilation either __envxmm32 or __envxmm64
is defined as envxmm depending on machine architecture. On amd64 the i386
struct is also available as struct envxmm32.
Reviewed by: kib
amd64/i386/pc98 ptrace.h with stubs.
For amd64 PT_GETXSTATE and PT_SETXSTATE have been redefined to match the
i386 values. The old values are still supported but should no longer be
used.
Reviewed by: kib
amd64/i386/pc98 endian.h with stubs.
In __bswap64_const(x) the conflict between 0xffUL and 0xffULL has been
resolved by reimplementing the macro in terms of __bswap32(x). As a side
effect __bswap64_var(x) is now implemented using two bswap instructions on
i386 and should be much faster. __bswap32_const(x) has been reimplemented
in terms of __bswap16(x) for consistency.