the Vahalia' "Unix Internals" section 15.12 "Other TLB Consistency
Algorithms". The same algorithm is already utilized by the MIPS pmap
to handle ASIDs.
The PCID for the address space is now allocated per-cpu during context
switch to the thread using pmap, when no PCID on the cpu was ever
allocated, or the current PCID is invalidated. If the PCID is reused,
bit 63 of %cr3 can be set to avoid TLB flush.
Each cpu has PCID' algorithm generation count, which is saved in the
pmap pcpu block when pcpu PCID is allocated. On invalidation, the
pmap generation count is zeroed, which signals the context switch code
that already allocated PCID is no longer valid. The implication is
the TLB shootdown for the given cpu/address space, due to the
allocation of new PCID.
The pm_save mask is no longer has to be tracked, which (significantly)
reduces the targets of the TLB shootdown IPIs. Previously, pm_save
was reset only on pmap_invalidate_all(), which made it accumulate the
cpuids of all processors on which the thread was scheduled between
full TLB shootdowns.
Besides reducing the amount of TLB shootdowns and removing atomics to
update pm_saves in the context switch code, the algorithm is much
simpler than the maintanence of pm_save and selection of the right
address space in the shootdown IPI handler.
Reviewed by: alc
Tested by: pho
Sponsored by: The FreeBSD Foundation
MFC after: 3 weeks
interacts with interrupts, query ACPI and use MWAIT for entrance into
Cx sleep states. Support C1 "I/O then halt" mode. See Intel'
document 302223-007 "Intelб╝ Processor Vendor-Specific ACPI Interface
Specification" for description.
Move the acpi_cpu_c1() function into x86/cpu_machdep.c and use
it instead of inlining "sti; hlt" sequence in several places.
In the acpi(4) man page, besides documenting the dev.cpu.N.cx_methods
sysctl, correct the names for dev.cpu.N.{cx_usage,cx_lowest,cx_supported}
sysctls.
Both jkim and avg have some other patches implementing the mwait
functionality; this work is unrelated. Linux does not rely on the
ACPI to provide correct tables describing Cx modes. Instead, the
driver has pre-defined knowledge of the CPU models, it was supplied by
Intel.
Tested by: pho (previous versions)
Sponsored by: The FreeBSD Foundation
In order to map memory from other domains when running on Xen FreeBSD uses
unused physical memory regions. Until now this memory has been allocated
using bus_alloc_resource, but this is not completely safe as we can end up
using unreclaimed MMIO or ACPI regions.
Fix this by introducing a new newbus method that can be used by Xen drivers
to request for unused memory regions. On amd64 we make sure this memory
comes from regions above 4GB in order to prevent clashes with MMIO/ACPI
regions. On i386 there's nothing we can do, so just fall back to the
previous mechanism.
Sponsored by: Citrix Systems R&D
Tested by: Gustau Pérez <gperez@entel.upc.edu>
a basic ACPI SLIT table parser.
For now this just exports the map via sysctl; it'll eventually be useful
to userland when there's more useful NUMA support in -HEAD.
* Add an optional mem_locality map;
* add a mapping function taking from/to domain and returning the
relative cost, or -1 if it's not available;
* Add a very basic SLIT parser to x86 ACPI.
Differential Revision: https://reviews.freebsd.org/D2460
Reviewed by: rpaulo, stas, jhb
Sponsored by: Norse Corp, Inc (hardware, coding); Dell (hardware)
remains. Xen is planning to phase out support for PV upstream since it
is harder to maintain and has more overhead. Modern x86 CPUs include
virtualization extensions that support HVM guests instead of PV guests.
In addition, the PV code was i386 only and not as well maintained recently
as the HVM code.
- Remove the i386-only NATIVE option that was used to disable certain
components for PV kernels. These components are now standard as they
are on amd64.
- Remove !XENHVM bits from PV drivers.
- Remove various shims required for XEN (e.g. PT_UPDATES_FLUSH, LOAD_CR3,
etc.)
- Remove duplicate copy of <xen/features.h>.
- Remove unused, i386-only xenstored.h.
Differential Revision: https://reviews.freebsd.org/D2362
Reviewed by: royger
Tested by: royger (i386/amd64 HVM domU and amd64 PVH dom0)
Relnotes: yes
- Vmbus multi channel support.
- Vector interrupt support.
- Signal optimization.
- Storvsc driver performance improvement.
- Scatter and gather support for storvsc driver.
- Minor bug fix for KVP driver.
Thanks royger, jhb and delphij from FreeBSD community for the reviews
and comments. Also thanks Hovy Xu from NetApp for the contributions to
the storvsc driver.
PR: 195238
Submitted by: whu
Reviewed by: royger, jhb, delphij
Approved by: royger
MFC after: 2 weeks
Relnotes: yes
Sponsored by: Microsoft OSTC
pages which pass a NULL virtual address. If the BUS_DMA_KEEP_PG_OFFSET
flag is set, use the physical address to compute the page offset
instead. The physical address should always be valid when adding
bounce pages and should contain the same page offset like the virtual
address.
Submitted by: Svatopluk Kraus <onwahe@gmail.com>
MFC after: 1 week
Reviewed by: jhb@
sys/amd64/amd64/mp_machdep.c, to the new common x86 source
sys/x86/x86/mp_x86.c.
Proposed and reviewed by: jhb
Review: https://reviews.freebsd.org/D2347
Sponsored by: The FreeBSD Foundation
sys/i386/i386/machdep.c to new file sys/x86/x86/cpu_machdep.c. Most
of the code is related to the idle handling.
Discussed with: pluknet
Sponsored by: The FreeBSD Foundation
shows no difference with the code removed.
On both amd64 and i386, assert that a released pmap is not active.
Proposed and reviewed by: alc
Discussed with: Svatopluk Kraus <onwahe@gmail.com>, peter
Sponsored by: The FreeBSD Foundation
MFC after: 2 weeks
use PAE format for the page tables, but does not incur other
consequences of the full PAE config. In particular, vm_paddr_t and
bus_addr_t are left 32bit, and max supported memory is still limited
by 4GB.
The option allows to have nx permissions for memory mappings on i386
kernel, while keeping the usual i386 KBI and avoiding the kernel data
sizing problems typical for the PAE config.
Intel documented that the PAE format for page tables is available
starting with the Pentium Pro, but it is possible that the plain
Pentium CPUs have the required support (Appendix H). The goal is to
enable the option and non-exec mappings on i386 for the GENERIC
kernel. Anybody wanting a useful system on 486, have to reconfigure
the modern i386 kernel anyway.
Discussed with: alc, jhb
Sponsored by: The FreeBSD Foundation
MFC after: 2 weeks
and export them to userland.
- Define __HAVE_REG32 on platforms that define a reg32 structure and check
for this in <sys/procfs.h> to control when to export prstatus32, etc.
- Add prstatus32_t and prpsinfo32_t typedefs for the 32-bit structures.
libbfd looks for these types, and having them fixes 'gcore' in gdb of a
32-bit process on a 64-bit platform.
- Use the structure definitions from <sys/procfs.h> in gcore's elf32 core
dump code instead of duplicating the definitions.
Differential Revision: https://reviews.freebsd.org/D2142
Reviewed by: kib, nathanw (powerpc bits)
MFC after: 1 week
dmar_map_entry. Non-zero offset both increases the required mapping
size, which is handled in dmar_bus_dmamap_load_something1(), and makes
it possible that allocated range crosses boundary, which needs a check
in dmar_gas_match_one().
Reported and tested by: jimharris
Sponsored by: The FreeBSD Foundation
MFC after: 1 week
requested size. If tag restrictions caused split entry, its size is
less then requsted.
Hardware provided by: Michael Fuckner <michael@fuckner.net>
Sponsored by: The FreeBSD Foundation
MFC after: 1 week
translation. In particular, despite IO-APICs only take 8bit apic id,
IR translation structures accept 32bit APIC Id, which allows x2APIC
mode to function properly. Extend msi_cpu of struct msi_intrsrc and
io_cpu of ioapic_intsrc to full int from one byte.
KPI of IR is isolated into the x86/iommu/iommu_intrmap.h, to avoid
bringing all dmar headers into interrupt code. The non-PCI(e) devices
which generate message interrupts on FSB require special handling. The
HPET FSB interrupts are remapped, while DMAR interrupts are not.
For each msi and ioapic interrupt source, the iommu cookie is added,
which is in fact index of the IRE (interrupt remap entry) in the IR
table. Cookie is made at the source allocation time, and then used at
the map time to fill both IRE and device registers. The MSI
address/data registers and IO-APIC redirection registers are
programmed with the special values which are recognized by IR and used
to restore the IRE index, to find proper delivery mode and target.
Map all MSI interrupts in the block when msi_map() is called.
Since an interrupt source setup and dismantle code are done in the
non-sleepable context, flushing interrupt entries cache in the IR
hardware, which is done async and ideally waits for the interrupt,
requires busy-wait for queue to drain. The dmar_qi_wait_for_seq() is
modified to take a boolean argument requesting busy-wait for the
written sequence number instead of waiting for interrupt.
Some interrupts are configured before IR is initialized, e.g. ACPI
SCI. Add intr_reprogram() function to reprogram all already
configured interrupts, and call it immediately before an IR unit is
enabled. There is still a small window after the IO-APIC redirection
entry is reprogrammed with cookie but before the unit is enabled, but
to fix this properly, IR must be started much earlier.
Add workarounds for 5500 and X58 northbridges, some revisions of which
have severe flaws in handling IR. Use the same identification methods
as employed by Linux.
Review: https://reviews.freebsd.org/D1892
Reviewed by: neel
Discussed with: jhb
Tested by: glebius, pho (previous versions)
Sponsored by: The FreeBSD Foundation
MFC after: 3 weeks
queue. They are for first-level translations and device TLB.
Review: https://reviews.freebsd.org/D1892
Reviewed by: neel
Sponsored by: The FreeBSD Foundation
MFC after: 1 week
allocator tries to move the entry up, after the boundary. The new
location may still fail to satisfy boundary requirement, for instance,
if the boundary is set to page size, and allocation is of multiple
pages.
Recheck that boundary is not crossed after the move. If it is
crossed, give up on allocating the whole entry and split it.
Reported by: Michael Fuckner <michael@fuckner.net>, running nvme(4)
Sponsored by: The FreeBSD Foundation
MFC after: 1 week
next entry does not intersect with the tail of the new entry, but also
that previous entry is also before new entry start.
Sponsored by: The FreeBSD Foundation
MFC after: 1 week
vectors to be dynamically allocated. This allows kernel modules like vmm.ko
to allocate unique IPI slots when loaded (as opposed to hard allocating one
or more vectors).
Also, reorganize the fixed IPI vectors to create a contiguous space for
dynamic IPI allocation.
Reviewed by: kib, jhb
Differential Revision: https://reviews.freebsd.org/D2042
Change the numeric value of IPI_STOP_HARD so it doesn't occupy a valid IPI
slot. This can be done because IPI_STOP_HARD is actually delivered via NMI.
Reviewed by: kib
Differential Revision: https://reviews.freebsd.org/D1983
the cache line flush in the LAPIC page, keep direct map page covering
LAPIC mapped uncached.
To have the (incomplete) check for the LAPIC range in
pmap_invalidate_cache_range() working, lapic_paddr must be initialized
in x2APIC mode too.
Sponsored by: The FreeBSD Foundation
MFC after: 2 months
r278854 introduced a race in the event channel handling code. We must make
sure that the pending bit is cleared before executing the filter, or else we
might miss other events that would be injected after the filter has ran but
before the pending bit is cleared.
While there also mask event channels while FreeBSD executes the ithread
bound to that event channel. This refrains Xen from injecting more
interrupts while the ithread has not finished it's work.
Sponsored by: Citrix Systems R&D
Reported by: sbruno, robak
Tested by: robak
level-triggered interrupt does not broadcast the EOI message to all
APICs in the system. Instead, interrupt handler must follow LAPIC EOI
with IOAPIC EOI. For modern IOAPICs, the later is done by writing to
EOIR register. Otherwise, Intel provided Linux with a trick of
temporary switching the pin config to edge and then back to level.
Detect presence of EOIR register by reading IO-APIC version. The
summary table in the comments was taken from the Linux kernel. For
Intel, newer IO-APICs are only briefly documented as part of the
ICH/PCH datasheet. According to the BKDG and chipset documentation,
AMD LAPICs do not provide EOI suppression, althought IO-APICs do
declare version 0x21 and implement EOIR.
The trick to temporary switch pin to edge mode to clear IRR was tested
on modern chipset, by pretending that EOIR is not present, i.e. by
forcing io_haseoi to zero.
Tunable hw.lapic_eoi_suppression disables the optimization.
Reviewed by: neel
Tested by: pho
Review: https://reviews.freebsd.org/D1943
Sponsored by: The FreeBSD Foundation
MFC after: 2 months
declares support for it. Newer versions of Xen works fine with x2APIC
code, but e.g. Xen 4.2 delivers GPF on the LAPIC MSR write, despite
x2APIC mode being known to hypervisor.
Discussed with: royger
Sponsored by: The FreeBSD Foundation
follow specification and do not provide PCIe capability.
Verify if the port above such bridge is downstream PCIe (or root port)
and treat the bridge as PCIe/PCI then. This allows to avoid
maintaining the table of device ids for bridges without capability,
while still calculate correct request originator for devices behind
the bridge.
Submitted by: Jason Harmening <jason.harmening@gmail.com>
MFC after: 1 week
Remove unneeded disable of LAPIC in the native_lapic_xapic_mode(). We
attempt to send wakeup IPI on the resume path right after BSP wakeup,
so disabling is wrong.
Reported and tested by: glebius, "Ranjan1018 ." <214748mv@gmail.com>
Sponsored by: The FreeBSD Foundation
MFC after: 2 months
Devices that use ISA IRQs expect them to be already configured, and don't
call bus_config_intr, which prevents those IRQs from working on Xen. In
order to solve it pre-register all the legacy IRQs with the default values
(edge triggered, low polarity) if no override is found.
While there add a panic if the registration of an interrupt override fails.
Sponsored by: Citrix Systems R&D
Improve and cleanup the Xen PIRQ event channel code:
- Remove the xi_shared field as it is unused.
- Clean the "pending" bit in the EOI handler, this is more similar to how
native interrupts are handled.
- Don't mask edge triggered PIRQs, edge trigger interrupts cannot be
masked.
- Panic if PHYSDEVOP_eoi fails.
- Remove the usage of the PHYSDEVOP_alloc_irq_vector hypercall because
it's just a no-op in the Xen versions that are supported by FreeBSD Dom0.
Sponsored by: Citrix Systems R&D
redirection support. Older versions of the hypervisor mis-interpret
the cpuid format in ioapic registers when x2APIC is turned on, but IR
is not used by the guest OS.
Based on: Linux commit 4cca6ea04d31c22a7d0436949c072b27bde41f86
Tested by: markj
Sponsored by: The FreeBSD Foundation
MFC after: 2 months
VT-d specification. Also add definitions for the interrupt remapping
table and IEC.
Print new capabilities on boot. although there is no hardware which
support it.
Sponsored by: The FreeBSD Foundation
MFC after: 1 week
hw.x2apic_enable tunable allows disabling it from the loader prompt.
To closely repeat effects of the uncached memory ops when accessing
registers in the xAPIC mode, the x2APIC writes to MSRs are preceeded
by mfence, except for the EOI notifications. This is probably too
strict, only ICR writes to send IPI require serialization to ensure
that other CPUs see the previous actions when IPI is delivered. This
may be changed later.
In vmm justreturn IPI handler, call doreti_iret instead of doing iretd
inline, to handle corner conditions.
Note that the patch only switches LAPICs into x2APIC mode. It does not
enables FreeBSD to support > 255 CPUs, which requires parsing x2APIC
MADT entries and doing interrupts remapping, but is the required step
on the way.
Reviewed by: neel
Tested by: pho (real hardware), neel (on bhyve)
Discussed with: jhb, grehan
Sponsored by: The FreeBSD Foundation
MFC after: 2 months
Intel Multiprocessor Specification v1.4. The Intel SDM claims that
the INIT IPIs here are invalid, but other systems follow the MP
spec instead.
While here, fix the IPI wait routine to accept a timeout in microseconds
instead of a raw spin count, and don't spin forever during AP startup.
Instead, panic if a STARTUP IPI is not delivered after 20 us.
PR: 196542
Differential Revision: https://reviews.freebsd.org/D1719
MFC after: 2 weeks
This can later use this to determine the TSC frequency like is done with
VMware, instead of using a DELAY loop that is not always accurate in an VM.
MFC after: 1 month
KVM clock shares the same data structures between the guest and the host
as Xen so it makes sense to just have a single copy of this code.
Differential Revision: https://reviews.freebsd.org/D1429
Reviewed by: royger (eariler version)
MFC after: 1 month
P-state but not C-state invariant TSC by changing the default behavior
to leaving the TSC enabled as the timecounter and disabling C2+ instead
of disabling the TSC by default.
Discussed with: jkim
Tested by: Jan Kokemuller <jan.kokemueller@gmail.com>
The data in MODINFOMD_MODULEP is packed by the loader as a 4 byte type, but
the amd64 kernel expects a vm_paddr_t, which is of size 8 bytes. Fix this by
saving it as 8 bytes in the loader and retrieving it using the proper type
in the kernel.
Sponsored by: Citrix Systems R&D
Prior to this change CLOCK_MONOTONIC could go backwards when the timecounter
hardware was changed via 'sysctl kern.timecounter.hardware'. This happened
because the vdso timehands update was missing the special treatment in
tc_windup() when changing timecounters.
Reviewed by: kib