If BIOS performed hand-off to OS with BSP LAPIC in the x2APIC mode,
system usually consumes such configuration without a notice, since
x2APIC is turned on by OS if possible (nop). But if BIOS
simultaneously requested OS to not use x2APIC, code assumption that
that xAPIC is active breaks.
In my opinion, we cannot safely turn off x2APIC if control is passed
in this mode. Make madt.c ignore user or BIOS requests to turn x2APIC
off, and do not check the x2APIC black list. Just trust the config
and try to continue, giving a warning in dmesg.
Reported and tested by: Slawa Olhovchenkov <slw@zxy.spb.ru> (previous version)
Diagnosed by and discussed with: avg
Sponsored by: The FreeBSD Foundation
MFC after: 2 weeks
Since these pages are allocated from a narrow range of memory, this makes
the allocation more likely to succeed.
Suggested by: kib
Reviewed by: jkim, kib
MFC after: 2 months
Differential Revision: https://reviews.freebsd.org/D7154
If the allocation attempt fails, we may otherwise VM_WAIT after a failed
attempt to reclaim contiguous memory in the requested range. After r297466,
this results in the thread going to sleep, causing a hang during boot.
Reviewed by: jkim, kib
Approved by: re (gjb)
Sponsored by: EMC / Isilon Storage Division
Differential Revision: https://reviews.freebsd.org/D6945
Instead of panicking when parsing an invalid ACPI SRAT table,
just ignore it, effectively disabling NUMA.
https://lists.freebsd.org/pipermail/freebsd-current/2016-May/060984.html
Reported and tested by: Bill O'Hanlon (bill.ohanlon at gmail.com)
Reviewed by: jhb
MFC after: 1 week
Relnotes: If dmesg shows "SRAT: Duplicate local APIC ID",
try updating your BIOS to fix NUMA support.
Sponsored by: Dell Inc.
bus_get_cpus() returns a specified set of CPUs for a device. It accepts
an enum for the second parameter that indicates the type of cpuset to
request. Currently two valus are supported:
- LOCAL_CPUS (on x86 this returns all the CPUs in the package closest to
the device when DEVICE_NUMA is enabled)
- INTR_CPUS (like LOCAL_CPUS but only returns 1 SMT thread for each core)
For systems that do not support NUMA (or if it is not enabled in the kernel
config), LOCAL_CPUS fails with EINVAL. INTR_CPUS is mapped to 'all_cpus'
by default. The idea is that INTR_CPUS should always return a valid set.
Device drivers which want to use per-CPU interrupts should start using
INTR_CPUS instead of simply assigning interrupts to all available CPUs.
In the future we may wish to add tunables to control the policy of
INTR_CPUS (e.g. should it be local-only or global, should it ignore
SMT threads or not).
The x86 nexus driver exposes the internal set of interrupt CPUs from the
the x86 interrupt code via INTR_CPUS.
The ACPI bus driver and PCI bridge drivers use _PXM to return a suitable
LOCAL_CPUS set when _PXM exists and DEVICE_NUMA is enabled. They also and
the global INTR_CPUS set from the nexus driver with the per-domain set from
_PXM to generate a local INTR_CPUS set for child devices.
Reviewed by: wblock (manpage)
Differential Revision: https://reviews.freebsd.org/D5519
If we reached MAXMEMDOM, we would previously try to insert an additional
element and only detect overflow after causing (probably trivial) memory
overflow. Instead, detect the ndomain > MAXMEMDOM case before we write past
the end.
Reported by: Coverity
CID: 1354783
Sponsored by: EMC / Isilon Storage Division
system. This uses the hints mechnanism. This mostly works today
because when there's no static hints (the default), this value can be
fetched from the hint. When there is a static hints file, the hint
passed from the boot loader to the kernel is ignored, but for the BIOS
case we're able to find it anyway. However, with UEFI, the fallback
doesn't work, so we get a panic instead.
Switch to acpi.rsdp and use TUNABLE_ULONG_FETCH instead. Continue to
generate the old values to allow for transitions. In addition, fall
back to the old method if the new method isn't present.
Add comments about all this.
Differential Revision: https://reviews.freebsd.org/D5866
VM_NUMA_ALLOC is used to enable use of domain-aware memory allocation in
the virtual memory system. DEVICE_NUMA is used to enable affinity
reporting for devices such as bus_get_domain().
MAXMEMDOM must still be set to a value greater than for any NUMA support
to be effective. Note that 'cpuset -gd' always works if MAXMEMDOM is
enabled and the system supports NUMA.
Reviewed by: kib
Differential Revision: https://reviews.freebsd.org/D5782
that was recently added for Lenovo laptops.
This is a prime candidate for conversion into a table and also
checking other fields like "product".
Tested:
* ASUS UX31E
believe that the bug only affects mobile CPUs, at least I did not see
other reports, but it is impossible to detect it in madt_setup_local().
While there, reduce duplication in the information strings printed
when x2APIC is auto-disabled, and do not print the line when user
manually override the setting.
Tested and reviewed by: royger (previous version)
Sponsored by: The FreeBSD Foundation
BIOS has been seen to include such entries even though the relevant specs
require that X2APIC entries only be used for CPUs with an APIC ID >= 255.
This was tested on a system with "plain" local APIC entries in the MADT
to ensure no regressions, but it has not yet been tested on a system with
X2APIC entries in the MADT. Currently such systems do not boot at all,
and with this change they might now boot correctly.
Differential Revision: https://reviews.freebsd.org/D2521
Reviewed by: kib
MFC after: 2 weeks
a basic ACPI SLIT table parser.
For now this just exports the map via sysctl; it'll eventually be useful
to userland when there's more useful NUMA support in -HEAD.
* Add an optional mem_locality map;
* add a mapping function taking from/to domain and returning the
relative cost, or -1 if it's not available;
* Add a very basic SLIT parser to x86 ACPI.
Differential Revision: https://reviews.freebsd.org/D2460
Reviewed by: rpaulo, stas, jhb
Sponsored by: Norse Corp, Inc (hardware, coding); Dell (hardware)
use PAE format for the page tables, but does not incur other
consequences of the full PAE config. In particular, vm_paddr_t and
bus_addr_t are left 32bit, and max supported memory is still limited
by 4GB.
The option allows to have nx permissions for memory mappings on i386
kernel, while keeping the usual i386 KBI and avoiding the kernel data
sizing problems typical for the PAE config.
Intel documented that the PAE format for page tables is available
starting with the Pentium Pro, but it is possible that the plain
Pentium CPUs have the required support (Appendix H). The goal is to
enable the option and non-exec mappings on i386 for the GENERIC
kernel. Anybody wanting a useful system on 486, have to reconfigure
the modern i386 kernel anyway.
Discussed with: alc, jhb
Sponsored by: The FreeBSD Foundation
MFC after: 2 weeks
declares support for it. Newer versions of Xen works fine with x2APIC
code, but e.g. Xen 4.2 delivers GPF on the LAPIC MSR write, despite
x2APIC mode being known to hypervisor.
Discussed with: royger
Sponsored by: The FreeBSD Foundation
Remove unneeded disable of LAPIC in the native_lapic_xapic_mode(). We
attempt to send wakeup IPI on the resume path right after BSP wakeup,
so disabling is wrong.
Reported and tested by: glebius, "Ranjan1018 ." <214748mv@gmail.com>
Sponsored by: The FreeBSD Foundation
MFC after: 2 months
redirection support. Older versions of the hypervisor mis-interpret
the cpuid format in ioapic registers when x2APIC is turned on, but IR
is not used by the guest OS.
Based on: Linux commit 4cca6ea04d31c22a7d0436949c072b27bde41f86
Tested by: markj
Sponsored by: The FreeBSD Foundation
MFC after: 2 months
hw.x2apic_enable tunable allows disabling it from the loader prompt.
To closely repeat effects of the uncached memory ops when accessing
registers in the xAPIC mode, the x2APIC writes to MSRs are preceeded
by mfence, except for the EOI notifications. This is probably too
strict, only ICR writes to send IPI require serialization to ensure
that other CPUs see the previous actions when IPI is delivered. This
may be changed later.
In vmm justreturn IPI handler, call doreti_iret instead of doing iretd
inline, to handle corner conditions.
Note that the patch only switches LAPICs into x2APIC mode. It does not
enables FreeBSD to support > 255 CPUs, which requires parsing x2APIC
MADT entries and doing interrupts remapping, but is the required step
on the way.
Reviewed by: neel
Tested by: pho (real hardware), neel (on bhyve)
Discussed with: jhb, grehan
Sponsored by: The FreeBSD Foundation
MFC after: 2 months
kernel via the global cpuset_domain[] array. To export these to userland,
add a CPU_WHICH_DOMAIN level that can be used to fetch the mask for a
specific domain. Add a -d flag to cpuset(1) that can be used to fetch
the mask for a given domain.
Differential Revision: https://reviews.freebsd.org/D1232
Submitted by: jeff (kernel bits)
Reviewed by: adrian, jeff
support for AVX on i386.
- Similar to amd64, move the FPU save area out of the PCB and instead
store saved FPU state in a variable-sized buffer after the PCB on the
stack.
- To support the variable PCB location, alter the locore code to only use
the bottom-most page of proc0stack for init386(). init386() returns
the correct stack pointer to locore which adjusts the stack for thread0
before calling mi_startup().
- Don't bother setting cr3 in thread0's pcb in locore before calling
init386(). It wasn't used (init386() overwrote it at the end) and
it doesn't work with the variable-sized FPU save area.
- Remove the new-bus attachment from npx. This was only ever useful for
external co-processors using IRQ13, but those have not been supported
for several years. npxinit() is now called much earlier during boot
(init386()) similar to amd64.
- Implement PT_{GET,SET}XSTATE and I386_GET_XFPUSTATE.
- npxsave() is now only called from context switch contexts so it can
use XSAVEOPT.
Differential Revision: https://reviews.freebsd.org/D1058
Reviewed by: kib
Tested on: FreeBSD/i386 VM under bhyve on Intel i5-2520
resume that is a superset of a pcb. Move the FPU state out of the pcb and
into this new structure. As part of this, move the FPU resume code on
amd64 into a C function. This allows resumectx() to still operate only on
a pcb and more closely mirrors the i386 code.
Reviewed by: kib (earlier version)
of this patch, resumectx() called npxresume() directly, but that doesn't
work because resumectx() runs with a non-standard %cs selector. Instead,
all of the FPU suspend/resume handling is done in C.
MFC after: 1 week
Split a portion of the code in madt_parse_interrupt_override to a
separate function, that is public and can be used from other code.
This will be needed by the Xen port, since FreeBSD needs to parse the
interrupt overrides and notify Xen about them.
This commit should not introduce any functional change.
Sponsored by: Citrix Systems R&D
Reviewed by: jhb, gibbs
x86/acpica/madt.c:
- Introduce madt_parse_interrupt_values() that parses the intr
information from ACPI and returns the triggering and the polarity.
This is a subset of the functionality that used to be part of
madt_parse_interrupt_override().
- Make madt_found_sci_override a global variable that can be used
from other files.
x86/include/acpica_machdep.h:
- Prototype of madt_parse_interrupt_values.
- Extern declaration of madt_found_sci_override.
Lower the quality of the MADT ACPI enumerator, so on Xen Dom0 we can
force the usage of the Xen mptable enumerator even when ACPI is
detected.
This is needed because Xen might restrict the number of vCPUs
available to Dom0, but the MADT ACPI table parsed in FreeBSD is the
native one (which enumerates all the CPUs available in the system).
Sponsored by: Citrix Systems R&D
Reviewed by: gibbs
x86/acpica/madt.c:
- Lower MADT enumerator quality to -50.
x86/xen/pvcpu_enum.c:
- Rise Xen PV enumerator to 0.
field. Perform vcpu enumeration for Xen PV and HVM environments
and convert all Xen drivers to use vcpu_id instead of a hard coded
assumption of the mapping algorithm (acpi or apic ID) in use.
Submitted by: Roger Pau Monné
Sponsored by: Citrix Systems R&D
Reviewed by: gibbs
Approved by: re (blanket Xen)
amd64/include/pcpu.h:
i386/include/pcpu.h:
Add vcpu_id to the amd64 and i386 pcpu structures.
dev/xen/timer/timer.c
x86/xen/xen_intr.c
Use new vcpu_id instead of assuming acpi_id == vcpu_id.
i386/xen/mp_machdep.c:
i386/xen/mptable.c
x86/xen/hvm.c:
Perform Xen HVM and Xen full PV vcpu_id mapping.
x86/xen/hvm.c:
x86/acpica/madt.c
Change SYSINIT ordering of acpi CPU enumeration so that it
is guaranteed to be available at the time of Xen HVM vcpu
id mapping.
Xen PVHVM guest.
Submitted by: Roger Pau Monné
Sponsored by: Citrix Systems R&D
Reviewed by: gibbs
Approved by: re (blanket Xen)
MFC after: 2 weeks
sys/amd64/amd64/mp_machdep.c:
sys/i386/i386/mp_machdep.c:
- Make sure that are no MMU related IPIs pending on migration.
- Reset pending IPI_BITMAP on resume.
- Init vcpu_info on resume.
sys/amd64/include/intr_machdep.h:
sys/i386/include/intr_machdep.h:
sys/x86/acpica/acpi_wakeup.c:
sys/x86/x86/intr_machdep.c:
sys/x86/isa/atpic.c:
sys/x86/x86/io_apic.c:
sys/x86/x86/local_apic.c:
- Add a "suspend_cancelled" parameter to pic_resume(). For the
Xen PIC, restoration of interrupt services differs between
the aborted suspend and normal resume cases, so we must provide
this information.
sys/dev/acpica/acpi_timer.c:
sys/dev/xen/timer/timer.c:
sys/timetc.h:
- Don't swap out "suspend safe" timers across a suspend/resume
cycle. This includes the Xen PV and ACPI timers.
sys/dev/xen/control/control.c:
- Perform proper suspend/resume process for PVHVM:
- Suspend all APs before going into suspension, this allows us
to reset the vcpu_info on resume for each AP.
- Reset shared info page and callback on resume.
sys/dev/xen/timer/timer.c:
- Implement suspend/resume support for the PV timer. Since FreeBSD
doesn't perform a per-cpu resume of the timer, we need to call
smp_rendezvous in order to correctly resume the timer on each CPU.
sys/dev/xen/xenpci/xenpci.c:
- Don't reset the PCI interrupt on each suspend/resume.
sys/kern/subr_smp.c:
- When suspending a PVHVM domain make sure there are no MMU IPIs
in-flight, or we will get a lockup on resume due to the fact that
pending event channels are not carried over on migration.
- Implement a generic version of restart_cpus that can be used by
suspended and stopped cpus.
sys/x86/xen/hvm.c:
- Implement resume support for the hypercall page and shared info.
- Clear vcpu_info so it can be reset by APs when resuming from
suspension.
sys/dev/xen/xenpci/xenpci.c:
sys/x86/xen/hvm.c:
sys/x86/xen/xen_intr.c:
- Support UP kernel configurations.
sys/x86/xen/xen_intr.c:
- Properly rebind per-cpus VIRQs and IPIs on resume.
into threads each processing queue in a single domain. The structure
of the pagedaemons and queues is kept intact, most of the changes come
from the need for code to find an owning page queue for given page,
calculated from the segment containing the page.
The tie between NUMA domain and pagedaemon thread/pagequeue split is
rather arbitrary, the multithreaded daemon could be allowed for the
single-domain machines, or one domain might be split into several page
domains, to further increase concurrency.
Right now, each pagedaemon thread tries to reach the global target,
precalculated at the start of the pass. This is not optimal, since it
could cause excessive page deactivation and freeing. The code should
be changed to re-check the global page deficit state in the loop after
some number of iterations.
The pagedaemons reach the quorum before starting the OOM, since one
thread inability to meet the target is normal for split queues. Only
when all pagedaemons fail to produce enough reusable pages, OOM is
started by single selected thread.
Launder is modified to take into account the segments layout with
regard to the region for which cleaning is performed.
Based on the preliminary patch by jeff, sponsored by EMC / Isilon
Storage Division.
Reviewed by: alc
Tested by: pho
Sponsored by: The FreeBSD Foundation
freelist.
o Split the pool of free pages queues really by domain and not rely on
definition of VM_RAW_NFREELIST.
o For MAXMEMDOM > 1, wrap the RR allocation logic into a specific
function that is called when calculating the allocation domain.
The RR counter is kept, currently, per-thread.
In the future it is expected that such function evolves in a real
policy decision referee, based on specific informations retrieved by
per-thread and per-vm_object attributes.
o Add the concept of "probed domains" under the form of vm_ndomains.
It is responsibility for every architecture willing to support multiple
memory domains to correctly probe vm_ndomains along with mem_affinity
segments attributes. Those two values are supposed to remain always
consistent.
Please also note that vm_ndomains and td_dom_rr_idx are both int
because segments already store domains as int. Ideally u_int would
have much more sense. Probabilly this should be cleaned up in the
future.
o Apply RR domain selection also to vm_phys_zero_pages_idle().
Sponsored by: EMC / Isilon storage division
Partly obtained from: jeff
Reviewed by: alc
Tested by: jeff