Previously, x86 used static ranges of IRQ values for different types
of I/O interrupts. Interrupt pins on I/O APICs and 8259A PICs used
IRQ values from 0 to 254. MSI interrupts used a compile-time-defined
range starting at 256, and Xen event channels used a
compile-time-defined range after MSI. Some recent systems have more
than 255 I/O APIC interrupt pins which resulted in those IRQ values
overflowing into the MSI range triggering an assertion failure.
Replace statically assigned ranges with dynamic ranges. Do a single
pass computing the sizes of the IRQ ranges (PICs, MSI, Xen) to
determine the total number of IRQs required. Allocate the interrupt
source and interrupt count arrays dynamically once this pass has
completed. To minimize runtime complexity these arrays are only sized
once during bootup. The PIC range is determined by the PICs present
in the system. The MSI and Xen ranges continue to use a fixed size,
though this does make it possible to turn the MSI range size into a
tunable in the future.
As a result, various places are updated to use dynamic limits instead
of constants. In addition, the vmstat(8) utility has been taught to
understand that some kernels may treat 'intrcnt' and 'intrnames' as
pointers rather than arrays when extracting interrupt stats from a
crashdump. This is determined by the presence (vs absence) of a
global 'nintrcnt' symbol.
This change reverts r189404 which worked around a buggy BIOS which
enumerated an I/O APIC twice (using the same memory mapped address for
both entries but using an IRQ base of 256 for one entry and a valid
IRQ base for the second entry). Making the "base" of MSI IRQ values
dynamic avoids the panic that r189404 worked around, and there may now
be valid I/O APICs with an IRQ base above 256 which this workaround
would incorrectly skip.
If in the future the issue reported in PR 130483 reoccurs, we will
have to add a pass over the I/O APIC entries in the MADT to detect
duplicates using the memory mapped address and use some strategy to
choose the "correct" one.
While here, reserve room in intrcnts for the Hyper-V counters.
PR: 229429, 130483
Reviewed by: kib, royger, cem
Tested by: royger (Xen), kib (DMAR)
Approved by: re (gjb)
MFC after: 2 weeks
Differential Revision: https://reviews.freebsd.org/D16861
Updates in the format described in section 9.11 of the Intel SDM can
now be applied as one of the first steps in booting the kernel. Updates
that are loaded this way are automatically re-applied upon exit from
ACPI sleep states, in contrast with the existing cpucontrol(8)-based
method. For the time being only Intel updates are supported.
Microcode update files are passed to the kernel via loader(8). The
file type must be "cpu_microcode" in order for the file to be recognized
as a candidate microcode update. Updates for multiple CPU types may be
concatenated together into a single file, in which case the kernel
will select and apply a matching update. Memory used to store the
update file will be freed back to the system once the update is applied,
so this approach will not consume more memory than required.
Reviewed by: kib
MFC after: 6 weeks
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D16370
- Change pcpu zone consumers to use a stride size of PAGE_SIZE.
(defined as UMA_PCPU_ALLOC_SIZE to make future identification easier)
- Allocate page from the correct domain for a given cpu.
- Don't initialize pc_domain to non-zero value if NUMA is not defined
There are some misconceptions surrounding this field. It is the
_VM_ NUMA domain and should only ever correspond to valid domain
values as understood by the VM.
The former slab size of sizeof(struct pcpu) was somewhat arbitrary.
The new value is PAGE_SIZE because that's the smallest granularity
which the VM can allocate a slab for a given domain. If you have
fewer than PAGE_SIZE/8 counters on your system there will be some
memory wasted, but this is obviously something where you want the
cache line to be coming from the correct domain.
Reviewed by: jeff
Sponsored by: Limelight Networks
Differential Revision: https://reviews.freebsd.org/D15933
This change adds a new optional console method cn_resume and a kernel
console interface cnresume. Consoles that may need to re-initialize
their hardware after suspend (e.g., because firmware does not care to do
it) will implement cn_resume. Note that it is called in rather early
environment not unlike early boot, so the same restrictions apply.
Platform specific code, for platforms that support hardware suspend,
should call cnresume early after resume, before any console output is
expected.
This change fixes a problem with a system of mine failing to resume when
a serial console is used. I found that the serial port was in a strange
configuration and an attempt to write to it likely resulted in an
infinite loop.
To avoid adding cn_resume method to every console driver, CONSOLE_DRIVER
macro has been extended to support optional methods.
Reviewed by: imp, mav
MFC after: 3 weeks
Differential Revision: https://reviews.freebsd.org/D15552
Speculative Store Bypass (SSB) is a speculative execution side channel
vulnerability identified by Jann Horn of Google Project Zero (GPZ) and
Ken Johnson of the Microsoft Security Response Center (MSRC)
https://bugs.chromium.org/p/project-zero/issues/detail?id=1528.
Updated Intel microcode introduces a MSR bit to disable SSB as a
mitigation for the vulnerability.
Introduce a sysctl hw.spec_store_bypass_disable to provide global
control over the SSBD bit, akin to the existing sysctl that controls
IBRS. The sysctl can be set to one of three values:
0: off
1: on
2: auto
Future work will enable applications to control SSBD on a per-process
basis (when it is not enabled globally).
SSBD bit detection and control was verified with prerelease microcode.
Security: CVE-2018-3639
Tested by: emaste (previous version, without updated microcode)
Sponsored by: The FreeBSD Foundation
MFC after: 3 days
Resume starts CPU from the init state, which clears any loaded
microcode updates. As result, IBRS MSRs are no longer available,
until the microcode is reloaded.
I have to forcibly clear cpu_stdext_feature3, which assumes that CPUID
leaf 7 reg %ebx does not report anything except Meltdown/Spectre bugs
bits. If future CPUs add new bits there, hw_ibrs_recalculate() and
identify_cpu1()/identify_cpu2() need to be adjusted for that.
Submitted and tested by: Michael Danilov <mike.d.ft402@gmail.com>
PR: 227866
Sponsored by: The FreeBSD Foundation
MFC after: 1 week
Differential revision: https://reviews.freebsd.org/D15236
This sysctl allows a deeper dive into the sleep abyss comparing to
debug.acpi.suspend_bounce. When the new sysctl is set the system will
execute the suspend sequence up to the call to AcpiEnterSleepState().
That includes saving processor contexts and parking APs. Then, instead
of actually entering the sleep state, the BSP will call resumectx() to
emulate the wakeup. The APs should get restarted by the sequence of
Init and Startup IPIs that BSP sends to them.
MFC after: 8 days
The change makes the user and kernel address spaces on i386
independent, giving each almost the full 4G of usable virtual addresses
except for one PDE at top used for trampoline and per-CPU trampoline
stacks, and system structures that must be always mapped, namely IDT,
GDT, common TSS and LDT, and process-private TSS and LDT if allocated.
By using 1:1 mapping for the kernel text and data, it appeared
possible to eliminate assembler part of the locore.S which bootstraps
initial page table and KPTmap. The code is rewritten in C and moved
into the pmap_cold(). The comment in vmparam.h explains the KVA
layout.
There is no PCID mechanism available in protected mode, so each
kernel/user switch forth and back completely flushes the TLB, except
for the trampoline PTD region. The TLB invalidations for userspace
becomes trivial, because IPI handlers switch page tables. On the other
hand, context switches no longer need to reload %cr3.
copyout(9) was rewritten to use vm_fault_quick_hold(). An issue for
new copyout(9) is compatibility with wiring user buffers around sysctl
handlers. This explains two kind of locks for copyout ptes and
accounting of the vslock() calls. The vm_fault_quick_hold() AKA slow
path, is only tried after the 'fast path' failed, which temporary
changes mapping to the userspace and copies the data to/from small
per-cpu buffer in the trampoline. If a page fault occurs during the
copy, it is short-circuit by exception.s to not even reach C code.
The change was motivated by the need to implement the Meltdown
mitigation, but instead of KPTI the full split is done. The i386
architecture already shows the sizing problems, in particular, it is
impossible to link clang and lld with debugging. I expect that the
issues due to the virtual address space limits would only exaggerate
and the split gives more liveness to the platform.
Tested by: pho
Discussed with: bde
Sponsored by: The FreeBSD Foundation
MFC after: 1 month
Differential revision: https://reviews.freebsd.org/D14633
userspace to control NUMA policy administratively and programmatically.
Implement domainset based iterators in the page layer.
Remove the now legacy numa_* syscalls.
Cleanup some header polution created by having seq.h in proc.h.
Reviewed by: markj, kib
Discussed with: alc
Tested by: pho
Sponsored by: Netflix, Dell/EMC Isilon
Differential Revision: https://reviews.freebsd.org/D13403
restart_cpus() worked well enough by accident. Before this set of fixes,
resume_cpus() used the same cpuset (started_cpus, meaning CPUs directed to
restart) as restart_cpus(). resume_cpus() waited for the wrong cpuset
(stopped_cpus) to become empty, but since mixtures of stopped and suspended
CPUs are not close to working, stopped_cpus must be empty when resuming so
the wait is null -- restart_cpus just allows the other CPUs to restart and
returns without waiting.
Fix resume_cpus() to wait on a non-wrong cpuset for the ACPI case, and
add further kludges to try to keep it working for the XEN case. It
was only used for XEN. It waited on suspended_cpus. This works for
XEN. However, for ACPI, resuming is a 2-step process. ACPI has already
woken up the other CPUs and removed them from suspended_cpus. This
fix records the move by putting them in a new cpuset resuming_cpus.
Waiting on suspended_cpus would give the same null wait as waiting on
stopped_cpus. Wait on resuming_cpus instead.
Add a cpuset toresume_cpus to map the CPUs being told to resume to keep
this separate from the cpuset started_cpus for mapping the CPUs being told
to restart. Mixtures of stopped and suspended/resuming CPUs are still far
from working. Describe new and some old cpusets in comments.
Add further kludges to cpususpend_handler() to try to avoid breaking it
for XEN. XEN doesn't use resumectx(), so it doesn't use the second
return path for savectx(), and it goes from the suspended state directly
to the restarted state, while ACPI resume goes through the resuming state.
Enter the resuming state early for all cases so that resume_cpus can test
for being in this state and not have to worry about the intermediate
!suspended state for ACPI only.
Reviewed by: kib
it by a transient double mapping for the one instruction in ACPI wakeup
where it is needed (and for many surrounding instructions in ACPI resume).
Invalidate the TLB as soon as convenient after undoing the transient
mapping. ACPI resume already has the strict ordering needed for this.
This fixes the non-trapping of null pointers and other garbage pointers
below NBPDR (except transiently). NBPDR is quite large (4MB, or 2MB for
PAE).
This fixes spurious traps at the first instruction in VM86 bioscalls.
The traps are for transiently missing read permission in the first
VM86 page (physical page 0) which was just written to at KERNBASE in
the kernel. The mechanism is unknown (it is not simply PG_G).
locore uses a similar but larger transient double mapping and needs
it for 2 instructions instead of 1. Unmap the first PDE in it after
the 2 instructions to detect most garbage pointers while bootstrapping.
pmap_bootstrap() finishes the unmapping.
Remove the avoidance of the double mapping for a recently fixed special
case. ACPI resume could use this avoidance (made non-special) to avoid
any problems with the transient double mapping, but no such problems
are known.
Update comments in locore. Many were for old versions of FreeBSD which
tried to map low memory r/o except for special cases, or might have
allowed access to low memory via physical offsets. Now all kernel
maps are r/w, and removal of of the double map disallows use of physical
offsets again.
Mainly focus on files that use BSD 2-Clause license, however the tool I
was using misidentified many licenses so this was mostly a manual - error
prone - task.
The Software Package Data Exchange (SPDX) group provides a specification
to make it easier for automated tools to detect and summarize well known
opensource licenses. We are gradually adopting the specification, noting
that the tags are considered only advisory and do not, in any way,
superceed or replace the license texts.
Fix from fallout introduced in r322348 that moved the cpus array to a
dynamic allocation without zeroing the area.
Reported by: mjg
MFC with: r322348
Reviewed by: mjg
Differential revision: https://reviews.freebsd.org/D12220
Introduce a new define to take int account the xAPIC ID limit, for
systems where x2APIC is not available/reliable.
Also change some of the usages of the APIC ID to use an unsigned int
(which is the correct storage type to deal with x2APIC IDs as found in
x2APIC MADT entries).
This allows booting FreeBSD on a box with 256 CPUs and APIC IDs up to
295:
FreeBSD/SMP: Multiprocessor System Detected: 256 CPUs
FreeBSD/SMP: 1 package(s) x 64 core(s) x 4 hardware threads
Package HW ID = 0
Core HW ID = 0
CPU0 (BSP): APIC ID: 0
CPU1 (AP/HT): APIC ID: 1
CPU2 (AP/HT): APIC ID: 2
CPU3 (AP/HT): APIC ID: 3
[...]
Core HW ID = 73
CPU252 (AP): APIC ID: 292
CPU253 (AP/HT): APIC ID: 293
CPU254 (AP/HT): APIC ID: 294
CPU255 (AP/HT): APIC ID: 295
Submitted by: kib (previous version)
Relnotes: yes
MFC after: 1 month
Reviewed by: kib
Differential revision: https://reviews.freebsd.org/D11913
So that MAX_APIC_ID can be bumped without wasting memory.
Note that the usage of MAX_APIC_ID in the SRAT parsing forces the
parser to allocate memory directly from the phys_avail physical memory
array, which is not the best approach probably, but I haven't found
any other way to allocate memory so early in boot. This memory is not
returned to the system afterwards, but at least it's sized according
to the maximum APIC ID found in the MADT table.
Sponsored by: Citrix Systems R&D
MFC after: 1 month
Reviewed by: kib
Differential revision: https://reviews.freebsd.org/D11912
Populate the lapics arrays and call cpu_add/lapic_create in the setup
phase instead. Also store the max APIC ID found in the newly
introduced max_apic_id global variable.
This is a requirement in order to make the static arrays currently
using MAX_LAPIC_ID dynamic.
Sponsored by: Citrix Systems R&D
MFC after: 1 month
Reviewed by: kib
Differential revision: https://reviews.freebsd.org/D11911
Do not try to set LMA bit while CPU is still in legacy mode.
Apparently Intel CPUs ignore non-id writes to LMA, while AMD's
(over-)react with #GP.
Reported and tested by: danfe
Sponsored by: The FreeBSD Foundation
MFC after: 3 days
operation after processor is configured to allow all required
features.
In particular, NX must be enabled in EFER, otherwise load of page
table element with nx bit set causes reserved bit page fault. Since
malloc uses direct mapping for small allocations, in particular for
the suspension pcbs, and DMAP is nx after r316767, this commit tripped
fault on resume path.
Restore complete state of EFER while wakeup code is still executing
with custom page table, before calling resumectx, instead of trying to
guess which features might be needed before resumectx restored EFER on
its own.
Bisected and tested by: trasz
Sponsored by: The FreeBSD Foundation
MFC after: 2 weeks
in place. To do per-cpu stats, convert all fields that previously were
maintained in the vmmeters that sit in pcpus to counter(9).
- Since some vmmeter stats may be touched at very early stages of boot,
before we have set up UMA and we can do counter_u64_alloc(), provide an
early counter mechanism:
o Leave one spare uint64_t in struct pcpu, named pc_early_dummy_counter.
o Point counter(9) fields of vmmeter to pcpu[0].pc_early_dummy_counter,
so that at early stages of boot, before counters are allocated we already
point to a counter that can be safely written to.
o For sparc64 that required a whole dummy pcpu[MAXCPU] array.
Further related changes:
- Don't include vmmeter.h into pcpu.h.
- vm.stats.vm.v_swappgsout and vm.stats.vm.v_swappgsin changed to 64-bit,
to match kernel representation.
- struct vmmeter hidden under _KERNEL, and only vmstat(1) is an exclusion.
This is based on benno@'s 4-year old patch:
https://lists.freebsd.org/pipermail/freebsd-arch/2013-July/014471.html
Reviewed by: kib, gallatin, marius, lidl
Differential Revision: https://reviews.freebsd.org/D10156
Ignore them like it's done in the MADT parser. This allows booting on a box
with SRAT and APIC IDs > 255.
Reported by: Wei Liu <wei.liu2@citrix.com>
Tested by: Wei Liu <wei.liu2@citrix.com>
Reviewed by: kib
MFC after: 2 weeks
Sponsored by: Citrix Systems R&D
and device npx.
This means that FPU is always initialized and handled when available,
and SSE+ register file and exception are handled when available. This
makes the kernel FPU code much easier to maintain by the cost of
slight bloat for CPUs older than 25 years.
CPU_DISABLE_CMPXCHG outlived its usefulness, see the removed comment
explaining the original purpose.
Suggested by and discussed with: bde
Tested by: pho
Sponsored by: The FreeBSD Foundation
MFC after: 3 weeks
If BIOS performed hand-off to OS with BSP LAPIC in the x2APIC mode,
system usually consumes such configuration without a notice, since
x2APIC is turned on by OS if possible (nop). But if BIOS
simultaneously requested OS to not use x2APIC, code assumption that
that xAPIC is active breaks.
In my opinion, we cannot safely turn off x2APIC if control is passed
in this mode. Make madt.c ignore user or BIOS requests to turn x2APIC
off, and do not check the x2APIC black list. Just trust the config
and try to continue, giving a warning in dmesg.
Reported and tested by: Slawa Olhovchenkov <slw@zxy.spb.ru> (previous version)
Diagnosed by and discussed with: avg
Sponsored by: The FreeBSD Foundation
MFC after: 2 weeks
Since these pages are allocated from a narrow range of memory, this makes
the allocation more likely to succeed.
Suggested by: kib
Reviewed by: jkim, kib
MFC after: 2 months
Differential Revision: https://reviews.freebsd.org/D7154
If the allocation attempt fails, we may otherwise VM_WAIT after a failed
attempt to reclaim contiguous memory in the requested range. After r297466,
this results in the thread going to sleep, causing a hang during boot.
Reviewed by: jkim, kib
Approved by: re (gjb)
Sponsored by: EMC / Isilon Storage Division
Differential Revision: https://reviews.freebsd.org/D6945
Instead of panicking when parsing an invalid ACPI SRAT table,
just ignore it, effectively disabling NUMA.
https://lists.freebsd.org/pipermail/freebsd-current/2016-May/060984.html
Reported and tested by: Bill O'Hanlon (bill.ohanlon at gmail.com)
Reviewed by: jhb
MFC after: 1 week
Relnotes: If dmesg shows "SRAT: Duplicate local APIC ID",
try updating your BIOS to fix NUMA support.
Sponsored by: Dell Inc.
bus_get_cpus() returns a specified set of CPUs for a device. It accepts
an enum for the second parameter that indicates the type of cpuset to
request. Currently two valus are supported:
- LOCAL_CPUS (on x86 this returns all the CPUs in the package closest to
the device when DEVICE_NUMA is enabled)
- INTR_CPUS (like LOCAL_CPUS but only returns 1 SMT thread for each core)
For systems that do not support NUMA (or if it is not enabled in the kernel
config), LOCAL_CPUS fails with EINVAL. INTR_CPUS is mapped to 'all_cpus'
by default. The idea is that INTR_CPUS should always return a valid set.
Device drivers which want to use per-CPU interrupts should start using
INTR_CPUS instead of simply assigning interrupts to all available CPUs.
In the future we may wish to add tunables to control the policy of
INTR_CPUS (e.g. should it be local-only or global, should it ignore
SMT threads or not).
The x86 nexus driver exposes the internal set of interrupt CPUs from the
the x86 interrupt code via INTR_CPUS.
The ACPI bus driver and PCI bridge drivers use _PXM to return a suitable
LOCAL_CPUS set when _PXM exists and DEVICE_NUMA is enabled. They also and
the global INTR_CPUS set from the nexus driver with the per-domain set from
_PXM to generate a local INTR_CPUS set for child devices.
Reviewed by: wblock (manpage)
Differential Revision: https://reviews.freebsd.org/D5519
If we reached MAXMEMDOM, we would previously try to insert an additional
element and only detect overflow after causing (probably trivial) memory
overflow. Instead, detect the ndomain > MAXMEMDOM case before we write past
the end.
Reported by: Coverity
CID: 1354783
Sponsored by: EMC / Isilon Storage Division
system. This uses the hints mechnanism. This mostly works today
because when there's no static hints (the default), this value can be
fetched from the hint. When there is a static hints file, the hint
passed from the boot loader to the kernel is ignored, but for the BIOS
case we're able to find it anyway. However, with UEFI, the fallback
doesn't work, so we get a panic instead.
Switch to acpi.rsdp and use TUNABLE_ULONG_FETCH instead. Continue to
generate the old values to allow for transitions. In addition, fall
back to the old method if the new method isn't present.
Add comments about all this.
Differential Revision: https://reviews.freebsd.org/D5866
VM_NUMA_ALLOC is used to enable use of domain-aware memory allocation in
the virtual memory system. DEVICE_NUMA is used to enable affinity
reporting for devices such as bus_get_domain().
MAXMEMDOM must still be set to a value greater than for any NUMA support
to be effective. Note that 'cpuset -gd' always works if MAXMEMDOM is
enabled and the system supports NUMA.
Reviewed by: kib
Differential Revision: https://reviews.freebsd.org/D5782
that was recently added for Lenovo laptops.
This is a prime candidate for conversion into a table and also
checking other fields like "product".
Tested:
* ASUS UX31E
believe that the bug only affects mobile CPUs, at least I did not see
other reports, but it is impossible to detect it in madt_setup_local().
While there, reduce duplication in the information strings printed
when x2APIC is auto-disabled, and do not print the line when user
manually override the setting.
Tested and reviewed by: royger (previous version)
Sponsored by: The FreeBSD Foundation
BIOS has been seen to include such entries even though the relevant specs
require that X2APIC entries only be used for CPUs with an APIC ID >= 255.
This was tested on a system with "plain" local APIC entries in the MADT
to ensure no regressions, but it has not yet been tested on a system with
X2APIC entries in the MADT. Currently such systems do not boot at all,
and with this change they might now boot correctly.
Differential Revision: https://reviews.freebsd.org/D2521
Reviewed by: kib
MFC after: 2 weeks
a basic ACPI SLIT table parser.
For now this just exports the map via sysctl; it'll eventually be useful
to userland when there's more useful NUMA support in -HEAD.
* Add an optional mem_locality map;
* add a mapping function taking from/to domain and returning the
relative cost, or -1 if it's not available;
* Add a very basic SLIT parser to x86 ACPI.
Differential Revision: https://reviews.freebsd.org/D2460
Reviewed by: rpaulo, stas, jhb
Sponsored by: Norse Corp, Inc (hardware, coding); Dell (hardware)
use PAE format for the page tables, but does not incur other
consequences of the full PAE config. In particular, vm_paddr_t and
bus_addr_t are left 32bit, and max supported memory is still limited
by 4GB.
The option allows to have nx permissions for memory mappings on i386
kernel, while keeping the usual i386 KBI and avoiding the kernel data
sizing problems typical for the PAE config.
Intel documented that the PAE format for page tables is available
starting with the Pentium Pro, but it is possible that the plain
Pentium CPUs have the required support (Appendix H). The goal is to
enable the option and non-exec mappings on i386 for the GENERIC
kernel. Anybody wanting a useful system on 486, have to reconfigure
the modern i386 kernel anyway.
Discussed with: alc, jhb
Sponsored by: The FreeBSD Foundation
MFC after: 2 weeks