The ifdefs were '#if !defined(__i386__) || !defined(PC98)' previously,
so cpu_idle_acpi was enabled both i386 and amd64 except PC98.
I was obfuscated by '#if !defined(__i386__)' condition.
Submitted by: bde
Reported by: bde
Convert PCIe hot plug support over to asking the firmware, if any, for
permission to use the HotPlug hardware. Implement pci_request_feature
for ACPI. All other host pci connections to allowing all valid feature
requests.
Sponsored by: Netflix
We have an original panic. Then, instead of writing the core to the dump
device, the kernel has a second panic: "smp_targeted_tlb_shootdown:
interrupts disabled". This change is an attempt to fix that second panic.
When the other CPUs are stopped, we can't notify them of the TLB shootdown,
so we skip that operation. However, when the CPUs come back up, we
invalidate the TLB to ensure they correctly observe any changes to the
page mappings.
Reviewed by: kib
Sponsored by: Netflix
Differential Revision: https://reviews.freebsd.org/D9786
On Core2 and older Intel CPUs, where TSC stops in C2, system does not
allow C2 entrance if timecounter hardware is TSC. This is done by
tc_windup() which tests for TC_FLAGS_C2STOP flag of the new
timecounter and increases cpu_disable_c2_sleep if flag is set. Right
now init_TSC_tc() only sets the flag if cpu_deepest_sleep >= 2, but
TSC is initialized too early for this variable to be set by
acpi_cpu.c.
There is no reason to require that ACPI reported C2 and deeper states
to set TC_FLAGS_C2STOP, so remove cpu_deepest_sleep test from
init_TSC_tc() condition. And since this is the only use of the
variable, remove it at all.
Reported and submitted by: Jia-Shiun Li <jiashiun@gmail.com>
Suggested by: jhb
MFC after: 2 weeks
Use large enough type for calculation of mtrr physmask. Typical
cpu_maxphyaddr is 36 or larger.
Reported and tested by: sbruno
Sponsored by: The FreeBSD Foundation
MFC after: 13 days
and there is no reason to check cpu family or vendor.
Noted by: royger
Reviewed by: jhb
Sponsored by: The FreeBSD Foundation
MFC after: 2 weeks
Differential revision: https://reviews.freebsd.org/D9657
code. Also fix cast and remove unneeded XXX in comment.
Noted and reviewed by: jhb
Sponsored by: The FreeBSD Foundation
MFC after: 2 weeks
Differential revision: https://reviews.freebsd.org/D9657
compile options. Remove doxygen pointers to now deleted files. Remove
EISA and VME as examples in bus_space.9.
Retained EISA mode code for IO PIC and MPTABLES because that's not
EISA bus, per se, and some people have abused EISA to mean "EISA-like
behavior as opposed to ISA" rather than using it for EISA add-in
cards.
Relnotes: yes
machines, only a few 486 machines that used it, and those haven't had
enough memory to run FreeBSD for quite some time (often limited to
16MB).
Not to be confused with the Machine Check Architecture, which is still
very much alive and used (and untouched by this commit).
No Objection From: arch@
This solves several problems.
First of all, cmc_throttle is specified in seconds and there was no
conversion between ticks and seconds when they were mixed together.
Second, we avoid potential problems with ticks wrapping around.
Resolution of time_uptime should be sufficient for the throttling
purposes.
Discussed with: jhb
MFC after: 12 days
Previously, if the threshold was changed, then MC_CTL2_CMCI_EN would get
cleared and the logic would switch to the polling only mode.
Discussed with: jhb
MFC after: 2 weeks
using the ACPI C1/mwait sleep method.
Previously, the mwait instruction would return when an interrupt was
pending; however, the idle loop did not actually enable interrupts when
this occurred. This led to a situation where the idle loop could quickly
spin through the C1/mwait sleep method a number of times when an interrupt
was pending. (Eventually, the situation corrected itself when something
other than an interrupt triggered the idle loop to either enable interrupts
or schedule another thread.)
Reviewed by: kib, imp (earlier version)
Input from: jhb
MFC after: 1 week
Sponsored by: Netflix
The types are for the byte offset and page index in vm object. They
are similar to off_t, which is defined as 64bit MI integer. Using MI
definitions will allow to provide consistent MD values of vm
object-related maximum sizes.
Reviewed by: alc
Sponsored by: The FreeBSD Foundation
MFC after: 1 week
and device npx.
This means that FPU is always initialized and handled when available,
and SSE+ register file and exception are handled when available. This
makes the kernel FPU code much easier to maintain by the cost of
slight bloat for CPUs older than 25 years.
CPU_DISABLE_CMPXCHG outlived its usefulness, see the removed comment
explaining the original purpose.
Suggested by and discussed with: bde
Tested by: pho
Sponsored by: The FreeBSD Foundation
MFC after: 3 weeks
Replace archaic "busses" with modern form "buses."
Intentionally excluded:
* Old/random drivers I didn't recognize
* Old hardware in general
* Use of "busses" in code as identifiers
No functional change.
http://grammarist.com/spelling/buses-busses/
PR: 216099
Reported by: bltsrc at mail.ru
Sponsored by: Dell EMC Isilon
These are of the few cases where we use the GCC non-null attributes in
non-header code. As part of a review [1] of our use of such attributes we
are replacing such uses of the overly aggressive GCC attribute with clang's
_Nonnull attribute.
In this case the attributes serve little purpose as they just don't
enforce run time checks, If anything the attributes would cause NULL pointer
checks to be ignored but there are no such checks so only effect is
cosmetic.
The references appear to be left over from code development and likely
already fulfilled their purpose.
Reference [1]:
https://reviews.freebsd.org/D9004
Reviewed by: jhb
MFC after: 3 weeks
Current Xen IPI setup functions require that the caller provide a device in
order to obtain the name of the interrupt from it. With early AP startup this
device is no longer available at the point where IPIs are bound, and a KASSERT
would trigger:
panic: NULL pcpu device_t
cpuid = 0
KDB: stack backtrace:
db_trace_self_wrapper() at db_trace_self_wrapper+0x2b/frame 0xffffffff82233a20
vpanic() at vpanic+0x186/frame 0xffffffff82233aa0
kassert_panic() at kassert_panic+0x126/frame 0xffffffff82233b10
xen_setup_cpus() at xen_setup_cpus+0x5b/frame 0xffffffff82233b50
mi_startup() at mi_startup+0x118/frame 0xffffffff82233b70
btext() at btext+0x2c
Fix this by no longer requiring the presence of a device in order to bind IPIs,
and simply use the "cpuX" format where X is the CPU identifier in order to
describe the interrupt.
Reported by: sbruno, cperciva
Tested by: sbruno
X-MFC-With: r310177
Sponsored by: Citrix Systems R&D
This 6 times gettimeofday performance, as measured by
tools/tools/syscall_timing
Reviewed by: kib
MFC after: 1 week
Sponsored by: Microsoft
Differential Revision: https://reviews.freebsd.org/D8789
The MCA taskqueue is not initialized until some time after CMCIs are
enabled on the BSP.
Reviewed by: cem, jhb
MFC after: 1 week
Sponsored by: Dell EMC Isilon
Differential Revision: https://reviews.freebsd.org/D8783
stack_machdep.c is compiled if either of the DDB or STACK options is
specified, but stack_save_td_running() isn't useable from DDB. Moreover,
stack_save_td_running() works by raising an NMI on the CPU running the
target thread, and the corresponding handler is compiled only if STACK is
configured.
Reported by: kib
MFC after: 1 week
actual numbers would help debugging (also, `MSR' and `ACPI' are standard
abbreviations and thus should be properly capitalized)
- Rephrase unsupported AMD CPUs message and wrap as an overly long line:
`sorry' 1) is wrongly spelled after period (starts with a small letter)
and 2) carries emotional "tinge" that is unnecessary and even bogus in
debug message; `implemented' is not the best word as `supported' suits
better in this context
- Improve readability when reporting resulted P-state transition (debug)
Approved by: jhb
(APIC-Timer-always-running) is not implemented.
If machine has ncpus >= 8 and non-FSB interrupt routing from HPET,
default HPET eventtimer quality 450 is reduced by 100, i.e. it is
350. On the other hand, LAPIC default quality is 600 and it is reduced
by 200 if ARAT is not reported. We end up with HPET quality 350 <
LAPIC quality 400, despite ARAT is not set. Then, since deep Cx
states are active by default, eventtimer fail.
E.g., on Nehalem Core i7 CPU and X58 chipset, LAPIC only works in
C0/C1/C1E and HPET does not implement FSB mode, which otherwise
requires manual switch to HPET to get working system.
Set LAPIC eventtimer quality to 100 if no ARAT.
While there, do not ignore deadlint TSC mode for LAPIC timer if ARAT
is not implemented. If user manually selected LAPIC eventtimer on
such CPU, there is no reason to not use deadline if available and not
disabled administratively.
Sponsored by: The FreeBSD Foundation
MFC after: 1 week
they shouldn't be.
I used this during driver bring-up to find that the Linux driver holds a
whole lot of locks whilst doing their equivalent of busdma operations.
If this works out well, it should be added to the other architecture busdma
implementations to aid in similar debugging.
Tested:
* bounce buffer and dmar busdma, Lenovo X230 laptop, all the internal
hardware
* ath(4) too
Discussed with: jhb
Add a reference count to xenisrc. This is required for implementation of
unmap-notifications in the grant table userspace device (gntdev). We need to
hold a reference to the event channel port, in case the user deallocates the
port before we send the notification.
Submitted by: jaggi
Reviewed by: royger
Differential review: https://reviews.freebsd.org/D7429
Use the same logic to calculate the nominal CPU frequency from the P-state
MSRs on family 0x12, 0x15, and 0x16 CPUs as is used for family 0x10.
Family 0x14 was included in the original patch in the PR but I left that
out as the BIOS writer's guide for family 0x14 CPUs show a different layout
for the relevant MSR and include a different formulate for calculating the
frequency.
While here, simplify a few expressions and print out the family of
unsupported CPUs in hex rather than decimal.
PR: 212020
Submitted by: Anthony Jenkins <Scoobi_doo@yahoo.com>
MFC after: 1 month
Differential Revision: https://reviews.freebsd.org/D7587
Reject attempts to read from or memory map offsets in /dev/mem that are
beyond the maximum-supported physical address of the current CPU.
Reviewed by: kib
MFC after: 1 month
Differential Revision: https://reviews.freebsd.org/D7408
- Make !KDB config buildable.
- Simplify interface to nmi_handle_intr() by evaluating panic_on_nmi
in one place, namely nmi_call_kdb(). This allows to remove do_panic
argument from the functions, and to remove i386/amd64 duplication of
the variable and sysctl definitions. Note that now NMI causes
panic(9) instead of trap_fatal() reporting and then panic(9),
consistently for NMIs delivered while CPU operated in ring 0 and 3.
Sponsored by: The FreeBSD Foundation
MFC after: 2 weeks
On several Intel chipsets, diagnostic NMIs sent from BMC or NMIs
reporting hardware errors are broadcasted to all CPUs.
When kernel is configured to enter kdb on NMI, the outcome is
problematic, because each CPU tries to enter kdb. All CPUs are
executing NMI handlers, which set the latches disabling the nested NMI
delivery; this means that stop_cpus_hard(), used by kdb_enter() to
stop other cpus by broadcasting IPI_STOP_HARD NMI, cannot work. One
indication of this is the harmless but annoying diagnostic "timeout
stopping cpus".
Much more harming behaviour is that because all CPUs try to enter kdb,
and if ddb is used as debugger, all CPUs issue prompt on console and
race for the input, not to mention the simultaneous use of the ddb
shared state.
Try to fix this by introducing a pseudo-lock for simultaneous attempts
to handle NMIs. If one core happens to enter NMI trap handler, other
cores see it and simulate reception of the IPI_STOP_HARD. More,
generic_stop_cpus() avoids sending IPI_STOP_HARD and avoids waiting
for the acknowledgement, relying on the nmi handler on other cores
suspending and then restarting the CPU.
Since it is impossible to detect at runtime whether some stray NMI is
broadcast or unicast, add a knob for administrator (really developer)
to configure debugging NMI handling mode.
The updated patch was debugged with the help from Andrey Gapon (avg)
and discussed with him.
Sponsored by: The FreeBSD Foundation
MFC after: 2 weeks
Differential revision: https://reviews.freebsd.org/D8249
All I/O APIC pins are masked when an I/O APIC is first probed. The
APIC enumerator (MP Table or MADT) then parses its associated tables to
configure individual pins to set custom delivery modes or alternate
routing (e.g. routing IRQ 0 to intpin 2). Pins for regular interrupt
pins are left masked until the first interrupt is assigned. However,
pins with unusual settings (e.g. NMI or SMI) are never assigned an
interrupt and thus never re-programmed. The I/O APIC code used to
reprogram all interrupt pins during registration but this was lost in
r151979.
In theory, this is mostly a no-op as the ACPI APIC table does not
include a way to enumerate NMI or SMI pins for the I/O APIC, so only
systems using an MP Table would be affected.
Reported by: avg
MFC after: 1 month
Reduce the cost of TLB invalidation on x86 by using per-CPU completion flags
Reduce contention during TLB invalidation operations by using a per-CPU
completion flag, rather than a single atomically-updated variable.
On a Westmere system (2 sockets x 4 cores x 1 threads), dtrace measurements
show that smp_tlb_shootdown is about 50% faster with this patch; observations
with VTune show that the percentage of time spent in invlrng_single_page on an
interrupt (actually doing invalidation, rather than synchronization) increases
from 31% with the old mechanism to 71% with the new one. (Running a basic file
server workload.)
Submitted by: Anton Rang <rang at acm.org>
Reviewed by: cem (earlier version)
Sponsored by: Dell EMC Isilon
Differential Revision: https://reviews.freebsd.org/D8041
Reduce contention during TLB invalidation operations by using a per-CPU
completion flag, rather than a single atomically-updated variable.
On a Westmere system (2 sockets x 4 cores x 1 threads), dtrace measurements
show that smp_tlb_shootdown is about 50% faster with this patch; observations
with VTune show that the percentage of time spent in invlrng_single_page on an
interrupt (actually doing invalidation, rather than synchronization) increases
from 31% with the old mechanism to 71% with the new one. (Running a basic file
server workload.)
Submitted by: Anton Rang <rang at acm.org>
Reviewed by: cem (earlier version), kib
Sponsored by: Dell EMC Isilon
Differential Revision: https://reviews.freebsd.org/D8041
On Hyper-V:
- Stick to the first cpu for all I/O APIC pins.
- And don't allow destination cpu changes.
Reviewed by: jhb
MFC after: 1 week
Sponsored by: Microsoft
Differential Revision: https://reviews.freebsd.org/D7949
If BIOS performed hand-off to OS with BSP LAPIC in the x2APIC mode,
system usually consumes such configuration without a notice, since
x2APIC is turned on by OS if possible (nop). But if BIOS
simultaneously requested OS to not use x2APIC, code assumption that
that xAPIC is active breaks.
In my opinion, we cannot safely turn off x2APIC if control is passed
in this mode. Make madt.c ignore user or BIOS requests to turn x2APIC
off, and do not check the x2APIC black list. Just trust the config
and try to continue, giving a warning in dmesg.
Reported and tested by: Slawa Olhovchenkov <slw@zxy.spb.ru> (previous version)
Diagnosed by and discussed with: avg
Sponsored by: The FreeBSD Foundation
MFC after: 2 weeks
i386-only section, and fix a comment about the amd64 kernel trapframe
not having stackregs.
tf_rsp doesn't need decoding on amd64, but had an old clone of i386
code to do this in 1 place, and since the amd64 kernel trapframe does
have stackregs, the result was an off-by-16 error for %rsp in an error
message.
The 'cpu' and 'cpu_class' variables were always set to the same value
on amd64 and are legacy holdovers from i386. Remove them entirely on
amd64.
Reviewed by: imp, kib (older version)
Differential Revision: https://reviews.freebsd.org/D7888
SEL_UPL and sometimes PSL_VM. This is just a style change on amd64,
but on i386 it fixes 1 unimportant place where the PSL_VM check was
missing and starts fixing 1 important place where the PSL_VM check
had a logic error.
Fix logic errors in treating vm86 bioscall mode as kernel mode. The
main place checked all the necessary flags, but put the necessary
parentheses for the PSL_VM and PCB_VM86CALL checks in the wrong
place. The broken case is only reached if a vm86 bioscall uses a
%cs which is nonzero mod 4, but that is unusual -- most bios calls
start with %cs = 0xc000 or 0xf000 and rarely change it. Another
place was missing the check for PCB_VM86CALL, but was only reachable
if there are bugs virtualizing PSL_I.
Add a macro TF_HAS_STACKREGS() and use this instead of converting
open-coded checks of SEL_UPL, etc. to TRAPF_USERMODE() when we only
care about whether the frame has stack registers. This fixes 3
places in my recent fix for register variables in vm86 mode where I
messed up the PSL_VM check and cleans up other places.
- Certain pic_assign_cpu, e.g. msi_assign_cpu can have quite a long
call chain. For msi_assign_cpu, mutex makes complex PCI bridge
drivers more tricky, e.g. sleep can note be called, etc, it will
be pretty tricky for upcoming Hyper-V PCI bridge driver for PCI
pass-through.
- It is not used on any hot code path nor non-sleepable context, so
sx should have the same effect as mutex.
PIC list is still protected by mutex to keep suspend/resume work.
Discussed with: jhb
Reviewed by: jhb
MFC after: 3 weeks
Sponsored by: Microsoft
Differential Revision: https://reviews.freebsd.org/D7784
Move msix_disable_migration under #ifdef SMP since it doesn't make sense
for !SMP kernels.
PR: 212014
Reported by: Glyn Grinstead <glyn@grinstead.org>
MFC after: 3 days
Right now, userspace (fast) gettimeofday(2) on x86 only works for
RDTSC. For older machines, like Core2, where RDTSC is not C2/C3
invariant, and which fall to HPET hardware, this means that the call
has both the penalty of the syscall and of the uncached hw behind the
QPI or PCIe connection to the sought bridge. Nothing can me done
against the access latency, but the syscall overhead can be removed.
System already provides mappable /dev/hpetX devices, which gives
straight access to the HPET registers page.
Add yet another algorithm to the x86 'vdso' timehands. Libc is updated
to handle both RDTSC and HPET. For HPET, the index of the hpet device
to mmap is passed from kernel to userspace, index might be changed and
libc invalidates its mapping as needed.
Remove cpu_fill_vdso_timehands() KPI, instead require that
timecounters which can be used from userspace, to provide
tc_fill_vdso_timehands{,32}() methods. Merge i386 and amd64
libc/<arch>/sys/__vdso_gettc.c into one source file in the new
libc/x86/sys location. __vdso_gettc() internal interface is changed
to move timecounter algorithm detection into the MD code.
Measurements show that RDTSC even with the syscall overhead is faster
than userspace HPET access. But still, userspace HPET is three-four
times faster than syscall HPET on several Core2 and SandyBridge
machines.
Tested by: Howard Su <howard0su@gmail.com>
Sponsored by: The FreeBSD Foundation
MFC after: 1 month
Differential revision: https://reviews.freebsd.org/D7473
Uses of commas instead of a semicolons can easily go undetected. The comma
can serve as a statement separator but this shouldn't be abused when
statements are meant to be standalone.
Detected with devel/coccinelle following a hint from DragonFlyBSD.
MFC after: 1 month
- Add constants for the fields in the root-entry table address register,
namely the root type type (RTT) and root table address (RTA) mask.
- Add macros for the bitmask of the domain ID field in the second word
of context table entries as well as a helper macro (DMAR_CTX2_GET_DID)
to extract the domain ID from a context table entry.
Reviewed by: kib
MFC after: 1 month
Sponsored by: Chelsio Communications
difference between files.
For pc98, put x86/mp_x86.c into the same place as used by i386 file list.
Fix typo in comment.
Sponsored by: The FreeBSD Foundation
MFC after: 1 week
This was only needed for Xen, and a better way to deal with this issue has
been found, so this commit can be reverted.
Sponsored by: Citrix Systems R&D
MFC after: 5 days
Reviewed by: kib
Differential revision: https://reviews.freebsd.org/D7363
Event channel handlers cannot be removed during resume because there might
be an interrupt thread running on a CPU currently blocked in the
cpususpend_handler, which prevents the call to intr_remove_handler from
finishing and completely freezes the system during resume. r291022 tried to
fix this by allowing recursion in intr_remove_handler, but that's clearly
not enough.
Instead don't remove the handlers at the interrupt resume phase, and let
each driver remove the handler by itself during resume. In order to do this,
change the opaque event channel handler cookie to use the global interrupt
vector instead of the event channel port. The event channel port cannot be
used because after resume all event channels are reset, and the port numbers
can change.
Sponsored by: Citrix Systems R&D
MFC after: 5 days
Since these pages are allocated from a narrow range of memory, this makes
the allocation more likely to succeed.
Suggested by: kib
Reviewed by: jkim, kib
MFC after: 2 months
Differential Revision: https://reviews.freebsd.org/D7154
Set vm_guest to a new enum value (VM_GUEST_KVM) when kvm is detected and use
vm_guest in conditionals testing for KVM.
Also, fix a conditional checking if we're running in a VM which caught only
the generic VM case, but not more specific VMs (KVM, VMWare, etc.). (Spotted
by: vangyzen).
Differential revision: https://reviews.freebsd.org/D7172
Sponsored by: Dell Inc.
Approved by: kib (mentor), vangyzen (mentor)
Reviewed by: alc
MFC after: 4 weeks
If the hypervisor version is smaller than 4.6.0. Xen commits 74fd00 and
70a3cb are required on the hypervisor side for this to be fixed, and those
are only included in 4.6.0, so stay on the safe side and disable MSI-X
interrupt migration on anything older than 4.6.0.
It should not cause major performance degradation unless a lot of MSI-X
interrupts are allocated.
Sponsored by: Citrix Systems R&D
MFC after: 3 days
Reviewed by: jhb
Differential revision: https://reviews.freebsd.org/D7148
The new 'machdep.disable_msix_migration' tunable can be set to 1 to
disable migration of MSI-X interrupts.
Xen versions prior to 4.6.0 do not properly handle updates to MSI-X
table entries after the initial write. In particular, the operation
to unmask a table entry after updating it during migration is not
propagated to the "real" table for passthrough devices causing the
interrupt to remain masked. At least some systems in EC2 are
affected by this bug when using SRIOV. The tunable can be set in
loader.conf as a workaround.
Submitted by: Jeremiah Lott <jlott@averesystems.com> (original patch)
Approved by: re (marius)
MFC after: 2 weeks
Differential Revision: https://reviews.freebsd.org/D6947
If the allocation attempt fails, we may otherwise VM_WAIT after a failed
attempt to reclaim contiguous memory in the requested range. After r297466,
this results in the thread going to sleep, causing a hang during boot.
Reviewed by: jkim, kib
Approved by: re (gjb)
Sponsored by: EMC / Isilon Storage Division
Differential Revision: https://reviews.freebsd.org/D6945
Reduce number of iterations used for calibrating ICR read loop. The
new number of iteration still gives the same ICR latency as before,
tested on Intel SandyBridge and Haswell machines, and on AMD. But it
significantly reduces the unneeded pause on boot in some VMs, from ~10
secs to less then 1 sec. It was reported to occur in bhyve on AMD
host.
Reported and tested by: avg
Sponsored by: The FreeBSD Foundation
MFC after: 1 week
The existing version depends on register_t and uintptr_t, which are only
available when including headers such as <sys/types.h>. As this macro is
used by <sys/socket.h>, for example, it should be written in such a way
that it doesn't depend on those types.
In r227474, this header file was changed to define SIG_ATOMIC_{MIN,MAX}
in terms of LONG_{MIN,MAX}. Unlike all of the definitions in this header
file, LONG_{MIN,MAX} is provided by <limits.h>. Remove the dependency on
<limits.h> by using __LONG_{MIN,MAX} instead and including
<machine/_limits.h>.
This change is needed to make SIG_ATOMIC_{MIN,MAX} work without
including any other header files.
This header uses __INT_MIN and __INT_MAX, which is provided by
<machine/_limits.h>. This is needed to make <stdint.h>'s WCHAR_MIN and
WCHAR_MAX work without including other headers as well.
switching between LAPIC modes is not supported, and there is no need
to wait for IPI ack in x2APIC mode. So the calibrated delay is only
needed for !x2APIC.
This saves around a second of boot time on the real hardware for
x2APIC.
Sponsored by: The FreeBSD Foundation
Add implementations of bus_map/unmap_resource to the x86 nexus driver.
Change bus_activate/deactivate_resource to honor RF_UNMAPPED and to
use bus_map/unmap_resource to create/destroy the implicit mapping when
RF_UNMAPPED is not set.
Reviewed by: cem
Differential Revision: https://reviews.freebsd.org/D5237
Currently, Application Processors (non-boot CPUs) are started by
MD code at SI_SUB_CPU, but they are kept waiting in a "pen" until
SI_SUB_SMP at which point they are released to run kernel threads.
SI_SUB_SMP is one of the last SYSINIT levels, so APs don't enter
the scheduler and start running threads until fairly late in the
boot.
This change moves SI_SUB_SMP up to just before software interrupt
threads are created allowing the APs to start executing kernel
threads much sooner (before any devices are probed). This allows
several initialization routines that need to perform initialization
on all CPUs to now perform that initialization in one step rather
than having to defer the AP initialization to a second SYSINIT run
at SI_SUB_SMP. It also permits all CPUs to be available for
handling interrupts before any devices are probed.
This last feature fixes a problem on with interrupt vector exhaustion.
Specifically, in the old model all device interrupts were routed
onto the boot CPU during boot. Later after the APs were released at
SI_SUB_SMP, interrupts were redistributed across all CPUs.
However, several drivers for multiqueue hardware allocate N interrupts
per CPU in the system. In a system with many CPUs, just a few drivers
doing this could exhaust the available pool of interrupt vectors on
the boot CPU as each driver was allocating N * mp_ncpu vectors on the
boot CPU. Now, drivers will allocate interrupts on their desired CPUs
during boot meaning that only N interrupts are allocated from the boot
CPU instead of N * mp_ncpu.
Some other bits of code can also be simplified as smp_started is
now true much earlier and will now always be true for these bits of
code. This removes the need to treat the single-CPU boot environment
as a special case.
As a transition aid, the new behavior is available under a new kernel
option (EARLY_AP_STARTUP). This will allow the option to be turned off
if need be during initial testing. I plan to enable this on x86 by
default in a followup commit in the next few days and to have all
platforms moved over before 11.0. Once the transition is complete,
the option will be removed along with the !EARLY_AP_STARTUP code.
These changes have only been tested on x86. Other platform maintainers
are encouraged to port their architectures over as well. The main
things to check for are any uses of smp_started in MD code that can be
simplified and SI_SUB_SMP SYSINITs in MD code that can be removed in
the EARLY_AP_STARTUP case (e.g. the interrupt shuffling).
PR: kern/199321
Reviewed by: markj, gnn, kib
Sponsored by: Netflix
if specific CPU features are not present.
Some simulation environments, e.g. gem5, have been found to require more
TLB management from the kernel in certain setups. It is currently unclear why.
Turning on the workaround_erratum383 seems to help and make problems (panics)
go away.
Given this is a fairly uncommon environment so far, allowing the workaround
to be manually enabled from loader in order to make debugging and comparing
traces easier, but also to allow gem5 run FreeBSD in X86 timing mode, seems
to be the least intrusive option for now until the issue if fully understood.
Sponsored by: DARPA/AFRL
Reviewed by: kib, alc (earlier)
MFC after: 2 weeks
Differential Revision: https://reviews.freebsd.org/D6206
bus_get_cpus() returns a specified set of CPUs for a device. It accepts
an enum for the second parameter that indicates the type of cpuset to
request. Currently two valus are supported:
- LOCAL_CPUS (on x86 this returns all the CPUs in the package closest to
the device when DEVICE_NUMA is enabled)
- INTR_CPUS (like LOCAL_CPUS but only returns 1 SMT thread for each core)
For systems that do not support NUMA (or if it is not enabled in the kernel
config), LOCAL_CPUS fails with EINVAL. INTR_CPUS is mapped to 'all_cpus'
by default. The idea is that INTR_CPUS should always return a valid set.
Device drivers which want to use per-CPU interrupts should start using
INTR_CPUS instead of simply assigning interrupts to all available CPUs.
In the future we may wish to add tunables to control the policy of
INTR_CPUS (e.g. should it be local-only or global, should it ignore
SMT threads or not).
The x86 nexus driver exposes the internal set of interrupt CPUs from the
the x86 interrupt code via INTR_CPUS.
The ACPI bus driver and PCI bridge drivers use _PXM to return a suitable
LOCAL_CPUS set when _PXM exists and DEVICE_NUMA is enabled. They also and
the global INTR_CPUS set from the nexus driver with the per-domain set from
_PXM to generate a local INTR_CPUS set for child devices.
Compared to the r298933, this version uses 'struct _cpuset' in
<sys/bus.h> instead of 'cpuset_t' to avoid requiring <sys/param.h>
(<sys/_cpuset.h> still requires <sys/param.h> for MAXCPU even though
<sys/_bitset.h> does not after recent changes).
Instead of panicking when parsing an invalid ACPI SRAT table,
just ignore it, effectively disabling NUMA.
https://lists.freebsd.org/pipermail/freebsd-current/2016-May/060984.html
Reported and tested by: Bill O'Hanlon (bill.ohanlon at gmail.com)
Reviewed by: jhb
MFC after: 1 week
Relnotes: If dmesg shows "SRAT: Duplicate local APIC ID",
try updating your BIOS to fix NUMA support.
Sponsored by: Dell Inc.
bus_get_cpus() returns a specified set of CPUs for a device. It accepts
an enum for the second parameter that indicates the type of cpuset to
request. Currently two valus are supported:
- LOCAL_CPUS (on x86 this returns all the CPUs in the package closest to
the device when DEVICE_NUMA is enabled)
- INTR_CPUS (like LOCAL_CPUS but only returns 1 SMT thread for each core)
For systems that do not support NUMA (or if it is not enabled in the kernel
config), LOCAL_CPUS fails with EINVAL. INTR_CPUS is mapped to 'all_cpus'
by default. The idea is that INTR_CPUS should always return a valid set.
Device drivers which want to use per-CPU interrupts should start using
INTR_CPUS instead of simply assigning interrupts to all available CPUs.
In the future we may wish to add tunables to control the policy of
INTR_CPUS (e.g. should it be local-only or global, should it ignore
SMT threads or not).
The x86 nexus driver exposes the internal set of interrupt CPUs from the
the x86 interrupt code via INTR_CPUS.
The ACPI bus driver and PCI bridge drivers use _PXM to return a suitable
LOCAL_CPUS set when _PXM exists and DEVICE_NUMA is enabled. They also and
the global INTR_CPUS set from the nexus driver with the per-domain set from
_PXM to generate a local INTR_CPUS set for child devices.
Reviewed by: wblock (manpage)
Differential Revision: https://reviews.freebsd.org/D5519
This is going to be used by the Xen clock on Dom0 in order to set the RTC of
the host. The current logic in atrtc_settime is moved to atrtc_set and the
unused device_t parameter is removed from the atrtc_set function call so it
can be safely used by other callers.
Sponsored by: Citrix Systems R&D
Reviewed by: kib, jhb
Differential revision: https://reviews.freebsd.org/D6067
rounddown2 tends to produce longer lines than the original code
and when the code has a high indentation level it was not really
advantageous to do the replacement.
This tries to strike a balance between readability using the macros
and flexibility of having the expressions, so not everything is
converted.
If we reached MAXMEMDOM, we would previously try to insert an additional
element and only detect overflow after causing (probably trivial) memory
overflow. Instead, detect the ndomain > MAXMEMDOM case before we write past
the end.
Reported by: Coverity
CID: 1354783
Sponsored by: EMC / Isilon Storage Division
which queued invalidation completion interrupt is requested with
regard to the queued invalidation requests. In other words, setting
the value of the knob to N requests completion interrupt after N items
are processed. Existing behaviour is restored by setting
hw.dmar.batch_coalesce=1.
The knob significantly decreases the DMAR qi interrupt rate at the
cost of slightly longer DMAR map entries recycling.
Sponsored by: The FreeBSD Foundation
initially configured in the TSC deadline mode, eventtimer subsystem
can be switched to periodic, and then DCR register is loaded with
unitialized value.
Reset the LAPIC eventtimer frequency and min/max periods when changing
between deadline and counted periodic modes.
Reported and tested by: Vladimir Zakharov <zakharov.vv@gmail.com>
Sponsored by: The FreeBSD Foundation
Revert r292255 because it can create bounced regions without contiguous
page offsets, which is needed for USB devices.
Another solution would be to force bouncing the full buffer always (even
when only one page requires bouncing), but this seems overly complicated and
unnecessary, and it will probably involve using more bounce pages than the
current code.
Reported by: phk
system. This uses the hints mechnanism. This mostly works today
because when there's no static hints (the default), this value can be
fetched from the hint. When there is a static hints file, the hint
passed from the boot loader to the kernel is ignored, but for the BIOS
case we're able to find it anyway. However, with UEFI, the fallback
doesn't work, so we get a panic instead.
Switch to acpi.rsdp and use TUNABLE_ULONG_FETCH instead. Continue to
generate the old values to allow for transitions. In addition, fall
back to the old method if the new method isn't present.
Add comments about all this.
Differential Revision: https://reviews.freebsd.org/D5866
Some BIOSes disable AMD Topology extension on AMD Family 15h notebook
processors. We re-enable the extension, so that we can properly discover
core and cache topology. Linux seems to do the same.
Reported by: Johannes Dieterich <dieterich.joh@gmail.com>
Reviewed by: jhb, kib
Tested by: Johannes Dieterich <dieterich.joh@gmail.com>
(earlier version)
MFC after: 3 weeks
Differential Revision: https://reviews.freebsd.org/D5883
VM_NUMA_ALLOC is used to enable use of domain-aware memory allocation in
the virtual memory system. DEVICE_NUMA is used to enable affinity
reporting for devices such as bus_get_domain().
MAXMEMDOM must still be set to a value greater than for any NUMA support
to be effective. Note that 'cpuset -gd' always works if MAXMEMDOM is
enabled and the system supports NUMA.
Reviewed by: kib
Differential Revision: https://reviews.freebsd.org/D5782
It is needed by the hypervisor FreeBSD guest to allocate/free private
interrupt vectors.
Reviewed by: kib, jhb, Dexuan Cui <decui microsoft com>
Sponsored by: Microsoft OSTC
Differential Revision: https://reviews.freebsd.org/D5849