Summary: POWER8 and POWER9 use a single CPU register, per core, to change clock
speed. Everything else is handled by the on-chip controller. This change
necessitates a change to the cpufreq global kernel driver to bump supported
levels, as the device tree for these systems can have theoretically 256
different options. On my POWER9 Talos, the list consists of 100 items. At
16.67MHz intervals, that allows for a change of roughly 1.67GHz between lowest
and highest.
This has only been tested on the POWER9. However, since they're similar, this
should work on POWER8 as well.
Reviewed By: nwhitehorn
Differential Revision: https://reviews.freebsd.org/D15932
Boost the priority of user-space threads when they set
their affinity to a core to adjust its frequency. This avoids a situation
where a CPU bound kernel thread with the same affinity is running on a
down-clocked core, and will "block" powerd from up-clocking the core
until the kernel thread yields. This can lead to poor perfomance,
and to things potentially getting stuck on Giant.
Reviewed by: kib (imp reviewed earlier version)
Sponsored by: Netflix
Differential Revision: https://reviews.freebsd.org/D15246
Uses of mallocarray(9).
The use of mallocarray(9) has rocketed the required swap to build FreeBSD.
This is likely caused by the allocation size attributes which put extra pressure
on the compiler.
Given that most of these checks are superfluous we have to choose better
where to use mallocarray(9). We still have more uses of mallocarray(9) but
hopefully this is enough to bring swap usage to a reasonable level.
Reported by: wosch
PR: 225197
Focus on code where we are doing multiplications within malloc(9). None of
these ire likely to overflow, however the change is still useful as some
static checkers can benefit from the allocation attributes we use for
mallocarray.
This initial sweep only covers malloc(9) calls with M_NOWAIT. No good
reason but I started doing the changes before r327796 and at that time it
was convenient to make sure the sorrounding code could handle NULL values.
X-Differential revision: https://reviews.freebsd.org/D13837
Mainly focus on files that use BSD 2-Clause license, however the tool I
was using misidentified many licenses so this was mostly a manual - error
prone - task.
The Software Package Data Exchange (SPDX) group provides a specification
to make it easier for automated tools to detect and summarize well known
opensource licenses. We are gradually adopting the specification, noting
that the tags are considered only advisory and do not, in any way,
superceed or replace the license texts.
Currently, Application Processors (non-boot CPUs) are started by
MD code at SI_SUB_CPU, but they are kept waiting in a "pen" until
SI_SUB_SMP at which point they are released to run kernel threads.
SI_SUB_SMP is one of the last SYSINIT levels, so APs don't enter
the scheduler and start running threads until fairly late in the
boot.
This change moves SI_SUB_SMP up to just before software interrupt
threads are created allowing the APs to start executing kernel
threads much sooner (before any devices are probed). This allows
several initialization routines that need to perform initialization
on all CPUs to now perform that initialization in one step rather
than having to defer the AP initialization to a second SYSINIT run
at SI_SUB_SMP. It also permits all CPUs to be available for
handling interrupts before any devices are probed.
This last feature fixes a problem on with interrupt vector exhaustion.
Specifically, in the old model all device interrupts were routed
onto the boot CPU during boot. Later after the APs were released at
SI_SUB_SMP, interrupts were redistributed across all CPUs.
However, several drivers for multiqueue hardware allocate N interrupts
per CPU in the system. In a system with many CPUs, just a few drivers
doing this could exhaust the available pool of interrupt vectors on
the boot CPU as each driver was allocating N * mp_ncpu vectors on the
boot CPU. Now, drivers will allocate interrupts on their desired CPUs
during boot meaning that only N interrupts are allocated from the boot
CPU instead of N * mp_ncpu.
Some other bits of code can also be simplified as smp_started is
now true much earlier and will now always be true for these bits of
code. This removes the need to treat the single-CPU boot environment
as a special case.
As a transition aid, the new behavior is available under a new kernel
option (EARLY_AP_STARTUP). This will allow the option to be turned off
if need be during initial testing. I plan to enable this on x86 by
default in a followup commit in the next few days and to have all
platforms moved over before 11.0. Once the transition is complete,
the option will be removed along with the !EARLY_AP_STARTUP code.
These changes have only been tested on x86. Other platform maintainers
are encouraged to port their architectures over as well. The main
things to check for are any uses of smp_started in MD code that can be
simplified and SI_SUB_SMP SYSINITs in MD code that can be removed in
the EARLY_AP_STARTUP case (e.g. the interrupt shuffling).
PR: kern/199321
Reviewed by: markj, gnn, kib
Sponsored by: Netflix
These changes prevent sysctl(8) from returning proper output,
such as:
1) no output from sysctl(8)
2) erroneously returning ENOMEM with tools like truss(1)
or uname(1)
truss: can not get etype: Cannot allocate memory
there is an environment variable which shall initialize the SYSCTL
during early boot. This works for all SYSCTL types both statically and
dynamically created ones, except for the SYSCTL NODE type and SYSCTLs
which belong to VNETs. A new flag, CTLFLAG_NOFETCH, has been added to
be used in the case a tunable sysctl has a custom initialisation
function allowing the sysctl to still be marked as a tunable. The
kernel SYSCTL API is mostly the same, with a few exceptions for some
special operations like iterating childrens of a static/extern SYSCTL
node. This operation should probably be made into a factored out
common macro, hence some device drivers use this. The reason for
changing the SYSCTL API was the need for a SYSCTL parent OID pointer
and not only the SYSCTL parent OID list pointer in order to quickly
generate the sysctl path. The motivation behind this patch is to avoid
parameter loading cludges inside the OFED driver subsystem. Instead of
adding special code to the OFED driver subsystem to post-load tunables
into dynamically created sysctls, we generalize this in the kernel.
Other changes:
- Corrected a possibly incorrect sysctl name from "hw.cbb.intr_mask"
to "hw.pcic.intr_mask".
- Removed redundant TUNABLE statements throughout the kernel.
- Some minor code rewrites in connection to removing not needed
TUNABLE statements.
- Added a missing SYSCTL_DECL().
- Wrapped two very long lines.
- Avoid malloc()/free() inside sysctl string handling, in case it is
called to initialize a sysctl from a tunable, hence malloc()/free() is
not ready when sysctls from the sysctl dataset are registered.
- Bumped FreeBSD version to indicate SYSCTL API change.
MFC after: 2 weeks
Sponsored by: Mellanox Technologies
running at, guess the nearest value instead of looking for a value within
25 MHz of the observed frequency.
Prior to this change, if a system booted with Intel Turbo Boost enabled,
the dev.cpu.0.freq sysctl is nonfunctional, since the ACPI-reported
frequency for Turbo Boost states does not match the actual clock frequency
(and thus no levels are within 25 MHz of the observed frequency) and the
current performance level is read before a new level is set.
MFC after: 3 days
Relnotes: Bug fix in power management on CPUs with Intel Turbo Boost
the cpufreq code. Replace its use with smp_started. There's at least
one userland tool that still looks at the kern.smp.active sysctl, so
preserve it but point it to smp_started as well.
Discussed with: peter, jhb
MFC after: 3 days
Obtained from: Netflix
Instead of using 25MHz equality threshold, look for the nearest value when
handling dev.cpu.0.freq sysctl and for exact match when it is expected.
ACPI may report extra level with frequency 1MHz above the nominal to
control Intel Turbo Boost operation. It is not a bug, but feature:
dev.cpu.0.freq_levels: 2934/106000 2933/95000 2800/82000 ...
In this case value 2933 means 2.93GHz, but 2934 means 3.2-3.6GHz.
I've found that my Core i7-870 based system has Intel Turbo Boost disabled
by default and without this change it was absolutely invisible and hard
to control.
MFC after: 2 weeks
The SYSCTL_NODE macro defines a list that stores all child-elements of
that node. If there's no SYSCTL_DECL macro anywhere else, there's no
reason why it shouldn't be static.
At present the cpufreq sysctl handler for current level setting would
allocate and deallocate a temporary buffer of 24KB even to handle a
read-only query. This puts unnecessary load on memory subsystem when
current level is checked frequently, e.g. when the likes of powerd
and system monitoring software are running.
Change the strategy to allocating a long-lived buffer for handling the
requests.
Reviewed by: njl
MFC after: 2 weeks
CPU, if available. This is meant to solve the issue of cpufreq misreporting
speeds on CPUs that boot in a reduced power mode and have only relative
speed control.
On HyperThreading CPUs logical cores have same frequency, so setting it
on any core will change the other's one. In most cases first request
to the second core will be the "set" request, done after setting frequency
of the first core. In such case second CPU will obtain throttled frequency
of the first core as it's max_mhz making cpufreq broken due to different
frequency sets.
method:
- If the last of the child cpufreq drivers returns an error while trying to
fetch its list of supported frequencies but an earlier driver found the
requested frequency, don't return an error to the caller.
- If all of the child cpufreq drivers fail and the attempt to match the
frequency based on 'cpu_est_clockrate()' fails, return ENXIO rather than
returning success and returning a frequency of CPUFREQ_VAL_UNKNOWN.
MFC after: 3 days
PR: kern/121433
Reported by: Eugene Grosbein eugen ! kuzbass dot ru
other. The first one survives, the rest are removed. So far, it appears
only some acpi_perf(4) BIOS tables have these invalid states, but address
this in the core to be sure to handle other potential driver data.
PR: kern/114722
Tested by: stefan.lambrev / moneybookers.com
MFC after: 3 days
to change the freq before the other CPUs are active. The current code
always attempts to change all CPUs to match each other, and the requisite
sched_bind() call won't work before APs are launched.
have caused a hang, but we got lucky with the available multi-CPU states
on actual hardware.
Submitted by: Bjorn Koenig <bkoenig / alpha-tierchen.de>
Approved by: re
MFC after: 3 days
- Use thread_lock() rather than sched_lock for per-thread scheduling
sychronization.
- Use the per-process spinlock rather than the sched_lock for per-process
scheduling synchronization.
Tested by: kris, current@
Tested on: i386, amd64, ULE, 4BSD, libthr, libkse, PREEMPTION, etc.
Discussed with: kris, attilio, kmacy, jhb, julian, bde (small parts each)
cpufreq_pre_change is called before the change, giving each driver a chance
to revoke the change. cpufreq_post_change provides the results of the
change (success or failure). cpufreq_levels_changed gives the unit number
of the cpufreq device whose number of available levels has changed. Hook
in all the drivers I could find that needed it.
* TSC: update TSC frequency value. When the available levels change, take the
highest possible level and notify the timecounter set_cputicker() of that
freq. This gets rid of the "calcru: runtime went backwards" messages.
* identcpu: updates the sysctl hw.clockrate value
* Profiling: if profiling is active when the clock changes, let the user
know the results may be inaccurate.
Reviewed by: bde, phk
MFC after: 1 month
1. Walk the absolute list in reverse to prefer duplicated levels that have
a lower absolute setting, i.e. 800 Mhz/50% is better than 1600 Mhz/25% even
though both have the same actual frequency. This also removes the need to
check for already-modified levels since by definition, those will be added
later in the sorted list.
2. Compare the absolute settings for derived levels and don't use the new
level if it's higher. For example, a level of 800 Mhz/75% is preferable to
1600 Mhz/25% even though the latter has a lower total frequency.
This work is based on a patch from the submitter but reworked by myself.
Submitted by: Tijl Coosemans (tijl/ulyssis.org)
debug.cpufreq.lowest tunable and sysctl. Some systems seem to have problems
with the lowest frequencies so setting this prevents them from being
available or used.
driver. This used to be handled by cpufreq_drv_settings() but it's
useful to get the type/flags separately from getting the settings.
(For example, you don't have to pass an array of cf_setting just to find
the driver type.)
Use this new method in our in-tree drivers to detect reliably if acpi_perf
is present and owns the hardware. This simplifies logic in drivers as well
as fixing a bug introduced in my last commit where too many drivers attached.
the rate for the 100% state once. Afterwards, use that value for deriving
states. This should fix the problem where the calibrated frequency was
different once a switch was done, giving a different set of levels each
time. Also, properly search for the right cpufreqX device when detaching.
override the current freq level temporarily and restore it when the
higher priority condition is past. Note that only the first overridden
value is saved. Callers pass NULL to CPUFREQ_SET to restore the saved
level. Priorities are not yet used so this commit should have no effect.
are not added to the list(s) of available settings. However, other drivers
can call the CPUFREQ_DRV_SETTINGS() method on those devices directly to
get info about available settings.
Update the acpi_perf(4) driver to use this flag in the presence of
"functional fixed hardware." Thus, future drivers like Powernow can
query acpi_perf for platform info but perform frequency transitions
themselves.
on dev.cpu.0 will affect all of the CPUs together. In the future,
independent control will be supported but this is good enough for now.
Check that the timecounter isn't TSC before switching (from Colin Percival.)
frequency as a percentage of the base rate and do not change the base
rate directly. The cpufreq framework combines these with absolute drivers
to produce synthesized levels made of one or more settings.