Commit Graph

48 Commits

Author SHA1 Message Date
Emmanuel Vadot
326867616f kern_cpu: When adding abs frequency allow for unordered insertion
Keep the list ordered as some code assume that it is but allow for
unordered cf_settings sets.
2018-07-19 11:28:14 +00:00
Justin Hibbits
22c1b4c0f1 Introduce PMCR-based cpufreq(4) driver, for IBM POWER8 and POWER9 systems
Summary: POWER8 and POWER9 use a single CPU register, per core, to change clock
speed.  Everything else is handled by the on-chip controller.  This change
necessitates a change to the cpufreq global kernel driver to bump supported
levels, as the device tree for these systems can have theoretically 256
different options.  On my POWER9 Talos, the list consists of 100 items.  At
16.67MHz intervals, that allows for a change of roughly 1.67GHz between lowest
and highest.

This has only been tested on the POWER9.  However, since they're similar, this
should work on POWER8 as well.

Reviewed By: nwhitehorn
Differential Revision: https://reviews.freebsd.org/D15932
2018-06-21 14:26:43 +00:00
Andrew Gallatin
e7bd0750af Boost thread priority while changing CPU frequency
Boost the priority of user-space threads when they set
their affinity to a core to adjust its frequency.   This avoids a situation
where a CPU bound kernel thread with the same affinity is running on a
down-clocked core, and will "block" powerd from up-clocking the core
until the kernel thread yields.   This can lead to poor perfomance,
and to things potentially getting stuck on Giant.

Reviewed by:	kib (imp reviewed earlier version)
Sponsored by:	Netflix
Differential Revision:	https://reviews.freebsd.org/D15246
2018-05-07 15:24:03 +00:00
Pedro F. Giffuni
ac2fffa4b7 Revert r327828, r327949, r327953, r328016-r328026, r328041:
Uses of mallocarray(9).

The use of mallocarray(9) has rocketed the required swap to build FreeBSD.
This is likely caused by the allocation size attributes which put extra pressure
on the compiler.

Given that most of these checks are superfluous we have to choose better
where to use mallocarray(9). We still have more uses of mallocarray(9) but
hopefully this is enough to bring swap usage to a reasonable level.

Reported by:	wosch
PR:		225197
2018-01-21 15:42:36 +00:00
Pedro F. Giffuni
a18a2290cd kern: make some use of mallocarray(9).
Focus on code where we are doing multiplications within malloc(9). None of
these ire likely to overflow, however the change is still useful as some
static checkers can benefit from the allocation attributes we use for
mallocarray.

This initial sweep only covers malloc(9) calls with M_NOWAIT. No good
reason but I started doing the changes before r327796 and at that time it
was convenient to make sure the sorrounding code could handle NULL values.

X-Differential revision: https://reviews.freebsd.org/D13837
2018-01-15 21:18:04 +00:00
Pedro F. Giffuni
8a36da99de sys/kern: adoption of SPDX licensing ID tags.
Mainly focus on files that use BSD 2-Clause license, however the tool I
was using misidentified many licenses so this was mostly a manual - error
prone - task.

The Software Package Data Exchange (SPDX) group provides a specification
to make it easier for automated tools to detect and summarize well known
opensource licenses. We are gradually adopting the specification, noting
that the tags are considered only advisory and do not, in any way,
superceed or replace the license texts.
2017-11-27 15:20:12 +00:00
John Baldwin
fdce57a042 Add an EARLY_AP_STARTUP option to start APs earlier during boot.
Currently, Application Processors (non-boot CPUs) are started by
MD code at SI_SUB_CPU, but they are kept waiting in a "pen" until
SI_SUB_SMP at which point they are released to run kernel threads.
SI_SUB_SMP is one of the last SYSINIT levels, so APs don't enter
the scheduler and start running threads until fairly late in the
boot.

This change moves SI_SUB_SMP up to just before software interrupt
threads are created allowing the APs to start executing kernel
threads much sooner (before any devices are probed).  This allows
several initialization routines that need to perform initialization
on all CPUs to now perform that initialization in one step rather
than having to defer the AP initialization to a second SYSINIT run
at SI_SUB_SMP.  It also permits all CPUs to be available for
handling interrupts before any devices are probed.

This last feature fixes a problem on with interrupt vector exhaustion.
Specifically, in the old model all device interrupts were routed
onto the boot CPU during boot.  Later after the APs were released at
SI_SUB_SMP, interrupts were redistributed across all CPUs.

However, several drivers for multiqueue hardware allocate N interrupts
per CPU in the system.  In a system with many CPUs, just a few drivers
doing this could exhaust the available pool of interrupt vectors on
the boot CPU as each driver was allocating N * mp_ncpu vectors on the
boot CPU.  Now, drivers will allocate interrupts on their desired CPUs
during boot meaning that only N interrupts are allocated from the boot
CPU instead of N * mp_ncpu.

Some other bits of code can also be simplified as smp_started is
now true much earlier and will now always be true for these bits of
code.  This removes the need to treat the single-CPU boot environment
as a special case.

As a transition aid, the new behavior is available under a new kernel
option (EARLY_AP_STARTUP).  This will allow the option to be turned off
if need be during initial testing.  I plan to enable this on x86 by
default in a followup commit in the next few days and to have all
platforms moved over before 11.0.  Once the transition is complete,
the option will be removed along with the !EARLY_AP_STARTUP code.

These changes have only been tested on x86.  Other platform maintainers
are encouraged to port their architectures over as well.  The main
things to check for are any uses of smp_started in MD code that can be
simplified and SI_SUB_SMP SYSINITs in MD code that can be removed in
the EARLY_AP_STARTUP case (e.g. the interrupt shuffling).

PR:		kern/199321
Reviewed by:	markj, gnn, kib
Sponsored by:	Netflix
2016-05-14 18:22:52 +00:00
Hans Petter Selasky
af3b2549c4 Pull in r267961 and r267973 again. Fix for issues reported will follow. 2014-06-28 03:56:17 +00:00
Glen Barber
37a107a407 Revert r267961, r267973:
These changes prevent sysctl(8) from returning proper output,
such as:

 1) no output from sysctl(8)
 2) erroneously returning ENOMEM with tools like truss(1)
    or uname(1)
 truss: can not get etype: Cannot allocate memory
2014-06-27 22:05:21 +00:00
Hans Petter Selasky
3da1cf1e88 Extend the meaning of the CTLFLAG_TUN flag to automatically check if
there is an environment variable which shall initialize the SYSCTL
during early boot. This works for all SYSCTL types both statically and
dynamically created ones, except for the SYSCTL NODE type and SYSCTLs
which belong to VNETs. A new flag, CTLFLAG_NOFETCH, has been added to
be used in the case a tunable sysctl has a custom initialisation
function allowing the sysctl to still be marked as a tunable. The
kernel SYSCTL API is mostly the same, with a few exceptions for some
special operations like iterating childrens of a static/extern SYSCTL
node. This operation should probably be made into a factored out
common macro, hence some device drivers use this. The reason for
changing the SYSCTL API was the need for a SYSCTL parent OID pointer
and not only the SYSCTL parent OID list pointer in order to quickly
generate the sysctl path. The motivation behind this patch is to avoid
parameter loading cludges inside the OFED driver subsystem. Instead of
adding special code to the OFED driver subsystem to post-load tunables
into dynamically created sysctls, we generalize this in the kernel.

Other changes:
- Corrected a possibly incorrect sysctl name from "hw.cbb.intr_mask"
to "hw.pcic.intr_mask".
- Removed redundant TUNABLE statements throughout the kernel.
- Some minor code rewrites in connection to removing not needed
TUNABLE statements.
- Added a missing SYSCTL_DECL().
- Wrapped two very long lines.
- Avoid malloc()/free() inside sysctl string handling, in case it is
called to initialize a sysctl from a tunable, hence malloc()/free() is
not ready when sysctls from the sysctl dataset are registered.
- Bumped FreeBSD version to indicate SYSCTL API change.

MFC after:	2 weeks
Sponsored by:	Mellanox Technologies
2014-06-27 16:33:43 +00:00
Colin Percival
760f4dec67 In cf_get_method, when we don't already know what clock speed the CPU is
running at, guess the nearest value instead of looking for a value within
25 MHz of the observed frequency.

Prior to this change, if a system booted with Intel Turbo Boost enabled,
the dev.cpu.0.freq sysctl is nonfunctional, since the ACPI-reported
frequency for Turbo Boost states does not match the actual clock frequency
(and thus no levels are within 25 MHz of the observed frequency) and the
current performance level is read before a new level is set.

MFC after:	3 days
Relnotes:	Bug fix in power management on CPUs with Intel Turbo Boost
2014-05-11 10:32:58 +00:00
Christian Brueffer
ed472910ba Free resources in an error case.
CID:		1018947
Found with:	Coverity Prevent(tm)
MFC after:	1 week
2014-05-02 21:34:17 +00:00
Scott Long
60ad8150c7 Retire smp_active. It was racey and caused demonstrated problems with
the cpufreq code.  Replace its use with smp_started.  There's at least
one userland tool that still looks at the kern.smp.active sysctl, so
preserve it but point it to smp_started as well.

Discussed with: peter, jhb
MFC after: 3 days
Obtained from: Netflix
2014-04-26 20:27:54 +00:00
Alexander Motin
5f3818a56e Revert r175376 and tune cpufreq(4) frequency comparison logic instead.
Instead of using 25MHz equality threshold, look for the nearest value when
handling dev.cpu.0.freq sysctl and for exact match when it is expected.

ACPI may report extra level with frequency 1MHz above the nominal to
control Intel Turbo Boost operation. It is not a bug, but feature:
dev.cpu.0.freq_levels: 2934/106000 2933/95000 2800/82000 ...
In this case value 2933 means 2.93GHz, but 2934 means 3.2-3.6GHz.

I've found that my Core i7-870 based system has Intel Turbo Boost disabled
by default and without this change it was absolutely invisible and hard
to control.

MFC after:	2 weeks
2012-03-10 18:56:16 +00:00
Ed Schouten
6472ac3d8a Mark all SYSCTL_NODEs static that have no corresponding SYSCTL_DECLs.
The SYSCTL_NODE macro defines a list that stores all child-elements of
that node. If there's no SYSCTL_DECL macro anywhere else, there's no
reason why it shouldn't be static.
2011-11-07 15:43:11 +00:00
Andriy Gapon
dac509311f cpufreq: allocate long-lived buffer for handling of sysctl requests
At present the cpufreq sysctl handler for current level setting would
allocate and deallocate a temporary buffer of 24KB even to handle a
read-only query.  This puts unnecessary load on memory subsystem when
current level is checked frequently, e.g. when the likes of powerd
and system monitoring software are running.
Change the strategy to allocating a long-lived buffer for handling the
requests.

Reviewed by:	njl
MFC after:	2 weeks
2010-07-23 16:46:42 +00:00
Christian Brueffer
30215f483e Free allocated sbufs before returning ENOMEM.
PR:		128335
Submitted by:	Mateusz Guzik <mjguzik@gmail.com>
MFC after:	2 week
2010-01-08 22:58:50 +00:00
Nathan Whitehorn
f436f17508 Provide a new CPU device driver ivar to report the nominal speed of the
CPU, if available. This is meant to solve the issue of cpufreq misreporting
speeds on CPUs that boot in a reduced power mode and have only relative
speed control.
2009-05-31 08:59:15 +00:00
Alexander Motin
d288bcc4df If possible, try to obtain max_mhz on cpufreq attach instead of first request.
On HyperThreading CPUs logical cores have same frequency, so setting it
on any core will change the other's one. In most cases first request
to the second core will be the "set" request, done after setting frequency
of the first core. In such case second CPU will obtain throttled frequency
of the first core as it's max_mhz making cpufreq broken due to different
frequency sets.
2008-12-16 01:24:05 +00:00
John Baldwin
be00f6053b Fix a few edge cases with error handling in cpufreq(4)'s CPUFREQ_GET()
method:
- If the last of the child cpufreq drivers returns an error while trying to
  fetch its list of supported frequencies but an earlier driver found the
  requested frequency, don't return an error to the caller.
- If all of the child cpufreq drivers fail and the attempt to match the
  frequency based on 'cpu_est_clockrate()' fails, return ENXIO rather than
  returning success and returning a frequency of CPUFREQ_VAL_UNKNOWN.

MFC after:	3 days
PR:		kern/121433
Reported by:	Eugene Grosbein  eugen ! kuzbass dot ru
2008-05-05 19:13:52 +00:00
Nate Lawson
e1f13773ec Remove duplicate cpufreq levels, i.e. ones that are within 25 Mhz of each
other.  The first one survives, the rest are removed.  So far, it appears
only some acpi_perf(4) BIOS tables have these invalid states, but address
this in the core to be sure to handle other potential driver data.

PR:		kern/114722
Tested by:	stefan.lambrev / moneybookers.com
MFC after:	3 days
2008-01-16 01:05:21 +00:00
Nate Lawson
a15e947d54 If we're on an SMP kernel and there is more than 1 CPU, reject any attempts
to change the freq before the other CPUs are active.  The current code
always attempts to change all CPUs to match each other, and the requisite
sched_bind() call won't work before APs are launched.
2007-10-30 22:18:08 +00:00
Nate Lawson
62db376af3 Always call sched_bind(), even if on the CPU in question. It is wrong to
check if we're already on that cpu and skip the bind since the thread could
be migrated off in the meantime.

Suggested by:	jeff
Approved by:	re
2007-08-20 06:28:26 +00:00
Nate Lawson
2145b9d207 Use a different loop variable for the inner loop. This previous reuse could
have caused a hang, but we got lucky with the available multi-CPU states
on actual hardware.

Submitted by:	Bjorn Koenig <bkoenig / alpha-tierchen.de>
Approved by:	re
MFC after:	3 days
2007-08-19 20:34:13 +00:00
Jeff Roberson
982d11f836 Commit 14/14 of sched_lock decomposition.
- Use thread_lock() rather than sched_lock for per-thread scheduling
   sychronization.
 - Use the per-process spinlock rather than the sched_lock for per-process
   scheduling synchronization.

Tested by:      kris, current@
Tested on:      i386, amd64, ULE, 4BSD, libthr, libkse, PREEMPTION, etc.
Discussed with: kris, attilio, kmacy, jhb, julian, bde (small parts each)
2007-06-05 00:00:57 +00:00
Nate Lawson
0d4ac62a35 Add an interface for drivers to be notified of changes to CPU frequency.
cpufreq_pre_change is called before the change, giving each driver a chance
to revoke the change.  cpufreq_post_change provides the results of the
change (success or failure).  cpufreq_levels_changed gives the unit number
of the cpufreq device whose number of available levels has changed.  Hook
in all the drivers I could find that needed it.

* TSC: update TSC frequency value.  When the available levels change, take the
highest possible level and notify the timecounter set_cputicker() of that
freq.  This gets rid of the "calcru: runtime went backwards" messages.
* identcpu: updates the sysctl hw.clockrate value
* Profiling: if profiling is active when the clock changes, let the user
know the results may be inaccurate.

Reviewed by:	bde, phk
MFC after:	1 month
2007-03-26 18:03:29 +00:00
Marcus Alves Grando
b4130b8ae0 - Print message about cpufreq and timecounter TSC
Approved by:	njl
MFC after:	1 day
2006-03-03 02:06:04 +00:00
Hajimu UMEMOTO
56e5a87a55 make saved cpu level stackable. 2005-10-03 06:57:29 +00:00
Nate Lawson
9000b91eb9 Break out the checks for duplicates and absolute settings being too high
instead of trying to do them all at once.  This should fix the level sorting
problems from the previous revision.

Testing help:	ume
2005-09-02 16:32:43 +00:00
Nate Lawson
5308b2a64e Eliminate cpufreq levels for two cases that are less than optimal:
1. Walk the absolute list in reverse to prefer duplicated levels that have
a lower absolute setting, i.e. 800 Mhz/50% is better than 1600 Mhz/25% even
though both have the same actual frequency.  This also removes the need to
check for already-modified levels since by definition, those will be added
later in the sorted list.

2. Compare the absolute settings for derived levels and don't use the new
level if it's higher.  For example, a level of 800 Mhz/75% is preferable to
1600 Mhz/25% even though the latter has a lower total frequency.

This work is based on a patch from the submitter but reworked by myself.

Submitted by:	Tijl Coosemans (tijl/ulyssis.org)
2005-08-30 04:45:32 +00:00
Hajimu UMEMOTO
1fea6ce7dd - don't forget to save freqency when priority is raised.
- nuke redundant variable initialization.
2005-08-18 16:41:25 +00:00
Hajimu UMEMOTO
5f36393468 don't forget to update curr_priority. even when frequency is
not changed, priority may be changed.
2005-08-18 16:08:56 +00:00
Hajimu UMEMOTO
961f7f911f Save cpu level only when priority is greater than PRIO_USER
to make CPUFREQ_SET(NULL, prio) work.
TODO: implement saved_level as stack.

Reviewed by:	njl
2005-08-16 20:03:08 +00:00
Nate Lawson
da8a77c1f1 The "lowest" sysctl setting makes more sense as the lowest one to use, so
discard all levels less than this setting, not less than/equal to.

MFC after:	1 day
2005-08-11 18:40:58 +00:00
Nate Lawson
8d9134815e Add debugging prints to all the methods in case there are problems with
managing levels.  This can be enabled with the debug.cpufreq.verbose
tunable and sysctl.
2005-04-10 19:11:23 +00:00
Nate Lawson
71ab130c9b Add a check for cpufreq_unregister() being called with no cpufreq device
active.  Note that the logic indicates this should not be possible so
generate a warning if this ever happens.

Found by:	Coverity Prevent (via sam)
2005-03-31 18:56:54 +00:00
Nate Lawson
789f03ceb4 Add locking to handle multiple threads getting/setting frequencies at the
same time.  We use an sx lock and serialize the cpufreq device's
get/set/levels methods.
2005-02-27 01:34:08 +00:00
Nate Lawson
b070969b48 Allow users to reject levels below a given frequency (in MHz) via the
debug.cpufreq.lowest tunable and sysctl.  Some systems seem to have problems
with the lowest frequencies so setting this prevents them from being
available or used.
2005-02-26 22:37:49 +00:00
Nate Lawson
d269386a24 Bump the maximum number of levels to 64 and add warning messages about
what to do to fix reduced functionality if the number of levels is too low.
2005-02-24 20:21:41 +00:00
Nate Lawson
e959a70bad Add the "freq_settings" sysctl to each device that registers with cpufreq
so their individual settings can be seen separately for debugging.
2005-02-20 00:59:15 +00:00
Nate Lawson
e94a0c1a18 Introduce a new method, cpufreq_drv_type(), that returns the type of the
driver.  This used to be handled by cpufreq_drv_settings() but it's
useful to get the type/flags separately from getting the settings.
(For example, you don't have to pass an array of cf_setting just to find
the driver type.)

Use this new method in our in-tree drivers to detect reliably if acpi_perf
is present and owns the hardware.  This simplifies logic in drivers as well
as fixing a bug introduced in my last commit where too many drivers attached.
2005-02-18 00:23:36 +00:00
Nate Lawson
67c8649f7f When dealing with systems with no absolute drivers attached, only calibrate
the rate for the 100% state once.  Afterwards, use that value for deriving
states.  This should fix the problem where the calibrated frequency was
different once a switch was done, giving a different set of levels each
time.  Also, properly search for the right cpufreqX device when detaching.
2005-02-15 07:43:48 +00:00
Nate Lawson
1196826af5 Bind to the driver's parent cpu before switching, for both absolute and
relative drivers.  Remove some extraneous KASSERTs since NULL pointers
will be found when they're used right afterwards.
2005-02-15 07:22:42 +00:00
Nate Lawson
5f0afa0415 Implement priorities. This allows a driver (say, for cooling purposes) to
override the current freq level temporarily and restore it when the
higher priority condition is past.  Note that only the first overridden
value is saved.  Callers pass NULL to CPUFREQ_SET to restore the saved
level.  Priorities are not yet used so this commit should have no effect.
2005-02-14 18:16:35 +00:00
Nate Lawson
e22cd41c01 Add support for the CPUFREQ_FLAG_INFO_ONLY flag. Devices that report this
are not added to the list(s) of available settings.  However, other drivers
can call the CPUFREQ_DRV_SETTINGS() method on those devices directly to
get info about available settings.

Update the acpi_perf(4) driver to use this flag in the presence of
"functional fixed hardware."  Thus, future drivers like Powernow can
query acpi_perf for platform info but perform frequency transitions
themselves.
2005-02-13 18:49:48 +00:00
Nate Lawson
0325089dad Set levels on all CPUs and attach a cpufreq device to each one. Sysctl
on dev.cpu.0 will affect all of the CPUs together.  In the future,
independent control will be supported but this is good enough for now.
Check that the timecounter isn't TSC before switching (from Colin Percival.)
2005-02-13 17:31:56 +00:00
Nate Lawson
88c9b54c47 Add support for relative cpufreq drivers. Such drivers modulate clock
frequency as a percentage of the base rate and do not change the base
rate directly.  The cpufreq framework combines these with absolute drivers
to produce synthesized levels made of one or more settings.
2005-02-06 21:08:35 +00:00
Nate Lawson
73347b071d Add the cpufreq framework. This code manages multiple drivers and presents
a unified kernel and user interface for controlling cpu frequencies.
2005-02-04 05:39:19 +00:00