freebsd-dev

Author	SHA1	Message	Date
John Baldwin	fdce57a042	Add an EARLY_AP_STARTUP option to start APs earlier during boot. Currently, Application Processors (non-boot CPUs) are started by MD code at SI_SUB_CPU, but they are kept waiting in a "pen" until SI_SUB_SMP at which point they are released to run kernel threads. SI_SUB_SMP is one of the last SYSINIT levels, so APs don't enter the scheduler and start running threads until fairly late in the boot. This change moves SI_SUB_SMP up to just before software interrupt threads are created allowing the APs to start executing kernel threads much sooner (before any devices are probed). This allows several initialization routines that need to perform initialization on all CPUs to now perform that initialization in one step rather than having to defer the AP initialization to a second SYSINIT run at SI_SUB_SMP. It also permits all CPUs to be available for handling interrupts before any devices are probed. This last feature fixes a problem on with interrupt vector exhaustion. Specifically, in the old model all device interrupts were routed onto the boot CPU during boot. Later after the APs were released at SI_SUB_SMP, interrupts were redistributed across all CPUs. However, several drivers for multiqueue hardware allocate N interrupts per CPU in the system. In a system with many CPUs, just a few drivers doing this could exhaust the available pool of interrupt vectors on the boot CPU as each driver was allocating N * mp_ncpu vectors on the boot CPU. Now, drivers will allocate interrupts on their desired CPUs during boot meaning that only N interrupts are allocated from the boot CPU instead of N * mp_ncpu. Some other bits of code can also be simplified as smp_started is now true much earlier and will now always be true for these bits of code. This removes the need to treat the single-CPU boot environment as a special case. As a transition aid, the new behavior is available under a new kernel option (EARLY_AP_STARTUP). This will allow the option to be turned off if need be during initial testing. I plan to enable this on x86 by default in a followup commit in the next few days and to have all platforms moved over before 11.0. Once the transition is complete, the option will be removed along with the !EARLY_AP_STARTUP code. These changes have only been tested on x86. Other platform maintainers are encouraged to port their architectures over as well. The main things to check for are any uses of smp_started in MD code that can be simplified and SI_SUB_SMP SYSINITs in MD code that can be removed in the EARLY_AP_STARTUP case (e.g. the interrupt shuffling). PR: kern/199321 Reviewed by: markj, gnn, kib Sponsored by: Netflix	2016-05-14 18:22:52 +00:00
Hans Petter Selasky	af3b2549c4	Pull in r267961 and r267973 again. Fix for issues reported will follow.	2014-06-28 03:56:17 +00:00
Glen Barber	37a107a407	Revert r267961, r267973: These changes prevent sysctl(8) from returning proper output, such as: 1) no output from sysctl(8) 2) erroneously returning ENOMEM with tools like truss(1) or uname(1) truss: can not get etype: Cannot allocate memory	2014-06-27 22:05:21 +00:00
Hans Petter Selasky	3da1cf1e88	Extend the meaning of the CTLFLAG_TUN flag to automatically check if there is an environment variable which shall initialize the SYSCTL during early boot. This works for all SYSCTL types both statically and dynamically created ones, except for the SYSCTL NODE type and SYSCTLs which belong to VNETs. A new flag, CTLFLAG_NOFETCH, has been added to be used in the case a tunable sysctl has a custom initialisation function allowing the sysctl to still be marked as a tunable. The kernel SYSCTL API is mostly the same, with a few exceptions for some special operations like iterating childrens of a static/extern SYSCTL node. This operation should probably be made into a factored out common macro, hence some device drivers use this. The reason for changing the SYSCTL API was the need for a SYSCTL parent OID pointer and not only the SYSCTL parent OID list pointer in order to quickly generate the sysctl path. The motivation behind this patch is to avoid parameter loading cludges inside the OFED driver subsystem. Instead of adding special code to the OFED driver subsystem to post-load tunables into dynamically created sysctls, we generalize this in the kernel. Other changes: - Corrected a possibly incorrect sysctl name from "hw.cbb.intr_mask" to "hw.pcic.intr_mask". - Removed redundant TUNABLE statements throughout the kernel. - Some minor code rewrites in connection to removing not needed TUNABLE statements. - Added a missing SYSCTL_DECL(). - Wrapped two very long lines. - Avoid malloc()/free() inside sysctl string handling, in case it is called to initialize a sysctl from a tunable, hence malloc()/free() is not ready when sysctls from the sysctl dataset are registered. - Bumped FreeBSD version to indicate SYSCTL API change. MFC after: 2 weeks Sponsored by: Mellanox Technologies	2014-06-27 16:33:43 +00:00
Colin Percival	760f4dec67	In cf_get_method, when we don't already know what clock speed the CPU is running at, guess the nearest value instead of looking for a value within 25 MHz of the observed frequency. Prior to this change, if a system booted with Intel Turbo Boost enabled, the dev.cpu.0.freq sysctl is nonfunctional, since the ACPI-reported frequency for Turbo Boost states does not match the actual clock frequency (and thus no levels are within 25 MHz of the observed frequency) and the current performance level is read before a new level is set. MFC after: 3 days Relnotes: Bug fix in power management on CPUs with Intel Turbo Boost	2014-05-11 10:32:58 +00:00
Christian Brueffer	ed472910ba	Free resources in an error case. CID: 1018947 Found with: Coverity Prevent(tm) MFC after: 1 week	2014-05-02 21:34:17 +00:00
Scott Long	60ad8150c7	Retire smp_active. It was racey and caused demonstrated problems with the cpufreq code. Replace its use with smp_started. There's at least one userland tool that still looks at the kern.smp.active sysctl, so preserve it but point it to smp_started as well. Discussed with: peter, jhb MFC after: 3 days Obtained from: Netflix	2014-04-26 20:27:54 +00:00
Alexander Motin	5f3818a56e	Revert r175376 and tune cpufreq(4) frequency comparison logic instead. Instead of using 25MHz equality threshold, look for the nearest value when handling dev.cpu.0.freq sysctl and for exact match when it is expected. ACPI may report extra level with frequency 1MHz above the nominal to control Intel Turbo Boost operation. It is not a bug, but feature: dev.cpu.0.freq_levels: 2934/106000 2933/95000 2800/82000 ... In this case value 2933 means 2.93GHz, but 2934 means 3.2-3.6GHz. I've found that my Core i7-870 based system has Intel Turbo Boost disabled by default and without this change it was absolutely invisible and hard to control. MFC after: 2 weeks	2012-03-10 18:56:16 +00:00
Ed Schouten	6472ac3d8a	Mark all SYSCTL_NODEs static that have no corresponding SYSCTL_DECLs. The SYSCTL_NODE macro defines a list that stores all child-elements of that node. If there's no SYSCTL_DECL macro anywhere else, there's no reason why it shouldn't be static.	2011-11-07 15:43:11 +00:00
Andriy Gapon	dac509311f	cpufreq: allocate long-lived buffer for handling of sysctl requests At present the cpufreq sysctl handler for current level setting would allocate and deallocate a temporary buffer of 24KB even to handle a read-only query. This puts unnecessary load on memory subsystem when current level is checked frequently, e.g. when the likes of powerd and system monitoring software are running. Change the strategy to allocating a long-lived buffer for handling the requests. Reviewed by: njl MFC after: 2 weeks	2010-07-23 16:46:42 +00:00
Christian Brueffer	30215f483e	Free allocated sbufs before returning ENOMEM. PR: 128335 Submitted by: Mateusz Guzik <mjguzik@gmail.com> MFC after: 2 week	2010-01-08 22:58:50 +00:00
Nathan Whitehorn	f436f17508	Provide a new CPU device driver ivar to report the nominal speed of the CPU, if available. This is meant to solve the issue of cpufreq misreporting speeds on CPUs that boot in a reduced power mode and have only relative speed control.	2009-05-31 08:59:15 +00:00
Alexander Motin	d288bcc4df	If possible, try to obtain max_mhz on cpufreq attach instead of first request. On HyperThreading CPUs logical cores have same frequency, so setting it on any core will change the other's one. In most cases first request to the second core will be the "set" request, done after setting frequency of the first core. In such case second CPU will obtain throttled frequency of the first core as it's max_mhz making cpufreq broken due to different frequency sets.	2008-12-16 01:24:05 +00:00
John Baldwin	be00f6053b	Fix a few edge cases with error handling in cpufreq(4)'s CPUFREQ_GET() method: - If the last of the child cpufreq drivers returns an error while trying to fetch its list of supported frequencies but an earlier driver found the requested frequency, don't return an error to the caller. - If all of the child cpufreq drivers fail and the attempt to match the frequency based on 'cpu_est_clockrate()' fails, return ENXIO rather than returning success and returning a frequency of CPUFREQ_VAL_UNKNOWN. MFC after: 3 days PR: kern/121433 Reported by: Eugene Grosbein eugen ! kuzbass dot ru	2008-05-05 19:13:52 +00:00
Nate Lawson	e1f13773ec	Remove duplicate cpufreq levels, i.e. ones that are within 25 Mhz of each other. The first one survives, the rest are removed. So far, it appears only some acpi_perf(4) BIOS tables have these invalid states, but address this in the core to be sure to handle other potential driver data. PR: kern/114722 Tested by: stefan.lambrev / moneybookers.com MFC after: 3 days	2008-01-16 01:05:21 +00:00
Nate Lawson	a15e947d54	If we're on an SMP kernel and there is more than 1 CPU, reject any attempts to change the freq before the other CPUs are active. The current code always attempts to change all CPUs to match each other, and the requisite sched_bind() call won't work before APs are launched.	2007-10-30 22:18:08 +00:00
Nate Lawson	62db376af3	Always call sched_bind(), even if on the CPU in question. It is wrong to check if we're already on that cpu and skip the bind since the thread could be migrated off in the meantime. Suggested by: jeff Approved by: re	2007-08-20 06:28:26 +00:00
Nate Lawson	2145b9d207	Use a different loop variable for the inner loop. This previous reuse could have caused a hang, but we got lucky with the available multi-CPU states on actual hardware. Submitted by: Bjorn Koenig <bkoenig / alpha-tierchen.de> Approved by: re MFC after: 3 days	2007-08-19 20:34:13 +00:00
Jeff Roberson	982d11f836	Commit 14/14 of sched_lock decomposition. - Use thread_lock() rather than sched_lock for per-thread scheduling sychronization. - Use the per-process spinlock rather than the sched_lock for per-process scheduling synchronization. Tested by: kris, current@ Tested on: i386, amd64, ULE, 4BSD, libthr, libkse, PREEMPTION, etc. Discussed with: kris, attilio, kmacy, jhb, julian, bde (small parts each)	2007-06-05 00:00:57 +00:00
Nate Lawson	0d4ac62a35	Add an interface for drivers to be notified of changes to CPU frequency. cpufreq_pre_change is called before the change, giving each driver a chance to revoke the change. cpufreq_post_change provides the results of the change (success or failure). cpufreq_levels_changed gives the unit number of the cpufreq device whose number of available levels has changed. Hook in all the drivers I could find that needed it. * TSC: update TSC frequency value. When the available levels change, take the highest possible level and notify the timecounter set_cputicker() of that freq. This gets rid of the "calcru: runtime went backwards" messages. * identcpu: updates the sysctl hw.clockrate value * Profiling: if profiling is active when the clock changes, let the user know the results may be inaccurate. Reviewed by: bde, phk MFC after: 1 month	2007-03-26 18:03:29 +00:00
Marcus Alves Grando	b4130b8ae0	- Print message about cpufreq and timecounter TSC Approved by: njl MFC after: 1 day	2006-03-03 02:06:04 +00:00
Hajimu UMEMOTO	56e5a87a55	make saved cpu level stackable.	2005-10-03 06:57:29 +00:00
Nate Lawson	9000b91eb9	Break out the checks for duplicates and absolute settings being too high instead of trying to do them all at once. This should fix the level sorting problems from the previous revision. Testing help: ume	2005-09-02 16:32:43 +00:00
Nate Lawson	5308b2a64e	Eliminate cpufreq levels for two cases that are less than optimal: 1. Walk the absolute list in reverse to prefer duplicated levels that have a lower absolute setting, i.e. 800 Mhz/50% is better than 1600 Mhz/25% even though both have the same actual frequency. This also removes the need to check for already-modified levels since by definition, those will be added later in the sorted list. 2. Compare the absolute settings for derived levels and don't use the new level if it's higher. For example, a level of 800 Mhz/75% is preferable to 1600 Mhz/25% even though the latter has a lower total frequency. This work is based on a patch from the submitter but reworked by myself. Submitted by: Tijl Coosemans (tijl/ulyssis.org)	2005-08-30 04:45:32 +00:00
Hajimu UMEMOTO	1fea6ce7dd	- don't forget to save freqency when priority is raised. - nuke redundant variable initialization.	2005-08-18 16:41:25 +00:00
Hajimu UMEMOTO	5f36393468	don't forget to update curr_priority. even when frequency is not changed, priority may be changed.	2005-08-18 16:08:56 +00:00
Hajimu UMEMOTO	961f7f911f	Save cpu level only when priority is greater than PRIO_USER to make CPUFREQ_SET(NULL, prio) work. TODO: implement saved_level as stack. Reviewed by: njl	2005-08-16 20:03:08 +00:00
Nate Lawson	da8a77c1f1	The "lowest" sysctl setting makes more sense as the lowest one to use, so discard all levels less than this setting, not less than/equal to. MFC after: 1 day	2005-08-11 18:40:58 +00:00
Nate Lawson	8d9134815e	Add debugging prints to all the methods in case there are problems with managing levels. This can be enabled with the debug.cpufreq.verbose tunable and sysctl.	2005-04-10 19:11:23 +00:00
Nate Lawson	71ab130c9b	Add a check for cpufreq_unregister() being called with no cpufreq device active. Note that the logic indicates this should not be possible so generate a warning if this ever happens. Found by: Coverity Prevent (via sam)	2005-03-31 18:56:54 +00:00
Nate Lawson	789f03ceb4	Add locking to handle multiple threads getting/setting frequencies at the same time. We use an sx lock and serialize the cpufreq device's get/set/levels methods.	2005-02-27 01:34:08 +00:00
Nate Lawson	b070969b48	Allow users to reject levels below a given frequency (in MHz) via the debug.cpufreq.lowest tunable and sysctl. Some systems seem to have problems with the lowest frequencies so setting this prevents them from being available or used.	2005-02-26 22:37:49 +00:00
Nate Lawson	d269386a24	Bump the maximum number of levels to 64 and add warning messages about what to do to fix reduced functionality if the number of levels is too low.	2005-02-24 20:21:41 +00:00
Nate Lawson	e959a70bad	Add the "freq_settings" sysctl to each device that registers with cpufreq so their individual settings can be seen separately for debugging.	2005-02-20 00:59:15 +00:00
Nate Lawson	e94a0c1a18	Introduce a new method, cpufreq_drv_type(), that returns the type of the driver. This used to be handled by cpufreq_drv_settings() but it's useful to get the type/flags separately from getting the settings. (For example, you don't have to pass an array of cf_setting just to find the driver type.) Use this new method in our in-tree drivers to detect reliably if acpi_perf is present and owns the hardware. This simplifies logic in drivers as well as fixing a bug introduced in my last commit where too many drivers attached.	2005-02-18 00:23:36 +00:00
Nate Lawson	67c8649f7f	When dealing with systems with no absolute drivers attached, only calibrate the rate for the 100% state once. Afterwards, use that value for deriving states. This should fix the problem where the calibrated frequency was different once a switch was done, giving a different set of levels each time. Also, properly search for the right cpufreqX device when detaching.	2005-02-15 07:43:48 +00:00
Nate Lawson	1196826af5	Bind to the driver's parent cpu before switching, for both absolute and relative drivers. Remove some extraneous KASSERTs since NULL pointers will be found when they're used right afterwards.	2005-02-15 07:22:42 +00:00
Nate Lawson	5f0afa0415	Implement priorities. This allows a driver (say, for cooling purposes) to override the current freq level temporarily and restore it when the higher priority condition is past. Note that only the first overridden value is saved. Callers pass NULL to CPUFREQ_SET to restore the saved level. Priorities are not yet used so this commit should have no effect.	2005-02-14 18:16:35 +00:00
Nate Lawson	e22cd41c01	Add support for the CPUFREQ_FLAG_INFO_ONLY flag. Devices that report this are not added to the list(s) of available settings. However, other drivers can call the CPUFREQ_DRV_SETTINGS() method on those devices directly to get info about available settings. Update the acpi_perf(4) driver to use this flag in the presence of "functional fixed hardware." Thus, future drivers like Powernow can query acpi_perf for platform info but perform frequency transitions themselves.	2005-02-13 18:49:48 +00:00
Nate Lawson	0325089dad	Set levels on all CPUs and attach a cpufreq device to each one. Sysctl on dev.cpu.0 will affect all of the CPUs together. In the future, independent control will be supported but this is good enough for now. Check that the timecounter isn't TSC before switching (from Colin Percival.)	2005-02-13 17:31:56 +00:00
Nate Lawson	88c9b54c47	Add support for relative cpufreq drivers. Such drivers modulate clock frequency as a percentage of the base rate and do not change the base rate directly. The cpufreq framework combines these with absolute drivers to produce synthesized levels made of one or more settings.	2005-02-06 21:08:35 +00:00
Nate Lawson	73347b071d	Add the cpufreq framework. This code manages multiple drivers and presents a unified kernel and user interface for controlling cpu frequencies.	2005-02-04 05:39:19 +00:00

42 Commits