Commit Graph

36 Commits

Author SHA1 Message Date
Hans Petter Selasky
af3b2549c4 Pull in r267961 and r267973 again. Fix for issues reported will follow. 2014-06-28 03:56:17 +00:00
Glen Barber
37a107a407 Revert r267961, r267973:
These changes prevent sysctl(8) from returning proper output,
such as:

 1) no output from sysctl(8)
 2) erroneously returning ENOMEM with tools like truss(1)
    or uname(1)
 truss: can not get etype: Cannot allocate memory
2014-06-27 22:05:21 +00:00
Hans Petter Selasky
3da1cf1e88 Extend the meaning of the CTLFLAG_TUN flag to automatically check if
there is an environment variable which shall initialize the SYSCTL
during early boot. This works for all SYSCTL types both statically and
dynamically created ones, except for the SYSCTL NODE type and SYSCTLs
which belong to VNETs. A new flag, CTLFLAG_NOFETCH, has been added to
be used in the case a tunable sysctl has a custom initialisation
function allowing the sysctl to still be marked as a tunable. The
kernel SYSCTL API is mostly the same, with a few exceptions for some
special operations like iterating childrens of a static/extern SYSCTL
node. This operation should probably be made into a factored out
common macro, hence some device drivers use this. The reason for
changing the SYSCTL API was the need for a SYSCTL parent OID pointer
and not only the SYSCTL parent OID list pointer in order to quickly
generate the sysctl path. The motivation behind this patch is to avoid
parameter loading cludges inside the OFED driver subsystem. Instead of
adding special code to the OFED driver subsystem to post-load tunables
into dynamically created sysctls, we generalize this in the kernel.

Other changes:
- Corrected a possibly incorrect sysctl name from "hw.cbb.intr_mask"
to "hw.pcic.intr_mask".
- Removed redundant TUNABLE statements throughout the kernel.
- Some minor code rewrites in connection to removing not needed
TUNABLE statements.
- Added a missing SYSCTL_DECL().
- Wrapped two very long lines.
- Avoid malloc()/free() inside sysctl string handling, in case it is
called to initialize a sysctl from a tunable, hence malloc()/free() is
not ready when sysctls from the sysctl dataset are registered.
- Bumped FreeBSD version to indicate SYSCTL API change.

MFC after:	2 weeks
Sponsored by:	Mellanox Technologies
2014-06-27 16:33:43 +00:00
Davide Italiano
4bc38a5ab0 Hide internal details of sbintime_t implementation wrapping INT64_MAX into
SBT_MAX, to make it more robust in case internal type representation will
change in the future. All the consumers were migrated to SBT_MAX and
every new consumer (if any) should from now use this interface.

Requested by:	bapt, jmg, Ryan Lortie (implictly)
Reviewed by:	mav, bde
2014-04-12 23:29:29 +00:00
Ian Lepore
cfc4b56b57 Add support for event timers whose clock frequency can change while running. 2014-04-02 15:56:11 +00:00
Alexander Motin
e37e08c7bf Fix periodic per-CPU timers startup on boot.
Reported by:	neel
MFC after:	2 weeks
2013-12-16 13:52:18 +00:00
Attilio Rao
54366c0bd7 - For kernel compiled only with KDTRACE_HOOKS and not any lock debugging
option, unbreak the lock tracing release semantic by embedding
  calls to LOCKSTAT_PROFILE_RELEASE_LOCK() direclty in the inlined
  version of the releasing functions for mutex, rwlock and sxlock.
  Failing to do so skips the lockstat_probe_func invokation for
  unlocking.
- As part of the LOCKSTAT support is inlined in mutex operation, for
  kernel compiled without lock debugging options, potentially every
  consumer must be compiled including opt_kdtrace.h.
  Fix this by moving KDTRACE_HOOKS into opt_global.h and remove the
  dependency by opt_kdtrace.h for all files, as now only KDTRACE_FRAMES
  is linked there and it is only used as a compile-time stub [0].

[0] immediately shows some new bug as DTRACE-derived support for debug
in sfxge is broken and it was never really tested.  As it was not
including correctly opt_kdtrace.h before it was never enabled so it
was kept broken for a while.  Fix this by using a protection stub,
leaving sfxge driver authors the responsibility for fixing it
appropriately [1].

Sponsored by:	EMC / Isilon storage division
Discussed with:	rstone
[0] Reported by:	rstone
[1] Discussed with:	philip
2013-11-25 07:38:45 +00:00
Davide Italiano
5b999a6be0 - Make callout(9) tickless, relying on eventtimers(4) as backend for
precise time event generation. This greatly improves granularity of
callouts which are not anymore constrained to wait next tick to be
scheduled.
- Extend the callout KPI introducing a set of callout_reset_sbt* functions,
which take a sbintime_t as timeout argument. The new KPI also offers a
way for consumers to specify precision tolerance they allow, so that
callout can coalesce events and reduce number of interrupts as well as
potentially avoid scheduling a SWI thread.
- Introduce support for dispatching callouts directly from hardware
interrupt context, specifying an additional flag. This feature should be
used carefully, as long as interrupt context has some limitations
(e.g. no sleeping locks can be held).
- Enhance mechanisms to gather informations about callwheel, introducing
a new sysctl to obtain stats.

This change breaks the KBI. struct callout fields has been changed, in
particular 'int ticks' (4 bytes) has been replaced with 'sbintime_t'
(8 bytes) and another 'sbintime_t' field was added for precision.

Together with:	mav
Reviewed by:	attilio, bde, luigi, phk
Sponsored by:	Google Summer of Code 2012, iXsystems inc.
Tested by:	flo (amd64, sparc64), marius (sparc64), ian (arm),
		markj (amd64), mav, Fabian Keil
2013-03-04 11:09:56 +00:00
Alexander Motin
fdc5dd2d2f MFcalloutng:
Switch eventtimers(9) from using struct bintime to sbintime_t.
Even before this not a single driver really supported full dynamic range of
struct bintime even in theory, not speaking about practical inexpediency.
This change legitimates the status quo and cleans up the code.
2013-02-28 13:46:03 +00:00
Davide Italiano
acccf7d8b4 MFcalloutng:
When CPU becomes idle, cpu_idleclock() calculates time to the next timer
event in order to reprogram hw timer. Return that time in sbintime_t to
the caller and pass it to acpi_cpu_idle(), where it can be used as one
more factor (quite precise) to extimate furter sleep time and choose
optimal sleep state. This is a preparatory change for further callout
improvements will be committed in the next days.

The commmit is not targeted for MFC.
2013-02-28 10:46:54 +00:00
Alexander Motin
1af19ee4a2 Add support for good old 8192Hz profiling clock to software PMC.
Reviewed by:	fabient
2013-02-26 18:13:42 +00:00
Grzegorz Bernacki
2d7d16429c Get time of next event from other cores only if SMP is already started.
Reviewed by: mav
Obtained from: Semihalf
2013-02-01 11:39:03 +00:00
Alexander Motin
803a9b3efd panic() with reasonable message instead of returning zero frequency causing
division by zero later if event timer's minimal period is above one second.
For now it is just a theoretical possibility.

Found by:	Clang Static Analyzer
2012-10-10 19:46:46 +00:00
Alexander Motin
2038943013 Particlly MFcalloutng r238425 (by davide):
Fix an issue related to old periodic timers. The code in kern_clocksource.c
uses interrupt to keep track of time, and this time may not match with
binuptime(). In order to address such incoherency, switch periodic timers
to binuptime().

Except further calloutng it is needed for already present cyclic subsystem.
2012-08-04 08:06:37 +00:00
Alexander Motin
9b71c63a8b Partialy MFcalloutng r236894 (by davide):
...
While here, Bruce Evans told me that "unsigned int" is spelled "u_int" in
KNF, so replace it where needed.
2012-08-04 07:46:58 +00:00
Alexander Motin
c0722d20d3 Microoptimize time math. As soon as our event periods are always below ome
second we may not add intereger parts by using bintime_addx() instead of
bintime_add().  Profiling shows handleevents() time redction by 15%.
2012-08-03 09:08:20 +00:00
Alexander Motin
fd053fae73 Add kern.eventtimer.activetick tunable/sysctl, specifying whether each
hardclock() tick should be run on every active CPU, or on only one.

On my tests, avoiding extra interrupts because of this on 8-CPU Core i7
system with HZ=10000 saves about 2% of performance. At this moment option
implemented only for global timers, as reprogramming per-CPU timers is
too expensive now to be compensated by this benefit, especially since we
still have to regularly run hardclock() on at least one active CPU to
update system uptime. For global timer it is quite trivial: timer runs
always, but we just skip IPIs to other CPUs when possible.

Option is enabled by default now, keeping previous behavior, as periodic
hardclock() calls are still used at least to implement setitimer(2) with
ITIMER_VIRTUAL and ITIMER_PROF arguments. But since default schedulers don't
depend on it since r232917, we are much more free to experiment with it.

MFC after:	1 month
2012-03-13 10:21:08 +00:00
Alexander Motin
bcfd016cff Idle ticks optimization:
- Pass number of events to the statclock() and profclock() functions
   same as to hardclock() before to not call them many times in a loop.
 - Rename them into statclock_cnt() and profclock_cnt().
 - Turn statclock() and profclock() into compatibility wrappers,
   still needed for arm.
 - Rename hardclock_anycpu() into hardclock_cnt() for unification.

MFC after:	1 week
2012-03-10 14:57:21 +00:00
Alexander Motin
55c71d634f Be more polite when setting state->nextevent inside cpu_new_callout().
Hardclock is not the only who wakes idle CPU since kdtrace cyclic addition.

MFC after:	2 weeks
2012-03-09 07:30:48 +00:00
Jung-uk Kim
a49399a903 Set negative quality to TSC timecounter when C3 state is enabled for Intel
processors unless the invariant TSC bit of CPUID is set.  Intel processors
may stop incrementing TSC when DPSLP# pin is asserted, according to Intel
processor manuals, i. e., TSC timecounter is useless if the processor can
enter deep sleep state (C3/C4).  This problem was accidentally uncovered by
r222869, which increased timecounter quality of P-state invariant TSC, e.g.,
for Core2 Duo T5870 (Family 6, Model f) and Atom N270 (Family 6, Model 1c).

Reported by:	Fabian Keil (freebsd-listen at fabiankeil dot de)
		Ian FREISLICH (ianf at clue dot co dot za)
Tested by:	Fabian Keil (freebsd-listen at fabiankeil dot de)
		- Core2 Duo T5870 (C3 state available/enabled)
		jkim - Xeon X5150 (C3 state unavailable)
2011-06-22 16:40:45 +00:00
Andriy Gapon
dd7498ae03 better integrate cyclic module with clocksource/eventtimer subsystem
Now in the case when one-shot timers are used cyclic events should fire
closer to theier scheduled times.  As the cyclic is currently used only
to drive DTrace profile provider, this is the area where the change
makes a difference.

Reviewed by:	mav (earlier version, a while ago)
X-MFC after:	clocksource/eventtimer subsystem
2011-05-16 15:29:59 +00:00
Alexander Motin
167aee3895 Refactor Xen PV code to use new event timers subsystem. That uses one-shot
Xen timer and time counter to provide one-shot and periodic time events.

On my tests this reduces idle interruts rate down to about 30Hz, and accor-
ding to Xen VM Manager reduces host CPU load by three times comparing to
the previous periodic 100Hz clock. Also now, when needed, it is possible to
increase HZ rate without useless CPU burning during idle periods.

Now only ia64 and some ARMs left not migrated to the new event timers.
2011-05-13 12:39:37 +00:00
Matthew D Fleming
fbbb13f962 sysctl(9) cleanup checkpoint: amd64 GENERIC builds cleanly.
Commit the kernel changes.
2011-01-12 19:54:19 +00:00
Dimitry Andric
3e288e6238 After some off-list discussion, revert a number of changes to the
DPCPU_DEFINE and VNET_DEFINE macros, as these cause problems for various
people working on the affected files.  A better long-term solution is
still being considered.  This reversal may give some modules empty
set_pcpu or set_vnet sections, but these are harmless.

Changes reverted:

------------------------------------------------------------------------
r215318 | dim | 2010-11-14 21:40:55 +0100 (Sun, 14 Nov 2010) | 4 lines

Instead of unconditionally emitting .globl's for the __start_set_xxx and
__stop_set_xxx symbols, only emit them when the set_vnet or set_pcpu
sections are actually defined.

------------------------------------------------------------------------
r215317 | dim | 2010-11-14 21:38:11 +0100 (Sun, 14 Nov 2010) | 3 lines

Apply the STATIC_VNET_DEFINE and STATIC_DPCPU_DEFINE macros throughout
the tree.

------------------------------------------------------------------------
r215316 | dim | 2010-11-14 21:23:02 +0100 (Sun, 14 Nov 2010) | 2 lines

Add macros to define static instances of VNET_DEFINE and DPCPU_DEFINE.
2010-11-22 19:32:54 +00:00
Dimitry Andric
31c6a0037e Apply the STATIC_VNET_DEFINE and STATIC_DPCPU_DEFINE macros throughout
the tree.
2010-11-14 20:38:11 +00:00
Alexander Motin
c70410e6f5 On APs startup skip hard-/statclock events, which time passed before CPU
was lauched. Few seconds event burst, accumulated during long startup,
reported to cause panic in SCHED_ULE priority calculation logic.
2010-11-08 15:25:12 +00:00
Alexander Motin
9dfc483c4a If kernel built with DEVICE_POLLING, keep one CPU always in active state
to handle it.
2010-09-22 05:32:37 +00:00
Alexander Motin
bcb74c4c95 If new callout scheduled to another CPU and we are using global timer,
there is high probability that timer is already programmed by some other
CPU. Especially by one that registered this callout, and so active now.
2010-09-21 17:37:28 +00:00
Alexander Motin
afe41f2da7 Remember last kern.eventtimer.periodic value, explicitly set by user.
If timer capabilities forcing us to change periodicity mode, try to restore
it back later, as soon as new choosen timer capable to do it. Without this,
timer change like HPET->RTC->HPET always results in enabling periodic mode.
2010-09-21 16:50:24 +00:00
Alexander Motin
8e860de4bf When global timer used at SMP system, update nextevent field on BSP before
sending IPI to other CPUs. Otherwise, other CPUs will try to honor stale
value, programming timer for zero interval. If timer is fast enough,
it caused extra interrupt before timer correctly reprogrammed by BSP.
2010-09-18 07:18:30 +00:00
Alexander Motin
0e18987383 Make kern_tc.c provide minimum frequency of tc_ticktock() calls, required
to handle current timecounter wraps. Make kern_clocksource.c to honor that
requirement, scheduling sleeps on first CPU for no more then specified
period. Allow other CPUs to sleep up to 1/4 second (for any case).
2010-09-14 08:48:06 +00:00
Alexander Motin
dd9595e7fa Add some foot shooting protection by checking singlemul value correctness.
Rephrase sysctls descriptions.

Suggested by:	edmaste
2010-09-14 04:48:04 +00:00
Alexander Motin
a157e42516 Refactor timer management code with priority to one-shot operation mode.
The main goal of this is to generate timer interrupts only when there is
some work to do. When CPU is busy interrupts are generating at full rate
of hz + stathz to fullfill scheduler and timekeeping requirements. But
when CPU is idle, only minimum set of interrupts (down to 8 interrupts per
second per CPU now), needed to handle scheduled callouts is executed.
This allows significantly increase idle CPU sleep time, increasing effect
of static power-saving technologies. Also it should reduce host CPU load
on virtualized systems, when guest system is idle.

There is set of tunables, also available as writable sysctls, allowing to
control wanted event timer subsystem behavior:
  kern.eventtimer.timer - allows to choose event timer hardware to use.
On x86 there is up to 4 different kinds of timers. Depending on whether
chosen timer is per-CPU, behavior of other options slightly differs.
  kern.eventtimer.periodic - allows to choose periodic and one-shot
operation mode. In periodic mode, current timer hardware taken as the only
source of time for time events. This mode is quite alike to previous kernel
behavior. One-shot mode instead uses currently selected time counter
hardware to schedule all needed events one by one and program timer to
generate interrupt exactly in specified time. Default value depends of
chosen timer capabilities, but one-shot mode is preferred, until other is
forced by user or hardware.
  kern.eventtimer.singlemul - in periodic mode specifies how much times
higher timer frequency should be, to not strictly alias hardclock() and
statclock() events. Default values are 2 and 4, but could be reduced to 1
if extra interrupts are unwanted.
  kern.eventtimer.idletick - makes each CPU to receive every timer interrupt
independently of whether they busy or not. By default this options is
disabled. If chosen timer is per-CPU and runs in periodic mode, this option
has no effect - all interrupts are generating.

As soon as this patch modifies cpu_idle() on some platforms, I have also
refactored one on x86. Now it makes use of MONITOR/MWAIT instrunctions
(if supported) under high sleep/wakeup rate, as fast alternative to other
methods. It allows SMP scheduler to wake up sleeping CPUs much faster
without using IPI, significantly increasing performance on some highly
task-switching loads.

Tested by:	many (on i386, amd64, sparc64 and powerc)
H/W donated by:	Gheorghe Ardelean
Sponsored by:	iXsystems, Inc.
2010-09-13 07:25:35 +00:00
Alexander Motin
599cf0f197 Fix several un-/signedness bugs of r210290 and r210293. Add one more check. 2010-07-20 15:48:29 +00:00
Alexander Motin
51636352b6 Extend timer driver API to report also minimal and maximal supported period
lengths. Make MI wrapper code to validate periods in request. Make kernel
clock management code to honor these hardware limitations while choosing hz,
stathz and profhz values.
2010-07-20 10:58:56 +00:00
Alexander Motin
43fe7d458a Rename timeevents.c to kern_clocksource.c.
Suggested by:	jhb@
2010-07-14 18:43:27 +00:00