Commit Graph

390 Commits

Author SHA1 Message Date
Ian Lepore
b97fa22cd6 Use sbuf_new_for_sysctl() instead of plain sbuf_new() to ensure sysctl
string returned to userland is nulterminated.

PR:           195668
2015-03-14 18:42:30 +00:00
Warner Losh
5837276ce2 Put back Andy's void for gcc happiness.
Submitted by:	jchandra@
2015-02-27 23:14:08 +00:00
Warner Losh
b250ad3499 Make sched_random() return an unsigned number, and use uint32_t
consistently. This also matches the per-cpu pointer declaration
anyway.

This changes the tweak we give to the load from -32..31 to be 0..31
which seems more inline with the rest of the code (- rnd and the -=
64). It should also provide the randomness we need, and may fix a
signedness bug in the old code (it isn't clear that the effect was
intentional as opposed to sloppy, and the right shift of a signed
value is undefined to boot).

This stores sched_balance() behavior when it used random().

Differential Revision: https://reviews.freebsd.org/D1981
2015-02-27 21:15:12 +00:00
Andrew Turner
ccc41f3e66 Fix sched_ule on sparc64, gcc complains sched_random is not a correct
prototype.

Sponsored by:	The FreeBSD Foundation
2015-02-27 15:05:20 +00:00
Andrew Turner
09d0653552 sched_random is only called for SMP, only define it there.
Sponsored by:	The FreeBSD Foundation
2015-02-27 12:38:24 +00:00
Warner Losh
0567b6cc16 Create sched_rand() and move the LCG code into that. Call this when
we need randomness in ULE. This removes random() call from the
rebalance interval code.

Submitted by: Harrison Grundy
Differential Revision: https://reviews.freebsd.org/D1968
2015-02-27 02:56:58 +00:00
Adrian Chadd
e77f9fed15 Update the ULE scheduler + thread and kinfo structs to use int for cpuid
rather than u_char.

To try and play nice with the ABI, the u_char CPU ID values are clamped
at 254.  The new fields now contain the full CPU ID, or -1 for no cpu.

Differential Revision:	D955
Reviewed by:	jhb, kib
Sponsored by:	Norse Corp, Inc.
2014-10-18 19:36:11 +00:00
Alexander Motin
ae9e9b4fda Reprase r271616 comments.
Submitted by:	alc
MFC after:	1 month
2014-09-17 17:43:32 +00:00
Alexander Motin
7965496958 Add comments describing r271604 change.
MFC after:	3 days
2014-09-15 11:17:36 +00:00
Alexander Motin
7e9b58eaaa Add couple memory barries to serialize tdq_cpu_idle and tdq_load accesses.
This change fixes transient performance drops in some of my benchmarks,
vanishing as soon as I am trying to collect any stats from the scheduler.
It looks like reordered access to those variables sometimes caused loss of
IPI_PREEMPT, that delayed thread execution until some later interrupt.

MFC after:	3 days
2014-09-14 22:13:19 +00:00
Alexander Motin
2e7d7bb294 Restore pre-r239157 handling of sched_yield(), when thread time slice was
aborted, allowing other threads to run.  Without this change thread is just
rescheduled again, that was illustrated by provided test tool.

PR:		192926
Submitted by:	eric@vangyzen.net
MFC after:	2 weeks
2014-08-23 17:31:56 +00:00
Konstantin Belousov
2499a5ccef Micro-manage clang to get the expected inlining for cpu_search().
Mark cpu_search_lowest/cpu_search_highest/cpu_search_both as noinline,
while cpu_search() gets always_inline.  With the attributes set,
cpu_search() is inlined in wrappers, and if()s with constant
conditionals are optimized.

On some tests on many-core machine, the hwpmc reported samples for
cpu_search*() are reduced from 25% total to 9%.

Submitted by:	"Rang, Anton" <anton.rang@isilon.com>
MFC after:	1 week
2014-07-03 11:06:27 +00:00
Konstantin Belousov
a288c757d4 Remove write-only local variable.
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
2014-06-08 10:56:25 +00:00
Attilio Rao
c149e542a5 Fix GENERIC build. 2014-03-19 00:38:27 +00:00
Jeff Roberson
8bc713f6c5 - Make runq_steal_from more aggressive. Previously it would examine only
a single priority queue.  If that queue had a thread or threads which
   could not be migrated we would fail to steal load.  This could cause
   starvation in situations where cores are idle.

Submitted by:	Doug Kilpatrick <dkilpatrick@isilon.com>
Tested by:	pho
Reviewed by:	mav
Sponsored by:	EMC / Isilon Storage Division
2014-03-08 00:35:06 +00:00
Nathan Whitehorn
a8a9b1c250 ULE works on Book-E since r258002, so remove statements to the contrary. 2014-02-01 20:46:35 +00:00
Dimitry Andric
3371b88c7b In sys/kern/sched_ule.c, remove static function sched_both(), which is
unused since r232207.

MFC after:	3 days
2013-12-25 16:25:54 +00:00
John Baldwin
5457fa234b Fix an off-by-one error in r228960. The maximum priority delta provided
by SCHED_PRI_TICKS should be SCHED_PRI_RANGE - 1 so that the resulting
priority value (before nice adjustment) is between SCHED_PRI_MIN and
SCHED_PRI_MAX, inclusive.

Submitted by:	kib
Reported by:	pho
MFC after:	1 week
2013-12-03 14:50:12 +00:00
Andriy Gapon
d9fae5ab88 dtrace sdt: remove the ugly sname parameter of SDT_PROBE_DEFINE
In its stead use the Solaris / illumos approach of emulating '-' (dash)
in probe names with '__' (two consecutive underscores).

Reviewed by:	markj
MFC after:	3 weeks
2013-11-26 08:46:27 +00:00
Attilio Rao
54366c0bd7 - For kernel compiled only with KDTRACE_HOOKS and not any lock debugging
option, unbreak the lock tracing release semantic by embedding
  calls to LOCKSTAT_PROFILE_RELEASE_LOCK() direclty in the inlined
  version of the releasing functions for mutex, rwlock and sxlock.
  Failing to do so skips the lockstat_probe_func invokation for
  unlocking.
- As part of the LOCKSTAT support is inlined in mutex operation, for
  kernel compiled without lock debugging options, potentially every
  consumer must be compiled including opt_kdtrace.h.
  Fix this by moving KDTRACE_HOOKS into opt_global.h and remove the
  dependency by opt_kdtrace.h for all files, as now only KDTRACE_FRAMES
  is linked there and it is only used as a compile-time stub [0].

[0] immediately shows some new bug as DTRACE-derived support for debug
in sfxge is broken and it was never really tested.  As it was not
including correctly opt_kdtrace.h before it was never enabled so it
was kept broken for a while.  Fix this by using a protection stub,
leaving sfxge driver authors the responsibility for fixing it
appropriately [1].

Sponsored by:	EMC / Isilon storage division
Discussed with:	rstone
[0] Reported by:	rstone
[1] Discussed with:	philip
2013-11-25 07:38:45 +00:00
Alexander Motin
58909b74b9 Micro-optimize cpu_search(), allowing compiler to use more efficient inline
ffsl() implementation, when it is available, instead of homegrown iteration.

On dual-E5645 amd64 system (2x6x2 cores) under heavy I/O load that reduces
time spent inside cpu_search() from 19% to 13%, while IOPS increased by 5%.
2013-09-07 15:16:30 +00:00
George V. Neville-Neil
8f2ba63493 Point args[0] not at the thread that is ending but at the one that
is starting.  This is in line with practice in OpenSolaris.

Note that this change is only in ULE and not in the 4BSD scheduler.
Once this change settles in (MFC timeout has expired) we'll try it out
on 4BSD as well.

PR:		177706
Submitted by:	Tiwei Bie
MFC after:	1 month
2013-04-15 17:21:02 +00:00
Alexander Motin
2fd4047f32 Fix bug in r242852 that prevented CPU from becoming idle if kernel built
without SMP support.
2012-11-15 14:10:51 +00:00
Alexander Motin
2c27cb3a34 Several optimizations to sched_idletd():
- Do not try to steal load from other CPUs if there was no contest switches
on this CPU (i.e. it was idle all the time and woke up just for bus mastering
or TLB shutdown). If current CPU was idle, then it is quite unlikely that some
other CPU has load to steal.  Under high I/O rate, when TLB shutdowns cause
numerous CPU wakeups, on 24-CPU system load stealing code may consume up to
25% of all CPU time without giving any benefits.
 - Change code that implements spinning for load to restart spin in case of
context switch.  Previous code periodically called cpu_idle() even under
high interrupt/context switch rate.
 - Rise spinning threshold to 10KHz, where it gives at least some effect
that may worth consumed power.

Reviewed by:	jeff@
2012-11-10 07:02:57 +00:00
Jeff Roberson
5e5c387373 - Change ULE to use dynamic slice sizes for the timeshare queue in order
to further reduce latency for threads in this queue.  This should help
   as threads transition from realtime to timeshare.  The latency is
   bound to a max of sched_slice until we have more than sched_slice / 6
   threads runnable.  Then the min slice is allotted to all threads and
   latency becomes (nthreads - 1) * min_slice.

Discussed with: mav
2012-11-08 01:46:47 +00:00
Attilio Rao
4ceaf45de5 Rework the known mutexes to benefit about staying on their own
cache line in order to avoid manual frobbing but using
struct mtx_padalign.

The sole exception being nvme and sxfge drivers, where the author
redefined CACHE_LINE_SIZE manually, so they need to be analyzed and
dealt with separately.

Reviwed by:	jimharris, alc
2012-10-31 18:07:18 +00:00
Attilio Rao
a049aa05c9 tdq_lock_pair() already does spinlock_enter() so migration is not
possible in sched_balance_pair(). Remove redundant sched_pin().

Reviewed by:	marius, jeff
2012-10-30 12:25:52 +00:00
Jim Harris
39f819e2fc Pad tdq_lock to avoid false sharing with tdq_load and tdq_cpu_idle.
This enables CPU searches (which read tdq_load) to operate independently
of any contention on the spinlock.  Some scheduler-intensive workloads
running on an 8C single-socket SNB Xeon show considerable improvement with
this change (2-3% perf improvement, 5-6% decrease in CPU util).

Sponsored by:	Intel
Reviewed by:	jeff
2012-10-24 18:36:41 +00:00
Eitan Adler
db702c59cf remove duplicate semicolons where possible.
Approved by:	cperciva
MFC after:	1 week
2012-10-22 03:00:37 +00:00
Andriy Gapon
e87fc7cf7b sched_ule: fix inverted condition in reporting of priority lending via ktr
Reviewed by:	kan
MFC after:	1 week
2012-09-14 19:55:28 +00:00
John Baldwin
ba96d2d816 Mark the idle threads as non-sleepable and also assert that an idle
thread never blocks on a turnstile.
2012-08-22 20:01:38 +00:00
Alexander Motin
37f4e0254f Some more minor tunings inspired by bde@. 2012-08-11 20:24:39 +00:00
Alexander Motin
bf89d544d0 Allow idle threads to steal second threads from other cores on systems with
8 or more cores to improve utilization.  None of my tests on 2xXeon (2x6x2)
system shown any slowdown from mentioned "excess thrashing".  Same time in
pbzip2 test with number of threads more then number of CPUs I see up to 10%
speedup with SMT disabled and up 5% with SMT enabled.  Thinking about
trashing I was trying to limit that stealing within same last level cache,
but got only worse results.  Present code any way prefers to steal threads
from topologically closer cores.

Sponsored by:	iXsystems, Inc.
2012-08-11 15:08:19 +00:00
Alexander Motin
579895df01 Some minor tunings/cleanups inspired by bde@ after previous commits:
- remove extra dynamic variable initializations;
 - restore (4BSD) and implement (ULE) hogticks variable setting;
 - make sched_rr_interval() more tolerant to options;
 - restore (4BSD) and implement (ULE) kern.sched.quantum sysctl, a more
user-friendly wrapper for sched_slice;
 - tune some sysctl descriptions;
 - make some style fixes.
2012-08-10 19:02:49 +00:00
Alexander Motin
3d7f41175d Rework r220198 change (by fabient). I believe it solves the problem from
the wrong direction. Before it, if preemption and end of time slice happen
same time, thread was put to the head of the queue as for only preemption.
It could cause single thread to run for indefinitely long time. r220198
handles it by not clearing TDF_NEEDRESCHED in case of preemption. But that
causes delayed context switch every time preemption happens, even when not
needed.

Solve problem by introducing scheduler-specifoc thread flag TDF_SLICEEND,
set when thread's time slice is over and it should be put to the tail of
queue. Using SW_PREEMPT flag for that purpose as it was before just not
enough informative to work correctly.

On my tests this by 2-3 times reduces run time deviation (improves fairness)
in cases when several threads share one CPU.

Reviewed by:	fabient
MFC after:	2 months
Sponsored by:	iXsystems, Inc.
2012-08-09 19:26:13 +00:00
Rafal Jaworowski
17f4cae4a5 Let us manage differences of Book-E PowerPC variations i.e. vendor /
implementation specific vs. the common architecture definition.

Bring PPC4XX defines (PSL, SPR, TLB). Note the new definitions under
BOOKE_PPC4XX are not used in the code yet.

This change set is not supposed to affect existing E500 support, it's just
another reorg step before bringing support for E500mc, E5500 and PPC465.

Obtained from:	AppliedMicro, Freescale, Semihalf
2012-05-27 10:25:20 +00:00
Ryan Stone
b3e9e682cf Implement the DTrace sched provider. This implementation aims to be
compatible with the sched provider implemented by Solaris and its open-
source derivatives.  Full documentation of the sched provider can be found
on Oracle's DTrace wiki pages.

Note that for compatibility with scripts originally written for Solaris,
serveral probes are defined that will never fire.  These probes are defined
to fire when Solaris-specific features perform certain actions.  As these
features are not present in FreeBSD, the probes can never fire.

Also, I have added a two probes that are not defined in Solaris, lend-pri
and load-change.  These probes have been added to make it possible to
collect schedgraph data with DTrace.

Finally, a few probes are defined in Solaris to take a cpuinfo_t *
argument.  As it was not immediately clear to me how to translate that to
FreeBSD, currently those probes are passed NULL in place of a cpuinfo_t *.

Sponsored by: Sandvine Incorporated
MFC after:	2 weeks
2012-05-15 01:30:25 +00:00
Alexander Motin
70801abe8f Microoptimize cpu_search().
According to profiling, it makes one take 6% of CPU time on hackbench
with its million of context switches per second, instead of 8% before.
2012-04-09 18:24:58 +00:00
Alexander Motin
7295465e33 Rewrite thread CPU usage percentage math to not depend on periodic calls
with HZ rate through the sched_tick() calls from hardclock().

Potentially it can be used to improve precision, but now it is just minus
one more reason to call hardclock() for every HZ tick on every active CPU.
SCHED_4BSD never used sched_tick(), but keep it in place for now, as at
least SCHED_FBFS existing in patches out of the tree depends on it.

MFC after:	1 month
2012-03-13 08:18:54 +00:00
Alexander Motin
b3f40a4107 Make kern.sched.idlespinthresh default value adaptive depending of HZ.
Otherwise with HZ above 8000 CPU may never skip timer ticks on idle.
2012-03-09 19:09:08 +00:00
John Baldwin
44ad547522 Add a new sched_clear_name() method to the scheduler interface to clear
the cached name used for KTR_SCHED traces when a thread's name changes.
This way KTR_SCHED traces (and thus schedgraph) will notice when a thread's
name changes, most commonly via execve().

MFC after:	2 weeks
2012-03-08 19:41:05 +00:00
Alexander Motin
6022f0bcb3 Fix bug of r232207, when cpu_search() could prefer CPU group with best
load, but with no CPU matching given limitations. It caused kernel panics
in some cases when thread was bound to specific CPUs with cpuset(1).
2012-03-03 11:50:48 +00:00
Alexander Motin
36acfc6507 Rework CPU load balancing in SCHED_ULE:
- In sched_pickcpu() be more careful taking previous CPU on SMT systems.
Do it only if all other logical CPUs of that physical one are idle to avoid
extra resource sharing.
 - In sched_pickcpu() change general logic of CPU selection. First
look for idle CPU, sharing last level cache with previously used one,
skipping SMT CPU groups. If none found, search all CPUs for the least loaded
one, where the thread with its priority can run now. If none found, search
just for the least loaded CPU.
 - Make cpu_search() compare lowest/highest CPU load when comparing CPU
groups with equal load. That allows to differentiate 1+1 and 2+0 loads.
 - Make cpu_search() to prefer specified (previous) CPU or group if load
is equal. This improves cache affinity for more complicated topologies.
 - Randomize CPU selection if above factors are equal. Previous code tend
to prefer CPUs with lower IDs, causing unneeded collisions.
 - Rework periodic balancer in sched_balance_group(). With cpu_search()
more intelligent now, make balansing process flat, removing recursion
over the topology tree. That fixes double swap problem and makes load
distribution more even and predictable.

All together this gives 10-15% performance improvement in many tests on
CPUs with SMT, such as Core i7, for number of threads is less then number
of logical CPUs. In some tests it also gives positive effect to systems
without SMT.

Reviewed by:	jeff
Tested by:	flo, hackers@
MFC after:	1 month
Sponsored by:	iXsystems, Inc.
2012-02-27 10:31:54 +00:00
John Baldwin
7e3a96ea37 Some small fixes to CPU accounting for threads:
- Only initialize the per-cpu switchticks and switchtime in sched_throw()
  for the very first context switch on APs during boot.  This avoids a
  small gap between the middle of thread_exit() and sched_throw() where
  time is not accounted to any thread.
- In thread_exit(), update the timestamp bookkeeping to track the changes
  to mi_switch() introduced by td_rux so that the code once again matches
  the comment claiming it is mimicing mi_switch().  Specifically, only
  update the per-thread stats directly and depend on ruxagg() to update
  p_rux rather than adjusting p_rux directly.  While here, move the
  timestamp bookkeeping as late in the function as possible.

Reviewed by:	bde, kib
MFC after:	1 week
2012-01-03 21:03:28 +00:00
John Baldwin
0c0d27d5dd Cap the priority calculated from the current thread's running tick count
at SCHED_PRI_RANGE to prevent overflows in the priority value.  This can
happen due to irregularities with clock interrupts under certain
virtualization environments.

Tested by:	Larry Rosenman  ler lerctr org
MFC after:	2 weeks
2011-12-29 16:17:16 +00:00
Andriy Gapon
167057914b ule: ensure that batch timeshare threads are scheduled fairly
With the previous code, if the range of priorities for timeshare batch
threads was greater than RQ_NQS, then the threads with low priorities in
the part of the range above RQ_NQS would be scheduled to the run-queues
as if they had high priorities at the beginning of the range.
In other words, threads with a nice level of +N could be scheduled as
if they had a nice level of -M.

Reported by:	George Mitchell <george@m5p.com>
Reviewed by:	jhb
Tested by:	George Mitchell <george@m5p.com> (earlier version)
MFC after:	1 week
2011-12-19 20:01:21 +00:00
Marius Strobl
880bf8b9bd - Currently, sched_balance_pair() may cause a CPU to send an IPI_PREEMPT to
itself, which sparc64 hardware doesn't support. One way to solve this
  would be to directly call sched_preempt() instead of issuing a self-IPI.
  However, quoting jhb@:
  "On the other hand, you can probably just skip the IPI entirely if we are
  going to send it to the current CPU.  Presumably, once this routine
  finishes, the current CPU will exit softlock (or will do so "soon") and
  will then pick the next thread to run based on the adjustments made in
  this routine, so there's no need to IPI the CPU running this routine
  anyway.  I think this is the better solution.  Right now what is probably
  happening on other platforms is as soon as this routine finishes the CPU
  processes its self-IPI and causes mi_switch() which will just switch back
  to the softclock thread it is already running."
- With r226054 and the the above change in place, sparc64 now no longer is
  incompatible with ULE and vice versa. However, powerpc/E500 still is.

Submitted by:	jhb
Reviewed by:	jeff
2011-10-06 11:48:13 +00:00
Xin LI
cd39bb098e Fix format strings for KTR_STATE in 4BSD ad ULE schedulers.
Submitted by:	Ivan Klymenko <fidaj@ukr.net>
PR:		kern/159904, kern/159905
MFC after:	2 weeks
Approved by:	re (kib)
2011-08-26 18:00:07 +00:00
Attilio Rao
6338c57958 Remove explicit MAXCPU usage from sys/pcpu.h avoiding a namespace
pollution.  That is a step further in the direction of building correct
policies for userland and modules on how to deal with the number of
maxcpus at runtime.

Reported by:	jhb
Reviewed and tested by:	pluknet
Approved by:	re (kib)
2011-07-19 16:50:55 +00:00
Attilio Rao
71a19bdc64 Commit the support for removing cpumask_t and replacing it directly with
cpuset_t objects.
That is going to offer the underlying support for a simple bump of
MAXCPU and then support for number of cpus > 32 (as it is today).

Right now, cpumask_t is an int, 32 bits on all our supported architecture.
cpumask_t on the other side is implemented as an array of longs, and
easilly extendible by definition.

The architectures touched by this commit are the following:
- amd64
- i386
- pc98
- arm
- ia64
- XEN

while the others are still missing.
Userland is believed to be fully converted with the changes contained
here.

Some technical notes:
- This commit may be considered an ABI nop for all the architectures
  different from amd64 and ia64 (and sparc64 in the future)
- per-cpu members, which are now converted to cpuset_t, needs to be
  accessed avoiding migration, because the size of cpuset_t should be
  considered unknown
- size of cpuset_t objects is different from kernel and userland (this is
  primirally done in order to leave some more space in userland to cope
  with KBI extensions). If you need to access kernel cpuset_t from the
  userland please refer to example in this patch on how to do that
  correctly (kgdb may be a good source, for example).
- Support for other architectures is going to be added soon
- Only MAXCPU for amd64 is bumped now

The patch has been tested by sbruno and Nicholas Esborn on opteron
4 x 12 pack CPUs. More testing on big SMP is expected to came soon.
pluknet tested the patch with his 8-ways on both amd64 and i386.

Tested by:	pluknet, sbruno, gianni, Nicholas Esborn
Reviewed by:	jeff, jhb, sbruno
2011-05-05 14:39:14 +00:00
Fabien Thomas
586cb6ec77 Clearing the flag when preempting will let the preempted thread run
too much time. This can finish in a scheduler deadlock with ping-pong
between two threads.

One sample of this is:
- device lapic (to have a preemption point on critical_exit())
- options DEVICE_POLLING with HZ>1499 (to have lapic freq = hardclock freq)
- running a cpu intensive task (that does not enter the kernel)
- only one CPU on SMP or no SMP.

As requested by jhb@ 4BSD have received the same type of fix instead of
propagating the flag to the new thread.

Reviewed by:	jhb, jeff
MFC after:	1 month
2011-03-31 13:59:47 +00:00
John Baldwin
2dc29adb9f Rework realtime priority support:
- Move the realtime priority range up above kernel sleep priorities and
  just below interrupt thread priorities.
- Contract the interrupt and kernel sleep priority ranges a bit so that
  the timesharing priority band can be increased.  The new timeshare range
  is now slightly larger than the old realtime + timeshare ranges.
- Change the ULE scheduler to no longer use realtime priorities for
  interactive threads.  Instead, the larger timeshare range is now split
  into separate subranges for interactive and non-interactive ("batch")
  threads.  The end result is that interactive threads and non-interactive
  threads still use the same priority ranges as before, but realtime
  threads now have a separate, dedicated priority range.
- Do not modify the priority of non-timeshare threads in sched_sleep()
  or via cv_broadcastpri().  Realtime and idle priority threads will
  no longer have their priorities affected by sleeping in the kernel.

Reviewed by:	jeff
2011-01-14 17:06:54 +00:00
John Baldwin
12d56c0f63 Introduce two new helper macros to define the priority ranges used for
interactive timeshare threads (PRI_*_INTERACTIVE) and non-interactive
timeshare threads (PRI_*_BATCH) and use these instead of PRI_*_REALTIME
and PRI_*_TIMESHARE.  No functional change.

Reviewed by:	jeff
2011-01-13 14:22:27 +00:00
John Baldwin
c9a8cba456 Always use PRI_BASE() when checking the base type of a thread's priority
class.

MFC after:	2 weeks
2011-01-11 22:13:19 +00:00
John Baldwin
789200082c Fix two harmless off-by-one errors.
Reviewed by:	jeff
MFC after:	2 weeks
2011-01-10 20:48:10 +00:00
John Baldwin
22d19207e9 - Move sched_fork() later in fork() after the various sections of the new
thread and proc have been copied and zeroed from the old thread and
  proc.  Otherwise attempts to modify thread or process data in sched_fork()
  could be undone.
- Don't copy td_{base,}_user_pri from the old thread to the new thread in
  sched_fork_thread() in ULE.  This is already done courtesy the bcopy()
  of the thread copy region.
- Always initialize the real priority (td_priority) of new threads to the
  new thread's base priority (td_base_pri) to avoid bogusly inheriting a
  borrowed priority from the parent thread.

MFC after:	2 weeks
2011-01-06 22:24:00 +00:00
David Xu
c8e368a933 - Follow r216313, the sched_unlend_user_prio is no longer needed, always
use sched_lend_user_prio to set lent priority.
- Improve pthread priority-inherit mutex, when a contender's priority is
  lowered, repropagete priorities, this may cause mutex owner's priority
  to be lowerd, in old code, mutex owner's priority is rise-only.
2010-12-29 09:26:46 +00:00
David Xu
acbe332a58 MFp4:
It is possible a lower priority thread lending priority to higher priority
thread, in old code, it is ignored, however the lending should always be
recorded, add field td_lend_user_pri to fix the problem, if a thread does
not have borrowed priority, its value is PRI_MAX.

MFC after: 1 week
2010-12-09 02:42:02 +00:00
Edward Tomasz Napierala
4220337804 Remove unused variables. 2010-11-13 11:54:04 +00:00
Attilio Rao
9f518f2068 Fix typos.
Submitted by:	gianni
MFC after:	3 days
2010-11-10 21:06:49 +00:00
David Xu
444528c026 Use integer for size of cpuset, as it won't be bigger than INT_MAX,
This is requested by bge.
Also move the sysctl into file kern_cpuset.c, because it should
always be there, it is independent of thread scheduler.
2010-11-01 00:42:25 +00:00
David Xu
b67cc292dc Add sysctl kern.sched.cpusetsize to export the size of kernel cpuset,
also add sysconf() key _SC_CPUSET_SIZE to get sysctl value.

Submitted by: gcooper
2010-10-29 13:31:10 +00:00
John Baldwin
a8103ae8ca Comment nit, set TDF_NEEDRESCHED after the comment describing why it is
done rather than before.

MFC after:	1 week
2010-09-21 19:12:22 +00:00
Andriy Gapon
19b8a6dbc1 kern.sched.topology_spec sysctl: use step of 1 for group levels numeration
This is just a cosmetic change for prettier output.
'indent' variable/parameter serves two purposes: it specifies whitespace
indentation level and also implies cpu group level/depth.
It would have been better to split those two uses,
but for now just a simple change.

MFC after:	1 week
2010-09-18 11:16:43 +00:00
Alexander Motin
a157e42516 Refactor timer management code with priority to one-shot operation mode.
The main goal of this is to generate timer interrupts only when there is
some work to do. When CPU is busy interrupts are generating at full rate
of hz + stathz to fullfill scheduler and timekeeping requirements. But
when CPU is idle, only minimum set of interrupts (down to 8 interrupts per
second per CPU now), needed to handle scheduled callouts is executed.
This allows significantly increase idle CPU sleep time, increasing effect
of static power-saving technologies. Also it should reduce host CPU load
on virtualized systems, when guest system is idle.

There is set of tunables, also available as writable sysctls, allowing to
control wanted event timer subsystem behavior:
  kern.eventtimer.timer - allows to choose event timer hardware to use.
On x86 there is up to 4 different kinds of timers. Depending on whether
chosen timer is per-CPU, behavior of other options slightly differs.
  kern.eventtimer.periodic - allows to choose periodic and one-shot
operation mode. In periodic mode, current timer hardware taken as the only
source of time for time events. This mode is quite alike to previous kernel
behavior. One-shot mode instead uses currently selected time counter
hardware to schedule all needed events one by one and program timer to
generate interrupt exactly in specified time. Default value depends of
chosen timer capabilities, but one-shot mode is preferred, until other is
forced by user or hardware.
  kern.eventtimer.singlemul - in periodic mode specifies how much times
higher timer frequency should be, to not strictly alias hardclock() and
statclock() events. Default values are 2 and 4, but could be reduced to 1
if extra interrupts are unwanted.
  kern.eventtimer.idletick - makes each CPU to receive every timer interrupt
independently of whether they busy or not. By default this options is
disabled. If chosen timer is per-CPU and runs in periodic mode, this option
has no effect - all interrupts are generating.

As soon as this patch modifies cpu_idle() on some platforms, I have also
refactored one on x86. Now it makes use of MONITOR/MWAIT instrunctions
(if supported) under high sleep/wakeup rate, as fast alternative to other
methods. It allows SMP scheduler to wake up sleeping CPUs much faster
without using IPI, significantly increasing performance on some highly
task-switching loads.

Tested by:	many (on i386, amd64, sparc64 and powerc)
H/W donated by:	Gheorghe Ardelean
Sponsored by:	iXsystems, Inc.
2010-09-13 07:25:35 +00:00
Alexander Motin
9f9ad565a1 Do not IPI CPU that is already spinning for load. It doubles effect of
spining (comparing to MWAIT) on some heavly switching test loads.
2010-09-10 13:24:47 +00:00
Matthew D Fleming
ba4932b5a2 Fix UP build.
MFC after:	2 weeks
2010-09-02 16:23:05 +00:00
Matthew D Fleming
0f7a0ebd59 Fix a bug with sched_affinity() where it checks td_pinned of another
thread in a racy manner, which can lead to attempting to migrate a
thread that is pinned to a CPU.  Instead, have sched_switch() determine
which CPU a thread should run on if the current one is not allowed.

KASSERT in sched_bind() that the thread is not yet pinned to a CPU.

KASSERT in sched_switch() that only migratable threads or those moving
due to a sched_bind() are changing CPUs.

sched_affinity code came from jhb@.

MFC after:	2 weeks
2010-09-01 20:32:47 +00:00
John Baldwin
8c7a92bd4a Remove unused KTRACE includes. 2010-08-19 16:41:27 +00:00
John Baldwin
d9d8d1449d Add a new ipi_cpu() function to the MI IPI API that can be used to send an
IPI to a specific CPU by its cpuid.  Replace calls to ipi_selected() that
constructed a mask for a single CPU with calls to ipi_cpu() instead.  This
will matter more in the future when we transition from cpumask_t to
cpuset_t for CPU masks in which case building a CPU mask is more expensive.

Submitted by:	peter, sbruno
Reviewed by:	rookie
Obtained from:	Yahoo! (x86)
MFC after:	1 month
2010-08-06 15:36:59 +00:00
Ivan Voras
611daf7e62 A cosmetic change - don't output empty <flags>. 2010-07-15 13:46:30 +00:00
John Baldwin
3aa6d94e0c Update several places that iterate over CPUs to use CPU_FOREACH(). 2010-06-11 18:46:34 +00:00
Ivan Voras
a401f2d098 Unconfuse THREAD and SMT flags 2010-06-10 11:48:14 +00:00
Ivan Voras
5368befb66 Cosmetic change to XML - less ugly newlines 2010-06-10 11:01:17 +00:00
John Baldwin
3da35a0a52 Assert that the thread lock is held in sched_pctcpu() instead of
recursively acquiring it.  All of the current callers already hold the
lock.

MFC after:	1 month
2010-06-03 16:02:11 +00:00
John Baldwin
1d7830edd5 Assert that the thread passed to sched_bind() and sched_unbind() is
curthread as those routines are only supported for curthread currently.

MFC after:	1 month
2010-05-21 17:15:56 +00:00
Randall Stewart
4542827d4d This pushes all of JC's patches that I have in place. I
am now able to run 32 cores ok.. but I still will hang
on buildworld with a NFS problem. I suspect I am missing
a patch for the netlogic rge driver.

JC check and see if I am missing anything except your
core-mask changes

Obtained from:	JC
2010-05-16 19:43:48 +00:00
Attilio Rao
b0b9dee5c9 - Fix a race in sched_switch() of sched_4bsd.
In the case of the thread being on a sleepqueue or a turnstile, the
  sched_lock was acquired (without the aid of the td_lock interface) and
  the td_lock was dropped. This was going to break locking rules on other
  threads willing to access to the thread (via the td_lock interface) and
  modify his flags (allowed as long as the container lock was different
  by the one used in sched_switch).
  In order to prevent this situation, while sched_lock is acquired there
  the td_lock gets blocked. [0]
- Merge the ULE's internal function thread_block_switch() into the global
  thread_lock_block() and make the former semantic as the default for
  thread_lock_block(). This means that thread_lock_block() will not
  disable interrupts when called (and consequently thread_unlock_block()
  will not re-enabled them when called). This should be done manually
  when necessary.
  Note, however, that ULE's thread_unblock_switch() is not reaped
  because it does reflect a difference in semantic due in ULE (the
  td_lock may not be necessarilly still blocked_lock when calling this).
  While asymmetric, it does describe a remarkable difference in semantic
  that is good to keep in mind.

[0] Reported by:	Kohji Okuno
			<okuno dot kohji at jp dot panasonic dot com>
Tested by:		Giovanni Trematerra
			<giovanni dot trematerra at gmail dot com>
MFC:			2 weeks
2010-01-23 15:54:21 +00:00
Konstantin Belousov
17c4c3563c Allow swap out of the kernel stack for the thread with priority greater
or equial then PSOCK, not less or equial. Higher priority has lesser
numerical value.

Existing test does not allow for swapout of the thread waiting for
advisory lock, for exiting child or sleeping for timeout. On the other
hand, high-priority waiters of VFS/VM events can be swapped out.

Tested by:	pho
Reviewed by:	jhb
MFC after:	1 week
2009-12-31 18:52:58 +00:00
Ed Schouten
62375ca8c1 Don't forget to use `void' for sched_balance(). It has no arguments. 2009-12-28 23:12:12 +00:00
Ivan Voras
cbc4ea28e2 Make ULE process usage (%CPU) accounting usable again by keeping track
of the last tick we incremented on.

Submitted by:	matthew.fleming/at/isilon.com, is/at/rambler-co.ru
Reviewed by:	jeff (who thinks there should be a better way in the future)
Approved by:	gnn (mentor)
MFC after:	3 weeks
2009-11-24 19:57:41 +00:00
Attilio Rao
1b9d701fee Split P_NOLOAD into a per-thread flag (TDF_NOLOAD).
This improvements aims for avoiding further cache-misses in scheduler
specific functions which need to keep track of average thread running
time and further locking in places setting for this flag.

Reported by:	jeff (originally), kris (currently)
Reviewed by:	jhb
Tested by:	Giuseppe Cocomazzi <sbudella at email dot it>
2009-11-03 16:46:52 +00:00
John Baldwin
a0f1535205 Fix a sign bug in the handling of nice priorities when computing the
interactive score for a thread.

Submitted by:	Taku YAMAMOTO  taku of tackymt.homeip.net
Reviewed by:	jeff
MFC after:	3 days
2009-10-15 11:41:12 +00:00
Attilio Rao
435068aab7 Fix sched_switch_migrate():
- In 8.x and above the run-queue locks are nomore shared even in the
  HTT case, so remove the special case.
- The deadlock explained in the removed comment here is still possible
  even with different locks, with the contribution of tdq_lock_pair().
  An explanation is here:
  (hypotesis: a thread needs to migrate on another CPU, thread1 is doing
  sched_switch_migrate() and thread2 is the one handling the sched_switch()
  request or in other words, thread1 is the thread that needs to migrate
  and thread2 is a thread that is going to be preempted, most likely an
  idle thread. Also, 'old' is referred to the context (in terms of
  run-queue and CPU) thread1 is leaving and 'new' is referred to the
  context thread1 is going into.  Finally, thread3 is doing tdq_idletd()
  or sched_balance() and definitively doing tdq_lock_pair())

  * thread1 blocks its td_lock. Now td_lock is 'blocked'
  * thread1 drops its old runqueue lock
  * thread1 acquires the new runqueue lock
  * thread1 adds itself to the new runqueue and sends an IPI_PREEMPT
    through tdq_notify() to the new CPU
  * thread1 drops the new lock
  * thread3, scanning the runqueues, locks the old lock
  * thread2 received the IPI_PREEMPT and does thread_lock() with td_lock
    pointing to the new runqueue
  * thread3 wants to acquire the new runqueue lock, but it can't because
    it is held by thread2 so it spins
  * thread1 wants to acquire old lock, but as long as it is held by
    thread3 it can't
  * thread2 going further, at some point wants to switchin in thread1,
    but it will wait forever because thread1->td_lock is in blocked state

This deadlock has been manifested mostly on 7.x and reported several time
on mailing lists under the voice 'spinlock held too long'.
Many thanks to des@ for having worked hard on producing suitable textdumps
and Jeff for help on the comment wording.

Reviewed by:	jeff
Reported by:	des, others
Tested by:	des, Giovanni Trematerra
		<giovanni dot trematerra at gmail dot com>
		(STABLE_7 based version)
2009-09-15 16:56:17 +00:00
Jeff Roberson
c76ee82799 - Use cpuset_t and the CPU_ macros in place of cpumask_t so that ULE
supports arbitrary numbers of cpus rather than being limited by
   cpumask_t to the number of bits in a long.
2009-06-23 22:12:37 +00:00
Jeff Roberson
09c8a4cc21 - Fix non-SMP build by encapsulating idle spin logic in a macro.
Pointy hat to:	me
2009-04-29 23:04:31 +00:00
Jeff Roberson
113dda8a7c - Fix the FBSDID line. 2009-04-29 03:26:30 +00:00
Jeff Roberson
7b55ab0534 - Remove the bogus idle thread state code. This may have a race in it
and it only optimized out an ipi or mwait in very few cases.
 - Skip the adaptive idle code when running on SMT or HTT cores.  This
   just wastes cpu time that could be used on a busy thread on the same
   core.
 - Rename CG_FLAG_THREAD to CG_FLAG_SMT to be more descriptive.  Re-use
   CG_FLAG_THREAD to mean SMT or HTT.

Sponsored by:   Nokia
2009-04-29 03:15:43 +00:00
Jeff Roberson
53a6c8b3ac - Fix an error that occurs when mp_ncpu is an odd number. steal_thresh
is calculated as 0 which causes errors elsewhere.

Submitted by:	KOIE Hidetaka <koie@suri.co.jp>

 - When sched_affinity() is called with a thread that is not curthread we
   need to handle the ON_RUNQ() case by adding the thread to the correct
   run queue.

Submitted by:	Justin Teller <justin.teller@gmail.com>

MFC after:	1 Week
2009-03-14 11:41:36 +00:00
Jeff Roberson
0d2cf8374a - Use __XSTRING where I want the define to be expanded. This resulted in
sizeof("MAXCPU") being used to calculate a string length rather than
   something more reasonable such as sizeof("32").  This shouldn't have
   caused any ill effect until we run on machines with 1000000 or more
   cpus.
2009-01-25 07:35:10 +00:00
Jeff Roberson
8f51ad55e7 - Implement generic macros for producing KTR records that are compatible
with src/tools/sched/schedgraph.py.  This allows developers to quickly
   create a graphical view of ktr data for any resource in the system.
 - Add sched_tdname() and the pcpu field 'name' for quickly and uniformly
   identifying records associated with a thread or cpu.
 - Reimplement the KTR_SCHED traces using the new generic facility.

Obtained from:	attilio
Discussed with:	jhb
Sponsored by:	Nokia
2009-01-17 07:17:57 +00:00
Ivan Voras
59d9578919 Add missing newlines to flags tags of CPU topology, for prettier
output.

Reviewed by:	jeff (original version)
Approved by:	gnn (mentor) (original version)
2008-12-23 16:19:59 +00:00
John Baldwin
02f0ff6d92 When checking to see if another CPU is running its idle thread, examine
the thread running on the other CPU instead of the thread being placed on
the run queue.

Reported by:	Ravi Murty @ Intel
Reviewed by:	jeff
2008-11-18 05:41:34 +00:00
Ivan Voras
aa880b9018 Increase the initial sbuf size for CPU topology dump to something more
usable for newer CPUs. The new value allows 2 x quad core configuration
dumps to fit within the initial buffer without reallocations.

Approved by:	gnn (mentor) (older version)
Pointed out by:	rdivacky
2008-11-02 23:11:20 +00:00
Ivan Voras
07095abf5d Introduce a new sysctl, kern.sched.topology_spec, that returns an XML
dump of detected ULE CPU topology. This dump can be used to check the
topology detection and for general system information.

An example of CPU topology dump is:
kern.sched.topology_spec: <groups>
 <group level="1" cache-level="0">
   <cpu count="8" mask="0xff">0, 1, 2, 3, 4, 5, 6, 7</cpu>
   <flags></flags>
   <children>
     <group level="2" cache-level="0">
       <cpu count="4" mask="0xf">0, 1, 2, 3</cpu>
       <flags></flags>
     </group>
     <group level="2" cache-level="0">
       <cpu count="4" mask="0xf0">4, 5, 6, 7</cpu>
       <flags></flags>
     </group>
   </children>
 </group>
</groups>

Reviewed by:	jeff
Approved by:	gnn (mentor)
2008-10-29 13:36:23 +00:00
Jeff Roberson
e980fff622 - Check whether we've recorded this tick in ts_ticks on another cpu in
sched_tick() to prevent multiple increments for one tick.  This pushes
   the value out of range and breaks priority calculation.

Reviewed by:	kib
Found by:	pho/nokia
Sponsored by:	Nokia
MFC after:	3 days
2008-07-19 05:13:47 +00:00
John Birrell
6f5f25e521 Add the vtime (virtual time) hooks for DTrace. 2008-05-25 01:44:58 +00:00
Jeff Roberson
6c47aaae12 - Add an integer argument to idle to indicate how likely we are to wake
from idle over the next tick.
 - Add a new MD routine, cpu_wake_idle() to wakeup idle threads who are
   suspended in cpu specific states.  This function can fail and cause the
   scheduler to fall back to another mechanism (ipi).
 - Implement support for mwait in cpu_idle() on i386/amd64 machines that
   support it.  mwait is a higher performance way to synchronize cpus
   as compared to hlt & ipis.
 - Allow selecting the idle routine by name via sysctl machdep.idle.  This
   replaces machdep.cpu_idle_hlt.  Only idle routines supported by the
   current machine are permitted.

Sponsored by:	Nokia
2008-04-25 05:18:50 +00:00
Jeff Roberson
1690c6c1be - Add a metric to describe how busy a processor has been over the last
two ticks by counting the number of switches and the load when
   sched_clock() is called.
 - If the busy metric exceeds a threshold allow the idle thread to spin
   waiting for new work for a brief period to avoid using IPIs.  This
   reduces the cost on the sender and receiver as well as reducing wakeup
   latency considerably when it works.

Sponsored by:	Nokia
2008-04-17 09:56:01 +00:00
Jeff Roberson
8df78c41d6 - Make SCHED_STATS more generic by adding a wrapper to create the
variables and sysctl nodes.
 - In reset walk the children of kern_sched_stats and reset the counters
   via the oid_arg1 pointer.  This allows us to add arbitrary counters to
   the tree and still reset them properly.
 - Define a set of switch types to be passed with flags to mi_switch().
   These types are named SWT_*.  These types correspond to SCHED_STATS
   counters and are automatically handled in this way.
 - Make the new SWT_ types more specific than the older switch stats.
   There are now stats for idle switches, remote idle wakeups, remote
   preemption ithreads idling, etc.
 - Add switch statistics for ULE's pickcpu algorithm.  These stats include
   how much migration there is, how often affinity was successful, how
   often threads were migrated to the local cpu on wakeup, etc.

Sponsored by:	Nokia
2008-04-17 04:20:10 +00:00