Commit Graph

245 Commits

Author SHA1 Message Date
jeff
b2f69d1b1e - Check whether we've recorded this tick in ts_ticks on another cpu in
sched_tick() to prevent multiple increments for one tick.  This pushes
   the value out of range and breaks priority calculation.

Reviewed by:	kib
Found by:	pho/nokia
Sponsored by:	Nokia
MFC after:	3 days
2008-07-19 05:13:47 +00:00
jb
1c6ecc547f Add the vtime (virtual time) hooks for DTrace. 2008-05-25 01:44:58 +00:00
jeff
14b586bf96 - Add an integer argument to idle to indicate how likely we are to wake
from idle over the next tick.
 - Add a new MD routine, cpu_wake_idle() to wakeup idle threads who are
   suspended in cpu specific states.  This function can fail and cause the
   scheduler to fall back to another mechanism (ipi).
 - Implement support for mwait in cpu_idle() on i386/amd64 machines that
   support it.  mwait is a higher performance way to synchronize cpus
   as compared to hlt & ipis.
 - Allow selecting the idle routine by name via sysctl machdep.idle.  This
   replaces machdep.cpu_idle_hlt.  Only idle routines supported by the
   current machine are permitted.

Sponsored by:	Nokia
2008-04-25 05:18:50 +00:00
jeff
3f4fde5950 - Add a metric to describe how busy a processor has been over the last
two ticks by counting the number of switches and the load when
   sched_clock() is called.
 - If the busy metric exceeds a threshold allow the idle thread to spin
   waiting for new work for a brief period to avoid using IPIs.  This
   reduces the cost on the sender and receiver as well as reducing wakeup
   latency considerably when it works.

Sponsored by:	Nokia
2008-04-17 09:56:01 +00:00
jeff
9d30d1d7a4 - Make SCHED_STATS more generic by adding a wrapper to create the
variables and sysctl nodes.
 - In reset walk the children of kern_sched_stats and reset the counters
   via the oid_arg1 pointer.  This allows us to add arbitrary counters to
   the tree and still reset them properly.
 - Define a set of switch types to be passed with flags to mi_switch().
   These types are named SWT_*.  These types correspond to SCHED_STATS
   counters and are automatically handled in this way.
 - Make the new SWT_ types more specific than the older switch stats.
   There are now stats for idle switches, remote idle wakeups, remote
   preemption ithreads idling, etc.
 - Add switch statistics for ULE's pickcpu algorithm.  These stats include
   how much migration there is, how often affinity was successful, how
   often threads were migrated to the local cpu on wakeup, etc.

Sponsored by:	Nokia
2008-04-17 04:20:10 +00:00
marcel
da8b8894d6 Support and switch to the ULE scheduler:
o  Implement IPI_PREEMPT,
o  Set td_lock for the thread being switched out,
o  For ULE & SMP, loop while td_lock points to blocked_lock for
   the thread being switched in,
o  Enable ULE by default in GENERIC and SKI,
2008-04-15 05:02:42 +00:00
jeff
85d3ffe23c - Allow static_boost to specify no boost with '0', traditional kernel
fixed pri boost with '1' or any priority less than the current thread's
   priority with a value greater than two.  Default the boost to
   PRI_MIN_TIMESHARE to prevent regular user-space threads from starving
   threads in the kernel.  This prevents these user-threads from also
   being scheduled as if they are high fixed-priority kernel threads.
 - Restore the setting of lowpri in tdq_choose().  It has to be either here
   or in sched_switch().  I accidentally removed it from both places.

Tested by:	kris
2008-04-04 01:16:18 +00:00
jeff
c50de590cc - Don't check for the ITHD pri class in tdq_load_add and rem. 4BSD doesn't
do this either.  Simply check P_NOLOAD.  It'd be nice if this was
   in a thread flag so we didn't have an extra cache miss every time we
   add and remove a thread from the run-queue.
2008-04-04 01:04:43 +00:00
jeff
a3f8e0c20d - Restore runq to manipulating threads directly by putting runq links and
rqindex back in struct thread.
 - Compile kern_switch.c independently again and stop #include'ing it from
   schedulers.
 - Remove the ts_thread backpointers and convert most code to go from
   struct thread to struct td_sched.
 - Cleanup the ts_flags #define garbage that was causing us to sometimes
   do things that expanded to td->td_sched->ts_thread->td_flags in 4BSD.
 - Export the kern.sched sysctl node in sysctl.h
2008-03-20 05:51:16 +00:00
jeff
19aab7bccf - ULE and 4BSD share only one line of code from sched_newthread() so implement
the required pieces in sched_fork_thread().  The td_sched pointer is already
   setup by thread_init anyway.
2008-03-20 03:06:33 +00:00
jeff
d6d07d2730 - Remove some dead code and comments related to KSE.
- Don't set tdq_lowpri on every switch, it should be precisely maintained now.
 - Add some comments to sched_thread_priority().
2008-03-19 07:36:37 +00:00
jeff
46f09d5bc3 - Relax requirements for p_numthreads, p_threads, p_swtick, and p_nice from
requiring the per-process spinlock to only requiring the process lock.
 - Reflect these changes in the proc.h documentation and consumers throughout
   the kernel.  This is a substantial reduction in locking cost for these
   fields and was made possible by recent changes to threading support.
2008-03-19 06:19:01 +00:00
rwatson
877d7c65ba In keeping with style(9)'s recommendations on macros, use a ';'
after each SYSINIT() macro invocation.  This makes a number of
lightweight C parsers much happier with the FreeBSD kernel
source, including cflow's prcc and lxr.

MFC after:	1 month
Discussed with:	imp, rink
2008-03-16 10:58:09 +00:00
jhb
c162b59cc2 Make the function prototype for cpu_search() match the declaration so that
this still compiles with gcc3.
2008-03-14 15:22:38 +00:00
jeff
acb93d599c Remove kernel support for M:N threading.
While the KSE project was quite successful in bringing threading to
FreeBSD, the M:N approach taken by the kse library was never developed
to its full potential.  Backwards compatibility will be provided via
libmap.conf for dynamically linked binaries and static binaries will
be broken.
2008-03-12 10:12:01 +00:00
jeff
3b1acbdce2 - Pass the priority argument from *sleep() into sleepq and down into
sched_sleep().  This removes extra thread_lock() acquisition and
   allows the scheduler to decide what to do with the static boost.
 - Change the priority arguments to cv_* to match sleepq/msleep/etc.
   where 0 means no priority change.  Catch -1 in cv_broadcastpri() and
   convert it to 0 for now.
 - Set a flag when sleeping in a way that is compatible with swapping
   since direct priority comparisons are meaningless now.
 - Add a sysctl to ule, kern.sched.static_boost, that defaults to on which
   controls the boost behavior.  Turning it off gives better performance
   in some workloads but needs more investigation.
 - While we're modifying sleepq, change signal and broadcast to both
   return with the lock held as the lock was held on enter.

Reviewed by:	jhb, peter
2008-03-12 06:31:06 +00:00
jeff
540fa064d9 - Fix the invalid priority panics people are seeing by forcing
tdq_runq_add to select the runq rather than hoping we set it properly
   when we adjusted the priority.  This involves the same number of
   branches as before so should perform identically without the extra
   fragility.

Tested by:	bz
Reviewed by:	bz
2008-03-10 22:48:27 +00:00
jeff
e53ae3b798 - Don't rely on a side effect of sched_prio() to set the initial ts_runq
for thread0.  Set it directly in sched_setup().  This fixes traps on boot
   seen on some machines.

Reported by:	phk
2008-03-10 09:50:29 +00:00
jeff
14a6f96adb Reduce ULE context switch time by over 25%.
- Only calculate timeshare priorities once per tick or when a thread is woken
   from sleeping.
 - Keep the ts_runq pointer valid after all priority changes.
 - Call tdq_runq_add() directly from sched_switch() without passing in via
   tdq_add().  We don't need to adjust loads or runqs anymore.
 - Sort tdq and ts_sched according to utilization to improve cache behavior.

Sponsored by:	Nokia
2008-03-10 03:15:19 +00:00
jeff
7dc7c824ee - Add an implementation of sched_preempt() that avoids excessive IPIs.
- Normalize the preemption/ipi setting code by introducing sched_shouldpreempt()
   so the logical is identical and not repeated between tdq_notify() and
   sched_setpreempt().
 - In tdq_notify() don't set NEEDRESCHED as we may not actually own the thread lock
   this could have caused us to lose td_flags settings.
 - Garbage collect some tunables that are no longer relevant.
2008-03-10 01:32:01 +00:00
jeff
c12e39a76d Add support for the new cpu topology api:
- When searching for affinity search backwards in the tree from the last
   cpu we ran on while the thread still has affinity for the group.   This
   can take advantage of knowledge of shared L2 or L3 caches among a
   group of cores.
 - When searching for the least loaded cpu find the least loaded cpu via
   the least loaded path through the tree.  This load balances system bus
   links, individual cache levels, and hyper-threaded/SMT cores.
 - Make the periodic balancer recursively balance the highest and lowest
   loaded cpu across each link.

Add support for cpusets:
 - Convert the cpuset to a simple native cpumask_t while the kernel still
   only supports cpumask.
 - Pass the derived cpumask down through the cpu_search functions to
   restrict the result cpus.
 - Make the various steal functions resilient to failure since all threads
   can not run on all cpus any longer.

General improvements:
 - Precisely track the lowest priority thread on every runq with
   tdq_setlowpri().  Before it was more advisory but this ended up having
   pathological behaviors.
 - Remove many #ifdef SMP conditions to simplify the code.
 - Get rid of the old cumbersome tdq_group.  This is more naturally
   expressed via the cpu_group tree.

Sponsored by:	Nokia
Testing by:	kris
2008-03-02 08:20:59 +00:00
jeff
ad2a31513f - Remove the old smp cpu topology specification with a new, more flexible
tree structure that encodes the level of cache sharing and other
   properties.
 - Provide several convenience functions for creating one and two level
   cpu trees as well as a default flat topology.  The system now always
   has some topology.
 - On i386 and amd64 create a seperate level in the hierarchy for HTT
   and multi-core cpus.  This will allow the scheduler to intelligently
   load balance non-uniform cores.  Presently we don't detect what level
   of the cache hierarchy is shared at each level in the topology.
 - Add a mechanism for testing common topologies that have more information
   than the MD code is able to provide via the kern.smp.topology tunable.
   This should be considered a debugging tool only and not a stable api.

Sponsored by:	Nokia
2008-03-02 07:58:42 +00:00
jeff
3bd7de5a7c - Add a new sched_affinity() api to be used in the upcoming cpuset
implementation.
 - Add empty implementations of sched_affinity() to 4BSD and ULE.

Sponsored by:	Nokia
2008-03-02 07:19:35 +00:00
jeff
be58be75dd - sched_prio() should only adjust tdq_lowpri if the thread is running or on
a run-queue.  If the priority is numerically raised only change lowpri
   if we're certain it will be correct.  Some slop is allowed however
   previously we could erroneously raise lowpri for an idle cpu that a
   thread had recently run on which lead to errors in load balancing
   decisions.
2008-01-23 03:10:18 +00:00
jeff
7a1cea460a - When executing the 'tryself' branch in sched_pickcpu() look at the
lowest priority on the queue for the current cpu vs curthread's
   priority.  In the case that curthread is waking up many threads of a
   lower priority as would happen with a turnstile_broadcast() or wakeup()
   of many threads this prevents them from all ending up on the current cpu.
 - In sched_add() make the relationship between a scheduled ithread and
   the current cpu advisory rather than strict.  Only give the ithread
   affinity for the current cpu if it's actually being scheduled from
   a hardware interrupt.  This prevents it from migrating when it simply
   blocks on a lock.

Sponsored by:	Nokia
2008-01-15 09:03:09 +00:00
jeff
ebfbe60329 - Restore timeslicing code for all bit SCHED_FIFO priority classes.
Reported by:	Peter Jeremy <peterjeremy@optushome.com.au>
2008-01-05 04:47:31 +00:00
wkoszek
7d4fbe3a46 Make SCHED_ULE buildable with gcc3.
Reviewed by:	cognet (mentor), jeffr
Approved by:	cognet (mentor), jeffr
2007-12-21 23:30:18 +00:00
jeff
12adc443d6 - Re-implement lock profiling in such a way that it no longer breaks
the ABI when enabled.  There is no longer an embedded lock_profile_object
   in each lock.  Instead a list of lock_profile_objects is kept per-thread
   for each lock it may own.  The cnt_hold statistic is now always 0 to
   facilitate this.
 - Support shared locking by tracking individual lock instances and
   statistics in the per-thread per-instance lock_profile_object.
 - Make the lock profiling hash table a per-cpu singly linked list with a
   per-cpu static lock_prof allocator.  This removes the need for an array
   of spinlocks and reduces cache contention between cores.
 - Use a seperate hash for spinlocks and other locks so that only a
   critical_enter() is required and not a spinlock_enter() to modify the
   per-cpu tables.
 - Count time spent spinning in the lock statistics.
 - Remove the LOCK_PROFILE_SHARED option as it is always supported now.
 - Specifically drop and release the scheduler locks in both schedulers
   since we track owners now.

In collaboration with:	Kip Macy
Sponsored by:	Nokia
2007-12-15 23:13:31 +00:00
davidxu
3695a78889 Fix LOR of thread lock and umtx's priority propagation mutex due
to the reworking of scheduler lock.

MFC: after 3 days
2007-12-11 08:25:36 +00:00
julian
b2732e0c22 generally we are interested in what thread did something as
opposed to what process. Since threads by default have teh name of the
process unless over-written with more useful information, just print the
thread name instead.
2007-11-14 06:21:24 +00:00
grehan
7c82e0520d Cut over to ULE on PowerPC
kern/sched_ule.c - Add __powerpc__ to the list of supported architectures

powerpc/conf/GENERIC - Swap SCHED_4BSD with SCHED_ULE

powerpc/powerpc/genassym.c - Export TD_LOCK field of thread struct

powerpc/powerpc/swtch.S - Handle new 3rd parameter to cpu_switch() by
 updating the old thread's lock. Note: uniprocessor-only, will require
 modification for MP support.

powerpc/powerpc/vm_machdep.c - Set 3rd param of cpu_switch to mutex of
old thread's lock, making the call a no-op.

Reviewed by:	marcel, jeffr (slightly older version)
2007-10-23 00:52:25 +00:00
sam
026e744752 ULE works fine on arm; allow it to be used
Reviewed by:	jeff, cognet, imp
MFC after:	1 week
2007-10-16 19:25:26 +00:00
jeff
2bbd1c2e70 - Bail out of tdq_idled if !smp_started or idle stealing is disabled. This
fixes a bug on UP machines with SMP kernels where the idle thread
   constantly switches after trying to steal work from the local cpu.
 - Make the idle stealing code more robust against self selection.
 - Prefer to steal from the cpu with the highest load that has at least one
   transferable thread.  Before we selected the cpu with the highest
   transferable count which excludes bound threads.

Collaborated with:	csjp
Approved by:		re
2007-10-08 23:50:39 +00:00
jeff
57102cf5ad - Restore historical sched_yield() behavior by changing sched_relinquish()
to simply switch rather than lowering priority and switching.  This allows
   threads of equal priority to run but not lesser priority.

Discussed with:	davidxu
Reported by:	NIIMI Satoshi <sa2c@sa2c.net>
Approved by:	re
2007-10-08 23:45:24 +00:00
jeff
4a1456cf5d - Reassign the thread queue lock to newtd prior to switching. Assigning
after the switch leads to a race where the outgoing thread still owns
   the local queue lock while another cpu may switch it in.  This race
   is only possible on machines where cpu_switch can take significantly
   longer on different cpus which in practice means HTT machines with
   unfair thread scheduling algorithms.

Found by:	kris (of course)
Approved by:	re
2007-10-02 01:30:18 +00:00
jeff
de71038e60 - Move the rebalancer back into hardclock to prevent potential softclock
starvation caused by unbalanced interrupt loads.
 - Change the rebalancer to work on stathz ticks but retain randomization.
 - Simplify locking in tdq_idled() to use the tdq_lock_pair() rather than
   complex sequences of locks to avoid deadlock.

Reported by:	kris
Approved by:	re
2007-10-02 00:36:06 +00:00
jeff
644f587c97 - Honor the PREEMPTION and FULL_PREEMPTION flags by setting the default
value for kern.sched.preempt_thresh appropriately.  It can still by
   adjusted at runtime.  ULE will still use IPI_PREEMPT in certain
   migration situations.
 - Assert that we're not trying to compile ULE on an unsupported
   architecture.  To date, I believe only i386 and amd64 have implemented
   the third cpu switch argument required.

Approved by:	re
2007-09-27 16:39:27 +00:00
jeff
ea91086eb3 - Bound the interactivity score so that it cannot become negative.
Approved by:	re
2007-09-24 00:28:54 +00:00
jeff
280a86b2ee - Improve grammar. s/it's/its/.
- Improve load long-term load balancer by always IPIing exactly once.
   Previously the delay after rebalancing could cause problems with
   uneven workloads.
 - Allow nice to have a linear effect on the interactivity score.  This
   allows negatively niced programs to stay interactive longer.  It may be
   useful with very expensive Xorg servers under high loads.  In general
   it should not be necessary to alter the nice level to improve interactive
   response.  We may also want to consider never allowing positively niced
   processes to become interactive at all.
 - Initialize ccpu to 0 rather than 0.0.  The decimal point was leftover
   from when the code was copied from 4bsd.  ccpu is 0 in ULE because ULE
   only exports weighted cpu values.

Reported by:	Steve Kargl (Load balancing problem)
Approved by:	re
2007-09-22 02:20:14 +00:00
jeff
bc0eadb21d - Redefine p_swtime and td_slptime as p_swtick and td_slptick. This
changes the units from seconds to the value of 'ticks' when swapped
   in/out.  ULE does not have a periodic timer that scans all threads in
   the system and as such maintaining a per-second counter is difficult.
 - Change computations requiring the unit in seconds to subtract ticks
   and divide by hz.  This does make the wraparound condition hz times
   more frequent but this is still in the range of several months to
   years and the adverse effects are minimal.

Approved by:	re
2007-09-21 04:10:23 +00:00
jeff
3fc0f8b973 - Move all of the PS_ flags into either p_flag or td_flags.
- p_sflag was mostly protected by PROC_LOCK rather than the PROC_SLOCK or
   previously the sched_lock.  These bugs have existed for some time.
 - Allow swapout to try each thread in a process individually and then
   swapin the whole process if any of these fail.  This allows us to move
   most scheduler related swap flags into td_flags.
 - Keep ki_sflag for backwards compat but change all in source tools to
   use the new and more correct location of P_INMEM.

Reported by:	pho
Reviewed by:	attilio, kib
Approved by:	re (kensmith)
2007-09-17 05:31:39 +00:00
jeff
0f3cc9a72e - Set steal_thresh to log2(ncpus). This improves idle-time load balancing
on 2cpu machines by reducing it to 1 by default.  This improves loaded
   operation on 8cpu machines by increasing it to 3 where the extra idle
   time is not as critical.

Approved by:	re
2007-08-20 06:34:20 +00:00
jeff
155be8b518 - Fix one line that erroneously crept in my last commit.
Approved by:	re
2007-08-04 01:21:28 +00:00
jeff
8b42bd66ad - Share scheduler locks between hyper-threaded cores to protect the
tdq_group structure.  Hyper-threaded cores won't really benefit from
   seperate locks anyway.
 - Seperate out the migration case from sched_switch to simplify the main
   switch code.  We only migrate here if called via sched_bind().
 - When preempted place the preempted thread back in the same queue at
   the head.
 - Improve the cpu group and topology infrastructure.

Tested by:	many on current@
Approved by:	re
2007-08-03 23:38:46 +00:00
jeff
e2ebe96ef4 - Refine the load balancer to improve buildkernel times on dual core
machines.
 - Leave the long-term load balancer running by default once per second.
 - Enable stealing load from the idle thread only when the remote processor
   has more than two transferable tasks.  Setting this to one further
   improves buildworld.  Setting it higher improves mysql.
 - Remove the bogus pick_zero option.  I had not intended to commit this.
 - Entirely disallow migration for threads with SRQ_YIELDING set.  This
   balances out the extra migration allowed for with the load balancers.
   It also makes pick_pri perform better as I had anticipated.

Tested by:	Dmitry Morozovsky <marck@rinet.ru>
Approved by:	re
2007-07-19 20:03:15 +00:00
jeff
550dacee12 - When newtd is specified to sched_switch() it was not being initialized
properly.  We have to temporarily unlock the TDQ lock so we can lock
   the thread and add it to the run queue.  This is used only for KSE.
 - When we add a thread from the tdq_move() via sched_balance() we need to
   ipi the target if it's sitting in the idle thread or it'll never run.

Reported by:	Rene Landan
Approved by:	re
2007-07-19 19:51:45 +00:00
jeff
119fa6328d ULE 3.0: Fine grain scheduler locking and affinity improvements. This has
been in development for over 6 months as SCHED_SMP.
 - Implement one spin lock per thread-queue.  Threads assigned to a
   run-queue point to this lock via td_lock.
 - Improve the facility for assigning threads to CPUs now that sched_lock
   contention no longer dominates scheduling decisions on larger SMP
   machines.
 - Re-write idle time stealing in an attempt to make it less damaging to
   general performance.  This is still disabled by default. See
   kern.sched.steal_idle.
 - Call the long-term load balancer from a callout rather than sched_clock()
   so there are no locks held.  This is disabled by default.  See
   kern.sched.balance.
 - Parameterize many scheduling decisions via sysctls.  Try to document
   these via sysctl descriptions.
 - General structural and naming cleanups.
 - Document each function with comments.

Tested by:	current@ amd64, x86, UP, SMP.
Approved by:	re
2007-07-17 22:53:23 +00:00
jeff
6164fe7d05 - Fix an off by one error in sched_pri_range.
- In tdq_choose() only assert that a thread does not have too high a
   priority (low value) for the queue we removed it from.  This will catch
   bugs in priority elevation.  It's not a serious error for the thread
   to have too low a priority as we don't change queues in this case as
   an optimization.

Reported by:	kris
2007-06-15 19:33:58 +00:00
jeff
bc31b141bb - Move some common code out of sched_fork_exit() and back into fork_exit(). 2007-06-12 07:47:09 +00:00
jeff
7a9c95c100 - Placing the 'volatile' on the right side of the * in the td_lock
declaration removes the need for __DEVOLATILE().

Pointed out by:	tegge
2007-06-06 03:40:47 +00:00