Commit Graph

298 Commits

Author SHA1 Message Date
John Baldwin
02f0ff6d92 When checking to see if another CPU is running its idle thread, examine
the thread running on the other CPU instead of the thread being placed on
the run queue.

Reported by:	Ravi Murty @ Intel
Reviewed by:	jeff
2008-11-18 05:41:34 +00:00
Ivan Voras
aa880b9018 Increase the initial sbuf size for CPU topology dump to something more
usable for newer CPUs. The new value allows 2 x quad core configuration
dumps to fit within the initial buffer without reallocations.

Approved by:	gnn (mentor) (older version)
Pointed out by:	rdivacky
2008-11-02 23:11:20 +00:00
Ivan Voras
07095abf5d Introduce a new sysctl, kern.sched.topology_spec, that returns an XML
dump of detected ULE CPU topology. This dump can be used to check the
topology detection and for general system information.

An example of CPU topology dump is:
kern.sched.topology_spec: <groups>
 <group level="1" cache-level="0">
   <cpu count="8" mask="0xff">0, 1, 2, 3, 4, 5, 6, 7</cpu>
   <flags></flags>
   <children>
     <group level="2" cache-level="0">
       <cpu count="4" mask="0xf">0, 1, 2, 3</cpu>
       <flags></flags>
     </group>
     <group level="2" cache-level="0">
       <cpu count="4" mask="0xf0">4, 5, 6, 7</cpu>
       <flags></flags>
     </group>
   </children>
 </group>
</groups>

Reviewed by:	jeff
Approved by:	gnn (mentor)
2008-10-29 13:36:23 +00:00
Jeff Roberson
e980fff622 - Check whether we've recorded this tick in ts_ticks on another cpu in
sched_tick() to prevent multiple increments for one tick.  This pushes
   the value out of range and breaks priority calculation.

Reviewed by:	kib
Found by:	pho/nokia
Sponsored by:	Nokia
MFC after:	3 days
2008-07-19 05:13:47 +00:00
John Birrell
6f5f25e521 Add the vtime (virtual time) hooks for DTrace. 2008-05-25 01:44:58 +00:00
Jeff Roberson
6c47aaae12 - Add an integer argument to idle to indicate how likely we are to wake
from idle over the next tick.
 - Add a new MD routine, cpu_wake_idle() to wakeup idle threads who are
   suspended in cpu specific states.  This function can fail and cause the
   scheduler to fall back to another mechanism (ipi).
 - Implement support for mwait in cpu_idle() on i386/amd64 machines that
   support it.  mwait is a higher performance way to synchronize cpus
   as compared to hlt & ipis.
 - Allow selecting the idle routine by name via sysctl machdep.idle.  This
   replaces machdep.cpu_idle_hlt.  Only idle routines supported by the
   current machine are permitted.

Sponsored by:	Nokia
2008-04-25 05:18:50 +00:00
Jeff Roberson
1690c6c1be - Add a metric to describe how busy a processor has been over the last
two ticks by counting the number of switches and the load when
   sched_clock() is called.
 - If the busy metric exceeds a threshold allow the idle thread to spin
   waiting for new work for a brief period to avoid using IPIs.  This
   reduces the cost on the sender and receiver as well as reducing wakeup
   latency considerably when it works.

Sponsored by:	Nokia
2008-04-17 09:56:01 +00:00
Jeff Roberson
8df78c41d6 - Make SCHED_STATS more generic by adding a wrapper to create the
variables and sysctl nodes.
 - In reset walk the children of kern_sched_stats and reset the counters
   via the oid_arg1 pointer.  This allows us to add arbitrary counters to
   the tree and still reset them properly.
 - Define a set of switch types to be passed with flags to mi_switch().
   These types are named SWT_*.  These types correspond to SCHED_STATS
   counters and are automatically handled in this way.
 - Make the new SWT_ types more specific than the older switch stats.
   There are now stats for idle switches, remote idle wakeups, remote
   preemption ithreads idling, etc.
 - Add switch statistics for ULE's pickcpu algorithm.  These stats include
   how much migration there is, how often affinity was successful, how
   often threads were migrated to the local cpu on wakeup, etc.

Sponsored by:	Nokia
2008-04-17 04:20:10 +00:00
Marcel Moolenaar
495168ba8d Support and switch to the ULE scheduler:
o  Implement IPI_PREEMPT,
o  Set td_lock for the thread being switched out,
o  For ULE & SMP, loop while td_lock points to blocked_lock for
   the thread being switched in,
o  Enable ULE by default in GENERIC and SKI,
2008-04-15 05:02:42 +00:00
Jeff Roberson
0502fe2e43 - Allow static_boost to specify no boost with '0', traditional kernel
fixed pri boost with '1' or any priority less than the current thread's
   priority with a value greater than two.  Default the boost to
   PRI_MIN_TIMESHARE to prevent regular user-space threads from starving
   threads in the kernel.  This prevents these user-threads from also
   being scheduled as if they are high fixed-priority kernel threads.
 - Restore the setting of lowpri in tdq_choose().  It has to be either here
   or in sched_switch().  I accidentally removed it from both places.

Tested by:	kris
2008-04-04 01:16:18 +00:00
Jeff Roberson
03d17db7d5 - Don't check for the ITHD pri class in tdq_load_add and rem. 4BSD doesn't
do this either.  Simply check P_NOLOAD.  It'd be nice if this was
   in a thread flag so we didn't have an extra cache miss every time we
   add and remove a thread from the run-queue.
2008-04-04 01:04:43 +00:00
Jeff Roberson
9727e63745 - Restore runq to manipulating threads directly by putting runq links and
rqindex back in struct thread.
 - Compile kern_switch.c independently again and stop #include'ing it from
   schedulers.
 - Remove the ts_thread backpointers and convert most code to go from
   struct thread to struct td_sched.
 - Cleanup the ts_flags #define garbage that was causing us to sometimes
   do things that expanded to td->td_sched->ts_thread->td_flags in 4BSD.
 - Export the kern.sched sysctl node in sysctl.h
2008-03-20 05:51:16 +00:00
Jeff Roberson
8b16c208e6 - ULE and 4BSD share only one line of code from sched_newthread() so implement
the required pieces in sched_fork_thread().  The td_sched pointer is already
   setup by thread_init anyway.
2008-03-20 03:06:33 +00:00
Jeff Roberson
6d55b3ec9c - Remove some dead code and comments related to KSE.
- Don't set tdq_lowpri on every switch, it should be precisely maintained now.
 - Add some comments to sched_thread_priority().
2008-03-19 07:36:37 +00:00
Jeff Roberson
374ae2a393 - Relax requirements for p_numthreads, p_threads, p_swtick, and p_nice from
requiring the per-process spinlock to only requiring the process lock.
 - Reflect these changes in the proc.h documentation and consumers throughout
   the kernel.  This is a substantial reduction in locking cost for these
   fields and was made possible by recent changes to threading support.
2008-03-19 06:19:01 +00:00
Robert Watson
237fdd787b In keeping with style(9)'s recommendations on macros, use a ';'
after each SYSINIT() macro invocation.  This makes a number of
lightweight C parsers much happier with the FreeBSD kernel
source, including cflow's prcc and lxr.

MFC after:	1 month
Discussed with:	imp, rink
2008-03-16 10:58:09 +00:00
John Baldwin
d628fbfa98 Make the function prototype for cpu_search() match the declaration so that
this still compiles with gcc3.
2008-03-14 15:22:38 +00:00
Jeff Roberson
6617724c5f Remove kernel support for M:N threading.
While the KSE project was quite successful in bringing threading to
FreeBSD, the M:N approach taken by the kse library was never developed
to its full potential.  Backwards compatibility will be provided via
libmap.conf for dynamically linked binaries and static binaries will
be broken.
2008-03-12 10:12:01 +00:00
Jeff Roberson
c5aa6b581d - Pass the priority argument from *sleep() into sleepq and down into
sched_sleep().  This removes extra thread_lock() acquisition and
   allows the scheduler to decide what to do with the static boost.
 - Change the priority arguments to cv_* to match sleepq/msleep/etc.
   where 0 means no priority change.  Catch -1 in cv_broadcastpri() and
   convert it to 0 for now.
 - Set a flag when sleeping in a way that is compatible with swapping
   since direct priority comparisons are meaningless now.
 - Add a sysctl to ule, kern.sched.static_boost, that defaults to on which
   controls the boost behavior.  Turning it off gives better performance
   in some workloads but needs more investigation.
 - While we're modifying sleepq, change signal and broadcast to both
   return with the lock held as the lock was held on enter.

Reviewed by:	jhb, peter
2008-03-12 06:31:06 +00:00
Jeff Roberson
c143ac21af - Fix the invalid priority panics people are seeing by forcing
tdq_runq_add to select the runq rather than hoping we set it properly
   when we adjusted the priority.  This involves the same number of
   branches as before so should perform identically without the extra
   fragility.

Tested by:	bz
Reviewed by:	bz
2008-03-10 22:48:27 +00:00
Jeff Roberson
7217d8d1ee - Don't rely on a side effect of sched_prio() to set the initial ts_runq
for thread0.  Set it directly in sched_setup().  This fixes traps on boot
   seen on some machines.

Reported by:	phk
2008-03-10 09:50:29 +00:00
Jeff Roberson
73daf66f41 Reduce ULE context switch time by over 25%.
- Only calculate timeshare priorities once per tick or when a thread is woken
   from sleeping.
 - Keep the ts_runq pointer valid after all priority changes.
 - Call tdq_runq_add() directly from sched_switch() without passing in via
   tdq_add().  We don't need to adjust loads or runqs anymore.
 - Sort tdq and ts_sched according to utilization to improve cache behavior.

Sponsored by:	Nokia
2008-03-10 03:15:19 +00:00
Jeff Roberson
ff256d9c47 - Add an implementation of sched_preempt() that avoids excessive IPIs.
- Normalize the preemption/ipi setting code by introducing sched_shouldpreempt()
   so the logical is identical and not repeated between tdq_notify() and
   sched_setpreempt().
 - In tdq_notify() don't set NEEDRESCHED as we may not actually own the thread lock
   this could have caused us to lose td_flags settings.
 - Garbage collect some tunables that are no longer relevant.
2008-03-10 01:32:01 +00:00
Jeff Roberson
62fa74d95a Add support for the new cpu topology api:
- When searching for affinity search backwards in the tree from the last
   cpu we ran on while the thread still has affinity for the group.   This
   can take advantage of knowledge of shared L2 or L3 caches among a
   group of cores.
 - When searching for the least loaded cpu find the least loaded cpu via
   the least loaded path through the tree.  This load balances system bus
   links, individual cache levels, and hyper-threaded/SMT cores.
 - Make the periodic balancer recursively balance the highest and lowest
   loaded cpu across each link.

Add support for cpusets:
 - Convert the cpuset to a simple native cpumask_t while the kernel still
   only supports cpumask.
 - Pass the derived cpumask down through the cpu_search functions to
   restrict the result cpus.
 - Make the various steal functions resilient to failure since all threads
   can not run on all cpus any longer.

General improvements:
 - Precisely track the lowest priority thread on every runq with
   tdq_setlowpri().  Before it was more advisory but this ended up having
   pathological behaviors.
 - Remove many #ifdef SMP conditions to simplify the code.
 - Get rid of the old cumbersome tdq_group.  This is more naturally
   expressed via the cpu_group tree.

Sponsored by:	Nokia
Testing by:	kris
2008-03-02 08:20:59 +00:00
Jeff Roberson
81aa71755b - Remove the old smp cpu topology specification with a new, more flexible
tree structure that encodes the level of cache sharing and other
   properties.
 - Provide several convenience functions for creating one and two level
   cpu trees as well as a default flat topology.  The system now always
   has some topology.
 - On i386 and amd64 create a seperate level in the hierarchy for HTT
   and multi-core cpus.  This will allow the scheduler to intelligently
   load balance non-uniform cores.  Presently we don't detect what level
   of the cache hierarchy is shared at each level in the topology.
 - Add a mechanism for testing common topologies that have more information
   than the MD code is able to provide via the kern.smp.topology tunable.
   This should be considered a debugging tool only and not a stable api.

Sponsored by:	Nokia
2008-03-02 07:58:42 +00:00
Jeff Roberson
885d51a38a - Add a new sched_affinity() api to be used in the upcoming cpuset
implementation.
 - Add empty implementations of sched_affinity() to 4BSD and ULE.

Sponsored by:	Nokia
2008-03-02 07:19:35 +00:00
Jeff Roberson
317da70593 - sched_prio() should only adjust tdq_lowpri if the thread is running or on
a run-queue.  If the priority is numerically raised only change lowpri
   if we're certain it will be correct.  Some slop is allowed however
   previously we could erroneously raise lowpri for an idle cpu that a
   thread had recently run on which lead to errors in load balancing
   decisions.
2008-01-23 03:10:18 +00:00
Jeff Roberson
a755f21484 - When executing the 'tryself' branch in sched_pickcpu() look at the
lowest priority on the queue for the current cpu vs curthread's
   priority.  In the case that curthread is waking up many threads of a
   lower priority as would happen with a turnstile_broadcast() or wakeup()
   of many threads this prevents them from all ending up on the current cpu.
 - In sched_add() make the relationship between a scheduled ithread and
   the current cpu advisory rather than strict.  Only give the ithread
   affinity for the current cpu if it's actually being scheduled from
   a hardware interrupt.  This prevents it from migrating when it simply
   blocks on a lock.

Sponsored by:	Nokia
2008-01-15 09:03:09 +00:00
Jeff Roberson
fd0b8c783d - Restore timeslicing code for all bit SCHED_FIFO priority classes.
Reported by:	Peter Jeremy <peterjeremy@optushome.com.au>
2008-01-05 04:47:31 +00:00
Wojciech A. Koszek
731016fe36 Make SCHED_ULE buildable with gcc3.
Reviewed by:	cognet (mentor), jeffr
Approved by:	cognet (mentor), jeffr
2007-12-21 23:30:18 +00:00
Jeff Roberson
eea4f254fe - Re-implement lock profiling in such a way that it no longer breaks
the ABI when enabled.  There is no longer an embedded lock_profile_object
   in each lock.  Instead a list of lock_profile_objects is kept per-thread
   for each lock it may own.  The cnt_hold statistic is now always 0 to
   facilitate this.
 - Support shared locking by tracking individual lock instances and
   statistics in the per-thread per-instance lock_profile_object.
 - Make the lock profiling hash table a per-cpu singly linked list with a
   per-cpu static lock_prof allocator.  This removes the need for an array
   of spinlocks and reduces cache contention between cores.
 - Use a seperate hash for spinlocks and other locks so that only a
   critical_enter() is required and not a spinlock_enter() to modify the
   per-cpu tables.
 - Count time spent spinning in the lock statistics.
 - Remove the LOCK_PROFILE_SHARED option as it is always supported now.
 - Specifically drop and release the scheduler locks in both schedulers
   since we track owners now.

In collaboration with:	Kip Macy
Sponsored by:	Nokia
2007-12-15 23:13:31 +00:00
David Xu
435806d31b Fix LOR of thread lock and umtx's priority propagation mutex due
to the reworking of scheduler lock.

MFC: after 3 days
2007-12-11 08:25:36 +00:00
Julian Elischer
431f890614 generally we are interested in what thread did something as
opposed to what process. Since threads by default have teh name of the
process unless over-written with more useful information, just print the
thread name instead.
2007-11-14 06:21:24 +00:00
Peter Grehan
cbdd62ad04 Cut over to ULE on PowerPC
kern/sched_ule.c - Add __powerpc__ to the list of supported architectures

powerpc/conf/GENERIC - Swap SCHED_4BSD with SCHED_ULE

powerpc/powerpc/genassym.c - Export TD_LOCK field of thread struct

powerpc/powerpc/swtch.S - Handle new 3rd parameter to cpu_switch() by
 updating the old thread's lock. Note: uniprocessor-only, will require
 modification for MP support.

powerpc/powerpc/vm_machdep.c - Set 3rd param of cpu_switch to mutex of
old thread's lock, making the call a no-op.

Reviewed by:	marcel, jeffr (slightly older version)
2007-10-23 00:52:25 +00:00
Sam Leffler
58590eb06b ULE works fine on arm; allow it to be used
Reviewed by:	jeff, cognet, imp
MFC after:	1 week
2007-10-16 19:25:26 +00:00
Jeff Roberson
88f530cc25 - Bail out of tdq_idled if !smp_started or idle stealing is disabled. This
fixes a bug on UP machines with SMP kernels where the idle thread
   constantly switches after trying to steal work from the local cpu.
 - Make the idle stealing code more robust against self selection.
 - Prefer to steal from the cpu with the highest load that has at least one
   transferable thread.  Before we selected the cpu with the highest
   transferable count which excludes bound threads.

Collaborated with:	csjp
Approved by:		re
2007-10-08 23:50:39 +00:00
Jeff Roberson
05dc0eb204 - Restore historical sched_yield() behavior by changing sched_relinquish()
to simply switch rather than lowering priority and switching.  This allows
   threads of equal priority to run but not lesser priority.

Discussed with:	davidxu
Reported by:	NIIMI Satoshi <sa2c@sa2c.net>
Approved by:	re
2007-10-08 23:45:24 +00:00
Jeff Roberson
59c6813475 - Reassign the thread queue lock to newtd prior to switching. Assigning
after the switch leads to a race where the outgoing thread still owns
   the local queue lock while another cpu may switch it in.  This race
   is only possible on machines where cpu_switch can take significantly
   longer on different cpus which in practice means HTT machines with
   unfair thread scheduling algorithms.

Found by:	kris (of course)
Approved by:	re
2007-10-02 01:30:18 +00:00
Jeff Roberson
7fcf154aef - Move the rebalancer back into hardclock to prevent potential softclock
starvation caused by unbalanced interrupt loads.
 - Change the rebalancer to work on stathz ticks but retain randomization.
 - Simplify locking in tdq_idled() to use the tdq_lock_pair() rather than
   complex sequences of locks to avoid deadlock.

Reported by:	kris
Approved by:	re
2007-10-02 00:36:06 +00:00
Jeff Roberson
02e2d6b445 - Honor the PREEMPTION and FULL_PREEMPTION flags by setting the default
value for kern.sched.preempt_thresh appropriately.  It can still by
   adjusted at runtime.  ULE will still use IPI_PREEMPT in certain
   migration situations.
 - Assert that we're not trying to compile ULE on an unsupported
   architecture.  To date, I believe only i386 and amd64 have implemented
   the third cpu switch argument required.

Approved by:	re
2007-09-27 16:39:27 +00:00
Jeff Roberson
e270652ba3 - Bound the interactivity score so that it cannot become negative.
Approved by:	re
2007-09-24 00:28:54 +00:00
Jeff Roberson
a5423ea313 - Improve grammar. s/it's/its/.
- Improve load long-term load balancer by always IPIing exactly once.
   Previously the delay after rebalancing could cause problems with
   uneven workloads.
 - Allow nice to have a linear effect on the interactivity score.  This
   allows negatively niced programs to stay interactive longer.  It may be
   useful with very expensive Xorg servers under high loads.  In general
   it should not be necessary to alter the nice level to improve interactive
   response.  We may also want to consider never allowing positively niced
   processes to become interactive at all.
 - Initialize ccpu to 0 rather than 0.0.  The decimal point was leftover
   from when the code was copied from 4bsd.  ccpu is 0 in ULE because ULE
   only exports weighted cpu values.

Reported by:	Steve Kargl (Load balancing problem)
Approved by:	re
2007-09-22 02:20:14 +00:00
Jeff Roberson
54b0e65f84 - Redefine p_swtime and td_slptime as p_swtick and td_slptick. This
changes the units from seconds to the value of 'ticks' when swapped
   in/out.  ULE does not have a periodic timer that scans all threads in
   the system and as such maintaining a per-second counter is difficult.
 - Change computations requiring the unit in seconds to subtract ticks
   and divide by hz.  This does make the wraparound condition hz times
   more frequent but this is still in the range of several months to
   years and the adverse effects are minimal.

Approved by:	re
2007-09-21 04:10:23 +00:00
Jeff Roberson
b61ce5b0e6 - Move all of the PS_ flags into either p_flag or td_flags.
- p_sflag was mostly protected by PROC_LOCK rather than the PROC_SLOCK or
   previously the sched_lock.  These bugs have existed for some time.
 - Allow swapout to try each thread in a process individually and then
   swapin the whole process if any of these fail.  This allows us to move
   most scheduler related swap flags into td_flags.
 - Keep ki_sflag for backwards compat but change all in source tools to
   use the new and more correct location of P_INMEM.

Reported by:	pho
Reviewed by:	attilio, kib
Approved by:	re (kensmith)
2007-09-17 05:31:39 +00:00
Jeff Roberson
9862717afe - Set steal_thresh to log2(ncpus). This improves idle-time load balancing
on 2cpu machines by reducing it to 1 by default.  This improves loaded
   operation on 8cpu machines by increasing it to 3 where the extra idle
   time is not as critical.

Approved by:	re
2007-08-20 06:34:20 +00:00
Jeff Roberson
3a78f9658b - Fix one line that erroneously crept in my last commit.
Approved by:	re
2007-08-04 01:21:28 +00:00
Jeff Roberson
c47f202b45 - Share scheduler locks between hyper-threaded cores to protect the
tdq_group structure.  Hyper-threaded cores won't really benefit from
   seperate locks anyway.
 - Seperate out the migration case from sched_switch to simplify the main
   switch code.  We only migrate here if called via sched_bind().
 - When preempted place the preempted thread back in the same queue at
   the head.
 - Improve the cpu group and topology infrastructure.

Tested by:	many on current@
Approved by:	re
2007-08-03 23:38:46 +00:00
Jeff Roberson
28994a5852 - Refine the load balancer to improve buildkernel times on dual core
machines.
 - Leave the long-term load balancer running by default once per second.
 - Enable stealing load from the idle thread only when the remote processor
   has more than two transferable tasks.  Setting this to one further
   improves buildworld.  Setting it higher improves mysql.
 - Remove the bogus pick_zero option.  I had not intended to commit this.
 - Entirely disallow migration for threads with SRQ_YIELDING set.  This
   balances out the extra migration allowed for with the load balancers.
   It also makes pick_pri perform better as I had anticipated.

Tested by:	Dmitry Morozovsky <marck@rinet.ru>
Approved by:	re
2007-07-19 20:03:15 +00:00
Jeff Roberson
08c9a16c4f - When newtd is specified to sched_switch() it was not being initialized
properly.  We have to temporarily unlock the TDQ lock so we can lock
   the thread and add it to the run queue.  This is used only for KSE.
 - When we add a thread from the tdq_move() via sched_balance() we need to
   ipi the target if it's sitting in the idle thread or it'll never run.

Reported by:	Rene Landan
Approved by:	re
2007-07-19 19:51:45 +00:00
Jeff Roberson
ae7a6b38d5 ULE 3.0: Fine grain scheduler locking and affinity improvements. This has
been in development for over 6 months as SCHED_SMP.
 - Implement one spin lock per thread-queue.  Threads assigned to a
   run-queue point to this lock via td_lock.
 - Improve the facility for assigning threads to CPUs now that sched_lock
   contention no longer dominates scheduling decisions on larger SMP
   machines.
 - Re-write idle time stealing in an attempt to make it less damaging to
   general performance.  This is still disabled by default. See
   kern.sched.steal_idle.
 - Call the long-term load balancer from a callout rather than sched_clock()
   so there are no locks held.  This is disabled by default.  See
   kern.sched.balance.
 - Parameterize many scheduling decisions via sysctls.  Try to document
   these via sysctl descriptions.
 - General structural and naming cleanups.
 - Document each function with comments.

Tested by:	current@ amd64, x86, UP, SMP.
Approved by:	re
2007-07-17 22:53:23 +00:00
Jeff Roberson
dda713dfb8 - Fix an off by one error in sched_pri_range.
- In tdq_choose() only assert that a thread does not have too high a
   priority (low value) for the queue we removed it from.  This will catch
   bugs in priority elevation.  It's not a serious error for the thread
   to have too low a priority as we don't change queues in this case as
   an optimization.

Reported by:	kris
2007-06-15 19:33:58 +00:00
Jeff Roberson
fe54587ffa - Move some common code out of sched_fork_exit() and back into fork_exit(). 2007-06-12 07:47:09 +00:00
Jeff Roberson
710eacdc5f - Placing the 'volatile' on the right side of the * in the td_lock
declaration removes the need for __DEVOLATILE().

Pointed out by:	tegge
2007-06-06 03:40:47 +00:00
Jeff Roberson
95e3a0bca3 - Better fix for previous error; use DEVOLATILE on the td_lock pointer
it can actually sometimes be something other than sched_lock even on
   schedulers which rely on a global scheduler lock.

Tested by:	kan
2007-06-05 04:12:46 +00:00
Jeff Roberson
c219b097af - Pass &sched_lock as the third argument to cpu_switch() as this will
always be the correct lock and we don't get volatile warnings this
   way.

Pointed out by:	kan
2007-06-05 03:46:54 +00:00
Jeff Roberson
36b369163b - Define TDQ_ID() for the !SMP case.
- Default pick_pri to off.  It is not faster in most cases.
2007-06-05 02:53:51 +00:00
Jeff Roberson
7b20fb19fb Commit 1/14 of sched_lock decomposition.
- Move all scheduler locking into the schedulers utilizing a technique
   similar to solaris's container locking.
 - A per-process spinlock is now used to protect the queue of threads,
   thread count, suspension count, p_sflags, and other process
   related scheduling fields.
 - The new thread lock is actually a pointer to a spinlock for the
   container that the thread is currently owned by.  The container may
   be a turnstile, sleepqueue, or run queue.
 - thread_lock() is now used to protect access to thread related scheduling
   fields.  thread_unlock() unlocks the lock and thread_set_lock()
   implements the transition from one lock to another.
 - A new "blocked_lock" is used in cases where it is not safe to hold the
   actual thread's lock yet we must prevent access to the thread.
 - sched_throw() and sched_fork_exit() are introduced to allow the
   schedulers to fix-up locking at these points.
 - Add some minor infrastructure for optionally exporting scheduler
   statistics that were invaluable in solving performance problems with
   this patch.  Generally these statistics allow you to differentiate
   between different causes of context switches.

Tested by:      kris, current@
Tested on:      i386, amd64, ULE, 4BSD, libthr, libkse, PREEMPTION, etc.
Discussed with: kris, attilio, kmacy, jhb, julian, bde (small parts each)
2007-06-04 23:50:30 +00:00
Kip Macy
fb1e3ccd7e Schedule the ithread on the same cpu as the interrupt
Tested by: kmacy
Submitted by: jeffr
2007-04-20 05:45:46 +00:00
Jeff Roberson
52bc574cc7 - Handle the case where slptime == runtime.
Submitted by:	Atoine Brodin
2007-03-17 23:32:48 +00:00
Jeff Roberson
4499aff6ec - Cast the intermediate value in priority computtion back down to
unsigned char.  Weirdly, casting the 1 constant to u_char still produces
   a signed integer result that is then used in the % computation.  This
   avoids that mess all together and causes a 0 pri to turn into 255 % 64
   as we expect.

Reported by:	kkenn (about 4 times, thanks)
2007-03-17 18:13:32 +00:00
Julian Elischer
486a941418 Instead of doing comparisons using the pcpu area to see if
a thread is an idle thread, just see if it has the IDLETD
flag set. That flag will probably move to the pflags word
as it's permenent and never chenges for the life of the
system so it doesn't need locking.
2007-03-08 06:44:34 +00:00
Kip Macy
fe68a91631 general LOCK_PROFILING cleanup
- only collect timestamps when a lock is contested - this reduces the overhead
  of collecting profiles from 20x to 5x

- remove unused function from subr_lock.c

- generalize cnt_hold and cnt_lock statistics to be kept for all locks

- NOTE: rwlock profiling generates invalid statistics (and most likely always has)
  someone familiar with that should review
2007-02-26 08:26:44 +00:00
Jeff Roberson
ed0e8f2fe9 - Change types for necent runq additions to u_char rather than int.
- Fix these types in ULE as well.  This fixes bugs in priority index
   calculations in certain edge cases. (int)-1 % 64 != (uint)-1 % 64.

Reported by:	kkenn using pho's stress2.
2007-02-08 01:52:25 +00:00
Jeff Roberson
fc3a97dcb7 - Implement much more intelligent ipi sending. This algorithm tries to
minimize IPIs and rescheduling when scheduling like tasks while keeping
   latency low for important threads.
   1) An idle thread is running.
   2) The current thread is worse than realtime and the new thread is
      better than realtime.  Realtime to realtime doesn't preempt.
   3) The new thread's priority is less than the threshold.
2007-01-25 23:51:59 +00:00
Jeff Roberson
1461899028 - Get rid of the unused DIDRUN flag. This was really only present to
support sched_4bsd.
 - Rename the KTR level for non schedgraph parsed events.  They take event
   space from things we'd like to graph.
 - Reset our slice value after we sleep.  The slice is simply there to
   prevent starvation among equal priorities.  A thread which had almost
   exhausted it's slice and then slept doesn't need to be rescheduled a
   tick after it wakes up.
 - Set the maximum slice value to a more conservative 100ms now that it is
   more accurately enforced.
2007-01-25 19:14:11 +00:00
Jeff Roberson
9a93305a2e - With a sleep time over 2097 seconds hzticks and slptime could end up
negative.  Use unsigned integers for sleep and run time so this doesn't
   disturb sched_interact_score().  This should fix the invalid interactive
   priority panics reported by several users.
2007-01-24 18:18:43 +00:00
Jeff Roberson
7a5e5e2a59 - Catch up to setrunqueue/choosethread/etc. api changes.
- Define our own maybe_preempt() as sched_preempt().  We want to be able
   to preempt idlethread in all cases.
 - Define our idlethread to require preemption to exit.
 - Get the cpu estimation tick from sched_tick() so we don't have to worry
   about errors from a sampling interval that differs from the time
   domain.  This was the source of sched_priority prints/panics and
   inaccurate pctcpu display in top.
2007-01-23 08:50:34 +00:00
Jeff Roberson
5cea64d54f - Disable the long-term load balancer. I believe that steal_busy works
better and gives more predictable results.
2007-01-20 21:24:05 +00:00
Jeff Roberson
c95d2db298 - We do need to IPI the idlethread on some systems. It may be stuck in
a power saving mode otherwise.
 - If the thread is already bound in sched_bind() unbind it before
   re-binding it to a new cpu.  I don't like these semantics but they are
   expected by some code in the tree.  Patch by jkoshy.
2007-01-20 17:03:33 +00:00
Jeff Roberson
6b2f763f7c - In tdq_transfer() always set NEEDRESCHED when necessary regardless of
the ipi settings.  If NEEDRESCHED is set and an ipi is later delivered
   it will clear it rather than cause extra context switches.  However, if
   we miss setting it we can have terrible latency.
 - In sched_bind() correctly implement bind.  Also be slightly more
   tolerant of code which calls bind multiple times.  However, we don't
   change binding if another call is made with a different cpu.  This
   does not presently work with hwpmc which I believe should be changed.
2007-01-20 09:03:43 +00:00
Jeff Roberson
7b8bfa0de9 Major revamp of ULE's cpu load balancing:
- Switch back to direct modification of remote CPU run queues.  This added
   a lot of complexity with questionable gain.  It's easy enough to
   reimplement if it's shown to help on huge machines.
 - Re-implement the old tdq_transfer() call as tdq_pickidle().  Change
   sched_add() so we have selectable cpu choosers and simplify the logic
   a bit here.
 - Implement tdq_pickpri() as the new default cpu chooser.  This algorithm
   is similar to Solaris in that it tries to always run the threads with
   the best priorities.  It is actually slightly more complex than
   solaris's algorithm because we also tend to favor the local cpu over
   other cpus which has a boost in latency but also potentially enables
   cache sharing between the waking thread and the woken thread.
 - Add a bunch of tunables that can be used to measure effects of different
   load balancing strategies.  Most of these will go away once the
   algorithm is more definite.
 - Add a new mechanism to steal threads from busy cpus when we idle.  This
   is enabled with kern.sched.steal_busy and kern.sched.busy_thresh.  The
   threshold is the required length of a tdq's run queue before another
   cpu will be able to steal runnable threads.  This prevents most queue
   imbalances that contribute the long latencies.
2007-01-19 21:56:08 +00:00
Jeff Roberson
eddb4efacd - Don't let SCHED_TICK_TOTAL() return less than hz. This can cause integer
divide faults in roundup() later if it is able to return 0.  For some
   reason this bug only shows up on my laptop and not my testboxes.
2007-01-06 12:33:43 +00:00
Jeff Roberson
1e516cf534 - Fix the sched_priority() invalid priority bugs. Use roundup() instead
of max() when computing the divisor in SCHED_TICK_PRI().  This prevents
   cases where rounding down would allow the quotient to exceed
   SCHED_PRI_RANGE.
 - Garbage collect some unused flags and fields.
 - Replace TDF_HOLD with sched_pin_td()/sched_unpin_td() since it simply
   duplicated this functionality.
 - Re-enable the rebalancer by default and fix the sysctl so it can be
   modified.
2007-01-06 08:44:13 +00:00
Jeff Roberson
9330bbbb61 - Don't IPI unless we're going to interrupt something exiting in the kernel.
otherwise we can afford the latency.  This makes a significant performance
   improvement.
2007-01-06 02:34:23 +00:00
Jeff Roberson
155b6ca12b - Fix a comparison in sched_choose() that caused cpus to be constantly
marked idle, thus breaking cpu load balancing.
 - Change sched_interact_update() to fix cases where the stored history
   has expanded significantly rather than handling them in the callers.  This
   fixes a case where sched_priority() could compute a bad value.
 - Add a sysctl to disable the global load balancer for experimentation.
2007-01-05 23:45:38 +00:00
Jeff Roberson
8ab80cf009 - ftick was initialized to -1 for init and any of it's children. Fix this by
setting ftick = ltick = ticks in schedinit().
 - Update the priority when we are pulled off of the run queue and when we
   are inserted onto the run queue so that it more accurately reflects our
   present status.  This is important for efficient priority propagation
   functioning.
 - Move the frequency test into sched_pctcpu_update() so we don't repeat it
   each time we'd like to call it.
 - Put some temporary work-around code in sched_priority() in case the tick
   mechanism produces a bad priority.  Eventually this should revert to an
   assert again.
2007-01-05 08:50:38 +00:00
Jeff Roberson
3f872f85d2 - Only allow the tdq_idx to increase by one each tick rather than up to
the most recently chosen index.  This significantly improves nice
   behavior.  This allows a lower priority thread to run some multiple of
   times before the higher priority thread makes it to the front of
   the queue.  A nice +20 cpu hog now only gets ~5% of the cpu when running
   with a nice 0 cpu hog and about 1.5% with a nice -20 hog.  A nice
   difference of 1 makes a 4% difference in cpu usage between two hogs.
 - Track a seperate insert and removal index.  When the removal index is
   empty it is updated to point at the current insert index.
 - Don't remove and re-add a thread to the runq when it is being adjusted
   down in priority.
 - Pull some conditional code out of sched_tick().  It's looking a bit
   large now.
2007-01-04 12:16:19 +00:00
Jeff Roberson
e7d50326de ULE 2.0:
- Remove the double queue mechanism for timeshare threads.  It was slow
   due to excess cache lines in play, caused suboptimal scheduling behavior
   with niced and other non-interactive processes, complicated priority
   lending, etc.
 - Use a circular queue with a floating starting index for timeshare threads.
   Enforces fairness by moving the insertion point closer to threads with
   worse priorities over time.
 - Give interactive timeshare threads real-time user-space priorities and
   place them on the realtime/ithd queue.
 - Select non-interactive timeshare thread priorities based on their cpu
   utilization over the last 10 seconds combined with the nice value.  This
   gives us more sane priorities and behavior in a loaded system as
   compared to the old method of using the interactivity score.  The
   interactive score quickly hit a ceiling if threads were non-interactive
   and penalized new hog threads.
 - Use one slice size for all threads.  The slice is not currently
   dynamically set to adjust scheduling behavior of different threads.
 - Add some new sysctls for scheduling parameters.

Bug fixes/Clean up:
 - Fix zeroing of td_sched after initialization in sched_fork_thread() caused
   by recent ksegrp removal.
 - Fix KSE interactivity issues related to frequent forking and exiting of
   kse threads.  We simply disable the penalty for thread creation and exit
   for kse threads.
 - Cleanup the cpu estimator by using tickincr here as well.  Keep ticks and
   ltick/ftick in the same frequency.  Previously ticks were stathz and
   others were hz.
 - Lots of new and updated comments.
 - Many many others.

Tested on:	up x86/amd64, 8way amd64.
2007-01-04 08:56:25 +00:00
Jeff Roberson
c02bbb43a0 - More search and replace prettying. 2006-12-29 12:55:32 +00:00
Jeff Roberson
d2ad694caa - Clean up a bit after the most recent KSE restructuring. 2006-12-29 10:37:07 +00:00
Julian Elischer
fc6c30f6c6 Changes to try fix sched_ule.c courtesy of David Xu. 2006-12-06 06:55:59 +00:00
Julian Elischer
ad1e7d285a Threading cleanup.. part 2 of several.
Make part of John Birrell's KSE patch permanent..
Specifically, remove:
Any reference of the ksegrp structure. This feature was
never fully utilised and made things overly complicated.
All code in the scheduler that tried to make threaded programs
fair to unthreaded programs.  Libpthread processes will already
do this to some extent and libthr processes already disable it.

Also:
Since this makes such a big change to the scheduler(s), take the opportunity
to rename some structures and elements that had to be moved anyhow.
This makes the code a lot more readable.

The ULE scheduler compiles again but I have no idea if it works.

The 4bsd scheduler still reqires a little cleaning and some functions that now do
ALMOST nothing will go away, but I thought I'd do that as a separate commit.

Tested by David Xu, and Dan Eischen using libthr and libpthread.
2006-12-06 06:34:57 +00:00
Maxim Konovalov
f645b5da88 o Fix a couple of obvious typos. 2006-11-08 09:09:07 +00:00
John Birrell
8460a577a4 Make KSE a kernel option, turned on by default in all GENERIC
kernel configs except sun4v (which doesn't process signals properly
with KSE).

Reviewed by:	davidxu@
2006-10-26 21:42:22 +00:00
David Xu
3db720fdce Add user priority loaning code to support priority propagation for
1:1 threading's POSIX priority mutexes, the code is no-op unless
priority-aware umtx code is committed.
2006-08-25 06:12:53 +00:00
David Xu
36ec198bd5 Add scheduler API sched_relinquish(), the API is used to implement
yield() and sched_yield() syscalls. Every scheduler has its own way
to relinquish cpu, the ULE and CORE schedulers have two internal run-
queues, a timesharing thread which calls yield() syscall should be
moved to inactive queue.
2006-06-15 06:37:39 +00:00
David Xu
b41f1452d9 Add scheduler CORE, the work I have done half a year ago, recent,
I picked it up again. The scheduler is forked from ULE, but the
algorithm to detect an interactive process is almost completely
different with ULE, it comes from Linux paper "Understanding the
Linux 2.6.8.1 CPU Scheduler", although I still use same word
"score" as a priority boost in ULE scheduler.

Briefly, the scheduler has following characteristic:
1. Timesharing process's nice value is seriously respected,
   timeslice and interaction detecting algorithm are based
   on nice value.
2. per-cpu scheduling queue and load balancing.
3. O(1) scheduling.
4. Some cpu affinity code in wakeup path.
5. Support POSIX SCHED_FIFO and SCHED_RR.
Unlike scheduler 4BSD and ULE which using fuzzy RQ_PPQ, the scheduler
uses 256 priority queues. Unlike ULE which using pull and push, the
scheduelr uses pull method, the main reason is to let relative idle
cpu do the work, but current the whole scheduler is protected by the
big sched_lock, so the benefit is not visible, it really can be worse
than nothing because all other cpu are locked out when we are doing
balancing work, which the 4BSD scheduelr does not have this problem.
The scheduler does not support hyperthreading very well, in fact,
the scheduler does not make the difference between physical CPU and
logical CPU, this should be improved in feature. The scheduler has
priority inversion problem on MP machine, it is not good for
realtime scheduling, it can cause realtime process starving.
As a result, it seems the MySQL super-smack runs better on my
Pentium-D machine when using libthr, despite on UP or SMP kernel.
2006-06-13 13:12:56 +00:00
David Xu
0ae716e5ee Make ke_rqindex unsigned. 2006-06-06 12:26:17 +00:00
David Xu
9f8eb3cb52 Use variable i instead of variable cpus as an index to get correct kseq. 2005-12-27 12:02:03 +00:00
David Xu
a1d4fe69d2 Fix a bug in slice calculation code, current code uses hz but
sched_clock() is called by state clock.

Submitted by: taku at tackymt dot homeip dot net
2005-12-19 08:26:09 +00:00
David Xu
a861574011 Temporarily disable nice threshold detection code, as it can starve
a thread holding critical resource, e.g mutex or other implicit
synchronous flags. Give thread which exceeds nice threshold a minimum
time slice.

PR: kern/86087
2005-09-22 01:19:37 +00:00
David Xu
f8ec133ed0 Move up code for testing KEF_HOLD to avoid ke_cpu being changed unexpectly
for PRI_ITHD and PRI_REALTIME threads.
2005-08-19 11:51:41 +00:00
David Xu
1278181c6c Try best to keep a preempted thread at front of run queue, this seems
improved performance a bit for some workloads, but still seeing interactive
lagging unless cpu idling race is fixed.
2005-08-08 14:20:10 +00:00
David Xu
3d16f519b6 If a thread was removed from system run queue, kse_assign shouldn't
add it again.
2005-07-31 15:11:21 +00:00
Xin LI
05a6b7ad62 Cast to uintptr_t when the compiler complains. This unbreaks ULE
scheduler breakage accompanied by the recent atomic_ptr() change.
2005-07-25 10:21:49 +00:00
Peter Wemm
4da0d332f4 Move HWPMC_HOOKS into its own opt_hwpmc_hooks.h file. It doesn't merit
being in opt_global.h and forcing a global recompile when only a few files
reference it.

Approved by:  re
2005-06-24 00:16:57 +00:00
Jeff Roberson
6680bbd529 - Fix the case where we're not preempting but there is already a newtd
as this happens via thread_switchout().  I don't particularly like the
   structure of the code here.  We twice call out to thread code when
   a thread is voluntarily switching.  Once to thread_switchout() and once
   to slot_fill(), while sched_4BSD does even more work which is redundant
   to select another thread to use our remaining slice.  This should be
   simplified in the future, but for now I'm only going to fix the bug not
   the bad design.
2005-06-07 02:59:16 +00:00
Jeff Roberson
9fe02f7e16 - It's 2005 already, I've been working on this for three years. 2005-06-04 09:24:15 +00:00
Jeff Roberson
21381d1b9e - Don't SLOT_USE() in the preempt case, sched_add() has already taken the
slot for us.  Previously, we would take two slots on every preempt, and
   setrunqueue() would fix it up for us in the non threaded case.  The
   threaded case was simply broken.
 - Clean up flags, prototypes, comments.
2005-06-04 09:23:28 +00:00
Joseph Koshy
ebccf1e3a6 Bring a working snapshot of hwpmc(4), its associated libraries, userland utilities
and documentation into -CURRENT.

Bump FreeBSD_version.

Reviewed by:	alc, jhb (kernel changes)
2005-04-19 04:01:25 +00:00