Commit Graph

190 Commits

Author SHA1 Message Date
Jeff Roberson
52bc574cc7 - Handle the case where slptime == runtime.
Submitted by:	Atoine Brodin
2007-03-17 23:32:48 +00:00
Jeff Roberson
4499aff6ec - Cast the intermediate value in priority computtion back down to
unsigned char.  Weirdly, casting the 1 constant to u_char still produces
   a signed integer result that is then used in the % computation.  This
   avoids that mess all together and causes a 0 pri to turn into 255 % 64
   as we expect.

Reported by:	kkenn (about 4 times, thanks)
2007-03-17 18:13:32 +00:00
Julian Elischer
486a941418 Instead of doing comparisons using the pcpu area to see if
a thread is an idle thread, just see if it has the IDLETD
flag set. That flag will probably move to the pflags word
as it's permenent and never chenges for the life of the
system so it doesn't need locking.
2007-03-08 06:44:34 +00:00
Kip Macy
fe68a91631 general LOCK_PROFILING cleanup
- only collect timestamps when a lock is contested - this reduces the overhead
  of collecting profiles from 20x to 5x

- remove unused function from subr_lock.c

- generalize cnt_hold and cnt_lock statistics to be kept for all locks

- NOTE: rwlock profiling generates invalid statistics (and most likely always has)
  someone familiar with that should review
2007-02-26 08:26:44 +00:00
Jeff Roberson
ed0e8f2fe9 - Change types for necent runq additions to u_char rather than int.
- Fix these types in ULE as well.  This fixes bugs in priority index
   calculations in certain edge cases. (int)-1 % 64 != (uint)-1 % 64.

Reported by:	kkenn using pho's stress2.
2007-02-08 01:52:25 +00:00
Jeff Roberson
fc3a97dcb7 - Implement much more intelligent ipi sending. This algorithm tries to
minimize IPIs and rescheduling when scheduling like tasks while keeping
   latency low for important threads.
   1) An idle thread is running.
   2) The current thread is worse than realtime and the new thread is
      better than realtime.  Realtime to realtime doesn't preempt.
   3) The new thread's priority is less than the threshold.
2007-01-25 23:51:59 +00:00
Jeff Roberson
1461899028 - Get rid of the unused DIDRUN flag. This was really only present to
support sched_4bsd.
 - Rename the KTR level for non schedgraph parsed events.  They take event
   space from things we'd like to graph.
 - Reset our slice value after we sleep.  The slice is simply there to
   prevent starvation among equal priorities.  A thread which had almost
   exhausted it's slice and then slept doesn't need to be rescheduled a
   tick after it wakes up.
 - Set the maximum slice value to a more conservative 100ms now that it is
   more accurately enforced.
2007-01-25 19:14:11 +00:00
Jeff Roberson
9a93305a2e - With a sleep time over 2097 seconds hzticks and slptime could end up
negative.  Use unsigned integers for sleep and run time so this doesn't
   disturb sched_interact_score().  This should fix the invalid interactive
   priority panics reported by several users.
2007-01-24 18:18:43 +00:00
Jeff Roberson
7a5e5e2a59 - Catch up to setrunqueue/choosethread/etc. api changes.
- Define our own maybe_preempt() as sched_preempt().  We want to be able
   to preempt idlethread in all cases.
 - Define our idlethread to require preemption to exit.
 - Get the cpu estimation tick from sched_tick() so we don't have to worry
   about errors from a sampling interval that differs from the time
   domain.  This was the source of sched_priority prints/panics and
   inaccurate pctcpu display in top.
2007-01-23 08:50:34 +00:00
Jeff Roberson
5cea64d54f - Disable the long-term load balancer. I believe that steal_busy works
better and gives more predictable results.
2007-01-20 21:24:05 +00:00
Jeff Roberson
c95d2db298 - We do need to IPI the idlethread on some systems. It may be stuck in
a power saving mode otherwise.
 - If the thread is already bound in sched_bind() unbind it before
   re-binding it to a new cpu.  I don't like these semantics but they are
   expected by some code in the tree.  Patch by jkoshy.
2007-01-20 17:03:33 +00:00
Jeff Roberson
6b2f763f7c - In tdq_transfer() always set NEEDRESCHED when necessary regardless of
the ipi settings.  If NEEDRESCHED is set and an ipi is later delivered
   it will clear it rather than cause extra context switches.  However, if
   we miss setting it we can have terrible latency.
 - In sched_bind() correctly implement bind.  Also be slightly more
   tolerant of code which calls bind multiple times.  However, we don't
   change binding if another call is made with a different cpu.  This
   does not presently work with hwpmc which I believe should be changed.
2007-01-20 09:03:43 +00:00
Jeff Roberson
7b8bfa0de9 Major revamp of ULE's cpu load balancing:
- Switch back to direct modification of remote CPU run queues.  This added
   a lot of complexity with questionable gain.  It's easy enough to
   reimplement if it's shown to help on huge machines.
 - Re-implement the old tdq_transfer() call as tdq_pickidle().  Change
   sched_add() so we have selectable cpu choosers and simplify the logic
   a bit here.
 - Implement tdq_pickpri() as the new default cpu chooser.  This algorithm
   is similar to Solaris in that it tries to always run the threads with
   the best priorities.  It is actually slightly more complex than
   solaris's algorithm because we also tend to favor the local cpu over
   other cpus which has a boost in latency but also potentially enables
   cache sharing between the waking thread and the woken thread.
 - Add a bunch of tunables that can be used to measure effects of different
   load balancing strategies.  Most of these will go away once the
   algorithm is more definite.
 - Add a new mechanism to steal threads from busy cpus when we idle.  This
   is enabled with kern.sched.steal_busy and kern.sched.busy_thresh.  The
   threshold is the required length of a tdq's run queue before another
   cpu will be able to steal runnable threads.  This prevents most queue
   imbalances that contribute the long latencies.
2007-01-19 21:56:08 +00:00
Jeff Roberson
eddb4efacd - Don't let SCHED_TICK_TOTAL() return less than hz. This can cause integer
divide faults in roundup() later if it is able to return 0.  For some
   reason this bug only shows up on my laptop and not my testboxes.
2007-01-06 12:33:43 +00:00
Jeff Roberson
1e516cf534 - Fix the sched_priority() invalid priority bugs. Use roundup() instead
of max() when computing the divisor in SCHED_TICK_PRI().  This prevents
   cases where rounding down would allow the quotient to exceed
   SCHED_PRI_RANGE.
 - Garbage collect some unused flags and fields.
 - Replace TDF_HOLD with sched_pin_td()/sched_unpin_td() since it simply
   duplicated this functionality.
 - Re-enable the rebalancer by default and fix the sysctl so it can be
   modified.
2007-01-06 08:44:13 +00:00
Jeff Roberson
9330bbbb61 - Don't IPI unless we're going to interrupt something exiting in the kernel.
otherwise we can afford the latency.  This makes a significant performance
   improvement.
2007-01-06 02:34:23 +00:00
Jeff Roberson
155b6ca12b - Fix a comparison in sched_choose() that caused cpus to be constantly
marked idle, thus breaking cpu load balancing.
 - Change sched_interact_update() to fix cases where the stored history
   has expanded significantly rather than handling them in the callers.  This
   fixes a case where sched_priority() could compute a bad value.
 - Add a sysctl to disable the global load balancer for experimentation.
2007-01-05 23:45:38 +00:00
Jeff Roberson
8ab80cf009 - ftick was initialized to -1 for init and any of it's children. Fix this by
setting ftick = ltick = ticks in schedinit().
 - Update the priority when we are pulled off of the run queue and when we
   are inserted onto the run queue so that it more accurately reflects our
   present status.  This is important for efficient priority propagation
   functioning.
 - Move the frequency test into sched_pctcpu_update() so we don't repeat it
   each time we'd like to call it.
 - Put some temporary work-around code in sched_priority() in case the tick
   mechanism produces a bad priority.  Eventually this should revert to an
   assert again.
2007-01-05 08:50:38 +00:00
Jeff Roberson
3f872f85d2 - Only allow the tdq_idx to increase by one each tick rather than up to
the most recently chosen index.  This significantly improves nice
   behavior.  This allows a lower priority thread to run some multiple of
   times before the higher priority thread makes it to the front of
   the queue.  A nice +20 cpu hog now only gets ~5% of the cpu when running
   with a nice 0 cpu hog and about 1.5% with a nice -20 hog.  A nice
   difference of 1 makes a 4% difference in cpu usage between two hogs.
 - Track a seperate insert and removal index.  When the removal index is
   empty it is updated to point at the current insert index.
 - Don't remove and re-add a thread to the runq when it is being adjusted
   down in priority.
 - Pull some conditional code out of sched_tick().  It's looking a bit
   large now.
2007-01-04 12:16:19 +00:00
Jeff Roberson
e7d50326de ULE 2.0:
- Remove the double queue mechanism for timeshare threads.  It was slow
   due to excess cache lines in play, caused suboptimal scheduling behavior
   with niced and other non-interactive processes, complicated priority
   lending, etc.
 - Use a circular queue with a floating starting index for timeshare threads.
   Enforces fairness by moving the insertion point closer to threads with
   worse priorities over time.
 - Give interactive timeshare threads real-time user-space priorities and
   place them on the realtime/ithd queue.
 - Select non-interactive timeshare thread priorities based on their cpu
   utilization over the last 10 seconds combined with the nice value.  This
   gives us more sane priorities and behavior in a loaded system as
   compared to the old method of using the interactivity score.  The
   interactive score quickly hit a ceiling if threads were non-interactive
   and penalized new hog threads.
 - Use one slice size for all threads.  The slice is not currently
   dynamically set to adjust scheduling behavior of different threads.
 - Add some new sysctls for scheduling parameters.

Bug fixes/Clean up:
 - Fix zeroing of td_sched after initialization in sched_fork_thread() caused
   by recent ksegrp removal.
 - Fix KSE interactivity issues related to frequent forking and exiting of
   kse threads.  We simply disable the penalty for thread creation and exit
   for kse threads.
 - Cleanup the cpu estimator by using tickincr here as well.  Keep ticks and
   ltick/ftick in the same frequency.  Previously ticks were stathz and
   others were hz.
 - Lots of new and updated comments.
 - Many many others.

Tested on:	up x86/amd64, 8way amd64.
2007-01-04 08:56:25 +00:00
Jeff Roberson
c02bbb43a0 - More search and replace prettying. 2006-12-29 12:55:32 +00:00
Jeff Roberson
d2ad694caa - Clean up a bit after the most recent KSE restructuring. 2006-12-29 10:37:07 +00:00
Julian Elischer
fc6c30f6c6 Changes to try fix sched_ule.c courtesy of David Xu. 2006-12-06 06:55:59 +00:00
Julian Elischer
ad1e7d285a Threading cleanup.. part 2 of several.
Make part of John Birrell's KSE patch permanent..
Specifically, remove:
Any reference of the ksegrp structure. This feature was
never fully utilised and made things overly complicated.
All code in the scheduler that tried to make threaded programs
fair to unthreaded programs.  Libpthread processes will already
do this to some extent and libthr processes already disable it.

Also:
Since this makes such a big change to the scheduler(s), take the opportunity
to rename some structures and elements that had to be moved anyhow.
This makes the code a lot more readable.

The ULE scheduler compiles again but I have no idea if it works.

The 4bsd scheduler still reqires a little cleaning and some functions that now do
ALMOST nothing will go away, but I thought I'd do that as a separate commit.

Tested by David Xu, and Dan Eischen using libthr and libpthread.
2006-12-06 06:34:57 +00:00
Maxim Konovalov
f645b5da88 o Fix a couple of obvious typos. 2006-11-08 09:09:07 +00:00
John Birrell
8460a577a4 Make KSE a kernel option, turned on by default in all GENERIC
kernel configs except sun4v (which doesn't process signals properly
with KSE).

Reviewed by:	davidxu@
2006-10-26 21:42:22 +00:00
David Xu
3db720fdce Add user priority loaning code to support priority propagation for
1:1 threading's POSIX priority mutexes, the code is no-op unless
priority-aware umtx code is committed.
2006-08-25 06:12:53 +00:00
David Xu
36ec198bd5 Add scheduler API sched_relinquish(), the API is used to implement
yield() and sched_yield() syscalls. Every scheduler has its own way
to relinquish cpu, the ULE and CORE schedulers have two internal run-
queues, a timesharing thread which calls yield() syscall should be
moved to inactive queue.
2006-06-15 06:37:39 +00:00
David Xu
b41f1452d9 Add scheduler CORE, the work I have done half a year ago, recent,
I picked it up again. The scheduler is forked from ULE, but the
algorithm to detect an interactive process is almost completely
different with ULE, it comes from Linux paper "Understanding the
Linux 2.6.8.1 CPU Scheduler", although I still use same word
"score" as a priority boost in ULE scheduler.

Briefly, the scheduler has following characteristic:
1. Timesharing process's nice value is seriously respected,
   timeslice and interaction detecting algorithm are based
   on nice value.
2. per-cpu scheduling queue and load balancing.
3. O(1) scheduling.
4. Some cpu affinity code in wakeup path.
5. Support POSIX SCHED_FIFO and SCHED_RR.
Unlike scheduler 4BSD and ULE which using fuzzy RQ_PPQ, the scheduler
uses 256 priority queues. Unlike ULE which using pull and push, the
scheduelr uses pull method, the main reason is to let relative idle
cpu do the work, but current the whole scheduler is protected by the
big sched_lock, so the benefit is not visible, it really can be worse
than nothing because all other cpu are locked out when we are doing
balancing work, which the 4BSD scheduelr does not have this problem.
The scheduler does not support hyperthreading very well, in fact,
the scheduler does not make the difference between physical CPU and
logical CPU, this should be improved in feature. The scheduler has
priority inversion problem on MP machine, it is not good for
realtime scheduling, it can cause realtime process starving.
As a result, it seems the MySQL super-smack runs better on my
Pentium-D machine when using libthr, despite on UP or SMP kernel.
2006-06-13 13:12:56 +00:00
David Xu
0ae716e5ee Make ke_rqindex unsigned. 2006-06-06 12:26:17 +00:00
David Xu
9f8eb3cb52 Use variable i instead of variable cpus as an index to get correct kseq. 2005-12-27 12:02:03 +00:00
David Xu
a1d4fe69d2 Fix a bug in slice calculation code, current code uses hz but
sched_clock() is called by state clock.

Submitted by: taku at tackymt dot homeip dot net
2005-12-19 08:26:09 +00:00
David Xu
a861574011 Temporarily disable nice threshold detection code, as it can starve
a thread holding critical resource, e.g mutex or other implicit
synchronous flags. Give thread which exceeds nice threshold a minimum
time slice.

PR: kern/86087
2005-09-22 01:19:37 +00:00
David Xu
f8ec133ed0 Move up code for testing KEF_HOLD to avoid ke_cpu being changed unexpectly
for PRI_ITHD and PRI_REALTIME threads.
2005-08-19 11:51:41 +00:00
David Xu
1278181c6c Try best to keep a preempted thread at front of run queue, this seems
improved performance a bit for some workloads, but still seeing interactive
lagging unless cpu idling race is fixed.
2005-08-08 14:20:10 +00:00
David Xu
3d16f519b6 If a thread was removed from system run queue, kse_assign shouldn't
add it again.
2005-07-31 15:11:21 +00:00
Xin LI
05a6b7ad62 Cast to uintptr_t when the compiler complains. This unbreaks ULE
scheduler breakage accompanied by the recent atomic_ptr() change.
2005-07-25 10:21:49 +00:00
Peter Wemm
4da0d332f4 Move HWPMC_HOOKS into its own opt_hwpmc_hooks.h file. It doesn't merit
being in opt_global.h and forcing a global recompile when only a few files
reference it.

Approved by:  re
2005-06-24 00:16:57 +00:00
Jeff Roberson
6680bbd529 - Fix the case where we're not preempting but there is already a newtd
as this happens via thread_switchout().  I don't particularly like the
   structure of the code here.  We twice call out to thread code when
   a thread is voluntarily switching.  Once to thread_switchout() and once
   to slot_fill(), while sched_4BSD does even more work which is redundant
   to select another thread to use our remaining slice.  This should be
   simplified in the future, but for now I'm only going to fix the bug not
   the bad design.
2005-06-07 02:59:16 +00:00
Jeff Roberson
9fe02f7e16 - It's 2005 already, I've been working on this for three years. 2005-06-04 09:24:15 +00:00
Jeff Roberson
21381d1b9e - Don't SLOT_USE() in the preempt case, sched_add() has already taken the
slot for us.  Previously, we would take two slots on every preempt, and
   setrunqueue() would fix it up for us in the non threaded case.  The
   threaded case was simply broken.
 - Clean up flags, prototypes, comments.
2005-06-04 09:23:28 +00:00
Joseph Koshy
ebccf1e3a6 Bring a working snapshot of hwpmc(4), its associated libraries, userland utilities
and documentation into -CURRENT.

Bump FreeBSD_version.

Reviewed by:	alc, jhb (kernel changes)
2005-04-19 04:01:25 +00:00
Stephan Uphoff
779186434a Sprinkle some volatile magic and rearrange things a bit to avoid race
conditions in critical_exit now that it no longer blocks interrupts.

Reviewed by:	jhb
2005-04-08 03:37:53 +00:00
Jeff Roberson
7a9507b60e - A test in sched_switch() is no longer necessary and it is incorrect
when td0 is preempted before it voluntarily switches.

Discovered by:	Arjan Van Leeuwen <avleeuwen@gmail.com>
2005-02-23 00:50:26 +00:00
Jeff Roberson
42a29039de - Add ke_runq == NULL to the conditions which will cause us to abort
adjusting timeshare loads in sched_class().  This is only important if
   the thread has never run, otherwise the state checks should work as
   expected.
2005-02-04 17:22:46 +00:00
John Baldwin
50aaa791ba Fix a typo and two whitespace nits. 2004-12-30 22:17:00 +00:00
John Baldwin
f5c157d986 Rework the interface between priority propagation (lending) and the
schedulers a bit to ensure more correct handling of priorities and fewer
priority inversions:
- Add two functions to the sched(9) API to handle priority lending:
  sched_lend_prio() and sched_unlend_prio().  The turnstile code uses these
  functions to ask the scheduler to lend a thread a set priority and to
  tell the scheduler when it thinks it is ok for a thread to stop borrowing
  priority.  The unlend case is slightly complex in that the turnstile code
  tells the scheduler what the minimum priority of the thread needs to be
  to satisfy the requirements of any other threads blocked on locks owned
  by the thread in question.  The scheduler then decides where the thread
  can go back to normal mode (if it's normal priority is high enough to
  satisfy the pending lock requests) or it it should continue to use the
  priority specified to the sched_unlend_prio() call.  This involves adding
  a new per-thread flag TDF_BORROWING that replaces the ULE-only kse flag
  for priority elevation.
- Schedulers now refuse to lower the priority of a thread that is currently
  borrowing another therad's priority.
- If a scheduler changes the priority of a thread that is currently sitting
  on a turnstile, it will call a new function turnstile_adjust() to inform
  the turnstile code of the change.  This function resorts the thread on
  the priority list of the turnstile if needed, and if the thread ends up
  at the head of the list (due to having the highest priority) and its
  priority was raised, then it will propagate that new priority to the
  owner of the lock it is blocked on.

Some additional fixes specific to the 4BSD scheduler include:
- Common code for updating the priority of a thread when the user priority
  of its associated kse group has been consolidated in a new static
  function resetpriority_thread().  One change to this function is that
  it will now only adjust the priority of a thread if it already has a
  time sharing priority, thus preserving any boosts from a tsleep() until
  the thread returns to userland.  Also, resetpriority() no longer calls
  maybe_resched() on each thread in the group. Instead, the code calling
  resetpriority() is responsible for calling resetpriority_thread() on
  any threads that need to be updated.
- schedcpu() now uses resetpriority_thread() instead of just calling
  sched_prio() directly after it updates a kse group's user priority.
- sched_clock() now uses resetpriority_thread() rather than writing
  directly to td_priority.
- sched_nice() now updates all the priorities of the threads after the
  group priority has been adjusted.

Discussed with:	bde
Reviewed by:	ups, jeffr
Tested on:	4bsd, ule
Tested on:	i386, alpha, sparc64
2004-12-30 20:52:44 +00:00
Jeff Roberson
2ebf8eb132 - Unintentionally checked in a debugging panic. Remove that. 2004-12-26 23:21:48 +00:00
Jeff Roberson
598b368d6c - Fix a long standing problem where an ithread would not honor sched_pin().
- Remove the sched_add wrapper that used sched_add_internal() as a backend.
   Its only purpose was to interpret one flag and turn it into an int.  Do
   the right thing and interpret the flag in sched_add() instead.
 - Pass the flag argument to sched_add() to kseq_runq_add() so that we can
   get the SRQ_PREEMPT optimization too.
 - Add a KEF_INTERNAL flag.  If KEF_INTERNAL is set we don't adjust the SLOT
   counts, otherwise the slot counts are adjusted as soon as we enter
   sched_add() or sched_rem() rather than when the thread is actually placed
   on the run queue.  This greatly simplifies the handling of slots.
 - Remove the explicit prevention of migration for ithreads on non-x86
   platforms.  This was never shown to have any real benefit.
 - Remove the unused class argument to KSE_CAN_MIGRATE().
 - Add ktr points for thread migration events.
 - Fix a long standing bug on platforms which don't initialize the cpu
   topology.  The ksg_maxid variable was never correctly set on these
   platforms which caused the long term load balancer to never inspect
   more than the first group or processor.
 - Fix another bug which prevented the long term load balancer from working
   properly.  If stathz != hz we can't expect sched_clock() to be called
   on the exact tick count that we're anticipating.
 - Rearrange sched_switch() a bit to reduce indentation levels.
2004-12-26 22:56:08 +00:00
Jeff Roberson
81d47d3f4b - Remove earlier KTR_ULE tracepoints.
- Define new KTR_SCHED points so that we can graph the operation of the
   scheduler.
2004-12-26 00:15:33 +00:00