- Improve load long-term load balancer by always IPIing exactly once.
Previously the delay after rebalancing could cause problems with
uneven workloads.
- Allow nice to have a linear effect on the interactivity score. This
allows negatively niced programs to stay interactive longer. It may be
useful with very expensive Xorg servers under high loads. In general
it should not be necessary to alter the nice level to improve interactive
response. We may also want to consider never allowing positively niced
processes to become interactive at all.
- Initialize ccpu to 0 rather than 0.0. The decimal point was leftover
from when the code was copied from 4bsd. ccpu is 0 in ULE because ULE
only exports weighted cpu values.
Reported by: Steve Kargl (Load balancing problem)
Approved by: re
changes the units from seconds to the value of 'ticks' when swapped
in/out. ULE does not have a periodic timer that scans all threads in
the system and as such maintaining a per-second counter is difficult.
- Change computations requiring the unit in seconds to subtract ticks
and divide by hz. This does make the wraparound condition hz times
more frequent but this is still in the range of several months to
years and the adverse effects are minimal.
Approved by: re
- p_sflag was mostly protected by PROC_LOCK rather than the PROC_SLOCK or
previously the sched_lock. These bugs have existed for some time.
- Allow swapout to try each thread in a process individually and then
swapin the whole process if any of these fail. This allows us to move
most scheduler related swap flags into td_flags.
- Keep ki_sflag for backwards compat but change all in source tools to
use the new and more correct location of P_INMEM.
Reported by: pho
Reviewed by: attilio, kib
Approved by: re (kensmith)
on 2cpu machines by reducing it to 1 by default. This improves loaded
operation on 8cpu machines by increasing it to 3 where the extra idle
time is not as critical.
Approved by: re
tdq_group structure. Hyper-threaded cores won't really benefit from
seperate locks anyway.
- Seperate out the migration case from sched_switch to simplify the main
switch code. We only migrate here if called via sched_bind().
- When preempted place the preempted thread back in the same queue at
the head.
- Improve the cpu group and topology infrastructure.
Tested by: many on current@
Approved by: re
machines.
- Leave the long-term load balancer running by default once per second.
- Enable stealing load from the idle thread only when the remote processor
has more than two transferable tasks. Setting this to one further
improves buildworld. Setting it higher improves mysql.
- Remove the bogus pick_zero option. I had not intended to commit this.
- Entirely disallow migration for threads with SRQ_YIELDING set. This
balances out the extra migration allowed for with the load balancers.
It also makes pick_pri perform better as I had anticipated.
Tested by: Dmitry Morozovsky <marck@rinet.ru>
Approved by: re
properly. We have to temporarily unlock the TDQ lock so we can lock
the thread and add it to the run queue. This is used only for KSE.
- When we add a thread from the tdq_move() via sched_balance() we need to
ipi the target if it's sitting in the idle thread or it'll never run.
Reported by: Rene Landan
Approved by: re
been in development for over 6 months as SCHED_SMP.
- Implement one spin lock per thread-queue. Threads assigned to a
run-queue point to this lock via td_lock.
- Improve the facility for assigning threads to CPUs now that sched_lock
contention no longer dominates scheduling decisions on larger SMP
machines.
- Re-write idle time stealing in an attempt to make it less damaging to
general performance. This is still disabled by default. See
kern.sched.steal_idle.
- Call the long-term load balancer from a callout rather than sched_clock()
so there are no locks held. This is disabled by default. See
kern.sched.balance.
- Parameterize many scheduling decisions via sysctls. Try to document
these via sysctl descriptions.
- General structural and naming cleanups.
- Document each function with comments.
Tested by: current@ amd64, x86, UP, SMP.
Approved by: re
- In tdq_choose() only assert that a thread does not have too high a
priority (low value) for the queue we removed it from. This will catch
bugs in priority elevation. It's not a serious error for the thread
to have too low a priority as we don't change queues in this case as
an optimization.
Reported by: kris
- Move all scheduler locking into the schedulers utilizing a technique
similar to solaris's container locking.
- A per-process spinlock is now used to protect the queue of threads,
thread count, suspension count, p_sflags, and other process
related scheduling fields.
- The new thread lock is actually a pointer to a spinlock for the
container that the thread is currently owned by. The container may
be a turnstile, sleepqueue, or run queue.
- thread_lock() is now used to protect access to thread related scheduling
fields. thread_unlock() unlocks the lock and thread_set_lock()
implements the transition from one lock to another.
- A new "blocked_lock" is used in cases where it is not safe to hold the
actual thread's lock yet we must prevent access to the thread.
- sched_throw() and sched_fork_exit() are introduced to allow the
schedulers to fix-up locking at these points.
- Add some minor infrastructure for optionally exporting scheduler
statistics that were invaluable in solving performance problems with
this patch. Generally these statistics allow you to differentiate
between different causes of context switches.
Tested by: kris, current@
Tested on: i386, amd64, ULE, 4BSD, libthr, libkse, PREEMPTION, etc.
Discussed with: kris, attilio, kmacy, jhb, julian, bde (small parts each)
unsigned char. Weirdly, casting the 1 constant to u_char still produces
a signed integer result that is then used in the % computation. This
avoids that mess all together and causes a 0 pri to turn into 255 % 64
as we expect.
Reported by: kkenn (about 4 times, thanks)
a thread is an idle thread, just see if it has the IDLETD
flag set. That flag will probably move to the pflags word
as it's permenent and never chenges for the life of the
system so it doesn't need locking.
- only collect timestamps when a lock is contested - this reduces the overhead
of collecting profiles from 20x to 5x
- remove unused function from subr_lock.c
- generalize cnt_hold and cnt_lock statistics to be kept for all locks
- NOTE: rwlock profiling generates invalid statistics (and most likely always has)
someone familiar with that should review
- Fix these types in ULE as well. This fixes bugs in priority index
calculations in certain edge cases. (int)-1 % 64 != (uint)-1 % 64.
Reported by: kkenn using pho's stress2.
minimize IPIs and rescheduling when scheduling like tasks while keeping
latency low for important threads.
1) An idle thread is running.
2) The current thread is worse than realtime and the new thread is
better than realtime. Realtime to realtime doesn't preempt.
3) The new thread's priority is less than the threshold.
support sched_4bsd.
- Rename the KTR level for non schedgraph parsed events. They take event
space from things we'd like to graph.
- Reset our slice value after we sleep. The slice is simply there to
prevent starvation among equal priorities. A thread which had almost
exhausted it's slice and then slept doesn't need to be rescheduled a
tick after it wakes up.
- Set the maximum slice value to a more conservative 100ms now that it is
more accurately enforced.
negative. Use unsigned integers for sleep and run time so this doesn't
disturb sched_interact_score(). This should fix the invalid interactive
priority panics reported by several users.
- Define our own maybe_preempt() as sched_preempt(). We want to be able
to preempt idlethread in all cases.
- Define our idlethread to require preemption to exit.
- Get the cpu estimation tick from sched_tick() so we don't have to worry
about errors from a sampling interval that differs from the time
domain. This was the source of sched_priority prints/panics and
inaccurate pctcpu display in top.
a power saving mode otherwise.
- If the thread is already bound in sched_bind() unbind it before
re-binding it to a new cpu. I don't like these semantics but they are
expected by some code in the tree. Patch by jkoshy.
the ipi settings. If NEEDRESCHED is set and an ipi is later delivered
it will clear it rather than cause extra context switches. However, if
we miss setting it we can have terrible latency.
- In sched_bind() correctly implement bind. Also be slightly more
tolerant of code which calls bind multiple times. However, we don't
change binding if another call is made with a different cpu. This
does not presently work with hwpmc which I believe should be changed.
- Switch back to direct modification of remote CPU run queues. This added
a lot of complexity with questionable gain. It's easy enough to
reimplement if it's shown to help on huge machines.
- Re-implement the old tdq_transfer() call as tdq_pickidle(). Change
sched_add() so we have selectable cpu choosers and simplify the logic
a bit here.
- Implement tdq_pickpri() as the new default cpu chooser. This algorithm
is similar to Solaris in that it tries to always run the threads with
the best priorities. It is actually slightly more complex than
solaris's algorithm because we also tend to favor the local cpu over
other cpus which has a boost in latency but also potentially enables
cache sharing between the waking thread and the woken thread.
- Add a bunch of tunables that can be used to measure effects of different
load balancing strategies. Most of these will go away once the
algorithm is more definite.
- Add a new mechanism to steal threads from busy cpus when we idle. This
is enabled with kern.sched.steal_busy and kern.sched.busy_thresh. The
threshold is the required length of a tdq's run queue before another
cpu will be able to steal runnable threads. This prevents most queue
imbalances that contribute the long latencies.
of max() when computing the divisor in SCHED_TICK_PRI(). This prevents
cases where rounding down would allow the quotient to exceed
SCHED_PRI_RANGE.
- Garbage collect some unused flags and fields.
- Replace TDF_HOLD with sched_pin_td()/sched_unpin_td() since it simply
duplicated this functionality.
- Re-enable the rebalancer by default and fix the sysctl so it can be
modified.
marked idle, thus breaking cpu load balancing.
- Change sched_interact_update() to fix cases where the stored history
has expanded significantly rather than handling them in the callers. This
fixes a case where sched_priority() could compute a bad value.
- Add a sysctl to disable the global load balancer for experimentation.
setting ftick = ltick = ticks in schedinit().
- Update the priority when we are pulled off of the run queue and when we
are inserted onto the run queue so that it more accurately reflects our
present status. This is important for efficient priority propagation
functioning.
- Move the frequency test into sched_pctcpu_update() so we don't repeat it
each time we'd like to call it.
- Put some temporary work-around code in sched_priority() in case the tick
mechanism produces a bad priority. Eventually this should revert to an
assert again.
the most recently chosen index. This significantly improves nice
behavior. This allows a lower priority thread to run some multiple of
times before the higher priority thread makes it to the front of
the queue. A nice +20 cpu hog now only gets ~5% of the cpu when running
with a nice 0 cpu hog and about 1.5% with a nice -20 hog. A nice
difference of 1 makes a 4% difference in cpu usage between two hogs.
- Track a seperate insert and removal index. When the removal index is
empty it is updated to point at the current insert index.
- Don't remove and re-add a thread to the runq when it is being adjusted
down in priority.
- Pull some conditional code out of sched_tick(). It's looking a bit
large now.
- Remove the double queue mechanism for timeshare threads. It was slow
due to excess cache lines in play, caused suboptimal scheduling behavior
with niced and other non-interactive processes, complicated priority
lending, etc.
- Use a circular queue with a floating starting index for timeshare threads.
Enforces fairness by moving the insertion point closer to threads with
worse priorities over time.
- Give interactive timeshare threads real-time user-space priorities and
place them on the realtime/ithd queue.
- Select non-interactive timeshare thread priorities based on their cpu
utilization over the last 10 seconds combined with the nice value. This
gives us more sane priorities and behavior in a loaded system as
compared to the old method of using the interactivity score. The
interactive score quickly hit a ceiling if threads were non-interactive
and penalized new hog threads.
- Use one slice size for all threads. The slice is not currently
dynamically set to adjust scheduling behavior of different threads.
- Add some new sysctls for scheduling parameters.
Bug fixes/Clean up:
- Fix zeroing of td_sched after initialization in sched_fork_thread() caused
by recent ksegrp removal.
- Fix KSE interactivity issues related to frequent forking and exiting of
kse threads. We simply disable the penalty for thread creation and exit
for kse threads.
- Cleanup the cpu estimator by using tickincr here as well. Keep ticks and
ltick/ftick in the same frequency. Previously ticks were stathz and
others were hz.
- Lots of new and updated comments.
- Many many others.
Tested on: up x86/amd64, 8way amd64.
Make part of John Birrell's KSE patch permanent..
Specifically, remove:
Any reference of the ksegrp structure. This feature was
never fully utilised and made things overly complicated.
All code in the scheduler that tried to make threaded programs
fair to unthreaded programs. Libpthread processes will already
do this to some extent and libthr processes already disable it.
Also:
Since this makes such a big change to the scheduler(s), take the opportunity
to rename some structures and elements that had to be moved anyhow.
This makes the code a lot more readable.
The ULE scheduler compiles again but I have no idea if it works.
The 4bsd scheduler still reqires a little cleaning and some functions that now do
ALMOST nothing will go away, but I thought I'd do that as a separate commit.
Tested by David Xu, and Dan Eischen using libthr and libpthread.
yield() and sched_yield() syscalls. Every scheduler has its own way
to relinquish cpu, the ULE and CORE schedulers have two internal run-
queues, a timesharing thread which calls yield() syscall should be
moved to inactive queue.
I picked it up again. The scheduler is forked from ULE, but the
algorithm to detect an interactive process is almost completely
different with ULE, it comes from Linux paper "Understanding the
Linux 2.6.8.1 CPU Scheduler", although I still use same word
"score" as a priority boost in ULE scheduler.
Briefly, the scheduler has following characteristic:
1. Timesharing process's nice value is seriously respected,
timeslice and interaction detecting algorithm are based
on nice value.
2. per-cpu scheduling queue and load balancing.
3. O(1) scheduling.
4. Some cpu affinity code in wakeup path.
5. Support POSIX SCHED_FIFO and SCHED_RR.
Unlike scheduler 4BSD and ULE which using fuzzy RQ_PPQ, the scheduler
uses 256 priority queues. Unlike ULE which using pull and push, the
scheduelr uses pull method, the main reason is to let relative idle
cpu do the work, but current the whole scheduler is protected by the
big sched_lock, so the benefit is not visible, it really can be worse
than nothing because all other cpu are locked out when we are doing
balancing work, which the 4BSD scheduelr does not have this problem.
The scheduler does not support hyperthreading very well, in fact,
the scheduler does not make the difference between physical CPU and
logical CPU, this should be improved in feature. The scheduler has
priority inversion problem on MP machine, it is not good for
realtime scheduling, it can cause realtime process starving.
As a result, it seems the MySQL super-smack runs better on my
Pentium-D machine when using libthr, despite on UP or SMP kernel.