itself, which sparc64 hardware doesn't support. One way to solve this
would be to directly call sched_preempt() instead of issuing a self-IPI.
However, quoting jhb@:
"On the other hand, you can probably just skip the IPI entirely if we are
going to send it to the current CPU. Presumably, once this routine
finishes, the current CPU will exit softlock (or will do so "soon") and
will then pick the next thread to run based on the adjustments made in
this routine, so there's no need to IPI the CPU running this routine
anyway. I think this is the better solution. Right now what is probably
happening on other platforms is as soon as this routine finishes the CPU
processes its self-IPI and causes mi_switch() which will just switch back
to the softclock thread it is already running."
- With r226054 and the the above change in place, sparc64 now no longer is
incompatible with ULE and vice versa. However, powerpc/E500 still is.
Submitted by: jhb
Reviewed by: jeff
pollution. That is a step further in the direction of building correct
policies for userland and modules on how to deal with the number of
maxcpus at runtime.
Reported by: jhb
Reviewed and tested by: pluknet
Approved by: re (kib)
cpuset_t objects.
That is going to offer the underlying support for a simple bump of
MAXCPU and then support for number of cpus > 32 (as it is today).
Right now, cpumask_t is an int, 32 bits on all our supported architecture.
cpumask_t on the other side is implemented as an array of longs, and
easilly extendible by definition.
The architectures touched by this commit are the following:
- amd64
- i386
- pc98
- arm
- ia64
- XEN
while the others are still missing.
Userland is believed to be fully converted with the changes contained
here.
Some technical notes:
- This commit may be considered an ABI nop for all the architectures
different from amd64 and ia64 (and sparc64 in the future)
- per-cpu members, which are now converted to cpuset_t, needs to be
accessed avoiding migration, because the size of cpuset_t should be
considered unknown
- size of cpuset_t objects is different from kernel and userland (this is
primirally done in order to leave some more space in userland to cope
with KBI extensions). If you need to access kernel cpuset_t from the
userland please refer to example in this patch on how to do that
correctly (kgdb may be a good source, for example).
- Support for other architectures is going to be added soon
- Only MAXCPU for amd64 is bumped now
The patch has been tested by sbruno and Nicholas Esborn on opteron
4 x 12 pack CPUs. More testing on big SMP is expected to came soon.
pluknet tested the patch with his 8-ways on both amd64 and i386.
Tested by: pluknet, sbruno, gianni, Nicholas Esborn
Reviewed by: jeff, jhb, sbruno
too much time. This can finish in a scheduler deadlock with ping-pong
between two threads.
One sample of this is:
- device lapic (to have a preemption point on critical_exit())
- options DEVICE_POLLING with HZ>1499 (to have lapic freq = hardclock freq)
- running a cpu intensive task (that does not enter the kernel)
- only one CPU on SMP or no SMP.
As requested by jhb@ 4BSD have received the same type of fix instead of
propagating the flag to the new thread.
Reviewed by: jhb, jeff
MFC after: 1 month
- Move the realtime priority range up above kernel sleep priorities and
just below interrupt thread priorities.
- Contract the interrupt and kernel sleep priority ranges a bit so that
the timesharing priority band can be increased. The new timeshare range
is now slightly larger than the old realtime + timeshare ranges.
- Change the ULE scheduler to no longer use realtime priorities for
interactive threads. Instead, the larger timeshare range is now split
into separate subranges for interactive and non-interactive ("batch")
threads. The end result is that interactive threads and non-interactive
threads still use the same priority ranges as before, but realtime
threads now have a separate, dedicated priority range.
- Do not modify the priority of non-timeshare threads in sched_sleep()
or via cv_broadcastpri(). Realtime and idle priority threads will
no longer have their priorities affected by sleeping in the kernel.
Reviewed by: jeff
interactive timeshare threads (PRI_*_INTERACTIVE) and non-interactive
timeshare threads (PRI_*_BATCH) and use these instead of PRI_*_REALTIME
and PRI_*_TIMESHARE. No functional change.
Reviewed by: jeff
thread and proc have been copied and zeroed from the old thread and
proc. Otherwise attempts to modify thread or process data in sched_fork()
could be undone.
- Don't copy td_{base,}_user_pri from the old thread to the new thread in
sched_fork_thread() in ULE. This is already done courtesy the bcopy()
of the thread copy region.
- Always initialize the real priority (td_priority) of new threads to the
new thread's base priority (td_base_pri) to avoid bogusly inheriting a
borrowed priority from the parent thread.
MFC after: 2 weeks
use sched_lend_user_prio to set lent priority.
- Improve pthread priority-inherit mutex, when a contender's priority is
lowered, repropagete priorities, this may cause mutex owner's priority
to be lowerd, in old code, mutex owner's priority is rise-only.
It is possible a lower priority thread lending priority to higher priority
thread, in old code, it is ignored, however the lending should always be
recorded, add field td_lend_user_pri to fix the problem, if a thread does
not have borrowed priority, its value is PRI_MAX.
MFC after: 1 week
This is just a cosmetic change for prettier output.
'indent' variable/parameter serves two purposes: it specifies whitespace
indentation level and also implies cpu group level/depth.
It would have been better to split those two uses,
but for now just a simple change.
MFC after: 1 week
The main goal of this is to generate timer interrupts only when there is
some work to do. When CPU is busy interrupts are generating at full rate
of hz + stathz to fullfill scheduler and timekeeping requirements. But
when CPU is idle, only minimum set of interrupts (down to 8 interrupts per
second per CPU now), needed to handle scheduled callouts is executed.
This allows significantly increase idle CPU sleep time, increasing effect
of static power-saving technologies. Also it should reduce host CPU load
on virtualized systems, when guest system is idle.
There is set of tunables, also available as writable sysctls, allowing to
control wanted event timer subsystem behavior:
kern.eventtimer.timer - allows to choose event timer hardware to use.
On x86 there is up to 4 different kinds of timers. Depending on whether
chosen timer is per-CPU, behavior of other options slightly differs.
kern.eventtimer.periodic - allows to choose periodic and one-shot
operation mode. In periodic mode, current timer hardware taken as the only
source of time for time events. This mode is quite alike to previous kernel
behavior. One-shot mode instead uses currently selected time counter
hardware to schedule all needed events one by one and program timer to
generate interrupt exactly in specified time. Default value depends of
chosen timer capabilities, but one-shot mode is preferred, until other is
forced by user or hardware.
kern.eventtimer.singlemul - in periodic mode specifies how much times
higher timer frequency should be, to not strictly alias hardclock() and
statclock() events. Default values are 2 and 4, but could be reduced to 1
if extra interrupts are unwanted.
kern.eventtimer.idletick - makes each CPU to receive every timer interrupt
independently of whether they busy or not. By default this options is
disabled. If chosen timer is per-CPU and runs in periodic mode, this option
has no effect - all interrupts are generating.
As soon as this patch modifies cpu_idle() on some platforms, I have also
refactored one on x86. Now it makes use of MONITOR/MWAIT instrunctions
(if supported) under high sleep/wakeup rate, as fast alternative to other
methods. It allows SMP scheduler to wake up sleeping CPUs much faster
without using IPI, significantly increasing performance on some highly
task-switching loads.
Tested by: many (on i386, amd64, sparc64 and powerc)
H/W donated by: Gheorghe Ardelean
Sponsored by: iXsystems, Inc.
thread in a racy manner, which can lead to attempting to migrate a
thread that is pinned to a CPU. Instead, have sched_switch() determine
which CPU a thread should run on if the current one is not allowed.
KASSERT in sched_bind() that the thread is not yet pinned to a CPU.
KASSERT in sched_switch() that only migratable threads or those moving
due to a sched_bind() are changing CPUs.
sched_affinity code came from jhb@.
MFC after: 2 weeks
IPI to a specific CPU by its cpuid. Replace calls to ipi_selected() that
constructed a mask for a single CPU with calls to ipi_cpu() instead. This
will matter more in the future when we transition from cpumask_t to
cpuset_t for CPU masks in which case building a CPU mask is more expensive.
Submitted by: peter, sbruno
Reviewed by: rookie
Obtained from: Yahoo! (x86)
MFC after: 1 month
am now able to run 32 cores ok.. but I still will hang
on buildworld with a NFS problem. I suspect I am missing
a patch for the netlogic rge driver.
JC check and see if I am missing anything except your
core-mask changes
Obtained from: JC
In the case of the thread being on a sleepqueue or a turnstile, the
sched_lock was acquired (without the aid of the td_lock interface) and
the td_lock was dropped. This was going to break locking rules on other
threads willing to access to the thread (via the td_lock interface) and
modify his flags (allowed as long as the container lock was different
by the one used in sched_switch).
In order to prevent this situation, while sched_lock is acquired there
the td_lock gets blocked. [0]
- Merge the ULE's internal function thread_block_switch() into the global
thread_lock_block() and make the former semantic as the default for
thread_lock_block(). This means that thread_lock_block() will not
disable interrupts when called (and consequently thread_unlock_block()
will not re-enabled them when called). This should be done manually
when necessary.
Note, however, that ULE's thread_unblock_switch() is not reaped
because it does reflect a difference in semantic due in ULE (the
td_lock may not be necessarilly still blocked_lock when calling this).
While asymmetric, it does describe a remarkable difference in semantic
that is good to keep in mind.
[0] Reported by: Kohji Okuno
<okuno dot kohji at jp dot panasonic dot com>
Tested by: Giovanni Trematerra
<giovanni dot trematerra at gmail dot com>
MFC: 2 weeks
or equial then PSOCK, not less or equial. Higher priority has lesser
numerical value.
Existing test does not allow for swapout of the thread waiting for
advisory lock, for exiting child or sleeping for timeout. On the other
hand, high-priority waiters of VFS/VM events can be swapped out.
Tested by: pho
Reviewed by: jhb
MFC after: 1 week
of the last tick we incremented on.
Submitted by: matthew.fleming/at/isilon.com, is/at/rambler-co.ru
Reviewed by: jeff (who thinks there should be a better way in the future)
Approved by: gnn (mentor)
MFC after: 3 weeks
This improvements aims for avoiding further cache-misses in scheduler
specific functions which need to keep track of average thread running
time and further locking in places setting for this flag.
Reported by: jeff (originally), kris (currently)
Reviewed by: jhb
Tested by: Giuseppe Cocomazzi <sbudella at email dot it>
- In 8.x and above the run-queue locks are nomore shared even in the
HTT case, so remove the special case.
- The deadlock explained in the removed comment here is still possible
even with different locks, with the contribution of tdq_lock_pair().
An explanation is here:
(hypotesis: a thread needs to migrate on another CPU, thread1 is doing
sched_switch_migrate() and thread2 is the one handling the sched_switch()
request or in other words, thread1 is the thread that needs to migrate
and thread2 is a thread that is going to be preempted, most likely an
idle thread. Also, 'old' is referred to the context (in terms of
run-queue and CPU) thread1 is leaving and 'new' is referred to the
context thread1 is going into. Finally, thread3 is doing tdq_idletd()
or sched_balance() and definitively doing tdq_lock_pair())
* thread1 blocks its td_lock. Now td_lock is 'blocked'
* thread1 drops its old runqueue lock
* thread1 acquires the new runqueue lock
* thread1 adds itself to the new runqueue and sends an IPI_PREEMPT
through tdq_notify() to the new CPU
* thread1 drops the new lock
* thread3, scanning the runqueues, locks the old lock
* thread2 received the IPI_PREEMPT and does thread_lock() with td_lock
pointing to the new runqueue
* thread3 wants to acquire the new runqueue lock, but it can't because
it is held by thread2 so it spins
* thread1 wants to acquire old lock, but as long as it is held by
thread3 it can't
* thread2 going further, at some point wants to switchin in thread1,
but it will wait forever because thread1->td_lock is in blocked state
This deadlock has been manifested mostly on 7.x and reported several time
on mailing lists under the voice 'spinlock held too long'.
Many thanks to des@ for having worked hard on producing suitable textdumps
and Jeff for help on the comment wording.
Reviewed by: jeff
Reported by: des, others
Tested by: des, Giovanni Trematerra
<giovanni dot trematerra at gmail dot com>
(STABLE_7 based version)
and it only optimized out an ipi or mwait in very few cases.
- Skip the adaptive idle code when running on SMT or HTT cores. This
just wastes cpu time that could be used on a busy thread on the same
core.
- Rename CG_FLAG_THREAD to CG_FLAG_SMT to be more descriptive. Re-use
CG_FLAG_THREAD to mean SMT or HTT.
Sponsored by: Nokia
is calculated as 0 which causes errors elsewhere.
Submitted by: KOIE Hidetaka <koie@suri.co.jp>
- When sched_affinity() is called with a thread that is not curthread we
need to handle the ON_RUNQ() case by adding the thread to the correct
run queue.
Submitted by: Justin Teller <justin.teller@gmail.com>
MFC after: 1 Week
sizeof("MAXCPU") being used to calculate a string length rather than
something more reasonable such as sizeof("32"). This shouldn't have
caused any ill effect until we run on machines with 1000000 or more
cpus.
with src/tools/sched/schedgraph.py. This allows developers to quickly
create a graphical view of ktr data for any resource in the system.
- Add sched_tdname() and the pcpu field 'name' for quickly and uniformly
identifying records associated with a thread or cpu.
- Reimplement the KTR_SCHED traces using the new generic facility.
Obtained from: attilio
Discussed with: jhb
Sponsored by: Nokia
usable for newer CPUs. The new value allows 2 x quad core configuration
dumps to fit within the initial buffer without reallocations.
Approved by: gnn (mentor) (older version)
Pointed out by: rdivacky
dump of detected ULE CPU topology. This dump can be used to check the
topology detection and for general system information.
An example of CPU topology dump is:
kern.sched.topology_spec: <groups>
<group level="1" cache-level="0">
<cpu count="8" mask="0xff">0, 1, 2, 3, 4, 5, 6, 7</cpu>
<flags></flags>
<children>
<group level="2" cache-level="0">
<cpu count="4" mask="0xf">0, 1, 2, 3</cpu>
<flags></flags>
</group>
<group level="2" cache-level="0">
<cpu count="4" mask="0xf0">4, 5, 6, 7</cpu>
<flags></flags>
</group>
</children>
</group>
</groups>
Reviewed by: jeff
Approved by: gnn (mentor)
sched_tick() to prevent multiple increments for one tick. This pushes
the value out of range and breaks priority calculation.
Reviewed by: kib
Found by: pho/nokia
Sponsored by: Nokia
MFC after: 3 days
from idle over the next tick.
- Add a new MD routine, cpu_wake_idle() to wakeup idle threads who are
suspended in cpu specific states. This function can fail and cause the
scheduler to fall back to another mechanism (ipi).
- Implement support for mwait in cpu_idle() on i386/amd64 machines that
support it. mwait is a higher performance way to synchronize cpus
as compared to hlt & ipis.
- Allow selecting the idle routine by name via sysctl machdep.idle. This
replaces machdep.cpu_idle_hlt. Only idle routines supported by the
current machine are permitted.
Sponsored by: Nokia
two ticks by counting the number of switches and the load when
sched_clock() is called.
- If the busy metric exceeds a threshold allow the idle thread to spin
waiting for new work for a brief period to avoid using IPIs. This
reduces the cost on the sender and receiver as well as reducing wakeup
latency considerably when it works.
Sponsored by: Nokia
variables and sysctl nodes.
- In reset walk the children of kern_sched_stats and reset the counters
via the oid_arg1 pointer. This allows us to add arbitrary counters to
the tree and still reset them properly.
- Define a set of switch types to be passed with flags to mi_switch().
These types are named SWT_*. These types correspond to SCHED_STATS
counters and are automatically handled in this way.
- Make the new SWT_ types more specific than the older switch stats.
There are now stats for idle switches, remote idle wakeups, remote
preemption ithreads idling, etc.
- Add switch statistics for ULE's pickcpu algorithm. These stats include
how much migration there is, how often affinity was successful, how
often threads were migrated to the local cpu on wakeup, etc.
Sponsored by: Nokia
o Implement IPI_PREEMPT,
o Set td_lock for the thread being switched out,
o For ULE & SMP, loop while td_lock points to blocked_lock for
the thread being switched in,
o Enable ULE by default in GENERIC and SKI,
fixed pri boost with '1' or any priority less than the current thread's
priority with a value greater than two. Default the boost to
PRI_MIN_TIMESHARE to prevent regular user-space threads from starving
threads in the kernel. This prevents these user-threads from also
being scheduled as if they are high fixed-priority kernel threads.
- Restore the setting of lowpri in tdq_choose(). It has to be either here
or in sched_switch(). I accidentally removed it from both places.
Tested by: kris
do this either. Simply check P_NOLOAD. It'd be nice if this was
in a thread flag so we didn't have an extra cache miss every time we
add and remove a thread from the run-queue.
rqindex back in struct thread.
- Compile kern_switch.c independently again and stop #include'ing it from
schedulers.
- Remove the ts_thread backpointers and convert most code to go from
struct thread to struct td_sched.
- Cleanup the ts_flags #define garbage that was causing us to sometimes
do things that expanded to td->td_sched->ts_thread->td_flags in 4BSD.
- Export the kern.sched sysctl node in sysctl.h
requiring the per-process spinlock to only requiring the process lock.
- Reflect these changes in the proc.h documentation and consumers throughout
the kernel. This is a substantial reduction in locking cost for these
fields and was made possible by recent changes to threading support.
after each SYSINIT() macro invocation. This makes a number of
lightweight C parsers much happier with the FreeBSD kernel
source, including cflow's prcc and lxr.
MFC after: 1 month
Discussed with: imp, rink
While the KSE project was quite successful in bringing threading to
FreeBSD, the M:N approach taken by the kse library was never developed
to its full potential. Backwards compatibility will be provided via
libmap.conf for dynamically linked binaries and static binaries will
be broken.
sched_sleep(). This removes extra thread_lock() acquisition and
allows the scheduler to decide what to do with the static boost.
- Change the priority arguments to cv_* to match sleepq/msleep/etc.
where 0 means no priority change. Catch -1 in cv_broadcastpri() and
convert it to 0 for now.
- Set a flag when sleeping in a way that is compatible with swapping
since direct priority comparisons are meaningless now.
- Add a sysctl to ule, kern.sched.static_boost, that defaults to on which
controls the boost behavior. Turning it off gives better performance
in some workloads but needs more investigation.
- While we're modifying sleepq, change signal and broadcast to both
return with the lock held as the lock was held on enter.
Reviewed by: jhb, peter
tdq_runq_add to select the runq rather than hoping we set it properly
when we adjusted the priority. This involves the same number of
branches as before so should perform identically without the extra
fragility.
Tested by: bz
Reviewed by: bz
- Only calculate timeshare priorities once per tick or when a thread is woken
from sleeping.
- Keep the ts_runq pointer valid after all priority changes.
- Call tdq_runq_add() directly from sched_switch() without passing in via
tdq_add(). We don't need to adjust loads or runqs anymore.
- Sort tdq and ts_sched according to utilization to improve cache behavior.
Sponsored by: Nokia
- Normalize the preemption/ipi setting code by introducing sched_shouldpreempt()
so the logical is identical and not repeated between tdq_notify() and
sched_setpreempt().
- In tdq_notify() don't set NEEDRESCHED as we may not actually own the thread lock
this could have caused us to lose td_flags settings.
- Garbage collect some tunables that are no longer relevant.
- When searching for affinity search backwards in the tree from the last
cpu we ran on while the thread still has affinity for the group. This
can take advantage of knowledge of shared L2 or L3 caches among a
group of cores.
- When searching for the least loaded cpu find the least loaded cpu via
the least loaded path through the tree. This load balances system bus
links, individual cache levels, and hyper-threaded/SMT cores.
- Make the periodic balancer recursively balance the highest and lowest
loaded cpu across each link.
Add support for cpusets:
- Convert the cpuset to a simple native cpumask_t while the kernel still
only supports cpumask.
- Pass the derived cpumask down through the cpu_search functions to
restrict the result cpus.
- Make the various steal functions resilient to failure since all threads
can not run on all cpus any longer.
General improvements:
- Precisely track the lowest priority thread on every runq with
tdq_setlowpri(). Before it was more advisory but this ended up having
pathological behaviors.
- Remove many #ifdef SMP conditions to simplify the code.
- Get rid of the old cumbersome tdq_group. This is more naturally
expressed via the cpu_group tree.
Sponsored by: Nokia
Testing by: kris
tree structure that encodes the level of cache sharing and other
properties.
- Provide several convenience functions for creating one and two level
cpu trees as well as a default flat topology. The system now always
has some topology.
- On i386 and amd64 create a seperate level in the hierarchy for HTT
and multi-core cpus. This will allow the scheduler to intelligently
load balance non-uniform cores. Presently we don't detect what level
of the cache hierarchy is shared at each level in the topology.
- Add a mechanism for testing common topologies that have more information
than the MD code is able to provide via the kern.smp.topology tunable.
This should be considered a debugging tool only and not a stable api.
Sponsored by: Nokia
a run-queue. If the priority is numerically raised only change lowpri
if we're certain it will be correct. Some slop is allowed however
previously we could erroneously raise lowpri for an idle cpu that a
thread had recently run on which lead to errors in load balancing
decisions.
lowest priority on the queue for the current cpu vs curthread's
priority. In the case that curthread is waking up many threads of a
lower priority as would happen with a turnstile_broadcast() or wakeup()
of many threads this prevents them from all ending up on the current cpu.
- In sched_add() make the relationship between a scheduled ithread and
the current cpu advisory rather than strict. Only give the ithread
affinity for the current cpu if it's actually being scheduled from
a hardware interrupt. This prevents it from migrating when it simply
blocks on a lock.
Sponsored by: Nokia
the ABI when enabled. There is no longer an embedded lock_profile_object
in each lock. Instead a list of lock_profile_objects is kept per-thread
for each lock it may own. The cnt_hold statistic is now always 0 to
facilitate this.
- Support shared locking by tracking individual lock instances and
statistics in the per-thread per-instance lock_profile_object.
- Make the lock profiling hash table a per-cpu singly linked list with a
per-cpu static lock_prof allocator. This removes the need for an array
of spinlocks and reduces cache contention between cores.
- Use a seperate hash for spinlocks and other locks so that only a
critical_enter() is required and not a spinlock_enter() to modify the
per-cpu tables.
- Count time spent spinning in the lock statistics.
- Remove the LOCK_PROFILE_SHARED option as it is always supported now.
- Specifically drop and release the scheduler locks in both schedulers
since we track owners now.
In collaboration with: Kip Macy
Sponsored by: Nokia
opposed to what process. Since threads by default have teh name of the
process unless over-written with more useful information, just print the
thread name instead.
kern/sched_ule.c - Add __powerpc__ to the list of supported architectures
powerpc/conf/GENERIC - Swap SCHED_4BSD with SCHED_ULE
powerpc/powerpc/genassym.c - Export TD_LOCK field of thread struct
powerpc/powerpc/swtch.S - Handle new 3rd parameter to cpu_switch() by
updating the old thread's lock. Note: uniprocessor-only, will require
modification for MP support.
powerpc/powerpc/vm_machdep.c - Set 3rd param of cpu_switch to mutex of
old thread's lock, making the call a no-op.
Reviewed by: marcel, jeffr (slightly older version)
fixes a bug on UP machines with SMP kernels where the idle thread
constantly switches after trying to steal work from the local cpu.
- Make the idle stealing code more robust against self selection.
- Prefer to steal from the cpu with the highest load that has at least one
transferable thread. Before we selected the cpu with the highest
transferable count which excludes bound threads.
Collaborated with: csjp
Approved by: re
to simply switch rather than lowering priority and switching. This allows
threads of equal priority to run but not lesser priority.
Discussed with: davidxu
Reported by: NIIMI Satoshi <sa2c@sa2c.net>
Approved by: re
after the switch leads to a race where the outgoing thread still owns
the local queue lock while another cpu may switch it in. This race
is only possible on machines where cpu_switch can take significantly
longer on different cpus which in practice means HTT machines with
unfair thread scheduling algorithms.
Found by: kris (of course)
Approved by: re
starvation caused by unbalanced interrupt loads.
- Change the rebalancer to work on stathz ticks but retain randomization.
- Simplify locking in tdq_idled() to use the tdq_lock_pair() rather than
complex sequences of locks to avoid deadlock.
Reported by: kris
Approved by: re
value for kern.sched.preempt_thresh appropriately. It can still by
adjusted at runtime. ULE will still use IPI_PREEMPT in certain
migration situations.
- Assert that we're not trying to compile ULE on an unsupported
architecture. To date, I believe only i386 and amd64 have implemented
the third cpu switch argument required.
Approved by: re
- Improve load long-term load balancer by always IPIing exactly once.
Previously the delay after rebalancing could cause problems with
uneven workloads.
- Allow nice to have a linear effect on the interactivity score. This
allows negatively niced programs to stay interactive longer. It may be
useful with very expensive Xorg servers under high loads. In general
it should not be necessary to alter the nice level to improve interactive
response. We may also want to consider never allowing positively niced
processes to become interactive at all.
- Initialize ccpu to 0 rather than 0.0. The decimal point was leftover
from when the code was copied from 4bsd. ccpu is 0 in ULE because ULE
only exports weighted cpu values.
Reported by: Steve Kargl (Load balancing problem)
Approved by: re
changes the units from seconds to the value of 'ticks' when swapped
in/out. ULE does not have a periodic timer that scans all threads in
the system and as such maintaining a per-second counter is difficult.
- Change computations requiring the unit in seconds to subtract ticks
and divide by hz. This does make the wraparound condition hz times
more frequent but this is still in the range of several months to
years and the adverse effects are minimal.
Approved by: re
- p_sflag was mostly protected by PROC_LOCK rather than the PROC_SLOCK or
previously the sched_lock. These bugs have existed for some time.
- Allow swapout to try each thread in a process individually and then
swapin the whole process if any of these fail. This allows us to move
most scheduler related swap flags into td_flags.
- Keep ki_sflag for backwards compat but change all in source tools to
use the new and more correct location of P_INMEM.
Reported by: pho
Reviewed by: attilio, kib
Approved by: re (kensmith)
on 2cpu machines by reducing it to 1 by default. This improves loaded
operation on 8cpu machines by increasing it to 3 where the extra idle
time is not as critical.
Approved by: re
tdq_group structure. Hyper-threaded cores won't really benefit from
seperate locks anyway.
- Seperate out the migration case from sched_switch to simplify the main
switch code. We only migrate here if called via sched_bind().
- When preempted place the preempted thread back in the same queue at
the head.
- Improve the cpu group and topology infrastructure.
Tested by: many on current@
Approved by: re
machines.
- Leave the long-term load balancer running by default once per second.
- Enable stealing load from the idle thread only when the remote processor
has more than two transferable tasks. Setting this to one further
improves buildworld. Setting it higher improves mysql.
- Remove the bogus pick_zero option. I had not intended to commit this.
- Entirely disallow migration for threads with SRQ_YIELDING set. This
balances out the extra migration allowed for with the load balancers.
It also makes pick_pri perform better as I had anticipated.
Tested by: Dmitry Morozovsky <marck@rinet.ru>
Approved by: re
properly. We have to temporarily unlock the TDQ lock so we can lock
the thread and add it to the run queue. This is used only for KSE.
- When we add a thread from the tdq_move() via sched_balance() we need to
ipi the target if it's sitting in the idle thread or it'll never run.
Reported by: Rene Landan
Approved by: re
been in development for over 6 months as SCHED_SMP.
- Implement one spin lock per thread-queue. Threads assigned to a
run-queue point to this lock via td_lock.
- Improve the facility for assigning threads to CPUs now that sched_lock
contention no longer dominates scheduling decisions on larger SMP
machines.
- Re-write idle time stealing in an attempt to make it less damaging to
general performance. This is still disabled by default. See
kern.sched.steal_idle.
- Call the long-term load balancer from a callout rather than sched_clock()
so there are no locks held. This is disabled by default. See
kern.sched.balance.
- Parameterize many scheduling decisions via sysctls. Try to document
these via sysctl descriptions.
- General structural and naming cleanups.
- Document each function with comments.
Tested by: current@ amd64, x86, UP, SMP.
Approved by: re
- In tdq_choose() only assert that a thread does not have too high a
priority (low value) for the queue we removed it from. This will catch
bugs in priority elevation. It's not a serious error for the thread
to have too low a priority as we don't change queues in this case as
an optimization.
Reported by: kris