Commit Graph

294 Commits

Author SHA1 Message Date
marius
1b1d84970a - Currently, sched_balance_pair() may cause a CPU to send an IPI_PREEMPT to
itself, which sparc64 hardware doesn't support. One way to solve this
  would be to directly call sched_preempt() instead of issuing a self-IPI.
  However, quoting jhb@:
  "On the other hand, you can probably just skip the IPI entirely if we are
  going to send it to the current CPU.  Presumably, once this routine
  finishes, the current CPU will exit softlock (or will do so "soon") and
  will then pick the next thread to run based on the adjustments made in
  this routine, so there's no need to IPI the CPU running this routine
  anyway.  I think this is the better solution.  Right now what is probably
  happening on other platforms is as soon as this routine finishes the CPU
  processes its self-IPI and causes mi_switch() which will just switch back
  to the softclock thread it is already running."
- With r226054 and the the above change in place, sparc64 now no longer is
  incompatible with ULE and vice versa. However, powerpc/E500 still is.

Submitted by:	jhb
Reviewed by:	jeff
2011-10-06 11:48:13 +00:00
delphij
f687a7bf6f Fix format strings for KTR_STATE in 4BSD ad ULE schedulers.
Submitted by:	Ivan Klymenko <fidaj@ukr.net>
PR:		kern/159904, kern/159905
MFC after:	2 weeks
Approved by:	re (kib)
2011-08-26 18:00:07 +00:00
attilio
750fc68e27 Remove explicit MAXCPU usage from sys/pcpu.h avoiding a namespace
pollution.  That is a step further in the direction of building correct
policies for userland and modules on how to deal with the number of
maxcpus at runtime.

Reported by:	jhb
Reviewed and tested by:	pluknet
Approved by:	re (kib)
2011-07-19 16:50:55 +00:00
attilio
fe4de567b5 Commit the support for removing cpumask_t and replacing it directly with
cpuset_t objects.
That is going to offer the underlying support for a simple bump of
MAXCPU and then support for number of cpus > 32 (as it is today).

Right now, cpumask_t is an int, 32 bits on all our supported architecture.
cpumask_t on the other side is implemented as an array of longs, and
easilly extendible by definition.

The architectures touched by this commit are the following:
- amd64
- i386
- pc98
- arm
- ia64
- XEN

while the others are still missing.
Userland is believed to be fully converted with the changes contained
here.

Some technical notes:
- This commit may be considered an ABI nop for all the architectures
  different from amd64 and ia64 (and sparc64 in the future)
- per-cpu members, which are now converted to cpuset_t, needs to be
  accessed avoiding migration, because the size of cpuset_t should be
  considered unknown
- size of cpuset_t objects is different from kernel and userland (this is
  primirally done in order to leave some more space in userland to cope
  with KBI extensions). If you need to access kernel cpuset_t from the
  userland please refer to example in this patch on how to do that
  correctly (kgdb may be a good source, for example).
- Support for other architectures is going to be added soon
- Only MAXCPU for amd64 is bumped now

The patch has been tested by sbruno and Nicholas Esborn on opteron
4 x 12 pack CPUs. More testing on big SMP is expected to came soon.
pluknet tested the patch with his 8-ways on both amd64 and i386.

Tested by:	pluknet, sbruno, gianni, Nicholas Esborn
Reviewed by:	jeff, jhb, sbruno
2011-05-05 14:39:14 +00:00
fabient
e0588db8d2 Clearing the flag when preempting will let the preempted thread run
too much time. This can finish in a scheduler deadlock with ping-pong
between two threads.

One sample of this is:
- device lapic (to have a preemption point on critical_exit())
- options DEVICE_POLLING with HZ>1499 (to have lapic freq = hardclock freq)
- running a cpu intensive task (that does not enter the kernel)
- only one CPU on SMP or no SMP.

As requested by jhb@ 4BSD have received the same type of fix instead of
propagating the flag to the new thread.

Reviewed by:	jhb, jeff
MFC after:	1 month
2011-03-31 13:59:47 +00:00
jhb
b92da6d9e2 Rework realtime priority support:
- Move the realtime priority range up above kernel sleep priorities and
  just below interrupt thread priorities.
- Contract the interrupt and kernel sleep priority ranges a bit so that
  the timesharing priority band can be increased.  The new timeshare range
  is now slightly larger than the old realtime + timeshare ranges.
- Change the ULE scheduler to no longer use realtime priorities for
  interactive threads.  Instead, the larger timeshare range is now split
  into separate subranges for interactive and non-interactive ("batch")
  threads.  The end result is that interactive threads and non-interactive
  threads still use the same priority ranges as before, but realtime
  threads now have a separate, dedicated priority range.
- Do not modify the priority of non-timeshare threads in sched_sleep()
  or via cv_broadcastpri().  Realtime and idle priority threads will
  no longer have their priorities affected by sleeping in the kernel.

Reviewed by:	jeff
2011-01-14 17:06:54 +00:00
jhb
fd41389dca Introduce two new helper macros to define the priority ranges used for
interactive timeshare threads (PRI_*_INTERACTIVE) and non-interactive
timeshare threads (PRI_*_BATCH) and use these instead of PRI_*_REALTIME
and PRI_*_TIMESHARE.  No functional change.

Reviewed by:	jeff
2011-01-13 14:22:27 +00:00
jhb
624ce4a8a0 Always use PRI_BASE() when checking the base type of a thread's priority
class.

MFC after:	2 weeks
2011-01-11 22:13:19 +00:00
jhb
27afded079 Fix two harmless off-by-one errors.
Reviewed by:	jeff
MFC after:	2 weeks
2011-01-10 20:48:10 +00:00
jhb
69d18cbe4f - Move sched_fork() later in fork() after the various sections of the new
thread and proc have been copied and zeroed from the old thread and
  proc.  Otherwise attempts to modify thread or process data in sched_fork()
  could be undone.
- Don't copy td_{base,}_user_pri from the old thread to the new thread in
  sched_fork_thread() in ULE.  This is already done courtesy the bcopy()
  of the thread copy region.
- Always initialize the real priority (td_priority) of new threads to the
  new thread's base priority (td_base_pri) to avoid bogusly inheriting a
  borrowed priority from the parent thread.

MFC after:	2 weeks
2011-01-06 22:24:00 +00:00
davidxu
3daac37e3c - Follow r216313, the sched_unlend_user_prio is no longer needed, always
use sched_lend_user_prio to set lent priority.
- Improve pthread priority-inherit mutex, when a contender's priority is
  lowered, repropagete priorities, this may cause mutex owner's priority
  to be lowerd, in old code, mutex owner's priority is rise-only.
2010-12-29 09:26:46 +00:00
davidxu
171976dba2 MFp4:
It is possible a lower priority thread lending priority to higher priority
thread, in old code, it is ignored, however the lending should always be
recorded, add field td_lend_user_pri to fix the problem, if a thread does
not have borrowed priority, its value is PRI_MAX.

MFC after: 1 week
2010-12-09 02:42:02 +00:00
trasz
3d022d63eb Remove unused variables. 2010-11-13 11:54:04 +00:00
attilio
b6be89dafb Fix typos.
Submitted by:	gianni
MFC after:	3 days
2010-11-10 21:06:49 +00:00
davidxu
4c899bcdf5 Use integer for size of cpuset, as it won't be bigger than INT_MAX,
This is requested by bge.
Also move the sysctl into file kern_cpuset.c, because it should
always be there, it is independent of thread scheduler.
2010-11-01 00:42:25 +00:00
davidxu
a5ea18413e Add sysctl kern.sched.cpusetsize to export the size of kernel cpuset,
also add sysconf() key _SC_CPUSET_SIZE to get sysctl value.

Submitted by: gcooper
2010-10-29 13:31:10 +00:00
jhb
e350ad7930 Comment nit, set TDF_NEEDRESCHED after the comment describing why it is
done rather than before.

MFC after:	1 week
2010-09-21 19:12:22 +00:00
avg
6fb9e57674 kern.sched.topology_spec sysctl: use step of 1 for group levels numeration
This is just a cosmetic change for prettier output.
'indent' variable/parameter serves two purposes: it specifies whitespace
indentation level and also implies cpu group level/depth.
It would have been better to split those two uses,
but for now just a simple change.

MFC after:	1 week
2010-09-18 11:16:43 +00:00
mav
eb4931dc6c Refactor timer management code with priority to one-shot operation mode.
The main goal of this is to generate timer interrupts only when there is
some work to do. When CPU is busy interrupts are generating at full rate
of hz + stathz to fullfill scheduler and timekeeping requirements. But
when CPU is idle, only minimum set of interrupts (down to 8 interrupts per
second per CPU now), needed to handle scheduled callouts is executed.
This allows significantly increase idle CPU sleep time, increasing effect
of static power-saving technologies. Also it should reduce host CPU load
on virtualized systems, when guest system is idle.

There is set of tunables, also available as writable sysctls, allowing to
control wanted event timer subsystem behavior:
  kern.eventtimer.timer - allows to choose event timer hardware to use.
On x86 there is up to 4 different kinds of timers. Depending on whether
chosen timer is per-CPU, behavior of other options slightly differs.
  kern.eventtimer.periodic - allows to choose periodic and one-shot
operation mode. In periodic mode, current timer hardware taken as the only
source of time for time events. This mode is quite alike to previous kernel
behavior. One-shot mode instead uses currently selected time counter
hardware to schedule all needed events one by one and program timer to
generate interrupt exactly in specified time. Default value depends of
chosen timer capabilities, but one-shot mode is preferred, until other is
forced by user or hardware.
  kern.eventtimer.singlemul - in periodic mode specifies how much times
higher timer frequency should be, to not strictly alias hardclock() and
statclock() events. Default values are 2 and 4, but could be reduced to 1
if extra interrupts are unwanted.
  kern.eventtimer.idletick - makes each CPU to receive every timer interrupt
independently of whether they busy or not. By default this options is
disabled. If chosen timer is per-CPU and runs in periodic mode, this option
has no effect - all interrupts are generating.

As soon as this patch modifies cpu_idle() on some platforms, I have also
refactored one on x86. Now it makes use of MONITOR/MWAIT instrunctions
(if supported) under high sleep/wakeup rate, as fast alternative to other
methods. It allows SMP scheduler to wake up sleeping CPUs much faster
without using IPI, significantly increasing performance on some highly
task-switching loads.

Tested by:	many (on i386, amd64, sparc64 and powerc)
H/W donated by:	Gheorghe Ardelean
Sponsored by:	iXsystems, Inc.
2010-09-13 07:25:35 +00:00
mav
aa2a743453 Do not IPI CPU that is already spinning for load. It doubles effect of
spining (comparing to MWAIT) on some heavly switching test loads.
2010-09-10 13:24:47 +00:00
mdf
200dc21dcc Fix UP build.
MFC after:	2 weeks
2010-09-02 16:23:05 +00:00
mdf
bbc3957715 Fix a bug with sched_affinity() where it checks td_pinned of another
thread in a racy manner, which can lead to attempting to migrate a
thread that is pinned to a CPU.  Instead, have sched_switch() determine
which CPU a thread should run on if the current one is not allowed.

KASSERT in sched_bind() that the thread is not yet pinned to a CPU.

KASSERT in sched_switch() that only migratable threads or those moving
due to a sched_bind() are changing CPUs.

sched_affinity code came from jhb@.

MFC after:	2 weeks
2010-09-01 20:32:47 +00:00
jhb
d02cab2556 Remove unused KTRACE includes. 2010-08-19 16:41:27 +00:00
jhb
19ddbf5c38 Add a new ipi_cpu() function to the MI IPI API that can be used to send an
IPI to a specific CPU by its cpuid.  Replace calls to ipi_selected() that
constructed a mask for a single CPU with calls to ipi_cpu() instead.  This
will matter more in the future when we transition from cpumask_t to
cpuset_t for CPU masks in which case building a CPU mask is more expensive.

Submitted by:	peter, sbruno
Reviewed by:	rookie
Obtained from:	Yahoo! (x86)
MFC after:	1 month
2010-08-06 15:36:59 +00:00
ivoras
3fb9f87a34 A cosmetic change - don't output empty <flags>. 2010-07-15 13:46:30 +00:00
jhb
9b74a62d73 Update several places that iterate over CPUs to use CPU_FOREACH(). 2010-06-11 18:46:34 +00:00
ivoras
04624ee0ea Unconfuse THREAD and SMT flags 2010-06-10 11:48:14 +00:00
ivoras
7937017072 Cosmetic change to XML - less ugly newlines 2010-06-10 11:01:17 +00:00
jhb
16dab63fe9 Assert that the thread lock is held in sched_pctcpu() instead of
recursively acquiring it.  All of the current callers already hold the
lock.

MFC after:	1 month
2010-06-03 16:02:11 +00:00
jhb
ce208e1f41 Assert that the thread passed to sched_bind() and sched_unbind() is
curthread as those routines are only supported for curthread currently.

MFC after:	1 month
2010-05-21 17:15:56 +00:00
rrs
8ea4ab29a0 This pushes all of JC's patches that I have in place. I
am now able to run 32 cores ok.. but I still will hang
on buildworld with a NFS problem. I suspect I am missing
a patch for the netlogic rge driver.

JC check and see if I am missing anything except your
core-mask changes

Obtained from:	JC
2010-05-16 19:43:48 +00:00
attilio
9a7f4738f4 - Fix a race in sched_switch() of sched_4bsd.
In the case of the thread being on a sleepqueue or a turnstile, the
  sched_lock was acquired (without the aid of the td_lock interface) and
  the td_lock was dropped. This was going to break locking rules on other
  threads willing to access to the thread (via the td_lock interface) and
  modify his flags (allowed as long as the container lock was different
  by the one used in sched_switch).
  In order to prevent this situation, while sched_lock is acquired there
  the td_lock gets blocked. [0]
- Merge the ULE's internal function thread_block_switch() into the global
  thread_lock_block() and make the former semantic as the default for
  thread_lock_block(). This means that thread_lock_block() will not
  disable interrupts when called (and consequently thread_unlock_block()
  will not re-enabled them when called). This should be done manually
  when necessary.
  Note, however, that ULE's thread_unblock_switch() is not reaped
  because it does reflect a difference in semantic due in ULE (the
  td_lock may not be necessarilly still blocked_lock when calling this).
  While asymmetric, it does describe a remarkable difference in semantic
  that is good to keep in mind.

[0] Reported by:	Kohji Okuno
			<okuno dot kohji at jp dot panasonic dot com>
Tested by:		Giovanni Trematerra
			<giovanni dot trematerra at gmail dot com>
MFC:			2 weeks
2010-01-23 15:54:21 +00:00
kib
fe41ad464e Allow swap out of the kernel stack for the thread with priority greater
or equial then PSOCK, not less or equial. Higher priority has lesser
numerical value.

Existing test does not allow for swapout of the thread waiting for
advisory lock, for exiting child or sleeping for timeout. On the other
hand, high-priority waiters of VFS/VM events can be swapped out.

Tested by:	pho
Reviewed by:	jhb
MFC after:	1 week
2009-12-31 18:52:58 +00:00
ed
3297cb3093 Don't forget to use `void' for sched_balance(). It has no arguments. 2009-12-28 23:12:12 +00:00
ivoras
28bbcc383a Make ULE process usage (%CPU) accounting usable again by keeping track
of the last tick we incremented on.

Submitted by:	matthew.fleming/at/isilon.com, is/at/rambler-co.ru
Reviewed by:	jeff (who thinks there should be a better way in the future)
Approved by:	gnn (mentor)
MFC after:	3 weeks
2009-11-24 19:57:41 +00:00
attilio
1c940ef4f4 Split P_NOLOAD into a per-thread flag (TDF_NOLOAD).
This improvements aims for avoiding further cache-misses in scheduler
specific functions which need to keep track of average thread running
time and further locking in places setting for this flag.

Reported by:	jeff (originally), kris (currently)
Reviewed by:	jhb
Tested by:	Giuseppe Cocomazzi <sbudella at email dot it>
2009-11-03 16:46:52 +00:00
jhb
f88b32f139 Fix a sign bug in the handling of nice priorities when computing the
interactive score for a thread.

Submitted by:	Taku YAMAMOTO  taku of tackymt.homeip.net
Reviewed by:	jeff
MFC after:	3 days
2009-10-15 11:41:12 +00:00
attilio
867bd3bd7f Fix sched_switch_migrate():
- In 8.x and above the run-queue locks are nomore shared even in the
  HTT case, so remove the special case.
- The deadlock explained in the removed comment here is still possible
  even with different locks, with the contribution of tdq_lock_pair().
  An explanation is here:
  (hypotesis: a thread needs to migrate on another CPU, thread1 is doing
  sched_switch_migrate() and thread2 is the one handling the sched_switch()
  request or in other words, thread1 is the thread that needs to migrate
  and thread2 is a thread that is going to be preempted, most likely an
  idle thread. Also, 'old' is referred to the context (in terms of
  run-queue and CPU) thread1 is leaving and 'new' is referred to the
  context thread1 is going into.  Finally, thread3 is doing tdq_idletd()
  or sched_balance() and definitively doing tdq_lock_pair())

  * thread1 blocks its td_lock. Now td_lock is 'blocked'
  * thread1 drops its old runqueue lock
  * thread1 acquires the new runqueue lock
  * thread1 adds itself to the new runqueue and sends an IPI_PREEMPT
    through tdq_notify() to the new CPU
  * thread1 drops the new lock
  * thread3, scanning the runqueues, locks the old lock
  * thread2 received the IPI_PREEMPT and does thread_lock() with td_lock
    pointing to the new runqueue
  * thread3 wants to acquire the new runqueue lock, but it can't because
    it is held by thread2 so it spins
  * thread1 wants to acquire old lock, but as long as it is held by
    thread3 it can't
  * thread2 going further, at some point wants to switchin in thread1,
    but it will wait forever because thread1->td_lock is in blocked state

This deadlock has been manifested mostly on 7.x and reported several time
on mailing lists under the voice 'spinlock held too long'.
Many thanks to des@ for having worked hard on producing suitable textdumps
and Jeff for help on the comment wording.

Reviewed by:	jeff
Reported by:	des, others
Tested by:	des, Giovanni Trematerra
		<giovanni dot trematerra at gmail dot com>
		(STABLE_7 based version)
2009-09-15 16:56:17 +00:00
jeff
92b4ecdc77 - Use cpuset_t and the CPU_ macros in place of cpumask_t so that ULE
supports arbitrary numbers of cpus rather than being limited by
   cpumask_t to the number of bits in a long.
2009-06-23 22:12:37 +00:00
jeff
9ff631ca46 - Fix non-SMP build by encapsulating idle spin logic in a macro.
Pointy hat to:	me
2009-04-29 23:04:31 +00:00
jeff
fe5d856f47 - Fix the FBSDID line. 2009-04-29 03:26:30 +00:00
jeff
88a1cd92bb - Remove the bogus idle thread state code. This may have a race in it
and it only optimized out an ipi or mwait in very few cases.
 - Skip the adaptive idle code when running on SMT or HTT cores.  This
   just wastes cpu time that could be used on a busy thread on the same
   core.
 - Rename CG_FLAG_THREAD to CG_FLAG_SMT to be more descriptive.  Re-use
   CG_FLAG_THREAD to mean SMT or HTT.

Sponsored by:   Nokia
2009-04-29 03:15:43 +00:00
jeff
96eaa9ff52 - Fix an error that occurs when mp_ncpu is an odd number. steal_thresh
is calculated as 0 which causes errors elsewhere.

Submitted by:	KOIE Hidetaka <koie@suri.co.jp>

 - When sched_affinity() is called with a thread that is not curthread we
   need to handle the ON_RUNQ() case by adding the thread to the correct
   run queue.

Submitted by:	Justin Teller <justin.teller@gmail.com>

MFC after:	1 Week
2009-03-14 11:41:36 +00:00
jeff
d4c94410f6 - Use __XSTRING where I want the define to be expanded. This resulted in
sizeof("MAXCPU") being used to calculate a string length rather than
   something more reasonable such as sizeof("32").  This shouldn't have
   caused any ill effect until we run on machines with 1000000 or more
   cpus.
2009-01-25 07:35:10 +00:00
jeff
3d8d825555 - Implement generic macros for producing KTR records that are compatible
with src/tools/sched/schedgraph.py.  This allows developers to quickly
   create a graphical view of ktr data for any resource in the system.
 - Add sched_tdname() and the pcpu field 'name' for quickly and uniformly
   identifying records associated with a thread or cpu.
 - Reimplement the KTR_SCHED traces using the new generic facility.

Obtained from:	attilio
Discussed with:	jhb
Sponsored by:	Nokia
2009-01-17 07:17:57 +00:00
ivoras
97219f9ae7 Add missing newlines to flags tags of CPU topology, for prettier
output.

Reviewed by:	jeff (original version)
Approved by:	gnn (mentor) (original version)
2008-12-23 16:19:59 +00:00
jhb
b08d457fbe When checking to see if another CPU is running its idle thread, examine
the thread running on the other CPU instead of the thread being placed on
the run queue.

Reported by:	Ravi Murty @ Intel
Reviewed by:	jeff
2008-11-18 05:41:34 +00:00
ivoras
d819bb20f8 Increase the initial sbuf size for CPU topology dump to something more
usable for newer CPUs. The new value allows 2 x quad core configuration
dumps to fit within the initial buffer without reallocations.

Approved by:	gnn (mentor) (older version)
Pointed out by:	rdivacky
2008-11-02 23:11:20 +00:00
ivoras
483637ae39 Introduce a new sysctl, kern.sched.topology_spec, that returns an XML
dump of detected ULE CPU topology. This dump can be used to check the
topology detection and for general system information.

An example of CPU topology dump is:
kern.sched.topology_spec: <groups>
 <group level="1" cache-level="0">
   <cpu count="8" mask="0xff">0, 1, 2, 3, 4, 5, 6, 7</cpu>
   <flags></flags>
   <children>
     <group level="2" cache-level="0">
       <cpu count="4" mask="0xf">0, 1, 2, 3</cpu>
       <flags></flags>
     </group>
     <group level="2" cache-level="0">
       <cpu count="4" mask="0xf0">4, 5, 6, 7</cpu>
       <flags></flags>
     </group>
   </children>
 </group>
</groups>

Reviewed by:	jeff
Approved by:	gnn (mentor)
2008-10-29 13:36:23 +00:00
jeff
b2f69d1b1e - Check whether we've recorded this tick in ts_ticks on another cpu in
sched_tick() to prevent multiple increments for one tick.  This pushes
   the value out of range and breaks priority calculation.

Reviewed by:	kib
Found by:	pho/nokia
Sponsored by:	Nokia
MFC after:	3 days
2008-07-19 05:13:47 +00:00