2003-01-26 05:23:15 +00:00
|
|
|
/*-
|
2007-01-04 08:56:25 +00:00
|
|
|
* Copyright (c) 2002-2007, Jeffrey Roberson <jeff@freebsd.org>
|
2003-01-26 05:23:15 +00:00
|
|
|
* All rights reserved.
|
|
|
|
*
|
|
|
|
* Redistribution and use in source and binary forms, with or without
|
|
|
|
* modification, are permitted provided that the following conditions
|
|
|
|
* are met:
|
|
|
|
* 1. Redistributions of source code must retain the above copyright
|
|
|
|
* notice unmodified, this list of conditions, and the following
|
|
|
|
* disclaimer.
|
|
|
|
* 2. Redistributions in binary form must reproduce the above copyright
|
|
|
|
* notice, this list of conditions and the following disclaimer in the
|
|
|
|
* documentation and/or other materials provided with the distribution.
|
|
|
|
*
|
|
|
|
* THIS SOFTWARE IS PROVIDED BY THE AUTHOR ``AS IS'' AND ANY EXPRESS OR
|
|
|
|
* IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES
|
|
|
|
* OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE DISCLAIMED.
|
|
|
|
* IN NO EVENT SHALL THE AUTHOR BE LIABLE FOR ANY DIRECT, INDIRECT,
|
|
|
|
* INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT
|
|
|
|
* NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE,
|
|
|
|
* DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY
|
|
|
|
* THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
|
|
|
|
* (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF
|
|
|
|
* THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
|
|
|
|
*/
|
|
|
|
|
2007-07-17 22:53:23 +00:00
|
|
|
/*
|
|
|
|
* This file implements the ULE scheduler. ULE supports independent CPU
|
|
|
|
* run queues and fine grain locking. It has superior interactive
|
|
|
|
* performance under load even on uni-processor systems.
|
|
|
|
*
|
|
|
|
* etymology:
|
2007-09-22 02:20:14 +00:00
|
|
|
* ULE is the last three letters in schedule. It owes its name to a
|
2007-07-17 22:53:23 +00:00
|
|
|
* generic user created for a scheduling system by Paul Mikesell at
|
|
|
|
* Isilon Systems and a general lack of creativity on the part of the author.
|
|
|
|
*/
|
|
|
|
|
2003-06-11 00:56:59 +00:00
|
|
|
#include <sys/cdefs.h>
|
2009-04-29 03:26:30 +00:00
|
|
|
__FBSDID("$FreeBSD$");
|
2003-06-11 00:56:59 +00:00
|
|
|
|
2005-06-24 00:16:57 +00:00
|
|
|
#include "opt_hwpmc_hooks.h"
|
2008-05-25 01:44:58 +00:00
|
|
|
#include "opt_kdtrace.h"
|
2005-06-24 00:16:57 +00:00
|
|
|
#include "opt_sched.h"
|
2004-09-02 18:59:15 +00:00
|
|
|
|
2003-01-26 05:23:15 +00:00
|
|
|
#include <sys/param.h>
|
|
|
|
#include <sys/systm.h>
|
2004-07-10 21:38:22 +00:00
|
|
|
#include <sys/kdb.h>
|
2003-01-26 05:23:15 +00:00
|
|
|
#include <sys/kernel.h>
|
|
|
|
#include <sys/ktr.h>
|
|
|
|
#include <sys/lock.h>
|
|
|
|
#include <sys/mutex.h>
|
|
|
|
#include <sys/proc.h>
|
2003-04-02 06:46:43 +00:00
|
|
|
#include <sys/resource.h>
|
2003-11-04 07:45:41 +00:00
|
|
|
#include <sys/resourcevar.h>
|
2003-01-26 05:23:15 +00:00
|
|
|
#include <sys/sched.h>
|
2012-05-15 01:30:25 +00:00
|
|
|
#include <sys/sdt.h>
|
2003-01-26 05:23:15 +00:00
|
|
|
#include <sys/smp.h>
|
|
|
|
#include <sys/sx.h>
|
|
|
|
#include <sys/sysctl.h>
|
|
|
|
#include <sys/sysproto.h>
|
Rework the interface between priority propagation (lending) and the
schedulers a bit to ensure more correct handling of priorities and fewer
priority inversions:
- Add two functions to the sched(9) API to handle priority lending:
sched_lend_prio() and sched_unlend_prio(). The turnstile code uses these
functions to ask the scheduler to lend a thread a set priority and to
tell the scheduler when it thinks it is ok for a thread to stop borrowing
priority. The unlend case is slightly complex in that the turnstile code
tells the scheduler what the minimum priority of the thread needs to be
to satisfy the requirements of any other threads blocked on locks owned
by the thread in question. The scheduler then decides where the thread
can go back to normal mode (if it's normal priority is high enough to
satisfy the pending lock requests) or it it should continue to use the
priority specified to the sched_unlend_prio() call. This involves adding
a new per-thread flag TDF_BORROWING that replaces the ULE-only kse flag
for priority elevation.
- Schedulers now refuse to lower the priority of a thread that is currently
borrowing another therad's priority.
- If a scheduler changes the priority of a thread that is currently sitting
on a turnstile, it will call a new function turnstile_adjust() to inform
the turnstile code of the change. This function resorts the thread on
the priority list of the turnstile if needed, and if the thread ends up
at the head of the list (due to having the highest priority) and its
priority was raised, then it will propagate that new priority to the
owner of the lock it is blocked on.
Some additional fixes specific to the 4BSD scheduler include:
- Common code for updating the priority of a thread when the user priority
of its associated kse group has been consolidated in a new static
function resetpriority_thread(). One change to this function is that
it will now only adjust the priority of a thread if it already has a
time sharing priority, thus preserving any boosts from a tsleep() until
the thread returns to userland. Also, resetpriority() no longer calls
maybe_resched() on each thread in the group. Instead, the code calling
resetpriority() is responsible for calling resetpriority_thread() on
any threads that need to be updated.
- schedcpu() now uses resetpriority_thread() instead of just calling
sched_prio() directly after it updates a kse group's user priority.
- sched_clock() now uses resetpriority_thread() rather than writing
directly to td_priority.
- sched_nice() now updates all the priorities of the threads after the
group priority has been adjusted.
Discussed with: bde
Reviewed by: ups, jeffr
Tested on: 4bsd, ule
Tested on: i386, alpha, sparc64
2004-12-30 20:52:44 +00:00
|
|
|
#include <sys/turnstile.h>
|
2006-08-25 06:12:53 +00:00
|
|
|
#include <sys/umtx.h>
|
2003-01-26 05:23:15 +00:00
|
|
|
#include <sys/vmmeter.h>
|
2008-03-02 08:20:59 +00:00
|
|
|
#include <sys/cpuset.h>
|
Introduce a new sysctl, kern.sched.topology_spec, that returns an XML
dump of detected ULE CPU topology. This dump can be used to check the
topology detection and for general system information.
An example of CPU topology dump is:
kern.sched.topology_spec: <groups>
<group level="1" cache-level="0">
<cpu count="8" mask="0xff">0, 1, 2, 3, 4, 5, 6, 7</cpu>
<flags></flags>
<children>
<group level="2" cache-level="0">
<cpu count="4" mask="0xf">0, 1, 2, 3</cpu>
<flags></flags>
</group>
<group level="2" cache-level="0">
<cpu count="4" mask="0xf0">4, 5, 6, 7</cpu>
<flags></flags>
</group>
</children>
</group>
</groups>
Reviewed by: jeff
Approved by: gnn (mentor)
2008-10-29 13:36:23 +00:00
|
|
|
#include <sys/sbuf.h>
|
2003-01-26 05:23:15 +00:00
|
|
|
|
2005-04-19 04:01:25 +00:00
|
|
|
#ifdef HWPMC_HOOKS
|
|
|
|
#include <sys/pmckern.h>
|
|
|
|
#endif
|
|
|
|
|
2008-05-25 01:44:58 +00:00
|
|
|
#ifdef KDTRACE_HOOKS
|
|
|
|
#include <sys/dtrace_bsd.h>
|
|
|
|
int dtrace_vtime_active;
|
|
|
|
dtrace_vtime_switch_func_t dtrace_vtime_switch_func;
|
|
|
|
#endif
|
|
|
|
|
2003-01-26 05:23:15 +00:00
|
|
|
#include <machine/cpu.h>
|
- Add static to local functions and data where it was missing.
- Add an IPI based mechanism for migrating kses. This mechanism is
broken down into several components. This is intended to reduce cache
thrashing by eliminating most cases where one cpu touches another's
run queues.
- kseq_notify() appends a kse to a lockless singly linked list and
conditionally sends an IPI to the target processor. Right now this is
protected by sched_lock but at some point I'd like to get rid of the
global lock. This is why I used something more complicated than a
standard queue.
- kseq_assign() processes our list of kses that have been assigned to us
by other processors. This simply calls sched_add() for each item on the
list after clearing the new KEF_ASSIGNED flag. This flag is used to
indicate that we have been appeneded to the assigned queue but not
added to the run queue yet.
- In sched_add(), instead of adding a KSE to another processor's queue we
use kse_notify() so that we don't touch their queue. Also in sched_add(),
if KEF_ASSIGNED is already set return immediately. This can happen if
a thread is removed and readded so that the priority is recorded properly.
- In sched_rem() return immediately if KEF_ASSIGNED is set. All callers
immediately readd simply to adjust priorites etc.
- In sched_choose(), if we're running an IDLE task or the per cpu idle thread
set our cpumask bit in 'kseq_idle' so that other processors may know that
we are idle. Before this, make a single pass through the run queues of
other processors so that we may find work more immediately if it is
available.
- In sched_runnable(), don't scan each processor's run queue, they will IPI
us if they have work for us to do.
- In sched_add(), if we're adding a thread that can be migrated and we have
plenty of work to do, try to migrate the thread to an idle kseq.
- Simplify the logic in sched_prio() and take the KEF_ASSIGNED flag into
consideration.
- No longer use kseq_choose() to steal threads, it can lose it's last
argument.
- Create a new function runq_steal() which operates like runq_choose() but
skips threads based on some criteria. Currently it will not steal
PRI_ITHD threads. In the future this will be used for CPU binding.
- Create a kseq_steal() that checks each run queue with runq_steal(), use
kseq_steal() in the places where we used kseq_choose() to steal with
before.
2003-10-31 11:16:04 +00:00
|
|
|
#include <machine/smp.h>
|
2003-01-26 05:23:15 +00:00
|
|
|
|
2012-05-27 10:25:20 +00:00
|
|
|
#if defined(__powerpc__) && defined(BOOKE_E500)
|
2007-09-27 16:39:27 +00:00
|
|
|
#error "This architecture is not currently compatible with ULE"
|
2007-01-23 08:50:34 +00:00
|
|
|
#endif
|
|
|
|
|
2007-07-17 22:53:23 +00:00
|
|
|
#define KTR_ULE 0
|
2007-01-25 19:14:11 +00:00
|
|
|
|
2009-01-25 07:35:10 +00:00
|
|
|
#define TS_NAME_LEN (MAXCOMLEN + sizeof(" td ") + sizeof(__XSTRING(UINT_MAX)))
|
|
|
|
#define TDQ_NAME_LEN (sizeof("sched lock ") + sizeof(__XSTRING(MAXCPU)))
|
2011-07-19 16:50:55 +00:00
|
|
|
#define TDQ_LOADNAME_LEN (sizeof("CPU ") + sizeof(__XSTRING(MAXCPU)) - 1 + sizeof(" load"))
|
2009-01-17 07:17:57 +00:00
|
|
|
|
2005-06-04 09:23:28 +00:00
|
|
|
/*
|
2007-07-17 22:53:23 +00:00
|
|
|
* Thread scheduler specific section. All fields are protected
|
|
|
|
* by the thread lock.
|
2004-09-05 02:09:54 +00:00
|
|
|
*/
|
2006-12-06 06:34:57 +00:00
|
|
|
struct td_sched {
|
2007-07-17 22:53:23 +00:00
|
|
|
struct runq *ts_runq; /* Run-queue we're queued on. */
|
|
|
|
short ts_flags; /* TSF_* flags. */
|
2006-12-06 06:34:57 +00:00
|
|
|
u_char ts_cpu; /* CPU that we have affinity for. */
|
2008-03-10 03:15:19 +00:00
|
|
|
int ts_rltick; /* Real last tick, for affinity. */
|
2007-07-17 22:53:23 +00:00
|
|
|
int ts_slice; /* Ticks of slice remaining. */
|
|
|
|
u_int ts_slptime; /* Number of ticks we vol. slept */
|
|
|
|
u_int ts_runtime; /* Number of ticks we were running */
|
2006-12-06 06:34:57 +00:00
|
|
|
int ts_ltick; /* Last tick that we were running on */
|
|
|
|
int ts_ftick; /* First tick that we were running on */
|
|
|
|
int ts_ticks; /* Tick count */
|
2009-01-17 07:17:57 +00:00
|
|
|
#ifdef KTR
|
|
|
|
char ts_name[TS_NAME_LEN];
|
|
|
|
#endif
|
2004-09-05 02:09:54 +00:00
|
|
|
};
|
2006-12-06 06:34:57 +00:00
|
|
|
/* flags kept in ts_flags */
|
2007-01-19 21:56:08 +00:00
|
|
|
#define TSF_BOUND 0x0001 /* Thread can not migrate. */
|
|
|
|
#define TSF_XFERABLE 0x0002 /* Thread was added as transferable. */
|
2006-12-06 06:34:57 +00:00
|
|
|
|
|
|
|
static struct td_sched td_sched0;
|
2003-01-26 05:23:15 +00:00
|
|
|
|
2008-03-02 08:20:59 +00:00
|
|
|
#define THREAD_CAN_MIGRATE(td) ((td)->td_pinned == 0)
|
|
|
|
#define THREAD_CAN_SCHED(td, cpu) \
|
|
|
|
CPU_ISSET((cpu), &(td)->td_cpuset->cs_mask)
|
|
|
|
|
2011-01-13 14:22:27 +00:00
|
|
|
/*
|
|
|
|
* Priority ranges used for interactive and non-interactive timeshare
|
2011-01-14 17:06:54 +00:00
|
|
|
* threads. The timeshare priorities are split up into four ranges.
|
|
|
|
* The first range handles interactive threads. The last three ranges
|
|
|
|
* (NHALF, x, and NHALF) handle non-interactive threads with the outer
|
|
|
|
* ranges supporting nice values.
|
2011-01-13 14:22:27 +00:00
|
|
|
*/
|
2011-01-14 17:06:54 +00:00
|
|
|
#define PRI_TIMESHARE_RANGE (PRI_MAX_TIMESHARE - PRI_MIN_TIMESHARE + 1)
|
|
|
|
#define PRI_INTERACT_RANGE ((PRI_TIMESHARE_RANGE - SCHED_PRI_NRESV) / 2)
|
2011-12-19 20:01:21 +00:00
|
|
|
#define PRI_BATCH_RANGE (PRI_TIMESHARE_RANGE - PRI_INTERACT_RANGE)
|
2011-01-14 17:06:54 +00:00
|
|
|
|
|
|
|
#define PRI_MIN_INTERACT PRI_MIN_TIMESHARE
|
|
|
|
#define PRI_MAX_INTERACT (PRI_MIN_TIMESHARE + PRI_INTERACT_RANGE - 1)
|
|
|
|
#define PRI_MIN_BATCH (PRI_MIN_TIMESHARE + PRI_INTERACT_RANGE)
|
2011-01-13 14:22:27 +00:00
|
|
|
#define PRI_MAX_BATCH PRI_MAX_TIMESHARE
|
|
|
|
|
2003-01-26 05:23:15 +00:00
|
|
|
/*
|
2007-01-04 08:56:25 +00:00
|
|
|
* Cpu percentage computation macros and defines.
|
|
|
|
*
|
|
|
|
* SCHED_TICK_SECS: Number of seconds to average the cpu usage across.
|
|
|
|
* SCHED_TICK_TARG: Number of hz ticks to average the cpu usage across.
|
2007-01-05 08:50:38 +00:00
|
|
|
* SCHED_TICK_MAX: Maximum number of ticks before scaling back.
|
2007-01-04 08:56:25 +00:00
|
|
|
* SCHED_TICK_SHIFT: Shift factor to avoid rounding away results.
|
|
|
|
* SCHED_TICK_HZ: Compute the number of hz ticks for a given ticks count.
|
|
|
|
* SCHED_TICK_TOTAL: Gives the amount of time we've been recording ticks.
|
|
|
|
*/
|
|
|
|
#define SCHED_TICK_SECS 10
|
|
|
|
#define SCHED_TICK_TARG (hz * SCHED_TICK_SECS)
|
2007-01-05 08:50:38 +00:00
|
|
|
#define SCHED_TICK_MAX (SCHED_TICK_TARG + hz)
|
2007-01-04 08:56:25 +00:00
|
|
|
#define SCHED_TICK_SHIFT 10
|
|
|
|
#define SCHED_TICK_HZ(ts) ((ts)->ts_ticks >> SCHED_TICK_SHIFT)
|
2007-01-06 12:33:43 +00:00
|
|
|
#define SCHED_TICK_TOTAL(ts) (max((ts)->ts_ltick - (ts)->ts_ftick, hz))
|
2007-01-04 08:56:25 +00:00
|
|
|
|
|
|
|
/*
|
|
|
|
* These macros determine priorities for non-interactive threads. They are
|
|
|
|
* assigned a priority based on their recent cpu utilization as expressed
|
|
|
|
* by the ratio of ticks to the tick total. NHALF priorities at the start
|
|
|
|
* and end of the MIN to MAX timeshare range are only reachable with negative
|
|
|
|
* or positive nice respectively.
|
2003-03-04 02:45:59 +00:00
|
|
|
*
|
2007-01-04 08:56:25 +00:00
|
|
|
* PRI_RANGE: Priority range for utilization dependent priorities.
|
2003-06-21 02:22:47 +00:00
|
|
|
* PRI_NRESV: Number of nice values.
|
2007-01-04 08:56:25 +00:00
|
|
|
* PRI_TICKS: Compute a priority in PRI_RANGE from the ticks count and total.
|
|
|
|
* PRI_NICE: Determines the part of the priority inherited from nice.
|
2003-01-26 05:23:15 +00:00
|
|
|
*/
|
2007-01-04 08:56:25 +00:00
|
|
|
#define SCHED_PRI_NRESV (PRIO_MAX - PRIO_MIN)
|
2003-11-02 03:49:32 +00:00
|
|
|
#define SCHED_PRI_NHALF (SCHED_PRI_NRESV / 2)
|
2011-01-13 14:22:27 +00:00
|
|
|
#define SCHED_PRI_MIN (PRI_MIN_BATCH + SCHED_PRI_NHALF)
|
|
|
|
#define SCHED_PRI_MAX (PRI_MAX_BATCH - SCHED_PRI_NHALF)
|
2011-01-10 20:48:10 +00:00
|
|
|
#define SCHED_PRI_RANGE (SCHED_PRI_MAX - SCHED_PRI_MIN + 1)
|
2007-01-04 08:56:25 +00:00
|
|
|
#define SCHED_PRI_TICKS(ts) \
|
|
|
|
(SCHED_TICK_HZ((ts)) / \
|
2007-01-06 08:44:13 +00:00
|
|
|
(roundup(SCHED_TICK_TOTAL((ts)), SCHED_PRI_RANGE) / SCHED_PRI_RANGE))
|
2007-01-04 08:56:25 +00:00
|
|
|
#define SCHED_PRI_NICE(nice) (nice)
|
2003-01-26 05:23:15 +00:00
|
|
|
|
|
|
|
/*
|
2007-01-04 08:56:25 +00:00
|
|
|
* These determine the interactivity of a process. Interactivity differs from
|
|
|
|
* cpu utilization in that it expresses the voluntary time slept vs time ran
|
|
|
|
* while cpu utilization includes all time not running. This more accurately
|
|
|
|
* models the intent of the thread.
|
2003-01-26 05:23:15 +00:00
|
|
|
*
|
2003-02-10 14:03:45 +00:00
|
|
|
* SLP_RUN_MAX: Maximum amount of sleep time + run time we'll accumulate
|
|
|
|
* before throttling back.
|
2003-11-02 03:36:33 +00:00
|
|
|
* SLP_RUN_FORK: Maximum slp+run time to inherit at fork time.
|
2003-06-15 02:18:29 +00:00
|
|
|
* INTERACT_MAX: Maximum interactivity value. Smaller is better.
|
2010-11-10 21:06:49 +00:00
|
|
|
* INTERACT_THRESH: Threshold for placement on the current runq.
|
2003-01-26 05:23:15 +00:00
|
|
|
*/
|
2007-01-04 08:56:25 +00:00
|
|
|
#define SCHED_SLP_RUN_MAX ((hz * 5) << SCHED_TICK_SHIFT)
|
|
|
|
#define SCHED_SLP_RUN_FORK ((hz / 2) << SCHED_TICK_SHIFT)
|
2003-06-15 02:18:29 +00:00
|
|
|
#define SCHED_INTERACT_MAX (100)
|
|
|
|
#define SCHED_INTERACT_HALF (SCHED_INTERACT_MAX / 2)
|
2003-10-16 08:17:43 +00:00
|
|
|
#define SCHED_INTERACT_THRESH (30)
|
2003-03-04 02:45:59 +00:00
|
|
|
|
2012-08-09 19:26:13 +00:00
|
|
|
/* Flags kept in td_flags. */
|
|
|
|
#define TDF_SLICEEND TDF_SCHED2 /* Thread time slice is over. */
|
|
|
|
|
2003-01-26 05:23:15 +00:00
|
|
|
/*
|
2007-01-04 08:56:25 +00:00
|
|
|
* tickincr: Converts a stathz tick into a hz domain scaled by
|
|
|
|
* the shift factor. Without the shift the error rate
|
|
|
|
* due to rounding would be unacceptably high.
|
|
|
|
* realstathz: stathz is sometimes 0 and run off of hz.
|
|
|
|
* sched_slice: Runtime of each thread before rescheduling.
|
2007-07-17 22:53:23 +00:00
|
|
|
* preempt_thresh: Priority threshold for preemption and remote IPIs.
|
2003-01-26 05:23:15 +00:00
|
|
|
*/
|
2007-01-04 08:56:25 +00:00
|
|
|
static int sched_interact = SCHED_INTERACT_THRESH;
|
2012-08-10 19:02:49 +00:00
|
|
|
static int realstathz = 127;
|
|
|
|
static int tickincr = 8 << SCHED_TICK_SHIFT;;
|
|
|
|
static int sched_slice = 12;
|
2007-09-27 16:39:27 +00:00
|
|
|
#ifdef PREEMPTION
|
|
|
|
#ifdef FULL_PREEMPTION
|
|
|
|
static int preempt_thresh = PRI_MAX_IDLE;
|
|
|
|
#else
|
2007-07-17 22:53:23 +00:00
|
|
|
static int preempt_thresh = PRI_MIN_KERN;
|
2007-09-27 16:39:27 +00:00
|
|
|
#endif
|
|
|
|
#else
|
|
|
|
static int preempt_thresh = 0;
|
|
|
|
#endif
|
2011-01-13 14:22:27 +00:00
|
|
|
static int static_boost = PRI_MIN_BATCH;
|
2008-04-17 09:56:01 +00:00
|
|
|
static int sched_idlespins = 10000;
|
2012-03-09 19:09:08 +00:00
|
|
|
static int sched_idlespinthresh = -1;
|
2007-07-17 22:53:23 +00:00
|
|
|
|
2003-01-26 05:23:15 +00:00
|
|
|
/*
|
2007-07-17 22:53:23 +00:00
|
|
|
* tdq - per processor runqs and statistics. All fields are protected by the
|
|
|
|
* tdq_lock. The load and lowpri may be accessed without to avoid excess
|
|
|
|
* locking in sched_pickcpu();
|
2003-01-26 05:23:15 +00:00
|
|
|
*/
|
2006-12-06 06:34:57 +00:00
|
|
|
struct tdq {
|
2008-03-10 03:15:19 +00:00
|
|
|
/* Ordered to improve efficiency of cpu_search() and switch(). */
|
2008-03-02 08:20:59 +00:00
|
|
|
struct mtx tdq_lock; /* run queue lock. */
|
2008-03-10 03:15:19 +00:00
|
|
|
struct cpu_group *tdq_cg; /* Pointer to cpu topology. */
|
2008-04-17 09:56:01 +00:00
|
|
|
volatile int tdq_load; /* Aggregate load. */
|
2010-09-10 13:24:47 +00:00
|
|
|
volatile int tdq_cpu_idle; /* cpu_idle() is active. */
|
2008-03-02 08:20:59 +00:00
|
|
|
int tdq_sysload; /* For loadavg, !ITHD load. */
|
2008-03-10 03:15:19 +00:00
|
|
|
int tdq_transferable; /* Transferable thread count. */
|
2008-04-17 09:56:01 +00:00
|
|
|
short tdq_switchcnt; /* Switches this tick. */
|
|
|
|
short tdq_oldswitchcnt; /* Switches last tick. */
|
2007-07-17 22:53:23 +00:00
|
|
|
u_char tdq_lowpri; /* Lowest priority thread. */
|
2008-03-10 01:32:01 +00:00
|
|
|
u_char tdq_ipipending; /* IPI pending. */
|
2008-03-10 03:15:19 +00:00
|
|
|
u_char tdq_idx; /* Current insert index. */
|
|
|
|
u_char tdq_ridx; /* Current removal index. */
|
|
|
|
struct runq tdq_realtime; /* real-time run queue. */
|
|
|
|
struct runq tdq_timeshare; /* timeshare run queue. */
|
|
|
|
struct runq tdq_idle; /* Queue of IDLE threads. */
|
2009-01-17 07:17:57 +00:00
|
|
|
char tdq_name[TDQ_NAME_LEN];
|
|
|
|
#ifdef KTR
|
|
|
|
char tdq_loadname[TDQ_LOADNAME_LEN];
|
|
|
|
#endif
|
2007-07-17 22:53:23 +00:00
|
|
|
} __aligned(64);
|
2003-01-26 05:23:15 +00:00
|
|
|
|
2008-04-17 09:56:01 +00:00
|
|
|
/* Idle thread states and config. */
|
|
|
|
#define TDQ_RUNNING 1
|
|
|
|
#define TDQ_IDLE 2
|
2007-01-19 21:56:08 +00:00
|
|
|
|
2003-12-11 03:57:10 +00:00
|
|
|
#ifdef SMP
|
Introduce a new sysctl, kern.sched.topology_spec, that returns an XML
dump of detected ULE CPU topology. This dump can be used to check the
topology detection and for general system information.
An example of CPU topology dump is:
kern.sched.topology_spec: <groups>
<group level="1" cache-level="0">
<cpu count="8" mask="0xff">0, 1, 2, 3, 4, 5, 6, 7</cpu>
<flags></flags>
<children>
<group level="2" cache-level="0">
<cpu count="4" mask="0xf">0, 1, 2, 3</cpu>
<flags></flags>
</group>
<group level="2" cache-level="0">
<cpu count="4" mask="0xf0">4, 5, 6, 7</cpu>
<flags></flags>
</group>
</children>
</group>
</groups>
Reviewed by: jeff
Approved by: gnn (mentor)
2008-10-29 13:36:23 +00:00
|
|
|
struct cpu_group *cpu_top; /* CPU topology */
|
2007-01-19 21:56:08 +00:00
|
|
|
|
2008-03-02 08:20:59 +00:00
|
|
|
#define SCHED_AFFINITY_DEFAULT (max(1, hz / 1000))
|
|
|
|
#define SCHED_AFFINITY(ts, t) ((ts)->ts_rltick > ticks - ((t) * affinity))
|
2007-01-19 21:56:08 +00:00
|
|
|
|
|
|
|
/*
|
|
|
|
* Run-time tunables.
|
|
|
|
*/
|
2007-07-19 20:03:15 +00:00
|
|
|
static int rebalance = 1;
|
2007-10-02 00:36:06 +00:00
|
|
|
static int balance_interval = 128; /* Default set in sched_initticks(). */
|
2007-01-19 21:56:08 +00:00
|
|
|
static int affinity;
|
2007-07-19 20:03:15 +00:00
|
|
|
static int steal_idle = 1;
|
|
|
|
static int steal_thresh = 2;
|
2003-12-11 03:57:10 +00:00
|
|
|
|
2003-01-26 05:23:15 +00:00
|
|
|
/*
|
2006-12-29 10:37:07 +00:00
|
|
|
* One thread queue per processor.
|
2003-01-26 05:23:15 +00:00
|
|
|
*/
|
2006-12-06 06:34:57 +00:00
|
|
|
static struct tdq tdq_cpu[MAXCPU];
|
2007-10-02 00:36:06 +00:00
|
|
|
static struct tdq *balance_tdq;
|
|
|
|
static int balance_ticks;
|
Rework CPU load balancing in SCHED_ULE:
- In sched_pickcpu() be more careful taking previous CPU on SMT systems.
Do it only if all other logical CPUs of that physical one are idle to avoid
extra resource sharing.
- In sched_pickcpu() change general logic of CPU selection. First
look for idle CPU, sharing last level cache with previously used one,
skipping SMT CPU groups. If none found, search all CPUs for the least loaded
one, where the thread with its priority can run now. If none found, search
just for the least loaded CPU.
- Make cpu_search() compare lowest/highest CPU load when comparing CPU
groups with equal load. That allows to differentiate 1+1 and 2+0 loads.
- Make cpu_search() to prefer specified (previous) CPU or group if load
is equal. This improves cache affinity for more complicated topologies.
- Randomize CPU selection if above factors are equal. Previous code tend
to prefer CPUs with lower IDs, causing unneeded collisions.
- Rework periodic balancer in sched_balance_group(). With cpu_search()
more intelligent now, make balansing process flat, removing recursion
over the topology tree. That fixes double swap problem and makes load
distribution more even and predictable.
All together this gives 10-15% performance improvement in many tests on
CPUs with SMT, such as Core i7, for number of threads is less then number
of logical CPUs. In some tests it also gives positive effect to systems
without SMT.
Reviewed by: jeff
Tested by: flo, hackers@
MFC after: 1 month
Sponsored by: iXsystems, Inc.
2012-02-27 10:31:54 +00:00
|
|
|
static DPCPU_DEFINE(uint32_t, randomval);
|
2004-06-02 05:46:48 +00:00
|
|
|
|
2006-12-06 06:34:57 +00:00
|
|
|
#define TDQ_SELF() (&tdq_cpu[PCPU_GET(cpuid)])
|
|
|
|
#define TDQ_CPU(x) (&tdq_cpu[(x)])
|
2007-08-03 23:38:46 +00:00
|
|
|
#define TDQ_ID(x) ((int)((x) - tdq_cpu))
|
2003-12-11 03:57:10 +00:00
|
|
|
#else /* !SMP */
|
2006-12-06 06:34:57 +00:00
|
|
|
static struct tdq tdq_cpu;
|
2004-06-02 05:46:48 +00:00
|
|
|
|
2007-06-05 02:53:51 +00:00
|
|
|
#define TDQ_ID(x) (0)
|
2006-12-06 06:34:57 +00:00
|
|
|
#define TDQ_SELF() (&tdq_cpu)
|
|
|
|
#define TDQ_CPU(x) (&tdq_cpu)
|
2003-01-29 07:00:51 +00:00
|
|
|
#endif
|
2003-01-26 05:23:15 +00:00
|
|
|
|
2007-07-17 22:53:23 +00:00
|
|
|
#define TDQ_LOCK_ASSERT(t, type) mtx_assert(TDQ_LOCKPTR((t)), (type))
|
|
|
|
#define TDQ_LOCK(t) mtx_lock_spin(TDQ_LOCKPTR((t)))
|
|
|
|
#define TDQ_LOCK_FLAGS(t, f) mtx_lock_spin_flags(TDQ_LOCKPTR((t)), (f))
|
|
|
|
#define TDQ_UNLOCK(t) mtx_unlock_spin(TDQ_LOCKPTR((t)))
|
2008-03-02 08:20:59 +00:00
|
|
|
#define TDQ_LOCKPTR(t) (&(t)->tdq_lock)
|
2007-07-17 22:53:23 +00:00
|
|
|
|
2006-10-26 21:42:22 +00:00
|
|
|
static void sched_priority(struct thread *);
|
2005-06-04 09:23:28 +00:00
|
|
|
static void sched_thread_priority(struct thread *, u_char);
|
2006-10-26 21:42:22 +00:00
|
|
|
static int sched_interact_score(struct thread *);
|
|
|
|
static void sched_interact_update(struct thread *);
|
|
|
|
static void sched_interact_fork(struct thread *);
|
2012-03-13 08:18:54 +00:00
|
|
|
static void sched_pctcpu_update(struct td_sched *, int);
|
2003-01-26 05:23:15 +00:00
|
|
|
|
2003-02-03 05:30:07 +00:00
|
|
|
/* Operations on per processor queues */
|
2008-03-20 05:51:16 +00:00
|
|
|
static struct thread *tdq_choose(struct tdq *);
|
2006-12-06 06:34:57 +00:00
|
|
|
static void tdq_setup(struct tdq *);
|
2008-03-20 05:51:16 +00:00
|
|
|
static void tdq_load_add(struct tdq *, struct thread *);
|
|
|
|
static void tdq_load_rem(struct tdq *, struct thread *);
|
|
|
|
static __inline void tdq_runq_add(struct tdq *, struct thread *, int);
|
|
|
|
static __inline void tdq_runq_rem(struct tdq *, struct thread *);
|
2008-03-10 01:32:01 +00:00
|
|
|
static inline int sched_shouldpreempt(int, int, int);
|
2006-12-06 06:34:57 +00:00
|
|
|
void tdq_print(int cpu);
|
2007-01-04 08:56:25 +00:00
|
|
|
static void runq_print(struct runq *rq);
|
2007-07-17 22:53:23 +00:00
|
|
|
static void tdq_add(struct tdq *, struct thread *, int);
|
2003-02-03 05:30:07 +00:00
|
|
|
#ifdef SMP
|
2008-03-02 08:20:59 +00:00
|
|
|
static int tdq_move(struct tdq *, struct tdq *);
|
2006-12-06 06:34:57 +00:00
|
|
|
static int tdq_idled(struct tdq *);
|
2008-03-20 05:51:16 +00:00
|
|
|
static void tdq_notify(struct tdq *, struct thread *);
|
|
|
|
static struct thread *tdq_steal(struct tdq *, int);
|
|
|
|
static struct thread *runq_steal(struct runq *, int);
|
|
|
|
static int sched_pickcpu(struct thread *, int);
|
2007-10-02 00:36:06 +00:00
|
|
|
static void sched_balance(void);
|
2008-03-02 08:20:59 +00:00
|
|
|
static int sched_balance_pair(struct tdq *, struct tdq *);
|
2008-03-20 05:51:16 +00:00
|
|
|
static inline struct tdq *sched_setcpu(struct thread *, int, int);
|
2007-07-17 22:53:23 +00:00
|
|
|
static inline void thread_unblock_switch(struct thread *, struct mtx *);
|
2007-08-03 23:38:46 +00:00
|
|
|
static struct mtx *sched_switch_migrate(struct tdq *, struct thread *, int);
|
Introduce a new sysctl, kern.sched.topology_spec, that returns an XML
dump of detected ULE CPU topology. This dump can be used to check the
topology detection and for general system information.
An example of CPU topology dump is:
kern.sched.topology_spec: <groups>
<group level="1" cache-level="0">
<cpu count="8" mask="0xff">0, 1, 2, 3, 4, 5, 6, 7</cpu>
<flags></flags>
<children>
<group level="2" cache-level="0">
<cpu count="4" mask="0xf">0, 1, 2, 3</cpu>
<flags></flags>
</group>
<group level="2" cache-level="0">
<cpu count="4" mask="0xf0">4, 5, 6, 7</cpu>
<flags></flags>
</group>
</children>
</group>
</groups>
Reviewed by: jeff
Approved by: gnn (mentor)
2008-10-29 13:36:23 +00:00
|
|
|
static int sysctl_kern_sched_topology_spec(SYSCTL_HANDLER_ARGS);
|
|
|
|
static int sysctl_kern_sched_topology_spec_internal(struct sbuf *sb,
|
|
|
|
struct cpu_group *cg, int indent);
|
2003-02-03 05:30:07 +00:00
|
|
|
#endif
|
|
|
|
|
2007-01-04 08:56:25 +00:00
|
|
|
static void sched_setup(void *dummy);
|
2008-03-16 10:58:09 +00:00
|
|
|
SYSINIT(sched_setup, SI_SUB_RUN_QUEUE, SI_ORDER_FIRST, sched_setup, NULL);
|
2007-01-04 08:56:25 +00:00
|
|
|
|
|
|
|
static void sched_initticks(void *dummy);
|
2008-03-16 10:58:09 +00:00
|
|
|
SYSINIT(sched_initticks, SI_SUB_CLOCKS, SI_ORDER_THIRD, sched_initticks,
|
|
|
|
NULL);
|
2007-01-04 08:56:25 +00:00
|
|
|
|
2012-05-15 01:30:25 +00:00
|
|
|
SDT_PROVIDER_DEFINE(sched);
|
|
|
|
|
|
|
|
SDT_PROBE_DEFINE3(sched, , , change_pri, change-pri, "struct thread *",
|
|
|
|
"struct proc *", "uint8_t");
|
|
|
|
SDT_PROBE_DEFINE3(sched, , , dequeue, dequeue, "struct thread *",
|
|
|
|
"struct proc *", "void *");
|
|
|
|
SDT_PROBE_DEFINE4(sched, , , enqueue, enqueue, "struct thread *",
|
|
|
|
"struct proc *", "void *", "int");
|
|
|
|
SDT_PROBE_DEFINE4(sched, , , lend_pri, lend-pri, "struct thread *",
|
|
|
|
"struct proc *", "uint8_t", "struct thread *");
|
|
|
|
SDT_PROBE_DEFINE2(sched, , , load_change, load-change, "int", "int");
|
|
|
|
SDT_PROBE_DEFINE2(sched, , , off_cpu, off-cpu, "struct thread *",
|
|
|
|
"struct proc *");
|
|
|
|
SDT_PROBE_DEFINE(sched, , , on_cpu, on-cpu);
|
|
|
|
SDT_PROBE_DEFINE(sched, , , remain_cpu, remain-cpu);
|
|
|
|
SDT_PROBE_DEFINE2(sched, , , surrender, surrender, "struct thread *",
|
|
|
|
"struct proc *");
|
|
|
|
|
2007-07-17 22:53:23 +00:00
|
|
|
/*
|
|
|
|
* Print the threads waiting on a run-queue.
|
|
|
|
*/
|
2007-01-04 08:56:25 +00:00
|
|
|
static void
|
|
|
|
runq_print(struct runq *rq)
|
|
|
|
{
|
|
|
|
struct rqhead *rqh;
|
2008-03-20 05:51:16 +00:00
|
|
|
struct thread *td;
|
2007-01-04 08:56:25 +00:00
|
|
|
int pri;
|
|
|
|
int j;
|
|
|
|
int i;
|
|
|
|
|
|
|
|
for (i = 0; i < RQB_LEN; i++) {
|
|
|
|
printf("\t\trunq bits %d 0x%zx\n",
|
|
|
|
i, rq->rq_status.rqb_bits[i]);
|
|
|
|
for (j = 0; j < RQB_BPW; j++)
|
|
|
|
if (rq->rq_status.rqb_bits[i] & (1ul << j)) {
|
|
|
|
pri = j + (i << RQB_L2BPW);
|
|
|
|
rqh = &rq->rq_queues[pri];
|
2008-03-20 05:51:16 +00:00
|
|
|
TAILQ_FOREACH(td, rqh, td_runq) {
|
2007-01-04 08:56:25 +00:00
|
|
|
printf("\t\t\ttd %p(%s) priority %d rqindex %d pri %d\n",
|
2008-03-20 05:51:16 +00:00
|
|
|
td, td->td_name, td->td_priority,
|
|
|
|
td->td_rqindex, pri);
|
2007-01-04 08:56:25 +00:00
|
|
|
}
|
|
|
|
}
|
|
|
|
}
|
|
|
|
}
|
|
|
|
|
2007-07-17 22:53:23 +00:00
|
|
|
/*
|
|
|
|
* Print the status of a per-cpu thread queue. Should be a ddb show cmd.
|
|
|
|
*/
|
2003-04-11 03:47:14 +00:00
|
|
|
void
|
2006-12-06 06:34:57 +00:00
|
|
|
tdq_print(int cpu)
|
2003-02-03 05:30:07 +00:00
|
|
|
{
|
2006-12-06 06:34:57 +00:00
|
|
|
struct tdq *tdq;
|
2003-04-03 00:29:28 +00:00
|
|
|
|
2006-12-06 06:34:57 +00:00
|
|
|
tdq = TDQ_CPU(cpu);
|
2003-04-03 00:29:28 +00:00
|
|
|
|
2007-08-03 23:38:46 +00:00
|
|
|
printf("tdq %d:\n", TDQ_ID(tdq));
|
2008-03-02 08:20:59 +00:00
|
|
|
printf("\tlock %p\n", TDQ_LOCKPTR(tdq));
|
|
|
|
printf("\tLock name: %s\n", tdq->tdq_name);
|
2006-12-29 10:37:07 +00:00
|
|
|
printf("\tload: %d\n", tdq->tdq_load);
|
2008-04-17 09:56:01 +00:00
|
|
|
printf("\tswitch cnt: %d\n", tdq->tdq_switchcnt);
|
|
|
|
printf("\told switch cnt: %d\n", tdq->tdq_oldswitchcnt);
|
2007-07-17 22:53:23 +00:00
|
|
|
printf("\ttimeshare idx: %d\n", tdq->tdq_idx);
|
2007-01-04 12:16:19 +00:00
|
|
|
printf("\ttimeshare ridx: %d\n", tdq->tdq_ridx);
|
2008-04-17 09:56:01 +00:00
|
|
|
printf("\tload transferable: %d\n", tdq->tdq_transferable);
|
|
|
|
printf("\tlowest priority: %d\n", tdq->tdq_lowpri);
|
2007-01-04 08:56:25 +00:00
|
|
|
printf("\trealtime runq:\n");
|
|
|
|
runq_print(&tdq->tdq_realtime);
|
|
|
|
printf("\ttimeshare runq:\n");
|
|
|
|
runq_print(&tdq->tdq_timeshare);
|
|
|
|
printf("\tidle runq:\n");
|
|
|
|
runq_print(&tdq->tdq_idle);
|
2003-04-11 03:47:14 +00:00
|
|
|
}
|
2003-04-03 00:29:28 +00:00
|
|
|
|
2008-03-10 01:32:01 +00:00
|
|
|
static inline int
|
|
|
|
sched_shouldpreempt(int pri, int cpri, int remote)
|
|
|
|
{
|
|
|
|
/*
|
|
|
|
* If the new priority is not better than the current priority there is
|
|
|
|
* nothing to do.
|
|
|
|
*/
|
|
|
|
if (pri >= cpri)
|
|
|
|
return (0);
|
|
|
|
/*
|
|
|
|
* Always preempt idle.
|
|
|
|
*/
|
|
|
|
if (cpri >= PRI_MIN_IDLE)
|
|
|
|
return (1);
|
|
|
|
/*
|
|
|
|
* If preemption is disabled don't preempt others.
|
|
|
|
*/
|
|
|
|
if (preempt_thresh == 0)
|
|
|
|
return (0);
|
|
|
|
/*
|
|
|
|
* Preempt if we exceed the threshold.
|
|
|
|
*/
|
|
|
|
if (pri <= preempt_thresh)
|
|
|
|
return (1);
|
|
|
|
/*
|
2011-01-13 14:22:27 +00:00
|
|
|
* If we're interactive or better and there is non-interactive
|
|
|
|
* or worse running preempt only remote processors.
|
2008-03-10 01:32:01 +00:00
|
|
|
*/
|
2011-01-13 14:22:27 +00:00
|
|
|
if (remote && pri <= PRI_MAX_INTERACT && cpri > PRI_MAX_INTERACT)
|
2008-03-10 01:32:01 +00:00
|
|
|
return (1);
|
|
|
|
return (0);
|
|
|
|
}
|
|
|
|
|
2007-07-17 22:53:23 +00:00
|
|
|
/*
|
|
|
|
* Add a thread to the actual run-queue. Keeps transferable counts up to
|
|
|
|
* date with what is actually on the run-queue. Selects the correct
|
|
|
|
* queue position for timeshare threads.
|
|
|
|
*/
|
2003-11-15 07:32:07 +00:00
|
|
|
static __inline void
|
2008-03-20 05:51:16 +00:00
|
|
|
tdq_runq_add(struct tdq *tdq, struct thread *td, int flags)
|
2003-11-15 07:32:07 +00:00
|
|
|
{
|
2008-03-20 05:51:16 +00:00
|
|
|
struct td_sched *ts;
|
2008-03-10 22:48:27 +00:00
|
|
|
u_char pri;
|
|
|
|
|
2007-07-17 22:53:23 +00:00
|
|
|
TDQ_LOCK_ASSERT(tdq, MA_OWNED);
|
2008-03-20 05:51:16 +00:00
|
|
|
THREAD_LOCK_ASSERT(td, MA_OWNED);
|
2008-03-10 03:15:19 +00:00
|
|
|
|
2008-03-20 05:51:16 +00:00
|
|
|
pri = td->td_priority;
|
|
|
|
ts = td->td_sched;
|
|
|
|
TD_SET_RUNQ(td);
|
|
|
|
if (THREAD_CAN_MIGRATE(td)) {
|
2006-12-29 10:37:07 +00:00
|
|
|
tdq->tdq_transferable++;
|
2006-12-06 06:34:57 +00:00
|
|
|
ts->ts_flags |= TSF_XFERABLE;
|
2003-12-11 03:57:10 +00:00
|
|
|
}
|
2011-01-13 14:22:27 +00:00
|
|
|
if (pri < PRI_MIN_BATCH) {
|
2008-03-10 22:48:27 +00:00
|
|
|
ts->ts_runq = &tdq->tdq_realtime;
|
2011-01-13 14:22:27 +00:00
|
|
|
} else if (pri <= PRI_MAX_BATCH) {
|
2008-03-10 22:48:27 +00:00
|
|
|
ts->ts_runq = &tdq->tdq_timeshare;
|
2011-01-13 14:22:27 +00:00
|
|
|
KASSERT(pri <= PRI_MAX_BATCH && pri >= PRI_MIN_BATCH,
|
2007-01-04 08:56:25 +00:00
|
|
|
("Invalid priority %d on timeshare runq", pri));
|
|
|
|
/*
|
|
|
|
* This queue contains only priorities between MIN and MAX
|
|
|
|
* realtime. Use the whole queue to represent these values.
|
|
|
|
*/
|
2007-08-03 23:38:46 +00:00
|
|
|
if ((flags & (SRQ_BORROWING|SRQ_PREEMPTED)) == 0) {
|
2011-12-19 20:01:21 +00:00
|
|
|
pri = RQ_NQS * (pri - PRI_MIN_BATCH) / PRI_BATCH_RANGE;
|
2007-01-04 08:56:25 +00:00
|
|
|
pri = (pri + tdq->tdq_idx) % RQ_NQS;
|
2007-01-04 12:16:19 +00:00
|
|
|
/*
|
|
|
|
* This effectively shortens the queue by one so we
|
|
|
|
* can have a one slot difference between idx and
|
|
|
|
* ridx while we wait for threads to drain.
|
|
|
|
*/
|
|
|
|
if (tdq->tdq_ridx != tdq->tdq_idx &&
|
|
|
|
pri == tdq->tdq_ridx)
|
2007-03-17 18:13:32 +00:00
|
|
|
pri = (unsigned char)(pri - 1) % RQ_NQS;
|
2007-01-04 08:56:25 +00:00
|
|
|
} else
|
2007-01-04 12:16:19 +00:00
|
|
|
pri = tdq->tdq_ridx;
|
2008-03-20 05:51:16 +00:00
|
|
|
runq_add_pri(ts->ts_runq, td, pri, flags);
|
2008-03-10 22:48:27 +00:00
|
|
|
return;
|
2007-01-04 08:56:25 +00:00
|
|
|
} else
|
2008-03-10 03:15:19 +00:00
|
|
|
ts->ts_runq = &tdq->tdq_idle;
|
2008-03-20 05:51:16 +00:00
|
|
|
runq_add(ts->ts_runq, td, flags);
|
2008-03-10 03:15:19 +00:00
|
|
|
}
|
|
|
|
|
2007-07-17 22:53:23 +00:00
|
|
|
/*
|
|
|
|
* Remove a thread from a run-queue. This typically happens when a thread
|
|
|
|
* is selected to run. Running threads are not on the queue and the
|
|
|
|
* transferable count does not reflect them.
|
|
|
|
*/
|
2003-11-15 07:32:07 +00:00
|
|
|
static __inline void
|
2008-03-20 05:51:16 +00:00
|
|
|
tdq_runq_rem(struct tdq *tdq, struct thread *td)
|
2003-11-15 07:32:07 +00:00
|
|
|
{
|
2008-03-20 05:51:16 +00:00
|
|
|
struct td_sched *ts;
|
|
|
|
|
|
|
|
ts = td->td_sched;
|
2007-07-17 22:53:23 +00:00
|
|
|
TDQ_LOCK_ASSERT(tdq, MA_OWNED);
|
|
|
|
KASSERT(ts->ts_runq != NULL,
|
2008-03-20 05:51:16 +00:00
|
|
|
("tdq_runq_remove: thread %p null ts_runq", td));
|
2006-12-06 06:34:57 +00:00
|
|
|
if (ts->ts_flags & TSF_XFERABLE) {
|
2006-12-29 10:37:07 +00:00
|
|
|
tdq->tdq_transferable--;
|
2006-12-06 06:34:57 +00:00
|
|
|
ts->ts_flags &= ~TSF_XFERABLE;
|
2003-12-11 03:57:10 +00:00
|
|
|
}
|
2007-01-04 12:16:19 +00:00
|
|
|
if (ts->ts_runq == &tdq->tdq_timeshare) {
|
|
|
|
if (tdq->tdq_idx != tdq->tdq_ridx)
|
2008-03-20 05:51:16 +00:00
|
|
|
runq_remove_idx(ts->ts_runq, td, &tdq->tdq_ridx);
|
2007-01-04 12:16:19 +00:00
|
|
|
else
|
2008-03-20 05:51:16 +00:00
|
|
|
runq_remove_idx(ts->ts_runq, td, NULL);
|
2007-01-04 12:16:19 +00:00
|
|
|
} else
|
2008-03-20 05:51:16 +00:00
|
|
|
runq_remove(ts->ts_runq, td);
|
2003-11-15 07:32:07 +00:00
|
|
|
}
|
|
|
|
|
2007-07-17 22:53:23 +00:00
|
|
|
/*
|
|
|
|
* Load is maintained for all threads RUNNING and ON_RUNQ. Add the load
|
|
|
|
* for this thread to the referenced thread queue.
|
|
|
|
*/
|
2003-04-11 03:47:14 +00:00
|
|
|
static void
|
2008-03-20 05:51:16 +00:00
|
|
|
tdq_load_add(struct tdq *tdq, struct thread *td)
|
2003-04-11 03:47:14 +00:00
|
|
|
{
|
2007-07-17 22:53:23 +00:00
|
|
|
|
|
|
|
TDQ_LOCK_ASSERT(tdq, MA_OWNED);
|
2008-03-20 05:51:16 +00:00
|
|
|
THREAD_LOCK_ASSERT(td, MA_OWNED);
|
2008-04-04 01:04:43 +00:00
|
|
|
|
2006-12-29 10:37:07 +00:00
|
|
|
tdq->tdq_load++;
|
2009-11-03 16:46:52 +00:00
|
|
|
if ((td->td_flags & TDF_NOLOAD) == 0)
|
2006-12-29 10:37:07 +00:00
|
|
|
tdq->tdq_sysload++;
|
2009-01-17 07:17:57 +00:00
|
|
|
KTR_COUNTER0(KTR_SCHED, "load", tdq->tdq_loadname, tdq->tdq_load);
|
2012-05-15 01:30:25 +00:00
|
|
|
SDT_PROBE2(sched, , , load_change, (int)TDQ_ID(tdq), tdq->tdq_load);
|
2003-02-03 05:30:07 +00:00
|
|
|
}
|
2003-04-11 03:47:14 +00:00
|
|
|
|
2007-07-17 22:53:23 +00:00
|
|
|
/*
|
|
|
|
* Remove the load from a thread that is transitioning to a sleep state or
|
|
|
|
* exiting.
|
|
|
|
*/
|
2003-04-03 00:29:28 +00:00
|
|
|
static void
|
2008-03-20 05:51:16 +00:00
|
|
|
tdq_load_rem(struct tdq *tdq, struct thread *td)
|
2003-02-03 05:30:07 +00:00
|
|
|
{
|
2007-07-17 22:53:23 +00:00
|
|
|
|
2008-03-20 05:51:16 +00:00
|
|
|
THREAD_LOCK_ASSERT(td, MA_OWNED);
|
2007-07-17 22:53:23 +00:00
|
|
|
TDQ_LOCK_ASSERT(tdq, MA_OWNED);
|
|
|
|
KASSERT(tdq->tdq_load != 0,
|
2007-08-03 23:38:46 +00:00
|
|
|
("tdq_load_rem: Removing with 0 load on queue %d", TDQ_ID(tdq)));
|
2008-04-04 01:04:43 +00:00
|
|
|
|
2006-12-29 10:37:07 +00:00
|
|
|
tdq->tdq_load--;
|
2009-11-03 16:46:52 +00:00
|
|
|
if ((td->td_flags & TDF_NOLOAD) == 0)
|
2008-04-04 01:04:43 +00:00
|
|
|
tdq->tdq_sysload--;
|
2009-01-17 07:17:57 +00:00
|
|
|
KTR_COUNTER0(KTR_SCHED, "load", tdq->tdq_loadname, tdq->tdq_load);
|
2012-05-15 01:30:25 +00:00
|
|
|
SDT_PROBE2(sched, , , load_change, (int)TDQ_ID(tdq), tdq->tdq_load);
|
2003-02-03 05:30:07 +00:00
|
|
|
}
|
|
|
|
|
2008-03-02 08:20:59 +00:00
|
|
|
/*
|
|
|
|
* Set lowpri to its exact value by searching the run-queue and
|
|
|
|
* evaluating curthread. curthread may be passed as an optimization.
|
|
|
|
*/
|
|
|
|
static void
|
|
|
|
tdq_setlowpri(struct tdq *tdq, struct thread *ctd)
|
|
|
|
{
|
|
|
|
struct thread *td;
|
|
|
|
|
|
|
|
TDQ_LOCK_ASSERT(tdq, MA_OWNED);
|
|
|
|
if (ctd == NULL)
|
|
|
|
ctd = pcpu_find(TDQ_ID(tdq))->pc_curthread;
|
2008-03-20 05:51:16 +00:00
|
|
|
td = tdq_choose(tdq);
|
|
|
|
if (td == NULL || td->td_priority > ctd->td_priority)
|
2008-03-02 08:20:59 +00:00
|
|
|
tdq->tdq_lowpri = ctd->td_priority;
|
|
|
|
else
|
|
|
|
tdq->tdq_lowpri = td->td_priority;
|
|
|
|
}
|
|
|
|
|
2003-04-11 03:47:14 +00:00
|
|
|
#ifdef SMP
|
2008-03-02 08:20:59 +00:00
|
|
|
struct cpu_search {
|
2009-06-23 22:12:37 +00:00
|
|
|
cpuset_t cs_mask;
|
Rework CPU load balancing in SCHED_ULE:
- In sched_pickcpu() be more careful taking previous CPU on SMT systems.
Do it only if all other logical CPUs of that physical one are idle to avoid
extra resource sharing.
- In sched_pickcpu() change general logic of CPU selection. First
look for idle CPU, sharing last level cache with previously used one,
skipping SMT CPU groups. If none found, search all CPUs for the least loaded
one, where the thread with its priority can run now. If none found, search
just for the least loaded CPU.
- Make cpu_search() compare lowest/highest CPU load when comparing CPU
groups with equal load. That allows to differentiate 1+1 and 2+0 loads.
- Make cpu_search() to prefer specified (previous) CPU or group if load
is equal. This improves cache affinity for more complicated topologies.
- Randomize CPU selection if above factors are equal. Previous code tend
to prefer CPUs with lower IDs, causing unneeded collisions.
- Rework periodic balancer in sched_balance_group(). With cpu_search()
more intelligent now, make balansing process flat, removing recursion
over the topology tree. That fixes double swap problem and makes load
distribution more even and predictable.
All together this gives 10-15% performance improvement in many tests on
CPUs with SMT, such as Core i7, for number of threads is less then number
of logical CPUs. In some tests it also gives positive effect to systems
without SMT.
Reviewed by: jeff
Tested by: flo, hackers@
MFC after: 1 month
Sponsored by: iXsystems, Inc.
2012-02-27 10:31:54 +00:00
|
|
|
u_int cs_prefer;
|
|
|
|
int cs_pri; /* Min priority for low. */
|
|
|
|
int cs_limit; /* Max load for low, min load for high. */
|
|
|
|
int cs_cpu;
|
|
|
|
int cs_load;
|
2008-03-02 08:20:59 +00:00
|
|
|
};
|
|
|
|
|
|
|
|
#define CPU_SEARCH_LOWEST 0x1
|
|
|
|
#define CPU_SEARCH_HIGHEST 0x2
|
|
|
|
#define CPU_SEARCH_BOTH (CPU_SEARCH_LOWEST|CPU_SEARCH_HIGHEST)
|
|
|
|
|
2009-06-23 22:12:37 +00:00
|
|
|
#define CPUSET_FOREACH(cpu, mask) \
|
|
|
|
for ((cpu) = 0; (cpu) <= mp_maxid; (cpu)++) \
|
Commit the support for removing cpumask_t and replacing it directly with
cpuset_t objects.
That is going to offer the underlying support for a simple bump of
MAXCPU and then support for number of cpus > 32 (as it is today).
Right now, cpumask_t is an int, 32 bits on all our supported architecture.
cpumask_t on the other side is implemented as an array of longs, and
easilly extendible by definition.
The architectures touched by this commit are the following:
- amd64
- i386
- pc98
- arm
- ia64
- XEN
while the others are still missing.
Userland is believed to be fully converted with the changes contained
here.
Some technical notes:
- This commit may be considered an ABI nop for all the architectures
different from amd64 and ia64 (and sparc64 in the future)
- per-cpu members, which are now converted to cpuset_t, needs to be
accessed avoiding migration, because the size of cpuset_t should be
considered unknown
- size of cpuset_t objects is different from kernel and userland (this is
primirally done in order to leave some more space in userland to cope
with KBI extensions). If you need to access kernel cpuset_t from the
userland please refer to example in this patch on how to do that
correctly (kgdb may be a good source, for example).
- Support for other architectures is going to be added soon
- Only MAXCPU for amd64 is bumped now
The patch has been tested by sbruno and Nicholas Esborn on opteron
4 x 12 pack CPUs. More testing on big SMP is expected to came soon.
pluknet tested the patch with his 8-ways on both amd64 and i386.
Tested by: pluknet, sbruno, gianni, Nicholas Esborn
Reviewed by: jeff, jhb, sbruno
2011-05-05 14:39:14 +00:00
|
|
|
if (CPU_ISSET(cpu, &mask))
|
2008-03-02 08:20:59 +00:00
|
|
|
|
Rework CPU load balancing in SCHED_ULE:
- In sched_pickcpu() be more careful taking previous CPU on SMT systems.
Do it only if all other logical CPUs of that physical one are idle to avoid
extra resource sharing.
- In sched_pickcpu() change general logic of CPU selection. First
look for idle CPU, sharing last level cache with previously used one,
skipping SMT CPU groups. If none found, search all CPUs for the least loaded
one, where the thread with its priority can run now. If none found, search
just for the least loaded CPU.
- Make cpu_search() compare lowest/highest CPU load when comparing CPU
groups with equal load. That allows to differentiate 1+1 and 2+0 loads.
- Make cpu_search() to prefer specified (previous) CPU or group if load
is equal. This improves cache affinity for more complicated topologies.
- Randomize CPU selection if above factors are equal. Previous code tend
to prefer CPUs with lower IDs, causing unneeded collisions.
- Rework periodic balancer in sched_balance_group(). With cpu_search()
more intelligent now, make balansing process flat, removing recursion
over the topology tree. That fixes double swap problem and makes load
distribution more even and predictable.
All together this gives 10-15% performance improvement in many tests on
CPUs with SMT, such as Core i7, for number of threads is less then number
of logical CPUs. In some tests it also gives positive effect to systems
without SMT.
Reviewed by: jeff
Tested by: flo, hackers@
MFC after: 1 month
Sponsored by: iXsystems, Inc.
2012-02-27 10:31:54 +00:00
|
|
|
static __inline int cpu_search(const struct cpu_group *cg, struct cpu_search *low,
|
2008-03-02 08:20:59 +00:00
|
|
|
struct cpu_search *high, const int match);
|
Rework CPU load balancing in SCHED_ULE:
- In sched_pickcpu() be more careful taking previous CPU on SMT systems.
Do it only if all other logical CPUs of that physical one are idle to avoid
extra resource sharing.
- In sched_pickcpu() change general logic of CPU selection. First
look for idle CPU, sharing last level cache with previously used one,
skipping SMT CPU groups. If none found, search all CPUs for the least loaded
one, where the thread with its priority can run now. If none found, search
just for the least loaded CPU.
- Make cpu_search() compare lowest/highest CPU load when comparing CPU
groups with equal load. That allows to differentiate 1+1 and 2+0 loads.
- Make cpu_search() to prefer specified (previous) CPU or group if load
is equal. This improves cache affinity for more complicated topologies.
- Randomize CPU selection if above factors are equal. Previous code tend
to prefer CPUs with lower IDs, causing unneeded collisions.
- Rework periodic balancer in sched_balance_group(). With cpu_search()
more intelligent now, make balansing process flat, removing recursion
over the topology tree. That fixes double swap problem and makes load
distribution more even and predictable.
All together this gives 10-15% performance improvement in many tests on
CPUs with SMT, such as Core i7, for number of threads is less then number
of logical CPUs. In some tests it also gives positive effect to systems
without SMT.
Reviewed by: jeff
Tested by: flo, hackers@
MFC after: 1 month
Sponsored by: iXsystems, Inc.
2012-02-27 10:31:54 +00:00
|
|
|
int cpu_search_lowest(const struct cpu_group *cg, struct cpu_search *low);
|
|
|
|
int cpu_search_highest(const struct cpu_group *cg, struct cpu_search *high);
|
|
|
|
int cpu_search_both(const struct cpu_group *cg, struct cpu_search *low,
|
2008-03-02 08:20:59 +00:00
|
|
|
struct cpu_search *high);
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Search the tree of cpu_groups for the lowest or highest loaded cpu
|
|
|
|
* according to the match argument. This routine actually compares the
|
|
|
|
* load on all paths through the tree and finds the least loaded cpu on
|
|
|
|
* the least loaded path, which may differ from the least loaded cpu in
|
|
|
|
* the system. This balances work among caches and busses.
|
2003-06-09 00:39:09 +00:00
|
|
|
*
|
2008-03-02 08:20:59 +00:00
|
|
|
* This inline is instantiated in three forms below using constants for the
|
|
|
|
* match argument. It is reduced to the minimum set for each case. It is
|
|
|
|
* also recursive to the depth of the tree.
|
|
|
|
*/
|
2008-03-14 15:22:38 +00:00
|
|
|
static __inline int
|
Rework CPU load balancing in SCHED_ULE:
- In sched_pickcpu() be more careful taking previous CPU on SMT systems.
Do it only if all other logical CPUs of that physical one are idle to avoid
extra resource sharing.
- In sched_pickcpu() change general logic of CPU selection. First
look for idle CPU, sharing last level cache with previously used one,
skipping SMT CPU groups. If none found, search all CPUs for the least loaded
one, where the thread with its priority can run now. If none found, search
just for the least loaded CPU.
- Make cpu_search() compare lowest/highest CPU load when comparing CPU
groups with equal load. That allows to differentiate 1+1 and 2+0 loads.
- Make cpu_search() to prefer specified (previous) CPU or group if load
is equal. This improves cache affinity for more complicated topologies.
- Randomize CPU selection if above factors are equal. Previous code tend
to prefer CPUs with lower IDs, causing unneeded collisions.
- Rework periodic balancer in sched_balance_group(). With cpu_search()
more intelligent now, make balansing process flat, removing recursion
over the topology tree. That fixes double swap problem and makes load
distribution more even and predictable.
All together this gives 10-15% performance improvement in many tests on
CPUs with SMT, such as Core i7, for number of threads is less then number
of logical CPUs. In some tests it also gives positive effect to systems
without SMT.
Reviewed by: jeff
Tested by: flo, hackers@
MFC after: 1 month
Sponsored by: iXsystems, Inc.
2012-02-27 10:31:54 +00:00
|
|
|
cpu_search(const struct cpu_group *cg, struct cpu_search *low,
|
2008-03-02 08:20:59 +00:00
|
|
|
struct cpu_search *high, const int match)
|
|
|
|
{
|
Rework CPU load balancing in SCHED_ULE:
- In sched_pickcpu() be more careful taking previous CPU on SMT systems.
Do it only if all other logical CPUs of that physical one are idle to avoid
extra resource sharing.
- In sched_pickcpu() change general logic of CPU selection. First
look for idle CPU, sharing last level cache with previously used one,
skipping SMT CPU groups. If none found, search all CPUs for the least loaded
one, where the thread with its priority can run now. If none found, search
just for the least loaded CPU.
- Make cpu_search() compare lowest/highest CPU load when comparing CPU
groups with equal load. That allows to differentiate 1+1 and 2+0 loads.
- Make cpu_search() to prefer specified (previous) CPU or group if load
is equal. This improves cache affinity for more complicated topologies.
- Randomize CPU selection if above factors are equal. Previous code tend
to prefer CPUs with lower IDs, causing unneeded collisions.
- Rework periodic balancer in sched_balance_group(). With cpu_search()
more intelligent now, make balansing process flat, removing recursion
over the topology tree. That fixes double swap problem and makes load
distribution more even and predictable.
All together this gives 10-15% performance improvement in many tests on
CPUs with SMT, such as Core i7, for number of threads is less then number
of logical CPUs. In some tests it also gives positive effect to systems
without SMT.
Reviewed by: jeff
Tested by: flo, hackers@
MFC after: 1 month
Sponsored by: iXsystems, Inc.
2012-02-27 10:31:54 +00:00
|
|
|
struct cpu_search lgroup;
|
|
|
|
struct cpu_search hgroup;
|
|
|
|
cpuset_t cpumask;
|
|
|
|
struct cpu_group *child;
|
|
|
|
struct tdq *tdq;
|
2012-04-09 18:24:58 +00:00
|
|
|
int cpu, i, hload, lload, load, total, rnd, *rndptr;
|
2008-03-02 08:20:59 +00:00
|
|
|
|
|
|
|
total = 0;
|
Rework CPU load balancing in SCHED_ULE:
- In sched_pickcpu() be more careful taking previous CPU on SMT systems.
Do it only if all other logical CPUs of that physical one are idle to avoid
extra resource sharing.
- In sched_pickcpu() change general logic of CPU selection. First
look for idle CPU, sharing last level cache with previously used one,
skipping SMT CPU groups. If none found, search all CPUs for the least loaded
one, where the thread with its priority can run now. If none found, search
just for the least loaded CPU.
- Make cpu_search() compare lowest/highest CPU load when comparing CPU
groups with equal load. That allows to differentiate 1+1 and 2+0 loads.
- Make cpu_search() to prefer specified (previous) CPU or group if load
is equal. This improves cache affinity for more complicated topologies.
- Randomize CPU selection if above factors are equal. Previous code tend
to prefer CPUs with lower IDs, causing unneeded collisions.
- Rework periodic balancer in sched_balance_group(). With cpu_search()
more intelligent now, make balansing process flat, removing recursion
over the topology tree. That fixes double swap problem and makes load
distribution more even and predictable.
All together this gives 10-15% performance improvement in many tests on
CPUs with SMT, such as Core i7, for number of threads is less then number
of logical CPUs. In some tests it also gives positive effect to systems
without SMT.
Reviewed by: jeff
Tested by: flo, hackers@
MFC after: 1 month
Sponsored by: iXsystems, Inc.
2012-02-27 10:31:54 +00:00
|
|
|
cpumask = cg->cg_mask;
|
|
|
|
if (match & CPU_SEARCH_LOWEST) {
|
|
|
|
lload = INT_MAX;
|
|
|
|
lgroup = *low;
|
|
|
|
}
|
|
|
|
if (match & CPU_SEARCH_HIGHEST) {
|
2012-04-09 18:24:58 +00:00
|
|
|
hload = INT_MIN;
|
Rework CPU load balancing in SCHED_ULE:
- In sched_pickcpu() be more careful taking previous CPU on SMT systems.
Do it only if all other logical CPUs of that physical one are idle to avoid
extra resource sharing.
- In sched_pickcpu() change general logic of CPU selection. First
look for idle CPU, sharing last level cache with previously used one,
skipping SMT CPU groups. If none found, search all CPUs for the least loaded
one, where the thread with its priority can run now. If none found, search
just for the least loaded CPU.
- Make cpu_search() compare lowest/highest CPU load when comparing CPU
groups with equal load. That allows to differentiate 1+1 and 2+0 loads.
- Make cpu_search() to prefer specified (previous) CPU or group if load
is equal. This improves cache affinity for more complicated topologies.
- Randomize CPU selection if above factors are equal. Previous code tend
to prefer CPUs with lower IDs, causing unneeded collisions.
- Rework periodic balancer in sched_balance_group(). With cpu_search()
more intelligent now, make balansing process flat, removing recursion
over the topology tree. That fixes double swap problem and makes load
distribution more even and predictable.
All together this gives 10-15% performance improvement in many tests on
CPUs with SMT, such as Core i7, for number of threads is less then number
of logical CPUs. In some tests it also gives positive effect to systems
without SMT.
Reviewed by: jeff
Tested by: flo, hackers@
MFC after: 1 month
Sponsored by: iXsystems, Inc.
2012-02-27 10:31:54 +00:00
|
|
|
hgroup = *high;
|
|
|
|
}
|
|
|
|
|
|
|
|
/* Iterate through the child CPU groups and then remaining CPUs. */
|
2012-04-09 18:24:58 +00:00
|
|
|
for (i = cg->cg_children, cpu = mp_maxid; i >= 0; ) {
|
|
|
|
if (i == 0) {
|
|
|
|
while (cpu >= 0 && !CPU_ISSET(cpu, &cpumask))
|
|
|
|
cpu--;
|
|
|
|
if (cpu < 0)
|
Rework CPU load balancing in SCHED_ULE:
- In sched_pickcpu() be more careful taking previous CPU on SMT systems.
Do it only if all other logical CPUs of that physical one are idle to avoid
extra resource sharing.
- In sched_pickcpu() change general logic of CPU selection. First
look for idle CPU, sharing last level cache with previously used one,
skipping SMT CPU groups. If none found, search all CPUs for the least loaded
one, where the thread with its priority can run now. If none found, search
just for the least loaded CPU.
- Make cpu_search() compare lowest/highest CPU load when comparing CPU
groups with equal load. That allows to differentiate 1+1 and 2+0 loads.
- Make cpu_search() to prefer specified (previous) CPU or group if load
is equal. This improves cache affinity for more complicated topologies.
- Randomize CPU selection if above factors are equal. Previous code tend
to prefer CPUs with lower IDs, causing unneeded collisions.
- Rework periodic balancer in sched_balance_group(). With cpu_search()
more intelligent now, make balansing process flat, removing recursion
over the topology tree. That fixes double swap problem and makes load
distribution more even and predictable.
All together this gives 10-15% performance improvement in many tests on
CPUs with SMT, such as Core i7, for number of threads is less then number
of logical CPUs. In some tests it also gives positive effect to systems
without SMT.
Reviewed by: jeff
Tested by: flo, hackers@
MFC after: 1 month
Sponsored by: iXsystems, Inc.
2012-02-27 10:31:54 +00:00
|
|
|
break;
|
|
|
|
child = NULL;
|
|
|
|
} else
|
2012-04-09 18:24:58 +00:00
|
|
|
child = &cg->cg_child[i - 1];
|
Rework CPU load balancing in SCHED_ULE:
- In sched_pickcpu() be more careful taking previous CPU on SMT systems.
Do it only if all other logical CPUs of that physical one are idle to avoid
extra resource sharing.
- In sched_pickcpu() change general logic of CPU selection. First
look for idle CPU, sharing last level cache with previously used one,
skipping SMT CPU groups. If none found, search all CPUs for the least loaded
one, where the thread with its priority can run now. If none found, search
just for the least loaded CPU.
- Make cpu_search() compare lowest/highest CPU load when comparing CPU
groups with equal load. That allows to differentiate 1+1 and 2+0 loads.
- Make cpu_search() to prefer specified (previous) CPU or group if load
is equal. This improves cache affinity for more complicated topologies.
- Randomize CPU selection if above factors are equal. Previous code tend
to prefer CPUs with lower IDs, causing unneeded collisions.
- Rework periodic balancer in sched_balance_group(). With cpu_search()
more intelligent now, make balansing process flat, removing recursion
over the topology tree. That fixes double swap problem and makes load
distribution more even and predictable.
All together this gives 10-15% performance improvement in many tests on
CPUs with SMT, such as Core i7, for number of threads is less then number
of logical CPUs. In some tests it also gives positive effect to systems
without SMT.
Reviewed by: jeff
Tested by: flo, hackers@
MFC after: 1 month
Sponsored by: iXsystems, Inc.
2012-02-27 10:31:54 +00:00
|
|
|
|
2012-04-09 18:24:58 +00:00
|
|
|
if (match & CPU_SEARCH_LOWEST)
|
|
|
|
lgroup.cs_cpu = -1;
|
|
|
|
if (match & CPU_SEARCH_HIGHEST)
|
|
|
|
hgroup.cs_cpu = -1;
|
Rework CPU load balancing in SCHED_ULE:
- In sched_pickcpu() be more careful taking previous CPU on SMT systems.
Do it only if all other logical CPUs of that physical one are idle to avoid
extra resource sharing.
- In sched_pickcpu() change general logic of CPU selection. First
look for idle CPU, sharing last level cache with previously used one,
skipping SMT CPU groups. If none found, search all CPUs for the least loaded
one, where the thread with its priority can run now. If none found, search
just for the least loaded CPU.
- Make cpu_search() compare lowest/highest CPU load when comparing CPU
groups with equal load. That allows to differentiate 1+1 and 2+0 loads.
- Make cpu_search() to prefer specified (previous) CPU or group if load
is equal. This improves cache affinity for more complicated topologies.
- Randomize CPU selection if above factors are equal. Previous code tend
to prefer CPUs with lower IDs, causing unneeded collisions.
- Rework periodic balancer in sched_balance_group(). With cpu_search()
more intelligent now, make balansing process flat, removing recursion
over the topology tree. That fixes double swap problem and makes load
distribution more even and predictable.
All together this gives 10-15% performance improvement in many tests on
CPUs with SMT, such as Core i7, for number of threads is less then number
of logical CPUs. In some tests it also gives positive effect to systems
without SMT.
Reviewed by: jeff
Tested by: flo, hackers@
MFC after: 1 month
Sponsored by: iXsystems, Inc.
2012-02-27 10:31:54 +00:00
|
|
|
if (child) { /* Handle child CPU group. */
|
|
|
|
CPU_NAND(&cpumask, &child->cg_mask);
|
2008-03-02 08:20:59 +00:00
|
|
|
switch (match) {
|
|
|
|
case CPU_SEARCH_LOWEST:
|
|
|
|
load = cpu_search_lowest(child, &lgroup);
|
|
|
|
break;
|
|
|
|
case CPU_SEARCH_HIGHEST:
|
|
|
|
load = cpu_search_highest(child, &hgroup);
|
|
|
|
break;
|
|
|
|
case CPU_SEARCH_BOTH:
|
|
|
|
load = cpu_search_both(child, &lgroup, &hgroup);
|
|
|
|
break;
|
|
|
|
}
|
Rework CPU load balancing in SCHED_ULE:
- In sched_pickcpu() be more careful taking previous CPU on SMT systems.
Do it only if all other logical CPUs of that physical one are idle to avoid
extra resource sharing.
- In sched_pickcpu() change general logic of CPU selection. First
look for idle CPU, sharing last level cache with previously used one,
skipping SMT CPU groups. If none found, search all CPUs for the least loaded
one, where the thread with its priority can run now. If none found, search
just for the least loaded CPU.
- Make cpu_search() compare lowest/highest CPU load when comparing CPU
groups with equal load. That allows to differentiate 1+1 and 2+0 loads.
- Make cpu_search() to prefer specified (previous) CPU or group if load
is equal. This improves cache affinity for more complicated topologies.
- Randomize CPU selection if above factors are equal. Previous code tend
to prefer CPUs with lower IDs, causing unneeded collisions.
- Rework periodic balancer in sched_balance_group(). With cpu_search()
more intelligent now, make balansing process flat, removing recursion
over the topology tree. That fixes double swap problem and makes load
distribution more even and predictable.
All together this gives 10-15% performance improvement in many tests on
CPUs with SMT, such as Core i7, for number of threads is less then number
of logical CPUs. In some tests it also gives positive effect to systems
without SMT.
Reviewed by: jeff
Tested by: flo, hackers@
MFC after: 1 month
Sponsored by: iXsystems, Inc.
2012-02-27 10:31:54 +00:00
|
|
|
} else { /* Handle child CPU. */
|
|
|
|
tdq = TDQ_CPU(cpu);
|
|
|
|
load = tdq->tdq_load * 256;
|
2012-04-09 18:24:58 +00:00
|
|
|
rndptr = DPCPU_PTR(randomval);
|
|
|
|
rnd = (*rndptr = *rndptr * 69069 + 5) >> 26;
|
Rework CPU load balancing in SCHED_ULE:
- In sched_pickcpu() be more careful taking previous CPU on SMT systems.
Do it only if all other logical CPUs of that physical one are idle to avoid
extra resource sharing.
- In sched_pickcpu() change general logic of CPU selection. First
look for idle CPU, sharing last level cache with previously used one,
skipping SMT CPU groups. If none found, search all CPUs for the least loaded
one, where the thread with its priority can run now. If none found, search
just for the least loaded CPU.
- Make cpu_search() compare lowest/highest CPU load when comparing CPU
groups with equal load. That allows to differentiate 1+1 and 2+0 loads.
- Make cpu_search() to prefer specified (previous) CPU or group if load
is equal. This improves cache affinity for more complicated topologies.
- Randomize CPU selection if above factors are equal. Previous code tend
to prefer CPUs with lower IDs, causing unneeded collisions.
- Rework periodic balancer in sched_balance_group(). With cpu_search()
more intelligent now, make balansing process flat, removing recursion
over the topology tree. That fixes double swap problem and makes load
distribution more even and predictable.
All together this gives 10-15% performance improvement in many tests on
CPUs with SMT, such as Core i7, for number of threads is less then number
of logical CPUs. In some tests it also gives positive effect to systems
without SMT.
Reviewed by: jeff
Tested by: flo, hackers@
MFC after: 1 month
Sponsored by: iXsystems, Inc.
2012-02-27 10:31:54 +00:00
|
|
|
if (match & CPU_SEARCH_LOWEST) {
|
|
|
|
if (cpu == low->cs_prefer)
|
|
|
|
load -= 64;
|
|
|
|
/* If that CPU is allowed and get data. */
|
2012-04-09 18:24:58 +00:00
|
|
|
if (tdq->tdq_lowpri > lgroup.cs_pri &&
|
|
|
|
tdq->tdq_load <= lgroup.cs_limit &&
|
|
|
|
CPU_ISSET(cpu, &lgroup.cs_mask)) {
|
Rework CPU load balancing in SCHED_ULE:
- In sched_pickcpu() be more careful taking previous CPU on SMT systems.
Do it only if all other logical CPUs of that physical one are idle to avoid
extra resource sharing.
- In sched_pickcpu() change general logic of CPU selection. First
look for idle CPU, sharing last level cache with previously used one,
skipping SMT CPU groups. If none found, search all CPUs for the least loaded
one, where the thread with its priority can run now. If none found, search
just for the least loaded CPU.
- Make cpu_search() compare lowest/highest CPU load when comparing CPU
groups with equal load. That allows to differentiate 1+1 and 2+0 loads.
- Make cpu_search() to prefer specified (previous) CPU or group if load
is equal. This improves cache affinity for more complicated topologies.
- Randomize CPU selection if above factors are equal. Previous code tend
to prefer CPUs with lower IDs, causing unneeded collisions.
- Rework periodic balancer in sched_balance_group(). With cpu_search()
more intelligent now, make balansing process flat, removing recursion
over the topology tree. That fixes double swap problem and makes load
distribution more even and predictable.
All together this gives 10-15% performance improvement in many tests on
CPUs with SMT, such as Core i7, for number of threads is less then number
of logical CPUs. In some tests it also gives positive effect to systems
without SMT.
Reviewed by: jeff
Tested by: flo, hackers@
MFC after: 1 month
Sponsored by: iXsystems, Inc.
2012-02-27 10:31:54 +00:00
|
|
|
lgroup.cs_cpu = cpu;
|
|
|
|
lgroup.cs_load = load - rnd;
|
2008-03-02 08:20:59 +00:00
|
|
|
}
|
Rework CPU load balancing in SCHED_ULE:
- In sched_pickcpu() be more careful taking previous CPU on SMT systems.
Do it only if all other logical CPUs of that physical one are idle to avoid
extra resource sharing.
- In sched_pickcpu() change general logic of CPU selection. First
look for idle CPU, sharing last level cache with previously used one,
skipping SMT CPU groups. If none found, search all CPUs for the least loaded
one, where the thread with its priority can run now. If none found, search
just for the least loaded CPU.
- Make cpu_search() compare lowest/highest CPU load when comparing CPU
groups with equal load. That allows to differentiate 1+1 and 2+0 loads.
- Make cpu_search() to prefer specified (previous) CPU or group if load
is equal. This improves cache affinity for more complicated topologies.
- Randomize CPU selection if above factors are equal. Previous code tend
to prefer CPUs with lower IDs, causing unneeded collisions.
- Rework periodic balancer in sched_balance_group(). With cpu_search()
more intelligent now, make balansing process flat, removing recursion
over the topology tree. That fixes double swap problem and makes load
distribution more even and predictable.
All together this gives 10-15% performance improvement in many tests on
CPUs with SMT, such as Core i7, for number of threads is less then number
of logical CPUs. In some tests it also gives positive effect to systems
without SMT.
Reviewed by: jeff
Tested by: flo, hackers@
MFC after: 1 month
Sponsored by: iXsystems, Inc.
2012-02-27 10:31:54 +00:00
|
|
|
}
|
|
|
|
if (match & CPU_SEARCH_HIGHEST)
|
2012-04-09 18:24:58 +00:00
|
|
|
if (tdq->tdq_load >= hgroup.cs_limit &&
|
|
|
|
tdq->tdq_transferable &&
|
|
|
|
CPU_ISSET(cpu, &hgroup.cs_mask)) {
|
Rework CPU load balancing in SCHED_ULE:
- In sched_pickcpu() be more careful taking previous CPU on SMT systems.
Do it only if all other logical CPUs of that physical one are idle to avoid
extra resource sharing.
- In sched_pickcpu() change general logic of CPU selection. First
look for idle CPU, sharing last level cache with previously used one,
skipping SMT CPU groups. If none found, search all CPUs for the least loaded
one, where the thread with its priority can run now. If none found, search
just for the least loaded CPU.
- Make cpu_search() compare lowest/highest CPU load when comparing CPU
groups with equal load. That allows to differentiate 1+1 and 2+0 loads.
- Make cpu_search() to prefer specified (previous) CPU or group if load
is equal. This improves cache affinity for more complicated topologies.
- Randomize CPU selection if above factors are equal. Previous code tend
to prefer CPUs with lower IDs, causing unneeded collisions.
- Rework periodic balancer in sched_balance_group(). With cpu_search()
more intelligent now, make balansing process flat, removing recursion
over the topology tree. That fixes double swap problem and makes load
distribution more even and predictable.
All together this gives 10-15% performance improvement in many tests on
CPUs with SMT, such as Core i7, for number of threads is less then number
of logical CPUs. In some tests it also gives positive effect to systems
without SMT.
Reviewed by: jeff
Tested by: flo, hackers@
MFC after: 1 month
Sponsored by: iXsystems, Inc.
2012-02-27 10:31:54 +00:00
|
|
|
hgroup.cs_cpu = cpu;
|
|
|
|
hgroup.cs_load = load - rnd;
|
2008-03-02 08:20:59 +00:00
|
|
|
}
|
|
|
|
}
|
Rework CPU load balancing in SCHED_ULE:
- In sched_pickcpu() be more careful taking previous CPU on SMT systems.
Do it only if all other logical CPUs of that physical one are idle to avoid
extra resource sharing.
- In sched_pickcpu() change general logic of CPU selection. First
look for idle CPU, sharing last level cache with previously used one,
skipping SMT CPU groups. If none found, search all CPUs for the least loaded
one, where the thread with its priority can run now. If none found, search
just for the least loaded CPU.
- Make cpu_search() compare lowest/highest CPU load when comparing CPU
groups with equal load. That allows to differentiate 1+1 and 2+0 loads.
- Make cpu_search() to prefer specified (previous) CPU or group if load
is equal. This improves cache affinity for more complicated topologies.
- Randomize CPU selection if above factors are equal. Previous code tend
to prefer CPUs with lower IDs, causing unneeded collisions.
- Rework periodic balancer in sched_balance_group(). With cpu_search()
more intelligent now, make balansing process flat, removing recursion
over the topology tree. That fixes double swap problem and makes load
distribution more even and predictable.
All together this gives 10-15% performance improvement in many tests on
CPUs with SMT, such as Core i7, for number of threads is less then number
of logical CPUs. In some tests it also gives positive effect to systems
without SMT.
Reviewed by: jeff
Tested by: flo, hackers@
MFC after: 1 month
Sponsored by: iXsystems, Inc.
2012-02-27 10:31:54 +00:00
|
|
|
total += load;
|
|
|
|
|
|
|
|
/* We have info about child item. Compare it. */
|
|
|
|
if (match & CPU_SEARCH_LOWEST) {
|
2012-04-09 18:24:58 +00:00
|
|
|
if (lgroup.cs_cpu >= 0 &&
|
2012-03-03 11:50:48 +00:00
|
|
|
(load < lload ||
|
|
|
|
(load == lload && lgroup.cs_load < low->cs_load))) {
|
Rework CPU load balancing in SCHED_ULE:
- In sched_pickcpu() be more careful taking previous CPU on SMT systems.
Do it only if all other logical CPUs of that physical one are idle to avoid
extra resource sharing.
- In sched_pickcpu() change general logic of CPU selection. First
look for idle CPU, sharing last level cache with previously used one,
skipping SMT CPU groups. If none found, search all CPUs for the least loaded
one, where the thread with its priority can run now. If none found, search
just for the least loaded CPU.
- Make cpu_search() compare lowest/highest CPU load when comparing CPU
groups with equal load. That allows to differentiate 1+1 and 2+0 loads.
- Make cpu_search() to prefer specified (previous) CPU or group if load
is equal. This improves cache affinity for more complicated topologies.
- Randomize CPU selection if above factors are equal. Previous code tend
to prefer CPUs with lower IDs, causing unneeded collisions.
- Rework periodic balancer in sched_balance_group(). With cpu_search()
more intelligent now, make balansing process flat, removing recursion
over the topology tree. That fixes double swap problem and makes load
distribution more even and predictable.
All together this gives 10-15% performance improvement in many tests on
CPUs with SMT, such as Core i7, for number of threads is less then number
of logical CPUs. In some tests it also gives positive effect to systems
without SMT.
Reviewed by: jeff
Tested by: flo, hackers@
MFC after: 1 month
Sponsored by: iXsystems, Inc.
2012-02-27 10:31:54 +00:00
|
|
|
lload = load;
|
|
|
|
low->cs_cpu = lgroup.cs_cpu;
|
|
|
|
low->cs_load = lgroup.cs_load;
|
|
|
|
}
|
|
|
|
}
|
|
|
|
if (match & CPU_SEARCH_HIGHEST)
|
2012-04-09 18:24:58 +00:00
|
|
|
if (hgroup.cs_cpu >= 0 &&
|
2012-03-03 11:50:48 +00:00
|
|
|
(load > hload ||
|
|
|
|
(load == hload && hgroup.cs_load > high->cs_load))) {
|
Rework CPU load balancing in SCHED_ULE:
- In sched_pickcpu() be more careful taking previous CPU on SMT systems.
Do it only if all other logical CPUs of that physical one are idle to avoid
extra resource sharing.
- In sched_pickcpu() change general logic of CPU selection. First
look for idle CPU, sharing last level cache with previously used one,
skipping SMT CPU groups. If none found, search all CPUs for the least loaded
one, where the thread with its priority can run now. If none found, search
just for the least loaded CPU.
- Make cpu_search() compare lowest/highest CPU load when comparing CPU
groups with equal load. That allows to differentiate 1+1 and 2+0 loads.
- Make cpu_search() to prefer specified (previous) CPU or group if load
is equal. This improves cache affinity for more complicated topologies.
- Randomize CPU selection if above factors are equal. Previous code tend
to prefer CPUs with lower IDs, causing unneeded collisions.
- Rework periodic balancer in sched_balance_group(). With cpu_search()
more intelligent now, make balansing process flat, removing recursion
over the topology tree. That fixes double swap problem and makes load
distribution more even and predictable.
All together this gives 10-15% performance improvement in many tests on
CPUs with SMT, such as Core i7, for number of threads is less then number
of logical CPUs. In some tests it also gives positive effect to systems
without SMT.
Reviewed by: jeff
Tested by: flo, hackers@
MFC after: 1 month
Sponsored by: iXsystems, Inc.
2012-02-27 10:31:54 +00:00
|
|
|
hload = load;
|
|
|
|
high->cs_cpu = hgroup.cs_cpu;
|
|
|
|
high->cs_load = hgroup.cs_load;
|
|
|
|
}
|
2012-04-09 18:24:58 +00:00
|
|
|
if (child) {
|
|
|
|
i--;
|
|
|
|
if (i == 0 && CPU_EMPTY(&cpumask))
|
|
|
|
break;
|
|
|
|
} else
|
|
|
|
cpu--;
|
2008-03-02 08:20:59 +00:00
|
|
|
}
|
|
|
|
return (total);
|
|
|
|
}
|
|
|
|
|
|
|
|
/*
|
|
|
|
* cpu_search instantiations must pass constants to maintain the inline
|
|
|
|
* optimization.
|
2003-06-09 00:39:09 +00:00
|
|
|
*/
|
2008-03-02 08:20:59 +00:00
|
|
|
int
|
Rework CPU load balancing in SCHED_ULE:
- In sched_pickcpu() be more careful taking previous CPU on SMT systems.
Do it only if all other logical CPUs of that physical one are idle to avoid
extra resource sharing.
- In sched_pickcpu() change general logic of CPU selection. First
look for idle CPU, sharing last level cache with previously used one,
skipping SMT CPU groups. If none found, search all CPUs for the least loaded
one, where the thread with its priority can run now. If none found, search
just for the least loaded CPU.
- Make cpu_search() compare lowest/highest CPU load when comparing CPU
groups with equal load. That allows to differentiate 1+1 and 2+0 loads.
- Make cpu_search() to prefer specified (previous) CPU or group if load
is equal. This improves cache affinity for more complicated topologies.
- Randomize CPU selection if above factors are equal. Previous code tend
to prefer CPUs with lower IDs, causing unneeded collisions.
- Rework periodic balancer in sched_balance_group(). With cpu_search()
more intelligent now, make balansing process flat, removing recursion
over the topology tree. That fixes double swap problem and makes load
distribution more even and predictable.
All together this gives 10-15% performance improvement in many tests on
CPUs with SMT, such as Core i7, for number of threads is less then number
of logical CPUs. In some tests it also gives positive effect to systems
without SMT.
Reviewed by: jeff
Tested by: flo, hackers@
MFC after: 1 month
Sponsored by: iXsystems, Inc.
2012-02-27 10:31:54 +00:00
|
|
|
cpu_search_lowest(const struct cpu_group *cg, struct cpu_search *low)
|
2008-03-02 08:20:59 +00:00
|
|
|
{
|
|
|
|
return cpu_search(cg, low, NULL, CPU_SEARCH_LOWEST);
|
|
|
|
}
|
|
|
|
|
|
|
|
int
|
Rework CPU load balancing in SCHED_ULE:
- In sched_pickcpu() be more careful taking previous CPU on SMT systems.
Do it only if all other logical CPUs of that physical one are idle to avoid
extra resource sharing.
- In sched_pickcpu() change general logic of CPU selection. First
look for idle CPU, sharing last level cache with previously used one,
skipping SMT CPU groups. If none found, search all CPUs for the least loaded
one, where the thread with its priority can run now. If none found, search
just for the least loaded CPU.
- Make cpu_search() compare lowest/highest CPU load when comparing CPU
groups with equal load. That allows to differentiate 1+1 and 2+0 loads.
- Make cpu_search() to prefer specified (previous) CPU or group if load
is equal. This improves cache affinity for more complicated topologies.
- Randomize CPU selection if above factors are equal. Previous code tend
to prefer CPUs with lower IDs, causing unneeded collisions.
- Rework periodic balancer in sched_balance_group(). With cpu_search()
more intelligent now, make balansing process flat, removing recursion
over the topology tree. That fixes double swap problem and makes load
distribution more even and predictable.
All together this gives 10-15% performance improvement in many tests on
CPUs with SMT, such as Core i7, for number of threads is less then number
of logical CPUs. In some tests it also gives positive effect to systems
without SMT.
Reviewed by: jeff
Tested by: flo, hackers@
MFC after: 1 month
Sponsored by: iXsystems, Inc.
2012-02-27 10:31:54 +00:00
|
|
|
cpu_search_highest(const struct cpu_group *cg, struct cpu_search *high)
|
2008-03-02 08:20:59 +00:00
|
|
|
{
|
|
|
|
return cpu_search(cg, NULL, high, CPU_SEARCH_HIGHEST);
|
|
|
|
}
|
|
|
|
|
|
|
|
int
|
Rework CPU load balancing in SCHED_ULE:
- In sched_pickcpu() be more careful taking previous CPU on SMT systems.
Do it only if all other logical CPUs of that physical one are idle to avoid
extra resource sharing.
- In sched_pickcpu() change general logic of CPU selection. First
look for idle CPU, sharing last level cache with previously used one,
skipping SMT CPU groups. If none found, search all CPUs for the least loaded
one, where the thread with its priority can run now. If none found, search
just for the least loaded CPU.
- Make cpu_search() compare lowest/highest CPU load when comparing CPU
groups with equal load. That allows to differentiate 1+1 and 2+0 loads.
- Make cpu_search() to prefer specified (previous) CPU or group if load
is equal. This improves cache affinity for more complicated topologies.
- Randomize CPU selection if above factors are equal. Previous code tend
to prefer CPUs with lower IDs, causing unneeded collisions.
- Rework periodic balancer in sched_balance_group(). With cpu_search()
more intelligent now, make balansing process flat, removing recursion
over the topology tree. That fixes double swap problem and makes load
distribution more even and predictable.
All together this gives 10-15% performance improvement in many tests on
CPUs with SMT, such as Core i7, for number of threads is less then number
of logical CPUs. In some tests it also gives positive effect to systems
without SMT.
Reviewed by: jeff
Tested by: flo, hackers@
MFC after: 1 month
Sponsored by: iXsystems, Inc.
2012-02-27 10:31:54 +00:00
|
|
|
cpu_search_both(const struct cpu_group *cg, struct cpu_search *low,
|
2008-03-02 08:20:59 +00:00
|
|
|
struct cpu_search *high)
|
|
|
|
{
|
|
|
|
return cpu_search(cg, low, high, CPU_SEARCH_BOTH);
|
|
|
|
}
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Find the cpu with the least load via the least loaded path that has a
|
|
|
|
* lowpri greater than pri pri. A pri of -1 indicates any priority is
|
|
|
|
* acceptable.
|
|
|
|
*/
|
|
|
|
static inline int
|
Rework CPU load balancing in SCHED_ULE:
- In sched_pickcpu() be more careful taking previous CPU on SMT systems.
Do it only if all other logical CPUs of that physical one are idle to avoid
extra resource sharing.
- In sched_pickcpu() change general logic of CPU selection. First
look for idle CPU, sharing last level cache with previously used one,
skipping SMT CPU groups. If none found, search all CPUs for the least loaded
one, where the thread with its priority can run now. If none found, search
just for the least loaded CPU.
- Make cpu_search() compare lowest/highest CPU load when comparing CPU
groups with equal load. That allows to differentiate 1+1 and 2+0 loads.
- Make cpu_search() to prefer specified (previous) CPU or group if load
is equal. This improves cache affinity for more complicated topologies.
- Randomize CPU selection if above factors are equal. Previous code tend
to prefer CPUs with lower IDs, causing unneeded collisions.
- Rework periodic balancer in sched_balance_group(). With cpu_search()
more intelligent now, make balansing process flat, removing recursion
over the topology tree. That fixes double swap problem and makes load
distribution more even and predictable.
All together this gives 10-15% performance improvement in many tests on
CPUs with SMT, such as Core i7, for number of threads is less then number
of logical CPUs. In some tests it also gives positive effect to systems
without SMT.
Reviewed by: jeff
Tested by: flo, hackers@
MFC after: 1 month
Sponsored by: iXsystems, Inc.
2012-02-27 10:31:54 +00:00
|
|
|
sched_lowest(const struct cpu_group *cg, cpuset_t mask, int pri, int maxload,
|
|
|
|
int prefer)
|
2008-03-02 08:20:59 +00:00
|
|
|
{
|
|
|
|
struct cpu_search low;
|
|
|
|
|
|
|
|
low.cs_cpu = -1;
|
Rework CPU load balancing in SCHED_ULE:
- In sched_pickcpu() be more careful taking previous CPU on SMT systems.
Do it only if all other logical CPUs of that physical one are idle to avoid
extra resource sharing.
- In sched_pickcpu() change general logic of CPU selection. First
look for idle CPU, sharing last level cache with previously used one,
skipping SMT CPU groups. If none found, search all CPUs for the least loaded
one, where the thread with its priority can run now. If none found, search
just for the least loaded CPU.
- Make cpu_search() compare lowest/highest CPU load when comparing CPU
groups with equal load. That allows to differentiate 1+1 and 2+0 loads.
- Make cpu_search() to prefer specified (previous) CPU or group if load
is equal. This improves cache affinity for more complicated topologies.
- Randomize CPU selection if above factors are equal. Previous code tend
to prefer CPUs with lower IDs, causing unneeded collisions.
- Rework periodic balancer in sched_balance_group(). With cpu_search()
more intelligent now, make balansing process flat, removing recursion
over the topology tree. That fixes double swap problem and makes load
distribution more even and predictable.
All together this gives 10-15% performance improvement in many tests on
CPUs with SMT, such as Core i7, for number of threads is less then number
of logical CPUs. In some tests it also gives positive effect to systems
without SMT.
Reviewed by: jeff
Tested by: flo, hackers@
MFC after: 1 month
Sponsored by: iXsystems, Inc.
2012-02-27 10:31:54 +00:00
|
|
|
low.cs_prefer = prefer;
|
2008-03-02 08:20:59 +00:00
|
|
|
low.cs_mask = mask;
|
Rework CPU load balancing in SCHED_ULE:
- In sched_pickcpu() be more careful taking previous CPU on SMT systems.
Do it only if all other logical CPUs of that physical one are idle to avoid
extra resource sharing.
- In sched_pickcpu() change general logic of CPU selection. First
look for idle CPU, sharing last level cache with previously used one,
skipping SMT CPU groups. If none found, search all CPUs for the least loaded
one, where the thread with its priority can run now. If none found, search
just for the least loaded CPU.
- Make cpu_search() compare lowest/highest CPU load when comparing CPU
groups with equal load. That allows to differentiate 1+1 and 2+0 loads.
- Make cpu_search() to prefer specified (previous) CPU or group if load
is equal. This improves cache affinity for more complicated topologies.
- Randomize CPU selection if above factors are equal. Previous code tend
to prefer CPUs with lower IDs, causing unneeded collisions.
- Rework periodic balancer in sched_balance_group(). With cpu_search()
more intelligent now, make balansing process flat, removing recursion
over the topology tree. That fixes double swap problem and makes load
distribution more even and predictable.
All together this gives 10-15% performance improvement in many tests on
CPUs with SMT, such as Core i7, for number of threads is less then number
of logical CPUs. In some tests it also gives positive effect to systems
without SMT.
Reviewed by: jeff
Tested by: flo, hackers@
MFC after: 1 month
Sponsored by: iXsystems, Inc.
2012-02-27 10:31:54 +00:00
|
|
|
low.cs_pri = pri;
|
|
|
|
low.cs_limit = maxload;
|
2008-03-02 08:20:59 +00:00
|
|
|
cpu_search_lowest(cg, &low);
|
|
|
|
return low.cs_cpu;
|
|
|
|
}
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Find the cpu with the highest load via the highest loaded path.
|
|
|
|
*/
|
|
|
|
static inline int
|
Rework CPU load balancing in SCHED_ULE:
- In sched_pickcpu() be more careful taking previous CPU on SMT systems.
Do it only if all other logical CPUs of that physical one are idle to avoid
extra resource sharing.
- In sched_pickcpu() change general logic of CPU selection. First
look for idle CPU, sharing last level cache with previously used one,
skipping SMT CPU groups. If none found, search all CPUs for the least loaded
one, where the thread with its priority can run now. If none found, search
just for the least loaded CPU.
- Make cpu_search() compare lowest/highest CPU load when comparing CPU
groups with equal load. That allows to differentiate 1+1 and 2+0 loads.
- Make cpu_search() to prefer specified (previous) CPU or group if load
is equal. This improves cache affinity for more complicated topologies.
- Randomize CPU selection if above factors are equal. Previous code tend
to prefer CPUs with lower IDs, causing unneeded collisions.
- Rework periodic balancer in sched_balance_group(). With cpu_search()
more intelligent now, make balansing process flat, removing recursion
over the topology tree. That fixes double swap problem and makes load
distribution more even and predictable.
All together this gives 10-15% performance improvement in many tests on
CPUs with SMT, such as Core i7, for number of threads is less then number
of logical CPUs. In some tests it also gives positive effect to systems
without SMT.
Reviewed by: jeff
Tested by: flo, hackers@
MFC after: 1 month
Sponsored by: iXsystems, Inc.
2012-02-27 10:31:54 +00:00
|
|
|
sched_highest(const struct cpu_group *cg, cpuset_t mask, int minload)
|
2008-03-02 08:20:59 +00:00
|
|
|
{
|
|
|
|
struct cpu_search high;
|
|
|
|
|
|
|
|
high.cs_cpu = -1;
|
|
|
|
high.cs_mask = mask;
|
|
|
|
high.cs_limit = minload;
|
|
|
|
cpu_search_highest(cg, &high);
|
|
|
|
return high.cs_cpu;
|
|
|
|
}
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Simultaneously find the highest and lowest loaded cpu reachable via
|
|
|
|
* cg.
|
|
|
|
*/
|
Rework CPU load balancing in SCHED_ULE:
- In sched_pickcpu() be more careful taking previous CPU on SMT systems.
Do it only if all other logical CPUs of that physical one are idle to avoid
extra resource sharing.
- In sched_pickcpu() change general logic of CPU selection. First
look for idle CPU, sharing last level cache with previously used one,
skipping SMT CPU groups. If none found, search all CPUs for the least loaded
one, where the thread with its priority can run now. If none found, search
just for the least loaded CPU.
- Make cpu_search() compare lowest/highest CPU load when comparing CPU
groups with equal load. That allows to differentiate 1+1 and 2+0 loads.
- Make cpu_search() to prefer specified (previous) CPU or group if load
is equal. This improves cache affinity for more complicated topologies.
- Randomize CPU selection if above factors are equal. Previous code tend
to prefer CPUs with lower IDs, causing unneeded collisions.
- Rework periodic balancer in sched_balance_group(). With cpu_search()
more intelligent now, make balansing process flat, removing recursion
over the topology tree. That fixes double swap problem and makes load
distribution more even and predictable.
All together this gives 10-15% performance improvement in many tests on
CPUs with SMT, such as Core i7, for number of threads is less then number
of logical CPUs. In some tests it also gives positive effect to systems
without SMT.
Reviewed by: jeff
Tested by: flo, hackers@
MFC after: 1 month
Sponsored by: iXsystems, Inc.
2012-02-27 10:31:54 +00:00
|
|
|
static inline void
|
|
|
|
sched_both(const struct cpu_group *cg, cpuset_t mask, int *lowcpu, int *highcpu)
|
2008-03-02 08:20:59 +00:00
|
|
|
{
|
|
|
|
struct cpu_search high;
|
|
|
|
struct cpu_search low;
|
|
|
|
|
|
|
|
low.cs_cpu = -1;
|
Rework CPU load balancing in SCHED_ULE:
- In sched_pickcpu() be more careful taking previous CPU on SMT systems.
Do it only if all other logical CPUs of that physical one are idle to avoid
extra resource sharing.
- In sched_pickcpu() change general logic of CPU selection. First
look for idle CPU, sharing last level cache with previously used one,
skipping SMT CPU groups. If none found, search all CPUs for the least loaded
one, where the thread with its priority can run now. If none found, search
just for the least loaded CPU.
- Make cpu_search() compare lowest/highest CPU load when comparing CPU
groups with equal load. That allows to differentiate 1+1 and 2+0 loads.
- Make cpu_search() to prefer specified (previous) CPU or group if load
is equal. This improves cache affinity for more complicated topologies.
- Randomize CPU selection if above factors are equal. Previous code tend
to prefer CPUs with lower IDs, causing unneeded collisions.
- Rework periodic balancer in sched_balance_group(). With cpu_search()
more intelligent now, make balansing process flat, removing recursion
over the topology tree. That fixes double swap problem and makes load
distribution more even and predictable.
All together this gives 10-15% performance improvement in many tests on
CPUs with SMT, such as Core i7, for number of threads is less then number
of logical CPUs. In some tests it also gives positive effect to systems
without SMT.
Reviewed by: jeff
Tested by: flo, hackers@
MFC after: 1 month
Sponsored by: iXsystems, Inc.
2012-02-27 10:31:54 +00:00
|
|
|
low.cs_prefer = -1;
|
|
|
|
low.cs_pri = -1;
|
|
|
|
low.cs_limit = INT_MAX;
|
2008-03-02 08:20:59 +00:00
|
|
|
low.cs_mask = mask;
|
|
|
|
high.cs_cpu = -1;
|
|
|
|
high.cs_limit = -1;
|
|
|
|
high.cs_mask = mask;
|
|
|
|
cpu_search_both(cg, &low, &high);
|
|
|
|
*lowcpu = low.cs_cpu;
|
|
|
|
*highcpu = high.cs_cpu;
|
|
|
|
return;
|
|
|
|
}
|
|
|
|
|
- Add static to local functions and data where it was missing.
- Add an IPI based mechanism for migrating kses. This mechanism is
broken down into several components. This is intended to reduce cache
thrashing by eliminating most cases where one cpu touches another's
run queues.
- kseq_notify() appends a kse to a lockless singly linked list and
conditionally sends an IPI to the target processor. Right now this is
protected by sched_lock but at some point I'd like to get rid of the
global lock. This is why I used something more complicated than a
standard queue.
- kseq_assign() processes our list of kses that have been assigned to us
by other processors. This simply calls sched_add() for each item on the
list after clearing the new KEF_ASSIGNED flag. This flag is used to
indicate that we have been appeneded to the assigned queue but not
added to the run queue yet.
- In sched_add(), instead of adding a KSE to another processor's queue we
use kse_notify() so that we don't touch their queue. Also in sched_add(),
if KEF_ASSIGNED is already set return immediately. This can happen if
a thread is removed and readded so that the priority is recorded properly.
- In sched_rem() return immediately if KEF_ASSIGNED is set. All callers
immediately readd simply to adjust priorites etc.
- In sched_choose(), if we're running an IDLE task or the per cpu idle thread
set our cpumask bit in 'kseq_idle' so that other processors may know that
we are idle. Before this, make a single pass through the run queues of
other processors so that we may find work more immediately if it is
available.
- In sched_runnable(), don't scan each processor's run queue, they will IPI
us if they have work for us to do.
- In sched_add(), if we're adding a thread that can be migrated and we have
plenty of work to do, try to migrate the thread to an idle kseq.
- Simplify the logic in sched_prio() and take the KEF_ASSIGNED flag into
consideration.
- No longer use kseq_choose() to steal threads, it can lose it's last
argument.
- Create a new function runq_steal() which operates like runq_choose() but
skips threads based on some criteria. Currently it will not steal
PRI_ITHD threads. In the future this will be used for CPU binding.
- Create a kseq_steal() that checks each run queue with runq_steal(), use
kseq_steal() in the places where we used kseq_choose() to steal with
before.
2003-10-31 11:16:04 +00:00
|
|
|
static void
|
2008-03-02 08:20:59 +00:00
|
|
|
sched_balance_group(struct cpu_group *cg)
|
2003-06-09 00:39:09 +00:00
|
|
|
{
|
Rework CPU load balancing in SCHED_ULE:
- In sched_pickcpu() be more careful taking previous CPU on SMT systems.
Do it only if all other logical CPUs of that physical one are idle to avoid
extra resource sharing.
- In sched_pickcpu() change general logic of CPU selection. First
look for idle CPU, sharing last level cache with previously used one,
skipping SMT CPU groups. If none found, search all CPUs for the least loaded
one, where the thread with its priority can run now. If none found, search
just for the least loaded CPU.
- Make cpu_search() compare lowest/highest CPU load when comparing CPU
groups with equal load. That allows to differentiate 1+1 and 2+0 loads.
- Make cpu_search() to prefer specified (previous) CPU or group if load
is equal. This improves cache affinity for more complicated topologies.
- Randomize CPU selection if above factors are equal. Previous code tend
to prefer CPUs with lower IDs, causing unneeded collisions.
- Rework periodic balancer in sched_balance_group(). With cpu_search()
more intelligent now, make balansing process flat, removing recursion
over the topology tree. That fixes double swap problem and makes load
distribution more even and predictable.
All together this gives 10-15% performance improvement in many tests on
CPUs with SMT, such as Core i7, for number of threads is less then number
of logical CPUs. In some tests it also gives positive effect to systems
without SMT.
Reviewed by: jeff
Tested by: flo, hackers@
MFC after: 1 month
Sponsored by: iXsystems, Inc.
2012-02-27 10:31:54 +00:00
|
|
|
cpuset_t hmask, lmask;
|
|
|
|
int high, low, anylow;
|
2003-06-09 00:39:09 +00:00
|
|
|
|
Rework CPU load balancing in SCHED_ULE:
- In sched_pickcpu() be more careful taking previous CPU on SMT systems.
Do it only if all other logical CPUs of that physical one are idle to avoid
extra resource sharing.
- In sched_pickcpu() change general logic of CPU selection. First
look for idle CPU, sharing last level cache with previously used one,
skipping SMT CPU groups. If none found, search all CPUs for the least loaded
one, where the thread with its priority can run now. If none found, search
just for the least loaded CPU.
- Make cpu_search() compare lowest/highest CPU load when comparing CPU
groups with equal load. That allows to differentiate 1+1 and 2+0 loads.
- Make cpu_search() to prefer specified (previous) CPU or group if load
is equal. This improves cache affinity for more complicated topologies.
- Randomize CPU selection if above factors are equal. Previous code tend
to prefer CPUs with lower IDs, causing unneeded collisions.
- Rework periodic balancer in sched_balance_group(). With cpu_search()
more intelligent now, make balansing process flat, removing recursion
over the topology tree. That fixes double swap problem and makes load
distribution more even and predictable.
All together this gives 10-15% performance improvement in many tests on
CPUs with SMT, such as Core i7, for number of threads is less then number
of logical CPUs. In some tests it also gives positive effect to systems
without SMT.
Reviewed by: jeff
Tested by: flo, hackers@
MFC after: 1 month
Sponsored by: iXsystems, Inc.
2012-02-27 10:31:54 +00:00
|
|
|
CPU_FILL(&hmask);
|
2008-03-02 08:20:59 +00:00
|
|
|
for (;;) {
|
Rework CPU load balancing in SCHED_ULE:
- In sched_pickcpu() be more careful taking previous CPU on SMT systems.
Do it only if all other logical CPUs of that physical one are idle to avoid
extra resource sharing.
- In sched_pickcpu() change general logic of CPU selection. First
look for idle CPU, sharing last level cache with previously used one,
skipping SMT CPU groups. If none found, search all CPUs for the least loaded
one, where the thread with its priority can run now. If none found, search
just for the least loaded CPU.
- Make cpu_search() compare lowest/highest CPU load when comparing CPU
groups with equal load. That allows to differentiate 1+1 and 2+0 loads.
- Make cpu_search() to prefer specified (previous) CPU or group if load
is equal. This improves cache affinity for more complicated topologies.
- Randomize CPU selection if above factors are equal. Previous code tend
to prefer CPUs with lower IDs, causing unneeded collisions.
- Rework periodic balancer in sched_balance_group(). With cpu_search()
more intelligent now, make balansing process flat, removing recursion
over the topology tree. That fixes double swap problem and makes load
distribution more even and predictable.
All together this gives 10-15% performance improvement in many tests on
CPUs with SMT, such as Core i7, for number of threads is less then number
of logical CPUs. In some tests it also gives positive effect to systems
without SMT.
Reviewed by: jeff
Tested by: flo, hackers@
MFC after: 1 month
Sponsored by: iXsystems, Inc.
2012-02-27 10:31:54 +00:00
|
|
|
high = sched_highest(cg, hmask, 1);
|
|
|
|
/* Stop if there is no more CPU with transferrable threads. */
|
|
|
|
if (high == -1)
|
2008-03-02 08:20:59 +00:00
|
|
|
break;
|
Rework CPU load balancing in SCHED_ULE:
- In sched_pickcpu() be more careful taking previous CPU on SMT systems.
Do it only if all other logical CPUs of that physical one are idle to avoid
extra resource sharing.
- In sched_pickcpu() change general logic of CPU selection. First
look for idle CPU, sharing last level cache with previously used one,
skipping SMT CPU groups. If none found, search all CPUs for the least loaded
one, where the thread with its priority can run now. If none found, search
just for the least loaded CPU.
- Make cpu_search() compare lowest/highest CPU load when comparing CPU
groups with equal load. That allows to differentiate 1+1 and 2+0 loads.
- Make cpu_search() to prefer specified (previous) CPU or group if load
is equal. This improves cache affinity for more complicated topologies.
- Randomize CPU selection if above factors are equal. Previous code tend
to prefer CPUs with lower IDs, causing unneeded collisions.
- Rework periodic balancer in sched_balance_group(). With cpu_search()
more intelligent now, make balansing process flat, removing recursion
over the topology tree. That fixes double swap problem and makes load
distribution more even and predictable.
All together this gives 10-15% performance improvement in many tests on
CPUs with SMT, such as Core i7, for number of threads is less then number
of logical CPUs. In some tests it also gives positive effect to systems
without SMT.
Reviewed by: jeff
Tested by: flo, hackers@
MFC after: 1 month
Sponsored by: iXsystems, Inc.
2012-02-27 10:31:54 +00:00
|
|
|
CPU_CLR(high, &hmask);
|
|
|
|
CPU_COPY(&hmask, &lmask);
|
|
|
|
/* Stop if there is no more CPU left for low. */
|
|
|
|
if (CPU_EMPTY(&lmask))
|
2008-03-02 08:20:59 +00:00
|
|
|
break;
|
Rework CPU load balancing in SCHED_ULE:
- In sched_pickcpu() be more careful taking previous CPU on SMT systems.
Do it only if all other logical CPUs of that physical one are idle to avoid
extra resource sharing.
- In sched_pickcpu() change general logic of CPU selection. First
look for idle CPU, sharing last level cache with previously used one,
skipping SMT CPU groups. If none found, search all CPUs for the least loaded
one, where the thread with its priority can run now. If none found, search
just for the least loaded CPU.
- Make cpu_search() compare lowest/highest CPU load when comparing CPU
groups with equal load. That allows to differentiate 1+1 and 2+0 loads.
- Make cpu_search() to prefer specified (previous) CPU or group if load
is equal. This improves cache affinity for more complicated topologies.
- Randomize CPU selection if above factors are equal. Previous code tend
to prefer CPUs with lower IDs, causing unneeded collisions.
- Rework periodic balancer in sched_balance_group(). With cpu_search()
more intelligent now, make balansing process flat, removing recursion
over the topology tree. That fixes double swap problem and makes load
distribution more even and predictable.
All together this gives 10-15% performance improvement in many tests on
CPUs with SMT, such as Core i7, for number of threads is less then number
of logical CPUs. In some tests it also gives positive effect to systems
without SMT.
Reviewed by: jeff
Tested by: flo, hackers@
MFC after: 1 month
Sponsored by: iXsystems, Inc.
2012-02-27 10:31:54 +00:00
|
|
|
anylow = 1;
|
|
|
|
nextlow:
|
|
|
|
low = sched_lowest(cg, lmask, -1,
|
|
|
|
TDQ_CPU(high)->tdq_load - 1, high);
|
|
|
|
/* Stop if we looked well and found no less loaded CPU. */
|
|
|
|
if (anylow && low == -1)
|
|
|
|
break;
|
|
|
|
/* Go to next high if we found no less loaded CPU. */
|
|
|
|
if (low == -1)
|
|
|
|
continue;
|
|
|
|
/* Transfer thread from high to low. */
|
|
|
|
if (sched_balance_pair(TDQ_CPU(high), TDQ_CPU(low))) {
|
|
|
|
/* CPU that got thread can no longer be a donor. */
|
|
|
|
CPU_CLR(low, &hmask);
|
|
|
|
} else {
|
|
|
|
/*
|
|
|
|
* If failed, then there is no threads on high
|
|
|
|
* that can run on this low. Drop low from low
|
|
|
|
* mask and look for different one.
|
|
|
|
*/
|
|
|
|
CPU_CLR(low, &lmask);
|
|
|
|
anylow = 0;
|
|
|
|
goto nextlow;
|
|
|
|
}
|
2003-06-09 00:39:09 +00:00
|
|
|
}
|
2003-12-12 07:33:51 +00:00
|
|
|
}
|
|
|
|
|
|
|
|
static void
|
2009-12-28 23:12:12 +00:00
|
|
|
sched_balance(void)
|
2003-12-12 07:33:51 +00:00
|
|
|
{
|
2007-10-02 00:36:06 +00:00
|
|
|
struct tdq *tdq;
|
2003-12-12 07:33:51 +00:00
|
|
|
|
2007-10-02 00:36:06 +00:00
|
|
|
/*
|
|
|
|
* Select a random time between .5 * balance_interval and
|
|
|
|
* 1.5 * balance_interval.
|
|
|
|
*/
|
2008-03-02 08:20:59 +00:00
|
|
|
balance_ticks = max(balance_interval / 2, 1);
|
|
|
|
balance_ticks += random() % balance_interval;
|
2007-07-17 22:53:23 +00:00
|
|
|
if (smp_started == 0 || rebalance == 0)
|
|
|
|
return;
|
2007-10-02 00:36:06 +00:00
|
|
|
tdq = TDQ_SELF();
|
|
|
|
TDQ_UNLOCK(tdq);
|
2008-03-02 08:20:59 +00:00
|
|
|
sched_balance_group(cpu_top);
|
2007-10-02 00:36:06 +00:00
|
|
|
TDQ_LOCK(tdq);
|
2003-12-12 07:33:51 +00:00
|
|
|
}
|
|
|
|
|
2007-07-17 22:53:23 +00:00
|
|
|
/*
|
|
|
|
* Lock two thread queues using their address to maintain lock order.
|
|
|
|
*/
|
|
|
|
static void
|
|
|
|
tdq_lock_pair(struct tdq *one, struct tdq *two)
|
|
|
|
{
|
|
|
|
if (one < two) {
|
|
|
|
TDQ_LOCK(one);
|
|
|
|
TDQ_LOCK_FLAGS(two, MTX_DUPOK);
|
|
|
|
} else {
|
|
|
|
TDQ_LOCK(two);
|
|
|
|
TDQ_LOCK_FLAGS(one, MTX_DUPOK);
|
|
|
|
}
|
|
|
|
}
|
|
|
|
|
2007-10-02 00:36:06 +00:00
|
|
|
/*
|
|
|
|
* Unlock two thread queues. Order is not important here.
|
|
|
|
*/
|
|
|
|
static void
|
|
|
|
tdq_unlock_pair(struct tdq *one, struct tdq *two)
|
|
|
|
{
|
|
|
|
TDQ_UNLOCK(one);
|
|
|
|
TDQ_UNLOCK(two);
|
|
|
|
}
|
|
|
|
|
2007-07-17 22:53:23 +00:00
|
|
|
/*
|
|
|
|
* Transfer load between two imbalanced thread queues.
|
|
|
|
*/
|
2008-03-02 08:20:59 +00:00
|
|
|
static int
|
2006-12-06 06:34:57 +00:00
|
|
|
sched_balance_pair(struct tdq *high, struct tdq *low)
|
2003-12-12 07:33:51 +00:00
|
|
|
{
|
2008-03-02 08:20:59 +00:00
|
|
|
int moved;
|
2011-10-06 11:48:13 +00:00
|
|
|
int cpu;
|
2003-12-12 07:33:51 +00:00
|
|
|
|
2007-07-17 22:53:23 +00:00
|
|
|
tdq_lock_pair(high, low);
|
2008-03-02 08:20:59 +00:00
|
|
|
moved = 0;
|
2003-11-15 07:32:07 +00:00
|
|
|
/*
|
|
|
|
* Determine what the imbalance is and then adjust that to how many
|
2006-12-29 10:37:07 +00:00
|
|
|
* threads we actually have to give up (transferable).
|
2003-11-15 07:32:07 +00:00
|
|
|
*/
|
Rework CPU load balancing in SCHED_ULE:
- In sched_pickcpu() be more careful taking previous CPU on SMT systems.
Do it only if all other logical CPUs of that physical one are idle to avoid
extra resource sharing.
- In sched_pickcpu() change general logic of CPU selection. First
look for idle CPU, sharing last level cache with previously used one,
skipping SMT CPU groups. If none found, search all CPUs for the least loaded
one, where the thread with its priority can run now. If none found, search
just for the least loaded CPU.
- Make cpu_search() compare lowest/highest CPU load when comparing CPU
groups with equal load. That allows to differentiate 1+1 and 2+0 loads.
- Make cpu_search() to prefer specified (previous) CPU or group if load
is equal. This improves cache affinity for more complicated topologies.
- Randomize CPU selection if above factors are equal. Previous code tend
to prefer CPUs with lower IDs, causing unneeded collisions.
- Rework periodic balancer in sched_balance_group(). With cpu_search()
more intelligent now, make balansing process flat, removing recursion
over the topology tree. That fixes double swap problem and makes load
distribution more even and predictable.
All together this gives 10-15% performance improvement in many tests on
CPUs with SMT, such as Core i7, for number of threads is less then number
of logical CPUs. In some tests it also gives positive effect to systems
without SMT.
Reviewed by: jeff
Tested by: flo, hackers@
MFC after: 1 month
Sponsored by: iXsystems, Inc.
2012-02-27 10:31:54 +00:00
|
|
|
if (high->tdq_transferable != 0 && high->tdq_load > low->tdq_load &&
|
|
|
|
(moved = tdq_move(high, low)) > 0) {
|
2007-09-22 02:20:14 +00:00
|
|
|
/*
|
2011-10-06 11:48:13 +00:00
|
|
|
* In case the target isn't the current cpu IPI it to force a
|
|
|
|
* reschedule with the new workload.
|
2007-09-22 02:20:14 +00:00
|
|
|
*/
|
2011-10-06 11:48:13 +00:00
|
|
|
cpu = TDQ_ID(low);
|
|
|
|
sched_pin();
|
|
|
|
if (cpu != PCPU_GET(cpuid))
|
|
|
|
ipi_cpu(cpu, IPI_PREEMPT);
|
|
|
|
sched_unpin();
|
2007-07-17 22:53:23 +00:00
|
|
|
}
|
2007-10-02 00:36:06 +00:00
|
|
|
tdq_unlock_pair(high, low);
|
2008-03-02 08:20:59 +00:00
|
|
|
return (moved);
|
2003-06-09 00:39:09 +00:00
|
|
|
}
|
|
|
|
|
2007-07-17 22:53:23 +00:00
|
|
|
/*
|
|
|
|
* Move a thread from one thread queue to another.
|
|
|
|
*/
|
2008-03-02 08:20:59 +00:00
|
|
|
static int
|
2007-07-17 22:53:23 +00:00
|
|
|
tdq_move(struct tdq *from, struct tdq *to)
|
2003-06-09 00:39:09 +00:00
|
|
|
{
|
2006-12-06 06:34:57 +00:00
|
|
|
struct td_sched *ts;
|
2007-07-17 22:53:23 +00:00
|
|
|
struct thread *td;
|
|
|
|
struct tdq *tdq;
|
|
|
|
int cpu;
|
2006-12-06 06:34:57 +00:00
|
|
|
|
2007-10-02 00:36:06 +00:00
|
|
|
TDQ_LOCK_ASSERT(from, MA_OWNED);
|
|
|
|
TDQ_LOCK_ASSERT(to, MA_OWNED);
|
|
|
|
|
2006-12-06 06:34:57 +00:00
|
|
|
tdq = from;
|
2007-07-17 22:53:23 +00:00
|
|
|
cpu = TDQ_ID(to);
|
2008-03-20 05:51:16 +00:00
|
|
|
td = tdq_steal(tdq, cpu);
|
|
|
|
if (td == NULL)
|
2008-03-02 08:20:59 +00:00
|
|
|
return (0);
|
2008-03-20 05:51:16 +00:00
|
|
|
ts = td->td_sched;
|
2007-07-17 22:53:23 +00:00
|
|
|
/*
|
|
|
|
* Although the run queue is locked the thread may be blocked. Lock
|
2007-10-02 00:36:06 +00:00
|
|
|
* it to clear this and acquire the run-queue lock.
|
2007-07-17 22:53:23 +00:00
|
|
|
*/
|
|
|
|
thread_lock(td);
|
2007-10-02 00:36:06 +00:00
|
|
|
/* Drop recursive lock on from acquired via thread_lock(). */
|
2007-07-17 22:53:23 +00:00
|
|
|
TDQ_UNLOCK(from);
|
|
|
|
sched_rem(td);
|
2007-01-19 21:56:08 +00:00
|
|
|
ts->ts_cpu = cpu;
|
2007-07-17 22:53:23 +00:00
|
|
|
td->td_lock = TDQ_LOCKPTR(to);
|
|
|
|
tdq_add(to, td, SRQ_YIELDING);
|
2008-03-02 08:20:59 +00:00
|
|
|
return (1);
|
2003-06-09 00:39:09 +00:00
|
|
|
}
|
- Add static to local functions and data where it was missing.
- Add an IPI based mechanism for migrating kses. This mechanism is
broken down into several components. This is intended to reduce cache
thrashing by eliminating most cases where one cpu touches another's
run queues.
- kseq_notify() appends a kse to a lockless singly linked list and
conditionally sends an IPI to the target processor. Right now this is
protected by sched_lock but at some point I'd like to get rid of the
global lock. This is why I used something more complicated than a
standard queue.
- kseq_assign() processes our list of kses that have been assigned to us
by other processors. This simply calls sched_add() for each item on the
list after clearing the new KEF_ASSIGNED flag. This flag is used to
indicate that we have been appeneded to the assigned queue but not
added to the run queue yet.
- In sched_add(), instead of adding a KSE to another processor's queue we
use kse_notify() so that we don't touch their queue. Also in sched_add(),
if KEF_ASSIGNED is already set return immediately. This can happen if
a thread is removed and readded so that the priority is recorded properly.
- In sched_rem() return immediately if KEF_ASSIGNED is set. All callers
immediately readd simply to adjust priorites etc.
- In sched_choose(), if we're running an IDLE task or the per cpu idle thread
set our cpumask bit in 'kseq_idle' so that other processors may know that
we are idle. Before this, make a single pass through the run queues of
other processors so that we may find work more immediately if it is
available.
- In sched_runnable(), don't scan each processor's run queue, they will IPI
us if they have work for us to do.
- In sched_add(), if we're adding a thread that can be migrated and we have
plenty of work to do, try to migrate the thread to an idle kseq.
- Simplify the logic in sched_prio() and take the KEF_ASSIGNED flag into
consideration.
- No longer use kseq_choose() to steal threads, it can lose it's last
argument.
- Create a new function runq_steal() which operates like runq_choose() but
skips threads based on some criteria. Currently it will not steal
PRI_ITHD threads. In the future this will be used for CPU binding.
- Create a kseq_steal() that checks each run queue with runq_steal(), use
kseq_steal() in the places where we used kseq_choose() to steal with
before.
2003-10-31 11:16:04 +00:00
|
|
|
|
2007-07-17 22:53:23 +00:00
|
|
|
/*
|
|
|
|
* This tdq has idled. Try to steal a thread from another cpu and switch
|
|
|
|
* to it.
|
|
|
|
*/
|
2003-12-11 03:57:10 +00:00
|
|
|
static int
|
2006-12-06 06:34:57 +00:00
|
|
|
tdq_idled(struct tdq *tdq)
|
- Add static to local functions and data where it was missing.
- Add an IPI based mechanism for migrating kses. This mechanism is
broken down into several components. This is intended to reduce cache
thrashing by eliminating most cases where one cpu touches another's
run queues.
- kseq_notify() appends a kse to a lockless singly linked list and
conditionally sends an IPI to the target processor. Right now this is
protected by sched_lock but at some point I'd like to get rid of the
global lock. This is why I used something more complicated than a
standard queue.
- kseq_assign() processes our list of kses that have been assigned to us
by other processors. This simply calls sched_add() for each item on the
list after clearing the new KEF_ASSIGNED flag. This flag is used to
indicate that we have been appeneded to the assigned queue but not
added to the run queue yet.
- In sched_add(), instead of adding a KSE to another processor's queue we
use kse_notify() so that we don't touch their queue. Also in sched_add(),
if KEF_ASSIGNED is already set return immediately. This can happen if
a thread is removed and readded so that the priority is recorded properly.
- In sched_rem() return immediately if KEF_ASSIGNED is set. All callers
immediately readd simply to adjust priorites etc.
- In sched_choose(), if we're running an IDLE task or the per cpu idle thread
set our cpumask bit in 'kseq_idle' so that other processors may know that
we are idle. Before this, make a single pass through the run queues of
other processors so that we may find work more immediately if it is
available.
- In sched_runnable(), don't scan each processor's run queue, they will IPI
us if they have work for us to do.
- In sched_add(), if we're adding a thread that can be migrated and we have
plenty of work to do, try to migrate the thread to an idle kseq.
- Simplify the logic in sched_prio() and take the KEF_ASSIGNED flag into
consideration.
- No longer use kseq_choose() to steal threads, it can lose it's last
argument.
- Create a new function runq_steal() which operates like runq_choose() but
skips threads based on some criteria. Currently it will not steal
PRI_ITHD threads. In the future this will be used for CPU binding.
- Create a kseq_steal() that checks each run queue with runq_steal(), use
kseq_steal() in the places where we used kseq_choose() to steal with
before.
2003-10-31 11:16:04 +00:00
|
|
|
{
|
2008-03-02 08:20:59 +00:00
|
|
|
struct cpu_group *cg;
|
2006-12-06 06:34:57 +00:00
|
|
|
struct tdq *steal;
|
2009-06-23 22:12:37 +00:00
|
|
|
cpuset_t mask;
|
2008-03-02 08:20:59 +00:00
|
|
|
int thresh;
|
2007-07-17 22:53:23 +00:00
|
|
|
int cpu;
|
2003-12-11 03:57:10 +00:00
|
|
|
|
2007-10-08 23:50:39 +00:00
|
|
|
if (smp_started == 0 || steal_idle == 0)
|
|
|
|
return (1);
|
2009-06-23 22:12:37 +00:00
|
|
|
CPU_FILL(&mask);
|
|
|
|
CPU_CLR(PCPU_GET(cpuid), &mask);
|
2008-03-02 08:20:59 +00:00
|
|
|
/* We don't want to be preempted while we're iterating. */
|
2007-07-17 22:53:23 +00:00
|
|
|
spinlock_enter();
|
2008-03-02 08:20:59 +00:00
|
|
|
for (cg = tdq->tdq_cg; cg != NULL; ) {
|
2009-04-29 03:15:43 +00:00
|
|
|
if ((cg->cg_flags & CG_FLAG_THREAD) == 0)
|
2008-03-02 08:20:59 +00:00
|
|
|
thresh = steal_thresh;
|
|
|
|
else
|
|
|
|
thresh = 1;
|
|
|
|
cpu = sched_highest(cg, mask, thresh);
|
|
|
|
if (cpu == -1) {
|
|
|
|
cg = cg->cg_parent;
|
|
|
|
continue;
|
2003-12-11 03:57:10 +00:00
|
|
|
}
|
2008-03-02 08:20:59 +00:00
|
|
|
steal = TDQ_CPU(cpu);
|
2009-06-23 22:12:37 +00:00
|
|
|
CPU_CLR(cpu, &mask);
|
2007-10-02 00:36:06 +00:00
|
|
|
tdq_lock_pair(tdq, steal);
|
2008-03-02 08:20:59 +00:00
|
|
|
if (steal->tdq_load < thresh || steal->tdq_transferable == 0) {
|
|
|
|
tdq_unlock_pair(tdq, steal);
|
|
|
|
continue;
|
|
|
|
}
|
|
|
|
/*
|
|
|
|
* If a thread was added while interrupts were disabled don't
|
|
|
|
* steal one here. If we fail to acquire one due to affinity
|
|
|
|
* restrictions loop again with this cpu removed from the
|
|
|
|
* set.
|
|
|
|
*/
|
|
|
|
if (tdq->tdq_load == 0 && tdq_move(steal, tdq) == 0) {
|
|
|
|
tdq_unlock_pair(tdq, steal);
|
|
|
|
continue;
|
|
|
|
}
|
|
|
|
spinlock_exit();
|
|
|
|
TDQ_UNLOCK(steal);
|
2008-04-17 04:20:10 +00:00
|
|
|
mi_switch(SW_VOL | SWT_IDLE, NULL);
|
2008-03-02 08:20:59 +00:00
|
|
|
thread_unlock(curthread);
|
|
|
|
|
|
|
|
return (0);
|
2003-12-11 03:57:10 +00:00
|
|
|
}
|
2007-07-17 22:53:23 +00:00
|
|
|
spinlock_exit();
|
2003-12-11 03:57:10 +00:00
|
|
|
return (1);
|
- Add static to local functions and data where it was missing.
- Add an IPI based mechanism for migrating kses. This mechanism is
broken down into several components. This is intended to reduce cache
thrashing by eliminating most cases where one cpu touches another's
run queues.
- kseq_notify() appends a kse to a lockless singly linked list and
conditionally sends an IPI to the target processor. Right now this is
protected by sched_lock but at some point I'd like to get rid of the
global lock. This is why I used something more complicated than a
standard queue.
- kseq_assign() processes our list of kses that have been assigned to us
by other processors. This simply calls sched_add() for each item on the
list after clearing the new KEF_ASSIGNED flag. This flag is used to
indicate that we have been appeneded to the assigned queue but not
added to the run queue yet.
- In sched_add(), instead of adding a KSE to another processor's queue we
use kse_notify() so that we don't touch their queue. Also in sched_add(),
if KEF_ASSIGNED is already set return immediately. This can happen if
a thread is removed and readded so that the priority is recorded properly.
- In sched_rem() return immediately if KEF_ASSIGNED is set. All callers
immediately readd simply to adjust priorites etc.
- In sched_choose(), if we're running an IDLE task or the per cpu idle thread
set our cpumask bit in 'kseq_idle' so that other processors may know that
we are idle. Before this, make a single pass through the run queues of
other processors so that we may find work more immediately if it is
available.
- In sched_runnable(), don't scan each processor's run queue, they will IPI
us if they have work for us to do.
- In sched_add(), if we're adding a thread that can be migrated and we have
plenty of work to do, try to migrate the thread to an idle kseq.
- Simplify the logic in sched_prio() and take the KEF_ASSIGNED flag into
consideration.
- No longer use kseq_choose() to steal threads, it can lose it's last
argument.
- Create a new function runq_steal() which operates like runq_choose() but
skips threads based on some criteria. Currently it will not steal
PRI_ITHD threads. In the future this will be used for CPU binding.
- Create a kseq_steal() that checks each run queue with runq_steal(), use
kseq_steal() in the places where we used kseq_choose() to steal with
before.
2003-10-31 11:16:04 +00:00
|
|
|
}
|
|
|
|
|
2007-07-17 22:53:23 +00:00
|
|
|
/*
|
|
|
|
* Notify a remote cpu of new work. Sends an IPI if criteria are met.
|
|
|
|
*/
|
- Add static to local functions and data where it was missing.
- Add an IPI based mechanism for migrating kses. This mechanism is
broken down into several components. This is intended to reduce cache
thrashing by eliminating most cases where one cpu touches another's
run queues.
- kseq_notify() appends a kse to a lockless singly linked list and
conditionally sends an IPI to the target processor. Right now this is
protected by sched_lock but at some point I'd like to get rid of the
global lock. This is why I used something more complicated than a
standard queue.
- kseq_assign() processes our list of kses that have been assigned to us
by other processors. This simply calls sched_add() for each item on the
list after clearing the new KEF_ASSIGNED flag. This flag is used to
indicate that we have been appeneded to the assigned queue but not
added to the run queue yet.
- In sched_add(), instead of adding a KSE to another processor's queue we
use kse_notify() so that we don't touch their queue. Also in sched_add(),
if KEF_ASSIGNED is already set return immediately. This can happen if
a thread is removed and readded so that the priority is recorded properly.
- In sched_rem() return immediately if KEF_ASSIGNED is set. All callers
immediately readd simply to adjust priorites etc.
- In sched_choose(), if we're running an IDLE task or the per cpu idle thread
set our cpumask bit in 'kseq_idle' so that other processors may know that
we are idle. Before this, make a single pass through the run queues of
other processors so that we may find work more immediately if it is
available.
- In sched_runnable(), don't scan each processor's run queue, they will IPI
us if they have work for us to do.
- In sched_add(), if we're adding a thread that can be migrated and we have
plenty of work to do, try to migrate the thread to an idle kseq.
- Simplify the logic in sched_prio() and take the KEF_ASSIGNED flag into
consideration.
- No longer use kseq_choose() to steal threads, it can lose it's last
argument.
- Create a new function runq_steal() which operates like runq_choose() but
skips threads based on some criteria. Currently it will not steal
PRI_ITHD threads. In the future this will be used for CPU binding.
- Create a kseq_steal() that checks each run queue with runq_steal(), use
kseq_steal() in the places where we used kseq_choose() to steal with
before.
2003-10-31 11:16:04 +00:00
|
|
|
static void
|
2008-03-20 05:51:16 +00:00
|
|
|
tdq_notify(struct tdq *tdq, struct thread *td)
|
- Add static to local functions and data where it was missing.
- Add an IPI based mechanism for migrating kses. This mechanism is
broken down into several components. This is intended to reduce cache
thrashing by eliminating most cases where one cpu touches another's
run queues.
- kseq_notify() appends a kse to a lockless singly linked list and
conditionally sends an IPI to the target processor. Right now this is
protected by sched_lock but at some point I'd like to get rid of the
global lock. This is why I used something more complicated than a
standard queue.
- kseq_assign() processes our list of kses that have been assigned to us
by other processors. This simply calls sched_add() for each item on the
list after clearing the new KEF_ASSIGNED flag. This flag is used to
indicate that we have been appeneded to the assigned queue but not
added to the run queue yet.
- In sched_add(), instead of adding a KSE to another processor's queue we
use kse_notify() so that we don't touch their queue. Also in sched_add(),
if KEF_ASSIGNED is already set return immediately. This can happen if
a thread is removed and readded so that the priority is recorded properly.
- In sched_rem() return immediately if KEF_ASSIGNED is set. All callers
immediately readd simply to adjust priorites etc.
- In sched_choose(), if we're running an IDLE task or the per cpu idle thread
set our cpumask bit in 'kseq_idle' so that other processors may know that
we are idle. Before this, make a single pass through the run queues of
other processors so that we may find work more immediately if it is
available.
- In sched_runnable(), don't scan each processor's run queue, they will IPI
us if they have work for us to do.
- In sched_add(), if we're adding a thread that can be migrated and we have
plenty of work to do, try to migrate the thread to an idle kseq.
- Simplify the logic in sched_prio() and take the KEF_ASSIGNED flag into
consideration.
- No longer use kseq_choose() to steal threads, it can lose it's last
argument.
- Create a new function runq_steal() which operates like runq_choose() but
skips threads based on some criteria. Currently it will not steal
PRI_ITHD threads. In the future this will be used for CPU binding.
- Create a kseq_steal() that checks each run queue with runq_steal(), use
kseq_steal() in the places where we used kseq_choose() to steal with
before.
2003-10-31 11:16:04 +00:00
|
|
|
{
|
2008-11-18 05:41:34 +00:00
|
|
|
struct thread *ctd;
|
2007-01-25 23:51:59 +00:00
|
|
|
int pri;
|
2007-01-19 21:56:08 +00:00
|
|
|
int cpu;
|
- Add static to local functions and data where it was missing.
- Add an IPI based mechanism for migrating kses. This mechanism is
broken down into several components. This is intended to reduce cache
thrashing by eliminating most cases where one cpu touches another's
run queues.
- kseq_notify() appends a kse to a lockless singly linked list and
conditionally sends an IPI to the target processor. Right now this is
protected by sched_lock but at some point I'd like to get rid of the
global lock. This is why I used something more complicated than a
standard queue.
- kseq_assign() processes our list of kses that have been assigned to us
by other processors. This simply calls sched_add() for each item on the
list after clearing the new KEF_ASSIGNED flag. This flag is used to
indicate that we have been appeneded to the assigned queue but not
added to the run queue yet.
- In sched_add(), instead of adding a KSE to another processor's queue we
use kse_notify() so that we don't touch their queue. Also in sched_add(),
if KEF_ASSIGNED is already set return immediately. This can happen if
a thread is removed and readded so that the priority is recorded properly.
- In sched_rem() return immediately if KEF_ASSIGNED is set. All callers
immediately readd simply to adjust priorites etc.
- In sched_choose(), if we're running an IDLE task or the per cpu idle thread
set our cpumask bit in 'kseq_idle' so that other processors may know that
we are idle. Before this, make a single pass through the run queues of
other processors so that we may find work more immediately if it is
available.
- In sched_runnable(), don't scan each processor's run queue, they will IPI
us if they have work for us to do.
- In sched_add(), if we're adding a thread that can be migrated and we have
plenty of work to do, try to migrate the thread to an idle kseq.
- Simplify the logic in sched_prio() and take the KEF_ASSIGNED flag into
consideration.
- No longer use kseq_choose() to steal threads, it can lose it's last
argument.
- Create a new function runq_steal() which operates like runq_choose() but
skips threads based on some criteria. Currently it will not steal
PRI_ITHD threads. In the future this will be used for CPU binding.
- Create a kseq_steal() that checks each run queue with runq_steal(), use
kseq_steal() in the places where we used kseq_choose() to steal with
before.
2003-10-31 11:16:04 +00:00
|
|
|
|
2008-03-10 01:32:01 +00:00
|
|
|
if (tdq->tdq_ipipending)
|
|
|
|
return;
|
2008-03-20 05:51:16 +00:00
|
|
|
cpu = td->td_sched->ts_cpu;
|
|
|
|
pri = td->td_priority;
|
2008-11-18 05:41:34 +00:00
|
|
|
ctd = pcpu_find(cpu)->pc_curthread;
|
|
|
|
if (!sched_shouldpreempt(pri, ctd->td_priority, 1))
|
2007-01-05 23:45:38 +00:00
|
|
|
return;
|
2008-11-18 05:41:34 +00:00
|
|
|
if (TD_IS_IDLETHREAD(ctd)) {
|
2008-04-25 05:18:50 +00:00
|
|
|
/*
|
|
|
|
* If the MD code has an idle wakeup routine try that before
|
|
|
|
* falling back to IPI.
|
|
|
|
*/
|
2010-09-10 13:24:47 +00:00
|
|
|
if (!tdq->tdq_cpu_idle || cpu_idle_wakeup(cpu))
|
2008-04-25 05:18:50 +00:00
|
|
|
return;
|
2008-04-17 09:56:01 +00:00
|
|
|
}
|
2008-03-10 01:32:01 +00:00
|
|
|
tdq->tdq_ipipending = 1;
|
2010-08-06 15:36:59 +00:00
|
|
|
ipi_cpu(cpu, IPI_PREEMPT);
|
2007-07-17 22:53:23 +00:00
|
|
|
}
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Steals load from a timeshare queue. Honors the rotating queue head
|
|
|
|
* index.
|
|
|
|
*/
|
2008-03-20 05:51:16 +00:00
|
|
|
static struct thread *
|
2008-03-02 08:20:59 +00:00
|
|
|
runq_steal_from(struct runq *rq, int cpu, u_char start)
|
2007-07-17 22:53:23 +00:00
|
|
|
{
|
|
|
|
struct rqbits *rqb;
|
|
|
|
struct rqhead *rqh;
|
Rework CPU load balancing in SCHED_ULE:
- In sched_pickcpu() be more careful taking previous CPU on SMT systems.
Do it only if all other logical CPUs of that physical one are idle to avoid
extra resource sharing.
- In sched_pickcpu() change general logic of CPU selection. First
look for idle CPU, sharing last level cache with previously used one,
skipping SMT CPU groups. If none found, search all CPUs for the least loaded
one, where the thread with its priority can run now. If none found, search
just for the least loaded CPU.
- Make cpu_search() compare lowest/highest CPU load when comparing CPU
groups with equal load. That allows to differentiate 1+1 and 2+0 loads.
- Make cpu_search() to prefer specified (previous) CPU or group if load
is equal. This improves cache affinity for more complicated topologies.
- Randomize CPU selection if above factors are equal. Previous code tend
to prefer CPUs with lower IDs, causing unneeded collisions.
- Rework periodic balancer in sched_balance_group(). With cpu_search()
more intelligent now, make balansing process flat, removing recursion
over the topology tree. That fixes double swap problem and makes load
distribution more even and predictable.
All together this gives 10-15% performance improvement in many tests on
CPUs with SMT, such as Core i7, for number of threads is less then number
of logical CPUs. In some tests it also gives positive effect to systems
without SMT.
Reviewed by: jeff
Tested by: flo, hackers@
MFC after: 1 month
Sponsored by: iXsystems, Inc.
2012-02-27 10:31:54 +00:00
|
|
|
struct thread *td, *first;
|
2007-07-17 22:53:23 +00:00
|
|
|
int bit;
|
|
|
|
int pri;
|
|
|
|
int i;
|
|
|
|
|
|
|
|
rqb = &rq->rq_status;
|
|
|
|
bit = start & (RQB_BPW -1);
|
|
|
|
pri = 0;
|
Rework CPU load balancing in SCHED_ULE:
- In sched_pickcpu() be more careful taking previous CPU on SMT systems.
Do it only if all other logical CPUs of that physical one are idle to avoid
extra resource sharing.
- In sched_pickcpu() change general logic of CPU selection. First
look for idle CPU, sharing last level cache with previously used one,
skipping SMT CPU groups. If none found, search all CPUs for the least loaded
one, where the thread with its priority can run now. If none found, search
just for the least loaded CPU.
- Make cpu_search() compare lowest/highest CPU load when comparing CPU
groups with equal load. That allows to differentiate 1+1 and 2+0 loads.
- Make cpu_search() to prefer specified (previous) CPU or group if load
is equal. This improves cache affinity for more complicated topologies.
- Randomize CPU selection if above factors are equal. Previous code tend
to prefer CPUs with lower IDs, causing unneeded collisions.
- Rework periodic balancer in sched_balance_group(). With cpu_search()
more intelligent now, make balansing process flat, removing recursion
over the topology tree. That fixes double swap problem and makes load
distribution more even and predictable.
All together this gives 10-15% performance improvement in many tests on
CPUs with SMT, such as Core i7, for number of threads is less then number
of logical CPUs. In some tests it also gives positive effect to systems
without SMT.
Reviewed by: jeff
Tested by: flo, hackers@
MFC after: 1 month
Sponsored by: iXsystems, Inc.
2012-02-27 10:31:54 +00:00
|
|
|
first = NULL;
|
2007-07-17 22:53:23 +00:00
|
|
|
again:
|
|
|
|
for (i = RQB_WORD(start); i < RQB_LEN; bit = 0, i++) {
|
|
|
|
if (rqb->rqb_bits[i] == 0)
|
|
|
|
continue;
|
|
|
|
if (bit != 0) {
|
|
|
|
for (pri = bit; pri < RQB_BPW; pri++)
|
|
|
|
if (rqb->rqb_bits[i] & (1ul << pri))
|
|
|
|
break;
|
|
|
|
if (pri >= RQB_BPW)
|
|
|
|
continue;
|
|
|
|
} else
|
|
|
|
pri = RQB_FFS(rqb->rqb_bits[i]);
|
|
|
|
pri += (i << RQB_L2BPW);
|
|
|
|
rqh = &rq->rq_queues[pri];
|
2008-03-20 05:51:16 +00:00
|
|
|
TAILQ_FOREACH(td, rqh, td_runq) {
|
|
|
|
if (first && THREAD_CAN_MIGRATE(td) &&
|
|
|
|
THREAD_CAN_SCHED(td, cpu))
|
|
|
|
return (td);
|
Rework CPU load balancing in SCHED_ULE:
- In sched_pickcpu() be more careful taking previous CPU on SMT systems.
Do it only if all other logical CPUs of that physical one are idle to avoid
extra resource sharing.
- In sched_pickcpu() change general logic of CPU selection. First
look for idle CPU, sharing last level cache with previously used one,
skipping SMT CPU groups. If none found, search all CPUs for the least loaded
one, where the thread with its priority can run now. If none found, search
just for the least loaded CPU.
- Make cpu_search() compare lowest/highest CPU load when comparing CPU
groups with equal load. That allows to differentiate 1+1 and 2+0 loads.
- Make cpu_search() to prefer specified (previous) CPU or group if load
is equal. This improves cache affinity for more complicated topologies.
- Randomize CPU selection if above factors are equal. Previous code tend
to prefer CPUs with lower IDs, causing unneeded collisions.
- Rework periodic balancer in sched_balance_group(). With cpu_search()
more intelligent now, make balansing process flat, removing recursion
over the topology tree. That fixes double swap problem and makes load
distribution more even and predictable.
All together this gives 10-15% performance improvement in many tests on
CPUs with SMT, such as Core i7, for number of threads is less then number
of logical CPUs. In some tests it also gives positive effect to systems
without SMT.
Reviewed by: jeff
Tested by: flo, hackers@
MFC after: 1 month
Sponsored by: iXsystems, Inc.
2012-02-27 10:31:54 +00:00
|
|
|
first = td;
|
2007-07-17 22:53:23 +00:00
|
|
|
}
|
|
|
|
}
|
|
|
|
if (start != 0) {
|
|
|
|
start = 0;
|
|
|
|
goto again;
|
|
|
|
}
|
|
|
|
|
Rework CPU load balancing in SCHED_ULE:
- In sched_pickcpu() be more careful taking previous CPU on SMT systems.
Do it only if all other logical CPUs of that physical one are idle to avoid
extra resource sharing.
- In sched_pickcpu() change general logic of CPU selection. First
look for idle CPU, sharing last level cache with previously used one,
skipping SMT CPU groups. If none found, search all CPUs for the least loaded
one, where the thread with its priority can run now. If none found, search
just for the least loaded CPU.
- Make cpu_search() compare lowest/highest CPU load when comparing CPU
groups with equal load. That allows to differentiate 1+1 and 2+0 loads.
- Make cpu_search() to prefer specified (previous) CPU or group if load
is equal. This improves cache affinity for more complicated topologies.
- Randomize CPU selection if above factors are equal. Previous code tend
to prefer CPUs with lower IDs, causing unneeded collisions.
- Rework periodic balancer in sched_balance_group(). With cpu_search()
more intelligent now, make balansing process flat, removing recursion
over the topology tree. That fixes double swap problem and makes load
distribution more even and predictable.
All together this gives 10-15% performance improvement in many tests on
CPUs with SMT, such as Core i7, for number of threads is less then number
of logical CPUs. In some tests it also gives positive effect to systems
without SMT.
Reviewed by: jeff
Tested by: flo, hackers@
MFC after: 1 month
Sponsored by: iXsystems, Inc.
2012-02-27 10:31:54 +00:00
|
|
|
if (first && THREAD_CAN_MIGRATE(first) &&
|
|
|
|
THREAD_CAN_SCHED(first, cpu))
|
|
|
|
return (first);
|
2007-07-17 22:53:23 +00:00
|
|
|
return (NULL);
|
- Add static to local functions and data where it was missing.
- Add an IPI based mechanism for migrating kses. This mechanism is
broken down into several components. This is intended to reduce cache
thrashing by eliminating most cases where one cpu touches another's
run queues.
- kseq_notify() appends a kse to a lockless singly linked list and
conditionally sends an IPI to the target processor. Right now this is
protected by sched_lock but at some point I'd like to get rid of the
global lock. This is why I used something more complicated than a
standard queue.
- kseq_assign() processes our list of kses that have been assigned to us
by other processors. This simply calls sched_add() for each item on the
list after clearing the new KEF_ASSIGNED flag. This flag is used to
indicate that we have been appeneded to the assigned queue but not
added to the run queue yet.
- In sched_add(), instead of adding a KSE to another processor's queue we
use kse_notify() so that we don't touch their queue. Also in sched_add(),
if KEF_ASSIGNED is already set return immediately. This can happen if
a thread is removed and readded so that the priority is recorded properly.
- In sched_rem() return immediately if KEF_ASSIGNED is set. All callers
immediately readd simply to adjust priorites etc.
- In sched_choose(), if we're running an IDLE task or the per cpu idle thread
set our cpumask bit in 'kseq_idle' so that other processors may know that
we are idle. Before this, make a single pass through the run queues of
other processors so that we may find work more immediately if it is
available.
- In sched_runnable(), don't scan each processor's run queue, they will IPI
us if they have work for us to do.
- In sched_add(), if we're adding a thread that can be migrated and we have
plenty of work to do, try to migrate the thread to an idle kseq.
- Simplify the logic in sched_prio() and take the KEF_ASSIGNED flag into
consideration.
- No longer use kseq_choose() to steal threads, it can lose it's last
argument.
- Create a new function runq_steal() which operates like runq_choose() but
skips threads based on some criteria. Currently it will not steal
PRI_ITHD threads. In the future this will be used for CPU binding.
- Create a kseq_steal() that checks each run queue with runq_steal(), use
kseq_steal() in the places where we used kseq_choose() to steal with
before.
2003-10-31 11:16:04 +00:00
|
|
|
}
|
|
|
|
|
2007-07-17 22:53:23 +00:00
|
|
|
/*
|
|
|
|
* Steals load from a standard linear queue.
|
|
|
|
*/
|
2008-03-20 05:51:16 +00:00
|
|
|
static struct thread *
|
2008-03-02 08:20:59 +00:00
|
|
|
runq_steal(struct runq *rq, int cpu)
|
- Add static to local functions and data where it was missing.
- Add an IPI based mechanism for migrating kses. This mechanism is
broken down into several components. This is intended to reduce cache
thrashing by eliminating most cases where one cpu touches another's
run queues.
- kseq_notify() appends a kse to a lockless singly linked list and
conditionally sends an IPI to the target processor. Right now this is
protected by sched_lock but at some point I'd like to get rid of the
global lock. This is why I used something more complicated than a
standard queue.
- kseq_assign() processes our list of kses that have been assigned to us
by other processors. This simply calls sched_add() for each item on the
list after clearing the new KEF_ASSIGNED flag. This flag is used to
indicate that we have been appeneded to the assigned queue but not
added to the run queue yet.
- In sched_add(), instead of adding a KSE to another processor's queue we
use kse_notify() so that we don't touch their queue. Also in sched_add(),
if KEF_ASSIGNED is already set return immediately. This can happen if
a thread is removed and readded so that the priority is recorded properly.
- In sched_rem() return immediately if KEF_ASSIGNED is set. All callers
immediately readd simply to adjust priorites etc.
- In sched_choose(), if we're running an IDLE task or the per cpu idle thread
set our cpumask bit in 'kseq_idle' so that other processors may know that
we are idle. Before this, make a single pass through the run queues of
other processors so that we may find work more immediately if it is
available.
- In sched_runnable(), don't scan each processor's run queue, they will IPI
us if they have work for us to do.
- In sched_add(), if we're adding a thread that can be migrated and we have
plenty of work to do, try to migrate the thread to an idle kseq.
- Simplify the logic in sched_prio() and take the KEF_ASSIGNED flag into
consideration.
- No longer use kseq_choose() to steal threads, it can lose it's last
argument.
- Create a new function runq_steal() which operates like runq_choose() but
skips threads based on some criteria. Currently it will not steal
PRI_ITHD threads. In the future this will be used for CPU binding.
- Create a kseq_steal() that checks each run queue with runq_steal(), use
kseq_steal() in the places where we used kseq_choose() to steal with
before.
2003-10-31 11:16:04 +00:00
|
|
|
{
|
|
|
|
struct rqhead *rqh;
|
|
|
|
struct rqbits *rqb;
|
2008-03-20 05:51:16 +00:00
|
|
|
struct thread *td;
|
- Add static to local functions and data where it was missing.
- Add an IPI based mechanism for migrating kses. This mechanism is
broken down into several components. This is intended to reduce cache
thrashing by eliminating most cases where one cpu touches another's
run queues.
- kseq_notify() appends a kse to a lockless singly linked list and
conditionally sends an IPI to the target processor. Right now this is
protected by sched_lock but at some point I'd like to get rid of the
global lock. This is why I used something more complicated than a
standard queue.
- kseq_assign() processes our list of kses that have been assigned to us
by other processors. This simply calls sched_add() for each item on the
list after clearing the new KEF_ASSIGNED flag. This flag is used to
indicate that we have been appeneded to the assigned queue but not
added to the run queue yet.
- In sched_add(), instead of adding a KSE to another processor's queue we
use kse_notify() so that we don't touch their queue. Also in sched_add(),
if KEF_ASSIGNED is already set return immediately. This can happen if
a thread is removed and readded so that the priority is recorded properly.
- In sched_rem() return immediately if KEF_ASSIGNED is set. All callers
immediately readd simply to adjust priorites etc.
- In sched_choose(), if we're running an IDLE task or the per cpu idle thread
set our cpumask bit in 'kseq_idle' so that other processors may know that
we are idle. Before this, make a single pass through the run queues of
other processors so that we may find work more immediately if it is
available.
- In sched_runnable(), don't scan each processor's run queue, they will IPI
us if they have work for us to do.
- In sched_add(), if we're adding a thread that can be migrated and we have
plenty of work to do, try to migrate the thread to an idle kseq.
- Simplify the logic in sched_prio() and take the KEF_ASSIGNED flag into
consideration.
- No longer use kseq_choose() to steal threads, it can lose it's last
argument.
- Create a new function runq_steal() which operates like runq_choose() but
skips threads based on some criteria. Currently it will not steal
PRI_ITHD threads. In the future this will be used for CPU binding.
- Create a kseq_steal() that checks each run queue with runq_steal(), use
kseq_steal() in the places where we used kseq_choose() to steal with
before.
2003-10-31 11:16:04 +00:00
|
|
|
int word;
|
|
|
|
int bit;
|
|
|
|
|
|
|
|
rqb = &rq->rq_status;
|
|
|
|
for (word = 0; word < RQB_LEN; word++) {
|
|
|
|
if (rqb->rqb_bits[word] == 0)
|
|
|
|
continue;
|
|
|
|
for (bit = 0; bit < RQB_BPW; bit++) {
|
2003-12-07 09:57:51 +00:00
|
|
|
if ((rqb->rqb_bits[word] & (1ul << bit)) == 0)
|
- Add static to local functions and data where it was missing.
- Add an IPI based mechanism for migrating kses. This mechanism is
broken down into several components. This is intended to reduce cache
thrashing by eliminating most cases where one cpu touches another's
run queues.
- kseq_notify() appends a kse to a lockless singly linked list and
conditionally sends an IPI to the target processor. Right now this is
protected by sched_lock but at some point I'd like to get rid of the
global lock. This is why I used something more complicated than a
standard queue.
- kseq_assign() processes our list of kses that have been assigned to us
by other processors. This simply calls sched_add() for each item on the
list after clearing the new KEF_ASSIGNED flag. This flag is used to
indicate that we have been appeneded to the assigned queue but not
added to the run queue yet.
- In sched_add(), instead of adding a KSE to another processor's queue we
use kse_notify() so that we don't touch their queue. Also in sched_add(),
if KEF_ASSIGNED is already set return immediately. This can happen if
a thread is removed and readded so that the priority is recorded properly.
- In sched_rem() return immediately if KEF_ASSIGNED is set. All callers
immediately readd simply to adjust priorites etc.
- In sched_choose(), if we're running an IDLE task or the per cpu idle thread
set our cpumask bit in 'kseq_idle' so that other processors may know that
we are idle. Before this, make a single pass through the run queues of
other processors so that we may find work more immediately if it is
available.
- In sched_runnable(), don't scan each processor's run queue, they will IPI
us if they have work for us to do.
- In sched_add(), if we're adding a thread that can be migrated and we have
plenty of work to do, try to migrate the thread to an idle kseq.
- Simplify the logic in sched_prio() and take the KEF_ASSIGNED flag into
consideration.
- No longer use kseq_choose() to steal threads, it can lose it's last
argument.
- Create a new function runq_steal() which operates like runq_choose() but
skips threads based on some criteria. Currently it will not steal
PRI_ITHD threads. In the future this will be used for CPU binding.
- Create a kseq_steal() that checks each run queue with runq_steal(), use
kseq_steal() in the places where we used kseq_choose() to steal with
before.
2003-10-31 11:16:04 +00:00
|
|
|
continue;
|
|
|
|
rqh = &rq->rq_queues[bit + (word << RQB_L2BPW)];
|
2008-03-20 05:51:16 +00:00
|
|
|
TAILQ_FOREACH(td, rqh, td_runq)
|
|
|
|
if (THREAD_CAN_MIGRATE(td) &&
|
|
|
|
THREAD_CAN_SCHED(td, cpu))
|
|
|
|
return (td);
|
- Add static to local functions and data where it was missing.
- Add an IPI based mechanism for migrating kses. This mechanism is
broken down into several components. This is intended to reduce cache
thrashing by eliminating most cases where one cpu touches another's
run queues.
- kseq_notify() appends a kse to a lockless singly linked list and
conditionally sends an IPI to the target processor. Right now this is
protected by sched_lock but at some point I'd like to get rid of the
global lock. This is why I used something more complicated than a
standard queue.
- kseq_assign() processes our list of kses that have been assigned to us
by other processors. This simply calls sched_add() for each item on the
list after clearing the new KEF_ASSIGNED flag. This flag is used to
indicate that we have been appeneded to the assigned queue but not
added to the run queue yet.
- In sched_add(), instead of adding a KSE to another processor's queue we
use kse_notify() so that we don't touch their queue. Also in sched_add(),
if KEF_ASSIGNED is already set return immediately. This can happen if
a thread is removed and readded so that the priority is recorded properly.
- In sched_rem() return immediately if KEF_ASSIGNED is set. All callers
immediately readd simply to adjust priorites etc.
- In sched_choose(), if we're running an IDLE task or the per cpu idle thread
set our cpumask bit in 'kseq_idle' so that other processors may know that
we are idle. Before this, make a single pass through the run queues of
other processors so that we may find work more immediately if it is
available.
- In sched_runnable(), don't scan each processor's run queue, they will IPI
us if they have work for us to do.
- In sched_add(), if we're adding a thread that can be migrated and we have
plenty of work to do, try to migrate the thread to an idle kseq.
- Simplify the logic in sched_prio() and take the KEF_ASSIGNED flag into
consideration.
- No longer use kseq_choose() to steal threads, it can lose it's last
argument.
- Create a new function runq_steal() which operates like runq_choose() but
skips threads based on some criteria. Currently it will not steal
PRI_ITHD threads. In the future this will be used for CPU binding.
- Create a kseq_steal() that checks each run queue with runq_steal(), use
kseq_steal() in the places where we used kseq_choose() to steal with
before.
2003-10-31 11:16:04 +00:00
|
|
|
}
|
|
|
|
}
|
|
|
|
return (NULL);
|
|
|
|
}
|
|
|
|
|
2007-07-17 22:53:23 +00:00
|
|
|
/*
|
|
|
|
* Attempt to steal a thread in priority order from a thread queue.
|
|
|
|
*/
|
2008-03-20 05:51:16 +00:00
|
|
|
static struct thread *
|
2008-03-02 08:20:59 +00:00
|
|
|
tdq_steal(struct tdq *tdq, int cpu)
|
- Add static to local functions and data where it was missing.
- Add an IPI based mechanism for migrating kses. This mechanism is
broken down into several components. This is intended to reduce cache
thrashing by eliminating most cases where one cpu touches another's
run queues.
- kseq_notify() appends a kse to a lockless singly linked list and
conditionally sends an IPI to the target processor. Right now this is
protected by sched_lock but at some point I'd like to get rid of the
global lock. This is why I used something more complicated than a
standard queue.
- kseq_assign() processes our list of kses that have been assigned to us
by other processors. This simply calls sched_add() for each item on the
list after clearing the new KEF_ASSIGNED flag. This flag is used to
indicate that we have been appeneded to the assigned queue but not
added to the run queue yet.
- In sched_add(), instead of adding a KSE to another processor's queue we
use kse_notify() so that we don't touch their queue. Also in sched_add(),
if KEF_ASSIGNED is already set return immediately. This can happen if
a thread is removed and readded so that the priority is recorded properly.
- In sched_rem() return immediately if KEF_ASSIGNED is set. All callers
immediately readd simply to adjust priorites etc.
- In sched_choose(), if we're running an IDLE task or the per cpu idle thread
set our cpumask bit in 'kseq_idle' so that other processors may know that
we are idle. Before this, make a single pass through the run queues of
other processors so that we may find work more immediately if it is
available.
- In sched_runnable(), don't scan each processor's run queue, they will IPI
us if they have work for us to do.
- In sched_add(), if we're adding a thread that can be migrated and we have
plenty of work to do, try to migrate the thread to an idle kseq.
- Simplify the logic in sched_prio() and take the KEF_ASSIGNED flag into
consideration.
- No longer use kseq_choose() to steal threads, it can lose it's last
argument.
- Create a new function runq_steal() which operates like runq_choose() but
skips threads based on some criteria. Currently it will not steal
PRI_ITHD threads. In the future this will be used for CPU binding.
- Create a kseq_steal() that checks each run queue with runq_steal(), use
kseq_steal() in the places where we used kseq_choose() to steal with
before.
2003-10-31 11:16:04 +00:00
|
|
|
{
|
2008-03-20 05:51:16 +00:00
|
|
|
struct thread *td;
|
- Add static to local functions and data where it was missing.
- Add an IPI based mechanism for migrating kses. This mechanism is
broken down into several components. This is intended to reduce cache
thrashing by eliminating most cases where one cpu touches another's
run queues.
- kseq_notify() appends a kse to a lockless singly linked list and
conditionally sends an IPI to the target processor. Right now this is
protected by sched_lock but at some point I'd like to get rid of the
global lock. This is why I used something more complicated than a
standard queue.
- kseq_assign() processes our list of kses that have been assigned to us
by other processors. This simply calls sched_add() for each item on the
list after clearing the new KEF_ASSIGNED flag. This flag is used to
indicate that we have been appeneded to the assigned queue but not
added to the run queue yet.
- In sched_add(), instead of adding a KSE to another processor's queue we
use kse_notify() so that we don't touch their queue. Also in sched_add(),
if KEF_ASSIGNED is already set return immediately. This can happen if
a thread is removed and readded so that the priority is recorded properly.
- In sched_rem() return immediately if KEF_ASSIGNED is set. All callers
immediately readd simply to adjust priorites etc.
- In sched_choose(), if we're running an IDLE task or the per cpu idle thread
set our cpumask bit in 'kseq_idle' so that other processors may know that
we are idle. Before this, make a single pass through the run queues of
other processors so that we may find work more immediately if it is
available.
- In sched_runnable(), don't scan each processor's run queue, they will IPI
us if they have work for us to do.
- In sched_add(), if we're adding a thread that can be migrated and we have
plenty of work to do, try to migrate the thread to an idle kseq.
- Simplify the logic in sched_prio() and take the KEF_ASSIGNED flag into
consideration.
- No longer use kseq_choose() to steal threads, it can lose it's last
argument.
- Create a new function runq_steal() which operates like runq_choose() but
skips threads based on some criteria. Currently it will not steal
PRI_ITHD threads. In the future this will be used for CPU binding.
- Create a kseq_steal() that checks each run queue with runq_steal(), use
kseq_steal() in the places where we used kseq_choose() to steal with
before.
2003-10-31 11:16:04 +00:00
|
|
|
|
2007-07-17 22:53:23 +00:00
|
|
|
TDQ_LOCK_ASSERT(tdq, MA_OWNED);
|
2008-03-20 05:51:16 +00:00
|
|
|
if ((td = runq_steal(&tdq->tdq_realtime, cpu)) != NULL)
|
|
|
|
return (td);
|
|
|
|
if ((td = runq_steal_from(&tdq->tdq_timeshare,
|
|
|
|
cpu, tdq->tdq_ridx)) != NULL)
|
|
|
|
return (td);
|
2008-03-02 08:20:59 +00:00
|
|
|
return (runq_steal(&tdq->tdq_idle, cpu));
|
2003-12-11 03:57:10 +00:00
|
|
|
}
|
|
|
|
|
2007-07-17 22:53:23 +00:00
|
|
|
/*
|
|
|
|
* Sets the thread lock and ts_cpu to match the requested cpu. Unlocks the
|
2007-10-02 00:36:06 +00:00
|
|
|
* current lock and returns with the assigned queue locked.
|
2007-07-17 22:53:23 +00:00
|
|
|
*/
|
|
|
|
static inline struct tdq *
|
2008-03-20 05:51:16 +00:00
|
|
|
sched_setcpu(struct thread *td, int cpu, int flags)
|
2003-12-11 03:57:10 +00:00
|
|
|
{
|
|
|
|
|
2008-03-20 05:51:16 +00:00
|
|
|
struct tdq *tdq;
|
2007-07-17 22:53:23 +00:00
|
|
|
|
2008-03-20 05:51:16 +00:00
|
|
|
THREAD_LOCK_ASSERT(td, MA_OWNED);
|
2007-07-17 22:53:23 +00:00
|
|
|
tdq = TDQ_CPU(cpu);
|
2008-03-20 05:51:16 +00:00
|
|
|
td->td_sched->ts_cpu = cpu;
|
|
|
|
/*
|
|
|
|
* If the lock matches just return the queue.
|
|
|
|
*/
|
2007-07-17 22:53:23 +00:00
|
|
|
if (td->td_lock == TDQ_LOCKPTR(tdq))
|
|
|
|
return (tdq);
|
|
|
|
#ifdef notyet
|
2007-01-19 21:56:08 +00:00
|
|
|
/*
|
2007-09-22 02:20:14 +00:00
|
|
|
* If the thread isn't running its lockptr is a
|
2007-07-17 22:53:23 +00:00
|
|
|
* turnstile or a sleepqueue. We can just lock_set without
|
|
|
|
* blocking.
|
2007-01-19 21:56:08 +00:00
|
|
|
*/
|
2007-07-17 22:53:23 +00:00
|
|
|
if (TD_CAN_RUN(td)) {
|
|
|
|
TDQ_LOCK(tdq);
|
|
|
|
thread_lock_set(td, TDQ_LOCKPTR(tdq));
|
|
|
|
return (tdq);
|
|
|
|
}
|
|
|
|
#endif
|
2007-01-19 21:56:08 +00:00
|
|
|
/*
|
2007-07-17 22:53:23 +00:00
|
|
|
* The hard case, migration, we need to block the thread first to
|
|
|
|
* prevent order reversals with other cpus locks.
|
2007-01-19 21:56:08 +00:00
|
|
|
*/
|
2010-01-23 15:54:21 +00:00
|
|
|
spinlock_enter();
|
2007-07-17 22:53:23 +00:00
|
|
|
thread_lock_block(td);
|
|
|
|
TDQ_LOCK(tdq);
|
2007-08-03 23:38:46 +00:00
|
|
|
thread_lock_unblock(td, TDQ_LOCKPTR(tdq));
|
2010-01-23 15:54:21 +00:00
|
|
|
spinlock_exit();
|
2007-07-17 22:53:23 +00:00
|
|
|
return (tdq);
|
2007-01-19 21:56:08 +00:00
|
|
|
}
|
|
|
|
|
2008-04-17 04:20:10 +00:00
|
|
|
SCHED_STAT_DEFINE(pickcpu_intrbind, "Soft interrupt binding");
|
|
|
|
SCHED_STAT_DEFINE(pickcpu_idle_affinity, "Picked idle cpu based on affinity");
|
|
|
|
SCHED_STAT_DEFINE(pickcpu_affinity, "Picked cpu based on affinity");
|
|
|
|
SCHED_STAT_DEFINE(pickcpu_lowest, "Selected lowest load");
|
|
|
|
SCHED_STAT_DEFINE(pickcpu_local, "Migrated to current cpu");
|
|
|
|
SCHED_STAT_DEFINE(pickcpu_migration, "Selection may have caused migration");
|
|
|
|
|
2007-07-17 22:53:23 +00:00
|
|
|
static int
|
2008-03-20 05:51:16 +00:00
|
|
|
sched_pickcpu(struct thread *td, int flags)
|
2007-07-17 22:53:23 +00:00
|
|
|
{
|
Rework CPU load balancing in SCHED_ULE:
- In sched_pickcpu() be more careful taking previous CPU on SMT systems.
Do it only if all other logical CPUs of that physical one are idle to avoid
extra resource sharing.
- In sched_pickcpu() change general logic of CPU selection. First
look for idle CPU, sharing last level cache with previously used one,
skipping SMT CPU groups. If none found, search all CPUs for the least loaded
one, where the thread with its priority can run now. If none found, search
just for the least loaded CPU.
- Make cpu_search() compare lowest/highest CPU load when comparing CPU
groups with equal load. That allows to differentiate 1+1 and 2+0 loads.
- Make cpu_search() to prefer specified (previous) CPU or group if load
is equal. This improves cache affinity for more complicated topologies.
- Randomize CPU selection if above factors are equal. Previous code tend
to prefer CPUs with lower IDs, causing unneeded collisions.
- Rework periodic balancer in sched_balance_group(). With cpu_search()
more intelligent now, make balansing process flat, removing recursion
over the topology tree. That fixes double swap problem and makes load
distribution more even and predictable.
All together this gives 10-15% performance improvement in many tests on
CPUs with SMT, such as Core i7, for number of threads is less then number
of logical CPUs. In some tests it also gives positive effect to systems
without SMT.
Reviewed by: jeff
Tested by: flo, hackers@
MFC after: 1 month
Sponsored by: iXsystems, Inc.
2012-02-27 10:31:54 +00:00
|
|
|
struct cpu_group *cg, *ccg;
|
2008-03-20 05:51:16 +00:00
|
|
|
struct td_sched *ts;
|
2007-07-17 22:53:23 +00:00
|
|
|
struct tdq *tdq;
|
2009-06-23 22:12:37 +00:00
|
|
|
cpuset_t mask;
|
Rework CPU load balancing in SCHED_ULE:
- In sched_pickcpu() be more careful taking previous CPU on SMT systems.
Do it only if all other logical CPUs of that physical one are idle to avoid
extra resource sharing.
- In sched_pickcpu() change general logic of CPU selection. First
look for idle CPU, sharing last level cache with previously used one,
skipping SMT CPU groups. If none found, search all CPUs for the least loaded
one, where the thread with its priority can run now. If none found, search
just for the least loaded CPU.
- Make cpu_search() compare lowest/highest CPU load when comparing CPU
groups with equal load. That allows to differentiate 1+1 and 2+0 loads.
- Make cpu_search() to prefer specified (previous) CPU or group if load
is equal. This improves cache affinity for more complicated topologies.
- Randomize CPU selection if above factors are equal. Previous code tend
to prefer CPUs with lower IDs, causing unneeded collisions.
- Rework periodic balancer in sched_balance_group(). With cpu_search()
more intelligent now, make balansing process flat, removing recursion
over the topology tree. That fixes double swap problem and makes load
distribution more even and predictable.
All together this gives 10-15% performance improvement in many tests on
CPUs with SMT, such as Core i7, for number of threads is less then number
of logical CPUs. In some tests it also gives positive effect to systems
without SMT.
Reviewed by: jeff
Tested by: flo, hackers@
MFC after: 1 month
Sponsored by: iXsystems, Inc.
2012-02-27 10:31:54 +00:00
|
|
|
int cpu, pri, self;
|
2007-01-19 21:56:08 +00:00
|
|
|
|
2008-03-02 08:20:59 +00:00
|
|
|
self = PCPU_GET(cpuid);
|
2008-03-20 05:51:16 +00:00
|
|
|
ts = td->td_sched;
|
2007-01-19 21:56:08 +00:00
|
|
|
if (smp_started == 0)
|
|
|
|
return (self);
|
2007-07-19 20:03:15 +00:00
|
|
|
/*
|
|
|
|
* Don't migrate a running thread from sched_switch().
|
|
|
|
*/
|
2008-03-02 08:20:59 +00:00
|
|
|
if ((flags & SRQ_OURSELF) || !THREAD_CAN_MIGRATE(td))
|
|
|
|
return (ts->ts_cpu);
|
2007-01-19 21:56:08 +00:00
|
|
|
/*
|
2008-03-02 08:20:59 +00:00
|
|
|
* Prefer to run interrupt threads on the processors that generate
|
|
|
|
* the interrupt.
|
2007-01-19 21:56:08 +00:00
|
|
|
*/
|
Rework CPU load balancing in SCHED_ULE:
- In sched_pickcpu() be more careful taking previous CPU on SMT systems.
Do it only if all other logical CPUs of that physical one are idle to avoid
extra resource sharing.
- In sched_pickcpu() change general logic of CPU selection. First
look for idle CPU, sharing last level cache with previously used one,
skipping SMT CPU groups. If none found, search all CPUs for the least loaded
one, where the thread with its priority can run now. If none found, search
just for the least loaded CPU.
- Make cpu_search() compare lowest/highest CPU load when comparing CPU
groups with equal load. That allows to differentiate 1+1 and 2+0 loads.
- Make cpu_search() to prefer specified (previous) CPU or group if load
is equal. This improves cache affinity for more complicated topologies.
- Randomize CPU selection if above factors are equal. Previous code tend
to prefer CPUs with lower IDs, causing unneeded collisions.
- Rework periodic balancer in sched_balance_group(). With cpu_search()
more intelligent now, make balansing process flat, removing recursion
over the topology tree. That fixes double swap problem and makes load
distribution more even and predictable.
All together this gives 10-15% performance improvement in many tests on
CPUs with SMT, such as Core i7, for number of threads is less then number
of logical CPUs. In some tests it also gives positive effect to systems
without SMT.
Reviewed by: jeff
Tested by: flo, hackers@
MFC after: 1 month
Sponsored by: iXsystems, Inc.
2012-02-27 10:31:54 +00:00
|
|
|
pri = td->td_priority;
|
2008-03-02 08:20:59 +00:00
|
|
|
if (td->td_priority <= PRI_MAX_ITHD && THREAD_CAN_SCHED(td, self) &&
|
2008-04-17 04:20:10 +00:00
|
|
|
curthread->td_intr_nesting_level && ts->ts_cpu != self) {
|
|
|
|
SCHED_STAT_INC(pickcpu_intrbind);
|
2008-03-02 08:20:59 +00:00
|
|
|
ts->ts_cpu = self;
|
Rework CPU load balancing in SCHED_ULE:
- In sched_pickcpu() be more careful taking previous CPU on SMT systems.
Do it only if all other logical CPUs of that physical one are idle to avoid
extra resource sharing.
- In sched_pickcpu() change general logic of CPU selection. First
look for idle CPU, sharing last level cache with previously used one,
skipping SMT CPU groups. If none found, search all CPUs for the least loaded
one, where the thread with its priority can run now. If none found, search
just for the least loaded CPU.
- Make cpu_search() compare lowest/highest CPU load when comparing CPU
groups with equal load. That allows to differentiate 1+1 and 2+0 loads.
- Make cpu_search() to prefer specified (previous) CPU or group if load
is equal. This improves cache affinity for more complicated topologies.
- Randomize CPU selection if above factors are equal. Previous code tend
to prefer CPUs with lower IDs, causing unneeded collisions.
- Rework periodic balancer in sched_balance_group(). With cpu_search()
more intelligent now, make balansing process flat, removing recursion
over the topology tree. That fixes double swap problem and makes load
distribution more even and predictable.
All together this gives 10-15% performance improvement in many tests on
CPUs with SMT, such as Core i7, for number of threads is less then number
of logical CPUs. In some tests it also gives positive effect to systems
without SMT.
Reviewed by: jeff
Tested by: flo, hackers@
MFC after: 1 month
Sponsored by: iXsystems, Inc.
2012-02-27 10:31:54 +00:00
|
|
|
if (TDQ_CPU(self)->tdq_lowpri > pri) {
|
|
|
|
SCHED_STAT_INC(pickcpu_affinity);
|
|
|
|
return (ts->ts_cpu);
|
|
|
|
}
|
2008-04-17 04:20:10 +00:00
|
|
|
}
|
2007-01-19 21:56:08 +00:00
|
|
|
/*
|
2008-03-02 08:20:59 +00:00
|
|
|
* If the thread can run on the last cpu and the affinity has not
|
|
|
|
* expired or it is idle run it there.
|
2007-01-19 21:56:08 +00:00
|
|
|
*/
|
2008-03-02 08:20:59 +00:00
|
|
|
tdq = TDQ_CPU(ts->ts_cpu);
|
Rework CPU load balancing in SCHED_ULE:
- In sched_pickcpu() be more careful taking previous CPU on SMT systems.
Do it only if all other logical CPUs of that physical one are idle to avoid
extra resource sharing.
- In sched_pickcpu() change general logic of CPU selection. First
look for idle CPU, sharing last level cache with previously used one,
skipping SMT CPU groups. If none found, search all CPUs for the least loaded
one, where the thread with its priority can run now. If none found, search
just for the least loaded CPU.
- Make cpu_search() compare lowest/highest CPU load when comparing CPU
groups with equal load. That allows to differentiate 1+1 and 2+0 loads.
- Make cpu_search() to prefer specified (previous) CPU or group if load
is equal. This improves cache affinity for more complicated topologies.
- Randomize CPU selection if above factors are equal. Previous code tend
to prefer CPUs with lower IDs, causing unneeded collisions.
- Rework periodic balancer in sched_balance_group(). With cpu_search()
more intelligent now, make balansing process flat, removing recursion
over the topology tree. That fixes double swap problem and makes load
distribution more even and predictable.
All together this gives 10-15% performance improvement in many tests on
CPUs with SMT, such as Core i7, for number of threads is less then number
of logical CPUs. In some tests it also gives positive effect to systems
without SMT.
Reviewed by: jeff
Tested by: flo, hackers@
MFC after: 1 month
Sponsored by: iXsystems, Inc.
2012-02-27 10:31:54 +00:00
|
|
|
cg = tdq->tdq_cg;
|
|
|
|
if (THREAD_CAN_SCHED(td, ts->ts_cpu) &&
|
|
|
|
tdq->tdq_lowpri >= PRI_MIN_IDLE &&
|
|
|
|
SCHED_AFFINITY(ts, CG_SHARE_L2)) {
|
|
|
|
if (cg->cg_flags & CG_FLAG_THREAD) {
|
|
|
|
CPUSET_FOREACH(cpu, cg->cg_mask) {
|
|
|
|
if (TDQ_CPU(cpu)->tdq_lowpri < PRI_MIN_IDLE)
|
|
|
|
break;
|
|
|
|
}
|
|
|
|
} else
|
|
|
|
cpu = INT_MAX;
|
|
|
|
if (cpu > mp_maxid) {
|
2008-04-17 04:20:10 +00:00
|
|
|
SCHED_STAT_INC(pickcpu_idle_affinity);
|
2008-03-02 08:20:59 +00:00
|
|
|
return (ts->ts_cpu);
|
2008-04-17 04:20:10 +00:00
|
|
|
}
|
2004-12-26 22:56:08 +00:00
|
|
|
}
|
2003-12-11 03:57:10 +00:00
|
|
|
/*
|
Rework CPU load balancing in SCHED_ULE:
- In sched_pickcpu() be more careful taking previous CPU on SMT systems.
Do it only if all other logical CPUs of that physical one are idle to avoid
extra resource sharing.
- In sched_pickcpu() change general logic of CPU selection. First
look for idle CPU, sharing last level cache with previously used one,
skipping SMT CPU groups. If none found, search all CPUs for the least loaded
one, where the thread with its priority can run now. If none found, search
just for the least loaded CPU.
- Make cpu_search() compare lowest/highest CPU load when comparing CPU
groups with equal load. That allows to differentiate 1+1 and 2+0 loads.
- Make cpu_search() to prefer specified (previous) CPU or group if load
is equal. This improves cache affinity for more complicated topologies.
- Randomize CPU selection if above factors are equal. Previous code tend
to prefer CPUs with lower IDs, causing unneeded collisions.
- Rework periodic balancer in sched_balance_group(). With cpu_search()
more intelligent now, make balansing process flat, removing recursion
over the topology tree. That fixes double swap problem and makes load
distribution more even and predictable.
All together this gives 10-15% performance improvement in many tests on
CPUs with SMT, such as Core i7, for number of threads is less then number
of logical CPUs. In some tests it also gives positive effect to systems
without SMT.
Reviewed by: jeff
Tested by: flo, hackers@
MFC after: 1 month
Sponsored by: iXsystems, Inc.
2012-02-27 10:31:54 +00:00
|
|
|
* Search for the last level cache CPU group in the tree.
|
|
|
|
* Skip caches with expired affinity time and SMT groups.
|
|
|
|
* Affinity to higher level caches will be handled less aggressively.
|
2003-12-11 03:57:10 +00:00
|
|
|
*/
|
Rework CPU load balancing in SCHED_ULE:
- In sched_pickcpu() be more careful taking previous CPU on SMT systems.
Do it only if all other logical CPUs of that physical one are idle to avoid
extra resource sharing.
- In sched_pickcpu() change general logic of CPU selection. First
look for idle CPU, sharing last level cache with previously used one,
skipping SMT CPU groups. If none found, search all CPUs for the least loaded
one, where the thread with its priority can run now. If none found, search
just for the least loaded CPU.
- Make cpu_search() compare lowest/highest CPU load when comparing CPU
groups with equal load. That allows to differentiate 1+1 and 2+0 loads.
- Make cpu_search() to prefer specified (previous) CPU or group if load
is equal. This improves cache affinity for more complicated topologies.
- Randomize CPU selection if above factors are equal. Previous code tend
to prefer CPUs with lower IDs, causing unneeded collisions.
- Rework periodic balancer in sched_balance_group(). With cpu_search()
more intelligent now, make balansing process flat, removing recursion
over the topology tree. That fixes double swap problem and makes load
distribution more even and predictable.
All together this gives 10-15% performance improvement in many tests on
CPUs with SMT, such as Core i7, for number of threads is less then number
of logical CPUs. In some tests it also gives positive effect to systems
without SMT.
Reviewed by: jeff
Tested by: flo, hackers@
MFC after: 1 month
Sponsored by: iXsystems, Inc.
2012-02-27 10:31:54 +00:00
|
|
|
for (ccg = NULL; cg != NULL; cg = cg->cg_parent) {
|
|
|
|
if (cg->cg_flags & CG_FLAG_THREAD)
|
|
|
|
continue;
|
|
|
|
if (!SCHED_AFFINITY(ts, cg->cg_level))
|
|
|
|
continue;
|
|
|
|
ccg = cg;
|
|
|
|
}
|
|
|
|
if (ccg != NULL)
|
|
|
|
cg = ccg;
|
2008-03-02 08:20:59 +00:00
|
|
|
cpu = -1;
|
Rework CPU load balancing in SCHED_ULE:
- In sched_pickcpu() be more careful taking previous CPU on SMT systems.
Do it only if all other logical CPUs of that physical one are idle to avoid
extra resource sharing.
- In sched_pickcpu() change general logic of CPU selection. First
look for idle CPU, sharing last level cache with previously used one,
skipping SMT CPU groups. If none found, search all CPUs for the least loaded
one, where the thread with its priority can run now. If none found, search
just for the least loaded CPU.
- Make cpu_search() compare lowest/highest CPU load when comparing CPU
groups with equal load. That allows to differentiate 1+1 and 2+0 loads.
- Make cpu_search() to prefer specified (previous) CPU or group if load
is equal. This improves cache affinity for more complicated topologies.
- Randomize CPU selection if above factors are equal. Previous code tend
to prefer CPUs with lower IDs, causing unneeded collisions.
- Rework periodic balancer in sched_balance_group(). With cpu_search()
more intelligent now, make balansing process flat, removing recursion
over the topology tree. That fixes double swap problem and makes load
distribution more even and predictable.
All together this gives 10-15% performance improvement in many tests on
CPUs with SMT, such as Core i7, for number of threads is less then number
of logical CPUs. In some tests it also gives positive effect to systems
without SMT.
Reviewed by: jeff
Tested by: flo, hackers@
MFC after: 1 month
Sponsored by: iXsystems, Inc.
2012-02-27 10:31:54 +00:00
|
|
|
/* Search the group for the less loaded idle CPU we can run now. */
|
2009-06-23 22:12:37 +00:00
|
|
|
mask = td->td_cpuset->cs_mask;
|
Rework CPU load balancing in SCHED_ULE:
- In sched_pickcpu() be more careful taking previous CPU on SMT systems.
Do it only if all other logical CPUs of that physical one are idle to avoid
extra resource sharing.
- In sched_pickcpu() change general logic of CPU selection. First
look for idle CPU, sharing last level cache with previously used one,
skipping SMT CPU groups. If none found, search all CPUs for the least loaded
one, where the thread with its priority can run now. If none found, search
just for the least loaded CPU.
- Make cpu_search() compare lowest/highest CPU load when comparing CPU
groups with equal load. That allows to differentiate 1+1 and 2+0 loads.
- Make cpu_search() to prefer specified (previous) CPU or group if load
is equal. This improves cache affinity for more complicated topologies.
- Randomize CPU selection if above factors are equal. Previous code tend
to prefer CPUs with lower IDs, causing unneeded collisions.
- Rework periodic balancer in sched_balance_group(). With cpu_search()
more intelligent now, make balansing process flat, removing recursion
over the topology tree. That fixes double swap problem and makes load
distribution more even and predictable.
All together this gives 10-15% performance improvement in many tests on
CPUs with SMT, such as Core i7, for number of threads is less then number
of logical CPUs. In some tests it also gives positive effect to systems
without SMT.
Reviewed by: jeff
Tested by: flo, hackers@
MFC after: 1 month
Sponsored by: iXsystems, Inc.
2012-02-27 10:31:54 +00:00
|
|
|
if (cg != NULL && cg != cpu_top &&
|
|
|
|
CPU_CMP(&cg->cg_mask, &cpu_top->cg_mask) != 0)
|
|
|
|
cpu = sched_lowest(cg, mask, max(pri, PRI_MAX_TIMESHARE),
|
|
|
|
INT_MAX, ts->ts_cpu);
|
|
|
|
/* Search globally for the less loaded CPU we can run now. */
|
|
|
|
if (cpu == -1)
|
|
|
|
cpu = sched_lowest(cpu_top, mask, pri, INT_MAX, ts->ts_cpu);
|
|
|
|
/* Search globally for the less loaded CPU. */
|
2008-03-02 08:20:59 +00:00
|
|
|
if (cpu == -1)
|
Rework CPU load balancing in SCHED_ULE:
- In sched_pickcpu() be more careful taking previous CPU on SMT systems.
Do it only if all other logical CPUs of that physical one are idle to avoid
extra resource sharing.
- In sched_pickcpu() change general logic of CPU selection. First
look for idle CPU, sharing last level cache with previously used one,
skipping SMT CPU groups. If none found, search all CPUs for the least loaded
one, where the thread with its priority can run now. If none found, search
just for the least loaded CPU.
- Make cpu_search() compare lowest/highest CPU load when comparing CPU
groups with equal load. That allows to differentiate 1+1 and 2+0 loads.
- Make cpu_search() to prefer specified (previous) CPU or group if load
is equal. This improves cache affinity for more complicated topologies.
- Randomize CPU selection if above factors are equal. Previous code tend
to prefer CPUs with lower IDs, causing unneeded collisions.
- Rework periodic balancer in sched_balance_group(). With cpu_search()
more intelligent now, make balansing process flat, removing recursion
over the topology tree. That fixes double swap problem and makes load
distribution more even and predictable.
All together this gives 10-15% performance improvement in many tests on
CPUs with SMT, such as Core i7, for number of threads is less then number
of logical CPUs. In some tests it also gives positive effect to systems
without SMT.
Reviewed by: jeff
Tested by: flo, hackers@
MFC after: 1 month
Sponsored by: iXsystems, Inc.
2012-02-27 10:31:54 +00:00
|
|
|
cpu = sched_lowest(cpu_top, mask, -1, INT_MAX, ts->ts_cpu);
|
2012-03-03 11:50:48 +00:00
|
|
|
KASSERT(cpu != -1, ("sched_pickcpu: Failed to find a cpu."));
|
2007-07-19 20:03:15 +00:00
|
|
|
/*
|
2008-03-02 08:20:59 +00:00
|
|
|
* Compare the lowest loaded cpu to current cpu.
|
2007-07-19 20:03:15 +00:00
|
|
|
*/
|
2008-03-10 01:32:01 +00:00
|
|
|
if (THREAD_CAN_SCHED(td, self) && TDQ_CPU(self)->tdq_lowpri > pri &&
|
Rework CPU load balancing in SCHED_ULE:
- In sched_pickcpu() be more careful taking previous CPU on SMT systems.
Do it only if all other logical CPUs of that physical one are idle to avoid
extra resource sharing.
- In sched_pickcpu() change general logic of CPU selection. First
look for idle CPU, sharing last level cache with previously used one,
skipping SMT CPU groups. If none found, search all CPUs for the least loaded
one, where the thread with its priority can run now. If none found, search
just for the least loaded CPU.
- Make cpu_search() compare lowest/highest CPU load when comparing CPU
groups with equal load. That allows to differentiate 1+1 and 2+0 loads.
- Make cpu_search() to prefer specified (previous) CPU or group if load
is equal. This improves cache affinity for more complicated topologies.
- Randomize CPU selection if above factors are equal. Previous code tend
to prefer CPUs with lower IDs, causing unneeded collisions.
- Rework periodic balancer in sched_balance_group(). With cpu_search()
more intelligent now, make balansing process flat, removing recursion
over the topology tree. That fixes double swap problem and makes load
distribution more even and predictable.
All together this gives 10-15% performance improvement in many tests on
CPUs with SMT, such as Core i7, for number of threads is less then number
of logical CPUs. In some tests it also gives positive effect to systems
without SMT.
Reviewed by: jeff
Tested by: flo, hackers@
MFC after: 1 month
Sponsored by: iXsystems, Inc.
2012-02-27 10:31:54 +00:00
|
|
|
TDQ_CPU(cpu)->tdq_lowpri < PRI_MIN_IDLE &&
|
|
|
|
TDQ_CPU(self)->tdq_load <= TDQ_CPU(cpu)->tdq_load + 1) {
|
2008-04-17 04:20:10 +00:00
|
|
|
SCHED_STAT_INC(pickcpu_local);
|
2008-03-10 01:32:01 +00:00
|
|
|
cpu = self;
|
2008-04-17 04:20:10 +00:00
|
|
|
} else
|
|
|
|
SCHED_STAT_INC(pickcpu_lowest);
|
|
|
|
if (cpu != ts->ts_cpu)
|
|
|
|
SCHED_STAT_INC(pickcpu_migration);
|
2007-07-17 22:53:23 +00:00
|
|
|
return (cpu);
|
- Add static to local functions and data where it was missing.
- Add an IPI based mechanism for migrating kses. This mechanism is
broken down into several components. This is intended to reduce cache
thrashing by eliminating most cases where one cpu touches another's
run queues.
- kseq_notify() appends a kse to a lockless singly linked list and
conditionally sends an IPI to the target processor. Right now this is
protected by sched_lock but at some point I'd like to get rid of the
global lock. This is why I used something more complicated than a
standard queue.
- kseq_assign() processes our list of kses that have been assigned to us
by other processors. This simply calls sched_add() for each item on the
list after clearing the new KEF_ASSIGNED flag. This flag is used to
indicate that we have been appeneded to the assigned queue but not
added to the run queue yet.
- In sched_add(), instead of adding a KSE to another processor's queue we
use kse_notify() so that we don't touch their queue. Also in sched_add(),
if KEF_ASSIGNED is already set return immediately. This can happen if
a thread is removed and readded so that the priority is recorded properly.
- In sched_rem() return immediately if KEF_ASSIGNED is set. All callers
immediately readd simply to adjust priorites etc.
- In sched_choose(), if we're running an IDLE task or the per cpu idle thread
set our cpumask bit in 'kseq_idle' so that other processors may know that
we are idle. Before this, make a single pass through the run queues of
other processors so that we may find work more immediately if it is
available.
- In sched_runnable(), don't scan each processor's run queue, they will IPI
us if they have work for us to do.
- In sched_add(), if we're adding a thread that can be migrated and we have
plenty of work to do, try to migrate the thread to an idle kseq.
- Simplify the logic in sched_prio() and take the KEF_ASSIGNED flag into
consideration.
- No longer use kseq_choose() to steal threads, it can lose it's last
argument.
- Create a new function runq_steal() which operates like runq_choose() but
skips threads based on some criteria. Currently it will not steal
PRI_ITHD threads. In the future this will be used for CPU binding.
- Create a kseq_steal() that checks each run queue with runq_steal(), use
kseq_steal() in the places where we used kseq_choose() to steal with
before.
2003-10-31 11:16:04 +00:00
|
|
|
}
|
2008-03-02 08:20:59 +00:00
|
|
|
#endif
|
2003-02-03 05:30:07 +00:00
|
|
|
|
2003-07-08 06:19:40 +00:00
|
|
|
/*
|
- Add static to local functions and data where it was missing.
- Add an IPI based mechanism for migrating kses. This mechanism is
broken down into several components. This is intended to reduce cache
thrashing by eliminating most cases where one cpu touches another's
run queues.
- kseq_notify() appends a kse to a lockless singly linked list and
conditionally sends an IPI to the target processor. Right now this is
protected by sched_lock but at some point I'd like to get rid of the
global lock. This is why I used something more complicated than a
standard queue.
- kseq_assign() processes our list of kses that have been assigned to us
by other processors. This simply calls sched_add() for each item on the
list after clearing the new KEF_ASSIGNED flag. This flag is used to
indicate that we have been appeneded to the assigned queue but not
added to the run queue yet.
- In sched_add(), instead of adding a KSE to another processor's queue we
use kse_notify() so that we don't touch their queue. Also in sched_add(),
if KEF_ASSIGNED is already set return immediately. This can happen if
a thread is removed and readded so that the priority is recorded properly.
- In sched_rem() return immediately if KEF_ASSIGNED is set. All callers
immediately readd simply to adjust priorites etc.
- In sched_choose(), if we're running an IDLE task or the per cpu idle thread
set our cpumask bit in 'kseq_idle' so that other processors may know that
we are idle. Before this, make a single pass through the run queues of
other processors so that we may find work more immediately if it is
available.
- In sched_runnable(), don't scan each processor's run queue, they will IPI
us if they have work for us to do.
- In sched_add(), if we're adding a thread that can be migrated and we have
plenty of work to do, try to migrate the thread to an idle kseq.
- Simplify the logic in sched_prio() and take the KEF_ASSIGNED flag into
consideration.
- No longer use kseq_choose() to steal threads, it can lose it's last
argument.
- Create a new function runq_steal() which operates like runq_choose() but
skips threads based on some criteria. Currently it will not steal
PRI_ITHD threads. In the future this will be used for CPU binding.
- Create a kseq_steal() that checks each run queue with runq_steal(), use
kseq_steal() in the places where we used kseq_choose() to steal with
before.
2003-10-31 11:16:04 +00:00
|
|
|
* Pick the highest priority task we have and return it.
|
2003-07-08 06:19:40 +00:00
|
|
|
*/
|
2008-03-20 05:51:16 +00:00
|
|
|
static struct thread *
|
2006-12-06 06:34:57 +00:00
|
|
|
tdq_choose(struct tdq *tdq)
|
2003-02-03 05:30:07 +00:00
|
|
|
{
|
2008-03-20 05:51:16 +00:00
|
|
|
struct thread *td;
|
2003-02-03 05:30:07 +00:00
|
|
|
|
2007-07-17 22:53:23 +00:00
|
|
|
TDQ_LOCK_ASSERT(tdq, MA_OWNED);
|
2008-03-20 05:51:16 +00:00
|
|
|
td = runq_choose(&tdq->tdq_realtime);
|
|
|
|
if (td != NULL)
|
|
|
|
return (td);
|
|
|
|
td = runq_choose_from(&tdq->tdq_timeshare, tdq->tdq_ridx);
|
|
|
|
if (td != NULL) {
|
2011-01-13 14:22:27 +00:00
|
|
|
KASSERT(td->td_priority >= PRI_MIN_BATCH,
|
2007-01-04 08:56:25 +00:00
|
|
|
("tdq_choose: Invalid priority on timeshare queue %d",
|
2008-03-20 05:51:16 +00:00
|
|
|
td->td_priority));
|
|
|
|
return (td);
|
2007-01-04 08:56:25 +00:00
|
|
|
}
|
2008-03-20 05:51:16 +00:00
|
|
|
td = runq_choose(&tdq->tdq_idle);
|
|
|
|
if (td != NULL) {
|
|
|
|
KASSERT(td->td_priority >= PRI_MIN_IDLE,
|
2007-01-04 08:56:25 +00:00
|
|
|
("tdq_choose: Invalid priority on idle queue %d",
|
2008-03-20 05:51:16 +00:00
|
|
|
td->td_priority));
|
|
|
|
return (td);
|
2003-02-03 05:30:07 +00:00
|
|
|
}
|
|
|
|
|
2007-01-04 08:56:25 +00:00
|
|
|
return (NULL);
|
2003-04-02 06:46:43 +00:00
|
|
|
}
|
2003-01-29 07:00:51 +00:00
|
|
|
|
2007-07-17 22:53:23 +00:00
|
|
|
/*
|
|
|
|
* Initialize a thread queue.
|
|
|
|
*/
|
2003-01-29 07:00:51 +00:00
|
|
|
static void
|
2006-12-06 06:34:57 +00:00
|
|
|
tdq_setup(struct tdq *tdq)
|
2003-01-29 07:00:51 +00:00
|
|
|
{
|
2007-07-17 22:53:23 +00:00
|
|
|
|
2007-08-03 23:38:46 +00:00
|
|
|
if (bootverbose)
|
|
|
|
printf("ULE: setup cpu %d\n", TDQ_ID(tdq));
|
2007-01-04 08:56:25 +00:00
|
|
|
runq_init(&tdq->tdq_realtime);
|
|
|
|
runq_init(&tdq->tdq_timeshare);
|
2006-12-29 10:37:07 +00:00
|
|
|
runq_init(&tdq->tdq_idle);
|
2008-03-02 08:20:59 +00:00
|
|
|
snprintf(tdq->tdq_name, sizeof(tdq->tdq_name),
|
|
|
|
"sched lock %d", (int)TDQ_ID(tdq));
|
|
|
|
mtx_init(&tdq->tdq_lock, tdq->tdq_name, "sched lock",
|
2007-08-03 23:38:46 +00:00
|
|
|
MTX_SPIN | MTX_RECURSE);
|
2009-01-17 07:17:57 +00:00
|
|
|
#ifdef KTR
|
|
|
|
snprintf(tdq->tdq_loadname, sizeof(tdq->tdq_loadname),
|
|
|
|
"CPU %d load", (int)TDQ_ID(tdq));
|
|
|
|
#endif
|
2007-08-03 23:38:46 +00:00
|
|
|
}
|
|
|
|
|
2008-03-02 08:20:59 +00:00
|
|
|
#ifdef SMP
|
2007-08-03 23:38:46 +00:00
|
|
|
static void
|
|
|
|
sched_setup_smp(void)
|
|
|
|
{
|
|
|
|
struct tdq *tdq;
|
|
|
|
int i;
|
2003-04-11 03:47:14 +00:00
|
|
|
|
2008-03-02 08:20:59 +00:00
|
|
|
cpu_top = smp_topo();
|
2010-06-11 18:46:34 +00:00
|
|
|
CPU_FOREACH(i) {
|
2008-03-02 08:20:59 +00:00
|
|
|
tdq = TDQ_CPU(i);
|
2007-08-03 23:38:46 +00:00
|
|
|
tdq_setup(tdq);
|
2008-03-02 08:20:59 +00:00
|
|
|
tdq->tdq_cg = smp_topo_find(cpu_top, i);
|
|
|
|
if (tdq->tdq_cg == NULL)
|
|
|
|
panic("Can't find cpu group for %d\n", i);
|
2003-07-04 19:59:00 +00:00
|
|
|
}
|
2008-03-02 08:20:59 +00:00
|
|
|
balance_tdq = TDQ_SELF();
|
|
|
|
sched_balance();
|
2007-08-03 23:38:46 +00:00
|
|
|
}
|
|
|
|
#endif
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Setup the thread queues and initialize the topology based on MD
|
|
|
|
* information.
|
|
|
|
*/
|
|
|
|
static void
|
|
|
|
sched_setup(void *dummy)
|
|
|
|
{
|
|
|
|
struct tdq *tdq;
|
|
|
|
|
|
|
|
tdq = TDQ_SELF();
|
|
|
|
#ifdef SMP
|
2008-03-02 07:58:42 +00:00
|
|
|
sched_setup_smp();
|
2003-07-04 19:59:00 +00:00
|
|
|
#else
|
2007-08-03 23:38:46 +00:00
|
|
|
tdq_setup(tdq);
|
2003-06-09 00:39:09 +00:00
|
|
|
#endif
|
2007-07-17 22:53:23 +00:00
|
|
|
|
|
|
|
/* Add thread0's load since it's running. */
|
|
|
|
TDQ_LOCK(tdq);
|
2007-08-03 23:38:46 +00:00
|
|
|
thread0.td_lock = TDQ_LOCKPTR(TDQ_SELF());
|
2008-03-20 05:51:16 +00:00
|
|
|
tdq_load_add(tdq, &thread0);
|
2008-03-02 08:20:59 +00:00
|
|
|
tdq->tdq_lowpri = thread0.td_priority;
|
2007-07-17 22:53:23 +00:00
|
|
|
TDQ_UNLOCK(tdq);
|
2003-01-26 05:23:15 +00:00
|
|
|
}
|
|
|
|
|
2007-07-17 22:53:23 +00:00
|
|
|
/*
|
2012-08-10 19:02:49 +00:00
|
|
|
* This routine determines time constants after stathz and hz are setup.
|
2007-07-17 22:53:23 +00:00
|
|
|
*/
|
2005-12-19 08:26:09 +00:00
|
|
|
/* ARGSUSED */
|
|
|
|
static void
|
|
|
|
sched_initticks(void *dummy)
|
|
|
|
{
|
2007-07-17 22:53:23 +00:00
|
|
|
int incr;
|
|
|
|
|
2005-12-19 08:26:09 +00:00
|
|
|
realstathz = stathz ? stathz : hz;
|
2012-08-10 19:02:49 +00:00
|
|
|
sched_slice = realstathz / 10; /* ~100ms */
|
2012-08-11 20:24:39 +00:00
|
|
|
hogticks = imax(1, (2 * hz * sched_slice + realstathz / 2) /
|
|
|
|
realstathz);
|
2005-12-19 08:26:09 +00:00
|
|
|
|
|
|
|
/*
|
2007-01-04 08:56:25 +00:00
|
|
|
* tickincr is shifted out by 10 to avoid rounding errors due to
|
2007-01-04 12:16:19 +00:00
|
|
|
* hz not being evenly divisible by stathz on all platforms.
|
2007-01-04 08:56:25 +00:00
|
|
|
*/
|
2007-07-17 22:53:23 +00:00
|
|
|
incr = (hz << SCHED_TICK_SHIFT) / realstathz;
|
2007-01-04 08:56:25 +00:00
|
|
|
/*
|
|
|
|
* This does not work for values of stathz that are more than
|
|
|
|
* 1 << SCHED_TICK_SHIFT * hz. In practice this does not happen.
|
2005-12-19 08:26:09 +00:00
|
|
|
*/
|
2007-07-17 22:53:23 +00:00
|
|
|
if (incr == 0)
|
|
|
|
incr = 1;
|
|
|
|
tickincr = incr;
|
2007-01-19 21:56:08 +00:00
|
|
|
#ifdef SMP
|
2007-10-02 00:36:06 +00:00
|
|
|
/*
|
|
|
|
* Set the default balance interval now that we know
|
|
|
|
* what realstathz is.
|
|
|
|
*/
|
|
|
|
balance_interval = realstathz;
|
2007-01-19 21:56:08 +00:00
|
|
|
affinity = SCHED_AFFINITY_DEFAULT;
|
|
|
|
#endif
|
2012-03-09 19:09:08 +00:00
|
|
|
if (sched_idlespinthresh < 0)
|
2012-08-11 20:24:39 +00:00
|
|
|
sched_idlespinthresh = imax(16, 2 * hz / realstathz);
|
2005-12-19 08:26:09 +00:00
|
|
|
}
|
|
|
|
|
|
|
|
|
2007-07-17 22:53:23 +00:00
|
|
|
/*
|
|
|
|
* This is the core of the interactivity algorithm. Determines a score based
|
|
|
|
* on past behavior. It is the ratio of sleep time to run time scaled to
|
|
|
|
* a [0, 100] integer. This is the voluntary sleep time of a process, which
|
|
|
|
* differs from the cpu usage because it does not account for time spent
|
|
|
|
* waiting on a run-queue. Would be prettier if we had floating point.
|
|
|
|
*/
|
|
|
|
static int
|
|
|
|
sched_interact_score(struct thread *td)
|
|
|
|
{
|
|
|
|
struct td_sched *ts;
|
|
|
|
int div;
|
|
|
|
|
|
|
|
ts = td->td_sched;
|
|
|
|
/*
|
|
|
|
* The score is only needed if this is likely to be an interactive
|
|
|
|
* task. Don't go through the expense of computing it if there's
|
|
|
|
* no chance.
|
|
|
|
*/
|
|
|
|
if (sched_interact <= SCHED_INTERACT_HALF &&
|
|
|
|
ts->ts_runtime >= ts->ts_slptime)
|
|
|
|
return (SCHED_INTERACT_HALF);
|
|
|
|
|
|
|
|
if (ts->ts_runtime > ts->ts_slptime) {
|
|
|
|
div = max(1, ts->ts_runtime / SCHED_INTERACT_HALF);
|
|
|
|
return (SCHED_INTERACT_HALF +
|
|
|
|
(SCHED_INTERACT_HALF - (ts->ts_slptime / div)));
|
|
|
|
}
|
|
|
|
if (ts->ts_slptime > ts->ts_runtime) {
|
|
|
|
div = max(1, ts->ts_slptime / SCHED_INTERACT_HALF);
|
|
|
|
return (ts->ts_runtime / div);
|
|
|
|
}
|
|
|
|
/* runtime == slptime */
|
|
|
|
if (ts->ts_runtime)
|
|
|
|
return (SCHED_INTERACT_HALF);
|
|
|
|
|
|
|
|
/*
|
|
|
|
* This can happen if slptime and runtime are 0.
|
|
|
|
*/
|
|
|
|
return (0);
|
|
|
|
|
|
|
|
}
|
|
|
|
|
2003-01-26 05:23:15 +00:00
|
|
|
/*
|
|
|
|
* Scale the scheduling priority according to the "interactivity" of this
|
|
|
|
* process.
|
|
|
|
*/
|
2003-04-11 03:47:14 +00:00
|
|
|
static void
|
2006-10-26 21:42:22 +00:00
|
|
|
sched_priority(struct thread *td)
|
2003-01-26 05:23:15 +00:00
|
|
|
{
|
2007-01-04 08:56:25 +00:00
|
|
|
int score;
|
2003-01-26 05:23:15 +00:00
|
|
|
int pri;
|
|
|
|
|
2011-01-11 22:13:19 +00:00
|
|
|
if (PRI_BASE(td->td_pri_class) != PRI_TIMESHARE)
|
2003-04-11 03:47:14 +00:00
|
|
|
return;
|
2003-04-02 06:46:43 +00:00
|
|
|
/*
|
2007-01-04 08:56:25 +00:00
|
|
|
* If the score is interactive we place the thread in the realtime
|
|
|
|
* queue with a priority that is less than kernel and interrupt
|
|
|
|
* priorities. These threads are not subject to nice restrictions.
|
2003-04-02 06:46:43 +00:00
|
|
|
*
|
2007-07-17 22:53:23 +00:00
|
|
|
* Scores greater than this are placed on the normal timeshare queue
|
2007-01-04 08:56:25 +00:00
|
|
|
* where the priority is partially decided by the most recent cpu
|
|
|
|
* utilization and the rest is decided by nice value.
|
2007-09-22 02:20:14 +00:00
|
|
|
*
|
|
|
|
* The nice value of the process has a linear effect on the calculated
|
|
|
|
* score. Negative nice values make it easier for a thread to be
|
|
|
|
* considered interactive.
|
2003-04-02 06:46:43 +00:00
|
|
|
*/
|
2009-10-15 11:41:12 +00:00
|
|
|
score = imax(0, sched_interact_score(td) + td->td_proc->p_nice);
|
2007-01-04 08:56:25 +00:00
|
|
|
if (score < sched_interact) {
|
2011-01-13 14:22:27 +00:00
|
|
|
pri = PRI_MIN_INTERACT;
|
|
|
|
pri += ((PRI_MAX_INTERACT - PRI_MIN_INTERACT + 1) /
|
2011-01-10 20:48:10 +00:00
|
|
|
sched_interact) * score;
|
2011-01-13 14:22:27 +00:00
|
|
|
KASSERT(pri >= PRI_MIN_INTERACT && pri <= PRI_MAX_INTERACT,
|
2007-01-24 18:18:43 +00:00
|
|
|
("sched_priority: invalid interactive priority %d score %d",
|
|
|
|
pri, score));
|
2007-01-04 08:56:25 +00:00
|
|
|
} else {
|
|
|
|
pri = SCHED_PRI_MIN;
|
|
|
|
if (td->td_sched->ts_ticks)
|
2011-12-29 16:17:16 +00:00
|
|
|
pri += min(SCHED_PRI_TICKS(td->td_sched),
|
|
|
|
SCHED_PRI_RANGE);
|
2007-01-04 08:56:25 +00:00
|
|
|
pri += SCHED_PRI_NICE(td->td_proc->p_nice);
|
2011-01-13 14:22:27 +00:00
|
|
|
KASSERT(pri >= PRI_MIN_BATCH && pri <= PRI_MAX_BATCH,
|
2007-07-17 22:53:23 +00:00
|
|
|
("sched_priority: invalid priority %d: nice %d, "
|
|
|
|
"ticks %d ftick %d ltick %d tick pri %d",
|
|
|
|
pri, td->td_proc->p_nice, td->td_sched->ts_ticks,
|
|
|
|
td->td_sched->ts_ftick, td->td_sched->ts_ltick,
|
|
|
|
SCHED_PRI_TICKS(td->td_sched)));
|
2007-01-04 08:56:25 +00:00
|
|
|
}
|
|
|
|
sched_user_prio(td, pri);
|
2003-01-26 05:23:15 +00:00
|
|
|
|
2003-04-02 06:46:43 +00:00
|
|
|
return;
|
2003-01-26 05:23:15 +00:00
|
|
|
}
|
|
|
|
|
2003-11-02 03:36:33 +00:00
|
|
|
/*
|
|
|
|
* This routine enforces a maximum limit on the amount of scheduling history
|
2007-07-17 22:53:23 +00:00
|
|
|
* kept. It is called after either the slptime or runtime is adjusted. This
|
|
|
|
* function is ugly due to integer math.
|
2003-11-02 03:36:33 +00:00
|
|
|
*/
|
2003-06-17 06:39:51 +00:00
|
|
|
static void
|
2006-10-26 21:42:22 +00:00
|
|
|
sched_interact_update(struct thread *td)
|
2003-06-17 06:39:51 +00:00
|
|
|
{
|
2007-01-05 23:45:38 +00:00
|
|
|
struct td_sched *ts;
|
2007-01-24 18:18:43 +00:00
|
|
|
u_int sum;
|
2003-11-02 03:36:33 +00:00
|
|
|
|
2007-01-05 23:45:38 +00:00
|
|
|
ts = td->td_sched;
|
2007-07-17 22:53:23 +00:00
|
|
|
sum = ts->ts_runtime + ts->ts_slptime;
|
2003-11-02 03:36:33 +00:00
|
|
|
if (sum < SCHED_SLP_RUN_MAX)
|
|
|
|
return;
|
2007-01-05 23:45:38 +00:00
|
|
|
/*
|
|
|
|
* This only happens from two places:
|
|
|
|
* 1) We have added an unusual amount of run time from fork_exit.
|
|
|
|
* 2) We have added an unusual amount of sleep time from sched_sleep().
|
|
|
|
*/
|
|
|
|
if (sum > SCHED_SLP_RUN_MAX * 2) {
|
2007-07-17 22:53:23 +00:00
|
|
|
if (ts->ts_runtime > ts->ts_slptime) {
|
|
|
|
ts->ts_runtime = SCHED_SLP_RUN_MAX;
|
|
|
|
ts->ts_slptime = 1;
|
2007-01-05 23:45:38 +00:00
|
|
|
} else {
|
2007-07-17 22:53:23 +00:00
|
|
|
ts->ts_slptime = SCHED_SLP_RUN_MAX;
|
|
|
|
ts->ts_runtime = 1;
|
2007-01-05 23:45:38 +00:00
|
|
|
}
|
|
|
|
return;
|
|
|
|
}
|
2003-11-02 03:36:33 +00:00
|
|
|
/*
|
|
|
|
* If we have exceeded by more than 1/5th then the algorithm below
|
|
|
|
* will not bring us back into range. Dividing by two here forces
|
2004-08-10 07:52:21 +00:00
|
|
|
* us into the range of [4/5 * SCHED_INTERACT_MAX, SCHED_INTERACT_MAX]
|
2003-11-02 03:36:33 +00:00
|
|
|
*/
|
2004-04-04 19:12:56 +00:00
|
|
|
if (sum > (SCHED_SLP_RUN_MAX / 5) * 6) {
|
2007-07-17 22:53:23 +00:00
|
|
|
ts->ts_runtime /= 2;
|
|
|
|
ts->ts_slptime /= 2;
|
2003-11-02 03:36:33 +00:00
|
|
|
return;
|
|
|
|
}
|
2007-07-17 22:53:23 +00:00
|
|
|
ts->ts_runtime = (ts->ts_runtime / 5) * 4;
|
|
|
|
ts->ts_slptime = (ts->ts_slptime / 5) * 4;
|
2003-11-02 03:36:33 +00:00
|
|
|
}
|
2003-10-27 06:47:05 +00:00
|
|
|
|
2007-07-17 22:53:23 +00:00
|
|
|
/*
|
|
|
|
* Scale back the interactivity history when a child thread is created. The
|
|
|
|
* history is inherited from the parent but the thread may behave totally
|
|
|
|
* differently. For example, a shell spawning a compiler process. We want
|
|
|
|
* to learn that the compiler is behaving badly very quickly.
|
|
|
|
*/
|
2003-11-02 03:36:33 +00:00
|
|
|
static void
|
2006-10-26 21:42:22 +00:00
|
|
|
sched_interact_fork(struct thread *td)
|
2003-11-02 03:36:33 +00:00
|
|
|
{
|
|
|
|
int ratio;
|
|
|
|
int sum;
|
|
|
|
|
2007-07-17 22:53:23 +00:00
|
|
|
sum = td->td_sched->ts_runtime + td->td_sched->ts_slptime;
|
2003-11-02 03:36:33 +00:00
|
|
|
if (sum > SCHED_SLP_RUN_FORK) {
|
|
|
|
ratio = sum / SCHED_SLP_RUN_FORK;
|
2007-07-17 22:53:23 +00:00
|
|
|
td->td_sched->ts_runtime /= ratio;
|
|
|
|
td->td_sched->ts_slptime /= ratio;
|
2003-06-17 06:39:51 +00:00
|
|
|
}
|
|
|
|
}
|
|
|
|
|
2004-09-05 02:09:54 +00:00
|
|
|
/*
|
2007-07-17 22:53:23 +00:00
|
|
|
* Called from proc0_init() to setup the scheduler fields.
|
2004-09-05 02:09:54 +00:00
|
|
|
*/
|
|
|
|
void
|
|
|
|
schedinit(void)
|
|
|
|
{
|
2007-01-04 08:56:25 +00:00
|
|
|
|
2004-09-05 02:09:54 +00:00
|
|
|
/*
|
|
|
|
* Set up the scheduler specific parts of proc0.
|
|
|
|
*/
|
|
|
|
proc0.p_sched = NULL; /* XXX */
|
2006-12-06 06:34:57 +00:00
|
|
|
thread0.td_sched = &td_sched0;
|
2007-01-04 08:56:25 +00:00
|
|
|
td_sched0.ts_ltick = ticks;
|
2007-01-05 08:50:38 +00:00
|
|
|
td_sched0.ts_ftick = ticks;
|
2008-03-10 03:15:19 +00:00
|
|
|
td_sched0.ts_slice = sched_slice;
|
2004-09-05 02:09:54 +00:00
|
|
|
}
|
|
|
|
|
2003-04-11 03:47:14 +00:00
|
|
|
/*
|
|
|
|
* This is only somewhat accurate since given many processes of the same
|
|
|
|
* priority they will switch when their slices run out, which will be
|
2007-01-04 08:56:25 +00:00
|
|
|
* at most sched_slice stathz ticks.
|
2003-04-11 03:47:14 +00:00
|
|
|
*/
|
2003-01-26 05:23:15 +00:00
|
|
|
int
|
|
|
|
sched_rr_interval(void)
|
|
|
|
{
|
2007-01-04 08:56:25 +00:00
|
|
|
|
2012-08-10 19:02:49 +00:00
|
|
|
/* Convert sched_slice from stathz to hz. */
|
2012-08-11 20:24:39 +00:00
|
|
|
return (imax(1, (sched_slice * hz + realstathz / 2) / realstathz));
|
2003-01-26 05:23:15 +00:00
|
|
|
}
|
|
|
|
|
2007-07-17 22:53:23 +00:00
|
|
|
/*
|
|
|
|
* Update the percent cpu tracking information when it is requested or
|
|
|
|
* the total history exceeds the maximum. We keep a sliding history of
|
|
|
|
* tick counts that slowly decays. This is less precise than the 4BSD
|
|
|
|
* mechanism since it happens with less regular and frequent events.
|
|
|
|
*/
|
- Add static to local functions and data where it was missing.
- Add an IPI based mechanism for migrating kses. This mechanism is
broken down into several components. This is intended to reduce cache
thrashing by eliminating most cases where one cpu touches another's
run queues.
- kseq_notify() appends a kse to a lockless singly linked list and
conditionally sends an IPI to the target processor. Right now this is
protected by sched_lock but at some point I'd like to get rid of the
global lock. This is why I used something more complicated than a
standard queue.
- kseq_assign() processes our list of kses that have been assigned to us
by other processors. This simply calls sched_add() for each item on the
list after clearing the new KEF_ASSIGNED flag. This flag is used to
indicate that we have been appeneded to the assigned queue but not
added to the run queue yet.
- In sched_add(), instead of adding a KSE to another processor's queue we
use kse_notify() so that we don't touch their queue. Also in sched_add(),
if KEF_ASSIGNED is already set return immediately. This can happen if
a thread is removed and readded so that the priority is recorded properly.
- In sched_rem() return immediately if KEF_ASSIGNED is set. All callers
immediately readd simply to adjust priorites etc.
- In sched_choose(), if we're running an IDLE task or the per cpu idle thread
set our cpumask bit in 'kseq_idle' so that other processors may know that
we are idle. Before this, make a single pass through the run queues of
other processors so that we may find work more immediately if it is
available.
- In sched_runnable(), don't scan each processor's run queue, they will IPI
us if they have work for us to do.
- In sched_add(), if we're adding a thread that can be migrated and we have
plenty of work to do, try to migrate the thread to an idle kseq.
- Simplify the logic in sched_prio() and take the KEF_ASSIGNED flag into
consideration.
- No longer use kseq_choose() to steal threads, it can lose it's last
argument.
- Create a new function runq_steal() which operates like runq_choose() but
skips threads based on some criteria. Currently it will not steal
PRI_ITHD threads. In the future this will be used for CPU binding.
- Create a kseq_steal() that checks each run queue with runq_steal(), use
kseq_steal() in the places where we used kseq_choose() to steal with
before.
2003-10-31 11:16:04 +00:00
|
|
|
static void
|
2012-03-13 08:18:54 +00:00
|
|
|
sched_pctcpu_update(struct td_sched *ts, int run)
|
2003-01-26 05:23:15 +00:00
|
|
|
{
|
2012-03-13 08:18:54 +00:00
|
|
|
int t = ticks;
|
2007-01-04 08:56:25 +00:00
|
|
|
|
2012-03-13 08:18:54 +00:00
|
|
|
if (t - ts->ts_ltick >= SCHED_TICK_TARG) {
|
2006-12-06 06:34:57 +00:00
|
|
|
ts->ts_ticks = 0;
|
2012-03-13 08:18:54 +00:00
|
|
|
ts->ts_ftick = t - SCHED_TICK_TARG;
|
|
|
|
} else if (t - ts->ts_ftick >= SCHED_TICK_MAX) {
|
|
|
|
ts->ts_ticks = (ts->ts_ticks / (ts->ts_ltick - ts->ts_ftick)) *
|
|
|
|
(ts->ts_ltick - (t - SCHED_TICK_TARG));
|
|
|
|
ts->ts_ftick = t - SCHED_TICK_TARG;
|
|
|
|
}
|
|
|
|
if (run)
|
|
|
|
ts->ts_ticks += (t - ts->ts_ltick) << SCHED_TICK_SHIFT;
|
|
|
|
ts->ts_ltick = t;
|
2003-01-26 05:23:15 +00:00
|
|
|
}
|
|
|
|
|
2007-07-17 22:53:23 +00:00
|
|
|
/*
|
|
|
|
* Adjust the priority of a thread. Move it to the appropriate run-queue
|
|
|
|
* if necessary. This is the back-end for several priority related
|
|
|
|
* functions.
|
|
|
|
*/
|
2007-01-04 08:56:25 +00:00
|
|
|
static void
|
Rework the interface between priority propagation (lending) and the
schedulers a bit to ensure more correct handling of priorities and fewer
priority inversions:
- Add two functions to the sched(9) API to handle priority lending:
sched_lend_prio() and sched_unlend_prio(). The turnstile code uses these
functions to ask the scheduler to lend a thread a set priority and to
tell the scheduler when it thinks it is ok for a thread to stop borrowing
priority. The unlend case is slightly complex in that the turnstile code
tells the scheduler what the minimum priority of the thread needs to be
to satisfy the requirements of any other threads blocked on locks owned
by the thread in question. The scheduler then decides where the thread
can go back to normal mode (if it's normal priority is high enough to
satisfy the pending lock requests) or it it should continue to use the
priority specified to the sched_unlend_prio() call. This involves adding
a new per-thread flag TDF_BORROWING that replaces the ULE-only kse flag
for priority elevation.
- Schedulers now refuse to lower the priority of a thread that is currently
borrowing another therad's priority.
- If a scheduler changes the priority of a thread that is currently sitting
on a turnstile, it will call a new function turnstile_adjust() to inform
the turnstile code of the change. This function resorts the thread on
the priority list of the turnstile if needed, and if the thread ends up
at the head of the list (due to having the highest priority) and its
priority was raised, then it will propagate that new priority to the
owner of the lock it is blocked on.
Some additional fixes specific to the 4BSD scheduler include:
- Common code for updating the priority of a thread when the user priority
of its associated kse group has been consolidated in a new static
function resetpriority_thread(). One change to this function is that
it will now only adjust the priority of a thread if it already has a
time sharing priority, thus preserving any boosts from a tsleep() until
the thread returns to userland. Also, resetpriority() no longer calls
maybe_resched() on each thread in the group. Instead, the code calling
resetpriority() is responsible for calling resetpriority_thread() on
any threads that need to be updated.
- schedcpu() now uses resetpriority_thread() instead of just calling
sched_prio() directly after it updates a kse group's user priority.
- sched_clock() now uses resetpriority_thread() rather than writing
directly to td_priority.
- sched_nice() now updates all the priorities of the threads after the
group priority has been adjusted.
Discussed with: bde
Reviewed by: ups, jeffr
Tested on: 4bsd, ule
Tested on: i386, alpha, sparc64
2004-12-30 20:52:44 +00:00
|
|
|
sched_thread_priority(struct thread *td, u_char prio)
|
2003-01-26 05:23:15 +00:00
|
|
|
{
|
2006-12-06 06:34:57 +00:00
|
|
|
struct td_sched *ts;
|
2008-03-10 03:15:19 +00:00
|
|
|
struct tdq *tdq;
|
|
|
|
int oldpri;
|
2003-01-26 05:23:15 +00:00
|
|
|
|
2009-01-17 07:17:57 +00:00
|
|
|
KTR_POINT3(KTR_SCHED, "thread", sched_tdname(td), "prio",
|
|
|
|
"prio:%d", td->td_priority, "new prio:%d", prio,
|
|
|
|
KTR_ATTR_LINKED, sched_tdname(curthread));
|
2012-05-15 01:30:25 +00:00
|
|
|
SDT_PROBE3(sched, , , change_pri, td, td->td_proc, prio);
|
2009-01-17 07:17:57 +00:00
|
|
|
if (td != curthread && prio > td->td_priority) {
|
|
|
|
KTR_POINT3(KTR_SCHED, "thread", sched_tdname(curthread),
|
|
|
|
"lend prio", "prio:%d", td->td_priority, "new prio:%d",
|
|
|
|
prio, KTR_ATTR_LINKED, sched_tdname(td));
|
2012-05-15 01:30:25 +00:00
|
|
|
SDT_PROBE4(sched, , , lend_pri, td, td->td_proc, prio,
|
|
|
|
curthread);
|
2009-01-17 07:17:57 +00:00
|
|
|
}
|
2006-12-06 06:34:57 +00:00
|
|
|
ts = td->td_sched;
|
Commit 1/14 of sched_lock decomposition.
- Move all scheduler locking into the schedulers utilizing a technique
similar to solaris's container locking.
- A per-process spinlock is now used to protect the queue of threads,
thread count, suspension count, p_sflags, and other process
related scheduling fields.
- The new thread lock is actually a pointer to a spinlock for the
container that the thread is currently owned by. The container may
be a turnstile, sleepqueue, or run queue.
- thread_lock() is now used to protect access to thread related scheduling
fields. thread_unlock() unlocks the lock and thread_set_lock()
implements the transition from one lock to another.
- A new "blocked_lock" is used in cases where it is not safe to hold the
actual thread's lock yet we must prevent access to the thread.
- sched_throw() and sched_fork_exit() are introduced to allow the
schedulers to fix-up locking at these points.
- Add some minor infrastructure for optionally exporting scheduler
statistics that were invaluable in solving performance problems with
this patch. Generally these statistics allow you to differentiate
between different causes of context switches.
Tested by: kris, current@
Tested on: i386, amd64, ULE, 4BSD, libthr, libkse, PREEMPTION, etc.
Discussed with: kris, attilio, kmacy, jhb, julian, bde (small parts each)
2007-06-04 23:50:30 +00:00
|
|
|
THREAD_LOCK_ASSERT(td, MA_OWNED);
|
Rework the interface between priority propagation (lending) and the
schedulers a bit to ensure more correct handling of priorities and fewer
priority inversions:
- Add two functions to the sched(9) API to handle priority lending:
sched_lend_prio() and sched_unlend_prio(). The turnstile code uses these
functions to ask the scheduler to lend a thread a set priority and to
tell the scheduler when it thinks it is ok for a thread to stop borrowing
priority. The unlend case is slightly complex in that the turnstile code
tells the scheduler what the minimum priority of the thread needs to be
to satisfy the requirements of any other threads blocked on locks owned
by the thread in question. The scheduler then decides where the thread
can go back to normal mode (if it's normal priority is high enough to
satisfy the pending lock requests) or it it should continue to use the
priority specified to the sched_unlend_prio() call. This involves adding
a new per-thread flag TDF_BORROWING that replaces the ULE-only kse flag
for priority elevation.
- Schedulers now refuse to lower the priority of a thread that is currently
borrowing another therad's priority.
- If a scheduler changes the priority of a thread that is currently sitting
on a turnstile, it will call a new function turnstile_adjust() to inform
the turnstile code of the change. This function resorts the thread on
the priority list of the turnstile if needed, and if the thread ends up
at the head of the list (due to having the highest priority) and its
priority was raised, then it will propagate that new priority to the
owner of the lock it is blocked on.
Some additional fixes specific to the 4BSD scheduler include:
- Common code for updating the priority of a thread when the user priority
of its associated kse group has been consolidated in a new static
function resetpriority_thread(). One change to this function is that
it will now only adjust the priority of a thread if it already has a
time sharing priority, thus preserving any boosts from a tsleep() until
the thread returns to userland. Also, resetpriority() no longer calls
maybe_resched() on each thread in the group. Instead, the code calling
resetpriority() is responsible for calling resetpriority_thread() on
any threads that need to be updated.
- schedcpu() now uses resetpriority_thread() instead of just calling
sched_prio() directly after it updates a kse group's user priority.
- sched_clock() now uses resetpriority_thread() rather than writing
directly to td_priority.
- sched_nice() now updates all the priorities of the threads after the
group priority has been adjusted.
Discussed with: bde
Reviewed by: ups, jeffr
Tested on: 4bsd, ule
Tested on: i386, alpha, sparc64
2004-12-30 20:52:44 +00:00
|
|
|
if (td->td_priority == prio)
|
|
|
|
return;
|
2008-03-19 07:36:37 +00:00
|
|
|
/*
|
|
|
|
* If the priority has been elevated due to priority
|
|
|
|
* propagation, we may have to move ourselves to a new
|
|
|
|
* queue. This could be optimized to not re-add in some
|
|
|
|
* cases.
|
|
|
|
*/
|
2007-01-04 12:16:19 +00:00
|
|
|
if (TD_ON_RUNQ(td) && prio < td->td_priority) {
|
2007-01-04 08:56:25 +00:00
|
|
|
sched_rem(td);
|
|
|
|
td->td_priority = prio;
|
2007-07-17 22:53:23 +00:00
|
|
|
sched_add(td, SRQ_BORROWING);
|
2008-03-10 03:15:19 +00:00
|
|
|
return;
|
|
|
|
}
|
2008-03-19 07:36:37 +00:00
|
|
|
/*
|
|
|
|
* If the thread is currently running we may have to adjust the lowpri
|
|
|
|
* information so other cpus are aware of our current priority.
|
|
|
|
*/
|
2008-03-10 03:15:19 +00:00
|
|
|
if (TD_IS_RUNNING(td)) {
|
2008-03-19 07:36:37 +00:00
|
|
|
tdq = TDQ_CPU(ts->ts_cpu);
|
|
|
|
oldpri = td->td_priority;
|
|
|
|
td->td_priority = prio;
|
2008-03-02 08:20:59 +00:00
|
|
|
if (prio < tdq->tdq_lowpri)
|
|
|
|
tdq->tdq_lowpri = prio;
|
|
|
|
else if (tdq->tdq_lowpri == oldpri)
|
|
|
|
tdq_setlowpri(tdq, td);
|
2008-03-19 07:36:37 +00:00
|
|
|
return;
|
2008-03-10 03:15:19 +00:00
|
|
|
}
|
2008-03-19 07:36:37 +00:00
|
|
|
td->td_priority = prio;
|
2003-01-26 05:23:15 +00:00
|
|
|
}
|
|
|
|
|
Rework the interface between priority propagation (lending) and the
schedulers a bit to ensure more correct handling of priorities and fewer
priority inversions:
- Add two functions to the sched(9) API to handle priority lending:
sched_lend_prio() and sched_unlend_prio(). The turnstile code uses these
functions to ask the scheduler to lend a thread a set priority and to
tell the scheduler when it thinks it is ok for a thread to stop borrowing
priority. The unlend case is slightly complex in that the turnstile code
tells the scheduler what the minimum priority of the thread needs to be
to satisfy the requirements of any other threads blocked on locks owned
by the thread in question. The scheduler then decides where the thread
can go back to normal mode (if it's normal priority is high enough to
satisfy the pending lock requests) or it it should continue to use the
priority specified to the sched_unlend_prio() call. This involves adding
a new per-thread flag TDF_BORROWING that replaces the ULE-only kse flag
for priority elevation.
- Schedulers now refuse to lower the priority of a thread that is currently
borrowing another therad's priority.
- If a scheduler changes the priority of a thread that is currently sitting
on a turnstile, it will call a new function turnstile_adjust() to inform
the turnstile code of the change. This function resorts the thread on
the priority list of the turnstile if needed, and if the thread ends up
at the head of the list (due to having the highest priority) and its
priority was raised, then it will propagate that new priority to the
owner of the lock it is blocked on.
Some additional fixes specific to the 4BSD scheduler include:
- Common code for updating the priority of a thread when the user priority
of its associated kse group has been consolidated in a new static
function resetpriority_thread(). One change to this function is that
it will now only adjust the priority of a thread if it already has a
time sharing priority, thus preserving any boosts from a tsleep() until
the thread returns to userland. Also, resetpriority() no longer calls
maybe_resched() on each thread in the group. Instead, the code calling
resetpriority() is responsible for calling resetpriority_thread() on
any threads that need to be updated.
- schedcpu() now uses resetpriority_thread() instead of just calling
sched_prio() directly after it updates a kse group's user priority.
- sched_clock() now uses resetpriority_thread() rather than writing
directly to td_priority.
- sched_nice() now updates all the priorities of the threads after the
group priority has been adjusted.
Discussed with: bde
Reviewed by: ups, jeffr
Tested on: 4bsd, ule
Tested on: i386, alpha, sparc64
2004-12-30 20:52:44 +00:00
|
|
|
/*
|
|
|
|
* Update a thread's priority when it is lent another thread's
|
|
|
|
* priority.
|
|
|
|
*/
|
|
|
|
void
|
|
|
|
sched_lend_prio(struct thread *td, u_char prio)
|
|
|
|
{
|
|
|
|
|
|
|
|
td->td_flags |= TDF_BORROWING;
|
|
|
|
sched_thread_priority(td, prio);
|
|
|
|
}
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Restore a thread's priority when priority propagation is
|
|
|
|
* over. The prio argument is the minimum priority the thread
|
|
|
|
* needs to have to satisfy other possible priority lending
|
|
|
|
* requests. If the thread's regular priority is less
|
|
|
|
* important than prio, the thread will keep a priority boost
|
|
|
|
* of prio.
|
|
|
|
*/
|
|
|
|
void
|
|
|
|
sched_unlend_prio(struct thread *td, u_char prio)
|
|
|
|
{
|
|
|
|
u_char base_pri;
|
|
|
|
|
|
|
|
if (td->td_base_pri >= PRI_MIN_TIMESHARE &&
|
|
|
|
td->td_base_pri <= PRI_MAX_TIMESHARE)
|
2006-10-26 21:42:22 +00:00
|
|
|
base_pri = td->td_user_pri;
|
Rework the interface between priority propagation (lending) and the
schedulers a bit to ensure more correct handling of priorities and fewer
priority inversions:
- Add two functions to the sched(9) API to handle priority lending:
sched_lend_prio() and sched_unlend_prio(). The turnstile code uses these
functions to ask the scheduler to lend a thread a set priority and to
tell the scheduler when it thinks it is ok for a thread to stop borrowing
priority. The unlend case is slightly complex in that the turnstile code
tells the scheduler what the minimum priority of the thread needs to be
to satisfy the requirements of any other threads blocked on locks owned
by the thread in question. The scheduler then decides where the thread
can go back to normal mode (if it's normal priority is high enough to
satisfy the pending lock requests) or it it should continue to use the
priority specified to the sched_unlend_prio() call. This involves adding
a new per-thread flag TDF_BORROWING that replaces the ULE-only kse flag
for priority elevation.
- Schedulers now refuse to lower the priority of a thread that is currently
borrowing another therad's priority.
- If a scheduler changes the priority of a thread that is currently sitting
on a turnstile, it will call a new function turnstile_adjust() to inform
the turnstile code of the change. This function resorts the thread on
the priority list of the turnstile if needed, and if the thread ends up
at the head of the list (due to having the highest priority) and its
priority was raised, then it will propagate that new priority to the
owner of the lock it is blocked on.
Some additional fixes specific to the 4BSD scheduler include:
- Common code for updating the priority of a thread when the user priority
of its associated kse group has been consolidated in a new static
function resetpriority_thread(). One change to this function is that
it will now only adjust the priority of a thread if it already has a
time sharing priority, thus preserving any boosts from a tsleep() until
the thread returns to userland. Also, resetpriority() no longer calls
maybe_resched() on each thread in the group. Instead, the code calling
resetpriority() is responsible for calling resetpriority_thread() on
any threads that need to be updated.
- schedcpu() now uses resetpriority_thread() instead of just calling
sched_prio() directly after it updates a kse group's user priority.
- sched_clock() now uses resetpriority_thread() rather than writing
directly to td_priority.
- sched_nice() now updates all the priorities of the threads after the
group priority has been adjusted.
Discussed with: bde
Reviewed by: ups, jeffr
Tested on: 4bsd, ule
Tested on: i386, alpha, sparc64
2004-12-30 20:52:44 +00:00
|
|
|
else
|
|
|
|
base_pri = td->td_base_pri;
|
|
|
|
if (prio >= base_pri) {
|
2004-12-30 22:17:00 +00:00
|
|
|
td->td_flags &= ~TDF_BORROWING;
|
Rework the interface between priority propagation (lending) and the
schedulers a bit to ensure more correct handling of priorities and fewer
priority inversions:
- Add two functions to the sched(9) API to handle priority lending:
sched_lend_prio() and sched_unlend_prio(). The turnstile code uses these
functions to ask the scheduler to lend a thread a set priority and to
tell the scheduler when it thinks it is ok for a thread to stop borrowing
priority. The unlend case is slightly complex in that the turnstile code
tells the scheduler what the minimum priority of the thread needs to be
to satisfy the requirements of any other threads blocked on locks owned
by the thread in question. The scheduler then decides where the thread
can go back to normal mode (if it's normal priority is high enough to
satisfy the pending lock requests) or it it should continue to use the
priority specified to the sched_unlend_prio() call. This involves adding
a new per-thread flag TDF_BORROWING that replaces the ULE-only kse flag
for priority elevation.
- Schedulers now refuse to lower the priority of a thread that is currently
borrowing another therad's priority.
- If a scheduler changes the priority of a thread that is currently sitting
on a turnstile, it will call a new function turnstile_adjust() to inform
the turnstile code of the change. This function resorts the thread on
the priority list of the turnstile if needed, and if the thread ends up
at the head of the list (due to having the highest priority) and its
priority was raised, then it will propagate that new priority to the
owner of the lock it is blocked on.
Some additional fixes specific to the 4BSD scheduler include:
- Common code for updating the priority of a thread when the user priority
of its associated kse group has been consolidated in a new static
function resetpriority_thread(). One change to this function is that
it will now only adjust the priority of a thread if it already has a
time sharing priority, thus preserving any boosts from a tsleep() until
the thread returns to userland. Also, resetpriority() no longer calls
maybe_resched() on each thread in the group. Instead, the code calling
resetpriority() is responsible for calling resetpriority_thread() on
any threads that need to be updated.
- schedcpu() now uses resetpriority_thread() instead of just calling
sched_prio() directly after it updates a kse group's user priority.
- sched_clock() now uses resetpriority_thread() rather than writing
directly to td_priority.
- sched_nice() now updates all the priorities of the threads after the
group priority has been adjusted.
Discussed with: bde
Reviewed by: ups, jeffr
Tested on: 4bsd, ule
Tested on: i386, alpha, sparc64
2004-12-30 20:52:44 +00:00
|
|
|
sched_thread_priority(td, base_pri);
|
|
|
|
} else
|
|
|
|
sched_lend_prio(td, prio);
|
|
|
|
}
|
|
|
|
|
2007-07-17 22:53:23 +00:00
|
|
|
/*
|
|
|
|
* Standard entry for setting the priority to an absolute value.
|
|
|
|
*/
|
Rework the interface between priority propagation (lending) and the
schedulers a bit to ensure more correct handling of priorities and fewer
priority inversions:
- Add two functions to the sched(9) API to handle priority lending:
sched_lend_prio() and sched_unlend_prio(). The turnstile code uses these
functions to ask the scheduler to lend a thread a set priority and to
tell the scheduler when it thinks it is ok for a thread to stop borrowing
priority. The unlend case is slightly complex in that the turnstile code
tells the scheduler what the minimum priority of the thread needs to be
to satisfy the requirements of any other threads blocked on locks owned
by the thread in question. The scheduler then decides where the thread
can go back to normal mode (if it's normal priority is high enough to
satisfy the pending lock requests) or it it should continue to use the
priority specified to the sched_unlend_prio() call. This involves adding
a new per-thread flag TDF_BORROWING that replaces the ULE-only kse flag
for priority elevation.
- Schedulers now refuse to lower the priority of a thread that is currently
borrowing another therad's priority.
- If a scheduler changes the priority of a thread that is currently sitting
on a turnstile, it will call a new function turnstile_adjust() to inform
the turnstile code of the change. This function resorts the thread on
the priority list of the turnstile if needed, and if the thread ends up
at the head of the list (due to having the highest priority) and its
priority was raised, then it will propagate that new priority to the
owner of the lock it is blocked on.
Some additional fixes specific to the 4BSD scheduler include:
- Common code for updating the priority of a thread when the user priority
of its associated kse group has been consolidated in a new static
function resetpriority_thread(). One change to this function is that
it will now only adjust the priority of a thread if it already has a
time sharing priority, thus preserving any boosts from a tsleep() until
the thread returns to userland. Also, resetpriority() no longer calls
maybe_resched() on each thread in the group. Instead, the code calling
resetpriority() is responsible for calling resetpriority_thread() on
any threads that need to be updated.
- schedcpu() now uses resetpriority_thread() instead of just calling
sched_prio() directly after it updates a kse group's user priority.
- sched_clock() now uses resetpriority_thread() rather than writing
directly to td_priority.
- sched_nice() now updates all the priorities of the threads after the
group priority has been adjusted.
Discussed with: bde
Reviewed by: ups, jeffr
Tested on: 4bsd, ule
Tested on: i386, alpha, sparc64
2004-12-30 20:52:44 +00:00
|
|
|
void
|
|
|
|
sched_prio(struct thread *td, u_char prio)
|
|
|
|
{
|
|
|
|
u_char oldprio;
|
|
|
|
|
|
|
|
/* First, update the base priority. */
|
|
|
|
td->td_base_pri = prio;
|
|
|
|
|
|
|
|
/*
|
2004-12-30 22:17:00 +00:00
|
|
|
* If the thread is borrowing another thread's priority, don't
|
Rework the interface between priority propagation (lending) and the
schedulers a bit to ensure more correct handling of priorities and fewer
priority inversions:
- Add two functions to the sched(9) API to handle priority lending:
sched_lend_prio() and sched_unlend_prio(). The turnstile code uses these
functions to ask the scheduler to lend a thread a set priority and to
tell the scheduler when it thinks it is ok for a thread to stop borrowing
priority. The unlend case is slightly complex in that the turnstile code
tells the scheduler what the minimum priority of the thread needs to be
to satisfy the requirements of any other threads blocked on locks owned
by the thread in question. The scheduler then decides where the thread
can go back to normal mode (if it's normal priority is high enough to
satisfy the pending lock requests) or it it should continue to use the
priority specified to the sched_unlend_prio() call. This involves adding
a new per-thread flag TDF_BORROWING that replaces the ULE-only kse flag
for priority elevation.
- Schedulers now refuse to lower the priority of a thread that is currently
borrowing another therad's priority.
- If a scheduler changes the priority of a thread that is currently sitting
on a turnstile, it will call a new function turnstile_adjust() to inform
the turnstile code of the change. This function resorts the thread on
the priority list of the turnstile if needed, and if the thread ends up
at the head of the list (due to having the highest priority) and its
priority was raised, then it will propagate that new priority to the
owner of the lock it is blocked on.
Some additional fixes specific to the 4BSD scheduler include:
- Common code for updating the priority of a thread when the user priority
of its associated kse group has been consolidated in a new static
function resetpriority_thread(). One change to this function is that
it will now only adjust the priority of a thread if it already has a
time sharing priority, thus preserving any boosts from a tsleep() until
the thread returns to userland. Also, resetpriority() no longer calls
maybe_resched() on each thread in the group. Instead, the code calling
resetpriority() is responsible for calling resetpriority_thread() on
any threads that need to be updated.
- schedcpu() now uses resetpriority_thread() instead of just calling
sched_prio() directly after it updates a kse group's user priority.
- sched_clock() now uses resetpriority_thread() rather than writing
directly to td_priority.
- sched_nice() now updates all the priorities of the threads after the
group priority has been adjusted.
Discussed with: bde
Reviewed by: ups, jeffr
Tested on: 4bsd, ule
Tested on: i386, alpha, sparc64
2004-12-30 20:52:44 +00:00
|
|
|
* ever lower the priority.
|
|
|
|
*/
|
|
|
|
if (td->td_flags & TDF_BORROWING && td->td_priority < prio)
|
|
|
|
return;
|
|
|
|
|
|
|
|
/* Change the real priority. */
|
|
|
|
oldprio = td->td_priority;
|
|
|
|
sched_thread_priority(td, prio);
|
|
|
|
|
|
|
|
/*
|
|
|
|
* If the thread is on a turnstile, then let the turnstile update
|
|
|
|
* its state.
|
|
|
|
*/
|
|
|
|
if (TD_ON_LOCK(td) && oldprio != prio)
|
|
|
|
turnstile_adjust(td, oldprio);
|
|
|
|
}
|
2004-12-30 22:17:00 +00:00
|
|
|
|
2007-07-17 22:53:23 +00:00
|
|
|
/*
|
|
|
|
* Set the base user priority, does not effect current running priority.
|
|
|
|
*/
|
2006-08-25 06:12:53 +00:00
|
|
|
void
|
2006-10-26 21:42:22 +00:00
|
|
|
sched_user_prio(struct thread *td, u_char prio)
|
2006-08-25 06:12:53 +00:00
|
|
|
{
|
|
|
|
|
2006-10-26 21:42:22 +00:00
|
|
|
td->td_base_user_pri = prio;
|
2010-12-09 02:42:02 +00:00
|
|
|
if (td->td_lend_user_pri <= prio)
|
|
|
|
return;
|
2006-10-26 21:42:22 +00:00
|
|
|
td->td_user_pri = prio;
|
2006-08-25 06:12:53 +00:00
|
|
|
}
|
|
|
|
|
|
|
|
void
|
|
|
|
sched_lend_user_prio(struct thread *td, u_char prio)
|
|
|
|
{
|
|
|
|
|
2007-12-11 08:25:36 +00:00
|
|
|
THREAD_LOCK_ASSERT(td, MA_OWNED);
|
2010-12-09 02:42:02 +00:00
|
|
|
td->td_lend_user_pri = prio;
|
2010-12-29 09:26:46 +00:00
|
|
|
td->td_user_pri = min(prio, td->td_base_user_pri);
|
|
|
|
if (td->td_priority > td->td_user_pri)
|
|
|
|
sched_prio(td, td->td_user_pri);
|
|
|
|
else if (td->td_priority != td->td_user_pri)
|
|
|
|
td->td_flags |= TDF_NEEDRESCHED;
|
2006-08-25 06:12:53 +00:00
|
|
|
}
|
|
|
|
|
2007-08-03 23:38:46 +00:00
|
|
|
/*
|
|
|
|
* Handle migration from sched_switch(). This happens only for
|
|
|
|
* cpu binding.
|
|
|
|
*/
|
|
|
|
static struct mtx *
|
|
|
|
sched_switch_migrate(struct tdq *tdq, struct thread *td, int flags)
|
|
|
|
{
|
|
|
|
struct tdq *tdn;
|
|
|
|
|
|
|
|
tdn = TDQ_CPU(td->td_sched->ts_cpu);
|
|
|
|
#ifdef SMP
|
2008-03-20 05:51:16 +00:00
|
|
|
tdq_load_rem(tdq, td);
|
2007-08-03 23:38:46 +00:00
|
|
|
/*
|
|
|
|
* Do the lock dance required to avoid LOR. We grab an extra
|
|
|
|
* spinlock nesting to prevent preemption while we're
|
|
|
|
* not holding either run-queue lock.
|
|
|
|
*/
|
|
|
|
spinlock_enter();
|
2010-01-23 15:54:21 +00:00
|
|
|
thread_lock_block(td); /* This releases the lock on tdq. */
|
Fix sched_switch_migrate():
- In 8.x and above the run-queue locks are nomore shared even in the
HTT case, so remove the special case.
- The deadlock explained in the removed comment here is still possible
even with different locks, with the contribution of tdq_lock_pair().
An explanation is here:
(hypotesis: a thread needs to migrate on another CPU, thread1 is doing
sched_switch_migrate() and thread2 is the one handling the sched_switch()
request or in other words, thread1 is the thread that needs to migrate
and thread2 is a thread that is going to be preempted, most likely an
idle thread. Also, 'old' is referred to the context (in terms of
run-queue and CPU) thread1 is leaving and 'new' is referred to the
context thread1 is going into. Finally, thread3 is doing tdq_idletd()
or sched_balance() and definitively doing tdq_lock_pair())
* thread1 blocks its td_lock. Now td_lock is 'blocked'
* thread1 drops its old runqueue lock
* thread1 acquires the new runqueue lock
* thread1 adds itself to the new runqueue and sends an IPI_PREEMPT
through tdq_notify() to the new CPU
* thread1 drops the new lock
* thread3, scanning the runqueues, locks the old lock
* thread2 received the IPI_PREEMPT and does thread_lock() with td_lock
pointing to the new runqueue
* thread3 wants to acquire the new runqueue lock, but it can't because
it is held by thread2 so it spins
* thread1 wants to acquire old lock, but as long as it is held by
thread3 it can't
* thread2 going further, at some point wants to switchin in thread1,
but it will wait forever because thread1->td_lock is in blocked state
This deadlock has been manifested mostly on 7.x and reported several time
on mailing lists under the voice 'spinlock held too long'.
Many thanks to des@ for having worked hard on producing suitable textdumps
and Jeff for help on the comment wording.
Reviewed by: jeff
Reported by: des, others
Tested by: des, Giovanni Trematerra
<giovanni dot trematerra at gmail dot com>
(STABLE_7 based version)
2009-09-15 16:56:17 +00:00
|
|
|
|
2007-08-03 23:38:46 +00:00
|
|
|
/*
|
Fix sched_switch_migrate():
- In 8.x and above the run-queue locks are nomore shared even in the
HTT case, so remove the special case.
- The deadlock explained in the removed comment here is still possible
even with different locks, with the contribution of tdq_lock_pair().
An explanation is here:
(hypotesis: a thread needs to migrate on another CPU, thread1 is doing
sched_switch_migrate() and thread2 is the one handling the sched_switch()
request or in other words, thread1 is the thread that needs to migrate
and thread2 is a thread that is going to be preempted, most likely an
idle thread. Also, 'old' is referred to the context (in terms of
run-queue and CPU) thread1 is leaving and 'new' is referred to the
context thread1 is going into. Finally, thread3 is doing tdq_idletd()
or sched_balance() and definitively doing tdq_lock_pair())
* thread1 blocks its td_lock. Now td_lock is 'blocked'
* thread1 drops its old runqueue lock
* thread1 acquires the new runqueue lock
* thread1 adds itself to the new runqueue and sends an IPI_PREEMPT
through tdq_notify() to the new CPU
* thread1 drops the new lock
* thread3, scanning the runqueues, locks the old lock
* thread2 received the IPI_PREEMPT and does thread_lock() with td_lock
pointing to the new runqueue
* thread3 wants to acquire the new runqueue lock, but it can't because
it is held by thread2 so it spins
* thread1 wants to acquire old lock, but as long as it is held by
thread3 it can't
* thread2 going further, at some point wants to switchin in thread1,
but it will wait forever because thread1->td_lock is in blocked state
This deadlock has been manifested mostly on 7.x and reported several time
on mailing lists under the voice 'spinlock held too long'.
Many thanks to des@ for having worked hard on producing suitable textdumps
and Jeff for help on the comment wording.
Reviewed by: jeff
Reported by: des, others
Tested by: des, Giovanni Trematerra
<giovanni dot trematerra at gmail dot com>
(STABLE_7 based version)
2009-09-15 16:56:17 +00:00
|
|
|
* Acquire both run-queue locks before placing the thread on the new
|
|
|
|
* run-queue to avoid deadlocks created by placing a thread with a
|
|
|
|
* blocked lock on the run-queue of a remote processor. The deadlock
|
|
|
|
* occurs when a third processor attempts to lock the two queues in
|
|
|
|
* question while the target processor is spinning with its own
|
|
|
|
* run-queue lock held while waiting for the blocked lock to clear.
|
2007-08-03 23:38:46 +00:00
|
|
|
*/
|
Fix sched_switch_migrate():
- In 8.x and above the run-queue locks are nomore shared even in the
HTT case, so remove the special case.
- The deadlock explained in the removed comment here is still possible
even with different locks, with the contribution of tdq_lock_pair().
An explanation is here:
(hypotesis: a thread needs to migrate on another CPU, thread1 is doing
sched_switch_migrate() and thread2 is the one handling the sched_switch()
request or in other words, thread1 is the thread that needs to migrate
and thread2 is a thread that is going to be preempted, most likely an
idle thread. Also, 'old' is referred to the context (in terms of
run-queue and CPU) thread1 is leaving and 'new' is referred to the
context thread1 is going into. Finally, thread3 is doing tdq_idletd()
or sched_balance() and definitively doing tdq_lock_pair())
* thread1 blocks its td_lock. Now td_lock is 'blocked'
* thread1 drops its old runqueue lock
* thread1 acquires the new runqueue lock
* thread1 adds itself to the new runqueue and sends an IPI_PREEMPT
through tdq_notify() to the new CPU
* thread1 drops the new lock
* thread3, scanning the runqueues, locks the old lock
* thread2 received the IPI_PREEMPT and does thread_lock() with td_lock
pointing to the new runqueue
* thread3 wants to acquire the new runqueue lock, but it can't because
it is held by thread2 so it spins
* thread1 wants to acquire old lock, but as long as it is held by
thread3 it can't
* thread2 going further, at some point wants to switchin in thread1,
but it will wait forever because thread1->td_lock is in blocked state
This deadlock has been manifested mostly on 7.x and reported several time
on mailing lists under the voice 'spinlock held too long'.
Many thanks to des@ for having worked hard on producing suitable textdumps
and Jeff for help on the comment wording.
Reviewed by: jeff
Reported by: des, others
Tested by: des, Giovanni Trematerra
<giovanni dot trematerra at gmail dot com>
(STABLE_7 based version)
2009-09-15 16:56:17 +00:00
|
|
|
tdq_lock_pair(tdn, tdq);
|
|
|
|
tdq_add(tdn, td, flags);
|
|
|
|
tdq_notify(tdn, td);
|
|
|
|
TDQ_UNLOCK(tdn);
|
2007-08-03 23:38:46 +00:00
|
|
|
spinlock_exit();
|
|
|
|
#endif
|
|
|
|
return (TDQ_LOCKPTR(tdn));
|
|
|
|
}
|
|
|
|
|
2007-07-17 22:53:23 +00:00
|
|
|
/*
|
2010-01-23 15:54:21 +00:00
|
|
|
* Variadic version of thread_lock_unblock() that does not assume td_lock
|
|
|
|
* is blocked.
|
2007-07-17 22:53:23 +00:00
|
|
|
*/
|
|
|
|
static inline void
|
|
|
|
thread_unblock_switch(struct thread *td, struct mtx *mtx)
|
|
|
|
{
|
|
|
|
atomic_store_rel_ptr((volatile uintptr_t *)&td->td_lock,
|
|
|
|
(uintptr_t)mtx);
|
|
|
|
}
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Switch threads. This function has to handle threads coming in while
|
|
|
|
* blocked for some reason, running, or idle. It also must deal with
|
|
|
|
* migrating a thread from one queue to another as running threads may
|
|
|
|
* be assigned elsewhere via binding.
|
|
|
|
*/
|
2003-01-26 05:23:15 +00:00
|
|
|
void
|
2004-09-10 21:04:38 +00:00
|
|
|
sched_switch(struct thread *td, struct thread *newtd, int flags)
|
2003-01-26 05:23:15 +00:00
|
|
|
{
|
2006-12-29 12:55:32 +00:00
|
|
|
struct tdq *tdq;
|
2006-12-06 06:34:57 +00:00
|
|
|
struct td_sched *ts;
|
2007-07-17 22:53:23 +00:00
|
|
|
struct mtx *mtx;
|
2007-08-03 23:38:46 +00:00
|
|
|
int srqflag;
|
2012-08-09 19:26:13 +00:00
|
|
|
int cpuid, preempted;
|
2003-01-26 05:23:15 +00:00
|
|
|
|
Commit 1/14 of sched_lock decomposition.
- Move all scheduler locking into the schedulers utilizing a technique
similar to solaris's container locking.
- A per-process spinlock is now used to protect the queue of threads,
thread count, suspension count, p_sflags, and other process
related scheduling fields.
- The new thread lock is actually a pointer to a spinlock for the
container that the thread is currently owned by. The container may
be a turnstile, sleepqueue, or run queue.
- thread_lock() is now used to protect access to thread related scheduling
fields. thread_unlock() unlocks the lock and thread_set_lock()
implements the transition from one lock to another.
- A new "blocked_lock" is used in cases where it is not safe to hold the
actual thread's lock yet we must prevent access to the thread.
- sched_throw() and sched_fork_exit() are introduced to allow the
schedulers to fix-up locking at these points.
- Add some minor infrastructure for optionally exporting scheduler
statistics that were invaluable in solving performance problems with
this patch. Generally these statistics allow you to differentiate
between different causes of context switches.
Tested by: kris, current@
Tested on: i386, amd64, ULE, 4BSD, libthr, libkse, PREEMPTION, etc.
Discussed with: kris, attilio, kmacy, jhb, julian, bde (small parts each)
2007-06-04 23:50:30 +00:00
|
|
|
THREAD_LOCK_ASSERT(td, MA_OWNED);
|
2008-03-19 07:36:37 +00:00
|
|
|
KASSERT(newtd == NULL, ("sched_switch: Unsupported newtd argument"));
|
2003-01-26 05:23:15 +00:00
|
|
|
|
2007-07-17 22:53:23 +00:00
|
|
|
cpuid = PCPU_GET(cpuid);
|
|
|
|
tdq = TDQ_CPU(cpuid);
|
2007-01-04 08:56:25 +00:00
|
|
|
ts = td->td_sched;
|
2007-08-03 23:38:46 +00:00
|
|
|
mtx = td->td_lock;
|
2012-03-13 08:18:54 +00:00
|
|
|
sched_pctcpu_update(ts, 1);
|
2007-07-17 22:53:23 +00:00
|
|
|
ts->ts_rltick = ticks;
|
2004-08-12 07:56:33 +00:00
|
|
|
td->td_lastcpu = td->td_oncpu;
|
2003-04-10 17:35:44 +00:00
|
|
|
td->td_oncpu = NOCPU;
|
2012-08-09 19:26:13 +00:00
|
|
|
preempted = !(td->td_flags & TDF_SLICEEND);
|
|
|
|
td->td_flags &= ~(TDF_NEEDRESCHED | TDF_SLICEEND);
|
2005-04-08 03:37:53 +00:00
|
|
|
td->td_owepreempt = 0;
|
2008-04-17 09:56:01 +00:00
|
|
|
tdq->tdq_switchcnt++;
|
2003-12-11 04:00:49 +00:00
|
|
|
/*
|
2007-07-17 22:53:23 +00:00
|
|
|
* The lock pointer in an idle thread should never change. Reset it
|
|
|
|
* to CAN_RUN as well.
|
Commit 1/14 of sched_lock decomposition.
- Move all scheduler locking into the schedulers utilizing a technique
similar to solaris's container locking.
- A per-process spinlock is now used to protect the queue of threads,
thread count, suspension count, p_sflags, and other process
related scheduling fields.
- The new thread lock is actually a pointer to a spinlock for the
container that the thread is currently owned by. The container may
be a turnstile, sleepqueue, or run queue.
- thread_lock() is now used to protect access to thread related scheduling
fields. thread_unlock() unlocks the lock and thread_set_lock()
implements the transition from one lock to another.
- A new "blocked_lock" is used in cases where it is not safe to hold the
actual thread's lock yet we must prevent access to the thread.
- sched_throw() and sched_fork_exit() are introduced to allow the
schedulers to fix-up locking at these points.
- Add some minor infrastructure for optionally exporting scheduler
statistics that were invaluable in solving performance problems with
this patch. Generally these statistics allow you to differentiate
between different causes of context switches.
Tested by: kris, current@
Tested on: i386, amd64, ULE, 4BSD, libthr, libkse, PREEMPTION, etc.
Discussed with: kris, attilio, kmacy, jhb, julian, bde (small parts each)
2007-06-04 23:50:30 +00:00
|
|
|
*/
|
2007-03-08 06:44:34 +00:00
|
|
|
if (TD_IS_IDLETHREAD(td)) {
|
2007-07-17 22:53:23 +00:00
|
|
|
MPASS(td->td_lock == TDQ_LOCKPTR(tdq));
|
2004-12-26 22:56:08 +00:00
|
|
|
TD_SET_CAN_RUN(td);
|
Commit 1/14 of sched_lock decomposition.
- Move all scheduler locking into the schedulers utilizing a technique
similar to solaris's container locking.
- A per-process spinlock is now used to protect the queue of threads,
thread count, suspension count, p_sflags, and other process
related scheduling fields.
- The new thread lock is actually a pointer to a spinlock for the
container that the thread is currently owned by. The container may
be a turnstile, sleepqueue, or run queue.
- thread_lock() is now used to protect access to thread related scheduling
fields. thread_unlock() unlocks the lock and thread_set_lock()
implements the transition from one lock to another.
- A new "blocked_lock" is used in cases where it is not safe to hold the
actual thread's lock yet we must prevent access to the thread.
- sched_throw() and sched_fork_exit() are introduced to allow the
schedulers to fix-up locking at these points.
- Add some minor infrastructure for optionally exporting scheduler
statistics that were invaluable in solving performance problems with
this patch. Generally these statistics allow you to differentiate
between different causes of context switches.
Tested by: kris, current@
Tested on: i386, amd64, ULE, 4BSD, libthr, libkse, PREEMPTION, etc.
Discussed with: kris, attilio, kmacy, jhb, julian, bde (small parts each)
2007-06-04 23:50:30 +00:00
|
|
|
} else if (TD_IS_RUNNING(td)) {
|
2007-07-17 22:53:23 +00:00
|
|
|
MPASS(td->td_lock == TDQ_LOCKPTR(tdq));
|
2012-08-09 19:26:13 +00:00
|
|
|
srqflag = preempted ?
|
Commit 1/14 of sched_lock decomposition.
- Move all scheduler locking into the schedulers utilizing a technique
similar to solaris's container locking.
- A per-process spinlock is now used to protect the queue of threads,
thread count, suspension count, p_sflags, and other process
related scheduling fields.
- The new thread lock is actually a pointer to a spinlock for the
container that the thread is currently owned by. The container may
be a turnstile, sleepqueue, or run queue.
- thread_lock() is now used to protect access to thread related scheduling
fields. thread_unlock() unlocks the lock and thread_set_lock()
implements the transition from one lock to another.
- A new "blocked_lock" is used in cases where it is not safe to hold the
actual thread's lock yet we must prevent access to the thread.
- sched_throw() and sched_fork_exit() are introduced to allow the
schedulers to fix-up locking at these points.
- Add some minor infrastructure for optionally exporting scheduler
statistics that were invaluable in solving performance problems with
this patch. Generally these statistics allow you to differentiate
between different causes of context switches.
Tested by: kris, current@
Tested on: i386, amd64, ULE, 4BSD, libthr, libkse, PREEMPTION, etc.
Discussed with: kris, attilio, kmacy, jhb, julian, bde (small parts each)
2007-06-04 23:50:30 +00:00
|
|
|
SRQ_OURSELF|SRQ_YIELDING|SRQ_PREEMPTED :
|
2007-08-03 23:38:46 +00:00
|
|
|
SRQ_OURSELF|SRQ_YIELDING;
|
2010-09-02 16:23:05 +00:00
|
|
|
#ifdef SMP
|
2010-09-01 20:32:47 +00:00
|
|
|
if (THREAD_CAN_MIGRATE(td) && !THREAD_CAN_SCHED(td, ts->ts_cpu))
|
|
|
|
ts->ts_cpu = sched_pickcpu(td, 0);
|
2010-09-02 16:23:05 +00:00
|
|
|
#endif
|
2007-08-03 23:38:46 +00:00
|
|
|
if (ts->ts_cpu == cpuid)
|
2008-03-20 05:51:16 +00:00
|
|
|
tdq_runq_add(tdq, td, srqflag);
|
2010-09-01 20:32:47 +00:00
|
|
|
else {
|
|
|
|
KASSERT(THREAD_CAN_MIGRATE(td) ||
|
|
|
|
(ts->ts_flags & TSF_BOUND) != 0,
|
|
|
|
("Thread %p shouldn't migrate", td));
|
2007-08-03 23:38:46 +00:00
|
|
|
mtx = sched_switch_migrate(tdq, td, srqflag);
|
2010-09-01 20:32:47 +00:00
|
|
|
}
|
2007-07-17 22:53:23 +00:00
|
|
|
} else {
|
|
|
|
/* This thread must be going to sleep. */
|
|
|
|
TDQ_LOCK(tdq);
|
2010-01-23 15:54:21 +00:00
|
|
|
mtx = thread_lock_block(td);
|
2008-03-20 05:51:16 +00:00
|
|
|
tdq_load_rem(tdq, td);
|
2007-07-17 22:53:23 +00:00
|
|
|
}
|
|
|
|
/*
|
|
|
|
* We enter here with the thread blocked and assigned to the
|
|
|
|
* appropriate cpu run-queue or sleep-queue and with the current
|
|
|
|
* thread-queue locked.
|
|
|
|
*/
|
|
|
|
TDQ_LOCK_ASSERT(tdq, MA_OWNED | MA_NOTRECURSED);
|
|
|
|
newtd = choosethread();
|
|
|
|
/*
|
|
|
|
* Call the MD code to switch contexts if necessary.
|
|
|
|
*/
|
2005-04-19 04:01:25 +00:00
|
|
|
if (td != newtd) {
|
|
|
|
#ifdef HWPMC_HOOKS
|
|
|
|
if (PMC_PROC_IS_USING_PMCS(td->td_proc))
|
|
|
|
PMC_SWITCH_CONTEXT(td, PMC_FN_CSW_OUT);
|
|
|
|
#endif
|
2012-05-15 01:30:25 +00:00
|
|
|
SDT_PROBE2(sched, , , off_cpu, td, td->td_proc);
|
2007-12-15 23:13:31 +00:00
|
|
|
lock_profile_release_lock(&TDQ_LOCKPTR(tdq)->lock_object);
|
2007-10-02 01:30:18 +00:00
|
|
|
TDQ_LOCKPTR(tdq)->mtx_lock = (uintptr_t)newtd;
|
2012-03-13 08:18:54 +00:00
|
|
|
sched_pctcpu_update(newtd->td_sched, 0);
|
2008-05-25 01:44:58 +00:00
|
|
|
|
|
|
|
#ifdef KDTRACE_HOOKS
|
|
|
|
/*
|
|
|
|
* If DTrace has set the active vtime enum to anything
|
|
|
|
* other than INACTIVE (0), then it should have set the
|
|
|
|
* function to call.
|
|
|
|
*/
|
|
|
|
if (dtrace_vtime_active)
|
|
|
|
(*dtrace_vtime_switch_func)(newtd);
|
|
|
|
#endif
|
|
|
|
|
2007-07-17 22:53:23 +00:00
|
|
|
cpu_switch(td, newtd, mtx);
|
|
|
|
/*
|
|
|
|
* We may return from cpu_switch on a different cpu. However,
|
|
|
|
* we always return with td_lock pointing to the current cpu's
|
|
|
|
* run queue lock.
|
|
|
|
*/
|
|
|
|
cpuid = PCPU_GET(cpuid);
|
|
|
|
tdq = TDQ_CPU(cpuid);
|
2007-12-15 23:13:31 +00:00
|
|
|
lock_profile_obtain_lock_success(
|
|
|
|
&TDQ_LOCKPTR(tdq)->lock_object, 0, 0, __FILE__, __LINE__);
|
2012-05-15 01:30:25 +00:00
|
|
|
|
|
|
|
SDT_PROBE0(sched, , , on_cpu);
|
2005-04-19 04:01:25 +00:00
|
|
|
#ifdef HWPMC_HOOKS
|
|
|
|
if (PMC_PROC_IS_USING_PMCS(td->td_proc))
|
|
|
|
PMC_SWITCH_CONTEXT(td, PMC_FN_CSW_IN);
|
|
|
|
#endif
|
2012-05-15 01:30:25 +00:00
|
|
|
} else {
|
2007-07-17 22:53:23 +00:00
|
|
|
thread_unblock_switch(td, mtx);
|
2012-05-15 01:30:25 +00:00
|
|
|
SDT_PROBE0(sched, , , remain_cpu);
|
|
|
|
}
|
2008-03-02 08:20:59 +00:00
|
|
|
/*
|
|
|
|
* Assert that all went well and return.
|
|
|
|
*/
|
2007-07-17 22:53:23 +00:00
|
|
|
TDQ_LOCK_ASSERT(tdq, MA_OWNED|MA_NOTRECURSED);
|
|
|
|
MPASS(td->td_lock == TDQ_LOCKPTR(tdq));
|
|
|
|
td->td_oncpu = cpuid;
|
2003-01-26 05:23:15 +00:00
|
|
|
}
|
|
|
|
|
2007-07-17 22:53:23 +00:00
|
|
|
/*
|
|
|
|
* Adjust thread priorities as a result of a nice request.
|
|
|
|
*/
|
2003-01-26 05:23:15 +00:00
|
|
|
void
|
2004-06-16 00:26:31 +00:00
|
|
|
sched_nice(struct proc *p, int nice)
|
2003-01-26 05:23:15 +00:00
|
|
|
{
|
|
|
|
struct thread *td;
|
|
|
|
|
2004-06-16 00:26:31 +00:00
|
|
|
PROC_LOCK_ASSERT(p, MA_OWNED);
|
2007-01-04 08:56:25 +00:00
|
|
|
|
2004-06-16 00:26:31 +00:00
|
|
|
p->p_nice = nice;
|
2006-10-26 21:42:22 +00:00
|
|
|
FOREACH_THREAD_IN_PROC(p, td) {
|
Commit 1/14 of sched_lock decomposition.
- Move all scheduler locking into the schedulers utilizing a technique
similar to solaris's container locking.
- A per-process spinlock is now used to protect the queue of threads,
thread count, suspension count, p_sflags, and other process
related scheduling fields.
- The new thread lock is actually a pointer to a spinlock for the
container that the thread is currently owned by. The container may
be a turnstile, sleepqueue, or run queue.
- thread_lock() is now used to protect access to thread related scheduling
fields. thread_unlock() unlocks the lock and thread_set_lock()
implements the transition from one lock to another.
- A new "blocked_lock" is used in cases where it is not safe to hold the
actual thread's lock yet we must prevent access to the thread.
- sched_throw() and sched_fork_exit() are introduced to allow the
schedulers to fix-up locking at these points.
- Add some minor infrastructure for optionally exporting scheduler
statistics that were invaluable in solving performance problems with
this patch. Generally these statistics allow you to differentiate
between different causes of context switches.
Tested by: kris, current@
Tested on: i386, amd64, ULE, 4BSD, libthr, libkse, PREEMPTION, etc.
Discussed with: kris, attilio, kmacy, jhb, julian, bde (small parts each)
2007-06-04 23:50:30 +00:00
|
|
|
thread_lock(td);
|
2006-10-26 21:42:22 +00:00
|
|
|
sched_priority(td);
|
2007-01-04 08:56:25 +00:00
|
|
|
sched_prio(td, td->td_base_user_pri);
|
Commit 1/14 of sched_lock decomposition.
- Move all scheduler locking into the schedulers utilizing a technique
similar to solaris's container locking.
- A per-process spinlock is now used to protect the queue of threads,
thread count, suspension count, p_sflags, and other process
related scheduling fields.
- The new thread lock is actually a pointer to a spinlock for the
container that the thread is currently owned by. The container may
be a turnstile, sleepqueue, or run queue.
- thread_lock() is now used to protect access to thread related scheduling
fields. thread_unlock() unlocks the lock and thread_set_lock()
implements the transition from one lock to another.
- A new "blocked_lock" is used in cases where it is not safe to hold the
actual thread's lock yet we must prevent access to the thread.
- sched_throw() and sched_fork_exit() are introduced to allow the
schedulers to fix-up locking at these points.
- Add some minor infrastructure for optionally exporting scheduler
statistics that were invaluable in solving performance problems with
this patch. Generally these statistics allow you to differentiate
between different causes of context switches.
Tested by: kris, current@
Tested on: i386, amd64, ULE, 4BSD, libthr, libkse, PREEMPTION, etc.
Discussed with: kris, attilio, kmacy, jhb, julian, bde (small parts each)
2007-06-04 23:50:30 +00:00
|
|
|
thread_unlock(td);
|
2004-06-16 00:26:31 +00:00
|
|
|
}
|
2003-01-26 05:23:15 +00:00
|
|
|
}
|
|
|
|
|
2007-07-17 22:53:23 +00:00
|
|
|
/*
|
|
|
|
* Record the sleep time for the interactivity scorer.
|
|
|
|
*/
|
2003-01-26 05:23:15 +00:00
|
|
|
void
|
2008-03-12 06:31:06 +00:00
|
|
|
sched_sleep(struct thread *td, int prio)
|
2003-01-26 05:23:15 +00:00
|
|
|
{
|
2007-01-04 08:56:25 +00:00
|
|
|
|
Commit 1/14 of sched_lock decomposition.
- Move all scheduler locking into the schedulers utilizing a technique
similar to solaris's container locking.
- A per-process spinlock is now used to protect the queue of threads,
thread count, suspension count, p_sflags, and other process
related scheduling fields.
- The new thread lock is actually a pointer to a spinlock for the
container that the thread is currently owned by. The container may
be a turnstile, sleepqueue, or run queue.
- thread_lock() is now used to protect access to thread related scheduling
fields. thread_unlock() unlocks the lock and thread_set_lock()
implements the transition from one lock to another.
- A new "blocked_lock" is used in cases where it is not safe to hold the
actual thread's lock yet we must prevent access to the thread.
- sched_throw() and sched_fork_exit() are introduced to allow the
schedulers to fix-up locking at these points.
- Add some minor infrastructure for optionally exporting scheduler
statistics that were invaluable in solving performance problems with
this patch. Generally these statistics allow you to differentiate
between different causes of context switches.
Tested by: kris, current@
Tested on: i386, amd64, ULE, 4BSD, libthr, libkse, PREEMPTION, etc.
Discussed with: kris, attilio, kmacy, jhb, julian, bde (small parts each)
2007-06-04 23:50:30 +00:00
|
|
|
THREAD_LOCK_ASSERT(td, MA_OWNED);
|
2003-01-26 05:23:15 +00:00
|
|
|
|
2007-09-21 04:10:23 +00:00
|
|
|
td->td_slptick = ticks;
|
2009-12-31 18:52:58 +00:00
|
|
|
if (TD_IS_SUSPENDED(td) || prio >= PSOCK)
|
2008-03-12 06:31:06 +00:00
|
|
|
td->td_flags |= TDF_CANSWAP;
|
2011-01-14 17:06:54 +00:00
|
|
|
if (PRI_BASE(td->td_pri_class) != PRI_TIMESHARE)
|
|
|
|
return;
|
2008-04-04 01:16:18 +00:00
|
|
|
if (static_boost == 1 && prio)
|
2008-03-12 06:31:06 +00:00
|
|
|
sched_prio(td, prio);
|
2008-04-04 01:16:18 +00:00
|
|
|
else if (static_boost && td->td_priority > static_boost)
|
|
|
|
sched_prio(td, static_boost);
|
2003-01-26 05:23:15 +00:00
|
|
|
}
|
|
|
|
|
2007-07-17 22:53:23 +00:00
|
|
|
/*
|
|
|
|
* Schedule a thread to resume execution and record how long it voluntarily
|
|
|
|
* slept. We also update the pctcpu, interactivity, and priority.
|
|
|
|
*/
|
2003-01-26 05:23:15 +00:00
|
|
|
void
|
|
|
|
sched_wakeup(struct thread *td)
|
|
|
|
{
|
2007-01-25 19:14:11 +00:00
|
|
|
struct td_sched *ts;
|
2007-07-17 22:53:23 +00:00
|
|
|
int slptick;
|
2007-01-04 08:56:25 +00:00
|
|
|
|
Commit 1/14 of sched_lock decomposition.
- Move all scheduler locking into the schedulers utilizing a technique
similar to solaris's container locking.
- A per-process spinlock is now used to protect the queue of threads,
thread count, suspension count, p_sflags, and other process
related scheduling fields.
- The new thread lock is actually a pointer to a spinlock for the
container that the thread is currently owned by. The container may
be a turnstile, sleepqueue, or run queue.
- thread_lock() is now used to protect access to thread related scheduling
fields. thread_unlock() unlocks the lock and thread_set_lock()
implements the transition from one lock to another.
- A new "blocked_lock" is used in cases where it is not safe to hold the
actual thread's lock yet we must prevent access to the thread.
- sched_throw() and sched_fork_exit() are introduced to allow the
schedulers to fix-up locking at these points.
- Add some minor infrastructure for optionally exporting scheduler
statistics that were invaluable in solving performance problems with
this patch. Generally these statistics allow you to differentiate
between different causes of context switches.
Tested by: kris, current@
Tested on: i386, amd64, ULE, 4BSD, libthr, libkse, PREEMPTION, etc.
Discussed with: kris, attilio, kmacy, jhb, julian, bde (small parts each)
2007-06-04 23:50:30 +00:00
|
|
|
THREAD_LOCK_ASSERT(td, MA_OWNED);
|
2007-01-25 19:14:11 +00:00
|
|
|
ts = td->td_sched;
|
2008-03-12 06:31:06 +00:00
|
|
|
td->td_flags &= ~TDF_CANSWAP;
|
2003-01-26 05:23:15 +00:00
|
|
|
/*
|
2007-01-04 08:56:25 +00:00
|
|
|
* If we slept for more than a tick update our interactivity and
|
|
|
|
* priority.
|
2003-01-26 05:23:15 +00:00
|
|
|
*/
|
2007-09-21 04:10:23 +00:00
|
|
|
slptick = td->td_slptick;
|
|
|
|
td->td_slptick = 0;
|
2007-07-17 22:53:23 +00:00
|
|
|
if (slptick && slptick != ticks) {
|
2012-03-13 08:18:54 +00:00
|
|
|
ts->ts_slptime += (ticks - slptick) << SCHED_TICK_SHIFT;
|
2007-01-05 23:45:38 +00:00
|
|
|
sched_interact_update(td);
|
2012-03-13 08:18:54 +00:00
|
|
|
sched_pctcpu_update(ts, 0);
|
2003-01-26 05:23:15 +00:00
|
|
|
}
|
2007-01-25 19:14:11 +00:00
|
|
|
/* Reset the slice value after we sleep. */
|
|
|
|
ts->ts_slice = sched_slice;
|
2007-01-23 08:50:34 +00:00
|
|
|
sched_add(td, SRQ_BORING);
|
2003-01-26 05:23:15 +00:00
|
|
|
}
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Penalize the parent for creating a new child and initialize the child's
|
|
|
|
* priority.
|
|
|
|
*/
|
|
|
|
void
|
2006-10-26 21:42:22 +00:00
|
|
|
sched_fork(struct thread *td, struct thread *child)
|
2003-01-26 05:23:15 +00:00
|
|
|
{
|
Commit 1/14 of sched_lock decomposition.
- Move all scheduler locking into the schedulers utilizing a technique
similar to solaris's container locking.
- A per-process spinlock is now used to protect the queue of threads,
thread count, suspension count, p_sflags, and other process
related scheduling fields.
- The new thread lock is actually a pointer to a spinlock for the
container that the thread is currently owned by. The container may
be a turnstile, sleepqueue, or run queue.
- thread_lock() is now used to protect access to thread related scheduling
fields. thread_unlock() unlocks the lock and thread_set_lock()
implements the transition from one lock to another.
- A new "blocked_lock" is used in cases where it is not safe to hold the
actual thread's lock yet we must prevent access to the thread.
- sched_throw() and sched_fork_exit() are introduced to allow the
schedulers to fix-up locking at these points.
- Add some minor infrastructure for optionally exporting scheduler
statistics that were invaluable in solving performance problems with
this patch. Generally these statistics allow you to differentiate
between different causes of context switches.
Tested by: kris, current@
Tested on: i386, amd64, ULE, 4BSD, libthr, libkse, PREEMPTION, etc.
Discussed with: kris, attilio, kmacy, jhb, julian, bde (small parts each)
2007-06-04 23:50:30 +00:00
|
|
|
THREAD_LOCK_ASSERT(td, MA_OWNED);
|
2012-03-13 08:18:54 +00:00
|
|
|
sched_pctcpu_update(td->td_sched, 1);
|
2006-12-06 06:34:57 +00:00
|
|
|
sched_fork_thread(td, child);
|
2007-01-04 08:56:25 +00:00
|
|
|
/*
|
|
|
|
* Penalize the parent and child for forking.
|
|
|
|
*/
|
|
|
|
sched_interact_fork(child);
|
|
|
|
sched_priority(child);
|
2007-07-17 22:53:23 +00:00
|
|
|
td->td_sched->ts_runtime += tickincr;
|
2007-01-04 08:56:25 +00:00
|
|
|
sched_interact_update(td);
|
|
|
|
sched_priority(td);
|
2006-12-06 06:34:57 +00:00
|
|
|
}
|
|
|
|
|
2007-07-17 22:53:23 +00:00
|
|
|
/*
|
|
|
|
* Fork a new thread, may be within the same process.
|
|
|
|
*/
|
2006-12-06 06:34:57 +00:00
|
|
|
void
|
|
|
|
sched_fork_thread(struct thread *td, struct thread *child)
|
|
|
|
{
|
|
|
|
struct td_sched *ts;
|
|
|
|
struct td_sched *ts2;
|
2003-01-26 05:23:15 +00:00
|
|
|
|
2008-03-20 03:06:33 +00:00
|
|
|
THREAD_LOCK_ASSERT(td, MA_OWNED);
|
2007-01-04 08:56:25 +00:00
|
|
|
/*
|
|
|
|
* Initialize child.
|
|
|
|
*/
|
2006-12-06 06:34:57 +00:00
|
|
|
ts = td->td_sched;
|
|
|
|
ts2 = child->td_sched;
|
2008-03-20 03:06:33 +00:00
|
|
|
child->td_lock = TDQ_LOCKPTR(TDQ_SELF());
|
|
|
|
child->td_cpuset = cpuset_ref(td->td_cpuset);
|
2006-12-06 06:34:57 +00:00
|
|
|
ts2->ts_cpu = ts->ts_cpu;
|
2008-03-20 03:06:33 +00:00
|
|
|
ts2->ts_flags = 0;
|
2007-01-04 08:56:25 +00:00
|
|
|
/*
|
2011-01-06 22:24:00 +00:00
|
|
|
* Grab our parents cpu estimation information.
|
2007-01-04 08:56:25 +00:00
|
|
|
*/
|
2006-12-06 06:34:57 +00:00
|
|
|
ts2->ts_ticks = ts->ts_ticks;
|
|
|
|
ts2->ts_ltick = ts->ts_ltick;
|
|
|
|
ts2->ts_ftick = ts->ts_ftick;
|
2011-01-06 22:24:00 +00:00
|
|
|
/*
|
|
|
|
* Do not inherit any borrowed priority from the parent.
|
|
|
|
*/
|
|
|
|
child->td_priority = child->td_base_pri;
|
2007-01-04 08:56:25 +00:00
|
|
|
/*
|
|
|
|
* And update interactivity score.
|
|
|
|
*/
|
2007-07-17 22:53:23 +00:00
|
|
|
ts2->ts_slptime = ts->ts_slptime;
|
|
|
|
ts2->ts_runtime = ts->ts_runtime;
|
2007-01-04 08:56:25 +00:00
|
|
|
ts2->ts_slice = 1; /* Attempt to quickly learn interactivity. */
|
2009-01-17 07:17:57 +00:00
|
|
|
#ifdef KTR
|
|
|
|
bzero(ts2->ts_name, sizeof(ts2->ts_name));
|
|
|
|
#endif
|
2003-04-11 03:47:14 +00:00
|
|
|
}
|
|
|
|
|
2007-07-17 22:53:23 +00:00
|
|
|
/*
|
|
|
|
* Adjust the priority class of a thread.
|
|
|
|
*/
|
2003-04-11 03:47:14 +00:00
|
|
|
void
|
2006-10-26 21:42:22 +00:00
|
|
|
sched_class(struct thread *td, int class)
|
2003-04-11 03:47:14 +00:00
|
|
|
{
|
|
|
|
|
Commit 1/14 of sched_lock decomposition.
- Move all scheduler locking into the schedulers utilizing a technique
similar to solaris's container locking.
- A per-process spinlock is now used to protect the queue of threads,
thread count, suspension count, p_sflags, and other process
related scheduling fields.
- The new thread lock is actually a pointer to a spinlock for the
container that the thread is currently owned by. The container may
be a turnstile, sleepqueue, or run queue.
- thread_lock() is now used to protect access to thread related scheduling
fields. thread_unlock() unlocks the lock and thread_set_lock()
implements the transition from one lock to another.
- A new "blocked_lock" is used in cases where it is not safe to hold the
actual thread's lock yet we must prevent access to the thread.
- sched_throw() and sched_fork_exit() are introduced to allow the
schedulers to fix-up locking at these points.
- Add some minor infrastructure for optionally exporting scheduler
statistics that were invaluable in solving performance problems with
this patch. Generally these statistics allow you to differentiate
between different causes of context switches.
Tested by: kris, current@
Tested on: i386, amd64, ULE, 4BSD, libthr, libkse, PREEMPTION, etc.
Discussed with: kris, attilio, kmacy, jhb, julian, bde (small parts each)
2007-06-04 23:50:30 +00:00
|
|
|
THREAD_LOCK_ASSERT(td, MA_OWNED);
|
2006-10-26 21:42:22 +00:00
|
|
|
if (td->td_pri_class == class)
|
2003-04-11 03:47:14 +00:00
|
|
|
return;
|
2006-10-26 21:42:22 +00:00
|
|
|
td->td_pri_class = class;
|
2003-01-26 05:23:15 +00:00
|
|
|
}
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Return some of the child's priority and interactivity to the parent.
|
|
|
|
*/
|
|
|
|
void
|
2006-12-06 06:55:59 +00:00
|
|
|
sched_exit(struct proc *p, struct thread *child)
|
2003-01-26 05:23:15 +00:00
|
|
|
{
|
2007-01-04 08:56:25 +00:00
|
|
|
struct thread *td;
|
2006-10-26 21:42:22 +00:00
|
|
|
|
2009-01-17 07:17:57 +00:00
|
|
|
KTR_STATE1(KTR_SCHED, "thread", sched_tdname(child), "proc exit",
|
2011-08-26 18:00:07 +00:00
|
|
|
"prio:%d", child->td_priority);
|
2008-03-19 06:19:01 +00:00
|
|
|
PROC_LOCK_ASSERT(p, MA_OWNED);
|
2007-01-04 08:56:25 +00:00
|
|
|
td = FIRST_THREAD_IN_PROC(p);
|
|
|
|
sched_exit_thread(td, child);
|
2006-12-06 06:34:57 +00:00
|
|
|
}
|
|
|
|
|
2007-07-17 22:53:23 +00:00
|
|
|
/*
|
|
|
|
* Penalize another thread for the time spent on this one. This helps to
|
|
|
|
* worsen the priority and interactivity of processes which schedule batch
|
|
|
|
* jobs such as make. This has little effect on the make process itself but
|
|
|
|
* causes new processes spawned by it to receive worse scores immediately.
|
|
|
|
*/
|
2006-12-06 06:34:57 +00:00
|
|
|
void
|
2006-12-06 06:55:59 +00:00
|
|
|
sched_exit_thread(struct thread *td, struct thread *child)
|
2006-12-06 06:34:57 +00:00
|
|
|
{
|
2007-01-04 08:56:25 +00:00
|
|
|
|
2009-01-17 07:17:57 +00:00
|
|
|
KTR_STATE1(KTR_SCHED, "thread", sched_tdname(child), "thread exit",
|
2011-08-26 18:00:07 +00:00
|
|
|
"prio:%d", child->td_priority);
|
2007-01-04 08:56:25 +00:00
|
|
|
/*
|
|
|
|
* Give the child's runtime to the parent without returning the
|
|
|
|
* sleep time as a penalty to the parent. This causes shells that
|
|
|
|
* launch expensive things to mark their children as expensive.
|
|
|
|
*/
|
Commit 1/14 of sched_lock decomposition.
- Move all scheduler locking into the schedulers utilizing a technique
similar to solaris's container locking.
- A per-process spinlock is now used to protect the queue of threads,
thread count, suspension count, p_sflags, and other process
related scheduling fields.
- The new thread lock is actually a pointer to a spinlock for the
container that the thread is currently owned by. The container may
be a turnstile, sleepqueue, or run queue.
- thread_lock() is now used to protect access to thread related scheduling
fields. thread_unlock() unlocks the lock and thread_set_lock()
implements the transition from one lock to another.
- A new "blocked_lock" is used in cases where it is not safe to hold the
actual thread's lock yet we must prevent access to the thread.
- sched_throw() and sched_fork_exit() are introduced to allow the
schedulers to fix-up locking at these points.
- Add some minor infrastructure for optionally exporting scheduler
statistics that were invaluable in solving performance problems with
this patch. Generally these statistics allow you to differentiate
between different causes of context switches.
Tested by: kris, current@
Tested on: i386, amd64, ULE, 4BSD, libthr, libkse, PREEMPTION, etc.
Discussed with: kris, attilio, kmacy, jhb, julian, bde (small parts each)
2007-06-04 23:50:30 +00:00
|
|
|
thread_lock(td);
|
2007-07-17 22:53:23 +00:00
|
|
|
td->td_sched->ts_runtime += child->td_sched->ts_runtime;
|
2006-12-06 06:55:59 +00:00
|
|
|
sched_interact_update(td);
|
2007-01-04 08:56:25 +00:00
|
|
|
sched_priority(td);
|
Commit 1/14 of sched_lock decomposition.
- Move all scheduler locking into the schedulers utilizing a technique
similar to solaris's container locking.
- A per-process spinlock is now used to protect the queue of threads,
thread count, suspension count, p_sflags, and other process
related scheduling fields.
- The new thread lock is actually a pointer to a spinlock for the
container that the thread is currently owned by. The container may
be a turnstile, sleepqueue, or run queue.
- thread_lock() is now used to protect access to thread related scheduling
fields. thread_unlock() unlocks the lock and thread_set_lock()
implements the transition from one lock to another.
- A new "blocked_lock" is used in cases where it is not safe to hold the
actual thread's lock yet we must prevent access to the thread.
- sched_throw() and sched_fork_exit() are introduced to allow the
schedulers to fix-up locking at these points.
- Add some minor infrastructure for optionally exporting scheduler
statistics that were invaluable in solving performance problems with
this patch. Generally these statistics allow you to differentiate
between different causes of context switches.
Tested by: kris, current@
Tested on: i386, amd64, ULE, 4BSD, libthr, libkse, PREEMPTION, etc.
Discussed with: kris, attilio, kmacy, jhb, julian, bde (small parts each)
2007-06-04 23:50:30 +00:00
|
|
|
thread_unlock(td);
|
2006-12-06 06:34:57 +00:00
|
|
|
}
|
|
|
|
|
2008-03-10 01:32:01 +00:00
|
|
|
void
|
|
|
|
sched_preempt(struct thread *td)
|
|
|
|
{
|
|
|
|
struct tdq *tdq;
|
|
|
|
|
2012-05-15 01:30:25 +00:00
|
|
|
SDT_PROBE2(sched, , , surrender, td, td->td_proc);
|
|
|
|
|
2008-03-10 01:32:01 +00:00
|
|
|
thread_lock(td);
|
|
|
|
tdq = TDQ_SELF();
|
|
|
|
TDQ_LOCK_ASSERT(tdq, MA_OWNED);
|
|
|
|
tdq->tdq_ipipending = 0;
|
|
|
|
if (td->td_priority > tdq->tdq_lowpri) {
|
2008-04-17 04:20:10 +00:00
|
|
|
int flags;
|
|
|
|
|
|
|
|
flags = SW_INVOL | SW_PREEMPT;
|
2008-03-10 01:32:01 +00:00
|
|
|
if (td->td_critnest > 1)
|
|
|
|
td->td_owepreempt = 1;
|
2008-04-17 04:20:10 +00:00
|
|
|
else if (TD_IS_IDLETHREAD(td))
|
|
|
|
mi_switch(flags | SWT_REMOTEWAKEIDLE, NULL);
|
2008-03-10 01:32:01 +00:00
|
|
|
else
|
2008-04-17 04:20:10 +00:00
|
|
|
mi_switch(flags | SWT_REMOTEPREEMPT, NULL);
|
2008-03-10 01:32:01 +00:00
|
|
|
}
|
|
|
|
thread_unlock(td);
|
|
|
|
}
|
|
|
|
|
2007-07-17 22:53:23 +00:00
|
|
|
/*
|
|
|
|
* Fix priorities on return to user-space. Priorities may be elevated due
|
|
|
|
* to static priorities in msleep() or similar.
|
|
|
|
*/
|
2006-12-06 06:34:57 +00:00
|
|
|
void
|
|
|
|
sched_userret(struct thread *td)
|
|
|
|
{
|
|
|
|
/*
|
|
|
|
* XXX we cheat slightly on the locking here to avoid locking in
|
|
|
|
* the usual case. Setting td_priority here is essentially an
|
|
|
|
* incomplete workaround for not setting it properly elsewhere.
|
|
|
|
* Now that some interrupt handlers are threads, not setting it
|
|
|
|
* properly elsewhere can clobber it in the window between setting
|
|
|
|
* it here and returning to user mode, so don't waste time setting
|
|
|
|
* it perfectly here.
|
|
|
|
*/
|
|
|
|
KASSERT((td->td_flags & TDF_BORROWING) == 0,
|
|
|
|
("thread with borrowed priority returning to userland"));
|
|
|
|
if (td->td_priority != td->td_user_pri) {
|
Commit 1/14 of sched_lock decomposition.
- Move all scheduler locking into the schedulers utilizing a technique
similar to solaris's container locking.
- A per-process spinlock is now used to protect the queue of threads,
thread count, suspension count, p_sflags, and other process
related scheduling fields.
- The new thread lock is actually a pointer to a spinlock for the
container that the thread is currently owned by. The container may
be a turnstile, sleepqueue, or run queue.
- thread_lock() is now used to protect access to thread related scheduling
fields. thread_unlock() unlocks the lock and thread_set_lock()
implements the transition from one lock to another.
- A new "blocked_lock" is used in cases where it is not safe to hold the
actual thread's lock yet we must prevent access to the thread.
- sched_throw() and sched_fork_exit() are introduced to allow the
schedulers to fix-up locking at these points.
- Add some minor infrastructure for optionally exporting scheduler
statistics that were invaluable in solving performance problems with
this patch. Generally these statistics allow you to differentiate
between different causes of context switches.
Tested by: kris, current@
Tested on: i386, amd64, ULE, 4BSD, libthr, libkse, PREEMPTION, etc.
Discussed with: kris, attilio, kmacy, jhb, julian, bde (small parts each)
2007-06-04 23:50:30 +00:00
|
|
|
thread_lock(td);
|
2006-12-06 06:34:57 +00:00
|
|
|
td->td_priority = td->td_user_pri;
|
|
|
|
td->td_base_pri = td->td_user_pri;
|
2008-03-10 01:32:01 +00:00
|
|
|
tdq_setlowpri(TDQ_SELF(), td);
|
Commit 1/14 of sched_lock decomposition.
- Move all scheduler locking into the schedulers utilizing a technique
similar to solaris's container locking.
- A per-process spinlock is now used to protect the queue of threads,
thread count, suspension count, p_sflags, and other process
related scheduling fields.
- The new thread lock is actually a pointer to a spinlock for the
container that the thread is currently owned by. The container may
be a turnstile, sleepqueue, or run queue.
- thread_lock() is now used to protect access to thread related scheduling
fields. thread_unlock() unlocks the lock and thread_set_lock()
implements the transition from one lock to another.
- A new "blocked_lock" is used in cases where it is not safe to hold the
actual thread's lock yet we must prevent access to the thread.
- sched_throw() and sched_fork_exit() are introduced to allow the
schedulers to fix-up locking at these points.
- Add some minor infrastructure for optionally exporting scheduler
statistics that were invaluable in solving performance problems with
this patch. Generally these statistics allow you to differentiate
between different causes of context switches.
Tested by: kris, current@
Tested on: i386, amd64, ULE, 4BSD, libthr, libkse, PREEMPTION, etc.
Discussed with: kris, attilio, kmacy, jhb, julian, bde (small parts each)
2007-06-04 23:50:30 +00:00
|
|
|
thread_unlock(td);
|
2006-12-06 06:34:57 +00:00
|
|
|
}
|
2003-01-26 05:23:15 +00:00
|
|
|
}
|
|
|
|
|
2007-07-17 22:53:23 +00:00
|
|
|
/*
|
|
|
|
* Handle a stathz tick. This is really only relevant for timeshare
|
|
|
|
* threads.
|
|
|
|
*/
|
2003-01-26 05:23:15 +00:00
|
|
|
void
|
2003-10-16 08:39:15 +00:00
|
|
|
sched_clock(struct thread *td)
|
2003-01-26 05:23:15 +00:00
|
|
|
{
|
2006-12-06 06:34:57 +00:00
|
|
|
struct tdq *tdq;
|
|
|
|
struct td_sched *ts;
|
2003-01-26 05:23:15 +00:00
|
|
|
|
2007-07-17 22:53:23 +00:00
|
|
|
THREAD_LOCK_ASSERT(td, MA_OWNED);
|
2007-01-04 12:16:19 +00:00
|
|
|
tdq = TDQ_SELF();
|
2007-10-02 00:36:06 +00:00
|
|
|
#ifdef SMP
|
|
|
|
/*
|
|
|
|
* We run the long term load balancer infrequently on the first cpu.
|
|
|
|
*/
|
|
|
|
if (balance_tdq == tdq) {
|
|
|
|
if (balance_ticks && --balance_ticks == 0)
|
|
|
|
sched_balance();
|
|
|
|
}
|
|
|
|
#endif
|
2008-04-17 09:56:01 +00:00
|
|
|
/*
|
|
|
|
* Save the old switch count so we have a record of the last ticks
|
|
|
|
* activity. Initialize the new switch count based on our load.
|
|
|
|
* If there is some activity seed it to reflect that.
|
|
|
|
*/
|
|
|
|
tdq->tdq_oldswitchcnt = tdq->tdq_switchcnt;
|
2008-04-25 05:18:50 +00:00
|
|
|
tdq->tdq_switchcnt = tdq->tdq_load;
|
2004-08-10 07:52:21 +00:00
|
|
|
/*
|
2007-01-04 12:16:19 +00:00
|
|
|
* Advance the insert index once for each tick to ensure that all
|
|
|
|
* threads get a chance to run.
|
2004-08-10 07:52:21 +00:00
|
|
|
*/
|
2007-01-04 12:16:19 +00:00
|
|
|
if (tdq->tdq_idx == tdq->tdq_ridx) {
|
|
|
|
tdq->tdq_idx = (tdq->tdq_idx + 1) % RQ_NQS;
|
|
|
|
if (TAILQ_EMPTY(&tdq->tdq_timeshare.rq_queues[tdq->tdq_ridx]))
|
|
|
|
tdq->tdq_ridx = tdq->tdq_idx;
|
|
|
|
}
|
|
|
|
ts = td->td_sched;
|
2012-03-13 08:18:54 +00:00
|
|
|
sched_pctcpu_update(ts, 1);
|
2008-01-05 04:47:31 +00:00
|
|
|
if (td->td_pri_class & PRI_FIFO_BIT)
|
2003-10-27 06:47:05 +00:00
|
|
|
return;
|
2011-01-11 22:13:19 +00:00
|
|
|
if (PRI_BASE(td->td_pri_class) == PRI_TIMESHARE) {
|
2008-01-05 04:47:31 +00:00
|
|
|
/*
|
|
|
|
* We used a tick; charge it to the thread so
|
|
|
|
* that we can compute our interactivity.
|
|
|
|
*/
|
|
|
|
td->td_sched->ts_runtime += tickincr;
|
|
|
|
sched_interact_update(td);
|
2008-03-10 03:15:19 +00:00
|
|
|
sched_priority(td);
|
2008-01-05 04:47:31 +00:00
|
|
|
}
|
2012-08-10 19:02:49 +00:00
|
|
|
|
2003-01-26 05:23:15 +00:00
|
|
|
/*
|
2012-08-10 19:02:49 +00:00
|
|
|
* Force a context switch if the current thread has used up a full
|
|
|
|
* time slice (default is 100ms).
|
2003-01-26 05:23:15 +00:00
|
|
|
*/
|
2012-08-10 19:02:49 +00:00
|
|
|
if (!TD_IS_IDLETHREAD(td) && --ts->ts_slice <= 0) {
|
|
|
|
ts->ts_slice = sched_slice;
|
|
|
|
td->td_flags |= TDF_NEEDRESCHED | TDF_SLICEEND;
|
|
|
|
}
|
2003-01-26 05:23:15 +00:00
|
|
|
}
|
|
|
|
|
2007-07-17 22:53:23 +00:00
|
|
|
/*
|
2012-03-13 08:18:54 +00:00
|
|
|
* Called once per hz tick.
|
2007-07-17 22:53:23 +00:00
|
|
|
*/
|
|
|
|
void
|
Refactor timer management code with priority to one-shot operation mode.
The main goal of this is to generate timer interrupts only when there is
some work to do. When CPU is busy interrupts are generating at full rate
of hz + stathz to fullfill scheduler and timekeeping requirements. But
when CPU is idle, only minimum set of interrupts (down to 8 interrupts per
second per CPU now), needed to handle scheduled callouts is executed.
This allows significantly increase idle CPU sleep time, increasing effect
of static power-saving technologies. Also it should reduce host CPU load
on virtualized systems, when guest system is idle.
There is set of tunables, also available as writable sysctls, allowing to
control wanted event timer subsystem behavior:
kern.eventtimer.timer - allows to choose event timer hardware to use.
On x86 there is up to 4 different kinds of timers. Depending on whether
chosen timer is per-CPU, behavior of other options slightly differs.
kern.eventtimer.periodic - allows to choose periodic and one-shot
operation mode. In periodic mode, current timer hardware taken as the only
source of time for time events. This mode is quite alike to previous kernel
behavior. One-shot mode instead uses currently selected time counter
hardware to schedule all needed events one by one and program timer to
generate interrupt exactly in specified time. Default value depends of
chosen timer capabilities, but one-shot mode is preferred, until other is
forced by user or hardware.
kern.eventtimer.singlemul - in periodic mode specifies how much times
higher timer frequency should be, to not strictly alias hardclock() and
statclock() events. Default values are 2 and 4, but could be reduced to 1
if extra interrupts are unwanted.
kern.eventtimer.idletick - makes each CPU to receive every timer interrupt
independently of whether they busy or not. By default this options is
disabled. If chosen timer is per-CPU and runs in periodic mode, this option
has no effect - all interrupts are generating.
As soon as this patch modifies cpu_idle() on some platforms, I have also
refactored one on x86. Now it makes use of MONITOR/MWAIT instrunctions
(if supported) under high sleep/wakeup rate, as fast alternative to other
methods. It allows SMP scheduler to wake up sleeping CPUs much faster
without using IPI, significantly increasing performance on some highly
task-switching loads.
Tested by: many (on i386, amd64, sparc64 and powerc)
H/W donated by: Gheorghe Ardelean
Sponsored by: iXsystems, Inc.
2010-09-13 07:25:35 +00:00
|
|
|
sched_tick(int cnt)
|
2007-07-17 22:53:23 +00:00
|
|
|
{
|
|
|
|
|
|
|
|
}
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Return whether the current CPU has runnable tasks. Used for in-kernel
|
|
|
|
* cooperative idle threads.
|
|
|
|
*/
|
2003-01-26 05:23:15 +00:00
|
|
|
int
|
|
|
|
sched_runnable(void)
|
|
|
|
{
|
2006-12-06 06:34:57 +00:00
|
|
|
struct tdq *tdq;
|
2003-06-08 00:47:33 +00:00
|
|
|
int load;
|
2003-01-26 05:23:15 +00:00
|
|
|
|
2003-06-08 00:47:33 +00:00
|
|
|
load = 1;
|
|
|
|
|
2006-12-06 06:34:57 +00:00
|
|
|
tdq = TDQ_SELF();
|
2003-10-27 06:47:05 +00:00
|
|
|
if ((curthread->td_flags & TDF_IDLETD) != 0) {
|
2006-12-29 10:37:07 +00:00
|
|
|
if (tdq->tdq_load > 0)
|
2003-10-27 06:47:05 +00:00
|
|
|
goto out;
|
|
|
|
} else
|
2006-12-29 10:37:07 +00:00
|
|
|
if (tdq->tdq_load - 1 > 0)
|
2003-10-27 06:47:05 +00:00
|
|
|
goto out;
|
2003-06-08 00:47:33 +00:00
|
|
|
load = 0;
|
|
|
|
out:
|
|
|
|
return (load);
|
2003-01-26 05:23:15 +00:00
|
|
|
}
|
|
|
|
|
2007-07-17 22:53:23 +00:00
|
|
|
/*
|
|
|
|
* Choose the highest priority thread to run. The thread is removed from
|
|
|
|
* the run-queue while running however the load remains. For SMP we set
|
|
|
|
* the tdq in the global idle bitmask if it idles here.
|
|
|
|
*/
|
2007-01-23 08:50:34 +00:00
|
|
|
struct thread *
|
2003-01-28 09:28:20 +00:00
|
|
|
sched_choose(void)
|
|
|
|
{
|
2008-03-20 05:51:16 +00:00
|
|
|
struct thread *td;
|
2007-07-17 22:53:23 +00:00
|
|
|
struct tdq *tdq;
|
2003-01-28 09:28:20 +00:00
|
|
|
|
2006-12-06 06:34:57 +00:00
|
|
|
tdq = TDQ_SELF();
|
2007-07-17 22:53:23 +00:00
|
|
|
TDQ_LOCK_ASSERT(tdq, MA_OWNED);
|
2008-03-20 05:51:16 +00:00
|
|
|
td = tdq_choose(tdq);
|
|
|
|
if (td) {
|
|
|
|
tdq_runq_rem(tdq, td);
|
2008-04-04 01:16:18 +00:00
|
|
|
tdq->tdq_lowpri = td->td_priority;
|
2008-03-20 05:51:16 +00:00
|
|
|
return (td);
|
2003-01-26 05:23:15 +00:00
|
|
|
}
|
2008-04-04 01:16:18 +00:00
|
|
|
tdq->tdq_lowpri = PRI_MAX_IDLE;
|
2008-03-02 08:20:59 +00:00
|
|
|
return (PCPU_GET(idlethread));
|
2007-01-23 08:50:34 +00:00
|
|
|
}
|
|
|
|
|
2007-07-17 22:53:23 +00:00
|
|
|
/*
|
|
|
|
* Set owepreempt if necessary. Preemption never happens directly in ULE,
|
|
|
|
* we always request it once we exit a critical section.
|
|
|
|
*/
|
|
|
|
static inline void
|
|
|
|
sched_setpreempt(struct thread *td)
|
2007-01-23 08:50:34 +00:00
|
|
|
{
|
|
|
|
struct thread *ctd;
|
|
|
|
int cpri;
|
|
|
|
int pri;
|
|
|
|
|
2008-03-10 01:32:01 +00:00
|
|
|
THREAD_LOCK_ASSERT(curthread, MA_OWNED);
|
|
|
|
|
2007-01-23 08:50:34 +00:00
|
|
|
ctd = curthread;
|
|
|
|
pri = td->td_priority;
|
|
|
|
cpri = ctd->td_priority;
|
2008-03-10 01:32:01 +00:00
|
|
|
if (pri < cpri)
|
|
|
|
ctd->td_flags |= TDF_NEEDRESCHED;
|
2007-01-23 08:50:34 +00:00
|
|
|
if (panicstr != NULL || pri >= cpri || cold || TD_IS_INHIBITED(ctd))
|
2007-07-17 22:53:23 +00:00
|
|
|
return;
|
2008-03-10 01:32:01 +00:00
|
|
|
if (!sched_shouldpreempt(pri, cpri, 0))
|
2007-07-17 22:53:23 +00:00
|
|
|
return;
|
|
|
|
ctd->td_owepreempt = 1;
|
2003-01-26 05:23:15 +00:00
|
|
|
}
|
|
|
|
|
2007-07-17 22:53:23 +00:00
|
|
|
/*
|
2008-03-10 03:15:19 +00:00
|
|
|
* Add a thread to a thread queue. Select the appropriate runq and add the
|
|
|
|
* thread to it. This is the internal function called when the tdq is
|
|
|
|
* predetermined.
|
2007-07-17 22:53:23 +00:00
|
|
|
*/
|
2003-01-26 05:23:15 +00:00
|
|
|
void
|
2007-07-17 22:53:23 +00:00
|
|
|
tdq_add(struct tdq *tdq, struct thread *td, int flags)
|
2003-01-26 05:23:15 +00:00
|
|
|
{
|
|
|
|
|
2007-07-17 22:53:23 +00:00
|
|
|
TDQ_LOCK_ASSERT(tdq, MA_OWNED);
|
2007-01-23 08:50:34 +00:00
|
|
|
KASSERT((td->td_inhibitors == 0),
|
|
|
|
("sched_add: trying to run inhibited thread"));
|
|
|
|
KASSERT((TD_CAN_RUN(td) || TD_IS_RUNNING(td)),
|
|
|
|
("sched_add: bad thread state"));
|
2007-09-17 05:31:39 +00:00
|
|
|
KASSERT(td->td_flags & TDF_INMEM,
|
|
|
|
("sched_add: thread swapped out"));
|
2007-07-17 22:53:23 +00:00
|
|
|
|
|
|
|
if (td->td_priority < tdq->tdq_lowpri)
|
|
|
|
tdq->tdq_lowpri = td->td_priority;
|
2008-03-20 05:51:16 +00:00
|
|
|
tdq_runq_add(tdq, td, flags);
|
|
|
|
tdq_load_add(tdq, td);
|
2007-07-17 22:53:23 +00:00
|
|
|
}
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Select the target thread queue and add a thread to it. Request
|
|
|
|
* preemption or IPI a remote processor if required.
|
|
|
|
*/
|
|
|
|
void
|
|
|
|
sched_add(struct thread *td, int flags)
|
|
|
|
{
|
|
|
|
struct tdq *tdq;
|
2007-01-19 21:56:08 +00:00
|
|
|
#ifdef SMP
|
2007-07-17 22:53:23 +00:00
|
|
|
int cpu;
|
|
|
|
#endif
|
2009-01-17 07:17:57 +00:00
|
|
|
|
|
|
|
KTR_STATE2(KTR_SCHED, "thread", sched_tdname(td), "runq add",
|
|
|
|
"prio:%d", td->td_priority, KTR_ATTR_LINKED,
|
|
|
|
sched_tdname(curthread));
|
|
|
|
KTR_POINT1(KTR_SCHED, "thread", sched_tdname(curthread), "wokeup",
|
|
|
|
KTR_ATTR_LINKED, sched_tdname(td));
|
2012-05-15 01:30:25 +00:00
|
|
|
SDT_PROBE4(sched, , , enqueue, td, td->td_proc, NULL,
|
|
|
|
flags & SRQ_PREEMPTED);
|
2007-07-17 22:53:23 +00:00
|
|
|
THREAD_LOCK_ASSERT(td, MA_OWNED);
|
|
|
|
/*
|
|
|
|
* Recalculate the priority before we select the target cpu or
|
|
|
|
* run-queue.
|
|
|
|
*/
|
|
|
|
if (PRI_BASE(td->td_pri_class) == PRI_TIMESHARE)
|
|
|
|
sched_priority(td);
|
|
|
|
#ifdef SMP
|
|
|
|
/*
|
|
|
|
* Pick the destination cpu and if it isn't ours transfer to the
|
|
|
|
* target cpu.
|
|
|
|
*/
|
2008-03-20 05:51:16 +00:00
|
|
|
cpu = sched_pickcpu(td, flags);
|
|
|
|
tdq = sched_setcpu(td, cpu, flags);
|
2007-07-17 22:53:23 +00:00
|
|
|
tdq_add(tdq, td, flags);
|
2008-03-10 03:15:19 +00:00
|
|
|
if (cpu != PCPU_GET(cpuid)) {
|
2008-03-20 05:51:16 +00:00
|
|
|
tdq_notify(tdq, td);
|
2007-01-19 21:56:08 +00:00
|
|
|
return;
|
|
|
|
}
|
2007-07-17 22:53:23 +00:00
|
|
|
#else
|
|
|
|
tdq = TDQ_SELF();
|
|
|
|
TDQ_LOCK(tdq);
|
|
|
|
/*
|
|
|
|
* Now that the thread is moving to the run-queue, set the lock
|
|
|
|
* to the scheduler's lock.
|
|
|
|
*/
|
|
|
|
thread_lock_set(td, TDQ_LOCKPTR(tdq));
|
|
|
|
tdq_add(tdq, td, flags);
|
2007-01-19 21:56:08 +00:00
|
|
|
#endif
|
2007-07-17 22:53:23 +00:00
|
|
|
if (!(flags & SRQ_YIELDING))
|
|
|
|
sched_setpreempt(td);
|
2003-01-26 05:23:15 +00:00
|
|
|
}
|
|
|
|
|
2007-07-17 22:53:23 +00:00
|
|
|
/*
|
|
|
|
* Remove a thread from a run-queue without running it. This is used
|
|
|
|
* when we're stealing a thread from a remote queue. Otherwise all threads
|
|
|
|
* exit by calling sched_exit_thread() and sched_throw() themselves.
|
|
|
|
*/
|
2003-01-26 05:23:15 +00:00
|
|
|
void
|
2003-10-16 08:39:15 +00:00
|
|
|
sched_rem(struct thread *td)
|
2003-01-26 05:23:15 +00:00
|
|
|
{
|
2006-12-06 06:34:57 +00:00
|
|
|
struct tdq *tdq;
|
2003-10-16 08:39:15 +00:00
|
|
|
|
2009-01-17 07:17:57 +00:00
|
|
|
KTR_STATE1(KTR_SCHED, "thread", sched_tdname(td), "runq rem",
|
|
|
|
"prio:%d", td->td_priority);
|
2012-05-15 01:30:25 +00:00
|
|
|
SDT_PROBE3(sched, , , dequeue, td, td->td_proc, NULL);
|
2008-03-20 05:51:16 +00:00
|
|
|
tdq = TDQ_CPU(td->td_sched->ts_cpu);
|
2007-07-17 22:53:23 +00:00
|
|
|
TDQ_LOCK_ASSERT(tdq, MA_OWNED);
|
|
|
|
MPASS(td->td_lock == TDQ_LOCKPTR(tdq));
|
2007-01-23 08:50:34 +00:00
|
|
|
KASSERT(TD_ON_RUNQ(td),
|
2006-12-06 06:34:57 +00:00
|
|
|
("sched_rem: thread not on run queue"));
|
2008-03-20 05:51:16 +00:00
|
|
|
tdq_runq_rem(tdq, td);
|
|
|
|
tdq_load_rem(tdq, td);
|
2007-01-23 08:50:34 +00:00
|
|
|
TD_SET_CAN_RUN(td);
|
2008-03-02 08:20:59 +00:00
|
|
|
if (td->td_priority == tdq->tdq_lowpri)
|
|
|
|
tdq_setlowpri(tdq, NULL);
|
2003-01-26 05:23:15 +00:00
|
|
|
}
|
|
|
|
|
2007-07-17 22:53:23 +00:00
|
|
|
/*
|
|
|
|
* Fetch cpu utilization information. Updates on demand.
|
|
|
|
*/
|
2003-01-26 05:23:15 +00:00
|
|
|
fixpt_t
|
2003-10-16 08:39:15 +00:00
|
|
|
sched_pctcpu(struct thread *td)
|
2003-01-26 05:23:15 +00:00
|
|
|
{
|
|
|
|
fixpt_t pctcpu;
|
2006-12-06 06:34:57 +00:00
|
|
|
struct td_sched *ts;
|
2003-01-26 05:23:15 +00:00
|
|
|
|
|
|
|
pctcpu = 0;
|
2006-12-06 06:34:57 +00:00
|
|
|
ts = td->td_sched;
|
|
|
|
if (ts == NULL)
|
2003-10-20 19:55:21 +00:00
|
|
|
return (0);
|
2003-01-26 05:23:15 +00:00
|
|
|
|
2010-06-03 16:02:11 +00:00
|
|
|
THREAD_LOCK_ASSERT(td, MA_OWNED);
|
2012-03-13 08:18:54 +00:00
|
|
|
sched_pctcpu_update(ts, TD_IS_RUNNING(td));
|
2006-12-06 06:34:57 +00:00
|
|
|
if (ts->ts_ticks) {
|
2003-01-26 05:23:15 +00:00
|
|
|
int rtick;
|
|
|
|
|
|
|
|
/* How many rtick per second ? */
|
2007-01-04 08:56:25 +00:00
|
|
|
rtick = min(SCHED_TICK_HZ(ts) / SCHED_TICK_SECS, hz);
|
|
|
|
pctcpu = (FSCALE * ((FSCALE * rtick)/hz)) >> FSHIFT;
|
2003-01-26 05:23:15 +00:00
|
|
|
}
|
|
|
|
|
|
|
|
return (pctcpu);
|
|
|
|
}
|
|
|
|
|
2008-03-02 08:20:59 +00:00
|
|
|
/*
|
|
|
|
* Enforce affinity settings for a thread. Called after adjustments to
|
|
|
|
* cpumask.
|
|
|
|
*/
|
2008-03-02 07:19:35 +00:00
|
|
|
void
|
|
|
|
sched_affinity(struct thread *td)
|
|
|
|
{
|
2008-03-02 08:20:59 +00:00
|
|
|
#ifdef SMP
|
|
|
|
struct td_sched *ts;
|
|
|
|
|
|
|
|
THREAD_LOCK_ASSERT(td, MA_OWNED);
|
|
|
|
ts = td->td_sched;
|
|
|
|
if (THREAD_CAN_SCHED(td, ts->ts_cpu))
|
|
|
|
return;
|
2009-03-14 11:41:36 +00:00
|
|
|
if (TD_ON_RUNQ(td)) {
|
|
|
|
sched_rem(td);
|
|
|
|
sched_add(td, SRQ_BORING);
|
|
|
|
return;
|
|
|
|
}
|
2008-03-02 08:20:59 +00:00
|
|
|
if (!TD_IS_RUNNING(td))
|
|
|
|
return;
|
|
|
|
/*
|
2010-09-01 20:32:47 +00:00
|
|
|
* Force a switch before returning to userspace. If the
|
|
|
|
* target thread is not running locally send an ipi to force
|
|
|
|
* the issue.
|
2008-03-02 08:20:59 +00:00
|
|
|
*/
|
2010-09-21 19:12:22 +00:00
|
|
|
td->td_flags |= TDF_NEEDRESCHED;
|
2010-09-01 20:32:47 +00:00
|
|
|
if (td != curthread)
|
|
|
|
ipi_cpu(ts->ts_cpu, IPI_PREEMPT);
|
2008-03-02 08:20:59 +00:00
|
|
|
#endif
|
2008-03-02 07:19:35 +00:00
|
|
|
}
|
|
|
|
|
2007-07-17 22:53:23 +00:00
|
|
|
/*
|
|
|
|
* Bind a thread to a target cpu.
|
|
|
|
*/
|
2003-11-04 07:45:41 +00:00
|
|
|
void
|
|
|
|
sched_bind(struct thread *td, int cpu)
|
|
|
|
{
|
2006-12-06 06:34:57 +00:00
|
|
|
struct td_sched *ts;
|
2003-11-04 07:45:41 +00:00
|
|
|
|
2007-08-03 23:38:46 +00:00
|
|
|
THREAD_LOCK_ASSERT(td, MA_OWNED|MA_NOTRECURSED);
|
2010-05-21 17:15:56 +00:00
|
|
|
KASSERT(td == curthread, ("sched_bind: can only bind curthread"));
|
2006-12-06 06:34:57 +00:00
|
|
|
ts = td->td_sched;
|
2007-01-20 09:03:43 +00:00
|
|
|
if (ts->ts_flags & TSF_BOUND)
|
2007-01-20 17:03:33 +00:00
|
|
|
sched_unbind(td);
|
2010-09-01 20:32:47 +00:00
|
|
|
KASSERT(THREAD_CAN_MIGRATE(td), ("%p must be migratable", td));
|
2006-12-06 06:34:57 +00:00
|
|
|
ts->ts_flags |= TSF_BOUND;
|
2007-01-20 09:03:43 +00:00
|
|
|
sched_pin();
|
2003-12-11 03:57:10 +00:00
|
|
|
if (PCPU_GET(cpuid) == cpu)
|
2003-11-04 07:45:41 +00:00
|
|
|
return;
|
2007-01-20 09:03:43 +00:00
|
|
|
ts->ts_cpu = cpu;
|
2003-11-04 07:45:41 +00:00
|
|
|
/* When we return from mi_switch we'll be on the correct cpu. */
|
2004-07-03 16:57:51 +00:00
|
|
|
mi_switch(SW_VOL, NULL);
|
2003-11-04 07:45:41 +00:00
|
|
|
}
|
|
|
|
|
2007-07-17 22:53:23 +00:00
|
|
|
/*
|
|
|
|
* Release a bound thread.
|
|
|
|
*/
|
2003-11-04 07:45:41 +00:00
|
|
|
void
|
|
|
|
sched_unbind(struct thread *td)
|
|
|
|
{
|
2007-01-04 08:56:25 +00:00
|
|
|
struct td_sched *ts;
|
|
|
|
|
Commit 1/14 of sched_lock decomposition.
- Move all scheduler locking into the schedulers utilizing a technique
similar to solaris's container locking.
- A per-process spinlock is now used to protect the queue of threads,
thread count, suspension count, p_sflags, and other process
related scheduling fields.
- The new thread lock is actually a pointer to a spinlock for the
container that the thread is currently owned by. The container may
be a turnstile, sleepqueue, or run queue.
- thread_lock() is now used to protect access to thread related scheduling
fields. thread_unlock() unlocks the lock and thread_set_lock()
implements the transition from one lock to another.
- A new "blocked_lock" is used in cases where it is not safe to hold the
actual thread's lock yet we must prevent access to the thread.
- sched_throw() and sched_fork_exit() are introduced to allow the
schedulers to fix-up locking at these points.
- Add some minor infrastructure for optionally exporting scheduler
statistics that were invaluable in solving performance problems with
this patch. Generally these statistics allow you to differentiate
between different causes of context switches.
Tested by: kris, current@
Tested on: i386, amd64, ULE, 4BSD, libthr, libkse, PREEMPTION, etc.
Discussed with: kris, attilio, kmacy, jhb, julian, bde (small parts each)
2007-06-04 23:50:30 +00:00
|
|
|
THREAD_LOCK_ASSERT(td, MA_OWNED);
|
2010-05-21 17:15:56 +00:00
|
|
|
KASSERT(td == curthread, ("sched_unbind: can only bind curthread"));
|
2007-01-04 08:56:25 +00:00
|
|
|
ts = td->td_sched;
|
2007-01-20 09:03:43 +00:00
|
|
|
if ((ts->ts_flags & TSF_BOUND) == 0)
|
|
|
|
return;
|
2007-01-04 08:56:25 +00:00
|
|
|
ts->ts_flags &= ~TSF_BOUND;
|
|
|
|
sched_unpin();
|
2003-11-04 07:45:41 +00:00
|
|
|
}
|
|
|
|
|
2005-04-19 04:01:25 +00:00
|
|
|
int
|
|
|
|
sched_is_bound(struct thread *td)
|
|
|
|
{
|
Commit 1/14 of sched_lock decomposition.
- Move all scheduler locking into the schedulers utilizing a technique
similar to solaris's container locking.
- A per-process spinlock is now used to protect the queue of threads,
thread count, suspension count, p_sflags, and other process
related scheduling fields.
- The new thread lock is actually a pointer to a spinlock for the
container that the thread is currently owned by. The container may
be a turnstile, sleepqueue, or run queue.
- thread_lock() is now used to protect access to thread related scheduling
fields. thread_unlock() unlocks the lock and thread_set_lock()
implements the transition from one lock to another.
- A new "blocked_lock" is used in cases where it is not safe to hold the
actual thread's lock yet we must prevent access to the thread.
- sched_throw() and sched_fork_exit() are introduced to allow the
schedulers to fix-up locking at these points.
- Add some minor infrastructure for optionally exporting scheduler
statistics that were invaluable in solving performance problems with
this patch. Generally these statistics allow you to differentiate
between different causes of context switches.
Tested by: kris, current@
Tested on: i386, amd64, ULE, 4BSD, libthr, libkse, PREEMPTION, etc.
Discussed with: kris, attilio, kmacy, jhb, julian, bde (small parts each)
2007-06-04 23:50:30 +00:00
|
|
|
THREAD_LOCK_ASSERT(td, MA_OWNED);
|
2006-12-06 06:34:57 +00:00
|
|
|
return (td->td_sched->ts_flags & TSF_BOUND);
|
2005-04-19 04:01:25 +00:00
|
|
|
}
|
|
|
|
|
2007-07-17 22:53:23 +00:00
|
|
|
/*
|
|
|
|
* Basic yield call.
|
|
|
|
*/
|
2006-06-15 06:37:39 +00:00
|
|
|
void
|
|
|
|
sched_relinquish(struct thread *td)
|
|
|
|
{
|
Commit 1/14 of sched_lock decomposition.
- Move all scheduler locking into the schedulers utilizing a technique
similar to solaris's container locking.
- A per-process spinlock is now used to protect the queue of threads,
thread count, suspension count, p_sflags, and other process
related scheduling fields.
- The new thread lock is actually a pointer to a spinlock for the
container that the thread is currently owned by. The container may
be a turnstile, sleepqueue, or run queue.
- thread_lock() is now used to protect access to thread related scheduling
fields. thread_unlock() unlocks the lock and thread_set_lock()
implements the transition from one lock to another.
- A new "blocked_lock" is used in cases where it is not safe to hold the
actual thread's lock yet we must prevent access to the thread.
- sched_throw() and sched_fork_exit() are introduced to allow the
schedulers to fix-up locking at these points.
- Add some minor infrastructure for optionally exporting scheduler
statistics that were invaluable in solving performance problems with
this patch. Generally these statistics allow you to differentiate
between different causes of context switches.
Tested by: kris, current@
Tested on: i386, amd64, ULE, 4BSD, libthr, libkse, PREEMPTION, etc.
Discussed with: kris, attilio, kmacy, jhb, julian, bde (small parts each)
2007-06-04 23:50:30 +00:00
|
|
|
thread_lock(td);
|
2008-04-17 04:20:10 +00:00
|
|
|
mi_switch(SW_VOL | SWT_RELINQUISH, NULL);
|
Commit 1/14 of sched_lock decomposition.
- Move all scheduler locking into the schedulers utilizing a technique
similar to solaris's container locking.
- A per-process spinlock is now used to protect the queue of threads,
thread count, suspension count, p_sflags, and other process
related scheduling fields.
- The new thread lock is actually a pointer to a spinlock for the
container that the thread is currently owned by. The container may
be a turnstile, sleepqueue, or run queue.
- thread_lock() is now used to protect access to thread related scheduling
fields. thread_unlock() unlocks the lock and thread_set_lock()
implements the transition from one lock to another.
- A new "blocked_lock" is used in cases where it is not safe to hold the
actual thread's lock yet we must prevent access to the thread.
- sched_throw() and sched_fork_exit() are introduced to allow the
schedulers to fix-up locking at these points.
- Add some minor infrastructure for optionally exporting scheduler
statistics that were invaluable in solving performance problems with
this patch. Generally these statistics allow you to differentiate
between different causes of context switches.
Tested by: kris, current@
Tested on: i386, amd64, ULE, 4BSD, libthr, libkse, PREEMPTION, etc.
Discussed with: kris, attilio, kmacy, jhb, julian, bde (small parts each)
2007-06-04 23:50:30 +00:00
|
|
|
thread_unlock(td);
|
2006-06-15 06:37:39 +00:00
|
|
|
}
|
|
|
|
|
2007-07-17 22:53:23 +00:00
|
|
|
/*
|
|
|
|
* Return the total system load.
|
|
|
|
*/
|
2004-02-01 02:48:36 +00:00
|
|
|
int
|
|
|
|
sched_load(void)
|
|
|
|
{
|
|
|
|
#ifdef SMP
|
|
|
|
int total;
|
|
|
|
int i;
|
|
|
|
|
|
|
|
total = 0;
|
2010-06-11 18:46:34 +00:00
|
|
|
CPU_FOREACH(i)
|
2008-03-02 08:20:59 +00:00
|
|
|
total += TDQ_CPU(i)->tdq_sysload;
|
2004-02-01 02:48:36 +00:00
|
|
|
return (total);
|
|
|
|
#else
|
2006-12-29 10:37:07 +00:00
|
|
|
return (TDQ_SELF()->tdq_sysload);
|
2004-02-01 02:48:36 +00:00
|
|
|
#endif
|
|
|
|
}
|
|
|
|
|
2003-01-26 05:23:15 +00:00
|
|
|
int
|
|
|
|
sched_sizeof_proc(void)
|
|
|
|
{
|
|
|
|
return (sizeof(struct proc));
|
|
|
|
}
|
|
|
|
|
|
|
|
int
|
|
|
|
sched_sizeof_thread(void)
|
|
|
|
{
|
|
|
|
return (sizeof(struct thread) + sizeof(struct td_sched));
|
|
|
|
}
|
Add scheduler CORE, the work I have done half a year ago, recent,
I picked it up again. The scheduler is forked from ULE, but the
algorithm to detect an interactive process is almost completely
different with ULE, it comes from Linux paper "Understanding the
Linux 2.6.8.1 CPU Scheduler", although I still use same word
"score" as a priority boost in ULE scheduler.
Briefly, the scheduler has following characteristic:
1. Timesharing process's nice value is seriously respected,
timeslice and interaction detecting algorithm are based
on nice value.
2. per-cpu scheduling queue and load balancing.
3. O(1) scheduling.
4. Some cpu affinity code in wakeup path.
5. Support POSIX SCHED_FIFO and SCHED_RR.
Unlike scheduler 4BSD and ULE which using fuzzy RQ_PPQ, the scheduler
uses 256 priority queues. Unlike ULE which using pull and push, the
scheduelr uses pull method, the main reason is to let relative idle
cpu do the work, but current the whole scheduler is protected by the
big sched_lock, so the benefit is not visible, it really can be worse
than nothing because all other cpu are locked out when we are doing
balancing work, which the 4BSD scheduelr does not have this problem.
The scheduler does not support hyperthreading very well, in fact,
the scheduler does not make the difference between physical CPU and
logical CPU, this should be improved in feature. The scheduler has
priority inversion problem on MP machine, it is not good for
realtime scheduling, it can cause realtime process starving.
As a result, it seems the MySQL super-smack runs better on my
Pentium-D machine when using libthr, despite on UP or SMP kernel.
2006-06-13 13:12:56 +00:00
|
|
|
|
2009-04-29 23:04:31 +00:00
|
|
|
#ifdef SMP
|
|
|
|
#define TDQ_IDLESPIN(tdq) \
|
|
|
|
((tdq)->tdq_cg != NULL && ((tdq)->tdq_cg->cg_flags & CG_FLAG_THREAD) == 0)
|
|
|
|
#else
|
|
|
|
#define TDQ_IDLESPIN(tdq) 1
|
|
|
|
#endif
|
|
|
|
|
2007-01-23 08:50:34 +00:00
|
|
|
/*
|
|
|
|
* The actual idle process.
|
|
|
|
*/
|
|
|
|
void
|
|
|
|
sched_idletd(void *dummy)
|
|
|
|
{
|
|
|
|
struct thread *td;
|
2007-07-17 22:53:23 +00:00
|
|
|
struct tdq *tdq;
|
2008-04-17 09:56:01 +00:00
|
|
|
int switchcnt;
|
|
|
|
int i;
|
2007-01-23 08:50:34 +00:00
|
|
|
|
2009-04-29 03:15:43 +00:00
|
|
|
mtx_assert(&Giant, MA_NOTOWNED);
|
2007-01-23 08:50:34 +00:00
|
|
|
td = curthread;
|
2007-07-17 22:53:23 +00:00
|
|
|
tdq = TDQ_SELF();
|
|
|
|
for (;;) {
|
|
|
|
#ifdef SMP
|
2008-04-17 09:56:01 +00:00
|
|
|
if (tdq_idled(tdq) == 0)
|
|
|
|
continue;
|
2007-07-17 22:53:23 +00:00
|
|
|
#endif
|
2008-04-17 09:56:01 +00:00
|
|
|
switchcnt = tdq->tdq_switchcnt + tdq->tdq_oldswitchcnt;
|
|
|
|
/*
|
|
|
|
* If we're switching very frequently, spin while checking
|
|
|
|
* for load rather than entering a low power state that
|
2009-04-29 03:15:43 +00:00
|
|
|
* may require an IPI. However, don't do any busy
|
|
|
|
* loops while on SMT machines as this simply steals
|
|
|
|
* cycles from cores doing useful work.
|
2008-04-17 09:56:01 +00:00
|
|
|
*/
|
2009-04-29 23:04:31 +00:00
|
|
|
if (TDQ_IDLESPIN(tdq) && switchcnt > sched_idlespinthresh) {
|
2008-04-17 09:56:01 +00:00
|
|
|
for (i = 0; i < sched_idlespins; i++) {
|
|
|
|
if (tdq->tdq_load)
|
|
|
|
break;
|
|
|
|
cpu_spinwait();
|
|
|
|
}
|
|
|
|
}
|
2009-04-29 03:15:43 +00:00
|
|
|
switchcnt = tdq->tdq_switchcnt + tdq->tdq_oldswitchcnt;
|
2010-09-10 13:24:47 +00:00
|
|
|
if (tdq->tdq_load == 0) {
|
|
|
|
tdq->tdq_cpu_idle = 1;
|
|
|
|
if (tdq->tdq_load == 0) {
|
Refactor timer management code with priority to one-shot operation mode.
The main goal of this is to generate timer interrupts only when there is
some work to do. When CPU is busy interrupts are generating at full rate
of hz + stathz to fullfill scheduler and timekeeping requirements. But
when CPU is idle, only minimum set of interrupts (down to 8 interrupts per
second per CPU now), needed to handle scheduled callouts is executed.
This allows significantly increase idle CPU sleep time, increasing effect
of static power-saving technologies. Also it should reduce host CPU load
on virtualized systems, when guest system is idle.
There is set of tunables, also available as writable sysctls, allowing to
control wanted event timer subsystem behavior:
kern.eventtimer.timer - allows to choose event timer hardware to use.
On x86 there is up to 4 different kinds of timers. Depending on whether
chosen timer is per-CPU, behavior of other options slightly differs.
kern.eventtimer.periodic - allows to choose periodic and one-shot
operation mode. In periodic mode, current timer hardware taken as the only
source of time for time events. This mode is quite alike to previous kernel
behavior. One-shot mode instead uses currently selected time counter
hardware to schedule all needed events one by one and program timer to
generate interrupt exactly in specified time. Default value depends of
chosen timer capabilities, but one-shot mode is preferred, until other is
forced by user or hardware.
kern.eventtimer.singlemul - in periodic mode specifies how much times
higher timer frequency should be, to not strictly alias hardclock() and
statclock() events. Default values are 2 and 4, but could be reduced to 1
if extra interrupts are unwanted.
kern.eventtimer.idletick - makes each CPU to receive every timer interrupt
independently of whether they busy or not. By default this options is
disabled. If chosen timer is per-CPU and runs in periodic mode, this option
has no effect - all interrupts are generating.
As soon as this patch modifies cpu_idle() on some platforms, I have also
refactored one on x86. Now it makes use of MONITOR/MWAIT instrunctions
(if supported) under high sleep/wakeup rate, as fast alternative to other
methods. It allows SMP scheduler to wake up sleeping CPUs much faster
without using IPI, significantly increasing performance on some highly
task-switching loads.
Tested by: many (on i386, amd64, sparc64 and powerc)
H/W donated by: Gheorghe Ardelean
Sponsored by: iXsystems, Inc.
2010-09-13 07:25:35 +00:00
|
|
|
cpu_idle(switchcnt > sched_idlespinthresh * 4);
|
2010-09-10 13:24:47 +00:00
|
|
|
tdq->tdq_switchcnt++;
|
|
|
|
}
|
|
|
|
tdq->tdq_cpu_idle = 0;
|
|
|
|
}
|
2008-04-17 09:56:01 +00:00
|
|
|
if (tdq->tdq_load) {
|
|
|
|
thread_lock(td);
|
|
|
|
mi_switch(SW_VOL | SWT_IDLE, NULL);
|
|
|
|
thread_unlock(td);
|
|
|
|
}
|
2007-07-17 22:53:23 +00:00
|
|
|
}
|
Add scheduler CORE, the work I have done half a year ago, recent,
I picked it up again. The scheduler is forked from ULE, but the
algorithm to detect an interactive process is almost completely
different with ULE, it comes from Linux paper "Understanding the
Linux 2.6.8.1 CPU Scheduler", although I still use same word
"score" as a priority boost in ULE scheduler.
Briefly, the scheduler has following characteristic:
1. Timesharing process's nice value is seriously respected,
timeslice and interaction detecting algorithm are based
on nice value.
2. per-cpu scheduling queue and load balancing.
3. O(1) scheduling.
4. Some cpu affinity code in wakeup path.
5. Support POSIX SCHED_FIFO and SCHED_RR.
Unlike scheduler 4BSD and ULE which using fuzzy RQ_PPQ, the scheduler
uses 256 priority queues. Unlike ULE which using pull and push, the
scheduelr uses pull method, the main reason is to let relative idle
cpu do the work, but current the whole scheduler is protected by the
big sched_lock, so the benefit is not visible, it really can be worse
than nothing because all other cpu are locked out when we are doing
balancing work, which the 4BSD scheduelr does not have this problem.
The scheduler does not support hyperthreading very well, in fact,
the scheduler does not make the difference between physical CPU and
logical CPU, this should be improved in feature. The scheduler has
priority inversion problem on MP machine, it is not good for
realtime scheduling, it can cause realtime process starving.
As a result, it seems the MySQL super-smack runs better on my
Pentium-D machine when using libthr, despite on UP or SMP kernel.
2006-06-13 13:12:56 +00:00
|
|
|
}
|
2007-01-04 08:56:25 +00:00
|
|
|
|
Commit 1/14 of sched_lock decomposition.
- Move all scheduler locking into the schedulers utilizing a technique
similar to solaris's container locking.
- A per-process spinlock is now used to protect the queue of threads,
thread count, suspension count, p_sflags, and other process
related scheduling fields.
- The new thread lock is actually a pointer to a spinlock for the
container that the thread is currently owned by. The container may
be a turnstile, sleepqueue, or run queue.
- thread_lock() is now used to protect access to thread related scheduling
fields. thread_unlock() unlocks the lock and thread_set_lock()
implements the transition from one lock to another.
- A new "blocked_lock" is used in cases where it is not safe to hold the
actual thread's lock yet we must prevent access to the thread.
- sched_throw() and sched_fork_exit() are introduced to allow the
schedulers to fix-up locking at these points.
- Add some minor infrastructure for optionally exporting scheduler
statistics that were invaluable in solving performance problems with
this patch. Generally these statistics allow you to differentiate
between different causes of context switches.
Tested by: kris, current@
Tested on: i386, amd64, ULE, 4BSD, libthr, libkse, PREEMPTION, etc.
Discussed with: kris, attilio, kmacy, jhb, julian, bde (small parts each)
2007-06-04 23:50:30 +00:00
|
|
|
/*
|
|
|
|
* A CPU is entering for the first time or a thread is exiting.
|
|
|
|
*/
|
|
|
|
void
|
|
|
|
sched_throw(struct thread *td)
|
|
|
|
{
|
2007-10-02 01:30:18 +00:00
|
|
|
struct thread *newtd;
|
2007-07-17 22:53:23 +00:00
|
|
|
struct tdq *tdq;
|
|
|
|
|
|
|
|
tdq = TDQ_SELF();
|
Commit 1/14 of sched_lock decomposition.
- Move all scheduler locking into the schedulers utilizing a technique
similar to solaris's container locking.
- A per-process spinlock is now used to protect the queue of threads,
thread count, suspension count, p_sflags, and other process
related scheduling fields.
- The new thread lock is actually a pointer to a spinlock for the
container that the thread is currently owned by. The container may
be a turnstile, sleepqueue, or run queue.
- thread_lock() is now used to protect access to thread related scheduling
fields. thread_unlock() unlocks the lock and thread_set_lock()
implements the transition from one lock to another.
- A new "blocked_lock" is used in cases where it is not safe to hold the
actual thread's lock yet we must prevent access to the thread.
- sched_throw() and sched_fork_exit() are introduced to allow the
schedulers to fix-up locking at these points.
- Add some minor infrastructure for optionally exporting scheduler
statistics that were invaluable in solving performance problems with
this patch. Generally these statistics allow you to differentiate
between different causes of context switches.
Tested by: kris, current@
Tested on: i386, amd64, ULE, 4BSD, libthr, libkse, PREEMPTION, etc.
Discussed with: kris, attilio, kmacy, jhb, julian, bde (small parts each)
2007-06-04 23:50:30 +00:00
|
|
|
if (td == NULL) {
|
2007-07-17 22:53:23 +00:00
|
|
|
/* Correct spinlock nesting and acquire the correct lock. */
|
|
|
|
TDQ_LOCK(tdq);
|
Commit 1/14 of sched_lock decomposition.
- Move all scheduler locking into the schedulers utilizing a technique
similar to solaris's container locking.
- A per-process spinlock is now used to protect the queue of threads,
thread count, suspension count, p_sflags, and other process
related scheduling fields.
- The new thread lock is actually a pointer to a spinlock for the
container that the thread is currently owned by. The container may
be a turnstile, sleepqueue, or run queue.
- thread_lock() is now used to protect access to thread related scheduling
fields. thread_unlock() unlocks the lock and thread_set_lock()
implements the transition from one lock to another.
- A new "blocked_lock" is used in cases where it is not safe to hold the
actual thread's lock yet we must prevent access to the thread.
- sched_throw() and sched_fork_exit() are introduced to allow the
schedulers to fix-up locking at these points.
- Add some minor infrastructure for optionally exporting scheduler
statistics that were invaluable in solving performance problems with
this patch. Generally these statistics allow you to differentiate
between different causes of context switches.
Tested by: kris, current@
Tested on: i386, amd64, ULE, 4BSD, libthr, libkse, PREEMPTION, etc.
Discussed with: kris, attilio, kmacy, jhb, julian, bde (small parts each)
2007-06-04 23:50:30 +00:00
|
|
|
spinlock_exit();
|
2012-01-03 21:03:28 +00:00
|
|
|
PCPU_SET(switchtime, cpu_ticks());
|
|
|
|
PCPU_SET(switchticks, ticks);
|
Commit 1/14 of sched_lock decomposition.
- Move all scheduler locking into the schedulers utilizing a technique
similar to solaris's container locking.
- A per-process spinlock is now used to protect the queue of threads,
thread count, suspension count, p_sflags, and other process
related scheduling fields.
- The new thread lock is actually a pointer to a spinlock for the
container that the thread is currently owned by. The container may
be a turnstile, sleepqueue, or run queue.
- thread_lock() is now used to protect access to thread related scheduling
fields. thread_unlock() unlocks the lock and thread_set_lock()
implements the transition from one lock to another.
- A new "blocked_lock" is used in cases where it is not safe to hold the
actual thread's lock yet we must prevent access to the thread.
- sched_throw() and sched_fork_exit() are introduced to allow the
schedulers to fix-up locking at these points.
- Add some minor infrastructure for optionally exporting scheduler
statistics that were invaluable in solving performance problems with
this patch. Generally these statistics allow you to differentiate
between different causes of context switches.
Tested by: kris, current@
Tested on: i386, amd64, ULE, 4BSD, libthr, libkse, PREEMPTION, etc.
Discussed with: kris, attilio, kmacy, jhb, julian, bde (small parts each)
2007-06-04 23:50:30 +00:00
|
|
|
} else {
|
2007-07-17 22:53:23 +00:00
|
|
|
MPASS(td->td_lock == TDQ_LOCKPTR(tdq));
|
2008-03-20 05:51:16 +00:00
|
|
|
tdq_load_rem(tdq, td);
|
2007-12-15 23:13:31 +00:00
|
|
|
lock_profile_release_lock(&TDQ_LOCKPTR(tdq)->lock_object);
|
Commit 1/14 of sched_lock decomposition.
- Move all scheduler locking into the schedulers utilizing a technique
similar to solaris's container locking.
- A per-process spinlock is now used to protect the queue of threads,
thread count, suspension count, p_sflags, and other process
related scheduling fields.
- The new thread lock is actually a pointer to a spinlock for the
container that the thread is currently owned by. The container may
be a turnstile, sleepqueue, or run queue.
- thread_lock() is now used to protect access to thread related scheduling
fields. thread_unlock() unlocks the lock and thread_set_lock()
implements the transition from one lock to another.
- A new "blocked_lock" is used in cases where it is not safe to hold the
actual thread's lock yet we must prevent access to the thread.
- sched_throw() and sched_fork_exit() are introduced to allow the
schedulers to fix-up locking at these points.
- Add some minor infrastructure for optionally exporting scheduler
statistics that were invaluable in solving performance problems with
this patch. Generally these statistics allow you to differentiate
between different causes of context switches.
Tested by: kris, current@
Tested on: i386, amd64, ULE, 4BSD, libthr, libkse, PREEMPTION, etc.
Discussed with: kris, attilio, kmacy, jhb, julian, bde (small parts each)
2007-06-04 23:50:30 +00:00
|
|
|
}
|
|
|
|
KASSERT(curthread->td_md.md_spinlock_count == 1, ("invalid count"));
|
2007-10-02 01:30:18 +00:00
|
|
|
newtd = choosethread();
|
|
|
|
TDQ_LOCKPTR(tdq)->mtx_lock = (uintptr_t)newtd;
|
|
|
|
cpu_throw(td, newtd); /* doesn't return */
|
Commit 1/14 of sched_lock decomposition.
- Move all scheduler locking into the schedulers utilizing a technique
similar to solaris's container locking.
- A per-process spinlock is now used to protect the queue of threads,
thread count, suspension count, p_sflags, and other process
related scheduling fields.
- The new thread lock is actually a pointer to a spinlock for the
container that the thread is currently owned by. The container may
be a turnstile, sleepqueue, or run queue.
- thread_lock() is now used to protect access to thread related scheduling
fields. thread_unlock() unlocks the lock and thread_set_lock()
implements the transition from one lock to another.
- A new "blocked_lock" is used in cases where it is not safe to hold the
actual thread's lock yet we must prevent access to the thread.
- sched_throw() and sched_fork_exit() are introduced to allow the
schedulers to fix-up locking at these points.
- Add some minor infrastructure for optionally exporting scheduler
statistics that were invaluable in solving performance problems with
this patch. Generally these statistics allow you to differentiate
between different causes of context switches.
Tested by: kris, current@
Tested on: i386, amd64, ULE, 4BSD, libthr, libkse, PREEMPTION, etc.
Discussed with: kris, attilio, kmacy, jhb, julian, bde (small parts each)
2007-06-04 23:50:30 +00:00
|
|
|
}
|
|
|
|
|
2007-07-17 22:53:23 +00:00
|
|
|
/*
|
|
|
|
* This is called from fork_exit(). Just acquire the correct locks and
|
|
|
|
* let fork do the rest of the work.
|
|
|
|
*/
|
Commit 1/14 of sched_lock decomposition.
- Move all scheduler locking into the schedulers utilizing a technique
similar to solaris's container locking.
- A per-process spinlock is now used to protect the queue of threads,
thread count, suspension count, p_sflags, and other process
related scheduling fields.
- The new thread lock is actually a pointer to a spinlock for the
container that the thread is currently owned by. The container may
be a turnstile, sleepqueue, or run queue.
- thread_lock() is now used to protect access to thread related scheduling
fields. thread_unlock() unlocks the lock and thread_set_lock()
implements the transition from one lock to another.
- A new "blocked_lock" is used in cases where it is not safe to hold the
actual thread's lock yet we must prevent access to the thread.
- sched_throw() and sched_fork_exit() are introduced to allow the
schedulers to fix-up locking at these points.
- Add some minor infrastructure for optionally exporting scheduler
statistics that were invaluable in solving performance problems with
this patch. Generally these statistics allow you to differentiate
between different causes of context switches.
Tested by: kris, current@
Tested on: i386, amd64, ULE, 4BSD, libthr, libkse, PREEMPTION, etc.
Discussed with: kris, attilio, kmacy, jhb, julian, bde (small parts each)
2007-06-04 23:50:30 +00:00
|
|
|
void
|
2007-06-12 07:47:09 +00:00
|
|
|
sched_fork_exit(struct thread *td)
|
Commit 1/14 of sched_lock decomposition.
- Move all scheduler locking into the schedulers utilizing a technique
similar to solaris's container locking.
- A per-process spinlock is now used to protect the queue of threads,
thread count, suspension count, p_sflags, and other process
related scheduling fields.
- The new thread lock is actually a pointer to a spinlock for the
container that the thread is currently owned by. The container may
be a turnstile, sleepqueue, or run queue.
- thread_lock() is now used to protect access to thread related scheduling
fields. thread_unlock() unlocks the lock and thread_set_lock()
implements the transition from one lock to another.
- A new "blocked_lock" is used in cases where it is not safe to hold the
actual thread's lock yet we must prevent access to the thread.
- sched_throw() and sched_fork_exit() are introduced to allow the
schedulers to fix-up locking at these points.
- Add some minor infrastructure for optionally exporting scheduler
statistics that were invaluable in solving performance problems with
this patch. Generally these statistics allow you to differentiate
between different causes of context switches.
Tested by: kris, current@
Tested on: i386, amd64, ULE, 4BSD, libthr, libkse, PREEMPTION, etc.
Discussed with: kris, attilio, kmacy, jhb, julian, bde (small parts each)
2007-06-04 23:50:30 +00:00
|
|
|
{
|
2007-07-17 22:53:23 +00:00
|
|
|
struct td_sched *ts;
|
|
|
|
struct tdq *tdq;
|
|
|
|
int cpuid;
|
Commit 1/14 of sched_lock decomposition.
- Move all scheduler locking into the schedulers utilizing a technique
similar to solaris's container locking.
- A per-process spinlock is now used to protect the queue of threads,
thread count, suspension count, p_sflags, and other process
related scheduling fields.
- The new thread lock is actually a pointer to a spinlock for the
container that the thread is currently owned by. The container may
be a turnstile, sleepqueue, or run queue.
- thread_lock() is now used to protect access to thread related scheduling
fields. thread_unlock() unlocks the lock and thread_set_lock()
implements the transition from one lock to another.
- A new "blocked_lock" is used in cases where it is not safe to hold the
actual thread's lock yet we must prevent access to the thread.
- sched_throw() and sched_fork_exit() are introduced to allow the
schedulers to fix-up locking at these points.
- Add some minor infrastructure for optionally exporting scheduler
statistics that were invaluable in solving performance problems with
this patch. Generally these statistics allow you to differentiate
between different causes of context switches.
Tested by: kris, current@
Tested on: i386, amd64, ULE, 4BSD, libthr, libkse, PREEMPTION, etc.
Discussed with: kris, attilio, kmacy, jhb, julian, bde (small parts each)
2007-06-04 23:50:30 +00:00
|
|
|
|
|
|
|
/*
|
|
|
|
* Finish setting up thread glue so that it begins execution in a
|
2007-07-17 22:53:23 +00:00
|
|
|
* non-nested critical section with the scheduler lock held.
|
Commit 1/14 of sched_lock decomposition.
- Move all scheduler locking into the schedulers utilizing a technique
similar to solaris's container locking.
- A per-process spinlock is now used to protect the queue of threads,
thread count, suspension count, p_sflags, and other process
related scheduling fields.
- The new thread lock is actually a pointer to a spinlock for the
container that the thread is currently owned by. The container may
be a turnstile, sleepqueue, or run queue.
- thread_lock() is now used to protect access to thread related scheduling
fields. thread_unlock() unlocks the lock and thread_set_lock()
implements the transition from one lock to another.
- A new "blocked_lock" is used in cases where it is not safe to hold the
actual thread's lock yet we must prevent access to the thread.
- sched_throw() and sched_fork_exit() are introduced to allow the
schedulers to fix-up locking at these points.
- Add some minor infrastructure for optionally exporting scheduler
statistics that were invaluable in solving performance problems with
this patch. Generally these statistics allow you to differentiate
between different causes of context switches.
Tested by: kris, current@
Tested on: i386, amd64, ULE, 4BSD, libthr, libkse, PREEMPTION, etc.
Discussed with: kris, attilio, kmacy, jhb, julian, bde (small parts each)
2007-06-04 23:50:30 +00:00
|
|
|
*/
|
2007-07-17 22:53:23 +00:00
|
|
|
cpuid = PCPU_GET(cpuid);
|
|
|
|
tdq = TDQ_CPU(cpuid);
|
|
|
|
ts = td->td_sched;
|
|
|
|
if (TD_IS_IDLETHREAD(td))
|
|
|
|
td->td_lock = TDQ_LOCKPTR(tdq);
|
|
|
|
MPASS(td->td_lock == TDQ_LOCKPTR(tdq));
|
|
|
|
td->td_oncpu = cpuid;
|
2007-10-02 01:30:18 +00:00
|
|
|
TDQ_LOCK_ASSERT(tdq, MA_OWNED | MA_NOTRECURSED);
|
2007-12-15 23:13:31 +00:00
|
|
|
lock_profile_obtain_lock_success(
|
|
|
|
&TDQ_LOCKPTR(tdq)->lock_object, 0, 0, __FILE__, __LINE__);
|
Commit 1/14 of sched_lock decomposition.
- Move all scheduler locking into the schedulers utilizing a technique
similar to solaris's container locking.
- A per-process spinlock is now used to protect the queue of threads,
thread count, suspension count, p_sflags, and other process
related scheduling fields.
- The new thread lock is actually a pointer to a spinlock for the
container that the thread is currently owned by. The container may
be a turnstile, sleepqueue, or run queue.
- thread_lock() is now used to protect access to thread related scheduling
fields. thread_unlock() unlocks the lock and thread_set_lock()
implements the transition from one lock to another.
- A new "blocked_lock" is used in cases where it is not safe to hold the
actual thread's lock yet we must prevent access to the thread.
- sched_throw() and sched_fork_exit() are introduced to allow the
schedulers to fix-up locking at these points.
- Add some minor infrastructure for optionally exporting scheduler
statistics that were invaluable in solving performance problems with
this patch. Generally these statistics allow you to differentiate
between different causes of context switches.
Tested by: kris, current@
Tested on: i386, amd64, ULE, 4BSD, libthr, libkse, PREEMPTION, etc.
Discussed with: kris, attilio, kmacy, jhb, julian, bde (small parts each)
2007-06-04 23:50:30 +00:00
|
|
|
}
|
|
|
|
|
2009-01-17 07:17:57 +00:00
|
|
|
/*
|
|
|
|
* Create on first use to catch odd startup conditons.
|
|
|
|
*/
|
|
|
|
char *
|
|
|
|
sched_tdname(struct thread *td)
|
|
|
|
{
|
|
|
|
#ifdef KTR
|
|
|
|
struct td_sched *ts;
|
|
|
|
|
|
|
|
ts = td->td_sched;
|
|
|
|
if (ts->ts_name[0] == '\0')
|
|
|
|
snprintf(ts->ts_name, sizeof(ts->ts_name),
|
|
|
|
"%s tid %d", td->td_name, td->td_tid);
|
|
|
|
return (ts->ts_name);
|
|
|
|
#else
|
|
|
|
return (td->td_name);
|
|
|
|
#endif
|
|
|
|
}
|
|
|
|
|
2012-03-08 19:41:05 +00:00
|
|
|
#ifdef KTR
|
|
|
|
void
|
|
|
|
sched_clear_tdname(struct thread *td)
|
|
|
|
{
|
|
|
|
struct td_sched *ts;
|
|
|
|
|
|
|
|
ts = td->td_sched;
|
|
|
|
ts->ts_name[0] = '\0';
|
|
|
|
}
|
|
|
|
#endif
|
|
|
|
|
Introduce a new sysctl, kern.sched.topology_spec, that returns an XML
dump of detected ULE CPU topology. This dump can be used to check the
topology detection and for general system information.
An example of CPU topology dump is:
kern.sched.topology_spec: <groups>
<group level="1" cache-level="0">
<cpu count="8" mask="0xff">0, 1, 2, 3, 4, 5, 6, 7</cpu>
<flags></flags>
<children>
<group level="2" cache-level="0">
<cpu count="4" mask="0xf">0, 1, 2, 3</cpu>
<flags></flags>
</group>
<group level="2" cache-level="0">
<cpu count="4" mask="0xf0">4, 5, 6, 7</cpu>
<flags></flags>
</group>
</children>
</group>
</groups>
Reviewed by: jeff
Approved by: gnn (mentor)
2008-10-29 13:36:23 +00:00
|
|
|
#ifdef SMP
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Build the CPU topology dump string. Is recursively called to collect
|
|
|
|
* the topology tree.
|
|
|
|
*/
|
|
|
|
static int
|
|
|
|
sysctl_kern_sched_topology_spec_internal(struct sbuf *sb, struct cpu_group *cg,
|
|
|
|
int indent)
|
|
|
|
{
|
Commit the support for removing cpumask_t and replacing it directly with
cpuset_t objects.
That is going to offer the underlying support for a simple bump of
MAXCPU and then support for number of cpus > 32 (as it is today).
Right now, cpumask_t is an int, 32 bits on all our supported architecture.
cpumask_t on the other side is implemented as an array of longs, and
easilly extendible by definition.
The architectures touched by this commit are the following:
- amd64
- i386
- pc98
- arm
- ia64
- XEN
while the others are still missing.
Userland is believed to be fully converted with the changes contained
here.
Some technical notes:
- This commit may be considered an ABI nop for all the architectures
different from amd64 and ia64 (and sparc64 in the future)
- per-cpu members, which are now converted to cpuset_t, needs to be
accessed avoiding migration, because the size of cpuset_t should be
considered unknown
- size of cpuset_t objects is different from kernel and userland (this is
primirally done in order to leave some more space in userland to cope
with KBI extensions). If you need to access kernel cpuset_t from the
userland please refer to example in this patch on how to do that
correctly (kgdb may be a good source, for example).
- Support for other architectures is going to be added soon
- Only MAXCPU for amd64 is bumped now
The patch has been tested by sbruno and Nicholas Esborn on opteron
4 x 12 pack CPUs. More testing on big SMP is expected to came soon.
pluknet tested the patch with his 8-ways on both amd64 and i386.
Tested by: pluknet, sbruno, gianni, Nicholas Esborn
Reviewed by: jeff, jhb, sbruno
2011-05-05 14:39:14 +00:00
|
|
|
char cpusetbuf[CPUSETBUFSIZ];
|
Introduce a new sysctl, kern.sched.topology_spec, that returns an XML
dump of detected ULE CPU topology. This dump can be used to check the
topology detection and for general system information.
An example of CPU topology dump is:
kern.sched.topology_spec: <groups>
<group level="1" cache-level="0">
<cpu count="8" mask="0xff">0, 1, 2, 3, 4, 5, 6, 7</cpu>
<flags></flags>
<children>
<group level="2" cache-level="0">
<cpu count="4" mask="0xf">0, 1, 2, 3</cpu>
<flags></flags>
</group>
<group level="2" cache-level="0">
<cpu count="4" mask="0xf0">4, 5, 6, 7</cpu>
<flags></flags>
</group>
</children>
</group>
</groups>
Reviewed by: jeff
Approved by: gnn (mentor)
2008-10-29 13:36:23 +00:00
|
|
|
int i, first;
|
|
|
|
|
|
|
|
sbuf_printf(sb, "%*s<group level=\"%d\" cache-level=\"%d\">\n", indent,
|
2010-09-18 11:16:43 +00:00
|
|
|
"", 1 + indent / 2, cg->cg_level);
|
Commit the support for removing cpumask_t and replacing it directly with
cpuset_t objects.
That is going to offer the underlying support for a simple bump of
MAXCPU and then support for number of cpus > 32 (as it is today).
Right now, cpumask_t is an int, 32 bits on all our supported architecture.
cpumask_t on the other side is implemented as an array of longs, and
easilly extendible by definition.
The architectures touched by this commit are the following:
- amd64
- i386
- pc98
- arm
- ia64
- XEN
while the others are still missing.
Userland is believed to be fully converted with the changes contained
here.
Some technical notes:
- This commit may be considered an ABI nop for all the architectures
different from amd64 and ia64 (and sparc64 in the future)
- per-cpu members, which are now converted to cpuset_t, needs to be
accessed avoiding migration, because the size of cpuset_t should be
considered unknown
- size of cpuset_t objects is different from kernel and userland (this is
primirally done in order to leave some more space in userland to cope
with KBI extensions). If you need to access kernel cpuset_t from the
userland please refer to example in this patch on how to do that
correctly (kgdb may be a good source, for example).
- Support for other architectures is going to be added soon
- Only MAXCPU for amd64 is bumped now
The patch has been tested by sbruno and Nicholas Esborn on opteron
4 x 12 pack CPUs. More testing on big SMP is expected to came soon.
pluknet tested the patch with his 8-ways on both amd64 and i386.
Tested by: pluknet, sbruno, gianni, Nicholas Esborn
Reviewed by: jeff, jhb, sbruno
2011-05-05 14:39:14 +00:00
|
|
|
sbuf_printf(sb, "%*s <cpu count=\"%d\" mask=\"%s\">", indent, "",
|
|
|
|
cg->cg_count, cpusetobj_strprint(cpusetbuf, &cg->cg_mask));
|
Introduce a new sysctl, kern.sched.topology_spec, that returns an XML
dump of detected ULE CPU topology. This dump can be used to check the
topology detection and for general system information.
An example of CPU topology dump is:
kern.sched.topology_spec: <groups>
<group level="1" cache-level="0">
<cpu count="8" mask="0xff">0, 1, 2, 3, 4, 5, 6, 7</cpu>
<flags></flags>
<children>
<group level="2" cache-level="0">
<cpu count="4" mask="0xf">0, 1, 2, 3</cpu>
<flags></flags>
</group>
<group level="2" cache-level="0">
<cpu count="4" mask="0xf0">4, 5, 6, 7</cpu>
<flags></flags>
</group>
</children>
</group>
</groups>
Reviewed by: jeff
Approved by: gnn (mentor)
2008-10-29 13:36:23 +00:00
|
|
|
first = TRUE;
|
|
|
|
for (i = 0; i < MAXCPU; i++) {
|
Commit the support for removing cpumask_t and replacing it directly with
cpuset_t objects.
That is going to offer the underlying support for a simple bump of
MAXCPU and then support for number of cpus > 32 (as it is today).
Right now, cpumask_t is an int, 32 bits on all our supported architecture.
cpumask_t on the other side is implemented as an array of longs, and
easilly extendible by definition.
The architectures touched by this commit are the following:
- amd64
- i386
- pc98
- arm
- ia64
- XEN
while the others are still missing.
Userland is believed to be fully converted with the changes contained
here.
Some technical notes:
- This commit may be considered an ABI nop for all the architectures
different from amd64 and ia64 (and sparc64 in the future)
- per-cpu members, which are now converted to cpuset_t, needs to be
accessed avoiding migration, because the size of cpuset_t should be
considered unknown
- size of cpuset_t objects is different from kernel and userland (this is
primirally done in order to leave some more space in userland to cope
with KBI extensions). If you need to access kernel cpuset_t from the
userland please refer to example in this patch on how to do that
correctly (kgdb may be a good source, for example).
- Support for other architectures is going to be added soon
- Only MAXCPU for amd64 is bumped now
The patch has been tested by sbruno and Nicholas Esborn on opteron
4 x 12 pack CPUs. More testing on big SMP is expected to came soon.
pluknet tested the patch with his 8-ways on both amd64 and i386.
Tested by: pluknet, sbruno, gianni, Nicholas Esborn
Reviewed by: jeff, jhb, sbruno
2011-05-05 14:39:14 +00:00
|
|
|
if (CPU_ISSET(i, &cg->cg_mask)) {
|
Introduce a new sysctl, kern.sched.topology_spec, that returns an XML
dump of detected ULE CPU topology. This dump can be used to check the
topology detection and for general system information.
An example of CPU topology dump is:
kern.sched.topology_spec: <groups>
<group level="1" cache-level="0">
<cpu count="8" mask="0xff">0, 1, 2, 3, 4, 5, 6, 7</cpu>
<flags></flags>
<children>
<group level="2" cache-level="0">
<cpu count="4" mask="0xf">0, 1, 2, 3</cpu>
<flags></flags>
</group>
<group level="2" cache-level="0">
<cpu count="4" mask="0xf0">4, 5, 6, 7</cpu>
<flags></flags>
</group>
</children>
</group>
</groups>
Reviewed by: jeff
Approved by: gnn (mentor)
2008-10-29 13:36:23 +00:00
|
|
|
if (!first)
|
|
|
|
sbuf_printf(sb, ", ");
|
|
|
|
else
|
|
|
|
first = FALSE;
|
|
|
|
sbuf_printf(sb, "%d", i);
|
|
|
|
}
|
|
|
|
}
|
|
|
|
sbuf_printf(sb, "</cpu>\n");
|
|
|
|
|
|
|
|
if (cg->cg_flags != 0) {
|
2010-07-15 13:46:30 +00:00
|
|
|
sbuf_printf(sb, "%*s <flags>", indent, "");
|
Introduce a new sysctl, kern.sched.topology_spec, that returns an XML
dump of detected ULE CPU topology. This dump can be used to check the
topology detection and for general system information.
An example of CPU topology dump is:
kern.sched.topology_spec: <groups>
<group level="1" cache-level="0">
<cpu count="8" mask="0xff">0, 1, 2, 3, 4, 5, 6, 7</cpu>
<flags></flags>
<children>
<group level="2" cache-level="0">
<cpu count="4" mask="0xf">0, 1, 2, 3</cpu>
<flags></flags>
</group>
<group level="2" cache-level="0">
<cpu count="4" mask="0xf0">4, 5, 6, 7</cpu>
<flags></flags>
</group>
</children>
</group>
</groups>
Reviewed by: jeff
Approved by: gnn (mentor)
2008-10-29 13:36:23 +00:00
|
|
|
if ((cg->cg_flags & CG_FLAG_HTT) != 0)
|
2010-06-10 11:01:17 +00:00
|
|
|
sbuf_printf(sb, "<flag name=\"HTT\">HTT group</flag>");
|
2010-06-10 11:48:14 +00:00
|
|
|
if ((cg->cg_flags & CG_FLAG_THREAD) != 0)
|
|
|
|
sbuf_printf(sb, "<flag name=\"THREAD\">THREAD group</flag>");
|
2009-04-29 03:15:43 +00:00
|
|
|
if ((cg->cg_flags & CG_FLAG_SMT) != 0)
|
2010-06-10 11:48:14 +00:00
|
|
|
sbuf_printf(sb, "<flag name=\"SMT\">SMT group</flag>");
|
2010-07-15 13:46:30 +00:00
|
|
|
sbuf_printf(sb, "</flags>\n");
|
Introduce a new sysctl, kern.sched.topology_spec, that returns an XML
dump of detected ULE CPU topology. This dump can be used to check the
topology detection and for general system information.
An example of CPU topology dump is:
kern.sched.topology_spec: <groups>
<group level="1" cache-level="0">
<cpu count="8" mask="0xff">0, 1, 2, 3, 4, 5, 6, 7</cpu>
<flags></flags>
<children>
<group level="2" cache-level="0">
<cpu count="4" mask="0xf">0, 1, 2, 3</cpu>
<flags></flags>
</group>
<group level="2" cache-level="0">
<cpu count="4" mask="0xf0">4, 5, 6, 7</cpu>
<flags></flags>
</group>
</children>
</group>
</groups>
Reviewed by: jeff
Approved by: gnn (mentor)
2008-10-29 13:36:23 +00:00
|
|
|
}
|
|
|
|
|
|
|
|
if (cg->cg_children > 0) {
|
|
|
|
sbuf_printf(sb, "%*s <children>\n", indent, "");
|
|
|
|
for (i = 0; i < cg->cg_children; i++)
|
|
|
|
sysctl_kern_sched_topology_spec_internal(sb,
|
|
|
|
&cg->cg_child[i], indent+2);
|
|
|
|
sbuf_printf(sb, "%*s </children>\n", indent, "");
|
|
|
|
}
|
|
|
|
sbuf_printf(sb, "%*s</group>\n", indent, "");
|
|
|
|
return (0);
|
|
|
|
}
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Sysctl handler for retrieving topology dump. It's a wrapper for
|
|
|
|
* the recursive sysctl_kern_smp_topology_spec_internal().
|
|
|
|
*/
|
|
|
|
static int
|
|
|
|
sysctl_kern_sched_topology_spec(SYSCTL_HANDLER_ARGS)
|
|
|
|
{
|
|
|
|
struct sbuf *topo;
|
|
|
|
int err;
|
|
|
|
|
|
|
|
KASSERT(cpu_top != NULL, ("cpu_top isn't initialized"));
|
|
|
|
|
2008-11-02 23:11:20 +00:00
|
|
|
topo = sbuf_new(NULL, NULL, 500, SBUF_AUTOEXTEND);
|
Introduce a new sysctl, kern.sched.topology_spec, that returns an XML
dump of detected ULE CPU topology. This dump can be used to check the
topology detection and for general system information.
An example of CPU topology dump is:
kern.sched.topology_spec: <groups>
<group level="1" cache-level="0">
<cpu count="8" mask="0xff">0, 1, 2, 3, 4, 5, 6, 7</cpu>
<flags></flags>
<children>
<group level="2" cache-level="0">
<cpu count="4" mask="0xf">0, 1, 2, 3</cpu>
<flags></flags>
</group>
<group level="2" cache-level="0">
<cpu count="4" mask="0xf0">4, 5, 6, 7</cpu>
<flags></flags>
</group>
</children>
</group>
</groups>
Reviewed by: jeff
Approved by: gnn (mentor)
2008-10-29 13:36:23 +00:00
|
|
|
if (topo == NULL)
|
|
|
|
return (ENOMEM);
|
|
|
|
|
|
|
|
sbuf_printf(topo, "<groups>\n");
|
|
|
|
err = sysctl_kern_sched_topology_spec_internal(topo, cpu_top, 1);
|
|
|
|
sbuf_printf(topo, "</groups>\n");
|
|
|
|
|
|
|
|
if (err == 0) {
|
|
|
|
sbuf_finish(topo);
|
|
|
|
err = SYSCTL_OUT(req, sbuf_data(topo), sbuf_len(topo));
|
|
|
|
}
|
|
|
|
sbuf_delete(topo);
|
|
|
|
return (err);
|
|
|
|
}
|
2010-10-29 13:31:10 +00:00
|
|
|
|
Introduce a new sysctl, kern.sched.topology_spec, that returns an XML
dump of detected ULE CPU topology. This dump can be used to check the
topology detection and for general system information.
An example of CPU topology dump is:
kern.sched.topology_spec: <groups>
<group level="1" cache-level="0">
<cpu count="8" mask="0xff">0, 1, 2, 3, 4, 5, 6, 7</cpu>
<flags></flags>
<children>
<group level="2" cache-level="0">
<cpu count="4" mask="0xf">0, 1, 2, 3</cpu>
<flags></flags>
</group>
<group level="2" cache-level="0">
<cpu count="4" mask="0xf0">4, 5, 6, 7</cpu>
<flags></flags>
</group>
</children>
</group>
</groups>
Reviewed by: jeff
Approved by: gnn (mentor)
2008-10-29 13:36:23 +00:00
|
|
|
#endif
|
|
|
|
|
2012-08-10 19:02:49 +00:00
|
|
|
static int
|
|
|
|
sysctl_kern_quantum(SYSCTL_HANDLER_ARGS)
|
|
|
|
{
|
|
|
|
int error, new_val, period;
|
|
|
|
|
|
|
|
period = 1000000 / realstathz;
|
|
|
|
new_val = period * sched_slice;
|
|
|
|
error = sysctl_handle_int(oidp, &new_val, 0, req);
|
2012-08-11 20:24:39 +00:00
|
|
|
if (error != 0 || req->newptr == NULL)
|
2012-08-10 19:02:49 +00:00
|
|
|
return (error);
|
|
|
|
if (new_val <= 0)
|
|
|
|
return (EINVAL);
|
2012-08-11 20:24:39 +00:00
|
|
|
sched_slice = imax(1, (new_val + period / 2) / period);
|
|
|
|
hogticks = imax(1, (2 * hz * sched_slice + realstathz / 2) /
|
|
|
|
realstathz);
|
2012-08-10 19:02:49 +00:00
|
|
|
return (0);
|
|
|
|
}
|
|
|
|
|
2008-03-20 05:51:16 +00:00
|
|
|
SYSCTL_NODE(_kern, OID_AUTO, sched, CTLFLAG_RW, 0, "Scheduler");
|
2007-07-17 22:53:23 +00:00
|
|
|
SYSCTL_STRING(_kern_sched, OID_AUTO, name, CTLFLAG_RD, "ULE", 0,
|
2007-01-04 08:56:25 +00:00
|
|
|
"Scheduler name");
|
2012-08-10 19:02:49 +00:00
|
|
|
SYSCTL_PROC(_kern_sched, OID_AUTO, quantum, CTLTYPE_INT | CTLFLAG_RW,
|
|
|
|
NULL, 0, sysctl_kern_quantum, "I",
|
2012-08-11 20:24:39 +00:00
|
|
|
"Quantum for timeshare threads in microseconds");
|
2007-07-17 22:53:23 +00:00
|
|
|
SYSCTL_INT(_kern_sched, OID_AUTO, slice, CTLFLAG_RW, &sched_slice, 0,
|
2012-08-11 20:24:39 +00:00
|
|
|
"Quantum for timeshare threads in stathz ticks");
|
2007-07-17 22:53:23 +00:00
|
|
|
SYSCTL_INT(_kern_sched, OID_AUTO, interact, CTLFLAG_RW, &sched_interact, 0,
|
2012-08-11 20:24:39 +00:00
|
|
|
"Interactivity score threshold");
|
|
|
|
SYSCTL_INT(_kern_sched, OID_AUTO, preempt_thresh, CTLFLAG_RW,
|
|
|
|
&preempt_thresh, 0,
|
|
|
|
"Maximal (lowest) priority for preemption");
|
|
|
|
SYSCTL_INT(_kern_sched, OID_AUTO, static_boost, CTLFLAG_RW, &static_boost, 0,
|
|
|
|
"Assign static kernel priorities to sleeping threads");
|
|
|
|
SYSCTL_INT(_kern_sched, OID_AUTO, idlespins, CTLFLAG_RW, &sched_idlespins, 0,
|
|
|
|
"Number of times idle thread will spin waiting for new work");
|
|
|
|
SYSCTL_INT(_kern_sched, OID_AUTO, idlespinthresh, CTLFLAG_RW,
|
|
|
|
&sched_idlespinthresh, 0,
|
|
|
|
"Threshold before we will permit idle thread spinning");
|
2007-01-19 21:56:08 +00:00
|
|
|
#ifdef SMP
|
2007-07-17 22:53:23 +00:00
|
|
|
SYSCTL_INT(_kern_sched, OID_AUTO, affinity, CTLFLAG_RW, &affinity, 0,
|
|
|
|
"Number of hz ticks to keep thread affinity for");
|
|
|
|
SYSCTL_INT(_kern_sched, OID_AUTO, balance, CTLFLAG_RW, &rebalance, 0,
|
|
|
|
"Enables the long-term load balancer");
|
2007-10-02 00:36:06 +00:00
|
|
|
SYSCTL_INT(_kern_sched, OID_AUTO, balance_interval, CTLFLAG_RW,
|
|
|
|
&balance_interval, 0,
|
2012-08-10 19:02:49 +00:00
|
|
|
"Average period in stathz ticks to run the long-term balancer");
|
2007-07-17 22:53:23 +00:00
|
|
|
SYSCTL_INT(_kern_sched, OID_AUTO, steal_idle, CTLFLAG_RW, &steal_idle, 0,
|
|
|
|
"Attempts to steal work from other cores before idling");
|
2007-07-19 20:03:15 +00:00
|
|
|
SYSCTL_INT(_kern_sched, OID_AUTO, steal_thresh, CTLFLAG_RW, &steal_thresh, 0,
|
2012-08-11 20:24:39 +00:00
|
|
|
"Minimum load on remote CPU before we'll steal");
|
Introduce a new sysctl, kern.sched.topology_spec, that returns an XML
dump of detected ULE CPU topology. This dump can be used to check the
topology detection and for general system information.
An example of CPU topology dump is:
kern.sched.topology_spec: <groups>
<group level="1" cache-level="0">
<cpu count="8" mask="0xff">0, 1, 2, 3, 4, 5, 6, 7</cpu>
<flags></flags>
<children>
<group level="2" cache-level="0">
<cpu count="4" mask="0xf">0, 1, 2, 3</cpu>
<flags></flags>
</group>
<group level="2" cache-level="0">
<cpu count="4" mask="0xf0">4, 5, 6, 7</cpu>
<flags></flags>
</group>
</children>
</group>
</groups>
Reviewed by: jeff
Approved by: gnn (mentor)
2008-10-29 13:36:23 +00:00
|
|
|
SYSCTL_PROC(_kern_sched, OID_AUTO, topology_spec, CTLTYPE_STRING |
|
2012-08-10 19:02:49 +00:00
|
|
|
CTLFLAG_RD, NULL, 0, sysctl_kern_sched_topology_spec, "A",
|
Introduce a new sysctl, kern.sched.topology_spec, that returns an XML
dump of detected ULE CPU topology. This dump can be used to check the
topology detection and for general system information.
An example of CPU topology dump is:
kern.sched.topology_spec: <groups>
<group level="1" cache-level="0">
<cpu count="8" mask="0xff">0, 1, 2, 3, 4, 5, 6, 7</cpu>
<flags></flags>
<children>
<group level="2" cache-level="0">
<cpu count="4" mask="0xf">0, 1, 2, 3</cpu>
<flags></flags>
</group>
<group level="2" cache-level="0">
<cpu count="4" mask="0xf0">4, 5, 6, 7</cpu>
<flags></flags>
</group>
</children>
</group>
</groups>
Reviewed by: jeff
Approved by: gnn (mentor)
2008-10-29 13:36:23 +00:00
|
|
|
"XML dump of detected CPU topology");
|
2007-01-19 21:56:08 +00:00
|
|
|
#endif
|
2007-01-04 08:56:25 +00:00
|
|
|
|
2007-09-21 04:10:23 +00:00
|
|
|
/* ps compat. All cpu percentages from ULE are weighted. */
|
2007-09-22 02:20:14 +00:00
|
|
|
static int ccpu = 0;
|
2007-01-04 08:56:25 +00:00
|
|
|
SYSCTL_INT(_kern, OID_AUTO, ccpu, CTLFLAG_RD, &ccpu, 0, "");
|