freebsd-dev

Author	SHA1	Message	Date
George V. Neville-Neil	8f2ba63493	Point args[0] not at the thread that is ending but at the one that is starting. This is in line with practice in OpenSolaris. Note that this change is only in ULE and not in the 4BSD scheduler. Once this change settles in (MFC timeout has expired) we'll try it out on 4BSD as well. PR: 177706 Submitted by: Tiwei Bie MFC after: 1 month	2013-04-15 17:21:02 +00:00
Alexander Motin	2fd4047f32	Fix bug in r242852 that prevented CPU from becoming idle if kernel built without SMP support.	2012-11-15 14:10:51 +00:00
Alexander Motin	2c27cb3a34	Several optimizations to sched_idletd(): - Do not try to steal load from other CPUs if there was no contest switches on this CPU (i.e. it was idle all the time and woke up just for bus mastering or TLB shutdown). If current CPU was idle, then it is quite unlikely that some other CPU has load to steal. Under high I/O rate, when TLB shutdowns cause numerous CPU wakeups, on 24-CPU system load stealing code may consume up to 25% of all CPU time without giving any benefits. - Change code that implements spinning for load to restart spin in case of context switch. Previous code periodically called cpu_idle() even under high interrupt/context switch rate. - Rise spinning threshold to 10KHz, where it gives at least some effect that may worth consumed power. Reviewed by: jeff@	2012-11-10 07:02:57 +00:00
Jeff Roberson	5e5c387373	- Change ULE to use dynamic slice sizes for the timeshare queue in order to further reduce latency for threads in this queue. This should help as threads transition from realtime to timeshare. The latency is bound to a max of sched_slice until we have more than sched_slice / 6 threads runnable. Then the min slice is allotted to all threads and latency becomes (nthreads - 1) * min_slice. Discussed with: mav	2012-11-08 01:46:47 +00:00
Attilio Rao	4ceaf45de5	Rework the known mutexes to benefit about staying on their own cache line in order to avoid manual frobbing but using struct mtx_padalign. The sole exception being nvme and sxfge drivers, where the author redefined CACHE_LINE_SIZE manually, so they need to be analyzed and dealt with separately. Reviwed by: jimharris, alc	2012-10-31 18:07:18 +00:00
Attilio Rao	a049aa05c9	tdq_lock_pair() already does spinlock_enter() so migration is not possible in sched_balance_pair(). Remove redundant sched_pin(). Reviewed by: marius, jeff	2012-10-30 12:25:52 +00:00
Jim Harris	39f819e2fc	Pad tdq_lock to avoid false sharing with tdq_load and tdq_cpu_idle. This enables CPU searches (which read tdq_load) to operate independently of any contention on the spinlock. Some scheduler-intensive workloads running on an 8C single-socket SNB Xeon show considerable improvement with this change (2-3% perf improvement, 5-6% decrease in CPU util). Sponsored by: Intel Reviewed by: jeff	2012-10-24 18:36:41 +00:00
Eitan Adler	db702c59cf	remove duplicate semicolons where possible. Approved by: cperciva MFC after: 1 week	2012-10-22 03:00:37 +00:00
Andriy Gapon	e87fc7cf7b	sched_ule: fix inverted condition in reporting of priority lending via ktr Reviewed by: kan MFC after: 1 week	2012-09-14 19:55:28 +00:00
John Baldwin	ba96d2d816	Mark the idle threads as non-sleepable and also assert that an idle thread never blocks on a turnstile.	2012-08-22 20:01:38 +00:00
Alexander Motin	37f4e0254f	Some more minor tunings inspired by bde@.	2012-08-11 20:24:39 +00:00
Alexander Motin	bf89d544d0	Allow idle threads to steal second threads from other cores on systems with 8 or more cores to improve utilization. None of my tests on 2xXeon (2x6x2) system shown any slowdown from mentioned "excess thrashing". Same time in pbzip2 test with number of threads more then number of CPUs I see up to 10% speedup with SMT disabled and up 5% with SMT enabled. Thinking about trashing I was trying to limit that stealing within same last level cache, but got only worse results. Present code any way prefers to steal threads from topologically closer cores. Sponsored by: iXsystems, Inc.	2012-08-11 15:08:19 +00:00
Alexander Motin	579895df01	Some minor tunings/cleanups inspired by bde@ after previous commits: - remove extra dynamic variable initializations; - restore (4BSD) and implement (ULE) hogticks variable setting; - make sched_rr_interval() more tolerant to options; - restore (4BSD) and implement (ULE) kern.sched.quantum sysctl, a more user-friendly wrapper for sched_slice; - tune some sysctl descriptions; - make some style fixes.	2012-08-10 19:02:49 +00:00
Alexander Motin	3d7f41175d	Rework r220198 change (by fabient). I believe it solves the problem from the wrong direction. Before it, if preemption and end of time slice happen same time, thread was put to the head of the queue as for only preemption. It could cause single thread to run for indefinitely long time. r220198 handles it by not clearing TDF_NEEDRESCHED in case of preemption. But that causes delayed context switch every time preemption happens, even when not needed. Solve problem by introducing scheduler-specifoc thread flag TDF_SLICEEND, set when thread's time slice is over and it should be put to the tail of queue. Using SW_PREEMPT flag for that purpose as it was before just not enough informative to work correctly. On my tests this by 2-3 times reduces run time deviation (improves fairness) in cases when several threads share one CPU. Reviewed by: fabient MFC after: 2 months Sponsored by: iXsystems, Inc.	2012-08-09 19:26:13 +00:00
Rafal Jaworowski	17f4cae4a5	Let us manage differences of Book-E PowerPC variations i.e. vendor / implementation specific vs. the common architecture definition. Bring PPC4XX defines (PSL, SPR, TLB). Note the new definitions under BOOKE_PPC4XX are not used in the code yet. This change set is not supposed to affect existing E500 support, it's just another reorg step before bringing support for E500mc, E5500 and PPC465. Obtained from: AppliedMicro, Freescale, Semihalf	2012-05-27 10:25:20 +00:00
Ryan Stone	b3e9e682cf	Implement the DTrace sched provider. This implementation aims to be compatible with the sched provider implemented by Solaris and its open- source derivatives. Full documentation of the sched provider can be found on Oracle's DTrace wiki pages. Note that for compatibility with scripts originally written for Solaris, serveral probes are defined that will never fire. These probes are defined to fire when Solaris-specific features perform certain actions. As these features are not present in FreeBSD, the probes can never fire. Also, I have added a two probes that are not defined in Solaris, lend-pri and load-change. These probes have been added to make it possible to collect schedgraph data with DTrace. Finally, a few probes are defined in Solaris to take a cpuinfo_t * argument. As it was not immediately clear to me how to translate that to FreeBSD, currently those probes are passed NULL in place of a cpuinfo_t *. Sponsored by: Sandvine Incorporated MFC after: 2 weeks	2012-05-15 01:30:25 +00:00
Alexander Motin	70801abe8f	Microoptimize cpu_search(). According to profiling, it makes one take 6% of CPU time on hackbench with its million of context switches per second, instead of 8% before.	2012-04-09 18:24:58 +00:00
Alexander Motin	7295465e33	Rewrite thread CPU usage percentage math to not depend on periodic calls with HZ rate through the sched_tick() calls from hardclock(). Potentially it can be used to improve precision, but now it is just minus one more reason to call hardclock() for every HZ tick on every active CPU. SCHED_4BSD never used sched_tick(), but keep it in place for now, as at least SCHED_FBFS existing in patches out of the tree depends on it. MFC after: 1 month	2012-03-13 08:18:54 +00:00
Alexander Motin	b3f40a4107	Make kern.sched.idlespinthresh default value adaptive depending of HZ. Otherwise with HZ above 8000 CPU may never skip timer ticks on idle.	2012-03-09 19:09:08 +00:00
John Baldwin	44ad547522	Add a new sched_clear_name() method to the scheduler interface to clear the cached name used for KTR_SCHED traces when a thread's name changes. This way KTR_SCHED traces (and thus schedgraph) will notice when a thread's name changes, most commonly via execve(). MFC after: 2 weeks	2012-03-08 19:41:05 +00:00
Alexander Motin	6022f0bcb3	Fix bug of r232207, when cpu_search() could prefer CPU group with best load, but with no CPU matching given limitations. It caused kernel panics in some cases when thread was bound to specific CPUs with cpuset(1).	2012-03-03 11:50:48 +00:00
Alexander Motin	36acfc6507	Rework CPU load balancing in SCHED_ULE: - In sched_pickcpu() be more careful taking previous CPU on SMT systems. Do it only if all other logical CPUs of that physical one are idle to avoid extra resource sharing. - In sched_pickcpu() change general logic of CPU selection. First look for idle CPU, sharing last level cache with previously used one, skipping SMT CPU groups. If none found, search all CPUs for the least loaded one, where the thread with its priority can run now. If none found, search just for the least loaded CPU. - Make cpu_search() compare lowest/highest CPU load when comparing CPU groups with equal load. That allows to differentiate 1+1 and 2+0 loads. - Make cpu_search() to prefer specified (previous) CPU or group if load is equal. This improves cache affinity for more complicated topologies. - Randomize CPU selection if above factors are equal. Previous code tend to prefer CPUs with lower IDs, causing unneeded collisions. - Rework periodic balancer in sched_balance_group(). With cpu_search() more intelligent now, make balansing process flat, removing recursion over the topology tree. That fixes double swap problem and makes load distribution more even and predictable. All together this gives 10-15% performance improvement in many tests on CPUs with SMT, such as Core i7, for number of threads is less then number of logical CPUs. In some tests it also gives positive effect to systems without SMT. Reviewed by: jeff Tested by: flo, hackers@ MFC after: 1 month Sponsored by: iXsystems, Inc.	2012-02-27 10:31:54 +00:00
John Baldwin	7e3a96ea37	Some small fixes to CPU accounting for threads: - Only initialize the per-cpu switchticks and switchtime in sched_throw() for the very first context switch on APs during boot. This avoids a small gap between the middle of thread_exit() and sched_throw() where time is not accounted to any thread. - In thread_exit(), update the timestamp bookkeeping to track the changes to mi_switch() introduced by td_rux so that the code once again matches the comment claiming it is mimicing mi_switch(). Specifically, only update the per-thread stats directly and depend on ruxagg() to update p_rux rather than adjusting p_rux directly. While here, move the timestamp bookkeeping as late in the function as possible. Reviewed by: bde, kib MFC after: 1 week	2012-01-03 21:03:28 +00:00
John Baldwin	0c0d27d5dd	Cap the priority calculated from the current thread's running tick count at SCHED_PRI_RANGE to prevent overflows in the priority value. This can happen due to irregularities with clock interrupts under certain virtualization environments. Tested by: Larry Rosenman ler lerctr org MFC after: 2 weeks	2011-12-29 16:17:16 +00:00
Andriy Gapon	167057914b	ule: ensure that batch timeshare threads are scheduled fairly With the previous code, if the range of priorities for timeshare batch threads was greater than RQ_NQS, then the threads with low priorities in the part of the range above RQ_NQS would be scheduled to the run-queues as if they had high priorities at the beginning of the range. In other words, threads with a nice level of +N could be scheduled as if they had a nice level of -M. Reported by: George Mitchell <george@m5p.com> Reviewed by: jhb Tested by: George Mitchell <george@m5p.com> (earlier version) MFC after: 1 week	2011-12-19 20:01:21 +00:00
Marius Strobl	880bf8b9bd	- Currently, sched_balance_pair() may cause a CPU to send an IPI_PREEMPT to itself, which sparc64 hardware doesn't support. One way to solve this would be to directly call sched_preempt() instead of issuing a self-IPI. However, quoting jhb@: "On the other hand, you can probably just skip the IPI entirely if we are going to send it to the current CPU. Presumably, once this routine finishes, the current CPU will exit softlock (or will do so "soon") and will then pick the next thread to run based on the adjustments made in this routine, so there's no need to IPI the CPU running this routine anyway. I think this is the better solution. Right now what is probably happening on other platforms is as soon as this routine finishes the CPU processes its self-IPI and causes mi_switch() which will just switch back to the softclock thread it is already running." - With r226054 and the the above change in place, sparc64 now no longer is incompatible with ULE and vice versa. However, powerpc/E500 still is. Submitted by: jhb Reviewed by: jeff	2011-10-06 11:48:13 +00:00
Xin LI	cd39bb098e	Fix format strings for KTR_STATE in 4BSD ad ULE schedulers. Submitted by: Ivan Klymenko <fidaj@ukr.net> PR: kern/159904, kern/159905 MFC after: 2 weeks Approved by: re (kib)	2011-08-26 18:00:07 +00:00
Attilio Rao	6338c57958	Remove explicit MAXCPU usage from sys/pcpu.h avoiding a namespace pollution. That is a step further in the direction of building correct policies for userland and modules on how to deal with the number of maxcpus at runtime. Reported by: jhb Reviewed and tested by: pluknet Approved by: re (kib)	2011-07-19 16:50:55 +00:00
Attilio Rao	71a19bdc64	Commit the support for removing cpumask_t and replacing it directly with cpuset_t objects. That is going to offer the underlying support for a simple bump of MAXCPU and then support for number of cpus > 32 (as it is today). Right now, cpumask_t is an int, 32 bits on all our supported architecture. cpumask_t on the other side is implemented as an array of longs, and easilly extendible by definition. The architectures touched by this commit are the following: - amd64 - i386 - pc98 - arm - ia64 - XEN while the others are still missing. Userland is believed to be fully converted with the changes contained here. Some technical notes: - This commit may be considered an ABI nop for all the architectures different from amd64 and ia64 (and sparc64 in the future) - per-cpu members, which are now converted to cpuset_t, needs to be accessed avoiding migration, because the size of cpuset_t should be considered unknown - size of cpuset_t objects is different from kernel and userland (this is primirally done in order to leave some more space in userland to cope with KBI extensions). If you need to access kernel cpuset_t from the userland please refer to example in this patch on how to do that correctly (kgdb may be a good source, for example). - Support for other architectures is going to be added soon - Only MAXCPU for amd64 is bumped now The patch has been tested by sbruno and Nicholas Esborn on opteron 4 x 12 pack CPUs. More testing on big SMP is expected to came soon. pluknet tested the patch with his 8-ways on both amd64 and i386. Tested by: pluknet, sbruno, gianni, Nicholas Esborn Reviewed by: jeff, jhb, sbruno	2011-05-05 14:39:14 +00:00
Fabien Thomas	586cb6ec77	Clearing the flag when preempting will let the preempted thread run too much time. This can finish in a scheduler deadlock with ping-pong between two threads. One sample of this is: - device lapic (to have a preemption point on critical_exit()) - options DEVICE_POLLING with HZ>1499 (to have lapic freq = hardclock freq) - running a cpu intensive task (that does not enter the kernel) - only one CPU on SMP or no SMP. As requested by jhb@ 4BSD have received the same type of fix instead of propagating the flag to the new thread. Reviewed by: jhb, jeff MFC after: 1 month	2011-03-31 13:59:47 +00:00
John Baldwin	2dc29adb9f	Rework realtime priority support: - Move the realtime priority range up above kernel sleep priorities and just below interrupt thread priorities. - Contract the interrupt and kernel sleep priority ranges a bit so that the timesharing priority band can be increased. The new timeshare range is now slightly larger than the old realtime + timeshare ranges. - Change the ULE scheduler to no longer use realtime priorities for interactive threads. Instead, the larger timeshare range is now split into separate subranges for interactive and non-interactive ("batch") threads. The end result is that interactive threads and non-interactive threads still use the same priority ranges as before, but realtime threads now have a separate, dedicated priority range. - Do not modify the priority of non-timeshare threads in sched_sleep() or via cv_broadcastpri(). Realtime and idle priority threads will no longer have their priorities affected by sleeping in the kernel. Reviewed by: jeff	2011-01-14 17:06:54 +00:00
John Baldwin	12d56c0f63	Introduce two new helper macros to define the priority ranges used for interactive timeshare threads (PRI__INTERACTIVE) and non-interactive timeshare threads (PRI__BATCH) and use these instead of PRI__REALTIME and PRI__TIMESHARE. No functional change. Reviewed by: jeff	2011-01-13 14:22:27 +00:00
John Baldwin	c9a8cba456	Always use PRI_BASE() when checking the base type of a thread's priority class. MFC after: 2 weeks	2011-01-11 22:13:19 +00:00
John Baldwin	789200082c	Fix two harmless off-by-one errors. Reviewed by: jeff MFC after: 2 weeks	2011-01-10 20:48:10 +00:00
John Baldwin	22d19207e9	- Move sched_fork() later in fork() after the various sections of the new thread and proc have been copied and zeroed from the old thread and proc. Otherwise attempts to modify thread or process data in sched_fork() could be undone. - Don't copy td_{base,}_user_pri from the old thread to the new thread in sched_fork_thread() in ULE. This is already done courtesy the bcopy() of the thread copy region. - Always initialize the real priority (td_priority) of new threads to the new thread's base priority (td_base_pri) to avoid bogusly inheriting a borrowed priority from the parent thread. MFC after: 2 weeks	2011-01-06 22:24:00 +00:00
David Xu	c8e368a933	- Follow r216313, the sched_unlend_user_prio is no longer needed, always use sched_lend_user_prio to set lent priority. - Improve pthread priority-inherit mutex, when a contender's priority is lowered, repropagete priorities, this may cause mutex owner's priority to be lowerd, in old code, mutex owner's priority is rise-only.	2010-12-29 09:26:46 +00:00
David Xu	acbe332a58	MFp4: It is possible a lower priority thread lending priority to higher priority thread, in old code, it is ignored, however the lending should always be recorded, add field td_lend_user_pri to fix the problem, if a thread does not have borrowed priority, its value is PRI_MAX. MFC after: 1 week	2010-12-09 02:42:02 +00:00
Edward Tomasz Napierala	4220337804	Remove unused variables.	2010-11-13 11:54:04 +00:00
Attilio Rao	9f518f2068	Fix typos. Submitted by: gianni MFC after: 3 days	2010-11-10 21:06:49 +00:00
David Xu	444528c026	Use integer for size of cpuset, as it won't be bigger than INT_MAX, This is requested by bge. Also move the sysctl into file kern_cpuset.c, because it should always be there, it is independent of thread scheduler.	2010-11-01 00:42:25 +00:00
David Xu	b67cc292dc	Add sysctl kern.sched.cpusetsize to export the size of kernel cpuset, also add sysconf() key _SC_CPUSET_SIZE to get sysctl value. Submitted by: gcooper	2010-10-29 13:31:10 +00:00
John Baldwin	a8103ae8ca	Comment nit, set TDF_NEEDRESCHED after the comment describing why it is done rather than before. MFC after: 1 week	2010-09-21 19:12:22 +00:00
Andriy Gapon	19b8a6dbc1	kern.sched.topology_spec sysctl: use step of 1 for group levels numeration This is just a cosmetic change for prettier output. 'indent' variable/parameter serves two purposes: it specifies whitespace indentation level and also implies cpu group level/depth. It would have been better to split those two uses, but for now just a simple change. MFC after: 1 week	2010-09-18 11:16:43 +00:00
Alexander Motin	a157e42516	Refactor timer management code with priority to one-shot operation mode. The main goal of this is to generate timer interrupts only when there is some work to do. When CPU is busy interrupts are generating at full rate of hz + stathz to fullfill scheduler and timekeeping requirements. But when CPU is idle, only minimum set of interrupts (down to 8 interrupts per second per CPU now), needed to handle scheduled callouts is executed. This allows significantly increase idle CPU sleep time, increasing effect of static power-saving technologies. Also it should reduce host CPU load on virtualized systems, when guest system is idle. There is set of tunables, also available as writable sysctls, allowing to control wanted event timer subsystem behavior: kern.eventtimer.timer - allows to choose event timer hardware to use. On x86 there is up to 4 different kinds of timers. Depending on whether chosen timer is per-CPU, behavior of other options slightly differs. kern.eventtimer.periodic - allows to choose periodic and one-shot operation mode. In periodic mode, current timer hardware taken as the only source of time for time events. This mode is quite alike to previous kernel behavior. One-shot mode instead uses currently selected time counter hardware to schedule all needed events one by one and program timer to generate interrupt exactly in specified time. Default value depends of chosen timer capabilities, but one-shot mode is preferred, until other is forced by user or hardware. kern.eventtimer.singlemul - in periodic mode specifies how much times higher timer frequency should be, to not strictly alias hardclock() and statclock() events. Default values are 2 and 4, but could be reduced to 1 if extra interrupts are unwanted. kern.eventtimer.idletick - makes each CPU to receive every timer interrupt independently of whether they busy or not. By default this options is disabled. If chosen timer is per-CPU and runs in periodic mode, this option has no effect - all interrupts are generating. As soon as this patch modifies cpu_idle() on some platforms, I have also refactored one on x86. Now it makes use of MONITOR/MWAIT instrunctions (if supported) under high sleep/wakeup rate, as fast alternative to other methods. It allows SMP scheduler to wake up sleeping CPUs much faster without using IPI, significantly increasing performance on some highly task-switching loads. Tested by: many (on i386, amd64, sparc64 and powerc) H/W donated by: Gheorghe Ardelean Sponsored by: iXsystems, Inc.	2010-09-13 07:25:35 +00:00
Alexander Motin	9f9ad565a1	Do not IPI CPU that is already spinning for load. It doubles effect of spining (comparing to MWAIT) on some heavly switching test loads.	2010-09-10 13:24:47 +00:00
Matthew D Fleming	ba4932b5a2	Fix UP build. MFC after: 2 weeks	2010-09-02 16:23:05 +00:00
Matthew D Fleming	0f7a0ebd59	Fix a bug with sched_affinity() where it checks td_pinned of another thread in a racy manner, which can lead to attempting to migrate a thread that is pinned to a CPU. Instead, have sched_switch() determine which CPU a thread should run on if the current one is not allowed. KASSERT in sched_bind() that the thread is not yet pinned to a CPU. KASSERT in sched_switch() that only migratable threads or those moving due to a sched_bind() are changing CPUs. sched_affinity code came from jhb@. MFC after: 2 weeks	2010-09-01 20:32:47 +00:00
John Baldwin	8c7a92bd4a	Remove unused KTRACE includes.	2010-08-19 16:41:27 +00:00
John Baldwin	d9d8d1449d	Add a new ipi_cpu() function to the MI IPI API that can be used to send an IPI to a specific CPU by its cpuid. Replace calls to ipi_selected() that constructed a mask for a single CPU with calls to ipi_cpu() instead. This will matter more in the future when we transition from cpumask_t to cpuset_t for CPU masks in which case building a CPU mask is more expensive. Submitted by: peter, sbruno Reviewed by: rookie Obtained from: Yahoo! (x86) MFC after: 1 month	2010-08-06 15:36:59 +00:00
Ivan Voras	611daf7e62	A cosmetic change - don't output empty <flags>.	2010-07-15 13:46:30 +00:00
John Baldwin	3aa6d94e0c	Update several places that iterate over CPUs to use CPU_FOREACH().	2010-06-11 18:46:34 +00:00
Ivan Voras	a401f2d098	Unconfuse THREAD and SMT flags	2010-06-10 11:48:14 +00:00
Ivan Voras	5368befb66	Cosmetic change to XML - less ugly newlines	2010-06-10 11:01:17 +00:00
John Baldwin	3da35a0a52	Assert that the thread lock is held in sched_pctcpu() instead of recursively acquiring it. All of the current callers already hold the lock. MFC after: 1 month	2010-06-03 16:02:11 +00:00
John Baldwin	1d7830edd5	Assert that the thread passed to sched_bind() and sched_unbind() is curthread as those routines are only supported for curthread currently. MFC after: 1 month	2010-05-21 17:15:56 +00:00
Randall Stewart	4542827d4d	This pushes all of JC's patches that I have in place. I am now able to run 32 cores ok.. but I still will hang on buildworld with a NFS problem. I suspect I am missing a patch for the netlogic rge driver. JC check and see if I am missing anything except your core-mask changes Obtained from: JC	2010-05-16 19:43:48 +00:00
Attilio Rao	b0b9dee5c9	- Fix a race in sched_switch() of sched_4bsd. In the case of the thread being on a sleepqueue or a turnstile, the sched_lock was acquired (without the aid of the td_lock interface) and the td_lock was dropped. This was going to break locking rules on other threads willing to access to the thread (via the td_lock interface) and modify his flags (allowed as long as the container lock was different by the one used in sched_switch). In order to prevent this situation, while sched_lock is acquired there the td_lock gets blocked. [0] - Merge the ULE's internal function thread_block_switch() into the global thread_lock_block() and make the former semantic as the default for thread_lock_block(). This means that thread_lock_block() will not disable interrupts when called (and consequently thread_unlock_block() will not re-enabled them when called). This should be done manually when necessary. Note, however, that ULE's thread_unblock_switch() is not reaped because it does reflect a difference in semantic due in ULE (the td_lock may not be necessarilly still blocked_lock when calling this). While asymmetric, it does describe a remarkable difference in semantic that is good to keep in mind. [0] Reported by: Kohji Okuno <okuno dot kohji at jp dot panasonic dot com> Tested by: Giovanni Trematerra <giovanni dot trematerra at gmail dot com> MFC: 2 weeks	2010-01-23 15:54:21 +00:00
Konstantin Belousov	17c4c3563c	Allow swap out of the kernel stack for the thread with priority greater or equial then PSOCK, not less or equial. Higher priority has lesser numerical value. Existing test does not allow for swapout of the thread waiting for advisory lock, for exiting child or sleeping for timeout. On the other hand, high-priority waiters of VFS/VM events can be swapped out. Tested by: pho Reviewed by: jhb MFC after: 1 week	2009-12-31 18:52:58 +00:00
Ed Schouten	62375ca8c1	Don't forget to use `void' for sched_balance(). It has no arguments.	2009-12-28 23:12:12 +00:00
Ivan Voras	cbc4ea28e2	Make ULE process usage (%CPU) accounting usable again by keeping track of the last tick we incremented on. Submitted by: matthew.fleming/at/isilon.com, is/at/rambler-co.ru Reviewed by: jeff (who thinks there should be a better way in the future) Approved by: gnn (mentor) MFC after: 3 weeks	2009-11-24 19:57:41 +00:00
Attilio Rao	1b9d701fee	Split P_NOLOAD into a per-thread flag (TDF_NOLOAD). This improvements aims for avoiding further cache-misses in scheduler specific functions which need to keep track of average thread running time and further locking in places setting for this flag. Reported by: jeff (originally), kris (currently) Reviewed by: jhb Tested by: Giuseppe Cocomazzi <sbudella at email dot it>	2009-11-03 16:46:52 +00:00
John Baldwin	a0f1535205	Fix a sign bug in the handling of nice priorities when computing the interactive score for a thread. Submitted by: Taku YAMAMOTO taku of tackymt.homeip.net Reviewed by: jeff MFC after: 3 days	2009-10-15 11:41:12 +00:00
Attilio Rao	435068aab7	Fix sched_switch_migrate(): - In 8.x and above the run-queue locks are nomore shared even in the HTT case, so remove the special case. - The deadlock explained in the removed comment here is still possible even with different locks, with the contribution of tdq_lock_pair(). An explanation is here: (hypotesis: a thread needs to migrate on another CPU, thread1 is doing sched_switch_migrate() and thread2 is the one handling the sched_switch() request or in other words, thread1 is the thread that needs to migrate and thread2 is a thread that is going to be preempted, most likely an idle thread. Also, 'old' is referred to the context (in terms of run-queue and CPU) thread1 is leaving and 'new' is referred to the context thread1 is going into. Finally, thread3 is doing tdq_idletd() or sched_balance() and definitively doing tdq_lock_pair()) * thread1 blocks its td_lock. Now td_lock is 'blocked' * thread1 drops its old runqueue lock * thread1 acquires the new runqueue lock * thread1 adds itself to the new runqueue and sends an IPI_PREEMPT through tdq_notify() to the new CPU * thread1 drops the new lock * thread3, scanning the runqueues, locks the old lock * thread2 received the IPI_PREEMPT and does thread_lock() with td_lock pointing to the new runqueue * thread3 wants to acquire the new runqueue lock, but it can't because it is held by thread2 so it spins * thread1 wants to acquire old lock, but as long as it is held by thread3 it can't * thread2 going further, at some point wants to switchin in thread1, but it will wait forever because thread1->td_lock is in blocked state This deadlock has been manifested mostly on 7.x and reported several time on mailing lists under the voice 'spinlock held too long'. Many thanks to des@ for having worked hard on producing suitable textdumps and Jeff for help on the comment wording. Reviewed by: jeff Reported by: des, others Tested by: des, Giovanni Trematerra <giovanni dot trematerra at gmail dot com> (STABLE_7 based version)	2009-09-15 16:56:17 +00:00
Jeff Roberson	c76ee82799	- Use cpuset_t and the CPU_ macros in place of cpumask_t so that ULE supports arbitrary numbers of cpus rather than being limited by cpumask_t to the number of bits in a long.	2009-06-23 22:12:37 +00:00
Jeff Roberson	09c8a4cc21	- Fix non-SMP build by encapsulating idle spin logic in a macro. Pointy hat to: me	2009-04-29 23:04:31 +00:00
Jeff Roberson	113dda8a7c	- Fix the FBSDID line.	2009-04-29 03:26:30 +00:00
Jeff Roberson	7b55ab0534	- Remove the bogus idle thread state code. This may have a race in it and it only optimized out an ipi or mwait in very few cases. - Skip the adaptive idle code when running on SMT or HTT cores. This just wastes cpu time that could be used on a busy thread on the same core. - Rename CG_FLAG_THREAD to CG_FLAG_SMT to be more descriptive. Re-use CG_FLAG_THREAD to mean SMT or HTT. Sponsored by: Nokia	2009-04-29 03:15:43 +00:00
Jeff Roberson	53a6c8b3ac	- Fix an error that occurs when mp_ncpu is an odd number. steal_thresh is calculated as 0 which causes errors elsewhere. Submitted by: KOIE Hidetaka <koie@suri.co.jp> - When sched_affinity() is called with a thread that is not curthread we need to handle the ON_RUNQ() case by adding the thread to the correct run queue. Submitted by: Justin Teller <justin.teller@gmail.com> MFC after: 1 Week	2009-03-14 11:41:36 +00:00
Jeff Roberson	0d2cf8374a	- Use __XSTRING where I want the define to be expanded. This resulted in sizeof("MAXCPU") being used to calculate a string length rather than something more reasonable such as sizeof("32"). This shouldn't have caused any ill effect until we run on machines with 1000000 or more cpus.	2009-01-25 07:35:10 +00:00
Jeff Roberson	8f51ad55e7	- Implement generic macros for producing KTR records that are compatible with src/tools/sched/schedgraph.py. This allows developers to quickly create a graphical view of ktr data for any resource in the system. - Add sched_tdname() and the pcpu field 'name' for quickly and uniformly identifying records associated with a thread or cpu. - Reimplement the KTR_SCHED traces using the new generic facility. Obtained from: attilio Discussed with: jhb Sponsored by: Nokia	2009-01-17 07:17:57 +00:00
Ivan Voras	59d9578919	Add missing newlines to flags tags of CPU topology, for prettier output. Reviewed by: jeff (original version) Approved by: gnn (mentor) (original version)	2008-12-23 16:19:59 +00:00
John Baldwin	02f0ff6d92	When checking to see if another CPU is running its idle thread, examine the thread running on the other CPU instead of the thread being placed on the run queue. Reported by: Ravi Murty @ Intel Reviewed by: jeff	2008-11-18 05:41:34 +00:00
Ivan Voras	aa880b9018	Increase the initial sbuf size for CPU topology dump to something more usable for newer CPUs. The new value allows 2 x quad core configuration dumps to fit within the initial buffer without reallocations. Approved by: gnn (mentor) (older version) Pointed out by: rdivacky	2008-11-02 23:11:20 +00:00
Ivan Voras	07095abf5d	Introduce a new sysctl, kern.sched.topology_spec, that returns an XML dump of detected ULE CPU topology. This dump can be used to check the topology detection and for general system information. An example of CPU topology dump is: kern.sched.topology_spec: <groups> <group level="1" cache-level="0"> <cpu count="8" mask="0xff">0, 1, 2, 3, 4, 5, 6, 7</cpu> <flags></flags> <children> <group level="2" cache-level="0"> <cpu count="4" mask="0xf">0, 1, 2, 3</cpu> <flags></flags> </group> <group level="2" cache-level="0"> <cpu count="4" mask="0xf0">4, 5, 6, 7</cpu> <flags></flags> </group> </children> </group> </groups> Reviewed by: jeff Approved by: gnn (mentor)	2008-10-29 13:36:23 +00:00
Jeff Roberson	e980fff622	- Check whether we've recorded this tick in ts_ticks on another cpu in sched_tick() to prevent multiple increments for one tick. This pushes the value out of range and breaks priority calculation. Reviewed by: kib Found by: pho/nokia Sponsored by: Nokia MFC after: 3 days	2008-07-19 05:13:47 +00:00
John Birrell	6f5f25e521	Add the vtime (virtual time) hooks for DTrace.	2008-05-25 01:44:58 +00:00
Jeff Roberson	6c47aaae12	- Add an integer argument to idle to indicate how likely we are to wake from idle over the next tick. - Add a new MD routine, cpu_wake_idle() to wakeup idle threads who are suspended in cpu specific states. This function can fail and cause the scheduler to fall back to another mechanism (ipi). - Implement support for mwait in cpu_idle() on i386/amd64 machines that support it. mwait is a higher performance way to synchronize cpus as compared to hlt & ipis. - Allow selecting the idle routine by name via sysctl machdep.idle. This replaces machdep.cpu_idle_hlt. Only idle routines supported by the current machine are permitted. Sponsored by: Nokia	2008-04-25 05:18:50 +00:00
Jeff Roberson	1690c6c1be	- Add a metric to describe how busy a processor has been over the last two ticks by counting the number of switches and the load when sched_clock() is called. - If the busy metric exceeds a threshold allow the idle thread to spin waiting for new work for a brief period to avoid using IPIs. This reduces the cost on the sender and receiver as well as reducing wakeup latency considerably when it works. Sponsored by: Nokia	2008-04-17 09:56:01 +00:00
Jeff Roberson	8df78c41d6	- Make SCHED_STATS more generic by adding a wrapper to create the variables and sysctl nodes. - In reset walk the children of kern_sched_stats and reset the counters via the oid_arg1 pointer. This allows us to add arbitrary counters to the tree and still reset them properly. - Define a set of switch types to be passed with flags to mi_switch(). These types are named SWT_*. These types correspond to SCHED_STATS counters and are automatically handled in this way. - Make the new SWT_ types more specific than the older switch stats. There are now stats for idle switches, remote idle wakeups, remote preemption ithreads idling, etc. - Add switch statistics for ULE's pickcpu algorithm. These stats include how much migration there is, how often affinity was successful, how often threads were migrated to the local cpu on wakeup, etc. Sponsored by: Nokia	2008-04-17 04:20:10 +00:00
Marcel Moolenaar	495168ba8d	Support and switch to the ULE scheduler: o Implement IPI_PREEMPT, o Set td_lock for the thread being switched out, o For ULE & SMP, loop while td_lock points to blocked_lock for the thread being switched in, o Enable ULE by default in GENERIC and SKI,	2008-04-15 05:02:42 +00:00
Jeff Roberson	0502fe2e43	- Allow static_boost to specify no boost with '0', traditional kernel fixed pri boost with '1' or any priority less than the current thread's priority with a value greater than two. Default the boost to PRI_MIN_TIMESHARE to prevent regular user-space threads from starving threads in the kernel. This prevents these user-threads from also being scheduled as if they are high fixed-priority kernel threads. - Restore the setting of lowpri in tdq_choose(). It has to be either here or in sched_switch(). I accidentally removed it from both places. Tested by: kris	2008-04-04 01:16:18 +00:00
Jeff Roberson	03d17db7d5	- Don't check for the ITHD pri class in tdq_load_add and rem. 4BSD doesn't do this either. Simply check P_NOLOAD. It'd be nice if this was in a thread flag so we didn't have an extra cache miss every time we add and remove a thread from the run-queue.	2008-04-04 01:04:43 +00:00
Jeff Roberson	9727e63745	- Restore runq to manipulating threads directly by putting runq links and rqindex back in struct thread. - Compile kern_switch.c independently again and stop #include'ing it from schedulers. - Remove the ts_thread backpointers and convert most code to go from struct thread to struct td_sched. - Cleanup the ts_flags #define garbage that was causing us to sometimes do things that expanded to td->td_sched->ts_thread->td_flags in 4BSD. - Export the kern.sched sysctl node in sysctl.h	2008-03-20 05:51:16 +00:00
Jeff Roberson	8b16c208e6	- ULE and 4BSD share only one line of code from sched_newthread() so implement the required pieces in sched_fork_thread(). The td_sched pointer is already setup by thread_init anyway.	2008-03-20 03:06:33 +00:00
Jeff Roberson	6d55b3ec9c	- Remove some dead code and comments related to KSE. - Don't set tdq_lowpri on every switch, it should be precisely maintained now. - Add some comments to sched_thread_priority().	2008-03-19 07:36:37 +00:00
Jeff Roberson	374ae2a393	- Relax requirements for p_numthreads, p_threads, p_swtick, and p_nice from requiring the per-process spinlock to only requiring the process lock. - Reflect these changes in the proc.h documentation and consumers throughout the kernel. This is a substantial reduction in locking cost for these fields and was made possible by recent changes to threading support.	2008-03-19 06:19:01 +00:00
Robert Watson	237fdd787b	In keeping with style(9)'s recommendations on macros, use a ';' after each SYSINIT() macro invocation. This makes a number of lightweight C parsers much happier with the FreeBSD kernel source, including cflow's prcc and lxr. MFC after: 1 month Discussed with: imp, rink	2008-03-16 10:58:09 +00:00
John Baldwin	d628fbfa98	Make the function prototype for cpu_search() match the declaration so that this still compiles with gcc3.	2008-03-14 15:22:38 +00:00
Jeff Roberson	6617724c5f	Remove kernel support for M:N threading. While the KSE project was quite successful in bringing threading to FreeBSD, the M:N approach taken by the kse library was never developed to its full potential. Backwards compatibility will be provided via libmap.conf for dynamically linked binaries and static binaries will be broken.	2008-03-12 10:12:01 +00:00
Jeff Roberson	c5aa6b581d	- Pass the priority argument from sleep() into sleepq and down into sched_sleep(). This removes extra thread_lock() acquisition and allows the scheduler to decide what to do with the static boost. - Change the priority arguments to cv_ to match sleepq/msleep/etc. where 0 means no priority change. Catch -1 in cv_broadcastpri() and convert it to 0 for now. - Set a flag when sleeping in a way that is compatible with swapping since direct priority comparisons are meaningless now. - Add a sysctl to ule, kern.sched.static_boost, that defaults to on which controls the boost behavior. Turning it off gives better performance in some workloads but needs more investigation. - While we're modifying sleepq, change signal and broadcast to both return with the lock held as the lock was held on enter. Reviewed by: jhb, peter	2008-03-12 06:31:06 +00:00
Jeff Roberson	c143ac21af	- Fix the invalid priority panics people are seeing by forcing tdq_runq_add to select the runq rather than hoping we set it properly when we adjusted the priority. This involves the same number of branches as before so should perform identically without the extra fragility. Tested by: bz Reviewed by: bz	2008-03-10 22:48:27 +00:00
Jeff Roberson	7217d8d1ee	- Don't rely on a side effect of sched_prio() to set the initial ts_runq for thread0. Set it directly in sched_setup(). This fixes traps on boot seen on some machines. Reported by: phk	2008-03-10 09:50:29 +00:00
Jeff Roberson	73daf66f41	Reduce ULE context switch time by over 25%. - Only calculate timeshare priorities once per tick or when a thread is woken from sleeping. - Keep the ts_runq pointer valid after all priority changes. - Call tdq_runq_add() directly from sched_switch() without passing in via tdq_add(). We don't need to adjust loads or runqs anymore. - Sort tdq and ts_sched according to utilization to improve cache behavior. Sponsored by: Nokia	2008-03-10 03:15:19 +00:00
Jeff Roberson	ff256d9c47	- Add an implementation of sched_preempt() that avoids excessive IPIs. - Normalize the preemption/ipi setting code by introducing sched_shouldpreempt() so the logical is identical and not repeated between tdq_notify() and sched_setpreempt(). - In tdq_notify() don't set NEEDRESCHED as we may not actually own the thread lock this could have caused us to lose td_flags settings. - Garbage collect some tunables that are no longer relevant.	2008-03-10 01:32:01 +00:00
Jeff Roberson	62fa74d95a	Add support for the new cpu topology api: - When searching for affinity search backwards in the tree from the last cpu we ran on while the thread still has affinity for the group. This can take advantage of knowledge of shared L2 or L3 caches among a group of cores. - When searching for the least loaded cpu find the least loaded cpu via the least loaded path through the tree. This load balances system bus links, individual cache levels, and hyper-threaded/SMT cores. - Make the periodic balancer recursively balance the highest and lowest loaded cpu across each link. Add support for cpusets: - Convert the cpuset to a simple native cpumask_t while the kernel still only supports cpumask. - Pass the derived cpumask down through the cpu_search functions to restrict the result cpus. - Make the various steal functions resilient to failure since all threads can not run on all cpus any longer. General improvements: - Precisely track the lowest priority thread on every runq with tdq_setlowpri(). Before it was more advisory but this ended up having pathological behaviors. - Remove many #ifdef SMP conditions to simplify the code. - Get rid of the old cumbersome tdq_group. This is more naturally expressed via the cpu_group tree. Sponsored by: Nokia Testing by: kris	2008-03-02 08:20:59 +00:00
Jeff Roberson	81aa71755b	- Remove the old smp cpu topology specification with a new, more flexible tree structure that encodes the level of cache sharing and other properties. - Provide several convenience functions for creating one and two level cpu trees as well as a default flat topology. The system now always has some topology. - On i386 and amd64 create a seperate level in the hierarchy for HTT and multi-core cpus. This will allow the scheduler to intelligently load balance non-uniform cores. Presently we don't detect what level of the cache hierarchy is shared at each level in the topology. - Add a mechanism for testing common topologies that have more information than the MD code is able to provide via the kern.smp.topology tunable. This should be considered a debugging tool only and not a stable api. Sponsored by: Nokia	2008-03-02 07:58:42 +00:00
Jeff Roberson	885d51a38a	- Add a new sched_affinity() api to be used in the upcoming cpuset implementation. - Add empty implementations of sched_affinity() to 4BSD and ULE. Sponsored by: Nokia	2008-03-02 07:19:35 +00:00
Jeff Roberson	317da70593	- sched_prio() should only adjust tdq_lowpri if the thread is running or on a run-queue. If the priority is numerically raised only change lowpri if we're certain it will be correct. Some slop is allowed however previously we could erroneously raise lowpri for an idle cpu that a thread had recently run on which lead to errors in load balancing decisions.	2008-01-23 03:10:18 +00:00
Jeff Roberson	a755f21484	- When executing the 'tryself' branch in sched_pickcpu() look at the lowest priority on the queue for the current cpu vs curthread's priority. In the case that curthread is waking up many threads of a lower priority as would happen with a turnstile_broadcast() or wakeup() of many threads this prevents them from all ending up on the current cpu. - In sched_add() make the relationship between a scheduled ithread and the current cpu advisory rather than strict. Only give the ithread affinity for the current cpu if it's actually being scheduled from a hardware interrupt. This prevents it from migrating when it simply blocks on a lock. Sponsored by: Nokia	2008-01-15 09:03:09 +00:00
Jeff Roberson	fd0b8c783d	- Restore timeslicing code for all bit SCHED_FIFO priority classes. Reported by: Peter Jeremy <peterjeremy@optushome.com.au>	2008-01-05 04:47:31 +00:00
Wojciech A. Koszek	731016fe36	Make SCHED_ULE buildable with gcc3. Reviewed by: cognet (mentor), jeffr Approved by: cognet (mentor), jeffr	2007-12-21 23:30:18 +00:00
Jeff Roberson	eea4f254fe	- Re-implement lock profiling in such a way that it no longer breaks the ABI when enabled. There is no longer an embedded lock_profile_object in each lock. Instead a list of lock_profile_objects is kept per-thread for each lock it may own. The cnt_hold statistic is now always 0 to facilitate this. - Support shared locking by tracking individual lock instances and statistics in the per-thread per-instance lock_profile_object. - Make the lock profiling hash table a per-cpu singly linked list with a per-cpu static lock_prof allocator. This removes the need for an array of spinlocks and reduces cache contention between cores. - Use a seperate hash for spinlocks and other locks so that only a critical_enter() is required and not a spinlock_enter() to modify the per-cpu tables. - Count time spent spinning in the lock statistics. - Remove the LOCK_PROFILE_SHARED option as it is always supported now. - Specifically drop and release the scheduler locks in both schedulers since we track owners now. In collaboration with: Kip Macy Sponsored by: Nokia	2007-12-15 23:13:31 +00:00
David Xu	435806d31b	Fix LOR of thread lock and umtx's priority propagation mutex due to the reworking of scheduler lock. MFC: after 3 days	2007-12-11 08:25:36 +00:00
Julian Elischer	431f890614	generally we are interested in what thread did something as opposed to what process. Since threads by default have teh name of the process unless over-written with more useful information, just print the thread name instead.	2007-11-14 06:21:24 +00:00
Peter Grehan	cbdd62ad04	Cut over to ULE on PowerPC kern/sched_ule.c - Add __powerpc__ to the list of supported architectures powerpc/conf/GENERIC - Swap SCHED_4BSD with SCHED_ULE powerpc/powerpc/genassym.c - Export TD_LOCK field of thread struct powerpc/powerpc/swtch.S - Handle new 3rd parameter to cpu_switch() by updating the old thread's lock. Note: uniprocessor-only, will require modification for MP support. powerpc/powerpc/vm_machdep.c - Set 3rd param of cpu_switch to mutex of old thread's lock, making the call a no-op. Reviewed by: marcel, jeffr (slightly older version)	2007-10-23 00:52:25 +00:00
Sam Leffler	58590eb06b	ULE works fine on arm; allow it to be used Reviewed by: jeff, cognet, imp MFC after: 1 week	2007-10-16 19:25:26 +00:00
Jeff Roberson	88f530cc25	- Bail out of tdq_idled if !smp_started or idle stealing is disabled. This fixes a bug on UP machines with SMP kernels where the idle thread constantly switches after trying to steal work from the local cpu. - Make the idle stealing code more robust against self selection. - Prefer to steal from the cpu with the highest load that has at least one transferable thread. Before we selected the cpu with the highest transferable count which excludes bound threads. Collaborated with: csjp Approved by: re	2007-10-08 23:50:39 +00:00
Jeff Roberson	05dc0eb204	- Restore historical sched_yield() behavior by changing sched_relinquish() to simply switch rather than lowering priority and switching. This allows threads of equal priority to run but not lesser priority. Discussed with: davidxu Reported by: NIIMI Satoshi <sa2c@sa2c.net> Approved by: re	2007-10-08 23:45:24 +00:00
Jeff Roberson	59c6813475	- Reassign the thread queue lock to newtd prior to switching. Assigning after the switch leads to a race where the outgoing thread still owns the local queue lock while another cpu may switch it in. This race is only possible on machines where cpu_switch can take significantly longer on different cpus which in practice means HTT machines with unfair thread scheduling algorithms. Found by: kris (of course) Approved by: re	2007-10-02 01:30:18 +00:00
Jeff Roberson	7fcf154aef	- Move the rebalancer back into hardclock to prevent potential softclock starvation caused by unbalanced interrupt loads. - Change the rebalancer to work on stathz ticks but retain randomization. - Simplify locking in tdq_idled() to use the tdq_lock_pair() rather than complex sequences of locks to avoid deadlock. Reported by: kris Approved by: re	2007-10-02 00:36:06 +00:00
Jeff Roberson	02e2d6b445	- Honor the PREEMPTION and FULL_PREEMPTION flags by setting the default value for kern.sched.preempt_thresh appropriately. It can still by adjusted at runtime. ULE will still use IPI_PREEMPT in certain migration situations. - Assert that we're not trying to compile ULE on an unsupported architecture. To date, I believe only i386 and amd64 have implemented the third cpu switch argument required. Approved by: re	2007-09-27 16:39:27 +00:00
Jeff Roberson	e270652ba3	- Bound the interactivity score so that it cannot become negative. Approved by: re	2007-09-24 00:28:54 +00:00
Jeff Roberson	a5423ea313	- Improve grammar. s/it's/its/. - Improve load long-term load balancer by always IPIing exactly once. Previously the delay after rebalancing could cause problems with uneven workloads. - Allow nice to have a linear effect on the interactivity score. This allows negatively niced programs to stay interactive longer. It may be useful with very expensive Xorg servers under high loads. In general it should not be necessary to alter the nice level to improve interactive response. We may also want to consider never allowing positively niced processes to become interactive at all. - Initialize ccpu to 0 rather than 0.0. The decimal point was leftover from when the code was copied from 4bsd. ccpu is 0 in ULE because ULE only exports weighted cpu values. Reported by: Steve Kargl (Load balancing problem) Approved by: re	2007-09-22 02:20:14 +00:00
Jeff Roberson	54b0e65f84	- Redefine p_swtime and td_slptime as p_swtick and td_slptick. This changes the units from seconds to the value of 'ticks' when swapped in/out. ULE does not have a periodic timer that scans all threads in the system and as such maintaining a per-second counter is difficult. - Change computations requiring the unit in seconds to subtract ticks and divide by hz. This does make the wraparound condition hz times more frequent but this is still in the range of several months to years and the adverse effects are minimal. Approved by: re	2007-09-21 04:10:23 +00:00
Jeff Roberson	b61ce5b0e6	- Move all of the PS_ flags into either p_flag or td_flags. - p_sflag was mostly protected by PROC_LOCK rather than the PROC_SLOCK or previously the sched_lock. These bugs have existed for some time. - Allow swapout to try each thread in a process individually and then swapin the whole process if any of these fail. This allows us to move most scheduler related swap flags into td_flags. - Keep ki_sflag for backwards compat but change all in source tools to use the new and more correct location of P_INMEM. Reported by: pho Reviewed by: attilio, kib Approved by: re (kensmith)	2007-09-17 05:31:39 +00:00
Jeff Roberson	9862717afe	- Set steal_thresh to log2(ncpus). This improves idle-time load balancing on 2cpu machines by reducing it to 1 by default. This improves loaded operation on 8cpu machines by increasing it to 3 where the extra idle time is not as critical. Approved by: re	2007-08-20 06:34:20 +00:00
Jeff Roberson	3a78f9658b	- Fix one line that erroneously crept in my last commit. Approved by: re	2007-08-04 01:21:28 +00:00
Jeff Roberson	c47f202b45	- Share scheduler locks between hyper-threaded cores to protect the tdq_group structure. Hyper-threaded cores won't really benefit from seperate locks anyway. - Seperate out the migration case from sched_switch to simplify the main switch code. We only migrate here if called via sched_bind(). - When preempted place the preempted thread back in the same queue at the head. - Improve the cpu group and topology infrastructure. Tested by: many on current@ Approved by: re	2007-08-03 23:38:46 +00:00
Jeff Roberson	28994a5852	- Refine the load balancer to improve buildkernel times on dual core machines. - Leave the long-term load balancer running by default once per second. - Enable stealing load from the idle thread only when the remote processor has more than two transferable tasks. Setting this to one further improves buildworld. Setting it higher improves mysql. - Remove the bogus pick_zero option. I had not intended to commit this. - Entirely disallow migration for threads with SRQ_YIELDING set. This balances out the extra migration allowed for with the load balancers. It also makes pick_pri perform better as I had anticipated. Tested by: Dmitry Morozovsky <marck@rinet.ru> Approved by: re	2007-07-19 20:03:15 +00:00
Jeff Roberson	08c9a16c4f	- When newtd is specified to sched_switch() it was not being initialized properly. We have to temporarily unlock the TDQ lock so we can lock the thread and add it to the run queue. This is used only for KSE. - When we add a thread from the tdq_move() via sched_balance() we need to ipi the target if it's sitting in the idle thread or it'll never run. Reported by: Rene Landan Approved by: re	2007-07-19 19:51:45 +00:00
Jeff Roberson	ae7a6b38d5	ULE 3.0: Fine grain scheduler locking and affinity improvements. This has been in development for over 6 months as SCHED_SMP. - Implement one spin lock per thread-queue. Threads assigned to a run-queue point to this lock via td_lock. - Improve the facility for assigning threads to CPUs now that sched_lock contention no longer dominates scheduling decisions on larger SMP machines. - Re-write idle time stealing in an attempt to make it less damaging to general performance. This is still disabled by default. See kern.sched.steal_idle. - Call the long-term load balancer from a callout rather than sched_clock() so there are no locks held. This is disabled by default. See kern.sched.balance. - Parameterize many scheduling decisions via sysctls. Try to document these via sysctl descriptions. - General structural and naming cleanups. - Document each function with comments. Tested by: current@ amd64, x86, UP, SMP. Approved by: re	2007-07-17 22:53:23 +00:00
Jeff Roberson	dda713dfb8	- Fix an off by one error in sched_pri_range. - In tdq_choose() only assert that a thread does not have too high a priority (low value) for the queue we removed it from. This will catch bugs in priority elevation. It's not a serious error for the thread to have too low a priority as we don't change queues in this case as an optimization. Reported by: kris	2007-06-15 19:33:58 +00:00
Jeff Roberson	fe54587ffa	- Move some common code out of sched_fork_exit() and back into fork_exit().	2007-06-12 07:47:09 +00:00
Jeff Roberson	710eacdc5f	- Placing the 'volatile' on the right side of the * in the td_lock declaration removes the need for __DEVOLATILE(). Pointed out by: tegge	2007-06-06 03:40:47 +00:00
Jeff Roberson	95e3a0bca3	- Better fix for previous error; use DEVOLATILE on the td_lock pointer it can actually sometimes be something other than sched_lock even on schedulers which rely on a global scheduler lock. Tested by: kan	2007-06-05 04:12:46 +00:00
Jeff Roberson	c219b097af	- Pass &sched_lock as the third argument to cpu_switch() as this will always be the correct lock and we don't get volatile warnings this way. Pointed out by: kan	2007-06-05 03:46:54 +00:00
Jeff Roberson	36b369163b	- Define TDQ_ID() for the !SMP case. - Default pick_pri to off. It is not faster in most cases.	2007-06-05 02:53:51 +00:00
Jeff Roberson	7b20fb19fb	Commit 1/14 of sched_lock decomposition. - Move all scheduler locking into the schedulers utilizing a technique similar to solaris's container locking. - A per-process spinlock is now used to protect the queue of threads, thread count, suspension count, p_sflags, and other process related scheduling fields. - The new thread lock is actually a pointer to a spinlock for the container that the thread is currently owned by. The container may be a turnstile, sleepqueue, or run queue. - thread_lock() is now used to protect access to thread related scheduling fields. thread_unlock() unlocks the lock and thread_set_lock() implements the transition from one lock to another. - A new "blocked_lock" is used in cases where it is not safe to hold the actual thread's lock yet we must prevent access to the thread. - sched_throw() and sched_fork_exit() are introduced to allow the schedulers to fix-up locking at these points. - Add some minor infrastructure for optionally exporting scheduler statistics that were invaluable in solving performance problems with this patch. Generally these statistics allow you to differentiate between different causes of context switches. Tested by: kris, current@ Tested on: i386, amd64, ULE, 4BSD, libthr, libkse, PREEMPTION, etc. Discussed with: kris, attilio, kmacy, jhb, julian, bde (small parts each)	2007-06-04 23:50:30 +00:00
Kip Macy	fb1e3ccd7e	Schedule the ithread on the same cpu as the interrupt Tested by: kmacy Submitted by: jeffr	2007-04-20 05:45:46 +00:00
Jeff Roberson	52bc574cc7	- Handle the case where slptime == runtime. Submitted by: Atoine Brodin	2007-03-17 23:32:48 +00:00
Jeff Roberson	4499aff6ec	- Cast the intermediate value in priority computtion back down to unsigned char. Weirdly, casting the 1 constant to u_char still produces a signed integer result that is then used in the % computation. This avoids that mess all together and causes a 0 pri to turn into 255 % 64 as we expect. Reported by: kkenn (about 4 times, thanks)	2007-03-17 18:13:32 +00:00
Julian Elischer	486a941418	Instead of doing comparisons using the pcpu area to see if a thread is an idle thread, just see if it has the IDLETD flag set. That flag will probably move to the pflags word as it's permenent and never chenges for the life of the system so it doesn't need locking.	2007-03-08 06:44:34 +00:00
Kip Macy	fe68a91631	general LOCK_PROFILING cleanup - only collect timestamps when a lock is contested - this reduces the overhead of collecting profiles from 20x to 5x - remove unused function from subr_lock.c - generalize cnt_hold and cnt_lock statistics to be kept for all locks - NOTE: rwlock profiling generates invalid statistics (and most likely always has) someone familiar with that should review	2007-02-26 08:26:44 +00:00
Jeff Roberson	ed0e8f2fe9	- Change types for necent runq additions to u_char rather than int. - Fix these types in ULE as well. This fixes bugs in priority index calculations in certain edge cases. (int)-1 % 64 != (uint)-1 % 64. Reported by: kkenn using pho's stress2.	2007-02-08 01:52:25 +00:00
Jeff Roberson	fc3a97dcb7	- Implement much more intelligent ipi sending. This algorithm tries to minimize IPIs and rescheduling when scheduling like tasks while keeping latency low for important threads. 1) An idle thread is running. 2) The current thread is worse than realtime and the new thread is better than realtime. Realtime to realtime doesn't preempt. 3) The new thread's priority is less than the threshold.	2007-01-25 23:51:59 +00:00
Jeff Roberson	1461899028	- Get rid of the unused DIDRUN flag. This was really only present to support sched_4bsd. - Rename the KTR level for non schedgraph parsed events. They take event space from things we'd like to graph. - Reset our slice value after we sleep. The slice is simply there to prevent starvation among equal priorities. A thread which had almost exhausted it's slice and then slept doesn't need to be rescheduled a tick after it wakes up. - Set the maximum slice value to a more conservative 100ms now that it is more accurately enforced.	2007-01-25 19:14:11 +00:00
Jeff Roberson	9a93305a2e	- With a sleep time over 2097 seconds hzticks and slptime could end up negative. Use unsigned integers for sleep and run time so this doesn't disturb sched_interact_score(). This should fix the invalid interactive priority panics reported by several users.	2007-01-24 18:18:43 +00:00
Jeff Roberson	7a5e5e2a59	- Catch up to setrunqueue/choosethread/etc. api changes. - Define our own maybe_preempt() as sched_preempt(). We want to be able to preempt idlethread in all cases. - Define our idlethread to require preemption to exit. - Get the cpu estimation tick from sched_tick() so we don't have to worry about errors from a sampling interval that differs from the time domain. This was the source of sched_priority prints/panics and inaccurate pctcpu display in top.	2007-01-23 08:50:34 +00:00
Jeff Roberson	5cea64d54f	- Disable the long-term load balancer. I believe that steal_busy works better and gives more predictable results.	2007-01-20 21:24:05 +00:00
Jeff Roberson	c95d2db298	- We do need to IPI the idlethread on some systems. It may be stuck in a power saving mode otherwise. - If the thread is already bound in sched_bind() unbind it before re-binding it to a new cpu. I don't like these semantics but they are expected by some code in the tree. Patch by jkoshy.	2007-01-20 17:03:33 +00:00
Jeff Roberson	6b2f763f7c	- In tdq_transfer() always set NEEDRESCHED when necessary regardless of the ipi settings. If NEEDRESCHED is set and an ipi is later delivered it will clear it rather than cause extra context switches. However, if we miss setting it we can have terrible latency. - In sched_bind() correctly implement bind. Also be slightly more tolerant of code which calls bind multiple times. However, we don't change binding if another call is made with a different cpu. This does not presently work with hwpmc which I believe should be changed.	2007-01-20 09:03:43 +00:00
Jeff Roberson	7b8bfa0de9	Major revamp of ULE's cpu load balancing: - Switch back to direct modification of remote CPU run queues. This added a lot of complexity with questionable gain. It's easy enough to reimplement if it's shown to help on huge machines. - Re-implement the old tdq_transfer() call as tdq_pickidle(). Change sched_add() so we have selectable cpu choosers and simplify the logic a bit here. - Implement tdq_pickpri() as the new default cpu chooser. This algorithm is similar to Solaris in that it tries to always run the threads with the best priorities. It is actually slightly more complex than solaris's algorithm because we also tend to favor the local cpu over other cpus which has a boost in latency but also potentially enables cache sharing between the waking thread and the woken thread. - Add a bunch of tunables that can be used to measure effects of different load balancing strategies. Most of these will go away once the algorithm is more definite. - Add a new mechanism to steal threads from busy cpus when we idle. This is enabled with kern.sched.steal_busy and kern.sched.busy_thresh. The threshold is the required length of a tdq's run queue before another cpu will be able to steal runnable threads. This prevents most queue imbalances that contribute the long latencies.	2007-01-19 21:56:08 +00:00
Jeff Roberson	eddb4efacd	- Don't let SCHED_TICK_TOTAL() return less than hz. This can cause integer divide faults in roundup() later if it is able to return 0. For some reason this bug only shows up on my laptop and not my testboxes.	2007-01-06 12:33:43 +00:00
Jeff Roberson	1e516cf534	- Fix the sched_priority() invalid priority bugs. Use roundup() instead of max() when computing the divisor in SCHED_TICK_PRI(). This prevents cases where rounding down would allow the quotient to exceed SCHED_PRI_RANGE. - Garbage collect some unused flags and fields. - Replace TDF_HOLD with sched_pin_td()/sched_unpin_td() since it simply duplicated this functionality. - Re-enable the rebalancer by default and fix the sysctl so it can be modified.	2007-01-06 08:44:13 +00:00
Jeff Roberson	9330bbbb61	- Don't IPI unless we're going to interrupt something exiting in the kernel. otherwise we can afford the latency. This makes a significant performance improvement.	2007-01-06 02:34:23 +00:00
Jeff Roberson	155b6ca12b	- Fix a comparison in sched_choose() that caused cpus to be constantly marked idle, thus breaking cpu load balancing. - Change sched_interact_update() to fix cases where the stored history has expanded significantly rather than handling them in the callers. This fixes a case where sched_priority() could compute a bad value. - Add a sysctl to disable the global load balancer for experimentation.	2007-01-05 23:45:38 +00:00
Jeff Roberson	8ab80cf009	- ftick was initialized to -1 for init and any of it's children. Fix this by setting ftick = ltick = ticks in schedinit(). - Update the priority when we are pulled off of the run queue and when we are inserted onto the run queue so that it more accurately reflects our present status. This is important for efficient priority propagation functioning. - Move the frequency test into sched_pctcpu_update() so we don't repeat it each time we'd like to call it. - Put some temporary work-around code in sched_priority() in case the tick mechanism produces a bad priority. Eventually this should revert to an assert again.	2007-01-05 08:50:38 +00:00
Jeff Roberson	3f872f85d2	- Only allow the tdq_idx to increase by one each tick rather than up to the most recently chosen index. This significantly improves nice behavior. This allows a lower priority thread to run some multiple of times before the higher priority thread makes it to the front of the queue. A nice +20 cpu hog now only gets ~5% of the cpu when running with a nice 0 cpu hog and about 1.5% with a nice -20 hog. A nice difference of 1 makes a 4% difference in cpu usage between two hogs. - Track a seperate insert and removal index. When the removal index is empty it is updated to point at the current insert index. - Don't remove and re-add a thread to the runq when it is being adjusted down in priority. - Pull some conditional code out of sched_tick(). It's looking a bit large now.	2007-01-04 12:16:19 +00:00
Jeff Roberson	e7d50326de	ULE 2.0: - Remove the double queue mechanism for timeshare threads. It was slow due to excess cache lines in play, caused suboptimal scheduling behavior with niced and other non-interactive processes, complicated priority lending, etc. - Use a circular queue with a floating starting index for timeshare threads. Enforces fairness by moving the insertion point closer to threads with worse priorities over time. - Give interactive timeshare threads real-time user-space priorities and place them on the realtime/ithd queue. - Select non-interactive timeshare thread priorities based on their cpu utilization over the last 10 seconds combined with the nice value. This gives us more sane priorities and behavior in a loaded system as compared to the old method of using the interactivity score. The interactive score quickly hit a ceiling if threads were non-interactive and penalized new hog threads. - Use one slice size for all threads. The slice is not currently dynamically set to adjust scheduling behavior of different threads. - Add some new sysctls for scheduling parameters. Bug fixes/Clean up: - Fix zeroing of td_sched after initialization in sched_fork_thread() caused by recent ksegrp removal. - Fix KSE interactivity issues related to frequent forking and exiting of kse threads. We simply disable the penalty for thread creation and exit for kse threads. - Cleanup the cpu estimator by using tickincr here as well. Keep ticks and ltick/ftick in the same frequency. Previously ticks were stathz and others were hz. - Lots of new and updated comments. - Many many others. Tested on: up x86/amd64, 8way amd64.	2007-01-04 08:56:25 +00:00
Jeff Roberson	c02bbb43a0	- More search and replace prettying.	2006-12-29 12:55:32 +00:00

1 2 3 4 5 ...

419 Commits