freebsd-skq

Author	SHA1	Message	Date
kib	3c2ec84650	Restore UP build. Reviewed by: truckman Sponsored by: The FreeBSD Foundation	2018-02-23 18:26:31 +00:00
truckman	1e9bbe93d1	Decrease latency by not wrapping the idle loop's potentially lengthy search for a thread to steal inside a critical section. Since this allows the search to be preempted, restart the search if preemption happens since the search results found earlier may no longer be valid. Decrease the latency of starting a thread that may be assigned to this CPU during the search by polling for incoming threads during the search and switching to that thread instead of continuing the search. Test for stale search results and restart the search before going through the expense of calling tdq_lock_pair(). Retry some tests after grabbing the locks since things may have changed while waiting to get both locks. Eliminate special case handling for stealing from an SMT peer that uses 1 as the steal threshold. This can only succeed if a thread has been assigned but our SMT peer has not yet started executing it. This is quite rare and when it happens the other SMT thread is generally waiting for the same tdq lock that we hold. Basically both SMT threads are racing to grab the same spin lock. Add the kern.sched.always_steal knob from a ULE patch by jeff@. Incorporate another idea from Jeff's ULE patch. If the sched_switch() detects that the CPU is about to go idle, try to steal a thread before switching to the idle thread. Since the search for a thread to steal has to be done inside a critical section in this context, limit the impact on latency by adding the knob kern.sched.trysteal_limit to limit the topological distance of the search and don't restart the search if we detect stale results. If this search can't find an stealable thread, the idle loop can do a more complete search. Also poll for threads being assigned to this CPU during the search and switch to them instead of continuing the search. This change is responsibile for the majority of the improvement in parallel buildworld times. In sched_balance_group() change the minimum threshold from stealing a thread from 1 to 2. Poaching a newly assigned thread from a CPU that is waking up hasn't yet switched to that thread from idle is likely very rare and is likely to have the same lock race as is seen when stealing threads in the idle loop. Also use tdq_notify() to kick the destintation CPU instead of always sending an IPI. Update a stale comment, the number of transferable threads is not calculated. Reviewed by: kib (earlier version) Comments by: avg, jeff, mav MFC after: 1 month Differential Revision: https://reviews.freebsd.org/D12130	2018-02-23 00:12:51 +00:00
vangyzen	f84320cfc8	sched_ule: update a comment to reflect reality MFC after: 3 days Sponsored by: Dell EMC	2018-02-22 17:09:26 +00:00
wma	86a31a6bb4	Reverting r328320	2018-01-24 13:57:01 +00:00
wma	d8d083c4f2	ULE: provide defaults to ts_cpu Fix a bug when the system has no CPU 0. When created, threads were implicitly assigned to CPU 0. This had no practical effect since a real CPU was chosen immediately by the scheduler. However, on systems without a CPU 0, sched_ule attempted to access the scheduler queue of the "old" CPU when assigned the initial choice of the old one. This caused an attempt to use illegal memory and a crash (or, more usually, a deadlock). Fix this by assigned new threads to the BSP explicitly and add some asserts to see that this problem does not recur. Authored by: Nathan Whitehorn <nwhitehorn@freebsd.org> Submitted by: Wojciech Macek <wma@semihalf.com> Obtained from: Semihalf Differential revision: https://reviews.freebsd.org/D13932	2018-01-24 07:54:05 +00:00
jeff	94c7af8ca2	Implement 'domainset', a cpuset based NUMA policy mechanism. This allows userspace to control NUMA policy administratively and programmatically. Implement domainset based iterators in the page layer. Remove the now legacy numa_* syscalls. Cleanup some header polution created by having seq.h in proc.h. Reviewed by: markj, kib Discussed with: alc Tested by: pho Sponsored by: Netflix, Dell/EMC Isilon Differential Revision: https://reviews.freebsd.org/D13403	2018-01-12 22:48:23 +00:00
hselasky	7213d95f20	The sched_add() function is not only used when the thread is initially started, but also by the turnstiles to mark a thread as runnable for all locks, for instance sleepqueues do: setrunnable()->sched_wakeup()->sched_add() In r326218 code was added to allow booting from non-zero CPU numbers by setting the ts_cpu field inside the ULE scheduler's sched_add() function. This had an undesired side-effect that prior sched_pin() and sched_bind() calls got disregarded. This patch fixes the initialization of the ts_cpu field for the ULE scheduler to only happen once when the initial thread is constructed during system init. Forking will then later on ensure that a valid ts_cpu value gets copied to all children. Reviewed by: jhb, kib Discussed with: nwhitehorn MFC after: 1 month Differential revision: https://reviews.freebsd.org/D13298 Sponsored by: Mellanox Technologies	2017-11-29 23:28:40 +00:00
pfg	cc22a86800	sys/kern: adoption of SPDX licensing ID tags. Mainly focus on files that use BSD 2-Clause license, however the tool I was using misidentified many licenses so this was mostly a manual - error prone - task. The Software Package Data Exchange (SPDX) group provides a specification to make it easier for automated tools to detect and summarize well known opensource licenses. We are gradually adopting the specification, noting that the tags are considered only advisory and do not, in any way, superceed or replace the license texts.	2017-11-27 15:20:12 +00:00
nwhitehorn	3cd3799a91	Remove some, but not all, assumptions that the BSP is CPU 0 and that CPUs are numbered densely from there to n_cpus. MFC after: 1 month	2017-11-25 23:41:05 +00:00
mjg	8f29191fc8	Don't take Giant for SMP status and cpu topology sysctls. Not only this lock doesn't play any role here, dirtying it slows down other things a little bit as giant-held checks (e.g. DROP_GIANT) are spread all over the kernel. MFC after: 1 week	2017-10-18 22:00:44 +00:00
avg	83025f4a68	move thread switch tracing from mi_switch to sched_switch This is done so that the thread state changes during the switch are not confused with the thread state changes reported when the thread spins on a lock. Here is an example, three consecutive entries for the same thread (from top to bottom): KTRGRAPH group:"thread", id:"zio_write_intr_3 tid 100260", state:"sleep", attributes: prio:84, wmesg:"-", lockname:"(null)" KTRGRAPH group:"thread", id:"zio_write_intr_3 tid 100260", state:"spinning", attributes: lockname:"sched lock 1" KTRGRAPH group:"thread", id:"zio_write_intr_3 tid 100260", state:"running", attributes: none The above trace could leave an impression that the final state of the thread was "running". After this change the sleep state will be reported after the "spinning" and "running" states reported for the sched lock. Reviewed by: jhb, markj MFC after: 1 week Sponsored by: Panzura Differential Revision: https://reviews.freebsd.org/D9961	2017-03-23 08:57:04 +00:00
avg	2758f60873	trace thread running state when a thread is run for the first time This applies to both KTR_SCHED and DTrace sched:::on-cpu tracing. MFC after: 10 days	2017-03-11 15:57:41 +00:00
markj	8b1baed602	Fix a ticks comparison in sched_pctcpu_update(). We may fail to reset the %CPU tracking window if a thread does not run for over half of the ticks rollover period, resulting in a bogus %CPU value for the thread until ticks fully rolls over. Handle this by comparing the unsigned difference ticks - ts_ltick with SCHED_TICK_TARG instead. Reviewed by: cem, jeff MFC after: 1 week Sponsored by: Dell EMC Isilon	2017-03-03 20:57:40 +00:00
rstone	c310497b05	Revert r313814 and r313816 Something evidently got mangled in my git tree in between testing and review, as an old and broken version of the patch was apparently submitted to svn. Revert this while I work out what went wrong. Reported by: tuexen Pointy hat to: rstone	2017-02-16 21:18:31 +00:00
rstone	aef5488faa	Fix a typo in my previous commit Somehow in the late stages of testing my sched_ule patch, a character was accidentally deleted from the file. Correct this. While I'm committing anyway, the previous commit message requires some clarification: in the normal case of unlending priority after releasing a mutex, the thread that was doing the lending will be woken up and immediately become the highest-priority thread, and in that case no priority inversion would take place. However, if that thread is pinned to a different CPU, then the currently running thread that just had its priority lowered will not be preempted and then priority inversion can occur. Reported by: O. Hartmann (typo), jhb (scheduler clarification) MFC after: 1 month Pointy hat to: rstone	2017-02-16 20:06:21 +00:00
rstone	943126a914	Check for preemption after lowering a thread's priority When a high-priority thread is waiting for a mutex held by a low-priority thread, it temporarily lends its priority to the low-priority thread to prevent priority inversion. When the mutex is released, the lent priority is revoked and the low-priority thread goes back to its original priority. When the priority of that thread is lowered (through a call to sched_priority()), the schedule was not checking whether there is now a high-priority thread in the run queue. This can cause threads with real-time priority to be starved in the run queue while the low-priority thread finishes its quantum. Fix this by explicitly checking whether preemption is necessary when a thread's priority is lowered. Sponsored by: Dell EMC Isilon Obtained from: Sandvine Inc Differential Revision: https://reviews.freebsd.org/D9518 Reviewed by: Jeff Roberson (ule) MFC after: 1 month	2017-02-16 19:41:13 +00:00
avg	96691e46d1	fix a thread preemption regression in schedulers introduced in r270423 Commit r270423 fixed a regression in sched_yield() that was introduced in earlier changes. Unfortunately, at the same time it introduced an new regression. The problem is that SWT_RELINQUISH (6), like all other SWT_* constants and unlike SW_* flags, is not a bit flag. So, (flags & SWT_RELINQUISH) is true in cases where that was not really indended, for example, with SWT_OWEPREEMPT (2) and SWT_REMOTEPREEMPT (11). A straight forward fix would be to use (flags & SW_TYPE_MASK) == SWT_RELINQUISH, but my impression is that the switch types are designed mostly for gathering statistics, not for influencing scheduling decisions. So, I decided that it would be better to check for SW_PREEMPT flag instead. That's also the same flag that was checked before r239157. I double-checked how that flag is used and I am confident that the flag is set only in the places where we really have the preemption: - critical_exit + td_owepreempt - sched_preempt in the ULE scheduler - sched_preempt in the 4BSD scheduler Reviewed by: kib, mav MFC after: 4 days Sponsored by: Panzura Differential Revision: https://reviews.freebsd.org/D9230	2017-01-19 18:46:41 +00:00
cem	b2000e56f9	"Buses" is the preferred plural of "bus" Replace archaic "busses" with modern form "buses." Intentionally excluded: * Old/random drivers I didn't recognize * Old hardware in general * Use of "busses" in code as identifiers No functional change. http://grammarist.com/spelling/buses-busses/ PR: 216099 Reported by: bltsrc at mail.ru Sponsored by: Dell EMC Isilon	2017-01-15 17:54:01 +00:00
kib	ef5f88c357	Get rid of struct proc p_sched and struct thread td_sched pointers. p_sched is unused. The struct td_sched is always co-allocated with the struct thread, except for the thread0. Avoid useless indirection, instead calculate td_sched location using simple pointer arithmetic in td_get_sched(9). For thread0, which is statically allocated, create a structure to emulate layout of the dynamic allocation. Reviewed by: jhb (previous version) Sponsored by: The FreeBSD Foundation Differential revision: https://reviews.freebsd.org/D6711	2016-06-05 17:04:03 +00:00
kib	f16910a47e	The struct thread td_estcpu member is only used by the 4BSD scheduler. Move it to the struct td_sched for 4BSD, removing always present field, otherwise unused for ULE. New scheduler method sched_estcpu() returns the estimation for kinfo_proc consumption. As before, it always returns 0 for ULE. Remove sched_tick() scheduler method, unused both by 4BSD and ULE. Update locking comment for the 4BSD struct td_sched, copying it from the same comment for ULE. Spell MAXPRI as PRI_MAX_TIMESHARE in the 4BSD comment. Based on some notes from, and reviewed by: bde Sponsored by: The FreeBSD Foundation	2016-04-17 11:04:27 +00:00
gnn	304a5ae6cb	Summary: Add the interactivity equations to the header comment for our interactivity calculation routine. Suggested by: rwatson	2015-08-26 16:36:41 +00:00
jhb	4678a09484	kgdb uses td_oncpu to determine if a thread is running and should use a pcb from stoppcbs[] rather than the thread's PCB. However, exited threads retained td_oncpu from the last time they ran, and newborn threads had their CPU fields cleared to zero during fork and thread creation since they are in the set of fields zeroed when threads are setup. To fix, explicitly update the CPU fields for exiting threads in sched_throw() to reflect the switch out and reset the CPU fields for new threads in sched_fork_thread() to NOCPU. Reviewed by: kib MFC after: 1 week Differential Revision: https://reviews.freebsd.org/D3193	2015-08-03 20:43:36 +00:00
kib	8b3d38af03	Change the mb() use in the sched_ult tdq_notify() and sched_idletd() to more C11-ish atomic_thread_fence_seq_cst(). Note that on PowerPC, which currently uses lwsync for mb(), the change actually fixes the missed store/load barrier, intended by r271604 []. Reviewed by: alc Noted by: alc [] Sponsored by: The FreeBSD Foundation MFC after: 3 weeks	2015-07-10 08:54:12 +00:00
pfg	2afd4cc980	Relocate sched_random() within the SMP section. Place sched_random nearer to where it's first used: moving the code nearer to where it is used makes the code easier to read and we can reduce the initial "#ifdef SMP" island. Reword a little the comment and clean some whitespaces while here.	2015-07-07 15:22:29 +00:00
ian	56e931511b	Use sbuf_new_for_sysctl() instead of plain sbuf_new() to ensure sysctl string returned to userland is nulterminated. PR: 195668	2015-03-14 18:42:30 +00:00
imp	25b5dc1900	Put back Andy's void for gcc happiness. Submitted by: jchandra@	2015-02-27 23:14:08 +00:00
imp	c20b5fa0f2	Make sched_random() return an unsigned number, and use uint32_t consistently. This also matches the per-cpu pointer declaration anyway. This changes the tweak we give to the load from -32..31 to be 0..31 which seems more inline with the rest of the code (- rnd and the -= 64). It should also provide the randomness we need, and may fix a signedness bug in the old code (it isn't clear that the effect was intentional as opposed to sloppy, and the right shift of a signed value is undefined to boot). This stores sched_balance() behavior when it used random(). Differential Revision: https://reviews.freebsd.org/D1981	2015-02-27 21:15:12 +00:00
andrew	078d1e1d3a	Fix sched_ule on sparc64, gcc complains sched_random is not a correct prototype. Sponsored by: The FreeBSD Foundation	2015-02-27 15:05:20 +00:00
andrew	7df93e0ecc	sched_random is only called for SMP, only define it there. Sponsored by: The FreeBSD Foundation	2015-02-27 12:38:24 +00:00
imp	ace438b760	Create sched_rand() and move the LCG code into that. Call this when we need randomness in ULE. This removes random() call from the rebalance interval code. Submitted by: Harrison Grundy Differential Revision: https://reviews.freebsd.org/D1968	2015-02-27 02:56:58 +00:00
adrian	9b44fe556b	Update the ULE scheduler + thread and kinfo structs to use int for cpuid rather than u_char. To try and play nice with the ABI, the u_char CPU ID values are clamped at 254. The new fields now contain the full CPU ID, or -1 for no cpu. Differential Revision: D955 Reviewed by: jhb, kib Sponsored by: Norse Corp, Inc.	2014-10-18 19:36:11 +00:00
mav	d4e6695660	Reprase r271616 comments. Submitted by: alc MFC after: 1 month	2014-09-17 17:43:32 +00:00
mav	023a2a140b	Add comments describing r271604 change. MFC after: 3 days	2014-09-15 11:17:36 +00:00
mav	1cfaa7d62b	Add couple memory barries to serialize tdq_cpu_idle and tdq_load accesses. This change fixes transient performance drops in some of my benchmarks, vanishing as soon as I am trying to collect any stats from the scheduler. It looks like reordered access to those variables sometimes caused loss of IPI_PREEMPT, that delayed thread execution until some later interrupt. MFC after: 3 days	2014-09-14 22:13:19 +00:00
mav	f7d8522961	Restore pre-r239157 handling of sched_yield(), when thread time slice was aborted, allowing other threads to run. Without this change thread is just rescheduled again, that was illustrated by provided test tool. PR: 192926 Submitted by: eric@vangyzen.net MFC after: 2 weeks	2014-08-23 17:31:56 +00:00
kib	fd64376ccf	Micro-manage clang to get the expected inlining for cpu_search(). Mark cpu_search_lowest/cpu_search_highest/cpu_search_both as noinline, while cpu_search() gets always_inline. With the attributes set, cpu_search() is inlined in wrappers, and if()s with constant conditionals are optimized. On some tests on many-core machine, the hwpmc reported samples for cpu_search*() are reduced from 25% total to 9%. Submitted by: "Rang, Anton" <anton.rang@isilon.com> MFC after: 1 week	2014-07-03 11:06:27 +00:00
kib	00a919ea8f	Remove write-only local variable. Sponsored by: The FreeBSD Foundation MFC after: 1 week	2014-06-08 10:56:25 +00:00
attilio	060e6c4c4b	Fix GENERIC build.	2014-03-19 00:38:27 +00:00
jeff	216dedc4bf	- Make runq_steal_from more aggressive. Previously it would examine only a single priority queue. If that queue had a thread or threads which could not be migrated we would fail to steal load. This could cause starvation in situations where cores are idle. Submitted by: Doug Kilpatrick <dkilpatrick@isilon.com> Tested by: pho Reviewed by: mav Sponsored by: EMC / Isilon Storage Division	2014-03-08 00:35:06 +00:00
nwhitehorn	a7d50b7d03	ULE works on Book-E since r258002, so remove statements to the contrary.	2014-02-01 20:46:35 +00:00
dim	a95ffecba4	In sys/kern/sched_ule.c, remove static function sched_both(), which is unused since r232207. MFC after: 3 days	2013-12-25 16:25:54 +00:00
jhb	aaed7cd3d5	Fix an off-by-one error in r228960. The maximum priority delta provided by SCHED_PRI_TICKS should be SCHED_PRI_RANGE - 1 so that the resulting priority value (before nice adjustment) is between SCHED_PRI_MIN and SCHED_PRI_MAX, inclusive. Submitted by: kib Reported by: pho MFC after: 1 week	2013-12-03 14:50:12 +00:00
avg	71889a5eff	dtrace sdt: remove the ugly sname parameter of SDT_PROBE_DEFINE In its stead use the Solaris / illumos approach of emulating '-' (dash) in probe names with '__' (two consecutive underscores). Reviewed by: markj MFC after: 3 weeks	2013-11-26 08:46:27 +00:00
attilio	7ee4e910ce	- For kernel compiled only with KDTRACE_HOOKS and not any lock debugging option, unbreak the lock tracing release semantic by embedding calls to LOCKSTAT_PROFILE_RELEASE_LOCK() direclty in the inlined version of the releasing functions for mutex, rwlock and sxlock. Failing to do so skips the lockstat_probe_func invokation for unlocking. - As part of the LOCKSTAT support is inlined in mutex operation, for kernel compiled without lock debugging options, potentially every consumer must be compiled including opt_kdtrace.h. Fix this by moving KDTRACE_HOOKS into opt_global.h and remove the dependency by opt_kdtrace.h for all files, as now only KDTRACE_FRAMES is linked there and it is only used as a compile-time stub [0]. [0] immediately shows some new bug as DTRACE-derived support for debug in sfxge is broken and it was never really tested. As it was not including correctly opt_kdtrace.h before it was never enabled so it was kept broken for a while. Fix this by using a protection stub, leaving sfxge driver authors the responsibility for fixing it appropriately [1]. Sponsored by: EMC / Isilon storage division Discussed with: rstone [0] Reported by: rstone [1] Discussed with: philip	2013-11-25 07:38:45 +00:00
mav	e372a1314d	Micro-optimize cpu_search(), allowing compiler to use more efficient inline ffsl() implementation, when it is available, instead of homegrown iteration. On dual-E5645 amd64 system (2x6x2 cores) under heavy I/O load that reduces time spent inside cpu_search() from 19% to 13%, while IOPS increased by 5%.	2013-09-07 15:16:30 +00:00
gnn	59c782b807	Point args[0] not at the thread that is ending but at the one that is starting. This is in line with practice in OpenSolaris. Note that this change is only in ULE and not in the 4BSD scheduler. Once this change settles in (MFC timeout has expired) we'll try it out on 4BSD as well. PR: 177706 Submitted by: Tiwei Bie MFC after: 1 month	2013-04-15 17:21:02 +00:00
mav	f2dcd36473	Fix bug in r242852 that prevented CPU from becoming idle if kernel built without SMP support.	2012-11-15 14:10:51 +00:00
mav	4edd3bb7df	Several optimizations to sched_idletd(): - Do not try to steal load from other CPUs if there was no contest switches on this CPU (i.e. it was idle all the time and woke up just for bus mastering or TLB shutdown). If current CPU was idle, then it is quite unlikely that some other CPU has load to steal. Under high I/O rate, when TLB shutdowns cause numerous CPU wakeups, on 24-CPU system load stealing code may consume up to 25% of all CPU time without giving any benefits. - Change code that implements spinning for load to restart spin in case of context switch. Previous code periodically called cpu_idle() even under high interrupt/context switch rate. - Rise spinning threshold to 10KHz, where it gives at least some effect that may worth consumed power. Reviewed by: jeff@	2012-11-10 07:02:57 +00:00
jeff	15bd5ad44d	- Change ULE to use dynamic slice sizes for the timeshare queue in order to further reduce latency for threads in this queue. This should help as threads transition from realtime to timeshare. The latency is bound to a max of sched_slice until we have more than sched_slice / 6 threads runnable. Then the min slice is allotted to all threads and latency becomes (nthreads - 1) * min_slice. Discussed with: mav	2012-11-08 01:46:47 +00:00
attilio	d38d7bb245	Rework the known mutexes to benefit about staying on their own cache line in order to avoid manual frobbing but using struct mtx_padalign. The sole exception being nvme and sxfge drivers, where the author redefined CACHE_LINE_SIZE manually, so they need to be analyzed and dealt with separately. Reviwed by: jimharris, alc	2012-10-31 18:07:18 +00:00

1 2 3 4 5 ...

364 Commits