Commit Graph

355 Commits

Author SHA1 Message Date
Mateusz Guzik
c69a1a50cd Don't take Giant for SMP status and cpu topology sysctls.
Not only this lock doesn't play any role here, dirtying it slows down
other things a little bit as giant-held checks (e.g. DROP_GIANT) are
spread all over the kernel.

MFC after:	1 week
2017-10-18 22:00:44 +00:00
Andriy Gapon
afa0a46cfd move thread switch tracing from mi_switch to sched_switch
This is done so that the thread state changes during the switch
are not confused with the thread state changes reported when the thread
spins on a lock.

Here is an example, three consecutive entries for the same thread (from top to
bottom):

  KTRGRAPH group:"thread", id:"zio_write_intr_3 tid 100260", state:"sleep", attributes: prio:84, wmesg:"-", lockname:"(null)"
  KTRGRAPH group:"thread", id:"zio_write_intr_3 tid 100260", state:"spinning", attributes: lockname:"sched lock 1"
  KTRGRAPH group:"thread", id:"zio_write_intr_3 tid 100260", state:"running", attributes: none

The above trace could leave an impression that the final state of
the thread was "running".
After this change the sleep state will be reported after the "spinning"
and "running" states reported for the sched lock.

Reviewed by:	jhb, markj
MFC after:	1 week
Sponsored by:	Panzura
Differential Revision: https://reviews.freebsd.org/D9961
2017-03-23 08:57:04 +00:00
Andriy Gapon
28ef18b8c1 trace thread running state when a thread is run for the first time
This applies to both KTR_SCHED and DTrace sched:::on-cpu tracing.

MFC after:	10 days
2017-03-11 15:57:36 +00:00
Mark Johnston
7813302434 Fix a ticks comparison in sched_pctcpu_update().
We may fail to reset the %CPU tracking window if a thread does not run
for over half of the ticks rollover period, resulting in a bogus %CPU
value for the thread until ticks fully rolls over. Handle this by comparing
the unsigned difference ticks - ts_ltick with SCHED_TICK_TARG instead.

Reviewed by:	cem, jeff
MFC after:	1 week
Sponsored by:	Dell EMC Isilon
2017-03-03 20:57:40 +00:00
Ryan Stone
27ee18ad33 Revert r313814 and r313816
Something evidently got mangled in my git tree in between testing and
review, as an old and broken version of the patch was apparently submitted
to svn.  Revert this while I work out what went wrong.

Reported by:	tuexen
Pointy hat to:	rstone
2017-02-16 21:18:31 +00:00
Ryan Stone
3600f4ba35 Fix a typo in my previous commit
Somehow in the late stages of testing my sched_ule patch, a character was
accidentally deleted from the file.  Correct this.

While I'm committing anyway, the previous commit message requires some
clarification: in the normal case of unlending priority after releasing
a mutex, the thread that was doing the lending will be woken up and
immediately become the highest-priority thread, and in that case no
priority inversion would take place.  However, if that thread is pinned
to a different CPU, then the currently running thread that just had its
priority lowered will not be preempted and then priority inversion can
occur.

Reported by:	O. Hartmann (typo), jhb (scheduler clarification)
MFC after:	1 month
Pointy hat to:	rstone
2017-02-16 20:06:21 +00:00
Ryan Stone
09ae7c4814 Check for preemption after lowering a thread's priority
When a high-priority thread is waiting for a mutex held by a
low-priority thread, it temporarily lends its priority to the
low-priority thread to prevent priority inversion.  When the mutex
is released, the lent priority is revoked and the low-priority
thread goes back to its original priority.

When the priority of that thread is lowered (through a call to
sched_priority()), the schedule was not checking whether
there is now a high-priority thread in the run queue.  This can
cause threads with real-time priority to be starved in the run
queue while the low-priority thread finishes its quantum.

Fix this by explicitly checking whether preemption is necessary
when a thread's priority is lowered.

Sponsored by: Dell EMC Isilon
Obtained from: Sandvine Inc
Differential Revision:	https://reviews.freebsd.org/D9518
Reviewed by: Jeff Roberson (ule)
MFC after: 1 month
2017-02-16 19:41:13 +00:00
Andriy Gapon
ad9dadc437 fix a thread preemption regression in schedulers introduced in r270423
Commit r270423 fixed a regression in sched_yield() that was introduced
in earlier changes.  Unfortunately, at the same time it introduced an
new regression.  The problem is that SWT_RELINQUISH (6), like all other
SWT_* constants and unlike SW_* flags, is not a bit flag.  So, (flags &
SWT_RELINQUISH) is true in cases where that was not really indended,
for example, with SWT_OWEPREEMPT (2) and SWT_REMOTEPREEMPT (11).

A straight forward fix would be to use (flags & SW_TYPE_MASK) ==
SWT_RELINQUISH, but my impression is that the switch types are designed
mostly for gathering statistics, not for influencing scheduling
decisions.

So, I decided that it would be better to check for SW_PREEMPT flag
instead.  That's also the same flag that was checked before r239157.
I double-checked how that flag is used and I am confident that the flag
is set only in the places where we really have the preemption:
- critical_exit + td_owepreempt
- sched_preempt in the ULE scheduler
- sched_preempt in the 4BSD scheduler

Reviewed by:	kib, mav
MFC after:	4 days
Sponsored by:	Panzura
Differential Revision: https://reviews.freebsd.org/D9230
2017-01-19 18:46:41 +00:00
Conrad Meyer
db4fcadf52 "Buses" is the preferred plural of "bus"
Replace archaic "busses" with modern form "buses."

Intentionally excluded:
* Old/random drivers I didn't recognize
  * Old hardware in general
* Use of "busses" in code as identifiers

No functional change.

http://grammarist.com/spelling/buses-busses/

PR:		216099
Reported by:	bltsrc at mail.ru
Sponsored by:	Dell EMC Isilon
2017-01-15 17:54:01 +00:00
Konstantin Belousov
93ccd6bf87 Get rid of struct proc p_sched and struct thread td_sched pointers.
p_sched is unused.

The struct td_sched is always co-allocated with the struct thread,
except for the thread0.  Avoid useless indirection, instead calculate
td_sched location using simple pointer arithmetic in td_get_sched(9).
For thread0, which is statically allocated, create a structure to
emulate layout of the dynamic allocation.

Reviewed by:	jhb (previous version)
Sponsored by:	The FreeBSD Foundation
Differential revision:	https://reviews.freebsd.org/D6711
2016-06-05 17:04:03 +00:00
Konstantin Belousov
ccd0ec4066 The struct thread td_estcpu member is only used by the 4BSD scheduler.
Move it to the struct td_sched for 4BSD, removing always present
field, otherwise unused for ULE.

New scheduler method sched_estcpu() returns the estimation for
kinfo_proc consumption.  As before, it always returns 0 for ULE.

Remove sched_tick() scheduler method, unused both by 4BSD and ULE.

Update locking comment for the 4BSD struct td_sched, copying it from
the same comment for ULE.

Spell MAXPRI as PRI_MAX_TIMESHARE in the 4BSD comment.

Based on some notes from, and reviewed by:	bde
Sponsored by:	The FreeBSD Foundation
2016-04-17 11:04:27 +00:00
George V. Neville-Neil
57031f7912 Summary: Add the interactivity equations to the header comment for our
interactivity calculation routine.

Suggested by: rwatson
2015-08-26 16:36:41 +00:00
John Baldwin
92de34df2c kgdb uses td_oncpu to determine if a thread is running and should use
a pcb from stoppcbs[] rather than the thread's PCB.  However, exited threads
retained td_oncpu from the last time they ran, and newborn threads had their
CPU fields cleared to zero during fork and thread creation since they are
in the set of fields zeroed when threads are setup.  To fix, explicitly
update the CPU fields for exiting threads in sched_throw() to reflect the
switch out and reset the CPU fields for new threads in sched_fork_thread()
to NOCPU.

Reviewed by:	kib
MFC after:	1 week
Differential Revision:	https://reviews.freebsd.org/D3193
2015-08-03 20:43:36 +00:00
Konstantin Belousov
e8677f3885 Change the mb() use in the sched_ult tdq_notify() and sched_idletd()
to more C11-ish atomic_thread_fence_seq_cst().

Note that on PowerPC, which currently uses lwsync for mb(), the change
actually fixes the missed store/load barrier, intended by r271604 [*].

Reviewed by:	alc
Noted by:	alc [*]
Sponsored by:	The FreeBSD Foundation
MFC after:	3 weeks
2015-07-10 08:54:12 +00:00
Pedro F. Giffuni
9129dd59be Relocate sched_random() within the SMP section.
Place sched_random nearer to where it's first used: moving the
code nearer to where it  is used makes the code easier to read
and we can reduce the initial "#ifdef SMP" island.

Reword a little the comment and clean some whitespaces
while here.
2015-07-07 15:22:29 +00:00
Ian Lepore
b97fa22cd6 Use sbuf_new_for_sysctl() instead of plain sbuf_new() to ensure sysctl
string returned to userland is nulterminated.

PR:           195668
2015-03-14 18:42:30 +00:00
Warner Losh
5837276ce2 Put back Andy's void for gcc happiness.
Submitted by:	jchandra@
2015-02-27 23:14:08 +00:00
Warner Losh
b250ad3499 Make sched_random() return an unsigned number, and use uint32_t
consistently. This also matches the per-cpu pointer declaration
anyway.

This changes the tweak we give to the load from -32..31 to be 0..31
which seems more inline with the rest of the code (- rnd and the -=
64). It should also provide the randomness we need, and may fix a
signedness bug in the old code (it isn't clear that the effect was
intentional as opposed to sloppy, and the right shift of a signed
value is undefined to boot).

This stores sched_balance() behavior when it used random().

Differential Revision: https://reviews.freebsd.org/D1981
2015-02-27 21:15:12 +00:00
Andrew Turner
ccc41f3e66 Fix sched_ule on sparc64, gcc complains sched_random is not a correct
prototype.

Sponsored by:	The FreeBSD Foundation
2015-02-27 15:05:20 +00:00
Andrew Turner
09d0653552 sched_random is only called for SMP, only define it there.
Sponsored by:	The FreeBSD Foundation
2015-02-27 12:38:24 +00:00
Warner Losh
0567b6cc16 Create sched_rand() and move the LCG code into that. Call this when
we need randomness in ULE. This removes random() call from the
rebalance interval code.

Submitted by: Harrison Grundy
Differential Revision: https://reviews.freebsd.org/D1968
2015-02-27 02:56:58 +00:00
Adrian Chadd
e77f9fed15 Update the ULE scheduler + thread and kinfo structs to use int for cpuid
rather than u_char.

To try and play nice with the ABI, the u_char CPU ID values are clamped
at 254.  The new fields now contain the full CPU ID, or -1 for no cpu.

Differential Revision:	D955
Reviewed by:	jhb, kib
Sponsored by:	Norse Corp, Inc.
2014-10-18 19:36:11 +00:00
Alexander Motin
ae9e9b4fda Reprase r271616 comments.
Submitted by:	alc
MFC after:	1 month
2014-09-17 17:43:32 +00:00
Alexander Motin
7965496958 Add comments describing r271604 change.
MFC after:	3 days
2014-09-15 11:17:36 +00:00
Alexander Motin
7e9b58eaaa Add couple memory barries to serialize tdq_cpu_idle and tdq_load accesses.
This change fixes transient performance drops in some of my benchmarks,
vanishing as soon as I am trying to collect any stats from the scheduler.
It looks like reordered access to those variables sometimes caused loss of
IPI_PREEMPT, that delayed thread execution until some later interrupt.

MFC after:	3 days
2014-09-14 22:13:19 +00:00
Alexander Motin
2e7d7bb294 Restore pre-r239157 handling of sched_yield(), when thread time slice was
aborted, allowing other threads to run.  Without this change thread is just
rescheduled again, that was illustrated by provided test tool.

PR:		192926
Submitted by:	eric@vangyzen.net
MFC after:	2 weeks
2014-08-23 17:31:56 +00:00
Konstantin Belousov
2499a5ccef Micro-manage clang to get the expected inlining for cpu_search().
Mark cpu_search_lowest/cpu_search_highest/cpu_search_both as noinline,
while cpu_search() gets always_inline.  With the attributes set,
cpu_search() is inlined in wrappers, and if()s with constant
conditionals are optimized.

On some tests on many-core machine, the hwpmc reported samples for
cpu_search*() are reduced from 25% total to 9%.

Submitted by:	"Rang, Anton" <anton.rang@isilon.com>
MFC after:	1 week
2014-07-03 11:06:27 +00:00
Konstantin Belousov
a288c757d4 Remove write-only local variable.
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
2014-06-08 10:56:25 +00:00
Attilio Rao
c149e542a5 Fix GENERIC build. 2014-03-19 00:38:27 +00:00
Jeff Roberson
8bc713f6c5 - Make runq_steal_from more aggressive. Previously it would examine only
a single priority queue.  If that queue had a thread or threads which
   could not be migrated we would fail to steal load.  This could cause
   starvation in situations where cores are idle.

Submitted by:	Doug Kilpatrick <dkilpatrick@isilon.com>
Tested by:	pho
Reviewed by:	mav
Sponsored by:	EMC / Isilon Storage Division
2014-03-08 00:35:06 +00:00
Nathan Whitehorn
a8a9b1c250 ULE works on Book-E since r258002, so remove statements to the contrary. 2014-02-01 20:46:35 +00:00
Dimitry Andric
3371b88c7b In sys/kern/sched_ule.c, remove static function sched_both(), which is
unused since r232207.

MFC after:	3 days
2013-12-25 16:25:54 +00:00
John Baldwin
5457fa234b Fix an off-by-one error in r228960. The maximum priority delta provided
by SCHED_PRI_TICKS should be SCHED_PRI_RANGE - 1 so that the resulting
priority value (before nice adjustment) is between SCHED_PRI_MIN and
SCHED_PRI_MAX, inclusive.

Submitted by:	kib
Reported by:	pho
MFC after:	1 week
2013-12-03 14:50:12 +00:00
Andriy Gapon
d9fae5ab88 dtrace sdt: remove the ugly sname parameter of SDT_PROBE_DEFINE
In its stead use the Solaris / illumos approach of emulating '-' (dash)
in probe names with '__' (two consecutive underscores).

Reviewed by:	markj
MFC after:	3 weeks
2013-11-26 08:46:27 +00:00
Attilio Rao
54366c0bd7 - For kernel compiled only with KDTRACE_HOOKS and not any lock debugging
option, unbreak the lock tracing release semantic by embedding
  calls to LOCKSTAT_PROFILE_RELEASE_LOCK() direclty in the inlined
  version of the releasing functions for mutex, rwlock and sxlock.
  Failing to do so skips the lockstat_probe_func invokation for
  unlocking.
- As part of the LOCKSTAT support is inlined in mutex operation, for
  kernel compiled without lock debugging options, potentially every
  consumer must be compiled including opt_kdtrace.h.
  Fix this by moving KDTRACE_HOOKS into opt_global.h and remove the
  dependency by opt_kdtrace.h for all files, as now only KDTRACE_FRAMES
  is linked there and it is only used as a compile-time stub [0].

[0] immediately shows some new bug as DTRACE-derived support for debug
in sfxge is broken and it was never really tested.  As it was not
including correctly opt_kdtrace.h before it was never enabled so it
was kept broken for a while.  Fix this by using a protection stub,
leaving sfxge driver authors the responsibility for fixing it
appropriately [1].

Sponsored by:	EMC / Isilon storage division
Discussed with:	rstone
[0] Reported by:	rstone
[1] Discussed with:	philip
2013-11-25 07:38:45 +00:00
Alexander Motin
58909b74b9 Micro-optimize cpu_search(), allowing compiler to use more efficient inline
ffsl() implementation, when it is available, instead of homegrown iteration.

On dual-E5645 amd64 system (2x6x2 cores) under heavy I/O load that reduces
time spent inside cpu_search() from 19% to 13%, while IOPS increased by 5%.
2013-09-07 15:16:30 +00:00
George V. Neville-Neil
8f2ba63493 Point args[0] not at the thread that is ending but at the one that
is starting.  This is in line with practice in OpenSolaris.

Note that this change is only in ULE and not in the 4BSD scheduler.
Once this change settles in (MFC timeout has expired) we'll try it out
on 4BSD as well.

PR:		177706
Submitted by:	Tiwei Bie
MFC after:	1 month
2013-04-15 17:21:02 +00:00
Alexander Motin
2fd4047f32 Fix bug in r242852 that prevented CPU from becoming idle if kernel built
without SMP support.
2012-11-15 14:10:51 +00:00
Alexander Motin
2c27cb3a34 Several optimizations to sched_idletd():
- Do not try to steal load from other CPUs if there was no contest switches
on this CPU (i.e. it was idle all the time and woke up just for bus mastering
or TLB shutdown). If current CPU was idle, then it is quite unlikely that some
other CPU has load to steal.  Under high I/O rate, when TLB shutdowns cause
numerous CPU wakeups, on 24-CPU system load stealing code may consume up to
25% of all CPU time without giving any benefits.
 - Change code that implements spinning for load to restart spin in case of
context switch.  Previous code periodically called cpu_idle() even under
high interrupt/context switch rate.
 - Rise spinning threshold to 10KHz, where it gives at least some effect
that may worth consumed power.

Reviewed by:	jeff@
2012-11-10 07:02:57 +00:00
Jeff Roberson
5e5c387373 - Change ULE to use dynamic slice sizes for the timeshare queue in order
to further reduce latency for threads in this queue.  This should help
   as threads transition from realtime to timeshare.  The latency is
   bound to a max of sched_slice until we have more than sched_slice / 6
   threads runnable.  Then the min slice is allotted to all threads and
   latency becomes (nthreads - 1) * min_slice.

Discussed with: mav
2012-11-08 01:46:47 +00:00
Attilio Rao
4ceaf45de5 Rework the known mutexes to benefit about staying on their own
cache line in order to avoid manual frobbing but using
struct mtx_padalign.

The sole exception being nvme and sxfge drivers, where the author
redefined CACHE_LINE_SIZE manually, so they need to be analyzed and
dealt with separately.

Reviwed by:	jimharris, alc
2012-10-31 18:07:18 +00:00
Attilio Rao
a049aa05c9 tdq_lock_pair() already does spinlock_enter() so migration is not
possible in sched_balance_pair(). Remove redundant sched_pin().

Reviewed by:	marius, jeff
2012-10-30 12:25:52 +00:00
Jim Harris
39f819e2fc Pad tdq_lock to avoid false sharing with tdq_load and tdq_cpu_idle.
This enables CPU searches (which read tdq_load) to operate independently
of any contention on the spinlock.  Some scheduler-intensive workloads
running on an 8C single-socket SNB Xeon show considerable improvement with
this change (2-3% perf improvement, 5-6% decrease in CPU util).

Sponsored by:	Intel
Reviewed by:	jeff
2012-10-24 18:36:41 +00:00
Eitan Adler
db702c59cf remove duplicate semicolons where possible.
Approved by:	cperciva
MFC after:	1 week
2012-10-22 03:00:37 +00:00
Andriy Gapon
e87fc7cf7b sched_ule: fix inverted condition in reporting of priority lending via ktr
Reviewed by:	kan
MFC after:	1 week
2012-09-14 19:55:28 +00:00
John Baldwin
ba96d2d816 Mark the idle threads as non-sleepable and also assert that an idle
thread never blocks on a turnstile.
2012-08-22 20:01:38 +00:00
Alexander Motin
37f4e0254f Some more minor tunings inspired by bde@. 2012-08-11 20:24:39 +00:00
Alexander Motin
bf89d544d0 Allow idle threads to steal second threads from other cores on systems with
8 or more cores to improve utilization.  None of my tests on 2xXeon (2x6x2)
system shown any slowdown from mentioned "excess thrashing".  Same time in
pbzip2 test with number of threads more then number of CPUs I see up to 10%
speedup with SMT disabled and up 5% with SMT enabled.  Thinking about
trashing I was trying to limit that stealing within same last level cache,
but got only worse results.  Present code any way prefers to steal threads
from topologically closer cores.

Sponsored by:	iXsystems, Inc.
2012-08-11 15:08:19 +00:00
Alexander Motin
579895df01 Some minor tunings/cleanups inspired by bde@ after previous commits:
- remove extra dynamic variable initializations;
 - restore (4BSD) and implement (ULE) hogticks variable setting;
 - make sched_rr_interval() more tolerant to options;
 - restore (4BSD) and implement (ULE) kern.sched.quantum sysctl, a more
user-friendly wrapper for sched_slice;
 - tune some sysctl descriptions;
 - make some style fixes.
2012-08-10 19:02:49 +00:00
Alexander Motin
3d7f41175d Rework r220198 change (by fabient). I believe it solves the problem from
the wrong direction. Before it, if preemption and end of time slice happen
same time, thread was put to the head of the queue as for only preemption.
It could cause single thread to run for indefinitely long time. r220198
handles it by not clearing TDF_NEEDRESCHED in case of preemption. But that
causes delayed context switch every time preemption happens, even when not
needed.

Solve problem by introducing scheduler-specifoc thread flag TDF_SLICEEND,
set when thread's time slice is over and it should be put to the tail of
queue. Using SW_PREEMPT flag for that purpose as it was before just not
enough informative to work correctly.

On my tests this by 2-3 times reduces run time deviation (improves fairness)
in cases when several threads share one CPU.

Reviewed by:	fabient
MFC after:	2 months
Sponsored by:	iXsystems, Inc.
2012-08-09 19:26:13 +00:00