to more C11-ish atomic_thread_fence_seq_cst().
Note that on PowerPC, which currently uses lwsync for mb(), the change
actually fixes the missed store/load barrier, intended by r271604 [*].
Reviewed by: alc
Noted by: alc [*]
Sponsored by: The FreeBSD Foundation
MFC after: 3 weeks
Place sched_random nearer to where it's first used: moving the
code nearer to where it is used makes the code easier to read
and we can reduce the initial "#ifdef SMP" island.
Reword a little the comment and clean some whitespaces
while here.
consistently. This also matches the per-cpu pointer declaration
anyway.
This changes the tweak we give to the load from -32..31 to be 0..31
which seems more inline with the rest of the code (- rnd and the -=
64). It should also provide the randomness we need, and may fix a
signedness bug in the old code (it isn't clear that the effect was
intentional as opposed to sloppy, and the right shift of a signed
value is undefined to boot).
This stores sched_balance() behavior when it used random().
Differential Revision: https://reviews.freebsd.org/D1981
we need randomness in ULE. This removes random() call from the
rebalance interval code.
Submitted by: Harrison Grundy
Differential Revision: https://reviews.freebsd.org/D1968
rather than u_char.
To try and play nice with the ABI, the u_char CPU ID values are clamped
at 254. The new fields now contain the full CPU ID, or -1 for no cpu.
Differential Revision: D955
Reviewed by: jhb, kib
Sponsored by: Norse Corp, Inc.
This change fixes transient performance drops in some of my benchmarks,
vanishing as soon as I am trying to collect any stats from the scheduler.
It looks like reordered access to those variables sometimes caused loss of
IPI_PREEMPT, that delayed thread execution until some later interrupt.
MFC after: 3 days
aborted, allowing other threads to run. Without this change thread is just
rescheduled again, that was illustrated by provided test tool.
PR: 192926
Submitted by: eric@vangyzen.net
MFC after: 2 weeks
Mark cpu_search_lowest/cpu_search_highest/cpu_search_both as noinline,
while cpu_search() gets always_inline. With the attributes set,
cpu_search() is inlined in wrappers, and if()s with constant
conditionals are optimized.
On some tests on many-core machine, the hwpmc reported samples for
cpu_search*() are reduced from 25% total to 9%.
Submitted by: "Rang, Anton" <anton.rang@isilon.com>
MFC after: 1 week
a single priority queue. If that queue had a thread or threads which
could not be migrated we would fail to steal load. This could cause
starvation in situations where cores are idle.
Submitted by: Doug Kilpatrick <dkilpatrick@isilon.com>
Tested by: pho
Reviewed by: mav
Sponsored by: EMC / Isilon Storage Division
by SCHED_PRI_TICKS should be SCHED_PRI_RANGE - 1 so that the resulting
priority value (before nice adjustment) is between SCHED_PRI_MIN and
SCHED_PRI_MAX, inclusive.
Submitted by: kib
Reported by: pho
MFC after: 1 week
In its stead use the Solaris / illumos approach of emulating '-' (dash)
in probe names with '__' (two consecutive underscores).
Reviewed by: markj
MFC after: 3 weeks
option, unbreak the lock tracing release semantic by embedding
calls to LOCKSTAT_PROFILE_RELEASE_LOCK() direclty in the inlined
version of the releasing functions for mutex, rwlock and sxlock.
Failing to do so skips the lockstat_probe_func invokation for
unlocking.
- As part of the LOCKSTAT support is inlined in mutex operation, for
kernel compiled without lock debugging options, potentially every
consumer must be compiled including opt_kdtrace.h.
Fix this by moving KDTRACE_HOOKS into opt_global.h and remove the
dependency by opt_kdtrace.h for all files, as now only KDTRACE_FRAMES
is linked there and it is only used as a compile-time stub [0].
[0] immediately shows some new bug as DTRACE-derived support for debug
in sfxge is broken and it was never really tested. As it was not
including correctly opt_kdtrace.h before it was never enabled so it
was kept broken for a while. Fix this by using a protection stub,
leaving sfxge driver authors the responsibility for fixing it
appropriately [1].
Sponsored by: EMC / Isilon storage division
Discussed with: rstone
[0] Reported by: rstone
[1] Discussed with: philip
ffsl() implementation, when it is available, instead of homegrown iteration.
On dual-E5645 amd64 system (2x6x2 cores) under heavy I/O load that reduces
time spent inside cpu_search() from 19% to 13%, while IOPS increased by 5%.
is starting. This is in line with practice in OpenSolaris.
Note that this change is only in ULE and not in the 4BSD scheduler.
Once this change settles in (MFC timeout has expired) we'll try it out
on 4BSD as well.
PR: 177706
Submitted by: Tiwei Bie
MFC after: 1 month
- Do not try to steal load from other CPUs if there was no contest switches
on this CPU (i.e. it was idle all the time and woke up just for bus mastering
or TLB shutdown). If current CPU was idle, then it is quite unlikely that some
other CPU has load to steal. Under high I/O rate, when TLB shutdowns cause
numerous CPU wakeups, on 24-CPU system load stealing code may consume up to
25% of all CPU time without giving any benefits.
- Change code that implements spinning for load to restart spin in case of
context switch. Previous code periodically called cpu_idle() even under
high interrupt/context switch rate.
- Rise spinning threshold to 10KHz, where it gives at least some effect
that may worth consumed power.
Reviewed by: jeff@
to further reduce latency for threads in this queue. This should help
as threads transition from realtime to timeshare. The latency is
bound to a max of sched_slice until we have more than sched_slice / 6
threads runnable. Then the min slice is allotted to all threads and
latency becomes (nthreads - 1) * min_slice.
Discussed with: mav
cache line in order to avoid manual frobbing but using
struct mtx_padalign.
The sole exception being nvme and sxfge drivers, where the author
redefined CACHE_LINE_SIZE manually, so they need to be analyzed and
dealt with separately.
Reviwed by: jimharris, alc
This enables CPU searches (which read tdq_load) to operate independently
of any contention on the spinlock. Some scheduler-intensive workloads
running on an 8C single-socket SNB Xeon show considerable improvement with
this change (2-3% perf improvement, 5-6% decrease in CPU util).
Sponsored by: Intel
Reviewed by: jeff
8 or more cores to improve utilization. None of my tests on 2xXeon (2x6x2)
system shown any slowdown from mentioned "excess thrashing". Same time in
pbzip2 test with number of threads more then number of CPUs I see up to 10%
speedup with SMT disabled and up 5% with SMT enabled. Thinking about
trashing I was trying to limit that stealing within same last level cache,
but got only worse results. Present code any way prefers to steal threads
from topologically closer cores.
Sponsored by: iXsystems, Inc.
- remove extra dynamic variable initializations;
- restore (4BSD) and implement (ULE) hogticks variable setting;
- make sched_rr_interval() more tolerant to options;
- restore (4BSD) and implement (ULE) kern.sched.quantum sysctl, a more
user-friendly wrapper for sched_slice;
- tune some sysctl descriptions;
- make some style fixes.
the wrong direction. Before it, if preemption and end of time slice happen
same time, thread was put to the head of the queue as for only preemption.
It could cause single thread to run for indefinitely long time. r220198
handles it by not clearing TDF_NEEDRESCHED in case of preemption. But that
causes delayed context switch every time preemption happens, even when not
needed.
Solve problem by introducing scheduler-specifoc thread flag TDF_SLICEEND,
set when thread's time slice is over and it should be put to the tail of
queue. Using SW_PREEMPT flag for that purpose as it was before just not
enough informative to work correctly.
On my tests this by 2-3 times reduces run time deviation (improves fairness)
in cases when several threads share one CPU.
Reviewed by: fabient
MFC after: 2 months
Sponsored by: iXsystems, Inc.
implementation specific vs. the common architecture definition.
Bring PPC4XX defines (PSL, SPR, TLB). Note the new definitions under
BOOKE_PPC4XX are not used in the code yet.
This change set is not supposed to affect existing E500 support, it's just
another reorg step before bringing support for E500mc, E5500 and PPC465.
Obtained from: AppliedMicro, Freescale, Semihalf
compatible with the sched provider implemented by Solaris and its open-
source derivatives. Full documentation of the sched provider can be found
on Oracle's DTrace wiki pages.
Note that for compatibility with scripts originally written for Solaris,
serveral probes are defined that will never fire. These probes are defined
to fire when Solaris-specific features perform certain actions. As these
features are not present in FreeBSD, the probes can never fire.
Also, I have added a two probes that are not defined in Solaris, lend-pri
and load-change. These probes have been added to make it possible to
collect schedgraph data with DTrace.
Finally, a few probes are defined in Solaris to take a cpuinfo_t *
argument. As it was not immediately clear to me how to translate that to
FreeBSD, currently those probes are passed NULL in place of a cpuinfo_t *.
Sponsored by: Sandvine Incorporated
MFC after: 2 weeks
with HZ rate through the sched_tick() calls from hardclock().
Potentially it can be used to improve precision, but now it is just minus
one more reason to call hardclock() for every HZ tick on every active CPU.
SCHED_4BSD never used sched_tick(), but keep it in place for now, as at
least SCHED_FBFS existing in patches out of the tree depends on it.
MFC after: 1 month
the cached name used for KTR_SCHED traces when a thread's name changes.
This way KTR_SCHED traces (and thus schedgraph) will notice when a thread's
name changes, most commonly via execve().
MFC after: 2 weeks
- In sched_pickcpu() be more careful taking previous CPU on SMT systems.
Do it only if all other logical CPUs of that physical one are idle to avoid
extra resource sharing.
- In sched_pickcpu() change general logic of CPU selection. First
look for idle CPU, sharing last level cache with previously used one,
skipping SMT CPU groups. If none found, search all CPUs for the least loaded
one, where the thread with its priority can run now. If none found, search
just for the least loaded CPU.
- Make cpu_search() compare lowest/highest CPU load when comparing CPU
groups with equal load. That allows to differentiate 1+1 and 2+0 loads.
- Make cpu_search() to prefer specified (previous) CPU or group if load
is equal. This improves cache affinity for more complicated topologies.
- Randomize CPU selection if above factors are equal. Previous code tend
to prefer CPUs with lower IDs, causing unneeded collisions.
- Rework periodic balancer in sched_balance_group(). With cpu_search()
more intelligent now, make balansing process flat, removing recursion
over the topology tree. That fixes double swap problem and makes load
distribution more even and predictable.
All together this gives 10-15% performance improvement in many tests on
CPUs with SMT, such as Core i7, for number of threads is less then number
of logical CPUs. In some tests it also gives positive effect to systems
without SMT.
Reviewed by: jeff
Tested by: flo, hackers@
MFC after: 1 month
Sponsored by: iXsystems, Inc.
- Only initialize the per-cpu switchticks and switchtime in sched_throw()
for the very first context switch on APs during boot. This avoids a
small gap between the middle of thread_exit() and sched_throw() where
time is not accounted to any thread.
- In thread_exit(), update the timestamp bookkeeping to track the changes
to mi_switch() introduced by td_rux so that the code once again matches
the comment claiming it is mimicing mi_switch(). Specifically, only
update the per-thread stats directly and depend on ruxagg() to update
p_rux rather than adjusting p_rux directly. While here, move the
timestamp bookkeeping as late in the function as possible.
Reviewed by: bde, kib
MFC after: 1 week
at SCHED_PRI_RANGE to prevent overflows in the priority value. This can
happen due to irregularities with clock interrupts under certain
virtualization environments.
Tested by: Larry Rosenman ler lerctr org
MFC after: 2 weeks
With the previous code, if the range of priorities for timeshare batch
threads was greater than RQ_NQS, then the threads with low priorities in
the part of the range above RQ_NQS would be scheduled to the run-queues
as if they had high priorities at the beginning of the range.
In other words, threads with a nice level of +N could be scheduled as
if they had a nice level of -M.
Reported by: George Mitchell <george@m5p.com>
Reviewed by: jhb
Tested by: George Mitchell <george@m5p.com> (earlier version)
MFC after: 1 week
itself, which sparc64 hardware doesn't support. One way to solve this
would be to directly call sched_preempt() instead of issuing a self-IPI.
However, quoting jhb@:
"On the other hand, you can probably just skip the IPI entirely if we are
going to send it to the current CPU. Presumably, once this routine
finishes, the current CPU will exit softlock (or will do so "soon") and
will then pick the next thread to run based on the adjustments made in
this routine, so there's no need to IPI the CPU running this routine
anyway. I think this is the better solution. Right now what is probably
happening on other platforms is as soon as this routine finishes the CPU
processes its self-IPI and causes mi_switch() which will just switch back
to the softclock thread it is already running."
- With r226054 and the the above change in place, sparc64 now no longer is
incompatible with ULE and vice versa. However, powerpc/E500 still is.
Submitted by: jhb
Reviewed by: jeff