Commit Graph

419 Commits

Author SHA1 Message Date
Mitchell Horne
1029dab634 mi_switch(): clean up switch types and their usage
Overall, this is a non-functional change, except for kernels built with
SCHED_STATS. However, the switch types are useful for communicating the
intent of the caller.

1. Ensure that every caller provides a type. In most cases, we upgrade
   the basic yield to sched_relinquish() aka SWT_RELINQUISH.
2. The case of sched_bind() is distinct, so add a new switch type SWT_BIND.
3. Remove the two unused types, SWT_PREEMPT and SWT_SLEEPQTIMO.
4. Remove SWT_NONE altogether and assert that callers always provide
   a type flag.
5. Reference the mi_switch(9) man page in the comments, as these flags
   will be documented there.

Reviewed by:	kib, markj
Sponsored by:	The FreeBSD Foundation
Differential Revision:	https://reviews.freebsd.org/D38184
2023-02-09 12:01:32 -04:00
Konstantin Belousov
c6d31b8306 AST: rework
Make most AST handlers dynamically registered.  This allows to have
subsystem-specific handler source located in the subsystem files,
instead of making subr_trap.c aware of it.  For instance, signal
delivery code on return to userspace is now moved to kern_sig.c.

Also, it allows to have some handlers designated as the cleanup (kclear)
type, which are called both at AST and on thread/process exit.  For
instance, ast(), exit1(), and NFS server no longer need to be aware
about UFS softdep processing.

The dynamic registration also allows third-party modules to register AST
handlers if needed.  There is one caveat with loadable modules: the
code does not make any effort to ensure that the module is not unloaded
before all threads processed through AST handler in it.  In fact, this
is already present behavior for hwpmc.ko and ufs.ko.  I do not think it
is worth the efforts and the runtime overhead to try to fix it.

Reviewed by:	markj
Tested by:	emaste (arm64), pho
Discussed with:	jhb
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
Differential revision:	https://reviews.freebsd.org/D35888
2022-08-02 21:11:09 +03:00
Mark Johnston
bd980ca847 sched_ule: Ensure we hold the thread lock when modifying td_flags
The load balancer may force a running thread to reschedule and pick a
new CPU.  To do this it sets some flags in the thread running on a
loaded CPU.  But the code assumed that a running thread's lock is the
same as that of the corresponding runqueue, and there are small windows
where this is not true.  In this case, we can end up with non-atomic
modifications to td_flags.

Since this load balancing is best-effort, simply give up if the thread's
lock doesn't match; in this case the thread is about to enter the
scheduler anyway.

Reviewed by:	kib
Reported by:	glebius
Fixes:		e745d729be ("sched_ule(4): Improve long-term load balancer.")
MFC after:	2 weeks
Sponsored by:	The FreeBSD Foundation
Differential Revision:	https://reviews.freebsd.org/D35821
2022-07-18 15:52:27 -04:00
Mateusz Guzik
6eeba7dbd6 ule: unbreak UP builds
Sponsored by:	Rubicon Communications, LLC ("Netgate")
2022-07-16 12:45:09 +00:00
John Baldwin
954cffe95d ule: Simplistic time-sharing for interrupt threads.
If an interrupt thread runs for a full quantum without yielding the
CPU, demote its priority and schedule a preemption to give other
ithreads a turn.

Reviewed by:	kib, markj
Sponsored by:	Netflix
Differential Revision:	https://reviews.freebsd.org/D35644
2022-07-14 13:13:57 -07:00
John Baldwin
fea89a2804 Add sched_ithread_prio to set the base priority of an interrupt thread.
Use it instead of sched_prio when setting the priority of an interrupt
thread.

Reviewed by:	kib, markj
Sponsored by:	Netflix
Differential Revision:	https://reviews.freebsd.org/D35642
2022-07-14 13:13:10 -07:00
Mark Johnston
6cbc4ceb7a sched_ule: Use the correct atomic_load variant for tdq_lowpri
Reported by:	tuexen
Fixes:	11484ad8a2 ("sched_ule: Use explicit atomic accesses for tdq fields")
2022-07-14 15:34:02 -04:00
Mark Johnston
11484ad8a2 sched_ule: Use explicit atomic accesses for tdq fields
Different fields in the tdq have different synchronization protocols.
Some are constant, some are accessed only while holding the tdq lock,
some are modified with the lock held but accessed without the lock, some
are accessed only on the tdq's CPU, and some are not synchronized by the
lock at all.

Convert ULE to stop using volatile and instead use atomic_load_* and
atomic_store_* to provide the desired semantics for lockless accesses.
This makes the intent of the code more explicit, gives more freedom to
the compiler when accesses do not need to be qualified, and lets KCSAN
intercept unlocked accesses.

Thus:
- Introduce macros to provide unlocked accessors for certain fields.
- Use atomic_load/store for all accesses of tdq_cpu_idle, which is not
  synchronized by the mutex.
- Use atomic_load/store for accesses of the switch count, which is
  updated by sched_clock().
- Add some comments to fields of struct tdq describing how accesses are
  synchronized.

No functional change intended.

Reviewed by:	mav, kib
MFC after:	1 month
Sponsored by:	The FreeBSD Foundation
Differential Revision:	https://reviews.freebsd.org/D35737
2022-07-14 10:45:33 -04:00
Mark Johnston
0927ff7814 sched_ule: Enable preemption of curthread in the load balancer
The load balancer executes from statclock and periodically tries to move
threads among CPUs in order to balance load.  It may move a thread to
the current CPU (the loader balancer always runs on CPU 0).  When it
does so, it may need to schedule preemption of the interrupted thread.
Use sched_setpreempt() to do so, same as sched_add().

PR:		264867
Reviewed by:	mav, kib, jhb
MFC after:	1 month
Sponsored by:	The FreeBSD Foundation
Differential Revision:	https://reviews.freebsd.org/D35744
2022-07-14 10:27:58 -04:00
Mark Johnston
6d3f74a14a sched_ule: Fix racy loads of pc_curthread
Thread switching used to be atomic with respect to the current CPU's
tdq lock.  Since commit 686bcb5c14 that is no longer the case.  Now
sched_switch() does this:

1.  lock tdq (might already be locked)
2.  maybe put the current thread in the tdq, choose a new thread to run
2a. update tdq_lowpri
3.  unlock tdq
4.  switch CPU context, update curthread

Some code paths in ULE will load pc_curthread from a remote CPU with
that CPU's tdq lock held, usually to inspect its priority.  But, as of
the aforementioned commit this is racy.

The problem I noticed is in tdq_notify(), which optionally sends an IPI
to a remote CPU when a new thread is added to its runqueue.  If the new
thread's priority is higher (lower) than the currently running thread's
priority, then we deliver an IPI.  But inspecting
pc_curthread->td_priority doesn't work, since pc_curthread might be
between steps 3 and 4 above.  If pc_curthread's priority is higher than
that of the newly added thread, but pc_curthread is switching to a
lower-priority thread, then tdq_notify() might fail to deliever an IPI,
leaving a high priority thread stuck on the runqueue for longer than it
should.  This can cause multi-millisecond stalls in
interactive/ithread/realtime threads.

Fix this problem by modifying tdq_add() and tdq_move() to return the
value of tdq_lowpri before the addition of the new thread.  This ensures
that tdq_notify() has the correct priority value to compare against.

The other two uses of pc_curthread are susceptible to the same race.  To
fix the one in sched_rem()->tdq_setlowpri() we need to have an exact
value for curthread.  Thus, introduce a new tdq_curthread field to the
tdq which gets updated any time a new thread is selected to run on the
CPU.  Because this field is synchronized by the thread lock, its
priority reflects the correct lowpri value for the tdq.

PR:		264867
Fixes:		686bcb5c14 ("schedlock 4/4")
Reviewed by:	mav, kib, jhb
MFC after:	1 month
Sponsored by:	The FreeBSD Foundation
Differential Revision:	https://reviews.freebsd.org/D35736
2022-07-14 10:27:51 -04:00
Mark Johnston
ba71333f60 sched_ule: Fix a typo in a comment
PR:		226107
MFC after:	1 week
2022-07-11 15:58:43 -04:00
Mark Johnston
ef80894c9d sched_ule: Purge an obsolete comment
The referenced bitmask was removed in commit 62fa74d95a.

MFC after:	 1 week
Sponsored by:	The FreeBSD Foundation
2022-07-11 15:58:43 -04:00
Mark Johnston
35dd6d6cb5 sched_ule: Eliminate a superfluous local variable in tdq_move()
No functional change intended.

MFC after:	1 week
Sponsored by:	The FreeBSD Foundation
2022-07-11 15:58:43 -04:00
John Baldwin
4aec198420 sched_ule: Inline value of ts in sched_thread_priority.
This avoids a set but unused warning in kernels without SMP where
TDQ_CPU() doesn't use its argument.
2022-04-13 16:08:23 -07:00
Gordon Bergling
15b5c347f1 sched_ule(4): Fix two typo in source code comments
- s/conditons/conditions/
- s/unconditonally/unconditionally/

MFC after:	3 days
2021-11-19 19:13:28 +01:00
Kyle Evans
6a8ea6d174 sched: split sched_ap_entry() out of sched_throw()
sched_throw() can no longer take a NULL thread, APs enter through
sched_ap_entry() instead.  This completely removes branching in the
common case and cleans up both paths.  No functional change intended.

Reviewed by:	kib, markj
Differential Revision:	https://reviews.freebsd.org/D32829
2021-11-05 15:45:51 -05:00
Kyle Evans
589aed00e3 sched: separate out schedinit_ap()
schedinit_ap() sets up an AP for a later call to sched_throw(NULL).

Currently, ULE sets up some pcpu bits and fixes the idlethread lock with
a call to sched_throw(NULL); this results in a window where curthread is
setup in platforms' init_secondary(), but it has the wrong td_lock.
Typical platform AP startup procedure looks something like:

- Setup curthread
- ... other stuff, including cpu_initclocks_ap()
- Signal smp_started
- sched_throw(NULL) to enter the scheduler

cpu_initclocks_ap() may have callouts to process (e.g., nvme) and
attempt to sched_add() for this AP, but this attempt fails because
of the noted violated assumption leading to locking heartburn in
sched_setpreempt().

Interrupts are still disabled until cpu_throw() so we're not really at
risk of being preempted -- just let the scheduler in on it a little
earlier as part of setting up curthread.

Reviewed by:	alfredo, kib, markj
Triage help from:	andrew, markj
Smoke-tested by:	alfredo (ppc), kevans (arm64, x86), mhorne (arm)
Differential Revision:	https://reviews.freebsd.org/D32797
2021-11-03 15:54:59 -05:00
Alexander Motin
1c119e173d sched_ule(4): Fix possible significance loss.
Before this change kern.sched.interact sysctl setting above 32 gave
all interactive threads identical priority of PRI_MIN_INTERACT due to
((PRI_MAX_INTERACT - PRI_MIN_INTERACT + 1) / sched_interact) turning
zero.  Setting the sysctl lower reduced the range of used priority
levels up to half, that is not great either.

Change of the operations order should fix the issue, always using full
range of priorities, while overflow is impossible there since both
score and priority values are small.  While there, make the variables
unsigned as they really are.

MFC after:	1 month
2021-10-02 00:09:45 -04:00
Alexander Motin
08063e9f98 sched_ule(4): Fix hang with steal_thresh < 2.
e745d729be caused infinite loop with interrupts disabled in load
stealing code if steal_thresh set below 2.  Such configuration should
not generally be used, but appeared some people are using it to
workaround some problems.

To fix the problem explicitly pass to sched_highest() minimum number
of transferrable threads, supported by the caller, instead of guessing.

MFC after:	25 days
2021-09-26 12:03:05 -04:00
Alexander Motin
ef50d5fbc3 x86: Add NUMA nodes into CPU topology.
Depending on hardware, NUMA nodes may match last level caches, or
they may be above them (AMD Zen 2/3) or below (Intel Xeon w/ SNC).
This information is provided by ACPI instead of CPUID, and it is
provided for each CPU individually instead of mask widths, but
this code should be able to properly handle all the above cases.

This change should immediately allow idle stealing in sched_ule(4)
to prefer load from NUMA-local CPUs to remote ones when the node
does not match LLC.  Later we may think of how to better handle it
on sched_pickcpu() side.

MFC after:	1 month
2021-09-23 14:31:38 -04:00
Alexander Motin
8db1669959 Fix build without SMP.
MFC after:	1 month
2021-09-21 22:13:33 -04:00
Alexander Motin
e745d729be sched_ule(4): Improve long-term load balancer.
Before this change long-term load balancer was unable to migrate
running threads, only ones waiting on run queues.  But with growing
number of CPU cores it is quite typical now for system to not have
many waiting threads.  But same time if due to some coincidence two
long-running CPU-bound threads ended up sharing same physical CPU
core, they could suffer from the SMT penalty indefinitely, and the
load balancer couldn't help.

Improve that by teaching the load balancer to hint running threads
to migrate by marking them with TDF_NEEDRESCHED and new TDF_PICKCPU
flag, making sched_pickcpu() to search for better CPU later, when
it is convenient.

Fix CPU search logic when balancing to limit round-robin migrations
in case of almost equal load to the group of physical cores.  The
previous code bounced threads across all the system, that should be
pretty bad for caches and NUMA affinity, while additional fairness
was almost invisible, diminishing with number of cores in the group.

MFC after:	1 month
2021-09-21 18:19:20 -04:00
Alexander Motin
bd84094a51 sched_ule(4): Fix interactive threads stealing.
In scenarios when first thread in the queue can migrate to specified
CPU, but later ones can't runq_steal_from() incorrectly returned NULL.

MFC after:	2 weeks
2021-09-21 16:03:32 -04:00
Alexander Motin
ca34553b6f sched_ule(4): Pre-seed sched_random().
I don't think it changes anything, but why not.

While there, make cpu_search_highest() use all 8 lower load bits for
noise, since it does not use cs_prefer and the code is not shared
with cpu_search_lowest() any more.

MFC after:	1 month
2021-08-02 10:55:28 -04:00
Alexander Motin
8bb173fb5b sched_ule(4): Use trylock when stealing load.
On some load patterns it is possible for several CPUs to try steal
thread from the same CPU despite randomization introduced.  It may
cause significant lock contention when holding one queue lock idle
thread tries to acquire another one.  Use of trylock on the remote
queue allows both reduce the contention and handle lock ordering
easier.  If we can't get lock inside tdq_trysteal() we just return,
allowing tdq_idled() handle it.  If it happens in tdq_idled(), then
we repeat search for load skipping this CPU.

On 2-socket 80-thread Xeon system I am observing dramatic reduction
of the lock spinning time when doing random uncached 4KB reads from
12 ZVOLs, while IOPS increase from 327K to 403K.

MFC after:	1 month
2021-08-01 22:42:01 -04:00
Alexander Motin
2668bb2add sched_ule(4): Reduce duplicate search for load.
When sched_highest() called for some CPU group returns nothing, idle
thread calls it for the parent CPU group.  But the parent CPU group
also includes the CPU group we've just searched, and unless there is
a race going on, it is unlikely we find anything new this time.

Avoid the double search in case of parent group having only two sub-
groups (the most prominent case). Instead of escalating to the parent
group run the next search over the sibling subgroup and escalate two
levels up after if that fail too.  In case of more than two siblings
the difference is less significant, while searching the parent group
can result in better decision if we find several candidate CPUs.

On 2-socket 40-core Xeon system I am measuring ~25% reduction of CPU
time spent inside cpu_search_highest() in both SMT (2x20x2) and non-
SMT (2x20) cases.

MFC after:	1 month
2021-08-01 22:07:51 -04:00
Dmitry Chagin
af29f39958 umtx: Split umtx.h on two counterparts.
To prevent umtx.h polluting by future changes split it on two headers:
umtx.h - ABI header for userspace;
umtxvar.h - the kernel staff.

While here fix umtx_key_match style.

Reviewed by:		kib
Differential Revision:	https://reviews.freebsd.org/D31248
MFC after:		2 weeks
2021-07-29 12:41:29 +03:00
Alexander Motin
aefe0a8c32 Refactor/optimize cpu_search_*().
Remove cpu_search_both(), unused for many years.  Without it there is
less sense for the trick of compiling common cpu_search() into separate
cpu_search_lowest() and cpu_search_highest(), so split them completely,
making code more readable.  While there, split iteration over children
groups and CPUs, complicating code for very small deduplication.

Stop passing cpuset_t arguments by value and avoid some manipulations.
Since MAXCPU bump from 64 to 256, what was a single register turned
into 32-byte memory array, requiring memory allocation and accesses.
Splitting struct cpu_search into parameter and result parts allows to
even more reduce stack usage, since the first can be passed through
on recursion.

Remove CPU_FFS() from the hot paths, precalculating first and last CPU
for each CPU group in advance during initialization.  Again, it was
not a problem for 64 CPUs before, but for 256 FFS needs much more code.

With these changes on 80-thread system doing ~260K uncached ZFS reads
per second I observe ~30% reduction of time spent in cpu_search_*().

MFC after:	1 month
2021-07-28 22:00:29 -04:00
wiklam
43521b46fc Correcting comment about "sched_interact_score".
Reviewed by:	jrtc@, imp@
Pull Request:	https://github.com/freebsd/freebsd-src/pull/431

Sponsored by:		Netflix
2021-06-02 21:50:57 -06:00
Mateusz Guzik
b77594bbbf sched: fix an incorrect comparison in sched_lend_user_prio_cond
Compare with sched_lend_user_prio.
2020-11-15 01:54:44 +00:00
Pawel Biernacki
b05ca4290c sys/: Document few more sysctls.
Submitted by:	Antranig Vartanian <antranigv@freebsd.am>
Reviewed by:	kaktus
Commented by:	jhb
Approved by:	kib (mentor)
Sponsored by:	illuria security
Differential Revision:	https://reviews.freebsd.org/D23759
2020-03-02 15:30:52 +00:00
Pawel Biernacki
7029da5c36 Mark more nodes as CTLFLAG_MPSAFE or CTLFLAG_NEEDGIANT (17 of many)
r357614 added CTLFLAG_NEEDGIANT to make it easier to find nodes that are
still not MPSAFE (or already are but aren’t properly marked).
Use it in preparation for a general review of all nodes.

This is non-functional change that adds annotations to SYSCTL_NODE and
SYSCTL_PROC nodes using one of the soon-to-be-required flags.

Mark all obvious cases as MPSAFE.  All entries that haven't been marked
as MPSAFE before are by default marked as NEEDGIANT

Approved by:	kib (mentor, blanket)
Commented by:	kib, gallatin, melifaro
Differential Revision:	https://reviews.freebsd.org/D23718
2020-02-26 14:26:36 +00:00
Mark Johnston
e489450589 Fix the !SMP case in sched_add() after r355779.
If the thread's lock is already that of the runqueue, don't recurse on
the queue lock.

Reviewed by:	jeff, kib
Sponsored by:	The FreeBSD Foundation
Differential Revision:	https://reviews.freebsd.org/D23492
2020-02-03 22:49:05 +00:00
Mateusz Guzik
3ff65f71cb Remove duplicated empty lines from kern/*.c
No functional changes.
2020-01-30 20:05:05 +00:00
Mark Johnston
a89c2c8c34 Revert r357050.
It seems to have introduced a couple of regressions.

Reported by:	cy, pho
2020-01-24 14:58:02 +00:00
Mark Johnston
1bfca40c57 Set td_oncpu before dropping the thread lock during a switch.
After r355784 we no longer hold a thread's thread lock when switching it
out.  Preserve the previous synchronization protocol for td_oncpu by
setting it together with td_state, before dropping the thread lock
during a switch.

Reported and tested by:	pho
Reviewed by:	kib
Discussed with:	jeff
Differential Revision:	https://reviews.freebsd.org/D23270
2020-01-23 16:24:51 +00:00
Jeff Roberson
1eb13fce84 Block the thread lock in sched_throw() and use cpu_switch() to unblock
it.  The introduction of lockless switch in r355784 created a race to
re-use the exiting thread that was only possible to hit on a hypervisor.

Reported/Tested by:	rlibby
Discussed with:	rlibby, jhb
2020-01-23 03:36:50 +00:00
Mateusz Guzik
879e0604ee Add KERNEL_PANICKED macro for use in place of direct panicstr tests 2020-01-12 06:07:54 +00:00
Jeff Roberson
d8d5f03610 Fix a bug in r355784. I missed a sched_add() call that needed to reacquire
the thread lock.

Reported by:	mjg
2019-12-19 18:22:11 +00:00
Jeff Roberson
686bcb5c14 schedlock 4/4
Don't hold the scheduler lock while doing context switches.  Instead we
unlock after selecting the new thread and switch within a spinlock
section leaving interrupts and preemption disabled to prevent local
concurrency.  This means that mi_switch() is entered with the thread
locked but returns without.  This dramatically simplifies scheduler
locking because we will not hold the schedlock while spinning on
blocked lock in switch.

This change has not been made to 4BSD but in principle it would be
more straightforward.

Discussed with:	markj
Reviewed by:	kib
Tested by:	pho
Differential Revision: https://reviews.freebsd.org/D22778
2019-12-15 21:26:50 +00:00
Jeff Roberson
61a74c5ccd schedlock 1/4
Eliminate recursion from most thread_lock consumers.  Return from
sched_add() without the thread_lock held.  This eliminates unnecessary
atomics and lock word loads as well as reducing the hold time for
scheduler locks.  This will eventually allow for lockless remote adds.

Discussed with:	kib
Reviewed by:	jhb
Tested by:	pho
Differential Revision:	https://reviews.freebsd.org/D22626
2019-12-15 21:11:15 +00:00
Ryan Libby
9825eadf2c bitset: rename confusing macro NAND to ANDNOT
s/BIT_NAND/BIT_ANDNOT/, and for CPU and DOMAINSET too.  The actual
implementation is "and not" (or "but not"), i.e. A but not B.
Fortunately this does appear to be what all existing callers want.

Don't supply a NAND (not (A and B)) operation at this time.

Discussed with:	jeff
Reviewed by:	cem
Sponsored by:	Dell EMC Isilon
Differential Revision:	https://reviews.freebsd.org/D22791
2019-12-13 09:32:16 +00:00
Mark Johnston
7789ab32b3 Rename tdq_ipipending and clear it in sched_switch().
This fixes a regression after r355311.  Specifically, sched_preempt()
may trigger a context switch by calling thread_lock(), since
thread_lock() calls critical_exit() in its slow path and the interrupted
thread may have already been marked for preemption.  This would happen
before tdq_ipipending is cleared, blocking further preemption IPIs.  The
CPU can be left in this state indefinitely if the interrupted thread
migrates.

Rename tdq_ipipending to tdq_owepreempt.  Any switch satisfies a remote
preemption request, so clear tdq_owepreempt in sched_switch() instead of
sched_preempt() to avoid subtle problems of the sort described above.

Reviewed by:	jeff, kib
Sponsored by:	The FreeBSD Foundation
Differential Revision:	https://reviews.freebsd.org/D22758
2019-12-12 02:43:24 +00:00
Jeff Roberson
c3cccf95bf Handle multiple clock interrupts simultaneously in sched_clock().
Reviewed by:	kib, markj, mav
Differential Revision:	https://reviews.freebsd.org/D22625
2019-12-08 01:17:38 +00:00
Alexander Motin
61322a0a8a Mark some more hot global variables with __read_mostly.
MFC after:	1 week
2019-12-04 21:26:03 +00:00
Jeff Roberson
e15046952d Initialize the idle thread's lock sooner so it's not evaluated on every fork
exit and we can rely on it elsewhere.

Reviewed by:	mav, kib, jhb, markj
Differential Revision:	https://reviews.freebsd.org/D22624
2019-12-02 22:35:45 +00:00
Alexander Motin
176dd236dc Microoptimize sched_pickcpu() CPU affinity on SMT.
Use of CPU_FFS() to implement CPUSET_FOREACH() allows to save up to ~0.5%
of CPU time on 72-thread SMT system doing 80K IOPS to NVMe from one thread.

MFC after:	1 month
Sponsored by:	iXsystems, Inc.
2019-09-26 00:35:06 +00:00
Alexander Motin
c55dc51c37 Microoptimize sched_pickcpu() after r352658.
I've noticed that I missed intr check at one more SCHED_AFFINITY(),
so instead of adding one more branching I prefer to remove few.

Profiler shows the function CPU time reduction from 0.24% to 0.16%.

MFC after:	1 month
Sponsored by:	iXsystems, Inc.
2019-09-25 19:29:09 +00:00
Alexander Motin
bb3dfc6ae9 Fix wrong assertion in r352658.
MFC after:	1 month
2019-09-25 11:58:54 +00:00
Alexander Motin
c9205e3500 Fix/improve interrupt threads scheduling.
Doing some tests with very high interrupt rates I've noticed that one of
conditions I added in r232207 to make interrupt threads in most cases
run on local CPU never worked as expected (worked only if previous time
it was executed on some other CPU, that is quite opposite).  It caused
additional CPU usage to run full CPU search and could schedule interrupt
threads to some other CPU.

This patch removes that code and instead reuses existing non-interrupt
code path with some tweaks for interrupt case:
 - On SMT systems, if current thread is idle, don't look on other threads.
Even if they are busy, it may take more time to do fill search and bounce
the interrupt thread to other core then execute it locally, even sharing
CPU resources.  It is other threads should migrate, not bound interrupts.
 - Try hard to keep interrupt threads within LLC of their original CPU.
This improves scheduling cost and supposedly cache and memory locality.

On a test system with 72 threads doing 2.2M IOPS to NVMe this saves few
percents of CPU time while adding few percents to IOPS.

MFC after:	1 month
Sponsored by:	iXsystems, Inc.
2019-09-24 20:01:20 +00:00