critical_enter() and critical_exit() are now solely a mechanism for
deferring kernel preemptions. They no longer have any affect on
interrupts. This means that standalone critical sections are now very
cheap as they are simply unlocked integer increments and decrements for the
common case.
Spin mutexes now use a separate KPI implemented in MD code: spinlock_enter()
and spinlock_exit(). This KPI is responsible for providing whatever MD
guarantees are needed to ensure that a thread holding a spin lock won't
be preempted by any other code that will try to lock the same lock. For
now all archs continue to block interrupts in a "spinlock section" as they
did formerly in all critical sections. Note that I've also taken this
opportunity to push a few things into MD code rather than MI. For example,
critical_fork_exit() no longer exists. Instead, MD code ensures that new
threads have the correct state when they are created. Also, we no longer
try to fixup the idlethreads for APs in MI code. Instead, each arch sets
the initial curthread and adjusts the state of the idle thread it borrows
in order to perform the initial context switch.
This change is largely a big NOP, but the cleaner separation it provides
will allow for more efficient alternative locking schemes in other parts
of the kernel (bare critical sections rather than per-CPU spin mutexes
for per-CPU data for example).
Reviewed by: grehan, cognet, arch@, others
Tested on: i386, alpha, sparc64, powerpc, arm, possibly more
in the window between the beginning of panic() and entering the debugger,
it's possible to receive interrupts. If we receive an interrupt, don't
preempt if panicstr != NULL, as the system is in the process of failing, and
the preempting thread is likely to stumble over the failure. The typical
scenario is during the printf() in panic() prior to entering the debugger,
but when running with a slower console type such as serial console.
It could be that the panic string should be passed to the debugger to print,
so that it can run from the debugger's environment rather than a regular
kernel printf.
Glanced at by: jhb
This generates a KTR event for each critical section entered and exited.
It would be desirable to also log the filename and line number of the
source entering or exiting the critical section, but this requires
hacking up the critical section API, so I've not done that yet.
critical_exit as the process is getting scheduled to run. This is subotimal
but for now avoid the LOR between the scheduler and the sleepq systems.
This is a 5.3 candidate.
Submitted by: davidxu
MFC After: 3 days
Implement preemption between threads in the same ksegp in out of slot
situations to prevent priority inversion.
Tested by: pho
Reviewed by: jhb, julian
Approved by: sam (mentor)
MFC: ASAP
UMA_ZONE_NOFREE to guarantee type stability, so proc_fini() should
never be called. Move an assertion from proc_fini() to proc_dtor()
and garbage-collect the rest of the unreachable code. I have retained
vm_proc_dispose(), since I consider its disuse a bug.
but with slightly cleaned up interfaces.
The KSE structure has become the same as the "per thread scheduler
private data" structure. In order to not make the diffs too great
one is #defined as the other at this time.
The KSE (or td_sched) structure is now allocated per thread and has no
allocation code of its own.
Concurrency for a KSEGRP is now kept track of via a simple pair of counters
rather than using KSE structures as tokens.
Since the KSE structure is different in each scheduler, kern_switch.c
is now included at the end of each scheduler. Nothing outside the
scheduler knows the contents of the KSE (aka td_sched) structure.
The fields in the ksegrp structure that are to do with the scheduler's
queueing mechanisms are now moved to the kg_sched structure.
(per ksegrp scheduler private data structure). In other words how the
scheduler queues and keeps track of threads is no-one's business except
the scheduler's. This should allow people to write experimental
schedulers with completely different internal structuring.
A scheduler call sched_set_concurrency(kg, N) has been added that
notifies teh scheduler that no more than N threads from that ksegrp
should be allowed to be on concurrently scheduled. This is also
used to enforce 'fainess' at this time so that a ksegrp with
10000 threads can not swamp a the run queue and force out a process
with 1 thread, since the current code will not set the concurrency above
NCPU, and both schedulers will not allow more than that many
onto the system run queue at a time. Each scheduler should eventualy develop
their own methods to do this now that they are effectively separated.
Rejig libthr's kernel interface to follow the same code paths as
linkse for scope system threads. This has slightly hurt libthr's performance
but I will work to recover as much of it as I can.
Thread exit code has been cleaned up greatly.
exit and exec code now transitions a process back to
'standard non-threaded mode' before taking the next step.
Reviewed by: scottl, peter
MFC after: 1 week
FULL_PREEMPTION is defined. Add a runtime warning to ULE if PREEMPTION is
enabled (code inspired by the PREEMPTION warning in kern_switch.c). This
is a possible MT5 candidate.
is an effective band-aid for at least some of the scheduler corruption seen
recently. The real fix will involve protecting threads while they are
inconsistent, and will come later.
Submitted by: julian
is here so that we can gather stats on the nature of the recent rash of
hard lockups, and in this particular case panic the machine instead of
letting it deadlock forever.
dereference curthread. It is called only from critical_{enter,exit}(),
which already dereferences curthread. This doesn't seem to affect SMP
performance in my benchmarks, but improves MySQL transaction throughput
by about 1% on UP on my Xeon.
Head nodding: jhb, bmilekic
since they are only accessed by curthread and thus do not need any
locking.
- Move pr_addr and pr_ticks out of struct uprof (which is per-process)
and directly into struct thread as td_profil_addr and td_profil_ticks
as these variables are really per-thread. (They are used to defer an
addupc_intr() that was too "hard" until ast()).
o Make debugging code conditional upon KDB instead of DDB.
o Call kdb_enter() instead of Debugger().
o Call kdb_backtrace() instead of db_print_backtrace() or backtrace().
kern_mutex.c:
o Replace checks for db_active with checks for kdb_active and make
them unconditional.
kern_shutdown.c:
o s/DDB_UNATTENDED/KDB_UNATTENDED/g
o s/DDB_TRACE/KDB_TRACE/g
o Save the TID of the thread doing the kernel dump so the debugger
knows which thread to select as the current when debugging the
kernel core file.
o Clear kdb_active instead of db_active and do so unconditionally.
o Remove backtrace() implementation.
kern_synch.c:
o Call kdb_reenter() instead of db_error().
than as one-off hacks in various other parts of the kernel:
- Add a function maybe_preempt() that is called from sched_add() to
determine if a thread about to be added to a run queue should be
preempted to directly. If it is not safe to preempt or if the new
thread does not have a high enough priority, then the function returns
false and sched_add() adds the thread to the run queue. If the thread
should be preempted to but the current thread is in a nested critical
section, then the flag TDF_OWEPREEMPT is set and the thread is added
to the run queue. Otherwise, mi_switch() is called immediately and the
thread is never added to the run queue since it is switch to directly.
When exiting an outermost critical section, if TDF_OWEPREEMPT is set,
then clear it and call mi_switch() to perform the deferred preemption.
- Remove explicit preemption from ithread_schedule() as calling
setrunqueue() now does all the correct work. This also removes the
do_switch argument from ithread_schedule().
- Do not use the manual preemption code in mtx_unlock if the architecture
supports native preemption.
- Don't call mi_switch() in a loop during shutdown to give ithreads a
chance to run if the architecture supports native preemption since
the ithreads will just preempt DELAY().
- Don't call mi_switch() from the page zeroing idle thread for
architectures that support native preemption as it is unnecessary.
- Native preemption is enabled on the same archs that supported ithread
preemption, namely alpha, i386, and amd64.
This change should largely be a NOP for the default case as committed
except that we will do fewer context switches in a few cases and will
avoid the run queues completely when preempting.
Approved by: scottl (with his re@ hat)