fully initialed when the pmap layer tries to call sched_pini() early in the
boot and results in an quick panic. Use ke_pinned instead as was originally
done with Tor's patch.
Approved by: julian
scheduler specific extension to it. Put it in the extension as
the implimentation details of how the pinning is done needn't be visible
outside the scheduler.
Submitted by: tegge (of course!) (with changes)
MFC after: 3 days
but with slightly cleaned up interfaces.
The KSE structure has become the same as the "per thread scheduler
private data" structure. In order to not make the diffs too great
one is #defined as the other at this time.
The KSE (or td_sched) structure is now allocated per thread and has no
allocation code of its own.
Concurrency for a KSEGRP is now kept track of via a simple pair of counters
rather than using KSE structures as tokens.
Since the KSE structure is different in each scheduler, kern_switch.c
is now included at the end of each scheduler. Nothing outside the
scheduler knows the contents of the KSE (aka td_sched) structure.
The fields in the ksegrp structure that are to do with the scheduler's
queueing mechanisms are now moved to the kg_sched structure.
(per ksegrp scheduler private data structure). In other words how the
scheduler queues and keeps track of threads is no-one's business except
the scheduler's. This should allow people to write experimental
schedulers with completely different internal structuring.
A scheduler call sched_set_concurrency(kg, N) has been added that
notifies teh scheduler that no more than N threads from that ksegrp
should be allowed to be on concurrently scheduled. This is also
used to enforce 'fainess' at this time so that a ksegrp with
10000 threads can not swamp a the run queue and force out a process
with 1 thread, since the current code will not set the concurrency above
NCPU, and both schedulers will not allow more than that many
onto the system run queue at a time. Each scheduler should eventualy develop
their own methods to do this now that they are effectively separated.
Rejig libthr's kernel interface to follow the same code paths as
linkse for scope system threads. This has slightly hurt libthr's performance
but I will work to recover as much of it as I can.
Thread exit code has been cleaned up greatly.
exit and exec code now transitions a process back to
'standard non-threaded mode' before taking the next step.
Reviewed by: scottl, peter
MFC after: 1 week
specify "us" as the thread not the process/ksegrp/kse.
You can always find the others from the thread but the converse is not true.
Theorotically this would lead to runtime being allocated to the wrong
entity in some cases though it is not clear how often this actually happenned.
(would only affect threaded processes and would probably be pretty benign,
but it WAS a bug..)
Reviewed by: peter
since they are only accessed by curthread and thus do not need any
locking.
- Move pr_addr and pr_ticks out of struct uprof (which is per-process)
and directly into struct thread as td_profil_addr and td_profil_ticks
as these variables are really per-thread. (They are used to defer an
addupc_intr() that was too "hard" until ast()).
sched_add() rather than just doing it in sched_wakeup(). The old
ithread preemption code used to set NEEDRESCHED unconditionally if it
didn't preempt which masked this bug in SCHED_4BSD.
Noticed by: jake
Reported by: kensmith, marcel
than as one-off hacks in various other parts of the kernel:
- Add a function maybe_preempt() that is called from sched_add() to
determine if a thread about to be added to a run queue should be
preempted to directly. If it is not safe to preempt or if the new
thread does not have a high enough priority, then the function returns
false and sched_add() adds the thread to the run queue. If the thread
should be preempted to but the current thread is in a nested critical
section, then the flag TDF_OWEPREEMPT is set and the thread is added
to the run queue. Otherwise, mi_switch() is called immediately and the
thread is never added to the run queue since it is switch to directly.
When exiting an outermost critical section, if TDF_OWEPREEMPT is set,
then clear it and call mi_switch() to perform the deferred preemption.
- Remove explicit preemption from ithread_schedule() as calling
setrunqueue() now does all the correct work. This also removes the
do_switch argument from ithread_schedule().
- Do not use the manual preemption code in mtx_unlock if the architecture
supports native preemption.
- Don't call mi_switch() in a loop during shutdown to give ithreads a
chance to run if the architecture supports native preemption since
the ithreads will just preempt DELAY().
- Don't call mi_switch() from the page zeroing idle thread for
architectures that support native preemption as it is unnecessary.
- Native preemption is enabled on the same archs that supported ithread
preemption, namely alpha, i386, and amd64.
This change should largely be a NOP for the default case as committed
except that we will do fewer context switches in a few cases and will
avoid the run queues completely when preempting.
Approved by: scottl (with his re@ hat)
switch to. If a non-NULL thread pointer is passed in, then the CPU will
switch to that thread directly rather than calling choosethread() to pick
a thread to choose to.
- Make sched_switch() aware of idle threads and know to do
TD_SET_CAN_RUN() instead of sticking them on the run queue rather than
requiring all callers of mi_switch() to know to do this if they can be
called from an idlethread.
- Move constants for arguments to mi_switch() and thread_single() out of
the middle of the function prototypes and up above into their own
section.
sleep queue interface:
- Sleep queues attempt to merge some of the benefits of both sleep queues
and condition variables. Having sleep qeueus in a hash table avoids
having to allocate a queue head for each wait channel. Thus, struct cv
has shrunk down to just a single char * pointer now. However, the
hash table does not hold threads directly, but queue heads. This means
that once you have located a queue in the hash bucket, you no longer have
to walk the rest of the hash chain looking for threads. Instead, you have
a list of all the threads sleeping on that wait channel.
- Outside of the sleepq code and the sleep/cv code the kernel no longer
differentiates between cv's and sleep/wakeup. For example, calls to
abortsleep() and cv_abort() are replaced with a call to sleepq_abort().
Thus, the TDF_CVWAITQ flag is removed. Also, calls to unsleep() and
cv_waitq_remove() have been replaced with calls to sleepq_remove().
- The sched_sleep() function no longer accepts a priority argument as
sleep's no longer inherently bump the priority. Instead, this is soley
a propery of msleep() which explicitly calls sched_prio() before
blocking.
- The TDF_ONSLEEPQ flag has been dropped as it was never used. The
associated TDF_SET_ONSLEEPQ and TDF_CLR_ON_SLEEPQ macros have also been
dropped and replaced with a single explicit clearing of td_wchan.
TD_SET_ONSLEEPQ() would really have only made sense if it had taken
the wait channel and message as arguments anyway. Now that that only
happens in one place, a macro would be overkill.
sense with sched_4bsd as it does with sched_ule.
- Use P_NOLOAD instead of the absence of td->td_ithd to determine whether or
not a thread should be accounted for in sched_tdcnt.
of sched_load(). This variable tracks the number of running and runnable
non ithd threads. This removes the need to traverse the proc table and
discover how many threads are runnable.
sched_cpu() locks an sx lock (allproc_lock) which can sleep if it fails to
acquire the lock, it is not safe to execute this in a callout handler from
softclock().
in the log message for kern_sched.c 1.83 (which should have been
repo-copied to preserve history for this file), the (4BSD) scheduler
algorithm only works right if stathz is nearly 128 Hz. The old
commit lock said 64 Hz; the scheduler actually wants nearly 16 Hz
but there was a scale factor of 4 to give the requirement of 64 Hz,
and rev.1.83 changed the scale factor so that the requirement became
128 Hz. The change of the scale factor was incomplete in the SMP
case. Then scheduling ticks are provided by smp_ncpu CPUs, and the
scheduler cannot tell the difference between this and 1 CPU providing
scheduling ticks smp_ncpu times faster, so we need another scale
factor of smp_ncp or an algorithm change.
This quick fix uses the scale factor without even trying to optimize
the runtime divisions required for this as is done for the other
scale factor.
The main algorithmic problem is the clamp on the scheduling tick counts.
This was 295; it is now approximately 295 * smp_ncpu. When the limit
is reached, threads get free timeslices and scheduling becomes very
unfair to the threads that don't hit the limit. The limit can be
reached and maintained in the worst case if the load average is larger
than (limit / effective_stathz - 1) / 2 = 0.65 now (was just 0.08 with
2 CPUs before this change), so there are algorithmic problems even for
a load average of 1. Fortunately, the worst case isn't common enough
for the problem to be very noticeable (it is mainly for niced CPU hogs
competing with less nice CPU hogs).
since there is no direct association between M:N thread and kse,
sometimes, a thread does not have a kse, in that case, return a pctcpu
from its last kse, it is not perfect, but gives a good number to be
displayed.
begin with sched_lock held but not recursed, so this variable was
always 0.
Removed fixup of sched_lock.mtx_recurse after context switches in
sched_switch(). Context switches always end with this variable in the
same state that it began in, so there is no need to fix it up. Only
sched_lock.mtx_lock really needs a fixup.
Replaced fixup of sched_lock.mtx_recurse in fork_exit() by an assertion
that sched_lock is owned and not recursed after it is fixed up. This
assertion much match the one in mi_switch(), and if sched_lock were
recursed then a non-null fixup of sched_lock.mtx_recurse would probably
be needed again, unlike in sched_switch(), since fork_exit() doesn't
return to its caller in the normal way.
- Update some stale comments.
- Sort a couple of includes.
- Only set 'newcpu' in updatepri() if we use it.
- No functional changes.
Obtained from: bde (via an old diff I got a long time ago)