Specifically, a critical_exit() call that drops the nesting level to zero
has a brief window where the pending preemption flag is set and the
nesting level is set to zero. This is done purposefully to avoid races
where a preemption scheduled by an interrupt could be lost otherwise (see
revision 144777). However, this does mean that if an interrupt fires
during this window and enters and exits a critical section, it may preempt
from the interrupt context. This is generally fine as the interrupt code
is careful to arrange critical sections so that they are not exited until
it is safe to preempt (e.g. interrupts EOI'd and masked if necessary).
However, the SMP rendezvous IPI handler does not quite follow this rule,
and in general a rendezvous can never be preempted. Rendezvous handlers
are also not permitted to schedule threads to execute, so they will not
typically trigger preemptions. SMP rendezvous handlers may use
spinlocks (carefully) such as the rm_cleanIPI() handler used in rmlocks,
but using a spinlock also enters and exits a critical section. If the
interrupted top-half code is in the brief window of critical_exit() where
the nesting level is zero but a preemption is pending, then releasing the
spinlock can trigger a preemption. Because we know that SMP rendezvous
handlers can never schedule a thread, we know that a critical_exit() in
an SMP rendezvous handler will only preempt in this edge case. We also
know that the top-half thread will happily handle the deferred preemption
once the SMP rendezvous has completed, so the preemption will not be lost.
This makes it safe to employ a workaround where we use a nested critical
section in the SMP rendezvous code itself around rendezvous action
routines to prevent any preemptions during an SMP rendezvous. The
workaround intentionally avoids checking for a deferred preemption
when leaving the critical section on the assumption that if there is a
pending preemption it will be handled by the interrupted top-half code.
Submitted by: mlaier (variation specific to rm_cleanIPI())
Obtained from: Isilon
MFC after: 1 week
last CPU to to finish the rendezvous action may become visible to
different CPUs at different times. As a result, the CPU that initiated
the rendezvous may exit the rendezvous and drop the lock allowing another
rendezvous to be initiated on the same CPU or a different CPU. In that
case the exit sentinel may be cleared before all CPUs have noticed causing
those CPUs to hang forever.
Workaround this by using a generation count to notice when this race
occurs and to exit the rendezvous in that case.
The problem was independently diagnosted by mlaier@ and avg@ as well.
Submitted by: neel
Reviewed by: avg, mlaier
Obtained from: NetApp
MFC after: 1 week
cpuset_t objects.
That is going to offer the underlying support for a simple bump of
MAXCPU and then support for number of cpus > 32 (as it is today).
Right now, cpumask_t is an int, 32 bits on all our supported architecture.
cpumask_t on the other side is implemented as an array of longs, and
easilly extendible by definition.
The architectures touched by this commit are the following:
- amd64
- i386
- pc98
- arm
- ia64
- XEN
while the others are still missing.
Userland is believed to be fully converted with the changes contained
here.
Some technical notes:
- This commit may be considered an ABI nop for all the architectures
different from amd64 and ia64 (and sparc64 in the future)
- per-cpu members, which are now converted to cpuset_t, needs to be
accessed avoiding migration, because the size of cpuset_t should be
considered unknown
- size of cpuset_t objects is different from kernel and userland (this is
primirally done in order to leave some more space in userland to cope
with KBI extensions). If you need to access kernel cpuset_t from the
userland please refer to example in this patch on how to do that
correctly (kgdb may be a good source, for example).
- Support for other architectures is going to be added soon
- Only MAXCPU for amd64 is bumped now
The patch has been tested by sbruno and Nicholas Esborn on opteron
4 x 12 pack CPUs. More testing on big SMP is expected to came soon.
pluknet tested the patch with his 8-ways on both amd64 and i386.
Tested by: pluknet, sbruno, gianni, Nicholas Esborn
Reviewed by: jeff, jhb, sbruno
This is based on the same approach as used in panic().
In theory parallel execution of generic_stop_cpus() could lead to two CPUs
stopping each other and everyone else, and thus a total system halt.
Also, in theory, we should have some smarter locking here, because two
(or more CPUs) could be stopping unrelated sets of CPUs.
But in practice, it seems, this function is only used to stop
"all other" CPUs.
Additionally, I took this opportunity to make amd64-specific suspend_cpus()
function use generic_stop_cpus() instead of rolling out essentially
duplicate code.
This code is based on code by Sandvine Incorporated.
Suggested by: mdf
Reviewed by: jhb, jkim (earlier version)
MFC after: 2 weeks
number of CPUs detection.
However, that was not mention at all, the problem was not reported, the
patch has not been MFCed and the fix is mostly improper.
Fix the original overflow (caused when 32 CPUs must be detected) by
just using a different mathematical computation (it also makes more
explicit the size of operands involved, which is good in the moment
waiting for a more complete support for a large number of CPUs).
PR: kern/148698
Submitted by: Joe Landers <jlanders at vmware dot com>
Tested by: gianni
MFC after: 10 days
IPI to a specific CPU by its cpuid. Replace calls to ipi_selected() that
constructed a mask for a single CPU with calls to ipi_cpu() instead. This
will matter more in the future when we transition from cpumask_t to
cpuset_t for CPU masks in which case building a CPU mask is more expensive.
Submitted by: peter, sbruno
Reviewed by: rookie
Obtained from: Yahoo! (x86)
MFC after: 1 month
am now able to run 32 cores ok.. but I still will hang
on buildworld with a NFS problem. I suspect I am missing
a patch for the netlogic rge driver.
JC check and see if I am missing anything except your
core-mask changes
Obtained from: JC
has proven to have a good effect when entering KDB by using a NMI,
but it completely violates all the good rules about interrupts
disabled while holding a spinlock in other occasions. This can be the
cause of deadlocks on events where a normal IPI_STOP is expected.
* Adds an new IPI called IPI_STOP_HARD on all the supported architectures.
This IPI is responsible for sending a stop message among CPUs using a
privileged channel when disponible. In other cases it just does match a
normal IPI_STOP.
Right now the IPI_STOP_HARD functionality uses a NMI on ia32 and amd64
architectures, while on the other has a normal IPI_STOP effect. It is
responsibility of maintainers to eventually implement an hard stop
when necessary and possible.
* Use the new IPI facility in order to implement a new userend SMP kernel
function called stop_cpus_hard(). That is specular to stop_cpu() but
it does use the privileged channel for the stopping facility.
* Let KDB use the newly introduced function stop_cpus_hard() and leave
stop_cpus() for all the other cases
* Disable interrupts on CPU0 when starting the process of APs suspension.
* Style cleanup and comments adding
This patch should fix the reboot/shutdown deadlocks many users are
constantly reporting on mailing lists.
Please don't forget to update your config file with the STOP_NMI
option removal
Reviewed by: jhb
Tested by: pho, bz, rink
Approved by: re (kib)
and it only optimized out an ipi or mwait in very few cases.
- Skip the adaptive idle code when running on SMT or HTT cores. This
just wastes cpu time that could be used on a busy thread on the same
core.
- Rename CG_FLAG_THREAD to CG_FLAG_SMT to be more descriptive. Re-use
CG_FLAG_THREAD to mean SMT or HTT.
Sponsored by: Nokia
This code is heavily inspired by Takanori Watanabe's experimental SMP patch
for i386 and large portion was shamelessly cut and pasted from Peter Wemm's
AP boot code.
doing it on every CPU.
- Use CPU_ABSENT() rather than pcpu_find() to determine if a CPU is not
present.
- Count up to mp_maxid rather than MAXCPU when iterating over CPUs to
match the rest of the code in the kernel.
MFC after: 1 week
after each SYSINIT() macro invocation. This makes a number of
lightweight C parsers much happier with the FreeBSD kernel
source, including cflow's prcc and lxr.
MFC after: 1 month
Discussed with: imp, rink
tree structure that encodes the level of cache sharing and other
properties.
- Provide several convenience functions for creating one and two level
cpu trees as well as a default flat topology. The system now always
has some topology.
- On i386 and amd64 create a seperate level in the hierarchy for HTT
and multi-core cpus. This will allow the scheduler to intelligently
load balance non-uniform cores. Presently we don't detect what level
of the cache hierarchy is shared at each level in the topology.
- Add a mechanism for testing common topologies that have more information
than the MD code is able to provide via the kern.smp.topology tunable.
This should be considered a debugging tool only and not a stable api.
Sponsored by: Nokia
lock optimized for almost exclusive reader access. (see also rmlock.9)
TODO:
Convert to per cpu variables linkerset as soon as it is available.
Optimize UP (single processor) case.
topology foo functions.
Working at the patch for topology problems in ia32/amd64 evicted some
problems regarding functions ordering in the SI_SUB_CPU family of
SYSINIT'ed subsystems.
In order to avoid problems with new modified to involved functions, a
correct ordering is not semantically specified for SI_SUB_CPU functions
(for a larger view of the issue please visit:
http://lists.freebsd.org/pipermail/freebsd-current/2007-July/075409.html )
Discussed with: peter
Tested by: kris, Rui Paulo <rpaulo@FreeBSD.org>
Approved by: jeff
Approved by: re
- Use cpu_spinwait() in the spin loops in stop_cpus(), restart_cpus(), and
smp_rendezvous_action().
- Remove unneeded acq memory barriers in stop_cpus(), restart_cpus(), and
smp_rendezvous_action().
- Add an additional synch point in smp_rendezvous() to ensure that all the
CPUs will always see an up-to-date value of smp_rv_setup_func.
Reviewed by: attilio
Approved by: re (kensmith)
Tested on: alpha, amd64, i386, sparc64 SMP (for several years)
- Use thread_lock() rather than sched_lock for per-thread scheduling
sychronization.
- Use the per-process spinlock rather than the sched_lock for per-process
scheduling synchronization.
Tested by: kris, current@
Tested on: i386, amd64, ULE, 4BSD, libthr, libkse, PREEMPTION, etc.
Discussed with: kris, attilio, kmacy, jhb, julian, bde (small parts each)
a thread is an idle thread, just see if it has the IDLETD
flag set. That flag will probably move to the pflags word
as it's permenent and never chenges for the life of the
system so it doesn't need locking.
IPI_STOP IPIs.
- Change the i386 and amd64 MD IPI code to send an NMI if STOP_NMI is
enabled if an attempt is made to send an IPI_STOP IPI. If the kernel
option is enabled, there is also a sysctl to change the behavior at
runtime (debug.stop_cpus_with_nmi which defaults to enabled). This
includes removing stop_cpus_nmi() and making ipi_nmi_selected() a
private function for i386 and amd64.
- Fix ipi_all(), ipi_all_but_self(), and ipi_self() on i386 and amd64 to
properly handle bitmapped IPIs as well as IPI_STOP IPIs when STOP_NMI is
enabled.
- Fix ipi_nmi_handler() to execute the restart function on the first CPU
that is restarted making use of atomic_readandclear() rather than
assuming that the BSP is always included in the set of restarted CPUs.
Also, the NMI handler didn't clear the function pointer meaning that
subsequent stop and restarts could execute the function again.
- Define a new macro HAVE_STOPPEDPCBS on i386 and amd64 to control the use
of stoppedpcbs[] and always enable it for i386 and amd64 instead of
being dependent on KDB_STOP_NMI. It works fine in both the NMI and
non-NMI cases.
a regular IPI vector, but this vector is blocked when interrupts are disabled.
With "options KDB_STOP_NMI" and debug.kdb.stop_cpus_with_nmi set, KDB will
send an NMI to each CPU instead. The code also has a context-stuffing
feature which helps ddb extract the state of processes running on the
stopped CPUs.
KDB_STOP_NMI is only useful with SMP and complains if SMP is not defined.
This feature only applies to i386 and amd64 at the moment, but could be
used on other architectures with the appropriate MD bits.
Submitted by: ups
We were obtaining different spin mutexes (which disable interrupts after
aquisition) and spin waiting for delivery. For example, KSE processes
do LDT operations which use smp_rendezvous, while other parts of the
system are doing things like tlb shootdowns with a different mutex.
This patch uses the common smp_rendezvous mutex for all MD home-grown
IPIs that spinwait for delivery. Having the single mutex means that
the spinloop to aquire it will enable interrupts periodically, thus
avoiding the cross-ipi deadlock.
Obtained from: dwhite, alc
Reviewed by: jhb
long as there are still explicit uses of int, whether in types or
in function names (such as atomic_set_int() in sched_ule.c), we can
not change cpumask_t to be anything other than u_int. See also the
commit log for sys/sys/types.h, revision 1.84.