idle_td is dereferenced without thread-locking it to make its contents is
invariant, and was accessed without telling the compiler that its contents
is invariant. Some compilers optimized accesses to the supposedly invariant
contents by moving the critical checks for changes outside of the loop that
waits for changes. Fix this using atomic ops.
This bug only showed up for the following configuration: a Turion2
system, amd64 kernels, compiled by gcc, and SCHED_4BSD. clang fails
to do the optimization with all CFLAGS that I tried, because it doesn't
fully optimize the '__asm __volatile' for cpu_spinwait() although this
asm has no memory clobber. gcc only does the optimization with most
CFLAGS. I mostly used -Os with all compilers. i386 works because gcc
-m32 -Os only moves 1 or the 2 accesses outside of the loop.
Non-Turion2 systems and SCHED_ULE worked due to different timing (when
all APs start before the BP checks them outside of the loop).
Reviewed by: kib