Instead of spuriously re-reading the lock value, read it once.
This change also has a side effect of fixing a performance bug:
on failed _mtx_obtain_lock, it was possible that re-read would find
the lock is unowned, but in this case the primitive would make a trip
through turnstile code.
This is diff reduction to a variant which uses atomic_fcmpset.
Discussed with: jhb (previous version)
Tested by: pho (previous version)