* Avoid choosing an arena until it's certain that an arena is needed
for allocation.
* Convert division/multiplication to bitshifting where possible.
* Avoid accessing TLS variables in single-threaded code.
* Reduce the amount of pointer dereferencing.
* Move lock acquisition in critical paths to only protect the the code
that requires synchronization, and completely remove locking where
possible.