Other minor optimizations. I got ~30% speedup in strcoll() for 50 char strings,
~40% speedup for 100 char strings, and unmeasurable speedup for 1M strings.
Collates are still terribly slow. To make them reasonable fast,
__collate_substitute() should be killed.