mode.
Don't manually unroll the 2 inner loops. On Haswell, doing so gave a
speedup of about 0.5% (about 4 cycles per iteration out of 1400), but
hard-coded a limit of width 9 and made better better optimizations
harder to see. gcc-4.2.1 -O does the unrolling anyway, unless tricked
with a volatile hack. gcc's unrolling is not very good and gives a
a speedup of about half as much (about 2 cycles per iteration). (All
timing on i386.)
Manual unrolling was only feasible because the inner loop only iterates
once or twice. Usually twice, but a dynamic check is needed to decide,
and was not moved from the second-innermost loop manually or by gcc.
This commit basically adds another dynamic check in the inner loop.
Cursor widths of 10-17 require 3 iterations in the inner loop and this
is not so easy to unroll -- even gcc stops at 2.