In my eagerness to eliminate a branch which is taken once per 2^38
bytes of keystream, I forgot that the state words are in host order.
Thus, the counter increment code worked fine on little-endian
machines, but not on big-endian ones. Switch to a simpler (branchful)
solution.