implementations inspired by the ones in DragonFly. Unlike the
DragonFly versions, these have a small data cache footprint, and my
tests show that they're never slower than the old code except when the
charset or the span is 0 or 1 characters. This implementation is
generally faster than DragonFly until either the charset or the span
gets in the ballpark of 32 to 64 characters.