f0ddecd745
Borrow the trick from memset and memmove and use the scale/index/base addressing to avoid branches. If a mismatch is found, the routine has to calculate the difference. Make sure there is always up to 8 bytes to inspect. This replaces the previous loop which would operate over up to 16 bytes with an unrolled list of 8 tests. Speed varies a lot, but this is a net win over the previous routine with probably a lot more to gain. Validated with glibc test suite.