The previous code neglected to use primitives which can find the end
of the string without having to branch on every character.
While here augment the somewhat misleading commentary -- strlen as
implemented here leaves performance on the table, especially so for
userspace. Every arch should get a dedicated variant instead.
In the meantime this commit lessens the problem.
Tested with glibc test suite.
Naive test just calling strlen in a loop on Haswell (ops/s):
$(perl -e "print 'A' x 3"):
before: 211198039
after: 338626619
$(perl -e "print 'A' x 100"):
before: 83151997
after: 98285919