blog posting [1].
- Use word-sized test for unaligned pointer before working
the hard way.
Memory page boundary is always integral multiple of a word
alignment boundary. Therefore, if we can access memory
referenced by pointer p, then (p & ~word mask) must be also
accessible.
- Better utilization of multi-issue processor's ability of
concurrency.
The previous implementation utilized a formular that must be
executed sequentially. However, the ~, & and - operations can
actually be caculated at the same time when the operand were
different and unrelated.
The original Hacker's Delight formular also offered consistent
performance regardless whether the input would contain
characters with their highest-bit set, as it catches real
nul characters only.
These two optimizations has shown further improvements over the
previous implementation on microbenchmarks on i386 and amd64 CPU
including Pentium 4, Core Duo 2 and i7.
[1] http://vger.kernel.org/~davem/cgi-bin/blog.cgi/2010/03/08#strlen_1
MFC after: 1 month
reducing branches and doing word-sized operation.
The idea is taken from J.T. Conklin's x86_64 optimized version of strlen(3)
for NetBSD, and reimplemented in C by me.
Discussed on: -arch@