regression manages to do it)... We use a packed struct to coerce
gcc/clang into producing unaligned loads (there is not packed pointer
attribute, otherwise this would be easier)...
use _storeu_ and _loadu_ when using the structure is overkill...
be better at using types properly... Since we allocate our own key
schedule and make sure it's aligned, use the __m128i type in various
arguments to functions...
clang ignores __aligned on prototypes and gcc errors on them, leave them
in comments to document that these function arguments are require to be
aligned...
about all that changes is movdqa -> movdqu from reading the diff of the
disassembly output...
Noticed by: symbolics at gmx.com
MFC after: 3 days