Add an optimised version of the in-flight flow matching algorithm
using SIMD instructions. This should give up to 1.5x over the scalar
versions performance.
Falls back to scalar version if SSE4.2 not available
Signed-off-by: David Hunt <david.hunt@intel.com>
Acked-by: Bruce Richardson <bruce.richardson@intel.com>