Bruce Evans 1dd21062e5 Rearranged the polynomial evaluation some more to reduce dependencies.
Instead of echoing the code in a comment, try to describe why we split
up the evaluation in a special way.

The new optimization is mostly to move the evaluation of w = z*z later
so that everything else (except z = x*x) doesn't have to wait for w.
On Athlons, FP multiplication has a latency of 4 cycles so this
optimization saves 4 cycles per call provided no new dependencies are
introduced.  Tweaking the other terms in to reduce dependencies saves
a couple more cycles in some cases (more on AXP than on A64; up to 8
cycles out of 56 altogether in some cases).  The previous version had
a similar optimization for s = z*x.  Special optimizations like these
probably have a larger effect than the simple 2-way vectorization
permitted (but not activated by gcc) in the old version, since 2-way
vectorization is not enough and the polynomial's degree is so small
in the float case that non-vectorizable dependencies dominate.

On an AXP, tanf() on uniformly distributed args in [-2pi, 2pi] now
takes 34-55 cycles (was 39-59 cycles).
2005-11-28 11:46:20 +00:00
..
2005-04-22 18:57:32 +00:00
2005-07-13 10:40:07 +00:00
2005-02-13 23:45:54 +00:00
2005-10-04 22:00:35 +00:00
2005-11-24 10:30:44 +00:00
2005-11-24 10:32:39 +00:00
2005-11-24 10:43:35 +00:00
2004-12-29 02:18:24 +00:00
2005-11-19 04:47:06 +00:00
2005-11-17 13:00:00 +00:00
2004-12-21 10:49:29 +00:00
2005-11-24 10:54:47 +00:00
2005-11-19 04:47:06 +00:00
2005-06-04 10:48:21 +00:00
2005-11-24 11:14:06 +00:00
2004-12-21 10:16:04 +00:00
2005-09-26 06:23:43 +00:00
2005-11-24 11:26:36 +00:00
2005-11-24 11:29:11 +00:00