bde 558fb238b1 Moved all the optimizations for |x| <= 9pi/2 from
__ieee754_rem_pio2f() to its 3 callers and manually inline them.

On Athlons, with favourable compiler flags and optimizations and
favourable pipeline conditions, this gives a speedup of 30-40 cycles
for cosf(), sinf() and tanf() on the range pi/4 < |x| <= 9pi/4, so
thes functions are now signifcantly faster than the hardware trig
functions in many cases.  E.g., in a benchmark with uniformly distributed
x in [-2pi, 2pi], A64 hardware fcos took 72-129 cycles and cosf() took
37-55 cycles.  Out-of-order execution is needed to get both of these
times.  The optimizations in this commit apparently work more by
removing 1 serialization point than by reducing latency.
2005-11-19 02:38:27 +00:00
..
2005-11-17 13:00:00 +00:00
2005-04-22 18:57:32 +00:00
2005-07-13 10:40:07 +00:00
2005-02-13 23:45:54 +00:00
2005-10-04 22:00:35 +00:00
2005-11-17 13:00:00 +00:00
2005-05-24 10:10:38 +00:00
2004-12-29 02:18:24 +00:00
2005-11-17 13:00:00 +00:00
2004-12-21 10:49:29 +00:00
2005-11-17 13:00:00 +00:00
2005-06-04 10:48:21 +00:00
2004-12-21 10:16:04 +00:00
2005-09-26 06:23:43 +00:00
2005-11-18 14:21:28 +00:00