Bruce Evans a00672cff9 Use a better method of scaling by 2**k. Instead of adding to the
exponent bits of the reduced result, construct 2**k (hopefully in
parallel with the construction of the reduced result) and multiply by
it.  This tends to be much faster if the construction of 2**k is
actually in parallel, and might be faster even with no parallelism
since adjustment of the exponent requires a read-modify-wrtite at an
unfortunate time for pipelines.

In some cases involving exp2* on amd64 (A64), this change saves about
40 cycles or 30%.  I think it is inherently only about 12 cycles faster
in these cases and the rest of the speedup is from partly-accidentally
avoiding compiler pessimizations (the construction of 2**k is now
manually scheduled for good results, and -O2 doesn't always mess this
up).  In most cases on amd64 (A64) and i386 (A64) the speedup is about
20 cycles.  The worst case that I found is expf on ia64 where this
change is a pessimization of about 10 cycles or 5%.  The manual
scheduling for plain exp[f] is harder and not as tuned.

Details specific to expm1*:
- the saving is closer to 12 cycles than to 40 for expm1* on i386 (A64).
  For some reason it is much larger for negative args.
- also convert to __FBSDID().
2008-02-07 09:42:19 +00:00
..
2007-12-02 22:21:30 +00:00
2006-09-17 21:27:35 +00:00
2007-06-10 19:06:09 +00:00
2008-02-06 23:25:29 +00:00
2007-04-09 01:45:30 +00:00
2007-12-21 12:00:16 +00:00
2007-10-16 02:10:44 +00:00
2008-02-03 06:04:38 +00:00
2007-04-09 01:45:30 +00:00
2006-03-23 14:09:21 +00:00