|x| or |y| and b is |y| or |x|) when mixing NaN arg(s).
hypot*() had its own foot shooting for mixing NaNs -- it swaps the
args so that |x| in bits is largest, but does this before quieting
signaling NaNs, so on amd64 (where the result of adding NaNs depends
on the order) it gets inconsistent results if setting the quiet bit
makes a difference, just like a similar ia64 and i387 hardware comparison.
The usual fix (see e_powf.c 1.13 for more details) of mixing using
(a+0.0)+-(b+0.0) doesn't work on amd64 if the args are swapped (since
the rder makes a difference with SSE). Fortunately, the original args
are unchanged and don't need to be swapped when we let the hardware
decide the mixing after quieting them, but we need to take their
absolute value.
hypotf() doesn't seem to have any real bugs masked by this non-bug.
On amd64, its maximum error in 2^32 trials on amd64 is now 0.8422 ulps,
and on i386 the maximum error is unchanged and about the same, except
with certain CFLAGS it magically drops to 0.5 (perfect rounding).
Convert to __FBSDID().
be into 12+24 bits of precision for extra-precision multiplication,
but was into 13+24 bits. On i386 with -O1 the bug was hidden by
accidental extra precision, but on amd64, in 2^32 trials the bug
caused about 200000 errors of more than 1 ulp, with a maximum error
of about 80 ulps. Now the maximum error in 2^32 trials on amd64
is 0.8573 ulps. It is still 0.8316 ulps on i386 with -O1.
The nearby decomposition of 1/ln2 and the decomposition of 2/(3ln2) in
the double precision version seem to be sub-optimal but not broken.
This uses 2 tricks to improve consistency so that more serious problems
aren't hidden in simple regression tests by noise for the NaNs:
- for a signaling NaN, adding 0.0 generates the invalid exception and
converts to a quiet NaN, and doesn't have too many effects for other
types of args (it converts -0 to +0 in some rounding modes, but that
hopefully doesn't change the result after adding the NaN arg). This
avoids some inconsistencies on i386 and ia64. On these arches, the
result of an operation on 2 NaNs is apparently the largest or the
smallest of the NaNs as bits (consistently largest or smallest for
each arch, but the opposite). I forget which way the comparison
goes and if the sign bit affects it. The quiet bit is is handled
poorly by not always setting it before the comparision or ignoring
it. Thus if one of the args was originally a signaling NaN and the
other was originally a quiet NaN, then the result depends too much
on whether the signaling NaN has been quieted at this point, which
in turn depends on optimizations and promotions. E.g., passing float
signaling NaNs to double functions must quiet them on conversion;
on i387, loading a signaling NaN of type float or double (but not
long double) into a register involves a conversion, so it quiets
signaling NaNs, so if the addition has 2 register operands than it
only sees quiet NaNs, but if the addition has a memory operand then
it sees a signaling NaN iff it is in the memory operand.
- subtraction instead of addition is used to avoid a dubious optimization
in old versions of gcc. For SSE operations, mixing of NaNs apparently
always gives the target operand. This is not as good as the i387
and ia64 behaviour. It doesn't mix NaNs at all, and makes addition
not quite commutative. Old versions of gcc sometimes rewrite x+y
to y+x and thus give different results (in bits) for NaNs. gcc-3.3.3
rewrites x+y to y+x for one of pow() and powf() but not the other,
so starting from float NaN args x and y, powf(x, y) was almost always
different from pow(x, y).
These tricks won't give consistency of 2-arg float and double functions
with long double ones on amd64, since long double ones use the i387
which has different semantics from SSE.
Convert to __FBSDID().
and trunc() to the corresponding long double functions. This is not
just an optimization for these arches. The full long double functions
have a wrong value for `huge', and the arches without full long doubles
depended on it being wrong.
This has the side effect of confusing gcc-4.2.1's optimizer into more
often doing the right thing. When it does the wrong thing here, it
seems to be mainly making too many copies of x with dependency chains.
This effect is tiny on amd64, but in some cases on i386 it is enormous.
E.g., on i386 (A64) with -O1, the current version of exp2() should
take about 50 cycles, but took 83 cycles before this change and 66
cycles after this change. exp2f() with -O1 only speeded up from 51
to 47 cycles. (exp2f() should take about 40 cycles, on an Athlon in
either i386 or amd64 mode, and now takes 42 on amd64). exp2l() with
-O1 slowed down from 155 cycles to 123 for some args; this is unimportant
since the i386 exp2l() is a fake; the wrong thing for it seems to
involve branch misprediction.
faster on all machines tested (old Celeron (P2), A64 (amd64 and i386)
and ia64) except on ia64 when compiled with -O1. It takes 2 more
multiplications, so it will be slower on old machines. The speedup
is about 8 cycles = 17% on A64 (amd64 and i386) with best CFLAGS
and some parallelism in the caller.
Move the evaluation of 2**k up a bit so that it doesn't compete too
much with the new polynomial evaluation. Unlike the previous
optimization, this rearrangement cannot change the result, so compilers
and CPU schedulers can do it, but they don't do it quite right yet.
This saves a whole 1 or 2 cycles on A64.
when the result is +-0. IEEE754 requires (in all rounding modes) that
if the result is +-0 then its sign is the same as that of the first
arg, but in round-towards-minus-infinity mode an uncorrected implementation
detail always reversed the sign. (The detail is that x-x with x's
sign positive gives -0 in this mode only, but the algorithm assumed
that x-x always has positive sign for such x.)
remquo() and remquof() seem to need the same fix, but I cannot test them
yet.
Use long doubles when mixing NaN args. This trick improves consistency
of results on at least amd64, so that more serious problems like the
above aren't hidden in simple regression tests by noise for the NaNs.
On amd64, hardware remainder should be used since it is about 10 times
faster than software remainder and is already used for remquo(), but
it involves using the i387 even for floats and doubles, and the i387
does NaN mixing which is better than but inconsistent with SSE NaN mixing.
Software remainder() would probably have been inconsistent with
software remainderl() for the same reason if the latter existed.
Signaling NaNs cause further inconsistencies on at least ia64 and i386.
Use __FBSDID().
exp2(i/TBLSIZE) * p(z) instead of only for the final multiplication
and addition. This fixes the code to match the comment that the maximum
error is 0.5010 ulps (except on machines that evaluate float expressions
in extra precision, e.g., i386's, where the evaluation was already
in extra precision).
Fix and expand the comment about use of double precision.
The relative roundoff error from evaluating p(z) in non-extra precision
was about 16 times larger than in exp2() because the interval length
is 16 times smaller. Its maximum was at least P1 * (1.0 ulps) *
max(|z|) ~= log(2) * 1.0 * 1/32 ~= 0.0217 ulps (1.0 ulps from the
addition in (1 + P1*z) with a cancelation error when z ~= -1/32). The
actual final maximum was 0.5313 ulps, of which 0.0303 ulps must have
come from the additional roundoff error in p(z). I can't explain why
the additional roundoff error was almost 3/2 times larger than the rough
estimate.
precision. The new polynomial has degree 4 instead of 10, and a maximum
error of 2**-30.04 ulps instead of 2**-33.15. This doesn't affect the
final error significantly; the maximum error was and is about 0.5015
ulps on i386 -O1, and the number of cases with an error of > 0.5 ulps
is increased from 13851 to 14407.
Note that the error is only this close to 0.5 ulps due to excessive
extra precision caused by compiler bugs on i386. The extra precision
could be obtained intentionally, and is useful for keeping the error
of the hyperbolic float functions below 1 ulp, since these functions
are implemented using expm1f. My recent change for scaling by 2**k
had the unintentional side effect of retaining extra precision for
longer, so callers of expm1f see errors of more like 0.0015 ulps than
0.5015 ulps, and for the hyperbolic functions this reduces the maximum
error from nearly about 2 ulps to about 0.75 ulps.
This is about 10% faster on i386 (A64). expm1* is still very slow,
but now the float version is actually significantly faster. The
algorithm is very sophisticated but not very good except on machines
with fast division.
exponent bits of the reduced result, construct 2**k (hopefully in
parallel with the construction of the reduced result) and multiply by
it. This tends to be much faster if the construction of 2**k is
actually in parallel, and might be faster even with no parallelism
since adjustment of the exponent requires a read-modify-wrtite at an
unfortunate time for pipelines.
In some cases involving exp2* on amd64 (A64), this change saves about
40 cycles or 30%. I think it is inherently only about 12 cycles faster
in these cases and the rest of the speedup is from partly-accidentally
avoiding compiler pessimizations (the construction of 2**k is now
manually scheduled for good results, and -O2 doesn't always mess this
up). In most cases on amd64 (A64) and i386 (A64) the speedup is about
20 cycles. The worst case that I found is expf on ia64 where this
change is a pessimization of about 10 cycles or 5%. The manual
scheduling for plain exp[f] is harder and not as tuned.
Details specific to expm1*:
- the saving is closer to 12 cycles than to 40 for expm1* on i386 (A64).
For some reason it is much larger for negative args.
- also convert to __FBSDID().
exponent bits of the reduced result, construct 2**k (hopefully in
parallel with the construction of the reduced result) and multiply by
it. This tends to be much faster if the construction of 2**k is
actually in parallel, and might be faster even with no parallelism
since adjustment of the exponent requires a read-modify-wrtite at an
unfortunate time for pipelines.
In some cases involving exp2* on amd64 (A64), this change saves about
40 cycles or 30%. I think it is inherently only about 12 cycles faster
in these cases and the rest of the speedup is from partly-accidentally
avoiding compiler pessimizations (the construction of 2**k is now
manually scheduled for good results, and -O2 doesn't always mess this
up). In most cases on amd64 (A64) and i386 (A64) the speedup is about
20 cycles. The worst case that I found is expf on ia64 where this
change is a pessimization of about 10 cycles or 5%. The manual
scheduling for plain exp[f] is harder and not as tuned.
This change ld128/s_exp2l.c has not been tested.
that is specialized for float precision. The new polynomial has degree
5 instead of 11, and a maximum error of 2**-27.74 ulps instead
of 2**-30.64. This doesn't affect the final error significantly; the
maximum error was and is about 0.9101 ulps on amd64 -01 and the number
of cases with an error of > 0.5 ulps is actually reduced by epsilon
despite the larger error in the polynomial.
This is about 15% faster on amd64 (A64), i386 (A64) and ia64. The asm
version is still used instead of this on i386 since it is faster and
more accurate.
cases which are used mainly by regression tests.
As usual, the cutoff for tiny args was not correctly translated to
float precision. It was 2**-54 but 2**-24 works. It must be about
2**-precision, since the error from approximating log(1+x) by x is
about the same as |x|. Exhaustive testing shows that 2**-24 gives
perfect rounding in round-to-nearest mode.
Similarly for the cutoff for being small, except this is not used by
so many other functions. It was 2**-29 but 2**-15 works. It must be
a bit smaller than sqrt(2**-precision), since the error from
approximating log(1+x) by x-x*x/2 is about the same as x*x. Exhaustive
testing shows that 2**-15 gives a maximum error of 0.5052 ulps in
round-to-nearest-mode. The algorithm for the general case is only good
for 0.8388 ulps, so this is sufficient (but it loses slightly on i386 --
then extra precision gives 0.5032 ulps for the general case).
While investigating this, I noticed that optimizing the usual case by
falling into a middle case involving a simple polynomial evaluation
(return x-x*x/2 instead of x here) is not such a good idea since it
gives an enormous pessimization of tinier args on machines for which
denormals are slow. Float x*x/2 is denormal when |x| ~< 2**-64 and
x*x/2 is evaluated in float precision, so it can easily be denormal
for normal x. This is even more interesting for general polynomial
evaluations. Multiplying out large powers of x is normally a good
optimization since it reduces dependencies, but it creates denormals
starting with quite large x.
forget to translate "float" to "double".
ucbtest didn't detect the bug, but exhaustive testing of the float
case relative to the double case eventually did. The bug only affects
args x with |x| ~> 2**19*(pi/2) on non-i386 (i386 is broken in a
different way for large args).
it should never have existed and it has not been used for many years
(floats are reduced faster using doubles). All relevant changes (just
the workaround for broken assignment) have been merged to the double
version.
there is a problem with non-floats (when i386 defaults to extra
precision). This essentially restores yesterday's behaviour for doubles
on i386 (since generic rint() isn't used and everywhere else assumed
working assignment), but for arches that use the generic rint() it
finishes restoring some of 1995's behaviour (don't waste time doing
unnecessary store/load).
variable hack for exp2f() only.
The volatile variable had a surprisingly large cost for exp2f() -- 19
cycles or 15% on i386 in the worst case observed. This is only partly
explained by there being several references to the variable, only one
of which benefited from it being volatile. Arches that have working
assignment are likely to benefit even more from not having any volatile
variable.
exp2() now has a chance of working with extra precision on i386.
exp2() has even more references to the variable, so it would have been
pessimized more by simply declaring the variable as volatile. Even
the temporary volatile variable for STRICT_ASSIGN costs 5-10% on i386,
(A64) so I will change STRICT_ASSIGN() to do an ordinary assignment
until i386 defaults to extra precision.
instead of a volatile cast hack for the float version only. The cast
hack broke with gcc-4, but this was harmless since the float version
hasn't been used for a few years. Merge from the float version so
that the double version has a chance of working on i386 with extra
precision.
See k_rem_pio2f.c rev.1.8 for the original hack.
Convert to _FBSDID().
hack for log1pf() only. The cast hack broke with gcc-4, resulting in
~1 million errors of more than 1 ulp, with a maximum error of ~1.5 ulps.
Now the maximum error for log1pf() on i386 is 0.5034 ulps again (this
depends on extra precision), and log1p() has a chance of working with
extra precision.
See s_log1pf.c 1.8 for the original hack. (It claims only 62343 large
errors).
Convert to _FBSDID(). Another thing broken with gcc-4 is the static
const hack used for rcsids.
around assignments not working for gcc on i386. Now volatile hacks
for rint() and rintf() don't needlessly pessimize so many arches
and the remaining pessimizations (for arm and powerpc) can be avoided
centrally.
This cleans up after s_rint.c 1.3 and 1.13 and s_rintf.c 1.3 and 1.9:
- s_rint.c 1.13 broke 1.3 by only using a volatile cast hack in 1 place
when it was needed in 2 places, and the volatile cast hack stopped
working with gcc-4. These bugs only affected correctness tests on
i386 since i386 normally uses asm rint() and doesn't support the
extra precision mode that would break assignments of doubles.
- s_rintf.c 1.9 improved(?) on 1.3 by using a volatile variable hack
instead of an extra-precision variable hack, but it declared 2
variables as volatile when only 1 variable needed to be volatile.
This only affected speed tests on i386 since i386 uses asm rintf().
long doubles (i386, amd64, ia64) and one for machines with 128-bit
long doubles (sparc64). Other platforms use the double version.
I've only done runtime testing on i386.
Thanks to bde@ for helpful discussions and bugfixes.
assignments and casts don't clip extra precision, if any. The
implementation is to assign to a temporary volatile variable and read
the result back to assign to the original lvalue.
lib/msun currently 2 different hard-coded hacks to avoid the problem
in just a few places and needs it in a few more places. One variant
uses volatile for the original lvalue. This works but is slower than
necessary. Another temporarily casts the lvalue to volatile. This
broke with gcc-4.2.1 or earlier (gcc now stores to the lvalue but
doesn't load from it).
my original implementation made both use the same code. Unfortunately,
this meant libm depended on a vendor header at compile time and previously-
unexposed vendor bits in libc at runtime.
Hence, I just wrote my own version of the relevant vendor routine. As it
turns out, mine has a factor of 8 fewer of lines of code, and is a bit more
readable anyway. The strtod() and *scanf() routines still use vendor code.
Reviewed by: bde
someone thought it would be a good idea to copy z_abs() to libm in 1994.
However, it's never been declared or documented anywhere, and I'm
reasonably confident that nobody uses it.
Discussed with: bde, deischen, kan
I hope that this and the i386 version of it will not be needed, but
this is currently about 16 cycles or 36% faster than the C version,
and the i386 version is about 8 cycles or 19% faster than the C
version, due to poor optimization of the C version.
adds two new directories in msun: ld80 and ld128. These are for
long double functions specific to the 80-bit long double format
used on x86-derived architectures, and the 128-bit format used on
sparc64, respectively.