pi/4 <= |x| <= 3pi/4. Use the same branch ladder as for float precision.
Remove the optimization for |x| near pi/2 and don't do it near the
multiples of pi/2 in the newly optimized range, since it requires
fairly large code to handle only relativley few cases. Ifdef out
optimization for |x| <= pi/4 since this case can't occur because it
is done in callers.
On amd64 (A64), for cos() and sin() with uniformly distributed args,
no cache misses, some parallelism in the caller, and good but not great
CC and CFLAGS, etc., this saves about 40 cycles or 38% in the newly
optimized range, or about 27% on average across the range |x| <= 2pi
(~65 cycles for most args, while the A64 hardware fcos and fsin take
~75 cycles for half the args and 125 cycles for the other half). The
speedup for tan() is much smaller, especially relatively. The speedup
on i386 (A64) is slightly smaller, especially relatively. i386 is
still much slower than amd64 here (unlike in the float case where it
is slightly faster).
saves an average of about 8 cycles or 5% on A64 (amd64 and i386 --
more in cycles but about the same percentage on i386, and more with
old versions of gcc) with good CFLAGS and some parallelism in the
caller. As usual, it takes a couple more multiplications so it will
be slower on old machines.
Convert to __FBSDID().
Maybe. In the meantime, my workarounds for trying to coax UTC without
timegm() are getting uglier and uglier. Apparently, some systems
don't support setenv()/unsetenv(), so you can't set the TZ env var and
hope thereby to coax mktime() into generating UTC. Without that, I
don't see a really good alternative to just giving up and converting to
localtime with mktime(). (I suppose I should research the Perl library
approach for computing an inverse function to gmtime(); that might
actually be simpler than this growing list of hacks.)
now returns a value, which supports such convenient
constructs as:
if (assert(NULL != foo())) {
}
Also be careful to setlocale("C") for each new test to
avoid locale pollution.
Also a couple of minor portability enhancements.
* If the platform can't restore char nodes, block nodes, or fifos,
don't try and just return error.
* Include O_BINARY in most open() calls (define O_BINARY to 0 if the
platform doesn't provide a definition already)
* Refactor the ownership restore to more cleanly support platforms
that don't have any form of {l,f,}chown() call.
* Comment a lingering issue with older Unix-like systems that allow
root to hose the filesystem. I don't (yet) have a good solution for
this, but I expect it will require adding more redundant stat()
calls. <sigh>
MFC after: 14 days
optimization of about 10% for cos(x), sin(x) and tan(x) on
|x| < 2**19*pi/2. We didn't do this before because __ieee754__rem_pio2()
is too large and complicated for gcc-3.3 to inline very well. We don't
do this for float precision because it interferes with optimization
of the usual (?) case (|x| < 9pi/4) which is manually inlined for float
precision only.
This has some rough edges:
- some static data is duplicated unnecessarily. There isn't much after
the recent move of large tables to k_rem_pio2.c, and some static data
is duplicated to good affect (all the data static const, so that the
compiler can evaluate expressions like 2*pio2 at compile time and
generate even more static data for the constant for this).
- extern inline is used (for the same reason as in previous inlining of
k_cosf.c etc.), but C99 apparently doesn't allow extern inline
functions with static data, and gcc will eventually warn about this.
Convert to __FBSDID().
Indent __ieee754_rem_pio2()'s declaration consistently (its style was
made inconsistent with fdlibm a while ago, so complete this).
Fix __ieee754_rem_pio2()'s return type to match its prototype. Someone
changed too many ints to int32_t's when fixing the assumption that all
ints are int32_t's.
reallocation, when junk filling is enabled. Junk filling must occur
prior to shrinking, since any deallocated trailing pages are immediately
available for use by other threads.
Reported by: Mats Palmgren <mats.palmgren@bredband.net>
allocation patterns, number of CPUs, and MALLOC_OPTIONS settings indicate
that lazy deallocation has the potential to worsen throughput dramatically.
Performance degradation occurs when multiple threads try to clear the lazy
free cache simultaneously. Various experiments to avoid this bottleneck
failed to completely solve this problem, while adding yet more complexity.
Bruce for putting lots of effort into these; getting them right isn't
easy, and they went through many iterations.
Submitted by: Steve Kargl <sgk@apl.washington.edu> with revisions from bde
is a violation of RFC 1034 [STD 13], it is accepted by certain name servers
as well as other popular operating systems' resolver library.
Bugs are mine.
Obtained from: ume
MFC after: 2 weeks
of disk names, where you must free each pointer, as well as the array
by hand. [1]
- Destaticize "disks" in Disk_Names, it has no reasons to be static.
PR: kern/96077 [1]
PR: kern/114110 [1]
MFC after: 1 month
Approved by: rwatson (mentor)
|x| or |y| and b is |y| or |x|) when mixing NaN arg(s).
hypot*() had its own foot shooting for mixing NaNs -- it swaps the
args so that |x| in bits is largest, but does this before quieting
signaling NaNs, so on amd64 (where the result of adding NaNs depends
on the order) it gets inconsistent results if setting the quiet bit
makes a difference, just like a similar ia64 and i387 hardware comparison.
The usual fix (see e_powf.c 1.13 for more details) of mixing using
(a+0.0)+-(b+0.0) doesn't work on amd64 if the args are swapped (since
the rder makes a difference with SSE). Fortunately, the original args
are unchanged and don't need to be swapped when we let the hardware
decide the mixing after quieting them, but we need to take their
absolute value.
hypotf() doesn't seem to have any real bugs masked by this non-bug.
On amd64, its maximum error in 2^32 trials on amd64 is now 0.8422 ulps,
and on i386 the maximum error is unchanged and about the same, except
with certain CFLAGS it magically drops to 0.5 (perfect rounding).
Convert to __FBSDID().
be into 12+24 bits of precision for extra-precision multiplication,
but was into 13+24 bits. On i386 with -O1 the bug was hidden by
accidental extra precision, but on amd64, in 2^32 trials the bug
caused about 200000 errors of more than 1 ulp, with a maximum error
of about 80 ulps. Now the maximum error in 2^32 trials on amd64
is 0.8573 ulps. It is still 0.8316 ulps on i386 with -O1.
The nearby decomposition of 1/ln2 and the decomposition of 2/(3ln2) in
the double precision version seem to be sub-optimal but not broken.
This uses 2 tricks to improve consistency so that more serious problems
aren't hidden in simple regression tests by noise for the NaNs:
- for a signaling NaN, adding 0.0 generates the invalid exception and
converts to a quiet NaN, and doesn't have too many effects for other
types of args (it converts -0 to +0 in some rounding modes, but that
hopefully doesn't change the result after adding the NaN arg). This
avoids some inconsistencies on i386 and ia64. On these arches, the
result of an operation on 2 NaNs is apparently the largest or the
smallest of the NaNs as bits (consistently largest or smallest for
each arch, but the opposite). I forget which way the comparison
goes and if the sign bit affects it. The quiet bit is is handled
poorly by not always setting it before the comparision or ignoring
it. Thus if one of the args was originally a signaling NaN and the
other was originally a quiet NaN, then the result depends too much
on whether the signaling NaN has been quieted at this point, which
in turn depends on optimizations and promotions. E.g., passing float
signaling NaNs to double functions must quiet them on conversion;
on i387, loading a signaling NaN of type float or double (but not
long double) into a register involves a conversion, so it quiets
signaling NaNs, so if the addition has 2 register operands than it
only sees quiet NaNs, but if the addition has a memory operand then
it sees a signaling NaN iff it is in the memory operand.
- subtraction instead of addition is used to avoid a dubious optimization
in old versions of gcc. For SSE operations, mixing of NaNs apparently
always gives the target operand. This is not as good as the i387
and ia64 behaviour. It doesn't mix NaNs at all, and makes addition
not quite commutative. Old versions of gcc sometimes rewrite x+y
to y+x and thus give different results (in bits) for NaNs. gcc-3.3.3
rewrites x+y to y+x for one of pow() and powf() but not the other,
so starting from float NaN args x and y, powf(x, y) was almost always
different from pow(x, y).
These tricks won't give consistency of 2-arg float and double functions
with long double ones on amd64, since long double ones use the i387
which has different semantics from SSE.
Convert to __FBSDID().
and trunc() to the corresponding long double functions. This is not
just an optimization for these arches. The full long double functions
have a wrong value for `huge', and the arches without full long doubles
depended on it being wrong.
This has the side effect of confusing gcc-4.2.1's optimizer into more
often doing the right thing. When it does the wrong thing here, it
seems to be mainly making too many copies of x with dependency chains.
This effect is tiny on amd64, but in some cases on i386 it is enormous.
E.g., on i386 (A64) with -O1, the current version of exp2() should
take about 50 cycles, but took 83 cycles before this change and 66
cycles after this change. exp2f() with -O1 only speeded up from 51
to 47 cycles. (exp2f() should take about 40 cycles, on an Athlon in
either i386 or amd64 mode, and now takes 42 on amd64). exp2l() with
-O1 slowed down from 155 cycles to 123 for some args; this is unimportant
since the i386 exp2l() is a fake; the wrong thing for it seems to
involve branch misprediction.
faster on all machines tested (old Celeron (P2), A64 (amd64 and i386)
and ia64) except on ia64 when compiled with -O1. It takes 2 more
multiplications, so it will be slower on old machines. The speedup
is about 8 cycles = 17% on A64 (amd64 and i386) with best CFLAGS
and some parallelism in the caller.
Move the evaluation of 2**k up a bit so that it doesn't compete too
much with the new polynomial evaluation. Unlike the previous
optimization, this rearrangement cannot change the result, so compilers
and CPU schedulers can do it, but they don't do it quite right yet.
This saves a whole 1 or 2 cycles on A64.
when the result is +-0. IEEE754 requires (in all rounding modes) that
if the result is +-0 then its sign is the same as that of the first
arg, but in round-towards-minus-infinity mode an uncorrected implementation
detail always reversed the sign. (The detail is that x-x with x's
sign positive gives -0 in this mode only, but the algorithm assumed
that x-x always has positive sign for such x.)
remquo() and remquof() seem to need the same fix, but I cannot test them
yet.
Use long doubles when mixing NaN args. This trick improves consistency
of results on at least amd64, so that more serious problems like the
above aren't hidden in simple regression tests by noise for the NaNs.
On amd64, hardware remainder should be used since it is about 10 times
faster than software remainder and is already used for remquo(), but
it involves using the i387 even for floats and doubles, and the i387
does NaN mixing which is better than but inconsistent with SSE NaN mixing.
Software remainder() would probably have been inconsistent with
software remainderl() for the same reason if the latter existed.
Signaling NaNs cause further inconsistencies on at least ia64 and i386.
Use __FBSDID().