freebsd-skq

Author	SHA1	Message	Date
bde	4417000483	Minor cleanups and optimizations: - Remove dead code that I forgot to remove in the previous commit. - Calculate the sum of the lower terms of the polynomial (divided by x**5) in a single expression (sum of odd terms) + (sum of even terms) with parentheses to control grouping. This is clearer and happens to give better instruction scheduling for a tiny optimization (an average of about ~0.5 cycles/call on Athlons). - Calculate the final sum in a single expression with parentheses to control grouping too. Change the grouping from first_term + (second_term + sum_of_lower_terms) to (first_term + second_term) + sum_of_lower_terms. Normally the first grouping must be used for accuracy, but extra precision makes any grouping give a correct result so we can group for efficiency. This is a larger optimization (average 3-4 cycles/call or 5%). - Use parentheses to indicate that the C order of left to right evaluation is what is wanted (for efficiency) in a multiplication too. The old fdlibm code has several optimizations related to these. 2 involve doing an extra operation that can be done almost in parallel on some superscalar machines but are pessimizations on sequential machines. Others involve statement ordering or expression grouping. All of these except the ordering for the combining the sums of the odd and even terms seem to be ideal for Athlons, but parallelism is still limited so all of these optimizations combined together with the ones in this commit save only ~6-8 cycles (~10%). On an AXP, tanf() on uniformly distributed args in [-2pi, 2pi] now takes 39-59 cycles. I don't know of any more optimizations for tanf() short of writing it all in asm with very MD instruction scheduling. Hardware fsin takes 122-138 cycles. Most of the optimizations for tanf() don't work very well for tan[l](). fdlibm tan() now takes 145-365 cycles.	2005-11-24 13:48:40 +00:00
ru	5bd42d0a34	Fix prototype.	2005-11-24 11:29:11 +00:00
ru	af47fb2f88	Fix prototypes.	2005-11-24 11:26:36 +00:00
ru	5633435ae3	Fix prototypes.	2005-11-24 11:14:06 +00:00
ru	46b5b6bcde	Fix prototypes.	2005-11-24 10:54:47 +00:00
ru	3cf38aeba7	Fix prototype.	2005-11-24 10:43:35 +00:00
ru	f815813dd1	Fix prototype.	2005-11-24 10:32:39 +00:00
ru	ae11cb5ef9	Fix prototypes.	2005-11-24 10:30:44 +00:00
ru	7b90f188c4	Fix prototypes.	2005-11-24 10:06:05 +00:00
joel	7eed0b9958	s/5.5/6.0/ in HISTORY section. Discussed with: ru	2005-11-24 09:25:10 +00:00
ru	d9eedd9185	Make SYNOPSIS compile. Attn peter@: this manpage wasn't synced with your code changes.	2005-11-24 07:48:19 +00:00
ru	a615d0b31e	Fix prototypes. Attn davidxu@: most likely, the description should also be tweaked after your undocumented changes that changed these prototypes.	2005-11-24 07:33:35 +00:00
ru	bf558bda27	Fix prototypes.	2005-11-24 07:12:01 +00:00
ru	e82db33c27	Keep up with const poisoning in uuid.h,v 1.3.	2005-11-24 07:04:20 +00:00
ru	07d744857c	Fix prototype.	2005-11-24 06:56:21 +00:00
bde	caae9bf081	Optimized by eliminating the special case for 0.67434 <= \|x\| < pi/4. A single polynomial approximation for tan(x) works in infinite precision up to \|x\| < pi/2, but in finite precision, to restrict the accumulated roundoff error to < 1 ulp, \|x\| must be restricted to less than about sqrt(0.5/((1.5+1.5)/3)) ~= 0.707. We restricted it a bit more to give a safety margin including some slop for optimizations. Now that we use double precision for the calculations, the accumulated roundoff error is in double-precision ulps so it can easily be made almost 2*29 times smaller than a single-precision ulp. Near x = pi/4 its maximum is about 0.5+(1.5+1.5)x**2/3 ~= 1.117 double-precision ulps. The minimax polynomial needs to be different to work for the larger interval. I didn't increase its degree the old degree is just large enough to keep the final error less than 1 ulp and increasing the degree would be a pessimization. The maximum error is now ~0.80 ulps instead of ~0.53 ulps. The speedup from this optimization for uniformly distributed args in [-2pi, 2pi] is 28-43% on athlons, depending on how badly gcc selected and scheduled the instructions in the old version. The old version has some int-to-float conversions that are apparently difficult to schedule well, but gcc-3.3 somehow did everything ~10 cycles or ~10% faster than gcc-3.4, with the difference especially large on AXPs. On A64s, the problem seems to be related to documented penalties for moving single precision data to undead xmm registers. With this version, the speed is cycles is almost independent of the athlon and gcc version despite the large differences in instruction selection to use the FPU on AXPs and SSE on A64s.	2005-11-24 02:04:26 +00:00
ru	11d4f09966	Fix prototype.	2005-11-23 20:34:37 +00:00
ru	642fd4337d	Fix prototype.	2005-11-23 20:26:58 +00:00
ru	869e65f881	Fix prototypes.	2005-11-23 16:44:23 +00:00
ru	5e1264a066	There's no longer^Wyet <sys/capability.h>.	2005-11-23 16:24:39 +00:00
ru	f0442273f1	Fix inet6_opt_get_val() prototype.	2005-11-23 16:07:54 +00:00
ru	07eeed1e1c	Make SYNOPSIS compile.	2005-11-23 15:55:38 +00:00
ru	906caa442c	Make SYNOPSIS compile after imp@'s changes.	2005-11-23 15:44:42 +00:00
ru	baae9ec455	Make SYNOPSIS compile.	2005-11-23 15:41:36 +00:00
bde	1e3150891d	Use only double precision for "kernel" tanf (except for returning float). This is a minor interface change. The function is renamed from __kernel_tanf() to __kernel_tandf() so that misues of it will cause link errors and not crashes. This version is a routine translation with no special optimizations for accuracy or efficiency. It gives an unimportant increase in accuracy, from ~0.9 ulps to 0.5285 ulps. Almost all of the error is from the minimax polynomial (~0.03 ulps and the final rounding step (< 0.5 ulps). It gives strange differences in efficiency in the -5 to +10% range, with -O1 fairly consistently becoming faster and -O2 slower on AXP and A64 with gcc-3.3 and gcc-3.4.	2005-11-23 14:27:56 +00:00
ru	11e07dda30	Add missing includes.	2005-11-23 10:49:07 +00:00
bde	89ac9def6a	Simplified setiing up args for __kernel_rem_pio2(). We already have x with a 24-bit fraction, so we don't need a loop to split it into up to 3 terms with 24-bit fractions.	2005-11-23 03:03:09 +00:00
bde	67ff03dd57	Quick fix for stack buffer overrun in rev.1.13. Oops. The prec == 1 arg to __kernel_rem_pio2() gives 53-bit (double) precision, not single precision and/or the array dimension like I thought. prec == 2 is used in e_rem_pio2.c for double precision although it is documented to be for 64-bit (extended) precision, and I just reduced it by 1 thinking that this would give the value suitable for 24-bit (float) precision. Reducing it 1 more to the documented value for float precision doesn't actually work (it gives errors of ~0.75 ulps in the reduced arg, but errors of much less than 0.5 ulps are needed; the bug seems to be in kernel_rem_pio2.c). Keep using a value 1 larger than the documented value but supply an array large enough hold the extra unused result from this. The bug can also be fixed quickly by increasing init_jk[0] in k_rem_pio2.c from 2 to 3. This gives behaviour identical to using prec == 1 except it doesn't create the extra result. It isn't clear how the precision bug affects higher precisions. 113-bit (quad) is the largest precision, so there is no way to use a large precision to fix it.	2005-11-23 02:06:06 +00:00
ru	92462f1576	Tidy up markup and fix two bugs.	2005-11-21 17:18:34 +00:00
bde	d8a5fc0b49	Mess up the "kernel" float trig function .c files with ifdefs so that they can be #included in other .c files to give inline functions, and use them to inline the functions in most callers (not in e_lgammaf_r.c). __kernel_tanf() is too large and complicated for gcc to inline very well. An athlons, this gives a speed increase under favourable pipeline conditions of about 10% overall (larger for AXP, smaller for A64). E.g., on AXP, sinf() on uniformly distributed args in [-2Pi, 2Pi] now takes 30-56 cycles; it used to take 45-61 cycles; hardware fsin takes 65-129.	2005-11-21 04:57:12 +00:00
bde	d96648954f	Use double precision to simplify and optimize a long division. On athlons, this gives a speedup of 10-20% for tanf() on uniformly distributed args in [-2Pi, 2Pi]. (It only directly applies for 43% of the args and gives a 16-20% speedup for these (more for AXP than A64) and this gives an overall speedup of 10-12% which is all that it should; however, it gives an overall speedup of 17-20% with gcc-3.3 on AXP-A64 by mysteriously effected cases where it isn't executed.) I originally intended to use double precision for all internals of float trig functions and will probably still do this, but benchmarking showed that converting to double precision and back is a pessimization in cases where a simple float precision calculation works, so it may be optimal to switch precisions only when using extra precision is much simpler.	2005-11-21 00:38:21 +00:00
bde	01155bb235	Restored a cleanup in rev.1.9 tthat was lost in rev.1.10.	2005-11-20 20:17:04 +00:00
simon	ac5e3a71fd	Do not explicitly state how many bytes an argument list can be in the description of E2BIG, since it's now larger on some platforms. MFC after: 3 days	2005-11-19 11:30:55 +00:00
marcel	d7ead39c65	o Include <sys/time.h> o Make this ILP32/LP64 clean: cast pointers to long o Code conditional upon DEBUG must also be conditional upon _LIBC_R_	2005-11-19 04:47:06 +00:00
marcel	3886f95485	o Include <string.h> o Make this ILP32/LP64 clean: cast pointers to long.	2005-11-19 04:45:15 +00:00
marcel	bfb066610e	Fix typo: s/_LIBC_R/_LIBC_R_/	2005-11-19 04:43:29 +00:00
bde	558fb238b1	Moved all the optimizations for \|x\| <= 9pi/2 from __ieee754_rem_pio2f() to its 3 callers and manually inline them. On Athlons, with favourable compiler flags and optimizations and favourable pipeline conditions, this gives a speedup of 30-40 cycles for cosf(), sinf() and tanf() on the range pi/4 < \|x\| <= 9pi/4, so thes functions are now signifcantly faster than the hardware trig functions in many cases. E.g., in a benchmark with uniformly distributed x in [-2pi, 2pi], A64 hardware fcos took 72-129 cycles and cosf() took 37-55 cycles. Out-of-order execution is needed to get both of these times. The optimizations in this commit apparently work more by removing 1 serialization point than by reducing latency.	2005-11-19 02:38:27 +00:00
andre	e76b2aa5e3	Document CLOCK_UPTIME which returns the current uptime in SI seconds. At the moment it is just an alias for CLOCK_MONOTONIC which reports the same number. Sponsored by: TCP/IP Optimization Fundraise 2005	2005-11-18 17:13:22 +00:00
ru	6e1cf27cb4	Fix markup, grammar and spelling.	2005-11-18 14:21:28 +00:00
ru	0a30497782	Fix up markup.	2005-11-18 11:54:14 +00:00
ru	271d9041b2	Fix up markup etc. in recently born manpage.	2005-11-18 11:53:23 +00:00
bde	63ac8a6c5f	Removed an unused declaration which was so old that it wasn't a prototype and thus just broke building at any nonzero WARNS level. Fixed nearby style bugs.	2005-11-18 05:03:12 +00:00
ru	928d297eeb	-mdoc sweep.	2005-11-17 13:00:00 +00:00
bde	5fa6749138	Minor cleanups: s_cosf.c and s_sinf.c: Use a non-bogus magic constant for the threshold of pi/4. It was 2 ulps smaller than pi/4 rounded down, but its value is not critical so it should be the result of natural rounding. s_cosf.c and s_tanf.c: Use a literal 0.0 instead of an unnecessary variable initialized to [(float)]0.0. Let the function prototype convert to 0.0F. Improved wording in some comments. Attempted to improve indentation of comments.	2005-11-17 03:53:22 +00:00
bde	c2a2c2b30d	Rearranged the the optimizations for special cases to reduce the average number of branches. Use a non-bogus magic constant for the threshold of pi/4. It was 2 ulps smaller than pi/4 rounded down, but its value is not critical so it should be the result of natural rounding. Use "<=" comparisons with rounded- down thresholds for all small multiples of pi/4. Cleaned up previous commit: - use static const variables instead of expressions for multiples of pi/2 to ensure that they are evaluated at compile time. gcc currently evaluates them at compile time but C99 compilers are not required to do so. We want compile time evaluation for optimization and don't care about side effects. - use M_PI_2 instead of a magic constant for pi/2. We need magic constants related to pi/2 elsewhere but not here since we just want pi/2 rounded to double and even prefer it to be rounded in the default rounding mode. We can depend on the cmpiler being C99ish enough to round M_PI_2 correctly just as much as we depended on it handling hex constants correctly. This also fixes a harmless rounding error in the hex constant. - keep using expressions n<value for pi/2> in the initializers for the static const variables. 2M_PI_2 and 4M_PI_2 are obviously rounded in the same way as the corresponding infinite precision expressions for multiples of pi/2, and 3M_PI_2 happens to be rounded like this, so we don't need magic constants for the multiples. - fixed and/or updated some comments.	2005-11-17 02:20:04 +00:00
ume	92c433a722	The KAME's getipnodebyaddr() code honor the MULTI_PTRS_ARE_ALIASES define also, but res_config.h was not included into libc/net/name6.c. So getipnodebyaddr() ignored the multiple PTRs. PR: kern/88241 Submitted by: Dan Lukes <dan__at__obluda.cz> MFC after: 3 days	2005-11-15 03:40:15 +00:00
rwatson	c2c82599c8	Add symlinks for kvm access methods for memstat(3). MFC after: 3 days	2005-11-13 13:42:03 +00:00
bde	f63f109c0b	Fixed some magic numbers. The threshold for not being tiny was too small. Use the usual 2*-12 threshold. This change is not just an optimization, since the general code that we fell into has accuracy problems even for tiny x. Avoiding it fixes 21366 args with errors of more than 1 ulp, with a maximum error of 1.167 ulps. The magic number 22 is log(DBL_EPSILON)/2 plus slop. This is bogus for float precision. Use 9 (~log(FLT_EPSILON)/2 plus less slop than for double precision). The code for handling the interval [2*-28, 9_was_22] has accuracy problems even for [9, 22], so this change happens to fix errors of more than 1 ulp in about 217000 cases. It leaves such errors in about 21074000 cases, with a max error of 1.242 ulps. The threshold for switching from returning exp(x)/2 to returning exp(x/2)^2/2 was a little smaller than necessary. As for coshf(), This was not quite harmless since the exp(x/2)^2/2 case is inaccurate, and fixing it avoids accuracy problems in 26 cases, leaving problems in 2*19997 cases. Fixed naming errors in pseudo-code in comments.	2005-11-13 00:41:46 +00:00
bde	3f7e4f1538	Fixed some magic numbers. The threshold for not being tiny was confusing and too small. Use the usual 2*-12 threshold and simplify the algorithm slightly so that this threshold works (now use the threshold for sinhf() instead of one for 1+expm1()). This is just a small optimization. The magic number 22 is log(DBL_EPSILON)/2 plus slop. This is bogus for float precision. Use 9 (~log(FLT_EPSILON)/2 plus less slop than for double precision). The threshold for switching from returning exp(x)/2 to returning exp(x/2)^2/2 was a little smaller than necessary. This was not quite harmless since the exp(x/2)^2/2 case is inaccurate. Fixing it happens to avoid accuracy problems for 26 of the 2151 args that were handled by the exp(x)/2 case. This leaves accuracy problems for about 219997 args near the overflow threshold (~89); the maximum error there is 2.5029 ulps. There are also accuracy probles for args in +-[0.5ln2, 9] -- 2188885 args with errors of more than 1 ulp, with a maximum error of 1.384 ulps. Fixed a syntax error and naming errors in pseudo-code in comments.	2005-11-13 00:08:23 +00:00
bde	1bfd712b60	Imoproved comments for the minimax polynomial. Removed an unused variable. Fixed some wrong comments and some nearby misformatting.	2005-11-12 20:06:04 +00:00

1 2 3 4 5 ...

10055 Commits