271 lines
11 KiB
Plaintext
271 lines
11 KiB
Plaintext
|
IDEAS ABOUT THINGS TO WORK ON
|
|||
|
|
|||
|
* mpq_cmp: Maybe the most sensible thing to do would be to multiply the, say,
|
|||
|
4 most significant limbs of each operand and compare them. If that is not
|
|||
|
sufficient, do the same for 8 limbs, etc.
|
|||
|
|
|||
|
* Write mpi, the Multiple Precision Interval Arithmetic layer.
|
|||
|
|
|||
|
* Write `mpX_eval' that take lambda-like expressions and a list of operands.
|
|||
|
|
|||
|
* As a general rule, recognize special operand values in mpz and mpf, and
|
|||
|
use shortcuts for speed. Examples: Recognize (small or all) 2^n in
|
|||
|
multiplication and division. Recognize small bases in mpz_pow_ui.
|
|||
|
|
|||
|
* Implement lazy allocation? mpz->d == 0 would mean no allocation made yet.
|
|||
|
|
|||
|
* Maybe store one-limb numbers according to Per Bothner's idea:
|
|||
|
struct {
|
|||
|
mp_ptr d;
|
|||
|
union {
|
|||
|
mp_limb val; /* if (d == NULL). */
|
|||
|
mp_size size; /* Length of data array, if (d != NULL). */
|
|||
|
} u;
|
|||
|
};
|
|||
|
Problem: We can't normalize to that format unless we free the space
|
|||
|
pointed to by d, and therefore small values will not be stored in a
|
|||
|
canonical way.
|
|||
|
|
|||
|
* Document complexity of all functions.
|
|||
|
|
|||
|
* Add predicate functions mpz_fits_signedlong_p, mpz_fits_unsignedlong_p,
|
|||
|
mpz_fits_signedint_p, etc.
|
|||
|
|
|||
|
mpz_floor (mpz, mpq), mpz_trunc (mpz, mpq), mpz_round (mpz, mpq).
|
|||
|
|
|||
|
* Better random number generators. There should be fast (like mpz_random),
|
|||
|
very good (mpz_veryrandom), and special purpose (like mpz_random2). Sizes
|
|||
|
in *bits*, not in limbs.
|
|||
|
|
|||
|
* It'd be possible to have an interface "s = add(a,b)" with automatic GC.
|
|||
|
If the mpz_xinit routine remembers the address of the variable we could
|
|||
|
walk-and-mark the list of remembered variables, and free the space
|
|||
|
occupied by the remembered variables that didn't get marked. Fairly
|
|||
|
standard.
|
|||
|
|
|||
|
* Improve speed for non-gcc compilers by defining umul_ppmm, udiv_qrnnd,
|
|||
|
etc, to call __umul_ppmm, __udiv_qrnnd. A typical definition for
|
|||
|
umul_ppmm would be
|
|||
|
#define umul_ppmm(ph,pl,m0,m1) \
|
|||
|
{unsigned long __ph; (pl) = __umul_ppmm (&__ph, (m0), (m1)); (ph) = __ph;}
|
|||
|
In order to maintain just one version of longlong.h (gmp and gcc), this
|
|||
|
has to be done outside of longlong.h.
|
|||
|
|
|||
|
Bennet Yee at CMU proposes:
|
|||
|
* mpz_{put,get}_raw for memory oriented I/O like other *_raw functions.
|
|||
|
* A function mpfatal that is called for exceptions. Let the user override
|
|||
|
a default definition.
|
|||
|
|
|||
|
* Make all computation mpz_* functions return a signed int indicating if the
|
|||
|
result was zero, positive, or negative?
|
|||
|
|
|||
|
* Implement mpz_cmpabs, mpz_xor, mpz_to_double, mpz_to_si, mpz_lcm, mpz_dpb,
|
|||
|
mpz_ldb, various bit string operations. Also mpz_@_si for most @??
|
|||
|
|
|||
|
* Add macros for looping efficiently over a number's limbs:
|
|||
|
MPZ_LOOP_OVER_LIMBS_INCREASING(num,limb)
|
|||
|
{ user code manipulating limb}
|
|||
|
MPZ_LOOP_OVER_LIMBS_DECREASING(num,limb)
|
|||
|
{ user code manipulating limb}
|
|||
|
|
|||
|
Brian Beuning proposes:
|
|||
|
1. An array of small primes
|
|||
|
3. A function to factor a mpz_t. [How do we return the factors? Maybe
|
|||
|
we just return one arbitrary factor? In the latter case, we have to
|
|||
|
use a data structure that records the state of the factoring routine.]
|
|||
|
4. A routine to look for "small" divisors of an mpz_t
|
|||
|
5. A 'multiply mod n' routine based on Montgomery's algorithm.
|
|||
|
|
|||
|
Dough Lea proposes:
|
|||
|
1. A way to find out if an integer fits into a signed int, and if so, a
|
|||
|
way to convert it out.
|
|||
|
2. Similarly for double precision float conversion.
|
|||
|
3. A function to convert the ratio of two integers to a double. This
|
|||
|
can be useful for mixed mode operations with integers, rationals, and
|
|||
|
doubles.
|
|||
|
|
|||
|
Elliptic curve method description in the Chapter `Algorithms in Number
|
|||
|
Theory' in the Handbook of Theoretical Computer Science, Elsevier,
|
|||
|
Amsterdam, 1990. Also in Carl Pomerance's lecture notes on Cryptology and
|
|||
|
Computational Number Theory, 1990.
|
|||
|
|
|||
|
* Harald Kirsh suggests:
|
|||
|
mpq_set_str (MP_RAT *r, char *numerator, char *denominator).
|
|||
|
|
|||
|
* New function: mpq_get_ifstr (int_str, frac_str, base,
|
|||
|
precision_in_som_way, rational_number). Convert RATIONAL_NUMBER to a
|
|||
|
string in BASE and put the integer part in INT_STR and the fraction part
|
|||
|
in FRAC_STR. (This function would do a division of the numerator and the
|
|||
|
denominator.)
|
|||
|
|
|||
|
* Should mpz_powm* handle negative exponents?
|
|||
|
|
|||
|
* udiv_qrnnd: If the denominator is normalized, the n0 argument has very
|
|||
|
little effect on the quotient. Maybe we can assume it is 0, and
|
|||
|
compensate at a later stage?
|
|||
|
|
|||
|
* Better sqrt: First calculate the reciprocal square root, then multiply by
|
|||
|
the operand to get the square root. The reciprocal square root can be
|
|||
|
obtained through Newton-Raphson without division. To compute sqrt(A), the
|
|||
|
iteration is,
|
|||
|
|
|||
|
2
|
|||
|
x = x (3 - A x )/2.
|
|||
|
i+1 i i
|
|||
|
|
|||
|
The final result can be computed without division using,
|
|||
|
|
|||
|
sqrt(A) = A x .
|
|||
|
n
|
|||
|
|
|||
|
* Newton-Raphson using multiplication: We get twice as many correct digits
|
|||
|
in each iteration. So if we square x(k) as part of the iteration, the
|
|||
|
result will have the leading digits in common with the entire result from
|
|||
|
iteration k-1. A _mpn_mul_lowpart could help us take advantage of this.
|
|||
|
|
|||
|
* Peter Montgomery: If 0 <= a, b < p < 2^31 and I want a modular product
|
|||
|
a*b modulo p and the long long type is unavailable, then I can write
|
|||
|
|
|||
|
typedef signed long slong;
|
|||
|
typedef unsigned long ulong;
|
|||
|
slong a, b, p, quot, rem;
|
|||
|
|
|||
|
quot = (slong) (0.5 + (double)a * (double)b / (double)p);
|
|||
|
rem = (slong)((ulong)a * (ulong)b - (ulong)p * (ulong)quot);
|
|||
|
if (rem < 0} {rem += p; quot--;}
|
|||
|
|
|||
|
* Speed modulo arithmetic, using Montgomery's method or my pre-inversion
|
|||
|
method. In either case, special arithmetic calls would be needed,
|
|||
|
mpz_mmmul, mpz_mmadd, mpz_mmsub, plus some kind of initialization
|
|||
|
functions. Better yet: Write a new mpr layer.
|
|||
|
|
|||
|
* mpz_powm* should not use division to reduce the result in the loop, but
|
|||
|
instead pre-compute the reciprocal of the MOD argument and do reduced_val
|
|||
|
= val-val*reciprocal(MOD)*MOD, or use Montgomery's method.
|
|||
|
|
|||
|
* mpz_mod_2expplussi -- to reduce a bignum modulo (2**n)+s
|
|||
|
|
|||
|
* It would be a quite important feature never to allocate more memory than
|
|||
|
really necessary for a result. Sometimes we can achieve this cheaply, by
|
|||
|
deferring reallocation until the result size is known.
|
|||
|
|
|||
|
* New macro in longlong.h: shift_rhl that extracts a word by shifting two
|
|||
|
words as a unit. (Supported by i386, i860, HP-PA, POWER, 29k.) Useful
|
|||
|
for shifting multiple precision numbers.
|
|||
|
|
|||
|
* The installation procedure should make a test run of multiplication to
|
|||
|
decide the threshold values for algorithm switching between the available
|
|||
|
methods.
|
|||
|
|
|||
|
* Fast output conversion of x to base B:
|
|||
|
1. Find n, such that (B^n > x).
|
|||
|
2. Set y to (x*2^m)/(B^n), where m large enough to make 2^n ~~ B^n
|
|||
|
3. Multiply the low half of y by B^(n/2), and recursively convert the
|
|||
|
result. Truncate the low half of y and convert that recursively.
|
|||
|
Complexity: O(M(n)log(n))+O(D(n))!
|
|||
|
|
|||
|
* Improve division using Newton-Raphson. Check out "Newton Iteration and
|
|||
|
Integer Division" by Stephen Tate in "Synthesis of Parallel Algorithms",
|
|||
|
Morgan Kaufmann, 1993 ("beware of some errors"...)
|
|||
|
|
|||
|
* Improve implementation of Karatsuba's algorithm. For most operand sizes,
|
|||
|
we can reduce the number of operations by splitting differently.
|
|||
|
|
|||
|
* Faster multiplication: The best approach is to first implement Toom-Cook.
|
|||
|
People report that it beats Karatsuba's algorithm already at about 100
|
|||
|
limbs. FFT would probably never beat a well-written Toom-Cook (not even for
|
|||
|
millions of bits).
|
|||
|
|
|||
|
FFT:
|
|||
|
{
|
|||
|
* Multiplication could be done with Montgomery's method combined with
|
|||
|
the "three primes" method described in Lipson. Maybe this would be
|
|||
|
faster than to Nussbaumer's method with 3 (simple) moduli?
|
|||
|
|
|||
|
* Maybe the modular tricks below are not needed: We are using very
|
|||
|
special numbers, Fermat numbers with a small base and a large exponent,
|
|||
|
and maybe it's possible to just subtract and add?
|
|||
|
|
|||
|
* Modify Nussbaumer's convolution algorithm, to use 3 words for each
|
|||
|
coefficient, calculating in 3 relatively prime moduli (e.g.
|
|||
|
0xffffffff, 0x100000000, and 0x7fff on a 32-bit computer). Both all
|
|||
|
operations and CRR would be very fast with such numbers.
|
|||
|
|
|||
|
* Optimize the Schoenhage-Stassen multiplication algorithm. Take advantage
|
|||
|
of the real valued input to save half of the operations and half of the
|
|||
|
memory. Use recursive FFT with large base cases, since recursive FFT has
|
|||
|
better memory locality. A normal FFT get 100% cache misses for large
|
|||
|
enough operands.
|
|||
|
|
|||
|
* In the 3-prime convolution method, it might sometimes be a win to use 2,
|
|||
|
3, or 5 primes. Imagine that using 3 primes would require a transform
|
|||
|
length of 2^n. But 2 primes might still sometimes give us correct
|
|||
|
results with that same transform length, or 5 primes might allow us to
|
|||
|
decrease the transform size to 2^(n-1).
|
|||
|
|
|||
|
To optimize floating-point based complex FFT we have to think of:
|
|||
|
|
|||
|
1. The normal implementation accesses all input exactly once for each of
|
|||
|
the log(n) passes. This means that we will get 0% cache hit when n >
|
|||
|
our cache. Remedy: Reorganize computation to compute partial passes,
|
|||
|
maybe similar to a standard recursive FFT implementation. Use a large
|
|||
|
`base case' to make any extra overhead of this organization negligible.
|
|||
|
|
|||
|
2. Use base-4, base-8 and base-16 FFT instead of just radix-2. This can
|
|||
|
reduce the number of operations by 2x.
|
|||
|
|
|||
|
3. Inputs are real-valued. According to Knuth's "Seminumerical
|
|||
|
Algorithms", exercise 4.6.4-14, we can save half the memory and half
|
|||
|
the operations if we take advantage of that.
|
|||
|
|
|||
|
4. Maybe make it possible to write the innermost loop in assembly, since
|
|||
|
that could win us another 2x speedup. (If we write our FFT to avoid
|
|||
|
cache-miss (see #1 above) it might be logical to write the `base case'
|
|||
|
in assembly.)
|
|||
|
|
|||
|
5. Avoid multiplication by 1, i, -1, -i. Similarly, optimize
|
|||
|
multiplication by (+-\/2 +- i\/2).
|
|||
|
|
|||
|
6. Put as many bits as possible in each double (but don't waste time if
|
|||
|
that doesn't make the transform size become smaller).
|
|||
|
|
|||
|
7. For n > some large number, we will get accuracy problems because of the
|
|||
|
limited precision of our floating point arithmetic. This can easily be
|
|||
|
solved by using the Karatsuba trick a few times until our operands
|
|||
|
become small enough.
|
|||
|
|
|||
|
8. Precompute the roots-of-unity and store them in a vector.
|
|||
|
}
|
|||
|
|
|||
|
* When a division result is going to be just one limb, (i.e. nsize-dsize is
|
|||
|
small) normalization could be done in the division loop.
|
|||
|
|
|||
|
* Never allocate temporary space for a source param that overlaps with a
|
|||
|
destination param needing reallocation. Instead malloc a new block for
|
|||
|
the destination (and free the source before returning to the caller).
|
|||
|
|
|||
|
* Parallel addition. Since each processors have to tell it is ready to the
|
|||
|
next processor, we can use simplified synchronization, and actually write
|
|||
|
it in C: For each processor (apart from the least significant):
|
|||
|
|
|||
|
while (*svar != my_number)
|
|||
|
;
|
|||
|
*svar = my_number + 1;
|
|||
|
|
|||
|
The least significant processor does this:
|
|||
|
|
|||
|
*svar = my_number + 1; /* i.e., *svar = 1 */
|
|||
|
|
|||
|
Before starting the addition, one processor has to store 0 in *svar.
|
|||
|
|
|||
|
Other things to think about for parallel addition: To avoid false
|
|||
|
(cache-line) sharing, allocate blocks on cache-line boundaries.
|
|||
|
|
|||
|
|
|||
|
Local Variables:
|
|||
|
mode: text
|
|||
|
fill-column: 77
|
|||
|
fill-prefix: " "
|
|||
|
version-control: never
|
|||
|
End:
|