BEFORE RELEASE 2.0

* Never use 64 bits sizes for allocation and lengths, 32 is enough for some
  time to come.  Potential problem: We might want 64 bit exponenents for
  mpf_t.  Take care.

* Implement lazy allocation.  mpz->d == 0 means no allocation made yet.

* Maybe store one-limb numbers according to Per Bothner's idea:

    struct {
      mp_ptr d;
      union {
	mp_limb val;    /* if (data == NULL).  */
        mp_size size;   /* Length of data array, if (data != NULL).  */
      } u;
    };

  Problem: We can't normalize to that format unless we free the space
  pointed to by d.

* mpz_inp_str should handle 0x...  Should input routines skip white space?
  I think so.

* What should mpn_rshift return, most significant limb or the bits shifted
  out?  Right now, mpn_rshift and mpn_rshiftci work differently...!

* Enrich mpn layer: mpn_gcd, mpn_gcdext, mpn_get_str, mpn_set_str.

* Document complexity of all functions.

* `make check' is supposed to test the package.

* To make __MP_SMALL__ safe, make some symbol defined/referenced depending
  on the value of that flag.  That way we'll get linking errors if an
  attempt is made to link with inconsistency.

* No mpq input routines should normalize.  A new mpq_normalize routine
  should handle that, and it has to be called explicitly by the user.

* Users' request: mpq_set_zz (mpq, mpz_1, mpz_2).

* Add predicate functions mpz_fits_signedlong_p, mpz_fits_unsignedlong_p,
  mpz_fits_signedint_p, etc.

* mpzq_floor (mpz, mpq), mpzq_trunc (mpz, mpq), mpzq_round (mpz, mpq).

THINGS TO WORK ON

* It'd be possible to have an interface "s = add(a,b)" with automatic GC.
  If the mpz_xinit routine remembers the address of the variable we could
  walk-and-mark the list of remembered variables, and free the space
  occupied by the remembered variables that didn't get marked.  Fairly
  standard.

* Improve speed for non-gcc compilers by defining umul_ppmm, udiv_qrnnd,
  etc, to call __umul_ppmm, __udiv_qrnnd.  A typical definition for
  umul_ppmm would be
  #define umul_ppmm(ph,pl,m0,m1) \
    {unsigned long __ph; (pl) = __umul_ppmm (&__ph, (m0), (m1)); (ph) = __ph;}
  In order to maintain just one version of longlong.h (gmp and gcc), this
  has to be done outside of longlong.h.

Henrik Johansson proposes mpz_gcd_ui.

Bennet Yee at CMU proposes:
* mpz_{put,get}_raw for memory oriented I/O like other *_raw functions.
* A function mpfatal that is called for exceptions.  The user may override
  the default definition.

* mout should group in 10-digit groups.
* ASCII dependence?
* Error reporting from I/O functions (linkoping)?

* Make all computation mpz_* functions return a signed int indicating if the
  result was zero, positive, or negative?

* Implement mpz_cmpabs, mpz_xor, mpz_to_double, mpz_to_si, mpz_lcm, mpz_dpb,
  mpz_ldb, various bit string operations like mpz_cntbits.  Also mpz_@_si
  for most @??

Brian Beuning proposes:
   1. An array of small primes
   3. A function to factor a mpz_t.  [How do we return the factors?  Maybe
      we just return one arbitrary factor?  In the latter case, we have to
      use a data structure that records the state of the factoring routine.]
   4. A routine to look for "small" divisors of an mpz_t
   5. A 'multiply mod n' routine based on Montgomery's algorithm.

Dough Lea proposes:
   1. A way to find out if an integer fits into a signed int, and if so, a
      way to convert it out.
   2. Similarly for double precision float conversion.
   3. A function to convert the ratio of two integers to a double.  This
      can be useful for mixed mode operations with integers, rationals, and
      doubles.
   5. Bit-setting, clearing, and testing operations, as in
	   mpz_setbit(MP_INT* dest, MP_INT* src, unsigned long bit_number),
       and used, for example in
	   mpz_setbit(x, x, 123)
       to directly set the 123rd bit of x.
       If these are supported, you don't first have to set up
       an otherwise unnecessary mpz holding a shifted value, then
       do an "or" operation.

Elliptic curve method description in the Chapter `Algorithms in Number
Theory' in the Handbook of Theoretical Computer Science, Elsevier,
Amsterdam, 1990.  Also in Carl Pomerance's lecture notes on Cryptology and
Computational Number Theory, 1990.

* New function: mpq_get_ifstr (int_str, frac_str, base,
  precision_in_som_way, rational_number).  Convert RATIONAL_NUMBER to a
  string in BASE and put the integer part in INT_STR and the fraction part
  in FRAC_STR.  (This function would do a division of the numerator and the
  denominator.)

* Should mpz_powm* handle negative exponents?

* udiv_qrnnd: If the denominator is normalized, the n0 argument has very
  little effect on the quotient.  Maybe we can assume it is 0, and
  compensate at a later stage?

* Better sqrt: First calculate the reciprocal square root, then multiply by
  the operand to get the square root.  The reciprocal square root can be
  obtained through Newton-Raphson without division.  The iteration is x :=
  x*(3-a*x^2)/2, where a is the operand.

* Newton-Raphson using multiplication: We get twice as many correct digits
  in each iteration.  So if we square x(k) as part of the iteration, the
  result will have the leading digits in common with the entire result from
  iteration k-1.  A _mpn_mul_lowpart could implement this.

* Peter Montgomery: If 0 <= a, b < p < 2^31 and I want a modular product
  a*b modulo p and the long long type is unavailable, then I can write

	  typedef   signed long slong;	
	  typedef unsigned long ulong;
	  slong a, b, p, quot, rem;	

	  quot = (slong) (0.5 + (double)a * (double)b / (double)p);
	  rem =  (slong)((ulong)a * (ulong)b - (ulong)p * (ulong)quot);
	  if (rem < 0} {rem += p; quot--;}

FFT:
{
  * Multiplication could be done with Montgomery's method combined with
    the "three primes" method described in Lipson.  Maybe this would be
    faster than to Nussbaumer's method with 3 (simple) moduli?

  * Maybe the modular tricks below are not needed: We are using very
    special numbers, Fermat numbers with a small base and a large exponent,
    and maybe it's possible to just subtract and add?

  * Modify Nussbaumer's convolution algorithm, to use 3 words for each
    coefficient, calculating in 3 relatively prime moduli (e.g.
    0xffffffff, 0x100000000, and 0x7fff on a 32-bit computer).  Both all
    operations and CRR would be very fast with such numbers.

  * Optimize the Schoenhage-Stassen multiplication algorithm.  Take
    advantage of the real valued input to save half of the operations and
    half of the memory.  Try recursive variants with large, optimized base
    cases.  Use recursive FFT with large base cases, since recursive FFT
    has better memory locality.  A normal FFT get 100% cache miss.

  To optimize floating-point based complex FFT we have to think of:

  1. The normal implementation accesses all input exactly once for each of
     the log(n) passes.  This means that we will get 0% cache hit when n >
     our cache.  Remedy: Reorganize computation to compute partial passes,
     maybe similar to a standard recursive FFT implementation.  Use a large
     `base case' to make any extra overhead of this organization negligible.

  2. Use base-4, base-8 and base-16 FFT instead of just radix-2.  This can
     reduce the number of operations by 2x.

  3. Inputs are real-valued.  According to Knuth's "Seminumerical
     Algorithms", exercise 4.6.4-14, we can save half the memory and half
     the operations if we take advantage of that.

  4. Maybe make it possible to write the innermost loop in assembler, since
     that could win us another 2x speedup.  (If we write our FFT to avoid
     cache-miss (see #1 above) it might be logical to write the `base case'
     in assembler.)

  5. Avoid multiplication by 1, i, -1, -i.  Similarly, optimize
     multiplication by (+-\/2 +- i\/2).

  6. Put as many bits as possible in each double (but don't waste time if
     that doesn't make the transform size become smaller).

  7. For n > some large number, we will get accuracy problems because of the
     limited precision of our floating point arithmetic.  This can easily be
     solved by using the Karatsuba trick a few times until our operands
     become small enough.

  8. Precompute the roots-of-unity and store them in a vector.
}

* Speed modulo arithmetic, using Montgomery's method or my pre-inversion
  method.  In either case, special arithmetic calls would be needed,
  mpz_mmmul, mpz_mmadd, mpz_mmsub, plus some kind of initialization
  functions.  Better yet: New a new mpr layer.

* mpz_powm* should not use division to reduce the result in the loop, but
  instead pre-compute the reciprocal of the MOD argument and do reduced_val
  = val-val*reciprocal(MOD)*MOD, or use Montgomery's method.

* mpz_mod_2expplussi -- to reduce a bignum modulo (2**n)+s

* It would be a quite important feature never to allocate more memory than
  really necessary for a result.  Sometimes we can achieve this cheaply, by
  deferring reallocation until the result size is known.

* New macro in longlong.h: shift_rhl that extracts a word by shifting two
  words as a unit.  (Supported by i386, i860, HP-PA, RS6000, 29k.)  Useful
  for shifting multiple precision numbers.

* The installation procedure should make a test run of multiplication to
  decide the threshold values for algorithm switching between the available
  methods.

* The gcd algorithm could probably be improved with a divide-and-conquer
  (DAC) approach.  At least the bulk of the operations should be done with
  single precision.

* Fast output conversion of x to base B:
    1. Find n, such that (B^n > x).
    2. Set y to (x*2^m)/(B^n), where m large enough to make 2^n ~~ B^n
    3. Multiply the low half of y by B^(n/2), and recursively convert the
       result.  Truncate the low half of y and convert that recursively.
  Complexity: O(M(n)log(n))+O(D(n))!

* Extensions for floating-point arithmetic.

* Improve special cases for division.

  1. When the divisor is just one word, normalization is not needed for
  most CPUs, and can be done in the division loop for CPUs that need
  normalization.

  2. Even when the result is going to be very small, (i.e. nsize-dsize is
  small) normalization should also be done in the division loop.

  To fix this, a new routine mpn_div_unnormalized is needed.

* Never allocate temporary space for a source param that overlaps with a
  destination param needing reallocation.  Instead malloc a new block for
  the destination (and free the source before returning to the caller).

* When any of the source operands overlap with the destination, mult (and
  other routines) slow down.  This is so because the need of temporary
  allocation (with alloca) and copying.  If a new destination were malloc'ed
  instead (and the overlapping source free'd before return) no copying would
  be needed.  Is GNU malloc quick enough to make this faster even for
  reasonably small operands?

* Parallel addition.  Since each processors have to tell it is ready to the
  next processor, we can use simplified synchronization, and actually write
  it in C:  For each processor (apart from the least significant):

	while (*svar != my_number)
	  ;
	*svar = my_number + 1;

  The least significant processor does this:

	*svar = my_number + 1;  /* i.e., *svar = 1 */

  Before starting the addition, one processor has to store 0 in *svar.

  Other things to think about for parallel addition: To avoid false
  (cache-line) sharing, allocate blocks on cache-line boundaries.


Local Variables:
mode: text
fill-column: 76
fill-prefix: "  "
version-control: never
End:
