dist/doc/tasks.html

51c586b8Smrg<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN">
51c586b8Smrg<html>
51c586b8Smrg<head>
51c586b8Smrg  <title>GMP Itemized Development Tasks</title>
51c586b8Smrg  <link rel="shortcut icon" href="favicon.ico">
51c586b8Smrg  <link rel="stylesheet" href="gmp.css">
*ce543368Smrg  <meta http-equiv="Content-Type" content="text/html; charset=utf-8">
51c586b8Smrg</head>
51c586b8Smrg
51c586b8Smrg<center>
51c586b8Smrg  <h1>
51c586b8Smrg    GMP Itemized Development Tasks
51c586b8Smrg  </h1>
51c586b8Smrg</center>
51c586b8Smrg
51c586b8Smrg<font size=-1>
51c586b8Smrg<pre>
*ce543368SmrgCopyright 2000-2004, 2006, 2008, 2009 Free Software Foundation, Inc.
51c586b8Smrg
51c586b8SmrgThis file is part of the GNU MP Library.
51c586b8Smrg
51c586b8SmrgThe GNU MP Library is free software; you can redistribute it and/or modify
*ce543368Smrgit under the terms of either:
*ce543368Smrg
*ce543368Smrg  * the GNU Lesser General Public License as published by the Free
*ce543368Smrg    Software Foundation; either version 3 of the License, or (at your
*ce543368Smrg    option) any later version.
*ce543368Smrg
*ce543368Smrgor
*ce543368Smrg
*ce543368Smrg  * the GNU General Public License as published by the Free Software
*ce543368Smrg    Foundation; either version 2 of the License, or (at your option) any
*ce543368Smrg    later version.
*ce543368Smrg
*ce543368Smrgor both in parallel, as here.
51c586b8Smrg
51c586b8SmrgThe GNU MP Library is distributed in the hope that it will be useful, but
51c586b8SmrgWITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY
*ce543368Smrgor FITNESS FOR A PARTICULAR PURPOSE.  See the GNU General Public License
*ce543368Smrgfor more details.
51c586b8Smrg
*ce543368SmrgYou should have received copies of the GNU General Public License and the
*ce543368SmrgGNU Lesser General Public License along with the GNU MP Library.  If not,
*ce543368Smrgsee https://www.gnu.org/licenses/.
51c586b8Smrg</pre>
51c586b8Smrg</font>
51c586b8Smrg
51c586b8Smrg<hr>
51c586b8Smrg<!-- NB. timestamp updated automatically by emacs -->
*ce543368Smrg  This file current as of 29 Jan 2014.  An up-to-date version is available at
*ce543368Smrg  <a href="https://gmplib.org/tasks.html">https://gmplib.org/tasks.html</a>.
51c586b8Smrg  Please send comments about this page to gmp-devel<font>@</font>gmplib.org.
51c586b8Smrg
51c586b8Smrg<p> These are itemized GMP development tasks.  Not all the tasks
51c586b8Smrg    listed here are suitable for volunteers, but many of them are.
51c586b8Smrg    Please see the <a href="projects.html">projects file</a> for more
51c586b8Smrg    sizeable projects.
51c586b8Smrg
51c586b8Smrg<p> CAUTION: This file needs updating.  Many of the tasks here have
51c586b8Smrgeither already been taken care of, or have become irrelevant.
51c586b8Smrg
51c586b8Smrg<h4>Correctness and Completeness</h4>
51c586b8Smrg<ul>
51c586b8Smrg<li> <code>_LONG_LONG_LIMB</code> in gmp.h is not namespace clean.  Reported
51c586b8Smrg     by Patrick Pelissier.
51c586b8Smrg     <br>
51c586b8Smrg     We sort of mentioned <code>_LONG_LONG_LIMB</code> in past releases, so
51c586b8Smrg     need to be careful about changing it.  It used to be a define
51c586b8Smrg     applications had to set for long long limb systems, but that in
51c586b8Smrg     particular is no longer relevant now that it's established automatically.
51c586b8Smrg<li> The various reuse.c tests need to force reallocation by calling
51c586b8Smrg     <code>_mpz_realloc</code> with a small (1 limb) size.
51c586b8Smrg<li> One reuse case is missing from mpX/tests/reuse.c:
51c586b8Smrg     <code>mpz_XXX(a,a,a)</code>.
51c586b8Smrg<li> Make the string reading functions allow the `0x' prefix when the base is
51c586b8Smrg     explicitly 16.  They currently only allow that prefix when the base is
51c586b8Smrg     unspecified (zero).
51c586b8Smrg<li> <code>mpf_eq</code> is not always correct, when one operand is
51c586b8Smrg     1000000000... and the other operand is 0111111111..., i.e., extremely
51c586b8Smrg     close.  There is a special case in <code>mpf_sub</code> for this
51c586b8Smrg     situation; put similar code in <code>mpf_eq</code>.  [In progress.]
51c586b8Smrg<li> <code>mpf_eq</code> doesn't implement what gmp.texi specifies.  It should
51c586b8Smrg     not use just whole limbs, but partial limbs.  [In progress.]
51c586b8Smrg<li> <code>mpf_set_str</code> doesn't validate it's exponent, for instance
51c586b8Smrg     garbage 123.456eX789X is accepted (and an exponent 0 used), and overflow
51c586b8Smrg     of a <code>long</code> is not detected.
51c586b8Smrg<li> <code>mpf_add</code> doesn't check for a carry from truncated portions of
51c586b8Smrg     the inputs, and in that respect doesn't implement the "infinite precision
51c586b8Smrg     followed by truncate" specified in the manual.
51c586b8Smrg<li> Windows DLLs: tests/mpz/reuse.c and tests/mpf/reuse.c initialize global
51c586b8Smrg     variables with pointers to <code>mpz_add</code> etc, which doesn't work
51c586b8Smrg     when those routines are coming from a DLL (because they're effectively
51c586b8Smrg     function pointer global variables themselves).  Need to rearrange perhaps
51c586b8Smrg     to a set of calls to a test function rather than iterating over an array.
51c586b8Smrg<li> <code>mpz_pow_ui</code>: Detect when the result would be more memory than
51c586b8Smrg     a <code>size_t</code> can represent and raise some suitable exception,
51c586b8Smrg     probably an alloc call asking for <code>SIZE_T_MAX</code>, and if that
51c586b8Smrg     somehow succeeds then an <code>abort</code>.  Various size overflows of
51c586b8Smrg     this kind are not handled gracefully, probably resulting in segvs.
51c586b8Smrg     <br>
51c586b8Smrg     In <code>mpz_n_pow_ui</code>, detect when the count of low zero bits
51c586b8Smrg     exceeds an <code>unsigned long</code>.  There's a (small) chance of this
51c586b8Smrg     happening but still having enough memory to represent the value.
51c586b8Smrg     Reported by Winfried Dreckmann in for instance <code>mpz_ui_pow_ui (x,
51c586b8Smrg     4UL, 1431655766UL)</code>.
51c586b8Smrg<li> <code>mpf</code>: Detect exponent overflow and raise some exception.
51c586b8Smrg     It'd be nice to allow the full <code>mp_exp_t</code> range since that's
51c586b8Smrg     how it's been in the past, but maybe dropping one bit would make it
51c586b8Smrg     easier to test if e1+e2 goes out of bounds.
51c586b8Smrg</ul>
51c586b8Smrg
51c586b8Smrg
51c586b8Smrg
51c586b8Smrg<h4>Machine Independent Optimization</h4>
51c586b8Smrg<ul>
51c586b8Smrg<li> <code>mpf_cmp</code>: For better cache locality, don't test for low zero
51c586b8Smrg     limbs until the high limbs fail to give an ordering.  Reduce code size by
51c586b8Smrg     turning the three <code>mpn_cmp</code>'s into a single loop stopping when
51c586b8Smrg     the end of one operand is reached (and then looking for a non-zero in the
51c586b8Smrg     rest of the other).
51c586b8Smrg<li> <code>mpf_mul_2exp</code>, <code>mpf_div_2exp</code>: The use of
51c586b8Smrg     <code>mpn_lshift</code> for any size&lt;=prec means repeated
51c586b8Smrg     <code>mul_2exp</code> and <code>div_2exp</code> calls accumulate low zero
51c586b8Smrg     limbs until size==prec+1 is reached.  Those zeros will slow down
51c586b8Smrg     subsequent operations, especially if the value is otherwise only small.
51c586b8Smrg     If low bits of the low limb are zero, use <code>mpn_rshift</code> so as
51c586b8Smrg     to not increase the size.
51c586b8Smrg<li> <code>mpn_dc_sqrtrem</code>, <code>mpn_sqrtrem2</code>: Don't use
51c586b8Smrg     <code>mpn_add_1</code> and <code>mpn_sub_1</code> for 1 limb operations,
51c586b8Smrg     instead <code>ADDC_LIMB</code> and <code>SUBC_LIMB</code>.
51c586b8Smrg<li> <code>mpn_sqrtrem2</code>: Use plain variables for <code>sp[0]</code> and
51c586b8Smrg     <code>rp[0]</code> calculations, so the compiler needn't worry about
51c586b8Smrg     aliasing between <code>sp</code> and <code>rp</code>.
51c586b8Smrg<li> <code>mpn_sqrtrem</code>: Some work can be saved in the last step when
51c586b8Smrg     the remainder is not required, as noted in Paul's paper.
51c586b8Smrg<li> <code>mpq_add</code>, <code>mpq_sub</code>: The gcd fits a single limb
dab47db4Smrg     with high probability and in this case <code>binvert_limb</code> could
51c586b8Smrg     be used to calculate the inverse just once for the two exact divisions
51c586b8Smrg     "op1.den / gcd" and "op2.den / gcd", rather than letting
dab47db4Smrg     <code>mpn_bdiv_q_1</code> do it each time.  This would require calling
dab47db4Smrg     <code>mpn_pi1_bdiv_q_1</code>.
51c586b8Smrg<li> <code>mpn_gcdext</code>: Don't test <code>count_leading_zeros</code> for
51c586b8Smrg     zero, instead check the high bit of the operand and avoid invoking
51c586b8Smrg     <code>count_leading_zeros</code>.  This is an optimization on all
51c586b8Smrg     machines, and significant on machines with slow
51c586b8Smrg     <code>count_leading_zeros</code>, though it's possible an already
51c586b8Smrg     normalized operand might not be encountered very often.
51c586b8Smrg<li> Rewrite <code>umul_ppmm</code> to use floating-point for generating the
51c586b8Smrg     most significant limb (if <code>GMP_LIMB_BITS</code> &lt= 52 bits).
51c586b8Smrg     (Peter Montgomery has some ideas on this subject.)
51c586b8Smrg<li> Improve the default <code>umul_ppmm</code> code in longlong.h: Add partial
51c586b8Smrg     products with fewer operations.
51c586b8Smrg<li> Consider inlining <code>mpz_set_ui</code>.  This would be both small and
51c586b8Smrg     fast, especially for compile-time constants, but would make application
51c586b8Smrg     binaries depend on having 1 limb allocated to an <code>mpz_t</code>,
51c586b8Smrg     preventing the "lazy" allocation scheme below.
51c586b8Smrg<li> Consider inlining <code>mpz_[cft]div_ui</code> and maybe
51c586b8Smrg     <code>mpz_[cft]div_r_ui</code>.  A <code>__gmp_divide_by_zero</code>
51c586b8Smrg     would be needed for the divide by zero test, unless that could be left to
51c586b8Smrg     <code>mpn_mod_1</code> (not sure currently whether all the risc chips
51c586b8Smrg     provoke the right exception there if using mul-by-inverse).
51c586b8Smrg<li> Consider inlining: <code>mpz_fits_s*_p</code>.  The setups for
51c586b8Smrg     <code>LONG_MAX</code> etc would need to go into gmp.h, and on Cray it
51c586b8Smrg     might, unfortunately, be necessary to forcibly include &lt;limits.h&gt;
51c586b8Smrg     since there's no apparent way to get <code>SHRT_MAX</code> with an
51c586b8Smrg     expression (since <code>short</code> and <code>unsigned short</code> can
51c586b8Smrg     be different sizes).
dab47db4Smrg<li> <code>mpz_powm</code> and <code>mpz_powm_ui</code> aren't very fast on one
dab47db4Smrg     or two limb moduli, due to a lot of function call overheads.  These could
dab47db4Smrg     perhaps be handled as special cases.
dab47db4Smrg<li> Make sure <code>mpz_powm_ui</code> is never slower than the corresponding
dab47db4Smrg     computation using <code>mpz_powm</code>.
51c586b8Smrg<li> <code>mpz_powm</code> REDC should do multiplications by <code>g[]</code>
51c586b8Smrg     using the division method when they're small, since the REDC form of a
51c586b8Smrg     small multiplier is normally a full size product.  Probably would need a
51c586b8Smrg     new tuned parameter to say what size multiplier is "small", as a function
51c586b8Smrg     of the size of the modulus.
dab47db4Smrg<li> <code>mpn_gcd</code> might be able to be sped up on small to moderate
dab47db4Smrg     sizes by improving <code>find_a</code>, possibly just by providing an
dab47db4Smrg     alternate implementation for CPUs with slowish
51c586b8Smrg     <code>count_leading_zeros</code>.
51c586b8Smrg<li> <code>mpf_set_str</code> produces low zero limbs when a string has a
51c586b8Smrg     fraction but is exactly representable, eg. 0.5 in decimal.  These could be
51c586b8Smrg     stripped to save work in later operations.
51c586b8Smrg<li> <code>mpz_and</code>, <code>mpz_ior</code> and <code>mpz_xor</code> should
51c586b8Smrg     use <code>mpn_and_n</code> etc for the benefit of the small number of
51c586b8Smrg     targets with native versions of those routines.  Need to be careful not to
51c586b8Smrg     pass size==0.  Is some code sharing possible between the <code>mpz</code>
51c586b8Smrg     routines?
51c586b8Smrg<li> <code>mpf_add</code>: Don't do a copy to avoid overlapping operands
51c586b8Smrg     unless it's really necessary (currently only sizes are tested, not
51c586b8Smrg     whether r really is u or v).
51c586b8Smrg<li> <code>mpf_add</code>: Under the check for v having no effect on the
51c586b8Smrg     result, perhaps test for r==u and do nothing in that case, rather than
51c586b8Smrg     currently it looks like an <code>MPN_COPY_INCR</code> will be done to
51c586b8Smrg     reduce prec+1 limbs to prec.
51c586b8Smrg<li> <code>mpf_div_ui</code>: Instead of padding with low zeros, call
51c586b8Smrg     <code>mpn_divrem_1</code> asking for fractional quotient limbs.
51c586b8Smrg<li> <code>mpf_div_ui</code>: Eliminate <code>TMP_ALLOC</code>.  When r!=u
51c586b8Smrg     there's no overlap and the division can be called on those operands.
51c586b8Smrg     When r==u and is prec+1 limbs, then it's an in-place division.  If r==u
51c586b8Smrg     and not prec+1 limbs, then move the available limbs up to prec+1 and do
51c586b8Smrg     an in-place there.
51c586b8Smrg<li> <code>mpf_div_ui</code>: Whether the high quotient limb is zero can be
51c586b8Smrg     determined by testing the dividend for high&lt;divisor.  When non-zero, the
51c586b8Smrg     division can be done on prec dividend limbs instead of prec+1.  The result
51c586b8Smrg     size is also known before the division, so that can be a tail call (once
51c586b8Smrg     the <code>TMP_ALLOC</code> is eliminated).
51c586b8Smrg<li> <code>mpn_divrem_2</code> could usefully accept unnormalized divisors and
51c586b8Smrg     shift the dividend on-the-fly, since this should cost nothing on
51c586b8Smrg     superscalar processors and avoid the need for temporary copying in
51c586b8Smrg     <code>mpn_tdiv_qr</code>.
51c586b8Smrg<li> <code>mpf_sqrt</code>: If r!=u, and if u doesn't need to be padded with
51c586b8Smrg     zeros, then there's no need for the tp temporary.
51c586b8Smrg<li> <code>mpq_cmp_ui</code> could form the <code>num1*den2</code> and
51c586b8Smrg     <code>num2*den1</code> products limb-by-limb from high to low and look at
51c586b8Smrg     each step for values differing by more than the possible carry bit from
51c586b8Smrg     the uncalculated portion.
51c586b8Smrg<li> <code>mpq_cmp</code> could do the same high-to-low progressive multiply
51c586b8Smrg     and compare.  The benefits of karatsuba and higher multiplication
51c586b8Smrg     algorithms are lost, but if it's assumed only a few high limbs will be
51c586b8Smrg     needed to determine an order then that's fine.
51c586b8Smrg<li> <code>mpn_add_1</code>, <code>mpn_sub_1</code>, <code>mpn_add</code>,
51c586b8Smrg     <code>mpn_sub</code>: Internally use <code>__GMPN_ADD_1</code> etc
51c586b8Smrg     instead of the functions, so they get inlined on all compilers, not just
51c586b8Smrg     gcc and others with <code>inline</code> recognised in gmp.h.
51c586b8Smrg     <code>__GMPN_ADD_1</code> etc are meant mostly to support application
51c586b8Smrg     inline <code>mpn_add_1</code> etc and if they don't come out good for
51c586b8Smrg     internal uses then special forms can be introduced, for instance many
51c586b8Smrg     internal uses are in-place.  Sometimes a block of code is executed based
51c586b8Smrg     on the carry-out, rather than using it arithmetically, and those places
51c586b8Smrg     might want to do their own loops entirely.
51c586b8Smrg<li> <code>__gmp_extract_double</code> on 64-bit systems could use just one
51c586b8Smrg     bitfield for the mantissa extraction, not two, when endianness permits.
51c586b8Smrg     Might depend on the compiler allowing <code>long long</code> bit fields
51c586b8Smrg     when that's the only actual 64-bit type.
51c586b8Smrg<li> tal-notreent.c could keep a block of memory permanently allocated.
51c586b8Smrg     Currently the last nested <code>TMP_FREE</code> releases all memory, so
51c586b8Smrg     there's an allocate and free every time a top-level function using
51c586b8Smrg     <code>TMP</code> is called.  Would need
51c586b8Smrg     <code>mp_set_memory_functions</code> to tell tal-notreent.c to release
51c586b8Smrg     any cached memory when changing allocation functions though.
51c586b8Smrg<li> <code>__gmp_tmp_alloc</code> from tal-notreent.c could be partially
51c586b8Smrg     inlined.  If the current chunk has enough room then a couple of pointers
51c586b8Smrg     can be updated.  Only if more space is required then a call to some sort
51c586b8Smrg     of <code>__gmp_tmp_increase</code> would be needed.  The requirement that
51c586b8Smrg     <code>TMP_ALLOC</code> is an expression might make the implementation a
51c586b8Smrg     bit ugly and/or a bit sub-optimal.
51c586b8Smrg<pre>
51c586b8Smrg#define TMP_ALLOC(n)
51c586b8Smrg  ((ROUND_UP(n) &gt; current-&gt;end - current-&gt;point ?
51c586b8Smrg     __gmp_tmp_increase (ROUND_UP (n)) : 0),
51c586b8Smrg     current-&gt;point += ROUND_UP (n),
51c586b8Smrg     current-&gt;point - ROUND_UP (n))
51c586b8Smrg</pre>
51c586b8Smrg<li> <code>__mp_bases</code> has a lot of data for bases which are pretty much
51c586b8Smrg     never used.  Perhaps the table should just go up to base 16, and have
51c586b8Smrg     code to generate data above that, if and when required.  Naturally this
51c586b8Smrg     assumes the code would be smaller than the data saved.
51c586b8Smrg<li> <code>__mp_bases</code> field <code>big_base_inverted</code> is only used
51c586b8Smrg     if <code>USE_PREINV_DIVREM_1</code> is true, and could be omitted
51c586b8Smrg     otherwise, to save space.
51c586b8Smrg<li> <code>mpz_get_str</code>, <code>mtox</code>: For power-of-2 bases, which
51c586b8Smrg     are of course fast, it seems a little silly to make a second pass over
51c586b8Smrg     the <code>mpn_get_str</code> output to convert to ASCII.  Perhaps combine
51c586b8Smrg     that with the bit extractions.
51c586b8Smrg<li> <code>mpz_gcdext</code>: If the caller requests only the S cofactor (of
51c586b8Smrg     A), and A&lt;B, then the code ends up generating the cofactor T (of B) and
51c586b8Smrg     deriving S from that.  Perhaps it'd be possible to arrange to get S in
51c586b8Smrg     the first place by calling <code>mpn_gcdext</code> with A+B,B.  This
51c586b8Smrg     might only be an advantage if A and B are about the same size.
51c586b8Smrg<li> <code>mpz_n_pow_ui</code> does a good job with small bases and stripping
51c586b8Smrg     powers of 2, but it's perhaps a bit too complicated for what it gains.
51c586b8Smrg     The simpler <code>mpn_pow_1</code> is a little faster on small exponents.
51c586b8Smrg     (Note some of the ugliness in <code>mpz_n_pow_ui</code> is due to
51c586b8Smrg     supporting <code>mpn_mul_2</code>.)
51c586b8Smrg     <br>
51c586b8Smrg     Perhaps the stripping of 2s in <code>mpz_n_pow_ui</code> should be
51c586b8Smrg     confined to single limb operands for simplicity and since that's where
51c586b8Smrg     the greatest gain would be.
51c586b8Smrg     <br>
51c586b8Smrg     Ideally <code>mpn_pow_1</code> and <code>mpz_n_pow_ui</code> would be
51c586b8Smrg     merged.  The reason <code>mpz_n_pow_ui</code> writes to an
51c586b8Smrg     <code>mpz_t</code> is that its callers leave it to make a good estimate
51c586b8Smrg     of the result size.  Callers of <code>mpn_pow_1</code> already know the
51c586b8Smrg     size by separate means (<code>mp_bases</code>).
51c586b8Smrg<li> <code>mpz_invert</code> should call <code>mpn_gcdext</code> directly.
51c586b8Smrg</ul>
51c586b8Smrg
51c586b8Smrg
51c586b8Smrg<h4>Machine Dependent Optimization</h4>
51c586b8Smrg<ul>
51c586b8Smrg<li> <code>invert_limb</code> on various processors might benefit from the
51c586b8Smrg     little Newton iteration done for alpha and ia64.
51c586b8Smrg<li> Alpha 21264: <code>mpn_addlsh1_n</code> could be implemented with
51c586b8Smrg     <code>mpn_addmul_1</code>, since that code at 3.5 is a touch faster than
51c586b8Smrg     a separate <code>lshift</code> and <code>add_n</code> at
51c586b8Smrg     1.75+2.125=3.875.  Or very likely some specific <code>addlsh1_n</code>
51c586b8Smrg     code could beat both.
51c586b8Smrg<li> Alpha 21264: Improve feed-in code for <code>mpn_mul_1</code>,
51c586b8Smrg     <code>mpn_addmul_1</code>, and <code>mpn_submul_1</code>.
51c586b8Smrg<li> Alpha 21164: Rewrite <code>mpn_mul_1</code>, <code>mpn_addmul_1</code>,
51c586b8Smrg     and <code>mpn_submul_1</code> for the 21164.  This should use both integer
51c586b8Smrg     multiplies and floating-point multiplies.  For the floating-point
51c586b8Smrg     operations, the single-limb multiplier should be split into three 21-bit
51c586b8Smrg     chunks, or perhaps even better in four 16-bit chunks.  Probably possible
51c586b8Smrg     to reach 9 cycles/limb.
51c586b8Smrg<li> Alpha: GCC 3.4 will introduce <code>__builtin_ctzl</code>,
51c586b8Smrg     <code>__builtin_clzl</code> and <code>__builtin_popcountl</code> using
51c586b8Smrg     the corresponding CIX <code>ct</code> instructions, and
51c586b8Smrg     <code>__builtin_alpha_cmpbge</code>.  These should give GCC more
51c586b8Smrg     information about scheduling etc than the <code>asm</code> blocks
51c586b8Smrg     currently used in longlong.h and gmp-impl.h.
51c586b8Smrg<li> Alpha Unicos: Apparently there's no <code>alloca</code> on this system,
51c586b8Smrg     making <code>configure</code> choose the slower
51c586b8Smrg     <code>malloc-reentrant</code> allocation method.  Is there a better way?
51c586b8Smrg     Maybe variable-length arrays per notes below.
51c586b8Smrg<li> Alpha Unicos 21164, 21264: <code>.align</code> is not used since it pads
51c586b8Smrg     with garbage.  Does the code get the intended slotting required for the
51c586b8Smrg     claimed speeds?  <code>.align</code> at the start of a function would
51c586b8Smrg     presumably be safe no matter how it pads.
51c586b8Smrg<li> ARM V5: <code>count_leading_zeros</code> can use the <code>clz</code>
51c586b8Smrg     instruction.  For GCC 3.4 and up, do this via <code>__builtin_clzl</code>
51c586b8Smrg     since then gcc knows it's "predicable".
51c586b8Smrg<li> Itanium: GCC 3.4 introduces <code>__builtin_popcount</code> which can be
51c586b8Smrg     used instead of an <code>asm</code> block.  The builtin should give gcc
51c586b8Smrg     more opportunities for scheduling, bundling and predication.
51c586b8Smrg     <code>__builtin_ctz</code> similarly (it just uses popcount as per
51c586b8Smrg     current longlong.h).
51c586b8Smrg<li> UltraSPARC/64: Optimize <code>mpn_mul_1</code>, <code>mpn_addmul_1</code>,
51c586b8Smrg     for s2 &lt; 2^32 (or perhaps for any zero 16-bit s2 chunk).  Not sure how
51c586b8Smrg     much this can improve the speed, though, since the symmetry that we rely
51c586b8Smrg     on is lost.  Perhaps we can just gain cycles when s2 &lt; 2^16, or more
51c586b8Smrg     accurately, when two 16-bit s2 chunks which are 16 bits apart are zero.
51c586b8Smrg<li> UltraSPARC/64: Write native <code>mpn_submul_1</code>, analogous to
51c586b8Smrg     <code>mpn_addmul_1</code>.
51c586b8Smrg<li> UltraSPARC/64: Write <code>umul_ppmm</code>.  Using four
51c586b8Smrg     "<code>mulx</code>"s either with an asm block or via the generic C code is
51c586b8Smrg     about 90 cycles.  Try using fp operations, and also try using karatsuba
51c586b8Smrg     for just three "<code>mulx</code>"s.
51c586b8Smrg<li> UltraSPARC/32: Rewrite <code>mpn_lshift</code>, <code>mpn_rshift</code>.
51c586b8Smrg     Will give 2 cycles/limb.  Trivial modifications of mpn/sparc64 should do.
51c586b8Smrg<li> UltraSPARC/32: Write special mpn_Xmul_1 loops for s2 &lt; 2^16.
51c586b8Smrg<li> UltraSPARC/32: Use <code>mulx</code> for <code>umul_ppmm</code> if
51c586b8Smrg     possible (see commented out code in longlong.h).  This is unlikely to
51c586b8Smrg     save more than a couple of cycles, so perhaps isn't worth bothering with.
51c586b8Smrg<li> UltraSPARC/32: On Solaris gcc doesn't give us <code>__sparc_v9__</code>
51c586b8Smrg     or anything to indicate V9 support when -mcpu=v9 is selected.  See
51c586b8Smrg     gcc/config/sol2-sld-64.h.  Will need to pass something through from
51c586b8Smrg     ./configure to select the right code in longlong.h.  (Currently nothing
51c586b8Smrg     is lost because <code>mulx</code> for multiplying is commented out.)
51c586b8Smrg<li> UltraSPARC/32: <code>mpn_divexact_1</code> and
51c586b8Smrg     <code>mpn_modexact_1c_odd</code> can use a 64-bit inverse and take
51c586b8Smrg     64-bits at a time from the dividend, as per the 32-bit divisor case in
51c586b8Smrg     mpn/sparc64/mode1o.c.  This must be done in assembler, since the full
51c586b8Smrg     64-bit registers (<code>%gN</code>) are not available from C.
51c586b8Smrg<li> UltraSPARC/32: <code>mpn_divexact_by3c</code> can work 64-bits at a time
51c586b8Smrg     using <code>mulx</code>, in assembler.  This would be the same as for
51c586b8Smrg     sparc64.
dab47db4Smrg<li> UltraSPARC: <code>binvert_limb</code> might save a few cycles from
51c586b8Smrg     masking down to just the useful bits at each point in the calculation,
51c586b8Smrg     since <code>mulx</code> speed depends on the highest bit set.  Either
51c586b8Smrg     explicit masks or small types like <code>short</code> and
51c586b8Smrg     <code>int</code> ought to work.
51c586b8Smrg<li> Sparc64 HAL R1 <code>popc</code>: This chip reputedly implements
51c586b8Smrg     <code>popc</code> properly (see gcc sparc.md).  Would need to recognise
51c586b8Smrg     it as <code>sparchalr1</code> or something in configure / config.sub /
51c586b8Smrg     config.guess.  <code>popc_limb</code> in gmp-impl.h could use this (per
51c586b8Smrg     commented out code).  <code>count_trailing_zeros</code> could use it too.
51c586b8Smrg<li> PA64: Improve <code>mpn_addmul_1</code>, <code>mpn_submul_1</code>, and
51c586b8Smrg     <code>mpn_mul_1</code>.  The current code runs at 11 cycles/limb.  It
51c586b8Smrg     should be possible to saturate the cache, which will happen at 8
51c586b8Smrg     cycles/limb (7.5 for mpn_mul_1).  Write special loops for s2 &lt; 2^32;
51c586b8Smrg     it should be possible to make them run at about 5 cycles/limb.
51c586b8Smrg<li> PPC601: See which of the power or powerpc32 code runs better.  Currently
51c586b8Smrg     the powerpc32 is used, but only because it's the default for
51c586b8Smrg     <code>powerpc*</code>.
51c586b8Smrg<li> PPC630: Rewrite <code>mpn_addmul_1</code>, <code>mpn_submul_1</code>, and
51c586b8Smrg     <code>mpn_mul_1</code>.  Use both integer and floating-point operations,
51c586b8Smrg     possibly two floating-point and one integer limb per loop.  Split operands
51c586b8Smrg     into four 16-bit chunks for fast fp operations.  Should easily reach 9
51c586b8Smrg     cycles/limb (using one int + one fp), but perhaps even 7 cycles/limb
51c586b8Smrg     (using one int + two fp).
51c586b8Smrg<li> PPC630: <code>mpn_rshift</code> could do the same sort of unrolled loop
51c586b8Smrg     as <code>mpn_lshift</code>.  Some judicious use of m4 might let the two
51c586b8Smrg     share source code, or with a register to control the loop direction
51c586b8Smrg     perhaps even share object code.
51c586b8Smrg<li> Implement <code>mpn_mul_basecase</code> and <code>mpn_sqr_basecase</code>
51c586b8Smrg     for important machines.  Helping the generic sqr_basecase.c with an
51c586b8Smrg     <code>mpn_sqr_diagonal</code> might be enough for some of the RISCs.
51c586b8Smrg<li> POWER2/POWER2SC: Schedule <code>mpn_lshift</code>/<code>mpn_rshift</code>.
51c586b8Smrg     Will bring time from 1.75 to 1.25 cycles/limb.
51c586b8Smrg<li> X86: Optimize non-MMX <code>mpn_lshift</code> for shifts by 1.  (See
51c586b8Smrg     Pentium code.)
51c586b8Smrg<li> X86: Good authority has it that in the past an inline <code>rep
51c586b8Smrg     movs</code> would upset GCC register allocation for the whole function.
51c586b8Smrg     Is this still true in GCC 3?  It uses <code>rep movs</code> itself for
51c586b8Smrg     <code>__builtin_memcpy</code>.  Examine the code for some simple and
51c586b8Smrg     complex functions to find out.  Inlining <code>rep movs</code> would be
51c586b8Smrg     desirable, it'd be both smaller and faster.
51c586b8Smrg<li> Pentium P54: <code>mpn_lshift</code> and <code>mpn_rshift</code> can come
51c586b8Smrg     down from 6.0 c/l to 5.5 or 5.375 by paying attention to pairing after
51c586b8Smrg     <code>shrdl</code> and <code>shldl</code>, see mpn/x86/pentium/README.
51c586b8Smrg<li> Pentium P55 MMX: <code>mpn_lshift</code> and <code>mpn_rshift</code>
51c586b8Smrg     might benefit from some destination prefetching.
51c586b8Smrg<li> PentiumPro: <code>mpn_divrem_1</code> might be able to use a
51c586b8Smrg     mul-by-inverse, hoping for maybe 30 c/l.
51c586b8Smrg<li> K7: <code>mpn_lshift</code> and <code>mpn_rshift</code> might be able to
51c586b8Smrg     do something branch-free for unaligned startups, and shaving one insn
51c586b8Smrg     from the loop with alternative indexing might save a cycle.
51c586b8Smrg<li> PPC32: Try using fewer registers in the current <code>mpn_lshift</code>.
51c586b8Smrg     The pipeline is now extremely deep, perhaps unnecessarily deep.
51c586b8Smrg<li> Fujitsu VPP: Vectorize main functions, perhaps in assembly language.
51c586b8Smrg<li> Fujitsu VPP: Write <code>mpn_mul_basecase</code> and
51c586b8Smrg     <code>mpn_sqr_basecase</code>.  This should use a "vertical multiplication
51c586b8Smrg     method", to avoid carry propagation.  splitting one of the operands in
51c586b8Smrg     11-bit chunks.
51c586b8Smrg<li> Pentium: <code>mpn_lshift</code> by 31 should use the special rshift
51c586b8Smrg     by 1 code, and vice versa <code>mpn_rshift</code> by 31 should use the
51c586b8Smrg     special lshift by 1.  This would be best as a jump across to the other
51c586b8Smrg     routine, could let both live in lshift.asm and omit rshift.asm on finding
51c586b8Smrg     <code>mpn_rshift</code> already provided.
51c586b8Smrg<li> Cray T3E: Experiment with optimization options.  In particular,
51c586b8Smrg     -hpipeline3 seems promising.  We should at least up -O to -O2 or -O3.
51c586b8Smrg<li> Cray: <code>mpn_com</code> and <code>mpn_and_n</code> etc very probably
51c586b8Smrg     wants a pragma like <code>MPN_COPY_INCR</code>.
51c586b8Smrg<li> Cray vector systems: <code>mpn_lshift</code>, <code>mpn_rshift</code>,
51c586b8Smrg     <code>mpn_popcount</code> and <code>mpn_hamdist</code> are nice and small
51c586b8Smrg     and could be inlined to avoid function calls.
51c586b8Smrg<li> Cray: Variable length arrays seem to be faster than the tal-notreent.c
51c586b8Smrg     scheme.  Not sure why, maybe they merely give the compiler more
51c586b8Smrg     information about aliasing (or the lack thereof).  Would like to modify
51c586b8Smrg     <code>TMP_ALLOC</code> to use them, or introduce a new scheme.  Memory
51c586b8Smrg     blocks wanted unconditionally are easy enough, those wanted only
51c586b8Smrg     sometimes are a problem.  Perhaps a special size calculation to ask for a
51c586b8Smrg     dummy length 1 when unwanted, or perhaps an inlined subroutine
51c586b8Smrg     duplicating code under each conditional.  Don't really want to turn
51c586b8Smrg     everything into a dog's dinner just because Cray don't offer an
51c586b8Smrg     <code>alloca</code>.
51c586b8Smrg<li> Cray: <code>mpn_get_str</code> on power-of-2 bases ought to vectorize.
51c586b8Smrg     Does it?  <code>bits_per_digit</code> and the inner loop over bits in a
51c586b8Smrg     limb might prevent it.  Perhaps special cases for binary, octal and hex
51c586b8Smrg     would be worthwhile (very possibly for all processors too).
51c586b8Smrg<li> S390: <code>BSWAP_LIMB_FETCH</code> looks like it could be done with
51c586b8Smrg     <code>lrvg</code>, as per glibc sysdeps/s390/s390-64/bits/byteswap.h.
51c586b8Smrg     This is only for 64-bit mode or something is it, since 32-bit mode has
51c586b8Smrg     other code?  Also, is it worth using for <code>BSWAP_LIMB</code> too, or
51c586b8Smrg     would that mean a store and re-fetch?  Presumably that's what comes out
51c586b8Smrg     in glibc.
51c586b8Smrg<li> Improve <code>count_leading_zeros</code> for 64-bit machines:
51c586b8Smrg  <pre>
51c586b8Smrg	   if ((x &gt&gt 32) == 0) { x &lt&lt= 32; cnt += 32; }
51c586b8Smrg	   if ((x &gt&gt 48) == 0) { x &lt&lt= 16; cnt += 16; }
51c586b8Smrg	   ... </pre>
51c586b8Smrg<li> IRIX 6 MIPSpro compiler has an <code>__inline</code> which could perhaps
51c586b8Smrg     be used in <code>__GMP_EXTERN_INLINE</code>.  What would be the right way
51c586b8Smrg     to identify suitable versions of that compiler?
51c586b8Smrg<li> IRIX <code>cc</code> is rumoured to have an <code>_int_mult_upper</code>
51c586b8Smrg     (in <code>&lt;intrinsics.h&gt;</code> like Cray), but it didn't seem to
51c586b8Smrg     exist on some IRIX 6.5 systems tried.  If it does actually exist
51c586b8Smrg     somewhere it would very likely be an improvement over a function call to
51c586b8Smrg     umul.asm.
51c586b8Smrg<li> <code>mpn_get_str</code> final divisions by the base with
51c586b8Smrg     <code>udiv_qrnd_unnorm</code> could use some sort of multiply-by-inverse
51c586b8Smrg     on suitable machines.  This ends up happening for decimal by presenting
51c586b8Smrg     the compiler with a run-time constant, but the same for other bases would
51c586b8Smrg     be good.  Perhaps use could be made of the fact base&lt;256.
51c586b8Smrg<li> <code>mpn_umul_ppmm</code>, <code>mpn_udiv_qrnnd</code>: Return a
51c586b8Smrg     structure like <code>div_t</code> to avoid going through memory, in
51c586b8Smrg     particular helping RISCs that don't do store-to-load forwarding.  Clearly
51c586b8Smrg     this is only possible if the ABI returns a structure of two
51c586b8Smrg     <code>mp_limb_t</code>s in registers.
51c586b8Smrg     <br>
51c586b8Smrg     On PowerPC, structures are returned in memory on AIX and Darwin.  In SVR4
51c586b8Smrg     they're returned in registers, except that draft SVR4 had said memory, so
51c586b8Smrg     it'd be prudent to check which is done.  We can jam the compiler into the
51c586b8Smrg     right mode if we know how, since all this is purely internal to libgmp.
51c586b8Smrg     (gcc has an option, though of course gcc doesn't matter since we use
51c586b8Smrg     inline asm there.)
51c586b8Smrg</ul>
51c586b8Smrg
51c586b8Smrg<h4>New Functionality</h4>
51c586b8Smrg<ul>
51c586b8Smrg<li> Maybe add <code>mpz_crr</code> (Chinese Remainder Reconstruction).
51c586b8Smrg<li> Let `0b' and `0B' mean binary input everywhere.
51c586b8Smrg<li> <code>mpz_init</code> and <code>mpq_init</code> could do lazy allocation.
51c586b8Smrg     Set <code>ALLOC(var)</code> to 0 to indicate nothing allocated, and let
51c586b8Smrg     <code>_mpz_realloc</code> do the initial alloc.  Set
51c586b8Smrg     <code>z-&gt;_mp_d</code> to a dummy that <code>mpz_get_ui</code> and
*ce543368Smrg     similar can unconditionally fetch from.  Niels Möller has had a go at
51c586b8Smrg     this.
51c586b8Smrg     <br>
51c586b8Smrg     The advantages of the lazy scheme would be:
51c586b8Smrg     <ul>
51c586b8Smrg     <li> Initial allocate would be the size required for the first value
51c586b8Smrg          stored, rather than getting 1 limb in <code>mpz_init</code> and then
51c586b8Smrg          more or less immediately reallocating.
51c586b8Smrg     <li> <code>mpz_init</code> would only store magic values in the
51c586b8Smrg          <code>mpz_t</code> fields, and could be inlined.
51c586b8Smrg     <li> A fixed initializer could even be used by applications, like
51c586b8Smrg          <code>mpz_t z = MPZ_INITIALIZER;</code>, which might be convenient
51c586b8Smrg          for globals.
51c586b8Smrg     </ul>
51c586b8Smrg     The advantages of the current scheme are:
51c586b8Smrg     <ul>
51c586b8Smrg     <li> <code>mpz_set_ui</code> and other similar routines needn't check the
51c586b8Smrg          size allocated and can just store unconditionally.
51c586b8Smrg     <li> <code>mpz_set_ui</code> and perhaps others like
51c586b8Smrg          <code>mpz_tdiv_r_ui</code> and a prospective
51c586b8Smrg          <code>mpz_set_ull</code> could be inlined.
51c586b8Smrg     </ul>
51c586b8Smrg<li> Add <code>mpf_out_raw</code> and <code>mpf_inp_raw</code>.  Make sure
51c586b8Smrg     format is portable between 32-bit and 64-bit machines, and between
51c586b8Smrg     little-endian and big-endian machines.  A format which MPFR can use too
51c586b8Smrg     would be good.
51c586b8Smrg<li> <code>mpn_and_n</code> ... <code>mpn_copyd</code>: Perhaps make the mpn
51c586b8Smrg     logops and copys available in gmp.h, either as library functions or
51c586b8Smrg     inlines, with the availability of library functions instantiated in the
51c586b8Smrg     generated gmp.h at build time.
51c586b8Smrg<li> <code>mpz_set_str</code> etc variants taking string lengths rather than
51c586b8Smrg     null-terminators.
51c586b8Smrg<li> <code>mpz_andn</code>, <code>mpz_iorn</code>, <code>mpz_nand</code>,
51c586b8Smrg     <code>mpz_nior</code>, <code>mpz_xnor</code> might be useful additions,
51c586b8Smrg     if they could share code with the current such functions (which should be
51c586b8Smrg     possible).
51c586b8Smrg<li> <code>mpz_and_ui</code> etc might be of use sometimes.  Suggested by
*ce543368Smrg     Niels Möller.
51c586b8Smrg<li> <code>mpf_set_str</code> and <code>mpf_inp_str</code> could usefully
51c586b8Smrg     accept 0x, 0b etc when base==0.  Perhaps the exponent could default to
51c586b8Smrg     decimal in this case, with a further 0x, 0b etc allowed there.
51c586b8Smrg     Eg. 0xFFAA@0x5A.  A leading "0" for octal would match the integers, but
51c586b8Smrg     probably something like "0.123" ought not mean octal.
51c586b8Smrg<li> <code>GMP_LONG_LONG_LIMB</code> or some such could become a documented
51c586b8Smrg     feature of gmp.h, so applications could know whether to
51c586b8Smrg     <code>printf</code> a limb using <code>%lu</code> or <code>%Lu</code>.
51c586b8Smrg<li> <code>GMP_PRIdMP_LIMB</code> and similar defines following C99
51c586b8Smrg     &lt;inttypes.h&gt; might be of use to applications printing limbs.  But
51c586b8Smrg     if <code>GMP_LONG_LONG_LIMB</code> or whatever is added then perhaps this
51c586b8Smrg     can easily enough be left to applications.
51c586b8Smrg<li> <code>gmp_printf</code> could accept <code>%b</code> for binary output.
51c586b8Smrg     It'd be nice if it worked for plain <code>int</code> etc too, not just
51c586b8Smrg     <code>mpz_t</code> etc.
51c586b8Smrg<li> <code>gmp_printf</code> in fact could usefully accept an arbitrary base,
51c586b8Smrg     for both integer and float conversions.  A base either in the format
51c586b8Smrg     string or as a parameter with <code>*</code> should be allowed.  Maybe
51c586b8Smrg     <code>&amp;13b</code> (b for base) or something like that.
51c586b8Smrg<li> <code>gmp_printf</code> could perhaps accept <code>mpq_t</code> for float
51c586b8Smrg     conversions, eg. <code>"%.4Qf"</code>.  This would be merely for
51c586b8Smrg     convenience, but still might be useful.  Rounding would be the same as
51c586b8Smrg     for an <code>mpf_t</code> (ie. currently round-to-nearest, but not
51c586b8Smrg     actually documented).  Alternately, perhaps a separate
51c586b8Smrg     <code>mpq_get_str_point</code> or some such might be more use.  Suggested
51c586b8Smrg     by Pedro Gimeno.
51c586b8Smrg<li> <code>mpz_rscan0</code> or <code>mpz_revscan0</code> or some such
51c586b8Smrg     searching towards the low end of an integer might match
51c586b8Smrg     <code>mpz_scan0</code> nicely.  Likewise for <code>scan1</code>.
51c586b8Smrg     Suggested by Roberto Bagnara.
51c586b8Smrg<li> <code>mpz_bit_subset</code> or some such to test whether one integer is a
51c586b8Smrg     bitwise subset of another might be of use.  Some sort of return value
51c586b8Smrg     indicating whether it's a proper or non-proper subset would be good and
51c586b8Smrg     wouldn't cost anything in the implementation.  Suggested by Roberto
51c586b8Smrg     Bagnara.
51c586b8Smrg<li> <code>mpf_get_ld</code>, <code>mpf_set_ld</code>: Conversions between
51c586b8Smrg     <code>mpf_t</code> and <code>long double</code>, suggested by Dan
51c586b8Smrg     Christensen.  Other <code>long double</code> routines might be desirable
51c586b8Smrg     too, but <code>mpf</code> would be a start.
51c586b8Smrg     <br>
51c586b8Smrg     <code>long double</code> is an ANSI-ism, so everything involving it would
51c586b8Smrg     need to be suppressed on a K&amp;R compiler.
51c586b8Smrg     <br>
51c586b8Smrg     There'd be some work to be done by <code>configure</code> to recognise
51c586b8Smrg     the format in use, MPFR has a start on this.  Often <code>long
51c586b8Smrg     double</code> is the same as <code>double</code>, which is easy but
51c586b8Smrg     pretty pointless.  A single float format detector macro could look at
51c586b8Smrg     <code>double</code> then <code>long double</code>
51c586b8Smrg     <br>
51c586b8Smrg     Sometimes there's a compiler option for the size of a <code>long
51c586b8Smrg     double</code>, eg. xlc on AIX can use either 64-bit or 128-bit.  It's
51c586b8Smrg     probably simplest to regard this as a compiler compatibility issue, and
51c586b8Smrg     leave it to users or sysadmins to ensure application and library code is
51c586b8Smrg     built the same.
51c586b8Smrg<li> <code>mpz_sqrt_if_perfect_square</code>: When
51c586b8Smrg     <code>mpz_perfect_square_p</code> does its tests it calculates a square
51c586b8Smrg     root and then discards it.  For some applications it might be useful to
51c586b8Smrg     return that root.  Suggested by Jason Moxham.
51c586b8Smrg<li> <code>mpz_get_ull</code>, <code>mpz_set_ull</code>,
51c586b8Smrg     <code>mpz_get_sll</code>, <code>mpz_get_sll</code>: Conversions for
51c586b8Smrg     <code>long long</code>.  These would aid interoperability, though a
51c586b8Smrg     mixture of GMP and <code>long long</code> would probably not be too
51c586b8Smrg     common.  Since <code>long long</code> is not always available (it's in
51c586b8Smrg     C99 and GCC though), disadvantages of using <code>long long</code> in
51c586b8Smrg     libgmp.a would be
51c586b8Smrg     <ul>
51c586b8Smrg     <li> Library contents vary according to the build compiler.
51c586b8Smrg     <li> gmp.h would need an ugly <code>#ifdef</code> block to decide if the
51c586b8Smrg          application compiler could take the <code>long long</code>
51c586b8Smrg          prototypes.
51c586b8Smrg     <li> Some sort of <code>LIBGMP_HAS_LONGLONG</code> might be wanted to
51c586b8Smrg          indicate whether the functions are available.  (Applications using
51c586b8Smrg          autoconf could probe the library too.)
51c586b8Smrg     </ul>
51c586b8Smrg     It'd be possible to defer the need for <code>long long</code> to
51c586b8Smrg     application compile time, by having something like
51c586b8Smrg     <code>mpz_set_2ui</code> called with two halves of a <code>long
51c586b8Smrg     long</code>.  Disadvantages of this would be,
51c586b8Smrg     <ul>
51c586b8Smrg     <li> Bigger code in the application, though perhaps not if a <code>long
51c586b8Smrg          long</code> is normally passed as two halves anyway.
51c586b8Smrg     <li> <code>mpz_get_ull</code> would be a rather big inline, or would have
51c586b8Smrg          to be two function calls.
51c586b8Smrg     <li> <code>mpz_get_sll</code> would be a worse inline, and would put the
51c586b8Smrg          treatment of <code>-0x10..00</code> into applications (see
51c586b8Smrg          <code>mpz_get_si</code> correctness above).
51c586b8Smrg     <li> Although having libgmp.a independent of the build compiler is nice,
51c586b8Smrg          it sort of sacrifices the capabilities of a good compiler to
51c586b8Smrg          uniformity with inferior ones.
51c586b8Smrg     </ul>
51c586b8Smrg     Plain use of <code>long long</code> is probably the lesser evil, if only
51c586b8Smrg     because it makes best use of gcc.  In fact perhaps it would suffice to
51c586b8Smrg     guarantee <code>long long</code> conversions only when using GCC for both
51c586b8Smrg     application and library.  That would cover free software, and we can
51c586b8Smrg     worry about selected vendor compilers later.
51c586b8Smrg     <br>
51c586b8Smrg     In C++ the situation is probably clearer, we demand fairly recent C++ so
51c586b8Smrg     <code>long long</code> should be available always.  We'd probably prefer
51c586b8Smrg     to have the C and C++ the same in respect of <code>long long</code>
51c586b8Smrg     support, but it would be possible to have it unconditionally in gmpxx.h,
51c586b8Smrg     by some means or another.
51c586b8Smrg<li> <code>mpz_strtoz</code> parsing the same as <code>strtol</code>.
51c586b8Smrg     Suggested by Alexander Kruppa.
51c586b8Smrg</ul>
51c586b8Smrg
51c586b8Smrg
51c586b8Smrg<h4>Configuration</h4>
51c586b8Smrg
51c586b8Smrg<ul>
51c586b8Smrg<li> Alpha ev7, ev79: Add code to config.guess to detect these.  Believe ev7
51c586b8Smrg     will be "3-1307" in the current switch, but need to verify that.  (On
51c586b8Smrg     OSF, current configfsf.guess identifies ev7 using psrinfo, we need to do
51c586b8Smrg     it ourselves for other systems.)
51c586b8Smrg<li> Alpha OSF: Libtool (version 1.5) doesn't seem to recognise this system is
51c586b8Smrg     "pic always" and ends up running gcc twice with the same options.  This
51c586b8Smrg     is wasteful, but harmless.  Perhaps a newer libtool will be better.
51c586b8Smrg<li> ARM: <code>umul_ppmm</code> in longlong.h always uses <code>umull</code>,
51c586b8Smrg     but is that available only for M series chips or some such?  Perhaps it
51c586b8Smrg     should be configured in some way.
51c586b8Smrg<li> HPPA: config.guess should recognize 7000, 7100, 7200, and 8x00.
51c586b8Smrg<li> HPPA: gcc 3.2 introduces a <code>-mschedule=7200</code> etc parameter,
51c586b8Smrg     which could be driven by an exact hppa cpu type.
51c586b8Smrg<li> Mips: config.guess should say mipsr3000, mipsr4000, mipsr10000, etc.
51c586b8Smrg     "hinv -c processor" gives lots of information on Irix.  Standard
51c586b8Smrg     config.guess appends "el" to indicate endianness, but
51c586b8Smrg     <code>AC_C_BIGENDIAN</code> seems the best way to handle that for GMP.
51c586b8Smrg<li> PowerPC: The function descriptor nonsense for AIX is currently driven by
51c586b8Smrg     <code>*-*-aix*</code>.  It might be more reliable to do some sort of
51c586b8Smrg     feature test, examining the compiler output perhaps.  It might also be
51c586b8Smrg     nice to merge the aix.m4 files into powerpc-defs.m4.
51c586b8Smrg<li> config.m4 is generated only by the configure script, it won't be
51c586b8Smrg     regenerated by config.status.  Creating it as an <code>AC_OUTPUT</code>
51c586b8Smrg     would work, but it might upset "make" to have things like <code>L$</code>
51c586b8Smrg     get into the Makefiles through <code>AC_SUBST</code>.
51c586b8Smrg     <code>AC_CONFIG_COMMANDS</code> would be the alternative.  With some
51c586b8Smrg     careful m4 quoting the <code>changequote</code> calls might not be
51c586b8Smrg     needed, which might free up the order in which things had to be output.
51c586b8Smrg<li> Automake: Latest automake has a <code>CCAS</code>, <code>CCASFLAGS</code>
51c586b8Smrg     scheme.  Though we probably wouldn't be using its assembler support we
51c586b8Smrg     could try to use those variables in compatible ways.
51c586b8Smrg<li> <code>GMP_LDFLAGS</code> could probably be done with plain
51c586b8Smrg     <code>LDFLAGS</code> already used by automake for all linking.  But with
51c586b8Smrg     a bit of luck the next libtool will pass pretty much all
51c586b8Smrg     <code>CFLAGS</code> through to the compiler when linking, making
51c586b8Smrg     <code>GMP_LDFLAGS</code> unnecessary.
51c586b8Smrg<li> mpn/Makeasm.am uses <code>-c</code> and <code>-o</code> together in the
51c586b8Smrg     .S and .asm rules, but apparently that isn't completely portable (there's
51c586b8Smrg     an autoconf <code>AC_PROG_CC_C_O</code> test for it).  So far we've not
51c586b8Smrg     had problems, but perhaps the rules could be rewritten to use "foo.s" as
51c586b8Smrg     the temporary, or to do a suitable "mv" of the result.  The only danger
51c586b8Smrg     from using foo.s would be if a compile failed and the temporary foo.s
51c586b8Smrg     then looked like the primary source.  Hopefully if the
51c586b8Smrg     <code>SUFFIXES</code> are ordered to have .S and .asm ahead of .s that
51c586b8Smrg     wouldn't happen.  Might need to check.
51c586b8Smrg</ul>
51c586b8Smrg
51c586b8Smrg
51c586b8Smrg<h4>Random Numbers</h4>
51c586b8Smrg<ul>
51c586b8Smrg<li> <code>_gmp_rand</code> is not particularly fast on the linear
51c586b8Smrg     congruential algorithm and could stand various improvements.
51c586b8Smrg     <ul>
51c586b8Smrg     <li> Make a second seed area within <code>gmp_randstate_t</code> (or
51c586b8Smrg          <code>_mp_algdata</code> rather) to save some copying.
51c586b8Smrg     <li> Make a special case for a single limb <code>2exp</code> modulus, to
51c586b8Smrg          avoid <code>mpn_mul</code> calls.  Perhaps the same for two limbs.
51c586b8Smrg     <li> Inline the <code>lc</code> code, to avoid a function call and
51c586b8Smrg          <code>TMP_ALLOC</code> for every chunk.
51c586b8Smrg     <li> Perhaps the <code>2exp</code> and general LC cases should be split,
51c586b8Smrg          for clarity (if the general case is retained).
51c586b8Smrg     </ul>
51c586b8Smrg<li> <code>gmp_randstate_t</code> used for parameters perhaps should become
51c586b8Smrg     <code>gmp_randstate_ptr</code> the same as other types.
51c586b8Smrg<li> Some of the empirical randomness tests could be included in a "make
51c586b8Smrg     check".  They ought to work everywhere, for a given seed at least.
51c586b8Smrg</ul>
51c586b8Smrg
51c586b8Smrg
51c586b8Smrg<h4>C++</h4>
51c586b8Smrg<ul>
51c586b8Smrg<li> <code>mpz_class(string)</code>, etc: Use the C++ global locale to
51c586b8Smrg     identify whitespace.
51c586b8Smrg     <br>
51c586b8Smrg     <code>mpf_class(string)</code>: Use the C++ global locale decimal point,
51c586b8Smrg     rather than the C one.
51c586b8Smrg     <br>
51c586b8Smrg     Consider making these variant <code>mpz_set_str</code> etc forms
51c586b8Smrg     available for <code>mpz_t</code> too, not just <code>mpz_class</code>
51c586b8Smrg     etc.
*ce543368Smrg<li> <code>mpq_class operator+=</code>: Don't emit an unnecessary
51c586b8Smrg     <code>mpq_set(q,q)</code> before <code>mpz_addmul</code> etc.
51c586b8Smrg<li> Put various bits of gmpxx.h into libgmpxx, to avoid excessive inlining.
51c586b8Smrg     Candidates for this would be,
51c586b8Smrg     <ul>
51c586b8Smrg     <li> <code>mpz_class(const char *)</code>, etc: since they're normally
51c586b8Smrg          not fast anyway, and we can hide the exception <code>throw</code>.
51c586b8Smrg     <li> <code>mpz_class(string)</code>, etc: to hide the <code>cstr</code>
51c586b8Smrg          needed to get to the C conversion function.
51c586b8Smrg     <li> <code>mpz_class string, char*</code> etc constructors: likewise to
51c586b8Smrg          hide the throws and conversions.
51c586b8Smrg     <li> <code>mpz_class::get_str</code>, etc: to hide the <code>char*</code>
51c586b8Smrg          to <code>string</code> conversion and free.  Perhaps
51c586b8Smrg          <code>mpz_get_str</code> can write directly into a
51c586b8Smrg          <code>string</code>, to avoid copying.
51c586b8Smrg          <br>
51c586b8Smrg          Consider making such <code>string</code> returning variants
51c586b8Smrg          available for use with plain <code>mpz_t</code> etc too.
51c586b8Smrg     </ul>
51c586b8Smrg</ul>
51c586b8Smrg
51c586b8Smrg<h4>Miscellaneous</h4>
51c586b8Smrg<ul>
51c586b8Smrg<li> <code>mpz_gcdext</code> and <code>mpn_gcdext</code> ought to document
51c586b8Smrg     what range of values the generated cofactors can take, and preferably
51c586b8Smrg     ensure the definition uniquely specifies the cofactors for given inputs.
51c586b8Smrg     A basic extended Euclidean algorithm or multi-step variant leads to
51c586b8Smrg     |x|&lt;|b| and |y|&lt;|a| or something like that, but there's probably
51c586b8Smrg     two solutions under just those restrictions.
51c586b8Smrg<li> demos/factorize.c: use <code>mpz_divisible_ui_p</code> rather than
51c586b8Smrg     <code>mpz_tdiv_qr_ui</code>.  (Of course dividing multiple primes at a
51c586b8Smrg     time would be better still.)
51c586b8Smrg<li> The various test programs use quite a bit of the main
51c586b8Smrg     <code>libgmp</code>.  This establishes good cross-checks, but it might be
51c586b8Smrg     better to use simple reference routines where possible.  Where it's not
51c586b8Smrg     possible some attention could be paid to the order of the tests, so a
51c586b8Smrg     <code>libgmp</code> routine is only used for tests once it seems to be
51c586b8Smrg     good.
51c586b8Smrg<li> <code>MUL_FFT_THRESHOLD</code> etc: the FFT thresholds should allow a
51c586b8Smrg     return to a previous k at certain sizes.  This arises basically due to
51c586b8Smrg     the step effect caused by size multiples effectively used for each k.
51c586b8Smrg     Looking at a graph makes it fairly clear.
51c586b8Smrg<li> <code>__gmp_doprnt_mpf</code> does a rather unattractive round-to-nearest
51c586b8Smrg     on the string returned by <code>mpf_get_str</code>.  Perhaps some variant
51c586b8Smrg     of <code>mpf_get_str</code> could be made which would better suit.
51c586b8Smrg</ul>
51c586b8Smrg
51c586b8Smrg
51c586b8Smrg<h4>Aids to Development</h4>
51c586b8Smrg<ul>
51c586b8Smrg<li> Add <code>ASSERT</code>s at the start of each user-visible mpz/mpq/mpf
51c586b8Smrg     function to check the validity of each <code>mp?_t</code> parameter, in
51c586b8Smrg     particular to check they've been <code>mp?_init</code>ed.  This might
51c586b8Smrg     catch elementary mistakes in user programs.  Care would need to be taken
51c586b8Smrg     over <code>MPZ_TMP_INIT</code>ed variables used internally.  If nothing
51c586b8Smrg     else then consistency checks like size&lt;=alloc, ptr not
51c586b8Smrg     <code>NULL</code> and ptr+size not wrapping around the address space,
51c586b8Smrg     would be possible.  A more sophisticated scheme could track
51c586b8Smrg     <code>_mp_d</code> pointers and ensure only a valid one is used.  Such a
51c586b8Smrg     scheme probably wouldn't be reentrant, not without some help from the
51c586b8Smrg     system.
51c586b8Smrg<li> tune/time.c could try to determine at runtime whether
51c586b8Smrg     <code>getrusage</code> and <code>gettimeofday</code> are reliable.
51c586b8Smrg     Currently we pretend in configure that the dodgy m68k netbsd 1.4.1
51c586b8Smrg     <code>getrusage</code> doesn't exist.  If a test might take a long time
51c586b8Smrg     to run then perhaps cache the result in a file somewhere.
51c586b8Smrg<li> tune/time.c could choose the default precision based on the
51c586b8Smrg     <code>speed_unittime</code> determined, independent of the method in use.
51c586b8Smrg<li> Cray vector systems: CPU frequency could be determined from
51c586b8Smrg     <code>sysconf(_SC_CLK_TCK)</code>, since it seems to be clock cycle
51c586b8Smrg     based.  Is this true for all Cray systems?  Would like some documentation
51c586b8Smrg     or something to confirm.
51c586b8Smrg</ul>
51c586b8Smrg
51c586b8Smrg
51c586b8Smrg<h4>Documentation</h4>
51c586b8Smrg<ul>
51c586b8Smrg<li> <code>mpz_inp_str</code> (etc) doesn't say when it stops reading digits.
51c586b8Smrg<li> <code>mpn_get_str</code> isn't terribly clear about how many digits it
51c586b8Smrg     produces.  It'd probably be possible to say at most one leading zero,
51c586b8Smrg     which is what both it and <code>mpz_get_str</code> currently do.  But
51c586b8Smrg     want to be careful not to bind ourselves to something that might not suit
51c586b8Smrg     another implementation.
51c586b8Smrg<li> <code>va_arg</code> doesn't do the right thing with <code>mpz_t</code>
51c586b8Smrg     etc directly, but instead needs a pointer type like <code>MP_INT*</code>.
51c586b8Smrg     It'd be good to show how to do this, but we'd either need to document
51c586b8Smrg     <code>mpz_ptr</code> and friends, or perhaps fallback on something
51c586b8Smrg     slightly nasty with <code>void*</code>.
51c586b8Smrg</ul>
51c586b8Smrg
51c586b8Smrg
51c586b8Smrg<h4>Bright Ideas</h4>
51c586b8Smrg
51c586b8Smrg<p> The following may or may not be feasible, and aren't likely to get done in the
51c586b8Smrgnear future, but are at least worth thinking about.
51c586b8Smrg
51c586b8Smrg<ul>
51c586b8Smrg<li> Reorganize longlong.h so that we can inline the operations even for the
51c586b8Smrg     system compiler.  When there is no such compiler feature, make calls to
51c586b8Smrg     stub functions.  Write such stub functions for as many machines as
51c586b8Smrg     possible.
51c586b8Smrg<li> longlong.h could declare when it's using, or would like to use,
51c586b8Smrg     <code>mpn_umul_ppmm</code>, and the corresponding umul.asm file could be
51c586b8Smrg     included in libgmp only in that case, the same as is effectively done for
51c586b8Smrg     <code>__clz_tab</code>.  Likewise udiv.asm and perhaps cntlz.asm.  This
51c586b8Smrg     would only be a very small space saving, so perhaps not worth the
51c586b8Smrg     complexity.
51c586b8Smrg<li> longlong.h could be built at configure time by concatenating or
51c586b8Smrg     #including fragments from each directory in the mpn path.  This would
51c586b8Smrg     select CPU specific macros the same way as CPU specific assembler code.
51c586b8Smrg     Code used would no longer depend on cpp predefines, and the current
51c586b8Smrg     nested conditionals could be flattened out.
51c586b8Smrg<li> <code>mpz_get_si</code> returns 0x80000000 for -0x100000000, whereas it's
51c586b8Smrg     sort of supposed to return the low 31 (or 63) bits.  But this is
51c586b8Smrg     undocumented, and perhaps not too important.
51c586b8Smrg<li> <code>mpz_init_set*</code> and <code>mpz_realloc</code> could allocate
51c586b8Smrg     say an extra 16 limbs over what's needed, so as to reduce the chance of
51c586b8Smrg     having to do a reallocate if the <code>mpz_t</code> grows a bit more.
51c586b8Smrg     This could only be an option, since it'd badly bloat memory usage in
51c586b8Smrg     applications using many small values.
51c586b8Smrg<li> <code>mpq</code> functions could perhaps check for numerator or
51c586b8Smrg     denominator equal to 1, on the assumption that integers or
51c586b8Smrg     denominator-only values might be expected to occur reasonably often.
51c586b8Smrg<li> <code>count_trailing_zeros</code> is used on more or less uniformly
51c586b8Smrg     distributed numbers in a couple of places.  For some CPUs
51c586b8Smrg     <code>count_trailing_zeros</code> is slow and it's probably worth handling
51c586b8Smrg     the frequently occurring 0 to 2 trailing zeros cases specially.
51c586b8Smrg<li> <code>mpf_t</code> might like to let the exponent be undefined when
51c586b8Smrg     size==0, instead of requiring it 0 as now.  It should be possible to do
51c586b8Smrg     size==0 tests before paying attention to the exponent.  The advantage is
51c586b8Smrg     not needing to set exp in the various places a zero result can arise,
51c586b8Smrg     which avoids some tedium but is otherwise perhaps not too important.
51c586b8Smrg     Currently <code>mpz_set_f</code> and <code>mpf_cmp_ui</code> depend on
51c586b8Smrg     exp==0, maybe elsewhere too.
51c586b8Smrg<li> <code>__gmp_allocate_func</code>: Could use GCC <code>__attribute__
51c586b8Smrg     ((malloc))</code> on this, though don't know if it'd do much.  GCC 3.0
51c586b8Smrg     allows that attribute on functions, but not function pointers (see info
51c586b8Smrg     node "Attribute Syntax"), so would need a new autoconf test.  This can
51c586b8Smrg     wait until there's a GCC that supports it.
51c586b8Smrg<li> <code>mpz_add_ui</code> contains two <code>__GMPN_COPY</code>s, one from
51c586b8Smrg     <code>mpn_add_1</code> and one from <code>mpn_sub_1</code>.  If those two
51c586b8Smrg     routines were opened up a bit maybe that code could be shared.  When a
51c586b8Smrg     copy needs to be done there's no carry to append for the add, and if the
51c586b8Smrg     copy is non-empty no high zero for the sub.
51c586b8Smrg</ul>
51c586b8Smrg
51c586b8Smrg
51c586b8Smrg<h4>Old and Obsolete Stuff</h4>
51c586b8Smrg
51c586b8Smrg<p> The following tasks apply to chips or systems that are old and/or obsolete.
51c586b8SmrgIt's unlikely anything will be done about them unless anyone is actively using
51c586b8Smrgthem.
51c586b8Smrg
51c586b8Smrg<ul>
51c586b8Smrg<li> Sparc32: The integer based udiv_nfp.asm used to be selected by
51c586b8Smrg     <code>configure --nfp</code> but that option is gone now that autoconf is
51c586b8Smrg     used.  The file could go somewhere suitable in the mpn search if any
51c586b8Smrg     chips might benefit from it, though it's possible we don't currently
51c586b8Smrg     differentiate enough exact cpu types to do this properly.
51c586b8Smrg<li> VAX D and G format <code>double</code> floats are straightforward and
51c586b8Smrg     could perhaps be handled directly in <code>__gmp_extract_double</code>
51c586b8Smrg     and maybe in <code>mpn_get_d</code>, rather than falling back on the
51c586b8Smrg     generic code.  (Both formats are detected by <code>configure</code>.)
51c586b8Smrg</ul>
51c586b8Smrg
51c586b8Smrg
51c586b8Smrg<hr>
51c586b8Smrg
51c586b8Smrg</body>
51c586b8Smrg</html>
51c586b8Smrg
51c586b8Smrg<!--
51c586b8SmrgLocal variables:
51c586b8Smrgeval: (add-hook 'write-file-hooks 'time-stamp)
51c586b8Smrgtime-stamp-start: "This file current as of "
51c586b8Smrgtime-stamp-format: "%:d %3b %:y"
51c586b8Smrgtime-stamp-end: "\\."
51c586b8Smrgtime-stamp-line-limit: 50
51c586b8SmrgEnd:
51c586b8Smrg-->