xref: /netbsd-src/external/lgpl3/gmp/dist/doc/tasks.html (revision ba65fde2d7fefa7d39838fa5fa855e62bd606b5e)
1<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN">
2<html>
3<head>
4  <title>GMP Itemized Development Tasks</title>
5  <link rel="shortcut icon" href="favicon.ico">
6  <link rel="stylesheet" href="gmp.css">
7  <meta http-equiv="Content-Type" content="text/html; charset=iso-8859-1">
8</head>
9
10<center>
11  <h1>
12    GMP Itemized Development Tasks
13  </h1>
14</center>
15
16<font size=-1>
17<pre>
18Copyright 2000, 2001, 2002, 2003, 2004, 2006, 2008, 2009 Free Software
19Foundation, Inc.
20
21This file is part of the GNU MP Library.
22
23The GNU MP Library is free software; you can redistribute it and/or modify
24it under the terms of the GNU Lesser General Public License as published
25by the Free Software Foundation; either version 3 of the License, or (at
26your option) any later version.
27
28The GNU MP Library is distributed in the hope that it will be useful, but
29WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY
30or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU Lesser General Public
31License for more details.
32
33You should have received a copy of the GNU Lesser General Public License
34along with the GNU MP Library.  If not, see http://www.gnu.org/licenses/.
35</pre>
36</font>
37
38<hr>
39<!-- NB. timestamp updated automatically by emacs -->
40  This file current as of 28 Dec 2009.  An up-to-date version is available at
41  <a href="http://gmplib.org/tasks.html">http://gmplib.org/tasks.html</a>.
42  Please send comments about this page to gmp-devel<font>@</font>gmplib.org.
43
44<p> These are itemized GMP development tasks.  Not all the tasks
45    listed here are suitable for volunteers, but many of them are.
46    Please see the <a href="projects.html">projects file</a> for more
47    sizeable projects.
48
49<p> CAUTION: This file needs updating.  Many of the tasks here have
50either already been taken care of, or have become irrelevant.
51
52<h4>Correctness and Completeness</h4>
53<ul>
54<li> <code>_LONG_LONG_LIMB</code> in gmp.h is not namespace clean.  Reported
55     by Patrick Pelissier.
56     <br>
57     We sort of mentioned <code>_LONG_LONG_LIMB</code> in past releases, so
58     need to be careful about changing it.  It used to be a define
59     applications had to set for long long limb systems, but that in
60     particular is no longer relevant now that it's established automatically.
61<li> The various reuse.c tests need to force reallocation by calling
62     <code>_mpz_realloc</code> with a small (1 limb) size.
63<li> One reuse case is missing from mpX/tests/reuse.c:
64     <code>mpz_XXX(a,a,a)</code>.
65<li> When printing <code>mpf_t</code> numbers with exponents &gt;2^53 on
66     machines with 64-bit <code>mp_exp_t</code>, the precision of
67     <code>__mp_bases[base].chars_per_bit_exactly</code> is insufficient and
68     <code>mpf_get_str</code> aborts.  Detect and compensate.  Alternately,
69     think seriously about using some sort of fixed-point integer value.
70     Avoiding unnecessary floating point is probably a good thing in general,
71     and it might be faster on some CPUs.
72<li> Make the string reading functions allow the `0x' prefix when the base is
73     explicitly 16.  They currently only allow that prefix when the base is
74     unspecified (zero).
75<li> <code>mpf_eq</code> is not always correct, when one operand is
76     1000000000... and the other operand is 0111111111..., i.e., extremely
77     close.  There is a special case in <code>mpf_sub</code> for this
78     situation; put similar code in <code>mpf_eq</code>.  [In progress.]
79<li> <code>mpf_eq</code> doesn't implement what gmp.texi specifies.  It should
80     not use just whole limbs, but partial limbs.  [In progress.]
81<li> <code>mpf_set_str</code> doesn't validate it's exponent, for instance
82     garbage 123.456eX789X is accepted (and an exponent 0 used), and overflow
83     of a <code>long</code> is not detected.
84<li> <code>mpf_add</code> doesn't check for a carry from truncated portions of
85     the inputs, and in that respect doesn't implement the "infinite precision
86     followed by truncate" specified in the manual.
87<li> Windows DLLs: tests/mpz/reuse.c and tests/mpf/reuse.c initialize global
88     variables with pointers to <code>mpz_add</code> etc, which doesn't work
89     when those routines are coming from a DLL (because they're effectively
90     function pointer global variables themselves).  Need to rearrange perhaps
91     to a set of calls to a test function rather than iterating over an array.
92<li> <code>mpz_pow_ui</code>: Detect when the result would be more memory than
93     a <code>size_t</code> can represent and raise some suitable exception,
94     probably an alloc call asking for <code>SIZE_T_MAX</code>, and if that
95     somehow succeeds then an <code>abort</code>.  Various size overflows of
96     this kind are not handled gracefully, probably resulting in segvs.
97     <br>
98     In <code>mpz_n_pow_ui</code>, detect when the count of low zero bits
99     exceeds an <code>unsigned long</code>.  There's a (small) chance of this
100     happening but still having enough memory to represent the value.
101     Reported by Winfried Dreckmann in for instance <code>mpz_ui_pow_ui (x,
102     4UL, 1431655766UL)</code>.
103<li> <code>mpf</code>: Detect exponent overflow and raise some exception.
104     It'd be nice to allow the full <code>mp_exp_t</code> range since that's
105     how it's been in the past, but maybe dropping one bit would make it
106     easier to test if e1+e2 goes out of bounds.
107</ul>
108
109
110
111<h4>Machine Independent Optimization</h4>
112<ul>
113<li> <code>mpf_cmp</code>: For better cache locality, don't test for low zero
114     limbs until the high limbs fail to give an ordering.  Reduce code size by
115     turning the three <code>mpn_cmp</code>'s into a single loop stopping when
116     the end of one operand is reached (and then looking for a non-zero in the
117     rest of the other).
118<li> <code>mpf_mul_2exp</code>, <code>mpf_div_2exp</code>: The use of
119     <code>mpn_lshift</code> for any size&lt;=prec means repeated
120     <code>mul_2exp</code> and <code>div_2exp</code> calls accumulate low zero
121     limbs until size==prec+1 is reached.  Those zeros will slow down
122     subsequent operations, especially if the value is otherwise only small.
123     If low bits of the low limb are zero, use <code>mpn_rshift</code> so as
124     to not increase the size.
125<li> <code>mpn_dc_sqrtrem</code>: Don't use <code>mpn_addmul_1</code> with
126     multiplier==2, instead either <code>mpn_addlsh1_n</code> when available,
127     or <code>mpn_lshift</code>+<code>mpn_add_n</code> if not.
128<li> <code>mpn_dc_sqrtrem</code>, <code>mpn_sqrtrem2</code>: Don't use
129     <code>mpn_add_1</code> and <code>mpn_sub_1</code> for 1 limb operations,
130     instead <code>ADDC_LIMB</code> and <code>SUBC_LIMB</code>.
131<li> <code>mpn_sqrtrem2</code>: Use plain variables for <code>sp[0]</code> and
132     <code>rp[0]</code> calculations, so the compiler needn't worry about
133     aliasing between <code>sp</code> and <code>rp</code>.
134<li> <code>mpn_sqrtrem</code>: Some work can be saved in the last step when
135     the remainder is not required, as noted in Paul's paper.
136<li> <code>mpq_add</code>, <code>mpq_add</code>: The division "op1.den / gcd"
137     is done twice, where of course only once is necessary.  Reported by Larry
138     Lambe.
139<li> <code>mpq_add</code>, <code>mpq_sub</code>: The gcd fits a single limb
140     with high probability and in this case <code>modlimb_invert</code> could
141     be used to calculate the inverse just once for the two exact divisions
142     "op1.den / gcd" and "op2.den / gcd", rather than letting
143     <code>mpn_divexact_1</code> do it each time.  This would require a new
144     <code>mpn_preinv_divexact_1</code> interface.  Not sure if it'd be worth
145     the trouble.
146<li> <code>mpq_add</code>, <code>mpq_sub</code>: The use of
147     <code>mpz_mul(x,y,x)</code> causes temp allocation or copying in
148     <code>mpz_mul</code> which can probably be avoided.  A rewrite using
149     <code>mpn</code> might be best.
150<li> <code>mpn_gcdext</code>: Don't test <code>count_leading_zeros</code> for
151     zero, instead check the high bit of the operand and avoid invoking
152     <code>count_leading_zeros</code>.  This is an optimization on all
153     machines, and significant on machines with slow
154     <code>count_leading_zeros</code>, though it's possible an already
155     normalized operand might not be encountered very often.
156<li> Rewrite <code>umul_ppmm</code> to use floating-point for generating the
157     most significant limb (if <code>GMP_LIMB_BITS</code> &lt= 52 bits).
158     (Peter Montgomery has some ideas on this subject.)
159<li> Improve the default <code>umul_ppmm</code> code in longlong.h: Add partial
160     products with fewer operations.
161<li> Consider inlining <code>mpz_set_ui</code>.  This would be both small and
162     fast, especially for compile-time constants, but would make application
163     binaries depend on having 1 limb allocated to an <code>mpz_t</code>,
164     preventing the "lazy" allocation scheme below.
165<li> Consider inlining <code>mpz_[cft]div_ui</code> and maybe
166     <code>mpz_[cft]div_r_ui</code>.  A <code>__gmp_divide_by_zero</code>
167     would be needed for the divide by zero test, unless that could be left to
168     <code>mpn_mod_1</code> (not sure currently whether all the risc chips
169     provoke the right exception there if using mul-by-inverse).
170<li> Consider inlining: <code>mpz_fits_s*_p</code>.  The setups for
171     <code>LONG_MAX</code> etc would need to go into gmp.h, and on Cray it
172     might, unfortunately, be necessary to forcibly include &lt;limits.h&gt;
173     since there's no apparent way to get <code>SHRT_MAX</code> with an
174     expression (since <code>short</code> and <code>unsigned short</code> can
175     be different sizes).
176<li> <code>mpz_powm</code> and <code>mpz_powm_ui</code> aren't very
177     fast on one or two limb moduli, due to a lot of function call
178     overheads.  These could perhaps be handled as special cases.
179<li> <code>mpz_powm</code> and <code>mpz_powm_ui</code> want better
180     algorithm selection, and the latter should use REDC.  Both could
181     change to use an <code>mpn_powm</code> and <code>mpn_redc</code>.
182<li> <code>mpz_powm</code> REDC should do multiplications by <code>g[]</code>
183     using the division method when they're small, since the REDC form of a
184     small multiplier is normally a full size product.  Probably would need a
185     new tuned parameter to say what size multiplier is "small", as a function
186     of the size of the modulus.
187<li> <code>mpz_powm</code> REDC should handle even moduli if possible.  Maybe
188     this would mean for m=n*2^k doing mod n using REDC and an auxiliary
189     calculation mod 2^k, then putting them together at the end.
190<li> <code>mpn_gcd</code> might be able to be sped up on small to
191     moderate sizes by improving <code>find_a</code>, possibly just by
192     providing an alternate implementation for CPUs with slowish
193     <code>count_leading_zeros</code>.
194<li> Toom3 could use a low to high cache localized evaluate and interpolate.
195     The necessary <code>mpn_divexact_by3c</code> exists.
196<li> <code>mpf_set_str</code> produces low zero limbs when a string has a
197     fraction but is exactly representable, eg. 0.5 in decimal.  These could be
198     stripped to save work in later operations.
199<li> <code>mpz_and</code>, <code>mpz_ior</code> and <code>mpz_xor</code> should
200     use <code>mpn_and_n</code> etc for the benefit of the small number of
201     targets with native versions of those routines.  Need to be careful not to
202     pass size==0.  Is some code sharing possible between the <code>mpz</code>
203     routines?
204<li> <code>mpf_add</code>: Don't do a copy to avoid overlapping operands
205     unless it's really necessary (currently only sizes are tested, not
206     whether r really is u or v).
207<li> <code>mpf_add</code>: Under the check for v having no effect on the
208     result, perhaps test for r==u and do nothing in that case, rather than
209     currently it looks like an <code>MPN_COPY_INCR</code> will be done to
210     reduce prec+1 limbs to prec.
211<li> <code>mpf_div_ui</code>: Instead of padding with low zeros, call
212     <code>mpn_divrem_1</code> asking for fractional quotient limbs.
213<li> <code>mpf_div_ui</code>: Eliminate <code>TMP_ALLOC</code>.  When r!=u
214     there's no overlap and the division can be called on those operands.
215     When r==u and is prec+1 limbs, then it's an in-place division.  If r==u
216     and not prec+1 limbs, then move the available limbs up to prec+1 and do
217     an in-place there.
218<li> <code>mpf_div_ui</code>: Whether the high quotient limb is zero can be
219     determined by testing the dividend for high&lt;divisor.  When non-zero, the
220     division can be done on prec dividend limbs instead of prec+1.  The result
221     size is also known before the division, so that can be a tail call (once
222     the <code>TMP_ALLOC</code> is eliminated).
223<li> <code>mpn_divrem_2</code> could usefully accept unnormalized divisors and
224     shift the dividend on-the-fly, since this should cost nothing on
225     superscalar processors and avoid the need for temporary copying in
226     <code>mpn_tdiv_qr</code>.
227<li> <code>mpf_sqrt</code>: If r!=u, and if u doesn't need to be padded with
228     zeros, then there's no need for the tp temporary.
229<li> <code>mpq_cmp_ui</code> could form the <code>num1*den2</code> and
230     <code>num2*den1</code> products limb-by-limb from high to low and look at
231     each step for values differing by more than the possible carry bit from
232     the uncalculated portion.
233<li> <code>mpq_cmp</code> could do the same high-to-low progressive multiply
234     and compare.  The benefits of karatsuba and higher multiplication
235     algorithms are lost, but if it's assumed only a few high limbs will be
236     needed to determine an order then that's fine.
237<li> <code>mpn_add_1</code>, <code>mpn_sub_1</code>, <code>mpn_add</code>,
238     <code>mpn_sub</code>: Internally use <code>__GMPN_ADD_1</code> etc
239     instead of the functions, so they get inlined on all compilers, not just
240     gcc and others with <code>inline</code> recognised in gmp.h.
241     <code>__GMPN_ADD_1</code> etc are meant mostly to support application
242     inline <code>mpn_add_1</code> etc and if they don't come out good for
243     internal uses then special forms can be introduced, for instance many
244     internal uses are in-place.  Sometimes a block of code is executed based
245     on the carry-out, rather than using it arithmetically, and those places
246     might want to do their own loops entirely.
247<li> <code>__gmp_extract_double</code> on 64-bit systems could use just one
248     bitfield for the mantissa extraction, not two, when endianness permits.
249     Might depend on the compiler allowing <code>long long</code> bit fields
250     when that's the only actual 64-bit type.
251<li> tal-notreent.c could keep a block of memory permanently allocated.
252     Currently the last nested <code>TMP_FREE</code> releases all memory, so
253     there's an allocate and free every time a top-level function using
254     <code>TMP</code> is called.  Would need
255     <code>mp_set_memory_functions</code> to tell tal-notreent.c to release
256     any cached memory when changing allocation functions though.
257<li> <code>__gmp_tmp_alloc</code> from tal-notreent.c could be partially
258     inlined.  If the current chunk has enough room then a couple of pointers
259     can be updated.  Only if more space is required then a call to some sort
260     of <code>__gmp_tmp_increase</code> would be needed.  The requirement that
261     <code>TMP_ALLOC</code> is an expression might make the implementation a
262     bit ugly and/or a bit sub-optimal.
263<pre>
264#define TMP_ALLOC(n)
265  ((ROUND_UP(n) &gt; current-&gt;end - current-&gt;point ?
266     __gmp_tmp_increase (ROUND_UP (n)) : 0),
267     current-&gt;point += ROUND_UP (n),
268     current-&gt;point - ROUND_UP (n))
269</pre>
270<li> <code>__mp_bases</code> has a lot of data for bases which are pretty much
271     never used.  Perhaps the table should just go up to base 16, and have
272     code to generate data above that, if and when required.  Naturally this
273     assumes the code would be smaller than the data saved.
274<li> <code>__mp_bases</code> field <code>big_base_inverted</code> is only used
275     if <code>USE_PREINV_DIVREM_1</code> is true, and could be omitted
276     otherwise, to save space.
277<li> <code>mpz_get_str</code>, <code>mtox</code>: For power-of-2 bases, which
278     are of course fast, it seems a little silly to make a second pass over
279     the <code>mpn_get_str</code> output to convert to ASCII.  Perhaps combine
280     that with the bit extractions.
281<li> <code>mpz_gcdext</code>: If the caller requests only the S cofactor (of
282     A), and A&lt;B, then the code ends up generating the cofactor T (of B) and
283     deriving S from that.  Perhaps it'd be possible to arrange to get S in
284     the first place by calling <code>mpn_gcdext</code> with A+B,B.  This
285     might only be an advantage if A and B are about the same size.
286<li> <code>mpz_n_pow_ui</code> does a good job with small bases and stripping
287     powers of 2, but it's perhaps a bit too complicated for what it gains.
288     The simpler <code>mpn_pow_1</code> is a little faster on small exponents.
289     (Note some of the ugliness in <code>mpz_n_pow_ui</code> is due to
290     supporting <code>mpn_mul_2</code>.)
291     <br>
292     Perhaps the stripping of 2s in <code>mpz_n_pow_ui</code> should be
293     confined to single limb operands for simplicity and since that's where
294     the greatest gain would be.
295     <br>
296     Ideally <code>mpn_pow_1</code> and <code>mpz_n_pow_ui</code> would be
297     merged.  The reason <code>mpz_n_pow_ui</code> writes to an
298     <code>mpz_t</code> is that its callers leave it to make a good estimate
299     of the result size.  Callers of <code>mpn_pow_1</code> already know the
300     size by separate means (<code>mp_bases</code>).
301<li> <code>mpz_invert</code> should call <code>mpn_gcdext</code> directly.
302</ul>
303
304
305<h4>Machine Dependent Optimization</h4>
306<ul>
307<li> <code>invert_limb</code> on various processors might benefit from the
308     little Newton iteration done for alpha and ia64.
309<li> Alpha 21264: <code>mpn_addlsh1_n</code> could be implemented with
310     <code>mpn_addmul_1</code>, since that code at 3.5 is a touch faster than
311     a separate <code>lshift</code> and <code>add_n</code> at
312     1.75+2.125=3.875.  Or very likely some specific <code>addlsh1_n</code>
313     code could beat both.
314<li> Alpha 21264: Improve feed-in code for <code>mpn_mul_1</code>,
315     <code>mpn_addmul_1</code>, and <code>mpn_submul_1</code>.
316<li> Alpha 21164: Rewrite <code>mpn_mul_1</code>, <code>mpn_addmul_1</code>,
317     and <code>mpn_submul_1</code> for the 21164.  This should use both integer
318     multiplies and floating-point multiplies.  For the floating-point
319     operations, the single-limb multiplier should be split into three 21-bit
320     chunks, or perhaps even better in four 16-bit chunks.  Probably possible
321     to reach 9 cycles/limb.
322<li> Alpha: GCC 3.4 will introduce <code>__builtin_ctzl</code>,
323     <code>__builtin_clzl</code> and <code>__builtin_popcountl</code> using
324     the corresponding CIX <code>ct</code> instructions, and
325     <code>__builtin_alpha_cmpbge</code>.  These should give GCC more
326     information about scheduling etc than the <code>asm</code> blocks
327     currently used in longlong.h and gmp-impl.h.
328<li> Alpha Unicos: Apparently there's no <code>alloca</code> on this system,
329     making <code>configure</code> choose the slower
330     <code>malloc-reentrant</code> allocation method.  Is there a better way?
331     Maybe variable-length arrays per notes below.
332<li> Alpha Unicos 21164, 21264: <code>.align</code> is not used since it pads
333     with garbage.  Does the code get the intended slotting required for the
334     claimed speeds?  <code>.align</code> at the start of a function would
335     presumably be safe no matter how it pads.
336<li> ARM V5: <code>count_leading_zeros</code> can use the <code>clz</code>
337     instruction.  For GCC 3.4 and up, do this via <code>__builtin_clzl</code>
338     since then gcc knows it's "predicable".
339<li> Itanium: GCC 3.4 introduces <code>__builtin_popcount</code> which can be
340     used instead of an <code>asm</code> block.  The builtin should give gcc
341     more opportunities for scheduling, bundling and predication.
342     <code>__builtin_ctz</code> similarly (it just uses popcount as per
343     current longlong.h).
344<li> UltraSPARC/64: Optimize <code>mpn_mul_1</code>, <code>mpn_addmul_1</code>,
345     for s2 &lt; 2^32 (or perhaps for any zero 16-bit s2 chunk).  Not sure how
346     much this can improve the speed, though, since the symmetry that we rely
347     on is lost.  Perhaps we can just gain cycles when s2 &lt; 2^16, or more
348     accurately, when two 16-bit s2 chunks which are 16 bits apart are zero.
349<li> UltraSPARC/64: Write native <code>mpn_submul_1</code>, analogous to
350     <code>mpn_addmul_1</code>.
351<li> UltraSPARC/64: Write <code>umul_ppmm</code>.  Using four
352     "<code>mulx</code>"s either with an asm block or via the generic C code is
353     about 90 cycles.  Try using fp operations, and also try using karatsuba
354     for just three "<code>mulx</code>"s.
355<li> UltraSPARC/32: Rewrite <code>mpn_lshift</code>, <code>mpn_rshift</code>.
356     Will give 2 cycles/limb.  Trivial modifications of mpn/sparc64 should do.
357<li> UltraSPARC/32: Write special mpn_Xmul_1 loops for s2 &lt; 2^16.
358<li> UltraSPARC/32: Use <code>mulx</code> for <code>umul_ppmm</code> if
359     possible (see commented out code in longlong.h).  This is unlikely to
360     save more than a couple of cycles, so perhaps isn't worth bothering with.
361<li> UltraSPARC/32: On Solaris gcc doesn't give us <code>__sparc_v9__</code>
362     or anything to indicate V9 support when -mcpu=v9 is selected.  See
363     gcc/config/sol2-sld-64.h.  Will need to pass something through from
364     ./configure to select the right code in longlong.h.  (Currently nothing
365     is lost because <code>mulx</code> for multiplying is commented out.)
366<li> UltraSPARC/32: <code>mpn_divexact_1</code> and
367     <code>mpn_modexact_1c_odd</code> can use a 64-bit inverse and take
368     64-bits at a time from the dividend, as per the 32-bit divisor case in
369     mpn/sparc64/mode1o.c.  This must be done in assembler, since the full
370     64-bit registers (<code>%gN</code>) are not available from C.
371<li> UltraSPARC/32: <code>mpn_divexact_by3c</code> can work 64-bits at a time
372     using <code>mulx</code>, in assembler.  This would be the same as for
373     sparc64.
374<li> UltraSPARC: <code>modlimb_invert</code> might save a few cycles from
375     masking down to just the useful bits at each point in the calculation,
376     since <code>mulx</code> speed depends on the highest bit set.  Either
377     explicit masks or small types like <code>short</code> and
378     <code>int</code> ought to work.
379<li> Sparc64 HAL R1 <code>popc</code>: This chip reputedly implements
380     <code>popc</code> properly (see gcc sparc.md).  Would need to recognise
381     it as <code>sparchalr1</code> or something in configure / config.sub /
382     config.guess.  <code>popc_limb</code> in gmp-impl.h could use this (per
383     commented out code).  <code>count_trailing_zeros</code> could use it too.
384<li> PA64: Improve <code>mpn_addmul_1</code>, <code>mpn_submul_1</code>, and
385     <code>mpn_mul_1</code>.  The current code runs at 11 cycles/limb.  It
386     should be possible to saturate the cache, which will happen at 8
387     cycles/limb (7.5 for mpn_mul_1).  Write special loops for s2 &lt; 2^32;
388     it should be possible to make them run at about 5 cycles/limb.
389<li> PPC601: See which of the power or powerpc32 code runs better.  Currently
390     the powerpc32 is used, but only because it's the default for
391     <code>powerpc*</code>.
392<li> PPC630: Rewrite <code>mpn_addmul_1</code>, <code>mpn_submul_1</code>, and
393     <code>mpn_mul_1</code>.  Use both integer and floating-point operations,
394     possibly two floating-point and one integer limb per loop.  Split operands
395     into four 16-bit chunks for fast fp operations.  Should easily reach 9
396     cycles/limb (using one int + one fp), but perhaps even 7 cycles/limb
397     (using one int + two fp).
398<li> PPC630: <code>mpn_rshift</code> could do the same sort of unrolled loop
399     as <code>mpn_lshift</code>.  Some judicious use of m4 might let the two
400     share source code, or with a register to control the loop direction
401     perhaps even share object code.
402<li> Implement <code>mpn_mul_basecase</code> and <code>mpn_sqr_basecase</code>
403     for important machines.  Helping the generic sqr_basecase.c with an
404     <code>mpn_sqr_diagonal</code> might be enough for some of the RISCs.
405<li> POWER2/POWER2SC: Schedule <code>mpn_lshift</code>/<code>mpn_rshift</code>.
406     Will bring time from 1.75 to 1.25 cycles/limb.
407<li> X86: Optimize non-MMX <code>mpn_lshift</code> for shifts by 1.  (See
408     Pentium code.)
409<li> X86: Good authority has it that in the past an inline <code>rep
410     movs</code> would upset GCC register allocation for the whole function.
411     Is this still true in GCC 3?  It uses <code>rep movs</code> itself for
412     <code>__builtin_memcpy</code>.  Examine the code for some simple and
413     complex functions to find out.  Inlining <code>rep movs</code> would be
414     desirable, it'd be both smaller and faster.
415<li> Pentium P54: <code>mpn_lshift</code> and <code>mpn_rshift</code> can come
416     down from 6.0 c/l to 5.5 or 5.375 by paying attention to pairing after
417     <code>shrdl</code> and <code>shldl</code>, see mpn/x86/pentium/README.
418<li> Pentium P55 MMX: <code>mpn_lshift</code> and <code>mpn_rshift</code>
419     might benefit from some destination prefetching.
420<li> PentiumPro: <code>mpn_divrem_1</code> might be able to use a
421     mul-by-inverse, hoping for maybe 30 c/l.
422<li> K7: <code>mpn_lshift</code> and <code>mpn_rshift</code> might be able to
423     do something branch-free for unaligned startups, and shaving one insn
424     from the loop with alternative indexing might save a cycle.
425<li> PPC32: Try using fewer registers in the current <code>mpn_lshift</code>.
426     The pipeline is now extremely deep, perhaps unnecessarily deep.
427<li> Fujitsu VPP: Vectorize main functions, perhaps in assembly language.
428<li> Fujitsu VPP: Write <code>mpn_mul_basecase</code> and
429     <code>mpn_sqr_basecase</code>.  This should use a "vertical multiplication
430     method", to avoid carry propagation.  splitting one of the operands in
431     11-bit chunks.
432<li> Pentium: <code>mpn_lshift</code> by 31 should use the special rshift
433     by 1 code, and vice versa <code>mpn_rshift</code> by 31 should use the
434     special lshift by 1.  This would be best as a jump across to the other
435     routine, could let both live in lshift.asm and omit rshift.asm on finding
436     <code>mpn_rshift</code> already provided.
437<li> Cray T3E: Experiment with optimization options.  In particular,
438     -hpipeline3 seems promising.  We should at least up -O to -O2 or -O3.
439<li> Cray: <code>mpn_com</code> and <code>mpn_and_n</code> etc very probably
440     wants a pragma like <code>MPN_COPY_INCR</code>.
441<li> Cray vector systems: <code>mpn_lshift</code>, <code>mpn_rshift</code>,
442     <code>mpn_popcount</code> and <code>mpn_hamdist</code> are nice and small
443     and could be inlined to avoid function calls.
444<li> Cray: Variable length arrays seem to be faster than the tal-notreent.c
445     scheme.  Not sure why, maybe they merely give the compiler more
446     information about aliasing (or the lack thereof).  Would like to modify
447     <code>TMP_ALLOC</code> to use them, or introduce a new scheme.  Memory
448     blocks wanted unconditionally are easy enough, those wanted only
449     sometimes are a problem.  Perhaps a special size calculation to ask for a
450     dummy length 1 when unwanted, or perhaps an inlined subroutine
451     duplicating code under each conditional.  Don't really want to turn
452     everything into a dog's dinner just because Cray don't offer an
453     <code>alloca</code>.
454<li> Cray: <code>mpn_get_str</code> on power-of-2 bases ought to vectorize.
455     Does it?  <code>bits_per_digit</code> and the inner loop over bits in a
456     limb might prevent it.  Perhaps special cases for binary, octal and hex
457     would be worthwhile (very possibly for all processors too).
458<li> S390: <code>BSWAP_LIMB_FETCH</code> looks like it could be done with
459     <code>lrvg</code>, as per glibc sysdeps/s390/s390-64/bits/byteswap.h.
460     This is only for 64-bit mode or something is it, since 32-bit mode has
461     other code?  Also, is it worth using for <code>BSWAP_LIMB</code> too, or
462     would that mean a store and re-fetch?  Presumably that's what comes out
463     in glibc.
464<li> Improve <code>count_leading_zeros</code> for 64-bit machines:
465  <pre>
466	   if ((x &gt&gt 32) == 0) { x &lt&lt= 32; cnt += 32; }
467	   if ((x &gt&gt 48) == 0) { x &lt&lt= 16; cnt += 16; }
468	   ... </pre>
469<li> IRIX 6 MIPSpro compiler has an <code>__inline</code> which could perhaps
470     be used in <code>__GMP_EXTERN_INLINE</code>.  What would be the right way
471     to identify suitable versions of that compiler?
472<li> IRIX <code>cc</code> is rumoured to have an <code>_int_mult_upper</code>
473     (in <code>&lt;intrinsics.h&gt;</code> like Cray), but it didn't seem to
474     exist on some IRIX 6.5 systems tried.  If it does actually exist
475     somewhere it would very likely be an improvement over a function call to
476     umul.asm.
477<li> <code>mpn_get_str</code> final divisions by the base with
478     <code>udiv_qrnd_unnorm</code> could use some sort of multiply-by-inverse
479     on suitable machines.  This ends up happening for decimal by presenting
480     the compiler with a run-time constant, but the same for other bases would
481     be good.  Perhaps use could be made of the fact base&lt;256.
482<li> <code>mpn_umul_ppmm</code>, <code>mpn_udiv_qrnnd</code>: Return a
483     structure like <code>div_t</code> to avoid going through memory, in
484     particular helping RISCs that don't do store-to-load forwarding.  Clearly
485     this is only possible if the ABI returns a structure of two
486     <code>mp_limb_t</code>s in registers.
487     <br>
488     On PowerPC, structures are returned in memory on AIX and Darwin.  In SVR4
489     they're returned in registers, except that draft SVR4 had said memory, so
490     it'd be prudent to check which is done.  We can jam the compiler into the
491     right mode if we know how, since all this is purely internal to libgmp.
492     (gcc has an option, though of course gcc doesn't matter since we use
493     inline asm there.)
494</ul>
495
496<h4>New Functionality</h4>
497<ul>
498<li> Maybe add <code>mpz_crr</code> (Chinese Remainder Reconstruction).
499<li> Let `0b' and `0B' mean binary input everywhere.
500<li> <code>mpz_init</code> and <code>mpq_init</code> could do lazy allocation.
501     Set <code>ALLOC(var)</code> to 0 to indicate nothing allocated, and let
502     <code>_mpz_realloc</code> do the initial alloc.  Set
503     <code>z-&gt;_mp_d</code> to a dummy that <code>mpz_get_ui</code> and
504     similar can unconditionally fetch from.  Niels M�ller has had a go at
505     this.
506     <br>
507     The advantages of the lazy scheme would be:
508     <ul>
509     <li> Initial allocate would be the size required for the first value
510          stored, rather than getting 1 limb in <code>mpz_init</code> and then
511          more or less immediately reallocating.
512     <li> <code>mpz_init</code> would only store magic values in the
513          <code>mpz_t</code> fields, and could be inlined.
514     <li> A fixed initializer could even be used by applications, like
515          <code>mpz_t z = MPZ_INITIALIZER;</code>, which might be convenient
516          for globals.
517     </ul>
518     The advantages of the current scheme are:
519     <ul>
520     <li> <code>mpz_set_ui</code> and other similar routines needn't check the
521          size allocated and can just store unconditionally.
522     <li> <code>mpz_set_ui</code> and perhaps others like
523          <code>mpz_tdiv_r_ui</code> and a prospective
524          <code>mpz_set_ull</code> could be inlined.
525     </ul>
526<li> Add <code>mpf_out_raw</code> and <code>mpf_inp_raw</code>.  Make sure
527     format is portable between 32-bit and 64-bit machines, and between
528     little-endian and big-endian machines.  A format which MPFR can use too
529     would be good.
530<li> <code>mpn_and_n</code> ... <code>mpn_copyd</code>: Perhaps make the mpn
531     logops and copys available in gmp.h, either as library functions or
532     inlines, with the availability of library functions instantiated in the
533     generated gmp.h at build time.
534<li> <code>mpz_set_str</code> etc variants taking string lengths rather than
535     null-terminators.
536<li> <code>mpz_andn</code>, <code>mpz_iorn</code>, <code>mpz_nand</code>,
537     <code>mpz_nior</code>, <code>mpz_xnor</code> might be useful additions,
538     if they could share code with the current such functions (which should be
539     possible).
540<li> <code>mpz_and_ui</code> etc might be of use sometimes.  Suggested by
541     Niels M�ller.
542<li> <code>mpf_set_str</code> and <code>mpf_inp_str</code> could usefully
543     accept 0x, 0b etc when base==0.  Perhaps the exponent could default to
544     decimal in this case, with a further 0x, 0b etc allowed there.
545     Eg. 0xFFAA@0x5A.  A leading "0" for octal would match the integers, but
546     probably something like "0.123" ought not mean octal.
547<li> <code>GMP_LONG_LONG_LIMB</code> or some such could become a documented
548     feature of gmp.h, so applications could know whether to
549     <code>printf</code> a limb using <code>%lu</code> or <code>%Lu</code>.
550<li> <code>GMP_PRIdMP_LIMB</code> and similar defines following C99
551     &lt;inttypes.h&gt; might be of use to applications printing limbs.  But
552     if <code>GMP_LONG_LONG_LIMB</code> or whatever is added then perhaps this
553     can easily enough be left to applications.
554<li> <code>gmp_printf</code> could accept <code>%b</code> for binary output.
555     It'd be nice if it worked for plain <code>int</code> etc too, not just
556     <code>mpz_t</code> etc.
557<li> <code>gmp_printf</code> in fact could usefully accept an arbitrary base,
558     for both integer and float conversions.  A base either in the format
559     string or as a parameter with <code>*</code> should be allowed.  Maybe
560     <code>&amp;13b</code> (b for base) or something like that.
561<li> <code>gmp_printf</code> could perhaps accept <code>mpq_t</code> for float
562     conversions, eg. <code>"%.4Qf"</code>.  This would be merely for
563     convenience, but still might be useful.  Rounding would be the same as
564     for an <code>mpf_t</code> (ie. currently round-to-nearest, but not
565     actually documented).  Alternately, perhaps a separate
566     <code>mpq_get_str_point</code> or some such might be more use.  Suggested
567     by Pedro Gimeno.
568<li> <code>mpz_rscan0</code> or <code>mpz_revscan0</code> or some such
569     searching towards the low end of an integer might match
570     <code>mpz_scan0</code> nicely.  Likewise for <code>scan1</code>.
571     Suggested by Roberto Bagnara.
572<li> <code>mpz_bit_subset</code> or some such to test whether one integer is a
573     bitwise subset of another might be of use.  Some sort of return value
574     indicating whether it's a proper or non-proper subset would be good and
575     wouldn't cost anything in the implementation.  Suggested by Roberto
576     Bagnara.
577<li> <code>mpf_get_ld</code>, <code>mpf_set_ld</code>: Conversions between
578     <code>mpf_t</code> and <code>long double</code>, suggested by Dan
579     Christensen.  Other <code>long double</code> routines might be desirable
580     too, but <code>mpf</code> would be a start.
581     <br>
582     <code>long double</code> is an ANSI-ism, so everything involving it would
583     need to be suppressed on a K&amp;R compiler.
584     <br>
585     There'd be some work to be done by <code>configure</code> to recognise
586     the format in use, MPFR has a start on this.  Often <code>long
587     double</code> is the same as <code>double</code>, which is easy but
588     pretty pointless.  A single float format detector macro could look at
589     <code>double</code> then <code>long double</code>
590     <br>
591     Sometimes there's a compiler option for the size of a <code>long
592     double</code>, eg. xlc on AIX can use either 64-bit or 128-bit.  It's
593     probably simplest to regard this as a compiler compatibility issue, and
594     leave it to users or sysadmins to ensure application and library code is
595     built the same.
596<li> <code>mpz_sqrt_if_perfect_square</code>: When
597     <code>mpz_perfect_square_p</code> does its tests it calculates a square
598     root and then discards it.  For some applications it might be useful to
599     return that root.  Suggested by Jason Moxham.
600<li> <code>mpz_get_ull</code>, <code>mpz_set_ull</code>,
601     <code>mpz_get_sll</code>, <code>mpz_get_sll</code>: Conversions for
602     <code>long long</code>.  These would aid interoperability, though a
603     mixture of GMP and <code>long long</code> would probably not be too
604     common.  Since <code>long long</code> is not always available (it's in
605     C99 and GCC though), disadvantages of using <code>long long</code> in
606     libgmp.a would be
607     <ul>
608     <li> Library contents vary according to the build compiler.
609     <li> gmp.h would need an ugly <code>#ifdef</code> block to decide if the
610          application compiler could take the <code>long long</code>
611          prototypes.
612     <li> Some sort of <code>LIBGMP_HAS_LONGLONG</code> might be wanted to
613          indicate whether the functions are available.  (Applications using
614          autoconf could probe the library too.)
615     </ul>
616     It'd be possible to defer the need for <code>long long</code> to
617     application compile time, by having something like
618     <code>mpz_set_2ui</code> called with two halves of a <code>long
619     long</code>.  Disadvantages of this would be,
620     <ul>
621     <li> Bigger code in the application, though perhaps not if a <code>long
622          long</code> is normally passed as two halves anyway.
623     <li> <code>mpz_get_ull</code> would be a rather big inline, or would have
624          to be two function calls.
625     <li> <code>mpz_get_sll</code> would be a worse inline, and would put the
626          treatment of <code>-0x10..00</code> into applications (see
627          <code>mpz_get_si</code> correctness above).
628     <li> Although having libgmp.a independent of the build compiler is nice,
629          it sort of sacrifices the capabilities of a good compiler to
630          uniformity with inferior ones.
631     </ul>
632     Plain use of <code>long long</code> is probably the lesser evil, if only
633     because it makes best use of gcc.  In fact perhaps it would suffice to
634     guarantee <code>long long</code> conversions only when using GCC for both
635     application and library.  That would cover free software, and we can
636     worry about selected vendor compilers later.
637     <br>
638     In C++ the situation is probably clearer, we demand fairly recent C++ so
639     <code>long long</code> should be available always.  We'd probably prefer
640     to have the C and C++ the same in respect of <code>long long</code>
641     support, but it would be possible to have it unconditionally in gmpxx.h,
642     by some means or another.
643<li> <code>mpz_strtoz</code> parsing the same as <code>strtol</code>.
644     Suggested by Alexander Kruppa.
645</ul>
646
647
648<h4>Configuration</h4>
649
650<ul>
651<li> Alpha ev7, ev79: Add code to config.guess to detect these.  Believe ev7
652     will be "3-1307" in the current switch, but need to verify that.  (On
653     OSF, current configfsf.guess identifies ev7 using psrinfo, we need to do
654     it ourselves for other systems.)
655<li> Alpha OSF: Libtool (version 1.5) doesn't seem to recognise this system is
656     "pic always" and ends up running gcc twice with the same options.  This
657     is wasteful, but harmless.  Perhaps a newer libtool will be better.
658<li> ARM: <code>umul_ppmm</code> in longlong.h always uses <code>umull</code>,
659     but is that available only for M series chips or some such?  Perhaps it
660     should be configured in some way.
661<li> HPPA: config.guess should recognize 7000, 7100, 7200, and 8x00.
662<li> HPPA: gcc 3.2 introduces a <code>-mschedule=7200</code> etc parameter,
663     which could be driven by an exact hppa cpu type.
664<li> Mips: config.guess should say mipsr3000, mipsr4000, mipsr10000, etc.
665     "hinv -c processor" gives lots of information on Irix.  Standard
666     config.guess appends "el" to indicate endianness, but
667     <code>AC_C_BIGENDIAN</code> seems the best way to handle that for GMP.
668<li> PowerPC: The function descriptor nonsense for AIX is currently driven by
669     <code>*-*-aix*</code>.  It might be more reliable to do some sort of
670     feature test, examining the compiler output perhaps.  It might also be
671     nice to merge the aix.m4 files into powerpc-defs.m4.
672<li> config.m4 is generated only by the configure script, it won't be
673     regenerated by config.status.  Creating it as an <code>AC_OUTPUT</code>
674     would work, but it might upset "make" to have things like <code>L$</code>
675     get into the Makefiles through <code>AC_SUBST</code>.
676     <code>AC_CONFIG_COMMANDS</code> would be the alternative.  With some
677     careful m4 quoting the <code>changequote</code> calls might not be
678     needed, which might free up the order in which things had to be output.
679<li> Automake: Latest automake has a <code>CCAS</code>, <code>CCASFLAGS</code>
680     scheme.  Though we probably wouldn't be using its assembler support we
681     could try to use those variables in compatible ways.
682<li> <code>GMP_LDFLAGS</code> could probably be done with plain
683     <code>LDFLAGS</code> already used by automake for all linking.  But with
684     a bit of luck the next libtool will pass pretty much all
685     <code>CFLAGS</code> through to the compiler when linking, making
686     <code>GMP_LDFLAGS</code> unnecessary.
687<li> mpn/Makeasm.am uses <code>-c</code> and <code>-o</code> together in the
688     .S and .asm rules, but apparently that isn't completely portable (there's
689     an autoconf <code>AC_PROG_CC_C_O</code> test for it).  So far we've not
690     had problems, but perhaps the rules could be rewritten to use "foo.s" as
691     the temporary, or to do a suitable "mv" of the result.  The only danger
692     from using foo.s would be if a compile failed and the temporary foo.s
693     then looked like the primary source.  Hopefully if the
694     <code>SUFFIXES</code> are ordered to have .S and .asm ahead of .s that
695     wouldn't happen.  Might need to check.
696</ul>
697
698
699<h4>Random Numbers</h4>
700<ul>
701<li> <code>_gmp_rand</code> is not particularly fast on the linear
702     congruential algorithm and could stand various improvements.
703     <ul>
704     <li> Make a second seed area within <code>gmp_randstate_t</code> (or
705          <code>_mp_algdata</code> rather) to save some copying.
706     <li> Make a special case for a single limb <code>2exp</code> modulus, to
707          avoid <code>mpn_mul</code> calls.  Perhaps the same for two limbs.
708     <li> Inline the <code>lc</code> code, to avoid a function call and
709          <code>TMP_ALLOC</code> for every chunk.
710     <li> Perhaps the <code>2exp</code> and general LC cases should be split,
711          for clarity (if the general case is retained).
712     </ul>
713<li> <code>gmp_randstate_t</code> used for parameters perhaps should become
714     <code>gmp_randstate_ptr</code> the same as other types.
715<li> Some of the empirical randomness tests could be included in a "make
716     check".  They ought to work everywhere, for a given seed at least.
717</ul>
718
719
720<h4>C++</h4>
721<ul>
722<li> <code>mpz_class(string)</code>, etc: Use the C++ global locale to
723     identify whitespace.
724     <br>
725     <code>mpf_class(string)</code>: Use the C++ global locale decimal point,
726     rather than the C one.
727     <br>
728     Consider making these variant <code>mpz_set_str</code> etc forms
729     available for <code>mpz_t</code> too, not just <code>mpz_class</code>
730     etc.
731<li> <code>mpq_class operator+=</code>: Don't emit an unnecssary
732     <code>mpq_set(q,q)</code> before <code>mpz_addmul</code> etc.
733<li> Put various bits of gmpxx.h into libgmpxx, to avoid excessive inlining.
734     Candidates for this would be,
735     <ul>
736     <li> <code>mpz_class(const char *)</code>, etc: since they're normally
737          not fast anyway, and we can hide the exception <code>throw</code>.
738     <li> <code>mpz_class(string)</code>, etc: to hide the <code>cstr</code>
739          needed to get to the C conversion function.
740     <li> <code>mpz_class string, char*</code> etc constructors: likewise to
741          hide the throws and conversions.
742     <li> <code>mpz_class::get_str</code>, etc: to hide the <code>char*</code>
743          to <code>string</code> conversion and free.  Perhaps
744          <code>mpz_get_str</code> can write directly into a
745          <code>string</code>, to avoid copying.
746          <br>
747          Consider making such <code>string</code> returning variants
748          available for use with plain <code>mpz_t</code> etc too.
749     </ul>
750</ul>
751
752<h4>Miscellaneous</h4>
753<ul>
754<li> <code>mpz_gcdext</code> and <code>mpn_gcdext</code> ought to document
755     what range of values the generated cofactors can take, and preferably
756     ensure the definition uniquely specifies the cofactors for given inputs.
757     A basic extended Euclidean algorithm or multi-step variant leads to
758     |x|&lt;|b| and |y|&lt;|a| or something like that, but there's probably
759     two solutions under just those restrictions.
760<li> demos/factorize.c: use <code>mpz_divisible_ui_p</code> rather than
761     <code>mpz_tdiv_qr_ui</code>.  (Of course dividing multiple primes at a
762     time would be better still.)
763<li> The various test programs use quite a bit of the main
764     <code>libgmp</code>.  This establishes good cross-checks, but it might be
765     better to use simple reference routines where possible.  Where it's not
766     possible some attention could be paid to the order of the tests, so a
767     <code>libgmp</code> routine is only used for tests once it seems to be
768     good.
769<li> <code>MUL_FFT_THRESHOLD</code> etc: the FFT thresholds should allow a
770     return to a previous k at certain sizes.  This arises basically due to
771     the step effect caused by size multiples effectively used for each k.
772     Looking at a graph makes it fairly clear.
773<li> <code>__gmp_doprnt_mpf</code> does a rather unattractive round-to-nearest
774     on the string returned by <code>mpf_get_str</code>.  Perhaps some variant
775     of <code>mpf_get_str</code> could be made which would better suit.
776</ul>
777
778
779<h4>Aids to Development</h4>
780<ul>
781<li> Add <code>ASSERT</code>s at the start of each user-visible mpz/mpq/mpf
782     function to check the validity of each <code>mp?_t</code> parameter, in
783     particular to check they've been <code>mp?_init</code>ed.  This might
784     catch elementary mistakes in user programs.  Care would need to be taken
785     over <code>MPZ_TMP_INIT</code>ed variables used internally.  If nothing
786     else then consistency checks like size&lt;=alloc, ptr not
787     <code>NULL</code> and ptr+size not wrapping around the address space,
788     would be possible.  A more sophisticated scheme could track
789     <code>_mp_d</code> pointers and ensure only a valid one is used.  Such a
790     scheme probably wouldn't be reentrant, not without some help from the
791     system.
792<li> tune/time.c could try to determine at runtime whether
793     <code>getrusage</code> and <code>gettimeofday</code> are reliable.
794     Currently we pretend in configure that the dodgy m68k netbsd 1.4.1
795     <code>getrusage</code> doesn't exist.  If a test might take a long time
796     to run then perhaps cache the result in a file somewhere.
797<li> tune/time.c could choose the default precision based on the
798     <code>speed_unittime</code> determined, independent of the method in use.
799<li> Cray vector systems: CPU frequency could be determined from
800     <code>sysconf(_SC_CLK_TCK)</code>, since it seems to be clock cycle
801     based.  Is this true for all Cray systems?  Would like some documentation
802     or something to confirm.
803</ul>
804
805
806<h4>Documentation</h4>
807<ul>
808<li> <code>mpz_inp_str</code> (etc) doesn't say when it stops reading digits.
809<li> <code>mpn_get_str</code> isn't terribly clear about how many digits it
810     produces.  It'd probably be possible to say at most one leading zero,
811     which is what both it and <code>mpz_get_str</code> currently do.  But
812     want to be careful not to bind ourselves to something that might not suit
813     another implementation.
814<li> <code>va_arg</code> doesn't do the right thing with <code>mpz_t</code>
815     etc directly, but instead needs a pointer type like <code>MP_INT*</code>.
816     It'd be good to show how to do this, but we'd either need to document
817     <code>mpz_ptr</code> and friends, or perhaps fallback on something
818     slightly nasty with <code>void*</code>.
819</ul>
820
821
822<h4>Bright Ideas</h4>
823
824<p> The following may or may not be feasible, and aren't likely to get done in the
825near future, but are at least worth thinking about.
826
827<ul>
828<li> Reorganize longlong.h so that we can inline the operations even for the
829     system compiler.  When there is no such compiler feature, make calls to
830     stub functions.  Write such stub functions for as many machines as
831     possible.
832<li> longlong.h could declare when it's using, or would like to use,
833     <code>mpn_umul_ppmm</code>, and the corresponding umul.asm file could be
834     included in libgmp only in that case, the same as is effectively done for
835     <code>__clz_tab</code>.  Likewise udiv.asm and perhaps cntlz.asm.  This
836     would only be a very small space saving, so perhaps not worth the
837     complexity.
838<li> longlong.h could be built at configure time by concatenating or
839     #including fragments from each directory in the mpn path.  This would
840     select CPU specific macros the same way as CPU specific assembler code.
841     Code used would no longer depend on cpp predefines, and the current
842     nested conditionals could be flattened out.
843<li> <code>mpz_get_si</code> returns 0x80000000 for -0x100000000, whereas it's
844     sort of supposed to return the low 31 (or 63) bits.  But this is
845     undocumented, and perhaps not too important.
846<li> <code>mpz_init_set*</code> and <code>mpz_realloc</code> could allocate
847     say an extra 16 limbs over what's needed, so as to reduce the chance of
848     having to do a reallocate if the <code>mpz_t</code> grows a bit more.
849     This could only be an option, since it'd badly bloat memory usage in
850     applications using many small values.
851<li> <code>mpq</code> functions could perhaps check for numerator or
852     denominator equal to 1, on the assumption that integers or
853     denominator-only values might be expected to occur reasonably often.
854<li> <code>count_trailing_zeros</code> is used on more or less uniformly
855     distributed numbers in a couple of places.  For some CPUs
856     <code>count_trailing_zeros</code> is slow and it's probably worth handling
857     the frequently occurring 0 to 2 trailing zeros cases specially.
858<li> <code>mpf_t</code> might like to let the exponent be undefined when
859     size==0, instead of requiring it 0 as now.  It should be possible to do
860     size==0 tests before paying attention to the exponent.  The advantage is
861     not needing to set exp in the various places a zero result can arise,
862     which avoids some tedium but is otherwise perhaps not too important.
863     Currently <code>mpz_set_f</code> and <code>mpf_cmp_ui</code> depend on
864     exp==0, maybe elsewhere too.
865<li> <code>__gmp_allocate_func</code>: Could use GCC <code>__attribute__
866     ((malloc))</code> on this, though don't know if it'd do much.  GCC 3.0
867     allows that attribute on functions, but not function pointers (see info
868     node "Attribute Syntax"), so would need a new autoconf test.  This can
869     wait until there's a GCC that supports it.
870<li> <code>mpz_add_ui</code> contains two <code>__GMPN_COPY</code>s, one from
871     <code>mpn_add_1</code> and one from <code>mpn_sub_1</code>.  If those two
872     routines were opened up a bit maybe that code could be shared.  When a
873     copy needs to be done there's no carry to append for the add, and if the
874     copy is non-empty no high zero for the sub.
875</ul>
876
877
878<h4>Old and Obsolete Stuff</h4>
879
880<p> The following tasks apply to chips or systems that are old and/or obsolete.
881It's unlikely anything will be done about them unless anyone is actively using
882them.
883
884<ul>
885<li> Sparc32: The integer based udiv_nfp.asm used to be selected by
886     <code>configure --nfp</code> but that option is gone now that autoconf is
887     used.  The file could go somewhere suitable in the mpn search if any
888     chips might benefit from it, though it's possible we don't currently
889     differentiate enough exact cpu types to do this properly.
890<li> VAX D and G format <code>double</code> floats are straightforward and
891     could perhaps be handled directly in <code>__gmp_extract_double</code>
892     and maybe in <code>mpn_get_d</code>, rather than falling back on the
893     generic code.  (Both formats are detected by <code>configure</code>.)
894</ul>
895
896
897<hr>
898
899</body>
900</html>
901
902<!--
903Local variables:
904eval: (add-hook 'write-file-hooks 'time-stamp)
905time-stamp-start: "This file current as of "
906time-stamp-format: "%:d %3b %:y"
907time-stamp-end: "\\."
908time-stamp-line-limit: 50
909End:
910-->
911