1<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN"> 2<html> 3<head> 4 <title>GMP Itemized Development Tasks</title> 5 <link rel="shortcut icon" href="favicon.ico"> 6 <link rel="stylesheet" href="gmp.css"> 7 <meta http-equiv="Content-Type" content="text/html; charset=iso-8859-1"> 8</head> 9 10<center> 11 <h1> 12 GMP Itemized Development Tasks 13 </h1> 14</center> 15 16<font size=-1> 17<pre> 18Copyright 2000, 2001, 2002, 2003, 2004, 2006, 2008, 2009 Free Software 19Foundation, Inc. 20 21This file is part of the GNU MP Library. 22 23The GNU MP Library is free software; you can redistribute it and/or modify 24it under the terms of the GNU Lesser General Public License as published 25by the Free Software Foundation; either version 3 of the License, or (at 26your option) any later version. 27 28The GNU MP Library is distributed in the hope that it will be useful, but 29WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY 30or FITNESS FOR A PARTICULAR PURPOSE. See the GNU Lesser General Public 31License for more details. 32 33You should have received a copy of the GNU Lesser General Public License 34along with the GNU MP Library. If not, see http://www.gnu.org/licenses/. 35</pre> 36</font> 37 38<hr> 39<!-- NB. timestamp updated automatically by emacs --> 40 This file current as of 28 Dec 2009. An up-to-date version is available at 41 <a href="http://gmplib.org/tasks.html">http://gmplib.org/tasks.html</a>. 42 Please send comments about this page to gmp-devel<font>@</font>gmplib.org. 43 44<p> These are itemized GMP development tasks. Not all the tasks 45 listed here are suitable for volunteers, but many of them are. 46 Please see the <a href="projects.html">projects file</a> for more 47 sizeable projects. 48 49<p> CAUTION: This file needs updating. Many of the tasks here have 50either already been taken care of, or have become irrelevant. 51 52<h4>Correctness and Completeness</h4> 53<ul> 54<li> <code>_LONG_LONG_LIMB</code> in gmp.h is not namespace clean. Reported 55 by Patrick Pelissier. 56 <br> 57 We sort of mentioned <code>_LONG_LONG_LIMB</code> in past releases, so 58 need to be careful about changing it. It used to be a define 59 applications had to set for long long limb systems, but that in 60 particular is no longer relevant now that it's established automatically. 61<li> The various reuse.c tests need to force reallocation by calling 62 <code>_mpz_realloc</code> with a small (1 limb) size. 63<li> One reuse case is missing from mpX/tests/reuse.c: 64 <code>mpz_XXX(a,a,a)</code>. 65<li> When printing <code>mpf_t</code> numbers with exponents >2^53 on 66 machines with 64-bit <code>mp_exp_t</code>, the precision of 67 <code>__mp_bases[base].chars_per_bit_exactly</code> is insufficient and 68 <code>mpf_get_str</code> aborts. Detect and compensate. Alternately, 69 think seriously about using some sort of fixed-point integer value. 70 Avoiding unnecessary floating point is probably a good thing in general, 71 and it might be faster on some CPUs. 72<li> Make the string reading functions allow the `0x' prefix when the base is 73 explicitly 16. They currently only allow that prefix when the base is 74 unspecified (zero). 75<li> <code>mpf_eq</code> is not always correct, when one operand is 76 1000000000... and the other operand is 0111111111..., i.e., extremely 77 close. There is a special case in <code>mpf_sub</code> for this 78 situation; put similar code in <code>mpf_eq</code>. [In progress.] 79<li> <code>mpf_eq</code> doesn't implement what gmp.texi specifies. It should 80 not use just whole limbs, but partial limbs. [In progress.] 81<li> <code>mpf_set_str</code> doesn't validate it's exponent, for instance 82 garbage 123.456eX789X is accepted (and an exponent 0 used), and overflow 83 of a <code>long</code> is not detected. 84<li> <code>mpf_add</code> doesn't check for a carry from truncated portions of 85 the inputs, and in that respect doesn't implement the "infinite precision 86 followed by truncate" specified in the manual. 87<li> Windows DLLs: tests/mpz/reuse.c and tests/mpf/reuse.c initialize global 88 variables with pointers to <code>mpz_add</code> etc, which doesn't work 89 when those routines are coming from a DLL (because they're effectively 90 function pointer global variables themselves). Need to rearrange perhaps 91 to a set of calls to a test function rather than iterating over an array. 92<li> <code>mpz_pow_ui</code>: Detect when the result would be more memory than 93 a <code>size_t</code> can represent and raise some suitable exception, 94 probably an alloc call asking for <code>SIZE_T_MAX</code>, and if that 95 somehow succeeds then an <code>abort</code>. Various size overflows of 96 this kind are not handled gracefully, probably resulting in segvs. 97 <br> 98 In <code>mpz_n_pow_ui</code>, detect when the count of low zero bits 99 exceeds an <code>unsigned long</code>. There's a (small) chance of this 100 happening but still having enough memory to represent the value. 101 Reported by Winfried Dreckmann in for instance <code>mpz_ui_pow_ui (x, 102 4UL, 1431655766UL)</code>. 103<li> <code>mpf</code>: Detect exponent overflow and raise some exception. 104 It'd be nice to allow the full <code>mp_exp_t</code> range since that's 105 how it's been in the past, but maybe dropping one bit would make it 106 easier to test if e1+e2 goes out of bounds. 107</ul> 108 109 110 111<h4>Machine Independent Optimization</h4> 112<ul> 113<li> <code>mpf_cmp</code>: For better cache locality, don't test for low zero 114 limbs until the high limbs fail to give an ordering. Reduce code size by 115 turning the three <code>mpn_cmp</code>'s into a single loop stopping when 116 the end of one operand is reached (and then looking for a non-zero in the 117 rest of the other). 118<li> <code>mpf_mul_2exp</code>, <code>mpf_div_2exp</code>: The use of 119 <code>mpn_lshift</code> for any size<=prec means repeated 120 <code>mul_2exp</code> and <code>div_2exp</code> calls accumulate low zero 121 limbs until size==prec+1 is reached. Those zeros will slow down 122 subsequent operations, especially if the value is otherwise only small. 123 If low bits of the low limb are zero, use <code>mpn_rshift</code> so as 124 to not increase the size. 125<li> <code>mpn_dc_sqrtrem</code>: Don't use <code>mpn_addmul_1</code> with 126 multiplier==2, instead either <code>mpn_addlsh1_n</code> when available, 127 or <code>mpn_lshift</code>+<code>mpn_add_n</code> if not. 128<li> <code>mpn_dc_sqrtrem</code>, <code>mpn_sqrtrem2</code>: Don't use 129 <code>mpn_add_1</code> and <code>mpn_sub_1</code> for 1 limb operations, 130 instead <code>ADDC_LIMB</code> and <code>SUBC_LIMB</code>. 131<li> <code>mpn_sqrtrem2</code>: Use plain variables for <code>sp[0]</code> and 132 <code>rp[0]</code> calculations, so the compiler needn't worry about 133 aliasing between <code>sp</code> and <code>rp</code>. 134<li> <code>mpn_sqrtrem</code>: Some work can be saved in the last step when 135 the remainder is not required, as noted in Paul's paper. 136<li> <code>mpq_add</code>, <code>mpq_add</code>: The division "op1.den / gcd" 137 is done twice, where of course only once is necessary. Reported by Larry 138 Lambe. 139<li> <code>mpq_add</code>, <code>mpq_sub</code>: The gcd fits a single limb 140 with high probability and in this case <code>modlimb_invert</code> could 141 be used to calculate the inverse just once for the two exact divisions 142 "op1.den / gcd" and "op2.den / gcd", rather than letting 143 <code>mpn_divexact_1</code> do it each time. This would require a new 144 <code>mpn_preinv_divexact_1</code> interface. Not sure if it'd be worth 145 the trouble. 146<li> <code>mpq_add</code>, <code>mpq_sub</code>: The use of 147 <code>mpz_mul(x,y,x)</code> causes temp allocation or copying in 148 <code>mpz_mul</code> which can probably be avoided. A rewrite using 149 <code>mpn</code> might be best. 150<li> <code>mpn_gcdext</code>: Don't test <code>count_leading_zeros</code> for 151 zero, instead check the high bit of the operand and avoid invoking 152 <code>count_leading_zeros</code>. This is an optimization on all 153 machines, and significant on machines with slow 154 <code>count_leading_zeros</code>, though it's possible an already 155 normalized operand might not be encountered very often. 156<li> Rewrite <code>umul_ppmm</code> to use floating-point for generating the 157 most significant limb (if <code>GMP_LIMB_BITS</code> <= 52 bits). 158 (Peter Montgomery has some ideas on this subject.) 159<li> Improve the default <code>umul_ppmm</code> code in longlong.h: Add partial 160 products with fewer operations. 161<li> Consider inlining <code>mpz_set_ui</code>. This would be both small and 162 fast, especially for compile-time constants, but would make application 163 binaries depend on having 1 limb allocated to an <code>mpz_t</code>, 164 preventing the "lazy" allocation scheme below. 165<li> Consider inlining <code>mpz_[cft]div_ui</code> and maybe 166 <code>mpz_[cft]div_r_ui</code>. A <code>__gmp_divide_by_zero</code> 167 would be needed for the divide by zero test, unless that could be left to 168 <code>mpn_mod_1</code> (not sure currently whether all the risc chips 169 provoke the right exception there if using mul-by-inverse). 170<li> Consider inlining: <code>mpz_fits_s*_p</code>. The setups for 171 <code>LONG_MAX</code> etc would need to go into gmp.h, and on Cray it 172 might, unfortunately, be necessary to forcibly include <limits.h> 173 since there's no apparent way to get <code>SHRT_MAX</code> with an 174 expression (since <code>short</code> and <code>unsigned short</code> can 175 be different sizes). 176<li> <code>mpz_powm</code> and <code>mpz_powm_ui</code> aren't very 177 fast on one or two limb moduli, due to a lot of function call 178 overheads. These could perhaps be handled as special cases. 179<li> <code>mpz_powm</code> and <code>mpz_powm_ui</code> want better 180 algorithm selection, and the latter should use REDC. Both could 181 change to use an <code>mpn_powm</code> and <code>mpn_redc</code>. 182<li> <code>mpz_powm</code> REDC should do multiplications by <code>g[]</code> 183 using the division method when they're small, since the REDC form of a 184 small multiplier is normally a full size product. Probably would need a 185 new tuned parameter to say what size multiplier is "small", as a function 186 of the size of the modulus. 187<li> <code>mpz_powm</code> REDC should handle even moduli if possible. Maybe 188 this would mean for m=n*2^k doing mod n using REDC and an auxiliary 189 calculation mod 2^k, then putting them together at the end. 190<li> <code>mpn_gcd</code> might be able to be sped up on small to 191 moderate sizes by improving <code>find_a</code>, possibly just by 192 providing an alternate implementation for CPUs with slowish 193 <code>count_leading_zeros</code>. 194<li> Toom3 could use a low to high cache localized evaluate and interpolate. 195 The necessary <code>mpn_divexact_by3c</code> exists. 196<li> <code>mpf_set_str</code> produces low zero limbs when a string has a 197 fraction but is exactly representable, eg. 0.5 in decimal. These could be 198 stripped to save work in later operations. 199<li> <code>mpz_and</code>, <code>mpz_ior</code> and <code>mpz_xor</code> should 200 use <code>mpn_and_n</code> etc for the benefit of the small number of 201 targets with native versions of those routines. Need to be careful not to 202 pass size==0. Is some code sharing possible between the <code>mpz</code> 203 routines? 204<li> <code>mpf_add</code>: Don't do a copy to avoid overlapping operands 205 unless it's really necessary (currently only sizes are tested, not 206 whether r really is u or v). 207<li> <code>mpf_add</code>: Under the check for v having no effect on the 208 result, perhaps test for r==u and do nothing in that case, rather than 209 currently it looks like an <code>MPN_COPY_INCR</code> will be done to 210 reduce prec+1 limbs to prec. 211<li> <code>mpf_div_ui</code>: Instead of padding with low zeros, call 212 <code>mpn_divrem_1</code> asking for fractional quotient limbs. 213<li> <code>mpf_div_ui</code>: Eliminate <code>TMP_ALLOC</code>. When r!=u 214 there's no overlap and the division can be called on those operands. 215 When r==u and is prec+1 limbs, then it's an in-place division. If r==u 216 and not prec+1 limbs, then move the available limbs up to prec+1 and do 217 an in-place there. 218<li> <code>mpf_div_ui</code>: Whether the high quotient limb is zero can be 219 determined by testing the dividend for high<divisor. When non-zero, the 220 division can be done on prec dividend limbs instead of prec+1. The result 221 size is also known before the division, so that can be a tail call (once 222 the <code>TMP_ALLOC</code> is eliminated). 223<li> <code>mpn_divrem_2</code> could usefully accept unnormalized divisors and 224 shift the dividend on-the-fly, since this should cost nothing on 225 superscalar processors and avoid the need for temporary copying in 226 <code>mpn_tdiv_qr</code>. 227<li> <code>mpf_sqrt</code>: If r!=u, and if u doesn't need to be padded with 228 zeros, then there's no need for the tp temporary. 229<li> <code>mpq_cmp_ui</code> could form the <code>num1*den2</code> and 230 <code>num2*den1</code> products limb-by-limb from high to low and look at 231 each step for values differing by more than the possible carry bit from 232 the uncalculated portion. 233<li> <code>mpq_cmp</code> could do the same high-to-low progressive multiply 234 and compare. The benefits of karatsuba and higher multiplication 235 algorithms are lost, but if it's assumed only a few high limbs will be 236 needed to determine an order then that's fine. 237<li> <code>mpn_add_1</code>, <code>mpn_sub_1</code>, <code>mpn_add</code>, 238 <code>mpn_sub</code>: Internally use <code>__GMPN_ADD_1</code> etc 239 instead of the functions, so they get inlined on all compilers, not just 240 gcc and others with <code>inline</code> recognised in gmp.h. 241 <code>__GMPN_ADD_1</code> etc are meant mostly to support application 242 inline <code>mpn_add_1</code> etc and if they don't come out good for 243 internal uses then special forms can be introduced, for instance many 244 internal uses are in-place. Sometimes a block of code is executed based 245 on the carry-out, rather than using it arithmetically, and those places 246 might want to do their own loops entirely. 247<li> <code>__gmp_extract_double</code> on 64-bit systems could use just one 248 bitfield for the mantissa extraction, not two, when endianness permits. 249 Might depend on the compiler allowing <code>long long</code> bit fields 250 when that's the only actual 64-bit type. 251<li> tal-notreent.c could keep a block of memory permanently allocated. 252 Currently the last nested <code>TMP_FREE</code> releases all memory, so 253 there's an allocate and free every time a top-level function using 254 <code>TMP</code> is called. Would need 255 <code>mp_set_memory_functions</code> to tell tal-notreent.c to release 256 any cached memory when changing allocation functions though. 257<li> <code>__gmp_tmp_alloc</code> from tal-notreent.c could be partially 258 inlined. If the current chunk has enough room then a couple of pointers 259 can be updated. Only if more space is required then a call to some sort 260 of <code>__gmp_tmp_increase</code> would be needed. The requirement that 261 <code>TMP_ALLOC</code> is an expression might make the implementation a 262 bit ugly and/or a bit sub-optimal. 263<pre> 264#define TMP_ALLOC(n) 265 ((ROUND_UP(n) > current->end - current->point ? 266 __gmp_tmp_increase (ROUND_UP (n)) : 0), 267 current->point += ROUND_UP (n), 268 current->point - ROUND_UP (n)) 269</pre> 270<li> <code>__mp_bases</code> has a lot of data for bases which are pretty much 271 never used. Perhaps the table should just go up to base 16, and have 272 code to generate data above that, if and when required. Naturally this 273 assumes the code would be smaller than the data saved. 274<li> <code>__mp_bases</code> field <code>big_base_inverted</code> is only used 275 if <code>USE_PREINV_DIVREM_1</code> is true, and could be omitted 276 otherwise, to save space. 277<li> <code>mpz_get_str</code>, <code>mtox</code>: For power-of-2 bases, which 278 are of course fast, it seems a little silly to make a second pass over 279 the <code>mpn_get_str</code> output to convert to ASCII. Perhaps combine 280 that with the bit extractions. 281<li> <code>mpz_gcdext</code>: If the caller requests only the S cofactor (of 282 A), and A<B, then the code ends up generating the cofactor T (of B) and 283 deriving S from that. Perhaps it'd be possible to arrange to get S in 284 the first place by calling <code>mpn_gcdext</code> with A+B,B. This 285 might only be an advantage if A and B are about the same size. 286<li> <code>mpz_n_pow_ui</code> does a good job with small bases and stripping 287 powers of 2, but it's perhaps a bit too complicated for what it gains. 288 The simpler <code>mpn_pow_1</code> is a little faster on small exponents. 289 (Note some of the ugliness in <code>mpz_n_pow_ui</code> is due to 290 supporting <code>mpn_mul_2</code>.) 291 <br> 292 Perhaps the stripping of 2s in <code>mpz_n_pow_ui</code> should be 293 confined to single limb operands for simplicity and since that's where 294 the greatest gain would be. 295 <br> 296 Ideally <code>mpn_pow_1</code> and <code>mpz_n_pow_ui</code> would be 297 merged. The reason <code>mpz_n_pow_ui</code> writes to an 298 <code>mpz_t</code> is that its callers leave it to make a good estimate 299 of the result size. Callers of <code>mpn_pow_1</code> already know the 300 size by separate means (<code>mp_bases</code>). 301<li> <code>mpz_invert</code> should call <code>mpn_gcdext</code> directly. 302</ul> 303 304 305<h4>Machine Dependent Optimization</h4> 306<ul> 307<li> <code>invert_limb</code> on various processors might benefit from the 308 little Newton iteration done for alpha and ia64. 309<li> Alpha 21264: <code>mpn_addlsh1_n</code> could be implemented with 310 <code>mpn_addmul_1</code>, since that code at 3.5 is a touch faster than 311 a separate <code>lshift</code> and <code>add_n</code> at 312 1.75+2.125=3.875. Or very likely some specific <code>addlsh1_n</code> 313 code could beat both. 314<li> Alpha 21264: Improve feed-in code for <code>mpn_mul_1</code>, 315 <code>mpn_addmul_1</code>, and <code>mpn_submul_1</code>. 316<li> Alpha 21164: Rewrite <code>mpn_mul_1</code>, <code>mpn_addmul_1</code>, 317 and <code>mpn_submul_1</code> for the 21164. This should use both integer 318 multiplies and floating-point multiplies. For the floating-point 319 operations, the single-limb multiplier should be split into three 21-bit 320 chunks, or perhaps even better in four 16-bit chunks. Probably possible 321 to reach 9 cycles/limb. 322<li> Alpha: GCC 3.4 will introduce <code>__builtin_ctzl</code>, 323 <code>__builtin_clzl</code> and <code>__builtin_popcountl</code> using 324 the corresponding CIX <code>ct</code> instructions, and 325 <code>__builtin_alpha_cmpbge</code>. These should give GCC more 326 information about scheduling etc than the <code>asm</code> blocks 327 currently used in longlong.h and gmp-impl.h. 328<li> Alpha Unicos: Apparently there's no <code>alloca</code> on this system, 329 making <code>configure</code> choose the slower 330 <code>malloc-reentrant</code> allocation method. Is there a better way? 331 Maybe variable-length arrays per notes below. 332<li> Alpha Unicos 21164, 21264: <code>.align</code> is not used since it pads 333 with garbage. Does the code get the intended slotting required for the 334 claimed speeds? <code>.align</code> at the start of a function would 335 presumably be safe no matter how it pads. 336<li> ARM V5: <code>count_leading_zeros</code> can use the <code>clz</code> 337 instruction. For GCC 3.4 and up, do this via <code>__builtin_clzl</code> 338 since then gcc knows it's "predicable". 339<li> Itanium: GCC 3.4 introduces <code>__builtin_popcount</code> which can be 340 used instead of an <code>asm</code> block. The builtin should give gcc 341 more opportunities for scheduling, bundling and predication. 342 <code>__builtin_ctz</code> similarly (it just uses popcount as per 343 current longlong.h). 344<li> UltraSPARC/64: Optimize <code>mpn_mul_1</code>, <code>mpn_addmul_1</code>, 345 for s2 < 2^32 (or perhaps for any zero 16-bit s2 chunk). Not sure how 346 much this can improve the speed, though, since the symmetry that we rely 347 on is lost. Perhaps we can just gain cycles when s2 < 2^16, or more 348 accurately, when two 16-bit s2 chunks which are 16 bits apart are zero. 349<li> UltraSPARC/64: Write native <code>mpn_submul_1</code>, analogous to 350 <code>mpn_addmul_1</code>. 351<li> UltraSPARC/64: Write <code>umul_ppmm</code>. Using four 352 "<code>mulx</code>"s either with an asm block or via the generic C code is 353 about 90 cycles. Try using fp operations, and also try using karatsuba 354 for just three "<code>mulx</code>"s. 355<li> UltraSPARC/32: Rewrite <code>mpn_lshift</code>, <code>mpn_rshift</code>. 356 Will give 2 cycles/limb. Trivial modifications of mpn/sparc64 should do. 357<li> UltraSPARC/32: Write special mpn_Xmul_1 loops for s2 < 2^16. 358<li> UltraSPARC/32: Use <code>mulx</code> for <code>umul_ppmm</code> if 359 possible (see commented out code in longlong.h). This is unlikely to 360 save more than a couple of cycles, so perhaps isn't worth bothering with. 361<li> UltraSPARC/32: On Solaris gcc doesn't give us <code>__sparc_v9__</code> 362 or anything to indicate V9 support when -mcpu=v9 is selected. See 363 gcc/config/sol2-sld-64.h. Will need to pass something through from 364 ./configure to select the right code in longlong.h. (Currently nothing 365 is lost because <code>mulx</code> for multiplying is commented out.) 366<li> UltraSPARC/32: <code>mpn_divexact_1</code> and 367 <code>mpn_modexact_1c_odd</code> can use a 64-bit inverse and take 368 64-bits at a time from the dividend, as per the 32-bit divisor case in 369 mpn/sparc64/mode1o.c. This must be done in assembler, since the full 370 64-bit registers (<code>%gN</code>) are not available from C. 371<li> UltraSPARC/32: <code>mpn_divexact_by3c</code> can work 64-bits at a time 372 using <code>mulx</code>, in assembler. This would be the same as for 373 sparc64. 374<li> UltraSPARC: <code>modlimb_invert</code> might save a few cycles from 375 masking down to just the useful bits at each point in the calculation, 376 since <code>mulx</code> speed depends on the highest bit set. Either 377 explicit masks or small types like <code>short</code> and 378 <code>int</code> ought to work. 379<li> Sparc64 HAL R1 <code>popc</code>: This chip reputedly implements 380 <code>popc</code> properly (see gcc sparc.md). Would need to recognise 381 it as <code>sparchalr1</code> or something in configure / config.sub / 382 config.guess. <code>popc_limb</code> in gmp-impl.h could use this (per 383 commented out code). <code>count_trailing_zeros</code> could use it too. 384<li> PA64: Improve <code>mpn_addmul_1</code>, <code>mpn_submul_1</code>, and 385 <code>mpn_mul_1</code>. The current code runs at 11 cycles/limb. It 386 should be possible to saturate the cache, which will happen at 8 387 cycles/limb (7.5 for mpn_mul_1). Write special loops for s2 < 2^32; 388 it should be possible to make them run at about 5 cycles/limb. 389<li> PPC601: See which of the power or powerpc32 code runs better. Currently 390 the powerpc32 is used, but only because it's the default for 391 <code>powerpc*</code>. 392<li> PPC630: Rewrite <code>mpn_addmul_1</code>, <code>mpn_submul_1</code>, and 393 <code>mpn_mul_1</code>. Use both integer and floating-point operations, 394 possibly two floating-point and one integer limb per loop. Split operands 395 into four 16-bit chunks for fast fp operations. Should easily reach 9 396 cycles/limb (using one int + one fp), but perhaps even 7 cycles/limb 397 (using one int + two fp). 398<li> PPC630: <code>mpn_rshift</code> could do the same sort of unrolled loop 399 as <code>mpn_lshift</code>. Some judicious use of m4 might let the two 400 share source code, or with a register to control the loop direction 401 perhaps even share object code. 402<li> Implement <code>mpn_mul_basecase</code> and <code>mpn_sqr_basecase</code> 403 for important machines. Helping the generic sqr_basecase.c with an 404 <code>mpn_sqr_diagonal</code> might be enough for some of the RISCs. 405<li> POWER2/POWER2SC: Schedule <code>mpn_lshift</code>/<code>mpn_rshift</code>. 406 Will bring time from 1.75 to 1.25 cycles/limb. 407<li> X86: Optimize non-MMX <code>mpn_lshift</code> for shifts by 1. (See 408 Pentium code.) 409<li> X86: Good authority has it that in the past an inline <code>rep 410 movs</code> would upset GCC register allocation for the whole function. 411 Is this still true in GCC 3? It uses <code>rep movs</code> itself for 412 <code>__builtin_memcpy</code>. Examine the code for some simple and 413 complex functions to find out. Inlining <code>rep movs</code> would be 414 desirable, it'd be both smaller and faster. 415<li> Pentium P54: <code>mpn_lshift</code> and <code>mpn_rshift</code> can come 416 down from 6.0 c/l to 5.5 or 5.375 by paying attention to pairing after 417 <code>shrdl</code> and <code>shldl</code>, see mpn/x86/pentium/README. 418<li> Pentium P55 MMX: <code>mpn_lshift</code> and <code>mpn_rshift</code> 419 might benefit from some destination prefetching. 420<li> PentiumPro: <code>mpn_divrem_1</code> might be able to use a 421 mul-by-inverse, hoping for maybe 30 c/l. 422<li> K7: <code>mpn_lshift</code> and <code>mpn_rshift</code> might be able to 423 do something branch-free for unaligned startups, and shaving one insn 424 from the loop with alternative indexing might save a cycle. 425<li> PPC32: Try using fewer registers in the current <code>mpn_lshift</code>. 426 The pipeline is now extremely deep, perhaps unnecessarily deep. 427<li> Fujitsu VPP: Vectorize main functions, perhaps in assembly language. 428<li> Fujitsu VPP: Write <code>mpn_mul_basecase</code> and 429 <code>mpn_sqr_basecase</code>. This should use a "vertical multiplication 430 method", to avoid carry propagation. splitting one of the operands in 431 11-bit chunks. 432<li> Pentium: <code>mpn_lshift</code> by 31 should use the special rshift 433 by 1 code, and vice versa <code>mpn_rshift</code> by 31 should use the 434 special lshift by 1. This would be best as a jump across to the other 435 routine, could let both live in lshift.asm and omit rshift.asm on finding 436 <code>mpn_rshift</code> already provided. 437<li> Cray T3E: Experiment with optimization options. In particular, 438 -hpipeline3 seems promising. We should at least up -O to -O2 or -O3. 439<li> Cray: <code>mpn_com</code> and <code>mpn_and_n</code> etc very probably 440 wants a pragma like <code>MPN_COPY_INCR</code>. 441<li> Cray vector systems: <code>mpn_lshift</code>, <code>mpn_rshift</code>, 442 <code>mpn_popcount</code> and <code>mpn_hamdist</code> are nice and small 443 and could be inlined to avoid function calls. 444<li> Cray: Variable length arrays seem to be faster than the tal-notreent.c 445 scheme. Not sure why, maybe they merely give the compiler more 446 information about aliasing (or the lack thereof). Would like to modify 447 <code>TMP_ALLOC</code> to use them, or introduce a new scheme. Memory 448 blocks wanted unconditionally are easy enough, those wanted only 449 sometimes are a problem. Perhaps a special size calculation to ask for a 450 dummy length 1 when unwanted, or perhaps an inlined subroutine 451 duplicating code under each conditional. Don't really want to turn 452 everything into a dog's dinner just because Cray don't offer an 453 <code>alloca</code>. 454<li> Cray: <code>mpn_get_str</code> on power-of-2 bases ought to vectorize. 455 Does it? <code>bits_per_digit</code> and the inner loop over bits in a 456 limb might prevent it. Perhaps special cases for binary, octal and hex 457 would be worthwhile (very possibly for all processors too). 458<li> S390: <code>BSWAP_LIMB_FETCH</code> looks like it could be done with 459 <code>lrvg</code>, as per glibc sysdeps/s390/s390-64/bits/byteswap.h. 460 This is only for 64-bit mode or something is it, since 32-bit mode has 461 other code? Also, is it worth using for <code>BSWAP_LIMB</code> too, or 462 would that mean a store and re-fetch? Presumably that's what comes out 463 in glibc. 464<li> Improve <code>count_leading_zeros</code> for 64-bit machines: 465 <pre> 466 if ((x >> 32) == 0) { x <<= 32; cnt += 32; } 467 if ((x >> 48) == 0) { x <<= 16; cnt += 16; } 468 ... </pre> 469<li> IRIX 6 MIPSpro compiler has an <code>__inline</code> which could perhaps 470 be used in <code>__GMP_EXTERN_INLINE</code>. What would be the right way 471 to identify suitable versions of that compiler? 472<li> IRIX <code>cc</code> is rumoured to have an <code>_int_mult_upper</code> 473 (in <code><intrinsics.h></code> like Cray), but it didn't seem to 474 exist on some IRIX 6.5 systems tried. If it does actually exist 475 somewhere it would very likely be an improvement over a function call to 476 umul.asm. 477<li> <code>mpn_get_str</code> final divisions by the base with 478 <code>udiv_qrnd_unnorm</code> could use some sort of multiply-by-inverse 479 on suitable machines. This ends up happening for decimal by presenting 480 the compiler with a run-time constant, but the same for other bases would 481 be good. Perhaps use could be made of the fact base<256. 482<li> <code>mpn_umul_ppmm</code>, <code>mpn_udiv_qrnnd</code>: Return a 483 structure like <code>div_t</code> to avoid going through memory, in 484 particular helping RISCs that don't do store-to-load forwarding. Clearly 485 this is only possible if the ABI returns a structure of two 486 <code>mp_limb_t</code>s in registers. 487 <br> 488 On PowerPC, structures are returned in memory on AIX and Darwin. In SVR4 489 they're returned in registers, except that draft SVR4 had said memory, so 490 it'd be prudent to check which is done. We can jam the compiler into the 491 right mode if we know how, since all this is purely internal to libgmp. 492 (gcc has an option, though of course gcc doesn't matter since we use 493 inline asm there.) 494</ul> 495 496<h4>New Functionality</h4> 497<ul> 498<li> Maybe add <code>mpz_crr</code> (Chinese Remainder Reconstruction). 499<li> Let `0b' and `0B' mean binary input everywhere. 500<li> <code>mpz_init</code> and <code>mpq_init</code> could do lazy allocation. 501 Set <code>ALLOC(var)</code> to 0 to indicate nothing allocated, and let 502 <code>_mpz_realloc</code> do the initial alloc. Set 503 <code>z->_mp_d</code> to a dummy that <code>mpz_get_ui</code> and 504 similar can unconditionally fetch from. Niels M�ller has had a go at 505 this. 506 <br> 507 The advantages of the lazy scheme would be: 508 <ul> 509 <li> Initial allocate would be the size required for the first value 510 stored, rather than getting 1 limb in <code>mpz_init</code> and then 511 more or less immediately reallocating. 512 <li> <code>mpz_init</code> would only store magic values in the 513 <code>mpz_t</code> fields, and could be inlined. 514 <li> A fixed initializer could even be used by applications, like 515 <code>mpz_t z = MPZ_INITIALIZER;</code>, which might be convenient 516 for globals. 517 </ul> 518 The advantages of the current scheme are: 519 <ul> 520 <li> <code>mpz_set_ui</code> and other similar routines needn't check the 521 size allocated and can just store unconditionally. 522 <li> <code>mpz_set_ui</code> and perhaps others like 523 <code>mpz_tdiv_r_ui</code> and a prospective 524 <code>mpz_set_ull</code> could be inlined. 525 </ul> 526<li> Add <code>mpf_out_raw</code> and <code>mpf_inp_raw</code>. Make sure 527 format is portable between 32-bit and 64-bit machines, and between 528 little-endian and big-endian machines. A format which MPFR can use too 529 would be good. 530<li> <code>mpn_and_n</code> ... <code>mpn_copyd</code>: Perhaps make the mpn 531 logops and copys available in gmp.h, either as library functions or 532 inlines, with the availability of library functions instantiated in the 533 generated gmp.h at build time. 534<li> <code>mpz_set_str</code> etc variants taking string lengths rather than 535 null-terminators. 536<li> <code>mpz_andn</code>, <code>mpz_iorn</code>, <code>mpz_nand</code>, 537 <code>mpz_nior</code>, <code>mpz_xnor</code> might be useful additions, 538 if they could share code with the current such functions (which should be 539 possible). 540<li> <code>mpz_and_ui</code> etc might be of use sometimes. Suggested by 541 Niels M�ller. 542<li> <code>mpf_set_str</code> and <code>mpf_inp_str</code> could usefully 543 accept 0x, 0b etc when base==0. Perhaps the exponent could default to 544 decimal in this case, with a further 0x, 0b etc allowed there. 545 Eg. 0xFFAA@0x5A. A leading "0" for octal would match the integers, but 546 probably something like "0.123" ought not mean octal. 547<li> <code>GMP_LONG_LONG_LIMB</code> or some such could become a documented 548 feature of gmp.h, so applications could know whether to 549 <code>printf</code> a limb using <code>%lu</code> or <code>%Lu</code>. 550<li> <code>GMP_PRIdMP_LIMB</code> and similar defines following C99 551 <inttypes.h> might be of use to applications printing limbs. But 552 if <code>GMP_LONG_LONG_LIMB</code> or whatever is added then perhaps this 553 can easily enough be left to applications. 554<li> <code>gmp_printf</code> could accept <code>%b</code> for binary output. 555 It'd be nice if it worked for plain <code>int</code> etc too, not just 556 <code>mpz_t</code> etc. 557<li> <code>gmp_printf</code> in fact could usefully accept an arbitrary base, 558 for both integer and float conversions. A base either in the format 559 string or as a parameter with <code>*</code> should be allowed. Maybe 560 <code>&13b</code> (b for base) or something like that. 561<li> <code>gmp_printf</code> could perhaps accept <code>mpq_t</code> for float 562 conversions, eg. <code>"%.4Qf"</code>. This would be merely for 563 convenience, but still might be useful. Rounding would be the same as 564 for an <code>mpf_t</code> (ie. currently round-to-nearest, but not 565 actually documented). Alternately, perhaps a separate 566 <code>mpq_get_str_point</code> or some such might be more use. Suggested 567 by Pedro Gimeno. 568<li> <code>mpz_rscan0</code> or <code>mpz_revscan0</code> or some such 569 searching towards the low end of an integer might match 570 <code>mpz_scan0</code> nicely. Likewise for <code>scan1</code>. 571 Suggested by Roberto Bagnara. 572<li> <code>mpz_bit_subset</code> or some such to test whether one integer is a 573 bitwise subset of another might be of use. Some sort of return value 574 indicating whether it's a proper or non-proper subset would be good and 575 wouldn't cost anything in the implementation. Suggested by Roberto 576 Bagnara. 577<li> <code>mpf_get_ld</code>, <code>mpf_set_ld</code>: Conversions between 578 <code>mpf_t</code> and <code>long double</code>, suggested by Dan 579 Christensen. Other <code>long double</code> routines might be desirable 580 too, but <code>mpf</code> would be a start. 581 <br> 582 <code>long double</code> is an ANSI-ism, so everything involving it would 583 need to be suppressed on a K&R compiler. 584 <br> 585 There'd be some work to be done by <code>configure</code> to recognise 586 the format in use, MPFR has a start on this. Often <code>long 587 double</code> is the same as <code>double</code>, which is easy but 588 pretty pointless. A single float format detector macro could look at 589 <code>double</code> then <code>long double</code> 590 <br> 591 Sometimes there's a compiler option for the size of a <code>long 592 double</code>, eg. xlc on AIX can use either 64-bit or 128-bit. It's 593 probably simplest to regard this as a compiler compatibility issue, and 594 leave it to users or sysadmins to ensure application and library code is 595 built the same. 596<li> <code>mpz_sqrt_if_perfect_square</code>: When 597 <code>mpz_perfect_square_p</code> does its tests it calculates a square 598 root and then discards it. For some applications it might be useful to 599 return that root. Suggested by Jason Moxham. 600<li> <code>mpz_get_ull</code>, <code>mpz_set_ull</code>, 601 <code>mpz_get_sll</code>, <code>mpz_get_sll</code>: Conversions for 602 <code>long long</code>. These would aid interoperability, though a 603 mixture of GMP and <code>long long</code> would probably not be too 604 common. Since <code>long long</code> is not always available (it's in 605 C99 and GCC though), disadvantages of using <code>long long</code> in 606 libgmp.a would be 607 <ul> 608 <li> Library contents vary according to the build compiler. 609 <li> gmp.h would need an ugly <code>#ifdef</code> block to decide if the 610 application compiler could take the <code>long long</code> 611 prototypes. 612 <li> Some sort of <code>LIBGMP_HAS_LONGLONG</code> might be wanted to 613 indicate whether the functions are available. (Applications using 614 autoconf could probe the library too.) 615 </ul> 616 It'd be possible to defer the need for <code>long long</code> to 617 application compile time, by having something like 618 <code>mpz_set_2ui</code> called with two halves of a <code>long 619 long</code>. Disadvantages of this would be, 620 <ul> 621 <li> Bigger code in the application, though perhaps not if a <code>long 622 long</code> is normally passed as two halves anyway. 623 <li> <code>mpz_get_ull</code> would be a rather big inline, or would have 624 to be two function calls. 625 <li> <code>mpz_get_sll</code> would be a worse inline, and would put the 626 treatment of <code>-0x10..00</code> into applications (see 627 <code>mpz_get_si</code> correctness above). 628 <li> Although having libgmp.a independent of the build compiler is nice, 629 it sort of sacrifices the capabilities of a good compiler to 630 uniformity with inferior ones. 631 </ul> 632 Plain use of <code>long long</code> is probably the lesser evil, if only 633 because it makes best use of gcc. In fact perhaps it would suffice to 634 guarantee <code>long long</code> conversions only when using GCC for both 635 application and library. That would cover free software, and we can 636 worry about selected vendor compilers later. 637 <br> 638 In C++ the situation is probably clearer, we demand fairly recent C++ so 639 <code>long long</code> should be available always. We'd probably prefer 640 to have the C and C++ the same in respect of <code>long long</code> 641 support, but it would be possible to have it unconditionally in gmpxx.h, 642 by some means or another. 643<li> <code>mpz_strtoz</code> parsing the same as <code>strtol</code>. 644 Suggested by Alexander Kruppa. 645</ul> 646 647 648<h4>Configuration</h4> 649 650<ul> 651<li> Alpha ev7, ev79: Add code to config.guess to detect these. Believe ev7 652 will be "3-1307" in the current switch, but need to verify that. (On 653 OSF, current configfsf.guess identifies ev7 using psrinfo, we need to do 654 it ourselves for other systems.) 655<li> Alpha OSF: Libtool (version 1.5) doesn't seem to recognise this system is 656 "pic always" and ends up running gcc twice with the same options. This 657 is wasteful, but harmless. Perhaps a newer libtool will be better. 658<li> ARM: <code>umul_ppmm</code> in longlong.h always uses <code>umull</code>, 659 but is that available only for M series chips or some such? Perhaps it 660 should be configured in some way. 661<li> HPPA: config.guess should recognize 7000, 7100, 7200, and 8x00. 662<li> HPPA: gcc 3.2 introduces a <code>-mschedule=7200</code> etc parameter, 663 which could be driven by an exact hppa cpu type. 664<li> Mips: config.guess should say mipsr3000, mipsr4000, mipsr10000, etc. 665 "hinv -c processor" gives lots of information on Irix. Standard 666 config.guess appends "el" to indicate endianness, but 667 <code>AC_C_BIGENDIAN</code> seems the best way to handle that for GMP. 668<li> PowerPC: The function descriptor nonsense for AIX is currently driven by 669 <code>*-*-aix*</code>. It might be more reliable to do some sort of 670 feature test, examining the compiler output perhaps. It might also be 671 nice to merge the aix.m4 files into powerpc-defs.m4. 672<li> config.m4 is generated only by the configure script, it won't be 673 regenerated by config.status. Creating it as an <code>AC_OUTPUT</code> 674 would work, but it might upset "make" to have things like <code>L$</code> 675 get into the Makefiles through <code>AC_SUBST</code>. 676 <code>AC_CONFIG_COMMANDS</code> would be the alternative. With some 677 careful m4 quoting the <code>changequote</code> calls might not be 678 needed, which might free up the order in which things had to be output. 679<li> Automake: Latest automake has a <code>CCAS</code>, <code>CCASFLAGS</code> 680 scheme. Though we probably wouldn't be using its assembler support we 681 could try to use those variables in compatible ways. 682<li> <code>GMP_LDFLAGS</code> could probably be done with plain 683 <code>LDFLAGS</code> already used by automake for all linking. But with 684 a bit of luck the next libtool will pass pretty much all 685 <code>CFLAGS</code> through to the compiler when linking, making 686 <code>GMP_LDFLAGS</code> unnecessary. 687<li> mpn/Makeasm.am uses <code>-c</code> and <code>-o</code> together in the 688 .S and .asm rules, but apparently that isn't completely portable (there's 689 an autoconf <code>AC_PROG_CC_C_O</code> test for it). So far we've not 690 had problems, but perhaps the rules could be rewritten to use "foo.s" as 691 the temporary, or to do a suitable "mv" of the result. The only danger 692 from using foo.s would be if a compile failed and the temporary foo.s 693 then looked like the primary source. Hopefully if the 694 <code>SUFFIXES</code> are ordered to have .S and .asm ahead of .s that 695 wouldn't happen. Might need to check. 696</ul> 697 698 699<h4>Random Numbers</h4> 700<ul> 701<li> <code>_gmp_rand</code> is not particularly fast on the linear 702 congruential algorithm and could stand various improvements. 703 <ul> 704 <li> Make a second seed area within <code>gmp_randstate_t</code> (or 705 <code>_mp_algdata</code> rather) to save some copying. 706 <li> Make a special case for a single limb <code>2exp</code> modulus, to 707 avoid <code>mpn_mul</code> calls. Perhaps the same for two limbs. 708 <li> Inline the <code>lc</code> code, to avoid a function call and 709 <code>TMP_ALLOC</code> for every chunk. 710 <li> Perhaps the <code>2exp</code> and general LC cases should be split, 711 for clarity (if the general case is retained). 712 </ul> 713<li> <code>gmp_randstate_t</code> used for parameters perhaps should become 714 <code>gmp_randstate_ptr</code> the same as other types. 715<li> Some of the empirical randomness tests could be included in a "make 716 check". They ought to work everywhere, for a given seed at least. 717</ul> 718 719 720<h4>C++</h4> 721<ul> 722<li> <code>mpz_class(string)</code>, etc: Use the C++ global locale to 723 identify whitespace. 724 <br> 725 <code>mpf_class(string)</code>: Use the C++ global locale decimal point, 726 rather than the C one. 727 <br> 728 Consider making these variant <code>mpz_set_str</code> etc forms 729 available for <code>mpz_t</code> too, not just <code>mpz_class</code> 730 etc. 731<li> <code>mpq_class operator+=</code>: Don't emit an unnecssary 732 <code>mpq_set(q,q)</code> before <code>mpz_addmul</code> etc. 733<li> Put various bits of gmpxx.h into libgmpxx, to avoid excessive inlining. 734 Candidates for this would be, 735 <ul> 736 <li> <code>mpz_class(const char *)</code>, etc: since they're normally 737 not fast anyway, and we can hide the exception <code>throw</code>. 738 <li> <code>mpz_class(string)</code>, etc: to hide the <code>cstr</code> 739 needed to get to the C conversion function. 740 <li> <code>mpz_class string, char*</code> etc constructors: likewise to 741 hide the throws and conversions. 742 <li> <code>mpz_class::get_str</code>, etc: to hide the <code>char*</code> 743 to <code>string</code> conversion and free. Perhaps 744 <code>mpz_get_str</code> can write directly into a 745 <code>string</code>, to avoid copying. 746 <br> 747 Consider making such <code>string</code> returning variants 748 available for use with plain <code>mpz_t</code> etc too. 749 </ul> 750</ul> 751 752<h4>Miscellaneous</h4> 753<ul> 754<li> <code>mpz_gcdext</code> and <code>mpn_gcdext</code> ought to document 755 what range of values the generated cofactors can take, and preferably 756 ensure the definition uniquely specifies the cofactors for given inputs. 757 A basic extended Euclidean algorithm or multi-step variant leads to 758 |x|<|b| and |y|<|a| or something like that, but there's probably 759 two solutions under just those restrictions. 760<li> demos/factorize.c: use <code>mpz_divisible_ui_p</code> rather than 761 <code>mpz_tdiv_qr_ui</code>. (Of course dividing multiple primes at a 762 time would be better still.) 763<li> The various test programs use quite a bit of the main 764 <code>libgmp</code>. This establishes good cross-checks, but it might be 765 better to use simple reference routines where possible. Where it's not 766 possible some attention could be paid to the order of the tests, so a 767 <code>libgmp</code> routine is only used for tests once it seems to be 768 good. 769<li> <code>MUL_FFT_THRESHOLD</code> etc: the FFT thresholds should allow a 770 return to a previous k at certain sizes. This arises basically due to 771 the step effect caused by size multiples effectively used for each k. 772 Looking at a graph makes it fairly clear. 773<li> <code>__gmp_doprnt_mpf</code> does a rather unattractive round-to-nearest 774 on the string returned by <code>mpf_get_str</code>. Perhaps some variant 775 of <code>mpf_get_str</code> could be made which would better suit. 776</ul> 777 778 779<h4>Aids to Development</h4> 780<ul> 781<li> Add <code>ASSERT</code>s at the start of each user-visible mpz/mpq/mpf 782 function to check the validity of each <code>mp?_t</code> parameter, in 783 particular to check they've been <code>mp?_init</code>ed. This might 784 catch elementary mistakes in user programs. Care would need to be taken 785 over <code>MPZ_TMP_INIT</code>ed variables used internally. If nothing 786 else then consistency checks like size<=alloc, ptr not 787 <code>NULL</code> and ptr+size not wrapping around the address space, 788 would be possible. A more sophisticated scheme could track 789 <code>_mp_d</code> pointers and ensure only a valid one is used. Such a 790 scheme probably wouldn't be reentrant, not without some help from the 791 system. 792<li> tune/time.c could try to determine at runtime whether 793 <code>getrusage</code> and <code>gettimeofday</code> are reliable. 794 Currently we pretend in configure that the dodgy m68k netbsd 1.4.1 795 <code>getrusage</code> doesn't exist. If a test might take a long time 796 to run then perhaps cache the result in a file somewhere. 797<li> tune/time.c could choose the default precision based on the 798 <code>speed_unittime</code> determined, independent of the method in use. 799<li> Cray vector systems: CPU frequency could be determined from 800 <code>sysconf(_SC_CLK_TCK)</code>, since it seems to be clock cycle 801 based. Is this true for all Cray systems? Would like some documentation 802 or something to confirm. 803</ul> 804 805 806<h4>Documentation</h4> 807<ul> 808<li> <code>mpz_inp_str</code> (etc) doesn't say when it stops reading digits. 809<li> <code>mpn_get_str</code> isn't terribly clear about how many digits it 810 produces. It'd probably be possible to say at most one leading zero, 811 which is what both it and <code>mpz_get_str</code> currently do. But 812 want to be careful not to bind ourselves to something that might not suit 813 another implementation. 814<li> <code>va_arg</code> doesn't do the right thing with <code>mpz_t</code> 815 etc directly, but instead needs a pointer type like <code>MP_INT*</code>. 816 It'd be good to show how to do this, but we'd either need to document 817 <code>mpz_ptr</code> and friends, or perhaps fallback on something 818 slightly nasty with <code>void*</code>. 819</ul> 820 821 822<h4>Bright Ideas</h4> 823 824<p> The following may or may not be feasible, and aren't likely to get done in the 825near future, but are at least worth thinking about. 826 827<ul> 828<li> Reorganize longlong.h so that we can inline the operations even for the 829 system compiler. When there is no such compiler feature, make calls to 830 stub functions. Write such stub functions for as many machines as 831 possible. 832<li> longlong.h could declare when it's using, or would like to use, 833 <code>mpn_umul_ppmm</code>, and the corresponding umul.asm file could be 834 included in libgmp only in that case, the same as is effectively done for 835 <code>__clz_tab</code>. Likewise udiv.asm and perhaps cntlz.asm. This 836 would only be a very small space saving, so perhaps not worth the 837 complexity. 838<li> longlong.h could be built at configure time by concatenating or 839 #including fragments from each directory in the mpn path. This would 840 select CPU specific macros the same way as CPU specific assembler code. 841 Code used would no longer depend on cpp predefines, and the current 842 nested conditionals could be flattened out. 843<li> <code>mpz_get_si</code> returns 0x80000000 for -0x100000000, whereas it's 844 sort of supposed to return the low 31 (or 63) bits. But this is 845 undocumented, and perhaps not too important. 846<li> <code>mpz_init_set*</code> and <code>mpz_realloc</code> could allocate 847 say an extra 16 limbs over what's needed, so as to reduce the chance of 848 having to do a reallocate if the <code>mpz_t</code> grows a bit more. 849 This could only be an option, since it'd badly bloat memory usage in 850 applications using many small values. 851<li> <code>mpq</code> functions could perhaps check for numerator or 852 denominator equal to 1, on the assumption that integers or 853 denominator-only values might be expected to occur reasonably often. 854<li> <code>count_trailing_zeros</code> is used on more or less uniformly 855 distributed numbers in a couple of places. For some CPUs 856 <code>count_trailing_zeros</code> is slow and it's probably worth handling 857 the frequently occurring 0 to 2 trailing zeros cases specially. 858<li> <code>mpf_t</code> might like to let the exponent be undefined when 859 size==0, instead of requiring it 0 as now. It should be possible to do 860 size==0 tests before paying attention to the exponent. The advantage is 861 not needing to set exp in the various places a zero result can arise, 862 which avoids some tedium but is otherwise perhaps not too important. 863 Currently <code>mpz_set_f</code> and <code>mpf_cmp_ui</code> depend on 864 exp==0, maybe elsewhere too. 865<li> <code>__gmp_allocate_func</code>: Could use GCC <code>__attribute__ 866 ((malloc))</code> on this, though don't know if it'd do much. GCC 3.0 867 allows that attribute on functions, but not function pointers (see info 868 node "Attribute Syntax"), so would need a new autoconf test. This can 869 wait until there's a GCC that supports it. 870<li> <code>mpz_add_ui</code> contains two <code>__GMPN_COPY</code>s, one from 871 <code>mpn_add_1</code> and one from <code>mpn_sub_1</code>. If those two 872 routines were opened up a bit maybe that code could be shared. When a 873 copy needs to be done there's no carry to append for the add, and if the 874 copy is non-empty no high zero for the sub. 875</ul> 876 877 878<h4>Old and Obsolete Stuff</h4> 879 880<p> The following tasks apply to chips or systems that are old and/or obsolete. 881It's unlikely anything will be done about them unless anyone is actively using 882them. 883 884<ul> 885<li> Sparc32: The integer based udiv_nfp.asm used to be selected by 886 <code>configure --nfp</code> but that option is gone now that autoconf is 887 used. The file could go somewhere suitable in the mpn search if any 888 chips might benefit from it, though it's possible we don't currently 889 differentiate enough exact cpu types to do this properly. 890<li> VAX D and G format <code>double</code> floats are straightforward and 891 could perhaps be handled directly in <code>__gmp_extract_double</code> 892 and maybe in <code>mpn_get_d</code>, rather than falling back on the 893 generic code. (Both formats are detected by <code>configure</code>.) 894</ul> 895 896 897<hr> 898 899</body> 900</html> 901 902<!-- 903Local variables: 904eval: (add-hook 'write-file-hooks 'time-stamp) 905time-stamp-start: "This file current as of " 906time-stamp-format: "%:d %3b %:y" 907time-stamp-end: "\\." 908time-stamp-line-limit: 50 909End: 910--> 911