151c586b8Smrg<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN"> 251c586b8Smrg<html> 351c586b8Smrg<head> 451c586b8Smrg <title>GMP Itemized Development Tasks</title> 551c586b8Smrg <link rel="shortcut icon" href="favicon.ico"> 651c586b8Smrg <link rel="stylesheet" href="gmp.css"> 7*ce543368Smrg <meta http-equiv="Content-Type" content="text/html; charset=utf-8"> 851c586b8Smrg</head> 951c586b8Smrg 1051c586b8Smrg<center> 1151c586b8Smrg <h1> 1251c586b8Smrg GMP Itemized Development Tasks 1351c586b8Smrg </h1> 1451c586b8Smrg</center> 1551c586b8Smrg 1651c586b8Smrg<font size=-1> 1751c586b8Smrg<pre> 18*ce543368SmrgCopyright 2000-2004, 2006, 2008, 2009 Free Software Foundation, Inc. 1951c586b8Smrg 2051c586b8SmrgThis file is part of the GNU MP Library. 2151c586b8Smrg 2251c586b8SmrgThe GNU MP Library is free software; you can redistribute it and/or modify 23*ce543368Smrgit under the terms of either: 24*ce543368Smrg 25*ce543368Smrg * the GNU Lesser General Public License as published by the Free 26*ce543368Smrg Software Foundation; either version 3 of the License, or (at your 27*ce543368Smrg option) any later version. 28*ce543368Smrg 29*ce543368Smrgor 30*ce543368Smrg 31*ce543368Smrg * the GNU General Public License as published by the Free Software 32*ce543368Smrg Foundation; either version 2 of the License, or (at your option) any 33*ce543368Smrg later version. 34*ce543368Smrg 35*ce543368Smrgor both in parallel, as here. 3651c586b8Smrg 3751c586b8SmrgThe GNU MP Library is distributed in the hope that it will be useful, but 3851c586b8SmrgWITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY 39*ce543368Smrgor FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License 40*ce543368Smrgfor more details. 4151c586b8Smrg 42*ce543368SmrgYou should have received copies of the GNU General Public License and the 43*ce543368SmrgGNU Lesser General Public License along with the GNU MP Library. If not, 44*ce543368Smrgsee https://www.gnu.org/licenses/. 4551c586b8Smrg</pre> 4651c586b8Smrg</font> 4751c586b8Smrg 4851c586b8Smrg<hr> 4951c586b8Smrg<!-- NB. timestamp updated automatically by emacs --> 50*ce543368Smrg This file current as of 29 Jan 2014. An up-to-date version is available at 51*ce543368Smrg <a href="https://gmplib.org/tasks.html">https://gmplib.org/tasks.html</a>. 5251c586b8Smrg Please send comments about this page to gmp-devel<font>@</font>gmplib.org. 5351c586b8Smrg 5451c586b8Smrg<p> These are itemized GMP development tasks. Not all the tasks 5551c586b8Smrg listed here are suitable for volunteers, but many of them are. 5651c586b8Smrg Please see the <a href="projects.html">projects file</a> for more 5751c586b8Smrg sizeable projects. 5851c586b8Smrg 5951c586b8Smrg<p> CAUTION: This file needs updating. Many of the tasks here have 6051c586b8Smrgeither already been taken care of, or have become irrelevant. 6151c586b8Smrg 6251c586b8Smrg<h4>Correctness and Completeness</h4> 6351c586b8Smrg<ul> 6451c586b8Smrg<li> <code>_LONG_LONG_LIMB</code> in gmp.h is not namespace clean. Reported 6551c586b8Smrg by Patrick Pelissier. 6651c586b8Smrg <br> 6751c586b8Smrg We sort of mentioned <code>_LONG_LONG_LIMB</code> in past releases, so 6851c586b8Smrg need to be careful about changing it. It used to be a define 6951c586b8Smrg applications had to set for long long limb systems, but that in 7051c586b8Smrg particular is no longer relevant now that it's established automatically. 7151c586b8Smrg<li> The various reuse.c tests need to force reallocation by calling 7251c586b8Smrg <code>_mpz_realloc</code> with a small (1 limb) size. 7351c586b8Smrg<li> One reuse case is missing from mpX/tests/reuse.c: 7451c586b8Smrg <code>mpz_XXX(a,a,a)</code>. 7551c586b8Smrg<li> Make the string reading functions allow the `0x' prefix when the base is 7651c586b8Smrg explicitly 16. They currently only allow that prefix when the base is 7751c586b8Smrg unspecified (zero). 7851c586b8Smrg<li> <code>mpf_eq</code> is not always correct, when one operand is 7951c586b8Smrg 1000000000... and the other operand is 0111111111..., i.e., extremely 8051c586b8Smrg close. There is a special case in <code>mpf_sub</code> for this 8151c586b8Smrg situation; put similar code in <code>mpf_eq</code>. [In progress.] 8251c586b8Smrg<li> <code>mpf_eq</code> doesn't implement what gmp.texi specifies. It should 8351c586b8Smrg not use just whole limbs, but partial limbs. [In progress.] 8451c586b8Smrg<li> <code>mpf_set_str</code> doesn't validate it's exponent, for instance 8551c586b8Smrg garbage 123.456eX789X is accepted (and an exponent 0 used), and overflow 8651c586b8Smrg of a <code>long</code> is not detected. 8751c586b8Smrg<li> <code>mpf_add</code> doesn't check for a carry from truncated portions of 8851c586b8Smrg the inputs, and in that respect doesn't implement the "infinite precision 8951c586b8Smrg followed by truncate" specified in the manual. 9051c586b8Smrg<li> Windows DLLs: tests/mpz/reuse.c and tests/mpf/reuse.c initialize global 9151c586b8Smrg variables with pointers to <code>mpz_add</code> etc, which doesn't work 9251c586b8Smrg when those routines are coming from a DLL (because they're effectively 9351c586b8Smrg function pointer global variables themselves). Need to rearrange perhaps 9451c586b8Smrg to a set of calls to a test function rather than iterating over an array. 9551c586b8Smrg<li> <code>mpz_pow_ui</code>: Detect when the result would be more memory than 9651c586b8Smrg a <code>size_t</code> can represent and raise some suitable exception, 9751c586b8Smrg probably an alloc call asking for <code>SIZE_T_MAX</code>, and if that 9851c586b8Smrg somehow succeeds then an <code>abort</code>. Various size overflows of 9951c586b8Smrg this kind are not handled gracefully, probably resulting in segvs. 10051c586b8Smrg <br> 10151c586b8Smrg In <code>mpz_n_pow_ui</code>, detect when the count of low zero bits 10251c586b8Smrg exceeds an <code>unsigned long</code>. There's a (small) chance of this 10351c586b8Smrg happening but still having enough memory to represent the value. 10451c586b8Smrg Reported by Winfried Dreckmann in for instance <code>mpz_ui_pow_ui (x, 10551c586b8Smrg 4UL, 1431655766UL)</code>. 10651c586b8Smrg<li> <code>mpf</code>: Detect exponent overflow and raise some exception. 10751c586b8Smrg It'd be nice to allow the full <code>mp_exp_t</code> range since that's 10851c586b8Smrg how it's been in the past, but maybe dropping one bit would make it 10951c586b8Smrg easier to test if e1+e2 goes out of bounds. 11051c586b8Smrg</ul> 11151c586b8Smrg 11251c586b8Smrg 11351c586b8Smrg 11451c586b8Smrg<h4>Machine Independent Optimization</h4> 11551c586b8Smrg<ul> 11651c586b8Smrg<li> <code>mpf_cmp</code>: For better cache locality, don't test for low zero 11751c586b8Smrg limbs until the high limbs fail to give an ordering. Reduce code size by 11851c586b8Smrg turning the three <code>mpn_cmp</code>'s into a single loop stopping when 11951c586b8Smrg the end of one operand is reached (and then looking for a non-zero in the 12051c586b8Smrg rest of the other). 12151c586b8Smrg<li> <code>mpf_mul_2exp</code>, <code>mpf_div_2exp</code>: The use of 12251c586b8Smrg <code>mpn_lshift</code> for any size<=prec means repeated 12351c586b8Smrg <code>mul_2exp</code> and <code>div_2exp</code> calls accumulate low zero 12451c586b8Smrg limbs until size==prec+1 is reached. Those zeros will slow down 12551c586b8Smrg subsequent operations, especially if the value is otherwise only small. 12651c586b8Smrg If low bits of the low limb are zero, use <code>mpn_rshift</code> so as 12751c586b8Smrg to not increase the size. 12851c586b8Smrg<li> <code>mpn_dc_sqrtrem</code>, <code>mpn_sqrtrem2</code>: Don't use 12951c586b8Smrg <code>mpn_add_1</code> and <code>mpn_sub_1</code> for 1 limb operations, 13051c586b8Smrg instead <code>ADDC_LIMB</code> and <code>SUBC_LIMB</code>. 13151c586b8Smrg<li> <code>mpn_sqrtrem2</code>: Use plain variables for <code>sp[0]</code> and 13251c586b8Smrg <code>rp[0]</code> calculations, so the compiler needn't worry about 13351c586b8Smrg aliasing between <code>sp</code> and <code>rp</code>. 13451c586b8Smrg<li> <code>mpn_sqrtrem</code>: Some work can be saved in the last step when 13551c586b8Smrg the remainder is not required, as noted in Paul's paper. 13651c586b8Smrg<li> <code>mpq_add</code>, <code>mpq_sub</code>: The gcd fits a single limb 137dab47db4Smrg with high probability and in this case <code>binvert_limb</code> could 13851c586b8Smrg be used to calculate the inverse just once for the two exact divisions 13951c586b8Smrg "op1.den / gcd" and "op2.den / gcd", rather than letting 140dab47db4Smrg <code>mpn_bdiv_q_1</code> do it each time. This would require calling 141dab47db4Smrg <code>mpn_pi1_bdiv_q_1</code>. 14251c586b8Smrg<li> <code>mpn_gcdext</code>: Don't test <code>count_leading_zeros</code> for 14351c586b8Smrg zero, instead check the high bit of the operand and avoid invoking 14451c586b8Smrg <code>count_leading_zeros</code>. This is an optimization on all 14551c586b8Smrg machines, and significant on machines with slow 14651c586b8Smrg <code>count_leading_zeros</code>, though it's possible an already 14751c586b8Smrg normalized operand might not be encountered very often. 14851c586b8Smrg<li> Rewrite <code>umul_ppmm</code> to use floating-point for generating the 14951c586b8Smrg most significant limb (if <code>GMP_LIMB_BITS</code> <= 52 bits). 15051c586b8Smrg (Peter Montgomery has some ideas on this subject.) 15151c586b8Smrg<li> Improve the default <code>umul_ppmm</code> code in longlong.h: Add partial 15251c586b8Smrg products with fewer operations. 15351c586b8Smrg<li> Consider inlining <code>mpz_set_ui</code>. This would be both small and 15451c586b8Smrg fast, especially for compile-time constants, but would make application 15551c586b8Smrg binaries depend on having 1 limb allocated to an <code>mpz_t</code>, 15651c586b8Smrg preventing the "lazy" allocation scheme below. 15751c586b8Smrg<li> Consider inlining <code>mpz_[cft]div_ui</code> and maybe 15851c586b8Smrg <code>mpz_[cft]div_r_ui</code>. A <code>__gmp_divide_by_zero</code> 15951c586b8Smrg would be needed for the divide by zero test, unless that could be left to 16051c586b8Smrg <code>mpn_mod_1</code> (not sure currently whether all the risc chips 16151c586b8Smrg provoke the right exception there if using mul-by-inverse). 16251c586b8Smrg<li> Consider inlining: <code>mpz_fits_s*_p</code>. The setups for 16351c586b8Smrg <code>LONG_MAX</code> etc would need to go into gmp.h, and on Cray it 16451c586b8Smrg might, unfortunately, be necessary to forcibly include <limits.h> 16551c586b8Smrg since there's no apparent way to get <code>SHRT_MAX</code> with an 16651c586b8Smrg expression (since <code>short</code> and <code>unsigned short</code> can 16751c586b8Smrg be different sizes). 168dab47db4Smrg<li> <code>mpz_powm</code> and <code>mpz_powm_ui</code> aren't very fast on one 169dab47db4Smrg or two limb moduli, due to a lot of function call overheads. These could 170dab47db4Smrg perhaps be handled as special cases. 171dab47db4Smrg<li> Make sure <code>mpz_powm_ui</code> is never slower than the corresponding 172dab47db4Smrg computation using <code>mpz_powm</code>. 17351c586b8Smrg<li> <code>mpz_powm</code> REDC should do multiplications by <code>g[]</code> 17451c586b8Smrg using the division method when they're small, since the REDC form of a 17551c586b8Smrg small multiplier is normally a full size product. Probably would need a 17651c586b8Smrg new tuned parameter to say what size multiplier is "small", as a function 17751c586b8Smrg of the size of the modulus. 178dab47db4Smrg<li> <code>mpn_gcd</code> might be able to be sped up on small to moderate 179dab47db4Smrg sizes by improving <code>find_a</code>, possibly just by providing an 180dab47db4Smrg alternate implementation for CPUs with slowish 18151c586b8Smrg <code>count_leading_zeros</code>. 18251c586b8Smrg<li> <code>mpf_set_str</code> produces low zero limbs when a string has a 18351c586b8Smrg fraction but is exactly representable, eg. 0.5 in decimal. These could be 18451c586b8Smrg stripped to save work in later operations. 18551c586b8Smrg<li> <code>mpz_and</code>, <code>mpz_ior</code> and <code>mpz_xor</code> should 18651c586b8Smrg use <code>mpn_and_n</code> etc for the benefit of the small number of 18751c586b8Smrg targets with native versions of those routines. Need to be careful not to 18851c586b8Smrg pass size==0. Is some code sharing possible between the <code>mpz</code> 18951c586b8Smrg routines? 19051c586b8Smrg<li> <code>mpf_add</code>: Don't do a copy to avoid overlapping operands 19151c586b8Smrg unless it's really necessary (currently only sizes are tested, not 19251c586b8Smrg whether r really is u or v). 19351c586b8Smrg<li> <code>mpf_add</code>: Under the check for v having no effect on the 19451c586b8Smrg result, perhaps test for r==u and do nothing in that case, rather than 19551c586b8Smrg currently it looks like an <code>MPN_COPY_INCR</code> will be done to 19651c586b8Smrg reduce prec+1 limbs to prec. 19751c586b8Smrg<li> <code>mpf_div_ui</code>: Instead of padding with low zeros, call 19851c586b8Smrg <code>mpn_divrem_1</code> asking for fractional quotient limbs. 19951c586b8Smrg<li> <code>mpf_div_ui</code>: Eliminate <code>TMP_ALLOC</code>. When r!=u 20051c586b8Smrg there's no overlap and the division can be called on those operands. 20151c586b8Smrg When r==u and is prec+1 limbs, then it's an in-place division. If r==u 20251c586b8Smrg and not prec+1 limbs, then move the available limbs up to prec+1 and do 20351c586b8Smrg an in-place there. 20451c586b8Smrg<li> <code>mpf_div_ui</code>: Whether the high quotient limb is zero can be 20551c586b8Smrg determined by testing the dividend for high<divisor. When non-zero, the 20651c586b8Smrg division can be done on prec dividend limbs instead of prec+1. The result 20751c586b8Smrg size is also known before the division, so that can be a tail call (once 20851c586b8Smrg the <code>TMP_ALLOC</code> is eliminated). 20951c586b8Smrg<li> <code>mpn_divrem_2</code> could usefully accept unnormalized divisors and 21051c586b8Smrg shift the dividend on-the-fly, since this should cost nothing on 21151c586b8Smrg superscalar processors and avoid the need for temporary copying in 21251c586b8Smrg <code>mpn_tdiv_qr</code>. 21351c586b8Smrg<li> <code>mpf_sqrt</code>: If r!=u, and if u doesn't need to be padded with 21451c586b8Smrg zeros, then there's no need for the tp temporary. 21551c586b8Smrg<li> <code>mpq_cmp_ui</code> could form the <code>num1*den2</code> and 21651c586b8Smrg <code>num2*den1</code> products limb-by-limb from high to low and look at 21751c586b8Smrg each step for values differing by more than the possible carry bit from 21851c586b8Smrg the uncalculated portion. 21951c586b8Smrg<li> <code>mpq_cmp</code> could do the same high-to-low progressive multiply 22051c586b8Smrg and compare. The benefits of karatsuba and higher multiplication 22151c586b8Smrg algorithms are lost, but if it's assumed only a few high limbs will be 22251c586b8Smrg needed to determine an order then that's fine. 22351c586b8Smrg<li> <code>mpn_add_1</code>, <code>mpn_sub_1</code>, <code>mpn_add</code>, 22451c586b8Smrg <code>mpn_sub</code>: Internally use <code>__GMPN_ADD_1</code> etc 22551c586b8Smrg instead of the functions, so they get inlined on all compilers, not just 22651c586b8Smrg gcc and others with <code>inline</code> recognised in gmp.h. 22751c586b8Smrg <code>__GMPN_ADD_1</code> etc are meant mostly to support application 22851c586b8Smrg inline <code>mpn_add_1</code> etc and if they don't come out good for 22951c586b8Smrg internal uses then special forms can be introduced, for instance many 23051c586b8Smrg internal uses are in-place. Sometimes a block of code is executed based 23151c586b8Smrg on the carry-out, rather than using it arithmetically, and those places 23251c586b8Smrg might want to do their own loops entirely. 23351c586b8Smrg<li> <code>__gmp_extract_double</code> on 64-bit systems could use just one 23451c586b8Smrg bitfield for the mantissa extraction, not two, when endianness permits. 23551c586b8Smrg Might depend on the compiler allowing <code>long long</code> bit fields 23651c586b8Smrg when that's the only actual 64-bit type. 23751c586b8Smrg<li> tal-notreent.c could keep a block of memory permanently allocated. 23851c586b8Smrg Currently the last nested <code>TMP_FREE</code> releases all memory, so 23951c586b8Smrg there's an allocate and free every time a top-level function using 24051c586b8Smrg <code>TMP</code> is called. Would need 24151c586b8Smrg <code>mp_set_memory_functions</code> to tell tal-notreent.c to release 24251c586b8Smrg any cached memory when changing allocation functions though. 24351c586b8Smrg<li> <code>__gmp_tmp_alloc</code> from tal-notreent.c could be partially 24451c586b8Smrg inlined. If the current chunk has enough room then a couple of pointers 24551c586b8Smrg can be updated. Only if more space is required then a call to some sort 24651c586b8Smrg of <code>__gmp_tmp_increase</code> would be needed. The requirement that 24751c586b8Smrg <code>TMP_ALLOC</code> is an expression might make the implementation a 24851c586b8Smrg bit ugly and/or a bit sub-optimal. 24951c586b8Smrg<pre> 25051c586b8Smrg#define TMP_ALLOC(n) 25151c586b8Smrg ((ROUND_UP(n) > current->end - current->point ? 25251c586b8Smrg __gmp_tmp_increase (ROUND_UP (n)) : 0), 25351c586b8Smrg current->point += ROUND_UP (n), 25451c586b8Smrg current->point - ROUND_UP (n)) 25551c586b8Smrg</pre> 25651c586b8Smrg<li> <code>__mp_bases</code> has a lot of data for bases which are pretty much 25751c586b8Smrg never used. Perhaps the table should just go up to base 16, and have 25851c586b8Smrg code to generate data above that, if and when required. Naturally this 25951c586b8Smrg assumes the code would be smaller than the data saved. 26051c586b8Smrg<li> <code>__mp_bases</code> field <code>big_base_inverted</code> is only used 26151c586b8Smrg if <code>USE_PREINV_DIVREM_1</code> is true, and could be omitted 26251c586b8Smrg otherwise, to save space. 26351c586b8Smrg<li> <code>mpz_get_str</code>, <code>mtox</code>: For power-of-2 bases, which 26451c586b8Smrg are of course fast, it seems a little silly to make a second pass over 26551c586b8Smrg the <code>mpn_get_str</code> output to convert to ASCII. Perhaps combine 26651c586b8Smrg that with the bit extractions. 26751c586b8Smrg<li> <code>mpz_gcdext</code>: If the caller requests only the S cofactor (of 26851c586b8Smrg A), and A<B, then the code ends up generating the cofactor T (of B) and 26951c586b8Smrg deriving S from that. Perhaps it'd be possible to arrange to get S in 27051c586b8Smrg the first place by calling <code>mpn_gcdext</code> with A+B,B. This 27151c586b8Smrg might only be an advantage if A and B are about the same size. 27251c586b8Smrg<li> <code>mpz_n_pow_ui</code> does a good job with small bases and stripping 27351c586b8Smrg powers of 2, but it's perhaps a bit too complicated for what it gains. 27451c586b8Smrg The simpler <code>mpn_pow_1</code> is a little faster on small exponents. 27551c586b8Smrg (Note some of the ugliness in <code>mpz_n_pow_ui</code> is due to 27651c586b8Smrg supporting <code>mpn_mul_2</code>.) 27751c586b8Smrg <br> 27851c586b8Smrg Perhaps the stripping of 2s in <code>mpz_n_pow_ui</code> should be 27951c586b8Smrg confined to single limb operands for simplicity and since that's where 28051c586b8Smrg the greatest gain would be. 28151c586b8Smrg <br> 28251c586b8Smrg Ideally <code>mpn_pow_1</code> and <code>mpz_n_pow_ui</code> would be 28351c586b8Smrg merged. The reason <code>mpz_n_pow_ui</code> writes to an 28451c586b8Smrg <code>mpz_t</code> is that its callers leave it to make a good estimate 28551c586b8Smrg of the result size. Callers of <code>mpn_pow_1</code> already know the 28651c586b8Smrg size by separate means (<code>mp_bases</code>). 28751c586b8Smrg<li> <code>mpz_invert</code> should call <code>mpn_gcdext</code> directly. 28851c586b8Smrg</ul> 28951c586b8Smrg 29051c586b8Smrg 29151c586b8Smrg<h4>Machine Dependent Optimization</h4> 29251c586b8Smrg<ul> 29351c586b8Smrg<li> <code>invert_limb</code> on various processors might benefit from the 29451c586b8Smrg little Newton iteration done for alpha and ia64. 29551c586b8Smrg<li> Alpha 21264: <code>mpn_addlsh1_n</code> could be implemented with 29651c586b8Smrg <code>mpn_addmul_1</code>, since that code at 3.5 is a touch faster than 29751c586b8Smrg a separate <code>lshift</code> and <code>add_n</code> at 29851c586b8Smrg 1.75+2.125=3.875. Or very likely some specific <code>addlsh1_n</code> 29951c586b8Smrg code could beat both. 30051c586b8Smrg<li> Alpha 21264: Improve feed-in code for <code>mpn_mul_1</code>, 30151c586b8Smrg <code>mpn_addmul_1</code>, and <code>mpn_submul_1</code>. 30251c586b8Smrg<li> Alpha 21164: Rewrite <code>mpn_mul_1</code>, <code>mpn_addmul_1</code>, 30351c586b8Smrg and <code>mpn_submul_1</code> for the 21164. This should use both integer 30451c586b8Smrg multiplies and floating-point multiplies. For the floating-point 30551c586b8Smrg operations, the single-limb multiplier should be split into three 21-bit 30651c586b8Smrg chunks, or perhaps even better in four 16-bit chunks. Probably possible 30751c586b8Smrg to reach 9 cycles/limb. 30851c586b8Smrg<li> Alpha: GCC 3.4 will introduce <code>__builtin_ctzl</code>, 30951c586b8Smrg <code>__builtin_clzl</code> and <code>__builtin_popcountl</code> using 31051c586b8Smrg the corresponding CIX <code>ct</code> instructions, and 31151c586b8Smrg <code>__builtin_alpha_cmpbge</code>. These should give GCC more 31251c586b8Smrg information about scheduling etc than the <code>asm</code> blocks 31351c586b8Smrg currently used in longlong.h and gmp-impl.h. 31451c586b8Smrg<li> Alpha Unicos: Apparently there's no <code>alloca</code> on this system, 31551c586b8Smrg making <code>configure</code> choose the slower 31651c586b8Smrg <code>malloc-reentrant</code> allocation method. Is there a better way? 31751c586b8Smrg Maybe variable-length arrays per notes below. 31851c586b8Smrg<li> Alpha Unicos 21164, 21264: <code>.align</code> is not used since it pads 31951c586b8Smrg with garbage. Does the code get the intended slotting required for the 32051c586b8Smrg claimed speeds? <code>.align</code> at the start of a function would 32151c586b8Smrg presumably be safe no matter how it pads. 32251c586b8Smrg<li> ARM V5: <code>count_leading_zeros</code> can use the <code>clz</code> 32351c586b8Smrg instruction. For GCC 3.4 and up, do this via <code>__builtin_clzl</code> 32451c586b8Smrg since then gcc knows it's "predicable". 32551c586b8Smrg<li> Itanium: GCC 3.4 introduces <code>__builtin_popcount</code> which can be 32651c586b8Smrg used instead of an <code>asm</code> block. The builtin should give gcc 32751c586b8Smrg more opportunities for scheduling, bundling and predication. 32851c586b8Smrg <code>__builtin_ctz</code> similarly (it just uses popcount as per 32951c586b8Smrg current longlong.h). 33051c586b8Smrg<li> UltraSPARC/64: Optimize <code>mpn_mul_1</code>, <code>mpn_addmul_1</code>, 33151c586b8Smrg for s2 < 2^32 (or perhaps for any zero 16-bit s2 chunk). Not sure how 33251c586b8Smrg much this can improve the speed, though, since the symmetry that we rely 33351c586b8Smrg on is lost. Perhaps we can just gain cycles when s2 < 2^16, or more 33451c586b8Smrg accurately, when two 16-bit s2 chunks which are 16 bits apart are zero. 33551c586b8Smrg<li> UltraSPARC/64: Write native <code>mpn_submul_1</code>, analogous to 33651c586b8Smrg <code>mpn_addmul_1</code>. 33751c586b8Smrg<li> UltraSPARC/64: Write <code>umul_ppmm</code>. Using four 33851c586b8Smrg "<code>mulx</code>"s either with an asm block or via the generic C code is 33951c586b8Smrg about 90 cycles. Try using fp operations, and also try using karatsuba 34051c586b8Smrg for just three "<code>mulx</code>"s. 34151c586b8Smrg<li> UltraSPARC/32: Rewrite <code>mpn_lshift</code>, <code>mpn_rshift</code>. 34251c586b8Smrg Will give 2 cycles/limb. Trivial modifications of mpn/sparc64 should do. 34351c586b8Smrg<li> UltraSPARC/32: Write special mpn_Xmul_1 loops for s2 < 2^16. 34451c586b8Smrg<li> UltraSPARC/32: Use <code>mulx</code> for <code>umul_ppmm</code> if 34551c586b8Smrg possible (see commented out code in longlong.h). This is unlikely to 34651c586b8Smrg save more than a couple of cycles, so perhaps isn't worth bothering with. 34751c586b8Smrg<li> UltraSPARC/32: On Solaris gcc doesn't give us <code>__sparc_v9__</code> 34851c586b8Smrg or anything to indicate V9 support when -mcpu=v9 is selected. See 34951c586b8Smrg gcc/config/sol2-sld-64.h. Will need to pass something through from 35051c586b8Smrg ./configure to select the right code in longlong.h. (Currently nothing 35151c586b8Smrg is lost because <code>mulx</code> for multiplying is commented out.) 35251c586b8Smrg<li> UltraSPARC/32: <code>mpn_divexact_1</code> and 35351c586b8Smrg <code>mpn_modexact_1c_odd</code> can use a 64-bit inverse and take 35451c586b8Smrg 64-bits at a time from the dividend, as per the 32-bit divisor case in 35551c586b8Smrg mpn/sparc64/mode1o.c. This must be done in assembler, since the full 35651c586b8Smrg 64-bit registers (<code>%gN</code>) are not available from C. 35751c586b8Smrg<li> UltraSPARC/32: <code>mpn_divexact_by3c</code> can work 64-bits at a time 35851c586b8Smrg using <code>mulx</code>, in assembler. This would be the same as for 35951c586b8Smrg sparc64. 360dab47db4Smrg<li> UltraSPARC: <code>binvert_limb</code> might save a few cycles from 36151c586b8Smrg masking down to just the useful bits at each point in the calculation, 36251c586b8Smrg since <code>mulx</code> speed depends on the highest bit set. Either 36351c586b8Smrg explicit masks or small types like <code>short</code> and 36451c586b8Smrg <code>int</code> ought to work. 36551c586b8Smrg<li> Sparc64 HAL R1 <code>popc</code>: This chip reputedly implements 36651c586b8Smrg <code>popc</code> properly (see gcc sparc.md). Would need to recognise 36751c586b8Smrg it as <code>sparchalr1</code> or something in configure / config.sub / 36851c586b8Smrg config.guess. <code>popc_limb</code> in gmp-impl.h could use this (per 36951c586b8Smrg commented out code). <code>count_trailing_zeros</code> could use it too. 37051c586b8Smrg<li> PA64: Improve <code>mpn_addmul_1</code>, <code>mpn_submul_1</code>, and 37151c586b8Smrg <code>mpn_mul_1</code>. The current code runs at 11 cycles/limb. It 37251c586b8Smrg should be possible to saturate the cache, which will happen at 8 37351c586b8Smrg cycles/limb (7.5 for mpn_mul_1). Write special loops for s2 < 2^32; 37451c586b8Smrg it should be possible to make them run at about 5 cycles/limb. 37551c586b8Smrg<li> PPC601: See which of the power or powerpc32 code runs better. Currently 37651c586b8Smrg the powerpc32 is used, but only because it's the default for 37751c586b8Smrg <code>powerpc*</code>. 37851c586b8Smrg<li> PPC630: Rewrite <code>mpn_addmul_1</code>, <code>mpn_submul_1</code>, and 37951c586b8Smrg <code>mpn_mul_1</code>. Use both integer and floating-point operations, 38051c586b8Smrg possibly two floating-point and one integer limb per loop. Split operands 38151c586b8Smrg into four 16-bit chunks for fast fp operations. Should easily reach 9 38251c586b8Smrg cycles/limb (using one int + one fp), but perhaps even 7 cycles/limb 38351c586b8Smrg (using one int + two fp). 38451c586b8Smrg<li> PPC630: <code>mpn_rshift</code> could do the same sort of unrolled loop 38551c586b8Smrg as <code>mpn_lshift</code>. Some judicious use of m4 might let the two 38651c586b8Smrg share source code, or with a register to control the loop direction 38751c586b8Smrg perhaps even share object code. 38851c586b8Smrg<li> Implement <code>mpn_mul_basecase</code> and <code>mpn_sqr_basecase</code> 38951c586b8Smrg for important machines. Helping the generic sqr_basecase.c with an 39051c586b8Smrg <code>mpn_sqr_diagonal</code> might be enough for some of the RISCs. 39151c586b8Smrg<li> POWER2/POWER2SC: Schedule <code>mpn_lshift</code>/<code>mpn_rshift</code>. 39251c586b8Smrg Will bring time from 1.75 to 1.25 cycles/limb. 39351c586b8Smrg<li> X86: Optimize non-MMX <code>mpn_lshift</code> for shifts by 1. (See 39451c586b8Smrg Pentium code.) 39551c586b8Smrg<li> X86: Good authority has it that in the past an inline <code>rep 39651c586b8Smrg movs</code> would upset GCC register allocation for the whole function. 39751c586b8Smrg Is this still true in GCC 3? It uses <code>rep movs</code> itself for 39851c586b8Smrg <code>__builtin_memcpy</code>. Examine the code for some simple and 39951c586b8Smrg complex functions to find out. Inlining <code>rep movs</code> would be 40051c586b8Smrg desirable, it'd be both smaller and faster. 40151c586b8Smrg<li> Pentium P54: <code>mpn_lshift</code> and <code>mpn_rshift</code> can come 40251c586b8Smrg down from 6.0 c/l to 5.5 or 5.375 by paying attention to pairing after 40351c586b8Smrg <code>shrdl</code> and <code>shldl</code>, see mpn/x86/pentium/README. 40451c586b8Smrg<li> Pentium P55 MMX: <code>mpn_lshift</code> and <code>mpn_rshift</code> 40551c586b8Smrg might benefit from some destination prefetching. 40651c586b8Smrg<li> PentiumPro: <code>mpn_divrem_1</code> might be able to use a 40751c586b8Smrg mul-by-inverse, hoping for maybe 30 c/l. 40851c586b8Smrg<li> K7: <code>mpn_lshift</code> and <code>mpn_rshift</code> might be able to 40951c586b8Smrg do something branch-free for unaligned startups, and shaving one insn 41051c586b8Smrg from the loop with alternative indexing might save a cycle. 41151c586b8Smrg<li> PPC32: Try using fewer registers in the current <code>mpn_lshift</code>. 41251c586b8Smrg The pipeline is now extremely deep, perhaps unnecessarily deep. 41351c586b8Smrg<li> Fujitsu VPP: Vectorize main functions, perhaps in assembly language. 41451c586b8Smrg<li> Fujitsu VPP: Write <code>mpn_mul_basecase</code> and 41551c586b8Smrg <code>mpn_sqr_basecase</code>. This should use a "vertical multiplication 41651c586b8Smrg method", to avoid carry propagation. splitting one of the operands in 41751c586b8Smrg 11-bit chunks. 41851c586b8Smrg<li> Pentium: <code>mpn_lshift</code> by 31 should use the special rshift 41951c586b8Smrg by 1 code, and vice versa <code>mpn_rshift</code> by 31 should use the 42051c586b8Smrg special lshift by 1. This would be best as a jump across to the other 42151c586b8Smrg routine, could let both live in lshift.asm and omit rshift.asm on finding 42251c586b8Smrg <code>mpn_rshift</code> already provided. 42351c586b8Smrg<li> Cray T3E: Experiment with optimization options. In particular, 42451c586b8Smrg -hpipeline3 seems promising. We should at least up -O to -O2 or -O3. 42551c586b8Smrg<li> Cray: <code>mpn_com</code> and <code>mpn_and_n</code> etc very probably 42651c586b8Smrg wants a pragma like <code>MPN_COPY_INCR</code>. 42751c586b8Smrg<li> Cray vector systems: <code>mpn_lshift</code>, <code>mpn_rshift</code>, 42851c586b8Smrg <code>mpn_popcount</code> and <code>mpn_hamdist</code> are nice and small 42951c586b8Smrg and could be inlined to avoid function calls. 43051c586b8Smrg<li> Cray: Variable length arrays seem to be faster than the tal-notreent.c 43151c586b8Smrg scheme. Not sure why, maybe they merely give the compiler more 43251c586b8Smrg information about aliasing (or the lack thereof). Would like to modify 43351c586b8Smrg <code>TMP_ALLOC</code> to use them, or introduce a new scheme. Memory 43451c586b8Smrg blocks wanted unconditionally are easy enough, those wanted only 43551c586b8Smrg sometimes are a problem. Perhaps a special size calculation to ask for a 43651c586b8Smrg dummy length 1 when unwanted, or perhaps an inlined subroutine 43751c586b8Smrg duplicating code under each conditional. Don't really want to turn 43851c586b8Smrg everything into a dog's dinner just because Cray don't offer an 43951c586b8Smrg <code>alloca</code>. 44051c586b8Smrg<li> Cray: <code>mpn_get_str</code> on power-of-2 bases ought to vectorize. 44151c586b8Smrg Does it? <code>bits_per_digit</code> and the inner loop over bits in a 44251c586b8Smrg limb might prevent it. Perhaps special cases for binary, octal and hex 44351c586b8Smrg would be worthwhile (very possibly for all processors too). 44451c586b8Smrg<li> S390: <code>BSWAP_LIMB_FETCH</code> looks like it could be done with 44551c586b8Smrg <code>lrvg</code>, as per glibc sysdeps/s390/s390-64/bits/byteswap.h. 44651c586b8Smrg This is only for 64-bit mode or something is it, since 32-bit mode has 44751c586b8Smrg other code? Also, is it worth using for <code>BSWAP_LIMB</code> too, or 44851c586b8Smrg would that mean a store and re-fetch? Presumably that's what comes out 44951c586b8Smrg in glibc. 45051c586b8Smrg<li> Improve <code>count_leading_zeros</code> for 64-bit machines: 45151c586b8Smrg <pre> 45251c586b8Smrg if ((x >> 32) == 0) { x <<= 32; cnt += 32; } 45351c586b8Smrg if ((x >> 48) == 0) { x <<= 16; cnt += 16; } 45451c586b8Smrg ... </pre> 45551c586b8Smrg<li> IRIX 6 MIPSpro compiler has an <code>__inline</code> which could perhaps 45651c586b8Smrg be used in <code>__GMP_EXTERN_INLINE</code>. What would be the right way 45751c586b8Smrg to identify suitable versions of that compiler? 45851c586b8Smrg<li> IRIX <code>cc</code> is rumoured to have an <code>_int_mult_upper</code> 45951c586b8Smrg (in <code><intrinsics.h></code> like Cray), but it didn't seem to 46051c586b8Smrg exist on some IRIX 6.5 systems tried. If it does actually exist 46151c586b8Smrg somewhere it would very likely be an improvement over a function call to 46251c586b8Smrg umul.asm. 46351c586b8Smrg<li> <code>mpn_get_str</code> final divisions by the base with 46451c586b8Smrg <code>udiv_qrnd_unnorm</code> could use some sort of multiply-by-inverse 46551c586b8Smrg on suitable machines. This ends up happening for decimal by presenting 46651c586b8Smrg the compiler with a run-time constant, but the same for other bases would 46751c586b8Smrg be good. Perhaps use could be made of the fact base<256. 46851c586b8Smrg<li> <code>mpn_umul_ppmm</code>, <code>mpn_udiv_qrnnd</code>: Return a 46951c586b8Smrg structure like <code>div_t</code> to avoid going through memory, in 47051c586b8Smrg particular helping RISCs that don't do store-to-load forwarding. Clearly 47151c586b8Smrg this is only possible if the ABI returns a structure of two 47251c586b8Smrg <code>mp_limb_t</code>s in registers. 47351c586b8Smrg <br> 47451c586b8Smrg On PowerPC, structures are returned in memory on AIX and Darwin. In SVR4 47551c586b8Smrg they're returned in registers, except that draft SVR4 had said memory, so 47651c586b8Smrg it'd be prudent to check which is done. We can jam the compiler into the 47751c586b8Smrg right mode if we know how, since all this is purely internal to libgmp. 47851c586b8Smrg (gcc has an option, though of course gcc doesn't matter since we use 47951c586b8Smrg inline asm there.) 48051c586b8Smrg</ul> 48151c586b8Smrg 48251c586b8Smrg<h4>New Functionality</h4> 48351c586b8Smrg<ul> 48451c586b8Smrg<li> Maybe add <code>mpz_crr</code> (Chinese Remainder Reconstruction). 48551c586b8Smrg<li> Let `0b' and `0B' mean binary input everywhere. 48651c586b8Smrg<li> <code>mpz_init</code> and <code>mpq_init</code> could do lazy allocation. 48751c586b8Smrg Set <code>ALLOC(var)</code> to 0 to indicate nothing allocated, and let 48851c586b8Smrg <code>_mpz_realloc</code> do the initial alloc. Set 48951c586b8Smrg <code>z->_mp_d</code> to a dummy that <code>mpz_get_ui</code> and 490*ce543368Smrg similar can unconditionally fetch from. Niels Möller has had a go at 49151c586b8Smrg this. 49251c586b8Smrg <br> 49351c586b8Smrg The advantages of the lazy scheme would be: 49451c586b8Smrg <ul> 49551c586b8Smrg <li> Initial allocate would be the size required for the first value 49651c586b8Smrg stored, rather than getting 1 limb in <code>mpz_init</code> and then 49751c586b8Smrg more or less immediately reallocating. 49851c586b8Smrg <li> <code>mpz_init</code> would only store magic values in the 49951c586b8Smrg <code>mpz_t</code> fields, and could be inlined. 50051c586b8Smrg <li> A fixed initializer could even be used by applications, like 50151c586b8Smrg <code>mpz_t z = MPZ_INITIALIZER;</code>, which might be convenient 50251c586b8Smrg for globals. 50351c586b8Smrg </ul> 50451c586b8Smrg The advantages of the current scheme are: 50551c586b8Smrg <ul> 50651c586b8Smrg <li> <code>mpz_set_ui</code> and other similar routines needn't check the 50751c586b8Smrg size allocated and can just store unconditionally. 50851c586b8Smrg <li> <code>mpz_set_ui</code> and perhaps others like 50951c586b8Smrg <code>mpz_tdiv_r_ui</code> and a prospective 51051c586b8Smrg <code>mpz_set_ull</code> could be inlined. 51151c586b8Smrg </ul> 51251c586b8Smrg<li> Add <code>mpf_out_raw</code> and <code>mpf_inp_raw</code>. Make sure 51351c586b8Smrg format is portable between 32-bit and 64-bit machines, and between 51451c586b8Smrg little-endian and big-endian machines. A format which MPFR can use too 51551c586b8Smrg would be good. 51651c586b8Smrg<li> <code>mpn_and_n</code> ... <code>mpn_copyd</code>: Perhaps make the mpn 51751c586b8Smrg logops and copys available in gmp.h, either as library functions or 51851c586b8Smrg inlines, with the availability of library functions instantiated in the 51951c586b8Smrg generated gmp.h at build time. 52051c586b8Smrg<li> <code>mpz_set_str</code> etc variants taking string lengths rather than 52151c586b8Smrg null-terminators. 52251c586b8Smrg<li> <code>mpz_andn</code>, <code>mpz_iorn</code>, <code>mpz_nand</code>, 52351c586b8Smrg <code>mpz_nior</code>, <code>mpz_xnor</code> might be useful additions, 52451c586b8Smrg if they could share code with the current such functions (which should be 52551c586b8Smrg possible). 52651c586b8Smrg<li> <code>mpz_and_ui</code> etc might be of use sometimes. Suggested by 527*ce543368Smrg Niels Möller. 52851c586b8Smrg<li> <code>mpf_set_str</code> and <code>mpf_inp_str</code> could usefully 52951c586b8Smrg accept 0x, 0b etc when base==0. Perhaps the exponent could default to 53051c586b8Smrg decimal in this case, with a further 0x, 0b etc allowed there. 53151c586b8Smrg Eg. 0xFFAA@0x5A. A leading "0" for octal would match the integers, but 53251c586b8Smrg probably something like "0.123" ought not mean octal. 53351c586b8Smrg<li> <code>GMP_LONG_LONG_LIMB</code> or some such could become a documented 53451c586b8Smrg feature of gmp.h, so applications could know whether to 53551c586b8Smrg <code>printf</code> a limb using <code>%lu</code> or <code>%Lu</code>. 53651c586b8Smrg<li> <code>GMP_PRIdMP_LIMB</code> and similar defines following C99 53751c586b8Smrg <inttypes.h> might be of use to applications printing limbs. But 53851c586b8Smrg if <code>GMP_LONG_LONG_LIMB</code> or whatever is added then perhaps this 53951c586b8Smrg can easily enough be left to applications. 54051c586b8Smrg<li> <code>gmp_printf</code> could accept <code>%b</code> for binary output. 54151c586b8Smrg It'd be nice if it worked for plain <code>int</code> etc too, not just 54251c586b8Smrg <code>mpz_t</code> etc. 54351c586b8Smrg<li> <code>gmp_printf</code> in fact could usefully accept an arbitrary base, 54451c586b8Smrg for both integer and float conversions. A base either in the format 54551c586b8Smrg string or as a parameter with <code>*</code> should be allowed. Maybe 54651c586b8Smrg <code>&13b</code> (b for base) or something like that. 54751c586b8Smrg<li> <code>gmp_printf</code> could perhaps accept <code>mpq_t</code> for float 54851c586b8Smrg conversions, eg. <code>"%.4Qf"</code>. This would be merely for 54951c586b8Smrg convenience, but still might be useful. Rounding would be the same as 55051c586b8Smrg for an <code>mpf_t</code> (ie. currently round-to-nearest, but not 55151c586b8Smrg actually documented). Alternately, perhaps a separate 55251c586b8Smrg <code>mpq_get_str_point</code> or some such might be more use. Suggested 55351c586b8Smrg by Pedro Gimeno. 55451c586b8Smrg<li> <code>mpz_rscan0</code> or <code>mpz_revscan0</code> or some such 55551c586b8Smrg searching towards the low end of an integer might match 55651c586b8Smrg <code>mpz_scan0</code> nicely. Likewise for <code>scan1</code>. 55751c586b8Smrg Suggested by Roberto Bagnara. 55851c586b8Smrg<li> <code>mpz_bit_subset</code> or some such to test whether one integer is a 55951c586b8Smrg bitwise subset of another might be of use. Some sort of return value 56051c586b8Smrg indicating whether it's a proper or non-proper subset would be good and 56151c586b8Smrg wouldn't cost anything in the implementation. Suggested by Roberto 56251c586b8Smrg Bagnara. 56351c586b8Smrg<li> <code>mpf_get_ld</code>, <code>mpf_set_ld</code>: Conversions between 56451c586b8Smrg <code>mpf_t</code> and <code>long double</code>, suggested by Dan 56551c586b8Smrg Christensen. Other <code>long double</code> routines might be desirable 56651c586b8Smrg too, but <code>mpf</code> would be a start. 56751c586b8Smrg <br> 56851c586b8Smrg <code>long double</code> is an ANSI-ism, so everything involving it would 56951c586b8Smrg need to be suppressed on a K&R compiler. 57051c586b8Smrg <br> 57151c586b8Smrg There'd be some work to be done by <code>configure</code> to recognise 57251c586b8Smrg the format in use, MPFR has a start on this. Often <code>long 57351c586b8Smrg double</code> is the same as <code>double</code>, which is easy but 57451c586b8Smrg pretty pointless. A single float format detector macro could look at 57551c586b8Smrg <code>double</code> then <code>long double</code> 57651c586b8Smrg <br> 57751c586b8Smrg Sometimes there's a compiler option for the size of a <code>long 57851c586b8Smrg double</code>, eg. xlc on AIX can use either 64-bit or 128-bit. It's 57951c586b8Smrg probably simplest to regard this as a compiler compatibility issue, and 58051c586b8Smrg leave it to users or sysadmins to ensure application and library code is 58151c586b8Smrg built the same. 58251c586b8Smrg<li> <code>mpz_sqrt_if_perfect_square</code>: When 58351c586b8Smrg <code>mpz_perfect_square_p</code> does its tests it calculates a square 58451c586b8Smrg root and then discards it. For some applications it might be useful to 58551c586b8Smrg return that root. Suggested by Jason Moxham. 58651c586b8Smrg<li> <code>mpz_get_ull</code>, <code>mpz_set_ull</code>, 58751c586b8Smrg <code>mpz_get_sll</code>, <code>mpz_get_sll</code>: Conversions for 58851c586b8Smrg <code>long long</code>. These would aid interoperability, though a 58951c586b8Smrg mixture of GMP and <code>long long</code> would probably not be too 59051c586b8Smrg common. Since <code>long long</code> is not always available (it's in 59151c586b8Smrg C99 and GCC though), disadvantages of using <code>long long</code> in 59251c586b8Smrg libgmp.a would be 59351c586b8Smrg <ul> 59451c586b8Smrg <li> Library contents vary according to the build compiler. 59551c586b8Smrg <li> gmp.h would need an ugly <code>#ifdef</code> block to decide if the 59651c586b8Smrg application compiler could take the <code>long long</code> 59751c586b8Smrg prototypes. 59851c586b8Smrg <li> Some sort of <code>LIBGMP_HAS_LONGLONG</code> might be wanted to 59951c586b8Smrg indicate whether the functions are available. (Applications using 60051c586b8Smrg autoconf could probe the library too.) 60151c586b8Smrg </ul> 60251c586b8Smrg It'd be possible to defer the need for <code>long long</code> to 60351c586b8Smrg application compile time, by having something like 60451c586b8Smrg <code>mpz_set_2ui</code> called with two halves of a <code>long 60551c586b8Smrg long</code>. Disadvantages of this would be, 60651c586b8Smrg <ul> 60751c586b8Smrg <li> Bigger code in the application, though perhaps not if a <code>long 60851c586b8Smrg long</code> is normally passed as two halves anyway. 60951c586b8Smrg <li> <code>mpz_get_ull</code> would be a rather big inline, or would have 61051c586b8Smrg to be two function calls. 61151c586b8Smrg <li> <code>mpz_get_sll</code> would be a worse inline, and would put the 61251c586b8Smrg treatment of <code>-0x10..00</code> into applications (see 61351c586b8Smrg <code>mpz_get_si</code> correctness above). 61451c586b8Smrg <li> Although having libgmp.a independent of the build compiler is nice, 61551c586b8Smrg it sort of sacrifices the capabilities of a good compiler to 61651c586b8Smrg uniformity with inferior ones. 61751c586b8Smrg </ul> 61851c586b8Smrg Plain use of <code>long long</code> is probably the lesser evil, if only 61951c586b8Smrg because it makes best use of gcc. In fact perhaps it would suffice to 62051c586b8Smrg guarantee <code>long long</code> conversions only when using GCC for both 62151c586b8Smrg application and library. That would cover free software, and we can 62251c586b8Smrg worry about selected vendor compilers later. 62351c586b8Smrg <br> 62451c586b8Smrg In C++ the situation is probably clearer, we demand fairly recent C++ so 62551c586b8Smrg <code>long long</code> should be available always. We'd probably prefer 62651c586b8Smrg to have the C and C++ the same in respect of <code>long long</code> 62751c586b8Smrg support, but it would be possible to have it unconditionally in gmpxx.h, 62851c586b8Smrg by some means or another. 62951c586b8Smrg<li> <code>mpz_strtoz</code> parsing the same as <code>strtol</code>. 63051c586b8Smrg Suggested by Alexander Kruppa. 63151c586b8Smrg</ul> 63251c586b8Smrg 63351c586b8Smrg 63451c586b8Smrg<h4>Configuration</h4> 63551c586b8Smrg 63651c586b8Smrg<ul> 63751c586b8Smrg<li> Alpha ev7, ev79: Add code to config.guess to detect these. Believe ev7 63851c586b8Smrg will be "3-1307" in the current switch, but need to verify that. (On 63951c586b8Smrg OSF, current configfsf.guess identifies ev7 using psrinfo, we need to do 64051c586b8Smrg it ourselves for other systems.) 64151c586b8Smrg<li> Alpha OSF: Libtool (version 1.5) doesn't seem to recognise this system is 64251c586b8Smrg "pic always" and ends up running gcc twice with the same options. This 64351c586b8Smrg is wasteful, but harmless. Perhaps a newer libtool will be better. 64451c586b8Smrg<li> ARM: <code>umul_ppmm</code> in longlong.h always uses <code>umull</code>, 64551c586b8Smrg but is that available only for M series chips or some such? Perhaps it 64651c586b8Smrg should be configured in some way. 64751c586b8Smrg<li> HPPA: config.guess should recognize 7000, 7100, 7200, and 8x00. 64851c586b8Smrg<li> HPPA: gcc 3.2 introduces a <code>-mschedule=7200</code> etc parameter, 64951c586b8Smrg which could be driven by an exact hppa cpu type. 65051c586b8Smrg<li> Mips: config.guess should say mipsr3000, mipsr4000, mipsr10000, etc. 65151c586b8Smrg "hinv -c processor" gives lots of information on Irix. Standard 65251c586b8Smrg config.guess appends "el" to indicate endianness, but 65351c586b8Smrg <code>AC_C_BIGENDIAN</code> seems the best way to handle that for GMP. 65451c586b8Smrg<li> PowerPC: The function descriptor nonsense for AIX is currently driven by 65551c586b8Smrg <code>*-*-aix*</code>. It might be more reliable to do some sort of 65651c586b8Smrg feature test, examining the compiler output perhaps. It might also be 65751c586b8Smrg nice to merge the aix.m4 files into powerpc-defs.m4. 65851c586b8Smrg<li> config.m4 is generated only by the configure script, it won't be 65951c586b8Smrg regenerated by config.status. Creating it as an <code>AC_OUTPUT</code> 66051c586b8Smrg would work, but it might upset "make" to have things like <code>L$</code> 66151c586b8Smrg get into the Makefiles through <code>AC_SUBST</code>. 66251c586b8Smrg <code>AC_CONFIG_COMMANDS</code> would be the alternative. With some 66351c586b8Smrg careful m4 quoting the <code>changequote</code> calls might not be 66451c586b8Smrg needed, which might free up the order in which things had to be output. 66551c586b8Smrg<li> Automake: Latest automake has a <code>CCAS</code>, <code>CCASFLAGS</code> 66651c586b8Smrg scheme. Though we probably wouldn't be using its assembler support we 66751c586b8Smrg could try to use those variables in compatible ways. 66851c586b8Smrg<li> <code>GMP_LDFLAGS</code> could probably be done with plain 66951c586b8Smrg <code>LDFLAGS</code> already used by automake for all linking. But with 67051c586b8Smrg a bit of luck the next libtool will pass pretty much all 67151c586b8Smrg <code>CFLAGS</code> through to the compiler when linking, making 67251c586b8Smrg <code>GMP_LDFLAGS</code> unnecessary. 67351c586b8Smrg<li> mpn/Makeasm.am uses <code>-c</code> and <code>-o</code> together in the 67451c586b8Smrg .S and .asm rules, but apparently that isn't completely portable (there's 67551c586b8Smrg an autoconf <code>AC_PROG_CC_C_O</code> test for it). So far we've not 67651c586b8Smrg had problems, but perhaps the rules could be rewritten to use "foo.s" as 67751c586b8Smrg the temporary, or to do a suitable "mv" of the result. The only danger 67851c586b8Smrg from using foo.s would be if a compile failed and the temporary foo.s 67951c586b8Smrg then looked like the primary source. Hopefully if the 68051c586b8Smrg <code>SUFFIXES</code> are ordered to have .S and .asm ahead of .s that 68151c586b8Smrg wouldn't happen. Might need to check. 68251c586b8Smrg</ul> 68351c586b8Smrg 68451c586b8Smrg 68551c586b8Smrg<h4>Random Numbers</h4> 68651c586b8Smrg<ul> 68751c586b8Smrg<li> <code>_gmp_rand</code> is not particularly fast on the linear 68851c586b8Smrg congruential algorithm and could stand various improvements. 68951c586b8Smrg <ul> 69051c586b8Smrg <li> Make a second seed area within <code>gmp_randstate_t</code> (or 69151c586b8Smrg <code>_mp_algdata</code> rather) to save some copying. 69251c586b8Smrg <li> Make a special case for a single limb <code>2exp</code> modulus, to 69351c586b8Smrg avoid <code>mpn_mul</code> calls. Perhaps the same for two limbs. 69451c586b8Smrg <li> Inline the <code>lc</code> code, to avoid a function call and 69551c586b8Smrg <code>TMP_ALLOC</code> for every chunk. 69651c586b8Smrg <li> Perhaps the <code>2exp</code> and general LC cases should be split, 69751c586b8Smrg for clarity (if the general case is retained). 69851c586b8Smrg </ul> 69951c586b8Smrg<li> <code>gmp_randstate_t</code> used for parameters perhaps should become 70051c586b8Smrg <code>gmp_randstate_ptr</code> the same as other types. 70151c586b8Smrg<li> Some of the empirical randomness tests could be included in a "make 70251c586b8Smrg check". They ought to work everywhere, for a given seed at least. 70351c586b8Smrg</ul> 70451c586b8Smrg 70551c586b8Smrg 70651c586b8Smrg<h4>C++</h4> 70751c586b8Smrg<ul> 70851c586b8Smrg<li> <code>mpz_class(string)</code>, etc: Use the C++ global locale to 70951c586b8Smrg identify whitespace. 71051c586b8Smrg <br> 71151c586b8Smrg <code>mpf_class(string)</code>: Use the C++ global locale decimal point, 71251c586b8Smrg rather than the C one. 71351c586b8Smrg <br> 71451c586b8Smrg Consider making these variant <code>mpz_set_str</code> etc forms 71551c586b8Smrg available for <code>mpz_t</code> too, not just <code>mpz_class</code> 71651c586b8Smrg etc. 717*ce543368Smrg<li> <code>mpq_class operator+=</code>: Don't emit an unnecessary 71851c586b8Smrg <code>mpq_set(q,q)</code> before <code>mpz_addmul</code> etc. 71951c586b8Smrg<li> Put various bits of gmpxx.h into libgmpxx, to avoid excessive inlining. 72051c586b8Smrg Candidates for this would be, 72151c586b8Smrg <ul> 72251c586b8Smrg <li> <code>mpz_class(const char *)</code>, etc: since they're normally 72351c586b8Smrg not fast anyway, and we can hide the exception <code>throw</code>. 72451c586b8Smrg <li> <code>mpz_class(string)</code>, etc: to hide the <code>cstr</code> 72551c586b8Smrg needed to get to the C conversion function. 72651c586b8Smrg <li> <code>mpz_class string, char*</code> etc constructors: likewise to 72751c586b8Smrg hide the throws and conversions. 72851c586b8Smrg <li> <code>mpz_class::get_str</code>, etc: to hide the <code>char*</code> 72951c586b8Smrg to <code>string</code> conversion and free. Perhaps 73051c586b8Smrg <code>mpz_get_str</code> can write directly into a 73151c586b8Smrg <code>string</code>, to avoid copying. 73251c586b8Smrg <br> 73351c586b8Smrg Consider making such <code>string</code> returning variants 73451c586b8Smrg available for use with plain <code>mpz_t</code> etc too. 73551c586b8Smrg </ul> 73651c586b8Smrg</ul> 73751c586b8Smrg 73851c586b8Smrg<h4>Miscellaneous</h4> 73951c586b8Smrg<ul> 74051c586b8Smrg<li> <code>mpz_gcdext</code> and <code>mpn_gcdext</code> ought to document 74151c586b8Smrg what range of values the generated cofactors can take, and preferably 74251c586b8Smrg ensure the definition uniquely specifies the cofactors for given inputs. 74351c586b8Smrg A basic extended Euclidean algorithm or multi-step variant leads to 74451c586b8Smrg |x|<|b| and |y|<|a| or something like that, but there's probably 74551c586b8Smrg two solutions under just those restrictions. 74651c586b8Smrg<li> demos/factorize.c: use <code>mpz_divisible_ui_p</code> rather than 74751c586b8Smrg <code>mpz_tdiv_qr_ui</code>. (Of course dividing multiple primes at a 74851c586b8Smrg time would be better still.) 74951c586b8Smrg<li> The various test programs use quite a bit of the main 75051c586b8Smrg <code>libgmp</code>. This establishes good cross-checks, but it might be 75151c586b8Smrg better to use simple reference routines where possible. Where it's not 75251c586b8Smrg possible some attention could be paid to the order of the tests, so a 75351c586b8Smrg <code>libgmp</code> routine is only used for tests once it seems to be 75451c586b8Smrg good. 75551c586b8Smrg<li> <code>MUL_FFT_THRESHOLD</code> etc: the FFT thresholds should allow a 75651c586b8Smrg return to a previous k at certain sizes. This arises basically due to 75751c586b8Smrg the step effect caused by size multiples effectively used for each k. 75851c586b8Smrg Looking at a graph makes it fairly clear. 75951c586b8Smrg<li> <code>__gmp_doprnt_mpf</code> does a rather unattractive round-to-nearest 76051c586b8Smrg on the string returned by <code>mpf_get_str</code>. Perhaps some variant 76151c586b8Smrg of <code>mpf_get_str</code> could be made which would better suit. 76251c586b8Smrg</ul> 76351c586b8Smrg 76451c586b8Smrg 76551c586b8Smrg<h4>Aids to Development</h4> 76651c586b8Smrg<ul> 76751c586b8Smrg<li> Add <code>ASSERT</code>s at the start of each user-visible mpz/mpq/mpf 76851c586b8Smrg function to check the validity of each <code>mp?_t</code> parameter, in 76951c586b8Smrg particular to check they've been <code>mp?_init</code>ed. This might 77051c586b8Smrg catch elementary mistakes in user programs. Care would need to be taken 77151c586b8Smrg over <code>MPZ_TMP_INIT</code>ed variables used internally. If nothing 77251c586b8Smrg else then consistency checks like size<=alloc, ptr not 77351c586b8Smrg <code>NULL</code> and ptr+size not wrapping around the address space, 77451c586b8Smrg would be possible. A more sophisticated scheme could track 77551c586b8Smrg <code>_mp_d</code> pointers and ensure only a valid one is used. Such a 77651c586b8Smrg scheme probably wouldn't be reentrant, not without some help from the 77751c586b8Smrg system. 77851c586b8Smrg<li> tune/time.c could try to determine at runtime whether 77951c586b8Smrg <code>getrusage</code> and <code>gettimeofday</code> are reliable. 78051c586b8Smrg Currently we pretend in configure that the dodgy m68k netbsd 1.4.1 78151c586b8Smrg <code>getrusage</code> doesn't exist. If a test might take a long time 78251c586b8Smrg to run then perhaps cache the result in a file somewhere. 78351c586b8Smrg<li> tune/time.c could choose the default precision based on the 78451c586b8Smrg <code>speed_unittime</code> determined, independent of the method in use. 78551c586b8Smrg<li> Cray vector systems: CPU frequency could be determined from 78651c586b8Smrg <code>sysconf(_SC_CLK_TCK)</code>, since it seems to be clock cycle 78751c586b8Smrg based. Is this true for all Cray systems? Would like some documentation 78851c586b8Smrg or something to confirm. 78951c586b8Smrg</ul> 79051c586b8Smrg 79151c586b8Smrg 79251c586b8Smrg<h4>Documentation</h4> 79351c586b8Smrg<ul> 79451c586b8Smrg<li> <code>mpz_inp_str</code> (etc) doesn't say when it stops reading digits. 79551c586b8Smrg<li> <code>mpn_get_str</code> isn't terribly clear about how many digits it 79651c586b8Smrg produces. It'd probably be possible to say at most one leading zero, 79751c586b8Smrg which is what both it and <code>mpz_get_str</code> currently do. But 79851c586b8Smrg want to be careful not to bind ourselves to something that might not suit 79951c586b8Smrg another implementation. 80051c586b8Smrg<li> <code>va_arg</code> doesn't do the right thing with <code>mpz_t</code> 80151c586b8Smrg etc directly, but instead needs a pointer type like <code>MP_INT*</code>. 80251c586b8Smrg It'd be good to show how to do this, but we'd either need to document 80351c586b8Smrg <code>mpz_ptr</code> and friends, or perhaps fallback on something 80451c586b8Smrg slightly nasty with <code>void*</code>. 80551c586b8Smrg</ul> 80651c586b8Smrg 80751c586b8Smrg 80851c586b8Smrg<h4>Bright Ideas</h4> 80951c586b8Smrg 81051c586b8Smrg<p> The following may or may not be feasible, and aren't likely to get done in the 81151c586b8Smrgnear future, but are at least worth thinking about. 81251c586b8Smrg 81351c586b8Smrg<ul> 81451c586b8Smrg<li> Reorganize longlong.h so that we can inline the operations even for the 81551c586b8Smrg system compiler. When there is no such compiler feature, make calls to 81651c586b8Smrg stub functions. Write such stub functions for as many machines as 81751c586b8Smrg possible. 81851c586b8Smrg<li> longlong.h could declare when it's using, or would like to use, 81951c586b8Smrg <code>mpn_umul_ppmm</code>, and the corresponding umul.asm file could be 82051c586b8Smrg included in libgmp only in that case, the same as is effectively done for 82151c586b8Smrg <code>__clz_tab</code>. Likewise udiv.asm and perhaps cntlz.asm. This 82251c586b8Smrg would only be a very small space saving, so perhaps not worth the 82351c586b8Smrg complexity. 82451c586b8Smrg<li> longlong.h could be built at configure time by concatenating or 82551c586b8Smrg #including fragments from each directory in the mpn path. This would 82651c586b8Smrg select CPU specific macros the same way as CPU specific assembler code. 82751c586b8Smrg Code used would no longer depend on cpp predefines, and the current 82851c586b8Smrg nested conditionals could be flattened out. 82951c586b8Smrg<li> <code>mpz_get_si</code> returns 0x80000000 for -0x100000000, whereas it's 83051c586b8Smrg sort of supposed to return the low 31 (or 63) bits. But this is 83151c586b8Smrg undocumented, and perhaps not too important. 83251c586b8Smrg<li> <code>mpz_init_set*</code> and <code>mpz_realloc</code> could allocate 83351c586b8Smrg say an extra 16 limbs over what's needed, so as to reduce the chance of 83451c586b8Smrg having to do a reallocate if the <code>mpz_t</code> grows a bit more. 83551c586b8Smrg This could only be an option, since it'd badly bloat memory usage in 83651c586b8Smrg applications using many small values. 83751c586b8Smrg<li> <code>mpq</code> functions could perhaps check for numerator or 83851c586b8Smrg denominator equal to 1, on the assumption that integers or 83951c586b8Smrg denominator-only values might be expected to occur reasonably often. 84051c586b8Smrg<li> <code>count_trailing_zeros</code> is used on more or less uniformly 84151c586b8Smrg distributed numbers in a couple of places. For some CPUs 84251c586b8Smrg <code>count_trailing_zeros</code> is slow and it's probably worth handling 84351c586b8Smrg the frequently occurring 0 to 2 trailing zeros cases specially. 84451c586b8Smrg<li> <code>mpf_t</code> might like to let the exponent be undefined when 84551c586b8Smrg size==0, instead of requiring it 0 as now. It should be possible to do 84651c586b8Smrg size==0 tests before paying attention to the exponent. The advantage is 84751c586b8Smrg not needing to set exp in the various places a zero result can arise, 84851c586b8Smrg which avoids some tedium but is otherwise perhaps not too important. 84951c586b8Smrg Currently <code>mpz_set_f</code> and <code>mpf_cmp_ui</code> depend on 85051c586b8Smrg exp==0, maybe elsewhere too. 85151c586b8Smrg<li> <code>__gmp_allocate_func</code>: Could use GCC <code>__attribute__ 85251c586b8Smrg ((malloc))</code> on this, though don't know if it'd do much. GCC 3.0 85351c586b8Smrg allows that attribute on functions, but not function pointers (see info 85451c586b8Smrg node "Attribute Syntax"), so would need a new autoconf test. This can 85551c586b8Smrg wait until there's a GCC that supports it. 85651c586b8Smrg<li> <code>mpz_add_ui</code> contains two <code>__GMPN_COPY</code>s, one from 85751c586b8Smrg <code>mpn_add_1</code> and one from <code>mpn_sub_1</code>. If those two 85851c586b8Smrg routines were opened up a bit maybe that code could be shared. When a 85951c586b8Smrg copy needs to be done there's no carry to append for the add, and if the 86051c586b8Smrg copy is non-empty no high zero for the sub. 86151c586b8Smrg</ul> 86251c586b8Smrg 86351c586b8Smrg 86451c586b8Smrg<h4>Old and Obsolete Stuff</h4> 86551c586b8Smrg 86651c586b8Smrg<p> The following tasks apply to chips or systems that are old and/or obsolete. 86751c586b8SmrgIt's unlikely anything will be done about them unless anyone is actively using 86851c586b8Smrgthem. 86951c586b8Smrg 87051c586b8Smrg<ul> 87151c586b8Smrg<li> Sparc32: The integer based udiv_nfp.asm used to be selected by 87251c586b8Smrg <code>configure --nfp</code> but that option is gone now that autoconf is 87351c586b8Smrg used. The file could go somewhere suitable in the mpn search if any 87451c586b8Smrg chips might benefit from it, though it's possible we don't currently 87551c586b8Smrg differentiate enough exact cpu types to do this properly. 87651c586b8Smrg<li> VAX D and G format <code>double</code> floats are straightforward and 87751c586b8Smrg could perhaps be handled directly in <code>__gmp_extract_double</code> 87851c586b8Smrg and maybe in <code>mpn_get_d</code>, rather than falling back on the 87951c586b8Smrg generic code. (Both formats are detected by <code>configure</code>.) 88051c586b8Smrg</ul> 88151c586b8Smrg 88251c586b8Smrg 88351c586b8Smrg<hr> 88451c586b8Smrg 88551c586b8Smrg</body> 88651c586b8Smrg</html> 88751c586b8Smrg 88851c586b8Smrg<!-- 88951c586b8SmrgLocal variables: 89051c586b8Smrgeval: (add-hook 'write-file-hooks 'time-stamp) 89151c586b8Smrgtime-stamp-start: "This file current as of " 89251c586b8Smrgtime-stamp-format: "%:d %3b %:y" 89351c586b8Smrgtime-stamp-end: "\\." 89451c586b8Smrgtime-stamp-line-limit: 50 89551c586b8SmrgEnd: 89651c586b8Smrg--> 897