1<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN"> 2<html> 3<head> 4 <title>GMP Development Projects</title> 5 <link rel="shortcut icon" href="favicon.ico"> 6 <link rel="stylesheet" href="gmp.css"> 7 <meta http-equiv="Content-Type" content="text/html; charset=iso-8859-1"> 8</head> 9 10<center> 11 <h1> 12 GMP Development Projects 13 </h1> 14</center> 15 16<font size=-1> 17<pre> 18Copyright 2000, 2001, 2002, 2003, 2004, 2005, 2006, 2008, 2009 Free Software 19Foundation, Inc. 20 21This file is part of the GNU MP Library. 22 23The GNU MP Library is free software; you can redistribute it and/or modify 24it under the terms of the GNU Lesser General Public License as published 25by the Free Software Foundation; either version 3 of the License, or (at 26your option) any later version. 27 28The GNU MP Library is distributed in the hope that it will be useful, but 29WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY 30or FITNESS FOR A PARTICULAR PURPOSE. See the GNU Lesser General Public 31License for more details. 32 33You should have received a copy of the GNU Lesser General Public License 34along with the GNU MP Library. If not, see http://www.gnu.org/licenses/. 35</pre> 36</font> 37 38<hr> 39<!-- NB. timestamp updated automatically by emacs --> 40 This file current as of 15 Nov 2009. An up-to-date version is available at 41 <a href="http://gmplib.org/projects.html">http://gmplib.org/projects.html</a>. 42 Please send comments about this page to gmp-devel<font>@</font>gmplib.org. 43 44<p> This file lists projects suitable for volunteers. Please see the 45 <a href="tasks.html">tasks file</a> for smaller tasks. 46 47<p> If you want to work on any of the projects below, please let 48 gmp-devel<font>@</font>gmplib.org know. If you want to help with a project 49 that already somebody else is working on, you will get in touch through 50 gmp-devel<font>@</font>gmplib.org. (There are no email addresses of 51 volunteers below, due to spamming problems.) 52 53<ul> 54<li> <strong>Faster multiplication</strong> 55 56 <p> The current multiplication code uses Karatsuba, 3-way and 4-way Toom, and 57 Fermat FFT. Several new developments are desirable: 58 59 <ol> 60 61 <li> Write more toom multiply functions for unbalanced operands. We now have 62 toom22, toom32, toom42, toom62, toom33, toom53, and toom44. Most 63 desirable is toom43, which will require a new toom_interpolate_6pts 64 function. Writing toom52 will then be straightforward. See also 65 <a href="http://bodrato.it/software/toom.html">Marco Bodrato's 66 site</a> 67 68 <li> Perhaps consider N-way Toom, N > 4. See Knuth's Seminumerical 69 Algorithms for details on the method, as well as Bodrato's site. Code 70 implementing it exists. This is asymptotically inferior to FFTs, but 71 is finer grained. 72 73 <li> The mpn_mul call now (from GMP 4.3) uses toom22, toom32, and toom42 74 for unbalanced operations. We don't use any of the other new toom 75 functions currently. Write new clever code for choosing the best toom 76 function from an m-limb and an n-limb operand. 77 78 <li> Implement an FFT variant computing the coefficients mod m different 79 limb size primes of the form l*2^k+1. i.e., compute m separate FFTs. 80 The wanted coefficients will at the end be found by lifting with CRT 81 (Chinese Remainder Theorem). If we let m = 3, i.e., use 3 primes, we 82 can split the operands into coefficients at limb boundaries, and if 83 our machine uses b-bit limbs, we can multiply numbers with close to 84 2^b limbs without coefficient overflow. For smaller multiplication, 85 we might perhaps let m = 1, and instead of splitting our operands at 86 limb boundaries, split them in much smaller pieces. We might also use 87 4 or more primes, and split operands into bigger than b-bit chunks. 88 By using more primes, the gain in shorter transform length, but lose 89 in having to do more FFTs, but that is a slight total save. We then 90 lose in more expensive CRT. <br><br> 91 92 <p> [We now have two implementations of this algorithm, one by Tommy 93 F�rnqvist and one by Niels M�ller.] 94 95 <li> Add support for short products, either a given number of low limbs, a 96 given number of high limbs, or perhaps the middle limbs of the result. 97 High short product can be used by <code>mpf_mul</code>, by 98 left-to-right Newton approximations, and for quotient approximation. 99 Low half short product can be of use in sub-quadratic REDC and for 100 right-to-left Newton approximations. On small sizes a short product 101 will be faster simply through fewer cross-products, similar to the way 102 squaring is faster. But work by Thom Mulders shows that for Karatsuba 103 and higher order algorithms the advantage is progressively lost, so 104 for large sizes shows products turn out to be no faster. 105 106 </ol> 107 108 <p> Another possibility would be an optimized cube. In the basecase that 109 should definitely be able to save cross-products in a similar fashion to 110 squaring, but some investigation might be needed for how best to adapt 111 the higher-order algorithms. Not sure whether cubing or further small 112 powers have any particularly important uses though. 113 114 115<li> <strong>Assembly routines</strong> 116 117 <p> Write new and improve existing assembly routines. The tests/devel 118 programs and the tune/speed.c and tune/many.pl programs are useful for 119 testing and timing the routines you write. See the README files in those 120 directories for more information. 121 122 <p> Please make sure your new routines are fast for these three situations: 123 <ol> 124 <li> Operands that fit into the cache. 125 <li> Small operands of less than, say, 10 limbs. 126 <li> Huge operands that does not fit into the cache. 127 </ol> 128 129 <p> The most important routines are mpn_addmul_1, mpn_mul_basecase and 130 mpn_sqr_basecase. The latter two don't exist for all machines, while 131 mpn_addmul_1 exists for almost all machines. 132 133 <p> Standard techniques for these routines are unrolling, software 134 pipelining, and specialization for common operand values. For machines 135 with poor integer multiplication, it is sometimes possible to remedy the 136 situation using floating-point operations or SIMD operations such as MMX 137 (x86) (x86), SSE (x86), VMX (PowerPC), VIS (Sparc). 138 139 <p> Using floating-point operations is interesting but somewhat tricky. 140 Since IEEE double has 53 bit of mantissa, one has to split the operands 141 in small pieces, so that no intermediates are greater than 2^53. For 142 32-bit computers, splitting one operand into 16-bit pieces works. For 143 64-bit machines, one operand can be split into 21-bit pieces and the 144 other into 32-bit pieces. (A 64-bit operand can be split into just three 145 21-bit pieces if one allows the split operands to be negative!) 146 147 148<li> <strong>Math functions for the mpf layer</strong> 149 150 <p> Implement the functions of math.h for the GMP mpf layer! Check the book 151 "Pi and the AGM" by Borwein and Borwein for ideas how to do this. These 152 functions are desirable: acos, acosh, asin, asinh, atan, atanh, atan2, 153 cos, cosh, exp, log, log10, pow, sin, sinh, tan, tanh. 154 155 <p> Note that the <a href="http://mpfr.org">mpfr</a> functions already 156 provide these functions, and that we usually recommend new programs to use 157 mpfr instead of mpf. 158 159 160<li> <strong>Faster sqrt</strong> 161 162 <p> The current code uses divisions, which are reasonably fast, but it'd be 163 possible to use only multiplications by computing 1/sqrt(A) using this 164 iteration: 165 <pre> 166 2 167 x = x (3 − A x )/2 168 i+1 i i </pre> 169 The square root can then be computed like this: 170 <pre> 171 sqrt(A) = A x 172 n </pre> 173 <p> That final multiply might be the full size of the input (though it might 174 only need the high half of that), so there may or may not be any speedup 175 overall. 176 177 <p> We should probably allow a special exponent-like parameter, to speed 178 computations of a precise square root of a small number in mpf and mpfr. 179 180 181<li> <strong>Nth root</strong> 182 183 <p> Improve mpn_rootrem. The current code is not too bad, but its average 184 time complexity is a function of the input, while it is possible to 185 make it a function of the output. 186 187 188<li> <strong>Exceptions</strong> 189 190 <p> Some sort of scheme for exceptions handling would be desirable. 191 Presently the only thing documented is that divide by zero in GMP 192 functions provokes a deliberate machine divide by zero (on those systems 193 where such a thing exists at least). The global <code>gmp_errno</code> 194 is not actually documented, except for the old <code>gmp_randinit</code> 195 function. Being currently just a plain global means it's not 196 thread-safe. 197 198 <p> The basic choices for exceptions are returning an error code or having a 199 handler function to be called. The disadvantage of error returns is they 200 have to be checked, leading to tedious and rarely executed code, and 201 strictly speaking such a scheme wouldn't be source or binary compatible. 202 The disadvantage of a handler function is that a <code>longjmp</code> or 203 similar recovery from it may be difficult. A combination would be 204 possible, for instance by allowing the handler to return an error code. 205 206 <p> Divide-by-zero, sqrt-of-negative, and similar operand range errors can 207 normally be detected at the start of functions, so exception handling 208 would have a clean state. What's worth considering though is that the 209 GMP function detecting the exception may have been called via some third 210 party library or self contained application module, and hence have 211 various bits of state to be cleaned up above it. It'd be highly 212 desirable for an exceptions scheme to allow for such cleanups. 213 214 <p> The C++ destructor mechanism could help with cleanups both internally and 215 externally, but being a plain C library we don't want to depend on that. 216 217 <p> A C++ <code>throw</code> might be a good optional extra exceptions 218 mechanism, perhaps under a build option. For 219 GCC <code>-fexceptions</code> will add the necessary frame information to 220 plain C code, or GMP could be compiled as C++. 221 222 <p> Out-of-memory exceptions are expected to be handled by the 223 <code>mp_set_memory_functions</code> routines, rather than being a 224 prospective part of divide-by-zero etc. Some similar considerations 225 apply but what differs is that out-of-memory can arise deep within GMP 226 internals. Even fundamental routines like <code>mpn_add_n</code> and 227 <code>mpn_addmul_1</code> can use temporary memory (for instance on Cray 228 vector systems). Allowing for an error code return would require an 229 awful lot of checking internally. Perhaps it'd still be worthwhile, but 230 it'd be a lot of changes and the extra code would probably be rather 231 rarely executed in normal usages. 232 233 <p> A <code>longjmp</code> recovery for out-of-memory will currently, in 234 general, lead to memory leaks and may leave GMP variables operated on in 235 inconsistent states. Maybe it'd be possible to record recovery 236 information for use by the relevant allocate or reallocate function, but 237 that too would be a lot of changes. 238 239 <p> One scheme for out-of-memory would be to note that all GMP allocations go 240 through the <code>mp_set_memory_functions</code> routines. So if the 241 application has an intended <code>setjmp</code> recovery point it can 242 record memory activity by GMP and abandon space allocated and variables 243 initialized after that point. This might be as simple as directing the 244 allocation functions to a separate pool, but in general would have the 245 disadvantage of needing application-level bookkeeping on top of the 246 normal system <code>malloc</code>. An advantage however is that it needs 247 nothing from GMP itself and on that basis doesn't burden applications not 248 needing recovery. Note that there's probably some details to be worked 249 out here about reallocs of existing variables, and perhaps about copying 250 or swapping between "permanent" and "temporary" variables. 251 252 <p> Applications desiring a fine-grained error control, for instance a 253 language interpreter, would very possibly not be well served by a scheme 254 requiring <code>longjmp</code>. Wrapping every GMP function call with a 255 <code>setjmp</code> would be very inconvenient. 256 257 <p> Another option would be to let <code>mpz_t</code> etc hold a sort of NaN, 258 a special value indicating an out-of-memory or other failure. This would 259 be similar to NaNs in mpfr. Unfortunately such a scheme could only be 260 used by programs prepared to handle such special values, since for 261 instance a program waiting for some condition to be satisfied could 262 become an infinite loop if it wasn't also watching for NaNs. The work to 263 implement this would be significant too, lots of checking of inputs and 264 intermediate results. And if <code>mpn</code> routines were to 265 participate in this (which they would have to internally) a lot of new 266 return values would need to be added, since of course there's no 267 <code>mpz_t</code> etc structure for them to indicate failure in. 268 269 <p> Stack overflow is another possible exception, but perhaps not one that 270 can be easily detected in general. On i386 GNU/Linux for instance GCC 271 normally doesn't generate stack probes for an <code>alloca</code>, but 272 merely adjusts <code>%esp</code>. A big enough <code>alloca</code> can 273 miss the stack redzone and hit arbitrary data. GMP stack usage is 274 normally a function of operand size, which might be enough for some 275 applications to know they'll be safe. Otherwise a fixed maximum usage 276 can probably be obtained by building with 277 <code>--enable-alloca=malloc-reentrant</code> (or 278 <code>notreentrant</code>). Arranging the default to be 279 <code>alloca</code> only on blocks up to a certain size and 280 <code>malloc</code> thereafter might be a better approach and would have 281 the advantage of not having calculations limited by available stack. 282 283 <p> Actually recovering from stack overflow is of course another problem. It 284 might be possible to catch a <code>SIGSEGV</code> in the stack redzone 285 and do something in a <code>sigaltstack</code>, on systems which have 286 that, but recovery might otherwise not be possible. This is worth 287 bearing in mind because there's no point worrying about tight and careful 288 out-of-memory recovery if an out-of-stack is fatal. 289 290 <p> Operand overflow is another exception to be addressed. It's easy for 291 instance to ask <code>mpz_pow_ui</code> for a result bigger than an 292 <code>mpz_t</code> can possibly represent. Currently overflows in limb 293 or byte count calculations will go undetected. Often they'll still end 294 up asking the memory functions for blocks bigger than available memory, 295 but that's by no means certain and results are unpredictable in general. 296 It'd be desirable to tighten up such size calculations. Probably only 297 selected routines would need checks, if it's assumed say that no input 298 will be more than half of all memory and hence size additions like say 299 <code>mpz_mul</code> won't overflow. 300 301 302<li> <strong>Performance Tool</strong> 303 304 <p> It'd be nice to have some sort of tool for getting an overview of 305 performance. Clearly a great many things could be done, but some primary 306 uses would be, 307 308 <ol> 309 <li> Checking speed variations between compilers. 310 <li> Checking relative performance between systems or CPUs. 311 </ol> 312 313 <p> A combination of measuring some fundamental routines and some 314 representative application routines might satisfy these. 315 316 <p> The tune/time.c routines would be the easiest way to get good accurate 317 measurements on lots of different systems. The high level 318 <code>speed_measure</code> may or may not suit, but the basic 319 <code>speed_starttime</code> and <code>speed_endtime</code> would cover 320 lots of portability and accuracy questions. 321 322 323<li> <strong>Using <code>restrict</code></strong> 324 325 <p> There might be some value in judicious use of C99 style 326 <code>restrict</code> on various pointers, but this would need some 327 careful thought about what it implies for the various operand overlaps 328 permitted in GMP. 329 330 <p> Rumour has it some pre-C99 compilers had <code>restrict</code>, but 331 expressing tighter (or perhaps looser) requirements. Might be worth 332 investigating that before using <code>restrict</code> unconditionally. 333 334 <p> Loops are presumably where the greatest benefit would be had, by allowing 335 the compiler to advance reads ahead of writes, perhaps as part of loop 336 unrolling. However critical loops are generally coded in assembler, so 337 there might not be very much to gain. And on Cray systems the explicit 338 use of <code>_Pragma</code> gives an equivalent effect. 339 340 <p> One thing to note is that Microsoft C headers (on ia64 at least) contain 341 <code>__declspec(restrict)</code>, so a <code>#define</code> of 342 <code>restrict</code> should be avoided. It might be wisest to setup a 343 <code>gmp_restrict</code>. 344 345 346<li> <strong>Nx1 Division</strong> 347 348 <p> The limb-by-limb dependencies in the existing Nx1 division (and 349 remainder) code means that chips with multiple execution units or 350 pipelined multipliers are not fully utilized. 351 352 <p> One possibility is to follow the current preinv method but taking two 353 limbs at a time. That means a 2x2->4 and a 2x1->2 multiply for 354 each two limbs processed, and because the 2x2 and 2x1 can each be done in 355 parallel the latency will be not much more than 2 multiplies for two 356 limbs, whereas the single limb method has a 2 multiply latency for just 357 one limb. A version of <code>mpn_divrem_1</code> doing this has been 358 written in C, but not yet tested on likely chips. Clearly this scheme 359 would extend to 3x3->9 and 3x1->3 etc, though with diminishing 360 returns. 361 362 <p> For <code>mpn_mod_1</code>, Peter L. Montgomery proposes the following 363 scheme. For a limb R=2^<code>bits_per_mp_limb</code>, pre-calculate 364 values R mod N, R^2 mod N, R^3 mod N, R^4 mod N. Then take dividend 365 limbs and multiply them by those values, thereby reducing them (moving 366 them down) by the corresponding factor. The products can be added to 367 produce an intermediate remainder of 2 or 3 limbs to be similarly 368 included in the next step. The point is that such multiplies can be done 369 in parallel, meaning as little as 1 multiply worth of latency for 4 370 limbs. If the modulus N is less than R/4 (or is it R/5?) the summed 371 products will fit in 2 limbs, otherwise 3 will be required, but with the 372 high only being small. Clearly this extends to as many factors of R as a 373 chip can efficiently apply. 374 375 <p> The logical conclusion for powers R^i is a whole array "p[i] = R^i mod N" 376 for i up to k, the size of the dividend. This could then be applied at 377 multiplier throughput speed like an inner product. If the powers took 378 roughly k divide steps to calculate then there'd be an advantage any time 379 the same N was used three or more times. Suggested by Victor Shoup in 380 connection with chinese-remainder style decompositions, but perhaps with 381 other uses. 382 383 <p> <code>mpn_modexact_1_odd</code> calculates an x in the range 0<=x<d 384 satisfying a = q*d + x*b^n, where b=2^bits_per_limb. The factor b^n 385 needed to get the true remainder r could be calculated by a powering 386 algorithm, allowing <code>mpn_modexact_1_odd</code> to be pressed into 387 service for an <code>mpn_mod_1</code>. <code>modexact_1</code> is 388 simpler and on some chips can run noticeably faster than plain 389 <code>mod_1</code>, on Athlon for instance 11 cycles/limb instead of 17. 390 Such a difference could soon overcome the time to calculate b^n. The 391 requirement for an odd divisor in <code>modexact</code> can be handled by 392 some shifting on-the-fly, or perhaps by an extra partial-limb step at the 393 end. 394 395 396<li> <strong>Factorial</strong> 397 398 <p> The removal of twos in the current code could be extended to factors of 3 399 or 5. Taking this to its logical conclusion would be a complete 400 decomposition into powers of primes. The power for a prime p is of 401 course floor(n/p)+floor(n/p^2)+... Conrad Curry found this is quite fast 402 (using simultaneous powering as per Handbook of Applied Cryptography 403 algorithm 14.88). 404 405 <p> A difficulty with using all primes is that quite large n can be 406 calculated on a system with enough memory, larger than we'd probably want 407 for a table of primes, so some sort of sieving would be wanted. Perhaps 408 just taking out the factors of 3 and 5 would give most of the speedup 409 that a prime decomposition can offer. 410 411 412<li> <strong>Binomial Coefficients</strong> 413 414 <p> An obvious improvement to the current code would be to strip factors of 2 415 from each multiplier and divisor and count them separately, to be applied 416 with a bit shift at the end. Factors of 3 and perhaps 5 could even be 417 handled similarly. 418 419 <p> Conrad Curry reports a big speedup for binomial coefficients using a 420 prime powering scheme, at least for k near n/2. Of course this is only 421 practical for moderate size n since again it requires primes up to n. 422 423 <p> When k is small the current (n-k+1)...n/1...k will be fastest. Some sort 424 of rule would be needed for when to use this or when to use prime 425 powering. Such a rule will be a function of both n and k. Some 426 investigation is needed to see what sort of shape the crossover line will 427 have, the usual parameter tuning can of course find machine dependent 428 constants to fill in where necessary. 429 430 <p> An easier possibility also reported by Conrad Curry is that it may be 431 faster not to divide out the denominator (1...k) one-limb at a time, but 432 do one big division at the end. Is this because a big divisor in 433 <code>mpn_bdivmod</code> trades the latency of 434 <code>mpn_divexact_1</code> for the throughput of 435 <code>mpn_submul_1</code>? Overheads must hurt though. 436 437 <p> Another reason a big divisor might help is that 438 <code>mpn_divexact_1</code> won't be getting a full limb in 439 <code>mpz_bin_uiui</code>. It's called when the n accumulator is full 440 but the k may be far from full. Perhaps the two could be decoupled so k 441 is applied when full. It'd be necessary to delay consideration of k 442 terms until the corresponding n terms had been applied though, since 443 otherwise the division won't be exact. 444 445 446<li> <strong>Perfect Power Testing</strong> 447 448 <p> <code>mpz_perfect_power_p</code> could be improved in a number of ways, 449 for instance p-adic arithmetic to find possible roots. 450 451 <p> Non-powers can be quickly identified by checking for Nth power residues 452 modulo small primes, like <code>mpn_perfect_square_p</code> does for 453 squares. The residues to each power N for a given remainder could be 454 grouped into a bit mask, the masks for the remainders to each divisor 455 would then be "and"ed together to hopefully leave only a few candidate 456 powers. Need to think about how wide to make such masks, ie. how many 457 powers to examine in this way. 458 459 <p> Any zero remainders found in residue testing reveal factors which can be 460 divided out, with the multiplicity restricting the powers that need to be 461 considered, as per the current code. Further prime dividing should be 462 grouped into limbs like <code>PP</code>. Need to think about how much 463 dividing to do like that, probably more for bigger inputs, less for 464 smaller inputs. 465 466 <p> <code>mpn_gcd_1</code> would probably be better than the current private 467 GCD routine. The use it's put to isn't time-critical, and it might help 468 ensure correctness to just use the main GCD routine. 469 470 <p> [There is work-in-progress with a very fast function.] 471 472 473<li> <strong>Prime Testing</strong> 474 475 <p> GMP is not really a number theory library and probably shouldn't have 476 large amounts of code dedicated to sophisticated prime testing 477 algorithms, but basic things well-implemented would suit. Tests offering 478 certainty are probably all too big or too slow (or both!) to justify 479 inclusion in the main library. Demo programs showing some possibilities 480 would be good though. 481 482 <p> The present "repetitions" argument to <code>mpz_probab_prime_p</code> is 483 rather specific to the Miller-Rabin tests of the current implementation. 484 Better would be some sort of parameter asking perhaps for a maximum 485 chance 1/2^x of a probable prime in fact being composite. If 486 applications follow the advice that the present reps gives 1/4^reps 487 chance then perhaps such a change is unnecessary, but an explicitly 488 described 1/2^x would allow for changes in the implementation or even for 489 new proofs about the theory. 490 491 <p> <code>mpz_probab_prime_p</code> always initializes a new 492 <code>gmp_randstate_t</code> for randomized tests, which unfortunately 493 means it's not really very random and in particular always runs the same 494 tests for a given input. Perhaps a new interface could accept an rstate 495 to use, so successive tests could increase confidence in the result. 496 497 <p> <code>mpn_mod_34lsub1</code> is an obvious and easy improvement to the 498 trial divisions. And since the various prime factors are constants, the 499 remainder can be tested with something like 500<pre> 501#define MP_LIMB_DIVISIBLE_7_P(n) \ 502 ((n) * MODLIMB_INVERSE_7 <= MP_LIMB_T_MAX/7) 503</pre> 504 Which would help compilers that don't know how to optimize divisions by 505 constants, and is even an improvement on current gcc 3.2 code. This 506 technique works for any modulus, see Granlund and Montgomery "Division by 507 Invariant Integers" section 9. 508 509 <p> The trial divisions are done with primes generated and grouped at 510 runtime. This could instead be a table of data, with pre-calculated 511 inverses too. Storing deltas, ie. amounts to add, rather than actual 512 primes would save space. <code>udiv_qrnnd_preinv</code> style inverses 513 can be made to exist by adding dummy factors of 2 if necessary. Some 514 thought needs to be given as to how big such a table should be, based on 515 how much dividing would be profitable for what sort of size inputs. The 516 data could be shared by the perfect power testing. 517 518 <p> Jason Moxham points out that if a sqrt(-1) mod N exists then any factor 519 of N must be == 1 mod 4, saving half the work in trial dividing. (If 520 x^2==-1 mod N then for a prime factor p we have x^2==-1 mod p and so the 521 jacobi symbol (-1/p)=1. But also (-1/p)=(-1)^((p-1)/2), hence must have 522 p==1 mod 4.) But knowing whether sqrt(-1) mod N exists is not too easy. 523 A strong pseudoprime test can reveal one, so perhaps such a test could be 524 inserted part way though the dividing. 525 526 <p> Jon Grantham "Frobenius Pseudoprimes" (www.pseudoprime.com) describes a 527 quadratic pseudoprime test taking about 3x longer than a plain test, but 528 with only a 1/7710 chance of error (whereas 3 plain Miller-Rabin tests 529 would offer only (1/4)^3 == 1/64). Such a test needs completely random 530 parameters to satisfy the theory, though single-limb values would run 531 faster. It's probably best to do at least one plain Miller-Rabin before 532 any quadratic tests, since that can identify composites in less total 533 time. 534 535 <p> Some thought needs to be given to the structure of which tests (trial 536 division, Miller-Rabin, quadratic) and how many are done, based on what 537 sort of inputs we expect, with a view to minimizing average time. 538 539 <p> It might be a good idea to break out subroutines for the various tests, 540 so that an application can combine them in ways it prefers, if sensible 541 defaults in <code>mpz_probab_prime_p</code> don't suit. In particular 542 this would let applications skip tests it knew would be unprofitable, 543 like trial dividing when an input is already known to have no small 544 factors. 545 546 <p> For small inputs, combinations of theory and explicit search make it 547 relatively easy to offer certainty. For instance numbers up to 2^32 548 could be handled with a strong pseudoprime test and table lookup. But 549 it's rather doubtful whether a smallnum prime test belongs in a bignum 550 library. Perhaps if it had other internal uses. 551 552 <p> An <code>mpz_nthprime</code> might be cute, but is almost certainly 553 impractical for anything but small n. 554 555 556<li> <strong>Intra-Library Calls</strong> 557 558 <p> On various systems, calls within libgmp still go through the PLT, TOC or 559 other mechanism, which makes the code bigger and slower than it needs to 560 be. 561 562 <p> The theory would be to have all GMP intra-library calls resolved directly 563 to the routines in the library. An application wouldn't be able to 564 replace a routine, the way it can normally, but there seems no good 565 reason to do that, in normal circumstances. 566 567 <p> The <code>visibility</code> attribute in recent gcc is good for this, 568 because it lets gcc omit unnecessary GOT pointer setups or whatever if it 569 finds all calls are local and there's no global data references. 570 Documented entrypoints would be <code>protected</code>, and purely 571 internal things not wanted by test programs or anything can be 572 <code>internal</code>. 573 574 <p> Unfortunately, on i386 it seems <code>protected</code> ends up causing 575 text segment relocations within libgmp.so, meaning the library code can't 576 be shared between processes, defeating the purpose of a shared library. 577 Perhaps this is just a gremlin in binutils (debian packaged 578 2.13.90.0.16-1). 579 580 <p> The linker can be told directly (with a link script, or options) to do 581 the same sort of thing. This doesn't change the code emitted by gcc of 582 course, but it does mean calls are resolved directly to their targets, 583 avoiding a PLT entry. 584 585 <p> Keeping symbols private to libgmp.so is probably a good thing in general 586 too, to stop anyone even attempting to access them. But some 587 undocumented things will need or want to be kept visible, for use by 588 mpfr, or the test and tune programs. Libtool has a standard option for 589 selecting public symbols (used now for libmp). 590 591 592</ul> 593<hr> 594 595</body> 596</html> 597 598<!-- 599Local variables: 600eval: (add-hook 'write-file-hooks 'time-stamp) 601time-stamp-start: "This file current as of " 602time-stamp-format: "%:d %3b %:y" 603time-stamp-end: "\\." 604time-stamp-line-limit: 50 605End: 606--> 607