xref: /netbsd-src/external/lgpl3/gmp/dist/doc/projects.html (revision ca453df649ce9db45b64d73678ba06cbccf9aa11)
1<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN">
2<html>
3<head>
4  <title>GMP Development Projects</title>
5  <link rel="shortcut icon" href="favicon.ico">
6  <link rel="stylesheet" href="gmp.css">
7  <meta http-equiv="Content-Type" content="text/html; charset=iso-8859-1">
8</head>
9
10<center>
11  <h1>
12    GMP Development Projects
13  </h1>
14</center>
15
16<font size=-1>
17<pre>
18Copyright 2000, 2001, 2002, 2003, 2004, 2005, 2006, 2008, 2009 Free Software
19Foundation, Inc.
20
21This file is part of the GNU MP Library.
22
23The GNU MP Library is free software; you can redistribute it and/or modify
24it under the terms of the GNU Lesser General Public License as published
25by the Free Software Foundation; either version 3 of the License, or (at
26your option) any later version.
27
28The GNU MP Library is distributed in the hope that it will be useful, but
29WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY
30or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU Lesser General Public
31License for more details.
32
33You should have received a copy of the GNU Lesser General Public License
34along with the GNU MP Library.  If not, see http://www.gnu.org/licenses/.
35</pre>
36</font>
37
38<hr>
39<!-- NB. timestamp updated automatically by emacs -->
40  This file current as of 15 Nov 2009.  An up-to-date version is available at
41  <a href="http://gmplib.org/projects.html">http://gmplib.org/projects.html</a>.
42  Please send comments about this page to gmp-devel<font>@</font>gmplib.org.
43
44<p> This file lists projects suitable for volunteers.  Please see the
45    <a href="tasks.html">tasks file</a> for smaller tasks.
46
47<p> If you want to work on any of the projects below, please let
48    gmp-devel<font>@</font>gmplib.org know.  If you want to help with a project
49    that already somebody else is working on, you will get in touch through
50    gmp-devel<font>@</font>gmplib.org.  (There are no email addresses of
51    volunteers below, due to spamming problems.)
52
53<ul>
54<li> <strong>Faster multiplication</strong>
55
56  <p> The current multiplication code uses Karatsuba, 3-way and 4-way Toom, and
57      Fermat FFT.  Several new developments are desirable:
58
59  <ol>
60
61    <li> Write more toom multiply functions for unbalanced operands.  We now have
62	 toom22, toom32, toom42, toom62, toom33, toom53, and toom44.  Most
63	 desirable is toom43, which will require a new toom_interpolate_6pts
64	 function.  Writing toom52 will then be straightforward.  See also
65	 <a href="http://bodrato.it/software/toom.html">Marco Bodrato's
66	 site</a>
67
68    <li> Perhaps consider N-way Toom, N > 4.  See Knuth's Seminumerical
69	 Algorithms for details on the method, as well as Bodrato's site.  Code
70	 implementing it exists.  This is asymptotically inferior to FFTs, but
71	 is finer grained.
72
73    <li> The mpn_mul call now (from GMP 4.3) uses toom22, toom32, and toom42
74	 for unbalanced operations.  We don't use any of the other new toom
75	 functions currently.  Write new clever code for choosing the best toom
76	 function from an m-limb and an n-limb operand.
77
78    <li> Implement an FFT variant computing the coefficients mod m different
79	 limb size primes of the form l*2^k+1. i.e., compute m separate FFTs.
80	 The wanted coefficients will at the end be found by lifting with CRT
81	 (Chinese Remainder Theorem).  If we let m = 3, i.e., use 3 primes, we
82	 can split the operands into coefficients at limb boundaries, and if
83	 our machine uses b-bit limbs, we can multiply numbers with close to
84	 2^b limbs without coefficient overflow.  For smaller multiplication,
85	 we might perhaps let m = 1, and instead of splitting our operands at
86	 limb boundaries, split them in much smaller pieces.  We might also use
87	 4 or more primes, and split operands into bigger than b-bit chunks.
88	 By using more primes, the gain in shorter transform length, but lose
89	 in having to do more FFTs, but that is a slight total save.  We then
90	 lose in more expensive CRT. <br><br>
91
92	 <p> [We now have two implementations of this algorithm, one by Tommy
93	 F�rnqvist and one by Niels M�ller.]
94
95    <li> Add support for short products, either a given number of low limbs, a
96	 given number of high limbs, or perhaps the middle limbs of the result.
97	 High short product can be used by <code>mpf_mul</code>, by
98	 left-to-right Newton approximations, and for quotient approximation.
99	 Low half short product can be of use in sub-quadratic REDC and for
100	 right-to-left Newton approximations.  On small sizes a short product
101	 will be faster simply through fewer cross-products, similar to the way
102	 squaring is faster.  But work by Thom Mulders shows that for Karatsuba
103	 and higher order algorithms the advantage is progressively lost, so
104	 for large sizes shows products turn out to be no faster.
105
106  </ol>
107
108  <p> Another possibility would be an optimized cube.  In the basecase that
109      should definitely be able to save cross-products in a similar fashion to
110      squaring, but some investigation might be needed for how best to adapt
111      the higher-order algorithms.  Not sure whether cubing or further small
112      powers have any particularly important uses though.
113
114
115<li> <strong>Assembly routines</strong>
116
117  <p> Write new and improve existing assembly routines.  The tests/devel
118      programs and the tune/speed.c and tune/many.pl programs are useful for
119      testing and timing the routines you write.  See the README files in those
120      directories for more information.
121
122  <p> Please make sure your new routines are fast for these three situations:
123      <ol>
124	<li> Operands that fit into the cache.
125	<li> Small operands of less than, say, 10 limbs.
126	<li> Huge operands that does not fit into the cache.
127      </ol>
128
129  <p> The most important routines are mpn_addmul_1, mpn_mul_basecase and
130      mpn_sqr_basecase.  The latter two don't exist for all machines, while
131      mpn_addmul_1 exists for almost all machines.
132
133  <p> Standard techniques for these routines are unrolling, software
134      pipelining, and specialization for common operand values.  For machines
135      with poor integer multiplication, it is sometimes possible to remedy the
136      situation using floating-point operations or SIMD operations such as MMX
137      (x86) (x86), SSE (x86), VMX (PowerPC), VIS (Sparc).
138
139  <p> Using floating-point operations is interesting but somewhat tricky.
140      Since IEEE double has 53 bit of mantissa, one has to split the operands
141      in small pieces, so that no intermediates are greater than 2^53.  For
142      32-bit computers, splitting one operand into 16-bit pieces works.  For
143      64-bit machines, one operand can be split into 21-bit pieces and the
144      other into 32-bit pieces.  (A 64-bit operand can be split into just three
145      21-bit pieces if one allows the split operands to be negative!)
146
147
148<li> <strong>Math functions for the mpf layer</strong>
149
150  <p> Implement the functions of math.h for the GMP mpf layer!	Check the book
151      "Pi and the AGM" by Borwein and Borwein for ideas how to do this.  These
152      functions are desirable: acos, acosh, asin, asinh, atan, atanh, atan2,
153      cos, cosh, exp, log, log10, pow, sin, sinh, tan, tanh.
154
155  <p> Note that the <a href="http://mpfr.org">mpfr</a> functions already
156  provide these functions, and that we usually recommend new programs to use
157  mpfr instead of mpf.
158
159
160<li> <strong>Faster sqrt</strong>
161
162  <p> The current code uses divisions, which are reasonably fast, but it'd be
163      possible to use only multiplications by computing 1/sqrt(A) using this
164      iteration:
165      <pre>
166				    2
167		   x   = x  (3 &minus; A x )/2
168		    i+1	  i	    i  </pre>
169      The square root can then be computed like this:
170      <pre>
171		     sqrt(A) = A x
172				  n  </pre>
173  <p> That final multiply might be the full size of the input (though it might
174      only need the high half of that), so there may or may not be any speedup
175      overall.
176
177  <p> We should probably allow a special exponent-like parameter, to speed
178      computations of a precise square root of a small number in mpf and mpfr.
179
180
181<li> <strong>Nth root</strong>
182
183  <p> Improve mpn_rootrem.  The current code is not too bad, but its average
184      time complexity is a function of the input, while it is possible to
185      make it a function of the output.
186
187
188<li> <strong>Exceptions</strong>
189
190  <p> Some sort of scheme for exceptions handling would be desirable.
191      Presently the only thing documented is that divide by zero in GMP
192      functions provokes a deliberate machine divide by zero (on those systems
193      where such a thing exists at least).  The global <code>gmp_errno</code>
194      is not actually documented, except for the old <code>gmp_randinit</code>
195      function.  Being currently just a plain global means it's not
196      thread-safe.
197
198  <p> The basic choices for exceptions are returning an error code or having a
199      handler function to be called.  The disadvantage of error returns is they
200      have to be checked, leading to tedious and rarely executed code, and
201      strictly speaking such a scheme wouldn't be source or binary compatible.
202      The disadvantage of a handler function is that a <code>longjmp</code> or
203      similar recovery from it may be difficult.  A combination would be
204      possible, for instance by allowing the handler to return an error code.
205
206  <p> Divide-by-zero, sqrt-of-negative, and similar operand range errors can
207      normally be detected at the start of functions, so exception handling
208      would have a clean state.  What's worth considering though is that the
209      GMP function detecting the exception may have been called via some third
210      party library or self contained application module, and hence have
211      various bits of state to be cleaned up above it.  It'd be highly
212      desirable for an exceptions scheme to allow for such cleanups.
213
214  <p> The C++ destructor mechanism could help with cleanups both internally and
215      externally, but being a plain C library we don't want to depend on that.
216
217  <p> A C++ <code>throw</code> might be a good optional extra exceptions
218      mechanism, perhaps under a build option.  For
219      GCC <code>-fexceptions</code> will add the necessary frame information to
220      plain C code, or GMP could be compiled as C++.
221
222  <p> Out-of-memory exceptions are expected to be handled by the
223      <code>mp_set_memory_functions</code> routines, rather than being a
224      prospective part of divide-by-zero etc.  Some similar considerations
225      apply but what differs is that out-of-memory can arise deep within GMP
226      internals.  Even fundamental routines like <code>mpn_add_n</code> and
227      <code>mpn_addmul_1</code> can use temporary memory (for instance on Cray
228      vector systems).  Allowing for an error code return would require an
229      awful lot of checking internally.  Perhaps it'd still be worthwhile, but
230      it'd be a lot of changes and the extra code would probably be rather
231      rarely executed in normal usages.
232
233  <p> A <code>longjmp</code> recovery for out-of-memory will currently, in
234      general, lead to memory leaks and may leave GMP variables operated on in
235      inconsistent states.  Maybe it'd be possible to record recovery
236      information for use by the relevant allocate or reallocate function, but
237      that too would be a lot of changes.
238
239  <p> One scheme for out-of-memory would be to note that all GMP allocations go
240      through the <code>mp_set_memory_functions</code> routines.  So if the
241      application has an intended <code>setjmp</code> recovery point it can
242      record memory activity by GMP and abandon space allocated and variables
243      initialized after that point.  This might be as simple as directing the
244      allocation functions to a separate pool, but in general would have the
245      disadvantage of needing application-level bookkeeping on top of the
246      normal system <code>malloc</code>.  An advantage however is that it needs
247      nothing from GMP itself and on that basis doesn't burden applications not
248      needing recovery.  Note that there's probably some details to be worked
249      out here about reallocs of existing variables, and perhaps about copying
250      or swapping between "permanent" and "temporary" variables.
251
252  <p> Applications desiring a fine-grained error control, for instance a
253      language interpreter, would very possibly not be well served by a scheme
254      requiring <code>longjmp</code>.  Wrapping every GMP function call with a
255      <code>setjmp</code> would be very inconvenient.
256
257  <p> Another option would be to let <code>mpz_t</code> etc hold a sort of NaN,
258      a special value indicating an out-of-memory or other failure.  This would
259      be similar to NaNs in mpfr.  Unfortunately such a scheme could only be
260      used by programs prepared to handle such special values, since for
261      instance a program waiting for some condition to be satisfied could
262      become an infinite loop if it wasn't also watching for NaNs.  The work to
263      implement this would be significant too, lots of checking of inputs and
264      intermediate results.  And if <code>mpn</code> routines were to
265      participate in this (which they would have to internally) a lot of new
266      return values would need to be added, since of course there's no
267      <code>mpz_t</code> etc structure for them to indicate failure in.
268
269  <p> Stack overflow is another possible exception, but perhaps not one that
270      can be easily detected in general.  On i386 GNU/Linux for instance GCC
271      normally doesn't generate stack probes for an <code>alloca</code>, but
272      merely adjusts <code>%esp</code>.  A big enough <code>alloca</code> can
273      miss the stack redzone and hit arbitrary data.  GMP stack usage is
274      normally a function of operand size, which might be enough for some
275      applications to know they'll be safe.  Otherwise a fixed maximum usage
276      can probably be obtained by building with
277      <code>--enable-alloca=malloc-reentrant</code> (or
278      <code>notreentrant</code>).  Arranging the default to be
279      <code>alloca</code> only on blocks up to a certain size and
280      <code>malloc</code> thereafter might be a better approach and would have
281      the advantage of not having calculations limited by available stack.
282
283  <p> Actually recovering from stack overflow is of course another problem.  It
284      might be possible to catch a <code>SIGSEGV</code> in the stack redzone
285      and do something in a <code>sigaltstack</code>, on systems which have
286      that, but recovery might otherwise not be possible.  This is worth
287      bearing in mind because there's no point worrying about tight and careful
288      out-of-memory recovery if an out-of-stack is fatal.
289
290  <p> Operand overflow is another exception to be addressed.  It's easy for
291      instance to ask <code>mpz_pow_ui</code> for a result bigger than an
292      <code>mpz_t</code> can possibly represent.  Currently overflows in limb
293      or byte count calculations will go undetected.  Often they'll still end
294      up asking the memory functions for blocks bigger than available memory,
295      but that's by no means certain and results are unpredictable in general.
296      It'd be desirable to tighten up such size calculations.  Probably only
297      selected routines would need checks, if it's assumed say that no input
298      will be more than half of all memory and hence size additions like say
299      <code>mpz_mul</code> won't overflow.
300
301
302<li> <strong>Performance Tool</strong>
303
304  <p> It'd be nice to have some sort of tool for getting an overview of
305      performance.  Clearly a great many things could be done, but some primary
306      uses would be,
307
308      <ol>
309	<li> Checking speed variations between compilers.
310	<li> Checking relative performance between systems or CPUs.
311      </ol>
312
313  <p> A combination of measuring some fundamental routines and some
314      representative application routines might satisfy these.
315
316  <p> The tune/time.c routines would be the easiest way to get good accurate
317      measurements on lots of different systems.  The high level
318      <code>speed_measure</code> may or may not suit, but the basic
319      <code>speed_starttime</code> and <code>speed_endtime</code> would cover
320      lots of portability and accuracy questions.
321
322
323<li> <strong>Using <code>restrict</code></strong>
324
325  <p> There might be some value in judicious use of C99 style
326      <code>restrict</code> on various pointers, but this would need some
327      careful thought about what it implies for the various operand overlaps
328      permitted in GMP.
329
330  <p> Rumour has it some pre-C99 compilers had <code>restrict</code>, but
331      expressing tighter (or perhaps looser) requirements.  Might be worth
332      investigating that before using <code>restrict</code> unconditionally.
333
334  <p> Loops are presumably where the greatest benefit would be had, by allowing
335      the compiler to advance reads ahead of writes, perhaps as part of loop
336      unrolling.  However critical loops are generally coded in assembler, so
337      there might not be very much to gain.  And on Cray systems the explicit
338      use of <code>_Pragma</code> gives an equivalent effect.
339
340  <p> One thing to note is that Microsoft C headers (on ia64 at least) contain
341      <code>__declspec(restrict)</code>, so a <code>#define</code> of
342      <code>restrict</code> should be avoided.  It might be wisest to setup a
343      <code>gmp_restrict</code>.
344
345
346<li> <strong>Nx1 Division</strong>
347
348  <p> The limb-by-limb dependencies in the existing Nx1 division (and
349      remainder) code means that chips with multiple execution units or
350      pipelined multipliers are not fully utilized.
351
352  <p> One possibility is to follow the current preinv method but taking two
353      limbs at a time.  That means a 2x2-&gt;4 and a 2x1-&gt;2 multiply for
354      each two limbs processed, and because the 2x2 and 2x1 can each be done in
355      parallel the latency will be not much more than 2 multiplies for two
356      limbs, whereas the single limb method has a 2 multiply latency for just
357      one limb.  A version of <code>mpn_divrem_1</code> doing this has been
358      written in C, but not yet tested on likely chips.  Clearly this scheme
359      would extend to 3x3-&gt;9 and 3x1-&gt;3 etc, though with diminishing
360      returns.
361
362  <p> For <code>mpn_mod_1</code>, Peter L. Montgomery proposes the following
363      scheme.  For a limb R=2^<code>bits_per_mp_limb</code>, pre-calculate
364      values R mod N, R^2 mod N, R^3 mod N, R^4 mod N.  Then take dividend
365      limbs and multiply them by those values, thereby reducing them (moving
366      them down) by the corresponding factor.  The products can be added to
367      produce an intermediate remainder of 2 or 3 limbs to be similarly
368      included in the next step.  The point is that such multiplies can be done
369      in parallel, meaning as little as 1 multiply worth of latency for 4
370      limbs.  If the modulus N is less than R/4 (or is it R/5?) the summed
371      products will fit in 2 limbs, otherwise 3 will be required, but with the
372      high only being small.  Clearly this extends to as many factors of R as a
373      chip can efficiently apply.
374
375  <p> The logical conclusion for powers R^i is a whole array "p[i] = R^i mod N"
376      for i up to k, the size of the dividend.  This could then be applied at
377      multiplier throughput speed like an inner product.  If the powers took
378      roughly k divide steps to calculate then there'd be an advantage any time
379      the same N was used three or more times.  Suggested by Victor Shoup in
380      connection with chinese-remainder style decompositions, but perhaps with
381      other uses.
382
383  <p> <code>mpn_modexact_1_odd</code> calculates an x in the range 0&lt;=x&lt;d
384      satisfying a = q*d + x*b^n, where b=2^bits_per_limb.  The factor b^n
385      needed to get the true remainder r could be calculated by a powering
386      algorithm, allowing <code>mpn_modexact_1_odd</code> to be pressed into
387      service for an <code>mpn_mod_1</code>.  <code>modexact_1</code> is
388      simpler and on some chips can run noticeably faster than plain
389      <code>mod_1</code>, on Athlon for instance 11 cycles/limb instead of 17.
390      Such a difference could soon overcome the time to calculate b^n.  The
391      requirement for an odd divisor in <code>modexact</code> can be handled by
392      some shifting on-the-fly, or perhaps by an extra partial-limb step at the
393      end.
394
395
396<li> <strong>Factorial</strong>
397
398  <p> The removal of twos in the current code could be extended to factors of 3
399      or 5.  Taking this to its logical conclusion would be a complete
400      decomposition into powers of primes.  The power for a prime p is of
401      course floor(n/p)+floor(n/p^2)+...  Conrad Curry found this is quite fast
402      (using simultaneous powering as per Handbook of Applied Cryptography
403      algorithm 14.88).
404
405  <p> A difficulty with using all primes is that quite large n can be
406      calculated on a system with enough memory, larger than we'd probably want
407      for a table of primes, so some sort of sieving would be wanted.  Perhaps
408      just taking out the factors of 3 and 5 would give most of the speedup
409      that a prime decomposition can offer.
410
411
412<li> <strong>Binomial Coefficients</strong>
413
414  <p> An obvious improvement to the current code would be to strip factors of 2
415      from each multiplier and divisor and count them separately, to be applied
416      with a bit shift at the end.  Factors of 3 and perhaps 5 could even be
417      handled similarly.
418
419  <p> Conrad Curry reports a big speedup for binomial coefficients using a
420      prime powering scheme, at least for k near n/2.  Of course this is only
421      practical for moderate size n since again it requires primes up to n.
422
423  <p> When k is small the current (n-k+1)...n/1...k will be fastest.  Some sort
424      of rule would be needed for when to use this or when to use prime
425      powering.  Such a rule will be a function of both n and k.  Some
426      investigation is needed to see what sort of shape the crossover line will
427      have, the usual parameter tuning can of course find machine dependent
428      constants to fill in where necessary.
429
430  <p> An easier possibility also reported by Conrad Curry is that it may be
431      faster not to divide out the denominator (1...k) one-limb at a time, but
432      do one big division at the end.  Is this because a big divisor in
433      <code>mpn_bdivmod</code> trades the latency of
434      <code>mpn_divexact_1</code> for the throughput of
435      <code>mpn_submul_1</code>?  Overheads must hurt though.
436
437  <p> Another reason a big divisor might help is that
438      <code>mpn_divexact_1</code> won't be getting a full limb in
439      <code>mpz_bin_uiui</code>.  It's called when the n accumulator is full
440      but the k may be far from full.  Perhaps the two could be decoupled so k
441      is applied when full.  It'd be necessary to delay consideration of k
442      terms until the corresponding n terms had been applied though, since
443      otherwise the division won't be exact.
444
445
446<li> <strong>Perfect Power Testing</strong>
447
448  <p> <code>mpz_perfect_power_p</code> could be improved in a number of ways,
449      for instance p-adic arithmetic to find possible roots.
450
451  <p> Non-powers can be quickly identified by checking for Nth power residues
452      modulo small primes, like <code>mpn_perfect_square_p</code> does for
453      squares.  The residues to each power N for a given remainder could be
454      grouped into a bit mask, the masks for the remainders to each divisor
455      would then be "and"ed together to hopefully leave only a few candidate
456      powers.  Need to think about how wide to make such masks, ie. how many
457      powers to examine in this way.
458
459  <p> Any zero remainders found in residue testing reveal factors which can be
460      divided out, with the multiplicity restricting the powers that need to be
461      considered, as per the current code.  Further prime dividing should be
462      grouped into limbs like <code>PP</code>.  Need to think about how much
463      dividing to do like that, probably more for bigger inputs, less for
464      smaller inputs.
465
466  <p> <code>mpn_gcd_1</code> would probably be better than the current private
467      GCD routine.  The use it's put to isn't time-critical, and it might help
468      ensure correctness to just use the main GCD routine.
469
470  <p> [There is work-in-progress with a very fast function.]
471
472
473<li> <strong>Prime Testing</strong>
474
475  <p> GMP is not really a number theory library and probably shouldn't have
476      large amounts of code dedicated to sophisticated prime testing
477      algorithms, but basic things well-implemented would suit.  Tests offering
478      certainty are probably all too big or too slow (or both!) to justify
479      inclusion in the main library.  Demo programs showing some possibilities
480      would be good though.
481
482  <p> The present "repetitions" argument to <code>mpz_probab_prime_p</code> is
483      rather specific to the Miller-Rabin tests of the current implementation.
484      Better would be some sort of parameter asking perhaps for a maximum
485      chance 1/2^x of a probable prime in fact being composite.  If
486      applications follow the advice that the present reps gives 1/4^reps
487      chance then perhaps such a change is unnecessary, but an explicitly
488      described 1/2^x would allow for changes in the implementation or even for
489      new proofs about the theory.
490
491  <p> <code>mpz_probab_prime_p</code> always initializes a new
492      <code>gmp_randstate_t</code> for randomized tests, which unfortunately
493      means it's not really very random and in particular always runs the same
494      tests for a given input.  Perhaps a new interface could accept an rstate
495      to use, so successive tests could increase confidence in the result.
496
497  <p> <code>mpn_mod_34lsub1</code> is an obvious and easy improvement to the
498      trial divisions.  And since the various prime factors are constants, the
499      remainder can be tested with something like
500<pre>
501#define MP_LIMB_DIVISIBLE_7_P(n) \
502  ((n) * MODLIMB_INVERSE_7 &lt;= MP_LIMB_T_MAX/7)
503</pre>
504      Which would help compilers that don't know how to optimize divisions by
505      constants, and is even an improvement on current gcc 3.2 code.  This
506      technique works for any modulus, see Granlund and Montgomery "Division by
507      Invariant Integers" section 9.
508
509  <p> The trial divisions are done with primes generated and grouped at
510      runtime.  This could instead be a table of data, with pre-calculated
511      inverses too.  Storing deltas, ie. amounts to add, rather than actual
512      primes would save space.  <code>udiv_qrnnd_preinv</code> style inverses
513      can be made to exist by adding dummy factors of 2 if necessary.  Some
514      thought needs to be given as to how big such a table should be, based on
515      how much dividing would be profitable for what sort of size inputs.  The
516      data could be shared by the perfect power testing.
517
518  <p> Jason Moxham points out that if a sqrt(-1) mod N exists then any factor
519      of N must be == 1 mod 4, saving half the work in trial dividing.  (If
520      x^2==-1 mod N then for a prime factor p we have x^2==-1 mod p and so the
521      jacobi symbol (-1/p)=1.  But also (-1/p)=(-1)^((p-1)/2), hence must have
522      p==1 mod 4.)  But knowing whether sqrt(-1) mod N exists is not too easy.
523      A strong pseudoprime test can reveal one, so perhaps such a test could be
524      inserted part way though the dividing.
525
526  <p> Jon Grantham "Frobenius Pseudoprimes" (www.pseudoprime.com) describes a
527      quadratic pseudoprime test taking about 3x longer than a plain test, but
528      with only a 1/7710 chance of error (whereas 3 plain Miller-Rabin tests
529      would offer only (1/4)^3 == 1/64).  Such a test needs completely random
530      parameters to satisfy the theory, though single-limb values would run
531      faster.  It's probably best to do at least one plain Miller-Rabin before
532      any quadratic tests, since that can identify composites in less total
533      time.
534
535  <p> Some thought needs to be given to the structure of which tests (trial
536      division, Miller-Rabin, quadratic) and how many are done, based on what
537      sort of inputs we expect, with a view to minimizing average time.
538
539  <p> It might be a good idea to break out subroutines for the various tests,
540      so that an application can combine them in ways it prefers, if sensible
541      defaults in <code>mpz_probab_prime_p</code> don't suit.  In particular
542      this would let applications skip tests it knew would be unprofitable,
543      like trial dividing when an input is already known to have no small
544      factors.
545
546  <p> For small inputs, combinations of theory and explicit search make it
547      relatively easy to offer certainty.  For instance numbers up to 2^32
548      could be handled with a strong pseudoprime test and table lookup.  But
549      it's rather doubtful whether a smallnum prime test belongs in a bignum
550      library.  Perhaps if it had other internal uses.
551
552  <p> An <code>mpz_nthprime</code> might be cute, but is almost certainly
553      impractical for anything but small n.
554
555
556<li> <strong>Intra-Library Calls</strong>
557
558  <p> On various systems, calls within libgmp still go through the PLT, TOC or
559      other mechanism, which makes the code bigger and slower than it needs to
560      be.
561
562  <p> The theory would be to have all GMP intra-library calls resolved directly
563      to the routines in the library.  An application wouldn't be able to
564      replace a routine, the way it can normally, but there seems no good
565      reason to do that, in normal circumstances.
566
567  <p> The <code>visibility</code> attribute in recent gcc is good for this,
568      because it lets gcc omit unnecessary GOT pointer setups or whatever if it
569      finds all calls are local and there's no global data references.
570      Documented entrypoints would be <code>protected</code>, and purely
571      internal things not wanted by test programs or anything can be
572      <code>internal</code>.
573
574  <p> Unfortunately, on i386 it seems <code>protected</code> ends up causing
575      text segment relocations within libgmp.so, meaning the library code can't
576      be shared between processes, defeating the purpose of a shared library.
577      Perhaps this is just a gremlin in binutils (debian packaged
578      2.13.90.0.16-1).
579
580  <p> The linker can be told directly (with a link script, or options) to do
581      the same sort of thing.  This doesn't change the code emitted by gcc of
582      course, but it does mean calls are resolved directly to their targets,
583      avoiding a PLT entry.
584
585  <p> Keeping symbols private to libgmp.so is probably a good thing in general
586      too, to stop anyone even attempting to access them.  But some
587      undocumented things will need or want to be kept visible, for use by
588      mpfr, or the test and tune programs.  Libtool has a standard option for
589      selecting public symbols (used now for libmp).
590
591
592</ul>
593<hr>
594
595</body>
596</html>
597
598<!--
599Local variables:
600eval: (add-hook 'write-file-hooks 'time-stamp)
601time-stamp-start: "This file current as of "
602time-stamp-format: "%:d %3b %:y"
603time-stamp-end: "\\."
604time-stamp-line-limit: 50
605End:
606-->
607