1Copyright 2000-2002, 2004 Free Software Foundation, Inc. 2 3This file is part of the GNU MP Library. 4 5The GNU MP Library is free software; you can redistribute it and/or modify 6it under the terms of either: 7 8 * the GNU Lesser General Public License as published by the Free 9 Software Foundation; either version 3 of the License, or (at your 10 option) any later version. 11 12or 13 14 * the GNU General Public License as published by the Free Software 15 Foundation; either version 2 of the License, or (at your option) any 16 later version. 17 18or both in parallel, as here. 19 20The GNU MP Library is distributed in the hope that it will be useful, but 21WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY 22or FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License 23for more details. 24 25You should have received copies of the GNU General Public License and the 26GNU Lesser General Public License along with the GNU MP Library. If not, 27see https://www.gnu.org/licenses/. 28 29 30 31 32 33 GMP SPEED MEASURING AND PARAMETER TUNING 34 35 36The programs in this directory are for knowledgeable users who want to 37measure GMP routines on their machine, and perhaps tweak some settings or 38identify things that can be improved. 39 40The programs here are tools, not ready to run solutions. Nothing is built 41in a normal "make all", but various Makefile targets described below exist. 42 43Relatively few systems and CPUs have been tested, so be sure to verify that 44results are sensible before relying on them. 45 46 47 48 49MISCELLANEOUS NOTES 50 51--enable-assert 52 53 Don't configure with --enable-assert, since the extra code added by 54 assertion checking may influence measurements. 55 56Direct mapped caches 57 58 Some effort has been made to accommodate CPUs with direct mapped caches, 59 by putting data blocks more or less contiguously on the stack. But this 60 will depend on TMP_ALLOC using alloca, and even then it may or may not 61 be enough. 62 63FreeBSD 4.2 i486 getrusage 64 65 This getrusage seems to be a bit doubtful, it looks like it's 66 microsecond accurate, but sometimes ru_utime remains unchanged after a 67 time of many microseconds has elapsed. It'd be good to detect this in 68 the time.c initializations, but for now the suggestion is to pretend it 69 doesn't exist. 70 71 ./configure ac_cv_func_getrusage=no 72 73NetBSD 1.4.1 m68k macintosh time base 74 75 On this system it's been found getrusage often goes backwards, making it 76 unusable (time.c getrusage_backwards_p detects this). gettimeofday 77 sometimes doesn't update atomically when it crosses a 1 second boundary. 78 Not sure what to do about this. Expect possible intermittent failures. 79 80SCO OpenUNIX 8 /etc/hw 81 82 /etc/hw takes about a second to return the cpu frequency, which suggests 83 perhaps it's measuring each time it runs. If this is annoying when 84 running the speed program repeatedly then set a GMP_CPU_FREQUENCY 85 environment variable (see TIME BASE section below). 86 87Timing on GNU/Linux 88 89 On Linux, timing currently uses the cycle counter. This is unreliable, 90 since the counter is not saved and restored at context switches (unlike 91 FreeBSD and Solaris where the cycle counter is "virtualized"). 92 93 Using the clock_gettime method with CLOCK_PROCESS_CPUTIME_ID (posix) or 94 CLOCK_VIRTUAL (BSD) should be more reliable. To get clock_gettime 95 with glibc, one has to link with -lrt (which also drags in the pthreads 96 threading library). configure.in must be hacked to detect this and 97 arrange proper linking. Something like 98 99 old_LIBS="$LIBS" 100 AC_SEARCH_LIBS(clock_gettime, rt, [AC_DEFINE(HAVE_CLOCK_GETTIME)]) 101 TUNE_LIBS="$LIBS" 102 LIBS="$old_LIBS" 103 104 AC_SUBST(TUNE_LIBS) 105 106 might work. 107 108Low resolution timebase 109 110 Parameter tuning can be very time consuming if the only timebase 111 available is a 10 millisecond clock tick, to the point of being 112 unusable. This is currently the case on VAX and ARM systems. 113 114 115 116 117PARAMETER TUNING 118 119The "tuneup" program runs some tests designed to find the best settings for 120various thresholds, like MUL_TOOM22_THRESHOLD. Its output can be put 121into gmp-mparam.h. The program is built and run with 122 123 make tune 124 125If the thresholds indicated are grossly different from the values in the 126selected gmp-mparam.h then there may be a performance boost in applicable 127size ranges by changing gmp-mparam.h accordingly. 128 129Be sure to do a full reconfigure and rebuild to get any newly set thresholds 130to take effect. A partial rebuild is enough sometimes, but a fresh 131configure and make is certain to be correct. 132 133If a CPU has specific tuned parameters coming from a gmp-mparam.h in one of 134the mpn subdirectories then the values from "make tune" should be similar. 135But check that the configured CPU is right and there are no machine specific 136effects causing a difference. 137 138It's hoped the compiler and options used won't have too much effect on 139thresholds, since for most CPUs they ultimately come down to comparisons 140between assembler subroutines. Missing out on the longlong.h macros by not 141using gcc will probably have an effect. 142 143Some thresholds produced by the tune program are merely single values chosen 144from what's a range of sizes where two algorithms are pretty much the same 145speed. When this happens the program is likely to give somewhat different 146values on successive runs. This is noticeable on the toom3 thresholds for 147instance. 148 149 150 151 152SPEED PROGRAM 153 154The "speed" program can be used for measuring and comparing various 155routines, and producing tables of data or gnuplot graphs. Compile it with 156 157 make speed 158 159(Or on DOS systems "make speed.exe".) 160 161Here are some examples of how to use it. Check the code for all the 162options. 163 164Draw a graph of mpn_mul_n, stepping through sizes by 10 or a factor of 1.05 165(whichever is greater). 166 167 ./speed -s 10-5000 -t 10 -f 1.05 -P foo mpn_mul_n 168 gnuplot foo.gnuplot 169 170Compare mpn_add_n and an mpn_lshift by 1, showing times in cycles and 171showing under mpn_lshift the difference between it and mpn_add_n. 172 173 ./speed -s 1-40 -c -d mpn_add_n mpn_lshift.1 174 175Using option -c for times in cycles is interesting but normally only 176necessary when looking carefully at assembler subroutines. You might think 177it would always give an integer value, but this doesn't happen in practice, 178probably due to overheads in the time measurements. 179 180In the free-form output the "#" symbol against a measurement means the 181corresponding routine is fastest at that size. This is a convenient visual 182cue when comparing different routines. The graph data files <name>.data 183don't get this since it would upset gnuplot or other data viewers. 184 185 186 187 188TIME BASE 189 190The time measuring method is determined in time.c, based on what the 191configured host has available. A cycle counter is preferred, possibly 192supplemented by another method if the counter has a limited range. A 193microsecond accurate getrusage() or gettimeofday() will work quite well too. 194 195The cycle counters (except possibly on alpha) and gettimeofday() will depend 196on the machine being otherwise idle, or rather on other jobs not stealing 197CPU time from the measuring program. Short routines (those that complete 198within a timeslice) should work even on a busy machine. 199 200Some trouble is taken by speed_measure() in common.c to avoid ill effects 201from sporadic interrupts, or other intermittent things (like cron waking up 202every minute). But generally an idle machine will be necessary to be 203certain of consistent results. 204 205The CPU frequency is needed to convert between cycles and seconds, or for 206when a cycle counter is supplemented by getrusage() etc. The speed program 207will convert as necessary according to the output format requested. The 208tune program will work with either cycles or seconds. 209 210freq.c knows how to get the frequency on some systems, or can measure a 211cycle counter against gettimeofday() or getrusage(), but when that fails, or 212needs to be overridden, an environment variable GMP_CPU_FREQUENCY can be 213used (in Hertz). For example in "bash" on a 650 MHz machine, 214 215 export GMP_CPU_FREQUENCY=650e6 216 217A high precision time base makes it possible to get accurate measurements in 218a shorter time. 219 220 221 222 223EXAMPLE COMPARISONS - VARIOUS 224 225Here are some ideas for things that can be done with the speed program. 226 227There's always going to be a certain amount of overhead in the time 228measurements, due to reading the time base, and in the loop that runs a 229routine enough times to get a reading of the desired precision. Noop 230functions taking various arguments are available to measure this. The 231"overhead" printed by the speed program each time in its intro is the "noop" 232routine, but note that this is just for information, it isn't deducted from 233the times printed or anything. 234 235 ./speed -s 1 noop noop_wxs noop_wxys 236 237To see how many cycles per limb a routine is taking, look at the time 238increase when the size increments, using option -D. This avoids fixed 239overheads in the measuring. Also, remember many of the assembler routines 240have unrolled loops, so it might be necessary to compare times at, say, 16, 24132, 48, 64 etc to see what the unrolled part is taking, as opposed to any 242finishing off. 243 244 ./speed -s 16-64 -t 16 -C -D mpn_add_n 245 246The -C option on its own gives cycles per limb, but is really only useful at 247big sizes where fixed overheads are small compared to the code doing the 248real work. Remember of course memory caching and/or page swapping will 249affect results at large sizes. 250 251 ./speed -s 500000 -C mpn_add_n 252 253Once a calculation stops fitting in the CPU data cache, it's going to start 254taking longer. Exactly where this happens depends on the cache priming in 255the measuring routines, and on what sort of "least recently used" the 256hardware does. Here's an example for a CPU with a 16kbyte L1 data cache and 25732-bit limb, showing a suddenly steeper curve for mpn_add_n at about 2000 258limbs. 259 260 ./speed -s 1-4000 -t 5 -f 1.02 -P foo mpn_add_n 261 gnuplot foo.gnuplot 262 263When a routine has an unrolled loop for, say, multiples of 8 limbs and then 264an ordinary loop for the remainder, it can happen that it's actually faster 265to do an operation on, say, 8 limbs than it is on 7 limbs. The following 266draws a graph of mpn_sub_n, to see whether times smoothly increase with 267size. 268 269 ./speed -s 1-100 -c -P foo mpn_sub_n 270 gnuplot foo.gnuplot 271 272If mpn_lshift and mpn_rshift have special case code for shifts by 1, it 273ought to be faster (or at least not slower) than shifting by, say, 2 bits. 274 275 ./speed -s 1-200 -c mpn_rshift.1 mpn_rshift.2 276 277An mpn_lshift by 1 can be done by mpn_add_n adding a number to itself, and 278if the lshift isn't faster there's an obvious improvement that's possible. 279 280 ./speed -s 1-200 -c mpn_lshift.1 mpn_add_n_self 281 282On some CPUs (AMD K6 for example) an "in-place" mpn_add_n where the 283destination is one of the sources is faster than a separate destination. 284Here's an example to see this. ".1" selects dst==src1 for mpn_add_n (and 285mpn_sub_n), for other values see speed.h SPEED_ROUTINE_MPN_BINARY_N_CALL. 286 287 ./speed -s 1-200 -c mpn_add_n mpn_add_n.1 288 289The gmp manual points out that divisions by powers of two should be done 290using a right shift because it'll be significantly faster than an actual 291division. The following shows by what factor mpn_rshift is faster than 292mpn_divrem_1, using division by 32 as an example. 293 294 ./speed -s 10-20 -r mpn_rshift.5 mpn_divrem_1.32 295 296 297 298 299EXAMPLE COMPARISONS - MULTIPLICATION 300 301mul_basecase takes a ".<r>" parameter. If positive, it gives the second 302(smaller) operand size. For example to show speeds for 3x3 up to 20x3 in 303cycles, 304 305 ./speed -s 3-20 -c mpn_mul_basecase.3 306 307A negative ".<-r>" parameter fixes the size of the product to the absolute 308value r. For example to show speeds for 10x10 up to 19x1 in cycles, 309 310 ./speed -s 10-19 -c mpn_mul_basecase.-20 311 312mul_basecase with no parameter does an NxN multiply, so for example to show 313speeds in cycles for 1x1, 2x2, 3x3, etc, up to 20x20, in cycles, 314 315 ./speed -s 1-20 -c mpn_mul_basecase 316 317sqr_basecase is implemented by a "triangular" method on most CPUs, making it 318up to twice as fast as mul_basecase. In practice loop overheads and the 319products on the diagonal mean it falls short of this. Here's an example 320running the two and showing by what factor an NxN mul_basecase is slower 321than an NxN sqr_basecase. (Some versions of sqr_basecase only allow sizes 322below SQR_TOOM2_THRESHOLD, so if it crashes at that point don't worry.) 323 324 ./speed -s 1-20 -r mpn_sqr_basecase mpn_mul_basecase 325 326The technique described above with -CD for showing the time difference in 327cycles per limb between two size operations can be done on an NxN 328mul_basecase using -E to change the basis for the size increment to N*N. 329For instance a 20x20 operation is taken to be doing 400 limbs, and a 16x16 330doing 256 limbs. The following therefore shows the per crossproduct speed 331of mul_basecase and sqr_basecase at around 20x20 limbs. 332 333 ./speed -s 16-20 -t 4 -CDE mpn_mul_basecase mpn_sqr_basecase 334 335Of course sqr_basecase isn't really doing NxN crossproducts, but it can be 336interesting to compare it to mul_basecase as if it was. For sqr_basecase 337the -F option can be used to base the deltas on N*(N+1)/2 operations, which 338is the triangular products sqr_basecase does. For example, 339 340 ./speed -s 16-20 -t 4 -CDF mpn_sqr_basecase 341 342Both -E and -F are preliminary and might change. A consistent approach to 343using them when claiming certain per crossproduct or per triangularproduct 344speeds hasn't really been established, but the increment between speeds in 345the range karatsuba will call seems sensible, that being k to k/2. For 346instance, if the karatsuba threshold was 20 for the multiply and 30 for the 347square, 348 349 ./speed -s 10-20 -t 10 -CDE mpn_mul_basecase 350 ./speed -s 15-30 -t 15 -CDF mpn_sqr_basecase 351 352 353 354EXAMPLE COMPARISONS - MALLOC 355 356The gmp manual recommends application programs avoid excessive initializing 357and clearing of mpz_t variables (and mpq_t and mpf_t too). Every new 358variable will at a minimum go through an init, a realloc for its first 359store, and finally a clear. Quite how long that takes depends on the C 360library. The following compares an mpz_init/realloc/clear to a 10 limb 361mpz_add. Don't be surprised if the mallocing is quite slow. 362 363 ./speed -s 10 -c mpz_init_realloc_clear mpz_add 364 365On some systems malloc and free are much slower when dynamic linked. The 366speed-dynamic program can be used to see this. For example the following 367measures malloc/free, first static then dynamic. 368 369 ./speed -s 10 -c malloc_free 370 ./speed-dynamic -s 10 -c malloc_free 371 372Of course a real world program has big problems if it's doing so many 373mallocs and frees that it gets slowed down by a dynamic linked malloc. 374 375 376 377 378 379EXAMPLE COMPARISONS - STRING CONVERSIONS 380 381mpn_get_str does a binary to string conversion. The base is specified with 382a ".<r>" parameter, or decimal by default. Power of 2 bases are much faster 383than general bases. The following compares decimal and hex for instance. 384 385 ./speed -s 1-20 -c mpn_get_str mpn_get_str.16 386 387Smaller bases need more divisions to split a given size number, and so are 388slower. The following compares base 3 and base 9. On small operands 9 will 389be nearly twice as fast, though at bigger sizes this reduces since in the 390current implementation both divide repeatedly by 3^20 (or 3^40 for 64 bit 391limbs) and those divisions come to dominate. 392 393 ./speed -s 1-20 -cr mpn_get_str.3 mpn_get_str.9 394 395mpn_set_str does a string to binary conversion. The base is specified with 396a ".<r>" parameter, or decimal by default. Power of 2 bases are faster than 397general bases on large conversions. 398 399 ./speed -s 1-512 -f 2 -c mpn_set_str.8 mpn_set_str.10 400 401mpn_set_str also has some special case code for decimal which is a bit 402faster than the general case, basically by giving the compiler a chance to 403optimize some multiplications by 10. 404 405 ./speed -s 20-40 -c mpn_set_str.9 mpn_set_str.10 mpn_set_str.11 406 407 408 409 410EXAMPLE COMPARISONS - GCDs 411 412mpn_gcd_1 has a threshold for when to reduce using an initial x%y when both 413x and y are single limbs. This isn't tuned currently, but a value can be 414established by a measurement like 415 416 ./speed -s 10-32 mpn_gcd_1.10 417 418This runs src[0] from 10 to 32 bits, and y fixed at 10 bits. If the div 419threshold is high, say 31 so it's effectively disabled then a 32x10 bit gcd 420is done by nibbling away at the 32-bit operands bit-by-bit. When the 421threshold is small, say 1 bit, then an initial x%y is done to reduce it to a 42210x10 bit operation. 423 424The threshold in mpn/generic/gcd_1.c or the various assembler 425implementations can be tweaked up or down until there's no more speedups on 426interesting combinations of sizes. Note that this affects only a 1x1 limb 427operation and so isn't very important. (An Nx1 limb operation always does 428an initial modular reduction, using mpn_mod_1 or mpn_modexact_1_odd.) 429 430 431 432 433SPEED PROGRAM EXTENSIONS 434 435Potentially lots of things could be made available in the program, but it's 436been left at only the things that have actually been wanted and are likely 437to be reasonably useful in the future. 438 439Extensions should be fairly easy to make though. speed-ext.c is an example, 440in a style that should suit one-off tests, or new code fragments under 441development. 442 443many.pl is a script for generating a new speed program supplemented with 444alternate versions of the standard routines. It can be used for measuring 445experimental code, or for comparing different implementations that exist 446within a CPU family. 447 448 449 450 451THRESHOLD EXAMINING 452 453The speed program can be used to examine the speeds of different algorithms 454to check the tune program has done the right thing. For example to examine 455the karatsuba multiply threshold, 456 457 ./speed -s 5-40 mpn_mul_basecase mpn_kara_mul_n 458 459When examining the toom3 threshold, remember it depends on the karatsuba 460threshold, so the right karatsuba threshold needs to be compiled into the 461library first. The tune program uses specially recompiled versions of 462mpn/mul_n.c etc for this reason, but the speed program simply uses the 463normal libgmp.la. 464 465Note further that the various routines may recurse into themselves on sizes 466far enough above applicable thresholds. For example, mpn_kara_mul_n will 467recurse into itself on sizes greater than twice the compiled-in 468MUL_TOOM22_THRESHOLD. 469 470When doing the above comparison between mul_basecase and kara_mul_n what's 471probably of interest is mul_basecase versus a kara_mul_n that does one level 472of Karatsuba then calls to mul_basecase, but this only happens on sizes less 473than twice the compiled MUL_TOOM22_THRESHOLD. A larger value for that 474setting can be compiled-in to avoid the problem if necessary. The same 475applies to toom3 and DC, though in a trickier fashion. 476 477There are some upper limits on some of the thresholds, arising from arrays 478dimensioned according to a threshold (mpn_mul_n), or asm code with certain 479sized displacements (some x86 versions of sqr_basecase). So putting huge 480values for the thresholds, even just for testing, may fail. 481 482 483 484 485FUTURE 486 487Make a program to check the time base is working properly, for small and 488large measurements. Make it able to test each available method, including 489perhaps the apparent resolution of each. 490 491Make a general mechanism for specifying operand overlap, and a syntax like 492maybe "mpn_add_n.dst=src2" to select it. Some measuring routines do this 493sort of thing with the "r" parameter currently. 494 495 496 497---------------- 498Local variables: 499mode: text 500fill-column: 76 501End: 502