1Copyright 2000, 2001, 2002, 2004 Free Software Foundation, Inc. 2 3This file is part of the GNU MP Library. 4 5The GNU MP Library is free software; you can redistribute it and/or modify 6it under the terms of the GNU Lesser General Public License as published by 7the Free Software Foundation; either version 3 of the License, or (at your 8option) any later version. 9 10The GNU MP Library is distributed in the hope that it will be useful, but 11WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY 12or FITNESS FOR A PARTICULAR PURPOSE. See the GNU Lesser General Public 13License for more details. 14 15You should have received a copy of the GNU Lesser General Public License 16along with the GNU MP Library. If not, see http://www.gnu.org/licenses/. 17 18 19 20 21 22 GMP SPEED MEASURING AND PARAMETER TUNING 23 24 25The programs in this directory are for knowledgeable users who want to 26measure GMP routines on their machine, and perhaps tweak some settings or 27identify things that can be improved. 28 29The programs here are tools, not ready to run solutions. Nothing is built 30in a normal "make all", but various Makefile targets described below exist. 31 32Relatively few systems and CPUs have been tested, so be sure to verify that 33results are sensible before relying on them. 34 35 36 37 38MISCELLANEOUS NOTES 39 40--enable-assert 41 42 Don't configure with --enable-assert, since the extra code added by 43 assertion checking may influence measurements. 44 45Direct mapped caches 46 47 Some effort has been made to accommodate CPUs with direct mapped caches, 48 by putting data blocks more or less contiguously on the stack. But this 49 will depend on TMP_ALLOC using alloca, and even then it may or may not 50 be enough. 51 52FreeBSD 4.2 i486 getrusage 53 54 This getrusage seems to be a bit doubtful, it looks like it's 55 microsecond accurate, but sometimes ru_utime remains unchanged after a 56 time of many microseconds has elapsed. It'd be good to detect this in 57 the time.c initializations, but for now the suggestion is to pretend it 58 doesn't exist. 59 60 ./configure ac_cv_func_getrusage=no 61 62NetBSD 1.4.1 m68k macintosh time base 63 64 On this system it's been found getrusage often goes backwards, making it 65 unusable (time.c getrusage_backwards_p detects this). gettimeofday 66 sometimes doesn't update atomically when it crosses a 1 second boundary. 67 Not sure what to do about this. Expect possible intermittent failures. 68 69SCO OpenUNIX 8 /etc/hw 70 71 /etc/hw takes about a second to return the cpu frequency, which suggests 72 perhaps it's measuring each time it runs. If this is annoying when 73 running the speed program repeatedly then set a GMP_CPU_FREQUENCY 74 environment variable (see TIME BASE section below). 75 76Timing on GNU/Linux 77 78 On Linux, timing currently uses the cycle counter. This is unreliable, 79 since the counter is not saved and restored at context switches (unlike 80 FreeBSD and Solaris where the cycle counter is "virtualized"). 81 82 Using the clock_gettime method with CLOCK_PROCESS_CPUTIME_ID (posix) or 83 CLOCK_VIRTUAL (BSD) should be more reliable. To get clock_gettime 84 with glibc, one has to link with -lrt (which also drags in the pthreads 85 threading library). configure.in must be hacked to detect this and 86 arrange proper linking. Something like 87 88 old_LIBS="$LIBS" 89 AC_SEARCH_LIBS(clock_gettime, rt, [AC_DEFINE(HAVE_CLOCK_GETTIME)]) 90 TUNE_LIBS="$LIBS" 91 LIBS="$old_LIBS" 92 93 AC_SUBST(TUNE_LIBS) 94 95 might work. 96 97Low resolution timebase 98 99 Parameter tuning can be very time consuming if the only timebase 100 available is a 10 millisecond clock tick, to the point of being 101 unusable. This is currently the case on VAX and ARM systems. 102 103 104 105 106PARAMETER TUNING 107 108The "tuneup" program runs some tests designed to find the best settings for 109various thresholds, like MUL_TOOM22_THRESHOLD. Its output can be put 110into gmp-mparam.h. The program is built and run with 111 112 make tune 113 114If the thresholds indicated are grossly different from the values in the 115selected gmp-mparam.h then there may be a performance boost in applicable 116size ranges by changing gmp-mparam.h accordingly. 117 118Be sure to do a full reconfigure and rebuild to get any newly set thresholds 119to take effect. A partial rebuild is enough sometimes, but a fresh 120configure and make is certain to be correct. 121 122If a CPU has specific tuned parameters coming from a gmp-mparam.h in one of 123the mpn subdirectories then the values from "make tune" should be similar. 124But check that the configured CPU is right and there are no machine specific 125effects causing a difference. 126 127It's hoped the compiler and options used won't have too much effect on 128thresholds, since for most CPUs they ultimately come down to comparisons 129between assembler subroutines. Missing out on the longlong.h macros by not 130using gcc will probably have an effect. 131 132Some thresholds produced by the tune program are merely single values chosen 133from what's a range of sizes where two algorithms are pretty much the same 134speed. When this happens the program is likely to give somewhat different 135values on successive runs. This is noticeable on the toom3 thresholds for 136instance. 137 138 139 140 141SPEED PROGRAM 142 143The "speed" program can be used for measuring and comparing various 144routines, and producing tables of data or gnuplot graphs. Compile it with 145 146 make speed 147 148(Or on DOS systems "make speed.exe".) 149 150Here are some examples of how to use it. Check the code for all the 151options. 152 153Draw a graph of mpn_mul_n, stepping through sizes by 10 or a factor of 1.05 154(whichever is greater). 155 156 ./speed -s 10-5000 -t 10 -f 1.05 -P foo mpn_mul_n 157 gnuplot foo.gnuplot 158 159Compare mpn_add_n and an mpn_lshift by 1, showing times in cycles and 160showing under mpn_lshift the difference between it and mpn_add_n. 161 162 ./speed -s 1-40 -c -d mpn_add_n mpn_lshift.1 163 164Using option -c for times in cycles is interesting but normally only 165necessary when looking carefully at assembler subroutines. You might think 166it would always give an integer value, but this doesn't happen in practice, 167probably due to overheads in the time measurements. 168 169In the free-form output the "#" symbol against a measurement means the 170corresponding routine is fastest at that size. This is a convenient visual 171cue when comparing different routines. The graph data files <name>.data 172don't get this since it would upset gnuplot or other data viewers. 173 174 175 176 177TIME BASE 178 179The time measuring method is determined in time.c, based on what the 180configured host has available. A cycle counter is preferred, possibly 181supplemented by another method if the counter has a limited range. A 182microsecond accurate getrusage() or gettimeofday() will work quite well too. 183 184The cycle counters (except possibly on alpha) and gettimeofday() will depend 185on the machine being otherwise idle, or rather on other jobs not stealing 186CPU time from the measuring program. Short routines (those that complete 187within a timeslice) should work even on a busy machine. 188 189Some trouble is taken by speed_measure() in common.c to avoid ill effects 190from sporadic interrupts, or other intermittent things (like cron waking up 191every minute). But generally an idle machine will be necessary to be 192certain of consistent results. 193 194The CPU frequency is needed to convert between cycles and seconds, or for 195when a cycle counter is supplemented by getrusage() etc. The speed program 196will convert as necessary according to the output format requested. The 197tune program will work with either cycles or seconds. 198 199freq.c knows how to get the frequency on some systems, or can measure a 200cycle counter against gettimeofday() or getrusage(), but when that fails, or 201needs to be overridden, an environment variable GMP_CPU_FREQUENCY can be 202used (in Hertz). For example in "bash" on a 650 MHz machine, 203 204 export GMP_CPU_FREQUENCY=650e6 205 206A high precision time base makes it possible to get accurate measurements in 207a shorter time. 208 209 210 211 212EXAMPLE COMPARISONS - VARIOUS 213 214Here are some ideas for things that can be done with the speed program. 215 216There's always going to be a certain amount of overhead in the time 217measurements, due to reading the time base, and in the loop that runs a 218routine enough times to get a reading of the desired precision. Noop 219functions taking various arguments are available to measure this. The 220"overhead" printed by the speed program each time in its intro is the "noop" 221routine, but note that this is just for information, it isn't deducted from 222the times printed or anything. 223 224 ./speed -s 1 noop noop_wxs noop_wxys 225 226To see how many cycles per limb a routine is taking, look at the time 227increase when the size increments, using option -D. This avoids fixed 228overheads in the measuring. Also, remember many of the assembler routines 229have unrolled loops, so it might be necessary to compare times at, say, 16, 23032, 48, 64 etc to see what the unrolled part is taking, as opposed to any 231finishing off. 232 233 ./speed -s 16-64 -t 16 -C -D mpn_add_n 234 235The -C option on its own gives cycles per limb, but is really only useful at 236big sizes where fixed overheads are small compared to the code doing the 237real work. Remember of course memory caching and/or page swapping will 238affect results at large sizes. 239 240 ./speed -s 500000 -C mpn_add_n 241 242Once a calculation stops fitting in the CPU data cache, it's going to start 243taking longer. Exactly where this happens depends on the cache priming in 244the measuring routines, and on what sort of "least recently used" the 245hardware does. Here's an example for a CPU with a 16kbyte L1 data cache and 24632-bit limb, showing a suddenly steeper curve for mpn_add_n at about 2000 247limbs. 248 249 ./speed -s 1-4000 -t 5 -f 1.02 -P foo mpn_add_n 250 gnuplot foo.gnuplot 251 252When a routine has an unrolled loop for, say, multiples of 8 limbs and then 253an ordinary loop for the remainder, it can happen that it's actually faster 254to do an operation on, say, 8 limbs than it is on 7 limbs. The following 255draws a graph of mpn_sub_n, to see whether times smoothly increase with 256size. 257 258 ./speed -s 1-100 -c -P foo mpn_sub_n 259 gnuplot foo.gnuplot 260 261If mpn_lshift and mpn_rshift have special case code for shifts by 1, it 262ought to be faster (or at least not slower) than shifting by, say, 2 bits. 263 264 ./speed -s 1-200 -c mpn_rshift.1 mpn_rshift.2 265 266An mpn_lshift by 1 can be done by mpn_add_n adding a number to itself, and 267if the lshift isn't faster there's an obvious improvement that's possible. 268 269 ./speed -s 1-200 -c mpn_lshift.1 mpn_add_n_self 270 271On some CPUs (AMD K6 for example) an "in-place" mpn_add_n where the 272destination is one of the sources is faster than a separate destination. 273Here's an example to see this. ".1" selects dst==src1 for mpn_add_n (and 274mpn_sub_n), for other values see speed.h SPEED_ROUTINE_MPN_BINARY_N_CALL. 275 276 ./speed -s 1-200 -c mpn_add_n mpn_add_n.1 277 278The gmp manual points out that divisions by powers of two should be done 279using a right shift because it'll be significantly faster than an actual 280division. The following shows by what factor mpn_rshift is faster than 281mpn_divrem_1, using division by 32 as an example. 282 283 ./speed -s 10-20 -r mpn_rshift.5 mpn_divrem_1.32 284 285 286 287 288EXAMPLE COMPARISONS - MULTIPLICATION 289 290mul_basecase takes a ".<r>" parameter. If positivie, it gives the second 291(smaller) operand size. For example to show speeds for 3x3 up to 20x3 in 292cycles, 293 294 ./speed -s 3-20 -c mpn_mul_basecase.3 295 296A negative ".<-r>" parameter fixes the size of the product to the absolute 297value r. For example to show speeds for 10x10 up to 19x1 in cycles, 298 299 ./speed -s 10-19 -c mpn_mul_basecase.-20 300 301mul_basecase with no parameter does an NxN multiply, so for example to show 302speeds in cycles for 1x1, 2x2, 3x3, etc, up to 20x20, in cycles, 303 304 ./speed -s 1-20 -c mpn_mul_basecase 305 306sqr_basecase is implemented by a "triangular" method on most CPUs, making it 307up to twice as fast as mul_basecase. In practice loop overheads and the 308products on the diagonal mean it falls short of this. Here's an example 309running the two and showing by what factor an NxN mul_basecase is slower 310than an NxN sqr_basecase. (Some versions of sqr_basecase only allow sizes 311below SQR_TOOM2_THRESHOLD, so if it crashes at that point don't worry.) 312 313 ./speed -s 1-20 -r mpn_sqr_basecase mpn_mul_basecase 314 315The technique described above with -CD for showing the time difference in 316cycles per limb between two size operations can be done on an NxN 317mul_basecase using -E to change the basis for the size increment to N*N. 318For instance a 20x20 operation is taken to be doing 400 limbs, and a 16x16 319doing 256 limbs. The following therefore shows the per crossproduct speed 320of mul_basecase and sqr_basecase at around 20x20 limbs. 321 322 ./speed -s 16-20 -t 4 -CDE mpn_mul_basecase mpn_sqr_basecase 323 324Of course sqr_basecase isn't really doing NxN crossproducts, but it can be 325interesting to compare it to mul_basecase as if it was. For sqr_basecase 326the -F option can be used to base the deltas on N*(N+1)/2 operations, which 327is the triangular products sqr_basecase does. For example, 328 329 ./speed -s 16-20 -t 4 -CDF mpn_sqr_basecase 330 331Both -E and -F are preliminary and might change. A consistent approach to 332using them when claiming certain per crossproduct or per triangularproduct 333speeds hasn't really been established, but the increment between speeds in 334the range karatsuba will call seems sensible, that being k to k/2. For 335instance, if the karatsuba threshold was 20 for the multiply and 30 for the 336square, 337 338 ./speed -s 10-20 -t 10 -CDE mpn_mul_basecase 339 ./speed -s 15-30 -t 15 -CDF mpn_sqr_basecase 340 341 342 343EXAMPLE COMPARISONS - MALLOC 344 345The gmp manual recommends application programs avoid excessive initializing 346and clearing of mpz_t variables (and mpq_t and mpf_t too). Every new 347variable will at a minimum go through an init, a realloc for its first 348store, and finally a clear. Quite how long that takes depends on the C 349library. The following compares an mpz_init/realloc/clear to a 10 limb 350mpz_add. Don't be surprised if the mallocing is quite slow. 351 352 ./speed -s 10 -c mpz_init_realloc_clear mpz_add 353 354On some systems malloc and free are much slower when dynamic linked. The 355speed-dynamic program can be used to see this. For example the following 356measures malloc/free, first static then dynamic. 357 358 ./speed -s 10 -c malloc_free 359 ./speed-dynamic -s 10 -c malloc_free 360 361Of course a real world program has big problems if it's doing so many 362mallocs and frees that it gets slowed down by a dynamic linked malloc. 363 364 365 366 367 368EXAMPLE COMPARISONS - STRING CONVERSIONS 369 370mpn_get_str does a binary to string conversion. The base is specified with 371a ".<r>" parameter, or decimal by default. Power of 2 bases are much faster 372than general bases. The following compares decimal and hex for instance. 373 374 ./speed -s 1-20 -c mpn_get_str mpn_get_str.16 375 376Smaller bases need more divisions to split a given size number, and so are 377slower. The following compares base 3 and base 9. On small operands 9 will 378be nearly twice as fast, though at bigger sizes this reduces since in the 379current implementation both divide repeatedly by 3^20 (or 3^40 for 64 bit 380limbs) and those divisions come to dominate. 381 382 ./speed -s 1-20 -cr mpn_get_str.3 mpn_get_str.9 383 384mpn_set_str does a string to binary conversion. The base is specified with 385a ".<r>" parameter, or decimal by default. Power of 2 bases are faster than 386general bases on large conversions. 387 388 ./speed -s 1-512 -f 2 -c mpn_set_str.8 mpn_set_str.10 389 390mpn_set_str also has some special case code for decimal which is a bit 391faster than the general case, basically by giving the compiler a chance to 392optimize some multiplications by 10. 393 394 ./speed -s 20-40 -c mpn_set_str.9 mpn_set_str.10 mpn_set_str.11 395 396 397 398 399EXAMPLE COMPARISONS - GCDs 400 401mpn_gcd_1 has a threshold for when to reduce using an initial x%y when both 402x and y are single limbs. This isn't tuned currently, but a value can be 403established by a measurement like 404 405 ./speed -s 10-32 mpn_gcd_1.10 406 407This runs src[0] from 10 to 32 bits, and y fixed at 10 bits. If the div 408threshold is high, say 31 so it's effectively disabled then a 32x10 bit gcd 409is done by nibbling away at the 32-bit operands bit-by-bit. When the 410threshold is small, say 1 bit, then an initial x%y is done to reduce it to a 41110x10 bit operation. 412 413The threshold in mpn/generic/gcd_1.c or the various assembler 414implementations can be tweaked up or down until there's no more speedups on 415interesting combinations of sizes. Note that this affects only a 1x1 limb 416operation and so isn't very important. (An Nx1 limb operation always does 417an initial modular reduction, using mpn_mod_1 or mpn_modexact_1_odd.) 418 419 420 421 422SPEED PROGRAM EXTENSIONS 423 424Potentially lots of things could be made available in the program, but it's 425been left at only the things that have actually been wanted and are likely 426to be reasonably useful in the future. 427 428Extensions should be fairly easy to make though. speed-ext.c is an example, 429in a style that should suit one-off tests, or new code fragments under 430development. 431 432many.pl is a script for generating a new speed program supplemented with 433alternate versions of the standard routines. It can be used for measuring 434experimental code, or for comparing different implementations that exist 435within a CPU family. 436 437 438 439 440THRESHOLD EXAMINING 441 442The speed program can be used to examine the speeds of different algorithms 443to check the tune program has done the right thing. For example to examine 444the karatsuba multiply threshold, 445 446 ./speed -s 5-40 mpn_mul_basecase mpn_kara_mul_n 447 448When examining the toom3 threshold, remember it depends on the karatsuba 449threshold, so the right karatsuba threshold needs to be compiled into the 450library first. The tune program uses specially recompiled versions of 451mpn/mul_n.c etc for this reason, but the speed program simply uses the 452normal libgmp.la. 453 454Note further that the various routines may recurse into themselves on sizes 455far enough above applicable thresholds. For example, mpn_kara_mul_n will 456recurse into itself on sizes greater than twice the compiled-in 457MUL_TOOM22_THRESHOLD. 458 459When doing the above comparison between mul_basecase and kara_mul_n what's 460probably of interest is mul_basecase versus a kara_mul_n that does one level 461of Karatsuba then calls to mul_basecase, but this only happens on sizes less 462than twice the compiled MUL_TOOM22_THRESHOLD. A larger value for that 463setting can be compiled-in to avoid the problem if necessary. The same 464applies to toom3 and DC, though in a trickier fashion. 465 466There are some upper limits on some of the thresholds, arising from arrays 467dimensioned according to a threshold (mpn_mul_n), or asm code with certain 468sized displacements (some x86 versions of sqr_basecase). So putting huge 469values for the thresholds, even just for testing, may fail. 470 471 472 473 474FUTURE 475 476Make a program to check the time base is working properly, for small and 477large measurements. Make it able to test each available method, including 478perhaps the apparent resolution of each. 479 480Make a general mechanism for specifying operand overlap, and a syntax like 481maybe "mpn_add_n.dst=src2" to select it. Some measuring routines do this 482sort of thing with the "r" parameter currently. 483 484 485 486---------------- 487Local variables: 488mode: text 489fill-column: 76 490End: 491