1Copyright 1999-2002 Free Software Foundation, Inc. 2 3This file is part of the GNU MP Library. 4 5The GNU MP Library is free software; you can redistribute it and/or modify 6it under the terms of either: 7 8 * the GNU Lesser General Public License as published by the Free 9 Software Foundation; either version 3 of the License, or (at your 10 option) any later version. 11 12or 13 14 * the GNU General Public License as published by the Free Software 15 Foundation; either version 2 of the License, or (at your option) any 16 later version. 17 18or both in parallel, as here. 19 20The GNU MP Library is distributed in the hope that it will be useful, but 21WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY 22or FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License 23for more details. 24 25You should have received copies of the GNU General Public License and the 26GNU Lesser General Public License along with the GNU MP Library. If not, 27see https://www.gnu.org/licenses/. 28 29 30 31 32 33 X86 MPN SUBROUTINES 34 35 36This directory contains mpn functions for various 80x86 chips. 37 38 39CODE ORGANIZATION 40 41 x86 i386, generic 42 x86/i486 i486 43 x86/pentium Intel Pentium (P5, P54) 44 x86/pentium/mmx Intel Pentium with MMX (P55) 45 x86/p6 Intel Pentium Pro 46 x86/p6/mmx Intel Pentium II, III 47 x86/p6/p3mmx Intel Pentium III 48 x86/k6 \ AMD K6 49 x86/k6/mmx / 50 x86/k6/k62mmx AMD K6-2 51 x86/k7 \ AMD Athlon 52 x86/k7/mmx / 53 x86/pentium4 \ 54 x86/pentium4/mmx | Intel Pentium 4 55 x86/pentium4/sse2 / 56 57 58The top-level x86 directory contains blended style code, meant to be 59reasonable on all x86s. 60 61 62 63STATUS 64 65The code is well-optimized for AMD and Intel chips, but there's nothing 66specific for Cyrix chips, nor for actual 80386 and 80486 chips. 67 68 69 70ASM FILES 71 72The x86 .asm files are BSD style assembler code, first put through m4 for 73macro processing. The generic mpn/asm-defs.m4 is used, together with 74mpn/x86/x86-defs.m4. See comments in those files. 75 76The code is meant for use with GNU "gas" or a system "as". There's no 77support for assemblers that demand Intel style code. 78 79 80 81STACK FRAME 82 83m4 macros are used to define the parameters passed on the stack, and these 84act like comments on what the stack frame looks like too. For example, 85mpn_mul_1() has the following. 86 87 defframe(PARAM_MULTIPLIER, 16) 88 defframe(PARAM_SIZE, 12) 89 defframe(PARAM_SRC, 8) 90 defframe(PARAM_DST, 4) 91 92PARAM_MULTIPLIER becomes `FRAME+16(%esp)', and the others similarly. The 93return address is at offset 0, but there's not normally any need to access 94that. 95 96FRAME is redefined as necessary through the code so it's the number of bytes 97pushed on the stack, and hence the offsets in the parameter macros stay 98correct. At the start of a routine FRAME should be zero. 99 100 deflit(`FRAME',0) 101 ... 102 deflit(`FRAME',4) 103 ... 104 deflit(`FRAME',8) 105 ... 106 107Helper macros FRAME_pushl(), FRAME_popl(), FRAME_addl_esp() and 108FRAME_subl_esp() exist to adjust FRAME for the effect of those instructions, 109and can be used instead of explicit definitions if preferred. 110defframe_pushl() is a combination FRAME_pushl() and defframe(). 111 112There's generally some slackness in redefining FRAME. If new values aren't 113going to get used then the redefinitions are omitted to keep from cluttering 114up the code. This happens for instance at the end of a routine, where there 115might be just four pops and then a ret, so FRAME isn't getting used. 116 117Local variables and saved registers can be similarly defined, with negative 118offsets representing stack space below the initial stack pointer. For 119example, 120 121 defframe(SAVE_ESI, -4) 122 defframe(SAVE_EDI, -8) 123 defframe(VAR_COUNTER,-12) 124 125 deflit(STACK_SPACE, 12) 126 127Here STACK_SPACE gets used in a "subl $STACK_SPACE, %esp" to allocate the 128space, and that instruction must be followed by a redefinition of FRAME 129(setting it equal to STACK_SPACE) to reflect the change in %esp. 130 131Definitions for pushed registers are only put in when they're going to be 132used. If registers are just saved and restored with pushes and pops then 133definitions aren't made. 134 135 136 137ASSEMBLER EXPRESSIONS 138 139Only addition and subtraction seem to be universally available, certainly 140that's all the Solaris 8 "as" seems to accept. If expressions are wanted 141then m4 eval() should be used. 142 143In particular note that a "/" anywhere in a line starts a comment in Solaris 144"as", and in some configurations of gas too. 145 146 addl $32/2, %eax <-- wrong 147 148 addl $eval(32/2), %eax <-- right 149 150Binutils gas/config/tc-i386.c has a choice between "/" being a comment 151anywhere in a line, or only at the start. FreeBSD patches 2.9.1 to select 152the latter, and from 2.9.5 it's the default for GNU/Linux too. 153 154 155 156ASSEMBLER COMMENTS 157 158Solaris "as" doesn't support "#" commenting, using /* */ instead. For that 159reason "C" commenting is used (see asm-defs.m4) and the intermediate ".s" 160files have no comments. 161 162Any comments before include(`../config.m4') must use m4 "dnl", since it's 163only after the include that "C" is available. By convention "dnl" is also 164used for comments about m4 macros. 165 166 167 168TEMPORARY LABELS 169 170Temporary numbered labels like "1:" used as "1f" or "1b" are available in 171"gas" and Solaris "as", but not in SCO "as". Normal L() labels should be 172used instead, possibly with a counter to make them unique, see jadcl0() in 173x86-defs.m4 for instance. A separate counter for each macro makes it 174possible to nest them, for instance movl_text_address() can be used within 175an ASSERT(). 176 177"1:" etc must be avoided in gcc __asm__ blocks too. "%=" for generating a 178unique number looks like a good alternative, but is that actually a 179documented feature? In any case this problem doesn't currently arise. 180 181 182 183ZERO DISPLACEMENTS 184 185In a couple of places addressing modes like 0(%ebx) with a byte-sized zero 186displacement are wanted, rather than (%ebx) with no displacement. These are 187either for computed jumps or to get desirable code alignment. Explicit 188.byte sequences are used to ensure the assembler doesn't turn 0(%ebx) into 189(%ebx). The Zdisp() macro in x86-defs.m4 is used for this. 190 191Current gas 2.9.5 or recent 2.9.1 leave 0(%ebx) as written, but old gas 1921.92.3 changes it. In general changing would be the sort of "optimization" 193an assembler might perform, hence explicit ".byte"s are used where 194necessary. 195 196 197 198SHLD/SHRD INSTRUCTIONS 199 200The %cl count forms of double shift instructions like "shldl %cl,%eax,%ebx" 201must be written "shldl %eax,%ebx" for some assemblers. gas takes either, 202Solaris "as" doesn't allow %cl, gcc generates %cl for gas and NeXT (which is 203gas), and omits %cl elsewhere. 204 205For GMP an autoconf test GMP_ASM_X86_SHLDL_CL is used to determine whether 206%cl should be used, and the macros shldl, shrdl, shldw and shrdw in 207mpn/x86/x86-defs.m4 pass through or omit %cl as necessary. See the comments 208with those macros for usage. 209 210 211 212IMUL INSTRUCTION 213 214GCC config/i386/i386.md (cvs rev 1.187, 21 Oct 00) under *mulsi3_1 notes 215that the following two forms produce identical object code 216 217 imul $12, %eax 218 imul $12, %eax, %eax 219 220but that the former isn't accepted by some assemblers, in particular the SCO 221OSR5 COFF assembler. GMP follows GCC and uses only the latter form. 222 223(This applies only to immediate operands, the three operand form is only 224valid with an immediate.) 225 226 227 228DIRECTION FLAG 229 230The x86 calling conventions say that the direction flag should be clear at 231function entry and exit. (See iBCS2 and SVR4 ABI books, references below.) 232Although this has been so since the year dot, it's not absolutely clear 233whether it's universally respected. Since it's better to be safe than 234sorry, GMP follows glibc and does a "cld" if it depends on the direction 235flag being clear. This happens only in a few places. 236 237 238 239POSITION INDEPENDENT CODE 240 241 Coding Style 242 243 Defining the symbol PIC in m4 processing selects SVR4 / ELF style 244 position independent code. This is necessary for shared libraries 245 because they can be mapped into different processes at different virtual 246 addresses. Actually, relocations are allowed but text pages with 247 relocations aren't shared, defeating the purpose of a shared library. 248 249 The GOT is used to access global data, and the PLT is used for 250 functions. The use of the PLT adds a fixed cost to every function call, 251 and the GOT adds a cost to any function accessing global variables. 252 These are small but might be noticeable when working with small 253 operands. 254 255 Scope 256 257 It's intended, as a matter of policy, that references within libgmp are 258 resolved within libgmp. Certainly there's no need for an application to 259 replace any internals, and we take the view that there's no value in an 260 application subverting anything documented either. 261 262 Resolving references within libgmp in theory means calls can be made with a 263 plain PC-relative call instruction, which is faster and smaller than going 264 through the PLT, and data references can be similarly PC-relative, saving a 265 GOT entry and fetch from there. Unfortunately the normal linker behaviour 266 doesn't allow us to do this. 267 268 By default an R_386_PC32 PC-relative reference, either for a call or for 269 data, is left in libgmp.so by the linker so that it can be resolved at 270 runtime to a location in the application or another shared library. This 271 means a text segment relocation which we don't want. 272 273 -Bsymbolic 274 275 Under the "-Bsymbolic" option, the linker resolves references to symbols 276 within libgmp.so. This gives us the desired effect for R_386_PC32, 277 ie. it's resolved at link time. It also resolves R_386_PLT32 calls 278 directly to their target without creating a PLT entry (though if this is 279 done to normal compiler-generated code it still leaves a setup of %ebx 280 to _GLOBAL_OFFSET_TABLE_ which may then be unnecessary). 281 282 Unfortunately -Bsymbolic does bad things to global variables defined in 283 a shared library but accessed by non-PIC code from the mainline (or a 284 static library). 285 286 The problem is that the mainline needs a fixed data address to avoid 287 text segment relocations, so space is allocated in its data segment and 288 the value from the variable is copied from the shared library's data 289 segment when the library is loaded. Under -Bsymbolic, however, 290 references in the shared library are then resolved still to the shared 291 library data area. Not surprisingly it bombs badly to have mainline 292 code and library code accessing different locations for what should be 293 one variable. 294 295 Note that this -Bsymbolic effect for the shared library is not just for 296 R_386_PC32 offsets which might have been cooked up in assembler, but is 297 done also for the contents of GOT entries. -Bsymbolic simply applies a 298 general rule that symbols are resolved first from the local module. 299 300 Visibility Attributes 301 302 GCC __attribute__ ((visibility ("protected"))), which is available in 303 recent versions, eg. 3.3, is probably what we'd like to use. It makes 304 gcc generate plain PC-relative calls to indicated functions, and directs 305 the linker to resolve references to the given function within the link 306 module. 307 308 Unfortunately, as of debian binutils 2.13.90.0.16 at least, the 309 resulting libgmp.so comes out with text segment relocations, references 310 are not resolved at link time. If the gcc description is to be believed 311 this is this not how it should work. If a symbol cannot be overridden 312 by another module then surely references within that module can be 313 resolved immediately (ie. at link time). 314 315 Present 316 317 In any case, all this means that we have no optimizations we can 318 usefully make to function or variable usages, neither for assembler nor 319 C code. Perhaps in the future the visibility attribute will work as 320 we'd like. 321 322 323 324 325GLOBAL OFFSET TABLE 326 327The magic _GLOBAL_OFFSET_TABLE_ used by code establishing the address of the 328GOT sometimes requires an extra underscore prefix. SVR4 systems and NetBSD 329don't need a prefix, OpenBSD does need one. Note that NetBSD and OpenBSD 330are both a.out underscore systems, so the prefix for _GLOBAL_OFFSET_TABLE_ 331is not simply the same as the prefix for ordinary globals. 332 333In any case in the asm code we write _GLOBAL_OFFSET_TABLE_ and let a macro 334in x86-defs.m4 add an extra underscore if required (according to a configure 335test). 336 337Old gas 1.92.3 which comes with FreeBSD 2.2.8 gets a segmentation fault when 338asked to assemble the following, 339 340 L1: 341 addl $_GLOBAL_OFFSET_TABLE_+[.-L1], %ebx 342 343It seems that using the label in the same instruction it refers to is the 344problem, since a nop in between works. But the simplest workaround is to 345follow gcc and omit the +[.-L1] since it does nothing, 346 347 addl $_GLOBAL_OFFSET_TABLE_, %ebx 348 349Current gas 2.10 generates incorrect object code when %eax is used in such a 350construction (with or without +[.-L1]), 351 352 addl $_GLOBAL_OFFSET_TABLE_, %eax 353 354The R_386_GOTPC gets a displacement of 2 rather than the 1 appropriate for 355the 1 byte opcode of "addl $n,%eax". The best workaround is just to use any 356other register, since then it's a two byte opcode+mod/rm. GCC for example 357always uses %ebx (which is needed for calls through the PLT). 358 359A similar problem occurs in an leal (again with or without a +[.-L1]), 360 361 leal _GLOBAL_OFFSET_TABLE_(%edi), %ebx 362 363This time the R_386_GOTPC gets a displacement of 0 rather than the 2 364appropriate for the opcode and mod/rm, making this form unusable. 365 366 367 368 369SIMPLE LOOPS 370 371The overheads in setting up for an unrolled loop can mean that at small 372sizes a simple loop is faster. Making small sizes go fast is important, 373even if it adds a cycle or two to bigger sizes. To this end various 374routines choose between a simple loop and an unrolled loop according to 375operand size. The path to the simple loop, or to special case code for 376small sizes, is always as fast as possible. 377 378Adding a simple loop requires a conditional jump to choose between the 379simple and unrolled code. The size of a branch misprediction penalty 380affects whether a simple loop is worthwhile. 381 382The convention is for an m4 definition UNROLL_THRESHOLD to set the crossover 383point, with sizes < UNROLL_THRESHOLD using the simple loop, sizes >= 384UNROLL_THRESHOLD using the unrolled loop. If position independent code adds 385a couple of cycles to an unrolled loop setup, the threshold will vary with 386PIC or non-PIC. Something like the following is typical. 387 388 deflit(UNROLL_THRESHOLD, ifdef(`PIC',10,8)) 389 390There's no automated way to determine the threshold. Setting it to a small 391value and then to a big value makes it possible to measure the simple and 392unrolled loops each over a range of sizes, from which the crossover point 393can be determined. Alternately, just adjust the threshold up or down until 394there's no more speedups. 395 396 397 398UNROLLED LOOP CODING 399 400The x86 addressing modes allow a byte displacement of -128 to +127, making 401it possible to access 256 bytes, which is 64 limbs, without adjusting 402pointer registers within the loop. Dword sized displacements can be used 403too, but they increase code size, and unrolling to 64 ought to be enough. 404 405When unrolling to the full 64 limbs/loop, the limb at the top of the loop 406will have a displacement of -128, so pointers have to have a corresponding 407+128 added before entering the loop. When unrolling to 32 limbs/loop 408displacements 0 to 127 can be used with 0 at the top of the loop and no 409adjustment needed to the pointers. 410 411Where 64 limbs/loop is supported, the +128 adjustment is done only when 64 412limbs/loop is selected. Usually the gain in speed using 64 instead of 32 or 41316 is small, so support for 64 limbs/loop is generally only for comparison. 414 415 416 417COMPUTED JUMPS 418 419When working from least significant limb to most significant limb (most 420routines) the computed jump and pointer calculations in preparation for an 421unrolled loop are as follows. 422 423 S = operand size in limbs 424 N = number of limbs per loop (UNROLL_COUNT) 425 L = log2 of unrolling (UNROLL_LOG2) 426 M = mask for unrolling (UNROLL_MASK) 427 C = code bytes per limb in the loop 428 B = bytes per limb (4 for x86) 429 430 computed jump (-S & M) * C + entrypoint 431 subtract from pointers (-S & M) * B 432 initial loop counter (S-1) >> L 433 displacements 0 to B*(N-1) 434 435The loop counter is decremented at the end of each loop, and the looping 436stops when the decrement takes the counter to -1. The displacements are for 437the addressing accessing each limb, eg. a load with "movl disp(%ebx), %eax". 438 439Usually the multiply by "C" can be handled without an imul, using instead an 440leal, or a shift and subtract. 441 442When working from most significant to least significant limb (eg. mpn_lshift 443and mpn_copyd), the calculations change as follows. 444 445 add to pointers (-S & M) * B 446 displacements 0 to -B*(N-1) 447 448 449 450OLD GAS 1.92.3 451 452This version comes with FreeBSD 2.2.8 and has a couple of gremlins that 453affect GMP code. 454 455Firstly, an expression involving two forward references to labels comes out 456as zero. For example, 457 458 addl $bar-foo, %eax 459 foo: 460 nop 461 bar: 462 463This should lead to "addl $1, %eax", but it comes out as "addl $0, %eax". 464When only one forward reference is involved, it works correctly, as for 465example, 466 467 foo: 468 addl $bar-foo, %eax 469 nop 470 bar: 471 472Secondly, an expression involving two labels can't be used as the 473displacement for an leal. For example, 474 475 foo: 476 nop 477 bar: 478 leal bar-foo(%eax,%ebx,8), %ecx 479 480A slightly cryptic error is given, "Unimplemented segment type 0 in 481parse_operand". When only one label is used it's ok, and the label can be a 482forward reference too, as for example, 483 484 leal foo(%eax,%ebx,8), %ecx 485 nop 486 foo: 487 488These problems only affect PIC computed jump calculations. The workarounds 489are just to do an leal without a displacement and then an addl, and to make 490sure the code is placed so that there's at most one forward reference in the 491addl. 492 493 494 495REFERENCES 496 497"Intel Architecture Software Developer's Manual", volumes 1, 2a, 2b, 3a, 3b, 4982006, order numbers 253665 through 253669. Available on-line, 499 500 ftp://download.intel.com/design/Pentium4/manuals/25366518.pdf 501 ftp://download.intel.com/design/Pentium4/manuals/25366618.pdf 502 ftp://download.intel.com/design/Pentium4/manuals/25366718.pdf 503 ftp://download.intel.com/design/Pentium4/manuals/25366818.pdf 504 ftp://download.intel.com/design/Pentium4/manuals/25366918.pdf 505 506 507"System V Application Binary Interface", Unix System Laboratories Inc, 1992, 508published by Prentice Hall, ISBN 0-13-880410-9. And the "Intel386 Processor 509Supplement", AT&T, 1991, ISBN 0-13-877689-X. These have details of calling 510conventions and ELF shared library PIC coding. Versions of both available 511on-line, 512 513 http://www.sco.com/developer/devspecs 514 515"Intel386 Family Binary Compatibility Specification 2", Intel Corporation, 516published by McGraw-Hill, 1991, ISBN 0-07-031219-2. (Same as the above 386 517ABI supplement.) 518 519 520 521---------------- 522Local variables: 523mode: text 524fill-column: 76 525End: 526