1Copyright 1999, 2000, 2001, 2002 Free Software Foundation, Inc. 2 3This file is part of the GNU MP Library. 4 5The GNU MP Library is free software; you can redistribute it and/or modify 6it under the terms of the GNU Lesser General Public License as published by 7the Free Software Foundation; either version 3 of the License, or (at your 8option) any later version. 9 10The GNU MP Library is distributed in the hope that it will be useful, but 11WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY 12or FITNESS FOR A PARTICULAR PURPOSE. See the GNU Lesser General Public 13License for more details. 14 15You should have received a copy of the GNU Lesser General Public License 16along with the GNU MP Library. If not, see http://www.gnu.org/licenses/. 17 18 19 20 21 22 X86 MPN SUBROUTINES 23 24 25This directory contains mpn functions for various 80x86 chips. 26 27 28CODE ORGANIZATION 29 30 x86 i386, generic 31 x86/i486 i486 32 x86/pentium Intel Pentium (P5, P54) 33 x86/pentium/mmx Intel Pentium with MMX (P55) 34 x86/p6 Intel Pentium Pro 35 x86/p6/mmx Intel Pentium II, III 36 x86/p6/p3mmx Intel Pentium III 37 x86/k6 \ AMD K6 38 x86/k6/mmx / 39 x86/k6/k62mmx AMD K6-2 40 x86/k7 \ AMD Athlon 41 x86/k7/mmx / 42 x86/pentium4 \ 43 x86/pentium4/mmx | Intel Pentium 4 44 x86/pentium4/sse2 / 45 46 47The top-level x86 directory contains blended style code, meant to be 48reasonable on all x86s. 49 50 51 52STATUS 53 54The code is well-optimized for AMD and Intel chips, but there's nothing 55specific for Cyrix chips, nor for actual 80386 and 80486 chips. 56 57 58 59ASM FILES 60 61The x86 .asm files are BSD style assembler code, first put through m4 for 62macro processing. The generic mpn/asm-defs.m4 is used, together with 63mpn/x86/x86-defs.m4. See comments in those files. 64 65The code is meant for use with GNU "gas" or a system "as". There's no 66support for assemblers that demand Intel style code. 67 68 69 70STACK FRAME 71 72m4 macros are used to define the parameters passed on the stack, and these 73act like comments on what the stack frame looks like too. For example, 74mpn_mul_1() has the following. 75 76 defframe(PARAM_MULTIPLIER, 16) 77 defframe(PARAM_SIZE, 12) 78 defframe(PARAM_SRC, 8) 79 defframe(PARAM_DST, 4) 80 81PARAM_MULTIPLIER becomes `FRAME+16(%esp)', and the others similarly. The 82return address is at offset 0, but there's not normally any need to access 83that. 84 85FRAME is redefined as necessary through the code so it's the number of bytes 86pushed on the stack, and hence the offsets in the parameter macros stay 87correct. At the start of a routine FRAME should be zero. 88 89 deflit(`FRAME',0) 90 ... 91 deflit(`FRAME',4) 92 ... 93 deflit(`FRAME',8) 94 ... 95 96Helper macros FRAME_pushl(), FRAME_popl(), FRAME_addl_esp() and 97FRAME_subl_esp() exist to adjust FRAME for the effect of those instructions, 98and can be used instead of explicit definitions if preferred. 99defframe_pushl() is a combination FRAME_pushl() and defframe(). 100 101There's generally some slackness in redefining FRAME. If new values aren't 102going to get used then the redefinitions are omitted to keep from cluttering 103up the code. This happens for instance at the end of a routine, where there 104might be just four pops and then a ret, so FRAME isn't getting used. 105 106Local variables and saved registers can be similarly defined, with negative 107offsets representing stack space below the initial stack pointer. For 108example, 109 110 defframe(SAVE_ESI, -4) 111 defframe(SAVE_EDI, -8) 112 defframe(VAR_COUNTER,-12) 113 114 deflit(STACK_SPACE, 12) 115 116Here STACK_SPACE gets used in a "subl $STACK_SPACE, %esp" to allocate the 117space, and that instruction must be followed by a redefinition of FRAME 118(setting it equal to STACK_SPACE) to reflect the change in %esp. 119 120Definitions for pushed registers are only put in when they're going to be 121used. If registers are just saved and restored with pushes and pops then 122definitions aren't made. 123 124 125 126ASSEMBLER EXPRESSIONS 127 128Only addition and subtraction seem to be universally available, certainly 129that's all the Solaris 8 "as" seems to accept. If expressions are wanted 130then m4 eval() should be used. 131 132In particular note that a "/" anywhere in a line starts a comment in Solaris 133"as", and in some configurations of gas too. 134 135 addl $32/2, %eax <-- wrong 136 137 addl $eval(32/2), %eax <-- right 138 139Binutils gas/config/tc-i386.c has a choice between "/" being a comment 140anywhere in a line, or only at the start. FreeBSD patches 2.9.1 to select 141the latter, and from 2.9.5 it's the default for GNU/Linux too. 142 143 144 145ASSEMBLER COMMENTS 146 147Solaris "as" doesn't support "#" commenting, using /* */ instead. For that 148reason "C" commenting is used (see asm-defs.m4) and the intermediate ".s" 149files have no comments. 150 151Any comments before include(`../config.m4') must use m4 "dnl", since it's 152only after the include that "C" is available. By convention "dnl" is also 153used for comments about m4 macros. 154 155 156 157TEMPORARY LABELS 158 159Temporary numbered labels like "1:" used as "1f" or "1b" are available in 160"gas" and Solaris "as", but not in SCO "as". Normal L() labels should be 161used instead, possibly with a counter to make them unique, see jadcl0() in 162x86-defs.m4 for instance. A separate counter for each macro makes it 163possible to nest them, for instance movl_text_address() can be used within 164an ASSERT(). 165 166"1:" etc must be avoided in gcc __asm__ blocks too. "%=" for generating a 167unique number looks like a good alternative, but is that actually a 168documented feature? In any case this problem doesn't currently arise. 169 170 171 172ZERO DISPLACEMENTS 173 174In a couple of places addressing modes like 0(%ebx) with a byte-sized zero 175displacement are wanted, rather than (%ebx) with no displacement. These are 176either for computed jumps or to get desirable code alignment. Explicit 177.byte sequences are used to ensure the assembler doesn't turn 0(%ebx) into 178(%ebx). The Zdisp() macro in x86-defs.m4 is used for this. 179 180Current gas 2.9.5 or recent 2.9.1 leave 0(%ebx) as written, but old gas 1811.92.3 changes it. In general changing would be the sort of "optimization" 182an assembler might perform, hence explicit ".byte"s are used where 183necessary. 184 185 186 187SHLD/SHRD INSTRUCTIONS 188 189The %cl count forms of double shift instructions like "shldl %cl,%eax,%ebx" 190must be written "shldl %eax,%ebx" for some assemblers. gas takes either, 191Solaris "as" doesn't allow %cl, gcc generates %cl for gas and NeXT (which is 192gas), and omits %cl elsewhere. 193 194For GMP an autoconf test GMP_ASM_X86_SHLDL_CL is used to determine whether 195%cl should be used, and the macros shldl, shrdl, shldw and shrdw in 196mpn/x86/x86-defs.m4 pass through or omit %cl as necessary. See the comments 197with those macros for usage. 198 199 200 201IMUL INSTRUCTION 202 203GCC config/i386/i386.md (cvs rev 1.187, 21 Oct 00) under *mulsi3_1 notes 204that the following two forms produce identical object code 205 206 imul $12, %eax 207 imul $12, %eax, %eax 208 209but that the former isn't accepted by some assemblers, in particular the SCO 210OSR5 COFF assembler. GMP follows GCC and uses only the latter form. 211 212(This applies only to immediate operands, the three operand form is only 213valid with an immediate.) 214 215 216 217DIRECTION FLAG 218 219The x86 calling conventions say that the direction flag should be clear at 220function entry and exit. (See iBCS2 and SVR4 ABI books, references below.) 221Although this has been so since the year dot, it's not absolutely clear 222whether it's universally respected. Since it's better to be safe than 223sorry, GMP follows glibc and does a "cld" if it depends on the direction 224flag being clear. This happens only in a few places. 225 226 227 228POSITION INDEPENDENT CODE 229 230 Coding Style 231 232 Defining the symbol PIC in m4 processing selects SVR4 / ELF style 233 position independent code. This is necessary for shared libraries 234 because they can be mapped into different processes at different virtual 235 addresses. Actually, relocations are allowed but text pages with 236 relocations aren't shared, defeating the purpose of a shared library. 237 238 The GOT is used to access global data, and the PLT is used for 239 functions. The use of the PLT adds a fixed cost to every function call, 240 and the GOT adds a cost to any function accessing global variables. 241 These are small but might be noticeable when working with small 242 operands. 243 244 Scope 245 246 It's intended, as a matter of policy, that references within libgmp are 247 resolved within libgmp. Certainly there's no need for an application to 248 replace any internals, and we take the view that there's no value in an 249 application subverting anything documented either. 250 251 Resolving references within libgmp in theory means calls can be made with a 252 plain PC-relative call instruction, which is faster and smaller than going 253 through the PLT, and data references can be similarly PC-relative, saving a 254 GOT entry and fetch from there. Unfortunately the normal linker behaviour 255 doesn't allow us to do this. 256 257 By default an R_386_PC32 PC-relative reference, either for a call or for 258 data, is left in libgmp.so by the linker so that it can be resolved at 259 runtime to a location in the application or another shared library. This 260 means a text segment relocation which we don't want. 261 262 -Bsymbolic 263 264 Under the "-Bsymbolic" option, the linker resolves references to symbols 265 within libgmp.so. This gives us the desired effect for R_386_PC32, 266 ie. it's resolved at link time. It also resolves R_386_PLT32 calls 267 directly to their target without creating a PLT entry (though if this is 268 done to normal compiler-generated code it still leaves a setup of %ebx 269 to _GLOBAL_OFFSET_TABLE_ which may then be unnecessary). 270 271 Unfortunately -Bsymbolic does bad things to global variables defined in 272 a shared library but accessed by non-PIC code from the mainline (or a 273 static library). 274 275 The problem is that the mainline needs a fixed data address to avoid 276 text segment relocations, so space is allocated in its data segment and 277 the value from the variable is copied from the shared library's data 278 segment when the library is loaded. Under -Bsymbolic, however, 279 references in the shared library are then resolved still to the shared 280 library data area. Not surprisingly it bombs badly to have mainline 281 code and library code accessing different locations for what should be 282 one variable. 283 284 Note that this -Bsymbolic effect for the shared library is not just for 285 R_386_PC32 offsets which might have been cooked up in assembler, but is 286 done also for the contents of GOT entries. -Bsymbolic simply applies a 287 general rule that symbols are resolved first from the local module. 288 289 Visibility Attributes 290 291 GCC __attribute__ ((visibility ("protected"))), which is available in 292 recent versions, eg. 3.3, is probably what we'd like to use. It makes 293 gcc generate plain PC-relative calls to indicated functions, and directs 294 the linker to resolve references to the given function within the link 295 module. 296 297 Unfortunately, as of debian binutils 2.13.90.0.16 at least, the 298 resulting libgmp.so comes out with text segment relocations, references 299 are not resolved at link time. If the gcc description is to be believed 300 this is this not how it should work. If a symbol cannot be overridden 301 by another module then surely references within that module can be 302 resolved immediately (ie. at link time). 303 304 Present 305 306 In any case, all this means that we have no optimizations we can 307 usefully make to function or variable usages, neither for assembler nor 308 C code. Perhaps in the future the visibility attribute will work as 309 we'd like. 310 311 312 313 314GLOBAL OFFSET TABLE 315 316The magic _GLOBAL_OFFSET_TABLE_ used by code establishing the address of the 317GOT sometimes requires an extra underscore prefix. SVR4 systems and NetBSD 318don't need a prefix, OpenBSD does need one. Note that NetBSD and OpenBSD 319are both a.out underscore systems, so the prefix for _GLOBAL_OFFSET_TABLE_ 320is not simply the same as the prefix for ordinary globals. 321 322In any case in the asm code we write _GLOBAL_OFFSET_TABLE_ and let a macro 323in x86-defs.m4 add an extra underscore if required (according to a configure 324test). 325 326Old gas 1.92.3 which comes with FreeBSD 2.2.8 gets a segmentation fault when 327asked to assemble the following, 328 329 L1: 330 addl $_GLOBAL_OFFSET_TABLE_+[.-L1], %ebx 331 332It seems that using the label in the same instruction it refers to is the 333problem, since a nop in between works. But the simplest workaround is to 334follow gcc and omit the +[.-L1] since it does nothing, 335 336 addl $_GLOBAL_OFFSET_TABLE_, %ebx 337 338Current gas 2.10 generates incorrect object code when %eax is used in such a 339construction (with or without +[.-L1]), 340 341 addl $_GLOBAL_OFFSET_TABLE_, %eax 342 343The R_386_GOTPC gets a displacement of 2 rather than the 1 appropriate for 344the 1 byte opcode of "addl $n,%eax". The best workaround is just to use any 345other register, since then it's a two byte opcode+mod/rm. GCC for example 346always uses %ebx (which is needed for calls through the PLT). 347 348A similar problem occurs in an leal (again with or without a +[.-L1]), 349 350 leal _GLOBAL_OFFSET_TABLE_(%edi), %ebx 351 352This time the R_386_GOTPC gets a displacement of 0 rather than the 2 353appropriate for the opcode and mod/rm, making this form unusable. 354 355 356 357 358SIMPLE LOOPS 359 360The overheads in setting up for an unrolled loop can mean that at small 361sizes a simple loop is faster. Making small sizes go fast is important, 362even if it adds a cycle or two to bigger sizes. To this end various 363routines choose between a simple loop and an unrolled loop according to 364operand size. The path to the simple loop, or to special case code for 365small sizes, is always as fast as possible. 366 367Adding a simple loop requires a conditional jump to choose between the 368simple and unrolled code. The size of a branch misprediction penalty 369affects whether a simple loop is worthwhile. 370 371The convention is for an m4 definition UNROLL_THRESHOLD to set the crossover 372point, with sizes < UNROLL_THRESHOLD using the simple loop, sizes >= 373UNROLL_THRESHOLD using the unrolled loop. If position independent code adds 374a couple of cycles to an unrolled loop setup, the threshold will vary with 375PIC or non-PIC. Something like the following is typical. 376 377 deflit(UNROLL_THRESHOLD, ifdef(`PIC',10,8)) 378 379There's no automated way to determine the threshold. Setting it to a small 380value and then to a big value makes it possible to measure the simple and 381unrolled loops each over a range of sizes, from which the crossover point 382can be determined. Alternately, just adjust the threshold up or down until 383there's no more speedups. 384 385 386 387UNROLLED LOOP CODING 388 389The x86 addressing modes allow a byte displacement of -128 to +127, making 390it possible to access 256 bytes, which is 64 limbs, without adjusting 391pointer registers within the loop. Dword sized displacements can be used 392too, but they increase code size, and unrolling to 64 ought to be enough. 393 394When unrolling to the full 64 limbs/loop, the limb at the top of the loop 395will have a displacement of -128, so pointers have to have a corresponding 396+128 added before entering the loop. When unrolling to 32 limbs/loop 397displacements 0 to 127 can be used with 0 at the top of the loop and no 398adjustment needed to the pointers. 399 400Where 64 limbs/loop is supported, the +128 adjustment is done only when 64 401limbs/loop is selected. Usually the gain in speed using 64 instead of 32 or 40216 is small, so support for 64 limbs/loop is generally only for comparison. 403 404 405 406COMPUTED JUMPS 407 408When working from least significant limb to most significant limb (most 409routines) the computed jump and pointer calculations in preparation for an 410unrolled loop are as follows. 411 412 S = operand size in limbs 413 N = number of limbs per loop (UNROLL_COUNT) 414 L = log2 of unrolling (UNROLL_LOG2) 415 M = mask for unrolling (UNROLL_MASK) 416 C = code bytes per limb in the loop 417 B = bytes per limb (4 for x86) 418 419 computed jump (-S & M) * C + entrypoint 420 subtract from pointers (-S & M) * B 421 initial loop counter (S-1) >> L 422 displacements 0 to B*(N-1) 423 424The loop counter is decremented at the end of each loop, and the looping 425stops when the decrement takes the counter to -1. The displacements are for 426the addressing accessing each limb, eg. a load with "movl disp(%ebx), %eax". 427 428Usually the multiply by "C" can be handled without an imul, using instead an 429leal, or a shift and subtract. 430 431When working from most significant to least significant limb (eg. mpn_lshift 432and mpn_copyd), the calculations change as follows. 433 434 add to pointers (-S & M) * B 435 displacements 0 to -B*(N-1) 436 437 438 439OLD GAS 1.92.3 440 441This version comes with FreeBSD 2.2.8 and has a couple of gremlins that 442affect GMP code. 443 444Firstly, an expression involving two forward references to labels comes out 445as zero. For example, 446 447 addl $bar-foo, %eax 448 foo: 449 nop 450 bar: 451 452This should lead to "addl $1, %eax", but it comes out as "addl $0, %eax". 453When only one forward reference is involved, it works correctly, as for 454example, 455 456 foo: 457 addl $bar-foo, %eax 458 nop 459 bar: 460 461Secondly, an expression involving two labels can't be used as the 462displacement for an leal. For example, 463 464 foo: 465 nop 466 bar: 467 leal bar-foo(%eax,%ebx,8), %ecx 468 469A slightly cryptic error is given, "Unimplemented segment type 0 in 470parse_operand". When only one label is used it's ok, and the label can be a 471forward reference too, as for example, 472 473 leal foo(%eax,%ebx,8), %ecx 474 nop 475 foo: 476 477These problems only affect PIC computed jump calculations. The workarounds 478are just to do an leal without a displacement and then an addl, and to make 479sure the code is placed so that there's at most one forward reference in the 480addl. 481 482 483 484REFERENCES 485 486"Intel Architecture Software Developer's Manual", volumes 1, 2a, 2b, 3a, 3b, 4872006, order numbers 253665 through 253669. Available on-line, 488 489 ftp://download.intel.com/design/Pentium4/manuals/25366518.pdf 490 ftp://download.intel.com/design/Pentium4/manuals/25366618.pdf 491 ftp://download.intel.com/design/Pentium4/manuals/25366718.pdf 492 ftp://download.intel.com/design/Pentium4/manuals/25366818.pdf 493 ftp://download.intel.com/design/Pentium4/manuals/25366918.pdf 494 495 496"System V Application Binary Interface", Unix System Laboratories Inc, 1992, 497published by Prentice Hall, ISBN 0-13-880410-9. And the "Intel386 Processor 498Supplement", AT&T, 1991, ISBN 0-13-877689-X. These have details of calling 499conventions and ELF shared library PIC coding. Versions of both available 500on-line, 501 502 http://www.sco.com/developer/devspecs 503 504"Intel386 Family Binary Compatibility Specification 2", Intel Corporation, 505published by McGraw-Hill, 1991, ISBN 0-07-031219-2. (Same as the above 386 506ABI supplement.) 507 508 509 510---------------- 511Local variables: 512mode: text 513fill-column: 76 514End: 515