1Copyright 1999-2002 Free Software Foundation, Inc.
2
3This file is part of the GNU MP Library.
4
5The GNU MP Library is free software; you can redistribute it and/or modify
6it under the terms of either:
7
8 * the GNU Lesser General Public License as published by the Free
9 Software Foundation; either version 3 of the License, or (at your
10 option) any later version.
11
12or
13
14 * the GNU General Public License as published by the Free Software
15 Foundation; either version 2 of the License, or (at your option) any
16 later version.
17
18or both in parallel, as here.
19
20The GNU MP Library is distributed in the hope that it will be useful, but
21WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY
22or FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License
23for more details.
24
25You should have received copies of the GNU General Public License and the
26GNU Lesser General Public License along with the GNU MP Library. If not,
27see https://www.gnu.org/licenses/.
28
29
30
31
32
33 X86 MPN SUBROUTINES
34
35
36This directory contains mpn functions for various 80x86 chips.
37
38
39CODE ORGANIZATION
40
41 x86 i386, generic
42 x86/i486 i486
43 x86/pentium Intel Pentium (P5, P54)
44 x86/pentium/mmx Intel Pentium with MMX (P55)
45 x86/p6 Intel Pentium Pro
46 x86/p6/mmx Intel Pentium II, III
47 x86/p6/p3mmx Intel Pentium III
48 x86/k6 \ AMD K6
49 x86/k6/mmx /
50 x86/k6/k62mmx AMD K6-2
51 x86/k7 \ AMD Athlon
52 x86/k7/mmx /
53 x86/pentium4 \
54 x86/pentium4/mmx | Intel Pentium 4
55 x86/pentium4/sse2 /
56
57
58The top-level x86 directory contains blended style code, meant to be
59reasonable on all x86s.
60
61
62
63STATUS
64
65The code is well-optimized for AMD and Intel chips, but there's nothing
66specific for Cyrix chips, nor for actual 80386 and 80486 chips.
67
68
69
70ASM FILES
71
72The x86 .asm files are BSD style assembler code, first put through m4 for
73macro processing. The generic mpn/asm-defs.m4 is used, together with
74mpn/x86/x86-defs.m4. See comments in those files.
75
76The code is meant for use with GNU "gas" or a system "as". There's no
77support for assemblers that demand Intel style code.
78
79
80
81STACK FRAME
82
83m4 macros are used to define the parameters passed on the stack, and these
84act like comments on what the stack frame looks like too. For example,
85mpn_mul_1() has the following.
86
87 defframe(PARAM_MULTIPLIER, 16)
88 defframe(PARAM_SIZE, 12)
89 defframe(PARAM_SRC, 8)
90 defframe(PARAM_DST, 4)
91
92PARAM_MULTIPLIER becomes `FRAME+16(%esp)', and the others similarly. The
93return address is at offset 0, but there's not normally any need to access
94that.
95
96FRAME is redefined as necessary through the code so it's the number of bytes
97pushed on the stack, and hence the offsets in the parameter macros stay
98correct. At the start of a routine FRAME should be zero.
99
100 deflit(`FRAME',0)
101 ...
102 deflit(`FRAME',4)
103 ...
104 deflit(`FRAME',8)
105 ...
106
107Helper macros FRAME_pushl(), FRAME_popl(), FRAME_addl_esp() and
108FRAME_subl_esp() exist to adjust FRAME for the effect of those instructions,
109and can be used instead of explicit definitions if preferred.
110defframe_pushl() is a combination FRAME_pushl() and defframe().
111
112There's generally some slackness in redefining FRAME. If new values aren't
113going to get used then the redefinitions are omitted to keep from cluttering
114up the code. This happens for instance at the end of a routine, where there
115might be just four pops and then a ret, so FRAME isn't getting used.
116
117Local variables and saved registers can be similarly defined, with negative
118offsets representing stack space below the initial stack pointer. For
119example,
120
121 defframe(SAVE_ESI, -4)
122 defframe(SAVE_EDI, -8)
123 defframe(VAR_COUNTER,-12)
124
125 deflit(STACK_SPACE, 12)
126
127Here STACK_SPACE gets used in a "subl $STACK_SPACE, %esp" to allocate the
128space, and that instruction must be followed by a redefinition of FRAME
129(setting it equal to STACK_SPACE) to reflect the change in %esp.
130
131Definitions for pushed registers are only put in when they're going to be
132used. If registers are just saved and restored with pushes and pops then
133definitions aren't made.
134
135
136
137ASSEMBLER EXPRESSIONS
138
139Only addition and subtraction seem to be universally available, certainly
140that's all the Solaris 8 "as" seems to accept. If expressions are wanted
141then m4 eval() should be used.
142
143In particular note that a "/" anywhere in a line starts a comment in Solaris
144"as", and in some configurations of gas too.
145
146 addl $32/2, %eax <-- wrong
147
148 addl $eval(32/2), %eax <-- right
149
150Binutils gas/config/tc-i386.c has a choice between "/" being a comment
151anywhere in a line, or only at the start. FreeBSD patches 2.9.1 to select
152the latter, and from 2.9.5 it's the default for GNU/Linux too.
153
154
155
156ASSEMBLER COMMENTS
157
158Solaris "as" doesn't support "#" commenting, using /* */ instead. For that
159reason "C" commenting is used (see asm-defs.m4) and the intermediate ".s"
160files have no comments.
161
162Any comments before include(`../config.m4') must use m4 "dnl", since it's
163only after the include that "C" is available. By convention "dnl" is also
164used for comments about m4 macros.
165
166
167
168TEMPORARY LABELS
169
170Temporary numbered labels like "1:" used as "1f" or "1b" are available in
171"gas" and Solaris "as", but not in SCO "as". Normal L() labels should be
172used instead, possibly with a counter to make them unique, see jadcl0() in
173x86-defs.m4 for instance. A separate counter for each macro makes it
174possible to nest them, for instance movl_text_address() can be used within
175an ASSERT().
176
177"1:" etc must be avoided in gcc __asm__ blocks too. "%=" for generating a
178unique number looks like a good alternative, but is that actually a
179documented feature? In any case this problem doesn't currently arise.
180
181
182
183ZERO DISPLACEMENTS
184
185In a couple of places addressing modes like 0(%ebx) with a byte-sized zero
186displacement are wanted, rather than (%ebx) with no displacement. These are
187either for computed jumps or to get desirable code alignment. Explicit
188.byte sequences are used to ensure the assembler doesn't turn 0(%ebx) into
189(%ebx). The Zdisp() macro in x86-defs.m4 is used for this.
190
191Current gas 2.9.5 or recent 2.9.1 leave 0(%ebx) as written, but old gas
1921.92.3 changes it. In general changing would be the sort of "optimization"
193an assembler might perform, hence explicit ".byte"s are used where
194necessary.
195
196
197
198SHLD/SHRD INSTRUCTIONS
199
200The %cl count forms of double shift instructions like "shldl %cl,%eax,%ebx"
201must be written "shldl %eax,%ebx" for some assemblers. gas takes either,
202Solaris "as" doesn't allow %cl, gcc generates %cl for gas and NeXT (which is
203gas), and omits %cl elsewhere.
204
205For GMP an autoconf test GMP_ASM_X86_SHLDL_CL is used to determine whether
206%cl should be used, and the macros shldl, shrdl, shldw and shrdw in
207mpn/x86/x86-defs.m4 pass through or omit %cl as necessary. See the comments
208with those macros for usage.
209
210
211
212IMUL INSTRUCTION
213
214GCC config/i386/i386.md (cvs rev 1.187, 21 Oct 00) under *mulsi3_1 notes
215that the following two forms produce identical object code
216
217 imul $12, %eax
218 imul $12, %eax, %eax
219
220but that the former isn't accepted by some assemblers, in particular the SCO
221OSR5 COFF assembler. GMP follows GCC and uses only the latter form.
222
223(This applies only to immediate operands, the three operand form is only
224valid with an immediate.)
225
226
227
228DIRECTION FLAG
229
230The x86 calling conventions say that the direction flag should be clear at
231function entry and exit. (See iBCS2 and SVR4 ABI books, references below.)
232Although this has been so since the year dot, it's not absolutely clear
233whether it's universally respected. Since it's better to be safe than
234sorry, GMP follows glibc and does a "cld" if it depends on the direction
235flag being clear. This happens only in a few places.
236
237
238
239POSITION INDEPENDENT CODE
240
241 Coding Style
242
243 Defining the symbol PIC in m4 processing selects SVR4 / ELF style
244 position independent code. This is necessary for shared libraries
245 because they can be mapped into different processes at different virtual
246 addresses. Actually, relocations are allowed but text pages with
247 relocations aren't shared, defeating the purpose of a shared library.
248
249 The GOT is used to access global data, and the PLT is used for
250 functions. The use of the PLT adds a fixed cost to every function call,
251 and the GOT adds a cost to any function accessing global variables.
252 These are small but might be noticeable when working with small
253 operands.
254
255 Scope
256
257 It's intended, as a matter of policy, that references within libgmp are
258 resolved within libgmp. Certainly there's no need for an application to
259 replace any internals, and we take the view that there's no value in an
260 application subverting anything documented either.
261
262 Resolving references within libgmp in theory means calls can be made with a
263 plain PC-relative call instruction, which is faster and smaller than going
264 through the PLT, and data references can be similarly PC-relative, saving a
265 GOT entry and fetch from there. Unfortunately the normal linker behaviour
266 doesn't allow us to do this.
267
268 By default an R_386_PC32 PC-relative reference, either for a call or for
269 data, is left in libgmp.so by the linker so that it can be resolved at
270 runtime to a location in the application or another shared library. This
271 means a text segment relocation which we don't want.
272
273 -Bsymbolic
274
275 Under the "-Bsymbolic" option, the linker resolves references to symbols
276 within libgmp.so. This gives us the desired effect for R_386_PC32,
277 ie. it's resolved at link time. It also resolves R_386_PLT32 calls
278 directly to their target without creating a PLT entry (though if this is
279 done to normal compiler-generated code it still leaves a setup of %ebx
280 to _GLOBAL_OFFSET_TABLE_ which may then be unnecessary).
281
282 Unfortunately -Bsymbolic does bad things to global variables defined in
283 a shared library but accessed by non-PIC code from the mainline (or a
284 static library).
285
286 The problem is that the mainline needs a fixed data address to avoid
287 text segment relocations, so space is allocated in its data segment and
288 the value from the variable is copied from the shared library's data
289 segment when the library is loaded. Under -Bsymbolic, however,
290 references in the shared library are then resolved still to the shared
291 library data area. Not surprisingly it bombs badly to have mainline
292 code and library code accessing different locations for what should be
293 one variable.
294
295 Note that this -Bsymbolic effect for the shared library is not just for
296 R_386_PC32 offsets which might have been cooked up in assembler, but is
297 done also for the contents of GOT entries. -Bsymbolic simply applies a
298 general rule that symbols are resolved first from the local module.
299
300 Visibility Attributes
301
302 GCC __attribute__ ((visibility ("protected"))), which is available in
303 recent versions, eg. 3.3, is probably what we'd like to use. It makes
304 gcc generate plain PC-relative calls to indicated functions, and directs
305 the linker to resolve references to the given function within the link
306 module.
307
308 Unfortunately, as of debian binutils 2.13.90.0.16 at least, the
309 resulting libgmp.so comes out with text segment relocations, references
310 are not resolved at link time. If the gcc description is to be believed
311 this is this not how it should work. If a symbol cannot be overridden
312 by another module then surely references within that module can be
313 resolved immediately (ie. at link time).
314
315 Present
316
317 In any case, all this means that we have no optimizations we can
318 usefully make to function or variable usages, neither for assembler nor
319 C code. Perhaps in the future the visibility attribute will work as
320 we'd like.
321
322
323
324
325GLOBAL OFFSET TABLE
326
327The magic _GLOBAL_OFFSET_TABLE_ used by code establishing the address of the
328GOT sometimes requires an extra underscore prefix. SVR4 systems and NetBSD
329don't need a prefix, OpenBSD does need one. Note that NetBSD and OpenBSD
330are both a.out underscore systems, so the prefix for _GLOBAL_OFFSET_TABLE_
331is not simply the same as the prefix for ordinary globals.
332
333In any case in the asm code we write _GLOBAL_OFFSET_TABLE_ and let a macro
334in x86-defs.m4 add an extra underscore if required (according to a configure
335test).
336
337Old gas 1.92.3 which comes with FreeBSD 2.2.8 gets a segmentation fault when
338asked to assemble the following,
339
340 L1:
341 addl $_GLOBAL_OFFSET_TABLE_+[.-L1], %ebx
342
343It seems that using the label in the same instruction it refers to is the
344problem, since a nop in between works. But the simplest workaround is to
345follow gcc and omit the +[.-L1] since it does nothing,
346
347 addl $_GLOBAL_OFFSET_TABLE_, %ebx
348
349Current gas 2.10 generates incorrect object code when %eax is used in such a
350construction (with or without +[.-L1]),
351
352 addl $_GLOBAL_OFFSET_TABLE_, %eax
353
354The R_386_GOTPC gets a displacement of 2 rather than the 1 appropriate for
355the 1 byte opcode of "addl $n,%eax". The best workaround is just to use any
356other register, since then it's a two byte opcode+mod/rm. GCC for example
357always uses %ebx (which is needed for calls through the PLT).
358
359A similar problem occurs in an leal (again with or without a +[.-L1]),
360
361 leal _GLOBAL_OFFSET_TABLE_(%edi), %ebx
362
363This time the R_386_GOTPC gets a displacement of 0 rather than the 2
364appropriate for the opcode and mod/rm, making this form unusable.
365
366
367
368
369SIMPLE LOOPS
370
371The overheads in setting up for an unrolled loop can mean that at small
372sizes a simple loop is faster. Making small sizes go fast is important,
373even if it adds a cycle or two to bigger sizes. To this end various
374routines choose between a simple loop and an unrolled loop according to
375operand size. The path to the simple loop, or to special case code for
376small sizes, is always as fast as possible.
377
378Adding a simple loop requires a conditional jump to choose between the
379simple and unrolled code. The size of a branch misprediction penalty
380affects whether a simple loop is worthwhile.
381
382The convention is for an m4 definition UNROLL_THRESHOLD to set the crossover
383point, with sizes < UNROLL_THRESHOLD using the simple loop, sizes >=
384UNROLL_THRESHOLD using the unrolled loop. If position independent code adds
385a couple of cycles to an unrolled loop setup, the threshold will vary with
386PIC or non-PIC. Something like the following is typical.
387
388 deflit(UNROLL_THRESHOLD, ifdef(`PIC',10,8))
389
390There's no automated way to determine the threshold. Setting it to a small
391value and then to a big value makes it possible to measure the simple and
392unrolled loops each over a range of sizes, from which the crossover point
393can be determined. Alternately, just adjust the threshold up or down until
394there's no more speedups.
395
396
397
398UNROLLED LOOP CODING
399
400The x86 addressing modes allow a byte displacement of -128 to +127, making
401it possible to access 256 bytes, which is 64 limbs, without adjusting
402pointer registers within the loop. Dword sized displacements can be used
403too, but they increase code size, and unrolling to 64 ought to be enough.
404
405When unrolling to the full 64 limbs/loop, the limb at the top of the loop
406will have a displacement of -128, so pointers have to have a corresponding
407+128 added before entering the loop. When unrolling to 32 limbs/loop
408displacements 0 to 127 can be used with 0 at the top of the loop and no
409adjustment needed to the pointers.
410
411Where 64 limbs/loop is supported, the +128 adjustment is done only when 64
412limbs/loop is selected. Usually the gain in speed using 64 instead of 32 or
41316 is small, so support for 64 limbs/loop is generally only for comparison.
414
415
416
417COMPUTED JUMPS
418
419When working from least significant limb to most significant limb (most
420routines) the computed jump and pointer calculations in preparation for an
421unrolled loop are as follows.
422
423 S = operand size in limbs
424 N = number of limbs per loop (UNROLL_COUNT)
425 L = log2 of unrolling (UNROLL_LOG2)
426 M = mask for unrolling (UNROLL_MASK)
427 C = code bytes per limb in the loop
428 B = bytes per limb (4 for x86)
429
430 computed jump (-S & M) * C + entrypoint
431 subtract from pointers (-S & M) * B
432 initial loop counter (S-1) >> L
433 displacements 0 to B*(N-1)
434
435The loop counter is decremented at the end of each loop, and the looping
436stops when the decrement takes the counter to -1. The displacements are for
437the addressing accessing each limb, eg. a load with "movl disp(%ebx), %eax".
438
439Usually the multiply by "C" can be handled without an imul, using instead an
440leal, or a shift and subtract.
441
442When working from most significant to least significant limb (eg. mpn_lshift
443and mpn_copyd), the calculations change as follows.
444
445 add to pointers (-S & M) * B
446 displacements 0 to -B*(N-1)
447
448
449
450OLD GAS 1.92.3
451
452This version comes with FreeBSD 2.2.8 and has a couple of gremlins that
453affect GMP code.
454
455Firstly, an expression involving two forward references to labels comes out
456as zero. For example,
457
458 addl $bar-foo, %eax
459 foo:
460 nop
461 bar:
462
463This should lead to "addl $1, %eax", but it comes out as "addl $0, %eax".
464When only one forward reference is involved, it works correctly, as for
465example,
466
467 foo:
468 addl $bar-foo, %eax
469 nop
470 bar:
471
472Secondly, an expression involving two labels can't be used as the
473displacement for an leal. For example,
474
475 foo:
476 nop
477 bar:
478 leal bar-foo(%eax,%ebx,8), %ecx
479
480A slightly cryptic error is given, "Unimplemented segment type 0 in
481parse_operand". When only one label is used it's ok, and the label can be a
482forward reference too, as for example,
483
484 leal foo(%eax,%ebx,8), %ecx
485 nop
486 foo:
487
488These problems only affect PIC computed jump calculations. The workarounds
489are just to do an leal without a displacement and then an addl, and to make
490sure the code is placed so that there's at most one forward reference in the
491addl.
492
493
494
495REFERENCES
496
497"Intel Architecture Software Developer's Manual", volumes 1, 2a, 2b, 3a, 3b,
4982006, order numbers 253665 through 253669. Available on-line,
499
500 ftp://download.intel.com/design/Pentium4/manuals/25366518.pdf
501 ftp://download.intel.com/design/Pentium4/manuals/25366618.pdf
502 ftp://download.intel.com/design/Pentium4/manuals/25366718.pdf
503 ftp://download.intel.com/design/Pentium4/manuals/25366818.pdf
504 ftp://download.intel.com/design/Pentium4/manuals/25366918.pdf
505
506
507"System V Application Binary Interface", Unix System Laboratories Inc, 1992,
508published by Prentice Hall, ISBN 0-13-880410-9. And the "Intel386 Processor
509Supplement", AT&T, 1991, ISBN 0-13-877689-X. These have details of calling
510conventions and ELF shared library PIC coding. Versions of both available
511on-line,
512
513 http://www.sco.com/developer/devspecs
514
515"Intel386 Family Binary Compatibility Specification 2", Intel Corporation,
516published by McGraw-Hill, 1991, ISBN 0-07-031219-2. (Same as the above 386
517ABI supplement.)
518
519
520
521----------------
522Local variables:
523mode: text
524fill-column: 76
525End:
526