xref: /netbsd-src/external/lgpl3/gmp/dist/mpn/x86/README (revision e39ef1d61eee3ccba837ee281f1e098c864487aa)
1Copyright 1999, 2000, 2001, 2002 Free Software Foundation, Inc.
2
3This file is part of the GNU MP Library.
4
5The GNU MP Library is free software; you can redistribute it and/or modify
6it under the terms of the GNU Lesser General Public License as published by
7the Free Software Foundation; either version 3 of the License, or (at your
8option) any later version.
9
10The GNU MP Library is distributed in the hope that it will be useful, but
11WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY
12or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU Lesser General Public
13License for more details.
14
15You should have received a copy of the GNU Lesser General Public License
16along with the GNU MP Library.  If not, see http://www.gnu.org/licenses/.
17
18
19
20
21
22                      X86 MPN SUBROUTINES
23
24
25This directory contains mpn functions for various 80x86 chips.
26
27
28CODE ORGANIZATION
29
30	x86               i386, generic
31	x86/i486          i486
32	x86/pentium       Intel Pentium (P5, P54)
33	x86/pentium/mmx   Intel Pentium with MMX (P55)
34	x86/p6            Intel Pentium Pro
35	x86/p6/mmx        Intel Pentium II, III
36	x86/p6/p3mmx      Intel Pentium III
37	x86/k6            \ AMD K6
38	x86/k6/mmx        /
39	x86/k6/k62mmx     AMD K6-2
40	x86/k7            \ AMD Athlon
41	x86/k7/mmx        /
42	x86/pentium4      \
43	x86/pentium4/mmx  | Intel Pentium 4
44	x86/pentium4/sse2 /
45
46
47The top-level x86 directory contains blended style code, meant to be
48reasonable on all x86s.
49
50
51
52STATUS
53
54The code is well-optimized for AMD and Intel chips, but there's nothing
55specific for Cyrix chips, nor for actual 80386 and 80486 chips.
56
57
58
59ASM FILES
60
61The x86 .asm files are BSD style assembler code, first put through m4 for
62macro processing.  The generic mpn/asm-defs.m4 is used, together with
63mpn/x86/x86-defs.m4.  See comments in those files.
64
65The code is meant for use with GNU "gas" or a system "as".  There's no
66support for assemblers that demand Intel style code.
67
68
69
70STACK FRAME
71
72m4 macros are used to define the parameters passed on the stack, and these
73act like comments on what the stack frame looks like too.  For example,
74mpn_mul_1() has the following.
75
76        defframe(PARAM_MULTIPLIER, 16)
77        defframe(PARAM_SIZE,       12)
78        defframe(PARAM_SRC,         8)
79        defframe(PARAM_DST,         4)
80
81PARAM_MULTIPLIER becomes `FRAME+16(%esp)', and the others similarly.  The
82return address is at offset 0, but there's not normally any need to access
83that.
84
85FRAME is redefined as necessary through the code so it's the number of bytes
86pushed on the stack, and hence the offsets in the parameter macros stay
87correct.  At the start of a routine FRAME should be zero.
88
89        deflit(`FRAME',0)
90	...
91	deflit(`FRAME',4)
92	...
93	deflit(`FRAME',8)
94	...
95
96Helper macros FRAME_pushl(), FRAME_popl(), FRAME_addl_esp() and
97FRAME_subl_esp() exist to adjust FRAME for the effect of those instructions,
98and can be used instead of explicit definitions if preferred.
99defframe_pushl() is a combination FRAME_pushl() and defframe().
100
101There's generally some slackness in redefining FRAME.  If new values aren't
102going to get used then the redefinitions are omitted to keep from cluttering
103up the code.  This happens for instance at the end of a routine, where there
104might be just four pops and then a ret, so FRAME isn't getting used.
105
106Local variables and saved registers can be similarly defined, with negative
107offsets representing stack space below the initial stack pointer.  For
108example,
109
110	defframe(SAVE_ESI,   -4)
111	defframe(SAVE_EDI,   -8)
112	defframe(VAR_COUNTER,-12)
113
114	deflit(STACK_SPACE, 12)
115
116Here STACK_SPACE gets used in a "subl $STACK_SPACE, %esp" to allocate the
117space, and that instruction must be followed by a redefinition of FRAME
118(setting it equal to STACK_SPACE) to reflect the change in %esp.
119
120Definitions for pushed registers are only put in when they're going to be
121used.  If registers are just saved and restored with pushes and pops then
122definitions aren't made.
123
124
125
126ASSEMBLER EXPRESSIONS
127
128Only addition and subtraction seem to be universally available, certainly
129that's all the Solaris 8 "as" seems to accept.  If expressions are wanted
130then m4 eval() should be used.
131
132In particular note that a "/" anywhere in a line starts a comment in Solaris
133"as", and in some configurations of gas too.
134
135	addl	$32/2, %eax           <-- wrong
136
137	addl	$eval(32/2), %eax     <-- right
138
139Binutils gas/config/tc-i386.c has a choice between "/" being a comment
140anywhere in a line, or only at the start.  FreeBSD patches 2.9.1 to select
141the latter, and from 2.9.5 it's the default for GNU/Linux too.
142
143
144
145ASSEMBLER COMMENTS
146
147Solaris "as" doesn't support "#" commenting, using /* */ instead.  For that
148reason "C" commenting is used (see asm-defs.m4) and the intermediate ".s"
149files have no comments.
150
151Any comments before include(`../config.m4') must use m4 "dnl", since it's
152only after the include that "C" is available.  By convention "dnl" is also
153used for comments about m4 macros.
154
155
156
157TEMPORARY LABELS
158
159Temporary numbered labels like "1:" used as "1f" or "1b" are available in
160"gas" and Solaris "as", but not in SCO "as".  Normal L() labels should be
161used instead, possibly with a counter to make them unique, see jadcl0() in
162x86-defs.m4 for instance.  A separate counter for each macro makes it
163possible to nest them, for instance movl_text_address() can be used within
164an ASSERT().
165
166"1:" etc must be avoided in gcc __asm__ blocks too.  "%=" for generating a
167unique number looks like a good alternative, but is that actually a
168documented feature?  In any case this problem doesn't currently arise.
169
170
171
172ZERO DISPLACEMENTS
173
174In a couple of places addressing modes like 0(%ebx) with a byte-sized zero
175displacement are wanted, rather than (%ebx) with no displacement.  These are
176either for computed jumps or to get desirable code alignment.  Explicit
177.byte sequences are used to ensure the assembler doesn't turn 0(%ebx) into
178(%ebx).  The Zdisp() macro in x86-defs.m4 is used for this.
179
180Current gas 2.9.5 or recent 2.9.1 leave 0(%ebx) as written, but old gas
1811.92.3 changes it.  In general changing would be the sort of "optimization"
182an assembler might perform, hence explicit ".byte"s are used where
183necessary.
184
185
186
187SHLD/SHRD INSTRUCTIONS
188
189The %cl count forms of double shift instructions like "shldl %cl,%eax,%ebx"
190must be written "shldl %eax,%ebx" for some assemblers.  gas takes either,
191Solaris "as" doesn't allow %cl, gcc generates %cl for gas and NeXT (which is
192gas), and omits %cl elsewhere.
193
194For GMP an autoconf test GMP_ASM_X86_SHLDL_CL is used to determine whether
195%cl should be used, and the macros shldl, shrdl, shldw and shrdw in
196mpn/x86/x86-defs.m4 pass through or omit %cl as necessary.  See the comments
197with those macros for usage.
198
199
200
201IMUL INSTRUCTION
202
203GCC config/i386/i386.md (cvs rev 1.187, 21 Oct 00) under *mulsi3_1 notes
204that the following two forms produce identical object code
205
206	imul	$12, %eax
207	imul	$12, %eax, %eax
208
209but that the former isn't accepted by some assemblers, in particular the SCO
210OSR5 COFF assembler.  GMP follows GCC and uses only the latter form.
211
212(This applies only to immediate operands, the three operand form is only
213valid with an immediate.)
214
215
216
217DIRECTION FLAG
218
219The x86 calling conventions say that the direction flag should be clear at
220function entry and exit.  (See iBCS2 and SVR4 ABI books, references below.)
221Although this has been so since the year dot, it's not absolutely clear
222whether it's universally respected.  Since it's better to be safe than
223sorry, GMP follows glibc and does a "cld" if it depends on the direction
224flag being clear.  This happens only in a few places.
225
226
227
228POSITION INDEPENDENT CODE
229
230  Coding Style
231
232    Defining the symbol PIC in m4 processing selects SVR4 / ELF style
233    position independent code.  This is necessary for shared libraries
234    because they can be mapped into different processes at different virtual
235    addresses.  Actually, relocations are allowed but text pages with
236    relocations aren't shared, defeating the purpose of a shared library.
237
238    The GOT is used to access global data, and the PLT is used for
239    functions.  The use of the PLT adds a fixed cost to every function call,
240    and the GOT adds a cost to any function accessing global variables.
241    These are small but might be noticeable when working with small
242    operands.
243
244  Scope
245
246    It's intended, as a matter of policy, that references within libgmp are
247    resolved within libgmp.  Certainly there's no need for an application to
248    replace any internals, and we take the view that there's no value in an
249    application subverting anything documented either.
250
251    Resolving references within libgmp in theory means calls can be made with a
252    plain PC-relative call instruction, which is faster and smaller than going
253    through the PLT, and data references can be similarly PC-relative, saving a
254    GOT entry and fetch from there.  Unfortunately the normal linker behaviour
255    doesn't allow us to do this.
256
257    By default an R_386_PC32 PC-relative reference, either for a call or for
258    data, is left in libgmp.so by the linker so that it can be resolved at
259    runtime to a location in the application or another shared library.  This
260    means a text segment relocation which we don't want.
261
262  -Bsymbolic
263
264    Under the "-Bsymbolic" option, the linker resolves references to symbols
265    within libgmp.so.  This gives us the desired effect for R_386_PC32,
266    ie. it's resolved at link time.  It also resolves R_386_PLT32 calls
267    directly to their target without creating a PLT entry (though if this is
268    done to normal compiler-generated code it still leaves a setup of %ebx
269    to _GLOBAL_OFFSET_TABLE_ which may then be unnecessary).
270
271    Unfortunately -Bsymbolic does bad things to global variables defined in
272    a shared library but accessed by non-PIC code from the mainline (or a
273    static library).
274
275    The problem is that the mainline needs a fixed data address to avoid
276    text segment relocations, so space is allocated in its data segment and
277    the value from the variable is copied from the shared library's data
278    segment when the library is loaded.  Under -Bsymbolic, however,
279    references in the shared library are then resolved still to the shared
280    library data area.  Not surprisingly it bombs badly to have mainline
281    code and library code accessing different locations for what should be
282    one variable.
283
284    Note that this -Bsymbolic effect for the shared library is not just for
285    R_386_PC32 offsets which might have been cooked up in assembler, but is
286    done also for the contents of GOT entries.  -Bsymbolic simply applies a
287    general rule that symbols are resolved first from the local module.
288
289  Visibility Attributes
290
291    GCC __attribute__ ((visibility ("protected"))), which is available in
292    recent versions, eg. 3.3, is probably what we'd like to use.  It makes
293    gcc generate plain PC-relative calls to indicated functions, and directs
294    the linker to resolve references to the given function within the link
295    module.
296
297    Unfortunately, as of debian binutils 2.13.90.0.16 at least, the
298    resulting libgmp.so comes out with text segment relocations, references
299    are not resolved at link time.  If the gcc description is to be believed
300    this is this not how it should work.  If a symbol cannot be overridden
301    by another module then surely references within that module can be
302    resolved immediately (ie. at link time).
303
304  Present
305
306    In any case, all this means that we have no optimizations we can
307    usefully make to function or variable usages, neither for assembler nor
308    C code.  Perhaps in the future the visibility attribute will work as
309    we'd like.
310
311
312
313
314GLOBAL OFFSET TABLE
315
316The magic _GLOBAL_OFFSET_TABLE_ used by code establishing the address of the
317GOT sometimes requires an extra underscore prefix.  SVR4 systems and NetBSD
318don't need a prefix, OpenBSD does need one.  Note that NetBSD and OpenBSD
319are both a.out underscore systems, so the prefix for _GLOBAL_OFFSET_TABLE_
320is not simply the same as the prefix for ordinary globals.
321
322In any case in the asm code we write _GLOBAL_OFFSET_TABLE_ and let a macro
323in x86-defs.m4 add an extra underscore if required (according to a configure
324test).
325
326Old gas 1.92.3 which comes with FreeBSD 2.2.8 gets a segmentation fault when
327asked to assemble the following,
328
329        L1:
330            addl  $_GLOBAL_OFFSET_TABLE_+[.-L1], %ebx
331
332It seems that using the label in the same instruction it refers to is the
333problem, since a nop in between works.  But the simplest workaround is to
334follow gcc and omit the +[.-L1] since it does nothing,
335
336            addl  $_GLOBAL_OFFSET_TABLE_, %ebx
337
338Current gas 2.10 generates incorrect object code when %eax is used in such a
339construction (with or without +[.-L1]),
340
341            addl  $_GLOBAL_OFFSET_TABLE_, %eax
342
343The R_386_GOTPC gets a displacement of 2 rather than the 1 appropriate for
344the 1 byte opcode of "addl $n,%eax".  The best workaround is just to use any
345other register, since then it's a two byte opcode+mod/rm.  GCC for example
346always uses %ebx (which is needed for calls through the PLT).
347
348A similar problem occurs in an leal (again with or without a +[.-L1]),
349
350            leal  _GLOBAL_OFFSET_TABLE_(%edi), %ebx
351
352This time the R_386_GOTPC gets a displacement of 0 rather than the 2
353appropriate for the opcode and mod/rm, making this form unusable.
354
355
356
357
358SIMPLE LOOPS
359
360The overheads in setting up for an unrolled loop can mean that at small
361sizes a simple loop is faster.  Making small sizes go fast is important,
362even if it adds a cycle or two to bigger sizes.  To this end various
363routines choose between a simple loop and an unrolled loop according to
364operand size.  The path to the simple loop, or to special case code for
365small sizes, is always as fast as possible.
366
367Adding a simple loop requires a conditional jump to choose between the
368simple and unrolled code.  The size of a branch misprediction penalty
369affects whether a simple loop is worthwhile.
370
371The convention is for an m4 definition UNROLL_THRESHOLD to set the crossover
372point, with sizes < UNROLL_THRESHOLD using the simple loop, sizes >=
373UNROLL_THRESHOLD using the unrolled loop.  If position independent code adds
374a couple of cycles to an unrolled loop setup, the threshold will vary with
375PIC or non-PIC.  Something like the following is typical.
376
377	deflit(UNROLL_THRESHOLD, ifdef(`PIC',10,8))
378
379There's no automated way to determine the threshold.  Setting it to a small
380value and then to a big value makes it possible to measure the simple and
381unrolled loops each over a range of sizes, from which the crossover point
382can be determined.  Alternately, just adjust the threshold up or down until
383there's no more speedups.
384
385
386
387UNROLLED LOOP CODING
388
389The x86 addressing modes allow a byte displacement of -128 to +127, making
390it possible to access 256 bytes, which is 64 limbs, without adjusting
391pointer registers within the loop.  Dword sized displacements can be used
392too, but they increase code size, and unrolling to 64 ought to be enough.
393
394When unrolling to the full 64 limbs/loop, the limb at the top of the loop
395will have a displacement of -128, so pointers have to have a corresponding
396+128 added before entering the loop.  When unrolling to 32 limbs/loop
397displacements 0 to 127 can be used with 0 at the top of the loop and no
398adjustment needed to the pointers.
399
400Where 64 limbs/loop is supported, the +128 adjustment is done only when 64
401limbs/loop is selected.  Usually the gain in speed using 64 instead of 32 or
40216 is small, so support for 64 limbs/loop is generally only for comparison.
403
404
405
406COMPUTED JUMPS
407
408When working from least significant limb to most significant limb (most
409routines) the computed jump and pointer calculations in preparation for an
410unrolled loop are as follows.
411
412	S = operand size in limbs
413	N = number of limbs per loop (UNROLL_COUNT)
414	L = log2 of unrolling (UNROLL_LOG2)
415	M = mask for unrolling (UNROLL_MASK)
416	C = code bytes per limb in the loop
417	B = bytes per limb (4 for x86)
418
419	computed jump            (-S & M) * C + entrypoint
420	subtract from pointers   (-S & M) * B
421	initial loop counter     (S-1) >> L
422	displacements            0 to B*(N-1)
423
424The loop counter is decremented at the end of each loop, and the looping
425stops when the decrement takes the counter to -1.  The displacements are for
426the addressing accessing each limb, eg. a load with "movl disp(%ebx), %eax".
427
428Usually the multiply by "C" can be handled without an imul, using instead an
429leal, or a shift and subtract.
430
431When working from most significant to least significant limb (eg. mpn_lshift
432and mpn_copyd), the calculations change as follows.
433
434	add to pointers          (-S & M) * B
435	displacements            0 to -B*(N-1)
436
437
438
439OLD GAS 1.92.3
440
441This version comes with FreeBSD 2.2.8 and has a couple of gremlins that
442affect GMP code.
443
444Firstly, an expression involving two forward references to labels comes out
445as zero.  For example,
446
447		addl	$bar-foo, %eax
448	foo:
449		nop
450	bar:
451
452This should lead to "addl $1, %eax", but it comes out as "addl $0, %eax".
453When only one forward reference is involved, it works correctly, as for
454example,
455
456	foo:
457		addl	$bar-foo, %eax
458		nop
459	bar:
460
461Secondly, an expression involving two labels can't be used as the
462displacement for an leal.  For example,
463
464	foo:
465		nop
466	bar:
467		leal	bar-foo(%eax,%ebx,8), %ecx
468
469A slightly cryptic error is given, "Unimplemented segment type 0 in
470parse_operand".  When only one label is used it's ok, and the label can be a
471forward reference too, as for example,
472
473		leal	foo(%eax,%ebx,8), %ecx
474		nop
475	foo:
476
477These problems only affect PIC computed jump calculations.  The workarounds
478are just to do an leal without a displacement and then an addl, and to make
479sure the code is placed so that there's at most one forward reference in the
480addl.
481
482
483
484REFERENCES
485
486"Intel Architecture Software Developer's Manual", volumes 1, 2a, 2b, 3a, 3b,
4872006, order numbers 253665 through 253669.  Available on-line,
488
489	ftp://download.intel.com/design/Pentium4/manuals/25366518.pdf
490	ftp://download.intel.com/design/Pentium4/manuals/25366618.pdf
491	ftp://download.intel.com/design/Pentium4/manuals/25366718.pdf
492	ftp://download.intel.com/design/Pentium4/manuals/25366818.pdf
493	ftp://download.intel.com/design/Pentium4/manuals/25366918.pdf
494
495
496"System V Application Binary Interface", Unix System Laboratories Inc, 1992,
497published by Prentice Hall, ISBN 0-13-880410-9.  And the "Intel386 Processor
498Supplement", AT&T, 1991, ISBN 0-13-877689-X.  These have details of calling
499conventions and ELF shared library PIC coding.  Versions of both available
500on-line,
501
502	http://www.sco.com/developer/devspecs
503
504"Intel386 Family Binary Compatibility Specification 2", Intel Corporation,
505published by McGraw-Hill, 1991, ISBN 0-07-031219-2.  (Same as the above 386
506ABI supplement.)
507
508
509
510----------------
511Local variables:
512mode: text
513fill-column: 76
514End:
515