xref: /netbsd-src/external/lgpl3/gmp/dist/mpn/sparc64/README (revision 924795e69c8bb3f17afd8fcbb799710cc1719dc4)
1Copyright 1997, 1999-2002 Free Software Foundation, Inc.
2
3This file is part of the GNU MP Library.
4
5The GNU MP Library is free software; you can redistribute it and/or modify
6it under the terms of either:
7
8  * the GNU Lesser General Public License as published by the Free
9    Software Foundation; either version 3 of the License, or (at your
10    option) any later version.
11
12or
13
14  * the GNU General Public License as published by the Free Software
15    Foundation; either version 2 of the License, or (at your option) any
16    later version.
17
18or both in parallel, as here.
19
20The GNU MP Library is distributed in the hope that it will be useful, but
21WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY
22or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU General Public License
23for more details.
24
25You should have received copies of the GNU General Public License and the
26GNU Lesser General Public License along with the GNU MP Library.  If not,
27see https://www.gnu.org/licenses/.
28
29
30
31
32
33This directory contains mpn functions for 64-bit V9 SPARC
34
35RELEVANT OPTIMIZATION ISSUES
36
37Notation:
38  IANY = shift/add/sub/logical/sethi
39  IADDLOG = add/sub/logical/sethi
40  MEM = ld*/st*
41  FA = fadd*/fsub*/f*to*/fmov*
42  FM = fmul*
43
44UltraSPARC can issue four instructions per cycle, with these restrictions:
45* Two IANY instructions, but only one of these may be a shift.  If there is a
46  shift and an IANY instruction, the shift must precede the IANY instruction.
47* One FA.
48* One FM.
49* One branch.
50* One MEM.
51* IANY/IADDLOG/MEM must be insn 1, 2, or 3 in an issue bundle.  Taken branches
52  should not be in slot 4, since that makes the delay insn come from separate
53  bundle.
54* If two IANY/IADDLOG instructions are to be executed in the same cycle and one
55  of these is setting the condition codes, that instruction must be the second
56  one.
57
58To summarize, ignoring branches, these are the bundles that can reach the peak
59execution speed:
60
61insn1	iany	iany	mem	iany	iany	mem	iany	iany	mem
62insn2	iaddlog	mem	iany	mem	iaddlog	iany	mem	iaddlog	iany
63insn3	mem	iaddlog	iaddlog	fa	fa	fa	fm	fm	fm
64insn4	fa/fm	fa/fm	fa/fm	fm	fm	fm	fa	fa	fa
65
66The 64-bit integer multiply instruction mulx takes from 5 cycles to 35 cycles,
67depending on the position of the most significant bit of the first source
68operand.  When used for 32x32->64 multiplication, it needs 20 cycles.
69Furthermore, it stalls the processor while executing.  We stay away from that
70instruction, and instead use floating-point operations.
71
72Floating-point add and multiply units are fully pipelined.  The latency for
73UltraSPARC-1/2 is 3 cycles and for UltraSPARC-3 it is 4 cycles.
74
75Integer conditional move instructions cannot dual-issue with other integer
76instructions.  No conditional move can issue 1-5 cycles after a load.  (This
77might have been fixed for UltraSPARC-3.)
78
79The UltraSPARC-3 pipeline is very simular to the one of UltraSPARC-1/2 , but is
80somewhat slower.  Branches execute slower, and there may be other new stalls.
81But integer multiply doesn't stall the entire CPU and also has a much lower
82latency.  But it's still not pipelined, and thus useless for our needs.
83
84STATUS
85
86* mpn_lshift, mpn_rshift: The current code runs at 2.0 cycles/limb on
87  UltraSPARC-1/2 and 2.65 on UltraSPARC-3.  For UltraSPARC-1/2, the IEU0
88  functional unit is saturated with shifts.
89
90* mpn_add_n, mpn_sub_n: The current code runs at 4 cycles/limb on
91  UltraSPARC-1/2 and 4.5 cycles/limb on UltraSPARC-3.  The 4 instruction
92  recurrency is the speed limiter.
93
94* mpn_addmul_1: The current code runs at 14 cycles/limb asymptotically on
95  UltraSPARC-1/2 and 17.5 cycles/limb on UltraSPARC-3.  On UltraSPARC-1/2, the
96  code sustains 4 instructions/cycle.  It might be possible to invent a better
97  way of summing the intermediate 49-bit operands, but it is unlikely that it
98  will save enough instructions to save an entire cycle.
99
100  The load-use of the u operand is not enough scheduled for good L2 cache
101  performance.  The UltraSPARC-1/2 L1 cache is direct mapped, and since we use
102  temporary stack slots that will conflict with the u and r operands, we miss
103  to L2 very often.  The load-use of the std/ldx pairs via the stack are
104  perhaps over-scheduled.
105
106  It would be possible to save two instructions: (1) The mov could be avoided
107  if the std/ldx were less scheduled.  (2) The ldx of the r operand could be
108  split into two ld instructions, saving the shifts/masks.
109
110  It should be possible to reach 14 cycles/limb for UltraSPARC-3 if the fp
111  operations where rescheduled for this processor's 4-cycle latency.
112
113* mpn_mul_1: The current code is a straightforward edit of the mpn_addmul_1
114  code.  It would be possible to shave one or two cycles from it, with some
115  labour.
116
117* mpn_submul_1: Simpleminded code just calling mpn_mul_1 + mpn_sub_n.  This
118  means that it runs at 18 cycles/limb on UltraSPARC-1/2 and 23 cycles/limb on
119  UltraSPARC-3.  It would be possible to either match the mpn_addmul_1
120  performance, or in the worst case use one more instruction group.
121
122* US1/US2 cache conflict resolving.  The direct mapped L1 date cache of US1/US2
123  is a problem for mul_1, addmul_1 (and a prospective submul_1).  We should
124  allocate a larger cache area, and put the stack temp area in a place that
125  doesn't cause cache conflicts.
126