1Copyright 1997, 1999, 2000, 2001, 2002 Free Software Foundation, Inc. 2 3This file is part of the GNU MP Library. 4 5The GNU MP Library is free software; you can redistribute it and/or modify 6it under the terms of the GNU Lesser General Public License as published by 7the Free Software Foundation; either version 3 of the License, or (at your 8option) any later version. 9 10The GNU MP Library is distributed in the hope that it will be useful, but 11WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY 12or FITNESS FOR A PARTICULAR PURPOSE. See the GNU Lesser General Public 13License for more details. 14 15You should have received a copy of the GNU Lesser General Public License 16along with the GNU MP Library. If not, see http://www.gnu.org/licenses/. 17 18 19 20 21 22This directory contains mpn functions for 64-bit V9 SPARC 23 24RELEVANT OPTIMIZATION ISSUES 25 26Notation: 27 IANY = shift/add/sub/logical/sethi 28 IADDLOG = add/sub/logical/sethi 29 MEM = ld*/st* 30 FA = fadd*/fsub*/f*to*/fmov* 31 FM = fmul* 32 33UltraSPARC can issue four instructions per cycle, with these restrictions: 34* Two IANY instructions, but only one of these may be a shift. If there is a 35 shift and an IANY instruction, the shift must precede the IANY instruction. 36* One FA. 37* One FM. 38* One branch. 39* One MEM. 40* IANY/IADDLOG/MEM must be insn 1, 2, or 3 in an issue bundle. Taken branches 41 should not be in slot 4, since that makes the delay insn come from separate 42 bundle. 43* If two IANY/IADDLOG instructions are to be executed in the same cycle and one 44 of these is setting the condition codes, that instruction must be the second 45 one. 46 47To summarize, ignoring branches, these are the bundles that can reach the peak 48execution speed: 49 50insn1 iany iany mem iany iany mem iany iany mem 51insn2 iaddlog mem iany mem iaddlog iany mem iaddlog iany 52insn3 mem iaddlog iaddlog fa fa fa fm fm fm 53insn4 fa/fm fa/fm fa/fm fm fm fm fa fa fa 54 55The 64-bit integer multiply instruction mulx takes from 5 cycles to 35 cycles, 56depending on the position of the most significant bit of the first source 57operand. When used for 32x32->64 multiplication, it needs 20 cycles. 58Furthermore, it stalls the processor while executing. We stay away from that 59instruction, and instead use floating-point operations. 60 61Floating-point add and multiply units are fully pipelined. The latency for 62UltraSPARC-1/2 is 3 cycles and for UltraSPARC-3 it is 4 cycles. 63 64Integer conditional move instructions cannot dual-issue with other integer 65instructions. No conditional move can issue 1-5 cycles after a load. (This 66might have been fixed for UltraSPARC-3.) 67 68The UltraSPARC-3 pipeline is very simular to the one of UltraSPARC-1/2 , but is 69somewhat slower. Branches execute slower, and there may be other new stalls. 70But integer multiply doesn't stall the entire CPU and also has a much lower 71latency. But it's still not pipelined, and thus useless for our needs. 72 73STATUS 74 75* mpn_lshift, mpn_rshift: The current code runs at 2.0 cycles/limb on 76 UltraSPARC-1/2 and 2.65 on UltraSPARC-3. For UltraSPARC-1/2, the IEU0 77 functional unit is saturated with shifts. 78 79* mpn_add_n, mpn_sub_n: The current code runs at 4 cycles/limb on 80 UltraSPARC-1/2 and 4.5 cycles/limb on UltraSPARC-3. The 4 instruction 81 recurrency is the speed limiter. 82 83* mpn_addmul_1: The current code runs at 14 cycles/limb asymptotically on 84 UltraSPARC-1/2 and 17.5 cycles/limb on UltraSPARC-3. On UltraSPARC-1/2, the 85 code sustains 4 instructions/cycle. It might be possible to invent a better 86 way of summing the intermediate 49-bit operands, but it is unlikely that it 87 will save enough instructions to save an entire cycle. 88 89 The load-use of the u operand is not enough scheduled for good L2 cache 90 performance. The UltraSPARC-1/2 L1 cache is direct mapped, and since we use 91 temporary stack slots that will conflict with the u and r operands, we miss 92 to L2 very often. The load-use of the std/ldx pairs via the stack are 93 perhaps over-scheduled. 94 95 It would be possible to save two instructions: (1) The mov could be avoided 96 if the std/ldx were less scheduled. (2) The ldx of the r operand could be 97 split into two ld instructions, saving the shifts/masks. 98 99 It should be possible to reach 14 cycles/limb for UltraSPARC-3 if the fp 100 operations where rescheduled for this processor's 4-cycle latency. 101 102* mpn_mul_1: The current code is a straightforward edit of the mpn_addmul_1 103 code. It would be possible to shave one or two cycles from it, with some 104 labour. 105 106* mpn_submul_1: Simpleminded code just calling mpn_mul_1 + mpn_sub_n. This 107 means that it runs at 18 cycles/limb on UltraSPARC-1/2 and 23 cycles/limb on 108 UltraSPARC-3. It would be possible to either match the mpn_addmul_1 109 performance, or in the worst case use one more instruction group. 110 111* US1/US2 cache conflict resolving. The direct mapped L1 date cache of US1/US2 112 is a problem for mul_1, addmul_1 (and a prospective submul_1). We should 113 allocate a larger cache area, and put the stack temp area in a place that 114 doesn't cause cache conflicts. 115