1Copyright 1996, 1997, 1999-2005 Free Software Foundation, Inc. 2 3This file is part of the GNU MP Library. 4 5The GNU MP Library is free software; you can redistribute it and/or modify 6it under the terms of either: 7 8 * the GNU Lesser General Public License as published by the Free 9 Software Foundation; either version 3 of the License, or (at your 10 option) any later version. 11 12or 13 14 * the GNU General Public License as published by the Free Software 15 Foundation; either version 2 of the License, or (at your option) any 16 later version. 17 18or both in parallel, as here. 19 20The GNU MP Library is distributed in the hope that it will be useful, but 21WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY 22or FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License 23for more details. 24 25You should have received copies of the GNU General Public License and the 26GNU Lesser General Public License along with the GNU MP Library. If not, 27see https://www.gnu.org/licenses/. 28 29 30 31 32 33This directory contains mpn functions optimized for DEC Alpha processors. 34 35ALPHA ASSEMBLY RULES AND REGULATIONS 36 37The `.prologue N' pseudo op marks the end of instruction that needs special 38handling by unwinding. It also says whether $27 is really needed for computing 39the gp. The `.mask M' pseudo op says which registers are saved on the stack, 40and at what offset in the frame. 41 42Cray T3 code is very very different... 43 44"$6" / "$f6" etc is the usual syntax for registers, but on Unicos instead "r6" 45/ "f6" is required. We use the "r6" / "f6" forms, and have m4 defines expand 46them to "$6" or "$f6" where necessary. 47 48"0x" introduces a hex constant in gas and DEC as, but on Unicos "^X" is 49required. The X() macro accommodates this difference. 50 51"cvttqc" is required by DEC as, "cvttq/c" is required by Unicos, and gas will 52accept either. We use cvttqc and have an m4 define expand to cvttq/c where 53necessary. 54 55"not" as an alias for "ornot r31, ..." is available in gas and DEC as, but not 56the Unicos assembler. The full "ornot" must be used. 57 58"unop" is not available in Unicos. We make an m4 define to the usual "ldq_u 59r31,0(r30)", and in fact use that define on all systems since it comes out the 60same. 61 62"!literal!123" etc explicit relocations as per Tru64 4.0 are apparently not 63available in older alpha assemblers (including gas prior to 2.12), according to 64the GCC manual, so the assembler macro forms must be used (eg. ldgp). 65 66 67 68RELEVANT OPTIMIZATION ISSUES 69 70EV4 71 721. This chip has very limited store bandwidth. The on-chip L1 cache is write- 73 through, and a cache line is transferred from the store buffer to the off- 74 chip L2 in as much 15 cycles on most systems. This delay hurts mpn_add_n, 75 mpn_sub_n, mpn_lshift, and mpn_rshift. 76 772. Pairing is possible between memory instructions and integer arithmetic 78 instructions. 79 803. mulq and umulh are documented to have a latency of 23 cycles, but 2 of these 81 cycles are pipelined. Thus, multiply instructions can be issued at a rate 82 of one each 21st cycle. 83 84EV5 85 861. The memory bandwidth of this chip is good, both for loads and stores. The 87 L1 cache can handle two loads or one store per cycle, but two cycles after a 88 store, no ld can issue. 89 902. mulq has a latency of 12 cycles and an issue rate of 1 each 8th cycle. 91 umulh has a latency of 14 cycles and an issue rate of 1 each 10th cycle. 92 (Note that published documentation gets these numbers slightly wrong.) 93 943. mpn_add_n. With 4-fold unrolling, we need 37 instructions, whereof 12 95 are memory operations. This will take at least 96 ceil(37/2) [dual issue] + 1 [taken branch] = 19 cycles 97 We have 12 memory cycles, plus 4 after-store conflict cycles, or 16 data 98 cache cycles, which should be completely hidden in the 19 issue cycles. 99 The computation is inherently serial, with these dependencies: 100 101 ldq ldq 102 \ /\ 103 (or) addq | 104 |\ / \ | 105 | addq cmpult 106 \ | | 107 cmpult | 108 \ / 109 or 110 111 I.e., 3 operations are needed between carry-in and carry-out, making 12 112 cycles the absolute minimum for the 4 limbs. We could replace the `or' with 113 a cmoveq/cmovne, which could issue one cycle earlier that the `or', but that 114 might waste a cycle on EV4. The total depth remain unaffected, since cmov 115 has a latency of 2 cycles. 116 117 addq 118 / \ 119 addq cmpult 120 | \ 121 cmpult -> cmovne 122 123 Montgomery has a slightly different way of computing carry that requires one 124 less instruction, but has depth 4 (instead of the current 3). Since the code 125 is currently instruction issue bound, Montgomery's idea should save us 1/2 126 cycle per limb, or bring us down to a total of 17 cycles or 4.25 cycles/limb. 127 Unfortunately, this method will not be good for the EV6. 128 1294. addmul_1 and friends: We previously had a scheme for splitting the single- 130 limb operand in 21-bits chunks and the multi-limb operand in 32-bit chunks, 131 and then use FP operations for every 2nd multiply, and integer operations 132 for every 2nd multiply. 133 134 But it seems much better to split the single-limb operand in 16-bit chunks, 135 since we save many integer shifts and adds that way. See powerpc64/README 136 for some more details. 137 138EV6 139 140Here we have a really parallel pipeline, capable of issuing up to 4 integer 141instructions per cycle. In actual practice, it is never possible to sustain 142more than 3.5 integer insns/cycle due to rename register shortage. One integer 143multiply instruction can issue each cycle. To get optimal speed, we need to 144pretend we are vectorizing the code, i.e., minimize the depth of recurrences. 145 146There are two dependencies to watch out for. 1) Address arithmetic 147dependencies, and 2) carry propagation dependencies. 148 149We can avoid serializing due to address arithmetic by unrolling loops, so that 150addresses don't depend heavily on an index variable. Avoiding serializing 151because of carry propagation is trickier; the ultimate performance of the code 152will be determined of the number of latency cycles it takes from accepting 153carry-in to a vector point until we can generate carry-out. 154 155Most integer instructions can execute in either the L0, U0, L1, or U1 156pipelines. Shifts only execute in U0 and U1, and multiply only in U1. 157 158CMOV instructions split into two internal instructions, CMOV1 and CMOV2. CMOV 159split the mapping process (see pg 2-26 in cmpwrgd.pdf), suggesting the CMOV 160should always be placed as the last instruction of an aligned 4 instruction 161block, or perhaps simply avoided. 162 163Perhaps the most important issue is the latency between the L0/U0 and L1/U1 164clusters; a result obtained on either cluster has an extra cycle of latency for 165consumers in the opposite cluster. Because of the dynamic nature of the 166implementation, it is hard to predict where an instruction will execute. 167 168 169 170REFERENCES 171 172"Alpha Architecture Handbook", version 4, Compaq, October 1998, order number 173EC-QD2KC-TE. 174 175"Alpha 21164 Microprocessor Hardware Reference Manual", Compaq, December 1998, 176order number EC-QP99C-TE. 177 178"Alpha 21264/EV67 Microprocessor Hardware Reference Manual", revision 1.4, 179Compaq, September 2000, order number DS-0028B-TE. 180 181"Compiler Writer's Guide for the Alpha 21264", Compaq, June 1999, order number 182EC-RJ66A-TE. 183 184All of the above are available online from 185 186 http://ftp.digital.com/pub/Digital/info/semiconductor/literature/dsc-library.html 187 ftp://ftp.compaq.com/pub/products/alphaCPUdocs 188 189"Tru64 Unix Assembly Language Programmer's Guide", Compaq, March 1996, part 190number AA-PS31D-TE. 191 192"Digital UNIX Calling Standard for Alpha Systems", Digital Equipment Corp, 193March 1996, part number AA-PY8AC-TE. 194 195The above are available online, 196 197 http://h30097.www3.hp.com/docs/pub_page/V40F_DOCS.HTM 198 199(Dunno what h30097 means in this URL, but if it moves try searching for "tru64 200online documentation" from the main www.hp.com page.) 201 202 203 204---------------- 205Local variables: 206mode: text 207fill-column: 79 208End: 209