1Copyright 2000-2005 Free Software Foundation, Inc. 2 3This file is part of the GNU MP Library. 4 5The GNU MP Library is free software; you can redistribute it and/or modify 6it under the terms of either: 7 8 * the GNU Lesser General Public License as published by the Free 9 Software Foundation; either version 3 of the License, or (at your 10 option) any later version. 11 12or 13 14 * the GNU General Public License as published by the Free Software 15 Foundation; either version 2 of the License, or (at your option) any 16 later version. 17 18or both in parallel, as here. 19 20The GNU MP Library is distributed in the hope that it will be useful, but 21WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY 22or FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License 23for more details. 24 25You should have received copies of the GNU General Public License and the 26GNU Lesser General Public License along with the GNU MP Library. If not, 27see https://www.gnu.org/licenses/. 28 29 30 31 IA-64 MPN SUBROUTINES 32 33 34This directory contains mpn functions for the IA-64 architecture. 35 36 37CODE ORGANIZATION 38 39 mpn/ia64 itanium-2, and generic ia64 40 41The code here has been optimized primarily for Itanium 2. Very few Itanium 1 42chips were ever sold, and Itanium 2 is more powerful, so the latter is what 43we concentrate on. 44 45 46 47CHIP NOTES 48 49The IA-64 ISA keeps instructions three and three in 128 bit bundles. 50Programmers/compilers need to put explicit breaks `;;' when there are WAW or 51RAW dependencies, with some notable exceptions. Such "breaks" are typically 52at the end of a bundle, but can be put between operations within some bundle 53types too. 54 55The Itanium 1 and Itanium 2 implementations can under ideal conditions 56execute two bundles per cycle. The Itanium 1 allows 4 of these instructions 57to do integer operations, while the Itanium 2 allows all 6 to be integer 58operations. 59 60Taken cloop branches seem to insert a bubble into the pipeline most of the 61time on Itanium 1. 62 63Loads to the fp registers bypass the L1 cache and thus get extremely long 64latencies, 9 cycles on the Itanium 1 and 6 cycles on the Itanium 2. 65 66The software pipeline stuff using br.ctop instruction causes delays, since 67many issue slots are taken up by instructions with zero predicates, and 68since many extra instructions are needed to set things up. These features 69are clearly designed for code density, not speed. 70 71Misc pipeline limitations (Itanium 1): 72* The getf.sig instruction can only execute in M0. 73* At most four integer instructions/cycle. 74* Nops take up resources like any plain instructions. 75 76Misc pipeline limitations (Itanium 2): 77* The getf.sig instruction can only execute in M0. 78* Nops take up resources like any plain instructions. 79 80 81ASSEMBLY SYNTAX 82 83.align pads with nops in a text segment, but gas 2.14 and earlier 84incorrectly byte-swaps its nop bundle in big endian mode (eg. hpux), making 85it come out as break instructions. We use the ALIGN() macro in 86mpn/ia64/ia64-defs.m4 when it might be executed across. That macro 87suppresses any .align if the problem is detected by configure. Lack of 88alignment might hurt performance but will at least be correct. 89 90foo:: to create a global symbol is not accepted by gas. Use separate 91".global foo" and "foo:" instead. 92 93.global is the standard global directive. gas accepts .globl, but hpux "as" 94doesn't. 95 96.proc / .endp generates the appropriate .type and .size information for ELF, 97so the latter directives don't need to be given explicitly. 98 99.pred.rel "mutex"... is standard for annotating predicate register 100relationships. gas also accepts .pred.rel.mutex, but hpux "as" doesn't. 101 102.pred directives can't be put on a line with a label, like 103".Lfoo: .pred ...", the HP assembler on HP-UX 11.23 rejects that. 104gas is happy with it, and past versions of HP had seemed ok. 105 106// is the standard comment sequence, but we prefer "C" since it inhibits m4 107macro expansion. See comments in ia64-defs.m4. 108 109 110REGISTER USAGE 111 112Special: 113 r0: constant 0 114 r1: global pointer (gp) 115 r8: return value 116 r12: stack pointer (sp) 117 r13: thread pointer (tp) 118Caller-saves: r8-r11 r14-r31 f6-f15 f32-f127 119Caller-saves but rotating: r32- 120 121 122================================================================ 123mpn_add_n, mpn_sub_n: 124 125The current code runs at 1.25 c/l on Itanium 2. 126 127================================================================ 128mpn_mul_1: 129 130The current code runs at 2 c/l on Itanium 2. 131 132Using a blocked approach, working off of 4 separate places in the operands, 133one could make use of the xma accumulation, and approach 1 c/l. 134 135 ldf8 [up] 136 xma.l 137 xma.hu 138 stf8 [wrp] 139 140================================================================ 141mpn_addmul_1: 142 143The current code runs at 2 c/l on Itanium 2. 144 145It seems possible to use a blocked approach, as with mpn_mul_1. We should 146read rp[] to integer registers, allowing for just one getf.sig per cycle. 147 148 ld8 [rp] 149 ldf8 [up] 150 xma.l 151 xma.hu 152 getf.sig 153 add+add+cmp+cmp 154 st8 [wrp] 155 156These 10 instructions can be scheduled to approach 1.667 cycles, and with 157the 4 cycle latency of xma, this means we need at least 3 blocks. Using 158ldfp8 we could approach 1.583 c/l. 159 160================================================================ 161mpn_submul_1: 162 163The current code runs at 2.25 c/l on Itanium 2. Getting to 2 c/l requires 164ldfp8 with all alignment headache that implies. 165 166================================================================ 167mpn_addmul_N 168 169For best speed, we need to give up using mpn_addmul_2 as the main multiply 170building block, and instead take multiple v limbs per loop. For the Itanium 1711, we need to take about 8 limbs at a time for full speed. For the Itanium 1722, something like mpn_addmul_4 should be enough. 173 174The add+cmp+cmp+add we use on the other codes is optimal for shortening 175recurrencies (1 cycle) but the sequence takes up 4 execution slots. When 176recurrency depth is not critical, a more standard 3-cycle add+cmp+add is 177better. 178 179/* First load the 8 values from v */ 180 ldfp8 v0, v1 = [r35], 16;; 181 ldfp8 v2, v3 = [r35], 16;; 182 ldfp8 v4, v5 = [r35], 16;; 183 ldfp8 v6, v7 = [r35], 16;; 184 185/* In the inner loop, get a new U limb and store a result limb. */ 186 mov lc = un 187Loop: ldf8 u0 = [r33], 8 188 ld8 r0 = [r32] 189 xma.l lp0 = v0, u0, hp0 190 xma.hu hp0 = v0, u0, hp0 191 xma.l lp1 = v1, u0, hp1 192 xma.hu hp1 = v1, u0, hp1 193 xma.l lp2 = v2, u0, hp2 194 xma.hu hp2 = v2, u0, hp2 195 xma.l lp3 = v3, u0, hp3 196 xma.hu hp3 = v3, u0, hp3 197 xma.l lp4 = v4, u0, hp4 198 xma.hu hp4 = v4, u0, hp4 199 xma.l lp5 = v5, u0, hp5 200 xma.hu hp5 = v5, u0, hp5 201 xma.l lp6 = v6, u0, hp6 202 xma.hu hp6 = v6, u0, hp6 203 xma.l lp7 = v7, u0, hp7 204 xma.hu hp7 = v7, u0, hp7 205 getf.sig l0 = lp0 206 getf.sig l1 = lp1 207 getf.sig l2 = lp2 208 getf.sig l3 = lp3 209 getf.sig l4 = lp4 210 getf.sig l5 = lp5 211 getf.sig l6 = lp6 212 add+cmp+add xx, l0, r0 213 add+cmp+add acc0, acc1, l1 214 add+cmp+add acc1, acc2, l2 215 add+cmp+add acc2, acc3, l3 216 add+cmp+add acc3, acc4, l4 217 add+cmp+add acc4, acc5, l5 218 add+cmp+add acc5, acc6, l6 219 getf.sig acc6 = lp7 220 st8 [r32] = xx, 8 221 br.cloop Loop 222 223 49 insn at max 6 insn/cycle: 8.167 cycles/limb8 224 11 memops at max 2 memops/cycle: 5.5 cycles/limb8 225 16 fpops at max 2 fpops/cycle: 8 cycles/limb8 226 21 intops at max 4 intops/cycle: 5.25 cycles/limb8 227 11+21 memops+intops at max 4/cycle 8 cycles/limb8 228 229================================================================ 230mpn_lshift, mpn_rshift 231 232The current code runs at 1 cycle/limb on Itanium 2. 233 234Using 63 separate loops, we could use the double-word shrp instruction. 235That instruction has a plain single-cycle latency. We need 63 loops since 236this instruction only accept immediate count. That would lead to a somewhat 237silly code size, but the speed would be 0.75 c/l on Itanium 2 (by using shrp 238each cycle plus shl/shr going down I1 for a further limb every second 239cycle). 240 241================================================================ 242mpn_copyi, mpn_copyd 243 244The current code runs at 0.5 c/l on Itanium 2. But that is just for L1 245cache hit. The 4-way unrolled loop takes just 2 cycles, and thus load-use 246scheduling isn't great. It might be best to actually use modulo scheduled 247loops, since that will allow us to do better load-use scheduling without too 248much unrolling. 249 250Depending on size or operand alignment, we get 1 c/l or 0.5 c/l on Itanium 2512, according to tune/speed. Cache bank conflicts? 252 253 254 255REFERENCES 256 257Intel Itanium Architecture Software Developer's Manual, volumes 1 to 3, 258Intel document 245317-004, 245318-004, 245319-004 October 2002. Volume 1 259includes an Itanium optimization guide. 260 261Intel Itanium Processor-specific Application Binary Interface (ABI), Intel 262document 245370-003, May 2001. Describes C type sizes, dynamic linking, 263etc. 264 265Intel Itanium Architecture Assembly Language Reference Guide, Intel document 266248801-004, 2000-2002. Describes assembly instruction syntax and other 267directives. 268 269Itanium Software Conventions and Runtime Architecture Guide, Intel document 270245358-003, May 2001. Describes calling conventions, including stack 271unwinding requirements. 272 273Intel Itanium Processor Reference Manual for Software Optimization, Intel 274document 245473-003, November 2001. 275 276Intel Itanium-2 Processor Reference Manual for Software Development and 277Optimization, Intel document 251110-003, May 2004. 278 279All the above documents can be found online at 280 281 http://developer.intel.com/design/itanium/manuals.htm 282