1//===- README.txt - Notes for improving PowerPC-specific code gen ---------===// 2 3TODO: 4* lmw/stmw pass a la arm load store optimizer for prolog/epilog 5 6===-------------------------------------------------------------------------=== 7 8This code: 9 10unsigned add32carry(unsigned sum, unsigned x) { 11 unsigned z = sum + x; 12 if (sum + x < x) 13 z++; 14 return z; 15} 16 17Should compile to something like: 18 19 addc r3,r3,r4 20 addze r3,r3 21 22instead we get: 23 24 add r3, r4, r3 25 cmplw cr7, r3, r4 26 mfcr r4 ; 1 27 rlwinm r4, r4, 29, 31, 31 28 add r3, r3, r4 29 30Ick. 31 32===-------------------------------------------------------------------------=== 33 34We compile the hottest inner loop of viterbi to: 35 36 li r6, 0 37 b LBB1_84 ;bb432.i 38LBB1_83: ;bb420.i 39 lbzx r8, r5, r7 40 addi r6, r7, 1 41 stbx r8, r4, r7 42LBB1_84: ;bb432.i 43 mr r7, r6 44 cmplwi cr0, r7, 143 45 bne cr0, LBB1_83 ;bb420.i 46 47The CBE manages to produce: 48 49 li r0, 143 50 mtctr r0 51loop: 52 lbzx r2, r2, r11 53 stbx r0, r2, r9 54 addi r2, r2, 1 55 bdz later 56 b loop 57 58This could be much better (bdnz instead of bdz) but it still beats us. If we 59produced this with bdnz, the loop would be a single dispatch group. 60 61===-------------------------------------------------------------------------=== 62 63Lump the constant pool for each function into ONE pic object, and reference 64pieces of it as offsets from the start. For functions like this (contrived 65to have lots of constants obviously): 66 67double X(double Y) { return (Y*1.23 + 4.512)*2.34 + 14.38; } 68 69We generate: 70 71_X: 72 lis r2, ha16(.CPI_X_0) 73 lfd f0, lo16(.CPI_X_0)(r2) 74 lis r2, ha16(.CPI_X_1) 75 lfd f2, lo16(.CPI_X_1)(r2) 76 fmadd f0, f1, f0, f2 77 lis r2, ha16(.CPI_X_2) 78 lfd f1, lo16(.CPI_X_2)(r2) 79 lis r2, ha16(.CPI_X_3) 80 lfd f2, lo16(.CPI_X_3)(r2) 81 fmadd f1, f0, f1, f2 82 blr 83 84It would be better to materialize .CPI_X into a register, then use immediates 85off of the register to avoid the lis's. This is even more important in PIC 86mode. 87 88Note that this (and the static variable version) is discussed here for GCC: 89http://gcc.gnu.org/ml/gcc-patches/2006-02/msg00133.html 90 91Here's another example (the sgn function): 92double testf(double a) { 93 return a == 0.0 ? 0.0 : (a > 0.0 ? 1.0 : -1.0); 94} 95 96it produces a BB like this: 97LBB1_1: ; cond_true 98 lis r2, ha16(LCPI1_0) 99 lfs f0, lo16(LCPI1_0)(r2) 100 lis r2, ha16(LCPI1_1) 101 lis r3, ha16(LCPI1_2) 102 lfs f2, lo16(LCPI1_2)(r3) 103 lfs f3, lo16(LCPI1_1)(r2) 104 fsub f0, f0, f1 105 fsel f1, f0, f2, f3 106 blr 107 108===-------------------------------------------------------------------------=== 109 110PIC Code Gen IPO optimization: 111 112Squish small scalar globals together into a single global struct, allowing the 113address of the struct to be CSE'd, avoiding PIC accesses (also reduces the size 114of the GOT on targets with one). 115 116Note that this is discussed here for GCC: 117http://gcc.gnu.org/ml/gcc-patches/2006-02/msg00133.html 118 119===-------------------------------------------------------------------------=== 120 121Fold add and sub with constant into non-extern, non-weak addresses so this: 122 123static int a; 124void bar(int b) { a = b; } 125void foo(unsigned char *c) { 126 *c = a; 127} 128 129So that 130 131_foo: 132 lis r2, ha16(_a) 133 la r2, lo16(_a)(r2) 134 lbz r2, 3(r2) 135 stb r2, 0(r3) 136 blr 137 138Becomes 139 140_foo: 141 lis r2, ha16(_a+3) 142 lbz r2, lo16(_a+3)(r2) 143 stb r2, 0(r3) 144 blr 145 146===-------------------------------------------------------------------------=== 147 148We should compile these two functions to the same thing: 149 150#include <stdlib.h> 151void f(int a, int b, int *P) { 152 *P = (a-b)>=0?(a-b):(b-a); 153} 154void g(int a, int b, int *P) { 155 *P = abs(a-b); 156} 157 158Further, they should compile to something better than: 159 160_g: 161 subf r2, r4, r3 162 subfic r3, r2, 0 163 cmpwi cr0, r2, -1 164 bgt cr0, LBB2_2 ; entry 165LBB2_1: ; entry 166 mr r2, r3 167LBB2_2: ; entry 168 stw r2, 0(r5) 169 blr 170 171GCC produces: 172 173_g: 174 subf r4,r4,r3 175 srawi r2,r4,31 176 xor r0,r2,r4 177 subf r0,r2,r0 178 stw r0,0(r5) 179 blr 180 181... which is much nicer. 182 183This theoretically may help improve twolf slightly (used in dimbox.c:142?). 184 185===-------------------------------------------------------------------------=== 186 187PR5945: This: 188define i32 @clamp0g(i32 %a) { 189entry: 190 %cmp = icmp slt i32 %a, 0 191 %sel = select i1 %cmp, i32 0, i32 %a 192 ret i32 %sel 193} 194 195Is compile to this with the PowerPC (32-bit) backend: 196 197_clamp0g: 198 cmpwi cr0, r3, 0 199 li r2, 0 200 blt cr0, LBB1_2 201; %bb.1: ; %entry 202 mr r2, r3 203LBB1_2: ; %entry 204 mr r3, r2 205 blr 206 207This could be reduced to the much simpler: 208 209_clamp0g: 210 srawi r2, r3, 31 211 andc r3, r3, r2 212 blr 213 214===-------------------------------------------------------------------------=== 215 216int foo(int N, int ***W, int **TK, int X) { 217 int t, i; 218 219 for (t = 0; t < N; ++t) 220 for (i = 0; i < 4; ++i) 221 W[t / X][i][t % X] = TK[i][t]; 222 223 return 5; 224} 225 226We generate relatively atrocious code for this loop compared to gcc. 227 228We could also strength reduce the rem and the div: 229http://www.lcs.mit.edu/pubs/pdf/MIT-LCS-TM-600.pdf 230 231===-------------------------------------------------------------------------=== 232 233We generate ugly code for this: 234 235void func(unsigned int *ret, float dx, float dy, float dz, float dw) { 236 unsigned code = 0; 237 if(dx < -dw) code |= 1; 238 if(dx > dw) code |= 2; 239 if(dy < -dw) code |= 4; 240 if(dy > dw) code |= 8; 241 if(dz < -dw) code |= 16; 242 if(dz > dw) code |= 32; 243 *ret = code; 244} 245 246===-------------------------------------------------------------------------=== 247 248%struct.B = type { i8, [3 x i8] } 249 250define void @bar(%struct.B* %b) { 251entry: 252 %tmp = bitcast %struct.B* %b to i32* ; <uint*> [#uses=1] 253 %tmp = load i32* %tmp ; <uint> [#uses=1] 254 %tmp3 = bitcast %struct.B* %b to i32* ; <uint*> [#uses=1] 255 %tmp4 = load i32* %tmp3 ; <uint> [#uses=1] 256 %tmp8 = bitcast %struct.B* %b to i32* ; <uint*> [#uses=2] 257 %tmp9 = load i32* %tmp8 ; <uint> [#uses=1] 258 %tmp4.mask17 = shl i32 %tmp4, i8 1 ; <uint> [#uses=1] 259 %tmp1415 = and i32 %tmp4.mask17, 2147483648 ; <uint> [#uses=1] 260 %tmp.masked = and i32 %tmp, 2147483648 ; <uint> [#uses=1] 261 %tmp11 = or i32 %tmp1415, %tmp.masked ; <uint> [#uses=1] 262 %tmp12 = and i32 %tmp9, 2147483647 ; <uint> [#uses=1] 263 %tmp13 = or i32 %tmp12, %tmp11 ; <uint> [#uses=1] 264 store i32 %tmp13, i32* %tmp8 265 ret void 266} 267 268We emit: 269 270_foo: 271 lwz r2, 0(r3) 272 slwi r4, r2, 1 273 or r4, r4, r2 274 rlwimi r2, r4, 0, 0, 0 275 stw r2, 0(r3) 276 blr 277 278We could collapse a bunch of those ORs and ANDs and generate the following 279equivalent code: 280 281_foo: 282 lwz r2, 0(r3) 283 rlwinm r4, r2, 1, 0, 0 284 or r2, r2, r4 285 stw r2, 0(r3) 286 blr 287 288===-------------------------------------------------------------------------=== 289 290Consider a function like this: 291 292float foo(float X) { return X + 1234.4123f; } 293 294The FP constant ends up in the constant pool, so we need to get the LR register. 295 This ends up producing code like this: 296 297_foo: 298.LBB_foo_0: ; entry 299 mflr r11 300*** stw r11, 8(r1) 301 bl "L00000$pb" 302"L00000$pb": 303 mflr r2 304 addis r2, r2, ha16(.CPI_foo_0-"L00000$pb") 305 lfs f0, lo16(.CPI_foo_0-"L00000$pb")(r2) 306 fadds f1, f1, f0 307*** lwz r11, 8(r1) 308 mtlr r11 309 blr 310 311This is functional, but there is no reason to spill the LR register all the way 312to the stack (the two marked instrs): spilling it to a GPR is quite enough. 313 314Implementing this will require some codegen improvements. Nate writes: 315 316"So basically what we need to support the "no stack frame save and restore" is a 317generalization of the LR optimization to "callee-save regs". 318 319Currently, we have LR marked as a callee-save reg. The register allocator sees 320that it's callee save, and spills it directly to the stack. 321 322Ideally, something like this would happen: 323 324LR would be in a separate register class from the GPRs. The class of LR would be 325marked "unspillable". When the register allocator came across an unspillable 326reg, it would ask "what is the best class to copy this into that I *can* spill" 327If it gets a class back, which it will in this case (the gprs), it grabs a free 328register of that class. If it is then later necessary to spill that reg, so be 329it. 330 331===-------------------------------------------------------------------------=== 332 333We compile this: 334int test(_Bool X) { 335 return X ? 524288 : 0; 336} 337 338to: 339_test: 340 cmplwi cr0, r3, 0 341 lis r2, 8 342 li r3, 0 343 beq cr0, LBB1_2 ;entry 344LBB1_1: ;entry 345 mr r3, r2 346LBB1_2: ;entry 347 blr 348 349instead of: 350_test: 351 addic r2,r3,-1 352 subfe r0,r2,r3 353 slwi r3,r0,19 354 blr 355 356This sort of thing occurs a lot due to globalopt. 357 358===-------------------------------------------------------------------------=== 359 360We compile: 361 362define i32 @bar(i32 %x) nounwind readnone ssp { 363entry: 364 %0 = icmp eq i32 %x, 0 ; <i1> [#uses=1] 365 %neg = sext i1 %0 to i32 ; <i32> [#uses=1] 366 ret i32 %neg 367} 368 369to: 370 371_bar: 372 cntlzw r2, r3 373 slwi r2, r2, 26 374 srawi r3, r2, 31 375 blr 376 377it would be better to produce: 378 379_bar: 380 addic r3,r3,-1 381 subfe r3,r3,r3 382 blr 383 384===-------------------------------------------------------------------------=== 385 386We generate horrible ppc code for this: 387 388#define N 2000000 389double a[N],c[N]; 390void simpleloop() { 391 int j; 392 for (j=0; j<N; j++) 393 c[j] = a[j]; 394} 395 396LBB1_1: ;bb 397 lfdx f0, r3, r4 398 addi r5, r5, 1 ;; Extra IV for the exit value compare. 399 stfdx f0, r2, r4 400 addi r4, r4, 8 401 402 xoris r6, r5, 30 ;; This is due to a large immediate. 403 cmplwi cr0, r6, 33920 404 bne cr0, LBB1_1 405 406//===---------------------------------------------------------------------===// 407 408This: 409 #include <algorithm> 410 inline std::pair<unsigned, bool> full_add(unsigned a, unsigned b) 411 { return std::make_pair(a + b, a + b < a); } 412 bool no_overflow(unsigned a, unsigned b) 413 { return !full_add(a, b).second; } 414 415Should compile to: 416 417__Z11no_overflowjj: 418 add r4,r3,r4 419 subfc r3,r3,r4 420 li r3,0 421 adde r3,r3,r3 422 blr 423 424(or better) not: 425 426__Z11no_overflowjj: 427 add r2, r4, r3 428 cmplw cr7, r2, r3 429 mfcr r2 430 rlwinm r2, r2, 29, 31, 31 431 xori r3, r2, 1 432 blr 433 434//===---------------------------------------------------------------------===// 435 436We compile some FP comparisons into an mfcr with two rlwinms and an or. For 437example: 438#include <math.h> 439int test(double x, double y) { return islessequal(x, y);} 440int test2(double x, double y) { return islessgreater(x, y);} 441int test3(double x, double y) { return !islessequal(x, y);} 442 443Compiles into (all three are similar, but the bits differ): 444 445_test: 446 fcmpu cr7, f1, f2 447 mfcr r2 448 rlwinm r3, r2, 29, 31, 31 449 rlwinm r2, r2, 31, 31, 31 450 or r3, r2, r3 451 blr 452 453GCC compiles this into: 454 455 _test: 456 fcmpu cr7,f1,f2 457 cror 30,28,30 458 mfcr r3 459 rlwinm r3,r3,31,1 460 blr 461 462which is more efficient and can use mfocr. See PR642 for some more context. 463 464//===---------------------------------------------------------------------===// 465 466void foo(float *data, float d) { 467 long i; 468 for (i = 0; i < 8000; i++) 469 data[i] = d; 470} 471void foo2(float *data, float d) { 472 long i; 473 data--; 474 for (i = 0; i < 8000; i++) { 475 data[1] = d; 476 data++; 477 } 478} 479 480These compile to: 481 482_foo: 483 li r2, 0 484LBB1_1: ; bb 485 addi r4, r2, 4 486 stfsx f1, r3, r2 487 cmplwi cr0, r4, 32000 488 mr r2, r4 489 bne cr0, LBB1_1 ; bb 490 blr 491_foo2: 492 li r2, 0 493LBB2_1: ; bb 494 addi r4, r2, 4 495 stfsx f1, r3, r2 496 cmplwi cr0, r4, 32000 497 mr r2, r4 498 bne cr0, LBB2_1 ; bb 499 blr 500 501The 'mr' could be eliminated to folding the add into the cmp better. 502 503//===---------------------------------------------------------------------===// 504Codegen for the following (low-probability) case deteriorated considerably 505when the correctness fixes for unordered comparisons went in (PR 642, 58871). 506It should be possible to recover the code quality described in the comments. 507 508; RUN: llvm-as < %s | llc -march=ppc32 | grep or | count 3 509; This should produce one 'or' or 'cror' instruction per function. 510 511; RUN: llvm-as < %s | llc -march=ppc32 | grep mfcr | count 3 512; PR2964 513 514define i32 @test(double %x, double %y) nounwind { 515entry: 516 %tmp3 = fcmp ole double %x, %y ; <i1> [#uses=1] 517 %tmp345 = zext i1 %tmp3 to i32 ; <i32> [#uses=1] 518 ret i32 %tmp345 519} 520 521define i32 @test2(double %x, double %y) nounwind { 522entry: 523 %tmp3 = fcmp one double %x, %y ; <i1> [#uses=1] 524 %tmp345 = zext i1 %tmp3 to i32 ; <i32> [#uses=1] 525 ret i32 %tmp345 526} 527 528define i32 @test3(double %x, double %y) nounwind { 529entry: 530 %tmp3 = fcmp ugt double %x, %y ; <i1> [#uses=1] 531 %tmp34 = zext i1 %tmp3 to i32 ; <i32> [#uses=1] 532 ret i32 %tmp34 533} 534 535//===---------------------------------------------------------------------===// 536for the following code: 537 538void foo (float *__restrict__ a, int *__restrict__ b, int n) { 539 a[n] = b[n] * 2.321; 540} 541 542we load b[n] to GPR, then move it VSX register and convert it float. We should 543use vsx scalar integer load instructions to avoid direct moves 544 545//===----------------------------------------------------------------------===// 546; RUN: llvm-as < %s | llc -march=ppc32 | not grep fneg 547 548; This could generate FSEL with appropriate flags (FSEL is not IEEE-safe, and 549; should not be generated except with -enable-finite-only-fp-math or the like). 550; With the correctness fixes for PR642 (58871) LowerSELECT_CC would need to 551; recognize a more elaborate tree than a simple SETxx. 552 553define double @test_FNEG_sel(double %A, double %B, double %C) { 554 %D = fsub double -0.000000e+00, %A ; <double> [#uses=1] 555 %Cond = fcmp ugt double %D, -0.000000e+00 ; <i1> [#uses=1] 556 %E = select i1 %Cond, double %B, double %C ; <double> [#uses=1] 557 ret double %E 558} 559 560//===----------------------------------------------------------------------===// 561The save/restore sequence for CR in prolog/epilog is terrible: 562- Each CR subreg is saved individually, rather than doing one save as a unit. 563- On Darwin, the save is done after the decrement of SP, which means the offset 564from SP of the save slot can be too big for a store instruction, which means we 565need an additional register (currently hacked in 96015+96020; the solution there 566is correct, but poor). 567- On SVR4 the same thing can happen, and I don't think saving before the SP 568decrement is safe on that target, as there is no red zone. This is currently 569broken AFAIK, although it's not a target I can exercise. 570The following demonstrates the problem: 571extern void bar(char *p); 572void foo() { 573 char x[100000]; 574 bar(x); 575 __asm__("" ::: "cr2"); 576} 577 578//===-------------------------------------------------------------------------=== 579Naming convention for instruction formats is very haphazard. 580We have agreed on a naming scheme as follows: 581 582<INST_form>{_<OP_type><OP_len>}+ 583 584Where: 585INST_form is the instruction format (X-form, etc.) 586OP_type is the operand type - one of OPC (opcode), RD (register destination), 587 RS (register source), 588 RDp (destination register pair), 589 RSp (source register pair), IM (immediate), 590 XO (extended opcode) 591OP_len is the length of the operand in bits 592 593VSX register operands would be of length 6 (split across two fields), 594condition register fields of length 3. 595We would not need denote reserved fields in names of instruction formats. 596 597//===----------------------------------------------------------------------===// 598 599Instruction fusion was introduced in ISA 2.06 and more opportunities added in 600ISA 2.07. LLVM needs to add infrastructure to recognize fusion opportunities 601and force instruction pairs to be scheduled together. 602 603----------------------------------------------------------------------------- 604 605More general handling of any_extend and zero_extend: 606 607See https://reviews.llvm.org/D24924#555306 608