1================================================== 2How to manually use the Individual pieces of Polly 3================================================== 4 5Execute the individual Polly passes manually 6============================================ 7 8.. sectionauthor:: Singapuram Sanjay Srivallabh 9 10This example presents the individual passes that are involved when optimizing 11code with Polly. We show how to execute them individually and explain for 12each which analysis is performed or what transformation is applied. In this 13example the polyhedral transformation is user-provided to show how much 14performance improvement can be expected by an optimal automatic optimizer. 15 161. **Create LLVM-IR from the C code** 17------------------------------------- 18 Polly works on LLVM-IR. Hence it is necessary to translate the source 19 files into LLVM-IR. If more than one file should be optimized the 20 files can be combined into a single file with llvm-link. 21 22 .. code-block:: console 23 24 clang -S -emit-llvm matmul.c -Xclang -disable-O0-optnone -o matmul.ll 25 26 272. **Prepare the LLVM-IR for Polly** 28------------------------------------ 29 30 Polly is only able to work with code that matches a canonical form. 31 To translate the LLVM-IR into this form we use a set of 32 canonicalization passes. They are scheduled by using 33 '-polly-canonicalize'. 34 35 .. code-block:: console 36 37 opt -S -polly-canonicalize matmul.ll -o matmul.preopt.ll 38 393. **Show the SCoPs detected by Polly (optional)** 40-------------------------------------------------- 41 42 To understand if Polly was able to detect SCoPs, we print the structure 43 of the detected SCoPs. In our example two SCoPs are detected. One in 44 'init_array' the other in 'main'. 45 46 .. code-block:: console 47 48 $ opt -basic-aa -polly-ast -analyze matmul.preopt.ll -polly-process-unprofitable -polly-use-llvm-names 49 50 .. code-block:: text 51 52 :: isl ast :: init_array :: %for.cond1.preheader---%for.end19 53 54 if (1) 55 56 for (int c0 = 0; c0 <= 1535; c0 += 1) 57 for (int c1 = 0; c1 <= 1535; c1 += 1) 58 Stmt_for_body3(c0, c1); 59 60 else 61 { /* original code */ } 62 63 :: isl ast :: main :: %for.cond1.preheader---%for.end30 64 65 if (1) 66 67 for (int c0 = 0; c0 <= 1535; c0 += 1) 68 for (int c1 = 0; c1 <= 1535; c1 += 1) { 69 Stmt_for_body3(c0, c1); 70 for (int c2 = 0; c2 <= 1535; c2 += 1) 71 Stmt_for_body8(c0, c1, c2); 72 } 73 74 else 75 { /* original code */ } 76 774. **Highlight the detected SCoPs in the CFGs of the program (requires graphviz/dotty)** 78---------------------------------------------------------------------------------------- 79 80 Polly can use graphviz to graphically show a CFG in which the detected 81 SCoPs are highlighted. It can also create '.dot' files that can be 82 translated by the 'dot' utility into various graphic formats. 83 84 85 .. code-block:: console 86 87 $ opt -polly-use-llvm-names -basic-aa -view-scops -disable-output matmul.preopt.ll 88 $ opt -polly-use-llvm-names -basic-aa -view-scops-only -disable-output matmul.preopt.ll 89 90 The output for the different functions: 91 92 - view-scops : main_, init_array_, print_array_ 93 - view-scops-only : main-scopsonly_, init_array-scopsonly_, print_array-scopsonly_ 94 95.. _main: http://polly.llvm.org/experiments/matmul/scops.main.dot.png 96.. _init_array: http://polly.llvm.org/experiments/matmul/scops.init_array.dot.png 97.. _print_array: http://polly.llvm.org/experiments/matmul/scops.print_array.dot.png 98.. _main-scopsonly: http://polly.llvm.org/experiments/matmul/scopsonly.main.dot.png 99.. _init_array-scopsonly: http://polly.llvm.org/experiments/matmul/scopsonly.init_array.dot.png 100.. _print_array-scopsonly: http://polly.llvm.org/experiments/matmul/scopsonly.print_array.dot.png 101 1025. **View the polyhedral representation of the SCoPs** 103------------------------------------------------------ 104 105 .. code-block:: console 106 107 $ opt -polly-use-llvm-names -basic-aa -polly-scops -analyze matmul.preopt.ll -polly-process-unprofitable 108 109 .. code-block:: text 110 111 [...]Printing analysis 'Polly - Create polyhedral description of Scops' for region: 'for.cond1.preheader => for.end19' in function 'init_array': 112 Function: init_array 113 Region: %for.cond1.preheader---%for.end19 114 Max Loop Depth: 2 115 Invariant Accesses: { 116 } 117 Context: 118 { : } 119 Assumed Context: 120 { : } 121 Invalid Context: 122 { : 1 = 0 } 123 Arrays { 124 float MemRef_A[*][1536]; // Element size 4 125 float MemRef_B[*][1536]; // Element size 4 126 } 127 Arrays (Bounds as pw_affs) { 128 float MemRef_A[*][ { [] -> [(1536)] } ]; // Element size 4 129 float MemRef_B[*][ { [] -> [(1536)] } ]; // Element size 4 130 } 131 Alias Groups (0): 132 n/a 133 Statements { 134 Stmt_for_body3 135 Domain := 136 { Stmt_for_body3[i0, i1] : 0 <= i0 <= 1535 and 0 <= i1 <= 1535 }; 137 Schedule := 138 { Stmt_for_body3[i0, i1] -> [i0, i1] }; 139 MustWriteAccess := [Reduction Type: NONE] [Scalar: 0] 140 { Stmt_for_body3[i0, i1] -> MemRef_A[i0, i1] }; 141 MustWriteAccess := [Reduction Type: NONE] [Scalar: 0] 142 { Stmt_for_body3[i0, i1] -> MemRef_B[i0, i1] }; 143 } 144 [...]Printing analysis 'Polly - Create polyhedral description of Scops' for region: 'for.cond1.preheader => for.end30' in function 'main': 145 Function: main 146 Region: %for.cond1.preheader---%for.end30 147 Max Loop Depth: 3 148 Invariant Accesses: { 149 } 150 Context: 151 { : } 152 Assumed Context: 153 { : } 154 Invalid Context: 155 { : 1 = 0 } 156 Arrays { 157 float MemRef_C[*][1536]; // Element size 4 158 float MemRef_A[*][1536]; // Element size 4 159 float MemRef_B[*][1536]; // Element size 4 160 } 161 Arrays (Bounds as pw_affs) { 162 float MemRef_C[*][ { [] -> [(1536)] } ]; // Element size 4 163 float MemRef_A[*][ { [] -> [(1536)] } ]; // Element size 4 164 float MemRef_B[*][ { [] -> [(1536)] } ]; // Element size 4 165 } 166 Alias Groups (0): 167 n/a 168 Statements { 169 Stmt_for_body3 170 Domain := 171 { Stmt_for_body3[i0, i1] : 0 <= i0 <= 1535 and 0 <= i1 <= 1535 }; 172 Schedule := 173 { Stmt_for_body3[i0, i1] -> [i0, i1, 0, 0] }; 174 MustWriteAccess := [Reduction Type: NONE] [Scalar: 0] 175 { Stmt_for_body3[i0, i1] -> MemRef_C[i0, i1] }; 176 Stmt_for_body8 177 Domain := 178 { Stmt_for_body8[i0, i1, i2] : 0 <= i0 <= 1535 and 0 <= i1 <= 1535 and 0 <= i2 <= 1535 }; 179 Schedule := 180 { Stmt_for_body8[i0, i1, i2] -> [i0, i1, 1, i2] }; 181 ReadAccess := [Reduction Type: NONE] [Scalar: 0] 182 { Stmt_for_body8[i0, i1, i2] -> MemRef_C[i0, i1] }; 183 ReadAccess := [Reduction Type: NONE] [Scalar: 0] 184 { Stmt_for_body8[i0, i1, i2] -> MemRef_A[i0, i2] }; 185 ReadAccess := [Reduction Type: NONE] [Scalar: 0] 186 { Stmt_for_body8[i0, i1, i2] -> MemRef_B[i2, i1] }; 187 MustWriteAccess := [Reduction Type: NONE] [Scalar: 0] 188 { Stmt_for_body8[i0, i1, i2] -> MemRef_C[i0, i1] }; 189 } 190 191 1926. **Show the dependences for the SCoPs** 193----------------------------------------- 194 195 .. code-block:: console 196 197 $ opt -basic-aa -polly-use-llvm-names -polly-dependences -analyze matmul.preopt.ll -polly-process-unprofitable 198 199 .. code-block:: text 200 201 [...]Printing analysis 'Polly - Calculate dependences' for region: 'for.cond1.preheader => for.end19' in function 'init_array': 202 RAW dependences: 203 { } 204 WAR dependences: 205 { } 206 WAW dependences: 207 { } 208 Reduction dependences: 209 n/a 210 Transitive closure of reduction dependences: 211 { } 212 [...]Printing analysis 'Polly - Calculate dependences' for region: 'for.cond1.preheader => for.end30' in function 'main': 213 RAW dependences: 214 { Stmt_for_body3[i0, i1] -> Stmt_for_body8[i0, i1, 0] : 0 <= i0 <= 1535 and 0 <= i1 <= 1535; Stmt_for_body8[i0, i1, i2] -> Stmt_for_body8[i0, i1, 1 + i2] : 0 <= i0 <= 1535 and 0 <= i1 <= 1535 and 0 <= i2 <= 1534 } 215 WAR dependences: 216 { } 217 WAW dependences: 218 { Stmt_for_body3[i0, i1] -> Stmt_for_body8[i0, i1, 0] : 0 <= i0 <= 1535 and 0 <= i1 <= 1535; Stmt_for_body8[i0, i1, i2] -> Stmt_for_body8[i0, i1, 1 + i2] : 0 <= i0 <= 1535 and 0 <= i1 <= 1535 and 0 <= i2 <= 1534 } 219 Reduction dependences: 220 n/a 221 Transitive closure of reduction dependences: 222 { } 223 2247. **Export jscop files** 225------------------------- 226 227 .. code-block:: console 228 229 $ opt -basic-aa -polly-use-llvm-names -polly-export-jscop matmul.preopt.ll -polly-process-unprofitable 230 231 .. code-block:: text 232 233 [...]Writing JScop '%for.cond1.preheader---%for.end19' in function 'init_array' to './init_array___%for.cond1.preheader---%for.end19.jscop'. 234 235 Writing JScop '%for.cond1.preheader---%for.end30' in function 'main' to './main___%for.cond1.preheader---%for.end30.jscop'. 236 237 238 2398. **Import the changed jscop files and print the updated SCoP structure (optional)** 240------------------------------------------------------------------------------------- 241 242 Polly can reimport jscop files, in which the schedules of the statements 243 are changed. These changed schedules are used to describe 244 transformations. It is possible to import different jscop files by 245 providing the postfix of the jscop file that is imported. 246 247 We apply three different transformations on the SCoP in the main 248 function. The jscop files describing these transformations are 249 hand written (and available in docs/experiments/matmul). 250 251 **No Polly** 252 253 As a baseline we do not call any Polly code generation, but only apply the normal -O3 optimizations. 254 255 .. code-block:: console 256 257 $ opt -basic-aa -polly-use-llvm-names matmul.preopt.ll -polly-import-jscop -polly-ast -analyze -polly-process-unprofitable 258 259 .. code-block:: c 260 261 [...] 262 :: isl ast :: main :: %for.cond1.preheader---%for.end30 263 264 if (1) 265 266 for (int c0 = 0; c0 <= 1535; c0 += 1) 267 for (int c1 = 0; c1 <= 1535; c1 += 1) { 268 Stmt_for_body3(c0, c1); 269 for (int c3 = 0; c3 <= 1535; c3 += 1) 270 Stmt_for_body8(c0, c1, c3); 271 } 272 273 else 274 { /* original code */ } 275 [...] 276 277 **Loop Interchange (and Fission to allow the interchange)** 278 279 We split the loops and can now apply an interchange of the loop dimensions that enumerate Stmt_for_body8. 280 281 .. Although I feel (and have created a .jscop) we can avoid splitting the loops. 282 283 .. code-block:: console 284 285 $ opt -basic-aa -polly-use-llvm-names matmul.preopt.ll -polly-import-jscop -polly-import-jscop-postfix=interchanged -polly-ast -analyze -polly-process-unprofitable 286 287 .. code-block:: c 288 289 [...] 290 :: isl ast :: main :: %for.cond1.preheader---%for.end30 291 292 if (1) 293 294 { 295 for (int c1 = 0; c1 <= 1535; c1 += 1) 296 for (int c2 = 0; c2 <= 1535; c2 += 1) 297 Stmt_for_body3(c1, c2); 298 for (int c1 = 0; c1 <= 1535; c1 += 1) 299 for (int c2 = 0; c2 <= 1535; c2 += 1) 300 for (int c3 = 0; c3 <= 1535; c3 += 1) 301 Stmt_for_body8(c1, c3, c2); 302 } 303 304 else 305 { /* original code */ } 306 [...] 307 308 **Interchange + Tiling** 309 310 In addition to the interchange we now tile the second loop nest. 311 312 .. code-block:: console 313 314 $ opt -basic-aa -polly-use-llvm-names matmul.preopt.ll -polly-import-jscop -polly-import-jscop-postfix=interchanged+tiled -polly-ast -analyze -polly-process-unprofitable 315 316 .. code-block:: c 317 318 [...] 319 :: isl ast :: main :: %for.cond1.preheader---%for.end30 320 321 if (1) 322 323 { 324 for (int c1 = 0; c1 <= 1535; c1 += 1) 325 for (int c2 = 0; c2 <= 1535; c2 += 1) 326 Stmt_for_body3(c1, c2); 327 for (int c1 = 0; c1 <= 1535; c1 += 64) 328 for (int c2 = 0; c2 <= 1535; c2 += 64) 329 for (int c3 = 0; c3 <= 1535; c3 += 64) 330 for (int c4 = c1; c4 <= c1 + 63; c4 += 1) 331 for (int c5 = c3; c5 <= c3 + 63; c5 += 1) 332 for (int c6 = c2; c6 <= c2 + 63; c6 += 1) 333 Stmt_for_body8(c4, c6, c5); 334 } 335 336 else 337 { /* original code */ } 338 [...] 339 340 341 **Interchange + Tiling + Strip-mining to prepare vectorization** 342 343 To later allow vectorization we create a so called trivially 344 parallelizable loop. It is innermost, parallel and has only four 345 iterations. It can be replaced by 4-element SIMD instructions. 346 347 .. code-block:: console 348 349 $ opt -basic-aa -polly-use-llvm-names matmul.preopt.ll -polly-import-jscop -polly-import-jscop-postfix=interchanged+tiled -polly-ast -analyze -polly-process-unprofitable 350 351 .. code-block:: c 352 353 [...] 354 :: isl ast :: main :: %for.cond1.preheader---%for.end30 355 356 if (1) 357 358 { 359 for (int c1 = 0; c1 <= 1535; c1 += 1) 360 for (int c2 = 0; c2 <= 1535; c2 += 1) 361 Stmt_for_body3(c1, c2); 362 for (int c1 = 0; c1 <= 1535; c1 += 64) 363 for (int c2 = 0; c2 <= 1535; c2 += 64) 364 for (int c3 = 0; c3 <= 1535; c3 += 64) 365 for (int c4 = c1; c4 <= c1 + 63; c4 += 1) 366 for (int c5 = c3; c5 <= c3 + 63; c5 += 1) 367 for (int c6 = c2; c6 <= c2 + 63; c6 += 4) 368 for (int c7 = c6; c7 <= c6 + 3; c7 += 1) 369 Stmt_for_body8(c4, c7, c5); 370 } 371 372 else 373 { /* original code */ } 374 [...] 375 3769. **Codegenerate the SCoPs** 377----------------------------- 378 379 This generates new code for the SCoPs detected by polly. If 380 -polly-import-jscop is present, transformations specified in the 381 imported jscop files will be applied. 382 383 384 .. code-block:: console 385 386 $ opt -S matmul.preopt.ll | opt -S -O3 -o matmul.normalopt.ll 387 388 .. code-block:: console 389 390 $ opt -S matmul.preopt.ll -basic-aa -polly-use-llvm-names -polly-import-jscop -polly-import-jscop-postfix=interchanged -polly-codegen -polly-process-unprofitable | opt -S -O3 -o matmul.polly.interchanged.ll 391 392 .. code-block:: text 393 394 Reading JScop '%for.cond1.preheader---%for.end19' in function 'init_array' from './init_array___%for.cond1.preheader---%for.end19.jscop.interchanged'. 395 File could not be read: No such file or directory 396 Reading JScop '%for.cond1.preheader---%for.end30' in function 'main' from './main___%for.cond1.preheader---%for.end30.jscop.interchanged'. 397 398 .. code-block:: console 399 400 $ opt -S matmul.preopt.ll -basic-aa -polly-use-llvm-names -polly-import-jscop -polly-import-jscop-postfix=interchanged+tiled -polly-codegen -polly-process-unprofitable | opt -S -O3 -o matmul.polly.interchanged+tiled.ll 401 402 .. code-block:: text 403 404 Reading JScop '%for.cond1.preheader---%for.end19' in function 'init_array' from './init_array___%for.cond1.preheader---%for.end19.jscop.interchanged+tiled'. 405 File could not be read: No such file or directory 406 Reading JScop '%for.cond1.preheader---%for.end30' in function 'main' from './main___%for.cond1.preheader---%for.end30.jscop.interchanged+tiled'. 407 408 .. code-block:: console 409 410 $ opt -S matmul.preopt.ll -basic-aa -polly-use-llvm-names -polly-import-jscop -polly-import-jscop-postfix=interchanged+tiled+vector -polly-codegen -polly-vectorizer=polly -polly-process-unprofitable | opt -S -O3 -o matmul.polly.interchanged+tiled+vector.ll 411 412 .. code-block:: text 413 414 Reading JScop '%for.cond1.preheader---%for.end19' in function 'init_array' from './init_array___%for.cond1.preheader---%for.end19.jscop.interchanged+tiled+vector'. 415 File could not be read: No such file or directory 416 Reading JScop '%for.cond1.preheader---%for.end30' in function 'main' from './main___%for.cond1.preheader---%for.end30.jscop.interchanged+tiled+vector'. 417 418 .. code-block:: console 419 420 $ opt -S matmul.preopt.ll -basic-aa -polly-use-llvm-names -polly-import-jscop -polly-import-jscop-postfix=interchanged+tiled+vector -polly-codegen -polly-vectorizer=polly -polly-parallel -polly-process-unprofitable | opt -S -O3 -o matmul.polly.interchanged+tiled+openmp.ll 421 422 .. code-block:: text 423 424 Reading JScop '%for.cond1.preheader---%for.end19' in function 'init_array' from './init_array___%for.cond1.preheader---%for.end19.jscop.interchanged+tiled+vector'. 425 File could not be read: No such file or directory 426 Reading JScop '%for.cond1.preheader---%for.end30' in function 'main' from './main___%for.cond1.preheader---%for.end30.jscop.interchanged+tiled+vector'. 427 428 42910. **Create the executables** 430------------------------------ 431 432 .. code-block:: console 433 434 $ llc matmul.normalopt.ll -o matmul.normalopt.s -relocation-model=pic 435 $ gcc matmul.normalopt.s -o matmul.normalopt.exe 436 $ llc matmul.polly.interchanged.ll -o matmul.polly.interchanged.s -relocation-model=pic 437 $ gcc matmul.polly.interchanged.s -o matmul.polly.interchanged.exe 438 $ llc matmul.polly.interchanged+tiled.ll -o matmul.polly.interchanged+tiled.s -relocation-model=pic 439 $ gcc matmul.polly.interchanged+tiled.s -o matmul.polly.interchanged+tiled.exe 440 $ llc matmul.polly.interchanged+tiled+vector.ll -o matmul.polly.interchanged+tiled+vector.s -relocation-model=pic 441 $ gcc matmul.polly.interchanged+tiled+vector.s -o matmul.polly.interchanged+tiled+vector.exe 442 $ llc matmul.polly.interchanged+tiled+vector+openmp.ll -o matmul.polly.interchanged+tiled+vector+openmp.s -relocation-model=pic 443 $ gcc matmul.polly.interchanged+tiled+vector+openmp.s -lgomp -o matmul.polly.interchanged+tiled+vector+openmp.exe 444 44511. **Compare the runtime of the executables** 446---------------------------------------------- 447 448 By comparing the runtimes of the different code snippets we see that a 449 simple loop interchange gives here the largest performance boost. 450 However in this case, adding vectorization and using OpenMP degrades 451 the performance. 452 453 .. code-block:: console 454 455 $ time ./matmul.normalopt.exe 456 457 real 0m11.295s 458 user 0m11.288s 459 sys 0m0.004s 460 $ time ./matmul.polly.interchanged.exe 461 462 real 0m0.988s 463 user 0m0.980s 464 sys 0m0.008s 465 $ time ./matmul.polly.interchanged+tiled.exe 466 467 real 0m0.830s 468 user 0m0.816s 469 sys 0m0.012s 470 $ time ./matmul.polly.interchanged+tiled+vector.exe 471 472 real 0m5.430s 473 user 0m5.424s 474 sys 0m0.004s 475 $ time ./matmul.polly.interchanged+tiled+vector+openmp.exe 476 477 real 0m3.184s 478 user 0m11.972s 479 sys 0m0.036s 480 481