1063ca0fcSTobias Grosser================ 2063ca0fcSTobias GrosserThe Architecture 3063ca0fcSTobias Grosser================ 4063ca0fcSTobias Grosser 5063ca0fcSTobias GrosserPolly is a loop optimizer for LLVM. Starting from LLVM-IR it detects and 6063ca0fcSTobias Grosserextracts interesting loop kernels. For each kernel a mathematical model is 7063ca0fcSTobias Grosserderived which precisely describes the individual computations and memory 8063ca0fcSTobias Grosseraccesses in the kernels. Within Polly a variety of analysis and code 9063ca0fcSTobias Grossertransformations are performed on this mathematical model. After all 10063ca0fcSTobias Grosseroptimizations have been derived and applied, optimized LLVM-IR is regenerated 11063ca0fcSTobias Grosserand inserted into the LLVM-IR module. 12063ca0fcSTobias Grosser 13063ca0fcSTobias Grosser.. image:: images/architecture.png 14e16b8df6STobias Grosser :align: center 15054ca24bSTobias Grosser 16054ca24bSTobias GrosserPolly in the LLVM pass pipeline 17054ca24bSTobias Grosser------------------------------- 18054ca24bSTobias Grosser 19054ca24bSTobias GrosserThe standard LLVM pass pipeline as it is used in -O1/-O2/-O3 mode of clang/opt 20054ca24bSTobias Grosserconsists of a sequence of passes that can be grouped in different conceptual 21054ca24bSTobias Grosserphases. The first phase, we call it here **Canonicalization**, is a scalar 22054ca24bSTobias Grossercanonicalization phase that contains passes like -mem2reg, -instcombine, 23054ca24bSTobias Grosser-cfgsimplify, or early loop unrolling. It has the goal of removing and 24054ca24bSTobias Grossersimplifying the given IR as much as possible focusing mostly on scalar 25054ca24bSTobias Grosseroptimizations. The second phase consists of three conceptual groups that are 26054ca24bSTobias Grosserexecuted in the so-called **Inliner cycle**, This is again a set of **Scalar 27054ca24bSTobias GrosserSimplification** passes, a set of **Simple Loop Optimizations**, and the 28054ca24bSTobias Grosser**Inliner** itself. Even though these passes make up the majority of the LLVM 29054ca24bSTobias Grosserpass pipeline, the primary goal of these passes is still canonicalization 30*5aafc6d5SChristian Clausswithout losing semantic information that complicates later analysis. As part of 31054ca24bSTobias Grosserthe inliner cycle, the LLVM inliner step-by-step tries to inline functions, runs 32054ca24bSTobias Grossercanonicalization passes to exploit newly exposed simplification opportunities, 33054ca24bSTobias Grosserand then tries to inline the further simplified functions. Some simple loop 34054ca24bSTobias Grosseroptimizations are executed as part of the inliner cycle. Even though they 35054ca24bSTobias Grosserperform some optimizations, their primary goal is still the simplification of 36054ca24bSTobias Grosserthe program code. Loop invariant code motion is one such optimization that 37054ca24bSTobias Grosserbesides being beneficial for program performance also allows us to move 38054ca24bSTobias Grossercomputation out of loops and in the best case enables us to eliminate certain 39054ca24bSTobias Grosserloops completely. Only after the inliner cycle has been finished, a last 40054ca24bSTobias Grosser**Target Specialization** phase is run, where IR complexity is deliberately 41054ca24bSTobias Grosserincreased to take advantage of target specific features that maximize the 42054ca24bSTobias Grosserexecution performance on the device we target. One of the principal 43054ca24bSTobias Grosseroptimizations in this phase is vectorization, but also target specific loop 44054ca24bSTobias Grosserunrolling, or some loop transformations (e.g., distribution) that expose more 45054ca24bSTobias Grosservectorization opportunities. 46054ca24bSTobias Grosser 47054ca24bSTobias Grosser.. image:: images/LLVM-Passes-only.png 48054ca24bSTobias Grosser :align: center 49054ca24bSTobias Grosser 50054ca24bSTobias GrosserPolly can conceptually be run at three different positions in the pass pipeline. 51054ca24bSTobias GrosserAs an early optimizer before the standard LLVM pass pipeline, as a later 52054ca24bSTobias Grosseroptimizer as part of the target specialization sequence, and theoretically also 53054ca24bSTobias Grosserwith the loop optimizations in the inliner cycle. We only discuss the first two 54054ca24bSTobias Grosseroptions, as running Polly in the inline loop, is likely to disturb the inliner 55054ca24bSTobias Grosserand is consequently not a good idea. 56054ca24bSTobias Grosser 57054ca24bSTobias Grosser.. image:: images/LLVM-Passes-all.png 58054ca24bSTobias Grosser :align: center 59054ca24bSTobias Grosser 60054ca24bSTobias GrosserRunning Polly early before the standard pass pipeline has the benefit that the 61054ca24bSTobias GrosserLLVM-IR processed by Polly is still very close to the original input code. 62054ca24bSTobias GrosserHence, it is less likely that transformations applied by LLVM change the IR in 63054ca24bSTobias Grosserways not easily understandable for the programmer. As a result, user feedback is 64054ca24bSTobias Grosserlikely better and it is less likely that kernels that in C seem a perfect fit 65054ca24bSTobias Grosserfor Polly have been transformed such that Polly can not handle them any 66054ca24bSTobias Grossermore. On the other hand, codes that require inlining to be optimized won't 67054ca24bSTobias Grosserbenefit if Polly is scheduled at this position. The additional set of 68054ca24bSTobias Grossercanonicalization passes required will result in a small, but general compile 69054ca24bSTobias Grossertime increase and some random run-time performance changes due to slightly 70054ca24bSTobias Grosserdifferent IR being passed through the optimizers. To force Polly to run early in 71d5c0b010SMichael Krusethe pass pipeline use the option *-polly-position=early* (default today). 72054ca24bSTobias Grosser 73054ca24bSTobias Grosser.. image:: images/LLVM-Passes-early.png 74054ca24bSTobias Grosser :align: center 75054ca24bSTobias Grosser 76054ca24bSTobias GrosserRunning Polly right before the vectorizer has the benefit that the full inlining 77054ca24bSTobias Grossercycle has been run and as a result even heavily templated C++ code could 78054ca24bSTobias Grossertheoretically benefit from Polly (more work is necessary to make Polly here 79054ca24bSTobias Grosserreally effective). As the IR that is passed to Polly has already been 80054ca24bSTobias Grossercanonicalized, there is also no need to run additional canonicalization passes. 81054ca24bSTobias GrosserGeneral compile time is almost not affected by Polly, as detection of loop 82054ca24bSTobias Grosserkernels is generally very fast and the actual optimization and cleanup passes 83054ca24bSTobias Grosserare only run on functions which contain loop kernels that are worth optimizing. 84054ca24bSTobias GrosserHowever, due to the many optimizations that LLVM runs before Polly the IR that 85054ca24bSTobias Grosserreaches Polly often has additional scalar dependences that make Polly a lot less 8627f7546fSKazu Hirataefficient. To force Polly to run before the vectorizer in the pass pipeline use 87054ca24bSTobias Grosserthe option *-polly-position=before-vectorizer*. This position is not yet the 88054ca24bSTobias Grosserdefault for Polly, but work is on its way to be effective even in presence of 89054ca24bSTobias Grosserscalar dependences. After this work has been completed, Polly will likely use 90054ca24bSTobias Grosserthis position by default. 91054ca24bSTobias Grosser 92054ca24bSTobias Grosser.. image:: images/LLVM-Passes-late.png 93054ca24bSTobias Grosser :align: center 94