polly/docs/Architecture.rst

063ca0fcSTobias Grosser================
063ca0fcSTobias GrosserThe Architecture
063ca0fcSTobias Grosser================
063ca0fcSTobias Grosser
063ca0fcSTobias GrosserPolly is a loop optimizer for LLVM. Starting from LLVM-IR it detects and
063ca0fcSTobias Grosserextracts interesting loop kernels. For each kernel a mathematical model is
063ca0fcSTobias Grosserderived which precisely describes the individual computations and memory
063ca0fcSTobias Grosseraccesses in the kernels. Within Polly a variety of analysis and code
063ca0fcSTobias Grossertransformations are performed on this mathematical model. After all
063ca0fcSTobias Grosseroptimizations have been derived and applied, optimized LLVM-IR is regenerated
063ca0fcSTobias Grosserand inserted into the LLVM-IR module.
063ca0fcSTobias Grosser
063ca0fcSTobias Grosser.. image:: images/architecture.png
e16b8df6STobias Grosser    :align: center
054ca24bSTobias Grosser
054ca24bSTobias GrosserPolly in the LLVM pass pipeline
054ca24bSTobias Grosser-------------------------------
054ca24bSTobias Grosser
054ca24bSTobias GrosserThe standard LLVM pass pipeline as it is used in -O1/-O2/-O3 mode of clang/opt
054ca24bSTobias Grosserconsists of a sequence of passes that can be grouped in different conceptual
054ca24bSTobias Grosserphases. The first phase, we call it here **Canonicalization**, is a scalar
054ca24bSTobias Grossercanonicalization phase that contains passes like -mem2reg, -instcombine,
054ca24bSTobias Grosser-cfgsimplify, or early loop unrolling. It has the goal of removing and
054ca24bSTobias Grossersimplifying the given IR as much as possible focusing mostly on scalar
054ca24bSTobias Grosseroptimizations. The second phase consists of three conceptual groups that  are
054ca24bSTobias Grosserexecuted in the so-called **Inliner cycle**, This is again a set of **Scalar
054ca24bSTobias GrosserSimplification** passes, a set of **Simple Loop Optimizations**, and the
054ca24bSTobias Grosser**Inliner** itself. Even though these passes make up the majority of the LLVM
054ca24bSTobias Grosserpass pipeline, the primary goal of these passes is still canonicalization
*5aafc6d5SChristian Clausswithout losing semantic information that complicates later analysis. As part of
054ca24bSTobias Grosserthe inliner cycle, the LLVM inliner step-by-step tries to inline functions, runs
054ca24bSTobias Grossercanonicalization passes to exploit newly exposed simplification opportunities,
054ca24bSTobias Grosserand then tries to inline the further simplified functions. Some simple loop
054ca24bSTobias Grosseroptimizations are executed as part of the inliner cycle. Even though they
054ca24bSTobias Grosserperform some optimizations, their primary goal is still the simplification of
054ca24bSTobias Grosserthe program code. Loop invariant code motion is one such optimization that
054ca24bSTobias Grosserbesides being beneficial for program performance also allows us to move
054ca24bSTobias Grossercomputation out of loops and in the best case enables us to eliminate certain
054ca24bSTobias Grosserloops completely.  Only after the inliner cycle has been finished, a last
054ca24bSTobias Grosser**Target Specialization** phase is run, where IR complexity is deliberately
054ca24bSTobias Grosserincreased to take advantage of target specific features that maximize the
054ca24bSTobias Grosserexecution performance on the device we target. One of the principal
054ca24bSTobias Grosseroptimizations in this phase is vectorization, but also target specific loop
054ca24bSTobias Grosserunrolling, or some loop transformations (e.g., distribution) that expose more
054ca24bSTobias Grosservectorization opportunities.
054ca24bSTobias Grosser
054ca24bSTobias Grosser.. image:: images/LLVM-Passes-only.png
054ca24bSTobias Grosser    :align: center
054ca24bSTobias Grosser
054ca24bSTobias GrosserPolly can conceptually be run at three different positions in the pass pipeline.
054ca24bSTobias GrosserAs an early optimizer before the standard LLVM pass pipeline, as a later
054ca24bSTobias Grosseroptimizer as part of the target specialization sequence, and theoretically also
054ca24bSTobias Grosserwith the loop optimizations in the inliner cycle. We only discuss the first two
054ca24bSTobias Grosseroptions, as running Polly in the inline loop, is likely to disturb the inliner
054ca24bSTobias Grosserand is consequently not a good idea.
054ca24bSTobias Grosser
054ca24bSTobias Grosser.. image:: images/LLVM-Passes-all.png
054ca24bSTobias Grosser    :align: center
054ca24bSTobias Grosser
054ca24bSTobias GrosserRunning Polly early before the standard pass pipeline has the benefit that the
054ca24bSTobias GrosserLLVM-IR processed by Polly is still very close to the original input code.
054ca24bSTobias GrosserHence, it is less likely that transformations applied by LLVM change the IR in
054ca24bSTobias Grosserways not easily understandable for the programmer. As a result, user feedback is
054ca24bSTobias Grosserlikely better and it is less likely that kernels that in C seem a perfect fit
054ca24bSTobias Grosserfor Polly have been transformed such that Polly can not handle them any
054ca24bSTobias Grossermore. On the other hand, codes that require inlining to be optimized won't
054ca24bSTobias Grosserbenefit if Polly is scheduled at this position. The additional set of
054ca24bSTobias Grossercanonicalization passes required will result in a small, but general compile
054ca24bSTobias Grossertime increase and some random run-time performance changes due to slightly
054ca24bSTobias Grosserdifferent IR being passed through the optimizers. To force Polly to run early in
d5c0b010SMichael Krusethe pass pipeline use the option *-polly-position=early* (default today).
054ca24bSTobias Grosser
054ca24bSTobias Grosser.. image:: images/LLVM-Passes-early.png
054ca24bSTobias Grosser    :align: center
054ca24bSTobias Grosser
054ca24bSTobias GrosserRunning Polly right before the vectorizer has the benefit that the full inlining
054ca24bSTobias Grossercycle has been run and as a result even heavily templated C++ code could
054ca24bSTobias Grossertheoretically benefit from Polly (more work is necessary to make Polly here
054ca24bSTobias Grosserreally effective). As the IR that is passed to Polly has already been
054ca24bSTobias Grossercanonicalized, there is also no need to run additional canonicalization passes.
054ca24bSTobias GrosserGeneral compile time is almost not affected by Polly, as detection of loop
054ca24bSTobias Grosserkernels is generally very fast and the actual optimization and cleanup passes
054ca24bSTobias Grosserare only run on functions which contain loop kernels that are worth optimizing.
054ca24bSTobias GrosserHowever, due to the many optimizations that LLVM runs before Polly the IR that
054ca24bSTobias Grosserreaches Polly often has additional scalar dependences that make Polly a lot less
27f7546fSKazu Hirataefficient. To force Polly to run before the vectorizer in the pass pipeline use
054ca24bSTobias Grosserthe option *-polly-position=before-vectorizer*. This position is not yet the
054ca24bSTobias Grosserdefault for Polly, but work is on its way to be effective even in presence of
054ca24bSTobias Grosserscalar dependences. After this work has been completed, Polly will likely use
054ca24bSTobias Grosserthis position by default.
054ca24bSTobias Grosser
054ca24bSTobias Grosser.. image:: images/LLVM-Passes-late.png
054ca24bSTobias Grosser    :align: center