xref: /llvm-project/polly/docs/Architecture.rst (revision 5aafc6d58f3405662902cee006be11e599801b88)
1063ca0fcSTobias Grosser================
2063ca0fcSTobias GrosserThe Architecture
3063ca0fcSTobias Grosser================
4063ca0fcSTobias Grosser
5063ca0fcSTobias GrosserPolly is a loop optimizer for LLVM. Starting from LLVM-IR it detects and
6063ca0fcSTobias Grosserextracts interesting loop kernels. For each kernel a mathematical model is
7063ca0fcSTobias Grosserderived which precisely describes the individual computations and memory
8063ca0fcSTobias Grosseraccesses in the kernels. Within Polly a variety of analysis and code
9063ca0fcSTobias Grossertransformations are performed on this mathematical model. After all
10063ca0fcSTobias Grosseroptimizations have been derived and applied, optimized LLVM-IR is regenerated
11063ca0fcSTobias Grosserand inserted into the LLVM-IR module.
12063ca0fcSTobias Grosser
13063ca0fcSTobias Grosser.. image:: images/architecture.png
14e16b8df6STobias Grosser    :align: center
15054ca24bSTobias Grosser
16054ca24bSTobias GrosserPolly in the LLVM pass pipeline
17054ca24bSTobias Grosser-------------------------------
18054ca24bSTobias Grosser
19054ca24bSTobias GrosserThe standard LLVM pass pipeline as it is used in -O1/-O2/-O3 mode of clang/opt
20054ca24bSTobias Grosserconsists of a sequence of passes that can be grouped in different conceptual
21054ca24bSTobias Grosserphases. The first phase, we call it here **Canonicalization**, is a scalar
22054ca24bSTobias Grossercanonicalization phase that contains passes like -mem2reg, -instcombine,
23054ca24bSTobias Grosser-cfgsimplify, or early loop unrolling. It has the goal of removing and
24054ca24bSTobias Grossersimplifying the given IR as much as possible focusing mostly on scalar
25054ca24bSTobias Grosseroptimizations. The second phase consists of three conceptual groups that  are
26054ca24bSTobias Grosserexecuted in the so-called **Inliner cycle**, This is again a set of **Scalar
27054ca24bSTobias GrosserSimplification** passes, a set of **Simple Loop Optimizations**, and the
28054ca24bSTobias Grosser**Inliner** itself. Even though these passes make up the majority of the LLVM
29054ca24bSTobias Grosserpass pipeline, the primary goal of these passes is still canonicalization
30*5aafc6d5SChristian Clausswithout losing semantic information that complicates later analysis. As part of
31054ca24bSTobias Grosserthe inliner cycle, the LLVM inliner step-by-step tries to inline functions, runs
32054ca24bSTobias Grossercanonicalization passes to exploit newly exposed simplification opportunities,
33054ca24bSTobias Grosserand then tries to inline the further simplified functions. Some simple loop
34054ca24bSTobias Grosseroptimizations are executed as part of the inliner cycle. Even though they
35054ca24bSTobias Grosserperform some optimizations, their primary goal is still the simplification of
36054ca24bSTobias Grosserthe program code. Loop invariant code motion is one such optimization that
37054ca24bSTobias Grosserbesides being beneficial for program performance also allows us to move
38054ca24bSTobias Grossercomputation out of loops and in the best case enables us to eliminate certain
39054ca24bSTobias Grosserloops completely.  Only after the inliner cycle has been finished, a last
40054ca24bSTobias Grosser**Target Specialization** phase is run, where IR complexity is deliberately
41054ca24bSTobias Grosserincreased to take advantage of target specific features that maximize the
42054ca24bSTobias Grosserexecution performance on the device we target. One of the principal
43054ca24bSTobias Grosseroptimizations in this phase is vectorization, but also target specific loop
44054ca24bSTobias Grosserunrolling, or some loop transformations (e.g., distribution) that expose more
45054ca24bSTobias Grosservectorization opportunities.
46054ca24bSTobias Grosser
47054ca24bSTobias Grosser.. image:: images/LLVM-Passes-only.png
48054ca24bSTobias Grosser    :align: center
49054ca24bSTobias Grosser
50054ca24bSTobias GrosserPolly can conceptually be run at three different positions in the pass pipeline.
51054ca24bSTobias GrosserAs an early optimizer before the standard LLVM pass pipeline, as a later
52054ca24bSTobias Grosseroptimizer as part of the target specialization sequence, and theoretically also
53054ca24bSTobias Grosserwith the loop optimizations in the inliner cycle. We only discuss the first two
54054ca24bSTobias Grosseroptions, as running Polly in the inline loop, is likely to disturb the inliner
55054ca24bSTobias Grosserand is consequently not a good idea.
56054ca24bSTobias Grosser
57054ca24bSTobias Grosser.. image:: images/LLVM-Passes-all.png
58054ca24bSTobias Grosser    :align: center
59054ca24bSTobias Grosser
60054ca24bSTobias GrosserRunning Polly early before the standard pass pipeline has the benefit that the
61054ca24bSTobias GrosserLLVM-IR processed by Polly is still very close to the original input code.
62054ca24bSTobias GrosserHence, it is less likely that transformations applied by LLVM change the IR in
63054ca24bSTobias Grosserways not easily understandable for the programmer. As a result, user feedback is
64054ca24bSTobias Grosserlikely better and it is less likely that kernels that in C seem a perfect fit
65054ca24bSTobias Grosserfor Polly have been transformed such that Polly can not handle them any
66054ca24bSTobias Grossermore. On the other hand, codes that require inlining to be optimized won't
67054ca24bSTobias Grosserbenefit if Polly is scheduled at this position. The additional set of
68054ca24bSTobias Grossercanonicalization passes required will result in a small, but general compile
69054ca24bSTobias Grossertime increase and some random run-time performance changes due to slightly
70054ca24bSTobias Grosserdifferent IR being passed through the optimizers. To force Polly to run early in
71d5c0b010SMichael Krusethe pass pipeline use the option *-polly-position=early* (default today).
72054ca24bSTobias Grosser
73054ca24bSTobias Grosser.. image:: images/LLVM-Passes-early.png
74054ca24bSTobias Grosser    :align: center
75054ca24bSTobias Grosser
76054ca24bSTobias GrosserRunning Polly right before the vectorizer has the benefit that the full inlining
77054ca24bSTobias Grossercycle has been run and as a result even heavily templated C++ code could
78054ca24bSTobias Grossertheoretically benefit from Polly (more work is necessary to make Polly here
79054ca24bSTobias Grosserreally effective). As the IR that is passed to Polly has already been
80054ca24bSTobias Grossercanonicalized, there is also no need to run additional canonicalization passes.
81054ca24bSTobias GrosserGeneral compile time is almost not affected by Polly, as detection of loop
82054ca24bSTobias Grosserkernels is generally very fast and the actual optimization and cleanup passes
83054ca24bSTobias Grosserare only run on functions which contain loop kernels that are worth optimizing.
84054ca24bSTobias GrosserHowever, due to the many optimizations that LLVM runs before Polly the IR that
85054ca24bSTobias Grosserreaches Polly often has additional scalar dependences that make Polly a lot less
8627f7546fSKazu Hirataefficient. To force Polly to run before the vectorizer in the pass pipeline use
87054ca24bSTobias Grosserthe option *-polly-position=before-vectorizer*. This position is not yet the
88054ca24bSTobias Grosserdefault for Polly, but work is on its way to be effective even in presence of
89054ca24bSTobias Grosserscalar dependences. After this work has been completed, Polly will likely use
90054ca24bSTobias Grosserthis position by default.
91054ca24bSTobias Grosser
92054ca24bSTobias Grosser.. image:: images/LLVM-Passes-late.png
93054ca24bSTobias Grosser    :align: center
94