xref: /llvm-project/llvm/docs/VectorizationPlan.rst (revision 309a881dccb82bf1f101cf138bee3e7400968ed8)
1==================
2Vectorization Plan
3==================
4
5.. contents::
6   :local:
7
8Abstract
9========
10The vectorization transformation can be rather complicated, involving several
11potential alternatives, especially for outer-loops [1]_ but also possibly for
12innermost loops. These alternatives may have significant performance impact,
13both positive and negative. A cost model is therefore employed to identify the
14best alternative, including the alternative of avoiding any transformation
15altogether.
16
17The Vectorization Plan is an explicit model for describing vectorization
18candidates. It serves for both optimizing candidates including estimating their
19cost reliably, and for performing their final translation into IR. This
20facilitates dealing with multiple vectorization candidates.
21
22Current Status
23==============
24VPlan is currently used to drive code-generation in LoopVectorize. VPlans are
25constructed after all cost-based and most legality-related decisions have been
26taken. As def-use chains between recipes are now fully modeled in VPlan,
27VPlan-based analyses and transformations are used to simplify and modularize
28the vectorization process [10]_. Those include transformations to
29
301. Legalize the initial VPlan, e.g. by introducing specialized recipes for
31   reductions and interleave groups.
32
332. Optimize the legalized VPlan, e.g. by removing redundant recipes or
34   introducing active-lane-masks.
35
363. Apply unroll- and vectorization-factor specific optimizations, e.g. removing
37   the backedge to reiterate the vector loop based on VF & UF.
38
39Refer to :numref:`fig-vplan-transform-pipeline` for an overview of the current
40transformation pipeline.
41
42Note that some legality checks are already done in VPlan, including checking if
43all users of a fixed-order recurrence can be re-ordered. This is implemented as
44a VPlan-to-VPlan transformation that either applies a valid re-ordering or
45bails out marking the VPlan as invalid.
46
47.. _fig-vplan-transform-pipeline:
48.. figure:: ./vplan-transform-pipeline.png
49   :width: 800 px
50
51   VPlan Transformation Pipeline in 2024
52
53
54VPlan currently models the complete vector loop, as well as additional parts of
55the vectorization skeleton. Refer to :numref:`fig-vplan-scope` for an overview
56of the scope covered by VPlan.
57
58.. _fig-vplan-scope:
59.. figure:: ./vplan-scope.png
60   :width: 800 px
61
62   Scope modeled in VPlan in 2024
63
64
65High-level Design
66=================
67
68Vectorization Workflow
69----------------------
70VPlan-based vectorization involves three major steps, taking a "scenario-based
71approach" to vectorization planning:
72
731. Legal Step: check if a loop can be legally vectorized; encode constraints and
74   artifacts if so.
752. Plan Step:
76
77   a. Build initial VPlans following the constraints and decisions taken by
78      Legal Step 1, and compute their cost.
79   b. Apply optimizations to the VPlans, possibly forking additional VPlans.
80      Prune sub-optimal VPlans having relatively high cost.
813. Execute Step: materialize the best VPlan. Note that this is the only step
82   that modifies the IR.
83
84Design Guidelines
85-----------------
86In what follows, the term "input IR" refers to code that is fed into the
87vectorizer whereas the term "output IR" refers to code that is generated by the
88vectorizer. The output IR contains code that has been vectorized or "widened"
89according to a loop Vectorization Factor (VF), and/or loop unroll-and-jammed
90according to an Unroll Factor (UF).
91The design of VPlan follows several high-level guidelines:
92
931. Analysis-like: building and manipulating VPlans must not modify the input IR.
94   In particular, if the best option is not to vectorize at all, the
95   vectorization process terminates before reaching Step 3, and compilation
96   should proceed as if VPlans had not been built.
97
982. Align Cost & Execute: each VPlan must support both estimating the cost and
99   generating the output IR code, such that the cost estimation evaluates the
100   to-be-generated code reliably.
101
1023. Support vectorizing additional constructs:
103
104   a. Outer-loop vectorization. In particular, VPlan must be able to model the
105      control-flow of the output IR which may include multiple basic-blocks and
106      nested loops.
107   b. SLP vectorization.
108   c. Combinations of the above, including nested vectorization: vectorizing
109      both an inner loop and an outer-loop at the same time (each with its own
110      VF and UF), mixed vectorization: vectorizing a loop with SLP patterns
111      inside [4]_, (re)vectorizing input IR containing vector code.
112   d. Function vectorization [2]_.
113
1144. Support multiple candidates efficiently. In particular, similar candidates
115   related to a range of possible VF's and UF's must be represented efficiently.
116   Potential versioning needs to be supported efficiently.
117
1185. Support vectorizing idioms, such as interleaved groups of strided loads or
119   stores. This is achieved by modeling a sequence of output instructions using
120   a "Recipe", which is responsible for computing its cost and generating its
121   code.
122
1236. Encapsulate Single-Entry Single-Exit regions (SESE). During vectorization
124   such regions may need to be, for example, predicated and linearized, or
125   replicated VF*UF times to handle scalarized and predicated instructions.
126   Innerloops are also modelled as SESE regions.
127
1287. Support instruction-level analysis and transformation, as part of Planning
129   Step 2.b: During vectorization instructions may need to be traversed, moved,
130   replaced by other instructions or be created. For example, vector idiom
131   detection and formation involves searching for and optimizing instruction
132   patterns.
133
134Definitions
135===========
136The low-level design of VPlan comprises of the following classes.
137
138:LoopVectorizationPlanner:
139  A LoopVectorizationPlanner is designed to handle the vectorization of a loop
140  or a loop nest. It can construct, optimize and discard one or more VPlans,
141  each VPlan modelling a distinct way to vectorize the loop or the loop nest.
142  Once the best VPlan is determined, including the best VF and UF, this VPlan
143  drives the generation of output IR.
144
145:VPlan:
146  A model of a vectorized candidate for a given input IR loop or loop nest. This
147  candidate is represented using a Hierarchical CFG. VPlan supports estimating
148  the cost and driving the generation of the output IR code it represents.
149
150:Hierarchical CFG:
151  A control-flow graph whose nodes are basic-blocks or Hierarchical CFG's. The
152  Hierarchical CFG data structure is similar to the Tile Tree [5]_, where
153  cross-Tile edges are lifted to connect Tiles instead of the original
154  basic-blocks as in Sharir [6]_, promoting the Tile encapsulation. The terms
155  Region and Block are used rather than Tile [5]_ to avoid confusion with loop
156  tiling.
157
158:VPBlockBase:
159  The building block of the Hierarchical CFG. A pure-virtual base-class of
160  VPBasicBlock and VPRegionBlock, see below. VPBlockBase models the hierarchical
161  control-flow relations with other VPBlocks. Note that in contrast to the IR
162  BasicBlock, a VPBlockBase models its control-flow successors and predecessors
163  directly, rather than through a Terminator branch or through predecessor
164  branches that "use" the VPBlockBase.
165
166:VPBasicBlock:
167  VPBasicBlock is a subclass of VPBlockBase, and serves as the leaves of the
168  Hierarchical CFG. It represents a sequence of output IR instructions that will
169  appear consecutively in an output IR basic-block. The instructions of this
170  basic-block originate from one or more VPBasicBlocks. VPBasicBlock holds a
171  sequence of zero or more VPRecipes that model the cost and generation of the
172  output IR instructions.
173
174:VPRegionBlock:
175  VPRegionBlock is a subclass of VPBlockBase. It models a collection of
176  VPBasicBlocks and VPRegionBlocks which form a SESE subgraph of the output IR
177  CFG. A VPRegionBlock may indicate that its contents are to be replicated a
178  constant number of times when output IR is generated, effectively representing
179  a loop with constant trip-count that will be completely unrolled. This is used
180  to support scalarized and predicated instructions with a single model for
181  multiple candidate VF's and UF's.
182
183:VPRecipeBase:
184  A pure-virtual base class modeling a sequence of one or more output IR
185  instructions, possibly based on one or more input IR instructions. These
186  input IR instructions are referred to as "Ingredients" of the Recipe. A Recipe
187  may specify how its ingredients are to be transformed to produce the output IR
188  instructions; e.g., cloned once, replicated multiple times or widened
189  according to selected VF.
190
191:VPValue:
192  The base of VPlan's def-use relations class hierarchy. When instantiated, it
193  models a constant or a live-in Value in VPlan. It has users, which are of type
194  VPUser, but no operands.
195
196:VPUser:
197  A VPUser represents an entity that uses a number of VPValues as operands.
198  VPUser is similar in some aspects to LLVM's User class.
199
200:VPDef:
201  A VPDef represents an entity that defines zero, one or multiple VPValues.
202  It is used to model the fact that recipes in VPlan can define multiple
203  VPValues.
204
205:VPInstruction:
206  A VPInstruction is a recipe characterized by a single opcode and optional
207  flags, free of ingredients or other meta-data. VPInstructions also extend
208  LLVM IR's opcodes with idiomatic operations that enrich the Vectorizer's
209  semantics.
210
211:VPTransformState:
212  Stores information used for generating output IR, passed from
213  LoopVectorizationPlanner to its selected VPlan for execution, and used to pass
214  additional information down to VPBlocks and VPRecipes.
215
216The Planning Process and VPlan Roadmap
217======================================
218
219Transforming the Loop Vectorizer to use VPlan follows a staged approach. First,
220VPlan was only used to record the final vectorization decisions, and to execute
221them: the Hierarchical CFG models the planned control-flow, and Recipes capture
222decisions taken inside basic-blocks. Currently, VPlan is used also as the basis
223for taking these decisions, effectively turning them into a series of
224VPlan-to-VPlan algorithms. Finally, VPlan will support the planning process
225itself including cost-based analyses for making these decisions, to fully
226support compositional and iterative decision making.
227
228Some decisions are local to an instruction in the loop, such as whether to widen
229it into a vector instruction or replicate it, keeping the generated instructions
230in place. Other decisions, however, involve moving instructions, replacing them
231with other instructions, and/or introducing new instructions. For example, a
232cast may sink past a later instruction and be widened to handle first-order
233recurrence; an interleave group of strided gathers or scatters may effectively
234move to one place where they are replaced with shuffles and a common wide vector
235load or store; new instructions may be introduced to compute masks, shuffle the
236elements of vectors, and pack scalar values into vectors or vice-versa.
237
238In order for VPlan to support making instruction-level decisions and analyses,
239it needs to model the relevant instructions along with their def/use relations.
240This too follows a staged approach: first, the new instructions that compute
241masks are modeled as VPInstructions, along with their induced def/use subgraph.
242This effectively models masks in VPlan, facilitating VPlan-based predication.
243Next, the logic embedded within each Recipe for generating its instructions at
244VPlan execution time, will instead take part in the planning process by modeling
245them as VPInstructions. Finally, only logic that applies to instructions as a
246group will remain in Recipes, such as interleave groups and potentially other
247idiom groups having synergistic cost.
248
249Related LLVM components
250-----------------------
2511. SLP Vectorizer: one can compare the VPlan model with LLVM's existing SLP
252   tree, where TSLP [3]_ adds Plan Step 2.b.
253
2542. RegionInfo: one can compare VPlan's H-CFG with the Region Analysis as used by
255   Polly [7]_.
256
2573. Loop Vectorizer: the Vectorization Plan aims to upgrade the infrastructure of
258   the Loop Vectorizer and extend it to handle outer loops [8]_, [9]_.
259
260References
261----------
262.. [1] "Outer-loop vectorization: revisited for short SIMD architectures", Dorit
263    Nuzman and Ayal Zaks, PACT 2008.
264
265.. [2] "Proposal for function vectorization and loop vectorization with function
266    calls", Xinmin Tian, [`cfe-dev
267    <http://lists.llvm.org/pipermail/cfe-dev/2016-March/047732.html>`_].,
268    March 2, 2016.
269    See also `review <https://reviews.llvm.org/D22792>`_.
270
271.. [3] "Throttling Automatic Vectorization: When Less is More", Vasileios
272    Porpodas and Tim Jones, PACT 2015 and LLVM Developers' Meeting 2015.
273
274.. [4] "Exploiting mixed SIMD parallelism by reducing data reorganization
275    overhead", Hao Zhou and Jingling Xue, CGO 2016.
276
277.. [5] "Register Allocation via Hierarchical Graph Coloring", David Callahan and
278    Brian Koblenz, PLDI 1991
279
280.. [6] "Structural analysis: A new approach to flow analysis in optimizing
281    compilers", M. Sharir, Journal of Computer Languages, Jan. 1980
282
283.. [7] "Enabling Polyhedral Optimizations in LLVM", Tobias Grosser, Diploma
284    thesis, 2011.
285
286.. [8] "Introducing VPlan to the Loop Vectorizer", Gil Rapaport and Ayal Zaks,
287    European LLVM Developers' Meeting 2017.
288
289.. [9] "Extending LoopVectorizer: OpenMP4.5 SIMD and Outer Loop
290    Auto-Vectorization", Intel Vectorizer Team, LLVM Developers' Meeting 2016.
291
292.. [10] "VPlan: Status Update and Future Roadmap", Ayal Zaks and Florian Hahn,
293    LLVM Developers' Meeting 2023, https://www.youtube.com/watch?v=SzGP4PgMuLE
294