1================== 2Vectorization Plan 3================== 4 5.. contents:: 6 :local: 7 8Abstract 9======== 10The vectorization transformation can be rather complicated, involving several 11potential alternatives, especially for outer-loops [1]_ but also possibly for 12innermost loops. These alternatives may have significant performance impact, 13both positive and negative. A cost model is therefore employed to identify the 14best alternative, including the alternative of avoiding any transformation 15altogether. 16 17The Vectorization Plan is an explicit model for describing vectorization 18candidates. It serves for both optimizing candidates including estimating their 19cost reliably, and for performing their final translation into IR. This 20facilitates dealing with multiple vectorization candidates. 21 22Current Status 23============== 24VPlan is currently used to drive code-generation in LoopVectorize. VPlans are 25constructed after all cost-based and most legality-related decisions have been 26taken. As def-use chains between recipes are now fully modeled in VPlan, 27VPlan-based analyses and transformations are used to simplify and modularize 28the vectorization process [10]_. Those include transformations to 29 301. Legalize the initial VPlan, e.g. by introducing specialized recipes for 31 reductions and interleave groups. 32 332. Optimize the legalized VPlan, e.g. by removing redundant recipes or 34 introducing active-lane-masks. 35 363. Apply unroll- and vectorization-factor specific optimizations, e.g. removing 37 the backedge to reiterate the vector loop based on VF & UF. 38 39Refer to :numref:`fig-vplan-transform-pipeline` for an overview of the current 40transformation pipeline. 41 42Note that some legality checks are already done in VPlan, including checking if 43all users of a fixed-order recurrence can be re-ordered. This is implemented as 44a VPlan-to-VPlan transformation that either applies a valid re-ordering or 45bails out marking the VPlan as invalid. 46 47.. _fig-vplan-transform-pipeline: 48.. figure:: ./vplan-transform-pipeline.png 49 :width: 800 px 50 51 VPlan Transformation Pipeline in 2024 52 53 54VPlan currently models the complete vector loop, as well as additional parts of 55the vectorization skeleton. Refer to :numref:`fig-vplan-scope` for an overview 56of the scope covered by VPlan. 57 58.. _fig-vplan-scope: 59.. figure:: ./vplan-scope.png 60 :width: 800 px 61 62 Scope modeled in VPlan in 2024 63 64 65High-level Design 66================= 67 68Vectorization Workflow 69---------------------- 70VPlan-based vectorization involves three major steps, taking a "scenario-based 71approach" to vectorization planning: 72 731. Legal Step: check if a loop can be legally vectorized; encode constraints and 74 artifacts if so. 752. Plan Step: 76 77 a. Build initial VPlans following the constraints and decisions taken by 78 Legal Step 1, and compute their cost. 79 b. Apply optimizations to the VPlans, possibly forking additional VPlans. 80 Prune sub-optimal VPlans having relatively high cost. 813. Execute Step: materialize the best VPlan. Note that this is the only step 82 that modifies the IR. 83 84Design Guidelines 85----------------- 86In what follows, the term "input IR" refers to code that is fed into the 87vectorizer whereas the term "output IR" refers to code that is generated by the 88vectorizer. The output IR contains code that has been vectorized or "widened" 89according to a loop Vectorization Factor (VF), and/or loop unroll-and-jammed 90according to an Unroll Factor (UF). 91The design of VPlan follows several high-level guidelines: 92 931. Analysis-like: building and manipulating VPlans must not modify the input IR. 94 In particular, if the best option is not to vectorize at all, the 95 vectorization process terminates before reaching Step 3, and compilation 96 should proceed as if VPlans had not been built. 97 982. Align Cost & Execute: each VPlan must support both estimating the cost and 99 generating the output IR code, such that the cost estimation evaluates the 100 to-be-generated code reliably. 101 1023. Support vectorizing additional constructs: 103 104 a. Outer-loop vectorization. In particular, VPlan must be able to model the 105 control-flow of the output IR which may include multiple basic-blocks and 106 nested loops. 107 b. SLP vectorization. 108 c. Combinations of the above, including nested vectorization: vectorizing 109 both an inner loop and an outer-loop at the same time (each with its own 110 VF and UF), mixed vectorization: vectorizing a loop with SLP patterns 111 inside [4]_, (re)vectorizing input IR containing vector code. 112 d. Function vectorization [2]_. 113 1144. Support multiple candidates efficiently. In particular, similar candidates 115 related to a range of possible VF's and UF's must be represented efficiently. 116 Potential versioning needs to be supported efficiently. 117 1185. Support vectorizing idioms, such as interleaved groups of strided loads or 119 stores. This is achieved by modeling a sequence of output instructions using 120 a "Recipe", which is responsible for computing its cost and generating its 121 code. 122 1236. Encapsulate Single-Entry Single-Exit regions (SESE). During vectorization 124 such regions may need to be, for example, predicated and linearized, or 125 replicated VF*UF times to handle scalarized and predicated instructions. 126 Innerloops are also modelled as SESE regions. 127 1287. Support instruction-level analysis and transformation, as part of Planning 129 Step 2.b: During vectorization instructions may need to be traversed, moved, 130 replaced by other instructions or be created. For example, vector idiom 131 detection and formation involves searching for and optimizing instruction 132 patterns. 133 134Definitions 135=========== 136The low-level design of VPlan comprises of the following classes. 137 138:LoopVectorizationPlanner: 139 A LoopVectorizationPlanner is designed to handle the vectorization of a loop 140 or a loop nest. It can construct, optimize and discard one or more VPlans, 141 each VPlan modelling a distinct way to vectorize the loop or the loop nest. 142 Once the best VPlan is determined, including the best VF and UF, this VPlan 143 drives the generation of output IR. 144 145:VPlan: 146 A model of a vectorized candidate for a given input IR loop or loop nest. This 147 candidate is represented using a Hierarchical CFG. VPlan supports estimating 148 the cost and driving the generation of the output IR code it represents. 149 150:Hierarchical CFG: 151 A control-flow graph whose nodes are basic-blocks or Hierarchical CFG's. The 152 Hierarchical CFG data structure is similar to the Tile Tree [5]_, where 153 cross-Tile edges are lifted to connect Tiles instead of the original 154 basic-blocks as in Sharir [6]_, promoting the Tile encapsulation. The terms 155 Region and Block are used rather than Tile [5]_ to avoid confusion with loop 156 tiling. 157 158:VPBlockBase: 159 The building block of the Hierarchical CFG. A pure-virtual base-class of 160 VPBasicBlock and VPRegionBlock, see below. VPBlockBase models the hierarchical 161 control-flow relations with other VPBlocks. Note that in contrast to the IR 162 BasicBlock, a VPBlockBase models its control-flow successors and predecessors 163 directly, rather than through a Terminator branch or through predecessor 164 branches that "use" the VPBlockBase. 165 166:VPBasicBlock: 167 VPBasicBlock is a subclass of VPBlockBase, and serves as the leaves of the 168 Hierarchical CFG. It represents a sequence of output IR instructions that will 169 appear consecutively in an output IR basic-block. The instructions of this 170 basic-block originate from one or more VPBasicBlocks. VPBasicBlock holds a 171 sequence of zero or more VPRecipes that model the cost and generation of the 172 output IR instructions. 173 174:VPRegionBlock: 175 VPRegionBlock is a subclass of VPBlockBase. It models a collection of 176 VPBasicBlocks and VPRegionBlocks which form a SESE subgraph of the output IR 177 CFG. A VPRegionBlock may indicate that its contents are to be replicated a 178 constant number of times when output IR is generated, effectively representing 179 a loop with constant trip-count that will be completely unrolled. This is used 180 to support scalarized and predicated instructions with a single model for 181 multiple candidate VF's and UF's. 182 183:VPRecipeBase: 184 A pure-virtual base class modeling a sequence of one or more output IR 185 instructions, possibly based on one or more input IR instructions. These 186 input IR instructions are referred to as "Ingredients" of the Recipe. A Recipe 187 may specify how its ingredients are to be transformed to produce the output IR 188 instructions; e.g., cloned once, replicated multiple times or widened 189 according to selected VF. 190 191:VPValue: 192 The base of VPlan's def-use relations class hierarchy. When instantiated, it 193 models a constant or a live-in Value in VPlan. It has users, which are of type 194 VPUser, but no operands. 195 196:VPUser: 197 A VPUser represents an entity that uses a number of VPValues as operands. 198 VPUser is similar in some aspects to LLVM's User class. 199 200:VPDef: 201 A VPDef represents an entity that defines zero, one or multiple VPValues. 202 It is used to model the fact that recipes in VPlan can define multiple 203 VPValues. 204 205:VPInstruction: 206 A VPInstruction is a recipe characterized by a single opcode and optional 207 flags, free of ingredients or other meta-data. VPInstructions also extend 208 LLVM IR's opcodes with idiomatic operations that enrich the Vectorizer's 209 semantics. 210 211:VPTransformState: 212 Stores information used for generating output IR, passed from 213 LoopVectorizationPlanner to its selected VPlan for execution, and used to pass 214 additional information down to VPBlocks and VPRecipes. 215 216The Planning Process and VPlan Roadmap 217====================================== 218 219Transforming the Loop Vectorizer to use VPlan follows a staged approach. First, 220VPlan was only used to record the final vectorization decisions, and to execute 221them: the Hierarchical CFG models the planned control-flow, and Recipes capture 222decisions taken inside basic-blocks. Currently, VPlan is used also as the basis 223for taking these decisions, effectively turning them into a series of 224VPlan-to-VPlan algorithms. Finally, VPlan will support the planning process 225itself including cost-based analyses for making these decisions, to fully 226support compositional and iterative decision making. 227 228Some decisions are local to an instruction in the loop, such as whether to widen 229it into a vector instruction or replicate it, keeping the generated instructions 230in place. Other decisions, however, involve moving instructions, replacing them 231with other instructions, and/or introducing new instructions. For example, a 232cast may sink past a later instruction and be widened to handle first-order 233recurrence; an interleave group of strided gathers or scatters may effectively 234move to one place where they are replaced with shuffles and a common wide vector 235load or store; new instructions may be introduced to compute masks, shuffle the 236elements of vectors, and pack scalar values into vectors or vice-versa. 237 238In order for VPlan to support making instruction-level decisions and analyses, 239it needs to model the relevant instructions along with their def/use relations. 240This too follows a staged approach: first, the new instructions that compute 241masks are modeled as VPInstructions, along with their induced def/use subgraph. 242This effectively models masks in VPlan, facilitating VPlan-based predication. 243Next, the logic embedded within each Recipe for generating its instructions at 244VPlan execution time, will instead take part in the planning process by modeling 245them as VPInstructions. Finally, only logic that applies to instructions as a 246group will remain in Recipes, such as interleave groups and potentially other 247idiom groups having synergistic cost. 248 249Related LLVM components 250----------------------- 2511. SLP Vectorizer: one can compare the VPlan model with LLVM's existing SLP 252 tree, where TSLP [3]_ adds Plan Step 2.b. 253 2542. RegionInfo: one can compare VPlan's H-CFG with the Region Analysis as used by 255 Polly [7]_. 256 2573. Loop Vectorizer: the Vectorization Plan aims to upgrade the infrastructure of 258 the Loop Vectorizer and extend it to handle outer loops [8]_, [9]_. 259 260References 261---------- 262.. [1] "Outer-loop vectorization: revisited for short SIMD architectures", Dorit 263 Nuzman and Ayal Zaks, PACT 2008. 264 265.. [2] "Proposal for function vectorization and loop vectorization with function 266 calls", Xinmin Tian, [`cfe-dev 267 <http://lists.llvm.org/pipermail/cfe-dev/2016-March/047732.html>`_]., 268 March 2, 2016. 269 See also `review <https://reviews.llvm.org/D22792>`_. 270 271.. [3] "Throttling Automatic Vectorization: When Less is More", Vasileios 272 Porpodas and Tim Jones, PACT 2015 and LLVM Developers' Meeting 2015. 273 274.. [4] "Exploiting mixed SIMD parallelism by reducing data reorganization 275 overhead", Hao Zhou and Jingling Xue, CGO 2016. 276 277.. [5] "Register Allocation via Hierarchical Graph Coloring", David Callahan and 278 Brian Koblenz, PLDI 1991 279 280.. [6] "Structural analysis: A new approach to flow analysis in optimizing 281 compilers", M. Sharir, Journal of Computer Languages, Jan. 1980 282 283.. [7] "Enabling Polyhedral Optimizations in LLVM", Tobias Grosser, Diploma 284 thesis, 2011. 285 286.. [8] "Introducing VPlan to the Loop Vectorizer", Gil Rapaport and Ayal Zaks, 287 European LLVM Developers' Meeting 2017. 288 289.. [9] "Extending LoopVectorizer: OpenMP4.5 SIMD and Outer Loop 290 Auto-Vectorization", Intel Vectorizer Team, LLVM Developers' Meeting 2016. 291 292.. [10] "VPlan: Status Update and Future Roadmap", Ayal Zaks and Florian Hahn, 293 LLVM Developers' Meeting 2023, https://www.youtube.com/watch?v=SzGP4PgMuLE 294