1d32e68aeSFlorian Hahn================== 2d32e68aeSFlorian HahnVectorization Plan 3d32e68aeSFlorian Hahn================== 4d32e68aeSFlorian Hahn 5d32e68aeSFlorian Hahn.. contents:: 6d32e68aeSFlorian Hahn :local: 7d32e68aeSFlorian Hahn 8d32e68aeSFlorian HahnAbstract 9d32e68aeSFlorian Hahn======== 10d32e68aeSFlorian HahnThe vectorization transformation can be rather complicated, involving several 11d32e68aeSFlorian Hahnpotential alternatives, especially for outer-loops [1]_ but also possibly for 12d32e68aeSFlorian Hahninnermost loops. These alternatives may have significant performance impact, 13d32e68aeSFlorian Hahnboth positive and negative. A cost model is therefore employed to identify the 14d32e68aeSFlorian Hahnbest alternative, including the alternative of avoiding any transformation 15d32e68aeSFlorian Hahnaltogether. 16d32e68aeSFlorian Hahn 17d32e68aeSFlorian HahnThe Vectorization Plan is an explicit model for describing vectorization 18d32e68aeSFlorian Hahncandidates. It serves for both optimizing candidates including estimating their 19d32e68aeSFlorian Hahncost reliably, and for performing their final translation into IR. This 20d32e68aeSFlorian Hahnfacilitates dealing with multiple vectorization candidates. 21d32e68aeSFlorian Hahn 2299de3a68SFlorian HahnCurrent Status 2399de3a68SFlorian Hahn============== 2499de3a68SFlorian HahnVPlan is currently used to drive code-generation in LoopVectorize. VPlans are 2599de3a68SFlorian Hahnconstructed after all cost-based and most legality-related decisions have been 2699de3a68SFlorian Hahntaken. As def-use chains between recipes are now fully modeled in VPlan, 2799de3a68SFlorian HahnVPlan-based analyses and transformations are used to simplify and modularize 2899de3a68SFlorian Hahnthe vectorization process [10]_. Those include transformations to 2999de3a68SFlorian Hahn 3099de3a68SFlorian Hahn1. Legalize the initial VPlan, e.g. by introducing specialized recipes for 3199de3a68SFlorian Hahn reductions and interleave groups. 3299de3a68SFlorian Hahn 3399de3a68SFlorian Hahn2. Optimize the legalized VPlan, e.g. by removing redundant recipes or 3499de3a68SFlorian Hahn introducing active-lane-masks. 3599de3a68SFlorian Hahn 3699de3a68SFlorian Hahn3. Apply unroll- and vectorization-factor specific optimizations, e.g. removing 3799de3a68SFlorian Hahn the backedge to reiterate the vector loop based on VF & UF. 3899de3a68SFlorian Hahn 3999de3a68SFlorian HahnRefer to :numref:`fig-vplan-transform-pipeline` for an overview of the current 4099de3a68SFlorian Hahntransformation pipeline. 4199de3a68SFlorian Hahn 4299de3a68SFlorian HahnNote that some legality checks are already done in VPlan, including checking if 4399de3a68SFlorian Hahnall users of a fixed-order recurrence can be re-ordered. This is implemented as 4499de3a68SFlorian Hahna VPlan-to-VPlan transformation that either applies a valid re-ordering or 4599de3a68SFlorian Hahnbails out marking the VPlan as invalid. 4699de3a68SFlorian Hahn 4799de3a68SFlorian Hahn.. _fig-vplan-transform-pipeline: 4899de3a68SFlorian Hahn.. figure:: ./vplan-transform-pipeline.png 4999de3a68SFlorian Hahn :width: 800 px 5099de3a68SFlorian Hahn 5199de3a68SFlorian Hahn VPlan Transformation Pipeline in 2024 5299de3a68SFlorian Hahn 5399de3a68SFlorian Hahn 5499de3a68SFlorian HahnVPlan currently models the complete vector loop, as well as additional parts of 5599de3a68SFlorian Hahnthe vectorization skeleton. Refer to :numref:`fig-vplan-scope` for an overview 5699de3a68SFlorian Hahnof the scope covered by VPlan. 5799de3a68SFlorian Hahn 5899de3a68SFlorian Hahn.. _fig-vplan-scope: 5999de3a68SFlorian Hahn.. figure:: ./vplan-scope.png 6099de3a68SFlorian Hahn :width: 800 px 6199de3a68SFlorian Hahn 6299de3a68SFlorian Hahn Scope modeled in VPlan in 2024 6399de3a68SFlorian Hahn 6499de3a68SFlorian Hahn 65d32e68aeSFlorian HahnHigh-level Design 66d32e68aeSFlorian Hahn================= 67d32e68aeSFlorian Hahn 68d32e68aeSFlorian HahnVectorization Workflow 69d32e68aeSFlorian Hahn---------------------- 70d32e68aeSFlorian HahnVPlan-based vectorization involves three major steps, taking a "scenario-based 71d32e68aeSFlorian Hahnapproach" to vectorization planning: 72d32e68aeSFlorian Hahn 73d32e68aeSFlorian Hahn1. Legal Step: check if a loop can be legally vectorized; encode constraints and 74d32e68aeSFlorian Hahn artifacts if so. 75d32e68aeSFlorian Hahn2. Plan Step: 76d32e68aeSFlorian Hahn 77d32e68aeSFlorian Hahn a. Build initial VPlans following the constraints and decisions taken by 78d32e68aeSFlorian Hahn Legal Step 1, and compute their cost. 79d32e68aeSFlorian Hahn b. Apply optimizations to the VPlans, possibly forking additional VPlans. 80d32e68aeSFlorian Hahn Prune sub-optimal VPlans having relatively high cost. 81d32e68aeSFlorian Hahn3. Execute Step: materialize the best VPlan. Note that this is the only step 82d32e68aeSFlorian Hahn that modifies the IR. 83d32e68aeSFlorian Hahn 84d32e68aeSFlorian HahnDesign Guidelines 85d32e68aeSFlorian Hahn----------------- 86d32e68aeSFlorian HahnIn what follows, the term "input IR" refers to code that is fed into the 87d32e68aeSFlorian Hahnvectorizer whereas the term "output IR" refers to code that is generated by the 88d32e68aeSFlorian Hahnvectorizer. The output IR contains code that has been vectorized or "widened" 89d32e68aeSFlorian Hahnaccording to a loop Vectorization Factor (VF), and/or loop unroll-and-jammed 90d32e68aeSFlorian Hahnaccording to an Unroll Factor (UF). 91d32e68aeSFlorian HahnThe design of VPlan follows several high-level guidelines: 92d32e68aeSFlorian Hahn 93d32e68aeSFlorian Hahn1. Analysis-like: building and manipulating VPlans must not modify the input IR. 94d32e68aeSFlorian Hahn In particular, if the best option is not to vectorize at all, the 95d32e68aeSFlorian Hahn vectorization process terminates before reaching Step 3, and compilation 96d32e68aeSFlorian Hahn should proceed as if VPlans had not been built. 97d32e68aeSFlorian Hahn 98d32e68aeSFlorian Hahn2. Align Cost & Execute: each VPlan must support both estimating the cost and 99d32e68aeSFlorian Hahn generating the output IR code, such that the cost estimation evaluates the 100d32e68aeSFlorian Hahn to-be-generated code reliably. 101d32e68aeSFlorian Hahn 102d32e68aeSFlorian Hahn3. Support vectorizing additional constructs: 103d32e68aeSFlorian Hahn 104d32e68aeSFlorian Hahn a. Outer-loop vectorization. In particular, VPlan must be able to model the 105d32e68aeSFlorian Hahn control-flow of the output IR which may include multiple basic-blocks and 106d32e68aeSFlorian Hahn nested loops. 107d32e68aeSFlorian Hahn b. SLP vectorization. 108d32e68aeSFlorian Hahn c. Combinations of the above, including nested vectorization: vectorizing 109d32e68aeSFlorian Hahn both an inner loop and an outer-loop at the same time (each with its own 110d32e68aeSFlorian Hahn VF and UF), mixed vectorization: vectorizing a loop with SLP patterns 111d32e68aeSFlorian Hahn inside [4]_, (re)vectorizing input IR containing vector code. 112d32e68aeSFlorian Hahn d. Function vectorization [2]_. 113d32e68aeSFlorian Hahn 114d32e68aeSFlorian Hahn4. Support multiple candidates efficiently. In particular, similar candidates 115d32e68aeSFlorian Hahn related to a range of possible VF's and UF's must be represented efficiently. 116d32e68aeSFlorian Hahn Potential versioning needs to be supported efficiently. 117d32e68aeSFlorian Hahn 118d32e68aeSFlorian Hahn5. Support vectorizing idioms, such as interleaved groups of strided loads or 119d32e68aeSFlorian Hahn stores. This is achieved by modeling a sequence of output instructions using 120d32e68aeSFlorian Hahn a "Recipe", which is responsible for computing its cost and generating its 121d32e68aeSFlorian Hahn code. 122d32e68aeSFlorian Hahn 123d32e68aeSFlorian Hahn6. Encapsulate Single-Entry Single-Exit regions (SESE). During vectorization 124d32e68aeSFlorian Hahn such regions may need to be, for example, predicated and linearized, or 125d32e68aeSFlorian Hahn replicated VF*UF times to handle scalarized and predicated instructions. 126d32e68aeSFlorian Hahn Innerloops are also modelled as SESE regions. 127d32e68aeSFlorian Hahn 128d32e68aeSFlorian Hahn7. Support instruction-level analysis and transformation, as part of Planning 129d32e68aeSFlorian Hahn Step 2.b: During vectorization instructions may need to be traversed, moved, 130d32e68aeSFlorian Hahn replaced by other instructions or be created. For example, vector idiom 131d32e68aeSFlorian Hahn detection and formation involves searching for and optimizing instruction 132d32e68aeSFlorian Hahn patterns. 133d32e68aeSFlorian Hahn 134d32e68aeSFlorian HahnDefinitions 135d32e68aeSFlorian Hahn=========== 136d32e68aeSFlorian HahnThe low-level design of VPlan comprises of the following classes. 137d32e68aeSFlorian Hahn 138d32e68aeSFlorian Hahn:LoopVectorizationPlanner: 139d32e68aeSFlorian Hahn A LoopVectorizationPlanner is designed to handle the vectorization of a loop 140d32e68aeSFlorian Hahn or a loop nest. It can construct, optimize and discard one or more VPlans, 141d32e68aeSFlorian Hahn each VPlan modelling a distinct way to vectorize the loop or the loop nest. 142d32e68aeSFlorian Hahn Once the best VPlan is determined, including the best VF and UF, this VPlan 143d32e68aeSFlorian Hahn drives the generation of output IR. 144d32e68aeSFlorian Hahn 145d32e68aeSFlorian Hahn:VPlan: 146d32e68aeSFlorian Hahn A model of a vectorized candidate for a given input IR loop or loop nest. This 147d32e68aeSFlorian Hahn candidate is represented using a Hierarchical CFG. VPlan supports estimating 148d32e68aeSFlorian Hahn the cost and driving the generation of the output IR code it represents. 149d32e68aeSFlorian Hahn 150d32e68aeSFlorian Hahn:Hierarchical CFG: 151d32e68aeSFlorian Hahn A control-flow graph whose nodes are basic-blocks or Hierarchical CFG's. The 152d32e68aeSFlorian Hahn Hierarchical CFG data structure is similar to the Tile Tree [5]_, where 153d32e68aeSFlorian Hahn cross-Tile edges are lifted to connect Tiles instead of the original 154d32e68aeSFlorian Hahn basic-blocks as in Sharir [6]_, promoting the Tile encapsulation. The terms 155d32e68aeSFlorian Hahn Region and Block are used rather than Tile [5]_ to avoid confusion with loop 156d32e68aeSFlorian Hahn tiling. 157d32e68aeSFlorian Hahn 158d32e68aeSFlorian Hahn:VPBlockBase: 159d32e68aeSFlorian Hahn The building block of the Hierarchical CFG. A pure-virtual base-class of 160d32e68aeSFlorian Hahn VPBasicBlock and VPRegionBlock, see below. VPBlockBase models the hierarchical 161d32e68aeSFlorian Hahn control-flow relations with other VPBlocks. Note that in contrast to the IR 162d32e68aeSFlorian Hahn BasicBlock, a VPBlockBase models its control-flow successors and predecessors 163d32e68aeSFlorian Hahn directly, rather than through a Terminator branch or through predecessor 164d32e68aeSFlorian Hahn branches that "use" the VPBlockBase. 165d32e68aeSFlorian Hahn 166d32e68aeSFlorian Hahn:VPBasicBlock: 167d32e68aeSFlorian Hahn VPBasicBlock is a subclass of VPBlockBase, and serves as the leaves of the 168d32e68aeSFlorian Hahn Hierarchical CFG. It represents a sequence of output IR instructions that will 169d32e68aeSFlorian Hahn appear consecutively in an output IR basic-block. The instructions of this 170d32e68aeSFlorian Hahn basic-block originate from one or more VPBasicBlocks. VPBasicBlock holds a 171d32e68aeSFlorian Hahn sequence of zero or more VPRecipes that model the cost and generation of the 172d32e68aeSFlorian Hahn output IR instructions. 173d32e68aeSFlorian Hahn 174d32e68aeSFlorian Hahn:VPRegionBlock: 175d32e68aeSFlorian Hahn VPRegionBlock is a subclass of VPBlockBase. It models a collection of 176d32e68aeSFlorian Hahn VPBasicBlocks and VPRegionBlocks which form a SESE subgraph of the output IR 177d32e68aeSFlorian Hahn CFG. A VPRegionBlock may indicate that its contents are to be replicated a 178d32e68aeSFlorian Hahn constant number of times when output IR is generated, effectively representing 179d32e68aeSFlorian Hahn a loop with constant trip-count that will be completely unrolled. This is used 180d32e68aeSFlorian Hahn to support scalarized and predicated instructions with a single model for 181d32e68aeSFlorian Hahn multiple candidate VF's and UF's. 182d32e68aeSFlorian Hahn 183d32e68aeSFlorian Hahn:VPRecipeBase: 184d32e68aeSFlorian Hahn A pure-virtual base class modeling a sequence of one or more output IR 185d32e68aeSFlorian Hahn instructions, possibly based on one or more input IR instructions. These 186d32e68aeSFlorian Hahn input IR instructions are referred to as "Ingredients" of the Recipe. A Recipe 187d32e68aeSFlorian Hahn may specify how its ingredients are to be transformed to produce the output IR 188d32e68aeSFlorian Hahn instructions; e.g., cloned once, replicated multiple times or widened 189d32e68aeSFlorian Hahn according to selected VF. 190d32e68aeSFlorian Hahn 191d32e68aeSFlorian Hahn:VPValue: 192d32e68aeSFlorian Hahn The base of VPlan's def-use relations class hierarchy. When instantiated, it 193d32e68aeSFlorian Hahn models a constant or a live-in Value in VPlan. It has users, which are of type 194d32e68aeSFlorian Hahn VPUser, but no operands. 195d32e68aeSFlorian Hahn 196d32e68aeSFlorian Hahn:VPUser: 197d32e68aeSFlorian Hahn A VPUser represents an entity that uses a number of VPValues as operands. 198d32e68aeSFlorian Hahn VPUser is similar in some aspects to LLVM's User class. 199d32e68aeSFlorian Hahn 200d32e68aeSFlorian Hahn:VPDef: 201d32e68aeSFlorian Hahn A VPDef represents an entity that defines zero, one or multiple VPValues. 202d32e68aeSFlorian Hahn It is used to model the fact that recipes in VPlan can define multiple 203d32e68aeSFlorian Hahn VPValues. 204d32e68aeSFlorian Hahn 205d32e68aeSFlorian Hahn:VPInstruction: 20699de3a68SFlorian Hahn A VPInstruction is a recipe characterized by a single opcode and optional 20799de3a68SFlorian Hahn flags, free of ingredients or other meta-data. VPInstructions also extend 20899de3a68SFlorian Hahn LLVM IR's opcodes with idiomatic operations that enrich the Vectorizer's 20999de3a68SFlorian Hahn semantics. 210d32e68aeSFlorian Hahn 211d32e68aeSFlorian Hahn:VPTransformState: 212d32e68aeSFlorian Hahn Stores information used for generating output IR, passed from 213d32e68aeSFlorian Hahn LoopVectorizationPlanner to its selected VPlan for execution, and used to pass 214d32e68aeSFlorian Hahn additional information down to VPBlocks and VPRecipes. 215d32e68aeSFlorian Hahn 216d32e68aeSFlorian HahnThe Planning Process and VPlan Roadmap 217d32e68aeSFlorian Hahn====================================== 218d32e68aeSFlorian Hahn 219d32e68aeSFlorian HahnTransforming the Loop Vectorizer to use VPlan follows a staged approach. First, 22099de3a68SFlorian HahnVPlan was only used to record the final vectorization decisions, and to execute 22199de3a68SFlorian Hahnthem: the Hierarchical CFG models the planned control-flow, and Recipes capture 22299de3a68SFlorian Hahndecisions taken inside basic-blocks. Currently, VPlan is used also as the basis 223d32e68aeSFlorian Hahnfor taking these decisions, effectively turning them into a series of 224d32e68aeSFlorian HahnVPlan-to-VPlan algorithms. Finally, VPlan will support the planning process 225d32e68aeSFlorian Hahnitself including cost-based analyses for making these decisions, to fully 226d32e68aeSFlorian Hahnsupport compositional and iterative decision making. 227d32e68aeSFlorian Hahn 228d32e68aeSFlorian HahnSome decisions are local to an instruction in the loop, such as whether to widen 229d32e68aeSFlorian Hahnit into a vector instruction or replicate it, keeping the generated instructions 230d32e68aeSFlorian Hahnin place. Other decisions, however, involve moving instructions, replacing them 231d32e68aeSFlorian Hahnwith other instructions, and/or introducing new instructions. For example, a 232d32e68aeSFlorian Hahncast may sink past a later instruction and be widened to handle first-order 233d32e68aeSFlorian Hahnrecurrence; an interleave group of strided gathers or scatters may effectively 234d32e68aeSFlorian Hahnmove to one place where they are replaced with shuffles and a common wide vector 235d32e68aeSFlorian Hahnload or store; new instructions may be introduced to compute masks, shuffle the 236d32e68aeSFlorian Hahnelements of vectors, and pack scalar values into vectors or vice-versa. 237d32e68aeSFlorian Hahn 238d32e68aeSFlorian HahnIn order for VPlan to support making instruction-level decisions and analyses, 239d32e68aeSFlorian Hahnit needs to model the relevant instructions along with their def/use relations. 240d32e68aeSFlorian HahnThis too follows a staged approach: first, the new instructions that compute 241d32e68aeSFlorian Hahnmasks are modeled as VPInstructions, along with their induced def/use subgraph. 242d32e68aeSFlorian HahnThis effectively models masks in VPlan, facilitating VPlan-based predication. 243d32e68aeSFlorian HahnNext, the logic embedded within each Recipe for generating its instructions at 244d32e68aeSFlorian HahnVPlan execution time, will instead take part in the planning process by modeling 245d32e68aeSFlorian Hahnthem as VPInstructions. Finally, only logic that applies to instructions as a 246d32e68aeSFlorian Hahngroup will remain in Recipes, such as interleave groups and potentially other 247d32e68aeSFlorian Hahnidiom groups having synergistic cost. 248d32e68aeSFlorian Hahn 249d32e68aeSFlorian HahnRelated LLVM components 250d32e68aeSFlorian Hahn----------------------- 251d32e68aeSFlorian Hahn1. SLP Vectorizer: one can compare the VPlan model with LLVM's existing SLP 252d32e68aeSFlorian Hahn tree, where TSLP [3]_ adds Plan Step 2.b. 253d32e68aeSFlorian Hahn 254d32e68aeSFlorian Hahn2. RegionInfo: one can compare VPlan's H-CFG with the Region Analysis as used by 255d32e68aeSFlorian Hahn Polly [7]_. 256d32e68aeSFlorian Hahn 257d32e68aeSFlorian Hahn3. Loop Vectorizer: the Vectorization Plan aims to upgrade the infrastructure of 258d32e68aeSFlorian Hahn the Loop Vectorizer and extend it to handle outer loops [8]_, [9]_. 259d32e68aeSFlorian Hahn 260d32e68aeSFlorian HahnReferences 261d32e68aeSFlorian Hahn---------- 262d32e68aeSFlorian Hahn.. [1] "Outer-loop vectorization: revisited for short SIMD architectures", Dorit 263d32e68aeSFlorian Hahn Nuzman and Ayal Zaks, PACT 2008. 264d32e68aeSFlorian Hahn 265d32e68aeSFlorian Hahn.. [2] "Proposal for function vectorization and loop vectorization with function 266d32e68aeSFlorian Hahn calls", Xinmin Tian, [`cfe-dev 267d32e68aeSFlorian Hahn <http://lists.llvm.org/pipermail/cfe-dev/2016-March/047732.html>`_]., 268d32e68aeSFlorian Hahn March 2, 2016. 269d32e68aeSFlorian Hahn See also `review <https://reviews.llvm.org/D22792>`_. 270d32e68aeSFlorian Hahn 271d32e68aeSFlorian Hahn.. [3] "Throttling Automatic Vectorization: When Less is More", Vasileios 272d32e68aeSFlorian Hahn Porpodas and Tim Jones, PACT 2015 and LLVM Developers' Meeting 2015. 273d32e68aeSFlorian Hahn 274d32e68aeSFlorian Hahn.. [4] "Exploiting mixed SIMD parallelism by reducing data reorganization 275d32e68aeSFlorian Hahn overhead", Hao Zhou and Jingling Xue, CGO 2016. 276d32e68aeSFlorian Hahn 277d32e68aeSFlorian Hahn.. [5] "Register Allocation via Hierarchical Graph Coloring", David Callahan and 278d32e68aeSFlorian Hahn Brian Koblenz, PLDI 1991 279d32e68aeSFlorian Hahn 280d32e68aeSFlorian Hahn.. [6] "Structural analysis: A new approach to flow analysis in optimizing 281d32e68aeSFlorian Hahn compilers", M. Sharir, Journal of Computer Languages, Jan. 1980 282d32e68aeSFlorian Hahn 283d32e68aeSFlorian Hahn.. [7] "Enabling Polyhedral Optimizations in LLVM", Tobias Grosser, Diploma 284d32e68aeSFlorian Hahn thesis, 2011. 285d32e68aeSFlorian Hahn 286d32e68aeSFlorian Hahn.. [8] "Introducing VPlan to the Loop Vectorizer", Gil Rapaport and Ayal Zaks, 287d32e68aeSFlorian Hahn European LLVM Developers' Meeting 2017. 288d32e68aeSFlorian Hahn 289d32e68aeSFlorian Hahn.. [9] "Extending LoopVectorizer: OpenMP4.5 SIMD and Outer Loop 290d32e68aeSFlorian Hahn Auto-Vectorization", Intel Vectorizer Team, LLVM Developers' Meeting 2016. 29199de3a68SFlorian Hahn 292*309a881dSFlorian Hahn.. [10] "VPlan: Status Update and Future Roadmap", Ayal Zaks and Florian Hahn, 293*309a881dSFlorian Hahn LLVM Developers' Meeting 2023, https://www.youtube.com/watch?v=SzGP4PgMuLE 294