1# Generic DAG Rewriter Infrastructure Rationale 2 3This document details the rationale behind a general DAG-to-DAG rewrite 4infrastructure for MLIR. For up-to-date documentation on the user facing API, 5please look at the main [Pattern Rewriting document](../PatternRewriter.md). 6 7## Introduction and Motivation 8 9The goal of a compiler IR is to represent code - at various levels of 10abstraction which pose different sets of tradeoffs in terms of representational 11capabilities and ease of transformation. However, the ability to represent code 12is not itself very useful - you also need to be able to implement those 13transformations. 14 15There are many different types of compiler transformations, but this document 16focuses on a particularly important class of transformation that comes up 17repeatedly at scale, and is important for the goals of MLIR: matching one DAG of 18operations, and replacing with another. This is an integral part of many 19compilers and necessary for peephole optimizations like "eliminate identity 20nodes" or "replace x+0 with x", a generalized canonicalization framework (e.g. 21Instruction Combiner in LLVM), as well as a useful abstraction to implement 22optimization algorithms for optimization algorithms for IR at multiple levels. 23 24A particular strength of MLIR (and a major difference vs other compiler 25infrastructures like LLVM, GCC, XLA, TensorFlow, etc) is that it uses a single 26compiler IR to represent code at multiple levels of abstraction: an MLIR 27operation can be a "TensorFlow operation", an "XLA HLO", an Affine Loop Nest, an 28LLVM IR instruction (transitively including X86, Lanai, PTX, and other target 29specific instructions), or anything else that the MLIR operation system can 30reasonably express. Given that MLIR spans such a wide range of different problem 31scopes, a single infrastructure for performing graph-to-graph rewrites can help 32solve many diverse domain challenges. 33 34[Static single assignment](https://en.wikipedia.org/wiki/Static_single_assignment_form) 35(SSA) representations like MLIR make it easy to access the operands and "users" 36of an operation. As such, a natural abstraction for these graph-to-graph 37rewrites is that of DAG pattern matching: clients define DAG tile patterns 38(where a tile is a sequence of operations defining a subgraph of the DAG), and 39each pattern includes a result DAG to produce and the cost of the result (or, 40inversely, the benefit of doing the replacement). A common infrastructure 41efficiently finds and performs the rewrites. 42 43While this concept is simple, the details are more nuanced. This document 44defines and explores a set of abstractions that can solve a wide range of 45different problems, and be applied to many different sorts of problems that MLIR 46is - and is expected to - face over time. We do this by separating the pattern 47application algorithm from the "driver" of the computation loop, and make space 48for the patterns to be defined declaratively. 49 50### Constant folding 51 52A degenerate but pervasive case of DAG-to-DAG pattern matching is constant 53folding: an operation whose operands contain constants can often be folded to a 54result constant value. 55 56MLIR operations may override a 57[`fold`](../Canonicalization.md/#canonicalizing-with-the-fold-method) routine, which 58exposes a simpler API compared to a general DAG-to-DAG pattern matcher, and 59allows for it to be applicable in cases that a generic matcher would not. For 60example, a DAG-rewrite can remove arbitrary nodes in the current function, which 61could invalidate iterators. Constant folding as an API does not remove any 62nodes, it just provides a (list of) constant values and allows the clients to 63update their data structures as necessary. 64 65## Related Work 66 67There is a huge amount of related work to consider, given that nearly every 68compiler in existence has to solve this problem many times over. One unifying 69problem is that all of these systems are designed to solve one particular, and 70usually, narrow problem: MLIR on the other hand would like to solve many of 71these problems within a single infrastructure. Here are a few related graph 72rewrite systems, along with the pros and cons of their work (The most similar 73design to the infrastructure present in MLIR is the LLVM DAG-to-DAG instruction 74selection algorithm). 75 76### AST-Level Pattern Matchers 77 78The literature is full of source-to-source translators which transform 79identities in order to improve performance (e.g. transforming `X*0` into `0`). 80One large example is the GCC `fold` function, which performs 81[many optimizations](https://github.com/gcc-mirror/gcc/blob/master/gcc/fold-const.c) 82on ASTs. Clang has 83[similar routines](https://clang.llvm.org/docs/InternalsManual.html#constant-folding-in-the-clang-ast) 84for simple constant folding of expressions (as required by the C++ standard) but 85doesn't perform general optimizations on its ASTs. 86 87The primary downside of AST optimizers is that you can't see across operations 88that have multiple uses. It is 89[well known in literature](https://llvm.org/pubs/2008-06-LCTES-ISelUsingSSAGraphs.pdf) 90that DAG pattern matching is more powerful than tree pattern matching, but on 91the other hand, DAG pattern matching can lead to duplication of computation 92which needs to be checked for. 93 94### "Combiners" and other peephole optimizers 95 96Compilers end up with a lot of peephole optimizers for various things, e.g. the 97GCC 98["combine" routines](https://github.com/gcc-mirror/gcc/blob/master/gcc/combine.c) 99(which try to merge two machine instructions into a single one), the LLVM 100[Inst Combine](https://github.com/llvm/llvm-project/tree/main/llvm/lib/Transforms/InstCombine) 101[pass](https://llvm.org/docs/Passes.html#instcombine-combine-redundant-instructions), 102LLVM's 103[DAG Combiner](https://github.com/llvm-mirror/llvm/blob/master/lib/CodeGen/SelectionDAG/DAGCombiner.cpp), 104the Swift compiler's 105[SIL Combiner](https://github.com/apple/swift/tree/main/lib/SILOptimizer/SILCombiner), 106etc. These generally match one or more operations and produce zero or more 107operations as a result. The LLVM 108[Legalization](https://github.com/llvm/llvm-project/tree/main/llvm/lib/CodeGen/SelectionDAG) 109infrastructure has a different outer loop but otherwise works the same way. 110 111These passes have a lot of diversity, but also have a unifying structure: they 112mostly have a worklist outer loop which visits operations. They then use a 113visitor pattern (or equivalent) to switch over the class of operation and 114dispatch to a method. That method contains a long list of hand-written C++ code 115that pattern-matches various special cases. LLVM introduced a "match" function 116that allows writing patterns in a somewhat more declarative style using template 117metaprogramming (MLIR has similar facilities). Here's a simple example: 118 119```c++ 120 // Y - (X + 1) --> ~X + Y 121 if (match(Op1, m_OneUse(m_Add(m_Value(X), m_One())))) 122 return BinaryOperator::CreateAdd(Builder.CreateNot(X), Op0); 123``` 124 125Here is a somewhat more complicated one (this is not the biggest or most 126complicated :) 127 128```c++ 129 // C2 is ODD 130 // LHS = XOR(Y,C1), Y = AND(Z,C2), C1==(C2+1) => LHS == NEG(OR(Z, ~C2)) 131 // ADD(LHS, RHS) == SUB(RHS, OR(Z, ~C2)) 132 if (match(LHS, m_Xor(m_Value(Y), m_APInt(C1)))) 133 if (C1->countTrailingZeros() == 0) 134 if (match(Y, m_And(m_Value(Z), m_APInt(C2))) && *C1 == (*C2 + 1)) { 135 Value NewOr = Builder.CreateOr(Z, ~(*C2)); 136 return Builder.CreateSub(RHS, NewOr, "sub"); 137 } 138``` 139 140These systems are simple to set up, and pattern matching templates have some 141advantages (they are extensible for new sorts of sub-patterns, look compact at 142point of use). On the other hand, they have lots of well known problems, for 143example: 144 145* These patterns are very error prone to write, and contain lots of 146 redundancies. 147* The IR being matched often has identities (e.g. when matching commutative 148 operators) and the C++ code has to handle it manually - take a look at 149 [the full code](https://github.com/llvm/llvm-project/blob/c0b5000bd848303320c03f80fbf84d71e74518c9/llvm/lib/Transforms/InstCombine/InstCombineAddSub.cpp#L767) 150 for `checkForNegativeOperand` that defines the second pattern). 151* The matching code compiles slowly, both because it generates tons of code 152 and because the templates instantiate slowly. 153* Adding new patterns (e.g. for count leading zeros in the example above) is 154 awkward and doesn't often happen. 155* The cost model for these patterns is not really defined - it is emergent 156 based on the order the patterns are matched in code. 157* They are non-extensible without rebuilding the compiler. 158* It isn't practical to apply theorem provers and other tools to these 159 patterns - they cannot be reused for other purposes. 160 161In addition to structured "combiners" like these, there are lots of ad-hoc 162systems like the 163[LLVM Machine code peephole optimizer](http://llvm.org/viewvc/llvm-project/llvm/trunk/lib/CodeGen/PeepholeOptimizer.cpp?view=markup) 164which are related. 165 166### LLVM's DAG-to-DAG Instruction Selection Infrastructure 167 168The instruction selection subsystem in LLVM is the result of many years worth of 169iteration and discovery, driven by the need for LLVM to support code generation 170for lots of targets, the complexity of code generators for modern instruction 171sets (e.g. X86), and the fanatical pursuit of reusing code across targets. Eli 172Bendersky wrote a 173[nice short overview](https://eli.thegreenplace.net/2013/02/25/a-deeper-look-into-the-llvm-code-generator-part-1) 174of how this works, and the 175[LLVM documentation](https://llvm.org/docs/CodeGenerator.html#select-instructions-from-dag) 176describes it in more depth including its advantages and limitations. It allows 177writing patterns like this. 178 179``` 180def : Pat<(or GR64:$src, (not (add GR64:$src, 1))), 181 (BLCI64rr GR64:$src)>; 182``` 183 184This example defines a matcher for the 185["blci" instruction](https://en.wikipedia.org/wiki/Bit_Manipulation_Instruction_Sets#TBM_\(Trailing_Bit_Manipulation\)) 186in the 187[X86 target description](https://github.com/llvm/llvm-project/blob/main/llvm/lib/Target/X86/X86InstrInfo.td), 188there are many others in that file (look for `Pat<>` patterns, since they aren't 189entangled in details of the compiler like assembler/disassembler generation 190logic). 191 192For the purposes of MLIR, there is much to like about this system, for example: 193 194* It is defined in a declarative format. 195* It is extensible to target-defined operations. 196* It automates matching across identities, like commutative patterns. 197* It allows custom abstractions and intense factoring of target-specific 198 commonalities. 199* It generates compact code - it compiles into a state machine, which is 200 interpreted. 201* It allows the instruction patterns to be defined and reused for multiple 202 purposes. 203* The patterns are "type checked" at compile time, detecting lots of bugs 204 early and eliminating redundancy from the pattern specifications. 205* It allows the use of general C++ code for weird/complex cases. 206 207While there is a lot that is good here, there are also a few undesirable bits: 208 209* The representation is specifically designed and only applicable for 210 instruction selection, meaning that the directly adjacent problems like the 211 DAGCombiner and Legalizer can't use it. 212* This isn't extensible at compiler runtime, you have to rebuild the compiler 213 to extend it. 214* The error messages when failing to match a pattern 215 [are not exactly optimal](https://www.google.com/search?q=llvm+cannot+select). 216* It has lots of implementation problems and limitations (e.g. can't write a 217 pattern for a multi-result operation) as a result of working with the 218 awkward SelectionDAG representation and being designed and implemented on 219 demand. 220* Organic growth over time has left lots of sharp edges. 221 222### Summary 223 224MLIR faces a wide range of pattern matching and graph rewrite problems, and one 225of the major advantages of having a common representation for code at multiple 226levels is that it allows for investing in - and highly leveraging - a single 227infrastructure for doing this sort of work. 228 229## Goals 230 231We'd like the to encompass many problems in the MLIR space, including 1-to-N 232expansions (e.g. such as in type legalization during instruction selection when 233an add of one bit width may be split into multiple adds of a smaller bit width), 234M-to-1 patterns (e.g. when converting a multiply+add into a single muladd 235operation), as well as general M-to-N patterns (e.g. instruction selection for 236target instructions). Patterns have a benefit associated with them, and the 237common infrastructure should be responsible for sorting out the highest benefit 238match for a given application. 239 240We separate the task of picking a particular optimal pattern from a given root 241node, the algorithm used to rewrite an entire graph given a particular set of 242goals, and the definition of the patterns themselves. We do this because DAG 243tile pattern matching is NP complete. Additionally, we would like to support 244iterative rewrite algorithms that progressively transform the input program 245through multiple steps. Furthermore, we would like to support many different 246sorts of clients across the MLIR stack, and they may have different tolerances 247for compile time cost, different demands for optimality, and other algorithmic 248goals or constraints. 249 250We aim for MLIR transformations to be easy to implement and reduce the 251likelihood for compiler bugs. We expect there to be a very large number of 252patterns that are defined over time, and we believe that these sorts of patterns 253will have a very large number of legality/validity constraints - many of which 254are difficult to reason about in a consistent way, may be target specific, and 255whose implementation may be particularly bug-prone. As such, we aim to design 256the API around pattern definition to be simple, resilient to programmer errors, 257and allow separation of concerns between the legality of the nodes generated 258from the idea of the pattern being defined. 259 260Finally, error handling is a topmost concern, we want pattern match failures to 261be diagnosable in a reasonable way. This is a difficult problem in general, as 262the space of malfunction is too great to be fully enumerated and handled 263optimally, but MLIR is already designed to represent the provenance of an 264operation well. The aim of the pattern rewriting infrastructure is simply to 265propagate that provenance information precisely, as well as diagnose pattern 266match failures with the rationale for why a set of patterns do not apply. 267 268### Non goals 269 270The pattern infrastructure does not aim to solve all compiler problems, it is 271simply a DAG-to-DAG pattern matching system. Compiler algorithms that require 272global dataflow analysis (e.g. common subexpression elimination, conditional 273constant propagation, and many many others) will not be directly solved by this 274infrastructure. 275 276This infrastructure is limited to DAG patterns, which (by definition) prevent 277the patterns from seeing across cycles in a graph. In an SSA-based IR like MLIR, 278this means that these patterns don't see across basic block arguments. We 279consider this acceptable given the set of problems we are trying to solve - we 280don't know of any other system that attempts to do so, and consider the payoff 281of worrying about this to be low. 282 283This design includes the ability for DAG patterns to have associated benefits, 284but those benefits are defined in terms of magic numbers (typically equal to the 285number of nodes being replaced). For any given application, the units of magic 286numbers will have to be defined. 287