xref: /llvm-project/mlir/docs/Rationale/RationaleGenericDAGRewriter.md (revision c71fbdd87b356c6d43dbc26a4d7113d9cbb93fe7)
1# Generic DAG Rewriter Infrastructure Rationale
2
3This document details the rationale behind a general DAG-to-DAG rewrite
4infrastructure for MLIR. For up-to-date documentation on the user facing API,
5please look at the main [Pattern Rewriting document](../PatternRewriter.md).
6
7## Introduction and Motivation
8
9The goal of a compiler IR is to represent code - at various levels of
10abstraction which pose different sets of tradeoffs in terms of representational
11capabilities and ease of transformation. However, the ability to represent code
12is not itself very useful - you also need to be able to implement those
13transformations.
14
15There are many different types of compiler transformations, but this document
16focuses on a particularly important class of transformation that comes up
17repeatedly at scale, and is important for the goals of MLIR: matching one DAG of
18operations, and replacing with another. This is an integral part of many
19compilers and necessary for peephole optimizations like "eliminate identity
20nodes" or "replace x+0 with x", a generalized canonicalization framework (e.g.
21Instruction Combiner in LLVM), as well as a useful abstraction to implement
22optimization algorithms for optimization algorithms for IR at multiple levels.
23
24A particular strength of MLIR (and a major difference vs other compiler
25infrastructures like LLVM, GCC, XLA, TensorFlow, etc) is that it uses a single
26compiler IR to represent code at multiple levels of abstraction: an MLIR
27operation can be a "TensorFlow operation", an "XLA HLO", an Affine Loop Nest, an
28LLVM IR instruction (transitively including X86, Lanai, PTX, and other target
29specific instructions), or anything else that the MLIR operation system can
30reasonably express. Given that MLIR spans such a wide range of different problem
31scopes, a single infrastructure for performing graph-to-graph rewrites can help
32solve many diverse domain challenges.
33
34[Static single assignment](https://en.wikipedia.org/wiki/Static_single_assignment_form)
35(SSA) representations like MLIR make it easy to access the operands and "users"
36of an operation. As such, a natural abstraction for these graph-to-graph
37rewrites is that of DAG pattern matching: clients define DAG tile patterns
38(where a tile is a sequence of operations defining a subgraph of the DAG), and
39each pattern includes a result DAG to produce and the cost of the result (or,
40inversely, the benefit of doing the replacement). A common infrastructure
41efficiently finds and performs the rewrites.
42
43While this concept is simple, the details are more nuanced. This document
44defines and explores a set of abstractions that can solve a wide range of
45different problems, and be applied to many different sorts of problems that MLIR
46is - and is expected to - face over time. We do this by separating the pattern
47application algorithm from the "driver" of the computation loop, and make space
48for the patterns to be defined declaratively.
49
50### Constant folding
51
52A degenerate but pervasive case of DAG-to-DAG pattern matching is constant
53folding: an operation whose operands contain constants can often be folded to a
54result constant value.
55
56MLIR operations may override a
57[`fold`](../Canonicalization.md/#canonicalizing-with-the-fold-method) routine, which
58exposes a simpler API compared to a general DAG-to-DAG pattern matcher, and
59allows for it to be applicable in cases that a generic matcher would not. For
60example, a DAG-rewrite can remove arbitrary nodes in the current function, which
61could invalidate iterators. Constant folding as an API does not remove any
62nodes, it just provides a (list of) constant values and allows the clients to
63update their data structures as necessary.
64
65## Related Work
66
67There is a huge amount of related work to consider, given that nearly every
68compiler in existence has to solve this problem many times over. One unifying
69problem is that all of these systems are designed to solve one particular, and
70usually, narrow problem: MLIR on the other hand would like to solve many of
71these problems within a single infrastructure. Here are a few related graph
72rewrite systems, along with the pros and cons of their work (The most similar
73design to the infrastructure present in MLIR is the LLVM DAG-to-DAG instruction
74selection algorithm).
75
76### AST-Level Pattern Matchers
77
78The literature is full of source-to-source translators which transform
79identities in order to improve performance (e.g. transforming `X*0` into `0`).
80One large example is the GCC `fold` function, which performs
81[many optimizations](https://github.com/gcc-mirror/gcc/blob/master/gcc/fold-const.c)
82on ASTs. Clang has
83[similar routines](https://clang.llvm.org/docs/InternalsManual.html#constant-folding-in-the-clang-ast)
84for simple constant folding of expressions (as required by the C++ standard) but
85doesn't perform general optimizations on its ASTs.
86
87The primary downside of AST optimizers is that you can't see across operations
88that have multiple uses. It is
89[well known in literature](https://llvm.org/pubs/2008-06-LCTES-ISelUsingSSAGraphs.pdf)
90that DAG pattern matching is more powerful than tree pattern matching, but on
91the other hand, DAG pattern matching can lead to duplication of computation
92which needs to be checked for.
93
94### "Combiners" and other peephole optimizers
95
96Compilers end up with a lot of peephole optimizers for various things, e.g. the
97GCC
98["combine" routines](https://github.com/gcc-mirror/gcc/blob/master/gcc/combine.c)
99(which try to merge two machine instructions into a single one), the LLVM
100[Inst Combine](https://github.com/llvm/llvm-project/tree/main/llvm/lib/Transforms/InstCombine)
101[pass](https://llvm.org/docs/Passes.html#instcombine-combine-redundant-instructions),
102LLVM's
103[DAG Combiner](https://github.com/llvm-mirror/llvm/blob/master/lib/CodeGen/SelectionDAG/DAGCombiner.cpp),
104the Swift compiler's
105[SIL Combiner](https://github.com/apple/swift/tree/main/lib/SILOptimizer/SILCombiner),
106etc. These generally match one or more operations and produce zero or more
107operations as a result. The LLVM
108[Legalization](https://github.com/llvm/llvm-project/tree/main/llvm/lib/CodeGen/SelectionDAG)
109infrastructure has a different outer loop but otherwise works the same way.
110
111These passes have a lot of diversity, but also have a unifying structure: they
112mostly have a worklist outer loop which visits operations. They then use a
113visitor pattern (or equivalent) to switch over the class of operation and
114dispatch to a method. That method contains a long list of hand-written C++ code
115that pattern-matches various special cases. LLVM introduced a "match" function
116that allows writing patterns in a somewhat more declarative style using template
117metaprogramming (MLIR has similar facilities). Here's a simple example:
118
119```c++
120  // Y - (X + 1) --> ~X + Y
121  if (match(Op1, m_OneUse(m_Add(m_Value(X), m_One()))))
122    return BinaryOperator::CreateAdd(Builder.CreateNot(X), Op0);
123```
124
125Here is a somewhat more complicated one (this is not the biggest or most
126complicated :)
127
128```c++
129  // C2 is ODD
130  // LHS = XOR(Y,C1), Y = AND(Z,C2), C1==(C2+1) => LHS == NEG(OR(Z, ~C2))
131  // ADD(LHS, RHS) == SUB(RHS, OR(Z, ~C2))
132  if (match(LHS, m_Xor(m_Value(Y), m_APInt(C1))))
133    if (C1->countTrailingZeros() == 0)
134      if (match(Y, m_And(m_Value(Z), m_APInt(C2))) && *C1 == (*C2 + 1)) {
135        Value NewOr = Builder.CreateOr(Z, ~(*C2));
136        return Builder.CreateSub(RHS, NewOr, "sub");
137      }
138```
139
140These systems are simple to set up, and pattern matching templates have some
141advantages (they are extensible for new sorts of sub-patterns, look compact at
142point of use). On the other hand, they have lots of well known problems, for
143example:
144
145*   These patterns are very error prone to write, and contain lots of
146    redundancies.
147*   The IR being matched often has identities (e.g. when matching commutative
148    operators) and the C++ code has to handle it manually - take a look at
149    [the full code](https://github.com/llvm/llvm-project/blob/c0b5000bd848303320c03f80fbf84d71e74518c9/llvm/lib/Transforms/InstCombine/InstCombineAddSub.cpp#L767)
150    for `checkForNegativeOperand` that defines the second pattern).
151*   The matching code compiles slowly, both because it generates tons of code
152    and because the templates instantiate slowly.
153*   Adding new patterns (e.g. for count leading zeros in the example above) is
154    awkward and doesn't often happen.
155*   The cost model for these patterns is not really defined - it is emergent
156    based on the order the patterns are matched in code.
157*   They are non-extensible without rebuilding the compiler.
158*   It isn't practical to apply theorem provers and other tools to these
159    patterns - they cannot be reused for other purposes.
160
161In addition to structured "combiners" like these, there are lots of ad-hoc
162systems like the
163[LLVM Machine code peephole optimizer](http://llvm.org/viewvc/llvm-project/llvm/trunk/lib/CodeGen/PeepholeOptimizer.cpp?view=markup)
164which are related.
165
166### LLVM's DAG-to-DAG Instruction Selection Infrastructure
167
168The instruction selection subsystem in LLVM is the result of many years worth of
169iteration and discovery, driven by the need for LLVM to support code generation
170for lots of targets, the complexity of code generators for modern instruction
171sets (e.g. X86), and the fanatical pursuit of reusing code across targets. Eli
172Bendersky wrote a
173[nice short overview](https://eli.thegreenplace.net/2013/02/25/a-deeper-look-into-the-llvm-code-generator-part-1)
174of how this works, and the
175[LLVM documentation](https://llvm.org/docs/CodeGenerator.html#select-instructions-from-dag)
176describes it in more depth including its advantages and limitations. It allows
177writing patterns like this.
178
179```
180def : Pat<(or GR64:$src, (not (add GR64:$src, 1))),
181          (BLCI64rr GR64:$src)>;
182```
183
184This example defines a matcher for the
185["blci" instruction](https://en.wikipedia.org/wiki/Bit_Manipulation_Instruction_Sets#TBM_\(Trailing_Bit_Manipulation\))
186in the
187[X86 target description](https://github.com/llvm/llvm-project/blob/main/llvm/lib/Target/X86/X86InstrInfo.td),
188there are many others in that file (look for `Pat<>` patterns, since they aren't
189entangled in details of the compiler like assembler/disassembler generation
190logic).
191
192For the purposes of MLIR, there is much to like about this system, for example:
193
194*   It is defined in a declarative format.
195*   It is extensible to target-defined operations.
196*   It automates matching across identities, like commutative patterns.
197*   It allows custom abstractions and intense factoring of target-specific
198    commonalities.
199*   It generates compact code - it compiles into a state machine, which is
200    interpreted.
201*   It allows the instruction patterns to be defined and reused for multiple
202    purposes.
203*   The patterns are "type checked" at compile time, detecting lots of bugs
204    early and eliminating redundancy from the pattern specifications.
205*   It allows the use of general C++ code for weird/complex cases.
206
207While there is a lot that is good here, there are also a few undesirable bits:
208
209*   The representation is specifically designed and only applicable for
210    instruction selection, meaning that the directly adjacent problems like the
211    DAGCombiner and Legalizer can't use it.
212*   This isn't extensible at compiler runtime, you have to rebuild the compiler
213    to extend it.
214*   The error messages when failing to match a pattern
215    [are not exactly optimal](https://www.google.com/search?q=llvm+cannot+select).
216*   It has lots of implementation problems and limitations (e.g. can't write a
217    pattern for a multi-result operation) as a result of working with the
218    awkward SelectionDAG representation and being designed and implemented on
219    demand.
220*   Organic growth over time has left lots of sharp edges.
221
222### Summary
223
224MLIR faces a wide range of pattern matching and graph rewrite problems, and one
225of the major advantages of having a common representation for code at multiple
226levels is that it allows for investing in - and highly leveraging - a single
227infrastructure for doing this sort of work.
228
229## Goals
230
231We'd like the to encompass many problems in the MLIR space, including 1-to-N
232expansions (e.g. such as in type legalization during instruction selection when
233an add of one bit width may be split into multiple adds of a smaller bit width),
234M-to-1 patterns (e.g. when converting a multiply+add into a single muladd
235operation), as well as general M-to-N patterns (e.g. instruction selection for
236target instructions). Patterns have a benefit associated with them, and the
237common infrastructure should be responsible for sorting out the highest benefit
238match for a given application.
239
240We separate the task of picking a particular optimal pattern from a given root
241node, the algorithm used to rewrite an entire graph given a particular set of
242goals, and the definition of the patterns themselves. We do this because DAG
243tile pattern matching is NP complete. Additionally, we would like to support
244iterative rewrite algorithms that progressively transform the input program
245through multiple steps. Furthermore, we would like to support many different
246sorts of clients across the MLIR stack, and they may have different tolerances
247for compile time cost, different demands for optimality, and other algorithmic
248goals or constraints.
249
250We aim for MLIR transformations to be easy to implement and reduce the
251likelihood for compiler bugs. We expect there to be a very large number of
252patterns that are defined over time, and we believe that these sorts of patterns
253will have a very large number of legality/validity constraints - many of which
254are difficult to reason about in a consistent way, may be target specific, and
255whose implementation may be particularly bug-prone. As such, we aim to design
256the API around pattern definition to be simple, resilient to programmer errors,
257and allow separation of concerns between the legality of the nodes generated
258from the idea of the pattern being defined.
259
260Finally, error handling is a topmost concern, we want pattern match failures to
261be diagnosable in a reasonable way. This is a difficult problem in general, as
262the space of malfunction is too great to be fully enumerated and handled
263optimally, but MLIR is already designed to represent the provenance of an
264operation well. The aim of the pattern rewriting infrastructure is simply to
265propagate that provenance information precisely, as well as diagnose pattern
266match failures with the rationale for why a set of patterns do not apply.
267
268### Non goals
269
270The pattern infrastructure does not aim to solve all compiler problems, it is
271simply a DAG-to-DAG pattern matching system. Compiler algorithms that require
272global dataflow analysis (e.g. common subexpression elimination, conditional
273constant propagation, and many many others) will not be directly solved by this
274infrastructure.
275
276This infrastructure is limited to DAG patterns, which (by definition) prevent
277the patterns from seeing across cycles in a graph. In an SSA-based IR like MLIR,
278this means that these patterns don't see across basic block arguments. We
279consider this acceptable given the set of problems we are trying to solve - we
280don't know of any other system that attempts to do so, and consider the payoff
281of worrying about this to be low.
282
283This design includes the ability for DAG patterns to have associated benefits,
284but those benefits are defined in terms of magic numbers (typically equal to the
285number of nodes being replaced). For any given application, the units of magic
286numbers will have to be defined.
287