xref: /llvm-project/mlir/docs/Rationale/Rationale.md (revision c3cf8a924e2a6c9c6b9a5011bf6792f7505bf16a)
1# MLIR Rationale
2
3This document is intended to capture some of the alternatives considered and
4open debates in the design of MLIR, along with the rationale for certain
5decisions we made. This is not intended to be a "finely groomed" document - we
6prefer the ability to dump in interesting tidbits without worrying too much
7about their consistency or readability.
8
9[TOC]
10
11## Abstract
12
13MLIR is a compiler intermediate representation with similarities to traditional
14three-address SSA representations (like
15[LLVM IR](http://llvm.org/docs/LangRef.html) or
16[SIL](https://github.com/apple/swift/blob/main/docs/SIL.rst)), but which
17introduces notions from the polyhedral loop optimization works as first class
18concepts. This hybrid design is optimized to represent, analyze, and transform
19high level dataflow graphs as well as target-specific code generated for high
20performance data parallel systems. Beyond its representational capabilities, its
21single continuous design provides a framework to lower from dataflow graphs to
22high performance target specific code.
23
24MLIR stands for one of "Multi-Level IR" or "Multi-dimensional Loop IR" or
25"Machine Learning IR" or "Mid Level IR", we prefer the first. This document only
26provides the rationale behind MLIR -- its actual
27[specification document](../LangRef.md) and other content is hosted elsewhere.
28
29## Introduction and Motivation
30
31The Multi-Level Intermediate Representation (MLIR) is intended for easy
32expression and optimization of computations involving deep loop nests and dense
33matrices of high dimensionality. It is thus well-suited to deep learning
34computations in particular. Yet it is general enough to also represent arbitrary
35sequential computation. The representation allows high-level optimization and
36parallelization for a wide range of parallel architectures including those with
37deep memory hierarchies --- general-purpose multicores, GPUs, and specialized
38neural network accelerators.
39
40MLIR uses ideas drawn from IRs of LLVM and Swift for lower level constructs
41while combining them with ideas from the polyhedral abstraction to represent
42loop nests, multidimensional data (tensors), and transformations on these
43entities as first class concepts in the IR.
44
45MLIR is a multi-level IR, i.e., it represents code at a domain-specific
46representation such as HLO or TensorFlow graphs, all the way down to the machine
47level. MLIR is able to represent arbitrary control flow and arbitrary data
48accesses, and is general enough to represent nearly all sequential computation.
49This is a key distinction from existing polyhedral representation
50implementations (such as LLVM [Polly](https://polly.llvm.org/)) that are able to
51use the polyhedral abstraction in a way isolated from the LLVM IR and only for
52affine loop nests, i.e., portions of the code where array accesses, loop bounds,
53and conditionals are regular (involve linear functions of loop iterators and
54constant symbols). The presence of statically unpredictable data accesses or
55control flow does not preclude representation in MLIR, but only limits to a
56certain extent the ability to reason about and apply transformations using the
57polyhedral abstraction.
58
59Maps, sets, and relations with affine constraints are the core structures
60underlying a polyhedral representation of high-dimensional loop nests and
61multidimensional arrays. These structures are represented as textual expressions
62in a form close to their mathematical form. These structures are used to capture
63loop nests, tensor data structures, and how they are reordered and mapped for a
64target architecture. All structured or "conforming" loops are captured as part
65of the polyhedral information, and so are tensor variables, their layouts, and
66subscripted accesses to these tensors in memory.
67
68The information captured in the IR allows a compact expression of all loop
69transformations, data remappings, explicit copying necessary for explicitly
70addressed memory in accelerators, mapping to pre-tuned expert-written
71primitives, and mapping to specialized vector instructions. Loop transformations
72that can be easily implemented include the body of affine transformations: these
73subsume all traditional loop transformations (unimodular and non-unimodular)
74such as loop tiling, interchange, permutation, skewing, scaling, relative
75shifting, reversal, fusion, and distribution/fission. Transformations on data
76layout such as padding and transforming to blocked layouts are also represented
77well via affine layout maps.
78
79MLIR's design allows a progressive lowering to target-specific forms. Besides
80high-level transformations for loop nests and data layouts that a typical
81mid-level optimizer is expected to deal with, MLIR is also designed to perform
82certain low-level scheduling and mapping decisions that a typical backend IR is
83entrusted with: these include mapping to specialized vector instructions,
84auto-vectorization, and software pipelining. The need to support these
85transformations stems from the fact that neural network accelerators have
86specialized units that deal with large chunks of data whose computation maps
87back to chunks of more than one loop of the loop nests as viewed by a program at
88a level closer to the original specification. Such specialized units or
89instructions operate on multidimensional data chunks from a programmer's
90viewpoint. It thus makes it hard or infeasible for a backend operating on a very
91low-level IR close to assembly to lift and reconstruct loops and perform such a
92mapping. This is in contrast to classic instruction selection and scheduling in
93today's compilers that primarily only deals with the body of the innermost loop.
94MLIR also facilitates automatic mapping to expert pre-tuned primitives or vendor
95libraries operating on data at higher levels (or at the highest level) of the
96memory hierarchy.
97
98In summary, MLIR is convenient for and closed under the kind of transformations
99needed to lower to general-purpose as well as specialized accelerators. It also
100allows one to build modular and reusable target independent and target dependent
101passes.
102
103## Design Decisions
104
105This section sheds light on some of the design decisions -- some of these are
106indirectly implied by the specification document.
107
108### Loads and stores
109
110The 'load' and 'store' instructions are specifically crafted to fully resolve to
111an element of a memref. These instructions take as arguments n+1 indices for an
112n-ranked tensor. This disallows the equivalent of pointer arithmetic or the
113ability to index into the same memref in other ways (something which C arrays
114allow for example). Furthermore, for the affine constructs, the compiler can
115follow use-def chains (e.g. through
116[affine.apply operations](../Dialects/Affine.md/#affineapply-affineapplyop) or
117through the map attributes of
118[affine operations](../Dialects/Affine.md/#operations)) to precisely analyze
119references at compile-time using polyhedral techniques. This is possible because
120of the
121[restrictions on dimensions and symbols](../Dialects/Affine.md/#restrictions-on-dimensions-and-symbols).
122
123A scalar of element-type (a primitive type or a vector type) that is stored in
124memory is modeled as a 0-d memref. This is also necessary for scalars that are
125live out of for loops and if conditionals in a function, for which we don't yet
126have an SSA representation --
127[an extension](#affineif-and-affinefor-extensions-for-escaping-scalars) to allow
128that is described later in this doc.
129
130### Symbols and types
131
132The current MLIR disallows use of symbols in types. For example, when a tensor
133or memref dimension is statically unknown, it is denoted in the type as '?'. An
134SSA symbol is then bound to it when a memref is created. The actual value of the
135unknown dimension can be queried using the "dim" builtin as shown below.
136
137Example:
138
139```mlir
140func.func foo(...) {
141  %A = memref.alloc <8x?xf32, #lmap> (%N)
142  ...
143  call bar(%A) : (memref<8x?xf32, #lmap>)
144}
145
146func.func bar(%A : memref<8x?xf32, #lmap>) {
147  // Type of %A indicates that %A has dynamic shape with 8 rows
148  // and unknown number of columns. The number of columns is queried
149  // dynamically using dim instruction.
150  %N = memref.dim %A, 1 : memref<8x?xf32, #lmap>
151
152  affine.for %i = 0 to 8 {
153    affine.for %j = 0 to %N {
154      // A[i,j] += 1
155      %s1 = affine.load %A[%i, %j] : memref<8x?xf32, #lmap>
156      %s2 = add %s1, 1
157      affine.store %s2, %A[%i, %j] : memref<8x?xf32, #lmap>
158    }
159  }
160  return
161}
162
163```
164
165An alternative design is to embed the reference to symbols directly in the
166type - memref<8x%Nxf32>. We went for the current approach in MLIR because it
167simplifies the design --- types remain immutable when the values of symbols
168change.
169
170### Block Arguments vs PHI nodes
171
172MLIR Regions represent SSA using "[block arguments](../LangRef.md/#blocks)"
173rather than [PHI instructions](http://llvm.org/docs/LangRef.html#i-phi) used in
174LLVM. This choice is representationally identical (the same constructs can be
175represented in either form) but block arguments have several advantages:
176
1771.  LLVM PHI nodes always have to be kept at the top of a block, and
178    transformations frequently have to manually skip over them. This is defined
179    away with BB arguments.
1801.  LLVM has a separate function Argument node. This is defined away with BB
181    arguments, because the arguments to the entry block serve this purpose.
1821.  Blocks of PHI nodes in LLVM execute atomically, which is surprising and
183    super confusing to compiler engineers and it is easy to introduce bugs with
184    this (very related to the
185    "[lost copy](http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.524.5461&rep=rep1&type=pdf)"
186    problem in SSA lowering literature.) With the BB argument representation,
187    this confusion is defined away.
1881.  The entry list of PHI nodes in LLVM are unordered, and some blocks have
189    thousands of predecessors (e.g. unwind blocks). This can cause long compile
190    time problems because transformations have to linearly scan this list. This
191    is defined away with BB argument representation.
1921.  LLVM has no way to represent values that are available only in one successor
193    but not the other, e.g. its invoke instruction cannot produce the exception
194    value JUST on the exception edge. Instead, the
195    [landingpad instruction](http://llvm.org/docs/LangRef.html#landingpad-instruction)
196    is a hack used to represent this. MLIR doesn't make use of this capability,
197    but SIL uses it extensively, e.g. in the
198    [switch_enum instruction](https://github.com/apple/swift/blob/main/docs/SIL.rst#switch-enum).
199
200For more context, block arguments were previously used in the Swift
201[SIL Intermediate Representation](https://github.com/apple/swift/blob/main/docs/SIL.rst),
202and described in
203[a talk on YouTube](https://www.youtube.com/watch?v=Ntj8ab-5cvE). The section of
204interest
205[starts here](https://www.youtube.com/watch?v=Ntj8ab-5cvE&t=596s).
206
207### Index type usage and limitations
208
209Index types are intended to be used for platform-specific "size" values and may
210appear in subscripts, sizes of aggregate types and affine expressions. They are
211also tightly coupled with `affine.apply` and affine.load/store operations;
212having `index` type is a necessary precondition of a value to be acceptable by
213these operations.
214
215We allow `index` types in tensors, vectors, and memrefs as a code generation
216strategy has to map `index` to an implementation type and hence needs to be able
217to materialize corresponding values. However, the target might lack support for
218`vector` values with the target specific equivalent of the `index` type.
219
220### Data layout of non-primitive types
221
222Data layout information such as the bit width or the alignment of types may be
223target and ABI-specific and thus should be configurable rather than imposed by
224the compiler. Especially, the layout of compound or `index` types may vary. MLIR
225specifies default bit widths for certain primitive *types*, in particular for
226integers and floats. It is equal to the number that appears in the type
227definition, e.g. the bit width of `i32` is `32`, so is the bit width of `f32`.
228The bit width is not *necessarily* related to the amount of memory (in bytes) or
229the register size (in bits) that is necessary to store the value of the given
230type. For example, `vector<3xi57>` is likely to be lowered to a vector of four
23164-bit integers, so that its storage requirement is `4 x 64 / 8 = 32` bytes,
232rather than `(3 x 57) ceildiv 8 = 22` bytes as can be naively computed from the
233bit width. MLIR makes such [data layout information](../DataLayout.md)
234configurable using attributes that can be queried during lowering, for example,
235when allocating a compound type.
236
237The data layout of dialect-specific types is undefined at MLIR level. Yet
238dialects are free to define their own quantities and make them available via the
239data layout infrastructure.
240
241### Integer signedness semantics
242
243Integers in the builtin MLIR type system have a bitwidth (note that the `index`
244type has a symbolic width equal to the machine word size), and they *may*
245additionally have signedness semantics. The purpose is to satisfy the needs of
246different dialects, which can model different levels of abstractions. Certain
247abstraction, especially closer to source language, might want to differentiate
248signedness with integer types; while others, especially closer to machine
249instruction, might want signless integers. Instead of forcing each abstraction
250to adopt the same integer modelling or develop its own one in house, Integer
251type provides this as an option to help code reuse and consistency.
252
253For the standard dialect, the choice is to have signless integer types. An
254integer value does not have an intrinsic sign, and it's up to the specific op
255for interpretation. For example, ops like `arith.addi` and `arith.muli` do two's
256complement arithmetic, but some other operations get a sign, e.g. `arith.divsi`
257vs `arith.divui`.
258
259LLVM uses the [same design](http://llvm.org/docs/LangRef.html#integer-type),
260which was introduced in a revamp rolled out
261[in the LLVM 2.0 integer type](http://releases.llvm.org/2.0/docs/LangRef.html#t_derived).
262Prior to that, from
263[LLVM 1.0](http://releases.llvm.org/1.0/docs/LangRef.html#t_classifications) to
264[1.9](http://releases.llvm.org/1.9/docs/LangRef.html#t_classifications), LLVM
265uses signed types like "sbyte" and "ubyte". This shift was important and has
266served LLVM well over the years. The reason this is important is that it is a
267good thing for an intermediate representation to represent the same computation
268with the same instruction. Signed types got in the way, because (e.g.) an "add
269of an sbyte" does the same computation as an "add of a ubyte", but the type
270system made them look artificially different. This split also required casts
271like "cast from sbyte to ubyte" which do nothing at the machine level. Removing
272signs from the type system eliminated these problems, making the compiler
273simpler.
274
275More information about this split is available in an old
276[talk on youtube](https://www.youtube.com/watch?v=VeRaLPupGks) talking about
277LLVM 2.0.
278
279Note that this rationale only applies to the "standard ops" dialect in which we
280can express an opinion about its design. Other dialects generally try to model
281an external system, and should aim to reflect its design as closely as possible.
282
283### Splitting floating point vs integer operations
284
285The MLIR "Arith" dialect splits many integer and floating point operations
286into different categories, for example `arith.addf` vs `arith.addi` and
287`arith.cmpf` vs `arith.cmpi`
288([following the design of LLVM](http://llvm.org/docs/LangRef.html#binary-operations)).
289These instructions *are* polymorphic on the number of elements in the type
290though, for example `addf` is used with scalar floats, vectors of floats, and
291tensors of floats (LLVM does the same thing with its scalar/vector types).
292
293This split is important because floating point and integer operations are quite
294different in practice: for example, floating point values include NaN's, so
295[integer comparisons](http://llvm.org/docs/LangRef.html#icmp-instruction) and
296[floating point comparisons](http://llvm.org/docs/LangRef.html#fcmp-instruction)
297should use different comparison opcodes. On the arithmetic side of things,
298floating point operations support rounding modes, floating point contractions,
299["fast math"](http://llvm.org/docs/LangRef.html#fadd-instruction), and integers
300may want to have two's complement overflow behavior or be undefined on
301[various forms of wrapping](http://llvm.org/docs/LangRef.html#add-instruction)
302for performance.
303
304We are a long way from this sort of thing being a priority to care about in
305MLIR, but since we have experience and know the right way to do this, we'd
306rather design it in from the beginning.
307
308Note that this rationale only applies to the "standard ops" dialect in which we
309can express an opinion about its design. Other dialects generally try to model
310an external system, and should aim to reflect its design as closely as possible.
311
312### Specifying sign in integer comparison operations
313
314Since integers are [signless](#integer-signedness-semantics), it is necessary to
315define the sign for integer comparison operations. This sign indicates how to
316treat the foremost bit of the integer: as sign bit or as most significant bit.
317For example, comparing two `i4` values `0b1000` and `0b0010` yields different
318results for unsigned (`8 > 3`) and signed (`-8 < 3`) interpretations. This
319difference is only significant for *order* comparisons, but not for *equality*
320comparisons. Indeed, for the latter all bits must have the same value
321independently of the sign. Since both arguments have exactly the same bit width
322and cannot be padded by this operation, it is impossible to compare two values
323whose bit representations would differ while the values are interpreted as
324equal.
325
326### Specifying comparison kind as attribute
327
328Unlike arithmetic, comparison operators share several common properties, e.g.
329they cannot be considered associative. In practice, comparisons are sometimes
330implemented by the same instruction or its variants so it makes sense to group
331them together at the IR level.
332
333An alternative would be introducing ten distinct operators for all currently
334supported kinds of integer comparisons. These operators would have increased the
335number of "reserved" names used by standard operations as well as the size of
336the C++ API while their implementations would have been mostly identical.
337
338The comparison kind is internally an integer attribute. However, for the sake of
339readability by humans, custom assembly form accepts string literals that are
340mapped to the underlying integer values: `cmpi "eq", %lhs, %rhs` better implies
341integer equality comparison than `cmpi 0, %lhs, %rhs` where it is unclear what
342gets compared to what else. This syntactic sugar is possible thanks to parser
343logic redefinitions for custom assembly form of non-builtin operations.
344Supporting it in the full notation would have required changing how the main
345parsing algorithm works and may have unexpected repercussions. While it had been
346possible to store the predicate as string attribute, it would have rendered
347impossible to implement switching logic based on the comparison kind and made
348attribute validity checks (one out of ten possible kinds) more complex.
349
350### Regions
351
352#### Attributes of type 'Block'
353
354We considered representing regions through `ArrayAttr`s containing a list of a
355special type `IRBlockAttr`, which in turn would contain a list of operations.
356All attributes in MLIR are unique’d within the context, which would make the IR
357inside the regions immortal for no good reason.
358
359#### Use "inlined" functions as regions
360
361We considered attaching a "force-inline" attribute on a function and/or a
362function `call` operation. Even the minimal region support (use cases in
363affine.for and affine.if existing before the regions) requires access to the
364values defined in the dominating block, which is not supported by functions.
365Conceptually, function bodies are instances of regions rather than the inverse;
366regions can also be device kernels, alternative sections, etc.
367
368#### Dedicated `region` operation
369
370This would mean we have a special kind of operation that is allowed to have
371regions while other operations are not. Such distinction is similar to the
372Stmt/Op difference we have had and chose to remove to make the IR simpler and
373more flexible. It would also require analyses and passes to consider the
374interplay between operations (e.g., an `affine.for` operation must be followed
375by a region operation). Finally, a region operation can be introduced using the
376current implementation, among other operations and without being special in any
377sense.
378
379#### Explicit capture of the values used in a region
380
381Being able to use values defined outside the region implies that use-def chains
382may contain uses from different nested regions. Consequently, IR transformations
383and analyses can pull the instruction defining the value across region
384boundaries, for example in case of TableGen-defined canonicalization patterns.
385This would not be the case if all used values had been passed as region
386arguments. One of the motivations for introducing regions in the IR is precisely
387to enable cross-region analyses and transformations that are simpler than
388inter-procedural transformations. Having uses from different regions appear in
389the same use-def chain, contrary to an additional data structure maintaining
390correspondence between function call arguments as uses of the original
391definitions and formal arguments as new definitions, enables such
392simplification. Since individual operations now belong to blocks, which belong
393to regions, it is always possible to check if the definition of the value
394belongs to the same region as its particular use. The risk is that any IR
395traversal will need to handle explicitly this situation and it is easy to forget
396a check (or conversely it isn’t easy to design the right check in a tablegen
397pattern for example): traversing use-def chains potentially crosses implicitly
398semantic barriers, making it possible to unknowingly break region semantics.
399This is expected to be caught in the verifier after the transformation.
400
401At the same time, one may choose to pass certain or all values as region
402arguments to explicitly break the use-def chains in the current proposal. This
403can be combined with an attribute-imposed semantic requirement disallowing the
404body of the region to refer to any value from outside it.
405
406### Dialect type extensions
407
408This section describes the design decisions that shaped the dialect extensible
409type system present in MLIR.
410
411#### Interactions between dialects
412
413There are two different interactions between dialects that are important to
414understand. When types of a dialect are:
415
416*   In operations of other dialects
417
418    -   For standard/builtin operations, only builtin types are allowed. This
419        restriction allows for operations to clearly understand the invariants
420        that they are working under.
421    -   Outside of standard/builtin operations, dialects are expected to verify
422        the allowable operation types per operation.
423
424*   In types of other dialects
425
426    -   For builtin types, these types are allowed to contain types from other
427        dialects. This simplifies the type system and removes the need for
428        dialects to redefine all of the builtin aggregate types, e.g. tensor, as
429        well as the memref type. Dialects are expected to verify that a specific
430        type is valid within a builtin type, e.g. if a type can be an element of
431        a tensor.
432    -   For dialect types, the dialect is expected to verify any type
433        invariants, e.g. if the tensor type can contain a specific type of that
434        dialect.
435
436#### Separating builtin and standard types
437
438Following the separation between the built-in and standard dialect, it makes
439sense to separate built-in types and standard dialect types. Built-in types are
440required for the validity of the IR itself, e.g. the function type (which
441appears in function signatures and generic assembly forms of operations).
442Integer, float, vector, memref and tensor types, while important, are not
443necessary for IR validity.
444
445#### Unregistered types
446
447MLIR supports unregistered operations in generic assembly form. MLIR also
448supports a similar concept for types. When parsing, if the dialect for dialect
449type has not been registered the type is modeled as an 'OpaqueType'. This allows
450for types to be round-tripped without needing to link in the dialect library
451that defined them. No additional information about opaque types, outside of
452parsing/printing, will be available.
453
454#### Dialect type syntax
455
456Dialect extended types are represented as string literals wrapped inside of the
457dialect namespace. This means that the parser delegates to the dialect for
458parsing specific type instances. This differs from the representation of dialect
459defined operations, of which have an identifier name that the parser uses to
460identify and parse them.
461
462This representation was chosen for several reasons:
463
464##### Dialects must provide custom type parsers
465
466Dialect type parsing cannot plug into the existing parser infrastructure as
467operations do with the OpAsmParser/Printer. Operations have a defined syntax
468structure that is the same across all dialects. Types, on the other hand, may
469have many different, and sometimes conflicting, parsing constraints that would
470be difficult/unmaintainable to provide within a single interface.
471
472This also has the added benefit of encouraging dialects to reuse existing
473external type parsers. For example, an LLVM dialect may provide an MLIR LLVM
474type that is simply a wrapper around LLVM types. The LLVM dialect would then use
475the existing LLVM type parsing infrastructure.
476
477Example:
478
479```mlir
480%s = "foo"() : () -> !llvm<"i32*">
481```
482
483##### Types do not always have canonical names
484
485Unlike operations, types generally do not have a formal canonical name. For
486example, function types have no defined keyword and integer types are defined by
487a regular expression to support arbitrary bitwidth. Dialects with existing type
488systems, e.g. LLVM, are likely to provide wrappers around their existing type
489systems. For these wrapper types there is no simple canonical name, it's logical
490to think of these types as existing within the namespace of the dialect. If a
491dialect wishes to assign a canonical name to a type, it can be done via
492[type aliases](../LangRef.md/#type-aliases).
493
494### Tuple types
495
496The MLIR type system provides first class support for defining
497[tuple types](../Dialects/Builtin/#tupletype). This is due to the fact that
498`Tuple` represents a universal concept that is likely to, and has already begun
499to, present itself in many different dialects. Though this type is first class
500in the type system, it merely serves to provide a common mechanism in which to
501represent this concept in MLIR. As such, MLIR provides no standard operations
502for interfacing with `tuple` types. It is up to dialect authors to provide
503operations, e.g. extract_tuple_element, to interpret and manipulate them. When
504possible, operations should prefer to use multiple results instead. These
505provide a myriad of benefits, such as alleviating any need for tuple-extract
506operations that merely get in the way of analysis and transformation.
507
508### Assembly forms
509
510MLIR decides to support both generic and custom assembly forms under the
511following considerations:
512
513MLIR is an open system; it is designed to support modular and pluggable
514dialects. Depending on whether there exists a corresponding dialect and whether
515the dialect is plugged in, operations may or may not be registered into MLIR
516system. Yet we still need a way to investigate these operations. So the generic
517assembly form is mandated by this aspect of MLIR system. It provides a default
518textual form for operations.
519
520On the other hand, an assembly form is for assisting developers to investigate
521the IR. The generic form serves as a safe fallback but it can be too verbose for
522certain ops. Therefore, MLIR gives each dialect the choice to define a custom
523assembly form for each operation according to the operation's semantics and
524specific needs. The custom assembly form can de-duplicate information from the
525operation to derive a more concise form, thus better facilitating the
526comprehension of the IR.
527
528## Examples
529
530This section describes a few very simple examples that help understand how MLIR
531represents computation.
532
533### Non-affine control flow
534
535```mlir
536// A simple linear search in every row of a matrix
537for (i = 0; i < N; i++) {
538  for (j = 0; j < N; j++) {
539    // dynamic control flow
540    if (a[i][j] == key) {
541      s[i] = j;
542      break;
543    }
544  }
545}
546```
547
548The presence of dynamic control flow leads to an inner non-affine function
549nested in an outer function that uses affine loops.
550
551```mlir
552func.func @search(%A: memref<?x?xi32>, %S: <?xi32>, %key : i32) {
553  %ni = memref.dim %A, 0 : memref<?x?xi32>
554  // This loop can be parallelized
555  affine.for %i = 0 to %ni {
556    call @search_body (%A, %S, %key, %i) : (memref<?x?xi32>, memref<?xi32>, i32, i32)
557  }
558  return
559}
560
561func.func @search_body(%A: memref<?x?xi32>, %S: memref<?xi32>, %key: i32, %i : i32) {
562  %nj = memref.dim %A, 1 : memref<?x?xi32>
563  cf.br ^bb1(0)
564
565^bb1(%j: i32)
566  %p1 = arith.cmpi "lt", %j, %nj : i32
567  cf.cond_br %p1, ^bb2, ^bb5
568
569^bb2:
570  %v = affine.load %A[%i, %j] : memref<?x?xi32>
571  %p2 = arith.cmpi "eq", %v, %key : i32
572  cf.cond_br %p2, ^bb3(%j), ^bb4
573
574^bb3(%j: i32)
575  affine.store %j, %S[%i] : memref<?xi32>
576  cf.br ^bb5
577
578^bb4:
579  %jinc = arith.addi %j, 1 : i32
580  cf.br ^bb1(%jinc)
581
582^bb5:
583  return
584}
585```
586
587As per the [MLIR spec](../LangRef.md), the restrictions on dimensions and symbol
588identifiers to be used with the affine.apply operation only apply to accesses
589inside `affine.for` and `affine.if` operations. However, an analysis of accesses
590inside the called function (`@search_body`) is necessary to determine if the
591`%i` loop could be parallelized: such function access analysis is calling
592context sensitive.
593
594### Non-affine loop bounds
595
596Loop bounds that are not affine lead to a nesting of functions as shown below.
597
598```c
599for (i = 0; i < N; i++)
600  for (j = 0; j < N; j++)
601    // Non-affine loop bound for k loop.
602    for (k = 0; k < pow(2, j); k++)
603       for (l = 0; l < N; l++) {
604        // block loop body
605        ...
606       }
607```
608
609```mlir
610func.func @outer_nest(%n : index) {
611  affine.for %i = 0 to %n {
612    affine.for %j = 0 to %n {
613      %pow = call @pow(2, %j) : (index, index) ->  index
614      call @inner_nest(%pow, %n) : ...
615    }
616  }
617  return
618}
619
620func.func @inner_nest(%m : index, %n : index) {
621  affine.for %k = 0 to %m {
622    affine.for %l = 0 to %n {
623      ...
624    }
625  }
626  return
627}
628```
629
630### Reference 2D Convolution
631
632The following example illustrates a reference implementation of a 2D
633convolution, which uses an integer set `#domain` to represent valid input data
634in a dilated convolution.
635
636```mlir
637// Dilation factors S0 and S1 can be constant folded if constant at compile time.
638#domain = (d0, d1)[S0,S1,S2,S3]: (d0 % S0 == 0, d1 % S1 == 0, d0 >= 0, d1 >= 0,
639                                   S3 - d0 - 1 >= 0, S4 - d1 - 1 >= 0)
640// Identity map (shown here for illustration).
641#map0 = (d0, d1, d2, d3, d4, d5, d6) -> (d0, d1, d2, d3, d4, d5, d6)
642
643// Affine map from output to input coordinate space.
644// d0 = output_h, d1 = output_w, d2 = kernel_h, d3 = kernel_w
645// S0 = h_stride, S1 = w_stride, S2 = h_kernel_dilation, S3 = w_kernel_dilation
646// S4 = h_pad_low, S5 = w_pad_low
647//     %out0 =  %0#1 * %h_stride + %0#4 * %h_kernel_dilation - %h_pad_low
648//     %out1=  %0#2 * %w_stride + %0#5 * %w_kernel_dilation - %w_pad_low
649#map1_0 = (d0, d1, d2, d3) [S0, S1, S2, S3, S4, S5] -> (d0 * S0 + d2 * S2 - %S4)
650#map1_1 = (d0, d1, d2, d3) [S0, S1, S2, S3, S4, S5] -> (d1 * S1 + d3 * S3 - %S5)
651
652// Semi-affine map to undilated input coordinate space.
653// d0 = input_h, d1 = input_w, S0 = h_base_dilation, S1 = w_base_dilation.
654#map2_0 = (d0, d1) [S0, S1] -> (d0 / S0)
655#map2_1 = (d0, d1) [S0, S1] -> (d1 / S1)
656
657// Conv2D shapes:
658// input:   [batch, input_height, input_width, input_feature]
659// kernel: [kernel_height, kernel_width, input_feature, output_feature]
660// output: [batch, output_height, output_width, output_feature]
661func.func @conv2d(%input: memref<16x1024x1024x3xf32, #lm0, /*scratchpad=*/1>,
662             %kernel: memref<5x5x3x32xf32, #lm0, /*scratchpad=*/1>,
663             %output: memref<16x512x512x32xf32, #lm0, /*scratchpad=*/1>) {
664  affine.for %b = 0 to %batch {
665    affine.for %oh = 0 to %output_height {
666      affine.for %ow = 0 to %output_width {
667        affine.for %of = 0 to %output_feature {
668          affine.for %kh = 0 to %kernel_height {
669            affine.for %kw = 0 to %kernel_width {
670              affine.for %if = 0 to %input_feature {
671                // Calculate input indices.
672                %1_0 = affine.apply #map1_0 (%0#1, %0#2, %0#4, %0#5)
673                  [%h_stride, %w_stride, %h_kernel_dilation, %w_kernel_dilation,
674                   %h_pad_low, %w_pad_low]
675                %1_1 = affine.apply #map1_1 (%0#1, %0#2, %0#4, %0#5)
676                  [%h_stride, %w_stride, %h_kernel_dilation, %w_kernel_dilation,
677                   %h_pad_low, %w_pad_low]
678
679                // Check if access is not in padding.
680                affine.if #domain(%1_0, %1_1)
681                                       [%h_base_dilation, %w_kernel_dilation, %h_bound, %w_bound] {
682                  %2_0 = affine.apply #map2 (%1_0, %1_1)
683                  %2_1 = affine.apply #map2 (%1_0, %1_1)
684                  // Compute: output[output_indices] += input[input_indices] * kernel[kernel_indices]
685                  call @multiply_accumulate(%input, %kernel, %output, %b, %oh, %ow, %of, %kh, %kw, %if, %2_0, %2_1)
686                }
687              }
688            }
689          }
690        }
691      }
692    }
693  }
694  return
695}
696```
697
698TODO: (Add more examples showing the IR for a variety of interesting cases)
699
700## Design alternatives and extensions
701
702This is a list of some design alternatives and extensions that we discussed in
703detail but did not include in the spec or postponed them for future
704consideration on demand. We will revisit these discussions when we have more
705implementation experience and learn more about the challenges and limitations of
706our current design in practice.
707
708### Polyhedral code representation alternatives: schedule lists vs schedules trees vs affine loop/if forms
709
710The current MLIR uses a representation of polyhedral schedules using a tree of
711if/for loops. We extensively debated the tradeoffs involved in the typical
712unordered polyhedral instruction representation (where each instruction has
713multidimensional schedule information), discussed the benefits of schedule tree
714forms, and eventually decided to go with a syntactic tree of affine if/else
715conditionals and affine for loops. Discussion of the tradeoff was captured in
716this document:
717[ MLIR: The case for a simplified polyhedral form](RationaleSimplifiedPolyhedralForm.md).
718
719At a high level, we have two alternatives here:
720
7211.  Schedule tree representation instead of an affine loop AST form: The current
722    proposal uses an affine loop and conditional tree form, which is syntactic
723    and with no separation of domains as sets and schedules as multidimensional
724    affine functions. A schedule tree form however makes polyhedral domains and
725    schedules a first class concept in the IR allowing compact expression of
726    transformations through the schedule tree without changing the domains of
727    instructions. Such a representation also hides prologues, epilogues, partial
728    tiles, complex loop bounds and conditionals making loop nests free of
729    "syntax". Cost models instead look at domains and schedules. In addition, if
730    necessary such a domain schedule representation can be normalized to
731    explicitly propagate the schedule into domains and model all the cleanup
732    code. An example and more detail on the schedule tree form is in the next
733    section.
7341.  Having two different forms of "affine regions": an affine loop tree form and
735    a polyhedral schedule tree form. In the latter, ops could carry attributes
736    capturing domain, scheduling, and other polyhedral code generation options
737    with IntegerSet, AffineMap, and other attributes.
738
739#### Schedule Tree Representation for Affine Regions
740
741This representation is based on a simplified form of the domain/schedule
742representation used by the polyhedral compiler community. Domains represent what
743has to be executed while schedules represent the order in which domain elements
744are interleaved. We model domains as non-piece-wise convex integer sets, and
745schedules as affine functions; however, the former can be disjunctive, and the
746latter can be piece-wise affine relations. In the schedule tree representation,
747domain and schedules for instructions are represented in a tree-like structure
748which is called a schedule tree. Each non-leaf node of the tree is an abstract
749polyhedral dimension corresponding to an abstract fused loop for each ML
750instruction that appears in that branch. Each leaf node is an ML Instruction.
751
752```mlir
753// A tiled matmul code (128x128x128) represented in schedule tree form
754
755// #map0 = (d0, d1, d2, d3, d4, d5) -> (128*d0 + d3, 128*d1 + d4, 128*d2 + d5)
756#intset_ij = (i, j) [M, N, K]  : i >= 0, -i + N - 1 >= 0, j >= 0, -j + N-1 >= 0
757#intset_ijk = (i, j, k) [M, N, K] : i >= 0, -i + N - 1 >= 0, j >= 0,
758                                     -j +  M-1 >= 0, k >= 0, -k + N - 1 >= 0)
759func.func @matmul(%A, %B, %C, %M, %N, %K) : (...)  { // %M, N, K are symbols
760  // t1, t2, t3, t4, t5, t6  are abstract polyhedral loops
761  mldim %t1 : {S1,S2,S3,S4,S5}  floordiv (i, 128) {
762    mldim %t2 : {S1,S2,S3,S4,S5}  floordiv (j, 128) {
763      // (%i, %j) = affine.apply (d0, d1) -> (128*d0, 128*d1) (%t1, %t2)
764      call dma_mem_to_scratchpad(%C, %i, %j, %M, %N, %K)
765          with @intset_ij(%i, %j) [%M, %N, %K]
766      mldim %t3 :   {S2,S3,S4,S5} floordiv (k, 128) {
767        // (%i, %j, %k) = affine.apply (d0, d1, d2)
768        //                          -> (128*d0, 128*d1, 128*d2)  (%t1, %t2, %t3)
769        call dma_mem_to_scratchpad(%A, ...) with #inset_ijk (%i, %j, %k) [%M, %N, %K]
770        // (%i, %j, %k) = affine.apply (d0, d1, d2)
771        //                          -> (128*d0, 128*d1, 128*d2)  (%t1, %t2, %t3)
772        call dma_mem_to_scratchpad(%B, ...) with #inset_ijk (%i, %j, %k) [%M, %N, %K]
773        mldim %t4 : {S4} i mod 128 {
774          mldim %t5 : {S4} j mod 128 {
775            mldim %t6 : {S4} k mod 128 {
776              // (%i, %j, %k) = affine.apply #map0 (%t1, %t2, %t3, %t4, %t5, %t6)
777              call matmul_body(A, B, C, %i, %j, %k, %M, %N, %K)
778                  with #inset_ijk(%i, %j, %k) [%M, %N, %K]
779            } // end mld4im t6
780          } // end mldim t5
781        } // end mldim t4
782      } // end mldim t3
783      // (%i, %j) = affine.apply (d0, d1) -> (128*d0, 128*d1) (%t1, %t2)
784      call $dma_scratchpad_to_mem_C ... with #intset(%i, %j) [%M, %N, %K]
785    }  // end mldim t2
786  } // end mldim t1
787  return
788}
789
790```
791
792### Affine Relations
793
794The current MLIR spec includes affine maps and integer sets, but not affine
795relations. Affine relations are a natural way to model read and write access
796information, which can be very useful to capture the behavior of external
797library calls where no implementation is available, high-performance vendor
798libraries, or user-provided / user-tuned routines.
799
800An affine relation is a relation between input and output dimension identifiers
801while being symbolic on a list of symbolic identifiers and with affine
802constraints on the identifiers.
803
804Syntax:
805
806```
807// Affine relation definition at the top of file
808affine-rel-def ::= affine-rel-id `=` affine-relation-inline
809
810affine-rel-id ::= `##` prefixed-id
811
812affine-relation-inline ::=
813       `(` input-dims `)` (`[` symbols `]`)? `->`
814       `(` output-dims `)` :  affine-constraint-conjunction
815
816input-dims ::= bare-id-list
817output-dims ::= bare-id-list
818symbols ::= bare-id-list
819
820affine-rel ::= affine-rel-id | affine-relation-inline
821
822// Usage
823affine-rel-spec ::= affine-rel dim-and-symbol-use-list
824```
825
826All identifiers appearing in input-dims, output-dims, and symbol-dims are
827pairwise distinct. All affine-constraint non-terminals in the above syntax are
828allowed to contain identifiers only from input-dims, output-dims, and
829symbol-dims.
830
831Affine relations are used to model read, write, may_read, and may_write sets of
832functions in the IR. The output dimension identifiers correspond to the data
833dimensions.
834
835Example:
836
837```mlir
838// read relation: two elements ( d0 <= r0 <= d0+1 )
839##aff_rel9 = (d0) -> (r0) : r0 - d0 >= 0, d0 - r0 + 1 >= 0
840
841func.func @count (%A : memref<128xf32>, %pos : i32) -> f32
842  reads: {%A ##aff_rel9 (%pos)}
843  writes: /* empty */
844  may_reads: /* empty */
845  may_writes: /* empty */ {
846bb0 (%0, %1: memref<128xf32>, i64):
847  %val = affine.load %A [%pos]
848  %val = affine.load %A [%pos + 1]
849  %p = arith.mulf %val, %val : f32
850  return %p : f32
851}
852```
853
854### Regions
855
856#### Making function definition an operation
857
858MLIR supports values of a Function type. Instead of having first-class IR
859concept for functions, one could define an operation with a body region that
860defines a function value. The particularity of functions is that their names are
861globally visible and can be referred to before being defined, unlike SSA values
862that must be defined first. Implementing a "function definition" operation would
863require to relax some of the SSA constraints in a region, and also make the IR
864Module a region as well. It would also affect the core infrastructure (e.g.,
865function passes) only for the sake of concept unification.
866
867#### Having types on a region
868
869Instead of inspecting the types of arguments of the first block, one could give
870the region itself a type. This type would be redundant with block argument
871types, which must have values and create room for type mismatches. While
872functions do have types that are partly redundant with the arguments of the
873first block in the function, this is necessary to support function declarations
874that do not have a body which we can refer to in order to obtain the argument
875types. A region is always contained in an operation or a function that can be
876queried to obtain the “type” of the region if necessary.
877
878A type on a region can be justified if Regions were to be considered separately
879from the enclosing entity (operation or function) and had their own semantics
880that should be checked.
881
882#### Attaching attributes to regions
883
884Regions could be annotated with dialect attributes to use attribute verification
885hooks. An operation could take multiple regions as arguments, and each of them
886may require different attributes. However, there are currently very few
887practical cases where this would be necessary. Instead, one could simulate
888per-region attributes with array attributes attached to the entity containing
889the region (operation or function). This decreases the overall complexity of the
890IR and enables more concise and op-specific forms, e.g., when all regions of an
891op have the same attribute that can be only mentioned once. Since the semantics
892of the region is entirely defined by the enclosing entity, it also makes sense
893to have attributes attached to that entity rather than to the region itself.
894
895This can be reconsidered in the future if we see a non-neglectable amount of use
896cases.
897
898### Read/Write/May_Read/May_Write sets for External Functions
899
900Having read, write, may_read, and may_write sets for external functions which
901include opaque ones, high-performance vendor libraries such as CuDNN, CuB, MKL,
902FFT libraries, user-provided/optimized functions, or data movement runtimes such
903as DMA ones is a powerful feature. It allows the compiler to perform analysis,
904composition/transformation in the presence of such calls and with loops around
905such calls on sub-tensors. For user-provided or custom hand-tuned functions, the
906read/write/may_read/may_write sets could be provided a-priori by a user as part
907of the external function signature or they could be part of a database.
908
909TODO: Design this, and update to use function attribute syntax.
910
911Example:
912
913```mlir
914##rel9 ( ) [s0] -> (r0, r1) : 0 <= r0 <= 1023, 0 <= r1 <= s0 - 1
915
916func.func @cblas_reduce_ffi(%M: memref<1024 x ? x f32, #layout_map0, /*mem=*/0>)
917  -> f32 [
918  reads: {%M, ##rel9() }
919  writes: /* empty */
920  may_reads: /* empty */
921  may_writes: /* empty */
922]
923
924func.func @dma_mem_to_scratchpad(%a : memref<1024 x f32, #layout_map0, /*mem=*/0>,
925    %b : memref<1024 x f32, #layout_map0, 1>, %c : memref<1024 x f32,
926    #layout_map0>) [
927  reads: {%M, ##rel9() }
928  writes: /* empty */
929  may_reads: /* empty */
930  may_writes: /* empty */
931 ]
932
933```
934
935### Memref Extensions
936
9371.  Arbitrary polyhedral shapes for tensors: e.g., triangular shapes in tensor
938    dimensions where there is symmetry: use integer set (affine constraints) to
939    model tensor data space (instead of just extents). Requires some changes to
940    the IR and the in-memory form.
9411.  Layout maps
942
943    1.  Allow piece-wise affine maps for layouts: allows clean modeling of
944        boundary cases for images/tensors through padding, wrapping, mirroring,
945        padding where padded values are the results of computation as opposed to
946        data, padding in the interior as opposed to just boundaries.
947    1.  Allow many-to-one layout maps: Index and layout maps in the current
948        proposal are bijective. Extending them to many-to-one layout maps allows
949        cleaner(?) modeling of broadcast/reduce style computations while reusing
950        memory.
951
952    Proposal 2(a) requires non-trivial changes to the IR and the in-memory
953    representation. 2(b) requires no change, but impacts how cost models look at
954    index and layout maps.
955
956### `affine.if` and `affine.for` Extensions for "Escaping Scalars"
957
958We considered providing a representation for SSA values that are live out of
959`if/else` conditional bodies and loop carried in `affine.for` loops. We
960ultimately abandoned this approach due to its complexity. In the current design
961of MLIR, scalar variables cannot escape for loops or if instructions. In
962situations, where escaping is necessary, we use zero-dimensional tensors and
963memrefs instead of scalars.
964
965**TODO**: This whole section is obsolete and should be updated to use block
966arguments and a yield like terminator in for/if instructions.
967
968The abandoned design of supporting escaping scalars is as follows:
969
970#### affine.for Instruction
971
972Syntax:
973
974```
975[<out-var-list> =]
976for %<index-variable-name> = <lower-bound> ... <upper-bound> step <step>
977   [with <in-var-list>] { <loop-instruction-list> }
978```
979
980out-var-list is a comma separated list of SSA values defined in the loop body
981and used outside the loop body. in-var-list is a comma separated list of SSA
982values used inside the loop body and their initializers. loop-instruction-list
983is a list of instructions that may also include a yield instruction.
984
985Example:
986
987```mlir
988// Return sum of elements in 1-dimensional mref A
989func.func i32 @sum(%A : memref<?xi32>, %N : i32) -> (i32) {
990   %init = 0
991   %result = affine.for %i = 0 to N with %tmp(%init) {
992      %value = affine.load %A[%i]
993      %sum = %value + %tmp
994      yield %sum
995   }
996   return %result : i32
997}
998```
999
1000#### affine.if/else Instruction
1001
1002Syntax:
1003
1004```
1005<out-var-list> = affine.if (<cond-list>) {...} [else {...}]
1006```
1007
1008Out-var-list is a list of SSA values defined by the if-instruction. The values
1009are arguments to the yield-instruction that occurs in both then and else clauses
1010when else clause is present. When if instruction contains only if clause, the
1011escaping value defined in the then clause should be merged with the value the
1012variable had before the if instruction. The design captured here does not handle
1013this situation.
1014
1015Example:
1016
1017```mlir
1018// Compute sum of half of the array
1019func.func i32 @sum_half(%A : memref<?xi32>, %N : i32) -> (i32) {
1020   %s0 = 0
1021   %s1 = affine.for %i = 1 ... N step 1 with %s2 (%s0) {
1022       %s3 = if (%i >= %N / 2) {
1023          %v0 = affine.load %A[%i]
1024          %s4 = %s2 + %v0
1025          yield %s4
1026       }
1027       yield %s3
1028   }
1029   return %s1 : i32
1030}
1031```
1032
1033### Multithreading the compiler
1034
1035People want compilers to go fast, and one simple way to do that is to
1036multi-thread them. There are multiple strategies for this, but a simple one is
1037to optimize and compile separate functions in parallel. LLVM's original pass
1038manager anticipated this demand, and the CallGraphSCCPass manager is even
1039designed to support this as well, but unfortunately, a few early design
1040decisions in LLVM prevent this from ever happening. Instead, things like ThinLTO
1041are forced to split programs into separate LLVM modules/context and optimize
1042those chunks independently.
1043
1044The problem is that LLVM has several objects in its IR that are globally uniqued
1045and also mutable: notably constants like `i32 0`. In LLVM, these constants are
1046`Value`'s, which allow them to be used as operands to instructions, and that
1047they also have SSA use lists. Because these things are uniqued, every `i32 0` in
1048any function shares a use list. This means that optimizing multiple functions in
1049parallel won't work (at least without some sort of synchronization on the use
1050lists, which would be unbearably inefficient).
1051
1052MLIR now supports a multithreaded pass manager. We do this through several
1053design choices:
1054
10551.  MLIR makes use of extensive uniqued immutable data structures (affine
1056    expressions, types, etc are all immutable, uniqued, and immortal).
10572.  Constants are defined in per-operation pools, instead of being globally
1058    uniqued.
10593.  Functions, and other global-like operations, themselves are not SSA values
1060    either, so they don't have the same problem as constants.
10614.  Passes are copied (through their copy ctor) into one instance per
1062    thread, avoiding sharing of local state across threads.
1063
1064This allows MLIR passes to support efficient multithreaded compilation
1065and code generation.
1066