1# MLIR Rationale 2 3This document is intended to capture some of the alternatives considered and 4open debates in the design of MLIR, along with the rationale for certain 5decisions we made. This is not intended to be a "finely groomed" document - we 6prefer the ability to dump in interesting tidbits without worrying too much 7about their consistency or readability. 8 9[TOC] 10 11## Abstract 12 13MLIR is a compiler intermediate representation with similarities to traditional 14three-address SSA representations (like 15[LLVM IR](http://llvm.org/docs/LangRef.html) or 16[SIL](https://github.com/apple/swift/blob/main/docs/SIL.rst)), but which 17introduces notions from the polyhedral loop optimization works as first class 18concepts. This hybrid design is optimized to represent, analyze, and transform 19high level dataflow graphs as well as target-specific code generated for high 20performance data parallel systems. Beyond its representational capabilities, its 21single continuous design provides a framework to lower from dataflow graphs to 22high performance target specific code. 23 24MLIR stands for one of "Multi-Level IR" or "Multi-dimensional Loop IR" or 25"Machine Learning IR" or "Mid Level IR", we prefer the first. This document only 26provides the rationale behind MLIR -- its actual 27[specification document](../LangRef.md) and other content is hosted elsewhere. 28 29## Introduction and Motivation 30 31The Multi-Level Intermediate Representation (MLIR) is intended for easy 32expression and optimization of computations involving deep loop nests and dense 33matrices of high dimensionality. It is thus well-suited to deep learning 34computations in particular. Yet it is general enough to also represent arbitrary 35sequential computation. The representation allows high-level optimization and 36parallelization for a wide range of parallel architectures including those with 37deep memory hierarchies --- general-purpose multicores, GPUs, and specialized 38neural network accelerators. 39 40MLIR uses ideas drawn from IRs of LLVM and Swift for lower level constructs 41while combining them with ideas from the polyhedral abstraction to represent 42loop nests, multidimensional data (tensors), and transformations on these 43entities as first class concepts in the IR. 44 45MLIR is a multi-level IR, i.e., it represents code at a domain-specific 46representation such as HLO or TensorFlow graphs, all the way down to the machine 47level. MLIR is able to represent arbitrary control flow and arbitrary data 48accesses, and is general enough to represent nearly all sequential computation. 49This is a key distinction from existing polyhedral representation 50implementations (such as LLVM [Polly](https://polly.llvm.org/)) that are able to 51use the polyhedral abstraction in a way isolated from the LLVM IR and only for 52affine loop nests, i.e., portions of the code where array accesses, loop bounds, 53and conditionals are regular (involve linear functions of loop iterators and 54constant symbols). The presence of statically unpredictable data accesses or 55control flow does not preclude representation in MLIR, but only limits to a 56certain extent the ability to reason about and apply transformations using the 57polyhedral abstraction. 58 59Maps, sets, and relations with affine constraints are the core structures 60underlying a polyhedral representation of high-dimensional loop nests and 61multidimensional arrays. These structures are represented as textual expressions 62in a form close to their mathematical form. These structures are used to capture 63loop nests, tensor data structures, and how they are reordered and mapped for a 64target architecture. All structured or "conforming" loops are captured as part 65of the polyhedral information, and so are tensor variables, their layouts, and 66subscripted accesses to these tensors in memory. 67 68The information captured in the IR allows a compact expression of all loop 69transformations, data remappings, explicit copying necessary for explicitly 70addressed memory in accelerators, mapping to pre-tuned expert-written 71primitives, and mapping to specialized vector instructions. Loop transformations 72that can be easily implemented include the body of affine transformations: these 73subsume all traditional loop transformations (unimodular and non-unimodular) 74such as loop tiling, interchange, permutation, skewing, scaling, relative 75shifting, reversal, fusion, and distribution/fission. Transformations on data 76layout such as padding and transforming to blocked layouts are also represented 77well via affine layout maps. 78 79MLIR's design allows a progressive lowering to target-specific forms. Besides 80high-level transformations for loop nests and data layouts that a typical 81mid-level optimizer is expected to deal with, MLIR is also designed to perform 82certain low-level scheduling and mapping decisions that a typical backend IR is 83entrusted with: these include mapping to specialized vector instructions, 84auto-vectorization, and software pipelining. The need to support these 85transformations stems from the fact that neural network accelerators have 86specialized units that deal with large chunks of data whose computation maps 87back to chunks of more than one loop of the loop nests as viewed by a program at 88a level closer to the original specification. Such specialized units or 89instructions operate on multidimensional data chunks from a programmer's 90viewpoint. It thus makes it hard or infeasible for a backend operating on a very 91low-level IR close to assembly to lift and reconstruct loops and perform such a 92mapping. This is in contrast to classic instruction selection and scheduling in 93today's compilers that primarily only deals with the body of the innermost loop. 94MLIR also facilitates automatic mapping to expert pre-tuned primitives or vendor 95libraries operating on data at higher levels (or at the highest level) of the 96memory hierarchy. 97 98In summary, MLIR is convenient for and closed under the kind of transformations 99needed to lower to general-purpose as well as specialized accelerators. It also 100allows one to build modular and reusable target independent and target dependent 101passes. 102 103## Design Decisions 104 105This section sheds light on some of the design decisions -- some of these are 106indirectly implied by the specification document. 107 108### Loads and stores 109 110The 'load' and 'store' instructions are specifically crafted to fully resolve to 111an element of a memref. These instructions take as arguments n+1 indices for an 112n-ranked tensor. This disallows the equivalent of pointer arithmetic or the 113ability to index into the same memref in other ways (something which C arrays 114allow for example). Furthermore, for the affine constructs, the compiler can 115follow use-def chains (e.g. through 116[affine.apply operations](../Dialects/Affine.md/#affineapply-affineapplyop) or 117through the map attributes of 118[affine operations](../Dialects/Affine.md/#operations)) to precisely analyze 119references at compile-time using polyhedral techniques. This is possible because 120of the 121[restrictions on dimensions and symbols](../Dialects/Affine.md/#restrictions-on-dimensions-and-symbols). 122 123A scalar of element-type (a primitive type or a vector type) that is stored in 124memory is modeled as a 0-d memref. This is also necessary for scalars that are 125live out of for loops and if conditionals in a function, for which we don't yet 126have an SSA representation -- 127[an extension](#affineif-and-affinefor-extensions-for-escaping-scalars) to allow 128that is described later in this doc. 129 130### Symbols and types 131 132The current MLIR disallows use of symbols in types. For example, when a tensor 133or memref dimension is statically unknown, it is denoted in the type as '?'. An 134SSA symbol is then bound to it when a memref is created. The actual value of the 135unknown dimension can be queried using the "dim" builtin as shown below. 136 137Example: 138 139```mlir 140func.func foo(...) { 141 %A = memref.alloc <8x?xf32, #lmap> (%N) 142 ... 143 call bar(%A) : (memref<8x?xf32, #lmap>) 144} 145 146func.func bar(%A : memref<8x?xf32, #lmap>) { 147 // Type of %A indicates that %A has dynamic shape with 8 rows 148 // and unknown number of columns. The number of columns is queried 149 // dynamically using dim instruction. 150 %N = memref.dim %A, 1 : memref<8x?xf32, #lmap> 151 152 affine.for %i = 0 to 8 { 153 affine.for %j = 0 to %N { 154 // A[i,j] += 1 155 %s1 = affine.load %A[%i, %j] : memref<8x?xf32, #lmap> 156 %s2 = add %s1, 1 157 affine.store %s2, %A[%i, %j] : memref<8x?xf32, #lmap> 158 } 159 } 160 return 161} 162 163``` 164 165An alternative design is to embed the reference to symbols directly in the 166type - memref<8x%Nxf32>. We went for the current approach in MLIR because it 167simplifies the design --- types remain immutable when the values of symbols 168change. 169 170### Block Arguments vs PHI nodes 171 172MLIR Regions represent SSA using "[block arguments](../LangRef.md/#blocks)" 173rather than [PHI instructions](http://llvm.org/docs/LangRef.html#i-phi) used in 174LLVM. This choice is representationally identical (the same constructs can be 175represented in either form) but block arguments have several advantages: 176 1771. LLVM PHI nodes always have to be kept at the top of a block, and 178 transformations frequently have to manually skip over them. This is defined 179 away with BB arguments. 1801. LLVM has a separate function Argument node. This is defined away with BB 181 arguments, because the arguments to the entry block serve this purpose. 1821. Blocks of PHI nodes in LLVM execute atomically, which is surprising and 183 super confusing to compiler engineers and it is easy to introduce bugs with 184 this (very related to the 185 "[lost copy](http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.524.5461&rep=rep1&type=pdf)" 186 problem in SSA lowering literature.) With the BB argument representation, 187 this confusion is defined away. 1881. The entry list of PHI nodes in LLVM are unordered, and some blocks have 189 thousands of predecessors (e.g. unwind blocks). This can cause long compile 190 time problems because transformations have to linearly scan this list. This 191 is defined away with BB argument representation. 1921. LLVM has no way to represent values that are available only in one successor 193 but not the other, e.g. its invoke instruction cannot produce the exception 194 value JUST on the exception edge. Instead, the 195 [landingpad instruction](http://llvm.org/docs/LangRef.html#landingpad-instruction) 196 is a hack used to represent this. MLIR doesn't make use of this capability, 197 but SIL uses it extensively, e.g. in the 198 [switch_enum instruction](https://github.com/apple/swift/blob/main/docs/SIL.rst#switch-enum). 199 200For more context, block arguments were previously used in the Swift 201[SIL Intermediate Representation](https://github.com/apple/swift/blob/main/docs/SIL.rst), 202and described in 203[a talk on YouTube](https://www.youtube.com/watch?v=Ntj8ab-5cvE). The section of 204interest 205[starts here](https://www.youtube.com/watch?v=Ntj8ab-5cvE&t=596s). 206 207### Index type usage and limitations 208 209Index types are intended to be used for platform-specific "size" values and may 210appear in subscripts, sizes of aggregate types and affine expressions. They are 211also tightly coupled with `affine.apply` and affine.load/store operations; 212having `index` type is a necessary precondition of a value to be acceptable by 213these operations. 214 215We allow `index` types in tensors, vectors, and memrefs as a code generation 216strategy has to map `index` to an implementation type and hence needs to be able 217to materialize corresponding values. However, the target might lack support for 218`vector` values with the target specific equivalent of the `index` type. 219 220### Data layout of non-primitive types 221 222Data layout information such as the bit width or the alignment of types may be 223target and ABI-specific and thus should be configurable rather than imposed by 224the compiler. Especially, the layout of compound or `index` types may vary. MLIR 225specifies default bit widths for certain primitive *types*, in particular for 226integers and floats. It is equal to the number that appears in the type 227definition, e.g. the bit width of `i32` is `32`, so is the bit width of `f32`. 228The bit width is not *necessarily* related to the amount of memory (in bytes) or 229the register size (in bits) that is necessary to store the value of the given 230type. For example, `vector<3xi57>` is likely to be lowered to a vector of four 23164-bit integers, so that its storage requirement is `4 x 64 / 8 = 32` bytes, 232rather than `(3 x 57) ceildiv 8 = 22` bytes as can be naively computed from the 233bit width. MLIR makes such [data layout information](../DataLayout.md) 234configurable using attributes that can be queried during lowering, for example, 235when allocating a compound type. 236 237The data layout of dialect-specific types is undefined at MLIR level. Yet 238dialects are free to define their own quantities and make them available via the 239data layout infrastructure. 240 241### Integer signedness semantics 242 243Integers in the builtin MLIR type system have a bitwidth (note that the `index` 244type has a symbolic width equal to the machine word size), and they *may* 245additionally have signedness semantics. The purpose is to satisfy the needs of 246different dialects, which can model different levels of abstractions. Certain 247abstraction, especially closer to source language, might want to differentiate 248signedness with integer types; while others, especially closer to machine 249instruction, might want signless integers. Instead of forcing each abstraction 250to adopt the same integer modelling or develop its own one in house, Integer 251type provides this as an option to help code reuse and consistency. 252 253For the standard dialect, the choice is to have signless integer types. An 254integer value does not have an intrinsic sign, and it's up to the specific op 255for interpretation. For example, ops like `arith.addi` and `arith.muli` do two's 256complement arithmetic, but some other operations get a sign, e.g. `arith.divsi` 257vs `arith.divui`. 258 259LLVM uses the [same design](http://llvm.org/docs/LangRef.html#integer-type), 260which was introduced in a revamp rolled out 261[in the LLVM 2.0 integer type](http://releases.llvm.org/2.0/docs/LangRef.html#t_derived). 262Prior to that, from 263[LLVM 1.0](http://releases.llvm.org/1.0/docs/LangRef.html#t_classifications) to 264[1.9](http://releases.llvm.org/1.9/docs/LangRef.html#t_classifications), LLVM 265uses signed types like "sbyte" and "ubyte". This shift was important and has 266served LLVM well over the years. The reason this is important is that it is a 267good thing for an intermediate representation to represent the same computation 268with the same instruction. Signed types got in the way, because (e.g.) an "add 269of an sbyte" does the same computation as an "add of a ubyte", but the type 270system made them look artificially different. This split also required casts 271like "cast from sbyte to ubyte" which do nothing at the machine level. Removing 272signs from the type system eliminated these problems, making the compiler 273simpler. 274 275More information about this split is available in an old 276[talk on youtube](https://www.youtube.com/watch?v=VeRaLPupGks) talking about 277LLVM 2.0. 278 279Note that this rationale only applies to the "standard ops" dialect in which we 280can express an opinion about its design. Other dialects generally try to model 281an external system, and should aim to reflect its design as closely as possible. 282 283### Splitting floating point vs integer operations 284 285The MLIR "Arith" dialect splits many integer and floating point operations 286into different categories, for example `arith.addf` vs `arith.addi` and 287`arith.cmpf` vs `arith.cmpi` 288([following the design of LLVM](http://llvm.org/docs/LangRef.html#binary-operations)). 289These instructions *are* polymorphic on the number of elements in the type 290though, for example `addf` is used with scalar floats, vectors of floats, and 291tensors of floats (LLVM does the same thing with its scalar/vector types). 292 293This split is important because floating point and integer operations are quite 294different in practice: for example, floating point values include NaN's, so 295[integer comparisons](http://llvm.org/docs/LangRef.html#icmp-instruction) and 296[floating point comparisons](http://llvm.org/docs/LangRef.html#fcmp-instruction) 297should use different comparison opcodes. On the arithmetic side of things, 298floating point operations support rounding modes, floating point contractions, 299["fast math"](http://llvm.org/docs/LangRef.html#fadd-instruction), and integers 300may want to have two's complement overflow behavior or be undefined on 301[various forms of wrapping](http://llvm.org/docs/LangRef.html#add-instruction) 302for performance. 303 304We are a long way from this sort of thing being a priority to care about in 305MLIR, but since we have experience and know the right way to do this, we'd 306rather design it in from the beginning. 307 308Note that this rationale only applies to the "standard ops" dialect in which we 309can express an opinion about its design. Other dialects generally try to model 310an external system, and should aim to reflect its design as closely as possible. 311 312### Specifying sign in integer comparison operations 313 314Since integers are [signless](#integer-signedness-semantics), it is necessary to 315define the sign for integer comparison operations. This sign indicates how to 316treat the foremost bit of the integer: as sign bit or as most significant bit. 317For example, comparing two `i4` values `0b1000` and `0b0010` yields different 318results for unsigned (`8 > 3`) and signed (`-8 < 3`) interpretations. This 319difference is only significant for *order* comparisons, but not for *equality* 320comparisons. Indeed, for the latter all bits must have the same value 321independently of the sign. Since both arguments have exactly the same bit width 322and cannot be padded by this operation, it is impossible to compare two values 323whose bit representations would differ while the values are interpreted as 324equal. 325 326### Specifying comparison kind as attribute 327 328Unlike arithmetic, comparison operators share several common properties, e.g. 329they cannot be considered associative. In practice, comparisons are sometimes 330implemented by the same instruction or its variants so it makes sense to group 331them together at the IR level. 332 333An alternative would be introducing ten distinct operators for all currently 334supported kinds of integer comparisons. These operators would have increased the 335number of "reserved" names used by standard operations as well as the size of 336the C++ API while their implementations would have been mostly identical. 337 338The comparison kind is internally an integer attribute. However, for the sake of 339readability by humans, custom assembly form accepts string literals that are 340mapped to the underlying integer values: `cmpi "eq", %lhs, %rhs` better implies 341integer equality comparison than `cmpi 0, %lhs, %rhs` where it is unclear what 342gets compared to what else. This syntactic sugar is possible thanks to parser 343logic redefinitions for custom assembly form of non-builtin operations. 344Supporting it in the full notation would have required changing how the main 345parsing algorithm works and may have unexpected repercussions. While it had been 346possible to store the predicate as string attribute, it would have rendered 347impossible to implement switching logic based on the comparison kind and made 348attribute validity checks (one out of ten possible kinds) more complex. 349 350### Regions 351 352#### Attributes of type 'Block' 353 354We considered representing regions through `ArrayAttr`s containing a list of a 355special type `IRBlockAttr`, which in turn would contain a list of operations. 356All attributes in MLIR are unique’d within the context, which would make the IR 357inside the regions immortal for no good reason. 358 359#### Use "inlined" functions as regions 360 361We considered attaching a "force-inline" attribute on a function and/or a 362function `call` operation. Even the minimal region support (use cases in 363affine.for and affine.if existing before the regions) requires access to the 364values defined in the dominating block, which is not supported by functions. 365Conceptually, function bodies are instances of regions rather than the inverse; 366regions can also be device kernels, alternative sections, etc. 367 368#### Dedicated `region` operation 369 370This would mean we have a special kind of operation that is allowed to have 371regions while other operations are not. Such distinction is similar to the 372Stmt/Op difference we have had and chose to remove to make the IR simpler and 373more flexible. It would also require analyses and passes to consider the 374interplay between operations (e.g., an `affine.for` operation must be followed 375by a region operation). Finally, a region operation can be introduced using the 376current implementation, among other operations and without being special in any 377sense. 378 379#### Explicit capture of the values used in a region 380 381Being able to use values defined outside the region implies that use-def chains 382may contain uses from different nested regions. Consequently, IR transformations 383and analyses can pull the instruction defining the value across region 384boundaries, for example in case of TableGen-defined canonicalization patterns. 385This would not be the case if all used values had been passed as region 386arguments. One of the motivations for introducing regions in the IR is precisely 387to enable cross-region analyses and transformations that are simpler than 388inter-procedural transformations. Having uses from different regions appear in 389the same use-def chain, contrary to an additional data structure maintaining 390correspondence between function call arguments as uses of the original 391definitions and formal arguments as new definitions, enables such 392simplification. Since individual operations now belong to blocks, which belong 393to regions, it is always possible to check if the definition of the value 394belongs to the same region as its particular use. The risk is that any IR 395traversal will need to handle explicitly this situation and it is easy to forget 396a check (or conversely it isn’t easy to design the right check in a tablegen 397pattern for example): traversing use-def chains potentially crosses implicitly 398semantic barriers, making it possible to unknowingly break region semantics. 399This is expected to be caught in the verifier after the transformation. 400 401At the same time, one may choose to pass certain or all values as region 402arguments to explicitly break the use-def chains in the current proposal. This 403can be combined with an attribute-imposed semantic requirement disallowing the 404body of the region to refer to any value from outside it. 405 406### Dialect type extensions 407 408This section describes the design decisions that shaped the dialect extensible 409type system present in MLIR. 410 411#### Interactions between dialects 412 413There are two different interactions between dialects that are important to 414understand. When types of a dialect are: 415 416* In operations of other dialects 417 418 - For standard/builtin operations, only builtin types are allowed. This 419 restriction allows for operations to clearly understand the invariants 420 that they are working under. 421 - Outside of standard/builtin operations, dialects are expected to verify 422 the allowable operation types per operation. 423 424* In types of other dialects 425 426 - For builtin types, these types are allowed to contain types from other 427 dialects. This simplifies the type system and removes the need for 428 dialects to redefine all of the builtin aggregate types, e.g. tensor, as 429 well as the memref type. Dialects are expected to verify that a specific 430 type is valid within a builtin type, e.g. if a type can be an element of 431 a tensor. 432 - For dialect types, the dialect is expected to verify any type 433 invariants, e.g. if the tensor type can contain a specific type of that 434 dialect. 435 436#### Separating builtin and standard types 437 438Following the separation between the built-in and standard dialect, it makes 439sense to separate built-in types and standard dialect types. Built-in types are 440required for the validity of the IR itself, e.g. the function type (which 441appears in function signatures and generic assembly forms of operations). 442Integer, float, vector, memref and tensor types, while important, are not 443necessary for IR validity. 444 445#### Unregistered types 446 447MLIR supports unregistered operations in generic assembly form. MLIR also 448supports a similar concept for types. When parsing, if the dialect for dialect 449type has not been registered the type is modeled as an 'OpaqueType'. This allows 450for types to be round-tripped without needing to link in the dialect library 451that defined them. No additional information about opaque types, outside of 452parsing/printing, will be available. 453 454#### Dialect type syntax 455 456Dialect extended types are represented as string literals wrapped inside of the 457dialect namespace. This means that the parser delegates to the dialect for 458parsing specific type instances. This differs from the representation of dialect 459defined operations, of which have an identifier name that the parser uses to 460identify and parse them. 461 462This representation was chosen for several reasons: 463 464##### Dialects must provide custom type parsers 465 466Dialect type parsing cannot plug into the existing parser infrastructure as 467operations do with the OpAsmParser/Printer. Operations have a defined syntax 468structure that is the same across all dialects. Types, on the other hand, may 469have many different, and sometimes conflicting, parsing constraints that would 470be difficult/unmaintainable to provide within a single interface. 471 472This also has the added benefit of encouraging dialects to reuse existing 473external type parsers. For example, an LLVM dialect may provide an MLIR LLVM 474type that is simply a wrapper around LLVM types. The LLVM dialect would then use 475the existing LLVM type parsing infrastructure. 476 477Example: 478 479```mlir 480%s = "foo"() : () -> !llvm<"i32*"> 481``` 482 483##### Types do not always have canonical names 484 485Unlike operations, types generally do not have a formal canonical name. For 486example, function types have no defined keyword and integer types are defined by 487a regular expression to support arbitrary bitwidth. Dialects with existing type 488systems, e.g. LLVM, are likely to provide wrappers around their existing type 489systems. For these wrapper types there is no simple canonical name, it's logical 490to think of these types as existing within the namespace of the dialect. If a 491dialect wishes to assign a canonical name to a type, it can be done via 492[type aliases](../LangRef.md/#type-aliases). 493 494### Tuple types 495 496The MLIR type system provides first class support for defining 497[tuple types](../Dialects/Builtin/#tupletype). This is due to the fact that 498`Tuple` represents a universal concept that is likely to, and has already begun 499to, present itself in many different dialects. Though this type is first class 500in the type system, it merely serves to provide a common mechanism in which to 501represent this concept in MLIR. As such, MLIR provides no standard operations 502for interfacing with `tuple` types. It is up to dialect authors to provide 503operations, e.g. extract_tuple_element, to interpret and manipulate them. When 504possible, operations should prefer to use multiple results instead. These 505provide a myriad of benefits, such as alleviating any need for tuple-extract 506operations that merely get in the way of analysis and transformation. 507 508### Assembly forms 509 510MLIR decides to support both generic and custom assembly forms under the 511following considerations: 512 513MLIR is an open system; it is designed to support modular and pluggable 514dialects. Depending on whether there exists a corresponding dialect and whether 515the dialect is plugged in, operations may or may not be registered into MLIR 516system. Yet we still need a way to investigate these operations. So the generic 517assembly form is mandated by this aspect of MLIR system. It provides a default 518textual form for operations. 519 520On the other hand, an assembly form is for assisting developers to investigate 521the IR. The generic form serves as a safe fallback but it can be too verbose for 522certain ops. Therefore, MLIR gives each dialect the choice to define a custom 523assembly form for each operation according to the operation's semantics and 524specific needs. The custom assembly form can de-duplicate information from the 525operation to derive a more concise form, thus better facilitating the 526comprehension of the IR. 527 528## Examples 529 530This section describes a few very simple examples that help understand how MLIR 531represents computation. 532 533### Non-affine control flow 534 535```mlir 536// A simple linear search in every row of a matrix 537for (i = 0; i < N; i++) { 538 for (j = 0; j < N; j++) { 539 // dynamic control flow 540 if (a[i][j] == key) { 541 s[i] = j; 542 break; 543 } 544 } 545} 546``` 547 548The presence of dynamic control flow leads to an inner non-affine function 549nested in an outer function that uses affine loops. 550 551```mlir 552func.func @search(%A: memref<?x?xi32>, %S: <?xi32>, %key : i32) { 553 %ni = memref.dim %A, 0 : memref<?x?xi32> 554 // This loop can be parallelized 555 affine.for %i = 0 to %ni { 556 call @search_body (%A, %S, %key, %i) : (memref<?x?xi32>, memref<?xi32>, i32, i32) 557 } 558 return 559} 560 561func.func @search_body(%A: memref<?x?xi32>, %S: memref<?xi32>, %key: i32, %i : i32) { 562 %nj = memref.dim %A, 1 : memref<?x?xi32> 563 cf.br ^bb1(0) 564 565^bb1(%j: i32) 566 %p1 = arith.cmpi "lt", %j, %nj : i32 567 cf.cond_br %p1, ^bb2, ^bb5 568 569^bb2: 570 %v = affine.load %A[%i, %j] : memref<?x?xi32> 571 %p2 = arith.cmpi "eq", %v, %key : i32 572 cf.cond_br %p2, ^bb3(%j), ^bb4 573 574^bb3(%j: i32) 575 affine.store %j, %S[%i] : memref<?xi32> 576 cf.br ^bb5 577 578^bb4: 579 %jinc = arith.addi %j, 1 : i32 580 cf.br ^bb1(%jinc) 581 582^bb5: 583 return 584} 585``` 586 587As per the [MLIR spec](../LangRef.md), the restrictions on dimensions and symbol 588identifiers to be used with the affine.apply operation only apply to accesses 589inside `affine.for` and `affine.if` operations. However, an analysis of accesses 590inside the called function (`@search_body`) is necessary to determine if the 591`%i` loop could be parallelized: such function access analysis is calling 592context sensitive. 593 594### Non-affine loop bounds 595 596Loop bounds that are not affine lead to a nesting of functions as shown below. 597 598```c 599for (i = 0; i < N; i++) 600 for (j = 0; j < N; j++) 601 // Non-affine loop bound for k loop. 602 for (k = 0; k < pow(2, j); k++) 603 for (l = 0; l < N; l++) { 604 // block loop body 605 ... 606 } 607``` 608 609```mlir 610func.func @outer_nest(%n : index) { 611 affine.for %i = 0 to %n { 612 affine.for %j = 0 to %n { 613 %pow = call @pow(2, %j) : (index, index) -> index 614 call @inner_nest(%pow, %n) : ... 615 } 616 } 617 return 618} 619 620func.func @inner_nest(%m : index, %n : index) { 621 affine.for %k = 0 to %m { 622 affine.for %l = 0 to %n { 623 ... 624 } 625 } 626 return 627} 628``` 629 630### Reference 2D Convolution 631 632The following example illustrates a reference implementation of a 2D 633convolution, which uses an integer set `#domain` to represent valid input data 634in a dilated convolution. 635 636```mlir 637// Dilation factors S0 and S1 can be constant folded if constant at compile time. 638#domain = (d0, d1)[S0,S1,S2,S3]: (d0 % S0 == 0, d1 % S1 == 0, d0 >= 0, d1 >= 0, 639 S3 - d0 - 1 >= 0, S4 - d1 - 1 >= 0) 640// Identity map (shown here for illustration). 641#map0 = (d0, d1, d2, d3, d4, d5, d6) -> (d0, d1, d2, d3, d4, d5, d6) 642 643// Affine map from output to input coordinate space. 644// d0 = output_h, d1 = output_w, d2 = kernel_h, d3 = kernel_w 645// S0 = h_stride, S1 = w_stride, S2 = h_kernel_dilation, S3 = w_kernel_dilation 646// S4 = h_pad_low, S5 = w_pad_low 647// %out0 = %0#1 * %h_stride + %0#4 * %h_kernel_dilation - %h_pad_low 648// %out1= %0#2 * %w_stride + %0#5 * %w_kernel_dilation - %w_pad_low 649#map1_0 = (d0, d1, d2, d3) [S0, S1, S2, S3, S4, S5] -> (d0 * S0 + d2 * S2 - %S4) 650#map1_1 = (d0, d1, d2, d3) [S0, S1, S2, S3, S4, S5] -> (d1 * S1 + d3 * S3 - %S5) 651 652// Semi-affine map to undilated input coordinate space. 653// d0 = input_h, d1 = input_w, S0 = h_base_dilation, S1 = w_base_dilation. 654#map2_0 = (d0, d1) [S0, S1] -> (d0 / S0) 655#map2_1 = (d0, d1) [S0, S1] -> (d1 / S1) 656 657// Conv2D shapes: 658// input: [batch, input_height, input_width, input_feature] 659// kernel: [kernel_height, kernel_width, input_feature, output_feature] 660// output: [batch, output_height, output_width, output_feature] 661func.func @conv2d(%input: memref<16x1024x1024x3xf32, #lm0, /*scratchpad=*/1>, 662 %kernel: memref<5x5x3x32xf32, #lm0, /*scratchpad=*/1>, 663 %output: memref<16x512x512x32xf32, #lm0, /*scratchpad=*/1>) { 664 affine.for %b = 0 to %batch { 665 affine.for %oh = 0 to %output_height { 666 affine.for %ow = 0 to %output_width { 667 affine.for %of = 0 to %output_feature { 668 affine.for %kh = 0 to %kernel_height { 669 affine.for %kw = 0 to %kernel_width { 670 affine.for %if = 0 to %input_feature { 671 // Calculate input indices. 672 %1_0 = affine.apply #map1_0 (%0#1, %0#2, %0#4, %0#5) 673 [%h_stride, %w_stride, %h_kernel_dilation, %w_kernel_dilation, 674 %h_pad_low, %w_pad_low] 675 %1_1 = affine.apply #map1_1 (%0#1, %0#2, %0#4, %0#5) 676 [%h_stride, %w_stride, %h_kernel_dilation, %w_kernel_dilation, 677 %h_pad_low, %w_pad_low] 678 679 // Check if access is not in padding. 680 affine.if #domain(%1_0, %1_1) 681 [%h_base_dilation, %w_kernel_dilation, %h_bound, %w_bound] { 682 %2_0 = affine.apply #map2 (%1_0, %1_1) 683 %2_1 = affine.apply #map2 (%1_0, %1_1) 684 // Compute: output[output_indices] += input[input_indices] * kernel[kernel_indices] 685 call @multiply_accumulate(%input, %kernel, %output, %b, %oh, %ow, %of, %kh, %kw, %if, %2_0, %2_1) 686 } 687 } 688 } 689 } 690 } 691 } 692 } 693 } 694 return 695} 696``` 697 698TODO: (Add more examples showing the IR for a variety of interesting cases) 699 700## Design alternatives and extensions 701 702This is a list of some design alternatives and extensions that we discussed in 703detail but did not include in the spec or postponed them for future 704consideration on demand. We will revisit these discussions when we have more 705implementation experience and learn more about the challenges and limitations of 706our current design in practice. 707 708### Polyhedral code representation alternatives: schedule lists vs schedules trees vs affine loop/if forms 709 710The current MLIR uses a representation of polyhedral schedules using a tree of 711if/for loops. We extensively debated the tradeoffs involved in the typical 712unordered polyhedral instruction representation (where each instruction has 713multidimensional schedule information), discussed the benefits of schedule tree 714forms, and eventually decided to go with a syntactic tree of affine if/else 715conditionals and affine for loops. Discussion of the tradeoff was captured in 716this document: 717[ MLIR: The case for a simplified polyhedral form](RationaleSimplifiedPolyhedralForm.md). 718 719At a high level, we have two alternatives here: 720 7211. Schedule tree representation instead of an affine loop AST form: The current 722 proposal uses an affine loop and conditional tree form, which is syntactic 723 and with no separation of domains as sets and schedules as multidimensional 724 affine functions. A schedule tree form however makes polyhedral domains and 725 schedules a first class concept in the IR allowing compact expression of 726 transformations through the schedule tree without changing the domains of 727 instructions. Such a representation also hides prologues, epilogues, partial 728 tiles, complex loop bounds and conditionals making loop nests free of 729 "syntax". Cost models instead look at domains and schedules. In addition, if 730 necessary such a domain schedule representation can be normalized to 731 explicitly propagate the schedule into domains and model all the cleanup 732 code. An example and more detail on the schedule tree form is in the next 733 section. 7341. Having two different forms of "affine regions": an affine loop tree form and 735 a polyhedral schedule tree form. In the latter, ops could carry attributes 736 capturing domain, scheduling, and other polyhedral code generation options 737 with IntegerSet, AffineMap, and other attributes. 738 739#### Schedule Tree Representation for Affine Regions 740 741This representation is based on a simplified form of the domain/schedule 742representation used by the polyhedral compiler community. Domains represent what 743has to be executed while schedules represent the order in which domain elements 744are interleaved. We model domains as non-piece-wise convex integer sets, and 745schedules as affine functions; however, the former can be disjunctive, and the 746latter can be piece-wise affine relations. In the schedule tree representation, 747domain and schedules for instructions are represented in a tree-like structure 748which is called a schedule tree. Each non-leaf node of the tree is an abstract 749polyhedral dimension corresponding to an abstract fused loop for each ML 750instruction that appears in that branch. Each leaf node is an ML Instruction. 751 752```mlir 753// A tiled matmul code (128x128x128) represented in schedule tree form 754 755// #map0 = (d0, d1, d2, d3, d4, d5) -> (128*d0 + d3, 128*d1 + d4, 128*d2 + d5) 756#intset_ij = (i, j) [M, N, K] : i >= 0, -i + N - 1 >= 0, j >= 0, -j + N-1 >= 0 757#intset_ijk = (i, j, k) [M, N, K] : i >= 0, -i + N - 1 >= 0, j >= 0, 758 -j + M-1 >= 0, k >= 0, -k + N - 1 >= 0) 759func.func @matmul(%A, %B, %C, %M, %N, %K) : (...) { // %M, N, K are symbols 760 // t1, t2, t3, t4, t5, t6 are abstract polyhedral loops 761 mldim %t1 : {S1,S2,S3,S4,S5} floordiv (i, 128) { 762 mldim %t2 : {S1,S2,S3,S4,S5} floordiv (j, 128) { 763 // (%i, %j) = affine.apply (d0, d1) -> (128*d0, 128*d1) (%t1, %t2) 764 call dma_mem_to_scratchpad(%C, %i, %j, %M, %N, %K) 765 with @intset_ij(%i, %j) [%M, %N, %K] 766 mldim %t3 : {S2,S3,S4,S5} floordiv (k, 128) { 767 // (%i, %j, %k) = affine.apply (d0, d1, d2) 768 // -> (128*d0, 128*d1, 128*d2) (%t1, %t2, %t3) 769 call dma_mem_to_scratchpad(%A, ...) with #inset_ijk (%i, %j, %k) [%M, %N, %K] 770 // (%i, %j, %k) = affine.apply (d0, d1, d2) 771 // -> (128*d0, 128*d1, 128*d2) (%t1, %t2, %t3) 772 call dma_mem_to_scratchpad(%B, ...) with #inset_ijk (%i, %j, %k) [%M, %N, %K] 773 mldim %t4 : {S4} i mod 128 { 774 mldim %t5 : {S4} j mod 128 { 775 mldim %t6 : {S4} k mod 128 { 776 // (%i, %j, %k) = affine.apply #map0 (%t1, %t2, %t3, %t4, %t5, %t6) 777 call matmul_body(A, B, C, %i, %j, %k, %M, %N, %K) 778 with #inset_ijk(%i, %j, %k) [%M, %N, %K] 779 } // end mld4im t6 780 } // end mldim t5 781 } // end mldim t4 782 } // end mldim t3 783 // (%i, %j) = affine.apply (d0, d1) -> (128*d0, 128*d1) (%t1, %t2) 784 call $dma_scratchpad_to_mem_C ... with #intset(%i, %j) [%M, %N, %K] 785 } // end mldim t2 786 } // end mldim t1 787 return 788} 789 790``` 791 792### Affine Relations 793 794The current MLIR spec includes affine maps and integer sets, but not affine 795relations. Affine relations are a natural way to model read and write access 796information, which can be very useful to capture the behavior of external 797library calls where no implementation is available, high-performance vendor 798libraries, or user-provided / user-tuned routines. 799 800An affine relation is a relation between input and output dimension identifiers 801while being symbolic on a list of symbolic identifiers and with affine 802constraints on the identifiers. 803 804Syntax: 805 806``` 807// Affine relation definition at the top of file 808affine-rel-def ::= affine-rel-id `=` affine-relation-inline 809 810affine-rel-id ::= `##` prefixed-id 811 812affine-relation-inline ::= 813 `(` input-dims `)` (`[` symbols `]`)? `->` 814 `(` output-dims `)` : affine-constraint-conjunction 815 816input-dims ::= bare-id-list 817output-dims ::= bare-id-list 818symbols ::= bare-id-list 819 820affine-rel ::= affine-rel-id | affine-relation-inline 821 822// Usage 823affine-rel-spec ::= affine-rel dim-and-symbol-use-list 824``` 825 826All identifiers appearing in input-dims, output-dims, and symbol-dims are 827pairwise distinct. All affine-constraint non-terminals in the above syntax are 828allowed to contain identifiers only from input-dims, output-dims, and 829symbol-dims. 830 831Affine relations are used to model read, write, may_read, and may_write sets of 832functions in the IR. The output dimension identifiers correspond to the data 833dimensions. 834 835Example: 836 837```mlir 838// read relation: two elements ( d0 <= r0 <= d0+1 ) 839##aff_rel9 = (d0) -> (r0) : r0 - d0 >= 0, d0 - r0 + 1 >= 0 840 841func.func @count (%A : memref<128xf32>, %pos : i32) -> f32 842 reads: {%A ##aff_rel9 (%pos)} 843 writes: /* empty */ 844 may_reads: /* empty */ 845 may_writes: /* empty */ { 846bb0 (%0, %1: memref<128xf32>, i64): 847 %val = affine.load %A [%pos] 848 %val = affine.load %A [%pos + 1] 849 %p = arith.mulf %val, %val : f32 850 return %p : f32 851} 852``` 853 854### Regions 855 856#### Making function definition an operation 857 858MLIR supports values of a Function type. Instead of having first-class IR 859concept for functions, one could define an operation with a body region that 860defines a function value. The particularity of functions is that their names are 861globally visible and can be referred to before being defined, unlike SSA values 862that must be defined first. Implementing a "function definition" operation would 863require to relax some of the SSA constraints in a region, and also make the IR 864Module a region as well. It would also affect the core infrastructure (e.g., 865function passes) only for the sake of concept unification. 866 867#### Having types on a region 868 869Instead of inspecting the types of arguments of the first block, one could give 870the region itself a type. This type would be redundant with block argument 871types, which must have values and create room for type mismatches. While 872functions do have types that are partly redundant with the arguments of the 873first block in the function, this is necessary to support function declarations 874that do not have a body which we can refer to in order to obtain the argument 875types. A region is always contained in an operation or a function that can be 876queried to obtain the “type” of the region if necessary. 877 878A type on a region can be justified if Regions were to be considered separately 879from the enclosing entity (operation or function) and had their own semantics 880that should be checked. 881 882#### Attaching attributes to regions 883 884Regions could be annotated with dialect attributes to use attribute verification 885hooks. An operation could take multiple regions as arguments, and each of them 886may require different attributes. However, there are currently very few 887practical cases where this would be necessary. Instead, one could simulate 888per-region attributes with array attributes attached to the entity containing 889the region (operation or function). This decreases the overall complexity of the 890IR and enables more concise and op-specific forms, e.g., when all regions of an 891op have the same attribute that can be only mentioned once. Since the semantics 892of the region is entirely defined by the enclosing entity, it also makes sense 893to have attributes attached to that entity rather than to the region itself. 894 895This can be reconsidered in the future if we see a non-neglectable amount of use 896cases. 897 898### Read/Write/May_Read/May_Write sets for External Functions 899 900Having read, write, may_read, and may_write sets for external functions which 901include opaque ones, high-performance vendor libraries such as CuDNN, CuB, MKL, 902FFT libraries, user-provided/optimized functions, or data movement runtimes such 903as DMA ones is a powerful feature. It allows the compiler to perform analysis, 904composition/transformation in the presence of such calls and with loops around 905such calls on sub-tensors. For user-provided or custom hand-tuned functions, the 906read/write/may_read/may_write sets could be provided a-priori by a user as part 907of the external function signature or they could be part of a database. 908 909TODO: Design this, and update to use function attribute syntax. 910 911Example: 912 913```mlir 914##rel9 ( ) [s0] -> (r0, r1) : 0 <= r0 <= 1023, 0 <= r1 <= s0 - 1 915 916func.func @cblas_reduce_ffi(%M: memref<1024 x ? x f32, #layout_map0, /*mem=*/0>) 917 -> f32 [ 918 reads: {%M, ##rel9() } 919 writes: /* empty */ 920 may_reads: /* empty */ 921 may_writes: /* empty */ 922] 923 924func.func @dma_mem_to_scratchpad(%a : memref<1024 x f32, #layout_map0, /*mem=*/0>, 925 %b : memref<1024 x f32, #layout_map0, 1>, %c : memref<1024 x f32, 926 #layout_map0>) [ 927 reads: {%M, ##rel9() } 928 writes: /* empty */ 929 may_reads: /* empty */ 930 may_writes: /* empty */ 931 ] 932 933``` 934 935### Memref Extensions 936 9371. Arbitrary polyhedral shapes for tensors: e.g., triangular shapes in tensor 938 dimensions where there is symmetry: use integer set (affine constraints) to 939 model tensor data space (instead of just extents). Requires some changes to 940 the IR and the in-memory form. 9411. Layout maps 942 943 1. Allow piece-wise affine maps for layouts: allows clean modeling of 944 boundary cases for images/tensors through padding, wrapping, mirroring, 945 padding where padded values are the results of computation as opposed to 946 data, padding in the interior as opposed to just boundaries. 947 1. Allow many-to-one layout maps: Index and layout maps in the current 948 proposal are bijective. Extending them to many-to-one layout maps allows 949 cleaner(?) modeling of broadcast/reduce style computations while reusing 950 memory. 951 952 Proposal 2(a) requires non-trivial changes to the IR and the in-memory 953 representation. 2(b) requires no change, but impacts how cost models look at 954 index and layout maps. 955 956### `affine.if` and `affine.for` Extensions for "Escaping Scalars" 957 958We considered providing a representation for SSA values that are live out of 959`if/else` conditional bodies and loop carried in `affine.for` loops. We 960ultimately abandoned this approach due to its complexity. In the current design 961of MLIR, scalar variables cannot escape for loops or if instructions. In 962situations, where escaping is necessary, we use zero-dimensional tensors and 963memrefs instead of scalars. 964 965**TODO**: This whole section is obsolete and should be updated to use block 966arguments and a yield like terminator in for/if instructions. 967 968The abandoned design of supporting escaping scalars is as follows: 969 970#### affine.for Instruction 971 972Syntax: 973 974``` 975[<out-var-list> =] 976for %<index-variable-name> = <lower-bound> ... <upper-bound> step <step> 977 [with <in-var-list>] { <loop-instruction-list> } 978``` 979 980out-var-list is a comma separated list of SSA values defined in the loop body 981and used outside the loop body. in-var-list is a comma separated list of SSA 982values used inside the loop body and their initializers. loop-instruction-list 983is a list of instructions that may also include a yield instruction. 984 985Example: 986 987```mlir 988// Return sum of elements in 1-dimensional mref A 989func.func i32 @sum(%A : memref<?xi32>, %N : i32) -> (i32) { 990 %init = 0 991 %result = affine.for %i = 0 to N with %tmp(%init) { 992 %value = affine.load %A[%i] 993 %sum = %value + %tmp 994 yield %sum 995 } 996 return %result : i32 997} 998``` 999 1000#### affine.if/else Instruction 1001 1002Syntax: 1003 1004``` 1005<out-var-list> = affine.if (<cond-list>) {...} [else {...}] 1006``` 1007 1008Out-var-list is a list of SSA values defined by the if-instruction. The values 1009are arguments to the yield-instruction that occurs in both then and else clauses 1010when else clause is present. When if instruction contains only if clause, the 1011escaping value defined in the then clause should be merged with the value the 1012variable had before the if instruction. The design captured here does not handle 1013this situation. 1014 1015Example: 1016 1017```mlir 1018// Compute sum of half of the array 1019func.func i32 @sum_half(%A : memref<?xi32>, %N : i32) -> (i32) { 1020 %s0 = 0 1021 %s1 = affine.for %i = 1 ... N step 1 with %s2 (%s0) { 1022 %s3 = if (%i >= %N / 2) { 1023 %v0 = affine.load %A[%i] 1024 %s4 = %s2 + %v0 1025 yield %s4 1026 } 1027 yield %s3 1028 } 1029 return %s1 : i32 1030} 1031``` 1032 1033### Multithreading the compiler 1034 1035People want compilers to go fast, and one simple way to do that is to 1036multi-thread them. There are multiple strategies for this, but a simple one is 1037to optimize and compile separate functions in parallel. LLVM's original pass 1038manager anticipated this demand, and the CallGraphSCCPass manager is even 1039designed to support this as well, but unfortunately, a few early design 1040decisions in LLVM prevent this from ever happening. Instead, things like ThinLTO 1041are forced to split programs into separate LLVM modules/context and optimize 1042those chunks independently. 1043 1044The problem is that LLVM has several objects in its IR that are globally uniqued 1045and also mutable: notably constants like `i32 0`. In LLVM, these constants are 1046`Value`'s, which allow them to be used as operands to instructions, and that 1047they also have SSA use lists. Because these things are uniqued, every `i32 0` in 1048any function shares a use list. This means that optimizing multiple functions in 1049parallel won't work (at least without some sort of synchronization on the use 1050lists, which would be unbearably inefficient). 1051 1052MLIR now supports a multithreaded pass manager. We do this through several 1053design choices: 1054 10551. MLIR makes use of extensive uniqued immutable data structures (affine 1056 expressions, types, etc are all immutable, uniqued, and immortal). 10572. Constants are defined in per-operation pools, instead of being globally 1058 uniqued. 10593. Functions, and other global-like operations, themselves are not SSA values 1060 either, so they don't have the same problem as constants. 10614. Passes are copied (through their copy ctor) into one instance per 1062 thread, avoiding sharing of local state across threads. 1063 1064This allows MLIR passes to support efficient multithreaded compilation 1065and code generation. 1066