Lines Matching +full:docs +full:- +full:polly +full:- +full:html

1 # Linalg Dialect Rationale: The Case For Compiler-Friendly Custom Operations
9 …lt="MLIR Codegen Flow" src="https://user-images.githubusercontent.com/10148468/73613629-c5586580-4…
13 the tradeoffs involved when building higher-level Intermediate
16 Linalg is designed to solve the High-level Hierarchical Optimization
19 This work is inspired by a wealth of [prior art](#prior-art) in
22 working group for discussing the [Development of high-level Tensor Compute
24 …tions](https://llvm.discourse.group/t/development-of-high-level-tensor-compute-primitives-dialect-
27 path to defining these High-Level Tensor Compute Primitives.
36 *linear-algebra like* semantics (`pointwise`, `matmul`, `conv`...). This
37 approach enables building a high-level productivity-first codegen solution that
44 However, as the design of Linalg co-evolved with the design of MLIR, it became
48 The design and evolution of Linalg follow a *codegen-friendly* approach where
49 the IR and the transformations evolve hand-in-hand.
64 [Linalg section](https://llvm.org/devmtg/2019-04/slides/Tutorial-AminiVasilacheZinenko-MLIR.pdf)
76 https://drive.google.com/drive/u/0/folders/1sRAsgsd8Bvpm_IxREmZf2agsGU2KvrK-),
81 design principles as Linalg. (Vector dialect includes the higher-level
82 operations on multi-dimensional vectors and abstracts away the lowering to
83 single-dimensional vectors).
85 The Linalg dialect itself grew beyond linear algebra-like operations to become
89 potential of growing beyond *dense* linear-algebra to support richer data
92 Linalg design remains open to evolution and cross-pollination with other
94 for code generation-related abstractions, spinning off the generalization of
96 - the `!linalg.view` type folded into the *"Strided MemRef"* type while
99 - the `linalg.view` and `linalg.subview` ops evolved into the standard dialect;
100 - the `linalg.for`, `linalg.load` and `linalg.store` ops evolved into a prelude
107 problems. It does aim at driving thinking and implementations of domain-specific
117 pragmatic solution. The following non-exhaustive list refers to some of the
120 - [ONNX](https://onnx.ai/),
121 - [LIFT](https://www.lift-project.org/),
122 - [XLA](https://www.tensorflow.org/xla/architecture),
123 - [Halide](https://halide-lang.org/) and [TVM](https://tvm.apache.org/),
124 - [TACO](http://tensor-compiler.org/),
125 - [Darkroom](http://darkroom-lang.org/) and [Terra](http://terralang.org/),
126 - [Sigma-LL](http://spiral.ece.cmu.edu:8080/pub-spiral/pubfile/cgo16-preprint_248.pdf),
127 - [Tensor Comprehensions](https://arxiv.org/abs/1802.04730),
128 - [Polyhedral Compilers](https://en.wikipedia.org/wiki/Polytope_model),
129 - the [Affine dialect](https://mlir.llvm.org/docs/Dialects/Affine/) in MLIR,
130 - Generic Loop Transformations (see Ken Kennedy's
132 https://www.elsevier.com/books/optimizing-compilers-for-modern-architectures/allen/978-0-08-051324-
133 - Traditional compiler CFGs with SSA forms.
139 - the [Torch](http://torch.ch/) machine-learning framework,
140 - the LLVM compiler, specifically in JIT mode,
141 - high-performance libraries (MKL, CUBLAS, FBFFT)
142 - the [PeachPy](https://www.cs.utexas.edu/users/flame/BLISRetreat/BLISRetreatTalks/PeachPy.pdf) ass…
143 - current and potentially upcoming hardware ISAs.
168 - facilitate frontend-compiler co-design by taking into account compiler
170 - minimize the set of available ops by making them non-overlapping with each
174 [LIFT](https://www.lift-project.org/) is a system to write computational
179 rules](https://www.lift-project.org/presentations/2015/ICFP-2015.pdf) that
187 - transformations are either separated from the representation or expressed as
190 - abstractions are split into smaller components (e.g., control flow and data
195 particular, extending the data structure abstractions to support non-dense
197 [sparse](https://www.lift-project.org/publications/2016/harries16sparse.pdf)
198 and [position-dependent
199 arrays](https://www.lift-project.org/publications/2019/pizzuti19positiondependentarrays.pdf).
203 post-Theano ML compilers that was introduced as a pragmatic compilation
208 code, and (3) complying to stringent requirements imposed by energy-efficient
221 - HLOs are expressive as a whole, but each op has very limited and fixed
225 - Reliance on perfect op knowledge leads to situations where transformations and
229 - XLA lacks an independent IR that can be inspected, unit tested and used
234 [Halide](https://halide-lang.org/) is a DSL embedded in C++ that provides a
239 machine learning and deep-neural network space, based on HalideIR.
242 [URUK](http://icps.u-strasbg.fr/~bastoul/research/papers/GVBCPST06-IJPP.pdf)
249 accessible to $\Omega$(10-100) users, at a time when polyhedral tools are
250 still only accessible to $\Omega$(1-10) users. Halide makes heavy usage of
254 - Halide scheduling is powerful and explores a large swath of possible
259 - Halide raises rather than lowers in two ways, going counter-current to the
260 design goals we set for high-level codegen abstractions in MLIR. First,
261 canonical Halide front-end code uses explicit indexing and math on scalar
266 ingests whole-tensor operations.
270 should start with higher-level primitives than Halide.
274 high-level language to express tensor computations with a syntax
275 generalizing the Einstein notation, coupled to an end-to-end
280 src="https://user-images.githubusercontent.com/10148468/73613272-df904480-45c1-11ea-88f9-214dee7464…
284 and uses both HalideIR and the ISL *schedule-tree* IR.
286 algorithms to perform fusion and favor multi-level parallelism and
292 to find good implementations in the ***non-compute bound regime***.
297 Halide high-level transformations with polyhedral mid-level
302 - Halide was never properly used in Tensor Comprehensions beyond shape
304 transformations and building a usable end-to-end system. MLIR was
306 - The early gains provided by reusing established infrastructures
309 - Tensor Comprehensions emitted CUDA code which was then JIT compiled
311 short-term solution it made it hard to perform low-level rewrites that
312 would have helped with register reuse in the ***compute-bound regime***.
313 - The same reliance on emitting CUDA code made it difficult to
323 The polyhedral model has been on the cutting edge of loop-level optimization for
326 [Polly](https://polly.llvm.org) for LLVM. Although it has proved crucial to
327 generate efficient code from domain-specific languages such as
328 [PolyMage](http://mcl.csa.iisc.ac.in/polymage.html) and [Tensor
330 fully included into mainstream general-purpose optimization pipelines. Detailed
338 - The transformed code (or IR) quickly gets complex and thus hard to analyze and
340 - Code generation from the mathematical form used in the polyhedral model relies
341 on non-trivial exponentially complex algorithms.
342 - The mathematical form is rarely composable with the SSA representation and
344 - Expressiveness limitations, although addressed in the scientific literature
351 the transformation, decreasing the need for one-shot conversion between
354 combine polyhedral and SSA-based transformations.
358 conventional SSA representation. It addresses several long-standing integration
360 from a C language-level abstraction.
362 MLIR makes it possible to start from a higher-level abstraction than C, for
364 avoid complex analyses (data-flow analysis across loop iterations is
373 expensive pattern-matching algorithms.
377 - **Discourage loop skewing**: the loop skewing transformation, that is
379 effects on performance. In particular, polyhedral auto-transformation can be
384 multi-for loops with induction variables independent of each other (referred
385 to as hyper-rectangular iteration domains in the literature) such as the
387 [affine.parallel](https://llvm.discourse.group/t/rfc-add-affine-parallel/350)
389 - **Declarative Tiling**: the *tiling* transformation is ubiquitous in HPC code
396 become non-affine or require exponentially complex algorithms to reason about
400 - **Preserve high-level information**: Linalg maintains the information provided
403 multiplication. Even with pattern-matching on top of the Affine dialect, this
404 would have required another step of pattern-matching after the transformation.
406 Given these choices, Linalg intends to be a better fit for **high-level
409 abstractions. Affine remains a strong abstraction for mid-level transformation
411 and combination of semantically-loaded and lower-level inputs. As such, Linalg
418 - develop a set of key transformations, and
419 - make them correct by construction by carefully curating the set of
421 - make them very simple to implement, apply, verify and especially
424 The problem at hand is fundamentally driven by compilation of domain-specific
425 workloads for high-performance and parallel hardware architectures: **this is
428 The selection of relevant transformations follows a co-design approach and
430 - concrete current and future needs of the application domain,
431 - concrete current and future hardware properties and ISAs,
432 - understanding of strengths and limitations of [existing approaches](#prior-art),
433 - taking advantage of the coexistence of multiple levels of IR in MLIR,
440 simplicity and maintainability aspects must be first-order concerns. Without
450 The last two decades have seen a proliferation of Domain-Specific Languages
464 Compiler transformations need static structural information (e.g. loop-nests,
470 form or from SSA IR into a higher-level representation that is more amenable
480 to convey user-intent directly into the proper abstraction.
490 This creates non-trivial phase ordering issues. For instance, loop fusion may
492 is to perform loop fusion, tiling, intra-tile loop distribution and then hope to
493 detect the BLAS pattern. Such a scheme presents difficult phase-ordering
495 Instead, certain Linalg ops are designed to maintain high-level information
501 defined in terms of number of low-level instructions in a particular
503 Linalg-based codegen and transformations start from higher-level IR
505 potential by introducing lower-level IR ops and *smaller* Linalg ops.
511 maintain. Mixing XLA-style high-level op semantics knowledge with generic
513 - Design transformations that are correct by construction, easy to
515 - Provide a way to specify transformations and the units of IR they manipulate
518 - Allow creating customizable passes declaratively by simply selecting rewrite
525 Compiler heuristics are hand-crafted human-engineered features: it is
526 ripe for disruption by machine-learning techniques.
527 To enable search, compiler transformations should be fine-grained,
533 immediately: low-level compilation and machine models are still quite performant
534 in LLVM. However, for the high-level and mid-level optimization problems,
535 models need to be conditioned (probabilistically) on the low-level
537 design of IR and transformations with search-friendly properties over
540 prefer to invest in infrastructure that will enable [ML-based
544 ### Extensibility and Future-Proofness<a name="future"></a>
548 In particular, the `MemRefType` represents dense non-contiguous memory regions.
553 For such more advanced data types, the control-flow required to traverse the
556 stand a chance of evolving into runtime-adaptive computations (e.g.
557 inspector-executor in which an *inspector* runs a cheap runtime
565 reconcile [core guiding principles](#guiding_principles) with real-world
571 design of Linalg: control-flow does not exist in a vacuum, independently of
572 data. On the contrary, there is a very strong relationship between control-flow
577 transformations are better done: - as control flow or data structure
578 manipulation, - on Linalg ops attributes or on loops after some partial lowering
579 occurred, - as extensions to the Linalg dialect in terms of new ops or
583 This is probably the most surprising and counter-intuitive
584 observation. When one designs IR for transformations, closed-ness is
585 often a non-negotiable property.
587 [URUK](http://icps.u-strasbg.fr/~bastoul/research/papers/GVBCPST06-IJPP.pdf)
589 [ISL-based IRs](https://en.wikipedia.org/wiki/Integer_set_library):
593 dialect closed-ness wasn't necessary and could be relaxed. Previous
605 [Progressive Lowering](#progressive_lowering), closed-ness under
612 Art](#prior-art)---when viewed under the lense of our [Core Guiding
613 Principles](#guiding_principles)---with the following picture.
616 src="https://user-images.githubusercontent.com/10148468/73613904-2f720a00-45c8-11ea-8265-1c856c0252…
620 codegen-friendly angle. Unsurprisingly, the
622 to a position in the top-right of this map.