xref: /llvm-project/mlir/docs/Tutorials/transform/Ch0.md (revision f8c034140b577c81ddaff3eec9e4af0db1c6c355)
168ae0d78SAlex Zinenko# Chapter 0: A Primer on “Structured” Linalg Operations
268ae0d78SAlex Zinenko
368ae0d78SAlex ZinenkoBefore starting the tutorial on the Transform dialect, let us take a brief look at the concept of Structured operations and its implementation in the Linalg dialect. Note that the Transform dialect does not require Structured operations and vice versa. The two co-evolved at the beginning of the Transform dialect, which makes the subset of transformations for Structured operations the most mature and most suitable for the tutorial. If you are already familiar with this concept, skip to Chapter 1.
468ae0d78SAlex Zinenko
568ae0d78SAlex ZinenkoStructured code generation intends to preserve the structure of the computation for as long as necessary to enable transformations, up to and including the design of IR abstractions that support specific transformations.
668ae0d78SAlex Zinenko
768ae0d78SAlex Zinenko## Uniform Elementwise Extension
868ae0d78SAlex Zinenko
968ae0d78SAlex ZinenkoConsider a simple scalar arithmetic addition operation in MLIR, which maps directly to a machine instruction on most architectures that support floating point operations:
1068ae0d78SAlex Zinenko
1168ae0d78SAlex Zinenko
1268ae0d78SAlex Zinenko```mlir
1368ae0d78SAlex Zinenko%2 = arith.addf %0, %1 : f32
1468ae0d78SAlex Zinenko```
1568ae0d78SAlex Zinenko
1668ae0d78SAlex ZinenkoThis operation can be easily extended to uniformly apply to elements of a 1D vector, which is also often available as an instruction of vector machines:
1768ae0d78SAlex Zinenko
1868ae0d78SAlex Zinenko```mlir
1968ae0d78SAlex Zinenko%2 = arith.addf %0, %1 : vector<8xf32>
2068ae0d78SAlex Zinenko```
2168ae0d78SAlex Zinenko
2268ae0d78SAlex ZinenkoOnly a few modern instruction sets offer instructions for two- or more-dimensional vectors. In MLIR, however, it is possible to transparently extend the uniform elementwise application to vectors of arbitrary rank.
2368ae0d78SAlex Zinenko
2468ae0d78SAlex Zinenko```mlir
2568ae0d78SAlex Zinenko%2 = arith.addf %0, %1 : vector<8x4xf32>
2668ae0d78SAlex Zinenko%5 = arith.addf %3, %4 : vector<2x2x2x2x2x2x2xf32>
2768ae0d78SAlex Zinenko```
2868ae0d78SAlex Zinenko
2968ae0d78SAlex ZinenkoAs you can notice, MLIR’s arithmetic operations on vectors preserve the structure of uniform elementwise application. This structure can be leveraged by the compiler, for example, to produce smaller-rank operations available on the target or to fuse multiplication and addition when such a fused instruction is available (which becomes complicated when there are a hundred of multiplications followed by a hundred of additions).
3068ae0d78SAlex Zinenko
3168ae0d78SAlex Zinenko## Reduction
3268ae0d78SAlex Zinenko
3368ae0d78SAlex ZinenkoSometimes it is necessary to add elements of a vector to obtain a scalar. Some platforms provide specific instructions for this operation, some others provide ones that can be combined to achieve the desired effect, such as addition of adjacent elements and element shuffle.
3468ae0d78SAlex Zinenko
3568ae0d78SAlex ZinenkoThe Vector dialect in MLIR defines an operation to explicitly denote a within-vector reduction:
3668ae0d78SAlex Zinenko
3768ae0d78SAlex Zinenko```mlir
3868ae0d78SAlex Zinenko%0 = vector.reduction <add>, %0 : vector<8xf32> into f32
3968ae0d78SAlex Zinenko```
4068ae0d78SAlex Zinenko
4168ae0d78SAlex ZinenkoWhen no support is available, such an operation can be transformed into a loop:
4268ae0d78SAlex Zinenko
4368ae0d78SAlex Zinenko```mlir
4468ae0d78SAlex Zinenko%c0 = arith.constant 0 : index
4568ae0d78SAlex Zinenko%c1 = arith.constant 1 : index
4668ae0d78SAlex Zinenko%c8 = arith.constant 8 : index
4768ae0d78SAlex Zinenko%init = arith.constant 0.0 : f32
4868ae0d78SAlex Zinenko%result = scf.for %i = %c0 to %c8 step %c1 iter_args(%partial = %init) -> (f32) {
4968ae0d78SAlex Zinenko  %element = vector.extractelement %0[%i : index] : vector<8xf32>
5068ae0d78SAlex Zinenko  %updated = arith.addf %partial, %element : f32
5168ae0d78SAlex Zinenko  scf.yield %updated : f32
5268ae0d78SAlex Zinenko}
5368ae0d78SAlex Zinenko```
5468ae0d78SAlex Zinenko
5568ae0d78SAlex ZinenkoEven when special instructions are available, it may still be desirable to use the loop form (with unrolling), depending on instruction latency and register pressure. Preserving the structure of the operation as a single reduction gives the compiler an understanding that a within-vector reduction is performed and, therefore, a choice in implementation.
5668ae0d78SAlex Zinenko
5768ae0d78SAlex Zinenko## Contraction
5868ae0d78SAlex Zinenko
59*f8c03414SAndrzej WarzyńskiContraction is a generalization of reduction that multiplies elements from two vectors before adding them up. A simple “add” reduction can be thought of as a contraction where one of the vectors contains `1.0`, the neutral element of multiplication. Contractions offer even more flexibility to the compiler, and are represented by a dedicated operation in MLIR:
6068ae0d78SAlex Zinenko
6168ae0d78SAlex Zinenko```mlir
6268ae0d78SAlex Zinenko// Neutral initializer for the addition.
6368ae0d78SAlex Zinenko%init  = arith.constant 0.0 : f32
6468ae0d78SAlex Zinenko// Neutral element of multiplication.
6568ae0d78SAlex Zinenko%ones = arith.constant dense<1.0> : vector<8xf32>
6668ae0d78SAlex Zinenko// Actual contraction.
6768ae0d78SAlex Zinenko%result = vector.contract {
6868ae0d78SAlex Zinenko  indexing_maps = [affine_map<(i) -> (i)>,
6968ae0d78SAlex Zinenko                   affine_map<(i) -> (i)>,
7068ae0d78SAlex Zinenko                   affine_map<(i) -> ()>],
7168ae0d78SAlex Zinenko  iterator_types = ["reduction"]
7268ae0d78SAlex Zinenko} %0, %ones, %init : vector<8xf32>, vector<8xf32> into f32
7368ae0d78SAlex Zinenko```
7468ae0d78SAlex Zinenko
75*f8c03414SAndrzej WarzyńskiNote the `affine_map` expressions indicating how vector elements are indexed. Their meaning is perhaps most evident when writing the loop form in pseudo-code equivalent to this contraction:
7668ae0d78SAlex Zinenko
7768ae0d78SAlex Zinenko```mlir
7868ae0d78SAlex Zinenkofor i in 0 to 8:
7968ae0d78SAlex Zinenko  init += p0[i] * ones[i]
8068ae0d78SAlex Zinenko```
8168ae0d78SAlex Zinenko
82*f8c03414SAndrzej Warzyńskiwhere both `%0` and `%ones` use the loop induction variable `i`, as noted on the right-hand side of the corresponding affine map, `(i) -> (i)`, and `%init` does not, as reflected on the right-hand side of its affine map, `(i) -> ()`.
8368ae0d78SAlex Zinenko
8468ae0d78SAlex ZinenkoSimilarly to uniform elementwise extension, MLIR vector contractions are not limited to 1D cases. In the 2D+ case, one can additionally specify which of the vector dimensions are being reduced and which ones are being preserved. This can be achieved by using the `iterator_types` attribute that specifies, for each dimension, whether it is being reduced (`"reduction"`) or preserved (`"parallel"`). Consider the following 3D contraction that encodes a matrix-matrix multiplication:
8568ae0d78SAlex Zinenko
8668ae0d78SAlex Zinenko```mlir
8768ae0d78SAlex Zinenko%result = vector.contract {
8868ae0d78SAlex Zinenko  indexing_maps = [affine_map<(i, j, k) -> (i, k)>,
8968ae0d78SAlex Zinenko                   affine_map<(i, j, k) -> (k, j)>,
9068ae0d78SAlex Zinenko                   affine_map<(i, j, k) -> (i, j)>],
9168ae0d78SAlex Zinenko  iterator_types = ["parallel", "parallel", "reduction"]
9268ae0d78SAlex Zinenko} %lhs, %rhs, %init: vector<8x10xf32>, vector<10x16xf32> into vector<8x16xf32>
9368ae0d78SAlex Zinenko```
9468ae0d78SAlex Zinenko
9568ae0d78SAlex ZinenkoLooking at the indexing maps, it is easy to recognize the loop form:
9668ae0d78SAlex Zinenko
9768ae0d78SAlex Zinenko```mlir
9868ae0d78SAlex Zinenkofor i in 0 to 8:
9968ae0d78SAlex Zinenko  for j in 0 to 16:
10068ae0d78SAlex Zinenko    for k in 0 to 10:
10168ae0d78SAlex Zinenko      init[i, j] += lhs[i, k] * rhs[k, j]
10268ae0d78SAlex Zinenko```
10368ae0d78SAlex Zinenko
10468ae0d78SAlex ZinenkoPreserving this higher-level structure of a contraction makes it significantly easier for the compiler to recognize operations such as matrix multiplications and dot products and gives it freedom to produce lower-level operations that leverage most advanced instructions or even pre-generated microkernels.
10568ae0d78SAlex Zinenko
10668ae0d78SAlex Zinenko## Generic Operation on Memory
10768ae0d78SAlex Zinenko
10868ae0d78SAlex ZinenkoUntil now, we have been considering operations on vectors stored in virtual registers. A similar contraction abstraction can be defined in memory:
10968ae0d78SAlex Zinenko
11068ae0d78SAlex Zinenko```mlir
11168ae0d78SAlex Zinenkolinalg.generic {
11268ae0d78SAlex Zinenko  indexing_maps = [affine_map<(i, j, k) -> (i, k)>,
11368ae0d78SAlex Zinenko                   affine_map<(i, j, k) -> (k, j)>,
11468ae0d78SAlex Zinenko                   affine_map<(i, j, k) -> (i, j)>],
11568ae0d78SAlex Zinenko  iterator_types = ["parallel", "parallel", "reduction"]
11668ae0d78SAlex Zinenko} ins(%lhs, %rhs : memref<8x10xf32>, memref<10x16xf32>)
11768ae0d78SAlex Zinenko  outs(%init : memref<8x16xf32>) {
11868ae0d78SAlex Zinenko^bb0(%lhs_one: f32, %rhs_one: f32, %init_one: f32):
11968ae0d78SAlex Zinenko  %0 = arith.mulf %lhs_one, %rhs_one : f32
12068ae0d78SAlex Zinenko  %1 = arith.addf %init_one, %0 : f32
12168ae0d78SAlex Zinenko  linalg.yield %1 : f32
12268ae0d78SAlex Zinenko}
12368ae0d78SAlex Zinenko```
12468ae0d78SAlex Zinenko
12568ae0d78SAlex ZinenkoThis looks more complicated, so let us unpack. The `indexing_maps` and `iterator_types` are _exactly_ the same as we have seen above for vector contractions. The operands are now split into two lists:
12668ae0d78SAlex Zinenko
12768ae0d78SAlex Zinenko
12868ae0d78SAlex Zinenko*   `in` operands containing the buffers that are being only read by the operation;
12968ae0d78SAlex Zinenko*   `out` operands that are being read and updated by the operation.
13068ae0d78SAlex Zinenko
13168ae0d78SAlex ZinenkoThis separation wasn’t necessary on vectors because, in MLIR, vectors are read-only (SSA or functional form) and operations mutating a vector are in fact producing a new one instead.
13268ae0d78SAlex Zinenko
13368ae0d78SAlex ZinenkoFurthermore, the operation now contains a region that explicitly specifies the multiplication and the addition operations that were implicit in the contraction. Block arguments in the region correspond to individual elements read from the buffer: the first two correspond to the `in` operands and the last one corresponds to the `out` operand. The value yielded from the region is “written” to the `out` operand and is available as the last block argument for future executions of the region. Note that the order in which the region is executed for various tuples of elements read from the buffers is not specified, and the write to the `out` buffer is written as a whole at the end of the operation.
13468ae0d78SAlex Zinenko
13568ae0d78SAlex Zinenko## “Loop” Fusion
13668ae0d78SAlex Zinenko
137*f8c03414SAndrzej WarzyńskiSince the region of the `linalg.generic` operation can contain arbitrarily many operations, we can use it to express “fusion” of the implicit loops by simply having more operations chained in the region. For example, the common machine learning rectified linear unit layer (ReLU), which can be defined as `relu(x) = max(0, x)`, can be defined be expressed using the “compare-and-select” idiom in one `linalg.generic` operation, without the temporary buffer for the comparison result and without repeating the outer operation:
13868ae0d78SAlex Zinenko
13968ae0d78SAlex Zinenko```mlir
14068ae0d78SAlex Zinenkolinalg.generic {
14168ae0d78SAlex Zinenko  indexing_maps [affine_map<(i) -> (i)>, affine_map<(i) -> (i)>],
14268ae0d78SAlex Zinenko  iterator_types = ["parallel"]
14368ae0d78SAlex Zinenko} ins(%in : memref<?xf32>) outs(%out : memref<?xf32>) {
14468ae0d78SAlex Zinenko^bb0(%in_one : f32, %out_one : f32):
14568ae0d78SAlex Zinenko  %c0 = arith.constant 0.0 : f32
14668ae0d78SAlex Zinenko  %0 = arith.cmpf ogt %in_one, %c0 : f32
14768ae0d78SAlex Zinenko  %1 = arith.select %0, %in_one, %c0 : f32
14868ae0d78SAlex Zinenko  linalg.yield %1 : f32
14968ae0d78SAlex Zinenko}
15068ae0d78SAlex Zinenko```
15168ae0d78SAlex Zinenko
15268ae0d78SAlex ZinenkoSuch operations can be converted to loops or lowered into vector forms after splitting into multiple operations, each of which maps to a Vector dialect primitive. This modeling, again, gives the compiler more choice in selecting the code generation strategy.
15368ae0d78SAlex Zinenko
15468ae0d78SAlex Zinenko## Generic Operation on Tensors
15568ae0d78SAlex Zinenko
15668ae0d78SAlex ZinenkoLet us take one last step up on the abstraction ladder. MLIR provides a tensor abstraction that makes it easy for the compiler to reason about multidimensional yet regular data without having to solve complex problems such as alias analysis and dependency satisfaction, which would be necessary on multidimensional buffers. The tensor abstraction is very similar to the vector abstraction (major differences include the availability of unranked tensors, tensor layouts, and vectors being usable as elemental types of tensors but not of other vectors). Tensors are read-only, and operations updating a tensor produce a new tensor.
15768ae0d78SAlex Zinenko
158*f8c03414SAndrzej WarzyńskiThe `linalg.generic` operation from above can lifted to operate on tensors instead of buffers:
15968ae0d78SAlex Zinenko
16068ae0d78SAlex Zinenko```mlir
16168ae0d78SAlex Zinenko%result = linalg.generic {
16268ae0d78SAlex Zinenko  indexing_maps = [affine_map<(i, j, k) -> (i, k)>,
16368ae0d78SAlex Zinenko                   affine_map<(i, j, k) -> (k, j)>,
16468ae0d78SAlex Zinenko                   affine_map<(i, j, k) -> (i, j)>],
16568ae0d78SAlex Zinenko  iterator_types = ["parallel", "parallel", "reduction"]
16668ae0d78SAlex Zinenko} ins(%lhs, %rhs : tensor<8x10xf32>,tensor<10x16xf32>)
16768ae0d78SAlex Zinenko  outs(%init :tensor<8x16xf32>) {
16868ae0d78SAlex Zinenko^bb0(%lhs_one: f32, %rhs_one: f32, %init_one: f32):
16968ae0d78SAlex Zinenko  %0 = arith.mulf %lhs_one, %rhs_one : f32
17068ae0d78SAlex Zinenko  %1 = arith.addf %init_one, %0 : f32
17168ae0d78SAlex Zinenko  linalg.yield %1 : f32
17268ae0d78SAlex Zinenko} -> tensor<8x16xf32>
17368ae0d78SAlex Zinenko```
17468ae0d78SAlex Zinenko
17568ae0d78SAlex ZinenkoAs you can notice, most components of this operation remain identical to its buffer version. It has been specifically designed this way. The main difference, beside the operand types, is that the operation now produces a new result instead of updating the `out` buffer. The `out` operand is used only as the initialization value.
17668ae0d78SAlex Zinenko
177*f8c03414SAndrzej WarzyńskiIf the `linalg.generic` operation had existed on vectors, it would have had the exact same structure.
17868ae0d78SAlex Zinenko
17968ae0d78SAlex Zinenko## Tiling and Loop Materialization
18068ae0d78SAlex Zinenko
18168ae0d78SAlex ZinenkoAt this level of abstraction, it becomes easy for the compiler to perform more advanced transformations usually required for high-performance code generation, such as [tiling](https://en.wikipedia.org/wiki/Loop_nest_optimization). Tiling, in general, can be seen as partitioning the iteration space into smaller parts, or tiles, so that the data required by each part fits into a level of cache for example. The order in which tiles are executed must preserve the original data dependencies.
18268ae0d78SAlex Zinenko
183*f8c03414SAndrzej WarzyńskiIn the case of `linalg.generic` operations, the iteration space is implicit and is defined by the shape of the operands. Therefore, a tile can be expressed by performing the _same_ operation on a subset (slice) of the original data. Since the order in which the body of `linalg.generic` is applied to different tuples of the input elements is unspecified, tiles can be executed in any order, without the need for dependence analysis. In order to control the execution of different tiles, the implementation of tiling produces loops. Thus tiling `linalg.generic` operations can also be seen as materializing the loops that have been implicit until now.
18468ae0d78SAlex Zinenko
185*f8c03414SAndrzej WarzyńskiFor example, tiling the matrix multiplication presented above with tile sizes `(2, 8)`, we obtain a loop nest around a `linalg.generic` expressing the same operation on a `2x8` tensor.
18668ae0d78SAlex Zinenko
18768ae0d78SAlex Zinenko```mlir
18868ae0d78SAlex Zinenko// A special "multi-for" loop that supports tensor-insertion semantics
18968ae0d78SAlex Zinenko// as opposed to implicit updates. The resulting 8x16 tensor will be produced
19068ae0d78SAlex Zinenko// by this loop.
19168ae0d78SAlex Zinenko// The trip count of iterators is computed dividing the original tensor size,
19268ae0d78SAlex Zinenko// 8x16, by the tile size, 2x8, to obtain 4x2.
19368ae0d78SAlex Zinenko// When tensor sizes are dynamic, the trip count computation is emitted as IR
19468ae0d78SAlex Zinenko// and is being computed at runtime.
19568ae0d78SAlex Zinenko%0 = scf.forall (%i, %j) in (4, 2)
19668ae0d78SAlex Zinenko     shared_outs(%shared = %init) -> (tensor<8x16xf32>) {
19768ae0d78SAlex Zinenko
19868ae0d78SAlex Zinenko  // Scale the loop induction variables by the tile sizes.
19968ae0d78SAlex Zinenko  %3 = affine.apply affine_map<(d0) -> (d0 * 2)>(%i)
20068ae0d78SAlex Zinenko  %4 = affine.apply affine_map<(d0) -> (d0 * 8)>(%j)
20168ae0d78SAlex Zinenko
20268ae0d78SAlex Zinenko  // Take slices of inputs and outputs. Only the "i" and "j" dimensions are sliced.
20368ae0d78SAlex Zinenko  %lhs_slice = tensor.extract_slice %lhs[%3, 0] [2, 10] [1, 1]
20468ae0d78SAlex Zinenko             : tensor<8x10xf32> to tensor<2x10xf32>
20568ae0d78SAlex Zinenko  %rhs_slice = tensor.extract_slice %rhs[0, %4] [10, 8] [1, 1]
20668ae0d78SAlex Zinenko             : tensor<10x16xf32> to tensor<10x8xf32>
20768ae0d78SAlex Zinenko  %result_slice = tensor.extract_slice %shared[%3, %4] [2, 8] [1, 1]
20868ae0d78SAlex Zinenko                : tensor<8x16xf32> to tensor<2x8xf32>
20968ae0d78SAlex Zinenko
21068ae0d78SAlex Zinenko  // This is exactly the same operation as before, but now operating on smaller
21168ae0d78SAlex Zinenko  // slices of data.
21268ae0d78SAlex Zinenko  %partial =  linalg.generic {
21368ae0d78SAlex Zinenko  indexing_maps = [affine_map<(i, j, k) -> (i, k)>,
21468ae0d78SAlex Zinenko                   affine_map<(i, j, k) -> (k, j)>,
21568ae0d78SAlex Zinenko                   affine_map<(i, j, k) -> (i, j)>],
21668ae0d78SAlex Zinenko  iterator_types = ["parallel", "parallel", "reduction"]
21768ae0d78SAlex Zinenko  } ins(%lhs_slice, %rhs_slice : tensor<2x10xf32>, tensor<10x8xf32>)
21868ae0d78SAlex Zinenko    outs(%result_slice : tensor<2x8xf32>) -> tensor<2x8xf32> {
21968ae0d78SAlex Zinenko  ^bb0(%lhs_one: f32, %rhs_one: f32, %init_one: f32):
22068ae0d78SAlex Zinenko    %0 = arith.mulf %lhs_one, %rhs_one : f32
22168ae0d78SAlex Zinenko    %1 = arith.addf %init_one, %0 : f32
22268ae0d78SAlex Zinenko    linalg.yield %1 : f32
22368ae0d78SAlex Zinenko  } : tensor<2x8xf32>
22468ae0d78SAlex Zinenko
22568ae0d78SAlex Zinenko  // Terminator for the loop with tensor-insertion semantics. Inserts a slice
22668ae0d78SAlex Zinenko  // into a larger tensor, potentially in parallel.
22768ae0d78SAlex Zinenko  scf.forall.in_parallel {
22868ae0d78SAlex Zinenko    tensor.parallel_insert_slice %partial into %shared[%3, %4] [2, 8] [1, 1]
22968ae0d78SAlex Zinenko        : tensor<2x8xf32> into tensor<8x16xf32>
23068ae0d78SAlex Zinenko  }
23168ae0d78SAlex Zinenko}
23268ae0d78SAlex Zinenko```
23368ae0d78SAlex Zinenko
23468ae0d78SAlex Zinenko## Producer/Consumer Fusion and Rematerialization
23568ae0d78SAlex Zinenko
23668ae0d78SAlex ZinenkoAfter materializing loops with tiling, another key code generation transformation becomes simple – fusion. Unlike loop fusion, the Structured operations approach allows for producer/consumer fusion even when the (implicit) iteration spaces of the operations do not match. Given an high-level structured operation on tensors, such as `linalg.generic`, one can follow use-def chains to identify:
23768ae0d78SAlex Zinenko
23868ae0d78SAlex Zinenko1. the subset (slice) of the operand that is used by the tile, and
23968ae0d78SAlex Zinenko2. the tensor-level structured operation producing the whole tensor that is being sliced.
24068ae0d78SAlex Zinenko
24168ae0d78SAlex ZinenkoBy inverting the `indexing_map` and applying it to the set of elements accessed through the slice, we can compute the part of the iteration space of the operation defining the full tensor necessary to compute the tile. Thus fusion boils down to replacing the `tensor.extract_slice` operation with the tile of the `linalg.generic` producing the original operand.
24268ae0d78SAlex Zinenko
24368ae0d78SAlex ZinenkoLet us assume that the matrix multiplication operation is followed by another operation that multiplies each element of the resulting matrix with itself. This trailing elementwise operation has a 2D iteration space, unlike the 3D one in matrix multiplication. Nevertheless, it is possible to tile the trailing operation and then fuse the producer of its operand, the matmul, into the loop generated by tiling. The untiled dimension will be used in its entirety.
24468ae0d78SAlex Zinenko
24568ae0d78SAlex Zinenko
24668ae0d78SAlex Zinenko```mlir
24768ae0d78SAlex Zinenko// Same loop as before.
24868ae0d78SAlex Zinenko%0 = scf.forall (%i, %j) in (4, 2)
24968ae0d78SAlex Zinenko     shared_outs(%shared = %init)
25068ae0d78SAlex Zinenko     -> (tensor<8x16xf32>, tensor<8x16xf32>) {
25168ae0d78SAlex Zinenko  // Scale the loop induction variables by the tile sizes.
25268ae0d78SAlex Zinenko  %1 = affine.apply affine_map<(d0) -> (d0 * 2)>(%i)
25368ae0d78SAlex Zinenko  %2 = affine.apply affine_map<(d0) -> (d0 * 8)>(%j)
25468ae0d78SAlex Zinenko
25568ae0d78SAlex Zinenko  // Take slices of inputs and outputs. Only the "i" and "j" dimensions are sliced.
25668ae0d78SAlex Zinenko  %lhs_slice = tensor.extract_slice %lhs[%1, 0] [2, 10] [1, 1]
25768ae0d78SAlex Zinenko             : tensor<8x10xf32> to tensor<2x10xf32>
25868ae0d78SAlex Zinenko  %rhs_slice = tensor.extract_slice %rhs[0, %2] [10, 8] [1, 1]
25968ae0d78SAlex Zinenko             : tensor<10x16xf32> to tensor<10x8xf32>
26068ae0d78SAlex Zinenko  %result_slice = tensor.extract_slice %result[%1, %2] [2, 8] [1, 1]
26168ae0d78SAlex Zinenko                : tensor<8x16xf32> to tensor<2x8xf32>
26268ae0d78SAlex Zinenko
26368ae0d78SAlex Zinenko  // This is exactly the same matmul slice as before. It replaces the slice
26468ae0d78SAlex Zinenko  // extraction for the generic operation below.
26568ae0d78SAlex Zinenko  %partial = linalg.generic {
26668ae0d78SAlex Zinenko    indexing_maps = [affine_map<(i, j, k) -> (i, k)>,
26768ae0d78SAlex Zinenko                     affine_map<(i, j, k) -> (k, j)>,
26868ae0d78SAlex Zinenko                     affine_map<(i, j, k) -> (i, j)>],
26968ae0d78SAlex Zinenko    iterator_types = ["parallel", "parallel", "reduction"]
27068ae0d78SAlex Zinenko  } ins(%lhs_slice, %rhs_slice : tensor<2x10xf32>, tensor<10x8xf32>)
27168ae0d78SAlex Zinenko   outs(%result_slice : tensor<2x8xf32>) {
27268ae0d78SAlex Zinenko  ^bb0(%lhs_one: f32, %rhs_one: f32, %init_one: f32):
27368ae0d78SAlex Zinenko    %5 = arith.mulf %lhs_one, %rhs_one : f32
27468ae0d78SAlex Zinenko    %6 = arith.addf %init_one, %5 : f32
27568ae0d78SAlex Zinenko    linalg.yield %6 : f32
27668ae0d78SAlex Zinenko  } -> tensor<2x8xf32>
27768ae0d78SAlex Zinenko
27868ae0d78SAlex Zinenko  // Take the slice of the final result. Note that we don't need to take
27968ae0d78SAlex Zinenko  // the slice of the operand because the matmul operation above computes
28068ae0d78SAlex Zinenko  // it in-place.
28168ae0d78SAlex Zinenko  %shared_slice = tensor.extract_slice %shared[%1, %2] [2, 8] [1, 1]
28268ae0d78SAlex Zinenko                : tensor<8x16xf32> to tensor<2x8xf32>
28368ae0d78SAlex Zinenko
28468ae0d78SAlex Zinenko  // The elementwise operation that we tiled.
28568ae0d78SAlex Zinenko  %elemwise = linalg.generic {
28668ae0d78SAlex Zinenko    indexing_maps = [affine_map<(i, j) -> (i, j)>,
28768ae0d78SAlex Zinenko                     affine_map<(i, j) -> (i, j)>],
28868ae0d78SAlex Zinenko    iterator_types = ["parallel", "parallel"]
28968ae0d78SAlex Zinenko  } ins(%partial : tensor<2x8xf32>)
29068ae0d78SAlex Zinenko   outs(%shared_slice : tensor<2x8xf32>) {
29168ae0d78SAlex Zinenko  ^bb0(%in: f32, %out: f32):
29268ae0d78SAlex Zinenko    %5 = arith.mulf %in, %in : f32
29368ae0d78SAlex Zinenko    linalg.yield %5 : f32
29468ae0d78SAlex Zinenko  } -> tensor<2x8xf32>
29568ae0d78SAlex Zinenko
29668ae0d78SAlex Zinenko  // Terminator for the loop with tensor-insertion semantics. Inserts a slice
29768ae0d78SAlex Zinenko  // into a larger tensor, potentially in parallel.
29868ae0d78SAlex Zinenko  scf.forall.in_parallel {
29968ae0d78SAlex Zinenko    tensor.parallel_insert_slice %elemwise into %shared[%1, %2] [2, 8] [1, 1]
30068ae0d78SAlex Zinenko        : tensor<2x8xf32> into tensor<8x16xf32>
30168ae0d78SAlex Zinenko  }
30268ae0d78SAlex Zinenko}
30368ae0d78SAlex Zinenko```
30468ae0d78SAlex Zinenko
30568ae0d78SAlex ZinenkoThis process may result in some elements in the operand tensors being (re)computed on every iteration of the loop. This is also known as _rematerialization_ and expresses the tradeoff between performing redundant computations or storing their result in (slow) memory.
30668ae0d78SAlex Zinenko
30768ae0d78SAlex Zinenko## Shorthand “Named” Forms of Linalg Ops
30868ae0d78SAlex Zinenko
30968ae0d78SAlex ZinenkoLinalg provides a set of predefined operations for common cases such as matrix multiplication, dot product, convolution, etc. These operations are equivalent to the `generic` ones but spare the need to spell out the access patterns and the bodies. For example, matrix multiplication is simply:
31068ae0d78SAlex Zinenko
31168ae0d78SAlex Zinenko```mlir
31268ae0d78SAlex Zinenko%matmul = linalg.matmul ins(%lhs, %rhs: tensor<8x10xf32>, tensor<10x16xf32>)
31368ae0d78SAlex Zinenko                        outs(%init: tensor<8x10xf32xf32>) -> tensor<8x16xf32>
31468ae0d78SAlex Zinenko```
315