xref: /llvm-project/mlir/docs/Tutorials/transform/ChH.md (revision e8b31fb39d9728e7505dfee7630158f14bc224de)
16841eff1SOleksandr "Alex" Zinenko# Chapter H: Reproducing Halide Schedule
26841eff1SOleksandr "Alex" Zinenko
36841eff1SOleksandr "Alex" ZinenkoThis chapter demonstrates how a schedule from the [Halide
439298b09SAndrzej WarzyńskiDSL](http://halide-lang.org) can be implemented using Transform dialect for
56841eff1SOleksandr "Alex" Zinenkostructured ops.
66841eff1SOleksandr "Alex" Zinenko
76841eff1SOleksandr "Alex" ZinenkoNote that the IR below is pseudo-code with types removed for brevity. It may
86841eff1SOleksandr "Alex" Zinenkoalso get out of sync with the current syntax. Always refer to the source code in
96841eff1SOleksandr "Alex" Zinenko[mlir/examples/transform/ChH](https://github.com/llvm/llvm-project/tree/main/mlir/test/Examples/transform/ChH)
106841eff1SOleksandr "Alex" Zinenkoas the source of truth.
116841eff1SOleksandr "Alex" Zinenko
126841eff1SOleksandr "Alex" Zinenko## Channeled Convolution
136841eff1SOleksandr "Alex" Zinenko
146841eff1SOleksandr "Alex" ZinenkoThe Transform dialect provides a substrate for implementing “transformation
156841eff1SOleksandr "Alex" Zinenkodirective” domain-specific languages (DSLs) in MLIR. Such a DSL, at least in its
166841eff1SOleksandr "Alex" Zinenkoscheduling part, can target the operations in the Transform dialect that are
176841eff1SOleksandr "Alex" Zinenkolater applied by the compiler. Sets of transform operations, or even new
186841eff1SOleksandr "Alex" Zinenkodialects leveraging the same interfaces and infrastructure, can be added to
196841eff1SOleksandr "Alex" Zinenkosupport a specific DSL for a particular scheduling model. In this chapter, we
206841eff1SOleksandr "Alex" Zinenkowill revisit the Halide DSL that has (re)popularized separate specification of
216841eff1SOleksandr "Alex" Zinenkoschedules originally for image processing programs.
226841eff1SOleksandr "Alex" Zinenko
236841eff1SOleksandr "Alex" ZinenkoTwo approaches Halide to the Transform dialect are possible:
246841eff1SOleksandr "Alex" Zinenko
256841eff1SOleksandr "Alex" Zinenko*   Create a new dialect that corresponds to the computational part of Halide
266841eff1SOleksandr "Alex" Zinenko    DSL, and define a set of transformations wrapped into Transform dialect
276841eff1SOleksandr "Alex" Zinenko    operations, that correspond to the scheduling part of the DSL.
286841eff1SOleksandr "Alex" Zinenko*   Map the Halide abstractions to the existing MLIR abstractions, for both
296841eff1SOleksandr "Alex" Zinenko    parts of the DSL.
306841eff1SOleksandr "Alex" Zinenko
316841eff1SOleksandr "Alex" ZinenkoWe will consider the latter approach as the computational part of the DSL easily
326841eff1SOleksandr "Alex" Zinenkomaps to the structured ops in the Linalg dialect. This also gives us the
336841eff1SOleksandr "Alex" Zinenkoopportunity to discuss how Linalg transformations on the so-called structured
346841eff1SOleksandr "Alex" Zinenkooperations are similar to or different from the existing transformations.
356841eff1SOleksandr "Alex" Zinenko
366841eff1SOleksandr "Alex" ZinenkoWe will consider the 2D channeled convolution example extracted from Halide
376841eff1SOleksandr "Alex" Zinenko[application
386841eff1SOleksandr "Alex" Zinenkoexamples](https://github.com/halide/Halide/tree/294f80c49bf3bb8582446613c25fcce03b82bcd8/apps/conv_layer).
396841eff1SOleksandr "Alex" Zinenko
406841eff1SOleksandr "Alex" Zinenko```cpp
416841eff1SOleksandr "Alex" Zinenko// Sizes of the problem.
426841eff1SOleksandr "Alex" Zinenkoconst int N = 5, CI = 128, CO = 128, W = 100, H = 80;
436841eff1SOleksandr "Alex" Zinenko
446841eff1SOleksandr "Alex" Zinenko// Sized inputs. Note that the order of dimensions is
456841eff1SOleksandr "Alex" Zinenko// inverted in Halide with respect to C++, so the last dimension
466841eff1SOleksandr "Alex" Zinenko// in the list (N for input, CI for filter) is the least
476841eff1SOleksandr "Alex" Zinenko// frequently varying. The C++ equivalent is input[N][H+2][W+2][CI].
486841eff1SOleksandr "Alex" ZinenkoBuffer<float, 4> input({CI, W+2, H+2, N}, "input");
496841eff1SOleksandr "Alex" ZinenkoBuffer<float, 4> filter({CO, 3, 3, CI}, "filter");
506841eff1SOleksandr "Alex" ZinenkoBuffer<float, 1> bias(std::vector<int>{CO}, "bias");
516841eff1SOleksandr "Alex" Zinenko
526841eff1SOleksandr "Alex" Zinenko// ... data initialization happens here ...
536841eff1SOleksandr "Alex" Zinenko
546841eff1SOleksandr "Alex" Zinenko// Declarations of "mathematical functions" for convolution and relu.
556841eff1SOleksandr "Alex" ZinenkoFunc conv("conv"), relu("relu");
566841eff1SOleksandr "Alex" Zinenko
576841eff1SOleksandr "Alex" Zinenko// Iterators/subscripts.
586841eff1SOleksandr "Alex" ZinenkoVar x("x"), y("y"), c("c"), n("n");
596841eff1SOleksandr "Alex" Zinenko
606841eff1SOleksandr "Alex" Zinenko// 3D reduction domain (channels and 2 window dimensions),
616841eff1SOleksandr "Alex" Zinenko// dimensions are later referred to as r.x, r.y, r.z.
626841eff1SOleksandr "Alex" ZinenkoRDom r(0, CI, 0, 3, 0, 3);
636841eff1SOleksandr "Alex" Zinenko
646841eff1SOleksandr "Alex" Zinenko// Core convolution with the result initialized to the bias value.
656841eff1SOleksandr "Alex" Zinenko// Note that the order of iterators is inverted in Halide DSL,
666841eff1SOleksandr "Alex" Zinenko// i.e. `n` corresponds to the lest frequently-varying (outermost) dimension
676841eff1SOleksandr "Alex" Zinenko// here and below.
686841eff1SOleksandr "Alex" Zinenkoconv(c, x, y, n) = bias(c);
696841eff1SOleksandr "Alex" Zinenkoconv(c, x, y, n) += filter(c, r.y, r.z, r.x) * input(r.x, x + r.y, y + r.z, n);
706841eff1SOleksandr "Alex" Zinenko
716841eff1SOleksandr "Alex" Zinenko// ReLU rectification, an elementwise operation.
726841eff1SOleksandr "Alex" Zinenkorelu(c, x, y, n) = max(0, conv(c, x, y, n));
736841eff1SOleksandr "Alex" Zinenko```
746841eff1SOleksandr "Alex" Zinenko
756841eff1SOleksandr "Alex" ZinenkoThis can be almost directly converted to Linalg dialect operating on tensors,
766841eff1SOleksandr "Alex" Zinenkowhich is conceptually closer to the “mathematical function” abstraction and is
776841eff1SOleksandr "Alex" Zinenkowhere the majority of transformations are available.
786841eff1SOleksandr "Alex" Zinenko
796841eff1SOleksandr "Alex" Zinenko```mlir
806841eff1SOleksandr "Alex" Zinenko// Bias. Using a named Linalg operation for brevity.
816841eff1SOleksandr "Alex" Zinenko%bias_init = tensor.empty() : !toutput
826841eff1SOleksandr "Alex" Zinenko%biased = linalg.broadcast ins(%bias : !tbias)
836841eff1SOleksandr "Alex" Zinenko                          outs(%bias_init : !toutput) dimensions = [0, 1, 2]
846841eff1SOleksandr "Alex" Zinenko
856841eff1SOleksandr "Alex" Zinenko// Convolution proper. While Linalg has named operations for 2D convolutions,
866841eff1SOleksandr "Alex" Zinenko// the one in the Halide example has an uncommon order of filter dimensions
876841eff1SOleksandr "Alex" Zinenko// and is not supported. It also takes the filter as first argument. This
886841eff1SOleksandr "Alex" Zinenko// code recreates it faithfully using the generic form.
896841eff1SOleksandr "Alex" Zinenko%convolved = linalg.generic {
906841eff1SOleksandr "Alex" Zinenko  iterator_types = ["parallel", "parallel", "parallel", "parallel",
916841eff1SOleksandr "Alex" Zinenko                    "reduction", "reduction", "reduction"],
926841eff1SOleksandr "Alex" Zinenko  indexing_maps = [
936841eff1SOleksandr "Alex" Zinenko    affine_map<(n, y, x, c, rz, ry, rx) -> (rx, rz, ry, c)>,
946841eff1SOleksandr "Alex" Zinenko    affine_map<(n, y, x, c, rz, ry, rx) -> (n, y+rz, x+ry, rx)>,
956841eff1SOleksandr "Alex" Zinenko    affine_map<(n, y, x, c, rz, ry, rx) -> (n, y, x, c)>
966841eff1SOleksandr "Alex" Zinenko  ]
976841eff1SOleksandr "Alex" Zinenko} ins(%filter, %input: !tfilter, !tinput)
986841eff1SOleksandr "Alex" Zinenko  outs(%biased : !toutput) {
996841eff1SOleksandr "Alex" Zinenko^bb0(%in: f32, %f: f32, %b: f32):
1006841eff1SOleksandr "Alex" Zinenko  // Note the fastmath attributes that allow operations to be recombined into
1016841eff1SOleksandr "Alex" Zinenko  //   %0 = math.fma %in, %f, %b : f32
1026841eff1SOleksandr "Alex" Zinenko  // later on and to reorder reductions.
1036841eff1SOleksandr "Alex" Zinenko  %m1 = arith.mulf %in, %f  {fastmath = #arith.fastmath<fast>} : f32
1046841eff1SOleksandr "Alex" Zinenko  %0 = arith.addf %b, %m1  {fastmath = #arith.fastmath<fast>} : f32
1056841eff1SOleksandr "Alex" Zinenko  linalg.yield %0 : f32
1066841eff1SOleksandr "Alex" Zinenko} -> !toutput
1076841eff1SOleksandr "Alex" Zinenko
1086841eff1SOleksandr "Alex" Zinenko// ReLU is just a max(0, x).
1096841eff1SOleksandr "Alex" Zinenko%c0 = arith.constant 0.0 : f32
1106841eff1SOleksandr "Alex" Zinenko%relued = linalg.generic {
1116841eff1SOleksandr "Alex" Zinenko  iterator_types = ["parallel", "parallel", "parallel", "parallel"],
1126841eff1SOleksandr "Alex" Zinenko  indexing_maps = [
1136841eff1SOleksandr "Alex" Zinenko    affine_map<(d0, d1, d2, d3) -> ()>,
1146841eff1SOleksandr "Alex" Zinenko    affine_map<(d0, d1, d2, d3) -> (d0, d1, d2, d3)>,
1156841eff1SOleksandr "Alex" Zinenko    affine_map<(d0, d1, d2, d3) -> (d0, d1, d2, d3)>
1166841eff1SOleksandr "Alex" Zinenko  ]
1176841eff1SOleksandr "Alex" Zinenko} ins(%c0, %convolved : f32, !toutput)
1186841eff1SOleksandr "Alex" Zinenko  outs(%output : !toutput) {
1196841eff1SOleksandr "Alex" Zinenko^bb0(%cst: f32, %in: f32, %out: f32):
1206841eff1SOleksandr "Alex" Zinenko  %0 = llvm.intr.maxnum(%cst, %in) : (f32, f32) -> f32
1216841eff1SOleksandr "Alex" Zinenko  linalg.yield %0 : f32
1226841eff1SOleksandr "Alex" Zinenko} -> !toutput
1236841eff1SOleksandr "Alex" Zinenko```
1246841eff1SOleksandr "Alex" Zinenko
1256841eff1SOleksandr "Alex" ZinenkoIn Halide, a function such as `conv` may consist of two parts: a “functional”
1266841eff1SOleksandr "Alex" Zinenkoinitialization computation and an in-place update for reductions. This is
1276841eff1SOleksandr "Alex" Zinenkoexpressed as two C++ statements in the embedded DSL, but internally is
1286841eff1SOleksandr "Alex" Zinenkorepresented in a single object. Linalg doesn’t have such a capability to the
1296841eff1SOleksandr "Alex" Zinenkoinitialization and the update are represented as two distinct Linalg operations
1306841eff1SOleksandr "Alex" Zinenkothat are not connected to each other. Furthermore, the `x`, `y`, `c`, `n`
1316841eff1SOleksandr "Alex" Zinenkovariables in Halide DSL correspond to implicit loops iterating over the
1326841eff1SOleksandr "Alex" Zinenkocorresponding objects, which implies that functions sharing these variables in
1336841eff1SOleksandr "Alex" Zinenkotheir definitions also share the corresponding loops. In other words, the loop
1346841eff1SOleksandr "Alex" Zinenkoequivalent of the Halide definition starts in a fully-fused form. The Linalg
1356841eff1SOleksandr "Alex" Zinenkomodel is the opposite with each structured operation corresponding to its own
1366841eff1SOleksandr "Alex" Zinenkoloop nest, resulting in a fully-distributed form. This will affect how the
1376841eff1SOleksandr "Alex" Zinenkoschedule is constructed later on.
1386841eff1SOleksandr "Alex" Zinenko
1396841eff1SOleksandr "Alex" ZinenkoThe loop structure for Halide computation resembles the following (adapted from
1406841eff1SOleksandr "Alex" Zinenkodebug dump with `HL_DEBUG_CODEGEN=1`)
1416841eff1SOleksandr "Alex" Zinenko
1426841eff1SOleksandr "Alex" Zinenko```python
1436841eff1SOleksandr "Alex" Zinenkofor n
1446841eff1SOleksandr "Alex" Zinenko  for y
1456841eff1SOleksandr "Alex" Zinenko    for x
1466841eff1SOleksandr "Alex" Zinenko      for c
1476841eff1SOleksandr "Alex" Zinenko        conv[n, y, x, c] = bias[c]
1486841eff1SOleksandr "Alex" Zinenko        for rz
1496841eff1SOleksandr "Alex" Zinenko          for ry
1506841eff1SOleksandr "Alex" Zinenko            for rx
1516841eff1SOleksandr "Alex" Zinenko              conv[n, y, x, c] += filter[rx, rz, ry, c] * input[n, y+rz, x+ry, rx]
1526841eff1SOleksandr "Alex" Zinenko        relu[n, y, x, c] = max(0, conv[n, y, x, c])
1536841eff1SOleksandr "Alex" Zinenko```
1546841eff1SOleksandr "Alex" Zinenko
1556841eff1SOleksandr "Alex" ZinenkoThe loop structure for the Linalg computation is as follows (obtained by
1566841eff1SOleksandr "Alex" Zinenko`mlir-opt --linalg-generalize-named-ops --empty-tensor-to-alloc-tensor
1576841eff1SOleksandr "Alex" Zinenko--one-shot-bufferize --convert-linalg-to-loops`)
1586841eff1SOleksandr "Alex" Zinenko
1596841eff1SOleksandr "Alex" Zinenko```python
1606841eff1SOleksandr "Alex" Zinenkofor n
1616841eff1SOleksandr "Alex" Zinenko  for y
1626841eff1SOleksandr "Alex" Zinenko    for x
1636841eff1SOleksandr "Alex" Zinenko      for c
1646841eff1SOleksandr "Alex" Zinenko        init[n, y, x, c] = bias[c]
1656841eff1SOleksandr "Alex" Zinenkofor n
1666841eff1SOleksandr "Alex" Zinenko  for y
1676841eff1SOleksandr "Alex" Zinenko    for x
1686841eff1SOleksandr "Alex" Zinenko      for c
1696841eff1SOleksandr "Alex" Zinenko        for rz
1706841eff1SOleksandr "Alex" Zinenko          for ry
1716841eff1SOleksandr "Alex" Zinenko            for rx
1726841eff1SOleksandr "Alex" Zinenko              conv[n, y, x, c] += filter[rx, rz, ry, c] * input[n, y+rz, x+ry, rx]
1736841eff1SOleksandr "Alex" Zinenkofor n
1746841eff1SOleksandr "Alex" Zinenko  for y
1756841eff1SOleksandr "Alex" Zinenko    for x
1766841eff1SOleksandr "Alex" Zinenko      for c
1776841eff1SOleksandr "Alex" Zinenko        relu[n, y, x, c] = max(0, conv[n, y, x, c])
1786841eff1SOleksandr "Alex" Zinenko
1796841eff1SOleksandr "Alex" Zinenko```
1806841eff1SOleksandr "Alex" Zinenko
1816841eff1SOleksandr "Alex" Zinenko## Mapping Halide Scheduling Primitives to Linalg Structured Transforms
1826841eff1SOleksandr "Alex" Zinenko
1836841eff1SOleksandr "Alex" ZinenkoThe complete Halide schedule listed in the example is as follows
1846841eff1SOleksandr "Alex" Zinenko
1856841eff1SOleksandr "Alex" Zinenko```cpp
1866841eff1SOleksandr "Alex" ZinenkoVar co, ci, xo, xi;
1876841eff1SOleksandr "Alex" Zinenkorelu.split(c, co, ci, vec * tile_w)
1886841eff1SOleksandr "Alex" Zinenko  .split(x, xo, xi, tile_h)
1896841eff1SOleksandr "Alex" Zinenko  .reorder(ci, xi, xo, y, n, co)
1906841eff1SOleksandr "Alex" Zinenko  .vectorize(ci, vec)
1916841eff1SOleksandr "Alex" Zinenko  .unroll(ci)
1926841eff1SOleksandr "Alex" Zinenko  .unroll(xi)
1936841eff1SOleksandr "Alex" Zinenko  .parallel(y)
1946841eff1SOleksandr "Alex" Zinenko  .parallel(n)
1956841eff1SOleksandr "Alex" Zinenko  .parallel(co);
1966841eff1SOleksandr "Alex" Zinenko
1976841eff1SOleksandr "Alex" Zinenkoconv.compute_at(relu, xo)
1986841eff1SOleksandr "Alex" Zinenko  .vectorize(c, vec)
1996841eff1SOleksandr "Alex" Zinenko  .unroll(c)
2006841eff1SOleksandr "Alex" Zinenko  .unroll(x)
2016841eff1SOleksandr "Alex" Zinenko  .unroll(y)
2026841eff1SOleksandr "Alex" Zinenko  .update()
2036841eff1SOleksandr "Alex" Zinenko  .reorder(c, x, y, r.x, r.y, r.z, n)
2046841eff1SOleksandr "Alex" Zinenko  .vectorize(c, vec)
2056841eff1SOleksandr "Alex" Zinenko  .unroll(c)
2066841eff1SOleksandr "Alex" Zinenko  .unroll(x)
2076841eff1SOleksandr "Alex" Zinenko  .unroll(y)
2086841eff1SOleksandr "Alex" Zinenko  .unroll(r.x, 2);
2096841eff1SOleksandr "Alex" Zinenko```
2106841eff1SOleksandr "Alex" Zinenko
2116841eff1SOleksandr "Alex" ZinenkoWe will consider only the case without parallelization to avoid the difference
2126841eff1SOleksandr "Alex" Zinenkoin parallel runtimes generated by Halide and used by MLIR. This schedule
2136841eff1SOleksandr "Alex" Zinenkocorresponds to a sequence of loop manipulations, unrolling and vectorization.
2146841eff1SOleksandr "Alex" ZinenkoThe following directives are present and can be mapped to transformations on
2156841eff1SOleksandr "Alex" ZinenkoLinalg as described below.
2166841eff1SOleksandr "Alex" Zinenko
2176841eff1SOleksandr "Alex" Zinenko*   `split` decomposes a loop dimension into two immediately nested loops with
2186841eff1SOleksandr "Alex" Zinenko    the inner loop having at most the given number of iterations. This can be
2196841eff1SOleksandr "Alex" Zinenko    understood as loop _strip-mining_ or a degenerate case of tiling a single
2206841eff1SOleksandr "Alex" Zinenko    dimension using any of `linalg.tile_` transform ops. We will be using
22196ff0255SOleksandr "Alex" Zinenko    `transform.structured.tile_using_forall` as this kind of loop is best
2226841eff1SOleksandr "Alex" Zinenko    supported by bufferization and can also be turned into a parallel loop later
2236841eff1SOleksandr "Alex" Zinenko    on. Unlike Halide, this doesn’t add new dimensions to the original
2246841eff1SOleksandr "Alex" Zinenko    operation, but rather creates a loop around it and rewrites the operation
2256841eff1SOleksandr "Alex" Zinenko    itself to operate on a subset of the original data.
2266841eff1SOleksandr "Alex" Zinenko*   `reorder` rearranges the loops arbitrarily. In Linalg representation, loops
2276841eff1SOleksandr "Alex" Zinenko    are implicit and are intended to remain so as long as possible to target
2286841eff1SOleksandr "Alex" Zinenko    microkernels. The order of implicit loops in a `linalg.generic` operation
2296841eff1SOleksandr "Alex" Zinenko    can be changed by using `transform.structured.interchange`, but this does
2306841eff1SOleksandr "Alex" Zinenko    not apply to named operations that need to be “generalized” first by calling
2316841eff1SOleksandr "Alex" Zinenko    `transform.structured.generalize`. However, this can only reorder implicit
2326841eff1SOleksandr "Alex" Zinenko    dimensions and not the explicit loops materialized by tiling operations that
2336841eff1SOleksandr "Alex" Zinenko    can no longer be “folded” into the original operation. Instead, we can
2346841eff1SOleksandr "Alex" Zinenko    leverage this behavior by materializing loops directly in the desired order
2356841eff1SOleksandr "Alex" Zinenko    by “tiling” to size 1.
2366841eff1SOleksandr "Alex" Zinenko*   `vectorize` indicates that the given dimension should be vectorized with the
2376841eff1SOleksandr "Alex" Zinenko    given factor; if the loop extent is larger than the factor, the loop is
2386841eff1SOleksandr "Alex" Zinenko    effectively split into two parts and the inner one is vectorized. On the
2396841eff1SOleksandr "Alex" Zinenko    contrary, structured Linalg op vectorization applies as a global
2406841eff1SOleksandr "Alex" Zinenko    transformation to all suitable operations at, e.g., a function scope via
2416841eff1SOleksandr "Alex" Zinenko    `transform.structured.vectorize_children_and_apply_patterns`. It relies on
2426841eff1SOleksandr "Alex" Zinenko    MLIR’s support for multidimensional vectors to directly map multidimensional
2436841eff1SOleksandr "Alex" Zinenko    tensors, which are later decomposed into operations on smaller
2446841eff1SOleksandr "Alex" Zinenko    hardware-compatible vectors during lowering.
2456841eff1SOleksandr "Alex" Zinenko*   `unroll` performs loop unrolling, fully or up to the given factor. It is
2466841eff1SOleksandr "Alex" Zinenko    equivalent to `transform.loop.unroll`.
2476841eff1SOleksandr "Alex" Zinenko*   `compute_at` indicates that the value of the function must be computed
2486841eff1SOleksandr "Alex" Zinenko    within the given loop that will be produced for another function; depending
2496841eff1SOleksandr "Alex" Zinenko    on the relation between loops surrounding functions, this corresponds to
2506841eff1SOleksandr "Alex" Zinenko    either a loop distribution or a producer/consumer fusion. Given that the
2516841eff1SOleksandr "Alex" Zinenko    Linalg representation starts in the fully distributed form, it can be
2526841eff1SOleksandr "Alex" Zinenko    represented as a sequence of `transform.structured.fuse_into_containing_op`
2536841eff1SOleksandr "Alex" Zinenko    that operates on `forall` loops materialized by tiling beforehand.
2546841eff1SOleksandr "Alex" Zinenko
2556841eff1SOleksandr "Alex" Zinenko
2566841eff1SOleksandr "Alex" Zinenko## Recreating the Loop Structure
2576841eff1SOleksandr "Alex" Zinenko
2586841eff1SOleksandr "Alex" ZinenkoThe three first transformation directives for `relu` in the Halide schedule aim
2596841eff1SOleksandr "Alex" Zinenkoat producing the following loop structure.
2606841eff1SOleksandr "Alex" Zinenko
2616841eff1SOleksandr "Alex" Zinenko```python
2626841eff1SOleksandr "Alex" Zinenkofor co
2636841eff1SOleksandr "Alex" Zinenko  for n
2646841eff1SOleksandr "Alex" Zinenko    for y
2656841eff1SOleksandr "Alex" Zinenko      for xo
2666841eff1SOleksandr "Alex" Zinenko        for xi
2676841eff1SOleksandr "Alex" Zinenko          for ci
2686841eff1SOleksandr "Alex" Zinenko            relu[n, y, xo*tile_h + xi, co*tile_w*vec + ci] = ...
2696841eff1SOleksandr "Alex" Zinenko```
2706841eff1SOleksandr "Alex" Zinenko
2716841eff1SOleksandr "Alex" ZinenkoNote that the outer part of the `c` gets hoisted from all of the surrounding
2726841eff1SOleksandr "Alex" Zinenkoloops. The implicit loop order for the operation is `n, y, x, c`, so the `co`
2736841eff1SOleksandr "Alex" Zinenkoloop needs to be materialized first in order to achieve the desired reordering.
2746841eff1SOleksandr "Alex" ZinenkoThe remaining dimensions can be materialized as loops in one transformation.
2756841eff1SOleksandr "Alex" Zinenko
2766841eff1SOleksandr "Alex" Zinenko```mlir
2776841eff1SOleksandr "Alex" Zinenko    //                                                             [n  y  x  c]
27896ff0255SOleksandr "Alex" Zinenko    %co, %relu2 = transform.structured.tile_using_forall %relu
2796841eff1SOleksandr "Alex" Zinenko                                                        tile_sizes [0, 0, 0, 64]
28096ff0255SOleksandr "Alex" Zinenko    %n_y_xo, %relu3 = transform.structured.tile_using_forall %relu2
2816841eff1SOleksandr "Alex" Zinenko                                                        tile_sizes [1, 1, 5, 0]
2826841eff1SOleksandr "Alex" Zinenko```
2836841eff1SOleksandr "Alex" Zinenko
2846841eff1SOleksandr "Alex" ZinenkoThis will result in the following loops being created in the IR with the nested
2856841eff1SOleksandr "Alex" Zinenkoelementwise operation operating on a smaller subset of original data via
2866841eff1SOleksandr "Alex" Zinenkoimplicit loops.
2876841eff1SOleksandr "Alex" Zinenko
2886841eff1SOleksandr "Alex" Zinenko```mlir
2896841eff1SOleksandr "Alex" Zinenkoscf.forall (%co) in (2) {
2906841eff1SOleksandr "Alex" Zinenko  scf.forall (%n, %y, %xo) in (5, 80, 20) {
2916841eff1SOleksandr "Alex" Zinenko    tensor.extract_slice
2926841eff1SOleksandr "Alex" Zinenko    // Implicit dimensions [ni=0:1, y=0:1, xi=0:5, ci=0:64]
2936841eff1SOleksandr "Alex" Zinenko    %relued = linalg.elemwise_binary { fun = #linalg.binary_fn<max_signed> } // ...
2946841eff1SOleksandr "Alex" Zinenko    scf.forall.in_parallel {
2956841eff1SOleksandr "Alex" Zinenko      tensor.parallel_insert_slice // ...
2966841eff1SOleksandr "Alex" Zinenko    }
2976841eff1SOleksandr "Alex" Zinenko  }
2986841eff1SOleksandr "Alex" Zinenko}
2996841eff1SOleksandr "Alex" Zinenko```
3006841eff1SOleksandr "Alex" Zinenko
3016841eff1SOleksandr "Alex" ZinenkoThe following loop restructuring transformations are `compute_at` and `reorder`
3026841eff1SOleksandr "Alex" Zinenkoon the `conv` function that need to happen before loops are destroyed by
3036841eff1SOleksandr "Alex" Zinenkounrolling and vectorization. They intend to produce the final desired loop
3046841eff1SOleksandr "Alex" Zinenkostructure.
3056841eff1SOleksandr "Alex" Zinenko
3066841eff1SOleksandr "Alex" Zinenko```python
3076841eff1SOleksandr "Alex" Zinenkofor co
3086841eff1SOleksandr "Alex" Zinenko  for n
3096841eff1SOleksandr "Alex" Zinenko    for y
3106841eff1SOleksandr "Alex" Zinenko      for xo
3116841eff1SOleksandr "Alex" Zinenko        for xi
3126841eff1SOleksandr "Alex" Zinenko          for ci
3136841eff1SOleksandr "Alex" Zinenko            conv[n, y, x*tile_h + xi, co*tile_w*vec + ci] = ...
3146841eff1SOleksandr "Alex" Zinenko        for rz
3156841eff1SOleksandr "Alex" Zinenko          for ry
3166841eff1SOleksandr "Alex" Zinenko            for rx
3176841eff1SOleksandr "Alex" Zinenko              for xi
3186841eff1SOleksandr "Alex" Zinenko                for ci
3196841eff1SOleksandr "Alex" Zinenko                  conv[n, y, x*tile_h + xi, co*tile_w*vec + ci] += ...
3206841eff1SOleksandr "Alex" Zinenko        for xi
3216841eff1SOleksandr "Alex" Zinenko          for ci
3226841eff1SOleksandr "Alex" Zinenko            relu[n, y, xo*tile_h + xi, co*tile_w*vec + ci] = ...
3236841eff1SOleksandr "Alex" Zinenko```
3246841eff1SOleksandr "Alex" Zinenko
3256841eff1SOleksandr "Alex" ZinenkoPractically, this corresponds to fusing the convolution initialization and
3266841eff1SOleksandr "Alex" Zinenkoupdate into the `co, n, y, xo` loops materialized by tiling earlier. Structured
3276841eff1SOleksandr "Alex" Zinenkoop transformation set supports fusing the producer of a value into its consumer,
3286841eff1SOleksandr "Alex" Zinenkoso fusion happens in two stages:
3296841eff1SOleksandr "Alex" Zinenko
3306841eff1SOleksandr "Alex" Zinenko*   first the main convolution update is fused into ReLU that uses it and has
3316841eff1SOleksandr "Alex" Zinenko    loops materialized;
3326841eff1SOleksandr "Alex" Zinenko*   then the bias initialization is fused into the convolution+relu loop nest.
3336841eff1SOleksandr "Alex" Zinenko
3346841eff1SOleksandr "Alex" ZinenkoEach stage consists of two transformations fusing the computational operation
3356841eff1SOleksandr "Alex" Zinenkointo the outer loop, then the inner loop.
3366841eff1SOleksandr "Alex" Zinenko
3376841eff1SOleksandr "Alex" Zinenko```mlir
3386841eff1SOleksandr "Alex" Zinenko%conv2, %co2 = transform.structured.fuse_into_containing_op %conv into %co
3396841eff1SOleksandr "Alex" Zinenko%conv3, %n_y_xo2 = transform.structured.fuse_into_containing_op %conv2
3406841eff1SOleksandr "Alex" Zinenko  into %n_y_xo
3416841eff1SOleksandr "Alex" Zinenko
3426841eff1SOleksandr "Alex" Zinenko%bias2, %co3 = transform.structured.fuse_into_containing_op %bias into %co2
3436841eff1SOleksandr "Alex" Zinenko%bias3, %n_y_xo3 = transform.structured.fuse_into_containing_op %bias2
3446841eff1SOleksandr "Alex" Zinenko  into %n_y_xo2
3456841eff1SOleksandr "Alex" Zinenko```
3466841eff1SOleksandr "Alex" Zinenko
3476841eff1SOleksandr "Alex" ZinenkoTo complete the structure, we need to put the `rz, ry, rx` loops outside the
3486841eff1SOleksandr "Alex" Zinenko“tile” loops `xi, ci`. This can be achieved materializing the corresponding
3496841eff1SOleksandr "Alex" Zinenkoloops from the convolution operation. However, these are reduction loops and it
3506841eff1SOleksandr "Alex" Zinenkowouldn’t be valid to materialize them as intrinsically parallel “forall” loops.
3516841eff1SOleksandr "Alex" ZinenkoInstead, we use the dedicated “reduction tiling” transformation and produce
3526841eff1SOleksandr "Alex" Zinenkosequential `scf.for` loops. (`scf.forall` loops can also express parallel
3536841eff1SOleksandr "Alex" Zinenkoreductions, but the corresponding transformation doesn’t handle reductions along
3546841eff1SOleksandr "Alex" Zinenkomore than one dimension at the moment of writing.)
3556841eff1SOleksandr "Alex" Zinenko
3566841eff1SOleksandr "Alex" Zinenko```mlir
3576841eff1SOleksandr "Alex" Zinenko%rz_ry_rx, %red_fill, %conv4, %comb
35896ff0255SOleksandr "Alex" Zinenko  = transform.structured.tile_reduction_using_for %conv3
3596841eff1SOleksandr "Alex" Zinenko//               n  y  x  c  rz ry rx
3606841eff1SOleksandr "Alex" Zinenko  by tile_sizes=[0, 0, 0, 0, 1, 1, 1]
3616841eff1SOleksandr "Alex" Zinenko```
3626841eff1SOleksandr "Alex" Zinenko
3636841eff1SOleksandr "Alex" ZinenkoThis transformation materializes the desired loops around the convolution
3646841eff1SOleksandr "Alex" Zinenkooperation. It is also more capable than merely producing (reduction) loops: the
3656841eff1SOleksandr "Alex" Zinenkotransformed code performs `tile_size` partial reductions of `N / tile_size`
3666841eff1SOleksandr "Alex" Zinenkoelements, potentially in parallel by changing the dimension kind of the
3676841eff1SOleksandr "Alex" Zinenkostructured operation inside the loop, and then performs a final reduction of
3686841eff1SOleksandr "Alex" Zinenkothese partial results by producing a new “combiner” structured operation after
3696841eff1SOleksandr "Alex" Zinenkothe loops. In our case, `tile_size = 1` along all dimensions, so the reduction
3706841eff1SOleksandr "Alex" Zinenkois entirely performed by the generated loops. The combiner structured operation
3716841eff1SOleksandr "Alex" Zinenkois still produced and adds up the reduction result with the initial value. This
3726841eff1SOleksandr "Alex" Zinenkochanges the order of floating point operations (so would reduction tiling with
3736841eff1SOleksandr "Alex" Zinenkonon-unit size) and may affect the final result due to non-commutativity of these
3746841eff1SOleksandr "Alex" Zinenkooperations, but is explicitly allowed by `fastmath` flags. Halide also emits
3756841eff1SOleksandr "Alex" ZinenkoLLVM IR with full `fastmath` flags.
3766841eff1SOleksandr "Alex" Zinenko
3776841eff1SOleksandr "Alex" ZinenkoFinally, we need to produce innermost loops `xi` and `ci` that are still not
3786841eff1SOleksandr "Alex" Zinenkoexplicit. As our next step is going to be vectorization along `ci`, we need to
3796841eff1SOleksandr "Alex" Zinenkotake into account the way it operates on MLIR structured operations: rather than
3806841eff1SOleksandr "Alex" Zinenkoselecting a specific vector size and loop/dimension to vectorize, it directly
3816841eff1SOleksandr "Alex" Zinenkosubstitutes multidimensional vector types for tensor types and updates the
3826841eff1SOleksandr "Alex" Zinenkooperations accordingly. Therefore, our tensor type should not become trivial,
3836841eff1SOleksandr "Alex" Zinenkoi.e. size-1, and retain a `vector_size` sized dimension along the desired axis,
3846841eff1SOleksandr "Alex" Zinenko`ci`. This can be achieved by tiling with `vector_size` as tile size in that
3856841eff1SOleksandr "Alex" Zinenkodimension:
3866841eff1SOleksandr "Alex" Zinenko
3876841eff1SOleksandr "Alex" Zinenko```mlir
3886841eff1SOleksandr "Alex" Zinenko//                                                                  n  y  xi ci
38996ff0255SOleksandr "Alex" Zinenko%1, %c5 = transform.structured.tile_using_forall %conv4 tile_sizes [0, 0, 1, 16]
39096ff0255SOleksandr "Alex" Zinenko%2, %b4 = transform.structured.tile_using_forall %bias3 tile_sizes [0, 0, 1, 16]
39196ff0255SOleksandr "Alex" Zinenko%3, %r4 = transform.structured.tile_using_forall %relu3 tile_sizes [0, 0, 1, 16]
39296ff0255SOleksandr "Alex" Zinenko%4, %c2 = transform.structured.tile_using_forall %comb  tile_sizes [0, 0, 1, 16]
3936841eff1SOleksandr "Alex" Zinenko```
3946841eff1SOleksandr "Alex" Zinenko
3956841eff1SOleksandr "Alex" ZinenkoNote that the combiner operation produced by reduction tiling is also tiled here.
3966841eff1SOleksandr "Alex" Zinenko
3976841eff1SOleksandr "Alex" Zinenko
3986841eff1SOleksandr "Alex" Zinenko## Explicit Loop Unrolling
3996841eff1SOleksandr "Alex" Zinenko
4006841eff1SOleksandr "Alex" ZinenkoThe remaining unhandled loop transformation is unrolling. Specifically,
4016841eff1SOleksandr "Alex" Zinenkounrolling is requested for the innermost loops that form the 4x5 tile of
4026841eff1SOleksandr "Alex" Zinenko16-element vector operations to ensure a contiguous sequence of `vfma`
4036841eff1SOleksandr "Alex" Zinenkoinstructions using 20 512-bit vector registers as accumulators. Unrolling
4046841eff1SOleksandr "Alex" Zinenkoadditional loops,, `unroll(y)` and `unroll(r.x, 2)`, is requested in the
4056841eff1SOleksandr "Alex" Zinenkoschedule but _has no practical effect_. That is, the code, and all intermediate
4066841eff1SOleksandr "Alex" Zinenkorepresentations, produced by Halide with these directives removed is _strictly
4076841eff1SOleksandr "Alex" Zinenkoidentical_ to the code with the full schedule. Therefore, we will only unroll
4086841eff1SOleksandr "Alex" Zinenkothe corresponding loops corresponding to `xi` and `ci` dimensions that actually
4096841eff1SOleksandr "Alex" Zinenkoget unrolled by Halide.
4106841eff1SOleksandr "Alex" Zinenko
41139298b09SAndrzej WarzyńskiAs tiling in the Transform dialect produces handles to the loops materialized by
4126841eff1SOleksandr "Alex" Zinenkotiling, unrolling those loops is just a matter of chaining the corresponding
4136841eff1SOleksandr "Alex" Zinenkotransformation. Note that the inner loop must be unrolled first as unrolling the
4146841eff1SOleksandr "Alex" Zinenkoouter loop will invalidate the handles to the inner loop.
4156841eff1SOleksandr "Alex" Zinenko
4166841eff1SOleksandr "Alex" Zinenko```mlir
4176841eff1SOleksandr "Alex" Zinenkotransform.loop.unroll %bias_ci {factor = 4}
4186841eff1SOleksandr "Alex" Zinenkotransform.loop.unroll %bias_xi {factor = 5}
4196841eff1SOleksandr "Alex" Zinenkotransform.loop.unroll %conv_ci {factor = 4}
4206841eff1SOleksandr "Alex" Zinenkotransform.loop.unroll %conv_xi {factor = 5}
4216841eff1SOleksandr "Alex" Zinenkotransform.loop.unroll %relu_ci {factor = 4}
4226841eff1SOleksandr "Alex" Zinenkotransform.loop.unroll %relu_xi {factor = 5}
4236841eff1SOleksandr "Alex" Zinenkotransform.loop.unroll %comb_ci {factor = 4}
4246841eff1SOleksandr "Alex" Zinenkotransform.loop.unroll %comb_xi {factor = 5}
4256841eff1SOleksandr "Alex" Zinenko```
4266841eff1SOleksandr "Alex" Zinenko
4276841eff1SOleksandr "Alex" Zinenko## Vectorization
4286841eff1SOleksandr "Alex" Zinenko
4296841eff1SOleksandr "Alex" ZinenkoThese transformations produced the desired loop structure and we are now ready
4306841eff1SOleksandr "Alex" Zinenkoto vectorize. Before proceeding it is desirable to simplify the code as tiling
4316841eff1SOleksandr "Alex" Zinenkoand fusion may have produced a lot of operations computing tensor subsets and
4326841eff1SOleksandr "Alex" Zinenkoloop ranges, some of which may be duplicated or excessively complex.
4336841eff1SOleksandr "Alex" ZinenkoSimplification involving canonicalization, common subexpression elimination,
4346841eff1SOleksandr "Alex" Zinenkoloop invariant code motion and various rewrite patterns can be applied directly
4356841eff1SOleksandr "Alex" Zinenkofrom the transform dialect. Furthermore, an arbitrary combination of rewrite
4366841eff1SOleksandr "Alex" Zinenkopatterns can be applied _in one sweep_ to a given scope, a functionality that
4376841eff1SOleksandr "Alex" Zinenko_cannot be achieved with conventional compiler passes_ that apply each group of
4386841eff1SOleksandr "Alex" Zinenkopatterns separately (at least without creating a new pass for each combination
4396841eff1SOleksandr "Alex" Zinenkoof pattern groups).
4406841eff1SOleksandr "Alex" Zinenko
4416841eff1SOleksandr "Alex" Zinenko```mlir
4426841eff1SOleksandr "Alex" Zinenko%f00 = transform.structured.match ops{["func.func"]} in %arg0
4436841eff1SOleksandr "Alex" Zinenkotransform.apply_patterns to %f00 {
4446841eff1SOleksandr "Alex" Zinenko  transform.apply_patterns.canonicalization
4456841eff1SOleksandr "Alex" Zinenko  transform.apply_patterns.linalg.tiling_canonicalization
4466841eff1SOleksandr "Alex" Zinenko}
4476841eff1SOleksandr "Alex" Zinenkotransform.apply_cse to %f00
4486841eff1SOleksandr "Alex" Zinenko
4496841eff1SOleksandr "Alex" Zinenko%all_loops = transform.structured.match interface{LoopLikeInterface} in %arg0
4506841eff1SOleksandr "Alex" Zinenkotransform.apply_licm to %all_loops
4516841eff1SOleksandr "Alex" Zinenko```
4526841eff1SOleksandr "Alex" Zinenko
4536841eff1SOleksandr "Alex" ZinenkoOne final simplification is necessary to produce good vectorized code.
4546841eff1SOleksandr "Alex" ZinenkoTiling-by-one as a way of materializing loops produced structured (`linalg`)
4556841eff1SOleksandr "Alex" Zinenkooperations processing 4D types where only one dimension isn’t unit-sized, e.g.,
4566841eff1SOleksandr "Alex" Zinenko`tensor<1x1x1x16xf32>` where 16 is the vector size corresponding to AVX512,
4576841eff1SOleksandr "Alex" Zinenkoas structured tiling doesn’t modify the rank of the operation in order to
4586841eff1SOleksandr "Alex" Zinenkopreserve the original structure. Even though the core computation is the same,
4596841eff1SOleksandr "Alex" Zinenkothe produced code may end up more complicated than necessary, in particular when
4606841eff1SOleksandr "Alex" Zinenkodecomposing multidimensional vectors into single-dimensional vectors supported
4616841eff1SOleksandr "Alex" Zinenkoby hardware. Such unit dimensions can be explicitly folded away using the
4626841eff1SOleksandr "Alex" Zinenkocorresponding pattern set before vectorization.
4636841eff1SOleksandr "Alex" Zinenko
4646841eff1SOleksandr "Alex" Zinenko```mlir
4656841eff1SOleksandr "Alex" Zinenkotransform.apply_patterns to %f00 {
4666841eff1SOleksandr "Alex" Zinenko  transform.apply_patterns.linalg.fold_unit_extent_dims_via_reshapes
4676841eff1SOleksandr "Alex" Zinenko}
4686841eff1SOleksandr "Alex" Zinenko
4696841eff1SOleksandr "Alex" Zinenko%fv = transform.structured.vectorize_children_and_apply_patterns %f00
4706841eff1SOleksandr "Alex" Zinenko```
4716841eff1SOleksandr "Alex" Zinenko
4726841eff1SOleksandr "Alex" ZinenkoThis produces the desired code performing arithmetic operations on
4736841eff1SOleksandr "Alex" Zinenko`vector<16xf32>` types that can be easily lowered to AVX512 instructions by the
4746841eff1SOleksandr "Alex" Zinenkodownstream compiler. Vectorization may have created new opportunities for code
4756841eff1SOleksandr "Alex" Zinenkosimplification, in particular combining tensor subsetting and vector slicing
4766841eff1SOleksandr "Alex" Zinenkooperations. Another round of simplification can be applied post vectorization.
4776841eff1SOleksandr "Alex" Zinenko
4786841eff1SOleksandr "Alex" Zinenko```mlir
4796841eff1SOleksandr "Alex" Zinenkotransform.apply_patterns to %fv {
4806841eff1SOleksandr "Alex" Zinenko  transform.apply_patterns.canonicalization
4816841eff1SOleksandr "Alex" Zinenko  transform.apply_patterns.tensor.fold_tensor_subset_ops_into_vector_transfers
4826841eff1SOleksandr "Alex" Zinenko}
4836841eff1SOleksandr "Alex" Zinenkotransform.apply_cse to %fv
4846841eff1SOleksandr "Alex" Zinenkotransform.structured.hoist_redundant_vector_transfers %fv
4856841eff1SOleksandr "Alex" Zinenko```
4866841eff1SOleksandr "Alex" Zinenko
4876841eff1SOleksandr "Alex" Zinenko## Lowering to LLVM and The Bufferization Hurdle
4886841eff1SOleksandr "Alex" Zinenko
4896841eff1SOleksandr "Alex" ZinenkoWith the loop restructuring done, the program now needs to be converted to the
4906841eff1SOleksandr "Alex" Zinenkoexecutable form. The first step in doing so is _bufferization_, the process that
4916841eff1SOleksandr "Alex" Zinenkoassociates a memory buffer with every tensor in the payload IR. MLIR’s one-shot
4926841eff1SOleksandr "Alex" Zinenkobufferization is directly available as a transform operation.
4936841eff1SOleksandr "Alex" Zinenko
4946841eff1SOleksandr "Alex" Zinenko```mlir
4956841eff1SOleksandr "Alex" Zinenko%arg1 = transform.bufferization.one_shot_bufferize %arg0 {
4966841eff1SOleksandr "Alex" Zinenko  bufferize_function_boundaries = true,
4976841eff1SOleksandr "Alex" Zinenko  function_boundary_type_conversion = 1 : i32 }
4986841eff1SOleksandr "Alex" Zinenko```
4996841eff1SOleksandr "Alex" Zinenko
500aab795a8SOleksandr "Alex" ZinenkoOne-shot bufferization itself does not produce buffer deallocations, which may
501aab795a8SOleksandr "Alex" Zinenkolead to leaks. So we have to run the buffer deallocation pass pipeline to avoid
50239298b09SAndrzej Warzyńskithem. Note that the Transform dialect seamlessly runs named passes and pass
503aab795a8SOleksandr "Alex" Zinenkopipelines: if desired, one could replace complex `--pass-pipeline expressions`
504aab795a8SOleksandr "Alex" Zinenkowith operations. Note that we apply the pipeline to functions rather than entire
505aab795a8SOleksandr "Alex" Zinenkomodule to avoid running it on the transform IR that is contained in the module.
506aab795a8SOleksandr "Alex" Zinenko
507aab795a8SOleksandr "Alex" Zinenko```mlir
508aab795a8SOleksandr "Alex" Zinenko%f = transform.structured.match ops{["func.func"]} in %arg1
509aab795a8SOleksandr "Alex" Zinenko  : (!transform.any_op) -> !transform.any_op
510aab795a8SOleksandr "Alex" Zinenkotransform.apply_registered_pass "buffer-deallocation-pipeline" to %f
511aab795a8SOleksandr "Alex" Zinenko  : (!transform.any_op) -> !transform.any_op
512aab795a8SOleksandr "Alex" Zinenko```
513aab795a8SOleksandr "Alex" Zinenko
5146841eff1SOleksandr "Alex" ZinenkoIn this particular case, the transformed IR could be directly bufferized. This
5156841eff1SOleksandr "Alex" Zinenkois not always the case in general as some operations, in particular
5166841eff1SOleksandr "Alex" Zinenko`tensor.empty` may not be bufferizable. Such operations need to be removed
5176841eff1SOleksandr "Alex" Zinenkobefore running the bufferization, which can often be achieved by sufficient
5186841eff1SOleksandr "Alex" Zinenkofusion (as in our case), or by running dedicated transformations
5196841eff1SOleksandr "Alex" Zinenko`transform.bufferization.eliminate_empty_tensors` that removes the
5206841eff1SOleksandr "Alex" Zinenko`tensor.empty` operations only serving for defining the size of a computation or
5216841eff1SOleksandr "Alex" Zinenko`transform.bufferization.empty_tensor_to_alloc_tensor` that materializes a new
5226841eff1SOleksandr "Alex" Zinenkotemporary buffer for empty tensors to be used as local caches.
5236841eff1SOleksandr "Alex" Zinenko
5246841eff1SOleksandr "Alex" Zinenko```mlir
5256841eff1SOleksandr "Alex" Zinenko// Apply general canonicalization and CSE to each function after
5266841eff1SOleksandr "Alex" Zinenko// bufferization as new simplification opportunities may have appeared.
5276841eff1SOleksandr "Alex" Zinenko%fb = transform.structured.match ops{["func.func"]} in %arg1
5286841eff1SOleksandr "Alex" Zinenkotransform.apply_patterns to %fb {
5296841eff1SOleksandr "Alex" Zinenko  transform.apply_patterns.canonicalization
5306841eff1SOleksandr "Alex" Zinenko}
5316841eff1SOleksandr "Alex" Zinenkotransform.apply_cse to %fb
5326841eff1SOleksandr "Alex" Zinenko
5336841eff1SOleksandr "Alex" Zinenko// Lower complex, multidimensional vector operations into simpler
5346841eff1SOleksandr "Alex" Zinenko// primitives. This particular selection of the pattern groups corresponds
5356841eff1SOleksandr "Alex" Zinenko// to vector dialect operations present in the payload IR at this stage.
5366841eff1SOleksandr "Alex" Zinenko// Many of these groups can be parameterized to use different strategies or
5376841eff1SOleksandr "Alex" Zinenko// lower-level primitives offering performance trade-offs. In this case, we
5386841eff1SOleksandr "Alex" Zinenko// are selecting the simplest strategies.
5396841eff1SOleksandr "Alex" Zinenkotransform.apply_patterns to %fb {
5406841eff1SOleksandr "Alex" Zinenko  transform.apply_patterns.vector.lower_contraction
5416841eff1SOleksandr "Alex" Zinenko    lowering_strategy = parallelarith
5426841eff1SOleksandr "Alex" Zinenko  transform.apply_patterns.vector.lower_transfer
5436841eff1SOleksandr "Alex" Zinenko    max_transfer_rank = 1
5446841eff1SOleksandr "Alex" Zinenko  transform.apply_patterns.vector.lower_transpose
5456841eff1SOleksandr "Alex" Zinenko    lowering_strategy = eltwise
5466841eff1SOleksandr "Alex" Zinenko  transform.apply_patterns.vector.lower_shape_cast
5476841eff1SOleksandr "Alex" Zinenko}
5486841eff1SOleksandr "Alex" Zinenko
5496841eff1SOleksandr "Alex" Zinenko// These patterns apply in a separate sweep to avoid transfer-to-scf
5506841eff1SOleksandr "Alex" Zinenko// patterns overlap with lower-transfer patterns as they apply to the same
5516841eff1SOleksandr "Alex" Zinenko// kind of operations. These patterns may produce local allocations to act
5526841eff1SOleksandr "Alex" Zinenko// as temporary caches deep inside loops, which could lead to catastrophic
5536841eff1SOleksandr "Alex" Zinenko// performance. Such allocations are moved onto the stack and hoisted from
5546841eff1SOleksandr "Alex" Zinenko// all the surrounding loops.
5556841eff1SOleksandr "Alex" Zinenkotransform.apply_patterns to %fb {
5566841eff1SOleksandr "Alex" Zinenko  transform.apply_patterns.vector.transfer_to_scf
5576841eff1SOleksandr "Alex" Zinenko  transform.apply_patterns.memref.alloc_to_alloca
5586841eff1SOleksandr "Alex" Zinenko  }
5596841eff1SOleksandr "Alex" Zinenkotransform.bufferization.buffer_loop_hoisting %fb
5606841eff1SOleksandr "Alex" Zinenko
5616841eff1SOleksandr "Alex" Zinenko// A final round of cleanups additionally includes patterns to simplify
5626841eff1SOleksandr "Alex" Zinenko// buffer aliasing operations that may have been introduced during
5636841eff1SOleksandr "Alex" Zinenko// bufferization and could result in excessively complex address
5646841eff1SOleksandr "Alex" Zinenko// computation.
5656841eff1SOleksandr "Alex" Zinenkotransform.apply_patterns to %fb {
5666841eff1SOleksandr "Alex" Zinenko  transform.apply_patterns.memref.fold_memref_alias_ops
5676841eff1SOleksandr "Alex" Zinenko  transform.apply_patterns.canonicalization
5686841eff1SOleksandr "Alex" Zinenko}
5696841eff1SOleksandr "Alex" Zinenkotransform.apply_cse to %fb
5706841eff1SOleksandr "Alex" Zinenko```
5716841eff1SOleksandr "Alex" Zinenko
5726841eff1SOleksandr "Alex" ZinenkoDue to its inter-procedural nature, one-bufferization processes the entire
5736841eff1SOleksandr "Alex" Zinenkopayload module and thus invalidates all previously created handles. Therefore,
5746841eff1SOleksandr "Alex" Zinenkoit is typically a late step in the transformation sequence where precise
5756841eff1SOleksandr "Alex" Zinenkotargeting of transformation is no longer required. The following transformations
5766841eff1SOleksandr "Alex" Zinenkoare typically module- or function-wide rewrites that are often pattern-based
5776841eff1SOleksandr "Alex" Zinenkolowerings. This part of the sequence can be seen as a pass pipeline specified
5786841eff1SOleksandr "Alex" Zinenkodirectly in the transform dialect, with pattern-based lowering passes
5796841eff1SOleksandr "Alex" Zinenkoconstructed _on-the-fly_ from named groups of patterns.
5806841eff1SOleksandr "Alex" Zinenko
5816841eff1SOleksandr "Alex" ZinenkoThe resulting IR can be further completely lowered to the LLVM dialect, then to
5826841eff1SOleksandr "Alex" ZinenkoLLVM IR and processed by the LLVM compiler to produce an executable or JITted.
5836841eff1SOleksandr "Alex" Zinenko
5846841eff1SOleksandr "Alex" ZinenkoThe generated code runs in ~420ms on an Intel processor with Skylake
5856841eff1SOleksandr "Alex" Zinenkomicroarchitecture clocked at 2.0GHz. Given that the computation performs
586*e8b31fb3SOleksandr "Alex" Zinenko$`5 \cdot 80 \cdot 100 \cdot 128 \cdot (2 \cdot 3 \cdot 3 \cdot 128 + 2) \approx 5.9 * 10^9`$
587*e8b31fb3SOleksandr "Alex" Zinenkofloating point operations, it reaches ~14 GFlops. With 1 FMA unit available,
588*e8b31fb3SOleksandr "Alex" Zinenkothe single-core performance of the test processor is 64 GFlops
589*e8b31fb3SOleksandr "Alex" Zinenko($`16 \cdot 2 \cdot 2 \cdot 10^9`$, where 16 is the vector width), so only
590*e8b31fb3SOleksandr "Alex" Zinenko22% of the theoretical peak is achieved.
5916841eff1SOleksandr "Alex" Zinenko
5926841eff1SOleksandr "Alex" ZinenkoThe code produced by Halide runs in ~120ms on the same processor, a 3.5x
5936841eff1SOleksandr "Alex" Zinenkoimprovement and 77% of peak. Let us analyze the generated assembly to understand
5946841eff1SOleksandr "Alex" Zinenkothe source of the difference. The main computational effort is expected to
5956841eff1SOleksandr "Alex" Zinenkohappen around floating point multiplications and additions in the convolution.
5966841eff1SOleksandr "Alex" ZinenkoIn both cases, the assembly features AVX512 `vfma231ps` instructions operating
5976841eff1SOleksandr "Alex" Zinenkoon `%zmm` 512-bit vector registers. In the MLIR-generated code, they are
5986841eff1SOleksandr "Alex" Zinenkointerspersed with memory accesses loading _two _of the `fma` operands before
5996841eff1SOleksandr "Alex" Zinenkoeach operation and leading to increased latency.
6006841eff1SOleksandr "Alex" Zinenko
6016841eff1SOleksandr "Alex" Zinenko```asm
6026841eff1SOleksandr "Alex" Zinenkovmovups       -192(%r10), %zmm0
6036841eff1SOleksandr "Alex" Zinenkovbroadcastss  -1536(%rdi,%r9), %zmm1
6046841eff1SOleksandr "Alex" Zinenkovmovups       112(%rsp), %zmm2
6056841eff1SOleksandr "Alex" Zinenkovfmadd231ps   %zmm1, %zmm0, %zmm2     # zmm2 = (zmm0 * zmm1) + zmm2
6066841eff1SOleksandr "Alex" Zinenkovmovups       %ymm2, 112(%rsp)
6076841eff1SOleksandr "Alex" Zinenkovextractf64x4 $1, %zmm2, 144(%rsp)
6086841eff1SOleksandr "Alex" Zinenko// 19 more blocks of either
6096841eff1SOleksandr "Alex" Zinenko//  (a) vmovups,vbroadcast,vfma(z,z),vextract,
6106841eff1SOleksandr "Alex" Zinenko//  (b) vbroadcast,vfma(z,mem),vextract
6116841eff1SOleksandr "Alex" Zinenko```
6126841eff1SOleksandr "Alex" Zinenko
6136841eff1SOleksandr "Alex" ZinenkoThe Halide-generated code however features compact blocks of `vfma231ps` and
6146841eff1SOleksandr "Alex" Zinenko`vbroadcastss` loading one of the operands while the other two are resident in
6156841eff1SOleksandr "Alex" Zinenkoregisters and loaded before `fma`.
6166841eff1SOleksandr "Alex" Zinenko
6176841eff1SOleksandr "Alex" Zinenko```asm
6186841eff1SOleksandr "Alex" Zinenkovbroadcastss    -1536(%rsi,%rbx), %zmm25
6196841eff1SOleksandr "Alex" Zinenkovmovups         -192(%rdi), %zmm26
6206841eff1SOleksandr "Alex" Zinenkovmovups         -128(%rdi), %zmm27
6216841eff1SOleksandr "Alex" Zinenkovmovups         -64(%rdi), %zmm28
6226841eff1SOleksandr "Alex" Zinenkovmovups         (%rdi), %zmm29
6236841eff1SOleksandr "Alex" Zinenkovfmadd231ps     %zmm25, %zmm26, %zmm24  # zmm24 = (zmm26 * zmm25) + zmm24
6246841eff1SOleksandr "Alex" Zinenkovfmadd231ps     %zmm25, %zmm27, %zmm23  # zmm23 = (zmm27 * zmm25) + zmm23
6256841eff1SOleksandr "Alex" Zinenkovfmadd231ps     %zmm25, %zmm28, %zmm22  # zmm22 = (zmm28 * zmm25) + zmm22
6266841eff1SOleksandr "Alex" Zinenkovfmadd231ps     %zmm25, %zmm29, %zmm21  # zmm21 = (zmm29 * zmm25) + zmm21
6276841eff1SOleksandr "Alex" Zinenkovbroadcastss    -1024(%rsi,%rbx), %zmm25
6286841eff1SOleksandr "Alex" Zinenkovfmadd231ps     %zmm25, %zmm26, %zmm20  # zmm20 = (zmm26 * zmm25) + zmm20
6296841eff1SOleksandr "Alex" Zinenkovfmadd231ps     %zmm25, %zmm27, %zmm19  # zmm19 = (zmm27 * zmm25) + zmm19
6306841eff1SOleksandr "Alex" Zinenkovfmadd231ps     %zmm25, %zmm28, %zmm18  # zmm18 = (zmm28 * zmm25) + zmm18
6316841eff1SOleksandr "Alex" Zinenkovfmadd231ps     %zmm25, %zmm29, %zmm17  # zmm17 = (zmm29 * zmm25) + zmm17
6326841eff1SOleksandr "Alex" Zinenkovbroadcastss    -512(%rsi,%rbx), %zmm25
6336841eff1SOleksandr "Alex" Zinenko
6346841eff1SOleksandr "Alex" Zinenko// 3 more blocks of 4 vfmadd231 followed by a vbroadcast
6356841eff1SOleksandr "Alex" Zinenko```
6366841eff1SOleksandr "Alex" Zinenko
6376841eff1SOleksandr "Alex" ZinenkoInspecting the progressive intermediate representations produced by MLIR, one
6386841eff1SOleksandr "Alex" Zinenkocan observe the load(transfer)/fma interspersing at all levels starting after
6396841eff1SOleksandr "Alex" Zinenkoschedule application. The repeated tensor subsetting operations, that are later
6406841eff1SOleksandr "Alex" Zinenkotransformed into vector transfer operations, and vector memory loads, are
6416841eff1SOleksandr "Alex" Zinenkoproduced by loop unrolling that was explicitly requested in the schedule! The
6426841eff1SOleksandr "Alex" Zinenkoissue is the single-assignment model of tensors (and vectors) that results in
6436841eff1SOleksandr "Alex" Zinenkolong and complex chains of access and update operations that become so long that
6446841eff1SOleksandr "Alex" Zinenkothe lower-level transformations and the downstream compiler can no longer
6456841eff1SOleksandr "Alex" Zinenkosimplify them. In fact, unrolling loops early in the transformation sequence can
6466841eff1SOleksandr "Alex" Zinenkolead to all sorts of compiler-performance related problems (including the
6476841eff1SOleksandr "Alex" Zinenkocompiler failing to perform some optimizations due to excessive code length) in
6486841eff1SOleksandr "Alex" Zinenkothe process.
6496841eff1SOleksandr "Alex" Zinenko
6506841eff1SOleksandr "Alex" ZinenkoIt is therefore desirable to perform loop unrolling at a later stage,
6516841eff1SOleksandr "Alex" Zinenkospecifically after bufferization and relevant simplification. However,
6526841eff1SOleksandr "Alex" Zinenkobufferization invalidates all loop handles including to loops that we are
6536841eff1SOleksandr "Alex" Zinenkowilling to unroll. This hurdle can be overcome by matching the payload IR
6546841eff1SOleksandr "Alex" Zinenkooperations after bufferization to produce new handles. We will first change the
6556841eff1SOleksandr "Alex" Zinenkokind of loops produced in the schedule from `scf.for` to `scf.forall` to have
65696ff0255SOleksandr "Alex" Zinenkoless operations to match by using `transform.structured.tile_using_forall`
6576841eff1SOleksandr "Alex" Zinenkoinstead of `transform.structured.tile` when tiling with sizes `[0, 0, 1, 16]`.
6586841eff1SOleksandr "Alex" ZinenkoThen we can match all `scf.forall` operations in the payload IR and transform
6596841eff1SOleksandr "Alex" Zinenkothem into single-iterator `scf.for` loops _after bufferization_.
6606841eff1SOleksandr "Alex" Zinenko
6616841eff1SOleksandr "Alex" Zinenko```mlir
6626841eff1SOleksandr "Alex" Zinenko%foralls = transform.structured.match ops{["scf.forall"]} in %arg1
6636841eff1SOleksandr "Alex" Zinenko%xi_bias, %ci_bias = transform.loop.forall_to_for %xi_ci_bias
6646841eff1SOleksandr "Alex" Zinenko%xi_conv, %ci_conv = transform.loop.forall_to_for %xi_ci_conv
6656841eff1SOleksandr "Alex" Zinenko%xi_relu, %ci_relu = transform.loop.forall_to_for %xi_ci_relu
6666841eff1SOleksandr "Alex" Zinenko%xi_comb, %ci_comb = transform.loop.forall_to_for %xi_ci_comb
6676841eff1SOleksandr "Alex" Zinenko```
6686841eff1SOleksandr "Alex" Zinenko
6696841eff1SOleksandr "Alex" ZinenkoWe can then move our loop unrolling transformations later in the transformation
6706841eff1SOleksandr "Alex" Zinenkosequence as desired. Compiling this new version to assembly produces exactly the
6716841eff1SOleksandr "Alex" Zinenkosame core computation around `vfmadd231ps` as Halide’s version, which only
6726841eff1SOleksandr "Alex" Zinenkodiffers slightly in allocated registers. Unsurprisingly, this version runs
6736841eff1SOleksandr "Alex" Zinenkoroughly in 120ms on the same machine.
6746841eff1SOleksandr "Alex" Zinenko
6756841eff1SOleksandr "Alex" Zinenko
6766841eff1SOleksandr "Alex" Zinenko## Multi-Dimensional Vectors to the Rescue
6776841eff1SOleksandr "Alex" Zinenko
6786841eff1SOleksandr "Alex" ZinenkoWhile we managed to produce similar code to Halide in the previous section, we
6796841eff1SOleksandr "Alex" Zinenkodid so by rematching generated loops after bufferization, which partially defies
6806841eff1SOleksandr "Alex" Zinenkothe purpose of using handles to chain transformations in the Transform dialect.
6816841eff1SOleksandr "Alex" ZinenkoLuckily, this step is not really necessary. It only served as an exercise in
6826841eff1SOleksandr "Alex" Zinenkoproducing the desired loop structure.
6836841eff1SOleksandr "Alex" Zinenko
6846841eff1SOleksandr "Alex" ZinenkoMultidimensional structured operations on vectors are lowered to target-specific
6856841eff1SOleksandr "Alex" Zinenkovectors by unrolling and splitting. For example, an elementwise arithmetic
6866841eff1SOleksandr "Alex" Zinenkooperation on `vector<5x64xf32>` is replaced with 5 operations on
6876841eff1SOleksandr "Alex" Zinenko`vector<64xf32>` and additional vector value manipulations to recreate the
6886841eff1SOleksandr "Alex" Zinenkorequired type at the MLIR level. Each of these operations is then split into 4
6896841eff1SOleksandr "Alex" Zinenkooperations on `vector<16xf32>` at the LLVM level where the information about
6906841eff1SOleksandr "Alex" Zinenkothe target vector width becomes available. Collectively, this has exactly the
6916841eff1SOleksandr "Alex" Zinenkosame effect as first materializing the 5x4 loop nest, and then fully unrolling
6926841eff1SOleksandr "Alex" Zinenkothese loops. Therefore, the last stage of tiling, re-matching and unrolling can
6936841eff1SOleksandr "Alex" Zinenkobe removed from the schedule.
6946841eff1SOleksandr "Alex" Zinenko
6956841eff1SOleksandr "Alex" ZinenkoThe resulting assembly has all `vbroadcast` grouped together before `vfmadd231`
6966841eff1SOleksandr "Alex" Zinenkobut otherwise has a similar structure. This grouping is due to each
6976841eff1SOleksandr "Alex" Zinenkomulti-dimensional vector operation being “unrolled” separately. When executed,
6986841eff1SOleksandr "Alex" Zinenkoit runs in ~110ms, a slight improvement of 8% over both the previous version and
6996841eff1SOleksandr "Alex" ZinenkoHalide, and reaches ~53.7 GFlop/s or 84% of peak single-core performance. The
7006841eff1SOleksandr "Alex" Zinenkoimprovement is largely due to the intermediate representation being shorter and
7016841eff1SOleksandr "Alex" Zinenkosimpler in presence of large-vector operations, which allowed for more
7026841eff1SOleksandr "Alex" Zinenkoaggressive address computation and load placement optimization.
7036841eff1SOleksandr "Alex" Zinenko
7046841eff1SOleksandr "Alex" ZinenkoThe final transformation strategy is checked into the repository at
7056841eff1SOleksandr "Alex" Zinenko[mlir/examples/transform/ChH/full.mlir](
7066841eff1SOleksandr "Alex" Zinenkohttps://github.com/llvm/llvm-project/tree/main/mlir/test/Examples/transform/ChH/full.mlir).
707