xref: /llvm-project/mlir/docs/Bufferization.md (revision ced2fc7819d5ddea616ec330f18e08ff284c1868)
1e2d7d3cbSSean Silva# Bufferization
20313e3bfSJulian Gross
3e2d7d3cbSSean Silva[TOC]
40313e3bfSJulian Gross
5e2d7d3cbSSean Silva## Overview
6e2d7d3cbSSean Silva
7975284abSMatthias SpringerBufferization in MLIR is the process of converting ops with `tensor` semantics
8099417d6SMatthias Springerto ops with `memref` semantics. There are multiple MLIR passes that are related
9099417d6SMatthias Springerto bufferization. These passes typically run as one of the last steps in a
10099417d6SMatthias Springerpass pipeline, right before lowering to `memref` ops to LLVM. That is because
11099417d6SMatthias Springermany transformations are easier or only supported in tensor land; e.g.,
12099417d6SMatthias Springer[tile/fuse/… on tensors first](https://llvm.discourse.group/t/rfc-linalg-on-tensors-update-and-comprehensive-bufferization-rfc/3373),
13099417d6SMatthias Springerthen bufferize the remaining IR.
14975284abSMatthias Springer
15099417d6SMatthias Springer![bufferization passes](/includes/img/bufferization_passes.svg)
16099417d6SMatthias Springer
17099417d6SMatthias SpringerThe most important bufferization pass is *One-Shot Bufferize*: This pass
18099417d6SMatthias Springerrewrites `tensor` IR to `memref` IR. There are additional helper passes that
19099417d6SMatthias Springerpreprocess IR (e.g., so that IR can be bufferized more efficiently), perform
20099417d6SMatthias Springerbuffer-level optimizations such as allocation hoisting, and
21099417d6SMatthias Springer[insert buffer deallocation ops](OwnershipBasedBufferDeallocation.md) so that
22099417d6SMatthias Springerthe resulting `memref` IR has no memory leaks.
23099417d6SMatthias Springer
24099417d6SMatthias Springer## Deprecated Passes
25099417d6SMatthias Springer
26099417d6SMatthias SpringerThe buffer deallocation pass has been deprecated in favor of the ownership-based
27099417d6SMatthias Springerbuffer deallocation pipeline. The deprecated pass has some limitations that may
28099417d6SMatthias Springercause memory leaks in the resulting IR.
29975284abSMatthias Springer
30975284abSMatthias Springer## What is One-Shot Bufferize?
31975284abSMatthias Springer
32099417d6SMatthias SpringerOne-Shot Bufferize is a tensor bufferization pass designed for IR in
33975284abSMatthias Springer[destination-passing style](https://www.microsoft.com/en-us/research/wp-content/uploads/2016/11/dps-fhpc17.pdf),
34975284abSMatthias Springerand with aggressive in-place bufferization.
35975284abSMatthias Springer
36975284abSMatthias SpringerOne-Shot Bufferize is:
37975284abSMatthias Springer
38099417d6SMatthias Springer*   **Monolithic**: A single MLIR pass does the entire work.
39099417d6SMatthias Springer
40099417d6SMatthias Springer*   **Extensible** via an op interface: All ops that implement
41099417d6SMatthias Springer    `BufferizableOpInterface` can be bufferized.
42975284abSMatthias Springer
43b55d55ecSMatthias Springer*   A **whole-function at a time analysis**. In-place bufferization decisions
44b55d55ecSMatthias Springer    are made by analyzing SSA use-def chains on tensors. Op interface
45b55d55ecSMatthias Springer    implementations not only provide the rewrite logic from tensor ops to memref
46b55d55ecSMatthias Springer    ops, but also helper methods for One-Shot Bufferize's analysis to query
47b55d55ecSMatthias Springer    information about an op's bufferization/memory semantics.
48975284abSMatthias Springer
49099417d6SMatthias Springer*   **2-Phase**: Bufferization is internally broken down into 2 steps: First,
50b55d55ecSMatthias Springer    analyze the entire IR and make bufferization decisions. Then, bufferize
51b55d55ecSMatthias Springer    (rewrite) the IR. The analysis has access to exact SSA use-def information.
52b55d55ecSMatthias Springer    It incrementally builds alias and equivalence sets and does not rely on a
53b55d55ecSMatthias Springer    posteriori-alias analysis from preallocated memory.
54975284abSMatthias Springer
55b55d55ecSMatthias Springer*   **Greedy**: Operations are analyzed one-by-one and it is decided on the spot
56b55d55ecSMatthias Springer    whether a tensor OpOperand must be copied or not. Heuristics determine the
57b55d55ecSMatthias Springer    order of analysis.
58975284abSMatthias Springer
59b55d55ecSMatthias Springer*   **Modular**: The current One-Shot Analysis can be replaced with a different
60b55d55ecSMatthias Springer    analysis. The result of the analysis are queried by the bufferization via
61b55d55ecSMatthias Springer    `AnalysisState`, in particular `AnalysisState::isInPlace`. Any derived class
62b55d55ecSMatthias Springer    of `AnalysisState` that implements a small number virtual functions can
63b55d55ecSMatthias Springer    serve as a custom analysis. It is even possible to run One-Shot Bufferize
64b55d55ecSMatthias Springer    without any analysis (`AlwaysCopyAnalysisState`), in which case One-Shot
65099417d6SMatthias Springer    Bufferize copies every buffer before writing to it.
66975284abSMatthias Springer
67099417d6SMatthias SpringerNote that One-Shot Bufferize does not deallocate buffers. That is done by the
68099417d6SMatthias Springer[Ownership-based Buffer Deallocation passes](OwnershipBasedBufferDeallocation.md).
69975284abSMatthias Springer
70975284abSMatthias Springer## Goals of Bufferization
71975284abSMatthias Springer
72099417d6SMatthias SpringerThe high-level goal of every bufferization technique is to:
73099417d6SMatthias Springer
74099417d6SMatthias Springer1. Use as little memory as possible.
75099417d6SMatthias Springer2. Copy as little memory as possible.
76975284abSMatthias Springer
77975284abSMatthias SpringerThis implies reusing already allocated buffers when possible, turning
78975284abSMatthias Springerbufferization into an algorithmically complex problem with similarities to
79975284abSMatthias Springerregister allocation.
80975284abSMatthias Springer
81975284abSMatthias SpringerDepending on the concrete use case, there may be additional bufferization
82975284abSMatthias Springerrequirements. If the contents of a buffer are expensive to compute, there could
83975284abSMatthias Springerbe a tradeoff between *recomputation* and *compute once and copy*. On the
84975284abSMatthias Springercontrary, it may not even be possible to allocate new buffers at runtime on some
85975284abSMatthias Springerarchitectures.
86975284abSMatthias Springer
87975284abSMatthias Springer## Destination-Passing Style
88975284abSMatthias Springer
89975284abSMatthias SpringerBufferization is an algorithmically complex problem. Given an op with a tensor
90975284abSMatthias Springerresult, bufferization has to choose a memref buffer in which the result can be
91975284abSMatthias Springerstored. It is always safe to allocate a brand new buffer, but such a
92975284abSMatthias Springerbufferization strategy would be unacceptable for high-performance codegen. When
93975284abSMatthias Springerchoosing an already existing buffer, we must be careful not to accidentally
94975284abSMatthias Springeroverwrite data that is still needed later in the program.
95975284abSMatthias Springer
9676dea22bSMehdi AminiTo simplify this problem, One-Shot Bufferize was designed to take advantage of
97099417d6SMatthias Springer*destination-passing style* (DPS). In MLIR, DPS op should implement the
98099417d6SMatthias Springer[`DestinationStyleOpInterface`](https://github.com/llvm/llvm-project/blob/792d437b56adfb3416daf8105942d4899fb82763/mlir/include/mlir/Interfaces/DestinationStyleOpInterface.td).
99099417d6SMatthias SpringerDPS exists in itself independently of bufferization and is tied to SSA
100099417d6SMatthias Springersemantics: many ops are "updating" a part of their input SSA variables. For
101099417d6SMatthias Springerexample the LLVM instruction
10276dea22bSMehdi Amini[`insertelement`](https://llvm.org/docs/LangRef.html#insertelement-instruction)
10376dea22bSMehdi Aminiis inserting an element inside a vector. Since SSA values are immutable, the
10476dea22bSMehdi Aminioperation returns a copy of the input vector with the element inserted.
105099417d6SMatthias SpringerAnother example in MLIR is `linalg.generic` on tensors, which always has an
106099417d6SMatthias Springerextra `outs` operand for each result, which provides the initial values to
107099417d6SMatthias Springerupdate (for example when the operation is doing a reduction).
10876dea22bSMehdi Amini
109099417d6SMatthias Springer`outs` operands are referred to as "destinations" in the following (quotes are
11076dea22bSMehdi Aminiimportant as this operand isn't modified in place but copied) and comes into
11176dea22bSMehdi Aminiplace in the context of bufferization as a possible "anchor" for the
11276dea22bSMehdi Aminibufferization algorithm. This allows the user to shape the input in a form that
11376dea22bSMehdi Aminiguarantees close to optimal bufferization result when carefully choosing the
11476dea22bSMehdi AminiSSA value used as "destination".
11576dea22bSMehdi Amini
116099417d6SMatthias SpringerFor every tensor result, a DPS op has a corresponding tensor operand. If there
117099417d6SMatthias Springeraren't any other conflicting uses of this tensor, the bufferization can alias
118099417d6SMatthias Springerit with the op result and perform the operation "in-place" by reusing the buffer
119099417d6SMatthias Springerallocated for this "destination" input.
120975284abSMatthias Springer
121099417d6SMatthias SpringerAs an example, consider the following op: `%r = tensor.insert %f into
122099417d6SMatthias Springer%t[%idx] : tensor<5xf32>`
123099417d6SMatthias Springer
124099417d6SMatthias Springer![tensor.insert example](/includes/img/bufferization_tensor_insert_dst.svg)
125975284abSMatthias Springer
12676dea22bSMehdi Amini`%t` is the "destination" in this example. When choosing a buffer for the result
127099417d6SMatthias Springer`%r`, denoted as `buffer(%r)`, One-Shot Bufferize considers only two options:
128975284abSMatthias Springer
129099417d6SMatthias Springer1.  `buffer(%r) = buffer(%t)`: store the result in the existing `buffer(%t)`.
130099417d6SMatthias Springer    Note that this is not always possible. E.g., if the old contents of
131099417d6SMatthias Springer    `buffer(%t)` are still needed. One-Shot Bufferize's main task is to detect
132099417d6SMatthias Springer    such cases and fall back to the second option when necessary.
133099417d6SMatthias Springer2.  `buffer(%r)` is a newly allocated buffer.
134975284abSMatthias Springer
135975284abSMatthias SpringerThere may be other buffers in the same function that could potentially be used
136099417d6SMatthias Springerfor `buffer(%r)`, but those are not considered by One-Shot Bufferize to keep the
137975284abSMatthias Springerbufferization simple. One-Shot Bufferize could be extended to consider such
138975284abSMatthias Springerbuffers in the future to achieve a better quality of bufferization.
139975284abSMatthias Springer
14076dea22bSMehdi AminiTensor ops that are not in destination-passing style always bufferized to a
141975284abSMatthias Springermemory allocation. E.g.:
142975284abSMatthias Springer
143b59fd8c2SMatthias Springer```mlir
144975284abSMatthias Springer%0 = tensor.generate %sz {
145975284abSMatthias Springer^bb0(%i : index):
146975284abSMatthias Springer  %cst = arith.constant 0.0 : f32
147975284abSMatthias Springer  tensor.yield %cst : f32
148975284abSMatthias Springer} : tensor<?xf32>
149975284abSMatthias Springer```
150975284abSMatthias Springer
15176dea22bSMehdi AminiThe result of `tensor.generate` does not have a "destination" operand, so
152099417d6SMatthias Springerbufferization allocates a new buffer. This could be avoided by instead using an
153d8a52157SRik Huijzerop such as `linalg.generic`, which can express the same computation with a
15476dea22bSMehdi Amini"destination" operand, as specified behind outputs (`outs`):
155975284abSMatthias Springer
156b59fd8c2SMatthias Springer```mlir
157975284abSMatthias Springer#map = affine_map<(i) -> (i)>
158975284abSMatthias Springer%0 = linalg.generic {indexing_maps = [#map], iterator_types = ["parallel"]}
159975284abSMatthias Springer                    outs(%t : tensor<?xf32>) {
160975284abSMatthias Springer  ^bb0(%arg0 : f32):
161975284abSMatthias Springer    %cst = arith.constant 0.0 : f32
162975284abSMatthias Springer    linalg.yield %cst : f32
163975284abSMatthias Springer} -> tensor<?xf32>
164975284abSMatthias Springer```
165975284abSMatthias Springer
166975284abSMatthias SpringerAt first glance, the above `linalg.generic` op may not seem very useful because
167975284abSMatthias Springerthe output tensor `%t` is entirely overwritten. Why pass the tensor `%t` as an
168975284abSMatthias Springeroperand in the first place? As an example, this can be useful for overwriting a
169975284abSMatthias Springerslice of a tensor:
170975284abSMatthias Springer
171b59fd8c2SMatthias Springer```mlir
172975284abSMatthias Springer%t = tensor.extract_slice %s [%idx] [%sz] [1] : tensor<?xf32> to tensor<?xf32>
173975284abSMatthias Springer%0 = linalg.generic ... outs(%t) { ... } -> tensor<?xf32>
174975284abSMatthias Springer%1 = tensor.insert_slice %0 into %s [%idx] [%sz] [1]
175975284abSMatthias Springer    : tensor<?xf32> into tensor<?xf32>
176975284abSMatthias Springer```
177975284abSMatthias Springer
178975284abSMatthias SpringerThe above example bufferizes to a `memref.subview`, followed by a
17976dea22bSMehdi Amini"`linalg.generic` on memrefs" that overwrites the memory of the subview, assuming
18076dea22bSMehdi Aminithat the slice `%t` has no other user. The `tensor.insert_slice` then bufferizes
18176dea22bSMehdi Aminito a no-op (in the absence of RaW conflicts such as a subsequent read of `%s`).
182975284abSMatthias Springer
183975284abSMatthias SpringerRaW conflicts are detected with an analysis of SSA use-def chains (details
184975284abSMatthias Springerlater). One-Shot Bufferize works best if there is a single SSA use-def chain,
18576dea22bSMehdi Aminiwhere the result of a tensor op is the operand of the next tensor ops, e.g.:
186975284abSMatthias Springer
187b59fd8c2SMatthias Springer```mlir
188975284abSMatthias Springer%0 = "my_dialect.some_op"(%t) : (tensor<?xf32>) -> (tensor<?xf32>)
189975284abSMatthias Springer%1 = "my_dialect.another_op"(%0) : (tensor<?xf32>) -> (tensor<?xf32>)
190975284abSMatthias Springer%2 = "my_dialect.yet_another_op"(%1) : (tensor<?xf32>) -> (tensor<?xf32>)
191975284abSMatthias Springer```
192975284abSMatthias Springer
193975284abSMatthias SpringerBuffer copies are likely inserted if the SSA use-def chain splits at some point,
194975284abSMatthias Springere.g.:
195975284abSMatthias Springer
196b59fd8c2SMatthias Springer```mlir
197975284abSMatthias Springer%0 = "my_dialect.some_op"(%t) : (tensor<?xf32>) -> (tensor<?xf32>)
198975284abSMatthias Springer%1 = "my_dialect.another_op"(%0) : (tensor<?xf32>) -> (tensor<?xf32>)
199099417d6SMatthias Springer
200099417d6SMatthias Springer// "yet_another_op" likely needs to read the data of %0, so "another_op" cannot
201099417d6SMatthias Springer// in-place write to buffer(%0).
202975284abSMatthias Springer%2 = "my_dialect.yet_another_op"(%0) : (tensor<?xf32>) -> (tensor<?xf32>)
203975284abSMatthias Springer```
204975284abSMatthias Springer
205099417d6SMatthias Springer## Tensor / MemRef Boundary
206099417d6SMatthias Springer
207099417d6SMatthias SpringerThe bufferization dialect provides a few helper ops to connect tensor IR (that
208099417d6SMatthias Springershould be bufferized) with existing buffers (that may be allocated/provided by
209099417d6SMatthias Springera different runtime/library/etc.).
210099417d6SMatthias Springer
211099417d6SMatthias Springer`bufferization.to_memref %t` returns the future buffer of a tensor SSA value.
212099417d6SMatthias Springer`bufferization.to_tensor %m` returns a tensor SSA value for a given MemRef
213099417d6SMatthias Springerbuffer. `bufferization.materialize_in_destination` indicates that a tensor value
214099417d6SMatthias Springershould materialize in a certain buffer.
215099417d6SMatthias Springer
216099417d6SMatthias SpringerConsider the following example, where a TOSA matmul result should materialize in
217099417d6SMatthias Springeran existing buffer `%C`:
218099417d6SMatthias Springer
219099417d6SMatthias Springer```mlir
220099417d6SMatthias Springer// Batched TOSA matrix multiplication. %A and %B are the
221099417d6SMatthias Springer// inputs, %C is the output.
222099417d6SMatthias Springerfunc.func @test_matmul(%A: memref<1x17x19xf32>,
223099417d6SMatthias Springer                       %B: memref<1x19x29xf32>,
224099417d6SMatthias Springer                       %C: memref<1x17x29xf32>) {
225099417d6SMatthias Springer
226*ced2fc78SChristopher Bate  %A_tensor = bufferization.to_tensor %A restrict : memref<1x17x19xf32> to tensor<1x17x19xf32>
227*ced2fc78SChristopher Bate  %B_tensor = bufferization.to_tensor %B restrict : memref<1x19x29xf32> to tensor<1x19x29xf32>
228099417d6SMatthias Springer
229099417d6SMatthias Springer  %0 = tosa.matmul %A_tensor, %B_tensor
230099417d6SMatthias Springer      : (tensor<1x17x19xf32>, tensor<1x19x29xf32>) ->
231099417d6SMatthias Springer         tensor<1x17x29xf32>
232099417d6SMatthias Springer
233099417d6SMatthias Springer  bufferization.materialize_in_destination
234099417d6SMatthias Springer    %0 in restrict writable %C
235099417d6SMatthias Springer      : (tensor<1x17x29xf32>, memref<1x17x29xf32>) -> ()
236099417d6SMatthias Springer
237099417d6SMatthias Springer  return
238099417d6SMatthias Springer}
239099417d6SMatthias Springer```
240099417d6SMatthias Springer
241099417d6SMatthias SpringerNote that all bufferization ops in this example have the `restrict` unit
242099417d6SMatthias Springerattribute set. This attribute is similar to the C restrict keyword and indicates
243099417d6SMatthias Springerthat there is no other `to_tensor` or `materialize_in_destination` op with
244099417d6SMatthias Springerthe same or an aliasing MemRef operand. Only such
245099417d6SMatthias Springer`to_tensor`/`materialize_in_destination` ops are supported. The `restrict`
246099417d6SMatthias Springerattribute gives strong aliasing guarantees to the bufferization analysis and
247099417d6SMatthias Springerallows us to look only at the tensor IR in a program. (Ops that do not operate
248099417d6SMatthias Springeron tensors are ignored by the One-Shot Bufferize.)
249099417d6SMatthias Springer
250099417d6SMatthias SpringerAlso note that `tosa.matmul` cannot be bufferized as is: there is no
251099417d6SMatthias Springer`BufferizableOpInterface` implementation for that op. However, the op can be
252099417d6SMatthias Springerlowered to a combination of `tensor.empty` and `linalg.matmul`, which can be
253099417d6SMatthias Springerbufferized.
254975284abSMatthias Springer
255975284abSMatthias Springer## Using One-Shot Bufferize
256975284abSMatthias Springer
257975284abSMatthias SpringerMLIR provides a pass
258975284abSMatthias Springer[`-one-shot-bufferize`](https://mlir.llvm.org/docs/Passes/#-one-shot-bufferize-one-shot-bufferize)
259975284abSMatthias Springerthat performs an analysis and bufferizes all ops with tensor semantics that
260975284abSMatthias Springerimplement `BufferizableOpInterface`. For modularity reasons, these op interface
261975284abSMatthias Springerimplementations are typically external models that live in a dialect's
262975284abSMatthias Springer"Transforms" build unit. (External models are a mechanism for implementing an op
263975284abSMatthias Springerinterface in a different build unit.) It is the user's responsibility to ensure
264975284abSMatthias Springerthat all needed external models are registered before running One-Shot
265975284abSMatthias SpringerBufferize.
266975284abSMatthias Springer
267975284abSMatthias SpringerBy default, One-Shot Bufferize fails when it encounters an op with tensor
268975284abSMatthias Springersemantics (i.e., tensor result or tensor operand) that is not bufferizable
269975284abSMatthias Springer(i.e., does not implement `BufferizableOpInterface`). This can be avoided with
270975284abSMatthias Springer`allow-unknown-ops`. In that case, One-Shot Bufferize inserts
271099417d6SMatthias Springer`to_memref`/`to_tensor` ops around the bufferization boundary.
272975284abSMatthias Springer
273975284abSMatthias SpringerOne-Shot Bufferize can be configured to bufferize only ops from a set of
274e394fecdSMatthias Springerdialects with `dialect-filter`.
275975284abSMatthias Springer
276975284abSMatthias SpringerOne-Shot Bufferize can also be called programmatically with
277975284abSMatthias Springer[`bufferization::runOneShotBufferize`](https://github.com/llvm/llvm-project/blob/ae2764e835a26bad9774803eca0a6530df2a3e2d/mlir/include/mlir/Dialect/Bufferization/Transforms/OneShotAnalysis.h#L167).
278975284abSMatthias SpringerAlternatively,
279975284abSMatthias Springer[`bufferization::bufferizeOp`](https://github.com/llvm/llvm-project/blob/ae2764e835a26bad9774803eca0a6530df2a3e2d/mlir/include/mlir/Dialect/Bufferization/Transforms/Bufferize.h#L78)
280e394fecdSMatthias Springerskips the analysis and inserts a copy on every buffer write.
281975284abSMatthias Springer
282099417d6SMatthias SpringerBy default, function boundaries are not bufferized. This is because there are
283099417d6SMatthias Springercurrently limitations around function graph bufferization: recursive
284099417d6SMatthias Springercalls are not supported. As long as there are no recursive calls, function
285099417d6SMatthias Springerboundary bufferization can be enabled with `bufferize-function-boundaries`. Each
286099417d6SMatthias Springertensor function argument and tensor function result is then turned into a
287099417d6SMatthias Springermemref. The layout map of the memref type can be controlled with
288099417d6SMatthias Springer`function-boundary-type-conversion`.
289099417d6SMatthias Springer
290975284abSMatthias Springer## Memory Layouts
291975284abSMatthias Springer
292975284abSMatthias SpringerOne-Shot Bufferize bufferizes ops from top to bottom. This works well when all
293975284abSMatthias Springerops are bufferizable. However, when encountering a non-bufferizable tensor with
294975284abSMatthias Springer`allow-unknown-ops`, One-Shot Bufferize must insert `to_memref` ops at the
295975284abSMatthias Springerbufferization boundary and decide on a memref type. By default, One-Shot
296975284abSMatthias SpringerBufferize choose the most dynamic memref type wrt. layout maps. E.g.:
297975284abSMatthias Springer
298b59fd8c2SMatthias Springer```mlir
299975284abSMatthias Springer%0 = "my_dialect.unbufferizable_op(%t) : (tensor<?x?xf32>) -> (tensor<?x?xf32>)
300975284abSMatthias Springer%1 = tensor.extract %0[%idx1, %idx2] : tensor<?xf32>
301975284abSMatthias Springer```
302975284abSMatthias Springer
303975284abSMatthias SpringerWhen bufferizing the above IR, One-Shot Bufferize inserts a `to_memref` ops with
304975284abSMatthias Springerdynamic offset and strides:
305975284abSMatthias Springer
306b59fd8c2SMatthias Springer```mlir
307975284abSMatthias Springer%0 = "my_dialect.unbufferizable_op(%t) : (tensor<?x?xf32>) -> (tensor<?x?xf32>)
308f3fae035SAlex Zinenko%0_m = bufferization.to_memref %0 : memref<?x?xf32, strided<[?, ?], offset: ?>>
309f3fae035SAlex Zinenko%1 = memref.load %0_m[%idx1, %idx2] : memref<?x?xf32, strided<[?, ?], offset: ?>>
310975284abSMatthias Springer```
311975284abSMatthias Springer
312975284abSMatthias SpringerAll users of `%0` have fully dynamic layout maps. This ensures that the
313975284abSMatthias Springerbufferized IR composes well with future bufferizations of `unbufferizable_op`
314975284abSMatthias Springer(maybe bufferized by another pass), regardless of the exact memref type of the
315975284abSMatthias Springerfuture bufferization. If the op turns out to be bufferized to an op with a
316975284abSMatthias Springersimpler memref type (e.g., identity layout map), we expect that canonicalization
317975284abSMatthias Springerpatterns would clean up unnecessarily dynamic layout maps. (Some of these
318975284abSMatthias Springercanonicalization patterns may not be implemented yet.)
319975284abSMatthias Springer
320f287da8aSMatthias SpringerOne-Shot Bufferize tries to infer the most precise memref type when bufferizing
321f287da8aSMatthias Springeran op. If the entire IR is bufferizable, we do not have to resort to
322f287da8aSMatthias Springerconservatively use fully dynamic layout maps. In that case, we also do not have
323f287da8aSMatthias Springerto rely on canonicalization patterns to clean up the bufferized IR.
324975284abSMatthias Springer
325f287da8aSMatthias SpringerNote: There are some bufferizable ops for which a percise layout map cannot be
326f287da8aSMatthias Springerinferred. E.g., a `tensor.cast` from a `tensor<*xf32>` to a `tensor<?x?xf32>`
327f287da8aSMatthias Springermust be bufferized to a `memref.cast` with a memref type that has a fully
328f287da8aSMatthias Springerdynamic layout map.
329f287da8aSMatthias Springer
330f287da8aSMatthias SpringerOne-Shot Bufferize has an option `unknown-type-conversion` to control the
331f287da8aSMatthias Springergeneration of layout maps when no precise layout can be inferred:
332f287da8aSMatthias Springer
333f287da8aSMatthias Springer*   `fully-dynamic-layout-map` uses fully dynamic layout maps and is the default
334f287da8aSMatthias Springer    behavior. This composes well when IR is partially bufferized.
335f287da8aSMatthias Springer*   `identity-layout-map` uses static identity layout maps. This option can be
336f287da8aSMatthias Springer    useful for legacy code that cannot handle memref types with layout maps.
337f287da8aSMatthias Springer    Note that this setting can lead to additional buffer copies when folding a
338f287da8aSMatthias Springer    `to_tensor`/`to_memref` pair with memref types that are not cast-compatible.
339f287da8aSMatthias Springer
340f287da8aSMatthias SpringerNote: The `unknown-type-conversion` option does not affect layout maps of
341f287da8aSMatthias Springerfunction signatures. There is a separate `function-signature-type-conversion`
342f287da8aSMatthias Springeroption that controls layout maps of function parameters and function results.
343975284abSMatthias Springer
344975284abSMatthias Springer## Extending One-Shot Bufferize
345975284abSMatthias Springer
346975284abSMatthias SpringerCustom ops can be bufferized if they implement `BufferizableOpInterface`. Users
347975284abSMatthias Springermust at least implement the following interface methods.
348975284abSMatthias Springer
349975284abSMatthias Springer*   `bufferizesToMemoryRead`: Return `true` if the buffer of the given tensor
350975284abSMatthias Springer    OpOperand is read.
351975284abSMatthias Springer*   `bufferizesToMemoryWrite`: Return `true` if the buffer of the given tensor
352975284abSMatthias Springer    OpOperand is written (if bufferizing in-place).
353975284abSMatthias Springer*   `getAliasingOpResult`: Return the OpResults that may share the same buffer
354975284abSMatthias Springer    as the given OpOperand. This interface method describes to
355975284abSMatthias Springer    OpOperand-to-OpResult mapping wrt. destination-passing style.
356975284abSMatthias Springer*   `bufferRelation`: Return `BufferRelation::Equivalent` if the given OpResult
357975284abSMatthias Springer    is the exact same memref as the aliasing OpOperand after bufferization (in
358975284abSMatthias Springer    case of in-place bufferization). Otherwise, (e.g., they overlap but are not
359148432eaSMatthias Springer    necessarily the exact same memrefs), `BufferRelation::Unknown` should be
360975284abSMatthias Springer    returned. Additional buffer relations will be added in the future, but
361148432eaSMatthias Springer    `BufferRelation::Unknown` is always safe.
362975284abSMatthias Springer*   `bufferize`: Rewrite the op with the given rewriter. Ops should be replaced
363975284abSMatthias Springer    with `bufferization::replaceOpWithBufferizedValues`.
364975284abSMatthias Springer
365975284abSMatthias SpringerTo get a better intuition of the interface methods, we invite users to take a
366975284abSMatthias Springerlook at existing implementations in MLIR, e.g., the implementation of
367975284abSMatthias Springer`tensor.insert` or `tensor.extract`.
368975284abSMatthias Springer
369099417d6SMatthias SpringerInterface implementations of DPS ops (that implement
370099417d6SMatthias Springer`DestinationStyleOpInterface`) can derive from
371099417d6SMatthias Springer`DstBufferizableOpInterfaceExternalModel`, which provides all necessary
372099417d6SMatthias Springermethod implementations except for `bufferize`.
373099417d6SMatthias Springer
374975284abSMatthias Springer## Debugging Buffer Copies
375975284abSMatthias Springer
376975284abSMatthias SpringerTo get a better understanding of why One-Shot Bufferize introduced a buffer
377975284abSMatthias Springercopy, users can run the pass with `test-analysis-only print-conflicts`. Every
378975284abSMatthias Springertensor op is then annotated with an attribute that has a boolean value for each
379975284abSMatthias Springertensor OpOperand. `true` means that the OpOperand bufferizes in-place. `false`
380975284abSMatthias Springermeans that the OpOperand bufferizes out-of-place and a buffer copy will be
381975284abSMatthias Springerinserted.
382975284abSMatthias Springer
383975284abSMatthias SpringerThere are two reasons why a buffer copy may be inserted.
384975284abSMatthias Springer
385975284abSMatthias Springer1.  Due to a RaW conflict, it is not safe to bufferize in-place. I.e., the
386975284abSMatthias Springer    overwritten data is still needed.
387975284abSMatthias Springer2.  The buffer is not writable. E.g., `memref.global` buffers that are the
388975284abSMatthias Springer    result of `arith.constant` ops are never modified.
389975284abSMatthias Springer
390975284abSMatthias SpringerIn the first case, `print-conflicts` illustrates the conflict in the form of a
391975284abSMatthias Springer("read", "conflicting write", "last write") tuple.
392975284abSMatthias Springer
393099417d6SMatthias SpringerA RaW conflict consists of three parts, in the following order according to
394099417d6SMatthias Springerop dominance:
395099417d6SMatthias Springer
396099417d6SMatthias Springer1. **Definition:** A tensor `%t` is defined.
397099417d6SMatthias Springer2. **Conflicting Write:** An operation writes to `buffer(%t)`.
398099417d6SMatthias Springer3. **Read:** An operation reads `%t`.
399099417d6SMatthias Springer
400099417d6SMatthias SpringerWhen such a RaW conflict is detected during the analysis phase, One-Shot
401099417d6SMatthias SpringerBufferize will insert a buffer copy for the conflicting write.
402099417d6SMatthias Springer
403099417d6SMatthias Springer**Example**
404099417d6SMatthias Springer
405099417d6SMatthias Springer```mlir
406099417d6SMatthias Springer// RUN: mlir-opt %s -one-shot-bufferize="bufferize-function-boundaries test-analysis-only print-conflicts"
407099417d6SMatthias Springerfunc.func @test(%arg0: f32, %arg1: f32, %arg2: index, %arg3: index) -> (f32, tensor<3xf32>) {
408099417d6SMatthias Springer  // Create a new tensor with [%arg0, %arg0, %arg0].
409099417d6SMatthias Springer  %0 = tensor.from_elements %arg0, %arg0, %arg0 : tensor<3xf32>
410099417d6SMatthias Springer
411099417d6SMatthias Springer  // Insert something into the new tensor.
412099417d6SMatthias Springer  %1 = tensor.insert %arg1 into %0[%arg2] : tensor<3xf32>
413099417d6SMatthias Springer
414099417d6SMatthias Springer  // Read from the old tensor.
415099417d6SMatthias Springer  %r = tensor.extract %0[%arg3] : tensor<3xf32>
416099417d6SMatthias Springer
417099417d6SMatthias Springer  // Return the extracted value and the result of the insertion.
418099417d6SMatthias Springer  func.return %r, %1 : f32, tensor<3xf32>
419099417d6SMatthias Springer}
420099417d6SMatthias Springer```
421099417d6SMatthias Springer
422099417d6SMatthias SpringerThe output IR is as follows:
423099417d6SMatthias Springer
424099417d6SMatthias Springer```mlir
425099417d6SMatthias Springerfunc.func @test(%arg0: f32, %arg1: f32, %arg2: index, %arg3: index) -> (f32, tensor<3xf32>) {
426099417d6SMatthias Springer  %from_elements = tensor.from_elements %arg0, %arg0, %arg0 {"C_0[DEF: result 0]"} : tensor<3xf32>
427099417d6SMatthias Springer  %inserted = tensor.insert %arg1 into %from_elements[%arg2] {"C_0[CONFL-WRITE: 1]", __inplace_operands_attr__ = ["none", "false", "none"]} : tensor<3xf32>
428099417d6SMatthias Springer  %extracted = tensor.extract %from_elements[%arg3] {"C_0[READ: 0]", __inplace_operands_attr__ = ["true", "none"]} : tensor<3xf32>
429099417d6SMatthias Springer  return {__inplace_operands_attr__ = ["none", "true"]} %extracted, %inserted : f32, tensor<3xf32>
430099417d6SMatthias Springer}
431099417d6SMatthias Springer```
432099417d6SMatthias Springer
433099417d6SMatthias SpringerNote that the IR was not bufferized. It was merely annotated with the results
434099417d6SMatthias Springerof the bufferization analysis. Every operation with tensor semantics has a
435099417d6SMatthias Springer`__inplace_operands_attr__` attribute with one value per operand. If an operand
436099417d6SMatthias Springeris not a tensor, the respective value is `none`. Otherwise, if the operand was
437099417d6SMatthias Springerdecided to be bufferized in-place, the value is `true`. A value of `false`
438099417d6SMatthias Springerindicates a buffer copy. In the above example, a buffer copy would be inserted
439099417d6SMatthias Springerfor `tensor.insert`, so that it does not overwrite `buffer(%from_elements)`,
440099417d6SMatthias Springerwhich is still needed for `tensor.extract`.
441099417d6SMatthias Springer
442099417d6SMatthias SpringerFor each RaW (there is only one in the example), three `C_i` attributes were
443099417d6SMatthias Springeradded:
444099417d6SMatthias Springer
445099417d6SMatthias Springer* `C_0[DEF: result 0]`: A tensor is defined: 0-th result of
446099417d6SMatthias Springer  `tensor.from_elements`.
447099417d6SMatthias Springer* `C_0[CONFL-WRITE: 1]`: An operation (if bufferized in-place) would write into
448099417d6SMatthias Springer  the future buffer of the defined tensor: 1-st operand of `tensor.insert`.
449099417d6SMatthias Springer* `C_0[READ: 0]`: An operation reads the tensor definition: 0-th operand of
450099417d6SMatthias Springer  `tensor.extract`.
451099417d6SMatthias Springer
452099417d6SMatthias SpringerThe fully bufferized IR (with the inserted buffer copy) is as follows:
453099417d6SMatthias Springer
454099417d6SMatthias Springer```mlir
455099417d6SMatthias Springerfunc.func @test(%arg0: f32, %arg1: f32, %arg2: index, %arg3: index) -> (f32, memref<3xf32>) {
456099417d6SMatthias Springer  %c2 = arith.constant 2 : index
457099417d6SMatthias Springer  %c1 = arith.constant 1 : index
458099417d6SMatthias Springer  %c0 = arith.constant 0 : index
459099417d6SMatthias Springer  %alloc = memref.alloc() {alignment = 64 : i64} : memref<3xf32>
460099417d6SMatthias Springer  memref.store %arg0, %alloc[%c0] : memref<3xf32>
461099417d6SMatthias Springer  memref.store %arg0, %alloc[%c1] : memref<3xf32>
462099417d6SMatthias Springer  memref.store %arg0, %alloc[%c2] : memref<3xf32>
463099417d6SMatthias Springer  %alloc_0 = memref.alloc() {alignment = 64 : i64} : memref<3xf32>
464099417d6SMatthias Springer  memref.copy %alloc, %alloc_0 : memref<3xf32> to memref<3xf32>
465099417d6SMatthias Springer  memref.store %arg1, %alloc_0[%arg2] : memref<3xf32>
466099417d6SMatthias Springer  %0 = memref.load %alloc[%arg3] : memref<3xf32>
467099417d6SMatthias Springer  return %0, %alloc_0 : f32, memref<3xf32>
468099417d6SMatthias Springer}
469099417d6SMatthias Springer```
470975284abSMatthias Springer
471975284abSMatthias SpringerTo get a better understanding of the SSA Use-Def Chain Analysis and the RaW
472099417d6SMatthias Springerconflict detection algorithm, interested users may want to refer to:
473099417d6SMatthias Springer
474099417d6SMatthias Springer* [Original design document](https://discourse.llvm.org/uploads/short-url/5kckJ3DftYwQokG252teFgw3sYa.pdf)
475099417d6SMatthias Springer* [ODM talk](https://youtu.be/TXEo59CYS9A), ([slides](https://mlir.llvm.org/OpenMeetings/2022-01-13-One-Shot-Bufferization.pdf)).
476099417d6SMatthias Springer* [LLVM Dev Meeting 2023 tutorial slides](https://m-sp.org/downloads/llvm_dev_2023.pdf)
477