xref: /llvm-project/mlir/docs/Bufferization.md (revision ced2fc7819d5ddea616ec330f18e08ff284c1868)
1# Bufferization
2
3[TOC]
4
5## Overview
6
7Bufferization in MLIR is the process of converting ops with `tensor` semantics
8to ops with `memref` semantics. There are multiple MLIR passes that are related
9to bufferization. These passes typically run as one of the last steps in a
10pass pipeline, right before lowering to `memref` ops to LLVM. That is because
11many transformations are easier or only supported in tensor land; e.g.,
12[tile/fuse/… on tensors first](https://llvm.discourse.group/t/rfc-linalg-on-tensors-update-and-comprehensive-bufferization-rfc/3373),
13then bufferize the remaining IR.
14
15![bufferization passes](/includes/img/bufferization_passes.svg)
16
17The most important bufferization pass is *One-Shot Bufferize*: This pass
18rewrites `tensor` IR to `memref` IR. There are additional helper passes that
19preprocess IR (e.g., so that IR can be bufferized more efficiently), perform
20buffer-level optimizations such as allocation hoisting, and
21[insert buffer deallocation ops](OwnershipBasedBufferDeallocation.md) so that
22the resulting `memref` IR has no memory leaks.
23
24## Deprecated Passes
25
26The buffer deallocation pass has been deprecated in favor of the ownership-based
27buffer deallocation pipeline. The deprecated pass has some limitations that may
28cause memory leaks in the resulting IR.
29
30## What is One-Shot Bufferize?
31
32One-Shot Bufferize is a tensor bufferization pass designed for IR in
33[destination-passing style](https://www.microsoft.com/en-us/research/wp-content/uploads/2016/11/dps-fhpc17.pdf),
34and with aggressive in-place bufferization.
35
36One-Shot Bufferize is:
37
38*   **Monolithic**: A single MLIR pass does the entire work.
39
40*   **Extensible** via an op interface: All ops that implement
41    `BufferizableOpInterface` can be bufferized.
42
43*   A **whole-function at a time analysis**. In-place bufferization decisions
44    are made by analyzing SSA use-def chains on tensors. Op interface
45    implementations not only provide the rewrite logic from tensor ops to memref
46    ops, but also helper methods for One-Shot Bufferize's analysis to query
47    information about an op's bufferization/memory semantics.
48
49*   **2-Phase**: Bufferization is internally broken down into 2 steps: First,
50    analyze the entire IR and make bufferization decisions. Then, bufferize
51    (rewrite) the IR. The analysis has access to exact SSA use-def information.
52    It incrementally builds alias and equivalence sets and does not rely on a
53    posteriori-alias analysis from preallocated memory.
54
55*   **Greedy**: Operations are analyzed one-by-one and it is decided on the spot
56    whether a tensor OpOperand must be copied or not. Heuristics determine the
57    order of analysis.
58
59*   **Modular**: The current One-Shot Analysis can be replaced with a different
60    analysis. The result of the analysis are queried by the bufferization via
61    `AnalysisState`, in particular `AnalysisState::isInPlace`. Any derived class
62    of `AnalysisState` that implements a small number virtual functions can
63    serve as a custom analysis. It is even possible to run One-Shot Bufferize
64    without any analysis (`AlwaysCopyAnalysisState`), in which case One-Shot
65    Bufferize copies every buffer before writing to it.
66
67Note that One-Shot Bufferize does not deallocate buffers. That is done by the
68[Ownership-based Buffer Deallocation passes](OwnershipBasedBufferDeallocation.md).
69
70## Goals of Bufferization
71
72The high-level goal of every bufferization technique is to:
73
741. Use as little memory as possible.
752. Copy as little memory as possible.
76
77This implies reusing already allocated buffers when possible, turning
78bufferization into an algorithmically complex problem with similarities to
79register allocation.
80
81Depending on the concrete use case, there may be additional bufferization
82requirements. If the contents of a buffer are expensive to compute, there could
83be a tradeoff between *recomputation* and *compute once and copy*. On the
84contrary, it may not even be possible to allocate new buffers at runtime on some
85architectures.
86
87## Destination-Passing Style
88
89Bufferization is an algorithmically complex problem. Given an op with a tensor
90result, bufferization has to choose a memref buffer in which the result can be
91stored. It is always safe to allocate a brand new buffer, but such a
92bufferization strategy would be unacceptable for high-performance codegen. When
93choosing an already existing buffer, we must be careful not to accidentally
94overwrite data that is still needed later in the program.
95
96To simplify this problem, One-Shot Bufferize was designed to take advantage of
97*destination-passing style* (DPS). In MLIR, DPS op should implement the
98[`DestinationStyleOpInterface`](https://github.com/llvm/llvm-project/blob/792d437b56adfb3416daf8105942d4899fb82763/mlir/include/mlir/Interfaces/DestinationStyleOpInterface.td).
99DPS exists in itself independently of bufferization and is tied to SSA
100semantics: many ops are "updating" a part of their input SSA variables. For
101example the LLVM instruction
102[`insertelement`](https://llvm.org/docs/LangRef.html#insertelement-instruction)
103is inserting an element inside a vector. Since SSA values are immutable, the
104operation returns a copy of the input vector with the element inserted.
105Another example in MLIR is `linalg.generic` on tensors, which always has an
106extra `outs` operand for each result, which provides the initial values to
107update (for example when the operation is doing a reduction).
108
109`outs` operands are referred to as "destinations" in the following (quotes are
110important as this operand isn't modified in place but copied) and comes into
111place in the context of bufferization as a possible "anchor" for the
112bufferization algorithm. This allows the user to shape the input in a form that
113guarantees close to optimal bufferization result when carefully choosing the
114SSA value used as "destination".
115
116For every tensor result, a DPS op has a corresponding tensor operand. If there
117aren't any other conflicting uses of this tensor, the bufferization can alias
118it with the op result and perform the operation "in-place" by reusing the buffer
119allocated for this "destination" input.
120
121As an example, consider the following op: `%r = tensor.insert %f into
122%t[%idx] : tensor<5xf32>`
123
124![tensor.insert example](/includes/img/bufferization_tensor_insert_dst.svg)
125
126`%t` is the "destination" in this example. When choosing a buffer for the result
127`%r`, denoted as `buffer(%r)`, One-Shot Bufferize considers only two options:
128
1291.  `buffer(%r) = buffer(%t)`: store the result in the existing `buffer(%t)`.
130    Note that this is not always possible. E.g., if the old contents of
131    `buffer(%t)` are still needed. One-Shot Bufferize's main task is to detect
132    such cases and fall back to the second option when necessary.
1332.  `buffer(%r)` is a newly allocated buffer.
134
135There may be other buffers in the same function that could potentially be used
136for `buffer(%r)`, but those are not considered by One-Shot Bufferize to keep the
137bufferization simple. One-Shot Bufferize could be extended to consider such
138buffers in the future to achieve a better quality of bufferization.
139
140Tensor ops that are not in destination-passing style always bufferized to a
141memory allocation. E.g.:
142
143```mlir
144%0 = tensor.generate %sz {
145^bb0(%i : index):
146  %cst = arith.constant 0.0 : f32
147  tensor.yield %cst : f32
148} : tensor<?xf32>
149```
150
151The result of `tensor.generate` does not have a "destination" operand, so
152bufferization allocates a new buffer. This could be avoided by instead using an
153op such as `linalg.generic`, which can express the same computation with a
154"destination" operand, as specified behind outputs (`outs`):
155
156```mlir
157#map = affine_map<(i) -> (i)>
158%0 = linalg.generic {indexing_maps = [#map], iterator_types = ["parallel"]}
159                    outs(%t : tensor<?xf32>) {
160  ^bb0(%arg0 : f32):
161    %cst = arith.constant 0.0 : f32
162    linalg.yield %cst : f32
163} -> tensor<?xf32>
164```
165
166At first glance, the above `linalg.generic` op may not seem very useful because
167the output tensor `%t` is entirely overwritten. Why pass the tensor `%t` as an
168operand in the first place? As an example, this can be useful for overwriting a
169slice of a tensor:
170
171```mlir
172%t = tensor.extract_slice %s [%idx] [%sz] [1] : tensor<?xf32> to tensor<?xf32>
173%0 = linalg.generic ... outs(%t) { ... } -> tensor<?xf32>
174%1 = tensor.insert_slice %0 into %s [%idx] [%sz] [1]
175    : tensor<?xf32> into tensor<?xf32>
176```
177
178The above example bufferizes to a `memref.subview`, followed by a
179"`linalg.generic` on memrefs" that overwrites the memory of the subview, assuming
180that the slice `%t` has no other user. The `tensor.insert_slice` then bufferizes
181to a no-op (in the absence of RaW conflicts such as a subsequent read of `%s`).
182
183RaW conflicts are detected with an analysis of SSA use-def chains (details
184later). One-Shot Bufferize works best if there is a single SSA use-def chain,
185where the result of a tensor op is the operand of the next tensor ops, e.g.:
186
187```mlir
188%0 = "my_dialect.some_op"(%t) : (tensor<?xf32>) -> (tensor<?xf32>)
189%1 = "my_dialect.another_op"(%0) : (tensor<?xf32>) -> (tensor<?xf32>)
190%2 = "my_dialect.yet_another_op"(%1) : (tensor<?xf32>) -> (tensor<?xf32>)
191```
192
193Buffer copies are likely inserted if the SSA use-def chain splits at some point,
194e.g.:
195
196```mlir
197%0 = "my_dialect.some_op"(%t) : (tensor<?xf32>) -> (tensor<?xf32>)
198%1 = "my_dialect.another_op"(%0) : (tensor<?xf32>) -> (tensor<?xf32>)
199
200// "yet_another_op" likely needs to read the data of %0, so "another_op" cannot
201// in-place write to buffer(%0).
202%2 = "my_dialect.yet_another_op"(%0) : (tensor<?xf32>) -> (tensor<?xf32>)
203```
204
205## Tensor / MemRef Boundary
206
207The bufferization dialect provides a few helper ops to connect tensor IR (that
208should be bufferized) with existing buffers (that may be allocated/provided by
209a different runtime/library/etc.).
210
211`bufferization.to_memref %t` returns the future buffer of a tensor SSA value.
212`bufferization.to_tensor %m` returns a tensor SSA value for a given MemRef
213buffer. `bufferization.materialize_in_destination` indicates that a tensor value
214should materialize in a certain buffer.
215
216Consider the following example, where a TOSA matmul result should materialize in
217an existing buffer `%C`:
218
219```mlir
220// Batched TOSA matrix multiplication. %A and %B are the
221// inputs, %C is the output.
222func.func @test_matmul(%A: memref<1x17x19xf32>,
223                       %B: memref<1x19x29xf32>,
224                       %C: memref<1x17x29xf32>) {
225
226  %A_tensor = bufferization.to_tensor %A restrict : memref<1x17x19xf32> to tensor<1x17x19xf32>
227  %B_tensor = bufferization.to_tensor %B restrict : memref<1x19x29xf32> to tensor<1x19x29xf32>
228
229  %0 = tosa.matmul %A_tensor, %B_tensor
230      : (tensor<1x17x19xf32>, tensor<1x19x29xf32>) ->
231         tensor<1x17x29xf32>
232
233  bufferization.materialize_in_destination
234    %0 in restrict writable %C
235      : (tensor<1x17x29xf32>, memref<1x17x29xf32>) -> ()
236
237  return
238}
239```
240
241Note that all bufferization ops in this example have the `restrict` unit
242attribute set. This attribute is similar to the C restrict keyword and indicates
243that there is no other `to_tensor` or `materialize_in_destination` op with
244the same or an aliasing MemRef operand. Only such
245`to_tensor`/`materialize_in_destination` ops are supported. The `restrict`
246attribute gives strong aliasing guarantees to the bufferization analysis and
247allows us to look only at the tensor IR in a program. (Ops that do not operate
248on tensors are ignored by the One-Shot Bufferize.)
249
250Also note that `tosa.matmul` cannot be bufferized as is: there is no
251`BufferizableOpInterface` implementation for that op. However, the op can be
252lowered to a combination of `tensor.empty` and `linalg.matmul`, which can be
253bufferized.
254
255## Using One-Shot Bufferize
256
257MLIR provides a pass
258[`-one-shot-bufferize`](https://mlir.llvm.org/docs/Passes/#-one-shot-bufferize-one-shot-bufferize)
259that performs an analysis and bufferizes all ops with tensor semantics that
260implement `BufferizableOpInterface`. For modularity reasons, these op interface
261implementations are typically external models that live in a dialect's
262"Transforms" build unit. (External models are a mechanism for implementing an op
263interface in a different build unit.) It is the user's responsibility to ensure
264that all needed external models are registered before running One-Shot
265Bufferize.
266
267By default, One-Shot Bufferize fails when it encounters an op with tensor
268semantics (i.e., tensor result or tensor operand) that is not bufferizable
269(i.e., does not implement `BufferizableOpInterface`). This can be avoided with
270`allow-unknown-ops`. In that case, One-Shot Bufferize inserts
271`to_memref`/`to_tensor` ops around the bufferization boundary.
272
273One-Shot Bufferize can be configured to bufferize only ops from a set of
274dialects with `dialect-filter`.
275
276One-Shot Bufferize can also be called programmatically with
277[`bufferization::runOneShotBufferize`](https://github.com/llvm/llvm-project/blob/ae2764e835a26bad9774803eca0a6530df2a3e2d/mlir/include/mlir/Dialect/Bufferization/Transforms/OneShotAnalysis.h#L167).
278Alternatively,
279[`bufferization::bufferizeOp`](https://github.com/llvm/llvm-project/blob/ae2764e835a26bad9774803eca0a6530df2a3e2d/mlir/include/mlir/Dialect/Bufferization/Transforms/Bufferize.h#L78)
280skips the analysis and inserts a copy on every buffer write.
281
282By default, function boundaries are not bufferized. This is because there are
283currently limitations around function graph bufferization: recursive
284calls are not supported. As long as there are no recursive calls, function
285boundary bufferization can be enabled with `bufferize-function-boundaries`. Each
286tensor function argument and tensor function result is then turned into a
287memref. The layout map of the memref type can be controlled with
288`function-boundary-type-conversion`.
289
290## Memory Layouts
291
292One-Shot Bufferize bufferizes ops from top to bottom. This works well when all
293ops are bufferizable. However, when encountering a non-bufferizable tensor with
294`allow-unknown-ops`, One-Shot Bufferize must insert `to_memref` ops at the
295bufferization boundary and decide on a memref type. By default, One-Shot
296Bufferize choose the most dynamic memref type wrt. layout maps. E.g.:
297
298```mlir
299%0 = "my_dialect.unbufferizable_op(%t) : (tensor<?x?xf32>) -> (tensor<?x?xf32>)
300%1 = tensor.extract %0[%idx1, %idx2] : tensor<?xf32>
301```
302
303When bufferizing the above IR, One-Shot Bufferize inserts a `to_memref` ops with
304dynamic offset and strides:
305
306```mlir
307%0 = "my_dialect.unbufferizable_op(%t) : (tensor<?x?xf32>) -> (tensor<?x?xf32>)
308%0_m = bufferization.to_memref %0 : memref<?x?xf32, strided<[?, ?], offset: ?>>
309%1 = memref.load %0_m[%idx1, %idx2] : memref<?x?xf32, strided<[?, ?], offset: ?>>
310```
311
312All users of `%0` have fully dynamic layout maps. This ensures that the
313bufferized IR composes well with future bufferizations of `unbufferizable_op`
314(maybe bufferized by another pass), regardless of the exact memref type of the
315future bufferization. If the op turns out to be bufferized to an op with a
316simpler memref type (e.g., identity layout map), we expect that canonicalization
317patterns would clean up unnecessarily dynamic layout maps. (Some of these
318canonicalization patterns may not be implemented yet.)
319
320One-Shot Bufferize tries to infer the most precise memref type when bufferizing
321an op. If the entire IR is bufferizable, we do not have to resort to
322conservatively use fully dynamic layout maps. In that case, we also do not have
323to rely on canonicalization patterns to clean up the bufferized IR.
324
325Note: There are some bufferizable ops for which a percise layout map cannot be
326inferred. E.g., a `tensor.cast` from a `tensor<*xf32>` to a `tensor<?x?xf32>`
327must be bufferized to a `memref.cast` with a memref type that has a fully
328dynamic layout map.
329
330One-Shot Bufferize has an option `unknown-type-conversion` to control the
331generation of layout maps when no precise layout can be inferred:
332
333*   `fully-dynamic-layout-map` uses fully dynamic layout maps and is the default
334    behavior. This composes well when IR is partially bufferized.
335*   `identity-layout-map` uses static identity layout maps. This option can be
336    useful for legacy code that cannot handle memref types with layout maps.
337    Note that this setting can lead to additional buffer copies when folding a
338    `to_tensor`/`to_memref` pair with memref types that are not cast-compatible.
339
340Note: The `unknown-type-conversion` option does not affect layout maps of
341function signatures. There is a separate `function-signature-type-conversion`
342option that controls layout maps of function parameters and function results.
343
344## Extending One-Shot Bufferize
345
346Custom ops can be bufferized if they implement `BufferizableOpInterface`. Users
347must at least implement the following interface methods.
348
349*   `bufferizesToMemoryRead`: Return `true` if the buffer of the given tensor
350    OpOperand is read.
351*   `bufferizesToMemoryWrite`: Return `true` if the buffer of the given tensor
352    OpOperand is written (if bufferizing in-place).
353*   `getAliasingOpResult`: Return the OpResults that may share the same buffer
354    as the given OpOperand. This interface method describes to
355    OpOperand-to-OpResult mapping wrt. destination-passing style.
356*   `bufferRelation`: Return `BufferRelation::Equivalent` if the given OpResult
357    is the exact same memref as the aliasing OpOperand after bufferization (in
358    case of in-place bufferization). Otherwise, (e.g., they overlap but are not
359    necessarily the exact same memrefs), `BufferRelation::Unknown` should be
360    returned. Additional buffer relations will be added in the future, but
361    `BufferRelation::Unknown` is always safe.
362*   `bufferize`: Rewrite the op with the given rewriter. Ops should be replaced
363    with `bufferization::replaceOpWithBufferizedValues`.
364
365To get a better intuition of the interface methods, we invite users to take a
366look at existing implementations in MLIR, e.g., the implementation of
367`tensor.insert` or `tensor.extract`.
368
369Interface implementations of DPS ops (that implement
370`DestinationStyleOpInterface`) can derive from
371`DstBufferizableOpInterfaceExternalModel`, which provides all necessary
372method implementations except for `bufferize`.
373
374## Debugging Buffer Copies
375
376To get a better understanding of why One-Shot Bufferize introduced a buffer
377copy, users can run the pass with `test-analysis-only print-conflicts`. Every
378tensor op is then annotated with an attribute that has a boolean value for each
379tensor OpOperand. `true` means that the OpOperand bufferizes in-place. `false`
380means that the OpOperand bufferizes out-of-place and a buffer copy will be
381inserted.
382
383There are two reasons why a buffer copy may be inserted.
384
3851.  Due to a RaW conflict, it is not safe to bufferize in-place. I.e., the
386    overwritten data is still needed.
3872.  The buffer is not writable. E.g., `memref.global` buffers that are the
388    result of `arith.constant` ops are never modified.
389
390In the first case, `print-conflicts` illustrates the conflict in the form of a
391("read", "conflicting write", "last write") tuple.
392
393A RaW conflict consists of three parts, in the following order according to
394op dominance:
395
3961. **Definition:** A tensor `%t` is defined.
3972. **Conflicting Write:** An operation writes to `buffer(%t)`.
3983. **Read:** An operation reads `%t`.
399
400When such a RaW conflict is detected during the analysis phase, One-Shot
401Bufferize will insert a buffer copy for the conflicting write.
402
403**Example**
404
405```mlir
406// RUN: mlir-opt %s -one-shot-bufferize="bufferize-function-boundaries test-analysis-only print-conflicts"
407func.func @test(%arg0: f32, %arg1: f32, %arg2: index, %arg3: index) -> (f32, tensor<3xf32>) {
408  // Create a new tensor with [%arg0, %arg0, %arg0].
409  %0 = tensor.from_elements %arg0, %arg0, %arg0 : tensor<3xf32>
410
411  // Insert something into the new tensor.
412  %1 = tensor.insert %arg1 into %0[%arg2] : tensor<3xf32>
413
414  // Read from the old tensor.
415  %r = tensor.extract %0[%arg3] : tensor<3xf32>
416
417  // Return the extracted value and the result of the insertion.
418  func.return %r, %1 : f32, tensor<3xf32>
419}
420```
421
422The output IR is as follows:
423
424```mlir
425func.func @test(%arg0: f32, %arg1: f32, %arg2: index, %arg3: index) -> (f32, tensor<3xf32>) {
426  %from_elements = tensor.from_elements %arg0, %arg0, %arg0 {"C_0[DEF: result 0]"} : tensor<3xf32>
427  %inserted = tensor.insert %arg1 into %from_elements[%arg2] {"C_0[CONFL-WRITE: 1]", __inplace_operands_attr__ = ["none", "false", "none"]} : tensor<3xf32>
428  %extracted = tensor.extract %from_elements[%arg3] {"C_0[READ: 0]", __inplace_operands_attr__ = ["true", "none"]} : tensor<3xf32>
429  return {__inplace_operands_attr__ = ["none", "true"]} %extracted, %inserted : f32, tensor<3xf32>
430}
431```
432
433Note that the IR was not bufferized. It was merely annotated with the results
434of the bufferization analysis. Every operation with tensor semantics has a
435`__inplace_operands_attr__` attribute with one value per operand. If an operand
436is not a tensor, the respective value is `none`. Otherwise, if the operand was
437decided to be bufferized in-place, the value is `true`. A value of `false`
438indicates a buffer copy. In the above example, a buffer copy would be inserted
439for `tensor.insert`, so that it does not overwrite `buffer(%from_elements)`,
440which is still needed for `tensor.extract`.
441
442For each RaW (there is only one in the example), three `C_i` attributes were
443added:
444
445* `C_0[DEF: result 0]`: A tensor is defined: 0-th result of
446  `tensor.from_elements`.
447* `C_0[CONFL-WRITE: 1]`: An operation (if bufferized in-place) would write into
448  the future buffer of the defined tensor: 1-st operand of `tensor.insert`.
449* `C_0[READ: 0]`: An operation reads the tensor definition: 0-th operand of
450  `tensor.extract`.
451
452The fully bufferized IR (with the inserted buffer copy) is as follows:
453
454```mlir
455func.func @test(%arg0: f32, %arg1: f32, %arg2: index, %arg3: index) -> (f32, memref<3xf32>) {
456  %c2 = arith.constant 2 : index
457  %c1 = arith.constant 1 : index
458  %c0 = arith.constant 0 : index
459  %alloc = memref.alloc() {alignment = 64 : i64} : memref<3xf32>
460  memref.store %arg0, %alloc[%c0] : memref<3xf32>
461  memref.store %arg0, %alloc[%c1] : memref<3xf32>
462  memref.store %arg0, %alloc[%c2] : memref<3xf32>
463  %alloc_0 = memref.alloc() {alignment = 64 : i64} : memref<3xf32>
464  memref.copy %alloc, %alloc_0 : memref<3xf32> to memref<3xf32>
465  memref.store %arg1, %alloc_0[%arg2] : memref<3xf32>
466  %0 = memref.load %alloc[%arg3] : memref<3xf32>
467  return %0, %alloc_0 : f32, memref<3xf32>
468}
469```
470
471To get a better understanding of the SSA Use-Def Chain Analysis and the RaW
472conflict detection algorithm, interested users may want to refer to:
473
474* [Original design document](https://discourse.llvm.org/uploads/short-url/5kckJ3DftYwQokG252teFgw3sYa.pdf)
475* [ODM talk](https://youtu.be/TXEo59CYS9A), ([slides](https://mlir.llvm.org/OpenMeetings/2022-01-13-One-Shot-Bufferization.pdf)).
476* [LLVM Dev Meeting 2023 tutorial slides](https://m-sp.org/downloads/llvm_dev_2023.pdf)
477