1# Bufferization 2 3[TOC] 4 5## Overview 6 7Bufferization in MLIR is the process of converting ops with `tensor` semantics 8to ops with `memref` semantics. There are multiple MLIR passes that are related 9to bufferization. These passes typically run as one of the last steps in a 10pass pipeline, right before lowering to `memref` ops to LLVM. That is because 11many transformations are easier or only supported in tensor land; e.g., 12[tile/fuse/… on tensors first](https://llvm.discourse.group/t/rfc-linalg-on-tensors-update-and-comprehensive-bufferization-rfc/3373), 13then bufferize the remaining IR. 14 15 16 17The most important bufferization pass is *One-Shot Bufferize*: This pass 18rewrites `tensor` IR to `memref` IR. There are additional helper passes that 19preprocess IR (e.g., so that IR can be bufferized more efficiently), perform 20buffer-level optimizations such as allocation hoisting, and 21[insert buffer deallocation ops](OwnershipBasedBufferDeallocation.md) so that 22the resulting `memref` IR has no memory leaks. 23 24## Deprecated Passes 25 26The buffer deallocation pass has been deprecated in favor of the ownership-based 27buffer deallocation pipeline. The deprecated pass has some limitations that may 28cause memory leaks in the resulting IR. 29 30## What is One-Shot Bufferize? 31 32One-Shot Bufferize is a tensor bufferization pass designed for IR in 33[destination-passing style](https://www.microsoft.com/en-us/research/wp-content/uploads/2016/11/dps-fhpc17.pdf), 34and with aggressive in-place bufferization. 35 36One-Shot Bufferize is: 37 38* **Monolithic**: A single MLIR pass does the entire work. 39 40* **Extensible** via an op interface: All ops that implement 41 `BufferizableOpInterface` can be bufferized. 42 43* A **whole-function at a time analysis**. In-place bufferization decisions 44 are made by analyzing SSA use-def chains on tensors. Op interface 45 implementations not only provide the rewrite logic from tensor ops to memref 46 ops, but also helper methods for One-Shot Bufferize's analysis to query 47 information about an op's bufferization/memory semantics. 48 49* **2-Phase**: Bufferization is internally broken down into 2 steps: First, 50 analyze the entire IR and make bufferization decisions. Then, bufferize 51 (rewrite) the IR. The analysis has access to exact SSA use-def information. 52 It incrementally builds alias and equivalence sets and does not rely on a 53 posteriori-alias analysis from preallocated memory. 54 55* **Greedy**: Operations are analyzed one-by-one and it is decided on the spot 56 whether a tensor OpOperand must be copied or not. Heuristics determine the 57 order of analysis. 58 59* **Modular**: The current One-Shot Analysis can be replaced with a different 60 analysis. The result of the analysis are queried by the bufferization via 61 `AnalysisState`, in particular `AnalysisState::isInPlace`. Any derived class 62 of `AnalysisState` that implements a small number virtual functions can 63 serve as a custom analysis. It is even possible to run One-Shot Bufferize 64 without any analysis (`AlwaysCopyAnalysisState`), in which case One-Shot 65 Bufferize copies every buffer before writing to it. 66 67Note that One-Shot Bufferize does not deallocate buffers. That is done by the 68[Ownership-based Buffer Deallocation passes](OwnershipBasedBufferDeallocation.md). 69 70## Goals of Bufferization 71 72The high-level goal of every bufferization technique is to: 73 741. Use as little memory as possible. 752. Copy as little memory as possible. 76 77This implies reusing already allocated buffers when possible, turning 78bufferization into an algorithmically complex problem with similarities to 79register allocation. 80 81Depending on the concrete use case, there may be additional bufferization 82requirements. If the contents of a buffer are expensive to compute, there could 83be a tradeoff between *recomputation* and *compute once and copy*. On the 84contrary, it may not even be possible to allocate new buffers at runtime on some 85architectures. 86 87## Destination-Passing Style 88 89Bufferization is an algorithmically complex problem. Given an op with a tensor 90result, bufferization has to choose a memref buffer in which the result can be 91stored. It is always safe to allocate a brand new buffer, but such a 92bufferization strategy would be unacceptable for high-performance codegen. When 93choosing an already existing buffer, we must be careful not to accidentally 94overwrite data that is still needed later in the program. 95 96To simplify this problem, One-Shot Bufferize was designed to take advantage of 97*destination-passing style* (DPS). In MLIR, DPS op should implement the 98[`DestinationStyleOpInterface`](https://github.com/llvm/llvm-project/blob/792d437b56adfb3416daf8105942d4899fb82763/mlir/include/mlir/Interfaces/DestinationStyleOpInterface.td). 99DPS exists in itself independently of bufferization and is tied to SSA 100semantics: many ops are "updating" a part of their input SSA variables. For 101example the LLVM instruction 102[`insertelement`](https://llvm.org/docs/LangRef.html#insertelement-instruction) 103is inserting an element inside a vector. Since SSA values are immutable, the 104operation returns a copy of the input vector with the element inserted. 105Another example in MLIR is `linalg.generic` on tensors, which always has an 106extra `outs` operand for each result, which provides the initial values to 107update (for example when the operation is doing a reduction). 108 109`outs` operands are referred to as "destinations" in the following (quotes are 110important as this operand isn't modified in place but copied) and comes into 111place in the context of bufferization as a possible "anchor" for the 112bufferization algorithm. This allows the user to shape the input in a form that 113guarantees close to optimal bufferization result when carefully choosing the 114SSA value used as "destination". 115 116For every tensor result, a DPS op has a corresponding tensor operand. If there 117aren't any other conflicting uses of this tensor, the bufferization can alias 118it with the op result and perform the operation "in-place" by reusing the buffer 119allocated for this "destination" input. 120 121As an example, consider the following op: `%r = tensor.insert %f into 122%t[%idx] : tensor<5xf32>` 123 124 125 126`%t` is the "destination" in this example. When choosing a buffer for the result 127`%r`, denoted as `buffer(%r)`, One-Shot Bufferize considers only two options: 128 1291. `buffer(%r) = buffer(%t)`: store the result in the existing `buffer(%t)`. 130 Note that this is not always possible. E.g., if the old contents of 131 `buffer(%t)` are still needed. One-Shot Bufferize's main task is to detect 132 such cases and fall back to the second option when necessary. 1332. `buffer(%r)` is a newly allocated buffer. 134 135There may be other buffers in the same function that could potentially be used 136for `buffer(%r)`, but those are not considered by One-Shot Bufferize to keep the 137bufferization simple. One-Shot Bufferize could be extended to consider such 138buffers in the future to achieve a better quality of bufferization. 139 140Tensor ops that are not in destination-passing style always bufferized to a 141memory allocation. E.g.: 142 143```mlir 144%0 = tensor.generate %sz { 145^bb0(%i : index): 146 %cst = arith.constant 0.0 : f32 147 tensor.yield %cst : f32 148} : tensor<?xf32> 149``` 150 151The result of `tensor.generate` does not have a "destination" operand, so 152bufferization allocates a new buffer. This could be avoided by instead using an 153op such as `linalg.generic`, which can express the same computation with a 154"destination" operand, as specified behind outputs (`outs`): 155 156```mlir 157#map = affine_map<(i) -> (i)> 158%0 = linalg.generic {indexing_maps = [#map], iterator_types = ["parallel"]} 159 outs(%t : tensor<?xf32>) { 160 ^bb0(%arg0 : f32): 161 %cst = arith.constant 0.0 : f32 162 linalg.yield %cst : f32 163} -> tensor<?xf32> 164``` 165 166At first glance, the above `linalg.generic` op may not seem very useful because 167the output tensor `%t` is entirely overwritten. Why pass the tensor `%t` as an 168operand in the first place? As an example, this can be useful for overwriting a 169slice of a tensor: 170 171```mlir 172%t = tensor.extract_slice %s [%idx] [%sz] [1] : tensor<?xf32> to tensor<?xf32> 173%0 = linalg.generic ... outs(%t) { ... } -> tensor<?xf32> 174%1 = tensor.insert_slice %0 into %s [%idx] [%sz] [1] 175 : tensor<?xf32> into tensor<?xf32> 176``` 177 178The above example bufferizes to a `memref.subview`, followed by a 179"`linalg.generic` on memrefs" that overwrites the memory of the subview, assuming 180that the slice `%t` has no other user. The `tensor.insert_slice` then bufferizes 181to a no-op (in the absence of RaW conflicts such as a subsequent read of `%s`). 182 183RaW conflicts are detected with an analysis of SSA use-def chains (details 184later). One-Shot Bufferize works best if there is a single SSA use-def chain, 185where the result of a tensor op is the operand of the next tensor ops, e.g.: 186 187```mlir 188%0 = "my_dialect.some_op"(%t) : (tensor<?xf32>) -> (tensor<?xf32>) 189%1 = "my_dialect.another_op"(%0) : (tensor<?xf32>) -> (tensor<?xf32>) 190%2 = "my_dialect.yet_another_op"(%1) : (tensor<?xf32>) -> (tensor<?xf32>) 191``` 192 193Buffer copies are likely inserted if the SSA use-def chain splits at some point, 194e.g.: 195 196```mlir 197%0 = "my_dialect.some_op"(%t) : (tensor<?xf32>) -> (tensor<?xf32>) 198%1 = "my_dialect.another_op"(%0) : (tensor<?xf32>) -> (tensor<?xf32>) 199 200// "yet_another_op" likely needs to read the data of %0, so "another_op" cannot 201// in-place write to buffer(%0). 202%2 = "my_dialect.yet_another_op"(%0) : (tensor<?xf32>) -> (tensor<?xf32>) 203``` 204 205## Tensor / MemRef Boundary 206 207The bufferization dialect provides a few helper ops to connect tensor IR (that 208should be bufferized) with existing buffers (that may be allocated/provided by 209a different runtime/library/etc.). 210 211`bufferization.to_memref %t` returns the future buffer of a tensor SSA value. 212`bufferization.to_tensor %m` returns a tensor SSA value for a given MemRef 213buffer. `bufferization.materialize_in_destination` indicates that a tensor value 214should materialize in a certain buffer. 215 216Consider the following example, where a TOSA matmul result should materialize in 217an existing buffer `%C`: 218 219```mlir 220// Batched TOSA matrix multiplication. %A and %B are the 221// inputs, %C is the output. 222func.func @test_matmul(%A: memref<1x17x19xf32>, 223 %B: memref<1x19x29xf32>, 224 %C: memref<1x17x29xf32>) { 225 226 %A_tensor = bufferization.to_tensor %A restrict : memref<1x17x19xf32> to tensor<1x17x19xf32> 227 %B_tensor = bufferization.to_tensor %B restrict : memref<1x19x29xf32> to tensor<1x19x29xf32> 228 229 %0 = tosa.matmul %A_tensor, %B_tensor 230 : (tensor<1x17x19xf32>, tensor<1x19x29xf32>) -> 231 tensor<1x17x29xf32> 232 233 bufferization.materialize_in_destination 234 %0 in restrict writable %C 235 : (tensor<1x17x29xf32>, memref<1x17x29xf32>) -> () 236 237 return 238} 239``` 240 241Note that all bufferization ops in this example have the `restrict` unit 242attribute set. This attribute is similar to the C restrict keyword and indicates 243that there is no other `to_tensor` or `materialize_in_destination` op with 244the same or an aliasing MemRef operand. Only such 245`to_tensor`/`materialize_in_destination` ops are supported. The `restrict` 246attribute gives strong aliasing guarantees to the bufferization analysis and 247allows us to look only at the tensor IR in a program. (Ops that do not operate 248on tensors are ignored by the One-Shot Bufferize.) 249 250Also note that `tosa.matmul` cannot be bufferized as is: there is no 251`BufferizableOpInterface` implementation for that op. However, the op can be 252lowered to a combination of `tensor.empty` and `linalg.matmul`, which can be 253bufferized. 254 255## Using One-Shot Bufferize 256 257MLIR provides a pass 258[`-one-shot-bufferize`](https://mlir.llvm.org/docs/Passes/#-one-shot-bufferize-one-shot-bufferize) 259that performs an analysis and bufferizes all ops with tensor semantics that 260implement `BufferizableOpInterface`. For modularity reasons, these op interface 261implementations are typically external models that live in a dialect's 262"Transforms" build unit. (External models are a mechanism for implementing an op 263interface in a different build unit.) It is the user's responsibility to ensure 264that all needed external models are registered before running One-Shot 265Bufferize. 266 267By default, One-Shot Bufferize fails when it encounters an op with tensor 268semantics (i.e., tensor result or tensor operand) that is not bufferizable 269(i.e., does not implement `BufferizableOpInterface`). This can be avoided with 270`allow-unknown-ops`. In that case, One-Shot Bufferize inserts 271`to_memref`/`to_tensor` ops around the bufferization boundary. 272 273One-Shot Bufferize can be configured to bufferize only ops from a set of 274dialects with `dialect-filter`. 275 276One-Shot Bufferize can also be called programmatically with 277[`bufferization::runOneShotBufferize`](https://github.com/llvm/llvm-project/blob/ae2764e835a26bad9774803eca0a6530df2a3e2d/mlir/include/mlir/Dialect/Bufferization/Transforms/OneShotAnalysis.h#L167). 278Alternatively, 279[`bufferization::bufferizeOp`](https://github.com/llvm/llvm-project/blob/ae2764e835a26bad9774803eca0a6530df2a3e2d/mlir/include/mlir/Dialect/Bufferization/Transforms/Bufferize.h#L78) 280skips the analysis and inserts a copy on every buffer write. 281 282By default, function boundaries are not bufferized. This is because there are 283currently limitations around function graph bufferization: recursive 284calls are not supported. As long as there are no recursive calls, function 285boundary bufferization can be enabled with `bufferize-function-boundaries`. Each 286tensor function argument and tensor function result is then turned into a 287memref. The layout map of the memref type can be controlled with 288`function-boundary-type-conversion`. 289 290## Memory Layouts 291 292One-Shot Bufferize bufferizes ops from top to bottom. This works well when all 293ops are bufferizable. However, when encountering a non-bufferizable tensor with 294`allow-unknown-ops`, One-Shot Bufferize must insert `to_memref` ops at the 295bufferization boundary and decide on a memref type. By default, One-Shot 296Bufferize choose the most dynamic memref type wrt. layout maps. E.g.: 297 298```mlir 299%0 = "my_dialect.unbufferizable_op(%t) : (tensor<?x?xf32>) -> (tensor<?x?xf32>) 300%1 = tensor.extract %0[%idx1, %idx2] : tensor<?xf32> 301``` 302 303When bufferizing the above IR, One-Shot Bufferize inserts a `to_memref` ops with 304dynamic offset and strides: 305 306```mlir 307%0 = "my_dialect.unbufferizable_op(%t) : (tensor<?x?xf32>) -> (tensor<?x?xf32>) 308%0_m = bufferization.to_memref %0 : memref<?x?xf32, strided<[?, ?], offset: ?>> 309%1 = memref.load %0_m[%idx1, %idx2] : memref<?x?xf32, strided<[?, ?], offset: ?>> 310``` 311 312All users of `%0` have fully dynamic layout maps. This ensures that the 313bufferized IR composes well with future bufferizations of `unbufferizable_op` 314(maybe bufferized by another pass), regardless of the exact memref type of the 315future bufferization. If the op turns out to be bufferized to an op with a 316simpler memref type (e.g., identity layout map), we expect that canonicalization 317patterns would clean up unnecessarily dynamic layout maps. (Some of these 318canonicalization patterns may not be implemented yet.) 319 320One-Shot Bufferize tries to infer the most precise memref type when bufferizing 321an op. If the entire IR is bufferizable, we do not have to resort to 322conservatively use fully dynamic layout maps. In that case, we also do not have 323to rely on canonicalization patterns to clean up the bufferized IR. 324 325Note: There are some bufferizable ops for which a percise layout map cannot be 326inferred. E.g., a `tensor.cast` from a `tensor<*xf32>` to a `tensor<?x?xf32>` 327must be bufferized to a `memref.cast` with a memref type that has a fully 328dynamic layout map. 329 330One-Shot Bufferize has an option `unknown-type-conversion` to control the 331generation of layout maps when no precise layout can be inferred: 332 333* `fully-dynamic-layout-map` uses fully dynamic layout maps and is the default 334 behavior. This composes well when IR is partially bufferized. 335* `identity-layout-map` uses static identity layout maps. This option can be 336 useful for legacy code that cannot handle memref types with layout maps. 337 Note that this setting can lead to additional buffer copies when folding a 338 `to_tensor`/`to_memref` pair with memref types that are not cast-compatible. 339 340Note: The `unknown-type-conversion` option does not affect layout maps of 341function signatures. There is a separate `function-signature-type-conversion` 342option that controls layout maps of function parameters and function results. 343 344## Extending One-Shot Bufferize 345 346Custom ops can be bufferized if they implement `BufferizableOpInterface`. Users 347must at least implement the following interface methods. 348 349* `bufferizesToMemoryRead`: Return `true` if the buffer of the given tensor 350 OpOperand is read. 351* `bufferizesToMemoryWrite`: Return `true` if the buffer of the given tensor 352 OpOperand is written (if bufferizing in-place). 353* `getAliasingOpResult`: Return the OpResults that may share the same buffer 354 as the given OpOperand. This interface method describes to 355 OpOperand-to-OpResult mapping wrt. destination-passing style. 356* `bufferRelation`: Return `BufferRelation::Equivalent` if the given OpResult 357 is the exact same memref as the aliasing OpOperand after bufferization (in 358 case of in-place bufferization). Otherwise, (e.g., they overlap but are not 359 necessarily the exact same memrefs), `BufferRelation::Unknown` should be 360 returned. Additional buffer relations will be added in the future, but 361 `BufferRelation::Unknown` is always safe. 362* `bufferize`: Rewrite the op with the given rewriter. Ops should be replaced 363 with `bufferization::replaceOpWithBufferizedValues`. 364 365To get a better intuition of the interface methods, we invite users to take a 366look at existing implementations in MLIR, e.g., the implementation of 367`tensor.insert` or `tensor.extract`. 368 369Interface implementations of DPS ops (that implement 370`DestinationStyleOpInterface`) can derive from 371`DstBufferizableOpInterfaceExternalModel`, which provides all necessary 372method implementations except for `bufferize`. 373 374## Debugging Buffer Copies 375 376To get a better understanding of why One-Shot Bufferize introduced a buffer 377copy, users can run the pass with `test-analysis-only print-conflicts`. Every 378tensor op is then annotated with an attribute that has a boolean value for each 379tensor OpOperand. `true` means that the OpOperand bufferizes in-place. `false` 380means that the OpOperand bufferizes out-of-place and a buffer copy will be 381inserted. 382 383There are two reasons why a buffer copy may be inserted. 384 3851. Due to a RaW conflict, it is not safe to bufferize in-place. I.e., the 386 overwritten data is still needed. 3872. The buffer is not writable. E.g., `memref.global` buffers that are the 388 result of `arith.constant` ops are never modified. 389 390In the first case, `print-conflicts` illustrates the conflict in the form of a 391("read", "conflicting write", "last write") tuple. 392 393A RaW conflict consists of three parts, in the following order according to 394op dominance: 395 3961. **Definition:** A tensor `%t` is defined. 3972. **Conflicting Write:** An operation writes to `buffer(%t)`. 3983. **Read:** An operation reads `%t`. 399 400When such a RaW conflict is detected during the analysis phase, One-Shot 401Bufferize will insert a buffer copy for the conflicting write. 402 403**Example** 404 405```mlir 406// RUN: mlir-opt %s -one-shot-bufferize="bufferize-function-boundaries test-analysis-only print-conflicts" 407func.func @test(%arg0: f32, %arg1: f32, %arg2: index, %arg3: index) -> (f32, tensor<3xf32>) { 408 // Create a new tensor with [%arg0, %arg0, %arg0]. 409 %0 = tensor.from_elements %arg0, %arg0, %arg0 : tensor<3xf32> 410 411 // Insert something into the new tensor. 412 %1 = tensor.insert %arg1 into %0[%arg2] : tensor<3xf32> 413 414 // Read from the old tensor. 415 %r = tensor.extract %0[%arg3] : tensor<3xf32> 416 417 // Return the extracted value and the result of the insertion. 418 func.return %r, %1 : f32, tensor<3xf32> 419} 420``` 421 422The output IR is as follows: 423 424```mlir 425func.func @test(%arg0: f32, %arg1: f32, %arg2: index, %arg3: index) -> (f32, tensor<3xf32>) { 426 %from_elements = tensor.from_elements %arg0, %arg0, %arg0 {"C_0[DEF: result 0]"} : tensor<3xf32> 427 %inserted = tensor.insert %arg1 into %from_elements[%arg2] {"C_0[CONFL-WRITE: 1]", __inplace_operands_attr__ = ["none", "false", "none"]} : tensor<3xf32> 428 %extracted = tensor.extract %from_elements[%arg3] {"C_0[READ: 0]", __inplace_operands_attr__ = ["true", "none"]} : tensor<3xf32> 429 return {__inplace_operands_attr__ = ["none", "true"]} %extracted, %inserted : f32, tensor<3xf32> 430} 431``` 432 433Note that the IR was not bufferized. It was merely annotated with the results 434of the bufferization analysis. Every operation with tensor semantics has a 435`__inplace_operands_attr__` attribute with one value per operand. If an operand 436is not a tensor, the respective value is `none`. Otherwise, if the operand was 437decided to be bufferized in-place, the value is `true`. A value of `false` 438indicates a buffer copy. In the above example, a buffer copy would be inserted 439for `tensor.insert`, so that it does not overwrite `buffer(%from_elements)`, 440which is still needed for `tensor.extract`. 441 442For each RaW (there is only one in the example), three `C_i` attributes were 443added: 444 445* `C_0[DEF: result 0]`: A tensor is defined: 0-th result of 446 `tensor.from_elements`. 447* `C_0[CONFL-WRITE: 1]`: An operation (if bufferized in-place) would write into 448 the future buffer of the defined tensor: 1-st operand of `tensor.insert`. 449* `C_0[READ: 0]`: An operation reads the tensor definition: 0-th operand of 450 `tensor.extract`. 451 452The fully bufferized IR (with the inserted buffer copy) is as follows: 453 454```mlir 455func.func @test(%arg0: f32, %arg1: f32, %arg2: index, %arg3: index) -> (f32, memref<3xf32>) { 456 %c2 = arith.constant 2 : index 457 %c1 = arith.constant 1 : index 458 %c0 = arith.constant 0 : index 459 %alloc = memref.alloc() {alignment = 64 : i64} : memref<3xf32> 460 memref.store %arg0, %alloc[%c0] : memref<3xf32> 461 memref.store %arg0, %alloc[%c1] : memref<3xf32> 462 memref.store %arg0, %alloc[%c2] : memref<3xf32> 463 %alloc_0 = memref.alloc() {alignment = 64 : i64} : memref<3xf32> 464 memref.copy %alloc, %alloc_0 : memref<3xf32> to memref<3xf32> 465 memref.store %arg1, %alloc_0[%arg2] : memref<3xf32> 466 %0 = memref.load %alloc[%arg3] : memref<3xf32> 467 return %0, %alloc_0 : f32, memref<3xf32> 468} 469``` 470 471To get a better understanding of the SSA Use-Def Chain Analysis and the RaW 472conflict detection algorithm, interested users may want to refer to: 473 474* [Original design document](https://discourse.llvm.org/uploads/short-url/5kckJ3DftYwQokG252teFgw3sYa.pdf) 475* [ODM talk](https://youtu.be/TXEo59CYS9A), ([slides](https://mlir.llvm.org/OpenMeetings/2022-01-13-One-Shot-Bufferization.pdf)). 476* [LLVM Dev Meeting 2023 tutorial slides](https://m-sp.org/downloads/llvm_dev_2023.pdf) 477