xref: /llvm-project/mlir/docs/Dialects/GPU.md (revision eb206e9ea84eff0a0596fed2de8316d924f946d1)
1# 'gpu' Dialect
2
3Note: this dialect is more likely to change than others in the near future; use
4with caution.
5
6This dialect provides middle-level abstractions for launching GPU kernels
7following a programming model similar to that of CUDA or OpenCL. It provides
8abstractions for kernel invocations (and may eventually provide those for device
9management) that are not present at the lower level (e.g., as LLVM IR intrinsics
10for GPUs). Its goal is to abstract away device- and driver-specific
11manipulations to launch a GPU kernel and provide a simple path towards GPU
12execution from MLIR. It may be targeted, for example, by DSLs using MLIR. The
13dialect uses `gpu` as its canonical prefix.
14
15This dialect also abstracts away primitives commonly available in GPU code, such
16as with `gpu.thread_id` (an operation that returns the ID of threads within
17a thread block/workgroup along a given dimension). While the compilation
18pipelines documented below expect such code to live inside a `gpu.module` and
19`gpu.func`, these intrinsic wrappers may be used outside of this context.
20
21Intrinsic-wrapping operations should not expect that they have a parent of type
22`gpu.func`. However, operations that deal in compiling and launching GPU functions,
23like `gpu.launch_func` or `gpu.binary` may assume that the dialect's full layering
24is being used.
25
26[TOC]
27
28## GPU address spaces
29
30The GPU dialect exposes the `gpu.address_space` attribute, which currently has
31three values: `global`, `workgroup`, and `private`.
32
33These address spaces represent the types of buffer commonly seen in GPU compilation.
34`global` memory is memory that resides in the GPU's global memory. `workgroup`
35memory is a limited, per-workgroup resource: all threads in a workgroup/thread
36block access the same values in `workgroup` memory. Finally, `private` memory is
37used to represent `alloca`-like buffers that are private to a single thread/workitem.
38
39These address spaces may be used as the `memorySpace` attribute on `memref` values.
40The `gpu.module`/`gpu.func` compilation pipeline will lower such memory space
41usages to the correct address spaces on target platforms. Memory attributions should be
42created with the correct memory space on the memref.
43
44## Memory attribution
45
46Memory buffers are defined at the function level, either in "gpu.launch" or in
47"gpu.func" ops. This encoding makes it clear where the memory belongs and makes
48the lifetime of the memory visible. The memory is only accessible while the
49kernel is launched/the function is currently invoked. The latter is more strict
50than actual GPU implementations but using static memory at the function level is
51just for convenience. It is also always possible to pass pointers to the
52workgroup memory into other functions, provided they expect the correct memory
53space.
54
55The buffers are considered live throughout the execution of the GPU function
56body. The absence of memory attribution syntax means that the function does not
57require special buffers. Rationale: although the underlying models declare
58memory buffers at the module level, we chose to do it at the function level to
59provide some structuring for the lifetime of those buffers; this avoids the
60incentive to use the buffers for communicating between different kernels or
61launches of the same kernel, which should be done through function arguments
62instead; we chose not to use `alloca`-style approach that would require more
63complex lifetime analysis following the principles of MLIR that promote
64structure and representing analysis results in the IR.
65
66## GPU Compilation
67### Compilation overview
68The compilation process in the GPU dialect has two main stages: GPU module
69serialization and offloading operations translation. Together these stages can
70produce GPU binaries and the necessary code to execute them.
71
72An example of how the compilation workflow look is:
73
74```
75mlir-opt example.mlir                   \
76  --pass-pipeline="builtin.module(      \
77    gpu-kernel-outlining,               \ # Outline gpu.launch body to a kernel.
78    nvvm-attach-target{chip=sm_90 O=3}, \ # Attach an NVVM target to a gpu.module op.
79    gpu.module(convert-gpu-to-nvvm),    \ # Convert GPU to NVVM.
80    gpu-to-llvm,                        \ # Convert GPU to LLVM.
81    gpu-module-to-binary                \ # Serialize GPU modules to binaries.
82  )" -o example-nvvm.mlir
83mlir-translate example-nvvm.mlir        \
84  --mlir-to-llvmir                      \ # Obtain the translated LLVM IR.
85  -o example.ll
86```
87
88This compilation process expects all GPU code to live in a `gpu.module` and
89expects all kernels to be `gpu.func` operations. Non-kernel functions, like
90device library calls, may be defined using `func.func` or other non-GPU dialect
91operations. This permits downstream systems to use these wrappers without
92requiring them to use the GPU dialect's function operations, which might not include
93information those systems want to have as intrinsic values on their functions.
94Additionally, this allows for using `func.func` for device-side library functions
95in `gpu.module`s.
96
97### Default NVVM Compilation Pipeline: gpu-lower-to-nvvm-pipeline
98
99The `gpu-lower-to-nvvm-pipeline` compilation pipeline serves as the default way
100for NVVM target compilation within MLIR. This pipeline operates by lowering
101primary dialects (arith, memref, scf, vector, gpu, and nvgpu) to NVVM target. It
102begins by lowering GPU code region(s) to the specified NVVM compilation target
103and subsequently handles the host code.
104
105This pipeline specifically requires explicitly parallel IR and doesn't do GPU
106parallelization. To enable parallelism, necessary transformations must be
107applied before utilizing this pipeline.
108
109It's designed to provide a generic solution for NVVM targets, generating NVVM
110and LLVM dialect code compatible with `mlir-runner` or execution engine.
111
112#### Example:
113
114Here's a snippet illustrating the use of primary dialects, including arith,
115within GPU code execution:
116
117```
118func.func @main() {
119    %c2 = arith.constant 2 : index
120    %c1 = arith.constant 1 : index
121    gpu.launch
122        blocks(%0, %1, %2) in (%3 = %c1, %4 = %c1, %5 = %c1)
123        threads(%6, %7, %8) in (%9 = %c2, %10 = %c1, %11 = %c1) {
124        gpu.printf "Hello from %d\n" %6 : index
125        gpu.terminator
126    }
127    return
128}
129```
130
131The `gpu-lower-to-nvvm` pipeline compiles this input code to NVVM format as
132below. It provides customization options like specifying SM capability, PTX
133version, and optimization level. Once compiled, the resulting IR is ready for
134execution using `mlir-runner`. Alternatively, it can be translated into
135LLVM, expanding its utility within the system.
136
137```
138mlir-opt example.mlir -gpu-lower-to-nvvm-pipeline = "cubin-chip=sm_90a cubin-features=+ptx80 opt-level=3"
139```
140
141### Module serialization
142Attributes implementing the GPU Target Attribute Interface handle the
143serialization process and are called Target attributes. These attributes can be
144attached to GPU Modules indicating the serialization scheme to compile the
145module into a binary string.
146
147The `gpu-module-to-binary` pass searches for all nested GPU modules and
148serializes the module using the target attributes attached to the module,
149producing a binary with an object for every target.
150
151Example:
152```
153// Input:
154gpu.module @kernels [#nvvm.target<chip = "sm_90">, #nvvm.target<chip = "sm_60">] {
155  ...
156}
157// mlir-opt --gpu-module-to-binary:
158gpu.binary @kernels [
159  #gpu.object<#nvvm.target<chip = "sm_90">, "sm_90 cubin">,
160  #gpu.object<#nvvm.target<chip = "sm_60">, "sm_60 cubin">
161]
162```
163
164### Offloading LLVM translation
165Attributes implementing the GPU Offloading LLVM Translation Attribute Interface
166handle the translation of GPU binaries and kernel launches into LLVM
167instructions and are called Offloading attributes. These attributes are
168attached to GPU binary operations.
169
170During the LLVM translation process, GPU binaries get translated using the
171scheme provided by the Offloading attribute, translating the GPU binary into
172LLVM instructions. Meanwhile, Kernel launches are translated by searching the
173appropriate binary and invoking the procedure provided by the Offloading
174attribute in the binary for translating kernel launches into LLVM instructions.
175
176Example:
177```
178// Input:
179// Binary with multiple objects but selecting the second one for embedding.
180gpu.binary @binary <#gpu.select_object<#rocdl.target<chip = "gfx90a">>> [
181    #gpu.object<#nvvm.target, "NVPTX">,
182    #gpu.object<#rocdl.target<chip = "gfx90a">, "AMDGPU">
183  ]
184llvm.func @foo() {
185  ...
186  // Launching a kernel inside the binary.
187  gpu.launch_func @binary::@func blocks in (%0, %0, %0)
188                                 threads in (%0, %0, %0) : i64
189                                 dynamic_shared_memory_size %2
190                                 args(%1 : i32, %1 : i32)
191  ...
192}
193// mlir-translate --mlir-to-llvmir:
194@binary_bin_cst = internal constant [6 x i8] c"AMDGPU", align 8
195@binary_func_kernel_name = private unnamed_addr constant [7 x i8] c"func\00", align 1
196...
197define void @foo() {
198  ...
199  %module = call ptr @mgpuModuleLoad(ptr @binary_bin_cst)
200  %kernel = call ptr @mgpuModuleGetFunction(ptr %module, ptr @binary_func_kernel_name)
201  call void @mgpuLaunchKernel(ptr %kernel, ...) ; Launch the kernel
202  ...
203  call void @mgpuModuleUnload(ptr %module)
204  ...
205}
206...
207```
208
209### The binary operation
210From a semantic point of view, GPU binaries allow the implementation of many
211concepts, from simple object files to fat binaries. By default, the binary
212operation uses the `#gpu.select_object` offloading attribute; this attribute
213embeds a single object in the binary as a global string, see the attribute docs
214for more information.
215
216## Operations
217
218[include "Dialects/GPUOps.md"]
219