xref: /llvm-project/libc/docs/gpu/using.rst (revision 6d1a51303edd33faab34732a77a874f3eb74dbfd)
1.. _libc_gpu_usage:
2
3===================
4Using libc for GPUs
5===================
6
7.. contents:: Table of Contents
8  :depth: 4
9  :local:
10
11Using the GPU C library
12=======================
13
14Once you have finished :ref:`building<libc_gpu_building>` the GPU C library it
15can be used to run libc or libm functions directly on the GPU. Currently, not
16all C standard functions are supported on the GPU. Consult the :ref:`list of
17supported functions<libc_gpu_support>` for a comprehensive list.
18
19The GPU C library supports two main usage modes. The first is as a supplementary
20library for offloading languages such as OpenMP, CUDA, or HIP. These aim to
21provide standard system utilities similarly to existing vendor libraries. The
22second method treats the GPU as a hosted target by compiling C or C++ for it
23directly. This is more similar to targeting OpenCL and is primarily used for
24exported functions on the GPU and testing.
25
26Offloading usage
27----------------
28
29Offloading languages like CUDA, HIP, or OpenMP work by compiling a single source
30file for both the host target and a list of offloading devices. In order to
31support standard compilation flows, the ``clang`` driver uses fat binaries,
32described in the `clang documentation
33<https://clang.llvm.org/docs/OffloadingDesign.html>`_. This linking mode is used
34by the OpenMP toolchain, but is currently opt-in for the CUDA and HIP toolchains
35through the ``--offload-new-driver``` and ``-fgpu-rdc`` flags.
36
37In order or link the GPU runtime, we simply pass this library to the embedded
38device linker job. This can be done using the ``-Xoffload-linker`` option, which
39forwards an argument to a ``clang`` job used to create the final GPU executable.
40The toolchain should pick up the C libraries automatically in most cases, so
41this shouldn't be necessary.
42
43.. code-block:: sh
44
45  $> clang openmp.c -fopenmp --offload-arch=gfx90a -Xoffload-linker -lc
46  $> clang cuda.cu --offload-arch=sm_80 --offload-new-driver -fgpu-rdc -Xoffload-linker -lc
47  $> clang hip.hip --offload-arch=gfx940 --offload-new-driver -fgpu-rdc -Xoffload-linker -lc
48
49This will automatically link in the needed function definitions if they were
50required by the user's application. Normally using the ``-fgpu-rdc`` option
51results in sub-par performance due to ABA linking. However, the offloading
52toolchain supports the ``--foffload-lto`` option to support LTO on the target
53device.
54
55Offloading languages require that functions present on the device be declared as
56such. This is done with the ``__device__`` keyword in CUDA and HIP or the
57``declare target`` pragma in OpenMP. This requires that the LLVM C library
58exposes its implemented functions to the compiler when it is used to build. We
59support this by providing wrapper headers in the compiler's resource directory.
60These are located in ``<clang-resource-dir>/include/llvm-libc-wrappers`` in your
61installation.
62
63The support for HIP and CUDA is more experimental, requiring manual intervention
64to link and use the facilities. An example of this is shown in the :ref:`CUDA
65server example<libc_gpu_cuda_server>`. The OpenMP Offloading toolchain is
66completely integrated with the LLVM C library however. It will automatically
67handle including the necessary libraries, define device-side interfaces, and run
68the RPC server.
69
70OpenMP Offloading example
71^^^^^^^^^^^^^^^^^^^^^^^^^
72
73This section provides a simple example of compiling an OpenMP program with the
74GPU C library.
75
76.. code-block:: c++
77
78  #include <stdio.h>
79
80  int main() {
81    FILE *file = stderr;
82  #pragma omp target teams num_teams(2) thread_limit(2)
83  #pragma omp parallel num_threads(2)
84    { fputs("Hello from OpenMP!\n", file); }
85  }
86
87This can simply be compiled like any other OpenMP application to print from two
88threads and two blocks.
89
90.. code-block:: sh
91
92  $> clang openmp.c -fopenmp --offload-arch=gfx90a
93  $> ./a.out
94  Hello from OpenMP!
95  Hello from OpenMP!
96  Hello from OpenMP!
97  Hello from OpenMP!
98
99Including the wrapper headers, linking the C library, and running the :ref:`RPC
100server<libc_gpu_rpc>` are all handled automatically by the compiler and runtime.
101
102Direct compilation
103------------------
104
105Instead of using standard offloading languages, we can also target the CPU
106directly using C and C++ to create a GPU executable similarly to OpenCL. This is
107done by targeting the GPU architecture using `clang's cross compilation
108support <https://clang.llvm.org/docs/CrossCompilation.html>`_. This is the
109method that the GPU C library uses both to build the library and to run tests.
110
111This allows us to easily define GPU specific libraries and programs that fit
112well into existing tools. In order to target the GPU effectively we rely heavily
113on the compiler's intrinsic and built-in functions. For example, the following
114function gets the thread identifier in the 'x' dimension on both GPUs supported
115GPUs.
116
117.. code-block:: c++
118
119  uint32_t get_thread_id_x() {
120  #if defined(__AMDGPU__)
121    return __builtin_amdgcn_workitem_id_x();
122  #elif defined(__NVPTX__)
123    return __nvvm_read_ptx_sreg_tid_x();
124  #else
125  #error "Unsupported platform"
126  #endif
127  }
128
129We can then compile this for both NVPTX and AMDGPU into LLVM-IR using the
130following commands. This will yield valid LLVM-IR for the given target just like
131if we were using CUDA, OpenCL, or OpenMP.
132
133.. code-block:: sh
134
135  $> clang id.c --target=amdgcn-amd-amdhsa -mcpu=native -nogpulib -flto -c
136  $> clang id.c --target=nvptx64-nvidia-cuda -march=native -nogpulib -flto -c
137
138We can also use this support to treat the GPU as a hosted environment by
139providing a C library and startup object just like a standard C library running
140on the host machine. Then, in order to execute these programs, we provide a
141loader utility to launch the executable on the GPU similar to a cross-compiling
142emulator. This is how we run :ref:`unit tests <libc_gpu_testing>` targeting the
143GPU. This is clearly not the most efficient way to use a GPU, but it provides a
144simple method to test execution on a GPU for debugging or development.
145
146Building for AMDGPU targets
147^^^^^^^^^^^^^^^^^^^^^^^^^^^
148
149The AMDGPU target supports several features natively by virtue of using ``lld``
150as its linker. The installation will include the ``include/amdgcn-amd-amdhsa``
151and ``lib/amdgcn-amd-amdha`` directories that contain the necessary code to use
152the library. We can directly link against ``libc.a`` and use LTO to generate the
153final executable.
154
155.. code-block:: c++
156
157  #include <stdio.h>
158
159  int main() { printf("Hello from AMDGPU!\n"); }
160
161This program can then be compiled using the ``clang`` compiler. Note that
162``-flto`` and ``-mcpu=`` should be defined. This is because the GPU
163sub-architectures do not have strict backwards compatibility. Use ``-mcpu=help``
164for accepted arguments or ``-mcpu=native`` to target the system's installed GPUs
165if present. Additionally, the AMDGPU target always uses ``-flto`` because we
166currently do not fully support ELF linking in ``lld``. Once built, we use the
167``amdhsa-loader`` utility to launch execution on the GPU. This will be built if
168the ``hsa_runtime64`` library was found during build time.
169
170.. code-block:: sh
171
172  $> clang hello.c --target=amdgcn-amd-amdhsa -mcpu=native -flto -lc <install>/lib/amdgcn-amd-amdhsa/crt1.o
173  $> amdhsa-loader --threads 2 --blocks 2 a.out
174  Hello from AMDGPU!
175  Hello from AMDGPU!
176  Hello from AMDGPU!
177  Hello from AMDGPU!
178
179This will include the ``stdio.h`` header, which is found in the
180``include/amdgcn-amd-amdhsa`` directory. We define out ``main`` function like a
181standard application. The startup utility in ``lib/amdgcn-amd-amdhsa/crt1.o``
182will handle the necessary steps to execute the ``main`` function along with
183global initializers and command line arguments. Finally, we link in the
184``libc.a`` library stored in ``lib/amdgcn-amd-amdhsa`` to define the standard C
185functions.
186
187The search paths for the include directories and libraries are automatically
188handled by the compiler. We use this support internally to run unit tests on the
189GPU directly. See :ref:`libc_gpu_testing` for more information. The installation
190also provides ``libc.bc`` which is a single LLVM-IR bitcode blob that can be
191used instead of the static library.
192
193Building for NVPTX targets
194^^^^^^^^^^^^^^^^^^^^^^^^^^
195
196The infrastructure is the same as the AMDGPU example. However, the NVPTX binary
197utilities are very limited and must be targeted directly. A utility called
198``clang-nvlink-wrapper`` instead wraps around the standard link job to give the
199illusion that ``nvlink`` is a functional linker.
200
201.. code-block:: c++
202
203  #include <stdio.h>
204
205  int main(int argc, char **argv, char **envp) {
206    printf("Hello from NVPTX!\n");
207  }
208
209Additionally, the NVPTX ABI requires that every function signature matches. This
210requires us to pass the full prototype from ``main``. The installation will
211contain the ``nvptx-loader`` utility if the CUDA driver was found during
212compilation. Using link time optimization will help hide this.
213
214.. code-block:: sh
215
216  $> clang hello.c --target=nvptx64-nvidia-cuda -march=native -flto -lc <install>/lib/nvptx64-nvidia-cuda/crt1.o
217  $> nvptx-loader --threads 2 --blocks 2 a.out
218  Hello from NVPTX!
219  Hello from NVPTX!
220  Hello from NVPTX!
221  Hello from NVPTX!
222