1.. _libc_gpu_usage: 2 3=================== 4Using libc for GPUs 5=================== 6 7.. contents:: Table of Contents 8 :depth: 4 9 :local: 10 11Using the GPU C library 12======================= 13 14Once you have finished :ref:`building<libc_gpu_building>` the GPU C library it 15can be used to run libc or libm functions directly on the GPU. Currently, not 16all C standard functions are supported on the GPU. Consult the :ref:`list of 17supported functions<libc_gpu_support>` for a comprehensive list. 18 19The GPU C library supports two main usage modes. The first is as a supplementary 20library for offloading languages such as OpenMP, CUDA, or HIP. These aim to 21provide standard system utilities similarly to existing vendor libraries. The 22second method treats the GPU as a hosted target by compiling C or C++ for it 23directly. This is more similar to targeting OpenCL and is primarily used for 24exported functions on the GPU and testing. 25 26Offloading usage 27---------------- 28 29Offloading languages like CUDA, HIP, or OpenMP work by compiling a single source 30file for both the host target and a list of offloading devices. In order to 31support standard compilation flows, the ``clang`` driver uses fat binaries, 32described in the `clang documentation 33<https://clang.llvm.org/docs/OffloadingDesign.html>`_. This linking mode is used 34by the OpenMP toolchain, but is currently opt-in for the CUDA and HIP toolchains 35through the ``--offload-new-driver``` and ``-fgpu-rdc`` flags. 36 37In order or link the GPU runtime, we simply pass this library to the embedded 38device linker job. This can be done using the ``-Xoffload-linker`` option, which 39forwards an argument to a ``clang`` job used to create the final GPU executable. 40The toolchain should pick up the C libraries automatically in most cases, so 41this shouldn't be necessary. 42 43.. code-block:: sh 44 45 $> clang openmp.c -fopenmp --offload-arch=gfx90a -Xoffload-linker -lc 46 $> clang cuda.cu --offload-arch=sm_80 --offload-new-driver -fgpu-rdc -Xoffload-linker -lc 47 $> clang hip.hip --offload-arch=gfx940 --offload-new-driver -fgpu-rdc -Xoffload-linker -lc 48 49This will automatically link in the needed function definitions if they were 50required by the user's application. Normally using the ``-fgpu-rdc`` option 51results in sub-par performance due to ABA linking. However, the offloading 52toolchain supports the ``--foffload-lto`` option to support LTO on the target 53device. 54 55Offloading languages require that functions present on the device be declared as 56such. This is done with the ``__device__`` keyword in CUDA and HIP or the 57``declare target`` pragma in OpenMP. This requires that the LLVM C library 58exposes its implemented functions to the compiler when it is used to build. We 59support this by providing wrapper headers in the compiler's resource directory. 60These are located in ``<clang-resource-dir>/include/llvm-libc-wrappers`` in your 61installation. 62 63The support for HIP and CUDA is more experimental, requiring manual intervention 64to link and use the facilities. An example of this is shown in the :ref:`CUDA 65server example<libc_gpu_cuda_server>`. The OpenMP Offloading toolchain is 66completely integrated with the LLVM C library however. It will automatically 67handle including the necessary libraries, define device-side interfaces, and run 68the RPC server. 69 70OpenMP Offloading example 71^^^^^^^^^^^^^^^^^^^^^^^^^ 72 73This section provides a simple example of compiling an OpenMP program with the 74GPU C library. 75 76.. code-block:: c++ 77 78 #include <stdio.h> 79 80 int main() { 81 FILE *file = stderr; 82 #pragma omp target teams num_teams(2) thread_limit(2) 83 #pragma omp parallel num_threads(2) 84 { fputs("Hello from OpenMP!\n", file); } 85 } 86 87This can simply be compiled like any other OpenMP application to print from two 88threads and two blocks. 89 90.. code-block:: sh 91 92 $> clang openmp.c -fopenmp --offload-arch=gfx90a 93 $> ./a.out 94 Hello from OpenMP! 95 Hello from OpenMP! 96 Hello from OpenMP! 97 Hello from OpenMP! 98 99Including the wrapper headers, linking the C library, and running the :ref:`RPC 100server<libc_gpu_rpc>` are all handled automatically by the compiler and runtime. 101 102Direct compilation 103------------------ 104 105Instead of using standard offloading languages, we can also target the CPU 106directly using C and C++ to create a GPU executable similarly to OpenCL. This is 107done by targeting the GPU architecture using `clang's cross compilation 108support <https://clang.llvm.org/docs/CrossCompilation.html>`_. This is the 109method that the GPU C library uses both to build the library and to run tests. 110 111This allows us to easily define GPU specific libraries and programs that fit 112well into existing tools. In order to target the GPU effectively we rely heavily 113on the compiler's intrinsic and built-in functions. For example, the following 114function gets the thread identifier in the 'x' dimension on both GPUs supported 115GPUs. 116 117.. code-block:: c++ 118 119 uint32_t get_thread_id_x() { 120 #if defined(__AMDGPU__) 121 return __builtin_amdgcn_workitem_id_x(); 122 #elif defined(__NVPTX__) 123 return __nvvm_read_ptx_sreg_tid_x(); 124 #else 125 #error "Unsupported platform" 126 #endif 127 } 128 129We can then compile this for both NVPTX and AMDGPU into LLVM-IR using the 130following commands. This will yield valid LLVM-IR for the given target just like 131if we were using CUDA, OpenCL, or OpenMP. 132 133.. code-block:: sh 134 135 $> clang id.c --target=amdgcn-amd-amdhsa -mcpu=native -nogpulib -flto -c 136 $> clang id.c --target=nvptx64-nvidia-cuda -march=native -nogpulib -flto -c 137 138We can also use this support to treat the GPU as a hosted environment by 139providing a C library and startup object just like a standard C library running 140on the host machine. Then, in order to execute these programs, we provide a 141loader utility to launch the executable on the GPU similar to a cross-compiling 142emulator. This is how we run :ref:`unit tests <libc_gpu_testing>` targeting the 143GPU. This is clearly not the most efficient way to use a GPU, but it provides a 144simple method to test execution on a GPU for debugging or development. 145 146Building for AMDGPU targets 147^^^^^^^^^^^^^^^^^^^^^^^^^^^ 148 149The AMDGPU target supports several features natively by virtue of using ``lld`` 150as its linker. The installation will include the ``include/amdgcn-amd-amdhsa`` 151and ``lib/amdgcn-amd-amdha`` directories that contain the necessary code to use 152the library. We can directly link against ``libc.a`` and use LTO to generate the 153final executable. 154 155.. code-block:: c++ 156 157 #include <stdio.h> 158 159 int main() { printf("Hello from AMDGPU!\n"); } 160 161This program can then be compiled using the ``clang`` compiler. Note that 162``-flto`` and ``-mcpu=`` should be defined. This is because the GPU 163sub-architectures do not have strict backwards compatibility. Use ``-mcpu=help`` 164for accepted arguments or ``-mcpu=native`` to target the system's installed GPUs 165if present. Additionally, the AMDGPU target always uses ``-flto`` because we 166currently do not fully support ELF linking in ``lld``. Once built, we use the 167``amdhsa-loader`` utility to launch execution on the GPU. This will be built if 168the ``hsa_runtime64`` library was found during build time. 169 170.. code-block:: sh 171 172 $> clang hello.c --target=amdgcn-amd-amdhsa -mcpu=native -flto -lc <install>/lib/amdgcn-amd-amdhsa/crt1.o 173 $> amdhsa-loader --threads 2 --blocks 2 a.out 174 Hello from AMDGPU! 175 Hello from AMDGPU! 176 Hello from AMDGPU! 177 Hello from AMDGPU! 178 179This will include the ``stdio.h`` header, which is found in the 180``include/amdgcn-amd-amdhsa`` directory. We define out ``main`` function like a 181standard application. The startup utility in ``lib/amdgcn-amd-amdhsa/crt1.o`` 182will handle the necessary steps to execute the ``main`` function along with 183global initializers and command line arguments. Finally, we link in the 184``libc.a`` library stored in ``lib/amdgcn-amd-amdhsa`` to define the standard C 185functions. 186 187The search paths for the include directories and libraries are automatically 188handled by the compiler. We use this support internally to run unit tests on the 189GPU directly. See :ref:`libc_gpu_testing` for more information. The installation 190also provides ``libc.bc`` which is a single LLVM-IR bitcode blob that can be 191used instead of the static library. 192 193Building for NVPTX targets 194^^^^^^^^^^^^^^^^^^^^^^^^^^ 195 196The infrastructure is the same as the AMDGPU example. However, the NVPTX binary 197utilities are very limited and must be targeted directly. A utility called 198``clang-nvlink-wrapper`` instead wraps around the standard link job to give the 199illusion that ``nvlink`` is a functional linker. 200 201.. code-block:: c++ 202 203 #include <stdio.h> 204 205 int main(int argc, char **argv, char **envp) { 206 printf("Hello from NVPTX!\n"); 207 } 208 209Additionally, the NVPTX ABI requires that every function signature matches. This 210requires us to pass the full prototype from ``main``. The installation will 211contain the ``nvptx-loader`` utility if the CUDA driver was found during 212compilation. Using link time optimization will help hide this. 213 214.. code-block:: sh 215 216 $> clang hello.c --target=nvptx64-nvidia-cuda -march=native -flto -lc <install>/lib/nvptx64-nvidia-cuda/crt1.o 217 $> nvptx-loader --threads 2 --blocks 2 a.out 218 Hello from NVPTX! 219 Hello from NVPTX! 220 Hello from NVPTX! 221 Hello from NVPTX! 222