1807f0584SJoseph Huber.. _libc_gpu_usage: 2807f0584SJoseph Huber 3807f0584SJoseph Huber=================== 4807f0584SJoseph HuberUsing libc for GPUs 5807f0584SJoseph Huber=================== 6807f0584SJoseph Huber 7807f0584SJoseph Huber.. contents:: Table of Contents 8807f0584SJoseph Huber :depth: 4 9807f0584SJoseph Huber :local: 10807f0584SJoseph Huber 110cbbcf1eSJoseph HuberUsing the GPU C library 120cbbcf1eSJoseph Huber======================= 13807f0584SJoseph Huber 140cbbcf1eSJoseph HuberOnce you have finished :ref:`building<libc_gpu_building>` the GPU C library it 150cbbcf1eSJoseph Hubercan be used to run libc or libm functions directly on the GPU. Currently, not 160cbbcf1eSJoseph Huberall C standard functions are supported on the GPU. Consult the :ref:`list of 170cbbcf1eSJoseph Hubersupported functions<libc_gpu_support>` for a comprehensive list. 180cbbcf1eSJoseph Huber 190cbbcf1eSJoseph HuberThe GPU C library supports two main usage modes. The first is as a supplementary 200cbbcf1eSJoseph Huberlibrary for offloading languages such as OpenMP, CUDA, or HIP. These aim to 210cbbcf1eSJoseph Huberprovide standard system utilities similarly to existing vendor libraries. The 220cbbcf1eSJoseph Hubersecond method treats the GPU as a hosted target by compiling C or C++ for it 230cbbcf1eSJoseph Huberdirectly. This is more similar to targeting OpenCL and is primarily used for 240cbbcf1eSJoseph Huberexported functions on the GPU and testing. 250cbbcf1eSJoseph Huber 260cbbcf1eSJoseph HuberOffloading usage 270cbbcf1eSJoseph Huber---------------- 280cbbcf1eSJoseph Huber 290cbbcf1eSJoseph HuberOffloading languages like CUDA, HIP, or OpenMP work by compiling a single source 300cbbcf1eSJoseph Huberfile for both the host target and a list of offloading devices. In order to 310cbbcf1eSJoseph Hubersupport standard compilation flows, the ``clang`` driver uses fat binaries, 320cbbcf1eSJoseph Huberdescribed in the `clang documentation 330cbbcf1eSJoseph Huber<https://clang.llvm.org/docs/OffloadingDesign.html>`_. This linking mode is used 340cbbcf1eSJoseph Huberby the OpenMP toolchain, but is currently opt-in for the CUDA and HIP toolchains 350cbbcf1eSJoseph Huberthrough the ``--offload-new-driver``` and ``-fgpu-rdc`` flags. 360cbbcf1eSJoseph Huber 378d8fa01aSJoseph HuberIn order or link the GPU runtime, we simply pass this library to the embedded 388d8fa01aSJoseph Huberdevice linker job. This can be done using the ``-Xoffload-linker`` option, which 398d8fa01aSJoseph Huberforwards an argument to a ``clang`` job used to create the final GPU executable. 408d8fa01aSJoseph HuberThe toolchain should pick up the C libraries automatically in most cases, so 418d8fa01aSJoseph Huberthis shouldn't be necessary. 42807f0584SJoseph Huber 43807f0584SJoseph Huber.. code-block:: sh 44807f0584SJoseph Huber 458d8fa01aSJoseph Huber $> clang openmp.c -fopenmp --offload-arch=gfx90a -Xoffload-linker -lc 468d8fa01aSJoseph Huber $> clang cuda.cu --offload-arch=sm_80 --offload-new-driver -fgpu-rdc -Xoffload-linker -lc 478d8fa01aSJoseph Huber $> clang hip.hip --offload-arch=gfx940 --offload-new-driver -fgpu-rdc -Xoffload-linker -lc 48807f0584SJoseph Huber 490cbbcf1eSJoseph HuberThis will automatically link in the needed function definitions if they were 500cbbcf1eSJoseph Huberrequired by the user's application. Normally using the ``-fgpu-rdc`` option 510cbbcf1eSJoseph Huberresults in sub-par performance due to ABA linking. However, the offloading 520cbbcf1eSJoseph Hubertoolchain supports the ``--foffload-lto`` option to support LTO on the target 530cbbcf1eSJoseph Huberdevice. 54807f0584SJoseph Huber 550cbbcf1eSJoseph HuberOffloading languages require that functions present on the device be declared as 560cbbcf1eSJoseph Hubersuch. This is done with the ``__device__`` keyword in CUDA and HIP or the 570cbbcf1eSJoseph Huber``declare target`` pragma in OpenMP. This requires that the LLVM C library 580cbbcf1eSJoseph Huberexposes its implemented functions to the compiler when it is used to build. We 590cbbcf1eSJoseph Hubersupport this by providing wrapper headers in the compiler's resource directory. 600cbbcf1eSJoseph HuberThese are located in ``<clang-resource-dir>/include/llvm-libc-wrappers`` in your 610cbbcf1eSJoseph Huberinstallation. 62807f0584SJoseph Huber 630cbbcf1eSJoseph HuberThe support for HIP and CUDA is more experimental, requiring manual intervention 640cbbcf1eSJoseph Huberto link and use the facilities. An example of this is shown in the :ref:`CUDA 650cbbcf1eSJoseph Huberserver example<libc_gpu_cuda_server>`. The OpenMP Offloading toolchain is 660cbbcf1eSJoseph Hubercompletely integrated with the LLVM C library however. It will automatically 670cbbcf1eSJoseph Huberhandle including the necessary libraries, define device-side interfaces, and run 680cbbcf1eSJoseph Huberthe RPC server. 690cbbcf1eSJoseph Huber 700cbbcf1eSJoseph HuberOpenMP Offloading example 710cbbcf1eSJoseph Huber^^^^^^^^^^^^^^^^^^^^^^^^^ 720cbbcf1eSJoseph Huber 730cbbcf1eSJoseph HuberThis section provides a simple example of compiling an OpenMP program with the 740cbbcf1eSJoseph HuberGPU C library. 750cbbcf1eSJoseph Huber 760cbbcf1eSJoseph Huber.. code-block:: c++ 770cbbcf1eSJoseph Huber 780cbbcf1eSJoseph Huber #include <stdio.h> 790cbbcf1eSJoseph Huber 800cbbcf1eSJoseph Huber int main() { 810cbbcf1eSJoseph Huber FILE *file = stderr; 820cbbcf1eSJoseph Huber #pragma omp target teams num_teams(2) thread_limit(2) 830cbbcf1eSJoseph Huber #pragma omp parallel num_threads(2) 840cbbcf1eSJoseph Huber { fputs("Hello from OpenMP!\n", file); } 850cbbcf1eSJoseph Huber } 860cbbcf1eSJoseph Huber 870cbbcf1eSJoseph HuberThis can simply be compiled like any other OpenMP application to print from two 880cbbcf1eSJoseph Huberthreads and two blocks. 89807f0584SJoseph Huber 90807f0584SJoseph Huber.. code-block:: sh 91807f0584SJoseph Huber 920cbbcf1eSJoseph Huber $> clang openmp.c -fopenmp --offload-arch=gfx90a 930cbbcf1eSJoseph Huber $> ./a.out 940cbbcf1eSJoseph Huber Hello from OpenMP! 950cbbcf1eSJoseph Huber Hello from OpenMP! 960cbbcf1eSJoseph Huber Hello from OpenMP! 970cbbcf1eSJoseph Huber Hello from OpenMP! 980cbbcf1eSJoseph Huber 990cbbcf1eSJoseph HuberIncluding the wrapper headers, linking the C library, and running the :ref:`RPC 1000cbbcf1eSJoseph Huberserver<libc_gpu_rpc>` are all handled automatically by the compiler and runtime. 1010cbbcf1eSJoseph Huber 1020cbbcf1eSJoseph HuberDirect compilation 1030cbbcf1eSJoseph Huber------------------ 1040cbbcf1eSJoseph Huber 1050cbbcf1eSJoseph HuberInstead of using standard offloading languages, we can also target the CPU 1060cbbcf1eSJoseph Huberdirectly using C and C++ to create a GPU executable similarly to OpenCL. This is 1070cbbcf1eSJoseph Huberdone by targeting the GPU architecture using `clang's cross compilation 1080cbbcf1eSJoseph Hubersupport <https://clang.llvm.org/docs/CrossCompilation.html>`_. This is the 1090cbbcf1eSJoseph Hubermethod that the GPU C library uses both to build the library and to run tests. 1100cbbcf1eSJoseph Huber 1110cbbcf1eSJoseph HuberThis allows us to easily define GPU specific libraries and programs that fit 1120cbbcf1eSJoseph Huberwell into existing tools. In order to target the GPU effectively we rely heavily 1130cbbcf1eSJoseph Huberon the compiler's intrinsic and built-in functions. For example, the following 1140cbbcf1eSJoseph Huberfunction gets the thread identifier in the 'x' dimension on both GPUs supported 1150cbbcf1eSJoseph HuberGPUs. 1160cbbcf1eSJoseph Huber 1170cbbcf1eSJoseph Huber.. code-block:: c++ 1180cbbcf1eSJoseph Huber 1190cbbcf1eSJoseph Huber uint32_t get_thread_id_x() { 1200cbbcf1eSJoseph Huber #if defined(__AMDGPU__) 1210cbbcf1eSJoseph Huber return __builtin_amdgcn_workitem_id_x(); 1220cbbcf1eSJoseph Huber #elif defined(__NVPTX__) 1230cbbcf1eSJoseph Huber return __nvvm_read_ptx_sreg_tid_x(); 1240cbbcf1eSJoseph Huber #else 1250cbbcf1eSJoseph Huber #error "Unsupported platform" 1260cbbcf1eSJoseph Huber #endif 1270cbbcf1eSJoseph Huber } 1280cbbcf1eSJoseph Huber 1290cbbcf1eSJoseph HuberWe can then compile this for both NVPTX and AMDGPU into LLVM-IR using the 1306818c7b8SJoseph Huberfollowing commands. This will yield valid LLVM-IR for the given target just like 1316818c7b8SJoseph Huberif we were using CUDA, OpenCL, or OpenMP. 1320cbbcf1eSJoseph Huber 1330cbbcf1eSJoseph Huber.. code-block:: sh 1340cbbcf1eSJoseph Huber 1350cbbcf1eSJoseph Huber $> clang id.c --target=amdgcn-amd-amdhsa -mcpu=native -nogpulib -flto -c 1360cbbcf1eSJoseph Huber $> clang id.c --target=nvptx64-nvidia-cuda -march=native -nogpulib -flto -c 1370cbbcf1eSJoseph Huber 1386818c7b8SJoseph HuberWe can also use this support to treat the GPU as a hosted environment by 1396818c7b8SJoseph Huberproviding a C library and startup object just like a standard C library running 1406818c7b8SJoseph Huberon the host machine. Then, in order to execute these programs, we provide a 1416818c7b8SJoseph Huberloader utility to launch the executable on the GPU similar to a cross-compiling 1426818c7b8SJoseph Huberemulator. This is how we run :ref:`unit tests <libc_gpu_testing>` targeting the 1436818c7b8SJoseph HuberGPU. This is clearly not the most efficient way to use a GPU, but it provides a 1446818c7b8SJoseph Hubersimple method to test execution on a GPU for debugging or development. 1450cbbcf1eSJoseph Huber 1460cbbcf1eSJoseph HuberBuilding for AMDGPU targets 1470cbbcf1eSJoseph Huber^^^^^^^^^^^^^^^^^^^^^^^^^^^ 1480cbbcf1eSJoseph Huber 1490cbbcf1eSJoseph HuberThe AMDGPU target supports several features natively by virtue of using ``lld`` 1500cbbcf1eSJoseph Huberas its linker. The installation will include the ``include/amdgcn-amd-amdhsa`` 1510cbbcf1eSJoseph Huberand ``lib/amdgcn-amd-amdha`` directories that contain the necessary code to use 1520cbbcf1eSJoseph Huberthe library. We can directly link against ``libc.a`` and use LTO to generate the 1530cbbcf1eSJoseph Huberfinal executable. 1540cbbcf1eSJoseph Huber 1550cbbcf1eSJoseph Huber.. code-block:: c++ 1560cbbcf1eSJoseph Huber 1570cbbcf1eSJoseph Huber #include <stdio.h> 1580cbbcf1eSJoseph Huber 15950838851SJoseph Huber int main() { printf("Hello from AMDGPU!\n"); } 1600cbbcf1eSJoseph Huber 1610cbbcf1eSJoseph HuberThis program can then be compiled using the ``clang`` compiler. Note that 1620cbbcf1eSJoseph Huber``-flto`` and ``-mcpu=`` should be defined. This is because the GPU 1630cbbcf1eSJoseph Hubersub-architectures do not have strict backwards compatibility. Use ``-mcpu=help`` 1640cbbcf1eSJoseph Huberfor accepted arguments or ``-mcpu=native`` to target the system's installed GPUs 1650cbbcf1eSJoseph Huberif present. Additionally, the AMDGPU target always uses ``-flto`` because we 1660cbbcf1eSJoseph Hubercurrently do not fully support ELF linking in ``lld``. Once built, we use the 1670cbbcf1eSJoseph Huber``amdhsa-loader`` utility to launch execution on the GPU. This will be built if 1680cbbcf1eSJoseph Huberthe ``hsa_runtime64`` library was found during build time. 1690cbbcf1eSJoseph Huber 1700cbbcf1eSJoseph Huber.. code-block:: sh 1710cbbcf1eSJoseph Huber 1720cbbcf1eSJoseph Huber $> clang hello.c --target=amdgcn-amd-amdhsa -mcpu=native -flto -lc <install>/lib/amdgcn-amd-amdhsa/crt1.o 1730cbbcf1eSJoseph Huber $> amdhsa-loader --threads 2 --blocks 2 a.out 1740cbbcf1eSJoseph Huber Hello from AMDGPU! 1750cbbcf1eSJoseph Huber Hello from AMDGPU! 1760cbbcf1eSJoseph Huber Hello from AMDGPU! 1770cbbcf1eSJoseph Huber Hello from AMDGPU! 1780cbbcf1eSJoseph Huber 1790cbbcf1eSJoseph HuberThis will include the ``stdio.h`` header, which is found in the 1800cbbcf1eSJoseph Huber``include/amdgcn-amd-amdhsa`` directory. We define out ``main`` function like a 1810cbbcf1eSJoseph Huberstandard application. The startup utility in ``lib/amdgcn-amd-amdhsa/crt1.o`` 1820cbbcf1eSJoseph Huberwill handle the necessary steps to execute the ``main`` function along with 1830cbbcf1eSJoseph Huberglobal initializers and command line arguments. Finally, we link in the 1840cbbcf1eSJoseph Huber``libc.a`` library stored in ``lib/amdgcn-amd-amdhsa`` to define the standard C 1850cbbcf1eSJoseph Huberfunctions. 1860cbbcf1eSJoseph Huber 1870cbbcf1eSJoseph HuberThe search paths for the include directories and libraries are automatically 1880cbbcf1eSJoseph Huberhandled by the compiler. We use this support internally to run unit tests on the 1890cbbcf1eSJoseph HuberGPU directly. See :ref:`libc_gpu_testing` for more information. The installation 1900cbbcf1eSJoseph Huberalso provides ``libc.bc`` which is a single LLVM-IR bitcode blob that can be 1910cbbcf1eSJoseph Huberused instead of the static library. 1920cbbcf1eSJoseph Huber 1930cbbcf1eSJoseph HuberBuilding for NVPTX targets 1940cbbcf1eSJoseph Huber^^^^^^^^^^^^^^^^^^^^^^^^^^ 1950cbbcf1eSJoseph Huber 1960cbbcf1eSJoseph HuberThe infrastructure is the same as the AMDGPU example. However, the NVPTX binary 19750838851SJoseph Huberutilities are very limited and must be targeted directly. A utility called 19850838851SJoseph Huber``clang-nvlink-wrapper`` instead wraps around the standard link job to give the 19950838851SJoseph Huberillusion that ``nvlink`` is a functional linker. 2000cbbcf1eSJoseph Huber 2010cbbcf1eSJoseph Huber.. code-block:: c++ 2020cbbcf1eSJoseph Huber 2030cbbcf1eSJoseph Huber #include <stdio.h> 2040cbbcf1eSJoseph Huber 2050cbbcf1eSJoseph Huber int main(int argc, char **argv, char **envp) { 20650838851SJoseph Huber printf("Hello from NVPTX!\n"); 2070cbbcf1eSJoseph Huber } 2080cbbcf1eSJoseph Huber 2090cbbcf1eSJoseph HuberAdditionally, the NVPTX ABI requires that every function signature matches. This 2100cbbcf1eSJoseph Huberrequires us to pass the full prototype from ``main``. The installation will 2110cbbcf1eSJoseph Hubercontain the ``nvptx-loader`` utility if the CUDA driver was found during 21250838851SJoseph Hubercompilation. Using link time optimization will help hide this. 2130cbbcf1eSJoseph Huber 2140cbbcf1eSJoseph Huber.. code-block:: sh 2150cbbcf1eSJoseph Huber 216*6d1a5130SJoseph Huber $> clang hello.c --target=nvptx64-nvidia-cuda -march=native -flto -lc <install>/lib/nvptx64-nvidia-cuda/crt1.o 2170cbbcf1eSJoseph Huber $> nvptx-loader --threads 2 --blocks 2 a.out 2180cbbcf1eSJoseph Huber Hello from NVPTX! 2190cbbcf1eSJoseph Huber Hello from NVPTX! 2200cbbcf1eSJoseph Huber Hello from NVPTX! 2210cbbcf1eSJoseph Huber Hello from NVPTX! 222