xref: /llvm-project/libc/docs/gpu/using.rst (revision 6d1a51303edd33faab34732a77a874f3eb74dbfd)
1807f0584SJoseph Huber.. _libc_gpu_usage:
2807f0584SJoseph Huber
3807f0584SJoseph Huber===================
4807f0584SJoseph HuberUsing libc for GPUs
5807f0584SJoseph Huber===================
6807f0584SJoseph Huber
7807f0584SJoseph Huber.. contents:: Table of Contents
8807f0584SJoseph Huber  :depth: 4
9807f0584SJoseph Huber  :local:
10807f0584SJoseph Huber
110cbbcf1eSJoseph HuberUsing the GPU C library
120cbbcf1eSJoseph Huber=======================
13807f0584SJoseph Huber
140cbbcf1eSJoseph HuberOnce you have finished :ref:`building<libc_gpu_building>` the GPU C library it
150cbbcf1eSJoseph Hubercan be used to run libc or libm functions directly on the GPU. Currently, not
160cbbcf1eSJoseph Huberall C standard functions are supported on the GPU. Consult the :ref:`list of
170cbbcf1eSJoseph Hubersupported functions<libc_gpu_support>` for a comprehensive list.
180cbbcf1eSJoseph Huber
190cbbcf1eSJoseph HuberThe GPU C library supports two main usage modes. The first is as a supplementary
200cbbcf1eSJoseph Huberlibrary for offloading languages such as OpenMP, CUDA, or HIP. These aim to
210cbbcf1eSJoseph Huberprovide standard system utilities similarly to existing vendor libraries. The
220cbbcf1eSJoseph Hubersecond method treats the GPU as a hosted target by compiling C or C++ for it
230cbbcf1eSJoseph Huberdirectly. This is more similar to targeting OpenCL and is primarily used for
240cbbcf1eSJoseph Huberexported functions on the GPU and testing.
250cbbcf1eSJoseph Huber
260cbbcf1eSJoseph HuberOffloading usage
270cbbcf1eSJoseph Huber----------------
280cbbcf1eSJoseph Huber
290cbbcf1eSJoseph HuberOffloading languages like CUDA, HIP, or OpenMP work by compiling a single source
300cbbcf1eSJoseph Huberfile for both the host target and a list of offloading devices. In order to
310cbbcf1eSJoseph Hubersupport standard compilation flows, the ``clang`` driver uses fat binaries,
320cbbcf1eSJoseph Huberdescribed in the `clang documentation
330cbbcf1eSJoseph Huber<https://clang.llvm.org/docs/OffloadingDesign.html>`_. This linking mode is used
340cbbcf1eSJoseph Huberby the OpenMP toolchain, but is currently opt-in for the CUDA and HIP toolchains
350cbbcf1eSJoseph Huberthrough the ``--offload-new-driver``` and ``-fgpu-rdc`` flags.
360cbbcf1eSJoseph Huber
378d8fa01aSJoseph HuberIn order or link the GPU runtime, we simply pass this library to the embedded
388d8fa01aSJoseph Huberdevice linker job. This can be done using the ``-Xoffload-linker`` option, which
398d8fa01aSJoseph Huberforwards an argument to a ``clang`` job used to create the final GPU executable.
408d8fa01aSJoseph HuberThe toolchain should pick up the C libraries automatically in most cases, so
418d8fa01aSJoseph Huberthis shouldn't be necessary.
42807f0584SJoseph Huber
43807f0584SJoseph Huber.. code-block:: sh
44807f0584SJoseph Huber
458d8fa01aSJoseph Huber  $> clang openmp.c -fopenmp --offload-arch=gfx90a -Xoffload-linker -lc
468d8fa01aSJoseph Huber  $> clang cuda.cu --offload-arch=sm_80 --offload-new-driver -fgpu-rdc -Xoffload-linker -lc
478d8fa01aSJoseph Huber  $> clang hip.hip --offload-arch=gfx940 --offload-new-driver -fgpu-rdc -Xoffload-linker -lc
48807f0584SJoseph Huber
490cbbcf1eSJoseph HuberThis will automatically link in the needed function definitions if they were
500cbbcf1eSJoseph Huberrequired by the user's application. Normally using the ``-fgpu-rdc`` option
510cbbcf1eSJoseph Huberresults in sub-par performance due to ABA linking. However, the offloading
520cbbcf1eSJoseph Hubertoolchain supports the ``--foffload-lto`` option to support LTO on the target
530cbbcf1eSJoseph Huberdevice.
54807f0584SJoseph Huber
550cbbcf1eSJoseph HuberOffloading languages require that functions present on the device be declared as
560cbbcf1eSJoseph Hubersuch. This is done with the ``__device__`` keyword in CUDA and HIP or the
570cbbcf1eSJoseph Huber``declare target`` pragma in OpenMP. This requires that the LLVM C library
580cbbcf1eSJoseph Huberexposes its implemented functions to the compiler when it is used to build. We
590cbbcf1eSJoseph Hubersupport this by providing wrapper headers in the compiler's resource directory.
600cbbcf1eSJoseph HuberThese are located in ``<clang-resource-dir>/include/llvm-libc-wrappers`` in your
610cbbcf1eSJoseph Huberinstallation.
62807f0584SJoseph Huber
630cbbcf1eSJoseph HuberThe support for HIP and CUDA is more experimental, requiring manual intervention
640cbbcf1eSJoseph Huberto link and use the facilities. An example of this is shown in the :ref:`CUDA
650cbbcf1eSJoseph Huberserver example<libc_gpu_cuda_server>`. The OpenMP Offloading toolchain is
660cbbcf1eSJoseph Hubercompletely integrated with the LLVM C library however. It will automatically
670cbbcf1eSJoseph Huberhandle including the necessary libraries, define device-side interfaces, and run
680cbbcf1eSJoseph Huberthe RPC server.
690cbbcf1eSJoseph Huber
700cbbcf1eSJoseph HuberOpenMP Offloading example
710cbbcf1eSJoseph Huber^^^^^^^^^^^^^^^^^^^^^^^^^
720cbbcf1eSJoseph Huber
730cbbcf1eSJoseph HuberThis section provides a simple example of compiling an OpenMP program with the
740cbbcf1eSJoseph HuberGPU C library.
750cbbcf1eSJoseph Huber
760cbbcf1eSJoseph Huber.. code-block:: c++
770cbbcf1eSJoseph Huber
780cbbcf1eSJoseph Huber  #include <stdio.h>
790cbbcf1eSJoseph Huber
800cbbcf1eSJoseph Huber  int main() {
810cbbcf1eSJoseph Huber    FILE *file = stderr;
820cbbcf1eSJoseph Huber  #pragma omp target teams num_teams(2) thread_limit(2)
830cbbcf1eSJoseph Huber  #pragma omp parallel num_threads(2)
840cbbcf1eSJoseph Huber    { fputs("Hello from OpenMP!\n", file); }
850cbbcf1eSJoseph Huber  }
860cbbcf1eSJoseph Huber
870cbbcf1eSJoseph HuberThis can simply be compiled like any other OpenMP application to print from two
880cbbcf1eSJoseph Huberthreads and two blocks.
89807f0584SJoseph Huber
90807f0584SJoseph Huber.. code-block:: sh
91807f0584SJoseph Huber
920cbbcf1eSJoseph Huber  $> clang openmp.c -fopenmp --offload-arch=gfx90a
930cbbcf1eSJoseph Huber  $> ./a.out
940cbbcf1eSJoseph Huber  Hello from OpenMP!
950cbbcf1eSJoseph Huber  Hello from OpenMP!
960cbbcf1eSJoseph Huber  Hello from OpenMP!
970cbbcf1eSJoseph Huber  Hello from OpenMP!
980cbbcf1eSJoseph Huber
990cbbcf1eSJoseph HuberIncluding the wrapper headers, linking the C library, and running the :ref:`RPC
1000cbbcf1eSJoseph Huberserver<libc_gpu_rpc>` are all handled automatically by the compiler and runtime.
1010cbbcf1eSJoseph Huber
1020cbbcf1eSJoseph HuberDirect compilation
1030cbbcf1eSJoseph Huber------------------
1040cbbcf1eSJoseph Huber
1050cbbcf1eSJoseph HuberInstead of using standard offloading languages, we can also target the CPU
1060cbbcf1eSJoseph Huberdirectly using C and C++ to create a GPU executable similarly to OpenCL. This is
1070cbbcf1eSJoseph Huberdone by targeting the GPU architecture using `clang's cross compilation
1080cbbcf1eSJoseph Hubersupport <https://clang.llvm.org/docs/CrossCompilation.html>`_. This is the
1090cbbcf1eSJoseph Hubermethod that the GPU C library uses both to build the library and to run tests.
1100cbbcf1eSJoseph Huber
1110cbbcf1eSJoseph HuberThis allows us to easily define GPU specific libraries and programs that fit
1120cbbcf1eSJoseph Huberwell into existing tools. In order to target the GPU effectively we rely heavily
1130cbbcf1eSJoseph Huberon the compiler's intrinsic and built-in functions. For example, the following
1140cbbcf1eSJoseph Huberfunction gets the thread identifier in the 'x' dimension on both GPUs supported
1150cbbcf1eSJoseph HuberGPUs.
1160cbbcf1eSJoseph Huber
1170cbbcf1eSJoseph Huber.. code-block:: c++
1180cbbcf1eSJoseph Huber
1190cbbcf1eSJoseph Huber  uint32_t get_thread_id_x() {
1200cbbcf1eSJoseph Huber  #if defined(__AMDGPU__)
1210cbbcf1eSJoseph Huber    return __builtin_amdgcn_workitem_id_x();
1220cbbcf1eSJoseph Huber  #elif defined(__NVPTX__)
1230cbbcf1eSJoseph Huber    return __nvvm_read_ptx_sreg_tid_x();
1240cbbcf1eSJoseph Huber  #else
1250cbbcf1eSJoseph Huber  #error "Unsupported platform"
1260cbbcf1eSJoseph Huber  #endif
1270cbbcf1eSJoseph Huber  }
1280cbbcf1eSJoseph Huber
1290cbbcf1eSJoseph HuberWe can then compile this for both NVPTX and AMDGPU into LLVM-IR using the
1306818c7b8SJoseph Huberfollowing commands. This will yield valid LLVM-IR for the given target just like
1316818c7b8SJoseph Huberif we were using CUDA, OpenCL, or OpenMP.
1320cbbcf1eSJoseph Huber
1330cbbcf1eSJoseph Huber.. code-block:: sh
1340cbbcf1eSJoseph Huber
1350cbbcf1eSJoseph Huber  $> clang id.c --target=amdgcn-amd-amdhsa -mcpu=native -nogpulib -flto -c
1360cbbcf1eSJoseph Huber  $> clang id.c --target=nvptx64-nvidia-cuda -march=native -nogpulib -flto -c
1370cbbcf1eSJoseph Huber
1386818c7b8SJoseph HuberWe can also use this support to treat the GPU as a hosted environment by
1396818c7b8SJoseph Huberproviding a C library and startup object just like a standard C library running
1406818c7b8SJoseph Huberon the host machine. Then, in order to execute these programs, we provide a
1416818c7b8SJoseph Huberloader utility to launch the executable on the GPU similar to a cross-compiling
1426818c7b8SJoseph Huberemulator. This is how we run :ref:`unit tests <libc_gpu_testing>` targeting the
1436818c7b8SJoseph HuberGPU. This is clearly not the most efficient way to use a GPU, but it provides a
1446818c7b8SJoseph Hubersimple method to test execution on a GPU for debugging or development.
1450cbbcf1eSJoseph Huber
1460cbbcf1eSJoseph HuberBuilding for AMDGPU targets
1470cbbcf1eSJoseph Huber^^^^^^^^^^^^^^^^^^^^^^^^^^^
1480cbbcf1eSJoseph Huber
1490cbbcf1eSJoseph HuberThe AMDGPU target supports several features natively by virtue of using ``lld``
1500cbbcf1eSJoseph Huberas its linker. The installation will include the ``include/amdgcn-amd-amdhsa``
1510cbbcf1eSJoseph Huberand ``lib/amdgcn-amd-amdha`` directories that contain the necessary code to use
1520cbbcf1eSJoseph Huberthe library. We can directly link against ``libc.a`` and use LTO to generate the
1530cbbcf1eSJoseph Huberfinal executable.
1540cbbcf1eSJoseph Huber
1550cbbcf1eSJoseph Huber.. code-block:: c++
1560cbbcf1eSJoseph Huber
1570cbbcf1eSJoseph Huber  #include <stdio.h>
1580cbbcf1eSJoseph Huber
15950838851SJoseph Huber  int main() { printf("Hello from AMDGPU!\n"); }
1600cbbcf1eSJoseph Huber
1610cbbcf1eSJoseph HuberThis program can then be compiled using the ``clang`` compiler. Note that
1620cbbcf1eSJoseph Huber``-flto`` and ``-mcpu=`` should be defined. This is because the GPU
1630cbbcf1eSJoseph Hubersub-architectures do not have strict backwards compatibility. Use ``-mcpu=help``
1640cbbcf1eSJoseph Huberfor accepted arguments or ``-mcpu=native`` to target the system's installed GPUs
1650cbbcf1eSJoseph Huberif present. Additionally, the AMDGPU target always uses ``-flto`` because we
1660cbbcf1eSJoseph Hubercurrently do not fully support ELF linking in ``lld``. Once built, we use the
1670cbbcf1eSJoseph Huber``amdhsa-loader`` utility to launch execution on the GPU. This will be built if
1680cbbcf1eSJoseph Huberthe ``hsa_runtime64`` library was found during build time.
1690cbbcf1eSJoseph Huber
1700cbbcf1eSJoseph Huber.. code-block:: sh
1710cbbcf1eSJoseph Huber
1720cbbcf1eSJoseph Huber  $> clang hello.c --target=amdgcn-amd-amdhsa -mcpu=native -flto -lc <install>/lib/amdgcn-amd-amdhsa/crt1.o
1730cbbcf1eSJoseph Huber  $> amdhsa-loader --threads 2 --blocks 2 a.out
1740cbbcf1eSJoseph Huber  Hello from AMDGPU!
1750cbbcf1eSJoseph Huber  Hello from AMDGPU!
1760cbbcf1eSJoseph Huber  Hello from AMDGPU!
1770cbbcf1eSJoseph Huber  Hello from AMDGPU!
1780cbbcf1eSJoseph Huber
1790cbbcf1eSJoseph HuberThis will include the ``stdio.h`` header, which is found in the
1800cbbcf1eSJoseph Huber``include/amdgcn-amd-amdhsa`` directory. We define out ``main`` function like a
1810cbbcf1eSJoseph Huberstandard application. The startup utility in ``lib/amdgcn-amd-amdhsa/crt1.o``
1820cbbcf1eSJoseph Huberwill handle the necessary steps to execute the ``main`` function along with
1830cbbcf1eSJoseph Huberglobal initializers and command line arguments. Finally, we link in the
1840cbbcf1eSJoseph Huber``libc.a`` library stored in ``lib/amdgcn-amd-amdhsa`` to define the standard C
1850cbbcf1eSJoseph Huberfunctions.
1860cbbcf1eSJoseph Huber
1870cbbcf1eSJoseph HuberThe search paths for the include directories and libraries are automatically
1880cbbcf1eSJoseph Huberhandled by the compiler. We use this support internally to run unit tests on the
1890cbbcf1eSJoseph HuberGPU directly. See :ref:`libc_gpu_testing` for more information. The installation
1900cbbcf1eSJoseph Huberalso provides ``libc.bc`` which is a single LLVM-IR bitcode blob that can be
1910cbbcf1eSJoseph Huberused instead of the static library.
1920cbbcf1eSJoseph Huber
1930cbbcf1eSJoseph HuberBuilding for NVPTX targets
1940cbbcf1eSJoseph Huber^^^^^^^^^^^^^^^^^^^^^^^^^^
1950cbbcf1eSJoseph Huber
1960cbbcf1eSJoseph HuberThe infrastructure is the same as the AMDGPU example. However, the NVPTX binary
19750838851SJoseph Huberutilities are very limited and must be targeted directly. A utility called
19850838851SJoseph Huber``clang-nvlink-wrapper`` instead wraps around the standard link job to give the
19950838851SJoseph Huberillusion that ``nvlink`` is a functional linker.
2000cbbcf1eSJoseph Huber
2010cbbcf1eSJoseph Huber.. code-block:: c++
2020cbbcf1eSJoseph Huber
2030cbbcf1eSJoseph Huber  #include <stdio.h>
2040cbbcf1eSJoseph Huber
2050cbbcf1eSJoseph Huber  int main(int argc, char **argv, char **envp) {
20650838851SJoseph Huber    printf("Hello from NVPTX!\n");
2070cbbcf1eSJoseph Huber  }
2080cbbcf1eSJoseph Huber
2090cbbcf1eSJoseph HuberAdditionally, the NVPTX ABI requires that every function signature matches. This
2100cbbcf1eSJoseph Huberrequires us to pass the full prototype from ``main``. The installation will
2110cbbcf1eSJoseph Hubercontain the ``nvptx-loader`` utility if the CUDA driver was found during
21250838851SJoseph Hubercompilation. Using link time optimization will help hide this.
2130cbbcf1eSJoseph Huber
2140cbbcf1eSJoseph Huber.. code-block:: sh
2150cbbcf1eSJoseph Huber
216*6d1a5130SJoseph Huber  $> clang hello.c --target=nvptx64-nvidia-cuda -march=native -flto -lc <install>/lib/nvptx64-nvidia-cuda/crt1.o
2170cbbcf1eSJoseph Huber  $> nvptx-loader --threads 2 --blocks 2 a.out
2180cbbcf1eSJoseph Huber  Hello from NVPTX!
2190cbbcf1eSJoseph Huber  Hello from NVPTX!
2200cbbcf1eSJoseph Huber  Hello from NVPTX!
2210cbbcf1eSJoseph Huber  Hello from NVPTX!
222