docs/gpu/using.rst

807f0584SJoseph Huber.. _libc_gpu_usage:
807f0584SJoseph Huber
807f0584SJoseph Huber===================
807f0584SJoseph HuberUsing libc for GPUs
807f0584SJoseph Huber===================
807f0584SJoseph Huber
807f0584SJoseph Huber.. contents:: Table of Contents
807f0584SJoseph Huber  :depth: 4
807f0584SJoseph Huber  :local:
807f0584SJoseph Huber
0cbbcf1eSJoseph HuberUsing the GPU C library
0cbbcf1eSJoseph Huber=======================
807f0584SJoseph Huber
0cbbcf1eSJoseph HuberOnce you have finished :ref:`building<libc_gpu_building>` the GPU C library it
0cbbcf1eSJoseph Hubercan be used to run libc or libm functions directly on the GPU. Currently, not
0cbbcf1eSJoseph Huberall C standard functions are supported on the GPU. Consult the :ref:`list of
0cbbcf1eSJoseph Hubersupported functions<libc_gpu_support>` for a comprehensive list.
0cbbcf1eSJoseph Huber
0cbbcf1eSJoseph HuberThe GPU C library supports two main usage modes. The first is as a supplementary
0cbbcf1eSJoseph Huberlibrary for offloading languages such as OpenMP, CUDA, or HIP. These aim to
0cbbcf1eSJoseph Huberprovide standard system utilities similarly to existing vendor libraries. The
0cbbcf1eSJoseph Hubersecond method treats the GPU as a hosted target by compiling C or C++ for it
0cbbcf1eSJoseph Huberdirectly. This is more similar to targeting OpenCL and is primarily used for
0cbbcf1eSJoseph Huberexported functions on the GPU and testing.
0cbbcf1eSJoseph Huber
0cbbcf1eSJoseph HuberOffloading usage
0cbbcf1eSJoseph Huber----------------
0cbbcf1eSJoseph Huber
0cbbcf1eSJoseph HuberOffloading languages like CUDA, HIP, or OpenMP work by compiling a single source
0cbbcf1eSJoseph Huberfile for both the host target and a list of offloading devices. In order to
0cbbcf1eSJoseph Hubersupport standard compilation flows, the ``clang`` driver uses fat binaries,
0cbbcf1eSJoseph Huberdescribed in the `clang documentation
0cbbcf1eSJoseph Huber<https://clang.llvm.org/docs/OffloadingDesign.html>`_. This linking mode is used
0cbbcf1eSJoseph Huberby the OpenMP toolchain, but is currently opt-in for the CUDA and HIP toolchains
0cbbcf1eSJoseph Huberthrough the ``--offload-new-driver``` and ``-fgpu-rdc`` flags.
0cbbcf1eSJoseph Huber
8d8fa01aSJoseph HuberIn order or link the GPU runtime, we simply pass this library to the embedded
8d8fa01aSJoseph Huberdevice linker job. This can be done using the ``-Xoffload-linker`` option, which
8d8fa01aSJoseph Huberforwards an argument to a ``clang`` job used to create the final GPU executable.
8d8fa01aSJoseph HuberThe toolchain should pick up the C libraries automatically in most cases, so
8d8fa01aSJoseph Huberthis shouldn't be necessary.
807f0584SJoseph Huber
807f0584SJoseph Huber.. code-block:: sh
807f0584SJoseph Huber
8d8fa01aSJoseph Huber  $> clang openmp.c -fopenmp --offload-arch=gfx90a -Xoffload-linker -lc
8d8fa01aSJoseph Huber  $> clang cuda.cu --offload-arch=sm_80 --offload-new-driver -fgpu-rdc -Xoffload-linker -lc
8d8fa01aSJoseph Huber  $> clang hip.hip --offload-arch=gfx940 --offload-new-driver -fgpu-rdc -Xoffload-linker -lc
807f0584SJoseph Huber
0cbbcf1eSJoseph HuberThis will automatically link in the needed function definitions if they were
0cbbcf1eSJoseph Huberrequired by the user's application. Normally using the ``-fgpu-rdc`` option
0cbbcf1eSJoseph Huberresults in sub-par performance due to ABA linking. However, the offloading
0cbbcf1eSJoseph Hubertoolchain supports the ``--foffload-lto`` option to support LTO on the target
0cbbcf1eSJoseph Huberdevice.
807f0584SJoseph Huber
0cbbcf1eSJoseph HuberOffloading languages require that functions present on the device be declared as
0cbbcf1eSJoseph Hubersuch. This is done with the ``__device__`` keyword in CUDA and HIP or the
0cbbcf1eSJoseph Huber``declare target`` pragma in OpenMP. This requires that the LLVM C library
0cbbcf1eSJoseph Huberexposes its implemented functions to the compiler when it is used to build. We
0cbbcf1eSJoseph Hubersupport this by providing wrapper headers in the compiler's resource directory.
0cbbcf1eSJoseph HuberThese are located in ``<clang-resource-dir>/include/llvm-libc-wrappers`` in your
0cbbcf1eSJoseph Huberinstallation.
807f0584SJoseph Huber
0cbbcf1eSJoseph HuberThe support for HIP and CUDA is more experimental, requiring manual intervention
0cbbcf1eSJoseph Huberto link and use the facilities. An example of this is shown in the :ref:`CUDA
0cbbcf1eSJoseph Huberserver example<libc_gpu_cuda_server>`. The OpenMP Offloading toolchain is
0cbbcf1eSJoseph Hubercompletely integrated with the LLVM C library however. It will automatically
0cbbcf1eSJoseph Huberhandle including the necessary libraries, define device-side interfaces, and run
0cbbcf1eSJoseph Huberthe RPC server.
0cbbcf1eSJoseph Huber
0cbbcf1eSJoseph HuberOpenMP Offloading example
0cbbcf1eSJoseph Huber^^^^^^^^^^^^^^^^^^^^^^^^^
0cbbcf1eSJoseph Huber
0cbbcf1eSJoseph HuberThis section provides a simple example of compiling an OpenMP program with the
0cbbcf1eSJoseph HuberGPU C library.
0cbbcf1eSJoseph Huber
0cbbcf1eSJoseph Huber.. code-block:: c++
0cbbcf1eSJoseph Huber
0cbbcf1eSJoseph Huber  #include <stdio.h>
0cbbcf1eSJoseph Huber
0cbbcf1eSJoseph Huber  int main() {
0cbbcf1eSJoseph Huber    FILE *file = stderr;
0cbbcf1eSJoseph Huber  #pragma omp target teams num_teams(2) thread_limit(2)
0cbbcf1eSJoseph Huber  #pragma omp parallel num_threads(2)
0cbbcf1eSJoseph Huber    { fputs("Hello from OpenMP!\n", file); }
0cbbcf1eSJoseph Huber  }
0cbbcf1eSJoseph Huber
0cbbcf1eSJoseph HuberThis can simply be compiled like any other OpenMP application to print from two
0cbbcf1eSJoseph Huberthreads and two blocks.
807f0584SJoseph Huber
807f0584SJoseph Huber.. code-block:: sh
807f0584SJoseph Huber
0cbbcf1eSJoseph Huber  $> clang openmp.c -fopenmp --offload-arch=gfx90a
0cbbcf1eSJoseph Huber  $> ./a.out
0cbbcf1eSJoseph Huber  Hello from OpenMP!
0cbbcf1eSJoseph Huber  Hello from OpenMP!
0cbbcf1eSJoseph Huber  Hello from OpenMP!
0cbbcf1eSJoseph Huber  Hello from OpenMP!
0cbbcf1eSJoseph Huber
0cbbcf1eSJoseph HuberIncluding the wrapper headers, linking the C library, and running the :ref:`RPC
0cbbcf1eSJoseph Huberserver<libc_gpu_rpc>` are all handled automatically by the compiler and runtime.
0cbbcf1eSJoseph Huber
0cbbcf1eSJoseph HuberDirect compilation
0cbbcf1eSJoseph Huber------------------
0cbbcf1eSJoseph Huber
0cbbcf1eSJoseph HuberInstead of using standard offloading languages, we can also target the CPU
0cbbcf1eSJoseph Huberdirectly using C and C++ to create a GPU executable similarly to OpenCL. This is
0cbbcf1eSJoseph Huberdone by targeting the GPU architecture using `clang's cross compilation
0cbbcf1eSJoseph Hubersupport <https://clang.llvm.org/docs/CrossCompilation.html>`_. This is the
0cbbcf1eSJoseph Hubermethod that the GPU C library uses both to build the library and to run tests.
0cbbcf1eSJoseph Huber
0cbbcf1eSJoseph HuberThis allows us to easily define GPU specific libraries and programs that fit
0cbbcf1eSJoseph Huberwell into existing tools. In order to target the GPU effectively we rely heavily
0cbbcf1eSJoseph Huberon the compiler's intrinsic and built-in functions. For example, the following
0cbbcf1eSJoseph Huberfunction gets the thread identifier in the 'x' dimension on both GPUs supported
0cbbcf1eSJoseph HuberGPUs.
0cbbcf1eSJoseph Huber
0cbbcf1eSJoseph Huber.. code-block:: c++
0cbbcf1eSJoseph Huber
0cbbcf1eSJoseph Huber  uint32_t get_thread_id_x() {
0cbbcf1eSJoseph Huber  #if defined(__AMDGPU__)
0cbbcf1eSJoseph Huber    return __builtin_amdgcn_workitem_id_x();
0cbbcf1eSJoseph Huber  #elif defined(__NVPTX__)
0cbbcf1eSJoseph Huber    return __nvvm_read_ptx_sreg_tid_x();
0cbbcf1eSJoseph Huber  #else
0cbbcf1eSJoseph Huber  #error "Unsupported platform"
0cbbcf1eSJoseph Huber  #endif
0cbbcf1eSJoseph Huber  }
0cbbcf1eSJoseph Huber
0cbbcf1eSJoseph HuberWe can then compile this for both NVPTX and AMDGPU into LLVM-IR using the
6818c7b8SJoseph Huberfollowing commands. This will yield valid LLVM-IR for the given target just like
6818c7b8SJoseph Huberif we were using CUDA, OpenCL, or OpenMP.
0cbbcf1eSJoseph Huber
0cbbcf1eSJoseph Huber.. code-block:: sh
0cbbcf1eSJoseph Huber
0cbbcf1eSJoseph Huber  $> clang id.c --target=amdgcn-amd-amdhsa -mcpu=native -nogpulib -flto -c
0cbbcf1eSJoseph Huber  $> clang id.c --target=nvptx64-nvidia-cuda -march=native -nogpulib -flto -c
0cbbcf1eSJoseph Huber
6818c7b8SJoseph HuberWe can also use this support to treat the GPU as a hosted environment by
6818c7b8SJoseph Huberproviding a C library and startup object just like a standard C library running
6818c7b8SJoseph Huberon the host machine. Then, in order to execute these programs, we provide a
6818c7b8SJoseph Huberloader utility to launch the executable on the GPU similar to a cross-compiling
6818c7b8SJoseph Huberemulator. This is how we run :ref:`unit tests <libc_gpu_testing>` targeting the
6818c7b8SJoseph HuberGPU. This is clearly not the most efficient way to use a GPU, but it provides a
6818c7b8SJoseph Hubersimple method to test execution on a GPU for debugging or development.
0cbbcf1eSJoseph Huber
0cbbcf1eSJoseph HuberBuilding for AMDGPU targets
0cbbcf1eSJoseph Huber^^^^^^^^^^^^^^^^^^^^^^^^^^^
0cbbcf1eSJoseph Huber
0cbbcf1eSJoseph HuberThe AMDGPU target supports several features natively by virtue of using ``lld``
0cbbcf1eSJoseph Huberas its linker. The installation will include the ``include/amdgcn-amd-amdhsa``
0cbbcf1eSJoseph Huberand ``lib/amdgcn-amd-amdha`` directories that contain the necessary code to use
0cbbcf1eSJoseph Huberthe library. We can directly link against ``libc.a`` and use LTO to generate the
0cbbcf1eSJoseph Huberfinal executable.
0cbbcf1eSJoseph Huber
0cbbcf1eSJoseph Huber.. code-block:: c++
0cbbcf1eSJoseph Huber
0cbbcf1eSJoseph Huber  #include <stdio.h>
0cbbcf1eSJoseph Huber
50838851SJoseph Huber  int main() { printf("Hello from AMDGPU!\n"); }
0cbbcf1eSJoseph Huber
0cbbcf1eSJoseph HuberThis program can then be compiled using the ``clang`` compiler. Note that
0cbbcf1eSJoseph Huber``-flto`` and ``-mcpu=`` should be defined. This is because the GPU
0cbbcf1eSJoseph Hubersub-architectures do not have strict backwards compatibility. Use ``-mcpu=help``
0cbbcf1eSJoseph Huberfor accepted arguments or ``-mcpu=native`` to target the system's installed GPUs
0cbbcf1eSJoseph Huberif present. Additionally, the AMDGPU target always uses ``-flto`` because we
0cbbcf1eSJoseph Hubercurrently do not fully support ELF linking in ``lld``. Once built, we use the
0cbbcf1eSJoseph Huber``amdhsa-loader`` utility to launch execution on the GPU. This will be built if
0cbbcf1eSJoseph Huberthe ``hsa_runtime64`` library was found during build time.
0cbbcf1eSJoseph Huber
0cbbcf1eSJoseph Huber.. code-block:: sh
0cbbcf1eSJoseph Huber
0cbbcf1eSJoseph Huber  $> clang hello.c --target=amdgcn-amd-amdhsa -mcpu=native -flto -lc <install>/lib/amdgcn-amd-amdhsa/crt1.o
0cbbcf1eSJoseph Huber  $> amdhsa-loader --threads 2 --blocks 2 a.out
0cbbcf1eSJoseph Huber  Hello from AMDGPU!
0cbbcf1eSJoseph Huber  Hello from AMDGPU!
0cbbcf1eSJoseph Huber  Hello from AMDGPU!
0cbbcf1eSJoseph Huber  Hello from AMDGPU!
0cbbcf1eSJoseph Huber
0cbbcf1eSJoseph HuberThis will include the ``stdio.h`` header, which is found in the
0cbbcf1eSJoseph Huber``include/amdgcn-amd-amdhsa`` directory. We define out ``main`` function like a
0cbbcf1eSJoseph Huberstandard application. The startup utility in ``lib/amdgcn-amd-amdhsa/crt1.o``
0cbbcf1eSJoseph Huberwill handle the necessary steps to execute the ``main`` function along with
0cbbcf1eSJoseph Huberglobal initializers and command line arguments. Finally, we link in the
0cbbcf1eSJoseph Huber``libc.a`` library stored in ``lib/amdgcn-amd-amdhsa`` to define the standard C
0cbbcf1eSJoseph Huberfunctions.
0cbbcf1eSJoseph Huber
0cbbcf1eSJoseph HuberThe search paths for the include directories and libraries are automatically
0cbbcf1eSJoseph Huberhandled by the compiler. We use this support internally to run unit tests on the
0cbbcf1eSJoseph HuberGPU directly. See :ref:`libc_gpu_testing` for more information. The installation
0cbbcf1eSJoseph Huberalso provides ``libc.bc`` which is a single LLVM-IR bitcode blob that can be
0cbbcf1eSJoseph Huberused instead of the static library.
0cbbcf1eSJoseph Huber
0cbbcf1eSJoseph HuberBuilding for NVPTX targets
0cbbcf1eSJoseph Huber^^^^^^^^^^^^^^^^^^^^^^^^^^
0cbbcf1eSJoseph Huber
0cbbcf1eSJoseph HuberThe infrastructure is the same as the AMDGPU example. However, the NVPTX binary
50838851SJoseph Huberutilities are very limited and must be targeted directly. A utility called
50838851SJoseph Huber``clang-nvlink-wrapper`` instead wraps around the standard link job to give the
50838851SJoseph Huberillusion that ``nvlink`` is a functional linker.
0cbbcf1eSJoseph Huber
0cbbcf1eSJoseph Huber.. code-block:: c++
0cbbcf1eSJoseph Huber
0cbbcf1eSJoseph Huber  #include <stdio.h>
0cbbcf1eSJoseph Huber
0cbbcf1eSJoseph Huber  int main(int argc, char **argv, char **envp) {
50838851SJoseph Huber    printf("Hello from NVPTX!\n");
0cbbcf1eSJoseph Huber  }
0cbbcf1eSJoseph Huber
0cbbcf1eSJoseph HuberAdditionally, the NVPTX ABI requires that every function signature matches. This
0cbbcf1eSJoseph Huberrequires us to pass the full prototype from ``main``. The installation will
0cbbcf1eSJoseph Hubercontain the ``nvptx-loader`` utility if the CUDA driver was found during
50838851SJoseph Hubercompilation. Using link time optimization will help hide this.
0cbbcf1eSJoseph Huber
0cbbcf1eSJoseph Huber.. code-block:: sh
0cbbcf1eSJoseph Huber
*6d1a5130SJoseph Huber  $> clang hello.c --target=nvptx64-nvidia-cuda -march=native -flto -lc <install>/lib/nvptx64-nvidia-cuda/crt1.o
0cbbcf1eSJoseph Huber  $> nvptx-loader --threads 2 --blocks 2 a.out
0cbbcf1eSJoseph Huber  Hello from NVPTX!
0cbbcf1eSJoseph Huber  Hello from NVPTX!
0cbbcf1eSJoseph Huber  Hello from NVPTX!
0cbbcf1eSJoseph Huber  Hello from NVPTX!