1============================= 2Offloading Design & Internals 3============================= 4 5.. contents:: 6 :local: 7 8Introduction 9============ 10 11This document describes the Clang driver and code generation steps for creating 12offloading applications. Clang supports offloading to various architectures 13using programming models like CUDA, HIP, and OpenMP. The purpose of this 14document is to illustrate the steps necessary to create an offloading 15application using Clang. 16 17OpenMP Offloading 18================= 19 20Clang supports OpenMP target offloading to several different architectures such 21as NVPTX, AMDGPU, X86_64, Arm, and PowerPC. Offloading code is generated by 22Clang and then executed using the ``libomptarget`` runtime and the associated 23plugin for the target architecture, e.g. ``libomptarget.rtl.cuda``. This section 24describes the steps necessary to create a functioning device image that can be 25loaded by the OpenMP runtime. More information on the OpenMP runtimes can be 26found at the `OpenMP documentation page <https://openmp.llvm.org>`__. 27 28.. _Offloading Overview: 29 30Offloading Overview 31------------------- 32 33The goal of offloading compilation is to create an executable device image that 34can be run on the target device. OpenMP offloading creates executable images by 35compiling the input file for both the host and the target device. The output 36from the device phase then needs to be embedded into the host to create a fat 37object. A special tool then needs to extract the device code from the fat 38objects, run the device linking step, and embed the final image in a symbol the 39host runtime library can use to register the library and access the symbols on 40the device. 41 42Compilation Process 43^^^^^^^^^^^^^^^^^^^ 44 45The compiler performs the following high-level actions to generate OpenMP 46offloading code: 47 48* Compile the input file for the host to produce a bitcode file. Lower ``#pragma 49 omp target`` declarations to :ref:`offloading entries <Generating Offloading 50 Entries>` and create metadata to indicate which entries are on the device. 51* Compile the input file for the target :ref:`device <Device Compilation>` using 52 the :ref:`offloading entry <Generating Offloading Entries>` metadata created 53 by the host. 54* Link the OpenMP device runtime library and run the backend to create a device 55 object file. 56* Run the backend on the host bitcode file and create a :ref:`fat object file 57 <Creating Fat Objects>` using the device object file. 58* Pass the fat object file to the :ref:`linker wrapper tool <Device Linking>` 59 and extract the device objects. Run the device linking action on the extracted 60 objects. 61* :ref:`Wrap <Device Binary Wrapping>` the :ref:`device images <Device linking>` 62 and :ref:`offload entries <Generating Offloading Entries>` in a symbol that 63 can be accessed by the host. 64* Add the :ref:`wrapped binary <Device Binary Wrapping>` to the linker input and 65 run the host linking action. Link with ``libomptarget`` to register and 66 execute the images. 67 68 .. _Generating Offloading Entries: 69 70Generating Offloading Entries 71----------------------------- 72 73The first step in compilation is to generate offloading entries for the host. 74This information is used to identify function kernels or global values that will 75be provided by the device. Blocks contained in a ``#pragma omp target`` or 76symbols inside a ``#pragma omp declare target`` directive will have offloading 77entries generated. The following table shows the :ref:`offload entry structure 78<table-tgt_offload_entry_structure>`. 79 80 .. table:: __tgt_offload_entry Structure 81 :name: table-tgt_offload_entry_structure 82 83 +---------+------------+------------------------------------------------------------------------+ 84 | Type | Identifier | Description | 85 +=========+============+========================================================================+ 86 | void* | addr | Address of global symbol within device image (function or global) | 87 +---------+------------+------------------------------------------------------------------------+ 88 | char* | name | Name of the symbol | 89 +---------+------------+------------------------------------------------------------------------+ 90 | size_t | size | Size of the entry info (0 if it is a function) | 91 +---------+------------+------------------------------------------------------------------------+ 92 | int32_t | flags | Flags associated with the entry (see :ref:`table-offload_entry_flags`) | 93 +---------+------------+------------------------------------------------------------------------+ 94 | int32_t | reserved | Reserved, to be used by the runtime library. | 95 +---------+------------+------------------------------------------------------------------------+ 96 97The address of the global symbol will be set to the device pointer value by the 98runtime once the device image is loaded. The flags are set to indicate the 99handling required for the offloading entry. If the offloading entry is an entry 100to a target region it can have one of the following :ref:`entry flags 101<table-offload_entry_flags>`. 102 103 .. table:: Target Region Entry Flags 104 :name: table-offload_entry_flags 105 106 +----------------------------------+-------+-----------------------------------------+ 107 | Name | Value | Description | 108 +==================================+=======+=========================================+ 109 | OMPTargetRegionEntryTargetRegion | 0x00 | Mark the entry as generic target region | 110 +----------------------------------+-------+-----------------------------------------+ 111 | OMPTargetRegionEntryCtor | 0x02 | Mark the entry as a global constructor | 112 +----------------------------------+-------+-----------------------------------------+ 113 | OMPTargetRegionEntryDtor | 0x04 | Mark the entry as a global destructor | 114 +----------------------------------+-------+-----------------------------------------+ 115 116If the offloading entry is a global variable, indicated by a non-zero size, it 117will instead have one of the following :ref:`global 118<table-offload_global_flags>` flags. 119 120 .. table:: Target Region Global 121 :name: table-offload_global_flags 122 123 +-----------------------------+-------+---------------------------------------------------------------+ 124 | Name | Value | Description | 125 +=============================+=======+===============================================================+ 126 | OMPTargetGlobalVarEntryTo | 0x00 | Mark the entry as a 'to' attribute (w.r.t. the to clause) | 127 +-----------------------------+-------+---------------------------------------------------------------+ 128 | OMPTargetGlobalVarEntryLink | 0x01 | Mark the entry as a 'link' attribute (w.r.t. the link clause) | 129 +-----------------------------+-------+---------------------------------------------------------------+ 130 131The target offload entries are used by the runtime to access the device kernels 132and globals that will be provided by the final device image. Each offloading 133entry is set to use the ``omp_offloading_entries`` section. When the final 134application is created the linker will provide the 135``__start_omp_offloading_entries`` and ``__stop_omp_offloading_entries`` symbols 136which are used to create the :ref:`final image <Device Binary Wrapping>`. 137 138This information is used by the device compilation stage to determine which 139symbols need to be exported from the device. We use the ``omp_offload.info`` 140metadata node to pass this information device compilation stage. 141 142Accessing Entries on the Device 143^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 144 145Accessing the entries in the device is done using the address field in the 146:ref:`offload entry<table-tgt_offload_entry_structure>`. The runtime will set 147the address to the pointer associated with the device image during runtime 148initialization. This is used to call the corresponding kernel function when 149entering a ``#pragma omp target`` region. For variables, the runtime maintains a 150table mapping host pointers to device pointers. Global variables inside a 151``#pragma omp target declare`` directive are first initialized to the host's 152address. Once the device address is initialized we insert it into the table to 153map the host address to the device address. 154 155Debugging Information 156^^^^^^^^^^^^^^^^^^^^^ 157 158We generate structures to hold debugging information that is passed to 159``libomptarget``. This allows the front-end to generate information the runtime 160library uses for more informative error messages. This is done using the 161standard :ref:`identifier structure <table-ident_t_structure>` used in 162``libomp`` and ``libomptarget``. This is used to pass information and source 163locations to the runtime. 164 165 .. table:: ident_t Structure 166 :name: table-ident_t_structure 167 168 +---------+------------+-----------------------------------------------------------------------------+ 169 | Type | Identifier | Description | 170 +=========+============+=============================================================================+ 171 | int32_t | reserved | Reserved, to be used by the runtime library. | 172 +---------+------------+-----------------------------------------------------------------------------+ 173 | int32_t | flags | Flags used to indicate some features, mostly unused. | 174 +---------+------------+-----------------------------------------------------------------------------+ 175 | int32_t | reserved | Reserved, to be used by the runtime library. | 176 +---------+------------+-----------------------------------------------------------------------------+ 177 | int32_t | reserved | Reserved, to be used by the runtime library. | 178 +---------+------------+-----------------------------------------------------------------------------+ 179 | char* | psource | Program source information, stored as ";filename;function;line;column;;\\0" | 180 +---------+------------+-----------------------------------------------------------------------------+ 181 182If debugging information is enabled, we will also create strings to indicate the 183names and declarations of variables mapped in target regions. These have the 184same format as the source location in the :ref:`identifier structure 185<table-ident_t_structure>`, but the function name is replaced with the variable 186name. 187 188.. _Device Compilation: 189 190Offload Device Compilation 191-------------------------- 192 193The input file is compiled for each active device toolchain. The device 194compilation stage is performed differently from the host stage. Namely, we do 195not generate any offloading entries. This is set by passing the 196``-fopenmp-is-target-device`` flag to the front-end. We use the host bitcode to 197determine which symbols to export from the device. The bitcode file is passed in 198from the previous stage using the ``-fopenmp-host-ir-file-path`` flag. 199Compilation is otherwise performed as it would be for any other target triple. 200 201When compiling for the OpenMP device, we set the visibility of all device 202symbols to be ``protected`` by default. This improves performance and prevents a 203class of errors where a symbol in the target device could preempt a host 204library. 205 206The OpenMP runtime library is linked in during compilation to provide the 207implementations for standard OpenMP functionality. For GPU targets this is done 208by linking in a special bitcode library during compilation, (e.g. 209``libomptarget-nvptx64-sm_70.bc``) using the ``-mlink-builtin-bitcode`` flag. 210Other device libraries, such as CUDA's libdevice, are also linked this way. If 211the target is a standard architecture with an existing ``libomp`` 212implementation, that will be linked instead. Finally, device tools are used to 213create a relocatable device object file that can be embedded in the host. 214 215.. _Creating Fat Objects: 216 217Creating Fat Objects 218-------------------- 219 220A fat binary is a binary file that contains information intended for another 221device. We create a fat object by embedding the output of the device compilation 222stage into the host as a named section. The output from the device compilation 223is passed to the host backend using the ``-fembed-offload-object`` flag. This 224embeds the device image into the ``.llvm.offloading`` section using a special 225binary format that behaves like a string map. This binary format is used to 226bundle metadata about the image so the linker can associate the proper device 227linking action with the image. Each device image will start with the magic bytes 228``0x10FF10AD``. 229 230.. code-block:: llvm 231 232 @llvm.embedded.object = private constant [1 x i8] c"\00", section ".llvm.offloading" 233 234The device code will then be placed in the corresponding section one the backend 235is run on the host, creating a fat object. Using fat objects allows us to treat 236offloading objects as standard host objects. The final object file should 237contain the following :ref:`offloading sections <table-offloading_sections>`. We 238will use this information when :ref:`Device Linking`. 239 240 .. table:: Offloading Sections 241 :name: table-offloading_sections 242 243 +----------------------------------+------------------------------------------------------------------------------+ 244 | Section | Description | 245 +==================================+==============================================================================+ 246 | omp_offloading_entries | Offloading entry information (see :ref:`table-tgt_offload_entry_structure`) | 247 +----------------------------------+------------------------------------------------------------------------------+ 248 | .llvm.offloading | Embedded device object file for the target device and architecture | 249 +----------------------------------+------------------------------------------------------------------------------+ 250 251.. _Device Linking: 252 253Linking Target Device Code 254-------------------------- 255 256Objects containing :ref:`table-offloading_sections` require special handling to 257create an executable device image. This is done using a Clang tool, see 258:doc:`ClangLinkerWrapper` for more information. This tool works as a wrapper 259over the host linking job. It scans the input object files for the offloading 260section ``.llvm.offloading``. The device files stored in this section are then 261extracted and passed to the appropriate linking job. The linked device image is 262then :ref:`wrapped <Device Binary Wrapping>` to create the symbols used to load 263the device image and link it with the host. 264 265The linker wrapper tool supports linking bitcode files through link time 266optimization (LTO). This is used whenever the object files embedded in the host 267contain LLVM bitcode. Bitcode will be embedded for architectures that do not 268support a relocatable object format, such as AMDGPU or SPIR-V, or if the user 269requested it using the ``-foffload-lto`` flag. 270 271.. _Device Binary Wrapping: 272 273Device Binary Wrapping 274---------------------- 275 276Various structures and functions are used to create the information necessary to 277offload code on the device. We use the :ref:`linked device executable <Device 278Linking>` with the corresponding offloading entries to create the symbols 279necessary to load and execute the device image. 280 281Structure Types 282^^^^^^^^^^^^^^^ 283 284Several different structures are used to store offloading information. The 285:ref:`device image structure <table-device_image_structure>` stores a single 286linked device image and its associated offloading entries. The offloading 287entries are stored using the ``__start_omp_offloading_entries`` and 288``__stop_omp_offloading_entries`` symbols generated by the linker using the 289:ref:`table-tgt_offload_entry_structure`. 290 291 .. table:: __tgt_device_image Structure 292 :name: table-device_image_structure 293 294 +----------------------+--------------+----------------------------------------+ 295 | Type | Identifier | Description | 296 +======================+==============+========================================+ 297 | void* | ImageStart | Pointer to the target code start | 298 +----------------------+--------------+----------------------------------------+ 299 | void* | ImageEnd | Pointer to the target code end | 300 +----------------------+--------------+----------------------------------------+ 301 | __tgt_offload_entry* | EntriesBegin | Begin of table with all target entries | 302 +----------------------+--------------+----------------------------------------+ 303 | __tgt_offload_entry* | EntriesEnd | End of table (non inclusive) | 304 +----------------------+--------------+----------------------------------------+ 305 306The target :ref:`target binary descriptor <table-target_binary_descriptor>` is 307used to store all binary images and offloading entries in an array. 308 309 .. table:: __tgt_bin_desc Structure 310 :name: table-target_binary_descriptor 311 312 +----------------------+------------------+------------------------------------------+ 313 | Type | Identifier | Description | 314 +======================+==================+==========================================+ 315 | int32_t | NumDeviceImages | Number of device types supported | 316 +----------------------+------------------+------------------------------------------+ 317 | __tgt_device_image* | DeviceImages | Array of device images (1 per dev. type) | 318 +----------------------+------------------+------------------------------------------+ 319 | __tgt_offload_entry* | HostEntriesBegin | Begin of table with all host entries | 320 +----------------------+------------------+------------------------------------------+ 321 | __tgt_offload_entry* | HostEntriesEnd | End of table (non inclusive) | 322 +----------------------+------------------+------------------------------------------+ 323 324Global Variables 325---------------- 326 327:ref:`table-global_variables` lists various global variables, along with their 328type and their explicit ELF sections, which are used to store device images and 329related symbols. 330 331 .. table:: Global Variables 332 :name: table-global_variables 333 334 +--------------------------------+---------------------+-------------------------+---------------------------------------------------------+ 335 | Variable | Type | ELF Section | Description | 336 +================================+=====================+=========================+=========================================================+ 337 | __start_omp_offloading_entries | __tgt_offload_entry | .omp_offloading_entries | Begin symbol for the offload entries table. | 338 +--------------------------------+---------------------+-------------------------+---------------------------------------------------------+ 339 | __stop_omp_offloading_entries | __tgt_offload_entry | .omp_offloading_entries | End symbol for the offload entries table. | 340 +--------------------------------+---------------------+-------------------------+---------------------------------------------------------+ 341 | __dummy.omp_offloading.entry | __tgt_offload_entry | .omp_offloading_entries | Dummy zero-sized object in the offload entries | 342 | | | | section to force linker to define begin/end | 343 | | | | symbols defined above. | 344 +--------------------------------+---------------------+-------------------------+---------------------------------------------------------+ 345 | .omp_offloading.device_image | __tgt_device_image | .omp_offloading_entries | ELF device code object of the first image. | 346 +--------------------------------+---------------------+-------------------------+---------------------------------------------------------+ 347 | .omp_offloading.device_image.N | __tgt_device_image | .omp_offloading_entries | ELF device code object of the (N+1)th image. | 348 +--------------------------------+---------------------+-------------------------+---------------------------------------------------------+ 349 | .omp_offloading.device_images | __tgt_device_image | .omp_offloading_entries | Array of images. | 350 +--------------------------------+---------------------+-------------------------+---------------------------------------------------------+ 351 | .omp_offloading.descriptor | __tgt_bin_desc | .omp_offloading_entries | Binary descriptor object (see :ref:`binary_descriptor`) | 352 +--------------------------------+---------------------+-------------------------+---------------------------------------------------------+ 353 354.. _binary_descriptor: 355 356Binary Descriptor for Device Images 357^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 358 359This object is passed to the offloading runtime at program startup and it 360describes all device images available in the executable or shared library. It 361is defined as follows: 362 363.. code-block:: c 364 365 __attribute__((visibility("hidden"))) 366 extern __tgt_offload_entry *__start_omp_offloading_entries; 367 __attribute__((visibility("hidden"))) 368 extern __tgt_offload_entry *__stop_omp_offloading_entries; 369 static const char Image0[] = { <Bufs.front() contents> }; 370 ... 371 static const char ImageN[] = { <Bufs.back() contents> }; 372 static const __tgt_device_image Images[] = { 373 { 374 Image0, /*ImageStart*/ 375 Image0 + sizeof(Image0), /*ImageEnd*/ 376 __start_omp_offloading_entries, /*EntriesBegin*/ 377 __stop_omp_offloading_entries /*EntriesEnd*/ 378 }, 379 ... 380 { 381 ImageN, /*ImageStart*/ 382 ImageN + sizeof(ImageN), /*ImageEnd*/ 383 __start_omp_offloading_entries, /*EntriesBegin*/ 384 __stop_omp_offloading_entries /*EntriesEnd*/ 385 } 386 }; 387 static const __tgt_bin_desc BinDesc = { 388 sizeof(Images) / sizeof(Images[0]), /*NumDeviceImages*/ 389 Images, /*DeviceImages*/ 390 __start_omp_offloading_entries, /*HostEntriesBegin*/ 391 __stop_omp_offloading_entries /*HostEntriesEnd*/ 392 }; 393 394 395Global Constructor and Destructor 396^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 397 398The global constructor (``.omp_offloading.descriptor_reg()``) registers the 399device images with the runtime by calling the ``__tgt_register_lib()`` runtime 400function. The constructor is explicitly defined in ``.text.startup`` section and 401is run once when the program starts. Similarly, the global destructor 402(``.omp_offloading.descriptor_unreg()``) calls ``__tgt_unregister_lib()`` for 403the destructor and is also defined in ``.text.startup`` section and run when the 404program exits. 405 406Offloading Example 407------------------ 408 409This section contains a simple example of generating offloading code using 410OpenMP offloading. We will use a simple ``ZAXPY`` BLAS routine. 411 412.. code-block:: c++ 413 414 #include <complex> 415 416 using complex = std::complex<double>; 417 418 void zaxpy(complex *X, complex *Y, complex D, std::size_t N) { 419 #pragma omp target teams distribute parallel for 420 for (std::size_t i = 0; i < N; ++i) 421 Y[i] = D * X[i] + Y[i]; 422 } 423 424 int main() { 425 const std::size_t N = 1024; 426 complex X[N], Y[N], D; 427 #pragma omp target data map(to:X[0 : N]) map(tofrom:Y[0 : N]) 428 zaxpy(X, Y, D, N); 429 } 430 431This code is compiled using the following Clang flags. 432 433.. code-block:: console 434 435 $ clang++ -fopenmp -fopenmp-targets=nvptx64 -O3 zaxpy.cpp -c 436 437The output section in the object file can be seen using the ``readelf`` utility. 438The ``.llvm.offloading`` section has the ``SHF_EXCLUDE`` flag so it will be 439removed from the final executable or shared library by the linker. 440 441.. code-block:: text 442 443 $ llvm-readelf -WS zaxpy.o 444 Section Headers: 445 [Nr] Name Type Address Off Size ES Flg Lk Inf Al 446 [11] omp_offloading_entries PROGBITS 0000000000000000 0001f0 000040 00 A 0 0 1 447 [12] .llvm.offloading PROGBITS 0000000000000000 000260 030950 00 E 0 0 8 448 449 450Compiling this file again will invoke the ``clang-linker-wrapper`` utility to 451extract and link the device code stored at the section named 452``.llvm.offloading`` and then use entries stored in 453the section named ``omp_offloading_entries`` to create the symbols necessary for 454``libomptarget`` to register the device image and call the entry function. 455 456.. code-block:: console 457 458 $ clang++ -fopenmp -fopenmp-targets=nvptx64 zaxpy.o -o zaxpy 459 $ ./zaxpy 460 461We can see the steps created by clang to generate the offloading code using the 462``-ccc-print-phases`` option in Clang. This matches the description in 463:ref:`Offloading Overview`. 464 465.. code-block:: console 466 467 $ clang++ -fopenmp -fopenmp-targets=nvptx64 -ccc-print-phases zaxpy.cpp 468 # "x86_64-unknown-linux-gnu" - "clang", inputs: ["zaxpy.cpp"], output: "/tmp/zaxpy-host.bc" 469 # "nvptx64-nvidia-cuda" - "clang", inputs: ["zaxpy.cpp", "/tmp/zaxpy-e6a41b.bc"], output: "/tmp/zaxpy-07f434.s" 470 # "nvptx64-nvidia-cuda" - "NVPTX::Assembler", inputs: ["/tmp/zaxpy-07f434.s"], output: "/tmp/zaxpy-0af7b7.o" 471 # "x86_64-unknown-linux-gnu" - "clang", inputs: ["/tmp/zaxpy-e6a41b.bc", "/tmp/zaxpy-0af7b7.o"], output: "/tmp/zaxpy-416cad.o" 472 # "x86_64-unknown-linux-gnu" - "Offload::Linker", inputs: ["/tmp/zaxpy-416cad.o"], output: "a.out" 473 474Relocatable Linking 475------------------- 476 477The offloading compilation pipeline normally will defer the final device linking 478and runtime registration until the ``clang-linker-wrapper`` is run to create the 479executable. This is the standard behaviour when compiling for OpenMP offloading 480or CUDA and HIP in ``-fgpu-rdc`` mode. However, there are some cases where the 481user may wish to perform this device handling prematurely. This is described in 482the :doc:`linker wrapper documentation<ClangLinkerWrapper>`. 483 484Effectively, this allows the user to handle offloading specific linking ahead of 485time when shipping objects or static libraries. This can be thought of as 486performing a standard ``-fno-gpu-rdc`` compilation on a subset of object files. 487This can be useful to reduce link time, prevent users from interacting with the 488library's device code, or for shipping libraries to incompatible compilers. 489 490Normally, if a relocatable link is done using ``clang -r`` it will simply merge 491the ``.llvm.offloading`` sections which will then be linked later when the 492executable is created. However, if the ``-r`` flag is used with the offloading 493toolchain, it will perform the device linking and registration phases and then 494merge the registration code into the final relocatable object file. 495 496The following example shows how using the relocatable link with the offloading 497pipeline can create a static library with offloading code that can be 498redistributed without requiring any additional handling. 499 500.. code-block:: console 501 502 $ clang++ -fopenmp -fopenmp-targets=nvptx64 foo.cpp -c 503 $ clang++ -lomptarget.devicertl --offload-link -r foo.o -o merged.o 504 $ llvm-ar rcs libfoo.a merged.o 505 # g++ app.cpp -L. -lfoo 506 507