1.. raw:: html 2 3 <style type="text/css"> 4 .none { background-color: #FFCCCC } 5 .part { background-color: #FFFF99 } 6 .good { background-color: #CCFF99 } 7 </style> 8 9.. role:: none 10.. role:: part 11.. role:: good 12 13.. contents:: 14 :local: 15 16============= 17HIP Support 18============= 19 20HIP (Heterogeneous-Compute Interface for Portability) `<https://github.com/ROCm-Developer-Tools/HIP>`_ is 21a C++ Runtime API and Kernel Language. It enables developers to create portable applications for 22offloading computation to different hardware platforms from a single source code. 23 24AMD GPU Support 25=============== 26 27Clang provides HIP support on AMD GPUs via the ROCm platform `<https://rocm.docs.amd.com/en/latest/#>`_. 28The ROCm runtime forms the base for HIP host APIs, while HIP device APIs are realized through HIP header 29files and the ROCm device library. The Clang driver uses the HIPAMD toolchain to compile HIP device code 30to AMDGPU ISA via the AMDGPU backend, or SPIR-V via the workflow outlined below. 31The compiled code is then bundled and embedded in the host executables. 32 33Intel GPU Support 34================= 35 36Clang provides partial HIP support on Intel GPUs using the CHIP-Star project `<https://github.com/CHIP-SPV/chipStar>`_. 37CHIP-Star implements the HIP runtime over oneAPI Level Zero or OpenCL runtime. The Clang driver uses the HIPSPV 38toolchain to compile HIP device code into LLVM IR, which is subsequently translated to SPIR-V via the SPIR-V 39backend or the out-of-tree LLVM-SPIRV translator. The SPIR-V is then bundled and embedded into the host executables. 40 41.. note:: 42 While Clang does not directly provide HIP support for NVIDIA GPUs and CPUs, these platforms are supported via other means: 43 44 - NVIDIA GPUs: HIP support is offered through the HIP project `<https://github.com/ROCm-Developer-Tools/HIP>`_, which provides a header-only library for translating HIP runtime APIs into CUDA runtime APIs. The code is subsequently compiled using NVIDIA's `nvcc`. 45 46 - CPUs: HIP support is available through the HIP-CPU runtime library `<https://github.com/ROCm-Developer-Tools/HIP-CPU>`_. This header-only library enables CPUs to execute unmodified HIP code. 47 48 49Example Usage 50============= 51 52To compile a HIP program, use the following command: 53 54.. code-block:: shell 55 56 clang++ -c --offload-arch=gfx906 -xhip sample.cpp -o sample.o 57 58The ``-xhip`` option indicates that the source is a HIP program. If the file has a ``.hip`` extension, 59Clang will automatically recognize it as a HIP program: 60 61.. code-block:: shell 62 63 clang++ -c --offload-arch=gfx906 sample.hip -o sample.o 64 65To link a HIP program, use this command: 66 67.. code-block:: shell 68 69 clang++ --hip-link --offload-arch=gfx906 sample.o -o sample 70 71In the above command, the ``--hip-link`` flag instructs Clang to link the HIP runtime library. However, 72the use of this flag is unnecessary if a HIP input file is already present in your program. 73 74For convenience, Clang also supports compiling and linking in a single step: 75 76.. code-block:: shell 77 78 clang++ --offload-arch=gfx906 -xhip sample.cpp -o sample 79 80In the above commands, ``gfx906`` is the GPU architecture that the code is being compiled for. The supported GPU 81architectures can be found in the `AMDGPU Processor Table <https://llvm.org/docs/AMDGPUUsage.html#processors>`_. 82Alternatively, you can use the ``amdgpu-arch`` tool that comes with Clang to list the GPU architecture on your system: 83 84.. code-block:: shell 85 86 amdgpu-arch 87 88You can use ``--offload-arch=native`` to automatically detect the GPU architectures on your system: 89 90.. code-block:: shell 91 92 clang++ --offload-arch=native -xhip sample.cpp -o sample 93 94 95Path Setting for Dependencies 96============================= 97 98Compiling a HIP program depends on the HIP runtime and device library. The paths to the HIP runtime and device libraries 99can be specified either using compiler options or environment variables. The paths can also be set through the ROCm path 100if they follow the ROCm installation directory structure. 101 102Order of Precedence for HIP Path 103-------------------------------- 104 1051. ``--hip-path`` compiler option 1062. ``HIP_PATH`` environment variable *(use with caution)* 1073. ``--rocm-path`` compiler option 1084. ``ROCM_PATH`` environment variable *(use with caution)* 1095. Default automatic detection (relative to Clang or at the default ROCm installation location) 110 111Order of Precedence for Device Library Path 112------------------------------------------- 113 1141. ``--hip-device-lib-path`` compiler option 1152. ``HIP_DEVICE_LIB_PATH`` environment variable *(use with caution)* 1163. ``--rocm-path`` compiler option 1174. ``ROCM_PATH`` environment variable *(use with caution)* 1185. Default automatic detection (relative to Clang or at the default ROCm installation location) 119 120.. list-table:: 121 :header-rows: 1 122 123 * - Compiler Option 124 - Environment Variable 125 - Description 126 - Default Value 127 * - ``--rocm-path=<path>`` 128 - ``ROCM_PATH`` 129 - Specifies the ROCm installation path. 130 - Automatic detection 131 * - ``--hip-path=<path>`` 132 - ``HIP_PATH`` 133 - Specifies the HIP runtime installation path. 134 - Determined by ROCm directory structure 135 * - ``--hip-device-lib-path=<path>`` 136 - ``HIP_DEVICE_LIB_PATH`` 137 - Specifies the HIP device library installation path. 138 - Determined by ROCm directory structure 139 140.. note:: 141 142 We recommend using the compiler options as the primary method for specifying these paths. While the environment variables ``ROCM_PATH``, ``HIP_PATH``, and ``HIP_DEVICE_LIB_PATH`` are supported, their use can lead to implicit dependencies that might cause issues in the long run. Use them with caution. 143 144 145Predefined Macros 146================= 147 148.. list-table:: 149 :header-rows: 1 150 151 * - Macro 152 - Description 153 * - ``__CLANG_RDC__`` 154 - Defined when Clang is compiling code in Relocatable Device Code (RDC) mode. RDC, enabled with the ``-fgpu-rdc`` compiler option, is necessary for linking device codes across translation units. 155 * - ``__HIP__`` 156 - Defined when compiling with HIP language support, indicating that the code targets the HIP environment. 157 * - ``__HIPCC__`` 158 - Alias to ``__HIP__``. 159 * - ``__HIP_DEVICE_COMPILE__`` 160 - Defined during device code compilation in Clang's separate compilation process for the host and each offloading GPU architecture. 161 * - ``__HIP_MEMORY_SCOPE_SINGLETHREAD`` 162 - Represents single-thread memory scope in HIP (value is 1). 163 * - ``__HIP_MEMORY_SCOPE_WAVEFRONT`` 164 - Represents wavefront memory scope in HIP (value is 2). 165 * - ``__HIP_MEMORY_SCOPE_WORKGROUP`` 166 - Represents workgroup memory scope in HIP (value is 3). 167 * - ``__HIP_MEMORY_SCOPE_AGENT`` 168 - Represents agent memory scope in HIP (value is 4). 169 * - ``__HIP_MEMORY_SCOPE_SYSTEM`` 170 - Represents system-wide memory scope in HIP (value is 5). 171 * - ``__HIP_NO_IMAGE_SUPPORT__`` 172 - Defined with a value of 1 when the target device lacks support for HIP image functions. 173 * - ``__HIP_NO_IMAGE_SUPPORT`` 174 - Alias to ``__HIP_NO_IMAGE_SUPPORT__``. Deprecated. 175 * - ``__HIP_API_PER_THREAD_DEFAULT_STREAM__`` 176 - Defined when the GPU default stream is set to per-thread mode. 177 * - ``HIP_API_PER_THREAD_DEFAULT_STREAM`` 178 - Alias to ``__HIP_API_PER_THREAD_DEFAULT_STREAM__``. Deprecated. 179 180Note that some architecture specific AMDGPU macros will have default values when 181used from the HIP host compilation. Other :doc:`AMDGPU macros <AMDGPUSupport>` 182like ``__AMDGCN_WAVEFRONT_SIZE__`` (deprecated) will default to 64 for example. 183 184Compilation Modes 185================= 186 187Each HIP source file contains intertwined device and host code. Depending on the chosen compilation mode by the compiler options ``-fno-gpu-rdc`` and ``-fgpu-rdc``, these portions of code are compiled differently. 188 189Device Code Compilation 190----------------------- 191 192**``-fno-gpu-rdc`` Mode (default)**: 193 194- Compiles to a self-contained, fully linked offloading device binary for each offloading device architecture. 195- Device code within a Translation Unit (TU) cannot call functions located in another TU. 196 197**``-fgpu-rdc`` Mode**: 198 199- Compiles to a bitcode for each GPU architecture. 200- For each offloading device architecture, the bitcode from different TUs are linked together to create a single offloading device binary. 201- Device code in one TU can call functions located in another TU. 202 203Host Code Compilation 204--------------------- 205 206**Both Modes**: 207 208- Compiles to a relocatable object for each TU. 209- These relocatable objects are then linked together. 210- Host code within a TU can call host functions and launch kernels from another TU. 211 212Syntax Difference with CUDA 213=========================== 214 215Clang's front end, used for both CUDA and HIP programming models, shares the same parsing and semantic analysis mechanisms. This includes the resolution of overloads concerning device and host functions. While there exists a comprehensive documentation on the syntax differences between Clang and NVCC for CUDA at `Dialect Differences Between Clang and NVCC <https://llvm.org/docs/CompileCudaWithLLVM.html#dialect-differences-between-clang-and-nvcc>`_, it is important to note that these differences also apply to HIP code compilation. 216 217Predefined Macros for Differentiation 218------------------------------------- 219 220To facilitate differentiation between HIP and CUDA code, as well as between device and host compilations within HIP, Clang defines specific macros: 221 222- ``__HIP__`` : This macro is defined only when compiling HIP code. It can be used to conditionally compile code specific to HIP, enabling developers to write portable code that can be compiled for both CUDA and HIP. 223 224- ``__HIP_DEVICE_COMPILE__`` : Defined exclusively during HIP device compilation, this macro allows for conditional compilation of device-specific code. It provides a mechanism to segregate device and host code, ensuring that each can be optimized for their respective execution environments. 225 226Function Pointers Support 227========================= 228 229Function pointers' support varies with the usage mode in Clang with HIP. The following table provides an overview of the support status across different use-cases and modes. 230 231.. list-table:: Function Pointers Support Overview 232 :widths: 25 25 25 233 :header-rows: 1 234 235 * - Use Case 236 - ``-fno-gpu-rdc`` Mode (default) 237 - ``-fgpu-rdc`` Mode 238 * - Defined and used in the same TU 239 - Supported 240 - Supported 241 * - Defined in one TU and used in another TU 242 - Not Supported 243 - Supported 244 245In the ``-fno-gpu-rdc`` mode, the compiler calculates the resource usage of kernels based only on functions present within the same TU. This mode does not support the use of function pointers defined in a different TU due to the possibility of incorrect resource usage calculations, leading to undefined behavior. 246 247On the other hand, the ``-fgpu-rdc`` mode allows the definition and use of function pointers across different TUs, as resource usage calculations can accommodate functions from disparate TUs. 248 249Virtual Function Support 250======================== 251 252In Clang with HIP, support for calling virtual functions of an object in device or host code is contingent on where the object is constructed. 253 254- **Constructed in Device Code**: Virtual functions of an object can be called in device code on a specific offloading device if the object is constructed in device code on an offloading device with the same architecture. 255- **Constructed in Host Code**: Virtual functions of an object can be called in host code if the object is constructed in host code. 256 257In other scenarios, calling virtual functions is not allowed. 258 259Explanation 260----------- 261 262An object constructed on the device side contains a pointer to the virtual function table on the device side, which is not accessible in host code, and vice versa. Thus, trying to invoke virtual functions from a context different from where the object was constructed will be disallowed because the appropriate virtual table cannot be accessed. The virtual function tables for offloading devices with different architecures are different, therefore trying to invoke virtual functions from an offloading device with a different architecture than where the object is constructed is also disallowed. 263 264Example Usage 265------------- 266 267.. code-block:: c++ 268 269 class Base { 270 public: 271 __device__ virtual void virtualFunction() { 272 // Base virtual function implementation 273 } 274 }; 275 276 class Derived : public Base { 277 public: 278 __device__ void virtualFunction() override { 279 // Derived virtual function implementation 280 } 281 }; 282 283 __global__ void kernel() { 284 Derived obj; 285 Base* basePtr = &obj; 286 basePtr->virtualFunction(); // Allowed since obj is constructed in device code 287 } 288 289C++ Standard Parallelism Offload Support: Compiler And Runtime 290============================================================== 291 292Introduction 293============ 294 295This section describes the implementation of support for offloading the 296execution of standard C++ algorithms to accelerators that can be targeted via 297HIP. Furthermore, it enumerates restrictions on user defined code, as well as 298the interactions with runtimes. 299 300Algorithm Offload: What, Why, Where 301=================================== 302 303C++17 introduced overloads 304`for most algorithms in the standard library <https://www.open-std.org/jtc1/sc22/wg21/docs/papers/2016/p0024r2.html>`_ 305which allow the user to specify a desired 306`execution policy <https://en.cppreference.com/w/cpp/algorithm#Execution_policies>`_. 307The `parallel_unsequenced_policy <https://en.cppreference.com/w/cpp/algorithm/execution_policy_tag_t>`_ 308maps relatively well to the execution model of AMD GPUs. This, coupled with the 309the availability and maturity of GPU accelerated algorithm libraries that 310implement most / all corresponding algorithms in the standard library 311(e.g. `rocThrust <https://github.com/ROCmSoftwarePlatform/rocThrust>`__), makes 312it feasible to provide seamless accelerator offload for supported algorithms, 313when an accelerated version exists. Thus, it becomes possible to easily access 314the computational resources of an AMD accelerator, via a well specified, 315familiar, algorithmic interface, without having to delve into low-level hardware 316specific details. Putting it all together: 317 318- **What**: standard library algorithms, when invoked with the 319 ``parallel_unsequenced_policy`` 320- **Why**: democratise AMDGPU accelerator programming, without loss of user 321 familiarity 322- **Where**: only AMDGPU accelerators targeted by Clang/LLVM via HIP 323 324Small Example 325============= 326 327Given the following C++ code: 328 329.. code-block:: C++ 330 331 bool has_the_answer(const std::vector<int>& v) { 332 return std::find(std::execution::par_unseq, std::cbegin(v), std::cend(v), 42) != std::cend(v); 333 } 334 335if Clang is invoked with the ``--hipstdpar --offload-arch=foo`` flags, the call 336to ``find`` will be offloaded to an accelerator that is part of the ``foo`` 337target family. If either ``foo`` or its runtime environment do not support 338transparent on-demand paging (such as e.g. that provided in Linux via 339`HMM <https://docs.kernel.org/mm/hmm.html>`_), it is necessary to also include 340the ``--hipstdpar-interpose-alloc`` flag. If the accelerator specific algorithm 341library ``foo`` uses doesn't have an implementation of a particular algorithm, 342execution seamlessly falls back to the host CPU. It is legal to specify multiple 343``--offload-arch``\s. All the flags we introduce, as well as a thorough view of 344various restrictions an their implementations, will be provided below. 345 346Implementation - General View 347============================= 348 349We built support for Algorithm Offload support atop the pre-existing HIP 350infrastructure. More specifically, when one requests offload via ``--hipstdpar``, 351compilation is switched to HIP compilation, as if ``-x hip`` was specified. 352Similarly, linking is also switched to HIP linking, as if ``--hip-link`` was 353specified. Note that these are implicit, and one should not assume that any 354interop with HIP specific language constructs is available e.g. ``__device__`` 355annotations are neither necessary nor guaranteed to work. 356 357Since there are no language restriction mechanisms in place, it is necessary to 358relax HIP language specific semantic checks performed by the FE; they would 359identify otherwise valid, offloadable code, as invalid HIP code. Given that we 360know that the user intended only for certain algorithms to be offloaded, and 361encoded this by specifying the ``parallel_unsequenced_policy``, we rely on a 362pass over IR to clean up any and all code that was not "meant" for offload. If 363requested, allocation interposition is also handled via a separate pass over IR. 364 365To interface with the client HIP runtime, and to forward offloaded algorithm 366invocations to the corresponding accelerator specific library implementation, an 367implementation detail forwarding header is implicitly included by the driver, 368when compiling with ``--hipstdpar``. In what follows, we will delve into each 369component that contributes to implementing Algorithm Offload support. 370 371Implementation - Driver 372======================= 373 374We augment the ``clang`` driver with the following flags: 375 376- ``--hipstdpar`` enables algorithm offload, which depending on phase, has the 377 following effects: 378 379 - when compiling: 380 381 - ``-x hip`` gets prepended to enable HIP support; 382 - the ``ROCmToolchain`` component checks for the ``hipstdpar_lib.hpp`` 383 forwarding header, 384 `rocThrust <https://rocm.docs.amd.com/projects/rocThrust/en/latest/>`_ and 385 `rocPrim <https://rocm.docs.amd.com/projects/rocPRIM/en/latest/>`_ in 386 their canonical locations, which can be overriden via flags found below; 387 if all are found, the forwarding header gets implicitly included, 388 otherwise an error listing the missing component is generated; 389 - the ``LangOpts.HIPStdPar`` member is set. 390 391 - when linking: 392 393 - ``--hip-link`` and ``-frtlib-add-rpath`` gets appended to enable HIP 394 support. 395 396- ``--hipstdpar-interpose-alloc`` enables the interposition of standard 397 allocation / deallocation functions with accelerator aware equivalents; the 398 ``LangOpts.HIPStdParInterposeAlloc`` member is set; 399- ``--hipstdpar-path=`` specifies a non-canonical path for the forwarding 400 header; it must point to the folder where the header is located and not to the 401 header itself; 402- ``--hipstdpar-thrust-path=`` specifies a non-canonical path for 403 `rocThrust <https://rocm.docs.amd.com/projects/rocThrust/en/latest/>`_; it 404 must point to the folder where the library is installed / built under a 405 ``/thrust`` subfolder; 406- ``--hipstdpar-prim-path=`` specifies a non-canonical path for 407 `rocPrim <https://rocm.docs.amd.com/projects/rocPRIM/en/latest/>`_; it must 408 point to the folder where the library is installed / built under a 409 ``/rocprim`` subfolder; 410 411The `--offload-arch <https://llvm.org/docs/AMDGPUUsage.html#amdgpu-processors>`_ 412flag can be used to specify the accelerator for which offload code is to be 413generated. 414 415Implementation - Front-End 416========================== 417 418When ``LangOpts.HIPStdPar`` is set, we relax some of the HIP language specific 419``Sema`` checks to account for the fact that we want to consume pure unannotated 420C++ code: 421 4221. ``__device__`` / ``__host__ __device__`` functions (which would originate in 423 the accelerator specific algorithm library) are allowed to call implicitly 424 ``__host__`` functions; 4252. ``__global__`` functions (which would originate in the accelerator specific 426 algorithm library) are allowed to call implicitly ``__host__`` functions; 4273. resolving ``__builtin`` availability is deferred, because it is possible that 428 a ``__builtin`` that is unavailable on the target accelerator is not 429 reachable from any offloaded algorithm, and thus will be safely removed in 430 the middle-end; 4314. ASM parsing / checking is deferred, because it is possible that an ASM block 432 that e.g. uses some constraints that are incompatible with the target 433 accelerator is not reachable from any offloaded algorithm, and thus will be 434 safely removed in the middle-end. 435 436``CodeGen`` is similarly relaxed, with implicitly ``__host__`` functions being 437emitted as well. 438 439Implementation - Middle-End 440=========================== 441 442We add two ``opt`` passes: 443 4441. ``HipStdParAcceleratorCodeSelectionPass`` 445 446 - For all kernels in a ``Module``, compute reachability, where a function 447 ``F`` is reachable from a kernel ``K`` if and only if there exists a direct 448 call-chain rooted in ``F`` that includes ``K``; 449 - Remove all functions that are not reachable from kernels; 450 - This pass is only run when compiling for the accelerator. 451 452The first pass assumes that the only code that the user intended to offload was 453that which was directly or transitively invocable as part of an algorithm 454execution. It also assumes that an accelerator aware algorithm implementation 455would rely on accelerator specific special functions (kernels), and that these 456effectively constitute the only roots for accelerator execution graphs. Both of 457these assumptions are based on observing how widespread accelerators, 458such as GPUs, work. 459 4601. ``HipStdParAllocationInterpositionPass`` 461 462 - Iterate through all functions in a ``Module``, and replace standard 463 allocation / deallocation functions with accelerator-aware equivalents, 464 based on a pre-established table; the list of functions that can be 465 interposed is available 466 `here <https://github.com/ROCmSoftwarePlatform/roc-stdpar#allocation--deallocation-interposition-status>`__; 467 - This is only run when compiling for the host. 468 469The second pass is optional. 470 471Implementation - Forwarding Header 472================================== 473 474The forwarding header implements two pieces of functionality: 475 4761. It forwards algorithms to a target accelerator, which is done by relying on 477 C++ language rules around overloading: 478 479 - overloads taking an explicit argument of type 480 ``parallel_unsequenced_policy`` are introduced into the ``std`` namespace; 481 - these will get preferentially selected versus the master template; 482 - the body forwards to the equivalent algorithm from the accelerator specific 483 library 484 4852. It provides allocation / deallocation functions that are equivalent to the 486 standard ones, but obtain memory by invoking 487 `hipMallocManaged <https://rocm.docs.amd.com/projects/HIP/en/latest/.doxygen/docBin/html/group___memory_m.html#gab8cfa0e292193fa37e0cc2e4911fa90a>`_ 488 and release it via `hipFree <https://rocm.docs.amd.com/projects/HIP/en/latest/.doxygen/docBin/html/group___memory.html#ga740d08da65cae1441ba32f8fedb863d1>`_. 489 490Predefined Macros 491================= 492 493.. list-table:: 494 :header-rows: 1 495 496 * - Macro 497 - Description 498 * - ``__HIPSTDPAR__`` 499 - Defined when Clang is compiling code in algorithm offload mode, enabled 500 with the ``--hipstdpar`` compiler option. 501 * - ``__HIPSTDPAR_INTERPOSE_ALLOC__`` 502 - Defined only when compiling in algorithm offload mode, when the user 503 enables interposition mode with the ``--hipstdpar-interpose-alloc`` 504 compiler option, indicating that all dynamic memory allocation / 505 deallocation functions should be replaced with accelerator aware 506 variants. 507 508Restrictions 509============ 510 511We define two modes in which runtime execution can occur: 512 5131. **HMM Mode** - this assumes that the 514 `HMM <https://docs.kernel.org/mm/hmm.html>`_ subsystem of the Linux kernel 515 is used to provide transparent on-demand paging i.e. memory obtained from a 516 system / OS allocator such as via a call to ``malloc`` or ``operator new`` is 517 directly accessible to the accelerator and it follows the C++ memory model; 5182. **Interposition Mode** - this is a fallback mode for cases where transparent 519 on-demand paging is unavailable (e.g. in the Windows OS), which means that 520 memory must be allocated via an accelerator aware mechanism, and system 521 allocated memory is inaccessible for the accelerator. 522 523The following restrictions imposed on user code apply to both modes: 524 5251. Pointers to function, and all associated features, such as e.g. dynamic 526 polymorphism, cannot be used (directly or transitively) by the user provided 527 callable passed to an algorithm invocation; 5282. Global / namespace scope / ``static`` / ``thread`` storage duration variables 529 cannot be used (directly or transitively) in name by the user provided 530 callable; 531 532 - When executing in **HMM Mode** they can be used in address e.g.: 533 534 .. code-block:: C++ 535 536 namespace { int foo = 42; } 537 538 bool never(const std::vector<int>& v) { 539 return std::any_of(std::execution::par_unseq, std::cbegin(v), std::cend(v), [](auto&& x) { 540 return x == foo; 541 }); 542 } 543 544 bool only_in_hmm_mode(const std::vector<int>& v) { 545 return std::any_of(std::execution::par_unseq, std::cbegin(v), std::cend(v), 546 [p = &foo](auto&& x) { return x == *p; }); 547 } 548 5493. Only algorithms that are invoked with the ``parallel_unsequenced_policy`` are 550 candidates for offload; 5514. Only algorithms that are invoked with iterator arguments that model 552 `random_access_iterator <https://en.cppreference.com/w/cpp/iterator/random_access_iterator>`_ 553 are candidates for offload; 5545. `Exceptions <https://en.cppreference.com/w/cpp/language/exceptions>`_ cannot 555 be used by the user provided callable; 5566. Dynamic memory allocation (e.g. ``operator new``) cannot be used by the user 557 provided callable; 5587. Selective offload is not possible i.e. it is not possible to indicate that 559 only some algorithms invoked with the ``parallel_unsequenced_policy`` are to 560 be executed on the accelerator. 561 562In addition to the above, using **Interposition Mode** imposes the following 563additional restrictions: 564 5651. All code that is expected to interoperate has to be recompiled with the 566 ``--hipstdpar-interpose-alloc`` flag i.e. it is not safe to compose libraries 567 that have been independently compiled; 5682. automatic storage duration (i.e. stack allocated) variables cannot be used 569 (directly or transitively) by the user provided callable e.g. 570 571 .. code-block:: c++ 572 573 bool never(const std::vector<int>& v, int n) { 574 return std::any_of(std::execution::par_unseq, std::cbegin(v), std::cend(v), 575 [p = &n](auto&& x) { return x == *p; }); 576 } 577 578Current Support 579=============== 580 581At the moment, C++ Standard Parallelism Offload is only available for AMD GPUs, 582when the `ROCm <https://rocm.docs.amd.com/en/latest/>`_ stack is used, on the 583Linux operating system. Support is synthesised in the following table: 584 585.. list-table:: 586 :header-rows: 1 587 588 * - `Processor <https://llvm.org/docs/AMDGPUUsage.html#amdgpu-processors>`_ 589 - HMM Mode 590 - Interposition Mode 591 * - GCN GFX9 (Vega) 592 - YES 593 - YES 594 * - GCN GFX10.1 (RDNA 1) 595 - *NO* 596 - YES 597 * - GCN GFX10.3 (RDNA 2) 598 - *NO* 599 - YES 600 * - GCN GFX11 (RDNA 3) 601 - *NO* 602 - YES 603 * - GCN GFX12 (RDNA 4) 604 - *NO* 605 - YES 606 607The minimum Linux kernel version for running in HMM mode is 6.4. 608 609The forwarding header can be obtained from 610`its GitHub repository <https://github.com/ROCmSoftwarePlatform/roc-stdpar>`_. 611It will be packaged with a future `ROCm <https://rocm.docs.amd.com/en/latest/>`_ 612release. Because accelerated algorithms are provided via 613`rocThrust <https://rocm.docs.amd.com/projects/rocThrust/en/latest/>`_, a 614transitive dependency on 615`rocPrim <https://rocm.docs.amd.com/projects/rocPRIM/en/latest/>`_ exists. Both 616can be obtained either by installing their associated components of the 617`ROCm <https://rocm.docs.amd.com/en/latest/>`_ stack, or from their respective 618repositories. The list algorithms that can be offloaded is available 619`here <https://github.com/ROCmSoftwarePlatform/roc-stdpar#algorithm-support-status>`_. 620 621HIP Specific Elements 622--------------------- 623 6241. There is no defined interop with the 625 `HIP kernel language <https://rocm.docs.amd.com/projects/HIP/en/latest/reference/kernel_language.html>`_; 626 whilst things like using `__device__` annotations might accidentally "work", 627 they are not guaranteed to, and thus cannot be relied upon by user code; 628 629 - A consequence of the above is that both bitcode linking and linking 630 relocatable object files will "work", but it is not guaranteed to remain 631 working or actively tested at the moment; this restriction might be relaxed 632 in the future. 633 6342. Combining explicit HIP, CUDA or OpenMP Offload compilation with 635 ``--hipstdpar`` based offloading is not allowed or supported in any way. 6363. There is no way to target different accelerators via a standard algorithm 637 invocation (`this might be addressed in future C++ standards <https://www.open-std.org/jtc1/sc22/wg21/docs/papers/2023/p2500r1.html>`_); 638 an unsafe (per the point above) way of achieving this is to spawn new threads 639 and invoke the `hipSetDevice <https://rocm.docs.amd.com/projects/HIP/en/latest/.doxygen/docBin/html/group___device.html#ga43c1e7f15925eeb762195ccb5e063eae>`_ 640 interface e.g.: 641 642 .. code-block:: c++ 643 644 int accelerator_0 = ...; 645 int accelerator_1 = ...; 646 647 bool multiple_accelerators(const std::vector<int>& u, const std::vector<int>& v) { 648 std::atomic<unsigned int> r{0u}; 649 650 thread t0{[&]() { 651 hipSetDevice(accelerator_0); 652 653 r += std::count(std::execution::par_unseq, std::cbegin(u), std::cend(u), 42); 654 }}; 655 thread t1{[&]() { 656 hitSetDevice(accelerator_1); 657 658 r += std::count(std::execution::par_unseq, std::cbegin(v), std::cend(v), 314152) 659 }}; 660 661 t0.join(); 662 t1.join(); 663 664 return r; 665 } 666 667 Note that this is a temporary, unsafe workaround for a deficiency in the C++ 668 Standard. 669 670Open Questions / Future Developments 671==================================== 672 6731. The restriction on the use of global / namespace scope / ``static`` / 674 ``thread`` storage duration variables in offloaded algorithms will be lifted 675 in the future, when running in **HMM Mode**; 6762. The restriction on the use of dynamic memory allocation in offloaded 677 algorithms will be lifted in the future. 6783. The restriction on the use of pointers to function, and associated features 679 such as dynamic polymorphism might be lifted in the future, when running in 680 **HMM Mode**; 6814. Offload support might be extended to cases where the ``parallel_policy`` is 682 used for some or all targets. 683 684SPIR-V Support on HIPAMD ToolChain 685================================== 686 687The HIPAMD ToolChain supports targetting 688`AMDGCN Flavoured SPIR-V <https://llvm.org/docs/SPIRVUsage.html#target-triples>`_. 689The support for SPIR-V in the ROCm and HIPAMD ToolChain is under active 690development. 691 692Compilation Process 693------------------- 694 695When compiling HIP programs with the intent of utilizing SPIR-V, the process 696diverges from the traditional compilation flow: 697 698Using ``--offload-arch=amdgcnspirv`` 699^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 700 701- **Target Triple**: The ``--offload-arch=amdgcnspirv`` flag instructs the 702 compiler to use the target triple ``spirv64-amd-amdhsa``. This approach does 703 generates generic AMDGCN SPIR-V which retains architecture specific elements 704 without hardcoding them, thus allowing for optimal target specific code to be 705 generated at run time, when the concrete target is known. 706 707- **LLVM IR Translation**: The program is compiled to LLVM Intermediate 708 Representation (IR), which is subsequently translated into SPIR-V. In the 709 future, this translation step will be replaced by direct SPIR-V emission via 710 the SPIR-V Back-end. 711 712- **Clang Offload Bundler**: The resulting SPIR-V is embedded in the Clang 713 offload bundler with the bundle ID ``hip-spirv64-amd-amdhsa--amdgcnspirv``. 714 715Architecture Specific Macros 716---------------------------- 717 718None of the architecture specific :doc:`AMDGPU macros <AMDGPUSupport>` are 719defined when targeting SPIR-V. An alternative, more flexible mechanism to enable 720doing per target / per feature code selection will be added in the future. 721