xref: /llvm-project/clang/docs/HIPSupport.rst (revision 964565c868c4a255f8ebdf412b307beeb390a6bc)
1.. raw:: html
2
3  <style type="text/css">
4    .none { background-color: #FFCCCC }
5    .part { background-color: #FFFF99 }
6    .good { background-color: #CCFF99 }
7  </style>
8
9.. role:: none
10.. role:: part
11.. role:: good
12
13.. contents::
14   :local:
15
16=============
17HIP Support
18=============
19
20HIP (Heterogeneous-Compute Interface for Portability) `<https://github.com/ROCm-Developer-Tools/HIP>`_ is
21a C++ Runtime API and Kernel Language. It enables developers to create portable applications for
22offloading computation to different hardware platforms from a single source code.
23
24AMD GPU Support
25===============
26
27Clang provides HIP support on AMD GPUs via the ROCm platform `<https://rocm.docs.amd.com/en/latest/#>`_.
28The ROCm runtime forms the base for HIP host APIs, while HIP device APIs are realized through HIP header
29files and the ROCm device library. The Clang driver uses the HIPAMD toolchain to compile HIP device code
30to AMDGPU ISA via the AMDGPU backend, or SPIR-V via the workflow outlined below.
31The compiled code is then bundled and embedded in the host executables.
32
33Intel GPU Support
34=================
35
36Clang provides partial HIP support on Intel GPUs using the CHIP-Star project `<https://github.com/CHIP-SPV/chipStar>`_.
37CHIP-Star implements the HIP runtime over oneAPI Level Zero or OpenCL runtime. The Clang driver uses the HIPSPV
38toolchain to compile HIP device code into LLVM IR, which is subsequently translated to SPIR-V via the SPIR-V
39backend or the out-of-tree LLVM-SPIRV translator. The SPIR-V is then bundled and embedded into the host executables.
40
41.. note::
42   While Clang does not directly provide HIP support for NVIDIA GPUs and CPUs, these platforms are supported via other means:
43
44   - NVIDIA GPUs: HIP support is offered through the HIP project `<https://github.com/ROCm-Developer-Tools/HIP>`_, which provides a header-only library for translating HIP runtime APIs into CUDA runtime APIs. The code is subsequently compiled using NVIDIA's `nvcc`.
45
46   - CPUs: HIP support is available through the HIP-CPU runtime library `<https://github.com/ROCm-Developer-Tools/HIP-CPU>`_. This header-only library enables CPUs to execute unmodified HIP code.
47
48
49Example Usage
50=============
51
52To compile a HIP program, use the following command:
53
54.. code-block:: shell
55
56   clang++ -c --offload-arch=gfx906 -xhip sample.cpp -o sample.o
57
58The ``-xhip`` option indicates that the source is a HIP program. If the file has a ``.hip`` extension,
59Clang will automatically recognize it as a HIP program:
60
61.. code-block:: shell
62
63   clang++ -c --offload-arch=gfx906 sample.hip -o sample.o
64
65To link a HIP program, use this command:
66
67.. code-block:: shell
68
69   clang++ --hip-link --offload-arch=gfx906 sample.o -o sample
70
71In the above command, the ``--hip-link`` flag instructs Clang to link the HIP runtime library. However,
72the use of this flag is unnecessary if a HIP input file is already present in your program.
73
74For convenience, Clang also supports compiling and linking in a single step:
75
76.. code-block:: shell
77
78   clang++ --offload-arch=gfx906 -xhip sample.cpp -o sample
79
80In the above commands, ``gfx906`` is the GPU architecture that the code is being compiled for. The supported GPU
81architectures can be found in the `AMDGPU Processor Table <https://llvm.org/docs/AMDGPUUsage.html#processors>`_.
82Alternatively, you can use the ``amdgpu-arch`` tool that comes with Clang to list the GPU architecture on your system:
83
84.. code-block:: shell
85
86   amdgpu-arch
87
88You can use ``--offload-arch=native`` to automatically detect the GPU architectures on your system:
89
90.. code-block:: shell
91
92   clang++ --offload-arch=native -xhip sample.cpp -o sample
93
94
95Path Setting for Dependencies
96=============================
97
98Compiling a HIP program depends on the HIP runtime and device library. The paths to the HIP runtime and device libraries
99can be specified either using compiler options or environment variables. The paths can also be set through the ROCm path
100if they follow the ROCm installation directory structure.
101
102Order of Precedence for HIP Path
103--------------------------------
104
1051. ``--hip-path`` compiler option
1062. ``HIP_PATH`` environment variable *(use with caution)*
1073. ``--rocm-path`` compiler option
1084. ``ROCM_PATH`` environment variable *(use with caution)*
1095. Default automatic detection (relative to Clang or at the default ROCm installation location)
110
111Order of Precedence for Device Library Path
112-------------------------------------------
113
1141. ``--hip-device-lib-path`` compiler option
1152. ``HIP_DEVICE_LIB_PATH`` environment variable *(use with caution)*
1163. ``--rocm-path`` compiler option
1174. ``ROCM_PATH`` environment variable *(use with caution)*
1185. Default automatic detection (relative to Clang or at the default ROCm installation location)
119
120.. list-table::
121   :header-rows: 1
122
123   * - Compiler Option
124     - Environment Variable
125     - Description
126     - Default Value
127   * - ``--rocm-path=<path>``
128     - ``ROCM_PATH``
129     - Specifies the ROCm installation path.
130     - Automatic detection
131   * - ``--hip-path=<path>``
132     - ``HIP_PATH``
133     - Specifies the HIP runtime installation path.
134     - Determined by ROCm directory structure
135   * - ``--hip-device-lib-path=<path>``
136     - ``HIP_DEVICE_LIB_PATH``
137     - Specifies the HIP device library installation path.
138     - Determined by ROCm directory structure
139
140.. note::
141
142   We recommend using the compiler options as the primary method for specifying these paths. While the environment variables ``ROCM_PATH``, ``HIP_PATH``, and ``HIP_DEVICE_LIB_PATH`` are supported, their use can lead to implicit dependencies that might cause issues in the long run. Use them with caution.
143
144
145Predefined Macros
146=================
147
148.. list-table::
149   :header-rows: 1
150
151   * - Macro
152     - Description
153   * - ``__CLANG_RDC__``
154     - Defined when Clang is compiling code in Relocatable Device Code (RDC) mode. RDC, enabled with the ``-fgpu-rdc`` compiler option, is necessary for linking device codes across translation units.
155   * - ``__HIP__``
156     - Defined when compiling with HIP language support, indicating that the code targets the HIP environment.
157   * - ``__HIPCC__``
158     - Alias to ``__HIP__``.
159   * - ``__HIP_DEVICE_COMPILE__``
160     - Defined during device code compilation in Clang's separate compilation process for the host and each offloading GPU architecture.
161   * - ``__HIP_MEMORY_SCOPE_SINGLETHREAD``
162     - Represents single-thread memory scope in HIP (value is 1).
163   * - ``__HIP_MEMORY_SCOPE_WAVEFRONT``
164     - Represents wavefront memory scope in HIP (value is 2).
165   * - ``__HIP_MEMORY_SCOPE_WORKGROUP``
166     - Represents workgroup memory scope in HIP (value is 3).
167   * - ``__HIP_MEMORY_SCOPE_AGENT``
168     - Represents agent memory scope in HIP (value is 4).
169   * - ``__HIP_MEMORY_SCOPE_SYSTEM``
170     - Represents system-wide memory scope in HIP (value is 5).
171   * - ``__HIP_NO_IMAGE_SUPPORT__``
172     - Defined with a value of 1 when the target device lacks support for HIP image functions.
173   * - ``__HIP_NO_IMAGE_SUPPORT``
174     - Alias to ``__HIP_NO_IMAGE_SUPPORT__``. Deprecated.
175   * - ``__HIP_API_PER_THREAD_DEFAULT_STREAM__``
176     - Defined when the GPU default stream is set to per-thread mode.
177   * - ``HIP_API_PER_THREAD_DEFAULT_STREAM``
178     - Alias to ``__HIP_API_PER_THREAD_DEFAULT_STREAM__``. Deprecated.
179
180Note that some architecture specific AMDGPU macros will have default values when
181used from the HIP host compilation. Other :doc:`AMDGPU macros <AMDGPUSupport>`
182like ``__AMDGCN_WAVEFRONT_SIZE__`` (deprecated) will default to 64 for example.
183
184Compilation Modes
185=================
186
187Each HIP source file contains intertwined device and host code. Depending on the chosen compilation mode by the compiler options ``-fno-gpu-rdc`` and ``-fgpu-rdc``, these portions of code are compiled differently.
188
189Device Code Compilation
190-----------------------
191
192**``-fno-gpu-rdc`` Mode (default)**:
193
194- Compiles to a self-contained, fully linked offloading device binary for each offloading device architecture.
195- Device code within a Translation Unit (TU) cannot call functions located in another TU.
196
197**``-fgpu-rdc`` Mode**:
198
199- Compiles to a bitcode for each GPU architecture.
200- For each offloading device architecture, the bitcode from different TUs are linked together to create a single offloading device binary.
201- Device code in one TU can call functions located in another TU.
202
203Host Code Compilation
204---------------------
205
206**Both Modes**:
207
208- Compiles to a relocatable object for each TU.
209- These relocatable objects are then linked together.
210- Host code within a TU can call host functions and launch kernels from another TU.
211
212Syntax Difference with CUDA
213===========================
214
215Clang's front end, used for both CUDA and HIP programming models, shares the same parsing and semantic analysis mechanisms. This includes the resolution of overloads concerning device and host functions. While there exists a comprehensive documentation on the syntax differences between Clang and NVCC for CUDA at `Dialect Differences Between Clang and NVCC <https://llvm.org/docs/CompileCudaWithLLVM.html#dialect-differences-between-clang-and-nvcc>`_, it is important to note that these differences also apply to HIP code compilation.
216
217Predefined Macros for Differentiation
218-------------------------------------
219
220To facilitate differentiation between HIP and CUDA code, as well as between device and host compilations within HIP, Clang defines specific macros:
221
222- ``__HIP__`` : This macro is defined only when compiling HIP code. It can be used to conditionally compile code specific to HIP, enabling developers to write portable code that can be compiled for both CUDA and HIP.
223
224- ``__HIP_DEVICE_COMPILE__`` : Defined exclusively during HIP device compilation, this macro allows for conditional compilation of device-specific code. It provides a mechanism to segregate device and host code, ensuring that each can be optimized for their respective execution environments.
225
226Function Pointers Support
227=========================
228
229Function pointers' support varies with the usage mode in Clang with HIP. The following table provides an overview of the support status across different use-cases and modes.
230
231.. list-table:: Function Pointers Support Overview
232   :widths: 25 25 25
233   :header-rows: 1
234
235   * - Use Case
236     - ``-fno-gpu-rdc`` Mode (default)
237     - ``-fgpu-rdc`` Mode
238   * - Defined and used in the same TU
239     - Supported
240     - Supported
241   * - Defined in one TU and used in another TU
242     - Not Supported
243     - Supported
244
245In the ``-fno-gpu-rdc`` mode, the compiler calculates the resource usage of kernels based only on functions present within the same TU. This mode does not support the use of function pointers defined in a different TU due to the possibility of incorrect resource usage calculations, leading to undefined behavior.
246
247On the other hand, the ``-fgpu-rdc`` mode allows the definition and use of function pointers across different TUs, as resource usage calculations can accommodate functions from disparate TUs.
248
249Virtual Function Support
250========================
251
252In Clang with HIP, support for calling virtual functions of an object in device or host code is contingent on where the object is constructed.
253
254- **Constructed in Device Code**: Virtual functions of an object can be called in device code on a specific offloading device if the object is constructed in device code on an offloading device with the same architecture.
255- **Constructed in Host Code**: Virtual functions of an object can be called in host code if the object is constructed in host code.
256
257In other scenarios, calling virtual functions is not allowed.
258
259Explanation
260-----------
261
262An object constructed on the device side contains a pointer to the virtual function table on the device side, which is not accessible in host code, and vice versa. Thus, trying to invoke virtual functions from a context different from where the object was constructed will be disallowed because the appropriate virtual table cannot be accessed. The virtual function tables for offloading devices with different architecures are different, therefore trying to invoke virtual functions from an offloading device with a different architecture than where the object is constructed is also disallowed.
263
264Example Usage
265-------------
266
267.. code-block:: c++
268
269   class Base {
270   public:
271      __device__ virtual void virtualFunction() {
272         // Base virtual function implementation
273      }
274   };
275
276   class Derived : public Base {
277   public:
278      __device__ void virtualFunction() override {
279         // Derived virtual function implementation
280      }
281   };
282
283   __global__ void kernel() {
284      Derived obj;
285      Base* basePtr = &obj;
286      basePtr->virtualFunction(); // Allowed since obj is constructed in device code
287   }
288
289C++ Standard Parallelism Offload Support: Compiler And Runtime
290==============================================================
291
292Introduction
293============
294
295This section describes the implementation of support for offloading the
296execution of standard C++ algorithms to accelerators that can be targeted via
297HIP. Furthermore, it enumerates restrictions on user defined code, as well as
298the interactions with runtimes.
299
300Algorithm Offload: What, Why, Where
301===================================
302
303C++17 introduced overloads
304`for most algorithms in the standard library <https://www.open-std.org/jtc1/sc22/wg21/docs/papers/2016/p0024r2.html>`_
305which allow the user to specify a desired
306`execution policy <https://en.cppreference.com/w/cpp/algorithm#Execution_policies>`_.
307The `parallel_unsequenced_policy <https://en.cppreference.com/w/cpp/algorithm/execution_policy_tag_t>`_
308maps relatively well to the execution model of AMD GPUs. This, coupled with the
309the availability and maturity of GPU accelerated algorithm libraries that
310implement most / all corresponding algorithms in the standard library
311(e.g. `rocThrust <https://github.com/ROCmSoftwarePlatform/rocThrust>`__), makes
312it feasible to provide seamless accelerator offload for supported algorithms,
313when an accelerated version exists. Thus, it becomes possible to easily access
314the computational resources of an AMD accelerator, via a well specified,
315familiar, algorithmic interface, without having to delve into low-level hardware
316specific details. Putting it all together:
317
318- **What**: standard library algorithms, when invoked with the
319  ``parallel_unsequenced_policy``
320- **Why**: democratise AMDGPU accelerator programming, without loss of user
321  familiarity
322- **Where**: only AMDGPU accelerators targeted by Clang/LLVM via HIP
323
324Small Example
325=============
326
327Given the following C++ code:
328
329.. code-block:: C++
330
331   bool has_the_answer(const std::vector<int>& v) {
332     return std::find(std::execution::par_unseq, std::cbegin(v), std::cend(v), 42) != std::cend(v);
333   }
334
335if Clang is invoked with the ``--hipstdpar --offload-arch=foo`` flags, the call
336to ``find`` will be offloaded to an accelerator that is part of the ``foo``
337target family. If either ``foo`` or its runtime environment do not support
338transparent on-demand paging (such as e.g. that provided in Linux via
339`HMM <https://docs.kernel.org/mm/hmm.html>`_), it is necessary to also include
340the ``--hipstdpar-interpose-alloc`` flag. If the accelerator specific algorithm
341library ``foo`` uses doesn't have an implementation of a particular algorithm,
342execution seamlessly falls back to the host CPU. It is legal to specify multiple
343``--offload-arch``\s. All the flags we introduce, as well as a thorough view of
344various restrictions an their implementations, will be provided below.
345
346Implementation - General View
347=============================
348
349We built support for Algorithm Offload support atop the pre-existing HIP
350infrastructure. More specifically, when one requests offload via ``--hipstdpar``,
351compilation is switched to HIP compilation, as if ``-x hip`` was specified.
352Similarly, linking is also switched to HIP linking, as if ``--hip-link`` was
353specified. Note that these are implicit, and one should not assume that any
354interop with HIP specific language constructs is available e.g. ``__device__``
355annotations are neither necessary nor guaranteed to work.
356
357Since there are no language restriction mechanisms in place, it is necessary to
358relax HIP language specific semantic checks performed by the FE; they would
359identify otherwise valid, offloadable code, as invalid HIP code. Given that we
360know that the user intended only for certain algorithms to be offloaded, and
361encoded this by specifying the ``parallel_unsequenced_policy``, we rely on a
362pass over IR to clean up any and all code that was not "meant" for offload. If
363requested, allocation interposition is also handled via a separate pass over IR.
364
365To interface with the client HIP runtime, and to forward offloaded algorithm
366invocations to the corresponding accelerator specific library implementation, an
367implementation detail forwarding header is implicitly included by the driver,
368when compiling with ``--hipstdpar``. In what follows, we will delve into each
369component that contributes to implementing Algorithm Offload support.
370
371Implementation - Driver
372=======================
373
374We augment the ``clang`` driver with the following flags:
375
376- ``--hipstdpar`` enables algorithm offload, which depending on phase, has the
377  following effects:
378
379  - when compiling:
380
381    - ``-x hip`` gets prepended to enable HIP support;
382    - the ``ROCmToolchain`` component checks for the ``hipstdpar_lib.hpp``
383      forwarding header,
384      `rocThrust <https://rocm.docs.amd.com/projects/rocThrust/en/latest/>`_ and
385      `rocPrim <https://rocm.docs.amd.com/projects/rocPRIM/en/latest/>`_ in
386      their canonical locations, which can be overriden via flags found below;
387      if all are found, the forwarding header gets implicitly included,
388      otherwise an error listing the missing component is generated;
389    - the ``LangOpts.HIPStdPar`` member is set.
390
391  - when linking:
392
393    - ``--hip-link`` and ``-frtlib-add-rpath`` gets appended to enable HIP
394      support.
395
396- ``--hipstdpar-interpose-alloc`` enables the interposition of standard
397  allocation / deallocation functions with accelerator aware equivalents; the
398  ``LangOpts.HIPStdParInterposeAlloc`` member is set;
399- ``--hipstdpar-path=`` specifies a non-canonical path for the forwarding
400  header; it must point to the folder where the header is located and not to the
401  header itself;
402- ``--hipstdpar-thrust-path=`` specifies a non-canonical path for
403  `rocThrust <https://rocm.docs.amd.com/projects/rocThrust/en/latest/>`_; it
404  must point to the folder where the library is installed / built under a
405  ``/thrust`` subfolder;
406- ``--hipstdpar-prim-path=`` specifies a non-canonical path for
407  `rocPrim <https://rocm.docs.amd.com/projects/rocPRIM/en/latest/>`_; it must
408  point to the folder where the library is installed / built under a
409  ``/rocprim`` subfolder;
410
411The `--offload-arch <https://llvm.org/docs/AMDGPUUsage.html#amdgpu-processors>`_
412flag can be used to specify the accelerator for which offload code is to be
413generated.
414
415Implementation - Front-End
416==========================
417
418When ``LangOpts.HIPStdPar`` is set, we relax some of the HIP language specific
419``Sema`` checks to account for the fact that we want to consume pure unannotated
420C++ code:
421
4221. ``__device__`` / ``__host__ __device__`` functions (which would originate in
423   the accelerator specific algorithm library) are allowed to call implicitly
424   ``__host__`` functions;
4252. ``__global__`` functions (which would originate in the accelerator specific
426   algorithm library) are allowed to call implicitly ``__host__`` functions;
4273. resolving ``__builtin`` availability is deferred, because it is possible that
428   a ``__builtin`` that is unavailable on the target accelerator is not
429   reachable from any offloaded algorithm, and thus will be safely removed in
430   the middle-end;
4314. ASM parsing / checking is deferred, because it is possible that an ASM block
432   that e.g. uses some constraints that are incompatible with the target
433   accelerator is not reachable from any offloaded algorithm, and thus will be
434   safely removed in the middle-end.
435
436``CodeGen`` is similarly relaxed, with implicitly ``__host__`` functions being
437emitted as well.
438
439Implementation - Middle-End
440===========================
441
442We add two ``opt`` passes:
443
4441. ``HipStdParAcceleratorCodeSelectionPass``
445
446   - For all kernels in a ``Module``, compute reachability, where a function
447     ``F`` is reachable from a kernel ``K`` if and only if there exists a direct
448     call-chain rooted in ``F`` that includes ``K``;
449   - Remove all functions that are not reachable from kernels;
450   - This pass is only run when compiling for the accelerator.
451
452The first pass assumes that the only code that the user intended to offload was
453that which was directly or transitively invocable as part of an algorithm
454execution. It also assumes that an accelerator aware algorithm implementation
455would rely on accelerator specific special functions (kernels), and that these
456effectively constitute the only roots for accelerator execution graphs. Both of
457these assumptions are based on observing how widespread accelerators,
458such as GPUs, work.
459
4601. ``HipStdParAllocationInterpositionPass``
461
462   - Iterate through all functions in a ``Module``, and replace standard
463     allocation / deallocation functions with accelerator-aware equivalents,
464     based on a pre-established table; the list of functions that can be
465     interposed is available
466     `here <https://github.com/ROCmSoftwarePlatform/roc-stdpar#allocation--deallocation-interposition-status>`__;
467   - This is only run when compiling for the host.
468
469The second pass is optional.
470
471Implementation - Forwarding Header
472==================================
473
474The forwarding header implements two pieces of functionality:
475
4761. It forwards algorithms to a target accelerator, which is done by relying on
477   C++ language rules around overloading:
478
479   - overloads taking an explicit argument of type
480     ``parallel_unsequenced_policy`` are introduced into the ``std`` namespace;
481   - these will get preferentially selected versus the master template;
482   - the body forwards to the equivalent algorithm from the accelerator specific
483     library
484
4852. It provides allocation / deallocation functions that are equivalent to the
486   standard ones, but obtain memory by invoking
487   `hipMallocManaged <https://rocm.docs.amd.com/projects/HIP/en/latest/.doxygen/docBin/html/group___memory_m.html#gab8cfa0e292193fa37e0cc2e4911fa90a>`_
488   and release it via `hipFree <https://rocm.docs.amd.com/projects/HIP/en/latest/.doxygen/docBin/html/group___memory.html#ga740d08da65cae1441ba32f8fedb863d1>`_.
489
490Predefined Macros
491=================
492
493.. list-table::
494   :header-rows: 1
495
496   * - Macro
497     - Description
498   * - ``__HIPSTDPAR__``
499     - Defined when Clang is compiling code in algorithm offload mode, enabled
500       with the ``--hipstdpar`` compiler option.
501   * - ``__HIPSTDPAR_INTERPOSE_ALLOC__``
502     - Defined only when compiling in algorithm offload mode, when the user
503       enables interposition mode with the ``--hipstdpar-interpose-alloc``
504       compiler option, indicating that all dynamic memory allocation /
505       deallocation functions should be replaced with accelerator aware
506       variants.
507
508Restrictions
509============
510
511We define two modes in which runtime execution can occur:
512
5131. **HMM Mode** - this assumes that the
514   `HMM <https://docs.kernel.org/mm/hmm.html>`_ subsystem of the Linux kernel
515   is used to provide transparent on-demand paging i.e. memory obtained from a
516   system / OS allocator such as via a call to ``malloc`` or ``operator new`` is
517   directly accessible to the accelerator and it follows the C++ memory model;
5182. **Interposition Mode** - this is a fallback mode for cases where transparent
519   on-demand paging is unavailable (e.g. in the Windows OS), which means that
520   memory must be allocated via an accelerator aware mechanism, and system
521   allocated memory is inaccessible for the accelerator.
522
523The following restrictions imposed on user code apply to both modes:
524
5251. Pointers to function, and all associated features, such as e.g. dynamic
526   polymorphism, cannot be used (directly or transitively) by the user provided
527   callable passed to an algorithm invocation;
5282. Global / namespace scope / ``static`` / ``thread`` storage duration variables
529   cannot be used (directly or transitively) in name by the user provided
530   callable;
531
532   - When executing in **HMM Mode** they can be used in address e.g.:
533
534     .. code-block:: C++
535
536        namespace { int foo = 42; }
537
538        bool never(const std::vector<int>& v) {
539          return std::any_of(std::execution::par_unseq, std::cbegin(v), std::cend(v), [](auto&& x) {
540            return x == foo;
541          });
542        }
543
544        bool only_in_hmm_mode(const std::vector<int>& v) {
545          return std::any_of(std::execution::par_unseq, std::cbegin(v), std::cend(v),
546                             [p = &foo](auto&& x) { return x == *p; });
547        }
548
5493. Only algorithms that are invoked with the ``parallel_unsequenced_policy`` are
550   candidates for offload;
5514. Only algorithms that are invoked with iterator arguments that model
552   `random_access_iterator <https://en.cppreference.com/w/cpp/iterator/random_access_iterator>`_
553   are candidates for offload;
5545. `Exceptions <https://en.cppreference.com/w/cpp/language/exceptions>`_ cannot
555   be used by the user provided callable;
5566. Dynamic memory allocation (e.g. ``operator new``) cannot be used by the user
557   provided callable;
5587. Selective offload is not possible i.e. it is not possible to indicate that
559   only some algorithms invoked with the ``parallel_unsequenced_policy`` are to
560   be executed on the accelerator.
561
562In addition to the above, using **Interposition Mode** imposes the following
563additional restrictions:
564
5651. All code that is expected to interoperate has to be recompiled with the
566   ``--hipstdpar-interpose-alloc`` flag i.e. it is not safe to compose libraries
567   that have been independently compiled;
5682. automatic storage duration (i.e. stack allocated) variables cannot be used
569   (directly or transitively) by the user provided callable e.g.
570
571   .. code-block:: c++
572
573      bool never(const std::vector<int>& v, int n) {
574        return std::any_of(std::execution::par_unseq, std::cbegin(v), std::cend(v),
575                           [p = &n](auto&& x) { return x == *p; });
576      }
577
578Current Support
579===============
580
581At the moment, C++ Standard Parallelism Offload is only available for AMD GPUs,
582when the `ROCm <https://rocm.docs.amd.com/en/latest/>`_ stack is used, on the
583Linux operating system. Support is synthesised in the following table:
584
585.. list-table::
586   :header-rows: 1
587
588   * - `Processor <https://llvm.org/docs/AMDGPUUsage.html#amdgpu-processors>`_
589     - HMM Mode
590     - Interposition Mode
591   * - GCN GFX9 (Vega)
592     - YES
593     - YES
594   * - GCN GFX10.1 (RDNA 1)
595     - *NO*
596     - YES
597   * - GCN GFX10.3 (RDNA 2)
598     - *NO*
599     - YES
600   * - GCN GFX11 (RDNA 3)
601     - *NO*
602     - YES
603   * - GCN GFX12 (RDNA 4)
604     - *NO*
605     - YES
606
607The minimum Linux kernel version for running in HMM mode is 6.4.
608
609The forwarding header can be obtained from
610`its GitHub repository <https://github.com/ROCmSoftwarePlatform/roc-stdpar>`_.
611It will be packaged with a future `ROCm <https://rocm.docs.amd.com/en/latest/>`_
612release. Because accelerated algorithms are provided via
613`rocThrust <https://rocm.docs.amd.com/projects/rocThrust/en/latest/>`_, a
614transitive dependency on
615`rocPrim <https://rocm.docs.amd.com/projects/rocPRIM/en/latest/>`_ exists. Both
616can be obtained either by installing their associated components of the
617`ROCm <https://rocm.docs.amd.com/en/latest/>`_ stack, or from their respective
618repositories. The list algorithms that can be offloaded is available
619`here <https://github.com/ROCmSoftwarePlatform/roc-stdpar#algorithm-support-status>`_.
620
621HIP Specific Elements
622---------------------
623
6241. There is no defined interop with the
625   `HIP kernel language <https://rocm.docs.amd.com/projects/HIP/en/latest/reference/kernel_language.html>`_;
626   whilst things like using `__device__` annotations might accidentally "work",
627   they are not guaranteed to, and thus cannot be relied upon by user code;
628
629   - A consequence of the above is that both bitcode linking and linking
630     relocatable object files will "work", but it is not guaranteed to remain
631     working or actively tested at the moment; this restriction might be relaxed
632     in the future.
633
6342. Combining explicit HIP, CUDA or OpenMP Offload compilation with
635   ``--hipstdpar`` based offloading is not allowed or supported in any way.
6363. There is no way to target different accelerators via a standard algorithm
637   invocation (`this might be addressed in future C++ standards <https://www.open-std.org/jtc1/sc22/wg21/docs/papers/2023/p2500r1.html>`_);
638   an unsafe (per the point above) way of achieving this is to spawn new threads
639   and invoke the `hipSetDevice <https://rocm.docs.amd.com/projects/HIP/en/latest/.doxygen/docBin/html/group___device.html#ga43c1e7f15925eeb762195ccb5e063eae>`_
640   interface e.g.:
641
642   .. code-block:: c++
643
644      int accelerator_0 = ...;
645      int accelerator_1 = ...;
646
647      bool multiple_accelerators(const std::vector<int>& u, const std::vector<int>& v) {
648        std::atomic<unsigned int> r{0u};
649
650        thread t0{[&]() {
651          hipSetDevice(accelerator_0);
652
653          r += std::count(std::execution::par_unseq, std::cbegin(u), std::cend(u), 42);
654        }};
655        thread t1{[&]() {
656          hitSetDevice(accelerator_1);
657
658          r += std::count(std::execution::par_unseq, std::cbegin(v), std::cend(v), 314152)
659        }};
660
661        t0.join();
662        t1.join();
663
664        return r;
665      }
666
667   Note that this is a temporary, unsafe workaround for a deficiency in the C++
668   Standard.
669
670Open Questions / Future Developments
671====================================
672
6731. The restriction on the use of global / namespace scope / ``static`` /
674   ``thread`` storage duration variables in offloaded algorithms will be lifted
675   in the future, when running in **HMM Mode**;
6762. The restriction on the use of dynamic memory allocation in offloaded
677   algorithms will be lifted in the future.
6783. The restriction on the use of pointers to function, and associated features
679   such as dynamic polymorphism might be lifted in the future, when running in
680   **HMM Mode**;
6814. Offload support might be extended to cases where the ``parallel_policy`` is
682   used for some or all targets.
683
684SPIR-V Support on HIPAMD ToolChain
685==================================
686
687The HIPAMD ToolChain supports targetting
688`AMDGCN Flavoured SPIR-V <https://llvm.org/docs/SPIRVUsage.html#target-triples>`_.
689The support for SPIR-V in the ROCm and HIPAMD ToolChain is under active
690development.
691
692Compilation Process
693-------------------
694
695When compiling HIP programs with the intent of utilizing SPIR-V, the process
696diverges from the traditional compilation flow:
697
698Using ``--offload-arch=amdgcnspirv``
699^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
700
701- **Target Triple**: The ``--offload-arch=amdgcnspirv`` flag instructs the
702  compiler to use the target triple ``spirv64-amd-amdhsa``. This approach does
703  generates generic AMDGCN SPIR-V which retains architecture specific elements
704  without hardcoding them, thus allowing for optimal target specific code to be
705  generated at run time, when the concrete target is known.
706
707- **LLVM IR Translation**: The program is compiled to LLVM Intermediate
708  Representation (IR), which is subsequently translated into SPIR-V. In the
709  future, this translation step will be replaced by direct SPIR-V emission via
710  the SPIR-V Back-end.
711
712- **Clang Offload Bundler**: The resulting SPIR-V is embedded in the Clang
713  offload bundler with the bundle ID ``hip-spirv64-amd-amdhsa--amdgcnspirv``.
714
715Architecture Specific Macros
716----------------------------
717
718None of the architecture specific :doc:`AMDGPU macros <AMDGPUSupport>` are
719defined when targeting SPIR-V. An alternative, more flexible mechanism to enable
720doing per target / per feature code selection will be added in the future.
721