xref: /llvm-project/llvm/docs/AMDGPUUsage.rst (revision db1ee18eda6329d7577ad019a47822220b3e293d)
1=============================
2User Guide for AMDGPU Backend
3=============================
4
5.. contents::
6   :local:
7
8.. toctree::
9   :hidden:
10
11   AMDGPU/AMDGPUAsmGFX7
12   AMDGPU/AMDGPUAsmGFX8
13   AMDGPU/AMDGPUAsmGFX9
14   AMDGPU/AMDGPUAsmGFX900
15   AMDGPU/AMDGPUAsmGFX904
16   AMDGPU/AMDGPUAsmGFX906
17   AMDGPU/AMDGPUAsmGFX908
18   AMDGPU/AMDGPUAsmGFX90a
19   AMDGPU/AMDGPUAsmGFX940
20   AMDGPU/AMDGPUAsmGFX10
21   AMDGPU/AMDGPUAsmGFX1011
22   AMDGPU/AMDGPUAsmGFX1013
23   AMDGPU/AMDGPUAsmGFX1030
24   AMDGPU/AMDGPUAsmGFX11
25   AMDGPUModifierSyntax
26   AMDGPUOperandSyntax
27   AMDGPUInstructionSyntax
28   AMDGPUInstructionNotation
29   AMDGPUDwarfExtensionsForHeterogeneousDebugging
30   AMDGPUDwarfExtensionAllowLocationDescriptionOnTheDwarfExpressionStack/AMDGPUDwarfExtensionAllowLocationDescriptionOnTheDwarfExpressionStack
31
32Introduction
33============
34
35The AMDGPU backend provides ISA code generation for AMD GPUs, starting with the
36R600 family up until the current GCN families. It lives in the
37``llvm/lib/Target/AMDGPU`` directory.
38
39LLVM
40====
41
42.. _amdgpu-target-triples:
43
44Target Triples
45--------------
46
47Use the Clang option ``-target <Architecture>-<Vendor>-<OS>-<Environment>``
48to specify the target triple:
49
50  .. table:: AMDGPU Architectures
51     :name: amdgpu-architecture-table
52
53     ============ ==============================================================
54     Architecture Description
55     ============ ==============================================================
56     ``r600``     AMD GPUs HD2XXX-HD6XXX for graphics and compute shaders.
57     ``amdgcn``   AMD GPUs GCN GFX6 onwards for graphics and compute shaders.
58     ============ ==============================================================
59
60  .. table:: AMDGPU Vendors
61     :name: amdgpu-vendor-table
62
63     ============ ==============================================================
64     Vendor       Description
65     ============ ==============================================================
66     ``amd``      Can be used for all AMD GPU usage.
67     ``mesa``     Can be used if the OS is ``mesa3d``.
68     ============ ==============================================================
69
70  .. table:: AMDGPU Operating Systems
71     :name: amdgpu-os
72
73     ============== ============================================================
74     OS             Description
75     ============== ============================================================
76     *<empty>*      Defaults to the *unknown* OS.
77     ``amdhsa``     Compute kernels executed on HSA [HSA]_ compatible runtimes
78                    such as:
79
80                    - AMD's ROCm™ runtime [AMD-ROCm]_ using the *rocm-amdhsa*
81                      loader on Linux. See *AMD ROCm Platform Release Notes*
82                      [AMD-ROCm-Release-Notes]_ for supported hardware and
83                      software.
84                    - AMD's PAL runtime using the *pal-amdhsa* loader on
85                      Windows.
86
87     ``amdpal``     Graphic shaders and compute kernels executed on AMD's PAL
88                    runtime using the *pal-amdpal* loader on Windows and Linux
89                    Pro.
90     ``mesa3d``     Graphic shaders and compute kernels executed on AMD's Mesa
91                    3D runtime using the *mesa-mesa3d* loader on Linux.
92     ============== ============================================================
93
94  .. table:: AMDGPU Environments
95     :name: amdgpu-environment-table
96
97     ============ ==============================================================
98     Environment  Description
99     ============ ==============================================================
100     *<empty>*    Default.
101     ============ ==============================================================
102
103.. _amdgpu-processors:
104
105Processors
106----------
107
108Use the Clang options ``-mcpu=<target-id>`` or ``--offload-arch=<target-id>`` to
109specify the AMDGPU processor together with optional target features. See
110:ref:`amdgpu-target-id` and :ref:`amdgpu-target-features` for AMD GPU target
111specific information.
112
113Every processor supports every OS ABI (see :ref:`amdgpu-os`) with the following exceptions:
114
115* ``amdhsa`` is not supported in ``r600`` architecture (see :ref:`amdgpu-architecture-table`).
116
117
118  .. table:: AMDGPU Processors
119     :name: amdgpu-processor-table
120
121     =========== =============== ============ ===== ================= =============== =============== ======================
122     Processor   Alternative     Target       dGPU/ Target            Target          OS Support      Example
123                 Processor       Triple       APU   Features          Properties      *(see*          Products
124                                 Architecture       Supported                         `amdgpu-os`_
125                                                                                      *and
126                                                                                      corresponding
127                                                                                      runtime release
128                                                                                      notes for
129                                                                                      current
130                                                                                      information and
131                                                                                      level of
132                                                                                      support)*
133     =========== =============== ============ ===== ================= =============== =============== ======================
134     **Radeon HD 2000/3000 Series (R600)** [AMD-RADEON-HD-2000-3000]_
135     -----------------------------------------------------------------------------------------------------------------------
136     ``r600``                    ``r600``     dGPU                    - Does not
137                                                                        support
138                                                                        generic
139                                                                        address
140                                                                        space
141     ``r630``                    ``r600``     dGPU                    - Does not
142                                                                        support
143                                                                        generic
144                                                                        address
145                                                                        space
146     ``rs880``                   ``r600``     dGPU                    - Does not
147                                                                        support
148                                                                        generic
149                                                                        address
150                                                                        space
151     ``rv670``                   ``r600``     dGPU                    - Does not
152                                                                        support
153                                                                        generic
154                                                                        address
155                                                                        space
156     **Radeon HD 4000 Series (R700)** [AMD-RADEON-HD-4000]_
157     -----------------------------------------------------------------------------------------------------------------------
158     ``rv710``                   ``r600``     dGPU                    - Does not
159                                                                        support
160                                                                        generic
161                                                                        address
162                                                                        space
163     ``rv730``                   ``r600``     dGPU                    - Does not
164                                                                        support
165                                                                        generic
166                                                                        address
167                                                                        space
168     ``rv770``                   ``r600``     dGPU                    - Does not
169                                                                        support
170                                                                        generic
171                                                                        address
172                                                                        space
173     **Radeon HD 5000 Series (Evergreen)** [AMD-RADEON-HD-5000]_
174     -----------------------------------------------------------------------------------------------------------------------
175     ``cedar``                   ``r600``     dGPU                    - Does not
176                                                                        support
177                                                                        generic
178                                                                        address
179                                                                        space
180     ``cypress``                 ``r600``     dGPU                    - Does not
181                                                                        support
182                                                                        generic
183                                                                        address
184                                                                        space
185     ``juniper``                 ``r600``     dGPU                    - Does not
186                                                                        support
187                                                                        generic
188                                                                        address
189                                                                        space
190     ``redwood``                 ``r600``     dGPU                    - Does not
191                                                                        support
192                                                                        generic
193                                                                        address
194                                                                        space
195     ``sumo``                    ``r600``     dGPU                    - Does not
196                                                                        support
197                                                                        generic
198                                                                        address
199                                                                        space
200     **Radeon HD 6000 Series (Northern Islands)** [AMD-RADEON-HD-6000]_
201     -----------------------------------------------------------------------------------------------------------------------
202     ``barts``                   ``r600``     dGPU                    - Does not
203                                                                        support
204                                                                        generic
205                                                                        address
206                                                                        space
207     ``caicos``                  ``r600``     dGPU                    - Does not
208                                                                        support
209                                                                        generic
210                                                                        address
211                                                                        space
212     ``cayman``                  ``r600``     dGPU                    - Does not
213                                                                        support
214                                                                        generic
215                                                                        address
216                                                                        space
217     ``turks``                   ``r600``     dGPU                    - Does not
218                                                                        support
219                                                                        generic
220                                                                        address
221                                                                        space
222     **GCN GFX6 (Southern Islands (SI))** [AMD-GCN-GFX6]_
223     -----------------------------------------------------------------------------------------------------------------------
224     ``gfx600``  - ``tahiti``    ``amdgcn``   dGPU                    - Does not      - *pal-amdpal*
225                                                                        support
226                                                                        generic
227                                                                        address
228                                                                        space
229     ``gfx601``  - ``pitcairn``  ``amdgcn``   dGPU                    - Does not      - *pal-amdpal*
230                 - ``verde``                                            support
231                                                                        generic
232                                                                        address
233                                                                        space
234     ``gfx602``  - ``hainan``    ``amdgcn``   dGPU                    - Does not      - *pal-amdpal*
235                 - ``oland``                                            support
236                                                                        generic
237                                                                        address
238                                                                        space
239     **GCN GFX7 (Sea Islands (CI))** [AMD-GCN-GFX7]_
240     -----------------------------------------------------------------------------------------------------------------------
241     ``gfx700``  - ``kaveri``    ``amdgcn``   APU                     - Offset        - *rocm-amdhsa* - A6-7000
242                                                                        flat          - *pal-amdhsa*  - A6 Pro-7050B
243                                                                        scratch       - *pal-amdpal*  - A8-7100
244                                                                                                      - A8 Pro-7150B
245                                                                                                      - A10-7300
246                                                                                                      - A10 Pro-7350B
247                                                                                                      - FX-7500
248                                                                                                      - A8-7200P
249                                                                                                      - A10-7400P
250                                                                                                      - FX-7600P
251     ``gfx701``  - ``hawaii``    ``amdgcn``   dGPU                    - Offset        - *rocm-amdhsa* - FirePro W8100
252                                                                        flat          - *pal-amdhsa*  - FirePro W9100
253                                                                        scratch       - *pal-amdpal*  - FirePro S9150
254                                                                                                      - FirePro S9170
255     ``gfx702``                  ``amdgcn``   dGPU                    - Offset        - *rocm-amdhsa* - Radeon R9 290
256                                                                        flat          - *pal-amdhsa*  - Radeon R9 290x
257                                                                        scratch       - *pal-amdpal*  - Radeon R390
258                                                                                                      - Radeon R390x
259     ``gfx703``  - ``kabini``    ``amdgcn``   APU                     - Offset        - *pal-amdhsa*  - E1-2100
260                 - ``mullins``                                          flat          - *pal-amdpal*  - E1-2200
261                                                                        scratch                       - E1-2500
262                                                                                                      - E2-3000
263                                                                                                      - E2-3800
264                                                                                                      - A4-5000
265                                                                                                      - A4-5100
266                                                                                                      - A6-5200
267                                                                                                      - A4 Pro-3340B
268     ``gfx704``  - ``bonaire``   ``amdgcn``   dGPU                    - Offset        - *pal-amdhsa*  - Radeon HD 7790
269                                                                        flat          - *pal-amdpal*  - Radeon HD 8770
270                                                                        scratch                       - R7 260
271                                                                                                      - R7 260X
272     ``gfx705``                  ``amdgcn``   APU                     - Offset        - *pal-amdhsa*  *TBA*
273                                                                        flat          - *pal-amdpal*
274                                                                        scratch                       .. TODO::
275
276                                                                                                        Add product
277                                                                                                        names.
278
279     **GCN GFX8 (Volcanic Islands (VI))** [AMD-GCN-GFX8]_
280     -----------------------------------------------------------------------------------------------------------------------
281     ``gfx801``  - ``carrizo``   ``amdgcn``   APU   - xnack           - Offset        - *rocm-amdhsa* - A6-8500P
282                                                                        flat          - *pal-amdhsa*  - Pro A6-8500B
283                                                                        scratch       - *pal-amdpal*  - A8-8600P
284                                                                                                      - Pro A8-8600B
285                                                                                                      - FX-8800P
286                                                                                                      - Pro A12-8800B
287                                                                                                      - A10-8700P
288                                                                                                      - Pro A10-8700B
289                                                                                                      - A10-8780P
290                                                                                                      - A10-9600P
291                                                                                                      - A10-9630P
292                                                                                                      - A12-9700P
293                                                                                                      - A12-9730P
294                                                                                                      - FX-9800P
295                                                                                                      - FX-9830P
296                                                                                                      - E2-9010
297                                                                                                      - A6-9210
298                                                                                                      - A9-9410
299     ``gfx802``  - ``iceland``   ``amdgcn``   dGPU                    - Offset        - *rocm-amdhsa* - Radeon R9 285
300                 - ``tonga``                                            flat          - *pal-amdhsa*  - Radeon R9 380
301                                                                        scratch       - *pal-amdpal*  - Radeon R9 385
302     ``gfx803``  - ``fiji``      ``amdgcn``   dGPU                                    - *rocm-amdhsa* - Radeon R9 Nano
303                                                                                      - *pal-amdhsa*  - Radeon R9 Fury
304                                                                                      - *pal-amdpal*  - Radeon R9 FuryX
305                                                                                                      - Radeon Pro Duo
306                                                                                                      - FirePro S9300x2
307                                                                                                      - Radeon Instinct MI8
308     \           - ``polaris10`` ``amdgcn``   dGPU                    - Offset        - *rocm-amdhsa* - Radeon RX 470
309                                                                        flat          - *pal-amdhsa*  - Radeon RX 480
310                                                                        scratch       - *pal-amdpal*  - Radeon Instinct MI6
311     \           - ``polaris11`` ``amdgcn``   dGPU                    - Offset        - *rocm-amdhsa* - Radeon RX 460
312                                                                        flat          - *pal-amdhsa*
313                                                                        scratch       - *pal-amdpal*
314     ``gfx805``  - ``tongapro``  ``amdgcn``   dGPU                    - Offset        - *rocm-amdhsa* - FirePro S7150
315                                                                        flat          - *pal-amdhsa*  - FirePro S7100
316                                                                        scratch       - *pal-amdpal*  - FirePro W7100
317                                                                                                      - Mobile FirePro
318                                                                                                        M7170
319     ``gfx810``  - ``stoney``    ``amdgcn``   APU   - xnack           - Offset        - *rocm-amdhsa* *TBA*
320                                                                        flat          - *pal-amdhsa*
321                                                                        scratch       - *pal-amdpal*  .. TODO::
322
323                                                                                                        Add product
324                                                                                                        names.
325
326     **GCN GFX9 (Vega)** [AMD-GCN-GFX900-GFX904-VEGA]_ [AMD-GCN-GFX906-VEGA7NM]_ [AMD-GCN-GFX908-CDNA1]_ [AMD-GCN-GFX90A-CDNA2]_ [AMD-GCN-GFX940-GFX942-CDNA3]_
327     -----------------------------------------------------------------------------------------------------------------------
328     ``gfx900``                  ``amdgcn``   dGPU  - xnack           - Absolute      - *rocm-amdhsa* - Radeon Vega
329                                                                        flat          - *pal-amdhsa*    Frontier Edition
330                                                                        scratch       - *pal-amdpal*  - Radeon RX Vega 56
331                                                                                                      - Radeon RX Vega 64
332                                                                                                      - Radeon RX Vega 64
333                                                                                                        Liquid
334                                                                                                      - Radeon Instinct MI25
335     ``gfx902``                  ``amdgcn``   APU   - xnack           - Absolute      - *rocm-amdhsa* - Ryzen 3 2200G
336                                                                        flat          - *pal-amdhsa*  - Ryzen 5 2400G
337                                                                        scratch       - *pal-amdpal*
338     ``gfx904``                  ``amdgcn``   dGPU  - xnack                           - *rocm-amdhsa* *TBA*
339                                                                                      - *pal-amdhsa*
340                                                                                      - *pal-amdpal*  .. TODO::
341
342                                                                                                        Add product
343                                                                                                        names.
344
345     ``gfx906``                  ``amdgcn``   dGPU  - sramecc         - Absolute      - *rocm-amdhsa* - Radeon Instinct MI50
346                                                    - xnack             flat          - *pal-amdhsa*  - Radeon Instinct MI60
347                                                                        scratch       - *pal-amdpal*  - Radeon VII
348                                                                                                      - Radeon Pro VII
349     ``gfx908``                  ``amdgcn``   dGPU  - sramecc                         - *rocm-amdhsa* - AMD Instinct MI100 Accelerator
350                                                    - xnack           - Absolute
351                                                                        flat
352                                                                        scratch
353     ``gfx909``                  ``amdgcn``   APU   - xnack           - Absolute      - *pal-amdpal*  *TBA*
354                                                                        flat
355                                                                        scratch                       .. TODO::
356
357                                                                                                        Add product
358                                                                                                        names.
359
360     ``gfx90a``                  ``amdgcn``   dGPU  - sramecc         - Absolute      - *rocm-amdhsa* - AMD Instinct MI210 Accelerator
361                                                    - tgsplit           flat          - *rocm-amdhsa* - AMD Instinct MI250 Accelerator
362                                                    - xnack             scratch       - *rocm-amdhsa* - AMD Instinct MI250X Accelerator
363                                                    - kernarg preload - Packed
364                                                      (except MI210)    work-item
365                                                                        IDs
366
367     ``gfx90c``                  ``amdgcn``   APU   - xnack           - Absolute      - *pal-amdpal*  - Ryzen 7 4700G
368                                                                        flat                          - Ryzen 7 4700GE
369                                                                        scratch                       - Ryzen 5 4600G
370                                                                                                      - Ryzen 5 4600GE
371                                                                                                      - Ryzen 3 4300G
372                                                                                                      - Ryzen 3 4300GE
373                                                                                                      - Ryzen Pro 4000G
374                                                                                                      - Ryzen 7 Pro 4700G
375                                                                                                      - Ryzen 7 Pro 4750GE
376                                                                                                      - Ryzen 5 Pro 4650G
377                                                                                                      - Ryzen 5 Pro 4650GE
378                                                                                                      - Ryzen 3 Pro 4350G
379                                                                                                      - Ryzen 3 Pro 4350GE
380
381     ``gfx940``                  ``amdgcn``   dGPU  - sramecc         - Architected                   *TBA*
382                                                    - tgsplit           flat
383                                                    - xnack             scratch                       .. TODO::
384                                                    - kernarg preload - Packed
385                                                                        work-item                       Add product
386                                                                        IDs                             names.
387
388     ``gfx941``                  ``amdgcn``   dGPU  - sramecc         - Architected                   *TBA*
389                                                    - tgsplit           flat
390                                                    - xnack             scratch                       .. TODO::
391                                                    - kernarg preload - Packed
392                                                                        work-item                       Add product
393                                                                        IDs                             names.
394
395     ``gfx942``                  ``amdgcn``   dGPU  - sramecc         - Architected                   - AMD Instinct MI300X
396                                                    - tgsplit           flat                          - AMD Instinct MI300A
397                                                    - xnack             scratch
398                                                    - kernarg preload - Packed
399                                                                        work-item
400                                                                        IDs
401
402     ``gfx950``                  ``amdgcn``   dGPU  - sramecc         - Architected                   *TBA*
403                                                    - tgsplit           flat
404                                                    - xnack             scratch                       .. TODO::
405                                                    - kernarg preload - Packed
406                                                                        work-item                       Add product
407                                                                        IDs                             names.
408
409     **GCN GFX10.1 (RDNA 1)** [AMD-GCN-GFX10-RDNA1]_
410     -----------------------------------------------------------------------------------------------------------------------
411     ``gfx1010``                 ``amdgcn``   dGPU  - cumode          - Absolute      - *rocm-amdhsa* - Radeon RX 5700
412                                                    - wavefrontsize64   flat          - *pal-amdhsa*  - Radeon RX 5700 XT
413                                                    - xnack             scratch       - *pal-amdpal*  - Radeon Pro 5600 XT
414                                                                                                      - Radeon Pro 5600M
415     ``gfx1011``                 ``amdgcn``   dGPU  - cumode                          - *rocm-amdhsa* - Radeon Pro V520
416                                                    - wavefrontsize64 - Absolute      - *pal-amdhsa*
417                                                    - xnack             flat          - *pal-amdpal*
418                                                                        scratch
419     ``gfx1012``                 ``amdgcn``   dGPU  - cumode          - Absolute      - *rocm-amdhsa* - Radeon RX 5500
420                                                    - wavefrontsize64   flat          - *pal-amdhsa*  - Radeon RX 5500 XT
421                                                    - xnack             scratch       - *pal-amdpal*
422     ``gfx1013``                 ``amdgcn``   APU   - cumode          - Absolute      - *rocm-amdhsa* *TBA*
423                                                    - wavefrontsize64   flat          - *pal-amdhsa*
424                                                    - xnack             scratch       - *pal-amdpal*  .. TODO::
425
426                                                                                                        Add product
427                                                                                                        names.
428
429     **GCN GFX10.3 (RDNA 2)** [AMD-GCN-GFX10-RDNA2]_
430     -----------------------------------------------------------------------------------------------------------------------
431     ``gfx1030``                 ``amdgcn``   dGPU  - cumode          - Absolute      - *rocm-amdhsa* - Radeon RX 6800
432                                                    - wavefrontsize64   flat          - *pal-amdhsa*  - Radeon RX 6800 XT
433                                                                        scratch       - *pal-amdpal*  - Radeon RX 6900 XT
434                                                                                                      - Radeon PRO W6800
435                                                                                                      - Radeon PRO V620
436     ``gfx1031``                 ``amdgcn``   dGPU  - cumode          - Absolute      - *rocm-amdhsa* - Radeon RX 6700 XT
437                                                    - wavefrontsize64   flat          - *pal-amdhsa*
438                                                                        scratch       - *pal-amdpal*
439     ``gfx1032``                 ``amdgcn``   dGPU  - cumode          - Absolute      - *rocm-amdhsa* *TBA*
440                                                    - wavefrontsize64   flat          - *pal-amdhsa*
441                                                                        scratch       - *pal-amdpal*  .. TODO::
442
443                                                                                                        Add product
444                                                                                                        names.
445
446     ``gfx1033``                 ``amdgcn``   APU   - cumode          - Absolute      - *pal-amdpal*  *TBA*
447                                                    - wavefrontsize64   flat
448                                                                        scratch                       .. TODO::
449
450                                                                                                        Add product
451                                                                                                        names.
452     ``gfx1034``                 ``amdgcn``   dGPU  - cumode          - Absolute      - *pal-amdpal*  *TBA*
453                                                    - wavefrontsize64   flat
454                                                                        scratch                       .. TODO::
455
456                                                                                                        Add product
457                                                                                                        names.
458
459     ``gfx1035``                 ``amdgcn``   APU   - cumode          - Absolute      - *pal-amdpal*  *TBA*
460                                                    - wavefrontsize64   flat
461                                                                        scratch                       .. TODO::
462                                                                                                        Add product
463                                                                                                        names.
464
465     ``gfx1036``                 ``amdgcn``   APU   - cumode          - Absolute      - *pal-amdpal*  *TBA*
466                                                    - wavefrontsize64   flat
467                                                                        scratch                       .. TODO::
468
469                                                                                                        Add product
470                                                                                                        names.
471
472     **GCN GFX11 (RDNA 3)** [AMD-GCN-GFX11-RDNA3]_
473     -----------------------------------------------------------------------------------------------------------------------
474     ``gfx1100``                 ``amdgcn``   dGPU  - cumode          - Architected   - *pal-amdpal*  - Radeon PRO W7900 Dual Slot
475                                                    - wavefrontsize64   flat                          - Radeon PRO W7900
476                                                                        scratch                       - Radeon PRO W7800
477                                                                      - Packed                        - Radeon RX 7900 XTX
478                                                                        work-item                     - Radeon RX 7900 XT
479                                                                        IDs                           - Radeon RX 7900 GRE
480
481     ``gfx1101``                 ``amdgcn``   dGPU  - cumode          - Architected                   *TBA*
482                                                    - wavefrontsize64   flat
483                                                                        scratch                       .. TODO::
484                                                                      - Packed
485                                                                        work-item                       Add product
486                                                                        IDs                             names.
487
488     ``gfx1102``                 ``amdgcn``   dGPU  - cumode          - Architected                   *TBA*
489                                                    - wavefrontsize64   flat
490                                                                        scratch                       .. TODO::
491                                                                      - Packed
492                                                                        work-item                       Add product
493                                                                        IDs                             names.
494
495     ``gfx1103``                 ``amdgcn``   APU   - cumode          - Architected                   *TBA*
496                                                    - wavefrontsize64   flat
497                                                                        scratch                       .. TODO::
498                                                                      - Packed
499                                                                        work-item                       Add product
500                                                                        IDs                             names.
501
502     **GCN GFX11 (RDNA 3.5)** [AMD-GCN-GFX11-RDNA3.5]_
503     -----------------------------------------------------------------------------------------------------------------------
504     ``gfx1150``                 ``amdgcn``   APU   - cumode          - Architected                   *TBA*
505                                                    - wavefrontsize64   flat
506                                                                        scratch                       .. TODO::
507                                                                      - Packed
508                                                                        work-item                       Add product
509                                                                        IDs                             names.
510
511     ``gfx1151``                 ``amdgcn``   APU   - cumode          - Architected                   *TBA*
512                                                    - wavefrontsize64   flat
513                                                                        scratch                       .. TODO::
514                                                                      - Packed
515                                                                        work-item                       Add product
516                                                                        IDs                             names.
517
518     ``gfx1152``                 ``amdgcn``   APU   - cumode          - Architected                   *TBA*
519                                                    - wavefrontsize64   flat
520                                                                        scratch                       .. TODO::
521                                                                      - Packed
522                                                                        work-item                       Add product
523                                                                        IDs                             names.
524
525     ``gfx1153``                 ``amdgcn``   APU   - cumode          - Architected                   *TBA*
526                                                    - wavefrontsize64   flat
527                                                                        scratch                       .. TODO::
528                                                                      - Packed
529                                                                        work-item                       Add product
530                                                                        IDs                             names.
531
532     ``gfx1200``                 ``amdgcn``   dGPU  - cumode          - Architected                   *TBA*
533                                                    - wavefrontsize64   flat
534                                                                        scratch                       .. TODO::
535                                                                      - Packed
536                                                                        work-item                       Add product
537                                                                        IDs                             names.
538
539     ``gfx1201``                 ``amdgcn``   dGPU  - cumode          - Architected                   *TBA*
540                                                    - wavefrontsize64   flat
541                                                                        scratch                       .. TODO::
542                                                                      - Packed
543                                                                        work-item                       Add product
544                                                                        IDs                             names.
545
546     =========== =============== ============ ===== ================= =============== =============== ======================
547
548Generic processors allow execution of a single code object on any of the processors that
549it supports. Such code objects may not perform as well as those for the non-generic processors.
550
551Generic processors are only available on code object V6 and above (see :ref:`amdgpu-elf-code-object`).
552
553Generic processor code objects are versioned. See :ref:`amdgpu-generic-processor-versioning` for more information on how versioning works.
554
555  .. table:: AMDGPU Generic Processors
556     :name: amdgpu-generic-processor-table
557
558     ==================== ============== ================= ================== ================= =================================
559     Processor             Target        Supported         Target Features    Target Properties Target Restrictions
560                           Triple        Processors        Supported
561                           Architecture
562
563     ==================== ============== ================= ================== ================= =================================
564     ``gfx9-generic``     ``amdgcn``     - ``gfx900``      - xnack            - Absolute flat   - ``v_mad_mix`` instructions
565                                         - ``gfx902``                           scratch           are not available on
566                                         - ``gfx904``                                             ``gfx900``, ``gfx902``,
567                                         - ``gfx906``                                             ``gfx909``, ``gfx90c``
568                                         - ``gfx909``                                           - ``v_fma_mix`` instructions
569                                         - ``gfx90c``                                             are not available on ``gfx904``
570                                                                                                - sramecc is not available on
571                                                                                                  ``gfx906``
572                                                                                                - The following instructions
573                                                                                                  are not available on ``gfx906``:
574
575                                                                                                  - ``v_fmac_f32``
576                                                                                                  - ``v_xnor_b32``
577                                                                                                  - ``v_dot4_i32_i8``
578                                                                                                  - ``v_dot8_i32_i4``
579                                                                                                  - ``v_dot2_i32_i16``
580                                                                                                  - ``v_dot2_u32_u16``
581                                                                                                  - ``v_dot4_u32_u8``
582                                                                                                  - ``v_dot8_u32_u4``
583                                                                                                  - ``v_dot2_f32_f16``
584
585
586     ``gfx9-4-generic``   ``amdgcn``     - ``gfx940``      - xnack            - Absolute flat   FP8 and BF8 instructions,
587                                         - ``gfx941``      - sramecc            scratch         FP8 and BF8 conversion instructions,
588                                         - ``gfx942``                                           as well as instructions with XF32 format support
589                                         - ``gfx950``                                           are not available.
590
591
592     ``gfx10-1-generic``  ``amdgcn``     - ``gfx1010``     - xnack            - Absolute flat   - The following instructions are
593                                         - ``gfx1011``     - wavefrontsize64    scratch           not available on ``gfx1011``
594                                         - ``gfx1012``     - cumode                               and ``gfx1012``
595                                         - ``gfx1013``
596                                                                                                  - ``v_dot4_i32_i8``
597                                                                                                  - ``v_dot8_i32_i4``
598                                                                                                  - ``v_dot2_i32_i16``
599                                                                                                  - ``v_dot2_u32_u16``
600                                                                                                  - ``v_dot2c_f32_f16``
601                                                                                                  - ``v_dot4c_i32_i8``
602                                                                                                  - ``v_dot4_u32_u8``
603                                                                                                  - ``v_dot8_u32_u4``
604                                                                                                  - ``v_dot2_f32_f16``
605
606                                                                                                - BVH Ray Tracing instructions
607                                                                                                  are not available on
608                                                                                                  ``gfx1013``
609
610
611     ``gfx10-3-generic``  ``amdgcn``     - ``gfx1030``     - wavefrontsize64  - Absolute flat   No restrictions.
612                                         - ``gfx1031``     - cumode             scratch
613                                         - ``gfx1032``
614                                         - ``gfx1033``
615                                         - ``gfx1034``
616                                         - ``gfx1035``
617                                         - ``gfx1036``
618
619
620     ``gfx11-generic``    ``amdgcn``     - ``gfx1100``     - wavefrontsize64  - Architected     Various codegen pessimizations
621                                         - ``gfx1101``     - cumode             flat scratch    are applied to work around some
622                                         - ``gfx1102``                        - Packed          hazards specific to some targets
623                                         - ``gfx1103``                          work-item       within this family.
624                                         - ``gfx1150``                          IDs
625                                         - ``gfx1151``
626                                         - ``gfx1152``
627                                         - ``gfx1153``                                          Not all VGPRs can be used on:
628
629                                                                                                - ``gfx1100``
630                                                                                                - ``gfx1101``
631                                                                                                - ``gfx1151``
632
633                                                                                                SALU floating point instructions
634                                                                                                are not available on:
635
636                                                                                                - ``gfx1150``
637                                                                                                - ``gfx1151``
638                                                                                                - ``gfx1152``
639                                                                                                - ``gfx1153``
640
641                                                                                                SGPRs are not supported for src1
642                                                                                                in dpp instructions for:
643
644                                                                                                - ``gfx1150``
645                                                                                                - ``gfx1151``
646                                                                                                - ``gfx1152``
647                                                                                                - ``gfx1153``
648
649
650     ``gfx12-generic``    ``amdgcn``     - ``gfx1200``     - wavefrontsize64  - Architected     No restrictions.
651                                         - ``gfx1201``     - cumode             flat scratch
652                                                                              - Packed
653                                                                                work-item
654                                                                                IDs
655     ==================== ============== ================= ================== ================= =================================
656
657.. _amdgpu-generic-processor-versioning:
658
659Generic Processor Versioning
660----------------------------
661
662Generic processor (see :ref:`amdgpu-generic-processor-table`) code objects are versioned (see :ref:`amdgpu-elf-header-e_flags-table-v6-onwards`) between 1 and 255.
663The version of non-generic code objects is always set to 0.
664
665For a generic code object, adding a new supported processor may require the code generated for the generic target to be changed
666so it can continue to execute on the previously supported processors as well as on the new one.
667When this happens, the generic code object version number is incremented at the same time as the generic target is updated.
668
669Each supported processor of a generic target is mapped to the version it was introduced in.
670A generic code object can execute on a supported processor if the version of the code object being loaded is
671greater than or equal to the version in which the processor was added to the generic target.
672
673.. _amdgpu-target-features:
674
675Target Features
676---------------
677
678Target features control how code is generated to support certain
679processor specific features. Not all target features are supported by
680all processors. The runtime must ensure that the features supported by
681the device used to execute the code match the features enabled when
682generating the code. A mismatch of features may result in incorrect
683execution, or a reduction in performance.
684
685The target features supported by each processor is listed in
686:ref:`amdgpu-processors`.
687
688Target features are controlled by exactly one of the following Clang
689options:
690
691``-mcpu=<target-id>`` or ``--offload-arch=<target-id>``
692
693  The ``-mcpu`` and ``--offload-arch`` can specify the target feature as
694  optional components of the target ID. If omitted, the target feature has the
695  ``any`` value. See :ref:`amdgpu-target-id`.
696
697``-m[no-]<target-feature>``
698
699  Target features not specified by the target ID are specified using a
700  separate option. These target features can have an ``on`` or ``off``
701  value.  ``on`` is specified by omitting the ``no-`` prefix, and
702  ``off`` is specified by including the ``no-`` prefix. The default
703  if not specified is ``off``.
704
705For example:
706
707``-mcpu=gfx908:xnack+``
708  Enable the ``xnack`` feature.
709``-mcpu=gfx908:xnack-``
710  Disable the ``xnack`` feature.
711``-mcumode``
712  Enable the ``cumode`` feature.
713``-mno-cumode``
714  Disable the ``cumode`` feature.
715
716  .. table:: AMDGPU Target Features
717     :name: amdgpu-target-features-table
718
719     =============== ============================ ==================================================
720     Target Feature  Clang Option to Control      Description
721     Name
722     =============== ============================ ==================================================
723     cumode          - ``-m[no-]cumode``          Control the wavefront execution mode used
724                                                  when generating code for kernels. When disabled
725                                                  native WGP wavefront execution mode is used,
726                                                  when enabled CU wavefront execution mode is used
727                                                  (see :ref:`amdgpu-amdhsa-memory-model`).
728
729     sramecc         - ``-mcpu``                  If specified, generate code that can only be
730                     - ``--offload-arch``         loaded and executed in a process that has a
731                                                  matching setting for SRAMECC.
732
733                                                  If not specified for code object V2 to V3, generate
734                                                  code that can be loaded and executed in a process
735                                                  with SRAMECC enabled.
736
737                                                  If not specified for code object V4 or above, generate
738                                                  code that can be loaded and executed in a process
739                                                  with either setting of SRAMECC.
740
741     tgsplit           ``-m[no-]tgsplit``         Enable/disable generating code that assumes
742                                                  work-groups are launched in threadgroup split mode.
743                                                  When enabled the waves of a work-group may be
744                                                  launched in different CUs.
745
746     wavefrontsize64 - ``-m[no-]wavefrontsize64`` Control the wavefront size used when
747                                                  generating code for kernels. When disabled
748                                                  native wavefront size 32 is used, when enabled
749                                                  wavefront size 64 is used.
750
751     xnack           - ``-mcpu``                  If specified, generate code that can only be
752                     - ``--offload-arch``         loaded and executed in a process that has a
753                                                  matching setting for XNACK replay.
754
755                                                  If not specified for code object V2 to V3, generate
756                                                  code that can be loaded and executed in a process
757                                                  with XNACK replay enabled.
758
759                                                  If not specified for code object V4 or above, generate
760                                                  code that can be loaded and executed in a process
761                                                  with either setting of XNACK replay.
762
763                                                  XNACK replay can be used for demand paging and
764                                                  page migration. If enabled in the device, then if
765                                                  a page fault occurs the code may execute
766                                                  incorrectly unless generated with XNACK replay
767                                                  enabled, or generated for code object V4 or above without
768                                                  specifying XNACK replay. Executing code that was
769                                                  generated with XNACK replay enabled, or generated
770                                                  for code object V4 or above without specifying XNACK replay,
771                                                  on a device that does not have XNACK replay
772                                                  enabled will execute correctly but may be less
773                                                  performant than code generated for XNACK replay
774                                                  disabled.
775     =============== ============================ ==================================================
776
777.. _amdgpu-target-id:
778
779Target ID
780---------
781
782AMDGPU supports target IDs. See `Clang Offload Bundler
783<https://clang.llvm.org/docs/ClangOffloadBundler.html>`_ for a general
784description. The AMDGPU target specific information is:
785
786**processor**
787  Is an AMDGPU processor or alternative processor name specified in
788  :ref:`amdgpu-processor-table`. The non-canonical form target ID allows both
789  the primary processor and alternative processor names. The canonical form
790  target ID only allow the primary processor name.
791
792**target-feature**
793  Is a target feature name specified in :ref:`amdgpu-target-features-table` that
794  is supported by the processor. The target features supported by each processor
795  is specified in :ref:`amdgpu-processor-table`. Those that can be specified in
796  a target ID are marked as being controlled by ``-mcpu`` and
797  ``--offload-arch``. Each target feature must appear at most once in a target
798  ID. The non-canonical form target ID allows the target features to be
799  specified in any order. The canonical form target ID requires the target
800  features to be specified in alphabetic order.
801
802.. _amdgpu-target-id-v2-v3:
803
804Code Object V2 to V3 Target ID
805~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
806
807The target ID syntax for code object V2 to V3 is the same as defined in `Clang
808Offload Bundler <https://clang.llvm.org/docs/ClangOffloadBundler.html>`_ except
809when used in the :ref:`amdgpu-assembler-directive-amdgcn-target` assembler
810directive and the bundle entry ID. In those cases it has the following BNF
811syntax:
812
813.. code::
814
815  <target-id> ::== <processor> ( "+" <target-feature> )*
816
817Where a target feature is omitted if *Off* and present if *On* or *Any*.
818
819.. note::
820
821  The code object V2 to V3 cannot represent *Any* and treats it the same as
822  *On*.
823
824.. _amdgpu-embedding-bundled-objects:
825
826Embedding Bundled Code Objects
827------------------------------
828
829AMDGPU supports the HIP and OpenMP languages that perform code object embedding
830as described in `Clang Offload Bundler
831<https://clang.llvm.org/docs/ClangOffloadBundler.html>`_.
832
833.. note::
834
835  The target ID syntax used for code object V2 to V3 for a bundle entry ID
836  differs from that used elsewhere. See :ref:`amdgpu-target-id-v2-v3`.
837
838.. _amdgpu-address-spaces:
839
840Address Spaces
841--------------
842
843The AMDGPU architecture supports a number of memory address spaces. The address
844space names use the OpenCL standard names, with some additions.
845
846The AMDGPU address spaces correspond to target architecture specific LLVM
847address space numbers used in LLVM IR.
848
849The AMDGPU address spaces are described in
850:ref:`amdgpu-address-spaces-table`. Only 64-bit process address spaces are
851supported for the ``amdgcn`` target.
852
853  .. table:: AMDGPU Address Spaces
854     :name: amdgpu-address-spaces-table
855
856     ===================================== =============== =========== ================ ======= ============================
857     ..                                                                                         64-Bit Process Address Space
858     ------------------------------------- --------------- ----------- ---------------- ------------------------------------
859     Address Space Name                    LLVM IR Address HSA Segment Hardware         Address NULL Value
860                                           Space Number    Name        Name             Size
861     ===================================== =============== =========== ================ ======= ============================
862     Generic                               0               flat        flat             64      0x0000000000000000
863     Global                                1               global      global           64      0x0000000000000000
864     Region                                2               N/A         GDS              32      *not implemented for AMDHSA*
865     Local                                 3               group       LDS              32      0xFFFFFFFF
866     Constant                              4               constant    *same as global* 64      0x0000000000000000
867     Private                               5               private     scratch          32      0xFFFFFFFF
868     Constant 32-bit                       6               *TODO*                               0x00000000
869     Buffer Fat Pointer                    7               N/A         N/A              160     0
870     Buffer Resource                       8               N/A         V#               128     0x00000000000000000000000000000000
871     Buffer Strided Pointer (experimental) 9               *TODO*
872     Streamout Registers                   128             N/A         GS_REGS
873     ===================================== =============== =========== ================ ======= ============================
874
875**Generic**
876  The generic address space is supported unless the *Target Properties* column
877  of :ref:`amdgpu-processor-table` specifies *Does not support generic address
878  space*.
879
880  The generic address space uses the hardware flat address support for two fixed
881  ranges of virtual addresses (the private and local apertures), that are
882  outside the range of addressable global memory, to map from a flat address to
883  a private or local address. This uses FLAT instructions that can take a flat
884  address and access global, private (scratch), and group (LDS) memory depending
885  on if the address is within one of the aperture ranges.
886
887  Flat access to scratch requires hardware aperture setup and setup in the
888  kernel prologue (see :ref:`amdgpu-amdhsa-kernel-prolog-flat-scratch`). Flat
889  access to LDS requires hardware aperture setup and M0 (GFX7-GFX8) register
890  setup (see :ref:`amdgpu-amdhsa-kernel-prolog-m0`).
891
892  To convert between a private or group address space address (termed a segment
893  address) and a flat address the base address of the corresponding aperture
894  can be used. For GFX7-GFX8 these are available in the
895  :ref:`amdgpu-amdhsa-hsa-aql-queue` the address of which can be obtained with
896  Queue Ptr SGPR (see :ref:`amdgpu-amdhsa-initial-kernel-execution-state`). For
897  GFX9-GFX11 the aperture base addresses are directly available as inline
898  constant registers ``SRC_SHARED_BASE/LIMIT`` and ``SRC_PRIVATE_BASE/LIMIT``.
899  In 64-bit address mode the aperture sizes are 2^32 bytes and the base is
900  aligned to 2^32 which makes it easier to convert from flat to segment or
901  segment to flat.
902
903  A global address space address has the same value when used as a flat address
904  so no conversion is needed.
905
906**Global and Constant**
907  The global and constant address spaces both use global virtual addresses,
908  which are the same virtual address space used by the CPU. However, some
909  virtual addresses may only be accessible to the CPU, some only accessible
910  by the GPU, and some by both.
911
912  Using the constant address space indicates that the data will not change
913  during the execution of the kernel. This allows scalar read instructions to
914  be used. As the constant address space could only be modified on the host
915  side, a generic pointer loaded from the constant address space is safe to be
916  assumed as a global pointer since only the device global memory is visible
917  and managed on the host side. The vector and scalar L1 caches are invalidated
918  of volatile data before each kernel dispatch execution to allow constant
919  memory to change values between kernel dispatches.
920
921**Region**
922  The region address space uses the hardware Global Data Store (GDS). All
923  wavefronts executing on the same device will access the same memory for any
924  given region address. However, the same region address accessed by wavefronts
925  executing on different devices will access different memory. It is higher
926  performance than global memory. It is allocated by the runtime. The data
927  store (DS) instructions can be used to access it.
928
929**Local**
930  The local address space uses the hardware Local Data Store (LDS) which is
931  automatically allocated when the hardware creates the wavefronts of a
932  work-group, and freed when all the wavefronts of a work-group have
933  terminated. All wavefronts belonging to the same work-group will access the
934  same memory for any given local address. However, the same local address
935  accessed by wavefronts belonging to different work-groups will access
936  different memory. It is higher performance than global memory. The data store
937  (DS) instructions can be used to access it.
938
939**Private**
940  The private address space uses the hardware scratch memory support which
941  automatically allocates memory when it creates a wavefront and frees it when
942  a wavefronts terminates. The memory accessed by a lane of a wavefront for any
943  given private address will be different to the memory accessed by another lane
944  of the same or different wavefront for the same private address.
945
946  If a kernel dispatch uses scratch, then the hardware allocates memory from a
947  pool of backing memory allocated by the runtime for each wavefront. The lanes
948  of the wavefront access this using dword (4 byte) interleaving. The mapping
949  used from private address to backing memory address is:
950
951    ``wavefront-scratch-base +
952    ((private-address / 4) * wavefront-size * 4) +
953    (wavefront-lane-id * 4) + (private-address % 4)``
954
955  If each lane of a wavefront accesses the same private address, the
956  interleaving results in adjacent dwords being accessed and hence requires
957  fewer cache lines to be fetched.
958
959  There are different ways that the wavefront scratch base address is
960  determined by a wavefront (see
961  :ref:`amdgpu-amdhsa-initial-kernel-execution-state`).
962
963  Scratch memory can be accessed in an interleaved manner using buffer
964  instructions with the scratch buffer descriptor and per wavefront scratch
965  offset, by the scratch instructions, or by flat instructions. Multi-dword
966  access is not supported except by flat and scratch instructions in
967  GFX9-GFX11.
968
969  Code that manipulates the stack values in other lanes of a wavefront,
970  such as by ``addrspacecast``-ing stack pointers to generic ones and taking offsets
971  that reach other lanes or by explicitly constructing the scratch buffer descriptor,
972  triggers undefined behavior when it modifies the scratch values of other lanes.
973  The compiler may assume that such modifications do not occur.
974  When using code object V5 ``LIBOMPTARGET_STACK_SIZE`` may be used to provide the
975  private segment size in bytes, for cases where a dynamic stack is used.
976
977**Constant 32-bit**
978  *TODO*
979
980**Buffer Fat Pointer**
981  The buffer fat pointer is an experimental address space that is currently
982  unsupported in the backend. It exposes a non-integral pointer that is in
983  the future intended to support the modelling of 128-bit buffer descriptors
984  plus a 32-bit offset into the buffer (in total encapsulating a 160-bit
985  *pointer*), allowing normal LLVM load/store/atomic operations to be used to
986  model the buffer descriptors used heavily in graphics workloads targeting
987  the backend.
988
989  The buffer descriptor used to construct a buffer fat pointer must be *raw*:
990  the stride must be 0, the "add tid" flag must be 0, the swizzle enable bits
991  must be off, and the extent must be measured in bytes. (On subtargets where
992  bounds checking may be disabled, buffer fat pointers may choose to enable
993  it or not).
994
995**Buffer Resource**
996  The buffer resource pointer, in address space 8, is the newer form
997  for representing buffer descriptors in AMDGPU IR, replacing their
998  previous representation as `<4 x i32>`. It is a non-integral pointer
999  that represents a 128-bit buffer descriptor resource (`V#`).
1000
1001  Since, in general, a buffer resource supports complex addressing modes that cannot
1002  be easily represented in LLVM (such as implicit swizzled access to structured
1003  buffers), it is **illegal** to perform non-trivial address computations, such as
1004  ``getelementptr`` operations, on buffer resources. They may be passed to
1005  AMDGPU buffer intrinsics, and they may be converted to and from ``i128``.
1006
1007  Casting a buffer resource to a buffer fat pointer is permitted and adds an offset
1008  of 0.
1009
1010  Buffer resources can be created from 64-bit pointers (which should be either
1011  generic or global) using the `llvm.amdgcn.make.buffer.rsrc` intrinsic, which
1012  takes the pointer, which becomes the base of the resource,
1013  the 16-bit stride (and swzizzle control) field stored in bits `63:48` of a `V#`,
1014  the 32-bit NumRecords/extent field (bits `95:64`), and the 32-bit flags field
1015  (bits `127:96`). The specific interpretation of these fields varies by the
1016  target architecture and is detailed in the ISA descriptions.
1017
1018**Buffer Strided Pointer**
1019  The buffer index pointer is an experimental address space. It represents
1020  a 128-bit buffer descriptor and a 32-bit offset, like the **Buffer Fat
1021  Pointer**. Additionally, it contains an index into the buffer, which
1022  allows the direct addressing of structured elements. These components appear
1023  in that order, i.e., the descriptor comes first, then the 32-bit offset
1024  followed by the 32-bit index.
1025
1026  The bits in the buffer descriptor must meet the following requirements:
1027  the stride is the size of a structured element, the "add tid" flag must be 0,
1028  and the swizzle enable bits must be off.
1029
1030**Streamout Registers**
1031  Dedicated registers used by the GS NGG Streamout Instructions. The register
1032  file is modelled as a memory in a distinct address space because it is indexed
1033  by an address-like offset in place of named registers, and because register
1034  accesses affect LGKMcnt. This is an internal address space used only by the
1035  compiler. Do not use this address space for IR pointers.
1036
1037.. _amdgpu-memory-scopes:
1038
1039Memory Scopes
1040-------------
1041
1042This section provides LLVM memory synchronization scopes supported by the AMDGPU
1043backend memory model when the target triple OS is ``amdhsa`` (see
1044:ref:`amdgpu-amdhsa-memory-model` and :ref:`amdgpu-target-triples`).
1045
1046The memory model supported is based on the HSA memory model [HSA]_ which is
1047based in turn on HRF-indirect with scope inclusion [HRF]_. The happens-before
1048relation is transitive over the synchronizes-with relation independent of scope
1049and synchronizes-with allows the memory scope instances to be inclusive (see
1050table :ref:`amdgpu-amdhsa-llvm-sync-scopes-table`).
1051
1052This is different to the OpenCL [OpenCL]_ memory model which does not have scope
1053inclusion and requires the memory scopes to exactly match. However, this
1054is conservatively correct for OpenCL.
1055
1056  .. table:: AMDHSA LLVM Sync Scopes
1057     :name: amdgpu-amdhsa-llvm-sync-scopes-table
1058
1059     ======================= ===================================================
1060     LLVM Sync Scope         Description
1061     ======================= ===================================================
1062     *none*                  The default: ``system``.
1063
1064                             Synchronizes with, and participates in modification
1065                             and seq_cst total orderings with, other operations
1066                             (except image operations) for all address spaces
1067                             (except private, or generic that accesses private)
1068                             provided the other operation's sync scope is:
1069
1070                             - ``system``.
1071                             - ``agent`` and executed by a thread on the same
1072                               agent.
1073                             - ``workgroup`` and executed by a thread in the
1074                               same work-group.
1075                             - ``wavefront`` and executed by a thread in the
1076                               same wavefront.
1077
1078     ``agent``               Synchronizes with, and participates in modification
1079                             and seq_cst total orderings with, other operations
1080                             (except image operations) for all address spaces
1081                             (except private, or generic that accesses private)
1082                             provided the other operation's sync scope is:
1083
1084                             - ``system`` or ``agent`` and executed by a thread
1085                               on the same agent.
1086                             - ``workgroup`` and executed by a thread in the
1087                               same work-group.
1088                             - ``wavefront`` and executed by a thread in the
1089                               same wavefront.
1090
1091     ``workgroup``           Synchronizes with, and participates in modification
1092                             and seq_cst total orderings with, other operations
1093                             (except image operations) for all address spaces
1094                             (except private, or generic that accesses private)
1095                             provided the other operation's sync scope is:
1096
1097                             - ``system``, ``agent`` or ``workgroup`` and
1098                               executed by a thread in the same work-group.
1099                             - ``wavefront`` and executed by a thread in the
1100                               same wavefront.
1101
1102     ``wavefront``           Synchronizes with, and participates in modification
1103                             and seq_cst total orderings with, other operations
1104                             (except image operations) for all address spaces
1105                             (except private, or generic that accesses private)
1106                             provided the other operation's sync scope is:
1107
1108                             - ``system``, ``agent``, ``workgroup`` or
1109                               ``wavefront`` and executed by a thread in the
1110                               same wavefront.
1111
1112     ``singlethread``        Only synchronizes with and participates in
1113                             modification and seq_cst total orderings with,
1114                             other operations (except image operations) running
1115                             in the same thread for all address spaces (for
1116                             example, in signal handlers).
1117
1118     ``one-as``              Same as ``system`` but only synchronizes with other
1119                             operations within the same address space.
1120
1121     ``agent-one-as``        Same as ``agent`` but only synchronizes with other
1122                             operations within the same address space.
1123
1124     ``workgroup-one-as``    Same as ``workgroup`` but only synchronizes with
1125                             other operations within the same address space.
1126
1127     ``wavefront-one-as``    Same as ``wavefront`` but only synchronizes with
1128                             other operations within the same address space.
1129
1130     ``singlethread-one-as`` Same as ``singlethread`` but only synchronizes with
1131                             other operations within the same address space.
1132     ======================= ===================================================
1133
1134LLVM IR Intrinsics
1135------------------
1136
1137The AMDGPU backend implements the following LLVM IR intrinsics.
1138
1139*This section is WIP.*
1140
1141.. table:: AMDGPU LLVM IR Intrinsics
1142  :name: amdgpu-llvm-ir-intrinsics-table
1143
1144  ==============================================   ==========================================================
1145  LLVM Intrinsic                                   Description
1146  ==============================================   ==========================================================
1147  llvm.amdgcn.sqrt                                 Provides direct access to v_sqrt_f64, v_sqrt_f32 and v_sqrt_f16
1148                                                   (on targets with half support). Performs sqrt function.
1149
1150  llvm.amdgcn.log                                  Provides direct access to v_log_f32 and v_log_f16
1151                                                   (on targets with half support). Performs log2 function.
1152
1153  llvm.amdgcn.exp2                                 Provides direct access to v_exp_f32 and v_exp_f16
1154                                                   (on targets with half support). Performs exp2 function.
1155
1156  :ref:`llvm.frexp <int_frexp>`                    Implemented for half, float and double.
1157
1158  :ref:`llvm.log2 <int_log2>`                      Implemented for float and half (and vectors of float or
1159                                                   half). Not implemented for double. Hardware provides
1160                                                   1ULP accuracy for float, and 0.51ULP for half. Float
1161                                                   instruction does not natively support denormal
1162                                                   inputs.
1163
1164  :ref:`llvm.sqrt <int_sqrt>`                      Implemented for double, float and half (and vectors).
1165
1166  :ref:`llvm.log <int_log>`                        Implemented for float and half (and vectors).
1167
1168  :ref:`llvm.exp <int_exp>`                        Implemented for float and half (and vectors).
1169
1170  :ref:`llvm.log10 <int_log10>`                    Implemented for float and half (and vectors).
1171
1172  :ref:`llvm.exp2 <int_exp2>`                      Implemented for float and half (and vectors of float or
1173                                                   half). Not implemented for double. Hardware provides
1174                                                   1ULP accuracy for float, and 0.51ULP for half. Float
1175                                                   instruction does not natively support denormal
1176                                                   inputs.
1177
1178  :ref:`llvm.stacksave.p5 <int_stacksave>`         Implemented, must use the alloca address space.
1179  :ref:`llvm.stackrestore.p5 <int_stackrestore>`   Implemented, must use the alloca address space.
1180
1181  :ref:`llvm.get.fpmode.i32 <int_get_fpmode>`      The natural floating-point mode type is i32. This
1182                                                   implemented by extracting relevant bits out of the MODE
1183                                                   register with s_getreg_b32. The first 10 bits are the
1184                                                   core floating-point mode. Bits 12:18 are the exception
1185                                                   mask. On gfx9+, bit 23 is FP16_OVFL. Bitfields not
1186                                                   relevant to floating-point instructions are 0s.
1187
1188  :ref:`llvm.get.rounding<int_get_rounding>`       AMDGPU supports two separately controllable rounding
1189                                                   modes depending on the floating-point type. One
1190                                                   controls float, and the other controls both double and
1191                                                   half operations. If both modes are the same, returns
1192                                                   one of the standard return values. If the modes are
1193                                                   different, returns one of :ref:`12 extended values
1194                                                   <amdgpu-rounding-mode-enumeration-values-table>`
1195                                                   describing the two modes.
1196
1197                                                   To nearest, ties away from zero is not a supported
1198                                                   mode. The raw rounding mode values in the MODE
1199                                                   register do not exactly match the FLT_ROUNDS values,
1200                                                   so a conversion is performed.
1201
1202  :ref:`llvm.set.rounding<int_set_rounding>`       Input value expected to be one of the valid results
1203                                                   from '``llvm.get.rounding``'. Rounding mode is
1204                                                   undefined if not passed a valid input. This should be
1205                                                   a wave uniform value. In case of a divergent input
1206                                                   value, the first active lane's value will be used.
1207
1208  :ref:`llvm.get.fpenv<int_get_fpenv>`             Returns the current value of the AMDGPU floating point environment.
1209                                                   This stores information related to the current rounding mode,
1210                                                   denormalization mode, enabled traps, and floating point exceptions.
1211                                                   The format is a 64-bit concatenation of the MODE and TRAPSTS registers.
1212
1213  :ref:`llvm.set.fpenv<int_set_fpenv>`             Sets the floating point environment to the specifies state.
1214
1215  llvm.amdgcn.readfirstlane                        Provides direct access to v_readfirstlane_b32. Returns the value in
1216                                                   the lowest active lane of the input operand. Currently implemented
1217                                                   for i16, i32, float, half, bfloat, <2 x i16>, <2 x half>, <2 x bfloat>,
1218                                                   i64, double, pointers, multiples of the 32-bit vectors.
1219
1220  llvm.amdgcn.readlane                             Provides direct access to v_readlane_b32. Returns the value in the
1221                                                   specified lane of the first input operand. The second operand specifies
1222                                                   the lane to read from. Currently implemented for i16, i32, float, half,
1223                                                   bfloat, <2 x i16>, <2 x half>, <2 x bfloat>, i64, double, pointers,
1224                                                   multiples of the 32-bit vectors.
1225
1226  llvm.amdgcn.writelane                            Provides direct access to v_writelane_b32. Writes value in the first input
1227                                                   operand to the specified lane of divergent output. The second operand
1228                                                   specifies the lane to write. Currently implemented for i16, i32, float,
1229                                                   half, bfloat, <2 x i16>, <2 x half>, <2 x bfloat>, i64, double, pointers,
1230                                                   multiples of the 32-bit vectors.
1231
1232  llvm.amdgcn.wave.reduce.umin                     Performs an arithmetic unsigned min reduction on the unsigned values
1233                                                   provided by each lane in the wavefront.
1234                                                   Intrinsic takes a hint for reduction strategy using second operand
1235                                                   0: Target default preference,
1236                                                   1: `Iterative strategy`, and
1237                                                   2: `DPP`.
1238                                                   If target does not support the DPP operations (e.g. gfx6/7),
1239                                                   reduction will be performed using default iterative strategy.
1240                                                   Intrinsic is currently only implemented for i32.
1241
1242  llvm.amdgcn.wave.reduce.umax                     Performs an arithmetic unsigned max reduction on the unsigned values
1243                                                   provided by each lane in the wavefront.
1244                                                   Intrinsic takes a hint for reduction strategy using second operand
1245                                                   0: Target default preference,
1246                                                   1: `Iterative strategy`, and
1247                                                   2: `DPP`.
1248                                                   If target does not support the DPP operations (e.g. gfx6/7),
1249                                                   reduction will be performed using default iterative strategy.
1250                                                   Intrinsic is currently only implemented for i32.
1251
1252  llvm.amdgcn.permlane16                           Provides direct access to v_permlane16_b32. Performs arbitrary gather-style
1253                                                   operation within a row (16 contiguous lanes) of the second input operand.
1254                                                   The third and fourth inputs must be scalar values. these are combined into
1255                                                   a single 64-bit value representing lane selects used to swizzle within each
1256                                                   row. Currently implemented for i16, i32, float, half, bfloat, <2 x i16>,
1257                                                   <2 x half>, <2 x bfloat>, i64, double, pointers, multiples of the 32-bit vectors.
1258
1259  llvm.amdgcn.permlanex16                          Provides direct access to v_permlanex16_b32. Performs arbitrary gather-style
1260                                                   operation across two rows of the second input operand (each row is 16 contiguous
1261                                                   lanes). The third and fourth inputs must be scalar values. these are combined
1262                                                   into a single 64-bit value representing lane selects used to swizzle within each
1263                                                   row. Currently implemented for i16, i32, float, half, bfloat, <2 x i16>, <2 x half>,
1264                                                   <2 x bfloat>, i64, double, pointers, multiples of the 32-bit vectors.
1265
1266  llvm.amdgcn.permlane64                           Provides direct access to v_permlane64_b32. Performs a specific permutation across
1267                                                   lanes of the input operand where the high half and low half of a wave64 are swapped.
1268                                                   Performs no operation in wave32 mode. Currently implemented for i16, i32, float, half,
1269                                                   bfloat, <2 x i16>, <2 x half>, <2 x bfloat>, i64, double, pointers, multiples of the
1270                                                   32-bit vectors.
1271
1272  llvm.amdgcn.udot2                                Provides direct access to v_dot2_u32_u16 across targets which
1273                                                   support such instructions. This performs unsigned dot product
1274                                                   with two v2i16 operands, summed with the third i32 operand. The
1275                                                   i1 fourth operand is used to clamp the output.
1276
1277  llvm.amdgcn.udot4                                Provides direct access to v_dot4_u32_u8 across targets which
1278                                                   support such instructions. This performs unsigned dot product
1279                                                   with two i32 operands (holding a vector of 4 8bit values), summed
1280                                                   with the third i32 operand. The i1 fourth operand is used to clamp
1281                                                   the output.
1282
1283  llvm.amdgcn.udot8                                Provides direct access to v_dot8_u32_u4 across targets which
1284                                                   support such instructions. This performs unsigned dot product
1285                                                   with two i32 operands (holding a vector of 8 4bit values), summed
1286                                                   with the third i32 operand. The i1 fourth operand is used to clamp
1287                                                   the output.
1288
1289  llvm.amdgcn.sdot2                                Provides direct access to v_dot2_i32_i16 across targets which
1290                                                   support such instructions. This performs signed dot product
1291                                                   with two v2i16 operands, summed with the third i32 operand. The
1292                                                   i1 fourth operand is used to clamp the output.
1293                                                   When applicable (e.g. no clamping), this is lowered into
1294                                                   v_dot2c_i32_i16 for targets which support it.
1295
1296  llvm.amdgcn.sdot4                                Provides direct access to v_dot4_i32_i8 across targets which
1297                                                   support such instructions. This performs signed dot product
1298                                                   with two i32 operands (holding a vector of 4 8bit values), summed
1299                                                   with the third i32 operand. The i1 fourth operand is used to clamp
1300                                                   the output.
1301                                                   When applicable (i.e. no clamping / operand modifiers), this is lowered
1302                                                   into v_dot4c_i32_i8 for targets which support it.
1303                                                   RDNA3 does not offer v_dot4_i32_i8, and rather offers
1304                                                   v_dot4_i32_iu8 which has operands to hold the signedness of the
1305                                                   vector operands. Thus, this intrinsic lowers to the signed version
1306                                                   of this instruction for gfx11 targets.
1307
1308  llvm.amdgcn.sdot8                                Provides direct access to v_dot8_u32_u4 across targets which
1309                                                   support such instructions. This performs signed dot product
1310                                                   with two i32 operands (holding a vector of 8 4bit values), summed
1311                                                   with the third i32 operand. The i1 fourth operand is used to clamp
1312                                                   the output.
1313                                                   When applicable (i.e. no clamping / operand modifiers), this is lowered
1314                                                   into v_dot8c_i32_i4 for targets which support it.
1315                                                   RDNA3 does not offer v_dot8_i32_i4, and rather offers
1316                                                   v_dot4_i32_iu4 which has operands to hold the signedness of the
1317                                                   vector operands. Thus, this intrinsic lowers to the signed version
1318                                                   of this instruction for gfx11 targets.
1319
1320  llvm.amdgcn.sudot4                               Provides direct access to v_dot4_i32_iu8 on gfx11 targets. This performs
1321                                                   dot product with two i32 operands (holding a vector of 4 8bit values), summed
1322                                                   with the fifth i32 operand. The i1 sixth operand is used to clamp
1323                                                   the output. The i1s preceding the vector operands decide the signedness.
1324
1325  llvm.amdgcn.sudot8                               Provides direct access to v_dot8_i32_iu4 on gfx11 targets. This performs
1326                                                   dot product with two i32 operands (holding a vector of 8 4bit values), summed
1327                                                   with the fifth i32 operand. The i1 sixth operand is used to clamp
1328                                                   the output. The i1s preceding the vector operands decide the signedness.
1329
1330  llvm.amdgcn.sched.barrier                        Controls the types of instructions that may be allowed to cross the intrinsic
1331                                                   during instruction scheduling. The parameter is a mask for the instruction types
1332                                                   that can cross the intrinsic.
1333
1334                                                   - 0x0000: No instructions may be scheduled across sched_barrier.
1335                                                   - 0x0001: All, non-memory, non-side-effect producing instructions may be
1336                                                     scheduled across sched_barrier, *i.e.* allow ALU instructions to pass.
1337                                                   - 0x0002: VALU instructions may be scheduled across sched_barrier.
1338                                                   - 0x0004: SALU instructions may be scheduled across sched_barrier.
1339                                                   - 0x0008: MFMA/WMMA instructions may be scheduled across sched_barrier.
1340                                                   - 0x0010: All VMEM instructions may be scheduled across sched_barrier.
1341                                                   - 0x0020: VMEM read instructions may be scheduled across sched_barrier.
1342                                                   - 0x0040: VMEM write instructions may be scheduled across sched_barrier.
1343                                                   - 0x0080: All DS instructions may be scheduled across sched_barrier.
1344                                                   - 0x0100: All DS read instructions may be scheduled accoss sched_barrier.
1345                                                   - 0x0200: All DS write instructions may be scheduled across sched_barrier.
1346                                                   - 0x0400: All Transcendental (e.g. V_EXP) instructions may be scheduled across sched_barrier.
1347
1348  llvm.amdgcn.sched.group.barrier                  Creates schedule groups with specific properties to create custom scheduling
1349                                                   pipelines. The ordering between groups is enforced by the instruction scheduler.
1350                                                   The intrinsic applies to the code that preceeds the intrinsic. The intrinsic
1351                                                   takes three values that control the behavior of the schedule groups.
1352
1353                                                   - Mask : Classify instruction groups using the llvm.amdgcn.sched_barrier mask values.
1354                                                   - Size : The number of instructions that are in the group.
1355                                                   - SyncID : Order is enforced between groups with matching values.
1356
1357                                                   The mask can include multiple instruction types. It is undefined behavior to set
1358                                                   values beyond the range of valid masks.
1359
1360                                                   Combining multiple sched_group_barrier intrinsics enables an ordering of specific
1361                                                   instruction types during instruction scheduling. For example, the following enforces
1362                                                   a sequence of 1 VMEM read, followed by 1 VALU instruction, followed by 5 MFMA
1363                                                   instructions.
1364
1365                                                   |  ``// 1 VMEM read``
1366                                                   |  ``__builtin_amdgcn_sched_group_barrier(32, 1, 0)``
1367                                                   |  ``// 1 VALU``
1368                                                   |  ``__builtin_amdgcn_sched_group_barrier(2, 1, 0)``
1369                                                   |  ``// 5 MFMA``
1370                                                   |  ``__builtin_amdgcn_sched_group_barrier(8, 5, 0)``
1371
1372  llvm.amdgcn.iglp.opt                             An **experimental** intrinsic for instruction group level parallelism. The intrinsic
1373                                                   implements predefined intruction scheduling orderings. The intrinsic applies to the
1374                                                   surrounding scheduling region. The intrinsic takes a value that specifies the
1375                                                   strategy.  The compiler implements two strategies.
1376
1377                                                   0. Interleave DS and MFMA instructions for small GEMM kernels.
1378                                                   1. Interleave DS and MFMA instructions for single wave small GEMM kernels.
1379                                                   2. Interleave TRANS and MFMA instructions, as well as their VALU and DS predecessors, for attention kernels.
1380                                                   3. Interleave TRANS and MFMA instructions, with no predecessor interleaving, for attention kernels.
1381
1382                                                   Only one iglp_opt intrinsic may be used in a scheduling region. The iglp_opt intrinsic
1383                                                   cannot be combined with sched_barrier or sched_group_barrier.
1384
1385                                                   The iglp_opt strategy implementations are subject to change.
1386
1387  llvm.amdgcn.atomic.cond.sub.u32                  Provides direct access to flat_atomic_cond_sub_u32, global_atomic_cond_sub_u32
1388                                                   and ds_cond_sub_u32 based on address space on gfx12 targets. This
1389                                                   performs subtraction only if the memory value is greater than or
1390                                                   equal to the data value.
1391
1392  llvm.amdgcn.s.getpc                              Provides access to the s_getpc_b64 instruction, but with the return value
1393                                                   sign-extended from the width of the underlying PC hardware register even on
1394                                                   processors where the s_getpc_b64 instruction returns a zero-extended value.
1395
1396  llvm.amdgcn.ballot                               Returns a bitfield(i32 or i64) containing the result of its i1 argument
1397                                                   in all active lanes, and zero in all inactive lanes.
1398                                                   Provides a way to convert i1 in LLVM IR to i32 or i64 lane mask - bitfield
1399                                                   used by hardware to control active lanes when used in EXEC register.
1400                                                   For example, ballot(i1 true) return EXEC mask.
1401
1402  llvm.amdgcn.mfma.scale.f32.16x16x128.f8f6f4      Emit `v_mfma_scale_f32_16x16x128_f8f6f4` to set the scale factor. The
1403                                                   last 4 operands correspond to the scale inputs.
1404
1405                                                   - 2-bit byte index to use for each lane for matrix A
1406                                                   - Matrix A scale values
1407                                                   - 2-bit byte index to use for each lane for matrix B
1408                                                   - Matrix B scale values
1409
1410  llvm.amdgcn.mfma.scale.f32.32x32x64.f8f6f4       Emit `v_mfma_scale_f32_32x32x64_f8f6f4`
1411
1412  llvm.amdgcn.permlane16.swap                      Provide direct access to `v_permlane16_swap_b32` instruction on supported targets.
1413                                                   Swaps the values across lanes of first 2 operands. Odd rows of the first operand are
1414                                                   swapped with even rows of the second operand (one row is 16 lanes).
1415                                                   Returns a pair for the swapped registers. The first element of the return corresponds
1416                                                   to the swapped element of the first argument.
1417
1418
1419  llvm.amdgcn.permlane32.swap                      Provide direct access to `v_permlane32_swap_b32` instruction on supported targets.
1420                                                   Swaps the values across lanes of first 2 operands. Rows 2 and 3 of the first operand are
1421                                                   swapped with rows 0 and 1 of the second operand (one row is 16 lanes).
1422                                                   Returns a pair for the swapped registers. The first element of the return
1423                                                   corresponds to the swapped element of the first argument.
1424
1425  llvm.amdgcn.mov.dpp                              The llvm.amdgcn.mov.dpp.`<type>` intrinsic represents the mov.dpp operation in AMDGPU.
1426                                                   This operation is being deprecated and can be replaced with llvm.amdgcn.update.dpp.
1427
1428  llvm.amdgcn.update.dpp                           The llvm.amdgcn.update.dpp.`<type>` intrinsic represents the update.dpp operation in AMDGPU.
1429                                                   It takes an old value, a source operand, a DPP control operand, a row mask, a bank mask, and a bound control.
1430                                                   Various data types are supported, including, bf16, f16, f32, f64, i16, i32, i64, p0, p3, p5, v2f16, v2f32, v2i16, v2i32, v2p0, v3i32, v4i32, v8f16.
1431                                                   This operation is equivalent to a sequence of v_mov_b32 operations.
1432                                                   It is preferred over llvm.amdgcn.mov.dpp.`<type>` for future use.
1433                                                   `llvm.amdgcn.update.dpp.<type> <old> <src> <dpp_ctrl> <row_mask> <bank_mask> <bound_ctrl>`
1434                                                   Should be equivalent to:
1435                                                   - `v_mov_b32 <dest> <old>`
1436                                                   - `v_mov_b32 <dest> <src> <dpp_ctrl> <row_mask> <bank_mask> <bound_ctrl>`
1437
1438  ==============================================   ==========================================================
1439
1440.. TODO::
1441
1442   List AMDGPU intrinsics.
1443
1444.. _amdgpu_metadata:
1445
1446LLVM IR Metadata
1447================
1448
1449The AMDGPU backend implements the following target custom LLVM IR
1450metadata.
1451
1452.. _amdgpu_last_use:
1453
1454'``amdgpu.last.use``' Metadata
1455------------------------------
1456
1457Sets TH_LOAD_LU temporal hint on load instructions that support it.
1458Takes priority over nontemporal hint (TH_LOAD_NT). This takes no
1459arguments.
1460
1461.. code-block:: llvm
1462
1463  %val = load i32, ptr %in, align 4, !amdgpu.last.use !{}
1464
1465'``amdgpu.no.remote.memory``' Metadata
1466---------------------------------------------
1467
1468Asserts a memory operation does not access bytes in host memory, or
1469remote connected peer device memory (the address must be device
1470local). This is intended for use with :ref:`atomicrmw <i_atomicrmw>`
1471and other atomic instructions. This is required to emit a native
1472hardware instruction for some :ref:`system scope
1473<amdgpu-memory-scopes>` atomic operations on some subtargets. For most
1474integer atomic operations, this is a sufficient restriction to emit a
1475native atomic instruction.
1476
1477An :ref:`atomicrmw <i_atomicrmw>` without metadata will be treated
1478conservatively as required to preserve the operation behavior in all
1479cases. This will typically be used in conjunction with
1480:ref:`\!amdgpu.no.fine.grained.memory<amdgpu_no_fine_grained_memory>`.
1481
1482
1483.. code-block:: llvm
1484
1485  ; Indicates the atomic does not access fine-grained memory, or
1486  ; remote device memory.
1487  %old0 = atomicrmw sub ptr %ptr0, i32 1 acquire, !amdgpu.no.fine.grained.memory !0, !amdgpu.no.remote.memory !0
1488
1489  ; Indicates the atomic does not access peer device memory.
1490  %old2 = atomicrmw sub ptr %ptr2, i32 1 acquire, !amdgpu.no.remote.memory !0
1491
1492  !0 = !{}
1493
1494.. _amdgpu_no_fine_grained_memory:
1495
1496'``amdgpu.no.fine.grained.memory``' Metadata
1497-------------------------------------------------
1498
1499Asserts a memory access does not access bytes allocated in
1500fine-grained allocated memory. This is intended for use with
1501:ref:`atomicrmw <i_atomicrmw>` and other atomic instructions. This is
1502required to emit a native hardware instruction for some :ref:`system
1503scope <amdgpu-memory-scopes>` atomic operations on some subtargets. An
1504:ref:`atomicrmw <i_atomicrmw>` without metadata will be treated
1505conservatively as required to preserve the operation behavior in all
1506cases. This will typically be used in conjunction with
1507:ref:`\!amdgpu.no.remote.memory.access<amdgpu_no_remote_memory_access>`.
1508
1509.. code-block:: llvm
1510
1511  ; Indicates the access does not access fine-grained memory, or
1512  ; remote device memory.
1513  %old0 = atomicrmw sub ptr %ptr0, i32 1 acquire, !amdgpu.no.fine.grained.memory !0, !amdgpu.no.remote.memory.access !0
1514
1515  ; Indicates the access does not access fine-grained memory
1516  %old2 = atomicrmw sub ptr %ptr2, i32 1 acquire, !amdgpu.no.fine.grained.memory !0
1517
1518  !0 = !{}
1519
1520.. _amdgpu_no_remote_memory_access:
1521
1522'``amdgpu.ignore.denormal.mode``' Metadata
1523------------------------------------------
1524
1525For use with :ref:`atomicrmw <i_atomicrmw>` floating-point
1526operations. Indicates the handling of denormal inputs and results is
1527insignificant and may be inconsistent with the expected floating-point
1528mode. This is necessary to emit a native atomic instruction on some
1529targets for some address spaces where float denormals are
1530unconditionally flushed. This is typically used in conjunction with
1531:ref:`\!amdgpu.no.remote.memory.access<amdgpu_no_remote_memory_access>`
1532and
1533:ref:`\!amdgpu.no.fine.grained.memory<amdgpu_no_fine_grained_memory>`
1534
1535
1536.. code-block:: llvm
1537
1538  %res0 = atomicrmw fadd ptr addrspace(1) %ptr, float %value seq_cst, align 4, !amdgpu.ignore.denormal.mode !0
1539  %res1 = atomicrmw fadd ptr addrspace(1) %ptr, float %value seq_cst, align 4, !amdgpu.ignore.denormal.mode !0, !amdgpu.no.fine.grained.memory !0, !amdgpu.no.remote.memory.access !0
1540
1541  !0 = !{}
1542
1543
1544LLVM IR Attributes
1545==================
1546
1547The AMDGPU backend supports the following LLVM IR attributes.
1548
1549  .. table:: AMDGPU LLVM IR Attributes
1550     :name: amdgpu-llvm-ir-attributes-table
1551
1552     ======================================= ==========================================================
1553     LLVM Attribute                          Description
1554     ======================================= ==========================================================
1555     "amdgpu-flat-work-group-size"="min,max" Specify the minimum and maximum flat work group sizes that
1556                                             will be specified when the kernel is dispatched. Generated
1557                                             by the ``amdgpu_flat_work_group_size`` CLANG attribute [CLANG-ATTR]_.
1558                                             The IR implied default value is 1,1024. Clang may emit this attribute
1559                                             with more restrictive bounds depending on language defaults.
1560                                             If the actual block or workgroup size exceeds the limit at any point during
1561                                             the execution, the behavior is undefined. For example, even if there is
1562                                             only one active thread but the thread local id exceeds the limit, the
1563                                             behavior is undefined.
1564
1565     "amdgpu-implicitarg-num-bytes"="n"      Number of kernel argument bytes to add to the kernel
1566                                             argument block size for the implicit arguments. This
1567                                             varies by OS and language (for OpenCL see
1568                                             :ref:`opencl-kernel-implicit-arguments-appended-for-amdhsa-os-table`).
1569     "amdgpu-num-sgpr"="n"                   Specifies the number of SGPRs to use. Generated by
1570                                             the ``amdgpu_num_sgpr`` CLANG attribute [CLANG-ATTR]_.
1571     "amdgpu-num-vgpr"="n"                   Specifies the number of VGPRs to use. Generated by the
1572                                             ``amdgpu_num_vgpr`` CLANG attribute [CLANG-ATTR]_.
1573     "amdgpu-waves-per-eu"="m,n"             Specify the minimum and maximum number of waves per
1574                                             execution unit. Generated by the ``amdgpu_waves_per_eu``
1575                                             CLANG attribute [CLANG-ATTR]_. This is an optimization hint,
1576                                             and the backend may not be able to satisfy the request. If
1577                                             the specified range is incompatible with the function's
1578                                             "amdgpu-flat-work-group-size" value, the implied occupancy
1579                                             bounds by the workgroup size takes precedence.
1580
1581     "amdgpu-ieee" true/false.               GFX6-GFX11 Only
1582                                             Specify whether the function expects the IEEE field of the
1583                                             mode register to be set on entry. Overrides the default for
1584                                             the calling convention.
1585     "amdgpu-dx10-clamp" true/false.         GFX6-GFX11 Only
1586                                             Specify whether the function expects the DX10_CLAMP field of
1587                                             the mode register to be set on entry. Overrides the default
1588                                             for the calling convention.
1589
1590     "amdgpu-no-workitem-id-x"               Indicates the function does not depend on the value of the
1591                                             llvm.amdgcn.workitem.id.x intrinsic. If a function is marked with this
1592                                             attribute, or reached through a call site marked with this attribute, and
1593                                             that intrinsic is called, the behavior of the program is undefined. (Whole-program
1594                                             undefined behavior is used here because, for example, the absence of a required workitem
1595                                             ID in the preloaded register set can mean that all other preloaded registers
1596                                             are earlier than the compilation assumed they would be.) The backend can
1597                                             generally infer this during code generation, so typically there is no
1598                                             benefit to frontends marking functions with this.
1599
1600     "amdgpu-no-workitem-id-y"               The same as amdgpu-no-workitem-id-x, except for the
1601                                             llvm.amdgcn.workitem.id.y intrinsic.
1602
1603     "amdgpu-no-workitem-id-z"               The same as amdgpu-no-workitem-id-x, except for the
1604                                             llvm.amdgcn.workitem.id.z intrinsic.
1605
1606     "amdgpu-no-workgroup-id-x"              The same as amdgpu-no-workitem-id-x, except for the
1607                                             llvm.amdgcn.workgroup.id.x intrinsic.
1608
1609     "amdgpu-no-workgroup-id-y"              The same as amdgpu-no-workitem-id-x, except for the
1610                                             llvm.amdgcn.workgroup.id.y intrinsic.
1611
1612     "amdgpu-no-workgroup-id-z"              The same as amdgpu-no-workitem-id-x, except for the
1613                                             llvm.amdgcn.workgroup.id.z intrinsic.
1614
1615     "amdgpu-no-dispatch-ptr"                The same as amdgpu-no-workitem-id-x, except for the
1616                                             llvm.amdgcn.dispatch.ptr intrinsic.
1617
1618     "amdgpu-no-implicitarg-ptr"             The same as amdgpu-no-workitem-id-x, except for the
1619                                             llvm.amdgcn.implicitarg.ptr intrinsic.
1620
1621     "amdgpu-no-dispatch-id"                 The same as amdgpu-no-workitem-id-x, except for the
1622                                             llvm.amdgcn.dispatch.id intrinsic.
1623
1624     "amdgpu-no-queue-ptr"                   Similar to amdgpu-no-workitem-id-x, except for the
1625                                             llvm.amdgcn.queue.ptr intrinsic. Note that unlike the other ABI hint
1626                                             attributes, the queue pointer may be required in situations where the
1627                                             intrinsic call does not directly appear in the program. Some subtargets
1628                                             require the queue pointer for to handle some addrspacecasts, as well
1629                                             as the llvm.amdgcn.is.shared, llvm.amdgcn.is.private, llvm.trap, and
1630                                             llvm.debug intrinsics.
1631
1632     "amdgpu-no-hostcall-ptr"                Similar to amdgpu-no-implicitarg-ptr, except specific to the implicit
1633                                             kernel argument that holds the pointer to the hostcall buffer. If this
1634                                             attribute is absent, then the amdgpu-no-implicitarg-ptr is also removed.
1635
1636     "amdgpu-no-heap-ptr"                    Similar to amdgpu-no-implicitarg-ptr, except specific to the implicit
1637                                             kernel argument that holds the pointer to an initialized memory buffer
1638                                             that conforms to the requirements of the malloc/free device library V1
1639                                             version implementation. If this attribute is absent, then the
1640                                             amdgpu-no-implicitarg-ptr is also removed.
1641
1642     "amdgpu-no-multigrid-sync-arg"          Similar to amdgpu-no-implicitarg-ptr, except specific to the implicit
1643                                             kernel argument that holds the multigrid synchronization pointer. If this
1644                                             attribute is absent, then the amdgpu-no-implicitarg-ptr is also removed.
1645
1646     "amdgpu-no-default-queue"               Similar to amdgpu-no-implicitarg-ptr, except specific to the implicit
1647                                             kernel argument that holds the default queue pointer. If this
1648                                             attribute is absent, then the amdgpu-no-implicitarg-ptr is also removed.
1649
1650     "amdgpu-no-completion-action"           Similar to amdgpu-no-implicitarg-ptr, except specific to the implicit
1651                                             kernel argument that holds the completion action pointer. If this
1652                                             attribute is absent, then the amdgpu-no-implicitarg-ptr is also removed.
1653
1654     "amdgpu-lds-size"="min[,max]"           Min is the minimum number of bytes that will be allocated in the Local
1655                                             Data Store at address zero. Variables are allocated within this frame
1656                                             using absolute symbol metadata, primarily by the AMDGPULowerModuleLDS
1657                                             pass. Optional max is the maximum number of bytes that will be allocated.
1658                                             Note that min==max indicates that no further variables can be added to
1659                                             the frame. This is an internal detail of how LDS variables are lowered,
1660                                             language front ends should not set this attribute.
1661
1662     "amdgpu-gds-size"                       Bytes expected to be allocated at the start of GDS memory at entry.
1663
1664     "amdgpu-git-ptr-high"                   The hard-wired high half of the address of the global information table
1665                                             for AMDPAL OS type. 0xffffffff represents no hard-wired high half, since
1666                                             current hardware only allows a 16 bit value.
1667
1668     "amdgpu-32bit-address-high-bits"        Assumed high 32-bits for 32-bit address spaces which are really truncated
1669                                             64-bit addresses (i.e., addrspace(6))
1670
1671     "amdgpu-color-export"                   Indicates shader exports color information if set to 1.
1672                                             Defaults to 1 for :ref:`amdgpu_ps <amdgpu-cc>`, and 0 for other calling
1673                                             conventions. Determines the necessity and type of null exports when a shader
1674                                             terminates early by killing lanes.
1675
1676     "amdgpu-depth-export"                   Indicates shader exports depth information if set to 1. Determines the
1677                                             necessity and type of null exports when a shader terminates early by killing
1678                                             lanes. A depth-only shader will export to depth channel when no null export
1679                                             target is available (GFX11+).
1680
1681     "InitialPSInputAddr"                    Set the initial value of the `spi_ps_input_addr` register for
1682                                             :ref:`amdgpu_ps <amdgpu-cc>` shaders. Any bits enabled by this value will
1683                                             be enabled in the final register value.
1684
1685     "amdgpu-wave-priority-threshold"        VALU instruction count threshold for adjusting wave priority. If exceeded,
1686                                             temporarily raise the wave priority at the start of the shader function
1687                                             until its last VMEM instructions to allow younger waves to issue their VMEM
1688                                             instructions as well.
1689
1690     "amdgpu-memory-bound"                   Set internally by backend
1691
1692     "amdgpu-wave-limiter"                   Set internally by backend
1693
1694     "amdgpu-unroll-threshold"               Set base cost threshold preference for loop unrolling within this function,
1695                                             default is 300. Actual threshold may be varied by per-loop metadata or
1696                                             reduced by heuristics.
1697
1698     "amdgpu-max-num-workgroups"="x,y,z"     Specify the maximum number of work groups for the kernel dispatch in the
1699                                             X, Y, and Z dimensions. Each number must be >= 1. Generated by the
1700                                             ``amdgpu_max_num_work_groups`` CLANG attribute [CLANG-ATTR]_. Clang only
1701                                             emits this attribute when all the three numbers are >= 1.
1702
1703     "amdgpu-no-agpr"                        Indicates the function will not require allocating AGPRs. This is only
1704                                             relevant on subtargets with AGPRs. The behavior is undefined if a
1705                                             function which requires AGPRs is reached through any function marked
1706                                             with this attribute.
1707
1708     "amdgpu-hidden-argument"                This attribute is used internally by the backend to mark function arguments
1709                                             as hidden. Hidden arguments are managed by the compiler and are not part of
1710                                             the explicit arguments supplied by the user.
1711
1712     ======================================= ==========================================================
1713
1714Calling Conventions
1715===================
1716
1717The AMDGPU backend supports the following calling conventions:
1718
1719  .. table:: AMDGPU Calling Conventions
1720     :name: amdgpu-cc
1721
1722     =============================== ==========================================================
1723     Calling Convention              Description
1724     =============================== ==========================================================
1725     ``ccc``                         The C calling convention. Used by default.
1726                                     See :ref:`amdgpu-amdhsa-function-call-convention-non-kernel-functions`
1727                                     for more details.
1728
1729     ``fastcc``                      The fast calling convention. Mostly the same as the ``ccc``.
1730
1731     ``coldcc``                      The cold calling convention. Mostly the same as the ``ccc``.
1732
1733     ``amdgpu_cs``                   Used for Mesa/AMDPAL compute shaders.
1734                                     ..TODO::
1735                                     Describe.
1736
1737     ``amdgpu_cs_chain``             Similar to ``amdgpu_cs``, with differences described below.
1738
1739                                     Functions with this calling convention cannot be called directly. They must
1740                                     instead be launched via the ``llvm.amdgcn.cs.chain`` intrinsic.
1741
1742                                     Arguments are passed in SGPRs, starting at s0, if they have the ``inreg``
1743                                     attribute, and in VGPRs otherwise, starting at v8. Using more SGPRs or VGPRs
1744                                     than available in the subtarget is not allowed.  On subtargets that use
1745                                     a scratch buffer descriptor (as opposed to ``scratch_{load,store}_*`` instructions),
1746                                     the scratch buffer descriptor is passed in s[48:51]. This limits the
1747                                     SGPR / ``inreg`` arguments to the equivalent of 48 dwords; using more
1748                                     than that is not allowed.
1749
1750                                     The return type must be void.
1751                                     Varargs, sret, byval, byref, inalloca, preallocated are not supported.
1752
1753                                     Values in scalar registers as well as v0-v7 are not preserved. Values in
1754                                     VGPRs starting at v8 are not preserved for the active lanes, but must be
1755                                     saved by the callee for inactive lanes when using WWM (a notable exception is
1756                                     when the llvm.amdgcn.init.whole.wave intrinsic is used in the function - in this
1757                                     case the backend assumes that there are no inactive lanes upon entry; any inactive
1758                                     lanes that need to be preserved must be explicitly present in the IR).
1759
1760                                     Wave scratch is "empty" at function boundaries. There is no stack pointer input
1761                                     or output value, but functions are free to use scratch starting from an initial
1762                                     stack pointer. Calls to ``amdgpu_gfx`` functions are allowed and behave like they
1763                                     do in ``amdgpu_cs`` functions.
1764
1765                                     All counters (``lgkmcnt``, ``vmcnt``, ``storecnt``, etc.) are presumed in an
1766                                     unknown state at function entry.
1767
1768                                     A function may have multiple exits (e.g. one chain exit and one plain ``ret void``
1769                                     for when the wave ends), but all ``llvm.amdgcn.cs.chain`` exits must be in
1770                                     uniform control flow.
1771
1772     ``amdgpu_cs_chain_preserve``    Same as ``amdgpu_cs_chain``, but active lanes for VGPRs starting at v8 are preserved.
1773                                     Calls to ``amdgpu_gfx`` functions are not allowed, and any calls to ``llvm.amdgcn.cs.chain``
1774                                     must not pass more VGPR arguments than the caller's VGPR function parameters.
1775
1776     ``amdgpu_es``                   Used for AMDPAL shader stage before geometry shader if geometry is in
1777                                     use. So either the domain (= tessellation evaluation) shader if
1778                                     tessellation is in use, or otherwise the vertex shader.
1779                                     ..TODO::
1780                                     Describe.
1781
1782     ``amdgpu_gfx``                  Used for AMD graphics targets. Functions with this calling convention
1783                                     cannot be used as entry points.
1784                                     ..TODO::
1785                                     Describe.
1786
1787     ``amdgpu_gs``                   Used for Mesa/AMDPAL geometry shaders.
1788                                     ..TODO::
1789                                     Describe.
1790
1791     ``amdgpu_hs``                   Used for Mesa/AMDPAL hull shaders (= tessellation control shaders).
1792                                     ..TODO::
1793                                     Describe.
1794
1795     ``amdgpu_kernel``               See :ref:`amdgpu-amdhsa-function-call-convention-kernel-functions`
1796
1797     ``amdgpu_ls``                   Used for AMDPAL vertex shader if tessellation is in use.
1798                                     ..TODO::
1799                                     Describe.
1800
1801     ``amdgpu_ps``                   Used for Mesa/AMDPAL pixel shaders.
1802                                     ..TODO::
1803                                     Describe.
1804
1805     ``amdgpu_vs``                   Used for Mesa/AMDPAL last shader stage before rasterization (vertex
1806                                     shader if tessellation and geometry are not in use, or otherwise
1807                                     copy shader if one is needed).
1808                                     ..TODO::
1809                                     Describe.
1810
1811     =============================== ==========================================================
1812
1813AMDGPU MCExpr
1814-------------
1815
1816As part of the AMDGPU MC layer, AMDGPU provides the following target specific
1817``MCExpr``\s.
1818
1819  .. table:: AMDGPU MCExpr types:
1820     :name: amdgpu-mcexpr-table
1821
1822     =================== ================= ========================================================
1823     MCExpr              Operands          Return value
1824     =================== ================= ========================================================
1825     ``max(arg, ...)``   1 or more         Variadic signed operation that returns the maximum
1826                                           value of all its arguments.
1827
1828     ``or(arg, ...)``    1 or more         Variadic signed operation that returns the bitwise-or
1829                                           result of all its arguments.
1830
1831     =================== ================= ========================================================
1832
1833Function Resource Usage
1834-----------------------
1835
1836A function's resource usage depends on each of its callees' resource usage. The
1837expressions used to denote resource usage reflect this by propagating each
1838callees' equivalent expressions. Said expressions are emitted as symbols by the
1839compiler when compiling to either assembly or object format and should not be
1840overwritten or redefined.
1841
1842The following describes all emitted function resource usage symbols:
1843
1844  .. table:: Function Resource Usage:
1845     :name: function-usage-table
1846
1847     ===================================== ========= ========================================= ===============================================================================
1848     Symbol                                Type      Description                               Example
1849     ===================================== ========= ========================================= ===============================================================================
1850     <function_name>.num_vgpr              Integer   Number of VGPRs used by <function_name>,  .set foo.num_vgpr, max(32, bar.num_vgpr, baz.num_vgpr)
1851                                                     worst case of itself and its callees'
1852                                                     VGPR use
1853     <function_name>.num_agpr              Integer   Number of AGPRs used by <function_name>,  .set foo.num_agpr, max(35, bar.num_agpr)
1854                                                     worst case of itself and its callees'
1855                                                     AGPR use
1856     <function_name>.numbered_sgpr         Integer   Number of SGPRs used by <function_name>,  .set foo.num_sgpr, 21
1857                                                     worst case of itself and its callees'
1858                                                     SGPR use (without any of the implicitly
1859                                                     used SGPRs)
1860     <function_name>.private_seg_size      Integer   Total stack size required for             .set foo.private_seg_size, 16+max(bar.private_seg_size, baz.private_seg_size)
1861                                                     <function_name>, expression is the
1862                                                     locally used stack size + the worst case
1863                                                     callee
1864     <function_name>.uses_vcc              Bool      Whether <function_name>, or any of its    .set foo.uses_vcc, or(0, bar.uses_vcc)
1865                                                     callees, uses vcc
1866     <function_name>.uses_flat_scratch     Bool      Whether <function_name>, or any of its    .set foo.uses_flat_scratch, 1
1867                                                     callees, uses flat scratch or not
1868     <function_name>.has_dyn_sized_stack   Bool      Whether <function_name>, or any of its    .set foo.has_dyn_sized_stack, 1
1869                                                     callees, is dynamically sized
1870     <function_name>.has_recursion         Bool      Whether <function_name>, or any of its    .set foo.has_recursion, 0
1871                                                     callees, contains recursion
1872     <function_name>.has_indirect_call     Bool      Whether <function_name>, or any of its    .set foo.has_indirect_call, max(0, bar.has_indirect_call)
1873                                                     callees, contains an indirect call
1874     ===================================== ========= ========================================= ===============================================================================
1875
1876Futhermore, three symbols are additionally emitted describing the compilation
1877unit's worst case (i.e, maxima) ``num_vgpr``, ``num_agpr``, and
1878``numbered_sgpr`` which may be referenced and used by the aforementioned
1879symbolic expressions. These three symbols are ``amdgcn.max_num_vgpr``,
1880``amdgcn.max_num_agpr``, and ``amdgcn.max_num_sgpr``.
1881
1882.. _amdgpu-elf-code-object:
1883
1884ELF Code Object
1885===============
1886
1887The AMDGPU backend generates a standard ELF [ELF]_ relocatable code object that
1888can be linked by ``lld`` to produce a standard ELF shared code object which can
1889be loaded and executed on an AMDGPU target.
1890
1891.. _amdgpu-elf-header:
1892
1893Header
1894------
1895
1896The AMDGPU backend uses the following ELF header:
1897
1898  .. table:: AMDGPU ELF Header
1899     :name: amdgpu-elf-header-table
1900
1901     ========================== ===============================
1902     Field                      Value
1903     ========================== ===============================
1904     ``e_ident[EI_CLASS]``      ``ELFCLASS64``
1905     ``e_ident[EI_DATA]``       ``ELFDATA2LSB``
1906     ``e_ident[EI_OSABI]``      - ``ELFOSABI_NONE``
1907                                - ``ELFOSABI_AMDGPU_HSA``
1908                                - ``ELFOSABI_AMDGPU_PAL``
1909                                - ``ELFOSABI_AMDGPU_MESA3D``
1910     ``e_ident[EI_ABIVERSION]`` - ``ELFABIVERSION_AMDGPU_HSA_V2``
1911                                - ``ELFABIVERSION_AMDGPU_HSA_V3``
1912                                - ``ELFABIVERSION_AMDGPU_HSA_V4``
1913                                - ``ELFABIVERSION_AMDGPU_HSA_V5``
1914                                - ``ELFABIVERSION_AMDGPU_HSA_V6``
1915                                - ``ELFABIVERSION_AMDGPU_PAL``
1916                                - ``ELFABIVERSION_AMDGPU_MESA3D``
1917     ``e_type``                 - ``ET_REL``
1918                                - ``ET_DYN``
1919     ``e_machine``              ``EM_AMDGPU``
1920     ``e_entry``                0
1921     ``e_flags``                See :ref:`amdgpu-elf-header-e_flags-v2-table`,
1922                                :ref:`amdgpu-elf-header-e_flags-table-v3`,
1923                                :ref:`amdgpu-elf-header-e_flags-table-v4-v5`,
1924                                and :ref:`amdgpu-elf-header-e_flags-table-v6-onwards`
1925     ========================== ===============================
1926
1927..
1928
1929  .. table:: AMDGPU ELF Header Enumeration Values
1930     :name: amdgpu-elf-header-enumeration-values-table
1931
1932     =============================== =====
1933     Name                            Value
1934     =============================== =====
1935     ``EM_AMDGPU``                   224
1936     ``ELFOSABI_NONE``               0
1937     ``ELFOSABI_AMDGPU_HSA``         64
1938     ``ELFOSABI_AMDGPU_PAL``         65
1939     ``ELFOSABI_AMDGPU_MESA3D``      66
1940     ``ELFABIVERSION_AMDGPU_HSA_V2`` 0
1941     ``ELFABIVERSION_AMDGPU_HSA_V3`` 1
1942     ``ELFABIVERSION_AMDGPU_HSA_V4`` 2
1943     ``ELFABIVERSION_AMDGPU_HSA_V5`` 3
1944     ``ELFABIVERSION_AMDGPU_HSA_V6`` 4
1945     ``ELFABIVERSION_AMDGPU_PAL``    0
1946     ``ELFABIVERSION_AMDGPU_MESA3D`` 0
1947     =============================== =====
1948
1949``e_ident[EI_CLASS]``
1950  The ELF class is:
1951
1952  * ``ELFCLASS32`` for ``r600`` architecture.
1953
1954  * ``ELFCLASS64`` for ``amdgcn`` architecture which only supports 64-bit
1955    process address space applications.
1956
1957``e_ident[EI_DATA]``
1958  All AMDGPU targets use ``ELFDATA2LSB`` for little-endian byte ordering.
1959
1960``e_ident[EI_OSABI]``
1961  One of the following AMDGPU target architecture specific OS ABIs
1962  (see :ref:`amdgpu-os`):
1963
1964  * ``ELFOSABI_NONE`` for *unknown* OS.
1965
1966  * ``ELFOSABI_AMDGPU_HSA`` for ``amdhsa`` OS.
1967
1968  * ``ELFOSABI_AMDGPU_PAL`` for ``amdpal`` OS.
1969
1970  * ``ELFOSABI_AMDGPU_MESA3D`` for ``mesa3D`` OS.
1971
1972``e_ident[EI_ABIVERSION]``
1973  The ABI version of the AMDGPU target architecture specific OS ABI to which the code
1974  object conforms:
1975
1976  * ``ELFABIVERSION_AMDGPU_HSA_V2`` is used to specify the version of AMD HSA
1977    runtime ABI for code object V2. Can no longer be emitted by this version of LLVM.
1978
1979  * ``ELFABIVERSION_AMDGPU_HSA_V3`` is used to specify the version of AMD HSA
1980    runtime ABI for code object V3. Can no longer be emitted by this version of LLVM.
1981
1982  * ``ELFABIVERSION_AMDGPU_HSA_V4`` is used to specify the version of AMD HSA
1983    runtime ABI for code object V4. Specify using the Clang option
1984    ``-mcode-object-version=4``.
1985
1986  * ``ELFABIVERSION_AMDGPU_HSA_V5`` is used to specify the version of AMD HSA
1987    runtime ABI for code object V5. Specify using the Clang option
1988    ``-mcode-object-version=5``. This is the default code object
1989    version if not specified.
1990
1991  * ``ELFABIVERSION_AMDGPU_HSA_V6`` is used to specify the version of AMD HSA
1992    runtime ABI for code object V6. Specify using the Clang option
1993    ``-mcode-object-version=6``.
1994
1995  * ``ELFABIVERSION_AMDGPU_PAL`` is used to specify the version of AMD PAL
1996    runtime ABI.
1997
1998  * ``ELFABIVERSION_AMDGPU_MESA3D`` is used to specify the version of AMD MESA
1999    3D runtime ABI.
2000
2001``e_type``
2002  Can be one of the following values:
2003
2004
2005  ``ET_REL``
2006    The type produced by the AMDGPU backend compiler as it is relocatable code
2007    object.
2008
2009  ``ET_DYN``
2010    The type produced by the linker as it is a shared code object.
2011
2012  The AMD HSA runtime loader requires a ``ET_DYN`` code object.
2013
2014``e_machine``
2015  The value ``EM_AMDGPU`` is used for the machine for all processors supported
2016  by the ``r600`` and ``amdgcn`` architectures (see
2017  :ref:`amdgpu-processor-table`). The specific processor is specified in the
2018  ``NT_AMD_HSA_ISA_VERSION`` note record for code object V2 (see
2019  :ref:`amdgpu-note-records-v2`) and in the ``EF_AMDGPU_MACH`` bit field of the
2020  ``e_flags`` for code object V3 and above (see
2021  :ref:`amdgpu-elf-header-e_flags-table-v3`,
2022  :ref:`amdgpu-elf-header-e_flags-table-v4-v5` and
2023  :ref:`amdgpu-elf-header-e_flags-table-v6-onwards`).
2024
2025``e_entry``
2026  The entry point is 0 as the entry points for individual kernels must be
2027  selected in order to invoke them through AQL packets.
2028
2029``e_flags``
2030  The AMDGPU backend uses the following ELF header flags:
2031
2032  .. table:: AMDGPU ELF Header ``e_flags`` for Code Object V2
2033     :name: amdgpu-elf-header-e_flags-v2-table
2034
2035     ===================================== ===== =============================
2036     Name                                  Value Description
2037     ===================================== ===== =============================
2038     ``EF_AMDGPU_FEATURE_XNACK_V2``        0x01  Indicates if the ``xnack``
2039                                                 target feature is
2040                                                 enabled for all code
2041                                                 contained in the code object.
2042                                                 If the processor
2043                                                 does not support the
2044                                                 ``xnack`` target
2045                                                 feature then must
2046                                                 be 0.
2047                                                 See
2048                                                 :ref:`amdgpu-target-features`.
2049     ``EF_AMDGPU_FEATURE_TRAP_HANDLER_V2`` 0x02  Indicates if the trap
2050                                                 handler is enabled for all
2051                                                 code contained in the code
2052                                                 object. If the processor
2053                                                 does not support a trap
2054                                                 handler then must be 0.
2055                                                 See
2056                                                 :ref:`amdgpu-target-features`.
2057     ===================================== ===== =============================
2058
2059  .. table:: AMDGPU ELF Header ``e_flags`` for Code Object V3
2060     :name: amdgpu-elf-header-e_flags-table-v3
2061
2062     ================================= ===== =============================
2063     Name                              Value Description
2064     ================================= ===== =============================
2065     ``EF_AMDGPU_MACH``                0x0ff AMDGPU processor selection
2066                                             mask for
2067                                             ``EF_AMDGPU_MACH_xxx`` values
2068                                             defined in
2069                                             :ref:`amdgpu-ef-amdgpu-mach-table`.
2070     ``EF_AMDGPU_FEATURE_XNACK_V3``    0x100 Indicates if the ``xnack``
2071                                             target feature is
2072                                             enabled for all code
2073                                             contained in the code object.
2074                                             If the processor
2075                                             does not support the
2076                                             ``xnack`` target
2077                                             feature then must
2078                                             be 0.
2079                                             See
2080                                             :ref:`amdgpu-target-features`.
2081     ``EF_AMDGPU_FEATURE_SRAMECC_V3``  0x200 Indicates if the ``sramecc``
2082                                             target feature is
2083                                             enabled for all code
2084                                             contained in the code object.
2085                                             If the processor
2086                                             does not support the
2087                                             ``sramecc`` target
2088                                             feature then must
2089                                             be 0.
2090                                             See
2091                                             :ref:`amdgpu-target-features`.
2092     ================================= ===== =============================
2093
2094  .. table:: AMDGPU ELF Header ``e_flags`` for Code Object V4 and V5
2095     :name: amdgpu-elf-header-e_flags-table-v4-v5
2096
2097     ============================================ ===== ===================================
2098     Name                                         Value      Description
2099     ============================================ ===== ===================================
2100     ``EF_AMDGPU_MACH``                           0x0ff AMDGPU processor selection
2101                                                        mask for
2102                                                        ``EF_AMDGPU_MACH_xxx`` values
2103                                                        defined in
2104                                                        :ref:`amdgpu-ef-amdgpu-mach-table`.
2105     ``EF_AMDGPU_FEATURE_XNACK_V4``               0x300 XNACK selection mask for
2106                                                        ``EF_AMDGPU_FEATURE_XNACK_*_V4``
2107                                                        values.
2108     ``EF_AMDGPU_FEATURE_XNACK_UNSUPPORTED_V4``   0x000 XNACK unsupported.
2109     ``EF_AMDGPU_FEATURE_XNACK_ANY_V4``           0x100 XNACK can have any value.
2110     ``EF_AMDGPU_FEATURE_XNACK_OFF_V4``           0x200 XNACK disabled.
2111     ``EF_AMDGPU_FEATURE_XNACK_ON_V4``            0x300 XNACK enabled.
2112     ``EF_AMDGPU_FEATURE_SRAMECC_V4``             0xc00 SRAMECC selection mask for
2113                                                        ``EF_AMDGPU_FEATURE_SRAMECC_*_V4``
2114                                                        values.
2115     ``EF_AMDGPU_FEATURE_SRAMECC_UNSUPPORTED_V4`` 0x000 SRAMECC unsupported.
2116     ``EF_AMDGPU_FEATURE_SRAMECC_ANY_V4``         0x400 SRAMECC can have any value.
2117     ``EF_AMDGPU_FEATURE_SRAMECC_OFF_V4``         0x800 SRAMECC disabled,
2118     ``EF_AMDGPU_FEATURE_SRAMECC_ON_V4``          0xc00 SRAMECC enabled.
2119     ============================================ ===== ===================================
2120
2121  .. table:: AMDGPU ELF Header ``e_flags`` for Code Object V6 and After
2122     :name: amdgpu-elf-header-e_flags-table-v6-onwards
2123
2124     ============================================ ========== =========================================
2125     Name                                         Value      Description
2126     ============================================ ========== =========================================
2127     ``EF_AMDGPU_MACH``                           0x0ff      AMDGPU processor selection
2128                                                             mask for
2129                                                             ``EF_AMDGPU_MACH_xxx`` values
2130                                                             defined in
2131                                                             :ref:`amdgpu-ef-amdgpu-mach-table`.
2132     ``EF_AMDGPU_FEATURE_XNACK_V4``               0x300      XNACK selection mask for
2133                                                             ``EF_AMDGPU_FEATURE_XNACK_*_V4``
2134                                                             values.
2135     ``EF_AMDGPU_FEATURE_XNACK_UNSUPPORTED_V4``   0x000      XNACK unsupported.
2136     ``EF_AMDGPU_FEATURE_XNACK_ANY_V4``           0x100      XNACK can have any value.
2137     ``EF_AMDGPU_FEATURE_XNACK_OFF_V4``           0x200      XNACK disabled.
2138     ``EF_AMDGPU_FEATURE_XNACK_ON_V4``            0x300      XNACK enabled.
2139     ``EF_AMDGPU_FEATURE_SRAMECC_V4``             0xc00      SRAMECC selection mask for
2140                                                             ``EF_AMDGPU_FEATURE_SRAMECC_*_V4``
2141                                                             values.
2142     ``EF_AMDGPU_FEATURE_SRAMECC_UNSUPPORTED_V4`` 0x000      SRAMECC unsupported.
2143     ``EF_AMDGPU_FEATURE_SRAMECC_ANY_V4``         0x400      SRAMECC can have any value.
2144     ``EF_AMDGPU_FEATURE_SRAMECC_OFF_V4``         0x800      SRAMECC disabled,
2145     ``EF_AMDGPU_FEATURE_SRAMECC_ON_V4``          0xc00      SRAMECC enabled.
2146     ``EF_AMDGPU_GENERIC_VERSION_V``              0xff000000 Generic code object version selection
2147                                                             mask. This is a value between 1 and 255,
2148                                                             stored in the most significant byte
2149                                                             of EFLAGS.
2150                                                             See :ref:`amdgpu-generic-processor-versioning`
2151     ============================================ ========== =========================================
2152
2153  .. table:: AMDGPU ``EF_AMDGPU_MACH`` Values
2154     :name: amdgpu-ef-amdgpu-mach-table
2155
2156     ========================================== ========== =============================
2157     Name                                       Value      Description (see
2158                                                           :ref:`amdgpu-processor-table`)
2159     ========================================== ========== =============================
2160     ``EF_AMDGPU_MACH_NONE``                    0x000      *not specified*
2161     ``EF_AMDGPU_MACH_R600_R600``               0x001      ``r600``
2162     ``EF_AMDGPU_MACH_R600_R630``               0x002      ``r630``
2163     ``EF_AMDGPU_MACH_R600_RS880``              0x003      ``rs880``
2164     ``EF_AMDGPU_MACH_R600_RV670``              0x004      ``rv670``
2165     ``EF_AMDGPU_MACH_R600_RV710``              0x005      ``rv710``
2166     ``EF_AMDGPU_MACH_R600_RV730``              0x006      ``rv730``
2167     ``EF_AMDGPU_MACH_R600_RV770``              0x007      ``rv770``
2168     ``EF_AMDGPU_MACH_R600_CEDAR``              0x008      ``cedar``
2169     ``EF_AMDGPU_MACH_R600_CYPRESS``            0x009      ``cypress``
2170     ``EF_AMDGPU_MACH_R600_JUNIPER``            0x00a      ``juniper``
2171     ``EF_AMDGPU_MACH_R600_REDWOOD``            0x00b      ``redwood``
2172     ``EF_AMDGPU_MACH_R600_SUMO``               0x00c      ``sumo``
2173     ``EF_AMDGPU_MACH_R600_BARTS``              0x00d      ``barts``
2174     ``EF_AMDGPU_MACH_R600_CAICOS``             0x00e      ``caicos``
2175     ``EF_AMDGPU_MACH_R600_CAYMAN``             0x00f      ``cayman``
2176     ``EF_AMDGPU_MACH_R600_TURKS``              0x010      ``turks``
2177     *reserved*                                 0x011 -    Reserved for ``r600``
2178                                                0x01f      architecture processors.
2179     ``EF_AMDGPU_MACH_AMDGCN_GFX600``           0x020      ``gfx600``
2180     ``EF_AMDGPU_MACH_AMDGCN_GFX601``           0x021      ``gfx601``
2181     ``EF_AMDGPU_MACH_AMDGCN_GFX700``           0x022      ``gfx700``
2182     ``EF_AMDGPU_MACH_AMDGCN_GFX701``           0x023      ``gfx701``
2183     ``EF_AMDGPU_MACH_AMDGCN_GFX702``           0x024      ``gfx702``
2184     ``EF_AMDGPU_MACH_AMDGCN_GFX703``           0x025      ``gfx703``
2185     ``EF_AMDGPU_MACH_AMDGCN_GFX704``           0x026      ``gfx704``
2186     *reserved*                                 0x027      Reserved.
2187     ``EF_AMDGPU_MACH_AMDGCN_GFX801``           0x028      ``gfx801``
2188     ``EF_AMDGPU_MACH_AMDGCN_GFX802``           0x029      ``gfx802``
2189     ``EF_AMDGPU_MACH_AMDGCN_GFX803``           0x02a      ``gfx803``
2190     ``EF_AMDGPU_MACH_AMDGCN_GFX810``           0x02b      ``gfx810``
2191     ``EF_AMDGPU_MACH_AMDGCN_GFX900``           0x02c      ``gfx900``
2192     ``EF_AMDGPU_MACH_AMDGCN_GFX902``           0x02d      ``gfx902``
2193     ``EF_AMDGPU_MACH_AMDGCN_GFX904``           0x02e      ``gfx904``
2194     ``EF_AMDGPU_MACH_AMDGCN_GFX906``           0x02f      ``gfx906``
2195     ``EF_AMDGPU_MACH_AMDGCN_GFX908``           0x030      ``gfx908``
2196     ``EF_AMDGPU_MACH_AMDGCN_GFX909``           0x031      ``gfx909``
2197     ``EF_AMDGPU_MACH_AMDGCN_GFX90C``           0x032      ``gfx90c``
2198     ``EF_AMDGPU_MACH_AMDGCN_GFX1010``          0x033      ``gfx1010``
2199     ``EF_AMDGPU_MACH_AMDGCN_GFX1011``          0x034      ``gfx1011``
2200     ``EF_AMDGPU_MACH_AMDGCN_GFX1012``          0x035      ``gfx1012``
2201     ``EF_AMDGPU_MACH_AMDGCN_GFX1030``          0x036      ``gfx1030``
2202     ``EF_AMDGPU_MACH_AMDGCN_GFX1031``          0x037      ``gfx1031``
2203     ``EF_AMDGPU_MACH_AMDGCN_GFX1032``          0x038      ``gfx1032``
2204     ``EF_AMDGPU_MACH_AMDGCN_GFX1033``          0x039      ``gfx1033``
2205     ``EF_AMDGPU_MACH_AMDGCN_GFX602``           0x03a      ``gfx602``
2206     ``EF_AMDGPU_MACH_AMDGCN_GFX705``           0x03b      ``gfx705``
2207     ``EF_AMDGPU_MACH_AMDGCN_GFX805``           0x03c      ``gfx805``
2208     ``EF_AMDGPU_MACH_AMDGCN_GFX1035``          0x03d      ``gfx1035``
2209     ``EF_AMDGPU_MACH_AMDGCN_GFX1034``          0x03e      ``gfx1034``
2210     ``EF_AMDGPU_MACH_AMDGCN_GFX90A``           0x03f      ``gfx90a``
2211     ``EF_AMDGPU_MACH_AMDGCN_GFX940``           0x040      ``gfx940``
2212     ``EF_AMDGPU_MACH_AMDGCN_GFX1100``          0x041      ``gfx1100``
2213     ``EF_AMDGPU_MACH_AMDGCN_GFX1013``          0x042      ``gfx1013``
2214     ``EF_AMDGPU_MACH_AMDGCN_GFX1150``          0x043      ``gfx1150``
2215     ``EF_AMDGPU_MACH_AMDGCN_GFX1103``          0x044      ``gfx1103``
2216     ``EF_AMDGPU_MACH_AMDGCN_GFX1036``          0x045      ``gfx1036``
2217     ``EF_AMDGPU_MACH_AMDGCN_GFX1101``          0x046      ``gfx1101``
2218     ``EF_AMDGPU_MACH_AMDGCN_GFX1102``          0x047      ``gfx1102``
2219     ``EF_AMDGPU_MACH_AMDGCN_GFX1200``          0x048      ``gfx1200``
2220     *reserved*                                 0x049      Reserved.
2221     ``EF_AMDGPU_MACH_AMDGCN_GFX1151``          0x04a      ``gfx1151``
2222     ``EF_AMDGPU_MACH_AMDGCN_GFX941``           0x04b      ``gfx941``
2223     ``EF_AMDGPU_MACH_AMDGCN_GFX942``           0x04c      ``gfx942``
2224     *reserved*                                 0x04d      Reserved.
2225     ``EF_AMDGPU_MACH_AMDGCN_GFX1201``          0x04e      ``gfx1201``
2226     ``EF_AMDGPU_MACH_AMDGCN_GFX950``           0x04f      ``gfx950``
2227     *reserved*                                 0x050      Reserved.
2228     ``EF_AMDGPU_MACH_AMDGCN_GFX9_GENERIC``     0x051      ``gfx9-generic``
2229     ``EF_AMDGPU_MACH_AMDGCN_GFX10_1_GENERIC``  0x052      ``gfx10-1-generic``
2230     ``EF_AMDGPU_MACH_AMDGCN_GFX10_3_GENERIC``  0x053      ``gfx10-3-generic``
2231     ``EF_AMDGPU_MACH_AMDGCN_GFX11_GENERIC``    0x054      ``gfx11-generic``
2232     ``EF_AMDGPU_MACH_AMDGCN_GFX1152``          0x055      ``gfx1152``.
2233     *reserved*                                 0x056      Reserved.
2234     *reserved*                                 0x057      Reserved.
2235     ``EF_AMDGPU_MACH_AMDGCN_GFX1153``          0x058      ``gfx1153``.
2236     ``EF_AMDGPU_MACH_AMDGCN_GFX12_GENERIC``    0x059      ``gfx12-generic``
2237     ``EF_AMDGPU_MACH_AMDGCN_GFX9_4_GENERIC``   0x05f      ``gfx9-4-generic``
2238     ========================================== ========== =============================
2239
2240Sections
2241--------
2242
2243An AMDGPU target ELF code object has the standard ELF sections which include:
2244
2245  .. table:: AMDGPU ELF Sections
2246     :name: amdgpu-elf-sections-table
2247
2248     ================== ================ =================================
2249     Name               Type             Attributes
2250     ================== ================ =================================
2251     ``.bss``           ``SHT_NOBITS``   ``SHF_ALLOC`` + ``SHF_WRITE``
2252     ``.data``          ``SHT_PROGBITS`` ``SHF_ALLOC`` + ``SHF_WRITE``
2253     ``.debug_``\ *\**  ``SHT_PROGBITS`` *none*
2254     ``.dynamic``       ``SHT_DYNAMIC``  ``SHF_ALLOC``
2255     ``.dynstr``        ``SHT_PROGBITS`` ``SHF_ALLOC``
2256     ``.dynsym``        ``SHT_PROGBITS`` ``SHF_ALLOC``
2257     ``.got``           ``SHT_PROGBITS`` ``SHF_ALLOC`` + ``SHF_WRITE``
2258     ``.hash``          ``SHT_HASH``     ``SHF_ALLOC``
2259     ``.note``          ``SHT_NOTE``     *none*
2260     ``.rela``\ *name*  ``SHT_RELA``     *none*
2261     ``.rela.dyn``      ``SHT_RELA``     *none*
2262     ``.rodata``        ``SHT_PROGBITS`` ``SHF_ALLOC``
2263     ``.shstrtab``      ``SHT_STRTAB``   *none*
2264     ``.strtab``        ``SHT_STRTAB``   *none*
2265     ``.symtab``        ``SHT_SYMTAB``   *none*
2266     ``.text``          ``SHT_PROGBITS`` ``SHF_ALLOC`` + ``SHF_EXECINSTR``
2267     ================== ================ =================================
2268
2269These sections have their standard meanings (see [ELF]_) and are only generated
2270if needed.
2271
2272``.debug``\ *\**
2273  The standard DWARF sections. See :ref:`amdgpu-dwarf-debug-information` for
2274  information on the DWARF produced by the AMDGPU backend.
2275
2276``.dynamic``, ``.dynstr``, ``.dynsym``, ``.hash``
2277  The standard sections used by a dynamic loader.
2278
2279``.note``
2280  See :ref:`amdgpu-note-records` for the note records supported by the AMDGPU
2281  backend.
2282
2283``.rela``\ *name*, ``.rela.dyn``
2284  For relocatable code objects, *name* is the name of the section that the
2285  relocation records apply. For example, ``.rela.text`` is the section name for
2286  relocation records associated with the ``.text`` section.
2287
2288  For linked shared code objects, ``.rela.dyn`` contains all the relocation
2289  records from each of the relocatable code object's ``.rela``\ *name* sections.
2290
2291  See :ref:`amdgpu-relocation-records` for the relocation records supported by
2292  the AMDGPU backend.
2293
2294``.text``
2295  The executable machine code for the kernels and functions they call. Generated
2296  as position independent code. See :ref:`amdgpu-code-conventions` for
2297  information on conventions used in the isa generation.
2298
2299.. _amdgpu-note-records:
2300
2301Note Records
2302------------
2303
2304The AMDGPU backend code object contains ELF note records in the ``.note``
2305section. The set of generated notes and their semantics depend on the code
2306object version; see :ref:`amdgpu-note-records-v2` and
2307:ref:`amdgpu-note-records-v3-onwards`.
2308
2309As required by ``ELFCLASS32`` and ``ELFCLASS64``, minimal zero-byte padding
2310must be generated after the ``name`` field to ensure the ``desc`` field is 4
2311byte aligned. In addition, minimal zero-byte padding must be generated to
2312ensure the ``desc`` field size is a multiple of 4 bytes. The ``sh_addralign``
2313field of the ``.note`` section must be at least 4 to indicate at least 8 byte
2314alignment.
2315
2316.. _amdgpu-note-records-v2:
2317
2318Code Object V2 Note Records
2319~~~~~~~~~~~~~~~~~~~~~~~~~~~
2320
2321.. warning::
2322  Code object V2 generation is no longer supported by this version of LLVM.
2323
2324The AMDGPU backend code object uses the following ELF note record in the
2325``.note`` section when compiling for code object V2.
2326
2327The note record vendor field is "AMD".
2328
2329Additional note records may be present, but any which are not documented here
2330are deprecated and should not be used.
2331
2332  .. table:: AMDGPU Code Object V2 ELF Note Records
2333     :name: amdgpu-elf-note-records-v2-table
2334
2335     ===== ===================================== ======================================
2336     Name  Type                                  Description
2337     ===== ===================================== ======================================
2338     "AMD" ``NT_AMD_HSA_CODE_OBJECT_VERSION``    Code object version.
2339     "AMD" ``NT_AMD_HSA_HSAIL``                  HSAIL properties generated by the HSAIL
2340                                                 Finalizer and not the LLVM compiler.
2341     "AMD" ``NT_AMD_HSA_ISA_VERSION``            Target ISA version.
2342     "AMD" ``NT_AMD_HSA_METADATA``               Metadata null terminated string in
2343                                                 YAML [YAML]_ textual format.
2344     "AMD" ``NT_AMD_HSA_ISA_NAME``               Target ISA name.
2345     ===== ===================================== ======================================
2346
2347..
2348
2349  .. table:: AMDGPU Code Object V2 ELF Note Record Enumeration Values
2350     :name: amdgpu-elf-note-record-enumeration-values-v2-table
2351
2352     ===================================== =====
2353     Name                                  Value
2354     ===================================== =====
2355     ``NT_AMD_HSA_CODE_OBJECT_VERSION``    1
2356     ``NT_AMD_HSA_HSAIL``                  2
2357     ``NT_AMD_HSA_ISA_VERSION``            3
2358     *reserved*                            4-9
2359     ``NT_AMD_HSA_METADATA``               10
2360     ``NT_AMD_HSA_ISA_NAME``               11
2361     ===================================== =====
2362
2363``NT_AMD_HSA_CODE_OBJECT_VERSION``
2364  Specifies the code object version number. The description field has the
2365  following layout:
2366
2367  .. code:: c
2368
2369    struct amdgpu_hsa_note_code_object_version_s {
2370      uint32_t major_version;
2371      uint32_t minor_version;
2372    };
2373
2374  The ``major_version`` has a value less than or equal to 2.
2375
2376``NT_AMD_HSA_HSAIL``
2377  Specifies the HSAIL properties used by the HSAIL Finalizer. The description
2378  field has the following layout:
2379
2380  .. code:: c
2381
2382    struct amdgpu_hsa_note_hsail_s {
2383      uint32_t hsail_major_version;
2384      uint32_t hsail_minor_version;
2385      uint8_t profile;
2386      uint8_t machine_model;
2387      uint8_t default_float_round;
2388    };
2389
2390``NT_AMD_HSA_ISA_VERSION``
2391  Specifies the target ISA version. The description field has the following layout:
2392
2393  .. code:: c
2394
2395    struct amdgpu_hsa_note_isa_s {
2396      uint16_t vendor_name_size;
2397      uint16_t architecture_name_size;
2398      uint32_t major;
2399      uint32_t minor;
2400      uint32_t stepping;
2401      char vendor_and_architecture_name[1];
2402    };
2403
2404  ``vendor_name_size`` and ``architecture_name_size`` are the length of the
2405  vendor and architecture names respectively, including the NUL character.
2406
2407  ``vendor_and_architecture_name`` contains the NUL terminates string for the
2408  vendor, immediately followed by the NUL terminated string for the
2409  architecture.
2410
2411  This note record is used by the HSA runtime loader.
2412
2413  Code object V2 only supports a limited number of processors and has fixed
2414  settings for target features. See
2415  :ref:`amdgpu-elf-note-record-supported_processors-v2-table` for a list of
2416  processors and the corresponding target ID. In the table the note record ISA
2417  name is a concatenation of the vendor name, architecture name, major, minor,
2418  and stepping separated by a ":".
2419
2420  The target ID column shows the processor name and fixed target features used
2421  by the LLVM compiler. The LLVM compiler does not generate a
2422  ``NT_AMD_HSA_HSAIL`` note record.
2423
2424  A code object generated by the Finalizer also uses code object V2 and always
2425  generates a ``NT_AMD_HSA_HSAIL`` note record. The processor name and
2426  ``sramecc`` target feature is as shown in
2427  :ref:`amdgpu-elf-note-record-supported_processors-v2-table` but the ``xnack``
2428  target feature is specified by the ``EF_AMDGPU_FEATURE_XNACK_V2`` ``e_flags``
2429  bit.
2430
2431``NT_AMD_HSA_ISA_NAME``
2432  Specifies the target ISA name as a non-NUL terminated string.
2433
2434  This note record is not used by the HSA runtime loader.
2435
2436  See the ``NT_AMD_HSA_ISA_VERSION`` note record description of the code object
2437  V2's limited support of processors and fixed settings for target features.
2438
2439  See :ref:`amdgpu-elf-note-record-supported_processors-v2-table` for a mapping
2440  from the string to the corresponding target ID. If the ``xnack`` target
2441  feature is supported and enabled, the string produced by the LLVM compiler
2442  will may have a ``+xnack`` appended. The Finlizer did not do the appending and
2443  instead used the ``EF_AMDGPU_FEATURE_XNACK_V2`` ``e_flags`` bit.
2444
2445``NT_AMD_HSA_METADATA``
2446  Specifies extensible metadata associated with the code objects executed on HSA
2447  [HSA]_ compatible runtimes (see :ref:`amdgpu-os`). It is required when the
2448  target triple OS is ``amdhsa`` (see :ref:`amdgpu-target-triples`). See
2449  :ref:`amdgpu-amdhsa-code-object-metadata-v2` for the syntax of the code object
2450  metadata string.
2451
2452  .. table:: AMDGPU Code Object V2 Supported Processors and Fixed Target Feature Settings
2453     :name: amdgpu-elf-note-record-supported_processors-v2-table
2454
2455     ===================== ==========================
2456     Note Record ISA Name  Target ID
2457     ===================== ==========================
2458     ``AMD:AMDGPU:6:0:0``  ``gfx600``
2459     ``AMD:AMDGPU:6:0:1``  ``gfx601``
2460     ``AMD:AMDGPU:6:0:2``  ``gfx602``
2461     ``AMD:AMDGPU:7:0:0``  ``gfx700``
2462     ``AMD:AMDGPU:7:0:1``  ``gfx701``
2463     ``AMD:AMDGPU:7:0:2``  ``gfx702``
2464     ``AMD:AMDGPU:7:0:3``  ``gfx703``
2465     ``AMD:AMDGPU:7:0:4``  ``gfx704``
2466     ``AMD:AMDGPU:7:0:5``  ``gfx705``
2467     ``AMD:AMDGPU:8:0:0``  ``gfx802``
2468     ``AMD:AMDGPU:8:0:1``  ``gfx801:xnack+``
2469     ``AMD:AMDGPU:8:0:2``  ``gfx802``
2470     ``AMD:AMDGPU:8:0:3``  ``gfx803``
2471     ``AMD:AMDGPU:8:0:4``  ``gfx803``
2472     ``AMD:AMDGPU:8:0:5``  ``gfx805``
2473     ``AMD:AMDGPU:8:1:0``  ``gfx810:xnack+``
2474     ``AMD:AMDGPU:9:0:0``  ``gfx900:xnack-``
2475     ``AMD:AMDGPU:9:0:1``  ``gfx900:xnack+``
2476     ``AMD:AMDGPU:9:0:2``  ``gfx902:xnack-``
2477     ``AMD:AMDGPU:9:0:3``  ``gfx902:xnack+``
2478     ``AMD:AMDGPU:9:0:4``  ``gfx904:xnack-``
2479     ``AMD:AMDGPU:9:0:5``  ``gfx904:xnack+``
2480     ``AMD:AMDGPU:9:0:6``  ``gfx906:sramecc-:xnack-``
2481     ``AMD:AMDGPU:9:0:7``  ``gfx906:sramecc-:xnack+``
2482     ``AMD:AMDGPU:9:0:12`` ``gfx90c:xnack-``
2483     ===================== ==========================
2484
2485.. _amdgpu-note-records-v3-onwards:
2486
2487Code Object V3 and Above Note Records
2488~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
2489
2490The AMDGPU backend code object uses the following ELF note record in the
2491``.note`` section when compiling for code object V3 and above.
2492
2493The note record vendor field is "AMDGPU".
2494
2495Additional note records may be present, but any which are not documented here
2496are deprecated and should not be used.
2497
2498  .. table:: AMDGPU Code Object V3 and Above ELF Note Records
2499     :name: amdgpu-elf-note-records-table-v3-onwards
2500
2501     ======== ============================== ======================================
2502     Name     Type                           Description
2503     ======== ============================== ======================================
2504     "AMDGPU" ``NT_AMDGPU_METADATA``         Metadata in Message Pack [MsgPack]_
2505                                             binary format.
2506     "AMDGPU" ``NT_AMDGPU_KFD_CORE_STATE``   Snapshot of runtime, agent and queues
2507                                             state for use in core dump.  See
2508                                             :ref:`amdgpu_corefile_note`.
2509     ======== ============================== ======================================
2510
2511..
2512
2513  .. table:: AMDGPU Code Object V3 and Above ELF Note Record Enumeration Values
2514     :name: amdgpu-elf-note-record-enumeration-values-table-v3-onwards
2515
2516     ============================== =====
2517     Name                           Value
2518     ============================== =====
2519     *reserved*                     0-31
2520     ``NT_AMDGPU_METADATA``         32
2521     ``NT_AMDGPU_KFD_CORE_STATE``   33
2522     ============================== =====
2523
2524``NT_AMDGPU_METADATA``
2525  Specifies extensible metadata associated with an AMDGPU code object. It is
2526  encoded as a map in the Message Pack [MsgPack]_ binary data format. See
2527  :ref:`amdgpu-amdhsa-code-object-metadata-v3`,
2528  :ref:`amdgpu-amdhsa-code-object-metadata-v4` and
2529  :ref:`amdgpu-amdhsa-code-object-metadata-v5` for the map keys defined for the
2530  ``amdhsa`` OS.
2531
2532.. _amdgpu-symbols:
2533
2534Symbols
2535-------
2536
2537Symbols include the following:
2538
2539  .. table:: AMDGPU ELF Symbols
2540     :name: amdgpu-elf-symbols-table
2541
2542     ===================== ================== ================ ==================
2543     Name                  Type               Section          Description
2544     ===================== ================== ================ ==================
2545     *link-name*           ``STT_OBJECT``     - ``.data``      Global variable
2546                                              - ``.rodata``
2547                                              - ``.bss``
2548     *link-name*\ ``.kd``  ``STT_OBJECT``     - ``.rodata``    Kernel descriptor
2549     *link-name*           ``STT_FUNC``       - ``.text``      Kernel entry point
2550     *link-name*           ``STT_OBJECT``     - SHN_AMDGPU_LDS Global variable in LDS
2551     ===================== ================== ================ ==================
2552
2553Global variable
2554  Global variables both used and defined by the compilation unit.
2555
2556  If the symbol is defined in the compilation unit then it is allocated in the
2557  appropriate section according to if it has initialized data or is readonly.
2558
2559  If the symbol is external then its section is ``STN_UNDEF`` and the loader
2560  will resolve relocations using the definition provided by another code object
2561  or explicitly defined by the runtime.
2562
2563  If the symbol resides in local/group memory (LDS) then its section is the
2564  special processor specific section name ``SHN_AMDGPU_LDS``, and the
2565  ``st_value`` field describes alignment requirements as it does for common
2566  symbols.
2567
2568  .. TODO::
2569
2570     Add description of linked shared object symbols. Seems undefined symbols
2571     are marked as STT_NOTYPE.
2572
2573Kernel descriptor
2574  Every HSA kernel has an associated kernel descriptor. It is the address of the
2575  kernel descriptor that is used in the AQL dispatch packet used to invoke the
2576  kernel, not the kernel entry point. The layout of the HSA kernel descriptor is
2577  defined in :ref:`amdgpu-amdhsa-kernel-descriptor`.
2578
2579Kernel entry point
2580  Every HSA kernel also has a symbol for its machine code entry point.
2581
2582.. _amdgpu-relocation-records:
2583
2584Relocation Records
2585------------------
2586
2587The AMDGPU backend generates ``Elf64_Rela`` relocation records for
2588AMDHSA or ``Elf64_Rel`` relocation records for Mesa/AMDPAL. Supported
2589relocatable fields are:
2590
2591``word32``
2592  This specifies a 32-bit field occupying 4 bytes with arbitrary byte
2593  alignment. These values use the same byte order as other word values in the
2594  AMDGPU architecture.
2595
2596``word64``
2597  This specifies a 64-bit field occupying 8 bytes with arbitrary byte
2598  alignment. These values use the same byte order as other word values in the
2599  AMDGPU architecture.
2600
2601Following notations are used for specifying relocation calculations:
2602
2603**A**
2604  Represents the addend used to compute the value of the relocatable field. If
2605  the addend field is smaller than 64 bits then it is zero-extended to 64 bits
2606  for use in the calculations below. (In practice this only affects ``_HI``
2607  relocation types on Mesa/AMDPAL, where the addend comes from the 32-bit field
2608  but the result of the calculation depends on the high part of the full 64-bit
2609  address.)
2610
2611**G**
2612  Represents the offset into the global offset table at which the relocation
2613  entry's symbol will reside during execution.
2614
2615**GOT**
2616  Represents the address of the global offset table.
2617
2618**P**
2619  Represents the place (section offset for ``et_rel`` or address for ``et_dyn``)
2620  of the storage unit being relocated (computed using ``r_offset``).
2621
2622**S**
2623  Represents the value of the symbol whose index resides in the relocation
2624  entry. Relocations not using this must specify a symbol index of
2625  ``STN_UNDEF``.
2626
2627**B**
2628  Represents the base address of a loaded executable or shared object which is
2629  the difference between the ELF address and the actual load address.
2630  Relocations using this are only valid in executable or shared objects.
2631
2632The following relocation types are supported:
2633
2634  .. table:: AMDGPU ELF Relocation Records
2635     :name: amdgpu-elf-relocation-records-table
2636
2637     ========================== ======= =====  ==========  ==============================
2638     Relocation Type            Kind    Value  Field       Calculation
2639     ========================== ======= =====  ==========  ==============================
2640     ``R_AMDGPU_NONE``                  0      *none*      *none*
2641     ``R_AMDGPU_ABS32_LO``      Static, 1      ``word32``  (S + A) & 0xFFFFFFFF
2642                                Dynamic
2643     ``R_AMDGPU_ABS32_HI``      Static, 2      ``word32``  (S + A) >> 32
2644                                Dynamic
2645     ``R_AMDGPU_ABS64``         Static, 3      ``word64``  S + A
2646                                Dynamic
2647     ``R_AMDGPU_REL32``         Static  4      ``word32``  S + A - P
2648     ``R_AMDGPU_REL64``         Static  5      ``word64``  S + A - P
2649     ``R_AMDGPU_ABS32``         Static, 6      ``word32``  S + A
2650                                Dynamic
2651     ``R_AMDGPU_GOTPCREL``      Static  7      ``word32``  G + GOT + A - P
2652     ``R_AMDGPU_GOTPCREL32_LO`` Static  8      ``word32``  (G + GOT + A - P) & 0xFFFFFFFF
2653     ``R_AMDGPU_GOTPCREL32_HI`` Static  9      ``word32``  (G + GOT + A - P) >> 32
2654     ``R_AMDGPU_REL32_LO``      Static  10     ``word32``  (S + A - P) & 0xFFFFFFFF
2655     ``R_AMDGPU_REL32_HI``      Static  11     ``word32``  (S + A - P) >> 32
2656     *reserved*                         12
2657     ``R_AMDGPU_RELATIVE64``    Dynamic 13     ``word64``  B + A
2658     ``R_AMDGPU_REL16``         Static  14     ``word16``  ((S + A - P) - 4) / 4
2659     ========================== ======= =====  ==========  ==============================
2660
2661``R_AMDGPU_ABS32_LO`` and ``R_AMDGPU_ABS32_HI`` are only supported by
2662the ``mesa3d`` OS, which does not support ``R_AMDGPU_ABS64``.
2663
2664There is no current OS loader support for 32-bit programs and so
2665``R_AMDGPU_ABS32`` is not used.
2666
2667.. _amdgpu-loaded-code-object-path-uniform-resource-identifier:
2668
2669Loaded Code Object Path Uniform Resource Identifier (URI)
2670---------------------------------------------------------
2671
2672The AMD GPU code object loader represents the path of the ELF shared object from
2673which the code object was loaded as a textual Uniform Resource Identifier (URI).
2674Note that the code object is the in memory loaded relocated form of the ELF
2675shared object.  Multiple code objects may be loaded at different memory
2676addresses in the same process from the same ELF shared object.
2677
2678The loaded code object path URI syntax is defined by the following BNF syntax:
2679
2680.. code::
2681
2682  code_object_uri ::== file_uri | memory_uri
2683  file_uri        ::== "file://" file_path [ range_specifier ]
2684  memory_uri      ::== "memory://" process_id range_specifier
2685  range_specifier ::== [ "#" | "?" ] "offset=" number "&" "size=" number
2686  file_path       ::== URI_ENCODED_OS_FILE_PATH
2687  process_id      ::== DECIMAL_NUMBER
2688  number          ::== HEX_NUMBER | DECIMAL_NUMBER | OCTAL_NUMBER
2689
2690**number**
2691  Is a C integral literal where hexadecimal values are prefixed by "0x" or "0X",
2692  and octal values by "0".
2693
2694**file_path**
2695  Is the file's path specified as a URI encoded UTF-8 string. In URI encoding,
2696  every character that is not in the regular expression ``[a-zA-Z0-9/_.~-]`` is
2697  encoded as two uppercase hexadecimal digits proceeded by "%".  Directories in
2698  the path are separated by "/".
2699
2700**offset**
2701  Is a 0-based byte offset to the start of the code object.  For a file URI, it
2702  is from the start of the file specified by the ``file_path``, and if omitted
2703  defaults to 0. For a memory URI, it is the memory address and is required.
2704
2705**size**
2706  Is the number of bytes in the code object.  For a file URI, if omitted it
2707  defaults to the size of the file.  It is required for a memory URI.
2708
2709**process_id**
2710  Is the identity of the process owning the memory.  For Linux it is the C
2711  unsigned integral decimal literal for the process ID (PID).
2712
2713For example:
2714
2715.. code::
2716
2717  file:///dir1/dir2/file1
2718  file:///dir3/dir4/file2#offset=0x2000&size=3000
2719  memory://1234#offset=0x20000&size=3000
2720
2721.. _amdgpu-dwarf-debug-information:
2722
2723DWARF Debug Information
2724=======================
2725
2726.. warning::
2727
2728   This section describes **provisional support** for AMDGPU DWARF [DWARF]_ that
2729   is not currently fully implemented and is subject to change.
2730
2731AMDGPU generates DWARF [DWARF]_ debugging information ELF sections (see
2732:ref:`amdgpu-elf-code-object`) which contain information that maps the code
2733object executable code and data to the source language constructs. It can be
2734used by tools such as debuggers and profilers. It uses features defined in
2735:doc:`AMDGPUDwarfExtensionsForHeterogeneousDebugging` that are made available in
2736DWARF Version 4 and DWARF Version 5 as an LLVM vendor extension.
2737
2738This section defines the AMDGPU target architecture specific DWARF mappings.
2739
2740.. _amdgpu-dwarf-register-identifier:
2741
2742Register Identifier
2743-------------------
2744
2745This section defines the AMDGPU target architecture register numbers used in
2746DWARF operation expressions (see DWARF Version 5 section 2.5 and
2747:ref:`amdgpu-dwarf-operation-expressions`) and Call Frame Information
2748instructions (see DWARF Version 5 section 6.4 and
2749:ref:`amdgpu-dwarf-call-frame-information`).
2750
2751A single code object can contain code for kernels that have different wavefront
2752sizes. The vector registers and some scalar registers are based on the wavefront
2753size. AMDGPU defines distinct DWARF registers for each wavefront size. This
2754simplifies the consumer of the DWARF so that each register has a fixed size,
2755rather than being dynamic according to the wavefront size mode. Similarly,
2756distinct DWARF registers are defined for those registers that vary in size
2757according to the process address size. This allows a consumer to treat a
2758specific AMDGPU processor as a single architecture regardless of how it is
2759configured at run time. The compiler explicitly specifies the DWARF registers
2760that match the mode in which the code it is generating will be executed.
2761
2762DWARF registers are encoded as numbers, which are mapped to architecture
2763registers. The mapping for AMDGPU is defined in
2764:ref:`amdgpu-dwarf-register-mapping-table`. All AMDGPU targets use the same
2765mapping.
2766
2767.. table:: AMDGPU DWARF Register Mapping
2768   :name: amdgpu-dwarf-register-mapping-table
2769
2770   ============== ================= ======== ==================================
2771   DWARF Register AMDGPU Register   Bit Size Description
2772   ============== ================= ======== ==================================
2773   0              PC_32             32       Program Counter (PC) when
2774                                             executing in a 32-bit process
2775                                             address space. Used in the CFI to
2776                                             describe the PC of the calling
2777                                             frame.
2778   1              EXEC_MASK_32      32       Execution Mask Register when
2779                                             executing in wavefront 32 mode.
2780   2-15           *Reserved*                 *Reserved for highly accessed
2781                                             registers using DWARF shortcut.*
2782   16             PC_64             64       Program Counter (PC) when
2783                                             executing in a 64-bit process
2784                                             address space. Used in the CFI to
2785                                             describe the PC of the calling
2786                                             frame.
2787   17             EXEC_MASK_64      64       Execution Mask Register when
2788                                             executing in wavefront 64 mode.
2789   18-31          *Reserved*                 *Reserved for highly accessed
2790                                             registers using DWARF shortcut.*
2791   32-95          SGPR0-SGPR63      32       Scalar General Purpose
2792                                             Registers.
2793   96-127         *Reserved*                 *Reserved for frequently accessed
2794                                             registers using DWARF 1-byte ULEB.*
2795   128            STATUS            32       Status Register.
2796   129-511        *Reserved*                 *Reserved for future Scalar
2797                                             Architectural Registers.*
2798   512            VCC_32            32       Vector Condition Code Register
2799                                             when executing in wavefront 32
2800                                             mode.
2801   513-767        *Reserved*                 *Reserved for future Vector
2802                                             Architectural Registers when
2803                                             executing in wavefront 32 mode.*
2804   768            VCC_64            64       Vector Condition Code Register
2805                                             when executing in wavefront 64
2806                                             mode.
2807   769-1023       *Reserved*                 *Reserved for future Vector
2808                                             Architectural Registers when
2809                                             executing in wavefront 64 mode.*
2810   1024-1087      *Reserved*                 *Reserved for padding.*
2811   1088-1129      SGPR64-SGPR105    32       Scalar General Purpose Registers.
2812   1130-1535      *Reserved*                 *Reserved for future Scalar
2813                                             General Purpose Registers.*
2814   1536-1791      VGPR0-VGPR255     32*32    Vector General Purpose Registers
2815                                             when executing in wavefront 32
2816                                             mode.
2817   1792-2047      *Reserved*                 *Reserved for future Vector
2818                                             General Purpose Registers when
2819                                             executing in wavefront 32 mode.*
2820   2048-2303      AGPR0-AGPR255     32*32    Vector Accumulation Registers
2821                                             when executing in wavefront 32
2822                                             mode.
2823   2304-2559      *Reserved*                 *Reserved for future Vector
2824                                             Accumulation Registers when
2825                                             executing in wavefront 32 mode.*
2826   2560-2815      VGPR0-VGPR255     64*32    Vector General Purpose Registers
2827                                             when executing in wavefront 64
2828                                             mode.
2829   2816-3071      *Reserved*                 *Reserved for future Vector
2830                                             General Purpose Registers when
2831                                             executing in wavefront 64 mode.*
2832   3072-3327      AGPR0-AGPR255     64*32    Vector Accumulation Registers
2833                                             when executing in wavefront 64
2834                                             mode.
2835   3328-3583      *Reserved*                 *Reserved for future Vector
2836                                             Accumulation Registers when
2837                                             executing in wavefront 64 mode.*
2838   ============== ================= ======== ==================================
2839
2840The vector registers are represented as the full size for the wavefront. They
2841are organized as consecutive dwords (32-bits), one per lane, with the dword at
2842the least significant bit position corresponding to lane 0 and so forth. DWARF
2843location expressions involving the ``DW_OP_LLVM_offset`` and
2844``DW_OP_LLVM_push_lane`` operations are used to select the part of the vector
2845register corresponding to the lane that is executing the current thread of
2846execution in languages that are implemented using a SIMD or SIMT execution
2847model.
2848
2849If the wavefront size is 32 lanes then the wavefront 32 mode register
2850definitions are used. If the wavefront size is 64 lanes then the wavefront 64
2851mode register definitions are used. Some AMDGPU targets support executing in
2852both wavefront 32 and wavefront 64 mode. The register definitions corresponding
2853to the wavefront mode of the generated code will be used.
2854
2855If code is generated to execute in a 32-bit process address space, then the
285632-bit process address space register definitions are used. If code is generated
2857to execute in a 64-bit process address space, then the 64-bit process address
2858space register definitions are used. The ``amdgcn`` target only supports the
285964-bit process address space.
2860
2861.. _amdgpu-dwarf-memory-space-identifier:
2862
2863Memory Space Identifier
2864-----------------------
2865
2866The DWARF memory space represents the source language memory space. See DWARF
2867Version 5 section 2.12 which is updated by the *DWARF Extensions For
2868Heterogeneous Debugging* section :ref:`amdgpu-dwarf-memory-spaces`.
2869
2870The DWARF memory space mapping used for AMDGPU is defined in
2871:ref:`amdgpu-dwarf-memory-space-mapping-table`.
2872
2873.. table:: AMDGPU DWARF Memory Space Mapping
2874   :name: amdgpu-dwarf-memory-space-mapping-table
2875
2876   =========================== ====== =================
2877   DWARF                              AMDGPU
2878   ---------------------------------- -----------------
2879   Memory Space Name           Value  Memory Space
2880   =========================== ====== =================
2881   ``DW_MSPACE_LLVM_none``     0x0000 Generic (Flat)
2882   ``DW_MSPACE_LLVM_global``   0x0001 Global
2883   ``DW_MSPACE_LLVM_constant`` 0x0002 Global
2884   ``DW_MSPACE_LLVM_group``    0x0003 Local (group/LDS)
2885   ``DW_MSPACE_LLVM_private``  0x0004 Private (Scratch)
2886   ``DW_MSPACE_AMDGPU_region`` 0x8000 Region (GDS)
2887   =========================== ====== =================
2888
2889The DWARF memory space values defined in the *DWARF Extensions For Heterogeneous
2890Debugging* section :ref:`amdgpu-dwarf-memory-spaces` are used.
2891
2892In addition, ``DW_ADDR_AMDGPU_region`` is encoded as a vendor extension. This is
2893available for use for the AMD extension for access to the hardware GDS memory
2894which is scratchpad memory allocated per device.
2895
2896For AMDGPU if no ``DW_AT_LLVM_memory_space`` attribute is present, then the
2897default memory space of ``DW_MSPACE_LLVM_none`` is used.
2898
2899See :ref:`amdgpu-dwarf-address-space-identifier` for information on the AMDGPU
2900mapping of DWARF memory spaces to DWARF address spaces, including address size
2901and NULL value.
2902
2903.. _amdgpu-dwarf-address-space-identifier:
2904
2905Address Space Identifier
2906------------------------
2907
2908DWARF address spaces correspond to target architecture specific linear
2909addressable memory areas. See DWARF Version 5 section 2.12 and *DWARF Extensions
2910For Heterogeneous Debugging* section :ref:`amdgpu-dwarf-address-spaces`.
2911
2912The DWARF address space mapping used for AMDGPU is defined in
2913:ref:`amdgpu-dwarf-address-space-mapping-table`.
2914
2915.. table:: AMDGPU DWARF Address Space Mapping
2916   :name: amdgpu-dwarf-address-space-mapping-table
2917
2918   ======================================= ===== ======= ======== ===================== =======================
2919   DWARF                                                          AMDGPU                Notes
2920   --------------------------------------- ----- ---------------- --------------------- -----------------------
2921   Address Space Name                      Value Address Bit Size LLVM IR Address Space
2922   --------------------------------------- ----- ------- -------- --------------------- -----------------------
2923   ..                                            64-bit  32-bit
2924                                                 process process
2925                                                 address address
2926                                                 space   space
2927   ======================================= ===== ======= ======== ===================== =======================
2928   ``DW_ASPACE_LLVM_none``                 0x00  64      32       Global                *default address space*
2929   ``DW_ASPACE_AMDGPU_generic``            0x01  64      32       Generic (Flat)
2930   ``DW_ASPACE_AMDGPU_region``             0x02  32      32       Region (GDS)
2931   ``DW_ASPACE_AMDGPU_local``              0x03  32      32       Local (group/LDS)
2932   *Reserved*                              0x04
2933   ``DW_ASPACE_AMDGPU_private_lane``       0x05  32      32       Private (Scratch)     *focused lane*
2934   ``DW_ASPACE_AMDGPU_private_wave``       0x06  32      32       Private (Scratch)     *unswizzled wavefront*
2935   ======================================= ===== ======= ======== ===================== =======================
2936
2937See :ref:`amdgpu-address-spaces` for information on the AMDGPU LLVM IR address
2938spaces including address size and NULL value.
2939
2940The ``DW_ASPACE_LLVM_none`` address space is the default target architecture
2941address space used in DWARF operations that do not specify an address space. It
2942therefore has to map to the global address space so that the ``DW_OP_addr*`` and
2943related operations can refer to addresses in the program code.
2944
2945The ``DW_ASPACE_AMDGPU_generic`` address space allows location expressions to
2946specify the flat address space. If the address corresponds to an address in the
2947local address space, then it corresponds to the wavefront that is executing the
2948focused thread of execution. If the address corresponds to an address in the
2949private address space, then it corresponds to the lane that is executing the
2950focused thread of execution for languages that are implemented using a SIMD or
2951SIMT execution model.
2952
2953.. note::
2954
2955  CUDA-like languages such as HIP that do not have address spaces in the
2956  language type system, but do allow variables to be allocated in different
2957  address spaces, need to explicitly specify the ``DW_ASPACE_AMDGPU_generic``
2958  address space in the DWARF expression operations as the default address space
2959  is the global address space.
2960
2961The ``DW_ASPACE_AMDGPU_local`` address space allows location expressions to
2962specify the local address space corresponding to the wavefront that is executing
2963the focused thread of execution.
2964
2965The ``DW_ASPACE_AMDGPU_private_lane`` address space allows location expressions
2966to specify the private address space corresponding to the lane that is executing
2967the focused thread of execution for languages that are implemented using a SIMD
2968or SIMT execution model.
2969
2970The ``DW_ASPACE_AMDGPU_private_wave`` address space allows location expressions
2971to specify the unswizzled private address space corresponding to the wavefront
2972that is executing the focused thread of execution. The wavefront view of private
2973memory is the per wavefront unswizzled backing memory layout defined in
2974:ref:`amdgpu-address-spaces`, such that address 0 corresponds to the first
2975location for the backing memory of the wavefront (namely the address is not
2976offset by ``wavefront-scratch-base``). The following formula can be used to
2977convert from a ``DW_ASPACE_AMDGPU_private_lane`` address to a
2978``DW_ASPACE_AMDGPU_private_wave`` address:
2979
2980::
2981
2982  private-address-wavefront =
2983    ((private-address-lane / 4) * wavefront-size * 4) +
2984    (wavefront-lane-id * 4) + (private-address-lane % 4)
2985
2986If the ``DW_ASPACE_AMDGPU_private_lane`` address is dword aligned, and the start
2987of the dwords for each lane starting with lane 0 is required, then this
2988simplifies to:
2989
2990::
2991
2992  private-address-wavefront =
2993    private-address-lane * wavefront-size
2994
2995A compiler can use the ``DW_ASPACE_AMDGPU_private_wave`` address space to read a
2996complete spilled vector register back into a complete vector register in the
2997CFI. The frame pointer can be a private lane address which is dword aligned,
2998which can be shifted to multiply by the wavefront size, and then used to form a
2999private wavefront address that gives a location for a contiguous set of dwords,
3000one per lane, where the vector register dwords are spilled. The compiler knows
3001the wavefront size since it generates the code. Note that the type of the
3002address may have to be converted as the size of a
3003``DW_ASPACE_AMDGPU_private_lane`` address may be smaller than the size of a
3004``DW_ASPACE_AMDGPU_private_wave`` address.
3005
3006.. _amdgpu-dwarf-lane-identifier:
3007
3008Lane identifier
3009---------------
3010
3011DWARF lane identifies specify a target architecture lane position for hardware
3012that executes in a SIMD or SIMT manner, and on which a source language maps its
3013threads of execution onto those lanes. The DWARF lane identifier is pushed by
3014the ``DW_OP_LLVM_push_lane`` DWARF expression operation. See DWARF Version 5
3015section 2.5 which is updated by *DWARF Extensions For Heterogeneous Debugging*
3016section :ref:`amdgpu-dwarf-operation-expressions`.
3017
3018For AMDGPU, the lane identifier corresponds to the hardware lane ID of a
3019wavefront. It is numbered from 0 to the wavefront size minus 1.
3020
3021Operation Expressions
3022---------------------
3023
3024DWARF expressions are used to compute program values and the locations of
3025program objects. See DWARF Version 5 section 2.5 and
3026:ref:`amdgpu-dwarf-operation-expressions`.
3027
3028DWARF location descriptions describe how to access storage which includes memory
3029and registers. When accessing storage on AMDGPU, bytes are ordered with least
3030significant bytes first, and bits are ordered within bytes with least
3031significant bits first.
3032
3033For AMDGPU CFI expressions, ``DW_OP_LLVM_select_bit_piece`` is used to describe
3034unwinding vector registers that are spilled under the execution mask to memory:
3035the zero-single location description is the vector register, and the one-single
3036location description is the spilled memory location description. The
3037``DW_OP_LLVM_form_aspace_address`` is used to specify the address space of the
3038memory location description.
3039
3040In AMDGPU expressions, ``DW_OP_LLVM_select_bit_piece`` is used by the
3041``DW_AT_LLVM_lane_pc`` attribute expression where divergent control flow is
3042controlled by the execution mask. An undefined location description together
3043with ``DW_OP_LLVM_extend`` is used to indicate the lane was not active on entry
3044to the subprogram. See :ref:`amdgpu-dwarf-dw-at-llvm-lane-pc` for an example.
3045
3046.. _amdgpu-dwarf-base-type-conversions:
3047
3048Base Type Conversions
3049---------------------
3050
3051For AMDGPU expressions, ``DW_OP_convert`` may be used to convert between
3052``DW_ATE_address``-encoded base types in different address spaces.
3053
3054Conversions are defined as in :ref:`amdgpu-address-spaces` when all relevant
3055conditions described there are met, and otherwise result in an evaluation
3056error.
3057
3058.. note::
3059
3060  For a target which does not support a particular address space, converting to
3061  or from that address space is always an evaluation error.
3062
3063  For targets which support the generic address space, converting from
3064  ``DW_ASPACE_AMDGPU_generic`` to ``DW_ASPACE_LLVM_none`` is defined when the
3065  generic address is in the global address space. The conversion requires no
3066  change to the literal value of the address.
3067
3068  Converting from ``DW_ASPACE_AMDGPU_generic`` to any of
3069  ``DW_ASPACE_AMDGPU_local``, ``DW_ASPACE_AMDGPU_private_wave`` or
3070  ``DW_ASPACE_AMDGPU_private_lane`` is defined when the relevant hardware
3071  support is present, any required hardware setup has been completed, and the
3072  generic address is in the corresponding address space. Conversion to
3073  ``DW_ASPACE_AMDGPU_private_lane`` additionally requires the context to
3074  include the active lane.
3075
3076Debugger Information Entry Attributes
3077-------------------------------------
3078
3079This section describes how certain debugger information entry attributes are
3080used by AMDGPU. See the sections in DWARF Version 5 section 3.3.5 and 3.1.1
3081which are updated by *DWARF Extensions For Heterogeneous Debugging* section
3082:ref:`amdgpu-dwarf-low-level-information` and
3083:ref:`amdgpu-dwarf-full-and-partial-compilation-unit-entries`.
3084
3085.. _amdgpu-dwarf-dw-at-llvm-lane-pc:
3086
3087``DW_AT_LLVM_lane_pc``
3088~~~~~~~~~~~~~~~~~~~~~~
3089
3090For AMDGPU, the ``DW_AT_LLVM_lane_pc`` attribute is used to specify the program
3091location of the separate lanes of a SIMT thread.
3092
3093If the lane is an active lane then this will be the same as the current program
3094location.
3095
3096If the lane is inactive, but was active on entry to the subprogram, then this is
3097the program location in the subprogram at which execution of the lane is
3098conceptual positioned.
3099
3100If the lane was not active on entry to the subprogram, then this will be the
3101undefined location. A client debugger can check if the lane is part of a valid
3102work-group by checking that the lane is in the range of the associated
3103work-group within the grid, accounting for partial work-groups. If it is not,
3104then the debugger can omit any information for the lane. Otherwise, the debugger
3105may repeatedly unwind the stack and inspect the ``DW_AT_LLVM_lane_pc`` of the
3106calling subprogram until it finds a non-undefined location. Conceptually the
3107lane only has the call frames that it has a non-undefined
3108``DW_AT_LLVM_lane_pc``.
3109
3110The following example illustrates how the AMDGPU backend can generate a DWARF
3111location list expression for the nested ``IF/THEN/ELSE`` structures of the
3112following subprogram pseudo code for a target with 64 lanes per wavefront.
3113
3114.. code::
3115  :number-lines:
3116
3117  SUBPROGRAM X
3118  BEGIN
3119    a;
3120    IF (c1) THEN
3121      b;
3122      IF (c2) THEN
3123        c;
3124      ELSE
3125        d;
3126      ENDIF
3127      e;
3128    ELSE
3129      f;
3130    ENDIF
3131    g;
3132  END
3133
3134The AMDGPU backend may generate the following pseudo LLVM MIR to manipulate the
3135execution mask (``EXEC``) to linearize the control flow. The condition is
3136evaluated to make a mask of the lanes for which the condition evaluates to true.
3137First the ``THEN`` region is executed by setting the ``EXEC`` mask to the
3138logical ``AND`` of the current ``EXEC`` mask with the condition mask. Then the
3139``ELSE`` region is executed by negating the ``EXEC`` mask and logical ``AND`` of
3140the saved ``EXEC`` mask at the start of the region. After the ``IF/THEN/ELSE``
3141region the ``EXEC`` mask is restored to the value it had at the beginning of the
3142region. This is shown below. Other approaches are possible, but the basic
3143concept is the same.
3144
3145.. code::
3146  :number-lines:
3147
3148  $lex_start:
3149    a;
3150    %1 = EXEC
3151    %2 = c1
3152  $lex_1_start:
3153    EXEC = %1 & %2
3154  $if_1_then:
3155      b;
3156      %3 = EXEC
3157      %4 = c2
3158  $lex_1_1_start:
3159      EXEC = %3 & %4
3160  $lex_1_1_then:
3161        c;
3162      EXEC = ~EXEC & %3
3163  $lex_1_1_else:
3164        d;
3165      EXEC = %3
3166  $lex_1_1_end:
3167      e;
3168    EXEC = ~EXEC & %1
3169  $lex_1_else:
3170      f;
3171    EXEC = %1
3172  $lex_1_end:
3173    g;
3174  $lex_end:
3175
3176To create the DWARF location list expression that defines the location
3177description of a vector of lane program locations, the LLVM MIR ``DBG_VALUE``
3178pseudo instruction can be used to annotate the linearized control flow. This can
3179be done by defining an artificial variable for the lane PC. The DWARF location
3180list expression created for it is used as the value of the
3181``DW_AT_LLVM_lane_pc`` attribute on the subprogram's debugger information entry.
3182
3183A DWARF procedure is defined for each well nested structured control flow region
3184which provides the conceptual lane program location for a lane if it is not
3185active (namely it is divergent). The DWARF operation expression for each region
3186conceptually inherits the value of the immediately enclosing region and modifies
3187it according to the semantics of the region.
3188
3189For an ``IF/THEN/ELSE`` region the divergent program location is at the start of
3190the region for the ``THEN`` region since it is executed first. For the ``ELSE``
3191region the divergent program location is at the end of the ``IF/THEN/ELSE``
3192region since the ``THEN`` region has completed.
3193
3194The lane PC artificial variable is assigned at each region transition. It uses
3195the immediately enclosing region's DWARF procedure to compute the program
3196location for each lane assuming they are divergent, and then modifies the result
3197by inserting the current program location for each lane that the ``EXEC`` mask
3198indicates is active.
3199
3200By having separate DWARF procedures for each region, they can be reused to
3201define the value for any nested region. This reduces the total size of the DWARF
3202operation expressions.
3203
3204The following provides an example using pseudo LLVM MIR.
3205
3206.. code::
3207  :number-lines:
3208
3209  $lex_start:
3210    DEFINE_DWARF %__uint_64 = DW_TAG_base_type[
3211      DW_AT_name = "__uint64";
3212      DW_AT_byte_size = 8;
3213      DW_AT_encoding = DW_ATE_unsigned;
3214    ];
3215    DEFINE_DWARF %__active_lane_pc = DW_TAG_dwarf_procedure[
3216      DW_AT_name = "__active_lane_pc";
3217      DW_AT_location = [
3218        DW_OP_regx PC;
3219        DW_OP_LLVM_extend 64, 64;
3220        DW_OP_regval_type EXEC, %uint_64;
3221        DW_OP_LLVM_select_bit_piece 64, 64;
3222      ];
3223    ];
3224    DEFINE_DWARF %__divergent_lane_pc = DW_TAG_dwarf_procedure[
3225      DW_AT_name = "__divergent_lane_pc";
3226      DW_AT_location = [
3227        DW_OP_LLVM_undefined;
3228        DW_OP_LLVM_extend 64, 64;
3229      ];
3230    ];
3231    DBG_VALUE $noreg, $noreg, %DW_AT_LLVM_lane_pc, DIExpression[
3232      DW_OP_call_ref %__divergent_lane_pc;
3233      DW_OP_call_ref %__active_lane_pc;
3234    ];
3235    a;
3236    %1 = EXEC;
3237    DBG_VALUE %1, $noreg, %__lex_1_save_exec;
3238    %2 = c1;
3239  $lex_1_start:
3240    EXEC = %1 & %2;
3241  $lex_1_then:
3242      DEFINE_DWARF %__divergent_lane_pc_1_then = DW_TAG_dwarf_procedure[
3243        DW_AT_name = "__divergent_lane_pc_1_then";
3244        DW_AT_location = DIExpression[
3245          DW_OP_call_ref %__divergent_lane_pc;
3246          DW_OP_addrx &lex_1_start;
3247          DW_OP_stack_value;
3248          DW_OP_LLVM_extend 64, 64;
3249          DW_OP_call_ref %__lex_1_save_exec;
3250          DW_OP_deref_type 64, %__uint_64;
3251          DW_OP_LLVM_select_bit_piece 64, 64;
3252        ];
3253      ];
3254      DBG_VALUE $noreg, $noreg, %DW_AT_LLVM_lane_pc, DIExpression[
3255        DW_OP_call_ref %__divergent_lane_pc_1_then;
3256        DW_OP_call_ref %__active_lane_pc;
3257      ];
3258      b;
3259      %3 = EXEC;
3260      DBG_VALUE %3, %__lex_1_1_save_exec;
3261      %4 = c2;
3262  $lex_1_1_start:
3263      EXEC = %3 & %4;
3264  $lex_1_1_then:
3265        DEFINE_DWARF %__divergent_lane_pc_1_1_then = DW_TAG_dwarf_procedure[
3266          DW_AT_name = "__divergent_lane_pc_1_1_then";
3267          DW_AT_location = DIExpression[
3268            DW_OP_call_ref %__divergent_lane_pc_1_then;
3269            DW_OP_addrx &lex_1_1_start;
3270            DW_OP_stack_value;
3271            DW_OP_LLVM_extend 64, 64;
3272            DW_OP_call_ref %__lex_1_1_save_exec;
3273            DW_OP_deref_type 64, %__uint_64;
3274            DW_OP_LLVM_select_bit_piece 64, 64;
3275          ];
3276        ];
3277        DBG_VALUE $noreg, $noreg, %DW_AT_LLVM_lane_pc, DIExpression[
3278          DW_OP_call_ref %__divergent_lane_pc_1_1_then;
3279          DW_OP_call_ref %__active_lane_pc;
3280        ];
3281        c;
3282      EXEC = ~EXEC & %3;
3283  $lex_1_1_else:
3284        DEFINE_DWARF %__divergent_lane_pc_1_1_else = DW_TAG_dwarf_procedure[
3285          DW_AT_name = "__divergent_lane_pc_1_1_else";
3286          DW_AT_location = DIExpression[
3287            DW_OP_call_ref %__divergent_lane_pc_1_then;
3288            DW_OP_addrx &lex_1_1_end;
3289            DW_OP_stack_value;
3290            DW_OP_LLVM_extend 64, 64;
3291            DW_OP_call_ref %__lex_1_1_save_exec;
3292            DW_OP_deref_type 64, %__uint_64;
3293            DW_OP_LLVM_select_bit_piece 64, 64;
3294          ];
3295        ];
3296        DBG_VALUE $noreg, $noreg, %DW_AT_LLVM_lane_pc, DIExpression[
3297          DW_OP_call_ref %__divergent_lane_pc_1_1_else;
3298          DW_OP_call_ref %__active_lane_pc;
3299        ];
3300        d;
3301      EXEC = %3;
3302  $lex_1_1_end:
3303      DBG_VALUE $noreg, $noreg, %DW_AT_LLVM_lane_pc, DIExpression[
3304        DW_OP_call_ref %__divergent_lane_pc;
3305        DW_OP_call_ref %__active_lane_pc;
3306      ];
3307      e;
3308    EXEC = ~EXEC & %1;
3309  $lex_1_else:
3310      DEFINE_DWARF %__divergent_lane_pc_1_else = DW_TAG_dwarf_procedure[
3311        DW_AT_name = "__divergent_lane_pc_1_else";
3312        DW_AT_location = DIExpression[
3313          DW_OP_call_ref %__divergent_lane_pc;
3314          DW_OP_addrx &lex_1_end;
3315          DW_OP_stack_value;
3316          DW_OP_LLVM_extend 64, 64;
3317          DW_OP_call_ref %__lex_1_save_exec;
3318          DW_OP_deref_type 64, %__uint_64;
3319          DW_OP_LLVM_select_bit_piece 64, 64;
3320        ];
3321      ];
3322      DBG_VALUE $noreg, $noreg, %DW_AT_LLVM_lane_pc, DIExpression[
3323        DW_OP_call_ref %__divergent_lane_pc_1_else;
3324        DW_OP_call_ref %__active_lane_pc;
3325      ];
3326      f;
3327    EXEC = %1;
3328  $lex_1_end:
3329    DBG_VALUE $noreg, $noreg, %DW_AT_LLVM_lane_pc DIExpression[
3330      DW_OP_call_ref %__divergent_lane_pc;
3331      DW_OP_call_ref %__active_lane_pc;
3332    ];
3333    g;
3334  $lex_end:
3335
3336The DWARF procedure ``%__active_lane_pc`` is used to update the lane pc elements
3337that are active, with the current program location.
3338
3339Artificial variables %__lex_1_save_exec and %__lex_1_1_save_exec are created for
3340the execution masks saved on entry to a region. Using the ``DBG_VALUE`` pseudo
3341instruction, location list entries will be created that describe where the
3342artificial variables are allocated at any given program location. The compiler
3343may allocate them to registers or spill them to memory.
3344
3345The DWARF procedures for each region use the values of the saved execution mask
3346artificial variables to only update the lanes that are active on entry to the
3347region. All other lanes retain the value of the enclosing region where they were
3348last active. If they were not active on entry to the subprogram, then will have
3349the undefined location description.
3350
3351Other structured control flow regions can be handled similarly. For example,
3352loops would set the divergent program location for the region at the end of the
3353loop. Any lanes active will be in the loop, and any lanes not active must have
3354exited the loop.
3355
3356An ``IF/THEN/ELSEIF/ELSEIF/...`` region can be treated as a nest of
3357``IF/THEN/ELSE`` regions.
3358
3359The DWARF procedures can use the active lane artificial variable described in
3360:ref:`amdgpu-dwarf-amdgpu-dw-at-llvm-active-lane` rather than the actual
3361``EXEC`` mask in order to support whole or quad wavefront mode.
3362
3363.. _amdgpu-dwarf-amdgpu-dw-at-llvm-active-lane:
3364
3365``DW_AT_LLVM_active_lane``
3366~~~~~~~~~~~~~~~~~~~~~~~~~~
3367
3368The ``DW_AT_LLVM_active_lane`` attribute on a subprogram debugger information
3369entry is used to specify the lanes that are conceptually active for a SIMT
3370thread.
3371
3372The execution mask may be modified to implement whole or quad wavefront mode
3373operations. For example, all lanes may need to temporarily be made active to
3374execute a whole wavefront operation. Such regions would save the ``EXEC`` mask,
3375update it to enable the necessary lanes, perform the operations, and then
3376restore the ``EXEC`` mask from the saved value. While executing the whole
3377wavefront region, the conceptual execution mask is the saved value, not the
3378``EXEC`` value.
3379
3380This is handled by defining an artificial variable for the active lane mask. The
3381active lane mask artificial variable would be the actual ``EXEC`` mask for
3382normal regions, and the saved execution mask for regions where the mask is
3383temporarily updated. The location list expression created for this artificial
3384variable is used to define the value of the ``DW_AT_LLVM_active_lane``
3385attribute.
3386
3387``DW_AT_LLVM_augmentation``
3388~~~~~~~~~~~~~~~~~~~~~~~~~~~
3389
3390For AMDGPU, the ``DW_AT_LLVM_augmentation`` attribute of a compilation unit
3391debugger information entry has the following value for the augmentation string:
3392
3393::
3394
3395  [amdgpu:v0.0]
3396
3397The "vX.Y" specifies the major X and minor Y version number of the AMDGPU
3398extensions used in the DWARF of the compilation unit. The version number
3399conforms to [SEMVER]_.
3400
3401Call Frame Information
3402----------------------
3403
3404DWARF Call Frame Information (CFI) describes how a consumer can virtually
3405*unwind* call frames in a running process or core dump. See DWARF Version 5
3406section 6.4 and :ref:`amdgpu-dwarf-call-frame-information`.
3407
3408For AMDGPU, the Common Information Entry (CIE) fields have the following values:
3409
34101.  ``augmentation`` string contains the following null-terminated UTF-8 string:
3411
3412    ::
3413
3414      [amd:v0.0]
3415
3416    The ``vX.Y`` specifies the major X and minor Y version number of the AMDGPU
3417    extensions used in this CIE or to the FDEs that use it. The version number
3418    conforms to [SEMVER]_.
3419
34202.  ``address_size`` for the ``Global`` address space is defined in
3421    :ref:`amdgpu-dwarf-address-space-identifier`.
3422
34233.  ``segment_selector_size`` is 0 as AMDGPU does not use a segment selector.
3424
34254.  ``code_alignment_factor`` is 4 bytes.
3426
3427    .. TODO::
3428
3429       Add to :ref:`amdgpu-processor-table` table.
3430
34315.  ``data_alignment_factor`` is 4 bytes.
3432
3433    .. TODO::
3434
3435       Add to :ref:`amdgpu-processor-table` table.
3436
34376.  ``return_address_register`` is ``PC_32`` for 32-bit processes and ``PC_64``
3438    for 64-bit processes defined in :ref:`amdgpu-dwarf-register-identifier`.
3439
34407.  ``initial_instructions`` Since a subprogram X with fewer registers can be
3441    called from subprogram Y that has more allocated, X will not change any of
3442    the extra registers as it cannot access them. Therefore, the default rule
3443    for all columns is ``same value``.
3444
3445For AMDGPU the register number follows the numbering defined in
3446:ref:`amdgpu-dwarf-register-identifier`.
3447
3448For AMDGPU the instructions are variable size. A consumer can subtract 1 from
3449the return address to get the address of a byte within the call site
3450instructions. See DWARF Version 5 section 6.4.4.
3451
3452Accelerated Access
3453------------------
3454
3455See DWARF Version 5 section 6.1.
3456
3457Lookup By Name Section Header
3458~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
3459
3460See DWARF Version 5 section 6.1.1.4.1 and :ref:`amdgpu-dwarf-lookup-by-name`.
3461
3462For AMDGPU the lookup by name section header table:
3463
3464``augmentation_string_size`` (uword)
3465
3466  Set to the length of the ``augmentation_string`` value which is always a
3467  multiple of 4.
3468
3469``augmentation_string`` (sequence of UTF-8 characters)
3470
3471  Contains the following UTF-8 string null padded to a multiple of 4 bytes:
3472
3473  ::
3474
3475    [amdgpu:v0.0]
3476
3477  The "vX.Y" specifies the major X and minor Y version number of the AMDGPU
3478  extensions used in the DWARF of this index. The version number conforms to
3479  [SEMVER]_.
3480
3481  .. note::
3482
3483    This is different to the DWARF Version 5 definition that requires the first
3484    4 characters to be the vendor ID. But this is consistent with the other
3485    augmentation strings and does allow multiple vendor contributions. However,
3486    backwards compatibility may be more desirable.
3487
3488Lookup By Address Section Header
3489~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
3490
3491See DWARF Version 5 section 6.1.2.
3492
3493For AMDGPU the lookup by address section header table:
3494
3495``address_size`` (ubyte)
3496
3497  Match the address size for the ``Global`` address space defined in
3498  :ref:`amdgpu-dwarf-address-space-identifier`.
3499
3500``segment_selector_size`` (ubyte)
3501
3502  AMDGPU does not use a segment selector so this is 0. The entries in the
3503  ``.debug_aranges`` do not have a segment selector.
3504
3505Line Number Information
3506-----------------------
3507
3508See DWARF Version 5 section 6.2 and :ref:`amdgpu-dwarf-line-number-information`.
3509
3510AMDGPU does not use the ``isa`` state machine registers and always sets it to 0.
3511The instruction set must be obtained from the ELF file header ``e_flags`` field
3512in the ``EF_AMDGPU_MACH`` bit position (see :ref:`ELF Header
3513<amdgpu-elf-header>`). See DWARF Version 5 section 6.2.2.
3514
3515.. TODO::
3516
3517  Should the ``isa`` state machine register be used to indicate if the code is
3518  in wavefront32 or wavefront64 mode? Or used to specify the architecture ISA?
3519
3520For AMDGPU the line number program header fields have the following values (see
3521DWARF Version 5 section 6.2.4):
3522
3523``address_size`` (ubyte)
3524  Matches the address size for the ``Global`` address space defined in
3525  :ref:`amdgpu-dwarf-address-space-identifier`.
3526
3527``segment_selector_size`` (ubyte)
3528  AMDGPU does not use a segment selector so this is 0.
3529
3530``minimum_instruction_length`` (ubyte)
3531  For GFX9-GFX11 this is 4.
3532
3533``maximum_operations_per_instruction`` (ubyte)
3534  For GFX9-GFX11 this is 1.
3535
3536Source text for online-compiled programs (for example, those compiled by the
3537OpenCL language runtime) may be embedded into the DWARF Version 5 line table.
3538See DWARF Version 5 section 6.2.4.1 which is updated by *DWARF Extensions For
3539Heterogeneous Debugging* section :ref:`DW_LNCT_LLVM_source
3540<amdgpu-dwarf-line-number-information-dw-lnct-llvm-source>`.
3541
3542The Clang option used to control source embedding in AMDGPU is defined in
3543:ref:`amdgpu-clang-debug-options-table`.
3544
3545  .. table:: AMDGPU Clang Debug Options
3546     :name: amdgpu-clang-debug-options-table
3547
3548     ==================== ==================================================
3549     Debug Flag           Description
3550     ==================== ==================================================
3551     -g[no-]embed-source  Enable/disable embedding source text in DWARF
3552                          debug sections. Useful for environments where
3553                          source cannot be written to disk, such as
3554                          when performing online compilation.
3555     ==================== ==================================================
3556
3557For example:
3558
3559``-gembed-source``
3560  Enable the embedded source.
3561
3562``-gno-embed-source``
3563  Disable the embedded source.
3564
356532-Bit and 64-Bit DWARF Formats
3566-------------------------------
3567
3568See DWARF Version 5 section 7.4 and
3569:ref:`amdgpu-dwarf-32-bit-and-64-bit-dwarf-formats`.
3570
3571For AMDGPU:
3572
3573* For the ``amdgcn`` target architecture only the 64-bit process address space
3574  is supported.
3575
3576* The producer can generate either 32-bit or 64-bit DWARF format. LLVM generates
3577  the 32-bit DWARF format.
3578
3579Unit Headers
3580------------
3581
3582For AMDGPU the following values apply for each of the unit headers described in
3583DWARF Version 5 sections 7.5.1.1, 7.5.1.2, and 7.5.1.3:
3584
3585``address_size`` (ubyte)
3586  Matches the address size for the ``Global`` address space defined in
3587  :ref:`amdgpu-dwarf-address-space-identifier`.
3588
3589.. _amdgpu-code-conventions:
3590
3591Code Conventions
3592================
3593
3594This section provides code conventions used for each supported target triple OS
3595(see :ref:`amdgpu-target-triples`).
3596
3597AMDHSA
3598------
3599
3600This section provides code conventions used when the target triple OS is
3601``amdhsa`` (see :ref:`amdgpu-target-triples`).
3602
3603.. _amdgpu-amdhsa-code-object-metadata:
3604
3605Code Object Metadata
3606~~~~~~~~~~~~~~~~~~~~
3607
3608The code object metadata specifies extensible metadata associated with the code
3609objects executed on HSA [HSA]_ compatible runtimes (see :ref:`amdgpu-os`). The
3610encoding and semantics of this metadata depends on the code object version; see
3611:ref:`amdgpu-amdhsa-code-object-metadata-v2`,
3612:ref:`amdgpu-amdhsa-code-object-metadata-v3`,
3613:ref:`amdgpu-amdhsa-code-object-metadata-v4` and
3614:ref:`amdgpu-amdhsa-code-object-metadata-v5`.
3615
3616Code object metadata is specified in a note record (see
3617:ref:`amdgpu-note-records`) and is required when the target triple OS is
3618``amdhsa`` (see :ref:`amdgpu-target-triples`). It must contain the minimum
3619information necessary to support the HSA compatible runtime kernel queries. For
3620example, the segment sizes needed in a dispatch packet. In addition, a
3621high-level language runtime may require other information to be included. For
3622example, the AMD OpenCL runtime records kernel argument information.
3623
3624.. _amdgpu-amdhsa-code-object-metadata-v2:
3625
3626Code Object V2 Metadata
3627+++++++++++++++++++++++
3628
3629.. warning::
3630  Code object V2 generation is no longer supported by this version of LLVM.
3631
3632Code object V2 metadata is specified by the ``NT_AMD_HSA_METADATA`` note record
3633(see :ref:`amdgpu-note-records-v2`).
3634
3635The metadata is specified as a YAML formatted string (see [YAML]_ and
3636:doc:`YamlIO`).
3637
3638.. TODO::
3639
3640  Is the string null terminated? It probably should not if YAML allows it to
3641  contain null characters, otherwise it should be.
3642
3643The metadata is represented as a single YAML document comprised of the mapping
3644defined in table :ref:`amdgpu-amdhsa-code-object-metadata-map-v2-table` and
3645referenced tables.
3646
3647For boolean values, the string values of ``false`` and ``true`` are used for
3648false and true respectively.
3649
3650Additional information can be added to the mappings. To avoid conflicts, any
3651non-AMD key names should be prefixed by "*vendor-name*.".
3652
3653  .. table:: AMDHSA Code Object V2 Metadata Map
3654     :name: amdgpu-amdhsa-code-object-metadata-map-v2-table
3655
3656     ========== ============== ========= =======================================
3657     String Key Value Type     Required? Description
3658     ========== ============== ========= =======================================
3659     "Version"  sequence of    Required  - The first integer is the major
3660                2 integers                 version. Currently 1.
3661                                         - The second integer is the minor
3662                                           version. Currently 0.
3663     "Printf"   sequence of              Each string is encoded information
3664                strings                  about a printf function call. The
3665                                         encoded information is organized as
3666                                         fields separated by colon (':'):
3667
3668                                         ``ID:N:S[0]:S[1]:...:S[N-1]:FormatString``
3669
3670                                         where:
3671
3672                                         ``ID``
3673                                           A 32-bit integer as a unique id for
3674                                           each printf function call
3675
3676                                         ``N``
3677                                           A 32-bit integer equal to the number
3678                                           of arguments of printf function call
3679                                           minus 1
3680
3681                                         ``S[i]`` (where i = 0, 1, ... , N-1)
3682                                           32-bit integers for the size in bytes
3683                                           of the i-th FormatString argument of
3684                                           the printf function call
3685
3686                                         FormatString
3687                                           The format string passed to the
3688                                           printf function call.
3689     "Kernels"  sequence of    Required  Sequence of the mappings for each
3690                mapping                  kernel in the code object. See
3691                                         :ref:`amdgpu-amdhsa-code-object-kernel-metadata-map-v2-table`
3692                                         for the definition of the mapping.
3693     ========== ============== ========= =======================================
3694
3695..
3696
3697  .. table:: AMDHSA Code Object V2 Kernel Metadata Map
3698     :name: amdgpu-amdhsa-code-object-kernel-metadata-map-v2-table
3699
3700     ================= ============== ========= ================================
3701     String Key        Value Type     Required? Description
3702     ================= ============== ========= ================================
3703     "Name"            string         Required  Source name of the kernel.
3704     "SymbolName"      string         Required  Name of the kernel
3705                                                descriptor ELF symbol.
3706     "Language"        string                   Source language of the kernel.
3707                                                Values include:
3708
3709                                                - "OpenCL C"
3710                                                - "OpenCL C++"
3711                                                - "HCC"
3712                                                - "OpenMP"
3713
3714     "LanguageVersion" sequence of              - The first integer is the major
3715                       2 integers                 version.
3716                                                - The second integer is the
3717                                                  minor version.
3718     "Attrs"           mapping                  Mapping of kernel attributes.
3719                                                See
3720                                                :ref:`amdgpu-amdhsa-code-object-kernel-attribute-metadata-map-v2-table`
3721                                                for the mapping definition.
3722     "Args"            sequence of              Sequence of mappings of the
3723                       mapping                  kernel arguments. See
3724                                                :ref:`amdgpu-amdhsa-code-object-kernel-argument-metadata-map-v2-table`
3725                                                for the definition of the mapping.
3726     "CodeProps"       mapping                  Mapping of properties related to
3727                                                the kernel code. See
3728                                                :ref:`amdgpu-amdhsa-code-object-kernel-code-properties-metadata-map-v2-table`
3729                                                for the mapping definition.
3730     ================= ============== ========= ================================
3731
3732..
3733
3734  .. table:: AMDHSA Code Object V2 Kernel Attribute Metadata Map
3735     :name: amdgpu-amdhsa-code-object-kernel-attribute-metadata-map-v2-table
3736
3737     =================== ============== ========= ==============================
3738     String Key          Value Type     Required? Description
3739     =================== ============== ========= ==============================
3740     "ReqdWorkGroupSize" sequence of              If not 0, 0, 0 then all values
3741                         3 integers               must be >=1 and the dispatch
3742                                                  work-group size X, Y, Z must
3743                                                  correspond to the specified
3744                                                  values. Defaults to 0, 0, 0.
3745
3746                                                  Corresponds to the OpenCL
3747                                                  ``reqd_work_group_size``
3748                                                  attribute.
3749     "WorkGroupSizeHint" sequence of              The dispatch work-group size
3750                         3 integers               X, Y, Z is likely to be the
3751                                                  specified values.
3752
3753                                                  Corresponds to the OpenCL
3754                                                  ``work_group_size_hint``
3755                                                  attribute.
3756     "VecTypeHint"       string                   The name of a scalar or vector
3757                                                  type.
3758
3759                                                  Corresponds to the OpenCL
3760                                                  ``vec_type_hint`` attribute.
3761
3762     "RuntimeHandle"     string                   The external symbol name
3763                                                  associated with a kernel.
3764                                                  OpenCL runtime allocates a
3765                                                  global buffer for the symbol
3766                                                  and saves the kernel's address
3767                                                  to it, which is used for
3768                                                  device side enqueueing. Only
3769                                                  available for device side
3770                                                  enqueued kernels.
3771     =================== ============== ========= ==============================
3772
3773..
3774
3775  .. table:: AMDHSA Code Object V2 Kernel Argument Metadata Map
3776     :name: amdgpu-amdhsa-code-object-kernel-argument-metadata-map-v2-table
3777
3778     ================= ============== ========= ================================
3779     String Key        Value Type     Required? Description
3780     ================= ============== ========= ================================
3781     "Name"            string                   Kernel argument name.
3782     "TypeName"        string                   Kernel argument type name.
3783     "Size"            integer        Required  Kernel argument size in bytes.
3784     "Align"           integer        Required  Kernel argument alignment in
3785                                                bytes. Must be a power of two.
3786     "ValueKind"       string         Required  Kernel argument kind that
3787                                                specifies how to set up the
3788                                                corresponding argument.
3789                                                Values include:
3790
3791                                                "ByValue"
3792                                                  The argument is copied
3793                                                  directly into the kernarg.
3794
3795                                                "GlobalBuffer"
3796                                                  A global address space pointer
3797                                                  to the buffer data is passed
3798                                                  in the kernarg.
3799
3800                                                "DynamicSharedPointer"
3801                                                  A group address space pointer
3802                                                  to dynamically allocated LDS
3803                                                  is passed in the kernarg.
3804
3805                                                "Sampler"
3806                                                  A global address space
3807                                                  pointer to a S# is passed in
3808                                                  the kernarg.
3809
3810                                                "Image"
3811                                                  A global address space
3812                                                  pointer to a T# is passed in
3813                                                  the kernarg.
3814
3815                                                "Pipe"
3816                                                  A global address space pointer
3817                                                  to an OpenCL pipe is passed in
3818                                                  the kernarg.
3819
3820                                                "Queue"
3821                                                  A global address space pointer
3822                                                  to an OpenCL device enqueue
3823                                                  queue is passed in the
3824                                                  kernarg.
3825
3826                                                "HiddenGlobalOffsetX"
3827                                                  The OpenCL grid dispatch
3828                                                  global offset for the X
3829                                                  dimension is passed in the
3830                                                  kernarg.
3831
3832                                                "HiddenGlobalOffsetY"
3833                                                  The OpenCL grid dispatch
3834                                                  global offset for the Y
3835                                                  dimension is passed in the
3836                                                  kernarg.
3837
3838                                                "HiddenGlobalOffsetZ"
3839                                                  The OpenCL grid dispatch
3840                                                  global offset for the Z
3841                                                  dimension is passed in the
3842                                                  kernarg.
3843
3844                                                "HiddenNone"
3845                                                  An argument that is not used
3846                                                  by the kernel. Space needs to
3847                                                  be left for it, but it does
3848                                                  not need to be set up.
3849
3850                                                "HiddenPrintfBuffer"
3851                                                  A global address space pointer
3852                                                  to the runtime printf buffer
3853                                                  is passed in kernarg. Mutually
3854                                                  exclusive with
3855                                                  "HiddenHostcallBuffer".
3856
3857                                                "HiddenHostcallBuffer"
3858                                                  A global address space pointer
3859                                                  to the runtime hostcall buffer
3860                                                  is passed in kernarg. Mutually
3861                                                  exclusive with
3862                                                  "HiddenPrintfBuffer".
3863
3864                                                "HiddenDefaultQueue"
3865                                                  A global address space pointer
3866                                                  to the OpenCL device enqueue
3867                                                  queue that should be used by
3868                                                  the kernel by default is
3869                                                  passed in the kernarg.
3870
3871                                                "HiddenCompletionAction"
3872                                                  A global address space pointer
3873                                                  to help link enqueued kernels into
3874                                                  the ancestor tree for determining
3875                                                  when the parent kernel has finished.
3876
3877                                                "HiddenMultiGridSyncArg"
3878                                                  A global address space pointer for
3879                                                  multi-grid synchronization is
3880                                                  passed in the kernarg.
3881
3882     "ValueType"       string                   Unused and deprecated. This should no longer
3883                                                be emitted, but is accepted for compatibility.
3884
3885
3886     "PointeeAlign"    integer                  Alignment in bytes of pointee
3887                                                type for pointer type kernel
3888                                                argument. Must be a power
3889                                                of 2. Only present if
3890                                                "ValueKind" is
3891                                                "DynamicSharedPointer".
3892     "AddrSpaceQual"   string                   Kernel argument address space
3893                                                qualifier. Only present if
3894                                                "ValueKind" is "GlobalBuffer" or
3895                                                "DynamicSharedPointer". Values
3896                                                are:
3897
3898                                                - "Private"
3899                                                - "Global"
3900                                                - "Constant"
3901                                                - "Local"
3902                                                - "Generic"
3903                                                - "Region"
3904
3905                                                .. TODO::
3906
3907                                                   Is GlobalBuffer only Global
3908                                                   or Constant? Is
3909                                                   DynamicSharedPointer always
3910                                                   Local? Can HCC allow Generic?
3911                                                   How can Private or Region
3912                                                   ever happen?
3913
3914     "AccQual"         string                   Kernel argument access
3915                                                qualifier. Only present if
3916                                                "ValueKind" is "Image" or
3917                                                "Pipe". Values
3918                                                are:
3919
3920                                                - "ReadOnly"
3921                                                - "WriteOnly"
3922                                                - "ReadWrite"
3923
3924                                                .. TODO::
3925
3926                                                   Does this apply to
3927                                                   GlobalBuffer?
3928
3929     "ActualAccQual"   string                   The actual memory accesses
3930                                                performed by the kernel on the
3931                                                kernel argument. Only present if
3932                                                "ValueKind" is "GlobalBuffer",
3933                                                "Image", or "Pipe". This may be
3934                                                more restrictive than indicated
3935                                                by "AccQual" to reflect what the
3936                                                kernel actual does. If not
3937                                                present then the runtime must
3938                                                assume what is implied by
3939                                                "AccQual" and "IsConst". Values
3940                                                are:
3941
3942                                                - "ReadOnly"
3943                                                - "WriteOnly"
3944                                                - "ReadWrite"
3945
3946     "IsConst"         boolean                  Indicates if the kernel argument
3947                                                is const qualified. Only present
3948                                                if "ValueKind" is
3949                                                "GlobalBuffer".
3950
3951     "IsRestrict"      boolean                  Indicates if the kernel argument
3952                                                is restrict qualified. Only
3953                                                present if "ValueKind" is
3954                                                "GlobalBuffer".
3955
3956     "IsVolatile"      boolean                  Indicates if the kernel argument
3957                                                is volatile qualified. Only
3958                                                present if "ValueKind" is
3959                                                "GlobalBuffer".
3960
3961     "IsPipe"          boolean                  Indicates if the kernel argument
3962                                                is pipe qualified. Only present
3963                                                if "ValueKind" is "Pipe".
3964
3965                                                .. TODO::
3966
3967                                                   Can GlobalBuffer be pipe
3968                                                   qualified?
3969
3970     ================= ============== ========= ================================
3971
3972..
3973
3974  .. table:: AMDHSA Code Object V2 Kernel Code Properties Metadata Map
3975     :name: amdgpu-amdhsa-code-object-kernel-code-properties-metadata-map-v2-table
3976
3977     ============================ ============== ========= =====================
3978     String Key                   Value Type     Required? Description
3979     ============================ ============== ========= =====================
3980     "KernargSegmentSize"         integer        Required  The size in bytes of
3981                                                           the kernarg segment
3982                                                           that holds the values
3983                                                           of the arguments to
3984                                                           the kernel.
3985     "GroupSegmentFixedSize"      integer        Required  The amount of group
3986                                                           segment memory
3987                                                           required by a
3988                                                           work-group in
3989                                                           bytes. This does not
3990                                                           include any
3991                                                           dynamically allocated
3992                                                           group segment memory
3993                                                           that may be added
3994                                                           when the kernel is
3995                                                           dispatched.
3996     "PrivateSegmentFixedSize"    integer        Required  The amount of fixed
3997                                                           private address space
3998                                                           memory required for a
3999                                                           work-item in
4000                                                           bytes. If the kernel
4001                                                           uses a dynamic call
4002                                                           stack then additional
4003                                                           space must be added
4004                                                           to this value for the
4005                                                           call stack.
4006     "KernargSegmentAlign"        integer        Required  The maximum byte
4007                                                           alignment of
4008                                                           arguments in the
4009                                                           kernarg segment. Must
4010                                                           be a power of 2.
4011     "WavefrontSize"              integer        Required  Wavefront size. Must
4012                                                           be a power of 2.
4013     "NumSGPRs"                   integer        Required  Number of scalar
4014                                                           registers used by a
4015                                                           wavefront for
4016                                                           GFX6-GFX11. This
4017                                                           includes the special
4018                                                           SGPRs for VCC, Flat
4019                                                           Scratch (GFX7-GFX10)
4020                                                           and XNACK (for
4021                                                           GFX8-GFX10). It does
4022                                                           not include the 16
4023                                                           SGPR added if a trap
4024                                                           handler is
4025                                                           enabled. It is not
4026                                                           rounded up to the
4027                                                           allocation
4028                                                           granularity.
4029     "NumVGPRs"                   integer        Required  Number of vector
4030                                                           registers used by
4031                                                           each work-item for
4032                                                           GFX6-GFX11
4033     "MaxFlatWorkGroupSize"       integer        Required  Maximum flat
4034                                                           work-group size
4035                                                           supported by the
4036                                                           kernel in work-items.
4037                                                           Must be >=1 and
4038                                                           consistent with
4039                                                           ReqdWorkGroupSize if
4040                                                           not 0, 0, 0.
4041     "NumSpilledSGPRs"            integer                  Number of stores from
4042                                                           a scalar register to
4043                                                           a register allocator
4044                                                           created spill
4045                                                           location.
4046     "NumSpilledVGPRs"            integer                  Number of stores from
4047                                                           a vector register to
4048                                                           a register allocator
4049                                                           created spill
4050                                                           location.
4051     ============================ ============== ========= =====================
4052
4053.. _amdgpu-amdhsa-code-object-metadata-v3:
4054
4055Code Object V3 Metadata
4056+++++++++++++++++++++++
4057
4058.. warning::
4059  Code object V3 generation is no longer supported by this version of LLVM.
4060
4061Code object V3 and above metadata is specified by the ``NT_AMDGPU_METADATA`` note
4062record (see :ref:`amdgpu-note-records-v3-onwards`).
4063
4064The metadata is represented as Message Pack formatted binary data (see
4065[MsgPack]_). The top level is a Message Pack map that includes the
4066keys defined in table
4067:ref:`amdgpu-amdhsa-code-object-metadata-map-table-v3` and referenced
4068tables.
4069
4070Additional information can be added to the maps. To avoid conflicts,
4071any key names should be prefixed by "*vendor-name*." where
4072``vendor-name`` can be the name of the vendor and specific vendor
4073tool that generates the information. The prefix is abbreviated to
4074simply "." when it appears within a map that has been added by the
4075same *vendor-name*.
4076
4077  .. table:: AMDHSA Code Object V3 Metadata Map
4078     :name: amdgpu-amdhsa-code-object-metadata-map-table-v3
4079
4080     ================= ============== ========= =======================================
4081     String Key        Value Type     Required? Description
4082     ================= ============== ========= =======================================
4083     "amdhsa.version"  sequence of    Required  - The first integer is the major
4084                       2 integers                 version. Currently 1.
4085                                                - The second integer is the minor
4086                                                  version. Currently 0.
4087     "amdhsa.printf"   sequence of              Each string is encoded information
4088                       strings                  about a printf function call. The
4089                                                encoded information is organized as
4090                                                fields separated by colon (':'):
4091
4092                                                ``ID:N:S[0]:S[1]:...:S[N-1]:FormatString``
4093
4094                                                where:
4095
4096                                                ``ID``
4097                                                  A 32-bit integer as a unique id for
4098                                                  each printf function call
4099
4100                                                ``N``
4101                                                  A 32-bit integer equal to the number
4102                                                  of arguments of printf function call
4103                                                  minus 1
4104
4105                                                ``S[i]`` (where i = 0, 1, ... , N-1)
4106                                                  32-bit integers for the size in bytes
4107                                                  of the i-th FormatString argument of
4108                                                  the printf function call
4109
4110                                                FormatString
4111                                                  The format string passed to the
4112                                                  printf function call.
4113     "amdhsa.kernels"  sequence of    Required  Sequence of the maps for each
4114                       map                      kernel in the code object. See
4115                                                :ref:`amdgpu-amdhsa-code-object-kernel-metadata-map-table-v3`
4116                                                for the definition of the keys included
4117                                                in that map.
4118     ================= ============== ========= =======================================
4119
4120..
4121
4122  .. table:: AMDHSA Code Object V3 Kernel Metadata Map
4123     :name: amdgpu-amdhsa-code-object-kernel-metadata-map-table-v3
4124
4125     =================================== ============== ========= ================================
4126     String Key                          Value Type     Required? Description
4127     =================================== ============== ========= ================================
4128     ".name"                             string         Required  Source name of the kernel.
4129     ".symbol"                           string         Required  Name of the kernel
4130                                                                  descriptor ELF symbol.
4131     ".language"                         string                   Source language of the kernel.
4132                                                                  Values include:
4133
4134                                                                  - "OpenCL C"
4135                                                                  - "OpenCL C++"
4136                                                                  - "HCC"
4137                                                                  - "HIP"
4138                                                                  - "OpenMP"
4139                                                                  - "Assembler"
4140
4141     ".language_version"                 sequence of              - The first integer is the major
4142                                         2 integers                 version.
4143                                                                  - The second integer is the
4144                                                                    minor version.
4145     ".args"                             sequence of              Sequence of maps of the
4146                                         map                      kernel arguments. See
4147                                                                  :ref:`amdgpu-amdhsa-code-object-kernel-argument-metadata-map-table-v3`
4148                                                                  for the definition of the keys
4149                                                                  included in that map.
4150     ".reqd_workgroup_size"              sequence of              If not 0, 0, 0 then all values
4151                                         3 integers               must be >=1 and the dispatch
4152                                                                  work-group size X, Y, Z must
4153                                                                  correspond to the specified
4154                                                                  values. Defaults to 0, 0, 0.
4155
4156                                                                  Corresponds to the OpenCL
4157                                                                  ``reqd_work_group_size``
4158                                                                  attribute.
4159     ".workgroup_size_hint"              sequence of              The dispatch work-group size
4160                                         3 integers               X, Y, Z is likely to be the
4161                                                                  specified values.
4162
4163                                                                  Corresponds to the OpenCL
4164                                                                  ``work_group_size_hint``
4165                                                                  attribute.
4166     ".vec_type_hint"                    string                   The name of a scalar or vector
4167                                                                  type.
4168
4169                                                                  Corresponds to the OpenCL
4170                                                                  ``vec_type_hint`` attribute.
4171
4172     ".device_enqueue_symbol"            string                   The external symbol name
4173                                                                  associated with a kernel.
4174                                                                  OpenCL runtime allocates a
4175                                                                  global buffer for the symbol
4176                                                                  and saves the kernel's address
4177                                                                  to it, which is used for
4178                                                                  device side enqueueing. Only
4179                                                                  available for device side
4180                                                                  enqueued kernels.
4181     ".kernarg_segment_size"             integer        Required  The size in bytes of
4182                                                                  the kernarg segment
4183                                                                  that holds the values
4184                                                                  of the arguments to
4185                                                                  the kernel.
4186     ".group_segment_fixed_size"         integer        Required  The amount of group
4187                                                                  segment memory
4188                                                                  required by a
4189                                                                  work-group in
4190                                                                  bytes. This does not
4191                                                                  include any
4192                                                                  dynamically allocated
4193                                                                  group segment memory
4194                                                                  that may be added
4195                                                                  when the kernel is
4196                                                                  dispatched.
4197     ".private_segment_fixed_size"       integer        Required  The amount of fixed
4198                                                                  private address space
4199                                                                  memory required for a
4200                                                                  work-item in
4201                                                                  bytes. If the kernel
4202                                                                  uses a dynamic call
4203                                                                  stack then additional
4204                                                                  space must be added
4205                                                                  to this value for the
4206                                                                  call stack.
4207     ".kernarg_segment_align"            integer        Required  The maximum byte
4208                                                                  alignment of
4209                                                                  arguments in the
4210                                                                  kernarg segment. Must
4211                                                                  be a power of 2.
4212     ".wavefront_size"                   integer        Required  Wavefront size. Must
4213                                                                  be a power of 2.
4214     ".sgpr_count"                       integer        Required  Number of scalar
4215                                                                  registers required by a
4216                                                                  wavefront for
4217                                                                  GFX6-GFX9. A register
4218                                                                  is required if it is
4219                                                                  used explicitly, or
4220                                                                  if a higher numbered
4221                                                                  register is used
4222                                                                  explicitly. This
4223                                                                  includes the special
4224                                                                  SGPRs for VCC, Flat
4225                                                                  Scratch (GFX7-GFX9)
4226                                                                  and XNACK (for
4227                                                                  GFX8-GFX9). It does
4228                                                                  not include the 16
4229                                                                  SGPR added if a trap
4230                                                                  handler is
4231                                                                  enabled. It is not
4232                                                                  rounded up to the
4233                                                                  allocation
4234                                                                  granularity.
4235     ".vgpr_count"                       integer        Required  Number of vector
4236                                                                  registers required by
4237                                                                  each work-item for
4238                                                                  GFX6-GFX9. A register
4239                                                                  is required if it is
4240                                                                  used explicitly, or
4241                                                                  if a higher numbered
4242                                                                  register is used
4243                                                                  explicitly.
4244     ".agpr_count"                       integer        Required  Number of accumulator
4245                                                                  registers required by
4246                                                                  each work-item for
4247                                                                  GFX90A, GFX908.
4248     ".max_flat_workgroup_size"          integer        Required  Maximum flat
4249                                                                  work-group size
4250                                                                  supported by the
4251                                                                  kernel in work-items.
4252                                                                  Must be >=1 and
4253                                                                  consistent with
4254                                                                  ReqdWorkGroupSize if
4255                                                                  not 0, 0, 0.
4256     ".sgpr_spill_count"                 integer                  Number of stores from
4257                                                                  a scalar register to
4258                                                                  a register allocator
4259                                                                  created spill
4260                                                                  location.
4261     ".vgpr_spill_count"                 integer                  Number of stores from
4262                                                                  a vector register to
4263                                                                  a register allocator
4264                                                                  created spill
4265                                                                  location.
4266     ".kind"                             string                   The kind of the kernel
4267                                                                  with the following
4268                                                                  values:
4269
4270                                                                  "normal"
4271                                                                    Regular kernels.
4272
4273                                                                  "init"
4274                                                                    These kernels must be
4275                                                                    invoked after loading
4276                                                                    the containing code
4277                                                                    object and must
4278                                                                    complete before any
4279                                                                    normal and fini
4280                                                                    kernels in the same
4281                                                                    code object are
4282                                                                    invoked.
4283
4284                                                                  "fini"
4285                                                                    These kernels must be
4286                                                                    invoked before
4287                                                                    unloading the
4288                                                                    containing code object
4289                                                                    and after all init and
4290                                                                    normal kernels in the
4291                                                                    same code object have
4292                                                                    been invoked and
4293                                                                    completed.
4294
4295                                                                  If omitted, "normal" is
4296                                                                  assumed.
4297     ".max_num_work_groups_{x,y,z}"      integer                  The max number of
4298                                                                  launched work-groups
4299                                                                  in the X, Y, and Z
4300                                                                  dimensions. Each number
4301                                                                  must be >=1.
4302     =================================== ============== ========= ================================
4303
4304..
4305
4306  .. table:: AMDHSA Code Object V3 Kernel Argument Metadata Map
4307     :name: amdgpu-amdhsa-code-object-kernel-argument-metadata-map-table-v3
4308
4309     ====================== ============== ========= ================================
4310     String Key             Value Type     Required? Description
4311     ====================== ============== ========= ================================
4312     ".name"                string                   Kernel argument name.
4313     ".type_name"           string                   Kernel argument type name.
4314     ".size"                integer        Required  Kernel argument size in bytes.
4315     ".offset"              integer        Required  Kernel argument offset in
4316                                                     bytes. The offset must be a
4317                                                     multiple of the alignment
4318                                                     required by the argument.
4319     ".value_kind"          string         Required  Kernel argument kind that
4320                                                     specifies how to set up the
4321                                                     corresponding argument.
4322                                                     Values include:
4323
4324                                                     "by_value"
4325                                                       The argument is copied
4326                                                       directly into the kernarg.
4327
4328                                                     "global_buffer"
4329                                                       A global address space pointer
4330                                                       to the buffer data is passed
4331                                                       in the kernarg.
4332
4333                                                     "dynamic_shared_pointer"
4334                                                       A group address space pointer
4335                                                       to dynamically allocated LDS
4336                                                       is passed in the kernarg.
4337
4338                                                     "sampler"
4339                                                       A global address space
4340                                                       pointer to a S# is passed in
4341                                                       the kernarg.
4342
4343                                                     "image"
4344                                                       A global address space
4345                                                       pointer to a T# is passed in
4346                                                       the kernarg.
4347
4348                                                     "pipe"
4349                                                       A global address space pointer
4350                                                       to an OpenCL pipe is passed in
4351                                                       the kernarg.
4352
4353                                                     "queue"
4354                                                       A global address space pointer
4355                                                       to an OpenCL device enqueue
4356                                                       queue is passed in the
4357                                                       kernarg.
4358
4359                                                     "hidden_global_offset_x"
4360                                                       The OpenCL grid dispatch
4361                                                       global offset for the X
4362                                                       dimension is passed in the
4363                                                       kernarg.
4364
4365                                                     "hidden_global_offset_y"
4366                                                       The OpenCL grid dispatch
4367                                                       global offset for the Y
4368                                                       dimension is passed in the
4369                                                       kernarg.
4370
4371                                                     "hidden_global_offset_z"
4372                                                       The OpenCL grid dispatch
4373                                                       global offset for the Z
4374                                                       dimension is passed in the
4375                                                       kernarg.
4376
4377                                                     "hidden_none"
4378                                                       An argument that is not used
4379                                                       by the kernel. Space needs to
4380                                                       be left for it, but it does
4381                                                       not need to be set up.
4382
4383                                                     "hidden_printf_buffer"
4384                                                       A global address space pointer
4385                                                       to the runtime printf buffer
4386                                                       is passed in kernarg. Mutually
4387                                                       exclusive with
4388                                                       "hidden_hostcall_buffer"
4389                                                       before Code Object V5.
4390
4391                                                     "hidden_hostcall_buffer"
4392                                                       A global address space pointer
4393                                                       to the runtime hostcall buffer
4394                                                       is passed in kernarg. Mutually
4395                                                       exclusive with
4396                                                       "hidden_printf_buffer"
4397                                                       before Code Object V5.
4398
4399                                                     "hidden_default_queue"
4400                                                       A global address space pointer
4401                                                       to the OpenCL device enqueue
4402                                                       queue that should be used by
4403                                                       the kernel by default is
4404                                                       passed in the kernarg.
4405
4406                                                     "hidden_completion_action"
4407                                                       A global address space pointer
4408                                                       to help link enqueued kernels into
4409                                                       the ancestor tree for determining
4410                                                       when the parent kernel has finished.
4411
4412                                                     "hidden_multigrid_sync_arg"
4413                                                       A global address space pointer for
4414                                                       multi-grid synchronization is
4415                                                       passed in the kernarg.
4416
4417     ".value_type"          string                    Unused and deprecated. This should no longer
4418                                                      be emitted, but is accepted for compatibility.
4419
4420     ".pointee_align"       integer                  Alignment in bytes of pointee
4421                                                     type for pointer type kernel
4422                                                     argument. Must be a power
4423                                                     of 2. Only present if
4424                                                     ".value_kind" is
4425                                                     "dynamic_shared_pointer".
4426     ".address_space"       string                   Kernel argument address space
4427                                                     qualifier. Only present if
4428                                                     ".value_kind" is "global_buffer" or
4429                                                     "dynamic_shared_pointer". Values
4430                                                     are:
4431
4432                                                     - "private"
4433                                                     - "global"
4434                                                     - "constant"
4435                                                     - "local"
4436                                                     - "generic"
4437                                                     - "region"
4438
4439                                                     .. TODO::
4440
4441                                                        Is "global_buffer" only "global"
4442                                                        or "constant"? Is
4443                                                        "dynamic_shared_pointer" always
4444                                                        "local"? Can HCC allow "generic"?
4445                                                        How can "private" or "region"
4446                                                        ever happen?
4447
4448     ".access"              string                   Kernel argument access
4449                                                     qualifier. Only present if
4450                                                     ".value_kind" is "image" or
4451                                                     "pipe". Values
4452                                                     are:
4453
4454                                                     - "read_only"
4455                                                     - "write_only"
4456                                                     - "read_write"
4457
4458                                                     .. TODO::
4459
4460                                                        Does this apply to
4461                                                        "global_buffer"?
4462
4463     ".actual_access"       string                   The actual memory accesses
4464                                                     performed by the kernel on the
4465                                                     kernel argument. Only present if
4466                                                     ".value_kind" is "global_buffer",
4467                                                     "image", or "pipe". This may be
4468                                                     more restrictive than indicated
4469                                                     by ".access" to reflect what the
4470                                                     kernel actual does. If not
4471                                                     present then the runtime must
4472                                                     assume what is implied by
4473                                                     ".access" and ".is_const"      . Values
4474                                                     are:
4475
4476                                                     - "read_only"
4477                                                     - "write_only"
4478                                                     - "read_write"
4479
4480     ".is_const"            boolean                  Indicates if the kernel argument
4481                                                     is const qualified. Only present
4482                                                     if ".value_kind" is
4483                                                     "global_buffer".
4484
4485     ".is_restrict"         boolean                  Indicates if the kernel argument
4486                                                     is restrict qualified. Only
4487                                                     present if ".value_kind" is
4488                                                     "global_buffer".
4489
4490     ".is_volatile"         boolean                  Indicates if the kernel argument
4491                                                     is volatile qualified. Only
4492                                                     present if ".value_kind" is
4493                                                     "global_buffer".
4494
4495     ".is_pipe"             boolean                  Indicates if the kernel argument
4496                                                     is pipe qualified. Only present
4497                                                     if ".value_kind" is "pipe".
4498
4499                                                     .. TODO::
4500
4501                                                        Can "global_buffer" be pipe
4502                                                        qualified?
4503
4504     ====================== ============== ========= ================================
4505
4506.. _amdgpu-amdhsa-code-object-metadata-v4:
4507
4508Code Object V4 Metadata
4509+++++++++++++++++++++++
4510
4511. warning::
4512  Code object V4 is not the default code object version emitted by this version
4513  of LLVM.
4514
4515Code object V4 metadata is the same as
4516:ref:`amdgpu-amdhsa-code-object-metadata-v3` with the changes and additions
4517defined in table :ref:`amdgpu-amdhsa-code-object-metadata-map-table-v4`.
4518
4519  .. table:: AMDHSA Code Object V4 Metadata Map Changes
4520     :name: amdgpu-amdhsa-code-object-metadata-map-table-v4
4521
4522     ================= ============== ========= =======================================
4523     String Key        Value Type     Required? Description
4524     ================= ============== ========= =======================================
4525     "amdhsa.version"  sequence of    Required  - The first integer is the major
4526                       2 integers                 version. Currently 1.
4527                                                - The second integer is the minor
4528                                                  version. Currently 1.
4529     "amdhsa.target"   string         Required  The target name of the code using the syntax:
4530
4531                                                .. code::
4532
4533                                                  <target-triple> [ "-" <target-id> ]
4534
4535                                                A canonical target ID must be
4536                                                used. See :ref:`amdgpu-target-triples`
4537                                                and :ref:`amdgpu-target-id`.
4538     ================= ============== ========= =======================================
4539
4540.. _amdgpu-amdhsa-code-object-metadata-v5:
4541
4542Code Object V5 Metadata
4543+++++++++++++++++++++++
4544
4545Code object V5 metadata is the same as
4546:ref:`amdgpu-amdhsa-code-object-metadata-v4` with the changes defined in table
4547:ref:`amdgpu-amdhsa-code-object-metadata-map-table-v5`, table
4548:ref:`amdgpu-amdhsa-code-object-kernel-metadata-map-table-v5` and table
4549:ref:`amdgpu-amdhsa-code-object-kernel-argument-metadata-map-table-v5`.
4550
4551  .. table:: AMDHSA Code Object V5 Metadata Map Changes
4552     :name: amdgpu-amdhsa-code-object-metadata-map-table-v5
4553
4554     ================= ============== ========= =======================================
4555     String Key        Value Type     Required? Description
4556     ================= ============== ========= =======================================
4557     "amdhsa.version"  sequence of    Required  - The first integer is the major
4558                       2 integers                 version. Currently 1.
4559                                                - The second integer is the minor
4560                                                  version. Currently 2.
4561     ================= ============== ========= =======================================
4562
4563..
4564
4565  .. table:: AMDHSA Code Object V5 Kernel Metadata Map Additions
4566     :name: amdgpu-amdhsa-code-object-kernel-metadata-map-table-v5
4567
4568     ============================= ============= ========== =======================================
4569     String Key                    Value Type     Required? Description
4570     ============================= ============= ========== =======================================
4571     ".uses_dynamic_stack"         boolean                  Indicates if the generated machine code
4572                                                            is using a dynamically sized stack.
4573     ".workgroup_processor_mode"   boolean                  (GFX10+) Controls ENABLE_WGP_MODE in
4574                                                            :ref:`amdgpu-amdhsa-kernel-descriptor-v3-table`.
4575     ============================= ============= ========== =======================================
4576
4577..
4578
4579  .. table:: AMDHSA Code Object V5 Kernel Attribute Metadata Map
4580     :name: amdgpu-amdhsa-code-object-kernel-attribute-metadata-map-v5-table
4581
4582     =========================== ============== ========= ==============================
4583     String Key                  Value Type     Required? Description
4584     =========================== ============== ========= ==============================
4585     ".uniform_work_group_size"  integer                  Indicates if the kernel
4586                                                          requires that each dimension
4587                                                          of global size is a multiple
4588                                                          of corresponding dimension of
4589                                                          work-group size. Value of 1
4590                                                          implies true and value of 0
4591                                                          implies false. Metadata is
4592                                                          only emitted when value is 1.
4593     =========================== ============== ========= ==============================
4594
4595..
4596
4597..
4598
4599  .. table:: AMDHSA Code Object V5 Kernel Argument Metadata Map Additions and Changes
4600     :name: amdgpu-amdhsa-code-object-kernel-argument-metadata-map-table-v5
4601
4602     ====================== ============== ========= ================================
4603     String Key             Value Type     Required? Description
4604     ====================== ============== ========= ================================
4605     ".value_kind"          string         Required  Kernel argument kind that
4606                                                     specifies how to set up the
4607                                                     corresponding argument.
4608                                                     Values include:
4609                                                     the same as code object V3 metadata
4610                                                     (see :ref:`amdgpu-amdhsa-code-object-kernel-argument-metadata-map-table-v3`)
4611                                                     with the following additions:
4612
4613                                                     "hidden_block_count_x"
4614                                                       The grid dispatch work-group count for the X dimension
4615                                                       is passed in the kernarg. Some languages, such as OpenCL,
4616                                                       support a last work-group in each dimension being partial.
4617                                                       This count only includes the non-partial work-group count.
4618                                                       This is not the same as the value in the AQL dispatch packet,
4619                                                       which has the grid size in work-items.
4620
4621                                                     "hidden_block_count_y"
4622                                                       The grid dispatch work-group count for the Y dimension
4623                                                       is passed in the kernarg. Some languages, such as OpenCL,
4624                                                       support a last work-group in each dimension being partial.
4625                                                       This count only includes the non-partial work-group count.
4626                                                       This is not the same as the value in the AQL dispatch packet,
4627                                                       which has the grid size in work-items. If the grid dimensionality
4628                                                       is 1, then must be 1.
4629
4630                                                     "hidden_block_count_z"
4631                                                       The grid dispatch work-group count for the Z dimension
4632                                                       is passed in the kernarg. Some languages, such as OpenCL,
4633                                                       support a last work-group in each dimension being partial.
4634                                                       This count only includes the non-partial work-group count.
4635                                                       This is not the same as the value in the AQL dispatch packet,
4636                                                       which has the grid size in work-items. If the grid dimensionality
4637                                                       is 1 or 2, then must be 1.
4638
4639                                                     "hidden_group_size_x"
4640                                                       The grid dispatch work-group size for the X dimension is
4641                                                       passed in the kernarg. This size only applies to the
4642                                                       non-partial work-groups. This is the same value as the AQL
4643                                                       dispatch packet work-group size.
4644
4645                                                     "hidden_group_size_y"
4646                                                       The grid dispatch work-group size for the Y dimension is
4647                                                       passed in the kernarg. This size only applies to the
4648                                                       non-partial work-groups. This is the same value as the AQL
4649                                                       dispatch packet work-group size. If the grid dimensionality
4650                                                       is 1, then must be 1.
4651
4652                                                     "hidden_group_size_z"
4653                                                       The grid dispatch work-group size for the Z dimension is
4654                                                       passed in the kernarg. This size only applies to the
4655                                                       non-partial work-groups. This is the same value as the AQL
4656                                                       dispatch packet work-group size. If the grid dimensionality
4657                                                       is 1 or 2, then must be 1.
4658
4659                                                     "hidden_remainder_x"
4660                                                       The grid dispatch work group size of the partial work group
4661                                                       of the X dimension, if it exists. Must be zero if a partial
4662                                                       work group does not exist in the X dimension.
4663
4664                                                     "hidden_remainder_y"
4665                                                       The grid dispatch work group size of the partial work group
4666                                                       of the Y dimension, if it exists. Must be zero if a partial
4667                                                       work group does not exist in the Y dimension.
4668
4669                                                     "hidden_remainder_z"
4670                                                       The grid dispatch work group size of the partial work group
4671                                                       of the Z dimension, if it exists. Must be zero if a partial
4672                                                       work group does not exist in the Z dimension.
4673
4674                                                     "hidden_grid_dims"
4675                                                       The grid dispatch dimensionality. This is the same value
4676                                                       as the AQL dispatch packet dimensionality. Must be a value
4677                                                       between 1 and 3.
4678
4679                                                     "hidden_heap_v1"
4680                                                       A global address space pointer to an initialized memory
4681                                                       buffer that conforms to the requirements of the malloc/free
4682                                                       device library V1 version implementation.
4683
4684                                                     "hidden_dynamic_lds_size"
4685                                                       Size of the dynamically allocated LDS memory is passed in the kernarg.
4686
4687                                                     "hidden_private_base"
4688                                                       The high 32 bits of the flat addressing private aperture base.
4689                                                       Only used by GFX8 to allow conversion between private segment
4690                                                       and flat addresses. See :ref:`amdgpu-amdhsa-kernel-prolog-flat-scratch`.
4691
4692                                                     "hidden_shared_base"
4693                                                       The high 32 bits of the flat addressing shared aperture base.
4694                                                       Only used by GFX8 to allow conversion between shared segment
4695                                                       and flat addresses. See :ref:`amdgpu-amdhsa-kernel-prolog-flat-scratch`.
4696
4697                                                     "hidden_queue_ptr"
4698                                                       A global memory address space pointer to the ROCm runtime
4699                                                       ``struct amd_queue_t`` structure for the HSA queue of the
4700                                                       associated dispatch AQL packet. It is only required for pre-GFX9
4701                                                       devices for the trap handler ABI (see :ref:`amdgpu-amdhsa-trap-handler-abi`).
4702
4703     ====================== ============== ========= ================================
4704
4705..
4706
4707Kernel Dispatch
4708~~~~~~~~~~~~~~~
4709
4710The HSA architected queuing language (AQL) defines a user space memory interface
4711that can be used to control the dispatch of kernels, in an agent independent
4712way. An agent can have zero or more AQL queues created for it using an HSA
4713compatible runtime (see :ref:`amdgpu-os`), in which AQL packets (all of which
4714are 64 bytes) can be placed. See the *HSA Platform System Architecture
4715Specification* [HSA]_ for the AQL queue mechanics and packet layouts.
4716
4717The packet processor of a kernel agent is responsible for detecting and
4718dispatching HSA kernels from the AQL queues associated with it. For AMD GPUs the
4719packet processor is implemented by the hardware command processor (CP),
4720asynchronous dispatch controller (ADC) and shader processor input controller
4721(SPI).
4722
4723An HSA compatible runtime can be used to allocate an AQL queue object. It uses
4724the kernel mode driver to initialize and register the AQL queue with CP.
4725
4726To dispatch a kernel the following actions are performed. This can occur in the
4727CPU host program, or from an HSA kernel executing on a GPU.
4728
47291. A pointer to an AQL queue for the kernel agent on which the kernel is to be
4730   executed is obtained.
47312. A pointer to the kernel descriptor (see
4732   :ref:`amdgpu-amdhsa-kernel-descriptor`) of the kernel to execute is obtained.
4733   It must be for a kernel that is contained in a code object that was loaded
4734   by an HSA compatible runtime on the kernel agent with which the AQL queue is
4735   associated.
47363. Space is allocated for the kernel arguments using the HSA compatible runtime
4737   allocator for a memory region with the kernarg property for the kernel agent
4738   that will execute the kernel. It must be at least 16-byte aligned.
47394. Kernel argument values are assigned to the kernel argument memory
4740   allocation. The layout is defined in the *HSA Programmer's Language
4741   Reference* [HSA]_. For AMDGPU the kernel execution directly accesses the
4742   kernel argument memory in the same way constant memory is accessed. (Note
4743   that the HSA specification allows an implementation to copy the kernel
4744   argument contents to another location that is accessed by the kernel.)
47455. An AQL kernel dispatch packet is created on the AQL queue. The HSA compatible
4746   runtime api uses 64-bit atomic operations to reserve space in the AQL queue
4747   for the packet. The packet must be set up, and the final write must use an
4748   atomic store release to set the packet kind to ensure the packet contents are
4749   visible to the kernel agent. AQL defines a doorbell signal mechanism to
4750   notify the kernel agent that the AQL queue has been updated. These rules, and
4751   the layout of the AQL queue and kernel dispatch packet is defined in the *HSA
4752   System Architecture Specification* [HSA]_.
47536. A kernel dispatch packet includes information about the actual dispatch,
4754   such as grid and work-group size, together with information from the code
4755   object about the kernel, such as segment sizes. The HSA compatible runtime
4756   queries on the kernel symbol can be used to obtain the code object values
4757   which are recorded in the :ref:`amdgpu-amdhsa-code-object-metadata`.
47587. CP executes micro-code and is responsible for detecting and setting up the
4759   GPU to execute the wavefronts of a kernel dispatch.
47608. CP ensures that when the a wavefront starts executing the kernel machine
4761   code, the scalar general purpose registers (SGPR) and vector general purpose
4762   registers (VGPR) are set up as required by the machine code. The required
4763   setup is defined in the :ref:`amdgpu-amdhsa-kernel-descriptor`. The initial
4764   register state is defined in
4765   :ref:`amdgpu-amdhsa-initial-kernel-execution-state`.
47669. The prolog of the kernel machine code (see
4767   :ref:`amdgpu-amdhsa-kernel-prolog`) sets up the machine state as necessary
4768   before continuing executing the machine code that corresponds to the kernel.
476910. When the kernel dispatch has completed execution, CP signals the completion
4770    signal specified in the kernel dispatch packet if not 0.
4771
4772.. _amdgpu-amdhsa-memory-spaces:
4773
4774Memory Spaces
4775~~~~~~~~~~~~~
4776
4777The memory space properties are:
4778
4779  .. table:: AMDHSA Memory Spaces
4780     :name: amdgpu-amdhsa-memory-spaces-table
4781
4782     ================= =========== ======== ======= ==================
4783     Memory Space Name HSA Segment Hardware Address NULL Value
4784                       Name        Name     Size
4785     ================= =========== ======== ======= ==================
4786     Private           private     scratch  32      0x00000000
4787     Local             group       LDS      32      0xFFFFFFFF
4788     Global            global      global   64      0x0000000000000000
4789     Constant          constant    *same as 64      0x0000000000000000
4790                                   global*
4791     Generic           flat        flat     64      0x0000000000000000
4792     Region            N/A         GDS      32      *not implemented
4793                                                    for AMDHSA*
4794     ================= =========== ======== ======= ==================
4795
4796The global and constant memory spaces both use global virtual addresses, which
4797are the same virtual address space used by the CPU. However, some virtual
4798addresses may only be accessible to the CPU, some only accessible by the GPU,
4799and some by both.
4800
4801Using the constant memory space indicates that the data will not change during
4802the execution of the kernel. This allows scalar read instructions to be
4803used. The vector and scalar L1 caches are invalidated of volatile data before
4804each kernel dispatch execution to allow constant memory to change values between
4805kernel dispatches.
4806
4807The local memory space uses the hardware Local Data Store (LDS) which is
4808automatically allocated when the hardware creates work-groups of wavefronts, and
4809freed when all the wavefronts of a work-group have terminated. The data store
4810(DS) instructions can be used to access it.
4811
4812The private memory space uses the hardware scratch memory support. If the kernel
4813uses scratch, then the hardware allocates memory that is accessed using
4814wavefront lane dword (4 byte) interleaving. The mapping used from private
4815address to physical address is:
4816
4817  ``wavefront-scratch-base +
4818  (private-address * wavefront-size * 4) +
4819  (wavefront-lane-id * 4)``
4820
4821There are different ways that the wavefront scratch base address is determined
4822by a wavefront (see :ref:`amdgpu-amdhsa-initial-kernel-execution-state`). This
4823memory can be accessed in an interleaved manner using buffer instruction with
4824the scratch buffer descriptor and per wavefront scratch offset, by the scratch
4825instructions, or by flat instructions. If each lane of a wavefront accesses the
4826same private address, the interleaving results in adjacent dwords being accessed
4827and hence requires fewer cache lines to be fetched. Multi-dword access is not
4828supported except by flat and scratch instructions in GFX9-GFX11.
4829
4830The generic address space uses the hardware flat address support available in
4831GFX7-GFX11. This uses two fixed ranges of virtual addresses (the private and
4832local apertures), that are outside the range of addressible global memory, to
4833map from a flat address to a private or local address.
4834
4835FLAT instructions can take a flat address and access global, private (scratch)
4836and group (LDS) memory depending on if the address is within one of the
4837aperture ranges. Flat access to scratch requires hardware aperture setup and
4838setup in the kernel prologue (see
4839:ref:`amdgpu-amdhsa-kernel-prolog-flat-scratch`). Flat access to LDS requires
4840hardware aperture setup and M0 (GFX7-GFX8) register setup (see
4841:ref:`amdgpu-amdhsa-kernel-prolog-m0`).
4842
4843To convert between a segment address and a flat address the base address of the
4844apertures address can be used. For GFX7-GFX8 these are available in the
4845:ref:`amdgpu-amdhsa-hsa-aql-queue` the address of which can be obtained with
4846Queue Ptr SGPR (see :ref:`amdgpu-amdhsa-initial-kernel-execution-state`). For
4847GFX9-GFX11 the aperture base addresses are directly available as inline constant
4848registers ``SRC_SHARED_BASE/LIMIT`` and ``SRC_PRIVATE_BASE/LIMIT``. In 64 bit
4849address mode the aperture sizes are 2^32 bytes and the base is aligned to 2^32
4850which makes it easier to convert from flat to segment or segment to flat.
4851
4852Image and Samplers
4853~~~~~~~~~~~~~~~~~~
4854
4855Image and sample handles created by an HSA compatible runtime (see
4856:ref:`amdgpu-os`) are 64-bit addresses of a hardware 32-byte V# and 48 byte S#
4857object respectively. In order to support the HSA ``query_sampler`` operations
4858two extra dwords are used to store the HSA BRIG enumeration values for the
4859queries that are not trivially deducible from the S# representation.
4860
4861HSA Signals
4862~~~~~~~~~~~
4863
4864HSA signal handles created by an HSA compatible runtime (see :ref:`amdgpu-os`)
4865are 64-bit addresses of a structure allocated in memory accessible from both the
4866CPU and GPU. The structure is defined by the runtime and subject to change
4867between releases. For example, see [AMD-ROCm-github]_.
4868
4869.. _amdgpu-amdhsa-hsa-aql-queue:
4870
4871HSA AQL Queue
4872~~~~~~~~~~~~~
4873
4874The HSA AQL queue structure is defined by an HSA compatible runtime (see
4875:ref:`amdgpu-os`) and subject to change between releases. For example, see
4876[AMD-ROCm-github]_. For some processors it contains fields needed to implement
4877certain language features such as the flat address aperture bases. It also
4878contains fields used by CP such as managing the allocation of scratch memory.
4879
4880.. _amdgpu-amdhsa-kernel-descriptor:
4881
4882Kernel Descriptor
4883~~~~~~~~~~~~~~~~~
4884
4885A kernel descriptor consists of the information needed by CP to initiate the
4886execution of a kernel, including the entry point address of the machine code
4887that implements the kernel.
4888
4889Code Object V3 Kernel Descriptor
4890++++++++++++++++++++++++++++++++
4891
4892CP microcode requires the Kernel descriptor to be allocated on 64-byte
4893alignment.
4894
4895The fields used by CP for code objects before V3 also match those specified in
4896:ref:`amdgpu-amdhsa-kernel-descriptor-v3-table`.
4897
4898  .. table:: Code Object V3 Kernel Descriptor
4899     :name: amdgpu-amdhsa-kernel-descriptor-v3-table
4900
4901     ======= ======= =============================== ============================
4902     Bits    Size    Field Name                      Description
4903     ======= ======= =============================== ============================
4904     31:0    4 bytes GROUP_SEGMENT_FIXED_SIZE        The amount of fixed local
4905                                                     address space memory
4906                                                     required for a work-group
4907                                                     in bytes. This does not
4908                                                     include any dynamically
4909                                                     allocated local address
4910                                                     space memory that may be
4911                                                     added when the kernel is
4912                                                     dispatched.
4913     63:32   4 bytes PRIVATE_SEGMENT_FIXED_SIZE      The amount of fixed
4914                                                     private address space
4915                                                     memory required for a
4916                                                     work-item in bytes.  When
4917                                                     this cannot be predicted,
4918                                                     code object v4 and older
4919                                                     sets this value to be
4920                                                     higher than the minimum
4921                                                     requirement.
4922     95:64   4 bytes KERNARG_SIZE                    The size of the kernarg
4923                                                     memory pointed to by the
4924                                                     AQL dispatch packet. The
4925                                                     kernarg memory is used to
4926                                                     pass arguments to the
4927                                                     kernel.
4928
4929                                                     * If the kernarg pointer in
4930                                                       the dispatch packet is NULL
4931                                                       then there are no kernel
4932                                                       arguments.
4933                                                     * If the kernarg pointer in
4934                                                       the dispatch packet is
4935                                                       not NULL and this value
4936                                                       is 0 then the kernarg
4937                                                       memory size is
4938                                                       unspecified.
4939                                                     * If the kernarg pointer in
4940                                                       the dispatch packet is
4941                                                       not NULL and this value
4942                                                       is not 0 then the value
4943                                                       specifies the kernarg
4944                                                       memory size in bytes. It
4945                                                       is recommended to provide
4946                                                       a value as it may be used
4947                                                       by CP to optimize making
4948                                                       the kernarg memory
4949                                                       visible to the kernel
4950                                                       code.
4951
4952     127:96  4 bytes                                 Reserved, must be 0.
4953     191:128 8 bytes KERNEL_CODE_ENTRY_BYTE_OFFSET   Byte offset (possibly
4954                                                     negative) from base
4955                                                     address of kernel
4956                                                     descriptor to kernel's
4957                                                     entry point instruction
4958                                                     which must be 256 byte
4959                                                     aligned.
4960     351:192 20                                      Reserved, must be 0.
4961             bytes
4962     383:352 4 bytes COMPUTE_PGM_RSRC3               GFX6-GFX9
4963                                                       Reserved, must be 0.
4964                                                     GFX90A, GFX940
4965                                                       Compute Shader (CS)
4966                                                       program settings used by
4967                                                       CP to set up
4968                                                       ``COMPUTE_PGM_RSRC3``
4969                                                       configuration
4970                                                       register. See
4971                                                       :ref:`amdgpu-amdhsa-compute_pgm_rsrc3-gfx90a-table`.
4972                                                     GFX10-GFX11
4973                                                       Compute Shader (CS)
4974                                                       program settings used by
4975                                                       CP to set up
4976                                                       ``COMPUTE_PGM_RSRC3``
4977                                                       configuration
4978                                                       register. See
4979                                                       :ref:`amdgpu-amdhsa-compute_pgm_rsrc3-gfx10-gfx11-table`.
4980                                                     GFX12
4981                                                       Compute Shader (CS)
4982                                                       program settings used by
4983                                                       CP to set up
4984                                                       ``COMPUTE_PGM_RSRC3``
4985                                                       configuration
4986                                                       register. See
4987                                                       :ref:`amdgpu-amdhsa-compute_pgm_rsrc3-gfx12-table`.
4988     415:384 4 bytes COMPUTE_PGM_RSRC1               Compute Shader (CS)
4989                                                     program settings used by
4990                                                     CP to set up
4991                                                     ``COMPUTE_PGM_RSRC1``
4992                                                     configuration
4993                                                     register. See
4994                                                     :ref:`amdgpu-amdhsa-compute_pgm_rsrc1-gfx6-gfx12-table`.
4995     447:416 4 bytes COMPUTE_PGM_RSRC2               Compute Shader (CS)
4996                                                     program settings used by
4997                                                     CP to set up
4998                                                     ``COMPUTE_PGM_RSRC2``
4999                                                     configuration
5000                                                     register. See
5001                                                     :ref:`amdgpu-amdhsa-compute_pgm_rsrc2-gfx6-gfx12-table`.
5002     458:448 7 bits  *See separate bits below.*      Enable the setup of the
5003                                                     SGPR user data registers
5004                                                     (see
5005                                                     :ref:`amdgpu-amdhsa-initial-kernel-execution-state`).
5006
5007                                                     The total number of SGPR
5008                                                     user data registers
5009                                                     requested must not exceed
5010                                                     16 and match value in
5011                                                     ``compute_pgm_rsrc2.user_sgpr.user_sgpr_count``.
5012                                                     Any requests beyond 16
5013                                                     will be ignored.
5014     >448    1 bit   ENABLE_SGPR_PRIVATE_SEGMENT     If the *Target Properties*
5015                     _BUFFER                         column of
5016                                                     :ref:`amdgpu-processor-table`
5017                                                     specifies *Architected flat
5018                                                     scratch* then not supported
5019                                                     and must be 0,
5020     >449    1 bit   ENABLE_SGPR_DISPATCH_PTR
5021     >450    1 bit   ENABLE_SGPR_QUEUE_PTR
5022     >451    1 bit   ENABLE_SGPR_KERNARG_SEGMENT_PTR
5023     >452    1 bit   ENABLE_SGPR_DISPATCH_ID
5024     >453    1 bit   ENABLE_SGPR_FLAT_SCRATCH_INIT   If the *Target Properties*
5025                                                     column of
5026                                                     :ref:`amdgpu-processor-table`
5027                                                     specifies *Architected flat
5028                                                     scratch* then not supported
5029                                                     and must be 0,
5030     >454    1 bit   ENABLE_SGPR_PRIVATE_SEGMENT
5031                     _SIZE
5032     457:455 3 bits                                  Reserved, must be 0.
5033     458     1 bit   ENABLE_WAVEFRONT_SIZE32         GFX6-GFX9
5034                                                       Reserved, must be 0.
5035                                                     GFX10-GFX11
5036                                                       - If 0 execute in
5037                                                         wavefront size 64 mode.
5038                                                       - If 1 execute in
5039                                                         native wavefront size
5040                                                         32 mode.
5041     459     1 bit   USES_DYNAMIC_STACK              Indicates if the generated
5042                                                     machine code is using a
5043                                                     dynamically sized stack.
5044                                                     This is only set in code
5045                                                     object v5 and later.
5046     463:460 4 bits                                  Reserved, must be 0.
5047     470:464 7 bits  KERNARG_PRELOAD_SPEC_LENGTH     GFX6-GFX9
5048                                                       - Reserved, must be 0.
5049                                                     GFX90A, GFX940
5050                                                       - The number of dwords from
5051                                                         the kernarg segment to preload
5052                                                         into User SGPRs before kernel
5053                                                         execution. (see
5054                                                         :ref:`amdgpu-amdhsa-kernarg-preload`).
5055     479:471 9 bits  KERNARG_PRELOAD_SPEC_OFFSET     GFX6-GFX9
5056                                                       - Reserved, must be 0.
5057                                                     GFX90A, GFX940
5058                                                       - An offset in dwords into the
5059                                                         kernarg segment to begin
5060                                                         preloading data into User
5061                                                         SGPRs. (see
5062                                                         :ref:`amdgpu-amdhsa-kernarg-preload`).
5063     511:480 4 bytes                                 Reserved, must be 0.
5064     512     **Total size 64 bytes.**
5065     ======= ====================================================================
5066
5067..
5068
5069  .. table:: compute_pgm_rsrc1 for GFX6-GFX12
5070     :name: amdgpu-amdhsa-compute_pgm_rsrc1-gfx6-gfx12-table
5071
5072     ======= ======= =============================== ===========================================================================
5073     Bits    Size    Field Name                      Description
5074     ======= ======= =============================== ===========================================================================
5075     5:0     6 bits  GRANULATED_WORKITEM_VGPR_COUNT  Number of vector register
5076                                                     blocks used by each work-item;
5077                                                     granularity is device
5078                                                     specific:
5079
5080                                                     GFX6-GFX9
5081                                                       - vgprs_used 0..256
5082                                                       - max(0, ceil(vgprs_used / 4) - 1)
5083                                                     GFX90A, GFX940
5084                                                       - vgprs_used 0..512
5085                                                       - vgprs_used = align(arch_vgprs, 4)
5086                                                                      + acc_vgprs
5087                                                       - max(0, ceil(vgprs_used / 8) - 1)
5088                                                     GFX10-GFX12 (wavefront size 64)
5089                                                       - max_vgpr 1..256
5090                                                       - max(0, ceil(vgprs_used / 4) - 1)
5091                                                     GFX10-GFX12 (wavefront size 32)
5092                                                       - max_vgpr 1..256
5093                                                       - max(0, ceil(vgprs_used / 8) - 1)
5094
5095                                                     Where vgprs_used is defined
5096                                                     as the highest VGPR number
5097                                                     explicitly referenced plus
5098                                                     one.
5099
5100                                                     Used by CP to set up
5101                                                     ``COMPUTE_PGM_RSRC1.VGPRS``.
5102
5103                                                     The
5104                                                     :ref:`amdgpu-assembler`
5105                                                     calculates this
5106                                                     automatically for the
5107                                                     selected processor from
5108                                                     values provided to the
5109                                                     `.amdhsa_kernel` directive
5110                                                     by the
5111                                                     `.amdhsa_next_free_vgpr`
5112                                                     nested directive (see
5113                                                     :ref:`amdhsa-kernel-directives-table`).
5114     9:6     4 bits  GRANULATED_WAVEFRONT_SGPR_COUNT Number of scalar register
5115                                                     blocks used by a wavefront;
5116                                                     granularity is device
5117                                                     specific:
5118
5119                                                     GFX6-GFX8
5120                                                       - sgprs_used 0..112
5121                                                       - max(0, ceil(sgprs_used / 8) - 1)
5122                                                     GFX9
5123                                                       - sgprs_used 0..112
5124                                                       - 2 * max(0, ceil(sgprs_used / 16) - 1)
5125                                                     GFX10-GFX12
5126                                                       Reserved, must be 0.
5127                                                       (128 SGPRs always
5128                                                       allocated.)
5129
5130                                                     Where sgprs_used is
5131                                                     defined as the highest
5132                                                     SGPR number explicitly
5133                                                     referenced plus one, plus
5134                                                     a target specific number
5135                                                     of additional special
5136                                                     SGPRs for VCC,
5137                                                     FLAT_SCRATCH (GFX7+) and
5138                                                     XNACK_MASK (GFX8+), and
5139                                                     any additional
5140                                                     target specific
5141                                                     limitations. It does not
5142                                                     include the 16 SGPRs added
5143                                                     if a trap handler is
5144                                                     enabled.
5145
5146                                                     The target specific
5147                                                     limitations and special
5148                                                     SGPR layout are defined in
5149                                                     the hardware
5150                                                     documentation, which can
5151                                                     be found in the
5152                                                     :ref:`amdgpu-processors`
5153                                                     table.
5154
5155                                                     Used by CP to set up
5156                                                     ``COMPUTE_PGM_RSRC1.SGPRS``.
5157
5158                                                     The
5159                                                     :ref:`amdgpu-assembler`
5160                                                     calculates this
5161                                                     automatically for the
5162                                                     selected processor from
5163                                                     values provided to the
5164                                                     `.amdhsa_kernel` directive
5165                                                     by the
5166                                                     `.amdhsa_next_free_sgpr`
5167                                                     and `.amdhsa_reserve_*`
5168                                                     nested directives (see
5169                                                     :ref:`amdhsa-kernel-directives-table`).
5170     11:10   2 bits  PRIORITY                        Must be 0.
5171
5172                                                     Start executing wavefront
5173                                                     at the specified priority.
5174
5175                                                     CP is responsible for
5176                                                     filling in
5177                                                     ``COMPUTE_PGM_RSRC1.PRIORITY``.
5178     13:12   2 bits  FLOAT_ROUND_MODE_32             Wavefront starts execution
5179                                                     with specified rounding
5180                                                     mode for single (32
5181                                                     bit) floating point
5182                                                     precision floating point
5183                                                     operations.
5184
5185                                                     Floating point rounding
5186                                                     mode values are defined in
5187                                                     :ref:`amdgpu-amdhsa-floating-point-rounding-mode-enumeration-values-table`.
5188
5189                                                     Used by CP to set up
5190                                                     ``COMPUTE_PGM_RSRC1.FLOAT_MODE``.
5191     15:14   2 bits  FLOAT_ROUND_MODE_16_64          Wavefront starts execution
5192                                                     with specified rounding
5193                                                     denorm mode for half/double (16
5194                                                     and 64-bit) floating point
5195                                                     precision floating point
5196                                                     operations.
5197
5198                                                     Floating point rounding
5199                                                     mode values are defined in
5200                                                     :ref:`amdgpu-amdhsa-floating-point-rounding-mode-enumeration-values-table`.
5201
5202                                                     Used by CP to set up
5203                                                     ``COMPUTE_PGM_RSRC1.FLOAT_MODE``.
5204     17:16   2 bits  FLOAT_DENORM_MODE_32            Wavefront starts execution
5205                                                     with specified denorm mode
5206                                                     for single (32
5207                                                     bit)  floating point
5208                                                     precision floating point
5209                                                     operations.
5210
5211                                                     Floating point denorm mode
5212                                                     values are defined in
5213                                                     :ref:`amdgpu-amdhsa-floating-point-denorm-mode-enumeration-values-table`.
5214
5215                                                     Used by CP to set up
5216                                                     ``COMPUTE_PGM_RSRC1.FLOAT_MODE``.
5217     19:18   2 bits  FLOAT_DENORM_MODE_16_64         Wavefront starts execution
5218                                                     with specified denorm mode
5219                                                     for half/double (16
5220                                                     and 64-bit) floating point
5221                                                     precision floating point
5222                                                     operations.
5223
5224                                                     Floating point denorm mode
5225                                                     values are defined in
5226                                                     :ref:`amdgpu-amdhsa-floating-point-denorm-mode-enumeration-values-table`.
5227
5228                                                     Used by CP to set up
5229                                                     ``COMPUTE_PGM_RSRC1.FLOAT_MODE``.
5230     20      1 bit   PRIV                            Must be 0.
5231
5232                                                     Start executing wavefront
5233                                                     in privilege trap handler
5234                                                     mode.
5235
5236                                                     CP is responsible for
5237                                                     filling in
5238                                                     ``COMPUTE_PGM_RSRC1.PRIV``.
5239     21      1 bit   ENABLE_DX10_CLAMP               GFX9-GFX11
5240                                                       Wavefront starts execution
5241                                                       with DX10 clamp mode
5242                                                       enabled. Used by the vector
5243                                                       ALU to force DX10 style
5244                                                       treatment of NaN's (when
5245                                                       set, clamp NaN to zero,
5246                                                       otherwise pass NaN
5247                                                       through).
5248
5249                                                       Used by CP to set up
5250                                                       ``COMPUTE_PGM_RSRC1.DX10_CLAMP``.
5251                     WG_RR_EN                        GFX12
5252                                                       If 1, wavefronts are scheduled
5253                                                       in a round-robin fashion with
5254                                                       respect to the other wavefronts
5255                                                       of the SIMD. Otherwise, wavefronts
5256                                                       are scheduled in oldest age order.
5257
5258                                                       CP is responsible for filling in
5259                                                       ``COMPUTE_PGM_RSRC1.WG_RR_EN``.
5260     22      1 bit   DEBUG_MODE                      Must be 0.
5261
5262                                                     Start executing wavefront
5263                                                     in single step mode.
5264
5265                                                     CP is responsible for
5266                                                     filling in
5267                                                     ``COMPUTE_PGM_RSRC1.DEBUG_MODE``.
5268     23      1 bit   ENABLE_IEEE_MODE                GFX9-GFX11
5269                                                       Wavefront starts execution
5270                                                       with IEEE mode
5271                                                       enabled. Floating point
5272                                                       opcodes that support
5273                                                       exception flag gathering
5274                                                       will quiet and propagate
5275                                                       signaling-NaN inputs per
5276                                                       IEEE 754-2008. Min_dx10 and
5277                                                       max_dx10 become IEEE
5278                                                       754-2008 compliant due to
5279                                                       signaling-NaN propagation
5280                                                       and quieting.
5281
5282                                                       Used by CP to set up
5283                                                       ``COMPUTE_PGM_RSRC1.IEEE_MODE``.
5284                     DISABLE_PERF                    GFX12
5285                                                       Reserved. Must be 0.
5286     24      1 bit   BULKY                           Must be 0.
5287
5288                                                     Only one work-group allowed
5289                                                     to execute on a compute
5290                                                     unit.
5291
5292                                                     CP is responsible for
5293                                                     filling in
5294                                                     ``COMPUTE_PGM_RSRC1.BULKY``.
5295     25      1 bit   CDBG_USER                       Must be 0.
5296
5297                                                     Flag that can be used to
5298                                                     control debugging code.
5299
5300                                                     CP is responsible for
5301                                                     filling in
5302                                                     ``COMPUTE_PGM_RSRC1.CDBG_USER``.
5303     26      1 bit   FP16_OVFL                       GFX6-GFX8
5304                                                       Reserved, must be 0.
5305                                                     GFX9-GFX12
5306                                                       Wavefront starts execution
5307                                                       with specified fp16 overflow
5308                                                       mode.
5309
5310                                                       - If 0, fp16 overflow generates
5311                                                         +/-INF values.
5312                                                       - If 1, fp16 overflow that is the
5313                                                         result of an +/-INF input value
5314                                                         or divide by 0 produces a +/-INF,
5315                                                         otherwise clamps computed
5316                                                         overflow to +/-MAX_FP16 as
5317                                                         appropriate.
5318
5319                                                       Used by CP to set up
5320                                                       ``COMPUTE_PGM_RSRC1.FP16_OVFL``.
5321     28:27   2 bits                                  Reserved, must be 0.
5322     29      1 bit    WGP_MODE                       GFX6-GFX9
5323                                                       Reserved, must be 0.
5324                                                     GFX10-GFX12
5325                                                       - If 0 execute work-groups in
5326                                                         CU wavefront execution mode.
5327                                                       - If 1 execute work-groups on
5328                                                         in WGP wavefront execution mode.
5329
5330                                                       See :ref:`amdgpu-amdhsa-memory-model`.
5331
5332                                                       Used by CP to set up
5333                                                       ``COMPUTE_PGM_RSRC1.WGP_MODE``.
5334     30      1 bit    MEM_ORDERED                    GFX6-GFX9
5335                                                       Reserved, must be 0.
5336                                                     GFX10-GFX12
5337                                                       Controls the behavior of the
5338                                                       s_waitcnt's vmcnt and vscnt
5339                                                       counters.
5340
5341                                                       - If 0 vmcnt reports completion
5342                                                         of load and atomic with return
5343                                                         out of order with sample
5344                                                         instructions, and the vscnt
5345                                                         reports the completion of
5346                                                         store and atomic without
5347                                                         return in order.
5348                                                       - If 1 vmcnt reports completion
5349                                                         of load, atomic with return
5350                                                         and sample instructions in
5351                                                         order, and the vscnt reports
5352                                                         the completion of store and
5353                                                         atomic without return in order.
5354
5355                                                       Used by CP to set up
5356                                                       ``COMPUTE_PGM_RSRC1.MEM_ORDERED``.
5357     31      1 bit    FWD_PROGRESS                   GFX6-GFX9
5358                                                       Reserved, must be 0.
5359                                                     GFX10-GFX12
5360                                                       - If 0 execute SIMD wavefronts
5361                                                         using oldest first policy.
5362                                                       - If 1 execute SIMD wavefronts to
5363                                                         ensure wavefronts will make some
5364                                                         forward progress.
5365
5366                                                       Used by CP to set up
5367                                                       ``COMPUTE_PGM_RSRC1.FWD_PROGRESS``.
5368     32      **Total size 4 bytes**
5369     ======= ===================================================================================================================
5370
5371..
5372
5373  .. table:: compute_pgm_rsrc2 for GFX6-GFX12
5374     :name: amdgpu-amdhsa-compute_pgm_rsrc2-gfx6-gfx12-table
5375
5376     ======= ======= =============================== ===========================================================================
5377     Bits    Size    Field Name                      Description
5378     ======= ======= =============================== ===========================================================================
5379     0       1 bit   ENABLE_PRIVATE_SEGMENT          * Enable the setup of the
5380                                                       private segment.
5381                                                     * If the *Target Properties*
5382                                                       column of
5383                                                       :ref:`amdgpu-processor-table`
5384                                                       does not specify
5385                                                       *Architected flat
5386                                                       scratch* then enable the
5387                                                       setup of the SGPR
5388                                                       wavefront scratch offset
5389                                                       system register (see
5390                                                       :ref:`amdgpu-amdhsa-initial-kernel-execution-state`).
5391                                                     * If the *Target Properties*
5392                                                       column of
5393                                                       :ref:`amdgpu-processor-table`
5394                                                       specifies *Architected
5395                                                       flat scratch* then enable
5396                                                       the setup of the
5397                                                       FLAT_SCRATCH register
5398                                                       pair (see
5399                                                       :ref:`amdgpu-amdhsa-initial-kernel-execution-state`).
5400
5401                                                     Used by CP to set up
5402                                                     ``COMPUTE_PGM_RSRC2.SCRATCH_EN``.
5403     5:1     5 bits  USER_SGPR_COUNT                 The total number of SGPR
5404                                                     user data
5405                                                     registers requested. This
5406                                                     number must be greater than
5407                                                     or equal to the number of user
5408                                                     data registers enabled.
5409
5410                                                     Used by CP to set up
5411                                                     ``COMPUTE_PGM_RSRC2.USER_SGPR``.
5412     6       1 bit   ENABLE_TRAP_HANDLER             GFX6-GFX11
5413                                                       Must be 0.
5414
5415                                                       This bit represents
5416                                                       ``COMPUTE_PGM_RSRC2.TRAP_PRESENT``,
5417                                                       which is set by the CP if
5418                                                       the runtime has installed a
5419                                                       trap handler.
5420                                                     GFX12
5421                                                       Reserved, must be 0.
5422     7       1 bit   ENABLE_SGPR_WORKGROUP_ID_X      Enable the setup of the
5423                                                     system SGPR register for
5424                                                     the work-group id in the X
5425                                                     dimension (see
5426                                                     :ref:`amdgpu-amdhsa-initial-kernel-execution-state`).
5427
5428                                                     Used by CP to set up
5429                                                     ``COMPUTE_PGM_RSRC2.TGID_X_EN``.
5430     8       1 bit   ENABLE_SGPR_WORKGROUP_ID_Y      Enable the setup of the
5431                                                     system SGPR register for
5432                                                     the work-group id in the Y
5433                                                     dimension (see
5434                                                     :ref:`amdgpu-amdhsa-initial-kernel-execution-state`).
5435
5436                                                     Used by CP to set up
5437                                                     ``COMPUTE_PGM_RSRC2.TGID_Y_EN``.
5438     9       1 bit   ENABLE_SGPR_WORKGROUP_ID_Z      Enable the setup of the
5439                                                     system SGPR register for
5440                                                     the work-group id in the Z
5441                                                     dimension (see
5442                                                     :ref:`amdgpu-amdhsa-initial-kernel-execution-state`).
5443
5444                                                     Used by CP to set up
5445                                                     ``COMPUTE_PGM_RSRC2.TGID_Z_EN``.
5446     10      1 bit   ENABLE_SGPR_WORKGROUP_INFO      Enable the setup of the
5447                                                     system SGPR register for
5448                                                     work-group information (see
5449                                                     :ref:`amdgpu-amdhsa-initial-kernel-execution-state`).
5450
5451                                                     Used by CP to set up
5452                                                     ``COMPUTE_PGM_RSRC2.TGID_SIZE_EN``.
5453     12:11   2 bits  ENABLE_VGPR_WORKITEM_ID         Enable the setup of the
5454                                                     VGPR system registers used
5455                                                     for the work-item ID.
5456                                                     :ref:`amdgpu-amdhsa-system-vgpr-work-item-id-enumeration-values-table`
5457                                                     defines the values.
5458
5459                                                     Used by CP to set up
5460                                                     ``COMPUTE_PGM_RSRC2.TIDIG_CMP_CNT``.
5461     13      1 bit   ENABLE_EXCEPTION_ADDRESS_WATCH  Must be 0.
5462
5463                                                     Wavefront starts execution
5464                                                     with address watch
5465                                                     exceptions enabled which
5466                                                     are generated when L1 has
5467                                                     witnessed a thread access
5468                                                     an *address of
5469                                                     interest*.
5470
5471                                                     CP is responsible for
5472                                                     filling in the address
5473                                                     watch bit in
5474                                                     ``COMPUTE_PGM_RSRC2.EXCP_EN_MSB``
5475                                                     according to what the
5476                                                     runtime requests.
5477     14      1 bit   ENABLE_EXCEPTION_MEMORY         Must be 0.
5478
5479                                                     Wavefront starts execution
5480                                                     with memory violation
5481                                                     exceptions exceptions
5482                                                     enabled which are generated
5483                                                     when a memory violation has
5484                                                     occurred for this wavefront from
5485                                                     L1 or LDS
5486                                                     (write-to-read-only-memory,
5487                                                     mis-aligned atomic, LDS
5488                                                     address out of range,
5489                                                     illegal address, etc.).
5490
5491                                                     CP sets the memory
5492                                                     violation bit in
5493                                                     ``COMPUTE_PGM_RSRC2.EXCP_EN_MSB``
5494                                                     according to what the
5495                                                     runtime requests.
5496     23:15   9 bits  GRANULATED_LDS_SIZE             Must be 0.
5497
5498                                                     CP uses the rounded value
5499                                                     from the dispatch packet,
5500                                                     not this value, as the
5501                                                     dispatch may contain
5502                                                     dynamically allocated group
5503                                                     segment memory. CP writes
5504                                                     directly to
5505                                                     ``COMPUTE_PGM_RSRC2.LDS_SIZE``.
5506
5507                                                     Amount of group segment
5508                                                     (LDS) to allocate for each
5509                                                     work-group. Granularity is
5510                                                     device specific:
5511
5512                                                     GFX6
5513                                                       roundup(lds-size / (64 * 4))
5514                                                     GFX7-GFX11
5515                                                       roundup(lds-size / (128 * 4))
5516                                                     GFX950
5517                                                       roundup(lds-size / (320 * 4))
5518
5519     24      1 bit   ENABLE_EXCEPTION_IEEE_754_FP    Wavefront starts execution
5520                     _INVALID_OPERATION              with specified exceptions
5521                                                     enabled.
5522
5523                                                     Used by CP to set up
5524                                                     ``COMPUTE_PGM_RSRC2.EXCP_EN``
5525                                                     (set from bits 0..6).
5526
5527                                                     IEEE 754 FP Invalid
5528                                                     Operation
5529     25      1 bit   ENABLE_EXCEPTION_FP_DENORMAL    FP Denormal one or more
5530                     _SOURCE                         input operands is a
5531                                                     denormal number
5532     26      1 bit   ENABLE_EXCEPTION_IEEE_754_FP    IEEE 754 FP Division by
5533                     _DIVISION_BY_ZERO               Zero
5534     27      1 bit   ENABLE_EXCEPTION_IEEE_754_FP    IEEE 754 FP FP Overflow
5535                     _OVERFLOW
5536     28      1 bit   ENABLE_EXCEPTION_IEEE_754_FP    IEEE 754 FP Underflow
5537                     _UNDERFLOW
5538     29      1 bit   ENABLE_EXCEPTION_IEEE_754_FP    IEEE 754 FP Inexact
5539                     _INEXACT
5540     30      1 bit   ENABLE_EXCEPTION_INT_DIVIDE_BY  Integer Division by Zero
5541                     _ZERO                           (rcp_iflag_f32 instruction
5542                                                     only)
5543     31      1 bit   RESERVED                        Reserved, must be 0.
5544     32      **Total size 4 bytes.**
5545     ======= ===================================================================================================================
5546
5547..
5548
5549  .. table:: compute_pgm_rsrc3 for GFX90A, GFX940
5550     :name: amdgpu-amdhsa-compute_pgm_rsrc3-gfx90a-table
5551
5552     ======= ======= =============================== ===========================================================================
5553     Bits    Size    Field Name                      Description
5554     ======= ======= =============================== ===========================================================================
5555     5:0     6 bits  ACCUM_OFFSET                    Offset of a first AccVGPR in the unified register file. Granularity 4.
5556                                                     Value 0-63. 0 - accum-offset = 4, 1 - accum-offset = 8, ...,
5557                                                     63 - accum-offset = 256.
5558     15:6    10                                      Reserved, must be 0.
5559             bits
5560     16      1 bit   TG_SPLIT                        - If 0 the waves of a work-group are
5561                                                       launched in the same CU.
5562                                                     - If 1 the waves of a work-group can be
5563                                                       launched in different CUs. The waves
5564                                                       cannot use S_BARRIER or LDS.
5565     31:17   15                                      Reserved, must be 0.
5566             bits
5567     32      **Total size 4 bytes.**
5568     ======= ===================================================================================================================
5569
5570..
5571
5572  .. table:: compute_pgm_rsrc3 for GFX10-GFX11
5573     :name: amdgpu-amdhsa-compute_pgm_rsrc3-gfx10-gfx11-table
5574
5575     ======= ======= =============================== ===========================================================================
5576     Bits    Size    Field Name                      Description
5577     ======= ======= =============================== ===========================================================================
5578     3:0     4 bits  SHARED_VGPR_COUNT               Number of shared VGPR blocks when executing in subvector mode. For
5579                                                     wavefront size 64 the value is 0-15, representing 0-120 VGPRs (granularity
5580                                                     of 8), such that (compute_pgm_rsrc1.vgprs +1)*4 + shared_vgpr_count*8 does
5581                                                     not exceed 256. For wavefront size 32 shared_vgpr_count must be 0.
5582     9:4     6 bits  INST_PREF_SIZE                  GFX10
5583                                                       Reserved, must be 0.
5584                                                     GFX11
5585                                                       Number of instruction bytes to prefetch, starting at the kernel's entry
5586                                                       point instruction, before wavefront starts execution. The value is 0..63
5587                                                       with a granularity of 128 bytes.
5588     10      1 bit   TRAP_ON_START                   GFX10
5589                                                       Reserved, must be 0.
5590                                                     GFX11
5591                                                       Must be 0.
5592
5593                                                       If 1, wavefront starts execution by trapping into the trap handler.
5594
5595                                                       CP is responsible for filling in the trap on start bit in
5596                                                       ``COMPUTE_PGM_RSRC3.TRAP_ON_START`` according to what the runtime
5597                                                       requests.
5598     11      1 bit   TRAP_ON_END                     GFX10
5599                                                       Reserved, must be 0.
5600                                                     GFX11
5601                                                       Must be 0.
5602
5603                                                       If 1, wavefront execution terminates by trapping into the trap handler.
5604
5605                                                       CP is responsible for filling in the trap on end bit in
5606                                                       ``COMPUTE_PGM_RSRC3.TRAP_ON_END`` according to what the runtime requests.
5607     30:12   19 bits                                 Reserved, must be 0.
5608     31      1 bit   IMAGE_OP                        GFX10
5609                                                       Reserved, must be 0.
5610                                                     GFX11
5611                                                       If 1, the kernel execution contains image instructions. If executed as
5612                                                       part of a graphics pipeline, image read instructions will stall waiting
5613                                                       for any necessary ``WAIT_SYNC`` fence to be performed in order to
5614                                                       indicate that earlier pipeline stages have completed writing to the
5615                                                       image.
5616
5617                                                       Not used for compute kernels that are not part of a graphics pipeline and
5618                                                       must be 0.
5619     32      **Total size 4 bytes.**
5620     ======= ===================================================================================================================
5621
5622..
5623
5624  .. table:: compute_pgm_rsrc3 for GFX12
5625     :name: amdgpu-amdhsa-compute_pgm_rsrc3-gfx12-table
5626
5627     ======= ======= =============================== ===========================================================================
5628     Bits    Size    Field Name                      Description
5629     ======= ======= =============================== ===========================================================================
5630     3:0     4 bits  RESERVED                        Reserved, must be 0.
5631     11:4    8 bits  INST_PREF_SIZE                  Number of instruction bytes to prefetch, starting at the kernel's entry
5632                                                     point instruction, before wavefront starts execution. The value is 0..255
5633                                                     with a granularity of 128 bytes.
5634     12      1 bit   RESERVED                        Reserved, must be 0.
5635     13      1 bit   GLG_EN                          If 1, group launch guarantee will be enabled for this dispatch
5636     30:14   17 bits RESERVED                        Reserved, must be 0.
5637     31      1 bit   IMAGE_OP                        If 1, the kernel execution contains image instructions. If executed as
5638                                                     part of a graphics pipeline, image read instructions will stall waiting
5639                                                     for any necessary ``WAIT_SYNC`` fence to be performed in order to
5640                                                     indicate that earlier pipeline stages have completed writing to the
5641                                                     image.
5642
5643                                                     Not used for compute kernels that are not part of a graphics pipeline and
5644                                                     must be 0.
5645     32      **Total size 4 bytes.**
5646     ======= ===================================================================================================================
5647
5648..
5649
5650  .. table:: Floating Point Rounding Mode Enumeration Values
5651     :name: amdgpu-amdhsa-floating-point-rounding-mode-enumeration-values-table
5652
5653     ====================================== ===== ==============================
5654     Enumeration Name                       Value Description
5655     ====================================== ===== ==============================
5656     FLOAT_ROUND_MODE_NEAR_EVEN             0     Round Ties To Even
5657     FLOAT_ROUND_MODE_PLUS_INFINITY         1     Round Toward +infinity
5658     FLOAT_ROUND_MODE_MINUS_INFINITY        2     Round Toward -infinity
5659     FLOAT_ROUND_MODE_ZERO                  3     Round Toward 0
5660     ====================================== ===== ==============================
5661
5662
5663  .. table:: Extended FLT_ROUNDS Enumeration Values
5664     :name: amdgpu-rounding-mode-enumeration-values-table
5665
5666     +------------------------+---------------+-------------------+--------------------+----------+
5667     |                        | F32 NEAR_EVEN | F32 PLUS_INFINITY | F32 MINUS_INFINITY | F32 ZERO |
5668     +------------------------+---------------+-------------------+--------------------+----------+
5669     | F64/F16 NEAR_EVEN      |      1        |        11         |        14          |     17   |
5670     +------------------------+---------------+-------------------+--------------------+----------+
5671     | F64/F16 PLUS_INFINITY  |      8        |         2         |        15          |     18   |
5672     +------------------------+---------------+-------------------+--------------------+----------+
5673     | F64/F16 MINUS_INFINITY |      9        |        12         |         3          |     19   |
5674     +------------------------+---------------+-------------------+--------------------+----------+
5675     | F64/F16 ZERO           |     10        |        13         |        16          |     0    |
5676     +------------------------+---------------+-------------------+--------------------+----------+
5677
5678..
5679
5680  .. table:: Floating Point Denorm Mode Enumeration Values
5681     :name: amdgpu-amdhsa-floating-point-denorm-mode-enumeration-values-table
5682
5683     ====================================== ===== ====================================
5684     Enumeration Name                       Value Description
5685     ====================================== ===== ====================================
5686     FLOAT_DENORM_MODE_FLUSH_SRC_DST        0     Flush Source and Destination Denorms
5687     FLOAT_DENORM_MODE_FLUSH_DST            1     Flush Output Denorms
5688     FLOAT_DENORM_MODE_FLUSH_SRC            2     Flush Source Denorms
5689     FLOAT_DENORM_MODE_FLUSH_NONE           3     No Flush
5690     ====================================== ===== ====================================
5691
5692  Denormal flushing is sign respecting. i.e. the behavior expected by
5693  ``"denormal-fp-math"="preserve-sign"``. The behavior is undefined with
5694  ``"denormal-fp-math"="positive-zero"``
5695
5696..
5697
5698  .. table:: System VGPR Work-Item ID Enumeration Values
5699     :name: amdgpu-amdhsa-system-vgpr-work-item-id-enumeration-values-table
5700
5701     ======================================== ===== ============================
5702     Enumeration Name                         Value Description
5703     ======================================== ===== ============================
5704     SYSTEM_VGPR_WORKITEM_ID_X                0     Set work-item X dimension
5705                                                    ID.
5706     SYSTEM_VGPR_WORKITEM_ID_X_Y              1     Set work-item X and Y
5707                                                    dimensions ID.
5708     SYSTEM_VGPR_WORKITEM_ID_X_Y_Z            2     Set work-item X, Y and Z
5709                                                    dimensions ID.
5710     SYSTEM_VGPR_WORKITEM_ID_UNDEFINED        3     Undefined.
5711     ======================================== ===== ============================
5712
5713.. _amdgpu-amdhsa-initial-kernel-execution-state:
5714
5715Initial Kernel Execution State
5716~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
5717
5718This section defines the register state that will be set up by the packet
5719processor prior to the start of execution of every wavefront. This is limited by
5720the constraints of the hardware controllers of CP/ADC/SPI.
5721
5722The order of the SGPR registers is defined, but the compiler can specify which
5723ones are actually setup in the kernel descriptor using the ``enable_sgpr_*`` bit
5724fields (see :ref:`amdgpu-amdhsa-kernel-descriptor`). The register numbers used
5725for enabled registers are dense starting at SGPR0: the first enabled register is
5726SGPR0, the next enabled register is SGPR1 etc.; disabled registers do not have
5727an SGPR number.
5728
5729The initial SGPRs comprise up to 16 User SGPRs that are set by CP and apply to
5730all wavefronts of the grid. It is possible to specify more than 16 User SGPRs
5731using the ``enable_sgpr_*`` bit fields, in which case only the first 16 are
5732actually initialized. These are then immediately followed by the System SGPRs
5733that are set up by ADC/SPI and can have different values for each wavefront of
5734the grid dispatch.
5735
5736SGPR register initial state is defined in
5737:ref:`amdgpu-amdhsa-sgpr-register-set-up-order-table`.
5738
5739  .. table:: SGPR Register Set Up Order
5740     :name: amdgpu-amdhsa-sgpr-register-set-up-order-table
5741
5742     ========== ========================== ====== ==============================
5743     SGPR Order Name                       Number Description
5744                (kernel descriptor enable  of
5745                field)                     SGPRs
5746     ========== ========================== ====== ==============================
5747     First      Private Segment Buffer     4      See
5748                (enable_sgpr_private              :ref:`amdgpu-amdhsa-kernel-prolog-private-segment-buffer`.
5749                _segment_buffer)
5750     then       Dispatch Ptr               2      64-bit address of AQL dispatch
5751                (enable_sgpr_dispatch_ptr)        packet for kernel dispatch
5752                                                  actually executing.
5753     then       Queue Ptr                  2      64-bit address of amd_queue_t
5754                (enable_sgpr_queue_ptr)           object for AQL queue on which
5755                                                  the dispatch packet was
5756                                                  queued.
5757     then       Kernarg Segment Ptr        2      64-bit address of Kernarg
5758                (enable_sgpr_kernarg              segment. This is directly
5759                _segment_ptr)                     copied from the
5760                                                  kernarg_address in the kernel
5761                                                  dispatch packet.
5762
5763                                                  Having CP load it once avoids
5764                                                  loading it at the beginning of
5765                                                  every wavefront.
5766     then       Dispatch Id                2      64-bit Dispatch ID of the
5767                (enable_sgpr_dispatch_id)         dispatch packet being
5768                                                  executed.
5769     then       Flat Scratch Init          2      See
5770                (enable_sgpr_flat_scratch         :ref:`amdgpu-amdhsa-kernel-prolog-flat-scratch`.
5771                _init)
5772     then       Private Segment Size       1      The 32-bit byte size of a
5773                (enable_sgpr_private              single work-item's memory
5774                _segment_size)                    allocation. This is the
5775                                                  value from the kernel
5776                                                  dispatch packet Private
5777                                                  Segment Byte Size rounded up
5778                                                  by CP to a multiple of
5779                                                  DWORD.
5780
5781                                                  Having CP load it once avoids
5782                                                  loading it at the beginning of
5783                                                  every wavefront.
5784
5785                                                  This is not used for
5786                                                  GFX7-GFX8 since it is the same
5787                                                  value as the second SGPR of
5788                                                  Flat Scratch Init. However, it
5789                                                  may be needed for GFX9-GFX11 which
5790                                                  changes the meaning of the
5791                                                  Flat Scratch Init value.
5792     then       Preloaded Kernargs         N/A    See
5793                (kernarg_preload_spec             :ref:`amdgpu-amdhsa-kernarg-preload`.
5794                _length)
5795     then       Work-Group Id X            1      32-bit work-group id in X
5796                (enable_sgpr_workgroup_id         dimension of grid for
5797                _X)                               wavefront.
5798     then       Work-Group Id Y            1      32-bit work-group id in Y
5799                (enable_sgpr_workgroup_id         dimension of grid for
5800                _Y)                               wavefront.
5801     then       Work-Group Id Z            1      32-bit work-group id in Z
5802                (enable_sgpr_workgroup_id         dimension of grid for
5803                _Z)                               wavefront.
5804     then       Work-Group Info            1      {first_wavefront, 14'b0000,
5805                (enable_sgpr_workgroup            ordered_append_term[10:0],
5806                _info)                            threadgroup_size_in_wavefronts[5:0]}
5807     then       Scratch Wavefront Offset   1      See
5808                (enable_sgpr_private              :ref:`amdgpu-amdhsa-kernel-prolog-flat-scratch`.
5809                _segment_wavefront_offset)        and
5810                                                  :ref:`amdgpu-amdhsa-kernel-prolog-private-segment-buffer`.
5811     ========== ========================== ====== ==============================
5812
5813The order of the VGPR registers is defined, but the compiler can specify which
5814ones are actually setup in the kernel descriptor using the ``enable_vgpr*`` bit
5815fields (see :ref:`amdgpu-amdhsa-kernel-descriptor`). The register numbers used
5816for enabled registers are dense starting at VGPR0: the first enabled register is
5817VGPR0, the next enabled register is VGPR1 etc.; disabled registers do not have a
5818VGPR number.
5819
5820There are different methods used for the VGPR initial state:
5821
5822* Unless the *Target Properties* column of :ref:`amdgpu-processor-table`
5823  specifies otherwise, a separate VGPR register is used per work-item ID. The
5824  VGPR register initial state for this method is defined in
5825  :ref:`amdgpu-amdhsa-vgpr-register-set-up-order-for-unpacked-work-item-id-method-table`.
5826* If *Target Properties* column of :ref:`amdgpu-processor-table`
5827  specifies *Packed work-item IDs*, the initial value of VGPR0 register is used
5828  for all work-item IDs. The register layout for this method is defined in
5829  :ref:`amdgpu-amdhsa-register-layout-for-packed-work-item-id-method-table`.
5830
5831  .. table:: VGPR Register Set Up Order for Unpacked Work-Item ID Method
5832     :name: amdgpu-amdhsa-vgpr-register-set-up-order-for-unpacked-work-item-id-method-table
5833
5834     ========== ========================== ====== ==============================
5835     VGPR Order Name                       Number Description
5836                (kernel descriptor enable  of
5837                field)                     VGPRs
5838     ========== ========================== ====== ==============================
5839     First      Work-Item Id X             1      32-bit work-item id in X
5840                (Always initialized)              dimension of work-group for
5841                                                  wavefront lane.
5842     then       Work-Item Id Y             1      32-bit work-item id in Y
5843                (enable_vgpr_workitem_id          dimension of work-group for
5844                > 0)                              wavefront lane.
5845     then       Work-Item Id Z             1      32-bit work-item id in Z
5846                (enable_vgpr_workitem_id          dimension of work-group for
5847                > 1)                              wavefront lane.
5848     ========== ========================== ====== ==============================
5849
5850..
5851
5852  .. table:: Register Layout for Packed Work-Item ID Method
5853     :name: amdgpu-amdhsa-register-layout-for-packed-work-item-id-method-table
5854
5855     ======= ======= ================ =========================================
5856     Bits    Size    Field Name       Description
5857     ======= ======= ================ =========================================
5858     0:9     10 bits Work-Item Id X   Work-item id in X
5859                                      dimension of work-group for
5860                                      wavefront lane.
5861
5862                                      Always initialized.
5863
5864     10:19   10 bits Work-Item Id Y   Work-item id in Y
5865                                      dimension of work-group for
5866                                      wavefront lane.
5867
5868                                      Initialized if enable_vgpr_workitem_id >
5869                                      0, otherwise set to 0.
5870     20:29   10 bits Work-Item Id Z   Work-item id in Z
5871                                      dimension of work-group for
5872                                      wavefront lane.
5873
5874                                      Initialized if enable_vgpr_workitem_id >
5875                                      1, otherwise set to 0.
5876     30:31   2 bits                   Reserved, set to 0.
5877     ======= ======= ================ =========================================
5878
5879The setting of registers is done by GPU CP/ADC/SPI hardware as follows:
5880
58811. SGPRs before the Work-Group Ids are set by CP using the 16 User Data
5882   registers.
58832. Work-group Id registers X, Y, Z are set by ADC which supports any
5884   combination including none.
58853. Scratch Wavefront Offset is set by SPI in a per wavefront basis which is why
5886   its value cannot be included with the flat scratch init value which is per
5887   queue (see :ref:`amdgpu-amdhsa-kernel-prolog-flat-scratch`).
58884. The VGPRs are set by SPI which only supports specifying either (X), (X, Y)
5889   or (X, Y, Z).
58905. Flat Scratch register pair initialization is described in
5891   :ref:`amdgpu-amdhsa-kernel-prolog-flat-scratch`.
5892
5893The global segment can be accessed either using buffer instructions (GFX6 which
5894has V# 64-bit address support), flat instructions (GFX7-GFX11), or global
5895instructions (GFX9-GFX11).
5896
5897If buffer operations are used, then the compiler can generate a V# with the
5898following properties:
5899
5900* base address of 0
5901* no swizzle
5902* ATC: 1 if IOMMU present (such as APU)
5903* ptr64: 1
5904* MTYPE set to support memory coherence that matches the runtime (such as CC for
5905  APU and NC for dGPU).
5906
5907.. _amdgpu-amdhsa-kernarg-preload:
5908
5909Preloaded Kernel Arguments
5910++++++++++++++++++++++++++
5911
5912On hardware that supports this feature, kernel arguments can be preloaded into
5913User SGPRs, up to the maximum number of User SGPRs available. The allocation of
5914Preload SGPRs occurs directly after the last enabled non-kernarg preload User
5915SGPR. (See :ref:`amdgpu-amdhsa-initial-kernel-execution-state`)
5916
5917The data preloaded is copied from the kernarg segment, the amount of data is
5918determined by the value specified in the kernarg_preload_spec_length field of
5919the kernel descriptor. This data is then loaded into consecutive User SGPRs. The
5920number of SGPRs receiving preloaded kernarg data corresponds with the value
5921given by kernarg_preload_spec_length. The preloading starts at the dword offset
5922within the kernarg segment, which is specified by the
5923kernarg_preload_spec_offset field.
5924
5925If the kernarg_preload_spec_length is non-zero, the CP firmware will append an
5926additional 256 bytes to the kernel_code_entry_byte_offset. This addition
5927facilitates the incorporation of a prologue to the kernel entry to handle cases
5928where code designed for kernarg preloading is executed on hardware equipped with
5929incompatible firmware. If hardware has compatible firmware the 256 bytes at the
5930start of the kernel entry will be skipped.
5931
5932With code object V5 and later, hidden kernel arguments that are normally
5933accessed through the Implicit Argument Ptr, may be preloaded into User SGPRs.
5934These arguments are added to the kernel function signature and are marked with
5935the attributes "inreg" and "amdgpu-hidden-argument". (See
5936:ref:`amdgpu-llvm-ir-attributes-table`).
5937
5938.. _amdgpu-amdhsa-kernel-prolog:
5939
5940Kernel Prolog
5941~~~~~~~~~~~~~
5942
5943The compiler performs initialization in the kernel prologue depending on the
5944target and information about things like stack usage in the kernel and called
5945functions. Some of this initialization requires the compiler to request certain
5946User and System SGPRs be present in the
5947:ref:`amdgpu-amdhsa-initial-kernel-execution-state` via the
5948:ref:`amdgpu-amdhsa-kernel-descriptor`.
5949
5950.. _amdgpu-amdhsa-kernel-prolog-cfi:
5951
5952CFI
5953+++
5954
59551.  The CFI return address is undefined.
5956
59572.  The CFI CFA is defined using an expression which evaluates to a location
5958    description that comprises one memory location description for the
5959    ``DW_ASPACE_AMDGPU_private_lane`` address space address ``0``.
5960
5961.. _amdgpu-amdhsa-kernel-prolog-m0:
5962
5963M0
5964++
5965
5966GFX6-GFX8
5967  The M0 register must be initialized with a value at least the total LDS size
5968  if the kernel may access LDS via DS or flat operations. Total LDS size is
5969  available in dispatch packet. For M0, it is also possible to use maximum
5970  possible value of LDS for given target (0x7FFF for GFX6 and 0xFFFF for
5971  GFX7-GFX8).
5972GFX9-GFX11
5973  The M0 register is not used for range checking LDS accesses and so does not
5974  need to be initialized in the prolog.
5975
5976.. _amdgpu-amdhsa-kernel-prolog-stack-pointer:
5977
5978Stack Pointer
5979+++++++++++++
5980
5981If the kernel has function calls it must set up the ABI stack pointer described
5982in :ref:`amdgpu-amdhsa-function-call-convention-non-kernel-functions` by setting
5983SGPR32 to the unswizzled scratch offset of the address past the last local
5984allocation.
5985
5986.. _amdgpu-amdhsa-kernel-prolog-frame-pointer:
5987
5988Frame Pointer
5989+++++++++++++
5990
5991If the kernel needs a frame pointer for the reasons defined in
5992``SIFrameLowering`` then SGPR33 is used and is always set to ``0`` in the
5993kernel prolog. If a frame pointer is not required then all uses of the frame
5994pointer are replaced with immediate ``0`` offsets.
5995
5996.. _amdgpu-amdhsa-kernel-prolog-flat-scratch:
5997
5998Flat Scratch
5999++++++++++++
6000
6001There are different methods used for initializing flat scratch:
6002
6003* If the *Target Properties* column of :ref:`amdgpu-processor-table`
6004  specifies *Does not support generic address space*:
6005
6006  Flat scratch is not supported and there is no flat scratch register pair.
6007
6008* If the *Target Properties* column of :ref:`amdgpu-processor-table`
6009  specifies *Offset flat scratch*:
6010
6011  If the kernel or any function it calls may use flat operations to access
6012  scratch memory, the prolog code must set up the FLAT_SCRATCH register pair
6013  (FLAT_SCRATCH_LO/FLAT_SCRATCH_HI). Initialization uses Flat Scratch Init and
6014  Scratch Wavefront Offset SGPR registers (see
6015  :ref:`amdgpu-amdhsa-initial-kernel-execution-state`):
6016
6017  1. The low word of Flat Scratch Init is the 32-bit byte offset from
6018     ``SH_HIDDEN_PRIVATE_BASE_VIMID`` to the base of scratch backing memory
6019     being managed by SPI for the queue executing the kernel dispatch. This is
6020     the same value used in the Scratch Segment Buffer V# base address.
6021
6022     CP obtains this from the runtime. (The Scratch Segment Buffer base address
6023     is ``SH_HIDDEN_PRIVATE_BASE_VIMID`` plus this offset.)
6024
6025     The prolog must add the value of Scratch Wavefront Offset to get the
6026     wavefront's byte scratch backing memory offset from
6027     ``SH_HIDDEN_PRIVATE_BASE_VIMID``.
6028
6029     The Scratch Wavefront Offset must also be used as an offset with Private
6030     segment address when using the Scratch Segment Buffer.
6031
6032     Since FLAT_SCRATCH_LO is in units of 256 bytes, the offset must be right
6033     shifted by 8 before moving into FLAT_SCRATCH_HI.
6034
6035     FLAT_SCRATCH_HI corresponds to SGPRn-4 on GFX7, and SGPRn-6 on GFX8 (where
6036     SGPRn is the highest numbered SGPR allocated to the wavefront).
6037     FLAT_SCRATCH_HI is multiplied by 256 (as it is in units of 256 bytes) and
6038     added to ``SH_HIDDEN_PRIVATE_BASE_VIMID`` to calculate the per wavefront
6039     FLAT SCRATCH BASE in flat memory instructions that access the scratch
6040     aperture.
6041  2. The second word of Flat Scratch Init is 32-bit byte size of a single
6042     work-items scratch memory usage.
6043
6044     CP obtains this from the runtime, and it is always a multiple of DWORD. CP
6045     checks that the value in the kernel dispatch packet Private Segment Byte
6046     Size is not larger and requests the runtime to increase the queue's scratch
6047     size if necessary.
6048
6049     CP directly loads from the kernel dispatch packet Private Segment Byte Size
6050     field and rounds up to a multiple of DWORD. Having CP load it once avoids
6051     loading it at the beginning of every wavefront.
6052
6053     The kernel prolog code must move it to FLAT_SCRATCH_LO which is SGPRn-3 on
6054     GFX7 and SGPRn-5 on GFX8. FLAT_SCRATCH_LO is used as the FLAT SCRATCH SIZE
6055     in flat memory instructions.
6056
6057* If the *Target Properties* column of :ref:`amdgpu-processor-table`
6058  specifies *Absolute flat scratch*:
6059
6060  If the kernel or any function it calls may use flat operations to access
6061  scratch memory, the prolog code must set up the FLAT_SCRATCH register pair
6062  (FLAT_SCRATCH_LO/FLAT_SCRATCH_HI which are in SGPRn-4/SGPRn-3). Initialization
6063  uses Flat Scratch Init and Scratch Wavefront Offset SGPR registers (see
6064  :ref:`amdgpu-amdhsa-initial-kernel-execution-state`):
6065
6066  The Flat Scratch Init is the 64-bit address of the base of scratch backing
6067  memory being managed by SPI for the queue executing the kernel dispatch.
6068
6069  CP obtains this from the runtime.
6070
6071  The kernel prolog must add the value of the wave's Scratch Wavefront Offset
6072  and move the result as a 64-bit value to the FLAT_SCRATCH SGPR register pair
6073  which is SGPRn-6 and SGPRn-5. It is used as the FLAT SCRATCH BASE in flat
6074  memory instructions.
6075
6076  The Scratch Wavefront Offset must also be used as an offset with Private
6077  segment address when using the Scratch Segment Buffer (see
6078  :ref:`amdgpu-amdhsa-kernel-prolog-private-segment-buffer`).
6079
6080* If the *Target Properties* column of :ref:`amdgpu-processor-table`
6081  specifies *Architected flat scratch*:
6082
6083  If ENABLE_PRIVATE_SEGMENT is enabled in
6084  :ref:`amdgpu-amdhsa-compute_pgm_rsrc2-gfx6-gfx12-table` then the FLAT_SCRATCH
6085  register pair will be initialized to the 64-bit address of the base of scratch
6086  backing memory being managed by SPI for the queue executing the kernel
6087  dispatch plus the value of the wave's Scratch Wavefront Offset for use as the
6088  flat scratch base in flat memory instructions.
6089
6090.. _amdgpu-amdhsa-kernel-prolog-private-segment-buffer:
6091
6092Private Segment Buffer
6093++++++++++++++++++++++
6094
6095If the *Target Properties* column of :ref:`amdgpu-processor-table` specifies
6096*Architected flat scratch* then a Private Segment Buffer is not supported.
6097Instead the flat SCRATCH instructions are used.
6098
6099Otherwise, Private Segment Buffer SGPR register is used to initialize 4 SGPRs
6100that are used as a V# to access scratch. CP uses the value provided by the
6101runtime. It is used, together with Scratch Wavefront Offset as an offset, to
6102access the private memory space using a segment address. See
6103:ref:`amdgpu-amdhsa-initial-kernel-execution-state`.
6104
6105The scratch V# is a four-aligned SGPR and always selected for the kernel as
6106follows:
6107
6108  - If it is known during instruction selection that there is stack usage,
6109    SGPR0-3 is reserved for use as the scratch V#.  Stack usage is assumed if
6110    optimizations are disabled (``-O0``), if stack objects already exist (for
6111    locals, etc.), or if there are any function calls.
6112
6113  - Otherwise, four high numbered SGPRs beginning at a four-aligned SGPR index
6114    are reserved for the tentative scratch V#. These will be used if it is
6115    determined that spilling is needed.
6116
6117    - If no use is made of the tentative scratch V#, then it is unreserved,
6118      and the register count is determined ignoring it.
6119    - If use is made of the tentative scratch V#, then its register numbers
6120      are shifted to the first four-aligned SGPR index after the highest one
6121      allocated by the register allocator, and all uses are updated. The
6122      register count includes them in the shifted location.
6123    - In either case, if the processor has the SGPR allocation bug, the
6124      tentative allocation is not shifted or unreserved in order to ensure
6125      the register count is higher to workaround the bug.
6126
6127    .. note::
6128
6129      This approach of using a tentative scratch V# and shifting the register
6130      numbers if used avoids having to perform register allocation a second
6131      time if the tentative V# is eliminated. This is more efficient and
6132      avoids the problem that the second register allocation may perform
6133      spilling which will fail as there is no longer a scratch V#.
6134
6135When the kernel prolog code is being emitted it is known whether the scratch V#
6136described above is actually used. If it is, the prolog code must set it up by
6137copying the Private Segment Buffer to the scratch V# registers and then adding
6138the Private Segment Wavefront Offset to the queue base address in the V#. The
6139result is a V# with a base address pointing to the beginning of the wavefront
6140scratch backing memory.
6141
6142The Private Segment Buffer is always requested, but the Private Segment
6143Wavefront Offset is only requested if it is used (see
6144:ref:`amdgpu-amdhsa-initial-kernel-execution-state`).
6145
6146.. _amdgpu-amdhsa-memory-model:
6147
6148Memory Model
6149~~~~~~~~~~~~
6150
6151This section describes the mapping of the LLVM memory model onto AMDGPU machine
6152code (see :ref:`memmodel`).
6153
6154The AMDGPU backend supports the memory synchronization scopes specified in
6155:ref:`amdgpu-memory-scopes`.
6156
6157The code sequences used to implement the memory model specify the order of
6158instructions that a single thread must execute. The ``s_waitcnt`` and cache
6159management instructions such as ``buffer_wbinvl1_vol`` are defined with respect
6160to other memory instructions executed by the same thread. This allows them to be
6161moved earlier or later which can allow them to be combined with other instances
6162of the same instruction, or hoisted/sunk out of loops to improve performance.
6163Only the instructions related to the memory model are given; additional
6164``s_waitcnt`` instructions are required to ensure registers are defined before
6165being used. These may be able to be combined with the memory model ``s_waitcnt``
6166instructions as described above.
6167
6168The AMDGPU backend supports the following memory models:
6169
6170  HSA Memory Model [HSA]_
6171    The HSA memory model uses a single happens-before relation for all address
6172    spaces (see :ref:`amdgpu-address-spaces`).
6173  OpenCL Memory Model [OpenCL]_
6174    The OpenCL memory model which has separate happens-before relations for the
6175    global and local address spaces. Only a fence specifying both global and
6176    local address space, and seq_cst instructions join the relationships. Since
6177    the LLVM ``memfence`` instruction does not allow an address space to be
6178    specified the OpenCL fence has to conservatively assume both local and
6179    global address space was specified. However, optimizations can often be
6180    done to eliminate the additional ``s_waitcnt`` instructions when there are
6181    no intervening memory instructions which access the corresponding address
6182    space. The code sequences in the table indicate what can be omitted for the
6183    OpenCL memory. The target triple environment is used to determine if the
6184    source language is OpenCL (see :ref:`amdgpu-opencl`).
6185
6186``ds/flat_load/store/atomic`` instructions to local memory are termed LDS
6187operations.
6188
6189``buffer/global/flat_load/store/atomic`` instructions to global memory are
6190termed vector memory operations.
6191
6192Private address space uses ``buffer_load/store`` using the scratch V#
6193(GFX6-GFX8), or ``scratch_load/store`` (GFX9-GFX11). Since only a single thread
6194is accessing the memory, atomic memory orderings are not meaningful, and all
6195accesses are treated as non-atomic.
6196
6197Constant address space uses ``buffer/global_load`` instructions (or equivalent
6198scalar memory instructions). Since the constant address space contents do not
6199change during the execution of a kernel dispatch it is not legal to perform
6200stores, and atomic memory orderings are not meaningful, and all accesses are
6201treated as non-atomic.
6202
6203A memory synchronization scope wider than work-group is not meaningful for the
6204group (LDS) address space and is treated as work-group.
6205
6206The memory model does not support the region address space which is treated as
6207non-atomic.
6208
6209Acquire memory ordering is not meaningful on store atomic instructions and is
6210treated as non-atomic.
6211
6212Release memory ordering is not meaningful on load atomic instructions and is
6213treated a non-atomic.
6214
6215Acquire-release memory ordering is not meaningful on load or store atomic
6216instructions and is treated as acquire and release respectively.
6217
6218The memory order also adds the single thread optimization constraints defined in
6219table
6220:ref:`amdgpu-amdhsa-memory-model-single-thread-optimization-constraints-table`.
6221
6222  .. table:: AMDHSA Memory Model Single Thread Optimization Constraints
6223     :name: amdgpu-amdhsa-memory-model-single-thread-optimization-constraints-table
6224
6225     ============ ==============================================================
6226     LLVM Memory  Optimization Constraints
6227     Ordering
6228     ============ ==============================================================
6229     unordered    *none*
6230     monotonic    *none*
6231     acquire      - If a load atomic/atomicrmw then no following load/load
6232                    atomic/store/store atomic/atomicrmw/fence instruction can be
6233                    moved before the acquire.
6234                  - If a fence then same as load atomic, plus no preceding
6235                    associated fence-paired-atomic can be moved after the fence.
6236     release      - If a store atomic/atomicrmw then no preceding load/load
6237                    atomic/store/store atomic/atomicrmw/fence instruction can be
6238                    moved after the release.
6239                  - If a fence then same as store atomic, plus no following
6240                    associated fence-paired-atomic can be moved before the
6241                    fence.
6242     acq_rel      Same constraints as both acquire and release.
6243     seq_cst      - If a load atomic then same constraints as acquire, plus no
6244                    preceding sequentially consistent load atomic/store
6245                    atomic/atomicrmw/fence instruction can be moved after the
6246                    seq_cst.
6247                  - If a store atomic then the same constraints as release, plus
6248                    no following sequentially consistent load atomic/store
6249                    atomic/atomicrmw/fence instruction can be moved before the
6250                    seq_cst.
6251                  - If an atomicrmw/fence then same constraints as acq_rel.
6252     ============ ==============================================================
6253
6254The code sequences used to implement the memory model are defined in the
6255following sections:
6256
6257* :ref:`amdgpu-amdhsa-memory-model-gfx6-gfx9`
6258* :ref:`amdgpu-amdhsa-memory-model-gfx90a`
6259* :ref:`amdgpu-amdhsa-memory-model-gfx942`
6260* :ref:`amdgpu-amdhsa-memory-model-gfx10-gfx11`
6261* :ref:`amdgpu-amdhsa-memory-model-gfx12`
6262
6263.. _amdgpu-fence-as:
6264
6265Fence and Address Spaces
6266++++++++++++++++++++++++++++++
6267
6268LLVM fences do not have address space information, thus, fence
6269codegen usually needs to conservatively synchronize all address spaces.
6270
6271In the case of OpenCL, where fences only need to synchronize
6272user-specified address spaces, this can result in extra unnecessary waits.
6273For instance, a fence that is supposed to only synchronize local memory will
6274also have to wait on all global memory operations, which is unnecessary.
6275
6276:doc:`Memory Model Relaxation Annotations <MemoryModelRelaxationAnnotations>` can
6277be used as an optimization hint for fences to solve this problem.
6278The AMDGPU backend recognizes the following tags on fences:
6279
6280- ``amdgpu-as:local`` - fence only the local address space
6281- ``amdgpu-as:global``- fence only the global address space
6282
6283.. note::
6284
6285  As an optimization hint, those tags are not guaranteed to survive until
6286  code generation. Optimizations are free to drop the tags to allow for
6287  better code optimization, at the cost of synchronizing additional address
6288  spaces.
6289
6290.. _amdgpu-amdhsa-memory-model-gfx6-gfx9:
6291
6292Memory Model GFX6-GFX9
6293++++++++++++++++++++++
6294
6295For GFX6-GFX9:
6296
6297* Each agent has multiple shader arrays (SA).
6298* Each SA has multiple compute units (CU).
6299* Each CU has multiple SIMDs that execute wavefronts.
6300* The wavefronts for a single work-group are executed in the same CU but may be
6301  executed by different SIMDs.
6302* Each CU has a single LDS memory shared by the wavefronts of the work-groups
6303  executing on it.
6304* All LDS operations of a CU are performed as wavefront wide operations in a
6305  global order and involve no caching. Completion is reported to a wavefront in
6306  execution order.
6307* The LDS memory has multiple request queues shared by the SIMDs of a
6308  CU. Therefore, the LDS operations performed by different wavefronts of a
6309  work-group can be reordered relative to each other, which can result in
6310  reordering the visibility of vector memory operations with respect to LDS
6311  operations of other wavefronts in the same work-group. A ``s_waitcnt
6312  lgkmcnt(0)`` is required to ensure synchronization between LDS operations and
6313  vector memory operations between wavefronts of a work-group, but not between
6314  operations performed by the same wavefront.
6315* The vector memory operations are performed as wavefront wide operations and
6316  completion is reported to a wavefront in execution order. The exception is
6317  that for GFX7-GFX9 ``flat_load/store/atomic`` instructions can report out of
6318  vector memory order if they access LDS memory, and out of LDS operation order
6319  if they access global memory.
6320* The vector memory operations access a single vector L1 cache shared by all
6321  SIMDs a CU. Therefore, no special action is required for coherence between the
6322  lanes of a single wavefront, or for coherence between wavefronts in the same
6323  work-group. A ``buffer_wbinvl1_vol`` is required for coherence between
6324  wavefronts executing in different work-groups as they may be executing on
6325  different CUs.
6326* The scalar memory operations access a scalar L1 cache shared by all wavefronts
6327  on a group of CUs. The scalar and vector L1 caches are not coherent. However,
6328  scalar operations are used in a restricted way so do not impact the memory
6329  model. See :ref:`amdgpu-amdhsa-memory-spaces`.
6330* The vector and scalar memory operations use an L2 cache shared by all CUs on
6331  the same agent.
6332* The L2 cache has independent channels to service disjoint ranges of virtual
6333  addresses.
6334* Each CU has a separate request queue per channel. Therefore, the vector and
6335  scalar memory operations performed by wavefronts executing in different
6336  work-groups (which may be executing on different CUs) of an agent can be
6337  reordered relative to each other. A ``s_waitcnt vmcnt(0)`` is required to
6338  ensure synchronization between vector memory operations of different CUs. It
6339  ensures a previous vector memory operation has completed before executing a
6340  subsequent vector memory or LDS operation and so can be used to meet the
6341  requirements of acquire and release.
6342* The L2 cache can be kept coherent with other agents on some targets, or ranges
6343  of virtual addresses can be set up to bypass it to ensure system coherence.
6344
6345Scalar memory operations are only used to access memory that is proven to not
6346change during the execution of the kernel dispatch. This includes constant
6347address space and global address space for program scope ``const`` variables.
6348Therefore, the kernel machine code does not have to maintain the scalar cache to
6349ensure it is coherent with the vector caches. The scalar and vector caches are
6350invalidated between kernel dispatches by CP since constant address space data
6351may change between kernel dispatch executions. See
6352:ref:`amdgpu-amdhsa-memory-spaces`.
6353
6354The one exception is if scalar writes are used to spill SGPR registers. In this
6355case the AMDGPU backend ensures the memory location used to spill is never
6356accessed by vector memory operations at the same time. If scalar writes are used
6357then a ``s_dcache_wb`` is inserted before the ``s_endpgm`` and before a function
6358return since the locations may be used for vector memory instructions by a
6359future wavefront that uses the same scratch area, or a function call that
6360creates a frame at the same address, respectively. There is no need for a
6361``s_dcache_inv`` as all scalar writes are write-before-read in the same thread.
6362
6363For kernarg backing memory:
6364
6365* CP invalidates the L1 cache at the start of each kernel dispatch.
6366* On dGPU the kernarg backing memory is allocated in host memory accessed as
6367  MTYPE UC (uncached) to avoid needing to invalidate the L2 cache. This also
6368  causes it to be treated as non-volatile and so is not invalidated by
6369  ``*_vol``.
6370* On APU the kernarg backing memory it is accessed as MTYPE CC (cache coherent)
6371  and so the L2 cache will be coherent with the CPU and other agents.
6372
6373Scratch backing memory (which is used for the private address space) is accessed
6374with MTYPE NC_NV (non-coherent non-volatile). Since the private address space is
6375only accessed by a single thread, and is always write-before-read, there is
6376never a need to invalidate these entries from the L1 cache. Hence all cache
6377invalidates are done as ``*_vol`` to only invalidate the volatile cache lines.
6378
6379The code sequences used to implement the memory model for GFX6-GFX9 are defined
6380in table :ref:`amdgpu-amdhsa-memory-model-code-sequences-gfx6-gfx9-table`.
6381
6382  .. table:: AMDHSA Memory Model Code Sequences GFX6-GFX9
6383     :name: amdgpu-amdhsa-memory-model-code-sequences-gfx6-gfx9-table
6384
6385     ============ ============ ============== ========== ================================
6386     LLVM Instr   LLVM Memory  LLVM Memory    AMDGPU     AMDGPU Machine Code
6387                  Ordering     Sync Scope     Address    GFX6-GFX9
6388                                              Space
6389     ============ ============ ============== ========== ================================
6390     **Non-Atomic**
6391     ------------------------------------------------------------------------------------
6392     load         *none*       *none*         - global   - !volatile & !nontemporal
6393                                              - generic
6394                                              - private    1. buffer/global/flat_load
6395                                              - constant
6396                                                         - !volatile & nontemporal
6397
6398                                                           1. buffer/global/flat_load
6399                                                              glc=1 slc=1
6400
6401                                                         - volatile
6402
6403                                                           1. buffer/global/flat_load
6404                                                              glc=1
6405                                                           2. s_waitcnt vmcnt(0)
6406
6407                                                            - Must happen before
6408                                                              any following volatile
6409                                                              global/generic
6410                                                              load/store.
6411                                                            - Ensures that
6412                                                              volatile
6413                                                              operations to
6414                                                              different
6415                                                              addresses will not
6416                                                              be reordered by
6417                                                              hardware.
6418
6419     load         *none*       *none*         - local    1. ds_load
6420     store        *none*       *none*         - global   - !volatile & !nontemporal
6421                                              - generic
6422                                              - private    1. buffer/global/flat_store
6423                                              - constant
6424                                                         - !volatile & nontemporal
6425
6426                                                           1. buffer/global/flat_store
6427                                                              glc=1 slc=1
6428
6429                                                         - volatile
6430
6431                                                           1. buffer/global/flat_store
6432                                                           2. s_waitcnt vmcnt(0)
6433
6434                                                            - Must happen before
6435                                                              any following volatile
6436                                                              global/generic
6437                                                              load/store.
6438                                                            - Ensures that
6439                                                              volatile
6440                                                              operations to
6441                                                              different
6442                                                              addresses will not
6443                                                              be reordered by
6444                                                              hardware.
6445
6446     store        *none*       *none*         - local    1. ds_store
6447     **Unordered Atomic**
6448     ------------------------------------------------------------------------------------
6449     load atomic  unordered    *any*          *any*      *Same as non-atomic*.
6450     store atomic unordered    *any*          *any*      *Same as non-atomic*.
6451     atomicrmw    unordered    *any*          *any*      *Same as monotonic atomic*.
6452     **Monotonic Atomic**
6453     ------------------------------------------------------------------------------------
6454     load atomic  monotonic    - singlethread - global   1. buffer/global/ds/flat_load
6455                               - wavefront    - local
6456                               - workgroup    - generic
6457     load atomic  monotonic    - agent        - global   1. buffer/global/flat_load
6458                               - system       - generic     glc=1
6459     store atomic monotonic    - singlethread - global   1. buffer/global/flat_store
6460                               - wavefront    - generic
6461                               - workgroup
6462                               - agent
6463                               - system
6464     store atomic monotonic    - singlethread - local    1. ds_store
6465                               - wavefront
6466                               - workgroup
6467     atomicrmw    monotonic    - singlethread - global   1. buffer/global/flat_atomic
6468                               - wavefront    - generic
6469                               - workgroup
6470                               - agent
6471                               - system
6472     atomicrmw    monotonic    - singlethread - local    1. ds_atomic
6473                               - wavefront
6474                               - workgroup
6475     **Acquire Atomic**
6476     ------------------------------------------------------------------------------------
6477     load atomic  acquire      - singlethread - global   1. buffer/global/ds/flat_load
6478                               - wavefront    - local
6479                                              - generic
6480     load atomic  acquire      - workgroup    - global   1. buffer/global_load
6481     load atomic  acquire      - workgroup    - local    1. ds/flat_load
6482                                              - generic  2. s_waitcnt lgkmcnt(0)
6483
6484                                                           - If OpenCL, omit.
6485                                                           - Must happen before
6486                                                             any following
6487                                                             global/generic
6488                                                             load/load
6489                                                             atomic/store/store
6490                                                             atomic/atomicrmw.
6491                                                           - Ensures any
6492                                                             following global
6493                                                             data read is no
6494                                                             older than a local load
6495                                                             atomic value being
6496                                                             acquired.
6497
6498     load atomic  acquire      - agent        - global   1. buffer/global_load
6499                               - system                     glc=1
6500                                                         2. s_waitcnt vmcnt(0)
6501
6502                                                           - Must happen before
6503                                                             following
6504                                                             buffer_wbinvl1_vol.
6505                                                           - Ensures the load
6506                                                             has completed
6507                                                             before invalidating
6508                                                             the cache.
6509
6510                                                         3. buffer_wbinvl1_vol
6511
6512                                                           - Must happen before
6513                                                             any following
6514                                                             global/generic
6515                                                             load/load
6516                                                             atomic/atomicrmw.
6517                                                           - Ensures that
6518                                                             following
6519                                                             loads will not see
6520                                                             stale global data.
6521
6522     load atomic  acquire      - agent        - generic  1. flat_load glc=1
6523                               - system                  2. s_waitcnt vmcnt(0) &
6524                                                            lgkmcnt(0)
6525
6526                                                           - If OpenCL omit
6527                                                             lgkmcnt(0).
6528                                                           - Must happen before
6529                                                             following
6530                                                             buffer_wbinvl1_vol.
6531                                                           - Ensures the flat_load
6532                                                             has completed
6533                                                             before invalidating
6534                                                             the cache.
6535
6536                                                         3. buffer_wbinvl1_vol
6537
6538                                                           - Must happen before
6539                                                             any following
6540                                                             global/generic
6541                                                             load/load
6542                                                             atomic/atomicrmw.
6543                                                           - Ensures that
6544                                                             following loads
6545                                                             will not see stale
6546                                                             global data.
6547
6548     atomicrmw    acquire      - singlethread - global   1. buffer/global/ds/flat_atomic
6549                               - wavefront    - local
6550                                              - generic
6551     atomicrmw    acquire      - workgroup    - global   1. buffer/global_atomic
6552     atomicrmw    acquire      - workgroup    - local    1. ds/flat_atomic
6553                                              - generic  2. s_waitcnt lgkmcnt(0)
6554
6555                                                           - If OpenCL, omit.
6556                                                           - Must happen before
6557                                                             any following
6558                                                             global/generic
6559                                                             load/load
6560                                                             atomic/store/store
6561                                                             atomic/atomicrmw.
6562                                                           - Ensures any
6563                                                             following global
6564                                                             data read is no
6565                                                             older than a local
6566                                                             atomicrmw value
6567                                                             being acquired.
6568
6569     atomicrmw    acquire      - agent        - global   1. buffer/global_atomic
6570                               - system                  2. s_waitcnt vmcnt(0)
6571
6572                                                           - Must happen before
6573                                                             following
6574                                                             buffer_wbinvl1_vol.
6575                                                           - Ensures the
6576                                                             atomicrmw has
6577                                                             completed before
6578                                                             invalidating the
6579                                                             cache.
6580
6581                                                         3. buffer_wbinvl1_vol
6582
6583                                                           - Must happen before
6584                                                             any following
6585                                                             global/generic
6586                                                             load/load
6587                                                             atomic/atomicrmw.
6588                                                           - Ensures that
6589                                                             following loads
6590                                                             will not see stale
6591                                                             global data.
6592
6593     atomicrmw    acquire      - agent        - generic  1. flat_atomic
6594                               - system                  2. s_waitcnt vmcnt(0) &
6595                                                            lgkmcnt(0)
6596
6597                                                           - If OpenCL, omit
6598                                                             lgkmcnt(0).
6599                                                           - Must happen before
6600                                                             following
6601                                                             buffer_wbinvl1_vol.
6602                                                           - Ensures the
6603                                                             atomicrmw has
6604                                                             completed before
6605                                                             invalidating the
6606                                                             cache.
6607
6608                                                         3. buffer_wbinvl1_vol
6609
6610                                                           - Must happen before
6611                                                             any following
6612                                                             global/generic
6613                                                             load/load
6614                                                             atomic/atomicrmw.
6615                                                           - Ensures that
6616                                                             following loads
6617                                                             will not see stale
6618                                                             global data.
6619
6620     fence        acquire      - singlethread *none*     *none*
6621                               - wavefront
6622     fence        acquire      - workgroup    *none*     1. s_waitcnt lgkmcnt(0)
6623
6624                                                           - If OpenCL and
6625                                                             address space is
6626                                                             not generic, omit.
6627                                                           - See :ref:`amdgpu-fence-as` for
6628                                                             more details on fencing specific
6629                                                             address spaces.
6630                                                           - Must happen after
6631                                                             any preceding
6632                                                             local/generic load
6633                                                             atomic/atomicrmw
6634                                                             with an equal or
6635                                                             wider sync scope
6636                                                             and memory ordering
6637                                                             stronger than
6638                                                             unordered (this is
6639                                                             termed the
6640                                                             fence-paired-atomic).
6641                                                           - Must happen before
6642                                                             any following
6643                                                             global/generic
6644                                                             load/load
6645                                                             atomic/store/store
6646                                                             atomic/atomicrmw.
6647                                                           - Ensures any
6648                                                             following global
6649                                                             data read is no
6650                                                             older than the
6651                                                             value read by the
6652                                                             fence-paired-atomic.
6653
6654     fence        acquire      - agent        *none*     1. s_waitcnt lgkmcnt(0) &
6655                               - system                     vmcnt(0)
6656
6657                                                           - If OpenCL and
6658                                                             address space is
6659                                                             not generic, omit
6660                                                             lgkmcnt(0).
6661                                                           - See :ref:`amdgpu-fence-as` for
6662                                                             more details on fencing specific
6663                                                             address spaces.
6664                                                           - Could be split into
6665                                                             separate s_waitcnt
6666                                                             vmcnt(0) and
6667                                                             s_waitcnt
6668                                                             lgkmcnt(0) to allow
6669                                                             them to be
6670                                                             independently moved
6671                                                             according to the
6672                                                             following rules.
6673                                                           - s_waitcnt vmcnt(0)
6674                                                             must happen after
6675                                                             any preceding
6676                                                             global/generic load
6677                                                             atomic/atomicrmw
6678                                                             with an equal or
6679                                                             wider sync scope
6680                                                             and memory ordering
6681                                                             stronger than
6682                                                             unordered (this is
6683                                                             termed the
6684                                                             fence-paired-atomic).
6685                                                           - s_waitcnt lgkmcnt(0)
6686                                                             must happen after
6687                                                             any preceding
6688                                                             local/generic load
6689                                                             atomic/atomicrmw
6690                                                             with an equal or
6691                                                             wider sync scope
6692                                                             and memory ordering
6693                                                             stronger than
6694                                                             unordered (this is
6695                                                             termed the
6696                                                             fence-paired-atomic).
6697                                                           - Must happen before
6698                                                             the following
6699                                                             buffer_wbinvl1_vol.
6700                                                           - Ensures that the
6701                                                             fence-paired atomic
6702                                                             has completed
6703                                                             before invalidating
6704                                                             the
6705                                                             cache. Therefore
6706                                                             any following
6707                                                             locations read must
6708                                                             be no older than
6709                                                             the value read by
6710                                                             the
6711                                                             fence-paired-atomic.
6712
6713                                                         2. buffer_wbinvl1_vol
6714
6715                                                           - Must happen before any
6716                                                             following global/generic
6717                                                             load/load
6718                                                             atomic/store/store
6719                                                             atomic/atomicrmw.
6720                                                           - Ensures that
6721                                                             following loads
6722                                                             will not see stale
6723                                                             global data.
6724
6725     **Release Atomic**
6726     ------------------------------------------------------------------------------------
6727     store atomic release      - singlethread - global   1. buffer/global/ds/flat_store
6728                               - wavefront    - local
6729                                              - generic
6730     store atomic release      - workgroup    - global   1. s_waitcnt lgkmcnt(0)
6731                                              - generic
6732                                                           - If OpenCL, omit.
6733                                                           - Must happen after
6734                                                             any preceding
6735                                                             local/generic
6736                                                             load/store/load
6737                                                             atomic/store
6738                                                             atomic/atomicrmw.
6739                                                           - Must happen before
6740                                                             the following
6741                                                             store.
6742                                                           - Ensures that all
6743                                                             memory operations
6744                                                             to local have
6745                                                             completed before
6746                                                             performing the
6747                                                             store that is being
6748                                                             released.
6749
6750                                                         2. buffer/global/flat_store
6751     store atomic release      - workgroup    - local    1. ds_store
6752     store atomic release      - agent        - global   1. s_waitcnt lgkmcnt(0) &
6753                               - system       - generic     vmcnt(0)
6754
6755                                                           - If OpenCL and
6756                                                             address space is
6757                                                             not generic, omit
6758                                                             lgkmcnt(0).
6759                                                           - Could be split into
6760                                                             separate s_waitcnt
6761                                                             vmcnt(0) and
6762                                                             s_waitcnt
6763                                                             lgkmcnt(0) to allow
6764                                                             them to be
6765                                                             independently moved
6766                                                             according to the
6767                                                             following rules.
6768                                                           - s_waitcnt vmcnt(0)
6769                                                             must happen after
6770                                                             any preceding
6771                                                             global/generic
6772                                                             load/store/load
6773                                                             atomic/store
6774                                                             atomic/atomicrmw.
6775                                                           - s_waitcnt lgkmcnt(0)
6776                                                             must happen after
6777                                                             any preceding
6778                                                             local/generic
6779                                                             load/store/load
6780                                                             atomic/store
6781                                                             atomic/atomicrmw.
6782                                                           - Must happen before
6783                                                             the following
6784                                                             store.
6785                                                           - Ensures that all
6786                                                             memory operations
6787                                                             to memory have
6788                                                             completed before
6789                                                             performing the
6790                                                             store that is being
6791                                                             released.
6792
6793                                                         2. buffer/global/flat_store
6794     atomicrmw    release      - singlethread - global   1. buffer/global/ds/flat_atomic
6795                               - wavefront    - local
6796                                              - generic
6797     atomicrmw    release      - workgroup    - global   1. s_waitcnt lgkmcnt(0)
6798                                              - generic
6799                                                           - If OpenCL, omit.
6800                                                           - Must happen after
6801                                                             any preceding
6802                                                             local/generic
6803                                                             load/store/load
6804                                                             atomic/store
6805                                                             atomic/atomicrmw.
6806                                                           - Must happen before
6807                                                             the following
6808                                                             atomicrmw.
6809                                                           - Ensures that all
6810                                                             memory operations
6811                                                             to local have
6812                                                             completed before
6813                                                             performing the
6814                                                             atomicrmw that is
6815                                                             being released.
6816
6817                                                         2. buffer/global/flat_atomic
6818     atomicrmw    release      - workgroup    - local    1. ds_atomic
6819     atomicrmw    release      - agent        - global   1. s_waitcnt lgkmcnt(0) &
6820                               - system       - generic     vmcnt(0)
6821
6822                                                           - If OpenCL, omit
6823                                                             lgkmcnt(0).
6824                                                           - Could be split into
6825                                                             separate s_waitcnt
6826                                                             vmcnt(0) and
6827                                                             s_waitcnt
6828                                                             lgkmcnt(0) to allow
6829                                                             them to be
6830                                                             independently moved
6831                                                             according to the
6832                                                             following rules.
6833                                                           - s_waitcnt vmcnt(0)
6834                                                             must happen after
6835                                                             any preceding
6836                                                             global/generic
6837                                                             load/store/load
6838                                                             atomic/store
6839                                                             atomic/atomicrmw.
6840                                                           - s_waitcnt lgkmcnt(0)
6841                                                             must happen after
6842                                                             any preceding
6843                                                             local/generic
6844                                                             load/store/load
6845                                                             atomic/store
6846                                                             atomic/atomicrmw.
6847                                                           - Must happen before
6848                                                             the following
6849                                                             atomicrmw.
6850                                                           - Ensures that all
6851                                                             memory operations
6852                                                             to global and local
6853                                                             have completed
6854                                                             before performing
6855                                                             the atomicrmw that
6856                                                             is being released.
6857
6858                                                         2. buffer/global/flat_atomic
6859     fence        release      - singlethread *none*     *none*
6860                               - wavefront
6861     fence        release      - workgroup    *none*     1. s_waitcnt lgkmcnt(0)
6862
6863                                                           - If OpenCL and
6864                                                             address space is
6865                                                             not generic, omit.
6866                                                           - See :ref:`amdgpu-fence-as` for
6867                                                             more details on fencing specific
6868                                                             address spaces.
6869                                                           - Must happen after
6870                                                             any preceding
6871                                                             local/generic
6872                                                             load/load
6873                                                             atomic/store/store
6874                                                             atomic/atomicrmw.
6875                                                           - Must happen before
6876                                                             any following store
6877                                                             atomic/atomicrmw
6878                                                             with an equal or
6879                                                             wider sync scope
6880                                                             and memory ordering
6881                                                             stronger than
6882                                                             unordered (this is
6883                                                             termed the
6884                                                             fence-paired-atomic).
6885                                                           - Ensures that all
6886                                                             memory operations
6887                                                             to local have
6888                                                             completed before
6889                                                             performing the
6890                                                             following
6891                                                             fence-paired-atomic.
6892
6893     fence        release      - agent        *none*     1. s_waitcnt lgkmcnt(0) &
6894                               - system                     vmcnt(0)
6895
6896                                                           - If OpenCL and
6897                                                             address space is
6898                                                             not generic, omit
6899                                                             lgkmcnt(0).
6900                                                           - If OpenCL and
6901                                                             address space is
6902                                                             local, omit
6903                                                             vmcnt(0).
6904                                                           - See :ref:`amdgpu-fence-as` for
6905                                                             more details on fencing specific
6906                                                             address spaces.
6907                                                           - Could be split into
6908                                                             separate s_waitcnt
6909                                                             vmcnt(0) and
6910                                                             s_waitcnt
6911                                                             lgkmcnt(0) to allow
6912                                                             them to be
6913                                                             independently moved
6914                                                             according to the
6915                                                             following rules.
6916                                                           - s_waitcnt vmcnt(0)
6917                                                             must happen after
6918                                                             any preceding
6919                                                             global/generic
6920                                                             load/store/load
6921                                                             atomic/store
6922                                                             atomic/atomicrmw.
6923                                                           - s_waitcnt lgkmcnt(0)
6924                                                             must happen after
6925                                                             any preceding
6926                                                             local/generic
6927                                                             load/store/load
6928                                                             atomic/store
6929                                                             atomic/atomicrmw.
6930                                                           - Must happen before
6931                                                             any following store
6932                                                             atomic/atomicrmw
6933                                                             with an equal or
6934                                                             wider sync scope
6935                                                             and memory ordering
6936                                                             stronger than
6937                                                             unordered (this is
6938                                                             termed the
6939                                                             fence-paired-atomic).
6940                                                           - Ensures that all
6941                                                             memory operations
6942                                                             have
6943                                                             completed before
6944                                                             performing the
6945                                                             following
6946                                                             fence-paired-atomic.
6947
6948     **Acquire-Release Atomic**
6949     ------------------------------------------------------------------------------------
6950     atomicrmw    acq_rel      - singlethread - global   1. buffer/global/ds/flat_atomic
6951                               - wavefront    - local
6952                                              - generic
6953     atomicrmw    acq_rel      - workgroup    - global   1. s_waitcnt lgkmcnt(0)
6954
6955                                                           - If OpenCL, omit.
6956                                                           - Must happen after
6957                                                             any preceding
6958                                                             local/generic
6959                                                             load/store/load
6960                                                             atomic/store
6961                                                             atomic/atomicrmw.
6962                                                           - Must happen before
6963                                                             the following
6964                                                             atomicrmw.
6965                                                           - Ensures that all
6966                                                             memory operations
6967                                                             to local have
6968                                                             completed before
6969                                                             performing the
6970                                                             atomicrmw that is
6971                                                             being released.
6972
6973                                                         2. buffer/global_atomic
6974
6975     atomicrmw    acq_rel      - workgroup    - local    1. ds_atomic
6976                                                         2. s_waitcnt lgkmcnt(0)
6977
6978                                                           - If OpenCL, omit.
6979                                                           - Must happen before
6980                                                             any following
6981                                                             global/generic
6982                                                             load/load
6983                                                             atomic/store/store
6984                                                             atomic/atomicrmw.
6985                                                           - Ensures any
6986                                                             following global
6987                                                             data read is no
6988                                                             older than the local load
6989                                                             atomic value being
6990                                                             acquired.
6991
6992     atomicrmw    acq_rel      - workgroup    - generic  1. s_waitcnt lgkmcnt(0)
6993
6994                                                           - If OpenCL, omit.
6995                                                           - Must happen after
6996                                                             any preceding
6997                                                             local/generic
6998                                                             load/store/load
6999                                                             atomic/store
7000                                                             atomic/atomicrmw.
7001                                                           - Must happen before
7002                                                             the following
7003                                                             atomicrmw.
7004                                                           - Ensures that all
7005                                                             memory operations
7006                                                             to local have
7007                                                             completed before
7008                                                             performing the
7009                                                             atomicrmw that is
7010                                                             being released.
7011
7012                                                         2. flat_atomic
7013                                                         3. s_waitcnt lgkmcnt(0)
7014
7015                                                           - If OpenCL, omit.
7016                                                           - Must happen before
7017                                                             any following
7018                                                             global/generic
7019                                                             load/load
7020                                                             atomic/store/store
7021                                                             atomic/atomicrmw.
7022                                                           - Ensures any
7023                                                             following global
7024                                                             data read is no
7025                                                             older than a local load
7026                                                             atomic value being
7027                                                             acquired.
7028
7029     atomicrmw    acq_rel      - agent        - global   1. s_waitcnt lgkmcnt(0) &
7030                               - system                     vmcnt(0)
7031
7032                                                           - If OpenCL, omit
7033                                                             lgkmcnt(0).
7034                                                           - Could be split into
7035                                                             separate s_waitcnt
7036                                                             vmcnt(0) and
7037                                                             s_waitcnt
7038                                                             lgkmcnt(0) to allow
7039                                                             them to be
7040                                                             independently moved
7041                                                             according to the
7042                                                             following rules.
7043                                                           - s_waitcnt vmcnt(0)
7044                                                             must happen after
7045                                                             any preceding
7046                                                             global/generic
7047                                                             load/store/load
7048                                                             atomic/store
7049                                                             atomic/atomicrmw.
7050                                                           - s_waitcnt lgkmcnt(0)
7051                                                             must happen after
7052                                                             any preceding
7053                                                             local/generic
7054                                                             load/store/load
7055                                                             atomic/store
7056                                                             atomic/atomicrmw.
7057                                                           - Must happen before
7058                                                             the following
7059                                                             atomicrmw.
7060                                                           - Ensures that all
7061                                                             memory operations
7062                                                             to global have
7063                                                             completed before
7064                                                             performing the
7065                                                             atomicrmw that is
7066                                                             being released.
7067
7068                                                         2. buffer/global_atomic
7069                                                         3. s_waitcnt vmcnt(0)
7070
7071                                                           - Must happen before
7072                                                             following
7073                                                             buffer_wbinvl1_vol.
7074                                                           - Ensures the
7075                                                             atomicrmw has
7076                                                             completed before
7077                                                             invalidating the
7078                                                             cache.
7079
7080                                                         4. buffer_wbinvl1_vol
7081
7082                                                           - Must happen before
7083                                                             any following
7084                                                             global/generic
7085                                                             load/load
7086                                                             atomic/atomicrmw.
7087                                                           - Ensures that
7088                                                             following loads
7089                                                             will not see stale
7090                                                             global data.
7091
7092     atomicrmw    acq_rel      - agent        - generic  1. s_waitcnt lgkmcnt(0) &
7093                               - system                     vmcnt(0)
7094
7095                                                           - If OpenCL, omit
7096                                                             lgkmcnt(0).
7097                                                           - Could be split into
7098                                                             separate s_waitcnt
7099                                                             vmcnt(0) and
7100                                                             s_waitcnt
7101                                                             lgkmcnt(0) to allow
7102                                                             them to be
7103                                                             independently moved
7104                                                             according to the
7105                                                             following rules.
7106                                                           - s_waitcnt vmcnt(0)
7107                                                             must happen after
7108                                                             any preceding
7109                                                             global/generic
7110                                                             load/store/load
7111                                                             atomic/store
7112                                                             atomic/atomicrmw.
7113                                                           - s_waitcnt lgkmcnt(0)
7114                                                             must happen after
7115                                                             any preceding
7116                                                             local/generic
7117                                                             load/store/load
7118                                                             atomic/store
7119                                                             atomic/atomicrmw.
7120                                                           - Must happen before
7121                                                             the following
7122                                                             atomicrmw.
7123                                                           - Ensures that all
7124                                                             memory operations
7125                                                             to global have
7126                                                             completed before
7127                                                             performing the
7128                                                             atomicrmw that is
7129                                                             being released.
7130
7131                                                         2. flat_atomic
7132                                                         3. s_waitcnt vmcnt(0) &
7133                                                            lgkmcnt(0)
7134
7135                                                           - If OpenCL, omit
7136                                                             lgkmcnt(0).
7137                                                           - Must happen before
7138                                                             following
7139                                                             buffer_wbinvl1_vol.
7140                                                           - Ensures the
7141                                                             atomicrmw has
7142                                                             completed before
7143                                                             invalidating the
7144                                                             cache.
7145
7146                                                         4. buffer_wbinvl1_vol
7147
7148                                                           - Must happen before
7149                                                             any following
7150                                                             global/generic
7151                                                             load/load
7152                                                             atomic/atomicrmw.
7153                                                           - Ensures that
7154                                                             following loads
7155                                                             will not see stale
7156                                                             global data.
7157
7158     fence        acq_rel      - singlethread *none*     *none*
7159                               - wavefront
7160     fence        acq_rel      - workgroup    *none*     1. s_waitcnt lgkmcnt(0)
7161
7162                                                           - If OpenCL and
7163                                                             address space is
7164                                                             not generic, omit.
7165                                                           - However,
7166                                                             since LLVM
7167                                                             currently has no
7168                                                             address space on
7169                                                             the fence need to
7170                                                             conservatively
7171                                                             always generate
7172                                                             (see comment for
7173                                                             previous fence).
7174                                                           - Must happen after
7175                                                             any preceding
7176                                                             local/generic
7177                                                             load/load
7178                                                             atomic/store/store
7179                                                             atomic/atomicrmw.
7180                                                           - Must happen before
7181                                                             any following
7182                                                             global/generic
7183                                                             load/load
7184                                                             atomic/store/store
7185                                                             atomic/atomicrmw.
7186                                                           - Ensures that all
7187                                                             memory operations
7188                                                             to local have
7189                                                             completed before
7190                                                             performing any
7191                                                             following global
7192                                                             memory operations.
7193                                                           - Ensures that the
7194                                                             preceding
7195                                                             local/generic load
7196                                                             atomic/atomicrmw
7197                                                             with an equal or
7198                                                             wider sync scope
7199                                                             and memory ordering
7200                                                             stronger than
7201                                                             unordered (this is
7202                                                             termed the
7203                                                             acquire-fence-paired-atomic)
7204                                                             has completed
7205                                                             before following
7206                                                             global memory
7207                                                             operations. This
7208                                                             satisfies the
7209                                                             requirements of
7210                                                             acquire.
7211                                                           - Ensures that all
7212                                                             previous memory
7213                                                             operations have
7214                                                             completed before a
7215                                                             following
7216                                                             local/generic store
7217                                                             atomic/atomicrmw
7218                                                             with an equal or
7219                                                             wider sync scope
7220                                                             and memory ordering
7221                                                             stronger than
7222                                                             unordered (this is
7223                                                             termed the
7224                                                             release-fence-paired-atomic).
7225                                                             This satisfies the
7226                                                             requirements of
7227                                                             release.
7228
7229     fence        acq_rel      - agent        *none*     1. s_waitcnt lgkmcnt(0) &
7230                               - system                     vmcnt(0)
7231
7232                                                           - If OpenCL and
7233                                                             address space is
7234                                                             not generic, omit
7235                                                             lgkmcnt(0).
7236                                                           - See :ref:`amdgpu-fence-as` for
7237                                                             more details on fencing specific
7238                                                             address spaces.
7239                                                           - Could be split into
7240                                                             separate s_waitcnt
7241                                                             vmcnt(0) and
7242                                                             s_waitcnt
7243                                                             lgkmcnt(0) to allow
7244                                                             them to be
7245                                                             independently moved
7246                                                             according to the
7247                                                             following rules.
7248                                                           - s_waitcnt vmcnt(0)
7249                                                             must happen after
7250                                                             any preceding
7251                                                             global/generic
7252                                                             load/store/load
7253                                                             atomic/store
7254                                                             atomic/atomicrmw.
7255                                                           - s_waitcnt lgkmcnt(0)
7256                                                             must happen after
7257                                                             any preceding
7258                                                             local/generic
7259                                                             load/store/load
7260                                                             atomic/store
7261                                                             atomic/atomicrmw.
7262                                                           - Must happen before
7263                                                             the following
7264                                                             buffer_wbinvl1_vol.
7265                                                           - Ensures that the
7266                                                             preceding
7267                                                             global/local/generic
7268                                                             load
7269                                                             atomic/atomicrmw
7270                                                             with an equal or
7271                                                             wider sync scope
7272                                                             and memory ordering
7273                                                             stronger than
7274                                                             unordered (this is
7275                                                             termed the
7276                                                             acquire-fence-paired-atomic)
7277                                                             has completed
7278                                                             before invalidating
7279                                                             the cache. This
7280                                                             satisfies the
7281                                                             requirements of
7282                                                             acquire.
7283                                                           - Ensures that all
7284                                                             previous memory
7285                                                             operations have
7286                                                             completed before a
7287                                                             following
7288                                                             global/local/generic
7289                                                             store
7290                                                             atomic/atomicrmw
7291                                                             with an equal or
7292                                                             wider sync scope
7293                                                             and memory ordering
7294                                                             stronger than
7295                                                             unordered (this is
7296                                                             termed the
7297                                                             release-fence-paired-atomic).
7298                                                             This satisfies the
7299                                                             requirements of
7300                                                             release.
7301
7302                                                         2. buffer_wbinvl1_vol
7303
7304                                                           - Must happen before
7305                                                             any following
7306                                                             global/generic
7307                                                             load/load
7308                                                             atomic/store/store
7309                                                             atomic/atomicrmw.
7310                                                           - Ensures that
7311                                                             following loads
7312                                                             will not see stale
7313                                                             global data. This
7314                                                             satisfies the
7315                                                             requirements of
7316                                                             acquire.
7317
7318     **Sequential Consistent Atomic**
7319     ------------------------------------------------------------------------------------
7320     load atomic  seq_cst      - singlethread - global   *Same as corresponding
7321                               - wavefront    - local    load atomic acquire,
7322                                              - generic  except must generate
7323                                                         all instructions even
7324                                                         for OpenCL.*
7325     load atomic  seq_cst      - workgroup    - global   1. s_waitcnt lgkmcnt(0)
7326                                              - generic
7327
7328                                                           - Must
7329                                                             happen after
7330                                                             preceding
7331                                                             local/generic load
7332                                                             atomic/store
7333                                                             atomic/atomicrmw
7334                                                             with memory
7335                                                             ordering of seq_cst
7336                                                             and with equal or
7337                                                             wider sync scope.
7338                                                             (Note that seq_cst
7339                                                             fences have their
7340                                                             own s_waitcnt
7341                                                             lgkmcnt(0) and so do
7342                                                             not need to be
7343                                                             considered.)
7344                                                           - Ensures any
7345                                                             preceding
7346                                                             sequential
7347                                                             consistent local
7348                                                             memory instructions
7349                                                             have completed
7350                                                             before executing
7351                                                             this sequentially
7352                                                             consistent
7353                                                             instruction. This
7354                                                             prevents reordering
7355                                                             a seq_cst store
7356                                                             followed by a
7357                                                             seq_cst load. (Note
7358                                                             that seq_cst is
7359                                                             stronger than
7360                                                             acquire/release as
7361                                                             the reordering of
7362                                                             load acquire
7363                                                             followed by a store
7364                                                             release is
7365                                                             prevented by the
7366                                                             s_waitcnt of
7367                                                             the release, but
7368                                                             there is nothing
7369                                                             preventing a store
7370                                                             release followed by
7371                                                             load acquire from
7372                                                             completing out of
7373                                                             order. The s_waitcnt
7374                                                             could be placed after
7375                                                             seq_store or before
7376                                                             the seq_load. We
7377                                                             choose the load to
7378                                                             make the s_waitcnt be
7379                                                             as late as possible
7380                                                             so that the store
7381                                                             may have already
7382                                                             completed.)
7383
7384                                                         2. *Following
7385                                                            instructions same as
7386                                                            corresponding load
7387                                                            atomic acquire,
7388                                                            except must generate
7389                                                            all instructions even
7390                                                            for OpenCL.*
7391     load atomic  seq_cst      - workgroup    - local    *Same as corresponding
7392                                                         load atomic acquire,
7393                                                         except must generate
7394                                                         all instructions even
7395                                                         for OpenCL.*
7396
7397     load atomic  seq_cst      - agent        - global   1. s_waitcnt lgkmcnt(0) &
7398                               - system       - generic     vmcnt(0)
7399
7400                                                           - Could be split into
7401                                                             separate s_waitcnt
7402                                                             vmcnt(0)
7403                                                             and s_waitcnt
7404                                                             lgkmcnt(0) to allow
7405                                                             them to be
7406                                                             independently moved
7407                                                             according to the
7408                                                             following rules.
7409                                                           - s_waitcnt lgkmcnt(0)
7410                                                             must happen after
7411                                                             preceding
7412                                                             global/generic load
7413                                                             atomic/store
7414                                                             atomic/atomicrmw
7415                                                             with memory
7416                                                             ordering of seq_cst
7417                                                             and with equal or
7418                                                             wider sync scope.
7419                                                             (Note that seq_cst
7420                                                             fences have their
7421                                                             own s_waitcnt
7422                                                             lgkmcnt(0) and so do
7423                                                             not need to be
7424                                                             considered.)
7425                                                           - s_waitcnt vmcnt(0)
7426                                                             must happen after
7427                                                             preceding
7428                                                             global/generic load
7429                                                             atomic/store
7430                                                             atomic/atomicrmw
7431                                                             with memory
7432                                                             ordering of seq_cst
7433                                                             and with equal or
7434                                                             wider sync scope.
7435                                                             (Note that seq_cst
7436                                                             fences have their
7437                                                             own s_waitcnt
7438                                                             vmcnt(0) and so do
7439                                                             not need to be
7440                                                             considered.)
7441                                                           - Ensures any
7442                                                             preceding
7443                                                             sequential
7444                                                             consistent global
7445                                                             memory instructions
7446                                                             have completed
7447                                                             before executing
7448                                                             this sequentially
7449                                                             consistent
7450                                                             instruction. This
7451                                                             prevents reordering
7452                                                             a seq_cst store
7453                                                             followed by a
7454                                                             seq_cst load. (Note
7455                                                             that seq_cst is
7456                                                             stronger than
7457                                                             acquire/release as
7458                                                             the reordering of
7459                                                             load acquire
7460                                                             followed by a store
7461                                                             release is
7462                                                             prevented by the
7463                                                             s_waitcnt of
7464                                                             the release, but
7465                                                             there is nothing
7466                                                             preventing a store
7467                                                             release followed by
7468                                                             load acquire from
7469                                                             completing out of
7470                                                             order. The s_waitcnt
7471                                                             could be placed after
7472                                                             seq_store or before
7473                                                             the seq_load. We
7474                                                             choose the load to
7475                                                             make the s_waitcnt be
7476                                                             as late as possible
7477                                                             so that the store
7478                                                             may have already
7479                                                             completed.)
7480
7481                                                         2. *Following
7482                                                            instructions same as
7483                                                            corresponding load
7484                                                            atomic acquire,
7485                                                            except must generate
7486                                                            all instructions even
7487                                                            for OpenCL.*
7488     store atomic seq_cst      - singlethread - global   *Same as corresponding
7489                               - wavefront    - local    store atomic release,
7490                               - workgroup    - generic  except must generate
7491                               - agent                   all instructions even
7492                               - system                  for OpenCL.*
7493     atomicrmw    seq_cst      - singlethread - global   *Same as corresponding
7494                               - wavefront    - local    atomicrmw acq_rel,
7495                               - workgroup    - generic  except must generate
7496                               - agent                   all instructions even
7497                               - system                  for OpenCL.*
7498     fence        seq_cst      - singlethread *none*     *Same as corresponding
7499                               - wavefront               fence acq_rel,
7500                               - workgroup               except must generate
7501                               - agent                   all instructions even
7502                               - system                  for OpenCL.*
7503     ============ ============ ============== ========== ================================
7504
7505.. _amdgpu-amdhsa-memory-model-gfx90a:
7506
7507Memory Model GFX90A
7508+++++++++++++++++++
7509
7510For GFX90A:
7511
7512* Each agent has multiple shader arrays (SA).
7513* Each SA has multiple compute units (CU).
7514* Each CU has multiple SIMDs that execute wavefronts.
7515* The wavefronts for a single work-group are executed in the same CU but may be
7516  executed by different SIMDs. The exception is when in tgsplit execution mode
7517  when the wavefronts may be executed by different SIMDs in different CUs.
7518* Each CU has a single LDS memory shared by the wavefronts of the work-groups
7519  executing on it. The exception is when in tgsplit execution mode when no LDS
7520  is allocated as wavefronts of the same work-group can be in different CUs.
7521* All LDS operations of a CU are performed as wavefront wide operations in a
7522  global order and involve no caching. Completion is reported to a wavefront in
7523  execution order.
7524* The LDS memory has multiple request queues shared by the SIMDs of a
7525  CU. Therefore, the LDS operations performed by different wavefronts of a
7526  work-group can be reordered relative to each other, which can result in
7527  reordering the visibility of vector memory operations with respect to LDS
7528  operations of other wavefronts in the same work-group. A ``s_waitcnt
7529  lgkmcnt(0)`` is required to ensure synchronization between LDS operations and
7530  vector memory operations between wavefronts of a work-group, but not between
7531  operations performed by the same wavefront.
7532* The vector memory operations are performed as wavefront wide operations and
7533  completion is reported to a wavefront in execution order. The exception is
7534  that ``flat_load/store/atomic`` instructions can report out of vector memory
7535  order if they access LDS memory, and out of LDS operation order if they access
7536  global memory.
7537* The vector memory operations access a single vector L1 cache shared by all
7538  SIMDs a CU. Therefore:
7539
7540  * No special action is required for coherence between the lanes of a single
7541    wavefront.
7542
7543  * No special action is required for coherence between wavefronts in the same
7544    work-group since they execute on the same CU. The exception is when in
7545    tgsplit execution mode as wavefronts of the same work-group can be in
7546    different CUs and so a ``buffer_wbinvl1_vol`` is required as described in
7547    the following item.
7548
7549  * A ``buffer_wbinvl1_vol`` is required for coherence between wavefronts
7550    executing in different work-groups as they may be executing on different
7551    CUs.
7552
7553* The scalar memory operations access a scalar L1 cache shared by all wavefronts
7554  on a group of CUs. The scalar and vector L1 caches are not coherent. However,
7555  scalar operations are used in a restricted way so do not impact the memory
7556  model. See :ref:`amdgpu-amdhsa-memory-spaces`.
7557* The vector and scalar memory operations use an L2 cache shared by all CUs on
7558  the same agent.
7559
7560  * The L2 cache has independent channels to service disjoint ranges of virtual
7561    addresses.
7562  * Each CU has a separate request queue per channel. Therefore, the vector and
7563    scalar memory operations performed by wavefronts executing in different
7564    work-groups (which may be executing on different CUs), or the same
7565    work-group if executing in tgsplit mode, of an agent can be reordered
7566    relative to each other. A ``s_waitcnt vmcnt(0)`` is required to ensure
7567    synchronization between vector memory operations of different CUs. It
7568    ensures a previous vector memory operation has completed before executing a
7569    subsequent vector memory or LDS operation and so can be used to meet the
7570    requirements of acquire and release.
7571  * The L2 cache of one agent can be kept coherent with other agents by:
7572    using the MTYPE RW (read-write) or MTYPE CC (cache-coherent) with the PTE
7573    C-bit for memory local to the L2; and using the MTYPE NC (non-coherent) with
7574    the PTE C-bit set or MTYPE UC (uncached) for memory not local to the L2.
7575
7576    * Any local memory cache lines will be automatically invalidated by writes
7577      from CUs associated with other L2 caches, or writes from the CPU, due to
7578      the cache probe caused by coherent requests. Coherent requests are caused
7579      by GPU accesses to pages with the PTE C-bit set, by CPU accesses over
7580      XGMI, and by PCIe requests that are configured to be coherent requests.
7581    * XGMI accesses from the CPU to local memory may be cached on the CPU.
7582      Subsequent access from the GPU will automatically invalidate or writeback
7583      the CPU cache due to the L2 probe filter and and the PTE C-bit being set.
7584    * Since all work-groups on the same agent share the same L2, no L2
7585      invalidation or writeback is required for coherence.
7586    * To ensure coherence of local and remote memory writes of work-groups in
7587      different agents a ``buffer_wbl2`` is required. It will writeback dirty L2
7588      cache lines of MTYPE RW (used for local coarse grain memory) and MTYPE NC
7589      ()used for remote coarse grain memory). Note that MTYPE CC (used for local
7590      fine grain memory) causes write through to DRAM, and MTYPE UC (used for
7591      remote fine grain memory) bypasses the L2, so both will never result in
7592      dirty L2 cache lines.
7593    * To ensure coherence of local and remote memory reads of work-groups in
7594      different agents a ``buffer_invl2`` is required. It will invalidate L2
7595      cache lines with MTYPE NC (used for remote coarse grain memory). Note that
7596      MTYPE CC (used for local fine grain memory) and MTYPE RW (used for local
7597      coarse memory) cause local reads to be invalidated by remote writes with
7598      with the PTE C-bit so these cache lines are not invalidated. Note that
7599      MTYPE UC (used for remote fine grain memory) bypasses the L2, so will
7600      never result in L2 cache lines that need to be invalidated.
7601
7602  * PCIe access from the GPU to the CPU memory is kept coherent by using the
7603    MTYPE UC (uncached) which bypasses the L2.
7604
7605Scalar memory operations are only used to access memory that is proven to not
7606change during the execution of the kernel dispatch. This includes constant
7607address space and global address space for program scope ``const`` variables.
7608Therefore, the kernel machine code does not have to maintain the scalar cache to
7609ensure it is coherent with the vector caches. The scalar and vector caches are
7610invalidated between kernel dispatches by CP since constant address space data
7611may change between kernel dispatch executions. See
7612:ref:`amdgpu-amdhsa-memory-spaces`.
7613
7614The one exception is if scalar writes are used to spill SGPR registers. In this
7615case the AMDGPU backend ensures the memory location used to spill is never
7616accessed by vector memory operations at the same time. If scalar writes are used
7617then a ``s_dcache_wb`` is inserted before the ``s_endpgm`` and before a function
7618return since the locations may be used for vector memory instructions by a
7619future wavefront that uses the same scratch area, or a function call that
7620creates a frame at the same address, respectively. There is no need for a
7621``s_dcache_inv`` as all scalar writes are write-before-read in the same thread.
7622
7623For kernarg backing memory:
7624
7625* CP invalidates the L1 cache at the start of each kernel dispatch.
7626* On dGPU over XGMI or PCIe the kernarg backing memory is allocated in host
7627  memory accessed as MTYPE UC (uncached) to avoid needing to invalidate the L2
7628  cache. This also causes it to be treated as non-volatile and so is not
7629  invalidated by ``*_vol``.
7630* On APU the kernarg backing memory is accessed as MTYPE CC (cache coherent) and
7631  so the L2 cache will be coherent with the CPU and other agents.
7632
7633Scratch backing memory (which is used for the private address space) is accessed
7634with MTYPE NC_NV (non-coherent non-volatile). Since the private address space is
7635only accessed by a single thread, and is always write-before-read, there is
7636never a need to invalidate these entries from the L1 cache. Hence all cache
7637invalidates are done as ``*_vol`` to only invalidate the volatile cache lines.
7638
7639The code sequences used to implement the memory model for GFX90A are defined
7640in table :ref:`amdgpu-amdhsa-memory-model-code-sequences-gfx90a-table`.
7641
7642  .. table:: AMDHSA Memory Model Code Sequences GFX90A
7643     :name: amdgpu-amdhsa-memory-model-code-sequences-gfx90a-table
7644
7645     ============ ============ ============== ========== ================================
7646     LLVM Instr   LLVM Memory  LLVM Memory    AMDGPU     AMDGPU Machine Code
7647                  Ordering     Sync Scope     Address    GFX90A
7648                                              Space
7649     ============ ============ ============== ========== ================================
7650     **Non-Atomic**
7651     ------------------------------------------------------------------------------------
7652     load         *none*       *none*         - global   - !volatile & !nontemporal
7653                                              - generic
7654                                              - private    1. buffer/global/flat_load
7655                                              - constant
7656                                                         - !volatile & nontemporal
7657
7658                                                           1. buffer/global/flat_load
7659                                                              glc=1 slc=1
7660
7661                                                         - volatile
7662
7663                                                           1. buffer/global/flat_load
7664                                                              glc=1
7665                                                           2. s_waitcnt vmcnt(0)
7666
7667                                                            - Must happen before
7668                                                              any following volatile
7669                                                              global/generic
7670                                                              load/store.
7671                                                            - Ensures that
7672                                                              volatile
7673                                                              operations to
7674                                                              different
7675                                                              addresses will not
7676                                                              be reordered by
7677                                                              hardware.
7678
7679     load         *none*       *none*         - local    1. ds_load
7680     store        *none*       *none*         - global   - !volatile & !nontemporal
7681                                              - generic
7682                                              - private    1. buffer/global/flat_store
7683                                              - constant
7684                                                         - !volatile & nontemporal
7685
7686                                                           1. buffer/global/flat_store
7687                                                              glc=1 slc=1
7688
7689                                                         - volatile
7690
7691                                                           1. buffer/global/flat_store
7692                                                           2. s_waitcnt vmcnt(0)
7693
7694                                                            - Must happen before
7695                                                              any following volatile
7696                                                              global/generic
7697                                                              load/store.
7698                                                            - Ensures that
7699                                                              volatile
7700                                                              operations to
7701                                                              different
7702                                                              addresses will not
7703                                                              be reordered by
7704                                                              hardware.
7705
7706     store        *none*       *none*         - local    1. ds_store
7707     **Unordered Atomic**
7708     ------------------------------------------------------------------------------------
7709     load atomic  unordered    *any*          *any*      *Same as non-atomic*.
7710     store atomic unordered    *any*          *any*      *Same as non-atomic*.
7711     atomicrmw    unordered    *any*          *any*      *Same as monotonic atomic*.
7712     **Monotonic Atomic**
7713     ------------------------------------------------------------------------------------
7714     load atomic  monotonic    - singlethread - global   1. buffer/global/flat_load
7715                               - wavefront    - generic
7716     load atomic  monotonic    - workgroup    - global   1. buffer/global/flat_load
7717                                              - generic     glc=1
7718
7719                                                           - If not TgSplit execution
7720                                                             mode, omit glc=1.
7721
7722     load atomic  monotonic    - singlethread - local    *If TgSplit execution mode,
7723                               - wavefront               local address space cannot
7724                               - workgroup               be used.*
7725
7726                                                         1. ds_load
7727     load atomic  monotonic    - agent        - global   1. buffer/global/flat_load
7728                                              - generic     glc=1
7729     load atomic  monotonic    - system       - global   1. buffer/global/flat_load
7730                                              - generic     glc=1
7731     store atomic monotonic    - singlethread - global   1. buffer/global/flat_store
7732                               - wavefront    - generic
7733                               - workgroup
7734                               - agent
7735     store atomic monotonic    - system       - global   1. buffer/global/flat_store
7736                                              - generic
7737     store atomic monotonic    - singlethread - local    *If TgSplit execution mode,
7738                               - wavefront               local address space cannot
7739                               - workgroup               be used.*
7740
7741                                                         1. ds_store
7742     atomicrmw    monotonic    - singlethread - global   1. buffer/global/flat_atomic
7743                               - wavefront    - generic
7744                               - workgroup
7745                               - agent
7746     atomicrmw    monotonic    - system       - global   1. buffer/global/flat_atomic
7747                                              - generic
7748     atomicrmw    monotonic    - singlethread - local    *If TgSplit execution mode,
7749                               - wavefront               local address space cannot
7750                               - workgroup               be used.*
7751
7752                                                         1. ds_atomic
7753     **Acquire Atomic**
7754     ------------------------------------------------------------------------------------
7755     load atomic  acquire      - singlethread - global   1. buffer/global/ds/flat_load
7756                               - wavefront    - local
7757                                              - generic
7758     load atomic  acquire      - workgroup    - global   1. buffer/global_load glc=1
7759
7760                                                           - If not TgSplit execution
7761                                                             mode, omit glc=1.
7762
7763                                                         2. s_waitcnt vmcnt(0)
7764
7765                                                           - If not TgSplit execution
7766                                                             mode, omit.
7767                                                           - Must happen before the
7768                                                             following buffer_wbinvl1_vol.
7769
7770                                                         3. buffer_wbinvl1_vol
7771
7772                                                           - If not TgSplit execution
7773                                                             mode, omit.
7774                                                           - Must happen before
7775                                                             any following
7776                                                             global/generic
7777                                                             load/load
7778                                                             atomic/store/store
7779                                                             atomic/atomicrmw.
7780                                                           - Ensures that
7781                                                             following
7782                                                             loads will not see
7783                                                             stale data.
7784
7785     load atomic  acquire      - workgroup    - local    *If TgSplit execution mode,
7786                                                         local address space cannot
7787                                                         be used.*
7788
7789                                                         1. ds_load
7790                                                         2. s_waitcnt lgkmcnt(0)
7791
7792                                                           - If OpenCL, omit.
7793                                                           - Must happen before
7794                                                             any following
7795                                                             global/generic
7796                                                             load/load
7797                                                             atomic/store/store
7798                                                             atomic/atomicrmw.
7799                                                           - Ensures any
7800                                                             following global
7801                                                             data read is no
7802                                                             older than the local load
7803                                                             atomic value being
7804                                                             acquired.
7805
7806     load atomic  acquire      - workgroup    - generic  1. flat_load glc=1
7807
7808                                                           - If not TgSplit execution
7809                                                             mode, omit glc=1.
7810
7811                                                         2. s_waitcnt lgkm/vmcnt(0)
7812
7813                                                           - Use lgkmcnt(0) if not
7814                                                             TgSplit execution mode
7815                                                             and vmcnt(0) if TgSplit
7816                                                             execution mode.
7817                                                           - If OpenCL, omit lgkmcnt(0).
7818                                                           - Must happen before
7819                                                             the following
7820                                                             buffer_wbinvl1_vol and any
7821                                                             following global/generic
7822                                                             load/load
7823                                                             atomic/store/store
7824                                                             atomic/atomicrmw.
7825                                                           - Ensures any
7826                                                             following global
7827                                                             data read is no
7828                                                             older than a local load
7829                                                             atomic value being
7830                                                             acquired.
7831
7832                                                         3. buffer_wbinvl1_vol
7833
7834                                                           - If not TgSplit execution
7835                                                             mode, omit.
7836                                                           - Ensures that
7837                                                             following
7838                                                             loads will not see
7839                                                             stale data.
7840
7841     load atomic  acquire      - agent        - global   1. buffer/global_load
7842                                                            glc=1
7843                                                         2. s_waitcnt vmcnt(0)
7844
7845                                                           - Must happen before
7846                                                             following
7847                                                             buffer_wbinvl1_vol.
7848                                                           - Ensures the load
7849                                                             has completed
7850                                                             before invalidating
7851                                                             the cache.
7852
7853                                                         3. buffer_wbinvl1_vol
7854
7855                                                           - Must happen before
7856                                                             any following
7857                                                             global/generic
7858                                                             load/load
7859                                                             atomic/atomicrmw.
7860                                                           - Ensures that
7861                                                             following
7862                                                             loads will not see
7863                                                             stale global data.
7864
7865     load atomic  acquire      - system       - global   1. buffer/global/flat_load
7866                                                            glc=1
7867                                                         2. s_waitcnt vmcnt(0)
7868
7869                                                           - Must happen before
7870                                                             following buffer_invl2 and
7871                                                             buffer_wbinvl1_vol.
7872                                                           - Ensures the load
7873                                                             has completed
7874                                                             before invalidating
7875                                                             the cache.
7876
7877                                                         3. buffer_invl2;
7878                                                            buffer_wbinvl1_vol
7879
7880                                                           - Must happen before
7881                                                             any following
7882                                                             global/generic
7883                                                             load/load
7884                                                             atomic/atomicrmw.
7885                                                           - Ensures that
7886                                                             following
7887                                                             loads will not see
7888                                                             stale L1 global data,
7889                                                             nor see stale L2 MTYPE
7890                                                             NC global data.
7891                                                             MTYPE RW and CC memory will
7892                                                             never be stale in L2 due to
7893                                                             the memory probes.
7894
7895     load atomic  acquire      - agent        - generic  1. flat_load glc=1
7896                                                         2. s_waitcnt vmcnt(0) &
7897                                                            lgkmcnt(0)
7898
7899                                                           - If TgSplit execution mode,
7900                                                             omit lgkmcnt(0).
7901                                                           - If OpenCL omit
7902                                                             lgkmcnt(0).
7903                                                           - Must happen before
7904                                                             following
7905                                                             buffer_wbinvl1_vol.
7906                                                           - Ensures the flat_load
7907                                                             has completed
7908                                                             before invalidating
7909                                                             the cache.
7910
7911                                                         3. buffer_wbinvl1_vol
7912
7913                                                           - Must happen before
7914                                                             any following
7915                                                             global/generic
7916                                                             load/load
7917                                                             atomic/atomicrmw.
7918                                                           - Ensures that
7919                                                             following loads
7920                                                             will not see stale
7921                                                             global data.
7922
7923     load atomic  acquire      - system       - generic  1. flat_load glc=1
7924                                                         2. s_waitcnt vmcnt(0) &
7925                                                            lgkmcnt(0)
7926
7927                                                           - If TgSplit execution mode,
7928                                                             omit lgkmcnt(0).
7929                                                           - If OpenCL omit
7930                                                             lgkmcnt(0).
7931                                                           - Must happen before
7932                                                             following
7933                                                             buffer_invl2 and
7934                                                             buffer_wbinvl1_vol.
7935                                                           - Ensures the flat_load
7936                                                             has completed
7937                                                             before invalidating
7938                                                             the caches.
7939
7940                                                         3. buffer_invl2;
7941                                                            buffer_wbinvl1_vol
7942
7943                                                           - Must happen before
7944                                                             any following
7945                                                             global/generic
7946                                                             load/load
7947                                                             atomic/atomicrmw.
7948                                                           - Ensures that
7949                                                             following
7950                                                             loads will not see
7951                                                             stale L1 global data,
7952                                                             nor see stale L2 MTYPE
7953                                                             NC global data.
7954                                                             MTYPE RW and CC memory will
7955                                                             never be stale in L2 due to
7956                                                             the memory probes.
7957
7958     atomicrmw    acquire      - singlethread - global   1. buffer/global/flat_atomic
7959                               - wavefront    - generic
7960     atomicrmw    acquire      - singlethread - local    *If TgSplit execution mode,
7961                               - wavefront               local address space cannot
7962                                                         be used.*
7963
7964                                                         1. ds_atomic
7965     atomicrmw    acquire      - workgroup    - global   1. buffer/global_atomic
7966                                                         2. s_waitcnt vmcnt(0)
7967
7968                                                           - If not TgSplit execution
7969                                                             mode, omit.
7970                                                           - Must happen before the
7971                                                             following buffer_wbinvl1_vol.
7972                                                           - Ensures the atomicrmw
7973                                                             has completed
7974                                                             before invalidating
7975                                                             the cache.
7976
7977                                                         3. buffer_wbinvl1_vol
7978
7979                                                           - If not TgSplit execution
7980                                                             mode, omit.
7981                                                           - Must happen before
7982                                                             any following
7983                                                             global/generic
7984                                                             load/load
7985                                                             atomic/atomicrmw.
7986                                                           - Ensures that
7987                                                             following loads
7988                                                             will not see stale
7989                                                             global data.
7990
7991     atomicrmw    acquire      - workgroup    - local    *If TgSplit execution mode,
7992                                                         local address space cannot
7993                                                         be used.*
7994
7995                                                         1. ds_atomic
7996                                                         2. s_waitcnt lgkmcnt(0)
7997
7998                                                           - If OpenCL, omit.
7999                                                           - Must happen before
8000                                                             any following
8001                                                             global/generic
8002                                                             load/load
8003                                                             atomic/store/store
8004                                                             atomic/atomicrmw.
8005                                                           - Ensures any
8006                                                             following global
8007                                                             data read is no
8008                                                             older than the local
8009                                                             atomicrmw value
8010                                                             being acquired.
8011
8012     atomicrmw    acquire      - workgroup    - generic  1. flat_atomic
8013                                                         2. s_waitcnt lgkm/vmcnt(0)
8014
8015                                                           - Use lgkmcnt(0) if not
8016                                                             TgSplit execution mode
8017                                                             and vmcnt(0) if TgSplit
8018                                                             execution mode.
8019                                                           - If OpenCL, omit lgkmcnt(0).
8020                                                           - Must happen before
8021                                                             the following
8022                                                             buffer_wbinvl1_vol and
8023                                                             any following
8024                                                             global/generic
8025                                                             load/load
8026                                                             atomic/store/store
8027                                                             atomic/atomicrmw.
8028                                                           - Ensures any
8029                                                             following global
8030                                                             data read is no
8031                                                             older than a local
8032                                                             atomicrmw value
8033                                                             being acquired.
8034
8035                                                         3. buffer_wbinvl1_vol
8036
8037                                                           - If not TgSplit execution
8038                                                             mode, omit.
8039                                                           - Ensures that
8040                                                             following
8041                                                             loads will not see
8042                                                             stale data.
8043
8044     atomicrmw    acquire      - agent        - global   1. buffer/global_atomic
8045                                                         2. s_waitcnt vmcnt(0)
8046
8047                                                           - Must happen before
8048                                                             following
8049                                                             buffer_wbinvl1_vol.
8050                                                           - Ensures the
8051                                                             atomicrmw has
8052                                                             completed before
8053                                                             invalidating the
8054                                                             cache.
8055
8056                                                         3. buffer_wbinvl1_vol
8057
8058                                                           - Must happen before
8059                                                             any following
8060                                                             global/generic
8061                                                             load/load
8062                                                             atomic/atomicrmw.
8063                                                           - Ensures that
8064                                                             following loads
8065                                                             will not see stale
8066                                                             global data.
8067
8068     atomicrmw    acquire      - system       - global   1. buffer/global_atomic
8069                                                         2. s_waitcnt vmcnt(0)
8070
8071                                                           - Must happen before
8072                                                             following buffer_invl2 and
8073                                                             buffer_wbinvl1_vol.
8074                                                           - Ensures the
8075                                                             atomicrmw has
8076                                                             completed before
8077                                                             invalidating the
8078                                                             caches.
8079
8080                                                         3. buffer_invl2;
8081                                                            buffer_wbinvl1_vol
8082
8083                                                           - Must happen before
8084                                                             any following
8085                                                             global/generic
8086                                                             load/load
8087                                                             atomic/atomicrmw.
8088                                                           - Ensures that
8089                                                             following
8090                                                             loads will not see
8091                                                             stale L1 global data,
8092                                                             nor see stale L2 MTYPE
8093                                                             NC global data.
8094                                                             MTYPE RW and CC memory will
8095                                                             never be stale in L2 due to
8096                                                             the memory probes.
8097
8098     atomicrmw    acquire      - agent        - generic  1. flat_atomic
8099                                                         2. s_waitcnt vmcnt(0) &
8100                                                            lgkmcnt(0)
8101
8102                                                           - If TgSplit execution mode,
8103                                                             omit lgkmcnt(0).
8104                                                           - If OpenCL, omit
8105                                                             lgkmcnt(0).
8106                                                           - Must happen before
8107                                                             following
8108                                                             buffer_wbinvl1_vol.
8109                                                           - Ensures the
8110                                                             atomicrmw has
8111                                                             completed before
8112                                                             invalidating the
8113                                                             cache.
8114
8115                                                         3. buffer_wbinvl1_vol
8116
8117                                                           - Must happen before
8118                                                             any following
8119                                                             global/generic
8120                                                             load/load
8121                                                             atomic/atomicrmw.
8122                                                           - Ensures that
8123                                                             following loads
8124                                                             will not see stale
8125                                                             global data.
8126
8127     atomicrmw    acquire      - system       - generic  1. flat_atomic
8128                                                         2. s_waitcnt vmcnt(0) &
8129                                                            lgkmcnt(0)
8130
8131                                                           - If TgSplit execution mode,
8132                                                             omit lgkmcnt(0).
8133                                                           - If OpenCL, omit
8134                                                             lgkmcnt(0).
8135                                                           - Must happen before
8136                                                             following
8137                                                             buffer_invl2 and
8138                                                             buffer_wbinvl1_vol.
8139                                                           - Ensures the
8140                                                             atomicrmw has
8141                                                             completed before
8142                                                             invalidating the
8143                                                             caches.
8144
8145                                                         3. buffer_invl2;
8146                                                            buffer_wbinvl1_vol
8147
8148                                                           - Must happen before
8149                                                             any following
8150                                                             global/generic
8151                                                             load/load
8152                                                             atomic/atomicrmw.
8153                                                           - Ensures that
8154                                                             following
8155                                                             loads will not see
8156                                                             stale L1 global data,
8157                                                             nor see stale L2 MTYPE
8158                                                             NC global data.
8159                                                             MTYPE RW and CC memory will
8160                                                             never be stale in L2 due to
8161                                                             the memory probes.
8162
8163     fence        acquire      - singlethread *none*     *none*
8164                               - wavefront
8165     fence        acquire      - workgroup    *none*     1. s_waitcnt lgkm/vmcnt(0)
8166
8167                                                           - Use lgkmcnt(0) if not
8168                                                             TgSplit execution mode
8169                                                             and vmcnt(0) if TgSplit
8170                                                             execution mode.
8171                                                           - If OpenCL and
8172                                                             address space is
8173                                                             not generic, omit
8174                                                             lgkmcnt(0).
8175                                                           - If OpenCL and
8176                                                             address space is
8177                                                             local, omit
8178                                                             vmcnt(0).
8179                                                           - See :ref:`amdgpu-fence-as` for
8180                                                             more details on fencing specific
8181                                                             address spaces.
8182                                                           - s_waitcnt vmcnt(0)
8183                                                             must happen after
8184                                                             any preceding
8185                                                             global/generic load
8186                                                             atomic/
8187                                                             atomicrmw
8188                                                             with an equal or
8189                                                             wider sync scope
8190                                                             and memory ordering
8191                                                             stronger than
8192                                                             unordered (this is
8193                                                             termed the
8194                                                             fence-paired-atomic).
8195                                                           - s_waitcnt lgkmcnt(0)
8196                                                             must happen after
8197                                                             any preceding
8198                                                             local/generic load
8199                                                             atomic/atomicrmw
8200                                                             with an equal or
8201                                                             wider sync scope
8202                                                             and memory ordering
8203                                                             stronger than
8204                                                             unordered (this is
8205                                                             termed the
8206                                                             fence-paired-atomic).
8207                                                           - Must happen before
8208                                                             the following
8209                                                             buffer_wbinvl1_vol and
8210                                                             any following
8211                                                             global/generic
8212                                                             load/load
8213                                                             atomic/store/store
8214                                                             atomic/atomicrmw.
8215                                                           - Ensures any
8216                                                             following global
8217                                                             data read is no
8218                                                             older than the
8219                                                             value read by the
8220                                                             fence-paired-atomic.
8221
8222                                                         2. buffer_wbinvl1_vol
8223
8224                                                           - If not TgSplit execution
8225                                                             mode, omit.
8226                                                           - Ensures that
8227                                                             following
8228                                                             loads will not see
8229                                                             stale data.
8230
8231     fence        acquire      - agent        *none*     1. s_waitcnt lgkmcnt(0) &
8232                                                            vmcnt(0)
8233
8234                                                           - If TgSplit execution mode,
8235                                                             omit lgkmcnt(0).
8236                                                           - If OpenCL and
8237                                                             address space is
8238                                                             not generic, omit
8239                                                             lgkmcnt(0).
8240                                                           - See :ref:`amdgpu-fence-as` for
8241                                                             more details on fencing specific
8242                                                             address spaces.
8243                                                           - Could be split into
8244                                                             separate s_waitcnt
8245                                                             vmcnt(0) and
8246                                                             s_waitcnt
8247                                                             lgkmcnt(0) to allow
8248                                                             them to be
8249                                                             independently moved
8250                                                             according to the
8251                                                             following rules.
8252                                                           - s_waitcnt vmcnt(0)
8253                                                             must happen after
8254                                                             any preceding
8255                                                             global/generic load
8256                                                             atomic/atomicrmw
8257                                                             with an equal or
8258                                                             wider sync scope
8259                                                             and memory ordering
8260                                                             stronger than
8261                                                             unordered (this is
8262                                                             termed the
8263                                                             fence-paired-atomic).
8264                                                           - s_waitcnt lgkmcnt(0)
8265                                                             must happen after
8266                                                             any preceding
8267                                                             local/generic load
8268                                                             atomic/atomicrmw
8269                                                             with an equal or
8270                                                             wider sync scope
8271                                                             and memory ordering
8272                                                             stronger than
8273                                                             unordered (this is
8274                                                             termed the
8275                                                             fence-paired-atomic).
8276                                                           - Must happen before
8277                                                             the following
8278                                                             buffer_wbinvl1_vol.
8279                                                           - Ensures that the
8280                                                             fence-paired atomic
8281                                                             has completed
8282                                                             before invalidating
8283                                                             the
8284                                                             cache. Therefore
8285                                                             any following
8286                                                             locations read must
8287                                                             be no older than
8288                                                             the value read by
8289                                                             the
8290                                                             fence-paired-atomic.
8291
8292                                                         2. buffer_wbinvl1_vol
8293
8294                                                           - Must happen before any
8295                                                             following global/generic
8296                                                             load/load
8297                                                             atomic/store/store
8298                                                             atomic/atomicrmw.
8299                                                           - Ensures that
8300                                                             following loads
8301                                                             will not see stale
8302                                                             global data.
8303
8304     fence        acquire      - system       *none*     1. s_waitcnt lgkmcnt(0) &
8305                                                            vmcnt(0)
8306
8307                                                           - If TgSplit execution mode,
8308                                                             omit lgkmcnt(0).
8309                                                           - If OpenCL and
8310                                                             address space is
8311                                                             not generic, omit
8312                                                             lgkmcnt(0).
8313                                                           - See :ref:`amdgpu-fence-as` for
8314                                                             more details on fencing specific
8315                                                             address spaces.
8316                                                           - Could be split into
8317                                                             separate s_waitcnt
8318                                                             vmcnt(0) and
8319                                                             s_waitcnt
8320                                                             lgkmcnt(0) to allow
8321                                                             them to be
8322                                                             independently moved
8323                                                             according to the
8324                                                             following rules.
8325                                                           - s_waitcnt vmcnt(0)
8326                                                             must happen after
8327                                                             any preceding
8328                                                             global/generic load
8329                                                             atomic/atomicrmw
8330                                                             with an equal or
8331                                                             wider sync scope
8332                                                             and memory ordering
8333                                                             stronger than
8334                                                             unordered (this is
8335                                                             termed the
8336                                                             fence-paired-atomic).
8337                                                           - s_waitcnt lgkmcnt(0)
8338                                                             must happen after
8339                                                             any preceding
8340                                                             local/generic load
8341                                                             atomic/atomicrmw
8342                                                             with an equal or
8343                                                             wider sync scope
8344                                                             and memory ordering
8345                                                             stronger than
8346                                                             unordered (this is
8347                                                             termed the
8348                                                             fence-paired-atomic).
8349                                                           - Must happen before
8350                                                             the following buffer_invl2 and
8351                                                             buffer_wbinvl1_vol.
8352                                                           - Ensures that the
8353                                                             fence-paired atomic
8354                                                             has completed
8355                                                             before invalidating
8356                                                             the
8357                                                             cache. Therefore
8358                                                             any following
8359                                                             locations read must
8360                                                             be no older than
8361                                                             the value read by
8362                                                             the
8363                                                             fence-paired-atomic.
8364
8365                                                         2. buffer_invl2;
8366                                                            buffer_wbinvl1_vol
8367
8368                                                           - Must happen before any
8369                                                             following global/generic
8370                                                             load/load
8371                                                             atomic/store/store
8372                                                             atomic/atomicrmw.
8373                                                           - Ensures that
8374                                                             following
8375                                                             loads will not see
8376                                                             stale L1 global data,
8377                                                             nor see stale L2 MTYPE
8378                                                             NC global data.
8379                                                             MTYPE RW and CC memory will
8380                                                             never be stale in L2 due to
8381                                                             the memory probes.
8382     **Release Atomic**
8383     ------------------------------------------------------------------------------------
8384     store atomic release      - singlethread - global   1. buffer/global/flat_store
8385                               - wavefront    - generic
8386     store atomic release      - singlethread - local    *If TgSplit execution mode,
8387                               - wavefront               local address space cannot
8388                                                         be used.*
8389
8390                                                         1. ds_store
8391     store atomic release      - workgroup    - global   1. s_waitcnt lgkm/vmcnt(0)
8392                                              - generic
8393                                                           - Use lgkmcnt(0) if not
8394                                                             TgSplit execution mode
8395                                                             and vmcnt(0) if TgSplit
8396                                                             execution mode.
8397                                                           - If OpenCL, omit lgkmcnt(0).
8398                                                           - s_waitcnt vmcnt(0)
8399                                                             must happen after
8400                                                             any preceding
8401                                                             global/generic load/store/
8402                                                             load atomic/store atomic/
8403                                                             atomicrmw.
8404                                                           - s_waitcnt lgkmcnt(0)
8405                                                             must happen after
8406                                                             any preceding
8407                                                             local/generic
8408                                                             load/store/load
8409                                                             atomic/store
8410                                                             atomic/atomicrmw.
8411                                                           - Must happen before
8412                                                             the following
8413                                                             store.
8414                                                           - Ensures that all
8415                                                             memory operations
8416                                                             have
8417                                                             completed before
8418                                                             performing the
8419                                                             store that is being
8420                                                             released.
8421
8422                                                         2. buffer/global/flat_store
8423     store atomic release      - workgroup    - local    *If TgSplit execution mode,
8424                                                         local address space cannot
8425                                                         be used.*
8426
8427                                                         1. ds_store
8428     store atomic release      - agent        - global   1. s_waitcnt lgkmcnt(0) &
8429                                              - generic     vmcnt(0)
8430
8431                                                           - If TgSplit execution mode,
8432                                                             omit lgkmcnt(0).
8433                                                           - If OpenCL and
8434                                                             address space is
8435                                                             not generic, omit
8436                                                             lgkmcnt(0).
8437                                                           - Could be split into
8438                                                             separate s_waitcnt
8439                                                             vmcnt(0) and
8440                                                             s_waitcnt
8441                                                             lgkmcnt(0) to allow
8442                                                             them to be
8443                                                             independently moved
8444                                                             according to the
8445                                                             following rules.
8446                                                           - s_waitcnt vmcnt(0)
8447                                                             must happen after
8448                                                             any preceding
8449                                                             global/generic
8450                                                             load/store/load
8451                                                             atomic/store
8452                                                             atomic/atomicrmw.
8453                                                           - s_waitcnt lgkmcnt(0)
8454                                                             must happen after
8455                                                             any preceding
8456                                                             local/generic
8457                                                             load/store/load
8458                                                             atomic/store
8459                                                             atomic/atomicrmw.
8460                                                           - Must happen before
8461                                                             the following
8462                                                             store.
8463                                                           - Ensures that all
8464                                                             memory operations
8465                                                             to memory have
8466                                                             completed before
8467                                                             performing the
8468                                                             store that is being
8469                                                             released.
8470
8471                                                         2. buffer/global/flat_store
8472     store atomic release      - system       - global   1. buffer_wbl2
8473                                              - generic
8474                                                           - Must happen before
8475                                                             following s_waitcnt.
8476                                                           - Performs L2 writeback to
8477                                                             ensure previous
8478                                                             global/generic
8479                                                             store/atomicrmw are
8480                                                             visible at system scope.
8481
8482                                                         2. s_waitcnt lgkmcnt(0) &
8483                                                            vmcnt(0)
8484
8485                                                           - If TgSplit execution mode,
8486                                                             omit lgkmcnt(0).
8487                                                           - If OpenCL and
8488                                                             address space is
8489                                                             not generic, omit
8490                                                             lgkmcnt(0).
8491                                                           - Could be split into
8492                                                             separate s_waitcnt
8493                                                             vmcnt(0) and
8494                                                             s_waitcnt
8495                                                             lgkmcnt(0) to allow
8496                                                             them to be
8497                                                             independently moved
8498                                                             according to the
8499                                                             following rules.
8500                                                           - s_waitcnt vmcnt(0)
8501                                                             must happen after any
8502                                                             preceding
8503                                                             global/generic
8504                                                             load/store/load
8505                                                             atomic/store
8506                                                             atomic/atomicrmw.
8507                                                           - s_waitcnt lgkmcnt(0)
8508                                                             must happen after any
8509                                                             preceding
8510                                                             local/generic
8511                                                             load/store/load
8512                                                             atomic/store
8513                                                             atomic/atomicrmw.
8514                                                           - Must happen before
8515                                                             the following
8516                                                             store.
8517                                                           - Ensures that all
8518                                                             memory operations
8519                                                             to memory and the L2
8520                                                             writeback have
8521                                                             completed before
8522                                                             performing the
8523                                                             store that is being
8524                                                             released.
8525
8526                                                         3. buffer/global/flat_store
8527     atomicrmw    release      - singlethread - global   1. buffer/global/flat_atomic
8528                               - wavefront    - generic
8529     atomicrmw    release      - singlethread - local    *If TgSplit execution mode,
8530                               - wavefront               local address space cannot
8531                                                         be used.*
8532
8533                                                         1. ds_atomic
8534     atomicrmw    release      - workgroup    - global   1. s_waitcnt lgkm/vmcnt(0)
8535                                              - generic
8536                                                           - Use lgkmcnt(0) if not
8537                                                             TgSplit execution mode
8538                                                             and vmcnt(0) if TgSplit
8539                                                             execution mode.
8540                                                           - If OpenCL, omit
8541                                                             lgkmcnt(0).
8542                                                           - s_waitcnt vmcnt(0)
8543                                                             must happen after
8544                                                             any preceding
8545                                                             global/generic load/store/
8546                                                             load atomic/store atomic/
8547                                                             atomicrmw.
8548                                                           - s_waitcnt lgkmcnt(0)
8549                                                             must happen after
8550                                                             any preceding
8551                                                             local/generic
8552                                                             load/store/load
8553                                                             atomic/store
8554                                                             atomic/atomicrmw.
8555                                                           - Must happen before
8556                                                             the following
8557                                                             atomicrmw.
8558                                                           - Ensures that all
8559                                                             memory operations
8560                                                             have
8561                                                             completed before
8562                                                             performing the
8563                                                             atomicrmw that is
8564                                                             being released.
8565
8566                                                         2. buffer/global/flat_atomic
8567     atomicrmw    release      - workgroup    - local    *If TgSplit execution mode,
8568                                                         local address space cannot
8569                                                         be used.*
8570
8571                                                         1. ds_atomic
8572     atomicrmw    release      - agent        - global   1. s_waitcnt lgkmcnt(0) &
8573                                              - generic     vmcnt(0)
8574
8575                                                           - If TgSplit execution mode,
8576                                                             omit lgkmcnt(0).
8577                                                           - If OpenCL, omit
8578                                                             lgkmcnt(0).
8579                                                           - Could be split into
8580                                                             separate s_waitcnt
8581                                                             vmcnt(0) and
8582                                                             s_waitcnt
8583                                                             lgkmcnt(0) to allow
8584                                                             them to be
8585                                                             independently moved
8586                                                             according to the
8587                                                             following rules.
8588                                                           - s_waitcnt vmcnt(0)
8589                                                             must happen after
8590                                                             any preceding
8591                                                             global/generic
8592                                                             load/store/load
8593                                                             atomic/store
8594                                                             atomic/atomicrmw.
8595                                                           - s_waitcnt lgkmcnt(0)
8596                                                             must happen after
8597                                                             any preceding
8598                                                             local/generic
8599                                                             load/store/load
8600                                                             atomic/store
8601                                                             atomic/atomicrmw.
8602                                                           - Must happen before
8603                                                             the following
8604                                                             atomicrmw.
8605                                                           - Ensures that all
8606                                                             memory operations
8607                                                             to global and local
8608                                                             have completed
8609                                                             before performing
8610                                                             the atomicrmw that
8611                                                             is being released.
8612
8613                                                         2. buffer/global/flat_atomic
8614     atomicrmw    release      - system       - global   1. buffer_wbl2
8615                                              - generic
8616                                                           - Must happen before
8617                                                             following s_waitcnt.
8618                                                           - Performs L2 writeback to
8619                                                             ensure previous
8620                                                             global/generic
8621                                                             store/atomicrmw are
8622                                                             visible at system scope.
8623
8624                                                         2. s_waitcnt lgkmcnt(0) &
8625                                                            vmcnt(0)
8626
8627                                                           - If TgSplit execution mode,
8628                                                             omit lgkmcnt(0).
8629                                                           - If OpenCL, omit
8630                                                             lgkmcnt(0).
8631                                                           - Could be split into
8632                                                             separate s_waitcnt
8633                                                             vmcnt(0) and
8634                                                             s_waitcnt
8635                                                             lgkmcnt(0) to allow
8636                                                             them to be
8637                                                             independently moved
8638                                                             according to the
8639                                                             following rules.
8640                                                           - s_waitcnt vmcnt(0)
8641                                                             must happen after
8642                                                             any preceding
8643                                                             global/generic
8644                                                             load/store/load
8645                                                             atomic/store
8646                                                             atomic/atomicrmw.
8647                                                           - s_waitcnt lgkmcnt(0)
8648                                                             must happen after
8649                                                             any preceding
8650                                                             local/generic
8651                                                             load/store/load
8652                                                             atomic/store
8653                                                             atomic/atomicrmw.
8654                                                           - Must happen before
8655                                                             the following
8656                                                             atomicrmw.
8657                                                           - Ensures that all
8658                                                             memory operations
8659                                                             to memory and the L2
8660                                                             writeback have
8661                                                             completed before
8662                                                             performing the
8663                                                             store that is being
8664                                                             released.
8665
8666                                                         3. buffer/global/flat_atomic
8667     fence        release      - singlethread *none*     *none*
8668                               - wavefront
8669     fence        release      - workgroup    *none*     1. s_waitcnt lgkm/vmcnt(0)
8670
8671                                                           - Use lgkmcnt(0) if not
8672                                                             TgSplit execution mode
8673                                                             and vmcnt(0) if TgSplit
8674                                                             execution mode.
8675                                                           - If OpenCL and
8676                                                             address space is
8677                                                             not generic, omit
8678                                                             lgkmcnt(0).
8679                                                           - If OpenCL and
8680                                                             address space is
8681                                                             local, omit
8682                                                             vmcnt(0).
8683                                                           - See :ref:`amdgpu-fence-as` for
8684                                                             more details on fencing specific
8685                                                             address spaces.
8686                                                           - s_waitcnt vmcnt(0)
8687                                                             must happen after
8688                                                             any preceding
8689                                                             global/generic
8690                                                             load/store/
8691                                                             load atomic/store atomic/
8692                                                             atomicrmw.
8693                                                           - s_waitcnt lgkmcnt(0)
8694                                                             must happen after
8695                                                             any preceding
8696                                                             local/generic
8697                                                             load/load
8698                                                             atomic/store/store
8699                                                             atomic/atomicrmw.
8700                                                           - Must happen before
8701                                                             any following store
8702                                                             atomic/atomicrmw
8703                                                             with an equal or
8704                                                             wider sync scope
8705                                                             and memory ordering
8706                                                             stronger than
8707                                                             unordered (this is
8708                                                             termed the
8709                                                             fence-paired-atomic).
8710                                                           - Ensures that all
8711                                                             memory operations
8712                                                             have
8713                                                             completed before
8714                                                             performing the
8715                                                             following
8716                                                             fence-paired-atomic.
8717
8718     fence        release      - agent        *none*     1. s_waitcnt lgkmcnt(0) &
8719                                                            vmcnt(0)
8720
8721                                                           - If TgSplit execution mode,
8722                                                             omit lgkmcnt(0).
8723                                                           - If OpenCL and
8724                                                             address space is
8725                                                             not generic, omit
8726                                                             lgkmcnt(0).
8727                                                           - If OpenCL and
8728                                                             address space is
8729                                                             local, omit
8730                                                             vmcnt(0).
8731                                                           - See :ref:`amdgpu-fence-as` for
8732                                                             more details on fencing specific
8733                                                             address spaces.
8734                                                           - Could be split into
8735                                                             separate s_waitcnt
8736                                                             vmcnt(0) and
8737                                                             s_waitcnt
8738                                                             lgkmcnt(0) to allow
8739                                                             them to be
8740                                                             independently moved
8741                                                             according to the
8742                                                             following rules.
8743                                                           - s_waitcnt vmcnt(0)
8744                                                             must happen after
8745                                                             any preceding
8746                                                             global/generic
8747                                                             load/store/load
8748                                                             atomic/store
8749                                                             atomic/atomicrmw.
8750                                                           - s_waitcnt lgkmcnt(0)
8751                                                             must happen after
8752                                                             any preceding
8753                                                             local/generic
8754                                                             load/store/load
8755                                                             atomic/store
8756                                                             atomic/atomicrmw.
8757                                                           - Must happen before
8758                                                             any following store
8759                                                             atomic/atomicrmw
8760                                                             with an equal or
8761                                                             wider sync scope
8762                                                             and memory ordering
8763                                                             stronger than
8764                                                             unordered (this is
8765                                                             termed the
8766                                                             fence-paired-atomic).
8767                                                           - Ensures that all
8768                                                             memory operations
8769                                                             have
8770                                                             completed before
8771                                                             performing the
8772                                                             following
8773                                                             fence-paired-atomic.
8774
8775     fence        release      - system       *none*     1. buffer_wbl2
8776
8777                                                           - If OpenCL and
8778                                                             address space is
8779                                                             local, omit.
8780                                                           - Must happen before
8781                                                             following s_waitcnt.
8782                                                           - Performs L2 writeback to
8783                                                             ensure previous
8784                                                             global/generic
8785                                                             store/atomicrmw are
8786                                                             visible at system scope.
8787
8788                                                         2. s_waitcnt lgkmcnt(0) &
8789                                                            vmcnt(0)
8790
8791                                                           - If TgSplit execution mode,
8792                                                             omit lgkmcnt(0).
8793                                                           - If OpenCL and
8794                                                             address space is
8795                                                             not generic, omit
8796                                                             lgkmcnt(0).
8797                                                           - If OpenCL and
8798                                                             address space is
8799                                                             local, omit
8800                                                             vmcnt(0).
8801                                                           - See :ref:`amdgpu-fence-as` for
8802                                                             more details on fencing specific
8803                                                             address spaces.
8804                                                           - Could be split into
8805                                                             separate s_waitcnt
8806                                                             vmcnt(0) and
8807                                                             s_waitcnt
8808                                                             lgkmcnt(0) to allow
8809                                                             them to be
8810                                                             independently moved
8811                                                             according to the
8812                                                             following rules.
8813                                                           - s_waitcnt vmcnt(0)
8814                                                             must happen after
8815                                                             any preceding
8816                                                             global/generic
8817                                                             load/store/load
8818                                                             atomic/store
8819                                                             atomic/atomicrmw.
8820                                                           - s_waitcnt lgkmcnt(0)
8821                                                             must happen after
8822                                                             any preceding
8823                                                             local/generic
8824                                                             load/store/load
8825                                                             atomic/store
8826                                                             atomic/atomicrmw.
8827                                                           - Must happen before
8828                                                             any following store
8829                                                             atomic/atomicrmw
8830                                                             with an equal or
8831                                                             wider sync scope
8832                                                             and memory ordering
8833                                                             stronger than
8834                                                             unordered (this is
8835                                                             termed the
8836                                                             fence-paired-atomic).
8837                                                           - Ensures that all
8838                                                             memory operations
8839                                                             have
8840                                                             completed before
8841                                                             performing the
8842                                                             following
8843                                                             fence-paired-atomic.
8844
8845     **Acquire-Release Atomic**
8846     ------------------------------------------------------------------------------------
8847     atomicrmw    acq_rel      - singlethread - global   1. buffer/global/flat_atomic
8848                               - wavefront    - generic
8849     atomicrmw    acq_rel      - singlethread - local    *If TgSplit execution mode,
8850                               - wavefront               local address space cannot
8851                                                         be used.*
8852
8853                                                         1. ds_atomic
8854     atomicrmw    acq_rel      - workgroup    - global   1. s_waitcnt lgkm/vmcnt(0)
8855
8856                                                           - Use lgkmcnt(0) if not
8857                                                             TgSplit execution mode
8858                                                             and vmcnt(0) if TgSplit
8859                                                             execution mode.
8860                                                           - If OpenCL, omit
8861                                                             lgkmcnt(0).
8862                                                           - Must happen after
8863                                                             any preceding
8864                                                             local/generic
8865                                                             load/store/load
8866                                                             atomic/store
8867                                                             atomic/atomicrmw.
8868                                                           - s_waitcnt vmcnt(0)
8869                                                             must happen after
8870                                                             any preceding
8871                                                             global/generic load/store/
8872                                                             load atomic/store atomic/
8873                                                             atomicrmw.
8874                                                           - s_waitcnt lgkmcnt(0)
8875                                                             must happen after
8876                                                             any preceding
8877                                                             local/generic
8878                                                             load/store/load
8879                                                             atomic/store
8880                                                             atomic/atomicrmw.
8881                                                           - Must happen before
8882                                                             the following
8883                                                             atomicrmw.
8884                                                           - Ensures that all
8885                                                             memory operations
8886                                                             have
8887                                                             completed before
8888                                                             performing the
8889                                                             atomicrmw that is
8890                                                             being released.
8891
8892                                                         2. buffer/global_atomic
8893                                                         3. s_waitcnt vmcnt(0)
8894
8895                                                           - If not TgSplit execution
8896                                                             mode, omit.
8897                                                           - Must happen before
8898                                                             the following
8899                                                             buffer_wbinvl1_vol.
8900                                                           - Ensures any
8901                                                             following global
8902                                                             data read is no
8903                                                             older than the
8904                                                             atomicrmw value
8905                                                             being acquired.
8906
8907                                                         4. buffer_wbinvl1_vol
8908
8909                                                           - If not TgSplit execution
8910                                                             mode, omit.
8911                                                           - Ensures that
8912                                                             following
8913                                                             loads will not see
8914                                                             stale data.
8915
8916     atomicrmw    acq_rel      - workgroup    - local    *If TgSplit execution mode,
8917                                                         local address space cannot
8918                                                         be used.*
8919
8920                                                         1. ds_atomic
8921                                                         2. s_waitcnt lgkmcnt(0)
8922
8923                                                           - If OpenCL, omit.
8924                                                           - Must happen before
8925                                                             any following
8926                                                             global/generic
8927                                                             load/load
8928                                                             atomic/store/store
8929                                                             atomic/atomicrmw.
8930                                                           - Ensures any
8931                                                             following global
8932                                                             data read is no
8933                                                             older than the local load
8934                                                             atomic value being
8935                                                             acquired.
8936
8937     atomicrmw    acq_rel      - workgroup    - generic  1. s_waitcnt lgkm/vmcnt(0)
8938
8939                                                           - Use lgkmcnt(0) if not
8940                                                             TgSplit execution mode
8941                                                             and vmcnt(0) if TgSplit
8942                                                             execution mode.
8943                                                           - If OpenCL, omit
8944                                                             lgkmcnt(0).
8945                                                           - s_waitcnt vmcnt(0)
8946                                                             must happen after
8947                                                             any preceding
8948                                                             global/generic load/store/
8949                                                             load atomic/store atomic/
8950                                                             atomicrmw.
8951                                                           - s_waitcnt lgkmcnt(0)
8952                                                             must happen after
8953                                                             any preceding
8954                                                             local/generic
8955                                                             load/store/load
8956                                                             atomic/store
8957                                                             atomic/atomicrmw.
8958                                                           - Must happen before
8959                                                             the following
8960                                                             atomicrmw.
8961                                                           - Ensures that all
8962                                                             memory operations
8963                                                             have
8964                                                             completed before
8965                                                             performing the
8966                                                             atomicrmw that is
8967                                                             being released.
8968
8969                                                         2. flat_atomic
8970                                                         3. s_waitcnt lgkmcnt(0) &
8971                                                            vmcnt(0)
8972
8973                                                           - If not TgSplit execution
8974                                                             mode, omit vmcnt(0).
8975                                                           - If OpenCL, omit
8976                                                             lgkmcnt(0).
8977                                                           - Must happen before
8978                                                             the following
8979                                                             buffer_wbinvl1_vol and
8980                                                             any following
8981                                                             global/generic
8982                                                             load/load
8983                                                             atomic/store/store
8984                                                             atomic/atomicrmw.
8985                                                           - Ensures any
8986                                                             following global
8987                                                             data read is no
8988                                                             older than a local load
8989                                                             atomic value being
8990                                                             acquired.
8991
8992                                                         3. buffer_wbinvl1_vol
8993
8994                                                           - If not TgSplit execution
8995                                                             mode, omit.
8996                                                           - Ensures that
8997                                                             following
8998                                                             loads will not see
8999                                                             stale data.
9000
9001     atomicrmw    acq_rel      - agent        - global   1. s_waitcnt lgkmcnt(0) &
9002                                                            vmcnt(0)
9003
9004                                                           - If TgSplit execution mode,
9005                                                             omit lgkmcnt(0).
9006                                                           - If OpenCL, omit
9007                                                             lgkmcnt(0).
9008                                                           - Could be split into
9009                                                             separate s_waitcnt
9010                                                             vmcnt(0) and
9011                                                             s_waitcnt
9012                                                             lgkmcnt(0) to allow
9013                                                             them to be
9014                                                             independently moved
9015                                                             according to the
9016                                                             following rules.
9017                                                           - s_waitcnt vmcnt(0)
9018                                                             must happen after
9019                                                             any preceding
9020                                                             global/generic
9021                                                             load/store/load
9022                                                             atomic/store
9023                                                             atomic/atomicrmw.
9024                                                           - s_waitcnt lgkmcnt(0)
9025                                                             must happen after
9026                                                             any preceding
9027                                                             local/generic
9028                                                             load/store/load
9029                                                             atomic/store
9030                                                             atomic/atomicrmw.
9031                                                           - Must happen before
9032                                                             the following
9033                                                             atomicrmw.
9034                                                           - Ensures that all
9035                                                             memory operations
9036                                                             to global have
9037                                                             completed before
9038                                                             performing the
9039                                                             atomicrmw that is
9040                                                             being released.
9041
9042                                                         2. buffer/global_atomic
9043                                                         3. s_waitcnt vmcnt(0)
9044
9045                                                           - Must happen before
9046                                                             following
9047                                                             buffer_wbinvl1_vol.
9048                                                           - Ensures the
9049                                                             atomicrmw has
9050                                                             completed before
9051                                                             invalidating the
9052                                                             cache.
9053
9054                                                         4. buffer_wbinvl1_vol
9055
9056                                                           - Must happen before
9057                                                             any following
9058                                                             global/generic
9059                                                             load/load
9060                                                             atomic/atomicrmw.
9061                                                           - Ensures that
9062                                                             following loads
9063                                                             will not see stale
9064                                                             global data.
9065
9066     atomicrmw    acq_rel      - system       - global   1. buffer_wbl2
9067
9068                                                           - Must happen before
9069                                                             following s_waitcnt.
9070                                                           - Performs L2 writeback to
9071                                                             ensure previous
9072                                                             global/generic
9073                                                             store/atomicrmw are
9074                                                             visible at system scope.
9075
9076                                                         2. s_waitcnt lgkmcnt(0) &
9077                                                            vmcnt(0)
9078
9079                                                           - If TgSplit execution mode,
9080                                                             omit lgkmcnt(0).
9081                                                           - If OpenCL, omit
9082                                                             lgkmcnt(0).
9083                                                           - Could be split into
9084                                                             separate s_waitcnt
9085                                                             vmcnt(0) and
9086                                                             s_waitcnt
9087                                                             lgkmcnt(0) to allow
9088                                                             them to be
9089                                                             independently moved
9090                                                             according to the
9091                                                             following rules.
9092                                                           - s_waitcnt vmcnt(0)
9093                                                             must happen after
9094                                                             any preceding
9095                                                             global/generic
9096                                                             load/store/load
9097                                                             atomic/store
9098                                                             atomic/atomicrmw.
9099                                                           - s_waitcnt lgkmcnt(0)
9100                                                             must happen after
9101                                                             any preceding
9102                                                             local/generic
9103                                                             load/store/load
9104                                                             atomic/store
9105                                                             atomic/atomicrmw.
9106                                                           - Must happen before
9107                                                             the following
9108                                                             atomicrmw.
9109                                                           - Ensures that all
9110                                                             memory operations
9111                                                             to global and L2 writeback
9112                                                             have completed before
9113                                                             performing the
9114                                                             atomicrmw that is
9115                                                             being released.
9116
9117                                                         3. buffer/global_atomic
9118                                                         4. s_waitcnt vmcnt(0)
9119
9120                                                           - Must happen before
9121                                                             following buffer_invl2 and
9122                                                             buffer_wbinvl1_vol.
9123                                                           - Ensures the
9124                                                             atomicrmw has
9125                                                             completed before
9126                                                             invalidating the
9127                                                             caches.
9128
9129                                                         5. buffer_invl2;
9130                                                            buffer_wbinvl1_vol
9131
9132                                                           - Must happen before
9133                                                             any following
9134                                                             global/generic
9135                                                             load/load
9136                                                             atomic/atomicrmw.
9137                                                           - Ensures that
9138                                                             following
9139                                                             loads will not see
9140                                                             stale L1 global data,
9141                                                             nor see stale L2 MTYPE
9142                                                             NC global data.
9143                                                             MTYPE RW and CC memory will
9144                                                             never be stale in L2 due to
9145                                                             the memory probes.
9146
9147     atomicrmw    acq_rel      - agent        - generic  1. s_waitcnt lgkmcnt(0) &
9148                                                            vmcnt(0)
9149
9150                                                           - If TgSplit execution mode,
9151                                                             omit lgkmcnt(0).
9152                                                           - If OpenCL, omit
9153                                                             lgkmcnt(0).
9154                                                           - Could be split into
9155                                                             separate s_waitcnt
9156                                                             vmcnt(0) and
9157                                                             s_waitcnt
9158                                                             lgkmcnt(0) to allow
9159                                                             them to be
9160                                                             independently moved
9161                                                             according to the
9162                                                             following rules.
9163                                                           - s_waitcnt vmcnt(0)
9164                                                             must happen after
9165                                                             any preceding
9166                                                             global/generic
9167                                                             load/store/load
9168                                                             atomic/store
9169                                                             atomic/atomicrmw.
9170                                                           - s_waitcnt lgkmcnt(0)
9171                                                             must happen after
9172                                                             any preceding
9173                                                             local/generic
9174                                                             load/store/load
9175                                                             atomic/store
9176                                                             atomic/atomicrmw.
9177                                                           - Must happen before
9178                                                             the following
9179                                                             atomicrmw.
9180                                                           - Ensures that all
9181                                                             memory operations
9182                                                             to global have
9183                                                             completed before
9184                                                             performing the
9185                                                             atomicrmw that is
9186                                                             being released.
9187
9188                                                         2. flat_atomic
9189                                                         3. s_waitcnt vmcnt(0) &
9190                                                            lgkmcnt(0)
9191
9192                                                           - If TgSplit execution mode,
9193                                                             omit lgkmcnt(0).
9194                                                           - If OpenCL, omit
9195                                                             lgkmcnt(0).
9196                                                           - Must happen before
9197                                                             following
9198                                                             buffer_wbinvl1_vol.
9199                                                           - Ensures the
9200                                                             atomicrmw has
9201                                                             completed before
9202                                                             invalidating the
9203                                                             cache.
9204
9205                                                         4. buffer_wbinvl1_vol
9206
9207                                                           - Must happen before
9208                                                             any following
9209                                                             global/generic
9210                                                             load/load
9211                                                             atomic/atomicrmw.
9212                                                           - Ensures that
9213                                                             following loads
9214                                                             will not see stale
9215                                                             global data.
9216
9217     atomicrmw    acq_rel      - system       - generic  1. buffer_wbl2
9218
9219                                                           - Must happen before
9220                                                             following s_waitcnt.
9221                                                           - Performs L2 writeback to
9222                                                             ensure previous
9223                                                             global/generic
9224                                                             store/atomicrmw are
9225                                                             visible at system scope.
9226
9227                                                         2. s_waitcnt lgkmcnt(0) &
9228                                                            vmcnt(0)
9229
9230                                                           - If TgSplit execution mode,
9231                                                             omit lgkmcnt(0).
9232                                                           - If OpenCL, omit
9233                                                             lgkmcnt(0).
9234                                                           - Could be split into
9235                                                             separate s_waitcnt
9236                                                             vmcnt(0) and
9237                                                             s_waitcnt
9238                                                             lgkmcnt(0) to allow
9239                                                             them to be
9240                                                             independently moved
9241                                                             according to the
9242                                                             following rules.
9243                                                           - s_waitcnt vmcnt(0)
9244                                                             must happen after
9245                                                             any preceding
9246                                                             global/generic
9247                                                             load/store/load
9248                                                             atomic/store
9249                                                             atomic/atomicrmw.
9250                                                           - s_waitcnt lgkmcnt(0)
9251                                                             must happen after
9252                                                             any preceding
9253                                                             local/generic
9254                                                             load/store/load
9255                                                             atomic/store
9256                                                             atomic/atomicrmw.
9257                                                           - Must happen before
9258                                                             the following
9259                                                             atomicrmw.
9260                                                           - Ensures that all
9261                                                             memory operations
9262                                                             to global and L2 writeback
9263                                                             have completed before
9264                                                             performing the
9265                                                             atomicrmw that is
9266                                                             being released.
9267
9268                                                         3. flat_atomic
9269                                                         4. s_waitcnt vmcnt(0) &
9270                                                            lgkmcnt(0)
9271
9272                                                           - If TgSplit execution mode,
9273                                                             omit lgkmcnt(0).
9274                                                           - If OpenCL, omit
9275                                                             lgkmcnt(0).
9276                                                           - Must happen before
9277                                                             following buffer_invl2 and
9278                                                             buffer_wbinvl1_vol.
9279                                                           - Ensures the
9280                                                             atomicrmw has
9281                                                             completed before
9282                                                             invalidating the
9283                                                             caches.
9284
9285                                                         5. buffer_invl2;
9286                                                            buffer_wbinvl1_vol
9287
9288                                                           - Must happen before
9289                                                             any following
9290                                                             global/generic
9291                                                             load/load
9292                                                             atomic/atomicrmw.
9293                                                           - Ensures that
9294                                                             following
9295                                                             loads will not see
9296                                                             stale L1 global data,
9297                                                             nor see stale L2 MTYPE
9298                                                             NC global data.
9299                                                             MTYPE RW and CC memory will
9300                                                             never be stale in L2 due to
9301                                                             the memory probes.
9302
9303     fence        acq_rel      - singlethread *none*     *none*
9304                               - wavefront
9305     fence        acq_rel      - workgroup    *none*     1. s_waitcnt lgkm/vmcnt(0)
9306
9307                                                           - Use lgkmcnt(0) if not
9308                                                             TgSplit execution mode
9309                                                             and vmcnt(0) if TgSplit
9310                                                             execution mode.
9311                                                           - If OpenCL and
9312                                                             address space is
9313                                                             not generic, omit
9314                                                             lgkmcnt(0).
9315                                                           - If OpenCL and
9316                                                             address space is
9317                                                             local, omit
9318                                                             vmcnt(0).
9319                                                           - However,
9320                                                             since LLVM
9321                                                             currently has no
9322                                                             address space on
9323                                                             the fence need to
9324                                                             conservatively
9325                                                             always generate
9326                                                             (see comment for
9327                                                             previous fence).
9328                                                           - s_waitcnt vmcnt(0)
9329                                                             must happen after
9330                                                             any preceding
9331                                                             global/generic
9332                                                             load/store/
9333                                                             load atomic/store atomic/
9334                                                             atomicrmw.
9335                                                           - s_waitcnt lgkmcnt(0)
9336                                                             must happen after
9337                                                             any preceding
9338                                                             local/generic
9339                                                             load/load
9340                                                             atomic/store/store
9341                                                             atomic/atomicrmw.
9342                                                           - Must happen before
9343                                                             any following
9344                                                             global/generic
9345                                                             load/load
9346                                                             atomic/store/store
9347                                                             atomic/atomicrmw.
9348                                                           - Ensures that all
9349                                                             memory operations
9350                                                             have
9351                                                             completed before
9352                                                             performing any
9353                                                             following global
9354                                                             memory operations.
9355                                                           - Ensures that the
9356                                                             preceding
9357                                                             local/generic load
9358                                                             atomic/atomicrmw
9359                                                             with an equal or
9360                                                             wider sync scope
9361                                                             and memory ordering
9362                                                             stronger than
9363                                                             unordered (this is
9364                                                             termed the
9365                                                             acquire-fence-paired-atomic)
9366                                                             has completed
9367                                                             before following
9368                                                             global memory
9369                                                             operations. This
9370                                                             satisfies the
9371                                                             requirements of
9372                                                             acquire.
9373                                                           - Ensures that all
9374                                                             previous memory
9375                                                             operations have
9376                                                             completed before a
9377                                                             following
9378                                                             local/generic store
9379                                                             atomic/atomicrmw
9380                                                             with an equal or
9381                                                             wider sync scope
9382                                                             and memory ordering
9383                                                             stronger than
9384                                                             unordered (this is
9385                                                             termed the
9386                                                             release-fence-paired-atomic).
9387                                                             This satisfies the
9388                                                             requirements of
9389                                                             release.
9390                                                           - Must happen before
9391                                                             the following
9392                                                             buffer_wbinvl1_vol.
9393                                                           - Ensures that the
9394                                                             acquire-fence-paired
9395                                                             atomic has completed
9396                                                             before invalidating
9397                                                             the
9398                                                             cache. Therefore
9399                                                             any following
9400                                                             locations read must
9401                                                             be no older than
9402                                                             the value read by
9403                                                             the
9404                                                             acquire-fence-paired-atomic.
9405
9406                                                         2. buffer_wbinvl1_vol
9407
9408                                                           - If not TgSplit execution
9409                                                             mode, omit.
9410                                                           - Ensures that
9411                                                             following
9412                                                             loads will not see
9413                                                             stale data.
9414
9415     fence        acq_rel      - agent        *none*     1. s_waitcnt lgkmcnt(0) &
9416                                                            vmcnt(0)
9417
9418                                                           - If TgSplit execution mode,
9419                                                             omit lgkmcnt(0).
9420                                                           - If OpenCL and
9421                                                             address space is
9422                                                             not generic, omit
9423                                                             lgkmcnt(0).
9424                                                           - See :ref:`amdgpu-fence-as` for
9425                                                             more details on fencing specific
9426                                                             address spaces.
9427                                                           - Could be split into
9428                                                             separate s_waitcnt
9429                                                             vmcnt(0) and
9430                                                             s_waitcnt
9431                                                             lgkmcnt(0) to allow
9432                                                             them to be
9433                                                             independently moved
9434                                                             according to the
9435                                                             following rules.
9436                                                           - s_waitcnt vmcnt(0)
9437                                                             must happen after
9438                                                             any preceding
9439                                                             global/generic
9440                                                             load/store/load
9441                                                             atomic/store
9442                                                             atomic/atomicrmw.
9443                                                           - s_waitcnt lgkmcnt(0)
9444                                                             must happen after
9445                                                             any preceding
9446                                                             local/generic
9447                                                             load/store/load
9448                                                             atomic/store
9449                                                             atomic/atomicrmw.
9450                                                           - Must happen before
9451                                                             the following
9452                                                             buffer_wbinvl1_vol.
9453                                                           - Ensures that the
9454                                                             preceding
9455                                                             global/local/generic
9456                                                             load
9457                                                             atomic/atomicrmw
9458                                                             with an equal or
9459                                                             wider sync scope
9460                                                             and memory ordering
9461                                                             stronger than
9462                                                             unordered (this is
9463                                                             termed the
9464                                                             acquire-fence-paired-atomic)
9465                                                             has completed
9466                                                             before invalidating
9467                                                             the cache. This
9468                                                             satisfies the
9469                                                             requirements of
9470                                                             acquire.
9471                                                           - Ensures that all
9472                                                             previous memory
9473                                                             operations have
9474                                                             completed before a
9475                                                             following
9476                                                             global/local/generic
9477                                                             store
9478                                                             atomic/atomicrmw
9479                                                             with an equal or
9480                                                             wider sync scope
9481                                                             and memory ordering
9482                                                             stronger than
9483                                                             unordered (this is
9484                                                             termed the
9485                                                             release-fence-paired-atomic).
9486                                                             This satisfies the
9487                                                             requirements of
9488                                                             release.
9489
9490                                                         2. buffer_wbinvl1_vol
9491
9492                                                           - Must happen before
9493                                                             any following
9494                                                             global/generic
9495                                                             load/load
9496                                                             atomic/store/store
9497                                                             atomic/atomicrmw.
9498                                                           - Ensures that
9499                                                             following loads
9500                                                             will not see stale
9501                                                             global data. This
9502                                                             satisfies the
9503                                                             requirements of
9504                                                             acquire.
9505
9506     fence        acq_rel      - system       *none*     1. buffer_wbl2
9507
9508                                                           - If OpenCL and
9509                                                             address space is
9510                                                             local, omit.
9511                                                           - Must happen before
9512                                                             following s_waitcnt.
9513                                                           - Performs L2 writeback to
9514                                                             ensure previous
9515                                                             global/generic
9516                                                             store/atomicrmw are
9517                                                             visible at system scope.
9518
9519                                                         2. s_waitcnt lgkmcnt(0) &
9520                                                            vmcnt(0)
9521
9522                                                           - If TgSplit execution mode,
9523                                                             omit lgkmcnt(0).
9524                                                           - If OpenCL and
9525                                                             address space is
9526                                                             not generic, omit
9527                                                             lgkmcnt(0).
9528                                                           - See :ref:`amdgpu-fence-as` for
9529                                                             more details on fencing specific
9530                                                             address spaces.
9531                                                           - Could be split into
9532                                                             separate s_waitcnt
9533                                                             vmcnt(0) and
9534                                                             s_waitcnt
9535                                                             lgkmcnt(0) to allow
9536                                                             them to be
9537                                                             independently moved
9538                                                             according to the
9539                                                             following rules.
9540                                                           - s_waitcnt vmcnt(0)
9541                                                             must happen after
9542                                                             any preceding
9543                                                             global/generic
9544                                                             load/store/load
9545                                                             atomic/store
9546                                                             atomic/atomicrmw.
9547                                                           - s_waitcnt lgkmcnt(0)
9548                                                             must happen after
9549                                                             any preceding
9550                                                             local/generic
9551                                                             load/store/load
9552                                                             atomic/store
9553                                                             atomic/atomicrmw.
9554                                                           - Must happen before
9555                                                             the following buffer_invl2 and
9556                                                             buffer_wbinvl1_vol.
9557                                                           - Ensures that the
9558                                                             preceding
9559                                                             global/local/generic
9560                                                             load
9561                                                             atomic/atomicrmw
9562                                                             with an equal or
9563                                                             wider sync scope
9564                                                             and memory ordering
9565                                                             stronger than
9566                                                             unordered (this is
9567                                                             termed the
9568                                                             acquire-fence-paired-atomic)
9569                                                             has completed
9570                                                             before invalidating
9571                                                             the cache. This
9572                                                             satisfies the
9573                                                             requirements of
9574                                                             acquire.
9575                                                           - Ensures that all
9576                                                             previous memory
9577                                                             operations have
9578                                                             completed before a
9579                                                             following
9580                                                             global/local/generic
9581                                                             store
9582                                                             atomic/atomicrmw
9583                                                             with an equal or
9584                                                             wider sync scope
9585                                                             and memory ordering
9586                                                             stronger than
9587                                                             unordered (this is
9588                                                             termed the
9589                                                             release-fence-paired-atomic).
9590                                                             This satisfies the
9591                                                             requirements of
9592                                                             release.
9593
9594                                                         3.  buffer_invl2;
9595                                                             buffer_wbinvl1_vol
9596
9597                                                           - Must happen before
9598                                                             any following
9599                                                             global/generic
9600                                                             load/load
9601                                                             atomic/store/store
9602                                                             atomic/atomicrmw.
9603                                                           - Ensures that
9604                                                             following
9605                                                             loads will not see
9606                                                             stale L1 global data,
9607                                                             nor see stale L2 MTYPE
9608                                                             NC global data.
9609                                                             MTYPE RW and CC memory will
9610                                                             never be stale in L2 due to
9611                                                             the memory probes.
9612
9613     **Sequential Consistent Atomic**
9614     ------------------------------------------------------------------------------------
9615     load atomic  seq_cst      - singlethread - global   *Same as corresponding
9616                               - wavefront    - local    load atomic acquire,
9617                                              - generic  except must generate
9618                                                         all instructions even
9619                                                         for OpenCL.*
9620     load atomic  seq_cst      - workgroup    - global   1. s_waitcnt lgkm/vmcnt(0)
9621                                              - generic
9622                                                           - Use lgkmcnt(0) if not
9623                                                             TgSplit execution mode
9624                                                             and vmcnt(0) if TgSplit
9625                                                             execution mode.
9626                                                           - s_waitcnt lgkmcnt(0) must
9627                                                             happen after
9628                                                             preceding
9629                                                             local/generic load
9630                                                             atomic/store
9631                                                             atomic/atomicrmw
9632                                                             with memory
9633                                                             ordering of seq_cst
9634                                                             and with equal or
9635                                                             wider sync scope.
9636                                                             (Note that seq_cst
9637                                                             fences have their
9638                                                             own s_waitcnt
9639                                                             lgkmcnt(0) and so do
9640                                                             not need to be
9641                                                             considered.)
9642                                                           - s_waitcnt vmcnt(0)
9643                                                             must happen after
9644                                                             preceding
9645                                                             global/generic load
9646                                                             atomic/store
9647                                                             atomic/atomicrmw
9648                                                             with memory
9649                                                             ordering of seq_cst
9650                                                             and with equal or
9651                                                             wider sync scope.
9652                                                             (Note that seq_cst
9653                                                             fences have their
9654                                                             own s_waitcnt
9655                                                             vmcnt(0) and so do
9656                                                             not need to be
9657                                                             considered.)
9658                                                           - Ensures any
9659                                                             preceding
9660                                                             sequential
9661                                                             consistent global/local
9662                                                             memory instructions
9663                                                             have completed
9664                                                             before executing
9665                                                             this sequentially
9666                                                             consistent
9667                                                             instruction. This
9668                                                             prevents reordering
9669                                                             a seq_cst store
9670                                                             followed by a
9671                                                             seq_cst load. (Note
9672                                                             that seq_cst is
9673                                                             stronger than
9674                                                             acquire/release as
9675                                                             the reordering of
9676                                                             load acquire
9677                                                             followed by a store
9678                                                             release is
9679                                                             prevented by the
9680                                                             s_waitcnt of
9681                                                             the release, but
9682                                                             there is nothing
9683                                                             preventing a store
9684                                                             release followed by
9685                                                             load acquire from
9686                                                             completing out of
9687                                                             order. The s_waitcnt
9688                                                             could be placed after
9689                                                             seq_store or before
9690                                                             the seq_load. We
9691                                                             choose the load to
9692                                                             make the s_waitcnt be
9693                                                             as late as possible
9694                                                             so that the store
9695                                                             may have already
9696                                                             completed.)
9697
9698                                                         2. *Following
9699                                                            instructions same as
9700                                                            corresponding load
9701                                                            atomic acquire,
9702                                                            except must generate
9703                                                            all instructions even
9704                                                            for OpenCL.*
9705     load atomic  seq_cst      - workgroup    - local    *If TgSplit execution mode,
9706                                                         local address space cannot
9707                                                         be used.*
9708
9709                                                         *Same as corresponding
9710                                                         load atomic acquire,
9711                                                         except must generate
9712                                                         all instructions even
9713                                                         for OpenCL.*
9714
9715     load atomic  seq_cst      - agent        - global   1. s_waitcnt lgkmcnt(0) &
9716                               - system       - generic     vmcnt(0)
9717
9718                                                           - If TgSplit execution mode,
9719                                                             omit lgkmcnt(0).
9720                                                           - Could be split into
9721                                                             separate s_waitcnt
9722                                                             vmcnt(0)
9723                                                             and s_waitcnt
9724                                                             lgkmcnt(0) to allow
9725                                                             them to be
9726                                                             independently moved
9727                                                             according to the
9728                                                             following rules.
9729                                                           - s_waitcnt lgkmcnt(0)
9730                                                             must happen after
9731                                                             preceding
9732                                                             global/generic load
9733                                                             atomic/store
9734                                                             atomic/atomicrmw
9735                                                             with memory
9736                                                             ordering of seq_cst
9737                                                             and with equal or
9738                                                             wider sync scope.
9739                                                             (Note that seq_cst
9740                                                             fences have their
9741                                                             own s_waitcnt
9742                                                             lgkmcnt(0) and so do
9743                                                             not need to be
9744                                                             considered.)
9745                                                           - s_waitcnt vmcnt(0)
9746                                                             must happen after
9747                                                             preceding
9748                                                             global/generic load
9749                                                             atomic/store
9750                                                             atomic/atomicrmw
9751                                                             with memory
9752                                                             ordering of seq_cst
9753                                                             and with equal or
9754                                                             wider sync scope.
9755                                                             (Note that seq_cst
9756                                                             fences have their
9757                                                             own s_waitcnt
9758                                                             vmcnt(0) and so do
9759                                                             not need to be
9760                                                             considered.)
9761                                                           - Ensures any
9762                                                             preceding
9763                                                             sequential
9764                                                             consistent global
9765                                                             memory instructions
9766                                                             have completed
9767                                                             before executing
9768                                                             this sequentially
9769                                                             consistent
9770                                                             instruction. This
9771                                                             prevents reordering
9772                                                             a seq_cst store
9773                                                             followed by a
9774                                                             seq_cst load. (Note
9775                                                             that seq_cst is
9776                                                             stronger than
9777                                                             acquire/release as
9778                                                             the reordering of
9779                                                             load acquire
9780                                                             followed by a store
9781                                                             release is
9782                                                             prevented by the
9783                                                             s_waitcnt of
9784                                                             the release, but
9785                                                             there is nothing
9786                                                             preventing a store
9787                                                             release followed by
9788                                                             load acquire from
9789                                                             completing out of
9790                                                             order. The s_waitcnt
9791                                                             could be placed after
9792                                                             seq_store or before
9793                                                             the seq_load. We
9794                                                             choose the load to
9795                                                             make the s_waitcnt be
9796                                                             as late as possible
9797                                                             so that the store
9798                                                             may have already
9799                                                             completed.)
9800
9801                                                         2. *Following
9802                                                            instructions same as
9803                                                            corresponding load
9804                                                            atomic acquire,
9805                                                            except must generate
9806                                                            all instructions even
9807                                                            for OpenCL.*
9808     store atomic seq_cst      - singlethread - global   *Same as corresponding
9809                               - wavefront    - local    store atomic release,
9810                               - workgroup    - generic  except must generate
9811                               - agent                   all instructions even
9812                               - system                  for OpenCL.*
9813     atomicrmw    seq_cst      - singlethread - global   *Same as corresponding
9814                               - wavefront    - local    atomicrmw acq_rel,
9815                               - workgroup    - generic  except must generate
9816                               - agent                   all instructions even
9817                               - system                  for OpenCL.*
9818     fence        seq_cst      - singlethread *none*     *Same as corresponding
9819                               - wavefront               fence acq_rel,
9820                               - workgroup               except must generate
9821                               - agent                   all instructions even
9822                               - system                  for OpenCL.*
9823     ============ ============ ============== ========== ================================
9824
9825.. _amdgpu-amdhsa-memory-model-gfx942:
9826
9827Memory Model GFX942
9828+++++++++++++++++++
9829
9830For GFX942:
9831
9832* Each agent has multiple shader arrays (SA).
9833* Each SA has multiple compute units (CU).
9834* Each CU has multiple SIMDs that execute wavefronts.
9835* The wavefronts for a single work-group are executed in the same CU but may be
9836  executed by different SIMDs. The exception is when in tgsplit execution mode
9837  when the wavefronts may be executed by different SIMDs in different CUs.
9838* Each CU has a single LDS memory shared by the wavefronts of the work-groups
9839  executing on it. The exception is when in tgsplit execution mode when no LDS
9840  is allocated as wavefronts of the same work-group can be in different CUs.
9841* All LDS operations of a CU are performed as wavefront wide operations in a
9842  global order and involve no caching. Completion is reported to a wavefront in
9843  execution order.
9844* The LDS memory has multiple request queues shared by the SIMDs of a
9845  CU. Therefore, the LDS operations performed by different wavefronts of a
9846  work-group can be reordered relative to each other, which can result in
9847  reordering the visibility of vector memory operations with respect to LDS
9848  operations of other wavefronts in the same work-group. A ``s_waitcnt
9849  lgkmcnt(0)`` is required to ensure synchronization between LDS operations and
9850  vector memory operations between wavefronts of a work-group, but not between
9851  operations performed by the same wavefront.
9852* The vector memory operations are performed as wavefront wide operations and
9853  completion is reported to a wavefront in execution order. The exception is
9854  that ``flat_load/store/atomic`` instructions can report out of vector memory
9855  order if they access LDS memory, and out of LDS operation order if they access
9856  global memory.
9857* The vector memory operations access a single vector L1 cache shared by all
9858  SIMDs a CU. Therefore:
9859
9860  * No special action is required for coherence between the lanes of a single
9861    wavefront.
9862
9863  * No special action is required for coherence between wavefronts in the same
9864    work-group since they execute on the same CU. The exception is when in
9865    tgsplit execution mode as wavefronts of the same work-group can be in
9866    different CUs and so a ``buffer_inv sc0`` is required which will invalidate
9867    the L1 cache.
9868
9869  * A ``buffer_inv sc0`` is required to invalidate the L1 cache for coherence
9870    between wavefronts executing in different work-groups as they may be
9871    executing on different CUs.
9872
9873  * Atomic read-modify-write instructions implicitly bypass the L1 cache.
9874    Therefore, they do not use the sc0 bit for coherence and instead use it to
9875    indicate if the instruction returns the original value being updated. They
9876    do use sc1 to indicate system or agent scope coherence.
9877
9878* The scalar memory operations access a scalar L1 cache shared by all wavefronts
9879  on a group of CUs. The scalar and vector L1 caches are not coherent. However,
9880  scalar operations are used in a restricted way so do not impact the memory
9881  model. See :ref:`amdgpu-amdhsa-memory-spaces`.
9882* The vector and scalar memory operations use an L2 cache.
9883
9884  * The gfx942 can be configured as a number of smaller agents with each having
9885    a single L2 shared by all CUs on the same agent, or as fewer (possibly one)
9886    larger agents with groups of CUs on each agent each sharing separate L2
9887    caches.
9888  * The L2 cache has independent channels to service disjoint ranges of virtual
9889    addresses.
9890  * Each CU has a separate request queue per channel for its associated L2.
9891    Therefore, the vector and scalar memory operations performed by wavefronts
9892    executing with different L1 caches and the same L2 cache can be reordered
9893    relative to each other.
9894  * A ``s_waitcnt vmcnt(0)`` is required to ensure synchronization between
9895    vector memory operations of different CUs. It ensures a previous vector
9896    memory operation has completed before executing a subsequent vector memory
9897    or LDS operation and so can be used to meet the requirements of acquire and
9898    release.
9899  * An L2 cache can be kept coherent with other L2 caches by using the MTYPE RW
9900    (read-write) for memory local to the L2, and MTYPE NC (non-coherent) with
9901    the PTE C-bit set for memory not local to the L2.
9902
9903    * Any local memory cache lines will be automatically invalidated by writes
9904      from CUs associated with other L2 caches, or writes from the CPU, due to
9905      the cache probe caused by the PTE C-bit.
9906    * XGMI accesses from the CPU to local memory may be cached on the CPU.
9907      Subsequent access from the GPU will automatically invalidate or writeback
9908      the CPU cache due to the L2 probe filter.
9909    * To ensure coherence of local memory writes of CUs with different L1 caches
9910      in the same agent a ``buffer_wbl2`` is required. It does nothing if the
9911      agent is configured to have a single L2, or will writeback dirty L2 cache
9912      lines if configured to have multiple L2 caches.
9913    * To ensure coherence of local memory writes of CUs in different agents a
9914      ``buffer_wbl2 sc1`` is required. It will writeback dirty L2 cache lines.
9915    * To ensure coherence of local memory reads of CUs with different L1 caches
9916      in the same agent a ``buffer_inv sc1`` is required. It does nothing if the
9917      agent is configured to have a single L2, or will invalidate non-local L2
9918      cache lines if configured to have multiple L2 caches.
9919    * To ensure coherence of local memory reads of CUs in different agents a
9920      ``buffer_inv sc0 sc1`` is required. It will invalidate non-local L2 cache
9921      lines if configured to have multiple L2 caches.
9922
9923  * PCIe access from the GPU to the CPU can be kept coherent by using the MTYPE
9924    UC (uncached) which bypasses the L2.
9925
9926Scalar memory operations are only used to access memory that is proven to not
9927change during the execution of the kernel dispatch. This includes constant
9928address space and global address space for program scope ``const`` variables.
9929Therefore, the kernel machine code does not have to maintain the scalar cache to
9930ensure it is coherent with the vector caches. The scalar and vector caches are
9931invalidated between kernel dispatches by CP since constant address space data
9932may change between kernel dispatch executions. See
9933:ref:`amdgpu-amdhsa-memory-spaces`.
9934
9935The one exception is if scalar writes are used to spill SGPR registers. In this
9936case the AMDGPU backend ensures the memory location used to spill is never
9937accessed by vector memory operations at the same time. If scalar writes are used
9938then a ``s_dcache_wb`` is inserted before the ``s_endpgm`` and before a function
9939return since the locations may be used for vector memory instructions by a
9940future wavefront that uses the same scratch area, or a function call that
9941creates a frame at the same address, respectively. There is no need for a
9942``s_dcache_inv`` as all scalar writes are write-before-read in the same thread.
9943
9944For kernarg backing memory:
9945
9946* CP invalidates the L1 cache at the start of each kernel dispatch.
9947* On dGPU over XGMI or PCIe the kernarg backing memory is allocated in host
9948  memory accessed as MTYPE UC (uncached) to avoid needing to invalidate the L2
9949  cache. This also causes it to be treated as non-volatile and so is not
9950  invalidated by ``*_vol``.
9951* On APU the kernarg backing memory is accessed as MTYPE CC (cache coherent) and
9952  so the L2 cache will be coherent with the CPU and other agents.
9953
9954Scratch backing memory (which is used for the private address space) is accessed
9955with MTYPE NC_NV (non-coherent non-volatile). Since the private address space is
9956only accessed by a single thread, and is always write-before-read, there is
9957never a need to invalidate these entries from the L1 cache. Hence all cache
9958invalidates are done as ``*_vol`` to only invalidate the volatile cache lines.
9959
9960The code sequences used to implement the memory model for GFX940, GFX941, GFX942
9961are defined in table :ref:`amdgpu-amdhsa-memory-model-code-sequences-gfx940-gfx941-gfx942-table`.
9962
9963  .. table:: AMDHSA Memory Model Code Sequences GFX940, GFX941, GFX942
9964     :name: amdgpu-amdhsa-memory-model-code-sequences-gfx940-gfx941-gfx942-table
9965
9966     ============ ============ ============== ========== ================================
9967     LLVM Instr   LLVM Memory  LLVM Memory    AMDGPU     AMDGPU Machine Code
9968                  Ordering     Sync Scope     Address    GFX940, GFX941, GFX942
9969                                              Space
9970     ============ ============ ============== ========== ================================
9971     **Non-Atomic**
9972     ------------------------------------------------------------------------------------
9973     load         *none*       *none*         - global   - !volatile & !nontemporal
9974                                              - generic
9975                                              - private    1. buffer/global/flat_load
9976                                              - constant
9977                                                         - !volatile & nontemporal
9978
9979                                                           1. buffer/global/flat_load
9980                                                              nt=1
9981
9982                                                         - volatile
9983
9984                                                           1. buffer/global/flat_load
9985                                                              sc0=1 sc1=1
9986                                                           2. s_waitcnt vmcnt(0)
9987
9988                                                            - Must happen before
9989                                                              any following volatile
9990                                                              global/generic
9991                                                              load/store.
9992                                                            - Ensures that
9993                                                              volatile
9994                                                              operations to
9995                                                              different
9996                                                              addresses will not
9997                                                              be reordered by
9998                                                              hardware.
9999
10000     load         *none*       *none*         - local    1. ds_load
10001     store        *none*       *none*         - global   - !volatile & !nontemporal
10002                                              - generic
10003                                              - private    1. GFX940, GFX941
10004                                              - constant        buffer/global/flat_store
10005                                                                sc0=1 sc1=1
10006                                                              GFX942
10007                                                                buffer/global/flat_store
10008
10009                                                         - !volatile & nontemporal
10010
10011                                                           1. GFX940, GFX941
10012                                                                buffer/global/flat_store
10013                                                                nt=1 sc0=1 sc1=1
10014                                                              GFX942
10015                                                                buffer/global/flat_store
10016                                                                nt=1
10017
10018                                                         - volatile
10019
10020                                                           1. buffer/global/flat_store
10021                                                              sc0=1 sc1=1
10022                                                           2. s_waitcnt vmcnt(0)
10023
10024                                                            - Must happen before
10025                                                              any following volatile
10026                                                              global/generic
10027                                                              load/store.
10028                                                            - Ensures that
10029                                                              volatile
10030                                                              operations to
10031                                                              different
10032                                                              addresses will not
10033                                                              be reordered by
10034                                                              hardware.
10035
10036     store        *none*       *none*         - local    1. ds_store
10037     **Unordered Atomic**
10038     ------------------------------------------------------------------------------------
10039     load atomic  unordered    *any*          *any*      *Same as non-atomic*.
10040     store atomic unordered    *any*          *any*      *Same as non-atomic*.
10041     atomicrmw    unordered    *any*          *any*      *Same as monotonic atomic*.
10042     **Monotonic Atomic**
10043     ------------------------------------------------------------------------------------
10044     load atomic  monotonic    - singlethread - global   1. buffer/global/flat_load
10045                               - wavefront    - generic
10046     load atomic  monotonic    - workgroup    - global   1. buffer/global/flat_load
10047                                              - generic     sc0=1
10048     load atomic  monotonic    - singlethread - local    *If TgSplit execution mode,
10049                               - wavefront               local address space cannot
10050                               - workgroup               be used.*
10051
10052                                                         1. ds_load
10053     load atomic  monotonic    - agent        - global   1. buffer/global/flat_load
10054                                              - generic     sc1=1
10055     load atomic  monotonic    - system       - global   1. buffer/global/flat_load
10056                                              - generic     sc0=1 sc1=1
10057     store atomic monotonic    - singlethread - global   1. buffer/global/flat_store
10058                               - wavefront    - generic
10059     store atomic monotonic    - workgroup    - global   1. buffer/global/flat_store
10060                                              - generic     sc0=1
10061     store atomic monotonic    - agent        - global   1. buffer/global/flat_store
10062                                              - generic     sc1=1
10063     store atomic monotonic    - system       - global   1. buffer/global/flat_store
10064                                              - generic     sc0=1 sc1=1
10065     store atomic monotonic    - singlethread - local    *If TgSplit execution mode,
10066                               - wavefront               local address space cannot
10067                               - workgroup               be used.*
10068
10069                                                         1. ds_store
10070     atomicrmw    monotonic    - singlethread - global   1. buffer/global/flat_atomic
10071                               - wavefront    - generic
10072                               - workgroup
10073                               - agent
10074     atomicrmw    monotonic    - system       - global   1. buffer/global/flat_atomic
10075                                              - generic     sc1=1
10076     atomicrmw    monotonic    - singlethread - local    *If TgSplit execution mode,
10077                               - wavefront               local address space cannot
10078                               - workgroup               be used.*
10079
10080                                                         1. ds_atomic
10081     **Acquire Atomic**
10082     ------------------------------------------------------------------------------------
10083     load atomic  acquire      - singlethread - global   1. buffer/global/ds/flat_load
10084                               - wavefront    - local
10085                                              - generic
10086     load atomic  acquire      - workgroup    - global   1. buffer/global_load sc0=1
10087                                                         2. s_waitcnt vmcnt(0)
10088
10089                                                           - If not TgSplit execution
10090                                                             mode, omit.
10091                                                           - Must happen before the
10092                                                             following buffer_inv.
10093
10094                                                         3. buffer_inv sc0=1
10095
10096                                                           - If not TgSplit execution
10097                                                             mode, omit.
10098                                                           - Must happen before
10099                                                             any following
10100                                                             global/generic
10101                                                             load/load
10102                                                             atomic/store/store
10103                                                             atomic/atomicrmw.
10104                                                           - Ensures that
10105                                                             following
10106                                                             loads will not see
10107                                                             stale data.
10108
10109     load atomic  acquire      - workgroup    - local    *If TgSplit execution mode,
10110                                                         local address space cannot
10111                                                         be used.*
10112
10113                                                         1. ds_load
10114                                                         2. s_waitcnt lgkmcnt(0)
10115
10116                                                           - If OpenCL, omit.
10117                                                           - Must happen before
10118                                                             any following
10119                                                             global/generic
10120                                                             load/load
10121                                                             atomic/store/store
10122                                                             atomic/atomicrmw.
10123                                                           - Ensures any
10124                                                             following global
10125                                                             data read is no
10126                                                             older than the local load
10127                                                             atomic value being
10128                                                             acquired.
10129
10130     load atomic  acquire      - workgroup    - generic  1. flat_load  sc0=1
10131                                                         2. s_waitcnt lgkm/vmcnt(0)
10132
10133                                                           - Use lgkmcnt(0) if not
10134                                                             TgSplit execution mode
10135                                                             and vmcnt(0) if TgSplit
10136                                                             execution mode.
10137                                                           - If OpenCL, omit lgkmcnt(0).
10138                                                           - Must happen before
10139                                                             the following
10140                                                             buffer_inv and any
10141                                                             following global/generic
10142                                                             load/load
10143                                                             atomic/store/store
10144                                                             atomic/atomicrmw.
10145                                                           - Ensures any
10146                                                             following global
10147                                                             data read is no
10148                                                             older than a local load
10149                                                             atomic value being
10150                                                             acquired.
10151
10152                                                         3. buffer_inv sc0=1
10153
10154                                                           - If not TgSplit execution
10155                                                             mode, omit.
10156                                                           - Ensures that
10157                                                             following
10158                                                             loads will not see
10159                                                             stale data.
10160
10161     load atomic  acquire      - agent        - global   1. buffer/global_load
10162                                                            sc1=1
10163                                                         2. s_waitcnt vmcnt(0)
10164
10165                                                           - Must happen before
10166                                                             following
10167                                                             buffer_inv.
10168                                                           - Ensures the load
10169                                                             has completed
10170                                                             before invalidating
10171                                                             the cache.
10172
10173                                                         3. buffer_inv sc1=1
10174
10175                                                           - Must happen before
10176                                                             any following
10177                                                             global/generic
10178                                                             load/load
10179                                                             atomic/atomicrmw.
10180                                                           - Ensures that
10181                                                             following
10182                                                             loads will not see
10183                                                             stale global data.
10184
10185     load atomic  acquire      - system       - global   1. buffer/global/flat_load
10186                                                            sc0=1 sc1=1
10187                                                         2. s_waitcnt vmcnt(0)
10188
10189                                                           - Must happen before
10190                                                             following
10191                                                             buffer_inv.
10192                                                           - Ensures the load
10193                                                             has completed
10194                                                             before invalidating
10195                                                             the cache.
10196
10197                                                         3. buffer_inv sc0=1 sc1=1
10198
10199                                                           - Must happen before
10200                                                             any following
10201                                                             global/generic
10202                                                             load/load
10203                                                             atomic/atomicrmw.
10204                                                           - Ensures that
10205                                                             following
10206                                                             loads will not see
10207                                                             stale MTYPE NC global data.
10208                                                             MTYPE RW and CC memory will
10209                                                             never be stale due to the
10210                                                             memory probes.
10211
10212     load atomic  acquire      - agent        - generic  1. flat_load sc1=1
10213                                                         2. s_waitcnt vmcnt(0) &
10214                                                            lgkmcnt(0)
10215
10216                                                           - If TgSplit execution mode,
10217                                                             omit lgkmcnt(0).
10218                                                           - If OpenCL omit
10219                                                             lgkmcnt(0).
10220                                                           - Must happen before
10221                                                             following
10222                                                             buffer_inv.
10223                                                           - Ensures the flat_load
10224                                                             has completed
10225                                                             before invalidating
10226                                                             the cache.
10227
10228                                                         3. buffer_inv sc1=1
10229
10230                                                           - Must happen before
10231                                                             any following
10232                                                             global/generic
10233                                                             load/load
10234                                                             atomic/atomicrmw.
10235                                                           - Ensures that
10236                                                             following loads
10237                                                             will not see stale
10238                                                             global data.
10239
10240     load atomic  acquire      - system       - generic  1. flat_load sc0=1 sc1=1
10241                                                         2. s_waitcnt vmcnt(0) &
10242                                                            lgkmcnt(0)
10243
10244                                                           - If TgSplit execution mode,
10245                                                             omit lgkmcnt(0).
10246                                                           - If OpenCL omit
10247                                                             lgkmcnt(0).
10248                                                           - Must happen before
10249                                                             the following
10250                                                             buffer_inv.
10251                                                           - Ensures the flat_load
10252                                                             has completed
10253                                                             before invalidating
10254                                                             the caches.
10255
10256                                                         3. buffer_inv sc0=1 sc1=1
10257
10258                                                           - Must happen before
10259                                                             any following
10260                                                             global/generic
10261                                                             load/load
10262                                                             atomic/atomicrmw.
10263                                                           - Ensures that
10264                                                             following
10265                                                             loads will not see
10266                                                             stale MTYPE NC global data.
10267                                                             MTYPE RW and CC memory will
10268                                                             never be stale due to the
10269                                                             memory probes.
10270
10271     atomicrmw    acquire      - singlethread - global   1. buffer/global/flat_atomic
10272                               - wavefront    - generic
10273     atomicrmw    acquire      - singlethread - local    *If TgSplit execution mode,
10274                               - wavefront               local address space cannot
10275                                                         be used.*
10276
10277                                                         1. ds_atomic
10278     atomicrmw    acquire      - workgroup    - global   1. buffer/global_atomic
10279                                                         2. s_waitcnt vmcnt(0)
10280
10281                                                           - If not TgSplit execution
10282                                                             mode, omit.
10283                                                           - Must happen before the
10284                                                             following buffer_inv.
10285                                                           - Ensures the atomicrmw
10286                                                             has completed
10287                                                             before invalidating
10288                                                             the cache.
10289
10290                                                         3. buffer_inv sc0=1
10291
10292                                                           - If not TgSplit execution
10293                                                             mode, omit.
10294                                                           - Must happen before
10295                                                             any following
10296                                                             global/generic
10297                                                             load/load
10298                                                             atomic/atomicrmw.
10299                                                           - Ensures that
10300                                                             following loads
10301                                                             will not see stale
10302                                                             global data.
10303
10304     atomicrmw    acquire      - workgroup    - local    *If TgSplit execution mode,
10305                                                         local address space cannot
10306                                                         be used.*
10307
10308                                                         1. ds_atomic
10309                                                         2. s_waitcnt lgkmcnt(0)
10310
10311                                                           - If OpenCL, omit.
10312                                                           - Must happen before
10313                                                             any following
10314                                                             global/generic
10315                                                             load/load
10316                                                             atomic/store/store
10317                                                             atomic/atomicrmw.
10318                                                           - Ensures any
10319                                                             following global
10320                                                             data read is no
10321                                                             older than the local
10322                                                             atomicrmw value
10323                                                             being acquired.
10324
10325     atomicrmw    acquire      - workgroup    - generic  1. flat_atomic
10326                                                         2. s_waitcnt lgkm/vmcnt(0)
10327
10328                                                           - Use lgkmcnt(0) if not
10329                                                             TgSplit execution mode
10330                                                             and vmcnt(0) if TgSplit
10331                                                             execution mode.
10332                                                           - If OpenCL, omit lgkmcnt(0).
10333                                                           - Must happen before
10334                                                             the following
10335                                                             buffer_inv and
10336                                                             any following
10337                                                             global/generic
10338                                                             load/load
10339                                                             atomic/store/store
10340                                                             atomic/atomicrmw.
10341                                                           - Ensures any
10342                                                             following global
10343                                                             data read is no
10344                                                             older than a local
10345                                                             atomicrmw value
10346                                                             being acquired.
10347
10348                                                         3. buffer_inv sc0=1
10349
10350                                                           - If not TgSplit execution
10351                                                             mode, omit.
10352                                                           - Ensures that
10353                                                             following
10354                                                             loads will not see
10355                                                             stale data.
10356
10357     atomicrmw    acquire      - agent        - global   1. buffer/global_atomic
10358                                                         2. s_waitcnt vmcnt(0)
10359
10360                                                           - Must happen before
10361                                                             following
10362                                                             buffer_inv.
10363                                                           - Ensures the
10364                                                             atomicrmw has
10365                                                             completed before
10366                                                             invalidating the
10367                                                             cache.
10368
10369                                                         3. buffer_inv sc1=1
10370
10371                                                           - Must happen before
10372                                                             any following
10373                                                             global/generic
10374                                                             load/load
10375                                                             atomic/atomicrmw.
10376                                                           - Ensures that
10377                                                             following loads
10378                                                             will not see stale
10379                                                             global data.
10380
10381     atomicrmw    acquire      - system       - global   1. buffer/global_atomic
10382                                                            sc1=1
10383                                                         2. s_waitcnt vmcnt(0)
10384
10385                                                           - Must happen before
10386                                                             following
10387                                                             buffer_inv.
10388                                                           - Ensures the
10389                                                             atomicrmw has
10390                                                             completed before
10391                                                             invalidating the
10392                                                             caches.
10393
10394                                                         3. buffer_inv sc0=1 sc1=1
10395
10396                                                           - Must happen before
10397                                                             any following
10398                                                             global/generic
10399                                                             load/load
10400                                                             atomic/atomicrmw.
10401                                                           - Ensures that
10402                                                             following
10403                                                             loads will not see
10404                                                             stale MTYPE NC global data.
10405                                                             MTYPE RW and CC memory will
10406                                                             never be stale due to the
10407                                                             memory probes.
10408
10409     atomicrmw    acquire      - agent        - generic  1. flat_atomic
10410                                                         2. s_waitcnt vmcnt(0) &
10411                                                            lgkmcnt(0)
10412
10413                                                           - If TgSplit execution mode,
10414                                                             omit lgkmcnt(0).
10415                                                           - If OpenCL, omit
10416                                                             lgkmcnt(0).
10417                                                           - Must happen before
10418                                                             following
10419                                                             buffer_inv.
10420                                                           - Ensures the
10421                                                             atomicrmw has
10422                                                             completed before
10423                                                             invalidating the
10424                                                             cache.
10425
10426                                                         3. buffer_inv sc1=1
10427
10428                                                           - Must happen before
10429                                                             any following
10430                                                             global/generic
10431                                                             load/load
10432                                                             atomic/atomicrmw.
10433                                                           - Ensures that
10434                                                             following loads
10435                                                             will not see stale
10436                                                             global data.
10437
10438     atomicrmw    acquire      - system       - generic  1. flat_atomic sc1=1
10439                                                         2. s_waitcnt vmcnt(0) &
10440                                                            lgkmcnt(0)
10441
10442                                                           - If TgSplit execution mode,
10443                                                             omit lgkmcnt(0).
10444                                                           - If OpenCL, omit
10445                                                             lgkmcnt(0).
10446                                                           - Must happen before
10447                                                             following
10448                                                             buffer_inv.
10449                                                           - Ensures the
10450                                                             atomicrmw has
10451                                                             completed before
10452                                                             invalidating the
10453                                                             caches.
10454
10455                                                         3. buffer_inv sc0=1 sc1=1
10456
10457                                                           - Must happen before
10458                                                             any following
10459                                                             global/generic
10460                                                             load/load
10461                                                             atomic/atomicrmw.
10462                                                           - Ensures that
10463                                                             following
10464                                                             loads will not see
10465                                                             stale MTYPE NC global data.
10466                                                             MTYPE RW and CC memory will
10467                                                             never be stale due to the
10468                                                             memory probes.
10469
10470     fence        acquire      - singlethread *none*     *none*
10471                               - wavefront
10472     fence        acquire      - workgroup    *none*     1. s_waitcnt lgkm/vmcnt(0)
10473
10474                                                           - Use lgkmcnt(0) if not
10475                                                             TgSplit execution mode
10476                                                             and vmcnt(0) if TgSplit
10477                                                             execution mode.
10478                                                           - If OpenCL and
10479                                                             address space is
10480                                                             not generic, omit
10481                                                             lgkmcnt(0).
10482                                                           - If OpenCL and
10483                                                             address space is
10484                                                             local, omit
10485                                                             vmcnt(0).
10486                                                           - See :ref:`amdgpu-fence-as` for
10487                                                             more details on fencing specific
10488                                                             address spaces.
10489                                                           - s_waitcnt vmcnt(0)
10490                                                             must happen after
10491                                                             any preceding
10492                                                             global/generic load
10493                                                             atomic/
10494                                                             atomicrmw
10495                                                             with an equal or
10496                                                             wider sync scope
10497                                                             and memory ordering
10498                                                             stronger than
10499                                                             unordered (this is
10500                                                             termed the
10501                                                             fence-paired-atomic).
10502                                                           - s_waitcnt lgkmcnt(0)
10503                                                             must happen after
10504                                                             any preceding
10505                                                             local/generic load
10506                                                             atomic/atomicrmw
10507                                                             with an equal or
10508                                                             wider sync scope
10509                                                             and memory ordering
10510                                                             stronger than
10511                                                             unordered (this is
10512                                                             termed the
10513                                                             fence-paired-atomic).
10514                                                           - Must happen before
10515                                                             the following
10516                                                             buffer_inv and
10517                                                             any following
10518                                                             global/generic
10519                                                             load/load
10520                                                             atomic/store/store
10521                                                             atomic/atomicrmw.
10522                                                           - Ensures any
10523                                                             following global
10524                                                             data read is no
10525                                                             older than the
10526                                                             value read by the
10527                                                             fence-paired-atomic.
10528
10529                                                         3. buffer_inv sc0=1
10530
10531                                                           - If not TgSplit execution
10532                                                             mode, omit.
10533                                                           - Ensures that
10534                                                             following
10535                                                             loads will not see
10536                                                             stale data.
10537
10538     fence        acquire      - agent        *none*     1. s_waitcnt lgkmcnt(0) &
10539                                                            vmcnt(0)
10540
10541                                                           - If TgSplit execution mode,
10542                                                             omit lgkmcnt(0).
10543                                                           - If OpenCL and
10544                                                             address space is
10545                                                             not generic, omit
10546                                                             lgkmcnt(0).
10547                                                           - See :ref:`amdgpu-fence-as` for
10548                                                             more details on fencing specific
10549                                                             address spaces.
10550                                                           - Could be split into
10551                                                             separate s_waitcnt
10552                                                             vmcnt(0) and
10553                                                             s_waitcnt
10554                                                             lgkmcnt(0) to allow
10555                                                             them to be
10556                                                             independently moved
10557                                                             according to the
10558                                                             following rules.
10559                                                           - s_waitcnt vmcnt(0)
10560                                                             must happen after
10561                                                             any preceding
10562                                                             global/generic load
10563                                                             atomic/atomicrmw
10564                                                             with an equal or
10565                                                             wider sync scope
10566                                                             and memory ordering
10567                                                             stronger than
10568                                                             unordered (this is
10569                                                             termed the
10570                                                             fence-paired-atomic).
10571                                                           - s_waitcnt lgkmcnt(0)
10572                                                             must happen after
10573                                                             any preceding
10574                                                             local/generic load
10575                                                             atomic/atomicrmw
10576                                                             with an equal or
10577                                                             wider sync scope
10578                                                             and memory ordering
10579                                                             stronger than
10580                                                             unordered (this is
10581                                                             termed the
10582                                                             fence-paired-atomic).
10583                                                           - Must happen before
10584                                                             the following
10585                                                             buffer_inv.
10586                                                           - Ensures that the
10587                                                             fence-paired atomic
10588                                                             has completed
10589                                                             before invalidating
10590                                                             the
10591                                                             cache. Therefore
10592                                                             any following
10593                                                             locations read must
10594                                                             be no older than
10595                                                             the value read by
10596                                                             the
10597                                                             fence-paired-atomic.
10598
10599                                                         2. buffer_inv sc1=1
10600
10601                                                           - Must happen before any
10602                                                             following global/generic
10603                                                             load/load
10604                                                             atomic/store/store
10605                                                             atomic/atomicrmw.
10606                                                           - Ensures that
10607                                                             following loads
10608                                                             will not see stale
10609                                                             global data.
10610
10611     fence        acquire      - system       *none*     1. s_waitcnt lgkmcnt(0) &
10612                                                            vmcnt(0)
10613
10614                                                           - If TgSplit execution mode,
10615                                                             omit lgkmcnt(0).
10616                                                           - If OpenCL and
10617                                                             address space is
10618                                                             not generic, omit
10619                                                             lgkmcnt(0).
10620                                                           - See :ref:`amdgpu-fence-as` for
10621                                                             more details on fencing specific
10622                                                             address spaces.
10623                                                           - Could be split into
10624                                                             separate s_waitcnt
10625                                                             vmcnt(0) and
10626                                                             s_waitcnt
10627                                                             lgkmcnt(0) to allow
10628                                                             them to be
10629                                                             independently moved
10630                                                             according to the
10631                                                             following rules.
10632                                                           - s_waitcnt vmcnt(0)
10633                                                             must happen after
10634                                                             any preceding
10635                                                             global/generic load
10636                                                             atomic/atomicrmw
10637                                                             with an equal or
10638                                                             wider sync scope
10639                                                             and memory ordering
10640                                                             stronger than
10641                                                             unordered (this is
10642                                                             termed the
10643                                                             fence-paired-atomic).
10644                                                           - s_waitcnt lgkmcnt(0)
10645                                                             must happen after
10646                                                             any preceding
10647                                                             local/generic load
10648                                                             atomic/atomicrmw
10649                                                             with an equal or
10650                                                             wider sync scope
10651                                                             and memory ordering
10652                                                             stronger than
10653                                                             unordered (this is
10654                                                             termed the
10655                                                             fence-paired-atomic).
10656                                                           - Must happen before
10657                                                             the following
10658                                                             buffer_inv.
10659                                                           - Ensures that the
10660                                                             fence-paired atomic
10661                                                             has completed
10662                                                             before invalidating
10663                                                             the
10664                                                             cache. Therefore
10665                                                             any following
10666                                                             locations read must
10667                                                             be no older than
10668                                                             the value read by
10669                                                             the
10670                                                             fence-paired-atomic.
10671
10672                                                         2. buffer_inv sc0=1 sc1=1
10673
10674                                                           - Must happen before any
10675                                                             following global/generic
10676                                                             load/load
10677                                                             atomic/store/store
10678                                                             atomic/atomicrmw.
10679                                                           - Ensures that
10680                                                             following loads
10681                                                             will not see stale
10682                                                             global data.
10683
10684     **Release Atomic**
10685     ------------------------------------------------------------------------------------
10686     store atomic release      - singlethread - global   1. GFX940, GFX941
10687                               - wavefront    - generic       buffer/global/flat_store
10688                                                              sc0=1 sc1=1
10689                                                            GFX942
10690                                                              buffer/global/flat_store
10691
10692     store atomic release      - singlethread - local    *If TgSplit execution mode,
10693                               - wavefront               local address space cannot
10694                                                         be used.*
10695
10696                                                         1. ds_store
10697     store atomic release      - workgroup    - global   1. s_waitcnt lgkm/vmcnt(0)
10698                                              - generic
10699                                                           - Use lgkmcnt(0) if not
10700                                                             TgSplit execution mode
10701                                                             and vmcnt(0) if TgSplit
10702                                                             execution mode.
10703                                                           - If OpenCL, omit lgkmcnt(0).
10704                                                           - s_waitcnt vmcnt(0)
10705                                                             must happen after
10706                                                             any preceding
10707                                                             global/generic load/store/
10708                                                             load atomic/store atomic/
10709                                                             atomicrmw.
10710                                                           - s_waitcnt lgkmcnt(0)
10711                                                             must happen after
10712                                                             any preceding
10713                                                             local/generic
10714                                                             load/store/load
10715                                                             atomic/store
10716                                                             atomic/atomicrmw.
10717                                                           - Must happen before
10718                                                             the following
10719                                                             store.
10720                                                           - Ensures that all
10721                                                             memory operations
10722                                                             have
10723                                                             completed before
10724                                                             performing the
10725                                                             store that is being
10726                                                             released.
10727
10728                                                         2. GFX940, GFX941
10729                                                              buffer/global/flat_store
10730                                                              sc0=1 sc1=1
10731                                                            GFX942
10732                                                              buffer/global/flat_store
10733                                                              sc0=1
10734     store atomic release      - workgroup    - local    *If TgSplit execution mode,
10735                                                         local address space cannot
10736                                                         be used.*
10737
10738                                                         1. ds_store
10739     store atomic release      - agent        - global   1. buffer_wbl2 sc1=1
10740                                              - generic
10741                                                           - Must happen before
10742                                                             following s_waitcnt.
10743                                                           - Performs L2 writeback to
10744                                                             ensure previous
10745                                                             global/generic
10746                                                             store/atomicrmw are
10747                                                             visible at agent scope.
10748
10749                                                         2. s_waitcnt lgkmcnt(0) &
10750                                                            vmcnt(0)
10751
10752                                                           - If TgSplit execution mode,
10753                                                             omit lgkmcnt(0).
10754                                                           - If OpenCL and
10755                                                             address space is
10756                                                             not generic, omit
10757                                                             lgkmcnt(0).
10758                                                           - Could be split into
10759                                                             separate s_waitcnt
10760                                                             vmcnt(0) and
10761                                                             s_waitcnt
10762                                                             lgkmcnt(0) to allow
10763                                                             them to be
10764                                                             independently moved
10765                                                             according to the
10766                                                             following rules.
10767                                                           - s_waitcnt vmcnt(0)
10768                                                             must happen after
10769                                                             any preceding
10770                                                             global/generic
10771                                                             load/store/load
10772                                                             atomic/store
10773                                                             atomic/atomicrmw.
10774                                                           - s_waitcnt lgkmcnt(0)
10775                                                             must happen after
10776                                                             any preceding
10777                                                             local/generic
10778                                                             load/store/load
10779                                                             atomic/store
10780                                                             atomic/atomicrmw.
10781                                                           - Must happen before
10782                                                             the following
10783                                                             store.
10784                                                           - Ensures that all
10785                                                             memory operations
10786                                                             to memory have
10787                                                             completed before
10788                                                             performing the
10789                                                             store that is being
10790                                                             released.
10791
10792                                                         3. GFX940, GFX941
10793                                                              buffer/global/flat_store
10794                                                              sc0=1 sc1=1
10795                                                            GFX942
10796                                                              buffer/global/flat_store
10797                                                              sc1=1
10798     store atomic release      - system       - global   1. buffer_wbl2 sc0=1 sc1=1
10799                                              - generic
10800                                                           - Must happen before
10801                                                             following s_waitcnt.
10802                                                           - Performs L2 writeback to
10803                                                             ensure previous
10804                                                             global/generic
10805                                                             store/atomicrmw are
10806                                                             visible at system scope.
10807
10808                                                         2. s_waitcnt lgkmcnt(0) &
10809                                                            vmcnt(0)
10810
10811                                                           - If TgSplit execution mode,
10812                                                             omit lgkmcnt(0).
10813                                                           - If OpenCL and
10814                                                             address space is
10815                                                             not generic, omit
10816                                                             lgkmcnt(0).
10817                                                           - Could be split into
10818                                                             separate s_waitcnt
10819                                                             vmcnt(0) and
10820                                                             s_waitcnt
10821                                                             lgkmcnt(0) to allow
10822                                                             them to be
10823                                                             independently moved
10824                                                             according to the
10825                                                             following rules.
10826                                                           - s_waitcnt vmcnt(0)
10827                                                             must happen after any
10828                                                             preceding
10829                                                             global/generic
10830                                                             load/store/load
10831                                                             atomic/store
10832                                                             atomic/atomicrmw.
10833                                                           - s_waitcnt lgkmcnt(0)
10834                                                             must happen after any
10835                                                             preceding
10836                                                             local/generic
10837                                                             load/store/load
10838                                                             atomic/store
10839                                                             atomic/atomicrmw.
10840                                                           - Must happen before
10841                                                             the following
10842                                                             store.
10843                                                           - Ensures that all
10844                                                             memory operations
10845                                                             to memory and the L2
10846                                                             writeback have
10847                                                             completed before
10848                                                             performing the
10849                                                             store that is being
10850                                                             released.
10851
10852                                                         3. buffer/global/flat_store
10853                                                            sc0=1 sc1=1
10854     atomicrmw    release      - singlethread - global   1. buffer/global/flat_atomic
10855                               - wavefront    - generic
10856     atomicrmw    release      - singlethread - local    *If TgSplit execution mode,
10857                               - wavefront               local address space cannot
10858                                                         be used.*
10859
10860                                                         1. ds_atomic
10861     atomicrmw    release      - workgroup    - global   1. s_waitcnt lgkm/vmcnt(0)
10862                                              - generic
10863                                                           - Use lgkmcnt(0) if not
10864                                                             TgSplit execution mode
10865                                                             and vmcnt(0) if TgSplit
10866                                                             execution mode.
10867                                                           - If OpenCL, omit
10868                                                             lgkmcnt(0).
10869                                                           - s_waitcnt vmcnt(0)
10870                                                             must happen after
10871                                                             any preceding
10872                                                             global/generic load/store/
10873                                                             load atomic/store atomic/
10874                                                             atomicrmw.
10875                                                           - s_waitcnt lgkmcnt(0)
10876                                                             must happen after
10877                                                             any preceding
10878                                                             local/generic
10879                                                             load/store/load
10880                                                             atomic/store
10881                                                             atomic/atomicrmw.
10882                                                           - Must happen before
10883                                                             the following
10884                                                             atomicrmw.
10885                                                           - Ensures that all
10886                                                             memory operations
10887                                                             have
10888                                                             completed before
10889                                                             performing the
10890                                                             atomicrmw that is
10891                                                             being released.
10892
10893                                                         2. buffer/global/flat_atomic sc0=1
10894     atomicrmw    release      - workgroup    - local    *If TgSplit execution mode,
10895                                                         local address space cannot
10896                                                         be used.*
10897
10898                                                         1. ds_atomic
10899     atomicrmw    release      - agent        - global   1. buffer_wbl2 sc1=1
10900                                              - generic
10901                                                           - Must happen before
10902                                                             following s_waitcnt.
10903                                                           - Performs L2 writeback to
10904                                                             ensure previous
10905                                                             global/generic
10906                                                             store/atomicrmw are
10907                                                             visible at agent scope.
10908
10909                                                         2. s_waitcnt lgkmcnt(0) &
10910                                                            vmcnt(0)
10911
10912                                                           - If TgSplit execution mode,
10913                                                             omit lgkmcnt(0).
10914                                                           - If OpenCL, omit
10915                                                             lgkmcnt(0).
10916                                                           - Could be split into
10917                                                             separate s_waitcnt
10918                                                             vmcnt(0) and
10919                                                             s_waitcnt
10920                                                             lgkmcnt(0) to allow
10921                                                             them to be
10922                                                             independently moved
10923                                                             according to the
10924                                                             following rules.
10925                                                           - s_waitcnt vmcnt(0)
10926                                                             must happen after
10927                                                             any preceding
10928                                                             global/generic
10929                                                             load/store/load
10930                                                             atomic/store
10931                                                             atomic/atomicrmw.
10932                                                           - s_waitcnt lgkmcnt(0)
10933                                                             must happen after
10934                                                             any preceding
10935                                                             local/generic
10936                                                             load/store/load
10937                                                             atomic/store
10938                                                             atomic/atomicrmw.
10939                                                           - Must happen before
10940                                                             the following
10941                                                             atomicrmw.
10942                                                           - Ensures that all
10943                                                             memory operations
10944                                                             to global and local
10945                                                             have completed
10946                                                             before performing
10947                                                             the atomicrmw that
10948                                                             is being released.
10949
10950                                                         3. buffer/global/flat_atomic sc1=1
10951     atomicrmw    release      - system       - global   1. buffer_wbl2 sc0=1 sc1=1
10952                                              - generic
10953                                                           - Must happen before
10954                                                             following s_waitcnt.
10955                                                           - Performs L2 writeback to
10956                                                             ensure previous
10957                                                             global/generic
10958                                                             store/atomicrmw are
10959                                                             visible at system scope.
10960
10961                                                         2. s_waitcnt lgkmcnt(0) &
10962                                                            vmcnt(0)
10963
10964                                                           - If TgSplit execution mode,
10965                                                             omit lgkmcnt(0).
10966                                                           - If OpenCL, omit
10967                                                             lgkmcnt(0).
10968                                                           - Could be split into
10969                                                             separate s_waitcnt
10970                                                             vmcnt(0) and
10971                                                             s_waitcnt
10972                                                             lgkmcnt(0) to allow
10973                                                             them to be
10974                                                             independently moved
10975                                                             according to the
10976                                                             following rules.
10977                                                           - s_waitcnt vmcnt(0)
10978                                                             must happen after
10979                                                             any preceding
10980                                                             global/generic
10981                                                             load/store/load
10982                                                             atomic/store
10983                                                             atomic/atomicrmw.
10984                                                           - s_waitcnt lgkmcnt(0)
10985                                                             must happen after
10986                                                             any preceding
10987                                                             local/generic
10988                                                             load/store/load
10989                                                             atomic/store
10990                                                             atomic/atomicrmw.
10991                                                           - Must happen before
10992                                                             the following
10993                                                             atomicrmw.
10994                                                           - Ensures that all
10995                                                             memory operations
10996                                                             to memory and the L2
10997                                                             writeback have
10998                                                             completed before
10999                                                             performing the
11000                                                             store that is being
11001                                                             released.
11002
11003                                                         3. buffer/global/flat_atomic
11004                                                            sc0=1 sc1=1
11005     fence        release      - singlethread *none*     *none*
11006                               - wavefront
11007     fence        release      - workgroup    *none*     1. s_waitcnt lgkm/vmcnt(0)
11008
11009                                                           - Use lgkmcnt(0) if not
11010                                                             TgSplit execution mode
11011                                                             and vmcnt(0) if TgSplit
11012                                                             execution mode.
11013                                                           - If OpenCL and
11014                                                             address space is
11015                                                             not generic, omit
11016                                                             lgkmcnt(0).
11017                                                           - If OpenCL and
11018                                                             address space is
11019                                                             local, omit
11020                                                             vmcnt(0).
11021                                                           - See :ref:`amdgpu-fence-as` for
11022                                                             more details on fencing specific
11023                                                             address spaces.
11024                                                           - s_waitcnt vmcnt(0)
11025                                                             must happen after
11026                                                             any preceding
11027                                                             global/generic
11028                                                             load/store/
11029                                                             load atomic/store atomic/
11030                                                             atomicrmw.
11031                                                           - s_waitcnt lgkmcnt(0)
11032                                                             must happen after
11033                                                             any preceding
11034                                                             local/generic
11035                                                             load/load
11036                                                             atomic/store/store
11037                                                             atomic/atomicrmw.
11038                                                           - Must happen before
11039                                                             any following store
11040                                                             atomic/atomicrmw
11041                                                             with an equal or
11042                                                             wider sync scope
11043                                                             and memory ordering
11044                                                             stronger than
11045                                                             unordered (this is
11046                                                             termed the
11047                                                             fence-paired-atomic).
11048                                                           - Ensures that all
11049                                                             memory operations
11050                                                             have
11051                                                             completed before
11052                                                             performing the
11053                                                             following
11054                                                             fence-paired-atomic.
11055
11056     fence        release      - agent        *none*     1. buffer_wbl2 sc1=1
11057
11058                                                           - If OpenCL and
11059                                                             address space is
11060                                                             local, omit.
11061                                                           - Must happen before
11062                                                             following s_waitcnt.
11063                                                           - Performs L2 writeback to
11064                                                             ensure previous
11065                                                             global/generic
11066                                                             store/atomicrmw are
11067                                                             visible at agent scope.
11068
11069                                                         2. s_waitcnt lgkmcnt(0) &
11070                                                            vmcnt(0)
11071
11072                                                           - If TgSplit execution mode,
11073                                                             omit lgkmcnt(0).
11074                                                           - If OpenCL and
11075                                                             address space is
11076                                                             not generic, omit
11077                                                             lgkmcnt(0).
11078                                                           - If OpenCL and
11079                                                             address space is
11080                                                             local, omit
11081                                                             vmcnt(0).
11082                                                           - See :ref:`amdgpu-fence-as` for
11083                                                             more details on fencing specific
11084                                                             address spaces.
11085                                                           - Could be split into
11086                                                             separate s_waitcnt
11087                                                             vmcnt(0) and
11088                                                             s_waitcnt
11089                                                             lgkmcnt(0) to allow
11090                                                             them to be
11091                                                             independently moved
11092                                                             according to the
11093                                                             following rules.
11094                                                           - s_waitcnt vmcnt(0)
11095                                                             must happen after
11096                                                             any preceding
11097                                                             global/generic
11098                                                             load/store/load
11099                                                             atomic/store
11100                                                             atomic/atomicrmw.
11101                                                           - s_waitcnt lgkmcnt(0)
11102                                                             must happen after
11103                                                             any preceding
11104                                                             local/generic
11105                                                             load/store/load
11106                                                             atomic/store
11107                                                             atomic/atomicrmw.
11108                                                           - Must happen before
11109                                                             any following store
11110                                                             atomic/atomicrmw
11111                                                             with an equal or
11112                                                             wider sync scope
11113                                                             and memory ordering
11114                                                             stronger than
11115                                                             unordered (this is
11116                                                             termed the
11117                                                             fence-paired-atomic).
11118                                                           - Ensures that all
11119                                                             memory operations
11120                                                             have
11121                                                             completed before
11122                                                             performing the
11123                                                             following
11124                                                             fence-paired-atomic.
11125
11126     fence        release      - system       *none*     1. buffer_wbl2 sc0=1 sc1=1
11127
11128                                                           - Must happen before
11129                                                             following s_waitcnt.
11130                                                           - Performs L2 writeback to
11131                                                             ensure previous
11132                                                             global/generic
11133                                                             store/atomicrmw are
11134                                                             visible at system scope.
11135
11136                                                         2. s_waitcnt lgkmcnt(0) &
11137                                                            vmcnt(0)
11138
11139                                                           - If TgSplit execution mode,
11140                                                             omit lgkmcnt(0).
11141                                                           - If OpenCL and
11142                                                             address space is
11143                                                             not generic, omit
11144                                                             lgkmcnt(0).
11145                                                           - If OpenCL and
11146                                                             address space is
11147                                                             local, omit
11148                                                             vmcnt(0).
11149                                                           - See :ref:`amdgpu-fence-as` for
11150                                                             more details on fencing specific
11151                                                             address spaces.
11152                                                           - Could be split into
11153                                                             separate s_waitcnt
11154                                                             vmcnt(0) and
11155                                                             s_waitcnt
11156                                                             lgkmcnt(0) to allow
11157                                                             them to be
11158                                                             independently moved
11159                                                             according to the
11160                                                             following rules.
11161                                                           - s_waitcnt vmcnt(0)
11162                                                             must happen after
11163                                                             any preceding
11164                                                             global/generic
11165                                                             load/store/load
11166                                                             atomic/store
11167                                                             atomic/atomicrmw.
11168                                                           - s_waitcnt lgkmcnt(0)
11169                                                             must happen after
11170                                                             any preceding
11171                                                             local/generic
11172                                                             load/store/load
11173                                                             atomic/store
11174                                                             atomic/atomicrmw.
11175                                                           - Must happen before
11176                                                             any following store
11177                                                             atomic/atomicrmw
11178                                                             with an equal or
11179                                                             wider sync scope
11180                                                             and memory ordering
11181                                                             stronger than
11182                                                             unordered (this is
11183                                                             termed the
11184                                                             fence-paired-atomic).
11185                                                           - Ensures that all
11186                                                             memory operations
11187                                                             have
11188                                                             completed before
11189                                                             performing the
11190                                                             following
11191                                                             fence-paired-atomic.
11192
11193     **Acquire-Release Atomic**
11194     ------------------------------------------------------------------------------------
11195     atomicrmw    acq_rel      - singlethread - global   1. buffer/global/flat_atomic
11196                               - wavefront    - generic
11197     atomicrmw    acq_rel      - singlethread - local    *If TgSplit execution mode,
11198                               - wavefront               local address space cannot
11199                                                         be used.*
11200
11201                                                         1. ds_atomic
11202     atomicrmw    acq_rel      - workgroup    - global   1. s_waitcnt lgkm/vmcnt(0)
11203
11204                                                           - Use lgkmcnt(0) if not
11205                                                             TgSplit execution mode
11206                                                             and vmcnt(0) if TgSplit
11207                                                             execution mode.
11208                                                           - If OpenCL, omit
11209                                                             lgkmcnt(0).
11210                                                           - Must happen after
11211                                                             any preceding
11212                                                             local/generic
11213                                                             load/store/load
11214                                                             atomic/store
11215                                                             atomic/atomicrmw.
11216                                                           - s_waitcnt vmcnt(0)
11217                                                             must happen after
11218                                                             any preceding
11219                                                             global/generic load/store/
11220                                                             load atomic/store atomic/
11221                                                             atomicrmw.
11222                                                           - s_waitcnt lgkmcnt(0)
11223                                                             must happen after
11224                                                             any preceding
11225                                                             local/generic
11226                                                             load/store/load
11227                                                             atomic/store
11228                                                             atomic/atomicrmw.
11229                                                           - Must happen before
11230                                                             the following
11231                                                             atomicrmw.
11232                                                           - Ensures that all
11233                                                             memory operations
11234                                                             have
11235                                                             completed before
11236                                                             performing the
11237                                                             atomicrmw that is
11238                                                             being released.
11239
11240                                                         2. buffer/global_atomic
11241                                                         3. s_waitcnt vmcnt(0)
11242
11243                                                           - If not TgSplit execution
11244                                                             mode, omit.
11245                                                           - Must happen before
11246                                                             the following
11247                                                             buffer_inv.
11248                                                           - Ensures any
11249                                                             following global
11250                                                             data read is no
11251                                                             older than the
11252                                                             atomicrmw value
11253                                                             being acquired.
11254
11255                                                         4. buffer_inv sc0=1
11256
11257                                                           - If not TgSplit execution
11258                                                             mode, omit.
11259                                                           - Ensures that
11260                                                             following
11261                                                             loads will not see
11262                                                             stale data.
11263
11264     atomicrmw    acq_rel      - workgroup    - local    *If TgSplit execution mode,
11265                                                         local address space cannot
11266                                                         be used.*
11267
11268                                                         1. ds_atomic
11269                                                         2. s_waitcnt lgkmcnt(0)
11270
11271                                                           - If OpenCL, omit.
11272                                                           - Must happen before
11273                                                             any following
11274                                                             global/generic
11275                                                             load/load
11276                                                             atomic/store/store
11277                                                             atomic/atomicrmw.
11278                                                           - Ensures any
11279                                                             following global
11280                                                             data read is no
11281                                                             older than the local load
11282                                                             atomic value being
11283                                                             acquired.
11284
11285     atomicrmw    acq_rel      - workgroup    - generic  1. s_waitcnt lgkm/vmcnt(0)
11286
11287                                                           - Use lgkmcnt(0) if not
11288                                                             TgSplit execution mode
11289                                                             and vmcnt(0) if TgSplit
11290                                                             execution mode.
11291                                                           - If OpenCL, omit
11292                                                             lgkmcnt(0).
11293                                                           - s_waitcnt vmcnt(0)
11294                                                             must happen after
11295                                                             any preceding
11296                                                             global/generic load/store/
11297                                                             load atomic/store atomic/
11298                                                             atomicrmw.
11299                                                           - s_waitcnt lgkmcnt(0)
11300                                                             must happen after
11301                                                             any preceding
11302                                                             local/generic
11303                                                             load/store/load
11304                                                             atomic/store
11305                                                             atomic/atomicrmw.
11306                                                           - Must happen before
11307                                                             the following
11308                                                             atomicrmw.
11309                                                           - Ensures that all
11310                                                             memory operations
11311                                                             have
11312                                                             completed before
11313                                                             performing the
11314                                                             atomicrmw that is
11315                                                             being released.
11316
11317                                                         2. flat_atomic
11318                                                         3. s_waitcnt lgkmcnt(0) &
11319                                                            vmcnt(0)
11320
11321                                                           - If not TgSplit execution
11322                                                             mode, omit vmcnt(0).
11323                                                           - If OpenCL, omit
11324                                                             lgkmcnt(0).
11325                                                           - Must happen before
11326                                                             the following
11327                                                             buffer_inv and
11328                                                             any following
11329                                                             global/generic
11330                                                             load/load
11331                                                             atomic/store/store
11332                                                             atomic/atomicrmw.
11333                                                           - Ensures any
11334                                                             following global
11335                                                             data read is no
11336                                                             older than a local load
11337                                                             atomic value being
11338                                                             acquired.
11339
11340                                                         3. buffer_inv sc0=1
11341
11342                                                           - If not TgSplit execution
11343                                                             mode, omit.
11344                                                           - Ensures that
11345                                                             following
11346                                                             loads will not see
11347                                                             stale data.
11348
11349     atomicrmw    acq_rel      - agent        - global   1. buffer_wbl2 sc1=1
11350
11351                                                           - Must happen before
11352                                                             following s_waitcnt.
11353                                                           - Performs L2 writeback to
11354                                                             ensure previous
11355                                                             global/generic
11356                                                             store/atomicrmw are
11357                                                             visible at agent scope.
11358
11359                                                         2. s_waitcnt lgkmcnt(0) &
11360                                                            vmcnt(0)
11361
11362                                                           - If TgSplit execution mode,
11363                                                             omit lgkmcnt(0).
11364                                                           - If OpenCL, omit
11365                                                             lgkmcnt(0).
11366                                                           - Could be split into
11367                                                             separate s_waitcnt
11368                                                             vmcnt(0) and
11369                                                             s_waitcnt
11370                                                             lgkmcnt(0) to allow
11371                                                             them to be
11372                                                             independently moved
11373                                                             according to the
11374                                                             following rules.
11375                                                           - s_waitcnt vmcnt(0)
11376                                                             must happen after
11377                                                             any preceding
11378                                                             global/generic
11379                                                             load/store/load
11380                                                             atomic/store
11381                                                             atomic/atomicrmw.
11382                                                           - s_waitcnt lgkmcnt(0)
11383                                                             must happen after
11384                                                             any preceding
11385                                                             local/generic
11386                                                             load/store/load
11387                                                             atomic/store
11388                                                             atomic/atomicrmw.
11389                                                           - Must happen before
11390                                                             the following
11391                                                             atomicrmw.
11392                                                           - Ensures that all
11393                                                             memory operations
11394                                                             to global have
11395                                                             completed before
11396                                                             performing the
11397                                                             atomicrmw that is
11398                                                             being released.
11399
11400                                                         3. buffer/global_atomic
11401                                                         4. s_waitcnt vmcnt(0)
11402
11403                                                           - Must happen before
11404                                                             following
11405                                                             buffer_inv.
11406                                                           - Ensures the
11407                                                             atomicrmw has
11408                                                             completed before
11409                                                             invalidating the
11410                                                             cache.
11411
11412                                                         5. buffer_inv sc1=1
11413
11414                                                           - Must happen before
11415                                                             any following
11416                                                             global/generic
11417                                                             load/load
11418                                                             atomic/atomicrmw.
11419                                                           - Ensures that
11420                                                             following loads
11421                                                             will not see stale
11422                                                             global data.
11423
11424     atomicrmw    acq_rel      - system       - global   1. buffer_wbl2 sc0=1 sc1=1
11425
11426                                                           - Must happen before
11427                                                             following s_waitcnt.
11428                                                           - Performs L2 writeback to
11429                                                             ensure previous
11430                                                             global/generic
11431                                                             store/atomicrmw are
11432                                                             visible at system scope.
11433
11434                                                         2. s_waitcnt lgkmcnt(0) &
11435                                                            vmcnt(0)
11436
11437                                                           - If TgSplit execution mode,
11438                                                             omit lgkmcnt(0).
11439                                                           - If OpenCL, omit
11440                                                             lgkmcnt(0).
11441                                                           - Could be split into
11442                                                             separate s_waitcnt
11443                                                             vmcnt(0) and
11444                                                             s_waitcnt
11445                                                             lgkmcnt(0) to allow
11446                                                             them to be
11447                                                             independently moved
11448                                                             according to the
11449                                                             following rules.
11450                                                           - s_waitcnt vmcnt(0)
11451                                                             must happen after
11452                                                             any preceding
11453                                                             global/generic
11454                                                             load/store/load
11455                                                             atomic/store
11456                                                             atomic/atomicrmw.
11457                                                           - s_waitcnt lgkmcnt(0)
11458                                                             must happen after
11459                                                             any preceding
11460                                                             local/generic
11461                                                             load/store/load
11462                                                             atomic/store
11463                                                             atomic/atomicrmw.
11464                                                           - Must happen before
11465                                                             the following
11466                                                             atomicrmw.
11467                                                           - Ensures that all
11468                                                             memory operations
11469                                                             to global and L2 writeback
11470                                                             have completed before
11471                                                             performing the
11472                                                             atomicrmw that is
11473                                                             being released.
11474
11475                                                         3. buffer/global_atomic
11476                                                            sc1=1
11477                                                         4. s_waitcnt vmcnt(0)
11478
11479                                                           - Must happen before
11480                                                             following
11481                                                             buffer_inv.
11482                                                           - Ensures the
11483                                                             atomicrmw has
11484                                                             completed before
11485                                                             invalidating the
11486                                                             caches.
11487
11488                                                         5. buffer_inv sc0=1 sc1=1
11489
11490                                                           - Must happen before
11491                                                             any following
11492                                                             global/generic
11493                                                             load/load
11494                                                             atomic/atomicrmw.
11495                                                           - Ensures that
11496                                                             following loads
11497                                                             will not see stale
11498                                                             MTYPE NC global data.
11499                                                             MTYPE RW and CC memory will
11500                                                             never be stale due to the
11501                                                             memory probes.
11502
11503     atomicrmw    acq_rel      - agent        - generic  1. buffer_wbl2 sc1=1
11504
11505                                                           - Must happen before
11506                                                             following s_waitcnt.
11507                                                           - Performs L2 writeback to
11508                                                             ensure previous
11509                                                             global/generic
11510                                                             store/atomicrmw are
11511                                                             visible at agent scope.
11512
11513                                                         2. s_waitcnt lgkmcnt(0) &
11514                                                            vmcnt(0)
11515
11516                                                           - If TgSplit execution mode,
11517                                                             omit lgkmcnt(0).
11518                                                           - If OpenCL, omit
11519                                                             lgkmcnt(0).
11520                                                           - Could be split into
11521                                                             separate s_waitcnt
11522                                                             vmcnt(0) and
11523                                                             s_waitcnt
11524                                                             lgkmcnt(0) to allow
11525                                                             them to be
11526                                                             independently moved
11527                                                             according to the
11528                                                             following rules.
11529                                                           - s_waitcnt vmcnt(0)
11530                                                             must happen after
11531                                                             any preceding
11532                                                             global/generic
11533                                                             load/store/load
11534                                                             atomic/store
11535                                                             atomic/atomicrmw.
11536                                                           - s_waitcnt lgkmcnt(0)
11537                                                             must happen after
11538                                                             any preceding
11539                                                             local/generic
11540                                                             load/store/load
11541                                                             atomic/store
11542                                                             atomic/atomicrmw.
11543                                                           - Must happen before
11544                                                             the following
11545                                                             atomicrmw.
11546                                                           - Ensures that all
11547                                                             memory operations
11548                                                             to global have
11549                                                             completed before
11550                                                             performing the
11551                                                             atomicrmw that is
11552                                                             being released.
11553
11554                                                         3. flat_atomic
11555                                                         4. s_waitcnt vmcnt(0) &
11556                                                            lgkmcnt(0)
11557
11558                                                           - If TgSplit execution mode,
11559                                                             omit lgkmcnt(0).
11560                                                           - If OpenCL, omit
11561                                                             lgkmcnt(0).
11562                                                           - Must happen before
11563                                                             following
11564                                                             buffer_inv.
11565                                                           - Ensures the
11566                                                             atomicrmw has
11567                                                             completed before
11568                                                             invalidating the
11569                                                             cache.
11570
11571                                                         5. buffer_inv sc1=1
11572
11573                                                           - Must happen before
11574                                                             any following
11575                                                             global/generic
11576                                                             load/load
11577                                                             atomic/atomicrmw.
11578                                                           - Ensures that
11579                                                             following loads
11580                                                             will not see stale
11581                                                             global data.
11582
11583     atomicrmw    acq_rel      - system       - generic  1. buffer_wbl2 sc0=1 sc1=1
11584
11585                                                           - Must happen before
11586                                                             following s_waitcnt.
11587                                                           - Performs L2 writeback to
11588                                                             ensure previous
11589                                                             global/generic
11590                                                             store/atomicrmw are
11591                                                             visible at system scope.
11592
11593                                                         2. s_waitcnt lgkmcnt(0) &
11594                                                            vmcnt(0)
11595
11596                                                           - If TgSplit execution mode,
11597                                                             omit lgkmcnt(0).
11598                                                           - If OpenCL, omit
11599                                                             lgkmcnt(0).
11600                                                           - Could be split into
11601                                                             separate s_waitcnt
11602                                                             vmcnt(0) and
11603                                                             s_waitcnt
11604                                                             lgkmcnt(0) to allow
11605                                                             them to be
11606                                                             independently moved
11607                                                             according to the
11608                                                             following rules.
11609                                                           - s_waitcnt vmcnt(0)
11610                                                             must happen after
11611                                                             any preceding
11612                                                             global/generic
11613                                                             load/store/load
11614                                                             atomic/store
11615                                                             atomic/atomicrmw.
11616                                                           - s_waitcnt lgkmcnt(0)
11617                                                             must happen after
11618                                                             any preceding
11619                                                             local/generic
11620                                                             load/store/load
11621                                                             atomic/store
11622                                                             atomic/atomicrmw.
11623                                                           - Must happen before
11624                                                             the following
11625                                                             atomicrmw.
11626                                                           - Ensures that all
11627                                                             memory operations
11628                                                             to global and L2 writeback
11629                                                             have completed before
11630                                                             performing the
11631                                                             atomicrmw that is
11632                                                             being released.
11633
11634                                                         3. flat_atomic sc1=1
11635                                                         4. s_waitcnt vmcnt(0) &
11636                                                            lgkmcnt(0)
11637
11638                                                           - If TgSplit execution mode,
11639                                                             omit lgkmcnt(0).
11640                                                           - If OpenCL, omit
11641                                                             lgkmcnt(0).
11642                                                           - Must happen before
11643                                                             following
11644                                                             buffer_inv.
11645                                                           - Ensures the
11646                                                             atomicrmw has
11647                                                             completed before
11648                                                             invalidating the
11649                                                             caches.
11650
11651                                                         5. buffer_inv sc0=1 sc1=1
11652
11653                                                           - Must happen before
11654                                                             any following
11655                                                             global/generic
11656                                                             load/load
11657                                                             atomic/atomicrmw.
11658                                                           - Ensures that
11659                                                             following loads
11660                                                             will not see stale
11661                                                             MTYPE NC global data.
11662                                                             MTYPE RW and CC memory will
11663                                                             never be stale due to the
11664                                                             memory probes.
11665
11666     fence        acq_rel      - singlethread *none*     *none*
11667                               - wavefront
11668     fence        acq_rel      - workgroup    *none*     1. s_waitcnt lgkm/vmcnt(0)
11669
11670                                                           - Use lgkmcnt(0) if not
11671                                                             TgSplit execution mode
11672                                                             and vmcnt(0) if TgSplit
11673                                                             execution mode.
11674                                                           - If OpenCL and
11675                                                             address space is
11676                                                             not generic, omit
11677                                                             lgkmcnt(0).
11678                                                           - If OpenCL and
11679                                                             address space is
11680                                                             local, omit
11681                                                             vmcnt(0).
11682                                                           - However,
11683                                                             since LLVM
11684                                                             currently has no
11685                                                             address space on
11686                                                             the fence need to
11687                                                             conservatively
11688                                                             always generate
11689                                                             (see comment for
11690                                                             previous fence).
11691                                                           - s_waitcnt vmcnt(0)
11692                                                             must happen after
11693                                                             any preceding
11694                                                             global/generic
11695                                                             load/store/
11696                                                             load atomic/store atomic/
11697                                                             atomicrmw.
11698                                                           - s_waitcnt lgkmcnt(0)
11699                                                             must happen after
11700                                                             any preceding
11701                                                             local/generic
11702                                                             load/load
11703                                                             atomic/store/store
11704                                                             atomic/atomicrmw.
11705                                                           - Must happen before
11706                                                             any following
11707                                                             global/generic
11708                                                             load/load
11709                                                             atomic/store/store
11710                                                             atomic/atomicrmw.
11711                                                           - Ensures that all
11712                                                             memory operations
11713                                                             have
11714                                                             completed before
11715                                                             performing any
11716                                                             following global
11717                                                             memory operations.
11718                                                           - Ensures that the
11719                                                             preceding
11720                                                             local/generic load
11721                                                             atomic/atomicrmw
11722                                                             with an equal or
11723                                                             wider sync scope
11724                                                             and memory ordering
11725                                                             stronger than
11726                                                             unordered (this is
11727                                                             termed the
11728                                                             acquire-fence-paired-atomic)
11729                                                             has completed
11730                                                             before following
11731                                                             global memory
11732                                                             operations. This
11733                                                             satisfies the
11734                                                             requirements of
11735                                                             acquire.
11736                                                           - Ensures that all
11737                                                             previous memory
11738                                                             operations have
11739                                                             completed before a
11740                                                             following
11741                                                             local/generic store
11742                                                             atomic/atomicrmw
11743                                                             with an equal or
11744                                                             wider sync scope
11745                                                             and memory ordering
11746                                                             stronger than
11747                                                             unordered (this is
11748                                                             termed the
11749                                                             release-fence-paired-atomic).
11750                                                             This satisfies the
11751                                                             requirements of
11752                                                             release.
11753                                                           - Must happen before
11754                                                             the following
11755                                                             buffer_inv.
11756                                                           - Ensures that the
11757                                                             acquire-fence-paired
11758                                                             atomic has completed
11759                                                             before invalidating
11760                                                             the
11761                                                             cache. Therefore
11762                                                             any following
11763                                                             locations read must
11764                                                             be no older than
11765                                                             the value read by
11766                                                             the
11767                                                             acquire-fence-paired-atomic.
11768
11769                                                         3. buffer_inv sc0=1
11770
11771                                                           - If not TgSplit execution
11772                                                             mode, omit.
11773                                                           - Ensures that
11774                                                             following
11775                                                             loads will not see
11776                                                             stale data.
11777
11778     fence        acq_rel      - agent        *none*     1. buffer_wbl2 sc1=1
11779
11780                                                           - If OpenCL and
11781                                                             address space is
11782                                                             local, omit.
11783                                                           - Must happen before
11784                                                             following s_waitcnt.
11785                                                           - Performs L2 writeback to
11786                                                             ensure previous
11787                                                             global/generic
11788                                                             store/atomicrmw are
11789                                                             visible at agent scope.
11790
11791                                                         2. s_waitcnt lgkmcnt(0) &
11792                                                            vmcnt(0)
11793
11794                                                           - If TgSplit execution mode,
11795                                                             omit lgkmcnt(0).
11796                                                           - If OpenCL and
11797                                                             address space is
11798                                                             not generic, omit
11799                                                             lgkmcnt(0).
11800                                                           - See :ref:`amdgpu-fence-as` for
11801                                                             more details on fencing specific
11802                                                             address spaces.
11803                                                           - Could be split into
11804                                                             separate s_waitcnt
11805                                                             vmcnt(0) and
11806                                                             s_waitcnt
11807                                                             lgkmcnt(0) to allow
11808                                                             them to be
11809                                                             independently moved
11810                                                             according to the
11811                                                             following rules.
11812                                                           - s_waitcnt vmcnt(0)
11813                                                             must happen after
11814                                                             any preceding
11815                                                             global/generic
11816                                                             load/store/load
11817                                                             atomic/store
11818                                                             atomic/atomicrmw.
11819                                                           - s_waitcnt lgkmcnt(0)
11820                                                             must happen after
11821                                                             any preceding
11822                                                             local/generic
11823                                                             load/store/load
11824                                                             atomic/store
11825                                                             atomic/atomicrmw.
11826                                                           - Must happen before
11827                                                             the following
11828                                                             buffer_inv.
11829                                                           - Ensures that the
11830                                                             preceding
11831                                                             global/local/generic
11832                                                             load
11833                                                             atomic/atomicrmw
11834                                                             with an equal or
11835                                                             wider sync scope
11836                                                             and memory ordering
11837                                                             stronger than
11838                                                             unordered (this is
11839                                                             termed the
11840                                                             acquire-fence-paired-atomic)
11841                                                             has completed
11842                                                             before invalidating
11843                                                             the cache. This
11844                                                             satisfies the
11845                                                             requirements of
11846                                                             acquire.
11847                                                           - Ensures that all
11848                                                             previous memory
11849                                                             operations have
11850                                                             completed before a
11851                                                             following
11852                                                             global/local/generic
11853                                                             store
11854                                                             atomic/atomicrmw
11855                                                             with an equal or
11856                                                             wider sync scope
11857                                                             and memory ordering
11858                                                             stronger than
11859                                                             unordered (this is
11860                                                             termed the
11861                                                             release-fence-paired-atomic).
11862                                                             This satisfies the
11863                                                             requirements of
11864                                                             release.
11865
11866                                                         3. buffer_inv sc1=1
11867
11868                                                           - Must happen before
11869                                                             any following
11870                                                             global/generic
11871                                                             load/load
11872                                                             atomic/store/store
11873                                                             atomic/atomicrmw.
11874                                                           - Ensures that
11875                                                             following loads
11876                                                             will not see stale
11877                                                             global data. This
11878                                                             satisfies the
11879                                                             requirements of
11880                                                             acquire.
11881
11882     fence        acq_rel      - system       *none*     1. buffer_wbl2 sc0=1 sc1=1
11883
11884                                                           - If OpenCL and
11885                                                             address space is
11886                                                             local, omit.
11887                                                           - Must happen before
11888                                                             following s_waitcnt.
11889                                                           - Performs L2 writeback to
11890                                                             ensure previous
11891                                                             global/generic
11892                                                             store/atomicrmw are
11893                                                             visible at system scope.
11894
11895                                                         1. s_waitcnt lgkmcnt(0) &
11896                                                            vmcnt(0)
11897
11898                                                           - If TgSplit execution mode,
11899                                                             omit lgkmcnt(0).
11900                                                           - If OpenCL and
11901                                                             address space is
11902                                                             not generic, omit
11903                                                             lgkmcnt(0).
11904                                                           - See :ref:`amdgpu-fence-as` for
11905                                                             more details on fencing specific
11906                                                             address spaces.
11907                                                           - Could be split into
11908                                                             separate s_waitcnt
11909                                                             vmcnt(0) and
11910                                                             s_waitcnt
11911                                                             lgkmcnt(0) to allow
11912                                                             them to be
11913                                                             independently moved
11914                                                             according to the
11915                                                             following rules.
11916                                                           - s_waitcnt vmcnt(0)
11917                                                             must happen after
11918                                                             any preceding
11919                                                             global/generic
11920                                                             load/store/load
11921                                                             atomic/store
11922                                                             atomic/atomicrmw.
11923                                                           - s_waitcnt lgkmcnt(0)
11924                                                             must happen after
11925                                                             any preceding
11926                                                             local/generic
11927                                                             load/store/load
11928                                                             atomic/store
11929                                                             atomic/atomicrmw.
11930                                                           - Must happen before
11931                                                             the following
11932                                                             buffer_inv.
11933                                                           - Ensures that the
11934                                                             preceding
11935                                                             global/local/generic
11936                                                             load
11937                                                             atomic/atomicrmw
11938                                                             with an equal or
11939                                                             wider sync scope
11940                                                             and memory ordering
11941                                                             stronger than
11942                                                             unordered (this is
11943                                                             termed the
11944                                                             acquire-fence-paired-atomic)
11945                                                             has completed
11946                                                             before invalidating
11947                                                             the cache. This
11948                                                             satisfies the
11949                                                             requirements of
11950                                                             acquire.
11951                                                           - Ensures that all
11952                                                             previous memory
11953                                                             operations have
11954                                                             completed before a
11955                                                             following
11956                                                             global/local/generic
11957                                                             store
11958                                                             atomic/atomicrmw
11959                                                             with an equal or
11960                                                             wider sync scope
11961                                                             and memory ordering
11962                                                             stronger than
11963                                                             unordered (this is
11964                                                             termed the
11965                                                             release-fence-paired-atomic).
11966                                                             This satisfies the
11967                                                             requirements of
11968                                                             release.
11969
11970                                                         2. buffer_inv sc0=1 sc1=1
11971
11972                                                           - Must happen before
11973                                                             any following
11974                                                             global/generic
11975                                                             load/load
11976                                                             atomic/store/store
11977                                                             atomic/atomicrmw.
11978                                                           - Ensures that
11979                                                             following loads
11980                                                             will not see stale
11981                                                             MTYPE NC global data.
11982                                                             MTYPE RW and CC memory will
11983                                                             never be stale due to the
11984                                                             memory probes.
11985
11986     **Sequential Consistent Atomic**
11987     ------------------------------------------------------------------------------------
11988     load atomic  seq_cst      - singlethread - global   *Same as corresponding
11989                               - wavefront    - local    load atomic acquire,
11990                                              - generic  except must generate
11991                                                         all instructions even
11992                                                         for OpenCL.*
11993     load atomic  seq_cst      - workgroup    - global   1. s_waitcnt lgkm/vmcnt(0)
11994                                              - generic
11995                                                           - Use lgkmcnt(0) if not
11996                                                             TgSplit execution mode
11997                                                             and vmcnt(0) if TgSplit
11998                                                             execution mode.
11999                                                           - s_waitcnt lgkmcnt(0) must
12000                                                             happen after
12001                                                             preceding
12002                                                             local/generic load
12003                                                             atomic/store
12004                                                             atomic/atomicrmw
12005                                                             with memory
12006                                                             ordering of seq_cst
12007                                                             and with equal or
12008                                                             wider sync scope.
12009                                                             (Note that seq_cst
12010                                                             fences have their
12011                                                             own s_waitcnt
12012                                                             lgkmcnt(0) and so do
12013                                                             not need to be
12014                                                             considered.)
12015                                                           - s_waitcnt vmcnt(0)
12016                                                             must happen after
12017                                                             preceding
12018                                                             global/generic load
12019                                                             atomic/store
12020                                                             atomic/atomicrmw
12021                                                             with memory
12022                                                             ordering of seq_cst
12023                                                             and with equal or
12024                                                             wider sync scope.
12025                                                             (Note that seq_cst
12026                                                             fences have their
12027                                                             own s_waitcnt
12028                                                             vmcnt(0) and so do
12029                                                             not need to be
12030                                                             considered.)
12031                                                           - Ensures any
12032                                                             preceding
12033                                                             sequential
12034                                                             consistent global/local
12035                                                             memory instructions
12036                                                             have completed
12037                                                             before executing
12038                                                             this sequentially
12039                                                             consistent
12040                                                             instruction. This
12041                                                             prevents reordering
12042                                                             a seq_cst store
12043                                                             followed by a
12044                                                             seq_cst load. (Note
12045                                                             that seq_cst is
12046                                                             stronger than
12047                                                             acquire/release as
12048                                                             the reordering of
12049                                                             load acquire
12050                                                             followed by a store
12051                                                             release is
12052                                                             prevented by the
12053                                                             s_waitcnt of
12054                                                             the release, but
12055                                                             there is nothing
12056                                                             preventing a store
12057                                                             release followed by
12058                                                             load acquire from
12059                                                             completing out of
12060                                                             order. The s_waitcnt
12061                                                             could be placed after
12062                                                             seq_store or before
12063                                                             the seq_load. We
12064                                                             choose the load to
12065                                                             make the s_waitcnt be
12066                                                             as late as possible
12067                                                             so that the store
12068                                                             may have already
12069                                                             completed.)
12070
12071                                                         2. *Following
12072                                                            instructions same as
12073                                                            corresponding load
12074                                                            atomic acquire,
12075                                                            except must generate
12076                                                            all instructions even
12077                                                            for OpenCL.*
12078     load atomic  seq_cst      - workgroup    - local    *If TgSplit execution mode,
12079                                                         local address space cannot
12080                                                         be used.*
12081
12082                                                         *Same as corresponding
12083                                                         load atomic acquire,
12084                                                         except must generate
12085                                                         all instructions even
12086                                                         for OpenCL.*
12087
12088     load atomic  seq_cst      - agent        - global   1. s_waitcnt lgkmcnt(0) &
12089                               - system       - generic     vmcnt(0)
12090
12091                                                           - If TgSplit execution mode,
12092                                                             omit lgkmcnt(0).
12093                                                           - Could be split into
12094                                                             separate s_waitcnt
12095                                                             vmcnt(0)
12096                                                             and s_waitcnt
12097                                                             lgkmcnt(0) to allow
12098                                                             them to be
12099                                                             independently moved
12100                                                             according to the
12101                                                             following rules.
12102                                                           - s_waitcnt lgkmcnt(0)
12103                                                             must happen after
12104                                                             preceding
12105                                                             global/generic load
12106                                                             atomic/store
12107                                                             atomic/atomicrmw
12108                                                             with memory
12109                                                             ordering of seq_cst
12110                                                             and with equal or
12111                                                             wider sync scope.
12112                                                             (Note that seq_cst
12113                                                             fences have their
12114                                                             own s_waitcnt
12115                                                             lgkmcnt(0) and so do
12116                                                             not need to be
12117                                                             considered.)
12118                                                           - s_waitcnt vmcnt(0)
12119                                                             must happen after
12120                                                             preceding
12121                                                             global/generic load
12122                                                             atomic/store
12123                                                             atomic/atomicrmw
12124                                                             with memory
12125                                                             ordering of seq_cst
12126                                                             and with equal or
12127                                                             wider sync scope.
12128                                                             (Note that seq_cst
12129                                                             fences have their
12130                                                             own s_waitcnt
12131                                                             vmcnt(0) and so do
12132                                                             not need to be
12133                                                             considered.)
12134                                                           - Ensures any
12135                                                             preceding
12136                                                             sequential
12137                                                             consistent global
12138                                                             memory instructions
12139                                                             have completed
12140                                                             before executing
12141                                                             this sequentially
12142                                                             consistent
12143                                                             instruction. This
12144                                                             prevents reordering
12145                                                             a seq_cst store
12146                                                             followed by a
12147                                                             seq_cst load. (Note
12148                                                             that seq_cst is
12149                                                             stronger than
12150                                                             acquire/release as
12151                                                             the reordering of
12152                                                             load acquire
12153                                                             followed by a store
12154                                                             release is
12155                                                             prevented by the
12156                                                             s_waitcnt of
12157                                                             the release, but
12158                                                             there is nothing
12159                                                             preventing a store
12160                                                             release followed by
12161                                                             load acquire from
12162                                                             completing out of
12163                                                             order. The s_waitcnt
12164                                                             could be placed after
12165                                                             seq_store or before
12166                                                             the seq_load. We
12167                                                             choose the load to
12168                                                             make the s_waitcnt be
12169                                                             as late as possible
12170                                                             so that the store
12171                                                             may have already
12172                                                             completed.)
12173
12174                                                         2. *Following
12175                                                            instructions same as
12176                                                            corresponding load
12177                                                            atomic acquire,
12178                                                            except must generate
12179                                                            all instructions even
12180                                                            for OpenCL.*
12181     store atomic seq_cst      - singlethread - global   *Same as corresponding
12182                               - wavefront    - local    store atomic release,
12183                               - workgroup    - generic  except must generate
12184                               - agent                   all instructions even
12185                               - system                  for OpenCL.*
12186     atomicrmw    seq_cst      - singlethread - global   *Same as corresponding
12187                               - wavefront    - local    atomicrmw acq_rel,
12188                               - workgroup    - generic  except must generate
12189                               - agent                   all instructions even
12190                               - system                  for OpenCL.*
12191     fence        seq_cst      - singlethread *none*     *Same as corresponding
12192                               - wavefront               fence acq_rel,
12193                               - workgroup               except must generate
12194                               - agent                   all instructions even
12195                               - system                  for OpenCL.*
12196     ============ ============ ============== ========== ================================
12197
12198.. _amdgpu-amdhsa-memory-model-gfx10-gfx11:
12199
12200Memory Model GFX10-GFX11
12201++++++++++++++++++++++++
12202
12203For GFX10-GFX11:
12204
12205* Each agent has multiple shader arrays (SA).
12206* Each SA has multiple work-group processors (WGP).
12207* Each WGP has multiple compute units (CU).
12208* Each CU has multiple SIMDs that execute wavefronts.
12209* The wavefronts for a single work-group are executed in the same
12210  WGP. In CU wavefront execution mode the wavefronts may be executed by
12211  different SIMDs in the same CU. In WGP wavefront execution mode the
12212  wavefronts may be executed by different SIMDs in different CUs in the same
12213  WGP.
12214* Each WGP has a single LDS memory shared by the wavefronts of the work-groups
12215  executing on it.
12216* All LDS operations of a WGP are performed as wavefront wide operations in a
12217  global order and involve no caching. Completion is reported to a wavefront in
12218  execution order.
12219* The LDS memory has multiple request queues shared by the SIMDs of a
12220  WGP. Therefore, the LDS operations performed by different wavefronts of a
12221  work-group can be reordered relative to each other, which can result in
12222  reordering the visibility of vector memory operations with respect to LDS
12223  operations of other wavefronts in the same work-group. A ``s_waitcnt
12224  lgkmcnt(0)`` is required to ensure synchronization between LDS operations and
12225  vector memory operations between wavefronts of a work-group, but not between
12226  operations performed by the same wavefront.
12227* The vector memory operations are performed as wavefront wide operations.
12228  Completion of load/store/sample operations are reported to a wavefront in
12229  execution order of other load/store/sample operations performed by that
12230  wavefront.
12231* The vector memory operations access a vector L0 cache. There is a single L0
12232  cache per CU. Each SIMD of a CU accesses the same L0 cache. Therefore, no
12233  special action is required for coherence between the lanes of a single
12234  wavefront. However, a ``buffer_gl0_inv`` is required for coherence between
12235  wavefronts executing in the same work-group as they may be executing on SIMDs
12236  of different CUs that access different L0s. A ``buffer_gl0_inv`` is also
12237  required for coherence between wavefronts executing in different work-groups
12238  as they may be executing on different WGPs.
12239* The scalar memory operations access a scalar L0 cache shared by all wavefronts
12240  on a WGP. The scalar and vector L0 caches are not coherent. However, scalar
12241  operations are used in a restricted way so do not impact the memory model. See
12242  :ref:`amdgpu-amdhsa-memory-spaces`.
12243* The vector and scalar memory L0 caches use an L1 cache shared by all WGPs on
12244  the same SA. Therefore, no special action is required for coherence between
12245  the wavefronts of a single work-group. However, a ``buffer_gl1_inv`` is
12246  required for coherence between wavefronts executing in different work-groups
12247  as they may be executing on different SAs that access different L1s.
12248* The L1 caches have independent quadrants to service disjoint ranges of virtual
12249  addresses.
12250* Each L0 cache has a separate request queue per L1 quadrant. Therefore, the
12251  vector and scalar memory operations performed by different wavefronts, whether
12252  executing in the same or different work-groups (which may be executing on
12253  different CUs accessing different L0s), can be reordered relative to each
12254  other. A ``s_waitcnt vmcnt(0) & vscnt(0)`` is required to ensure
12255  synchronization between vector memory operations of different wavefronts. It
12256  ensures a previous vector memory operation has completed before executing a
12257  subsequent vector memory or LDS operation and so can be used to meet the
12258  requirements of acquire, release and sequential consistency.
12259* The L1 caches use an L2 cache shared by all SAs on the same agent.
12260* The L2 cache has independent channels to service disjoint ranges of virtual
12261  addresses.
12262* Each L1 quadrant of a single SA accesses a different L2 channel. Each L1
12263  quadrant has a separate request queue per L2 channel. Therefore, the vector
12264  and scalar memory operations performed by wavefronts executing in different
12265  work-groups (which may be executing on different SAs) of an agent can be
12266  reordered relative to each other. A ``s_waitcnt vmcnt(0) & vscnt(0)`` is
12267  required to ensure synchronization between vector memory operations of
12268  different SAs. It ensures a previous vector memory operation has completed
12269  before executing a subsequent vector memory and so can be used to meet the
12270  requirements of acquire, release and sequential consistency.
12271* The L2 cache can be kept coherent with other agents on some targets, or ranges
12272  of virtual addresses can be set up to bypass it to ensure system coherence.
12273* On GFX10.3 and GFX11 a memory attached last level (MALL) cache exists for GPU memory.
12274  The MALL cache is fully coherent with GPU memory and has no impact on system
12275  coherence. All agents (GPU and CPU) access GPU memory through the MALL cache.
12276
12277Scalar memory operations are only used to access memory that is proven to not
12278change during the execution of the kernel dispatch. This includes constant
12279address space and global address space for program scope ``const`` variables.
12280Therefore, the kernel machine code does not have to maintain the scalar cache to
12281ensure it is coherent with the vector caches. The scalar and vector caches are
12282invalidated between kernel dispatches by CP since constant address space data
12283may change between kernel dispatch executions. See
12284:ref:`amdgpu-amdhsa-memory-spaces`.
12285
12286The one exception is if scalar writes are used to spill SGPR registers. In this
12287case the AMDGPU backend ensures the memory location used to spill is never
12288accessed by vector memory operations at the same time. If scalar writes are used
12289then a ``s_dcache_wb`` is inserted before the ``s_endpgm`` and before a function
12290return since the locations may be used for vector memory instructions by a
12291future wavefront that uses the same scratch area, or a function call that
12292creates a frame at the same address, respectively. There is no need for a
12293``s_dcache_inv`` as all scalar writes are write-before-read in the same thread.
12294
12295For kernarg backing memory:
12296
12297* CP invalidates the L0 and L1 caches at the start of each kernel dispatch.
12298* On dGPU the kernarg backing memory is accessed as MTYPE UC (uncached) to avoid
12299  needing to invalidate the L2 cache.
12300* On APU the kernarg backing memory is accessed as MTYPE CC (cache coherent) and
12301  so the L2 cache will be coherent with the CPU and other agents.
12302
12303Scratch backing memory (which is used for the private address space) is accessed
12304with MTYPE NC (non-coherent). Since the private address space is only accessed
12305by a single thread, and is always write-before-read, there is never a need to
12306invalidate these entries from the L0 or L1 caches.
12307
12308Wavefronts are executed in native mode with in-order reporting of loads and
12309sample instructions. In this mode vmcnt reports completion of load, atomic with
12310return and sample instructions in order, and the vscnt reports the completion of
12311store and atomic without return in order. See ``MEM_ORDERED`` field in
12312:ref:`amdgpu-amdhsa-compute_pgm_rsrc1-gfx6-gfx12-table`.
12313
12314Wavefronts can be executed in WGP or CU wavefront execution mode:
12315
12316* In WGP wavefront execution mode the wavefronts of a work-group are executed
12317  on the SIMDs of both CUs of the WGP. Therefore, explicit management of the per
12318  CU L0 caches is required for work-group synchronization. Also accesses to L1
12319  at work-group scope need to be explicitly ordered as the accesses from
12320  different CUs are not ordered.
12321* In CU wavefront execution mode the wavefronts of a work-group are executed on
12322  the SIMDs of a single CU of the WGP. Therefore, all global memory access by
12323  the work-group access the same L0 which in turn ensures L1 accesses are
12324  ordered and so do not require explicit management of the caches for
12325  work-group synchronization.
12326
12327See ``WGP_MODE`` field in
12328:ref:`amdgpu-amdhsa-compute_pgm_rsrc1-gfx6-gfx12-table` and
12329:ref:`amdgpu-target-features`.
12330
12331The code sequences used to implement the memory model for GFX10-GFX11 are defined in
12332table :ref:`amdgpu-amdhsa-memory-model-code-sequences-gfx10-gfx11-table`.
12333
12334  .. table:: AMDHSA Memory Model Code Sequences GFX10-GFX11
12335     :name: amdgpu-amdhsa-memory-model-code-sequences-gfx10-gfx11-table
12336
12337     ============ ============ ============== ========== ================================
12338     LLVM Instr   LLVM Memory  LLVM Memory    AMDGPU     AMDGPU Machine Code
12339                  Ordering     Sync Scope     Address    GFX10-GFX11
12340                                              Space
12341     ============ ============ ============== ========== ================================
12342     **Non-Atomic**
12343     ------------------------------------------------------------------------------------
12344     load         *none*       *none*         - global   - !volatile & !nontemporal
12345                                              - generic
12346                                              - private    1. buffer/global/flat_load
12347                                              - constant
12348                                                         - !volatile & nontemporal
12349
12350                                                           1. buffer/global/flat_load
12351                                                              slc=1 dlc=1
12352
12353                                                            - If GFX10, omit dlc=1.
12354
12355                                                         - volatile
12356
12357                                                           1. buffer/global/flat_load
12358                                                              glc=1 dlc=1
12359
12360                                                           2. s_waitcnt vmcnt(0)
12361
12362                                                            - Must happen before
12363                                                              any following volatile
12364                                                              global/generic
12365                                                              load/store.
12366                                                            - Ensures that
12367                                                              volatile
12368                                                              operations to
12369                                                              different
12370                                                              addresses will not
12371                                                              be reordered by
12372                                                              hardware.
12373
12374     load         *none*       *none*         - local    1. ds_load
12375     store        *none*       *none*         - global   - !volatile & !nontemporal
12376                                              - generic
12377                                              - private    1. buffer/global/flat_store
12378                                              - constant
12379                                                         - !volatile & nontemporal
12380
12381                                                           1. buffer/global/flat_store
12382                                                              glc=1 slc=1 dlc=1
12383
12384                                                            - If GFX10, omit dlc=1.
12385
12386                                                         - volatile
12387
12388                                                           1. buffer/global/flat_store
12389                                                              dlc=1
12390
12391                                                            - If GFX10, omit dlc=1.
12392
12393                                                           2. s_waitcnt vscnt(0)
12394
12395                                                            - Must happen before
12396                                                              any following volatile
12397                                                              global/generic
12398                                                              load/store.
12399                                                            - Ensures that
12400                                                              volatile
12401                                                              operations to
12402                                                              different
12403                                                              addresses will not
12404                                                              be reordered by
12405                                                              hardware.
12406
12407     store        *none*       *none*         - local    1. ds_store
12408     **Unordered Atomic**
12409     ------------------------------------------------------------------------------------
12410     load atomic  unordered    *any*          *any*      *Same as non-atomic*.
12411     store atomic unordered    *any*          *any*      *Same as non-atomic*.
12412     atomicrmw    unordered    *any*          *any*      *Same as monotonic atomic*.
12413     **Monotonic Atomic**
12414     ------------------------------------------------------------------------------------
12415     load atomic  monotonic    - singlethread - global   1. buffer/global/flat_load
12416                               - wavefront    - generic
12417     load atomic  monotonic    - workgroup    - global   1. buffer/global/flat_load
12418                                              - generic     glc=1
12419
12420                                                           - If CU wavefront execution
12421                                                             mode, omit glc=1.
12422
12423     load atomic  monotonic    - singlethread - local    1. ds_load
12424                               - wavefront
12425                               - workgroup
12426     load atomic  monotonic    - agent        - global   1. buffer/global/flat_load
12427                               - system       - generic     glc=1 dlc=1
12428
12429                                                           - If GFX11, omit dlc=1.
12430
12431     store atomic monotonic    - singlethread - global   1. buffer/global/flat_store
12432                               - wavefront    - generic
12433                               - workgroup
12434                               - agent
12435                               - system
12436     store atomic monotonic    - singlethread - local    1. ds_store
12437                               - wavefront
12438                               - workgroup
12439     atomicrmw    monotonic    - singlethread - global   1. buffer/global/flat_atomic
12440                               - wavefront    - generic
12441                               - workgroup
12442                               - agent
12443                               - system
12444     atomicrmw    monotonic    - singlethread - local    1. ds_atomic
12445                               - wavefront
12446                               - workgroup
12447     **Acquire Atomic**
12448     ------------------------------------------------------------------------------------
12449     load atomic  acquire      - singlethread - global   1. buffer/global/ds/flat_load
12450                               - wavefront    - local
12451                                              - generic
12452     load atomic  acquire      - workgroup    - global   1. buffer/global_load glc=1
12453
12454                                                           - If CU wavefront execution
12455                                                             mode, omit glc=1.
12456
12457                                                         2. s_waitcnt vmcnt(0)
12458
12459                                                           - If CU wavefront execution
12460                                                             mode, omit.
12461                                                           - Must happen before
12462                                                             the following buffer_gl0_inv
12463                                                             and before any following
12464                                                             global/generic
12465                                                             load/load
12466                                                             atomic/store/store
12467                                                             atomic/atomicrmw.
12468
12469                                                         3. buffer_gl0_inv
12470
12471                                                           - If CU wavefront execution
12472                                                             mode, omit.
12473                                                           - Ensures that
12474                                                             following
12475                                                             loads will not see
12476                                                             stale data.
12477
12478     load atomic  acquire      - workgroup    - local    1. ds_load
12479                                                         2. s_waitcnt lgkmcnt(0)
12480
12481                                                           - If OpenCL, omit.
12482                                                           - Must happen before
12483                                                             the following buffer_gl0_inv
12484                                                             and before any following
12485                                                             global/generic load/load
12486                                                             atomic/store/store
12487                                                             atomic/atomicrmw.
12488                                                           - Ensures any
12489                                                             following global
12490                                                             data read is no
12491                                                             older than the local load
12492                                                             atomic value being
12493                                                             acquired.
12494
12495                                                         3. buffer_gl0_inv
12496
12497                                                           - If CU wavefront execution
12498                                                             mode, omit.
12499                                                           - If OpenCL, omit.
12500                                                           - Ensures that
12501                                                             following
12502                                                             loads will not see
12503                                                             stale data.
12504
12505     load atomic  acquire      - workgroup    - generic  1. flat_load glc=1
12506
12507                                                           - If CU wavefront execution
12508                                                             mode, omit glc=1.
12509
12510                                                         2. s_waitcnt lgkmcnt(0) &
12511                                                            vmcnt(0)
12512
12513                                                           - If CU wavefront execution
12514                                                             mode, omit vmcnt(0).
12515                                                           - If OpenCL, omit
12516                                                             lgkmcnt(0).
12517                                                           - Must happen before
12518                                                             the following
12519                                                             buffer_gl0_inv and any
12520                                                             following global/generic
12521                                                             load/load
12522                                                             atomic/store/store
12523                                                             atomic/atomicrmw.
12524                                                           - Ensures any
12525                                                             following global
12526                                                             data read is no
12527                                                             older than a local load
12528                                                             atomic value being
12529                                                             acquired.
12530
12531                                                         3. buffer_gl0_inv
12532
12533                                                           - If CU wavefront execution
12534                                                             mode, omit.
12535                                                           - Ensures that
12536                                                             following
12537                                                             loads will not see
12538                                                             stale data.
12539
12540     load atomic  acquire      - agent        - global   1. buffer/global_load
12541                               - system                     glc=1 dlc=1
12542
12543                                                           - If GFX11, omit dlc=1.
12544
12545                                                         2. s_waitcnt vmcnt(0)
12546
12547                                                           - Must happen before
12548                                                             following
12549                                                             buffer_gl*_inv.
12550                                                           - Ensures the load
12551                                                             has completed
12552                                                             before invalidating
12553                                                             the caches.
12554
12555                                                         3. buffer_gl1_inv;
12556                                                            buffer_gl0_inv
12557
12558                                                           - Must happen before
12559                                                             any following
12560                                                             global/generic
12561                                                             load/load
12562                                                             atomic/atomicrmw.
12563                                                           - Ensures that
12564                                                             following
12565                                                             loads will not see
12566                                                             stale global data.
12567
12568     load atomic  acquire      - agent        - generic  1. flat_load glc=1 dlc=1
12569                               - system
12570                                                           - If GFX11, omit dlc=1.
12571
12572                                                         2. s_waitcnt vmcnt(0) &
12573                                                            lgkmcnt(0)
12574
12575                                                           - If OpenCL omit
12576                                                             lgkmcnt(0).
12577                                                           - Must happen before
12578                                                             following
12579                                                             buffer_gl*_invl.
12580                                                           - Ensures the flat_load
12581                                                             has completed
12582                                                             before invalidating
12583                                                             the caches.
12584
12585                                                         3. buffer_gl1_inv;
12586                                                            buffer_gl0_inv
12587
12588                                                           - Must happen before
12589                                                             any following
12590                                                             global/generic
12591                                                             load/load
12592                                                             atomic/atomicrmw.
12593                                                           - Ensures that
12594                                                             following loads
12595                                                             will not see stale
12596                                                             global data.
12597
12598     atomicrmw    acquire      - singlethread - global   1. buffer/global/ds/flat_atomic
12599                               - wavefront    - local
12600                                              - generic
12601     atomicrmw    acquire      - workgroup    - global   1. buffer/global_atomic
12602                                                         2. s_waitcnt vm/vscnt(0)
12603
12604                                                           - If CU wavefront execution
12605                                                             mode, omit.
12606                                                           - Use vmcnt(0) if atomic with
12607                                                             return and vscnt(0) if
12608                                                             atomic with no-return.
12609                                                           - Must happen before
12610                                                             the following buffer_gl0_inv
12611                                                             and before any following
12612                                                             global/generic
12613                                                             load/load
12614                                                             atomic/store/store
12615                                                             atomic/atomicrmw.
12616
12617                                                         3. buffer_gl0_inv
12618
12619                                                           - If CU wavefront execution
12620                                                             mode, omit.
12621                                                           - Ensures that
12622                                                             following
12623                                                             loads will not see
12624                                                             stale data.
12625
12626     atomicrmw    acquire      - workgroup    - local    1. ds_atomic
12627                                                         2. s_waitcnt lgkmcnt(0)
12628
12629                                                           - If OpenCL, omit.
12630                                                           - Must happen before
12631                                                             the following
12632                                                             buffer_gl0_inv.
12633                                                           - Ensures any
12634                                                             following global
12635                                                             data read is no
12636                                                             older than the local
12637                                                             atomicrmw value
12638                                                             being acquired.
12639
12640                                                         3. buffer_gl0_inv
12641
12642                                                           - If OpenCL omit.
12643                                                           - Ensures that
12644                                                             following
12645                                                             loads will not see
12646                                                             stale data.
12647
12648     atomicrmw    acquire      - workgroup    - generic  1. flat_atomic
12649                                                         2. s_waitcnt lgkmcnt(0) &
12650                                                            vm/vscnt(0)
12651
12652                                                           - If CU wavefront execution
12653                                                             mode, omit vm/vscnt(0).
12654                                                           - If OpenCL, omit lgkmcnt(0).
12655                                                           - Use vmcnt(0) if atomic with
12656                                                             return and vscnt(0) if
12657                                                             atomic with no-return.
12658                                                           - Must happen before
12659                                                             the following
12660                                                             buffer_gl0_inv.
12661                                                           - Ensures any
12662                                                             following global
12663                                                             data read is no
12664                                                             older than a local
12665                                                             atomicrmw value
12666                                                             being acquired.
12667
12668                                                         3. buffer_gl0_inv
12669
12670                                                           - If CU wavefront execution
12671                                                             mode, omit.
12672                                                           - Ensures that
12673                                                             following
12674                                                             loads will not see
12675                                                             stale data.
12676
12677     atomicrmw    acquire      - agent        - global   1. buffer/global_atomic
12678                               - system                  2. s_waitcnt vm/vscnt(0)
12679
12680                                                           - Use vmcnt(0) if atomic with
12681                                                             return and vscnt(0) if
12682                                                             atomic with no-return.
12683                                                           - Must happen before
12684                                                             following
12685                                                             buffer_gl*_inv.
12686                                                           - Ensures the
12687                                                             atomicrmw has
12688                                                             completed before
12689                                                             invalidating the
12690                                                             caches.
12691
12692                                                         3. buffer_gl1_inv;
12693                                                            buffer_gl0_inv
12694
12695                                                           - Must happen before
12696                                                             any following
12697                                                             global/generic
12698                                                             load/load
12699                                                             atomic/atomicrmw.
12700                                                           - Ensures that
12701                                                             following loads
12702                                                             will not see stale
12703                                                             global data.
12704
12705     atomicrmw    acquire      - agent        - generic  1. flat_atomic
12706                               - system                  2. s_waitcnt vm/vscnt(0) &
12707                                                            lgkmcnt(0)
12708
12709                                                           - If OpenCL, omit
12710                                                             lgkmcnt(0).
12711                                                           - Use vmcnt(0) if atomic with
12712                                                             return and vscnt(0) if
12713                                                             atomic with no-return.
12714                                                           - Must happen before
12715                                                             following
12716                                                             buffer_gl*_inv.
12717                                                           - Ensures the
12718                                                             atomicrmw has
12719                                                             completed before
12720                                                             invalidating the
12721                                                             caches.
12722
12723                                                         3. buffer_gl1_inv;
12724                                                            buffer_gl0_inv
12725
12726                                                           - Must happen before
12727                                                             any following
12728                                                             global/generic
12729                                                             load/load
12730                                                             atomic/atomicrmw.
12731                                                           - Ensures that
12732                                                             following loads
12733                                                             will not see stale
12734                                                             global data.
12735
12736     fence        acquire      - singlethread *none*     *none*
12737                               - wavefront
12738     fence        acquire      - workgroup    *none*     1. s_waitcnt lgkmcnt(0) &
12739                                                            vmcnt(0) & vscnt(0)
12740
12741                                                           - If CU wavefront execution
12742                                                             mode, omit vmcnt(0) and
12743                                                             vscnt(0).
12744                                                           - If OpenCL and
12745                                                             address space is
12746                                                             not generic, omit
12747                                                             lgkmcnt(0).
12748                                                           - If OpenCL and
12749                                                             address space is
12750                                                             local, omit
12751                                                             vmcnt(0) and vscnt(0).
12752                                                           - See :ref:`amdgpu-fence-as` for
12753                                                             more details on fencing specific
12754                                                             address spaces.
12755                                                           - Could be split into
12756                                                             separate s_waitcnt
12757                                                             vmcnt(0), s_waitcnt
12758                                                             vscnt(0) and s_waitcnt
12759                                                             lgkmcnt(0) to allow
12760                                                             them to be
12761                                                             independently moved
12762                                                             according to the
12763                                                             following rules.
12764                                                           - s_waitcnt vmcnt(0)
12765                                                             must happen after
12766                                                             any preceding
12767                                                             global/generic load
12768                                                             atomic/
12769                                                             atomicrmw-with-return-value
12770                                                             with an equal or
12771                                                             wider sync scope
12772                                                             and memory ordering
12773                                                             stronger than
12774                                                             unordered (this is
12775                                                             termed the
12776                                                             fence-paired-atomic).
12777                                                           - s_waitcnt vscnt(0)
12778                                                             must happen after
12779                                                             any preceding
12780                                                             global/generic
12781                                                             atomicrmw-no-return-value
12782                                                             with an equal or
12783                                                             wider sync scope
12784                                                             and memory ordering
12785                                                             stronger than
12786                                                             unordered (this is
12787                                                             termed the
12788                                                             fence-paired-atomic).
12789                                                           - s_waitcnt lgkmcnt(0)
12790                                                             must happen after
12791                                                             any preceding
12792                                                             local/generic load
12793                                                             atomic/atomicrmw
12794                                                             with an equal or
12795                                                             wider sync scope
12796                                                             and memory ordering
12797                                                             stronger than
12798                                                             unordered (this is
12799                                                             termed the
12800                                                             fence-paired-atomic).
12801                                                           - Must happen before
12802                                                             the following
12803                                                             buffer_gl0_inv.
12804                                                           - Ensures that the
12805                                                             fence-paired atomic
12806                                                             has completed
12807                                                             before invalidating
12808                                                             the
12809                                                             cache. Therefore
12810                                                             any following
12811                                                             locations read must
12812                                                             be no older than
12813                                                             the value read by
12814                                                             the
12815                                                             fence-paired-atomic.
12816
12817                                                         3. buffer_gl0_inv
12818
12819                                                           - If CU wavefront execution
12820                                                             mode, omit.
12821                                                           - Ensures that
12822                                                             following
12823                                                             loads will not see
12824                                                             stale data.
12825
12826     fence        acquire      - agent        *none*     1. s_waitcnt lgkmcnt(0) &
12827                               - system                     vmcnt(0) & vscnt(0)
12828
12829                                                           - If OpenCL and
12830                                                             address space is
12831                                                             not generic, omit
12832                                                             lgkmcnt(0).
12833                                                           - If OpenCL and
12834                                                             address space is
12835                                                             local, omit
12836                                                             vmcnt(0) and vscnt(0).
12837                                                           - See :ref:`amdgpu-fence-as` for
12838                                                             more details on fencing specific
12839                                                             address spaces.
12840                                                           - Could be split into
12841                                                             separate s_waitcnt
12842                                                             vmcnt(0), s_waitcnt
12843                                                             vscnt(0) and s_waitcnt
12844                                                             lgkmcnt(0) to allow
12845                                                             them to be
12846                                                             independently moved
12847                                                             according to the
12848                                                             following rules.
12849                                                           - s_waitcnt vmcnt(0)
12850                                                             must happen after
12851                                                             any preceding
12852                                                             global/generic load
12853                                                             atomic/
12854                                                             atomicrmw-with-return-value
12855                                                             with an equal or
12856                                                             wider sync scope
12857                                                             and memory ordering
12858                                                             stronger than
12859                                                             unordered (this is
12860                                                             termed the
12861                                                             fence-paired-atomic).
12862                                                           - s_waitcnt vscnt(0)
12863                                                             must happen after
12864                                                             any preceding
12865                                                             global/generic
12866                                                             atomicrmw-no-return-value
12867                                                             with an equal or
12868                                                             wider sync scope
12869                                                             and memory ordering
12870                                                             stronger than
12871                                                             unordered (this is
12872                                                             termed the
12873                                                             fence-paired-atomic).
12874                                                           - s_waitcnt lgkmcnt(0)
12875                                                             must happen after
12876                                                             any preceding
12877                                                             local/generic load
12878                                                             atomic/atomicrmw
12879                                                             with an equal or
12880                                                             wider sync scope
12881                                                             and memory ordering
12882                                                             stronger than
12883                                                             unordered (this is
12884                                                             termed the
12885                                                             fence-paired-atomic).
12886                                                           - Must happen before
12887                                                             the following
12888                                                             buffer_gl*_inv.
12889                                                           - Ensures that the
12890                                                             fence-paired atomic
12891                                                             has completed
12892                                                             before invalidating
12893                                                             the
12894                                                             caches. Therefore
12895                                                             any following
12896                                                             locations read must
12897                                                             be no older than
12898                                                             the value read by
12899                                                             the
12900                                                             fence-paired-atomic.
12901
12902                                                         2. buffer_gl1_inv;
12903                                                            buffer_gl0_inv
12904
12905                                                           - Must happen before any
12906                                                             following global/generic
12907                                                             load/load
12908                                                             atomic/store/store
12909                                                             atomic/atomicrmw.
12910                                                           - Ensures that
12911                                                             following loads
12912                                                             will not see stale
12913                                                             global data.
12914
12915     **Release Atomic**
12916     ------------------------------------------------------------------------------------
12917     store atomic release      - singlethread - global   1. buffer/global/ds/flat_store
12918                               - wavefront    - local
12919                                              - generic
12920     store atomic release      - workgroup    - global   1. s_waitcnt lgkmcnt(0) &
12921                                              - generic     vmcnt(0) & vscnt(0)
12922
12923                                                           - If CU wavefront execution
12924                                                             mode, omit vmcnt(0) and
12925                                                             vscnt(0).
12926                                                           - If OpenCL, omit
12927                                                             lgkmcnt(0).
12928                                                           - Could be split into
12929                                                             separate s_waitcnt
12930                                                             vmcnt(0), s_waitcnt
12931                                                             vscnt(0) and s_waitcnt
12932                                                             lgkmcnt(0) to allow
12933                                                             them to be
12934                                                             independently moved
12935                                                             according to the
12936                                                             following rules.
12937                                                           - s_waitcnt vmcnt(0)
12938                                                             must happen after
12939                                                             any preceding
12940                                                             global/generic load/load
12941                                                             atomic/
12942                                                             atomicrmw-with-return-value.
12943                                                           - s_waitcnt vscnt(0)
12944                                                             must happen after
12945                                                             any preceding
12946                                                             global/generic
12947                                                             store/store
12948                                                             atomic/
12949                                                             atomicrmw-no-return-value.
12950                                                           - s_waitcnt lgkmcnt(0)
12951                                                             must happen after
12952                                                             any preceding
12953                                                             local/generic
12954                                                             load/store/load
12955                                                             atomic/store
12956                                                             atomic/atomicrmw.
12957                                                           - Must happen before
12958                                                             the following
12959                                                             store.
12960                                                           - Ensures that all
12961                                                             memory operations
12962                                                             have
12963                                                             completed before
12964                                                             performing the
12965                                                             store that is being
12966                                                             released.
12967
12968                                                         2. buffer/global/flat_store
12969     store atomic release      - workgroup    - local    1. s_waitcnt vmcnt(0) & vscnt(0)
12970
12971                                                           - If CU wavefront execution
12972                                                             mode, omit.
12973                                                           - If OpenCL, omit.
12974                                                           - Could be split into
12975                                                             separate s_waitcnt
12976                                                             vmcnt(0) and s_waitcnt
12977                                                             vscnt(0) to allow
12978                                                             them to be
12979                                                             independently moved
12980                                                             according to the
12981                                                             following rules.
12982                                                           - s_waitcnt vmcnt(0)
12983                                                             must happen after
12984                                                             any preceding
12985                                                             global/generic load/load
12986                                                             atomic/
12987                                                             atomicrmw-with-return-value.
12988                                                           - s_waitcnt vscnt(0)
12989                                                             must happen after
12990                                                             any preceding
12991                                                             global/generic
12992                                                             store/store atomic/
12993                                                             atomicrmw-no-return-value.
12994                                                           - Must happen before
12995                                                             the following
12996                                                             store.
12997                                                           - Ensures that all
12998                                                             global memory
12999                                                             operations have
13000                                                             completed before
13001                                                             performing the
13002                                                             store that is being
13003                                                             released.
13004
13005                                                         2. ds_store
13006     store atomic release      - agent        - global   1. s_waitcnt lgkmcnt(0) &
13007                               - system       - generic     vmcnt(0) & vscnt(0)
13008
13009                                                           - If OpenCL and
13010                                                             address space is
13011                                                             not generic, omit
13012                                                             lgkmcnt(0).
13013                                                           - Could be split into
13014                                                             separate s_waitcnt
13015                                                             vmcnt(0), s_waitcnt vscnt(0)
13016                                                             and s_waitcnt
13017                                                             lgkmcnt(0) to allow
13018                                                             them to be
13019                                                             independently moved
13020                                                             according to the
13021                                                             following rules.
13022                                                           - s_waitcnt vmcnt(0)
13023                                                             must happen after
13024                                                             any preceding
13025                                                             global/generic
13026                                                             load/load
13027                                                             atomic/
13028                                                             atomicrmw-with-return-value.
13029                                                           - s_waitcnt vscnt(0)
13030                                                             must happen after
13031                                                             any preceding
13032                                                             global/generic
13033                                                             store/store atomic/
13034                                                             atomicrmw-no-return-value.
13035                                                           - s_waitcnt lgkmcnt(0)
13036                                                             must happen after
13037                                                             any preceding
13038                                                             local/generic
13039                                                             load/store/load
13040                                                             atomic/store
13041                                                             atomic/atomicrmw.
13042                                                           - Must happen before
13043                                                             the following
13044                                                             store.
13045                                                           - Ensures that all
13046                                                             memory operations
13047                                                             have
13048                                                             completed before
13049                                                             performing the
13050                                                             store that is being
13051                                                             released.
13052
13053                                                         2. buffer/global/flat_store
13054     atomicrmw    release      - singlethread - global   1. buffer/global/ds/flat_atomic
13055                               - wavefront    - local
13056                                              - generic
13057     atomicrmw    release      - workgroup    - global   1. s_waitcnt lgkmcnt(0) &
13058                                              - generic     vmcnt(0) & vscnt(0)
13059
13060                                                           - If CU wavefront execution
13061                                                             mode, omit vmcnt(0) and
13062                                                             vscnt(0).
13063                                                           - If OpenCL, omit lgkmcnt(0).
13064                                                           - Could be split into
13065                                                             separate s_waitcnt
13066                                                             vmcnt(0), s_waitcnt
13067                                                             vscnt(0) and s_waitcnt
13068                                                             lgkmcnt(0) to allow
13069                                                             them to be
13070                                                             independently moved
13071                                                             according to the
13072                                                             following rules.
13073                                                           - s_waitcnt vmcnt(0)
13074                                                             must happen after
13075                                                             any preceding
13076                                                             global/generic load/load
13077                                                             atomic/
13078                                                             atomicrmw-with-return-value.
13079                                                           - s_waitcnt vscnt(0)
13080                                                             must happen after
13081                                                             any preceding
13082                                                             global/generic
13083                                                             store/store
13084                                                             atomic/
13085                                                             atomicrmw-no-return-value.
13086                                                           - s_waitcnt lgkmcnt(0)
13087                                                             must happen after
13088                                                             any preceding
13089                                                             local/generic
13090                                                             load/store/load
13091                                                             atomic/store
13092                                                             atomic/atomicrmw.
13093                                                           - Must happen before
13094                                                             the following
13095                                                             atomicrmw.
13096                                                           - Ensures that all
13097                                                             memory operations
13098                                                             have
13099                                                             completed before
13100                                                             performing the
13101                                                             atomicrmw that is
13102                                                             being released.
13103
13104                                                         2. buffer/global/flat_atomic
13105     atomicrmw    release      - workgroup    - local    1. s_waitcnt vmcnt(0) & vscnt(0)
13106
13107                                                           - If CU wavefront execution
13108                                                             mode, omit.
13109                                                           - If OpenCL, omit.
13110                                                           - Could be split into
13111                                                             separate s_waitcnt
13112                                                             vmcnt(0) and s_waitcnt
13113                                                             vscnt(0) to allow
13114                                                             them to be
13115                                                             independently moved
13116                                                             according to the
13117                                                             following rules.
13118                                                           - s_waitcnt vmcnt(0)
13119                                                             must happen after
13120                                                             any preceding
13121                                                             global/generic load/load
13122                                                             atomic/
13123                                                             atomicrmw-with-return-value.
13124                                                           - s_waitcnt vscnt(0)
13125                                                             must happen after
13126                                                             any preceding
13127                                                             global/generic
13128                                                             store/store atomic/
13129                                                             atomicrmw-no-return-value.
13130                                                           - Must happen before
13131                                                             the following
13132                                                             store.
13133                                                           - Ensures that all
13134                                                             global memory
13135                                                             operations have
13136                                                             completed before
13137                                                             performing the
13138                                                             store that is being
13139                                                             released.
13140
13141                                                         2. ds_atomic
13142     atomicrmw    release      - agent        - global   1. s_waitcnt lgkmcnt(0) &
13143                               - system       - generic      vmcnt(0) & vscnt(0)
13144
13145                                                           - If OpenCL, omit
13146                                                             lgkmcnt(0).
13147                                                           - Could be split into
13148                                                             separate s_waitcnt
13149                                                             vmcnt(0), s_waitcnt
13150                                                             vscnt(0) and s_waitcnt
13151                                                             lgkmcnt(0) to allow
13152                                                             them to be
13153                                                             independently moved
13154                                                             according to the
13155                                                             following rules.
13156                                                           - s_waitcnt vmcnt(0)
13157                                                             must happen after
13158                                                             any preceding
13159                                                             global/generic
13160                                                             load/load atomic/
13161                                                             atomicrmw-with-return-value.
13162                                                           - s_waitcnt vscnt(0)
13163                                                             must happen after
13164                                                             any preceding
13165                                                             global/generic
13166                                                             store/store atomic/
13167                                                             atomicrmw-no-return-value.
13168                                                           - s_waitcnt lgkmcnt(0)
13169                                                             must happen after
13170                                                             any preceding
13171                                                             local/generic
13172                                                             load/store/load
13173                                                             atomic/store
13174                                                             atomic/atomicrmw.
13175                                                           - Must happen before
13176                                                             the following
13177                                                             atomicrmw.
13178                                                           - Ensures that all
13179                                                             memory operations
13180                                                             to global and local
13181                                                             have completed
13182                                                             before performing
13183                                                             the atomicrmw that
13184                                                             is being released.
13185
13186                                                         2. buffer/global/flat_atomic
13187     fence        release      - singlethread *none*     *none*
13188                               - wavefront
13189     fence        release      - workgroup    *none*     1. s_waitcnt lgkmcnt(0) &
13190                                                            vmcnt(0) & vscnt(0)
13191
13192                                                           - If CU wavefront execution
13193                                                             mode, omit vmcnt(0) and
13194                                                             vscnt(0).
13195                                                           - If OpenCL and
13196                                                             address space is
13197                                                             not generic, omit
13198                                                             lgkmcnt(0).
13199                                                           - If OpenCL and
13200                                                             address space is
13201                                                             local, omit
13202                                                             vmcnt(0) and vscnt(0).
13203                                                           - See :ref:`amdgpu-fence-as` for
13204                                                             more details on fencing specific
13205                                                             address spaces.
13206                                                           - Could be split into
13207                                                             separate s_waitcnt
13208                                                             vmcnt(0), s_waitcnt
13209                                                             vscnt(0) and s_waitcnt
13210                                                             lgkmcnt(0) to allow
13211                                                             them to be
13212                                                             independently moved
13213                                                             according to the
13214                                                             following rules.
13215                                                           - s_waitcnt vmcnt(0)
13216                                                             must happen after
13217                                                             any preceding
13218                                                             global/generic
13219                                                             load/load
13220                                                             atomic/
13221                                                             atomicrmw-with-return-value.
13222                                                           - s_waitcnt vscnt(0)
13223                                                             must happen after
13224                                                             any preceding
13225                                                             global/generic
13226                                                             store/store atomic/
13227                                                             atomicrmw-no-return-value.
13228                                                           - s_waitcnt lgkmcnt(0)
13229                                                             must happen after
13230                                                             any preceding
13231                                                             local/generic
13232                                                             load/store/load
13233                                                             atomic/store atomic/
13234                                                             atomicrmw.
13235                                                           - Must happen before
13236                                                             any following store
13237                                                             atomic/atomicrmw
13238                                                             with an equal or
13239                                                             wider sync scope
13240                                                             and memory ordering
13241                                                             stronger than
13242                                                             unordered (this is
13243                                                             termed the
13244                                                             fence-paired-atomic).
13245                                                           - Ensures that all
13246                                                             memory operations
13247                                                             have
13248                                                             completed before
13249                                                             performing the
13250                                                             following
13251                                                             fence-paired-atomic.
13252
13253     fence        release      - agent        *none*     1. s_waitcnt lgkmcnt(0) &
13254                               - system                     vmcnt(0) & vscnt(0)
13255
13256                                                           - If OpenCL and
13257                                                             address space is
13258                                                             not generic, omit
13259                                                             lgkmcnt(0).
13260                                                           - If OpenCL and
13261                                                             address space is
13262                                                             local, omit
13263                                                             vmcnt(0) and vscnt(0).
13264                                                           - See :ref:`amdgpu-fence-as` for
13265                                                             more details on fencing specific
13266                                                             address spaces.
13267                                                           - Could be split into
13268                                                             separate s_waitcnt
13269                                                             vmcnt(0), s_waitcnt
13270                                                             vscnt(0) and s_waitcnt
13271                                                             lgkmcnt(0) to allow
13272                                                             them to be
13273                                                             independently moved
13274                                                             according to the
13275                                                             following rules.
13276                                                           - s_waitcnt vmcnt(0)
13277                                                             must happen after
13278                                                             any preceding
13279                                                             global/generic
13280                                                             load/load atomic/
13281                                                             atomicrmw-with-return-value.
13282                                                           - s_waitcnt vscnt(0)
13283                                                             must happen after
13284                                                             any preceding
13285                                                             global/generic
13286                                                             store/store atomic/
13287                                                             atomicrmw-no-return-value.
13288                                                           - s_waitcnt lgkmcnt(0)
13289                                                             must happen after
13290                                                             any preceding
13291                                                             local/generic
13292                                                             load/store/load
13293                                                             atomic/store
13294                                                             atomic/atomicrmw.
13295                                                           - Must happen before
13296                                                             any following store
13297                                                             atomic/atomicrmw
13298                                                             with an equal or
13299                                                             wider sync scope
13300                                                             and memory ordering
13301                                                             stronger than
13302                                                             unordered (this is
13303                                                             termed the
13304                                                             fence-paired-atomic).
13305                                                           - Ensures that all
13306                                                             memory operations
13307                                                             have
13308                                                             completed before
13309                                                             performing the
13310                                                             following
13311                                                             fence-paired-atomic.
13312
13313     **Acquire-Release Atomic**
13314     ------------------------------------------------------------------------------------
13315     atomicrmw    acq_rel      - singlethread - global   1. buffer/global/ds/flat_atomic
13316                               - wavefront    - local
13317                                              - generic
13318     atomicrmw    acq_rel      - workgroup    - global   1. s_waitcnt lgkmcnt(0) &
13319                                                            vmcnt(0) & vscnt(0)
13320
13321                                                           - If CU wavefront execution
13322                                                             mode, omit vmcnt(0) and
13323                                                             vscnt(0).
13324                                                           - If OpenCL, omit
13325                                                             lgkmcnt(0).
13326                                                           - Must happen after
13327                                                             any preceding
13328                                                             local/generic
13329                                                             load/store/load
13330                                                             atomic/store
13331                                                             atomic/atomicrmw.
13332                                                           - Could be split into
13333                                                             separate s_waitcnt
13334                                                             vmcnt(0), s_waitcnt
13335                                                             vscnt(0), and s_waitcnt
13336                                                             lgkmcnt(0) to allow
13337                                                             them to be
13338                                                             independently moved
13339                                                             according to the
13340                                                             following rules.
13341                                                           - s_waitcnt vmcnt(0)
13342                                                             must happen after
13343                                                             any preceding
13344                                                             global/generic load/load
13345                                                             atomic/
13346                                                             atomicrmw-with-return-value.
13347                                                           - s_waitcnt vscnt(0)
13348                                                             must happen after
13349                                                             any preceding
13350                                                             global/generic
13351                                                             store/store
13352                                                             atomic/
13353                                                             atomicrmw-no-return-value.
13354                                                           - s_waitcnt lgkmcnt(0)
13355                                                             must happen after
13356                                                             any preceding
13357                                                             local/generic
13358                                                             load/store/load
13359                                                             atomic/store
13360                                                             atomic/atomicrmw.
13361                                                           - Must happen before
13362                                                             the following
13363                                                             atomicrmw.
13364                                                           - Ensures that all
13365                                                             memory operations
13366                                                             have
13367                                                             completed before
13368                                                             performing the
13369                                                             atomicrmw that is
13370                                                             being released.
13371
13372                                                         2. buffer/global_atomic
13373                                                         3. s_waitcnt vm/vscnt(0)
13374
13375                                                           - If CU wavefront execution
13376                                                             mode, omit.
13377                                                           - Use vmcnt(0) if atomic with
13378                                                             return and vscnt(0) if
13379                                                             atomic with no-return.
13380                                                           - Must happen before
13381                                                             the following
13382                                                             buffer_gl0_inv.
13383                                                           - Ensures any
13384                                                             following global
13385                                                             data read is no
13386                                                             older than the
13387                                                             atomicrmw value
13388                                                             being acquired.
13389
13390                                                         4. buffer_gl0_inv
13391
13392                                                           - If CU wavefront execution
13393                                                             mode, omit.
13394                                                           - Ensures that
13395                                                             following
13396                                                             loads will not see
13397                                                             stale data.
13398
13399     atomicrmw    acq_rel      - workgroup    - local    1. s_waitcnt vmcnt(0) & vscnt(0)
13400
13401                                                           - If CU wavefront execution
13402                                                             mode, omit.
13403                                                           - If OpenCL, omit.
13404                                                           - Could be split into
13405                                                             separate s_waitcnt
13406                                                             vmcnt(0) and s_waitcnt
13407                                                             vscnt(0) to allow
13408                                                             them to be
13409                                                             independently moved
13410                                                             according to the
13411                                                             following rules.
13412                                                           - s_waitcnt vmcnt(0)
13413                                                             must happen after
13414                                                             any preceding
13415                                                             global/generic load/load
13416                                                             atomic/
13417                                                             atomicrmw-with-return-value.
13418                                                           - s_waitcnt vscnt(0)
13419                                                             must happen after
13420                                                             any preceding
13421                                                             global/generic
13422                                                             store/store atomic/
13423                                                             atomicrmw-no-return-value.
13424                                                           - Must happen before
13425                                                             the following
13426                                                             store.
13427                                                           - Ensures that all
13428                                                             global memory
13429                                                             operations have
13430                                                             completed before
13431                                                             performing the
13432                                                             store that is being
13433                                                             released.
13434
13435                                                         2. ds_atomic
13436                                                         3. s_waitcnt lgkmcnt(0)
13437
13438                                                           - If OpenCL, omit.
13439                                                           - Must happen before
13440                                                             the following
13441                                                             buffer_gl0_inv.
13442                                                           - Ensures any
13443                                                             following global
13444                                                             data read is no
13445                                                             older than the local load
13446                                                             atomic value being
13447                                                             acquired.
13448
13449                                                         4. buffer_gl0_inv
13450
13451                                                           - If CU wavefront execution
13452                                                             mode, omit.
13453                                                           - If OpenCL omit.
13454                                                           - Ensures that
13455                                                             following
13456                                                             loads will not see
13457                                                             stale data.
13458
13459     atomicrmw    acq_rel      - workgroup    - generic  1. s_waitcnt lgkmcnt(0) &
13460                                                            vmcnt(0) & vscnt(0)
13461
13462                                                           - If CU wavefront execution
13463                                                             mode, omit vmcnt(0) and
13464                                                             vscnt(0).
13465                                                           - If OpenCL, omit lgkmcnt(0).
13466                                                           - Could be split into
13467                                                             separate s_waitcnt
13468                                                             vmcnt(0), s_waitcnt
13469                                                             vscnt(0) and s_waitcnt
13470                                                             lgkmcnt(0) to allow
13471                                                             them to be
13472                                                             independently moved
13473                                                             according to the
13474                                                             following rules.
13475                                                           - s_waitcnt vmcnt(0)
13476                                                             must happen after
13477                                                             any preceding
13478                                                             global/generic load/load
13479                                                             atomic/
13480                                                             atomicrmw-with-return-value.
13481                                                           - s_waitcnt vscnt(0)
13482                                                             must happen after
13483                                                             any preceding
13484                                                             global/generic
13485                                                             store/store
13486                                                             atomic/
13487                                                             atomicrmw-no-return-value.
13488                                                           - s_waitcnt lgkmcnt(0)
13489                                                             must happen after
13490                                                             any preceding
13491                                                             local/generic
13492                                                             load/store/load
13493                                                             atomic/store
13494                                                             atomic/atomicrmw.
13495                                                           - Must happen before
13496                                                             the following
13497                                                             atomicrmw.
13498                                                           - Ensures that all
13499                                                             memory operations
13500                                                             have
13501                                                             completed before
13502                                                             performing the
13503                                                             atomicrmw that is
13504                                                             being released.
13505
13506                                                         2. flat_atomic
13507                                                         3. s_waitcnt lgkmcnt(0) &
13508                                                            vmcnt(0) & vscnt(0)
13509
13510                                                           - If CU wavefront execution
13511                                                             mode, omit vmcnt(0) and
13512                                                             vscnt(0).
13513                                                           - If OpenCL, omit lgkmcnt(0).
13514                                                           - Must happen before
13515                                                             the following
13516                                                             buffer_gl0_inv.
13517                                                           - Ensures any
13518                                                             following global
13519                                                             data read is no
13520                                                             older than the load
13521                                                             atomic value being
13522                                                             acquired.
13523
13524                                                         3. buffer_gl0_inv
13525
13526                                                           - If CU wavefront execution
13527                                                             mode, omit.
13528                                                           - Ensures that
13529                                                             following
13530                                                             loads will not see
13531                                                             stale data.
13532
13533     atomicrmw    acq_rel      - agent        - global   1. s_waitcnt lgkmcnt(0) &
13534                               - system                     vmcnt(0) & vscnt(0)
13535
13536                                                           - If OpenCL, omit
13537                                                             lgkmcnt(0).
13538                                                           - Could be split into
13539                                                             separate s_waitcnt
13540                                                             vmcnt(0), s_waitcnt
13541                                                             vscnt(0) and s_waitcnt
13542                                                             lgkmcnt(0) to allow
13543                                                             them to be
13544                                                             independently moved
13545                                                             according to the
13546                                                             following rules.
13547                                                           - s_waitcnt vmcnt(0)
13548                                                             must happen after
13549                                                             any preceding
13550                                                             global/generic
13551                                                             load/load atomic/
13552                                                             atomicrmw-with-return-value.
13553                                                           - s_waitcnt vscnt(0)
13554                                                             must happen after
13555                                                             any preceding
13556                                                             global/generic
13557                                                             store/store atomic/
13558                                                             atomicrmw-no-return-value.
13559                                                           - s_waitcnt lgkmcnt(0)
13560                                                             must happen after
13561                                                             any preceding
13562                                                             local/generic
13563                                                             load/store/load
13564                                                             atomic/store
13565                                                             atomic/atomicrmw.
13566                                                           - Must happen before
13567                                                             the following
13568                                                             atomicrmw.
13569                                                           - Ensures that all
13570                                                             memory operations
13571                                                             to global have
13572                                                             completed before
13573                                                             performing the
13574                                                             atomicrmw that is
13575                                                             being released.
13576
13577                                                         2. buffer/global_atomic
13578                                                         3. s_waitcnt vm/vscnt(0)
13579
13580                                                           - Use vmcnt(0) if atomic with
13581                                                             return and vscnt(0) if
13582                                                             atomic with no-return.
13583                                                           - Must happen before
13584                                                             following
13585                                                             buffer_gl*_inv.
13586                                                           - Ensures the
13587                                                             atomicrmw has
13588                                                             completed before
13589                                                             invalidating the
13590                                                             caches.
13591
13592                                                         4. buffer_gl1_inv;
13593                                                            buffer_gl0_inv
13594
13595                                                           - Must happen before
13596                                                             any following
13597                                                             global/generic
13598                                                             load/load
13599                                                             atomic/atomicrmw.
13600                                                           - Ensures that
13601                                                             following loads
13602                                                             will not see stale
13603                                                             global data.
13604
13605     atomicrmw    acq_rel      - agent        - generic  1. s_waitcnt lgkmcnt(0) &
13606                               - system                     vmcnt(0) & vscnt(0)
13607
13608                                                           - If OpenCL, omit
13609                                                             lgkmcnt(0).
13610                                                           - Could be split into
13611                                                             separate s_waitcnt
13612                                                             vmcnt(0), s_waitcnt
13613                                                             vscnt(0), and s_waitcnt
13614                                                             lgkmcnt(0) to allow
13615                                                             them to be
13616                                                             independently moved
13617                                                             according to the
13618                                                             following rules.
13619                                                           - s_waitcnt vmcnt(0)
13620                                                             must happen after
13621                                                             any preceding
13622                                                             global/generic
13623                                                             load/load atomic
13624                                                             atomicrmw-with-return-value.
13625                                                           - s_waitcnt vscnt(0)
13626                                                             must happen after
13627                                                             any preceding
13628                                                             global/generic
13629                                                             store/store atomic/
13630                                                             atomicrmw-no-return-value.
13631                                                           - s_waitcnt lgkmcnt(0)
13632                                                             must happen after
13633                                                             any preceding
13634                                                             local/generic
13635                                                             load/store/load
13636                                                             atomic/store
13637                                                             atomic/atomicrmw.
13638                                                           - Must happen before
13639                                                             the following
13640                                                             atomicrmw.
13641                                                           - Ensures that all
13642                                                             memory operations
13643                                                             have
13644                                                             completed before
13645                                                             performing the
13646                                                             atomicrmw that is
13647                                                             being released.
13648
13649                                                         2. flat_atomic
13650                                                         3. s_waitcnt vm/vscnt(0) &
13651                                                            lgkmcnt(0)
13652
13653                                                           - If OpenCL, omit
13654                                                             lgkmcnt(0).
13655                                                           - Use vmcnt(0) if atomic with
13656                                                             return and vscnt(0) if
13657                                                             atomic with no-return.
13658                                                           - Must happen before
13659                                                             following
13660                                                             buffer_gl*_inv.
13661                                                           - Ensures the
13662                                                             atomicrmw has
13663                                                             completed before
13664                                                             invalidating the
13665                                                             caches.
13666
13667                                                         4. buffer_gl1_inv;
13668                                                            buffer_gl0_inv
13669
13670                                                           - Must happen before
13671                                                             any following
13672                                                             global/generic
13673                                                             load/load
13674                                                             atomic/atomicrmw.
13675                                                           - Ensures that
13676                                                             following loads
13677                                                             will not see stale
13678                                                             global data.
13679
13680     fence        acq_rel      - singlethread *none*     *none*
13681                               - wavefront
13682     fence        acq_rel      - workgroup    *none*     1. s_waitcnt lgkmcnt(0) &
13683                                                            vmcnt(0) & vscnt(0)
13684
13685                                                           - If CU wavefront execution
13686                                                             mode, omit vmcnt(0) and
13687                                                             vscnt(0).
13688                                                           - If OpenCL and
13689                                                             address space is
13690                                                             not generic, omit
13691                                                             lgkmcnt(0).
13692                                                           - If OpenCL and
13693                                                             address space is
13694                                                             local, omit
13695                                                             vmcnt(0) and vscnt(0).
13696                                                           - However,
13697                                                             since LLVM
13698                                                             currently has no
13699                                                             address space on
13700                                                             the fence need to
13701                                                             conservatively
13702                                                             always generate
13703                                                             (see comment for
13704                                                             previous fence).
13705                                                           - Could be split into
13706                                                             separate s_waitcnt
13707                                                             vmcnt(0), s_waitcnt
13708                                                             vscnt(0) and s_waitcnt
13709                                                             lgkmcnt(0) to allow
13710                                                             them to be
13711                                                             independently moved
13712                                                             according to the
13713                                                             following rules.
13714                                                           - s_waitcnt vmcnt(0)
13715                                                             must happen after
13716                                                             any preceding
13717                                                             global/generic
13718                                                             load/load
13719                                                             atomic/
13720                                                             atomicrmw-with-return-value.
13721                                                           - s_waitcnt vscnt(0)
13722                                                             must happen after
13723                                                             any preceding
13724                                                             global/generic
13725                                                             store/store atomic/
13726                                                             atomicrmw-no-return-value.
13727                                                           - s_waitcnt lgkmcnt(0)
13728                                                             must happen after
13729                                                             any preceding
13730                                                             local/generic
13731                                                             load/store/load
13732                                                             atomic/store atomic/
13733                                                             atomicrmw.
13734                                                           - Must happen before
13735                                                             any following
13736                                                             global/generic
13737                                                             load/load
13738                                                             atomic/store/store
13739                                                             atomic/atomicrmw.
13740                                                           - Ensures that all
13741                                                             memory operations
13742                                                             have
13743                                                             completed before
13744                                                             performing any
13745                                                             following global
13746                                                             memory operations.
13747                                                           - Ensures that the
13748                                                             preceding
13749                                                             local/generic load
13750                                                             atomic/atomicrmw
13751                                                             with an equal or
13752                                                             wider sync scope
13753                                                             and memory ordering
13754                                                             stronger than
13755                                                             unordered (this is
13756                                                             termed the
13757                                                             acquire-fence-paired-atomic)
13758                                                             has completed
13759                                                             before following
13760                                                             global memory
13761                                                             operations. This
13762                                                             satisfies the
13763                                                             requirements of
13764                                                             acquire.
13765                                                           - Ensures that all
13766                                                             previous memory
13767                                                             operations have
13768                                                             completed before a
13769                                                             following
13770                                                             local/generic store
13771                                                             atomic/atomicrmw
13772                                                             with an equal or
13773                                                             wider sync scope
13774                                                             and memory ordering
13775                                                             stronger than
13776                                                             unordered (this is
13777                                                             termed the
13778                                                             release-fence-paired-atomic).
13779                                                             This satisfies the
13780                                                             requirements of
13781                                                             release.
13782                                                           - Must happen before
13783                                                             the following
13784                                                             buffer_gl0_inv.
13785                                                           - Ensures that the
13786                                                             acquire-fence-paired
13787                                                             atomic has completed
13788                                                             before invalidating
13789                                                             the
13790                                                             cache. Therefore
13791                                                             any following
13792                                                             locations read must
13793                                                             be no older than
13794                                                             the value read by
13795                                                             the
13796                                                             acquire-fence-paired-atomic.
13797
13798                                                         3. buffer_gl0_inv
13799
13800                                                           - If CU wavefront execution
13801                                                             mode, omit.
13802                                                           - Ensures that
13803                                                             following
13804                                                             loads will not see
13805                                                             stale data.
13806
13807     fence        acq_rel      - agent        *none*     1. s_waitcnt lgkmcnt(0) &
13808                               - system                     vmcnt(0) & vscnt(0)
13809
13810                                                           - If OpenCL and
13811                                                             address space is
13812                                                             not generic, omit
13813                                                             lgkmcnt(0).
13814                                                           - If OpenCL and
13815                                                             address space is
13816                                                             local, omit
13817                                                             vmcnt(0) and vscnt(0).
13818                                                           - See :ref:`amdgpu-fence-as` for
13819                                                             more details on fencing specific
13820                                                             address spaces.
13821                                                           - Could be split into
13822                                                             separate s_waitcnt
13823                                                             vmcnt(0), s_waitcnt
13824                                                             vscnt(0) and s_waitcnt
13825                                                             lgkmcnt(0) to allow
13826                                                             them to be
13827                                                             independently moved
13828                                                             according to the
13829                                                             following rules.
13830                                                           - s_waitcnt vmcnt(0)
13831                                                             must happen after
13832                                                             any preceding
13833                                                             global/generic
13834                                                             load/load
13835                                                             atomic/
13836                                                             atomicrmw-with-return-value.
13837                                                           - s_waitcnt vscnt(0)
13838                                                             must happen after
13839                                                             any preceding
13840                                                             global/generic
13841                                                             store/store atomic/
13842                                                             atomicrmw-no-return-value.
13843                                                           - s_waitcnt lgkmcnt(0)
13844                                                             must happen after
13845                                                             any preceding
13846                                                             local/generic
13847                                                             load/store/load
13848                                                             atomic/store
13849                                                             atomic/atomicrmw.
13850                                                           - Must happen before
13851                                                             the following
13852                                                             buffer_gl*_inv.
13853                                                           - Ensures that the
13854                                                             preceding
13855                                                             global/local/generic
13856                                                             load
13857                                                             atomic/atomicrmw
13858                                                             with an equal or
13859                                                             wider sync scope
13860                                                             and memory ordering
13861                                                             stronger than
13862                                                             unordered (this is
13863                                                             termed the
13864                                                             acquire-fence-paired-atomic)
13865                                                             has completed
13866                                                             before invalidating
13867                                                             the caches. This
13868                                                             satisfies the
13869                                                             requirements of
13870                                                             acquire.
13871                                                           - Ensures that all
13872                                                             previous memory
13873                                                             operations have
13874                                                             completed before a
13875                                                             following
13876                                                             global/local/generic
13877                                                             store
13878                                                             atomic/atomicrmw
13879                                                             with an equal or
13880                                                             wider sync scope
13881                                                             and memory ordering
13882                                                             stronger than
13883                                                             unordered (this is
13884                                                             termed the
13885                                                             release-fence-paired-atomic).
13886                                                             This satisfies the
13887                                                             requirements of
13888                                                             release.
13889
13890                                                         2. buffer_gl1_inv;
13891                                                            buffer_gl0_inv
13892
13893                                                           - Must happen before
13894                                                             any following
13895                                                             global/generic
13896                                                             load/load
13897                                                             atomic/store/store
13898                                                             atomic/atomicrmw.
13899                                                           - Ensures that
13900                                                             following loads
13901                                                             will not see stale
13902                                                             global data. This
13903                                                             satisfies the
13904                                                             requirements of
13905                                                             acquire.
13906
13907     **Sequential Consistent Atomic**
13908     ------------------------------------------------------------------------------------
13909     load atomic  seq_cst      - singlethread - global   *Same as corresponding
13910                               - wavefront    - local    load atomic acquire,
13911                                              - generic  except must generate
13912                                                         all instructions even
13913                                                         for OpenCL.*
13914     load atomic  seq_cst      - workgroup    - global   1. s_waitcnt lgkmcnt(0) &
13915                                              - generic     vmcnt(0) & vscnt(0)
13916
13917                                                           - If CU wavefront execution
13918                                                             mode, omit vmcnt(0) and
13919                                                             vscnt(0).
13920                                                           - Could be split into
13921                                                             separate s_waitcnt
13922                                                             vmcnt(0), s_waitcnt
13923                                                             vscnt(0), and s_waitcnt
13924                                                             lgkmcnt(0) to allow
13925                                                             them to be
13926                                                             independently moved
13927                                                             according to the
13928                                                             following rules.
13929                                                           - s_waitcnt lgkmcnt(0) must
13930                                                             happen after
13931                                                             preceding
13932                                                             local/generic load
13933                                                             atomic/store
13934                                                             atomic/atomicrmw
13935                                                             with memory
13936                                                             ordering of seq_cst
13937                                                             and with equal or
13938                                                             wider sync scope.
13939                                                             (Note that seq_cst
13940                                                             fences have their
13941                                                             own s_waitcnt
13942                                                             lgkmcnt(0) and so do
13943                                                             not need to be
13944                                                             considered.)
13945                                                           - s_waitcnt vmcnt(0)
13946                                                             must happen after
13947                                                             preceding
13948                                                             global/generic load
13949                                                             atomic/
13950                                                             atomicrmw-with-return-value
13951                                                             with memory
13952                                                             ordering of seq_cst
13953                                                             and with equal or
13954                                                             wider sync scope.
13955                                                             (Note that seq_cst
13956                                                             fences have their
13957                                                             own s_waitcnt
13958                                                             vmcnt(0) and so do
13959                                                             not need to be
13960                                                             considered.)
13961                                                           - s_waitcnt vscnt(0)
13962                                                             Must happen after
13963                                                             preceding
13964                                                             global/generic store
13965                                                             atomic/
13966                                                             atomicrmw-no-return-value
13967                                                             with memory
13968                                                             ordering of seq_cst
13969                                                             and with equal or
13970                                                             wider sync scope.
13971                                                             (Note that seq_cst
13972                                                             fences have their
13973                                                             own s_waitcnt
13974                                                             vscnt(0) and so do
13975                                                             not need to be
13976                                                             considered.)
13977                                                           - Ensures any
13978                                                             preceding
13979                                                             sequential
13980                                                             consistent global/local
13981                                                             memory instructions
13982                                                             have completed
13983                                                             before executing
13984                                                             this sequentially
13985                                                             consistent
13986                                                             instruction. This
13987                                                             prevents reordering
13988                                                             a seq_cst store
13989                                                             followed by a
13990                                                             seq_cst load. (Note
13991                                                             that seq_cst is
13992                                                             stronger than
13993                                                             acquire/release as
13994                                                             the reordering of
13995                                                             load acquire
13996                                                             followed by a store
13997                                                             release is
13998                                                             prevented by the
13999                                                             s_waitcnt of
14000                                                             the release, but
14001                                                             there is nothing
14002                                                             preventing a store
14003                                                             release followed by
14004                                                             load acquire from
14005                                                             completing out of
14006                                                             order. The s_waitcnt
14007                                                             could be placed after
14008                                                             seq_store or before
14009                                                             the seq_load. We
14010                                                             choose the load to
14011                                                             make the s_waitcnt be
14012                                                             as late as possible
14013                                                             so that the store
14014                                                             may have already
14015                                                             completed.)
14016
14017                                                         2. *Following
14018                                                            instructions same as
14019                                                            corresponding load
14020                                                            atomic acquire,
14021                                                            except must generate
14022                                                            all instructions even
14023                                                            for OpenCL.*
14024     load atomic  seq_cst      - workgroup    - local
14025
14026                                                         1. s_waitcnt vmcnt(0) & vscnt(0)
14027
14028                                                           - If CU wavefront execution
14029                                                             mode, omit.
14030                                                           - Could be split into
14031                                                             separate s_waitcnt
14032                                                             vmcnt(0) and s_waitcnt
14033                                                             vscnt(0) to allow
14034                                                             them to be
14035                                                             independently moved
14036                                                             according to the
14037                                                             following rules.
14038                                                           - s_waitcnt vmcnt(0)
14039                                                             Must happen after
14040                                                             preceding
14041                                                             global/generic load
14042                                                             atomic/
14043                                                             atomicrmw-with-return-value
14044                                                             with memory
14045                                                             ordering of seq_cst
14046                                                             and with equal or
14047                                                             wider sync scope.
14048                                                             (Note that seq_cst
14049                                                             fences have their
14050                                                             own s_waitcnt
14051                                                             vmcnt(0) and so do
14052                                                             not need to be
14053                                                             considered.)
14054                                                           - s_waitcnt vscnt(0)
14055                                                             Must happen after
14056                                                             preceding
14057                                                             global/generic store
14058                                                             atomic/
14059                                                             atomicrmw-no-return-value
14060                                                             with memory
14061                                                             ordering of seq_cst
14062                                                             and with equal or
14063                                                             wider sync scope.
14064                                                             (Note that seq_cst
14065                                                             fences have their
14066                                                             own s_waitcnt
14067                                                             vscnt(0) and so do
14068                                                             not need to be
14069                                                             considered.)
14070                                                           - Ensures any
14071                                                             preceding
14072                                                             sequential
14073                                                             consistent global
14074                                                             memory instructions
14075                                                             have completed
14076                                                             before executing
14077                                                             this sequentially
14078                                                             consistent
14079                                                             instruction. This
14080                                                             prevents reordering
14081                                                             a seq_cst store
14082                                                             followed by a
14083                                                             seq_cst load. (Note
14084                                                             that seq_cst is
14085                                                             stronger than
14086                                                             acquire/release as
14087                                                             the reordering of
14088                                                             load acquire
14089                                                             followed by a store
14090                                                             release is
14091                                                             prevented by the
14092                                                             s_waitcnt of
14093                                                             the release, but
14094                                                             there is nothing
14095                                                             preventing a store
14096                                                             release followed by
14097                                                             load acquire from
14098                                                             completing out of
14099                                                             order. The s_waitcnt
14100                                                             could be placed after
14101                                                             seq_store or before
14102                                                             the seq_load. We
14103                                                             choose the load to
14104                                                             make the s_waitcnt be
14105                                                             as late as possible
14106                                                             so that the store
14107                                                             may have already
14108                                                             completed.)
14109
14110                                                         2. *Following
14111                                                            instructions same as
14112                                                            corresponding load
14113                                                            atomic acquire,
14114                                                            except must generate
14115                                                            all instructions even
14116                                                            for OpenCL.*
14117
14118     load atomic  seq_cst      - agent        - global   1. s_waitcnt lgkmcnt(0) &
14119                               - system       - generic     vmcnt(0) & vscnt(0)
14120
14121                                                           - Could be split into
14122                                                             separate s_waitcnt
14123                                                             vmcnt(0), s_waitcnt
14124                                                             vscnt(0) and s_waitcnt
14125                                                             lgkmcnt(0) to allow
14126                                                             them to be
14127                                                             independently moved
14128                                                             according to the
14129                                                             following rules.
14130                                                           - s_waitcnt lgkmcnt(0)
14131                                                             must happen after
14132                                                             preceding
14133                                                             local load
14134                                                             atomic/store
14135                                                             atomic/atomicrmw
14136                                                             with memory
14137                                                             ordering of seq_cst
14138                                                             and with equal or
14139                                                             wider sync scope.
14140                                                             (Note that seq_cst
14141                                                             fences have their
14142                                                             own s_waitcnt
14143                                                             lgkmcnt(0) and so do
14144                                                             not need to be
14145                                                             considered.)
14146                                                           - s_waitcnt vmcnt(0)
14147                                                             must happen after
14148                                                             preceding
14149                                                             global/generic load
14150                                                             atomic/
14151                                                             atomicrmw-with-return-value
14152                                                             with memory
14153                                                             ordering of seq_cst
14154                                                             and with equal or
14155                                                             wider sync scope.
14156                                                             (Note that seq_cst
14157                                                             fences have their
14158                                                             own s_waitcnt
14159                                                             vmcnt(0) and so do
14160                                                             not need to be
14161                                                             considered.)
14162                                                           - s_waitcnt vscnt(0)
14163                                                             Must happen after
14164                                                             preceding
14165                                                             global/generic store
14166                                                             atomic/
14167                                                             atomicrmw-no-return-value
14168                                                             with memory
14169                                                             ordering of seq_cst
14170                                                             and with equal or
14171                                                             wider sync scope.
14172                                                             (Note that seq_cst
14173                                                             fences have their
14174                                                             own s_waitcnt
14175                                                             vscnt(0) and so do
14176                                                             not need to be
14177                                                             considered.)
14178                                                           - Ensures any
14179                                                             preceding
14180                                                             sequential
14181                                                             consistent global
14182                                                             memory instructions
14183                                                             have completed
14184                                                             before executing
14185                                                             this sequentially
14186                                                             consistent
14187                                                             instruction. This
14188                                                             prevents reordering
14189                                                             a seq_cst store
14190                                                             followed by a
14191                                                             seq_cst load. (Note
14192                                                             that seq_cst is
14193                                                             stronger than
14194                                                             acquire/release as
14195                                                             the reordering of
14196                                                             load acquire
14197                                                             followed by a store
14198                                                             release is
14199                                                             prevented by the
14200                                                             s_waitcnt of
14201                                                             the release, but
14202                                                             there is nothing
14203                                                             preventing a store
14204                                                             release followed by
14205                                                             load acquire from
14206                                                             completing out of
14207                                                             order. The s_waitcnt
14208                                                             could be placed after
14209                                                             seq_store or before
14210                                                             the seq_load. We
14211                                                             choose the load to
14212                                                             make the s_waitcnt be
14213                                                             as late as possible
14214                                                             so that the store
14215                                                             may have already
14216                                                             completed.)
14217
14218                                                         2. *Following
14219                                                            instructions same as
14220                                                            corresponding load
14221                                                            atomic acquire,
14222                                                            except must generate
14223                                                            all instructions even
14224                                                            for OpenCL.*
14225     store atomic seq_cst      - singlethread - global   *Same as corresponding
14226                               - wavefront    - local    store atomic release,
14227                               - workgroup    - generic  except must generate
14228                               - agent                   all instructions even
14229                               - system                  for OpenCL.*
14230     atomicrmw    seq_cst      - singlethread - global   *Same as corresponding
14231                               - wavefront    - local    atomicrmw acq_rel,
14232                               - workgroup    - generic  except must generate
14233                               - agent                   all instructions even
14234                               - system                  for OpenCL.*
14235     fence        seq_cst      - singlethread *none*     *Same as corresponding
14236                               - wavefront               fence acq_rel,
14237                               - workgroup               except must generate
14238                               - agent                   all instructions even
14239                               - system                  for OpenCL.*
14240     ============ ============ ============== ========== ================================
14241
14242
14243.. _amdgpu-amdhsa-memory-model-gfx12:
14244
14245Memory Model GFX12
14246++++++++++++++++++++++++
14247
14248For GFX12:
14249
14250* Each agent has multiple shader arrays (SA).
14251* Each SA has multiple work-group processors (WGP).
14252* Each WGP has multiple compute units (CU).
14253* Each CU has multiple SIMDs that execute wavefronts.
14254* The wavefronts for a single work-group are executed in the same
14255  WGP.
14256
14257  * In CU wavefront execution mode the wavefronts may be executed by different SIMDs
14258    in the same CU.
14259  * In WGP wavefront execution mode the wavefronts may be executed by different SIMDs
14260    in different CUs in the same WGP.
14261
14262* Each WGP has a single LDS memory shared by the wavefronts of the work-groups
14263  executing on it.
14264* All LDS operations of a WGP are performed as wavefront wide operations in a
14265  global order and involve no caching. Completion is reported to a wavefront in
14266  execution order.
14267* The LDS memory has multiple request queues shared by the SIMDs of a
14268  WGP. Therefore, the LDS operations performed by different wavefronts of a
14269  work-group can be reordered relative to each other, which can result in
14270  reordering the visibility of vector memory operations with respect to LDS
14271  operations of other wavefronts in the same work-group. A ``s_wait_dscnt 0x0``
14272  is required to ensure synchronization between LDS operations and
14273  vector memory operations between wavefronts of a work-group, but not between
14274  operations performed by the same wavefront.
14275* The vector memory operations are performed as wavefront wide operations.
14276  Vector memory operations are divided in different types. Completion of a
14277  vector memory operation is reported to a wavefront in-order within a type,
14278  but may be out of order between types. The types of vector memory operations
14279  (and their associated ``s_wait`` instructions) are:
14280
14281  * LDS: ``s_wait_dscnt``
14282  * Load (global, scratch, flat, buffer and image): ``s_wait_loadcnt``
14283  * Store (global, scratch, flat, buffer and image): ``s_wait_storecnt``
14284  * Sample and Gather4: ``s_wait_samplecnt``
14285  * BVH: ``s_wait_bvhcnt``
14286
14287* Vector and scalar memory instructions contain a ``SCOPE`` field with values
14288  corresponding to each cache level. The ``SCOPE`` determines whether a cache
14289  can complete an operation locally or whether it needs to forward the operation
14290  to the next cache level. The ``SCOPE`` values are:
14291
14292  * ``SCOPE_CU``: Compute Unit (NOTE: not affected by CU/WGP mode)
14293  * ``SCOPE_SE``: Shader Engine
14294  * ``SCOPE_DEV``: Device/Agent
14295  * ``SCOPE_SYS``: System
14296
14297* When a memory operation with a given ``SCOPE`` reaches a cache with a smaller
14298  ``SCOPE`` value, it is forwarded to the next level of cache.
14299* When a memory operation with a given ``SCOPE`` reaches a cache with a ``SCOPE``
14300  value greater than or equal to its own, the operation can proceed:
14301
14302  * Reads can hit into the cache
14303  * Writes can happen in this cache and the transaction is acknowledged
14304    from this level of cache.
14305  * RMW operations can be done locally.
14306
14307* ``global_inv``, ``global_wb`` and ``global_wbinv`` instructions are used to
14308  invalidate, write-back and write-back+invalidate caches. The affected
14309  cache(s) are controlled by the ``SCOPE:`` of the instruction.
14310* ``global_inv`` invalidates caches whose scope is strictly smaller than the
14311  instruction's. The invalidation requests cannot be reordered with pending or
14312  upcoming memory operations.
14313* ``global_wb`` is a writeback operation that additionally ensures previous
14314  memory operation done at a lower scope level have reached the ``SCOPE:``
14315  of the ``global_wb``.
14316
14317  * ``global_wb`` can be omitted for scopes other than ``SCOPE_SYS`` in
14318    gfx120x.
14319
14320* The vector memory operations access a vector L0 cache. There is a single L0
14321  cache per CU. Each SIMD of a CU accesses the same L0 cache. Therefore, no
14322  special action is required for coherence between the lanes of a single
14323  wavefront. To achieve coherence between wavefronts executing in the same
14324  work-group:
14325
14326  * In CU wavefront execution mode, no special action is required.
14327  * In WGP wavefront execution mode, a ``global_inv scope:SCOPE_SE`` is required
14328    as wavefronts may be executing on SIMDs of different CUs that access different L0s.
14329
14330* The scalar memory operations access a scalar L0 cache shared by all wavefronts
14331  on a WGP. The scalar and vector L0 caches are not coherent. However, scalar
14332  operations are used in a restricted way so do not impact the memory model. See
14333  :ref:`amdgpu-amdhsa-memory-spaces`.
14334* The vector and scalar memory L0 caches use an L1 buffer shared by all WGPs on
14335  the same SA. The L1 buffer acts as a bridge to L2 for clients within a SA.
14336* The L1 buffers have independent quadrants to service disjoint ranges of virtual
14337  addresses.
14338* Each L0 cache has a separate request queue per L1 quadrant. Therefore, the
14339  vector and scalar memory operations performed by different wavefronts, whether
14340  executing in the same or different work-groups (which may be executing on
14341  different CUs accessing different L0s), can be reordered relative to each
14342  other. Some or all of the wait instructions below are required to ensure
14343  synchronization between vector memory operations of different wavefronts. It
14344  ensures a previous vector memory operation has completed before executing a
14345  subsequent vector memory or LDS operation and so can be used to meet the
14346  requirements of acquire, release and sequential consistency.
14347
14348  * ``s_wait_loadcnt 0x0``
14349  * ``s_wait_samplecnt 0x0``
14350  * ``s_wait_bvhcnt 0x0``
14351  * ``s_wait_storecnt 0x0``
14352
14353* The L1 buffers use an L2 cache shared by all SAs on the same agent.
14354* The L2 cache has independent channels to service disjoint ranges of virtual
14355  addresses.
14356* Each L1 quadrant of a single SA accesses a different L2 channel. Each L1
14357  quadrant has a separate request queue per L2 channel. Therefore, the vector
14358  and scalar memory operations performed by wavefronts executing in different
14359  work-groups (which may be executing on different SAs) of an agent can be
14360  reordered relative to each other. Some or all of the wait instructions below are
14361  required to ensure synchronization between vector memory operations of
14362  different SAs. It ensures a previous vector memory operation has completed
14363  before executing a subsequent vector memory and so can be used to meet the
14364  requirements of acquire, release and sequential consistency.
14365
14366  * ``s_wait_loadcnt 0x0``
14367  * ``s_wait_samplecnt 0x0``
14368  * ``s_wait_bvhcnt 0x0``
14369  * ``s_wait_storecnt 0x0``
14370
14371* The L2 cache can be kept coherent with other agents, or ranges
14372  of virtual addresses can be set up to bypass it to ensure system coherence.
14373* A memory attached last level (MALL) cache exists for GPU memory.
14374  The MALL cache is fully coherent with GPU memory and has no impact on system
14375  coherence. All agents (GPU and CPU) access GPU memory through the MALL cache.
14376
14377Scalar memory operations are only used to access memory that is proven to not
14378change during the execution of the kernel dispatch. This includes constant
14379address space and global address space for program scope ``const`` variables.
14380Therefore, the kernel machine code does not have to maintain the scalar cache to
14381ensure it is coherent with the vector caches. The scalar and vector caches are
14382invalidated between kernel dispatches by CP since constant address space data
14383may change between kernel dispatch executions. See
14384:ref:`amdgpu-amdhsa-memory-spaces`.
14385
14386For kernarg backing memory:
14387
14388* CP invalidates caches at the start of each kernel dispatch.
14389* On dGPU the kernarg backing memory is accessed as MTYPE UC (uncached) to avoid
14390  needing to invalidate the L2 cache.
14391* On APU the kernarg backing memory is accessed as MTYPE CC (cache coherent) and
14392  so the L2 cache will be coherent with the CPU and other agents.
14393
14394Scratch backing memory (which is used for the private address space) is accessed
14395with MTYPE NC (non-coherent). Since the private address space is only accessed
14396by a single thread, and is always write-before-read, there is never a need to
14397invalidate these entries from L0.
14398
14399Wavefronts can be executed in WGP or CU wavefront execution mode:
14400
14401* In WGP wavefront execution mode the wavefronts of a work-group are executed
14402  on the SIMDs of both CUs of the WGP. Therefore, explicit management of the per
14403  CU L0 caches is required for work-group synchronization. Also accesses to L1
14404  at work-group scope need to be explicitly ordered as the accesses from
14405  different CUs are not ordered.
14406* In CU wavefront execution mode the wavefronts of a work-group are executed on
14407  the SIMDs of a single CU of the WGP. Therefore, all global memory access by
14408  the work-group access the same L0 which in turn ensures L1 accesses are
14409  ordered and so do not require explicit management of the caches for
14410  work-group synchronization.
14411
14412See ``WGP_MODE`` field in
14413:ref:`amdgpu-amdhsa-compute_pgm_rsrc1-gfx6-gfx12-table` and
14414:ref:`amdgpu-target-features`.
14415
14416The code sequences used to implement the memory model for GFX12 are defined in
14417table :ref:`amdgpu-amdhsa-memory-model-code-sequences-gfx12-table`.
14418
14419The mapping of LLVM IR syncscope to GFX12 instruction ``scope`` operands is
14420defined in :ref:`amdgpu-amdhsa-memory-model-code-sequences-gfx12-scopes-table`.
14421
14422The table only applies if and only if it is directly referenced by an entry in
14423:ref:`amdgpu-amdhsa-memory-model-code-sequences-gfx12-table`, and it only applies to
14424the instruction in the code sequence that references the table.
14425
14426  .. table:: AMDHSA Memory Model Code Sequences GFX12 - Instruction Scopes
14427     :name: amdgpu-amdhsa-memory-model-code-sequences-gfx12-scopes-table
14428
14429     =================== =================== ===================
14430     LLVM syncscope      CU wavefront        WGP wavefront
14431                         execution           execution
14432                         mode                mode
14433     =================== =================== ===================
14434     *none*              ``scope:SCOPE_SYS`` ``scope:SCOPE_SYS``
14435     system              ``scope:SCOPE_SYS`` ``scope:SCOPE_SYS``
14436     agent               ``scope:SCOPE_DEV`` ``scope:SCOPE_DEV``
14437     workgroup           *none*              ``scope:SCOPE_SE``
14438     wavefront           *none*              *none*
14439     singlethread        *none*              *none*
14440     one-as              ``scope:SCOPE_SYS`` ``scope:SCOPE_SYS``
14441     system-one-as       ``scope:SCOPE_SYS`` ``scope:SCOPE_SYS``
14442     agent-one-as        ``scope:SCOPE_DEV`` ``scope:SCOPE_DEV``
14443     workgroup-one-as    *none*              ``scope:SCOPE_SE``
14444     wavefront-one-as    *none*              *none*
14445     singlethread-one-as *none*              *none*
14446     =================== =================== ===================
14447
14448  .. table:: AMDHSA Memory Model Code Sequences GFX12
14449     :name: amdgpu-amdhsa-memory-model-code-sequences-gfx12-table
14450
14451     ============ ============ ============== ========== ================================
14452     LLVM Instr   LLVM Memory  LLVM Memory    AMDGPU     AMDGPU Machine Code
14453                  Ordering     Sync Scope     Address    GFX12
14454                                              Space
14455     ============ ============ ============== ========== ================================
14456     **Non-Atomic**
14457     ------------------------------------------------------------------------------------
14458     load         *none*       *none*         - global   - !volatile & !nontemporal
14459                                              - generic
14460                                              - private    1. buffer/global/flat_load
14461                                              - constant
14462                                                         - !volatile & nontemporal
14463
14464                                                           1. buffer/global/flat_load
14465                                                              ``th:TH_LOAD_NT``
14466
14467                                                         - volatile
14468
14469                                                           1. buffer/global/flat_load
14470                                                              ``scope:SCOPE_SYS``
14471
14472                                                           2. ``s_wait_loadcnt 0x0``
14473
14474                                                            - Must happen before
14475                                                              any following volatile
14476                                                              global/generic
14477                                                              load/store.
14478                                                            - Ensures that
14479                                                              volatile
14480                                                              operations to
14481                                                              different
14482                                                              addresses will not
14483                                                              be reordered by
14484                                                              hardware.
14485
14486     load         *none*       *none*         - local    1. ds_load
14487     store        *none*       *none*         - global   - !volatile & !nontemporal
14488                                              - generic
14489                                              - private    1. buffer/global/flat_store
14490                                              - constant
14491                                                         - !volatile & nontemporal
14492
14493                                                           1. buffer/global/flat_store
14494                                                              ``th:TH_STORE_NT``
14495
14496                                                         - volatile
14497
14498                                                           1. buffer/global/flat_store
14499                                                              ``scope:SCOPE_SYS``
14500
14501                                                           2. ``s_wait_storecnt 0x0``
14502
14503                                                            - Must happen before
14504                                                              any following volatile
14505                                                              global/generic
14506                                                              load/store.
14507                                                            - Ensures that
14508                                                              volatile
14509                                                              operations to
14510                                                              different
14511                                                              addresses will not
14512                                                              be reordered by
14513                                                              hardware.
14514
14515     store        *none*       *none*         - local    1. ds_store
14516     **Unordered Atomic**
14517     ------------------------------------------------------------------------------------
14518     load atomic  unordered    *any*          *any*      *Same as non-atomic*.
14519     store atomic unordered    *any*          *any*      *Same as non-atomic*.
14520     atomicrmw    unordered    *any*          *any*      *Same as monotonic atomic*.
14521     **Monotonic Atomic**
14522     ------------------------------------------------------------------------------------
14523     load atomic  monotonic    - singlethread - global   1. buffer/global/flat_load
14524                               - wavefront    - generic
14525                               - workgroup                - Apply :ref:`amdgpu-amdhsa-memory-model-code-sequences-gfx12-scopes-table`.
14526                               - agent
14527                               - system
14528     load atomic  monotonic    - singlethread - local    1. ds_load
14529                               - wavefront
14530                               - workgroup
14531     store atomic monotonic    - singlethread - global   1. buffer/global/flat_store
14532                               - wavefront    - generic
14533                               - workgroup                 - Apply :ref:`amdgpu-amdhsa-memory-model-code-sequences-gfx12-scopes-table`.
14534                               - agent
14535                               - system
14536     store atomic monotonic    - singlethread - local    1. ds_store
14537                               - wavefront
14538                               - workgroup
14539     atomicrmw    monotonic    - singlethread - global   1. buffer/global/flat_atomic
14540                               - wavefront    - generic
14541                               - workgroup                 - Apply :ref:`amdgpu-amdhsa-memory-model-code-sequences-gfx12-scopes-table`.
14542                               - agent
14543                               - system
14544     atomicrmw    monotonic    - singlethread - local    1. ds_atomic
14545                               - wavefront
14546                               - workgroup
14547     **Acquire Atomic**
14548     ------------------------------------------------------------------------------------
14549     load atomic  acquire      - singlethread - global   1. buffer/global/ds/flat_load
14550                               - wavefront    - local
14551                                              - generic
14552     load atomic  acquire      - workgroup    - global   1. buffer/global_load ``scope:SCOPE_SE``
14553
14554                                                           - Apply :ref:`amdgpu-amdhsa-memory-model-code-sequences-gfx12-scopes-table`.
14555
14556                                                         2.  ``s_wait_loadcnt 0x0``
14557
14558                                                           - If CU wavefront execution
14559                                                             mode, omit.
14560                                                           - Must happen before
14561                                                             the following ``global_inv``
14562                                                             and before any following
14563                                                             global/generic
14564                                                             load/load
14565                                                             atomic/store/store
14566                                                             atomic/atomicrmw.
14567
14568                                                         3. ``global_inv scope:SCOPE_SE``
14569
14570                                                           - If CU wavefront execution
14571                                                             mode, omit.
14572                                                           - Ensures that
14573                                                             following
14574                                                             loads will not see
14575                                                             stale data.
14576
14577     load atomic  acquire      - workgroup    - local    1. ds_load
14578                                                         2. ``s_wait_dscnt 0x0``
14579
14580                                                           - If OpenCL, omit.
14581                                                           - Must happen before
14582                                                             the following ``global_inv``
14583                                                             and before any following
14584                                                             global/generic load/load
14585                                                             atomic/store/store
14586                                                             atomic/atomicrmw.
14587                                                           - Ensures any
14588                                                             following global
14589                                                             data read is no
14590                                                             older than the local load
14591                                                             atomic value being
14592                                                             acquired.
14593
14594                                                         3. ``global_inv scope:SCOPE_SE``
14595
14596                                                           - If OpenCL or CU wavefront
14597                                                             execution mode, omit.
14598                                                           - Ensures that
14599                                                             following
14600                                                             loads will not see
14601                                                             stale data.
14602
14603     load atomic  acquire      - workgroup    - generic  1. flat_load
14604
14605                                                           - Apply :ref:`amdgpu-amdhsa-memory-model-code-sequences-gfx12-scopes-table`.
14606
14607                                                         2. | ``s_wait_loadcnt 0x0``
14608                                                            | ``s_wait_dscnt 0x0``
14609                                                            | **CU wavefront execution mode:**
14610                                                            | ``s_wait_dscnt 0x0``
14611
14612                                                           - If OpenCL, omit ``s_wait_dscnt 0x0``
14613                                                           - Must happen before
14614                                                             the following
14615                                                             ``global_inv`` and any
14616                                                             following global/generic
14617                                                             load/load
14618                                                             atomic/store/store
14619                                                             atomic/atomicrmw.
14620                                                           - Ensures any
14621                                                             following global
14622                                                             data read is no
14623                                                             older than a local load
14624                                                             atomic value being
14625                                                             acquired.
14626
14627                                                         3. ``global_inv scope:SCOPE_SE``
14628
14629                                                           - If CU wavefront execution
14630                                                             mode, omit.
14631                                                           - Ensures that
14632                                                             following
14633                                                             loads will not see
14634                                                             stale data.
14635
14636     load atomic  acquire      - agent        - global   1. buffer/global_load
14637                               - system
14638                                                           - Apply :ref:`amdgpu-amdhsa-memory-model-code-sequences-gfx12-scopes-table`.
14639
14640                                                         2.  ``s_wait_loadcnt 0x0``
14641
14642                                                           - Must happen before
14643                                                             following
14644                                                             ``global_inv``.
14645                                                           - Ensures the load
14646                                                             has completed
14647                                                             before invalidating
14648                                                             the caches.
14649
14650                                                         3. ``global_inv``
14651
14652                                                           - Apply :ref:`amdgpu-amdhsa-memory-model-code-sequences-gfx12-scopes-table`.
14653                                                           - Must happen before
14654                                                             any following
14655                                                             global/generic
14656                                                             load/load
14657                                                             atomic/atomicrmw.
14658                                                           - Ensures that
14659                                                             following
14660                                                             loads will not see
14661                                                             stale global data.
14662
14663     load atomic  acquire      - agent        - generic  1. flat_load
14664                               - system
14665                                                           - Apply :ref:`amdgpu-amdhsa-memory-model-code-sequences-gfx12-scopes-table`.
14666
14667                                                         2. | ``s_wait_loadcnt 0x0``
14668                                                            | ``s_wait_dscnt 0x0``
14669
14670                                                           - If OpenCL, omit ``s_wait_dscnt 0x0``
14671                                                           - Must happen before
14672                                                             following
14673                                                             ``global_inv``.
14674                                                           - Ensures the flat_load
14675                                                             has completed
14676                                                             before invalidating
14677                                                             the caches.
14678
14679                                                         3. ``global_inv``
14680
14681                                                           - Apply :ref:`amdgpu-amdhsa-memory-model-code-sequences-gfx12-scopes-table`.
14682                                                           - Must happen before
14683                                                             any following
14684                                                             global/generic
14685                                                             load/load
14686                                                             atomic/atomicrmw.
14687                                                           - Ensures that
14688                                                             following loads
14689                                                             will not see stale
14690                                                             global data.
14691
14692     atomicrmw    acquire      - singlethread - global   1. buffer/global/ds/flat_atomic
14693                               - wavefront    - local
14694                                              - generic
14695     atomicrmw    acquire      - workgroup    - global   1. buffer/global_atomic
14696
14697                                                           - Apply :ref:`amdgpu-amdhsa-memory-model-code-sequences-gfx12-scopes-table`.
14698                                                           - If atomic with return,
14699                                                             use ``th:TH_ATOMIC_RETURN``
14700
14701                                                         2. | **Atomic with return:**
14702                                                            | ``s_wait_loadcnt 0x0``
14703                                                            | **Atomic without return:**
14704                                                            | ``s_wait_storecnt 0x0``
14705
14706                                                           - If CU wavefront execution
14707                                                             mode, omit.
14708                                                           - Must happen before
14709                                                             the following ``global_inv``
14710                                                             and before any following
14711                                                             global/generic
14712                                                             load/load
14713                                                             atomic/store/store
14714                                                             atomic/atomicrmw.
14715
14716                                                         3. ``global_inv scope:SCOPE_SE``
14717
14718                                                           - If CU wavefront execution
14719                                                             mode, omit.
14720                                                           - Ensures that
14721                                                             following
14722                                                             loads will not see
14723                                                             stale data.
14724
14725     atomicrmw    acquire      - workgroup    - local    1. ds_atomic
14726                                                         2. ``s_wait_dscnt 0x0``
14727
14728                                                           - If OpenCL, omit.
14729                                                           - Must happen before
14730                                                             the following
14731                                                             ``global_inv``.
14732                                                           - Ensures any
14733                                                             following global
14734                                                             data read is no
14735                                                             older than the local
14736                                                             atomicrmw value
14737                                                             being acquired.
14738
14739                                                         3. ``global_inv scope:SCOPE_SE``
14740
14741                                                           - If OpenCL omit.
14742                                                           - If CU wavefront execution
14743                                                             mode, omit.
14744                                                           - Ensures that
14745                                                             following
14746                                                             loads will not see
14747                                                             stale data.
14748
14749     atomicrmw    acquire      - workgroup    - generic  1. flat_atomic
14750
14751                                                           - Apply :ref:`amdgpu-amdhsa-memory-model-code-sequences-gfx12-scopes-table`.
14752                                                           - If atomic with return,
14753                                                             use ``th:TH_ATOMIC_RETURN``
14754
14755                                                         2. | **Atomic with return:**
14756                                                            | ``s_wait_loadcnt 0x0``
14757                                                            | ``s_wait_dscnt 0x0``
14758                                                            | **Atomic without return:**
14759                                                            | ``s_wait_storecnt 0x0``
14760                                                            | ``s_wait_dscnt 0x0``
14761
14762                                                           - If CU wavefront execution mode,
14763                                                             omit all for atomics without
14764                                                             return, and only emit
14765                                                             ``s_wait_dscnt 0x0`` for atomics
14766                                                             with return.
14767                                                           - If OpenCL, omit ``s_wait_dscnt 0x0``
14768                                                           - Must happen before
14769                                                             the following
14770                                                             ``global_inv``.
14771                                                           - Ensures any
14772                                                             following global
14773                                                             data read is no
14774                                                             older than a local
14775                                                             atomicrmw value
14776                                                             being acquired.
14777
14778                                                         3. ``global_inv scope:SCOPE_SE``
14779
14780                                                           - If CU wavefront execution
14781                                                             mode, omit.
14782                                                           - Ensures that
14783                                                             following
14784                                                             loads will not see
14785                                                             stale data.
14786
14787     atomicrmw    acquire      - agent        - global   1. buffer/global_atomic
14788                               - system
14789                                                           - Apply :ref:`amdgpu-amdhsa-memory-model-code-sequences-gfx12-scopes-table`.
14790                                                           - If atomic with return,
14791                                                             use ``th:TH_ATOMIC_RETURN``
14792
14793                                                         2. | **Atomic with return:**
14794                                                            | ``s_wait_loadcnt 0x0``
14795                                                            | **Atomic without return:**
14796                                                            | ``s_wait_storecnt 0x0``
14797
14798                                                           - Must happen before
14799                                                             following ``global_inv``.
14800                                                           - Ensures the
14801                                                             atomicrmw has
14802                                                             completed before
14803                                                             invalidating the
14804                                                             caches.
14805
14806                                                         3. ``global_inv``
14807
14808                                                           - Apply :ref:`amdgpu-amdhsa-memory-model-code-sequences-gfx12-scopes-table`.
14809                                                           - Must happen before
14810                                                             any following
14811                                                             global/generic
14812                                                             load/load
14813                                                             atomic/atomicrmw.
14814                                                           - Ensures that
14815                                                             following loads
14816                                                             will not see stale
14817                                                             global data.
14818
14819     atomicrmw    acquire      - agent        - generic  1. flat_atomic
14820                               - system
14821                                                           - Apply :ref:`amdgpu-amdhsa-memory-model-code-sequences-gfx12-scopes-table`.
14822                                                           - If atomic with return,
14823                                                             use ``th:TH_ATOMIC_RETURN``
14824
14825                                                         2. | **Atomic with return:**
14826                                                            | ``s_wait_loadcnt 0x0``
14827                                                            | ``s_wait_dscnt 0x0``
14828                                                            | **Atomic without return:**
14829                                                            | ``s_wait_storecnt 0x0``
14830                                                            | ``s_wait_dscnt 0x0``
14831
14832                                                           - If OpenCL, omit dscnt
14833                                                           - Must happen before
14834                                                             following
14835                                                             global_inv
14836                                                           - Ensures the
14837                                                             atomicrmw has
14838                                                             completed before
14839                                                             invalidating the
14840                                                             caches.
14841
14842                                                         3. ``global_inv``
14843
14844                                                           - Apply :ref:`amdgpu-amdhsa-memory-model-code-sequences-gfx12-scopes-table`.
14845                                                           - Must happen before
14846                                                             any following
14847                                                             global/generic
14848                                                             load/load
14849                                                             atomic/atomicrmw.
14850                                                           - Ensures that
14851                                                             following loads
14852                                                             will not see stale
14853                                                             global data.
14854
14855     fence        acquire      - singlethread *none*     *none*
14856                               - wavefront
14857     fence        acquire      - workgroup    *none*     1. | ``s_wait_storecnt 0x0``
14858                                                            | ``s_wait_loadcnt 0x0``
14859                                                            | ``s_wait_dscnt 0x0``
14860                                                            | **CU wavefront execution mode:**
14861                                                            | ``s_wait_dscnt 0x0``
14862
14863                                                           - If OpenCL, omit ``s_wait_dscnt 0x0``
14864                                                           - If OpenCL and address space is local,
14865                                                             omit all.
14866                                                           - See :ref:`amdgpu-fence-as` for
14867                                                             more details on fencing specific
14868                                                             address spaces.
14869                                                           - Note: we don't have to use
14870                                                             ``s_wait_samplecnt 0x0`` or
14871                                                             ``s_wait_bvhcnt 0x0`` because
14872                                                             there are no atomic sample or
14873                                                             BVH instructions that the fence
14874                                                             could pair with.
14875                                                           - The waits can be
14876                                                             independently moved
14877                                                             according to the
14878                                                             following rules:
14879                                                           - ``s_wait_loadcnt 0x0``
14880                                                             must happen after
14881                                                             any preceding
14882                                                             global/generic load
14883                                                             atomic/
14884                                                             atomicrmw-with-return-value
14885                                                             with an equal or
14886                                                             wider sync scope
14887                                                             and memory ordering
14888                                                             stronger than
14889                                                             unordered (this is
14890                                                             termed the
14891                                                             fence-paired-atomic).
14892                                                           - ``s_wait_storecnt 0x0``
14893                                                             must happen after
14894                                                             any preceding
14895                                                             global/generic
14896                                                             atomicrmw-no-return-value
14897                                                             with an equal or
14898                                                             wider sync scope
14899                                                             and memory ordering
14900                                                             stronger than
14901                                                             unordered (this is
14902                                                             termed the
14903                                                             fence-paired-atomic).
14904                                                           - ``s_wait_dscnt 0x0``
14905                                                             must happen after
14906                                                             any preceding
14907                                                             local/generic load
14908                                                             atomic/atomicrmw
14909                                                             with an equal or
14910                                                             wider sync scope
14911                                                             and memory ordering
14912                                                             stronger than
14913                                                             unordered (this is
14914                                                             termed the
14915                                                             fence-paired-atomic).
14916                                                           - Must happen before
14917                                                             the following
14918                                                             ``global_inv``.
14919                                                           - Ensures that the
14920                                                             fence-paired atomic
14921                                                             has completed
14922                                                             before invalidating
14923                                                             the
14924                                                             cache. Therefore
14925                                                             any following
14926                                                             locations read must
14927                                                             be no older than
14928                                                             the value read by
14929                                                             the
14930                                                             fence-paired-atomic.
14931
14932                                                         2. ``global_inv scope:SCOPE_SE``
14933
14934                                                           - If CU wavefront execution
14935                                                             mode, omit.
14936                                                           - Ensures that
14937                                                             following
14938                                                             loads will not see
14939                                                             stale data.
14940
14941     fence        acquire      - agent        *none*     1.  | ``s_wait_storecnt 0x0``
14942                                                             | ``s_wait_loadcnt 0x0``
14943                                                             | ``s_wait_dscnt 0x0``
14944
14945                                                           - If OpenCL, omit ``s_wait_dscnt 0x0``.
14946                                                           - If OpenCL and address space is
14947                                                             local, omit all.
14948                                                           - See :ref:`amdgpu-fence-as` for
14949                                                             more details on fencing specific
14950                                                             address spaces.
14951                                                           - Note: we don't have to use
14952                                                             ``s_wait_samplecnt 0x0`` or
14953                                                             ``s_wait_bvhcnt 0x0`` because
14954                                                             there are no atomic sample or
14955                                                             BVH instructions that the fence
14956                                                             could pair with.
14957                                                           - The waits can be
14958                                                             independently moved
14959                                                             according to the
14960                                                             following rules:
14961                                                           - ``s_wait_loadcnt 0x0``
14962                                                             must happen after
14963                                                             any preceding
14964                                                             global/generic load
14965                                                             atomic/
14966                                                             atomicrmw-with-return-value
14967                                                             with an equal or
14968                                                             wider sync scope
14969                                                             and memory ordering
14970                                                             stronger than
14971                                                             unordered (this is
14972                                                             termed the
14973                                                             fence-paired-atomic).
14974                                                           - ``s_wait_storecnt 0x0``
14975                                                             must happen after
14976                                                             any preceding
14977                                                             global/generic
14978                                                             atomicrmw-no-return-value
14979                                                             with an equal or
14980                                                             wider sync scope
14981                                                             and memory ordering
14982                                                             stronger than
14983                                                             unordered (this is
14984                                                             termed the
14985                                                             fence-paired-atomic).
14986                                                           - ``s_wait_dscnt 0x0``
14987                                                             must happen after
14988                                                             any preceding
14989                                                             local/generic load
14990                                                             atomic/atomicrmw
14991                                                             with an equal or
14992                                                             wider sync scope
14993                                                             and memory ordering
14994                                                             stronger than
14995                                                             unordered (this is
14996                                                             termed the
14997                                                             fence-paired-atomic).
14998                                                           - Must happen before
14999                                                             the following
15000                                                             ``global_inv``
15001                                                           - Ensures that the
15002                                                             fence-paired atomic
15003                                                             has completed
15004                                                             before invalidating the
15005                                                             caches. Therefore
15006                                                             any following
15007                                                             locations read must
15008                                                             be no older than
15009                                                             the value read by
15010                                                             the
15011                                                             fence-paired-atomic.
15012
15013                                                         2. ``global_inv``
15014
15015                                                           - Apply :ref:`amdgpu-amdhsa-memory-model-code-sequences-gfx12-scopes-table`.
15016                                                           - Ensures that
15017                                                             following
15018                                                             loads will not see
15019                                                             stale data.
15020
15021     **Release Atomic**
15022     ------------------------------------------------------------------------------------
15023     store atomic release      - singlethread - global   1. buffer/global/ds/flat_store
15024                               - wavefront    - local
15025                                              - generic
15026     store atomic release      - workgroup    - global   1. | ``s_wait_bvhcnt 0x0``
15027                                                            | ``s_wait_samplecnt 0x0``
15028                                                            | ``s_wait_storecnt 0x0``
15029                                                            | ``s_wait_loadcnt 0x0``
15030                                                            | ``s_wait_dscnt 0x0``
15031                                                            | **CU wavefront execution mode:**
15032                                                            | ``s_wait_dscnt 0x0``
15033
15034                                                           - If OpenCL, omit ``s_wait_dscnt 0x0``.
15035                                                           - The waits can be
15036                                                             independently moved
15037                                                             according to the
15038                                                             following rules:
15039                                                           - ``s_wait_loadcnt 0x0``,
15040                                                             ``s_wait_samplecnt 0x0`` and
15041                                                             ``s_wait_bvhcnt 0x0``
15042                                                             must happen after
15043                                                             any preceding
15044                                                             global/generic load/load
15045                                                             atomic/
15046                                                             atomicrmw-with-return-value.
15047                                                           - ``s_wait_storecnt 0x0``
15048                                                             must happen after
15049                                                             any preceding
15050                                                             global/generic
15051                                                             store/store
15052                                                             atomic/
15053                                                             atomicrmw-no-return-value.
15054                                                           - ``s_wait_dscnt 0x0``
15055                                                             must happen after
15056                                                             any preceding
15057                                                             local/generic
15058                                                             load/store/load
15059                                                             atomic/store
15060                                                             atomic/atomicrmw.
15061                                                           - Ensures that all
15062                                                             memory operations
15063                                                             have
15064                                                             completed before
15065                                                             performing the
15066                                                             store that is being
15067                                                             released.
15068
15069                                                         3. buffer/global/flat_store
15070
15071                                                           - Apply :ref:`amdgpu-amdhsa-memory-model-code-sequences-gfx12-scopes-table`.
15072
15073     store atomic release      - workgroup    - local    1. | ``s_wait_bvhcnt 0x0``
15074                                                            | ``s_wait_samplecnt 0x0``
15075                                                            | ``s_wait_storecnt 0x0``
15076                                                            | ``s_wait_loadcnt 0x0``
15077                                                            | ``s_wait_dscnt 0x0``
15078                                                            | **CU wavefront execution mode:**
15079                                                            | ``s_wait_dscnt 0x0``
15080
15081                                                           - If OpenCL, omit.
15082                                                           - The waits can be
15083                                                             independently moved
15084                                                             according to the
15085                                                             following rules:
15086                                                           - ``s_wait_loadcnt 0x0``,
15087                                                             ``s_wait_samplecnt 0x0`` and
15088                                                             ``s_wait_bvhcnt 0x0``
15089                                                             must happen after
15090                                                             any preceding
15091                                                             global/generic load/load
15092                                                             atomic/
15093                                                             atomicrmw-with-return-value.
15094                                                           - ``s_wait_storecnt 0x0``
15095                                                             must happen after
15096                                                             any preceding
15097                                                             global/generic
15098                                                             store/store
15099                                                             atomic/
15100                                                             atomicrmw-no-return-value.
15101                                                           - Must happen before the
15102                                                             following store.
15103                                                           - Ensures that all
15104                                                             global memory
15105                                                             operations have
15106                                                             completed before
15107                                                             performing the
15108                                                             store that is being
15109                                                             released.
15110
15111                                                         3. ds_store
15112     store atomic release      - agent        - global   1. ``global_wb scope:SCOPE_SYS``
15113                               - system       - generic
15114                                                            - If agent scope, omit.
15115
15116                                                         2. | ``s_wait_bvhcnt 0x0``
15117                                                            | ``s_wait_samplecnt 0x0``
15118                                                            | ``s_wait_storecnt 0x0``
15119                                                            | ``s_wait_loadcnt 0x0``
15120                                                            | ``s_wait_dscnt 0x0``
15121
15122                                                           - If OpenCL, omit ``s_wait_dscnt 0x0``.
15123                                                           - The waits can be
15124                                                             independently moved
15125                                                             according to the
15126                                                             following rules:
15127                                                           - ``s_wait_loadcnt 0x0``,
15128                                                             ``s_wait_samplecnt 0x0`` and
15129                                                             ``s_wait_bvhcnt 0x0``
15130                                                             must happen after
15131                                                             any preceding
15132                                                             global/generic
15133                                                             load/load
15134                                                             atomic/
15135                                                             atomicrmw-with-return-value.
15136                                                           - ``s_wait_storecnt 0x0``
15137                                                             must happen after
15138                                                             ``global_wb`` if present, or
15139                                                             any preceding
15140                                                             global/generic
15141                                                             store/store
15142                                                             atomic/
15143                                                             atomicrmw-no-return-value.
15144                                                           - ``s_wait_dscnt 0x0``
15145                                                             must happen after
15146                                                             any preceding
15147                                                             local/generic
15148                                                             load/store/load
15149                                                             atomic/store
15150                                                             atomic/atomicrmw.
15151                                                           - Must happen before the
15152                                                             following store.
15153                                                           - Ensures that all
15154                                                             memory operations
15155                                                             have
15156                                                             completed before
15157                                                             performing the
15158                                                             store that is being
15159                                                             released.
15160
15161                                                         3. buffer/global/flat_store
15162
15163                                                           - Apply :ref:`amdgpu-amdhsa-memory-model-code-sequences-gfx12-scopes-table`.
15164
15165     atomicrmw    release      - singlethread - global   1. buffer/global/ds/flat_atomic
15166                               - wavefront    - local
15167                                              - generic
15168     atomicrmw    release      - workgroup    - global   1. | ``s_wait_bvhcnt 0x0``
15169                                              - generic     | ``s_wait_samplecnt 0x0``
15170                                                            | ``s_wait_storecnt 0x0``
15171                                                            | ``s_wait_loadcnt 0x0``
15172                                                            | ``s_wait_dscnt 0x0``
15173                                                            | **CU wavefront execution mode:**
15174                                                            | ``s_wait_dscnt 0x0``
15175
15176                                                           - If OpenCL, omit ``s_wait_dscnt 0x0``.
15177                                                           - If OpenCL and CU wavefront
15178                                                             execution mode, omit all.
15179                                                           - The waits can be
15180                                                             independently moved
15181                                                             according to the
15182                                                             following rules:
15183                                                           - ``s_wait_loadcnt 0x0``,
15184                                                             ``s_wait_samplecnt 0x0`` and
15185                                                             ``s_wait_bvhcnt 0x0``
15186                                                             must happen after
15187                                                             any preceding
15188                                                             global/generic load/load
15189                                                             atomic/
15190                                                             atomicrmw-with-return-value.
15191                                                           - ``s_wait_storecnt 0x0``
15192                                                             must happen after
15193                                                             any preceding
15194                                                             global/generic
15195                                                             store/store
15196                                                             atomic/
15197                                                             atomicrmw-no-return-value.
15198                                                           - ``s_wait_dscnt 0x0``
15199                                                             must happen after
15200                                                             any preceding
15201                                                             local/generic
15202                                                             load/store/load
15203                                                             atomic/store
15204                                                             atomic/atomicrmw.
15205                                                           - Must happen before the
15206                                                             following atomic.
15207                                                           - Ensures that all
15208                                                             memory operations
15209                                                             have
15210                                                             completed before
15211                                                             performing the
15212                                                             atomicrmw that is
15213                                                             being released.
15214
15215                                                         2. buffer/global/flat_atomic
15216
15217                                                           - Apply :ref:`amdgpu-amdhsa-memory-model-code-sequences-gfx12-scopes-table`.
15218
15219     atomicrmw    release      - workgroup    - local    1. | ``s_wait_bvhcnt 0x0``
15220                                                            | ``s_wait_samplecnt 0x0``
15221                                                            | ``s_wait_storecnt 0x0``
15222                                                            | ``s_wait_loadcnt 0x0``
15223                                                            | ``s_wait_dscnt 0x0``
15224                                                            | **CU wavefront execution mode:**
15225                                                            | ``s_wait_dscnt 0x0``
15226
15227                                                           - If OpenCL, omit all.
15228                                                           - The waits can be
15229                                                             independently moved
15230                                                             according to the
15231                                                             following rules:
15232                                                           - ``s_wait_loadcnt 0x0``,
15233                                                             ``s_wait_samplecnt 0x0`` and
15234                                                             ``s_wait_bvhcnt 0x0``
15235                                                             must happen after
15236                                                             any preceding
15237                                                             global/generic load/load
15238                                                             atomic/
15239                                                             atomicrmw-with-return-value.
15240                                                           - ``s_wait_storecnt 0x0``
15241                                                             must happen after
15242                                                             any preceding
15243                                                             global/generic
15244                                                             store/store
15245                                                             atomic/
15246                                                             atomicrmw-no-return-value.
15247                                                           - Must happen before the
15248                                                             following atomic.
15249                                                           - Ensures that all
15250                                                             global memory
15251                                                             operations have
15252                                                             completed before
15253                                                             performing the
15254                                                             store that is being
15255                                                             released.
15256
15257                                                         2. ds_atomic
15258     atomicrmw    release      - agent        - global   1. ``global_wb scope:SCOPE_SYS``
15259                               - system       - generic
15260                                                           - If agent scope, omit.
15261
15262                                                         2. | ``s_wait_bvhcnt 0x0``
15263                                                            | ``s_wait_samplecnt 0x0``
15264                                                            | ``s_wait_storecnt 0x0``
15265                                                            | ``s_wait_loadcnt 0x0``
15266                                                            | ``s_wait_dscnt 0x0``
15267
15268                                                           - If OpenCL, omit ``s_wait_dscnt 0x0``.
15269                                                           - The waits can be
15270                                                             independently moved
15271                                                             according to the
15272                                                             following rules:
15273                                                           - ``s_wait_loadcnt 0x0``,
15274                                                             ``s_wait_samplecnt 0x0`` and
15275                                                             ``s_wait_bvhcnt 0x0``
15276                                                             must happen after
15277                                                             any preceding
15278                                                             global/generic
15279                                                             load/load atomic/
15280                                                             atomicrmw-with-return-value.
15281                                                           - ``s_wait_storecnt 0x0``
15282                                                             must happen after
15283                                                             ``global_wb`` if present, or
15284                                                             any preceding
15285                                                             global/generic
15286                                                             store/store
15287                                                             atomic/
15288                                                             atomicrmw-no-return-value.
15289                                                           - ``s_wait_dscnt 0x0``
15290                                                             must happen after
15291                                                             any preceding
15292                                                             local/generic
15293                                                             load/store/load
15294                                                             atomic/store
15295                                                             atomic/atomicrmw.
15296                                                           - Must happen before the
15297                                                             following atomic.
15298                                                           - Ensures that all
15299                                                             memory operations
15300                                                             to global and local
15301                                                             have completed
15302                                                             before performing
15303                                                             the atomicrmw that
15304                                                             is being released.
15305
15306                                                         3. buffer/global/flat_atomic
15307
15308                                                           - Apply :ref:`amdgpu-amdhsa-memory-model-code-sequences-gfx12-scopes-table`.
15309
15310     fence        release      - singlethread *none*     *none*
15311                               - wavefront
15312     fence        release      - workgroup    *none*     1. | ``s_wait_bvhcnt 0x0``
15313                                                            | ``s_wait_samplecnt 0x0``
15314                                                            | ``s_wait_storecnt 0x0``
15315                                                            | ``s_wait_loadcnt 0x0``
15316                                                            | ``s_wait_dscnt 0x0``
15317                                                            | **CU wavefront execution mode:**
15318                                                            | ``s_wait_dscnt 0x0``
15319
15320                                                           - If OpenCL, omit ``s_wait_dscnt 0x0``.
15321                                                           - If OpenCL and
15322                                                             address space is
15323                                                             local, omit all.
15324                                                           - See :ref:`amdgpu-fence-as` for
15325                                                             more details on fencing specific
15326                                                             address spaces.
15327                                                           - The waits can be
15328                                                             independently moved
15329                                                             according to the
15330                                                             following rules:
15331                                                           - ``s_wait_loadcnt 0x0``,
15332                                                             ``s_wait_samplecnt 0x0`` and
15333                                                             ``s_wait_bvhcnt 0x0``
15334                                                             must happen after
15335                                                             any preceding
15336                                                             global/generic
15337                                                             load/load
15338                                                             atomic/
15339                                                             atomicrmw-with-return-value.
15340                                                           - ``s_wait_storecnt 0x0``
15341                                                             must happen after
15342                                                             any preceding
15343                                                             global/generic
15344                                                             store/store
15345                                                             atomic/
15346                                                             atomicrmw-no-return-value.
15347                                                           - ``s_wait_dscnt 0x0``
15348                                                             must happen after
15349                                                             any preceding
15350                                                             local/generic
15351                                                             load/store/load
15352                                                             atomic/store atomic/
15353                                                             atomicrmw.
15354                                                           - Must happen before
15355                                                             any following store
15356                                                             atomic/atomicrmw
15357                                                             with an equal or
15358                                                             wider sync scope
15359                                                             and memory ordering
15360                                                             stronger than
15361                                                             unordered (this is
15362                                                             termed the
15363                                                             fence-paired-atomic).
15364                                                           - Ensures that all
15365                                                             memory operations
15366                                                             have
15367                                                             completed before
15368                                                             performing the
15369                                                             following
15370                                                             fence-paired-atomic.
15371
15372     fence        release      - agent        *none*     1. ``global_wb scope:SCOPE_SYS``
15373                               - system
15374                                                           - If agent scope, omit.
15375
15376                                                         2. | ``s_wait_bvhcnt 0x0``
15377                                                            | ``s_wait_samplecnt 0x0``
15378                                                            | ``s_wait_storecnt 0x0``
15379                                                            | ``s_wait_loadcnt 0x0``
15380                                                            | ``s_wait_dscnt 0x0``
15381                                                            | **OpenCL:**
15382                                                            | ``s_wait_bvhcnt 0x0``
15383                                                            | ``s_wait_samplecnt 0x0``
15384                                                            | ``s_wait_storecnt 0x0``
15385                                                            | ``s_wait_loadcnt 0x0``
15386
15387                                                           - If OpenCl, omit ``s_wait_dscnt 0x0``.
15388                                                           - If OpenCL and address space is local,
15389                                                             omit all.
15390                                                           - See :ref:`amdgpu-fence-as` for
15391                                                             more details on fencing specific
15392                                                             address spaces.
15393                                                           - The waits can be
15394                                                             independently moved
15395                                                             according to the
15396                                                             following rules:
15397                                                           - ``s_wait_loadcnt 0x0``,
15398                                                             ``s_wait_samplecnt 0x0`` and
15399                                                             ``s_wait_bvhcnt 0x0``
15400                                                             must happen after
15401                                                             any preceding
15402                                                             global/generic
15403                                                             load/load atomic/
15404                                                             atomicrmw-with-return-value.
15405                                                           - ``s_wait_storecnt 0x0``
15406                                                             must happen after
15407                                                             ``global_wb`` if present, or
15408                                                             any preceding
15409                                                             global/generic
15410                                                             store/store
15411                                                             atomic/
15412                                                             atomicrmw-no-return-value.
15413                                                           - ``s_wait_dscnt 0x0``
15414                                                             must happen after
15415                                                             any preceding
15416                                                             local/generic
15417                                                             load/store/load
15418                                                             atomic/store
15419                                                             atomic/atomicrmw.
15420                                                           - Must happen before
15421                                                             any following store
15422                                                             atomic/atomicrmw
15423                                                             with an equal or
15424                                                             wider sync scope
15425                                                             and memory ordering
15426                                                             stronger than
15427                                                             unordered (this is
15428                                                             termed the
15429                                                             fence-paired-atomic).
15430                                                           - Ensures that all
15431                                                             memory operations
15432                                                             have
15433                                                             completed before
15434                                                             performing the
15435                                                             following
15436                                                             fence-paired-atomic.
15437
15438     **Acquire-Release Atomic**
15439     ------------------------------------------------------------------------------------
15440     atomicrmw    acq_rel      - singlethread - global   1. buffer/global/ds/flat_atomic
15441                               - wavefront    - local
15442                                              - generic
15443     atomicrmw    acq_rel      - workgroup    - global   1. | ``s_wait_bvhcnt 0x0``
15444                                                            | ``s_wait_samplecnt 0x0``
15445                                                            | ``s_wait_storecnt 0x0``
15446                                                            | ``s_wait_loadcnt 0x0``
15447                                                            | ``s_wait_dscnt 0x0``
15448                                                            | **CU wavefront execution mode:**
15449                                                            | ``s_wait_dscnt 0x0``
15450
15451                                                           - If OpenCL, omit ``s_wait_dscnt 0x0``.
15452                                                           - Must happen after
15453                                                             any preceding
15454                                                             local/generic
15455                                                             load/store/load
15456                                                             atomic/store
15457                                                             atomic/atomicrmw.
15458                                                           - The waits can be
15459                                                             independently moved
15460                                                             according to the
15461                                                             following rules:
15462                                                           - ``s_wait_loadcnt 0x0``,
15463                                                             ``s_wait_samplecnt 0x0`` and
15464                                                             ``s_wait_bvhcnt 0x0``
15465                                                             must happen after
15466                                                             any preceding
15467                                                             global/generic load/load
15468                                                             atomic/
15469                                                             atomicrmw-with-return-value.
15470                                                           - ``s_wait_storecnt 0x0``
15471                                                             must happen after
15472                                                             any preceding
15473                                                             global/generic
15474                                                             store/store
15475                                                             atomic/
15476                                                             atomicrmw-no-return-value.
15477                                                           - ``s_wait_dscnt 0x0``
15478                                                             must happen after
15479                                                             any preceding
15480                                                             local/generic
15481                                                             load/store/load
15482                                                             atomic/store
15483                                                             atomic/atomicrmw.
15484                                                           - Must happen before
15485                                                             the following
15486                                                             atomicrmw.
15487                                                           - Ensures that all
15488                                                             memory operations
15489                                                             have
15490                                                             completed before
15491                                                             performing the
15492                                                             atomicrmw that is
15493                                                             being released.
15494
15495                                                         2. buffer/global_atomic
15496
15497                                                           - Apply :ref:`amdgpu-amdhsa-memory-model-code-sequences-gfx12-scopes-table`.
15498                                                           - If atomic with return, use
15499                                                             ``th:TH_ATOMIC_RETURN``.
15500
15501                                                         3. | **Atomic with return:**
15502                                                            | ``s_wait_loadcnt 0x0``
15503                                                            | **Atomic without return:**
15504                                                            | ``s_wait_storecnt 0x0``
15505
15506                                                           - If CU wavefront execution
15507                                                             mode, omit.
15508                                                           - Must happen before
15509                                                             the following
15510                                                             ``global_inv``.
15511                                                           - Ensures any
15512                                                             following global
15513                                                             data read is no
15514                                                             older than the
15515                                                             atomicrmw value
15516                                                             being acquired.
15517
15518                                                         4. ``global_inv scope:SCOPE_SE``
15519
15520                                                           - If CU wavefront execution
15521                                                             mode, omit.
15522                                                           - Ensures that
15523                                                             following
15524                                                             loads will not see
15525                                                             stale data.
15526
15527     atomicrmw    acq_rel      - workgroup    - local    1  | ``s_wait_bvhcnt 0x0``
15528                                                            | ``s_wait_samplecnt 0x0``
15529                                                            | ``s_wait_storecnt 0x0``
15530                                                            | ``s_wait_loadcnt 0x0``
15531                                                            | ``s_wait_dscnt 0x0``
15532                                                            | **CU wavefront execution mode:**
15533                                                            | ``s_wait_dscnt 0x0``
15534
15535                                                           - If OpenCL, omit.
15536                                                           - The waits can be
15537                                                             independently moved
15538                                                             according to the
15539                                                             following rules:
15540                                                           - ``s_wait_loadcnt 0x0``,
15541                                                             ``s_wait_samplecnt 0x0`` and
15542                                                             ``s_wait_bvhcnt 0x0``
15543                                                             must happen after
15544                                                             any preceding
15545                                                             global/generic load/load
15546                                                             atomic/
15547                                                             atomicrmw-with-return-value.
15548                                                           - ``s_wait_storecnt 0x0``
15549                                                             must happen after
15550                                                             any preceding
15551                                                             global/generic
15552                                                             store/store
15553                                                             atomic/
15554                                                             atomicrmw-no-return-value.
15555                                                           - Must happen before
15556                                                             the following
15557                                                             store.
15558                                                           - Ensures that all
15559                                                             global memory
15560                                                             operations have
15561                                                             completed before
15562                                                             performing the
15563                                                             store that is being
15564                                                             released.
15565
15566                                                         2. ds_atomic
15567                                                         3. ``s_wait_dscnt 0x0``
15568
15569                                                           - If OpenCL, omit.
15570                                                           - Must happen before
15571                                                             the following
15572                                                             ``global_inv``.
15573                                                           - Ensures any
15574                                                             following global
15575                                                             data read is no
15576                                                             older than the local load
15577                                                             atomic value being
15578                                                             acquired.
15579
15580                                                         4. ``global_inv scope:SCOPE_SE``
15581
15582                                                           - If CU wavefront execution
15583                                                             mode, omit.
15584                                                           - If OpenCL omit.
15585                                                           - Ensures that
15586                                                             following
15587                                                             loads will not see
15588                                                             stale data.
15589
15590     atomicrmw    acq_rel      - workgroup    - generic  1. | ``s_wait_bvhcnt 0x0``
15591                                                            | ``s_wait_samplecnt 0x0``
15592                                                            | ``s_wait_storecnt 0x0``
15593                                                            | ``s_wait_loadcnt 0x0``
15594                                                            | ``s_wait_dscnt 0x0``
15595                                                            | **CU wavefront execution mode:**
15596                                                            | ``s_wait_dscnt 0x0``
15597
15598                                                           - If OpenCL, omit ``s_wait_loadcnt 0x0``.
15599                                                           - The waits can be
15600                                                             independently moved
15601                                                             according to the
15602                                                             following rules:
15603                                                           - ``s_wait_loadcnt 0x0``,
15604                                                             ``s_wait_samplecnt 0x0`` and
15605                                                             ``s_wait_bvhcnt 0x0``
15606                                                             must happen after
15607                                                             any preceding
15608                                                             global/generic load/load
15609                                                             atomic/
15610                                                             atomicrmw-with-return-value.
15611                                                           - ``s_wait_storecnt 0x0``
15612                                                             must happen after
15613                                                             any preceding
15614                                                             global/generic
15615                                                             store/store
15616                                                             atomic/
15617                                                             atomicrmw-no-return-value.
15618                                                           - ``s_wait_dscnt 0x0``
15619                                                             must happen after
15620                                                             any preceding
15621                                                             local/generic
15622                                                             load/store/load
15623                                                             atomic/store
15624                                                             atomic/atomicrmw.
15625                                                           - Must happen before
15626                                                             the following
15627                                                             atomicrmw.
15628                                                           - Ensures that all
15629                                                             memory operations
15630                                                             have
15631                                                             completed before
15632                                                             performing the
15633                                                             atomicrmw that is
15634                                                             being released.
15635
15636                                                         2. flat_atomic
15637
15638                                                           - Apply :ref:`amdgpu-amdhsa-memory-model-code-sequences-gfx12-scopes-table`.
15639                                                           - If atomic with return,
15640                                                             use ``th:TH_ATOMIC_RETURN``.
15641
15642                                                         3. | **Atomic without return:**
15643                                                            | ``s_wait_dscnt 0x0``
15644                                                            | ``s_wait_storecnt 0x0``
15645                                                            | **Atomic with return:**
15646                                                            | ``s_wait_loadcnt 0x0``
15647                                                            | ``s_wait_dscnt 0x0``
15648                                                            | **CU wavefront execution mode:**
15649                                                            | ``s_wait_dscnt 0x0``
15650
15651                                                           - If OpenCL, omit ``s_wait_dscnt 0x0``
15652                                                           - Must happen before
15653                                                             the following
15654                                                             ``global_inv``.
15655                                                           - Ensures any
15656                                                             following global
15657                                                             data read is no
15658                                                             older than the load
15659                                                             atomic value being
15660                                                             acquired.
15661
15662                                                         4. ``global_inv scope:SCOPE_SE``
15663
15664                                                           - If CU wavefront execution
15665                                                             mode, omit.
15666                                                           - Ensures that
15667                                                             following
15668                                                             loads will not see
15669                                                             stale data.
15670
15671     atomicrmw    acq_rel      - agent        - global   1. ``global_wb scope:SCOPE_SYS``
15672                               - system
15673                                                           - If agent scope, omit.
15674
15675                                                         2. | ``s_wait_bvhcnt 0x0``
15676                                                            | ``s_wait_samplecnt 0x0``
15677                                                            | ``s_wait_storecnt 0x0``
15678                                                            | ``s_wait_loadcnt 0x0``
15679                                                            | ``s_wait_dscnt 0x0``
15680
15681                                                           - If OpenCL, omit
15682                                                             ``s_wait_dscnt 0x0``
15683                                                           - The waits can be
15684                                                             independently moved
15685                                                             according to the
15686                                                             following rules:
15687                                                           - ``s_wait_loadcnt 0x0``,
15688                                                             ``s_wait_samplecnt 0x0`` and
15689                                                             ``s_wait_bvhcnt 0x0``
15690                                                             must happen after
15691                                                             any preceding
15692                                                             global/generic
15693                                                             load/load atomic/
15694                                                             atomicrmw-with-return-value.
15695                                                           - ``s_wait_storecnt 0x0``
15696                                                             must happen after
15697                                                             ``global_wb`` if present, or
15698                                                             any preceding
15699                                                             global/generic
15700                                                             store/store
15701                                                             atomic/
15702                                                             atomicrmw-no-return-value.
15703                                                           - ``s_wait_dscnt 0x0``
15704                                                             must happen after
15705                                                             any preceding
15706                                                             local/generic
15707                                                             load/store/load
15708                                                             atomic/store
15709                                                             atomic/atomicrmw.
15710                                                           - Must happen before
15711                                                             the following
15712                                                             atomicrmw.
15713                                                           - Ensures that all
15714                                                             memory operations
15715                                                             to global have
15716                                                             completed before
15717                                                             performing the
15718                                                             atomicrmw that is
15719                                                             being released.
15720
15721                                                         3. buffer/global_atomic
15722
15723                                                           - Apply :ref:`amdgpu-amdhsa-memory-model-code-sequences-gfx12-scopes-table`.
15724                                                           - If atomic with return, use
15725                                                             ``th:TH_ATOMIC_RETURN``.
15726
15727                                                         4. | **Atomic with return:**
15728                                                            | ``s_wait_loadcnt 0x0``
15729                                                            | **Atomic without return:**
15730                                                            | ``s_wait_storecnt 0x0``
15731
15732                                                           - Must happen before
15733                                                             following
15734                                                             ``global_inv``.
15735                                                           - Ensures the
15736                                                             atomicrmw has
15737                                                             completed before
15738                                                             invalidating the
15739                                                             caches.
15740
15741                                                         5. ``global_inv``
15742
15743                                                           - Apply :ref:`amdgpu-amdhsa-memory-model-code-sequences-gfx12-scopes-table`.
15744                                                           - Must happen before
15745                                                             any following
15746                                                             global/generic
15747                                                             load/load
15748                                                             atomic/atomicrmw.
15749                                                           - Ensures that
15750                                                             following loads
15751                                                             will not see stale
15752                                                             global data.
15753
15754     atomicrmw    acq_rel      - agent        - generic  1. ``global_wb scope:SCOPE_SYS``
15755                               - system
15756                                                           - If agent scope, omit.
15757
15758                                                         2. | ``s_wait_bvhcnt 0x0``
15759                                                            | ``s_wait_samplecnt 0x0``
15760                                                            | ``s_wait_storecnt 0x0``
15761                                                            | ``s_wait_loadcnt 0x0``
15762                                                            | ``s_wait_dscnt 0x0``
15763
15764                                                           - If OpenCL, omit
15765                                                             ``s_wait_dscnt 0x0``
15766                                                           - The waits can be
15767                                                             independently moved
15768                                                             according to the
15769                                                             following rules:
15770                                                           - ``s_wait_loadcnt 0x0``,
15771                                                             ``s_wait_samplecnt 0x0`` and
15772                                                             ``s_wait_bvhcnt 0x0``
15773                                                             must happen after
15774                                                             any preceding
15775                                                             global/generic
15776                                                             load/load atomic
15777                                                             atomicrmw-with-return-value.
15778                                                           - ``s_wait_storecnt 0x0``
15779                                                             must happen after
15780                                                             ``global_wb`` if present, or
15781                                                             any preceding
15782                                                             global/generic
15783                                                             store/store atomic/
15784                                                             atomicrmw-no-return-value.
15785                                                           - ``s_wait_dscnt 0x0``
15786                                                             must happen after
15787                                                             any preceding
15788                                                             local/generic
15789                                                             load/store/load
15790                                                             atomic/store
15791                                                             atomic/atomicrmw.
15792                                                           - Must happen before
15793                                                             the following
15794                                                             atomicrmw.
15795                                                           - Ensures that all
15796                                                             memory operations
15797                                                             have
15798                                                             completed before
15799                                                             performing the
15800                                                             atomicrmw that is
15801                                                             being released.
15802
15803                                                         3. flat_atomic
15804
15805                                                           - Apply :ref:`amdgpu-amdhsa-memory-model-code-sequences-gfx12-scopes-table`.
15806                                                           - If atomic with return, use
15807                                                             ``th:TH_ATOMIC_RETURN``.
15808
15809                                                         4. | **Atomic with return:**
15810                                                            | ``s_wait_loadcnt 0x0``
15811                                                            | ``s_wait_dscnt 0x0``
15812                                                            | **Atomic without return:**
15813                                                            | ``s_wait_storecnt 0x0``
15814                                                            | ``s_wait_dscnt 0x0``
15815
15816
15817                                                           - If OpenCL, omit
15818                                                             ``s_wait_dscnt 0x0``.
15819                                                           - Must happen before
15820                                                             following
15821                                                             ``global_inv``.
15822                                                           - Ensures the
15823                                                             atomicrmw has
15824                                                             completed before
15825                                                             invalidating the
15826                                                             caches.
15827
15828                                                         5. ``global_inv``
15829
15830                                                           - Apply :ref:`amdgpu-amdhsa-memory-model-code-sequences-gfx12-scopes-table`.
15831                                                           - Must happen before
15832                                                             any following
15833                                                             global/generic
15834                                                             load/load
15835                                                             atomic/atomicrmw.
15836                                                           - Ensures that
15837                                                             following loads
15838                                                             will not see stale
15839                                                             global data.
15840
15841     fence        acq_rel      - singlethread *none*     *none*
15842                               - wavefront
15843     fence        acq_rel      - workgroup    *none*     1. | ``s_wait_bvhcnt 0x0``
15844                                                            | ``s_wait_samplecnt 0x0``
15845                                                            | ``s_wait_storecnt 0x0``
15846                                                            | ``s_wait_loadcnt 0x0``
15847                                                            | ``s_wait_dscnt 0x0``
15848                                                            | **CU wavefront execution mode:**
15849                                                            | ``s_wait_dscnt 0x0``
15850
15851                                                           - If OpenCL and
15852                                                             address space is
15853                                                             not generic, omit
15854                                                             ``s_wait_dscnt 0x0``
15855                                                           - If OpenCL and
15856                                                             address space is
15857                                                             local, omit
15858                                                             all but ``s_wait_dscnt 0x0``.
15859                                                           - See :ref:`amdgpu-fence-as` for
15860                                                             more details on fencing specific
15861                                                             address spaces.
15862                                                           - The waits can be
15863                                                             independently moved
15864                                                             according to the
15865                                                             following rules:
15866                                                           - ``s_wait_loadcnt 0x0``,
15867                                                             ``s_wait_samplecnt 0x0`` and
15868                                                             ``s_wait_bvhcnt 0x0``
15869                                                             must happen after
15870                                                             any preceding
15871                                                             global/generic
15872                                                             load/load
15873                                                             atomic/
15874                                                             atomicrmw-with-return-value.
15875                                                           - ``s_wait_storecnt 0x0``
15876                                                             must happen after
15877                                                             any preceding
15878                                                             global/generic
15879                                                             store/store atomic/
15880                                                             atomicrmw-no-return-value.
15881                                                           - ``s_wait_dscnt 0x0``
15882                                                             must happen after
15883                                                             any preceding
15884                                                             local/generic
15885                                                             load/store/load
15886                                                             atomic/store atomic/
15887                                                             atomicrmw.
15888                                                           - Must happen before
15889                                                             any following
15890                                                             global/generic
15891                                                             load/load
15892                                                             atomic/store/store
15893                                                             atomic/atomicrmw.
15894                                                           - Ensures that all
15895                                                             memory operations
15896                                                             have
15897                                                             completed before
15898                                                             performing any
15899                                                             following global
15900                                                             memory operations.
15901                                                           - Ensures that the
15902                                                             preceding
15903                                                             local/generic load
15904                                                             atomic/atomicrmw
15905                                                             with an equal or
15906                                                             wider sync scope
15907                                                             and memory ordering
15908                                                             stronger than
15909                                                             unordered (this is
15910                                                             termed the
15911                                                             acquire-fence-paired-atomic)
15912                                                             has completed
15913                                                             before following
15914                                                             global memory
15915                                                             operations. This
15916                                                             satisfies the
15917                                                             requirements of
15918                                                             acquire.
15919                                                           - Ensures that all
15920                                                             previous memory
15921                                                             operations have
15922                                                             completed before a
15923                                                             following
15924                                                             local/generic store
15925                                                             atomic/atomicrmw
15926                                                             with an equal or
15927                                                             wider sync scope
15928                                                             and memory ordering
15929                                                             stronger than
15930                                                             unordered (this is
15931                                                             termed the
15932                                                             release-fence-paired-atomic).
15933                                                             This satisfies the
15934                                                             requirements of
15935                                                             release.
15936                                                           - Must happen before
15937                                                             the following
15938                                                             ``global_inv``.
15939                                                           - Ensures that the
15940                                                             acquire-fence-paired
15941                                                             atomic has completed
15942                                                             before invalidating
15943                                                             the
15944                                                             cache. Therefore
15945                                                             any following
15946                                                             locations read must
15947                                                             be no older than
15948                                                             the value read by
15949                                                             the
15950                                                             acquire-fence-paired-atomic.
15951
15952                                                         2. ``global_inv scope:SCOPE_SE``
15953
15954                                                           - If CU wavefront execution
15955                                                             mode, omit.
15956                                                           - Ensures that
15957                                                             following
15958                                                             loads will not see
15959                                                             stale data.
15960
15961     fence        acq_rel      - agent        *none*     1.  ``global_wb scope:SCOPE_SYS``
15962                               - system
15963                                                           - If agent scope, omit.
15964
15965                                                         2. | ``s_wait_bvhcnt 0x0``
15966                                                            | ``s_wait_samplecnt 0x0``
15967                                                            | ``s_wait_storecnt 0x0``
15968                                                            | ``s_wait_loadcnt 0x0``
15969                                                            | ``s_wait_dscnt 0x0``
15970
15971                                                           - If OpenCL and
15972                                                             address space is
15973                                                             not generic, omit
15974                                                             ``s_wait_dscnt 0x0``
15975                                                           - If OpenCL and
15976                                                             address space is
15977                                                             local, omit
15978                                                             all but ``s_wait_dscnt 0x0``.
15979                                                           - See :ref:`amdgpu-fence-as` for
15980                                                             more details on fencing specific
15981                                                             address spaces.
15982                                                           - The waits can be
15983                                                             independently moved
15984                                                             according to the
15985                                                             following rules:
15986                                                           - ``s_wait_loadcnt 0x0``,
15987                                                             ``s_wait_samplecnt 0x0`` and
15988                                                             ``s_wait_bvhcnt 0x0``
15989                                                             must happen after
15990                                                             any preceding
15991                                                             global/generic
15992                                                             load/load
15993                                                             atomic/
15994                                                             atomicrmw-with-return-value.
15995                                                           - ``s_wait_storecnt 0x0``
15996                                                             must happen after
15997                                                             ``global_wb`` if present, or
15998                                                             any preceding
15999                                                             global/generic
16000                                                             store/store atomic/
16001                                                             atomicrmw-no-return-value.
16002                                                           - ``s_wait_dscnt 0x0``
16003                                                             must happen after
16004                                                             any preceding
16005                                                             local/generic
16006                                                             load/store/load
16007                                                             atomic/store
16008                                                             atomic/atomicrmw.
16009                                                           - Must happen before
16010                                                             the following
16011                                                             ``global_inv``
16012                                                           - Ensures that the
16013                                                             preceding
16014                                                             global/local/generic
16015                                                             load
16016                                                             atomic/atomicrmw
16017                                                             with an equal or
16018                                                             wider sync scope
16019                                                             and memory ordering
16020                                                             stronger than
16021                                                             unordered (this is
16022                                                             termed the
16023                                                             acquire-fence-paired-atomic)
16024                                                             has completed
16025                                                             before invalidating
16026                                                             the caches. This
16027                                                             satisfies the
16028                                                             requirements of
16029                                                             acquire.
16030                                                           - Ensures that all
16031                                                             previous memory
16032                                                             operations have
16033                                                             completed before a
16034                                                             following
16035                                                             global/local/generic
16036                                                             store
16037                                                             atomic/atomicrmw
16038                                                             with an equal or
16039                                                             wider sync scope
16040                                                             and memory ordering
16041                                                             stronger than
16042                                                             unordered (this is
16043                                                             termed the
16044                                                             release-fence-paired-atomic).
16045                                                             This satisfies the
16046                                                             requirements of
16047                                                             release.
16048
16049                                                         3. ``global_inv scope:``
16050
16051                                                           - Apply :ref:`amdgpu-amdhsa-memory-model-code-sequences-gfx12-scopes-table`.
16052                                                           - Must happen before
16053                                                             any following
16054                                                             global/generic
16055                                                             load/load
16056                                                             atomic/store/store
16057                                                             atomic/atomicrmw.
16058                                                           - Ensures that
16059                                                             following loads
16060                                                             will not see stale
16061                                                             global data. This
16062                                                             satisfies the
16063                                                             requirements of
16064                                                             acquire.
16065
16066     **Sequential Consistent Atomic**
16067     ------------------------------------------------------------------------------------
16068     load atomic  seq_cst      - singlethread - global   *Same as corresponding
16069                               - wavefront    - local    load atomic acquire,
16070                                              - generic  except must generate
16071                                                         all instructions even
16072                                                         for OpenCL.*
16073     load atomic  seq_cst      - workgroup    - global   1. | ``s_wait_bvhcnt 0x0``
16074                                              - generic     | ``s_wait_samplecnt 0x0``
16075                                                            | ``s_wait_storecnt 0x0``
16076                                                            | ``s_wait_loadcnt 0x0``
16077                                                            | ``s_wait_dscnt 0x0``
16078                                                            | **CU wavefront execution mode:**
16079                                                            | ``s_wait_dscnt 0x0``
16080
16081                                                           - If OpenCL, omit
16082                                                             ``s_wait_dscnt 0x0``
16083                                                           - The waits can be
16084                                                             independently moved
16085                                                             according to the
16086                                                             following rules:
16087                                                           - ``s_wait_dscnt 0x0`` must
16088                                                             happen after
16089                                                             preceding
16090                                                             local/generic load
16091                                                             atomic/store
16092                                                             atomic/atomicrmw
16093                                                             with memory
16094                                                             ordering of seq_cst
16095                                                             and with equal or
16096                                                             wider sync scope.
16097                                                             (Note that seq_cst
16098                                                             fences have their
16099                                                             own ``s_wait_dscnt 0x0``
16100                                                             and so do not need to be
16101                                                             considered.)
16102                                                           - ``s_wait_loadcnt 0x0``\,
16103                                                             ``s_wait_samplecnt 0x0`` and
16104                                                             ``s_wait_bvhcnt 0x0``
16105                                                             must happen after
16106                                                             preceding
16107                                                             global/generic load
16108                                                             atomic/
16109                                                             atomicrmw-with-return-value
16110                                                             with memory
16111                                                             ordering of seq_cst
16112                                                             and with equal or
16113                                                             wider sync scope.
16114                                                             (Note that seq_cst
16115                                                             fences have their
16116                                                             own waits and so do
16117                                                             not need to be
16118                                                             considered.)
16119                                                           - ``s_wait_storecnt 0x0``
16120                                                             Must happen after
16121                                                             preceding
16122                                                             global/generic store
16123                                                             atomic/
16124                                                             atomicrmw-no-return-value
16125                                                             with memory
16126                                                             ordering of seq_cst
16127                                                             and with equal or
16128                                                             wider sync scope.
16129                                                             (Note that seq_cst
16130                                                             fences have their
16131                                                             own ``s_wait_storecnt 0x0``
16132                                                             and so do not need to be
16133                                                             considered.)
16134                                                           - Ensures any
16135                                                             preceding
16136                                                             sequential
16137                                                             consistent global/local
16138                                                             memory instructions
16139                                                             have completed
16140                                                             before executing
16141                                                             this sequentially
16142                                                             consistent
16143                                                             instruction. This
16144                                                             prevents reordering
16145                                                             a seq_cst store
16146                                                             followed by a
16147                                                             seq_cst load. (Note
16148                                                             that seq_cst is
16149                                                             stronger than
16150                                                             acquire/release as
16151                                                             the reordering of
16152                                                             load acquire
16153                                                             followed by a store
16154                                                             release is
16155                                                             prevented by the
16156                                                             ``s_wait``\s of
16157                                                             the release, but
16158                                                             there is nothing
16159                                                             preventing a store
16160                                                             release followed by
16161                                                             load acquire from
16162                                                             completing out of
16163                                                             order. The ``s_wait``\s
16164                                                             could be placed after
16165                                                             seq_store or before
16166                                                             the seq_load. We
16167                                                             choose the load to
16168                                                             make the ``s_wait``\s be
16169                                                             as late as possible
16170                                                             so that the store
16171                                                             may have already
16172                                                             completed.)
16173
16174                                                         2. *Following
16175                                                            instructions same as
16176                                                            corresponding load
16177                                                            atomic acquire,
16178                                                            except must generate
16179                                                            all instructions even
16180                                                            for OpenCL.*
16181     load atomic  seq_cst      - workgroup    - local    1. | ``s_wait_bvhcnt 0x0``
16182                                                            | ``s_wait_samplecnt 0x0``
16183                                                            | ``s_wait_storecnt 0x0``
16184                                                            | ``s_wait_loadcnt 0x0``
16185                                                            | ``s_wait_dscnt 0x0``
16186                                                            | **CU wavefront execution mode:**
16187                                                            | ``s_wait_dscnt 0x0``
16188
16189                                                           - If OpenCL, omit all.
16190                                                           - The waits can be
16191                                                             independently moved
16192                                                             according to the
16193                                                             following rules:
16194                                                           - ``s_wait_loadcnt 0x0``\,
16195                                                             ``s_wait_samplecnt 0x0`` and
16196                                                             ``s_wait_bvhcnt 0x0``
16197                                                             Must happen after
16198                                                             preceding
16199                                                             global/generic load
16200                                                             atomic/
16201                                                             atomicrmw-with-return-value
16202                                                             with memory
16203                                                             ordering of seq_cst
16204                                                             and with equal or
16205                                                             wider sync scope.
16206                                                             (Note that seq_cst
16207                                                             fences have their
16208                                                             own ``s_wait``\s and so do
16209                                                             not need to be
16210                                                             considered.)
16211                                                           - ``s_wait_storecnt 0x0``
16212                                                             Must happen after
16213                                                             preceding
16214                                                             global/generic store
16215                                                             atomic/
16216                                                             atomicrmw-no-return-value
16217                                                             with memory
16218                                                             ordering of seq_cst
16219                                                             and with equal or
16220                                                             wider sync scope.
16221                                                             (Note that seq_cst
16222                                                             fences have their
16223                                                             own ``s_wait_storecnt 0x0``
16224                                                             and so do
16225                                                             not need to be
16226                                                             considered.)
16227                                                           - Ensures any
16228                                                             preceding
16229                                                             sequential
16230                                                             consistent global
16231                                                             memory instructions
16232                                                             have completed
16233                                                             before executing
16234                                                             this sequentially
16235                                                             consistent
16236                                                             instruction. This
16237                                                             prevents reordering
16238                                                             a seq_cst store
16239                                                             followed by a
16240                                                             seq_cst load. (Note
16241                                                             that seq_cst is
16242                                                             stronger than
16243                                                             acquire/release as
16244                                                             the reordering of
16245                                                             load acquire
16246                                                             followed by a store
16247                                                             release is
16248                                                             prevented by the
16249                                                             ``s_wait``\s of
16250                                                             the release, but
16251                                                             there is nothing
16252                                                             preventing a store
16253                                                             release followed by
16254                                                             load acquire from
16255                                                             completing out of
16256                                                             order. The s_waitcnt
16257                                                             could be placed after
16258                                                             seq_store or before
16259                                                             the seq_load. We
16260                                                             choose the load to
16261                                                             make the ``s_wait``\s be
16262                                                             as late as possible
16263                                                             so that the store
16264                                                             may have already
16265                                                             completed.)
16266
16267                                                         2. *Following
16268                                                            instructions same as
16269                                                            corresponding load
16270                                                            atomic acquire,
16271                                                            except must generate
16272                                                            all instructions even
16273                                                            for OpenCL.*
16274
16275     load atomic  seq_cst      - agent        - global   1. | ``s_wait_bvhcnt 0x0``
16276                               - system       - generic     | ``s_wait_samplecnt 0x0``
16277                                                            | ``s_wait_storecnt 0x0``
16278                                                            | ``s_wait_loadcnt 0x0``
16279                                                            | ``s_wait_dscnt 0x0``
16280
16281                                                           - If OpenCL, omit
16282                                                             ``s_wait_dscnt 0x0``
16283                                                           - The waits can be
16284                                                             independently moved
16285                                                             according to the
16286                                                             following rules:
16287                                                           - ``s_wait_dscnt 0x0``
16288                                                             must happen after
16289                                                             preceding
16290                                                             local load
16291                                                             atomic/store
16292                                                             atomic/atomicrmw
16293                                                             with memory
16294                                                             ordering of seq_cst
16295                                                             and with equal or
16296                                                             wider sync scope.
16297                                                             (Note that seq_cst
16298                                                             fences have their
16299                                                             own ``s_wait_dscnt 0x0``
16300                                                             and so do
16301                                                             not need to be
16302                                                             considered.)
16303                                                           - ``s_wait_loadcnt 0x0``\,
16304                                                             ``s_wait_samplecnt 0x0`` and
16305                                                             ``s_wait_bvhcnt 0x0``
16306                                                             must happen after
16307                                                             preceding
16308                                                             global/generic load
16309                                                             atomic/
16310                                                             atomicrmw-with-return-value
16311                                                             with memory
16312                                                             ordering of seq_cst
16313                                                             and with equal or
16314                                                             wider sync scope.
16315                                                             (Note that seq_cst
16316                                                             fences have their
16317                                                             own ``s_wait``\s and so do
16318                                                             not need to be
16319                                                             considered.)
16320                                                           - ``s_wait_storecnt 0x0``
16321                                                             Must happen after
16322                                                             preceding
16323                                                             global/generic store
16324                                                             atomic/
16325                                                             atomicrmw-no-return-value
16326                                                             with memory
16327                                                             ordering of seq_cst
16328                                                             and with equal or
16329                                                             wider sync scope.
16330                                                             (Note that seq_cst
16331                                                             fences have their
16332                                                             own
16333                                                             ``s_wait_storecnt 0x0`` and so do
16334                                                             not need to be
16335                                                             considered.)
16336                                                           - Ensures any
16337                                                             preceding
16338                                                             sequential
16339                                                             consistent global
16340                                                             memory instructions
16341                                                             have completed
16342                                                             before executing
16343                                                             this sequentially
16344                                                             consistent
16345                                                             instruction. This
16346                                                             prevents reordering
16347                                                             a seq_cst store
16348                                                             followed by a
16349                                                             seq_cst load. (Note
16350                                                             that seq_cst is
16351                                                             stronger than
16352                                                             acquire/release as
16353                                                             the reordering of
16354                                                             load acquire
16355                                                             followed by a store
16356                                                             release is
16357                                                             prevented by the
16358                                                             ``s_wait``\s of
16359                                                             the release, but
16360                                                             there is nothing
16361                                                             preventing a store
16362                                                             release followed by
16363                                                             load acquire from
16364                                                             completing out of
16365                                                             order. The ``s_wait``\s
16366                                                             could be placed after
16367                                                             seq_store or before
16368                                                             the seq_load. We
16369                                                             choose the load to
16370                                                             make the ``s_wait``\s be
16371                                                             as late as possible
16372                                                             so that the store
16373                                                             may have already
16374                                                             completed.)
16375
16376                                                         2. *Following
16377                                                            instructions same as
16378                                                            corresponding load
16379                                                            atomic acquire,
16380                                                            except must generate
16381                                                            all instructions even
16382                                                            for OpenCL.*
16383     store atomic seq_cst      - singlethread - global   *Same as corresponding
16384                               - wavefront    - local    store atomic release,
16385                               - workgroup    - generic  except must generate
16386                               - agent                   all instructions even
16387                               - system                  for OpenCL.*
16388     atomicrmw    seq_cst      - singlethread - global   *Same as corresponding
16389                               - wavefront    - local    atomicrmw acq_rel,
16390                               - workgroup    - generic  except must generate
16391                               - agent                   all instructions even
16392                               - system                  for OpenCL.*
16393     fence        seq_cst      - singlethread *none*     *Same as corresponding
16394                               - wavefront               fence acq_rel,
16395                               - workgroup               except must generate
16396                               - agent                   all instructions even
16397                               - system                  for OpenCL.*
16398     ============ ============ ============== ========== ================================
16399
16400.. _amdgpu-amdhsa-trap-handler-abi:
16401
16402Trap Handler ABI
16403~~~~~~~~~~~~~~~~
16404
16405For code objects generated by the AMDGPU backend for HSA [HSA]_ compatible
16406runtimes (see :ref:`amdgpu-os`), the runtime installs a trap handler that
16407supports the ``s_trap`` instruction. For usage see:
16408
16409- :ref:`amdgpu-trap-handler-for-amdhsa-os-v2-table`
16410- :ref:`amdgpu-trap-handler-for-amdhsa-os-v3-table`
16411- :ref:`amdgpu-trap-handler-for-amdhsa-os-v4-onwards-table`
16412
16413  .. table:: AMDGPU Trap Handler for AMDHSA OS Code Object V2
16414     :name: amdgpu-trap-handler-for-amdhsa-os-v2-table
16415
16416     =================== =============== =============== =======================================
16417     Usage               Code Sequence   Trap Handler    Description
16418                                         Inputs
16419     =================== =============== =============== =======================================
16420     reserved            ``s_trap 0x00``                 Reserved by hardware.
16421     ``debugtrap(arg)``  ``s_trap 0x01`` ``SGPR0-1``:    Reserved for Finalizer HSA ``debugtrap``
16422                                           ``queue_ptr`` intrinsic (not implemented).
16423                                         ``VGPR0``:
16424                                           ``arg``
16425     ``llvm.trap``       ``s_trap 0x02`` ``SGPR0-1``:    Causes wave to be halted with the PC at
16426                                           ``queue_ptr`` the trap instruction. The associated
16427                                                         queue is signalled to put it into the
16428                                                         error state.  When the queue is put in
16429                                                         the error state, the waves executing
16430                                                         dispatches on the queue will be
16431                                                         terminated.
16432     ``llvm.debugtrap``  ``s_trap 0x03`` *none*          - If debugger not enabled then behaves
16433                                                           as a no-operation. The trap handler
16434                                                           is entered and immediately returns to
16435                                                           continue execution of the wavefront.
16436                                                         - If the debugger is enabled, causes
16437                                                           the debug trap to be reported by the
16438                                                           debugger and the wavefront is put in
16439                                                           the halt state with the PC at the
16440                                                           instruction.  The debugger must
16441                                                           increment the PC and resume the wave.
16442     reserved            ``s_trap 0x04``                 Reserved.
16443     reserved            ``s_trap 0x05``                 Reserved.
16444     reserved            ``s_trap 0x06``                 Reserved.
16445     reserved            ``s_trap 0x07``                 Reserved.
16446     reserved            ``s_trap 0x08``                 Reserved.
16447     reserved            ``s_trap 0xfe``                 Reserved.
16448     reserved            ``s_trap 0xff``                 Reserved.
16449     =================== =============== =============== =======================================
16450
16451..
16452
16453  .. table:: AMDGPU Trap Handler for AMDHSA OS Code Object V3
16454     :name: amdgpu-trap-handler-for-amdhsa-os-v3-table
16455
16456     =================== =============== =============== =======================================
16457     Usage               Code Sequence   Trap Handler    Description
16458                                         Inputs
16459     =================== =============== =============== =======================================
16460     reserved            ``s_trap 0x00``                 Reserved by hardware.
16461     debugger breakpoint ``s_trap 0x01`` *none*          Reserved for debugger to use for
16462                                                         breakpoints. Causes wave to be halted
16463                                                         with the PC at the trap instruction.
16464                                                         The debugger is responsible to resume
16465                                                         the wave, including the instruction
16466                                                         that the breakpoint overwrote.
16467     ``llvm.trap``       ``s_trap 0x02`` ``SGPR0-1``:    Causes wave to be halted with the PC at
16468                                           ``queue_ptr`` the trap instruction. The associated
16469                                                         queue is signalled to put it into the
16470                                                         error state.  When the queue is put in
16471                                                         the error state, the waves executing
16472                                                         dispatches on the queue will be
16473                                                         terminated.
16474     ``llvm.debugtrap``  ``s_trap 0x03`` *none*          - If debugger not enabled then behaves
16475                                                           as a no-operation. The trap handler
16476                                                           is entered and immediately returns to
16477                                                           continue execution of the wavefront.
16478                                                         - If the debugger is enabled, causes
16479                                                           the debug trap to be reported by the
16480                                                           debugger and the wavefront is put in
16481                                                           the halt state with the PC at the
16482                                                           instruction.  The debugger must
16483                                                           increment the PC and resume the wave.
16484     reserved            ``s_trap 0x04``                 Reserved.
16485     reserved            ``s_trap 0x05``                 Reserved.
16486     reserved            ``s_trap 0x06``                 Reserved.
16487     reserved            ``s_trap 0x07``                 Reserved.
16488     reserved            ``s_trap 0x08``                 Reserved.
16489     reserved            ``s_trap 0xfe``                 Reserved.
16490     reserved            ``s_trap 0xff``                 Reserved.
16491     =================== =============== =============== =======================================
16492
16493..
16494
16495  .. table:: AMDGPU Trap Handler for AMDHSA OS Code Object V4 and Above
16496     :name: amdgpu-trap-handler-for-amdhsa-os-v4-onwards-table
16497
16498     =================== =============== ================ ================= =======================================
16499     Usage               Code Sequence   GFX6-GFX8 Inputs GFX9-GFX11 Inputs Description
16500     =================== =============== ================ ================= =======================================
16501     reserved            ``s_trap 0x00``                                    Reserved by hardware.
16502     debugger breakpoint ``s_trap 0x01`` *none*           *none*            Reserved for debugger to use for
16503                                                                            breakpoints. Causes wave to be halted
16504                                                                            with the PC at the trap instruction.
16505                                                                            The debugger is responsible to resume
16506                                                                            the wave, including the instruction
16507                                                                            that the breakpoint overwrote.
16508     ``llvm.trap``       ``s_trap 0x02`` ``SGPR0-1``:     *none*            Causes wave to be halted with the PC at
16509                                           ``queue_ptr``                    the trap instruction. The associated
16510                                                                            queue is signalled to put it into the
16511                                                                            error state.  When the queue is put in
16512                                                                            the error state, the waves executing
16513                                                                            dispatches on the queue will be
16514                                                                            terminated.
16515     ``llvm.debugtrap``  ``s_trap 0x03`` *none*           *none*            - If debugger not enabled then behaves
16516                                                                              as a no-operation. The trap handler
16517                                                                              is entered and immediately returns to
16518                                                                              continue execution of the wavefront.
16519                                                                            - If the debugger is enabled, causes
16520                                                                              the debug trap to be reported by the
16521                                                                              debugger and the wavefront is put in
16522                                                                              the halt state with the PC at the
16523                                                                              instruction.  The debugger must
16524                                                                              increment the PC and resume the wave.
16525     reserved            ``s_trap 0x04``                                    Reserved.
16526     reserved            ``s_trap 0x05``                                    Reserved.
16527     reserved            ``s_trap 0x06``                                    Reserved.
16528     reserved            ``s_trap 0x07``                                    Reserved.
16529     reserved            ``s_trap 0x08``                                    Reserved.
16530     reserved            ``s_trap 0xfe``                                    Reserved.
16531     reserved            ``s_trap 0xff``                                    Reserved.
16532     =================== =============== ================ ================= =======================================
16533
16534.. _amdgpu-amdhsa-function-call-convention:
16535
16536Call Convention
16537~~~~~~~~~~~~~~~
16538
16539.. note::
16540
16541  This section is currently incomplete and has inaccuracies. It is WIP that will
16542  be updated as information is determined.
16543
16544See :ref:`amdgpu-dwarf-address-space-identifier` for information on swizzled
16545addresses. Unswizzled addresses are normal linear addresses.
16546
16547.. _amdgpu-amdhsa-function-call-convention-kernel-functions:
16548
16549Kernel Functions
16550++++++++++++++++
16551
16552This section describes the call convention ABI for the outer kernel function.
16553
16554See :ref:`amdgpu-amdhsa-initial-kernel-execution-state` for the kernel call
16555convention.
16556
16557The following is not part of the AMDGPU kernel calling convention but describes
16558how the AMDGPU implements function calls:
16559
165601.  Clang decides the kernarg layout to match the *HSA Programmer's Language
16561    Reference* [HSA]_.
16562
16563    - All structs are passed directly.
16564    - Lambda values are passed *TBA*.
16565
16566    .. TODO::
16567
16568      - Does this really follow HSA rules? Or are structs >16 bytes passed
16569        by-value struct?
16570      - What is ABI for lambda values?
16571
165724.  The kernel performs certain setup in its prolog, as described in
16573    :ref:`amdgpu-amdhsa-kernel-prolog`.
16574
16575.. _amdgpu-amdhsa-function-call-convention-non-kernel-functions:
16576
16577Non-Kernel Functions
16578++++++++++++++++++++
16579
16580This section describes the call convention ABI for functions other than the
16581outer kernel function.
16582
16583If a kernel has function calls then scratch is always allocated and used for
16584the call stack which grows from low address to high address using the swizzled
16585scratch address space.
16586
16587On entry to a function:
16588
165891.  SGPR0-3 contain a V# with the following properties (see
16590    :ref:`amdgpu-amdhsa-kernel-prolog-private-segment-buffer`):
16591
16592    * Base address pointing to the beginning of the wavefront scratch backing
16593      memory.
16594    * Swizzled with dword element size and stride of wavefront size elements.
16595
165962.  The FLAT_SCRATCH register pair is setup. See
16597    :ref:`amdgpu-amdhsa-kernel-prolog-flat-scratch`.
165983.  GFX6-GFX8: M0 register set to the size of LDS in bytes. See
16599    :ref:`amdgpu-amdhsa-kernel-prolog-m0`.
166004.  The EXEC register is set to the lanes active on entry to the function.
166015.  MODE register: *TBD*
166026.  VGPR0-31 and SGPR4-29 are used to pass function input arguments as described
16603    below.
166047.  SGPR30-31 return address (RA). The code address that the function must
16605    return to when it completes. The value is undefined if the function is *no
16606    return*.
166078.  SGPR32 is used for the stack pointer (SP). It is an unswizzled scratch
16608    offset relative to the beginning of the wavefront scratch backing memory.
16609
16610    The unswizzled SP can be used with buffer instructions as an unswizzled SGPR
16611    offset with the scratch V# in SGPR0-3 to access the stack in a swizzled
16612    manner.
16613
16614    The unswizzled SP value can be converted into the swizzled SP value by:
16615
16616      | swizzled SP = unswizzled SP / wavefront size
16617
16618    This may be used to obtain the private address space address of stack
16619    objects and to convert this address to a flat address by adding the flat
16620    scratch aperture base address.
16621
16622    The swizzled SP value is always 4 bytes aligned for the ``r600``
16623    architecture and 16 byte aligned for the ``amdgcn`` architecture.
16624
16625    .. note::
16626
16627      The ``amdgcn`` value is selected to avoid dynamic stack alignment for the
16628      OpenCL language which has the largest base type defined as 16 bytes.
16629
16630    On entry, the swizzled SP value is the address of the first function
16631    argument passed on the stack. Other stack passed arguments are positive
16632    offsets from the entry swizzled SP value.
16633
16634    The function may use positive offsets beyond the last stack passed argument
16635    for stack allocated local variables and register spill slots. If necessary,
16636    the function may align these to greater alignment than 16 bytes. After these
16637    the function may dynamically allocate space for such things as runtime sized
16638    ``alloca`` local allocations.
16639
16640    If the function calls another function, it will place any stack allocated
16641    arguments after the last local allocation and adjust SGPR32 to the address
16642    after the last local allocation.
16643
166449.  All other registers are unspecified.
1664510. Any necessary ``s_waitcnt`` has been performed to ensure memory is available
16646    to the function.
1664711. Use pass-by-reference (byref) in stead of pass-by-value (byval) for struct
16648    arguments in C ABI. Callee is responsible for allocating stack memory and
16649    copying the value of the struct if modified. Note that the backend still
16650    supports byval for struct arguments.
16651
16652On exit from a function:
16653
166541.  VGPR0-31 and SGPR4-29 are used to pass function result arguments as
16655    described below. Any registers used are considered clobbered registers.
166562.  The following registers are preserved and have the same value as on entry:
16657
16658    * FLAT_SCRATCH
16659    * EXEC
16660    * GFX6-GFX8: M0
16661    * All SGPR registers except the clobbered registers of SGPR4-31.
16662    * VGPR40-47
16663    * VGPR56-63
16664    * VGPR72-79
16665    * VGPR88-95
16666    * VGPR104-111
16667    * VGPR120-127
16668    * VGPR136-143
16669    * VGPR152-159
16670    * VGPR168-175
16671    * VGPR184-191
16672    * VGPR200-207
16673    * VGPR216-223
16674    * VGPR232-239
16675    * VGPR248-255
16676
16677        .. note::
16678
16679          Except the argument registers, the VGPRs clobbered and the preserved
16680          registers are intermixed at regular intervals in order to keep a
16681          similar ratio independent of the number of allocated VGPRs.
16682
16683    * GFX90A: All AGPR registers except the clobbered registers AGPR0-31.
16684    * Lanes of all VGPRs that are inactive at the call site.
16685
16686      For the AMDGPU backend, an inter-procedural register allocation (IPRA)
16687      optimization may mark some of clobbered SGPR and VGPR registers as
16688      preserved if it can be determined that the called function does not change
16689      their value.
16690
166912.  The PC is set to the RA provided on entry.
166923.  MODE register: *TBD*.
166934.  All other registers are clobbered.
166945.  Any necessary ``s_waitcnt`` has been performed to ensure memory accessed by
16695    function is available to the caller.
16696
16697.. TODO::
16698
16699  - How are function results returned? The address of structured types is passed
16700    by reference, but what about other types?
16701
16702The function input arguments are made up of the formal arguments explicitly
16703declared by the source language function plus the implicit input arguments used
16704by the implementation.
16705
16706The source language input arguments are:
16707
167081. Any source language implicit ``this`` or ``self`` argument comes first as a
16709   pointer type.
167102. Followed by the function formal arguments in left to right source order.
16711
16712The source language result arguments are:
16713
167141. The function result argument.
16715
16716The source language input or result struct type arguments that are less than or
16717equal to 16 bytes, are decomposed recursively into their base type fields, and
16718each field is passed as if a separate argument. For input arguments, if the
16719called function requires the struct to be in memory, for example because its
16720address is taken, then the function body is responsible for allocating a stack
16721location and copying the field arguments into it. Clang terms this *direct
16722struct*.
16723
16724The source language input struct type arguments that are greater than 16 bytes,
16725are passed by reference. The caller is responsible for allocating a stack
16726location to make a copy of the struct value and pass the address as the input
16727argument. The called function is responsible to perform the dereference when
16728accessing the input argument. Clang terms this *by-value struct*.
16729
16730A source language result struct type argument that is greater than 16 bytes, is
16731returned by reference. The caller is responsible for allocating a stack location
16732to hold the result value and passes the address as the last input argument
16733(before the implicit input arguments). In this case there are no result
16734arguments. The called function is responsible to perform the dereference when
16735storing the result value. Clang terms this *structured return (sret)*.
16736
16737*TODO: correct the ``sret`` definition.*
16738
16739.. TODO::
16740
16741  Is this definition correct? Or is ``sret`` only used if passing in registers, and
16742  pass as non-decomposed struct as stack argument? Or something else? Is the
16743  memory location in the caller stack frame, or a stack memory argument and so
16744  no address is passed as the caller can directly write to the argument stack
16745  location? But then the stack location is still live after return. If an
16746  argument stack location is it the first stack argument or the last one?
16747
16748Lambda argument types are treated as struct types with an implementation defined
16749set of fields.
16750
16751.. TODO::
16752
16753  Need to specify the ABI for lambda types for AMDGPU.
16754
16755For AMDGPU backend all source language arguments (including the decomposed
16756struct type arguments) are passed in VGPRs unless marked ``inreg`` in which case
16757they are passed in SGPRs.
16758
16759The AMDGPU backend walks the function call graph from the leaves to determine
16760which implicit input arguments are used, propagating to each caller of the
16761function. The used implicit arguments are appended to the function arguments
16762after the source language arguments in the following order:
16763
16764.. TODO::
16765
16766  Is recursion or external functions supported?
16767
167681.  Work-Item ID (1 VGPR)
16769
16770    The X, Y and Z work-item ID are packed into a single VGRP with the following
16771    layout. Only fields actually used by the function are set. The other bits
16772    are undefined.
16773
16774    The values come from the initial kernel execution state. See
16775    :ref:`amdgpu-amdhsa-initial-kernel-execution-state`.
16776
16777    .. table:: Work-item implicit argument layout
16778      :name: amdgpu-amdhsa-workitem-implicit-argument-layout-table
16779
16780      ======= ======= ==============
16781      Bits    Size    Field Name
16782      ======= ======= ==============
16783      9:0     10 bits X Work-Item ID
16784      19:10   10 bits Y Work-Item ID
16785      29:20   10 bits Z Work-Item ID
16786      31:30   2 bits  Unused
16787      ======= ======= ==============
16788
167892.  Dispatch Ptr (2 SGPRs)
16790
16791    The value comes from the initial kernel execution state. See
16792    :ref:`amdgpu-amdhsa-sgpr-register-set-up-order-table`.
16793
167943.  Queue Ptr (2 SGPRs)
16795
16796    The value comes from the initial kernel execution state. See
16797    :ref:`amdgpu-amdhsa-sgpr-register-set-up-order-table`.
16798
167994.  Kernarg Segment Ptr (2 SGPRs)
16800
16801    The value comes from the initial kernel execution state. See
16802    :ref:`amdgpu-amdhsa-sgpr-register-set-up-order-table`.
16803
168045.  Dispatch id (2 SGPRs)
16805
16806    The value comes from the initial kernel execution state. See
16807    :ref:`amdgpu-amdhsa-sgpr-register-set-up-order-table`.
16808
168096.  Work-Group ID X (1 SGPR)
16810
16811    The value comes from the initial kernel execution state. See
16812    :ref:`amdgpu-amdhsa-sgpr-register-set-up-order-table`.
16813
168147.  Work-Group ID Y (1 SGPR)
16815
16816    The value comes from the initial kernel execution state. See
16817    :ref:`amdgpu-amdhsa-sgpr-register-set-up-order-table`.
16818
168198.  Work-Group ID Z (1 SGPR)
16820
16821    The value comes from the initial kernel execution state. See
16822    :ref:`amdgpu-amdhsa-sgpr-register-set-up-order-table`.
16823
168249.  Implicit Argument Ptr (2 SGPRs)
16825
16826    The value is computed by adding an offset to Kernarg Segment Ptr to get the
16827    global address space pointer to the first kernarg implicit argument.
16828
16829The input and result arguments are assigned in order in the following manner:
16830
16831.. note::
16832
16833  There are likely some errors and omissions in the following description that
16834  need correction.
16835
16836  .. TODO::
16837
16838    Check the Clang source code to decipher how function arguments and return
16839    results are handled. Also see the AMDGPU specific values used.
16840
16841* VGPR arguments are assigned to consecutive VGPRs starting at VGPR0 up to
16842  VGPR31.
16843
16844  If there are more arguments than will fit in these registers, the remaining
16845  arguments are allocated on the stack in order on naturally aligned
16846  addresses.
16847
16848  .. TODO::
16849
16850    How are overly aligned structures allocated on the stack?
16851
16852* SGPR arguments are assigned to consecutive SGPRs starting at SGPR0 up to
16853  SGPR29.
16854
16855  If there are more arguments than will fit in these registers, the remaining
16856  arguments are allocated on the stack in order on naturally aligned
16857  addresses.
16858
16859Note that decomposed struct type arguments may have some fields passed in
16860registers and some in memory.
16861
16862.. TODO::
16863
16864  So, a struct which can pass some fields as decomposed register arguments, will
16865  pass the rest as decomposed stack elements? But an argument that will not start
16866  in registers will not be decomposed and will be passed as a non-decomposed
16867  stack value?
16868
16869The following is not part of the AMDGPU function calling convention but
16870describes how the AMDGPU implements function calls:
16871
168721.  SGPR33 is used as a frame pointer (FP) if necessary. Like the SP it is an
16873    unswizzled scratch address. It is only needed if runtime sized ``alloca``
16874    are used, or for the reasons defined in ``SIFrameLowering``.
168752.  Runtime stack alignment is supported. SGPR34 is used as a base pointer (BP)
16876    to access the incoming stack arguments in the function. The BP is needed
16877    only when the function requires the runtime stack alignment.
16878
168793.  Allocating SGPR arguments on the stack are not supported.
16880
168814.  No CFI is currently generated. See
16882    :ref:`amdgpu-dwarf-call-frame-information`.
16883
16884    .. note::
16885
16886      CFI will be generated that defines the CFA as the unswizzled address
16887      relative to the wave scratch base in the unswizzled private address space
16888      of the lowest address stack allocated local variable.
16889
16890      ``DW_AT_frame_base`` will be defined as the swizzled address in the
16891      swizzled private address space by dividing the CFA by the wavefront size
16892      (since CFA is always at least dword aligned which matches the scratch
16893      swizzle element size).
16894
16895      If no dynamic stack alignment was performed, the stack allocated arguments
16896      are accessed as negative offsets relative to ``DW_AT_frame_base``, and the
16897      local variables and register spill slots are accessed as positive offsets
16898      relative to ``DW_AT_frame_base``.
16899
169005.  Function argument passing is implemented by copying the input physical
16901    registers to virtual registers on entry. The register allocator can spill if
16902    necessary. These are copied back to physical registers at call sites. The
16903    net effect is that each function call can have these values in entirely
16904    distinct locations. The IPRA can help avoid shuffling argument registers.
169056.  Call sites are implemented by setting up the arguments at positive offsets
16906    from SP. Then SP is incremented to account for the known frame size before
16907    the call and decremented after the call.
16908
16909    .. note::
16910
16911      The CFI will reflect the changed calculation needed to compute the CFA
16912      from SP.
16913
169147.  4 byte spill slots are used in the stack frame. One slot is allocated for an
16915    emergency spill slot. Buffer instructions are used for stack accesses and
16916    not the ``flat_scratch`` instruction.
16917
16918    .. TODO::
16919
16920      Explain when the emergency spill slot is used.
16921
16922.. TODO::
16923
16924  Possible broken issues:
16925
16926  - Stack arguments must be aligned to required alignment.
16927  - Stack is aligned to max(16, max formal argument alignment)
16928  - Direct argument < 64 bits should check register budget.
16929  - Register budget calculation should respect ``inreg`` for SGPR.
16930  - SGPR overflow is not handled.
16931  - struct with 1 member unpeeling is not checking size of member.
16932  - ``sret`` is after ``this`` pointer.
16933  - Caller is not implementing stack realignment: need an extra pointer.
16934  - Should say AMDGPU passes FP rather than SP.
16935  - Should CFI define CFA as address of locals or arguments. Difference is
16936    apparent when have implemented dynamic alignment.
16937  - If ``SCRATCH`` instruction could allow negative offsets, then can make FP be
16938    highest address of stack frame and use negative offset for locals. Would
16939    allow SP to be the same as FP and could support signal-handler-like as now
16940    have a real SP for the top of the stack.
16941  - How is ``sret`` passed on the stack? In argument stack area? Can it overlay
16942    arguments?
16943
16944AMDPAL
16945------
16946
16947This section provides code conventions used when the target triple OS is
16948``amdpal`` (see :ref:`amdgpu-target-triples`).
16949
16950.. _amdgpu-amdpal-code-object-metadata-section:
16951
16952Code Object Metadata
16953~~~~~~~~~~~~~~~~~~~~
16954
16955.. note::
16956
16957  The metadata is currently in development and is subject to major
16958  changes. Only the current version is supported. *When this document
16959  was generated the version was 2.6.*
16960
16961Code object metadata is specified by the ``NT_AMDGPU_METADATA`` note
16962record (see :ref:`amdgpu-note-records-v3-onwards`).
16963
16964The metadata is represented as Message Pack formatted binary data (see
16965[MsgPack]_). The top level is a Message Pack map that includes the keys
16966defined in table :ref:`amdgpu-amdpal-code-object-metadata-map-table`
16967and referenced tables.
16968
16969Additional information can be added to the maps. To avoid conflicts, any
16970key names should be prefixed by "*vendor-name*." where ``vendor-name``
16971can be the name of the vendor and specific vendor tool that generates the
16972information. The prefix is abbreviated to simply "." when it appears
16973within a map that has been added by the same *vendor-name*.
16974
16975  .. table:: AMDPAL Code Object Metadata Map
16976     :name: amdgpu-amdpal-code-object-metadata-map-table
16977
16978     =================== ============== ========= ======================================================================
16979     String Key          Value Type     Required? Description
16980     =================== ============== ========= ======================================================================
16981     "amdpal.version"    sequence of    Required  PAL code object metadata (major, minor) version. The current values
16982                         2 integers               are defined by *Util::Abi::PipelineMetadata(Major|Minor)Version*.
16983     "amdpal.pipelines"  sequence of    Required  Per-pipeline metadata. See
16984                         map                      :ref:`amdgpu-amdpal-code-object-pipeline-metadata-map-table` for the
16985                                                  definition of the keys included in that map.
16986     =================== ============== ========= ======================================================================
16987
16988..
16989
16990  .. table:: AMDPAL Code Object Pipeline Metadata Map
16991     :name: amdgpu-amdpal-code-object-pipeline-metadata-map-table
16992
16993     ====================================== ============== ========= ===================================================
16994     String Key                             Value Type     Required? Description
16995     ====================================== ============== ========= ===================================================
16996     ".name"                                string                   Source name of the pipeline.
16997     ".type"                                string                   Pipeline type, e.g. VsPs. Values include:
16998
16999                                                                       - "VsPs"
17000                                                                       - "Gs"
17001                                                                       - "Cs"
17002                                                                       - "Ngg"
17003                                                                       - "Tess"
17004                                                                       - "GsTess"
17005                                                                       - "NggTess"
17006
17007     ".internal_pipeline_hash"              sequence of    Required  Internal compiler hash for this pipeline. Lower
17008                                            2 integers               64 bits is the "stable" portion of the hash, used
17009                                                                     for e.g. shader replacement lookup. Upper 64 bits
17010                                                                     is the "unique" portion of the hash, used for
17011                                                                     e.g. pipeline cache lookup. The value is
17012                                                                     implementation defined, and can not be relied on
17013                                                                     between different builds of the compiler.
17014     ".shaders"                             map                      Per-API shader metadata. See
17015                                                                     :ref:`amdgpu-amdpal-code-object-shader-map-table`
17016                                                                     for the definition of the keys included in that
17017                                                                     map.
17018     ".hardware_stages"                     map                      Per-hardware stage metadata. See
17019                                                                     :ref:`amdgpu-amdpal-code-object-hardware-stage-map-table`
17020                                                                     for the definition of the keys included in that
17021                                                                     map.
17022     ".shader_functions"                    map                      Per-shader function metadata. See
17023                                                                     :ref:`amdgpu-amdpal-code-object-shader-function-map-table`
17024                                                                     for the definition of the keys included in that
17025                                                                     map.
17026     ".registers"                           map            Required  Hardware register configuration. See
17027                                                                     :ref:`amdgpu-amdpal-code-object-register-map-table`
17028                                                                     for the definition of the keys included in that
17029                                                                     map.
17030     ".user_data_limit"                     integer                  Number of user data entries accessed by this
17031                                                                     pipeline.
17032     ".spill_threshold"                     integer                  The user data spill threshold.  0xFFFF for
17033                                                                     NoUserDataSpilling.
17034     ".uses_viewport_array_index"           boolean                  Indicates whether or not the pipeline uses the
17035                                                                     viewport array index feature. Pipelines which use
17036                                                                     this feature can render into all 16 viewports,
17037                                                                     whereas pipelines which do not use it are
17038                                                                     restricted to viewport #0.
17039     ".es_gs_lds_size"                      integer                  Size in bytes of LDS space used internally for
17040                                                                     handling data-passing between the ES and GS
17041                                                                     shader stages. This can be zero if the data is
17042                                                                     passed using off-chip buffers. This value should
17043                                                                     be used to program all user-SGPRs which have been
17044                                                                     marked with "UserDataMapping::EsGsLdsSize"
17045                                                                     (typically only the GS and VS HW stages will ever
17046                                                                     have a user-SGPR so marked).
17047     ".nggSubgroupSize"                     integer                  Explicit maximum subgroup size for NGG shaders
17048                                                                     (maximum number of threads in a subgroup).
17049     ".num_interpolants"                    integer                  Graphics only. Number of PS interpolants.
17050     ".mesh_scratch_memory_size"            integer                  Max mesh shader scratch memory used.
17051     ".api"                                 string                   Name of the client graphics API.
17052     ".api_create_info"                     binary                   Graphics API shader create info binary blob. Can
17053                                                                     be defined by the driver using the compiler if
17054                                                                     they want to be able to correlate API-specific
17055                                                                     information used during creation at a later time.
17056     ====================================== ============== ========= ===================================================
17057
17058..
17059
17060  .. table:: AMDPAL Code Object Shader Map
17061     :name: amdgpu-amdpal-code-object-shader-map-table
17062
17063
17064     +-------------+--------------+-------------------------------------------------------------------+
17065     |String Key   |Value Type    |Description                                                        |
17066     +=============+==============+===================================================================+
17067     |- ".compute" |map           |See :ref:`amdgpu-amdpal-code-object-api-shader-metadata-map-table` |
17068     |- ".vertex"  |              |for the definition of the keys included in that map.               |
17069     |- ".hull"    |              |                                                                   |
17070     |- ".domain"  |              |                                                                   |
17071     |- ".geometry"|              |                                                                   |
17072     |- ".pixel"   |              |                                                                   |
17073     +-------------+--------------+-------------------------------------------------------------------+
17074
17075..
17076
17077  .. table:: AMDPAL Code Object API Shader Metadata Map
17078     :name: amdgpu-amdpal-code-object-api-shader-metadata-map-table
17079
17080     ==================== ============== ========= =====================================================================
17081     String Key           Value Type     Required? Description
17082     ==================== ============== ========= =====================================================================
17083     ".api_shader_hash"   sequence of    Required  Input shader hash, typically passed in from the client. The value
17084                          2 integers               is implementation defined, and can not be relied on between
17085                                                   different builds of the compiler.
17086     ".hardware_mapping"  sequence of    Required  Flags indicating the HW stages this API shader maps to. Values
17087                          string                   include:
17088
17089                                                     - ".ls"
17090                                                     - ".hs"
17091                                                     - ".es"
17092                                                     - ".gs"
17093                                                     - ".vs"
17094                                                     - ".ps"
17095                                                     - ".cs"
17096
17097     ==================== ============== ========= =====================================================================
17098
17099..
17100
17101  .. table:: AMDPAL Code Object Hardware Stage Map
17102     :name: amdgpu-amdpal-code-object-hardware-stage-map-table
17103
17104     +-------------+--------------+-----------------------------------------------------------------------+
17105     |String Key   |Value Type    |Description                                                            |
17106     +=============+==============+=======================================================================+
17107     |- ".ls"      |map           |See :ref:`amdgpu-amdpal-code-object-hardware-stage-metadata-map-table` |
17108     |- ".hs"      |              |for the definition of the keys included in that map.                   |
17109     |- ".es"      |              |                                                                       |
17110     |- ".gs"      |              |                                                                       |
17111     |- ".vs"      |              |                                                                       |
17112     |- ".ps"      |              |                                                                       |
17113     |- ".cs"      |              |                                                                       |
17114     +-------------+--------------+-----------------------------------------------------------------------+
17115
17116..
17117
17118  .. table:: AMDPAL Code Object Hardware Stage Metadata Map
17119     :name: amdgpu-amdpal-code-object-hardware-stage-metadata-map-table
17120
17121     ========================== ============== ========= ===============================================================
17122     String Key                 Value Type     Required? Description
17123     ========================== ============== ========= ===============================================================
17124     ".entry_point"             string                   The ELF symbol pointing to this pipeline's stage entry point.
17125     ".scratch_memory_size"     integer                  Scratch memory size in bytes.
17126     ".lds_size"                integer                  Local Data Share size in bytes.
17127     ".perf_data_buffer_size"   integer                  Performance data buffer size in bytes.
17128     ".vgpr_count"              integer                  Number of VGPRs used.
17129     ".agpr_count"              integer                  Number of AGPRs used.
17130     ".sgpr_count"              integer                  Number of SGPRs used.
17131     ".vgpr_limit"              integer                  If non-zero, indicates the shader was compiled with a
17132                                                         directive to instruct the compiler to limit the VGPR usage to
17133                                                         be less than or equal to the specified value (only set if
17134                                                         different from HW default).
17135     ".sgpr_limit"              integer                  SGPR count upper limit (only set if different from HW
17136                                                         default).
17137     ".threadgroup_dimensions"  sequence of              Thread-group X/Y/Z dimensions (Compute only).
17138                                3 integers
17139     ".wavefront_size"          integer                  Wavefront size (only set if different from HW default).
17140     ".uses_uavs"               boolean                  The shader reads or writes UAVs.
17141     ".uses_rovs"               boolean                  The shader reads or writes ROVs.
17142     ".writes_uavs"             boolean                  The shader writes to one or more UAVs.
17143     ".writes_depth"            boolean                  The shader writes out a depth value.
17144     ".uses_append_consume"     boolean                  The shader uses append and/or consume operations, either
17145                                                         memory or GDS.
17146     ".uses_prim_id"            boolean                  The shader uses PrimID.
17147     ========================== ============== ========= ===============================================================
17148
17149..
17150
17151  .. table:: AMDPAL Code Object Shader Function Map
17152     :name: amdgpu-amdpal-code-object-shader-function-map-table
17153
17154     =============== ============== ====================================================================
17155     String Key      Value Type     Description
17156     =============== ============== ====================================================================
17157     *symbol name*   map            *symbol name* is the ELF symbol name of the shader function code
17158                                    entry address. The value is the function's metadata. See
17159                                    :ref:`amdgpu-amdpal-code-object-shader-function-metadata-map-table`.
17160     =============== ============== ====================================================================
17161
17162..
17163
17164  .. table:: AMDPAL Code Object Shader Function Metadata Map
17165     :name: amdgpu-amdpal-code-object-shader-function-metadata-map-table
17166
17167     ============================= ============== =================================================================
17168     String Key                    Value Type     Description
17169     ============================= ============== =================================================================
17170     ".api_shader_hash"            sequence of    Input shader hash, typically passed in from the client. The value
17171                                   2 integers     is implementation defined, and can not be relied on between
17172                                                  different builds of the compiler.
17173     ".scratch_memory_size"        integer        Size in bytes of scratch memory used by the shader.
17174     ".lds_size"                   integer        Size in bytes of LDS memory.
17175     ".vgpr_count"                 integer        Number of VGPRs used by the shader.
17176     ".sgpr_count"                 integer        Number of SGPRs used by the shader.
17177     ".stack_frame_size_in_bytes"  integer        Amount of stack size used by the shader.
17178     ".shader_subtype"             string         Shader subtype/kind. Values include:
17179
17180                                                    - "Unknown"
17181
17182     ============================= ============== =================================================================
17183
17184..
17185
17186  .. table:: AMDPAL Code Object Register Map
17187     :name: amdgpu-amdpal-code-object-register-map-table
17188
17189     ========================== ============== ====================================================================
17190     32-bit Integer Key         Value Type     Description
17191     ========================== ============== ====================================================================
17192     ``reg offset``             32-bit integer ``reg offset`` is the dword offset into the GFXIP register space of
17193                                               a GRBM register (i.e., driver accessible GPU register number, not
17194                                               shader GPR register number). The driver is required to program each
17195                                               specified register to the corresponding specified value when
17196                                               executing this pipeline. Typically, the ``reg offsets`` are the
17197                                               ``uint16_t`` offsets to each register as defined by the hardware
17198                                               chip headers. The register is set to the provided value. However, a
17199                                               ``reg offset`` that specifies a user data register (e.g.,
17200                                               COMPUTE_USER_DATA_0) needs special treatment. See
17201                                               :ref:`amdgpu-amdpal-code-object-user-data-section` section for more
17202                                               information.
17203     ========================== ============== ====================================================================
17204
17205.. _amdgpu-amdpal-code-object-user-data-section:
17206
17207User Data
17208+++++++++
17209
17210Each hardware stage has a set of 32-bit physical SPI *user data registers*
17211(either 16 or 32 based on graphics IP and the stage) which can be
17212written from a command buffer and then loaded into SGPRs when waves are
17213launched via a subsequent dispatch or draw operation. This is the way
17214most arguments are passed from the application/runtime to a hardware
17215shader.
17216
17217PAL abstracts this functionality by exposing a set of 128 *user data
17218entries* per pipeline a client can use to pass arguments from a command
17219buffer to one or more shaders in that pipeline. The ELF code object must
17220specify a mapping from virtualized *user data entries* to physical *user
17221data registers*, and PAL is responsible for implementing that mapping,
17222including spilling overflow *user data entries* to memory if needed.
17223
17224Since the *user data registers* are GRBM-accessible SPI registers, this
17225mapping is actually embedded in the ``.registers`` metadata entry. For
17226most registers, the value in that map is a literal 32-bit value that
17227should be written to the register by the driver. However, when the
17228register is a *user data register* (any USER_DATA register e.g.,
17229SPI_SHADER_USER_DATA_PS_5), the value is instead an encoding that tells
17230the driver to write either a *user data entry* value or one of several
17231driver-internal values to the register. This encoding is described in
17232the following table:
17233
17234.. note::
17235
17236  Currently, *user data registers* 0 and 1 (e.g., SPI_SHADER_USER_DATA_PS_0,
17237  and SPI_SHADER_USER_DATA_PS_1) are reserved. *User data register* 0 must
17238  always be programmed to the address of the GlobalTable, and *user data
17239  register* 1 must always be programmed to the address of the PerShaderTable.
17240
17241..
17242
17243  .. table:: AMDPAL User Data Mapping
17244     :name: amdgpu-amdpal-code-object-metadata-user-data-mapping-table
17245
17246     ==========  =================  ===============================================================================
17247     Value       Name               Description
17248     ==========  =================  ===============================================================================
17249     0..127      *User Data Entry*  32-bit value of user_data_entry[N] as specified via *CmdSetUserData()*
17250     0x10000000  GlobalTable        32-bit pointer to GPU memory containing the global internal table (should
17251                                    always point to *user data register* 0).
17252     0x10000001  PerShaderTable     32-bit pointer to GPU memory containing the per-shader internal table. See
17253                                    :ref:`amdgpu-amdpal-code-object-metadata-user-data-per-shader-table-section`
17254                                    for more detail (should always point to *user data register* 1).
17255     0x10000002  SpillTable         32-bit pointer to GPU memory containing the user data spill table. See
17256                                    :ref:`amdgpu-amdpal-code-object-metadata-user-data-spill-table-section` for
17257                                    more detail.
17258     0x10000003  BaseVertex         Vertex offset (32-bit unsigned integer). Not needed if the pipeline doesn't
17259                                    reference the draw index in the vertex shader. Only supported by the first
17260                                    stage in a graphics pipeline.
17261     0x10000004  BaseInstance       Instance offset (32-bit unsigned integer). Only supported by the first stage in
17262                                    a graphics pipeline.
17263     0x10000005  DrawIndex          Draw index (32-bit unsigned integer). Only supported by the first stage in a
17264                                    graphics pipeline.
17265     0x10000006  Workgroup          Thread group count (32-bit unsigned integer). Low half of a 64-bit address of
17266                                    a buffer containing the grid dimensions for a Compute dispatch operation. The
17267                                    high half of the address is stored in the next sequential user-SGPR. Only
17268                                    supported by compute pipelines.
17269     0x1000000A  EsGsLdsSize        Indicates that PAL will program this user-SGPR to contain the amount of LDS
17270                                    space used for the ES/GS pseudo-ring-buffer for passing data between shader
17271                                    stages.
17272     0x1000000B  ViewId             View id (32-bit unsigned integer) identifies a view of graphic
17273                                    pipeline instancing.
17274     0x1000000C  StreamOutTable     32-bit pointer to GPU memory containing the stream out target SRD table.  This
17275                                    can only appear for one shader stage per pipeline.
17276     0x1000000D  PerShaderPerfData  32-bit pointer to GPU memory containing the per-shader performance data buffer.
17277     0x1000000F  VertexBufferTable  32-bit pointer to GPU memory containing the vertex buffer SRD table.  This can
17278                                    only appear for one shader stage per pipeline.
17279     0x10000010  UavExportTable     32-bit pointer to GPU memory containing the UAV export SRD table.  This can
17280                                    only appear for one shader stage per pipeline (PS). These replace color targets
17281                                    and are completely separate from any UAVs used by the shader. This is optional,
17282                                    and only used by the PS when UAV exports are used to replace color-target
17283                                    exports to optimize specific shaders.
17284     0x10000011  NggCullingData     64-bit pointer to GPU memory containing the hardware register data needed by
17285                                    some NGG pipelines to perform culling.  This value contains the address of the
17286                                    first of two consecutive registers which provide the full GPU address.
17287     0x10000015  FetchShaderPtr     64-bit pointer to GPU memory containing the fetch shader subroutine.
17288     ==========  =================  ===============================================================================
17289
17290.. _amdgpu-amdpal-code-object-metadata-user-data-per-shader-table-section:
17291
17292Per-Shader Table
17293################
17294
17295Low 32 bits of the GPU address for an optional buffer in the ``.data``
17296section of the ELF. The high 32 bits of the address match the high 32 bits
17297of the shader's program counter.
17298
17299The buffer can be anything the shader compiler needs it for, and
17300allows each shader to have its own region of the ``.data`` section.
17301Typically, this could be a table of buffer SRD's and the data pointed to
17302by the buffer SRD's, but it could be a flat-address region of memory as
17303well. Its layout and usage are defined by the shader compiler.
17304
17305Each shader's table in the ``.data`` section is referenced by the symbol
17306``_amdgpu_``\ *xs*\ ``_shdr_intrl_data``  where *xs* corresponds with the
17307hardware shader stage the data is for. E.g.,
17308``_amdgpu_cs_shdr_intrl_data`` for the compute shader hardware stage.
17309
17310.. _amdgpu-amdpal-code-object-metadata-user-data-spill-table-section:
17311
17312Spill Table
17313###########
17314
17315It is possible for a hardware shader to need access to more *user data
17316entries* than there are slots available in user data registers for one
17317or more hardware shader stages. In that case, the PAL runtime expects
17318the necessary *user data entries* to be spilled to GPU memory and use
17319one user data register to point to the spilled user data memory. The
17320value of the *user data entry* must then represent the location where
17321a shader expects to read the low 32-bits of the table's GPU virtual
17322address. The *spill table* itself represents a set of 32-bit values
17323managed by the PAL runtime in GPU-accessible memory that can be made
17324indirectly accessible to a hardware shader.
17325
17326Unspecified OS
17327--------------
17328
17329This section provides code conventions used when the target triple OS is
17330empty (see :ref:`amdgpu-target-triples`).
17331
17332Trap Handler ABI
17333~~~~~~~~~~~~~~~~
17334
17335For code objects generated by AMDGPU backend for non-amdhsa OS, the runtime does
17336not install a trap handler. The ``llvm.trap`` and ``llvm.debugtrap``
17337instructions are handled as follows:
17338
17339  .. table:: AMDGPU Trap Handler for Non-AMDHSA OS
17340     :name: amdgpu-trap-handler-for-non-amdhsa-os-table
17341
17342     =============== =============== ===========================================
17343     Usage           Code Sequence   Description
17344     =============== =============== ===========================================
17345     llvm.trap       s_endpgm        Causes wavefront to be terminated.
17346     llvm.debugtrap  *none*          Compiler warning given that there is no
17347                                     trap handler installed.
17348     =============== =============== ===========================================
17349
17350Core file format
17351================
17352
17353This section describes the format of core files supporting AMDGPU. Core dumps
17354for an AMDGPU program can come in 2 flavors: split or unified core files.
17355
17356The split layout consists of one host core file containing the information to
17357rebuild the image of the host process and one AMDGPU core file that contains
17358the information for the AMDGPU agents used in the process.  The AMDGPU core
17359file consists of:
17360
17361* A note describing the state of the AMDGPU agents, AMDGPU queues, and AMDGPU
17362  runtime for the process (see :ref:`amdgpu_corefile_note`).
17363* A list of load segments containing an image of the AMDGPU agents' memory (see
17364  :ref:`amdgpu_corefile_memory`).
17365
17366The unified core file is the union of all the information contained in
17367the two files of the split layout (all notes and load segments).  It contains
17368all the information required to reconstruct the image of the process across all
17369the agents.
17370
17371Core file header
17372----------------
17373
17374An AMDGPU core file is an ``ELF64`` core file.  The content of the header
17375differs in unified core file layout and AMDGPU core file layout.
17376
17377Split files
17378~~~~~~~~~~~
17379
17380In the split files layout, the AMDGPU core file is an ``ELF64`` file with the
17381header configured as described in :ref:`amdgpu-corefile-headers-table`:
17382
17383  .. table:: AMDGPU corefile headers
17384     :name: amdgpu-corefile-headers-table
17385
17386     ========================== ===================================
17387     Field                      Value
17388     ========================== ===================================
17389     ``e_ident[EI_CLASS]``      ``ELFCLASS64`` (``0x2``)
17390     ``e_ident[EI_DATA]``       ``ELFDATA2LSB`` (``0x1``)
17391     ``e_ident[EI_OSABI]``      ``ELFOSABI_AMDGPU_HSA`` (``0x40``)
17392     ``e_type``                 ``ET_CORE``(``0x4``)
17393     ``e_ident[EI_ABIVERSION]`` ``ELFABIVERSION_AMDGPU_HSA_5``
17394     ``e_machine``              ``EM_AMDGPU`` (``0xe0``)
17395     ========================== ===================================
17396
17397Unified file
17398~~~~~~~~~~~~
17399
17400In the unified core file mode, the ``ELF64`` headers are set to describe
17401the host architecture and process.
17402
17403.. _amdgpu_corefile_note:
17404
17405Core file notes
17406---------------
17407
17408An AMDGPU core file must contain one snapshot note in a ``PT_NOTE`` segment.
17409When using a split core file layout, this note is in the AMDGPU file.
17410
17411The note record vendor field is "``AMDGPU``" and the record type is
17412"``NT_AMDGPU_KFD_CORE_STATE``" (see :ref:`amdgpu-note-records-v3-onwards`)
17413
17414The content of the note is defined in table
17415:ref:`amdgpu-core-snapshot-note-layout-table-v1`:
17416
17417  .. table:: AMDGPU snapshot note format V1
17418     :name: amdgpu-core-snapshot-note-layout-table-v1
17419
17420     ================================ ======================================= ======================= ============== ===========================
17421     Field                            Type                                    Size (bytes)            Byte alignment Comment
17422     ================================ ======================================= ======================= ============== ===========================
17423     ``version_major``                ``uint32``                              4                       4              ``KFD_IOCTL_MAJOR_VERSION``
17424     ``version_minor``                ``uint32``                              4                       4              ``KFD_IOCTL_MINOR_VERSION``
17425     ``runtime_info_size``            ``uint64``                              8                       8              Must be a multiple of 8
17426     ``n_agents``                     ``uint32``                              4                       8
17427     ``agent_info_entry_size``        ``uint32``                              4                       4              Must be a multiple of 8
17428     ``n_queues``                     ``uint32``                              4                       8
17429     ``queue_info_entry_size``        ``uint32``                              4                       4              Must be a multiple of 8
17430     ``runtime_info``                 ``kfd_runtime_info``                    ``runtime_info_size``   8
17431     ``agents_info``                  ``kfd_dbg_device_info_entry[n_agents]`` ``n_agents *            8
17432                                                                              agent_info_entry_size``
17433     ``queues_info``                  ``kfd_queue_snapshot_entry[n_queues]``  ``n_queues *
17434                                                                              queue_info_entry_size`` 8
17435     ================================ ======================================= ======================= ============== ===========================
17436
17437The definition of all the ``kfd_*`` types comes from the
17438``include/uapi/linux/kfd_ioctl.h`` header file from the KFD repository. It is
17439usually installed in ``/usr/include/linux/kfd_ioctl.h``. The version of the
17440``kfd_ioctl.h`` file used must define values for
17441``KFD_IOCTL_MAJOR_VERSION`` and ``KFD_IOCTL_MINOR_VERSION`` matching
17442the values of ``kfd_version_major`` and ``kfd_version_major`` from the
17443note.
17444
17445.. _amdgpu_corefile_memory:
17446
17447Memory segments
17448---------------
17449
17450An AMDGPU core file must contain an image of the AMDGPU agents' memory in load
17451segments (of type ``PT_LOAD``).  Those segments must correspond to the memory
17452regions where the content of the agent memory is mapped into the host process
17453by the ROCr runtime (note that those memory mappings are usually not readable
17454by the process itself).
17455
17456When using the split core file layout, those segments must be included in the
17457AMDGPU core file.
17458
17459Source Languages
17460================
17461
17462.. _amdgpu-opencl:
17463
17464OpenCL
17465------
17466
17467When the language is OpenCL the following differences occur:
17468
174691. The OpenCL memory model is used (see :ref:`amdgpu-amdhsa-memory-model`).
174702. The AMDGPU backend appends additional arguments to the kernel's explicit
17471   arguments for the AMDHSA OS (see
17472   :ref:`opencl-kernel-implicit-arguments-appended-for-amdhsa-os-table`).
174733. Additional metadata is generated
17474   (see :ref:`amdgpu-amdhsa-code-object-metadata`).
17475
17476  .. table:: OpenCL kernel implicit arguments appended for AMDHSA OS
17477     :name: opencl-kernel-implicit-arguments-appended-for-amdhsa-os-table
17478
17479     ======== ==== ========= ===========================================
17480     Position Byte Byte      Description
17481              Size Alignment
17482     ======== ==== ========= ===========================================
17483     1        8    8         OpenCL Global Offset X
17484     2        8    8         OpenCL Global Offset Y
17485     3        8    8         OpenCL Global Offset Z
17486     4        8    8         OpenCL address of printf buffer
17487     5        8    8         OpenCL address of virtual queue used by
17488                             enqueue_kernel.
17489     6        8    8         OpenCL address of AqlWrap struct used by
17490                             enqueue_kernel.
17491     7        8    8         Pointer argument used for Multi-gird
17492                             synchronization.
17493     ======== ==== ========= ===========================================
17494
17495.. _amdgpu-hcc:
17496
17497HCC
17498---
17499
17500When the language is HCC the following differences occur:
17501
175021. The HSA memory model is used (see :ref:`amdgpu-amdhsa-memory-model`).
17503
17504.. _amdgpu-assembler:
17505
17506Assembler
17507---------
17508
17509AMDGPU backend has LLVM-MC based assembler which is currently in development.
17510It supports AMDGCN GFX6-GFX11.
17511
17512This section describes general syntax for instructions and operands.
17513
17514Instructions
17515~~~~~~~~~~~~
17516
17517An instruction has the following :doc:`syntax<AMDGPUInstructionSyntax>`:
17518
17519  | ``<``\ *opcode*\ ``> <``\ *operand0*\ ``>, <``\ *operand1*\ ``>,...
17520    <``\ *modifier0*\ ``> <``\ *modifier1*\ ``>...``
17521
17522:doc:`Operands<AMDGPUOperandSyntax>` are comma-separated while
17523:doc:`modifiers<AMDGPUModifierSyntax>` are space-separated.
17524
17525The order of operands and modifiers is fixed.
17526Most modifiers are optional and may be omitted.
17527
17528Links to detailed instruction syntax description may be found in the following
17529table. Note that features under development are not included
17530in this description.
17531
17532    ============= ============================================= =======================================
17533    Architecture  Core ISA                                      ISA Variants and Extensions
17534    ============= ============================================= =======================================
17535    GCN 2         :doc:`GFX7<AMDGPU/AMDGPUAsmGFX7>`             \-
17536    GCN 3, GCN 4  :doc:`GFX8<AMDGPU/AMDGPUAsmGFX8>`             \-
17537    GCN 5         :doc:`GFX9<AMDGPU/AMDGPUAsmGFX9>`             :doc:`gfx900<AMDGPU/AMDGPUAsmGFX900>`
17538
17539                                                                :doc:`gfx902<AMDGPU/AMDGPUAsmGFX900>`
17540
17541                                                                :doc:`gfx904<AMDGPU/AMDGPUAsmGFX904>`
17542
17543                                                                :doc:`gfx906<AMDGPU/AMDGPUAsmGFX906>`
17544
17545                                                                :doc:`gfx909<AMDGPU/AMDGPUAsmGFX900>`
17546
17547                                                                :doc:`gfx90c<AMDGPU/AMDGPUAsmGFX900>`
17548
17549    CDNA 1        :doc:`GFX9<AMDGPU/AMDGPUAsmGFX9>`             :doc:`gfx908<AMDGPU/AMDGPUAsmGFX908>`
17550
17551    CDNA 2        :doc:`GFX9<AMDGPU/AMDGPUAsmGFX9>`             :doc:`gfx90a<AMDGPU/AMDGPUAsmGFX90a>`
17552
17553    CDNA 3        :doc:`GFX9<AMDGPU/AMDGPUAsmGFX9>`             :doc:`gfx940<AMDGPU/AMDGPUAsmGFX940>`
17554
17555                                                                :doc:`gfx941<AMDGPU/AMDGPUAsmGFX940>`
17556
17557                                                                :doc:`gfx942<AMDGPU/AMDGPUAsmGFX940>`
17558
17559    RDNA 1        :doc:`GFX10 RDNA1<AMDGPU/AMDGPUAsmGFX10>`     :doc:`gfx1010<AMDGPU/AMDGPUAsmGFX10>`
17560
17561                                                                :doc:`gfx1011<AMDGPU/AMDGPUAsmGFX1011>`
17562
17563                                                                :doc:`gfx1012<AMDGPU/AMDGPUAsmGFX1011>`
17564
17565                                                                :doc:`gfx1013<AMDGPU/AMDGPUAsmGFX1013>`
17566
17567    RDNA 2        :doc:`GFX10 RDNA2<AMDGPU/AMDGPUAsmGFX1030>`   :doc:`gfx1030<AMDGPU/AMDGPUAsmGFX1030>`
17568
17569                                                                :doc:`gfx1031<AMDGPU/AMDGPUAsmGFX1030>`
17570
17571                                                                :doc:`gfx1032<AMDGPU/AMDGPUAsmGFX1030>`
17572
17573                                                                :doc:`gfx1033<AMDGPU/AMDGPUAsmGFX1030>`
17574
17575                                                                :doc:`gfx1034<AMDGPU/AMDGPUAsmGFX1030>`
17576
17577                                                                :doc:`gfx1035<AMDGPU/AMDGPUAsmGFX1030>`
17578
17579                                                                :doc:`gfx1036<AMDGPU/AMDGPUAsmGFX1030>`
17580
17581    RDNA 3        :doc:`GFX11<AMDGPU/AMDGPUAsmGFX11>`           :doc:`gfx1100<AMDGPU/AMDGPUAsmGFX11>`
17582
17583                                                                :doc:`gfx1101<AMDGPU/AMDGPUAsmGFX11>`
17584
17585                                                                :doc:`gfx1102<AMDGPU/AMDGPUAsmGFX11>`
17586
17587                                                                :doc:`gfx1103<AMDGPU/AMDGPUAsmGFX11>`
17588    ============= ============================================= =======================================
17589
17590For more information about instructions, their semantics and supported
17591combinations of operands, refer to one of instruction set architecture manuals
17592[AMD-GCN-GFX6]_, [AMD-GCN-GFX7]_, [AMD-GCN-GFX8]_,
17593[AMD-GCN-GFX900-GFX904-VEGA]_, [AMD-GCN-GFX906-VEGA7NM]_,
17594[AMD-GCN-GFX908-CDNA1]_, [AMD-GCN-GFX90A-CDNA2]_,
17595[AMD-GCN-GFX940-GFX942-CDNA3]_, [AMD-GCN-GFX10-RDNA1]_, [AMD-GCN-GFX10-RDNA2]_,
17596[AMD-GCN-GFX11-RDNA3]_ and [AMD-GCN-GFX11-RDNA3.5]_.
17597
17598Operands
17599~~~~~~~~
17600
17601Detailed description of operands may be found :doc:`here<AMDGPUOperandSyntax>`.
17602
17603Modifiers
17604~~~~~~~~~
17605
17606Detailed description of modifiers may be found
17607:doc:`here<AMDGPUModifierSyntax>`.
17608
17609Instruction Examples
17610~~~~~~~~~~~~~~~~~~~~
17611
17612DS
17613++
17614
17615.. code-block:: nasm
17616
17617  ds_add_u32 v2, v4 offset:16
17618  ds_write_src2_b64 v2 offset0:4 offset1:8
17619  ds_cmpst_f32 v2, v4, v6
17620  ds_min_rtn_f64 v[8:9], v2, v[4:5]
17621
17622For full list of supported instructions, refer to "LDS/GDS instructions" in ISA
17623Manual.
17624
17625FLAT
17626++++
17627
17628.. code-block:: nasm
17629
17630  flat_load_dword v1, v[3:4]
17631  flat_store_dwordx3 v[3:4], v[5:7]
17632  flat_atomic_swap v1, v[3:4], v5 glc
17633  flat_atomic_cmpswap v1, v[3:4], v[5:6] glc slc
17634  flat_atomic_fmax_x2 v[1:2], v[3:4], v[5:6] glc
17635
17636For full list of supported instructions, refer to "FLAT instructions" in ISA
17637Manual.
17638
17639MUBUF
17640+++++
17641
17642.. code-block:: nasm
17643
17644  buffer_load_dword v1, off, s[4:7], s1
17645  buffer_store_dwordx4 v[1:4], v2, ttmp[4:7], s1 offen offset:4 glc tfe
17646  buffer_store_format_xy v[1:2], off, s[4:7], s1
17647  buffer_wbinvl1
17648  buffer_atomic_inc v1, v2, s[8:11], s4 idxen offset:4 slc
17649
17650For full list of supported instructions, refer to "MUBUF Instructions" in ISA
17651Manual.
17652
17653SMRD/SMEM
17654+++++++++
17655
17656.. code-block:: nasm
17657
17658  s_load_dword s1, s[2:3], 0xfc
17659  s_load_dwordx8 s[8:15], s[2:3], s4
17660  s_load_dwordx16 s[88:103], s[2:3], s4
17661  s_dcache_inv_vol
17662  s_memtime s[4:5]
17663
17664For full list of supported instructions, refer to "Scalar Memory Operations" in
17665ISA Manual.
17666
17667SOP1
17668++++
17669
17670.. code-block:: nasm
17671
17672  s_mov_b32 s1, s2
17673  s_mov_b64 s[0:1], 0x80000000
17674  s_cmov_b32 s1, 200
17675  s_wqm_b64 s[2:3], s[4:5]
17676  s_bcnt0_i32_b64 s1, s[2:3]
17677  s_swappc_b64 s[2:3], s[4:5]
17678  s_cbranch_join s[4:5]
17679
17680For full list of supported instructions, refer to "SOP1 Instructions" in ISA
17681Manual.
17682
17683SOP2
17684++++
17685
17686.. code-block:: nasm
17687
17688  s_add_u32 s1, s2, s3
17689  s_and_b64 s[2:3], s[4:5], s[6:7]
17690  s_cselect_b32 s1, s2, s3
17691  s_andn2_b32 s2, s4, s6
17692  s_lshr_b64 s[2:3], s[4:5], s6
17693  s_ashr_i32 s2, s4, s6
17694  s_bfm_b64 s[2:3], s4, s6
17695  s_bfe_i64 s[2:3], s[4:5], s6
17696  s_cbranch_g_fork s[4:5], s[6:7]
17697
17698For full list of supported instructions, refer to "SOP2 Instructions" in ISA
17699Manual.
17700
17701SOPC
17702++++
17703
17704.. code-block:: nasm
17705
17706  s_cmp_eq_i32 s1, s2
17707  s_bitcmp1_b32 s1, s2
17708  s_bitcmp0_b64 s[2:3], s4
17709  s_setvskip s3, s5
17710
17711For full list of supported instructions, refer to "SOPC Instructions" in ISA
17712Manual.
17713
17714SOPP
17715++++
17716
17717.. code-block:: nasm
17718
17719  s_barrier
17720  s_nop 2
17721  s_endpgm
17722  s_waitcnt 0 ; Wait for all counters to be 0
17723  s_waitcnt vmcnt(0) & expcnt(0) & lgkmcnt(0) ; Equivalent to above
17724  s_waitcnt vmcnt(1) ; Wait for vmcnt counter to be 1.
17725  s_sethalt 9
17726  s_sleep 10
17727  s_sendmsg 0x1
17728  s_sendmsg sendmsg(MSG_INTERRUPT)
17729  s_trap 1
17730
17731For full list of supported instructions, refer to "SOPP Instructions" in ISA
17732Manual.
17733
17734Unless otherwise mentioned, little verification is performed on the operands
17735of SOPP Instructions, so it is up to the programmer to be familiar with the
17736range or acceptable values.
17737
17738VALU
17739++++
17740
17741For vector ALU instruction opcodes (VOP1, VOP2, VOP3, VOPC, VOP_DPP, VOP_SDWA),
17742the assembler will automatically use optimal encoding based on its operands. To
17743force specific encoding, one can add a suffix to the opcode of the instruction:
17744
17745* _e32 for 32-bit VOP1/VOP2/VOPC
17746* _e64 for 64-bit VOP3
17747* _dpp for VOP_DPP
17748* _e64_dpp for VOP3 with DPP
17749* _sdwa for VOP_SDWA
17750
17751VOP1/VOP2/VOP3/VOPC examples:
17752
17753.. code-block:: nasm
17754
17755  v_mov_b32 v1, v2
17756  v_mov_b32_e32 v1, v2
17757  v_nop
17758  v_cvt_f64_i32_e32 v[1:2], v2
17759  v_floor_f32_e32 v1, v2
17760  v_bfrev_b32_e32 v1, v2
17761  v_add_f32_e32 v1, v2, v3
17762  v_mul_i32_i24_e64 v1, v2, 3
17763  v_mul_i32_i24_e32 v1, -3, v3
17764  v_mul_i32_i24_e32 v1, -100, v3
17765  v_addc_u32 v1, s[0:1], v2, v3, s[2:3]
17766  v_max_f16_e32 v1, v2, v3
17767
17768VOP_DPP examples:
17769
17770.. code-block:: nasm
17771
17772  v_mov_b32 v0, v0 quad_perm:[0,2,1,1]
17773  v_sin_f32 v0, v0 row_shl:1 row_mask:0xa bank_mask:0x1 bound_ctrl:0
17774  v_mov_b32 v0, v0 wave_shl:1
17775  v_mov_b32 v0, v0 row_mirror
17776  v_mov_b32 v0, v0 row_bcast:31
17777  v_mov_b32 v0, v0 quad_perm:[1,3,0,1] row_mask:0xa bank_mask:0x1 bound_ctrl:0
17778  v_add_f32 v0, v0, |v0| row_shl:1 row_mask:0xa bank_mask:0x1 bound_ctrl:0
17779  v_max_f16 v1, v2, v3 row_shl:1 row_mask:0xa bank_mask:0x1 bound_ctrl:0
17780
17781
17782VOP3_DPP examples (Available on GFX11+):
17783
17784.. code-block:: nasm
17785
17786  v_add_f32_e64_dpp v0, v1, v2 dpp8:[0,1,2,3,4,5,6,7]
17787  v_sqrt_f32_e64_dpp v0, v1 row_shl:1 row_mask:0xa bank_mask:0x1 bound_ctrl:0
17788  v_ldexp_f32 v0, v1, v2 dpp8:[0,1,2,3,4,5,6,7]
17789
17790VOP_SDWA examples:
17791
17792.. code-block:: nasm
17793
17794  v_mov_b32 v1, v2 dst_sel:BYTE_0 dst_unused:UNUSED_PRESERVE src0_sel:DWORD
17795  v_min_u32 v200, v200, v1 dst_sel:WORD_1 dst_unused:UNUSED_PAD src0_sel:BYTE_1 src1_sel:DWORD
17796  v_sin_f32 v0, v0 dst_unused:UNUSED_PAD src0_sel:WORD_1
17797  v_fract_f32 v0, |v0| dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:WORD_1
17798  v_cmpx_le_u32 vcc, v1, v2 src0_sel:BYTE_2 src1_sel:WORD_0
17799
17800For full list of supported instructions, refer to "Vector ALU instructions".
17801
17802.. _amdgpu-amdhsa-assembler-predefined-symbols-v2:
17803
17804Code Object V2 Predefined Symbols
17805~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
17806
17807.. warning::
17808  Code object V2 generation is no longer supported by this version of LLVM.
17809
17810The AMDGPU assembler defines and updates some symbols automatically. These
17811symbols do not affect code generation.
17812
17813.option.machine_version_major
17814+++++++++++++++++++++++++++++
17815
17816Set to the GFX major generation number of the target being assembled for. For
17817example, when assembling for a "GFX9" target this will be set to the integer
17818value "9". The possible GFX major generation numbers are presented in
17819:ref:`amdgpu-processors`.
17820
17821.option.machine_version_minor
17822+++++++++++++++++++++++++++++
17823
17824Set to the GFX minor generation number of the target being assembled for. For
17825example, when assembling for a "GFX810" target this will be set to the integer
17826value "1". The possible GFX minor generation numbers are presented in
17827:ref:`amdgpu-processors`.
17828
17829.option.machine_version_stepping
17830++++++++++++++++++++++++++++++++
17831
17832Set to the GFX stepping generation number of the target being assembled for.
17833For example, when assembling for a "GFX704" target this will be set to the
17834integer value "4". The possible GFX stepping generation numbers are presented
17835in :ref:`amdgpu-processors`.
17836
17837.kernel.vgpr_count
17838++++++++++++++++++
17839
17840Set to zero each time a
17841:ref:`amdgpu-amdhsa-assembler-directive-amdgpu_hsa_kernel` directive is
17842encountered. At each instruction, if the current value of this symbol is less
17843than or equal to the maximum VGPR number explicitly referenced within that
17844instruction then the symbol value is updated to equal that VGPR number plus
17845one.
17846
17847.kernel.sgpr_count
17848++++++++++++++++++
17849
17850Set to zero each time a
17851:ref:`amdgpu-amdhsa-assembler-directive-amdgpu_hsa_kernel` directive is
17852encountered. At each instruction, if the current value of this symbol is less
17853than or equal to the maximum VGPR number explicitly referenced within that
17854instruction then the symbol value is updated to equal that SGPR number plus
17855one.
17856
17857.. _amdgpu-amdhsa-assembler-directives-v2:
17858
17859Code Object V2 Directives
17860~~~~~~~~~~~~~~~~~~~~~~~~~
17861
17862.. warning::
17863  Code object V2 generation is no longer supported by this version of LLVM.
17864
17865AMDGPU ABI defines auxiliary data in output code object. In assembly source,
17866one can specify them with assembler directives.
17867
17868.hsa_code_object_version major, minor
17869+++++++++++++++++++++++++++++++++++++
17870
17871*major* and *minor* are integers that specify the version of the HSA code
17872object that will be generated by the assembler.
17873
17874.hsa_code_object_isa [major, minor, stepping, vendor, arch]
17875+++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
17876
17877
17878*major*, *minor*, and *stepping* are all integers that describe the instruction
17879set architecture (ISA) version of the assembly program.
17880
17881*vendor* and *arch* are quoted strings. *vendor* should always be equal to
17882"AMD" and *arch* should always be equal to "AMDGPU".
17883
17884By default, the assembler will derive the ISA version, *vendor*, and *arch*
17885from the value of the -mcpu option that is passed to the assembler.
17886
17887.. _amdgpu-amdhsa-assembler-directive-amdgpu_hsa_kernel:
17888
17889.amdgpu_hsa_kernel (name)
17890+++++++++++++++++++++++++
17891
17892This directives specifies that the symbol with given name is a kernel entry
17893point (label) and the object should contain corresponding symbol of type
17894STT_AMDGPU_HSA_KERNEL.
17895
17896.amd_kernel_code_t
17897++++++++++++++++++
17898
17899This directive marks the beginning of a list of key / value pairs that are used
17900to specify the amd_kernel_code_t object that will be emitted by the assembler.
17901The list must be terminated by the *.end_amd_kernel_code_t* directive. For any
17902amd_kernel_code_t values that are unspecified a default value will be used. The
17903default value for all keys is 0, with the following exceptions:
17904
17905- *amd_code_version_major* defaults to 1.
17906- *amd_kernel_code_version_minor* defaults to 2.
17907- *amd_machine_kind* defaults to 1.
17908- *amd_machine_version_major*, *machine_version_minor*, and
17909  *amd_machine_version_stepping* are derived from the value of the -mcpu option
17910  that is passed to the assembler.
17911- *kernel_code_entry_byte_offset* defaults to 256.
17912- *wavefront_size* defaults 6 for all targets before GFX10. For GFX10 onwards
17913  defaults to 6 if target feature ``wavefrontsize64`` is enabled, otherwise 5.
17914  Note that wavefront size is specified as a power of two, so a value of **n**
17915  means a size of 2^ **n**.
17916- *call_convention* defaults to -1.
17917- *kernarg_segment_alignment*, *group_segment_alignment*, and
17918  *private_segment_alignment* default to 4. Note that alignments are specified
17919  as a power of 2, so a value of **n** means an alignment of 2^ **n**.
17920- *enable_tg_split* defaults to 1 if target feature ``tgsplit`` is enabled for
17921  GFX90A onwards.
17922- *enable_wgp_mode* defaults to 1 if target feature ``cumode`` is disabled for
17923  GFX10 onwards.
17924- *enable_mem_ordered* defaults to 1 for GFX10 onwards.
17925
17926The *.amd_kernel_code_t* directive must be placed immediately after the
17927function label and before any instructions.
17928
17929For a full list of amd_kernel_code_t keys, refer to AMDGPU ABI document,
17930comments in lib/Target/AMDGPU/AmdKernelCodeT.h and test/CodeGen/AMDGPU/hsa.s.
17931
17932.. _amdgpu-amdhsa-assembler-example-v2:
17933
17934Code Object V2 Example Source Code
17935~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
17936
17937.. warning::
17938  Code object V2 generation is no longer supported by this version of LLVM.
17939
17940Here is an example of a minimal assembly source file, defining one HSA kernel:
17941
17942.. code::
17943   :number-lines:
17944
17945   .hsa_code_object_version 1,0
17946   .hsa_code_object_isa
17947
17948   .hsatext
17949   .globl  hello_world
17950   .p2align 8
17951   .amdgpu_hsa_kernel hello_world
17952
17953   hello_world:
17954
17955      .amd_kernel_code_t
17956         enable_sgpr_kernarg_segment_ptr = 1
17957         is_ptr64 = 1
17958         compute_pgm_rsrc1_vgprs = 0
17959         compute_pgm_rsrc1_sgprs = 0
17960         compute_pgm_rsrc2_user_sgpr = 2
17961         compute_pgm_rsrc1_wgp_mode = 0
17962         compute_pgm_rsrc1_mem_ordered = 0
17963         compute_pgm_rsrc1_fwd_progress = 1
17964     .end_amd_kernel_code_t
17965
17966     s_load_dwordx2 s[0:1], s[0:1] 0x0
17967     v_mov_b32 v0, 3.14159
17968     s_waitcnt lgkmcnt(0)
17969     v_mov_b32 v1, s0
17970     v_mov_b32 v2, s1
17971     flat_store_dword v[1:2], v0
17972     s_endpgm
17973   .Lfunc_end0:
17974        .size   hello_world, .Lfunc_end0-hello_world
17975
17976.. _amdgpu-amdhsa-assembler-predefined-symbols-v3-onwards:
17977
17978Code Object V3 and Above Predefined Symbols
17979~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
17980
17981The AMDGPU assembler defines and updates some symbols automatically. These
17982symbols do not affect code generation.
17983
17984.amdgcn.gfx_generation_number
17985+++++++++++++++++++++++++++++
17986
17987Set to the GFX major generation number of the target being assembled for. For
17988example, when assembling for a "GFX9" target this will be set to the integer
17989value "9". The possible GFX major generation numbers are presented in
17990:ref:`amdgpu-processors`.
17991
17992.amdgcn.gfx_generation_minor
17993++++++++++++++++++++++++++++
17994
17995Set to the GFX minor generation number of the target being assembled for. For
17996example, when assembling for a "GFX810" target this will be set to the integer
17997value "1". The possible GFX minor generation numbers are presented in
17998:ref:`amdgpu-processors`.
17999
18000.amdgcn.gfx_generation_stepping
18001+++++++++++++++++++++++++++++++
18002
18003Set to the GFX stepping generation number of the target being assembled for.
18004For example, when assembling for a "GFX704" target this will be set to the
18005integer value "4". The possible GFX stepping generation numbers are presented
18006in :ref:`amdgpu-processors`.
18007
18008.. _amdgpu-amdhsa-assembler-symbol-next_free_vgpr:
18009
18010.amdgcn.next_free_vgpr
18011++++++++++++++++++++++
18012
18013Set to zero before assembly begins. At each instruction, if the current value
18014of this symbol is less than or equal to the maximum VGPR number explicitly
18015referenced within that instruction then the symbol value is updated to equal
18016that VGPR number plus one.
18017
18018May be used to set the `.amdhsa_next_free_vgpr` directive in
18019:ref:`amdhsa-kernel-directives-table`.
18020
18021May be set at any time, e.g. manually set to zero at the start of each kernel.
18022
18023.. _amdgpu-amdhsa-assembler-symbol-next_free_sgpr:
18024
18025.amdgcn.next_free_sgpr
18026++++++++++++++++++++++
18027
18028Set to zero before assembly begins. At each instruction, if the current value
18029of this symbol is less than or equal the maximum SGPR number explicitly
18030referenced within that instruction then the symbol value is updated to equal
18031that SGPR number plus one.
18032
18033May be used to set the `.amdhsa_next_free_spgr` directive in
18034:ref:`amdhsa-kernel-directives-table`.
18035
18036May be set at any time, e.g. manually set to zero at the start of each kernel.
18037
18038.. _amdgpu-amdhsa-assembler-directives-v3-onwards:
18039
18040Code Object V3 and Above Directives
18041~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
18042
18043Directives which begin with ``.amdgcn`` are valid for all ``amdgcn``
18044architecture processors, and are not OS-specific. Directives which begin with
18045``.amdhsa`` are specific to ``amdgcn`` architecture processors when the
18046``amdhsa`` OS is specified. See :ref:`amdgpu-target-triples` and
18047:ref:`amdgpu-processors`.
18048
18049.. _amdgpu-assembler-directive-amdgcn-target:
18050
18051.amdgcn_target <target-triple> "-" <target-id>
18052++++++++++++++++++++++++++++++++++++++++++++++
18053
18054Optional directive which declares the ``<target-triple>-<target-id>`` supported
18055by the containing assembler source file. Used by the assembler to validate
18056command-line options such as ``-triple``, ``-mcpu``, and
18057``--offload-arch=<target-id>``. A non-canonical target ID is allowed. See
18058:ref:`amdgpu-target-triples` and :ref:`amdgpu-target-id`.
18059
18060.. note::
18061
18062  The target ID syntax used for code object V2 to V3 for this directive differs
18063  from that used elsewhere. See :ref:`amdgpu-target-id-v2-v3`.
18064
18065.. _amdgpu-assembler-directive-amdhsa-code-object-version:
18066
18067.amdhsa_code_object_version <version>
18068+++++++++++++++++++++++++++++++++++++
18069
18070Optional directive which declares the code object version to be generated by the
18071assembler. If not present, a default value will be used.
18072
18073.amdhsa_kernel <name>
18074+++++++++++++++++++++
18075
18076Creates a correctly aligned AMDHSA kernel descriptor and a symbol,
18077``<name>.kd``, in the current location of the current section. Only valid when
18078the OS is ``amdhsa``. ``<name>`` must be a symbol that labels the first
18079instruction to execute, and does not need to be previously defined.
18080
18081Marks the beginning of a list of directives used to generate the bytes of a
18082kernel descriptor, as described in :ref:`amdgpu-amdhsa-kernel-descriptor`.
18083Directives which may appear in this list are described in
18084:ref:`amdhsa-kernel-directives-table`. Directives may appear in any order, must
18085be valid for the target being assembled for, and cannot be repeated. Directives
18086support the range of values specified by the field they reference in
18087:ref:`amdgpu-amdhsa-kernel-descriptor`. If a directive is not specified, it is
18088assumed to have its default value, unless it is marked as "Required", in which
18089case it is an error to omit the directive. This list of directives is
18090terminated by an ``.end_amdhsa_kernel`` directive.
18091
18092  .. table:: AMDHSA Kernel Assembler Directives
18093     :name: amdhsa-kernel-directives-table
18094
18095     ======================================================== =================== ============ ===================
18096     Directive                                                Default             Supported On Description
18097     ======================================================== =================== ============ ===================
18098     ``.amdhsa_group_segment_fixed_size``                     0                   GFX6-GFX12   Controls GROUP_SEGMENT_FIXED_SIZE in
18099                                                                                               :ref:`amdgpu-amdhsa-kernel-descriptor-v3-table`.
18100     ``.amdhsa_private_segment_fixed_size``                   0                   GFX6-GFX12   Controls PRIVATE_SEGMENT_FIXED_SIZE in
18101                                                                                               :ref:`amdgpu-amdhsa-kernel-descriptor-v3-table`.
18102     ``.amdhsa_kernarg_size``                                 0                   GFX6-GFX12   Controls KERNARG_SIZE in
18103                                                                                               :ref:`amdgpu-amdhsa-kernel-descriptor-v3-table`.
18104     ``.amdhsa_user_sgpr_count``                              0                   GFX6-GFX12   Controls USER_SGPR_COUNT in COMPUTE_PGM_RSRC2
18105                                                                                               :ref:`amdgpu-amdhsa-compute_pgm_rsrc2-gfx6-gfx12-table`
18106     ``.amdhsa_user_sgpr_private_segment_buffer``             0                   GFX6-GFX10   Controls ENABLE_SGPR_PRIVATE_SEGMENT_BUFFER in
18107                                                                                  (except      :ref:`amdgpu-amdhsa-kernel-descriptor-v3-table`.
18108                                                                                  GFX940)
18109     ``.amdhsa_user_sgpr_dispatch_ptr``                       0                   GFX6-GFX12   Controls ENABLE_SGPR_DISPATCH_PTR in
18110                                                                                               :ref:`amdgpu-amdhsa-kernel-descriptor-v3-table`.
18111     ``.amdhsa_user_sgpr_queue_ptr``                          0                   GFX6-GFX12   Controls ENABLE_SGPR_QUEUE_PTR in
18112                                                                                               :ref:`amdgpu-amdhsa-kernel-descriptor-v3-table`.
18113     ``.amdhsa_user_sgpr_kernarg_segment_ptr``                0                   GFX6-GFX12   Controls ENABLE_SGPR_KERNARG_SEGMENT_PTR in
18114                                                                                               :ref:`amdgpu-amdhsa-kernel-descriptor-v3-table`.
18115     ``.amdhsa_user_sgpr_dispatch_id``                        0                   GFX6-GFX12   Controls ENABLE_SGPR_DISPATCH_ID in
18116                                                                                               :ref:`amdgpu-amdhsa-kernel-descriptor-v3-table`.
18117     ``.amdhsa_user_sgpr_flat_scratch_init``                  0                   GFX6-GFX10   Controls ENABLE_SGPR_FLAT_SCRATCH_INIT in
18118                                                                                  (except      :ref:`amdgpu-amdhsa-kernel-descriptor-v3-table`.
18119                                                                                  GFX940)
18120     ``.amdhsa_user_sgpr_private_segment_size``               0                   GFX6-GFX12   Controls ENABLE_SGPR_PRIVATE_SEGMENT_SIZE in
18121                                                                                               :ref:`amdgpu-amdhsa-kernel-descriptor-v3-table`.
18122     ``.amdhsa_wavefront_size32``                             Target              GFX10-GFX12  Controls ENABLE_WAVEFRONT_SIZE32 in
18123                                                              Feature                          :ref:`amdgpu-amdhsa-kernel-descriptor-v3-table`.
18124                                                              Specific
18125                                                              (wavefrontsize64)
18126     ``.amdhsa_uses_dynamic_stack``                           0                   GFX6-GFX12   Controls USES_DYNAMIC_STACK in
18127                                                                                               :ref:`amdgpu-amdhsa-kernel-descriptor-v3-table`.
18128     ``.amdhsa_system_sgpr_private_segment_wavefront_offset`` 0                   GFX6-GFX10   Controls ENABLE_PRIVATE_SEGMENT in
18129                                                                                  (except      :ref:`amdgpu-amdhsa-compute_pgm_rsrc2-gfx6-gfx12-table`.
18130                                                                                  GFX940)
18131     ``.amdhsa_enable_private_segment``                       0                   GFX940,      Controls ENABLE_PRIVATE_SEGMENT in
18132                                                                                  GFX11-GFX12  :ref:`amdgpu-amdhsa-compute_pgm_rsrc2-gfx6-gfx12-table`.
18133     ``.amdhsa_system_sgpr_workgroup_id_x``                   1                   GFX6-GFX12   Controls ENABLE_SGPR_WORKGROUP_ID_X in
18134                                                                                               :ref:`amdgpu-amdhsa-compute_pgm_rsrc2-gfx6-gfx12-table`.
18135     ``.amdhsa_system_sgpr_workgroup_id_y``                   0                   GFX6-GFX12   Controls ENABLE_SGPR_WORKGROUP_ID_Y in
18136                                                                                               :ref:`amdgpu-amdhsa-compute_pgm_rsrc2-gfx6-gfx12-table`.
18137     ``.amdhsa_system_sgpr_workgroup_id_z``                   0                   GFX6-GFX12   Controls ENABLE_SGPR_WORKGROUP_ID_Z in
18138                                                                                               :ref:`amdgpu-amdhsa-compute_pgm_rsrc2-gfx6-gfx12-table`.
18139     ``.amdhsa_system_sgpr_workgroup_info``                   0                   GFX6-GFX12   Controls ENABLE_SGPR_WORKGROUP_INFO in
18140                                                                                               :ref:`amdgpu-amdhsa-compute_pgm_rsrc2-gfx6-gfx12-table`.
18141     ``.amdhsa_system_vgpr_workitem_id``                      0                   GFX6-GFX12   Controls ENABLE_VGPR_WORKITEM_ID in
18142                                                                                               :ref:`amdgpu-amdhsa-compute_pgm_rsrc2-gfx6-gfx12-table`.
18143                                                                                               Possible values are defined in
18144                                                                                               :ref:`amdgpu-amdhsa-system-vgpr-work-item-id-enumeration-values-table`.
18145     ``.amdhsa_next_free_vgpr``                               Required            GFX6-GFX12   Maximum VGPR number explicitly referenced, plus one.
18146                                                                                               Used to calculate GRANULATED_WORKITEM_VGPR_COUNT in
18147                                                                                               :ref:`amdgpu-amdhsa-compute_pgm_rsrc1-gfx6-gfx12-table`.
18148     ``.amdhsa_next_free_sgpr``                               Required            GFX6-GFX12   Maximum SGPR number explicitly referenced, plus one.
18149                                                                                               Used to calculate GRANULATED_WAVEFRONT_SGPR_COUNT in
18150                                                                                               :ref:`amdgpu-amdhsa-compute_pgm_rsrc1-gfx6-gfx12-table`.
18151     ``.amdhsa_accum_offset``                                 Required            GFX90A,      Offset of a first AccVGPR in the unified register file.
18152                                                                                  GFX940       Used to calculate ACCUM_OFFSET in
18153                                                                                               :ref:`amdgpu-amdhsa-compute_pgm_rsrc3-gfx90a-table`.
18154     ``.amdhsa_reserve_vcc``                                  1                   GFX6-GFX12   Whether the kernel may use the special VCC SGPR.
18155                                                                                               Used to calculate GRANULATED_WAVEFRONT_SGPR_COUNT in
18156                                                                                               :ref:`amdgpu-amdhsa-compute_pgm_rsrc1-gfx6-gfx12-table`.
18157     ``.amdhsa_reserve_flat_scratch``                         1                   GFX7-GFX10   Whether the kernel may use flat instructions to access
18158                                                                                  (except      scratch memory. Used to calculate
18159                                                                                  GFX940)      GRANULATED_WAVEFRONT_SGPR_COUNT in
18160                                                                                               :ref:`amdgpu-amdhsa-compute_pgm_rsrc1-gfx6-gfx12-table`.
18161     ``.amdhsa_reserve_xnack_mask``                           Target              GFX8-GFX10   Whether the kernel may trigger XNACK replay.
18162                                                              Feature                          Used to calculate GRANULATED_WAVEFRONT_SGPR_COUNT in
18163                                                              Specific                         :ref:`amdgpu-amdhsa-compute_pgm_rsrc1-gfx6-gfx12-table`.
18164                                                              (xnack)
18165     ``.amdhsa_float_round_mode_32``                          0                   GFX6-GFX12   Controls FLOAT_ROUND_MODE_32 in
18166                                                                                               :ref:`amdgpu-amdhsa-compute_pgm_rsrc1-gfx6-gfx12-table`.
18167                                                                                               Possible values are defined in
18168                                                                                               :ref:`amdgpu-amdhsa-floating-point-rounding-mode-enumeration-values-table`.
18169     ``.amdhsa_float_round_mode_16_64``                       0                   GFX6-GFX12   Controls FLOAT_ROUND_MODE_16_64 in
18170                                                                                               :ref:`amdgpu-amdhsa-compute_pgm_rsrc1-gfx6-gfx12-table`.
18171                                                                                               Possible values are defined in
18172                                                                                               :ref:`amdgpu-amdhsa-floating-point-rounding-mode-enumeration-values-table`.
18173     ``.amdhsa_float_denorm_mode_32``                         0                   GFX6-GFX12   Controls FLOAT_DENORM_MODE_32 in
18174                                                                                               :ref:`amdgpu-amdhsa-compute_pgm_rsrc1-gfx6-gfx12-table`.
18175                                                                                               Possible values are defined in
18176                                                                                               :ref:`amdgpu-amdhsa-floating-point-denorm-mode-enumeration-values-table`.
18177     ``.amdhsa_float_denorm_mode_16_64``                      3                   GFX6-GFX12   Controls FLOAT_DENORM_MODE_16_64 in
18178                                                                                               :ref:`amdgpu-amdhsa-compute_pgm_rsrc1-gfx6-gfx12-table`.
18179                                                                                               Possible values are defined in
18180                                                                                               :ref:`amdgpu-amdhsa-floating-point-denorm-mode-enumeration-values-table`.
18181     ``.amdhsa_dx10_clamp``                                   1                   GFX6-GFX11   Controls ENABLE_DX10_CLAMP in
18182                                                                                               :ref:`amdgpu-amdhsa-compute_pgm_rsrc1-gfx6-gfx12-table`.
18183     ``.amdhsa_ieee_mode``                                    1                   GFX6-GFX11   Controls ENABLE_IEEE_MODE in
18184                                                                                               :ref:`amdgpu-amdhsa-compute_pgm_rsrc1-gfx6-gfx12-table`.
18185     ``.amdhsa_round_robin_scheduling``                       0                   GFX12        Controls ENABLE_WG_RR_EN in
18186                                                                                               :ref:`amdgpu-amdhsa-compute_pgm_rsrc1-gfx6-gfx12-table`.
18187     ``.amdhsa_fp16_overflow``                                0                   GFX9-GFX12   Controls FP16_OVFL in
18188                                                                                               :ref:`amdgpu-amdhsa-compute_pgm_rsrc1-gfx6-gfx12-table`.
18189     ``.amdhsa_tg_split``                                     Target              GFX90A,      Controls TG_SPLIT in
18190                                                              Feature             GFX940,      :ref:`amdgpu-amdhsa-compute_pgm_rsrc3-gfx90a-table`.
18191                                                              Specific            GFX11-GFX12
18192                                                              (tgsplit)
18193     ``.amdhsa_workgroup_processor_mode``                     Target              GFX10-GFX12  Controls ENABLE_WGP_MODE in
18194                                                              Feature                          :ref:`amdgpu-amdhsa-kernel-descriptor-v3-table`.
18195                                                              Specific
18196                                                              (cumode)
18197     ``.amdhsa_memory_ordered``                               1                   GFX10-GFX12  Controls MEM_ORDERED in
18198                                                                                               :ref:`amdgpu-amdhsa-compute_pgm_rsrc1-gfx6-gfx12-table`.
18199     ``.amdhsa_forward_progress``                             0                   GFX10-GFX12  Controls FWD_PROGRESS in
18200                                                                                               :ref:`amdgpu-amdhsa-compute_pgm_rsrc1-gfx6-gfx12-table`.
18201     ``.amdhsa_shared_vgpr_count``                            0                   GFX10-GFX11  Controls SHARED_VGPR_COUNT in
18202                                                                                               :ref:`amdgpu-amdhsa-compute_pgm_rsrc3-gfx10-gfx11-table`.
18203     ``.amdhsa_exception_fp_ieee_invalid_op``                 0                   GFX6-GFX12   Controls ENABLE_EXCEPTION_IEEE_754_FP_INVALID_OPERATION in
18204                                                                                               :ref:`amdgpu-amdhsa-compute_pgm_rsrc2-gfx6-gfx12-table`.
18205     ``.amdhsa_exception_fp_denorm_src``                      0                   GFX6-GFX12   Controls ENABLE_EXCEPTION_FP_DENORMAL_SOURCE in
18206                                                                                               :ref:`amdgpu-amdhsa-compute_pgm_rsrc2-gfx6-gfx12-table`.
18207     ``.amdhsa_exception_fp_ieee_div_zero``                   0                   GFX6-GFX12   Controls ENABLE_EXCEPTION_IEEE_754_FP_DIVISION_BY_ZERO in
18208                                                                                               :ref:`amdgpu-amdhsa-compute_pgm_rsrc2-gfx6-gfx12-table`.
18209     ``.amdhsa_exception_fp_ieee_overflow``                   0                   GFX6-GFX12   Controls ENABLE_EXCEPTION_IEEE_754_FP_OVERFLOW in
18210                                                                                               :ref:`amdgpu-amdhsa-compute_pgm_rsrc2-gfx6-gfx12-table`.
18211     ``.amdhsa_exception_fp_ieee_underflow``                  0                   GFX6-GFX12   Controls ENABLE_EXCEPTION_IEEE_754_FP_UNDERFLOW in
18212                                                                                               :ref:`amdgpu-amdhsa-compute_pgm_rsrc2-gfx6-gfx12-table`.
18213     ``.amdhsa_exception_fp_ieee_inexact``                    0                   GFX6-GFX12   Controls ENABLE_EXCEPTION_IEEE_754_FP_INEXACT in
18214                                                                                               :ref:`amdgpu-amdhsa-compute_pgm_rsrc2-gfx6-gfx12-table`.
18215     ``.amdhsa_exception_int_div_zero``                       0                   GFX6-GFX12   Controls ENABLE_EXCEPTION_INT_DIVIDE_BY_ZERO in
18216                                                                                               :ref:`amdgpu-amdhsa-compute_pgm_rsrc2-gfx6-gfx12-table`.
18217     ``.amdhsa_user_sgpr_kernarg_preload_length``             0                   GFX90A,      Controls KERNARG_PRELOAD_SPEC_LENGTH in
18218                                                                                  GFX940       :ref:`amdgpu-amdhsa-kernel-descriptor-v3-table`.
18219     ``.amdhsa_user_sgpr_kernarg_preload_offset``             0                   GFX90A,      Controls KERNARG_PRELOAD_SPEC_OFFSET in
18220                                                                                  GFX940       :ref:`amdgpu-amdhsa-kernel-descriptor-v3-table`.
18221     ======================================================== =================== ============ ===================
18222
18223.amdgpu_metadata
18224++++++++++++++++
18225
18226Optional directive which declares the contents of the ``NT_AMDGPU_METADATA``
18227note record (see :ref:`amdgpu-elf-note-records-table-v3-onwards`).
18228
18229The contents must be in the [YAML]_ markup format, with the same structure and
18230semantics described in :ref:`amdgpu-amdhsa-code-object-metadata-v3`,
18231:ref:`amdgpu-amdhsa-code-object-metadata-v4` or
18232:ref:`amdgpu-amdhsa-code-object-metadata-v5`.
18233
18234This directive is terminated by an ``.end_amdgpu_metadata`` directive.
18235
18236.. _amdgpu-amdhsa-assembler-example-v3-onwards:
18237
18238Code Object V3 and Above Example Source Code
18239~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
18240
18241Here is an example of a minimal assembly source file, defining one HSA kernel:
18242
18243.. code::
18244   :number-lines:
18245
18246   .amdgcn_target "amdgcn-amd-amdhsa--gfx900+xnack" // optional
18247
18248   .text
18249   .globl hello_world
18250   .p2align 8
18251   .type hello_world,@function
18252   hello_world:
18253     s_load_dwordx2 s[0:1], s[0:1] 0x0
18254     v_mov_b32 v0, 3.14159
18255     s_waitcnt lgkmcnt(0)
18256     v_mov_b32 v1, s0
18257     v_mov_b32 v2, s1
18258     flat_store_dword v[1:2], v0
18259     s_endpgm
18260   .Lfunc_end0:
18261     .size   hello_world, .Lfunc_end0-hello_world
18262
18263   .rodata
18264   .p2align 6
18265   .amdhsa_kernel hello_world
18266     .amdhsa_user_sgpr_kernarg_segment_ptr 1
18267     .amdhsa_next_free_vgpr .amdgcn.next_free_vgpr
18268     .amdhsa_next_free_sgpr .amdgcn.next_free_sgpr
18269   .end_amdhsa_kernel
18270
18271   .amdgpu_metadata
18272   ---
18273   amdhsa.version:
18274     - 1
18275     - 0
18276   amdhsa.kernels:
18277     - .name: hello_world
18278       .symbol: hello_world.kd
18279       .kernarg_segment_size: 48
18280       .group_segment_fixed_size: 0
18281       .private_segment_fixed_size: 0
18282       .kernarg_segment_align: 4
18283       .wavefront_size: 64
18284       .sgpr_count: 2
18285       .vgpr_count: 3
18286       .max_flat_workgroup_size: 256
18287       .args:
18288         - .size: 8
18289           .offset: 0
18290           .value_kind: global_buffer
18291           .address_space: global
18292           .actual_access: write_only
18293   //...
18294   .end_amdgpu_metadata
18295
18296This kernel is equivalent to the following HIP program:
18297
18298.. code::
18299   :number-lines:
18300
18301   __global__ void hello_world(float *p) {
18302       *p = 3.14159f;
18303   }
18304
18305If an assembly source file contains multiple kernels and/or functions, the
18306:ref:`amdgpu-amdhsa-assembler-symbol-next_free_vgpr` and
18307:ref:`amdgpu-amdhsa-assembler-symbol-next_free_sgpr` symbols may be reset using
18308the ``.set <symbol>, <expression>`` directive. For example, in the case of two
18309kernels, where ``function1`` is only called from ``kernel1`` it is sufficient
18310to group the function with the kernel that calls it and reset the symbols
18311between the two connected components:
18312
18313.. code::
18314   :number-lines:
18315
18316   .amdgcn_target "amdgcn-amd-amdhsa--gfx900+xnack" // optional
18317
18318   // gpr tracking symbols are implicitly set to zero
18319
18320   .text
18321   .globl kern0
18322   .p2align 8
18323   .type kern0,@function
18324   kern0:
18325     // ...
18326     s_endpgm
18327   .Lkern0_end:
18328     .size   kern0, .Lkern0_end-kern0
18329
18330   .rodata
18331   .p2align 6
18332   .amdhsa_kernel kern0
18333     // ...
18334     .amdhsa_next_free_vgpr .amdgcn.next_free_vgpr
18335     .amdhsa_next_free_sgpr .amdgcn.next_free_sgpr
18336   .end_amdhsa_kernel
18337
18338   // reset symbols to begin tracking usage in func1 and kern1
18339   .set .amdgcn.next_free_vgpr, 0
18340   .set .amdgcn.next_free_sgpr, 0
18341
18342   .text
18343   .hidden func1
18344   .global func1
18345   .p2align 2
18346   .type func1,@function
18347   func1:
18348     // ...
18349     s_setpc_b64 s[30:31]
18350   .Lfunc1_end:
18351   .size func1, .Lfunc1_end-func1
18352
18353   .globl kern1
18354   .p2align 8
18355   .type kern1,@function
18356   kern1:
18357     // ...
18358     s_getpc_b64 s[4:5]
18359     s_add_u32 s4, s4, func1@rel32@lo+4
18360     s_addc_u32 s5, s5, func1@rel32@lo+4
18361     s_swappc_b64 s[30:31], s[4:5]
18362     // ...
18363     s_endpgm
18364   .Lkern1_end:
18365     .size   kern1, .Lkern1_end-kern1
18366
18367   .rodata
18368   .p2align 6
18369   .amdhsa_kernel kern1
18370     // ...
18371     .amdhsa_next_free_vgpr .amdgcn.next_free_vgpr
18372     .amdhsa_next_free_sgpr .amdgcn.next_free_sgpr
18373   .end_amdhsa_kernel
18374
18375These symbols cannot identify connected components in order to automatically
18376track the usage for each kernel. However, in some cases careful organization of
18377the kernels and functions in the source file means there is minimal additional
18378effort required to accurately calculate GPR usage.
18379
18380Additional Documentation
18381========================
18382
18383.. [AMD-GCN-GFX6] `AMD Southern Islands Series ISA <http://developer.amd.com/wordpress/media/2012/12/AMD_Southern_Islands_Instruction_Set_Architecture.pdf>`__
18384.. [AMD-GCN-GFX7] `AMD Sea Islands Series ISA <http://developer.amd.com/wordpress/media/2013/07/AMD_Sea_Islands_Instruction_Set_Architecture.pdf>`_
18385.. [AMD-GCN-GFX8] `AMD GCN3 Instruction Set Architecture <http://amd-dev.wpengine.netdna-cdn.com/wordpress/media/2013/12/AMD_GCN3_Instruction_Set_Architecture_rev1.1.pdf>`__
18386.. [AMD-GCN-GFX900-GFX904-VEGA] `AMD Vega Instruction Set Architecture <http://developer.amd.com/wordpress/media/2013/12/Vega_Shader_ISA_28July2017.pdf>`__
18387.. [AMD-GCN-GFX906-VEGA7NM] `AMD Vega 7nm Instruction Set Architecture <https://gpuopen.com/wp-content/uploads/2019/11/Vega_7nm_Shader_ISA_26November2019.pdf>`__
18388.. [AMD-GCN-GFX908-CDNA1] `AMD Instinct MI100 Instruction Set Architecture <https://developer.amd.com/wp-content/resources/CDNA1_Shader_ISA_14December2020.pdf>`__
18389.. [AMD-GCN-GFX90A-CDNA2] `AMD Instinct MI200 Instruction Set Architecture <https://developer.amd.com/wp-content/resources/CDNA2_Shader_ISA_4February2022.pdf>`__
18390.. [AMD-GCN-GFX940-GFX942-CDNA3] `AMD Instinct MI300 Instruction Set Architecture <https://www.amd.com/content/dam/amd/en/documents/instinct-tech-docs/instruction-set-architectures/amd-instinct-mi300-cdna3-instruction-set-architecture.pdf>`__
18391.. [AMD-GCN-GFX10-RDNA1] `AMD RDNA 1.0 Instruction Set Architecture <https://gpuopen.com/wp-content/uploads/2019/08/RDNA_Shader_ISA_5August2019.pdf>`__
18392.. [AMD-GCN-GFX10-RDNA2] `AMD RDNA 2 Instruction Set Architecture <https://developer.amd.com/wp-content/resources/RDNA2_Shader_ISA_November2020.pdf>`__
18393.. [AMD-GCN-GFX11-RDNA3] `AMD RDNA 3 Instruction Set Architecture <https://developer.amd.com/wp-content/resources/RDNA3_Shader_ISA_December2022.pdf>`__
18394.. [AMD-GCN-GFX11-RDNA3.5] `AMD RDNA 3.5 Instruction Set Architecture <https://www.amd.com/content/dam/amd/en/documents/radeon-tech-docs/instruction-set-architectures/rdna35_instruction_set_architecture.pdf>`__
18395.. [AMD-RADEON-HD-2000-3000] `AMD R6xx shader ISA <http://developer.amd.com/wordpress/media/2012/10/R600_Instruction_Set_Architecture.pdf>`__
18396.. [AMD-RADEON-HD-4000] `AMD R7xx shader ISA <http://developer.amd.com/wordpress/media/2012/10/R700-Family_Instruction_Set_Architecture.pdf>`__
18397.. [AMD-RADEON-HD-5000] `AMD Evergreen shader ISA <http://developer.amd.com/wordpress/media/2012/10/AMD_Evergreen-Family_Instruction_Set_Architecture.pdf>`__
18398.. [AMD-RADEON-HD-6000] `AMD Cayman/Trinity shader ISA <http://developer.amd.com/wordpress/media/2012/10/AMD_HD_6900_Series_Instruction_Set_Architecture.pdf>`__
18399.. [AMD-ROCm] `AMD ROCm™ Platform <https://rocmdocs.amd.com/>`__
18400.. [AMD-ROCm-github] `AMD ROCm™ github <http://github.com/RadeonOpenCompute>`__
18401.. [AMD-ROCm-Release-Notes] `AMD ROCm Release Notes <https://github.com/RadeonOpenCompute/ROCm>`__
18402.. [CLANG-ATTR] `Attributes in Clang <https://clang.llvm.org/docs/AttributeReference.html>`__
18403.. [DWARF] `DWARF Debugging Information Format <http://dwarfstd.org/>`__
18404.. [ELF] `Executable and Linkable Format (ELF) <http://www.sco.com/developers/gabi/>`__
18405.. [HRF] `Heterogeneous-race-free Memory Models <https://research.cs.wisc.edu/multifacet/papers/asplos14_hrf.pdf>`__
18406.. [HSA] `Heterogeneous System Architecture (HSA) Foundation <http://www.hsafoundation.com/>`__
18407.. [MsgPack] `Message Pack <http://www.msgpack.org/>`__
18408.. [OpenCL] `The OpenCL Specification Version 2.0 <http://www.khronos.org/registry/cl/specs/opencl-2.0.pdf>`__
18409.. [SEMVER] `Semantic Versioning <https://semver.org/>`__
18410.. [YAML] `YAML Ain't Markup Language (YAML™) Version 1.2 <http://www.yaml.org/spec/1.2/spec.html>`__
18411