1============================= 2User Guide for AMDGPU Backend 3============================= 4 5.. contents:: 6 :local: 7 8.. toctree:: 9 :hidden: 10 11 AMDGPU/AMDGPUAsmGFX7 12 AMDGPU/AMDGPUAsmGFX8 13 AMDGPU/AMDGPUAsmGFX9 14 AMDGPU/AMDGPUAsmGFX900 15 AMDGPU/AMDGPUAsmGFX904 16 AMDGPU/AMDGPUAsmGFX906 17 AMDGPU/AMDGPUAsmGFX908 18 AMDGPU/AMDGPUAsmGFX90a 19 AMDGPU/AMDGPUAsmGFX940 20 AMDGPU/AMDGPUAsmGFX10 21 AMDGPU/AMDGPUAsmGFX1011 22 AMDGPU/AMDGPUAsmGFX1013 23 AMDGPU/AMDGPUAsmGFX1030 24 AMDGPU/AMDGPUAsmGFX11 25 AMDGPUModifierSyntax 26 AMDGPUOperandSyntax 27 AMDGPUInstructionSyntax 28 AMDGPUInstructionNotation 29 AMDGPUDwarfExtensionsForHeterogeneousDebugging 30 AMDGPUDwarfExtensionAllowLocationDescriptionOnTheDwarfExpressionStack/AMDGPUDwarfExtensionAllowLocationDescriptionOnTheDwarfExpressionStack 31 32Introduction 33============ 34 35The AMDGPU backend provides ISA code generation for AMD GPUs, starting with the 36R600 family up until the current GCN families. It lives in the 37``llvm/lib/Target/AMDGPU`` directory. 38 39LLVM 40==== 41 42.. _amdgpu-target-triples: 43 44Target Triples 45-------------- 46 47Use the Clang option ``-target <Architecture>-<Vendor>-<OS>-<Environment>`` 48to specify the target triple: 49 50 .. table:: AMDGPU Architectures 51 :name: amdgpu-architecture-table 52 53 ============ ============================================================== 54 Architecture Description 55 ============ ============================================================== 56 ``r600`` AMD GPUs HD2XXX-HD6XXX for graphics and compute shaders. 57 ``amdgcn`` AMD GPUs GCN GFX6 onwards for graphics and compute shaders. 58 ============ ============================================================== 59 60 .. table:: AMDGPU Vendors 61 :name: amdgpu-vendor-table 62 63 ============ ============================================================== 64 Vendor Description 65 ============ ============================================================== 66 ``amd`` Can be used for all AMD GPU usage. 67 ``mesa`` Can be used if the OS is ``mesa3d``. 68 ============ ============================================================== 69 70 .. table:: AMDGPU Operating Systems 71 :name: amdgpu-os 72 73 ============== ============================================================ 74 OS Description 75 ============== ============================================================ 76 *<empty>* Defaults to the *unknown* OS. 77 ``amdhsa`` Compute kernels executed on HSA [HSA]_ compatible runtimes 78 such as: 79 80 - AMD's ROCm™ runtime [AMD-ROCm]_ using the *rocm-amdhsa* 81 loader on Linux. See *AMD ROCm Platform Release Notes* 82 [AMD-ROCm-Release-Notes]_ for supported hardware and 83 software. 84 - AMD's PAL runtime using the *pal-amdhsa* loader on 85 Windows. 86 87 ``amdpal`` Graphic shaders and compute kernels executed on AMD's PAL 88 runtime using the *pal-amdpal* loader on Windows and Linux 89 Pro. 90 ``mesa3d`` Graphic shaders and compute kernels executed on AMD's Mesa 91 3D runtime using the *mesa-mesa3d* loader on Linux. 92 ============== ============================================================ 93 94 .. table:: AMDGPU Environments 95 :name: amdgpu-environment-table 96 97 ============ ============================================================== 98 Environment Description 99 ============ ============================================================== 100 *<empty>* Default. 101 ============ ============================================================== 102 103.. _amdgpu-processors: 104 105Processors 106---------- 107 108Use the Clang options ``-mcpu=<target-id>`` or ``--offload-arch=<target-id>`` to 109specify the AMDGPU processor together with optional target features. See 110:ref:`amdgpu-target-id` and :ref:`amdgpu-target-features` for AMD GPU target 111specific information. 112 113Every processor supports every OS ABI (see :ref:`amdgpu-os`) with the following exceptions: 114 115* ``amdhsa`` is not supported in ``r600`` architecture (see :ref:`amdgpu-architecture-table`). 116 117 118 .. table:: AMDGPU Processors 119 :name: amdgpu-processor-table 120 121 =========== =============== ============ ===== ================= =============== =============== ====================== 122 Processor Alternative Target dGPU/ Target Target OS Support Example 123 Processor Triple APU Features Properties *(see* Products 124 Architecture Supported `amdgpu-os`_ 125 *and 126 corresponding 127 runtime release 128 notes for 129 current 130 information and 131 level of 132 support)* 133 =========== =============== ============ ===== ================= =============== =============== ====================== 134 **Radeon HD 2000/3000 Series (R600)** [AMD-RADEON-HD-2000-3000]_ 135 ----------------------------------------------------------------------------------------------------------------------- 136 ``r600`` ``r600`` dGPU - Does not 137 support 138 generic 139 address 140 space 141 ``r630`` ``r600`` dGPU - Does not 142 support 143 generic 144 address 145 space 146 ``rs880`` ``r600`` dGPU - Does not 147 support 148 generic 149 address 150 space 151 ``rv670`` ``r600`` dGPU - Does not 152 support 153 generic 154 address 155 space 156 **Radeon HD 4000 Series (R700)** [AMD-RADEON-HD-4000]_ 157 ----------------------------------------------------------------------------------------------------------------------- 158 ``rv710`` ``r600`` dGPU - Does not 159 support 160 generic 161 address 162 space 163 ``rv730`` ``r600`` dGPU - Does not 164 support 165 generic 166 address 167 space 168 ``rv770`` ``r600`` dGPU - Does not 169 support 170 generic 171 address 172 space 173 **Radeon HD 5000 Series (Evergreen)** [AMD-RADEON-HD-5000]_ 174 ----------------------------------------------------------------------------------------------------------------------- 175 ``cedar`` ``r600`` dGPU - Does not 176 support 177 generic 178 address 179 space 180 ``cypress`` ``r600`` dGPU - Does not 181 support 182 generic 183 address 184 space 185 ``juniper`` ``r600`` dGPU - Does not 186 support 187 generic 188 address 189 space 190 ``redwood`` ``r600`` dGPU - Does not 191 support 192 generic 193 address 194 space 195 ``sumo`` ``r600`` dGPU - Does not 196 support 197 generic 198 address 199 space 200 **Radeon HD 6000 Series (Northern Islands)** [AMD-RADEON-HD-6000]_ 201 ----------------------------------------------------------------------------------------------------------------------- 202 ``barts`` ``r600`` dGPU - Does not 203 support 204 generic 205 address 206 space 207 ``caicos`` ``r600`` dGPU - Does not 208 support 209 generic 210 address 211 space 212 ``cayman`` ``r600`` dGPU - Does not 213 support 214 generic 215 address 216 space 217 ``turks`` ``r600`` dGPU - Does not 218 support 219 generic 220 address 221 space 222 **GCN GFX6 (Southern Islands (SI))** [AMD-GCN-GFX6]_ 223 ----------------------------------------------------------------------------------------------------------------------- 224 ``gfx600`` - ``tahiti`` ``amdgcn`` dGPU - Does not - *pal-amdpal* 225 support 226 generic 227 address 228 space 229 ``gfx601`` - ``pitcairn`` ``amdgcn`` dGPU - Does not - *pal-amdpal* 230 - ``verde`` support 231 generic 232 address 233 space 234 ``gfx602`` - ``hainan`` ``amdgcn`` dGPU - Does not - *pal-amdpal* 235 - ``oland`` support 236 generic 237 address 238 space 239 **GCN GFX7 (Sea Islands (CI))** [AMD-GCN-GFX7]_ 240 ----------------------------------------------------------------------------------------------------------------------- 241 ``gfx700`` - ``kaveri`` ``amdgcn`` APU - Offset - *rocm-amdhsa* - A6-7000 242 flat - *pal-amdhsa* - A6 Pro-7050B 243 scratch - *pal-amdpal* - A8-7100 244 - A8 Pro-7150B 245 - A10-7300 246 - A10 Pro-7350B 247 - FX-7500 248 - A8-7200P 249 - A10-7400P 250 - FX-7600P 251 ``gfx701`` - ``hawaii`` ``amdgcn`` dGPU - Offset - *rocm-amdhsa* - FirePro W8100 252 flat - *pal-amdhsa* - FirePro W9100 253 scratch - *pal-amdpal* - FirePro S9150 254 - FirePro S9170 255 ``gfx702`` ``amdgcn`` dGPU - Offset - *rocm-amdhsa* - Radeon R9 290 256 flat - *pal-amdhsa* - Radeon R9 290x 257 scratch - *pal-amdpal* - Radeon R390 258 - Radeon R390x 259 ``gfx703`` - ``kabini`` ``amdgcn`` APU - Offset - *pal-amdhsa* - E1-2100 260 - ``mullins`` flat - *pal-amdpal* - E1-2200 261 scratch - E1-2500 262 - E2-3000 263 - E2-3800 264 - A4-5000 265 - A4-5100 266 - A6-5200 267 - A4 Pro-3340B 268 ``gfx704`` - ``bonaire`` ``amdgcn`` dGPU - Offset - *pal-amdhsa* - Radeon HD 7790 269 flat - *pal-amdpal* - Radeon HD 8770 270 scratch - R7 260 271 - R7 260X 272 ``gfx705`` ``amdgcn`` APU - Offset - *pal-amdhsa* *TBA* 273 flat - *pal-amdpal* 274 scratch .. TODO:: 275 276 Add product 277 names. 278 279 **GCN GFX8 (Volcanic Islands (VI))** [AMD-GCN-GFX8]_ 280 ----------------------------------------------------------------------------------------------------------------------- 281 ``gfx801`` - ``carrizo`` ``amdgcn`` APU - xnack - Offset - *rocm-amdhsa* - A6-8500P 282 flat - *pal-amdhsa* - Pro A6-8500B 283 scratch - *pal-amdpal* - A8-8600P 284 - Pro A8-8600B 285 - FX-8800P 286 - Pro A12-8800B 287 - A10-8700P 288 - Pro A10-8700B 289 - A10-8780P 290 - A10-9600P 291 - A10-9630P 292 - A12-9700P 293 - A12-9730P 294 - FX-9800P 295 - FX-9830P 296 - E2-9010 297 - A6-9210 298 - A9-9410 299 ``gfx802`` - ``iceland`` ``amdgcn`` dGPU - Offset - *rocm-amdhsa* - Radeon R9 285 300 - ``tonga`` flat - *pal-amdhsa* - Radeon R9 380 301 scratch - *pal-amdpal* - Radeon R9 385 302 ``gfx803`` - ``fiji`` ``amdgcn`` dGPU - *rocm-amdhsa* - Radeon R9 Nano 303 - *pal-amdhsa* - Radeon R9 Fury 304 - *pal-amdpal* - Radeon R9 FuryX 305 - Radeon Pro Duo 306 - FirePro S9300x2 307 - Radeon Instinct MI8 308 \ - ``polaris10`` ``amdgcn`` dGPU - Offset - *rocm-amdhsa* - Radeon RX 470 309 flat - *pal-amdhsa* - Radeon RX 480 310 scratch - *pal-amdpal* - Radeon Instinct MI6 311 \ - ``polaris11`` ``amdgcn`` dGPU - Offset - *rocm-amdhsa* - Radeon RX 460 312 flat - *pal-amdhsa* 313 scratch - *pal-amdpal* 314 ``gfx805`` - ``tongapro`` ``amdgcn`` dGPU - Offset - *rocm-amdhsa* - FirePro S7150 315 flat - *pal-amdhsa* - FirePro S7100 316 scratch - *pal-amdpal* - FirePro W7100 317 - Mobile FirePro 318 M7170 319 ``gfx810`` - ``stoney`` ``amdgcn`` APU - xnack - Offset - *rocm-amdhsa* *TBA* 320 flat - *pal-amdhsa* 321 scratch - *pal-amdpal* .. TODO:: 322 323 Add product 324 names. 325 326 **GCN GFX9 (Vega)** [AMD-GCN-GFX900-GFX904-VEGA]_ [AMD-GCN-GFX906-VEGA7NM]_ [AMD-GCN-GFX908-CDNA1]_ [AMD-GCN-GFX90A-CDNA2]_ [AMD-GCN-GFX940-GFX942-CDNA3]_ 327 ----------------------------------------------------------------------------------------------------------------------- 328 ``gfx900`` ``amdgcn`` dGPU - xnack - Absolute - *rocm-amdhsa* - Radeon Vega 329 flat - *pal-amdhsa* Frontier Edition 330 scratch - *pal-amdpal* - Radeon RX Vega 56 331 - Radeon RX Vega 64 332 - Radeon RX Vega 64 333 Liquid 334 - Radeon Instinct MI25 335 ``gfx902`` ``amdgcn`` APU - xnack - Absolute - *rocm-amdhsa* - Ryzen 3 2200G 336 flat - *pal-amdhsa* - Ryzen 5 2400G 337 scratch - *pal-amdpal* 338 ``gfx904`` ``amdgcn`` dGPU - xnack - *rocm-amdhsa* *TBA* 339 - *pal-amdhsa* 340 - *pal-amdpal* .. TODO:: 341 342 Add product 343 names. 344 345 ``gfx906`` ``amdgcn`` dGPU - sramecc - Absolute - *rocm-amdhsa* - Radeon Instinct MI50 346 - xnack flat - *pal-amdhsa* - Radeon Instinct MI60 347 scratch - *pal-amdpal* - Radeon VII 348 - Radeon Pro VII 349 ``gfx908`` ``amdgcn`` dGPU - sramecc - *rocm-amdhsa* - AMD Instinct MI100 Accelerator 350 - xnack - Absolute 351 flat 352 scratch 353 ``gfx909`` ``amdgcn`` APU - xnack - Absolute - *pal-amdpal* *TBA* 354 flat 355 scratch .. TODO:: 356 357 Add product 358 names. 359 360 ``gfx90a`` ``amdgcn`` dGPU - sramecc - Absolute - *rocm-amdhsa* - AMD Instinct MI210 Accelerator 361 - tgsplit flat - *rocm-amdhsa* - AMD Instinct MI250 Accelerator 362 - xnack scratch - *rocm-amdhsa* - AMD Instinct MI250X Accelerator 363 - kernarg preload - Packed 364 (except MI210) work-item 365 IDs 366 367 ``gfx90c`` ``amdgcn`` APU - xnack - Absolute - *pal-amdpal* - Ryzen 7 4700G 368 flat - Ryzen 7 4700GE 369 scratch - Ryzen 5 4600G 370 - Ryzen 5 4600GE 371 - Ryzen 3 4300G 372 - Ryzen 3 4300GE 373 - Ryzen Pro 4000G 374 - Ryzen 7 Pro 4700G 375 - Ryzen 7 Pro 4750GE 376 - Ryzen 5 Pro 4650G 377 - Ryzen 5 Pro 4650GE 378 - Ryzen 3 Pro 4350G 379 - Ryzen 3 Pro 4350GE 380 381 ``gfx940`` ``amdgcn`` dGPU - sramecc - Architected *TBA* 382 - tgsplit flat 383 - xnack scratch .. TODO:: 384 - kernarg preload - Packed 385 work-item Add product 386 IDs names. 387 388 ``gfx941`` ``amdgcn`` dGPU - sramecc - Architected *TBA* 389 - tgsplit flat 390 - xnack scratch .. TODO:: 391 - kernarg preload - Packed 392 work-item Add product 393 IDs names. 394 395 ``gfx942`` ``amdgcn`` dGPU - sramecc - Architected - AMD Instinct MI300X 396 - tgsplit flat - AMD Instinct MI300A 397 - xnack scratch 398 - kernarg preload - Packed 399 work-item 400 IDs 401 402 ``gfx950`` ``amdgcn`` dGPU - sramecc - Architected *TBA* 403 - tgsplit flat 404 - xnack scratch .. TODO:: 405 - kernarg preload - Packed 406 work-item Add product 407 IDs names. 408 409 **GCN GFX10.1 (RDNA 1)** [AMD-GCN-GFX10-RDNA1]_ 410 ----------------------------------------------------------------------------------------------------------------------- 411 ``gfx1010`` ``amdgcn`` dGPU - cumode - Absolute - *rocm-amdhsa* - Radeon RX 5700 412 - wavefrontsize64 flat - *pal-amdhsa* - Radeon RX 5700 XT 413 - xnack scratch - *pal-amdpal* - Radeon Pro 5600 XT 414 - Radeon Pro 5600M 415 ``gfx1011`` ``amdgcn`` dGPU - cumode - *rocm-amdhsa* - Radeon Pro V520 416 - wavefrontsize64 - Absolute - *pal-amdhsa* 417 - xnack flat - *pal-amdpal* 418 scratch 419 ``gfx1012`` ``amdgcn`` dGPU - cumode - Absolute - *rocm-amdhsa* - Radeon RX 5500 420 - wavefrontsize64 flat - *pal-amdhsa* - Radeon RX 5500 XT 421 - xnack scratch - *pal-amdpal* 422 ``gfx1013`` ``amdgcn`` APU - cumode - Absolute - *rocm-amdhsa* *TBA* 423 - wavefrontsize64 flat - *pal-amdhsa* 424 - xnack scratch - *pal-amdpal* .. TODO:: 425 426 Add product 427 names. 428 429 **GCN GFX10.3 (RDNA 2)** [AMD-GCN-GFX10-RDNA2]_ 430 ----------------------------------------------------------------------------------------------------------------------- 431 ``gfx1030`` ``amdgcn`` dGPU - cumode - Absolute - *rocm-amdhsa* - Radeon RX 6800 432 - wavefrontsize64 flat - *pal-amdhsa* - Radeon RX 6800 XT 433 scratch - *pal-amdpal* - Radeon RX 6900 XT 434 - Radeon PRO W6800 435 - Radeon PRO V620 436 ``gfx1031`` ``amdgcn`` dGPU - cumode - Absolute - *rocm-amdhsa* - Radeon RX 6700 XT 437 - wavefrontsize64 flat - *pal-amdhsa* 438 scratch - *pal-amdpal* 439 ``gfx1032`` ``amdgcn`` dGPU - cumode - Absolute - *rocm-amdhsa* *TBA* 440 - wavefrontsize64 flat - *pal-amdhsa* 441 scratch - *pal-amdpal* .. TODO:: 442 443 Add product 444 names. 445 446 ``gfx1033`` ``amdgcn`` APU - cumode - Absolute - *pal-amdpal* *TBA* 447 - wavefrontsize64 flat 448 scratch .. TODO:: 449 450 Add product 451 names. 452 ``gfx1034`` ``amdgcn`` dGPU - cumode - Absolute - *pal-amdpal* *TBA* 453 - wavefrontsize64 flat 454 scratch .. TODO:: 455 456 Add product 457 names. 458 459 ``gfx1035`` ``amdgcn`` APU - cumode - Absolute - *pal-amdpal* *TBA* 460 - wavefrontsize64 flat 461 scratch .. TODO:: 462 Add product 463 names. 464 465 ``gfx1036`` ``amdgcn`` APU - cumode - Absolute - *pal-amdpal* *TBA* 466 - wavefrontsize64 flat 467 scratch .. TODO:: 468 469 Add product 470 names. 471 472 **GCN GFX11 (RDNA 3)** [AMD-GCN-GFX11-RDNA3]_ 473 ----------------------------------------------------------------------------------------------------------------------- 474 ``gfx1100`` ``amdgcn`` dGPU - cumode - Architected - *pal-amdpal* - Radeon PRO W7900 Dual Slot 475 - wavefrontsize64 flat - Radeon PRO W7900 476 scratch - Radeon PRO W7800 477 - Packed - Radeon RX 7900 XTX 478 work-item - Radeon RX 7900 XT 479 IDs - Radeon RX 7900 GRE 480 481 ``gfx1101`` ``amdgcn`` dGPU - cumode - Architected *TBA* 482 - wavefrontsize64 flat 483 scratch .. TODO:: 484 - Packed 485 work-item Add product 486 IDs names. 487 488 ``gfx1102`` ``amdgcn`` dGPU - cumode - Architected *TBA* 489 - wavefrontsize64 flat 490 scratch .. TODO:: 491 - Packed 492 work-item Add product 493 IDs names. 494 495 ``gfx1103`` ``amdgcn`` APU - cumode - Architected *TBA* 496 - wavefrontsize64 flat 497 scratch .. TODO:: 498 - Packed 499 work-item Add product 500 IDs names. 501 502 **GCN GFX11 (RDNA 3.5)** [AMD-GCN-GFX11-RDNA3.5]_ 503 ----------------------------------------------------------------------------------------------------------------------- 504 ``gfx1150`` ``amdgcn`` APU - cumode - Architected *TBA* 505 - wavefrontsize64 flat 506 scratch .. TODO:: 507 - Packed 508 work-item Add product 509 IDs names. 510 511 ``gfx1151`` ``amdgcn`` APU - cumode - Architected *TBA* 512 - wavefrontsize64 flat 513 scratch .. TODO:: 514 - Packed 515 work-item Add product 516 IDs names. 517 518 ``gfx1152`` ``amdgcn`` APU - cumode - Architected *TBA* 519 - wavefrontsize64 flat 520 scratch .. TODO:: 521 - Packed 522 work-item Add product 523 IDs names. 524 525 ``gfx1153`` ``amdgcn`` APU - cumode - Architected *TBA* 526 - wavefrontsize64 flat 527 scratch .. TODO:: 528 - Packed 529 work-item Add product 530 IDs names. 531 532 ``gfx1200`` ``amdgcn`` dGPU - cumode - Architected *TBA* 533 - wavefrontsize64 flat 534 scratch .. TODO:: 535 - Packed 536 work-item Add product 537 IDs names. 538 539 ``gfx1201`` ``amdgcn`` dGPU - cumode - Architected *TBA* 540 - wavefrontsize64 flat 541 scratch .. TODO:: 542 - Packed 543 work-item Add product 544 IDs names. 545 546 =========== =============== ============ ===== ================= =============== =============== ====================== 547 548Generic processors allow execution of a single code object on any of the processors that 549it supports. Such code objects may not perform as well as those for the non-generic processors. 550 551Generic processors are only available on code object V6 and above (see :ref:`amdgpu-elf-code-object`). 552 553Generic processor code objects are versioned. See :ref:`amdgpu-generic-processor-versioning` for more information on how versioning works. 554 555 .. table:: AMDGPU Generic Processors 556 :name: amdgpu-generic-processor-table 557 558 ==================== ============== ================= ================== ================= ================================= 559 Processor Target Supported Target Features Target Properties Target Restrictions 560 Triple Processors Supported 561 Architecture 562 563 ==================== ============== ================= ================== ================= ================================= 564 ``gfx9-generic`` ``amdgcn`` - ``gfx900`` - xnack - Absolute flat - ``v_mad_mix`` instructions 565 - ``gfx902`` scratch are not available on 566 - ``gfx904`` ``gfx900``, ``gfx902``, 567 - ``gfx906`` ``gfx909``, ``gfx90c`` 568 - ``gfx909`` - ``v_fma_mix`` instructions 569 - ``gfx90c`` are not available on ``gfx904`` 570 - sramecc is not available on 571 ``gfx906`` 572 - The following instructions 573 are not available on ``gfx906``: 574 575 - ``v_fmac_f32`` 576 - ``v_xnor_b32`` 577 - ``v_dot4_i32_i8`` 578 - ``v_dot8_i32_i4`` 579 - ``v_dot2_i32_i16`` 580 - ``v_dot2_u32_u16`` 581 - ``v_dot4_u32_u8`` 582 - ``v_dot8_u32_u4`` 583 - ``v_dot2_f32_f16`` 584 585 586 ``gfx9-4-generic`` ``amdgcn`` - ``gfx940`` - xnack - Absolute flat FP8 and BF8 instructions, 587 - ``gfx941`` - sramecc scratch FP8 and BF8 conversion instructions, 588 - ``gfx942`` as well as instructions with XF32 format support 589 - ``gfx950`` are not available. 590 591 592 ``gfx10-1-generic`` ``amdgcn`` - ``gfx1010`` - xnack - Absolute flat - The following instructions are 593 - ``gfx1011`` - wavefrontsize64 scratch not available on ``gfx1011`` 594 - ``gfx1012`` - cumode and ``gfx1012`` 595 - ``gfx1013`` 596 - ``v_dot4_i32_i8`` 597 - ``v_dot8_i32_i4`` 598 - ``v_dot2_i32_i16`` 599 - ``v_dot2_u32_u16`` 600 - ``v_dot2c_f32_f16`` 601 - ``v_dot4c_i32_i8`` 602 - ``v_dot4_u32_u8`` 603 - ``v_dot8_u32_u4`` 604 - ``v_dot2_f32_f16`` 605 606 - BVH Ray Tracing instructions 607 are not available on 608 ``gfx1013`` 609 610 611 ``gfx10-3-generic`` ``amdgcn`` - ``gfx1030`` - wavefrontsize64 - Absolute flat No restrictions. 612 - ``gfx1031`` - cumode scratch 613 - ``gfx1032`` 614 - ``gfx1033`` 615 - ``gfx1034`` 616 - ``gfx1035`` 617 - ``gfx1036`` 618 619 620 ``gfx11-generic`` ``amdgcn`` - ``gfx1100`` - wavefrontsize64 - Architected Various codegen pessimizations 621 - ``gfx1101`` - cumode flat scratch are applied to work around some 622 - ``gfx1102`` - Packed hazards specific to some targets 623 - ``gfx1103`` work-item within this family. 624 - ``gfx1150`` IDs 625 - ``gfx1151`` 626 - ``gfx1152`` 627 - ``gfx1153`` Not all VGPRs can be used on: 628 629 - ``gfx1100`` 630 - ``gfx1101`` 631 - ``gfx1151`` 632 633 SALU floating point instructions 634 are not available on: 635 636 - ``gfx1150`` 637 - ``gfx1151`` 638 - ``gfx1152`` 639 - ``gfx1153`` 640 641 SGPRs are not supported for src1 642 in dpp instructions for: 643 644 - ``gfx1150`` 645 - ``gfx1151`` 646 - ``gfx1152`` 647 - ``gfx1153`` 648 649 650 ``gfx12-generic`` ``amdgcn`` - ``gfx1200`` - wavefrontsize64 - Architected No restrictions. 651 - ``gfx1201`` - cumode flat scratch 652 - Packed 653 work-item 654 IDs 655 ==================== ============== ================= ================== ================= ================================= 656 657.. _amdgpu-generic-processor-versioning: 658 659Generic Processor Versioning 660---------------------------- 661 662Generic processor (see :ref:`amdgpu-generic-processor-table`) code objects are versioned (see :ref:`amdgpu-elf-header-e_flags-table-v6-onwards`) between 1 and 255. 663The version of non-generic code objects is always set to 0. 664 665For a generic code object, adding a new supported processor may require the code generated for the generic target to be changed 666so it can continue to execute on the previously supported processors as well as on the new one. 667When this happens, the generic code object version number is incremented at the same time as the generic target is updated. 668 669Each supported processor of a generic target is mapped to the version it was introduced in. 670A generic code object can execute on a supported processor if the version of the code object being loaded is 671greater than or equal to the version in which the processor was added to the generic target. 672 673.. _amdgpu-target-features: 674 675Target Features 676--------------- 677 678Target features control how code is generated to support certain 679processor specific features. Not all target features are supported by 680all processors. The runtime must ensure that the features supported by 681the device used to execute the code match the features enabled when 682generating the code. A mismatch of features may result in incorrect 683execution, or a reduction in performance. 684 685The target features supported by each processor is listed in 686:ref:`amdgpu-processors`. 687 688Target features are controlled by exactly one of the following Clang 689options: 690 691``-mcpu=<target-id>`` or ``--offload-arch=<target-id>`` 692 693 The ``-mcpu`` and ``--offload-arch`` can specify the target feature as 694 optional components of the target ID. If omitted, the target feature has the 695 ``any`` value. See :ref:`amdgpu-target-id`. 696 697``-m[no-]<target-feature>`` 698 699 Target features not specified by the target ID are specified using a 700 separate option. These target features can have an ``on`` or ``off`` 701 value. ``on`` is specified by omitting the ``no-`` prefix, and 702 ``off`` is specified by including the ``no-`` prefix. The default 703 if not specified is ``off``. 704 705For example: 706 707``-mcpu=gfx908:xnack+`` 708 Enable the ``xnack`` feature. 709``-mcpu=gfx908:xnack-`` 710 Disable the ``xnack`` feature. 711``-mcumode`` 712 Enable the ``cumode`` feature. 713``-mno-cumode`` 714 Disable the ``cumode`` feature. 715 716 .. table:: AMDGPU Target Features 717 :name: amdgpu-target-features-table 718 719 =============== ============================ ================================================== 720 Target Feature Clang Option to Control Description 721 Name 722 =============== ============================ ================================================== 723 cumode - ``-m[no-]cumode`` Control the wavefront execution mode used 724 when generating code for kernels. When disabled 725 native WGP wavefront execution mode is used, 726 when enabled CU wavefront execution mode is used 727 (see :ref:`amdgpu-amdhsa-memory-model`). 728 729 sramecc - ``-mcpu`` If specified, generate code that can only be 730 - ``--offload-arch`` loaded and executed in a process that has a 731 matching setting for SRAMECC. 732 733 If not specified for code object V2 to V3, generate 734 code that can be loaded and executed in a process 735 with SRAMECC enabled. 736 737 If not specified for code object V4 or above, generate 738 code that can be loaded and executed in a process 739 with either setting of SRAMECC. 740 741 tgsplit ``-m[no-]tgsplit`` Enable/disable generating code that assumes 742 work-groups are launched in threadgroup split mode. 743 When enabled the waves of a work-group may be 744 launched in different CUs. 745 746 wavefrontsize64 - ``-m[no-]wavefrontsize64`` Control the wavefront size used when 747 generating code for kernels. When disabled 748 native wavefront size 32 is used, when enabled 749 wavefront size 64 is used. 750 751 xnack - ``-mcpu`` If specified, generate code that can only be 752 - ``--offload-arch`` loaded and executed in a process that has a 753 matching setting for XNACK replay. 754 755 If not specified for code object V2 to V3, generate 756 code that can be loaded and executed in a process 757 with XNACK replay enabled. 758 759 If not specified for code object V4 or above, generate 760 code that can be loaded and executed in a process 761 with either setting of XNACK replay. 762 763 XNACK replay can be used for demand paging and 764 page migration. If enabled in the device, then if 765 a page fault occurs the code may execute 766 incorrectly unless generated with XNACK replay 767 enabled, or generated for code object V4 or above without 768 specifying XNACK replay. Executing code that was 769 generated with XNACK replay enabled, or generated 770 for code object V4 or above without specifying XNACK replay, 771 on a device that does not have XNACK replay 772 enabled will execute correctly but may be less 773 performant than code generated for XNACK replay 774 disabled. 775 =============== ============================ ================================================== 776 777.. _amdgpu-target-id: 778 779Target ID 780--------- 781 782AMDGPU supports target IDs. See `Clang Offload Bundler 783<https://clang.llvm.org/docs/ClangOffloadBundler.html>`_ for a general 784description. The AMDGPU target specific information is: 785 786**processor** 787 Is an AMDGPU processor or alternative processor name specified in 788 :ref:`amdgpu-processor-table`. The non-canonical form target ID allows both 789 the primary processor and alternative processor names. The canonical form 790 target ID only allow the primary processor name. 791 792**target-feature** 793 Is a target feature name specified in :ref:`amdgpu-target-features-table` that 794 is supported by the processor. The target features supported by each processor 795 is specified in :ref:`amdgpu-processor-table`. Those that can be specified in 796 a target ID are marked as being controlled by ``-mcpu`` and 797 ``--offload-arch``. Each target feature must appear at most once in a target 798 ID. The non-canonical form target ID allows the target features to be 799 specified in any order. The canonical form target ID requires the target 800 features to be specified in alphabetic order. 801 802.. _amdgpu-target-id-v2-v3: 803 804Code Object V2 to V3 Target ID 805~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ 806 807The target ID syntax for code object V2 to V3 is the same as defined in `Clang 808Offload Bundler <https://clang.llvm.org/docs/ClangOffloadBundler.html>`_ except 809when used in the :ref:`amdgpu-assembler-directive-amdgcn-target` assembler 810directive and the bundle entry ID. In those cases it has the following BNF 811syntax: 812 813.. code:: 814 815 <target-id> ::== <processor> ( "+" <target-feature> )* 816 817Where a target feature is omitted if *Off* and present if *On* or *Any*. 818 819.. note:: 820 821 The code object V2 to V3 cannot represent *Any* and treats it the same as 822 *On*. 823 824.. _amdgpu-embedding-bundled-objects: 825 826Embedding Bundled Code Objects 827------------------------------ 828 829AMDGPU supports the HIP and OpenMP languages that perform code object embedding 830as described in `Clang Offload Bundler 831<https://clang.llvm.org/docs/ClangOffloadBundler.html>`_. 832 833.. note:: 834 835 The target ID syntax used for code object V2 to V3 for a bundle entry ID 836 differs from that used elsewhere. See :ref:`amdgpu-target-id-v2-v3`. 837 838.. _amdgpu-address-spaces: 839 840Address Spaces 841-------------- 842 843The AMDGPU architecture supports a number of memory address spaces. The address 844space names use the OpenCL standard names, with some additions. 845 846The AMDGPU address spaces correspond to target architecture specific LLVM 847address space numbers used in LLVM IR. 848 849The AMDGPU address spaces are described in 850:ref:`amdgpu-address-spaces-table`. Only 64-bit process address spaces are 851supported for the ``amdgcn`` target. 852 853 .. table:: AMDGPU Address Spaces 854 :name: amdgpu-address-spaces-table 855 856 ===================================== =============== =========== ================ ======= ============================ 857 .. 64-Bit Process Address Space 858 ------------------------------------- --------------- ----------- ---------------- ------------------------------------ 859 Address Space Name LLVM IR Address HSA Segment Hardware Address NULL Value 860 Space Number Name Name Size 861 ===================================== =============== =========== ================ ======= ============================ 862 Generic 0 flat flat 64 0x0000000000000000 863 Global 1 global global 64 0x0000000000000000 864 Region 2 N/A GDS 32 *not implemented for AMDHSA* 865 Local 3 group LDS 32 0xFFFFFFFF 866 Constant 4 constant *same as global* 64 0x0000000000000000 867 Private 5 private scratch 32 0xFFFFFFFF 868 Constant 32-bit 6 *TODO* 0x00000000 869 Buffer Fat Pointer 7 N/A N/A 160 0 870 Buffer Resource 8 N/A V# 128 0x00000000000000000000000000000000 871 Buffer Strided Pointer (experimental) 9 *TODO* 872 Streamout Registers 128 N/A GS_REGS 873 ===================================== =============== =========== ================ ======= ============================ 874 875**Generic** 876 The generic address space is supported unless the *Target Properties* column 877 of :ref:`amdgpu-processor-table` specifies *Does not support generic address 878 space*. 879 880 The generic address space uses the hardware flat address support for two fixed 881 ranges of virtual addresses (the private and local apertures), that are 882 outside the range of addressable global memory, to map from a flat address to 883 a private or local address. This uses FLAT instructions that can take a flat 884 address and access global, private (scratch), and group (LDS) memory depending 885 on if the address is within one of the aperture ranges. 886 887 Flat access to scratch requires hardware aperture setup and setup in the 888 kernel prologue (see :ref:`amdgpu-amdhsa-kernel-prolog-flat-scratch`). Flat 889 access to LDS requires hardware aperture setup and M0 (GFX7-GFX8) register 890 setup (see :ref:`amdgpu-amdhsa-kernel-prolog-m0`). 891 892 To convert between a private or group address space address (termed a segment 893 address) and a flat address the base address of the corresponding aperture 894 can be used. For GFX7-GFX8 these are available in the 895 :ref:`amdgpu-amdhsa-hsa-aql-queue` the address of which can be obtained with 896 Queue Ptr SGPR (see :ref:`amdgpu-amdhsa-initial-kernel-execution-state`). For 897 GFX9-GFX11 the aperture base addresses are directly available as inline 898 constant registers ``SRC_SHARED_BASE/LIMIT`` and ``SRC_PRIVATE_BASE/LIMIT``. 899 In 64-bit address mode the aperture sizes are 2^32 bytes and the base is 900 aligned to 2^32 which makes it easier to convert from flat to segment or 901 segment to flat. 902 903 A global address space address has the same value when used as a flat address 904 so no conversion is needed. 905 906**Global and Constant** 907 The global and constant address spaces both use global virtual addresses, 908 which are the same virtual address space used by the CPU. However, some 909 virtual addresses may only be accessible to the CPU, some only accessible 910 by the GPU, and some by both. 911 912 Using the constant address space indicates that the data will not change 913 during the execution of the kernel. This allows scalar read instructions to 914 be used. As the constant address space could only be modified on the host 915 side, a generic pointer loaded from the constant address space is safe to be 916 assumed as a global pointer since only the device global memory is visible 917 and managed on the host side. The vector and scalar L1 caches are invalidated 918 of volatile data before each kernel dispatch execution to allow constant 919 memory to change values between kernel dispatches. 920 921**Region** 922 The region address space uses the hardware Global Data Store (GDS). All 923 wavefronts executing on the same device will access the same memory for any 924 given region address. However, the same region address accessed by wavefronts 925 executing on different devices will access different memory. It is higher 926 performance than global memory. It is allocated by the runtime. The data 927 store (DS) instructions can be used to access it. 928 929**Local** 930 The local address space uses the hardware Local Data Store (LDS) which is 931 automatically allocated when the hardware creates the wavefronts of a 932 work-group, and freed when all the wavefronts of a work-group have 933 terminated. All wavefronts belonging to the same work-group will access the 934 same memory for any given local address. However, the same local address 935 accessed by wavefronts belonging to different work-groups will access 936 different memory. It is higher performance than global memory. The data store 937 (DS) instructions can be used to access it. 938 939**Private** 940 The private address space uses the hardware scratch memory support which 941 automatically allocates memory when it creates a wavefront and frees it when 942 a wavefronts terminates. The memory accessed by a lane of a wavefront for any 943 given private address will be different to the memory accessed by another lane 944 of the same or different wavefront for the same private address. 945 946 If a kernel dispatch uses scratch, then the hardware allocates memory from a 947 pool of backing memory allocated by the runtime for each wavefront. The lanes 948 of the wavefront access this using dword (4 byte) interleaving. The mapping 949 used from private address to backing memory address is: 950 951 ``wavefront-scratch-base + 952 ((private-address / 4) * wavefront-size * 4) + 953 (wavefront-lane-id * 4) + (private-address % 4)`` 954 955 If each lane of a wavefront accesses the same private address, the 956 interleaving results in adjacent dwords being accessed and hence requires 957 fewer cache lines to be fetched. 958 959 There are different ways that the wavefront scratch base address is 960 determined by a wavefront (see 961 :ref:`amdgpu-amdhsa-initial-kernel-execution-state`). 962 963 Scratch memory can be accessed in an interleaved manner using buffer 964 instructions with the scratch buffer descriptor and per wavefront scratch 965 offset, by the scratch instructions, or by flat instructions. Multi-dword 966 access is not supported except by flat and scratch instructions in 967 GFX9-GFX11. 968 969 Code that manipulates the stack values in other lanes of a wavefront, 970 such as by ``addrspacecast``-ing stack pointers to generic ones and taking offsets 971 that reach other lanes or by explicitly constructing the scratch buffer descriptor, 972 triggers undefined behavior when it modifies the scratch values of other lanes. 973 The compiler may assume that such modifications do not occur. 974 When using code object V5 ``LIBOMPTARGET_STACK_SIZE`` may be used to provide the 975 private segment size in bytes, for cases where a dynamic stack is used. 976 977**Constant 32-bit** 978 *TODO* 979 980**Buffer Fat Pointer** 981 The buffer fat pointer is an experimental address space that is currently 982 unsupported in the backend. It exposes a non-integral pointer that is in 983 the future intended to support the modelling of 128-bit buffer descriptors 984 plus a 32-bit offset into the buffer (in total encapsulating a 160-bit 985 *pointer*), allowing normal LLVM load/store/atomic operations to be used to 986 model the buffer descriptors used heavily in graphics workloads targeting 987 the backend. 988 989 The buffer descriptor used to construct a buffer fat pointer must be *raw*: 990 the stride must be 0, the "add tid" flag must be 0, the swizzle enable bits 991 must be off, and the extent must be measured in bytes. (On subtargets where 992 bounds checking may be disabled, buffer fat pointers may choose to enable 993 it or not). 994 995**Buffer Resource** 996 The buffer resource pointer, in address space 8, is the newer form 997 for representing buffer descriptors in AMDGPU IR, replacing their 998 previous representation as `<4 x i32>`. It is a non-integral pointer 999 that represents a 128-bit buffer descriptor resource (`V#`). 1000 1001 Since, in general, a buffer resource supports complex addressing modes that cannot 1002 be easily represented in LLVM (such as implicit swizzled access to structured 1003 buffers), it is **illegal** to perform non-trivial address computations, such as 1004 ``getelementptr`` operations, on buffer resources. They may be passed to 1005 AMDGPU buffer intrinsics, and they may be converted to and from ``i128``. 1006 1007 Casting a buffer resource to a buffer fat pointer is permitted and adds an offset 1008 of 0. 1009 1010 Buffer resources can be created from 64-bit pointers (which should be either 1011 generic or global) using the `llvm.amdgcn.make.buffer.rsrc` intrinsic, which 1012 takes the pointer, which becomes the base of the resource, 1013 the 16-bit stride (and swzizzle control) field stored in bits `63:48` of a `V#`, 1014 the 32-bit NumRecords/extent field (bits `95:64`), and the 32-bit flags field 1015 (bits `127:96`). The specific interpretation of these fields varies by the 1016 target architecture and is detailed in the ISA descriptions. 1017 1018**Buffer Strided Pointer** 1019 The buffer index pointer is an experimental address space. It represents 1020 a 128-bit buffer descriptor and a 32-bit offset, like the **Buffer Fat 1021 Pointer**. Additionally, it contains an index into the buffer, which 1022 allows the direct addressing of structured elements. These components appear 1023 in that order, i.e., the descriptor comes first, then the 32-bit offset 1024 followed by the 32-bit index. 1025 1026 The bits in the buffer descriptor must meet the following requirements: 1027 the stride is the size of a structured element, the "add tid" flag must be 0, 1028 and the swizzle enable bits must be off. 1029 1030**Streamout Registers** 1031 Dedicated registers used by the GS NGG Streamout Instructions. The register 1032 file is modelled as a memory in a distinct address space because it is indexed 1033 by an address-like offset in place of named registers, and because register 1034 accesses affect LGKMcnt. This is an internal address space used only by the 1035 compiler. Do not use this address space for IR pointers. 1036 1037.. _amdgpu-memory-scopes: 1038 1039Memory Scopes 1040------------- 1041 1042This section provides LLVM memory synchronization scopes supported by the AMDGPU 1043backend memory model when the target triple OS is ``amdhsa`` (see 1044:ref:`amdgpu-amdhsa-memory-model` and :ref:`amdgpu-target-triples`). 1045 1046The memory model supported is based on the HSA memory model [HSA]_ which is 1047based in turn on HRF-indirect with scope inclusion [HRF]_. The happens-before 1048relation is transitive over the synchronizes-with relation independent of scope 1049and synchronizes-with allows the memory scope instances to be inclusive (see 1050table :ref:`amdgpu-amdhsa-llvm-sync-scopes-table`). 1051 1052This is different to the OpenCL [OpenCL]_ memory model which does not have scope 1053inclusion and requires the memory scopes to exactly match. However, this 1054is conservatively correct for OpenCL. 1055 1056 .. table:: AMDHSA LLVM Sync Scopes 1057 :name: amdgpu-amdhsa-llvm-sync-scopes-table 1058 1059 ======================= =================================================== 1060 LLVM Sync Scope Description 1061 ======================= =================================================== 1062 *none* The default: ``system``. 1063 1064 Synchronizes with, and participates in modification 1065 and seq_cst total orderings with, other operations 1066 (except image operations) for all address spaces 1067 (except private, or generic that accesses private) 1068 provided the other operation's sync scope is: 1069 1070 - ``system``. 1071 - ``agent`` and executed by a thread on the same 1072 agent. 1073 - ``workgroup`` and executed by a thread in the 1074 same work-group. 1075 - ``wavefront`` and executed by a thread in the 1076 same wavefront. 1077 1078 ``agent`` Synchronizes with, and participates in modification 1079 and seq_cst total orderings with, other operations 1080 (except image operations) for all address spaces 1081 (except private, or generic that accesses private) 1082 provided the other operation's sync scope is: 1083 1084 - ``system`` or ``agent`` and executed by a thread 1085 on the same agent. 1086 - ``workgroup`` and executed by a thread in the 1087 same work-group. 1088 - ``wavefront`` and executed by a thread in the 1089 same wavefront. 1090 1091 ``workgroup`` Synchronizes with, and participates in modification 1092 and seq_cst total orderings with, other operations 1093 (except image operations) for all address spaces 1094 (except private, or generic that accesses private) 1095 provided the other operation's sync scope is: 1096 1097 - ``system``, ``agent`` or ``workgroup`` and 1098 executed by a thread in the same work-group. 1099 - ``wavefront`` and executed by a thread in the 1100 same wavefront. 1101 1102 ``wavefront`` Synchronizes with, and participates in modification 1103 and seq_cst total orderings with, other operations 1104 (except image operations) for all address spaces 1105 (except private, or generic that accesses private) 1106 provided the other operation's sync scope is: 1107 1108 - ``system``, ``agent``, ``workgroup`` or 1109 ``wavefront`` and executed by a thread in the 1110 same wavefront. 1111 1112 ``singlethread`` Only synchronizes with and participates in 1113 modification and seq_cst total orderings with, 1114 other operations (except image operations) running 1115 in the same thread for all address spaces (for 1116 example, in signal handlers). 1117 1118 ``one-as`` Same as ``system`` but only synchronizes with other 1119 operations within the same address space. 1120 1121 ``agent-one-as`` Same as ``agent`` but only synchronizes with other 1122 operations within the same address space. 1123 1124 ``workgroup-one-as`` Same as ``workgroup`` but only synchronizes with 1125 other operations within the same address space. 1126 1127 ``wavefront-one-as`` Same as ``wavefront`` but only synchronizes with 1128 other operations within the same address space. 1129 1130 ``singlethread-one-as`` Same as ``singlethread`` but only synchronizes with 1131 other operations within the same address space. 1132 ======================= =================================================== 1133 1134LLVM IR Intrinsics 1135------------------ 1136 1137The AMDGPU backend implements the following LLVM IR intrinsics. 1138 1139*This section is WIP.* 1140 1141.. table:: AMDGPU LLVM IR Intrinsics 1142 :name: amdgpu-llvm-ir-intrinsics-table 1143 1144 ============================================== ========================================================== 1145 LLVM Intrinsic Description 1146 ============================================== ========================================================== 1147 llvm.amdgcn.sqrt Provides direct access to v_sqrt_f64, v_sqrt_f32 and v_sqrt_f16 1148 (on targets with half support). Performs sqrt function. 1149 1150 llvm.amdgcn.log Provides direct access to v_log_f32 and v_log_f16 1151 (on targets with half support). Performs log2 function. 1152 1153 llvm.amdgcn.exp2 Provides direct access to v_exp_f32 and v_exp_f16 1154 (on targets with half support). Performs exp2 function. 1155 1156 :ref:`llvm.frexp <int_frexp>` Implemented for half, float and double. 1157 1158 :ref:`llvm.log2 <int_log2>` Implemented for float and half (and vectors of float or 1159 half). Not implemented for double. Hardware provides 1160 1ULP accuracy for float, and 0.51ULP for half. Float 1161 instruction does not natively support denormal 1162 inputs. 1163 1164 :ref:`llvm.sqrt <int_sqrt>` Implemented for double, float and half (and vectors). 1165 1166 :ref:`llvm.log <int_log>` Implemented for float and half (and vectors). 1167 1168 :ref:`llvm.exp <int_exp>` Implemented for float and half (and vectors). 1169 1170 :ref:`llvm.log10 <int_log10>` Implemented for float and half (and vectors). 1171 1172 :ref:`llvm.exp2 <int_exp2>` Implemented for float and half (and vectors of float or 1173 half). Not implemented for double. Hardware provides 1174 1ULP accuracy for float, and 0.51ULP for half. Float 1175 instruction does not natively support denormal 1176 inputs. 1177 1178 :ref:`llvm.stacksave.p5 <int_stacksave>` Implemented, must use the alloca address space. 1179 :ref:`llvm.stackrestore.p5 <int_stackrestore>` Implemented, must use the alloca address space. 1180 1181 :ref:`llvm.get.fpmode.i32 <int_get_fpmode>` The natural floating-point mode type is i32. This 1182 implemented by extracting relevant bits out of the MODE 1183 register with s_getreg_b32. The first 10 bits are the 1184 core floating-point mode. Bits 12:18 are the exception 1185 mask. On gfx9+, bit 23 is FP16_OVFL. Bitfields not 1186 relevant to floating-point instructions are 0s. 1187 1188 :ref:`llvm.get.rounding<int_get_rounding>` AMDGPU supports two separately controllable rounding 1189 modes depending on the floating-point type. One 1190 controls float, and the other controls both double and 1191 half operations. If both modes are the same, returns 1192 one of the standard return values. If the modes are 1193 different, returns one of :ref:`12 extended values 1194 <amdgpu-rounding-mode-enumeration-values-table>` 1195 describing the two modes. 1196 1197 To nearest, ties away from zero is not a supported 1198 mode. The raw rounding mode values in the MODE 1199 register do not exactly match the FLT_ROUNDS values, 1200 so a conversion is performed. 1201 1202 :ref:`llvm.set.rounding<int_set_rounding>` Input value expected to be one of the valid results 1203 from '``llvm.get.rounding``'. Rounding mode is 1204 undefined if not passed a valid input. This should be 1205 a wave uniform value. In case of a divergent input 1206 value, the first active lane's value will be used. 1207 1208 :ref:`llvm.get.fpenv<int_get_fpenv>` Returns the current value of the AMDGPU floating point environment. 1209 This stores information related to the current rounding mode, 1210 denormalization mode, enabled traps, and floating point exceptions. 1211 The format is a 64-bit concatenation of the MODE and TRAPSTS registers. 1212 1213 :ref:`llvm.set.fpenv<int_set_fpenv>` Sets the floating point environment to the specifies state. 1214 1215 llvm.amdgcn.readfirstlane Provides direct access to v_readfirstlane_b32. Returns the value in 1216 the lowest active lane of the input operand. Currently implemented 1217 for i16, i32, float, half, bfloat, <2 x i16>, <2 x half>, <2 x bfloat>, 1218 i64, double, pointers, multiples of the 32-bit vectors. 1219 1220 llvm.amdgcn.readlane Provides direct access to v_readlane_b32. Returns the value in the 1221 specified lane of the first input operand. The second operand specifies 1222 the lane to read from. Currently implemented for i16, i32, float, half, 1223 bfloat, <2 x i16>, <2 x half>, <2 x bfloat>, i64, double, pointers, 1224 multiples of the 32-bit vectors. 1225 1226 llvm.amdgcn.writelane Provides direct access to v_writelane_b32. Writes value in the first input 1227 operand to the specified lane of divergent output. The second operand 1228 specifies the lane to write. Currently implemented for i16, i32, float, 1229 half, bfloat, <2 x i16>, <2 x half>, <2 x bfloat>, i64, double, pointers, 1230 multiples of the 32-bit vectors. 1231 1232 llvm.amdgcn.wave.reduce.umin Performs an arithmetic unsigned min reduction on the unsigned values 1233 provided by each lane in the wavefront. 1234 Intrinsic takes a hint for reduction strategy using second operand 1235 0: Target default preference, 1236 1: `Iterative strategy`, and 1237 2: `DPP`. 1238 If target does not support the DPP operations (e.g. gfx6/7), 1239 reduction will be performed using default iterative strategy. 1240 Intrinsic is currently only implemented for i32. 1241 1242 llvm.amdgcn.wave.reduce.umax Performs an arithmetic unsigned max reduction on the unsigned values 1243 provided by each lane in the wavefront. 1244 Intrinsic takes a hint for reduction strategy using second operand 1245 0: Target default preference, 1246 1: `Iterative strategy`, and 1247 2: `DPP`. 1248 If target does not support the DPP operations (e.g. gfx6/7), 1249 reduction will be performed using default iterative strategy. 1250 Intrinsic is currently only implemented for i32. 1251 1252 llvm.amdgcn.permlane16 Provides direct access to v_permlane16_b32. Performs arbitrary gather-style 1253 operation within a row (16 contiguous lanes) of the second input operand. 1254 The third and fourth inputs must be scalar values. these are combined into 1255 a single 64-bit value representing lane selects used to swizzle within each 1256 row. Currently implemented for i16, i32, float, half, bfloat, <2 x i16>, 1257 <2 x half>, <2 x bfloat>, i64, double, pointers, multiples of the 32-bit vectors. 1258 1259 llvm.amdgcn.permlanex16 Provides direct access to v_permlanex16_b32. Performs arbitrary gather-style 1260 operation across two rows of the second input operand (each row is 16 contiguous 1261 lanes). The third and fourth inputs must be scalar values. these are combined 1262 into a single 64-bit value representing lane selects used to swizzle within each 1263 row. Currently implemented for i16, i32, float, half, bfloat, <2 x i16>, <2 x half>, 1264 <2 x bfloat>, i64, double, pointers, multiples of the 32-bit vectors. 1265 1266 llvm.amdgcn.permlane64 Provides direct access to v_permlane64_b32. Performs a specific permutation across 1267 lanes of the input operand where the high half and low half of a wave64 are swapped. 1268 Performs no operation in wave32 mode. Currently implemented for i16, i32, float, half, 1269 bfloat, <2 x i16>, <2 x half>, <2 x bfloat>, i64, double, pointers, multiples of the 1270 32-bit vectors. 1271 1272 llvm.amdgcn.udot2 Provides direct access to v_dot2_u32_u16 across targets which 1273 support such instructions. This performs unsigned dot product 1274 with two v2i16 operands, summed with the third i32 operand. The 1275 i1 fourth operand is used to clamp the output. 1276 1277 llvm.amdgcn.udot4 Provides direct access to v_dot4_u32_u8 across targets which 1278 support such instructions. This performs unsigned dot product 1279 with two i32 operands (holding a vector of 4 8bit values), summed 1280 with the third i32 operand. The i1 fourth operand is used to clamp 1281 the output. 1282 1283 llvm.amdgcn.udot8 Provides direct access to v_dot8_u32_u4 across targets which 1284 support such instructions. This performs unsigned dot product 1285 with two i32 operands (holding a vector of 8 4bit values), summed 1286 with the third i32 operand. The i1 fourth operand is used to clamp 1287 the output. 1288 1289 llvm.amdgcn.sdot2 Provides direct access to v_dot2_i32_i16 across targets which 1290 support such instructions. This performs signed dot product 1291 with two v2i16 operands, summed with the third i32 operand. The 1292 i1 fourth operand is used to clamp the output. 1293 When applicable (e.g. no clamping), this is lowered into 1294 v_dot2c_i32_i16 for targets which support it. 1295 1296 llvm.amdgcn.sdot4 Provides direct access to v_dot4_i32_i8 across targets which 1297 support such instructions. This performs signed dot product 1298 with two i32 operands (holding a vector of 4 8bit values), summed 1299 with the third i32 operand. The i1 fourth operand is used to clamp 1300 the output. 1301 When applicable (i.e. no clamping / operand modifiers), this is lowered 1302 into v_dot4c_i32_i8 for targets which support it. 1303 RDNA3 does not offer v_dot4_i32_i8, and rather offers 1304 v_dot4_i32_iu8 which has operands to hold the signedness of the 1305 vector operands. Thus, this intrinsic lowers to the signed version 1306 of this instruction for gfx11 targets. 1307 1308 llvm.amdgcn.sdot8 Provides direct access to v_dot8_u32_u4 across targets which 1309 support such instructions. This performs signed dot product 1310 with two i32 operands (holding a vector of 8 4bit values), summed 1311 with the third i32 operand. The i1 fourth operand is used to clamp 1312 the output. 1313 When applicable (i.e. no clamping / operand modifiers), this is lowered 1314 into v_dot8c_i32_i4 for targets which support it. 1315 RDNA3 does not offer v_dot8_i32_i4, and rather offers 1316 v_dot4_i32_iu4 which has operands to hold the signedness of the 1317 vector operands. Thus, this intrinsic lowers to the signed version 1318 of this instruction for gfx11 targets. 1319 1320 llvm.amdgcn.sudot4 Provides direct access to v_dot4_i32_iu8 on gfx11 targets. This performs 1321 dot product with two i32 operands (holding a vector of 4 8bit values), summed 1322 with the fifth i32 operand. The i1 sixth operand is used to clamp 1323 the output. The i1s preceding the vector operands decide the signedness. 1324 1325 llvm.amdgcn.sudot8 Provides direct access to v_dot8_i32_iu4 on gfx11 targets. This performs 1326 dot product with two i32 operands (holding a vector of 8 4bit values), summed 1327 with the fifth i32 operand. The i1 sixth operand is used to clamp 1328 the output. The i1s preceding the vector operands decide the signedness. 1329 1330 llvm.amdgcn.sched.barrier Controls the types of instructions that may be allowed to cross the intrinsic 1331 during instruction scheduling. The parameter is a mask for the instruction types 1332 that can cross the intrinsic. 1333 1334 - 0x0000: No instructions may be scheduled across sched_barrier. 1335 - 0x0001: All, non-memory, non-side-effect producing instructions may be 1336 scheduled across sched_barrier, *i.e.* allow ALU instructions to pass. 1337 - 0x0002: VALU instructions may be scheduled across sched_barrier. 1338 - 0x0004: SALU instructions may be scheduled across sched_barrier. 1339 - 0x0008: MFMA/WMMA instructions may be scheduled across sched_barrier. 1340 - 0x0010: All VMEM instructions may be scheduled across sched_barrier. 1341 - 0x0020: VMEM read instructions may be scheduled across sched_barrier. 1342 - 0x0040: VMEM write instructions may be scheduled across sched_barrier. 1343 - 0x0080: All DS instructions may be scheduled across sched_barrier. 1344 - 0x0100: All DS read instructions may be scheduled accoss sched_barrier. 1345 - 0x0200: All DS write instructions may be scheduled across sched_barrier. 1346 - 0x0400: All Transcendental (e.g. V_EXP) instructions may be scheduled across sched_barrier. 1347 1348 llvm.amdgcn.sched.group.barrier Creates schedule groups with specific properties to create custom scheduling 1349 pipelines. The ordering between groups is enforced by the instruction scheduler. 1350 The intrinsic applies to the code that preceeds the intrinsic. The intrinsic 1351 takes three values that control the behavior of the schedule groups. 1352 1353 - Mask : Classify instruction groups using the llvm.amdgcn.sched_barrier mask values. 1354 - Size : The number of instructions that are in the group. 1355 - SyncID : Order is enforced between groups with matching values. 1356 1357 The mask can include multiple instruction types. It is undefined behavior to set 1358 values beyond the range of valid masks. 1359 1360 Combining multiple sched_group_barrier intrinsics enables an ordering of specific 1361 instruction types during instruction scheduling. For example, the following enforces 1362 a sequence of 1 VMEM read, followed by 1 VALU instruction, followed by 5 MFMA 1363 instructions. 1364 1365 | ``// 1 VMEM read`` 1366 | ``__builtin_amdgcn_sched_group_barrier(32, 1, 0)`` 1367 | ``// 1 VALU`` 1368 | ``__builtin_amdgcn_sched_group_barrier(2, 1, 0)`` 1369 | ``// 5 MFMA`` 1370 | ``__builtin_amdgcn_sched_group_barrier(8, 5, 0)`` 1371 1372 llvm.amdgcn.iglp.opt An **experimental** intrinsic for instruction group level parallelism. The intrinsic 1373 implements predefined intruction scheduling orderings. The intrinsic applies to the 1374 surrounding scheduling region. The intrinsic takes a value that specifies the 1375 strategy. The compiler implements two strategies. 1376 1377 0. Interleave DS and MFMA instructions for small GEMM kernels. 1378 1. Interleave DS and MFMA instructions for single wave small GEMM kernels. 1379 2. Interleave TRANS and MFMA instructions, as well as their VALU and DS predecessors, for attention kernels. 1380 3. Interleave TRANS and MFMA instructions, with no predecessor interleaving, for attention kernels. 1381 1382 Only one iglp_opt intrinsic may be used in a scheduling region. The iglp_opt intrinsic 1383 cannot be combined with sched_barrier or sched_group_barrier. 1384 1385 The iglp_opt strategy implementations are subject to change. 1386 1387 llvm.amdgcn.atomic.cond.sub.u32 Provides direct access to flat_atomic_cond_sub_u32, global_atomic_cond_sub_u32 1388 and ds_cond_sub_u32 based on address space on gfx12 targets. This 1389 performs subtraction only if the memory value is greater than or 1390 equal to the data value. 1391 1392 llvm.amdgcn.s.getpc Provides access to the s_getpc_b64 instruction, but with the return value 1393 sign-extended from the width of the underlying PC hardware register even on 1394 processors where the s_getpc_b64 instruction returns a zero-extended value. 1395 1396 llvm.amdgcn.ballot Returns a bitfield(i32 or i64) containing the result of its i1 argument 1397 in all active lanes, and zero in all inactive lanes. 1398 Provides a way to convert i1 in LLVM IR to i32 or i64 lane mask - bitfield 1399 used by hardware to control active lanes when used in EXEC register. 1400 For example, ballot(i1 true) return EXEC mask. 1401 1402 llvm.amdgcn.mfma.scale.f32.16x16x128.f8f6f4 Emit `v_mfma_scale_f32_16x16x128_f8f6f4` to set the scale factor. The 1403 last 4 operands correspond to the scale inputs. 1404 1405 - 2-bit byte index to use for each lane for matrix A 1406 - Matrix A scale values 1407 - 2-bit byte index to use for each lane for matrix B 1408 - Matrix B scale values 1409 1410 llvm.amdgcn.mfma.scale.f32.32x32x64.f8f6f4 Emit `v_mfma_scale_f32_32x32x64_f8f6f4` 1411 1412 llvm.amdgcn.permlane16.swap Provide direct access to `v_permlane16_swap_b32` instruction on supported targets. 1413 Swaps the values across lanes of first 2 operands. Odd rows of the first operand are 1414 swapped with even rows of the second operand (one row is 16 lanes). 1415 Returns a pair for the swapped registers. The first element of the return corresponds 1416 to the swapped element of the first argument. 1417 1418 1419 llvm.amdgcn.permlane32.swap Provide direct access to `v_permlane32_swap_b32` instruction on supported targets. 1420 Swaps the values across lanes of first 2 operands. Rows 2 and 3 of the first operand are 1421 swapped with rows 0 and 1 of the second operand (one row is 16 lanes). 1422 Returns a pair for the swapped registers. The first element of the return 1423 corresponds to the swapped element of the first argument. 1424 1425 llvm.amdgcn.mov.dpp The llvm.amdgcn.mov.dpp.`<type>` intrinsic represents the mov.dpp operation in AMDGPU. 1426 This operation is being deprecated and can be replaced with llvm.amdgcn.update.dpp. 1427 1428 llvm.amdgcn.update.dpp The llvm.amdgcn.update.dpp.`<type>` intrinsic represents the update.dpp operation in AMDGPU. 1429 It takes an old value, a source operand, a DPP control operand, a row mask, a bank mask, and a bound control. 1430 Various data types are supported, including, bf16, f16, f32, f64, i16, i32, i64, p0, p3, p5, v2f16, v2f32, v2i16, v2i32, v2p0, v3i32, v4i32, v8f16. 1431 This operation is equivalent to a sequence of v_mov_b32 operations. 1432 It is preferred over llvm.amdgcn.mov.dpp.`<type>` for future use. 1433 `llvm.amdgcn.update.dpp.<type> <old> <src> <dpp_ctrl> <row_mask> <bank_mask> <bound_ctrl>` 1434 Should be equivalent to: 1435 - `v_mov_b32 <dest> <old>` 1436 - `v_mov_b32 <dest> <src> <dpp_ctrl> <row_mask> <bank_mask> <bound_ctrl>` 1437 1438 ============================================== ========================================================== 1439 1440.. TODO:: 1441 1442 List AMDGPU intrinsics. 1443 1444.. _amdgpu_metadata: 1445 1446LLVM IR Metadata 1447================ 1448 1449The AMDGPU backend implements the following target custom LLVM IR 1450metadata. 1451 1452.. _amdgpu_last_use: 1453 1454'``amdgpu.last.use``' Metadata 1455------------------------------ 1456 1457Sets TH_LOAD_LU temporal hint on load instructions that support it. 1458Takes priority over nontemporal hint (TH_LOAD_NT). This takes no 1459arguments. 1460 1461.. code-block:: llvm 1462 1463 %val = load i32, ptr %in, align 4, !amdgpu.last.use !{} 1464 1465'``amdgpu.no.remote.memory``' Metadata 1466--------------------------------------------- 1467 1468Asserts a memory operation does not access bytes in host memory, or 1469remote connected peer device memory (the address must be device 1470local). This is intended for use with :ref:`atomicrmw <i_atomicrmw>` 1471and other atomic instructions. This is required to emit a native 1472hardware instruction for some :ref:`system scope 1473<amdgpu-memory-scopes>` atomic operations on some subtargets. For most 1474integer atomic operations, this is a sufficient restriction to emit a 1475native atomic instruction. 1476 1477An :ref:`atomicrmw <i_atomicrmw>` without metadata will be treated 1478conservatively as required to preserve the operation behavior in all 1479cases. This will typically be used in conjunction with 1480:ref:`\!amdgpu.no.fine.grained.memory<amdgpu_no_fine_grained_memory>`. 1481 1482 1483.. code-block:: llvm 1484 1485 ; Indicates the atomic does not access fine-grained memory, or 1486 ; remote device memory. 1487 %old0 = atomicrmw sub ptr %ptr0, i32 1 acquire, !amdgpu.no.fine.grained.memory !0, !amdgpu.no.remote.memory !0 1488 1489 ; Indicates the atomic does not access peer device memory. 1490 %old2 = atomicrmw sub ptr %ptr2, i32 1 acquire, !amdgpu.no.remote.memory !0 1491 1492 !0 = !{} 1493 1494.. _amdgpu_no_fine_grained_memory: 1495 1496'``amdgpu.no.fine.grained.memory``' Metadata 1497------------------------------------------------- 1498 1499Asserts a memory access does not access bytes allocated in 1500fine-grained allocated memory. This is intended for use with 1501:ref:`atomicrmw <i_atomicrmw>` and other atomic instructions. This is 1502required to emit a native hardware instruction for some :ref:`system 1503scope <amdgpu-memory-scopes>` atomic operations on some subtargets. An 1504:ref:`atomicrmw <i_atomicrmw>` without metadata will be treated 1505conservatively as required to preserve the operation behavior in all 1506cases. This will typically be used in conjunction with 1507:ref:`\!amdgpu.no.remote.memory.access<amdgpu_no_remote_memory_access>`. 1508 1509.. code-block:: llvm 1510 1511 ; Indicates the access does not access fine-grained memory, or 1512 ; remote device memory. 1513 %old0 = atomicrmw sub ptr %ptr0, i32 1 acquire, !amdgpu.no.fine.grained.memory !0, !amdgpu.no.remote.memory.access !0 1514 1515 ; Indicates the access does not access fine-grained memory 1516 %old2 = atomicrmw sub ptr %ptr2, i32 1 acquire, !amdgpu.no.fine.grained.memory !0 1517 1518 !0 = !{} 1519 1520.. _amdgpu_no_remote_memory_access: 1521 1522'``amdgpu.ignore.denormal.mode``' Metadata 1523------------------------------------------ 1524 1525For use with :ref:`atomicrmw <i_atomicrmw>` floating-point 1526operations. Indicates the handling of denormal inputs and results is 1527insignificant and may be inconsistent with the expected floating-point 1528mode. This is necessary to emit a native atomic instruction on some 1529targets for some address spaces where float denormals are 1530unconditionally flushed. This is typically used in conjunction with 1531:ref:`\!amdgpu.no.remote.memory.access<amdgpu_no_remote_memory_access>` 1532and 1533:ref:`\!amdgpu.no.fine.grained.memory<amdgpu_no_fine_grained_memory>` 1534 1535 1536.. code-block:: llvm 1537 1538 %res0 = atomicrmw fadd ptr addrspace(1) %ptr, float %value seq_cst, align 4, !amdgpu.ignore.denormal.mode !0 1539 %res1 = atomicrmw fadd ptr addrspace(1) %ptr, float %value seq_cst, align 4, !amdgpu.ignore.denormal.mode !0, !amdgpu.no.fine.grained.memory !0, !amdgpu.no.remote.memory.access !0 1540 1541 !0 = !{} 1542 1543 1544LLVM IR Attributes 1545================== 1546 1547The AMDGPU backend supports the following LLVM IR attributes. 1548 1549 .. table:: AMDGPU LLVM IR Attributes 1550 :name: amdgpu-llvm-ir-attributes-table 1551 1552 ======================================= ========================================================== 1553 LLVM Attribute Description 1554 ======================================= ========================================================== 1555 "amdgpu-flat-work-group-size"="min,max" Specify the minimum and maximum flat work group sizes that 1556 will be specified when the kernel is dispatched. Generated 1557 by the ``amdgpu_flat_work_group_size`` CLANG attribute [CLANG-ATTR]_. 1558 The IR implied default value is 1,1024. Clang may emit this attribute 1559 with more restrictive bounds depending on language defaults. 1560 If the actual block or workgroup size exceeds the limit at any point during 1561 the execution, the behavior is undefined. For example, even if there is 1562 only one active thread but the thread local id exceeds the limit, the 1563 behavior is undefined. 1564 1565 "amdgpu-implicitarg-num-bytes"="n" Number of kernel argument bytes to add to the kernel 1566 argument block size for the implicit arguments. This 1567 varies by OS and language (for OpenCL see 1568 :ref:`opencl-kernel-implicit-arguments-appended-for-amdhsa-os-table`). 1569 "amdgpu-num-sgpr"="n" Specifies the number of SGPRs to use. Generated by 1570 the ``amdgpu_num_sgpr`` CLANG attribute [CLANG-ATTR]_. 1571 "amdgpu-num-vgpr"="n" Specifies the number of VGPRs to use. Generated by the 1572 ``amdgpu_num_vgpr`` CLANG attribute [CLANG-ATTR]_. 1573 "amdgpu-waves-per-eu"="m,n" Specify the minimum and maximum number of waves per 1574 execution unit. Generated by the ``amdgpu_waves_per_eu`` 1575 CLANG attribute [CLANG-ATTR]_. This is an optimization hint, 1576 and the backend may not be able to satisfy the request. If 1577 the specified range is incompatible with the function's 1578 "amdgpu-flat-work-group-size" value, the implied occupancy 1579 bounds by the workgroup size takes precedence. 1580 1581 "amdgpu-ieee" true/false. GFX6-GFX11 Only 1582 Specify whether the function expects the IEEE field of the 1583 mode register to be set on entry. Overrides the default for 1584 the calling convention. 1585 "amdgpu-dx10-clamp" true/false. GFX6-GFX11 Only 1586 Specify whether the function expects the DX10_CLAMP field of 1587 the mode register to be set on entry. Overrides the default 1588 for the calling convention. 1589 1590 "amdgpu-no-workitem-id-x" Indicates the function does not depend on the value of the 1591 llvm.amdgcn.workitem.id.x intrinsic. If a function is marked with this 1592 attribute, or reached through a call site marked with this attribute, and 1593 that intrinsic is called, the behavior of the program is undefined. (Whole-program 1594 undefined behavior is used here because, for example, the absence of a required workitem 1595 ID in the preloaded register set can mean that all other preloaded registers 1596 are earlier than the compilation assumed they would be.) The backend can 1597 generally infer this during code generation, so typically there is no 1598 benefit to frontends marking functions with this. 1599 1600 "amdgpu-no-workitem-id-y" The same as amdgpu-no-workitem-id-x, except for the 1601 llvm.amdgcn.workitem.id.y intrinsic. 1602 1603 "amdgpu-no-workitem-id-z" The same as amdgpu-no-workitem-id-x, except for the 1604 llvm.amdgcn.workitem.id.z intrinsic. 1605 1606 "amdgpu-no-workgroup-id-x" The same as amdgpu-no-workitem-id-x, except for the 1607 llvm.amdgcn.workgroup.id.x intrinsic. 1608 1609 "amdgpu-no-workgroup-id-y" The same as amdgpu-no-workitem-id-x, except for the 1610 llvm.amdgcn.workgroup.id.y intrinsic. 1611 1612 "amdgpu-no-workgroup-id-z" The same as amdgpu-no-workitem-id-x, except for the 1613 llvm.amdgcn.workgroup.id.z intrinsic. 1614 1615 "amdgpu-no-dispatch-ptr" The same as amdgpu-no-workitem-id-x, except for the 1616 llvm.amdgcn.dispatch.ptr intrinsic. 1617 1618 "amdgpu-no-implicitarg-ptr" The same as amdgpu-no-workitem-id-x, except for the 1619 llvm.amdgcn.implicitarg.ptr intrinsic. 1620 1621 "amdgpu-no-dispatch-id" The same as amdgpu-no-workitem-id-x, except for the 1622 llvm.amdgcn.dispatch.id intrinsic. 1623 1624 "amdgpu-no-queue-ptr" Similar to amdgpu-no-workitem-id-x, except for the 1625 llvm.amdgcn.queue.ptr intrinsic. Note that unlike the other ABI hint 1626 attributes, the queue pointer may be required in situations where the 1627 intrinsic call does not directly appear in the program. Some subtargets 1628 require the queue pointer for to handle some addrspacecasts, as well 1629 as the llvm.amdgcn.is.shared, llvm.amdgcn.is.private, llvm.trap, and 1630 llvm.debug intrinsics. 1631 1632 "amdgpu-no-hostcall-ptr" Similar to amdgpu-no-implicitarg-ptr, except specific to the implicit 1633 kernel argument that holds the pointer to the hostcall buffer. If this 1634 attribute is absent, then the amdgpu-no-implicitarg-ptr is also removed. 1635 1636 "amdgpu-no-heap-ptr" Similar to amdgpu-no-implicitarg-ptr, except specific to the implicit 1637 kernel argument that holds the pointer to an initialized memory buffer 1638 that conforms to the requirements of the malloc/free device library V1 1639 version implementation. If this attribute is absent, then the 1640 amdgpu-no-implicitarg-ptr is also removed. 1641 1642 "amdgpu-no-multigrid-sync-arg" Similar to amdgpu-no-implicitarg-ptr, except specific to the implicit 1643 kernel argument that holds the multigrid synchronization pointer. If this 1644 attribute is absent, then the amdgpu-no-implicitarg-ptr is also removed. 1645 1646 "amdgpu-no-default-queue" Similar to amdgpu-no-implicitarg-ptr, except specific to the implicit 1647 kernel argument that holds the default queue pointer. If this 1648 attribute is absent, then the amdgpu-no-implicitarg-ptr is also removed. 1649 1650 "amdgpu-no-completion-action" Similar to amdgpu-no-implicitarg-ptr, except specific to the implicit 1651 kernel argument that holds the completion action pointer. If this 1652 attribute is absent, then the amdgpu-no-implicitarg-ptr is also removed. 1653 1654 "amdgpu-lds-size"="min[,max]" Min is the minimum number of bytes that will be allocated in the Local 1655 Data Store at address zero. Variables are allocated within this frame 1656 using absolute symbol metadata, primarily by the AMDGPULowerModuleLDS 1657 pass. Optional max is the maximum number of bytes that will be allocated. 1658 Note that min==max indicates that no further variables can be added to 1659 the frame. This is an internal detail of how LDS variables are lowered, 1660 language front ends should not set this attribute. 1661 1662 "amdgpu-gds-size" Bytes expected to be allocated at the start of GDS memory at entry. 1663 1664 "amdgpu-git-ptr-high" The hard-wired high half of the address of the global information table 1665 for AMDPAL OS type. 0xffffffff represents no hard-wired high half, since 1666 current hardware only allows a 16 bit value. 1667 1668 "amdgpu-32bit-address-high-bits" Assumed high 32-bits for 32-bit address spaces which are really truncated 1669 64-bit addresses (i.e., addrspace(6)) 1670 1671 "amdgpu-color-export" Indicates shader exports color information if set to 1. 1672 Defaults to 1 for :ref:`amdgpu_ps <amdgpu-cc>`, and 0 for other calling 1673 conventions. Determines the necessity and type of null exports when a shader 1674 terminates early by killing lanes. 1675 1676 "amdgpu-depth-export" Indicates shader exports depth information if set to 1. Determines the 1677 necessity and type of null exports when a shader terminates early by killing 1678 lanes. A depth-only shader will export to depth channel when no null export 1679 target is available (GFX11+). 1680 1681 "InitialPSInputAddr" Set the initial value of the `spi_ps_input_addr` register for 1682 :ref:`amdgpu_ps <amdgpu-cc>` shaders. Any bits enabled by this value will 1683 be enabled in the final register value. 1684 1685 "amdgpu-wave-priority-threshold" VALU instruction count threshold for adjusting wave priority. If exceeded, 1686 temporarily raise the wave priority at the start of the shader function 1687 until its last VMEM instructions to allow younger waves to issue their VMEM 1688 instructions as well. 1689 1690 "amdgpu-memory-bound" Set internally by backend 1691 1692 "amdgpu-wave-limiter" Set internally by backend 1693 1694 "amdgpu-unroll-threshold" Set base cost threshold preference for loop unrolling within this function, 1695 default is 300. Actual threshold may be varied by per-loop metadata or 1696 reduced by heuristics. 1697 1698 "amdgpu-max-num-workgroups"="x,y,z" Specify the maximum number of work groups for the kernel dispatch in the 1699 X, Y, and Z dimensions. Each number must be >= 1. Generated by the 1700 ``amdgpu_max_num_work_groups`` CLANG attribute [CLANG-ATTR]_. Clang only 1701 emits this attribute when all the three numbers are >= 1. 1702 1703 "amdgpu-no-agpr" Indicates the function will not require allocating AGPRs. This is only 1704 relevant on subtargets with AGPRs. The behavior is undefined if a 1705 function which requires AGPRs is reached through any function marked 1706 with this attribute. 1707 1708 "amdgpu-hidden-argument" This attribute is used internally by the backend to mark function arguments 1709 as hidden. Hidden arguments are managed by the compiler and are not part of 1710 the explicit arguments supplied by the user. 1711 1712 ======================================= ========================================================== 1713 1714Calling Conventions 1715=================== 1716 1717The AMDGPU backend supports the following calling conventions: 1718 1719 .. table:: AMDGPU Calling Conventions 1720 :name: amdgpu-cc 1721 1722 =============================== ========================================================== 1723 Calling Convention Description 1724 =============================== ========================================================== 1725 ``ccc`` The C calling convention. Used by default. 1726 See :ref:`amdgpu-amdhsa-function-call-convention-non-kernel-functions` 1727 for more details. 1728 1729 ``fastcc`` The fast calling convention. Mostly the same as the ``ccc``. 1730 1731 ``coldcc`` The cold calling convention. Mostly the same as the ``ccc``. 1732 1733 ``amdgpu_cs`` Used for Mesa/AMDPAL compute shaders. 1734 ..TODO:: 1735 Describe. 1736 1737 ``amdgpu_cs_chain`` Similar to ``amdgpu_cs``, with differences described below. 1738 1739 Functions with this calling convention cannot be called directly. They must 1740 instead be launched via the ``llvm.amdgcn.cs.chain`` intrinsic. 1741 1742 Arguments are passed in SGPRs, starting at s0, if they have the ``inreg`` 1743 attribute, and in VGPRs otherwise, starting at v8. Using more SGPRs or VGPRs 1744 than available in the subtarget is not allowed. On subtargets that use 1745 a scratch buffer descriptor (as opposed to ``scratch_{load,store}_*`` instructions), 1746 the scratch buffer descriptor is passed in s[48:51]. This limits the 1747 SGPR / ``inreg`` arguments to the equivalent of 48 dwords; using more 1748 than that is not allowed. 1749 1750 The return type must be void. 1751 Varargs, sret, byval, byref, inalloca, preallocated are not supported. 1752 1753 Values in scalar registers as well as v0-v7 are not preserved. Values in 1754 VGPRs starting at v8 are not preserved for the active lanes, but must be 1755 saved by the callee for inactive lanes when using WWM (a notable exception is 1756 when the llvm.amdgcn.init.whole.wave intrinsic is used in the function - in this 1757 case the backend assumes that there are no inactive lanes upon entry; any inactive 1758 lanes that need to be preserved must be explicitly present in the IR). 1759 1760 Wave scratch is "empty" at function boundaries. There is no stack pointer input 1761 or output value, but functions are free to use scratch starting from an initial 1762 stack pointer. Calls to ``amdgpu_gfx`` functions are allowed and behave like they 1763 do in ``amdgpu_cs`` functions. 1764 1765 All counters (``lgkmcnt``, ``vmcnt``, ``storecnt``, etc.) are presumed in an 1766 unknown state at function entry. 1767 1768 A function may have multiple exits (e.g. one chain exit and one plain ``ret void`` 1769 for when the wave ends), but all ``llvm.amdgcn.cs.chain`` exits must be in 1770 uniform control flow. 1771 1772 ``amdgpu_cs_chain_preserve`` Same as ``amdgpu_cs_chain``, but active lanes for VGPRs starting at v8 are preserved. 1773 Calls to ``amdgpu_gfx`` functions are not allowed, and any calls to ``llvm.amdgcn.cs.chain`` 1774 must not pass more VGPR arguments than the caller's VGPR function parameters. 1775 1776 ``amdgpu_es`` Used for AMDPAL shader stage before geometry shader if geometry is in 1777 use. So either the domain (= tessellation evaluation) shader if 1778 tessellation is in use, or otherwise the vertex shader. 1779 ..TODO:: 1780 Describe. 1781 1782 ``amdgpu_gfx`` Used for AMD graphics targets. Functions with this calling convention 1783 cannot be used as entry points. 1784 ..TODO:: 1785 Describe. 1786 1787 ``amdgpu_gs`` Used for Mesa/AMDPAL geometry shaders. 1788 ..TODO:: 1789 Describe. 1790 1791 ``amdgpu_hs`` Used for Mesa/AMDPAL hull shaders (= tessellation control shaders). 1792 ..TODO:: 1793 Describe. 1794 1795 ``amdgpu_kernel`` See :ref:`amdgpu-amdhsa-function-call-convention-kernel-functions` 1796 1797 ``amdgpu_ls`` Used for AMDPAL vertex shader if tessellation is in use. 1798 ..TODO:: 1799 Describe. 1800 1801 ``amdgpu_ps`` Used for Mesa/AMDPAL pixel shaders. 1802 ..TODO:: 1803 Describe. 1804 1805 ``amdgpu_vs`` Used for Mesa/AMDPAL last shader stage before rasterization (vertex 1806 shader if tessellation and geometry are not in use, or otherwise 1807 copy shader if one is needed). 1808 ..TODO:: 1809 Describe. 1810 1811 =============================== ========================================================== 1812 1813AMDGPU MCExpr 1814------------- 1815 1816As part of the AMDGPU MC layer, AMDGPU provides the following target specific 1817``MCExpr``\s. 1818 1819 .. table:: AMDGPU MCExpr types: 1820 :name: amdgpu-mcexpr-table 1821 1822 =================== ================= ======================================================== 1823 MCExpr Operands Return value 1824 =================== ================= ======================================================== 1825 ``max(arg, ...)`` 1 or more Variadic signed operation that returns the maximum 1826 value of all its arguments. 1827 1828 ``or(arg, ...)`` 1 or more Variadic signed operation that returns the bitwise-or 1829 result of all its arguments. 1830 1831 =================== ================= ======================================================== 1832 1833Function Resource Usage 1834----------------------- 1835 1836A function's resource usage depends on each of its callees' resource usage. The 1837expressions used to denote resource usage reflect this by propagating each 1838callees' equivalent expressions. Said expressions are emitted as symbols by the 1839compiler when compiling to either assembly or object format and should not be 1840overwritten or redefined. 1841 1842The following describes all emitted function resource usage symbols: 1843 1844 .. table:: Function Resource Usage: 1845 :name: function-usage-table 1846 1847 ===================================== ========= ========================================= =============================================================================== 1848 Symbol Type Description Example 1849 ===================================== ========= ========================================= =============================================================================== 1850 <function_name>.num_vgpr Integer Number of VGPRs used by <function_name>, .set foo.num_vgpr, max(32, bar.num_vgpr, baz.num_vgpr) 1851 worst case of itself and its callees' 1852 VGPR use 1853 <function_name>.num_agpr Integer Number of AGPRs used by <function_name>, .set foo.num_agpr, max(35, bar.num_agpr) 1854 worst case of itself and its callees' 1855 AGPR use 1856 <function_name>.numbered_sgpr Integer Number of SGPRs used by <function_name>, .set foo.num_sgpr, 21 1857 worst case of itself and its callees' 1858 SGPR use (without any of the implicitly 1859 used SGPRs) 1860 <function_name>.private_seg_size Integer Total stack size required for .set foo.private_seg_size, 16+max(bar.private_seg_size, baz.private_seg_size) 1861 <function_name>, expression is the 1862 locally used stack size + the worst case 1863 callee 1864 <function_name>.uses_vcc Bool Whether <function_name>, or any of its .set foo.uses_vcc, or(0, bar.uses_vcc) 1865 callees, uses vcc 1866 <function_name>.uses_flat_scratch Bool Whether <function_name>, or any of its .set foo.uses_flat_scratch, 1 1867 callees, uses flat scratch or not 1868 <function_name>.has_dyn_sized_stack Bool Whether <function_name>, or any of its .set foo.has_dyn_sized_stack, 1 1869 callees, is dynamically sized 1870 <function_name>.has_recursion Bool Whether <function_name>, or any of its .set foo.has_recursion, 0 1871 callees, contains recursion 1872 <function_name>.has_indirect_call Bool Whether <function_name>, or any of its .set foo.has_indirect_call, max(0, bar.has_indirect_call) 1873 callees, contains an indirect call 1874 ===================================== ========= ========================================= =============================================================================== 1875 1876Futhermore, three symbols are additionally emitted describing the compilation 1877unit's worst case (i.e, maxima) ``num_vgpr``, ``num_agpr``, and 1878``numbered_sgpr`` which may be referenced and used by the aforementioned 1879symbolic expressions. These three symbols are ``amdgcn.max_num_vgpr``, 1880``amdgcn.max_num_agpr``, and ``amdgcn.max_num_sgpr``. 1881 1882.. _amdgpu-elf-code-object: 1883 1884ELF Code Object 1885=============== 1886 1887The AMDGPU backend generates a standard ELF [ELF]_ relocatable code object that 1888can be linked by ``lld`` to produce a standard ELF shared code object which can 1889be loaded and executed on an AMDGPU target. 1890 1891.. _amdgpu-elf-header: 1892 1893Header 1894------ 1895 1896The AMDGPU backend uses the following ELF header: 1897 1898 .. table:: AMDGPU ELF Header 1899 :name: amdgpu-elf-header-table 1900 1901 ========================== =============================== 1902 Field Value 1903 ========================== =============================== 1904 ``e_ident[EI_CLASS]`` ``ELFCLASS64`` 1905 ``e_ident[EI_DATA]`` ``ELFDATA2LSB`` 1906 ``e_ident[EI_OSABI]`` - ``ELFOSABI_NONE`` 1907 - ``ELFOSABI_AMDGPU_HSA`` 1908 - ``ELFOSABI_AMDGPU_PAL`` 1909 - ``ELFOSABI_AMDGPU_MESA3D`` 1910 ``e_ident[EI_ABIVERSION]`` - ``ELFABIVERSION_AMDGPU_HSA_V2`` 1911 - ``ELFABIVERSION_AMDGPU_HSA_V3`` 1912 - ``ELFABIVERSION_AMDGPU_HSA_V4`` 1913 - ``ELFABIVERSION_AMDGPU_HSA_V5`` 1914 - ``ELFABIVERSION_AMDGPU_HSA_V6`` 1915 - ``ELFABIVERSION_AMDGPU_PAL`` 1916 - ``ELFABIVERSION_AMDGPU_MESA3D`` 1917 ``e_type`` - ``ET_REL`` 1918 - ``ET_DYN`` 1919 ``e_machine`` ``EM_AMDGPU`` 1920 ``e_entry`` 0 1921 ``e_flags`` See :ref:`amdgpu-elf-header-e_flags-v2-table`, 1922 :ref:`amdgpu-elf-header-e_flags-table-v3`, 1923 :ref:`amdgpu-elf-header-e_flags-table-v4-v5`, 1924 and :ref:`amdgpu-elf-header-e_flags-table-v6-onwards` 1925 ========================== =============================== 1926 1927.. 1928 1929 .. table:: AMDGPU ELF Header Enumeration Values 1930 :name: amdgpu-elf-header-enumeration-values-table 1931 1932 =============================== ===== 1933 Name Value 1934 =============================== ===== 1935 ``EM_AMDGPU`` 224 1936 ``ELFOSABI_NONE`` 0 1937 ``ELFOSABI_AMDGPU_HSA`` 64 1938 ``ELFOSABI_AMDGPU_PAL`` 65 1939 ``ELFOSABI_AMDGPU_MESA3D`` 66 1940 ``ELFABIVERSION_AMDGPU_HSA_V2`` 0 1941 ``ELFABIVERSION_AMDGPU_HSA_V3`` 1 1942 ``ELFABIVERSION_AMDGPU_HSA_V4`` 2 1943 ``ELFABIVERSION_AMDGPU_HSA_V5`` 3 1944 ``ELFABIVERSION_AMDGPU_HSA_V6`` 4 1945 ``ELFABIVERSION_AMDGPU_PAL`` 0 1946 ``ELFABIVERSION_AMDGPU_MESA3D`` 0 1947 =============================== ===== 1948 1949``e_ident[EI_CLASS]`` 1950 The ELF class is: 1951 1952 * ``ELFCLASS32`` for ``r600`` architecture. 1953 1954 * ``ELFCLASS64`` for ``amdgcn`` architecture which only supports 64-bit 1955 process address space applications. 1956 1957``e_ident[EI_DATA]`` 1958 All AMDGPU targets use ``ELFDATA2LSB`` for little-endian byte ordering. 1959 1960``e_ident[EI_OSABI]`` 1961 One of the following AMDGPU target architecture specific OS ABIs 1962 (see :ref:`amdgpu-os`): 1963 1964 * ``ELFOSABI_NONE`` for *unknown* OS. 1965 1966 * ``ELFOSABI_AMDGPU_HSA`` for ``amdhsa`` OS. 1967 1968 * ``ELFOSABI_AMDGPU_PAL`` for ``amdpal`` OS. 1969 1970 * ``ELFOSABI_AMDGPU_MESA3D`` for ``mesa3D`` OS. 1971 1972``e_ident[EI_ABIVERSION]`` 1973 The ABI version of the AMDGPU target architecture specific OS ABI to which the code 1974 object conforms: 1975 1976 * ``ELFABIVERSION_AMDGPU_HSA_V2`` is used to specify the version of AMD HSA 1977 runtime ABI for code object V2. Can no longer be emitted by this version of LLVM. 1978 1979 * ``ELFABIVERSION_AMDGPU_HSA_V3`` is used to specify the version of AMD HSA 1980 runtime ABI for code object V3. Can no longer be emitted by this version of LLVM. 1981 1982 * ``ELFABIVERSION_AMDGPU_HSA_V4`` is used to specify the version of AMD HSA 1983 runtime ABI for code object V4. Specify using the Clang option 1984 ``-mcode-object-version=4``. 1985 1986 * ``ELFABIVERSION_AMDGPU_HSA_V5`` is used to specify the version of AMD HSA 1987 runtime ABI for code object V5. Specify using the Clang option 1988 ``-mcode-object-version=5``. This is the default code object 1989 version if not specified. 1990 1991 * ``ELFABIVERSION_AMDGPU_HSA_V6`` is used to specify the version of AMD HSA 1992 runtime ABI for code object V6. Specify using the Clang option 1993 ``-mcode-object-version=6``. 1994 1995 * ``ELFABIVERSION_AMDGPU_PAL`` is used to specify the version of AMD PAL 1996 runtime ABI. 1997 1998 * ``ELFABIVERSION_AMDGPU_MESA3D`` is used to specify the version of AMD MESA 1999 3D runtime ABI. 2000 2001``e_type`` 2002 Can be one of the following values: 2003 2004 2005 ``ET_REL`` 2006 The type produced by the AMDGPU backend compiler as it is relocatable code 2007 object. 2008 2009 ``ET_DYN`` 2010 The type produced by the linker as it is a shared code object. 2011 2012 The AMD HSA runtime loader requires a ``ET_DYN`` code object. 2013 2014``e_machine`` 2015 The value ``EM_AMDGPU`` is used for the machine for all processors supported 2016 by the ``r600`` and ``amdgcn`` architectures (see 2017 :ref:`amdgpu-processor-table`). The specific processor is specified in the 2018 ``NT_AMD_HSA_ISA_VERSION`` note record for code object V2 (see 2019 :ref:`amdgpu-note-records-v2`) and in the ``EF_AMDGPU_MACH`` bit field of the 2020 ``e_flags`` for code object V3 and above (see 2021 :ref:`amdgpu-elf-header-e_flags-table-v3`, 2022 :ref:`amdgpu-elf-header-e_flags-table-v4-v5` and 2023 :ref:`amdgpu-elf-header-e_flags-table-v6-onwards`). 2024 2025``e_entry`` 2026 The entry point is 0 as the entry points for individual kernels must be 2027 selected in order to invoke them through AQL packets. 2028 2029``e_flags`` 2030 The AMDGPU backend uses the following ELF header flags: 2031 2032 .. table:: AMDGPU ELF Header ``e_flags`` for Code Object V2 2033 :name: amdgpu-elf-header-e_flags-v2-table 2034 2035 ===================================== ===== ============================= 2036 Name Value Description 2037 ===================================== ===== ============================= 2038 ``EF_AMDGPU_FEATURE_XNACK_V2`` 0x01 Indicates if the ``xnack`` 2039 target feature is 2040 enabled for all code 2041 contained in the code object. 2042 If the processor 2043 does not support the 2044 ``xnack`` target 2045 feature then must 2046 be 0. 2047 See 2048 :ref:`amdgpu-target-features`. 2049 ``EF_AMDGPU_FEATURE_TRAP_HANDLER_V2`` 0x02 Indicates if the trap 2050 handler is enabled for all 2051 code contained in the code 2052 object. If the processor 2053 does not support a trap 2054 handler then must be 0. 2055 See 2056 :ref:`amdgpu-target-features`. 2057 ===================================== ===== ============================= 2058 2059 .. table:: AMDGPU ELF Header ``e_flags`` for Code Object V3 2060 :name: amdgpu-elf-header-e_flags-table-v3 2061 2062 ================================= ===== ============================= 2063 Name Value Description 2064 ================================= ===== ============================= 2065 ``EF_AMDGPU_MACH`` 0x0ff AMDGPU processor selection 2066 mask for 2067 ``EF_AMDGPU_MACH_xxx`` values 2068 defined in 2069 :ref:`amdgpu-ef-amdgpu-mach-table`. 2070 ``EF_AMDGPU_FEATURE_XNACK_V3`` 0x100 Indicates if the ``xnack`` 2071 target feature is 2072 enabled for all code 2073 contained in the code object. 2074 If the processor 2075 does not support the 2076 ``xnack`` target 2077 feature then must 2078 be 0. 2079 See 2080 :ref:`amdgpu-target-features`. 2081 ``EF_AMDGPU_FEATURE_SRAMECC_V3`` 0x200 Indicates if the ``sramecc`` 2082 target feature is 2083 enabled for all code 2084 contained in the code object. 2085 If the processor 2086 does not support the 2087 ``sramecc`` target 2088 feature then must 2089 be 0. 2090 See 2091 :ref:`amdgpu-target-features`. 2092 ================================= ===== ============================= 2093 2094 .. table:: AMDGPU ELF Header ``e_flags`` for Code Object V4 and V5 2095 :name: amdgpu-elf-header-e_flags-table-v4-v5 2096 2097 ============================================ ===== =================================== 2098 Name Value Description 2099 ============================================ ===== =================================== 2100 ``EF_AMDGPU_MACH`` 0x0ff AMDGPU processor selection 2101 mask for 2102 ``EF_AMDGPU_MACH_xxx`` values 2103 defined in 2104 :ref:`amdgpu-ef-amdgpu-mach-table`. 2105 ``EF_AMDGPU_FEATURE_XNACK_V4`` 0x300 XNACK selection mask for 2106 ``EF_AMDGPU_FEATURE_XNACK_*_V4`` 2107 values. 2108 ``EF_AMDGPU_FEATURE_XNACK_UNSUPPORTED_V4`` 0x000 XNACK unsupported. 2109 ``EF_AMDGPU_FEATURE_XNACK_ANY_V4`` 0x100 XNACK can have any value. 2110 ``EF_AMDGPU_FEATURE_XNACK_OFF_V4`` 0x200 XNACK disabled. 2111 ``EF_AMDGPU_FEATURE_XNACK_ON_V4`` 0x300 XNACK enabled. 2112 ``EF_AMDGPU_FEATURE_SRAMECC_V4`` 0xc00 SRAMECC selection mask for 2113 ``EF_AMDGPU_FEATURE_SRAMECC_*_V4`` 2114 values. 2115 ``EF_AMDGPU_FEATURE_SRAMECC_UNSUPPORTED_V4`` 0x000 SRAMECC unsupported. 2116 ``EF_AMDGPU_FEATURE_SRAMECC_ANY_V4`` 0x400 SRAMECC can have any value. 2117 ``EF_AMDGPU_FEATURE_SRAMECC_OFF_V4`` 0x800 SRAMECC disabled, 2118 ``EF_AMDGPU_FEATURE_SRAMECC_ON_V4`` 0xc00 SRAMECC enabled. 2119 ============================================ ===== =================================== 2120 2121 .. table:: AMDGPU ELF Header ``e_flags`` for Code Object V6 and After 2122 :name: amdgpu-elf-header-e_flags-table-v6-onwards 2123 2124 ============================================ ========== ========================================= 2125 Name Value Description 2126 ============================================ ========== ========================================= 2127 ``EF_AMDGPU_MACH`` 0x0ff AMDGPU processor selection 2128 mask for 2129 ``EF_AMDGPU_MACH_xxx`` values 2130 defined in 2131 :ref:`amdgpu-ef-amdgpu-mach-table`. 2132 ``EF_AMDGPU_FEATURE_XNACK_V4`` 0x300 XNACK selection mask for 2133 ``EF_AMDGPU_FEATURE_XNACK_*_V4`` 2134 values. 2135 ``EF_AMDGPU_FEATURE_XNACK_UNSUPPORTED_V4`` 0x000 XNACK unsupported. 2136 ``EF_AMDGPU_FEATURE_XNACK_ANY_V4`` 0x100 XNACK can have any value. 2137 ``EF_AMDGPU_FEATURE_XNACK_OFF_V4`` 0x200 XNACK disabled. 2138 ``EF_AMDGPU_FEATURE_XNACK_ON_V4`` 0x300 XNACK enabled. 2139 ``EF_AMDGPU_FEATURE_SRAMECC_V4`` 0xc00 SRAMECC selection mask for 2140 ``EF_AMDGPU_FEATURE_SRAMECC_*_V4`` 2141 values. 2142 ``EF_AMDGPU_FEATURE_SRAMECC_UNSUPPORTED_V4`` 0x000 SRAMECC unsupported. 2143 ``EF_AMDGPU_FEATURE_SRAMECC_ANY_V4`` 0x400 SRAMECC can have any value. 2144 ``EF_AMDGPU_FEATURE_SRAMECC_OFF_V4`` 0x800 SRAMECC disabled, 2145 ``EF_AMDGPU_FEATURE_SRAMECC_ON_V4`` 0xc00 SRAMECC enabled. 2146 ``EF_AMDGPU_GENERIC_VERSION_V`` 0xff000000 Generic code object version selection 2147 mask. This is a value between 1 and 255, 2148 stored in the most significant byte 2149 of EFLAGS. 2150 See :ref:`amdgpu-generic-processor-versioning` 2151 ============================================ ========== ========================================= 2152 2153 .. table:: AMDGPU ``EF_AMDGPU_MACH`` Values 2154 :name: amdgpu-ef-amdgpu-mach-table 2155 2156 ========================================== ========== ============================= 2157 Name Value Description (see 2158 :ref:`amdgpu-processor-table`) 2159 ========================================== ========== ============================= 2160 ``EF_AMDGPU_MACH_NONE`` 0x000 *not specified* 2161 ``EF_AMDGPU_MACH_R600_R600`` 0x001 ``r600`` 2162 ``EF_AMDGPU_MACH_R600_R630`` 0x002 ``r630`` 2163 ``EF_AMDGPU_MACH_R600_RS880`` 0x003 ``rs880`` 2164 ``EF_AMDGPU_MACH_R600_RV670`` 0x004 ``rv670`` 2165 ``EF_AMDGPU_MACH_R600_RV710`` 0x005 ``rv710`` 2166 ``EF_AMDGPU_MACH_R600_RV730`` 0x006 ``rv730`` 2167 ``EF_AMDGPU_MACH_R600_RV770`` 0x007 ``rv770`` 2168 ``EF_AMDGPU_MACH_R600_CEDAR`` 0x008 ``cedar`` 2169 ``EF_AMDGPU_MACH_R600_CYPRESS`` 0x009 ``cypress`` 2170 ``EF_AMDGPU_MACH_R600_JUNIPER`` 0x00a ``juniper`` 2171 ``EF_AMDGPU_MACH_R600_REDWOOD`` 0x00b ``redwood`` 2172 ``EF_AMDGPU_MACH_R600_SUMO`` 0x00c ``sumo`` 2173 ``EF_AMDGPU_MACH_R600_BARTS`` 0x00d ``barts`` 2174 ``EF_AMDGPU_MACH_R600_CAICOS`` 0x00e ``caicos`` 2175 ``EF_AMDGPU_MACH_R600_CAYMAN`` 0x00f ``cayman`` 2176 ``EF_AMDGPU_MACH_R600_TURKS`` 0x010 ``turks`` 2177 *reserved* 0x011 - Reserved for ``r600`` 2178 0x01f architecture processors. 2179 ``EF_AMDGPU_MACH_AMDGCN_GFX600`` 0x020 ``gfx600`` 2180 ``EF_AMDGPU_MACH_AMDGCN_GFX601`` 0x021 ``gfx601`` 2181 ``EF_AMDGPU_MACH_AMDGCN_GFX700`` 0x022 ``gfx700`` 2182 ``EF_AMDGPU_MACH_AMDGCN_GFX701`` 0x023 ``gfx701`` 2183 ``EF_AMDGPU_MACH_AMDGCN_GFX702`` 0x024 ``gfx702`` 2184 ``EF_AMDGPU_MACH_AMDGCN_GFX703`` 0x025 ``gfx703`` 2185 ``EF_AMDGPU_MACH_AMDGCN_GFX704`` 0x026 ``gfx704`` 2186 *reserved* 0x027 Reserved. 2187 ``EF_AMDGPU_MACH_AMDGCN_GFX801`` 0x028 ``gfx801`` 2188 ``EF_AMDGPU_MACH_AMDGCN_GFX802`` 0x029 ``gfx802`` 2189 ``EF_AMDGPU_MACH_AMDGCN_GFX803`` 0x02a ``gfx803`` 2190 ``EF_AMDGPU_MACH_AMDGCN_GFX810`` 0x02b ``gfx810`` 2191 ``EF_AMDGPU_MACH_AMDGCN_GFX900`` 0x02c ``gfx900`` 2192 ``EF_AMDGPU_MACH_AMDGCN_GFX902`` 0x02d ``gfx902`` 2193 ``EF_AMDGPU_MACH_AMDGCN_GFX904`` 0x02e ``gfx904`` 2194 ``EF_AMDGPU_MACH_AMDGCN_GFX906`` 0x02f ``gfx906`` 2195 ``EF_AMDGPU_MACH_AMDGCN_GFX908`` 0x030 ``gfx908`` 2196 ``EF_AMDGPU_MACH_AMDGCN_GFX909`` 0x031 ``gfx909`` 2197 ``EF_AMDGPU_MACH_AMDGCN_GFX90C`` 0x032 ``gfx90c`` 2198 ``EF_AMDGPU_MACH_AMDGCN_GFX1010`` 0x033 ``gfx1010`` 2199 ``EF_AMDGPU_MACH_AMDGCN_GFX1011`` 0x034 ``gfx1011`` 2200 ``EF_AMDGPU_MACH_AMDGCN_GFX1012`` 0x035 ``gfx1012`` 2201 ``EF_AMDGPU_MACH_AMDGCN_GFX1030`` 0x036 ``gfx1030`` 2202 ``EF_AMDGPU_MACH_AMDGCN_GFX1031`` 0x037 ``gfx1031`` 2203 ``EF_AMDGPU_MACH_AMDGCN_GFX1032`` 0x038 ``gfx1032`` 2204 ``EF_AMDGPU_MACH_AMDGCN_GFX1033`` 0x039 ``gfx1033`` 2205 ``EF_AMDGPU_MACH_AMDGCN_GFX602`` 0x03a ``gfx602`` 2206 ``EF_AMDGPU_MACH_AMDGCN_GFX705`` 0x03b ``gfx705`` 2207 ``EF_AMDGPU_MACH_AMDGCN_GFX805`` 0x03c ``gfx805`` 2208 ``EF_AMDGPU_MACH_AMDGCN_GFX1035`` 0x03d ``gfx1035`` 2209 ``EF_AMDGPU_MACH_AMDGCN_GFX1034`` 0x03e ``gfx1034`` 2210 ``EF_AMDGPU_MACH_AMDGCN_GFX90A`` 0x03f ``gfx90a`` 2211 ``EF_AMDGPU_MACH_AMDGCN_GFX940`` 0x040 ``gfx940`` 2212 ``EF_AMDGPU_MACH_AMDGCN_GFX1100`` 0x041 ``gfx1100`` 2213 ``EF_AMDGPU_MACH_AMDGCN_GFX1013`` 0x042 ``gfx1013`` 2214 ``EF_AMDGPU_MACH_AMDGCN_GFX1150`` 0x043 ``gfx1150`` 2215 ``EF_AMDGPU_MACH_AMDGCN_GFX1103`` 0x044 ``gfx1103`` 2216 ``EF_AMDGPU_MACH_AMDGCN_GFX1036`` 0x045 ``gfx1036`` 2217 ``EF_AMDGPU_MACH_AMDGCN_GFX1101`` 0x046 ``gfx1101`` 2218 ``EF_AMDGPU_MACH_AMDGCN_GFX1102`` 0x047 ``gfx1102`` 2219 ``EF_AMDGPU_MACH_AMDGCN_GFX1200`` 0x048 ``gfx1200`` 2220 *reserved* 0x049 Reserved. 2221 ``EF_AMDGPU_MACH_AMDGCN_GFX1151`` 0x04a ``gfx1151`` 2222 ``EF_AMDGPU_MACH_AMDGCN_GFX941`` 0x04b ``gfx941`` 2223 ``EF_AMDGPU_MACH_AMDGCN_GFX942`` 0x04c ``gfx942`` 2224 *reserved* 0x04d Reserved. 2225 ``EF_AMDGPU_MACH_AMDGCN_GFX1201`` 0x04e ``gfx1201`` 2226 ``EF_AMDGPU_MACH_AMDGCN_GFX950`` 0x04f ``gfx950`` 2227 *reserved* 0x050 Reserved. 2228 ``EF_AMDGPU_MACH_AMDGCN_GFX9_GENERIC`` 0x051 ``gfx9-generic`` 2229 ``EF_AMDGPU_MACH_AMDGCN_GFX10_1_GENERIC`` 0x052 ``gfx10-1-generic`` 2230 ``EF_AMDGPU_MACH_AMDGCN_GFX10_3_GENERIC`` 0x053 ``gfx10-3-generic`` 2231 ``EF_AMDGPU_MACH_AMDGCN_GFX11_GENERIC`` 0x054 ``gfx11-generic`` 2232 ``EF_AMDGPU_MACH_AMDGCN_GFX1152`` 0x055 ``gfx1152``. 2233 *reserved* 0x056 Reserved. 2234 *reserved* 0x057 Reserved. 2235 ``EF_AMDGPU_MACH_AMDGCN_GFX1153`` 0x058 ``gfx1153``. 2236 ``EF_AMDGPU_MACH_AMDGCN_GFX12_GENERIC`` 0x059 ``gfx12-generic`` 2237 ``EF_AMDGPU_MACH_AMDGCN_GFX9_4_GENERIC`` 0x05f ``gfx9-4-generic`` 2238 ========================================== ========== ============================= 2239 2240Sections 2241-------- 2242 2243An AMDGPU target ELF code object has the standard ELF sections which include: 2244 2245 .. table:: AMDGPU ELF Sections 2246 :name: amdgpu-elf-sections-table 2247 2248 ================== ================ ================================= 2249 Name Type Attributes 2250 ================== ================ ================================= 2251 ``.bss`` ``SHT_NOBITS`` ``SHF_ALLOC`` + ``SHF_WRITE`` 2252 ``.data`` ``SHT_PROGBITS`` ``SHF_ALLOC`` + ``SHF_WRITE`` 2253 ``.debug_``\ *\** ``SHT_PROGBITS`` *none* 2254 ``.dynamic`` ``SHT_DYNAMIC`` ``SHF_ALLOC`` 2255 ``.dynstr`` ``SHT_PROGBITS`` ``SHF_ALLOC`` 2256 ``.dynsym`` ``SHT_PROGBITS`` ``SHF_ALLOC`` 2257 ``.got`` ``SHT_PROGBITS`` ``SHF_ALLOC`` + ``SHF_WRITE`` 2258 ``.hash`` ``SHT_HASH`` ``SHF_ALLOC`` 2259 ``.note`` ``SHT_NOTE`` *none* 2260 ``.rela``\ *name* ``SHT_RELA`` *none* 2261 ``.rela.dyn`` ``SHT_RELA`` *none* 2262 ``.rodata`` ``SHT_PROGBITS`` ``SHF_ALLOC`` 2263 ``.shstrtab`` ``SHT_STRTAB`` *none* 2264 ``.strtab`` ``SHT_STRTAB`` *none* 2265 ``.symtab`` ``SHT_SYMTAB`` *none* 2266 ``.text`` ``SHT_PROGBITS`` ``SHF_ALLOC`` + ``SHF_EXECINSTR`` 2267 ================== ================ ================================= 2268 2269These sections have their standard meanings (see [ELF]_) and are only generated 2270if needed. 2271 2272``.debug``\ *\** 2273 The standard DWARF sections. See :ref:`amdgpu-dwarf-debug-information` for 2274 information on the DWARF produced by the AMDGPU backend. 2275 2276``.dynamic``, ``.dynstr``, ``.dynsym``, ``.hash`` 2277 The standard sections used by a dynamic loader. 2278 2279``.note`` 2280 See :ref:`amdgpu-note-records` for the note records supported by the AMDGPU 2281 backend. 2282 2283``.rela``\ *name*, ``.rela.dyn`` 2284 For relocatable code objects, *name* is the name of the section that the 2285 relocation records apply. For example, ``.rela.text`` is the section name for 2286 relocation records associated with the ``.text`` section. 2287 2288 For linked shared code objects, ``.rela.dyn`` contains all the relocation 2289 records from each of the relocatable code object's ``.rela``\ *name* sections. 2290 2291 See :ref:`amdgpu-relocation-records` for the relocation records supported by 2292 the AMDGPU backend. 2293 2294``.text`` 2295 The executable machine code for the kernels and functions they call. Generated 2296 as position independent code. See :ref:`amdgpu-code-conventions` for 2297 information on conventions used in the isa generation. 2298 2299.. _amdgpu-note-records: 2300 2301Note Records 2302------------ 2303 2304The AMDGPU backend code object contains ELF note records in the ``.note`` 2305section. The set of generated notes and their semantics depend on the code 2306object version; see :ref:`amdgpu-note-records-v2` and 2307:ref:`amdgpu-note-records-v3-onwards`. 2308 2309As required by ``ELFCLASS32`` and ``ELFCLASS64``, minimal zero-byte padding 2310must be generated after the ``name`` field to ensure the ``desc`` field is 4 2311byte aligned. In addition, minimal zero-byte padding must be generated to 2312ensure the ``desc`` field size is a multiple of 4 bytes. The ``sh_addralign`` 2313field of the ``.note`` section must be at least 4 to indicate at least 8 byte 2314alignment. 2315 2316.. _amdgpu-note-records-v2: 2317 2318Code Object V2 Note Records 2319~~~~~~~~~~~~~~~~~~~~~~~~~~~ 2320 2321.. warning:: 2322 Code object V2 generation is no longer supported by this version of LLVM. 2323 2324The AMDGPU backend code object uses the following ELF note record in the 2325``.note`` section when compiling for code object V2. 2326 2327The note record vendor field is "AMD". 2328 2329Additional note records may be present, but any which are not documented here 2330are deprecated and should not be used. 2331 2332 .. table:: AMDGPU Code Object V2 ELF Note Records 2333 :name: amdgpu-elf-note-records-v2-table 2334 2335 ===== ===================================== ====================================== 2336 Name Type Description 2337 ===== ===================================== ====================================== 2338 "AMD" ``NT_AMD_HSA_CODE_OBJECT_VERSION`` Code object version. 2339 "AMD" ``NT_AMD_HSA_HSAIL`` HSAIL properties generated by the HSAIL 2340 Finalizer and not the LLVM compiler. 2341 "AMD" ``NT_AMD_HSA_ISA_VERSION`` Target ISA version. 2342 "AMD" ``NT_AMD_HSA_METADATA`` Metadata null terminated string in 2343 YAML [YAML]_ textual format. 2344 "AMD" ``NT_AMD_HSA_ISA_NAME`` Target ISA name. 2345 ===== ===================================== ====================================== 2346 2347.. 2348 2349 .. table:: AMDGPU Code Object V2 ELF Note Record Enumeration Values 2350 :name: amdgpu-elf-note-record-enumeration-values-v2-table 2351 2352 ===================================== ===== 2353 Name Value 2354 ===================================== ===== 2355 ``NT_AMD_HSA_CODE_OBJECT_VERSION`` 1 2356 ``NT_AMD_HSA_HSAIL`` 2 2357 ``NT_AMD_HSA_ISA_VERSION`` 3 2358 *reserved* 4-9 2359 ``NT_AMD_HSA_METADATA`` 10 2360 ``NT_AMD_HSA_ISA_NAME`` 11 2361 ===================================== ===== 2362 2363``NT_AMD_HSA_CODE_OBJECT_VERSION`` 2364 Specifies the code object version number. The description field has the 2365 following layout: 2366 2367 .. code:: c 2368 2369 struct amdgpu_hsa_note_code_object_version_s { 2370 uint32_t major_version; 2371 uint32_t minor_version; 2372 }; 2373 2374 The ``major_version`` has a value less than or equal to 2. 2375 2376``NT_AMD_HSA_HSAIL`` 2377 Specifies the HSAIL properties used by the HSAIL Finalizer. The description 2378 field has the following layout: 2379 2380 .. code:: c 2381 2382 struct amdgpu_hsa_note_hsail_s { 2383 uint32_t hsail_major_version; 2384 uint32_t hsail_minor_version; 2385 uint8_t profile; 2386 uint8_t machine_model; 2387 uint8_t default_float_round; 2388 }; 2389 2390``NT_AMD_HSA_ISA_VERSION`` 2391 Specifies the target ISA version. The description field has the following layout: 2392 2393 .. code:: c 2394 2395 struct amdgpu_hsa_note_isa_s { 2396 uint16_t vendor_name_size; 2397 uint16_t architecture_name_size; 2398 uint32_t major; 2399 uint32_t minor; 2400 uint32_t stepping; 2401 char vendor_and_architecture_name[1]; 2402 }; 2403 2404 ``vendor_name_size`` and ``architecture_name_size`` are the length of the 2405 vendor and architecture names respectively, including the NUL character. 2406 2407 ``vendor_and_architecture_name`` contains the NUL terminates string for the 2408 vendor, immediately followed by the NUL terminated string for the 2409 architecture. 2410 2411 This note record is used by the HSA runtime loader. 2412 2413 Code object V2 only supports a limited number of processors and has fixed 2414 settings for target features. See 2415 :ref:`amdgpu-elf-note-record-supported_processors-v2-table` for a list of 2416 processors and the corresponding target ID. In the table the note record ISA 2417 name is a concatenation of the vendor name, architecture name, major, minor, 2418 and stepping separated by a ":". 2419 2420 The target ID column shows the processor name and fixed target features used 2421 by the LLVM compiler. The LLVM compiler does not generate a 2422 ``NT_AMD_HSA_HSAIL`` note record. 2423 2424 A code object generated by the Finalizer also uses code object V2 and always 2425 generates a ``NT_AMD_HSA_HSAIL`` note record. The processor name and 2426 ``sramecc`` target feature is as shown in 2427 :ref:`amdgpu-elf-note-record-supported_processors-v2-table` but the ``xnack`` 2428 target feature is specified by the ``EF_AMDGPU_FEATURE_XNACK_V2`` ``e_flags`` 2429 bit. 2430 2431``NT_AMD_HSA_ISA_NAME`` 2432 Specifies the target ISA name as a non-NUL terminated string. 2433 2434 This note record is not used by the HSA runtime loader. 2435 2436 See the ``NT_AMD_HSA_ISA_VERSION`` note record description of the code object 2437 V2's limited support of processors and fixed settings for target features. 2438 2439 See :ref:`amdgpu-elf-note-record-supported_processors-v2-table` for a mapping 2440 from the string to the corresponding target ID. If the ``xnack`` target 2441 feature is supported and enabled, the string produced by the LLVM compiler 2442 will may have a ``+xnack`` appended. The Finlizer did not do the appending and 2443 instead used the ``EF_AMDGPU_FEATURE_XNACK_V2`` ``e_flags`` bit. 2444 2445``NT_AMD_HSA_METADATA`` 2446 Specifies extensible metadata associated with the code objects executed on HSA 2447 [HSA]_ compatible runtimes (see :ref:`amdgpu-os`). It is required when the 2448 target triple OS is ``amdhsa`` (see :ref:`amdgpu-target-triples`). See 2449 :ref:`amdgpu-amdhsa-code-object-metadata-v2` for the syntax of the code object 2450 metadata string. 2451 2452 .. table:: AMDGPU Code Object V2 Supported Processors and Fixed Target Feature Settings 2453 :name: amdgpu-elf-note-record-supported_processors-v2-table 2454 2455 ===================== ========================== 2456 Note Record ISA Name Target ID 2457 ===================== ========================== 2458 ``AMD:AMDGPU:6:0:0`` ``gfx600`` 2459 ``AMD:AMDGPU:6:0:1`` ``gfx601`` 2460 ``AMD:AMDGPU:6:0:2`` ``gfx602`` 2461 ``AMD:AMDGPU:7:0:0`` ``gfx700`` 2462 ``AMD:AMDGPU:7:0:1`` ``gfx701`` 2463 ``AMD:AMDGPU:7:0:2`` ``gfx702`` 2464 ``AMD:AMDGPU:7:0:3`` ``gfx703`` 2465 ``AMD:AMDGPU:7:0:4`` ``gfx704`` 2466 ``AMD:AMDGPU:7:0:5`` ``gfx705`` 2467 ``AMD:AMDGPU:8:0:0`` ``gfx802`` 2468 ``AMD:AMDGPU:8:0:1`` ``gfx801:xnack+`` 2469 ``AMD:AMDGPU:8:0:2`` ``gfx802`` 2470 ``AMD:AMDGPU:8:0:3`` ``gfx803`` 2471 ``AMD:AMDGPU:8:0:4`` ``gfx803`` 2472 ``AMD:AMDGPU:8:0:5`` ``gfx805`` 2473 ``AMD:AMDGPU:8:1:0`` ``gfx810:xnack+`` 2474 ``AMD:AMDGPU:9:0:0`` ``gfx900:xnack-`` 2475 ``AMD:AMDGPU:9:0:1`` ``gfx900:xnack+`` 2476 ``AMD:AMDGPU:9:0:2`` ``gfx902:xnack-`` 2477 ``AMD:AMDGPU:9:0:3`` ``gfx902:xnack+`` 2478 ``AMD:AMDGPU:9:0:4`` ``gfx904:xnack-`` 2479 ``AMD:AMDGPU:9:0:5`` ``gfx904:xnack+`` 2480 ``AMD:AMDGPU:9:0:6`` ``gfx906:sramecc-:xnack-`` 2481 ``AMD:AMDGPU:9:0:7`` ``gfx906:sramecc-:xnack+`` 2482 ``AMD:AMDGPU:9:0:12`` ``gfx90c:xnack-`` 2483 ===================== ========================== 2484 2485.. _amdgpu-note-records-v3-onwards: 2486 2487Code Object V3 and Above Note Records 2488~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ 2489 2490The AMDGPU backend code object uses the following ELF note record in the 2491``.note`` section when compiling for code object V3 and above. 2492 2493The note record vendor field is "AMDGPU". 2494 2495Additional note records may be present, but any which are not documented here 2496are deprecated and should not be used. 2497 2498 .. table:: AMDGPU Code Object V3 and Above ELF Note Records 2499 :name: amdgpu-elf-note-records-table-v3-onwards 2500 2501 ======== ============================== ====================================== 2502 Name Type Description 2503 ======== ============================== ====================================== 2504 "AMDGPU" ``NT_AMDGPU_METADATA`` Metadata in Message Pack [MsgPack]_ 2505 binary format. 2506 "AMDGPU" ``NT_AMDGPU_KFD_CORE_STATE`` Snapshot of runtime, agent and queues 2507 state for use in core dump. See 2508 :ref:`amdgpu_corefile_note`. 2509 ======== ============================== ====================================== 2510 2511.. 2512 2513 .. table:: AMDGPU Code Object V3 and Above ELF Note Record Enumeration Values 2514 :name: amdgpu-elf-note-record-enumeration-values-table-v3-onwards 2515 2516 ============================== ===== 2517 Name Value 2518 ============================== ===== 2519 *reserved* 0-31 2520 ``NT_AMDGPU_METADATA`` 32 2521 ``NT_AMDGPU_KFD_CORE_STATE`` 33 2522 ============================== ===== 2523 2524``NT_AMDGPU_METADATA`` 2525 Specifies extensible metadata associated with an AMDGPU code object. It is 2526 encoded as a map in the Message Pack [MsgPack]_ binary data format. See 2527 :ref:`amdgpu-amdhsa-code-object-metadata-v3`, 2528 :ref:`amdgpu-amdhsa-code-object-metadata-v4` and 2529 :ref:`amdgpu-amdhsa-code-object-metadata-v5` for the map keys defined for the 2530 ``amdhsa`` OS. 2531 2532.. _amdgpu-symbols: 2533 2534Symbols 2535------- 2536 2537Symbols include the following: 2538 2539 .. table:: AMDGPU ELF Symbols 2540 :name: amdgpu-elf-symbols-table 2541 2542 ===================== ================== ================ ================== 2543 Name Type Section Description 2544 ===================== ================== ================ ================== 2545 *link-name* ``STT_OBJECT`` - ``.data`` Global variable 2546 - ``.rodata`` 2547 - ``.bss`` 2548 *link-name*\ ``.kd`` ``STT_OBJECT`` - ``.rodata`` Kernel descriptor 2549 *link-name* ``STT_FUNC`` - ``.text`` Kernel entry point 2550 *link-name* ``STT_OBJECT`` - SHN_AMDGPU_LDS Global variable in LDS 2551 ===================== ================== ================ ================== 2552 2553Global variable 2554 Global variables both used and defined by the compilation unit. 2555 2556 If the symbol is defined in the compilation unit then it is allocated in the 2557 appropriate section according to if it has initialized data or is readonly. 2558 2559 If the symbol is external then its section is ``STN_UNDEF`` and the loader 2560 will resolve relocations using the definition provided by another code object 2561 or explicitly defined by the runtime. 2562 2563 If the symbol resides in local/group memory (LDS) then its section is the 2564 special processor specific section name ``SHN_AMDGPU_LDS``, and the 2565 ``st_value`` field describes alignment requirements as it does for common 2566 symbols. 2567 2568 .. TODO:: 2569 2570 Add description of linked shared object symbols. Seems undefined symbols 2571 are marked as STT_NOTYPE. 2572 2573Kernel descriptor 2574 Every HSA kernel has an associated kernel descriptor. It is the address of the 2575 kernel descriptor that is used in the AQL dispatch packet used to invoke the 2576 kernel, not the kernel entry point. The layout of the HSA kernel descriptor is 2577 defined in :ref:`amdgpu-amdhsa-kernel-descriptor`. 2578 2579Kernel entry point 2580 Every HSA kernel also has a symbol for its machine code entry point. 2581 2582.. _amdgpu-relocation-records: 2583 2584Relocation Records 2585------------------ 2586 2587The AMDGPU backend generates ``Elf64_Rela`` relocation records for 2588AMDHSA or ``Elf64_Rel`` relocation records for Mesa/AMDPAL. Supported 2589relocatable fields are: 2590 2591``word32`` 2592 This specifies a 32-bit field occupying 4 bytes with arbitrary byte 2593 alignment. These values use the same byte order as other word values in the 2594 AMDGPU architecture. 2595 2596``word64`` 2597 This specifies a 64-bit field occupying 8 bytes with arbitrary byte 2598 alignment. These values use the same byte order as other word values in the 2599 AMDGPU architecture. 2600 2601Following notations are used for specifying relocation calculations: 2602 2603**A** 2604 Represents the addend used to compute the value of the relocatable field. If 2605 the addend field is smaller than 64 bits then it is zero-extended to 64 bits 2606 for use in the calculations below. (In practice this only affects ``_HI`` 2607 relocation types on Mesa/AMDPAL, where the addend comes from the 32-bit field 2608 but the result of the calculation depends on the high part of the full 64-bit 2609 address.) 2610 2611**G** 2612 Represents the offset into the global offset table at which the relocation 2613 entry's symbol will reside during execution. 2614 2615**GOT** 2616 Represents the address of the global offset table. 2617 2618**P** 2619 Represents the place (section offset for ``et_rel`` or address for ``et_dyn``) 2620 of the storage unit being relocated (computed using ``r_offset``). 2621 2622**S** 2623 Represents the value of the symbol whose index resides in the relocation 2624 entry. Relocations not using this must specify a symbol index of 2625 ``STN_UNDEF``. 2626 2627**B** 2628 Represents the base address of a loaded executable or shared object which is 2629 the difference between the ELF address and the actual load address. 2630 Relocations using this are only valid in executable or shared objects. 2631 2632The following relocation types are supported: 2633 2634 .. table:: AMDGPU ELF Relocation Records 2635 :name: amdgpu-elf-relocation-records-table 2636 2637 ========================== ======= ===== ========== ============================== 2638 Relocation Type Kind Value Field Calculation 2639 ========================== ======= ===== ========== ============================== 2640 ``R_AMDGPU_NONE`` 0 *none* *none* 2641 ``R_AMDGPU_ABS32_LO`` Static, 1 ``word32`` (S + A) & 0xFFFFFFFF 2642 Dynamic 2643 ``R_AMDGPU_ABS32_HI`` Static, 2 ``word32`` (S + A) >> 32 2644 Dynamic 2645 ``R_AMDGPU_ABS64`` Static, 3 ``word64`` S + A 2646 Dynamic 2647 ``R_AMDGPU_REL32`` Static 4 ``word32`` S + A - P 2648 ``R_AMDGPU_REL64`` Static 5 ``word64`` S + A - P 2649 ``R_AMDGPU_ABS32`` Static, 6 ``word32`` S + A 2650 Dynamic 2651 ``R_AMDGPU_GOTPCREL`` Static 7 ``word32`` G + GOT + A - P 2652 ``R_AMDGPU_GOTPCREL32_LO`` Static 8 ``word32`` (G + GOT + A - P) & 0xFFFFFFFF 2653 ``R_AMDGPU_GOTPCREL32_HI`` Static 9 ``word32`` (G + GOT + A - P) >> 32 2654 ``R_AMDGPU_REL32_LO`` Static 10 ``word32`` (S + A - P) & 0xFFFFFFFF 2655 ``R_AMDGPU_REL32_HI`` Static 11 ``word32`` (S + A - P) >> 32 2656 *reserved* 12 2657 ``R_AMDGPU_RELATIVE64`` Dynamic 13 ``word64`` B + A 2658 ``R_AMDGPU_REL16`` Static 14 ``word16`` ((S + A - P) - 4) / 4 2659 ========================== ======= ===== ========== ============================== 2660 2661``R_AMDGPU_ABS32_LO`` and ``R_AMDGPU_ABS32_HI`` are only supported by 2662the ``mesa3d`` OS, which does not support ``R_AMDGPU_ABS64``. 2663 2664There is no current OS loader support for 32-bit programs and so 2665``R_AMDGPU_ABS32`` is not used. 2666 2667.. _amdgpu-loaded-code-object-path-uniform-resource-identifier: 2668 2669Loaded Code Object Path Uniform Resource Identifier (URI) 2670--------------------------------------------------------- 2671 2672The AMD GPU code object loader represents the path of the ELF shared object from 2673which the code object was loaded as a textual Uniform Resource Identifier (URI). 2674Note that the code object is the in memory loaded relocated form of the ELF 2675shared object. Multiple code objects may be loaded at different memory 2676addresses in the same process from the same ELF shared object. 2677 2678The loaded code object path URI syntax is defined by the following BNF syntax: 2679 2680.. code:: 2681 2682 code_object_uri ::== file_uri | memory_uri 2683 file_uri ::== "file://" file_path [ range_specifier ] 2684 memory_uri ::== "memory://" process_id range_specifier 2685 range_specifier ::== [ "#" | "?" ] "offset=" number "&" "size=" number 2686 file_path ::== URI_ENCODED_OS_FILE_PATH 2687 process_id ::== DECIMAL_NUMBER 2688 number ::== HEX_NUMBER | DECIMAL_NUMBER | OCTAL_NUMBER 2689 2690**number** 2691 Is a C integral literal where hexadecimal values are prefixed by "0x" or "0X", 2692 and octal values by "0". 2693 2694**file_path** 2695 Is the file's path specified as a URI encoded UTF-8 string. In URI encoding, 2696 every character that is not in the regular expression ``[a-zA-Z0-9/_.~-]`` is 2697 encoded as two uppercase hexadecimal digits proceeded by "%". Directories in 2698 the path are separated by "/". 2699 2700**offset** 2701 Is a 0-based byte offset to the start of the code object. For a file URI, it 2702 is from the start of the file specified by the ``file_path``, and if omitted 2703 defaults to 0. For a memory URI, it is the memory address and is required. 2704 2705**size** 2706 Is the number of bytes in the code object. For a file URI, if omitted it 2707 defaults to the size of the file. It is required for a memory URI. 2708 2709**process_id** 2710 Is the identity of the process owning the memory. For Linux it is the C 2711 unsigned integral decimal literal for the process ID (PID). 2712 2713For example: 2714 2715.. code:: 2716 2717 file:///dir1/dir2/file1 2718 file:///dir3/dir4/file2#offset=0x2000&size=3000 2719 memory://1234#offset=0x20000&size=3000 2720 2721.. _amdgpu-dwarf-debug-information: 2722 2723DWARF Debug Information 2724======================= 2725 2726.. warning:: 2727 2728 This section describes **provisional support** for AMDGPU DWARF [DWARF]_ that 2729 is not currently fully implemented and is subject to change. 2730 2731AMDGPU generates DWARF [DWARF]_ debugging information ELF sections (see 2732:ref:`amdgpu-elf-code-object`) which contain information that maps the code 2733object executable code and data to the source language constructs. It can be 2734used by tools such as debuggers and profilers. It uses features defined in 2735:doc:`AMDGPUDwarfExtensionsForHeterogeneousDebugging` that are made available in 2736DWARF Version 4 and DWARF Version 5 as an LLVM vendor extension. 2737 2738This section defines the AMDGPU target architecture specific DWARF mappings. 2739 2740.. _amdgpu-dwarf-register-identifier: 2741 2742Register Identifier 2743------------------- 2744 2745This section defines the AMDGPU target architecture register numbers used in 2746DWARF operation expressions (see DWARF Version 5 section 2.5 and 2747:ref:`amdgpu-dwarf-operation-expressions`) and Call Frame Information 2748instructions (see DWARF Version 5 section 6.4 and 2749:ref:`amdgpu-dwarf-call-frame-information`). 2750 2751A single code object can contain code for kernels that have different wavefront 2752sizes. The vector registers and some scalar registers are based on the wavefront 2753size. AMDGPU defines distinct DWARF registers for each wavefront size. This 2754simplifies the consumer of the DWARF so that each register has a fixed size, 2755rather than being dynamic according to the wavefront size mode. Similarly, 2756distinct DWARF registers are defined for those registers that vary in size 2757according to the process address size. This allows a consumer to treat a 2758specific AMDGPU processor as a single architecture regardless of how it is 2759configured at run time. The compiler explicitly specifies the DWARF registers 2760that match the mode in which the code it is generating will be executed. 2761 2762DWARF registers are encoded as numbers, which are mapped to architecture 2763registers. The mapping for AMDGPU is defined in 2764:ref:`amdgpu-dwarf-register-mapping-table`. All AMDGPU targets use the same 2765mapping. 2766 2767.. table:: AMDGPU DWARF Register Mapping 2768 :name: amdgpu-dwarf-register-mapping-table 2769 2770 ============== ================= ======== ================================== 2771 DWARF Register AMDGPU Register Bit Size Description 2772 ============== ================= ======== ================================== 2773 0 PC_32 32 Program Counter (PC) when 2774 executing in a 32-bit process 2775 address space. Used in the CFI to 2776 describe the PC of the calling 2777 frame. 2778 1 EXEC_MASK_32 32 Execution Mask Register when 2779 executing in wavefront 32 mode. 2780 2-15 *Reserved* *Reserved for highly accessed 2781 registers using DWARF shortcut.* 2782 16 PC_64 64 Program Counter (PC) when 2783 executing in a 64-bit process 2784 address space. Used in the CFI to 2785 describe the PC of the calling 2786 frame. 2787 17 EXEC_MASK_64 64 Execution Mask Register when 2788 executing in wavefront 64 mode. 2789 18-31 *Reserved* *Reserved for highly accessed 2790 registers using DWARF shortcut.* 2791 32-95 SGPR0-SGPR63 32 Scalar General Purpose 2792 Registers. 2793 96-127 *Reserved* *Reserved for frequently accessed 2794 registers using DWARF 1-byte ULEB.* 2795 128 STATUS 32 Status Register. 2796 129-511 *Reserved* *Reserved for future Scalar 2797 Architectural Registers.* 2798 512 VCC_32 32 Vector Condition Code Register 2799 when executing in wavefront 32 2800 mode. 2801 513-767 *Reserved* *Reserved for future Vector 2802 Architectural Registers when 2803 executing in wavefront 32 mode.* 2804 768 VCC_64 64 Vector Condition Code Register 2805 when executing in wavefront 64 2806 mode. 2807 769-1023 *Reserved* *Reserved for future Vector 2808 Architectural Registers when 2809 executing in wavefront 64 mode.* 2810 1024-1087 *Reserved* *Reserved for padding.* 2811 1088-1129 SGPR64-SGPR105 32 Scalar General Purpose Registers. 2812 1130-1535 *Reserved* *Reserved for future Scalar 2813 General Purpose Registers.* 2814 1536-1791 VGPR0-VGPR255 32*32 Vector General Purpose Registers 2815 when executing in wavefront 32 2816 mode. 2817 1792-2047 *Reserved* *Reserved for future Vector 2818 General Purpose Registers when 2819 executing in wavefront 32 mode.* 2820 2048-2303 AGPR0-AGPR255 32*32 Vector Accumulation Registers 2821 when executing in wavefront 32 2822 mode. 2823 2304-2559 *Reserved* *Reserved for future Vector 2824 Accumulation Registers when 2825 executing in wavefront 32 mode.* 2826 2560-2815 VGPR0-VGPR255 64*32 Vector General Purpose Registers 2827 when executing in wavefront 64 2828 mode. 2829 2816-3071 *Reserved* *Reserved for future Vector 2830 General Purpose Registers when 2831 executing in wavefront 64 mode.* 2832 3072-3327 AGPR0-AGPR255 64*32 Vector Accumulation Registers 2833 when executing in wavefront 64 2834 mode. 2835 3328-3583 *Reserved* *Reserved for future Vector 2836 Accumulation Registers when 2837 executing in wavefront 64 mode.* 2838 ============== ================= ======== ================================== 2839 2840The vector registers are represented as the full size for the wavefront. They 2841are organized as consecutive dwords (32-bits), one per lane, with the dword at 2842the least significant bit position corresponding to lane 0 and so forth. DWARF 2843location expressions involving the ``DW_OP_LLVM_offset`` and 2844``DW_OP_LLVM_push_lane`` operations are used to select the part of the vector 2845register corresponding to the lane that is executing the current thread of 2846execution in languages that are implemented using a SIMD or SIMT execution 2847model. 2848 2849If the wavefront size is 32 lanes then the wavefront 32 mode register 2850definitions are used. If the wavefront size is 64 lanes then the wavefront 64 2851mode register definitions are used. Some AMDGPU targets support executing in 2852both wavefront 32 and wavefront 64 mode. The register definitions corresponding 2853to the wavefront mode of the generated code will be used. 2854 2855If code is generated to execute in a 32-bit process address space, then the 285632-bit process address space register definitions are used. If code is generated 2857to execute in a 64-bit process address space, then the 64-bit process address 2858space register definitions are used. The ``amdgcn`` target only supports the 285964-bit process address space. 2860 2861.. _amdgpu-dwarf-memory-space-identifier: 2862 2863Memory Space Identifier 2864----------------------- 2865 2866The DWARF memory space represents the source language memory space. See DWARF 2867Version 5 section 2.12 which is updated by the *DWARF Extensions For 2868Heterogeneous Debugging* section :ref:`amdgpu-dwarf-memory-spaces`. 2869 2870The DWARF memory space mapping used for AMDGPU is defined in 2871:ref:`amdgpu-dwarf-memory-space-mapping-table`. 2872 2873.. table:: AMDGPU DWARF Memory Space Mapping 2874 :name: amdgpu-dwarf-memory-space-mapping-table 2875 2876 =========================== ====== ================= 2877 DWARF AMDGPU 2878 ---------------------------------- ----------------- 2879 Memory Space Name Value Memory Space 2880 =========================== ====== ================= 2881 ``DW_MSPACE_LLVM_none`` 0x0000 Generic (Flat) 2882 ``DW_MSPACE_LLVM_global`` 0x0001 Global 2883 ``DW_MSPACE_LLVM_constant`` 0x0002 Global 2884 ``DW_MSPACE_LLVM_group`` 0x0003 Local (group/LDS) 2885 ``DW_MSPACE_LLVM_private`` 0x0004 Private (Scratch) 2886 ``DW_MSPACE_AMDGPU_region`` 0x8000 Region (GDS) 2887 =========================== ====== ================= 2888 2889The DWARF memory space values defined in the *DWARF Extensions For Heterogeneous 2890Debugging* section :ref:`amdgpu-dwarf-memory-spaces` are used. 2891 2892In addition, ``DW_ADDR_AMDGPU_region`` is encoded as a vendor extension. This is 2893available for use for the AMD extension for access to the hardware GDS memory 2894which is scratchpad memory allocated per device. 2895 2896For AMDGPU if no ``DW_AT_LLVM_memory_space`` attribute is present, then the 2897default memory space of ``DW_MSPACE_LLVM_none`` is used. 2898 2899See :ref:`amdgpu-dwarf-address-space-identifier` for information on the AMDGPU 2900mapping of DWARF memory spaces to DWARF address spaces, including address size 2901and NULL value. 2902 2903.. _amdgpu-dwarf-address-space-identifier: 2904 2905Address Space Identifier 2906------------------------ 2907 2908DWARF address spaces correspond to target architecture specific linear 2909addressable memory areas. See DWARF Version 5 section 2.12 and *DWARF Extensions 2910For Heterogeneous Debugging* section :ref:`amdgpu-dwarf-address-spaces`. 2911 2912The DWARF address space mapping used for AMDGPU is defined in 2913:ref:`amdgpu-dwarf-address-space-mapping-table`. 2914 2915.. table:: AMDGPU DWARF Address Space Mapping 2916 :name: amdgpu-dwarf-address-space-mapping-table 2917 2918 ======================================= ===== ======= ======== ===================== ======================= 2919 DWARF AMDGPU Notes 2920 --------------------------------------- ----- ---------------- --------------------- ----------------------- 2921 Address Space Name Value Address Bit Size LLVM IR Address Space 2922 --------------------------------------- ----- ------- -------- --------------------- ----------------------- 2923 .. 64-bit 32-bit 2924 process process 2925 address address 2926 space space 2927 ======================================= ===== ======= ======== ===================== ======================= 2928 ``DW_ASPACE_LLVM_none`` 0x00 64 32 Global *default address space* 2929 ``DW_ASPACE_AMDGPU_generic`` 0x01 64 32 Generic (Flat) 2930 ``DW_ASPACE_AMDGPU_region`` 0x02 32 32 Region (GDS) 2931 ``DW_ASPACE_AMDGPU_local`` 0x03 32 32 Local (group/LDS) 2932 *Reserved* 0x04 2933 ``DW_ASPACE_AMDGPU_private_lane`` 0x05 32 32 Private (Scratch) *focused lane* 2934 ``DW_ASPACE_AMDGPU_private_wave`` 0x06 32 32 Private (Scratch) *unswizzled wavefront* 2935 ======================================= ===== ======= ======== ===================== ======================= 2936 2937See :ref:`amdgpu-address-spaces` for information on the AMDGPU LLVM IR address 2938spaces including address size and NULL value. 2939 2940The ``DW_ASPACE_LLVM_none`` address space is the default target architecture 2941address space used in DWARF operations that do not specify an address space. It 2942therefore has to map to the global address space so that the ``DW_OP_addr*`` and 2943related operations can refer to addresses in the program code. 2944 2945The ``DW_ASPACE_AMDGPU_generic`` address space allows location expressions to 2946specify the flat address space. If the address corresponds to an address in the 2947local address space, then it corresponds to the wavefront that is executing the 2948focused thread of execution. If the address corresponds to an address in the 2949private address space, then it corresponds to the lane that is executing the 2950focused thread of execution for languages that are implemented using a SIMD or 2951SIMT execution model. 2952 2953.. note:: 2954 2955 CUDA-like languages such as HIP that do not have address spaces in the 2956 language type system, but do allow variables to be allocated in different 2957 address spaces, need to explicitly specify the ``DW_ASPACE_AMDGPU_generic`` 2958 address space in the DWARF expression operations as the default address space 2959 is the global address space. 2960 2961The ``DW_ASPACE_AMDGPU_local`` address space allows location expressions to 2962specify the local address space corresponding to the wavefront that is executing 2963the focused thread of execution. 2964 2965The ``DW_ASPACE_AMDGPU_private_lane`` address space allows location expressions 2966to specify the private address space corresponding to the lane that is executing 2967the focused thread of execution for languages that are implemented using a SIMD 2968or SIMT execution model. 2969 2970The ``DW_ASPACE_AMDGPU_private_wave`` address space allows location expressions 2971to specify the unswizzled private address space corresponding to the wavefront 2972that is executing the focused thread of execution. The wavefront view of private 2973memory is the per wavefront unswizzled backing memory layout defined in 2974:ref:`amdgpu-address-spaces`, such that address 0 corresponds to the first 2975location for the backing memory of the wavefront (namely the address is not 2976offset by ``wavefront-scratch-base``). The following formula can be used to 2977convert from a ``DW_ASPACE_AMDGPU_private_lane`` address to a 2978``DW_ASPACE_AMDGPU_private_wave`` address: 2979 2980:: 2981 2982 private-address-wavefront = 2983 ((private-address-lane / 4) * wavefront-size * 4) + 2984 (wavefront-lane-id * 4) + (private-address-lane % 4) 2985 2986If the ``DW_ASPACE_AMDGPU_private_lane`` address is dword aligned, and the start 2987of the dwords for each lane starting with lane 0 is required, then this 2988simplifies to: 2989 2990:: 2991 2992 private-address-wavefront = 2993 private-address-lane * wavefront-size 2994 2995A compiler can use the ``DW_ASPACE_AMDGPU_private_wave`` address space to read a 2996complete spilled vector register back into a complete vector register in the 2997CFI. The frame pointer can be a private lane address which is dword aligned, 2998which can be shifted to multiply by the wavefront size, and then used to form a 2999private wavefront address that gives a location for a contiguous set of dwords, 3000one per lane, where the vector register dwords are spilled. The compiler knows 3001the wavefront size since it generates the code. Note that the type of the 3002address may have to be converted as the size of a 3003``DW_ASPACE_AMDGPU_private_lane`` address may be smaller than the size of a 3004``DW_ASPACE_AMDGPU_private_wave`` address. 3005 3006.. _amdgpu-dwarf-lane-identifier: 3007 3008Lane identifier 3009--------------- 3010 3011DWARF lane identifies specify a target architecture lane position for hardware 3012that executes in a SIMD or SIMT manner, and on which a source language maps its 3013threads of execution onto those lanes. The DWARF lane identifier is pushed by 3014the ``DW_OP_LLVM_push_lane`` DWARF expression operation. See DWARF Version 5 3015section 2.5 which is updated by *DWARF Extensions For Heterogeneous Debugging* 3016section :ref:`amdgpu-dwarf-operation-expressions`. 3017 3018For AMDGPU, the lane identifier corresponds to the hardware lane ID of a 3019wavefront. It is numbered from 0 to the wavefront size minus 1. 3020 3021Operation Expressions 3022--------------------- 3023 3024DWARF expressions are used to compute program values and the locations of 3025program objects. See DWARF Version 5 section 2.5 and 3026:ref:`amdgpu-dwarf-operation-expressions`. 3027 3028DWARF location descriptions describe how to access storage which includes memory 3029and registers. When accessing storage on AMDGPU, bytes are ordered with least 3030significant bytes first, and bits are ordered within bytes with least 3031significant bits first. 3032 3033For AMDGPU CFI expressions, ``DW_OP_LLVM_select_bit_piece`` is used to describe 3034unwinding vector registers that are spilled under the execution mask to memory: 3035the zero-single location description is the vector register, and the one-single 3036location description is the spilled memory location description. The 3037``DW_OP_LLVM_form_aspace_address`` is used to specify the address space of the 3038memory location description. 3039 3040In AMDGPU expressions, ``DW_OP_LLVM_select_bit_piece`` is used by the 3041``DW_AT_LLVM_lane_pc`` attribute expression where divergent control flow is 3042controlled by the execution mask. An undefined location description together 3043with ``DW_OP_LLVM_extend`` is used to indicate the lane was not active on entry 3044to the subprogram. See :ref:`amdgpu-dwarf-dw-at-llvm-lane-pc` for an example. 3045 3046.. _amdgpu-dwarf-base-type-conversions: 3047 3048Base Type Conversions 3049--------------------- 3050 3051For AMDGPU expressions, ``DW_OP_convert`` may be used to convert between 3052``DW_ATE_address``-encoded base types in different address spaces. 3053 3054Conversions are defined as in :ref:`amdgpu-address-spaces` when all relevant 3055conditions described there are met, and otherwise result in an evaluation 3056error. 3057 3058.. note:: 3059 3060 For a target which does not support a particular address space, converting to 3061 or from that address space is always an evaluation error. 3062 3063 For targets which support the generic address space, converting from 3064 ``DW_ASPACE_AMDGPU_generic`` to ``DW_ASPACE_LLVM_none`` is defined when the 3065 generic address is in the global address space. The conversion requires no 3066 change to the literal value of the address. 3067 3068 Converting from ``DW_ASPACE_AMDGPU_generic`` to any of 3069 ``DW_ASPACE_AMDGPU_local``, ``DW_ASPACE_AMDGPU_private_wave`` or 3070 ``DW_ASPACE_AMDGPU_private_lane`` is defined when the relevant hardware 3071 support is present, any required hardware setup has been completed, and the 3072 generic address is in the corresponding address space. Conversion to 3073 ``DW_ASPACE_AMDGPU_private_lane`` additionally requires the context to 3074 include the active lane. 3075 3076Debugger Information Entry Attributes 3077------------------------------------- 3078 3079This section describes how certain debugger information entry attributes are 3080used by AMDGPU. See the sections in DWARF Version 5 section 3.3.5 and 3.1.1 3081which are updated by *DWARF Extensions For Heterogeneous Debugging* section 3082:ref:`amdgpu-dwarf-low-level-information` and 3083:ref:`amdgpu-dwarf-full-and-partial-compilation-unit-entries`. 3084 3085.. _amdgpu-dwarf-dw-at-llvm-lane-pc: 3086 3087``DW_AT_LLVM_lane_pc`` 3088~~~~~~~~~~~~~~~~~~~~~~ 3089 3090For AMDGPU, the ``DW_AT_LLVM_lane_pc`` attribute is used to specify the program 3091location of the separate lanes of a SIMT thread. 3092 3093If the lane is an active lane then this will be the same as the current program 3094location. 3095 3096If the lane is inactive, but was active on entry to the subprogram, then this is 3097the program location in the subprogram at which execution of the lane is 3098conceptual positioned. 3099 3100If the lane was not active on entry to the subprogram, then this will be the 3101undefined location. A client debugger can check if the lane is part of a valid 3102work-group by checking that the lane is in the range of the associated 3103work-group within the grid, accounting for partial work-groups. If it is not, 3104then the debugger can omit any information for the lane. Otherwise, the debugger 3105may repeatedly unwind the stack and inspect the ``DW_AT_LLVM_lane_pc`` of the 3106calling subprogram until it finds a non-undefined location. Conceptually the 3107lane only has the call frames that it has a non-undefined 3108``DW_AT_LLVM_lane_pc``. 3109 3110The following example illustrates how the AMDGPU backend can generate a DWARF 3111location list expression for the nested ``IF/THEN/ELSE`` structures of the 3112following subprogram pseudo code for a target with 64 lanes per wavefront. 3113 3114.. code:: 3115 :number-lines: 3116 3117 SUBPROGRAM X 3118 BEGIN 3119 a; 3120 IF (c1) THEN 3121 b; 3122 IF (c2) THEN 3123 c; 3124 ELSE 3125 d; 3126 ENDIF 3127 e; 3128 ELSE 3129 f; 3130 ENDIF 3131 g; 3132 END 3133 3134The AMDGPU backend may generate the following pseudo LLVM MIR to manipulate the 3135execution mask (``EXEC``) to linearize the control flow. The condition is 3136evaluated to make a mask of the lanes for which the condition evaluates to true. 3137First the ``THEN`` region is executed by setting the ``EXEC`` mask to the 3138logical ``AND`` of the current ``EXEC`` mask with the condition mask. Then the 3139``ELSE`` region is executed by negating the ``EXEC`` mask and logical ``AND`` of 3140the saved ``EXEC`` mask at the start of the region. After the ``IF/THEN/ELSE`` 3141region the ``EXEC`` mask is restored to the value it had at the beginning of the 3142region. This is shown below. Other approaches are possible, but the basic 3143concept is the same. 3144 3145.. code:: 3146 :number-lines: 3147 3148 $lex_start: 3149 a; 3150 %1 = EXEC 3151 %2 = c1 3152 $lex_1_start: 3153 EXEC = %1 & %2 3154 $if_1_then: 3155 b; 3156 %3 = EXEC 3157 %4 = c2 3158 $lex_1_1_start: 3159 EXEC = %3 & %4 3160 $lex_1_1_then: 3161 c; 3162 EXEC = ~EXEC & %3 3163 $lex_1_1_else: 3164 d; 3165 EXEC = %3 3166 $lex_1_1_end: 3167 e; 3168 EXEC = ~EXEC & %1 3169 $lex_1_else: 3170 f; 3171 EXEC = %1 3172 $lex_1_end: 3173 g; 3174 $lex_end: 3175 3176To create the DWARF location list expression that defines the location 3177description of a vector of lane program locations, the LLVM MIR ``DBG_VALUE`` 3178pseudo instruction can be used to annotate the linearized control flow. This can 3179be done by defining an artificial variable for the lane PC. The DWARF location 3180list expression created for it is used as the value of the 3181``DW_AT_LLVM_lane_pc`` attribute on the subprogram's debugger information entry. 3182 3183A DWARF procedure is defined for each well nested structured control flow region 3184which provides the conceptual lane program location for a lane if it is not 3185active (namely it is divergent). The DWARF operation expression for each region 3186conceptually inherits the value of the immediately enclosing region and modifies 3187it according to the semantics of the region. 3188 3189For an ``IF/THEN/ELSE`` region the divergent program location is at the start of 3190the region for the ``THEN`` region since it is executed first. For the ``ELSE`` 3191region the divergent program location is at the end of the ``IF/THEN/ELSE`` 3192region since the ``THEN`` region has completed. 3193 3194The lane PC artificial variable is assigned at each region transition. It uses 3195the immediately enclosing region's DWARF procedure to compute the program 3196location for each lane assuming they are divergent, and then modifies the result 3197by inserting the current program location for each lane that the ``EXEC`` mask 3198indicates is active. 3199 3200By having separate DWARF procedures for each region, they can be reused to 3201define the value for any nested region. This reduces the total size of the DWARF 3202operation expressions. 3203 3204The following provides an example using pseudo LLVM MIR. 3205 3206.. code:: 3207 :number-lines: 3208 3209 $lex_start: 3210 DEFINE_DWARF %__uint_64 = DW_TAG_base_type[ 3211 DW_AT_name = "__uint64"; 3212 DW_AT_byte_size = 8; 3213 DW_AT_encoding = DW_ATE_unsigned; 3214 ]; 3215 DEFINE_DWARF %__active_lane_pc = DW_TAG_dwarf_procedure[ 3216 DW_AT_name = "__active_lane_pc"; 3217 DW_AT_location = [ 3218 DW_OP_regx PC; 3219 DW_OP_LLVM_extend 64, 64; 3220 DW_OP_regval_type EXEC, %uint_64; 3221 DW_OP_LLVM_select_bit_piece 64, 64; 3222 ]; 3223 ]; 3224 DEFINE_DWARF %__divergent_lane_pc = DW_TAG_dwarf_procedure[ 3225 DW_AT_name = "__divergent_lane_pc"; 3226 DW_AT_location = [ 3227 DW_OP_LLVM_undefined; 3228 DW_OP_LLVM_extend 64, 64; 3229 ]; 3230 ]; 3231 DBG_VALUE $noreg, $noreg, %DW_AT_LLVM_lane_pc, DIExpression[ 3232 DW_OP_call_ref %__divergent_lane_pc; 3233 DW_OP_call_ref %__active_lane_pc; 3234 ]; 3235 a; 3236 %1 = EXEC; 3237 DBG_VALUE %1, $noreg, %__lex_1_save_exec; 3238 %2 = c1; 3239 $lex_1_start: 3240 EXEC = %1 & %2; 3241 $lex_1_then: 3242 DEFINE_DWARF %__divergent_lane_pc_1_then = DW_TAG_dwarf_procedure[ 3243 DW_AT_name = "__divergent_lane_pc_1_then"; 3244 DW_AT_location = DIExpression[ 3245 DW_OP_call_ref %__divergent_lane_pc; 3246 DW_OP_addrx &lex_1_start; 3247 DW_OP_stack_value; 3248 DW_OP_LLVM_extend 64, 64; 3249 DW_OP_call_ref %__lex_1_save_exec; 3250 DW_OP_deref_type 64, %__uint_64; 3251 DW_OP_LLVM_select_bit_piece 64, 64; 3252 ]; 3253 ]; 3254 DBG_VALUE $noreg, $noreg, %DW_AT_LLVM_lane_pc, DIExpression[ 3255 DW_OP_call_ref %__divergent_lane_pc_1_then; 3256 DW_OP_call_ref %__active_lane_pc; 3257 ]; 3258 b; 3259 %3 = EXEC; 3260 DBG_VALUE %3, %__lex_1_1_save_exec; 3261 %4 = c2; 3262 $lex_1_1_start: 3263 EXEC = %3 & %4; 3264 $lex_1_1_then: 3265 DEFINE_DWARF %__divergent_lane_pc_1_1_then = DW_TAG_dwarf_procedure[ 3266 DW_AT_name = "__divergent_lane_pc_1_1_then"; 3267 DW_AT_location = DIExpression[ 3268 DW_OP_call_ref %__divergent_lane_pc_1_then; 3269 DW_OP_addrx &lex_1_1_start; 3270 DW_OP_stack_value; 3271 DW_OP_LLVM_extend 64, 64; 3272 DW_OP_call_ref %__lex_1_1_save_exec; 3273 DW_OP_deref_type 64, %__uint_64; 3274 DW_OP_LLVM_select_bit_piece 64, 64; 3275 ]; 3276 ]; 3277 DBG_VALUE $noreg, $noreg, %DW_AT_LLVM_lane_pc, DIExpression[ 3278 DW_OP_call_ref %__divergent_lane_pc_1_1_then; 3279 DW_OP_call_ref %__active_lane_pc; 3280 ]; 3281 c; 3282 EXEC = ~EXEC & %3; 3283 $lex_1_1_else: 3284 DEFINE_DWARF %__divergent_lane_pc_1_1_else = DW_TAG_dwarf_procedure[ 3285 DW_AT_name = "__divergent_lane_pc_1_1_else"; 3286 DW_AT_location = DIExpression[ 3287 DW_OP_call_ref %__divergent_lane_pc_1_then; 3288 DW_OP_addrx &lex_1_1_end; 3289 DW_OP_stack_value; 3290 DW_OP_LLVM_extend 64, 64; 3291 DW_OP_call_ref %__lex_1_1_save_exec; 3292 DW_OP_deref_type 64, %__uint_64; 3293 DW_OP_LLVM_select_bit_piece 64, 64; 3294 ]; 3295 ]; 3296 DBG_VALUE $noreg, $noreg, %DW_AT_LLVM_lane_pc, DIExpression[ 3297 DW_OP_call_ref %__divergent_lane_pc_1_1_else; 3298 DW_OP_call_ref %__active_lane_pc; 3299 ]; 3300 d; 3301 EXEC = %3; 3302 $lex_1_1_end: 3303 DBG_VALUE $noreg, $noreg, %DW_AT_LLVM_lane_pc, DIExpression[ 3304 DW_OP_call_ref %__divergent_lane_pc; 3305 DW_OP_call_ref %__active_lane_pc; 3306 ]; 3307 e; 3308 EXEC = ~EXEC & %1; 3309 $lex_1_else: 3310 DEFINE_DWARF %__divergent_lane_pc_1_else = DW_TAG_dwarf_procedure[ 3311 DW_AT_name = "__divergent_lane_pc_1_else"; 3312 DW_AT_location = DIExpression[ 3313 DW_OP_call_ref %__divergent_lane_pc; 3314 DW_OP_addrx &lex_1_end; 3315 DW_OP_stack_value; 3316 DW_OP_LLVM_extend 64, 64; 3317 DW_OP_call_ref %__lex_1_save_exec; 3318 DW_OP_deref_type 64, %__uint_64; 3319 DW_OP_LLVM_select_bit_piece 64, 64; 3320 ]; 3321 ]; 3322 DBG_VALUE $noreg, $noreg, %DW_AT_LLVM_lane_pc, DIExpression[ 3323 DW_OP_call_ref %__divergent_lane_pc_1_else; 3324 DW_OP_call_ref %__active_lane_pc; 3325 ]; 3326 f; 3327 EXEC = %1; 3328 $lex_1_end: 3329 DBG_VALUE $noreg, $noreg, %DW_AT_LLVM_lane_pc DIExpression[ 3330 DW_OP_call_ref %__divergent_lane_pc; 3331 DW_OP_call_ref %__active_lane_pc; 3332 ]; 3333 g; 3334 $lex_end: 3335 3336The DWARF procedure ``%__active_lane_pc`` is used to update the lane pc elements 3337that are active, with the current program location. 3338 3339Artificial variables %__lex_1_save_exec and %__lex_1_1_save_exec are created for 3340the execution masks saved on entry to a region. Using the ``DBG_VALUE`` pseudo 3341instruction, location list entries will be created that describe where the 3342artificial variables are allocated at any given program location. The compiler 3343may allocate them to registers or spill them to memory. 3344 3345The DWARF procedures for each region use the values of the saved execution mask 3346artificial variables to only update the lanes that are active on entry to the 3347region. All other lanes retain the value of the enclosing region where they were 3348last active. If they were not active on entry to the subprogram, then will have 3349the undefined location description. 3350 3351Other structured control flow regions can be handled similarly. For example, 3352loops would set the divergent program location for the region at the end of the 3353loop. Any lanes active will be in the loop, and any lanes not active must have 3354exited the loop. 3355 3356An ``IF/THEN/ELSEIF/ELSEIF/...`` region can be treated as a nest of 3357``IF/THEN/ELSE`` regions. 3358 3359The DWARF procedures can use the active lane artificial variable described in 3360:ref:`amdgpu-dwarf-amdgpu-dw-at-llvm-active-lane` rather than the actual 3361``EXEC`` mask in order to support whole or quad wavefront mode. 3362 3363.. _amdgpu-dwarf-amdgpu-dw-at-llvm-active-lane: 3364 3365``DW_AT_LLVM_active_lane`` 3366~~~~~~~~~~~~~~~~~~~~~~~~~~ 3367 3368The ``DW_AT_LLVM_active_lane`` attribute on a subprogram debugger information 3369entry is used to specify the lanes that are conceptually active for a SIMT 3370thread. 3371 3372The execution mask may be modified to implement whole or quad wavefront mode 3373operations. For example, all lanes may need to temporarily be made active to 3374execute a whole wavefront operation. Such regions would save the ``EXEC`` mask, 3375update it to enable the necessary lanes, perform the operations, and then 3376restore the ``EXEC`` mask from the saved value. While executing the whole 3377wavefront region, the conceptual execution mask is the saved value, not the 3378``EXEC`` value. 3379 3380This is handled by defining an artificial variable for the active lane mask. The 3381active lane mask artificial variable would be the actual ``EXEC`` mask for 3382normal regions, and the saved execution mask for regions where the mask is 3383temporarily updated. The location list expression created for this artificial 3384variable is used to define the value of the ``DW_AT_LLVM_active_lane`` 3385attribute. 3386 3387``DW_AT_LLVM_augmentation`` 3388~~~~~~~~~~~~~~~~~~~~~~~~~~~ 3389 3390For AMDGPU, the ``DW_AT_LLVM_augmentation`` attribute of a compilation unit 3391debugger information entry has the following value for the augmentation string: 3392 3393:: 3394 3395 [amdgpu:v0.0] 3396 3397The "vX.Y" specifies the major X and minor Y version number of the AMDGPU 3398extensions used in the DWARF of the compilation unit. The version number 3399conforms to [SEMVER]_. 3400 3401Call Frame Information 3402---------------------- 3403 3404DWARF Call Frame Information (CFI) describes how a consumer can virtually 3405*unwind* call frames in a running process or core dump. See DWARF Version 5 3406section 6.4 and :ref:`amdgpu-dwarf-call-frame-information`. 3407 3408For AMDGPU, the Common Information Entry (CIE) fields have the following values: 3409 34101. ``augmentation`` string contains the following null-terminated UTF-8 string: 3411 3412 :: 3413 3414 [amd:v0.0] 3415 3416 The ``vX.Y`` specifies the major X and minor Y version number of the AMDGPU 3417 extensions used in this CIE or to the FDEs that use it. The version number 3418 conforms to [SEMVER]_. 3419 34202. ``address_size`` for the ``Global`` address space is defined in 3421 :ref:`amdgpu-dwarf-address-space-identifier`. 3422 34233. ``segment_selector_size`` is 0 as AMDGPU does not use a segment selector. 3424 34254. ``code_alignment_factor`` is 4 bytes. 3426 3427 .. TODO:: 3428 3429 Add to :ref:`amdgpu-processor-table` table. 3430 34315. ``data_alignment_factor`` is 4 bytes. 3432 3433 .. TODO:: 3434 3435 Add to :ref:`amdgpu-processor-table` table. 3436 34376. ``return_address_register`` is ``PC_32`` for 32-bit processes and ``PC_64`` 3438 for 64-bit processes defined in :ref:`amdgpu-dwarf-register-identifier`. 3439 34407. ``initial_instructions`` Since a subprogram X with fewer registers can be 3441 called from subprogram Y that has more allocated, X will not change any of 3442 the extra registers as it cannot access them. Therefore, the default rule 3443 for all columns is ``same value``. 3444 3445For AMDGPU the register number follows the numbering defined in 3446:ref:`amdgpu-dwarf-register-identifier`. 3447 3448For AMDGPU the instructions are variable size. A consumer can subtract 1 from 3449the return address to get the address of a byte within the call site 3450instructions. See DWARF Version 5 section 6.4.4. 3451 3452Accelerated Access 3453------------------ 3454 3455See DWARF Version 5 section 6.1. 3456 3457Lookup By Name Section Header 3458~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ 3459 3460See DWARF Version 5 section 6.1.1.4.1 and :ref:`amdgpu-dwarf-lookup-by-name`. 3461 3462For AMDGPU the lookup by name section header table: 3463 3464``augmentation_string_size`` (uword) 3465 3466 Set to the length of the ``augmentation_string`` value which is always a 3467 multiple of 4. 3468 3469``augmentation_string`` (sequence of UTF-8 characters) 3470 3471 Contains the following UTF-8 string null padded to a multiple of 4 bytes: 3472 3473 :: 3474 3475 [amdgpu:v0.0] 3476 3477 The "vX.Y" specifies the major X and minor Y version number of the AMDGPU 3478 extensions used in the DWARF of this index. The version number conforms to 3479 [SEMVER]_. 3480 3481 .. note:: 3482 3483 This is different to the DWARF Version 5 definition that requires the first 3484 4 characters to be the vendor ID. But this is consistent with the other 3485 augmentation strings and does allow multiple vendor contributions. However, 3486 backwards compatibility may be more desirable. 3487 3488Lookup By Address Section Header 3489~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ 3490 3491See DWARF Version 5 section 6.1.2. 3492 3493For AMDGPU the lookup by address section header table: 3494 3495``address_size`` (ubyte) 3496 3497 Match the address size for the ``Global`` address space defined in 3498 :ref:`amdgpu-dwarf-address-space-identifier`. 3499 3500``segment_selector_size`` (ubyte) 3501 3502 AMDGPU does not use a segment selector so this is 0. The entries in the 3503 ``.debug_aranges`` do not have a segment selector. 3504 3505Line Number Information 3506----------------------- 3507 3508See DWARF Version 5 section 6.2 and :ref:`amdgpu-dwarf-line-number-information`. 3509 3510AMDGPU does not use the ``isa`` state machine registers and always sets it to 0. 3511The instruction set must be obtained from the ELF file header ``e_flags`` field 3512in the ``EF_AMDGPU_MACH`` bit position (see :ref:`ELF Header 3513<amdgpu-elf-header>`). See DWARF Version 5 section 6.2.2. 3514 3515.. TODO:: 3516 3517 Should the ``isa`` state machine register be used to indicate if the code is 3518 in wavefront32 or wavefront64 mode? Or used to specify the architecture ISA? 3519 3520For AMDGPU the line number program header fields have the following values (see 3521DWARF Version 5 section 6.2.4): 3522 3523``address_size`` (ubyte) 3524 Matches the address size for the ``Global`` address space defined in 3525 :ref:`amdgpu-dwarf-address-space-identifier`. 3526 3527``segment_selector_size`` (ubyte) 3528 AMDGPU does not use a segment selector so this is 0. 3529 3530``minimum_instruction_length`` (ubyte) 3531 For GFX9-GFX11 this is 4. 3532 3533``maximum_operations_per_instruction`` (ubyte) 3534 For GFX9-GFX11 this is 1. 3535 3536Source text for online-compiled programs (for example, those compiled by the 3537OpenCL language runtime) may be embedded into the DWARF Version 5 line table. 3538See DWARF Version 5 section 6.2.4.1 which is updated by *DWARF Extensions For 3539Heterogeneous Debugging* section :ref:`DW_LNCT_LLVM_source 3540<amdgpu-dwarf-line-number-information-dw-lnct-llvm-source>`. 3541 3542The Clang option used to control source embedding in AMDGPU is defined in 3543:ref:`amdgpu-clang-debug-options-table`. 3544 3545 .. table:: AMDGPU Clang Debug Options 3546 :name: amdgpu-clang-debug-options-table 3547 3548 ==================== ================================================== 3549 Debug Flag Description 3550 ==================== ================================================== 3551 -g[no-]embed-source Enable/disable embedding source text in DWARF 3552 debug sections. Useful for environments where 3553 source cannot be written to disk, such as 3554 when performing online compilation. 3555 ==================== ================================================== 3556 3557For example: 3558 3559``-gembed-source`` 3560 Enable the embedded source. 3561 3562``-gno-embed-source`` 3563 Disable the embedded source. 3564 356532-Bit and 64-Bit DWARF Formats 3566------------------------------- 3567 3568See DWARF Version 5 section 7.4 and 3569:ref:`amdgpu-dwarf-32-bit-and-64-bit-dwarf-formats`. 3570 3571For AMDGPU: 3572 3573* For the ``amdgcn`` target architecture only the 64-bit process address space 3574 is supported. 3575 3576* The producer can generate either 32-bit or 64-bit DWARF format. LLVM generates 3577 the 32-bit DWARF format. 3578 3579Unit Headers 3580------------ 3581 3582For AMDGPU the following values apply for each of the unit headers described in 3583DWARF Version 5 sections 7.5.1.1, 7.5.1.2, and 7.5.1.3: 3584 3585``address_size`` (ubyte) 3586 Matches the address size for the ``Global`` address space defined in 3587 :ref:`amdgpu-dwarf-address-space-identifier`. 3588 3589.. _amdgpu-code-conventions: 3590 3591Code Conventions 3592================ 3593 3594This section provides code conventions used for each supported target triple OS 3595(see :ref:`amdgpu-target-triples`). 3596 3597AMDHSA 3598------ 3599 3600This section provides code conventions used when the target triple OS is 3601``amdhsa`` (see :ref:`amdgpu-target-triples`). 3602 3603.. _amdgpu-amdhsa-code-object-metadata: 3604 3605Code Object Metadata 3606~~~~~~~~~~~~~~~~~~~~ 3607 3608The code object metadata specifies extensible metadata associated with the code 3609objects executed on HSA [HSA]_ compatible runtimes (see :ref:`amdgpu-os`). The 3610encoding and semantics of this metadata depends on the code object version; see 3611:ref:`amdgpu-amdhsa-code-object-metadata-v2`, 3612:ref:`amdgpu-amdhsa-code-object-metadata-v3`, 3613:ref:`amdgpu-amdhsa-code-object-metadata-v4` and 3614:ref:`amdgpu-amdhsa-code-object-metadata-v5`. 3615 3616Code object metadata is specified in a note record (see 3617:ref:`amdgpu-note-records`) and is required when the target triple OS is 3618``amdhsa`` (see :ref:`amdgpu-target-triples`). It must contain the minimum 3619information necessary to support the HSA compatible runtime kernel queries. For 3620example, the segment sizes needed in a dispatch packet. In addition, a 3621high-level language runtime may require other information to be included. For 3622example, the AMD OpenCL runtime records kernel argument information. 3623 3624.. _amdgpu-amdhsa-code-object-metadata-v2: 3625 3626Code Object V2 Metadata 3627+++++++++++++++++++++++ 3628 3629.. warning:: 3630 Code object V2 generation is no longer supported by this version of LLVM. 3631 3632Code object V2 metadata is specified by the ``NT_AMD_HSA_METADATA`` note record 3633(see :ref:`amdgpu-note-records-v2`). 3634 3635The metadata is specified as a YAML formatted string (see [YAML]_ and 3636:doc:`YamlIO`). 3637 3638.. TODO:: 3639 3640 Is the string null terminated? It probably should not if YAML allows it to 3641 contain null characters, otherwise it should be. 3642 3643The metadata is represented as a single YAML document comprised of the mapping 3644defined in table :ref:`amdgpu-amdhsa-code-object-metadata-map-v2-table` and 3645referenced tables. 3646 3647For boolean values, the string values of ``false`` and ``true`` are used for 3648false and true respectively. 3649 3650Additional information can be added to the mappings. To avoid conflicts, any 3651non-AMD key names should be prefixed by "*vendor-name*.". 3652 3653 .. table:: AMDHSA Code Object V2 Metadata Map 3654 :name: amdgpu-amdhsa-code-object-metadata-map-v2-table 3655 3656 ========== ============== ========= ======================================= 3657 String Key Value Type Required? Description 3658 ========== ============== ========= ======================================= 3659 "Version" sequence of Required - The first integer is the major 3660 2 integers version. Currently 1. 3661 - The second integer is the minor 3662 version. Currently 0. 3663 "Printf" sequence of Each string is encoded information 3664 strings about a printf function call. The 3665 encoded information is organized as 3666 fields separated by colon (':'): 3667 3668 ``ID:N:S[0]:S[1]:...:S[N-1]:FormatString`` 3669 3670 where: 3671 3672 ``ID`` 3673 A 32-bit integer as a unique id for 3674 each printf function call 3675 3676 ``N`` 3677 A 32-bit integer equal to the number 3678 of arguments of printf function call 3679 minus 1 3680 3681 ``S[i]`` (where i = 0, 1, ... , N-1) 3682 32-bit integers for the size in bytes 3683 of the i-th FormatString argument of 3684 the printf function call 3685 3686 FormatString 3687 The format string passed to the 3688 printf function call. 3689 "Kernels" sequence of Required Sequence of the mappings for each 3690 mapping kernel in the code object. See 3691 :ref:`amdgpu-amdhsa-code-object-kernel-metadata-map-v2-table` 3692 for the definition of the mapping. 3693 ========== ============== ========= ======================================= 3694 3695.. 3696 3697 .. table:: AMDHSA Code Object V2 Kernel Metadata Map 3698 :name: amdgpu-amdhsa-code-object-kernel-metadata-map-v2-table 3699 3700 ================= ============== ========= ================================ 3701 String Key Value Type Required? Description 3702 ================= ============== ========= ================================ 3703 "Name" string Required Source name of the kernel. 3704 "SymbolName" string Required Name of the kernel 3705 descriptor ELF symbol. 3706 "Language" string Source language of the kernel. 3707 Values include: 3708 3709 - "OpenCL C" 3710 - "OpenCL C++" 3711 - "HCC" 3712 - "OpenMP" 3713 3714 "LanguageVersion" sequence of - The first integer is the major 3715 2 integers version. 3716 - The second integer is the 3717 minor version. 3718 "Attrs" mapping Mapping of kernel attributes. 3719 See 3720 :ref:`amdgpu-amdhsa-code-object-kernel-attribute-metadata-map-v2-table` 3721 for the mapping definition. 3722 "Args" sequence of Sequence of mappings of the 3723 mapping kernel arguments. See 3724 :ref:`amdgpu-amdhsa-code-object-kernel-argument-metadata-map-v2-table` 3725 for the definition of the mapping. 3726 "CodeProps" mapping Mapping of properties related to 3727 the kernel code. See 3728 :ref:`amdgpu-amdhsa-code-object-kernel-code-properties-metadata-map-v2-table` 3729 for the mapping definition. 3730 ================= ============== ========= ================================ 3731 3732.. 3733 3734 .. table:: AMDHSA Code Object V2 Kernel Attribute Metadata Map 3735 :name: amdgpu-amdhsa-code-object-kernel-attribute-metadata-map-v2-table 3736 3737 =================== ============== ========= ============================== 3738 String Key Value Type Required? Description 3739 =================== ============== ========= ============================== 3740 "ReqdWorkGroupSize" sequence of If not 0, 0, 0 then all values 3741 3 integers must be >=1 and the dispatch 3742 work-group size X, Y, Z must 3743 correspond to the specified 3744 values. Defaults to 0, 0, 0. 3745 3746 Corresponds to the OpenCL 3747 ``reqd_work_group_size`` 3748 attribute. 3749 "WorkGroupSizeHint" sequence of The dispatch work-group size 3750 3 integers X, Y, Z is likely to be the 3751 specified values. 3752 3753 Corresponds to the OpenCL 3754 ``work_group_size_hint`` 3755 attribute. 3756 "VecTypeHint" string The name of a scalar or vector 3757 type. 3758 3759 Corresponds to the OpenCL 3760 ``vec_type_hint`` attribute. 3761 3762 "RuntimeHandle" string The external symbol name 3763 associated with a kernel. 3764 OpenCL runtime allocates a 3765 global buffer for the symbol 3766 and saves the kernel's address 3767 to it, which is used for 3768 device side enqueueing. Only 3769 available for device side 3770 enqueued kernels. 3771 =================== ============== ========= ============================== 3772 3773.. 3774 3775 .. table:: AMDHSA Code Object V2 Kernel Argument Metadata Map 3776 :name: amdgpu-amdhsa-code-object-kernel-argument-metadata-map-v2-table 3777 3778 ================= ============== ========= ================================ 3779 String Key Value Type Required? Description 3780 ================= ============== ========= ================================ 3781 "Name" string Kernel argument name. 3782 "TypeName" string Kernel argument type name. 3783 "Size" integer Required Kernel argument size in bytes. 3784 "Align" integer Required Kernel argument alignment in 3785 bytes. Must be a power of two. 3786 "ValueKind" string Required Kernel argument kind that 3787 specifies how to set up the 3788 corresponding argument. 3789 Values include: 3790 3791 "ByValue" 3792 The argument is copied 3793 directly into the kernarg. 3794 3795 "GlobalBuffer" 3796 A global address space pointer 3797 to the buffer data is passed 3798 in the kernarg. 3799 3800 "DynamicSharedPointer" 3801 A group address space pointer 3802 to dynamically allocated LDS 3803 is passed in the kernarg. 3804 3805 "Sampler" 3806 A global address space 3807 pointer to a S# is passed in 3808 the kernarg. 3809 3810 "Image" 3811 A global address space 3812 pointer to a T# is passed in 3813 the kernarg. 3814 3815 "Pipe" 3816 A global address space pointer 3817 to an OpenCL pipe is passed in 3818 the kernarg. 3819 3820 "Queue" 3821 A global address space pointer 3822 to an OpenCL device enqueue 3823 queue is passed in the 3824 kernarg. 3825 3826 "HiddenGlobalOffsetX" 3827 The OpenCL grid dispatch 3828 global offset for the X 3829 dimension is passed in the 3830 kernarg. 3831 3832 "HiddenGlobalOffsetY" 3833 The OpenCL grid dispatch 3834 global offset for the Y 3835 dimension is passed in the 3836 kernarg. 3837 3838 "HiddenGlobalOffsetZ" 3839 The OpenCL grid dispatch 3840 global offset for the Z 3841 dimension is passed in the 3842 kernarg. 3843 3844 "HiddenNone" 3845 An argument that is not used 3846 by the kernel. Space needs to 3847 be left for it, but it does 3848 not need to be set up. 3849 3850 "HiddenPrintfBuffer" 3851 A global address space pointer 3852 to the runtime printf buffer 3853 is passed in kernarg. Mutually 3854 exclusive with 3855 "HiddenHostcallBuffer". 3856 3857 "HiddenHostcallBuffer" 3858 A global address space pointer 3859 to the runtime hostcall buffer 3860 is passed in kernarg. Mutually 3861 exclusive with 3862 "HiddenPrintfBuffer". 3863 3864 "HiddenDefaultQueue" 3865 A global address space pointer 3866 to the OpenCL device enqueue 3867 queue that should be used by 3868 the kernel by default is 3869 passed in the kernarg. 3870 3871 "HiddenCompletionAction" 3872 A global address space pointer 3873 to help link enqueued kernels into 3874 the ancestor tree for determining 3875 when the parent kernel has finished. 3876 3877 "HiddenMultiGridSyncArg" 3878 A global address space pointer for 3879 multi-grid synchronization is 3880 passed in the kernarg. 3881 3882 "ValueType" string Unused and deprecated. This should no longer 3883 be emitted, but is accepted for compatibility. 3884 3885 3886 "PointeeAlign" integer Alignment in bytes of pointee 3887 type for pointer type kernel 3888 argument. Must be a power 3889 of 2. Only present if 3890 "ValueKind" is 3891 "DynamicSharedPointer". 3892 "AddrSpaceQual" string Kernel argument address space 3893 qualifier. Only present if 3894 "ValueKind" is "GlobalBuffer" or 3895 "DynamicSharedPointer". Values 3896 are: 3897 3898 - "Private" 3899 - "Global" 3900 - "Constant" 3901 - "Local" 3902 - "Generic" 3903 - "Region" 3904 3905 .. TODO:: 3906 3907 Is GlobalBuffer only Global 3908 or Constant? Is 3909 DynamicSharedPointer always 3910 Local? Can HCC allow Generic? 3911 How can Private or Region 3912 ever happen? 3913 3914 "AccQual" string Kernel argument access 3915 qualifier. Only present if 3916 "ValueKind" is "Image" or 3917 "Pipe". Values 3918 are: 3919 3920 - "ReadOnly" 3921 - "WriteOnly" 3922 - "ReadWrite" 3923 3924 .. TODO:: 3925 3926 Does this apply to 3927 GlobalBuffer? 3928 3929 "ActualAccQual" string The actual memory accesses 3930 performed by the kernel on the 3931 kernel argument. Only present if 3932 "ValueKind" is "GlobalBuffer", 3933 "Image", or "Pipe". This may be 3934 more restrictive than indicated 3935 by "AccQual" to reflect what the 3936 kernel actual does. If not 3937 present then the runtime must 3938 assume what is implied by 3939 "AccQual" and "IsConst". Values 3940 are: 3941 3942 - "ReadOnly" 3943 - "WriteOnly" 3944 - "ReadWrite" 3945 3946 "IsConst" boolean Indicates if the kernel argument 3947 is const qualified. Only present 3948 if "ValueKind" is 3949 "GlobalBuffer". 3950 3951 "IsRestrict" boolean Indicates if the kernel argument 3952 is restrict qualified. Only 3953 present if "ValueKind" is 3954 "GlobalBuffer". 3955 3956 "IsVolatile" boolean Indicates if the kernel argument 3957 is volatile qualified. Only 3958 present if "ValueKind" is 3959 "GlobalBuffer". 3960 3961 "IsPipe" boolean Indicates if the kernel argument 3962 is pipe qualified. Only present 3963 if "ValueKind" is "Pipe". 3964 3965 .. TODO:: 3966 3967 Can GlobalBuffer be pipe 3968 qualified? 3969 3970 ================= ============== ========= ================================ 3971 3972.. 3973 3974 .. table:: AMDHSA Code Object V2 Kernel Code Properties Metadata Map 3975 :name: amdgpu-amdhsa-code-object-kernel-code-properties-metadata-map-v2-table 3976 3977 ============================ ============== ========= ===================== 3978 String Key Value Type Required? Description 3979 ============================ ============== ========= ===================== 3980 "KernargSegmentSize" integer Required The size in bytes of 3981 the kernarg segment 3982 that holds the values 3983 of the arguments to 3984 the kernel. 3985 "GroupSegmentFixedSize" integer Required The amount of group 3986 segment memory 3987 required by a 3988 work-group in 3989 bytes. This does not 3990 include any 3991 dynamically allocated 3992 group segment memory 3993 that may be added 3994 when the kernel is 3995 dispatched. 3996 "PrivateSegmentFixedSize" integer Required The amount of fixed 3997 private address space 3998 memory required for a 3999 work-item in 4000 bytes. If the kernel 4001 uses a dynamic call 4002 stack then additional 4003 space must be added 4004 to this value for the 4005 call stack. 4006 "KernargSegmentAlign" integer Required The maximum byte 4007 alignment of 4008 arguments in the 4009 kernarg segment. Must 4010 be a power of 2. 4011 "WavefrontSize" integer Required Wavefront size. Must 4012 be a power of 2. 4013 "NumSGPRs" integer Required Number of scalar 4014 registers used by a 4015 wavefront for 4016 GFX6-GFX11. This 4017 includes the special 4018 SGPRs for VCC, Flat 4019 Scratch (GFX7-GFX10) 4020 and XNACK (for 4021 GFX8-GFX10). It does 4022 not include the 16 4023 SGPR added if a trap 4024 handler is 4025 enabled. It is not 4026 rounded up to the 4027 allocation 4028 granularity. 4029 "NumVGPRs" integer Required Number of vector 4030 registers used by 4031 each work-item for 4032 GFX6-GFX11 4033 "MaxFlatWorkGroupSize" integer Required Maximum flat 4034 work-group size 4035 supported by the 4036 kernel in work-items. 4037 Must be >=1 and 4038 consistent with 4039 ReqdWorkGroupSize if 4040 not 0, 0, 0. 4041 "NumSpilledSGPRs" integer Number of stores from 4042 a scalar register to 4043 a register allocator 4044 created spill 4045 location. 4046 "NumSpilledVGPRs" integer Number of stores from 4047 a vector register to 4048 a register allocator 4049 created spill 4050 location. 4051 ============================ ============== ========= ===================== 4052 4053.. _amdgpu-amdhsa-code-object-metadata-v3: 4054 4055Code Object V3 Metadata 4056+++++++++++++++++++++++ 4057 4058.. warning:: 4059 Code object V3 generation is no longer supported by this version of LLVM. 4060 4061Code object V3 and above metadata is specified by the ``NT_AMDGPU_METADATA`` note 4062record (see :ref:`amdgpu-note-records-v3-onwards`). 4063 4064The metadata is represented as Message Pack formatted binary data (see 4065[MsgPack]_). The top level is a Message Pack map that includes the 4066keys defined in table 4067:ref:`amdgpu-amdhsa-code-object-metadata-map-table-v3` and referenced 4068tables. 4069 4070Additional information can be added to the maps. To avoid conflicts, 4071any key names should be prefixed by "*vendor-name*." where 4072``vendor-name`` can be the name of the vendor and specific vendor 4073tool that generates the information. The prefix is abbreviated to 4074simply "." when it appears within a map that has been added by the 4075same *vendor-name*. 4076 4077 .. table:: AMDHSA Code Object V3 Metadata Map 4078 :name: amdgpu-amdhsa-code-object-metadata-map-table-v3 4079 4080 ================= ============== ========= ======================================= 4081 String Key Value Type Required? Description 4082 ================= ============== ========= ======================================= 4083 "amdhsa.version" sequence of Required - The first integer is the major 4084 2 integers version. Currently 1. 4085 - The second integer is the minor 4086 version. Currently 0. 4087 "amdhsa.printf" sequence of Each string is encoded information 4088 strings about a printf function call. The 4089 encoded information is organized as 4090 fields separated by colon (':'): 4091 4092 ``ID:N:S[0]:S[1]:...:S[N-1]:FormatString`` 4093 4094 where: 4095 4096 ``ID`` 4097 A 32-bit integer as a unique id for 4098 each printf function call 4099 4100 ``N`` 4101 A 32-bit integer equal to the number 4102 of arguments of printf function call 4103 minus 1 4104 4105 ``S[i]`` (where i = 0, 1, ... , N-1) 4106 32-bit integers for the size in bytes 4107 of the i-th FormatString argument of 4108 the printf function call 4109 4110 FormatString 4111 The format string passed to the 4112 printf function call. 4113 "amdhsa.kernels" sequence of Required Sequence of the maps for each 4114 map kernel in the code object. See 4115 :ref:`amdgpu-amdhsa-code-object-kernel-metadata-map-table-v3` 4116 for the definition of the keys included 4117 in that map. 4118 ================= ============== ========= ======================================= 4119 4120.. 4121 4122 .. table:: AMDHSA Code Object V3 Kernel Metadata Map 4123 :name: amdgpu-amdhsa-code-object-kernel-metadata-map-table-v3 4124 4125 =================================== ============== ========= ================================ 4126 String Key Value Type Required? Description 4127 =================================== ============== ========= ================================ 4128 ".name" string Required Source name of the kernel. 4129 ".symbol" string Required Name of the kernel 4130 descriptor ELF symbol. 4131 ".language" string Source language of the kernel. 4132 Values include: 4133 4134 - "OpenCL C" 4135 - "OpenCL C++" 4136 - "HCC" 4137 - "HIP" 4138 - "OpenMP" 4139 - "Assembler" 4140 4141 ".language_version" sequence of - The first integer is the major 4142 2 integers version. 4143 - The second integer is the 4144 minor version. 4145 ".args" sequence of Sequence of maps of the 4146 map kernel arguments. See 4147 :ref:`amdgpu-amdhsa-code-object-kernel-argument-metadata-map-table-v3` 4148 for the definition of the keys 4149 included in that map. 4150 ".reqd_workgroup_size" sequence of If not 0, 0, 0 then all values 4151 3 integers must be >=1 and the dispatch 4152 work-group size X, Y, Z must 4153 correspond to the specified 4154 values. Defaults to 0, 0, 0. 4155 4156 Corresponds to the OpenCL 4157 ``reqd_work_group_size`` 4158 attribute. 4159 ".workgroup_size_hint" sequence of The dispatch work-group size 4160 3 integers X, Y, Z is likely to be the 4161 specified values. 4162 4163 Corresponds to the OpenCL 4164 ``work_group_size_hint`` 4165 attribute. 4166 ".vec_type_hint" string The name of a scalar or vector 4167 type. 4168 4169 Corresponds to the OpenCL 4170 ``vec_type_hint`` attribute. 4171 4172 ".device_enqueue_symbol" string The external symbol name 4173 associated with a kernel. 4174 OpenCL runtime allocates a 4175 global buffer for the symbol 4176 and saves the kernel's address 4177 to it, which is used for 4178 device side enqueueing. Only 4179 available for device side 4180 enqueued kernels. 4181 ".kernarg_segment_size" integer Required The size in bytes of 4182 the kernarg segment 4183 that holds the values 4184 of the arguments to 4185 the kernel. 4186 ".group_segment_fixed_size" integer Required The amount of group 4187 segment memory 4188 required by a 4189 work-group in 4190 bytes. This does not 4191 include any 4192 dynamically allocated 4193 group segment memory 4194 that may be added 4195 when the kernel is 4196 dispatched. 4197 ".private_segment_fixed_size" integer Required The amount of fixed 4198 private address space 4199 memory required for a 4200 work-item in 4201 bytes. If the kernel 4202 uses a dynamic call 4203 stack then additional 4204 space must be added 4205 to this value for the 4206 call stack. 4207 ".kernarg_segment_align" integer Required The maximum byte 4208 alignment of 4209 arguments in the 4210 kernarg segment. Must 4211 be a power of 2. 4212 ".wavefront_size" integer Required Wavefront size. Must 4213 be a power of 2. 4214 ".sgpr_count" integer Required Number of scalar 4215 registers required by a 4216 wavefront for 4217 GFX6-GFX9. A register 4218 is required if it is 4219 used explicitly, or 4220 if a higher numbered 4221 register is used 4222 explicitly. This 4223 includes the special 4224 SGPRs for VCC, Flat 4225 Scratch (GFX7-GFX9) 4226 and XNACK (for 4227 GFX8-GFX9). It does 4228 not include the 16 4229 SGPR added if a trap 4230 handler is 4231 enabled. It is not 4232 rounded up to the 4233 allocation 4234 granularity. 4235 ".vgpr_count" integer Required Number of vector 4236 registers required by 4237 each work-item for 4238 GFX6-GFX9. A register 4239 is required if it is 4240 used explicitly, or 4241 if a higher numbered 4242 register is used 4243 explicitly. 4244 ".agpr_count" integer Required Number of accumulator 4245 registers required by 4246 each work-item for 4247 GFX90A, GFX908. 4248 ".max_flat_workgroup_size" integer Required Maximum flat 4249 work-group size 4250 supported by the 4251 kernel in work-items. 4252 Must be >=1 and 4253 consistent with 4254 ReqdWorkGroupSize if 4255 not 0, 0, 0. 4256 ".sgpr_spill_count" integer Number of stores from 4257 a scalar register to 4258 a register allocator 4259 created spill 4260 location. 4261 ".vgpr_spill_count" integer Number of stores from 4262 a vector register to 4263 a register allocator 4264 created spill 4265 location. 4266 ".kind" string The kind of the kernel 4267 with the following 4268 values: 4269 4270 "normal" 4271 Regular kernels. 4272 4273 "init" 4274 These kernels must be 4275 invoked after loading 4276 the containing code 4277 object and must 4278 complete before any 4279 normal and fini 4280 kernels in the same 4281 code object are 4282 invoked. 4283 4284 "fini" 4285 These kernels must be 4286 invoked before 4287 unloading the 4288 containing code object 4289 and after all init and 4290 normal kernels in the 4291 same code object have 4292 been invoked and 4293 completed. 4294 4295 If omitted, "normal" is 4296 assumed. 4297 ".max_num_work_groups_{x,y,z}" integer The max number of 4298 launched work-groups 4299 in the X, Y, and Z 4300 dimensions. Each number 4301 must be >=1. 4302 =================================== ============== ========= ================================ 4303 4304.. 4305 4306 .. table:: AMDHSA Code Object V3 Kernel Argument Metadata Map 4307 :name: amdgpu-amdhsa-code-object-kernel-argument-metadata-map-table-v3 4308 4309 ====================== ============== ========= ================================ 4310 String Key Value Type Required? Description 4311 ====================== ============== ========= ================================ 4312 ".name" string Kernel argument name. 4313 ".type_name" string Kernel argument type name. 4314 ".size" integer Required Kernel argument size in bytes. 4315 ".offset" integer Required Kernel argument offset in 4316 bytes. The offset must be a 4317 multiple of the alignment 4318 required by the argument. 4319 ".value_kind" string Required Kernel argument kind that 4320 specifies how to set up the 4321 corresponding argument. 4322 Values include: 4323 4324 "by_value" 4325 The argument is copied 4326 directly into the kernarg. 4327 4328 "global_buffer" 4329 A global address space pointer 4330 to the buffer data is passed 4331 in the kernarg. 4332 4333 "dynamic_shared_pointer" 4334 A group address space pointer 4335 to dynamically allocated LDS 4336 is passed in the kernarg. 4337 4338 "sampler" 4339 A global address space 4340 pointer to a S# is passed in 4341 the kernarg. 4342 4343 "image" 4344 A global address space 4345 pointer to a T# is passed in 4346 the kernarg. 4347 4348 "pipe" 4349 A global address space pointer 4350 to an OpenCL pipe is passed in 4351 the kernarg. 4352 4353 "queue" 4354 A global address space pointer 4355 to an OpenCL device enqueue 4356 queue is passed in the 4357 kernarg. 4358 4359 "hidden_global_offset_x" 4360 The OpenCL grid dispatch 4361 global offset for the X 4362 dimension is passed in the 4363 kernarg. 4364 4365 "hidden_global_offset_y" 4366 The OpenCL grid dispatch 4367 global offset for the Y 4368 dimension is passed in the 4369 kernarg. 4370 4371 "hidden_global_offset_z" 4372 The OpenCL grid dispatch 4373 global offset for the Z 4374 dimension is passed in the 4375 kernarg. 4376 4377 "hidden_none" 4378 An argument that is not used 4379 by the kernel. Space needs to 4380 be left for it, but it does 4381 not need to be set up. 4382 4383 "hidden_printf_buffer" 4384 A global address space pointer 4385 to the runtime printf buffer 4386 is passed in kernarg. Mutually 4387 exclusive with 4388 "hidden_hostcall_buffer" 4389 before Code Object V5. 4390 4391 "hidden_hostcall_buffer" 4392 A global address space pointer 4393 to the runtime hostcall buffer 4394 is passed in kernarg. Mutually 4395 exclusive with 4396 "hidden_printf_buffer" 4397 before Code Object V5. 4398 4399 "hidden_default_queue" 4400 A global address space pointer 4401 to the OpenCL device enqueue 4402 queue that should be used by 4403 the kernel by default is 4404 passed in the kernarg. 4405 4406 "hidden_completion_action" 4407 A global address space pointer 4408 to help link enqueued kernels into 4409 the ancestor tree for determining 4410 when the parent kernel has finished. 4411 4412 "hidden_multigrid_sync_arg" 4413 A global address space pointer for 4414 multi-grid synchronization is 4415 passed in the kernarg. 4416 4417 ".value_type" string Unused and deprecated. This should no longer 4418 be emitted, but is accepted for compatibility. 4419 4420 ".pointee_align" integer Alignment in bytes of pointee 4421 type for pointer type kernel 4422 argument. Must be a power 4423 of 2. Only present if 4424 ".value_kind" is 4425 "dynamic_shared_pointer". 4426 ".address_space" string Kernel argument address space 4427 qualifier. Only present if 4428 ".value_kind" is "global_buffer" or 4429 "dynamic_shared_pointer". Values 4430 are: 4431 4432 - "private" 4433 - "global" 4434 - "constant" 4435 - "local" 4436 - "generic" 4437 - "region" 4438 4439 .. TODO:: 4440 4441 Is "global_buffer" only "global" 4442 or "constant"? Is 4443 "dynamic_shared_pointer" always 4444 "local"? Can HCC allow "generic"? 4445 How can "private" or "region" 4446 ever happen? 4447 4448 ".access" string Kernel argument access 4449 qualifier. Only present if 4450 ".value_kind" is "image" or 4451 "pipe". Values 4452 are: 4453 4454 - "read_only" 4455 - "write_only" 4456 - "read_write" 4457 4458 .. TODO:: 4459 4460 Does this apply to 4461 "global_buffer"? 4462 4463 ".actual_access" string The actual memory accesses 4464 performed by the kernel on the 4465 kernel argument. Only present if 4466 ".value_kind" is "global_buffer", 4467 "image", or "pipe". This may be 4468 more restrictive than indicated 4469 by ".access" to reflect what the 4470 kernel actual does. If not 4471 present then the runtime must 4472 assume what is implied by 4473 ".access" and ".is_const" . Values 4474 are: 4475 4476 - "read_only" 4477 - "write_only" 4478 - "read_write" 4479 4480 ".is_const" boolean Indicates if the kernel argument 4481 is const qualified. Only present 4482 if ".value_kind" is 4483 "global_buffer". 4484 4485 ".is_restrict" boolean Indicates if the kernel argument 4486 is restrict qualified. Only 4487 present if ".value_kind" is 4488 "global_buffer". 4489 4490 ".is_volatile" boolean Indicates if the kernel argument 4491 is volatile qualified. Only 4492 present if ".value_kind" is 4493 "global_buffer". 4494 4495 ".is_pipe" boolean Indicates if the kernel argument 4496 is pipe qualified. Only present 4497 if ".value_kind" is "pipe". 4498 4499 .. TODO:: 4500 4501 Can "global_buffer" be pipe 4502 qualified? 4503 4504 ====================== ============== ========= ================================ 4505 4506.. _amdgpu-amdhsa-code-object-metadata-v4: 4507 4508Code Object V4 Metadata 4509+++++++++++++++++++++++ 4510 4511. warning:: 4512 Code object V4 is not the default code object version emitted by this version 4513 of LLVM. 4514 4515Code object V4 metadata is the same as 4516:ref:`amdgpu-amdhsa-code-object-metadata-v3` with the changes and additions 4517defined in table :ref:`amdgpu-amdhsa-code-object-metadata-map-table-v4`. 4518 4519 .. table:: AMDHSA Code Object V4 Metadata Map Changes 4520 :name: amdgpu-amdhsa-code-object-metadata-map-table-v4 4521 4522 ================= ============== ========= ======================================= 4523 String Key Value Type Required? Description 4524 ================= ============== ========= ======================================= 4525 "amdhsa.version" sequence of Required - The first integer is the major 4526 2 integers version. Currently 1. 4527 - The second integer is the minor 4528 version. Currently 1. 4529 "amdhsa.target" string Required The target name of the code using the syntax: 4530 4531 .. code:: 4532 4533 <target-triple> [ "-" <target-id> ] 4534 4535 A canonical target ID must be 4536 used. See :ref:`amdgpu-target-triples` 4537 and :ref:`amdgpu-target-id`. 4538 ================= ============== ========= ======================================= 4539 4540.. _amdgpu-amdhsa-code-object-metadata-v5: 4541 4542Code Object V5 Metadata 4543+++++++++++++++++++++++ 4544 4545Code object V5 metadata is the same as 4546:ref:`amdgpu-amdhsa-code-object-metadata-v4` with the changes defined in table 4547:ref:`amdgpu-amdhsa-code-object-metadata-map-table-v5`, table 4548:ref:`amdgpu-amdhsa-code-object-kernel-metadata-map-table-v5` and table 4549:ref:`amdgpu-amdhsa-code-object-kernel-argument-metadata-map-table-v5`. 4550 4551 .. table:: AMDHSA Code Object V5 Metadata Map Changes 4552 :name: amdgpu-amdhsa-code-object-metadata-map-table-v5 4553 4554 ================= ============== ========= ======================================= 4555 String Key Value Type Required? Description 4556 ================= ============== ========= ======================================= 4557 "amdhsa.version" sequence of Required - The first integer is the major 4558 2 integers version. Currently 1. 4559 - The second integer is the minor 4560 version. Currently 2. 4561 ================= ============== ========= ======================================= 4562 4563.. 4564 4565 .. table:: AMDHSA Code Object V5 Kernel Metadata Map Additions 4566 :name: amdgpu-amdhsa-code-object-kernel-metadata-map-table-v5 4567 4568 ============================= ============= ========== ======================================= 4569 String Key Value Type Required? Description 4570 ============================= ============= ========== ======================================= 4571 ".uses_dynamic_stack" boolean Indicates if the generated machine code 4572 is using a dynamically sized stack. 4573 ".workgroup_processor_mode" boolean (GFX10+) Controls ENABLE_WGP_MODE in 4574 :ref:`amdgpu-amdhsa-kernel-descriptor-v3-table`. 4575 ============================= ============= ========== ======================================= 4576 4577.. 4578 4579 .. table:: AMDHSA Code Object V5 Kernel Attribute Metadata Map 4580 :name: amdgpu-amdhsa-code-object-kernel-attribute-metadata-map-v5-table 4581 4582 =========================== ============== ========= ============================== 4583 String Key Value Type Required? Description 4584 =========================== ============== ========= ============================== 4585 ".uniform_work_group_size" integer Indicates if the kernel 4586 requires that each dimension 4587 of global size is a multiple 4588 of corresponding dimension of 4589 work-group size. Value of 1 4590 implies true and value of 0 4591 implies false. Metadata is 4592 only emitted when value is 1. 4593 =========================== ============== ========= ============================== 4594 4595.. 4596 4597.. 4598 4599 .. table:: AMDHSA Code Object V5 Kernel Argument Metadata Map Additions and Changes 4600 :name: amdgpu-amdhsa-code-object-kernel-argument-metadata-map-table-v5 4601 4602 ====================== ============== ========= ================================ 4603 String Key Value Type Required? Description 4604 ====================== ============== ========= ================================ 4605 ".value_kind" string Required Kernel argument kind that 4606 specifies how to set up the 4607 corresponding argument. 4608 Values include: 4609 the same as code object V3 metadata 4610 (see :ref:`amdgpu-amdhsa-code-object-kernel-argument-metadata-map-table-v3`) 4611 with the following additions: 4612 4613 "hidden_block_count_x" 4614 The grid dispatch work-group count for the X dimension 4615 is passed in the kernarg. Some languages, such as OpenCL, 4616 support a last work-group in each dimension being partial. 4617 This count only includes the non-partial work-group count. 4618 This is not the same as the value in the AQL dispatch packet, 4619 which has the grid size in work-items. 4620 4621 "hidden_block_count_y" 4622 The grid dispatch work-group count for the Y dimension 4623 is passed in the kernarg. Some languages, such as OpenCL, 4624 support a last work-group in each dimension being partial. 4625 This count only includes the non-partial work-group count. 4626 This is not the same as the value in the AQL dispatch packet, 4627 which has the grid size in work-items. If the grid dimensionality 4628 is 1, then must be 1. 4629 4630 "hidden_block_count_z" 4631 The grid dispatch work-group count for the Z dimension 4632 is passed in the kernarg. Some languages, such as OpenCL, 4633 support a last work-group in each dimension being partial. 4634 This count only includes the non-partial work-group count. 4635 This is not the same as the value in the AQL dispatch packet, 4636 which has the grid size in work-items. If the grid dimensionality 4637 is 1 or 2, then must be 1. 4638 4639 "hidden_group_size_x" 4640 The grid dispatch work-group size for the X dimension is 4641 passed in the kernarg. This size only applies to the 4642 non-partial work-groups. This is the same value as the AQL 4643 dispatch packet work-group size. 4644 4645 "hidden_group_size_y" 4646 The grid dispatch work-group size for the Y dimension is 4647 passed in the kernarg. This size only applies to the 4648 non-partial work-groups. This is the same value as the AQL 4649 dispatch packet work-group size. If the grid dimensionality 4650 is 1, then must be 1. 4651 4652 "hidden_group_size_z" 4653 The grid dispatch work-group size for the Z dimension is 4654 passed in the kernarg. This size only applies to the 4655 non-partial work-groups. This is the same value as the AQL 4656 dispatch packet work-group size. If the grid dimensionality 4657 is 1 or 2, then must be 1. 4658 4659 "hidden_remainder_x" 4660 The grid dispatch work group size of the partial work group 4661 of the X dimension, if it exists. Must be zero if a partial 4662 work group does not exist in the X dimension. 4663 4664 "hidden_remainder_y" 4665 The grid dispatch work group size of the partial work group 4666 of the Y dimension, if it exists. Must be zero if a partial 4667 work group does not exist in the Y dimension. 4668 4669 "hidden_remainder_z" 4670 The grid dispatch work group size of the partial work group 4671 of the Z dimension, if it exists. Must be zero if a partial 4672 work group does not exist in the Z dimension. 4673 4674 "hidden_grid_dims" 4675 The grid dispatch dimensionality. This is the same value 4676 as the AQL dispatch packet dimensionality. Must be a value 4677 between 1 and 3. 4678 4679 "hidden_heap_v1" 4680 A global address space pointer to an initialized memory 4681 buffer that conforms to the requirements of the malloc/free 4682 device library V1 version implementation. 4683 4684 "hidden_dynamic_lds_size" 4685 Size of the dynamically allocated LDS memory is passed in the kernarg. 4686 4687 "hidden_private_base" 4688 The high 32 bits of the flat addressing private aperture base. 4689 Only used by GFX8 to allow conversion between private segment 4690 and flat addresses. See :ref:`amdgpu-amdhsa-kernel-prolog-flat-scratch`. 4691 4692 "hidden_shared_base" 4693 The high 32 bits of the flat addressing shared aperture base. 4694 Only used by GFX8 to allow conversion between shared segment 4695 and flat addresses. See :ref:`amdgpu-amdhsa-kernel-prolog-flat-scratch`. 4696 4697 "hidden_queue_ptr" 4698 A global memory address space pointer to the ROCm runtime 4699 ``struct amd_queue_t`` structure for the HSA queue of the 4700 associated dispatch AQL packet. It is only required for pre-GFX9 4701 devices for the trap handler ABI (see :ref:`amdgpu-amdhsa-trap-handler-abi`). 4702 4703 ====================== ============== ========= ================================ 4704 4705.. 4706 4707Kernel Dispatch 4708~~~~~~~~~~~~~~~ 4709 4710The HSA architected queuing language (AQL) defines a user space memory interface 4711that can be used to control the dispatch of kernels, in an agent independent 4712way. An agent can have zero or more AQL queues created for it using an HSA 4713compatible runtime (see :ref:`amdgpu-os`), in which AQL packets (all of which 4714are 64 bytes) can be placed. See the *HSA Platform System Architecture 4715Specification* [HSA]_ for the AQL queue mechanics and packet layouts. 4716 4717The packet processor of a kernel agent is responsible for detecting and 4718dispatching HSA kernels from the AQL queues associated with it. For AMD GPUs the 4719packet processor is implemented by the hardware command processor (CP), 4720asynchronous dispatch controller (ADC) and shader processor input controller 4721(SPI). 4722 4723An HSA compatible runtime can be used to allocate an AQL queue object. It uses 4724the kernel mode driver to initialize and register the AQL queue with CP. 4725 4726To dispatch a kernel the following actions are performed. This can occur in the 4727CPU host program, or from an HSA kernel executing on a GPU. 4728 47291. A pointer to an AQL queue for the kernel agent on which the kernel is to be 4730 executed is obtained. 47312. A pointer to the kernel descriptor (see 4732 :ref:`amdgpu-amdhsa-kernel-descriptor`) of the kernel to execute is obtained. 4733 It must be for a kernel that is contained in a code object that was loaded 4734 by an HSA compatible runtime on the kernel agent with which the AQL queue is 4735 associated. 47363. Space is allocated for the kernel arguments using the HSA compatible runtime 4737 allocator for a memory region with the kernarg property for the kernel agent 4738 that will execute the kernel. It must be at least 16-byte aligned. 47394. Kernel argument values are assigned to the kernel argument memory 4740 allocation. The layout is defined in the *HSA Programmer's Language 4741 Reference* [HSA]_. For AMDGPU the kernel execution directly accesses the 4742 kernel argument memory in the same way constant memory is accessed. (Note 4743 that the HSA specification allows an implementation to copy the kernel 4744 argument contents to another location that is accessed by the kernel.) 47455. An AQL kernel dispatch packet is created on the AQL queue. The HSA compatible 4746 runtime api uses 64-bit atomic operations to reserve space in the AQL queue 4747 for the packet. The packet must be set up, and the final write must use an 4748 atomic store release to set the packet kind to ensure the packet contents are 4749 visible to the kernel agent. AQL defines a doorbell signal mechanism to 4750 notify the kernel agent that the AQL queue has been updated. These rules, and 4751 the layout of the AQL queue and kernel dispatch packet is defined in the *HSA 4752 System Architecture Specification* [HSA]_. 47536. A kernel dispatch packet includes information about the actual dispatch, 4754 such as grid and work-group size, together with information from the code 4755 object about the kernel, such as segment sizes. The HSA compatible runtime 4756 queries on the kernel symbol can be used to obtain the code object values 4757 which are recorded in the :ref:`amdgpu-amdhsa-code-object-metadata`. 47587. CP executes micro-code and is responsible for detecting and setting up the 4759 GPU to execute the wavefronts of a kernel dispatch. 47608. CP ensures that when the a wavefront starts executing the kernel machine 4761 code, the scalar general purpose registers (SGPR) and vector general purpose 4762 registers (VGPR) are set up as required by the machine code. The required 4763 setup is defined in the :ref:`amdgpu-amdhsa-kernel-descriptor`. The initial 4764 register state is defined in 4765 :ref:`amdgpu-amdhsa-initial-kernel-execution-state`. 47669. The prolog of the kernel machine code (see 4767 :ref:`amdgpu-amdhsa-kernel-prolog`) sets up the machine state as necessary 4768 before continuing executing the machine code that corresponds to the kernel. 476910. When the kernel dispatch has completed execution, CP signals the completion 4770 signal specified in the kernel dispatch packet if not 0. 4771 4772.. _amdgpu-amdhsa-memory-spaces: 4773 4774Memory Spaces 4775~~~~~~~~~~~~~ 4776 4777The memory space properties are: 4778 4779 .. table:: AMDHSA Memory Spaces 4780 :name: amdgpu-amdhsa-memory-spaces-table 4781 4782 ================= =========== ======== ======= ================== 4783 Memory Space Name HSA Segment Hardware Address NULL Value 4784 Name Name Size 4785 ================= =========== ======== ======= ================== 4786 Private private scratch 32 0x00000000 4787 Local group LDS 32 0xFFFFFFFF 4788 Global global global 64 0x0000000000000000 4789 Constant constant *same as 64 0x0000000000000000 4790 global* 4791 Generic flat flat 64 0x0000000000000000 4792 Region N/A GDS 32 *not implemented 4793 for AMDHSA* 4794 ================= =========== ======== ======= ================== 4795 4796The global and constant memory spaces both use global virtual addresses, which 4797are the same virtual address space used by the CPU. However, some virtual 4798addresses may only be accessible to the CPU, some only accessible by the GPU, 4799and some by both. 4800 4801Using the constant memory space indicates that the data will not change during 4802the execution of the kernel. This allows scalar read instructions to be 4803used. The vector and scalar L1 caches are invalidated of volatile data before 4804each kernel dispatch execution to allow constant memory to change values between 4805kernel dispatches. 4806 4807The local memory space uses the hardware Local Data Store (LDS) which is 4808automatically allocated when the hardware creates work-groups of wavefronts, and 4809freed when all the wavefronts of a work-group have terminated. The data store 4810(DS) instructions can be used to access it. 4811 4812The private memory space uses the hardware scratch memory support. If the kernel 4813uses scratch, then the hardware allocates memory that is accessed using 4814wavefront lane dword (4 byte) interleaving. The mapping used from private 4815address to physical address is: 4816 4817 ``wavefront-scratch-base + 4818 (private-address * wavefront-size * 4) + 4819 (wavefront-lane-id * 4)`` 4820 4821There are different ways that the wavefront scratch base address is determined 4822by a wavefront (see :ref:`amdgpu-amdhsa-initial-kernel-execution-state`). This 4823memory can be accessed in an interleaved manner using buffer instruction with 4824the scratch buffer descriptor and per wavefront scratch offset, by the scratch 4825instructions, or by flat instructions. If each lane of a wavefront accesses the 4826same private address, the interleaving results in adjacent dwords being accessed 4827and hence requires fewer cache lines to be fetched. Multi-dword access is not 4828supported except by flat and scratch instructions in GFX9-GFX11. 4829 4830The generic address space uses the hardware flat address support available in 4831GFX7-GFX11. This uses two fixed ranges of virtual addresses (the private and 4832local apertures), that are outside the range of addressible global memory, to 4833map from a flat address to a private or local address. 4834 4835FLAT instructions can take a flat address and access global, private (scratch) 4836and group (LDS) memory depending on if the address is within one of the 4837aperture ranges. Flat access to scratch requires hardware aperture setup and 4838setup in the kernel prologue (see 4839:ref:`amdgpu-amdhsa-kernel-prolog-flat-scratch`). Flat access to LDS requires 4840hardware aperture setup and M0 (GFX7-GFX8) register setup (see 4841:ref:`amdgpu-amdhsa-kernel-prolog-m0`). 4842 4843To convert between a segment address and a flat address the base address of the 4844apertures address can be used. For GFX7-GFX8 these are available in the 4845:ref:`amdgpu-amdhsa-hsa-aql-queue` the address of which can be obtained with 4846Queue Ptr SGPR (see :ref:`amdgpu-amdhsa-initial-kernel-execution-state`). For 4847GFX9-GFX11 the aperture base addresses are directly available as inline constant 4848registers ``SRC_SHARED_BASE/LIMIT`` and ``SRC_PRIVATE_BASE/LIMIT``. In 64 bit 4849address mode the aperture sizes are 2^32 bytes and the base is aligned to 2^32 4850which makes it easier to convert from flat to segment or segment to flat. 4851 4852Image and Samplers 4853~~~~~~~~~~~~~~~~~~ 4854 4855Image and sample handles created by an HSA compatible runtime (see 4856:ref:`amdgpu-os`) are 64-bit addresses of a hardware 32-byte V# and 48 byte S# 4857object respectively. In order to support the HSA ``query_sampler`` operations 4858two extra dwords are used to store the HSA BRIG enumeration values for the 4859queries that are not trivially deducible from the S# representation. 4860 4861HSA Signals 4862~~~~~~~~~~~ 4863 4864HSA signal handles created by an HSA compatible runtime (see :ref:`amdgpu-os`) 4865are 64-bit addresses of a structure allocated in memory accessible from both the 4866CPU and GPU. The structure is defined by the runtime and subject to change 4867between releases. For example, see [AMD-ROCm-github]_. 4868 4869.. _amdgpu-amdhsa-hsa-aql-queue: 4870 4871HSA AQL Queue 4872~~~~~~~~~~~~~ 4873 4874The HSA AQL queue structure is defined by an HSA compatible runtime (see 4875:ref:`amdgpu-os`) and subject to change between releases. For example, see 4876[AMD-ROCm-github]_. For some processors it contains fields needed to implement 4877certain language features such as the flat address aperture bases. It also 4878contains fields used by CP such as managing the allocation of scratch memory. 4879 4880.. _amdgpu-amdhsa-kernel-descriptor: 4881 4882Kernel Descriptor 4883~~~~~~~~~~~~~~~~~ 4884 4885A kernel descriptor consists of the information needed by CP to initiate the 4886execution of a kernel, including the entry point address of the machine code 4887that implements the kernel. 4888 4889Code Object V3 Kernel Descriptor 4890++++++++++++++++++++++++++++++++ 4891 4892CP microcode requires the Kernel descriptor to be allocated on 64-byte 4893alignment. 4894 4895The fields used by CP for code objects before V3 also match those specified in 4896:ref:`amdgpu-amdhsa-kernel-descriptor-v3-table`. 4897 4898 .. table:: Code Object V3 Kernel Descriptor 4899 :name: amdgpu-amdhsa-kernel-descriptor-v3-table 4900 4901 ======= ======= =============================== ============================ 4902 Bits Size Field Name Description 4903 ======= ======= =============================== ============================ 4904 31:0 4 bytes GROUP_SEGMENT_FIXED_SIZE The amount of fixed local 4905 address space memory 4906 required for a work-group 4907 in bytes. This does not 4908 include any dynamically 4909 allocated local address 4910 space memory that may be 4911 added when the kernel is 4912 dispatched. 4913 63:32 4 bytes PRIVATE_SEGMENT_FIXED_SIZE The amount of fixed 4914 private address space 4915 memory required for a 4916 work-item in bytes. When 4917 this cannot be predicted, 4918 code object v4 and older 4919 sets this value to be 4920 higher than the minimum 4921 requirement. 4922 95:64 4 bytes KERNARG_SIZE The size of the kernarg 4923 memory pointed to by the 4924 AQL dispatch packet. The 4925 kernarg memory is used to 4926 pass arguments to the 4927 kernel. 4928 4929 * If the kernarg pointer in 4930 the dispatch packet is NULL 4931 then there are no kernel 4932 arguments. 4933 * If the kernarg pointer in 4934 the dispatch packet is 4935 not NULL and this value 4936 is 0 then the kernarg 4937 memory size is 4938 unspecified. 4939 * If the kernarg pointer in 4940 the dispatch packet is 4941 not NULL and this value 4942 is not 0 then the value 4943 specifies the kernarg 4944 memory size in bytes. It 4945 is recommended to provide 4946 a value as it may be used 4947 by CP to optimize making 4948 the kernarg memory 4949 visible to the kernel 4950 code. 4951 4952 127:96 4 bytes Reserved, must be 0. 4953 191:128 8 bytes KERNEL_CODE_ENTRY_BYTE_OFFSET Byte offset (possibly 4954 negative) from base 4955 address of kernel 4956 descriptor to kernel's 4957 entry point instruction 4958 which must be 256 byte 4959 aligned. 4960 351:192 20 Reserved, must be 0. 4961 bytes 4962 383:352 4 bytes COMPUTE_PGM_RSRC3 GFX6-GFX9 4963 Reserved, must be 0. 4964 GFX90A, GFX940 4965 Compute Shader (CS) 4966 program settings used by 4967 CP to set up 4968 ``COMPUTE_PGM_RSRC3`` 4969 configuration 4970 register. See 4971 :ref:`amdgpu-amdhsa-compute_pgm_rsrc3-gfx90a-table`. 4972 GFX10-GFX11 4973 Compute Shader (CS) 4974 program settings used by 4975 CP to set up 4976 ``COMPUTE_PGM_RSRC3`` 4977 configuration 4978 register. See 4979 :ref:`amdgpu-amdhsa-compute_pgm_rsrc3-gfx10-gfx11-table`. 4980 GFX12 4981 Compute Shader (CS) 4982 program settings used by 4983 CP to set up 4984 ``COMPUTE_PGM_RSRC3`` 4985 configuration 4986 register. See 4987 :ref:`amdgpu-amdhsa-compute_pgm_rsrc3-gfx12-table`. 4988 415:384 4 bytes COMPUTE_PGM_RSRC1 Compute Shader (CS) 4989 program settings used by 4990 CP to set up 4991 ``COMPUTE_PGM_RSRC1`` 4992 configuration 4993 register. See 4994 :ref:`amdgpu-amdhsa-compute_pgm_rsrc1-gfx6-gfx12-table`. 4995 447:416 4 bytes COMPUTE_PGM_RSRC2 Compute Shader (CS) 4996 program settings used by 4997 CP to set up 4998 ``COMPUTE_PGM_RSRC2`` 4999 configuration 5000 register. See 5001 :ref:`amdgpu-amdhsa-compute_pgm_rsrc2-gfx6-gfx12-table`. 5002 458:448 7 bits *See separate bits below.* Enable the setup of the 5003 SGPR user data registers 5004 (see 5005 :ref:`amdgpu-amdhsa-initial-kernel-execution-state`). 5006 5007 The total number of SGPR 5008 user data registers 5009 requested must not exceed 5010 16 and match value in 5011 ``compute_pgm_rsrc2.user_sgpr.user_sgpr_count``. 5012 Any requests beyond 16 5013 will be ignored. 5014 >448 1 bit ENABLE_SGPR_PRIVATE_SEGMENT If the *Target Properties* 5015 _BUFFER column of 5016 :ref:`amdgpu-processor-table` 5017 specifies *Architected flat 5018 scratch* then not supported 5019 and must be 0, 5020 >449 1 bit ENABLE_SGPR_DISPATCH_PTR 5021 >450 1 bit ENABLE_SGPR_QUEUE_PTR 5022 >451 1 bit ENABLE_SGPR_KERNARG_SEGMENT_PTR 5023 >452 1 bit ENABLE_SGPR_DISPATCH_ID 5024 >453 1 bit ENABLE_SGPR_FLAT_SCRATCH_INIT If the *Target Properties* 5025 column of 5026 :ref:`amdgpu-processor-table` 5027 specifies *Architected flat 5028 scratch* then not supported 5029 and must be 0, 5030 >454 1 bit ENABLE_SGPR_PRIVATE_SEGMENT 5031 _SIZE 5032 457:455 3 bits Reserved, must be 0. 5033 458 1 bit ENABLE_WAVEFRONT_SIZE32 GFX6-GFX9 5034 Reserved, must be 0. 5035 GFX10-GFX11 5036 - If 0 execute in 5037 wavefront size 64 mode. 5038 - If 1 execute in 5039 native wavefront size 5040 32 mode. 5041 459 1 bit USES_DYNAMIC_STACK Indicates if the generated 5042 machine code is using a 5043 dynamically sized stack. 5044 This is only set in code 5045 object v5 and later. 5046 463:460 4 bits Reserved, must be 0. 5047 470:464 7 bits KERNARG_PRELOAD_SPEC_LENGTH GFX6-GFX9 5048 - Reserved, must be 0. 5049 GFX90A, GFX940 5050 - The number of dwords from 5051 the kernarg segment to preload 5052 into User SGPRs before kernel 5053 execution. (see 5054 :ref:`amdgpu-amdhsa-kernarg-preload`). 5055 479:471 9 bits KERNARG_PRELOAD_SPEC_OFFSET GFX6-GFX9 5056 - Reserved, must be 0. 5057 GFX90A, GFX940 5058 - An offset in dwords into the 5059 kernarg segment to begin 5060 preloading data into User 5061 SGPRs. (see 5062 :ref:`amdgpu-amdhsa-kernarg-preload`). 5063 511:480 4 bytes Reserved, must be 0. 5064 512 **Total size 64 bytes.** 5065 ======= ==================================================================== 5066 5067.. 5068 5069 .. table:: compute_pgm_rsrc1 for GFX6-GFX12 5070 :name: amdgpu-amdhsa-compute_pgm_rsrc1-gfx6-gfx12-table 5071 5072 ======= ======= =============================== =========================================================================== 5073 Bits Size Field Name Description 5074 ======= ======= =============================== =========================================================================== 5075 5:0 6 bits GRANULATED_WORKITEM_VGPR_COUNT Number of vector register 5076 blocks used by each work-item; 5077 granularity is device 5078 specific: 5079 5080 GFX6-GFX9 5081 - vgprs_used 0..256 5082 - max(0, ceil(vgprs_used / 4) - 1) 5083 GFX90A, GFX940 5084 - vgprs_used 0..512 5085 - vgprs_used = align(arch_vgprs, 4) 5086 + acc_vgprs 5087 - max(0, ceil(vgprs_used / 8) - 1) 5088 GFX10-GFX12 (wavefront size 64) 5089 - max_vgpr 1..256 5090 - max(0, ceil(vgprs_used / 4) - 1) 5091 GFX10-GFX12 (wavefront size 32) 5092 - max_vgpr 1..256 5093 - max(0, ceil(vgprs_used / 8) - 1) 5094 5095 Where vgprs_used is defined 5096 as the highest VGPR number 5097 explicitly referenced plus 5098 one. 5099 5100 Used by CP to set up 5101 ``COMPUTE_PGM_RSRC1.VGPRS``. 5102 5103 The 5104 :ref:`amdgpu-assembler` 5105 calculates this 5106 automatically for the 5107 selected processor from 5108 values provided to the 5109 `.amdhsa_kernel` directive 5110 by the 5111 `.amdhsa_next_free_vgpr` 5112 nested directive (see 5113 :ref:`amdhsa-kernel-directives-table`). 5114 9:6 4 bits GRANULATED_WAVEFRONT_SGPR_COUNT Number of scalar register 5115 blocks used by a wavefront; 5116 granularity is device 5117 specific: 5118 5119 GFX6-GFX8 5120 - sgprs_used 0..112 5121 - max(0, ceil(sgprs_used / 8) - 1) 5122 GFX9 5123 - sgprs_used 0..112 5124 - 2 * max(0, ceil(sgprs_used / 16) - 1) 5125 GFX10-GFX12 5126 Reserved, must be 0. 5127 (128 SGPRs always 5128 allocated.) 5129 5130 Where sgprs_used is 5131 defined as the highest 5132 SGPR number explicitly 5133 referenced plus one, plus 5134 a target specific number 5135 of additional special 5136 SGPRs for VCC, 5137 FLAT_SCRATCH (GFX7+) and 5138 XNACK_MASK (GFX8+), and 5139 any additional 5140 target specific 5141 limitations. It does not 5142 include the 16 SGPRs added 5143 if a trap handler is 5144 enabled. 5145 5146 The target specific 5147 limitations and special 5148 SGPR layout are defined in 5149 the hardware 5150 documentation, which can 5151 be found in the 5152 :ref:`amdgpu-processors` 5153 table. 5154 5155 Used by CP to set up 5156 ``COMPUTE_PGM_RSRC1.SGPRS``. 5157 5158 The 5159 :ref:`amdgpu-assembler` 5160 calculates this 5161 automatically for the 5162 selected processor from 5163 values provided to the 5164 `.amdhsa_kernel` directive 5165 by the 5166 `.amdhsa_next_free_sgpr` 5167 and `.amdhsa_reserve_*` 5168 nested directives (see 5169 :ref:`amdhsa-kernel-directives-table`). 5170 11:10 2 bits PRIORITY Must be 0. 5171 5172 Start executing wavefront 5173 at the specified priority. 5174 5175 CP is responsible for 5176 filling in 5177 ``COMPUTE_PGM_RSRC1.PRIORITY``. 5178 13:12 2 bits FLOAT_ROUND_MODE_32 Wavefront starts execution 5179 with specified rounding 5180 mode for single (32 5181 bit) floating point 5182 precision floating point 5183 operations. 5184 5185 Floating point rounding 5186 mode values are defined in 5187 :ref:`amdgpu-amdhsa-floating-point-rounding-mode-enumeration-values-table`. 5188 5189 Used by CP to set up 5190 ``COMPUTE_PGM_RSRC1.FLOAT_MODE``. 5191 15:14 2 bits FLOAT_ROUND_MODE_16_64 Wavefront starts execution 5192 with specified rounding 5193 denorm mode for half/double (16 5194 and 64-bit) floating point 5195 precision floating point 5196 operations. 5197 5198 Floating point rounding 5199 mode values are defined in 5200 :ref:`amdgpu-amdhsa-floating-point-rounding-mode-enumeration-values-table`. 5201 5202 Used by CP to set up 5203 ``COMPUTE_PGM_RSRC1.FLOAT_MODE``. 5204 17:16 2 bits FLOAT_DENORM_MODE_32 Wavefront starts execution 5205 with specified denorm mode 5206 for single (32 5207 bit) floating point 5208 precision floating point 5209 operations. 5210 5211 Floating point denorm mode 5212 values are defined in 5213 :ref:`amdgpu-amdhsa-floating-point-denorm-mode-enumeration-values-table`. 5214 5215 Used by CP to set up 5216 ``COMPUTE_PGM_RSRC1.FLOAT_MODE``. 5217 19:18 2 bits FLOAT_DENORM_MODE_16_64 Wavefront starts execution 5218 with specified denorm mode 5219 for half/double (16 5220 and 64-bit) floating point 5221 precision floating point 5222 operations. 5223 5224 Floating point denorm mode 5225 values are defined in 5226 :ref:`amdgpu-amdhsa-floating-point-denorm-mode-enumeration-values-table`. 5227 5228 Used by CP to set up 5229 ``COMPUTE_PGM_RSRC1.FLOAT_MODE``. 5230 20 1 bit PRIV Must be 0. 5231 5232 Start executing wavefront 5233 in privilege trap handler 5234 mode. 5235 5236 CP is responsible for 5237 filling in 5238 ``COMPUTE_PGM_RSRC1.PRIV``. 5239 21 1 bit ENABLE_DX10_CLAMP GFX9-GFX11 5240 Wavefront starts execution 5241 with DX10 clamp mode 5242 enabled. Used by the vector 5243 ALU to force DX10 style 5244 treatment of NaN's (when 5245 set, clamp NaN to zero, 5246 otherwise pass NaN 5247 through). 5248 5249 Used by CP to set up 5250 ``COMPUTE_PGM_RSRC1.DX10_CLAMP``. 5251 WG_RR_EN GFX12 5252 If 1, wavefronts are scheduled 5253 in a round-robin fashion with 5254 respect to the other wavefronts 5255 of the SIMD. Otherwise, wavefronts 5256 are scheduled in oldest age order. 5257 5258 CP is responsible for filling in 5259 ``COMPUTE_PGM_RSRC1.WG_RR_EN``. 5260 22 1 bit DEBUG_MODE Must be 0. 5261 5262 Start executing wavefront 5263 in single step mode. 5264 5265 CP is responsible for 5266 filling in 5267 ``COMPUTE_PGM_RSRC1.DEBUG_MODE``. 5268 23 1 bit ENABLE_IEEE_MODE GFX9-GFX11 5269 Wavefront starts execution 5270 with IEEE mode 5271 enabled. Floating point 5272 opcodes that support 5273 exception flag gathering 5274 will quiet and propagate 5275 signaling-NaN inputs per 5276 IEEE 754-2008. Min_dx10 and 5277 max_dx10 become IEEE 5278 754-2008 compliant due to 5279 signaling-NaN propagation 5280 and quieting. 5281 5282 Used by CP to set up 5283 ``COMPUTE_PGM_RSRC1.IEEE_MODE``. 5284 DISABLE_PERF GFX12 5285 Reserved. Must be 0. 5286 24 1 bit BULKY Must be 0. 5287 5288 Only one work-group allowed 5289 to execute on a compute 5290 unit. 5291 5292 CP is responsible for 5293 filling in 5294 ``COMPUTE_PGM_RSRC1.BULKY``. 5295 25 1 bit CDBG_USER Must be 0. 5296 5297 Flag that can be used to 5298 control debugging code. 5299 5300 CP is responsible for 5301 filling in 5302 ``COMPUTE_PGM_RSRC1.CDBG_USER``. 5303 26 1 bit FP16_OVFL GFX6-GFX8 5304 Reserved, must be 0. 5305 GFX9-GFX12 5306 Wavefront starts execution 5307 with specified fp16 overflow 5308 mode. 5309 5310 - If 0, fp16 overflow generates 5311 +/-INF values. 5312 - If 1, fp16 overflow that is the 5313 result of an +/-INF input value 5314 or divide by 0 produces a +/-INF, 5315 otherwise clamps computed 5316 overflow to +/-MAX_FP16 as 5317 appropriate. 5318 5319 Used by CP to set up 5320 ``COMPUTE_PGM_RSRC1.FP16_OVFL``. 5321 28:27 2 bits Reserved, must be 0. 5322 29 1 bit WGP_MODE GFX6-GFX9 5323 Reserved, must be 0. 5324 GFX10-GFX12 5325 - If 0 execute work-groups in 5326 CU wavefront execution mode. 5327 - If 1 execute work-groups on 5328 in WGP wavefront execution mode. 5329 5330 See :ref:`amdgpu-amdhsa-memory-model`. 5331 5332 Used by CP to set up 5333 ``COMPUTE_PGM_RSRC1.WGP_MODE``. 5334 30 1 bit MEM_ORDERED GFX6-GFX9 5335 Reserved, must be 0. 5336 GFX10-GFX12 5337 Controls the behavior of the 5338 s_waitcnt's vmcnt and vscnt 5339 counters. 5340 5341 - If 0 vmcnt reports completion 5342 of load and atomic with return 5343 out of order with sample 5344 instructions, and the vscnt 5345 reports the completion of 5346 store and atomic without 5347 return in order. 5348 - If 1 vmcnt reports completion 5349 of load, atomic with return 5350 and sample instructions in 5351 order, and the vscnt reports 5352 the completion of store and 5353 atomic without return in order. 5354 5355 Used by CP to set up 5356 ``COMPUTE_PGM_RSRC1.MEM_ORDERED``. 5357 31 1 bit FWD_PROGRESS GFX6-GFX9 5358 Reserved, must be 0. 5359 GFX10-GFX12 5360 - If 0 execute SIMD wavefronts 5361 using oldest first policy. 5362 - If 1 execute SIMD wavefronts to 5363 ensure wavefronts will make some 5364 forward progress. 5365 5366 Used by CP to set up 5367 ``COMPUTE_PGM_RSRC1.FWD_PROGRESS``. 5368 32 **Total size 4 bytes** 5369 ======= =================================================================================================================== 5370 5371.. 5372 5373 .. table:: compute_pgm_rsrc2 for GFX6-GFX12 5374 :name: amdgpu-amdhsa-compute_pgm_rsrc2-gfx6-gfx12-table 5375 5376 ======= ======= =============================== =========================================================================== 5377 Bits Size Field Name Description 5378 ======= ======= =============================== =========================================================================== 5379 0 1 bit ENABLE_PRIVATE_SEGMENT * Enable the setup of the 5380 private segment. 5381 * If the *Target Properties* 5382 column of 5383 :ref:`amdgpu-processor-table` 5384 does not specify 5385 *Architected flat 5386 scratch* then enable the 5387 setup of the SGPR 5388 wavefront scratch offset 5389 system register (see 5390 :ref:`amdgpu-amdhsa-initial-kernel-execution-state`). 5391 * If the *Target Properties* 5392 column of 5393 :ref:`amdgpu-processor-table` 5394 specifies *Architected 5395 flat scratch* then enable 5396 the setup of the 5397 FLAT_SCRATCH register 5398 pair (see 5399 :ref:`amdgpu-amdhsa-initial-kernel-execution-state`). 5400 5401 Used by CP to set up 5402 ``COMPUTE_PGM_RSRC2.SCRATCH_EN``. 5403 5:1 5 bits USER_SGPR_COUNT The total number of SGPR 5404 user data 5405 registers requested. This 5406 number must be greater than 5407 or equal to the number of user 5408 data registers enabled. 5409 5410 Used by CP to set up 5411 ``COMPUTE_PGM_RSRC2.USER_SGPR``. 5412 6 1 bit ENABLE_TRAP_HANDLER GFX6-GFX11 5413 Must be 0. 5414 5415 This bit represents 5416 ``COMPUTE_PGM_RSRC2.TRAP_PRESENT``, 5417 which is set by the CP if 5418 the runtime has installed a 5419 trap handler. 5420 GFX12 5421 Reserved, must be 0. 5422 7 1 bit ENABLE_SGPR_WORKGROUP_ID_X Enable the setup of the 5423 system SGPR register for 5424 the work-group id in the X 5425 dimension (see 5426 :ref:`amdgpu-amdhsa-initial-kernel-execution-state`). 5427 5428 Used by CP to set up 5429 ``COMPUTE_PGM_RSRC2.TGID_X_EN``. 5430 8 1 bit ENABLE_SGPR_WORKGROUP_ID_Y Enable the setup of the 5431 system SGPR register for 5432 the work-group id in the Y 5433 dimension (see 5434 :ref:`amdgpu-amdhsa-initial-kernel-execution-state`). 5435 5436 Used by CP to set up 5437 ``COMPUTE_PGM_RSRC2.TGID_Y_EN``. 5438 9 1 bit ENABLE_SGPR_WORKGROUP_ID_Z Enable the setup of the 5439 system SGPR register for 5440 the work-group id in the Z 5441 dimension (see 5442 :ref:`amdgpu-amdhsa-initial-kernel-execution-state`). 5443 5444 Used by CP to set up 5445 ``COMPUTE_PGM_RSRC2.TGID_Z_EN``. 5446 10 1 bit ENABLE_SGPR_WORKGROUP_INFO Enable the setup of the 5447 system SGPR register for 5448 work-group information (see 5449 :ref:`amdgpu-amdhsa-initial-kernel-execution-state`). 5450 5451 Used by CP to set up 5452 ``COMPUTE_PGM_RSRC2.TGID_SIZE_EN``. 5453 12:11 2 bits ENABLE_VGPR_WORKITEM_ID Enable the setup of the 5454 VGPR system registers used 5455 for the work-item ID. 5456 :ref:`amdgpu-amdhsa-system-vgpr-work-item-id-enumeration-values-table` 5457 defines the values. 5458 5459 Used by CP to set up 5460 ``COMPUTE_PGM_RSRC2.TIDIG_CMP_CNT``. 5461 13 1 bit ENABLE_EXCEPTION_ADDRESS_WATCH Must be 0. 5462 5463 Wavefront starts execution 5464 with address watch 5465 exceptions enabled which 5466 are generated when L1 has 5467 witnessed a thread access 5468 an *address of 5469 interest*. 5470 5471 CP is responsible for 5472 filling in the address 5473 watch bit in 5474 ``COMPUTE_PGM_RSRC2.EXCP_EN_MSB`` 5475 according to what the 5476 runtime requests. 5477 14 1 bit ENABLE_EXCEPTION_MEMORY Must be 0. 5478 5479 Wavefront starts execution 5480 with memory violation 5481 exceptions exceptions 5482 enabled which are generated 5483 when a memory violation has 5484 occurred for this wavefront from 5485 L1 or LDS 5486 (write-to-read-only-memory, 5487 mis-aligned atomic, LDS 5488 address out of range, 5489 illegal address, etc.). 5490 5491 CP sets the memory 5492 violation bit in 5493 ``COMPUTE_PGM_RSRC2.EXCP_EN_MSB`` 5494 according to what the 5495 runtime requests. 5496 23:15 9 bits GRANULATED_LDS_SIZE Must be 0. 5497 5498 CP uses the rounded value 5499 from the dispatch packet, 5500 not this value, as the 5501 dispatch may contain 5502 dynamically allocated group 5503 segment memory. CP writes 5504 directly to 5505 ``COMPUTE_PGM_RSRC2.LDS_SIZE``. 5506 5507 Amount of group segment 5508 (LDS) to allocate for each 5509 work-group. Granularity is 5510 device specific: 5511 5512 GFX6 5513 roundup(lds-size / (64 * 4)) 5514 GFX7-GFX11 5515 roundup(lds-size / (128 * 4)) 5516 GFX950 5517 roundup(lds-size / (320 * 4)) 5518 5519 24 1 bit ENABLE_EXCEPTION_IEEE_754_FP Wavefront starts execution 5520 _INVALID_OPERATION with specified exceptions 5521 enabled. 5522 5523 Used by CP to set up 5524 ``COMPUTE_PGM_RSRC2.EXCP_EN`` 5525 (set from bits 0..6). 5526 5527 IEEE 754 FP Invalid 5528 Operation 5529 25 1 bit ENABLE_EXCEPTION_FP_DENORMAL FP Denormal one or more 5530 _SOURCE input operands is a 5531 denormal number 5532 26 1 bit ENABLE_EXCEPTION_IEEE_754_FP IEEE 754 FP Division by 5533 _DIVISION_BY_ZERO Zero 5534 27 1 bit ENABLE_EXCEPTION_IEEE_754_FP IEEE 754 FP FP Overflow 5535 _OVERFLOW 5536 28 1 bit ENABLE_EXCEPTION_IEEE_754_FP IEEE 754 FP Underflow 5537 _UNDERFLOW 5538 29 1 bit ENABLE_EXCEPTION_IEEE_754_FP IEEE 754 FP Inexact 5539 _INEXACT 5540 30 1 bit ENABLE_EXCEPTION_INT_DIVIDE_BY Integer Division by Zero 5541 _ZERO (rcp_iflag_f32 instruction 5542 only) 5543 31 1 bit RESERVED Reserved, must be 0. 5544 32 **Total size 4 bytes.** 5545 ======= =================================================================================================================== 5546 5547.. 5548 5549 .. table:: compute_pgm_rsrc3 for GFX90A, GFX940 5550 :name: amdgpu-amdhsa-compute_pgm_rsrc3-gfx90a-table 5551 5552 ======= ======= =============================== =========================================================================== 5553 Bits Size Field Name Description 5554 ======= ======= =============================== =========================================================================== 5555 5:0 6 bits ACCUM_OFFSET Offset of a first AccVGPR in the unified register file. Granularity 4. 5556 Value 0-63. 0 - accum-offset = 4, 1 - accum-offset = 8, ..., 5557 63 - accum-offset = 256. 5558 15:6 10 Reserved, must be 0. 5559 bits 5560 16 1 bit TG_SPLIT - If 0 the waves of a work-group are 5561 launched in the same CU. 5562 - If 1 the waves of a work-group can be 5563 launched in different CUs. The waves 5564 cannot use S_BARRIER or LDS. 5565 31:17 15 Reserved, must be 0. 5566 bits 5567 32 **Total size 4 bytes.** 5568 ======= =================================================================================================================== 5569 5570.. 5571 5572 .. table:: compute_pgm_rsrc3 for GFX10-GFX11 5573 :name: amdgpu-amdhsa-compute_pgm_rsrc3-gfx10-gfx11-table 5574 5575 ======= ======= =============================== =========================================================================== 5576 Bits Size Field Name Description 5577 ======= ======= =============================== =========================================================================== 5578 3:0 4 bits SHARED_VGPR_COUNT Number of shared VGPR blocks when executing in subvector mode. For 5579 wavefront size 64 the value is 0-15, representing 0-120 VGPRs (granularity 5580 of 8), such that (compute_pgm_rsrc1.vgprs +1)*4 + shared_vgpr_count*8 does 5581 not exceed 256. For wavefront size 32 shared_vgpr_count must be 0. 5582 9:4 6 bits INST_PREF_SIZE GFX10 5583 Reserved, must be 0. 5584 GFX11 5585 Number of instruction bytes to prefetch, starting at the kernel's entry 5586 point instruction, before wavefront starts execution. The value is 0..63 5587 with a granularity of 128 bytes. 5588 10 1 bit TRAP_ON_START GFX10 5589 Reserved, must be 0. 5590 GFX11 5591 Must be 0. 5592 5593 If 1, wavefront starts execution by trapping into the trap handler. 5594 5595 CP is responsible for filling in the trap on start bit in 5596 ``COMPUTE_PGM_RSRC3.TRAP_ON_START`` according to what the runtime 5597 requests. 5598 11 1 bit TRAP_ON_END GFX10 5599 Reserved, must be 0. 5600 GFX11 5601 Must be 0. 5602 5603 If 1, wavefront execution terminates by trapping into the trap handler. 5604 5605 CP is responsible for filling in the trap on end bit in 5606 ``COMPUTE_PGM_RSRC3.TRAP_ON_END`` according to what the runtime requests. 5607 30:12 19 bits Reserved, must be 0. 5608 31 1 bit IMAGE_OP GFX10 5609 Reserved, must be 0. 5610 GFX11 5611 If 1, the kernel execution contains image instructions. If executed as 5612 part of a graphics pipeline, image read instructions will stall waiting 5613 for any necessary ``WAIT_SYNC`` fence to be performed in order to 5614 indicate that earlier pipeline stages have completed writing to the 5615 image. 5616 5617 Not used for compute kernels that are not part of a graphics pipeline and 5618 must be 0. 5619 32 **Total size 4 bytes.** 5620 ======= =================================================================================================================== 5621 5622.. 5623 5624 .. table:: compute_pgm_rsrc3 for GFX12 5625 :name: amdgpu-amdhsa-compute_pgm_rsrc3-gfx12-table 5626 5627 ======= ======= =============================== =========================================================================== 5628 Bits Size Field Name Description 5629 ======= ======= =============================== =========================================================================== 5630 3:0 4 bits RESERVED Reserved, must be 0. 5631 11:4 8 bits INST_PREF_SIZE Number of instruction bytes to prefetch, starting at the kernel's entry 5632 point instruction, before wavefront starts execution. The value is 0..255 5633 with a granularity of 128 bytes. 5634 12 1 bit RESERVED Reserved, must be 0. 5635 13 1 bit GLG_EN If 1, group launch guarantee will be enabled for this dispatch 5636 30:14 17 bits RESERVED Reserved, must be 0. 5637 31 1 bit IMAGE_OP If 1, the kernel execution contains image instructions. If executed as 5638 part of a graphics pipeline, image read instructions will stall waiting 5639 for any necessary ``WAIT_SYNC`` fence to be performed in order to 5640 indicate that earlier pipeline stages have completed writing to the 5641 image. 5642 5643 Not used for compute kernels that are not part of a graphics pipeline and 5644 must be 0. 5645 32 **Total size 4 bytes.** 5646 ======= =================================================================================================================== 5647 5648.. 5649 5650 .. table:: Floating Point Rounding Mode Enumeration Values 5651 :name: amdgpu-amdhsa-floating-point-rounding-mode-enumeration-values-table 5652 5653 ====================================== ===== ============================== 5654 Enumeration Name Value Description 5655 ====================================== ===== ============================== 5656 FLOAT_ROUND_MODE_NEAR_EVEN 0 Round Ties To Even 5657 FLOAT_ROUND_MODE_PLUS_INFINITY 1 Round Toward +infinity 5658 FLOAT_ROUND_MODE_MINUS_INFINITY 2 Round Toward -infinity 5659 FLOAT_ROUND_MODE_ZERO 3 Round Toward 0 5660 ====================================== ===== ============================== 5661 5662 5663 .. table:: Extended FLT_ROUNDS Enumeration Values 5664 :name: amdgpu-rounding-mode-enumeration-values-table 5665 5666 +------------------------+---------------+-------------------+--------------------+----------+ 5667 | | F32 NEAR_EVEN | F32 PLUS_INFINITY | F32 MINUS_INFINITY | F32 ZERO | 5668 +------------------------+---------------+-------------------+--------------------+----------+ 5669 | F64/F16 NEAR_EVEN | 1 | 11 | 14 | 17 | 5670 +------------------------+---------------+-------------------+--------------------+----------+ 5671 | F64/F16 PLUS_INFINITY | 8 | 2 | 15 | 18 | 5672 +------------------------+---------------+-------------------+--------------------+----------+ 5673 | F64/F16 MINUS_INFINITY | 9 | 12 | 3 | 19 | 5674 +------------------------+---------------+-------------------+--------------------+----------+ 5675 | F64/F16 ZERO | 10 | 13 | 16 | 0 | 5676 +------------------------+---------------+-------------------+--------------------+----------+ 5677 5678.. 5679 5680 .. table:: Floating Point Denorm Mode Enumeration Values 5681 :name: amdgpu-amdhsa-floating-point-denorm-mode-enumeration-values-table 5682 5683 ====================================== ===== ==================================== 5684 Enumeration Name Value Description 5685 ====================================== ===== ==================================== 5686 FLOAT_DENORM_MODE_FLUSH_SRC_DST 0 Flush Source and Destination Denorms 5687 FLOAT_DENORM_MODE_FLUSH_DST 1 Flush Output Denorms 5688 FLOAT_DENORM_MODE_FLUSH_SRC 2 Flush Source Denorms 5689 FLOAT_DENORM_MODE_FLUSH_NONE 3 No Flush 5690 ====================================== ===== ==================================== 5691 5692 Denormal flushing is sign respecting. i.e. the behavior expected by 5693 ``"denormal-fp-math"="preserve-sign"``. The behavior is undefined with 5694 ``"denormal-fp-math"="positive-zero"`` 5695 5696.. 5697 5698 .. table:: System VGPR Work-Item ID Enumeration Values 5699 :name: amdgpu-amdhsa-system-vgpr-work-item-id-enumeration-values-table 5700 5701 ======================================== ===== ============================ 5702 Enumeration Name Value Description 5703 ======================================== ===== ============================ 5704 SYSTEM_VGPR_WORKITEM_ID_X 0 Set work-item X dimension 5705 ID. 5706 SYSTEM_VGPR_WORKITEM_ID_X_Y 1 Set work-item X and Y 5707 dimensions ID. 5708 SYSTEM_VGPR_WORKITEM_ID_X_Y_Z 2 Set work-item X, Y and Z 5709 dimensions ID. 5710 SYSTEM_VGPR_WORKITEM_ID_UNDEFINED 3 Undefined. 5711 ======================================== ===== ============================ 5712 5713.. _amdgpu-amdhsa-initial-kernel-execution-state: 5714 5715Initial Kernel Execution State 5716~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ 5717 5718This section defines the register state that will be set up by the packet 5719processor prior to the start of execution of every wavefront. This is limited by 5720the constraints of the hardware controllers of CP/ADC/SPI. 5721 5722The order of the SGPR registers is defined, but the compiler can specify which 5723ones are actually setup in the kernel descriptor using the ``enable_sgpr_*`` bit 5724fields (see :ref:`amdgpu-amdhsa-kernel-descriptor`). The register numbers used 5725for enabled registers are dense starting at SGPR0: the first enabled register is 5726SGPR0, the next enabled register is SGPR1 etc.; disabled registers do not have 5727an SGPR number. 5728 5729The initial SGPRs comprise up to 16 User SGPRs that are set by CP and apply to 5730all wavefronts of the grid. It is possible to specify more than 16 User SGPRs 5731using the ``enable_sgpr_*`` bit fields, in which case only the first 16 are 5732actually initialized. These are then immediately followed by the System SGPRs 5733that are set up by ADC/SPI and can have different values for each wavefront of 5734the grid dispatch. 5735 5736SGPR register initial state is defined in 5737:ref:`amdgpu-amdhsa-sgpr-register-set-up-order-table`. 5738 5739 .. table:: SGPR Register Set Up Order 5740 :name: amdgpu-amdhsa-sgpr-register-set-up-order-table 5741 5742 ========== ========================== ====== ============================== 5743 SGPR Order Name Number Description 5744 (kernel descriptor enable of 5745 field) SGPRs 5746 ========== ========================== ====== ============================== 5747 First Private Segment Buffer 4 See 5748 (enable_sgpr_private :ref:`amdgpu-amdhsa-kernel-prolog-private-segment-buffer`. 5749 _segment_buffer) 5750 then Dispatch Ptr 2 64-bit address of AQL dispatch 5751 (enable_sgpr_dispatch_ptr) packet for kernel dispatch 5752 actually executing. 5753 then Queue Ptr 2 64-bit address of amd_queue_t 5754 (enable_sgpr_queue_ptr) object for AQL queue on which 5755 the dispatch packet was 5756 queued. 5757 then Kernarg Segment Ptr 2 64-bit address of Kernarg 5758 (enable_sgpr_kernarg segment. This is directly 5759 _segment_ptr) copied from the 5760 kernarg_address in the kernel 5761 dispatch packet. 5762 5763 Having CP load it once avoids 5764 loading it at the beginning of 5765 every wavefront. 5766 then Dispatch Id 2 64-bit Dispatch ID of the 5767 (enable_sgpr_dispatch_id) dispatch packet being 5768 executed. 5769 then Flat Scratch Init 2 See 5770 (enable_sgpr_flat_scratch :ref:`amdgpu-amdhsa-kernel-prolog-flat-scratch`. 5771 _init) 5772 then Private Segment Size 1 The 32-bit byte size of a 5773 (enable_sgpr_private single work-item's memory 5774 _segment_size) allocation. This is the 5775 value from the kernel 5776 dispatch packet Private 5777 Segment Byte Size rounded up 5778 by CP to a multiple of 5779 DWORD. 5780 5781 Having CP load it once avoids 5782 loading it at the beginning of 5783 every wavefront. 5784 5785 This is not used for 5786 GFX7-GFX8 since it is the same 5787 value as the second SGPR of 5788 Flat Scratch Init. However, it 5789 may be needed for GFX9-GFX11 which 5790 changes the meaning of the 5791 Flat Scratch Init value. 5792 then Preloaded Kernargs N/A See 5793 (kernarg_preload_spec :ref:`amdgpu-amdhsa-kernarg-preload`. 5794 _length) 5795 then Work-Group Id X 1 32-bit work-group id in X 5796 (enable_sgpr_workgroup_id dimension of grid for 5797 _X) wavefront. 5798 then Work-Group Id Y 1 32-bit work-group id in Y 5799 (enable_sgpr_workgroup_id dimension of grid for 5800 _Y) wavefront. 5801 then Work-Group Id Z 1 32-bit work-group id in Z 5802 (enable_sgpr_workgroup_id dimension of grid for 5803 _Z) wavefront. 5804 then Work-Group Info 1 {first_wavefront, 14'b0000, 5805 (enable_sgpr_workgroup ordered_append_term[10:0], 5806 _info) threadgroup_size_in_wavefronts[5:0]} 5807 then Scratch Wavefront Offset 1 See 5808 (enable_sgpr_private :ref:`amdgpu-amdhsa-kernel-prolog-flat-scratch`. 5809 _segment_wavefront_offset) and 5810 :ref:`amdgpu-amdhsa-kernel-prolog-private-segment-buffer`. 5811 ========== ========================== ====== ============================== 5812 5813The order of the VGPR registers is defined, but the compiler can specify which 5814ones are actually setup in the kernel descriptor using the ``enable_vgpr*`` bit 5815fields (see :ref:`amdgpu-amdhsa-kernel-descriptor`). The register numbers used 5816for enabled registers are dense starting at VGPR0: the first enabled register is 5817VGPR0, the next enabled register is VGPR1 etc.; disabled registers do not have a 5818VGPR number. 5819 5820There are different methods used for the VGPR initial state: 5821 5822* Unless the *Target Properties* column of :ref:`amdgpu-processor-table` 5823 specifies otherwise, a separate VGPR register is used per work-item ID. The 5824 VGPR register initial state for this method is defined in 5825 :ref:`amdgpu-amdhsa-vgpr-register-set-up-order-for-unpacked-work-item-id-method-table`. 5826* If *Target Properties* column of :ref:`amdgpu-processor-table` 5827 specifies *Packed work-item IDs*, the initial value of VGPR0 register is used 5828 for all work-item IDs. The register layout for this method is defined in 5829 :ref:`amdgpu-amdhsa-register-layout-for-packed-work-item-id-method-table`. 5830 5831 .. table:: VGPR Register Set Up Order for Unpacked Work-Item ID Method 5832 :name: amdgpu-amdhsa-vgpr-register-set-up-order-for-unpacked-work-item-id-method-table 5833 5834 ========== ========================== ====== ============================== 5835 VGPR Order Name Number Description 5836 (kernel descriptor enable of 5837 field) VGPRs 5838 ========== ========================== ====== ============================== 5839 First Work-Item Id X 1 32-bit work-item id in X 5840 (Always initialized) dimension of work-group for 5841 wavefront lane. 5842 then Work-Item Id Y 1 32-bit work-item id in Y 5843 (enable_vgpr_workitem_id dimension of work-group for 5844 > 0) wavefront lane. 5845 then Work-Item Id Z 1 32-bit work-item id in Z 5846 (enable_vgpr_workitem_id dimension of work-group for 5847 > 1) wavefront lane. 5848 ========== ========================== ====== ============================== 5849 5850.. 5851 5852 .. table:: Register Layout for Packed Work-Item ID Method 5853 :name: amdgpu-amdhsa-register-layout-for-packed-work-item-id-method-table 5854 5855 ======= ======= ================ ========================================= 5856 Bits Size Field Name Description 5857 ======= ======= ================ ========================================= 5858 0:9 10 bits Work-Item Id X Work-item id in X 5859 dimension of work-group for 5860 wavefront lane. 5861 5862 Always initialized. 5863 5864 10:19 10 bits Work-Item Id Y Work-item id in Y 5865 dimension of work-group for 5866 wavefront lane. 5867 5868 Initialized if enable_vgpr_workitem_id > 5869 0, otherwise set to 0. 5870 20:29 10 bits Work-Item Id Z Work-item id in Z 5871 dimension of work-group for 5872 wavefront lane. 5873 5874 Initialized if enable_vgpr_workitem_id > 5875 1, otherwise set to 0. 5876 30:31 2 bits Reserved, set to 0. 5877 ======= ======= ================ ========================================= 5878 5879The setting of registers is done by GPU CP/ADC/SPI hardware as follows: 5880 58811. SGPRs before the Work-Group Ids are set by CP using the 16 User Data 5882 registers. 58832. Work-group Id registers X, Y, Z are set by ADC which supports any 5884 combination including none. 58853. Scratch Wavefront Offset is set by SPI in a per wavefront basis which is why 5886 its value cannot be included with the flat scratch init value which is per 5887 queue (see :ref:`amdgpu-amdhsa-kernel-prolog-flat-scratch`). 58884. The VGPRs are set by SPI which only supports specifying either (X), (X, Y) 5889 or (X, Y, Z). 58905. Flat Scratch register pair initialization is described in 5891 :ref:`amdgpu-amdhsa-kernel-prolog-flat-scratch`. 5892 5893The global segment can be accessed either using buffer instructions (GFX6 which 5894has V# 64-bit address support), flat instructions (GFX7-GFX11), or global 5895instructions (GFX9-GFX11). 5896 5897If buffer operations are used, then the compiler can generate a V# with the 5898following properties: 5899 5900* base address of 0 5901* no swizzle 5902* ATC: 1 if IOMMU present (such as APU) 5903* ptr64: 1 5904* MTYPE set to support memory coherence that matches the runtime (such as CC for 5905 APU and NC for dGPU). 5906 5907.. _amdgpu-amdhsa-kernarg-preload: 5908 5909Preloaded Kernel Arguments 5910++++++++++++++++++++++++++ 5911 5912On hardware that supports this feature, kernel arguments can be preloaded into 5913User SGPRs, up to the maximum number of User SGPRs available. The allocation of 5914Preload SGPRs occurs directly after the last enabled non-kernarg preload User 5915SGPR. (See :ref:`amdgpu-amdhsa-initial-kernel-execution-state`) 5916 5917The data preloaded is copied from the kernarg segment, the amount of data is 5918determined by the value specified in the kernarg_preload_spec_length field of 5919the kernel descriptor. This data is then loaded into consecutive User SGPRs. The 5920number of SGPRs receiving preloaded kernarg data corresponds with the value 5921given by kernarg_preload_spec_length. The preloading starts at the dword offset 5922within the kernarg segment, which is specified by the 5923kernarg_preload_spec_offset field. 5924 5925If the kernarg_preload_spec_length is non-zero, the CP firmware will append an 5926additional 256 bytes to the kernel_code_entry_byte_offset. This addition 5927facilitates the incorporation of a prologue to the kernel entry to handle cases 5928where code designed for kernarg preloading is executed on hardware equipped with 5929incompatible firmware. If hardware has compatible firmware the 256 bytes at the 5930start of the kernel entry will be skipped. 5931 5932With code object V5 and later, hidden kernel arguments that are normally 5933accessed through the Implicit Argument Ptr, may be preloaded into User SGPRs. 5934These arguments are added to the kernel function signature and are marked with 5935the attributes "inreg" and "amdgpu-hidden-argument". (See 5936:ref:`amdgpu-llvm-ir-attributes-table`). 5937 5938.. _amdgpu-amdhsa-kernel-prolog: 5939 5940Kernel Prolog 5941~~~~~~~~~~~~~ 5942 5943The compiler performs initialization in the kernel prologue depending on the 5944target and information about things like stack usage in the kernel and called 5945functions. Some of this initialization requires the compiler to request certain 5946User and System SGPRs be present in the 5947:ref:`amdgpu-amdhsa-initial-kernel-execution-state` via the 5948:ref:`amdgpu-amdhsa-kernel-descriptor`. 5949 5950.. _amdgpu-amdhsa-kernel-prolog-cfi: 5951 5952CFI 5953+++ 5954 59551. The CFI return address is undefined. 5956 59572. The CFI CFA is defined using an expression which evaluates to a location 5958 description that comprises one memory location description for the 5959 ``DW_ASPACE_AMDGPU_private_lane`` address space address ``0``. 5960 5961.. _amdgpu-amdhsa-kernel-prolog-m0: 5962 5963M0 5964++ 5965 5966GFX6-GFX8 5967 The M0 register must be initialized with a value at least the total LDS size 5968 if the kernel may access LDS via DS or flat operations. Total LDS size is 5969 available in dispatch packet. For M0, it is also possible to use maximum 5970 possible value of LDS for given target (0x7FFF for GFX6 and 0xFFFF for 5971 GFX7-GFX8). 5972GFX9-GFX11 5973 The M0 register is not used for range checking LDS accesses and so does not 5974 need to be initialized in the prolog. 5975 5976.. _amdgpu-amdhsa-kernel-prolog-stack-pointer: 5977 5978Stack Pointer 5979+++++++++++++ 5980 5981If the kernel has function calls it must set up the ABI stack pointer described 5982in :ref:`amdgpu-amdhsa-function-call-convention-non-kernel-functions` by setting 5983SGPR32 to the unswizzled scratch offset of the address past the last local 5984allocation. 5985 5986.. _amdgpu-amdhsa-kernel-prolog-frame-pointer: 5987 5988Frame Pointer 5989+++++++++++++ 5990 5991If the kernel needs a frame pointer for the reasons defined in 5992``SIFrameLowering`` then SGPR33 is used and is always set to ``0`` in the 5993kernel prolog. If a frame pointer is not required then all uses of the frame 5994pointer are replaced with immediate ``0`` offsets. 5995 5996.. _amdgpu-amdhsa-kernel-prolog-flat-scratch: 5997 5998Flat Scratch 5999++++++++++++ 6000 6001There are different methods used for initializing flat scratch: 6002 6003* If the *Target Properties* column of :ref:`amdgpu-processor-table` 6004 specifies *Does not support generic address space*: 6005 6006 Flat scratch is not supported and there is no flat scratch register pair. 6007 6008* If the *Target Properties* column of :ref:`amdgpu-processor-table` 6009 specifies *Offset flat scratch*: 6010 6011 If the kernel or any function it calls may use flat operations to access 6012 scratch memory, the prolog code must set up the FLAT_SCRATCH register pair 6013 (FLAT_SCRATCH_LO/FLAT_SCRATCH_HI). Initialization uses Flat Scratch Init and 6014 Scratch Wavefront Offset SGPR registers (see 6015 :ref:`amdgpu-amdhsa-initial-kernel-execution-state`): 6016 6017 1. The low word of Flat Scratch Init is the 32-bit byte offset from 6018 ``SH_HIDDEN_PRIVATE_BASE_VIMID`` to the base of scratch backing memory 6019 being managed by SPI for the queue executing the kernel dispatch. This is 6020 the same value used in the Scratch Segment Buffer V# base address. 6021 6022 CP obtains this from the runtime. (The Scratch Segment Buffer base address 6023 is ``SH_HIDDEN_PRIVATE_BASE_VIMID`` plus this offset.) 6024 6025 The prolog must add the value of Scratch Wavefront Offset to get the 6026 wavefront's byte scratch backing memory offset from 6027 ``SH_HIDDEN_PRIVATE_BASE_VIMID``. 6028 6029 The Scratch Wavefront Offset must also be used as an offset with Private 6030 segment address when using the Scratch Segment Buffer. 6031 6032 Since FLAT_SCRATCH_LO is in units of 256 bytes, the offset must be right 6033 shifted by 8 before moving into FLAT_SCRATCH_HI. 6034 6035 FLAT_SCRATCH_HI corresponds to SGPRn-4 on GFX7, and SGPRn-6 on GFX8 (where 6036 SGPRn is the highest numbered SGPR allocated to the wavefront). 6037 FLAT_SCRATCH_HI is multiplied by 256 (as it is in units of 256 bytes) and 6038 added to ``SH_HIDDEN_PRIVATE_BASE_VIMID`` to calculate the per wavefront 6039 FLAT SCRATCH BASE in flat memory instructions that access the scratch 6040 aperture. 6041 2. The second word of Flat Scratch Init is 32-bit byte size of a single 6042 work-items scratch memory usage. 6043 6044 CP obtains this from the runtime, and it is always a multiple of DWORD. CP 6045 checks that the value in the kernel dispatch packet Private Segment Byte 6046 Size is not larger and requests the runtime to increase the queue's scratch 6047 size if necessary. 6048 6049 CP directly loads from the kernel dispatch packet Private Segment Byte Size 6050 field and rounds up to a multiple of DWORD. Having CP load it once avoids 6051 loading it at the beginning of every wavefront. 6052 6053 The kernel prolog code must move it to FLAT_SCRATCH_LO which is SGPRn-3 on 6054 GFX7 and SGPRn-5 on GFX8. FLAT_SCRATCH_LO is used as the FLAT SCRATCH SIZE 6055 in flat memory instructions. 6056 6057* If the *Target Properties* column of :ref:`amdgpu-processor-table` 6058 specifies *Absolute flat scratch*: 6059 6060 If the kernel or any function it calls may use flat operations to access 6061 scratch memory, the prolog code must set up the FLAT_SCRATCH register pair 6062 (FLAT_SCRATCH_LO/FLAT_SCRATCH_HI which are in SGPRn-4/SGPRn-3). Initialization 6063 uses Flat Scratch Init and Scratch Wavefront Offset SGPR registers (see 6064 :ref:`amdgpu-amdhsa-initial-kernel-execution-state`): 6065 6066 The Flat Scratch Init is the 64-bit address of the base of scratch backing 6067 memory being managed by SPI for the queue executing the kernel dispatch. 6068 6069 CP obtains this from the runtime. 6070 6071 The kernel prolog must add the value of the wave's Scratch Wavefront Offset 6072 and move the result as a 64-bit value to the FLAT_SCRATCH SGPR register pair 6073 which is SGPRn-6 and SGPRn-5. It is used as the FLAT SCRATCH BASE in flat 6074 memory instructions. 6075 6076 The Scratch Wavefront Offset must also be used as an offset with Private 6077 segment address when using the Scratch Segment Buffer (see 6078 :ref:`amdgpu-amdhsa-kernel-prolog-private-segment-buffer`). 6079 6080* If the *Target Properties* column of :ref:`amdgpu-processor-table` 6081 specifies *Architected flat scratch*: 6082 6083 If ENABLE_PRIVATE_SEGMENT is enabled in 6084 :ref:`amdgpu-amdhsa-compute_pgm_rsrc2-gfx6-gfx12-table` then the FLAT_SCRATCH 6085 register pair will be initialized to the 64-bit address of the base of scratch 6086 backing memory being managed by SPI for the queue executing the kernel 6087 dispatch plus the value of the wave's Scratch Wavefront Offset for use as the 6088 flat scratch base in flat memory instructions. 6089 6090.. _amdgpu-amdhsa-kernel-prolog-private-segment-buffer: 6091 6092Private Segment Buffer 6093++++++++++++++++++++++ 6094 6095If the *Target Properties* column of :ref:`amdgpu-processor-table` specifies 6096*Architected flat scratch* then a Private Segment Buffer is not supported. 6097Instead the flat SCRATCH instructions are used. 6098 6099Otherwise, Private Segment Buffer SGPR register is used to initialize 4 SGPRs 6100that are used as a V# to access scratch. CP uses the value provided by the 6101runtime. It is used, together with Scratch Wavefront Offset as an offset, to 6102access the private memory space using a segment address. See 6103:ref:`amdgpu-amdhsa-initial-kernel-execution-state`. 6104 6105The scratch V# is a four-aligned SGPR and always selected for the kernel as 6106follows: 6107 6108 - If it is known during instruction selection that there is stack usage, 6109 SGPR0-3 is reserved for use as the scratch V#. Stack usage is assumed if 6110 optimizations are disabled (``-O0``), if stack objects already exist (for 6111 locals, etc.), or if there are any function calls. 6112 6113 - Otherwise, four high numbered SGPRs beginning at a four-aligned SGPR index 6114 are reserved for the tentative scratch V#. These will be used if it is 6115 determined that spilling is needed. 6116 6117 - If no use is made of the tentative scratch V#, then it is unreserved, 6118 and the register count is determined ignoring it. 6119 - If use is made of the tentative scratch V#, then its register numbers 6120 are shifted to the first four-aligned SGPR index after the highest one 6121 allocated by the register allocator, and all uses are updated. The 6122 register count includes them in the shifted location. 6123 - In either case, if the processor has the SGPR allocation bug, the 6124 tentative allocation is not shifted or unreserved in order to ensure 6125 the register count is higher to workaround the bug. 6126 6127 .. note:: 6128 6129 This approach of using a tentative scratch V# and shifting the register 6130 numbers if used avoids having to perform register allocation a second 6131 time if the tentative V# is eliminated. This is more efficient and 6132 avoids the problem that the second register allocation may perform 6133 spilling which will fail as there is no longer a scratch V#. 6134 6135When the kernel prolog code is being emitted it is known whether the scratch V# 6136described above is actually used. If it is, the prolog code must set it up by 6137copying the Private Segment Buffer to the scratch V# registers and then adding 6138the Private Segment Wavefront Offset to the queue base address in the V#. The 6139result is a V# with a base address pointing to the beginning of the wavefront 6140scratch backing memory. 6141 6142The Private Segment Buffer is always requested, but the Private Segment 6143Wavefront Offset is only requested if it is used (see 6144:ref:`amdgpu-amdhsa-initial-kernel-execution-state`). 6145 6146.. _amdgpu-amdhsa-memory-model: 6147 6148Memory Model 6149~~~~~~~~~~~~ 6150 6151This section describes the mapping of the LLVM memory model onto AMDGPU machine 6152code (see :ref:`memmodel`). 6153 6154The AMDGPU backend supports the memory synchronization scopes specified in 6155:ref:`amdgpu-memory-scopes`. 6156 6157The code sequences used to implement the memory model specify the order of 6158instructions that a single thread must execute. The ``s_waitcnt`` and cache 6159management instructions such as ``buffer_wbinvl1_vol`` are defined with respect 6160to other memory instructions executed by the same thread. This allows them to be 6161moved earlier or later which can allow them to be combined with other instances 6162of the same instruction, or hoisted/sunk out of loops to improve performance. 6163Only the instructions related to the memory model are given; additional 6164``s_waitcnt`` instructions are required to ensure registers are defined before 6165being used. These may be able to be combined with the memory model ``s_waitcnt`` 6166instructions as described above. 6167 6168The AMDGPU backend supports the following memory models: 6169 6170 HSA Memory Model [HSA]_ 6171 The HSA memory model uses a single happens-before relation for all address 6172 spaces (see :ref:`amdgpu-address-spaces`). 6173 OpenCL Memory Model [OpenCL]_ 6174 The OpenCL memory model which has separate happens-before relations for the 6175 global and local address spaces. Only a fence specifying both global and 6176 local address space, and seq_cst instructions join the relationships. Since 6177 the LLVM ``memfence`` instruction does not allow an address space to be 6178 specified the OpenCL fence has to conservatively assume both local and 6179 global address space was specified. However, optimizations can often be 6180 done to eliminate the additional ``s_waitcnt`` instructions when there are 6181 no intervening memory instructions which access the corresponding address 6182 space. The code sequences in the table indicate what can be omitted for the 6183 OpenCL memory. The target triple environment is used to determine if the 6184 source language is OpenCL (see :ref:`amdgpu-opencl`). 6185 6186``ds/flat_load/store/atomic`` instructions to local memory are termed LDS 6187operations. 6188 6189``buffer/global/flat_load/store/atomic`` instructions to global memory are 6190termed vector memory operations. 6191 6192Private address space uses ``buffer_load/store`` using the scratch V# 6193(GFX6-GFX8), or ``scratch_load/store`` (GFX9-GFX11). Since only a single thread 6194is accessing the memory, atomic memory orderings are not meaningful, and all 6195accesses are treated as non-atomic. 6196 6197Constant address space uses ``buffer/global_load`` instructions (or equivalent 6198scalar memory instructions). Since the constant address space contents do not 6199change during the execution of a kernel dispatch it is not legal to perform 6200stores, and atomic memory orderings are not meaningful, and all accesses are 6201treated as non-atomic. 6202 6203A memory synchronization scope wider than work-group is not meaningful for the 6204group (LDS) address space and is treated as work-group. 6205 6206The memory model does not support the region address space which is treated as 6207non-atomic. 6208 6209Acquire memory ordering is not meaningful on store atomic instructions and is 6210treated as non-atomic. 6211 6212Release memory ordering is not meaningful on load atomic instructions and is 6213treated a non-atomic. 6214 6215Acquire-release memory ordering is not meaningful on load or store atomic 6216instructions and is treated as acquire and release respectively. 6217 6218The memory order also adds the single thread optimization constraints defined in 6219table 6220:ref:`amdgpu-amdhsa-memory-model-single-thread-optimization-constraints-table`. 6221 6222 .. table:: AMDHSA Memory Model Single Thread Optimization Constraints 6223 :name: amdgpu-amdhsa-memory-model-single-thread-optimization-constraints-table 6224 6225 ============ ============================================================== 6226 LLVM Memory Optimization Constraints 6227 Ordering 6228 ============ ============================================================== 6229 unordered *none* 6230 monotonic *none* 6231 acquire - If a load atomic/atomicrmw then no following load/load 6232 atomic/store/store atomic/atomicrmw/fence instruction can be 6233 moved before the acquire. 6234 - If a fence then same as load atomic, plus no preceding 6235 associated fence-paired-atomic can be moved after the fence. 6236 release - If a store atomic/atomicrmw then no preceding load/load 6237 atomic/store/store atomic/atomicrmw/fence instruction can be 6238 moved after the release. 6239 - If a fence then same as store atomic, plus no following 6240 associated fence-paired-atomic can be moved before the 6241 fence. 6242 acq_rel Same constraints as both acquire and release. 6243 seq_cst - If a load atomic then same constraints as acquire, plus no 6244 preceding sequentially consistent load atomic/store 6245 atomic/atomicrmw/fence instruction can be moved after the 6246 seq_cst. 6247 - If a store atomic then the same constraints as release, plus 6248 no following sequentially consistent load atomic/store 6249 atomic/atomicrmw/fence instruction can be moved before the 6250 seq_cst. 6251 - If an atomicrmw/fence then same constraints as acq_rel. 6252 ============ ============================================================== 6253 6254The code sequences used to implement the memory model are defined in the 6255following sections: 6256 6257* :ref:`amdgpu-amdhsa-memory-model-gfx6-gfx9` 6258* :ref:`amdgpu-amdhsa-memory-model-gfx90a` 6259* :ref:`amdgpu-amdhsa-memory-model-gfx942` 6260* :ref:`amdgpu-amdhsa-memory-model-gfx10-gfx11` 6261* :ref:`amdgpu-amdhsa-memory-model-gfx12` 6262 6263.. _amdgpu-fence-as: 6264 6265Fence and Address Spaces 6266++++++++++++++++++++++++++++++ 6267 6268LLVM fences do not have address space information, thus, fence 6269codegen usually needs to conservatively synchronize all address spaces. 6270 6271In the case of OpenCL, where fences only need to synchronize 6272user-specified address spaces, this can result in extra unnecessary waits. 6273For instance, a fence that is supposed to only synchronize local memory will 6274also have to wait on all global memory operations, which is unnecessary. 6275 6276:doc:`Memory Model Relaxation Annotations <MemoryModelRelaxationAnnotations>` can 6277be used as an optimization hint for fences to solve this problem. 6278The AMDGPU backend recognizes the following tags on fences: 6279 6280- ``amdgpu-as:local`` - fence only the local address space 6281- ``amdgpu-as:global``- fence only the global address space 6282 6283.. note:: 6284 6285 As an optimization hint, those tags are not guaranteed to survive until 6286 code generation. Optimizations are free to drop the tags to allow for 6287 better code optimization, at the cost of synchronizing additional address 6288 spaces. 6289 6290.. _amdgpu-amdhsa-memory-model-gfx6-gfx9: 6291 6292Memory Model GFX6-GFX9 6293++++++++++++++++++++++ 6294 6295For GFX6-GFX9: 6296 6297* Each agent has multiple shader arrays (SA). 6298* Each SA has multiple compute units (CU). 6299* Each CU has multiple SIMDs that execute wavefronts. 6300* The wavefronts for a single work-group are executed in the same CU but may be 6301 executed by different SIMDs. 6302* Each CU has a single LDS memory shared by the wavefronts of the work-groups 6303 executing on it. 6304* All LDS operations of a CU are performed as wavefront wide operations in a 6305 global order and involve no caching. Completion is reported to a wavefront in 6306 execution order. 6307* The LDS memory has multiple request queues shared by the SIMDs of a 6308 CU. Therefore, the LDS operations performed by different wavefronts of a 6309 work-group can be reordered relative to each other, which can result in 6310 reordering the visibility of vector memory operations with respect to LDS 6311 operations of other wavefronts in the same work-group. A ``s_waitcnt 6312 lgkmcnt(0)`` is required to ensure synchronization between LDS operations and 6313 vector memory operations between wavefronts of a work-group, but not between 6314 operations performed by the same wavefront. 6315* The vector memory operations are performed as wavefront wide operations and 6316 completion is reported to a wavefront in execution order. The exception is 6317 that for GFX7-GFX9 ``flat_load/store/atomic`` instructions can report out of 6318 vector memory order if they access LDS memory, and out of LDS operation order 6319 if they access global memory. 6320* The vector memory operations access a single vector L1 cache shared by all 6321 SIMDs a CU. Therefore, no special action is required for coherence between the 6322 lanes of a single wavefront, or for coherence between wavefronts in the same 6323 work-group. A ``buffer_wbinvl1_vol`` is required for coherence between 6324 wavefronts executing in different work-groups as they may be executing on 6325 different CUs. 6326* The scalar memory operations access a scalar L1 cache shared by all wavefronts 6327 on a group of CUs. The scalar and vector L1 caches are not coherent. However, 6328 scalar operations are used in a restricted way so do not impact the memory 6329 model. See :ref:`amdgpu-amdhsa-memory-spaces`. 6330* The vector and scalar memory operations use an L2 cache shared by all CUs on 6331 the same agent. 6332* The L2 cache has independent channels to service disjoint ranges of virtual 6333 addresses. 6334* Each CU has a separate request queue per channel. Therefore, the vector and 6335 scalar memory operations performed by wavefronts executing in different 6336 work-groups (which may be executing on different CUs) of an agent can be 6337 reordered relative to each other. A ``s_waitcnt vmcnt(0)`` is required to 6338 ensure synchronization between vector memory operations of different CUs. It 6339 ensures a previous vector memory operation has completed before executing a 6340 subsequent vector memory or LDS operation and so can be used to meet the 6341 requirements of acquire and release. 6342* The L2 cache can be kept coherent with other agents on some targets, or ranges 6343 of virtual addresses can be set up to bypass it to ensure system coherence. 6344 6345Scalar memory operations are only used to access memory that is proven to not 6346change during the execution of the kernel dispatch. This includes constant 6347address space and global address space for program scope ``const`` variables. 6348Therefore, the kernel machine code does not have to maintain the scalar cache to 6349ensure it is coherent with the vector caches. The scalar and vector caches are 6350invalidated between kernel dispatches by CP since constant address space data 6351may change between kernel dispatch executions. See 6352:ref:`amdgpu-amdhsa-memory-spaces`. 6353 6354The one exception is if scalar writes are used to spill SGPR registers. In this 6355case the AMDGPU backend ensures the memory location used to spill is never 6356accessed by vector memory operations at the same time. If scalar writes are used 6357then a ``s_dcache_wb`` is inserted before the ``s_endpgm`` and before a function 6358return since the locations may be used for vector memory instructions by a 6359future wavefront that uses the same scratch area, or a function call that 6360creates a frame at the same address, respectively. There is no need for a 6361``s_dcache_inv`` as all scalar writes are write-before-read in the same thread. 6362 6363For kernarg backing memory: 6364 6365* CP invalidates the L1 cache at the start of each kernel dispatch. 6366* On dGPU the kernarg backing memory is allocated in host memory accessed as 6367 MTYPE UC (uncached) to avoid needing to invalidate the L2 cache. This also 6368 causes it to be treated as non-volatile and so is not invalidated by 6369 ``*_vol``. 6370* On APU the kernarg backing memory it is accessed as MTYPE CC (cache coherent) 6371 and so the L2 cache will be coherent with the CPU and other agents. 6372 6373Scratch backing memory (which is used for the private address space) is accessed 6374with MTYPE NC_NV (non-coherent non-volatile). Since the private address space is 6375only accessed by a single thread, and is always write-before-read, there is 6376never a need to invalidate these entries from the L1 cache. Hence all cache 6377invalidates are done as ``*_vol`` to only invalidate the volatile cache lines. 6378 6379The code sequences used to implement the memory model for GFX6-GFX9 are defined 6380in table :ref:`amdgpu-amdhsa-memory-model-code-sequences-gfx6-gfx9-table`. 6381 6382 .. table:: AMDHSA Memory Model Code Sequences GFX6-GFX9 6383 :name: amdgpu-amdhsa-memory-model-code-sequences-gfx6-gfx9-table 6384 6385 ============ ============ ============== ========== ================================ 6386 LLVM Instr LLVM Memory LLVM Memory AMDGPU AMDGPU Machine Code 6387 Ordering Sync Scope Address GFX6-GFX9 6388 Space 6389 ============ ============ ============== ========== ================================ 6390 **Non-Atomic** 6391 ------------------------------------------------------------------------------------ 6392 load *none* *none* - global - !volatile & !nontemporal 6393 - generic 6394 - private 1. buffer/global/flat_load 6395 - constant 6396 - !volatile & nontemporal 6397 6398 1. buffer/global/flat_load 6399 glc=1 slc=1 6400 6401 - volatile 6402 6403 1. buffer/global/flat_load 6404 glc=1 6405 2. s_waitcnt vmcnt(0) 6406 6407 - Must happen before 6408 any following volatile 6409 global/generic 6410 load/store. 6411 - Ensures that 6412 volatile 6413 operations to 6414 different 6415 addresses will not 6416 be reordered by 6417 hardware. 6418 6419 load *none* *none* - local 1. ds_load 6420 store *none* *none* - global - !volatile & !nontemporal 6421 - generic 6422 - private 1. buffer/global/flat_store 6423 - constant 6424 - !volatile & nontemporal 6425 6426 1. buffer/global/flat_store 6427 glc=1 slc=1 6428 6429 - volatile 6430 6431 1. buffer/global/flat_store 6432 2. s_waitcnt vmcnt(0) 6433 6434 - Must happen before 6435 any following volatile 6436 global/generic 6437 load/store. 6438 - Ensures that 6439 volatile 6440 operations to 6441 different 6442 addresses will not 6443 be reordered by 6444 hardware. 6445 6446 store *none* *none* - local 1. ds_store 6447 **Unordered Atomic** 6448 ------------------------------------------------------------------------------------ 6449 load atomic unordered *any* *any* *Same as non-atomic*. 6450 store atomic unordered *any* *any* *Same as non-atomic*. 6451 atomicrmw unordered *any* *any* *Same as monotonic atomic*. 6452 **Monotonic Atomic** 6453 ------------------------------------------------------------------------------------ 6454 load atomic monotonic - singlethread - global 1. buffer/global/ds/flat_load 6455 - wavefront - local 6456 - workgroup - generic 6457 load atomic monotonic - agent - global 1. buffer/global/flat_load 6458 - system - generic glc=1 6459 store atomic monotonic - singlethread - global 1. buffer/global/flat_store 6460 - wavefront - generic 6461 - workgroup 6462 - agent 6463 - system 6464 store atomic monotonic - singlethread - local 1. ds_store 6465 - wavefront 6466 - workgroup 6467 atomicrmw monotonic - singlethread - global 1. buffer/global/flat_atomic 6468 - wavefront - generic 6469 - workgroup 6470 - agent 6471 - system 6472 atomicrmw monotonic - singlethread - local 1. ds_atomic 6473 - wavefront 6474 - workgroup 6475 **Acquire Atomic** 6476 ------------------------------------------------------------------------------------ 6477 load atomic acquire - singlethread - global 1. buffer/global/ds/flat_load 6478 - wavefront - local 6479 - generic 6480 load atomic acquire - workgroup - global 1. buffer/global_load 6481 load atomic acquire - workgroup - local 1. ds/flat_load 6482 - generic 2. s_waitcnt lgkmcnt(0) 6483 6484 - If OpenCL, omit. 6485 - Must happen before 6486 any following 6487 global/generic 6488 load/load 6489 atomic/store/store 6490 atomic/atomicrmw. 6491 - Ensures any 6492 following global 6493 data read is no 6494 older than a local load 6495 atomic value being 6496 acquired. 6497 6498 load atomic acquire - agent - global 1. buffer/global_load 6499 - system glc=1 6500 2. s_waitcnt vmcnt(0) 6501 6502 - Must happen before 6503 following 6504 buffer_wbinvl1_vol. 6505 - Ensures the load 6506 has completed 6507 before invalidating 6508 the cache. 6509 6510 3. buffer_wbinvl1_vol 6511 6512 - Must happen before 6513 any following 6514 global/generic 6515 load/load 6516 atomic/atomicrmw. 6517 - Ensures that 6518 following 6519 loads will not see 6520 stale global data. 6521 6522 load atomic acquire - agent - generic 1. flat_load glc=1 6523 - system 2. s_waitcnt vmcnt(0) & 6524 lgkmcnt(0) 6525 6526 - If OpenCL omit 6527 lgkmcnt(0). 6528 - Must happen before 6529 following 6530 buffer_wbinvl1_vol. 6531 - Ensures the flat_load 6532 has completed 6533 before invalidating 6534 the cache. 6535 6536 3. buffer_wbinvl1_vol 6537 6538 - Must happen before 6539 any following 6540 global/generic 6541 load/load 6542 atomic/atomicrmw. 6543 - Ensures that 6544 following loads 6545 will not see stale 6546 global data. 6547 6548 atomicrmw acquire - singlethread - global 1. buffer/global/ds/flat_atomic 6549 - wavefront - local 6550 - generic 6551 atomicrmw acquire - workgroup - global 1. buffer/global_atomic 6552 atomicrmw acquire - workgroup - local 1. ds/flat_atomic 6553 - generic 2. s_waitcnt lgkmcnt(0) 6554 6555 - If OpenCL, omit. 6556 - Must happen before 6557 any following 6558 global/generic 6559 load/load 6560 atomic/store/store 6561 atomic/atomicrmw. 6562 - Ensures any 6563 following global 6564 data read is no 6565 older than a local 6566 atomicrmw value 6567 being acquired. 6568 6569 atomicrmw acquire - agent - global 1. buffer/global_atomic 6570 - system 2. s_waitcnt vmcnt(0) 6571 6572 - Must happen before 6573 following 6574 buffer_wbinvl1_vol. 6575 - Ensures the 6576 atomicrmw has 6577 completed before 6578 invalidating the 6579 cache. 6580 6581 3. buffer_wbinvl1_vol 6582 6583 - Must happen before 6584 any following 6585 global/generic 6586 load/load 6587 atomic/atomicrmw. 6588 - Ensures that 6589 following loads 6590 will not see stale 6591 global data. 6592 6593 atomicrmw acquire - agent - generic 1. flat_atomic 6594 - system 2. s_waitcnt vmcnt(0) & 6595 lgkmcnt(0) 6596 6597 - If OpenCL, omit 6598 lgkmcnt(0). 6599 - Must happen before 6600 following 6601 buffer_wbinvl1_vol. 6602 - Ensures the 6603 atomicrmw has 6604 completed before 6605 invalidating the 6606 cache. 6607 6608 3. buffer_wbinvl1_vol 6609 6610 - Must happen before 6611 any following 6612 global/generic 6613 load/load 6614 atomic/atomicrmw. 6615 - Ensures that 6616 following loads 6617 will not see stale 6618 global data. 6619 6620 fence acquire - singlethread *none* *none* 6621 - wavefront 6622 fence acquire - workgroup *none* 1. s_waitcnt lgkmcnt(0) 6623 6624 - If OpenCL and 6625 address space is 6626 not generic, omit. 6627 - See :ref:`amdgpu-fence-as` for 6628 more details on fencing specific 6629 address spaces. 6630 - Must happen after 6631 any preceding 6632 local/generic load 6633 atomic/atomicrmw 6634 with an equal or 6635 wider sync scope 6636 and memory ordering 6637 stronger than 6638 unordered (this is 6639 termed the 6640 fence-paired-atomic). 6641 - Must happen before 6642 any following 6643 global/generic 6644 load/load 6645 atomic/store/store 6646 atomic/atomicrmw. 6647 - Ensures any 6648 following global 6649 data read is no 6650 older than the 6651 value read by the 6652 fence-paired-atomic. 6653 6654 fence acquire - agent *none* 1. s_waitcnt lgkmcnt(0) & 6655 - system vmcnt(0) 6656 6657 - If OpenCL and 6658 address space is 6659 not generic, omit 6660 lgkmcnt(0). 6661 - See :ref:`amdgpu-fence-as` for 6662 more details on fencing specific 6663 address spaces. 6664 - Could be split into 6665 separate s_waitcnt 6666 vmcnt(0) and 6667 s_waitcnt 6668 lgkmcnt(0) to allow 6669 them to be 6670 independently moved 6671 according to the 6672 following rules. 6673 - s_waitcnt vmcnt(0) 6674 must happen after 6675 any preceding 6676 global/generic load 6677 atomic/atomicrmw 6678 with an equal or 6679 wider sync scope 6680 and memory ordering 6681 stronger than 6682 unordered (this is 6683 termed the 6684 fence-paired-atomic). 6685 - s_waitcnt lgkmcnt(0) 6686 must happen after 6687 any preceding 6688 local/generic load 6689 atomic/atomicrmw 6690 with an equal or 6691 wider sync scope 6692 and memory ordering 6693 stronger than 6694 unordered (this is 6695 termed the 6696 fence-paired-atomic). 6697 - Must happen before 6698 the following 6699 buffer_wbinvl1_vol. 6700 - Ensures that the 6701 fence-paired atomic 6702 has completed 6703 before invalidating 6704 the 6705 cache. Therefore 6706 any following 6707 locations read must 6708 be no older than 6709 the value read by 6710 the 6711 fence-paired-atomic. 6712 6713 2. buffer_wbinvl1_vol 6714 6715 - Must happen before any 6716 following global/generic 6717 load/load 6718 atomic/store/store 6719 atomic/atomicrmw. 6720 - Ensures that 6721 following loads 6722 will not see stale 6723 global data. 6724 6725 **Release Atomic** 6726 ------------------------------------------------------------------------------------ 6727 store atomic release - singlethread - global 1. buffer/global/ds/flat_store 6728 - wavefront - local 6729 - generic 6730 store atomic release - workgroup - global 1. s_waitcnt lgkmcnt(0) 6731 - generic 6732 - If OpenCL, omit. 6733 - Must happen after 6734 any preceding 6735 local/generic 6736 load/store/load 6737 atomic/store 6738 atomic/atomicrmw. 6739 - Must happen before 6740 the following 6741 store. 6742 - Ensures that all 6743 memory operations 6744 to local have 6745 completed before 6746 performing the 6747 store that is being 6748 released. 6749 6750 2. buffer/global/flat_store 6751 store atomic release - workgroup - local 1. ds_store 6752 store atomic release - agent - global 1. s_waitcnt lgkmcnt(0) & 6753 - system - generic vmcnt(0) 6754 6755 - If OpenCL and 6756 address space is 6757 not generic, omit 6758 lgkmcnt(0). 6759 - Could be split into 6760 separate s_waitcnt 6761 vmcnt(0) and 6762 s_waitcnt 6763 lgkmcnt(0) to allow 6764 them to be 6765 independently moved 6766 according to the 6767 following rules. 6768 - s_waitcnt vmcnt(0) 6769 must happen after 6770 any preceding 6771 global/generic 6772 load/store/load 6773 atomic/store 6774 atomic/atomicrmw. 6775 - s_waitcnt lgkmcnt(0) 6776 must happen after 6777 any preceding 6778 local/generic 6779 load/store/load 6780 atomic/store 6781 atomic/atomicrmw. 6782 - Must happen before 6783 the following 6784 store. 6785 - Ensures that all 6786 memory operations 6787 to memory have 6788 completed before 6789 performing the 6790 store that is being 6791 released. 6792 6793 2. buffer/global/flat_store 6794 atomicrmw release - singlethread - global 1. buffer/global/ds/flat_atomic 6795 - wavefront - local 6796 - generic 6797 atomicrmw release - workgroup - global 1. s_waitcnt lgkmcnt(0) 6798 - generic 6799 - If OpenCL, omit. 6800 - Must happen after 6801 any preceding 6802 local/generic 6803 load/store/load 6804 atomic/store 6805 atomic/atomicrmw. 6806 - Must happen before 6807 the following 6808 atomicrmw. 6809 - Ensures that all 6810 memory operations 6811 to local have 6812 completed before 6813 performing the 6814 atomicrmw that is 6815 being released. 6816 6817 2. buffer/global/flat_atomic 6818 atomicrmw release - workgroup - local 1. ds_atomic 6819 atomicrmw release - agent - global 1. s_waitcnt lgkmcnt(0) & 6820 - system - generic vmcnt(0) 6821 6822 - If OpenCL, omit 6823 lgkmcnt(0). 6824 - Could be split into 6825 separate s_waitcnt 6826 vmcnt(0) and 6827 s_waitcnt 6828 lgkmcnt(0) to allow 6829 them to be 6830 independently moved 6831 according to the 6832 following rules. 6833 - s_waitcnt vmcnt(0) 6834 must happen after 6835 any preceding 6836 global/generic 6837 load/store/load 6838 atomic/store 6839 atomic/atomicrmw. 6840 - s_waitcnt lgkmcnt(0) 6841 must happen after 6842 any preceding 6843 local/generic 6844 load/store/load 6845 atomic/store 6846 atomic/atomicrmw. 6847 - Must happen before 6848 the following 6849 atomicrmw. 6850 - Ensures that all 6851 memory operations 6852 to global and local 6853 have completed 6854 before performing 6855 the atomicrmw that 6856 is being released. 6857 6858 2. buffer/global/flat_atomic 6859 fence release - singlethread *none* *none* 6860 - wavefront 6861 fence release - workgroup *none* 1. s_waitcnt lgkmcnt(0) 6862 6863 - If OpenCL and 6864 address space is 6865 not generic, omit. 6866 - See :ref:`amdgpu-fence-as` for 6867 more details on fencing specific 6868 address spaces. 6869 - Must happen after 6870 any preceding 6871 local/generic 6872 load/load 6873 atomic/store/store 6874 atomic/atomicrmw. 6875 - Must happen before 6876 any following store 6877 atomic/atomicrmw 6878 with an equal or 6879 wider sync scope 6880 and memory ordering 6881 stronger than 6882 unordered (this is 6883 termed the 6884 fence-paired-atomic). 6885 - Ensures that all 6886 memory operations 6887 to local have 6888 completed before 6889 performing the 6890 following 6891 fence-paired-atomic. 6892 6893 fence release - agent *none* 1. s_waitcnt lgkmcnt(0) & 6894 - system vmcnt(0) 6895 6896 - If OpenCL and 6897 address space is 6898 not generic, omit 6899 lgkmcnt(0). 6900 - If OpenCL and 6901 address space is 6902 local, omit 6903 vmcnt(0). 6904 - See :ref:`amdgpu-fence-as` for 6905 more details on fencing specific 6906 address spaces. 6907 - Could be split into 6908 separate s_waitcnt 6909 vmcnt(0) and 6910 s_waitcnt 6911 lgkmcnt(0) to allow 6912 them to be 6913 independently moved 6914 according to the 6915 following rules. 6916 - s_waitcnt vmcnt(0) 6917 must happen after 6918 any preceding 6919 global/generic 6920 load/store/load 6921 atomic/store 6922 atomic/atomicrmw. 6923 - s_waitcnt lgkmcnt(0) 6924 must happen after 6925 any preceding 6926 local/generic 6927 load/store/load 6928 atomic/store 6929 atomic/atomicrmw. 6930 - Must happen before 6931 any following store 6932 atomic/atomicrmw 6933 with an equal or 6934 wider sync scope 6935 and memory ordering 6936 stronger than 6937 unordered (this is 6938 termed the 6939 fence-paired-atomic). 6940 - Ensures that all 6941 memory operations 6942 have 6943 completed before 6944 performing the 6945 following 6946 fence-paired-atomic. 6947 6948 **Acquire-Release Atomic** 6949 ------------------------------------------------------------------------------------ 6950 atomicrmw acq_rel - singlethread - global 1. buffer/global/ds/flat_atomic 6951 - wavefront - local 6952 - generic 6953 atomicrmw acq_rel - workgroup - global 1. s_waitcnt lgkmcnt(0) 6954 6955 - If OpenCL, omit. 6956 - Must happen after 6957 any preceding 6958 local/generic 6959 load/store/load 6960 atomic/store 6961 atomic/atomicrmw. 6962 - Must happen before 6963 the following 6964 atomicrmw. 6965 - Ensures that all 6966 memory operations 6967 to local have 6968 completed before 6969 performing the 6970 atomicrmw that is 6971 being released. 6972 6973 2. buffer/global_atomic 6974 6975 atomicrmw acq_rel - workgroup - local 1. ds_atomic 6976 2. s_waitcnt lgkmcnt(0) 6977 6978 - If OpenCL, omit. 6979 - Must happen before 6980 any following 6981 global/generic 6982 load/load 6983 atomic/store/store 6984 atomic/atomicrmw. 6985 - Ensures any 6986 following global 6987 data read is no 6988 older than the local load 6989 atomic value being 6990 acquired. 6991 6992 atomicrmw acq_rel - workgroup - generic 1. s_waitcnt lgkmcnt(0) 6993 6994 - If OpenCL, omit. 6995 - Must happen after 6996 any preceding 6997 local/generic 6998 load/store/load 6999 atomic/store 7000 atomic/atomicrmw. 7001 - Must happen before 7002 the following 7003 atomicrmw. 7004 - Ensures that all 7005 memory operations 7006 to local have 7007 completed before 7008 performing the 7009 atomicrmw that is 7010 being released. 7011 7012 2. flat_atomic 7013 3. s_waitcnt lgkmcnt(0) 7014 7015 - If OpenCL, omit. 7016 - Must happen before 7017 any following 7018 global/generic 7019 load/load 7020 atomic/store/store 7021 atomic/atomicrmw. 7022 - Ensures any 7023 following global 7024 data read is no 7025 older than a local load 7026 atomic value being 7027 acquired. 7028 7029 atomicrmw acq_rel - agent - global 1. s_waitcnt lgkmcnt(0) & 7030 - system vmcnt(0) 7031 7032 - If OpenCL, omit 7033 lgkmcnt(0). 7034 - Could be split into 7035 separate s_waitcnt 7036 vmcnt(0) and 7037 s_waitcnt 7038 lgkmcnt(0) to allow 7039 them to be 7040 independently moved 7041 according to the 7042 following rules. 7043 - s_waitcnt vmcnt(0) 7044 must happen after 7045 any preceding 7046 global/generic 7047 load/store/load 7048 atomic/store 7049 atomic/atomicrmw. 7050 - s_waitcnt lgkmcnt(0) 7051 must happen after 7052 any preceding 7053 local/generic 7054 load/store/load 7055 atomic/store 7056 atomic/atomicrmw. 7057 - Must happen before 7058 the following 7059 atomicrmw. 7060 - Ensures that all 7061 memory operations 7062 to global have 7063 completed before 7064 performing the 7065 atomicrmw that is 7066 being released. 7067 7068 2. buffer/global_atomic 7069 3. s_waitcnt vmcnt(0) 7070 7071 - Must happen before 7072 following 7073 buffer_wbinvl1_vol. 7074 - Ensures the 7075 atomicrmw has 7076 completed before 7077 invalidating the 7078 cache. 7079 7080 4. buffer_wbinvl1_vol 7081 7082 - Must happen before 7083 any following 7084 global/generic 7085 load/load 7086 atomic/atomicrmw. 7087 - Ensures that 7088 following loads 7089 will not see stale 7090 global data. 7091 7092 atomicrmw acq_rel - agent - generic 1. s_waitcnt lgkmcnt(0) & 7093 - system vmcnt(0) 7094 7095 - If OpenCL, omit 7096 lgkmcnt(0). 7097 - Could be split into 7098 separate s_waitcnt 7099 vmcnt(0) and 7100 s_waitcnt 7101 lgkmcnt(0) to allow 7102 them to be 7103 independently moved 7104 according to the 7105 following rules. 7106 - s_waitcnt vmcnt(0) 7107 must happen after 7108 any preceding 7109 global/generic 7110 load/store/load 7111 atomic/store 7112 atomic/atomicrmw. 7113 - s_waitcnt lgkmcnt(0) 7114 must happen after 7115 any preceding 7116 local/generic 7117 load/store/load 7118 atomic/store 7119 atomic/atomicrmw. 7120 - Must happen before 7121 the following 7122 atomicrmw. 7123 - Ensures that all 7124 memory operations 7125 to global have 7126 completed before 7127 performing the 7128 atomicrmw that is 7129 being released. 7130 7131 2. flat_atomic 7132 3. s_waitcnt vmcnt(0) & 7133 lgkmcnt(0) 7134 7135 - If OpenCL, omit 7136 lgkmcnt(0). 7137 - Must happen before 7138 following 7139 buffer_wbinvl1_vol. 7140 - Ensures the 7141 atomicrmw has 7142 completed before 7143 invalidating the 7144 cache. 7145 7146 4. buffer_wbinvl1_vol 7147 7148 - Must happen before 7149 any following 7150 global/generic 7151 load/load 7152 atomic/atomicrmw. 7153 - Ensures that 7154 following loads 7155 will not see stale 7156 global data. 7157 7158 fence acq_rel - singlethread *none* *none* 7159 - wavefront 7160 fence acq_rel - workgroup *none* 1. s_waitcnt lgkmcnt(0) 7161 7162 - If OpenCL and 7163 address space is 7164 not generic, omit. 7165 - However, 7166 since LLVM 7167 currently has no 7168 address space on 7169 the fence need to 7170 conservatively 7171 always generate 7172 (see comment for 7173 previous fence). 7174 - Must happen after 7175 any preceding 7176 local/generic 7177 load/load 7178 atomic/store/store 7179 atomic/atomicrmw. 7180 - Must happen before 7181 any following 7182 global/generic 7183 load/load 7184 atomic/store/store 7185 atomic/atomicrmw. 7186 - Ensures that all 7187 memory operations 7188 to local have 7189 completed before 7190 performing any 7191 following global 7192 memory operations. 7193 - Ensures that the 7194 preceding 7195 local/generic load 7196 atomic/atomicrmw 7197 with an equal or 7198 wider sync scope 7199 and memory ordering 7200 stronger than 7201 unordered (this is 7202 termed the 7203 acquire-fence-paired-atomic) 7204 has completed 7205 before following 7206 global memory 7207 operations. This 7208 satisfies the 7209 requirements of 7210 acquire. 7211 - Ensures that all 7212 previous memory 7213 operations have 7214 completed before a 7215 following 7216 local/generic store 7217 atomic/atomicrmw 7218 with an equal or 7219 wider sync scope 7220 and memory ordering 7221 stronger than 7222 unordered (this is 7223 termed the 7224 release-fence-paired-atomic). 7225 This satisfies the 7226 requirements of 7227 release. 7228 7229 fence acq_rel - agent *none* 1. s_waitcnt lgkmcnt(0) & 7230 - system vmcnt(0) 7231 7232 - If OpenCL and 7233 address space is 7234 not generic, omit 7235 lgkmcnt(0). 7236 - See :ref:`amdgpu-fence-as` for 7237 more details on fencing specific 7238 address spaces. 7239 - Could be split into 7240 separate s_waitcnt 7241 vmcnt(0) and 7242 s_waitcnt 7243 lgkmcnt(0) to allow 7244 them to be 7245 independently moved 7246 according to the 7247 following rules. 7248 - s_waitcnt vmcnt(0) 7249 must happen after 7250 any preceding 7251 global/generic 7252 load/store/load 7253 atomic/store 7254 atomic/atomicrmw. 7255 - s_waitcnt lgkmcnt(0) 7256 must happen after 7257 any preceding 7258 local/generic 7259 load/store/load 7260 atomic/store 7261 atomic/atomicrmw. 7262 - Must happen before 7263 the following 7264 buffer_wbinvl1_vol. 7265 - Ensures that the 7266 preceding 7267 global/local/generic 7268 load 7269 atomic/atomicrmw 7270 with an equal or 7271 wider sync scope 7272 and memory ordering 7273 stronger than 7274 unordered (this is 7275 termed the 7276 acquire-fence-paired-atomic) 7277 has completed 7278 before invalidating 7279 the cache. This 7280 satisfies the 7281 requirements of 7282 acquire. 7283 - Ensures that all 7284 previous memory 7285 operations have 7286 completed before a 7287 following 7288 global/local/generic 7289 store 7290 atomic/atomicrmw 7291 with an equal or 7292 wider sync scope 7293 and memory ordering 7294 stronger than 7295 unordered (this is 7296 termed the 7297 release-fence-paired-atomic). 7298 This satisfies the 7299 requirements of 7300 release. 7301 7302 2. buffer_wbinvl1_vol 7303 7304 - Must happen before 7305 any following 7306 global/generic 7307 load/load 7308 atomic/store/store 7309 atomic/atomicrmw. 7310 - Ensures that 7311 following loads 7312 will not see stale 7313 global data. This 7314 satisfies the 7315 requirements of 7316 acquire. 7317 7318 **Sequential Consistent Atomic** 7319 ------------------------------------------------------------------------------------ 7320 load atomic seq_cst - singlethread - global *Same as corresponding 7321 - wavefront - local load atomic acquire, 7322 - generic except must generate 7323 all instructions even 7324 for OpenCL.* 7325 load atomic seq_cst - workgroup - global 1. s_waitcnt lgkmcnt(0) 7326 - generic 7327 7328 - Must 7329 happen after 7330 preceding 7331 local/generic load 7332 atomic/store 7333 atomic/atomicrmw 7334 with memory 7335 ordering of seq_cst 7336 and with equal or 7337 wider sync scope. 7338 (Note that seq_cst 7339 fences have their 7340 own s_waitcnt 7341 lgkmcnt(0) and so do 7342 not need to be 7343 considered.) 7344 - Ensures any 7345 preceding 7346 sequential 7347 consistent local 7348 memory instructions 7349 have completed 7350 before executing 7351 this sequentially 7352 consistent 7353 instruction. This 7354 prevents reordering 7355 a seq_cst store 7356 followed by a 7357 seq_cst load. (Note 7358 that seq_cst is 7359 stronger than 7360 acquire/release as 7361 the reordering of 7362 load acquire 7363 followed by a store 7364 release is 7365 prevented by the 7366 s_waitcnt of 7367 the release, but 7368 there is nothing 7369 preventing a store 7370 release followed by 7371 load acquire from 7372 completing out of 7373 order. The s_waitcnt 7374 could be placed after 7375 seq_store or before 7376 the seq_load. We 7377 choose the load to 7378 make the s_waitcnt be 7379 as late as possible 7380 so that the store 7381 may have already 7382 completed.) 7383 7384 2. *Following 7385 instructions same as 7386 corresponding load 7387 atomic acquire, 7388 except must generate 7389 all instructions even 7390 for OpenCL.* 7391 load atomic seq_cst - workgroup - local *Same as corresponding 7392 load atomic acquire, 7393 except must generate 7394 all instructions even 7395 for OpenCL.* 7396 7397 load atomic seq_cst - agent - global 1. s_waitcnt lgkmcnt(0) & 7398 - system - generic vmcnt(0) 7399 7400 - Could be split into 7401 separate s_waitcnt 7402 vmcnt(0) 7403 and s_waitcnt 7404 lgkmcnt(0) to allow 7405 them to be 7406 independently moved 7407 according to the 7408 following rules. 7409 - s_waitcnt lgkmcnt(0) 7410 must happen after 7411 preceding 7412 global/generic load 7413 atomic/store 7414 atomic/atomicrmw 7415 with memory 7416 ordering of seq_cst 7417 and with equal or 7418 wider sync scope. 7419 (Note that seq_cst 7420 fences have their 7421 own s_waitcnt 7422 lgkmcnt(0) and so do 7423 not need to be 7424 considered.) 7425 - s_waitcnt vmcnt(0) 7426 must happen after 7427 preceding 7428 global/generic load 7429 atomic/store 7430 atomic/atomicrmw 7431 with memory 7432 ordering of seq_cst 7433 and with equal or 7434 wider sync scope. 7435 (Note that seq_cst 7436 fences have their 7437 own s_waitcnt 7438 vmcnt(0) and so do 7439 not need to be 7440 considered.) 7441 - Ensures any 7442 preceding 7443 sequential 7444 consistent global 7445 memory instructions 7446 have completed 7447 before executing 7448 this sequentially 7449 consistent 7450 instruction. This 7451 prevents reordering 7452 a seq_cst store 7453 followed by a 7454 seq_cst load. (Note 7455 that seq_cst is 7456 stronger than 7457 acquire/release as 7458 the reordering of 7459 load acquire 7460 followed by a store 7461 release is 7462 prevented by the 7463 s_waitcnt of 7464 the release, but 7465 there is nothing 7466 preventing a store 7467 release followed by 7468 load acquire from 7469 completing out of 7470 order. The s_waitcnt 7471 could be placed after 7472 seq_store or before 7473 the seq_load. We 7474 choose the load to 7475 make the s_waitcnt be 7476 as late as possible 7477 so that the store 7478 may have already 7479 completed.) 7480 7481 2. *Following 7482 instructions same as 7483 corresponding load 7484 atomic acquire, 7485 except must generate 7486 all instructions even 7487 for OpenCL.* 7488 store atomic seq_cst - singlethread - global *Same as corresponding 7489 - wavefront - local store atomic release, 7490 - workgroup - generic except must generate 7491 - agent all instructions even 7492 - system for OpenCL.* 7493 atomicrmw seq_cst - singlethread - global *Same as corresponding 7494 - wavefront - local atomicrmw acq_rel, 7495 - workgroup - generic except must generate 7496 - agent all instructions even 7497 - system for OpenCL.* 7498 fence seq_cst - singlethread *none* *Same as corresponding 7499 - wavefront fence acq_rel, 7500 - workgroup except must generate 7501 - agent all instructions even 7502 - system for OpenCL.* 7503 ============ ============ ============== ========== ================================ 7504 7505.. _amdgpu-amdhsa-memory-model-gfx90a: 7506 7507Memory Model GFX90A 7508+++++++++++++++++++ 7509 7510For GFX90A: 7511 7512* Each agent has multiple shader arrays (SA). 7513* Each SA has multiple compute units (CU). 7514* Each CU has multiple SIMDs that execute wavefronts. 7515* The wavefronts for a single work-group are executed in the same CU but may be 7516 executed by different SIMDs. The exception is when in tgsplit execution mode 7517 when the wavefronts may be executed by different SIMDs in different CUs. 7518* Each CU has a single LDS memory shared by the wavefronts of the work-groups 7519 executing on it. The exception is when in tgsplit execution mode when no LDS 7520 is allocated as wavefronts of the same work-group can be in different CUs. 7521* All LDS operations of a CU are performed as wavefront wide operations in a 7522 global order and involve no caching. Completion is reported to a wavefront in 7523 execution order. 7524* The LDS memory has multiple request queues shared by the SIMDs of a 7525 CU. Therefore, the LDS operations performed by different wavefronts of a 7526 work-group can be reordered relative to each other, which can result in 7527 reordering the visibility of vector memory operations with respect to LDS 7528 operations of other wavefronts in the same work-group. A ``s_waitcnt 7529 lgkmcnt(0)`` is required to ensure synchronization between LDS operations and 7530 vector memory operations between wavefronts of a work-group, but not between 7531 operations performed by the same wavefront. 7532* The vector memory operations are performed as wavefront wide operations and 7533 completion is reported to a wavefront in execution order. The exception is 7534 that ``flat_load/store/atomic`` instructions can report out of vector memory 7535 order if they access LDS memory, and out of LDS operation order if they access 7536 global memory. 7537* The vector memory operations access a single vector L1 cache shared by all 7538 SIMDs a CU. Therefore: 7539 7540 * No special action is required for coherence between the lanes of a single 7541 wavefront. 7542 7543 * No special action is required for coherence between wavefronts in the same 7544 work-group since they execute on the same CU. The exception is when in 7545 tgsplit execution mode as wavefronts of the same work-group can be in 7546 different CUs and so a ``buffer_wbinvl1_vol`` is required as described in 7547 the following item. 7548 7549 * A ``buffer_wbinvl1_vol`` is required for coherence between wavefronts 7550 executing in different work-groups as they may be executing on different 7551 CUs. 7552 7553* The scalar memory operations access a scalar L1 cache shared by all wavefronts 7554 on a group of CUs. The scalar and vector L1 caches are not coherent. However, 7555 scalar operations are used in a restricted way so do not impact the memory 7556 model. See :ref:`amdgpu-amdhsa-memory-spaces`. 7557* The vector and scalar memory operations use an L2 cache shared by all CUs on 7558 the same agent. 7559 7560 * The L2 cache has independent channels to service disjoint ranges of virtual 7561 addresses. 7562 * Each CU has a separate request queue per channel. Therefore, the vector and 7563 scalar memory operations performed by wavefronts executing in different 7564 work-groups (which may be executing on different CUs), or the same 7565 work-group if executing in tgsplit mode, of an agent can be reordered 7566 relative to each other. A ``s_waitcnt vmcnt(0)`` is required to ensure 7567 synchronization between vector memory operations of different CUs. It 7568 ensures a previous vector memory operation has completed before executing a 7569 subsequent vector memory or LDS operation and so can be used to meet the 7570 requirements of acquire and release. 7571 * The L2 cache of one agent can be kept coherent with other agents by: 7572 using the MTYPE RW (read-write) or MTYPE CC (cache-coherent) with the PTE 7573 C-bit for memory local to the L2; and using the MTYPE NC (non-coherent) with 7574 the PTE C-bit set or MTYPE UC (uncached) for memory not local to the L2. 7575 7576 * Any local memory cache lines will be automatically invalidated by writes 7577 from CUs associated with other L2 caches, or writes from the CPU, due to 7578 the cache probe caused by coherent requests. Coherent requests are caused 7579 by GPU accesses to pages with the PTE C-bit set, by CPU accesses over 7580 XGMI, and by PCIe requests that are configured to be coherent requests. 7581 * XGMI accesses from the CPU to local memory may be cached on the CPU. 7582 Subsequent access from the GPU will automatically invalidate or writeback 7583 the CPU cache due to the L2 probe filter and and the PTE C-bit being set. 7584 * Since all work-groups on the same agent share the same L2, no L2 7585 invalidation or writeback is required for coherence. 7586 * To ensure coherence of local and remote memory writes of work-groups in 7587 different agents a ``buffer_wbl2`` is required. It will writeback dirty L2 7588 cache lines of MTYPE RW (used for local coarse grain memory) and MTYPE NC 7589 ()used for remote coarse grain memory). Note that MTYPE CC (used for local 7590 fine grain memory) causes write through to DRAM, and MTYPE UC (used for 7591 remote fine grain memory) bypasses the L2, so both will never result in 7592 dirty L2 cache lines. 7593 * To ensure coherence of local and remote memory reads of work-groups in 7594 different agents a ``buffer_invl2`` is required. It will invalidate L2 7595 cache lines with MTYPE NC (used for remote coarse grain memory). Note that 7596 MTYPE CC (used for local fine grain memory) and MTYPE RW (used for local 7597 coarse memory) cause local reads to be invalidated by remote writes with 7598 with the PTE C-bit so these cache lines are not invalidated. Note that 7599 MTYPE UC (used for remote fine grain memory) bypasses the L2, so will 7600 never result in L2 cache lines that need to be invalidated. 7601 7602 * PCIe access from the GPU to the CPU memory is kept coherent by using the 7603 MTYPE UC (uncached) which bypasses the L2. 7604 7605Scalar memory operations are only used to access memory that is proven to not 7606change during the execution of the kernel dispatch. This includes constant 7607address space and global address space for program scope ``const`` variables. 7608Therefore, the kernel machine code does not have to maintain the scalar cache to 7609ensure it is coherent with the vector caches. The scalar and vector caches are 7610invalidated between kernel dispatches by CP since constant address space data 7611may change between kernel dispatch executions. See 7612:ref:`amdgpu-amdhsa-memory-spaces`. 7613 7614The one exception is if scalar writes are used to spill SGPR registers. In this 7615case the AMDGPU backend ensures the memory location used to spill is never 7616accessed by vector memory operations at the same time. If scalar writes are used 7617then a ``s_dcache_wb`` is inserted before the ``s_endpgm`` and before a function 7618return since the locations may be used for vector memory instructions by a 7619future wavefront that uses the same scratch area, or a function call that 7620creates a frame at the same address, respectively. There is no need for a 7621``s_dcache_inv`` as all scalar writes are write-before-read in the same thread. 7622 7623For kernarg backing memory: 7624 7625* CP invalidates the L1 cache at the start of each kernel dispatch. 7626* On dGPU over XGMI or PCIe the kernarg backing memory is allocated in host 7627 memory accessed as MTYPE UC (uncached) to avoid needing to invalidate the L2 7628 cache. This also causes it to be treated as non-volatile and so is not 7629 invalidated by ``*_vol``. 7630* On APU the kernarg backing memory is accessed as MTYPE CC (cache coherent) and 7631 so the L2 cache will be coherent with the CPU and other agents. 7632 7633Scratch backing memory (which is used for the private address space) is accessed 7634with MTYPE NC_NV (non-coherent non-volatile). Since the private address space is 7635only accessed by a single thread, and is always write-before-read, there is 7636never a need to invalidate these entries from the L1 cache. Hence all cache 7637invalidates are done as ``*_vol`` to only invalidate the volatile cache lines. 7638 7639The code sequences used to implement the memory model for GFX90A are defined 7640in table :ref:`amdgpu-amdhsa-memory-model-code-sequences-gfx90a-table`. 7641 7642 .. table:: AMDHSA Memory Model Code Sequences GFX90A 7643 :name: amdgpu-amdhsa-memory-model-code-sequences-gfx90a-table 7644 7645 ============ ============ ============== ========== ================================ 7646 LLVM Instr LLVM Memory LLVM Memory AMDGPU AMDGPU Machine Code 7647 Ordering Sync Scope Address GFX90A 7648 Space 7649 ============ ============ ============== ========== ================================ 7650 **Non-Atomic** 7651 ------------------------------------------------------------------------------------ 7652 load *none* *none* - global - !volatile & !nontemporal 7653 - generic 7654 - private 1. buffer/global/flat_load 7655 - constant 7656 - !volatile & nontemporal 7657 7658 1. buffer/global/flat_load 7659 glc=1 slc=1 7660 7661 - volatile 7662 7663 1. buffer/global/flat_load 7664 glc=1 7665 2. s_waitcnt vmcnt(0) 7666 7667 - Must happen before 7668 any following volatile 7669 global/generic 7670 load/store. 7671 - Ensures that 7672 volatile 7673 operations to 7674 different 7675 addresses will not 7676 be reordered by 7677 hardware. 7678 7679 load *none* *none* - local 1. ds_load 7680 store *none* *none* - global - !volatile & !nontemporal 7681 - generic 7682 - private 1. buffer/global/flat_store 7683 - constant 7684 - !volatile & nontemporal 7685 7686 1. buffer/global/flat_store 7687 glc=1 slc=1 7688 7689 - volatile 7690 7691 1. buffer/global/flat_store 7692 2. s_waitcnt vmcnt(0) 7693 7694 - Must happen before 7695 any following volatile 7696 global/generic 7697 load/store. 7698 - Ensures that 7699 volatile 7700 operations to 7701 different 7702 addresses will not 7703 be reordered by 7704 hardware. 7705 7706 store *none* *none* - local 1. ds_store 7707 **Unordered Atomic** 7708 ------------------------------------------------------------------------------------ 7709 load atomic unordered *any* *any* *Same as non-atomic*. 7710 store atomic unordered *any* *any* *Same as non-atomic*. 7711 atomicrmw unordered *any* *any* *Same as monotonic atomic*. 7712 **Monotonic Atomic** 7713 ------------------------------------------------------------------------------------ 7714 load atomic monotonic - singlethread - global 1. buffer/global/flat_load 7715 - wavefront - generic 7716 load atomic monotonic - workgroup - global 1. buffer/global/flat_load 7717 - generic glc=1 7718 7719 - If not TgSplit execution 7720 mode, omit glc=1. 7721 7722 load atomic monotonic - singlethread - local *If TgSplit execution mode, 7723 - wavefront local address space cannot 7724 - workgroup be used.* 7725 7726 1. ds_load 7727 load atomic monotonic - agent - global 1. buffer/global/flat_load 7728 - generic glc=1 7729 load atomic monotonic - system - global 1. buffer/global/flat_load 7730 - generic glc=1 7731 store atomic monotonic - singlethread - global 1. buffer/global/flat_store 7732 - wavefront - generic 7733 - workgroup 7734 - agent 7735 store atomic monotonic - system - global 1. buffer/global/flat_store 7736 - generic 7737 store atomic monotonic - singlethread - local *If TgSplit execution mode, 7738 - wavefront local address space cannot 7739 - workgroup be used.* 7740 7741 1. ds_store 7742 atomicrmw monotonic - singlethread - global 1. buffer/global/flat_atomic 7743 - wavefront - generic 7744 - workgroup 7745 - agent 7746 atomicrmw monotonic - system - global 1. buffer/global/flat_atomic 7747 - generic 7748 atomicrmw monotonic - singlethread - local *If TgSplit execution mode, 7749 - wavefront local address space cannot 7750 - workgroup be used.* 7751 7752 1. ds_atomic 7753 **Acquire Atomic** 7754 ------------------------------------------------------------------------------------ 7755 load atomic acquire - singlethread - global 1. buffer/global/ds/flat_load 7756 - wavefront - local 7757 - generic 7758 load atomic acquire - workgroup - global 1. buffer/global_load glc=1 7759 7760 - If not TgSplit execution 7761 mode, omit glc=1. 7762 7763 2. s_waitcnt vmcnt(0) 7764 7765 - If not TgSplit execution 7766 mode, omit. 7767 - Must happen before the 7768 following buffer_wbinvl1_vol. 7769 7770 3. buffer_wbinvl1_vol 7771 7772 - If not TgSplit execution 7773 mode, omit. 7774 - Must happen before 7775 any following 7776 global/generic 7777 load/load 7778 atomic/store/store 7779 atomic/atomicrmw. 7780 - Ensures that 7781 following 7782 loads will not see 7783 stale data. 7784 7785 load atomic acquire - workgroup - local *If TgSplit execution mode, 7786 local address space cannot 7787 be used.* 7788 7789 1. ds_load 7790 2. s_waitcnt lgkmcnt(0) 7791 7792 - If OpenCL, omit. 7793 - Must happen before 7794 any following 7795 global/generic 7796 load/load 7797 atomic/store/store 7798 atomic/atomicrmw. 7799 - Ensures any 7800 following global 7801 data read is no 7802 older than the local load 7803 atomic value being 7804 acquired. 7805 7806 load atomic acquire - workgroup - generic 1. flat_load glc=1 7807 7808 - If not TgSplit execution 7809 mode, omit glc=1. 7810 7811 2. s_waitcnt lgkm/vmcnt(0) 7812 7813 - Use lgkmcnt(0) if not 7814 TgSplit execution mode 7815 and vmcnt(0) if TgSplit 7816 execution mode. 7817 - If OpenCL, omit lgkmcnt(0). 7818 - Must happen before 7819 the following 7820 buffer_wbinvl1_vol and any 7821 following global/generic 7822 load/load 7823 atomic/store/store 7824 atomic/atomicrmw. 7825 - Ensures any 7826 following global 7827 data read is no 7828 older than a local load 7829 atomic value being 7830 acquired. 7831 7832 3. buffer_wbinvl1_vol 7833 7834 - If not TgSplit execution 7835 mode, omit. 7836 - Ensures that 7837 following 7838 loads will not see 7839 stale data. 7840 7841 load atomic acquire - agent - global 1. buffer/global_load 7842 glc=1 7843 2. s_waitcnt vmcnt(0) 7844 7845 - Must happen before 7846 following 7847 buffer_wbinvl1_vol. 7848 - Ensures the load 7849 has completed 7850 before invalidating 7851 the cache. 7852 7853 3. buffer_wbinvl1_vol 7854 7855 - Must happen before 7856 any following 7857 global/generic 7858 load/load 7859 atomic/atomicrmw. 7860 - Ensures that 7861 following 7862 loads will not see 7863 stale global data. 7864 7865 load atomic acquire - system - global 1. buffer/global/flat_load 7866 glc=1 7867 2. s_waitcnt vmcnt(0) 7868 7869 - Must happen before 7870 following buffer_invl2 and 7871 buffer_wbinvl1_vol. 7872 - Ensures the load 7873 has completed 7874 before invalidating 7875 the cache. 7876 7877 3. buffer_invl2; 7878 buffer_wbinvl1_vol 7879 7880 - Must happen before 7881 any following 7882 global/generic 7883 load/load 7884 atomic/atomicrmw. 7885 - Ensures that 7886 following 7887 loads will not see 7888 stale L1 global data, 7889 nor see stale L2 MTYPE 7890 NC global data. 7891 MTYPE RW and CC memory will 7892 never be stale in L2 due to 7893 the memory probes. 7894 7895 load atomic acquire - agent - generic 1. flat_load glc=1 7896 2. s_waitcnt vmcnt(0) & 7897 lgkmcnt(0) 7898 7899 - If TgSplit execution mode, 7900 omit lgkmcnt(0). 7901 - If OpenCL omit 7902 lgkmcnt(0). 7903 - Must happen before 7904 following 7905 buffer_wbinvl1_vol. 7906 - Ensures the flat_load 7907 has completed 7908 before invalidating 7909 the cache. 7910 7911 3. buffer_wbinvl1_vol 7912 7913 - Must happen before 7914 any following 7915 global/generic 7916 load/load 7917 atomic/atomicrmw. 7918 - Ensures that 7919 following loads 7920 will not see stale 7921 global data. 7922 7923 load atomic acquire - system - generic 1. flat_load glc=1 7924 2. s_waitcnt vmcnt(0) & 7925 lgkmcnt(0) 7926 7927 - If TgSplit execution mode, 7928 omit lgkmcnt(0). 7929 - If OpenCL omit 7930 lgkmcnt(0). 7931 - Must happen before 7932 following 7933 buffer_invl2 and 7934 buffer_wbinvl1_vol. 7935 - Ensures the flat_load 7936 has completed 7937 before invalidating 7938 the caches. 7939 7940 3. buffer_invl2; 7941 buffer_wbinvl1_vol 7942 7943 - Must happen before 7944 any following 7945 global/generic 7946 load/load 7947 atomic/atomicrmw. 7948 - Ensures that 7949 following 7950 loads will not see 7951 stale L1 global data, 7952 nor see stale L2 MTYPE 7953 NC global data. 7954 MTYPE RW and CC memory will 7955 never be stale in L2 due to 7956 the memory probes. 7957 7958 atomicrmw acquire - singlethread - global 1. buffer/global/flat_atomic 7959 - wavefront - generic 7960 atomicrmw acquire - singlethread - local *If TgSplit execution mode, 7961 - wavefront local address space cannot 7962 be used.* 7963 7964 1. ds_atomic 7965 atomicrmw acquire - workgroup - global 1. buffer/global_atomic 7966 2. s_waitcnt vmcnt(0) 7967 7968 - If not TgSplit execution 7969 mode, omit. 7970 - Must happen before the 7971 following buffer_wbinvl1_vol. 7972 - Ensures the atomicrmw 7973 has completed 7974 before invalidating 7975 the cache. 7976 7977 3. buffer_wbinvl1_vol 7978 7979 - If not TgSplit execution 7980 mode, omit. 7981 - Must happen before 7982 any following 7983 global/generic 7984 load/load 7985 atomic/atomicrmw. 7986 - Ensures that 7987 following loads 7988 will not see stale 7989 global data. 7990 7991 atomicrmw acquire - workgroup - local *If TgSplit execution mode, 7992 local address space cannot 7993 be used.* 7994 7995 1. ds_atomic 7996 2. s_waitcnt lgkmcnt(0) 7997 7998 - If OpenCL, omit. 7999 - Must happen before 8000 any following 8001 global/generic 8002 load/load 8003 atomic/store/store 8004 atomic/atomicrmw. 8005 - Ensures any 8006 following global 8007 data read is no 8008 older than the local 8009 atomicrmw value 8010 being acquired. 8011 8012 atomicrmw acquire - workgroup - generic 1. flat_atomic 8013 2. s_waitcnt lgkm/vmcnt(0) 8014 8015 - Use lgkmcnt(0) if not 8016 TgSplit execution mode 8017 and vmcnt(0) if TgSplit 8018 execution mode. 8019 - If OpenCL, omit lgkmcnt(0). 8020 - Must happen before 8021 the following 8022 buffer_wbinvl1_vol and 8023 any following 8024 global/generic 8025 load/load 8026 atomic/store/store 8027 atomic/atomicrmw. 8028 - Ensures any 8029 following global 8030 data read is no 8031 older than a local 8032 atomicrmw value 8033 being acquired. 8034 8035 3. buffer_wbinvl1_vol 8036 8037 - If not TgSplit execution 8038 mode, omit. 8039 - Ensures that 8040 following 8041 loads will not see 8042 stale data. 8043 8044 atomicrmw acquire - agent - global 1. buffer/global_atomic 8045 2. s_waitcnt vmcnt(0) 8046 8047 - Must happen before 8048 following 8049 buffer_wbinvl1_vol. 8050 - Ensures the 8051 atomicrmw has 8052 completed before 8053 invalidating the 8054 cache. 8055 8056 3. buffer_wbinvl1_vol 8057 8058 - Must happen before 8059 any following 8060 global/generic 8061 load/load 8062 atomic/atomicrmw. 8063 - Ensures that 8064 following loads 8065 will not see stale 8066 global data. 8067 8068 atomicrmw acquire - system - global 1. buffer/global_atomic 8069 2. s_waitcnt vmcnt(0) 8070 8071 - Must happen before 8072 following buffer_invl2 and 8073 buffer_wbinvl1_vol. 8074 - Ensures the 8075 atomicrmw has 8076 completed before 8077 invalidating the 8078 caches. 8079 8080 3. buffer_invl2; 8081 buffer_wbinvl1_vol 8082 8083 - Must happen before 8084 any following 8085 global/generic 8086 load/load 8087 atomic/atomicrmw. 8088 - Ensures that 8089 following 8090 loads will not see 8091 stale L1 global data, 8092 nor see stale L2 MTYPE 8093 NC global data. 8094 MTYPE RW and CC memory will 8095 never be stale in L2 due to 8096 the memory probes. 8097 8098 atomicrmw acquire - agent - generic 1. flat_atomic 8099 2. s_waitcnt vmcnt(0) & 8100 lgkmcnt(0) 8101 8102 - If TgSplit execution mode, 8103 omit lgkmcnt(0). 8104 - If OpenCL, omit 8105 lgkmcnt(0). 8106 - Must happen before 8107 following 8108 buffer_wbinvl1_vol. 8109 - Ensures the 8110 atomicrmw has 8111 completed before 8112 invalidating the 8113 cache. 8114 8115 3. buffer_wbinvl1_vol 8116 8117 - Must happen before 8118 any following 8119 global/generic 8120 load/load 8121 atomic/atomicrmw. 8122 - Ensures that 8123 following loads 8124 will not see stale 8125 global data. 8126 8127 atomicrmw acquire - system - generic 1. flat_atomic 8128 2. s_waitcnt vmcnt(0) & 8129 lgkmcnt(0) 8130 8131 - If TgSplit execution mode, 8132 omit lgkmcnt(0). 8133 - If OpenCL, omit 8134 lgkmcnt(0). 8135 - Must happen before 8136 following 8137 buffer_invl2 and 8138 buffer_wbinvl1_vol. 8139 - Ensures the 8140 atomicrmw has 8141 completed before 8142 invalidating the 8143 caches. 8144 8145 3. buffer_invl2; 8146 buffer_wbinvl1_vol 8147 8148 - Must happen before 8149 any following 8150 global/generic 8151 load/load 8152 atomic/atomicrmw. 8153 - Ensures that 8154 following 8155 loads will not see 8156 stale L1 global data, 8157 nor see stale L2 MTYPE 8158 NC global data. 8159 MTYPE RW and CC memory will 8160 never be stale in L2 due to 8161 the memory probes. 8162 8163 fence acquire - singlethread *none* *none* 8164 - wavefront 8165 fence acquire - workgroup *none* 1. s_waitcnt lgkm/vmcnt(0) 8166 8167 - Use lgkmcnt(0) if not 8168 TgSplit execution mode 8169 and vmcnt(0) if TgSplit 8170 execution mode. 8171 - If OpenCL and 8172 address space is 8173 not generic, omit 8174 lgkmcnt(0). 8175 - If OpenCL and 8176 address space is 8177 local, omit 8178 vmcnt(0). 8179 - See :ref:`amdgpu-fence-as` for 8180 more details on fencing specific 8181 address spaces. 8182 - s_waitcnt vmcnt(0) 8183 must happen after 8184 any preceding 8185 global/generic load 8186 atomic/ 8187 atomicrmw 8188 with an equal or 8189 wider sync scope 8190 and memory ordering 8191 stronger than 8192 unordered (this is 8193 termed the 8194 fence-paired-atomic). 8195 - s_waitcnt lgkmcnt(0) 8196 must happen after 8197 any preceding 8198 local/generic load 8199 atomic/atomicrmw 8200 with an equal or 8201 wider sync scope 8202 and memory ordering 8203 stronger than 8204 unordered (this is 8205 termed the 8206 fence-paired-atomic). 8207 - Must happen before 8208 the following 8209 buffer_wbinvl1_vol and 8210 any following 8211 global/generic 8212 load/load 8213 atomic/store/store 8214 atomic/atomicrmw. 8215 - Ensures any 8216 following global 8217 data read is no 8218 older than the 8219 value read by the 8220 fence-paired-atomic. 8221 8222 2. buffer_wbinvl1_vol 8223 8224 - If not TgSplit execution 8225 mode, omit. 8226 - Ensures that 8227 following 8228 loads will not see 8229 stale data. 8230 8231 fence acquire - agent *none* 1. s_waitcnt lgkmcnt(0) & 8232 vmcnt(0) 8233 8234 - If TgSplit execution mode, 8235 omit lgkmcnt(0). 8236 - If OpenCL and 8237 address space is 8238 not generic, omit 8239 lgkmcnt(0). 8240 - See :ref:`amdgpu-fence-as` for 8241 more details on fencing specific 8242 address spaces. 8243 - Could be split into 8244 separate s_waitcnt 8245 vmcnt(0) and 8246 s_waitcnt 8247 lgkmcnt(0) to allow 8248 them to be 8249 independently moved 8250 according to the 8251 following rules. 8252 - s_waitcnt vmcnt(0) 8253 must happen after 8254 any preceding 8255 global/generic load 8256 atomic/atomicrmw 8257 with an equal or 8258 wider sync scope 8259 and memory ordering 8260 stronger than 8261 unordered (this is 8262 termed the 8263 fence-paired-atomic). 8264 - s_waitcnt lgkmcnt(0) 8265 must happen after 8266 any preceding 8267 local/generic load 8268 atomic/atomicrmw 8269 with an equal or 8270 wider sync scope 8271 and memory ordering 8272 stronger than 8273 unordered (this is 8274 termed the 8275 fence-paired-atomic). 8276 - Must happen before 8277 the following 8278 buffer_wbinvl1_vol. 8279 - Ensures that the 8280 fence-paired atomic 8281 has completed 8282 before invalidating 8283 the 8284 cache. Therefore 8285 any following 8286 locations read must 8287 be no older than 8288 the value read by 8289 the 8290 fence-paired-atomic. 8291 8292 2. buffer_wbinvl1_vol 8293 8294 - Must happen before any 8295 following global/generic 8296 load/load 8297 atomic/store/store 8298 atomic/atomicrmw. 8299 - Ensures that 8300 following loads 8301 will not see stale 8302 global data. 8303 8304 fence acquire - system *none* 1. s_waitcnt lgkmcnt(0) & 8305 vmcnt(0) 8306 8307 - If TgSplit execution mode, 8308 omit lgkmcnt(0). 8309 - If OpenCL and 8310 address space is 8311 not generic, omit 8312 lgkmcnt(0). 8313 - See :ref:`amdgpu-fence-as` for 8314 more details on fencing specific 8315 address spaces. 8316 - Could be split into 8317 separate s_waitcnt 8318 vmcnt(0) and 8319 s_waitcnt 8320 lgkmcnt(0) to allow 8321 them to be 8322 independently moved 8323 according to the 8324 following rules. 8325 - s_waitcnt vmcnt(0) 8326 must happen after 8327 any preceding 8328 global/generic load 8329 atomic/atomicrmw 8330 with an equal or 8331 wider sync scope 8332 and memory ordering 8333 stronger than 8334 unordered (this is 8335 termed the 8336 fence-paired-atomic). 8337 - s_waitcnt lgkmcnt(0) 8338 must happen after 8339 any preceding 8340 local/generic load 8341 atomic/atomicrmw 8342 with an equal or 8343 wider sync scope 8344 and memory ordering 8345 stronger than 8346 unordered (this is 8347 termed the 8348 fence-paired-atomic). 8349 - Must happen before 8350 the following buffer_invl2 and 8351 buffer_wbinvl1_vol. 8352 - Ensures that the 8353 fence-paired atomic 8354 has completed 8355 before invalidating 8356 the 8357 cache. Therefore 8358 any following 8359 locations read must 8360 be no older than 8361 the value read by 8362 the 8363 fence-paired-atomic. 8364 8365 2. buffer_invl2; 8366 buffer_wbinvl1_vol 8367 8368 - Must happen before any 8369 following global/generic 8370 load/load 8371 atomic/store/store 8372 atomic/atomicrmw. 8373 - Ensures that 8374 following 8375 loads will not see 8376 stale L1 global data, 8377 nor see stale L2 MTYPE 8378 NC global data. 8379 MTYPE RW and CC memory will 8380 never be stale in L2 due to 8381 the memory probes. 8382 **Release Atomic** 8383 ------------------------------------------------------------------------------------ 8384 store atomic release - singlethread - global 1. buffer/global/flat_store 8385 - wavefront - generic 8386 store atomic release - singlethread - local *If TgSplit execution mode, 8387 - wavefront local address space cannot 8388 be used.* 8389 8390 1. ds_store 8391 store atomic release - workgroup - global 1. s_waitcnt lgkm/vmcnt(0) 8392 - generic 8393 - Use lgkmcnt(0) if not 8394 TgSplit execution mode 8395 and vmcnt(0) if TgSplit 8396 execution mode. 8397 - If OpenCL, omit lgkmcnt(0). 8398 - s_waitcnt vmcnt(0) 8399 must happen after 8400 any preceding 8401 global/generic load/store/ 8402 load atomic/store atomic/ 8403 atomicrmw. 8404 - s_waitcnt lgkmcnt(0) 8405 must happen after 8406 any preceding 8407 local/generic 8408 load/store/load 8409 atomic/store 8410 atomic/atomicrmw. 8411 - Must happen before 8412 the following 8413 store. 8414 - Ensures that all 8415 memory operations 8416 have 8417 completed before 8418 performing the 8419 store that is being 8420 released. 8421 8422 2. buffer/global/flat_store 8423 store atomic release - workgroup - local *If TgSplit execution mode, 8424 local address space cannot 8425 be used.* 8426 8427 1. ds_store 8428 store atomic release - agent - global 1. s_waitcnt lgkmcnt(0) & 8429 - generic vmcnt(0) 8430 8431 - If TgSplit execution mode, 8432 omit lgkmcnt(0). 8433 - If OpenCL and 8434 address space is 8435 not generic, omit 8436 lgkmcnt(0). 8437 - Could be split into 8438 separate s_waitcnt 8439 vmcnt(0) and 8440 s_waitcnt 8441 lgkmcnt(0) to allow 8442 them to be 8443 independently moved 8444 according to the 8445 following rules. 8446 - s_waitcnt vmcnt(0) 8447 must happen after 8448 any preceding 8449 global/generic 8450 load/store/load 8451 atomic/store 8452 atomic/atomicrmw. 8453 - s_waitcnt lgkmcnt(0) 8454 must happen after 8455 any preceding 8456 local/generic 8457 load/store/load 8458 atomic/store 8459 atomic/atomicrmw. 8460 - Must happen before 8461 the following 8462 store. 8463 - Ensures that all 8464 memory operations 8465 to memory have 8466 completed before 8467 performing the 8468 store that is being 8469 released. 8470 8471 2. buffer/global/flat_store 8472 store atomic release - system - global 1. buffer_wbl2 8473 - generic 8474 - Must happen before 8475 following s_waitcnt. 8476 - Performs L2 writeback to 8477 ensure previous 8478 global/generic 8479 store/atomicrmw are 8480 visible at system scope. 8481 8482 2. s_waitcnt lgkmcnt(0) & 8483 vmcnt(0) 8484 8485 - If TgSplit execution mode, 8486 omit lgkmcnt(0). 8487 - If OpenCL and 8488 address space is 8489 not generic, omit 8490 lgkmcnt(0). 8491 - Could be split into 8492 separate s_waitcnt 8493 vmcnt(0) and 8494 s_waitcnt 8495 lgkmcnt(0) to allow 8496 them to be 8497 independently moved 8498 according to the 8499 following rules. 8500 - s_waitcnt vmcnt(0) 8501 must happen after any 8502 preceding 8503 global/generic 8504 load/store/load 8505 atomic/store 8506 atomic/atomicrmw. 8507 - s_waitcnt lgkmcnt(0) 8508 must happen after any 8509 preceding 8510 local/generic 8511 load/store/load 8512 atomic/store 8513 atomic/atomicrmw. 8514 - Must happen before 8515 the following 8516 store. 8517 - Ensures that all 8518 memory operations 8519 to memory and the L2 8520 writeback have 8521 completed before 8522 performing the 8523 store that is being 8524 released. 8525 8526 3. buffer/global/flat_store 8527 atomicrmw release - singlethread - global 1. buffer/global/flat_atomic 8528 - wavefront - generic 8529 atomicrmw release - singlethread - local *If TgSplit execution mode, 8530 - wavefront local address space cannot 8531 be used.* 8532 8533 1. ds_atomic 8534 atomicrmw release - workgroup - global 1. s_waitcnt lgkm/vmcnt(0) 8535 - generic 8536 - Use lgkmcnt(0) if not 8537 TgSplit execution mode 8538 and vmcnt(0) if TgSplit 8539 execution mode. 8540 - If OpenCL, omit 8541 lgkmcnt(0). 8542 - s_waitcnt vmcnt(0) 8543 must happen after 8544 any preceding 8545 global/generic load/store/ 8546 load atomic/store atomic/ 8547 atomicrmw. 8548 - s_waitcnt lgkmcnt(0) 8549 must happen after 8550 any preceding 8551 local/generic 8552 load/store/load 8553 atomic/store 8554 atomic/atomicrmw. 8555 - Must happen before 8556 the following 8557 atomicrmw. 8558 - Ensures that all 8559 memory operations 8560 have 8561 completed before 8562 performing the 8563 atomicrmw that is 8564 being released. 8565 8566 2. buffer/global/flat_atomic 8567 atomicrmw release - workgroup - local *If TgSplit execution mode, 8568 local address space cannot 8569 be used.* 8570 8571 1. ds_atomic 8572 atomicrmw release - agent - global 1. s_waitcnt lgkmcnt(0) & 8573 - generic vmcnt(0) 8574 8575 - If TgSplit execution mode, 8576 omit lgkmcnt(0). 8577 - If OpenCL, omit 8578 lgkmcnt(0). 8579 - Could be split into 8580 separate s_waitcnt 8581 vmcnt(0) and 8582 s_waitcnt 8583 lgkmcnt(0) to allow 8584 them to be 8585 independently moved 8586 according to the 8587 following rules. 8588 - s_waitcnt vmcnt(0) 8589 must happen after 8590 any preceding 8591 global/generic 8592 load/store/load 8593 atomic/store 8594 atomic/atomicrmw. 8595 - s_waitcnt lgkmcnt(0) 8596 must happen after 8597 any preceding 8598 local/generic 8599 load/store/load 8600 atomic/store 8601 atomic/atomicrmw. 8602 - Must happen before 8603 the following 8604 atomicrmw. 8605 - Ensures that all 8606 memory operations 8607 to global and local 8608 have completed 8609 before performing 8610 the atomicrmw that 8611 is being released. 8612 8613 2. buffer/global/flat_atomic 8614 atomicrmw release - system - global 1. buffer_wbl2 8615 - generic 8616 - Must happen before 8617 following s_waitcnt. 8618 - Performs L2 writeback to 8619 ensure previous 8620 global/generic 8621 store/atomicrmw are 8622 visible at system scope. 8623 8624 2. s_waitcnt lgkmcnt(0) & 8625 vmcnt(0) 8626 8627 - If TgSplit execution mode, 8628 omit lgkmcnt(0). 8629 - If OpenCL, omit 8630 lgkmcnt(0). 8631 - Could be split into 8632 separate s_waitcnt 8633 vmcnt(0) and 8634 s_waitcnt 8635 lgkmcnt(0) to allow 8636 them to be 8637 independently moved 8638 according to the 8639 following rules. 8640 - s_waitcnt vmcnt(0) 8641 must happen after 8642 any preceding 8643 global/generic 8644 load/store/load 8645 atomic/store 8646 atomic/atomicrmw. 8647 - s_waitcnt lgkmcnt(0) 8648 must happen after 8649 any preceding 8650 local/generic 8651 load/store/load 8652 atomic/store 8653 atomic/atomicrmw. 8654 - Must happen before 8655 the following 8656 atomicrmw. 8657 - Ensures that all 8658 memory operations 8659 to memory and the L2 8660 writeback have 8661 completed before 8662 performing the 8663 store that is being 8664 released. 8665 8666 3. buffer/global/flat_atomic 8667 fence release - singlethread *none* *none* 8668 - wavefront 8669 fence release - workgroup *none* 1. s_waitcnt lgkm/vmcnt(0) 8670 8671 - Use lgkmcnt(0) if not 8672 TgSplit execution mode 8673 and vmcnt(0) if TgSplit 8674 execution mode. 8675 - If OpenCL and 8676 address space is 8677 not generic, omit 8678 lgkmcnt(0). 8679 - If OpenCL and 8680 address space is 8681 local, omit 8682 vmcnt(0). 8683 - See :ref:`amdgpu-fence-as` for 8684 more details on fencing specific 8685 address spaces. 8686 - s_waitcnt vmcnt(0) 8687 must happen after 8688 any preceding 8689 global/generic 8690 load/store/ 8691 load atomic/store atomic/ 8692 atomicrmw. 8693 - s_waitcnt lgkmcnt(0) 8694 must happen after 8695 any preceding 8696 local/generic 8697 load/load 8698 atomic/store/store 8699 atomic/atomicrmw. 8700 - Must happen before 8701 any following store 8702 atomic/atomicrmw 8703 with an equal or 8704 wider sync scope 8705 and memory ordering 8706 stronger than 8707 unordered (this is 8708 termed the 8709 fence-paired-atomic). 8710 - Ensures that all 8711 memory operations 8712 have 8713 completed before 8714 performing the 8715 following 8716 fence-paired-atomic. 8717 8718 fence release - agent *none* 1. s_waitcnt lgkmcnt(0) & 8719 vmcnt(0) 8720 8721 - If TgSplit execution mode, 8722 omit lgkmcnt(0). 8723 - If OpenCL and 8724 address space is 8725 not generic, omit 8726 lgkmcnt(0). 8727 - If OpenCL and 8728 address space is 8729 local, omit 8730 vmcnt(0). 8731 - See :ref:`amdgpu-fence-as` for 8732 more details on fencing specific 8733 address spaces. 8734 - Could be split into 8735 separate s_waitcnt 8736 vmcnt(0) and 8737 s_waitcnt 8738 lgkmcnt(0) to allow 8739 them to be 8740 independently moved 8741 according to the 8742 following rules. 8743 - s_waitcnt vmcnt(0) 8744 must happen after 8745 any preceding 8746 global/generic 8747 load/store/load 8748 atomic/store 8749 atomic/atomicrmw. 8750 - s_waitcnt lgkmcnt(0) 8751 must happen after 8752 any preceding 8753 local/generic 8754 load/store/load 8755 atomic/store 8756 atomic/atomicrmw. 8757 - Must happen before 8758 any following store 8759 atomic/atomicrmw 8760 with an equal or 8761 wider sync scope 8762 and memory ordering 8763 stronger than 8764 unordered (this is 8765 termed the 8766 fence-paired-atomic). 8767 - Ensures that all 8768 memory operations 8769 have 8770 completed before 8771 performing the 8772 following 8773 fence-paired-atomic. 8774 8775 fence release - system *none* 1. buffer_wbl2 8776 8777 - If OpenCL and 8778 address space is 8779 local, omit. 8780 - Must happen before 8781 following s_waitcnt. 8782 - Performs L2 writeback to 8783 ensure previous 8784 global/generic 8785 store/atomicrmw are 8786 visible at system scope. 8787 8788 2. s_waitcnt lgkmcnt(0) & 8789 vmcnt(0) 8790 8791 - If TgSplit execution mode, 8792 omit lgkmcnt(0). 8793 - If OpenCL and 8794 address space is 8795 not generic, omit 8796 lgkmcnt(0). 8797 - If OpenCL and 8798 address space is 8799 local, omit 8800 vmcnt(0). 8801 - See :ref:`amdgpu-fence-as` for 8802 more details on fencing specific 8803 address spaces. 8804 - Could be split into 8805 separate s_waitcnt 8806 vmcnt(0) and 8807 s_waitcnt 8808 lgkmcnt(0) to allow 8809 them to be 8810 independently moved 8811 according to the 8812 following rules. 8813 - s_waitcnt vmcnt(0) 8814 must happen after 8815 any preceding 8816 global/generic 8817 load/store/load 8818 atomic/store 8819 atomic/atomicrmw. 8820 - s_waitcnt lgkmcnt(0) 8821 must happen after 8822 any preceding 8823 local/generic 8824 load/store/load 8825 atomic/store 8826 atomic/atomicrmw. 8827 - Must happen before 8828 any following store 8829 atomic/atomicrmw 8830 with an equal or 8831 wider sync scope 8832 and memory ordering 8833 stronger than 8834 unordered (this is 8835 termed the 8836 fence-paired-atomic). 8837 - Ensures that all 8838 memory operations 8839 have 8840 completed before 8841 performing the 8842 following 8843 fence-paired-atomic. 8844 8845 **Acquire-Release Atomic** 8846 ------------------------------------------------------------------------------------ 8847 atomicrmw acq_rel - singlethread - global 1. buffer/global/flat_atomic 8848 - wavefront - generic 8849 atomicrmw acq_rel - singlethread - local *If TgSplit execution mode, 8850 - wavefront local address space cannot 8851 be used.* 8852 8853 1. ds_atomic 8854 atomicrmw acq_rel - workgroup - global 1. s_waitcnt lgkm/vmcnt(0) 8855 8856 - Use lgkmcnt(0) if not 8857 TgSplit execution mode 8858 and vmcnt(0) if TgSplit 8859 execution mode. 8860 - If OpenCL, omit 8861 lgkmcnt(0). 8862 - Must happen after 8863 any preceding 8864 local/generic 8865 load/store/load 8866 atomic/store 8867 atomic/atomicrmw. 8868 - s_waitcnt vmcnt(0) 8869 must happen after 8870 any preceding 8871 global/generic load/store/ 8872 load atomic/store atomic/ 8873 atomicrmw. 8874 - s_waitcnt lgkmcnt(0) 8875 must happen after 8876 any preceding 8877 local/generic 8878 load/store/load 8879 atomic/store 8880 atomic/atomicrmw. 8881 - Must happen before 8882 the following 8883 atomicrmw. 8884 - Ensures that all 8885 memory operations 8886 have 8887 completed before 8888 performing the 8889 atomicrmw that is 8890 being released. 8891 8892 2. buffer/global_atomic 8893 3. s_waitcnt vmcnt(0) 8894 8895 - If not TgSplit execution 8896 mode, omit. 8897 - Must happen before 8898 the following 8899 buffer_wbinvl1_vol. 8900 - Ensures any 8901 following global 8902 data read is no 8903 older than the 8904 atomicrmw value 8905 being acquired. 8906 8907 4. buffer_wbinvl1_vol 8908 8909 - If not TgSplit execution 8910 mode, omit. 8911 - Ensures that 8912 following 8913 loads will not see 8914 stale data. 8915 8916 atomicrmw acq_rel - workgroup - local *If TgSplit execution mode, 8917 local address space cannot 8918 be used.* 8919 8920 1. ds_atomic 8921 2. s_waitcnt lgkmcnt(0) 8922 8923 - If OpenCL, omit. 8924 - Must happen before 8925 any following 8926 global/generic 8927 load/load 8928 atomic/store/store 8929 atomic/atomicrmw. 8930 - Ensures any 8931 following global 8932 data read is no 8933 older than the local load 8934 atomic value being 8935 acquired. 8936 8937 atomicrmw acq_rel - workgroup - generic 1. s_waitcnt lgkm/vmcnt(0) 8938 8939 - Use lgkmcnt(0) if not 8940 TgSplit execution mode 8941 and vmcnt(0) if TgSplit 8942 execution mode. 8943 - If OpenCL, omit 8944 lgkmcnt(0). 8945 - s_waitcnt vmcnt(0) 8946 must happen after 8947 any preceding 8948 global/generic load/store/ 8949 load atomic/store atomic/ 8950 atomicrmw. 8951 - s_waitcnt lgkmcnt(0) 8952 must happen after 8953 any preceding 8954 local/generic 8955 load/store/load 8956 atomic/store 8957 atomic/atomicrmw. 8958 - Must happen before 8959 the following 8960 atomicrmw. 8961 - Ensures that all 8962 memory operations 8963 have 8964 completed before 8965 performing the 8966 atomicrmw that is 8967 being released. 8968 8969 2. flat_atomic 8970 3. s_waitcnt lgkmcnt(0) & 8971 vmcnt(0) 8972 8973 - If not TgSplit execution 8974 mode, omit vmcnt(0). 8975 - If OpenCL, omit 8976 lgkmcnt(0). 8977 - Must happen before 8978 the following 8979 buffer_wbinvl1_vol and 8980 any following 8981 global/generic 8982 load/load 8983 atomic/store/store 8984 atomic/atomicrmw. 8985 - Ensures any 8986 following global 8987 data read is no 8988 older than a local load 8989 atomic value being 8990 acquired. 8991 8992 3. buffer_wbinvl1_vol 8993 8994 - If not TgSplit execution 8995 mode, omit. 8996 - Ensures that 8997 following 8998 loads will not see 8999 stale data. 9000 9001 atomicrmw acq_rel - agent - global 1. s_waitcnt lgkmcnt(0) & 9002 vmcnt(0) 9003 9004 - If TgSplit execution mode, 9005 omit lgkmcnt(0). 9006 - If OpenCL, omit 9007 lgkmcnt(0). 9008 - Could be split into 9009 separate s_waitcnt 9010 vmcnt(0) and 9011 s_waitcnt 9012 lgkmcnt(0) to allow 9013 them to be 9014 independently moved 9015 according to the 9016 following rules. 9017 - s_waitcnt vmcnt(0) 9018 must happen after 9019 any preceding 9020 global/generic 9021 load/store/load 9022 atomic/store 9023 atomic/atomicrmw. 9024 - s_waitcnt lgkmcnt(0) 9025 must happen after 9026 any preceding 9027 local/generic 9028 load/store/load 9029 atomic/store 9030 atomic/atomicrmw. 9031 - Must happen before 9032 the following 9033 atomicrmw. 9034 - Ensures that all 9035 memory operations 9036 to global have 9037 completed before 9038 performing the 9039 atomicrmw that is 9040 being released. 9041 9042 2. buffer/global_atomic 9043 3. s_waitcnt vmcnt(0) 9044 9045 - Must happen before 9046 following 9047 buffer_wbinvl1_vol. 9048 - Ensures the 9049 atomicrmw has 9050 completed before 9051 invalidating the 9052 cache. 9053 9054 4. buffer_wbinvl1_vol 9055 9056 - Must happen before 9057 any following 9058 global/generic 9059 load/load 9060 atomic/atomicrmw. 9061 - Ensures that 9062 following loads 9063 will not see stale 9064 global data. 9065 9066 atomicrmw acq_rel - system - global 1. buffer_wbl2 9067 9068 - Must happen before 9069 following s_waitcnt. 9070 - Performs L2 writeback to 9071 ensure previous 9072 global/generic 9073 store/atomicrmw are 9074 visible at system scope. 9075 9076 2. s_waitcnt lgkmcnt(0) & 9077 vmcnt(0) 9078 9079 - If TgSplit execution mode, 9080 omit lgkmcnt(0). 9081 - If OpenCL, omit 9082 lgkmcnt(0). 9083 - Could be split into 9084 separate s_waitcnt 9085 vmcnt(0) and 9086 s_waitcnt 9087 lgkmcnt(0) to allow 9088 them to be 9089 independently moved 9090 according to the 9091 following rules. 9092 - s_waitcnt vmcnt(0) 9093 must happen after 9094 any preceding 9095 global/generic 9096 load/store/load 9097 atomic/store 9098 atomic/atomicrmw. 9099 - s_waitcnt lgkmcnt(0) 9100 must happen after 9101 any preceding 9102 local/generic 9103 load/store/load 9104 atomic/store 9105 atomic/atomicrmw. 9106 - Must happen before 9107 the following 9108 atomicrmw. 9109 - Ensures that all 9110 memory operations 9111 to global and L2 writeback 9112 have completed before 9113 performing the 9114 atomicrmw that is 9115 being released. 9116 9117 3. buffer/global_atomic 9118 4. s_waitcnt vmcnt(0) 9119 9120 - Must happen before 9121 following buffer_invl2 and 9122 buffer_wbinvl1_vol. 9123 - Ensures the 9124 atomicrmw has 9125 completed before 9126 invalidating the 9127 caches. 9128 9129 5. buffer_invl2; 9130 buffer_wbinvl1_vol 9131 9132 - Must happen before 9133 any following 9134 global/generic 9135 load/load 9136 atomic/atomicrmw. 9137 - Ensures that 9138 following 9139 loads will not see 9140 stale L1 global data, 9141 nor see stale L2 MTYPE 9142 NC global data. 9143 MTYPE RW and CC memory will 9144 never be stale in L2 due to 9145 the memory probes. 9146 9147 atomicrmw acq_rel - agent - generic 1. s_waitcnt lgkmcnt(0) & 9148 vmcnt(0) 9149 9150 - If TgSplit execution mode, 9151 omit lgkmcnt(0). 9152 - If OpenCL, omit 9153 lgkmcnt(0). 9154 - Could be split into 9155 separate s_waitcnt 9156 vmcnt(0) and 9157 s_waitcnt 9158 lgkmcnt(0) to allow 9159 them to be 9160 independently moved 9161 according to the 9162 following rules. 9163 - s_waitcnt vmcnt(0) 9164 must happen after 9165 any preceding 9166 global/generic 9167 load/store/load 9168 atomic/store 9169 atomic/atomicrmw. 9170 - s_waitcnt lgkmcnt(0) 9171 must happen after 9172 any preceding 9173 local/generic 9174 load/store/load 9175 atomic/store 9176 atomic/atomicrmw. 9177 - Must happen before 9178 the following 9179 atomicrmw. 9180 - Ensures that all 9181 memory operations 9182 to global have 9183 completed before 9184 performing the 9185 atomicrmw that is 9186 being released. 9187 9188 2. flat_atomic 9189 3. s_waitcnt vmcnt(0) & 9190 lgkmcnt(0) 9191 9192 - If TgSplit execution mode, 9193 omit lgkmcnt(0). 9194 - If OpenCL, omit 9195 lgkmcnt(0). 9196 - Must happen before 9197 following 9198 buffer_wbinvl1_vol. 9199 - Ensures the 9200 atomicrmw has 9201 completed before 9202 invalidating the 9203 cache. 9204 9205 4. buffer_wbinvl1_vol 9206 9207 - Must happen before 9208 any following 9209 global/generic 9210 load/load 9211 atomic/atomicrmw. 9212 - Ensures that 9213 following loads 9214 will not see stale 9215 global data. 9216 9217 atomicrmw acq_rel - system - generic 1. buffer_wbl2 9218 9219 - Must happen before 9220 following s_waitcnt. 9221 - Performs L2 writeback to 9222 ensure previous 9223 global/generic 9224 store/atomicrmw are 9225 visible at system scope. 9226 9227 2. s_waitcnt lgkmcnt(0) & 9228 vmcnt(0) 9229 9230 - If TgSplit execution mode, 9231 omit lgkmcnt(0). 9232 - If OpenCL, omit 9233 lgkmcnt(0). 9234 - Could be split into 9235 separate s_waitcnt 9236 vmcnt(0) and 9237 s_waitcnt 9238 lgkmcnt(0) to allow 9239 them to be 9240 independently moved 9241 according to the 9242 following rules. 9243 - s_waitcnt vmcnt(0) 9244 must happen after 9245 any preceding 9246 global/generic 9247 load/store/load 9248 atomic/store 9249 atomic/atomicrmw. 9250 - s_waitcnt lgkmcnt(0) 9251 must happen after 9252 any preceding 9253 local/generic 9254 load/store/load 9255 atomic/store 9256 atomic/atomicrmw. 9257 - Must happen before 9258 the following 9259 atomicrmw. 9260 - Ensures that all 9261 memory operations 9262 to global and L2 writeback 9263 have completed before 9264 performing the 9265 atomicrmw that is 9266 being released. 9267 9268 3. flat_atomic 9269 4. s_waitcnt vmcnt(0) & 9270 lgkmcnt(0) 9271 9272 - If TgSplit execution mode, 9273 omit lgkmcnt(0). 9274 - If OpenCL, omit 9275 lgkmcnt(0). 9276 - Must happen before 9277 following buffer_invl2 and 9278 buffer_wbinvl1_vol. 9279 - Ensures the 9280 atomicrmw has 9281 completed before 9282 invalidating the 9283 caches. 9284 9285 5. buffer_invl2; 9286 buffer_wbinvl1_vol 9287 9288 - Must happen before 9289 any following 9290 global/generic 9291 load/load 9292 atomic/atomicrmw. 9293 - Ensures that 9294 following 9295 loads will not see 9296 stale L1 global data, 9297 nor see stale L2 MTYPE 9298 NC global data. 9299 MTYPE RW and CC memory will 9300 never be stale in L2 due to 9301 the memory probes. 9302 9303 fence acq_rel - singlethread *none* *none* 9304 - wavefront 9305 fence acq_rel - workgroup *none* 1. s_waitcnt lgkm/vmcnt(0) 9306 9307 - Use lgkmcnt(0) if not 9308 TgSplit execution mode 9309 and vmcnt(0) if TgSplit 9310 execution mode. 9311 - If OpenCL and 9312 address space is 9313 not generic, omit 9314 lgkmcnt(0). 9315 - If OpenCL and 9316 address space is 9317 local, omit 9318 vmcnt(0). 9319 - However, 9320 since LLVM 9321 currently has no 9322 address space on 9323 the fence need to 9324 conservatively 9325 always generate 9326 (see comment for 9327 previous fence). 9328 - s_waitcnt vmcnt(0) 9329 must happen after 9330 any preceding 9331 global/generic 9332 load/store/ 9333 load atomic/store atomic/ 9334 atomicrmw. 9335 - s_waitcnt lgkmcnt(0) 9336 must happen after 9337 any preceding 9338 local/generic 9339 load/load 9340 atomic/store/store 9341 atomic/atomicrmw. 9342 - Must happen before 9343 any following 9344 global/generic 9345 load/load 9346 atomic/store/store 9347 atomic/atomicrmw. 9348 - Ensures that all 9349 memory operations 9350 have 9351 completed before 9352 performing any 9353 following global 9354 memory operations. 9355 - Ensures that the 9356 preceding 9357 local/generic load 9358 atomic/atomicrmw 9359 with an equal or 9360 wider sync scope 9361 and memory ordering 9362 stronger than 9363 unordered (this is 9364 termed the 9365 acquire-fence-paired-atomic) 9366 has completed 9367 before following 9368 global memory 9369 operations. This 9370 satisfies the 9371 requirements of 9372 acquire. 9373 - Ensures that all 9374 previous memory 9375 operations have 9376 completed before a 9377 following 9378 local/generic store 9379 atomic/atomicrmw 9380 with an equal or 9381 wider sync scope 9382 and memory ordering 9383 stronger than 9384 unordered (this is 9385 termed the 9386 release-fence-paired-atomic). 9387 This satisfies the 9388 requirements of 9389 release. 9390 - Must happen before 9391 the following 9392 buffer_wbinvl1_vol. 9393 - Ensures that the 9394 acquire-fence-paired 9395 atomic has completed 9396 before invalidating 9397 the 9398 cache. Therefore 9399 any following 9400 locations read must 9401 be no older than 9402 the value read by 9403 the 9404 acquire-fence-paired-atomic. 9405 9406 2. buffer_wbinvl1_vol 9407 9408 - If not TgSplit execution 9409 mode, omit. 9410 - Ensures that 9411 following 9412 loads will not see 9413 stale data. 9414 9415 fence acq_rel - agent *none* 1. s_waitcnt lgkmcnt(0) & 9416 vmcnt(0) 9417 9418 - If TgSplit execution mode, 9419 omit lgkmcnt(0). 9420 - If OpenCL and 9421 address space is 9422 not generic, omit 9423 lgkmcnt(0). 9424 - See :ref:`amdgpu-fence-as` for 9425 more details on fencing specific 9426 address spaces. 9427 - Could be split into 9428 separate s_waitcnt 9429 vmcnt(0) and 9430 s_waitcnt 9431 lgkmcnt(0) to allow 9432 them to be 9433 independently moved 9434 according to the 9435 following rules. 9436 - s_waitcnt vmcnt(0) 9437 must happen after 9438 any preceding 9439 global/generic 9440 load/store/load 9441 atomic/store 9442 atomic/atomicrmw. 9443 - s_waitcnt lgkmcnt(0) 9444 must happen after 9445 any preceding 9446 local/generic 9447 load/store/load 9448 atomic/store 9449 atomic/atomicrmw. 9450 - Must happen before 9451 the following 9452 buffer_wbinvl1_vol. 9453 - Ensures that the 9454 preceding 9455 global/local/generic 9456 load 9457 atomic/atomicrmw 9458 with an equal or 9459 wider sync scope 9460 and memory ordering 9461 stronger than 9462 unordered (this is 9463 termed the 9464 acquire-fence-paired-atomic) 9465 has completed 9466 before invalidating 9467 the cache. This 9468 satisfies the 9469 requirements of 9470 acquire. 9471 - Ensures that all 9472 previous memory 9473 operations have 9474 completed before a 9475 following 9476 global/local/generic 9477 store 9478 atomic/atomicrmw 9479 with an equal or 9480 wider sync scope 9481 and memory ordering 9482 stronger than 9483 unordered (this is 9484 termed the 9485 release-fence-paired-atomic). 9486 This satisfies the 9487 requirements of 9488 release. 9489 9490 2. buffer_wbinvl1_vol 9491 9492 - Must happen before 9493 any following 9494 global/generic 9495 load/load 9496 atomic/store/store 9497 atomic/atomicrmw. 9498 - Ensures that 9499 following loads 9500 will not see stale 9501 global data. This 9502 satisfies the 9503 requirements of 9504 acquire. 9505 9506 fence acq_rel - system *none* 1. buffer_wbl2 9507 9508 - If OpenCL and 9509 address space is 9510 local, omit. 9511 - Must happen before 9512 following s_waitcnt. 9513 - Performs L2 writeback to 9514 ensure previous 9515 global/generic 9516 store/atomicrmw are 9517 visible at system scope. 9518 9519 2. s_waitcnt lgkmcnt(0) & 9520 vmcnt(0) 9521 9522 - If TgSplit execution mode, 9523 omit lgkmcnt(0). 9524 - If OpenCL and 9525 address space is 9526 not generic, omit 9527 lgkmcnt(0). 9528 - See :ref:`amdgpu-fence-as` for 9529 more details on fencing specific 9530 address spaces. 9531 - Could be split into 9532 separate s_waitcnt 9533 vmcnt(0) and 9534 s_waitcnt 9535 lgkmcnt(0) to allow 9536 them to be 9537 independently moved 9538 according to the 9539 following rules. 9540 - s_waitcnt vmcnt(0) 9541 must happen after 9542 any preceding 9543 global/generic 9544 load/store/load 9545 atomic/store 9546 atomic/atomicrmw. 9547 - s_waitcnt lgkmcnt(0) 9548 must happen after 9549 any preceding 9550 local/generic 9551 load/store/load 9552 atomic/store 9553 atomic/atomicrmw. 9554 - Must happen before 9555 the following buffer_invl2 and 9556 buffer_wbinvl1_vol. 9557 - Ensures that the 9558 preceding 9559 global/local/generic 9560 load 9561 atomic/atomicrmw 9562 with an equal or 9563 wider sync scope 9564 and memory ordering 9565 stronger than 9566 unordered (this is 9567 termed the 9568 acquire-fence-paired-atomic) 9569 has completed 9570 before invalidating 9571 the cache. This 9572 satisfies the 9573 requirements of 9574 acquire. 9575 - Ensures that all 9576 previous memory 9577 operations have 9578 completed before a 9579 following 9580 global/local/generic 9581 store 9582 atomic/atomicrmw 9583 with an equal or 9584 wider sync scope 9585 and memory ordering 9586 stronger than 9587 unordered (this is 9588 termed the 9589 release-fence-paired-atomic). 9590 This satisfies the 9591 requirements of 9592 release. 9593 9594 3. buffer_invl2; 9595 buffer_wbinvl1_vol 9596 9597 - Must happen before 9598 any following 9599 global/generic 9600 load/load 9601 atomic/store/store 9602 atomic/atomicrmw. 9603 - Ensures that 9604 following 9605 loads will not see 9606 stale L1 global data, 9607 nor see stale L2 MTYPE 9608 NC global data. 9609 MTYPE RW and CC memory will 9610 never be stale in L2 due to 9611 the memory probes. 9612 9613 **Sequential Consistent Atomic** 9614 ------------------------------------------------------------------------------------ 9615 load atomic seq_cst - singlethread - global *Same as corresponding 9616 - wavefront - local load atomic acquire, 9617 - generic except must generate 9618 all instructions even 9619 for OpenCL.* 9620 load atomic seq_cst - workgroup - global 1. s_waitcnt lgkm/vmcnt(0) 9621 - generic 9622 - Use lgkmcnt(0) if not 9623 TgSplit execution mode 9624 and vmcnt(0) if TgSplit 9625 execution mode. 9626 - s_waitcnt lgkmcnt(0) must 9627 happen after 9628 preceding 9629 local/generic load 9630 atomic/store 9631 atomic/atomicrmw 9632 with memory 9633 ordering of seq_cst 9634 and with equal or 9635 wider sync scope. 9636 (Note that seq_cst 9637 fences have their 9638 own s_waitcnt 9639 lgkmcnt(0) and so do 9640 not need to be 9641 considered.) 9642 - s_waitcnt vmcnt(0) 9643 must happen after 9644 preceding 9645 global/generic load 9646 atomic/store 9647 atomic/atomicrmw 9648 with memory 9649 ordering of seq_cst 9650 and with equal or 9651 wider sync scope. 9652 (Note that seq_cst 9653 fences have their 9654 own s_waitcnt 9655 vmcnt(0) and so do 9656 not need to be 9657 considered.) 9658 - Ensures any 9659 preceding 9660 sequential 9661 consistent global/local 9662 memory instructions 9663 have completed 9664 before executing 9665 this sequentially 9666 consistent 9667 instruction. This 9668 prevents reordering 9669 a seq_cst store 9670 followed by a 9671 seq_cst load. (Note 9672 that seq_cst is 9673 stronger than 9674 acquire/release as 9675 the reordering of 9676 load acquire 9677 followed by a store 9678 release is 9679 prevented by the 9680 s_waitcnt of 9681 the release, but 9682 there is nothing 9683 preventing a store 9684 release followed by 9685 load acquire from 9686 completing out of 9687 order. The s_waitcnt 9688 could be placed after 9689 seq_store or before 9690 the seq_load. We 9691 choose the load to 9692 make the s_waitcnt be 9693 as late as possible 9694 so that the store 9695 may have already 9696 completed.) 9697 9698 2. *Following 9699 instructions same as 9700 corresponding load 9701 atomic acquire, 9702 except must generate 9703 all instructions even 9704 for OpenCL.* 9705 load atomic seq_cst - workgroup - local *If TgSplit execution mode, 9706 local address space cannot 9707 be used.* 9708 9709 *Same as corresponding 9710 load atomic acquire, 9711 except must generate 9712 all instructions even 9713 for OpenCL.* 9714 9715 load atomic seq_cst - agent - global 1. s_waitcnt lgkmcnt(0) & 9716 - system - generic vmcnt(0) 9717 9718 - If TgSplit execution mode, 9719 omit lgkmcnt(0). 9720 - Could be split into 9721 separate s_waitcnt 9722 vmcnt(0) 9723 and s_waitcnt 9724 lgkmcnt(0) to allow 9725 them to be 9726 independently moved 9727 according to the 9728 following rules. 9729 - s_waitcnt lgkmcnt(0) 9730 must happen after 9731 preceding 9732 global/generic load 9733 atomic/store 9734 atomic/atomicrmw 9735 with memory 9736 ordering of seq_cst 9737 and with equal or 9738 wider sync scope. 9739 (Note that seq_cst 9740 fences have their 9741 own s_waitcnt 9742 lgkmcnt(0) and so do 9743 not need to be 9744 considered.) 9745 - s_waitcnt vmcnt(0) 9746 must happen after 9747 preceding 9748 global/generic load 9749 atomic/store 9750 atomic/atomicrmw 9751 with memory 9752 ordering of seq_cst 9753 and with equal or 9754 wider sync scope. 9755 (Note that seq_cst 9756 fences have their 9757 own s_waitcnt 9758 vmcnt(0) and so do 9759 not need to be 9760 considered.) 9761 - Ensures any 9762 preceding 9763 sequential 9764 consistent global 9765 memory instructions 9766 have completed 9767 before executing 9768 this sequentially 9769 consistent 9770 instruction. This 9771 prevents reordering 9772 a seq_cst store 9773 followed by a 9774 seq_cst load. (Note 9775 that seq_cst is 9776 stronger than 9777 acquire/release as 9778 the reordering of 9779 load acquire 9780 followed by a store 9781 release is 9782 prevented by the 9783 s_waitcnt of 9784 the release, but 9785 there is nothing 9786 preventing a store 9787 release followed by 9788 load acquire from 9789 completing out of 9790 order. The s_waitcnt 9791 could be placed after 9792 seq_store or before 9793 the seq_load. We 9794 choose the load to 9795 make the s_waitcnt be 9796 as late as possible 9797 so that the store 9798 may have already 9799 completed.) 9800 9801 2. *Following 9802 instructions same as 9803 corresponding load 9804 atomic acquire, 9805 except must generate 9806 all instructions even 9807 for OpenCL.* 9808 store atomic seq_cst - singlethread - global *Same as corresponding 9809 - wavefront - local store atomic release, 9810 - workgroup - generic except must generate 9811 - agent all instructions even 9812 - system for OpenCL.* 9813 atomicrmw seq_cst - singlethread - global *Same as corresponding 9814 - wavefront - local atomicrmw acq_rel, 9815 - workgroup - generic except must generate 9816 - agent all instructions even 9817 - system for OpenCL.* 9818 fence seq_cst - singlethread *none* *Same as corresponding 9819 - wavefront fence acq_rel, 9820 - workgroup except must generate 9821 - agent all instructions even 9822 - system for OpenCL.* 9823 ============ ============ ============== ========== ================================ 9824 9825.. _amdgpu-amdhsa-memory-model-gfx942: 9826 9827Memory Model GFX942 9828+++++++++++++++++++ 9829 9830For GFX942: 9831 9832* Each agent has multiple shader arrays (SA). 9833* Each SA has multiple compute units (CU). 9834* Each CU has multiple SIMDs that execute wavefronts. 9835* The wavefronts for a single work-group are executed in the same CU but may be 9836 executed by different SIMDs. The exception is when in tgsplit execution mode 9837 when the wavefronts may be executed by different SIMDs in different CUs. 9838* Each CU has a single LDS memory shared by the wavefronts of the work-groups 9839 executing on it. The exception is when in tgsplit execution mode when no LDS 9840 is allocated as wavefronts of the same work-group can be in different CUs. 9841* All LDS operations of a CU are performed as wavefront wide operations in a 9842 global order and involve no caching. Completion is reported to a wavefront in 9843 execution order. 9844* The LDS memory has multiple request queues shared by the SIMDs of a 9845 CU. Therefore, the LDS operations performed by different wavefronts of a 9846 work-group can be reordered relative to each other, which can result in 9847 reordering the visibility of vector memory operations with respect to LDS 9848 operations of other wavefronts in the same work-group. A ``s_waitcnt 9849 lgkmcnt(0)`` is required to ensure synchronization between LDS operations and 9850 vector memory operations between wavefronts of a work-group, but not between 9851 operations performed by the same wavefront. 9852* The vector memory operations are performed as wavefront wide operations and 9853 completion is reported to a wavefront in execution order. The exception is 9854 that ``flat_load/store/atomic`` instructions can report out of vector memory 9855 order if they access LDS memory, and out of LDS operation order if they access 9856 global memory. 9857* The vector memory operations access a single vector L1 cache shared by all 9858 SIMDs a CU. Therefore: 9859 9860 * No special action is required for coherence between the lanes of a single 9861 wavefront. 9862 9863 * No special action is required for coherence between wavefronts in the same 9864 work-group since they execute on the same CU. The exception is when in 9865 tgsplit execution mode as wavefronts of the same work-group can be in 9866 different CUs and so a ``buffer_inv sc0`` is required which will invalidate 9867 the L1 cache. 9868 9869 * A ``buffer_inv sc0`` is required to invalidate the L1 cache for coherence 9870 between wavefronts executing in different work-groups as they may be 9871 executing on different CUs. 9872 9873 * Atomic read-modify-write instructions implicitly bypass the L1 cache. 9874 Therefore, they do not use the sc0 bit for coherence and instead use it to 9875 indicate if the instruction returns the original value being updated. They 9876 do use sc1 to indicate system or agent scope coherence. 9877 9878* The scalar memory operations access a scalar L1 cache shared by all wavefronts 9879 on a group of CUs. The scalar and vector L1 caches are not coherent. However, 9880 scalar operations are used in a restricted way so do not impact the memory 9881 model. See :ref:`amdgpu-amdhsa-memory-spaces`. 9882* The vector and scalar memory operations use an L2 cache. 9883 9884 * The gfx942 can be configured as a number of smaller agents with each having 9885 a single L2 shared by all CUs on the same agent, or as fewer (possibly one) 9886 larger agents with groups of CUs on each agent each sharing separate L2 9887 caches. 9888 * The L2 cache has independent channels to service disjoint ranges of virtual 9889 addresses. 9890 * Each CU has a separate request queue per channel for its associated L2. 9891 Therefore, the vector and scalar memory operations performed by wavefronts 9892 executing with different L1 caches and the same L2 cache can be reordered 9893 relative to each other. 9894 * A ``s_waitcnt vmcnt(0)`` is required to ensure synchronization between 9895 vector memory operations of different CUs. It ensures a previous vector 9896 memory operation has completed before executing a subsequent vector memory 9897 or LDS operation and so can be used to meet the requirements of acquire and 9898 release. 9899 * An L2 cache can be kept coherent with other L2 caches by using the MTYPE RW 9900 (read-write) for memory local to the L2, and MTYPE NC (non-coherent) with 9901 the PTE C-bit set for memory not local to the L2. 9902 9903 * Any local memory cache lines will be automatically invalidated by writes 9904 from CUs associated with other L2 caches, or writes from the CPU, due to 9905 the cache probe caused by the PTE C-bit. 9906 * XGMI accesses from the CPU to local memory may be cached on the CPU. 9907 Subsequent access from the GPU will automatically invalidate or writeback 9908 the CPU cache due to the L2 probe filter. 9909 * To ensure coherence of local memory writes of CUs with different L1 caches 9910 in the same agent a ``buffer_wbl2`` is required. It does nothing if the 9911 agent is configured to have a single L2, or will writeback dirty L2 cache 9912 lines if configured to have multiple L2 caches. 9913 * To ensure coherence of local memory writes of CUs in different agents a 9914 ``buffer_wbl2 sc1`` is required. It will writeback dirty L2 cache lines. 9915 * To ensure coherence of local memory reads of CUs with different L1 caches 9916 in the same agent a ``buffer_inv sc1`` is required. It does nothing if the 9917 agent is configured to have a single L2, or will invalidate non-local L2 9918 cache lines if configured to have multiple L2 caches. 9919 * To ensure coherence of local memory reads of CUs in different agents a 9920 ``buffer_inv sc0 sc1`` is required. It will invalidate non-local L2 cache 9921 lines if configured to have multiple L2 caches. 9922 9923 * PCIe access from the GPU to the CPU can be kept coherent by using the MTYPE 9924 UC (uncached) which bypasses the L2. 9925 9926Scalar memory operations are only used to access memory that is proven to not 9927change during the execution of the kernel dispatch. This includes constant 9928address space and global address space for program scope ``const`` variables. 9929Therefore, the kernel machine code does not have to maintain the scalar cache to 9930ensure it is coherent with the vector caches. The scalar and vector caches are 9931invalidated between kernel dispatches by CP since constant address space data 9932may change between kernel dispatch executions. See 9933:ref:`amdgpu-amdhsa-memory-spaces`. 9934 9935The one exception is if scalar writes are used to spill SGPR registers. In this 9936case the AMDGPU backend ensures the memory location used to spill is never 9937accessed by vector memory operations at the same time. If scalar writes are used 9938then a ``s_dcache_wb`` is inserted before the ``s_endpgm`` and before a function 9939return since the locations may be used for vector memory instructions by a 9940future wavefront that uses the same scratch area, or a function call that 9941creates a frame at the same address, respectively. There is no need for a 9942``s_dcache_inv`` as all scalar writes are write-before-read in the same thread. 9943 9944For kernarg backing memory: 9945 9946* CP invalidates the L1 cache at the start of each kernel dispatch. 9947* On dGPU over XGMI or PCIe the kernarg backing memory is allocated in host 9948 memory accessed as MTYPE UC (uncached) to avoid needing to invalidate the L2 9949 cache. This also causes it to be treated as non-volatile and so is not 9950 invalidated by ``*_vol``. 9951* On APU the kernarg backing memory is accessed as MTYPE CC (cache coherent) and 9952 so the L2 cache will be coherent with the CPU and other agents. 9953 9954Scratch backing memory (which is used for the private address space) is accessed 9955with MTYPE NC_NV (non-coherent non-volatile). Since the private address space is 9956only accessed by a single thread, and is always write-before-read, there is 9957never a need to invalidate these entries from the L1 cache. Hence all cache 9958invalidates are done as ``*_vol`` to only invalidate the volatile cache lines. 9959 9960The code sequences used to implement the memory model for GFX940, GFX941, GFX942 9961are defined in table :ref:`amdgpu-amdhsa-memory-model-code-sequences-gfx940-gfx941-gfx942-table`. 9962 9963 .. table:: AMDHSA Memory Model Code Sequences GFX940, GFX941, GFX942 9964 :name: amdgpu-amdhsa-memory-model-code-sequences-gfx940-gfx941-gfx942-table 9965 9966 ============ ============ ============== ========== ================================ 9967 LLVM Instr LLVM Memory LLVM Memory AMDGPU AMDGPU Machine Code 9968 Ordering Sync Scope Address GFX940, GFX941, GFX942 9969 Space 9970 ============ ============ ============== ========== ================================ 9971 **Non-Atomic** 9972 ------------------------------------------------------------------------------------ 9973 load *none* *none* - global - !volatile & !nontemporal 9974 - generic 9975 - private 1. buffer/global/flat_load 9976 - constant 9977 - !volatile & nontemporal 9978 9979 1. buffer/global/flat_load 9980 nt=1 9981 9982 - volatile 9983 9984 1. buffer/global/flat_load 9985 sc0=1 sc1=1 9986 2. s_waitcnt vmcnt(0) 9987 9988 - Must happen before 9989 any following volatile 9990 global/generic 9991 load/store. 9992 - Ensures that 9993 volatile 9994 operations to 9995 different 9996 addresses will not 9997 be reordered by 9998 hardware. 9999 10000 load *none* *none* - local 1. ds_load 10001 store *none* *none* - global - !volatile & !nontemporal 10002 - generic 10003 - private 1. GFX940, GFX941 10004 - constant buffer/global/flat_store 10005 sc0=1 sc1=1 10006 GFX942 10007 buffer/global/flat_store 10008 10009 - !volatile & nontemporal 10010 10011 1. GFX940, GFX941 10012 buffer/global/flat_store 10013 nt=1 sc0=1 sc1=1 10014 GFX942 10015 buffer/global/flat_store 10016 nt=1 10017 10018 - volatile 10019 10020 1. buffer/global/flat_store 10021 sc0=1 sc1=1 10022 2. s_waitcnt vmcnt(0) 10023 10024 - Must happen before 10025 any following volatile 10026 global/generic 10027 load/store. 10028 - Ensures that 10029 volatile 10030 operations to 10031 different 10032 addresses will not 10033 be reordered by 10034 hardware. 10035 10036 store *none* *none* - local 1. ds_store 10037 **Unordered Atomic** 10038 ------------------------------------------------------------------------------------ 10039 load atomic unordered *any* *any* *Same as non-atomic*. 10040 store atomic unordered *any* *any* *Same as non-atomic*. 10041 atomicrmw unordered *any* *any* *Same as monotonic atomic*. 10042 **Monotonic Atomic** 10043 ------------------------------------------------------------------------------------ 10044 load atomic monotonic - singlethread - global 1. buffer/global/flat_load 10045 - wavefront - generic 10046 load atomic monotonic - workgroup - global 1. buffer/global/flat_load 10047 - generic sc0=1 10048 load atomic monotonic - singlethread - local *If TgSplit execution mode, 10049 - wavefront local address space cannot 10050 - workgroup be used.* 10051 10052 1. ds_load 10053 load atomic monotonic - agent - global 1. buffer/global/flat_load 10054 - generic sc1=1 10055 load atomic monotonic - system - global 1. buffer/global/flat_load 10056 - generic sc0=1 sc1=1 10057 store atomic monotonic - singlethread - global 1. buffer/global/flat_store 10058 - wavefront - generic 10059 store atomic monotonic - workgroup - global 1. buffer/global/flat_store 10060 - generic sc0=1 10061 store atomic monotonic - agent - global 1. buffer/global/flat_store 10062 - generic sc1=1 10063 store atomic monotonic - system - global 1. buffer/global/flat_store 10064 - generic sc0=1 sc1=1 10065 store atomic monotonic - singlethread - local *If TgSplit execution mode, 10066 - wavefront local address space cannot 10067 - workgroup be used.* 10068 10069 1. ds_store 10070 atomicrmw monotonic - singlethread - global 1. buffer/global/flat_atomic 10071 - wavefront - generic 10072 - workgroup 10073 - agent 10074 atomicrmw monotonic - system - global 1. buffer/global/flat_atomic 10075 - generic sc1=1 10076 atomicrmw monotonic - singlethread - local *If TgSplit execution mode, 10077 - wavefront local address space cannot 10078 - workgroup be used.* 10079 10080 1. ds_atomic 10081 **Acquire Atomic** 10082 ------------------------------------------------------------------------------------ 10083 load atomic acquire - singlethread - global 1. buffer/global/ds/flat_load 10084 - wavefront - local 10085 - generic 10086 load atomic acquire - workgroup - global 1. buffer/global_load sc0=1 10087 2. s_waitcnt vmcnt(0) 10088 10089 - If not TgSplit execution 10090 mode, omit. 10091 - Must happen before the 10092 following buffer_inv. 10093 10094 3. buffer_inv sc0=1 10095 10096 - If not TgSplit execution 10097 mode, omit. 10098 - Must happen before 10099 any following 10100 global/generic 10101 load/load 10102 atomic/store/store 10103 atomic/atomicrmw. 10104 - Ensures that 10105 following 10106 loads will not see 10107 stale data. 10108 10109 load atomic acquire - workgroup - local *If TgSplit execution mode, 10110 local address space cannot 10111 be used.* 10112 10113 1. ds_load 10114 2. s_waitcnt lgkmcnt(0) 10115 10116 - If OpenCL, omit. 10117 - Must happen before 10118 any following 10119 global/generic 10120 load/load 10121 atomic/store/store 10122 atomic/atomicrmw. 10123 - Ensures any 10124 following global 10125 data read is no 10126 older than the local load 10127 atomic value being 10128 acquired. 10129 10130 load atomic acquire - workgroup - generic 1. flat_load sc0=1 10131 2. s_waitcnt lgkm/vmcnt(0) 10132 10133 - Use lgkmcnt(0) if not 10134 TgSplit execution mode 10135 and vmcnt(0) if TgSplit 10136 execution mode. 10137 - If OpenCL, omit lgkmcnt(0). 10138 - Must happen before 10139 the following 10140 buffer_inv and any 10141 following global/generic 10142 load/load 10143 atomic/store/store 10144 atomic/atomicrmw. 10145 - Ensures any 10146 following global 10147 data read is no 10148 older than a local load 10149 atomic value being 10150 acquired. 10151 10152 3. buffer_inv sc0=1 10153 10154 - If not TgSplit execution 10155 mode, omit. 10156 - Ensures that 10157 following 10158 loads will not see 10159 stale data. 10160 10161 load atomic acquire - agent - global 1. buffer/global_load 10162 sc1=1 10163 2. s_waitcnt vmcnt(0) 10164 10165 - Must happen before 10166 following 10167 buffer_inv. 10168 - Ensures the load 10169 has completed 10170 before invalidating 10171 the cache. 10172 10173 3. buffer_inv sc1=1 10174 10175 - Must happen before 10176 any following 10177 global/generic 10178 load/load 10179 atomic/atomicrmw. 10180 - Ensures that 10181 following 10182 loads will not see 10183 stale global data. 10184 10185 load atomic acquire - system - global 1. buffer/global/flat_load 10186 sc0=1 sc1=1 10187 2. s_waitcnt vmcnt(0) 10188 10189 - Must happen before 10190 following 10191 buffer_inv. 10192 - Ensures the load 10193 has completed 10194 before invalidating 10195 the cache. 10196 10197 3. buffer_inv sc0=1 sc1=1 10198 10199 - Must happen before 10200 any following 10201 global/generic 10202 load/load 10203 atomic/atomicrmw. 10204 - Ensures that 10205 following 10206 loads will not see 10207 stale MTYPE NC global data. 10208 MTYPE RW and CC memory will 10209 never be stale due to the 10210 memory probes. 10211 10212 load atomic acquire - agent - generic 1. flat_load sc1=1 10213 2. s_waitcnt vmcnt(0) & 10214 lgkmcnt(0) 10215 10216 - If TgSplit execution mode, 10217 omit lgkmcnt(0). 10218 - If OpenCL omit 10219 lgkmcnt(0). 10220 - Must happen before 10221 following 10222 buffer_inv. 10223 - Ensures the flat_load 10224 has completed 10225 before invalidating 10226 the cache. 10227 10228 3. buffer_inv sc1=1 10229 10230 - Must happen before 10231 any following 10232 global/generic 10233 load/load 10234 atomic/atomicrmw. 10235 - Ensures that 10236 following loads 10237 will not see stale 10238 global data. 10239 10240 load atomic acquire - system - generic 1. flat_load sc0=1 sc1=1 10241 2. s_waitcnt vmcnt(0) & 10242 lgkmcnt(0) 10243 10244 - If TgSplit execution mode, 10245 omit lgkmcnt(0). 10246 - If OpenCL omit 10247 lgkmcnt(0). 10248 - Must happen before 10249 the following 10250 buffer_inv. 10251 - Ensures the flat_load 10252 has completed 10253 before invalidating 10254 the caches. 10255 10256 3. buffer_inv sc0=1 sc1=1 10257 10258 - Must happen before 10259 any following 10260 global/generic 10261 load/load 10262 atomic/atomicrmw. 10263 - Ensures that 10264 following 10265 loads will not see 10266 stale MTYPE NC global data. 10267 MTYPE RW and CC memory will 10268 never be stale due to the 10269 memory probes. 10270 10271 atomicrmw acquire - singlethread - global 1. buffer/global/flat_atomic 10272 - wavefront - generic 10273 atomicrmw acquire - singlethread - local *If TgSplit execution mode, 10274 - wavefront local address space cannot 10275 be used.* 10276 10277 1. ds_atomic 10278 atomicrmw acquire - workgroup - global 1. buffer/global_atomic 10279 2. s_waitcnt vmcnt(0) 10280 10281 - If not TgSplit execution 10282 mode, omit. 10283 - Must happen before the 10284 following buffer_inv. 10285 - Ensures the atomicrmw 10286 has completed 10287 before invalidating 10288 the cache. 10289 10290 3. buffer_inv sc0=1 10291 10292 - If not TgSplit execution 10293 mode, omit. 10294 - Must happen before 10295 any following 10296 global/generic 10297 load/load 10298 atomic/atomicrmw. 10299 - Ensures that 10300 following loads 10301 will not see stale 10302 global data. 10303 10304 atomicrmw acquire - workgroup - local *If TgSplit execution mode, 10305 local address space cannot 10306 be used.* 10307 10308 1. ds_atomic 10309 2. s_waitcnt lgkmcnt(0) 10310 10311 - If OpenCL, omit. 10312 - Must happen before 10313 any following 10314 global/generic 10315 load/load 10316 atomic/store/store 10317 atomic/atomicrmw. 10318 - Ensures any 10319 following global 10320 data read is no 10321 older than the local 10322 atomicrmw value 10323 being acquired. 10324 10325 atomicrmw acquire - workgroup - generic 1. flat_atomic 10326 2. s_waitcnt lgkm/vmcnt(0) 10327 10328 - Use lgkmcnt(0) if not 10329 TgSplit execution mode 10330 and vmcnt(0) if TgSplit 10331 execution mode. 10332 - If OpenCL, omit lgkmcnt(0). 10333 - Must happen before 10334 the following 10335 buffer_inv and 10336 any following 10337 global/generic 10338 load/load 10339 atomic/store/store 10340 atomic/atomicrmw. 10341 - Ensures any 10342 following global 10343 data read is no 10344 older than a local 10345 atomicrmw value 10346 being acquired. 10347 10348 3. buffer_inv sc0=1 10349 10350 - If not TgSplit execution 10351 mode, omit. 10352 - Ensures that 10353 following 10354 loads will not see 10355 stale data. 10356 10357 atomicrmw acquire - agent - global 1. buffer/global_atomic 10358 2. s_waitcnt vmcnt(0) 10359 10360 - Must happen before 10361 following 10362 buffer_inv. 10363 - Ensures the 10364 atomicrmw has 10365 completed before 10366 invalidating the 10367 cache. 10368 10369 3. buffer_inv sc1=1 10370 10371 - Must happen before 10372 any following 10373 global/generic 10374 load/load 10375 atomic/atomicrmw. 10376 - Ensures that 10377 following loads 10378 will not see stale 10379 global data. 10380 10381 atomicrmw acquire - system - global 1. buffer/global_atomic 10382 sc1=1 10383 2. s_waitcnt vmcnt(0) 10384 10385 - Must happen before 10386 following 10387 buffer_inv. 10388 - Ensures the 10389 atomicrmw has 10390 completed before 10391 invalidating the 10392 caches. 10393 10394 3. buffer_inv sc0=1 sc1=1 10395 10396 - Must happen before 10397 any following 10398 global/generic 10399 load/load 10400 atomic/atomicrmw. 10401 - Ensures that 10402 following 10403 loads will not see 10404 stale MTYPE NC global data. 10405 MTYPE RW and CC memory will 10406 never be stale due to the 10407 memory probes. 10408 10409 atomicrmw acquire - agent - generic 1. flat_atomic 10410 2. s_waitcnt vmcnt(0) & 10411 lgkmcnt(0) 10412 10413 - If TgSplit execution mode, 10414 omit lgkmcnt(0). 10415 - If OpenCL, omit 10416 lgkmcnt(0). 10417 - Must happen before 10418 following 10419 buffer_inv. 10420 - Ensures the 10421 atomicrmw has 10422 completed before 10423 invalidating the 10424 cache. 10425 10426 3. buffer_inv sc1=1 10427 10428 - Must happen before 10429 any following 10430 global/generic 10431 load/load 10432 atomic/atomicrmw. 10433 - Ensures that 10434 following loads 10435 will not see stale 10436 global data. 10437 10438 atomicrmw acquire - system - generic 1. flat_atomic sc1=1 10439 2. s_waitcnt vmcnt(0) & 10440 lgkmcnt(0) 10441 10442 - If TgSplit execution mode, 10443 omit lgkmcnt(0). 10444 - If OpenCL, omit 10445 lgkmcnt(0). 10446 - Must happen before 10447 following 10448 buffer_inv. 10449 - Ensures the 10450 atomicrmw has 10451 completed before 10452 invalidating the 10453 caches. 10454 10455 3. buffer_inv sc0=1 sc1=1 10456 10457 - Must happen before 10458 any following 10459 global/generic 10460 load/load 10461 atomic/atomicrmw. 10462 - Ensures that 10463 following 10464 loads will not see 10465 stale MTYPE NC global data. 10466 MTYPE RW and CC memory will 10467 never be stale due to the 10468 memory probes. 10469 10470 fence acquire - singlethread *none* *none* 10471 - wavefront 10472 fence acquire - workgroup *none* 1. s_waitcnt lgkm/vmcnt(0) 10473 10474 - Use lgkmcnt(0) if not 10475 TgSplit execution mode 10476 and vmcnt(0) if TgSplit 10477 execution mode. 10478 - If OpenCL and 10479 address space is 10480 not generic, omit 10481 lgkmcnt(0). 10482 - If OpenCL and 10483 address space is 10484 local, omit 10485 vmcnt(0). 10486 - See :ref:`amdgpu-fence-as` for 10487 more details on fencing specific 10488 address spaces. 10489 - s_waitcnt vmcnt(0) 10490 must happen after 10491 any preceding 10492 global/generic load 10493 atomic/ 10494 atomicrmw 10495 with an equal or 10496 wider sync scope 10497 and memory ordering 10498 stronger than 10499 unordered (this is 10500 termed the 10501 fence-paired-atomic). 10502 - s_waitcnt lgkmcnt(0) 10503 must happen after 10504 any preceding 10505 local/generic load 10506 atomic/atomicrmw 10507 with an equal or 10508 wider sync scope 10509 and memory ordering 10510 stronger than 10511 unordered (this is 10512 termed the 10513 fence-paired-atomic). 10514 - Must happen before 10515 the following 10516 buffer_inv and 10517 any following 10518 global/generic 10519 load/load 10520 atomic/store/store 10521 atomic/atomicrmw. 10522 - Ensures any 10523 following global 10524 data read is no 10525 older than the 10526 value read by the 10527 fence-paired-atomic. 10528 10529 3. buffer_inv sc0=1 10530 10531 - If not TgSplit execution 10532 mode, omit. 10533 - Ensures that 10534 following 10535 loads will not see 10536 stale data. 10537 10538 fence acquire - agent *none* 1. s_waitcnt lgkmcnt(0) & 10539 vmcnt(0) 10540 10541 - If TgSplit execution mode, 10542 omit lgkmcnt(0). 10543 - If OpenCL and 10544 address space is 10545 not generic, omit 10546 lgkmcnt(0). 10547 - See :ref:`amdgpu-fence-as` for 10548 more details on fencing specific 10549 address spaces. 10550 - Could be split into 10551 separate s_waitcnt 10552 vmcnt(0) and 10553 s_waitcnt 10554 lgkmcnt(0) to allow 10555 them to be 10556 independently moved 10557 according to the 10558 following rules. 10559 - s_waitcnt vmcnt(0) 10560 must happen after 10561 any preceding 10562 global/generic load 10563 atomic/atomicrmw 10564 with an equal or 10565 wider sync scope 10566 and memory ordering 10567 stronger than 10568 unordered (this is 10569 termed the 10570 fence-paired-atomic). 10571 - s_waitcnt lgkmcnt(0) 10572 must happen after 10573 any preceding 10574 local/generic load 10575 atomic/atomicrmw 10576 with an equal or 10577 wider sync scope 10578 and memory ordering 10579 stronger than 10580 unordered (this is 10581 termed the 10582 fence-paired-atomic). 10583 - Must happen before 10584 the following 10585 buffer_inv. 10586 - Ensures that the 10587 fence-paired atomic 10588 has completed 10589 before invalidating 10590 the 10591 cache. Therefore 10592 any following 10593 locations read must 10594 be no older than 10595 the value read by 10596 the 10597 fence-paired-atomic. 10598 10599 2. buffer_inv sc1=1 10600 10601 - Must happen before any 10602 following global/generic 10603 load/load 10604 atomic/store/store 10605 atomic/atomicrmw. 10606 - Ensures that 10607 following loads 10608 will not see stale 10609 global data. 10610 10611 fence acquire - system *none* 1. s_waitcnt lgkmcnt(0) & 10612 vmcnt(0) 10613 10614 - If TgSplit execution mode, 10615 omit lgkmcnt(0). 10616 - If OpenCL and 10617 address space is 10618 not generic, omit 10619 lgkmcnt(0). 10620 - See :ref:`amdgpu-fence-as` for 10621 more details on fencing specific 10622 address spaces. 10623 - Could be split into 10624 separate s_waitcnt 10625 vmcnt(0) and 10626 s_waitcnt 10627 lgkmcnt(0) to allow 10628 them to be 10629 independently moved 10630 according to the 10631 following rules. 10632 - s_waitcnt vmcnt(0) 10633 must happen after 10634 any preceding 10635 global/generic load 10636 atomic/atomicrmw 10637 with an equal or 10638 wider sync scope 10639 and memory ordering 10640 stronger than 10641 unordered (this is 10642 termed the 10643 fence-paired-atomic). 10644 - s_waitcnt lgkmcnt(0) 10645 must happen after 10646 any preceding 10647 local/generic load 10648 atomic/atomicrmw 10649 with an equal or 10650 wider sync scope 10651 and memory ordering 10652 stronger than 10653 unordered (this is 10654 termed the 10655 fence-paired-atomic). 10656 - Must happen before 10657 the following 10658 buffer_inv. 10659 - Ensures that the 10660 fence-paired atomic 10661 has completed 10662 before invalidating 10663 the 10664 cache. Therefore 10665 any following 10666 locations read must 10667 be no older than 10668 the value read by 10669 the 10670 fence-paired-atomic. 10671 10672 2. buffer_inv sc0=1 sc1=1 10673 10674 - Must happen before any 10675 following global/generic 10676 load/load 10677 atomic/store/store 10678 atomic/atomicrmw. 10679 - Ensures that 10680 following loads 10681 will not see stale 10682 global data. 10683 10684 **Release Atomic** 10685 ------------------------------------------------------------------------------------ 10686 store atomic release - singlethread - global 1. GFX940, GFX941 10687 - wavefront - generic buffer/global/flat_store 10688 sc0=1 sc1=1 10689 GFX942 10690 buffer/global/flat_store 10691 10692 store atomic release - singlethread - local *If TgSplit execution mode, 10693 - wavefront local address space cannot 10694 be used.* 10695 10696 1. ds_store 10697 store atomic release - workgroup - global 1. s_waitcnt lgkm/vmcnt(0) 10698 - generic 10699 - Use lgkmcnt(0) if not 10700 TgSplit execution mode 10701 and vmcnt(0) if TgSplit 10702 execution mode. 10703 - If OpenCL, omit lgkmcnt(0). 10704 - s_waitcnt vmcnt(0) 10705 must happen after 10706 any preceding 10707 global/generic load/store/ 10708 load atomic/store atomic/ 10709 atomicrmw. 10710 - s_waitcnt lgkmcnt(0) 10711 must happen after 10712 any preceding 10713 local/generic 10714 load/store/load 10715 atomic/store 10716 atomic/atomicrmw. 10717 - Must happen before 10718 the following 10719 store. 10720 - Ensures that all 10721 memory operations 10722 have 10723 completed before 10724 performing the 10725 store that is being 10726 released. 10727 10728 2. GFX940, GFX941 10729 buffer/global/flat_store 10730 sc0=1 sc1=1 10731 GFX942 10732 buffer/global/flat_store 10733 sc0=1 10734 store atomic release - workgroup - local *If TgSplit execution mode, 10735 local address space cannot 10736 be used.* 10737 10738 1. ds_store 10739 store atomic release - agent - global 1. buffer_wbl2 sc1=1 10740 - generic 10741 - Must happen before 10742 following s_waitcnt. 10743 - Performs L2 writeback to 10744 ensure previous 10745 global/generic 10746 store/atomicrmw are 10747 visible at agent scope. 10748 10749 2. s_waitcnt lgkmcnt(0) & 10750 vmcnt(0) 10751 10752 - If TgSplit execution mode, 10753 omit lgkmcnt(0). 10754 - If OpenCL and 10755 address space is 10756 not generic, omit 10757 lgkmcnt(0). 10758 - Could be split into 10759 separate s_waitcnt 10760 vmcnt(0) and 10761 s_waitcnt 10762 lgkmcnt(0) to allow 10763 them to be 10764 independently moved 10765 according to the 10766 following rules. 10767 - s_waitcnt vmcnt(0) 10768 must happen after 10769 any preceding 10770 global/generic 10771 load/store/load 10772 atomic/store 10773 atomic/atomicrmw. 10774 - s_waitcnt lgkmcnt(0) 10775 must happen after 10776 any preceding 10777 local/generic 10778 load/store/load 10779 atomic/store 10780 atomic/atomicrmw. 10781 - Must happen before 10782 the following 10783 store. 10784 - Ensures that all 10785 memory operations 10786 to memory have 10787 completed before 10788 performing the 10789 store that is being 10790 released. 10791 10792 3. GFX940, GFX941 10793 buffer/global/flat_store 10794 sc0=1 sc1=1 10795 GFX942 10796 buffer/global/flat_store 10797 sc1=1 10798 store atomic release - system - global 1. buffer_wbl2 sc0=1 sc1=1 10799 - generic 10800 - Must happen before 10801 following s_waitcnt. 10802 - Performs L2 writeback to 10803 ensure previous 10804 global/generic 10805 store/atomicrmw are 10806 visible at system scope. 10807 10808 2. s_waitcnt lgkmcnt(0) & 10809 vmcnt(0) 10810 10811 - If TgSplit execution mode, 10812 omit lgkmcnt(0). 10813 - If OpenCL and 10814 address space is 10815 not generic, omit 10816 lgkmcnt(0). 10817 - Could be split into 10818 separate s_waitcnt 10819 vmcnt(0) and 10820 s_waitcnt 10821 lgkmcnt(0) to allow 10822 them to be 10823 independently moved 10824 according to the 10825 following rules. 10826 - s_waitcnt vmcnt(0) 10827 must happen after any 10828 preceding 10829 global/generic 10830 load/store/load 10831 atomic/store 10832 atomic/atomicrmw. 10833 - s_waitcnt lgkmcnt(0) 10834 must happen after any 10835 preceding 10836 local/generic 10837 load/store/load 10838 atomic/store 10839 atomic/atomicrmw. 10840 - Must happen before 10841 the following 10842 store. 10843 - Ensures that all 10844 memory operations 10845 to memory and the L2 10846 writeback have 10847 completed before 10848 performing the 10849 store that is being 10850 released. 10851 10852 3. buffer/global/flat_store 10853 sc0=1 sc1=1 10854 atomicrmw release - singlethread - global 1. buffer/global/flat_atomic 10855 - wavefront - generic 10856 atomicrmw release - singlethread - local *If TgSplit execution mode, 10857 - wavefront local address space cannot 10858 be used.* 10859 10860 1. ds_atomic 10861 atomicrmw release - workgroup - global 1. s_waitcnt lgkm/vmcnt(0) 10862 - generic 10863 - Use lgkmcnt(0) if not 10864 TgSplit execution mode 10865 and vmcnt(0) if TgSplit 10866 execution mode. 10867 - If OpenCL, omit 10868 lgkmcnt(0). 10869 - s_waitcnt vmcnt(0) 10870 must happen after 10871 any preceding 10872 global/generic load/store/ 10873 load atomic/store atomic/ 10874 atomicrmw. 10875 - s_waitcnt lgkmcnt(0) 10876 must happen after 10877 any preceding 10878 local/generic 10879 load/store/load 10880 atomic/store 10881 atomic/atomicrmw. 10882 - Must happen before 10883 the following 10884 atomicrmw. 10885 - Ensures that all 10886 memory operations 10887 have 10888 completed before 10889 performing the 10890 atomicrmw that is 10891 being released. 10892 10893 2. buffer/global/flat_atomic sc0=1 10894 atomicrmw release - workgroup - local *If TgSplit execution mode, 10895 local address space cannot 10896 be used.* 10897 10898 1. ds_atomic 10899 atomicrmw release - agent - global 1. buffer_wbl2 sc1=1 10900 - generic 10901 - Must happen before 10902 following s_waitcnt. 10903 - Performs L2 writeback to 10904 ensure previous 10905 global/generic 10906 store/atomicrmw are 10907 visible at agent scope. 10908 10909 2. s_waitcnt lgkmcnt(0) & 10910 vmcnt(0) 10911 10912 - If TgSplit execution mode, 10913 omit lgkmcnt(0). 10914 - If OpenCL, omit 10915 lgkmcnt(0). 10916 - Could be split into 10917 separate s_waitcnt 10918 vmcnt(0) and 10919 s_waitcnt 10920 lgkmcnt(0) to allow 10921 them to be 10922 independently moved 10923 according to the 10924 following rules. 10925 - s_waitcnt vmcnt(0) 10926 must happen after 10927 any preceding 10928 global/generic 10929 load/store/load 10930 atomic/store 10931 atomic/atomicrmw. 10932 - s_waitcnt lgkmcnt(0) 10933 must happen after 10934 any preceding 10935 local/generic 10936 load/store/load 10937 atomic/store 10938 atomic/atomicrmw. 10939 - Must happen before 10940 the following 10941 atomicrmw. 10942 - Ensures that all 10943 memory operations 10944 to global and local 10945 have completed 10946 before performing 10947 the atomicrmw that 10948 is being released. 10949 10950 3. buffer/global/flat_atomic sc1=1 10951 atomicrmw release - system - global 1. buffer_wbl2 sc0=1 sc1=1 10952 - generic 10953 - Must happen before 10954 following s_waitcnt. 10955 - Performs L2 writeback to 10956 ensure previous 10957 global/generic 10958 store/atomicrmw are 10959 visible at system scope. 10960 10961 2. s_waitcnt lgkmcnt(0) & 10962 vmcnt(0) 10963 10964 - If TgSplit execution mode, 10965 omit lgkmcnt(0). 10966 - If OpenCL, omit 10967 lgkmcnt(0). 10968 - Could be split into 10969 separate s_waitcnt 10970 vmcnt(0) and 10971 s_waitcnt 10972 lgkmcnt(0) to allow 10973 them to be 10974 independently moved 10975 according to the 10976 following rules. 10977 - s_waitcnt vmcnt(0) 10978 must happen after 10979 any preceding 10980 global/generic 10981 load/store/load 10982 atomic/store 10983 atomic/atomicrmw. 10984 - s_waitcnt lgkmcnt(0) 10985 must happen after 10986 any preceding 10987 local/generic 10988 load/store/load 10989 atomic/store 10990 atomic/atomicrmw. 10991 - Must happen before 10992 the following 10993 atomicrmw. 10994 - Ensures that all 10995 memory operations 10996 to memory and the L2 10997 writeback have 10998 completed before 10999 performing the 11000 store that is being 11001 released. 11002 11003 3. buffer/global/flat_atomic 11004 sc0=1 sc1=1 11005 fence release - singlethread *none* *none* 11006 - wavefront 11007 fence release - workgroup *none* 1. s_waitcnt lgkm/vmcnt(0) 11008 11009 - Use lgkmcnt(0) if not 11010 TgSplit execution mode 11011 and vmcnt(0) if TgSplit 11012 execution mode. 11013 - If OpenCL and 11014 address space is 11015 not generic, omit 11016 lgkmcnt(0). 11017 - If OpenCL and 11018 address space is 11019 local, omit 11020 vmcnt(0). 11021 - See :ref:`amdgpu-fence-as` for 11022 more details on fencing specific 11023 address spaces. 11024 - s_waitcnt vmcnt(0) 11025 must happen after 11026 any preceding 11027 global/generic 11028 load/store/ 11029 load atomic/store atomic/ 11030 atomicrmw. 11031 - s_waitcnt lgkmcnt(0) 11032 must happen after 11033 any preceding 11034 local/generic 11035 load/load 11036 atomic/store/store 11037 atomic/atomicrmw. 11038 - Must happen before 11039 any following store 11040 atomic/atomicrmw 11041 with an equal or 11042 wider sync scope 11043 and memory ordering 11044 stronger than 11045 unordered (this is 11046 termed the 11047 fence-paired-atomic). 11048 - Ensures that all 11049 memory operations 11050 have 11051 completed before 11052 performing the 11053 following 11054 fence-paired-atomic. 11055 11056 fence release - agent *none* 1. buffer_wbl2 sc1=1 11057 11058 - If OpenCL and 11059 address space is 11060 local, omit. 11061 - Must happen before 11062 following s_waitcnt. 11063 - Performs L2 writeback to 11064 ensure previous 11065 global/generic 11066 store/atomicrmw are 11067 visible at agent scope. 11068 11069 2. s_waitcnt lgkmcnt(0) & 11070 vmcnt(0) 11071 11072 - If TgSplit execution mode, 11073 omit lgkmcnt(0). 11074 - If OpenCL and 11075 address space is 11076 not generic, omit 11077 lgkmcnt(0). 11078 - If OpenCL and 11079 address space is 11080 local, omit 11081 vmcnt(0). 11082 - See :ref:`amdgpu-fence-as` for 11083 more details on fencing specific 11084 address spaces. 11085 - Could be split into 11086 separate s_waitcnt 11087 vmcnt(0) and 11088 s_waitcnt 11089 lgkmcnt(0) to allow 11090 them to be 11091 independently moved 11092 according to the 11093 following rules. 11094 - s_waitcnt vmcnt(0) 11095 must happen after 11096 any preceding 11097 global/generic 11098 load/store/load 11099 atomic/store 11100 atomic/atomicrmw. 11101 - s_waitcnt lgkmcnt(0) 11102 must happen after 11103 any preceding 11104 local/generic 11105 load/store/load 11106 atomic/store 11107 atomic/atomicrmw. 11108 - Must happen before 11109 any following store 11110 atomic/atomicrmw 11111 with an equal or 11112 wider sync scope 11113 and memory ordering 11114 stronger than 11115 unordered (this is 11116 termed the 11117 fence-paired-atomic). 11118 - Ensures that all 11119 memory operations 11120 have 11121 completed before 11122 performing the 11123 following 11124 fence-paired-atomic. 11125 11126 fence release - system *none* 1. buffer_wbl2 sc0=1 sc1=1 11127 11128 - Must happen before 11129 following s_waitcnt. 11130 - Performs L2 writeback to 11131 ensure previous 11132 global/generic 11133 store/atomicrmw are 11134 visible at system scope. 11135 11136 2. s_waitcnt lgkmcnt(0) & 11137 vmcnt(0) 11138 11139 - If TgSplit execution mode, 11140 omit lgkmcnt(0). 11141 - If OpenCL and 11142 address space is 11143 not generic, omit 11144 lgkmcnt(0). 11145 - If OpenCL and 11146 address space is 11147 local, omit 11148 vmcnt(0). 11149 - See :ref:`amdgpu-fence-as` for 11150 more details on fencing specific 11151 address spaces. 11152 - Could be split into 11153 separate s_waitcnt 11154 vmcnt(0) and 11155 s_waitcnt 11156 lgkmcnt(0) to allow 11157 them to be 11158 independently moved 11159 according to the 11160 following rules. 11161 - s_waitcnt vmcnt(0) 11162 must happen after 11163 any preceding 11164 global/generic 11165 load/store/load 11166 atomic/store 11167 atomic/atomicrmw. 11168 - s_waitcnt lgkmcnt(0) 11169 must happen after 11170 any preceding 11171 local/generic 11172 load/store/load 11173 atomic/store 11174 atomic/atomicrmw. 11175 - Must happen before 11176 any following store 11177 atomic/atomicrmw 11178 with an equal or 11179 wider sync scope 11180 and memory ordering 11181 stronger than 11182 unordered (this is 11183 termed the 11184 fence-paired-atomic). 11185 - Ensures that all 11186 memory operations 11187 have 11188 completed before 11189 performing the 11190 following 11191 fence-paired-atomic. 11192 11193 **Acquire-Release Atomic** 11194 ------------------------------------------------------------------------------------ 11195 atomicrmw acq_rel - singlethread - global 1. buffer/global/flat_atomic 11196 - wavefront - generic 11197 atomicrmw acq_rel - singlethread - local *If TgSplit execution mode, 11198 - wavefront local address space cannot 11199 be used.* 11200 11201 1. ds_atomic 11202 atomicrmw acq_rel - workgroup - global 1. s_waitcnt lgkm/vmcnt(0) 11203 11204 - Use lgkmcnt(0) if not 11205 TgSplit execution mode 11206 and vmcnt(0) if TgSplit 11207 execution mode. 11208 - If OpenCL, omit 11209 lgkmcnt(0). 11210 - Must happen after 11211 any preceding 11212 local/generic 11213 load/store/load 11214 atomic/store 11215 atomic/atomicrmw. 11216 - s_waitcnt vmcnt(0) 11217 must happen after 11218 any preceding 11219 global/generic load/store/ 11220 load atomic/store atomic/ 11221 atomicrmw. 11222 - s_waitcnt lgkmcnt(0) 11223 must happen after 11224 any preceding 11225 local/generic 11226 load/store/load 11227 atomic/store 11228 atomic/atomicrmw. 11229 - Must happen before 11230 the following 11231 atomicrmw. 11232 - Ensures that all 11233 memory operations 11234 have 11235 completed before 11236 performing the 11237 atomicrmw that is 11238 being released. 11239 11240 2. buffer/global_atomic 11241 3. s_waitcnt vmcnt(0) 11242 11243 - If not TgSplit execution 11244 mode, omit. 11245 - Must happen before 11246 the following 11247 buffer_inv. 11248 - Ensures any 11249 following global 11250 data read is no 11251 older than the 11252 atomicrmw value 11253 being acquired. 11254 11255 4. buffer_inv sc0=1 11256 11257 - If not TgSplit execution 11258 mode, omit. 11259 - Ensures that 11260 following 11261 loads will not see 11262 stale data. 11263 11264 atomicrmw acq_rel - workgroup - local *If TgSplit execution mode, 11265 local address space cannot 11266 be used.* 11267 11268 1. ds_atomic 11269 2. s_waitcnt lgkmcnt(0) 11270 11271 - If OpenCL, omit. 11272 - Must happen before 11273 any following 11274 global/generic 11275 load/load 11276 atomic/store/store 11277 atomic/atomicrmw. 11278 - Ensures any 11279 following global 11280 data read is no 11281 older than the local load 11282 atomic value being 11283 acquired. 11284 11285 atomicrmw acq_rel - workgroup - generic 1. s_waitcnt lgkm/vmcnt(0) 11286 11287 - Use lgkmcnt(0) if not 11288 TgSplit execution mode 11289 and vmcnt(0) if TgSplit 11290 execution mode. 11291 - If OpenCL, omit 11292 lgkmcnt(0). 11293 - s_waitcnt vmcnt(0) 11294 must happen after 11295 any preceding 11296 global/generic load/store/ 11297 load atomic/store atomic/ 11298 atomicrmw. 11299 - s_waitcnt lgkmcnt(0) 11300 must happen after 11301 any preceding 11302 local/generic 11303 load/store/load 11304 atomic/store 11305 atomic/atomicrmw. 11306 - Must happen before 11307 the following 11308 atomicrmw. 11309 - Ensures that all 11310 memory operations 11311 have 11312 completed before 11313 performing the 11314 atomicrmw that is 11315 being released. 11316 11317 2. flat_atomic 11318 3. s_waitcnt lgkmcnt(0) & 11319 vmcnt(0) 11320 11321 - If not TgSplit execution 11322 mode, omit vmcnt(0). 11323 - If OpenCL, omit 11324 lgkmcnt(0). 11325 - Must happen before 11326 the following 11327 buffer_inv and 11328 any following 11329 global/generic 11330 load/load 11331 atomic/store/store 11332 atomic/atomicrmw. 11333 - Ensures any 11334 following global 11335 data read is no 11336 older than a local load 11337 atomic value being 11338 acquired. 11339 11340 3. buffer_inv sc0=1 11341 11342 - If not TgSplit execution 11343 mode, omit. 11344 - Ensures that 11345 following 11346 loads will not see 11347 stale data. 11348 11349 atomicrmw acq_rel - agent - global 1. buffer_wbl2 sc1=1 11350 11351 - Must happen before 11352 following s_waitcnt. 11353 - Performs L2 writeback to 11354 ensure previous 11355 global/generic 11356 store/atomicrmw are 11357 visible at agent scope. 11358 11359 2. s_waitcnt lgkmcnt(0) & 11360 vmcnt(0) 11361 11362 - If TgSplit execution mode, 11363 omit lgkmcnt(0). 11364 - If OpenCL, omit 11365 lgkmcnt(0). 11366 - Could be split into 11367 separate s_waitcnt 11368 vmcnt(0) and 11369 s_waitcnt 11370 lgkmcnt(0) to allow 11371 them to be 11372 independently moved 11373 according to the 11374 following rules. 11375 - s_waitcnt vmcnt(0) 11376 must happen after 11377 any preceding 11378 global/generic 11379 load/store/load 11380 atomic/store 11381 atomic/atomicrmw. 11382 - s_waitcnt lgkmcnt(0) 11383 must happen after 11384 any preceding 11385 local/generic 11386 load/store/load 11387 atomic/store 11388 atomic/atomicrmw. 11389 - Must happen before 11390 the following 11391 atomicrmw. 11392 - Ensures that all 11393 memory operations 11394 to global have 11395 completed before 11396 performing the 11397 atomicrmw that is 11398 being released. 11399 11400 3. buffer/global_atomic 11401 4. s_waitcnt vmcnt(0) 11402 11403 - Must happen before 11404 following 11405 buffer_inv. 11406 - Ensures the 11407 atomicrmw has 11408 completed before 11409 invalidating the 11410 cache. 11411 11412 5. buffer_inv sc1=1 11413 11414 - Must happen before 11415 any following 11416 global/generic 11417 load/load 11418 atomic/atomicrmw. 11419 - Ensures that 11420 following loads 11421 will not see stale 11422 global data. 11423 11424 atomicrmw acq_rel - system - global 1. buffer_wbl2 sc0=1 sc1=1 11425 11426 - Must happen before 11427 following s_waitcnt. 11428 - Performs L2 writeback to 11429 ensure previous 11430 global/generic 11431 store/atomicrmw are 11432 visible at system scope. 11433 11434 2. s_waitcnt lgkmcnt(0) & 11435 vmcnt(0) 11436 11437 - If TgSplit execution mode, 11438 omit lgkmcnt(0). 11439 - If OpenCL, omit 11440 lgkmcnt(0). 11441 - Could be split into 11442 separate s_waitcnt 11443 vmcnt(0) and 11444 s_waitcnt 11445 lgkmcnt(0) to allow 11446 them to be 11447 independently moved 11448 according to the 11449 following rules. 11450 - s_waitcnt vmcnt(0) 11451 must happen after 11452 any preceding 11453 global/generic 11454 load/store/load 11455 atomic/store 11456 atomic/atomicrmw. 11457 - s_waitcnt lgkmcnt(0) 11458 must happen after 11459 any preceding 11460 local/generic 11461 load/store/load 11462 atomic/store 11463 atomic/atomicrmw. 11464 - Must happen before 11465 the following 11466 atomicrmw. 11467 - Ensures that all 11468 memory operations 11469 to global and L2 writeback 11470 have completed before 11471 performing the 11472 atomicrmw that is 11473 being released. 11474 11475 3. buffer/global_atomic 11476 sc1=1 11477 4. s_waitcnt vmcnt(0) 11478 11479 - Must happen before 11480 following 11481 buffer_inv. 11482 - Ensures the 11483 atomicrmw has 11484 completed before 11485 invalidating the 11486 caches. 11487 11488 5. buffer_inv sc0=1 sc1=1 11489 11490 - Must happen before 11491 any following 11492 global/generic 11493 load/load 11494 atomic/atomicrmw. 11495 - Ensures that 11496 following loads 11497 will not see stale 11498 MTYPE NC global data. 11499 MTYPE RW and CC memory will 11500 never be stale due to the 11501 memory probes. 11502 11503 atomicrmw acq_rel - agent - generic 1. buffer_wbl2 sc1=1 11504 11505 - Must happen before 11506 following s_waitcnt. 11507 - Performs L2 writeback to 11508 ensure previous 11509 global/generic 11510 store/atomicrmw are 11511 visible at agent scope. 11512 11513 2. s_waitcnt lgkmcnt(0) & 11514 vmcnt(0) 11515 11516 - If TgSplit execution mode, 11517 omit lgkmcnt(0). 11518 - If OpenCL, omit 11519 lgkmcnt(0). 11520 - Could be split into 11521 separate s_waitcnt 11522 vmcnt(0) and 11523 s_waitcnt 11524 lgkmcnt(0) to allow 11525 them to be 11526 independently moved 11527 according to the 11528 following rules. 11529 - s_waitcnt vmcnt(0) 11530 must happen after 11531 any preceding 11532 global/generic 11533 load/store/load 11534 atomic/store 11535 atomic/atomicrmw. 11536 - s_waitcnt lgkmcnt(0) 11537 must happen after 11538 any preceding 11539 local/generic 11540 load/store/load 11541 atomic/store 11542 atomic/atomicrmw. 11543 - Must happen before 11544 the following 11545 atomicrmw. 11546 - Ensures that all 11547 memory operations 11548 to global have 11549 completed before 11550 performing the 11551 atomicrmw that is 11552 being released. 11553 11554 3. flat_atomic 11555 4. s_waitcnt vmcnt(0) & 11556 lgkmcnt(0) 11557 11558 - If TgSplit execution mode, 11559 omit lgkmcnt(0). 11560 - If OpenCL, omit 11561 lgkmcnt(0). 11562 - Must happen before 11563 following 11564 buffer_inv. 11565 - Ensures the 11566 atomicrmw has 11567 completed before 11568 invalidating the 11569 cache. 11570 11571 5. buffer_inv sc1=1 11572 11573 - Must happen before 11574 any following 11575 global/generic 11576 load/load 11577 atomic/atomicrmw. 11578 - Ensures that 11579 following loads 11580 will not see stale 11581 global data. 11582 11583 atomicrmw acq_rel - system - generic 1. buffer_wbl2 sc0=1 sc1=1 11584 11585 - Must happen before 11586 following s_waitcnt. 11587 - Performs L2 writeback to 11588 ensure previous 11589 global/generic 11590 store/atomicrmw are 11591 visible at system scope. 11592 11593 2. s_waitcnt lgkmcnt(0) & 11594 vmcnt(0) 11595 11596 - If TgSplit execution mode, 11597 omit lgkmcnt(0). 11598 - If OpenCL, omit 11599 lgkmcnt(0). 11600 - Could be split into 11601 separate s_waitcnt 11602 vmcnt(0) and 11603 s_waitcnt 11604 lgkmcnt(0) to allow 11605 them to be 11606 independently moved 11607 according to the 11608 following rules. 11609 - s_waitcnt vmcnt(0) 11610 must happen after 11611 any preceding 11612 global/generic 11613 load/store/load 11614 atomic/store 11615 atomic/atomicrmw. 11616 - s_waitcnt lgkmcnt(0) 11617 must happen after 11618 any preceding 11619 local/generic 11620 load/store/load 11621 atomic/store 11622 atomic/atomicrmw. 11623 - Must happen before 11624 the following 11625 atomicrmw. 11626 - Ensures that all 11627 memory operations 11628 to global and L2 writeback 11629 have completed before 11630 performing the 11631 atomicrmw that is 11632 being released. 11633 11634 3. flat_atomic sc1=1 11635 4. s_waitcnt vmcnt(0) & 11636 lgkmcnt(0) 11637 11638 - If TgSplit execution mode, 11639 omit lgkmcnt(0). 11640 - If OpenCL, omit 11641 lgkmcnt(0). 11642 - Must happen before 11643 following 11644 buffer_inv. 11645 - Ensures the 11646 atomicrmw has 11647 completed before 11648 invalidating the 11649 caches. 11650 11651 5. buffer_inv sc0=1 sc1=1 11652 11653 - Must happen before 11654 any following 11655 global/generic 11656 load/load 11657 atomic/atomicrmw. 11658 - Ensures that 11659 following loads 11660 will not see stale 11661 MTYPE NC global data. 11662 MTYPE RW and CC memory will 11663 never be stale due to the 11664 memory probes. 11665 11666 fence acq_rel - singlethread *none* *none* 11667 - wavefront 11668 fence acq_rel - workgroup *none* 1. s_waitcnt lgkm/vmcnt(0) 11669 11670 - Use lgkmcnt(0) if not 11671 TgSplit execution mode 11672 and vmcnt(0) if TgSplit 11673 execution mode. 11674 - If OpenCL and 11675 address space is 11676 not generic, omit 11677 lgkmcnt(0). 11678 - If OpenCL and 11679 address space is 11680 local, omit 11681 vmcnt(0). 11682 - However, 11683 since LLVM 11684 currently has no 11685 address space on 11686 the fence need to 11687 conservatively 11688 always generate 11689 (see comment for 11690 previous fence). 11691 - s_waitcnt vmcnt(0) 11692 must happen after 11693 any preceding 11694 global/generic 11695 load/store/ 11696 load atomic/store atomic/ 11697 atomicrmw. 11698 - s_waitcnt lgkmcnt(0) 11699 must happen after 11700 any preceding 11701 local/generic 11702 load/load 11703 atomic/store/store 11704 atomic/atomicrmw. 11705 - Must happen before 11706 any following 11707 global/generic 11708 load/load 11709 atomic/store/store 11710 atomic/atomicrmw. 11711 - Ensures that all 11712 memory operations 11713 have 11714 completed before 11715 performing any 11716 following global 11717 memory operations. 11718 - Ensures that the 11719 preceding 11720 local/generic load 11721 atomic/atomicrmw 11722 with an equal or 11723 wider sync scope 11724 and memory ordering 11725 stronger than 11726 unordered (this is 11727 termed the 11728 acquire-fence-paired-atomic) 11729 has completed 11730 before following 11731 global memory 11732 operations. This 11733 satisfies the 11734 requirements of 11735 acquire. 11736 - Ensures that all 11737 previous memory 11738 operations have 11739 completed before a 11740 following 11741 local/generic store 11742 atomic/atomicrmw 11743 with an equal or 11744 wider sync scope 11745 and memory ordering 11746 stronger than 11747 unordered (this is 11748 termed the 11749 release-fence-paired-atomic). 11750 This satisfies the 11751 requirements of 11752 release. 11753 - Must happen before 11754 the following 11755 buffer_inv. 11756 - Ensures that the 11757 acquire-fence-paired 11758 atomic has completed 11759 before invalidating 11760 the 11761 cache. Therefore 11762 any following 11763 locations read must 11764 be no older than 11765 the value read by 11766 the 11767 acquire-fence-paired-atomic. 11768 11769 3. buffer_inv sc0=1 11770 11771 - If not TgSplit execution 11772 mode, omit. 11773 - Ensures that 11774 following 11775 loads will not see 11776 stale data. 11777 11778 fence acq_rel - agent *none* 1. buffer_wbl2 sc1=1 11779 11780 - If OpenCL and 11781 address space is 11782 local, omit. 11783 - Must happen before 11784 following s_waitcnt. 11785 - Performs L2 writeback to 11786 ensure previous 11787 global/generic 11788 store/atomicrmw are 11789 visible at agent scope. 11790 11791 2. s_waitcnt lgkmcnt(0) & 11792 vmcnt(0) 11793 11794 - If TgSplit execution mode, 11795 omit lgkmcnt(0). 11796 - If OpenCL and 11797 address space is 11798 not generic, omit 11799 lgkmcnt(0). 11800 - See :ref:`amdgpu-fence-as` for 11801 more details on fencing specific 11802 address spaces. 11803 - Could be split into 11804 separate s_waitcnt 11805 vmcnt(0) and 11806 s_waitcnt 11807 lgkmcnt(0) to allow 11808 them to be 11809 independently moved 11810 according to the 11811 following rules. 11812 - s_waitcnt vmcnt(0) 11813 must happen after 11814 any preceding 11815 global/generic 11816 load/store/load 11817 atomic/store 11818 atomic/atomicrmw. 11819 - s_waitcnt lgkmcnt(0) 11820 must happen after 11821 any preceding 11822 local/generic 11823 load/store/load 11824 atomic/store 11825 atomic/atomicrmw. 11826 - Must happen before 11827 the following 11828 buffer_inv. 11829 - Ensures that the 11830 preceding 11831 global/local/generic 11832 load 11833 atomic/atomicrmw 11834 with an equal or 11835 wider sync scope 11836 and memory ordering 11837 stronger than 11838 unordered (this is 11839 termed the 11840 acquire-fence-paired-atomic) 11841 has completed 11842 before invalidating 11843 the cache. This 11844 satisfies the 11845 requirements of 11846 acquire. 11847 - Ensures that all 11848 previous memory 11849 operations have 11850 completed before a 11851 following 11852 global/local/generic 11853 store 11854 atomic/atomicrmw 11855 with an equal or 11856 wider sync scope 11857 and memory ordering 11858 stronger than 11859 unordered (this is 11860 termed the 11861 release-fence-paired-atomic). 11862 This satisfies the 11863 requirements of 11864 release. 11865 11866 3. buffer_inv sc1=1 11867 11868 - Must happen before 11869 any following 11870 global/generic 11871 load/load 11872 atomic/store/store 11873 atomic/atomicrmw. 11874 - Ensures that 11875 following loads 11876 will not see stale 11877 global data. This 11878 satisfies the 11879 requirements of 11880 acquire. 11881 11882 fence acq_rel - system *none* 1. buffer_wbl2 sc0=1 sc1=1 11883 11884 - If OpenCL and 11885 address space is 11886 local, omit. 11887 - Must happen before 11888 following s_waitcnt. 11889 - Performs L2 writeback to 11890 ensure previous 11891 global/generic 11892 store/atomicrmw are 11893 visible at system scope. 11894 11895 1. s_waitcnt lgkmcnt(0) & 11896 vmcnt(0) 11897 11898 - If TgSplit execution mode, 11899 omit lgkmcnt(0). 11900 - If OpenCL and 11901 address space is 11902 not generic, omit 11903 lgkmcnt(0). 11904 - See :ref:`amdgpu-fence-as` for 11905 more details on fencing specific 11906 address spaces. 11907 - Could be split into 11908 separate s_waitcnt 11909 vmcnt(0) and 11910 s_waitcnt 11911 lgkmcnt(0) to allow 11912 them to be 11913 independently moved 11914 according to the 11915 following rules. 11916 - s_waitcnt vmcnt(0) 11917 must happen after 11918 any preceding 11919 global/generic 11920 load/store/load 11921 atomic/store 11922 atomic/atomicrmw. 11923 - s_waitcnt lgkmcnt(0) 11924 must happen after 11925 any preceding 11926 local/generic 11927 load/store/load 11928 atomic/store 11929 atomic/atomicrmw. 11930 - Must happen before 11931 the following 11932 buffer_inv. 11933 - Ensures that the 11934 preceding 11935 global/local/generic 11936 load 11937 atomic/atomicrmw 11938 with an equal or 11939 wider sync scope 11940 and memory ordering 11941 stronger than 11942 unordered (this is 11943 termed the 11944 acquire-fence-paired-atomic) 11945 has completed 11946 before invalidating 11947 the cache. This 11948 satisfies the 11949 requirements of 11950 acquire. 11951 - Ensures that all 11952 previous memory 11953 operations have 11954 completed before a 11955 following 11956 global/local/generic 11957 store 11958 atomic/atomicrmw 11959 with an equal or 11960 wider sync scope 11961 and memory ordering 11962 stronger than 11963 unordered (this is 11964 termed the 11965 release-fence-paired-atomic). 11966 This satisfies the 11967 requirements of 11968 release. 11969 11970 2. buffer_inv sc0=1 sc1=1 11971 11972 - Must happen before 11973 any following 11974 global/generic 11975 load/load 11976 atomic/store/store 11977 atomic/atomicrmw. 11978 - Ensures that 11979 following loads 11980 will not see stale 11981 MTYPE NC global data. 11982 MTYPE RW and CC memory will 11983 never be stale due to the 11984 memory probes. 11985 11986 **Sequential Consistent Atomic** 11987 ------------------------------------------------------------------------------------ 11988 load atomic seq_cst - singlethread - global *Same as corresponding 11989 - wavefront - local load atomic acquire, 11990 - generic except must generate 11991 all instructions even 11992 for OpenCL.* 11993 load atomic seq_cst - workgroup - global 1. s_waitcnt lgkm/vmcnt(0) 11994 - generic 11995 - Use lgkmcnt(0) if not 11996 TgSplit execution mode 11997 and vmcnt(0) if TgSplit 11998 execution mode. 11999 - s_waitcnt lgkmcnt(0) must 12000 happen after 12001 preceding 12002 local/generic load 12003 atomic/store 12004 atomic/atomicrmw 12005 with memory 12006 ordering of seq_cst 12007 and with equal or 12008 wider sync scope. 12009 (Note that seq_cst 12010 fences have their 12011 own s_waitcnt 12012 lgkmcnt(0) and so do 12013 not need to be 12014 considered.) 12015 - s_waitcnt vmcnt(0) 12016 must happen after 12017 preceding 12018 global/generic load 12019 atomic/store 12020 atomic/atomicrmw 12021 with memory 12022 ordering of seq_cst 12023 and with equal or 12024 wider sync scope. 12025 (Note that seq_cst 12026 fences have their 12027 own s_waitcnt 12028 vmcnt(0) and so do 12029 not need to be 12030 considered.) 12031 - Ensures any 12032 preceding 12033 sequential 12034 consistent global/local 12035 memory instructions 12036 have completed 12037 before executing 12038 this sequentially 12039 consistent 12040 instruction. This 12041 prevents reordering 12042 a seq_cst store 12043 followed by a 12044 seq_cst load. (Note 12045 that seq_cst is 12046 stronger than 12047 acquire/release as 12048 the reordering of 12049 load acquire 12050 followed by a store 12051 release is 12052 prevented by the 12053 s_waitcnt of 12054 the release, but 12055 there is nothing 12056 preventing a store 12057 release followed by 12058 load acquire from 12059 completing out of 12060 order. The s_waitcnt 12061 could be placed after 12062 seq_store or before 12063 the seq_load. We 12064 choose the load to 12065 make the s_waitcnt be 12066 as late as possible 12067 so that the store 12068 may have already 12069 completed.) 12070 12071 2. *Following 12072 instructions same as 12073 corresponding load 12074 atomic acquire, 12075 except must generate 12076 all instructions even 12077 for OpenCL.* 12078 load atomic seq_cst - workgroup - local *If TgSplit execution mode, 12079 local address space cannot 12080 be used.* 12081 12082 *Same as corresponding 12083 load atomic acquire, 12084 except must generate 12085 all instructions even 12086 for OpenCL.* 12087 12088 load atomic seq_cst - agent - global 1. s_waitcnt lgkmcnt(0) & 12089 - system - generic vmcnt(0) 12090 12091 - If TgSplit execution mode, 12092 omit lgkmcnt(0). 12093 - Could be split into 12094 separate s_waitcnt 12095 vmcnt(0) 12096 and s_waitcnt 12097 lgkmcnt(0) to allow 12098 them to be 12099 independently moved 12100 according to the 12101 following rules. 12102 - s_waitcnt lgkmcnt(0) 12103 must happen after 12104 preceding 12105 global/generic load 12106 atomic/store 12107 atomic/atomicrmw 12108 with memory 12109 ordering of seq_cst 12110 and with equal or 12111 wider sync scope. 12112 (Note that seq_cst 12113 fences have their 12114 own s_waitcnt 12115 lgkmcnt(0) and so do 12116 not need to be 12117 considered.) 12118 - s_waitcnt vmcnt(0) 12119 must happen after 12120 preceding 12121 global/generic load 12122 atomic/store 12123 atomic/atomicrmw 12124 with memory 12125 ordering of seq_cst 12126 and with equal or 12127 wider sync scope. 12128 (Note that seq_cst 12129 fences have their 12130 own s_waitcnt 12131 vmcnt(0) and so do 12132 not need to be 12133 considered.) 12134 - Ensures any 12135 preceding 12136 sequential 12137 consistent global 12138 memory instructions 12139 have completed 12140 before executing 12141 this sequentially 12142 consistent 12143 instruction. This 12144 prevents reordering 12145 a seq_cst store 12146 followed by a 12147 seq_cst load. (Note 12148 that seq_cst is 12149 stronger than 12150 acquire/release as 12151 the reordering of 12152 load acquire 12153 followed by a store 12154 release is 12155 prevented by the 12156 s_waitcnt of 12157 the release, but 12158 there is nothing 12159 preventing a store 12160 release followed by 12161 load acquire from 12162 completing out of 12163 order. The s_waitcnt 12164 could be placed after 12165 seq_store or before 12166 the seq_load. We 12167 choose the load to 12168 make the s_waitcnt be 12169 as late as possible 12170 so that the store 12171 may have already 12172 completed.) 12173 12174 2. *Following 12175 instructions same as 12176 corresponding load 12177 atomic acquire, 12178 except must generate 12179 all instructions even 12180 for OpenCL.* 12181 store atomic seq_cst - singlethread - global *Same as corresponding 12182 - wavefront - local store atomic release, 12183 - workgroup - generic except must generate 12184 - agent all instructions even 12185 - system for OpenCL.* 12186 atomicrmw seq_cst - singlethread - global *Same as corresponding 12187 - wavefront - local atomicrmw acq_rel, 12188 - workgroup - generic except must generate 12189 - agent all instructions even 12190 - system for OpenCL.* 12191 fence seq_cst - singlethread *none* *Same as corresponding 12192 - wavefront fence acq_rel, 12193 - workgroup except must generate 12194 - agent all instructions even 12195 - system for OpenCL.* 12196 ============ ============ ============== ========== ================================ 12197 12198.. _amdgpu-amdhsa-memory-model-gfx10-gfx11: 12199 12200Memory Model GFX10-GFX11 12201++++++++++++++++++++++++ 12202 12203For GFX10-GFX11: 12204 12205* Each agent has multiple shader arrays (SA). 12206* Each SA has multiple work-group processors (WGP). 12207* Each WGP has multiple compute units (CU). 12208* Each CU has multiple SIMDs that execute wavefronts. 12209* The wavefronts for a single work-group are executed in the same 12210 WGP. In CU wavefront execution mode the wavefronts may be executed by 12211 different SIMDs in the same CU. In WGP wavefront execution mode the 12212 wavefronts may be executed by different SIMDs in different CUs in the same 12213 WGP. 12214* Each WGP has a single LDS memory shared by the wavefronts of the work-groups 12215 executing on it. 12216* All LDS operations of a WGP are performed as wavefront wide operations in a 12217 global order and involve no caching. Completion is reported to a wavefront in 12218 execution order. 12219* The LDS memory has multiple request queues shared by the SIMDs of a 12220 WGP. Therefore, the LDS operations performed by different wavefronts of a 12221 work-group can be reordered relative to each other, which can result in 12222 reordering the visibility of vector memory operations with respect to LDS 12223 operations of other wavefronts in the same work-group. A ``s_waitcnt 12224 lgkmcnt(0)`` is required to ensure synchronization between LDS operations and 12225 vector memory operations between wavefronts of a work-group, but not between 12226 operations performed by the same wavefront. 12227* The vector memory operations are performed as wavefront wide operations. 12228 Completion of load/store/sample operations are reported to a wavefront in 12229 execution order of other load/store/sample operations performed by that 12230 wavefront. 12231* The vector memory operations access a vector L0 cache. There is a single L0 12232 cache per CU. Each SIMD of a CU accesses the same L0 cache. Therefore, no 12233 special action is required for coherence between the lanes of a single 12234 wavefront. However, a ``buffer_gl0_inv`` is required for coherence between 12235 wavefronts executing in the same work-group as they may be executing on SIMDs 12236 of different CUs that access different L0s. A ``buffer_gl0_inv`` is also 12237 required for coherence between wavefronts executing in different work-groups 12238 as they may be executing on different WGPs. 12239* The scalar memory operations access a scalar L0 cache shared by all wavefronts 12240 on a WGP. The scalar and vector L0 caches are not coherent. However, scalar 12241 operations are used in a restricted way so do not impact the memory model. See 12242 :ref:`amdgpu-amdhsa-memory-spaces`. 12243* The vector and scalar memory L0 caches use an L1 cache shared by all WGPs on 12244 the same SA. Therefore, no special action is required for coherence between 12245 the wavefronts of a single work-group. However, a ``buffer_gl1_inv`` is 12246 required for coherence between wavefronts executing in different work-groups 12247 as they may be executing on different SAs that access different L1s. 12248* The L1 caches have independent quadrants to service disjoint ranges of virtual 12249 addresses. 12250* Each L0 cache has a separate request queue per L1 quadrant. Therefore, the 12251 vector and scalar memory operations performed by different wavefronts, whether 12252 executing in the same or different work-groups (which may be executing on 12253 different CUs accessing different L0s), can be reordered relative to each 12254 other. A ``s_waitcnt vmcnt(0) & vscnt(0)`` is required to ensure 12255 synchronization between vector memory operations of different wavefronts. It 12256 ensures a previous vector memory operation has completed before executing a 12257 subsequent vector memory or LDS operation and so can be used to meet the 12258 requirements of acquire, release and sequential consistency. 12259* The L1 caches use an L2 cache shared by all SAs on the same agent. 12260* The L2 cache has independent channels to service disjoint ranges of virtual 12261 addresses. 12262* Each L1 quadrant of a single SA accesses a different L2 channel. Each L1 12263 quadrant has a separate request queue per L2 channel. Therefore, the vector 12264 and scalar memory operations performed by wavefronts executing in different 12265 work-groups (which may be executing on different SAs) of an agent can be 12266 reordered relative to each other. A ``s_waitcnt vmcnt(0) & vscnt(0)`` is 12267 required to ensure synchronization between vector memory operations of 12268 different SAs. It ensures a previous vector memory operation has completed 12269 before executing a subsequent vector memory and so can be used to meet the 12270 requirements of acquire, release and sequential consistency. 12271* The L2 cache can be kept coherent with other agents on some targets, or ranges 12272 of virtual addresses can be set up to bypass it to ensure system coherence. 12273* On GFX10.3 and GFX11 a memory attached last level (MALL) cache exists for GPU memory. 12274 The MALL cache is fully coherent with GPU memory and has no impact on system 12275 coherence. All agents (GPU and CPU) access GPU memory through the MALL cache. 12276 12277Scalar memory operations are only used to access memory that is proven to not 12278change during the execution of the kernel dispatch. This includes constant 12279address space and global address space for program scope ``const`` variables. 12280Therefore, the kernel machine code does not have to maintain the scalar cache to 12281ensure it is coherent with the vector caches. The scalar and vector caches are 12282invalidated between kernel dispatches by CP since constant address space data 12283may change between kernel dispatch executions. See 12284:ref:`amdgpu-amdhsa-memory-spaces`. 12285 12286The one exception is if scalar writes are used to spill SGPR registers. In this 12287case the AMDGPU backend ensures the memory location used to spill is never 12288accessed by vector memory operations at the same time. If scalar writes are used 12289then a ``s_dcache_wb`` is inserted before the ``s_endpgm`` and before a function 12290return since the locations may be used for vector memory instructions by a 12291future wavefront that uses the same scratch area, or a function call that 12292creates a frame at the same address, respectively. There is no need for a 12293``s_dcache_inv`` as all scalar writes are write-before-read in the same thread. 12294 12295For kernarg backing memory: 12296 12297* CP invalidates the L0 and L1 caches at the start of each kernel dispatch. 12298* On dGPU the kernarg backing memory is accessed as MTYPE UC (uncached) to avoid 12299 needing to invalidate the L2 cache. 12300* On APU the kernarg backing memory is accessed as MTYPE CC (cache coherent) and 12301 so the L2 cache will be coherent with the CPU and other agents. 12302 12303Scratch backing memory (which is used for the private address space) is accessed 12304with MTYPE NC (non-coherent). Since the private address space is only accessed 12305by a single thread, and is always write-before-read, there is never a need to 12306invalidate these entries from the L0 or L1 caches. 12307 12308Wavefronts are executed in native mode with in-order reporting of loads and 12309sample instructions. In this mode vmcnt reports completion of load, atomic with 12310return and sample instructions in order, and the vscnt reports the completion of 12311store and atomic without return in order. See ``MEM_ORDERED`` field in 12312:ref:`amdgpu-amdhsa-compute_pgm_rsrc1-gfx6-gfx12-table`. 12313 12314Wavefronts can be executed in WGP or CU wavefront execution mode: 12315 12316* In WGP wavefront execution mode the wavefronts of a work-group are executed 12317 on the SIMDs of both CUs of the WGP. Therefore, explicit management of the per 12318 CU L0 caches is required for work-group synchronization. Also accesses to L1 12319 at work-group scope need to be explicitly ordered as the accesses from 12320 different CUs are not ordered. 12321* In CU wavefront execution mode the wavefronts of a work-group are executed on 12322 the SIMDs of a single CU of the WGP. Therefore, all global memory access by 12323 the work-group access the same L0 which in turn ensures L1 accesses are 12324 ordered and so do not require explicit management of the caches for 12325 work-group synchronization. 12326 12327See ``WGP_MODE`` field in 12328:ref:`amdgpu-amdhsa-compute_pgm_rsrc1-gfx6-gfx12-table` and 12329:ref:`amdgpu-target-features`. 12330 12331The code sequences used to implement the memory model for GFX10-GFX11 are defined in 12332table :ref:`amdgpu-amdhsa-memory-model-code-sequences-gfx10-gfx11-table`. 12333 12334 .. table:: AMDHSA Memory Model Code Sequences GFX10-GFX11 12335 :name: amdgpu-amdhsa-memory-model-code-sequences-gfx10-gfx11-table 12336 12337 ============ ============ ============== ========== ================================ 12338 LLVM Instr LLVM Memory LLVM Memory AMDGPU AMDGPU Machine Code 12339 Ordering Sync Scope Address GFX10-GFX11 12340 Space 12341 ============ ============ ============== ========== ================================ 12342 **Non-Atomic** 12343 ------------------------------------------------------------------------------------ 12344 load *none* *none* - global - !volatile & !nontemporal 12345 - generic 12346 - private 1. buffer/global/flat_load 12347 - constant 12348 - !volatile & nontemporal 12349 12350 1. buffer/global/flat_load 12351 slc=1 dlc=1 12352 12353 - If GFX10, omit dlc=1. 12354 12355 - volatile 12356 12357 1. buffer/global/flat_load 12358 glc=1 dlc=1 12359 12360 2. s_waitcnt vmcnt(0) 12361 12362 - Must happen before 12363 any following volatile 12364 global/generic 12365 load/store. 12366 - Ensures that 12367 volatile 12368 operations to 12369 different 12370 addresses will not 12371 be reordered by 12372 hardware. 12373 12374 load *none* *none* - local 1. ds_load 12375 store *none* *none* - global - !volatile & !nontemporal 12376 - generic 12377 - private 1. buffer/global/flat_store 12378 - constant 12379 - !volatile & nontemporal 12380 12381 1. buffer/global/flat_store 12382 glc=1 slc=1 dlc=1 12383 12384 - If GFX10, omit dlc=1. 12385 12386 - volatile 12387 12388 1. buffer/global/flat_store 12389 dlc=1 12390 12391 - If GFX10, omit dlc=1. 12392 12393 2. s_waitcnt vscnt(0) 12394 12395 - Must happen before 12396 any following volatile 12397 global/generic 12398 load/store. 12399 - Ensures that 12400 volatile 12401 operations to 12402 different 12403 addresses will not 12404 be reordered by 12405 hardware. 12406 12407 store *none* *none* - local 1. ds_store 12408 **Unordered Atomic** 12409 ------------------------------------------------------------------------------------ 12410 load atomic unordered *any* *any* *Same as non-atomic*. 12411 store atomic unordered *any* *any* *Same as non-atomic*. 12412 atomicrmw unordered *any* *any* *Same as monotonic atomic*. 12413 **Monotonic Atomic** 12414 ------------------------------------------------------------------------------------ 12415 load atomic monotonic - singlethread - global 1. buffer/global/flat_load 12416 - wavefront - generic 12417 load atomic monotonic - workgroup - global 1. buffer/global/flat_load 12418 - generic glc=1 12419 12420 - If CU wavefront execution 12421 mode, omit glc=1. 12422 12423 load atomic monotonic - singlethread - local 1. ds_load 12424 - wavefront 12425 - workgroup 12426 load atomic monotonic - agent - global 1. buffer/global/flat_load 12427 - system - generic glc=1 dlc=1 12428 12429 - If GFX11, omit dlc=1. 12430 12431 store atomic monotonic - singlethread - global 1. buffer/global/flat_store 12432 - wavefront - generic 12433 - workgroup 12434 - agent 12435 - system 12436 store atomic monotonic - singlethread - local 1. ds_store 12437 - wavefront 12438 - workgroup 12439 atomicrmw monotonic - singlethread - global 1. buffer/global/flat_atomic 12440 - wavefront - generic 12441 - workgroup 12442 - agent 12443 - system 12444 atomicrmw monotonic - singlethread - local 1. ds_atomic 12445 - wavefront 12446 - workgroup 12447 **Acquire Atomic** 12448 ------------------------------------------------------------------------------------ 12449 load atomic acquire - singlethread - global 1. buffer/global/ds/flat_load 12450 - wavefront - local 12451 - generic 12452 load atomic acquire - workgroup - global 1. buffer/global_load glc=1 12453 12454 - If CU wavefront execution 12455 mode, omit glc=1. 12456 12457 2. s_waitcnt vmcnt(0) 12458 12459 - If CU wavefront execution 12460 mode, omit. 12461 - Must happen before 12462 the following buffer_gl0_inv 12463 and before any following 12464 global/generic 12465 load/load 12466 atomic/store/store 12467 atomic/atomicrmw. 12468 12469 3. buffer_gl0_inv 12470 12471 - If CU wavefront execution 12472 mode, omit. 12473 - Ensures that 12474 following 12475 loads will not see 12476 stale data. 12477 12478 load atomic acquire - workgroup - local 1. ds_load 12479 2. s_waitcnt lgkmcnt(0) 12480 12481 - If OpenCL, omit. 12482 - Must happen before 12483 the following buffer_gl0_inv 12484 and before any following 12485 global/generic load/load 12486 atomic/store/store 12487 atomic/atomicrmw. 12488 - Ensures any 12489 following global 12490 data read is no 12491 older than the local load 12492 atomic value being 12493 acquired. 12494 12495 3. buffer_gl0_inv 12496 12497 - If CU wavefront execution 12498 mode, omit. 12499 - If OpenCL, omit. 12500 - Ensures that 12501 following 12502 loads will not see 12503 stale data. 12504 12505 load atomic acquire - workgroup - generic 1. flat_load glc=1 12506 12507 - If CU wavefront execution 12508 mode, omit glc=1. 12509 12510 2. s_waitcnt lgkmcnt(0) & 12511 vmcnt(0) 12512 12513 - If CU wavefront execution 12514 mode, omit vmcnt(0). 12515 - If OpenCL, omit 12516 lgkmcnt(0). 12517 - Must happen before 12518 the following 12519 buffer_gl0_inv and any 12520 following global/generic 12521 load/load 12522 atomic/store/store 12523 atomic/atomicrmw. 12524 - Ensures any 12525 following global 12526 data read is no 12527 older than a local load 12528 atomic value being 12529 acquired. 12530 12531 3. buffer_gl0_inv 12532 12533 - If CU wavefront execution 12534 mode, omit. 12535 - Ensures that 12536 following 12537 loads will not see 12538 stale data. 12539 12540 load atomic acquire - agent - global 1. buffer/global_load 12541 - system glc=1 dlc=1 12542 12543 - If GFX11, omit dlc=1. 12544 12545 2. s_waitcnt vmcnt(0) 12546 12547 - Must happen before 12548 following 12549 buffer_gl*_inv. 12550 - Ensures the load 12551 has completed 12552 before invalidating 12553 the caches. 12554 12555 3. buffer_gl1_inv; 12556 buffer_gl0_inv 12557 12558 - Must happen before 12559 any following 12560 global/generic 12561 load/load 12562 atomic/atomicrmw. 12563 - Ensures that 12564 following 12565 loads will not see 12566 stale global data. 12567 12568 load atomic acquire - agent - generic 1. flat_load glc=1 dlc=1 12569 - system 12570 - If GFX11, omit dlc=1. 12571 12572 2. s_waitcnt vmcnt(0) & 12573 lgkmcnt(0) 12574 12575 - If OpenCL omit 12576 lgkmcnt(0). 12577 - Must happen before 12578 following 12579 buffer_gl*_invl. 12580 - Ensures the flat_load 12581 has completed 12582 before invalidating 12583 the caches. 12584 12585 3. buffer_gl1_inv; 12586 buffer_gl0_inv 12587 12588 - Must happen before 12589 any following 12590 global/generic 12591 load/load 12592 atomic/atomicrmw. 12593 - Ensures that 12594 following loads 12595 will not see stale 12596 global data. 12597 12598 atomicrmw acquire - singlethread - global 1. buffer/global/ds/flat_atomic 12599 - wavefront - local 12600 - generic 12601 atomicrmw acquire - workgroup - global 1. buffer/global_atomic 12602 2. s_waitcnt vm/vscnt(0) 12603 12604 - If CU wavefront execution 12605 mode, omit. 12606 - Use vmcnt(0) if atomic with 12607 return and vscnt(0) if 12608 atomic with no-return. 12609 - Must happen before 12610 the following buffer_gl0_inv 12611 and before any following 12612 global/generic 12613 load/load 12614 atomic/store/store 12615 atomic/atomicrmw. 12616 12617 3. buffer_gl0_inv 12618 12619 - If CU wavefront execution 12620 mode, omit. 12621 - Ensures that 12622 following 12623 loads will not see 12624 stale data. 12625 12626 atomicrmw acquire - workgroup - local 1. ds_atomic 12627 2. s_waitcnt lgkmcnt(0) 12628 12629 - If OpenCL, omit. 12630 - Must happen before 12631 the following 12632 buffer_gl0_inv. 12633 - Ensures any 12634 following global 12635 data read is no 12636 older than the local 12637 atomicrmw value 12638 being acquired. 12639 12640 3. buffer_gl0_inv 12641 12642 - If OpenCL omit. 12643 - Ensures that 12644 following 12645 loads will not see 12646 stale data. 12647 12648 atomicrmw acquire - workgroup - generic 1. flat_atomic 12649 2. s_waitcnt lgkmcnt(0) & 12650 vm/vscnt(0) 12651 12652 - If CU wavefront execution 12653 mode, omit vm/vscnt(0). 12654 - If OpenCL, omit lgkmcnt(0). 12655 - Use vmcnt(0) if atomic with 12656 return and vscnt(0) if 12657 atomic with no-return. 12658 - Must happen before 12659 the following 12660 buffer_gl0_inv. 12661 - Ensures any 12662 following global 12663 data read is no 12664 older than a local 12665 atomicrmw value 12666 being acquired. 12667 12668 3. buffer_gl0_inv 12669 12670 - If CU wavefront execution 12671 mode, omit. 12672 - Ensures that 12673 following 12674 loads will not see 12675 stale data. 12676 12677 atomicrmw acquire - agent - global 1. buffer/global_atomic 12678 - system 2. s_waitcnt vm/vscnt(0) 12679 12680 - Use vmcnt(0) if atomic with 12681 return and vscnt(0) if 12682 atomic with no-return. 12683 - Must happen before 12684 following 12685 buffer_gl*_inv. 12686 - Ensures the 12687 atomicrmw has 12688 completed before 12689 invalidating the 12690 caches. 12691 12692 3. buffer_gl1_inv; 12693 buffer_gl0_inv 12694 12695 - Must happen before 12696 any following 12697 global/generic 12698 load/load 12699 atomic/atomicrmw. 12700 - Ensures that 12701 following loads 12702 will not see stale 12703 global data. 12704 12705 atomicrmw acquire - agent - generic 1. flat_atomic 12706 - system 2. s_waitcnt vm/vscnt(0) & 12707 lgkmcnt(0) 12708 12709 - If OpenCL, omit 12710 lgkmcnt(0). 12711 - Use vmcnt(0) if atomic with 12712 return and vscnt(0) if 12713 atomic with no-return. 12714 - Must happen before 12715 following 12716 buffer_gl*_inv. 12717 - Ensures the 12718 atomicrmw has 12719 completed before 12720 invalidating the 12721 caches. 12722 12723 3. buffer_gl1_inv; 12724 buffer_gl0_inv 12725 12726 - Must happen before 12727 any following 12728 global/generic 12729 load/load 12730 atomic/atomicrmw. 12731 - Ensures that 12732 following loads 12733 will not see stale 12734 global data. 12735 12736 fence acquire - singlethread *none* *none* 12737 - wavefront 12738 fence acquire - workgroup *none* 1. s_waitcnt lgkmcnt(0) & 12739 vmcnt(0) & vscnt(0) 12740 12741 - If CU wavefront execution 12742 mode, omit vmcnt(0) and 12743 vscnt(0). 12744 - If OpenCL and 12745 address space is 12746 not generic, omit 12747 lgkmcnt(0). 12748 - If OpenCL and 12749 address space is 12750 local, omit 12751 vmcnt(0) and vscnt(0). 12752 - See :ref:`amdgpu-fence-as` for 12753 more details on fencing specific 12754 address spaces. 12755 - Could be split into 12756 separate s_waitcnt 12757 vmcnt(0), s_waitcnt 12758 vscnt(0) and s_waitcnt 12759 lgkmcnt(0) to allow 12760 them to be 12761 independently moved 12762 according to the 12763 following rules. 12764 - s_waitcnt vmcnt(0) 12765 must happen after 12766 any preceding 12767 global/generic load 12768 atomic/ 12769 atomicrmw-with-return-value 12770 with an equal or 12771 wider sync scope 12772 and memory ordering 12773 stronger than 12774 unordered (this is 12775 termed the 12776 fence-paired-atomic). 12777 - s_waitcnt vscnt(0) 12778 must happen after 12779 any preceding 12780 global/generic 12781 atomicrmw-no-return-value 12782 with an equal or 12783 wider sync scope 12784 and memory ordering 12785 stronger than 12786 unordered (this is 12787 termed the 12788 fence-paired-atomic). 12789 - s_waitcnt lgkmcnt(0) 12790 must happen after 12791 any preceding 12792 local/generic load 12793 atomic/atomicrmw 12794 with an equal or 12795 wider sync scope 12796 and memory ordering 12797 stronger than 12798 unordered (this is 12799 termed the 12800 fence-paired-atomic). 12801 - Must happen before 12802 the following 12803 buffer_gl0_inv. 12804 - Ensures that the 12805 fence-paired atomic 12806 has completed 12807 before invalidating 12808 the 12809 cache. Therefore 12810 any following 12811 locations read must 12812 be no older than 12813 the value read by 12814 the 12815 fence-paired-atomic. 12816 12817 3. buffer_gl0_inv 12818 12819 - If CU wavefront execution 12820 mode, omit. 12821 - Ensures that 12822 following 12823 loads will not see 12824 stale data. 12825 12826 fence acquire - agent *none* 1. s_waitcnt lgkmcnt(0) & 12827 - system vmcnt(0) & vscnt(0) 12828 12829 - If OpenCL and 12830 address space is 12831 not generic, omit 12832 lgkmcnt(0). 12833 - If OpenCL and 12834 address space is 12835 local, omit 12836 vmcnt(0) and vscnt(0). 12837 - See :ref:`amdgpu-fence-as` for 12838 more details on fencing specific 12839 address spaces. 12840 - Could be split into 12841 separate s_waitcnt 12842 vmcnt(0), s_waitcnt 12843 vscnt(0) and s_waitcnt 12844 lgkmcnt(0) to allow 12845 them to be 12846 independently moved 12847 according to the 12848 following rules. 12849 - s_waitcnt vmcnt(0) 12850 must happen after 12851 any preceding 12852 global/generic load 12853 atomic/ 12854 atomicrmw-with-return-value 12855 with an equal or 12856 wider sync scope 12857 and memory ordering 12858 stronger than 12859 unordered (this is 12860 termed the 12861 fence-paired-atomic). 12862 - s_waitcnt vscnt(0) 12863 must happen after 12864 any preceding 12865 global/generic 12866 atomicrmw-no-return-value 12867 with an equal or 12868 wider sync scope 12869 and memory ordering 12870 stronger than 12871 unordered (this is 12872 termed the 12873 fence-paired-atomic). 12874 - s_waitcnt lgkmcnt(0) 12875 must happen after 12876 any preceding 12877 local/generic load 12878 atomic/atomicrmw 12879 with an equal or 12880 wider sync scope 12881 and memory ordering 12882 stronger than 12883 unordered (this is 12884 termed the 12885 fence-paired-atomic). 12886 - Must happen before 12887 the following 12888 buffer_gl*_inv. 12889 - Ensures that the 12890 fence-paired atomic 12891 has completed 12892 before invalidating 12893 the 12894 caches. Therefore 12895 any following 12896 locations read must 12897 be no older than 12898 the value read by 12899 the 12900 fence-paired-atomic. 12901 12902 2. buffer_gl1_inv; 12903 buffer_gl0_inv 12904 12905 - Must happen before any 12906 following global/generic 12907 load/load 12908 atomic/store/store 12909 atomic/atomicrmw. 12910 - Ensures that 12911 following loads 12912 will not see stale 12913 global data. 12914 12915 **Release Atomic** 12916 ------------------------------------------------------------------------------------ 12917 store atomic release - singlethread - global 1. buffer/global/ds/flat_store 12918 - wavefront - local 12919 - generic 12920 store atomic release - workgroup - global 1. s_waitcnt lgkmcnt(0) & 12921 - generic vmcnt(0) & vscnt(0) 12922 12923 - If CU wavefront execution 12924 mode, omit vmcnt(0) and 12925 vscnt(0). 12926 - If OpenCL, omit 12927 lgkmcnt(0). 12928 - Could be split into 12929 separate s_waitcnt 12930 vmcnt(0), s_waitcnt 12931 vscnt(0) and s_waitcnt 12932 lgkmcnt(0) to allow 12933 them to be 12934 independently moved 12935 according to the 12936 following rules. 12937 - s_waitcnt vmcnt(0) 12938 must happen after 12939 any preceding 12940 global/generic load/load 12941 atomic/ 12942 atomicrmw-with-return-value. 12943 - s_waitcnt vscnt(0) 12944 must happen after 12945 any preceding 12946 global/generic 12947 store/store 12948 atomic/ 12949 atomicrmw-no-return-value. 12950 - s_waitcnt lgkmcnt(0) 12951 must happen after 12952 any preceding 12953 local/generic 12954 load/store/load 12955 atomic/store 12956 atomic/atomicrmw. 12957 - Must happen before 12958 the following 12959 store. 12960 - Ensures that all 12961 memory operations 12962 have 12963 completed before 12964 performing the 12965 store that is being 12966 released. 12967 12968 2. buffer/global/flat_store 12969 store atomic release - workgroup - local 1. s_waitcnt vmcnt(0) & vscnt(0) 12970 12971 - If CU wavefront execution 12972 mode, omit. 12973 - If OpenCL, omit. 12974 - Could be split into 12975 separate s_waitcnt 12976 vmcnt(0) and s_waitcnt 12977 vscnt(0) to allow 12978 them to be 12979 independently moved 12980 according to the 12981 following rules. 12982 - s_waitcnt vmcnt(0) 12983 must happen after 12984 any preceding 12985 global/generic load/load 12986 atomic/ 12987 atomicrmw-with-return-value. 12988 - s_waitcnt vscnt(0) 12989 must happen after 12990 any preceding 12991 global/generic 12992 store/store atomic/ 12993 atomicrmw-no-return-value. 12994 - Must happen before 12995 the following 12996 store. 12997 - Ensures that all 12998 global memory 12999 operations have 13000 completed before 13001 performing the 13002 store that is being 13003 released. 13004 13005 2. ds_store 13006 store atomic release - agent - global 1. s_waitcnt lgkmcnt(0) & 13007 - system - generic vmcnt(0) & vscnt(0) 13008 13009 - If OpenCL and 13010 address space is 13011 not generic, omit 13012 lgkmcnt(0). 13013 - Could be split into 13014 separate s_waitcnt 13015 vmcnt(0), s_waitcnt vscnt(0) 13016 and s_waitcnt 13017 lgkmcnt(0) to allow 13018 them to be 13019 independently moved 13020 according to the 13021 following rules. 13022 - s_waitcnt vmcnt(0) 13023 must happen after 13024 any preceding 13025 global/generic 13026 load/load 13027 atomic/ 13028 atomicrmw-with-return-value. 13029 - s_waitcnt vscnt(0) 13030 must happen after 13031 any preceding 13032 global/generic 13033 store/store atomic/ 13034 atomicrmw-no-return-value. 13035 - s_waitcnt lgkmcnt(0) 13036 must happen after 13037 any preceding 13038 local/generic 13039 load/store/load 13040 atomic/store 13041 atomic/atomicrmw. 13042 - Must happen before 13043 the following 13044 store. 13045 - Ensures that all 13046 memory operations 13047 have 13048 completed before 13049 performing the 13050 store that is being 13051 released. 13052 13053 2. buffer/global/flat_store 13054 atomicrmw release - singlethread - global 1. buffer/global/ds/flat_atomic 13055 - wavefront - local 13056 - generic 13057 atomicrmw release - workgroup - global 1. s_waitcnt lgkmcnt(0) & 13058 - generic vmcnt(0) & vscnt(0) 13059 13060 - If CU wavefront execution 13061 mode, omit vmcnt(0) and 13062 vscnt(0). 13063 - If OpenCL, omit lgkmcnt(0). 13064 - Could be split into 13065 separate s_waitcnt 13066 vmcnt(0), s_waitcnt 13067 vscnt(0) and s_waitcnt 13068 lgkmcnt(0) to allow 13069 them to be 13070 independently moved 13071 according to the 13072 following rules. 13073 - s_waitcnt vmcnt(0) 13074 must happen after 13075 any preceding 13076 global/generic load/load 13077 atomic/ 13078 atomicrmw-with-return-value. 13079 - s_waitcnt vscnt(0) 13080 must happen after 13081 any preceding 13082 global/generic 13083 store/store 13084 atomic/ 13085 atomicrmw-no-return-value. 13086 - s_waitcnt lgkmcnt(0) 13087 must happen after 13088 any preceding 13089 local/generic 13090 load/store/load 13091 atomic/store 13092 atomic/atomicrmw. 13093 - Must happen before 13094 the following 13095 atomicrmw. 13096 - Ensures that all 13097 memory operations 13098 have 13099 completed before 13100 performing the 13101 atomicrmw that is 13102 being released. 13103 13104 2. buffer/global/flat_atomic 13105 atomicrmw release - workgroup - local 1. s_waitcnt vmcnt(0) & vscnt(0) 13106 13107 - If CU wavefront execution 13108 mode, omit. 13109 - If OpenCL, omit. 13110 - Could be split into 13111 separate s_waitcnt 13112 vmcnt(0) and s_waitcnt 13113 vscnt(0) to allow 13114 them to be 13115 independently moved 13116 according to the 13117 following rules. 13118 - s_waitcnt vmcnt(0) 13119 must happen after 13120 any preceding 13121 global/generic load/load 13122 atomic/ 13123 atomicrmw-with-return-value. 13124 - s_waitcnt vscnt(0) 13125 must happen after 13126 any preceding 13127 global/generic 13128 store/store atomic/ 13129 atomicrmw-no-return-value. 13130 - Must happen before 13131 the following 13132 store. 13133 - Ensures that all 13134 global memory 13135 operations have 13136 completed before 13137 performing the 13138 store that is being 13139 released. 13140 13141 2. ds_atomic 13142 atomicrmw release - agent - global 1. s_waitcnt lgkmcnt(0) & 13143 - system - generic vmcnt(0) & vscnt(0) 13144 13145 - If OpenCL, omit 13146 lgkmcnt(0). 13147 - Could be split into 13148 separate s_waitcnt 13149 vmcnt(0), s_waitcnt 13150 vscnt(0) and s_waitcnt 13151 lgkmcnt(0) to allow 13152 them to be 13153 independently moved 13154 according to the 13155 following rules. 13156 - s_waitcnt vmcnt(0) 13157 must happen after 13158 any preceding 13159 global/generic 13160 load/load atomic/ 13161 atomicrmw-with-return-value. 13162 - s_waitcnt vscnt(0) 13163 must happen after 13164 any preceding 13165 global/generic 13166 store/store atomic/ 13167 atomicrmw-no-return-value. 13168 - s_waitcnt lgkmcnt(0) 13169 must happen after 13170 any preceding 13171 local/generic 13172 load/store/load 13173 atomic/store 13174 atomic/atomicrmw. 13175 - Must happen before 13176 the following 13177 atomicrmw. 13178 - Ensures that all 13179 memory operations 13180 to global and local 13181 have completed 13182 before performing 13183 the atomicrmw that 13184 is being released. 13185 13186 2. buffer/global/flat_atomic 13187 fence release - singlethread *none* *none* 13188 - wavefront 13189 fence release - workgroup *none* 1. s_waitcnt lgkmcnt(0) & 13190 vmcnt(0) & vscnt(0) 13191 13192 - If CU wavefront execution 13193 mode, omit vmcnt(0) and 13194 vscnt(0). 13195 - If OpenCL and 13196 address space is 13197 not generic, omit 13198 lgkmcnt(0). 13199 - If OpenCL and 13200 address space is 13201 local, omit 13202 vmcnt(0) and vscnt(0). 13203 - See :ref:`amdgpu-fence-as` for 13204 more details on fencing specific 13205 address spaces. 13206 - Could be split into 13207 separate s_waitcnt 13208 vmcnt(0), s_waitcnt 13209 vscnt(0) and s_waitcnt 13210 lgkmcnt(0) to allow 13211 them to be 13212 independently moved 13213 according to the 13214 following rules. 13215 - s_waitcnt vmcnt(0) 13216 must happen after 13217 any preceding 13218 global/generic 13219 load/load 13220 atomic/ 13221 atomicrmw-with-return-value. 13222 - s_waitcnt vscnt(0) 13223 must happen after 13224 any preceding 13225 global/generic 13226 store/store atomic/ 13227 atomicrmw-no-return-value. 13228 - s_waitcnt lgkmcnt(0) 13229 must happen after 13230 any preceding 13231 local/generic 13232 load/store/load 13233 atomic/store atomic/ 13234 atomicrmw. 13235 - Must happen before 13236 any following store 13237 atomic/atomicrmw 13238 with an equal or 13239 wider sync scope 13240 and memory ordering 13241 stronger than 13242 unordered (this is 13243 termed the 13244 fence-paired-atomic). 13245 - Ensures that all 13246 memory operations 13247 have 13248 completed before 13249 performing the 13250 following 13251 fence-paired-atomic. 13252 13253 fence release - agent *none* 1. s_waitcnt lgkmcnt(0) & 13254 - system vmcnt(0) & vscnt(0) 13255 13256 - If OpenCL and 13257 address space is 13258 not generic, omit 13259 lgkmcnt(0). 13260 - If OpenCL and 13261 address space is 13262 local, omit 13263 vmcnt(0) and vscnt(0). 13264 - See :ref:`amdgpu-fence-as` for 13265 more details on fencing specific 13266 address spaces. 13267 - Could be split into 13268 separate s_waitcnt 13269 vmcnt(0), s_waitcnt 13270 vscnt(0) and s_waitcnt 13271 lgkmcnt(0) to allow 13272 them to be 13273 independently moved 13274 according to the 13275 following rules. 13276 - s_waitcnt vmcnt(0) 13277 must happen after 13278 any preceding 13279 global/generic 13280 load/load atomic/ 13281 atomicrmw-with-return-value. 13282 - s_waitcnt vscnt(0) 13283 must happen after 13284 any preceding 13285 global/generic 13286 store/store atomic/ 13287 atomicrmw-no-return-value. 13288 - s_waitcnt lgkmcnt(0) 13289 must happen after 13290 any preceding 13291 local/generic 13292 load/store/load 13293 atomic/store 13294 atomic/atomicrmw. 13295 - Must happen before 13296 any following store 13297 atomic/atomicrmw 13298 with an equal or 13299 wider sync scope 13300 and memory ordering 13301 stronger than 13302 unordered (this is 13303 termed the 13304 fence-paired-atomic). 13305 - Ensures that all 13306 memory operations 13307 have 13308 completed before 13309 performing the 13310 following 13311 fence-paired-atomic. 13312 13313 **Acquire-Release Atomic** 13314 ------------------------------------------------------------------------------------ 13315 atomicrmw acq_rel - singlethread - global 1. buffer/global/ds/flat_atomic 13316 - wavefront - local 13317 - generic 13318 atomicrmw acq_rel - workgroup - global 1. s_waitcnt lgkmcnt(0) & 13319 vmcnt(0) & vscnt(0) 13320 13321 - If CU wavefront execution 13322 mode, omit vmcnt(0) and 13323 vscnt(0). 13324 - If OpenCL, omit 13325 lgkmcnt(0). 13326 - Must happen after 13327 any preceding 13328 local/generic 13329 load/store/load 13330 atomic/store 13331 atomic/atomicrmw. 13332 - Could be split into 13333 separate s_waitcnt 13334 vmcnt(0), s_waitcnt 13335 vscnt(0), and s_waitcnt 13336 lgkmcnt(0) to allow 13337 them to be 13338 independently moved 13339 according to the 13340 following rules. 13341 - s_waitcnt vmcnt(0) 13342 must happen after 13343 any preceding 13344 global/generic load/load 13345 atomic/ 13346 atomicrmw-with-return-value. 13347 - s_waitcnt vscnt(0) 13348 must happen after 13349 any preceding 13350 global/generic 13351 store/store 13352 atomic/ 13353 atomicrmw-no-return-value. 13354 - s_waitcnt lgkmcnt(0) 13355 must happen after 13356 any preceding 13357 local/generic 13358 load/store/load 13359 atomic/store 13360 atomic/atomicrmw. 13361 - Must happen before 13362 the following 13363 atomicrmw. 13364 - Ensures that all 13365 memory operations 13366 have 13367 completed before 13368 performing the 13369 atomicrmw that is 13370 being released. 13371 13372 2. buffer/global_atomic 13373 3. s_waitcnt vm/vscnt(0) 13374 13375 - If CU wavefront execution 13376 mode, omit. 13377 - Use vmcnt(0) if atomic with 13378 return and vscnt(0) if 13379 atomic with no-return. 13380 - Must happen before 13381 the following 13382 buffer_gl0_inv. 13383 - Ensures any 13384 following global 13385 data read is no 13386 older than the 13387 atomicrmw value 13388 being acquired. 13389 13390 4. buffer_gl0_inv 13391 13392 - If CU wavefront execution 13393 mode, omit. 13394 - Ensures that 13395 following 13396 loads will not see 13397 stale data. 13398 13399 atomicrmw acq_rel - workgroup - local 1. s_waitcnt vmcnt(0) & vscnt(0) 13400 13401 - If CU wavefront execution 13402 mode, omit. 13403 - If OpenCL, omit. 13404 - Could be split into 13405 separate s_waitcnt 13406 vmcnt(0) and s_waitcnt 13407 vscnt(0) to allow 13408 them to be 13409 independently moved 13410 according to the 13411 following rules. 13412 - s_waitcnt vmcnt(0) 13413 must happen after 13414 any preceding 13415 global/generic load/load 13416 atomic/ 13417 atomicrmw-with-return-value. 13418 - s_waitcnt vscnt(0) 13419 must happen after 13420 any preceding 13421 global/generic 13422 store/store atomic/ 13423 atomicrmw-no-return-value. 13424 - Must happen before 13425 the following 13426 store. 13427 - Ensures that all 13428 global memory 13429 operations have 13430 completed before 13431 performing the 13432 store that is being 13433 released. 13434 13435 2. ds_atomic 13436 3. s_waitcnt lgkmcnt(0) 13437 13438 - If OpenCL, omit. 13439 - Must happen before 13440 the following 13441 buffer_gl0_inv. 13442 - Ensures any 13443 following global 13444 data read is no 13445 older than the local load 13446 atomic value being 13447 acquired. 13448 13449 4. buffer_gl0_inv 13450 13451 - If CU wavefront execution 13452 mode, omit. 13453 - If OpenCL omit. 13454 - Ensures that 13455 following 13456 loads will not see 13457 stale data. 13458 13459 atomicrmw acq_rel - workgroup - generic 1. s_waitcnt lgkmcnt(0) & 13460 vmcnt(0) & vscnt(0) 13461 13462 - If CU wavefront execution 13463 mode, omit vmcnt(0) and 13464 vscnt(0). 13465 - If OpenCL, omit lgkmcnt(0). 13466 - Could be split into 13467 separate s_waitcnt 13468 vmcnt(0), s_waitcnt 13469 vscnt(0) and s_waitcnt 13470 lgkmcnt(0) to allow 13471 them to be 13472 independently moved 13473 according to the 13474 following rules. 13475 - s_waitcnt vmcnt(0) 13476 must happen after 13477 any preceding 13478 global/generic load/load 13479 atomic/ 13480 atomicrmw-with-return-value. 13481 - s_waitcnt vscnt(0) 13482 must happen after 13483 any preceding 13484 global/generic 13485 store/store 13486 atomic/ 13487 atomicrmw-no-return-value. 13488 - s_waitcnt lgkmcnt(0) 13489 must happen after 13490 any preceding 13491 local/generic 13492 load/store/load 13493 atomic/store 13494 atomic/atomicrmw. 13495 - Must happen before 13496 the following 13497 atomicrmw. 13498 - Ensures that all 13499 memory operations 13500 have 13501 completed before 13502 performing the 13503 atomicrmw that is 13504 being released. 13505 13506 2. flat_atomic 13507 3. s_waitcnt lgkmcnt(0) & 13508 vmcnt(0) & vscnt(0) 13509 13510 - If CU wavefront execution 13511 mode, omit vmcnt(0) and 13512 vscnt(0). 13513 - If OpenCL, omit lgkmcnt(0). 13514 - Must happen before 13515 the following 13516 buffer_gl0_inv. 13517 - Ensures any 13518 following global 13519 data read is no 13520 older than the load 13521 atomic value being 13522 acquired. 13523 13524 3. buffer_gl0_inv 13525 13526 - If CU wavefront execution 13527 mode, omit. 13528 - Ensures that 13529 following 13530 loads will not see 13531 stale data. 13532 13533 atomicrmw acq_rel - agent - global 1. s_waitcnt lgkmcnt(0) & 13534 - system vmcnt(0) & vscnt(0) 13535 13536 - If OpenCL, omit 13537 lgkmcnt(0). 13538 - Could be split into 13539 separate s_waitcnt 13540 vmcnt(0), s_waitcnt 13541 vscnt(0) and s_waitcnt 13542 lgkmcnt(0) to allow 13543 them to be 13544 independently moved 13545 according to the 13546 following rules. 13547 - s_waitcnt vmcnt(0) 13548 must happen after 13549 any preceding 13550 global/generic 13551 load/load atomic/ 13552 atomicrmw-with-return-value. 13553 - s_waitcnt vscnt(0) 13554 must happen after 13555 any preceding 13556 global/generic 13557 store/store atomic/ 13558 atomicrmw-no-return-value. 13559 - s_waitcnt lgkmcnt(0) 13560 must happen after 13561 any preceding 13562 local/generic 13563 load/store/load 13564 atomic/store 13565 atomic/atomicrmw. 13566 - Must happen before 13567 the following 13568 atomicrmw. 13569 - Ensures that all 13570 memory operations 13571 to global have 13572 completed before 13573 performing the 13574 atomicrmw that is 13575 being released. 13576 13577 2. buffer/global_atomic 13578 3. s_waitcnt vm/vscnt(0) 13579 13580 - Use vmcnt(0) if atomic with 13581 return and vscnt(0) if 13582 atomic with no-return. 13583 - Must happen before 13584 following 13585 buffer_gl*_inv. 13586 - Ensures the 13587 atomicrmw has 13588 completed before 13589 invalidating the 13590 caches. 13591 13592 4. buffer_gl1_inv; 13593 buffer_gl0_inv 13594 13595 - Must happen before 13596 any following 13597 global/generic 13598 load/load 13599 atomic/atomicrmw. 13600 - Ensures that 13601 following loads 13602 will not see stale 13603 global data. 13604 13605 atomicrmw acq_rel - agent - generic 1. s_waitcnt lgkmcnt(0) & 13606 - system vmcnt(0) & vscnt(0) 13607 13608 - If OpenCL, omit 13609 lgkmcnt(0). 13610 - Could be split into 13611 separate s_waitcnt 13612 vmcnt(0), s_waitcnt 13613 vscnt(0), and s_waitcnt 13614 lgkmcnt(0) to allow 13615 them to be 13616 independently moved 13617 according to the 13618 following rules. 13619 - s_waitcnt vmcnt(0) 13620 must happen after 13621 any preceding 13622 global/generic 13623 load/load atomic 13624 atomicrmw-with-return-value. 13625 - s_waitcnt vscnt(0) 13626 must happen after 13627 any preceding 13628 global/generic 13629 store/store atomic/ 13630 atomicrmw-no-return-value. 13631 - s_waitcnt lgkmcnt(0) 13632 must happen after 13633 any preceding 13634 local/generic 13635 load/store/load 13636 atomic/store 13637 atomic/atomicrmw. 13638 - Must happen before 13639 the following 13640 atomicrmw. 13641 - Ensures that all 13642 memory operations 13643 have 13644 completed before 13645 performing the 13646 atomicrmw that is 13647 being released. 13648 13649 2. flat_atomic 13650 3. s_waitcnt vm/vscnt(0) & 13651 lgkmcnt(0) 13652 13653 - If OpenCL, omit 13654 lgkmcnt(0). 13655 - Use vmcnt(0) if atomic with 13656 return and vscnt(0) if 13657 atomic with no-return. 13658 - Must happen before 13659 following 13660 buffer_gl*_inv. 13661 - Ensures the 13662 atomicrmw has 13663 completed before 13664 invalidating the 13665 caches. 13666 13667 4. buffer_gl1_inv; 13668 buffer_gl0_inv 13669 13670 - Must happen before 13671 any following 13672 global/generic 13673 load/load 13674 atomic/atomicrmw. 13675 - Ensures that 13676 following loads 13677 will not see stale 13678 global data. 13679 13680 fence acq_rel - singlethread *none* *none* 13681 - wavefront 13682 fence acq_rel - workgroup *none* 1. s_waitcnt lgkmcnt(0) & 13683 vmcnt(0) & vscnt(0) 13684 13685 - If CU wavefront execution 13686 mode, omit vmcnt(0) and 13687 vscnt(0). 13688 - If OpenCL and 13689 address space is 13690 not generic, omit 13691 lgkmcnt(0). 13692 - If OpenCL and 13693 address space is 13694 local, omit 13695 vmcnt(0) and vscnt(0). 13696 - However, 13697 since LLVM 13698 currently has no 13699 address space on 13700 the fence need to 13701 conservatively 13702 always generate 13703 (see comment for 13704 previous fence). 13705 - Could be split into 13706 separate s_waitcnt 13707 vmcnt(0), s_waitcnt 13708 vscnt(0) and s_waitcnt 13709 lgkmcnt(0) to allow 13710 them to be 13711 independently moved 13712 according to the 13713 following rules. 13714 - s_waitcnt vmcnt(0) 13715 must happen after 13716 any preceding 13717 global/generic 13718 load/load 13719 atomic/ 13720 atomicrmw-with-return-value. 13721 - s_waitcnt vscnt(0) 13722 must happen after 13723 any preceding 13724 global/generic 13725 store/store atomic/ 13726 atomicrmw-no-return-value. 13727 - s_waitcnt lgkmcnt(0) 13728 must happen after 13729 any preceding 13730 local/generic 13731 load/store/load 13732 atomic/store atomic/ 13733 atomicrmw. 13734 - Must happen before 13735 any following 13736 global/generic 13737 load/load 13738 atomic/store/store 13739 atomic/atomicrmw. 13740 - Ensures that all 13741 memory operations 13742 have 13743 completed before 13744 performing any 13745 following global 13746 memory operations. 13747 - Ensures that the 13748 preceding 13749 local/generic load 13750 atomic/atomicrmw 13751 with an equal or 13752 wider sync scope 13753 and memory ordering 13754 stronger than 13755 unordered (this is 13756 termed the 13757 acquire-fence-paired-atomic) 13758 has completed 13759 before following 13760 global memory 13761 operations. This 13762 satisfies the 13763 requirements of 13764 acquire. 13765 - Ensures that all 13766 previous memory 13767 operations have 13768 completed before a 13769 following 13770 local/generic store 13771 atomic/atomicrmw 13772 with an equal or 13773 wider sync scope 13774 and memory ordering 13775 stronger than 13776 unordered (this is 13777 termed the 13778 release-fence-paired-atomic). 13779 This satisfies the 13780 requirements of 13781 release. 13782 - Must happen before 13783 the following 13784 buffer_gl0_inv. 13785 - Ensures that the 13786 acquire-fence-paired 13787 atomic has completed 13788 before invalidating 13789 the 13790 cache. Therefore 13791 any following 13792 locations read must 13793 be no older than 13794 the value read by 13795 the 13796 acquire-fence-paired-atomic. 13797 13798 3. buffer_gl0_inv 13799 13800 - If CU wavefront execution 13801 mode, omit. 13802 - Ensures that 13803 following 13804 loads will not see 13805 stale data. 13806 13807 fence acq_rel - agent *none* 1. s_waitcnt lgkmcnt(0) & 13808 - system vmcnt(0) & vscnt(0) 13809 13810 - If OpenCL and 13811 address space is 13812 not generic, omit 13813 lgkmcnt(0). 13814 - If OpenCL and 13815 address space is 13816 local, omit 13817 vmcnt(0) and vscnt(0). 13818 - See :ref:`amdgpu-fence-as` for 13819 more details on fencing specific 13820 address spaces. 13821 - Could be split into 13822 separate s_waitcnt 13823 vmcnt(0), s_waitcnt 13824 vscnt(0) and s_waitcnt 13825 lgkmcnt(0) to allow 13826 them to be 13827 independently moved 13828 according to the 13829 following rules. 13830 - s_waitcnt vmcnt(0) 13831 must happen after 13832 any preceding 13833 global/generic 13834 load/load 13835 atomic/ 13836 atomicrmw-with-return-value. 13837 - s_waitcnt vscnt(0) 13838 must happen after 13839 any preceding 13840 global/generic 13841 store/store atomic/ 13842 atomicrmw-no-return-value. 13843 - s_waitcnt lgkmcnt(0) 13844 must happen after 13845 any preceding 13846 local/generic 13847 load/store/load 13848 atomic/store 13849 atomic/atomicrmw. 13850 - Must happen before 13851 the following 13852 buffer_gl*_inv. 13853 - Ensures that the 13854 preceding 13855 global/local/generic 13856 load 13857 atomic/atomicrmw 13858 with an equal or 13859 wider sync scope 13860 and memory ordering 13861 stronger than 13862 unordered (this is 13863 termed the 13864 acquire-fence-paired-atomic) 13865 has completed 13866 before invalidating 13867 the caches. This 13868 satisfies the 13869 requirements of 13870 acquire. 13871 - Ensures that all 13872 previous memory 13873 operations have 13874 completed before a 13875 following 13876 global/local/generic 13877 store 13878 atomic/atomicrmw 13879 with an equal or 13880 wider sync scope 13881 and memory ordering 13882 stronger than 13883 unordered (this is 13884 termed the 13885 release-fence-paired-atomic). 13886 This satisfies the 13887 requirements of 13888 release. 13889 13890 2. buffer_gl1_inv; 13891 buffer_gl0_inv 13892 13893 - Must happen before 13894 any following 13895 global/generic 13896 load/load 13897 atomic/store/store 13898 atomic/atomicrmw. 13899 - Ensures that 13900 following loads 13901 will not see stale 13902 global data. This 13903 satisfies the 13904 requirements of 13905 acquire. 13906 13907 **Sequential Consistent Atomic** 13908 ------------------------------------------------------------------------------------ 13909 load atomic seq_cst - singlethread - global *Same as corresponding 13910 - wavefront - local load atomic acquire, 13911 - generic except must generate 13912 all instructions even 13913 for OpenCL.* 13914 load atomic seq_cst - workgroup - global 1. s_waitcnt lgkmcnt(0) & 13915 - generic vmcnt(0) & vscnt(0) 13916 13917 - If CU wavefront execution 13918 mode, omit vmcnt(0) and 13919 vscnt(0). 13920 - Could be split into 13921 separate s_waitcnt 13922 vmcnt(0), s_waitcnt 13923 vscnt(0), and s_waitcnt 13924 lgkmcnt(0) to allow 13925 them to be 13926 independently moved 13927 according to the 13928 following rules. 13929 - s_waitcnt lgkmcnt(0) must 13930 happen after 13931 preceding 13932 local/generic load 13933 atomic/store 13934 atomic/atomicrmw 13935 with memory 13936 ordering of seq_cst 13937 and with equal or 13938 wider sync scope. 13939 (Note that seq_cst 13940 fences have their 13941 own s_waitcnt 13942 lgkmcnt(0) and so do 13943 not need to be 13944 considered.) 13945 - s_waitcnt vmcnt(0) 13946 must happen after 13947 preceding 13948 global/generic load 13949 atomic/ 13950 atomicrmw-with-return-value 13951 with memory 13952 ordering of seq_cst 13953 and with equal or 13954 wider sync scope. 13955 (Note that seq_cst 13956 fences have their 13957 own s_waitcnt 13958 vmcnt(0) and so do 13959 not need to be 13960 considered.) 13961 - s_waitcnt vscnt(0) 13962 Must happen after 13963 preceding 13964 global/generic store 13965 atomic/ 13966 atomicrmw-no-return-value 13967 with memory 13968 ordering of seq_cst 13969 and with equal or 13970 wider sync scope. 13971 (Note that seq_cst 13972 fences have their 13973 own s_waitcnt 13974 vscnt(0) and so do 13975 not need to be 13976 considered.) 13977 - Ensures any 13978 preceding 13979 sequential 13980 consistent global/local 13981 memory instructions 13982 have completed 13983 before executing 13984 this sequentially 13985 consistent 13986 instruction. This 13987 prevents reordering 13988 a seq_cst store 13989 followed by a 13990 seq_cst load. (Note 13991 that seq_cst is 13992 stronger than 13993 acquire/release as 13994 the reordering of 13995 load acquire 13996 followed by a store 13997 release is 13998 prevented by the 13999 s_waitcnt of 14000 the release, but 14001 there is nothing 14002 preventing a store 14003 release followed by 14004 load acquire from 14005 completing out of 14006 order. The s_waitcnt 14007 could be placed after 14008 seq_store or before 14009 the seq_load. We 14010 choose the load to 14011 make the s_waitcnt be 14012 as late as possible 14013 so that the store 14014 may have already 14015 completed.) 14016 14017 2. *Following 14018 instructions same as 14019 corresponding load 14020 atomic acquire, 14021 except must generate 14022 all instructions even 14023 for OpenCL.* 14024 load atomic seq_cst - workgroup - local 14025 14026 1. s_waitcnt vmcnt(0) & vscnt(0) 14027 14028 - If CU wavefront execution 14029 mode, omit. 14030 - Could be split into 14031 separate s_waitcnt 14032 vmcnt(0) and s_waitcnt 14033 vscnt(0) to allow 14034 them to be 14035 independently moved 14036 according to the 14037 following rules. 14038 - s_waitcnt vmcnt(0) 14039 Must happen after 14040 preceding 14041 global/generic load 14042 atomic/ 14043 atomicrmw-with-return-value 14044 with memory 14045 ordering of seq_cst 14046 and with equal or 14047 wider sync scope. 14048 (Note that seq_cst 14049 fences have their 14050 own s_waitcnt 14051 vmcnt(0) and so do 14052 not need to be 14053 considered.) 14054 - s_waitcnt vscnt(0) 14055 Must happen after 14056 preceding 14057 global/generic store 14058 atomic/ 14059 atomicrmw-no-return-value 14060 with memory 14061 ordering of seq_cst 14062 and with equal or 14063 wider sync scope. 14064 (Note that seq_cst 14065 fences have their 14066 own s_waitcnt 14067 vscnt(0) and so do 14068 not need to be 14069 considered.) 14070 - Ensures any 14071 preceding 14072 sequential 14073 consistent global 14074 memory instructions 14075 have completed 14076 before executing 14077 this sequentially 14078 consistent 14079 instruction. This 14080 prevents reordering 14081 a seq_cst store 14082 followed by a 14083 seq_cst load. (Note 14084 that seq_cst is 14085 stronger than 14086 acquire/release as 14087 the reordering of 14088 load acquire 14089 followed by a store 14090 release is 14091 prevented by the 14092 s_waitcnt of 14093 the release, but 14094 there is nothing 14095 preventing a store 14096 release followed by 14097 load acquire from 14098 completing out of 14099 order. The s_waitcnt 14100 could be placed after 14101 seq_store or before 14102 the seq_load. We 14103 choose the load to 14104 make the s_waitcnt be 14105 as late as possible 14106 so that the store 14107 may have already 14108 completed.) 14109 14110 2. *Following 14111 instructions same as 14112 corresponding load 14113 atomic acquire, 14114 except must generate 14115 all instructions even 14116 for OpenCL.* 14117 14118 load atomic seq_cst - agent - global 1. s_waitcnt lgkmcnt(0) & 14119 - system - generic vmcnt(0) & vscnt(0) 14120 14121 - Could be split into 14122 separate s_waitcnt 14123 vmcnt(0), s_waitcnt 14124 vscnt(0) and s_waitcnt 14125 lgkmcnt(0) to allow 14126 them to be 14127 independently moved 14128 according to the 14129 following rules. 14130 - s_waitcnt lgkmcnt(0) 14131 must happen after 14132 preceding 14133 local load 14134 atomic/store 14135 atomic/atomicrmw 14136 with memory 14137 ordering of seq_cst 14138 and with equal or 14139 wider sync scope. 14140 (Note that seq_cst 14141 fences have their 14142 own s_waitcnt 14143 lgkmcnt(0) and so do 14144 not need to be 14145 considered.) 14146 - s_waitcnt vmcnt(0) 14147 must happen after 14148 preceding 14149 global/generic load 14150 atomic/ 14151 atomicrmw-with-return-value 14152 with memory 14153 ordering of seq_cst 14154 and with equal or 14155 wider sync scope. 14156 (Note that seq_cst 14157 fences have their 14158 own s_waitcnt 14159 vmcnt(0) and so do 14160 not need to be 14161 considered.) 14162 - s_waitcnt vscnt(0) 14163 Must happen after 14164 preceding 14165 global/generic store 14166 atomic/ 14167 atomicrmw-no-return-value 14168 with memory 14169 ordering of seq_cst 14170 and with equal or 14171 wider sync scope. 14172 (Note that seq_cst 14173 fences have their 14174 own s_waitcnt 14175 vscnt(0) and so do 14176 not need to be 14177 considered.) 14178 - Ensures any 14179 preceding 14180 sequential 14181 consistent global 14182 memory instructions 14183 have completed 14184 before executing 14185 this sequentially 14186 consistent 14187 instruction. This 14188 prevents reordering 14189 a seq_cst store 14190 followed by a 14191 seq_cst load. (Note 14192 that seq_cst is 14193 stronger than 14194 acquire/release as 14195 the reordering of 14196 load acquire 14197 followed by a store 14198 release is 14199 prevented by the 14200 s_waitcnt of 14201 the release, but 14202 there is nothing 14203 preventing a store 14204 release followed by 14205 load acquire from 14206 completing out of 14207 order. The s_waitcnt 14208 could be placed after 14209 seq_store or before 14210 the seq_load. We 14211 choose the load to 14212 make the s_waitcnt be 14213 as late as possible 14214 so that the store 14215 may have already 14216 completed.) 14217 14218 2. *Following 14219 instructions same as 14220 corresponding load 14221 atomic acquire, 14222 except must generate 14223 all instructions even 14224 for OpenCL.* 14225 store atomic seq_cst - singlethread - global *Same as corresponding 14226 - wavefront - local store atomic release, 14227 - workgroup - generic except must generate 14228 - agent all instructions even 14229 - system for OpenCL.* 14230 atomicrmw seq_cst - singlethread - global *Same as corresponding 14231 - wavefront - local atomicrmw acq_rel, 14232 - workgroup - generic except must generate 14233 - agent all instructions even 14234 - system for OpenCL.* 14235 fence seq_cst - singlethread *none* *Same as corresponding 14236 - wavefront fence acq_rel, 14237 - workgroup except must generate 14238 - agent all instructions even 14239 - system for OpenCL.* 14240 ============ ============ ============== ========== ================================ 14241 14242 14243.. _amdgpu-amdhsa-memory-model-gfx12: 14244 14245Memory Model GFX12 14246++++++++++++++++++++++++ 14247 14248For GFX12: 14249 14250* Each agent has multiple shader arrays (SA). 14251* Each SA has multiple work-group processors (WGP). 14252* Each WGP has multiple compute units (CU). 14253* Each CU has multiple SIMDs that execute wavefronts. 14254* The wavefronts for a single work-group are executed in the same 14255 WGP. 14256 14257 * In CU wavefront execution mode the wavefronts may be executed by different SIMDs 14258 in the same CU. 14259 * In WGP wavefront execution mode the wavefronts may be executed by different SIMDs 14260 in different CUs in the same WGP. 14261 14262* Each WGP has a single LDS memory shared by the wavefronts of the work-groups 14263 executing on it. 14264* All LDS operations of a WGP are performed as wavefront wide operations in a 14265 global order and involve no caching. Completion is reported to a wavefront in 14266 execution order. 14267* The LDS memory has multiple request queues shared by the SIMDs of a 14268 WGP. Therefore, the LDS operations performed by different wavefronts of a 14269 work-group can be reordered relative to each other, which can result in 14270 reordering the visibility of vector memory operations with respect to LDS 14271 operations of other wavefronts in the same work-group. A ``s_wait_dscnt 0x0`` 14272 is required to ensure synchronization between LDS operations and 14273 vector memory operations between wavefronts of a work-group, but not between 14274 operations performed by the same wavefront. 14275* The vector memory operations are performed as wavefront wide operations. 14276 Vector memory operations are divided in different types. Completion of a 14277 vector memory operation is reported to a wavefront in-order within a type, 14278 but may be out of order between types. The types of vector memory operations 14279 (and their associated ``s_wait`` instructions) are: 14280 14281 * LDS: ``s_wait_dscnt`` 14282 * Load (global, scratch, flat, buffer and image): ``s_wait_loadcnt`` 14283 * Store (global, scratch, flat, buffer and image): ``s_wait_storecnt`` 14284 * Sample and Gather4: ``s_wait_samplecnt`` 14285 * BVH: ``s_wait_bvhcnt`` 14286 14287* Vector and scalar memory instructions contain a ``SCOPE`` field with values 14288 corresponding to each cache level. The ``SCOPE`` determines whether a cache 14289 can complete an operation locally or whether it needs to forward the operation 14290 to the next cache level. The ``SCOPE`` values are: 14291 14292 * ``SCOPE_CU``: Compute Unit (NOTE: not affected by CU/WGP mode) 14293 * ``SCOPE_SE``: Shader Engine 14294 * ``SCOPE_DEV``: Device/Agent 14295 * ``SCOPE_SYS``: System 14296 14297* When a memory operation with a given ``SCOPE`` reaches a cache with a smaller 14298 ``SCOPE`` value, it is forwarded to the next level of cache. 14299* When a memory operation with a given ``SCOPE`` reaches a cache with a ``SCOPE`` 14300 value greater than or equal to its own, the operation can proceed: 14301 14302 * Reads can hit into the cache 14303 * Writes can happen in this cache and the transaction is acknowledged 14304 from this level of cache. 14305 * RMW operations can be done locally. 14306 14307* ``global_inv``, ``global_wb`` and ``global_wbinv`` instructions are used to 14308 invalidate, write-back and write-back+invalidate caches. The affected 14309 cache(s) are controlled by the ``SCOPE:`` of the instruction. 14310* ``global_inv`` invalidates caches whose scope is strictly smaller than the 14311 instruction's. The invalidation requests cannot be reordered with pending or 14312 upcoming memory operations. 14313* ``global_wb`` is a writeback operation that additionally ensures previous 14314 memory operation done at a lower scope level have reached the ``SCOPE:`` 14315 of the ``global_wb``. 14316 14317 * ``global_wb`` can be omitted for scopes other than ``SCOPE_SYS`` in 14318 gfx120x. 14319 14320* The vector memory operations access a vector L0 cache. There is a single L0 14321 cache per CU. Each SIMD of a CU accesses the same L0 cache. Therefore, no 14322 special action is required for coherence between the lanes of a single 14323 wavefront. To achieve coherence between wavefronts executing in the same 14324 work-group: 14325 14326 * In CU wavefront execution mode, no special action is required. 14327 * In WGP wavefront execution mode, a ``global_inv scope:SCOPE_SE`` is required 14328 as wavefronts may be executing on SIMDs of different CUs that access different L0s. 14329 14330* The scalar memory operations access a scalar L0 cache shared by all wavefronts 14331 on a WGP. The scalar and vector L0 caches are not coherent. However, scalar 14332 operations are used in a restricted way so do not impact the memory model. See 14333 :ref:`amdgpu-amdhsa-memory-spaces`. 14334* The vector and scalar memory L0 caches use an L1 buffer shared by all WGPs on 14335 the same SA. The L1 buffer acts as a bridge to L2 for clients within a SA. 14336* The L1 buffers have independent quadrants to service disjoint ranges of virtual 14337 addresses. 14338* Each L0 cache has a separate request queue per L1 quadrant. Therefore, the 14339 vector and scalar memory operations performed by different wavefronts, whether 14340 executing in the same or different work-groups (which may be executing on 14341 different CUs accessing different L0s), can be reordered relative to each 14342 other. Some or all of the wait instructions below are required to ensure 14343 synchronization between vector memory operations of different wavefronts. It 14344 ensures a previous vector memory operation has completed before executing a 14345 subsequent vector memory or LDS operation and so can be used to meet the 14346 requirements of acquire, release and sequential consistency. 14347 14348 * ``s_wait_loadcnt 0x0`` 14349 * ``s_wait_samplecnt 0x0`` 14350 * ``s_wait_bvhcnt 0x0`` 14351 * ``s_wait_storecnt 0x0`` 14352 14353* The L1 buffers use an L2 cache shared by all SAs on the same agent. 14354* The L2 cache has independent channels to service disjoint ranges of virtual 14355 addresses. 14356* Each L1 quadrant of a single SA accesses a different L2 channel. Each L1 14357 quadrant has a separate request queue per L2 channel. Therefore, the vector 14358 and scalar memory operations performed by wavefronts executing in different 14359 work-groups (which may be executing on different SAs) of an agent can be 14360 reordered relative to each other. Some or all of the wait instructions below are 14361 required to ensure synchronization between vector memory operations of 14362 different SAs. It ensures a previous vector memory operation has completed 14363 before executing a subsequent vector memory and so can be used to meet the 14364 requirements of acquire, release and sequential consistency. 14365 14366 * ``s_wait_loadcnt 0x0`` 14367 * ``s_wait_samplecnt 0x0`` 14368 * ``s_wait_bvhcnt 0x0`` 14369 * ``s_wait_storecnt 0x0`` 14370 14371* The L2 cache can be kept coherent with other agents, or ranges 14372 of virtual addresses can be set up to bypass it to ensure system coherence. 14373* A memory attached last level (MALL) cache exists for GPU memory. 14374 The MALL cache is fully coherent with GPU memory and has no impact on system 14375 coherence. All agents (GPU and CPU) access GPU memory through the MALL cache. 14376 14377Scalar memory operations are only used to access memory that is proven to not 14378change during the execution of the kernel dispatch. This includes constant 14379address space and global address space for program scope ``const`` variables. 14380Therefore, the kernel machine code does not have to maintain the scalar cache to 14381ensure it is coherent with the vector caches. The scalar and vector caches are 14382invalidated between kernel dispatches by CP since constant address space data 14383may change between kernel dispatch executions. See 14384:ref:`amdgpu-amdhsa-memory-spaces`. 14385 14386For kernarg backing memory: 14387 14388* CP invalidates caches at the start of each kernel dispatch. 14389* On dGPU the kernarg backing memory is accessed as MTYPE UC (uncached) to avoid 14390 needing to invalidate the L2 cache. 14391* On APU the kernarg backing memory is accessed as MTYPE CC (cache coherent) and 14392 so the L2 cache will be coherent with the CPU and other agents. 14393 14394Scratch backing memory (which is used for the private address space) is accessed 14395with MTYPE NC (non-coherent). Since the private address space is only accessed 14396by a single thread, and is always write-before-read, there is never a need to 14397invalidate these entries from L0. 14398 14399Wavefronts can be executed in WGP or CU wavefront execution mode: 14400 14401* In WGP wavefront execution mode the wavefronts of a work-group are executed 14402 on the SIMDs of both CUs of the WGP. Therefore, explicit management of the per 14403 CU L0 caches is required for work-group synchronization. Also accesses to L1 14404 at work-group scope need to be explicitly ordered as the accesses from 14405 different CUs are not ordered. 14406* In CU wavefront execution mode the wavefronts of a work-group are executed on 14407 the SIMDs of a single CU of the WGP. Therefore, all global memory access by 14408 the work-group access the same L0 which in turn ensures L1 accesses are 14409 ordered and so do not require explicit management of the caches for 14410 work-group synchronization. 14411 14412See ``WGP_MODE`` field in 14413:ref:`amdgpu-amdhsa-compute_pgm_rsrc1-gfx6-gfx12-table` and 14414:ref:`amdgpu-target-features`. 14415 14416The code sequences used to implement the memory model for GFX12 are defined in 14417table :ref:`amdgpu-amdhsa-memory-model-code-sequences-gfx12-table`. 14418 14419The mapping of LLVM IR syncscope to GFX12 instruction ``scope`` operands is 14420defined in :ref:`amdgpu-amdhsa-memory-model-code-sequences-gfx12-scopes-table`. 14421 14422The table only applies if and only if it is directly referenced by an entry in 14423:ref:`amdgpu-amdhsa-memory-model-code-sequences-gfx12-table`, and it only applies to 14424the instruction in the code sequence that references the table. 14425 14426 .. table:: AMDHSA Memory Model Code Sequences GFX12 - Instruction Scopes 14427 :name: amdgpu-amdhsa-memory-model-code-sequences-gfx12-scopes-table 14428 14429 =================== =================== =================== 14430 LLVM syncscope CU wavefront WGP wavefront 14431 execution execution 14432 mode mode 14433 =================== =================== =================== 14434 *none* ``scope:SCOPE_SYS`` ``scope:SCOPE_SYS`` 14435 system ``scope:SCOPE_SYS`` ``scope:SCOPE_SYS`` 14436 agent ``scope:SCOPE_DEV`` ``scope:SCOPE_DEV`` 14437 workgroup *none* ``scope:SCOPE_SE`` 14438 wavefront *none* *none* 14439 singlethread *none* *none* 14440 one-as ``scope:SCOPE_SYS`` ``scope:SCOPE_SYS`` 14441 system-one-as ``scope:SCOPE_SYS`` ``scope:SCOPE_SYS`` 14442 agent-one-as ``scope:SCOPE_DEV`` ``scope:SCOPE_DEV`` 14443 workgroup-one-as *none* ``scope:SCOPE_SE`` 14444 wavefront-one-as *none* *none* 14445 singlethread-one-as *none* *none* 14446 =================== =================== =================== 14447 14448 .. table:: AMDHSA Memory Model Code Sequences GFX12 14449 :name: amdgpu-amdhsa-memory-model-code-sequences-gfx12-table 14450 14451 ============ ============ ============== ========== ================================ 14452 LLVM Instr LLVM Memory LLVM Memory AMDGPU AMDGPU Machine Code 14453 Ordering Sync Scope Address GFX12 14454 Space 14455 ============ ============ ============== ========== ================================ 14456 **Non-Atomic** 14457 ------------------------------------------------------------------------------------ 14458 load *none* *none* - global - !volatile & !nontemporal 14459 - generic 14460 - private 1. buffer/global/flat_load 14461 - constant 14462 - !volatile & nontemporal 14463 14464 1. buffer/global/flat_load 14465 ``th:TH_LOAD_NT`` 14466 14467 - volatile 14468 14469 1. buffer/global/flat_load 14470 ``scope:SCOPE_SYS`` 14471 14472 2. ``s_wait_loadcnt 0x0`` 14473 14474 - Must happen before 14475 any following volatile 14476 global/generic 14477 load/store. 14478 - Ensures that 14479 volatile 14480 operations to 14481 different 14482 addresses will not 14483 be reordered by 14484 hardware. 14485 14486 load *none* *none* - local 1. ds_load 14487 store *none* *none* - global - !volatile & !nontemporal 14488 - generic 14489 - private 1. buffer/global/flat_store 14490 - constant 14491 - !volatile & nontemporal 14492 14493 1. buffer/global/flat_store 14494 ``th:TH_STORE_NT`` 14495 14496 - volatile 14497 14498 1. buffer/global/flat_store 14499 ``scope:SCOPE_SYS`` 14500 14501 2. ``s_wait_storecnt 0x0`` 14502 14503 - Must happen before 14504 any following volatile 14505 global/generic 14506 load/store. 14507 - Ensures that 14508 volatile 14509 operations to 14510 different 14511 addresses will not 14512 be reordered by 14513 hardware. 14514 14515 store *none* *none* - local 1. ds_store 14516 **Unordered Atomic** 14517 ------------------------------------------------------------------------------------ 14518 load atomic unordered *any* *any* *Same as non-atomic*. 14519 store atomic unordered *any* *any* *Same as non-atomic*. 14520 atomicrmw unordered *any* *any* *Same as monotonic atomic*. 14521 **Monotonic Atomic** 14522 ------------------------------------------------------------------------------------ 14523 load atomic monotonic - singlethread - global 1. buffer/global/flat_load 14524 - wavefront - generic 14525 - workgroup - Apply :ref:`amdgpu-amdhsa-memory-model-code-sequences-gfx12-scopes-table`. 14526 - agent 14527 - system 14528 load atomic monotonic - singlethread - local 1. ds_load 14529 - wavefront 14530 - workgroup 14531 store atomic monotonic - singlethread - global 1. buffer/global/flat_store 14532 - wavefront - generic 14533 - workgroup - Apply :ref:`amdgpu-amdhsa-memory-model-code-sequences-gfx12-scopes-table`. 14534 - agent 14535 - system 14536 store atomic monotonic - singlethread - local 1. ds_store 14537 - wavefront 14538 - workgroup 14539 atomicrmw monotonic - singlethread - global 1. buffer/global/flat_atomic 14540 - wavefront - generic 14541 - workgroup - Apply :ref:`amdgpu-amdhsa-memory-model-code-sequences-gfx12-scopes-table`. 14542 - agent 14543 - system 14544 atomicrmw monotonic - singlethread - local 1. ds_atomic 14545 - wavefront 14546 - workgroup 14547 **Acquire Atomic** 14548 ------------------------------------------------------------------------------------ 14549 load atomic acquire - singlethread - global 1. buffer/global/ds/flat_load 14550 - wavefront - local 14551 - generic 14552 load atomic acquire - workgroup - global 1. buffer/global_load ``scope:SCOPE_SE`` 14553 14554 - Apply :ref:`amdgpu-amdhsa-memory-model-code-sequences-gfx12-scopes-table`. 14555 14556 2. ``s_wait_loadcnt 0x0`` 14557 14558 - If CU wavefront execution 14559 mode, omit. 14560 - Must happen before 14561 the following ``global_inv`` 14562 and before any following 14563 global/generic 14564 load/load 14565 atomic/store/store 14566 atomic/atomicrmw. 14567 14568 3. ``global_inv scope:SCOPE_SE`` 14569 14570 - If CU wavefront execution 14571 mode, omit. 14572 - Ensures that 14573 following 14574 loads will not see 14575 stale data. 14576 14577 load atomic acquire - workgroup - local 1. ds_load 14578 2. ``s_wait_dscnt 0x0`` 14579 14580 - If OpenCL, omit. 14581 - Must happen before 14582 the following ``global_inv`` 14583 and before any following 14584 global/generic load/load 14585 atomic/store/store 14586 atomic/atomicrmw. 14587 - Ensures any 14588 following global 14589 data read is no 14590 older than the local load 14591 atomic value being 14592 acquired. 14593 14594 3. ``global_inv scope:SCOPE_SE`` 14595 14596 - If OpenCL or CU wavefront 14597 execution mode, omit. 14598 - Ensures that 14599 following 14600 loads will not see 14601 stale data. 14602 14603 load atomic acquire - workgroup - generic 1. flat_load 14604 14605 - Apply :ref:`amdgpu-amdhsa-memory-model-code-sequences-gfx12-scopes-table`. 14606 14607 2. | ``s_wait_loadcnt 0x0`` 14608 | ``s_wait_dscnt 0x0`` 14609 | **CU wavefront execution mode:** 14610 | ``s_wait_dscnt 0x0`` 14611 14612 - If OpenCL, omit ``s_wait_dscnt 0x0`` 14613 - Must happen before 14614 the following 14615 ``global_inv`` and any 14616 following global/generic 14617 load/load 14618 atomic/store/store 14619 atomic/atomicrmw. 14620 - Ensures any 14621 following global 14622 data read is no 14623 older than a local load 14624 atomic value being 14625 acquired. 14626 14627 3. ``global_inv scope:SCOPE_SE`` 14628 14629 - If CU wavefront execution 14630 mode, omit. 14631 - Ensures that 14632 following 14633 loads will not see 14634 stale data. 14635 14636 load atomic acquire - agent - global 1. buffer/global_load 14637 - system 14638 - Apply :ref:`amdgpu-amdhsa-memory-model-code-sequences-gfx12-scopes-table`. 14639 14640 2. ``s_wait_loadcnt 0x0`` 14641 14642 - Must happen before 14643 following 14644 ``global_inv``. 14645 - Ensures the load 14646 has completed 14647 before invalidating 14648 the caches. 14649 14650 3. ``global_inv`` 14651 14652 - Apply :ref:`amdgpu-amdhsa-memory-model-code-sequences-gfx12-scopes-table`. 14653 - Must happen before 14654 any following 14655 global/generic 14656 load/load 14657 atomic/atomicrmw. 14658 - Ensures that 14659 following 14660 loads will not see 14661 stale global data. 14662 14663 load atomic acquire - agent - generic 1. flat_load 14664 - system 14665 - Apply :ref:`amdgpu-amdhsa-memory-model-code-sequences-gfx12-scopes-table`. 14666 14667 2. | ``s_wait_loadcnt 0x0`` 14668 | ``s_wait_dscnt 0x0`` 14669 14670 - If OpenCL, omit ``s_wait_dscnt 0x0`` 14671 - Must happen before 14672 following 14673 ``global_inv``. 14674 - Ensures the flat_load 14675 has completed 14676 before invalidating 14677 the caches. 14678 14679 3. ``global_inv`` 14680 14681 - Apply :ref:`amdgpu-amdhsa-memory-model-code-sequences-gfx12-scopes-table`. 14682 - Must happen before 14683 any following 14684 global/generic 14685 load/load 14686 atomic/atomicrmw. 14687 - Ensures that 14688 following loads 14689 will not see stale 14690 global data. 14691 14692 atomicrmw acquire - singlethread - global 1. buffer/global/ds/flat_atomic 14693 - wavefront - local 14694 - generic 14695 atomicrmw acquire - workgroup - global 1. buffer/global_atomic 14696 14697 - Apply :ref:`amdgpu-amdhsa-memory-model-code-sequences-gfx12-scopes-table`. 14698 - If atomic with return, 14699 use ``th:TH_ATOMIC_RETURN`` 14700 14701 2. | **Atomic with return:** 14702 | ``s_wait_loadcnt 0x0`` 14703 | **Atomic without return:** 14704 | ``s_wait_storecnt 0x0`` 14705 14706 - If CU wavefront execution 14707 mode, omit. 14708 - Must happen before 14709 the following ``global_inv`` 14710 and before any following 14711 global/generic 14712 load/load 14713 atomic/store/store 14714 atomic/atomicrmw. 14715 14716 3. ``global_inv scope:SCOPE_SE`` 14717 14718 - If CU wavefront execution 14719 mode, omit. 14720 - Ensures that 14721 following 14722 loads will not see 14723 stale data. 14724 14725 atomicrmw acquire - workgroup - local 1. ds_atomic 14726 2. ``s_wait_dscnt 0x0`` 14727 14728 - If OpenCL, omit. 14729 - Must happen before 14730 the following 14731 ``global_inv``. 14732 - Ensures any 14733 following global 14734 data read is no 14735 older than the local 14736 atomicrmw value 14737 being acquired. 14738 14739 3. ``global_inv scope:SCOPE_SE`` 14740 14741 - If OpenCL omit. 14742 - If CU wavefront execution 14743 mode, omit. 14744 - Ensures that 14745 following 14746 loads will not see 14747 stale data. 14748 14749 atomicrmw acquire - workgroup - generic 1. flat_atomic 14750 14751 - Apply :ref:`amdgpu-amdhsa-memory-model-code-sequences-gfx12-scopes-table`. 14752 - If atomic with return, 14753 use ``th:TH_ATOMIC_RETURN`` 14754 14755 2. | **Atomic with return:** 14756 | ``s_wait_loadcnt 0x0`` 14757 | ``s_wait_dscnt 0x0`` 14758 | **Atomic without return:** 14759 | ``s_wait_storecnt 0x0`` 14760 | ``s_wait_dscnt 0x0`` 14761 14762 - If CU wavefront execution mode, 14763 omit all for atomics without 14764 return, and only emit 14765 ``s_wait_dscnt 0x0`` for atomics 14766 with return. 14767 - If OpenCL, omit ``s_wait_dscnt 0x0`` 14768 - Must happen before 14769 the following 14770 ``global_inv``. 14771 - Ensures any 14772 following global 14773 data read is no 14774 older than a local 14775 atomicrmw value 14776 being acquired. 14777 14778 3. ``global_inv scope:SCOPE_SE`` 14779 14780 - If CU wavefront execution 14781 mode, omit. 14782 - Ensures that 14783 following 14784 loads will not see 14785 stale data. 14786 14787 atomicrmw acquire - agent - global 1. buffer/global_atomic 14788 - system 14789 - Apply :ref:`amdgpu-amdhsa-memory-model-code-sequences-gfx12-scopes-table`. 14790 - If atomic with return, 14791 use ``th:TH_ATOMIC_RETURN`` 14792 14793 2. | **Atomic with return:** 14794 | ``s_wait_loadcnt 0x0`` 14795 | **Atomic without return:** 14796 | ``s_wait_storecnt 0x0`` 14797 14798 - Must happen before 14799 following ``global_inv``. 14800 - Ensures the 14801 atomicrmw has 14802 completed before 14803 invalidating the 14804 caches. 14805 14806 3. ``global_inv`` 14807 14808 - Apply :ref:`amdgpu-amdhsa-memory-model-code-sequences-gfx12-scopes-table`. 14809 - Must happen before 14810 any following 14811 global/generic 14812 load/load 14813 atomic/atomicrmw. 14814 - Ensures that 14815 following loads 14816 will not see stale 14817 global data. 14818 14819 atomicrmw acquire - agent - generic 1. flat_atomic 14820 - system 14821 - Apply :ref:`amdgpu-amdhsa-memory-model-code-sequences-gfx12-scopes-table`. 14822 - If atomic with return, 14823 use ``th:TH_ATOMIC_RETURN`` 14824 14825 2. | **Atomic with return:** 14826 | ``s_wait_loadcnt 0x0`` 14827 | ``s_wait_dscnt 0x0`` 14828 | **Atomic without return:** 14829 | ``s_wait_storecnt 0x0`` 14830 | ``s_wait_dscnt 0x0`` 14831 14832 - If OpenCL, omit dscnt 14833 - Must happen before 14834 following 14835 global_inv 14836 - Ensures the 14837 atomicrmw has 14838 completed before 14839 invalidating the 14840 caches. 14841 14842 3. ``global_inv`` 14843 14844 - Apply :ref:`amdgpu-amdhsa-memory-model-code-sequences-gfx12-scopes-table`. 14845 - Must happen before 14846 any following 14847 global/generic 14848 load/load 14849 atomic/atomicrmw. 14850 - Ensures that 14851 following loads 14852 will not see stale 14853 global data. 14854 14855 fence acquire - singlethread *none* *none* 14856 - wavefront 14857 fence acquire - workgroup *none* 1. | ``s_wait_storecnt 0x0`` 14858 | ``s_wait_loadcnt 0x0`` 14859 | ``s_wait_dscnt 0x0`` 14860 | **CU wavefront execution mode:** 14861 | ``s_wait_dscnt 0x0`` 14862 14863 - If OpenCL, omit ``s_wait_dscnt 0x0`` 14864 - If OpenCL and address space is local, 14865 omit all. 14866 - See :ref:`amdgpu-fence-as` for 14867 more details on fencing specific 14868 address spaces. 14869 - Note: we don't have to use 14870 ``s_wait_samplecnt 0x0`` or 14871 ``s_wait_bvhcnt 0x0`` because 14872 there are no atomic sample or 14873 BVH instructions that the fence 14874 could pair with. 14875 - The waits can be 14876 independently moved 14877 according to the 14878 following rules: 14879 - ``s_wait_loadcnt 0x0`` 14880 must happen after 14881 any preceding 14882 global/generic load 14883 atomic/ 14884 atomicrmw-with-return-value 14885 with an equal or 14886 wider sync scope 14887 and memory ordering 14888 stronger than 14889 unordered (this is 14890 termed the 14891 fence-paired-atomic). 14892 - ``s_wait_storecnt 0x0`` 14893 must happen after 14894 any preceding 14895 global/generic 14896 atomicrmw-no-return-value 14897 with an equal or 14898 wider sync scope 14899 and memory ordering 14900 stronger than 14901 unordered (this is 14902 termed the 14903 fence-paired-atomic). 14904 - ``s_wait_dscnt 0x0`` 14905 must happen after 14906 any preceding 14907 local/generic load 14908 atomic/atomicrmw 14909 with an equal or 14910 wider sync scope 14911 and memory ordering 14912 stronger than 14913 unordered (this is 14914 termed the 14915 fence-paired-atomic). 14916 - Must happen before 14917 the following 14918 ``global_inv``. 14919 - Ensures that the 14920 fence-paired atomic 14921 has completed 14922 before invalidating 14923 the 14924 cache. Therefore 14925 any following 14926 locations read must 14927 be no older than 14928 the value read by 14929 the 14930 fence-paired-atomic. 14931 14932 2. ``global_inv scope:SCOPE_SE`` 14933 14934 - If CU wavefront execution 14935 mode, omit. 14936 - Ensures that 14937 following 14938 loads will not see 14939 stale data. 14940 14941 fence acquire - agent *none* 1. | ``s_wait_storecnt 0x0`` 14942 | ``s_wait_loadcnt 0x0`` 14943 | ``s_wait_dscnt 0x0`` 14944 14945 - If OpenCL, omit ``s_wait_dscnt 0x0``. 14946 - If OpenCL and address space is 14947 local, omit all. 14948 - See :ref:`amdgpu-fence-as` for 14949 more details on fencing specific 14950 address spaces. 14951 - Note: we don't have to use 14952 ``s_wait_samplecnt 0x0`` or 14953 ``s_wait_bvhcnt 0x0`` because 14954 there are no atomic sample or 14955 BVH instructions that the fence 14956 could pair with. 14957 - The waits can be 14958 independently moved 14959 according to the 14960 following rules: 14961 - ``s_wait_loadcnt 0x0`` 14962 must happen after 14963 any preceding 14964 global/generic load 14965 atomic/ 14966 atomicrmw-with-return-value 14967 with an equal or 14968 wider sync scope 14969 and memory ordering 14970 stronger than 14971 unordered (this is 14972 termed the 14973 fence-paired-atomic). 14974 - ``s_wait_storecnt 0x0`` 14975 must happen after 14976 any preceding 14977 global/generic 14978 atomicrmw-no-return-value 14979 with an equal or 14980 wider sync scope 14981 and memory ordering 14982 stronger than 14983 unordered (this is 14984 termed the 14985 fence-paired-atomic). 14986 - ``s_wait_dscnt 0x0`` 14987 must happen after 14988 any preceding 14989 local/generic load 14990 atomic/atomicrmw 14991 with an equal or 14992 wider sync scope 14993 and memory ordering 14994 stronger than 14995 unordered (this is 14996 termed the 14997 fence-paired-atomic). 14998 - Must happen before 14999 the following 15000 ``global_inv`` 15001 - Ensures that the 15002 fence-paired atomic 15003 has completed 15004 before invalidating the 15005 caches. Therefore 15006 any following 15007 locations read must 15008 be no older than 15009 the value read by 15010 the 15011 fence-paired-atomic. 15012 15013 2. ``global_inv`` 15014 15015 - Apply :ref:`amdgpu-amdhsa-memory-model-code-sequences-gfx12-scopes-table`. 15016 - Ensures that 15017 following 15018 loads will not see 15019 stale data. 15020 15021 **Release Atomic** 15022 ------------------------------------------------------------------------------------ 15023 store atomic release - singlethread - global 1. buffer/global/ds/flat_store 15024 - wavefront - local 15025 - generic 15026 store atomic release - workgroup - global 1. | ``s_wait_bvhcnt 0x0`` 15027 | ``s_wait_samplecnt 0x0`` 15028 | ``s_wait_storecnt 0x0`` 15029 | ``s_wait_loadcnt 0x0`` 15030 | ``s_wait_dscnt 0x0`` 15031 | **CU wavefront execution mode:** 15032 | ``s_wait_dscnt 0x0`` 15033 15034 - If OpenCL, omit ``s_wait_dscnt 0x0``. 15035 - The waits can be 15036 independently moved 15037 according to the 15038 following rules: 15039 - ``s_wait_loadcnt 0x0``, 15040 ``s_wait_samplecnt 0x0`` and 15041 ``s_wait_bvhcnt 0x0`` 15042 must happen after 15043 any preceding 15044 global/generic load/load 15045 atomic/ 15046 atomicrmw-with-return-value. 15047 - ``s_wait_storecnt 0x0`` 15048 must happen after 15049 any preceding 15050 global/generic 15051 store/store 15052 atomic/ 15053 atomicrmw-no-return-value. 15054 - ``s_wait_dscnt 0x0`` 15055 must happen after 15056 any preceding 15057 local/generic 15058 load/store/load 15059 atomic/store 15060 atomic/atomicrmw. 15061 - Ensures that all 15062 memory operations 15063 have 15064 completed before 15065 performing the 15066 store that is being 15067 released. 15068 15069 3. buffer/global/flat_store 15070 15071 - Apply :ref:`amdgpu-amdhsa-memory-model-code-sequences-gfx12-scopes-table`. 15072 15073 store atomic release - workgroup - local 1. | ``s_wait_bvhcnt 0x0`` 15074 | ``s_wait_samplecnt 0x0`` 15075 | ``s_wait_storecnt 0x0`` 15076 | ``s_wait_loadcnt 0x0`` 15077 | ``s_wait_dscnt 0x0`` 15078 | **CU wavefront execution mode:** 15079 | ``s_wait_dscnt 0x0`` 15080 15081 - If OpenCL, omit. 15082 - The waits can be 15083 independently moved 15084 according to the 15085 following rules: 15086 - ``s_wait_loadcnt 0x0``, 15087 ``s_wait_samplecnt 0x0`` and 15088 ``s_wait_bvhcnt 0x0`` 15089 must happen after 15090 any preceding 15091 global/generic load/load 15092 atomic/ 15093 atomicrmw-with-return-value. 15094 - ``s_wait_storecnt 0x0`` 15095 must happen after 15096 any preceding 15097 global/generic 15098 store/store 15099 atomic/ 15100 atomicrmw-no-return-value. 15101 - Must happen before the 15102 following store. 15103 - Ensures that all 15104 global memory 15105 operations have 15106 completed before 15107 performing the 15108 store that is being 15109 released. 15110 15111 3. ds_store 15112 store atomic release - agent - global 1. ``global_wb scope:SCOPE_SYS`` 15113 - system - generic 15114 - If agent scope, omit. 15115 15116 2. | ``s_wait_bvhcnt 0x0`` 15117 | ``s_wait_samplecnt 0x0`` 15118 | ``s_wait_storecnt 0x0`` 15119 | ``s_wait_loadcnt 0x0`` 15120 | ``s_wait_dscnt 0x0`` 15121 15122 - If OpenCL, omit ``s_wait_dscnt 0x0``. 15123 - The waits can be 15124 independently moved 15125 according to the 15126 following rules: 15127 - ``s_wait_loadcnt 0x0``, 15128 ``s_wait_samplecnt 0x0`` and 15129 ``s_wait_bvhcnt 0x0`` 15130 must happen after 15131 any preceding 15132 global/generic 15133 load/load 15134 atomic/ 15135 atomicrmw-with-return-value. 15136 - ``s_wait_storecnt 0x0`` 15137 must happen after 15138 ``global_wb`` if present, or 15139 any preceding 15140 global/generic 15141 store/store 15142 atomic/ 15143 atomicrmw-no-return-value. 15144 - ``s_wait_dscnt 0x0`` 15145 must happen after 15146 any preceding 15147 local/generic 15148 load/store/load 15149 atomic/store 15150 atomic/atomicrmw. 15151 - Must happen before the 15152 following store. 15153 - Ensures that all 15154 memory operations 15155 have 15156 completed before 15157 performing the 15158 store that is being 15159 released. 15160 15161 3. buffer/global/flat_store 15162 15163 - Apply :ref:`amdgpu-amdhsa-memory-model-code-sequences-gfx12-scopes-table`. 15164 15165 atomicrmw release - singlethread - global 1. buffer/global/ds/flat_atomic 15166 - wavefront - local 15167 - generic 15168 atomicrmw release - workgroup - global 1. | ``s_wait_bvhcnt 0x0`` 15169 - generic | ``s_wait_samplecnt 0x0`` 15170 | ``s_wait_storecnt 0x0`` 15171 | ``s_wait_loadcnt 0x0`` 15172 | ``s_wait_dscnt 0x0`` 15173 | **CU wavefront execution mode:** 15174 | ``s_wait_dscnt 0x0`` 15175 15176 - If OpenCL, omit ``s_wait_dscnt 0x0``. 15177 - If OpenCL and CU wavefront 15178 execution mode, omit all. 15179 - The waits can be 15180 independently moved 15181 according to the 15182 following rules: 15183 - ``s_wait_loadcnt 0x0``, 15184 ``s_wait_samplecnt 0x0`` and 15185 ``s_wait_bvhcnt 0x0`` 15186 must happen after 15187 any preceding 15188 global/generic load/load 15189 atomic/ 15190 atomicrmw-with-return-value. 15191 - ``s_wait_storecnt 0x0`` 15192 must happen after 15193 any preceding 15194 global/generic 15195 store/store 15196 atomic/ 15197 atomicrmw-no-return-value. 15198 - ``s_wait_dscnt 0x0`` 15199 must happen after 15200 any preceding 15201 local/generic 15202 load/store/load 15203 atomic/store 15204 atomic/atomicrmw. 15205 - Must happen before the 15206 following atomic. 15207 - Ensures that all 15208 memory operations 15209 have 15210 completed before 15211 performing the 15212 atomicrmw that is 15213 being released. 15214 15215 2. buffer/global/flat_atomic 15216 15217 - Apply :ref:`amdgpu-amdhsa-memory-model-code-sequences-gfx12-scopes-table`. 15218 15219 atomicrmw release - workgroup - local 1. | ``s_wait_bvhcnt 0x0`` 15220 | ``s_wait_samplecnt 0x0`` 15221 | ``s_wait_storecnt 0x0`` 15222 | ``s_wait_loadcnt 0x0`` 15223 | ``s_wait_dscnt 0x0`` 15224 | **CU wavefront execution mode:** 15225 | ``s_wait_dscnt 0x0`` 15226 15227 - If OpenCL, omit all. 15228 - The waits can be 15229 independently moved 15230 according to the 15231 following rules: 15232 - ``s_wait_loadcnt 0x0``, 15233 ``s_wait_samplecnt 0x0`` and 15234 ``s_wait_bvhcnt 0x0`` 15235 must happen after 15236 any preceding 15237 global/generic load/load 15238 atomic/ 15239 atomicrmw-with-return-value. 15240 - ``s_wait_storecnt 0x0`` 15241 must happen after 15242 any preceding 15243 global/generic 15244 store/store 15245 atomic/ 15246 atomicrmw-no-return-value. 15247 - Must happen before the 15248 following atomic. 15249 - Ensures that all 15250 global memory 15251 operations have 15252 completed before 15253 performing the 15254 store that is being 15255 released. 15256 15257 2. ds_atomic 15258 atomicrmw release - agent - global 1. ``global_wb scope:SCOPE_SYS`` 15259 - system - generic 15260 - If agent scope, omit. 15261 15262 2. | ``s_wait_bvhcnt 0x0`` 15263 | ``s_wait_samplecnt 0x0`` 15264 | ``s_wait_storecnt 0x0`` 15265 | ``s_wait_loadcnt 0x0`` 15266 | ``s_wait_dscnt 0x0`` 15267 15268 - If OpenCL, omit ``s_wait_dscnt 0x0``. 15269 - The waits can be 15270 independently moved 15271 according to the 15272 following rules: 15273 - ``s_wait_loadcnt 0x0``, 15274 ``s_wait_samplecnt 0x0`` and 15275 ``s_wait_bvhcnt 0x0`` 15276 must happen after 15277 any preceding 15278 global/generic 15279 load/load atomic/ 15280 atomicrmw-with-return-value. 15281 - ``s_wait_storecnt 0x0`` 15282 must happen after 15283 ``global_wb`` if present, or 15284 any preceding 15285 global/generic 15286 store/store 15287 atomic/ 15288 atomicrmw-no-return-value. 15289 - ``s_wait_dscnt 0x0`` 15290 must happen after 15291 any preceding 15292 local/generic 15293 load/store/load 15294 atomic/store 15295 atomic/atomicrmw. 15296 - Must happen before the 15297 following atomic. 15298 - Ensures that all 15299 memory operations 15300 to global and local 15301 have completed 15302 before performing 15303 the atomicrmw that 15304 is being released. 15305 15306 3. buffer/global/flat_atomic 15307 15308 - Apply :ref:`amdgpu-amdhsa-memory-model-code-sequences-gfx12-scopes-table`. 15309 15310 fence release - singlethread *none* *none* 15311 - wavefront 15312 fence release - workgroup *none* 1. | ``s_wait_bvhcnt 0x0`` 15313 | ``s_wait_samplecnt 0x0`` 15314 | ``s_wait_storecnt 0x0`` 15315 | ``s_wait_loadcnt 0x0`` 15316 | ``s_wait_dscnt 0x0`` 15317 | **CU wavefront execution mode:** 15318 | ``s_wait_dscnt 0x0`` 15319 15320 - If OpenCL, omit ``s_wait_dscnt 0x0``. 15321 - If OpenCL and 15322 address space is 15323 local, omit all. 15324 - See :ref:`amdgpu-fence-as` for 15325 more details on fencing specific 15326 address spaces. 15327 - The waits can be 15328 independently moved 15329 according to the 15330 following rules: 15331 - ``s_wait_loadcnt 0x0``, 15332 ``s_wait_samplecnt 0x0`` and 15333 ``s_wait_bvhcnt 0x0`` 15334 must happen after 15335 any preceding 15336 global/generic 15337 load/load 15338 atomic/ 15339 atomicrmw-with-return-value. 15340 - ``s_wait_storecnt 0x0`` 15341 must happen after 15342 any preceding 15343 global/generic 15344 store/store 15345 atomic/ 15346 atomicrmw-no-return-value. 15347 - ``s_wait_dscnt 0x0`` 15348 must happen after 15349 any preceding 15350 local/generic 15351 load/store/load 15352 atomic/store atomic/ 15353 atomicrmw. 15354 - Must happen before 15355 any following store 15356 atomic/atomicrmw 15357 with an equal or 15358 wider sync scope 15359 and memory ordering 15360 stronger than 15361 unordered (this is 15362 termed the 15363 fence-paired-atomic). 15364 - Ensures that all 15365 memory operations 15366 have 15367 completed before 15368 performing the 15369 following 15370 fence-paired-atomic. 15371 15372 fence release - agent *none* 1. ``global_wb scope:SCOPE_SYS`` 15373 - system 15374 - If agent scope, omit. 15375 15376 2. | ``s_wait_bvhcnt 0x0`` 15377 | ``s_wait_samplecnt 0x0`` 15378 | ``s_wait_storecnt 0x0`` 15379 | ``s_wait_loadcnt 0x0`` 15380 | ``s_wait_dscnt 0x0`` 15381 | **OpenCL:** 15382 | ``s_wait_bvhcnt 0x0`` 15383 | ``s_wait_samplecnt 0x0`` 15384 | ``s_wait_storecnt 0x0`` 15385 | ``s_wait_loadcnt 0x0`` 15386 15387 - If OpenCl, omit ``s_wait_dscnt 0x0``. 15388 - If OpenCL and address space is local, 15389 omit all. 15390 - See :ref:`amdgpu-fence-as` for 15391 more details on fencing specific 15392 address spaces. 15393 - The waits can be 15394 independently moved 15395 according to the 15396 following rules: 15397 - ``s_wait_loadcnt 0x0``, 15398 ``s_wait_samplecnt 0x0`` and 15399 ``s_wait_bvhcnt 0x0`` 15400 must happen after 15401 any preceding 15402 global/generic 15403 load/load atomic/ 15404 atomicrmw-with-return-value. 15405 - ``s_wait_storecnt 0x0`` 15406 must happen after 15407 ``global_wb`` if present, or 15408 any preceding 15409 global/generic 15410 store/store 15411 atomic/ 15412 atomicrmw-no-return-value. 15413 - ``s_wait_dscnt 0x0`` 15414 must happen after 15415 any preceding 15416 local/generic 15417 load/store/load 15418 atomic/store 15419 atomic/atomicrmw. 15420 - Must happen before 15421 any following store 15422 atomic/atomicrmw 15423 with an equal or 15424 wider sync scope 15425 and memory ordering 15426 stronger than 15427 unordered (this is 15428 termed the 15429 fence-paired-atomic). 15430 - Ensures that all 15431 memory operations 15432 have 15433 completed before 15434 performing the 15435 following 15436 fence-paired-atomic. 15437 15438 **Acquire-Release Atomic** 15439 ------------------------------------------------------------------------------------ 15440 atomicrmw acq_rel - singlethread - global 1. buffer/global/ds/flat_atomic 15441 - wavefront - local 15442 - generic 15443 atomicrmw acq_rel - workgroup - global 1. | ``s_wait_bvhcnt 0x0`` 15444 | ``s_wait_samplecnt 0x0`` 15445 | ``s_wait_storecnt 0x0`` 15446 | ``s_wait_loadcnt 0x0`` 15447 | ``s_wait_dscnt 0x0`` 15448 | **CU wavefront execution mode:** 15449 | ``s_wait_dscnt 0x0`` 15450 15451 - If OpenCL, omit ``s_wait_dscnt 0x0``. 15452 - Must happen after 15453 any preceding 15454 local/generic 15455 load/store/load 15456 atomic/store 15457 atomic/atomicrmw. 15458 - The waits can be 15459 independently moved 15460 according to the 15461 following rules: 15462 - ``s_wait_loadcnt 0x0``, 15463 ``s_wait_samplecnt 0x0`` and 15464 ``s_wait_bvhcnt 0x0`` 15465 must happen after 15466 any preceding 15467 global/generic load/load 15468 atomic/ 15469 atomicrmw-with-return-value. 15470 - ``s_wait_storecnt 0x0`` 15471 must happen after 15472 any preceding 15473 global/generic 15474 store/store 15475 atomic/ 15476 atomicrmw-no-return-value. 15477 - ``s_wait_dscnt 0x0`` 15478 must happen after 15479 any preceding 15480 local/generic 15481 load/store/load 15482 atomic/store 15483 atomic/atomicrmw. 15484 - Must happen before 15485 the following 15486 atomicrmw. 15487 - Ensures that all 15488 memory operations 15489 have 15490 completed before 15491 performing the 15492 atomicrmw that is 15493 being released. 15494 15495 2. buffer/global_atomic 15496 15497 - Apply :ref:`amdgpu-amdhsa-memory-model-code-sequences-gfx12-scopes-table`. 15498 - If atomic with return, use 15499 ``th:TH_ATOMIC_RETURN``. 15500 15501 3. | **Atomic with return:** 15502 | ``s_wait_loadcnt 0x0`` 15503 | **Atomic without return:** 15504 | ``s_wait_storecnt 0x0`` 15505 15506 - If CU wavefront execution 15507 mode, omit. 15508 - Must happen before 15509 the following 15510 ``global_inv``. 15511 - Ensures any 15512 following global 15513 data read is no 15514 older than the 15515 atomicrmw value 15516 being acquired. 15517 15518 4. ``global_inv scope:SCOPE_SE`` 15519 15520 - If CU wavefront execution 15521 mode, omit. 15522 - Ensures that 15523 following 15524 loads will not see 15525 stale data. 15526 15527 atomicrmw acq_rel - workgroup - local 1 | ``s_wait_bvhcnt 0x0`` 15528 | ``s_wait_samplecnt 0x0`` 15529 | ``s_wait_storecnt 0x0`` 15530 | ``s_wait_loadcnt 0x0`` 15531 | ``s_wait_dscnt 0x0`` 15532 | **CU wavefront execution mode:** 15533 | ``s_wait_dscnt 0x0`` 15534 15535 - If OpenCL, omit. 15536 - The waits can be 15537 independently moved 15538 according to the 15539 following rules: 15540 - ``s_wait_loadcnt 0x0``, 15541 ``s_wait_samplecnt 0x0`` and 15542 ``s_wait_bvhcnt 0x0`` 15543 must happen after 15544 any preceding 15545 global/generic load/load 15546 atomic/ 15547 atomicrmw-with-return-value. 15548 - ``s_wait_storecnt 0x0`` 15549 must happen after 15550 any preceding 15551 global/generic 15552 store/store 15553 atomic/ 15554 atomicrmw-no-return-value. 15555 - Must happen before 15556 the following 15557 store. 15558 - Ensures that all 15559 global memory 15560 operations have 15561 completed before 15562 performing the 15563 store that is being 15564 released. 15565 15566 2. ds_atomic 15567 3. ``s_wait_dscnt 0x0`` 15568 15569 - If OpenCL, omit. 15570 - Must happen before 15571 the following 15572 ``global_inv``. 15573 - Ensures any 15574 following global 15575 data read is no 15576 older than the local load 15577 atomic value being 15578 acquired. 15579 15580 4. ``global_inv scope:SCOPE_SE`` 15581 15582 - If CU wavefront execution 15583 mode, omit. 15584 - If OpenCL omit. 15585 - Ensures that 15586 following 15587 loads will not see 15588 stale data. 15589 15590 atomicrmw acq_rel - workgroup - generic 1. | ``s_wait_bvhcnt 0x0`` 15591 | ``s_wait_samplecnt 0x0`` 15592 | ``s_wait_storecnt 0x0`` 15593 | ``s_wait_loadcnt 0x0`` 15594 | ``s_wait_dscnt 0x0`` 15595 | **CU wavefront execution mode:** 15596 | ``s_wait_dscnt 0x0`` 15597 15598 - If OpenCL, omit ``s_wait_loadcnt 0x0``. 15599 - The waits can be 15600 independently moved 15601 according to the 15602 following rules: 15603 - ``s_wait_loadcnt 0x0``, 15604 ``s_wait_samplecnt 0x0`` and 15605 ``s_wait_bvhcnt 0x0`` 15606 must happen after 15607 any preceding 15608 global/generic load/load 15609 atomic/ 15610 atomicrmw-with-return-value. 15611 - ``s_wait_storecnt 0x0`` 15612 must happen after 15613 any preceding 15614 global/generic 15615 store/store 15616 atomic/ 15617 atomicrmw-no-return-value. 15618 - ``s_wait_dscnt 0x0`` 15619 must happen after 15620 any preceding 15621 local/generic 15622 load/store/load 15623 atomic/store 15624 atomic/atomicrmw. 15625 - Must happen before 15626 the following 15627 atomicrmw. 15628 - Ensures that all 15629 memory operations 15630 have 15631 completed before 15632 performing the 15633 atomicrmw that is 15634 being released. 15635 15636 2. flat_atomic 15637 15638 - Apply :ref:`amdgpu-amdhsa-memory-model-code-sequences-gfx12-scopes-table`. 15639 - If atomic with return, 15640 use ``th:TH_ATOMIC_RETURN``. 15641 15642 3. | **Atomic without return:** 15643 | ``s_wait_dscnt 0x0`` 15644 | ``s_wait_storecnt 0x0`` 15645 | **Atomic with return:** 15646 | ``s_wait_loadcnt 0x0`` 15647 | ``s_wait_dscnt 0x0`` 15648 | **CU wavefront execution mode:** 15649 | ``s_wait_dscnt 0x0`` 15650 15651 - If OpenCL, omit ``s_wait_dscnt 0x0`` 15652 - Must happen before 15653 the following 15654 ``global_inv``. 15655 - Ensures any 15656 following global 15657 data read is no 15658 older than the load 15659 atomic value being 15660 acquired. 15661 15662 4. ``global_inv scope:SCOPE_SE`` 15663 15664 - If CU wavefront execution 15665 mode, omit. 15666 - Ensures that 15667 following 15668 loads will not see 15669 stale data. 15670 15671 atomicrmw acq_rel - agent - global 1. ``global_wb scope:SCOPE_SYS`` 15672 - system 15673 - If agent scope, omit. 15674 15675 2. | ``s_wait_bvhcnt 0x0`` 15676 | ``s_wait_samplecnt 0x0`` 15677 | ``s_wait_storecnt 0x0`` 15678 | ``s_wait_loadcnt 0x0`` 15679 | ``s_wait_dscnt 0x0`` 15680 15681 - If OpenCL, omit 15682 ``s_wait_dscnt 0x0`` 15683 - The waits can be 15684 independently moved 15685 according to the 15686 following rules: 15687 - ``s_wait_loadcnt 0x0``, 15688 ``s_wait_samplecnt 0x0`` and 15689 ``s_wait_bvhcnt 0x0`` 15690 must happen after 15691 any preceding 15692 global/generic 15693 load/load atomic/ 15694 atomicrmw-with-return-value. 15695 - ``s_wait_storecnt 0x0`` 15696 must happen after 15697 ``global_wb`` if present, or 15698 any preceding 15699 global/generic 15700 store/store 15701 atomic/ 15702 atomicrmw-no-return-value. 15703 - ``s_wait_dscnt 0x0`` 15704 must happen after 15705 any preceding 15706 local/generic 15707 load/store/load 15708 atomic/store 15709 atomic/atomicrmw. 15710 - Must happen before 15711 the following 15712 atomicrmw. 15713 - Ensures that all 15714 memory operations 15715 to global have 15716 completed before 15717 performing the 15718 atomicrmw that is 15719 being released. 15720 15721 3. buffer/global_atomic 15722 15723 - Apply :ref:`amdgpu-amdhsa-memory-model-code-sequences-gfx12-scopes-table`. 15724 - If atomic with return, use 15725 ``th:TH_ATOMIC_RETURN``. 15726 15727 4. | **Atomic with return:** 15728 | ``s_wait_loadcnt 0x0`` 15729 | **Atomic without return:** 15730 | ``s_wait_storecnt 0x0`` 15731 15732 - Must happen before 15733 following 15734 ``global_inv``. 15735 - Ensures the 15736 atomicrmw has 15737 completed before 15738 invalidating the 15739 caches. 15740 15741 5. ``global_inv`` 15742 15743 - Apply :ref:`amdgpu-amdhsa-memory-model-code-sequences-gfx12-scopes-table`. 15744 - Must happen before 15745 any following 15746 global/generic 15747 load/load 15748 atomic/atomicrmw. 15749 - Ensures that 15750 following loads 15751 will not see stale 15752 global data. 15753 15754 atomicrmw acq_rel - agent - generic 1. ``global_wb scope:SCOPE_SYS`` 15755 - system 15756 - If agent scope, omit. 15757 15758 2. | ``s_wait_bvhcnt 0x0`` 15759 | ``s_wait_samplecnt 0x0`` 15760 | ``s_wait_storecnt 0x0`` 15761 | ``s_wait_loadcnt 0x0`` 15762 | ``s_wait_dscnt 0x0`` 15763 15764 - If OpenCL, omit 15765 ``s_wait_dscnt 0x0`` 15766 - The waits can be 15767 independently moved 15768 according to the 15769 following rules: 15770 - ``s_wait_loadcnt 0x0``, 15771 ``s_wait_samplecnt 0x0`` and 15772 ``s_wait_bvhcnt 0x0`` 15773 must happen after 15774 any preceding 15775 global/generic 15776 load/load atomic 15777 atomicrmw-with-return-value. 15778 - ``s_wait_storecnt 0x0`` 15779 must happen after 15780 ``global_wb`` if present, or 15781 any preceding 15782 global/generic 15783 store/store atomic/ 15784 atomicrmw-no-return-value. 15785 - ``s_wait_dscnt 0x0`` 15786 must happen after 15787 any preceding 15788 local/generic 15789 load/store/load 15790 atomic/store 15791 atomic/atomicrmw. 15792 - Must happen before 15793 the following 15794 atomicrmw. 15795 - Ensures that all 15796 memory operations 15797 have 15798 completed before 15799 performing the 15800 atomicrmw that is 15801 being released. 15802 15803 3. flat_atomic 15804 15805 - Apply :ref:`amdgpu-amdhsa-memory-model-code-sequences-gfx12-scopes-table`. 15806 - If atomic with return, use 15807 ``th:TH_ATOMIC_RETURN``. 15808 15809 4. | **Atomic with return:** 15810 | ``s_wait_loadcnt 0x0`` 15811 | ``s_wait_dscnt 0x0`` 15812 | **Atomic without return:** 15813 | ``s_wait_storecnt 0x0`` 15814 | ``s_wait_dscnt 0x0`` 15815 15816 15817 - If OpenCL, omit 15818 ``s_wait_dscnt 0x0``. 15819 - Must happen before 15820 following 15821 ``global_inv``. 15822 - Ensures the 15823 atomicrmw has 15824 completed before 15825 invalidating the 15826 caches. 15827 15828 5. ``global_inv`` 15829 15830 - Apply :ref:`amdgpu-amdhsa-memory-model-code-sequences-gfx12-scopes-table`. 15831 - Must happen before 15832 any following 15833 global/generic 15834 load/load 15835 atomic/atomicrmw. 15836 - Ensures that 15837 following loads 15838 will not see stale 15839 global data. 15840 15841 fence acq_rel - singlethread *none* *none* 15842 - wavefront 15843 fence acq_rel - workgroup *none* 1. | ``s_wait_bvhcnt 0x0`` 15844 | ``s_wait_samplecnt 0x0`` 15845 | ``s_wait_storecnt 0x0`` 15846 | ``s_wait_loadcnt 0x0`` 15847 | ``s_wait_dscnt 0x0`` 15848 | **CU wavefront execution mode:** 15849 | ``s_wait_dscnt 0x0`` 15850 15851 - If OpenCL and 15852 address space is 15853 not generic, omit 15854 ``s_wait_dscnt 0x0`` 15855 - If OpenCL and 15856 address space is 15857 local, omit 15858 all but ``s_wait_dscnt 0x0``. 15859 - See :ref:`amdgpu-fence-as` for 15860 more details on fencing specific 15861 address spaces. 15862 - The waits can be 15863 independently moved 15864 according to the 15865 following rules: 15866 - ``s_wait_loadcnt 0x0``, 15867 ``s_wait_samplecnt 0x0`` and 15868 ``s_wait_bvhcnt 0x0`` 15869 must happen after 15870 any preceding 15871 global/generic 15872 load/load 15873 atomic/ 15874 atomicrmw-with-return-value. 15875 - ``s_wait_storecnt 0x0`` 15876 must happen after 15877 any preceding 15878 global/generic 15879 store/store atomic/ 15880 atomicrmw-no-return-value. 15881 - ``s_wait_dscnt 0x0`` 15882 must happen after 15883 any preceding 15884 local/generic 15885 load/store/load 15886 atomic/store atomic/ 15887 atomicrmw. 15888 - Must happen before 15889 any following 15890 global/generic 15891 load/load 15892 atomic/store/store 15893 atomic/atomicrmw. 15894 - Ensures that all 15895 memory operations 15896 have 15897 completed before 15898 performing any 15899 following global 15900 memory operations. 15901 - Ensures that the 15902 preceding 15903 local/generic load 15904 atomic/atomicrmw 15905 with an equal or 15906 wider sync scope 15907 and memory ordering 15908 stronger than 15909 unordered (this is 15910 termed the 15911 acquire-fence-paired-atomic) 15912 has completed 15913 before following 15914 global memory 15915 operations. This 15916 satisfies the 15917 requirements of 15918 acquire. 15919 - Ensures that all 15920 previous memory 15921 operations have 15922 completed before a 15923 following 15924 local/generic store 15925 atomic/atomicrmw 15926 with an equal or 15927 wider sync scope 15928 and memory ordering 15929 stronger than 15930 unordered (this is 15931 termed the 15932 release-fence-paired-atomic). 15933 This satisfies the 15934 requirements of 15935 release. 15936 - Must happen before 15937 the following 15938 ``global_inv``. 15939 - Ensures that the 15940 acquire-fence-paired 15941 atomic has completed 15942 before invalidating 15943 the 15944 cache. Therefore 15945 any following 15946 locations read must 15947 be no older than 15948 the value read by 15949 the 15950 acquire-fence-paired-atomic. 15951 15952 2. ``global_inv scope:SCOPE_SE`` 15953 15954 - If CU wavefront execution 15955 mode, omit. 15956 - Ensures that 15957 following 15958 loads will not see 15959 stale data. 15960 15961 fence acq_rel - agent *none* 1. ``global_wb scope:SCOPE_SYS`` 15962 - system 15963 - If agent scope, omit. 15964 15965 2. | ``s_wait_bvhcnt 0x0`` 15966 | ``s_wait_samplecnt 0x0`` 15967 | ``s_wait_storecnt 0x0`` 15968 | ``s_wait_loadcnt 0x0`` 15969 | ``s_wait_dscnt 0x0`` 15970 15971 - If OpenCL and 15972 address space is 15973 not generic, omit 15974 ``s_wait_dscnt 0x0`` 15975 - If OpenCL and 15976 address space is 15977 local, omit 15978 all but ``s_wait_dscnt 0x0``. 15979 - See :ref:`amdgpu-fence-as` for 15980 more details on fencing specific 15981 address spaces. 15982 - The waits can be 15983 independently moved 15984 according to the 15985 following rules: 15986 - ``s_wait_loadcnt 0x0``, 15987 ``s_wait_samplecnt 0x0`` and 15988 ``s_wait_bvhcnt 0x0`` 15989 must happen after 15990 any preceding 15991 global/generic 15992 load/load 15993 atomic/ 15994 atomicrmw-with-return-value. 15995 - ``s_wait_storecnt 0x0`` 15996 must happen after 15997 ``global_wb`` if present, or 15998 any preceding 15999 global/generic 16000 store/store atomic/ 16001 atomicrmw-no-return-value. 16002 - ``s_wait_dscnt 0x0`` 16003 must happen after 16004 any preceding 16005 local/generic 16006 load/store/load 16007 atomic/store 16008 atomic/atomicrmw. 16009 - Must happen before 16010 the following 16011 ``global_inv`` 16012 - Ensures that the 16013 preceding 16014 global/local/generic 16015 load 16016 atomic/atomicrmw 16017 with an equal or 16018 wider sync scope 16019 and memory ordering 16020 stronger than 16021 unordered (this is 16022 termed the 16023 acquire-fence-paired-atomic) 16024 has completed 16025 before invalidating 16026 the caches. This 16027 satisfies the 16028 requirements of 16029 acquire. 16030 - Ensures that all 16031 previous memory 16032 operations have 16033 completed before a 16034 following 16035 global/local/generic 16036 store 16037 atomic/atomicrmw 16038 with an equal or 16039 wider sync scope 16040 and memory ordering 16041 stronger than 16042 unordered (this is 16043 termed the 16044 release-fence-paired-atomic). 16045 This satisfies the 16046 requirements of 16047 release. 16048 16049 3. ``global_inv scope:`` 16050 16051 - Apply :ref:`amdgpu-amdhsa-memory-model-code-sequences-gfx12-scopes-table`. 16052 - Must happen before 16053 any following 16054 global/generic 16055 load/load 16056 atomic/store/store 16057 atomic/atomicrmw. 16058 - Ensures that 16059 following loads 16060 will not see stale 16061 global data. This 16062 satisfies the 16063 requirements of 16064 acquire. 16065 16066 **Sequential Consistent Atomic** 16067 ------------------------------------------------------------------------------------ 16068 load atomic seq_cst - singlethread - global *Same as corresponding 16069 - wavefront - local load atomic acquire, 16070 - generic except must generate 16071 all instructions even 16072 for OpenCL.* 16073 load atomic seq_cst - workgroup - global 1. | ``s_wait_bvhcnt 0x0`` 16074 - generic | ``s_wait_samplecnt 0x0`` 16075 | ``s_wait_storecnt 0x0`` 16076 | ``s_wait_loadcnt 0x0`` 16077 | ``s_wait_dscnt 0x0`` 16078 | **CU wavefront execution mode:** 16079 | ``s_wait_dscnt 0x0`` 16080 16081 - If OpenCL, omit 16082 ``s_wait_dscnt 0x0`` 16083 - The waits can be 16084 independently moved 16085 according to the 16086 following rules: 16087 - ``s_wait_dscnt 0x0`` must 16088 happen after 16089 preceding 16090 local/generic load 16091 atomic/store 16092 atomic/atomicrmw 16093 with memory 16094 ordering of seq_cst 16095 and with equal or 16096 wider sync scope. 16097 (Note that seq_cst 16098 fences have their 16099 own ``s_wait_dscnt 0x0`` 16100 and so do not need to be 16101 considered.) 16102 - ``s_wait_loadcnt 0x0``\, 16103 ``s_wait_samplecnt 0x0`` and 16104 ``s_wait_bvhcnt 0x0`` 16105 must happen after 16106 preceding 16107 global/generic load 16108 atomic/ 16109 atomicrmw-with-return-value 16110 with memory 16111 ordering of seq_cst 16112 and with equal or 16113 wider sync scope. 16114 (Note that seq_cst 16115 fences have their 16116 own waits and so do 16117 not need to be 16118 considered.) 16119 - ``s_wait_storecnt 0x0`` 16120 Must happen after 16121 preceding 16122 global/generic store 16123 atomic/ 16124 atomicrmw-no-return-value 16125 with memory 16126 ordering of seq_cst 16127 and with equal or 16128 wider sync scope. 16129 (Note that seq_cst 16130 fences have their 16131 own ``s_wait_storecnt 0x0`` 16132 and so do not need to be 16133 considered.) 16134 - Ensures any 16135 preceding 16136 sequential 16137 consistent global/local 16138 memory instructions 16139 have completed 16140 before executing 16141 this sequentially 16142 consistent 16143 instruction. This 16144 prevents reordering 16145 a seq_cst store 16146 followed by a 16147 seq_cst load. (Note 16148 that seq_cst is 16149 stronger than 16150 acquire/release as 16151 the reordering of 16152 load acquire 16153 followed by a store 16154 release is 16155 prevented by the 16156 ``s_wait``\s of 16157 the release, but 16158 there is nothing 16159 preventing a store 16160 release followed by 16161 load acquire from 16162 completing out of 16163 order. The ``s_wait``\s 16164 could be placed after 16165 seq_store or before 16166 the seq_load. We 16167 choose the load to 16168 make the ``s_wait``\s be 16169 as late as possible 16170 so that the store 16171 may have already 16172 completed.) 16173 16174 2. *Following 16175 instructions same as 16176 corresponding load 16177 atomic acquire, 16178 except must generate 16179 all instructions even 16180 for OpenCL.* 16181 load atomic seq_cst - workgroup - local 1. | ``s_wait_bvhcnt 0x0`` 16182 | ``s_wait_samplecnt 0x0`` 16183 | ``s_wait_storecnt 0x0`` 16184 | ``s_wait_loadcnt 0x0`` 16185 | ``s_wait_dscnt 0x0`` 16186 | **CU wavefront execution mode:** 16187 | ``s_wait_dscnt 0x0`` 16188 16189 - If OpenCL, omit all. 16190 - The waits can be 16191 independently moved 16192 according to the 16193 following rules: 16194 - ``s_wait_loadcnt 0x0``\, 16195 ``s_wait_samplecnt 0x0`` and 16196 ``s_wait_bvhcnt 0x0`` 16197 Must happen after 16198 preceding 16199 global/generic load 16200 atomic/ 16201 atomicrmw-with-return-value 16202 with memory 16203 ordering of seq_cst 16204 and with equal or 16205 wider sync scope. 16206 (Note that seq_cst 16207 fences have their 16208 own ``s_wait``\s and so do 16209 not need to be 16210 considered.) 16211 - ``s_wait_storecnt 0x0`` 16212 Must happen after 16213 preceding 16214 global/generic store 16215 atomic/ 16216 atomicrmw-no-return-value 16217 with memory 16218 ordering of seq_cst 16219 and with equal or 16220 wider sync scope. 16221 (Note that seq_cst 16222 fences have their 16223 own ``s_wait_storecnt 0x0`` 16224 and so do 16225 not need to be 16226 considered.) 16227 - Ensures any 16228 preceding 16229 sequential 16230 consistent global 16231 memory instructions 16232 have completed 16233 before executing 16234 this sequentially 16235 consistent 16236 instruction. This 16237 prevents reordering 16238 a seq_cst store 16239 followed by a 16240 seq_cst load. (Note 16241 that seq_cst is 16242 stronger than 16243 acquire/release as 16244 the reordering of 16245 load acquire 16246 followed by a store 16247 release is 16248 prevented by the 16249 ``s_wait``\s of 16250 the release, but 16251 there is nothing 16252 preventing a store 16253 release followed by 16254 load acquire from 16255 completing out of 16256 order. The s_waitcnt 16257 could be placed after 16258 seq_store or before 16259 the seq_load. We 16260 choose the load to 16261 make the ``s_wait``\s be 16262 as late as possible 16263 so that the store 16264 may have already 16265 completed.) 16266 16267 2. *Following 16268 instructions same as 16269 corresponding load 16270 atomic acquire, 16271 except must generate 16272 all instructions even 16273 for OpenCL.* 16274 16275 load atomic seq_cst - agent - global 1. | ``s_wait_bvhcnt 0x0`` 16276 - system - generic | ``s_wait_samplecnt 0x0`` 16277 | ``s_wait_storecnt 0x0`` 16278 | ``s_wait_loadcnt 0x0`` 16279 | ``s_wait_dscnt 0x0`` 16280 16281 - If OpenCL, omit 16282 ``s_wait_dscnt 0x0`` 16283 - The waits can be 16284 independently moved 16285 according to the 16286 following rules: 16287 - ``s_wait_dscnt 0x0`` 16288 must happen after 16289 preceding 16290 local load 16291 atomic/store 16292 atomic/atomicrmw 16293 with memory 16294 ordering of seq_cst 16295 and with equal or 16296 wider sync scope. 16297 (Note that seq_cst 16298 fences have their 16299 own ``s_wait_dscnt 0x0`` 16300 and so do 16301 not need to be 16302 considered.) 16303 - ``s_wait_loadcnt 0x0``\, 16304 ``s_wait_samplecnt 0x0`` and 16305 ``s_wait_bvhcnt 0x0`` 16306 must happen after 16307 preceding 16308 global/generic load 16309 atomic/ 16310 atomicrmw-with-return-value 16311 with memory 16312 ordering of seq_cst 16313 and with equal or 16314 wider sync scope. 16315 (Note that seq_cst 16316 fences have their 16317 own ``s_wait``\s and so do 16318 not need to be 16319 considered.) 16320 - ``s_wait_storecnt 0x0`` 16321 Must happen after 16322 preceding 16323 global/generic store 16324 atomic/ 16325 atomicrmw-no-return-value 16326 with memory 16327 ordering of seq_cst 16328 and with equal or 16329 wider sync scope. 16330 (Note that seq_cst 16331 fences have their 16332 own 16333 ``s_wait_storecnt 0x0`` and so do 16334 not need to be 16335 considered.) 16336 - Ensures any 16337 preceding 16338 sequential 16339 consistent global 16340 memory instructions 16341 have completed 16342 before executing 16343 this sequentially 16344 consistent 16345 instruction. This 16346 prevents reordering 16347 a seq_cst store 16348 followed by a 16349 seq_cst load. (Note 16350 that seq_cst is 16351 stronger than 16352 acquire/release as 16353 the reordering of 16354 load acquire 16355 followed by a store 16356 release is 16357 prevented by the 16358 ``s_wait``\s of 16359 the release, but 16360 there is nothing 16361 preventing a store 16362 release followed by 16363 load acquire from 16364 completing out of 16365 order. The ``s_wait``\s 16366 could be placed after 16367 seq_store or before 16368 the seq_load. We 16369 choose the load to 16370 make the ``s_wait``\s be 16371 as late as possible 16372 so that the store 16373 may have already 16374 completed.) 16375 16376 2. *Following 16377 instructions same as 16378 corresponding load 16379 atomic acquire, 16380 except must generate 16381 all instructions even 16382 for OpenCL.* 16383 store atomic seq_cst - singlethread - global *Same as corresponding 16384 - wavefront - local store atomic release, 16385 - workgroup - generic except must generate 16386 - agent all instructions even 16387 - system for OpenCL.* 16388 atomicrmw seq_cst - singlethread - global *Same as corresponding 16389 - wavefront - local atomicrmw acq_rel, 16390 - workgroup - generic except must generate 16391 - agent all instructions even 16392 - system for OpenCL.* 16393 fence seq_cst - singlethread *none* *Same as corresponding 16394 - wavefront fence acq_rel, 16395 - workgroup except must generate 16396 - agent all instructions even 16397 - system for OpenCL.* 16398 ============ ============ ============== ========== ================================ 16399 16400.. _amdgpu-amdhsa-trap-handler-abi: 16401 16402Trap Handler ABI 16403~~~~~~~~~~~~~~~~ 16404 16405For code objects generated by the AMDGPU backend for HSA [HSA]_ compatible 16406runtimes (see :ref:`amdgpu-os`), the runtime installs a trap handler that 16407supports the ``s_trap`` instruction. For usage see: 16408 16409- :ref:`amdgpu-trap-handler-for-amdhsa-os-v2-table` 16410- :ref:`amdgpu-trap-handler-for-amdhsa-os-v3-table` 16411- :ref:`amdgpu-trap-handler-for-amdhsa-os-v4-onwards-table` 16412 16413 .. table:: AMDGPU Trap Handler for AMDHSA OS Code Object V2 16414 :name: amdgpu-trap-handler-for-amdhsa-os-v2-table 16415 16416 =================== =============== =============== ======================================= 16417 Usage Code Sequence Trap Handler Description 16418 Inputs 16419 =================== =============== =============== ======================================= 16420 reserved ``s_trap 0x00`` Reserved by hardware. 16421 ``debugtrap(arg)`` ``s_trap 0x01`` ``SGPR0-1``: Reserved for Finalizer HSA ``debugtrap`` 16422 ``queue_ptr`` intrinsic (not implemented). 16423 ``VGPR0``: 16424 ``arg`` 16425 ``llvm.trap`` ``s_trap 0x02`` ``SGPR0-1``: Causes wave to be halted with the PC at 16426 ``queue_ptr`` the trap instruction. The associated 16427 queue is signalled to put it into the 16428 error state. When the queue is put in 16429 the error state, the waves executing 16430 dispatches on the queue will be 16431 terminated. 16432 ``llvm.debugtrap`` ``s_trap 0x03`` *none* - If debugger not enabled then behaves 16433 as a no-operation. The trap handler 16434 is entered and immediately returns to 16435 continue execution of the wavefront. 16436 - If the debugger is enabled, causes 16437 the debug trap to be reported by the 16438 debugger and the wavefront is put in 16439 the halt state with the PC at the 16440 instruction. The debugger must 16441 increment the PC and resume the wave. 16442 reserved ``s_trap 0x04`` Reserved. 16443 reserved ``s_trap 0x05`` Reserved. 16444 reserved ``s_trap 0x06`` Reserved. 16445 reserved ``s_trap 0x07`` Reserved. 16446 reserved ``s_trap 0x08`` Reserved. 16447 reserved ``s_trap 0xfe`` Reserved. 16448 reserved ``s_trap 0xff`` Reserved. 16449 =================== =============== =============== ======================================= 16450 16451.. 16452 16453 .. table:: AMDGPU Trap Handler for AMDHSA OS Code Object V3 16454 :name: amdgpu-trap-handler-for-amdhsa-os-v3-table 16455 16456 =================== =============== =============== ======================================= 16457 Usage Code Sequence Trap Handler Description 16458 Inputs 16459 =================== =============== =============== ======================================= 16460 reserved ``s_trap 0x00`` Reserved by hardware. 16461 debugger breakpoint ``s_trap 0x01`` *none* Reserved for debugger to use for 16462 breakpoints. Causes wave to be halted 16463 with the PC at the trap instruction. 16464 The debugger is responsible to resume 16465 the wave, including the instruction 16466 that the breakpoint overwrote. 16467 ``llvm.trap`` ``s_trap 0x02`` ``SGPR0-1``: Causes wave to be halted with the PC at 16468 ``queue_ptr`` the trap instruction. The associated 16469 queue is signalled to put it into the 16470 error state. When the queue is put in 16471 the error state, the waves executing 16472 dispatches on the queue will be 16473 terminated. 16474 ``llvm.debugtrap`` ``s_trap 0x03`` *none* - If debugger not enabled then behaves 16475 as a no-operation. The trap handler 16476 is entered and immediately returns to 16477 continue execution of the wavefront. 16478 - If the debugger is enabled, causes 16479 the debug trap to be reported by the 16480 debugger and the wavefront is put in 16481 the halt state with the PC at the 16482 instruction. The debugger must 16483 increment the PC and resume the wave. 16484 reserved ``s_trap 0x04`` Reserved. 16485 reserved ``s_trap 0x05`` Reserved. 16486 reserved ``s_trap 0x06`` Reserved. 16487 reserved ``s_trap 0x07`` Reserved. 16488 reserved ``s_trap 0x08`` Reserved. 16489 reserved ``s_trap 0xfe`` Reserved. 16490 reserved ``s_trap 0xff`` Reserved. 16491 =================== =============== =============== ======================================= 16492 16493.. 16494 16495 .. table:: AMDGPU Trap Handler for AMDHSA OS Code Object V4 and Above 16496 :name: amdgpu-trap-handler-for-amdhsa-os-v4-onwards-table 16497 16498 =================== =============== ================ ================= ======================================= 16499 Usage Code Sequence GFX6-GFX8 Inputs GFX9-GFX11 Inputs Description 16500 =================== =============== ================ ================= ======================================= 16501 reserved ``s_trap 0x00`` Reserved by hardware. 16502 debugger breakpoint ``s_trap 0x01`` *none* *none* Reserved for debugger to use for 16503 breakpoints. Causes wave to be halted 16504 with the PC at the trap instruction. 16505 The debugger is responsible to resume 16506 the wave, including the instruction 16507 that the breakpoint overwrote. 16508 ``llvm.trap`` ``s_trap 0x02`` ``SGPR0-1``: *none* Causes wave to be halted with the PC at 16509 ``queue_ptr`` the trap instruction. The associated 16510 queue is signalled to put it into the 16511 error state. When the queue is put in 16512 the error state, the waves executing 16513 dispatches on the queue will be 16514 terminated. 16515 ``llvm.debugtrap`` ``s_trap 0x03`` *none* *none* - If debugger not enabled then behaves 16516 as a no-operation. The trap handler 16517 is entered and immediately returns to 16518 continue execution of the wavefront. 16519 - If the debugger is enabled, causes 16520 the debug trap to be reported by the 16521 debugger and the wavefront is put in 16522 the halt state with the PC at the 16523 instruction. The debugger must 16524 increment the PC and resume the wave. 16525 reserved ``s_trap 0x04`` Reserved. 16526 reserved ``s_trap 0x05`` Reserved. 16527 reserved ``s_trap 0x06`` Reserved. 16528 reserved ``s_trap 0x07`` Reserved. 16529 reserved ``s_trap 0x08`` Reserved. 16530 reserved ``s_trap 0xfe`` Reserved. 16531 reserved ``s_trap 0xff`` Reserved. 16532 =================== =============== ================ ================= ======================================= 16533 16534.. _amdgpu-amdhsa-function-call-convention: 16535 16536Call Convention 16537~~~~~~~~~~~~~~~ 16538 16539.. note:: 16540 16541 This section is currently incomplete and has inaccuracies. It is WIP that will 16542 be updated as information is determined. 16543 16544See :ref:`amdgpu-dwarf-address-space-identifier` for information on swizzled 16545addresses. Unswizzled addresses are normal linear addresses. 16546 16547.. _amdgpu-amdhsa-function-call-convention-kernel-functions: 16548 16549Kernel Functions 16550++++++++++++++++ 16551 16552This section describes the call convention ABI for the outer kernel function. 16553 16554See :ref:`amdgpu-amdhsa-initial-kernel-execution-state` for the kernel call 16555convention. 16556 16557The following is not part of the AMDGPU kernel calling convention but describes 16558how the AMDGPU implements function calls: 16559 165601. Clang decides the kernarg layout to match the *HSA Programmer's Language 16561 Reference* [HSA]_. 16562 16563 - All structs are passed directly. 16564 - Lambda values are passed *TBA*. 16565 16566 .. TODO:: 16567 16568 - Does this really follow HSA rules? Or are structs >16 bytes passed 16569 by-value struct? 16570 - What is ABI for lambda values? 16571 165724. The kernel performs certain setup in its prolog, as described in 16573 :ref:`amdgpu-amdhsa-kernel-prolog`. 16574 16575.. _amdgpu-amdhsa-function-call-convention-non-kernel-functions: 16576 16577Non-Kernel Functions 16578++++++++++++++++++++ 16579 16580This section describes the call convention ABI for functions other than the 16581outer kernel function. 16582 16583If a kernel has function calls then scratch is always allocated and used for 16584the call stack which grows from low address to high address using the swizzled 16585scratch address space. 16586 16587On entry to a function: 16588 165891. SGPR0-3 contain a V# with the following properties (see 16590 :ref:`amdgpu-amdhsa-kernel-prolog-private-segment-buffer`): 16591 16592 * Base address pointing to the beginning of the wavefront scratch backing 16593 memory. 16594 * Swizzled with dword element size and stride of wavefront size elements. 16595 165962. The FLAT_SCRATCH register pair is setup. See 16597 :ref:`amdgpu-amdhsa-kernel-prolog-flat-scratch`. 165983. GFX6-GFX8: M0 register set to the size of LDS in bytes. See 16599 :ref:`amdgpu-amdhsa-kernel-prolog-m0`. 166004. The EXEC register is set to the lanes active on entry to the function. 166015. MODE register: *TBD* 166026. VGPR0-31 and SGPR4-29 are used to pass function input arguments as described 16603 below. 166047. SGPR30-31 return address (RA). The code address that the function must 16605 return to when it completes. The value is undefined if the function is *no 16606 return*. 166078. SGPR32 is used for the stack pointer (SP). It is an unswizzled scratch 16608 offset relative to the beginning of the wavefront scratch backing memory. 16609 16610 The unswizzled SP can be used with buffer instructions as an unswizzled SGPR 16611 offset with the scratch V# in SGPR0-3 to access the stack in a swizzled 16612 manner. 16613 16614 The unswizzled SP value can be converted into the swizzled SP value by: 16615 16616 | swizzled SP = unswizzled SP / wavefront size 16617 16618 This may be used to obtain the private address space address of stack 16619 objects and to convert this address to a flat address by adding the flat 16620 scratch aperture base address. 16621 16622 The swizzled SP value is always 4 bytes aligned for the ``r600`` 16623 architecture and 16 byte aligned for the ``amdgcn`` architecture. 16624 16625 .. note:: 16626 16627 The ``amdgcn`` value is selected to avoid dynamic stack alignment for the 16628 OpenCL language which has the largest base type defined as 16 bytes. 16629 16630 On entry, the swizzled SP value is the address of the first function 16631 argument passed on the stack. Other stack passed arguments are positive 16632 offsets from the entry swizzled SP value. 16633 16634 The function may use positive offsets beyond the last stack passed argument 16635 for stack allocated local variables and register spill slots. If necessary, 16636 the function may align these to greater alignment than 16 bytes. After these 16637 the function may dynamically allocate space for such things as runtime sized 16638 ``alloca`` local allocations. 16639 16640 If the function calls another function, it will place any stack allocated 16641 arguments after the last local allocation and adjust SGPR32 to the address 16642 after the last local allocation. 16643 166449. All other registers are unspecified. 1664510. Any necessary ``s_waitcnt`` has been performed to ensure memory is available 16646 to the function. 1664711. Use pass-by-reference (byref) in stead of pass-by-value (byval) for struct 16648 arguments in C ABI. Callee is responsible for allocating stack memory and 16649 copying the value of the struct if modified. Note that the backend still 16650 supports byval for struct arguments. 16651 16652On exit from a function: 16653 166541. VGPR0-31 and SGPR4-29 are used to pass function result arguments as 16655 described below. Any registers used are considered clobbered registers. 166562. The following registers are preserved and have the same value as on entry: 16657 16658 * FLAT_SCRATCH 16659 * EXEC 16660 * GFX6-GFX8: M0 16661 * All SGPR registers except the clobbered registers of SGPR4-31. 16662 * VGPR40-47 16663 * VGPR56-63 16664 * VGPR72-79 16665 * VGPR88-95 16666 * VGPR104-111 16667 * VGPR120-127 16668 * VGPR136-143 16669 * VGPR152-159 16670 * VGPR168-175 16671 * VGPR184-191 16672 * VGPR200-207 16673 * VGPR216-223 16674 * VGPR232-239 16675 * VGPR248-255 16676 16677 .. note:: 16678 16679 Except the argument registers, the VGPRs clobbered and the preserved 16680 registers are intermixed at regular intervals in order to keep a 16681 similar ratio independent of the number of allocated VGPRs. 16682 16683 * GFX90A: All AGPR registers except the clobbered registers AGPR0-31. 16684 * Lanes of all VGPRs that are inactive at the call site. 16685 16686 For the AMDGPU backend, an inter-procedural register allocation (IPRA) 16687 optimization may mark some of clobbered SGPR and VGPR registers as 16688 preserved if it can be determined that the called function does not change 16689 their value. 16690 166912. The PC is set to the RA provided on entry. 166923. MODE register: *TBD*. 166934. All other registers are clobbered. 166945. Any necessary ``s_waitcnt`` has been performed to ensure memory accessed by 16695 function is available to the caller. 16696 16697.. TODO:: 16698 16699 - How are function results returned? The address of structured types is passed 16700 by reference, but what about other types? 16701 16702The function input arguments are made up of the formal arguments explicitly 16703declared by the source language function plus the implicit input arguments used 16704by the implementation. 16705 16706The source language input arguments are: 16707 167081. Any source language implicit ``this`` or ``self`` argument comes first as a 16709 pointer type. 167102. Followed by the function formal arguments in left to right source order. 16711 16712The source language result arguments are: 16713 167141. The function result argument. 16715 16716The source language input or result struct type arguments that are less than or 16717equal to 16 bytes, are decomposed recursively into their base type fields, and 16718each field is passed as if a separate argument. For input arguments, if the 16719called function requires the struct to be in memory, for example because its 16720address is taken, then the function body is responsible for allocating a stack 16721location and copying the field arguments into it. Clang terms this *direct 16722struct*. 16723 16724The source language input struct type arguments that are greater than 16 bytes, 16725are passed by reference. The caller is responsible for allocating a stack 16726location to make a copy of the struct value and pass the address as the input 16727argument. The called function is responsible to perform the dereference when 16728accessing the input argument. Clang terms this *by-value struct*. 16729 16730A source language result struct type argument that is greater than 16 bytes, is 16731returned by reference. The caller is responsible for allocating a stack location 16732to hold the result value and passes the address as the last input argument 16733(before the implicit input arguments). In this case there are no result 16734arguments. The called function is responsible to perform the dereference when 16735storing the result value. Clang terms this *structured return (sret)*. 16736 16737*TODO: correct the ``sret`` definition.* 16738 16739.. TODO:: 16740 16741 Is this definition correct? Or is ``sret`` only used if passing in registers, and 16742 pass as non-decomposed struct as stack argument? Or something else? Is the 16743 memory location in the caller stack frame, or a stack memory argument and so 16744 no address is passed as the caller can directly write to the argument stack 16745 location? But then the stack location is still live after return. If an 16746 argument stack location is it the first stack argument or the last one? 16747 16748Lambda argument types are treated as struct types with an implementation defined 16749set of fields. 16750 16751.. TODO:: 16752 16753 Need to specify the ABI for lambda types for AMDGPU. 16754 16755For AMDGPU backend all source language arguments (including the decomposed 16756struct type arguments) are passed in VGPRs unless marked ``inreg`` in which case 16757they are passed in SGPRs. 16758 16759The AMDGPU backend walks the function call graph from the leaves to determine 16760which implicit input arguments are used, propagating to each caller of the 16761function. The used implicit arguments are appended to the function arguments 16762after the source language arguments in the following order: 16763 16764.. TODO:: 16765 16766 Is recursion or external functions supported? 16767 167681. Work-Item ID (1 VGPR) 16769 16770 The X, Y and Z work-item ID are packed into a single VGRP with the following 16771 layout. Only fields actually used by the function are set. The other bits 16772 are undefined. 16773 16774 The values come from the initial kernel execution state. See 16775 :ref:`amdgpu-amdhsa-initial-kernel-execution-state`. 16776 16777 .. table:: Work-item implicit argument layout 16778 :name: amdgpu-amdhsa-workitem-implicit-argument-layout-table 16779 16780 ======= ======= ============== 16781 Bits Size Field Name 16782 ======= ======= ============== 16783 9:0 10 bits X Work-Item ID 16784 19:10 10 bits Y Work-Item ID 16785 29:20 10 bits Z Work-Item ID 16786 31:30 2 bits Unused 16787 ======= ======= ============== 16788 167892. Dispatch Ptr (2 SGPRs) 16790 16791 The value comes from the initial kernel execution state. See 16792 :ref:`amdgpu-amdhsa-sgpr-register-set-up-order-table`. 16793 167943. Queue Ptr (2 SGPRs) 16795 16796 The value comes from the initial kernel execution state. See 16797 :ref:`amdgpu-amdhsa-sgpr-register-set-up-order-table`. 16798 167994. Kernarg Segment Ptr (2 SGPRs) 16800 16801 The value comes from the initial kernel execution state. See 16802 :ref:`amdgpu-amdhsa-sgpr-register-set-up-order-table`. 16803 168045. Dispatch id (2 SGPRs) 16805 16806 The value comes from the initial kernel execution state. See 16807 :ref:`amdgpu-amdhsa-sgpr-register-set-up-order-table`. 16808 168096. Work-Group ID X (1 SGPR) 16810 16811 The value comes from the initial kernel execution state. See 16812 :ref:`amdgpu-amdhsa-sgpr-register-set-up-order-table`. 16813 168147. Work-Group ID Y (1 SGPR) 16815 16816 The value comes from the initial kernel execution state. See 16817 :ref:`amdgpu-amdhsa-sgpr-register-set-up-order-table`. 16818 168198. Work-Group ID Z (1 SGPR) 16820 16821 The value comes from the initial kernel execution state. See 16822 :ref:`amdgpu-amdhsa-sgpr-register-set-up-order-table`. 16823 168249. Implicit Argument Ptr (2 SGPRs) 16825 16826 The value is computed by adding an offset to Kernarg Segment Ptr to get the 16827 global address space pointer to the first kernarg implicit argument. 16828 16829The input and result arguments are assigned in order in the following manner: 16830 16831.. note:: 16832 16833 There are likely some errors and omissions in the following description that 16834 need correction. 16835 16836 .. TODO:: 16837 16838 Check the Clang source code to decipher how function arguments and return 16839 results are handled. Also see the AMDGPU specific values used. 16840 16841* VGPR arguments are assigned to consecutive VGPRs starting at VGPR0 up to 16842 VGPR31. 16843 16844 If there are more arguments than will fit in these registers, the remaining 16845 arguments are allocated on the stack in order on naturally aligned 16846 addresses. 16847 16848 .. TODO:: 16849 16850 How are overly aligned structures allocated on the stack? 16851 16852* SGPR arguments are assigned to consecutive SGPRs starting at SGPR0 up to 16853 SGPR29. 16854 16855 If there are more arguments than will fit in these registers, the remaining 16856 arguments are allocated on the stack in order on naturally aligned 16857 addresses. 16858 16859Note that decomposed struct type arguments may have some fields passed in 16860registers and some in memory. 16861 16862.. TODO:: 16863 16864 So, a struct which can pass some fields as decomposed register arguments, will 16865 pass the rest as decomposed stack elements? But an argument that will not start 16866 in registers will not be decomposed and will be passed as a non-decomposed 16867 stack value? 16868 16869The following is not part of the AMDGPU function calling convention but 16870describes how the AMDGPU implements function calls: 16871 168721. SGPR33 is used as a frame pointer (FP) if necessary. Like the SP it is an 16873 unswizzled scratch address. It is only needed if runtime sized ``alloca`` 16874 are used, or for the reasons defined in ``SIFrameLowering``. 168752. Runtime stack alignment is supported. SGPR34 is used as a base pointer (BP) 16876 to access the incoming stack arguments in the function. The BP is needed 16877 only when the function requires the runtime stack alignment. 16878 168793. Allocating SGPR arguments on the stack are not supported. 16880 168814. No CFI is currently generated. See 16882 :ref:`amdgpu-dwarf-call-frame-information`. 16883 16884 .. note:: 16885 16886 CFI will be generated that defines the CFA as the unswizzled address 16887 relative to the wave scratch base in the unswizzled private address space 16888 of the lowest address stack allocated local variable. 16889 16890 ``DW_AT_frame_base`` will be defined as the swizzled address in the 16891 swizzled private address space by dividing the CFA by the wavefront size 16892 (since CFA is always at least dword aligned which matches the scratch 16893 swizzle element size). 16894 16895 If no dynamic stack alignment was performed, the stack allocated arguments 16896 are accessed as negative offsets relative to ``DW_AT_frame_base``, and the 16897 local variables and register spill slots are accessed as positive offsets 16898 relative to ``DW_AT_frame_base``. 16899 169005. Function argument passing is implemented by copying the input physical 16901 registers to virtual registers on entry. The register allocator can spill if 16902 necessary. These are copied back to physical registers at call sites. The 16903 net effect is that each function call can have these values in entirely 16904 distinct locations. The IPRA can help avoid shuffling argument registers. 169056. Call sites are implemented by setting up the arguments at positive offsets 16906 from SP. Then SP is incremented to account for the known frame size before 16907 the call and decremented after the call. 16908 16909 .. note:: 16910 16911 The CFI will reflect the changed calculation needed to compute the CFA 16912 from SP. 16913 169147. 4 byte spill slots are used in the stack frame. One slot is allocated for an 16915 emergency spill slot. Buffer instructions are used for stack accesses and 16916 not the ``flat_scratch`` instruction. 16917 16918 .. TODO:: 16919 16920 Explain when the emergency spill slot is used. 16921 16922.. TODO:: 16923 16924 Possible broken issues: 16925 16926 - Stack arguments must be aligned to required alignment. 16927 - Stack is aligned to max(16, max formal argument alignment) 16928 - Direct argument < 64 bits should check register budget. 16929 - Register budget calculation should respect ``inreg`` for SGPR. 16930 - SGPR overflow is not handled. 16931 - struct with 1 member unpeeling is not checking size of member. 16932 - ``sret`` is after ``this`` pointer. 16933 - Caller is not implementing stack realignment: need an extra pointer. 16934 - Should say AMDGPU passes FP rather than SP. 16935 - Should CFI define CFA as address of locals or arguments. Difference is 16936 apparent when have implemented dynamic alignment. 16937 - If ``SCRATCH`` instruction could allow negative offsets, then can make FP be 16938 highest address of stack frame and use negative offset for locals. Would 16939 allow SP to be the same as FP and could support signal-handler-like as now 16940 have a real SP for the top of the stack. 16941 - How is ``sret`` passed on the stack? In argument stack area? Can it overlay 16942 arguments? 16943 16944AMDPAL 16945------ 16946 16947This section provides code conventions used when the target triple OS is 16948``amdpal`` (see :ref:`amdgpu-target-triples`). 16949 16950.. _amdgpu-amdpal-code-object-metadata-section: 16951 16952Code Object Metadata 16953~~~~~~~~~~~~~~~~~~~~ 16954 16955.. note:: 16956 16957 The metadata is currently in development and is subject to major 16958 changes. Only the current version is supported. *When this document 16959 was generated the version was 2.6.* 16960 16961Code object metadata is specified by the ``NT_AMDGPU_METADATA`` note 16962record (see :ref:`amdgpu-note-records-v3-onwards`). 16963 16964The metadata is represented as Message Pack formatted binary data (see 16965[MsgPack]_). The top level is a Message Pack map that includes the keys 16966defined in table :ref:`amdgpu-amdpal-code-object-metadata-map-table` 16967and referenced tables. 16968 16969Additional information can be added to the maps. To avoid conflicts, any 16970key names should be prefixed by "*vendor-name*." where ``vendor-name`` 16971can be the name of the vendor and specific vendor tool that generates the 16972information. The prefix is abbreviated to simply "." when it appears 16973within a map that has been added by the same *vendor-name*. 16974 16975 .. table:: AMDPAL Code Object Metadata Map 16976 :name: amdgpu-amdpal-code-object-metadata-map-table 16977 16978 =================== ============== ========= ====================================================================== 16979 String Key Value Type Required? Description 16980 =================== ============== ========= ====================================================================== 16981 "amdpal.version" sequence of Required PAL code object metadata (major, minor) version. The current values 16982 2 integers are defined by *Util::Abi::PipelineMetadata(Major|Minor)Version*. 16983 "amdpal.pipelines" sequence of Required Per-pipeline metadata. See 16984 map :ref:`amdgpu-amdpal-code-object-pipeline-metadata-map-table` for the 16985 definition of the keys included in that map. 16986 =================== ============== ========= ====================================================================== 16987 16988.. 16989 16990 .. table:: AMDPAL Code Object Pipeline Metadata Map 16991 :name: amdgpu-amdpal-code-object-pipeline-metadata-map-table 16992 16993 ====================================== ============== ========= =================================================== 16994 String Key Value Type Required? Description 16995 ====================================== ============== ========= =================================================== 16996 ".name" string Source name of the pipeline. 16997 ".type" string Pipeline type, e.g. VsPs. Values include: 16998 16999 - "VsPs" 17000 - "Gs" 17001 - "Cs" 17002 - "Ngg" 17003 - "Tess" 17004 - "GsTess" 17005 - "NggTess" 17006 17007 ".internal_pipeline_hash" sequence of Required Internal compiler hash for this pipeline. Lower 17008 2 integers 64 bits is the "stable" portion of the hash, used 17009 for e.g. shader replacement lookup. Upper 64 bits 17010 is the "unique" portion of the hash, used for 17011 e.g. pipeline cache lookup. The value is 17012 implementation defined, and can not be relied on 17013 between different builds of the compiler. 17014 ".shaders" map Per-API shader metadata. See 17015 :ref:`amdgpu-amdpal-code-object-shader-map-table` 17016 for the definition of the keys included in that 17017 map. 17018 ".hardware_stages" map Per-hardware stage metadata. See 17019 :ref:`amdgpu-amdpal-code-object-hardware-stage-map-table` 17020 for the definition of the keys included in that 17021 map. 17022 ".shader_functions" map Per-shader function metadata. See 17023 :ref:`amdgpu-amdpal-code-object-shader-function-map-table` 17024 for the definition of the keys included in that 17025 map. 17026 ".registers" map Required Hardware register configuration. See 17027 :ref:`amdgpu-amdpal-code-object-register-map-table` 17028 for the definition of the keys included in that 17029 map. 17030 ".user_data_limit" integer Number of user data entries accessed by this 17031 pipeline. 17032 ".spill_threshold" integer The user data spill threshold. 0xFFFF for 17033 NoUserDataSpilling. 17034 ".uses_viewport_array_index" boolean Indicates whether or not the pipeline uses the 17035 viewport array index feature. Pipelines which use 17036 this feature can render into all 16 viewports, 17037 whereas pipelines which do not use it are 17038 restricted to viewport #0. 17039 ".es_gs_lds_size" integer Size in bytes of LDS space used internally for 17040 handling data-passing between the ES and GS 17041 shader stages. This can be zero if the data is 17042 passed using off-chip buffers. This value should 17043 be used to program all user-SGPRs which have been 17044 marked with "UserDataMapping::EsGsLdsSize" 17045 (typically only the GS and VS HW stages will ever 17046 have a user-SGPR so marked). 17047 ".nggSubgroupSize" integer Explicit maximum subgroup size for NGG shaders 17048 (maximum number of threads in a subgroup). 17049 ".num_interpolants" integer Graphics only. Number of PS interpolants. 17050 ".mesh_scratch_memory_size" integer Max mesh shader scratch memory used. 17051 ".api" string Name of the client graphics API. 17052 ".api_create_info" binary Graphics API shader create info binary blob. Can 17053 be defined by the driver using the compiler if 17054 they want to be able to correlate API-specific 17055 information used during creation at a later time. 17056 ====================================== ============== ========= =================================================== 17057 17058.. 17059 17060 .. table:: AMDPAL Code Object Shader Map 17061 :name: amdgpu-amdpal-code-object-shader-map-table 17062 17063 17064 +-------------+--------------+-------------------------------------------------------------------+ 17065 |String Key |Value Type |Description | 17066 +=============+==============+===================================================================+ 17067 |- ".compute" |map |See :ref:`amdgpu-amdpal-code-object-api-shader-metadata-map-table` | 17068 |- ".vertex" | |for the definition of the keys included in that map. | 17069 |- ".hull" | | | 17070 |- ".domain" | | | 17071 |- ".geometry"| | | 17072 |- ".pixel" | | | 17073 +-------------+--------------+-------------------------------------------------------------------+ 17074 17075.. 17076 17077 .. table:: AMDPAL Code Object API Shader Metadata Map 17078 :name: amdgpu-amdpal-code-object-api-shader-metadata-map-table 17079 17080 ==================== ============== ========= ===================================================================== 17081 String Key Value Type Required? Description 17082 ==================== ============== ========= ===================================================================== 17083 ".api_shader_hash" sequence of Required Input shader hash, typically passed in from the client. The value 17084 2 integers is implementation defined, and can not be relied on between 17085 different builds of the compiler. 17086 ".hardware_mapping" sequence of Required Flags indicating the HW stages this API shader maps to. Values 17087 string include: 17088 17089 - ".ls" 17090 - ".hs" 17091 - ".es" 17092 - ".gs" 17093 - ".vs" 17094 - ".ps" 17095 - ".cs" 17096 17097 ==================== ============== ========= ===================================================================== 17098 17099.. 17100 17101 .. table:: AMDPAL Code Object Hardware Stage Map 17102 :name: amdgpu-amdpal-code-object-hardware-stage-map-table 17103 17104 +-------------+--------------+-----------------------------------------------------------------------+ 17105 |String Key |Value Type |Description | 17106 +=============+==============+=======================================================================+ 17107 |- ".ls" |map |See :ref:`amdgpu-amdpal-code-object-hardware-stage-metadata-map-table` | 17108 |- ".hs" | |for the definition of the keys included in that map. | 17109 |- ".es" | | | 17110 |- ".gs" | | | 17111 |- ".vs" | | | 17112 |- ".ps" | | | 17113 |- ".cs" | | | 17114 +-------------+--------------+-----------------------------------------------------------------------+ 17115 17116.. 17117 17118 .. table:: AMDPAL Code Object Hardware Stage Metadata Map 17119 :name: amdgpu-amdpal-code-object-hardware-stage-metadata-map-table 17120 17121 ========================== ============== ========= =============================================================== 17122 String Key Value Type Required? Description 17123 ========================== ============== ========= =============================================================== 17124 ".entry_point" string The ELF symbol pointing to this pipeline's stage entry point. 17125 ".scratch_memory_size" integer Scratch memory size in bytes. 17126 ".lds_size" integer Local Data Share size in bytes. 17127 ".perf_data_buffer_size" integer Performance data buffer size in bytes. 17128 ".vgpr_count" integer Number of VGPRs used. 17129 ".agpr_count" integer Number of AGPRs used. 17130 ".sgpr_count" integer Number of SGPRs used. 17131 ".vgpr_limit" integer If non-zero, indicates the shader was compiled with a 17132 directive to instruct the compiler to limit the VGPR usage to 17133 be less than or equal to the specified value (only set if 17134 different from HW default). 17135 ".sgpr_limit" integer SGPR count upper limit (only set if different from HW 17136 default). 17137 ".threadgroup_dimensions" sequence of Thread-group X/Y/Z dimensions (Compute only). 17138 3 integers 17139 ".wavefront_size" integer Wavefront size (only set if different from HW default). 17140 ".uses_uavs" boolean The shader reads or writes UAVs. 17141 ".uses_rovs" boolean The shader reads or writes ROVs. 17142 ".writes_uavs" boolean The shader writes to one or more UAVs. 17143 ".writes_depth" boolean The shader writes out a depth value. 17144 ".uses_append_consume" boolean The shader uses append and/or consume operations, either 17145 memory or GDS. 17146 ".uses_prim_id" boolean The shader uses PrimID. 17147 ========================== ============== ========= =============================================================== 17148 17149.. 17150 17151 .. table:: AMDPAL Code Object Shader Function Map 17152 :name: amdgpu-amdpal-code-object-shader-function-map-table 17153 17154 =============== ============== ==================================================================== 17155 String Key Value Type Description 17156 =============== ============== ==================================================================== 17157 *symbol name* map *symbol name* is the ELF symbol name of the shader function code 17158 entry address. The value is the function's metadata. See 17159 :ref:`amdgpu-amdpal-code-object-shader-function-metadata-map-table`. 17160 =============== ============== ==================================================================== 17161 17162.. 17163 17164 .. table:: AMDPAL Code Object Shader Function Metadata Map 17165 :name: amdgpu-amdpal-code-object-shader-function-metadata-map-table 17166 17167 ============================= ============== ================================================================= 17168 String Key Value Type Description 17169 ============================= ============== ================================================================= 17170 ".api_shader_hash" sequence of Input shader hash, typically passed in from the client. The value 17171 2 integers is implementation defined, and can not be relied on between 17172 different builds of the compiler. 17173 ".scratch_memory_size" integer Size in bytes of scratch memory used by the shader. 17174 ".lds_size" integer Size in bytes of LDS memory. 17175 ".vgpr_count" integer Number of VGPRs used by the shader. 17176 ".sgpr_count" integer Number of SGPRs used by the shader. 17177 ".stack_frame_size_in_bytes" integer Amount of stack size used by the shader. 17178 ".shader_subtype" string Shader subtype/kind. Values include: 17179 17180 - "Unknown" 17181 17182 ============================= ============== ================================================================= 17183 17184.. 17185 17186 .. table:: AMDPAL Code Object Register Map 17187 :name: amdgpu-amdpal-code-object-register-map-table 17188 17189 ========================== ============== ==================================================================== 17190 32-bit Integer Key Value Type Description 17191 ========================== ============== ==================================================================== 17192 ``reg offset`` 32-bit integer ``reg offset`` is the dword offset into the GFXIP register space of 17193 a GRBM register (i.e., driver accessible GPU register number, not 17194 shader GPR register number). The driver is required to program each 17195 specified register to the corresponding specified value when 17196 executing this pipeline. Typically, the ``reg offsets`` are the 17197 ``uint16_t`` offsets to each register as defined by the hardware 17198 chip headers. The register is set to the provided value. However, a 17199 ``reg offset`` that specifies a user data register (e.g., 17200 COMPUTE_USER_DATA_0) needs special treatment. See 17201 :ref:`amdgpu-amdpal-code-object-user-data-section` section for more 17202 information. 17203 ========================== ============== ==================================================================== 17204 17205.. _amdgpu-amdpal-code-object-user-data-section: 17206 17207User Data 17208+++++++++ 17209 17210Each hardware stage has a set of 32-bit physical SPI *user data registers* 17211(either 16 or 32 based on graphics IP and the stage) which can be 17212written from a command buffer and then loaded into SGPRs when waves are 17213launched via a subsequent dispatch or draw operation. This is the way 17214most arguments are passed from the application/runtime to a hardware 17215shader. 17216 17217PAL abstracts this functionality by exposing a set of 128 *user data 17218entries* per pipeline a client can use to pass arguments from a command 17219buffer to one or more shaders in that pipeline. The ELF code object must 17220specify a mapping from virtualized *user data entries* to physical *user 17221data registers*, and PAL is responsible for implementing that mapping, 17222including spilling overflow *user data entries* to memory if needed. 17223 17224Since the *user data registers* are GRBM-accessible SPI registers, this 17225mapping is actually embedded in the ``.registers`` metadata entry. For 17226most registers, the value in that map is a literal 32-bit value that 17227should be written to the register by the driver. However, when the 17228register is a *user data register* (any USER_DATA register e.g., 17229SPI_SHADER_USER_DATA_PS_5), the value is instead an encoding that tells 17230the driver to write either a *user data entry* value or one of several 17231driver-internal values to the register. This encoding is described in 17232the following table: 17233 17234.. note:: 17235 17236 Currently, *user data registers* 0 and 1 (e.g., SPI_SHADER_USER_DATA_PS_0, 17237 and SPI_SHADER_USER_DATA_PS_1) are reserved. *User data register* 0 must 17238 always be programmed to the address of the GlobalTable, and *user data 17239 register* 1 must always be programmed to the address of the PerShaderTable. 17240 17241.. 17242 17243 .. table:: AMDPAL User Data Mapping 17244 :name: amdgpu-amdpal-code-object-metadata-user-data-mapping-table 17245 17246 ========== ================= =============================================================================== 17247 Value Name Description 17248 ========== ================= =============================================================================== 17249 0..127 *User Data Entry* 32-bit value of user_data_entry[N] as specified via *CmdSetUserData()* 17250 0x10000000 GlobalTable 32-bit pointer to GPU memory containing the global internal table (should 17251 always point to *user data register* 0). 17252 0x10000001 PerShaderTable 32-bit pointer to GPU memory containing the per-shader internal table. See 17253 :ref:`amdgpu-amdpal-code-object-metadata-user-data-per-shader-table-section` 17254 for more detail (should always point to *user data register* 1). 17255 0x10000002 SpillTable 32-bit pointer to GPU memory containing the user data spill table. See 17256 :ref:`amdgpu-amdpal-code-object-metadata-user-data-spill-table-section` for 17257 more detail. 17258 0x10000003 BaseVertex Vertex offset (32-bit unsigned integer). Not needed if the pipeline doesn't 17259 reference the draw index in the vertex shader. Only supported by the first 17260 stage in a graphics pipeline. 17261 0x10000004 BaseInstance Instance offset (32-bit unsigned integer). Only supported by the first stage in 17262 a graphics pipeline. 17263 0x10000005 DrawIndex Draw index (32-bit unsigned integer). Only supported by the first stage in a 17264 graphics pipeline. 17265 0x10000006 Workgroup Thread group count (32-bit unsigned integer). Low half of a 64-bit address of 17266 a buffer containing the grid dimensions for a Compute dispatch operation. The 17267 high half of the address is stored in the next sequential user-SGPR. Only 17268 supported by compute pipelines. 17269 0x1000000A EsGsLdsSize Indicates that PAL will program this user-SGPR to contain the amount of LDS 17270 space used for the ES/GS pseudo-ring-buffer for passing data between shader 17271 stages. 17272 0x1000000B ViewId View id (32-bit unsigned integer) identifies a view of graphic 17273 pipeline instancing. 17274 0x1000000C StreamOutTable 32-bit pointer to GPU memory containing the stream out target SRD table. This 17275 can only appear for one shader stage per pipeline. 17276 0x1000000D PerShaderPerfData 32-bit pointer to GPU memory containing the per-shader performance data buffer. 17277 0x1000000F VertexBufferTable 32-bit pointer to GPU memory containing the vertex buffer SRD table. This can 17278 only appear for one shader stage per pipeline. 17279 0x10000010 UavExportTable 32-bit pointer to GPU memory containing the UAV export SRD table. This can 17280 only appear for one shader stage per pipeline (PS). These replace color targets 17281 and are completely separate from any UAVs used by the shader. This is optional, 17282 and only used by the PS when UAV exports are used to replace color-target 17283 exports to optimize specific shaders. 17284 0x10000011 NggCullingData 64-bit pointer to GPU memory containing the hardware register data needed by 17285 some NGG pipelines to perform culling. This value contains the address of the 17286 first of two consecutive registers which provide the full GPU address. 17287 0x10000015 FetchShaderPtr 64-bit pointer to GPU memory containing the fetch shader subroutine. 17288 ========== ================= =============================================================================== 17289 17290.. _amdgpu-amdpal-code-object-metadata-user-data-per-shader-table-section: 17291 17292Per-Shader Table 17293################ 17294 17295Low 32 bits of the GPU address for an optional buffer in the ``.data`` 17296section of the ELF. The high 32 bits of the address match the high 32 bits 17297of the shader's program counter. 17298 17299The buffer can be anything the shader compiler needs it for, and 17300allows each shader to have its own region of the ``.data`` section. 17301Typically, this could be a table of buffer SRD's and the data pointed to 17302by the buffer SRD's, but it could be a flat-address region of memory as 17303well. Its layout and usage are defined by the shader compiler. 17304 17305Each shader's table in the ``.data`` section is referenced by the symbol 17306``_amdgpu_``\ *xs*\ ``_shdr_intrl_data`` where *xs* corresponds with the 17307hardware shader stage the data is for. E.g., 17308``_amdgpu_cs_shdr_intrl_data`` for the compute shader hardware stage. 17309 17310.. _amdgpu-amdpal-code-object-metadata-user-data-spill-table-section: 17311 17312Spill Table 17313########### 17314 17315It is possible for a hardware shader to need access to more *user data 17316entries* than there are slots available in user data registers for one 17317or more hardware shader stages. In that case, the PAL runtime expects 17318the necessary *user data entries* to be spilled to GPU memory and use 17319one user data register to point to the spilled user data memory. The 17320value of the *user data entry* must then represent the location where 17321a shader expects to read the low 32-bits of the table's GPU virtual 17322address. The *spill table* itself represents a set of 32-bit values 17323managed by the PAL runtime in GPU-accessible memory that can be made 17324indirectly accessible to a hardware shader. 17325 17326Unspecified OS 17327-------------- 17328 17329This section provides code conventions used when the target triple OS is 17330empty (see :ref:`amdgpu-target-triples`). 17331 17332Trap Handler ABI 17333~~~~~~~~~~~~~~~~ 17334 17335For code objects generated by AMDGPU backend for non-amdhsa OS, the runtime does 17336not install a trap handler. The ``llvm.trap`` and ``llvm.debugtrap`` 17337instructions are handled as follows: 17338 17339 .. table:: AMDGPU Trap Handler for Non-AMDHSA OS 17340 :name: amdgpu-trap-handler-for-non-amdhsa-os-table 17341 17342 =============== =============== =========================================== 17343 Usage Code Sequence Description 17344 =============== =============== =========================================== 17345 llvm.trap s_endpgm Causes wavefront to be terminated. 17346 llvm.debugtrap *none* Compiler warning given that there is no 17347 trap handler installed. 17348 =============== =============== =========================================== 17349 17350Core file format 17351================ 17352 17353This section describes the format of core files supporting AMDGPU. Core dumps 17354for an AMDGPU program can come in 2 flavors: split or unified core files. 17355 17356The split layout consists of one host core file containing the information to 17357rebuild the image of the host process and one AMDGPU core file that contains 17358the information for the AMDGPU agents used in the process. The AMDGPU core 17359file consists of: 17360 17361* A note describing the state of the AMDGPU agents, AMDGPU queues, and AMDGPU 17362 runtime for the process (see :ref:`amdgpu_corefile_note`). 17363* A list of load segments containing an image of the AMDGPU agents' memory (see 17364 :ref:`amdgpu_corefile_memory`). 17365 17366The unified core file is the union of all the information contained in 17367the two files of the split layout (all notes and load segments). It contains 17368all the information required to reconstruct the image of the process across all 17369the agents. 17370 17371Core file header 17372---------------- 17373 17374An AMDGPU core file is an ``ELF64`` core file. The content of the header 17375differs in unified core file layout and AMDGPU core file layout. 17376 17377Split files 17378~~~~~~~~~~~ 17379 17380In the split files layout, the AMDGPU core file is an ``ELF64`` file with the 17381header configured as described in :ref:`amdgpu-corefile-headers-table`: 17382 17383 .. table:: AMDGPU corefile headers 17384 :name: amdgpu-corefile-headers-table 17385 17386 ========================== =================================== 17387 Field Value 17388 ========================== =================================== 17389 ``e_ident[EI_CLASS]`` ``ELFCLASS64`` (``0x2``) 17390 ``e_ident[EI_DATA]`` ``ELFDATA2LSB`` (``0x1``) 17391 ``e_ident[EI_OSABI]`` ``ELFOSABI_AMDGPU_HSA`` (``0x40``) 17392 ``e_type`` ``ET_CORE``(``0x4``) 17393 ``e_ident[EI_ABIVERSION]`` ``ELFABIVERSION_AMDGPU_HSA_5`` 17394 ``e_machine`` ``EM_AMDGPU`` (``0xe0``) 17395 ========================== =================================== 17396 17397Unified file 17398~~~~~~~~~~~~ 17399 17400In the unified core file mode, the ``ELF64`` headers are set to describe 17401the host architecture and process. 17402 17403.. _amdgpu_corefile_note: 17404 17405Core file notes 17406--------------- 17407 17408An AMDGPU core file must contain one snapshot note in a ``PT_NOTE`` segment. 17409When using a split core file layout, this note is in the AMDGPU file. 17410 17411The note record vendor field is "``AMDGPU``" and the record type is 17412"``NT_AMDGPU_KFD_CORE_STATE``" (see :ref:`amdgpu-note-records-v3-onwards`) 17413 17414The content of the note is defined in table 17415:ref:`amdgpu-core-snapshot-note-layout-table-v1`: 17416 17417 .. table:: AMDGPU snapshot note format V1 17418 :name: amdgpu-core-snapshot-note-layout-table-v1 17419 17420 ================================ ======================================= ======================= ============== =========================== 17421 Field Type Size (bytes) Byte alignment Comment 17422 ================================ ======================================= ======================= ============== =========================== 17423 ``version_major`` ``uint32`` 4 4 ``KFD_IOCTL_MAJOR_VERSION`` 17424 ``version_minor`` ``uint32`` 4 4 ``KFD_IOCTL_MINOR_VERSION`` 17425 ``runtime_info_size`` ``uint64`` 8 8 Must be a multiple of 8 17426 ``n_agents`` ``uint32`` 4 8 17427 ``agent_info_entry_size`` ``uint32`` 4 4 Must be a multiple of 8 17428 ``n_queues`` ``uint32`` 4 8 17429 ``queue_info_entry_size`` ``uint32`` 4 4 Must be a multiple of 8 17430 ``runtime_info`` ``kfd_runtime_info`` ``runtime_info_size`` 8 17431 ``agents_info`` ``kfd_dbg_device_info_entry[n_agents]`` ``n_agents * 8 17432 agent_info_entry_size`` 17433 ``queues_info`` ``kfd_queue_snapshot_entry[n_queues]`` ``n_queues * 17434 queue_info_entry_size`` 8 17435 ================================ ======================================= ======================= ============== =========================== 17436 17437The definition of all the ``kfd_*`` types comes from the 17438``include/uapi/linux/kfd_ioctl.h`` header file from the KFD repository. It is 17439usually installed in ``/usr/include/linux/kfd_ioctl.h``. The version of the 17440``kfd_ioctl.h`` file used must define values for 17441``KFD_IOCTL_MAJOR_VERSION`` and ``KFD_IOCTL_MINOR_VERSION`` matching 17442the values of ``kfd_version_major`` and ``kfd_version_major`` from the 17443note. 17444 17445.. _amdgpu_corefile_memory: 17446 17447Memory segments 17448--------------- 17449 17450An AMDGPU core file must contain an image of the AMDGPU agents' memory in load 17451segments (of type ``PT_LOAD``). Those segments must correspond to the memory 17452regions where the content of the agent memory is mapped into the host process 17453by the ROCr runtime (note that those memory mappings are usually not readable 17454by the process itself). 17455 17456When using the split core file layout, those segments must be included in the 17457AMDGPU core file. 17458 17459Source Languages 17460================ 17461 17462.. _amdgpu-opencl: 17463 17464OpenCL 17465------ 17466 17467When the language is OpenCL the following differences occur: 17468 174691. The OpenCL memory model is used (see :ref:`amdgpu-amdhsa-memory-model`). 174702. The AMDGPU backend appends additional arguments to the kernel's explicit 17471 arguments for the AMDHSA OS (see 17472 :ref:`opencl-kernel-implicit-arguments-appended-for-amdhsa-os-table`). 174733. Additional metadata is generated 17474 (see :ref:`amdgpu-amdhsa-code-object-metadata`). 17475 17476 .. table:: OpenCL kernel implicit arguments appended for AMDHSA OS 17477 :name: opencl-kernel-implicit-arguments-appended-for-amdhsa-os-table 17478 17479 ======== ==== ========= =========================================== 17480 Position Byte Byte Description 17481 Size Alignment 17482 ======== ==== ========= =========================================== 17483 1 8 8 OpenCL Global Offset X 17484 2 8 8 OpenCL Global Offset Y 17485 3 8 8 OpenCL Global Offset Z 17486 4 8 8 OpenCL address of printf buffer 17487 5 8 8 OpenCL address of virtual queue used by 17488 enqueue_kernel. 17489 6 8 8 OpenCL address of AqlWrap struct used by 17490 enqueue_kernel. 17491 7 8 8 Pointer argument used for Multi-gird 17492 synchronization. 17493 ======== ==== ========= =========================================== 17494 17495.. _amdgpu-hcc: 17496 17497HCC 17498--- 17499 17500When the language is HCC the following differences occur: 17501 175021. The HSA memory model is used (see :ref:`amdgpu-amdhsa-memory-model`). 17503 17504.. _amdgpu-assembler: 17505 17506Assembler 17507--------- 17508 17509AMDGPU backend has LLVM-MC based assembler which is currently in development. 17510It supports AMDGCN GFX6-GFX11. 17511 17512This section describes general syntax for instructions and operands. 17513 17514Instructions 17515~~~~~~~~~~~~ 17516 17517An instruction has the following :doc:`syntax<AMDGPUInstructionSyntax>`: 17518 17519 | ``<``\ *opcode*\ ``> <``\ *operand0*\ ``>, <``\ *operand1*\ ``>,... 17520 <``\ *modifier0*\ ``> <``\ *modifier1*\ ``>...`` 17521 17522:doc:`Operands<AMDGPUOperandSyntax>` are comma-separated while 17523:doc:`modifiers<AMDGPUModifierSyntax>` are space-separated. 17524 17525The order of operands and modifiers is fixed. 17526Most modifiers are optional and may be omitted. 17527 17528Links to detailed instruction syntax description may be found in the following 17529table. Note that features under development are not included 17530in this description. 17531 17532 ============= ============================================= ======================================= 17533 Architecture Core ISA ISA Variants and Extensions 17534 ============= ============================================= ======================================= 17535 GCN 2 :doc:`GFX7<AMDGPU/AMDGPUAsmGFX7>` \- 17536 GCN 3, GCN 4 :doc:`GFX8<AMDGPU/AMDGPUAsmGFX8>` \- 17537 GCN 5 :doc:`GFX9<AMDGPU/AMDGPUAsmGFX9>` :doc:`gfx900<AMDGPU/AMDGPUAsmGFX900>` 17538 17539 :doc:`gfx902<AMDGPU/AMDGPUAsmGFX900>` 17540 17541 :doc:`gfx904<AMDGPU/AMDGPUAsmGFX904>` 17542 17543 :doc:`gfx906<AMDGPU/AMDGPUAsmGFX906>` 17544 17545 :doc:`gfx909<AMDGPU/AMDGPUAsmGFX900>` 17546 17547 :doc:`gfx90c<AMDGPU/AMDGPUAsmGFX900>` 17548 17549 CDNA 1 :doc:`GFX9<AMDGPU/AMDGPUAsmGFX9>` :doc:`gfx908<AMDGPU/AMDGPUAsmGFX908>` 17550 17551 CDNA 2 :doc:`GFX9<AMDGPU/AMDGPUAsmGFX9>` :doc:`gfx90a<AMDGPU/AMDGPUAsmGFX90a>` 17552 17553 CDNA 3 :doc:`GFX9<AMDGPU/AMDGPUAsmGFX9>` :doc:`gfx940<AMDGPU/AMDGPUAsmGFX940>` 17554 17555 :doc:`gfx941<AMDGPU/AMDGPUAsmGFX940>` 17556 17557 :doc:`gfx942<AMDGPU/AMDGPUAsmGFX940>` 17558 17559 RDNA 1 :doc:`GFX10 RDNA1<AMDGPU/AMDGPUAsmGFX10>` :doc:`gfx1010<AMDGPU/AMDGPUAsmGFX10>` 17560 17561 :doc:`gfx1011<AMDGPU/AMDGPUAsmGFX1011>` 17562 17563 :doc:`gfx1012<AMDGPU/AMDGPUAsmGFX1011>` 17564 17565 :doc:`gfx1013<AMDGPU/AMDGPUAsmGFX1013>` 17566 17567 RDNA 2 :doc:`GFX10 RDNA2<AMDGPU/AMDGPUAsmGFX1030>` :doc:`gfx1030<AMDGPU/AMDGPUAsmGFX1030>` 17568 17569 :doc:`gfx1031<AMDGPU/AMDGPUAsmGFX1030>` 17570 17571 :doc:`gfx1032<AMDGPU/AMDGPUAsmGFX1030>` 17572 17573 :doc:`gfx1033<AMDGPU/AMDGPUAsmGFX1030>` 17574 17575 :doc:`gfx1034<AMDGPU/AMDGPUAsmGFX1030>` 17576 17577 :doc:`gfx1035<AMDGPU/AMDGPUAsmGFX1030>` 17578 17579 :doc:`gfx1036<AMDGPU/AMDGPUAsmGFX1030>` 17580 17581 RDNA 3 :doc:`GFX11<AMDGPU/AMDGPUAsmGFX11>` :doc:`gfx1100<AMDGPU/AMDGPUAsmGFX11>` 17582 17583 :doc:`gfx1101<AMDGPU/AMDGPUAsmGFX11>` 17584 17585 :doc:`gfx1102<AMDGPU/AMDGPUAsmGFX11>` 17586 17587 :doc:`gfx1103<AMDGPU/AMDGPUAsmGFX11>` 17588 ============= ============================================= ======================================= 17589 17590For more information about instructions, their semantics and supported 17591combinations of operands, refer to one of instruction set architecture manuals 17592[AMD-GCN-GFX6]_, [AMD-GCN-GFX7]_, [AMD-GCN-GFX8]_, 17593[AMD-GCN-GFX900-GFX904-VEGA]_, [AMD-GCN-GFX906-VEGA7NM]_, 17594[AMD-GCN-GFX908-CDNA1]_, [AMD-GCN-GFX90A-CDNA2]_, 17595[AMD-GCN-GFX940-GFX942-CDNA3]_, [AMD-GCN-GFX10-RDNA1]_, [AMD-GCN-GFX10-RDNA2]_, 17596[AMD-GCN-GFX11-RDNA3]_ and [AMD-GCN-GFX11-RDNA3.5]_. 17597 17598Operands 17599~~~~~~~~ 17600 17601Detailed description of operands may be found :doc:`here<AMDGPUOperandSyntax>`. 17602 17603Modifiers 17604~~~~~~~~~ 17605 17606Detailed description of modifiers may be found 17607:doc:`here<AMDGPUModifierSyntax>`. 17608 17609Instruction Examples 17610~~~~~~~~~~~~~~~~~~~~ 17611 17612DS 17613++ 17614 17615.. code-block:: nasm 17616 17617 ds_add_u32 v2, v4 offset:16 17618 ds_write_src2_b64 v2 offset0:4 offset1:8 17619 ds_cmpst_f32 v2, v4, v6 17620 ds_min_rtn_f64 v[8:9], v2, v[4:5] 17621 17622For full list of supported instructions, refer to "LDS/GDS instructions" in ISA 17623Manual. 17624 17625FLAT 17626++++ 17627 17628.. code-block:: nasm 17629 17630 flat_load_dword v1, v[3:4] 17631 flat_store_dwordx3 v[3:4], v[5:7] 17632 flat_atomic_swap v1, v[3:4], v5 glc 17633 flat_atomic_cmpswap v1, v[3:4], v[5:6] glc slc 17634 flat_atomic_fmax_x2 v[1:2], v[3:4], v[5:6] glc 17635 17636For full list of supported instructions, refer to "FLAT instructions" in ISA 17637Manual. 17638 17639MUBUF 17640+++++ 17641 17642.. code-block:: nasm 17643 17644 buffer_load_dword v1, off, s[4:7], s1 17645 buffer_store_dwordx4 v[1:4], v2, ttmp[4:7], s1 offen offset:4 glc tfe 17646 buffer_store_format_xy v[1:2], off, s[4:7], s1 17647 buffer_wbinvl1 17648 buffer_atomic_inc v1, v2, s[8:11], s4 idxen offset:4 slc 17649 17650For full list of supported instructions, refer to "MUBUF Instructions" in ISA 17651Manual. 17652 17653SMRD/SMEM 17654+++++++++ 17655 17656.. code-block:: nasm 17657 17658 s_load_dword s1, s[2:3], 0xfc 17659 s_load_dwordx8 s[8:15], s[2:3], s4 17660 s_load_dwordx16 s[88:103], s[2:3], s4 17661 s_dcache_inv_vol 17662 s_memtime s[4:5] 17663 17664For full list of supported instructions, refer to "Scalar Memory Operations" in 17665ISA Manual. 17666 17667SOP1 17668++++ 17669 17670.. code-block:: nasm 17671 17672 s_mov_b32 s1, s2 17673 s_mov_b64 s[0:1], 0x80000000 17674 s_cmov_b32 s1, 200 17675 s_wqm_b64 s[2:3], s[4:5] 17676 s_bcnt0_i32_b64 s1, s[2:3] 17677 s_swappc_b64 s[2:3], s[4:5] 17678 s_cbranch_join s[4:5] 17679 17680For full list of supported instructions, refer to "SOP1 Instructions" in ISA 17681Manual. 17682 17683SOP2 17684++++ 17685 17686.. code-block:: nasm 17687 17688 s_add_u32 s1, s2, s3 17689 s_and_b64 s[2:3], s[4:5], s[6:7] 17690 s_cselect_b32 s1, s2, s3 17691 s_andn2_b32 s2, s4, s6 17692 s_lshr_b64 s[2:3], s[4:5], s6 17693 s_ashr_i32 s2, s4, s6 17694 s_bfm_b64 s[2:3], s4, s6 17695 s_bfe_i64 s[2:3], s[4:5], s6 17696 s_cbranch_g_fork s[4:5], s[6:7] 17697 17698For full list of supported instructions, refer to "SOP2 Instructions" in ISA 17699Manual. 17700 17701SOPC 17702++++ 17703 17704.. code-block:: nasm 17705 17706 s_cmp_eq_i32 s1, s2 17707 s_bitcmp1_b32 s1, s2 17708 s_bitcmp0_b64 s[2:3], s4 17709 s_setvskip s3, s5 17710 17711For full list of supported instructions, refer to "SOPC Instructions" in ISA 17712Manual. 17713 17714SOPP 17715++++ 17716 17717.. code-block:: nasm 17718 17719 s_barrier 17720 s_nop 2 17721 s_endpgm 17722 s_waitcnt 0 ; Wait for all counters to be 0 17723 s_waitcnt vmcnt(0) & expcnt(0) & lgkmcnt(0) ; Equivalent to above 17724 s_waitcnt vmcnt(1) ; Wait for vmcnt counter to be 1. 17725 s_sethalt 9 17726 s_sleep 10 17727 s_sendmsg 0x1 17728 s_sendmsg sendmsg(MSG_INTERRUPT) 17729 s_trap 1 17730 17731For full list of supported instructions, refer to "SOPP Instructions" in ISA 17732Manual. 17733 17734Unless otherwise mentioned, little verification is performed on the operands 17735of SOPP Instructions, so it is up to the programmer to be familiar with the 17736range or acceptable values. 17737 17738VALU 17739++++ 17740 17741For vector ALU instruction opcodes (VOP1, VOP2, VOP3, VOPC, VOP_DPP, VOP_SDWA), 17742the assembler will automatically use optimal encoding based on its operands. To 17743force specific encoding, one can add a suffix to the opcode of the instruction: 17744 17745* _e32 for 32-bit VOP1/VOP2/VOPC 17746* _e64 for 64-bit VOP3 17747* _dpp for VOP_DPP 17748* _e64_dpp for VOP3 with DPP 17749* _sdwa for VOP_SDWA 17750 17751VOP1/VOP2/VOP3/VOPC examples: 17752 17753.. code-block:: nasm 17754 17755 v_mov_b32 v1, v2 17756 v_mov_b32_e32 v1, v2 17757 v_nop 17758 v_cvt_f64_i32_e32 v[1:2], v2 17759 v_floor_f32_e32 v1, v2 17760 v_bfrev_b32_e32 v1, v2 17761 v_add_f32_e32 v1, v2, v3 17762 v_mul_i32_i24_e64 v1, v2, 3 17763 v_mul_i32_i24_e32 v1, -3, v3 17764 v_mul_i32_i24_e32 v1, -100, v3 17765 v_addc_u32 v1, s[0:1], v2, v3, s[2:3] 17766 v_max_f16_e32 v1, v2, v3 17767 17768VOP_DPP examples: 17769 17770.. code-block:: nasm 17771 17772 v_mov_b32 v0, v0 quad_perm:[0,2,1,1] 17773 v_sin_f32 v0, v0 row_shl:1 row_mask:0xa bank_mask:0x1 bound_ctrl:0 17774 v_mov_b32 v0, v0 wave_shl:1 17775 v_mov_b32 v0, v0 row_mirror 17776 v_mov_b32 v0, v0 row_bcast:31 17777 v_mov_b32 v0, v0 quad_perm:[1,3,0,1] row_mask:0xa bank_mask:0x1 bound_ctrl:0 17778 v_add_f32 v0, v0, |v0| row_shl:1 row_mask:0xa bank_mask:0x1 bound_ctrl:0 17779 v_max_f16 v1, v2, v3 row_shl:1 row_mask:0xa bank_mask:0x1 bound_ctrl:0 17780 17781 17782VOP3_DPP examples (Available on GFX11+): 17783 17784.. code-block:: nasm 17785 17786 v_add_f32_e64_dpp v0, v1, v2 dpp8:[0,1,2,3,4,5,6,7] 17787 v_sqrt_f32_e64_dpp v0, v1 row_shl:1 row_mask:0xa bank_mask:0x1 bound_ctrl:0 17788 v_ldexp_f32 v0, v1, v2 dpp8:[0,1,2,3,4,5,6,7] 17789 17790VOP_SDWA examples: 17791 17792.. code-block:: nasm 17793 17794 v_mov_b32 v1, v2 dst_sel:BYTE_0 dst_unused:UNUSED_PRESERVE src0_sel:DWORD 17795 v_min_u32 v200, v200, v1 dst_sel:WORD_1 dst_unused:UNUSED_PAD src0_sel:BYTE_1 src1_sel:DWORD 17796 v_sin_f32 v0, v0 dst_unused:UNUSED_PAD src0_sel:WORD_1 17797 v_fract_f32 v0, |v0| dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:WORD_1 17798 v_cmpx_le_u32 vcc, v1, v2 src0_sel:BYTE_2 src1_sel:WORD_0 17799 17800For full list of supported instructions, refer to "Vector ALU instructions". 17801 17802.. _amdgpu-amdhsa-assembler-predefined-symbols-v2: 17803 17804Code Object V2 Predefined Symbols 17805~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ 17806 17807.. warning:: 17808 Code object V2 generation is no longer supported by this version of LLVM. 17809 17810The AMDGPU assembler defines and updates some symbols automatically. These 17811symbols do not affect code generation. 17812 17813.option.machine_version_major 17814+++++++++++++++++++++++++++++ 17815 17816Set to the GFX major generation number of the target being assembled for. For 17817example, when assembling for a "GFX9" target this will be set to the integer 17818value "9". The possible GFX major generation numbers are presented in 17819:ref:`amdgpu-processors`. 17820 17821.option.machine_version_minor 17822+++++++++++++++++++++++++++++ 17823 17824Set to the GFX minor generation number of the target being assembled for. For 17825example, when assembling for a "GFX810" target this will be set to the integer 17826value "1". The possible GFX minor generation numbers are presented in 17827:ref:`amdgpu-processors`. 17828 17829.option.machine_version_stepping 17830++++++++++++++++++++++++++++++++ 17831 17832Set to the GFX stepping generation number of the target being assembled for. 17833For example, when assembling for a "GFX704" target this will be set to the 17834integer value "4". The possible GFX stepping generation numbers are presented 17835in :ref:`amdgpu-processors`. 17836 17837.kernel.vgpr_count 17838++++++++++++++++++ 17839 17840Set to zero each time a 17841:ref:`amdgpu-amdhsa-assembler-directive-amdgpu_hsa_kernel` directive is 17842encountered. At each instruction, if the current value of this symbol is less 17843than or equal to the maximum VGPR number explicitly referenced within that 17844instruction then the symbol value is updated to equal that VGPR number plus 17845one. 17846 17847.kernel.sgpr_count 17848++++++++++++++++++ 17849 17850Set to zero each time a 17851:ref:`amdgpu-amdhsa-assembler-directive-amdgpu_hsa_kernel` directive is 17852encountered. At each instruction, if the current value of this symbol is less 17853than or equal to the maximum VGPR number explicitly referenced within that 17854instruction then the symbol value is updated to equal that SGPR number plus 17855one. 17856 17857.. _amdgpu-amdhsa-assembler-directives-v2: 17858 17859Code Object V2 Directives 17860~~~~~~~~~~~~~~~~~~~~~~~~~ 17861 17862.. warning:: 17863 Code object V2 generation is no longer supported by this version of LLVM. 17864 17865AMDGPU ABI defines auxiliary data in output code object. In assembly source, 17866one can specify them with assembler directives. 17867 17868.hsa_code_object_version major, minor 17869+++++++++++++++++++++++++++++++++++++ 17870 17871*major* and *minor* are integers that specify the version of the HSA code 17872object that will be generated by the assembler. 17873 17874.hsa_code_object_isa [major, minor, stepping, vendor, arch] 17875+++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ 17876 17877 17878*major*, *minor*, and *stepping* are all integers that describe the instruction 17879set architecture (ISA) version of the assembly program. 17880 17881*vendor* and *arch* are quoted strings. *vendor* should always be equal to 17882"AMD" and *arch* should always be equal to "AMDGPU". 17883 17884By default, the assembler will derive the ISA version, *vendor*, and *arch* 17885from the value of the -mcpu option that is passed to the assembler. 17886 17887.. _amdgpu-amdhsa-assembler-directive-amdgpu_hsa_kernel: 17888 17889.amdgpu_hsa_kernel (name) 17890+++++++++++++++++++++++++ 17891 17892This directives specifies that the symbol with given name is a kernel entry 17893point (label) and the object should contain corresponding symbol of type 17894STT_AMDGPU_HSA_KERNEL. 17895 17896.amd_kernel_code_t 17897++++++++++++++++++ 17898 17899This directive marks the beginning of a list of key / value pairs that are used 17900to specify the amd_kernel_code_t object that will be emitted by the assembler. 17901The list must be terminated by the *.end_amd_kernel_code_t* directive. For any 17902amd_kernel_code_t values that are unspecified a default value will be used. The 17903default value for all keys is 0, with the following exceptions: 17904 17905- *amd_code_version_major* defaults to 1. 17906- *amd_kernel_code_version_minor* defaults to 2. 17907- *amd_machine_kind* defaults to 1. 17908- *amd_machine_version_major*, *machine_version_minor*, and 17909 *amd_machine_version_stepping* are derived from the value of the -mcpu option 17910 that is passed to the assembler. 17911- *kernel_code_entry_byte_offset* defaults to 256. 17912- *wavefront_size* defaults 6 for all targets before GFX10. For GFX10 onwards 17913 defaults to 6 if target feature ``wavefrontsize64`` is enabled, otherwise 5. 17914 Note that wavefront size is specified as a power of two, so a value of **n** 17915 means a size of 2^ **n**. 17916- *call_convention* defaults to -1. 17917- *kernarg_segment_alignment*, *group_segment_alignment*, and 17918 *private_segment_alignment* default to 4. Note that alignments are specified 17919 as a power of 2, so a value of **n** means an alignment of 2^ **n**. 17920- *enable_tg_split* defaults to 1 if target feature ``tgsplit`` is enabled for 17921 GFX90A onwards. 17922- *enable_wgp_mode* defaults to 1 if target feature ``cumode`` is disabled for 17923 GFX10 onwards. 17924- *enable_mem_ordered* defaults to 1 for GFX10 onwards. 17925 17926The *.amd_kernel_code_t* directive must be placed immediately after the 17927function label and before any instructions. 17928 17929For a full list of amd_kernel_code_t keys, refer to AMDGPU ABI document, 17930comments in lib/Target/AMDGPU/AmdKernelCodeT.h and test/CodeGen/AMDGPU/hsa.s. 17931 17932.. _amdgpu-amdhsa-assembler-example-v2: 17933 17934Code Object V2 Example Source Code 17935~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ 17936 17937.. warning:: 17938 Code object V2 generation is no longer supported by this version of LLVM. 17939 17940Here is an example of a minimal assembly source file, defining one HSA kernel: 17941 17942.. code:: 17943 :number-lines: 17944 17945 .hsa_code_object_version 1,0 17946 .hsa_code_object_isa 17947 17948 .hsatext 17949 .globl hello_world 17950 .p2align 8 17951 .amdgpu_hsa_kernel hello_world 17952 17953 hello_world: 17954 17955 .amd_kernel_code_t 17956 enable_sgpr_kernarg_segment_ptr = 1 17957 is_ptr64 = 1 17958 compute_pgm_rsrc1_vgprs = 0 17959 compute_pgm_rsrc1_sgprs = 0 17960 compute_pgm_rsrc2_user_sgpr = 2 17961 compute_pgm_rsrc1_wgp_mode = 0 17962 compute_pgm_rsrc1_mem_ordered = 0 17963 compute_pgm_rsrc1_fwd_progress = 1 17964 .end_amd_kernel_code_t 17965 17966 s_load_dwordx2 s[0:1], s[0:1] 0x0 17967 v_mov_b32 v0, 3.14159 17968 s_waitcnt lgkmcnt(0) 17969 v_mov_b32 v1, s0 17970 v_mov_b32 v2, s1 17971 flat_store_dword v[1:2], v0 17972 s_endpgm 17973 .Lfunc_end0: 17974 .size hello_world, .Lfunc_end0-hello_world 17975 17976.. _amdgpu-amdhsa-assembler-predefined-symbols-v3-onwards: 17977 17978Code Object V3 and Above Predefined Symbols 17979~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ 17980 17981The AMDGPU assembler defines and updates some symbols automatically. These 17982symbols do not affect code generation. 17983 17984.amdgcn.gfx_generation_number 17985+++++++++++++++++++++++++++++ 17986 17987Set to the GFX major generation number of the target being assembled for. For 17988example, when assembling for a "GFX9" target this will be set to the integer 17989value "9". The possible GFX major generation numbers are presented in 17990:ref:`amdgpu-processors`. 17991 17992.amdgcn.gfx_generation_minor 17993++++++++++++++++++++++++++++ 17994 17995Set to the GFX minor generation number of the target being assembled for. For 17996example, when assembling for a "GFX810" target this will be set to the integer 17997value "1". The possible GFX minor generation numbers are presented in 17998:ref:`amdgpu-processors`. 17999 18000.amdgcn.gfx_generation_stepping 18001+++++++++++++++++++++++++++++++ 18002 18003Set to the GFX stepping generation number of the target being assembled for. 18004For example, when assembling for a "GFX704" target this will be set to the 18005integer value "4". The possible GFX stepping generation numbers are presented 18006in :ref:`amdgpu-processors`. 18007 18008.. _amdgpu-amdhsa-assembler-symbol-next_free_vgpr: 18009 18010.amdgcn.next_free_vgpr 18011++++++++++++++++++++++ 18012 18013Set to zero before assembly begins. At each instruction, if the current value 18014of this symbol is less than or equal to the maximum VGPR number explicitly 18015referenced within that instruction then the symbol value is updated to equal 18016that VGPR number plus one. 18017 18018May be used to set the `.amdhsa_next_free_vgpr` directive in 18019:ref:`amdhsa-kernel-directives-table`. 18020 18021May be set at any time, e.g. manually set to zero at the start of each kernel. 18022 18023.. _amdgpu-amdhsa-assembler-symbol-next_free_sgpr: 18024 18025.amdgcn.next_free_sgpr 18026++++++++++++++++++++++ 18027 18028Set to zero before assembly begins. At each instruction, if the current value 18029of this symbol is less than or equal the maximum SGPR number explicitly 18030referenced within that instruction then the symbol value is updated to equal 18031that SGPR number plus one. 18032 18033May be used to set the `.amdhsa_next_free_spgr` directive in 18034:ref:`amdhsa-kernel-directives-table`. 18035 18036May be set at any time, e.g. manually set to zero at the start of each kernel. 18037 18038.. _amdgpu-amdhsa-assembler-directives-v3-onwards: 18039 18040Code Object V3 and Above Directives 18041~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ 18042 18043Directives which begin with ``.amdgcn`` are valid for all ``amdgcn`` 18044architecture processors, and are not OS-specific. Directives which begin with 18045``.amdhsa`` are specific to ``amdgcn`` architecture processors when the 18046``amdhsa`` OS is specified. See :ref:`amdgpu-target-triples` and 18047:ref:`amdgpu-processors`. 18048 18049.. _amdgpu-assembler-directive-amdgcn-target: 18050 18051.amdgcn_target <target-triple> "-" <target-id> 18052++++++++++++++++++++++++++++++++++++++++++++++ 18053 18054Optional directive which declares the ``<target-triple>-<target-id>`` supported 18055by the containing assembler source file. Used by the assembler to validate 18056command-line options such as ``-triple``, ``-mcpu``, and 18057``--offload-arch=<target-id>``. A non-canonical target ID is allowed. See 18058:ref:`amdgpu-target-triples` and :ref:`amdgpu-target-id`. 18059 18060.. note:: 18061 18062 The target ID syntax used for code object V2 to V3 for this directive differs 18063 from that used elsewhere. See :ref:`amdgpu-target-id-v2-v3`. 18064 18065.. _amdgpu-assembler-directive-amdhsa-code-object-version: 18066 18067.amdhsa_code_object_version <version> 18068+++++++++++++++++++++++++++++++++++++ 18069 18070Optional directive which declares the code object version to be generated by the 18071assembler. If not present, a default value will be used. 18072 18073.amdhsa_kernel <name> 18074+++++++++++++++++++++ 18075 18076Creates a correctly aligned AMDHSA kernel descriptor and a symbol, 18077``<name>.kd``, in the current location of the current section. Only valid when 18078the OS is ``amdhsa``. ``<name>`` must be a symbol that labels the first 18079instruction to execute, and does not need to be previously defined. 18080 18081Marks the beginning of a list of directives used to generate the bytes of a 18082kernel descriptor, as described in :ref:`amdgpu-amdhsa-kernel-descriptor`. 18083Directives which may appear in this list are described in 18084:ref:`amdhsa-kernel-directives-table`. Directives may appear in any order, must 18085be valid for the target being assembled for, and cannot be repeated. Directives 18086support the range of values specified by the field they reference in 18087:ref:`amdgpu-amdhsa-kernel-descriptor`. If a directive is not specified, it is 18088assumed to have its default value, unless it is marked as "Required", in which 18089case it is an error to omit the directive. This list of directives is 18090terminated by an ``.end_amdhsa_kernel`` directive. 18091 18092 .. table:: AMDHSA Kernel Assembler Directives 18093 :name: amdhsa-kernel-directives-table 18094 18095 ======================================================== =================== ============ =================== 18096 Directive Default Supported On Description 18097 ======================================================== =================== ============ =================== 18098 ``.amdhsa_group_segment_fixed_size`` 0 GFX6-GFX12 Controls GROUP_SEGMENT_FIXED_SIZE in 18099 :ref:`amdgpu-amdhsa-kernel-descriptor-v3-table`. 18100 ``.amdhsa_private_segment_fixed_size`` 0 GFX6-GFX12 Controls PRIVATE_SEGMENT_FIXED_SIZE in 18101 :ref:`amdgpu-amdhsa-kernel-descriptor-v3-table`. 18102 ``.amdhsa_kernarg_size`` 0 GFX6-GFX12 Controls KERNARG_SIZE in 18103 :ref:`amdgpu-amdhsa-kernel-descriptor-v3-table`. 18104 ``.amdhsa_user_sgpr_count`` 0 GFX6-GFX12 Controls USER_SGPR_COUNT in COMPUTE_PGM_RSRC2 18105 :ref:`amdgpu-amdhsa-compute_pgm_rsrc2-gfx6-gfx12-table` 18106 ``.amdhsa_user_sgpr_private_segment_buffer`` 0 GFX6-GFX10 Controls ENABLE_SGPR_PRIVATE_SEGMENT_BUFFER in 18107 (except :ref:`amdgpu-amdhsa-kernel-descriptor-v3-table`. 18108 GFX940) 18109 ``.amdhsa_user_sgpr_dispatch_ptr`` 0 GFX6-GFX12 Controls ENABLE_SGPR_DISPATCH_PTR in 18110 :ref:`amdgpu-amdhsa-kernel-descriptor-v3-table`. 18111 ``.amdhsa_user_sgpr_queue_ptr`` 0 GFX6-GFX12 Controls ENABLE_SGPR_QUEUE_PTR in 18112 :ref:`amdgpu-amdhsa-kernel-descriptor-v3-table`. 18113 ``.amdhsa_user_sgpr_kernarg_segment_ptr`` 0 GFX6-GFX12 Controls ENABLE_SGPR_KERNARG_SEGMENT_PTR in 18114 :ref:`amdgpu-amdhsa-kernel-descriptor-v3-table`. 18115 ``.amdhsa_user_sgpr_dispatch_id`` 0 GFX6-GFX12 Controls ENABLE_SGPR_DISPATCH_ID in 18116 :ref:`amdgpu-amdhsa-kernel-descriptor-v3-table`. 18117 ``.amdhsa_user_sgpr_flat_scratch_init`` 0 GFX6-GFX10 Controls ENABLE_SGPR_FLAT_SCRATCH_INIT in 18118 (except :ref:`amdgpu-amdhsa-kernel-descriptor-v3-table`. 18119 GFX940) 18120 ``.amdhsa_user_sgpr_private_segment_size`` 0 GFX6-GFX12 Controls ENABLE_SGPR_PRIVATE_SEGMENT_SIZE in 18121 :ref:`amdgpu-amdhsa-kernel-descriptor-v3-table`. 18122 ``.amdhsa_wavefront_size32`` Target GFX10-GFX12 Controls ENABLE_WAVEFRONT_SIZE32 in 18123 Feature :ref:`amdgpu-amdhsa-kernel-descriptor-v3-table`. 18124 Specific 18125 (wavefrontsize64) 18126 ``.amdhsa_uses_dynamic_stack`` 0 GFX6-GFX12 Controls USES_DYNAMIC_STACK in 18127 :ref:`amdgpu-amdhsa-kernel-descriptor-v3-table`. 18128 ``.amdhsa_system_sgpr_private_segment_wavefront_offset`` 0 GFX6-GFX10 Controls ENABLE_PRIVATE_SEGMENT in 18129 (except :ref:`amdgpu-amdhsa-compute_pgm_rsrc2-gfx6-gfx12-table`. 18130 GFX940) 18131 ``.amdhsa_enable_private_segment`` 0 GFX940, Controls ENABLE_PRIVATE_SEGMENT in 18132 GFX11-GFX12 :ref:`amdgpu-amdhsa-compute_pgm_rsrc2-gfx6-gfx12-table`. 18133 ``.amdhsa_system_sgpr_workgroup_id_x`` 1 GFX6-GFX12 Controls ENABLE_SGPR_WORKGROUP_ID_X in 18134 :ref:`amdgpu-amdhsa-compute_pgm_rsrc2-gfx6-gfx12-table`. 18135 ``.amdhsa_system_sgpr_workgroup_id_y`` 0 GFX6-GFX12 Controls ENABLE_SGPR_WORKGROUP_ID_Y in 18136 :ref:`amdgpu-amdhsa-compute_pgm_rsrc2-gfx6-gfx12-table`. 18137 ``.amdhsa_system_sgpr_workgroup_id_z`` 0 GFX6-GFX12 Controls ENABLE_SGPR_WORKGROUP_ID_Z in 18138 :ref:`amdgpu-amdhsa-compute_pgm_rsrc2-gfx6-gfx12-table`. 18139 ``.amdhsa_system_sgpr_workgroup_info`` 0 GFX6-GFX12 Controls ENABLE_SGPR_WORKGROUP_INFO in 18140 :ref:`amdgpu-amdhsa-compute_pgm_rsrc2-gfx6-gfx12-table`. 18141 ``.amdhsa_system_vgpr_workitem_id`` 0 GFX6-GFX12 Controls ENABLE_VGPR_WORKITEM_ID in 18142 :ref:`amdgpu-amdhsa-compute_pgm_rsrc2-gfx6-gfx12-table`. 18143 Possible values are defined in 18144 :ref:`amdgpu-amdhsa-system-vgpr-work-item-id-enumeration-values-table`. 18145 ``.amdhsa_next_free_vgpr`` Required GFX6-GFX12 Maximum VGPR number explicitly referenced, plus one. 18146 Used to calculate GRANULATED_WORKITEM_VGPR_COUNT in 18147 :ref:`amdgpu-amdhsa-compute_pgm_rsrc1-gfx6-gfx12-table`. 18148 ``.amdhsa_next_free_sgpr`` Required GFX6-GFX12 Maximum SGPR number explicitly referenced, plus one. 18149 Used to calculate GRANULATED_WAVEFRONT_SGPR_COUNT in 18150 :ref:`amdgpu-amdhsa-compute_pgm_rsrc1-gfx6-gfx12-table`. 18151 ``.amdhsa_accum_offset`` Required GFX90A, Offset of a first AccVGPR in the unified register file. 18152 GFX940 Used to calculate ACCUM_OFFSET in 18153 :ref:`amdgpu-amdhsa-compute_pgm_rsrc3-gfx90a-table`. 18154 ``.amdhsa_reserve_vcc`` 1 GFX6-GFX12 Whether the kernel may use the special VCC SGPR. 18155 Used to calculate GRANULATED_WAVEFRONT_SGPR_COUNT in 18156 :ref:`amdgpu-amdhsa-compute_pgm_rsrc1-gfx6-gfx12-table`. 18157 ``.amdhsa_reserve_flat_scratch`` 1 GFX7-GFX10 Whether the kernel may use flat instructions to access 18158 (except scratch memory. Used to calculate 18159 GFX940) GRANULATED_WAVEFRONT_SGPR_COUNT in 18160 :ref:`amdgpu-amdhsa-compute_pgm_rsrc1-gfx6-gfx12-table`. 18161 ``.amdhsa_reserve_xnack_mask`` Target GFX8-GFX10 Whether the kernel may trigger XNACK replay. 18162 Feature Used to calculate GRANULATED_WAVEFRONT_SGPR_COUNT in 18163 Specific :ref:`amdgpu-amdhsa-compute_pgm_rsrc1-gfx6-gfx12-table`. 18164 (xnack) 18165 ``.amdhsa_float_round_mode_32`` 0 GFX6-GFX12 Controls FLOAT_ROUND_MODE_32 in 18166 :ref:`amdgpu-amdhsa-compute_pgm_rsrc1-gfx6-gfx12-table`. 18167 Possible values are defined in 18168 :ref:`amdgpu-amdhsa-floating-point-rounding-mode-enumeration-values-table`. 18169 ``.amdhsa_float_round_mode_16_64`` 0 GFX6-GFX12 Controls FLOAT_ROUND_MODE_16_64 in 18170 :ref:`amdgpu-amdhsa-compute_pgm_rsrc1-gfx6-gfx12-table`. 18171 Possible values are defined in 18172 :ref:`amdgpu-amdhsa-floating-point-rounding-mode-enumeration-values-table`. 18173 ``.amdhsa_float_denorm_mode_32`` 0 GFX6-GFX12 Controls FLOAT_DENORM_MODE_32 in 18174 :ref:`amdgpu-amdhsa-compute_pgm_rsrc1-gfx6-gfx12-table`. 18175 Possible values are defined in 18176 :ref:`amdgpu-amdhsa-floating-point-denorm-mode-enumeration-values-table`. 18177 ``.amdhsa_float_denorm_mode_16_64`` 3 GFX6-GFX12 Controls FLOAT_DENORM_MODE_16_64 in 18178 :ref:`amdgpu-amdhsa-compute_pgm_rsrc1-gfx6-gfx12-table`. 18179 Possible values are defined in 18180 :ref:`amdgpu-amdhsa-floating-point-denorm-mode-enumeration-values-table`. 18181 ``.amdhsa_dx10_clamp`` 1 GFX6-GFX11 Controls ENABLE_DX10_CLAMP in 18182 :ref:`amdgpu-amdhsa-compute_pgm_rsrc1-gfx6-gfx12-table`. 18183 ``.amdhsa_ieee_mode`` 1 GFX6-GFX11 Controls ENABLE_IEEE_MODE in 18184 :ref:`amdgpu-amdhsa-compute_pgm_rsrc1-gfx6-gfx12-table`. 18185 ``.amdhsa_round_robin_scheduling`` 0 GFX12 Controls ENABLE_WG_RR_EN in 18186 :ref:`amdgpu-amdhsa-compute_pgm_rsrc1-gfx6-gfx12-table`. 18187 ``.amdhsa_fp16_overflow`` 0 GFX9-GFX12 Controls FP16_OVFL in 18188 :ref:`amdgpu-amdhsa-compute_pgm_rsrc1-gfx6-gfx12-table`. 18189 ``.amdhsa_tg_split`` Target GFX90A, Controls TG_SPLIT in 18190 Feature GFX940, :ref:`amdgpu-amdhsa-compute_pgm_rsrc3-gfx90a-table`. 18191 Specific GFX11-GFX12 18192 (tgsplit) 18193 ``.amdhsa_workgroup_processor_mode`` Target GFX10-GFX12 Controls ENABLE_WGP_MODE in 18194 Feature :ref:`amdgpu-amdhsa-kernel-descriptor-v3-table`. 18195 Specific 18196 (cumode) 18197 ``.amdhsa_memory_ordered`` 1 GFX10-GFX12 Controls MEM_ORDERED in 18198 :ref:`amdgpu-amdhsa-compute_pgm_rsrc1-gfx6-gfx12-table`. 18199 ``.amdhsa_forward_progress`` 0 GFX10-GFX12 Controls FWD_PROGRESS in 18200 :ref:`amdgpu-amdhsa-compute_pgm_rsrc1-gfx6-gfx12-table`. 18201 ``.amdhsa_shared_vgpr_count`` 0 GFX10-GFX11 Controls SHARED_VGPR_COUNT in 18202 :ref:`amdgpu-amdhsa-compute_pgm_rsrc3-gfx10-gfx11-table`. 18203 ``.amdhsa_exception_fp_ieee_invalid_op`` 0 GFX6-GFX12 Controls ENABLE_EXCEPTION_IEEE_754_FP_INVALID_OPERATION in 18204 :ref:`amdgpu-amdhsa-compute_pgm_rsrc2-gfx6-gfx12-table`. 18205 ``.amdhsa_exception_fp_denorm_src`` 0 GFX6-GFX12 Controls ENABLE_EXCEPTION_FP_DENORMAL_SOURCE in 18206 :ref:`amdgpu-amdhsa-compute_pgm_rsrc2-gfx6-gfx12-table`. 18207 ``.amdhsa_exception_fp_ieee_div_zero`` 0 GFX6-GFX12 Controls ENABLE_EXCEPTION_IEEE_754_FP_DIVISION_BY_ZERO in 18208 :ref:`amdgpu-amdhsa-compute_pgm_rsrc2-gfx6-gfx12-table`. 18209 ``.amdhsa_exception_fp_ieee_overflow`` 0 GFX6-GFX12 Controls ENABLE_EXCEPTION_IEEE_754_FP_OVERFLOW in 18210 :ref:`amdgpu-amdhsa-compute_pgm_rsrc2-gfx6-gfx12-table`. 18211 ``.amdhsa_exception_fp_ieee_underflow`` 0 GFX6-GFX12 Controls ENABLE_EXCEPTION_IEEE_754_FP_UNDERFLOW in 18212 :ref:`amdgpu-amdhsa-compute_pgm_rsrc2-gfx6-gfx12-table`. 18213 ``.amdhsa_exception_fp_ieee_inexact`` 0 GFX6-GFX12 Controls ENABLE_EXCEPTION_IEEE_754_FP_INEXACT in 18214 :ref:`amdgpu-amdhsa-compute_pgm_rsrc2-gfx6-gfx12-table`. 18215 ``.amdhsa_exception_int_div_zero`` 0 GFX6-GFX12 Controls ENABLE_EXCEPTION_INT_DIVIDE_BY_ZERO in 18216 :ref:`amdgpu-amdhsa-compute_pgm_rsrc2-gfx6-gfx12-table`. 18217 ``.amdhsa_user_sgpr_kernarg_preload_length`` 0 GFX90A, Controls KERNARG_PRELOAD_SPEC_LENGTH in 18218 GFX940 :ref:`amdgpu-amdhsa-kernel-descriptor-v3-table`. 18219 ``.amdhsa_user_sgpr_kernarg_preload_offset`` 0 GFX90A, Controls KERNARG_PRELOAD_SPEC_OFFSET in 18220 GFX940 :ref:`amdgpu-amdhsa-kernel-descriptor-v3-table`. 18221 ======================================================== =================== ============ =================== 18222 18223.amdgpu_metadata 18224++++++++++++++++ 18225 18226Optional directive which declares the contents of the ``NT_AMDGPU_METADATA`` 18227note record (see :ref:`amdgpu-elf-note-records-table-v3-onwards`). 18228 18229The contents must be in the [YAML]_ markup format, with the same structure and 18230semantics described in :ref:`amdgpu-amdhsa-code-object-metadata-v3`, 18231:ref:`amdgpu-amdhsa-code-object-metadata-v4` or 18232:ref:`amdgpu-amdhsa-code-object-metadata-v5`. 18233 18234This directive is terminated by an ``.end_amdgpu_metadata`` directive. 18235 18236.. _amdgpu-amdhsa-assembler-example-v3-onwards: 18237 18238Code Object V3 and Above Example Source Code 18239~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ 18240 18241Here is an example of a minimal assembly source file, defining one HSA kernel: 18242 18243.. code:: 18244 :number-lines: 18245 18246 .amdgcn_target "amdgcn-amd-amdhsa--gfx900+xnack" // optional 18247 18248 .text 18249 .globl hello_world 18250 .p2align 8 18251 .type hello_world,@function 18252 hello_world: 18253 s_load_dwordx2 s[0:1], s[0:1] 0x0 18254 v_mov_b32 v0, 3.14159 18255 s_waitcnt lgkmcnt(0) 18256 v_mov_b32 v1, s0 18257 v_mov_b32 v2, s1 18258 flat_store_dword v[1:2], v0 18259 s_endpgm 18260 .Lfunc_end0: 18261 .size hello_world, .Lfunc_end0-hello_world 18262 18263 .rodata 18264 .p2align 6 18265 .amdhsa_kernel hello_world 18266 .amdhsa_user_sgpr_kernarg_segment_ptr 1 18267 .amdhsa_next_free_vgpr .amdgcn.next_free_vgpr 18268 .amdhsa_next_free_sgpr .amdgcn.next_free_sgpr 18269 .end_amdhsa_kernel 18270 18271 .amdgpu_metadata 18272 --- 18273 amdhsa.version: 18274 - 1 18275 - 0 18276 amdhsa.kernels: 18277 - .name: hello_world 18278 .symbol: hello_world.kd 18279 .kernarg_segment_size: 48 18280 .group_segment_fixed_size: 0 18281 .private_segment_fixed_size: 0 18282 .kernarg_segment_align: 4 18283 .wavefront_size: 64 18284 .sgpr_count: 2 18285 .vgpr_count: 3 18286 .max_flat_workgroup_size: 256 18287 .args: 18288 - .size: 8 18289 .offset: 0 18290 .value_kind: global_buffer 18291 .address_space: global 18292 .actual_access: write_only 18293 //... 18294 .end_amdgpu_metadata 18295 18296This kernel is equivalent to the following HIP program: 18297 18298.. code:: 18299 :number-lines: 18300 18301 __global__ void hello_world(float *p) { 18302 *p = 3.14159f; 18303 } 18304 18305If an assembly source file contains multiple kernels and/or functions, the 18306:ref:`amdgpu-amdhsa-assembler-symbol-next_free_vgpr` and 18307:ref:`amdgpu-amdhsa-assembler-symbol-next_free_sgpr` symbols may be reset using 18308the ``.set <symbol>, <expression>`` directive. For example, in the case of two 18309kernels, where ``function1`` is only called from ``kernel1`` it is sufficient 18310to group the function with the kernel that calls it and reset the symbols 18311between the two connected components: 18312 18313.. code:: 18314 :number-lines: 18315 18316 .amdgcn_target "amdgcn-amd-amdhsa--gfx900+xnack" // optional 18317 18318 // gpr tracking symbols are implicitly set to zero 18319 18320 .text 18321 .globl kern0 18322 .p2align 8 18323 .type kern0,@function 18324 kern0: 18325 // ... 18326 s_endpgm 18327 .Lkern0_end: 18328 .size kern0, .Lkern0_end-kern0 18329 18330 .rodata 18331 .p2align 6 18332 .amdhsa_kernel kern0 18333 // ... 18334 .amdhsa_next_free_vgpr .amdgcn.next_free_vgpr 18335 .amdhsa_next_free_sgpr .amdgcn.next_free_sgpr 18336 .end_amdhsa_kernel 18337 18338 // reset symbols to begin tracking usage in func1 and kern1 18339 .set .amdgcn.next_free_vgpr, 0 18340 .set .amdgcn.next_free_sgpr, 0 18341 18342 .text 18343 .hidden func1 18344 .global func1 18345 .p2align 2 18346 .type func1,@function 18347 func1: 18348 // ... 18349 s_setpc_b64 s[30:31] 18350 .Lfunc1_end: 18351 .size func1, .Lfunc1_end-func1 18352 18353 .globl kern1 18354 .p2align 8 18355 .type kern1,@function 18356 kern1: 18357 // ... 18358 s_getpc_b64 s[4:5] 18359 s_add_u32 s4, s4, func1@rel32@lo+4 18360 s_addc_u32 s5, s5, func1@rel32@lo+4 18361 s_swappc_b64 s[30:31], s[4:5] 18362 // ... 18363 s_endpgm 18364 .Lkern1_end: 18365 .size kern1, .Lkern1_end-kern1 18366 18367 .rodata 18368 .p2align 6 18369 .amdhsa_kernel kern1 18370 // ... 18371 .amdhsa_next_free_vgpr .amdgcn.next_free_vgpr 18372 .amdhsa_next_free_sgpr .amdgcn.next_free_sgpr 18373 .end_amdhsa_kernel 18374 18375These symbols cannot identify connected components in order to automatically 18376track the usage for each kernel. However, in some cases careful organization of 18377the kernels and functions in the source file means there is minimal additional 18378effort required to accurately calculate GPR usage. 18379 18380Additional Documentation 18381======================== 18382 18383.. [AMD-GCN-GFX6] `AMD Southern Islands Series ISA <http://developer.amd.com/wordpress/media/2012/12/AMD_Southern_Islands_Instruction_Set_Architecture.pdf>`__ 18384.. [AMD-GCN-GFX7] `AMD Sea Islands Series ISA <http://developer.amd.com/wordpress/media/2013/07/AMD_Sea_Islands_Instruction_Set_Architecture.pdf>`_ 18385.. [AMD-GCN-GFX8] `AMD GCN3 Instruction Set Architecture <http://amd-dev.wpengine.netdna-cdn.com/wordpress/media/2013/12/AMD_GCN3_Instruction_Set_Architecture_rev1.1.pdf>`__ 18386.. [AMD-GCN-GFX900-GFX904-VEGA] `AMD Vega Instruction Set Architecture <http://developer.amd.com/wordpress/media/2013/12/Vega_Shader_ISA_28July2017.pdf>`__ 18387.. [AMD-GCN-GFX906-VEGA7NM] `AMD Vega 7nm Instruction Set Architecture <https://gpuopen.com/wp-content/uploads/2019/11/Vega_7nm_Shader_ISA_26November2019.pdf>`__ 18388.. [AMD-GCN-GFX908-CDNA1] `AMD Instinct MI100 Instruction Set Architecture <https://developer.amd.com/wp-content/resources/CDNA1_Shader_ISA_14December2020.pdf>`__ 18389.. [AMD-GCN-GFX90A-CDNA2] `AMD Instinct MI200 Instruction Set Architecture <https://developer.amd.com/wp-content/resources/CDNA2_Shader_ISA_4February2022.pdf>`__ 18390.. [AMD-GCN-GFX940-GFX942-CDNA3] `AMD Instinct MI300 Instruction Set Architecture <https://www.amd.com/content/dam/amd/en/documents/instinct-tech-docs/instruction-set-architectures/amd-instinct-mi300-cdna3-instruction-set-architecture.pdf>`__ 18391.. [AMD-GCN-GFX10-RDNA1] `AMD RDNA 1.0 Instruction Set Architecture <https://gpuopen.com/wp-content/uploads/2019/08/RDNA_Shader_ISA_5August2019.pdf>`__ 18392.. [AMD-GCN-GFX10-RDNA2] `AMD RDNA 2 Instruction Set Architecture <https://developer.amd.com/wp-content/resources/RDNA2_Shader_ISA_November2020.pdf>`__ 18393.. [AMD-GCN-GFX11-RDNA3] `AMD RDNA 3 Instruction Set Architecture <https://developer.amd.com/wp-content/resources/RDNA3_Shader_ISA_December2022.pdf>`__ 18394.. [AMD-GCN-GFX11-RDNA3.5] `AMD RDNA 3.5 Instruction Set Architecture <https://www.amd.com/content/dam/amd/en/documents/radeon-tech-docs/instruction-set-architectures/rdna35_instruction_set_architecture.pdf>`__ 18395.. [AMD-RADEON-HD-2000-3000] `AMD R6xx shader ISA <http://developer.amd.com/wordpress/media/2012/10/R600_Instruction_Set_Architecture.pdf>`__ 18396.. [AMD-RADEON-HD-4000] `AMD R7xx shader ISA <http://developer.amd.com/wordpress/media/2012/10/R700-Family_Instruction_Set_Architecture.pdf>`__ 18397.. [AMD-RADEON-HD-5000] `AMD Evergreen shader ISA <http://developer.amd.com/wordpress/media/2012/10/AMD_Evergreen-Family_Instruction_Set_Architecture.pdf>`__ 18398.. [AMD-RADEON-HD-6000] `AMD Cayman/Trinity shader ISA <http://developer.amd.com/wordpress/media/2012/10/AMD_HD_6900_Series_Instruction_Set_Architecture.pdf>`__ 18399.. [AMD-ROCm] `AMD ROCm™ Platform <https://rocmdocs.amd.com/>`__ 18400.. [AMD-ROCm-github] `AMD ROCm™ github <http://github.com/RadeonOpenCompute>`__ 18401.. [AMD-ROCm-Release-Notes] `AMD ROCm Release Notes <https://github.com/RadeonOpenCompute/ROCm>`__ 18402.. [CLANG-ATTR] `Attributes in Clang <https://clang.llvm.org/docs/AttributeReference.html>`__ 18403.. [DWARF] `DWARF Debugging Information Format <http://dwarfstd.org/>`__ 18404.. [ELF] `Executable and Linkable Format (ELF) <http://www.sco.com/developers/gabi/>`__ 18405.. [HRF] `Heterogeneous-race-free Memory Models <https://research.cs.wisc.edu/multifacet/papers/asplos14_hrf.pdf>`__ 18406.. [HSA] `Heterogeneous System Architecture (HSA) Foundation <http://www.hsafoundation.com/>`__ 18407.. [MsgPack] `Message Pack <http://www.msgpack.org/>`__ 18408.. [OpenCL] `The OpenCL Specification Version 2.0 <http://www.khronos.org/registry/cl/specs/opencl-2.0.pdf>`__ 18409.. [SEMVER] `Semantic Versioning <https://semver.org/>`__ 18410.. [YAML] `YAML Ain't Markup Language (YAML™) Version 1.2 <http://www.yaml.org/spec/1.2/spec.html>`__ 18411