1============================= 2User Guide for NVPTX Back-end 3============================= 4 5.. contents:: 6 :local: 7 :depth: 3 8 9 10Introduction 11============ 12 13To support GPU programming, the NVPTX back-end supports a subset of LLVM IR 14along with a defined set of conventions used to represent GPU programming 15concepts. This document provides an overview of the general usage of the back- 16end, including a description of the conventions used and the set of accepted 17LLVM IR. 18 19.. note:: 20 21 This document assumes a basic familiarity with CUDA and the PTX 22 assembly language. Information about the CUDA Driver API and the PTX assembly 23 language can be found in the `CUDA documentation 24 <http://docs.nvidia.com/cuda/index.html>`_. 25 26 27 28Conventions 29=========== 30 31Marking Functions as Kernels 32---------------------------- 33 34In PTX, there are two types of functions: *device functions*, which are only 35callable by device code, and *kernel functions*, which are callable by host 36code. By default, the back-end will emit device functions. Metadata is used to 37declare a function as a kernel function. This metadata is attached to the 38``nvvm.annotations`` named metadata object, and has the following format: 39 40.. code-block:: text 41 42 !0 = !{<function-ref>, metadata !"kernel", i32 1} 43 44The first parameter is a reference to the kernel function. The following 45example shows a kernel function calling a device function in LLVM IR. The 46function ``@my_kernel`` is callable from host code, but ``@my_fmad`` is not. 47 48.. code-block:: llvm 49 50 define float @my_fmad(float %x, float %y, float %z) { 51 %mul = fmul float %x, %y 52 %add = fadd float %mul, %z 53 ret float %add 54 } 55 56 define void @my_kernel(ptr %ptr) { 57 %val = load float, ptr %ptr 58 %ret = call float @my_fmad(float %val, float %val, float %val) 59 store float %ret, ptr %ptr 60 ret void 61 } 62 63 !nvvm.annotations = !{!1} 64 !1 = !{ptr @my_kernel, !"kernel", i32 1} 65 66When compiled, the PTX kernel functions are callable by host-side code. 67 68 69.. _address_spaces: 70 71Address Spaces 72-------------- 73 74The NVPTX back-end uses the following address space mapping: 75 76 ============= ====================== 77 Address Space Memory Space 78 ============= ====================== 79 0 Generic 80 1 Global 81 2 Internal Use 82 3 Shared 83 4 Constant 84 5 Local 85 ============= ====================== 86 87Every global variable and pointer type is assigned to one of these address 88spaces, with 0 being the default address space. Intrinsics are provided which 89can be used to convert pointers between the generic and non-generic address 90spaces. 91 92As an example, the following IR will define an array ``@g`` that resides in 93global device memory. 94 95.. code-block:: llvm 96 97 @g = internal addrspace(1) global [4 x i32] [ i32 0, i32 1, i32 2, i32 3 ] 98 99LLVM IR functions can read and write to this array, and host-side code can 100copy data to it by name with the CUDA Driver API. 101 102Note that since address space 0 is the generic space, it is illegal to have 103global variables in address space 0. Address space 0 is the default address 104space in LLVM, so the ``addrspace(N)`` annotation is *required* for global 105variables. 106 107 108Triples 109------- 110 111The NVPTX target uses the module triple to select between 32/64-bit code 112generation and the driver-compiler interface to use. The triple architecture 113can be one of ``nvptx`` (32-bit PTX) or ``nvptx64`` (64-bit PTX). The 114operating system should be one of ``cuda`` or ``nvcl``, which determines the 115interface used by the generated code to communicate with the driver. Most 116users will want to use ``cuda`` as the operating system, which makes the 117generated PTX compatible with the CUDA Driver API. 118 119Example: 32-bit PTX for CUDA Driver API: ``nvptx-nvidia-cuda`` 120 121Example: 64-bit PTX for CUDA Driver API: ``nvptx64-nvidia-cuda`` 122 123 124 125.. _nvptx_intrinsics: 126 127NVPTX Intrinsics 128================ 129 130Reading PTX Special Registers 131----------------------------- 132 133'``llvm.nvvm.read.ptx.sreg.*``' 134^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 135 136Syntax: 137""""""" 138 139.. code-block:: llvm 140 141 declare i32 @llvm.nvvm.read.ptx.sreg.tid.x() 142 declare i32 @llvm.nvvm.read.ptx.sreg.tid.y() 143 declare i32 @llvm.nvvm.read.ptx.sreg.tid.z() 144 declare i32 @llvm.nvvm.read.ptx.sreg.ntid.x() 145 declare i32 @llvm.nvvm.read.ptx.sreg.ntid.y() 146 declare i32 @llvm.nvvm.read.ptx.sreg.ntid.z() 147 declare i32 @llvm.nvvm.read.ptx.sreg.ctaid.x() 148 declare i32 @llvm.nvvm.read.ptx.sreg.ctaid.y() 149 declare i32 @llvm.nvvm.read.ptx.sreg.ctaid.z() 150 declare i32 @llvm.nvvm.read.ptx.sreg.nctaid.x() 151 declare i32 @llvm.nvvm.read.ptx.sreg.nctaid.y() 152 declare i32 @llvm.nvvm.read.ptx.sreg.nctaid.z() 153 declare i32 @llvm.nvvm.read.ptx.sreg.warpsize() 154 155Overview: 156""""""""" 157 158The '``@llvm.nvvm.read.ptx.sreg.*``' intrinsics provide access to the PTX 159special registers, in particular the kernel launch bounds. These registers 160map in the following way to CUDA builtins: 161 162 ============ ===================================== 163 CUDA Builtin PTX Special Register Intrinsic 164 ============ ===================================== 165 ``threadId`` ``@llvm.nvvm.read.ptx.sreg.tid.*`` 166 ``blockIdx`` ``@llvm.nvvm.read.ptx.sreg.ctaid.*`` 167 ``blockDim`` ``@llvm.nvvm.read.ptx.sreg.ntid.*`` 168 ``gridDim`` ``@llvm.nvvm.read.ptx.sreg.nctaid.*`` 169 ============ ===================================== 170 171 172Barriers 173-------- 174 175'``llvm.nvvm.barrier0``' 176^^^^^^^^^^^^^^^^^^^^^^^^^^^ 177 178Syntax: 179""""""" 180 181.. code-block:: llvm 182 183 declare void @llvm.nvvm.barrier0() 184 185Overview: 186""""""""" 187 188The '``@llvm.nvvm.barrier0()``' intrinsic emits a PTX ``bar.sync 0`` 189instruction, equivalent to the ``__syncthreads()`` call in CUDA. 190 191Electing a thread 192----------------- 193 194'``llvm.nvvm.elect.sync``' 195^^^^^^^^^^^^^^^^^^^^^^^^^^ 196 197Syntax: 198""""""" 199 200.. code-block:: llvm 201 202 declare {i32, i1} @llvm.nvvm.elect.sync(i32 %membermask) 203 204Overview: 205""""""""" 206 207The '``@llvm.nvvm.elect.sync``' intrinsic generates the ``elect.sync`` 208PTX instruction, which elects one predicated active leader thread from 209a set of threads specified by ``membermask``. The behavior is undefined 210if the executing thread is not in ``membermask``. The laneid of the 211elected thread is captured in the i32 return value. The i1 return 212value is set to ``True`` for the leader thread and ``False`` for all 213the other threads. Election of a leader thread happens deterministically, 214i.e. the same leader thread is elected for the same ``membermask`` 215every time. For more information, refer PTX ISA 216`<https://docs.nvidia.com/cuda/parallel-thread-execution/index.html#parallel-synchronization-and-communication-instructions-elect-sync>`_. 217 218Membar/Fences 219------------- 220 221'``llvm.nvvm.fence.proxy.tensormap_generic.*``' 222^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 223 224Syntax: 225""""""" 226 227.. code-block:: llvm 228 229 declare void @llvm.nvvm.fence.proxy.tensormap_generic.release.cta() 230 declare void @llvm.nvvm.fence.proxy.tensormap_generic.release.cluster() 231 declare void @llvm.nvvm.fence.proxy.tensormap_generic.release.gpu() 232 declare void @llvm.nvvm.fence.proxy.tensormap_generic.release.sys() 233 234 declare void @llvm.nvvm.fence.proxy.tensormap_generic.acquire.cta(ptr %addr, i32 %size) 235 declare void @llvm.nvvm.fence.proxy.tensormap_generic.acquire.cluster(ptr %addr, i32 %size) 236 declare void @llvm.nvvm.fence.proxy.tensormap_generic.acquire.gpu(ptr %addr, i32 %size) 237 declare void @llvm.nvvm.fence.proxy.tensormap_generic.acquire.sys(ptr %addr, i32 %size) 238 239Overview: 240""""""""" 241 242The ``@llvm.nvvm.fence.proxy.tensormap_generic.*`` is a uni-directional fence used to establish ordering between a prior memory access performed via the generic `proxy<https://docs.nvidia.com/cuda/parallel-thread-execution/index.html#proxies>_` and a subsequent memory access performed via the tensormap proxy. ``nvvm.fence.proxy.tensormap_generic.release`` can form a release sequence that synchronizes with an acquire sequence that contains the ``nvvm.fence.proxy.tensormap_generic.acquire`` proxy fence. The following table describes the mapping between LLVM Intrinsic and the PTX instruction: 243 244 ====================================================== ========================================================= 245 NVVM Intrinsic PTX Instruction 246 ====================================================== ========================================================= 247 ``@llvm.nvvm.fence.proxy.tensormap_generic.release.*`` ``fence.proxy.tensormap::generic.release.*`` 248 ``@llvm.nvvm.fence.proxy.tensormap_generic.acquire.*`` ``fence.proxy.tensormap::generic.acquire.* [addr], size`` 249 ====================================================== ========================================================= 250 251The address operand ``addr`` and the operand ``size`` together specify the memory range ``[addr, addr+size)`` on which the ordering guarantees on the memory accesses across the proxies is to be provided. The only supported value for the ``size`` operand is ``128`` and must be an immediate. Generic Addressing is used unconditionally, and the address specified by the operand addr must fall within the ``.global`` state space. Otherwise, the behavior is undefined. For more information, see `PTX ISA <https://docs.nvidia.com/cuda/parallel-thread-execution/#parallel-synchronization-and-communication-instructions-membar>`_. 252 253Address Space Intrinsics 254------------------------ 255 256'``llvm.nvvm.isspacep.*``' Intrinsics 257^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 258 259Syntax: 260""""""" 261 262.. code-block:: llvm 263 264 declare i1 @llvm.nvvm.isspacep.const(ptr %p) 265 declare i1 @llvm.nvvm.isspacep.global(ptr %p) 266 declare i1 @llvm.nvvm.isspacep.local(ptr %p) 267 declare i1 @llvm.nvvm.isspacep.shared(ptr %p) 268 declare i1 @llvm.nvvm.isspacep.shared.cluster(ptr %p) 269 270Overview: 271""""""""" 272 273The '``llvm.nvvm.isspacep.*``' intrinsics determine whether the provided generic 274pointer references memory which falls within a particular address space. 275 276Semantics: 277"""""""""" 278 279If the given pointer in the generic address space refers to memory which falls 280within the state space of the intrinsic (and therefore could be safely address 281space casted to this space), 1 is returned, otherwise 0 is returned. 282 283Arithmetic Intrinsics 284--------------------- 285 286'``llvm.nvvm.idp2a.[us].[us]``' Intrinsics 287^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 288 289Syntax: 290""""""" 291 292.. code-block:: llvm 293 294 declare i32 @llvm.nvvm.idp2a.s.s(i32 %a, i32 %b, i1 immarg %is.hi, i32 %c) 295 declare i32 @llvm.nvvm.idp2a.s.u(i32 %a, i32 %b, i1 immarg %is.hi, i32 %c) 296 declare i32 @llvm.nvvm.idp2a.u.s(i32 %a, i32 %b, i1 immarg %is.hi, i32 %c) 297 declare i32 @llvm.nvvm.idp2a.u.u(i32 %a, i32 %b, i1 immarg %is.hi, i32 %c) 298 299 300Overview: 301""""""""" 302 303The '``llvm.nvvm.idp2a.[us].[us]``' intrinsics performs a 2-element vector dot 304product followed by addition. They corresponds directly to the ``dp2a`` PTX 305instruction. 306 307Semantics: 308"""""""""" 309 310The 32-bit value in ``%a`` is broken into 2 16-bit values which are extended to 31132 bits. For the '``llvm.nvvm.idp2a.u.[us]``' variants zero-extension is used, 312while for the '``llvm.nvvm.idp2a.s.[us]``' sign-extension is used. Two bytes are 313selected from ``%b``, if ``%is.hi`` is true, the most significant bytes are 314selected, otherwise the least significant bytes are selected. These bytes are 315then extended to 32-bits. For the '``llvm.nvvm.idp2a.[us].u``' variants 316zero-extension is used, while for the '``llvm.nvvm.idp2a.[us].s``' 317sign-extension is used. The dot product of these 2-element vectors is added to 318``%c`` to produce the return. 319 320 321'``llvm.nvvm.idp4a.[us].[us]``' Intrinsics 322^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 323 324Syntax: 325""""""" 326 327.. code-block:: llvm 328 329 declare i32 @llvm.nvvm.idp4a.s.s(i32 %a, i32 %b, i32 %c) 330 declare i32 @llvm.nvvm.idp4a.s.u(i32 %a, i32 %b, i32 %c) 331 declare i32 @llvm.nvvm.idp4a.u.s(i32 %a, i32 %b, i32 %c) 332 declare i32 @llvm.nvvm.idp4a.u.u(i32 %a, i32 %b, i32 %c) 333 334Overview: 335""""""""" 336 337The '``llvm.nvvm.idp4a.[us].[us]``' intrinsics perform a 4-element vector dot 338product followed by addition. They corresponds directly to the ``dp4a`` PTX 339instruction. 340 341Semantics: 342"""""""""" 343 344Each of the 4 bytes in both ``%a`` and ``%b`` are extended to 32-bit integers 345forming 2 ``<4 x i32>``. For ``%a``, zero-extension is used in the 346'``llvm.nvvm.idp4a.u.[us]``' variants, while sign-extension is used with 347'``llvm.nvvm.idp4a.s.[us]``' variants. Similarly, for ``%b``, zero-extension is 348used in the '``llvm.nvvm.idp4a.[us].u``' variants, while sign-extension is used 349with '``llvm.nvvm.idp4a.[us].s``' variants. The dot product of these 4-element 350vectors is added to ``%c`` to produce the return. 351 352Bit Manipulation Intrinsics 353--------------------------- 354 355'``llvm.nvvm.fshl.clamp.*``' Intrinsic 356^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 357 358Syntax: 359""""""" 360 361.. code-block:: llvm 362 363 declare i32 @llvm.nvvm.fshl.clamp.i32(i32 %hi, i32 %lo, i32 %n) 364 365Overview: 366""""""""" 367 368The '``llvm.nvvm.fshl.clamp``' family of intrinsics performs a clamped funnel 369shift left. These intrinsics are very similar to '``llvm.fshl``', except the 370shift ammont is clamped at the integer width (instead of modulo it). Currently, 371only ``i32`` is supported. 372 373Semantics: 374"""""""""" 375 376The '``llvm.nvvm.fshl.clamp``' family of intrinsic functions performs a clamped 377funnel shift left: the first two values are concatenated as { %hi : %lo } (%hi 378is the most significant bits of the wide value), the combined value is shifted 379left, and the most significant bits are extracted to produce a result that is 380the same size as the original arguments. The shift amount is the minimum of the 381value of %n and the bit width of the integer type. 382 383'``llvm.nvvm.fshr.clamp.*``' Intrinsic 384^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 385 386Syntax: 387""""""" 388 389.. code-block:: llvm 390 391 declare i32 @llvm.nvvm.fshr.clamp.i32(i32 %hi, i32 %lo, i32 %n) 392 393Overview: 394""""""""" 395 396The '``llvm.nvvm.fshr.clamp``' family of intrinsics perform a clamped funnel 397shift right. These intrinsics are very similar to '``llvm.fshr``', except the 398shift ammont is clamped at the integer width (instead of modulo it). Currently, 399only ``i32`` is supported. 400 401Semantics: 402"""""""""" 403 404The '``llvm.nvvm.fshr.clamp``' family of intrinsic functions performs a clamped 405funnel shift right: the first two values are concatenated as { %hi : %lo } (%hi 406is the most significant bits of the wide value), the combined value is shifted 407right, and the least significant bits are extracted to produce a result that is 408the same size as the original arguments. The shift amount is the minimum of the 409value of %n and the bit width of the integer type. 410 411'``llvm.nvvm.flo.u.*``' Intrinsic 412^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 413 414Syntax: 415""""""" 416 417.. code-block:: llvm 418 419 declare i32 @llvm.nvvm.flo.u.i32(i32 %a, i1 %shiftamt) 420 declare i32 @llvm.nvvm.flo.u.i64(i64 %a, i1 %shiftamt) 421 422Overview: 423""""""""" 424 425The '``llvm.nvvm.flo.u``' family of intrinsics identifies the bit position of the 426leading one, returning either it's offset from the most or least significant bit. 427 428Semantics: 429"""""""""" 430 431The '``llvm.nvvm.flo.u``' family of intrinsics returns the bit position of the 432most significant 1. If %shiftamt is true, The result is the shift amount needed 433to left-shift the found bit into the most-significant bit position, otherwise 434the result is the shift amount needed to right-shift the found bit into the 435least-significant bit position. 0xffffffff is returned if no 1 bit is found. 436 437'``llvm.nvvm.flo.s.*``' Intrinsic 438^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 439 440Syntax: 441""""""" 442 443.. code-block:: llvm 444 445 declare i32 @llvm.nvvm.flo.s.i32(i32 %a, i1 %shiftamt) 446 declare i32 @llvm.nvvm.flo.s.i64(i64 %a, i1 %shiftamt) 447 448Overview: 449""""""""" 450 451The '``llvm.nvvm.flo.s``' family of intrinsics identifies the bit position of the 452leading non-sign bit, returning either it's offset from the most or least 453significant bit. 454 455Semantics: 456"""""""""" 457 458The '``llvm.nvvm.flo.s``' family of intrinsics returns the bit position of the 459most significant 0 for negative inputs and the most significant 1 for 460non-negative inputs. If %shiftamt is true, The result is the shift amount needed 461to left-shift the found bit into the most-significant bit position, otherwise 462the result is the shift amount needed to right-shift the found bit into the 463least-significant bit position. 0xffffffff is returned if no 1 bit is found. 464 465TMA family of Intrinsics 466------------------------ 467 468'``llvm.nvvm.cp.async.bulk.global.to.shared.cluster``' 469^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 470 471Syntax: 472""""""" 473 474.. code-block:: llvm 475 476 declare void @llvm.nvvm.cp.async.bulk.global.to.shared.cluster(ptr addrspace(3) %dst, ptr addrspace(3) %mbar, ptr addrspace(1) %src, i32 %size, i16 %mc, i64 %ch, i1 %flag_mc, i1 %flag_ch) 477 478Overview: 479""""""""" 480 481The '``@llvm.nvvm.cp.async.bulk.global.to.shared.cluster``' intrinsic 482corresponds to the ``cp.async.bulk.shared::cluster.global.*`` family 483of PTX instructions. These instructions initiate an asynchronous 484copy of bulk data from global memory to shared::cluster memory. 485The 32-bit operand ``%size`` specifies the amount of memory to be 486copied and it must be a multiple of 16. 487 488* The last two arguments to these intrinsics are boolean flags 489 indicating support for cache_hint and/or multicast modifiers. 490 These flag arguments must be compile-time constants. The backend 491 looks through these flags and lowers the intrinsics appropriately. 492 493* The Nth argument (denoted by ``i1 %flag_ch``) when set, indicates 494 a valid cache_hint (``i64 %ch``) and generates the ``.L2::cache_hint`` 495 variant of the PTX instruction. 496 497* The [N-1]th argument (denoted by ``i1 %flag_mc``) when set, indicates 498 the presence of a multicast mask (``i16 %mc``) and generates the PTX 499 instruction with the ``.multicast::cluster`` modifier. 500 501For more information, refer PTX ISA 502`<https://docs.nvidia.com/cuda/parallel-thread-execution/index.html#data-movement-and-conversion-instructions-cp-async-bulk>`_. 503 504'``llvm.nvvm.cp.async.bulk.shared.cta.to.global``' 505^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 506 507Syntax: 508""""""" 509 510.. code-block:: llvm 511 512 declare void @llvm.nvvm.cp.async.bulk.shared.cta.to.global(ptr addrspace(1) %dst, ptr addrspace(3) %src, i32 %size, i64 %ch, i1 %flag_ch) 513 514Overview: 515""""""""" 516 517The '``@llvm.nvvm.cp.async.bulk.shared.cta.to.global``' intrinsic 518corresponds to the ``cp.async.bulk.global.shared::cta.*`` set of PTX 519instructions. These instructions initiate an asynchronous copy from 520shared::cta to global memory. The 32-bit operand ``%size`` specifies 521the amount of memory to be copied and it must be a multiple of 16. 522 523* The last argument to these intrinsics is a boolean flag 524 indicating support for cache_hint. This flag argument must 525 be a compile-time constant. When set, it indicates a valid 526 cache_hint (``i64 %ch``) and generates the ``.L2::cache_hint`` 527 variant of the PTX instruction. 528 529For more information, refer PTX ISA 530`<https://docs.nvidia.com/cuda/parallel-thread-execution/index.html#data-movement-and-conversion-instructions-cp-async-bulk>`_. 531 532'``llvm.nvvm.cp.async.bulk.shared.cta.to.cluster``' 533^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 534 535Syntax: 536""""""" 537 538.. code-block:: llvm 539 540 declare void @llvm.nvvm.cp.async.bulk.shared.cta.to.cluster(ptr addrspace(3) %dst, ptr addrspace(3) %mbar, ptr addrspace(3) %src, i32 %size) 541 542Overview: 543""""""""" 544 545The '``@llvm.nvvm.cp.async.bulk.shared.cta.to.cluster``' intrinsic 546corresponds to the ``cp.async.bulk.shared::cluster.shared::cta.*`` 547PTX instruction. This instruction initiates an asynchronous copy from 548shared::cta to shared::cluster memory. The destination has to be in 549the shared memory of a different CTA within the cluster. The 32-bit 550operand ``%size`` specifies the amount of memory to be copied and 551it must be a multiple of 16. 552 553For more information, refer PTX ISA 554`<https://docs.nvidia.com/cuda/parallel-thread-execution/index.html#data-movement-and-conversion-instructions-cp-async-bulk>`_. 555 556'``llvm.nvvm.cp.async.bulk.prefetch.L2``' 557^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 558 559Syntax: 560""""""" 561 562.. code-block:: llvm 563 564 declare void @llvm.nvvm.cp.async.bulk.prefetch.L2(ptr addrspace(1) %src, i32 %size, i64 %ch, i1 %flag_ch) 565 566Overview: 567""""""""" 568 569The '``@llvm.nvvm.cp.async.bulk.prefetch.L2``' intrinsic 570corresponds to the ``cp.async.bulk.prefetch.L2.*`` family 571of PTX instructions. These instructions initiate an asynchronous 572prefetch of bulk data from global memory to the L2 cache. 573The 32-bit operand ``%size`` specifies the amount of memory to be 574prefetched in terms of bytes and it must be a multiple of 16. 575 576* The last argument to these intrinsics is boolean flag indicating 577 support for cache_hint. These flag argument must be compile-time 578 constant. When set, it indicates a valid cache_hint (``i64 %ch``) 579 and generates the ``.L2::cache_hint`` variant of the PTX instruction. 580 581For more information, refer PTX ISA 582`<https://docs.nvidia.com/cuda/parallel-thread-execution/#data-movement-and-conversion-instructions-cp-async-bulk-prefetch>`_. 583 584'``llvm.nvvm.cp.async.bulk.tensor.g2s.tile.[1-5]d``' 585^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 586 587Syntax: 588""""""" 589 590.. code-block:: llvm 591 592 declare void @llvm.nvvm.cp.async.bulk.tensor.g2s.tile.1d(ptr addrspace(3) %dst, ptr addrspace(3) %bar, ptr %tensor_map, i32 %d0, i16 %mc, i64 %ch, i1 %flag_mc, i1 %flag_ch) 593 declare void @llvm.nvvm.cp.async.bulk.tensor.g2s.tile.2d(..., i32 %d0, i32 %d1, ...) 594 declare void @llvm.nvvm.cp.async.bulk.tensor.g2s.tile.3d(..., i32 %d0, i32 %d1, i32 %d2, ...) 595 declare void @llvm.nvvm.cp.async.bulk.tensor.g2s.tile.4d(..., i32 %d0, i32 %d1, i32 %d2, i32 %d3, ...) 596 declare void @llvm.nvvm.cp.async.bulk.tensor.g2s.tile.5d(..., i32 %d0, i32 %d1, i32 %d2, i32 %d3, i32 %d4, ...) 597 598Overview: 599""""""""" 600 601The '``@llvm.nvvm.cp.async.bulk.tensor.g2s.tile.[1-5]d``' intrinsics 602correspond to the ``cp.async.bulk.tensor.[1-5]d.*`` set of PTX instructions. 603These instructions initiate an asynchronous copy of tensor data from 604global memory to shared::cluster memory (indicated by the ``g2s`` prefix) 605in ``tile`` mode. In tile mode, the multi-dimensional layout of the 606source tensor is preserved at the destination. The dimension of the 607tensor data ranges from 1d to 5d with the coordinates specified 608by the ``i32 %d0 ... i32 %d4`` arguments. 609 610* The last two arguments to these intrinsics are boolean flags 611 indicating support for cache_hint and/or multicast modifiers. 612 These flag arguments must be compile-time constants. The backend 613 looks through these flags and lowers the intrinsics appropriately. 614 615* The Nth argument (denoted by ``i1 flag_ch``) when set, indicates 616 a valid cache_hint (``i64 %ch``) and generates the ``.L2::cache_hint`` 617 variant of the PTX instruction. 618 619* The [N-1]th argument (denoted by ``i1 flag_mc``) when set, indicates 620 the presence of a multicast mask (``i16 %mc``) and generates the PTX 621 instruction with the ``.multicast::cluster`` modifier. 622 623For more information, refer PTX ISA 624`<https://docs.nvidia.com/cuda/parallel-thread-execution/index.html#data-movement-and-conversion-instructions-cp-async-bulk-tensor>`_. 625 626'``llvm.nvvm.cp.async.bulk.tensor.g2s.im2col.[3-5]d``' 627^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 628 629Syntax: 630""""""" 631 632.. code-block:: llvm 633 634 declare void @llvm.nvvm.cp.async.bulk.tensor.g2s.im2col.3d(ptr addrspace(3) %dst, ptr addrspace(3) %bar, ptr %tensor_map, i32 %d0, i32 %d1, i32 %d2, i16 %im2col0, i16 %mc, i64 %ch, i1 %flag_mc, i1 %flag_ch) 635 declare void @llvm.nvvm.cp.async.bulk.tensor.g2s.im2col.4d(..., i32 %d0, i32 %d1, i32 %d2, i32 %d3, i16 %im2col0, i16 %im2col1, ...) 636 declare void @llvm.nvvm.cp.async.bulk.tensor.g2s.im2col.5d(..., i32 %d0, i32 %d1, i32 %d2, i32 %d3, i32 %d4, i16 %im2col0, i16 %im2col1, i16 %im2col2, ...) 637 638Overview: 639""""""""" 640 641The '``@llvm.nvvm.cp.async.bulk.tensor.g2s.im2col.[3-5]d``' intrinsics 642correspond to the ``cp.async.bulk.tensor.[1-5]d.*`` set of PTX instructions. 643These instructions initiate an asynchronous copy of tensor data from 644global memory to shared::cluster memory (indicated by the ``g2s`` prefix) 645in ``im2col`` mode. In im2col mode, some dimensions of the source tensor 646are unrolled into a single dimensional column at the destination. In this 647mode, the tensor has to be at least three-dimensional. Along with the tensor 648coordinates, im2col offsets are also specified (denoted by 649``i16 im2col0...i16 %im2col2``). The number of im2col offsets is two less 650than the number of dimensions of the tensor operation. The last two arguments 651to these intrinsics are boolean flags, with the same functionality as described 652in the ``tile`` mode intrinsics above. 653 654For more information, refer PTX ISA 655`<https://docs.nvidia.com/cuda/parallel-thread-execution/index.html#data-movement-and-conversion-instructions-cp-async-bulk-tensor>`_. 656 657'``llvm.nvvm.cp.async.bulk.tensor.s2g.tile.[1-5]d``' 658^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 659 660Syntax: 661""""""" 662 663.. code-block:: llvm 664 665 declare void @llvm.nvvm.cp.async.bulk.tensor.s2g.tile.1d(ptr addrspace(3) %src, ptr %tensor_map, i32 %d0, i64 %ch, i1 %flag_ch) 666 declare void @llvm.nvvm.cp.async.bulk.tensor.s2g.tile.2d(..., i32 %d0, i32 %d1, ...) 667 declare void @llvm.nvvm.cp.async.bulk.tensor.s2g.tile.3d(..., i32 %d0, i32 %d1, i32 %d2, ...) 668 declare void @llvm.nvvm.cp.async.bulk.tensor.s2g.tile.4d(..., i32 %d0, i32 %d1, i32 %d2, i32 %d3, ...) 669 declare void @llvm.nvvm.cp.async.bulk.tensor.s2g.tile.5d(..., i32 %d0, i32 %d1, i32 %d2, i32 %d3, i32 %d4, ...) 670 671Overview: 672""""""""" 673 674The '``@llvm.nvvm.cp.async.bulk.tensor.s2g.tile.[1-5]d``' intrinsics 675correspond to the ``cp.async.bulk.tensor.[1-5]d.*`` set of PTX instructions. 676These instructions initiate an asynchronous copy of tensor data from 677shared::cta to global memory (indicated by the ``s2g`` prefix) 678in ``tile`` mode. The dimension of the tensor data ranges from 1d to 5d 679with the coordinates specified by the ``i32 %d0 ... i32 %d4`` arguments. 680 681* The last argument to these intrinsics is a boolean flag 682 indicating support for cache_hint. This flag argument must 683 be a compile-time constant. When set, it indicates a valid 684 cache_hint (``i64 %ch``) and generates the ``.L2::cache_hint`` 685 variant of the PTX instruction. 686 687For more information, refer PTX ISA 688`<https://docs.nvidia.com/cuda/parallel-thread-execution/index.html#data-movement-and-conversion-instructions-cp-async-bulk-tensor>`_. 689 690'``llvm.nvvm.cp.async.bulk.tensor.s2g.im2col.[3-5]d``' 691^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 692 693Syntax: 694""""""" 695 696.. code-block:: llvm 697 698 declare void @llvm.nvvm.cp.async.bulk.tensor.s2g.im2col.3d(ptr addrspace(3) %src, ptr %tensor_map, i32 %d0, i32 %d1, i32 %d2, i64 %ch, i1 %flag_ch) 699 declare void @llvm.nvvm.cp.async.bulk.tensor.s2g.im2col.4d(..., i32 %d0, i32 %d1, i32 %d2, i32 %d3, ...) 700 declare void @llvm.nvvm.cp.async.bulk.tensor.s2g.im2col.5d(..., i32 %d0, i32 %d1, i32 %d2, i32 %d3, i32 %d4, ...) 701 702Overview: 703""""""""" 704 705The '``@llvm.nvvm.cp.async.bulk.tensor.s2g.im2col.[1-5]d``' intrinsics 706correspond to the ``cp.async.bulk.tensor.[1-5]d.*`` set of PTX instructions. 707These instructions initiate an asynchronous copy of tensor data from 708shared::cta to global memory (indicated by the ``s2g`` prefix) 709in ``im2col`` mode. In this mode, the tensor has to be at least 710three-dimensional. Unlike the ``g2s`` variants, there are no 711im2col_offsets for these intrinsics. The last argument to these 712intrinsics is a boolean flag, with the same functionality as 713described in the ``s2g.tile`` mode intrinsics above. 714 715For more information, refer PTX ISA 716`<https://docs.nvidia.com/cuda/parallel-thread-execution/index.html#data-movement-and-conversion-instructions-cp-async-bulk-tensor>`_. 717 718'``llvm.nvvm.cp.async.bulk.tensor.prefetch.tile.[1-5]d``' 719^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 720 721Syntax: 722""""""" 723 724.. code-block:: llvm 725 726 declare void @llvm.nvvm.cp.async.bulk.tensor.prefetch.tile.1d(ptr %tensor_map, i32 %d0, i64 %ch, i1 %flag_ch) 727 declare void @llvm.nvvm.cp.async.bulk.tensor.prefetch.tile.2d(..., i32 %d0, i32 %d1, ...) 728 declare void @llvm.nvvm.cp.async.bulk.tensor.prefetch.tile.3d(..., i32 %d0, i32 %d1, i32 %d2, ...) 729 declare void @llvm.nvvm.cp.async.bulk.tensor.prefetch.tile.4d(..., i32 %d0, i32 %d1, i32 %d2, i32 %d3, ...) 730 declare void @llvm.nvvm.cp.async.bulk.tensor.prefetch.tile.5d(..., i32 %d0, i32 %d1, i32 %d2, i32 %d3, i32 %d4, ...) 731 732Overview: 733""""""""" 734 735The '``@llvm.nvvm.cp.async.bulk.tensor.prefetch.tile.[1-5]d``' intrinsics 736correspond to the ``cp.async.bulk.prefetch.tensor.[1-5]d.L2.global*`` set 737of PTX instructions. These instructions initiate an asynchronous prefetch 738of tensor data from global memory to the L2 cache. In tile mode, the 739multi-dimensional layout of the source tensor is preserved at the destination. 740The dimension of the tensor data ranges from 1d to 5d with the coordinates 741specified by the ``i32 %d0 ... i32 %d4`` arguments. 742 743* The last argument to these intrinsics is a boolean flag 744 indicating support for cache_hint. This flag argument must 745 be a compile-time constant. When set, it indicates a valid 746 cache_hint (``i64 %ch``) and generates the ``.L2::cache_hint`` 747 variant of the PTX instruction. 748 749For more information, refer PTX ISA 750`<https://docs.nvidia.com/cuda/parallel-thread-execution/#data-movement-and-conversion-instructions-cp-async-bulk-prefetch-tensor>`_. 751 752'``llvm.nvvm.cp.async.bulk.tensor.prefetch.im2col.[3-5]d``' 753^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 754 755Syntax: 756""""""" 757 758.. code-block:: llvm 759 760 declare void @llvm.nvvm.cp.async.bulk.tensor.prefetch.im2col.3d(ptr %tensor_map, i32 %d0, i32 %d1, i32 %d2, i16 %im2col0, i64 %ch, i1 %flag_ch) 761 declare void @llvm.nvvm.cp.async.bulk.tensor.prefetch.im2col.4d(..., i32 %d0, i32 %d1, i32 %d2, i32 %d3, i16 %im2col0, i16 %im2col1, ...) 762 declare void @llvm.nvvm.cp.async.bulk.tensor.prefetch.im2col.5d(..., i32 %d0, i32 %d1, i32 %d2, i32 %d3, i32 %d4, i16 %im2col0, i16 %im2col1, i16 %im2col2, ...) 763 764Overview: 765""""""""" 766 767The '``@llvm.nvvm.cp.async.bulk.tensor.prefetch.im2col.[3-5]d``' intrinsics 768correspond to the ``cp.async.bulk.prefetch.tensor.[1-5]d.L2.global*`` set 769of PTX instructions. These instructions initiate an asynchronous prefetch 770of tensor data from global memory to the L2 cache. In im2col mode, some 771dimensions of the source tensor are unrolled into a single dimensional 772column at the destination. In this mode, the tensor has to be at least 773three-dimensional. Along with the tensor coordinates, im2col offsets are 774also specified (denoted by ``i16 im2col0...i16 %im2col2``). The number 775of im2col offsets is two less than the number of dimensions of the tensor 776operation. The last argument to these intrinsics is a boolean flag, with 777the same functionality as described in the ``tile`` mode intrinsics above. 778 779For more information, refer PTX ISA 780`<https://docs.nvidia.com/cuda/parallel-thread-execution/#data-movement-and-conversion-instructions-cp-async-bulk-prefetch-tensor>`_. 781 782'``llvm.nvvm.cp.async.bulk.tensor.reduce.[red_op].tile.[1-5]d``' 783^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 784 785Syntax: 786""""""" 787 788.. code-block:: llvm 789 790 declare void @llvm.nvvm.cp.async.bulk.tensor.reduce.add.tile.1d(ptr addrspace(3) %src, ptr %tensor_map, i32 %d0, i64 %ch, i1 %flag_ch) 791 declare void @llvm.nvvm.cp.async.bulk.tensor.reduce.min.tile.1d(ptr addrspace(3) %src, ptr %tensor_map, i32 %d0, i64 %ch, i1 %flag_ch) 792 declare void @llvm.nvvm.cp.async.bulk.tensor.reduce.max.tile.1d(ptr addrspace(3) %src, ptr %tensor_map, i32 %d0, i64 %ch, i1 %flag_ch) 793 declare void @llvm.nvvm.cp.async.bulk.tensor.reduce.inc.tile.1d(ptr addrspace(3) %src, ptr %tensor_map, i32 %d0, i64 %ch, i1 %flag_ch) 794 declare void @llvm.nvvm.cp.async.bulk.tensor.reduce.dec.tile.1d(ptr addrspace(3) %src, ptr %tensor_map, i32 %d0, i64 %ch, i1 %flag_ch) 795 declare void @llvm.nvvm.cp.async.bulk.tensor.reduce.and.tile.1d(ptr addrspace(3) %src, ptr %tensor_map, i32 %d0, i64 %ch, i1 %flag_ch) 796 declare void @llvm.nvvm.cp.async.bulk.tensor.reduce.or.tile.1d(ptr addrspace(3) %src, ptr %tensor_map, i32 %d0, i64 %ch, i1 %flag_ch) 797 declare void @llvm.nvvm.cp.async.bulk.tensor.reduce.xor.tile.1d(ptr addrspace(3) %src, ptr %tensor_map, i32 %d0, i64 %ch, i1 %flag_ch) 798 799 declare void @llvm.nvvm.cp.async.bulk.tensor.reduce.<red_op>.tile.2d(..., i32 %d0, i32 %d1, ...) 800 declare void @llvm.nvvm.cp.async.bulk.tensor.reduce.<red_op>.tile.3d(..., i32 %d0, i32 %d1, i32 %d2, ...) 801 declare void @llvm.nvvm.cp.async.bulk.tensor.reduce.<red_op>.tile.4d(..., i32 %d0, i32 %d1, i32 %d2, i32 %d3, ...) 802 declare void @llvm.nvvm.cp.async.bulk.tensor.reduce.<red_op>.tile.5d(..., i32 %d0, i32 %d1, i32 %d2, i32 %d3, i32 %d4, ...) 803 804Overview: 805""""""""" 806 807The '``@llvm.nvvm.cp.async.bulk.tensor.reduce.<red_op>.tile.[1-5]d``' intrinsics 808correspond to the ``cp.reduce.async.bulk.tensor.[1-5]d.*`` set of PTX instructions. 809These instructions initiate an asynchronous reduction operation of tensor data 810in global memory with the tensor data in shared{::cta} memory, using ``tile`` mode. 811The dimension of the tensor data ranges from 1d to 5d with the coordinates 812specified by the ``i32 %d0 ... i32 %d4`` arguments. The supported reduction 813operations are {add, min, max, inc, dec, and, or, xor} as described in the 814``tile.1d`` intrinsics. 815 816* The last argument to these intrinsics is a boolean flag 817 indicating support for cache_hint. This flag argument must 818 be a compile-time constant. When set, it indicates a valid 819 cache_hint (``i64 %ch``) and generates the ``.L2::cache_hint`` 820 variant of the PTX instruction. 821 822For more information, refer PTX ISA 823`<https://docs.nvidia.com/cuda/parallel-thread-execution/index.html#data-movement-and-conversion-instructions-cp-reduce-async-bulk-tensor>`_. 824 825'``llvm.nvvm.cp.async.bulk.tensor.reduce.[red_op].im2col.[3-5]d``' 826^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 827 828Syntax: 829""""""" 830 831.. code-block:: llvm 832 833 declare void @llvm.nvvm.cp.async.bulk.tensor.reduce.<red_op>.im2col.3d(ptr addrspace(3) %src, ptr %tensor_map, i32 %d0, i32 %d1, i32 %d2, i64 %ch, i1 %flag_ch) 834 declare void @llvm.nvvm.cp.async.bulk.tensor.reduce.<red_op>.im2col.4d(..., i32 %d0, i32 %d1, i32 %d2, i32 %d3, ...) 835 declare void @llvm.nvvm.cp.async.bulk.tensor.reduce.<red_op>.im2col.5d(..., i32 %d0, i32 %d1, i32 %d2, i32 %d3, i32 %d4, ...) 836 837Overview: 838""""""""" 839 840The '``@llvm.nvvm.cp.async.bulk.tensor.reduce.<red_op>.im2col.[3-5]d``' intrinsics 841correspond to the ``cp.reduce.async.bulk.tensor.[3-5]d.*`` set of PTX instructions. 842These instructions initiate an asynchronous reduction operation of tensor data 843in global memory with the tensor data in shared{::cta} memory, using ``im2col`` mode. 844In this mode, the tensor has to be at least three-dimensional. The supported reduction 845operations supported are the same as the ones in the tile mode. The last argument to 846these intrinsics is a boolean flag, with the same functionality as described in the 847``tile`` mode intrinsics above. 848 849For more information, refer PTX ISA 850`<https://docs.nvidia.com/cuda/parallel-thread-execution/index.html#data-movement-and-conversion-instructions-cp-reduce-async-bulk-tensor>`_. 851 852Warp Group Intrinsics 853--------------------- 854 855'``llvm.nvvm.wgmma.fence.sync.aligned``' 856^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 857 858Syntax: 859""""""" 860 861.. code-block:: llvm 862 863 declare void @llvm.nvvm.wgmma.fence.sync.aligned() 864 865Overview: 866""""""""" 867 868The '``@llvm.nvvm.wgmma.fence.sync.aligned``' intrinsic generates the 869``wgmma.fence.sync.aligned`` PTX instruction, which establishes an ordering 870between prior accesses to any warpgroup registers and subsequent accesses to 871the same registers by a ``wgmma.mma_async`` instruction. 872 873The ``wgmma.fence`` instruction must be issued by all warps of the warpgroup in 874the following locations: 875 876* Before the first ``wgmma.mma_async`` operation in a warpgroup. 877* Between a register access by a thread in the warpgroup and any 878 ``wgmma.mma_async`` instruction that accesses the same registers, except when 879 these are accumulator register accesses across multiple ``wgmma.mma_async`` 880 instructions of the same shape in which case an ordering guarantee is 881 provided by default. 882 883For more information, refer PTX ISA 884`<https://docs.nvidia.com/cuda/parallel-thread-execution/#asynchronous-warpgroup-level-matrix-instructions-wgmma-fence>`_. 885 886'``llvm.nvvm.wgmma.commit_group.sync.aligned``' 887^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 888 889Syntax: 890""""""" 891 892.. code-block:: llvm 893 894 declare void @llvm.nvvm.wgmma.commit_group.sync.aligned() 895 896Overview: 897""""""""" 898 899The '``@llvm.nvvm.wgmma.commit_group.sync.aligned``' intrinsic generates the 900``wgmma.commit_group.sync.aligned`` PTX instruction, which creates a new 901wgmma-group per warpgroup and batches all prior ``wgmma.mma_async`` 902instructions initiated by the executing warp but not committed to any 903wgmma-group into the new wgmma-group. If there are no uncommitted ``wgmma 904mma_async`` instructions then, ``wgmma.commit_group`` results in an empty 905wgmma-group. 906 907An executing thread can wait for the completion of all ``wgmma.mma_async`` 908operations in a wgmma-group by using ``wgmma.wait_group``. 909 910For more information, refer PTX ISA 911`<https://docs.nvidia.com/cuda/parallel-thread-execution/#asynchronous-warpgroup-level-matrix-instructions-wgmma-commit-group>`_. 912 913'``llvm.nvvm.wgmma.wait_group.sync.aligned``' 914^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 915 916Syntax: 917""""""" 918 919.. code-block:: llvm 920 921 declare void @llvm.nvvm.wgmma.wait_group.sync.aligned(i64 immarg N) 922 923Overview: 924""""""""" 925 926The '``@llvm.nvvm.wgmma.wait_group.sync.aligned``' intrinsic generates the 927``wgmma.commit_group.sync.aligned N`` PTX instruction, which will cause the 928executing thread to wait until only ``N`` or fewer of the most recent 929wgmma-groups are pending and all the prior wgmma-groups committed by the 930executing threads are complete. For example, when ``N`` is 0, the executing 931thread waits on all the prior wgmma-groups to complete. Operand ``N`` is an 932integer constant. 933 934Accessing the accumulator register or the input register containing the 935fragments of matrix A of a ``wgmma.mma_async`` instruction without first 936performing a ``wgmma.wait_group`` instruction that waits on a wgmma-group 937including that ``wgmma.mma_async`` instruction is undefined behavior. 938 939For more information, refer PTX ISA 940`<https://docs.nvidia.com/cuda/parallel-thread-execution/#asynchronous-warpgroup-level-matrix-instructions-wgmma-wait-group>`_. 941 942'``llvm.nvvm.griddepcontrol.*``' 943^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 944 945Syntax: 946""""""" 947 948.. code-block:: llvm 949 950 declare void @llvm.nvvm.griddepcontrol.launch_dependents() 951 declare void @llvm.nvvm.griddepcontrol.wait() 952 953Overview: 954""""""""" 955 956The ``griddepcontrol`` intrinsics allows the dependent grids and prerequisite grids as defined by the runtime, to control execution in the following way: 957 958``griddepcontrol.launch_dependents`` intrinsic signals that the dependents can be scheduled, before the current grid completes. The intrinsic can be invoked by multiple threads in the current CTA and repeated invocations of the intrinsic will have no additional side effects past that of the first invocation. 959 960``griddepcontrol.wait`` intrinsic causes the executing thread to wait until all prerequisite grids in flight have completed and all the memory operations from the prerequisite grids are performed and made visible to the current grid. 961 962For more information, refer 963`PTX ISA <https://docs.nvidia.com/cuda/parallel-thread-execution/#parallel-synchronization-and-communication-instructions-griddepcontrol>`__. 964 965Other Intrinsics 966---------------- 967 968For the full set of NVPTX intrinsics, please see the 969``include/llvm/IR/IntrinsicsNVVM.td`` file in the LLVM source tree. 970 971 972.. _libdevice: 973 974Linking with Libdevice 975====================== 976 977The CUDA Toolkit comes with an LLVM bitcode library called ``libdevice`` that 978implements many common mathematical functions. This library can be used as a 979high-performance math library for any compilers using the LLVM NVPTX target. 980The library can be found under ``nvvm/libdevice/`` in the CUDA Toolkit and 981there is a separate version for each compute architecture. 982 983For a list of all math functions implemented in libdevice, see 984`libdevice Users Guide <http://docs.nvidia.com/cuda/libdevice-users-guide/index.html>`_. 985 986To accommodate various math-related compiler flags that can affect code 987generation of libdevice code, the library code depends on a special LLVM IR 988pass (``NVVMReflect``) to handle conditional compilation within LLVM IR. This 989pass looks for calls to the ``@__nvvm_reflect`` function and replaces them 990with constants based on the defined reflection parameters. Such conditional 991code often follows a pattern: 992 993.. code-block:: c++ 994 995 float my_function(float a) { 996 if (__nvvm_reflect("FASTMATH")) 997 return my_function_fast(a); 998 else 999 return my_function_precise(a); 1000 } 1001 1002The default value for all unspecified reflection parameters is zero. 1003 1004The ``NVVMReflect`` pass should be executed early in the optimization 1005pipeline, immediately after the link stage. The ``internalize`` pass is also 1006recommended to remove unused math functions from the resulting PTX. For an 1007input IR module ``module.bc``, the following compilation flow is recommended: 1008 1009The ``NVVMReflect`` pass will attempt to remove dead code even without 1010optimizations. This allows potentially incompatible instructions to be avoided 1011at all optimizations levels by using the ``__CUDA_ARCH`` argument. 1012 10131. Save list of external functions in ``module.bc`` 10142. Link ``module.bc`` with ``libdevice.compute_XX.YY.bc`` 10153. Internalize all functions not in list from (1) 10164. Eliminate all unused internal functions 10175. Run ``NVVMReflect`` pass 10186. Run standard optimization pipeline 1019 1020.. note:: 1021 1022 ``linkonce`` and ``linkonce_odr`` linkage types are not suitable for the 1023 libdevice functions. It is possible to link two IR modules that have been 1024 linked against libdevice using different reflection variables. 1025 1026Since the ``NVVMReflect`` pass replaces conditionals with constants, it will 1027often leave behind dead code of the form: 1028 1029.. code-block:: llvm 1030 1031 entry: 1032 .. 1033 br i1 true, label %foo, label %bar 1034 foo: 1035 .. 1036 bar: 1037 ; Dead code 1038 .. 1039 1040Therefore, it is recommended that ``NVVMReflect`` is executed early in the 1041optimization pipeline before dead-code elimination. 1042 1043The NVPTX TargetMachine knows how to schedule ``NVVMReflect`` at the beginning 1044of your pass manager; just use the following code when setting up your pass 1045manager and the PassBuilder will use ``registerPassBuilderCallbacks`` to let 1046NVPTXTargetMachine::registerPassBuilderCallbacks add the pass to the 1047pass manager: 1048 1049.. code-block:: c++ 1050 1051 std::unique_ptr<TargetMachine> TM = ...; 1052 PassBuilder PB(TM); 1053 ModulePassManager MPM; 1054 PB.parsePassPipeline(MPM, ...); 1055 1056Reflection Parameters 1057--------------------- 1058 1059The libdevice library currently uses the following reflection parameters to 1060control code generation: 1061 1062==================== ====================================================== 1063Flag Description 1064==================== ====================================================== 1065``__CUDA_FTZ=[0,1]`` Use optimized code paths that flush subnormals to zero 1066==================== ====================================================== 1067 1068The value of this flag is determined by the "nvvm-reflect-ftz" module flag. 1069The following sets the ftz flag to 1. 1070 1071.. code-block:: llvm 1072 1073 !llvm.module.flags = !{!0} 1074 !0 = !{i32 4, !"nvvm-reflect-ftz", i32 1} 1075 1076(``i32 4`` indicates that the value set here overrides the value in another 1077module we link with. See the `LangRef <LangRef.html#module-flags-metadata>` 1078for details.) 1079 1080Executing PTX 1081============= 1082 1083The most common way to execute PTX assembly on a GPU device is to use the CUDA 1084Driver API. This API is a low-level interface to the GPU driver and allows for 1085JIT compilation of PTX code to native GPU machine code. 1086 1087Initializing the Driver API: 1088 1089.. code-block:: c++ 1090 1091 CUdevice device; 1092 CUcontext context; 1093 1094 // Initialize the driver API 1095 cuInit(0); 1096 // Get a handle to the first compute device 1097 cuDeviceGet(&device, 0); 1098 // Create a compute device context 1099 cuCtxCreate(&context, 0, device); 1100 1101JIT compiling a PTX string to a device binary: 1102 1103.. code-block:: c++ 1104 1105 CUmodule module; 1106 CUfunction function; 1107 1108 // JIT compile a null-terminated PTX string 1109 cuModuleLoadData(&module, (void*)PTXString); 1110 1111 // Get a handle to the "myfunction" kernel function 1112 cuModuleGetFunction(&function, module, "myfunction"); 1113 1114For full examples of executing PTX assembly, please see the `CUDA Samples 1115<https://developer.nvidia.com/cuda-downloads>`_ distribution. 1116 1117 1118Common Issues 1119============= 1120 1121ptxas complains of undefined function: __nvvm_reflect 1122----------------------------------------------------- 1123 1124When linking with libdevice, the ``NVVMReflect`` pass must be used. See 1125:ref:`libdevice` for more information. 1126 1127 1128Tutorial: A Simple Compute Kernel 1129================================= 1130 1131To start, let us take a look at a simple compute kernel written directly in 1132LLVM IR. The kernel implements vector addition, where each thread computes one 1133element of the output vector C from the input vectors A and B. To make this 1134easier, we also assume that only a single CTA (thread block) will be launched, 1135and that it will be one dimensional. 1136 1137 1138The Kernel 1139---------- 1140 1141.. code-block:: llvm 1142 1143 target datalayout = "e-p:64:64:64-i1:8:8-i8:8:8-i16:16:16-i32:32:32-i64:64:64-f32:32:32-f64:64:64-v16:16:16-v32:32:32-v64:64:64-v128:128:128-n16:32:64" 1144 target triple = "nvptx64-nvidia-cuda" 1145 1146 ; Intrinsic to read X component of thread ID 1147 declare i32 @llvm.nvvm.read.ptx.sreg.tid.x() readnone nounwind 1148 1149 define void @kernel(ptr addrspace(1) %A, 1150 ptr addrspace(1) %B, 1151 ptr addrspace(1) %C) { 1152 entry: 1153 ; What is my ID? 1154 %id = tail call i32 @llvm.nvvm.read.ptx.sreg.tid.x() readnone nounwind 1155 1156 ; Compute pointers into A, B, and C 1157 %ptrA = getelementptr float, ptr addrspace(1) %A, i32 %id 1158 %ptrB = getelementptr float, ptr addrspace(1) %B, i32 %id 1159 %ptrC = getelementptr float, ptr addrspace(1) %C, i32 %id 1160 1161 ; Read A, B 1162 %valA = load float, ptr addrspace(1) %ptrA, align 4 1163 %valB = load float, ptr addrspace(1) %ptrB, align 4 1164 1165 ; Compute C = A + B 1166 %valC = fadd float %valA, %valB 1167 1168 ; Store back to C 1169 store float %valC, ptr addrspace(1) %ptrC, align 4 1170 1171 ret void 1172 } 1173 1174 !nvvm.annotations = !{!0} 1175 !0 = !{ptr @kernel, !"kernel", i32 1} 1176 1177 1178We can use the LLVM ``llc`` tool to directly run the NVPTX code generator: 1179 1180.. code-block:: text 1181 1182 # llc -mcpu=sm_20 kernel.ll -o kernel.ptx 1183 1184 1185.. note:: 1186 1187 If you want to generate 32-bit code, change ``p:64:64:64`` to ``p:32:32:32`` 1188 in the module data layout string and use ``nvptx-nvidia-cuda`` as the 1189 target triple. 1190 1191 1192The output we get from ``llc`` (as of LLVM 3.4): 1193 1194.. code-block:: text 1195 1196 // 1197 // Generated by LLVM NVPTX Back-End 1198 // 1199 1200 .version 3.1 1201 .target sm_20 1202 .address_size 64 1203 1204 // .globl kernel 1205 // @kernel 1206 .visible .entry kernel( 1207 .param .u64 kernel_param_0, 1208 .param .u64 kernel_param_1, 1209 .param .u64 kernel_param_2 1210 ) 1211 { 1212 .reg .f32 %f<4>; 1213 .reg .s32 %r<2>; 1214 .reg .s64 %rl<8>; 1215 1216 // %bb.0: // %entry 1217 ld.param.u64 %rl1, [kernel_param_0]; 1218 mov.u32 %r1, %tid.x; 1219 mul.wide.s32 %rl2, %r1, 4; 1220 add.s64 %rl3, %rl1, %rl2; 1221 ld.param.u64 %rl4, [kernel_param_1]; 1222 add.s64 %rl5, %rl4, %rl2; 1223 ld.param.u64 %rl6, [kernel_param_2]; 1224 add.s64 %rl7, %rl6, %rl2; 1225 ld.global.f32 %f1, [%rl3]; 1226 ld.global.f32 %f2, [%rl5]; 1227 add.f32 %f3, %f1, %f2; 1228 st.global.f32 [%rl7], %f3; 1229 ret; 1230 } 1231 1232 1233Dissecting the Kernel 1234--------------------- 1235 1236Now let us dissect the LLVM IR that makes up this kernel. 1237 1238Data Layout 1239^^^^^^^^^^^ 1240 1241The data layout string determines the size in bits of common data types, their 1242ABI alignment, and their storage size. For NVPTX, you should use one of the 1243following: 1244 124532-bit PTX: 1246 1247.. code-block:: llvm 1248 1249 target datalayout = "e-p:32:32:32-i1:8:8-i8:8:8-i16:16:16-i32:32:32-i64:64:64-f32:32:32-f64:64:64-v16:16:16-v32:32:32-v64:64:64-v128:128:128-n16:32:64" 1250 125164-bit PTX: 1252 1253.. code-block:: llvm 1254 1255 target datalayout = "e-p:64:64:64-i1:8:8-i8:8:8-i16:16:16-i32:32:32-i64:64:64-f32:32:32-f64:64:64-v16:16:16-v32:32:32-v64:64:64-v128:128:128-n16:32:64" 1256 1257 1258Target Intrinsics 1259^^^^^^^^^^^^^^^^^ 1260 1261In this example, we use the ``@llvm.nvvm.read.ptx.sreg.tid.x`` intrinsic to 1262read the X component of the current thread's ID, which corresponds to a read 1263of register ``%tid.x`` in PTX. The NVPTX back-end supports a large set of 1264intrinsics. A short list is shown below; please see 1265``include/llvm/IR/IntrinsicsNVVM.td`` for the full list. 1266 1267 1268================================================ ==================== 1269Intrinsic CUDA Equivalent 1270================================================ ==================== 1271``i32 @llvm.nvvm.read.ptx.sreg.tid.{x,y,z}`` threadIdx.{x,y,z} 1272``i32 @llvm.nvvm.read.ptx.sreg.ctaid.{x,y,z}`` blockIdx.{x,y,z} 1273``i32 @llvm.nvvm.read.ptx.sreg.ntid.{x,y,z}`` blockDim.{x,y,z} 1274``i32 @llvm.nvvm.read.ptx.sreg.nctaid.{x,y,z}`` gridDim.{x,y,z} 1275``void @llvm.nvvm.barrier0()`` __syncthreads() 1276================================================ ==================== 1277 1278 1279Address Spaces 1280^^^^^^^^^^^^^^ 1281 1282You may have noticed that all of the pointer types in the LLVM IR example had 1283an explicit address space specifier. What is address space 1? NVIDIA GPU 1284devices (generally) have four types of memory: 1285 1286- Global: Large, off-chip memory 1287- Shared: Small, on-chip memory shared among all threads in a CTA 1288- Local: Per-thread, private memory 1289- Constant: Read-only memory shared across all threads 1290 1291These different types of memory are represented in LLVM IR as address spaces. 1292There is also a fifth address space used by the NVPTX code generator that 1293corresponds to the "generic" address space. This address space can represent 1294addresses in any other address space (with a few exceptions). This allows 1295users to write IR functions that can load/store memory using the same 1296instructions. Intrinsics are provided to convert pointers between the generic 1297and non-generic address spaces. 1298 1299See :ref:`address_spaces` and :ref:`nvptx_intrinsics` for more information. 1300 1301 1302Kernel Metadata 1303^^^^^^^^^^^^^^^ 1304 1305In PTX, a function can be either a `kernel` function (callable from the host 1306program), or a `device` function (callable only from GPU code). You can think 1307of `kernel` functions as entry-points in the GPU program. To mark an LLVM IR 1308function as a `kernel` function, we make use of special LLVM metadata. The 1309NVPTX back-end will look for a named metadata node called 1310``nvvm.annotations``. This named metadata must contain a list of metadata that 1311describe the IR. For our purposes, we need to declare a metadata node that 1312assigns the "kernel" attribute to the LLVM IR function that should be emitted 1313as a PTX `kernel` function. These metadata nodes take the form: 1314 1315.. code-block:: text 1316 1317 !{<function ref>, metadata !"kernel", i32 1} 1318 1319For the previous example, we have: 1320 1321.. code-block:: llvm 1322 1323 !nvvm.annotations = !{!0} 1324 !0 = !{ptr @kernel, !"kernel", i32 1} 1325 1326Here, we have a single metadata declaration in ``nvvm.annotations``. This 1327metadata annotates our ``@kernel`` function with the ``kernel`` attribute. 1328 1329 1330Running the Kernel 1331------------------ 1332 1333Generating PTX from LLVM IR is all well and good, but how do we execute it on 1334a real GPU device? The CUDA Driver API provides a convenient mechanism for 1335loading and JIT compiling PTX to a native GPU device, and launching a kernel. 1336The API is similar to OpenCL. A simple example showing how to load and 1337execute our vector addition code is shown below. Note that for brevity this 1338code does not perform much error checking! 1339 1340.. note:: 1341 1342 You can also use the ``ptxas`` tool provided by the CUDA Toolkit to offline 1343 compile PTX to machine code (SASS) for a specific GPU architecture. Such 1344 binaries can be loaded by the CUDA Driver API in the same way as PTX. This 1345 can be useful for reducing startup time by precompiling the PTX kernels. 1346 1347 1348.. code-block:: c++ 1349 1350 #include <iostream> 1351 #include <fstream> 1352 #include <cassert> 1353 #include "cuda.h" 1354 1355 1356 void checkCudaErrors(CUresult err) { 1357 assert(err == CUDA_SUCCESS); 1358 } 1359 1360 /// main - Program entry point 1361 int main(int argc, char **argv) { 1362 CUdevice device; 1363 CUmodule cudaModule; 1364 CUcontext context; 1365 CUfunction function; 1366 CUlinkState linker; 1367 int devCount; 1368 1369 // CUDA initialization 1370 checkCudaErrors(cuInit(0)); 1371 checkCudaErrors(cuDeviceGetCount(&devCount)); 1372 checkCudaErrors(cuDeviceGet(&device, 0)); 1373 1374 char name[128]; 1375 checkCudaErrors(cuDeviceGetName(name, 128, device)); 1376 std::cout << "Using CUDA Device [0]: " << name << "\n"; 1377 1378 int devMajor, devMinor; 1379 checkCudaErrors(cuDeviceComputeCapability(&devMajor, &devMinor, device)); 1380 std::cout << "Device Compute Capability: " 1381 << devMajor << "." << devMinor << "\n"; 1382 if (devMajor < 2) { 1383 std::cerr << "ERROR: Device 0 is not SM 2.0 or greater\n"; 1384 return 1; 1385 } 1386 1387 std::ifstream t("kernel.ptx"); 1388 if (!t.is_open()) { 1389 std::cerr << "kernel.ptx not found\n"; 1390 return 1; 1391 } 1392 std::string str((std::istreambuf_iterator<char>(t)), 1393 std::istreambuf_iterator<char>()); 1394 1395 // Create driver context 1396 checkCudaErrors(cuCtxCreate(&context, 0, device)); 1397 1398 // Create module for object 1399 checkCudaErrors(cuModuleLoadDataEx(&cudaModule, str.c_str(), 0, 0, 0)); 1400 1401 // Get kernel function 1402 checkCudaErrors(cuModuleGetFunction(&function, cudaModule, "kernel")); 1403 1404 // Device data 1405 CUdeviceptr devBufferA; 1406 CUdeviceptr devBufferB; 1407 CUdeviceptr devBufferC; 1408 1409 checkCudaErrors(cuMemAlloc(&devBufferA, sizeof(float)*16)); 1410 checkCudaErrors(cuMemAlloc(&devBufferB, sizeof(float)*16)); 1411 checkCudaErrors(cuMemAlloc(&devBufferC, sizeof(float)*16)); 1412 1413 float* hostA = new float[16]; 1414 float* hostB = new float[16]; 1415 float* hostC = new float[16]; 1416 1417 // Populate input 1418 for (unsigned i = 0; i != 16; ++i) { 1419 hostA[i] = (float)i; 1420 hostB[i] = (float)(2*i); 1421 hostC[i] = 0.0f; 1422 } 1423 1424 checkCudaErrors(cuMemcpyHtoD(devBufferA, &hostA[0], sizeof(float)*16)); 1425 checkCudaErrors(cuMemcpyHtoD(devBufferB, &hostB[0], sizeof(float)*16)); 1426 1427 1428 unsigned blockSizeX = 16; 1429 unsigned blockSizeY = 1; 1430 unsigned blockSizeZ = 1; 1431 unsigned gridSizeX = 1; 1432 unsigned gridSizeY = 1; 1433 unsigned gridSizeZ = 1; 1434 1435 // Kernel parameters 1436 void *KernelParams[] = { &devBufferA, &devBufferB, &devBufferC }; 1437 1438 std::cout << "Launching kernel\n"; 1439 1440 // Kernel launch 1441 checkCudaErrors(cuLaunchKernel(function, gridSizeX, gridSizeY, gridSizeZ, 1442 blockSizeX, blockSizeY, blockSizeZ, 1443 0, NULL, KernelParams, NULL)); 1444 1445 // Retrieve device data 1446 checkCudaErrors(cuMemcpyDtoH(&hostC[0], devBufferC, sizeof(float)*16)); 1447 1448 1449 std::cout << "Results:\n"; 1450 for (unsigned i = 0; i != 16; ++i) { 1451 std::cout << hostA[i] << " + " << hostB[i] << " = " << hostC[i] << "\n"; 1452 } 1453 1454 1455 // Clean up after ourselves 1456 delete [] hostA; 1457 delete [] hostB; 1458 delete [] hostC; 1459 1460 // Clean-up 1461 checkCudaErrors(cuMemFree(devBufferA)); 1462 checkCudaErrors(cuMemFree(devBufferB)); 1463 checkCudaErrors(cuMemFree(devBufferC)); 1464 checkCudaErrors(cuModuleUnload(cudaModule)); 1465 checkCudaErrors(cuCtxDestroy(context)); 1466 1467 return 0; 1468 } 1469 1470 1471You will need to link with the CUDA driver and specify the path to cuda.h. 1472 1473.. code-block:: text 1474 1475 # clang++ sample.cpp -o sample -O2 -g -I/usr/local/cuda-5.5/include -lcuda 1476 1477We don't need to specify a path to ``libcuda.so`` since this is installed in a 1478system location by the driver, not the CUDA toolkit. 1479 1480If everything goes as planned, you should see the following output when 1481running the compiled program: 1482 1483.. code-block:: text 1484 1485 Using CUDA Device [0]: GeForce GTX 680 1486 Device Compute Capability: 3.0 1487 Launching kernel 1488 Results: 1489 0 + 0 = 0 1490 1 + 2 = 3 1491 2 + 4 = 6 1492 3 + 6 = 9 1493 4 + 8 = 12 1494 5 + 10 = 15 1495 6 + 12 = 18 1496 7 + 14 = 21 1497 8 + 16 = 24 1498 9 + 18 = 27 1499 10 + 20 = 30 1500 11 + 22 = 33 1501 12 + 24 = 36 1502 13 + 26 = 39 1503 14 + 28 = 42 1504 15 + 30 = 45 1505 1506.. note:: 1507 1508 You will likely see a different device identifier based on your hardware 1509 1510 1511Tutorial: Linking with Libdevice 1512================================ 1513 1514In this tutorial, we show a simple example of linking LLVM IR with the 1515libdevice library. We will use the same kernel as the previous tutorial, 1516except that we will compute ``C = pow(A, B)`` instead of ``C = A + B``. 1517Libdevice provides an ``__nv_powf`` function that we will use. 1518 1519.. code-block:: llvm 1520 1521 target datalayout = "e-p:64:64:64-i1:8:8-i8:8:8-i16:16:16-i32:32:32-i64:64:64-f32:32:32-f64:64:64-v16:16:16-v32:32:32-v64:64:64-v128:128:128-n16:32:64" 1522 target triple = "nvptx64-nvidia-cuda" 1523 1524 ; Intrinsic to read X component of thread ID 1525 declare i32 @llvm.nvvm.read.ptx.sreg.tid.x() readnone nounwind 1526 ; libdevice function 1527 declare float @__nv_powf(float, float) 1528 1529 define void @kernel(ptr addrspace(1) %A, 1530 ptr addrspace(1) %B, 1531 ptr addrspace(1) %C) { 1532 entry: 1533 ; What is my ID? 1534 %id = tail call i32 @llvm.nvvm.read.ptx.sreg.tid.x() readnone nounwind 1535 1536 ; Compute pointers into A, B, and C 1537 %ptrA = getelementptr float, ptr addrspace(1) %A, i32 %id 1538 %ptrB = getelementptr float, ptr addrspace(1) %B, i32 %id 1539 %ptrC = getelementptr float, ptr addrspace(1) %C, i32 %id 1540 1541 ; Read A, B 1542 %valA = load float, ptr addrspace(1) %ptrA, align 4 1543 %valB = load float, ptr addrspace(1) %ptrB, align 4 1544 1545 ; Compute C = pow(A, B) 1546 %valC = call float @__nv_powf(float %valA, float %valB) 1547 1548 ; Store back to C 1549 store float %valC, ptr addrspace(1) %ptrC, align 4 1550 1551 ret void 1552 } 1553 1554 !nvvm.annotations = !{!0} 1555 !0 = !{ptr @kernel, !"kernel", i32 1} 1556 1557 1558To compile this kernel, we perform the following steps: 1559 15601. Link with libdevice 15612. Internalize all but the public kernel function 15623. Run ``NVVMReflect`` and set ``__CUDA_FTZ`` to 0 15634. Optimize the linked module 15645. Codegen the module 1565 1566 1567These steps can be performed by the LLVM ``llvm-link``, ``opt``, and ``llc`` 1568tools. In a complete compiler, these steps can also be performed entirely 1569programmatically by setting up an appropriate pass configuration (see 1570:ref:`libdevice`). 1571 1572.. code-block:: text 1573 1574 # llvm-link t2.bc libdevice.compute_20.10.bc -o t2.linked.bc 1575 # opt -internalize -internalize-public-api-list=kernel -nvvm-reflect-list=__CUDA_FTZ=0 -nvvm-reflect -O3 t2.linked.bc -o t2.opt.bc 1576 # llc -mcpu=sm_20 t2.opt.bc -o t2.ptx 1577 1578.. note:: 1579 1580 The ``-nvvm-reflect-list=_CUDA_FTZ=0`` is not strictly required, as any 1581 undefined variables will default to zero. It is shown here for evaluation 1582 purposes. 1583 1584 1585This gives us the following PTX (excerpt): 1586 1587.. code-block:: text 1588 1589 // 1590 // Generated by LLVM NVPTX Back-End 1591 // 1592 1593 .version 3.1 1594 .target sm_20 1595 .address_size 64 1596 1597 // .globl kernel 1598 // @kernel 1599 .visible .entry kernel( 1600 .param .u64 kernel_param_0, 1601 .param .u64 kernel_param_1, 1602 .param .u64 kernel_param_2 1603 ) 1604 { 1605 .reg .pred %p<30>; 1606 .reg .f32 %f<111>; 1607 .reg .s32 %r<21>; 1608 .reg .s64 %rl<8>; 1609 1610 // %bb.0: // %entry 1611 ld.param.u64 %rl2, [kernel_param_0]; 1612 mov.u32 %r3, %tid.x; 1613 ld.param.u64 %rl3, [kernel_param_1]; 1614 mul.wide.s32 %rl4, %r3, 4; 1615 add.s64 %rl5, %rl2, %rl4; 1616 ld.param.u64 %rl6, [kernel_param_2]; 1617 add.s64 %rl7, %rl3, %rl4; 1618 add.s64 %rl1, %rl6, %rl4; 1619 ld.global.f32 %f1, [%rl5]; 1620 ld.global.f32 %f2, [%rl7]; 1621 setp.eq.f32 %p1, %f1, 0f3F800000; 1622 setp.eq.f32 %p2, %f2, 0f00000000; 1623 or.pred %p3, %p1, %p2; 1624 @%p3 bra BB0_1; 1625 bra.uni BB0_2; 1626 BB0_1: 1627 mov.f32 %f110, 0f3F800000; 1628 st.global.f32 [%rl1], %f110; 1629 ret; 1630 BB0_2: // %__nv_isnanf.exit.i 1631 abs.f32 %f4, %f1; 1632 setp.gtu.f32 %p4, %f4, 0f7F800000; 1633 @%p4 bra BB0_4; 1634 // %bb.3: // %__nv_isnanf.exit5.i 1635 abs.f32 %f5, %f2; 1636 setp.le.f32 %p5, %f5, 0f7F800000; 1637 @%p5 bra BB0_5; 1638 BB0_4: // %.critedge1.i 1639 add.f32 %f110, %f1, %f2; 1640 st.global.f32 [%rl1], %f110; 1641 ret; 1642 BB0_5: // %__nv_isinff.exit.i 1643 1644 ... 1645 1646 BB0_26: // %__nv_truncf.exit.i.i.i.i.i 1647 mul.f32 %f90, %f107, 0f3FB8AA3B; 1648 cvt.rzi.f32.f32 %f91, %f90; 1649 mov.f32 %f92, 0fBF317200; 1650 fma.rn.f32 %f93, %f91, %f92, %f107; 1651 mov.f32 %f94, 0fB5BFBE8E; 1652 fma.rn.f32 %f95, %f91, %f94, %f93; 1653 mul.f32 %f89, %f95, 0f3FB8AA3B; 1654 // inline asm 1655 ex2.approx.ftz.f32 %f88,%f89; 1656 // inline asm 1657 add.f32 %f96, %f91, 0f00000000; 1658 ex2.approx.f32 %f97, %f96; 1659 mul.f32 %f98, %f88, %f97; 1660 setp.lt.f32 %p15, %f107, 0fC2D20000; 1661 selp.f32 %f99, 0f00000000, %f98, %p15; 1662 setp.gt.f32 %p16, %f107, 0f42D20000; 1663 selp.f32 %f110, 0f7F800000, %f99, %p16; 1664 setp.eq.f32 %p17, %f110, 0f7F800000; 1665 @%p17 bra BB0_28; 1666 // %bb.27: 1667 fma.rn.f32 %f110, %f110, %f108, %f110; 1668 BB0_28: // %__internal_accurate_powf.exit.i 1669 setp.lt.f32 %p18, %f1, 0f00000000; 1670 setp.eq.f32 %p19, %f3, 0f3F800000; 1671 and.pred %p20, %p18, %p19; 1672 @!%p20 bra BB0_30; 1673 bra.uni BB0_29; 1674 BB0_29: 1675 mov.b32 %r9, %f110; 1676 xor.b32 %r10, %r9, -2147483648; 1677 mov.b32 %f110, %r10; 1678 BB0_30: // %__nv_powf.exit 1679 st.global.f32 [%rl1], %f110; 1680 ret; 1681 } 1682