xref: /llvm-project/llvm/docs/NVPTXUsage.rst (revision 435609b70c8bbf7bc6b73b04ec8852a9c11376ec)
1=============================
2User Guide for NVPTX Back-end
3=============================
4
5.. contents::
6   :local:
7   :depth: 3
8
9
10Introduction
11============
12
13To support GPU programming, the NVPTX back-end supports a subset of LLVM IR
14along with a defined set of conventions used to represent GPU programming
15concepts. This document provides an overview of the general usage of the back-
16end, including a description of the conventions used and the set of accepted
17LLVM IR.
18
19.. note::
20
21   This document assumes a basic familiarity with CUDA and the PTX
22   assembly language. Information about the CUDA Driver API and the PTX assembly
23   language can be found in the `CUDA documentation
24   <http://docs.nvidia.com/cuda/index.html>`_.
25
26
27
28Conventions
29===========
30
31Marking Functions as Kernels
32----------------------------
33
34In PTX, there are two types of functions: *device functions*, which are only
35callable by device code, and *kernel functions*, which are callable by host
36code. By default, the back-end will emit device functions. Metadata is used to
37declare a function as a kernel function. This metadata is attached to the
38``nvvm.annotations`` named metadata object, and has the following format:
39
40.. code-block:: text
41
42   !0 = !{<function-ref>, metadata !"kernel", i32 1}
43
44The first parameter is a reference to the kernel function. The following
45example shows a kernel function calling a device function in LLVM IR. The
46function ``@my_kernel`` is callable from host code, but ``@my_fmad`` is not.
47
48.. code-block:: llvm
49
50    define float @my_fmad(float %x, float %y, float %z) {
51      %mul = fmul float %x, %y
52      %add = fadd float %mul, %z
53      ret float %add
54    }
55
56    define void @my_kernel(ptr %ptr) {
57      %val = load float, ptr %ptr
58      %ret = call float @my_fmad(float %val, float %val, float %val)
59      store float %ret, ptr %ptr
60      ret void
61    }
62
63    !nvvm.annotations = !{!1}
64    !1 = !{ptr @my_kernel, !"kernel", i32 1}
65
66When compiled, the PTX kernel functions are callable by host-side code.
67
68
69.. _address_spaces:
70
71Address Spaces
72--------------
73
74The NVPTX back-end uses the following address space mapping:
75
76   ============= ======================
77   Address Space Memory Space
78   ============= ======================
79   0             Generic
80   1             Global
81   2             Internal Use
82   3             Shared
83   4             Constant
84   5             Local
85   ============= ======================
86
87Every global variable and pointer type is assigned to one of these address
88spaces, with 0 being the default address space. Intrinsics are provided which
89can be used to convert pointers between the generic and non-generic address
90spaces.
91
92As an example, the following IR will define an array ``@g`` that resides in
93global device memory.
94
95.. code-block:: llvm
96
97    @g = internal addrspace(1) global [4 x i32] [ i32 0, i32 1, i32 2, i32 3 ]
98
99LLVM IR functions can read and write to this array, and host-side code can
100copy data to it by name with the CUDA Driver API.
101
102Note that since address space 0 is the generic space, it is illegal to have
103global variables in address space 0.  Address space 0 is the default address
104space in LLVM, so the ``addrspace(N)`` annotation is *required* for global
105variables.
106
107
108Triples
109-------
110
111The NVPTX target uses the module triple to select between 32/64-bit code
112generation and the driver-compiler interface to use. The triple architecture
113can be one of ``nvptx`` (32-bit PTX) or ``nvptx64`` (64-bit PTX). The
114operating system should be one of ``cuda`` or ``nvcl``, which determines the
115interface used by the generated code to communicate with the driver.  Most
116users will want to use ``cuda`` as the operating system, which makes the
117generated PTX compatible with the CUDA Driver API.
118
119Example: 32-bit PTX for CUDA Driver API: ``nvptx-nvidia-cuda``
120
121Example: 64-bit PTX for CUDA Driver API: ``nvptx64-nvidia-cuda``
122
123
124
125.. _nvptx_intrinsics:
126
127NVPTX Intrinsics
128================
129
130Reading PTX Special Registers
131-----------------------------
132
133'``llvm.nvvm.read.ptx.sreg.*``'
134^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
135
136Syntax:
137"""""""
138
139.. code-block:: llvm
140
141    declare i32 @llvm.nvvm.read.ptx.sreg.tid.x()
142    declare i32 @llvm.nvvm.read.ptx.sreg.tid.y()
143    declare i32 @llvm.nvvm.read.ptx.sreg.tid.z()
144    declare i32 @llvm.nvvm.read.ptx.sreg.ntid.x()
145    declare i32 @llvm.nvvm.read.ptx.sreg.ntid.y()
146    declare i32 @llvm.nvvm.read.ptx.sreg.ntid.z()
147    declare i32 @llvm.nvvm.read.ptx.sreg.ctaid.x()
148    declare i32 @llvm.nvvm.read.ptx.sreg.ctaid.y()
149    declare i32 @llvm.nvvm.read.ptx.sreg.ctaid.z()
150    declare i32 @llvm.nvvm.read.ptx.sreg.nctaid.x()
151    declare i32 @llvm.nvvm.read.ptx.sreg.nctaid.y()
152    declare i32 @llvm.nvvm.read.ptx.sreg.nctaid.z()
153    declare i32 @llvm.nvvm.read.ptx.sreg.warpsize()
154
155Overview:
156"""""""""
157
158The '``@llvm.nvvm.read.ptx.sreg.*``' intrinsics provide access to the PTX
159special registers, in particular the kernel launch bounds.  These registers
160map in the following way to CUDA builtins:
161
162   ============ =====================================
163   CUDA Builtin PTX Special Register Intrinsic
164   ============ =====================================
165   ``threadId`` ``@llvm.nvvm.read.ptx.sreg.tid.*``
166   ``blockIdx`` ``@llvm.nvvm.read.ptx.sreg.ctaid.*``
167   ``blockDim`` ``@llvm.nvvm.read.ptx.sreg.ntid.*``
168   ``gridDim``  ``@llvm.nvvm.read.ptx.sreg.nctaid.*``
169   ============ =====================================
170
171
172Barriers
173--------
174
175'``llvm.nvvm.barrier0``'
176^^^^^^^^^^^^^^^^^^^^^^^^^^^
177
178Syntax:
179"""""""
180
181.. code-block:: llvm
182
183  declare void @llvm.nvvm.barrier0()
184
185Overview:
186"""""""""
187
188The '``@llvm.nvvm.barrier0()``' intrinsic emits a PTX ``bar.sync 0``
189instruction, equivalent to the ``__syncthreads()`` call in CUDA.
190
191Electing a thread
192-----------------
193
194'``llvm.nvvm.elect.sync``'
195^^^^^^^^^^^^^^^^^^^^^^^^^^
196
197Syntax:
198"""""""
199
200.. code-block:: llvm
201
202  declare {i32, i1} @llvm.nvvm.elect.sync(i32 %membermask)
203
204Overview:
205"""""""""
206
207The '``@llvm.nvvm.elect.sync``' intrinsic generates the ``elect.sync``
208PTX instruction, which elects one predicated active leader thread from
209a set of threads specified by ``membermask``. The behavior is undefined
210if the executing thread is not in ``membermask``. The laneid of the
211elected thread is captured in the i32 return value. The i1 return
212value is set to ``True`` for the leader thread and ``False`` for all
213the other threads. Election of a leader thread happens deterministically,
214i.e. the same leader thread is elected for the same ``membermask``
215every time. For more information, refer PTX ISA
216`<https://docs.nvidia.com/cuda/parallel-thread-execution/index.html#parallel-synchronization-and-communication-instructions-elect-sync>`_.
217
218Membar/Fences
219-------------
220
221'``llvm.nvvm.fence.proxy.tensormap_generic.*``'
222^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
223
224Syntax:
225"""""""
226
227.. code-block:: llvm
228
229  declare void @llvm.nvvm.fence.proxy.tensormap_generic.release.cta()
230  declare void @llvm.nvvm.fence.proxy.tensormap_generic.release.cluster()
231  declare void @llvm.nvvm.fence.proxy.tensormap_generic.release.gpu()
232  declare void @llvm.nvvm.fence.proxy.tensormap_generic.release.sys()
233
234  declare void @llvm.nvvm.fence.proxy.tensormap_generic.acquire.cta(ptr %addr, i32 %size)
235  declare void @llvm.nvvm.fence.proxy.tensormap_generic.acquire.cluster(ptr %addr, i32 %size)
236  declare void @llvm.nvvm.fence.proxy.tensormap_generic.acquire.gpu(ptr %addr, i32 %size)
237  declare void @llvm.nvvm.fence.proxy.tensormap_generic.acquire.sys(ptr %addr, i32 %size)
238
239Overview:
240"""""""""
241
242The ``@llvm.nvvm.fence.proxy.tensormap_generic.*`` is a uni-directional fence used to establish ordering between a prior memory access performed via the generic `proxy<https://docs.nvidia.com/cuda/parallel-thread-execution/index.html#proxies>_` and a subsequent memory access performed via the tensormap proxy. ``nvvm.fence.proxy.tensormap_generic.release`` can form a release sequence that synchronizes with an acquire sequence that contains the ``nvvm.fence.proxy.tensormap_generic.acquire`` proxy fence. The following table describes the mapping between LLVM Intrinsic and the PTX instruction:
243
244  ====================================================== =========================================================
245  NVVM Intrinsic                                         PTX Instruction
246  ====================================================== =========================================================
247  ``@llvm.nvvm.fence.proxy.tensormap_generic.release.*`` ``fence.proxy.tensormap::generic.release.*``
248  ``@llvm.nvvm.fence.proxy.tensormap_generic.acquire.*`` ``fence.proxy.tensormap::generic.acquire.* [addr], size``
249  ====================================================== =========================================================
250
251The address operand ``addr`` and the operand ``size`` together specify the memory range ``[addr, addr+size)`` on which the ordering guarantees on the memory accesses across the proxies is to be provided. The only supported value for the ``size`` operand is ``128`` and must be an immediate. Generic Addressing is used unconditionally, and the address specified by the operand addr must fall within the ``.global`` state space. Otherwise, the behavior is undefined. For more information, see `PTX ISA <https://docs.nvidia.com/cuda/parallel-thread-execution/#parallel-synchronization-and-communication-instructions-membar>`_.
252
253Address Space Intrinsics
254------------------------
255
256'``llvm.nvvm.isspacep.*``' Intrinsics
257^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
258
259Syntax:
260"""""""
261
262.. code-block:: llvm
263
264    declare i1 @llvm.nvvm.isspacep.const(ptr %p)
265    declare i1 @llvm.nvvm.isspacep.global(ptr %p)
266    declare i1 @llvm.nvvm.isspacep.local(ptr %p)
267    declare i1 @llvm.nvvm.isspacep.shared(ptr %p)
268    declare i1 @llvm.nvvm.isspacep.shared.cluster(ptr %p)
269
270Overview:
271"""""""""
272
273The '``llvm.nvvm.isspacep.*``' intrinsics determine whether the provided generic
274pointer references memory which falls within a particular address space.
275
276Semantics:
277""""""""""
278
279If the given pointer in the generic address space refers to memory which falls
280within the state space of the intrinsic (and therefore could be safely address
281space casted to this space), 1 is returned, otherwise 0 is returned.
282
283Arithmetic Intrinsics
284---------------------
285
286'``llvm.nvvm.idp2a.[us].[us]``' Intrinsics
287^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
288
289Syntax:
290"""""""
291
292.. code-block:: llvm
293
294    declare i32 @llvm.nvvm.idp2a.s.s(i32 %a, i32 %b, i1 immarg %is.hi, i32 %c)
295    declare i32 @llvm.nvvm.idp2a.s.u(i32 %a, i32 %b, i1 immarg %is.hi, i32 %c)
296    declare i32 @llvm.nvvm.idp2a.u.s(i32 %a, i32 %b, i1 immarg %is.hi, i32 %c)
297    declare i32 @llvm.nvvm.idp2a.u.u(i32 %a, i32 %b, i1 immarg %is.hi, i32 %c)
298
299
300Overview:
301"""""""""
302
303The '``llvm.nvvm.idp2a.[us].[us]``' intrinsics performs a 2-element vector dot
304product followed by addition. They corresponds directly to the ``dp2a`` PTX
305instruction.
306
307Semantics:
308""""""""""
309
310The 32-bit value in ``%a`` is broken into 2 16-bit values which are extended to
31132 bits. For the '``llvm.nvvm.idp2a.u.[us]``' variants zero-extension is used,
312while for the '``llvm.nvvm.idp2a.s.[us]``' sign-extension is used. Two bytes are
313selected from ``%b``, if ``%is.hi`` is true, the most significant bytes are
314selected, otherwise the least significant bytes are selected. These bytes are
315then extended to 32-bits. For the '``llvm.nvvm.idp2a.[us].u``' variants
316zero-extension is used, while for the '``llvm.nvvm.idp2a.[us].s``'
317sign-extension is used. The dot product of these 2-element vectors is added to
318``%c`` to produce the return.
319
320
321'``llvm.nvvm.idp4a.[us].[us]``' Intrinsics
322^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
323
324Syntax:
325"""""""
326
327.. code-block:: llvm
328
329    declare i32 @llvm.nvvm.idp4a.s.s(i32 %a, i32 %b, i32 %c)
330    declare i32 @llvm.nvvm.idp4a.s.u(i32 %a, i32 %b, i32 %c)
331    declare i32 @llvm.nvvm.idp4a.u.s(i32 %a, i32 %b, i32 %c)
332    declare i32 @llvm.nvvm.idp4a.u.u(i32 %a, i32 %b, i32 %c)
333
334Overview:
335"""""""""
336
337The '``llvm.nvvm.idp4a.[us].[us]``' intrinsics perform a 4-element vector dot
338product followed by addition. They corresponds directly to the ``dp4a`` PTX
339instruction.
340
341Semantics:
342""""""""""
343
344Each of the 4 bytes in both ``%a`` and ``%b`` are extended to 32-bit integers
345forming 2 ``<4 x i32>``. For ``%a``, zero-extension is used in the
346'``llvm.nvvm.idp4a.u.[us]``' variants, while sign-extension is used with
347'``llvm.nvvm.idp4a.s.[us]``' variants. Similarly, for ``%b``, zero-extension is
348used in the '``llvm.nvvm.idp4a.[us].u``' variants, while sign-extension is used
349with '``llvm.nvvm.idp4a.[us].s``' variants. The dot product of these 4-element
350vectors is added to ``%c`` to produce the return.
351
352Bit Manipulation Intrinsics
353---------------------------
354
355'``llvm.nvvm.fshl.clamp.*``' Intrinsic
356^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
357
358Syntax:
359"""""""
360
361.. code-block:: llvm
362
363    declare i32 @llvm.nvvm.fshl.clamp.i32(i32 %hi, i32 %lo, i32 %n)
364
365Overview:
366"""""""""
367
368The '``llvm.nvvm.fshl.clamp``' family of intrinsics performs a clamped funnel
369shift left. These intrinsics are very similar to '``llvm.fshl``', except the
370shift ammont is clamped at the integer width (instead of modulo it). Currently,
371only ``i32`` is supported.
372
373Semantics:
374""""""""""
375
376The '``llvm.nvvm.fshl.clamp``' family of intrinsic functions performs a clamped
377funnel shift left: the first two values are concatenated as { %hi : %lo } (%hi
378is the most significant bits of the wide value), the combined value is shifted
379left, and the most significant bits are extracted to produce a result that is
380the same size as the original arguments. The shift amount is the minimum of the
381value of %n and the bit width of the integer type.
382
383'``llvm.nvvm.fshr.clamp.*``' Intrinsic
384^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
385
386Syntax:
387"""""""
388
389.. code-block:: llvm
390
391    declare i32 @llvm.nvvm.fshr.clamp.i32(i32 %hi, i32 %lo, i32 %n)
392
393Overview:
394"""""""""
395
396The '``llvm.nvvm.fshr.clamp``' family of intrinsics perform a clamped funnel
397shift right. These intrinsics are very similar to '``llvm.fshr``', except the
398shift ammont is clamped at the integer width (instead of modulo it). Currently,
399only ``i32`` is supported.
400
401Semantics:
402""""""""""
403
404The '``llvm.nvvm.fshr.clamp``' family of intrinsic functions performs a clamped
405funnel shift right: the first two values are concatenated as { %hi : %lo } (%hi
406is the most significant bits of the wide value), the combined value is shifted
407right, and the least significant bits are extracted to produce a result that is
408the same size as the original arguments. The shift amount is the minimum of the
409value of %n and the bit width of the integer type.
410
411'``llvm.nvvm.flo.u.*``' Intrinsic
412^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
413
414Syntax:
415"""""""
416
417.. code-block:: llvm
418
419    declare i32 @llvm.nvvm.flo.u.i32(i32 %a, i1 %shiftamt)
420    declare i32 @llvm.nvvm.flo.u.i64(i64 %a, i1 %shiftamt)
421
422Overview:
423"""""""""
424
425The '``llvm.nvvm.flo.u``' family of intrinsics identifies the bit position of the
426leading one, returning either it's offset from the most or least significant bit.
427
428Semantics:
429""""""""""
430
431The '``llvm.nvvm.flo.u``' family of intrinsics returns the bit position of the
432most significant 1. If %shiftamt is true, The result is the shift amount needed
433to left-shift the found bit into the most-significant bit position, otherwise
434the result is the shift amount needed to right-shift the found bit into the
435least-significant bit position. 0xffffffff is returned if no 1 bit is found.
436
437'``llvm.nvvm.flo.s.*``' Intrinsic
438^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
439
440Syntax:
441"""""""
442
443.. code-block:: llvm
444
445    declare i32 @llvm.nvvm.flo.s.i32(i32 %a, i1 %shiftamt)
446    declare i32 @llvm.nvvm.flo.s.i64(i64 %a, i1 %shiftamt)
447
448Overview:
449"""""""""
450
451The '``llvm.nvvm.flo.s``' family of intrinsics identifies the bit position of the
452leading non-sign bit, returning either it's offset from the most or least
453significant bit.
454
455Semantics:
456""""""""""
457
458The '``llvm.nvvm.flo.s``' family of intrinsics returns the bit position of the
459most significant 0 for negative inputs and the most significant 1 for
460non-negative inputs. If %shiftamt is true, The result is the shift amount needed
461to left-shift the found bit into the most-significant bit position, otherwise
462the result is the shift amount needed to right-shift the found bit into the
463least-significant bit position. 0xffffffff is returned if no 1 bit is found.
464
465TMA family of Intrinsics
466------------------------
467
468'``llvm.nvvm.cp.async.bulk.global.to.shared.cluster``'
469^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
470
471Syntax:
472"""""""
473
474.. code-block:: llvm
475
476  declare void @llvm.nvvm.cp.async.bulk.global.to.shared.cluster(ptr addrspace(3) %dst, ptr addrspace(3) %mbar, ptr addrspace(1) %src, i32 %size, i16 %mc, i64 %ch, i1 %flag_mc, i1 %flag_ch)
477
478Overview:
479"""""""""
480
481The '``@llvm.nvvm.cp.async.bulk.global.to.shared.cluster``' intrinsic
482corresponds to the ``cp.async.bulk.shared::cluster.global.*`` family
483of PTX instructions. These instructions initiate an asynchronous
484copy of bulk data from global memory to shared::cluster memory.
485The 32-bit operand ``%size`` specifies the amount of memory to be
486copied and it must be a multiple of 16.
487
488* The last two arguments to these intrinsics are boolean flags
489  indicating support for cache_hint and/or multicast modifiers.
490  These flag arguments must be compile-time constants. The backend
491  looks through these flags and lowers the intrinsics appropriately.
492
493* The Nth argument (denoted by ``i1 %flag_ch``) when set, indicates
494  a valid cache_hint (``i64 %ch``) and generates the ``.L2::cache_hint``
495  variant of the PTX instruction.
496
497* The [N-1]th argument (denoted by ``i1 %flag_mc``) when set, indicates
498  the presence of a multicast mask (``i16 %mc``) and generates the PTX
499  instruction with the ``.multicast::cluster`` modifier.
500
501For more information, refer PTX ISA
502`<https://docs.nvidia.com/cuda/parallel-thread-execution/index.html#data-movement-and-conversion-instructions-cp-async-bulk>`_.
503
504'``llvm.nvvm.cp.async.bulk.shared.cta.to.global``'
505^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
506
507Syntax:
508"""""""
509
510.. code-block:: llvm
511
512  declare void @llvm.nvvm.cp.async.bulk.shared.cta.to.global(ptr addrspace(1) %dst, ptr addrspace(3) %src, i32 %size, i64 %ch, i1 %flag_ch)
513
514Overview:
515"""""""""
516
517The '``@llvm.nvvm.cp.async.bulk.shared.cta.to.global``' intrinsic
518corresponds to the ``cp.async.bulk.global.shared::cta.*`` set of PTX
519instructions. These instructions initiate an asynchronous copy from
520shared::cta to global memory. The 32-bit operand ``%size`` specifies
521the amount of memory to be copied and it must be a multiple of 16.
522
523* The last argument to these intrinsics is a boolean flag
524  indicating support for cache_hint. This flag argument must
525  be a compile-time constant. When set, it indicates a valid
526  cache_hint (``i64 %ch``) and generates the ``.L2::cache_hint``
527  variant of the PTX instruction.
528
529For more information, refer PTX ISA
530`<https://docs.nvidia.com/cuda/parallel-thread-execution/index.html#data-movement-and-conversion-instructions-cp-async-bulk>`_.
531
532'``llvm.nvvm.cp.async.bulk.shared.cta.to.cluster``'
533^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
534
535Syntax:
536"""""""
537
538.. code-block:: llvm
539
540  declare void @llvm.nvvm.cp.async.bulk.shared.cta.to.cluster(ptr addrspace(3) %dst, ptr addrspace(3) %mbar, ptr addrspace(3) %src, i32 %size)
541
542Overview:
543"""""""""
544
545The '``@llvm.nvvm.cp.async.bulk.shared.cta.to.cluster``' intrinsic
546corresponds to the ``cp.async.bulk.shared::cluster.shared::cta.*``
547PTX instruction. This instruction initiates an asynchronous copy from
548shared::cta to shared::cluster memory. The destination has to be in
549the shared memory of a different CTA within the cluster. The 32-bit
550operand ``%size`` specifies the amount of memory to be copied and
551it must be a multiple of 16.
552
553For more information, refer PTX ISA
554`<https://docs.nvidia.com/cuda/parallel-thread-execution/index.html#data-movement-and-conversion-instructions-cp-async-bulk>`_.
555
556'``llvm.nvvm.cp.async.bulk.prefetch.L2``'
557^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
558
559Syntax:
560"""""""
561
562.. code-block:: llvm
563
564  declare void @llvm.nvvm.cp.async.bulk.prefetch.L2(ptr addrspace(1) %src, i32 %size, i64 %ch, i1 %flag_ch)
565
566Overview:
567"""""""""
568
569The '``@llvm.nvvm.cp.async.bulk.prefetch.L2``' intrinsic
570corresponds to the ``cp.async.bulk.prefetch.L2.*`` family
571of PTX instructions. These instructions initiate an asynchronous
572prefetch of bulk data from global memory to the L2 cache.
573The 32-bit operand ``%size`` specifies the amount of memory to be
574prefetched in terms of bytes and it must be a multiple of 16.
575
576* The last argument to these intrinsics is boolean flag indicating
577  support for cache_hint. These flag argument must be compile-time
578  constant. When set, it indicates a valid cache_hint (``i64 %ch``)
579  and generates the ``.L2::cache_hint`` variant of the PTX instruction.
580
581For more information, refer PTX ISA
582`<https://docs.nvidia.com/cuda/parallel-thread-execution/#data-movement-and-conversion-instructions-cp-async-bulk-prefetch>`_.
583
584'``llvm.nvvm.cp.async.bulk.tensor.g2s.tile.[1-5]d``'
585^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
586
587Syntax:
588"""""""
589
590.. code-block:: llvm
591
592  declare void @llvm.nvvm.cp.async.bulk.tensor.g2s.tile.1d(ptr addrspace(3) %dst, ptr addrspace(3) %bar, ptr %tensor_map, i32 %d0, i16 %mc, i64 %ch, i1 %flag_mc, i1 %flag_ch)
593  declare void @llvm.nvvm.cp.async.bulk.tensor.g2s.tile.2d(..., i32 %d0, i32 %d1, ...)
594  declare void @llvm.nvvm.cp.async.bulk.tensor.g2s.tile.3d(..., i32 %d0, i32 %d1, i32 %d2, ...)
595  declare void @llvm.nvvm.cp.async.bulk.tensor.g2s.tile.4d(..., i32 %d0, i32 %d1, i32 %d2, i32 %d3, ...)
596  declare void @llvm.nvvm.cp.async.bulk.tensor.g2s.tile.5d(..., i32 %d0, i32 %d1, i32 %d2, i32 %d3, i32 %d4, ...)
597
598Overview:
599"""""""""
600
601The '``@llvm.nvvm.cp.async.bulk.tensor.g2s.tile.[1-5]d``' intrinsics
602correspond to the ``cp.async.bulk.tensor.[1-5]d.*`` set of PTX instructions.
603These instructions initiate an asynchronous copy of tensor data from
604global memory to shared::cluster memory (indicated by the ``g2s`` prefix)
605in ``tile`` mode. In tile mode, the multi-dimensional layout of the
606source tensor is preserved at the destination. The dimension of the
607tensor data ranges from 1d to 5d with the coordinates specified
608by the ``i32 %d0 ... i32 %d4`` arguments.
609
610* The last two arguments to these intrinsics are boolean flags
611  indicating support for cache_hint and/or multicast modifiers.
612  These flag arguments must be compile-time constants. The backend
613  looks through these flags and lowers the intrinsics appropriately.
614
615* The Nth argument (denoted by ``i1 flag_ch``) when set, indicates
616  a valid cache_hint (``i64 %ch``) and generates the ``.L2::cache_hint``
617  variant of the PTX instruction.
618
619* The [N-1]th argument (denoted by ``i1 flag_mc``) when set, indicates
620  the presence of a multicast mask (``i16 %mc``) and generates the PTX
621  instruction with the ``.multicast::cluster`` modifier.
622
623For more information, refer PTX ISA
624`<https://docs.nvidia.com/cuda/parallel-thread-execution/index.html#data-movement-and-conversion-instructions-cp-async-bulk-tensor>`_.
625
626'``llvm.nvvm.cp.async.bulk.tensor.g2s.im2col.[3-5]d``'
627^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
628
629Syntax:
630"""""""
631
632.. code-block:: llvm
633
634  declare void @llvm.nvvm.cp.async.bulk.tensor.g2s.im2col.3d(ptr addrspace(3) %dst, ptr addrspace(3) %bar, ptr %tensor_map, i32 %d0, i32 %d1, i32 %d2, i16 %im2col0, i16 %mc, i64 %ch, i1 %flag_mc, i1 %flag_ch)
635  declare void @llvm.nvvm.cp.async.bulk.tensor.g2s.im2col.4d(..., i32 %d0, i32 %d1, i32 %d2, i32 %d3, i16 %im2col0, i16 %im2col1, ...)
636  declare void @llvm.nvvm.cp.async.bulk.tensor.g2s.im2col.5d(..., i32 %d0, i32 %d1, i32 %d2, i32 %d3, i32 %d4, i16 %im2col0, i16 %im2col1, i16 %im2col2, ...)
637
638Overview:
639"""""""""
640
641The '``@llvm.nvvm.cp.async.bulk.tensor.g2s.im2col.[3-5]d``' intrinsics
642correspond to the ``cp.async.bulk.tensor.[1-5]d.*`` set of PTX instructions.
643These instructions initiate an asynchronous copy of tensor data from
644global memory to shared::cluster memory (indicated by the ``g2s`` prefix)
645in ``im2col`` mode. In im2col mode, some dimensions of the source tensor
646are unrolled into a single dimensional column at the destination. In this
647mode, the tensor has to be at least three-dimensional. Along with the tensor
648coordinates, im2col offsets are also specified (denoted by
649``i16 im2col0...i16 %im2col2``). The number of im2col offsets is two less
650than the number of dimensions of the tensor operation. The last two arguments
651to these intrinsics are boolean flags, with the same functionality as described
652in the ``tile`` mode intrinsics above.
653
654For more information, refer PTX ISA
655`<https://docs.nvidia.com/cuda/parallel-thread-execution/index.html#data-movement-and-conversion-instructions-cp-async-bulk-tensor>`_.
656
657'``llvm.nvvm.cp.async.bulk.tensor.s2g.tile.[1-5]d``'
658^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
659
660Syntax:
661"""""""
662
663.. code-block:: llvm
664
665  declare void @llvm.nvvm.cp.async.bulk.tensor.s2g.tile.1d(ptr addrspace(3) %src, ptr %tensor_map, i32 %d0, i64 %ch, i1 %flag_ch)
666  declare void @llvm.nvvm.cp.async.bulk.tensor.s2g.tile.2d(..., i32 %d0, i32 %d1, ...)
667  declare void @llvm.nvvm.cp.async.bulk.tensor.s2g.tile.3d(..., i32 %d0, i32 %d1, i32 %d2, ...)
668  declare void @llvm.nvvm.cp.async.bulk.tensor.s2g.tile.4d(..., i32 %d0, i32 %d1, i32 %d2, i32 %d3, ...)
669  declare void @llvm.nvvm.cp.async.bulk.tensor.s2g.tile.5d(..., i32 %d0, i32 %d1, i32 %d2, i32 %d3, i32 %d4, ...)
670
671Overview:
672"""""""""
673
674The '``@llvm.nvvm.cp.async.bulk.tensor.s2g.tile.[1-5]d``' intrinsics
675correspond to the ``cp.async.bulk.tensor.[1-5]d.*`` set of PTX instructions.
676These instructions initiate an asynchronous copy of tensor data from
677shared::cta to global memory (indicated by the ``s2g`` prefix)
678in ``tile`` mode. The dimension of the tensor data ranges from 1d to 5d
679with the coordinates specified by the ``i32 %d0 ... i32 %d4`` arguments.
680
681* The last argument to these intrinsics is a boolean flag
682  indicating support for cache_hint. This flag argument must
683  be a compile-time constant. When set, it indicates a valid
684  cache_hint (``i64 %ch``) and generates the ``.L2::cache_hint``
685  variant of the PTX instruction.
686
687For more information, refer PTX ISA
688`<https://docs.nvidia.com/cuda/parallel-thread-execution/index.html#data-movement-and-conversion-instructions-cp-async-bulk-tensor>`_.
689
690'``llvm.nvvm.cp.async.bulk.tensor.s2g.im2col.[3-5]d``'
691^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
692
693Syntax:
694"""""""
695
696.. code-block:: llvm
697
698  declare void @llvm.nvvm.cp.async.bulk.tensor.s2g.im2col.3d(ptr addrspace(3) %src, ptr %tensor_map, i32 %d0, i32 %d1, i32 %d2, i64 %ch, i1 %flag_ch)
699  declare void @llvm.nvvm.cp.async.bulk.tensor.s2g.im2col.4d(..., i32 %d0, i32 %d1, i32 %d2, i32 %d3, ...)
700  declare void @llvm.nvvm.cp.async.bulk.tensor.s2g.im2col.5d(..., i32 %d0, i32 %d1, i32 %d2, i32 %d3, i32 %d4, ...)
701
702Overview:
703"""""""""
704
705The '``@llvm.nvvm.cp.async.bulk.tensor.s2g.im2col.[1-5]d``' intrinsics
706correspond to the ``cp.async.bulk.tensor.[1-5]d.*`` set of PTX instructions.
707These instructions initiate an asynchronous copy of tensor data from
708shared::cta to global memory (indicated by the ``s2g`` prefix)
709in ``im2col`` mode. In this mode, the tensor has to be at least
710three-dimensional. Unlike the ``g2s`` variants, there are no
711im2col_offsets for these intrinsics. The last argument to these
712intrinsics is a boolean flag, with the same functionality as
713described in the ``s2g.tile`` mode intrinsics above.
714
715For more information, refer PTX ISA
716`<https://docs.nvidia.com/cuda/parallel-thread-execution/index.html#data-movement-and-conversion-instructions-cp-async-bulk-tensor>`_.
717
718'``llvm.nvvm.cp.async.bulk.tensor.prefetch.tile.[1-5]d``'
719^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
720
721Syntax:
722"""""""
723
724.. code-block:: llvm
725
726  declare void @llvm.nvvm.cp.async.bulk.tensor.prefetch.tile.1d(ptr %tensor_map, i32 %d0, i64 %ch, i1 %flag_ch)
727  declare void @llvm.nvvm.cp.async.bulk.tensor.prefetch.tile.2d(..., i32 %d0, i32 %d1, ...)
728  declare void @llvm.nvvm.cp.async.bulk.tensor.prefetch.tile.3d(..., i32 %d0, i32 %d1, i32 %d2, ...)
729  declare void @llvm.nvvm.cp.async.bulk.tensor.prefetch.tile.4d(..., i32 %d0, i32 %d1, i32 %d2, i32 %d3, ...)
730  declare void @llvm.nvvm.cp.async.bulk.tensor.prefetch.tile.5d(..., i32 %d0, i32 %d1, i32 %d2, i32 %d3, i32 %d4, ...)
731
732Overview:
733"""""""""
734
735The '``@llvm.nvvm.cp.async.bulk.tensor.prefetch.tile.[1-5]d``' intrinsics
736correspond to the ``cp.async.bulk.prefetch.tensor.[1-5]d.L2.global*`` set
737of PTX instructions. These instructions initiate an asynchronous prefetch
738of tensor data from global memory to the L2 cache. In tile mode, the
739multi-dimensional layout of the source tensor is preserved at the destination.
740The dimension of the tensor data ranges from 1d to 5d with the coordinates
741specified by the ``i32 %d0 ... i32 %d4`` arguments.
742
743* The last argument to these intrinsics is a boolean flag
744  indicating support for cache_hint. This flag argument must
745  be a compile-time constant. When set, it indicates a valid
746  cache_hint (``i64 %ch``) and generates the ``.L2::cache_hint``
747  variant of the PTX instruction.
748
749For more information, refer PTX ISA
750`<https://docs.nvidia.com/cuda/parallel-thread-execution/#data-movement-and-conversion-instructions-cp-async-bulk-prefetch-tensor>`_.
751
752'``llvm.nvvm.cp.async.bulk.tensor.prefetch.im2col.[3-5]d``'
753^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
754
755Syntax:
756"""""""
757
758.. code-block:: llvm
759
760  declare void @llvm.nvvm.cp.async.bulk.tensor.prefetch.im2col.3d(ptr %tensor_map, i32 %d0, i32 %d1, i32 %d2, i16 %im2col0, i64 %ch, i1 %flag_ch)
761  declare void @llvm.nvvm.cp.async.bulk.tensor.prefetch.im2col.4d(..., i32 %d0, i32 %d1, i32 %d2, i32 %d3, i16 %im2col0, i16 %im2col1, ...)
762  declare void @llvm.nvvm.cp.async.bulk.tensor.prefetch.im2col.5d(..., i32 %d0, i32 %d1, i32 %d2, i32 %d3, i32 %d4, i16 %im2col0, i16 %im2col1, i16 %im2col2, ...)
763
764Overview:
765"""""""""
766
767The '``@llvm.nvvm.cp.async.bulk.tensor.prefetch.im2col.[3-5]d``' intrinsics
768correspond to the ``cp.async.bulk.prefetch.tensor.[1-5]d.L2.global*`` set
769of PTX instructions. These instructions initiate an asynchronous prefetch
770of tensor data from global memory to the L2 cache. In im2col mode, some
771dimensions of the source tensor are unrolled into a single dimensional
772column at the destination. In this mode, the tensor has to be at least
773three-dimensional. Along with the tensor coordinates, im2col offsets are
774also specified (denoted by ``i16 im2col0...i16 %im2col2``). The number
775of im2col offsets is two less than the number of dimensions of the tensor
776operation. The last argument to these intrinsics is a boolean flag, with
777the same functionality as described in the ``tile`` mode intrinsics above.
778
779For more information, refer PTX ISA
780`<https://docs.nvidia.com/cuda/parallel-thread-execution/#data-movement-and-conversion-instructions-cp-async-bulk-prefetch-tensor>`_.
781
782'``llvm.nvvm.cp.async.bulk.tensor.reduce.[red_op].tile.[1-5]d``'
783^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
784
785Syntax:
786"""""""
787
788.. code-block:: llvm
789
790  declare void @llvm.nvvm.cp.async.bulk.tensor.reduce.add.tile.1d(ptr addrspace(3) %src, ptr %tensor_map, i32 %d0, i64 %ch, i1 %flag_ch)
791  declare void @llvm.nvvm.cp.async.bulk.tensor.reduce.min.tile.1d(ptr addrspace(3) %src, ptr %tensor_map, i32 %d0, i64 %ch, i1 %flag_ch)
792  declare void @llvm.nvvm.cp.async.bulk.tensor.reduce.max.tile.1d(ptr addrspace(3) %src, ptr %tensor_map, i32 %d0, i64 %ch, i1 %flag_ch)
793  declare void @llvm.nvvm.cp.async.bulk.tensor.reduce.inc.tile.1d(ptr addrspace(3) %src, ptr %tensor_map, i32 %d0, i64 %ch, i1 %flag_ch)
794  declare void @llvm.nvvm.cp.async.bulk.tensor.reduce.dec.tile.1d(ptr addrspace(3) %src, ptr %tensor_map, i32 %d0, i64 %ch, i1 %flag_ch)
795  declare void @llvm.nvvm.cp.async.bulk.tensor.reduce.and.tile.1d(ptr addrspace(3) %src, ptr %tensor_map, i32 %d0, i64 %ch, i1 %flag_ch)
796  declare void @llvm.nvvm.cp.async.bulk.tensor.reduce.or.tile.1d(ptr addrspace(3) %src, ptr %tensor_map, i32 %d0, i64 %ch, i1 %flag_ch)
797  declare void @llvm.nvvm.cp.async.bulk.tensor.reduce.xor.tile.1d(ptr addrspace(3) %src, ptr %tensor_map, i32 %d0, i64 %ch, i1 %flag_ch)
798
799  declare void @llvm.nvvm.cp.async.bulk.tensor.reduce.<red_op>.tile.2d(..., i32 %d0, i32 %d1, ...)
800  declare void @llvm.nvvm.cp.async.bulk.tensor.reduce.<red_op>.tile.3d(..., i32 %d0, i32 %d1, i32 %d2, ...)
801  declare void @llvm.nvvm.cp.async.bulk.tensor.reduce.<red_op>.tile.4d(..., i32 %d0, i32 %d1, i32 %d2, i32 %d3, ...)
802  declare void @llvm.nvvm.cp.async.bulk.tensor.reduce.<red_op>.tile.5d(..., i32 %d0, i32 %d1, i32 %d2, i32 %d3, i32 %d4, ...)
803
804Overview:
805"""""""""
806
807The '``@llvm.nvvm.cp.async.bulk.tensor.reduce.<red_op>.tile.[1-5]d``' intrinsics
808correspond to the ``cp.reduce.async.bulk.tensor.[1-5]d.*`` set of PTX instructions.
809These instructions initiate an asynchronous reduction operation of tensor data
810in global memory with the tensor data in shared{::cta} memory, using ``tile`` mode.
811The dimension of the tensor data ranges from 1d to 5d with the coordinates
812specified by the ``i32 %d0 ... i32 %d4`` arguments. The supported reduction
813operations are {add, min, max, inc, dec, and, or, xor} as described in the
814``tile.1d`` intrinsics.
815
816* The last argument to these intrinsics is a boolean flag
817  indicating support for cache_hint. This flag argument must
818  be a compile-time constant. When set, it indicates a valid
819  cache_hint (``i64 %ch``) and generates the ``.L2::cache_hint``
820  variant of the PTX instruction.
821
822For more information, refer PTX ISA
823`<https://docs.nvidia.com/cuda/parallel-thread-execution/index.html#data-movement-and-conversion-instructions-cp-reduce-async-bulk-tensor>`_.
824
825'``llvm.nvvm.cp.async.bulk.tensor.reduce.[red_op].im2col.[3-5]d``'
826^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
827
828Syntax:
829"""""""
830
831.. code-block:: llvm
832
833  declare void @llvm.nvvm.cp.async.bulk.tensor.reduce.<red_op>.im2col.3d(ptr addrspace(3) %src, ptr %tensor_map, i32 %d0, i32 %d1, i32 %d2, i64 %ch, i1 %flag_ch)
834  declare void @llvm.nvvm.cp.async.bulk.tensor.reduce.<red_op>.im2col.4d(..., i32 %d0, i32 %d1, i32 %d2, i32 %d3, ...)
835  declare void @llvm.nvvm.cp.async.bulk.tensor.reduce.<red_op>.im2col.5d(..., i32 %d0, i32 %d1, i32 %d2, i32 %d3, i32 %d4, ...)
836
837Overview:
838"""""""""
839
840The '``@llvm.nvvm.cp.async.bulk.tensor.reduce.<red_op>.im2col.[3-5]d``' intrinsics
841correspond to the ``cp.reduce.async.bulk.tensor.[3-5]d.*`` set of PTX instructions.
842These instructions initiate an asynchronous reduction operation of tensor data
843in global memory with the tensor data in shared{::cta} memory, using ``im2col`` mode.
844In this mode, the tensor has to be at least three-dimensional. The supported reduction
845operations supported are the same as the ones in the tile mode. The last argument to
846these intrinsics is a boolean flag, with the same functionality as described in the
847``tile`` mode intrinsics above.
848
849For more information, refer PTX ISA
850`<https://docs.nvidia.com/cuda/parallel-thread-execution/index.html#data-movement-and-conversion-instructions-cp-reduce-async-bulk-tensor>`_.
851
852Warp Group Intrinsics
853---------------------
854
855'``llvm.nvvm.wgmma.fence.sync.aligned``'
856^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
857
858Syntax:
859"""""""
860
861.. code-block:: llvm
862
863  declare void @llvm.nvvm.wgmma.fence.sync.aligned()
864
865Overview:
866"""""""""
867
868The '``@llvm.nvvm.wgmma.fence.sync.aligned``' intrinsic generates the
869``wgmma.fence.sync.aligned`` PTX instruction, which establishes an ordering
870between prior accesses to any warpgroup registers and subsequent accesses to
871the same registers by a ``wgmma.mma_async`` instruction.
872
873The ``wgmma.fence`` instruction must be issued by all warps of the warpgroup in
874the following locations:
875
876* Before the first ``wgmma.mma_async`` operation in a warpgroup.
877* Between a register access by a thread in the warpgroup and any
878  ``wgmma.mma_async`` instruction that accesses the same registers, except when
879  these are accumulator register accesses across multiple ``wgmma.mma_async``
880  instructions of the same shape in which case an ordering guarantee is
881  provided by default.
882
883For more information, refer PTX ISA
884`<https://docs.nvidia.com/cuda/parallel-thread-execution/#asynchronous-warpgroup-level-matrix-instructions-wgmma-fence>`_.
885
886'``llvm.nvvm.wgmma.commit_group.sync.aligned``'
887^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
888
889Syntax:
890"""""""
891
892.. code-block:: llvm
893
894  declare void @llvm.nvvm.wgmma.commit_group.sync.aligned()
895
896Overview:
897"""""""""
898
899The '``@llvm.nvvm.wgmma.commit_group.sync.aligned``' intrinsic generates the
900``wgmma.commit_group.sync.aligned`` PTX instruction, which creates a new
901wgmma-group per warpgroup and batches all prior ``wgmma.mma_async``
902instructions initiated by the executing warp but not committed to any
903wgmma-group into the new wgmma-group. If there are no uncommitted ``wgmma
904mma_async`` instructions then, ``wgmma.commit_group`` results in an empty
905wgmma-group.
906
907An executing thread can wait for the completion of all ``wgmma.mma_async``
908operations in a wgmma-group by using ``wgmma.wait_group``.
909
910For more information, refer PTX ISA
911`<https://docs.nvidia.com/cuda/parallel-thread-execution/#asynchronous-warpgroup-level-matrix-instructions-wgmma-commit-group>`_.
912
913'``llvm.nvvm.wgmma.wait_group.sync.aligned``'
914^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
915
916Syntax:
917"""""""
918
919.. code-block:: llvm
920
921  declare void @llvm.nvvm.wgmma.wait_group.sync.aligned(i64 immarg N)
922
923Overview:
924"""""""""
925
926The '``@llvm.nvvm.wgmma.wait_group.sync.aligned``' intrinsic generates the
927``wgmma.commit_group.sync.aligned N`` PTX instruction, which will cause the
928executing thread to wait until only ``N`` or fewer of the most recent
929wgmma-groups are pending and all the prior wgmma-groups committed by the
930executing threads are complete. For example, when ``N`` is 0, the executing
931thread waits on all the prior wgmma-groups to complete. Operand ``N`` is an
932integer constant.
933
934Accessing the accumulator register or the input register containing the
935fragments of matrix A of a ``wgmma.mma_async`` instruction without first
936performing a ``wgmma.wait_group`` instruction that waits on a wgmma-group
937including that ``wgmma.mma_async`` instruction is undefined behavior.
938
939For more information, refer PTX ISA
940`<https://docs.nvidia.com/cuda/parallel-thread-execution/#asynchronous-warpgroup-level-matrix-instructions-wgmma-wait-group>`_.
941
942'``llvm.nvvm.griddepcontrol.*``'
943^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
944
945Syntax:
946"""""""
947
948.. code-block:: llvm
949
950  declare void @llvm.nvvm.griddepcontrol.launch_dependents()
951  declare void @llvm.nvvm.griddepcontrol.wait()
952
953Overview:
954"""""""""
955
956The ``griddepcontrol`` intrinsics allows the dependent grids and prerequisite grids as defined by the runtime, to control execution in the following way:
957
958``griddepcontrol.launch_dependents`` intrinsic signals that the dependents can be scheduled, before the current grid completes. The intrinsic can be invoked by multiple threads in the current CTA and repeated invocations of the intrinsic will have no additional side effects past that of the first invocation.
959
960``griddepcontrol.wait`` intrinsic causes the executing thread to wait until all prerequisite grids in flight have completed and all the memory operations from the prerequisite grids are performed and made visible to the current grid.
961
962For more information, refer
963`PTX ISA <https://docs.nvidia.com/cuda/parallel-thread-execution/#parallel-synchronization-and-communication-instructions-griddepcontrol>`__.
964
965Other Intrinsics
966----------------
967
968For the full set of NVPTX intrinsics, please see the
969``include/llvm/IR/IntrinsicsNVVM.td`` file in the LLVM source tree.
970
971
972.. _libdevice:
973
974Linking with Libdevice
975======================
976
977The CUDA Toolkit comes with an LLVM bitcode library called ``libdevice`` that
978implements many common mathematical functions. This library can be used as a
979high-performance math library for any compilers using the LLVM NVPTX target.
980The library can be found under ``nvvm/libdevice/`` in the CUDA Toolkit and
981there is a separate version for each compute architecture.
982
983For a list of all math functions implemented in libdevice, see
984`libdevice Users Guide <http://docs.nvidia.com/cuda/libdevice-users-guide/index.html>`_.
985
986To accommodate various math-related compiler flags that can affect code
987generation of libdevice code, the library code depends on a special LLVM IR
988pass (``NVVMReflect``) to handle conditional compilation within LLVM IR. This
989pass looks for calls to the ``@__nvvm_reflect`` function and replaces them
990with constants based on the defined reflection parameters. Such conditional
991code often follows a pattern:
992
993.. code-block:: c++
994
995  float my_function(float a) {
996    if (__nvvm_reflect("FASTMATH"))
997      return my_function_fast(a);
998    else
999      return my_function_precise(a);
1000  }
1001
1002The default value for all unspecified reflection parameters is zero.
1003
1004The ``NVVMReflect`` pass should be executed early in the optimization
1005pipeline, immediately after the link stage. The ``internalize`` pass is also
1006recommended to remove unused math functions from the resulting PTX. For an
1007input IR module ``module.bc``, the following compilation flow is recommended:
1008
1009The ``NVVMReflect`` pass will attempt to remove dead code even without
1010optimizations. This allows potentially incompatible instructions to be avoided
1011at all optimizations levels by using the ``__CUDA_ARCH`` argument.
1012
10131. Save list of external functions in ``module.bc``
10142. Link ``module.bc`` with ``libdevice.compute_XX.YY.bc``
10153. Internalize all functions not in list from (1)
10164. Eliminate all unused internal functions
10175. Run ``NVVMReflect`` pass
10186. Run standard optimization pipeline
1019
1020.. note::
1021
1022  ``linkonce`` and ``linkonce_odr`` linkage types are not suitable for the
1023  libdevice functions. It is possible to link two IR modules that have been
1024  linked against libdevice using different reflection variables.
1025
1026Since the ``NVVMReflect`` pass replaces conditionals with constants, it will
1027often leave behind dead code of the form:
1028
1029.. code-block:: llvm
1030
1031  entry:
1032    ..
1033    br i1 true, label %foo, label %bar
1034  foo:
1035    ..
1036  bar:
1037    ; Dead code
1038    ..
1039
1040Therefore, it is recommended that ``NVVMReflect`` is executed early in the
1041optimization pipeline before dead-code elimination.
1042
1043The NVPTX TargetMachine knows how to schedule ``NVVMReflect`` at the beginning
1044of your pass manager; just use the following code when setting up your pass
1045manager and the PassBuilder will use ``registerPassBuilderCallbacks`` to let
1046NVPTXTargetMachine::registerPassBuilderCallbacks add the pass to the
1047pass manager:
1048
1049.. code-block:: c++
1050
1051    std::unique_ptr<TargetMachine> TM = ...;
1052    PassBuilder PB(TM);
1053    ModulePassManager MPM;
1054    PB.parsePassPipeline(MPM, ...);
1055
1056Reflection Parameters
1057---------------------
1058
1059The libdevice library currently uses the following reflection parameters to
1060control code generation:
1061
1062==================== ======================================================
1063Flag                 Description
1064==================== ======================================================
1065``__CUDA_FTZ=[0,1]`` Use optimized code paths that flush subnormals to zero
1066==================== ======================================================
1067
1068The value of this flag is determined by the "nvvm-reflect-ftz" module flag.
1069The following sets the ftz flag to 1.
1070
1071.. code-block:: llvm
1072
1073    !llvm.module.flags = !{!0}
1074    !0 = !{i32 4, !"nvvm-reflect-ftz", i32 1}
1075
1076(``i32 4`` indicates that the value set here overrides the value in another
1077module we link with.  See the `LangRef <LangRef.html#module-flags-metadata>`
1078for details.)
1079
1080Executing PTX
1081=============
1082
1083The most common way to execute PTX assembly on a GPU device is to use the CUDA
1084Driver API. This API is a low-level interface to the GPU driver and allows for
1085JIT compilation of PTX code to native GPU machine code.
1086
1087Initializing the Driver API:
1088
1089.. code-block:: c++
1090
1091    CUdevice device;
1092    CUcontext context;
1093
1094    // Initialize the driver API
1095    cuInit(0);
1096    // Get a handle to the first compute device
1097    cuDeviceGet(&device, 0);
1098    // Create a compute device context
1099    cuCtxCreate(&context, 0, device);
1100
1101JIT compiling a PTX string to a device binary:
1102
1103.. code-block:: c++
1104
1105    CUmodule module;
1106    CUfunction function;
1107
1108    // JIT compile a null-terminated PTX string
1109    cuModuleLoadData(&module, (void*)PTXString);
1110
1111    // Get a handle to the "myfunction" kernel function
1112    cuModuleGetFunction(&function, module, "myfunction");
1113
1114For full examples of executing PTX assembly, please see the `CUDA Samples
1115<https://developer.nvidia.com/cuda-downloads>`_ distribution.
1116
1117
1118Common Issues
1119=============
1120
1121ptxas complains of undefined function: __nvvm_reflect
1122-----------------------------------------------------
1123
1124When linking with libdevice, the ``NVVMReflect`` pass must be used. See
1125:ref:`libdevice` for more information.
1126
1127
1128Tutorial: A Simple Compute Kernel
1129=================================
1130
1131To start, let us take a look at a simple compute kernel written directly in
1132LLVM IR. The kernel implements vector addition, where each thread computes one
1133element of the output vector C from the input vectors A and B.  To make this
1134easier, we also assume that only a single CTA (thread block) will be launched,
1135and that it will be one dimensional.
1136
1137
1138The Kernel
1139----------
1140
1141.. code-block:: llvm
1142
1143  target datalayout = "e-p:64:64:64-i1:8:8-i8:8:8-i16:16:16-i32:32:32-i64:64:64-f32:32:32-f64:64:64-v16:16:16-v32:32:32-v64:64:64-v128:128:128-n16:32:64"
1144  target triple = "nvptx64-nvidia-cuda"
1145
1146  ; Intrinsic to read X component of thread ID
1147  declare i32 @llvm.nvvm.read.ptx.sreg.tid.x() readnone nounwind
1148
1149  define void @kernel(ptr addrspace(1) %A,
1150                      ptr addrspace(1) %B,
1151                      ptr addrspace(1) %C) {
1152  entry:
1153    ; What is my ID?
1154    %id = tail call i32 @llvm.nvvm.read.ptx.sreg.tid.x() readnone nounwind
1155
1156    ; Compute pointers into A, B, and C
1157    %ptrA = getelementptr float, ptr addrspace(1) %A, i32 %id
1158    %ptrB = getelementptr float, ptr addrspace(1) %B, i32 %id
1159    %ptrC = getelementptr float, ptr addrspace(1) %C, i32 %id
1160
1161    ; Read A, B
1162    %valA = load float, ptr addrspace(1) %ptrA, align 4
1163    %valB = load float, ptr addrspace(1) %ptrB, align 4
1164
1165    ; Compute C = A + B
1166    %valC = fadd float %valA, %valB
1167
1168    ; Store back to C
1169    store float %valC, ptr addrspace(1) %ptrC, align 4
1170
1171    ret void
1172  }
1173
1174  !nvvm.annotations = !{!0}
1175  !0 = !{ptr @kernel, !"kernel", i32 1}
1176
1177
1178We can use the LLVM ``llc`` tool to directly run the NVPTX code generator:
1179
1180.. code-block:: text
1181
1182  # llc -mcpu=sm_20 kernel.ll -o kernel.ptx
1183
1184
1185.. note::
1186
1187  If you want to generate 32-bit code, change ``p:64:64:64`` to ``p:32:32:32``
1188  in the module data layout string and use ``nvptx-nvidia-cuda`` as the
1189  target triple.
1190
1191
1192The output we get from ``llc`` (as of LLVM 3.4):
1193
1194.. code-block:: text
1195
1196  //
1197  // Generated by LLVM NVPTX Back-End
1198  //
1199
1200  .version 3.1
1201  .target sm_20
1202  .address_size 64
1203
1204    // .globl kernel
1205                                          // @kernel
1206  .visible .entry kernel(
1207    .param .u64 kernel_param_0,
1208    .param .u64 kernel_param_1,
1209    .param .u64 kernel_param_2
1210  )
1211  {
1212    .reg .f32   %f<4>;
1213    .reg .s32   %r<2>;
1214    .reg .s64   %rl<8>;
1215
1216  // %bb.0:                                // %entry
1217    ld.param.u64    %rl1, [kernel_param_0];
1218    mov.u32         %r1, %tid.x;
1219    mul.wide.s32    %rl2, %r1, 4;
1220    add.s64         %rl3, %rl1, %rl2;
1221    ld.param.u64    %rl4, [kernel_param_1];
1222    add.s64         %rl5, %rl4, %rl2;
1223    ld.param.u64    %rl6, [kernel_param_2];
1224    add.s64         %rl7, %rl6, %rl2;
1225    ld.global.f32   %f1, [%rl3];
1226    ld.global.f32   %f2, [%rl5];
1227    add.f32         %f3, %f1, %f2;
1228    st.global.f32   [%rl7], %f3;
1229    ret;
1230  }
1231
1232
1233Dissecting the Kernel
1234---------------------
1235
1236Now let us dissect the LLVM IR that makes up this kernel.
1237
1238Data Layout
1239^^^^^^^^^^^
1240
1241The data layout string determines the size in bits of common data types, their
1242ABI alignment, and their storage size.  For NVPTX, you should use one of the
1243following:
1244
124532-bit PTX:
1246
1247.. code-block:: llvm
1248
1249  target datalayout = "e-p:32:32:32-i1:8:8-i8:8:8-i16:16:16-i32:32:32-i64:64:64-f32:32:32-f64:64:64-v16:16:16-v32:32:32-v64:64:64-v128:128:128-n16:32:64"
1250
125164-bit PTX:
1252
1253.. code-block:: llvm
1254
1255  target datalayout = "e-p:64:64:64-i1:8:8-i8:8:8-i16:16:16-i32:32:32-i64:64:64-f32:32:32-f64:64:64-v16:16:16-v32:32:32-v64:64:64-v128:128:128-n16:32:64"
1256
1257
1258Target Intrinsics
1259^^^^^^^^^^^^^^^^^
1260
1261In this example, we use the ``@llvm.nvvm.read.ptx.sreg.tid.x`` intrinsic to
1262read the X component of the current thread's ID, which corresponds to a read
1263of register ``%tid.x`` in PTX. The NVPTX back-end supports a large set of
1264intrinsics.  A short list is shown below; please see
1265``include/llvm/IR/IntrinsicsNVVM.td`` for the full list.
1266
1267
1268================================================ ====================
1269Intrinsic                                        CUDA Equivalent
1270================================================ ====================
1271``i32 @llvm.nvvm.read.ptx.sreg.tid.{x,y,z}``     threadIdx.{x,y,z}
1272``i32 @llvm.nvvm.read.ptx.sreg.ctaid.{x,y,z}``   blockIdx.{x,y,z}
1273``i32 @llvm.nvvm.read.ptx.sreg.ntid.{x,y,z}``    blockDim.{x,y,z}
1274``i32 @llvm.nvvm.read.ptx.sreg.nctaid.{x,y,z}``  gridDim.{x,y,z}
1275``void @llvm.nvvm.barrier0()``                   __syncthreads()
1276================================================ ====================
1277
1278
1279Address Spaces
1280^^^^^^^^^^^^^^
1281
1282You may have noticed that all of the pointer types in the LLVM IR example had
1283an explicit address space specifier. What is address space 1? NVIDIA GPU
1284devices (generally) have four types of memory:
1285
1286- Global: Large, off-chip memory
1287- Shared: Small, on-chip memory shared among all threads in a CTA
1288- Local: Per-thread, private memory
1289- Constant: Read-only memory shared across all threads
1290
1291These different types of memory are represented in LLVM IR as address spaces.
1292There is also a fifth address space used by the NVPTX code generator that
1293corresponds to the "generic" address space.  This address space can represent
1294addresses in any other address space (with a few exceptions).  This allows
1295users to write IR functions that can load/store memory using the same
1296instructions. Intrinsics are provided to convert pointers between the generic
1297and non-generic address spaces.
1298
1299See :ref:`address_spaces` and :ref:`nvptx_intrinsics` for more information.
1300
1301
1302Kernel Metadata
1303^^^^^^^^^^^^^^^
1304
1305In PTX, a function can be either a `kernel` function (callable from the host
1306program), or a `device` function (callable only from GPU code). You can think
1307of `kernel` functions as entry-points in the GPU program. To mark an LLVM IR
1308function as a `kernel` function, we make use of special LLVM metadata. The
1309NVPTX back-end will look for a named metadata node called
1310``nvvm.annotations``. This named metadata must contain a list of metadata that
1311describe the IR. For our purposes, we need to declare a metadata node that
1312assigns the "kernel" attribute to the LLVM IR function that should be emitted
1313as a PTX `kernel` function. These metadata nodes take the form:
1314
1315.. code-block:: text
1316
1317  !{<function ref>, metadata !"kernel", i32 1}
1318
1319For the previous example, we have:
1320
1321.. code-block:: llvm
1322
1323  !nvvm.annotations = !{!0}
1324  !0 = !{ptr @kernel, !"kernel", i32 1}
1325
1326Here, we have a single metadata declaration in ``nvvm.annotations``. This
1327metadata annotates our ``@kernel`` function with the ``kernel`` attribute.
1328
1329
1330Running the Kernel
1331------------------
1332
1333Generating PTX from LLVM IR is all well and good, but how do we execute it on
1334a real GPU device? The CUDA Driver API provides a convenient mechanism for
1335loading and JIT compiling PTX to a native GPU device, and launching a kernel.
1336The API is similar to OpenCL.  A simple example showing how to load and
1337execute our vector addition code is shown below. Note that for brevity this
1338code does not perform much error checking!
1339
1340.. note::
1341
1342  You can also use the ``ptxas`` tool provided by the CUDA Toolkit to offline
1343  compile PTX to machine code (SASS) for a specific GPU architecture. Such
1344  binaries can be loaded by the CUDA Driver API in the same way as PTX. This
1345  can be useful for reducing startup time by precompiling the PTX kernels.
1346
1347
1348.. code-block:: c++
1349
1350  #include <iostream>
1351  #include <fstream>
1352  #include <cassert>
1353  #include "cuda.h"
1354
1355
1356  void checkCudaErrors(CUresult err) {
1357    assert(err == CUDA_SUCCESS);
1358  }
1359
1360  /// main - Program entry point
1361  int main(int argc, char **argv) {
1362    CUdevice    device;
1363    CUmodule    cudaModule;
1364    CUcontext   context;
1365    CUfunction  function;
1366    CUlinkState linker;
1367    int         devCount;
1368
1369    // CUDA initialization
1370    checkCudaErrors(cuInit(0));
1371    checkCudaErrors(cuDeviceGetCount(&devCount));
1372    checkCudaErrors(cuDeviceGet(&device, 0));
1373
1374    char name[128];
1375    checkCudaErrors(cuDeviceGetName(name, 128, device));
1376    std::cout << "Using CUDA Device [0]: " << name << "\n";
1377
1378    int devMajor, devMinor;
1379    checkCudaErrors(cuDeviceComputeCapability(&devMajor, &devMinor, device));
1380    std::cout << "Device Compute Capability: "
1381              << devMajor << "." << devMinor << "\n";
1382    if (devMajor < 2) {
1383      std::cerr << "ERROR: Device 0 is not SM 2.0 or greater\n";
1384      return 1;
1385    }
1386
1387    std::ifstream t("kernel.ptx");
1388    if (!t.is_open()) {
1389      std::cerr << "kernel.ptx not found\n";
1390      return 1;
1391    }
1392    std::string str((std::istreambuf_iterator<char>(t)),
1393                      std::istreambuf_iterator<char>());
1394
1395    // Create driver context
1396    checkCudaErrors(cuCtxCreate(&context, 0, device));
1397
1398    // Create module for object
1399    checkCudaErrors(cuModuleLoadDataEx(&cudaModule, str.c_str(), 0, 0, 0));
1400
1401    // Get kernel function
1402    checkCudaErrors(cuModuleGetFunction(&function, cudaModule, "kernel"));
1403
1404    // Device data
1405    CUdeviceptr devBufferA;
1406    CUdeviceptr devBufferB;
1407    CUdeviceptr devBufferC;
1408
1409    checkCudaErrors(cuMemAlloc(&devBufferA, sizeof(float)*16));
1410    checkCudaErrors(cuMemAlloc(&devBufferB, sizeof(float)*16));
1411    checkCudaErrors(cuMemAlloc(&devBufferC, sizeof(float)*16));
1412
1413    float* hostA = new float[16];
1414    float* hostB = new float[16];
1415    float* hostC = new float[16];
1416
1417    // Populate input
1418    for (unsigned i = 0; i != 16; ++i) {
1419      hostA[i] = (float)i;
1420      hostB[i] = (float)(2*i);
1421      hostC[i] = 0.0f;
1422    }
1423
1424    checkCudaErrors(cuMemcpyHtoD(devBufferA, &hostA[0], sizeof(float)*16));
1425    checkCudaErrors(cuMemcpyHtoD(devBufferB, &hostB[0], sizeof(float)*16));
1426
1427
1428    unsigned blockSizeX = 16;
1429    unsigned blockSizeY = 1;
1430    unsigned blockSizeZ = 1;
1431    unsigned gridSizeX  = 1;
1432    unsigned gridSizeY  = 1;
1433    unsigned gridSizeZ  = 1;
1434
1435    // Kernel parameters
1436    void *KernelParams[] = { &devBufferA, &devBufferB, &devBufferC };
1437
1438    std::cout << "Launching kernel\n";
1439
1440    // Kernel launch
1441    checkCudaErrors(cuLaunchKernel(function, gridSizeX, gridSizeY, gridSizeZ,
1442                                   blockSizeX, blockSizeY, blockSizeZ,
1443                                   0, NULL, KernelParams, NULL));
1444
1445    // Retrieve device data
1446    checkCudaErrors(cuMemcpyDtoH(&hostC[0], devBufferC, sizeof(float)*16));
1447
1448
1449    std::cout << "Results:\n";
1450    for (unsigned i = 0; i != 16; ++i) {
1451      std::cout << hostA[i] << " + " << hostB[i] << " = " << hostC[i] << "\n";
1452    }
1453
1454
1455    // Clean up after ourselves
1456    delete [] hostA;
1457    delete [] hostB;
1458    delete [] hostC;
1459
1460    // Clean-up
1461    checkCudaErrors(cuMemFree(devBufferA));
1462    checkCudaErrors(cuMemFree(devBufferB));
1463    checkCudaErrors(cuMemFree(devBufferC));
1464    checkCudaErrors(cuModuleUnload(cudaModule));
1465    checkCudaErrors(cuCtxDestroy(context));
1466
1467    return 0;
1468  }
1469
1470
1471You will need to link with the CUDA driver and specify the path to cuda.h.
1472
1473.. code-block:: text
1474
1475  # clang++ sample.cpp -o sample -O2 -g -I/usr/local/cuda-5.5/include -lcuda
1476
1477We don't need to specify a path to ``libcuda.so`` since this is installed in a
1478system location by the driver, not the CUDA toolkit.
1479
1480If everything goes as planned, you should see the following output when
1481running the compiled program:
1482
1483.. code-block:: text
1484
1485  Using CUDA Device [0]: GeForce GTX 680
1486  Device Compute Capability: 3.0
1487  Launching kernel
1488  Results:
1489  0 + 0 = 0
1490  1 + 2 = 3
1491  2 + 4 = 6
1492  3 + 6 = 9
1493  4 + 8 = 12
1494  5 + 10 = 15
1495  6 + 12 = 18
1496  7 + 14 = 21
1497  8 + 16 = 24
1498  9 + 18 = 27
1499  10 + 20 = 30
1500  11 + 22 = 33
1501  12 + 24 = 36
1502  13 + 26 = 39
1503  14 + 28 = 42
1504  15 + 30 = 45
1505
1506.. note::
1507
1508  You will likely see a different device identifier based on your hardware
1509
1510
1511Tutorial: Linking with Libdevice
1512================================
1513
1514In this tutorial, we show a simple example of linking LLVM IR with the
1515libdevice library. We will use the same kernel as the previous tutorial,
1516except that we will compute ``C = pow(A, B)`` instead of ``C = A + B``.
1517Libdevice provides an ``__nv_powf`` function that we will use.
1518
1519.. code-block:: llvm
1520
1521  target datalayout = "e-p:64:64:64-i1:8:8-i8:8:8-i16:16:16-i32:32:32-i64:64:64-f32:32:32-f64:64:64-v16:16:16-v32:32:32-v64:64:64-v128:128:128-n16:32:64"
1522  target triple = "nvptx64-nvidia-cuda"
1523
1524  ; Intrinsic to read X component of thread ID
1525  declare i32 @llvm.nvvm.read.ptx.sreg.tid.x() readnone nounwind
1526  ; libdevice function
1527  declare float @__nv_powf(float, float)
1528
1529  define void @kernel(ptr addrspace(1) %A,
1530                      ptr addrspace(1) %B,
1531                      ptr addrspace(1) %C) {
1532  entry:
1533    ; What is my ID?
1534    %id = tail call i32 @llvm.nvvm.read.ptx.sreg.tid.x() readnone nounwind
1535
1536    ; Compute pointers into A, B, and C
1537    %ptrA = getelementptr float, ptr addrspace(1) %A, i32 %id
1538    %ptrB = getelementptr float, ptr addrspace(1) %B, i32 %id
1539    %ptrC = getelementptr float, ptr addrspace(1) %C, i32 %id
1540
1541    ; Read A, B
1542    %valA = load float, ptr addrspace(1) %ptrA, align 4
1543    %valB = load float, ptr addrspace(1) %ptrB, align 4
1544
1545    ; Compute C = pow(A, B)
1546    %valC = call float @__nv_powf(float %valA, float %valB)
1547
1548    ; Store back to C
1549    store float %valC, ptr addrspace(1) %ptrC, align 4
1550
1551    ret void
1552  }
1553
1554  !nvvm.annotations = !{!0}
1555  !0 = !{ptr @kernel, !"kernel", i32 1}
1556
1557
1558To compile this kernel, we perform the following steps:
1559
15601. Link with libdevice
15612. Internalize all but the public kernel function
15623. Run ``NVVMReflect`` and set ``__CUDA_FTZ`` to 0
15634. Optimize the linked module
15645. Codegen the module
1565
1566
1567These steps can be performed by the LLVM ``llvm-link``, ``opt``, and ``llc``
1568tools. In a complete compiler, these steps can also be performed entirely
1569programmatically by setting up an appropriate pass configuration (see
1570:ref:`libdevice`).
1571
1572.. code-block:: text
1573
1574  # llvm-link t2.bc libdevice.compute_20.10.bc -o t2.linked.bc
1575  # opt -internalize -internalize-public-api-list=kernel -nvvm-reflect-list=__CUDA_FTZ=0 -nvvm-reflect -O3 t2.linked.bc -o t2.opt.bc
1576  # llc -mcpu=sm_20 t2.opt.bc -o t2.ptx
1577
1578.. note::
1579
1580  The ``-nvvm-reflect-list=_CUDA_FTZ=0`` is not strictly required, as any
1581  undefined variables will default to zero. It is shown here for evaluation
1582  purposes.
1583
1584
1585This gives us the following PTX (excerpt):
1586
1587.. code-block:: text
1588
1589  //
1590  // Generated by LLVM NVPTX Back-End
1591  //
1592
1593  .version 3.1
1594  .target sm_20
1595  .address_size 64
1596
1597    // .globl kernel
1598                                          // @kernel
1599  .visible .entry kernel(
1600    .param .u64 kernel_param_0,
1601    .param .u64 kernel_param_1,
1602    .param .u64 kernel_param_2
1603  )
1604  {
1605    .reg .pred  %p<30>;
1606    .reg .f32   %f<111>;
1607    .reg .s32   %r<21>;
1608    .reg .s64   %rl<8>;
1609
1610  // %bb.0:                                // %entry
1611    ld.param.u64  %rl2, [kernel_param_0];
1612    mov.u32   %r3, %tid.x;
1613    ld.param.u64  %rl3, [kernel_param_1];
1614    mul.wide.s32  %rl4, %r3, 4;
1615    add.s64   %rl5, %rl2, %rl4;
1616    ld.param.u64  %rl6, [kernel_param_2];
1617    add.s64   %rl7, %rl3, %rl4;
1618    add.s64   %rl1, %rl6, %rl4;
1619    ld.global.f32   %f1, [%rl5];
1620    ld.global.f32   %f2, [%rl7];
1621    setp.eq.f32 %p1, %f1, 0f3F800000;
1622    setp.eq.f32 %p2, %f2, 0f00000000;
1623    or.pred   %p3, %p1, %p2;
1624    @%p3 bra  BB0_1;
1625    bra.uni   BB0_2;
1626  BB0_1:
1627    mov.f32   %f110, 0f3F800000;
1628    st.global.f32   [%rl1], %f110;
1629    ret;
1630  BB0_2:                                  // %__nv_isnanf.exit.i
1631    abs.f32   %f4, %f1;
1632    setp.gtu.f32  %p4, %f4, 0f7F800000;
1633    @%p4 bra  BB0_4;
1634  // %bb.3:                                // %__nv_isnanf.exit5.i
1635    abs.f32   %f5, %f2;
1636    setp.le.f32 %p5, %f5, 0f7F800000;
1637    @%p5 bra  BB0_5;
1638  BB0_4:                                  // %.critedge1.i
1639    add.f32   %f110, %f1, %f2;
1640    st.global.f32   [%rl1], %f110;
1641    ret;
1642  BB0_5:                                  // %__nv_isinff.exit.i
1643
1644    ...
1645
1646  BB0_26:                                 // %__nv_truncf.exit.i.i.i.i.i
1647    mul.f32   %f90, %f107, 0f3FB8AA3B;
1648    cvt.rzi.f32.f32 %f91, %f90;
1649    mov.f32   %f92, 0fBF317200;
1650    fma.rn.f32  %f93, %f91, %f92, %f107;
1651    mov.f32   %f94, 0fB5BFBE8E;
1652    fma.rn.f32  %f95, %f91, %f94, %f93;
1653    mul.f32   %f89, %f95, 0f3FB8AA3B;
1654    // inline asm
1655    ex2.approx.ftz.f32 %f88,%f89;
1656    // inline asm
1657    add.f32   %f96, %f91, 0f00000000;
1658    ex2.approx.f32  %f97, %f96;
1659    mul.f32   %f98, %f88, %f97;
1660    setp.lt.f32 %p15, %f107, 0fC2D20000;
1661    selp.f32  %f99, 0f00000000, %f98, %p15;
1662    setp.gt.f32 %p16, %f107, 0f42D20000;
1663    selp.f32  %f110, 0f7F800000, %f99, %p16;
1664    setp.eq.f32 %p17, %f110, 0f7F800000;
1665    @%p17 bra   BB0_28;
1666  // %bb.27:
1667    fma.rn.f32  %f110, %f110, %f108, %f110;
1668  BB0_28:                                 // %__internal_accurate_powf.exit.i
1669    setp.lt.f32 %p18, %f1, 0f00000000;
1670    setp.eq.f32 %p19, %f3, 0f3F800000;
1671    and.pred    %p20, %p18, %p19;
1672    @!%p20 bra  BB0_30;
1673    bra.uni   BB0_29;
1674  BB0_29:
1675    mov.b32    %r9, %f110;
1676    xor.b32   %r10, %r9, -2147483648;
1677    mov.b32    %f110, %r10;
1678  BB0_30:                                 // %__nv_powf.exit
1679    st.global.f32   [%rl1], %f110;
1680    ret;
1681  }
1682