CompileCudaWithLLVM.rst - OpenGrok cross reference for /llvm-project/llvm/docs/CompileCudaWithLLVM.rst

Lines Matching +full:release +full:- +full:doxygen
17 <http://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html>`_.
23 -------------
31 <https://docs.nvidia.com/cuda/cuda-installation-guide-linux/index.html>`_ for
37 NVIDIA's `.run` package and specify its location via `--cuda-path=...` argument.
43 --------------
51 Alternatively, you can pass ``-x cuda``.)
56 .. code-block:: console
58   $ clang++ axpy.cu -o axpy --cuda-gpu-arch=<GPU arch> \
59       -L<CUDA install path>/<lib64 or lib>             \
60       -lcudart_static -ldl -lrt -pthread
67 On MacOS, replace `-lcudart_static` with `-lcudart`; otherwise, you may get
71 * ``<CUDA install path>`` -- the directory where you installed CUDA SDK.
74   Pass e.g. ``-L/usr/local/cuda/lib64`` if compiling in 64-bit mode; otherwise,
75   pass e.g. ``-L/usr/local/cuda/lib``.  (In CUDA, the device code and host code
76   always have the same pointer widths, so if you're compiling 64-bit code for
77   the host, you're also compiling 64-bit code for the device.) Note that as of
78   v10.0 CUDA SDK `no longer supports compilation of 32-bit
79   applications <https://docs.nvidia.com/cuda/cuda-toolkit-release-notes/index.html#deprecated-features>`_.
81 * ``<GPU arch>`` -- the `compute capability
82   <https://developer.nvidia.com/cuda-gpus>`_ of your GPU. For example, if you
84   ``--cuda-gpu-arch=sm_35``.
86   Note: You cannot pass ``compute_XX`` as an argument to ``--cuda-gpu-arch``;
88   its binaries, so e.g. a binary compiled with ``--cuda-gpu-arch=sm_30`` would be
89   forwards-compatible with e.g. ``sm_35`` GPUs.
91   You can pass ``--cuda-gpu-arch`` multiple times to compile for multiple archs.
93 The `-L` and `-l` flags only need to be passed when linking.  When compiling,
94 you may also need to pass ``--cuda-path=/path/to/cuda`` if you didn't install
95 the CUDA SDK into ``/usr/local/cuda`` or ``/usr/local/cuda-X.Y``.
98 ---------------------------------
106 * ``-ffp-contract={on,off,fast}`` (defaults to ``fast`` on host and device when
107   compiling CUDA) Controls whether the compiler emits fused multiply-add
119   Fused multiply-add instructions can be much faster than the unfused
123 * ``-fcuda-flush-denormals-to-zero`` (default: off) When this is enabled,
129 * ``-fcuda-approx-transcendentals`` (default: off) When this is enabled, the
131   functions, instead of using the slower, fully IEEE-compliant versions.  For
135   This is implied by ``-ffast-math``.
144 ----------------------------
147 <https://github.com/llvm/llvm-test-suite/blob/main/External/CUDA/math_h.cu>`_
149 <https://github.com/llvm/llvm-test-suite/blob/main/External/CUDA/cmath.cu>`_
157 .. code-block:: c++
164     std::sin(0.); // nvcc - ok
165     std::sin(0);  // nvcc - error, because no std::sin(int) override is available.
166     sin(0);       // nvcc - same as above.
168     sinf(0.);       // nvcc - ok
169     std::sinf(0.);  // nvcc - no such function
173 ------------------
177 __device__`` code due to nvcc's interpretation of the "wrong-side rule" (see
182 As of 2016-11-16, clang supports ``std::complex`` without these caveats.  It is
184 newer than 2016-11-16.
187 ---------------
203 When clang is actually compiling CUDA code -- rather than being used as a
204 subtool of NVCC's -- it defines the ``__CUDA__`` macro.  ``__CUDA_ARCH__`` is
209 .. code-block:: c++
232 ------------------
254    like).  ``F`` is packaged up into a header file which is force-included into
259 of the host and device code is present and must be semantically-correct in both
295 ---------------------------------------------------------------
303 .. code-block:: c++
305   // nvcc: error - function "foo" has already been defined
312 .. code-block:: c++
322 .. code-block:: c++
330 .. code-block:: c++
332   // nvcc: error - function "foo" has already been defined
333   // clang: error - redefinition of 'foo'
345 <https://clang.llvm.org/doxygen/SemaCUDA_8cpp.html>`_ for the full set of rules,
358    "wrong-side rule", see example below.
364 .. code-block:: c++
387 Wrong-side rule example:
389 .. code-block:: c++
394   // non-inline function.  inline_hd1() is called only from the host side, so
407 For the purposes of the wrong-side rule, templated functions also behave like
411 clang's behavior with respect to the wrong-side rule matches nvcc's, except
422 --------------------------------------------------------
428 .. code-block:: console
430     -Wnvcc-compat
433 --------------------------------------
444 .. code-block:: c++
476 .. code-block:: c++
490 .. code-block:: c++
517 * `Straight-line scalar optimizations <https://docs.google.com/document/d/1momWzKFf4D6h8H3YlfgKQ3qeZy5ayvMRh6yR-Xn2hUE>`_ -- These
518   reduce redundancy within straight-line code.
521   <https://llvm.org/docs/doxygen/html/SpeculativeExecution_8cpp_source.html>`_
522   -- This is mainly for promoting straight-line scalar optimizations, which are
526   <https://llvm.org/doxygen/InferAddressSpaces_8cpp_source.html>`_ --
530   non-generic address space are faster, but pointers in CUDA are not explicitly
534 * `Bypassing 64-bit divides
535   <https://llvm.org/docs/doxygen/html/BypassSlowDivision_8cpp_source.html>`_ --
538   64-bit integer divides are much slower than 32-bit ones on NVIDIA GPUs.
539   Many of the 64-bit divides in our benchmarks have a divisor and dividend
540   which fit in 32-bits at runtime. This optimization provides a fast path for
543 * Aggressive loop unrolling and function inlining -- Loop unrolling and
550   <https://clang.llvm.org/docs/AttributeReference.html#pragma-unroll-pragma-nounroll>`_
560 | `gpucc: An Open-Source GPGPU Compiler <http://dl.acm.org/citation.cfm?id=2854041>`_
564 | `Slides from the CGO talk <http://wujingyue.github.io/docs/gpucc-talk.pdf>`_
566 | `Tutorial given at CGO <http://wujingyue.github.io/docs/gpucc-tutorial.pdf>`_
572 community <https://llvm.org/docs/#mailing-lists>`_.