Lines Matching +full:release +full:- +full:doxygen
17 <http://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html>`_.
23 -------------
31 <https://docs.nvidia.com/cuda/cuda-installation-guide-linux/index.html>`_ for
37 NVIDIA's `.run` package and specify its location via `--cuda-path=...` argument.
43 --------------
51 Alternatively, you can pass ``-x cuda``.)
56 .. code-block:: console
58 $ clang++ axpy.cu -o axpy --cuda-gpu-arch=<GPU arch> \
59 -L<CUDA install path>/<lib64 or lib> \
60 -lcudart_static -ldl -lrt -pthread
67 On MacOS, replace `-lcudart_static` with `-lcudart`; otherwise, you may get
71 * ``<CUDA install path>`` -- the directory where you installed CUDA SDK.
74 Pass e.g. ``-L/usr/local/cuda/lib64`` if compiling in 64-bit mode; otherwise,
75 pass e.g. ``-L/usr/local/cuda/lib``. (In CUDA, the device code and host code
76 always have the same pointer widths, so if you're compiling 64-bit code for
77 the host, you're also compiling 64-bit code for the device.) Note that as of
78 v10.0 CUDA SDK `no longer supports compilation of 32-bit
79 applications <https://docs.nvidia.com/cuda/cuda-toolkit-release-notes/index.html#deprecated-features>`_.
81 * ``<GPU arch>`` -- the `compute capability
82 <https://developer.nvidia.com/cuda-gpus>`_ of your GPU. For example, if you
84 ``--cuda-gpu-arch=sm_35``.
86 Note: You cannot pass ``compute_XX`` as an argument to ``--cuda-gpu-arch``;
88 its binaries, so e.g. a binary compiled with ``--cuda-gpu-arch=sm_30`` would be
89 forwards-compatible with e.g. ``sm_35`` GPUs.
91 You can pass ``--cuda-gpu-arch`` multiple times to compile for multiple archs.
93 The `-L` and `-l` flags only need to be passed when linking. When compiling,
94 you may also need to pass ``--cuda-path=/path/to/cuda`` if you didn't install
95 the CUDA SDK into ``/usr/local/cuda`` or ``/usr/local/cuda-X.Y``.
98 ---------------------------------
106 * ``-ffp-contract={on,off,fast}`` (defaults to ``fast`` on host and device when
107 compiling CUDA) Controls whether the compiler emits fused multiply-add
119 Fused multiply-add instructions can be much faster than the unfused
123 * ``-fcuda-flush-denormals-to-zero`` (default: off) When this is enabled,
129 * ``-fcuda-approx-transcendentals`` (default: off) When this is enabled, the
131 functions, instead of using the slower, fully IEEE-compliant versions. For
135 This is implied by ``-ffast-math``.
144 ----------------------------
147 <https://github.com/llvm/llvm-test-suite/blob/main/External/CUDA/math_h.cu>`_
149 <https://github.com/llvm/llvm-test-suite/blob/main/External/CUDA/cmath.cu>`_
157 .. code-block:: c++
164 std::sin(0.); // nvcc - ok
165 std::sin(0); // nvcc - error, because no std::sin(int) override is available.
166 sin(0); // nvcc - same as above.
168 sinf(0.); // nvcc - ok
169 std::sinf(0.); // nvcc - no such function
173 ------------------
177 __device__`` code due to nvcc's interpretation of the "wrong-side rule" (see
182 As of 2016-11-16, clang supports ``std::complex`` without these caveats. It is
184 newer than 2016-11-16.
187 ---------------
203 When clang is actually compiling CUDA code -- rather than being used as a
204 subtool of NVCC's -- it defines the ``__CUDA__`` macro. ``__CUDA_ARCH__`` is
209 .. code-block:: c++
232 ------------------
254 like). ``F`` is packaged up into a header file which is force-included into
259 of the host and device code is present and must be semantically-correct in both
295 ---------------------------------------------------------------
303 .. code-block:: c++
305 // nvcc: error - function "foo" has already been defined
312 .. code-block:: c++
322 .. code-block:: c++
330 .. code-block:: c++
332 // nvcc: error - function "foo" has already been defined
333 // clang: error - redefinition of 'foo'
345 <https://clang.llvm.org/doxygen/SemaCUDA_8cpp.html>`_ for the full set of rules,
358 "wrong-side rule", see example below.
364 .. code-block:: c++
387 Wrong-side rule example:
389 .. code-block:: c++
394 // non-inline function. inline_hd1() is called only from the host side, so
407 For the purposes of the wrong-side rule, templated functions also behave like
411 clang's behavior with respect to the wrong-side rule matches nvcc's, except
422 --------------------------------------------------------
428 .. code-block:: console
430 -Wnvcc-compat
433 --------------------------------------
444 .. code-block:: c++
476 .. code-block:: c++
490 .. code-block:: c++
517 * `Straight-line scalar optimizations <https://docs.google.com/document/d/1momWzKFf4D6h8H3YlfgKQ3qeZy5ayvMRh6yR-Xn2hUE>`_ -- These
518 reduce redundancy within straight-line code.
521 <https://llvm.org/docs/doxygen/html/SpeculativeExecution_8cpp_source.html>`_
522 -- This is mainly for promoting straight-line scalar optimizations, which are
526 <https://llvm.org/doxygen/InferAddressSpaces_8cpp_source.html>`_ --
530 non-generic address space are faster, but pointers in CUDA are not explicitly
534 * `Bypassing 64-bit divides
535 <https://llvm.org/docs/doxygen/html/BypassSlowDivision_8cpp_source.html>`_ --
538 64-bit integer divides are much slower than 32-bit ones on NVIDIA GPUs.
539 Many of the 64-bit divides in our benchmarks have a divisor and dividend
540 which fit in 32-bits at runtime. This optimization provides a fast path for
543 * Aggressive loop unrolling and function inlining -- Loop unrolling and
550 <https://clang.llvm.org/docs/AttributeReference.html#pragma-unroll-pragma-nounroll>`_
560 | `gpucc: An Open-Source GPGPU Compiler <http://dl.acm.org/citation.cfm?id=2854041>`_
564 | `Slides from the CGO talk <http://wujingyue.github.io/docs/gpucc-talk.pdf>`_
566 | `Tutorial given at CGO <http://wujingyue.github.io/docs/gpucc-tutorial.pdf>`_
572 community <https://llvm.org/docs/#mailing-lists>`_.