1========================= 2Compiling CUDA with clang 3========================= 4 5.. contents:: 6 :local: 7 8Introduction 9============ 10 11This document describes how to compile CUDA code with clang, and gives some 12details about LLVM and clang's CUDA implementations. 13 14This document assumes a basic familiarity with CUDA. Information about CUDA 15programming can be found in the 16`CUDA programming guide 17<http://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html>`_. 18 19Compiling CUDA Code 20=================== 21 22Prerequisites 23------------- 24 25CUDA is supported since llvm 3.9. Clang currently supports CUDA 7.0 through 2612.1. If clang detects a newer CUDA version, it will issue a warning and will 27attempt to use detected CUDA SDK it as if it were CUDA 12.1. 28 29Before you build CUDA code, you'll need to have installed the CUDA SDK. See 30`NVIDIA's CUDA installation guide 31<https://docs.nvidia.com/cuda/cuda-installation-guide-linux/index.html>`_ for 32details. Note that clang `maynot support 33<https://bugs.llvm.org/show_bug.cgi?id=26966>`_ the CUDA toolkit as installed by 34some Linux package managers. Clang does attempt to deal with specific details of 35CUDA installation on a handful of common Linux distributions, but in general the 36most reliable way to make it work is to install CUDA in a single directory from 37NVIDIA's `.run` package and specify its location via `--cuda-path=...` argument. 38 39CUDA compilation is supported on Linux. Compilation on MacOS and Windows may or 40may not work and currently have no maintainers. 41 42Invoking clang 43-------------- 44 45Invoking clang for CUDA compilation works similarly to compiling regular C++. 46You just need to be aware of a few additional flags. 47 48You can use `this <https://gist.github.com/855e277884eb6b388cd2f00d956c2fd4>`_ 49program as a toy example. Save it as ``axpy.cu``. (Clang detects that you're 50compiling CUDA code by noticing that your filename ends with ``.cu``. 51Alternatively, you can pass ``-x cuda``.) 52 53To build and run, run the following commands, filling in the parts in angle 54brackets as described below: 55 56.. code-block:: console 57 58 $ clang++ axpy.cu -o axpy --cuda-gpu-arch=<GPU arch> \ 59 -L<CUDA install path>/<lib64 or lib> \ 60 -lcudart_static -ldl -lrt -pthread 61 $ ./axpy 62 y[0] = 2 63 y[1] = 4 64 y[2] = 6 65 y[3] = 8 66 67On MacOS, replace `-lcudart_static` with `-lcudart`; otherwise, you may get 68"CUDA driver version is insufficient for CUDA runtime version" errors when you 69run your program. 70 71* ``<CUDA install path>`` -- the directory where you installed CUDA SDK. 72 Typically, ``/usr/local/cuda``. 73 74 Pass e.g. ``-L/usr/local/cuda/lib64`` if compiling in 64-bit mode; otherwise, 75 pass e.g. ``-L/usr/local/cuda/lib``. (In CUDA, the device code and host code 76 always have the same pointer widths, so if you're compiling 64-bit code for 77 the host, you're also compiling 64-bit code for the device.) Note that as of 78 v10.0 CUDA SDK `no longer supports compilation of 32-bit 79 applications <https://docs.nvidia.com/cuda/cuda-toolkit-release-notes/index.html#deprecated-features>`_. 80 81* ``<GPU arch>`` -- the `compute capability 82 <https://developer.nvidia.com/cuda-gpus>`_ of your GPU. For example, if you 83 want to run your program on a GPU with compute capability of 3.5, specify 84 ``--cuda-gpu-arch=sm_35``. 85 86 Note: You cannot pass ``compute_XX`` as an argument to ``--cuda-gpu-arch``; 87 only ``sm_XX`` is currently supported. However, clang always includes PTX in 88 its binaries, so e.g. a binary compiled with ``--cuda-gpu-arch=sm_30`` would be 89 forwards-compatible with e.g. ``sm_35`` GPUs. 90 91 You can pass ``--cuda-gpu-arch`` multiple times to compile for multiple archs. 92 93The `-L` and `-l` flags only need to be passed when linking. When compiling, 94you may also need to pass ``--cuda-path=/path/to/cuda`` if you didn't install 95the CUDA SDK into ``/usr/local/cuda`` or ``/usr/local/cuda-X.Y``. 96 97Flags that control numerical code 98--------------------------------- 99 100If you're using GPUs, you probably care about making numerical code run fast. 101GPU hardware allows for more control over numerical operations than most CPUs, 102but this results in more compiler options for you to juggle. 103 104Flags you may wish to tweak include: 105 106* ``-ffp-contract={on,off,fast}`` (defaults to ``fast`` on host and device when 107 compiling CUDA) Controls whether the compiler emits fused multiply-add 108 operations. 109 110 * ``off``: never emit fma operations, and prevent ptxas from fusing multiply 111 and add instructions. 112 * ``on``: fuse multiplies and adds within a single statement, but never 113 across statements (C11 semantics). Prevent ptxas from fusing other 114 multiplies and adds. 115 * ``fast``: fuse multiplies and adds wherever profitable, even across 116 statements. Doesn't prevent ptxas from fusing additional multiplies and 117 adds. 118 119 Fused multiply-add instructions can be much faster than the unfused 120 equivalents, but because the intermediate result in an fma is not rounded, 121 this flag can affect numerical code. 122 123* ``-fcuda-flush-denormals-to-zero`` (default: off) When this is enabled, 124 floating point operations may flush `denormal 125 <https://en.wikipedia.org/wiki/Denormal_number>`_ inputs and/or outputs to 0. 126 Operations on denormal numbers are often much slower than the same operations 127 on normal numbers. 128 129* ``-fcuda-approx-transcendentals`` (default: off) When this is enabled, the 130 compiler may emit calls to faster, approximate versions of transcendental 131 functions, instead of using the slower, fully IEEE-compliant versions. For 132 example, this flag allows clang to emit the ptx ``sin.approx.f32`` 133 instruction. 134 135 This is implied by ``-ffast-math``. 136 137Standard library support 138======================== 139 140In clang and nvcc, most of the C++ standard library is not supported on the 141device side. 142 143``<math.h>`` and ``<cmath>`` 144---------------------------- 145 146In clang, ``math.h`` and ``cmath`` are available and `pass 147<https://github.com/llvm/llvm-test-suite/blob/main/External/CUDA/math_h.cu>`_ 148`tests 149<https://github.com/llvm/llvm-test-suite/blob/main/External/CUDA/cmath.cu>`_ 150adapted from libc++'s test suite. 151 152In nvcc ``math.h`` and ``cmath`` are mostly available. Versions of ``::foof`` 153in namespace std (e.g. ``std::sinf``) are not available, and where the standard 154calls for overloads that take integral arguments, these are usually not 155available. 156 157.. code-block:: c++ 158 159 #include <math.h> 160 #include <cmath.h> 161 162 // clang is OK with everything in this function. 163 __device__ void test() { 164 std::sin(0.); // nvcc - ok 165 std::sin(0); // nvcc - error, because no std::sin(int) override is available. 166 sin(0); // nvcc - same as above. 167 168 sinf(0.); // nvcc - ok 169 std::sinf(0.); // nvcc - no such function 170 } 171 172``<std::complex>`` 173------------------ 174 175nvcc does not officially support ``std::complex``. It's an error to use 176``std::complex`` in ``__device__`` code, but it often works in ``__host__ 177__device__`` code due to nvcc's interpretation of the "wrong-side rule" (see 178below). However, we have heard from implementers that it's possible to get 179into situations where nvcc will omit a call to an ``std::complex`` function, 180especially when compiling without optimizations. 181 182As of 2016-11-16, clang supports ``std::complex`` without these caveats. It is 183tested with libstdc++ 4.8.5 and newer, but is known to work only with libc++ 184newer than 2016-11-16. 185 186``<algorithm>`` 187--------------- 188 189In C++14, many useful functions from ``<algorithm>`` (notably, ``std::min`` and 190``std::max``) become constexpr. You can therefore use these in device code, 191when compiling with clang. 192 193Detecting clang vs NVCC from code 194================================= 195 196Although clang's CUDA implementation is largely compatible with NVCC's, you may 197still want to detect when you're compiling CUDA code specifically with clang. 198 199This is tricky, because NVCC may invoke clang as part of its own compilation 200process! For example, NVCC uses the host compiler's preprocessor when 201compiling for device code, and that host compiler may in fact be clang. 202 203When clang is actually compiling CUDA code -- rather than being used as a 204subtool of NVCC's -- it defines the ``__CUDA__`` macro. ``__CUDA_ARCH__`` is 205defined only in device mode (but will be defined if NVCC is using clang as a 206preprocessor). So you can use the following incantations to detect clang CUDA 207compilation, in host and device modes: 208 209.. code-block:: c++ 210 211 #if defined(__clang__) && defined(__CUDA__) && !defined(__CUDA_ARCH__) 212 // clang compiling CUDA code, host mode. 213 #endif 214 215 #if defined(__clang__) && defined(__CUDA__) && defined(__CUDA_ARCH__) 216 // clang compiling CUDA code, device mode. 217 #endif 218 219Both clang and nvcc define ``__CUDACC__`` during CUDA compilation. You can 220detect NVCC specifically by looking for ``__NVCC__``. 221 222Dialect Differences Between clang and nvcc 223========================================== 224 225There is no formal CUDA spec, and clang and nvcc speak slightly different 226dialects of the language. Below, we describe some of the differences. 227 228This section is painful; hopefully you can skip this section and live your life 229blissfully unaware. 230 231Compilation Models 232------------------ 233 234Most of the differences between clang and nvcc stem from the different 235compilation models used by clang and nvcc. nvcc uses *split compilation*, 236which works roughly as follows: 237 238 * Run a preprocessor over the input ``.cu`` file to split it into two source 239 files: ``H``, containing source code for the host, and ``D``, containing 240 source code for the device. 241 242 * For each GPU architecture ``arch`` that we're compiling for, do: 243 244 * Compile ``D`` using nvcc proper. The result of this is a ``ptx`` file for 245 ``P_arch``. 246 247 * Optionally, invoke ``ptxas``, the PTX assembler, to generate a file, 248 ``S_arch``, containing GPU machine code (SASS) for ``arch``. 249 250 * Invoke ``fatbin`` to combine all ``P_arch`` and ``S_arch`` files into a 251 single "fat binary" file, ``F``. 252 253 * Compile ``H`` using an external host compiler (gcc, clang, or whatever you 254 like). ``F`` is packaged up into a header file which is force-included into 255 ``H``; nvcc generates code that calls into this header to e.g. launch 256 kernels. 257 258clang uses *merged parsing*. This is similar to split compilation, except all 259of the host and device code is present and must be semantically-correct in both 260compilation steps. 261 262 * For each GPU architecture ``arch`` that we're compiling for, do: 263 264 * Compile the input ``.cu`` file for device, using clang. ``__host__`` code 265 is parsed and must be semantically correct, even though we're not 266 generating code for the host at this time. 267 268 The output of this step is a ``ptx`` file ``P_arch``. 269 270 * Invoke ``ptxas`` to generate a SASS file, ``S_arch``. Note that, unlike 271 nvcc, clang always generates SASS code. 272 273 * Invoke ``fatbin`` to combine all ``P_arch`` and ``S_arch`` files into a 274 single fat binary file, ``F``. 275 276 * Compile ``H`` using clang. ``__device__`` code is parsed and must be 277 semantically correct, even though we're not generating code for the device 278 at this time. 279 280 ``F`` is passed to this compilation, and clang includes it in a special ELF 281 section, where it can be found by tools like ``cuobjdump``. 282 283(You may ask at this point, why does clang need to parse the input file 284multiple times? Why not parse it just once, and then use the AST to generate 285code for the host and each device architecture? 286 287Unfortunately this can't work because we have to define different macros during 288host compilation and during device compilation for each GPU architecture.) 289 290clang's approach allows it to be highly robust to C++ edge cases, as it doesn't 291need to decide at an early stage which declarations to keep and which to throw 292away. But it has some consequences you should be aware of. 293 294Overloading Based on ``__host__`` and ``__device__`` Attributes 295--------------------------------------------------------------- 296 297Let "H", "D", and "HD" stand for "``__host__`` functions", "``__device__`` 298functions", and "``__host__ __device__`` functions", respectively. Functions 299with no attributes behave the same as H. 300 301nvcc does not allow you to create H and D functions with the same signature: 302 303.. code-block:: c++ 304 305 // nvcc: error - function "foo" has already been defined 306 __host__ void foo() {} 307 __device__ void foo() {} 308 309However, nvcc allows you to "overload" H and D functions with different 310signatures: 311 312.. code-block:: c++ 313 314 // nvcc: no error 315 __host__ void foo(int) {} 316 __device__ void foo() {} 317 318In clang, the ``__host__`` and ``__device__`` attributes are part of a 319function's signature, and so it's legal to have H and D functions with 320(otherwise) the same signature: 321 322.. code-block:: c++ 323 324 // clang: no error 325 __host__ void foo() {} 326 __device__ void foo() {} 327 328HD functions cannot be overloaded by H or D functions with the same signature: 329 330.. code-block:: c++ 331 332 // nvcc: error - function "foo" has already been defined 333 // clang: error - redefinition of 'foo' 334 __host__ __device__ void foo() {} 335 __device__ void foo() {} 336 337 // nvcc: no error 338 // clang: no error 339 __host__ __device__ void bar(int) {} 340 __device__ void bar() {} 341 342When resolving an overloaded function, clang considers the host/device 343attributes of the caller and callee. These are used as a tiebreaker during 344overload resolution. See `IdentifyCUDAPreference 345<https://clang.llvm.org/doxygen/SemaCUDA_8cpp.html>`_ for the full set of rules, 346but at a high level they are: 347 348 * D functions prefer to call other Ds. HDs are given lower priority. 349 350 * Similarly, H functions prefer to call other Hs, or ``__global__`` functions 351 (with equal priority). HDs are given lower priority. 352 353 * HD functions prefer to call other HDs. 354 355 When compiling for device, HDs will call Ds with lower priority than HD, and 356 will call Hs with still lower priority. If it's forced to call an H, the 357 program is malformed if we emit code for this HD function. We call this the 358 "wrong-side rule", see example below. 359 360 The rules are symmetrical when compiling for host. 361 362Some examples: 363 364.. code-block:: c++ 365 366 __host__ void foo(); 367 __device__ void foo(); 368 369 __host__ void bar(); 370 __host__ __device__ void bar(); 371 372 __host__ void test_host() { 373 foo(); // calls H overload 374 bar(); // calls H overload 375 } 376 377 __device__ void test_device() { 378 foo(); // calls D overload 379 bar(); // calls HD overload 380 } 381 382 __host__ __device__ void test_hd() { 383 foo(); // calls H overload when compiling for host, otherwise D overload 384 bar(); // always calls HD overload 385 } 386 387Wrong-side rule example: 388 389.. code-block:: c++ 390 391 __host__ void host_only(); 392 393 // We don't codegen inline functions unless they're referenced by a 394 // non-inline function. inline_hd1() is called only from the host side, so 395 // does not generate an error. inline_hd2() is called from the device side, 396 // so it generates an error. 397 inline __host__ __device__ void inline_hd1() { host_only(); } // no error 398 inline __host__ __device__ void inline_hd2() { host_only(); } // error 399 400 __host__ void host_fn() { inline_hd1(); } 401 __device__ void device_fn() { inline_hd2(); } 402 403 // This function is not inline, so it's always codegen'ed on both the host 404 // and the device. Therefore, it generates an error. 405 __host__ __device__ void not_inline_hd() { host_only(); } 406 407For the purposes of the wrong-side rule, templated functions also behave like 408``inline`` functions: They aren't codegen'ed unless they're instantiated 409(usually as part of the process of invoking them). 410 411clang's behavior with respect to the wrong-side rule matches nvcc's, except 412nvcc only emits a warning for ``not_inline_hd``; device code is allowed to call 413``not_inline_hd``. In its generated code, nvcc may omit ``not_inline_hd``'s 414call to ``host_only`` entirely, or it may try to generate code for 415``host_only`` on the device. What you get seems to depend on whether or not 416the compiler chooses to inline ``host_only``. 417 418Member functions, including constructors, may be overloaded using H and D 419attributes. However, destructors cannot be overloaded. 420 421Clang Warnings for Host and Device Function Declarations 422-------------------------------------------------------- 423 424Clang can emit warnings when it detects that host (H) and device (D) functions are declared or defined with the same signature. These warnings are not enabled by default. 425 426To enable these warnings, use the following compiler flag: 427 428.. code-block:: console 429 430 -Wnvcc-compat 431 432Using a Different Class on Host/Device 433-------------------------------------- 434 435Occasionally you may want to have a class with different host/device versions. 436 437If all of the class's members are the same on the host and device, you can just 438provide overloads for the class's member functions. 439 440However, if you want your class to have different members on host/device, you 441won't be able to provide working H and D overloads in both classes. In this 442case, clang is likely to be unhappy with you. 443 444.. code-block:: c++ 445 446 #ifdef __CUDA_ARCH__ 447 struct S { 448 __device__ void foo() { /* use device_only */ } 449 int device_only; 450 }; 451 #else 452 struct S { 453 __host__ void foo() { /* use host_only */ } 454 double host_only; 455 }; 456 457 __device__ void test() { 458 S s; 459 // clang generates an error here, because during host compilation, we 460 // have ifdef'ed away the __device__ overload of S::foo(). The __device__ 461 // overload must be present *even during host compilation*. 462 S.foo(); 463 } 464 #endif 465 466We posit that you don't really want to have classes with different members on H 467and D. For example, if you were to pass one of these as a parameter to a 468kernel, it would have a different layout on H and D, so would not work 469properly. 470 471To make code like this compatible with clang, we recommend you separate it out 472into two classes. If you need to write code that works on both host and 473device, consider writing an overloaded wrapper function that returns different 474types on host and device. 475 476.. code-block:: c++ 477 478 struct HostS { ... }; 479 struct DeviceS { ... }; 480 481 __host__ HostS MakeStruct() { return HostS(); } 482 __device__ DeviceS MakeStruct() { return DeviceS(); } 483 484 // Now host and device code can call MakeStruct(). 485 486Unfortunately, this idiom isn't compatible with nvcc, because it doesn't allow 487you to overload based on the H/D attributes. Here's an idiom that works with 488both clang and nvcc: 489 490.. code-block:: c++ 491 492 struct HostS { ... }; 493 struct DeviceS { ... }; 494 495 #ifdef __NVCC__ 496 #ifndef __CUDA_ARCH__ 497 __host__ HostS MakeStruct() { return HostS(); } 498 #else 499 __device__ DeviceS MakeStruct() { return DeviceS(); } 500 #endif 501 #else 502 __host__ HostS MakeStruct() { return HostS(); } 503 __device__ DeviceS MakeStruct() { return DeviceS(); } 504 #endif 505 506 // Now host and device code can call MakeStruct(). 507 508Hopefully you don't have to do this sort of thing often. 509 510Optimizations 511============= 512 513Modern CPUs and GPUs are architecturally quite different, so code that's fast 514on a CPU isn't necessarily fast on a GPU. We've made a number of changes to 515LLVM to make it generate good GPU code. Among these changes are: 516 517* `Straight-line scalar optimizations <https://docs.google.com/document/d/1momWzKFf4D6h8H3YlfgKQ3qeZy5ayvMRh6yR-Xn2hUE>`_ -- These 518 reduce redundancy within straight-line code. 519 520* `Aggressive speculative execution 521 <https://llvm.org/docs/doxygen/html/SpeculativeExecution_8cpp_source.html>`_ 522 -- This is mainly for promoting straight-line scalar optimizations, which are 523 most effective on code along dominator paths. 524 525* `Memory space inference 526 <https://llvm.org/doxygen/InferAddressSpaces_8cpp_source.html>`_ -- 527 In PTX, we can operate on pointers that are in a particular "address space" 528 (global, shared, constant, or local), or we can operate on pointers in the 529 "generic" address space, which can point to anything. Operations in a 530 non-generic address space are faster, but pointers in CUDA are not explicitly 531 annotated with their address space, so it's up to LLVM to infer it where 532 possible. 533 534* `Bypassing 64-bit divides 535 <https://llvm.org/docs/doxygen/html/BypassSlowDivision_8cpp_source.html>`_ -- 536 This was an existing optimization that we enabled for the PTX backend. 537 538 64-bit integer divides are much slower than 32-bit ones on NVIDIA GPUs. 539 Many of the 64-bit divides in our benchmarks have a divisor and dividend 540 which fit in 32-bits at runtime. This optimization provides a fast path for 541 this common case. 542 543* Aggressive loop unrolling and function inlining -- Loop unrolling and 544 function inlining need to be more aggressive for GPUs than for CPUs because 545 control flow transfer in GPU is more expensive. More aggressive unrolling and 546 inlining also promote other optimizations, such as constant propagation and 547 SROA, which sometimes speed up code by over 10x. 548 549 (Programmers can force unrolling and inline using clang's `loop unrolling pragmas 550 <https://clang.llvm.org/docs/AttributeReference.html#pragma-unroll-pragma-nounroll>`_ 551 and ``__attribute__((always_inline))``.) 552 553Publication 554=========== 555 556The team at Google published a paper in CGO 2016 detailing the optimizations 557they'd made to clang/LLVM. Note that "gpucc" is no longer a meaningful name: 558The relevant tools are now just vanilla clang/LLVM. 559 560| `gpucc: An Open-Source GPGPU Compiler <http://dl.acm.org/citation.cfm?id=2854041>`_ 561| Jingyue Wu, Artem Belevich, Eli Bendersky, Mark Heffernan, Chris Leary, Jacques Pienaar, Bjarke Roune, Rob Springer, Xuetian Weng, Robert Hundt 562| *Proceedings of the 2016 International Symposium on Code Generation and Optimization (CGO 2016)* 563| 564| `Slides from the CGO talk <http://wujingyue.github.io/docs/gpucc-talk.pdf>`_ 565| 566| `Tutorial given at CGO <http://wujingyue.github.io/docs/gpucc-tutorial.pdf>`_ 567 568Obtaining Help 569============== 570 571To obtain help on LLVM in general and its CUDA support, see `the LLVM 572community <https://llvm.org/docs/#mailing-lists>`_. 573