1============================== 2Convergent Operation Semantics 3============================== 4 5.. contents:: 6 :local: 7 :depth: 4 8 9Overview 10======== 11 12Some parallel execution environments execute threads in groups that allow 13efficient communication within the group using special primitives called 14*convergent* operations. The outcome of a convergent operation is sensitive to 15the set of threads that executes it "together", i.e., convergently. When control 16flow :ref:`diverges <convergence-and-uniformity>`, i.e. threads of the same 17group follow different 18paths through the CFG, not all threads of the group may be available to 19participate in this communication. This is the defining characteristic that 20distinguishes convergent operations from other inter-thread communication: 21 22 A convergent operation involves inter-thread communication or synchronization 23 that occurs outside of the memory model, where the set of threads which 24 participate in communication is implicitly affected by control flow. 25 26For example, in the following GPU compute kernel, communication during the 27convergent operation is expected to occur precisely among those threads of an 28implementation-defined execution scope (such as workgroup or subgroup) for 29which ``condition`` is true: 30 31.. code-block:: c++ 32 33 void example_kernel() { 34 ... 35 if (condition) 36 convergent_operation(); 37 ... 38 } 39 40In structured programming languages, there is often an intuitive and 41unambiguous way of determining the threads that are expected to communicate. 42However, this is not always the case even in structured programming languages, 43and the intuition breaks down entirely in unstructured control flow. This 44document describes the formal semantics in LLVM, i.e. how to determine the set 45of communicating threads for convergent operations. 46 47The definitions in this document leave many details open, such as how groups of 48threads are formed in the first place. It focuses on the questions that are 49relevant for deciding the correctness of generic program transforms and 50convergence-related analyses such as :ref:`uniformity analysis 51<convergence-and-uniformity>`. 52 53.. _convergent_operations: 54 55Convergent Operations 56===================== 57 58In LLVM IR, the only way to communicate between threads as described 59above is by calling target-defined convergent intrinsics. Hence, only 60a call-site in LLVM IR (a :ref:`call <i_call>`, :ref:`invoke 61<i_invoke>`, or :ref:`callbr <i_callbr>` instruction) can result in a 62convergent operation. 63 64A function in LLVM IR is said to be *convergent* if it has the 65:ref:`convergent <attr_convergent>` attribute. 66 67A call-site in LLVM IR is said to be *convergent* if it is a direct 68call to a convergent function or it has the :ref:`convergent 69<attr_convergent>` attribute or a :ref:`convergencectrl operand bundle 70<convergencectrl>`. 71 72Informational notes: 73 74 A function may have to be treated as convergent if that function, or 75 transitively, any function called from it, contains a convergent call-site. A 76 frontend generating the ``convergent`` attribute should take this into account 77 when emitting functions and function calls. But this is not always the case: 78 79 A non-convergent function may contain convergent operations; such operations 80 do not directly depend on the set of threads that enter the function as a 81 single communicating group. Instead, these operations depend on an 82 implementation-defined subset of threads within the body of the function, as 83 shown in :ref:`opportunistic_convergence`. 84 85Examples of Convergent Operations 86======================================== 87 88(This section is informative.) 89 90Texture sampling in a pixel shader 91---------------------------------- 92 93The following stylized pixel shader samples a texture at a given set of 94coordinates, using the builtin function `textureSample`. Texture sampling 95requires screen-space derivatives of the coordinates to determine the level of 96detail (mipmap level) of the sample. They are commonly approximated by taking 97the difference between neighboring pixels, which are computed by different 98threads in the same group: 99 100.. code-block:: c++ 101 102 void example_shader() { 103 ... 104 color = textureSample(texture, coordinates); 105 if (condition) { 106 use(color); 107 } 108 ... 109 } 110 111From a purely single-threaded perspective, sinking the `textureSample` into 112the if-statement appears legal. However, if the condition is false for some 113neighboring pixels, then their corresponding threads will not execute together 114in the group, making it impossible to take the difference of coordinates as an 115approximation of the screen-space derivative. In practice, the outcome will be 116an undefined value. 117 118That is, the `textureSample` operation fits our definition of a convergent 119operation: 120 121 1. It communicates with a set of threads that implicitly depends on control 122 flow. 123 2. Correctness depends on this set of threads. 124 125The compiler frontend can emit IR that expresses the convergence constraints as 126follows: 127 128.. code-block:: llvm 129 130 define void @example_shader() convergent { 131 %entry = call token @llvm.experimental.convergence.entry() 132 ... 133 %color = call T @textureSample(U %texture, V %coordinates) [ "convergencectrl"(token %entry) ] 134 br i1 %condition, label %then, label %end 135 136 then: 137 call void @use(T %color) 138 br label %end 139 140 end: 141 ret void 142 } 143 144The :ref:`llvm.experimental.convergence.entry <llvm.experimental.convergence.entry>` 145intrinsic is itself ``convergent``, and we expect it to communicate at least 146among all threads of the same "quad" -- a group of 2x2 pixels that are 147evaluated together for the purpose of approximating screen-space derivatives. 148This fact is not part of the generic LLVM IR semantics; it would have to be 149defined somewhere else, for example as part of target-specific ABI definitions 150and/or in reference to some relevant API specs. 151 152Since the ``@textureSample`` call then uses the token produced by the entry 153intrinsic in its ``convergencectrl`` bundle, and has no additional control 154dependencies, it must communicate among the same set of threads. This indicates 155to generic program transforms that sinking the ``@textureSample`` call is 156forbidden. (A program transform can still sink the call if it can prove somehow, 157e.g. by leaning on target-specific callbacks that can analyze the program with 158additional knowledge, that ``%condition`` is always uniform across the threads 159referenced by the *convergence token* ``%entry``.) 160 161.. _convergence_example_reductions: 162 163Reductions inside divergent control flow 164---------------------------------------- 165 166The following example shows that merging common code of branches can be 167incorrect in the face of convergent operations: 168 169.. code-block:: c++ 170 171 void example_kernel() { 172 delta = ... 173 if (delta > 0) { 174 total_gains = subgroupAdd(delta); 175 ... 176 } else { 177 total_losses = subgroupAdd(delta); 178 ... 179 } 180 } 181 182The ``subgroupAdd`` computing the ``total_gains`` will be executed by the 183subset of threads with positive ``delta`` in a subgroup (wave), and so will sum 184up all the ``delta`` values of those threads; and similarly for the 185``subgroupAdd`` that computes the ``total_losses``. 186 187If we were to hoist and merge the ``subgroupAdd`` above the if-statement, it 188would sum up the ``delta`` across *all* threads instead. 189 190The compiler frontend can emit IR that expresses the convergence constraints 191as follows: 192 193.. code-block:: llvm 194 195 define void @example_kernel() convergent { 196 %entry = call token @llvm.experimental.convergence.entry() 197 %delta = ... 198 %cc = icmp sgt i32 %delta, 0 199 br i1 %cc, label %then, label %else 200 201 then: 202 %total_gains = call i32 @subgroupAdd(i32 %delta) [ "convergencectrl"(token %entry) ] 203 ... 204 br label %end 205 206 else: 207 %total_losses = call i32 @subgroupAdd(i32 %delta) [ "convergencectrl"(token %entry) ] 208 ... 209 br label %end 210 211 end: 212 ... 213 } 214 215The entry intrinsic behaves like in the previous example: assuming that 216``@example_kernel`` is an OpenCL kernel (as hinted at by the "subgroup" 217terminology), we expect it to communicate among all threads within the 218"subgroup". This typically maps to a SIMD vector on GPU hardware. 219 220The calls to ``@subgroupAdd`` use the token produced by the entry intrinsic, 221but they also have an additional control dependency. According to the rules 222defined in this document, they only communicate among the subset of threads 223that actually end up executing the respective (static) call site. 224 225Hoisting them would remove the control dependency and cause them to communicate 226among the full set of threads that the entry intrinsic communicated with. 227Again, hoisting is allowed if it can be proven that ``%cc`` is always uniform 228among the relevant set of threads: in that case, the ``@subgroupAdd`` already 229communicates among the full set of threads in the original program. 230 231Motivating Examples of Convergence Control 232========================================== 233 234(This section is informative.) 235 236Unstructured control flow 237------------------------- 238 239Consider an example of how jump threading removes structure in a way that can 240make semantics non-obvious without the convergence intrinsics described in this 241document: 242 243.. code-block:: llvm 244 245 void example_original() { 246 entry: 247 ... 248 br i1 %cond1, label %then1, label %mid 249 250 then1: 251 ... 252 %cond2 = ... 253 br label %mid 254 255 mid: 256 %flag = phi i1 [ true, %entry ], [ %cond2, %then1 ] 257 br i1 %flag, label %then2, label %end 258 259 then2: 260 ... 261 call void @subgroupControlBarrier() 262 ... 263 br label %end 264 265 end: 266 } 267 268 void example_jumpthreaded() { 269 entry: 270 ... 271 br i1 %cond1, label %then1, label %then2 272 273 then1: 274 ... 275 %cond2 = ... 276 br i1 %cond2, label %then2, label %end 277 278 then2: 279 ... 280 call void @subgroupControlBarrier() 281 ... 282 br label %end 283 284 end: 285 } 286 287Is the control barrier guaranteed to synchronize among the same set of threads 288in both cases? Different implementations in the literature may give different 289answers to this question: 290 291* In an implementation that reconverges at post-dominators, threads reconverge 292 at ``mid`` in the first version, so that all threads (within a subgroup/wave) 293 that execute the control barrier do so together. In the second version, 294 threads that reach the control barrier via different paths synchronize 295 separately: the first (and only) post-dominator is ``end``, so threads do not 296 reconverge before then. 297 298* An implementation that sorts basic blocks topologically and ensures maximal 299 reconvergence for each basic block would behave the same way in both 300 versions. 301 302We generally take the stance that reconvergence in acyclic control flow must 303be maximal. The compiler frontend could augment the original code as follows: 304 305.. code-block:: llvm 306 307 define void @example_original() convergent { 308 entry: 309 %entry = call token @llvm.experimental.convergence.entry() 310 ... 311 br i1 %cond1, label %then1, label %mid 312 313 then1: 314 ... 315 %cond2 = ... 316 br label %mid 317 318 mid: 319 %flag = phi i1 [ true, %entry ], [ %cond2, %then1 ] 320 br i1 %flag, label %then2, label %end 321 322 then2: 323 ... 324 call void @subgroupControlBarrier() [ "convergencectrl"(token %entry) ] 325 ... 326 br label %end 327 328 end: 329 } 330 331If S is the set of threads that the entry intrinsic communicated with, then 332the ``@subgroupControlBarrier`` call communicates with the subset of S that 333actually reaches the call site. This set of threads doesn't change after 334jump-threading, so the answer to the question posed above remains the same. 335 336.. _opportunistic_convergence: 337 338Opportunistic convergent operations 339----------------------------------- 340 341Some programs have local regions of code that contain a sequence of convergent 342operations where the code does not care about the exact set of threads with 343which it is executed, but only that the set of threads is the same for all the 344operations within the sequence. (If a subset of the convergent operations in the 345sequence have additional, non-uniform control dependencies, then this is not 346possible. However, the code may still require that the sets of threads are 347logically consistent with the conditions of those control dependencies.) In this 348case, :ref:`llvm.experimental.convergence.anchor 349<llvm.experimental.convergence.anchor>` can be used to express the desired 350semantics. 351 352The following example function could be part of a hypothetical "append buffer" 353implementation, where threads conditionally write fixed-sized records 354contiguously into a global buffer. The function ``@reserveSpaceInBuffer`` 355returns the index into the buffer at which the calling thread should store its 356data. 357 358This could be achieved by using a simple atomic operation in every thread to 359bump an allocation counter. 360 361However, the following implementation can be more performant on some hardware, 362because it uses only a single atomic operation for an entire group of threads. 363To do this, it first determines the total size of the group, which will be the 364operand to the atomic operation, and then later broadcasts the result of the 365atomic operation to all threads of the group, so that each thread can compute 366its individual position in the buffer: 367 368.. code-block:: llvm 369 370 define i32 @reserveSpaceInBuffer() { ; NOTE: _not_ a convergent function! 371 entry: 372 %anchor = call token @llvm.experimental.convergence.anchor() 373 374 %ballot = call i64 @subgroupBallot(i1 true) [ "convergencectrl"(token %anchor) ] 375 %numThreads.p = call i64 @llvm.ctpop.i64(i64 %ballot) 376 %numThreads = trunc i64 %numThreads.p to i32 377 378 %absoluteThreadIdx = call i32 @getSubgroupLocalInvocationId() 379 %absoluteThreadIdx.ext = zext i32 %absoluteThreadIdx to i64 380 %mask.p = shl i64 1, %absoluteThreadIdx.ext 381 %mask = sub i64 %mask.p, 1 382 383 %maskedBallot = and i64 %ballot, %mask 384 %relativeThreadIdx.p = call i64 @llvm.ctpop.i64(i64 %maskedBallot) 385 %relativeThreadIdx = trunc i64 %relativeThreadIdx.p to i32 386 387 %isFirstThread = icmp eq i32 %relativeThreadIdx, 0 388 br i1 %isFirstThread, label %then, label %end 389 390 then: 391 %baseOffset.1 = atomicrmw add ptr @bufferAllocationCount, i32 %numThreads monotonic 392 br label %end 393 394 end: 395 %baseOffset.2 = phi i32 [ undef, %entry ], [ %baseOffset.1, %then ] 396 %baseOffset = call i32 @subgroupBroadcastFirst(i32 %baseOffset.2) [ "convergencectrl"(token %anchor) ] 397 %offset = add i32 %baseOffset, %relativeThreadIdx 398 ret i32 %offset 399 } 400 401The key here is that the function really doesn't care which set of threads it 402is being called with. It takes whatever set of threads it can get. What the 403implementation of the function cares about is that the initial 404``@subgroupBallot`` -- which is used to retrieve the bitmask of threads that 405executed the anchor together -- executes with the same set of threads as the 406final ``@subgroupBroadcastFirst``. Nothing else is required for correctness as 407far as convergence is concerned. 408 409The function ``@reserveSpaceInBuffer`` itself is _not_ ``convergent``: callers 410are free to move call sites of the function as they see fit. This can change 411the behavior in practice, by changing the sets of threads that are grouped 412together for the atomic operation. This can be visible in the output of the 413program, since the order in which outputs appear in the buffer is changed. 414However, this does not break the overall contract that ``@reserveSpaceInBuffer`` 415has with its caller -- which makes sense: the order of outputs is 416non-deterministic anyway because of the atomic operation that is involved. 417 418If the function is inlined, the use of the anchor intrinsic similarly indicates 419that certain transforms which are usually forbidden by the presence of 420convergent operations are in fact allowed, as long as they don't break up the 421region of code that is controlled by the anchor. 422 423.. _convergence_high-level_break: 424 425Extended Cycles: Divergent Exit from a Loop 426------------------------------------------- 427 428High-level languages typically provide a ``break`` statement that transfers 429control out of a loop statement. In most cases, the loop is structured and hence 430there is no ambiguity about convergence inside the loop. But an ambiguity arises 431when a ``break`` is control dependent on a divergent condition inside the loop. 432Consider the following example: 433 434.. code-block:: c++ 435 436 void example() { 437 // A 438 ... 439 for (...) { 440 // B 441 if (condition) { // divergent condition 442 // C 443 convergent_op(); 444 break; 445 } 446 // D 447 ... 448 } 449 // E 450 } 451 452In this program, the call to convergent_op() is lexically "inside" the ``for`` 453loop. But when translated to LLVM IR, the basic block B is an exiting block 454ending in a divergent branch, and the basic block C is an exit of the loop. 455Thus, the call to convergent_op() is outside the loop. This causes a mismatch 456between the programmer's expectation and the compiled program. The call should 457be executed convergently on every iteration of the loop, by threads that 458together take the branch to exit the loop. But when compiled, all threads that 459take the divergent exit on different iterations first converge at the beginning 460of basic block C and then together execute the call to convergent_op(). 461 462In this case, :ref:`llvm.experimental.convergence.loop 463<llvm.experimental.convergence.loop>` can be used to express the desired 464semantics. A call to this intrinsic is placed in the loop header, which tracks 465each iteration of the loop. The token produced by this is used as a 466``convergencectrl`` operand to the convergent call. The semantics of the 467``loop`` intrinsic ensures that the convergent call is performed convergently 468only by those threads that convergently exited the loop in a given iteration. 469 470.. code-block:: llvm 471 472 define void @example() convergent { 473 %entry = call token @llvm.experimental.convergence.entry() 474 br label %for 475 476 for: 477 %inner = call token @llvm.experimental.convergence.loop() ["convergencectrl"(token %entry)] 478 %for.cond = i1 ... 479 br i1 %for.cond, label %B, label %E 480 481 B: 482 ... 483 %condition = i1 ... 484 br i1 %condition, label %C, label %D 485 486 C: 487 call void @convergent_op() ["convergencectrl"(token %inner)] 488 br label %E 489 490 D: 491 ... 492 br label %for 493 494 E: 495 ... 496 ret void 497 } 498 499The LLVM IR version of the same program shows a cycle consisting of the basic 500blocks ``%for``, ``%B`` and ``%D``, while ``%C`` is an exit reached by the 501divergent branch at the end of the exiting block ``%B``. But the use of 502convergence control tokens makes it clear that block ``%C`` must be executed 503convergently only by those threads that convergently take the exit edge from %B 504to ``%C``. In other words, the convergent execution of ``%C`` is governed by the 505call to the :ref:`llvm.experimental.convergence.loop 506<llvm.experimental.convergence.loop>` intrinsic inside the cycle. The cycle is 507effectively extended to include all uses of this token that lie outside the 508cycle. 509 510.. _dynamic_instances_and_convergence_tokens: 511 512Dynamic Instances and Convergence Tokens 513======================================== 514 515Every execution of an LLVM IR instruction occurs in a :ref:`dynamic instance 516<convergence-dynamic-instances>` of the instruction. Dynamic instances are the 517formal objects by which we talk about communicating threads in convergent 518operations. Dynamic instances are defined for *all* operations in an LLVM 519program, whether convergent or not. Convergence control is primarily about the 520dynamic instances of convergent operations since they affect execution of the 521program through inter-thread communication. The dynamic instances for 522non-convergent operations are relevant for determining :ref:`uniformity 523<convergence-and-uniformity>` of values. 524 525Dynamic instances produced by the execution of the same *convergent operation* 526by different threads may be :ref:`converged <convergence-definition>`. When 527executing a convergent operation, the set of threads that execute converged 528dynamic instances is the set of threads that communicate with each other. 529*Convergence tokens* capture this convergence as described below. 530 531*Convergence tokens* are values of ``token`` type, i.e. they cannot be used in 532``phi`` or ``select`` instructions. A convergence token value represents the 533dynamic instance of the instruction that produced it. 534 535Convergent operations may have an optional ``convergencectrl`` operand bundle with 536a convergence token operand to define the set of communicating threads relative 537to the operation that defined the token. 538 539 Let ``U`` be a convergent operation other than a call to a convergence 540 control intrinsic, and ``D`` be the convergent operation that defines 541 the token value used as the ``convergencectrl`` operand to ``U``. Two 542 threads execute converged dynamic instances of ``U`` if and only if the 543 token value in both threads was returned by converged dynamic 544 instances of ``D``. 545 546.. note:: 547 548 The text defines convergence token values as representing dynamic instances. 549 But if we were to assume that converged dynamic instances produce the same 550 token value, then we could almost think of the token value as representing a 551 set of threads instead -- specifically, the set ``S`` of threads that 552 executed converged dynamic instances of the defining instruction ``D``. 553 554 In this intuitive picture, when a convergence token value ``T`` is used by a 555 ``convergencectrl`` bundle on an instruction ``I``, then the set of threads that 556 communicates in ``I`` is a subset of the set ``S`` represented by the token value. 557 Specifically, it is the subset of threads that ends up executing ``I`` while 558 using the token value. 559 560 This by itself wouldn't quite work as a definition: what if ``I`` is executed 561 multiple times by the same threads? Which execution of ``I`` in thread 1 562 communicates with which execution of ``I`` in thread 2? Leaning on the notion 563 of dynamic instances gives a robust answer to this question as long as ``D`` 564 and ``I`` are at the same loop (or cycle) nesting level. 565 566 The case where ``D`` and ``I`` are at different loop nesting levels is 567 forbidden by the :ref:`static rules <convergence_static_rules>` -- handling 568 that case is the purpose of :ref:`llvm.experimental.convergence.loop 569 <llvm.experimental.convergence.loop>`. 570 571.. _convergence_control_intrinsics: 572 573Convergence Control Intrinsics 574============================== 575 576This section describes target-independent intrinsics that can be used to 577produce convergence tokens. 578 579Behaviour is undefined if a convergence control intrinsic is called 580indirectly. 581 582.. _llvm.experimental.convergence.entry: 583 584``llvm.experimental.convergence.entry`` 585---------------------------------------- 586 587.. code-block:: llvm 588 589 token @llvm.experimental.convergence.entry() convergent readnone 590 591This intrinsic is used to tie the dynamic instances inside of a function to 592those in the caller. 593 5941. If the function is called from outside the scope of LLVM, the convergence of 595 dynamic instances of this intrinsic are environment-defined. For example: 596 597 a. In an OpenCL *kernel launch*, the maximal set of threads that 598 can communicate outside the memory model is a *workgroup*. 599 Hence, a suitable choice is to specify that all the threads from 600 a single workgroup in OpenCL execute converged dynamic instances 601 of this intrinsic. 602 b. In a C/C++ program, threads are launched independently and they can 603 communicate only through the memory model. Hence the dynamic instances of 604 this intrinsic in a C/C++ program are never converged. 6052. If the function is called from a call-site in LLVM IR, then two 606 threads execute converged dynamic instances of this intrinsic if and 607 only if both threads entered the function by executing converged 608 dynamic instances of the call-site. 609 610This intrinsic can occur at most once in a function, and only in the entry 611block of the function. If this intrinsic occurs in a basic block, then it must 612precede any other convergent operation in the same basic block. 613 614It is an error if this intrinsic appears in a non-convergent function. 615 616It is an error to specify a ``convergencectrl`` operand bundle at a 617call to this intrinsic. 618 619Function inlining substitutes this intrinsic with the token from the operand 620bundle. For example: 621 622.. code-block:: c++ 623 624 // Before inlining: 625 626 void callee() convergent { 627 %tok = call token @llvm.experimental.convergence.entry() 628 convergent_operation(...) [ "convergencectrl"(token %tok) ] 629 } 630 631 void main() { 632 %outer = call token @llvm.experimental.convergence.anchor() 633 for (...) { 634 %inner = call token @llvm.experimental.convergence.loop() [ "convergencectrl"(token %outer) ] 635 callee() [ "convergencectrl"(token %inner) ] 636 } 637 } 638 639 // After inlining: 640 641 void main() { 642 %outer = call token @llvm.experimental.convergence.anchor() 643 for (...) { 644 %inner = call token @llvm.experimental.convergence.loop() [ "convergencectrl"(token %outer) ] 645 convergent_operation(...) [ "convergencectrl"(token %inner) ] 646 } 647 } 648 649.. _llvm.experimental.convergence.loop: 650 651``llvm.experimental.convergence.loop`` 652-------------------------------------- 653 654.. code-block:: llvm 655 656 token @llvm.experimental.convergence.loop() [ "convergencectrl"(token) ] convergent readnone 657 658This intrinsic represents the place where an imaginary counter is incremented 659for determining convergence inside a control flow cycle. 660 661Let ``U`` be a call to this intrinsic and ``D`` be the convergent operation that 662defines the token value used as the ``convergencectrl`` operand to ``U``. Two 663threads execute converged dynamic instances of ``U`` if and only if: 664 6651. The token value in both threads was returned by converged dynamic 666 instances of ``D``, and, 6672. There is an integer *n* such that both threads execute ``U`` for the *n*'th time 668 with that token value. 669 670It is an error to omit the ``convergencectrl`` operand bundle on a 671call to this intrinsic. 672 673If this intrinsic occurs in a basic block, then it must precede any other 674convergent operation in the same basic block. 675 676.. _convergence_cycle_heart: 677 678**Heart of a Cycle:** 679 680 If a :ref:`cycle <cycle-terminology>` ``C`` contains an occurrence ``H`` of 681 this intrinsic whose token operand is defined outside ``C``, then ``H`` is 682 called the heart of ``C``. 683 684 .. note:: 685 686 The static rules for cycles imply that a heart can occur only in the header 687 of a natural loop. This ensures that the heart closely represents the 688 intuitive notion of a loop iteration. If this restriction is relaxed, the 689 resulting semantics provides a new notion of "cycle iteration" even for 690 irreducible cycles. But this allows a natural loop to have a heart in a 691 node other than its header, which has interesting consequences on the 692 meaning of a loop iteration in terms of convergence. For now, we disallow 693 this situation since its practical application is very rare. 694 695.. _llvm.experimental.convergence.anchor: 696 697``llvm.experimental.convergence.anchor`` 698---------------------------------------- 699 700.. code-block:: llvm 701 702 token @llvm.experimental.convergence.anchor() convergent readnone 703 704This intrinsic produces an initial convergence token that is independent from 705any "outer scope". The set of threads executing converged dynamic instances of 706this intrinsic is implementation-defined. 707 708It is an error to pass a ``convergencectrl`` operand bundle at a 709call to this intrinsic. 710 711.. note:: 712 713 The expectation is that all threads within a group that "happen to be active 714 at the same time" will execute converged dynamic instances, so that programs 715 can detect the maximal set of threads that can communicate efficiently within 716 some local region of the program. 717 718.. _convergence_uncontrolled: 719 720Uncontrolled Convergent Operations 721================================== 722 723Convergent operations with an explicit ``convergencectrl`` operand bundle are 724called *controlled convergent operations*. All other convergent operations are 725said to be *uncontrolled*. 726 727An uncontrolled convergent operation is said to have *implicit convergence 728control* determined by the ``convergent`` attribute alone. The semantics of the 729``convergent`` attribute as implemented in LLVM differs from the documented 730semantics. The implementation tries to follow common intuition about convergent 731operations, which remains under-specified. As such, it is not possible to fully 732translate implicit convergence control into explicit convergence control tokens, 733and these two modes cannot be mixed in the same function. 734 735 If a function contains a controlled convergent operation, then all convergent 736 operations in that function must either be controlled operations or calls to 737 the convergence control intrinsics. 738 739Inferring Tokens 740---------------- 741 742(This section is informational) 743 744Sometimes, it may be necessary to reinterpret the implicit convergence control 745in terms of explicit convergence control tokens. For example, this may happen 746when a function call is inlined, and either the caller or the callee contains 747uncontrolled convergent operations. 748 749Some uses of uncontrolled convergent operations may need to satisfy the 750following property: 751 752 For an environment-defined group of threads (such as an OpenCL workgroup or 753 subgroup), if one thread in the group executes a convergent operation, then 754 all threads in the group do so convergently with that thread. 755 756In terms of explicit convergence control, this means that the 757``convergencectrl`` operand on each convergent operation ``X`` must ultimately 758originate from a call to the :ref:`llvm.experimental.convergence.entry 759<llvm.experimental.convergence.entry>` intrinsic. This preserves the possibility 760that the group of threads that converge on reaching ``X`` is the same group that 761originally started executing the program in convergence. In comparison, the 762:ref:`llvm.experimental.convergence.anchor 763<llvm.experimental.convergence.anchor>` intrinsic captures an 764implementation-defined group of threads, which is insufficient to support the 765above property. 766 767One way to approximate implicit convergence control in terms of explicit 768convergence control tokens is the following procedure, which preserves the above 769mentioned property: 770 7711. Convert every irreducible cycle into a reducible cycle. 7722. Insert a call to :ref:`llvm.experimental.convergence.entry 773 <llvm.experimental.convergence.entry>` at the start of the entry block of the 774 function. 7753. Insert a call to :ref:`llvm.experimental.convergence.loop 776 <llvm.experimental.convergence.loop>` at the start of every loop header. If 777 this loop is an outermost loop, the ``convergencectrl`` operand is the call 778 to :ref:`llvm.experimental.convergence.entry 779 <llvm.experimental.convergence.entry>` in the entry block of the function. 780 Otherwise, the ``convergencectrl`` operand is the call to 781 :ref:`llvm.experimental.convergence.loop 782 <llvm.experimental.convergence.loop>` in the parent loop's header. 7834. For each uncontrolled convergent operation ``X``, add a ``convergencectrl`` 784 operand bundle using the token defined by a definition ``D`` that is a 785 :ref:`sibling <cycle-sibling>` to this operation. ``D`` always dominates 786 ``X`` --- if ``X`` is not in any cycle, then ``D`` is a call to 787 :ref:`llvm.experimental.convergence.entry 788 <llvm.experimental.convergence.entry>`; otherwise ``D`` is the heart of the 789 parent cycle of ``X``. 790 791.. _convergence_static_rules: 792 793Static Rules 794============ 795 796A *well-formed* program in LLVM IR must satisfy the following static 797rules about cycles and convergence regions. 798 799Closed Paths 800------------ 801 802A :ref:`closed path <cycle-closed-path>` in a CFG is a connected sequence of 803nodes and edges in the CFG whose start and end points are the same. 804 8051. Every closed path in the CFG that contains a use of a convergence token T other 806 than a use by 807 :ref:`llvm.experimental.convergence.loop <llvm.experimental.convergence.loop>` 808 must also contain the definition of T. 809 8102. Every closed path in the CFG that contains two different uses of a convergence 811 token T must also contain the definition of T. 812 8133. Every closed path in the CFG that contains uses of two different convergence tokens 814 T1 and T2 must also contain the definition of at least one of them. 815 816Taken together, these rules imply that for every closed path C, there can be at most 817one convergence token T which is used in C but defined outside of it, and that 818T can be used only once in C, and only by ``llvm.experimental.convergence.loop``. 819 8204. In every closed path that contains a use U of a token T but not the 821 definition of T, U must dominate all nodes in the closed path. 822 823This implies that ``llvm.experimental.convergence.loop`` can appear as a heart 824only in the header of a natural loop. 825 826**Sufficient Conditions:** From the :ref:`properties of cycles 827<cycle-closed-path>`, it is sufficient to prove the above properties 828for cycles instead of closed paths. Briefly, any closed path that violates 829one or more of the above static rules is contained in a cycle that also 830violates the same rule(s). 831 832.. _convergence_region: 833 834Convergence Regions 835------------------- 836 837The *convergence region* of a convergence token T is the minimal region in 838which T is live and used, i.e., the set of program points dominated by the 839definition D of T from which a use of T can be reached. 840 841The following static rule about convergence regions must be satisfied by 842valid programs: 843 844 If a convergence region R for a token T1 contains a use of a convergence 845 token T2, then R must also contain the definition of T2. (In other words, 846 convergence regions must be reasonably nested.) 847 848.. note:: 849 850 For brevity, this document uses the term "convergence region of a token 851 definition ``D``" to actually refer to the convergence region of the token 852 ``T`` defined by ``D``. 853 854.. _inferring_noconvergent: 855 856Inferring non-convergence 857========================= 858 859When the target or the environment guarantees that threads do not 860communicate using convergent operations or that threads never diverge, 861the dynamic instances in the program are irrelevant and an optimizer 862may remove any occurrence of the ``convergent`` attribute on a 863call-site or a function and any explicit ``convergencectrl`` operand 864bundle at a call-site. 865 866An optimizer may remove the ``convergent`` attribute and any explicit 867``convergencectrl`` operand bundle from a call-site if it can prove 868that the execution of this call-site always results in a call to a 869non-convergent function. 870 871An optimizer may remove the ``convergent`` attribute on a function if it can 872prove that the function does not contain a call to 873:ref:`llvm.experimental.convergence.entry 874<llvm.experimental.convergence.entry>`, or any uncontrolled convergent 875operations. 876 877Memory Model Non-Interaction 878============================ 879 880The fact that an operation is convergent has no effect on how it is treated for 881memory model purposes. In particular, an operation that is ``convergent`` and 882``readnone`` does not introduce additional ordering constraints as far as the 883memory model is concerned. There is no implied barrier, neither in the memory 884barrier sense nor in the control barrier sense of synchronizing the execution 885of threads. 886 887Informational note: Threads that execute converged dynamic instances do not 888necessarily do so at the same time. 889 890 891Other Interactions 892================== 893 894A function can be both ``convergent`` and 895``speculatable``, indicating that the function does not have undefined 896behavior and has no effects besides calculating its result, but is still 897affected by the set of threads executing this function. This typically 898prevents speculation of calls to the function unless the constraint imposed 899by ``convergent`` is further relaxed by some other means. 900 901Controlled Maximal Convergence 902============================== 903 904The :ref:`converged-with relation <convergence-definition>` over dynamic 905instances of each controlled convergent operation is completely defined by the 906semantics of convergence tokens. But the implementation-defined convergence at a 907call to :ref:`llvm.experimental.convergence.anchor 908<llvm.experimental.convergence.anchor>` also depends on the cycle hierarchy 909chosen if it occurs inside an irreducible cycle. 910 911When the token defined by a convergent operation ``D`` is used at another 912convergent operation ``U``, the implementation must ensure that the threads that 913converge at ``U`` are all the threads that reached ``U`` after converging at 914``D``. On most implementations, it is reasonable to assume that only these 915threads are converged at every node they reach on any path from ``D`` to ``U``. 916In other words, the converged-with relation at ``D`` produces groups of threads 917that can converge only within each group, while inside the convergence region of 918``D``. 919 920All this affects the :ref:`maximal converged-with relation 921<convergence-maximal>` over dynamic instances and in turn the :ref:`m-converged 922property <uniformity-analysis>` of static instances in the convergence region of 923``D``. 924 925.. _controlled_maximal_converged_with: 926 927 **Controlled Maximal converged-with Relation** 928 929 1. Dynamic instances of a *convergent operation* are related in the controlled 930 maximal converged-with relation according to the semantics of the convergence 931 control tokens. 932 2. Dynamic instances ``X1`` and ``X2`` produced by different threads for the 933 same *non-convergent operation* ``X`` are related in the controlled maximal 934 converged-with relation if and only if: 935 936 1. Both threads executed converged dynamic instances of every token 937 definition ``D`` such that ``X`` is in the convergence region of ``D``, 938 and, 939 2. Either ``X`` is not contained in any cycle, or, for every cycle ``C`` 940 with header ``H`` that contains ``X``: 941 942 - every dynamic instance ``H1`` of ``H`` that precedes ``X1`` in the 943 respective thread is convergence-before ``X2``, and, 944 - every dynamic instance ``H2`` of ``H`` that precedes ``X2`` in the 945 respective thread is convergence-before ``X1``, 946 - without assuming that ``X1`` is converged with ``X2``. 947 948.. _controlled_m_converged: 949 950 **Controlled m-converged Static Instances** 951 952 A node ``X`` in a given CFG is reported to be m-converged if and only if: 953 954 1. For any token definition ``D`` such that ``X`` is inside the convergence region 955 of ``D``, ``D`` itself is m-converged, and, 956 2. Every cycle that contains ``X`` satisfies the following necessary 957 conditions: 958 959 a. Every divergent branch inside the cycle satisfies the :ref:`diverged 960 entry criterion<convergence-diverged-entry>`, and, 961 b. There are no :ref:`diverged paths reaching the 962 cycle<convergence-diverged-outside>` from a divergent branch outside it. 963 964Temporal Divergence at Cycle Exit 965--------------------------------- 966 967When a cycle has a divergent exit, maximal convergence assumes that all threads 968converge at the exit block. But if a controlled convergent operation outside the 969cycle uses a token defined by an operation ``D`` inside the cycle, the 970convergence region of ``D`` now extends outside the cycle. If two threads 971executed converged dynamic instances of ``D`` before exiting the cycle, then 972they continue to execute converged dynamic instances of nodes in the convergence 973region of ``D`` outside the cycle. Thus, for a value ``V`` defined inside the 974cycle, any use ``U`` of ``V`` within the convergence region of ``T`` uses the 975output of converged dynamic instances of ``V``. If ``V`` is uniform, then its 976use at such a ``U`` is also uniform. In other words, temporal divergence applies 977only to a use of ``V`` that is outside the convergence region of ``D``. 978 979Rationales for Static rules about cycles 980======================================== 981 982(This section is informative.) 983 984.. note:: 985 986 For convenience, we use the operator ``==`` to represent the relation 987 ``converged-with`` and the operator ``!=`` to represent its negation. 988 989Consider a loop with (incorrect!) convergence control as in the following 990pseudocode: 991 992.. code-block:: llvm 993 994 ; WARNING: Example of incorrect convergence control! 995 996 %anchor = call token @llvm.experimental.convergence.anchor() 997 for (;;) { 998 ... 999 call void @convergent.op() [ "convergencectrl"(token %anchor) ] 1000 ... 1001 } 1002 1003This code is forbidden by the first static rule about cycles. 1004 1005A first formal argument why we have to do this is that the dynamic rule for 1006deciding whether two threads execute converged dynamic instances of 1007``@convergent.op`` leads to a logical contradiction in this code. 1008Assume two threads execute converged dynamic instances of the anchor 1009followed by two iterations of the loop. Thread 1 executes dynamic instances 1010I1 and I2 of ``@convergent.op``, thread 2 executes dynamic instances J1 and J2. 1011Using all the rules, we can deduce: 1012 10131. ``I1 != I2`` and ``J1 != J2`` by the basic rules of dynamic instances. 1014 10152. ``I1 == J1`` by the first dynamic rule about controlled convergent 1016 operations: both threads execute the same static instruction while using 1017 a convergence token value produced by converged dynamic instances of an 1018 instruction (the anchor). 1019 10203. ``I1 == J2`` by the same argument. Also, ``I2 == J1`` and ``I2 == J2``. 1021 1022 The fact that one may be *intuitively* tempted to think of ``I1`` and ``J2`` 1023 as being executed in different loop iterations is completely irrelevant for 1024 the *formal* argument. There is no mechanism in LLVM IR semantics for 1025 forming associations between loop iterations in different threads, *except* 1026 for the rules defined in this document -- and the rules in this document 1027 require a loop heart intrinsic for talking about loop iterations. 1028 10294. By transitivity, we have ``I1 == I2`` and ``J1 == J2``. That is a 1030 contradiction. 1031 1032This problem goes away by inserting a loop heart intrinsic as follows, which 1033establishes a relationship between loop iterations across threads. 1034 1035.. code-block:: llvm 1036 1037 %anchor = call token @llvm.experimental.convergence.anchor() 1038 for (;;) { 1039 %loop = call token @llvm.experimental.convergence.loop() [ "convergencectrl"(token %anchor) ] 1040 ... 1041 call void @convergent.op() [ "convergencectrl"(token %loop) ] 1042 ... 1043 } 1044 1045In the same scenario of two threads executing converged dynamic instances of the 1046anchor and then two iterations of the loop, the dynamic rule about loop heart 1047intrinsics implies that both threads execute the converged dynamic instances of 1048the loop heart intrinsic in their respective first iterations and then again in 1049their respective second iterations of the loop. 1050 1051This then implies that they execute converged dynamic instances ``I1 == J1`` of 1052the ``@convergent.op`` in their first iterations and then 1053``I2 == J2`` in their second iterations. The rule is an "if and only if" rule, 1054so it also implies that ``I1 != J2`` and ``I2 != J1``, because those executions 1055see token values of ``%loop`` originating from non-converged dynamic 1056instances of the loop intrinsic. 1057 1058One may ask whether we could change the dynamic rule instead of adding the 1059static rule about cycles. That is impractical due to deeper difficulties. 1060Consider the following loop, again with incorrect convergence control: 1061 1062.. code-block:: llvm 1063 1064 ; WARNING: Example of incorrect convergence control! 1065 1066 ; (A) 1067 %anchor = call token @llvm.experimental.convergence.anchor() 1068 for (;;) { 1069 ; (B) 1070 if (condition1) { 1071 ; (C) 1072 call void @convergent.op.1() [ "convergencectrl"(token %anchor) ] 1073 } 1074 ; (D) 1075 if (condition2) { 1076 ; (E) 1077 call void @convergent.op.2() [ "convergencectrl"(token %anchor) ] 1078 } 1079 ; (F) 1080 } 1081 ; (G) 1082 1083Assume two threads execute converged dynamic instances of the anchor followed 1084by this sequence of basic blocks: 1085 1086.. code-block:: text 1087 1088 Thread 1: A B C D F B D E F G 1089 Thread 2: A B D E F B C D F G 1090 1091That is, both threads execute two iterations of the loop, but they execute 1092the different convergent operations in different iterations. Without forming a 1093relation between loop iterations across the threads, there is no reasonable way 1094of defining which dynamic instances of the convergent operations should be the 1095same across the threads, if any. 1096 1097Again, this can be addressed by adding a loop heart intrinsic, most naturally 1098as: 1099 1100.. code-block:: llvm 1101 1102 ; (A) 1103 %anchor = call token @llvm.experimental.convergence.anchor() 1104 for (;;) { 1105 ; (B) 1106 %loop = call token @llvm.experimental.convergence.loop() [ "convergencectrl"(token %anchor) ] 1107 if (condition1) { 1108 ; (C) 1109 call void @convergent.op.1() [ "convergencectrl"(token %loop) ] 1110 } 1111 ; (D) 1112 if (condition2) { 1113 ; (E) 1114 call void @convergent.op.2() [ "convergencectrl"(token %loop) ] 1115 } 1116 ; (F) 1117 } 1118 ; (G) 1119 1120Let ``%loop(i;j)`` be the dynamic instance of ``j``-th execution of the loop 1121heart intrinsic by thread ``i``, and analogously ``@op.k(i)`` and ``@op.k(i)`` 1122the dynamic instances of the execution of ``@convergent.op.k`` by thread ``i``. 1123Then we have: 1124 11251. ``%loop(1;j) == %loop(2;j)`` for ``j = 1, 2`` because of the dynamic rule 1126 about loop heart intrinsics. 1127 11282. ``%loop(i;1) != %loop(i;2)`` for ``i = 1, 2`` because of the basic rule that 1129 different executions by the same thread happen in different dynamic 1130 instances. 1131 11323. ``@op.1(1) != @op.1(2)``, since ``@op.1(1)`` uses the token value of ``%loop`` 1133 referring to ``%loop(1;1)`` and ``@op.1(2)`` uses that 1134 referring to ``%loop(2;2) == %loop(1;2)``, which is different from 1135 ``%loop(1;1)``. 1136 11374. Similarly, ``@op.2(1) != @op.2(2)``. 1138 1139However, loop heart intrinsics could be inserted differently, at the cost 1140of also inserting a free-standing anchor: 1141 1142.. code-block:: llvm 1143 1144 ; (A) 1145 %anchor = call token @llvm.experimental.convergence.anchor() 1146 for (;;) { 1147 ; (B) 1148 if (condition1) { 1149 ; (C) 1150 %loop = call token @llvm.experimental.convergence.loop() [ "convergencectrl"(token %anchor) ] 1151 call void @convergent.op.1() [ "convergencectrl"(token %loop) ] 1152 } 1153 ; (D) 1154 if (condition2) { 1155 ; (E) 1156 %free = call token @llvm.experimental.convergence.anchor() 1157 call void @convergent.op.2() [ "convergencectrl"(token %free) ] 1158 } 1159 ; (F) 1160 } 1161 ; (G) 1162 1163This leads to the "unnatural counting of loop iterations" that is also mentioned 1164elsewhere. Let ``%loop(i)`` be the dynamic instance of the execution of the 1165loop heart intrinsic by thread ``i`` (each thread executes it only once), and 1166let ``@op.k(i)`` be as before. Then: 1167 11681. ``%loop(1) == %loop(2)`` because of the dynamic rule about loop heart 1169 intrinsics. 1170 11712. ``@op.1(1) == @op.1(2)`` because ``@op.1(i)`` uses the value of ``%loop`` 1172 referring to ``%loop(i)``, and ``%loop(1) == %loop(2)``. 1173 11743. Whether ``@op.2(1) == @op.2(2)`` is implementation-defined because of the 1175 use of the ``%free`` anchor intrinsic. 1176 1177 In practice, they almost certainly have to be non-converged dynamic 1178 instances. Consider that if an implementation strictly follows the order of 1179 instructions given in the program, the executions of the threads can be 1180 "aligned" as follows: 1181 1182 .. code-block:: text 1183 1184 Thread 1: A B C D F B D E F G 1185 Thread 2: A B D E F B C D F G 1186 1187 So then ``@op.2(1)`` physically executes later than ``@op.2(2)`` and there 1188 can be no communication between the threads, which means they execute 1189 non-converged dynamic instances. 1190 1191 That said, it is conceivable that there aren't actually any data or other 1192 dependencies that would enforce this execution order. In that case, a highly 1193 out-of-order implementation could potentially allow communication. That's 1194 why the rules defined in this document are silent about whether 1195 ``@op.2(1) == @op.2(2)`` or not. 1196 1197This type of convergence control seems relatively unlikely to appear in real 1198programs. Its possibility is simply a logical consequence of the model. 1199 1200An equivalent issue arises if the convergent operations are replaced by nested 1201loops with loop heart intrinsics that directly refer to ``%anchor``, hence 1202the variants of the static rules about cycles that apply to them: 1203 1204.. code-block:: llvm 1205 1206 ; WARNING: Example of incorrect convergence control! 1207 1208 %anchor = call token @llvm.experimental.convergence.anchor() 1209 for (;;) { 1210 if (condition1) { 1211 for (;;) { 1212 %loop1 = call token @llvm.experimental.convergence.loop() [ "convergencectrl"(token %anchor) ] 1213 } 1214 } 1215 if (condition2) { 1216 for (;;) { 1217 %loop2 = call token @llvm.experimental.convergence.loop() [ "convergencectrl"(token %anchor) ] 1218 } 1219 } 1220 } 1221 1222There is a cycle (closed walk in the CFG) that goes through both loop heart 1223intrinsics using ``%anchor`` but not through the definition of ``%anchor``, 1224so this code is invalid. 1225 1226 1227Examples for the Correctness of Program Transforms 1228================================================== 1229 1230(This section is informative.) 1231 1232As implied by the rules in the previous sections, program transforms are correct 1233with respect to convergent operations if they preserve or refine their 1234semantics. This means that the set of communicating threads in the transformed 1235program must have been possible in the original program. 1236 1237Program transforms with a single-threaded focus are generally conservatively 1238correct if they do not sink or hoist convergent operations across a branch. 1239This applies even to program transforms that change the control flow graph. 1240 1241For example, unrolling a loop that does not contain convergent operations 1242cannot break any of the guarantees required for convergent operations outside 1243of the loop. 1244 1245 1246Loop unrolling examples 1247----------------------- 1248 1249We consider three kinds of loop unrolling here: 1250 1251* Partial unrolling with no known trip multiple, so a "tail" is required to 1252 collect the remaining elements. 1253* Partial unrolling by a trip multiple, so no "tail" is required. 1254* Full unrolling, which eliminates the loop. 1255 1256The first kind is forbidden when ``@llvm.experimental.convergence.loop`` is 1257used. We illustrate the reasoning with some examples. 1258 1259First, an arbitrary loop that contains convergent operations *can* be unrolled 1260in all of these ways, even with "tail", if all convergent operations refer back 1261to an anchor inside the loop. For example (in pseudo-code): 1262 1263.. code-block:: llvm 1264 1265 while (counter > 0) { 1266 %tok = call token @llvm.experimental.convergence.anchor() 1267 call void @convergent.operation() [ "convergencectrl"(token %tok) ] 1268 counter--; 1269 } 1270 1271This can be unrolled to: 1272 1273.. code-block:: llvm 1274 1275 while (counter >= 2) { 1276 %tok = call token @llvm.experimental.convergence.anchor() 1277 call void @convergent.operation() [ "convergencectrl"(token %tok) ] 1278 %tok = call token @llvm.experimental.convergence.anchor() 1279 call void @convergent.operation() [ "convergencectrl"(token %tok) ] 1280 counter -= 2; 1281 } 1282 while (counter > 0) { 1283 %tok = call token @llvm.experimental.convergence.anchor() 1284 call void @convergent.operation() [ "convergencectrl"(token %tok) ] 1285 counter--; 1286 } 1287 1288This is likely to change the behavior of the convergent operation if there 1289are threads whose initial counter value is not a multiple of 2. In particular, 1290all threads with an odd trip count are now likely to execute the convergent 1291operation in their respective final iterations together because the 1292underlying implementation is likely to try to group as many threads together 1293as possible for the execution of the "tail". 1294 1295This change is allowed because the anchor intrinsic has implementation-defined 1296convergence behavior and the loop unrolling transform is considered to be part 1297of the implementation. Another way of reasoning is that while the *likely* 1298behavior of the code has changed, the *guarantees* about its behavior have 1299remained the same. 1300 1301If the loop contains uncontrolled convergent operations, this kind of unrolling 1302is forbidden. 1303 1304Unrolling a loop with convergent operations that refer to tokens produced 1305outside the loop is forbidden when a "tail" or "remainder" would have to 1306be introduced. Consider: 1307 1308.. code-block:: llvm 1309 1310 ; (A) 1311 %outer = call token @llvm.experimental.convergence.anchor() 1312 while (counter > 0) { 1313 %inner = call token @llvm.experimental.convergence.loop() [ "convergencectrl"(token %outer) ] 1314 ; (B) 1315 call void @convergent.operation() [ "convergencectrl"(token %inner) ] 1316 counter--; 1317 } 1318 ; (C) 1319 1320To understand why unrolling is forbidden, consider two threads that execute 1321converged dynamic instances of the anchor and then proceed with 3 and 4 loop 1322iterations, respectively: 1323 1324.. code-block:: text 1325 1326 Thread 1: A B B B C 1327 Thread 2: A B B B B C 1328 1329By the dynamic rule on loop heart intrinsics, these threads execute converged 1330dynamic instances of the loop intrinsic for the first 3 iterations, and then 1331thread 2 executes another dynamic instance by itself. 1332 1333By the dynamic rule on general convergent operations, the threads execute 1334converged dynamic instances of the ``@convergent.operation`` in the first 3 1335iterations (that is, the dynamic instance executed by thread 1 in iteration 1336*n* is the same as that executed by thread 2 in iteration *n*, for *n = 1,2,3*; 1337the dynamic instance executed in iteration 1 is different from that in 1338iteration 2, etc.). 1339 1340Now assume that the loop is unrolled by a factor of 2, which requires a 1341remainder as follows: 1342 1343.. code-block:: llvm 1344 1345 ; (A) 1346 %outer = call token @llvm.experimental.convergence.anchor() 1347 while (counter >= 2) { 1348 %inner = call token @llvm.experimental.convergence.loop() [ "convergencectrl"(token %outer) ] 1349 ; (B) 1350 call void @convergent.operation() [ "convergencectrl"(token %inner) ] 1351 call void @convergent.operation() [ "convergencectrl"(token %inner) ] 1352 counter -= 2; 1353 } 1354 ; (C) 1355 if (counter > 0) { 1356 %remainder = call token @llvm.experimental.convergence.loop() [ "convergencectrl"(token %outer) ] 1357 ; (D) 1358 call void @convergent.operation() [ "convergencectrl"(token %remainder) ] 1359 } 1360 ; (E) 1361 1362First of all, note some interesting problems surrounding the loop intrinsic: 1363 13641. It is *not* duplicated inside the unrolled loop. This is to comply with 1365 the :ref:`convergence_static_rules`. 1366 13672. It is unclear whether the loop intrinsic ought to be duplicated in the 1368 remainder, or whether the final ``@convergent.operation`` in D should just 1369 refer to either ``%inner`` (which is possible in SSA form) or directly to 1370 ``%outer``. The decision made here is arbitrary and doesn't change the 1371 argument that follows. Ultimately, it simply doesn't matter because the 1372 transform is incorrect either way. 1373 1374The threads now execute the following sequences of blocks: 1375 1376.. code-block:: text 1377 1378 Thread 1: A B C D E 1379 Thread 2: A B B C D E 1380 1381Analogous to the argument above, they execute converged dynamic instances of the 1382``%inner`` intrinsic and the ``@convergent.operation`` in the first iteration 1383of the unrolled loop, which corresponds to the first 2 iterations of the 1384original loop. 1385 1386However, they execute different static calls to ``@convergent.operation`` for 1387the 3rd iteration of the original loop. In thread 1, that iteration corresponds 1388to the call in the remainder, while in thread 2 it corresponds to the first 1389call to ``@convergent.operation`` in the unrolled loop. Therefore, they execute 1390non-converged dynamic instances, which means that the set of communicating threads 1391for the 3rd iteration of the original loop is different. This is why the 1392unrolling is incorrect. 1393 1394On the other hand, unrolling without "tail" is allowed. For example, assuming 1395that the trip counter is known to be a multiple of 2, we can unroll the loop 1396as follows: 1397 1398.. code-block:: llvm 1399 1400 %outer = call token @llvm.experimental.convergence.anchor() 1401 while (counter > 0) { 1402 %inner = call token @llvm.experimental.convergence.loop() [ "convergencectrl"(token %outer) ] 1403 call void @convergent.operation() [ "convergencectrl"(token %inner) ] 1404 call void @convergent.operation() [ "convergencectrl"(token %inner) ] 1405 counter -= 2; 1406 } 1407 1408Note again that the loop intrinsic is not duplicated. 1409 1410The 1411:ref:`llvm.experimental.convergence.loop <llvm.experimental.convergence.loop>` 1412intrinsic is typically expected to appear in the header of a natural loop. 1413However, it can also appear in non-header blocks of a loop. In that case, the 1414loop can generally not be unrolled. 1415 1416 1417Hoisting and sinking 1418-------------------- 1419 1420In general, hoisting and sinking of convergent operations is forbidden. This is 1421because moving the operation to a different point in control flow generally 1422changes the set of threads that reach the operation and therefore, the set of 1423threads that execute converged dynamic instances of the operation. By 1424definition, this changes the set of threads that participate in the 1425communication of the convergent operation, which will typically change its 1426result. 1427 1428There are a number of exceptions, though most of them require additional 1429knowledge. 1430 1431For example, hoisting and sinking across *uniform* conditional branches -- i.e., 1432conditional branches where within every possible relevant set of threads, all 1433threads will always take the same direction -- is generally allowed. See the end 1434of the :ref:`example of reductions inside control flow 1435<convergence_example_reductions>` for a brief discussion. 1436 1437Some convergent operations can be hoisted but not sunk, or vice versa. A simple 1438example is the ``subgroupShuffle(data, id)`` operation. It returns the ``data`` 1439operand of the thread identified by ``id``, where thread IDs are fixed and 1440assigned to each thread at launch. The result is undefined (or perhaps there is 1441UB, depending on the language and environment) if thread ``id`` is not in the 1442communicating set of threads. So hoisting is allowed in the following 1443pseudo-code example: 1444 1445.. code-block:: llvm 1446 1447 define void @example(...) convergent { 1448 %entry = call token @llvm.experimental.convergence.entry() 1449 %data = ... 1450 %id = ... 1451 if (condition) { 1452 %shuffled = call i32 @subgroupShuffle(i32 %data, i32 %id) [ "convergencectrl"(token %entry) ] 1453 ... 1454 } else { 1455 %shuffled = call i32 @subgroupShuffle(i32 %data, i32 %id) [ "convergencectrl"(token %entry) ] 1456 ... 1457 } 1458 } 1459 1460After hoisting the calls to ``@subgroupShuffle``, the communicating set of 1461threads is the union of the two sets of threads in the original program, so 1462``%id`` can only go "out of range" after hoisting if it did so in the original 1463program. 1464 1465However, speculative execution of ``@subgroupShuffle`` in the following program 1466may be forbidden: 1467 1468.. code-block:: llvm 1469 1470 define void @example(...) convergent { 1471 %entry = call token @llvm.experimental.convergence.entry() 1472 %data = ... 1473 %id = ... 1474 if (condition) { 1475 %shuffled = call i32 @subgroupShuffle(i32 %data, i32 %id) [ "convergencectrl"(token %entry) ] 1476 ... 1477 } 1478 } 1479 1480There is no guarantee about the value of ``%id`` in the threads where 1481``condition`` is false. If ``@subgroupShuffle`` is defined to have UB when 1482``%id`` is outside of the set of communicating threads, then speculating and 1483hoisting ``@subgroupShuffle`` might introduce UB. 1484 1485On the other hand, if ``@subgroupShuffle`` is defined such that it merely 1486produces an undefined value or poison as result when ``%id`` is "out of range", 1487then speculating is okay. 1488 1489Even though 1490:ref:`llvm.experimental.convergence.anchor <llvm.experimental.convergence.anchor>` 1491is marked as ``convergent``, it can be sunk in some cases. For example, in 1492pseudo-code: 1493 1494.. code-block:: llvm 1495 1496 %tok = call token @llvm.experimental.convergence.anchor() 1497 if (condition) { 1498 call void @convergent.operation() [ "convergencectrl"(token %tok) ] 1499 } 1500 1501Assuming that ``%tok`` is only used inside the conditional block, the anchor can 1502be sunk. The rationale is two-fold. First, the anchor has implementation-defined 1503behavior, and the sinking is part of the implementation. Second, already in the 1504original program, the set of threads that communicates in the 1505``@convergent.operation`` is automatically subset to the threads for which 1506``condition`` is true. 1507 1508Anchors can be hoisted in acyclic control flow. For example: 1509 1510.. code-block:: llvm 1511 1512 if (condition) { 1513 %tok1 = call token @llvm.experimental.convergence.anchor() 1514 call void @convergent.operation() [ "convergencectrl"(token %tok1) ] 1515 } else { 1516 %tok2 = call token @llvm.experimental.convergence.anchor() 1517 call void @convergent.operation() [ "convergencectrl"(token %tok2) ] 1518 } 1519 1520The anchors can be hoisted, resulting in: 1521 1522.. code-block:: llvm 1523 1524 %tok = call token @llvm.experimental.convergence.anchor() 1525 if (condition) { 1526 call void @convergent.operation() [ "convergencectrl"(token %tok) ] 1527 } else { 1528 call void @convergent.operation() [ "convergencectrl"(token %tok) ] 1529 } 1530 1531The behavior is unchanged, since each of the static convergent operations only 1532ever communicates with threads that have the same ``condition`` value. 1533By contrast, hoisting the convergent operations themselves is forbidden. 1534 1535Hoisting and sinking anchors out of and into loops is forbidden. For example: 1536 1537.. code-block:: llvm 1538 1539 for (;;) { 1540 %tok = call token @llvm.experimental.convergence.anchor() 1541 call void @convergent.operation() [ "convergencectrl"(token %tok) ] 1542 } 1543 1544Hoisting the anchor would make the program invalid according to the static 1545validity rules. Conversely: 1546 1547.. code-block:: llvm 1548 1549 %outer = call token @llvm.experimental.convergence.anchor() 1550 while (counter > 0) { 1551 %inner = call token @llvm.experimental.convergence.loop() [ "convergencectrl"(token %outer) ] 1552 call void @convergent.operation() [ "convergencectrl"(token %inner) ] 1553 counter--; 1554 } 1555 1556The program would stay valid if the anchor was sunk into the loop, but its 1557behavior could end up being different. If the anchor is inside the loop, then 1558each loop iteration has a new dynamic instance of the anchor, and the set of 1559threads participating in those dynamic instances of the anchor could be 1560different in arbitrary implementation-defined ways. Via the dynamic rules about 1561dynamic instances of convergent operations, this then implies that the set of 1562threads executing ``@convergent.operation`` could be different in each loop 1563iteration in arbitrary implementation-defined ways. 1564 1565Convergent operations can be sunk together with their anchor. Again in 1566pseudo-code: 1567 1568.. code-block:: llvm 1569 1570 %tok = call token @llvm.experimental.convergence.anchor() 1571 %a = call T @pure.convergent.operation(...) [ "convergencectrl"(token %tok) ] 1572 %b = call T @pure.convergent.operation(...) [ "convergencectrl"(token %tok) ] 1573 if (condition) { 1574 use(%a, %b) 1575 } 1576 1577Assuming that ``%tok``, ``%a``, and ``%b`` are only used inside the conditional 1578block, all can be sunk together: 1579 1580.. code-block:: llvm 1581 1582 if (condition) { 1583 %tok = call token @llvm.experimental.convergence.anchor() 1584 %a = call T @pure.convergent.operation(...) [ "convergencectrl"(token %tok) ] 1585 %b = call T @pure.convergent.operation(...) [ "convergencectrl"(token %tok) ] 1586 use(%a, %b) 1587 } 1588 1589The rationale is that the anchor intrinsic has implementation-defined behavior, 1590and the sinking transform is considered to be part of the implementation: 1591the sinking will restrict the set of communicating threads to those for which 1592``condition`` is true, but that could have happened in the original program 1593anyway for some arbitrary other reason. 1594 1595However, sinking *only* the convergent operation producing ``%b`` would be 1596incorrect. That would allow threads for which ``condition`` is false to 1597communicate at ``%a``, but not at ``%b``, which the original program doesn't 1598allow. 1599 1600Note that the entry intrinsic behaves differently. Sinking the convergent 1601operations is forbidden in the following snippet: 1602 1603.. code-block:: llvm 1604 1605 %tok = call token @llvm.experimental.convergence.entry() 1606 %a = call T @pure.convergent.operation(...) [ "convergencectrl"(token %tok) ] 1607 %b = call T @pure.convergent.operation(...) [ "convergencectrl"(token %tok) ] 1608 if (condition) { 1609 use(%a, %b) 1610 } 1611