xref: /llvm-project/llvm/docs/ConvergentOperations.rst (revision 256d76f48060a353ba3bb885698e2ba8d1c87ec6)
1==============================
2Convergent Operation Semantics
3==============================
4
5.. contents::
6   :local:
7   :depth: 4
8
9Overview
10========
11
12Some parallel execution environments execute threads in groups that allow
13efficient communication within the group using special primitives called
14*convergent* operations. The outcome of a convergent operation is sensitive to
15the set of threads that executes it "together", i.e., convergently. When control
16flow :ref:`diverges <convergence-and-uniformity>`, i.e. threads of the same
17group follow different
18paths through the CFG, not all threads of the group may be available to
19participate in this communication. This is the defining characteristic that
20distinguishes convergent operations from other inter-thread communication:
21
22  A convergent operation involves inter-thread communication or synchronization
23  that occurs outside of the memory model, where the set of threads which
24  participate in communication is implicitly affected by control flow.
25
26For example, in the following GPU compute kernel, communication during the
27convergent operation is expected to occur precisely among those threads of an
28implementation-defined execution scope (such as workgroup or subgroup) for
29which ``condition`` is true:
30
31.. code-block:: c++
32
33  void example_kernel() {
34      ...
35      if (condition)
36          convergent_operation();
37      ...
38  }
39
40In structured programming languages, there is often an intuitive and
41unambiguous way of determining the threads that are expected to communicate.
42However, this is not always the case even in structured programming languages,
43and the intuition breaks down entirely in unstructured control flow. This
44document describes the formal semantics in LLVM, i.e. how to determine the set
45of communicating threads for convergent operations.
46
47The definitions in this document leave many details open, such as how groups of
48threads are formed in the first place. It focuses on the questions that are
49relevant for deciding the correctness of generic program transforms and
50convergence-related analyses such as :ref:`uniformity analysis
51<convergence-and-uniformity>`.
52
53.. _convergent_operations:
54
55Convergent Operations
56=====================
57
58In LLVM IR, the only way to communicate between threads as described
59above is by calling target-defined convergent intrinsics. Hence, only
60a call-site in LLVM IR (a :ref:`call <i_call>`, :ref:`invoke
61<i_invoke>`, or :ref:`callbr <i_callbr>` instruction) can result in a
62convergent operation.
63
64A function in LLVM IR is said to be *convergent* if it has the
65:ref:`convergent <attr_convergent>` attribute.
66
67A call-site in LLVM IR is said to be *convergent* if it is a direct
68call to a convergent function or it has the :ref:`convergent
69<attr_convergent>` attribute or a :ref:`convergencectrl operand bundle
70<convergencectrl>`.
71
72Informational notes:
73
74  A function may have to be treated as convergent if that function, or
75  transitively, any function called from it, contains a convergent call-site. A
76  frontend generating the ``convergent`` attribute should take this into account
77  when emitting functions and function calls. But this is not always the case:
78
79  A non-convergent function may contain convergent operations; such operations
80  do not directly depend on the set of threads that enter the function as a
81  single communicating group. Instead, these operations depend on an
82  implementation-defined subset of threads within the body of the function, as
83  shown in :ref:`opportunistic_convergence`.
84
85Examples of Convergent Operations
86========================================
87
88(This section is informative.)
89
90Texture sampling in a pixel shader
91----------------------------------
92
93The following stylized pixel shader samples a texture at a given set of
94coordinates, using the builtin function `textureSample`. Texture sampling
95requires screen-space derivatives of the coordinates to determine the level of
96detail (mipmap level) of the sample. They are commonly approximated by taking
97the difference between neighboring pixels, which are computed by different
98threads in the same group:
99
100.. code-block:: c++
101
102  void example_shader() {
103    ...
104    color = textureSample(texture, coordinates);
105    if (condition) {
106      use(color);
107    }
108    ...
109  }
110
111From a purely single-threaded perspective, sinking the `textureSample` into
112the if-statement appears legal. However, if the condition is false for some
113neighboring pixels, then their corresponding threads will not execute together
114in the group, making it impossible to take the difference of coordinates as an
115approximation of the screen-space derivative. In practice, the outcome will be
116an undefined value.
117
118That is, the `textureSample` operation fits our definition of a convergent
119operation:
120
121 1. It communicates with a set of threads that implicitly depends on control
122    flow.
123 2. Correctness depends on this set of threads.
124
125The compiler frontend can emit IR that expresses the convergence constraints as
126follows:
127
128.. code-block:: llvm
129
130  define void @example_shader() convergent {
131    %entry = call token @llvm.experimental.convergence.entry()
132    ...
133    %color = call T @textureSample(U %texture, V %coordinates) [ "convergencectrl"(token %entry) ]
134    br i1 %condition, label %then, label %end
135
136  then:
137    call void @use(T %color)
138    br label %end
139
140  end:
141    ret void
142  }
143
144The :ref:`llvm.experimental.convergence.entry <llvm.experimental.convergence.entry>`
145intrinsic is itself ``convergent``, and we expect it to communicate at least
146among all threads of the same "quad" -- a group of 2x2 pixels that are
147evaluated together for the purpose of approximating screen-space derivatives.
148This fact is not part of the generic LLVM IR semantics; it would have to be
149defined somewhere else, for example as part of target-specific ABI definitions
150and/or in reference to some relevant API specs.
151
152Since the ``@textureSample`` call then uses the token produced by the entry
153intrinsic in its ``convergencectrl`` bundle, and has no additional control
154dependencies, it must communicate among the same set of threads. This indicates
155to generic program transforms that sinking the ``@textureSample`` call is
156forbidden. (A program transform can still sink the call if it can prove somehow,
157e.g. by leaning on target-specific callbacks that can analyze the program with
158additional knowledge, that ``%condition`` is always uniform across the threads
159referenced by the *convergence token* ``%entry``.)
160
161.. _convergence_example_reductions:
162
163Reductions inside divergent control flow
164----------------------------------------
165
166The following example shows that merging common code of branches can be
167incorrect in the face of convergent operations:
168
169.. code-block:: c++
170
171  void example_kernel() {
172    delta = ...
173    if (delta > 0) {
174      total_gains = subgroupAdd(delta);
175      ...
176    } else {
177      total_losses = subgroupAdd(delta);
178      ...
179    }
180  }
181
182The ``subgroupAdd`` computing the ``total_gains`` will be executed by the
183subset of threads with positive ``delta`` in a subgroup (wave), and so will sum
184up all the ``delta`` values of those threads; and similarly for the
185``subgroupAdd`` that computes the ``total_losses``.
186
187If we were to hoist and merge the ``subgroupAdd`` above the if-statement, it
188would sum up the ``delta`` across *all* threads instead.
189
190The compiler frontend can emit IR that expresses the convergence constraints
191as follows:
192
193.. code-block:: llvm
194
195  define void @example_kernel() convergent {
196    %entry = call token @llvm.experimental.convergence.entry()
197    %delta = ...
198    %cc = icmp sgt i32 %delta, 0
199    br i1 %cc, label %then, label %else
200
201  then:
202    %total_gains = call i32 @subgroupAdd(i32 %delta) [ "convergencectrl"(token %entry) ]
203    ...
204    br label %end
205
206  else:
207    %total_losses = call i32 @subgroupAdd(i32 %delta) [ "convergencectrl"(token %entry) ]
208    ...
209    br label %end
210
211  end:
212    ...
213  }
214
215The entry intrinsic behaves like in the previous example: assuming that
216``@example_kernel`` is an OpenCL kernel (as hinted at by the "subgroup"
217terminology), we expect it to communicate among all threads within the
218"subgroup". This typically maps to a SIMD vector on GPU hardware.
219
220The calls to ``@subgroupAdd`` use the token produced by the entry intrinsic,
221but they also have an additional control dependency. According to the rules
222defined in this document, they only communicate among the subset of threads
223that actually end up executing the respective (static) call site.
224
225Hoisting them would remove the control dependency and cause them to communicate
226among the full set of threads that the entry intrinsic communicated with.
227Again, hoisting is allowed if it can be proven that ``%cc`` is always uniform
228among the relevant set of threads: in that case, the ``@subgroupAdd`` already
229communicates among the full set of threads in the original program.
230
231Motivating Examples of Convergence Control
232==========================================
233
234(This section is informative.)
235
236Unstructured control flow
237-------------------------
238
239Consider an example of how jump threading removes structure in a way that can
240make semantics non-obvious without the convergence intrinsics described in this
241document:
242
243.. code-block:: llvm
244
245  void example_original() {
246  entry:
247      ...
248      br i1 %cond1, label %then1, label %mid
249
250  then1:
251      ...
252      %cond2 = ...
253      br label %mid
254
255  mid:
256      %flag = phi i1 [ true, %entry ], [ %cond2, %then1 ]
257      br i1 %flag, label %then2, label %end
258
259  then2:
260      ...
261      call void @subgroupControlBarrier()
262      ...
263      br label %end
264
265  end:
266  }
267
268  void example_jumpthreaded() {
269  entry:
270      ...
271      br i1 %cond1, label %then1, label %then2
272
273  then1:
274      ...
275      %cond2 = ...
276      br i1 %cond2, label %then2, label %end
277
278  then2:
279      ...
280      call void @subgroupControlBarrier()
281      ...
282      br label %end
283
284  end:
285  }
286
287Is the control barrier guaranteed to synchronize among the same set of threads
288in both cases? Different implementations in the literature may give different
289answers to this question:
290
291* In an implementation that reconverges at post-dominators, threads reconverge
292  at ``mid`` in the first version, so that all threads (within a subgroup/wave)
293  that execute the control barrier do so together. In the second version,
294  threads that reach the control barrier via different paths synchronize
295  separately: the first (and only) post-dominator is ``end``, so threads do not
296  reconverge before then.
297
298* An implementation that sorts basic blocks topologically and ensures maximal
299  reconvergence for each basic block would behave the same way in both
300  versions.
301
302We generally take the stance that reconvergence in acyclic control flow must
303be maximal. The compiler frontend could augment the original code as follows:
304
305.. code-block:: llvm
306
307  define void @example_original() convergent {
308  entry:
309    %entry = call token @llvm.experimental.convergence.entry()
310    ...
311    br i1 %cond1, label %then1, label %mid
312
313  then1:
314    ...
315    %cond2 = ...
316    br label %mid
317
318  mid:
319    %flag = phi i1 [ true, %entry ], [ %cond2, %then1 ]
320    br i1 %flag, label %then2, label %end
321
322  then2:
323    ...
324    call void @subgroupControlBarrier() [ "convergencectrl"(token %entry) ]
325    ...
326    br label %end
327
328  end:
329  }
330
331If S is the set of threads that the entry intrinsic communicated with, then
332the ``@subgroupControlBarrier`` call communicates with the subset of S that
333actually reaches the call site. This set of threads doesn't change after
334jump-threading, so the answer to the question posed above remains the same.
335
336.. _opportunistic_convergence:
337
338Opportunistic convergent operations
339-----------------------------------
340
341Some programs have local regions of code that contain a sequence of convergent
342operations where the code does not care about the exact set of threads with
343which it is executed, but only that the set of threads is the same for all the
344operations within the sequence. (If a subset of the convergent operations in the
345sequence have additional, non-uniform control dependencies, then this is not
346possible. However, the code may still require that the sets of threads are
347logically consistent with the conditions of those control dependencies.) In this
348case, :ref:`llvm.experimental.convergence.anchor
349<llvm.experimental.convergence.anchor>` can be used to express the desired
350semantics.
351
352The following example function could be part of a hypothetical "append buffer"
353implementation, where threads conditionally write fixed-sized records
354contiguously into a global buffer. The function ``@reserveSpaceInBuffer``
355returns the index into the buffer at which the calling thread should store its
356data.
357
358This could be achieved by using a simple atomic operation in every thread to
359bump an allocation counter.
360
361However, the following implementation can be more performant on some hardware,
362because it uses only a single atomic operation for an entire group of threads.
363To do this, it first determines the total size of the group, which will be the
364operand to the atomic operation, and then later broadcasts the result of the
365atomic operation to all threads of the group, so that each thread can compute
366its individual position in the buffer:
367
368.. code-block:: llvm
369
370  define i32 @reserveSpaceInBuffer() {    ; NOTE: _not_ a convergent function!
371  entry:
372    %anchor = call token @llvm.experimental.convergence.anchor()
373
374    %ballot = call i64 @subgroupBallot(i1 true) [ "convergencectrl"(token %anchor) ]
375    %numThreads.p = call i64 @llvm.ctpop.i64(i64 %ballot)
376    %numThreads = trunc i64 %numThreads.p to i32
377
378    %absoluteThreadIdx = call i32 @getSubgroupLocalInvocationId()
379    %absoluteThreadIdx.ext = zext i32 %absoluteThreadIdx to i64
380    %mask.p = shl i64 1, %absoluteThreadIdx.ext
381    %mask = sub i64 %mask.p, 1
382
383    %maskedBallot = and i64 %ballot, %mask
384    %relativeThreadIdx.p = call i64 @llvm.ctpop.i64(i64 %maskedBallot)
385    %relativeThreadIdx = trunc i64 %relativeThreadIdx.p to i32
386
387    %isFirstThread = icmp eq i32 %relativeThreadIdx, 0
388    br i1 %isFirstThread, label %then, label %end
389
390  then:
391    %baseOffset.1 = atomicrmw add ptr @bufferAllocationCount, i32 %numThreads monotonic
392    br label %end
393
394  end:
395    %baseOffset.2 = phi i32 [ undef, %entry ], [ %baseOffset.1, %then ]
396    %baseOffset = call i32 @subgroupBroadcastFirst(i32 %baseOffset.2) [ "convergencectrl"(token %anchor) ]
397    %offset = add i32 %baseOffset, %relativeThreadIdx
398    ret i32 %offset
399  }
400
401The key here is that the function really doesn't care which set of threads it
402is being called with. It takes whatever set of threads it can get. What the
403implementation of the function cares about is that the initial
404``@subgroupBallot`` -- which is used to retrieve the bitmask of threads that
405executed the anchor together -- executes with the same set of threads as the
406final ``@subgroupBroadcastFirst``. Nothing else is required for correctness as
407far as convergence is concerned.
408
409The function ``@reserveSpaceInBuffer`` itself is _not_ ``convergent``: callers
410are free to move call sites of the function as they see fit. This can change
411the behavior in practice, by changing the sets of threads that are grouped
412together for the atomic operation. This can be visible in the output of the
413program, since the order in which outputs appear in the buffer is changed.
414However, this does not break the overall contract that ``@reserveSpaceInBuffer``
415has with its caller -- which makes sense: the order of outputs is
416non-deterministic anyway because of the atomic operation that is involved.
417
418If the function is inlined, the use of the anchor intrinsic similarly indicates
419that certain transforms which are usually forbidden by the presence of
420convergent operations are in fact allowed, as long as they don't break up the
421region of code that is controlled by the anchor.
422
423.. _convergence_high-level_break:
424
425Extended Cycles: Divergent Exit from a Loop
426-------------------------------------------
427
428High-level languages typically provide a ``break`` statement that transfers
429control out of a loop statement. In most cases, the loop is structured and hence
430there is no ambiguity about convergence inside the loop. But an ambiguity arises
431when a ``break`` is control dependent on a divergent condition inside the loop.
432Consider the following example:
433
434.. code-block:: c++
435
436  void example() {
437    // A
438    ...
439    for (...) {
440      // B
441      if (condition) { // divergent condition
442        // C
443        convergent_op();
444        break;
445      }
446      // D
447      ...
448    }
449    // E
450  }
451
452In this program, the call to convergent_op() is lexically "inside" the ``for``
453loop. But when translated to LLVM IR, the basic block B is an exiting block
454ending in a divergent branch, and the basic block C is an exit of the loop.
455Thus, the call to convergent_op() is outside the loop. This causes a mismatch
456between the programmer's expectation and the compiled program. The call should
457be executed convergently on every iteration of the loop, by threads that
458together take the branch to exit the loop. But when compiled, all threads that
459take the divergent exit on different iterations first converge at the beginning
460of basic block C and then together execute the call to convergent_op().
461
462In this case, :ref:`llvm.experimental.convergence.loop
463<llvm.experimental.convergence.loop>` can be used to express the desired
464semantics. A call to this intrinsic is placed in the loop header, which tracks
465each iteration of the loop. The token produced by this is used as a
466``convergencectrl`` operand to the convergent call. The semantics of the
467``loop`` intrinsic ensures that the convergent call is performed convergently
468only by those threads that convergently exited the loop in a given iteration.
469
470.. code-block:: llvm
471
472  define void @example() convergent {
473    %entry = call token @llvm.experimental.convergence.entry()
474    br label %for
475
476  for:
477    %inner = call token @llvm.experimental.convergence.loop() ["convergencectrl"(token %entry)]
478    %for.cond = i1 ...
479    br i1 %for.cond, label %B, label %E
480
481  B:
482    ...
483    %condition = i1 ...
484    br i1 %condition, label %C, label %D
485
486  C:
487    call void @convergent_op() ["convergencectrl"(token %inner)]
488    br label %E
489
490  D:
491    ...
492    br label %for
493
494  E:
495    ...
496    ret void
497  }
498
499The LLVM IR version of the same program shows a cycle consisting of the basic
500blocks ``%for``, ``%B`` and ``%D``, while ``%C`` is an exit reached by the
501divergent branch at the end of the exiting block ``%B``. But the use of
502convergence control tokens makes it clear that block ``%C`` must be executed
503convergently only by those threads that convergently take the exit edge from %B
504to ``%C``. In other words, the convergent execution of ``%C`` is governed by the
505call to the :ref:`llvm.experimental.convergence.loop
506<llvm.experimental.convergence.loop>` intrinsic inside the cycle. The cycle is
507effectively extended to include all uses of this token that lie outside the
508cycle.
509
510.. _dynamic_instances_and_convergence_tokens:
511
512Dynamic Instances and Convergence Tokens
513========================================
514
515Every execution of an LLVM IR instruction occurs in a :ref:`dynamic instance
516<convergence-dynamic-instances>` of the instruction. Dynamic instances are the
517formal objects by which we talk about communicating threads in convergent
518operations. Dynamic instances are defined for *all* operations in an LLVM
519program, whether convergent or not. Convergence control is primarily about the
520dynamic instances of convergent operations since they affect execution of the
521program through inter-thread communication. The dynamic instances for
522non-convergent operations are relevant for determining :ref:`uniformity
523<convergence-and-uniformity>` of values.
524
525Dynamic instances produced by the execution of the same *convergent operation*
526by different threads may be :ref:`converged <convergence-definition>`. When
527executing a convergent operation, the set of threads that execute converged
528dynamic instances is the set of threads that communicate with each other.
529*Convergence tokens* capture this convergence as described below.
530
531*Convergence tokens* are values of ``token`` type, i.e. they cannot be used in
532``phi`` or ``select`` instructions. A convergence token value represents the
533dynamic instance of the instruction that produced it.
534
535Convergent operations may have an optional ``convergencectrl`` operand bundle with
536a convergence token operand to define the set of communicating threads relative
537to the operation that defined the token.
538
539   Let ``U`` be a convergent operation other than a call to a convergence
540   control intrinsic, and ``D`` be the convergent operation that defines
541   the token value used as the ``convergencectrl`` operand to ``U``. Two
542   threads execute converged dynamic instances of ``U`` if and only if the
543   token value in both threads was returned by converged dynamic
544   instances of ``D``.
545
546.. note::
547
548   The text defines convergence token values as representing dynamic instances.
549   But if we were to assume that converged dynamic instances produce the same
550   token value, then we could almost think of the token value as representing a
551   set of threads instead -- specifically, the set ``S`` of threads that
552   executed converged dynamic instances of the defining instruction ``D``.
553
554   In this intuitive picture, when a convergence token value ``T`` is used by a
555   ``convergencectrl`` bundle on an instruction ``I``, then the set of threads that
556   communicates in ``I`` is a subset of the set ``S`` represented by the token value.
557   Specifically, it is the subset of threads that ends up executing ``I`` while
558   using the token value.
559
560   This by itself wouldn't quite work as a definition: what if ``I`` is executed
561   multiple times by the same threads? Which execution of ``I`` in thread 1
562   communicates with which execution of ``I`` in thread 2? Leaning on the notion
563   of dynamic instances gives a robust answer to this question as long as ``D``
564   and ``I`` are at the same loop (or cycle) nesting level.
565
566   The case where ``D`` and ``I`` are at different loop nesting levels is
567   forbidden by the :ref:`static rules <convergence_static_rules>` -- handling
568   that case is the purpose of :ref:`llvm.experimental.convergence.loop
569   <llvm.experimental.convergence.loop>`.
570
571.. _convergence_control_intrinsics:
572
573Convergence Control Intrinsics
574==============================
575
576This section describes target-independent intrinsics that can be used to
577produce convergence tokens.
578
579Behaviour is undefined if a convergence control intrinsic is called
580indirectly.
581
582.. _llvm.experimental.convergence.entry:
583
584``llvm.experimental.convergence.entry``
585----------------------------------------
586
587.. code-block:: llvm
588
589  token @llvm.experimental.convergence.entry() convergent readnone
590
591This intrinsic is used to tie the dynamic instances inside of a function to
592those in the caller.
593
5941. If the function is called from outside the scope of LLVM, the convergence of
595   dynamic instances of this intrinsic are environment-defined. For example:
596
597   a. In an OpenCL *kernel launch*, the maximal set of threads that
598      can communicate outside the memory model is a *workgroup*.
599      Hence, a suitable choice is to specify that all the threads from
600      a single workgroup in OpenCL execute converged dynamic instances
601      of this intrinsic.
602   b. In a C/C++ program, threads are launched independently and they can
603      communicate only through the memory model. Hence the dynamic instances of
604      this intrinsic in a C/C++ program are never converged.
6052. If the function is called from a call-site in LLVM IR, then two
606   threads execute converged dynamic instances of this intrinsic if and
607   only if both threads entered the function by executing converged
608   dynamic instances of the call-site.
609
610This intrinsic can occur at most once in a function, and only in the entry
611block of the function. If this intrinsic occurs in a basic block, then it must
612precede any other convergent operation in the same basic block.
613
614It is an error if this intrinsic appears in a non-convergent function.
615
616It is an error to specify a ``convergencectrl`` operand bundle at a
617call to this intrinsic.
618
619Function inlining substitutes this intrinsic with the token from the operand
620bundle. For example:
621
622.. code-block:: c++
623
624  // Before inlining:
625
626  void callee() convergent {
627    %tok = call token @llvm.experimental.convergence.entry()
628    convergent_operation(...) [ "convergencectrl"(token %tok) ]
629  }
630
631  void main() {
632    %outer = call token @llvm.experimental.convergence.anchor()
633    for (...) {
634      %inner = call token @llvm.experimental.convergence.loop() [ "convergencectrl"(token %outer) ]
635      callee() [ "convergencectrl"(token %inner) ]
636    }
637  }
638
639  // After inlining:
640
641  void main() {
642    %outer = call token @llvm.experimental.convergence.anchor()
643    for (...) {
644      %inner = call token @llvm.experimental.convergence.loop() [ "convergencectrl"(token %outer) ]
645      convergent_operation(...) [ "convergencectrl"(token %inner) ]
646    }
647  }
648
649.. _llvm.experimental.convergence.loop:
650
651``llvm.experimental.convergence.loop``
652--------------------------------------
653
654.. code-block:: llvm
655
656  token @llvm.experimental.convergence.loop() [ "convergencectrl"(token) ] convergent readnone
657
658This intrinsic represents the place where an imaginary counter is incremented
659for determining convergence inside a control flow cycle.
660
661Let ``U`` be a call to this intrinsic and ``D`` be the convergent operation that
662defines the token value used as the ``convergencectrl`` operand to ``U``. Two
663threads execute converged dynamic instances of ``U`` if and only if:
664
6651. The token value in both threads was returned by converged dynamic
666   instances of ``D``, and,
6672. There is an integer *n* such that both threads execute ``U`` for the *n*'th time
668   with that token value.
669
670It is an error to omit the ``convergencectrl`` operand bundle on a
671call to this intrinsic.
672
673If this intrinsic occurs in a basic block, then it must precede any other
674convergent operation in the same basic block.
675
676.. _convergence_cycle_heart:
677
678**Heart of a Cycle:**
679
680  If a :ref:`cycle <cycle-terminology>` ``C`` contains an occurrence ``H`` of
681  this intrinsic whose token operand is defined outside ``C``, then ``H`` is
682  called the heart of ``C``.
683
684  .. note::
685
686     The static rules for cycles imply that a heart can occur only in the header
687     of a natural loop. This ensures that the heart closely represents the
688     intuitive notion of a loop iteration. If this restriction is relaxed, the
689     resulting semantics provides a new notion of "cycle iteration" even for
690     irreducible cycles. But this allows a natural loop to have a heart in a
691     node other than its header, which has interesting consequences on the
692     meaning of a loop iteration in terms of convergence. For now, we disallow
693     this situation since its practical application is very rare.
694
695.. _llvm.experimental.convergence.anchor:
696
697``llvm.experimental.convergence.anchor``
698----------------------------------------
699
700.. code-block:: llvm
701
702  token @llvm.experimental.convergence.anchor() convergent readnone
703
704This intrinsic produces an initial convergence token that is independent from
705any "outer scope". The set of threads executing converged dynamic instances of
706this intrinsic is implementation-defined.
707
708It is an error to pass a ``convergencectrl`` operand bundle at a
709call to this intrinsic.
710
711.. note::
712
713   The expectation is that all threads within a group that "happen to be active
714   at the same time" will execute converged dynamic instances, so that programs
715   can detect the maximal set of threads that can communicate efficiently within
716   some local region of the program.
717
718.. _convergence_uncontrolled:
719
720Uncontrolled Convergent Operations
721==================================
722
723Convergent operations with an explicit ``convergencectrl`` operand bundle are
724called *controlled convergent operations*. All other convergent operations are
725said to be *uncontrolled*.
726
727An uncontrolled convergent operation is said to have *implicit convergence
728control* determined by the ``convergent`` attribute alone. The semantics of the
729``convergent`` attribute as implemented in LLVM differs from the documented
730semantics. The implementation tries to follow common intuition about convergent
731operations, which remains under-specified. As such, it is not possible to fully
732translate implicit convergence control into explicit convergence control tokens,
733and these two modes cannot be mixed in the same function.
734
735  If a function contains a controlled convergent operation, then all convergent
736  operations in that function must either be controlled operations or calls to
737  the convergence control intrinsics.
738
739Inferring Tokens
740----------------
741
742(This section is informational)
743
744Sometimes, it may be necessary to reinterpret the implicit convergence control
745in terms of explicit convergence control tokens. For example, this may happen
746when a function call is inlined, and either the caller or the callee contains
747uncontrolled convergent operations.
748
749Some uses of uncontrolled convergent operations may need to satisfy the
750following property:
751
752  For an environment-defined group of threads (such as an OpenCL workgroup or
753  subgroup), if one thread in the group executes a convergent operation, then
754  all threads in the group do so convergently with that thread.
755
756In terms of explicit convergence control, this means that the
757``convergencectrl`` operand on each convergent operation ``X`` must ultimately
758originate from a call to the :ref:`llvm.experimental.convergence.entry
759<llvm.experimental.convergence.entry>` intrinsic. This preserves the possibility
760that the group of threads that converge on reaching ``X`` is the same group that
761originally started executing the program in convergence. In comparison, the
762:ref:`llvm.experimental.convergence.anchor
763<llvm.experimental.convergence.anchor>` intrinsic captures an
764implementation-defined group of threads, which is insufficient to support the
765above property.
766
767One way to approximate implicit convergence control in terms of explicit
768convergence control tokens is the following procedure, which preserves the above
769mentioned property:
770
7711. Convert every irreducible cycle into a reducible cycle.
7722. Insert a call to :ref:`llvm.experimental.convergence.entry
773   <llvm.experimental.convergence.entry>` at the start of the entry block of the
774   function.
7753. Insert a call to :ref:`llvm.experimental.convergence.loop
776   <llvm.experimental.convergence.loop>` at the start of every loop header. If
777   this loop is an outermost loop, the ``convergencectrl`` operand is the call
778   to :ref:`llvm.experimental.convergence.entry
779   <llvm.experimental.convergence.entry>` in the entry block of the function.
780   Otherwise, the ``convergencectrl`` operand is the call to
781   :ref:`llvm.experimental.convergence.loop
782   <llvm.experimental.convergence.loop>` in the parent loop's header.
7834. For each uncontrolled convergent operation ``X``, add a ``convergencectrl``
784   operand bundle using the token defined by a definition ``D`` that is a
785   :ref:`sibling <cycle-sibling>` to this operation. ``D`` always dominates
786   ``X`` --- if ``X`` is not in any cycle, then ``D`` is a call to
787   :ref:`llvm.experimental.convergence.entry
788   <llvm.experimental.convergence.entry>`; otherwise ``D`` is the heart of the
789   parent cycle of ``X``.
790
791.. _convergence_static_rules:
792
793Static Rules
794============
795
796A *well-formed* program in LLVM IR must satisfy the following static
797rules about cycles and convergence regions.
798
799Closed Paths
800------------
801
802A :ref:`closed path <cycle-closed-path>` in a CFG is a connected sequence of
803nodes and edges in the CFG whose start and end points are the same.
804
8051. Every closed path in the CFG that contains a use of a convergence token T other
806   than a use by
807   :ref:`llvm.experimental.convergence.loop <llvm.experimental.convergence.loop>`
808   must also contain the definition of T.
809
8102. Every closed path in the CFG that contains two different uses of a convergence
811   token T must also contain the definition of T.
812
8133. Every closed path in the CFG that contains uses of two different convergence tokens
814   T1 and T2 must also contain the definition of at least one of them.
815
816Taken together, these rules imply that for every closed path C, there can be at most
817one convergence token T which is used in C but defined outside of it, and that
818T can be used only once in C, and only by ``llvm.experimental.convergence.loop``.
819
8204. In every closed path that contains a use U of a token T but not the
821   definition of T, U must dominate all nodes in the closed path.
822
823This implies that ``llvm.experimental.convergence.loop`` can appear as a heart
824only in the header of a natural loop.
825
826**Sufficient Conditions:** From the :ref:`properties of cycles
827<cycle-closed-path>`, it is sufficient to prove the above properties
828for cycles instead of closed paths. Briefly, any closed path that violates
829one or more of the above static rules is contained in a cycle that also
830violates the same rule(s).
831
832.. _convergence_region:
833
834Convergence Regions
835-------------------
836
837The *convergence region* of a convergence token T is the minimal region in
838which T is live and used, i.e., the set of program points dominated by the
839definition D of T from which a use of T can be reached.
840
841The following static rule about convergence regions must be satisfied by
842valid programs:
843
844   If a convergence region R for a token T1 contains a use of a convergence
845   token T2, then R must also contain the definition of T2. (In other words,
846   convergence regions must be reasonably nested.)
847
848.. note::
849
850   For brevity, this document uses the term "convergence region of a token
851   definition ``D``" to actually refer to the convergence region of the token
852   ``T`` defined by ``D``.
853
854.. _inferring_noconvergent:
855
856Inferring non-convergence
857=========================
858
859When the target or the environment guarantees that threads do not
860communicate using convergent operations or that threads never diverge,
861the dynamic instances in the program are irrelevant and an optimizer
862may remove any occurrence of the ``convergent`` attribute on a
863call-site or a function and any explicit ``convergencectrl`` operand
864bundle at a call-site.
865
866An optimizer may remove the ``convergent`` attribute and any explicit
867``convergencectrl`` operand bundle from a call-site if it can prove
868that the execution of this call-site always results in a call to a
869non-convergent function.
870
871An optimizer may remove the ``convergent`` attribute on a function if it can
872prove that the function does not contain a call to
873:ref:`llvm.experimental.convergence.entry
874<llvm.experimental.convergence.entry>`, or any uncontrolled convergent
875operations.
876
877Memory Model Non-Interaction
878============================
879
880The fact that an operation is convergent has no effect on how it is treated for
881memory model purposes. In particular, an operation that is ``convergent`` and
882``readnone`` does not introduce additional ordering constraints as far as the
883memory model is concerned. There is no implied barrier, neither in the memory
884barrier sense nor in the control barrier sense of synchronizing the execution
885of threads.
886
887Informational note: Threads that execute converged dynamic instances do not
888necessarily do so at the same time.
889
890
891Other Interactions
892==================
893
894A function can be both ``convergent`` and
895``speculatable``, indicating that the function does not have undefined
896behavior and has no effects besides calculating its result, but is still
897affected by the set of threads executing this function. This typically
898prevents speculation of calls to the function unless the constraint imposed
899by ``convergent`` is further relaxed by some other means.
900
901Controlled Maximal Convergence
902==============================
903
904The :ref:`converged-with relation <convergence-definition>` over dynamic
905instances of each controlled convergent operation is completely defined by the
906semantics of convergence tokens. But the implementation-defined convergence at a
907call to :ref:`llvm.experimental.convergence.anchor
908<llvm.experimental.convergence.anchor>` also depends on the cycle hierarchy
909chosen if it occurs inside an irreducible cycle.
910
911When the token defined by a convergent operation ``D`` is used at another
912convergent operation ``U``, the implementation must ensure that the threads that
913converge at ``U`` are all the threads that reached ``U`` after converging at
914``D``. On most implementations, it is reasonable to assume that only these
915threads are converged at every node they reach on any path from ``D`` to ``U``.
916In other words, the converged-with relation at ``D`` produces groups of threads
917that can converge only within each group, while inside the convergence region of
918``D``.
919
920All this affects the :ref:`maximal converged-with relation
921<convergence-maximal>` over dynamic instances and in turn the :ref:`m-converged
922property <uniformity-analysis>` of static instances in the convergence region of
923``D``.
924
925.. _controlled_maximal_converged_with:
926
927  **Controlled Maximal converged-with Relation**
928
929  1. Dynamic instances of a *convergent operation* are related in the controlled
930     maximal converged-with relation according to the semantics of the convergence
931     control tokens.
932  2. Dynamic instances ``X1`` and ``X2`` produced by different threads for the
933     same *non-convergent operation* ``X`` are related in the controlled maximal
934     converged-with relation if and only if:
935
936     1. Both threads executed converged dynamic instances of every token
937        definition ``D`` such that ``X`` is in the convergence region of ``D``,
938        and,
939     2. Either ``X`` is not contained in any cycle, or, for every cycle ``C``
940        with header ``H`` that contains ``X``:
941
942        - every dynamic instance ``H1`` of ``H`` that precedes ``X1`` in the
943          respective thread is convergence-before ``X2``, and,
944        - every dynamic instance ``H2`` of ``H`` that precedes ``X2`` in the
945          respective thread is convergence-before ``X1``,
946        - without assuming that ``X1`` is converged with ``X2``.
947
948.. _controlled_m_converged:
949
950  **Controlled m-converged Static Instances**
951
952  A node ``X`` in a given CFG is reported to be m-converged if and only if:
953
954  1. For any token definition ``D`` such that ``X`` is inside the convergence region
955     of ``D``, ``D`` itself is m-converged, and,
956  2. Every cycle that contains ``X`` satisfies the following necessary
957     conditions:
958
959     a. Every divergent branch inside the cycle satisfies the :ref:`diverged
960        entry criterion<convergence-diverged-entry>`, and,
961     b. There are no :ref:`diverged paths reaching the
962        cycle<convergence-diverged-outside>` from a divergent branch outside it.
963
964Temporal Divergence at Cycle Exit
965---------------------------------
966
967When a cycle has a divergent exit, maximal convergence assumes that all threads
968converge at the exit block. But if a controlled convergent operation outside the
969cycle uses a token defined by an operation ``D`` inside the cycle, the
970convergence region of ``D`` now extends outside the cycle. If two threads
971executed converged dynamic instances of ``D`` before exiting the cycle, then
972they continue to execute converged dynamic instances of nodes in the convergence
973region of ``D`` outside the cycle. Thus, for a value ``V`` defined inside the
974cycle, any use ``U`` of ``V`` within the convergence region of ``T`` uses the
975output of converged dynamic instances of ``V``. If ``V`` is uniform, then its
976use at such a ``U`` is also uniform. In other words, temporal divergence applies
977only to a use of ``V`` that is outside the convergence region of ``D``.
978
979Rationales for Static rules about cycles
980========================================
981
982(This section is informative.)
983
984.. note::
985
986   For convenience, we use the operator ``==`` to represent the relation
987   ``converged-with`` and the operator ``!=`` to represent its negation.
988
989Consider a loop with (incorrect!) convergence control as in the following
990pseudocode:
991
992.. code-block:: llvm
993
994  ; WARNING: Example of incorrect convergence control!
995
996  %anchor = call token @llvm.experimental.convergence.anchor()
997  for (;;) {
998    ...
999    call void @convergent.op() [ "convergencectrl"(token %anchor) ]
1000    ...
1001  }
1002
1003This code is forbidden by the first static rule about cycles.
1004
1005A first formal argument why we have to do this is that the dynamic rule for
1006deciding whether two threads execute converged dynamic instances of
1007``@convergent.op`` leads to a logical contradiction in this code.
1008Assume two threads execute converged dynamic instances of the anchor
1009followed by two iterations of the loop. Thread 1 executes dynamic instances
1010I1 and I2 of ``@convergent.op``, thread 2 executes dynamic instances J1 and J2.
1011Using all the rules, we can deduce:
1012
10131. ``I1 != I2`` and ``J1 != J2`` by the basic rules of dynamic instances.
1014
10152. ``I1 == J1`` by the first dynamic rule about controlled convergent
1016   operations: both threads execute the same static instruction while using
1017   a convergence token value produced by converged dynamic instances of an
1018   instruction (the anchor).
1019
10203. ``I1 == J2`` by the same argument. Also, ``I2 == J1`` and ``I2 == J2``.
1021
1022   The fact that one may be *intuitively* tempted to think of ``I1`` and ``J2``
1023   as being executed in different loop iterations is completely irrelevant for
1024   the *formal* argument. There is no mechanism in LLVM IR semantics for
1025   forming associations between loop iterations in different threads, *except*
1026   for the rules defined in this document -- and the rules in this document
1027   require a loop heart intrinsic for talking about loop iterations.
1028
10294. By transitivity, we have ``I1 == I2`` and ``J1 == J2``. That is a
1030   contradiction.
1031
1032This problem goes away by inserting a loop heart intrinsic as follows, which
1033establishes a relationship between loop iterations across threads.
1034
1035.. code-block:: llvm
1036
1037  %anchor = call token @llvm.experimental.convergence.anchor()
1038  for (;;) {
1039    %loop = call token @llvm.experimental.convergence.loop() [ "convergencectrl"(token %anchor) ]
1040    ...
1041    call void @convergent.op() [ "convergencectrl"(token %loop) ]
1042    ...
1043  }
1044
1045In the same scenario of two threads executing converged dynamic instances of the
1046anchor and then two iterations of the loop, the dynamic rule about loop heart
1047intrinsics implies that both threads execute the converged dynamic instances of
1048the loop heart intrinsic in their respective first iterations and then again in
1049their respective second iterations of the loop.
1050
1051This then implies that they execute converged dynamic instances ``I1 == J1`` of
1052the ``@convergent.op`` in their first iterations and then
1053``I2 == J2`` in their second iterations. The rule is an "if and only if" rule,
1054so it also implies that ``I1 != J2`` and ``I2 != J1``, because those executions
1055see token values of ``%loop`` originating from non-converged dynamic
1056instances of the loop intrinsic.
1057
1058One may ask whether we could change the dynamic rule instead of adding the
1059static rule about cycles. That is impractical due to deeper difficulties.
1060Consider the following loop, again with incorrect convergence control:
1061
1062.. code-block:: llvm
1063
1064  ; WARNING: Example of incorrect convergence control!
1065
1066  ; (A)
1067  %anchor = call token @llvm.experimental.convergence.anchor()
1068  for (;;) {
1069    ; (B)
1070    if (condition1) {
1071      ; (C)
1072      call void @convergent.op.1() [ "convergencectrl"(token %anchor) ]
1073    }
1074    ; (D)
1075    if (condition2) {
1076      ; (E)
1077      call void @convergent.op.2() [ "convergencectrl"(token %anchor) ]
1078    }
1079    ; (F)
1080  }
1081  ; (G)
1082
1083Assume two threads execute converged dynamic instances of the anchor followed
1084by this sequence of basic blocks:
1085
1086.. code-block:: text
1087
1088  Thread 1: A B C D F B D E F G
1089  Thread 2: A B D E F B C D F G
1090
1091That is, both threads execute two iterations of the loop, but they execute
1092the different convergent operations in different iterations. Without forming a
1093relation between loop iterations across the threads, there is no reasonable way
1094of defining which dynamic instances of the convergent operations should be the
1095same across the threads, if any.
1096
1097Again, this can be addressed by adding a loop heart intrinsic, most naturally
1098as:
1099
1100.. code-block:: llvm
1101
1102  ; (A)
1103  %anchor = call token @llvm.experimental.convergence.anchor()
1104  for (;;) {
1105    ; (B)
1106    %loop = call token @llvm.experimental.convergence.loop() [ "convergencectrl"(token %anchor) ]
1107    if (condition1) {
1108      ; (C)
1109      call void @convergent.op.1() [ "convergencectrl"(token %loop) ]
1110    }
1111    ; (D)
1112    if (condition2) {
1113      ; (E)
1114      call void @convergent.op.2() [ "convergencectrl"(token %loop) ]
1115    }
1116    ; (F)
1117  }
1118  ; (G)
1119
1120Let ``%loop(i;j)`` be the dynamic instance of ``j``-th execution of the loop
1121heart intrinsic by thread ``i``, and analogously ``@op.k(i)`` and ``@op.k(i)``
1122the dynamic instances of the execution of ``@convergent.op.k`` by thread ``i``.
1123Then we have:
1124
11251. ``%loop(1;j) == %loop(2;j)`` for ``j = 1, 2`` because of the dynamic rule
1126   about loop heart intrinsics.
1127
11282. ``%loop(i;1) != %loop(i;2)`` for ``i = 1, 2`` because of the basic rule that
1129   different executions by the same thread happen in different dynamic
1130   instances.
1131
11323. ``@op.1(1) != @op.1(2)``, since ``@op.1(1)`` uses the token value of ``%loop``
1133   referring to ``%loop(1;1)`` and ``@op.1(2)`` uses that
1134   referring to ``%loop(2;2) == %loop(1;2)``, which is different from
1135   ``%loop(1;1)``.
1136
11374. Similarly, ``@op.2(1) != @op.2(2)``.
1138
1139However, loop heart intrinsics could be inserted differently, at the cost
1140of also inserting a free-standing anchor:
1141
1142.. code-block:: llvm
1143
1144  ; (A)
1145  %anchor = call token @llvm.experimental.convergence.anchor()
1146  for (;;) {
1147    ; (B)
1148    if (condition1) {
1149      ; (C)
1150      %loop = call token @llvm.experimental.convergence.loop() [ "convergencectrl"(token %anchor) ]
1151      call void @convergent.op.1() [ "convergencectrl"(token %loop) ]
1152    }
1153    ; (D)
1154    if (condition2) {
1155      ; (E)
1156      %free = call token @llvm.experimental.convergence.anchor()
1157      call void @convergent.op.2() [ "convergencectrl"(token %free) ]
1158    }
1159    ; (F)
1160  }
1161  ; (G)
1162
1163This leads to the "unnatural counting of loop iterations" that is also mentioned
1164elsewhere. Let ``%loop(i)`` be the dynamic instance of the execution of the
1165loop heart intrinsic by thread ``i`` (each thread executes it only once), and
1166let ``@op.k(i)`` be as before. Then:
1167
11681. ``%loop(1) == %loop(2)`` because of the dynamic rule about loop heart
1169   intrinsics.
1170
11712. ``@op.1(1) == @op.1(2)`` because ``@op.1(i)`` uses the value of ``%loop``
1172   referring to ``%loop(i)``, and ``%loop(1) == %loop(2)``.
1173
11743. Whether ``@op.2(1) == @op.2(2)`` is implementation-defined because of the
1175   use of the ``%free`` anchor intrinsic.
1176
1177   In practice, they almost certainly have to be non-converged dynamic
1178   instances. Consider that if an implementation strictly follows the order of
1179   instructions given in the program, the executions of the threads can be
1180   "aligned" as follows:
1181
1182   .. code-block:: text
1183
1184     Thread 1: A B         C D F B D E F G
1185     Thread 2: A B D E F B C D F         G
1186
1187   So then ``@op.2(1)`` physically executes later than ``@op.2(2)`` and there
1188   can be no communication between the threads, which means they execute
1189   non-converged dynamic instances.
1190
1191   That said, it is conceivable that there aren't actually any data or other
1192   dependencies that would enforce this execution order. In that case, a highly
1193   out-of-order implementation could potentially allow communication. That's
1194   why the rules defined in this document are silent about whether
1195   ``@op.2(1) == @op.2(2)`` or not.
1196
1197This type of convergence control seems relatively unlikely to appear in real
1198programs. Its possibility is simply a logical consequence of the model.
1199
1200An equivalent issue arises if the convergent operations are replaced by nested
1201loops with loop heart intrinsics that directly refer to ``%anchor``, hence
1202the variants of the static rules about cycles that apply to them:
1203
1204.. code-block:: llvm
1205
1206  ; WARNING: Example of incorrect convergence control!
1207
1208  %anchor = call token @llvm.experimental.convergence.anchor()
1209  for (;;) {
1210    if (condition1) {
1211      for (;;) {
1212        %loop1 = call token @llvm.experimental.convergence.loop() [ "convergencectrl"(token %anchor) ]
1213      }
1214    }
1215    if (condition2) {
1216      for (;;) {
1217        %loop2 = call token @llvm.experimental.convergence.loop() [ "convergencectrl"(token %anchor) ]
1218      }
1219    }
1220  }
1221
1222There is a cycle (closed walk in the CFG) that goes through both loop heart
1223intrinsics using ``%anchor`` but not through the definition of ``%anchor``,
1224so this code is invalid.
1225
1226
1227Examples for the Correctness of Program Transforms
1228==================================================
1229
1230(This section is informative.)
1231
1232As implied by the rules in the previous sections, program transforms are correct
1233with respect to convergent operations if they preserve or refine their
1234semantics. This means that the set of communicating threads in the transformed
1235program must have been possible in the original program.
1236
1237Program transforms with a single-threaded focus are generally conservatively
1238correct if they do not sink or hoist convergent operations across a branch.
1239This applies even to program transforms that change the control flow graph.
1240
1241For example, unrolling a loop that does not contain convergent operations
1242cannot break any of the guarantees required for convergent operations outside
1243of the loop.
1244
1245
1246Loop unrolling examples
1247-----------------------
1248
1249We consider three kinds of loop unrolling here:
1250
1251* Partial unrolling with no known trip multiple, so a "tail" is required to
1252  collect the remaining elements.
1253* Partial unrolling by a trip multiple, so no "tail" is required.
1254* Full unrolling, which eliminates the loop.
1255
1256The first kind is forbidden when ``@llvm.experimental.convergence.loop`` is
1257used. We illustrate the reasoning with some examples.
1258
1259First, an arbitrary loop that contains convergent operations *can* be unrolled
1260in all of these ways, even with "tail", if all convergent operations refer back
1261to an anchor inside the loop. For example (in pseudo-code):
1262
1263.. code-block:: llvm
1264
1265  while (counter > 0) {
1266    %tok = call token @llvm.experimental.convergence.anchor()
1267    call void @convergent.operation() [ "convergencectrl"(token %tok) ]
1268    counter--;
1269  }
1270
1271This can be unrolled to:
1272
1273.. code-block:: llvm
1274
1275  while (counter >= 2) {
1276    %tok = call token @llvm.experimental.convergence.anchor()
1277    call void @convergent.operation() [ "convergencectrl"(token %tok) ]
1278    %tok = call token @llvm.experimental.convergence.anchor()
1279    call void @convergent.operation() [ "convergencectrl"(token %tok) ]
1280    counter -= 2;
1281  }
1282  while (counter > 0) {
1283    %tok = call token @llvm.experimental.convergence.anchor()
1284    call void @convergent.operation() [ "convergencectrl"(token %tok) ]
1285    counter--;
1286  }
1287
1288This is likely to change the behavior of the convergent operation if there
1289are threads whose initial counter value is not a multiple of 2. In particular,
1290all threads with an odd trip count are now likely to execute the convergent
1291operation in their respective final iterations together because the
1292underlying implementation is likely to try to group as many threads together
1293as possible for the execution of the "tail".
1294
1295This change is allowed because the anchor intrinsic has implementation-defined
1296convergence behavior and the loop unrolling transform is considered to be part
1297of the implementation. Another way of reasoning is that while the *likely*
1298behavior of the code has changed, the *guarantees* about its behavior have
1299remained the same.
1300
1301If the loop contains uncontrolled convergent operations, this kind of unrolling
1302is forbidden.
1303
1304Unrolling a loop with convergent operations that refer to tokens produced
1305outside the loop is forbidden when a "tail" or "remainder" would have to
1306be introduced. Consider:
1307
1308.. code-block:: llvm
1309
1310  ; (A)
1311  %outer = call token @llvm.experimental.convergence.anchor()
1312  while (counter > 0) {
1313    %inner = call token @llvm.experimental.convergence.loop() [ "convergencectrl"(token %outer) ]
1314    ; (B)
1315    call void @convergent.operation() [ "convergencectrl"(token %inner) ]
1316    counter--;
1317  }
1318  ; (C)
1319
1320To understand why unrolling is forbidden, consider two threads that execute
1321converged dynamic instances of the anchor and then proceed with 3 and 4 loop
1322iterations, respectively:
1323
1324.. code-block:: text
1325
1326  Thread 1: A B B B C
1327  Thread 2: A B B B B C
1328
1329By the dynamic rule on loop heart intrinsics, these threads execute converged
1330dynamic instances of the loop intrinsic for the first 3 iterations, and then
1331thread 2 executes another dynamic instance by itself.
1332
1333By the dynamic rule on general convergent operations, the threads execute
1334converged dynamic instances of the ``@convergent.operation`` in the first 3
1335iterations (that is, the dynamic instance executed by thread 1 in iteration
1336*n* is the same as that executed by thread 2 in iteration *n*, for *n = 1,2,3*;
1337the dynamic instance executed in iteration 1 is different from that in
1338iteration 2, etc.).
1339
1340Now assume that the loop is unrolled by a factor of 2, which requires a
1341remainder as follows:
1342
1343.. code-block:: llvm
1344
1345  ; (A)
1346  %outer = call token @llvm.experimental.convergence.anchor()
1347  while (counter >= 2) {
1348    %inner = call token @llvm.experimental.convergence.loop() [ "convergencectrl"(token %outer) ]
1349    ; (B)
1350    call void @convergent.operation() [ "convergencectrl"(token %inner) ]
1351    call void @convergent.operation() [ "convergencectrl"(token %inner) ]
1352    counter -= 2;
1353  }
1354  ; (C)
1355  if (counter > 0) {
1356    %remainder = call token @llvm.experimental.convergence.loop() [ "convergencectrl"(token %outer) ]
1357    ; (D)
1358    call void @convergent.operation() [ "convergencectrl"(token %remainder) ]
1359  }
1360  ; (E)
1361
1362First of all, note some interesting problems surrounding the loop intrinsic:
1363
13641. It is *not* duplicated inside the unrolled loop. This is to comply with
1365   the :ref:`convergence_static_rules`.
1366
13672. It is unclear whether the loop intrinsic ought to be duplicated in the
1368   remainder, or whether the final ``@convergent.operation`` in D should just
1369   refer to either ``%inner`` (which is possible in SSA form) or directly to
1370   ``%outer``. The decision made here is arbitrary and doesn't change the
1371   argument that follows. Ultimately, it simply doesn't matter because the
1372   transform is incorrect either way.
1373
1374The threads now execute the following sequences of blocks:
1375
1376.. code-block:: text
1377
1378  Thread 1: A B C D E
1379  Thread 2: A B B C D E
1380
1381Analogous to the argument above, they execute converged dynamic instances of the
1382``%inner`` intrinsic and the ``@convergent.operation`` in the first iteration
1383of the unrolled loop, which corresponds to the first 2 iterations of the
1384original loop.
1385
1386However, they execute different static calls to ``@convergent.operation`` for
1387the 3rd iteration of the original loop. In thread 1, that iteration corresponds
1388to the call in the remainder, while in thread 2 it corresponds to the first
1389call to ``@convergent.operation`` in the unrolled loop. Therefore, they execute
1390non-converged dynamic instances, which means that the set of communicating threads
1391for the 3rd iteration of the original loop is different. This is why the
1392unrolling is incorrect.
1393
1394On the other hand, unrolling without "tail" is allowed. For example, assuming
1395that the trip counter is known to be a multiple of 2, we can unroll the loop
1396as follows:
1397
1398.. code-block:: llvm
1399
1400  %outer = call token @llvm.experimental.convergence.anchor()
1401  while (counter > 0) {
1402    %inner = call token @llvm.experimental.convergence.loop() [ "convergencectrl"(token %outer) ]
1403    call void @convergent.operation() [ "convergencectrl"(token %inner) ]
1404    call void @convergent.operation() [ "convergencectrl"(token %inner) ]
1405    counter -= 2;
1406  }
1407
1408Note again that the loop intrinsic is not duplicated.
1409
1410The
1411:ref:`llvm.experimental.convergence.loop <llvm.experimental.convergence.loop>`
1412intrinsic is typically expected to appear in the header of a natural loop.
1413However, it can also appear in non-header blocks of a loop. In that case, the
1414loop can generally not be unrolled.
1415
1416
1417Hoisting and sinking
1418--------------------
1419
1420In general, hoisting and sinking of convergent operations is forbidden. This is
1421because moving the operation to a different point in control flow generally
1422changes the set of threads that reach the operation and therefore, the set of
1423threads that execute converged dynamic instances of the operation. By
1424definition, this changes the set of threads that participate in the
1425communication of the convergent operation, which will typically change its
1426result.
1427
1428There are a number of exceptions, though most of them require additional
1429knowledge.
1430
1431For example, hoisting and sinking across *uniform* conditional branches -- i.e.,
1432conditional branches where within every possible relevant set of threads, all
1433threads will always take the same direction -- is generally allowed. See the end
1434of the :ref:`example of reductions inside control flow
1435<convergence_example_reductions>` for a brief discussion.
1436
1437Some convergent operations can be hoisted but not sunk, or vice versa. A simple
1438example is the ``subgroupShuffle(data, id)`` operation. It returns the ``data``
1439operand of the thread identified by ``id``, where thread IDs are fixed and
1440assigned to each thread at launch. The result is undefined (or perhaps there is
1441UB, depending on the language and environment) if thread ``id`` is not in the
1442communicating set of threads. So hoisting is allowed in the following
1443pseudo-code example:
1444
1445.. code-block:: llvm
1446
1447  define void @example(...) convergent {
1448    %entry = call token @llvm.experimental.convergence.entry()
1449    %data = ...
1450    %id = ...
1451    if (condition) {
1452      %shuffled = call i32 @subgroupShuffle(i32 %data, i32 %id) [ "convergencectrl"(token %entry) ]
1453      ...
1454    } else {
1455      %shuffled = call i32 @subgroupShuffle(i32 %data, i32 %id) [ "convergencectrl"(token %entry) ]
1456      ...
1457    }
1458  }
1459
1460After hoisting the calls to ``@subgroupShuffle``, the communicating set of
1461threads is the union of the two sets of threads in the original program, so
1462``%id`` can only go "out of range" after hoisting if it did so in the original
1463program.
1464
1465However, speculative execution of ``@subgroupShuffle`` in the following program
1466may be forbidden:
1467
1468.. code-block:: llvm
1469
1470  define void @example(...) convergent {
1471    %entry = call token @llvm.experimental.convergence.entry()
1472    %data = ...
1473    %id = ...
1474    if (condition) {
1475      %shuffled = call i32 @subgroupShuffle(i32 %data, i32 %id) [ "convergencectrl"(token %entry) ]
1476      ...
1477    }
1478  }
1479
1480There is no guarantee about the value of ``%id`` in the threads where
1481``condition`` is false. If ``@subgroupShuffle`` is defined to have UB when
1482``%id`` is outside of the set of communicating threads, then speculating and
1483hoisting ``@subgroupShuffle`` might introduce UB.
1484
1485On the other hand, if ``@subgroupShuffle`` is defined such that it merely
1486produces an undefined value or poison as result when ``%id`` is "out of range",
1487then speculating is okay.
1488
1489Even though
1490:ref:`llvm.experimental.convergence.anchor <llvm.experimental.convergence.anchor>`
1491is marked as ``convergent``, it can be sunk in some cases. For example, in
1492pseudo-code:
1493
1494.. code-block:: llvm
1495
1496  %tok = call token @llvm.experimental.convergence.anchor()
1497  if (condition) {
1498    call void @convergent.operation() [ "convergencectrl"(token %tok) ]
1499  }
1500
1501Assuming that ``%tok`` is only used inside the conditional block, the anchor can
1502be sunk. The rationale is two-fold. First, the anchor has implementation-defined
1503behavior, and the sinking is part of the implementation. Second, already in the
1504original program, the set of threads that communicates in the
1505``@convergent.operation`` is automatically subset to the threads for which
1506``condition`` is true.
1507
1508Anchors can be hoisted in acyclic control flow. For example:
1509
1510.. code-block:: llvm
1511
1512  if (condition) {
1513    %tok1 = call token @llvm.experimental.convergence.anchor()
1514    call void @convergent.operation() [ "convergencectrl"(token %tok1) ]
1515  } else {
1516    %tok2 = call token @llvm.experimental.convergence.anchor()
1517    call void @convergent.operation() [ "convergencectrl"(token %tok2) ]
1518  }
1519
1520The anchors can be hoisted, resulting in:
1521
1522.. code-block:: llvm
1523
1524  %tok = call token @llvm.experimental.convergence.anchor()
1525  if (condition) {
1526    call void @convergent.operation() [ "convergencectrl"(token %tok) ]
1527  } else {
1528    call void @convergent.operation() [ "convergencectrl"(token %tok) ]
1529  }
1530
1531The behavior is unchanged, since each of the static convergent operations only
1532ever communicates with threads that have the same ``condition`` value.
1533By contrast, hoisting the convergent operations themselves is forbidden.
1534
1535Hoisting and sinking anchors out of and into loops is forbidden. For example:
1536
1537.. code-block:: llvm
1538
1539  for (;;) {
1540    %tok = call token @llvm.experimental.convergence.anchor()
1541    call void @convergent.operation() [ "convergencectrl"(token %tok) ]
1542  }
1543
1544Hoisting the anchor would make the program invalid according to the static
1545validity rules. Conversely:
1546
1547.. code-block:: llvm
1548
1549  %outer = call token @llvm.experimental.convergence.anchor()
1550  while (counter > 0) {
1551    %inner = call token @llvm.experimental.convergence.loop() [ "convergencectrl"(token %outer) ]
1552    call void @convergent.operation() [ "convergencectrl"(token %inner) ]
1553    counter--;
1554  }
1555
1556The program would stay valid if the anchor was sunk into the loop, but its
1557behavior could end up being different. If the anchor is inside the loop, then
1558each loop iteration has a new dynamic instance of the anchor, and the set of
1559threads participating in those dynamic instances of the anchor could be
1560different in arbitrary implementation-defined ways. Via the dynamic rules about
1561dynamic instances of convergent operations, this then implies that the set of
1562threads executing ``@convergent.operation`` could be different in each loop
1563iteration in arbitrary implementation-defined ways.
1564
1565Convergent operations can be sunk together with their anchor. Again in
1566pseudo-code:
1567
1568.. code-block:: llvm
1569
1570  %tok = call token @llvm.experimental.convergence.anchor()
1571  %a = call T @pure.convergent.operation(...) [ "convergencectrl"(token %tok) ]
1572  %b = call T @pure.convergent.operation(...) [ "convergencectrl"(token %tok) ]
1573  if (condition) {
1574    use(%a, %b)
1575  }
1576
1577Assuming that ``%tok``, ``%a``, and ``%b`` are only used inside the conditional
1578block, all can be sunk together:
1579
1580.. code-block:: llvm
1581
1582  if (condition) {
1583    %tok = call token @llvm.experimental.convergence.anchor()
1584    %a = call T @pure.convergent.operation(...) [ "convergencectrl"(token %tok) ]
1585    %b = call T @pure.convergent.operation(...) [ "convergencectrl"(token %tok) ]
1586    use(%a, %b)
1587  }
1588
1589The rationale is that the anchor intrinsic has implementation-defined behavior,
1590and the sinking transform is considered to be part of the implementation:
1591the sinking will restrict the set of communicating threads to those for which
1592``condition`` is true, but that could have happened in the original program
1593anyway for some arbitrary other reason.
1594
1595However, sinking *only* the convergent operation producing ``%b`` would be
1596incorrect. That would allow threads for which ``condition`` is false to
1597communicate at ``%a``, but not at ``%b``, which the original program doesn't
1598allow.
1599
1600Note that the entry intrinsic behaves differently. Sinking the convergent
1601operations is forbidden in the following snippet:
1602
1603.. code-block:: llvm
1604
1605  %tok = call token @llvm.experimental.convergence.entry()
1606  %a = call T @pure.convergent.operation(...) [ "convergencectrl"(token %tok) ]
1607  %b = call T @pure.convergent.operation(...) [ "convergencectrl"(token %tok) ]
1608  if (condition) {
1609    use(%a, %b)
1610  }
1611