xref: /llvm-project/llvm/docs/Atomics.rst (revision 8a45cec934697747ac3d3a18e75833e0058fe9a1)
1==============================================
2LLVM Atomic Instructions and Concurrency Guide
3==============================================
4
5.. contents::
6   :local:
7
8Introduction
9============
10
11LLVM supports instructions which are well-defined in the presence of threads and
12asynchronous signals.
13
14The atomic instructions are designed specifically to provide readable IR and
15optimized code generation for the following:
16
17* The C++ ``<atomic>`` header and C ``<stdatomic.h>`` headers. These
18  were originally added in C++11 and C11. The memory model has been
19  subsequently adjusted to correct errors in the initial
20  specification, so LLVM currently intends to implement the version
21  specified by C++20. (See the `C++20 draft standard
22  <https://isocpp.org/files/papers/N4860.pdf>`_ or the unofficial
23  `latest C++ draft <https://eel.is/c++draft/>`_. A `C2x draft
24  <https://www.open-std.org/jtc1/sc22/wg14/www/docs/n3047.pdf>`_ is
25  also available, though the text has not yet been updated with the
26  errata corrected by C++20.)
27
28* Proper semantics for Java-style memory, for both ``volatile`` and regular
29  shared variables. (`Java Specification
30  <http://docs.oracle.com/javase/specs/jls/se8/html/jls-17.html>`_)
31
32* gcc-compatible ``__sync_*`` builtins. (`Description
33  <https://gcc.gnu.org/onlinedocs/gcc/_005f_005fsync-Builtins.html>`_)
34
35* Other scenarios with atomic semantics, including ``static`` variables with
36  non-trivial constructors in C++.
37
38Atomic and volatile in the IR are orthogonal; "volatile" is the C/C++ volatile,
39which ensures that every volatile load and store happens and is performed in the
40stated order.  A couple examples: if a SequentiallyConsistent store is
41immediately followed by another SequentiallyConsistent store to the same
42address, the first store can be erased. This transformation is not allowed for a
43pair of volatile stores. On the other hand, a non-volatile non-atomic load can
44be moved across a volatile load freely, but not an Acquire load.
45
46This document is intended to provide a guide to anyone either writing a frontend
47for LLVM or working on optimization passes for LLVM with a guide for how to deal
48with instructions with special semantics in the presence of concurrency. This
49is not intended to be a precise guide to the semantics; the details can get
50extremely complicated and unreadable, and are not usually necessary.
51
52.. _Optimization outside atomic:
53
54Optimization outside atomic
55===========================
56
57The basic ``'load'`` and ``'store'`` allow a variety of optimizations, but can
58lead to undefined results in a concurrent environment; see `NotAtomic`_. This
59section specifically goes into the one optimizer restriction which applies in
60concurrent environments, which gets a bit more of an extended description
61because any optimization dealing with stores needs to be aware of it.
62
63From the optimizer's point of view, the rule is that if there are not any
64instructions with atomic ordering involved, concurrency does not matter, with
65one exception: if a variable might be visible to another thread or signal
66handler, a store cannot be inserted along a path where it might not execute
67otherwise.  Take the following example:
68
69.. code-block:: c
70
71 /* C code, for readability; run through clang -O2 -S -emit-llvm to get
72     equivalent IR */
73  int x;
74  void f(int* a) {
75    for (int i = 0; i < 100; i++) {
76      if (a[i])
77        x += 1;
78    }
79  }
80
81The following is equivalent in non-concurrent situations:
82
83.. code-block:: c
84
85  int x;
86  void f(int* a) {
87    int xtemp = x;
88    for (int i = 0; i < 100; i++) {
89      if (a[i])
90        xtemp += 1;
91    }
92    x = xtemp;
93  }
94
95However, LLVM is not allowed to transform the former to the latter: it could
96indirectly introduce undefined behavior if another thread can access ``x`` at
97the same time. That thread would read `undef` instead of the value it was
98expecting, which can lead to undefined behavior down the line. (This example is
99particularly of interest because before the concurrency model was implemented,
100LLVM would perform this transformation.)
101
102Note that speculative loads are allowed; a load which is part of a race returns
103``undef``, but does not have undefined behavior.
104
105Atomic instructions
106===================
107
108For cases where simple loads and stores are not sufficient, LLVM provides
109various atomic instructions. The exact guarantees provided depend on the
110ordering; see `Atomic orderings`_.
111
112``load atomic`` and ``store atomic`` provide the same basic functionality as
113non-atomic loads and stores, but provide additional guarantees in situations
114where threads and signals are involved.
115
116``cmpxchg`` and ``atomicrmw`` are essentially like an atomic load followed by an
117atomic store (where the store is conditional for ``cmpxchg``), but no other
118memory operation can happen on any thread between the load and store.
119
120A ``fence`` provides Acquire and/or Release ordering which is not part
121of another operation; it is normally used along with Monotonic memory
122operations.  A Monotonic load followed by an Acquire fence is roughly
123equivalent to an Acquire load, and a Monotonic store following a
124Release fence is roughly equivalent to a Release
125store. SequentiallyConsistent fences behave as both an Acquire and a
126Release fence, and additionally provide a total ordering with some
127complicated guarantees, see the C++ standard for details.
128
129Frontends generating atomic instructions generally need to be aware of the
130target to some degree; atomic instructions are guaranteed to be lock-free, and
131therefore an instruction which is wider than the target natively supports can be
132impossible to generate.
133
134.. _Atomic orderings:
135
136Atomic orderings
137================
138
139In order to achieve a balance between performance and necessary guarantees,
140there are six levels of atomicity. They are listed in order of strength; each
141level includes all the guarantees of the previous level except for
142Acquire/Release. (See also `LangRef Ordering <LangRef.html#ordering>`_.)
143
144.. _NotAtomic:
145
146NotAtomic
147---------
148
149NotAtomic is the obvious, a load or store which is not atomic. (This isn't
150really a level of atomicity, but is listed here for comparison.) This is
151essentially a regular load or store. If there is a race on a given memory
152location, loads from that location return undef.
153
154Relevant standard
155  This is intended to match shared variables in C/C++, and to be used in any
156  other context where memory access is necessary, and a race is impossible. (The
157  precise definition is in `LangRef Memory Model <LangRef.html#memmodel>`_.)
158
159Notes for frontends
160  The rule is essentially that all memory accessed with basic loads and stores
161  by multiple threads should be protected by a lock or other synchronization;
162  otherwise, you are likely to run into undefined behavior. If your frontend is
163  for a "safe" language like Java, use Unordered to load and store any shared
164  variable.  Note that NotAtomic volatile loads and stores are not properly
165  atomic; do not try to use them as a substitute. (Per the C/C++ standards,
166  volatile does provide some limited guarantees around asynchronous signals, but
167  atomics are generally a better solution.)
168
169Notes for optimizers
170  Introducing loads to shared variables along a codepath where they would not
171  otherwise exist is allowed; introducing stores to shared variables is not. See
172  `Optimization outside atomic`_.
173
174Notes for code generation
175  The one interesting restriction here is that it is not allowed to write to
176  bytes outside of the bytes relevant to a store.  This is mostly relevant to
177  unaligned stores: it is not allowed in general to convert an unaligned store
178  into two aligned stores of the same width as the unaligned store. Backends are
179  also expected to generate an i8 store as an i8 store, and not an instruction
180  which writes to surrounding bytes.  (If you are writing a backend for an
181  architecture which cannot satisfy these restrictions and cares about
182  concurrency, please send an email to llvm-dev.)
183
184Unordered
185---------
186
187Unordered is the lowest level of atomicity. It essentially guarantees that races
188produce somewhat sane results instead of having undefined behavior.  It also
189guarantees the operation to be lock-free, so it does not depend on the data
190being part of a special atomic structure or depend on a separate per-process
191global lock.  Note that code generation will fail for unsupported atomic
192operations; if you need such an operation, use explicit locking.
193
194Relevant standard
195  This is intended to match the Java memory model for shared variables.
196
197Notes for frontends
198  This cannot be used for synchronization, but is useful for Java and other
199  "safe" languages which need to guarantee that the generated code never
200  exhibits undefined behavior. Note that this guarantee is cheap on common
201  platforms for loads of a native width, but can be expensive or unavailable for
202  wider loads, like a 64-bit store on ARM. (A frontend for Java or other "safe"
203  languages would normally split a 64-bit store on ARM into two 32-bit unordered
204  stores.)
205
206Notes for optimizers
207  In terms of the optimizer, this prohibits any transformation that transforms a
208  single load into multiple loads, transforms a store into multiple stores,
209  narrows a store, or stores a value which would not be stored otherwise.  Some
210  examples of unsafe optimizations are narrowing an assignment into a bitfield,
211  rematerializing a load, and turning loads and stores into a memcpy
212  call. Reordering unordered operations is safe, though, and optimizers should
213  take advantage of that because unordered operations are common in languages
214  that need them.
215
216Notes for code generation
217  These operations are required to be atomic in the sense that if you use
218  unordered loads and unordered stores, a load cannot see a value which was
219  never stored.  A normal load or store instruction is usually sufficient, but
220  note that an unordered load or store cannot be split into multiple
221  instructions (or an instruction which does multiple memory operations, like
222  ``LDRD`` on ARM without LPAE, or not naturally-aligned ``LDRD`` on LPAE ARM).
223
224Monotonic
225---------
226
227Monotonic is the weakest level of atomicity that can be used in synchronization
228primitives, although it does not provide any general synchronization. It
229essentially guarantees that if you take all the operations affecting a specific
230address, a consistent ordering exists.
231
232Relevant standard
233  This corresponds to the C++/C ``memory_order_relaxed``; see those
234  standards for the exact definition.
235
236Notes for frontends
237  If you are writing a frontend which uses this directly, use with caution.  The
238  guarantees in terms of synchronization are very weak, so make sure these are
239  only used in a pattern which you know is correct.  Generally, these would
240  either be used for atomic operations which do not protect other memory (like
241  an atomic counter), or along with a ``fence``.
242
243Notes for optimizers
244  In terms of the optimizer, this can be treated as a read+write on the relevant
245  memory location (and alias analysis will take advantage of that). In addition,
246  it is legal to reorder non-atomic and Unordered loads around Monotonic
247  loads. CSE/DSE and a few other optimizations are allowed, but Monotonic
248  operations are unlikely to be used in ways which would make those
249  optimizations useful.
250
251Notes for code generation
252  Code generation is essentially the same as that for unordered for loads and
253  stores.  No fences are required.  ``cmpxchg`` and ``atomicrmw`` are required
254  to appear as a single operation.
255
256Acquire
257-------
258
259Acquire provides a barrier of the sort necessary to acquire a lock to access
260other memory with normal loads and stores.
261
262Relevant standard
263  This corresponds to the C++/C ``memory_order_acquire``. It should also be
264  used for C++/C ``memory_order_consume``.
265
266Notes for frontends
267  If you are writing a frontend which uses this directly, use with caution.
268  Acquire only provides a semantic guarantee when paired with a Release
269  operation.
270
271Notes for optimizers
272  Optimizers not aware of atomics can treat this like a nothrow call.  It is
273  also possible to move stores from before an Acquire load or read-modify-write
274  operation to after it, and move non-Acquire loads from before an Acquire
275  operation to after it.
276
277Notes for code generation
278  Architectures with weak memory ordering (essentially everything relevant today
279  except x86 and SPARC) require some sort of fence to maintain the Acquire
280  semantics.  The precise fences required varies widely by architecture, but for
281  a simple implementation, most architectures provide a barrier which is strong
282  enough for everything (``dmb`` on ARM, ``sync`` on PowerPC, etc.).  Putting
283  such a fence after the equivalent Monotonic operation is sufficient to
284  maintain Acquire semantics for a memory operation.
285
286Release
287-------
288
289Release is similar to Acquire, but with a barrier of the sort necessary to
290release a lock.
291
292Relevant standard
293  This corresponds to the C++/C ``memory_order_release``.
294
295Notes for frontends
296  If you are writing a frontend which uses this directly, use with caution.
297  Release only provides a semantic guarantee when paired with an Acquire
298  operation.
299
300Notes for optimizers
301  Optimizers not aware of atomics can treat this like a nothrow call.  It is
302  also possible to move loads from after a Release store or read-modify-write
303  operation to before it, and move non-Release stores from after a Release
304  operation to before it.
305
306Notes for code generation
307  See the section on Acquire; a fence before the relevant operation is usually
308  sufficient for Release. Note that a store-store fence is not sufficient to
309  implement Release semantics; store-store fences are generally not exposed to
310  IR because they are extremely difficult to use correctly.
311
312AcquireRelease
313--------------
314
315AcquireRelease (``acq_rel`` in IR) provides both an Acquire and a Release
316barrier (for fences and operations which both read and write memory).
317
318Relevant standard
319  This corresponds to the C++/C ``memory_order_acq_rel``.
320
321Notes for frontends
322  If you are writing a frontend which uses this directly, use with caution.
323  Acquire only provides a semantic guarantee when paired with a Release
324  operation, and vice versa.
325
326Notes for optimizers
327  In general, optimizers should treat this like a nothrow call; the possible
328  optimizations are usually not interesting.
329
330Notes for code generation
331  This operation has Acquire and Release semantics; see the sections on Acquire
332  and Release.
333
334SequentiallyConsistent
335----------------------
336
337SequentiallyConsistent (``seq_cst`` in IR) provides Acquire semantics for loads
338and Release semantics for stores. Additionally, it guarantees that a total
339ordering exists between all SequentiallyConsistent operations.
340
341Relevant standard
342  This corresponds to the C++/C ``memory_order_seq_cst``, Java volatile, and
343  the gcc-compatible ``__sync_*`` builtins which do not specify otherwise.
344
345Notes for frontends
346  If a frontend is exposing atomic operations, these are much easier to reason
347  about for the programmer than other kinds of operations, and using them is
348  generally a practical performance tradeoff.
349
350Notes for optimizers
351  Optimizers not aware of atomics can treat this like a nothrow call.  For
352  SequentiallyConsistent loads and stores, the same reorderings are allowed as
353  for Acquire loads and Release stores, except that SequentiallyConsistent
354  operations may not be reordered.
355
356Notes for code generation
357  SequentiallyConsistent loads minimally require the same barriers as Acquire
358  operations and SequentiallyConsistent stores require Release
359  barriers. Additionally, the code generator must enforce ordering between
360  SequentiallyConsistent stores followed by SequentiallyConsistent loads. This
361  is usually done by emitting either a full fence before the loads or a full
362  fence after the stores; which is preferred varies by architecture.
363
364Atomics and IR optimization
365===========================
366
367Predicates for optimizer writers to query:
368
369* ``isSimple()``: A load or store which is not volatile or atomic.  This is
370  what, for example, memcpyopt would check for operations it might transform.
371
372* ``isUnordered()``: A load or store which is not volatile and at most
373  Unordered. This would be checked, for example, by LICM before hoisting an
374  operation.
375
376* ``mayReadFromMemory()``/``mayWriteToMemory()``: Existing predicate, but note
377  that they return true for any operation which is volatile or at least
378  Monotonic.
379
380* ``isStrongerThan`` / ``isAtLeastOrStrongerThan``: These are predicates on
381  orderings. They can be useful for passes that are aware of atomics, for
382  example to do DSE across a single atomic access, but not across a
383  release-acquire pair (see MemoryDependencyAnalysis for an example of this)
384
385* Alias analysis: Note that AA will return ModRef for anything Acquire or
386  Release, and for the address accessed by any Monotonic operation.
387
388To support optimizing around atomic operations, make sure you are using the
389right predicates; everything should work if that is done.  If your pass should
390optimize some atomic operations (Unordered operations in particular), make sure
391it doesn't replace an atomic load or store with a non-atomic operation.
392
393Some examples of how optimizations interact with various kinds of atomic
394operations:
395
396* ``memcpyopt``: An atomic operation cannot be optimized into part of a
397  memcpy/memset, including unordered loads/stores.  It can pull operations
398  across some atomic operations.
399
400* LICM: Unordered loads/stores can be moved out of a loop.  It just treats
401  monotonic operations like a read+write to a memory location, and anything
402  stricter than that like a nothrow call.
403
404* DSE: Unordered stores can be DSE'ed like normal stores.  Monotonic stores can
405  be DSE'ed in some cases, but it's tricky to reason about, and not especially
406  important. It is possible in some case for DSE to operate across a stronger
407  atomic operation, but it is fairly tricky. DSE delegates this reasoning to
408  MemoryDependencyAnalysis (which is also used by other passes like GVN).
409
410* Folding a load: Any atomic load from a constant global can be constant-folded,
411  because it cannot be observed.  Similar reasoning allows sroa with
412  atomic loads and stores.
413
414Atomics and Codegen
415===================
416
417Atomic operations are represented in the SelectionDAG with ``ATOMIC_*`` opcodes.
418On architectures which use barrier instructions for all atomic ordering (like
419ARM), appropriate fences can be emitted by the AtomicExpand Codegen pass if
420``shouldInsertFencesForAtomic()`` returns true.
421
422The MachineMemOperand for all atomic operations is currently marked as volatile;
423this is not correct in the IR sense of volatile, but CodeGen handles anything
424marked volatile very conservatively.  This should get fixed at some point.
425
426One very important property of the atomic operations is that if your backend
427supports any inline lock-free atomic operations of a given size, you should
428support *ALL* operations of that size in a lock-free manner.
429
430When the target implements atomic ``cmpxchg`` or LL/SC instructions (as most do)
431this is trivial: all the other operations can be implemented on top of those
432primitives. However, on many older CPUs (e.g. ARMv5, SparcV8, Intel 80386) there
433are atomic load and store instructions, but no ``cmpxchg`` or LL/SC. As it is
434invalid to implement ``atomic load`` using the native instruction, but
435``cmpxchg`` using a library call to a function that uses a mutex, ``atomic
436load`` must *also* expand to a library call on such architectures, so that it
437can remain atomic with regards to a simultaneous ``cmpxchg``, by using the same
438mutex.
439
440AtomicExpandPass can help with that: it will expand all atomic operations to the
441proper ``__atomic_*`` libcalls for any size above the maximum set by
442``setMaxAtomicSizeInBitsSupported`` (which defaults to 0).
443
444On x86, all atomic loads generate a ``MOV``. SequentiallyConsistent stores
445generate an ``XCHG``, other stores generate a ``MOV``. SequentiallyConsistent
446fences generate an ``MFENCE``, other fences do not cause any code to be
447generated.  ``cmpxchg`` uses the ``LOCK CMPXCHG`` instruction.  ``atomicrmw xchg``
448uses ``XCHG``, ``atomicrmw add`` and ``atomicrmw sub`` use ``XADD``, and all
449other ``atomicrmw`` operations generate a loop with ``LOCK CMPXCHG``.  Depending
450on the users of the result, some ``atomicrmw`` operations can be translated into
451operations like ``LOCK AND``, but that does not work in general.
452
453On ARM (before v8), MIPS, and many other RISC architectures, Acquire, Release,
454and SequentiallyConsistent semantics require barrier instructions for every such
455operation. Loads and stores generate normal instructions.  ``cmpxchg`` and
456``atomicrmw`` can be represented using a loop with LL/SC-style instructions
457which take some sort of exclusive lock on a cache line (``LDREX`` and ``STREX``
458on ARM, etc.).
459
460It is often easiest for backends to use AtomicExpandPass to lower some of the
461atomic constructs. Here are some lowerings it can do:
462
463* cmpxchg -> loop with load-linked/store-conditional
464  by overriding ``shouldExpandAtomicCmpXchgInIR()``, ``emitLoadLinked()``,
465  ``emitStoreConditional()``
466* large loads/stores -> ll-sc/cmpxchg
467  by overriding ``shouldExpandAtomicStoreInIR()``/``shouldExpandAtomicLoadInIR()``
468* strong atomic accesses -> monotonic accesses + fences by overriding
469  ``shouldInsertFencesForAtomic()``, ``emitLeadingFence()``, and
470  ``emitTrailingFence()``
471* atomic rmw -> loop with cmpxchg or load-linked/store-conditional
472  by overriding ``expandAtomicRMWInIR()``
473* expansion to __atomic_* libcalls for unsupported sizes.
474* part-word atomicrmw/cmpxchg -> target-specific intrinsic by overriding
475  ``shouldExpandAtomicRMWInIR``, ``emitMaskedAtomicRMWIntrinsic``,
476  ``shouldExpandAtomicCmpXchgInIR``, and ``emitMaskedAtomicCmpXchgIntrinsic``.
477
478For an example of these look at the ARM (first five lowerings) or RISC-V (last
479lowering) backend.
480
481AtomicExpandPass supports two strategies for lowering atomicrmw/cmpxchg to
482load-linked/store-conditional (LL/SC) loops. The first expands the LL/SC loop
483in IR, calling target lowering hooks to emit intrinsics for the LL and SC
484operations. However, many architectures have strict requirements for LL/SC
485loops to ensure forward progress, such as restrictions on the number and type
486of instructions in the loop. It isn't possible to enforce these restrictions
487when the loop is expanded in LLVM IR, and so affected targets may prefer to
488expand to LL/SC loops at a very late stage (i.e. after register allocation).
489AtomicExpandPass can help support lowering of part-word atomicrmw or cmpxchg
490using this strategy by producing IR for any shifting and masking that can be
491performed outside of the LL/SC loop.
492
493Libcalls: __atomic_*
494====================
495
496There are two kinds of atomic library calls that are generated by LLVM. Please
497note that both sets of library functions somewhat confusingly share the names of
498builtin functions defined by clang. Despite this, the library functions are
499not directly related to the builtins: it is *not* the case that ``__atomic_*``
500builtins lower to ``__atomic_*`` library calls and ``__sync_*`` builtins lower
501to ``__sync_*`` library calls.
502
503The first set of library functions are named ``__atomic_*``. This set has been
504"standardized" by GCC, and is described below. (See also `GCC's documentation
505<https://gcc.gnu.org/wiki/Atomic/GCCMM/LIbrary>`_)
506
507LLVM's AtomicExpandPass will translate atomic operations on data sizes above
508``MaxAtomicSizeInBitsSupported`` into calls to these functions.
509
510There are four generic functions, which can be called with data of any size or
511alignment::
512
513   void __atomic_load(size_t size, void *ptr, void *ret, int ordering)
514   void __atomic_store(size_t size, void *ptr, void *val, int ordering)
515   void __atomic_exchange(size_t size, void *ptr, void *val, void *ret, int ordering)
516   bool __atomic_compare_exchange(size_t size, void *ptr, void *expected, void *desired, int success_order, int failure_order)
517
518There are also size-specialized versions of the above functions, which can only
519be used with *naturally-aligned* pointers of the appropriate size. In the
520signatures below, "N" is one of 1, 2, 4, 8, and 16, and "iN" is the appropriate
521integer type of that size; if no such integer type exists, the specialization
522cannot be used::
523
524   iN __atomic_load_N(iN *ptr, iN val, int ordering)
525   void __atomic_store_N(iN *ptr, iN val, int ordering)
526   iN __atomic_exchange_N(iN *ptr, iN val, int ordering)
527   bool __atomic_compare_exchange_N(iN *ptr, iN *expected, iN desired, int success_order, int failure_order)
528
529Finally there are some read-modify-write functions, which are only available in
530the size-specific variants (any other sizes use a ``__atomic_compare_exchange``
531loop)::
532
533   iN __atomic_fetch_add_N(iN *ptr, iN val, int ordering)
534   iN __atomic_fetch_sub_N(iN *ptr, iN val, int ordering)
535   iN __atomic_fetch_and_N(iN *ptr, iN val, int ordering)
536   iN __atomic_fetch_or_N(iN *ptr, iN val, int ordering)
537   iN __atomic_fetch_xor_N(iN *ptr, iN val, int ordering)
538   iN __atomic_fetch_nand_N(iN *ptr, iN val, int ordering)
539
540This set of library functions have some interesting implementation requirements
541to take note of:
542
543- They support all sizes and alignments -- including those which cannot be
544  implemented natively on any existing hardware. Therefore, they will certainly
545  use mutexes in for some sizes/alignments.
546
547- As a consequence, they cannot be shipped in a statically linked
548  compiler-support library, as they have state which must be shared amongst all
549  DSOs loaded in the program. They must be provided in a shared library used by
550  all objects.
551
552- The set of atomic sizes supported lock-free must be a superset of the sizes
553  any compiler can emit. That is: if a new compiler introduces support for
554  inline-lock-free atomics of size N, the ``__atomic_*`` functions must also have a
555  lock-free implementation for size N. This is a requirement so that code
556  produced by an old compiler (which will have called the ``__atomic_*`` function)
557  interoperates with code produced by the new compiler (which will use native
558  the atomic instruction).
559
560Note that it's possible to write an entirely target-independent implementation
561of these library functions by using the compiler atomic builtins themselves to
562implement the operations on naturally-aligned pointers of supported sizes, and a
563generic mutex implementation otherwise.
564
565Libcalls: __sync_*
566==================
567
568Some targets or OS/target combinations can support lock-free atomics, but for
569various reasons, it is not practical to emit the instructions inline.
570
571There's two typical examples of this.
572
573Some CPUs support multiple instruction sets which can be switched back and forth
574on function-call boundaries. For example, MIPS supports the MIPS16 ISA, which
575has a smaller instruction encoding than the usual MIPS32 ISA. ARM, similarly,
576has the Thumb ISA. In MIPS16 and earlier versions of Thumb, the atomic
577instructions are not encodable. However, those instructions are available via a
578function call to a function with the longer encoding.
579
580Additionally, a few OS/target pairs provide kernel-supported lock-free
581atomics. ARM/Linux is an example of this: the kernel `provides
582<https://www.kernel.org/doc/Documentation/arm/kernel_user_helpers.txt>`_ a
583function which on older CPUs contains a "magically-restartable" atomic sequence
584(which looks atomic so long as there's only one CPU), and contains actual atomic
585instructions on newer multicore models. This sort of functionality can typically
586be provided on any architecture, if all CPUs which are missing atomic
587compare-and-swap support are uniprocessor (no SMP). This is almost always the
588case. The only common architecture without that property is SPARC -- SPARCV8 SMP
589systems were common, yet it doesn't support any sort of compare-and-swap
590operation.
591
592Some targets (like RISCV) support a ``+forced-atomics`` target feature, which
593enables the use of lock-free atomics even if LLVM is not aware of any specific
594OS support for them. In this case, the user is responsible for ensuring that
595necessary ``__sync_*`` implementations are available. Code using
596``+forced-atomics`` is ABI-incompatible with code not using the feature, if
597atomic variables cross the ABI boundary.
598
599In either of these cases, the Target in LLVM can claim support for atomics of an
600appropriate size, and then implement some subset of the operations via libcalls
601to a ``__sync_*`` function. Such functions *must* not use locks in their
602implementation, because unlike the ``__atomic_*`` routines used by
603AtomicExpandPass, these may be mixed-and-matched with native instructions by the
604target lowering.
605
606Further, these routines do not need to be shared, as they are stateless. So,
607there is no issue with having multiple copies included in one binary. Thus,
608typically these routines are implemented by the statically-linked compiler
609runtime support library.
610
611LLVM will emit a call to an appropriate ``__sync_*`` routine if the target
612ISelLowering code has set the corresponding ``ATOMIC_CMPXCHG``, ``ATOMIC_SWAP``,
613or ``ATOMIC_LOAD_*`` operation to "Expand", and if it has opted-into the
614availability of those library functions via a call to ``initSyncLibcalls()``.
615
616The full set of functions that may be called by LLVM is (for ``N`` being 1, 2,
6174, 8, or 16)::
618
619  iN __sync_val_compare_and_swap_N(iN *ptr, iN expected, iN desired)
620  iN __sync_lock_test_and_set_N(iN *ptr, iN val)
621  iN __sync_fetch_and_add_N(iN *ptr, iN val)
622  iN __sync_fetch_and_sub_N(iN *ptr, iN val)
623  iN __sync_fetch_and_and_N(iN *ptr, iN val)
624  iN __sync_fetch_and_or_N(iN *ptr, iN val)
625  iN __sync_fetch_and_xor_N(iN *ptr, iN val)
626  iN __sync_fetch_and_nand_N(iN *ptr, iN val)
627  iN __sync_fetch_and_max_N(iN *ptr, iN val)
628  iN __sync_fetch_and_umax_N(iN *ptr, iN val)
629  iN __sync_fetch_and_min_N(iN *ptr, iN val)
630  iN __sync_fetch_and_umin_N(iN *ptr, iN val)
631
632This list doesn't include any function for atomic load or store; all known
633architectures support atomic loads and stores directly (possibly by emitting a
634fence on either side of a normal load or store.)
635
636There's also, somewhat separately, the possibility to lower ``ATOMIC_FENCE`` to
637``__sync_synchronize()``. This may happen or not happen independent of all the
638above, controlled purely by ``setOperationAction(ISD::ATOMIC_FENCE, ...)``.
639
640On AArch64, a variant of the __sync_* routines is used which contain the memory
641order as part of the function name. These routines may determine at runtime
642whether the single-instruction atomic operations which were introduced as part
643of AArch64 Large System Extensions "LSE" instruction set are available, or if
644it needs to fall back to an LL/SC loop. The following helper functions are
645implemented in both ``compiler-rt`` and ``libgcc`` libraries
646(``N`` is one of 1, 2, 4, 8, and ``M`` is one of 1, 2, 4, 8 and 16, and
647``ORDER`` is one of 'relax', 'acq', 'rel', 'acq_rel')::
648
649  iM __aarch64_casM_ORDER(iM expected, iM desired, iM *ptr)
650  iN __aarch64_swpN_ORDER(iN val, iN *ptr)
651  iN __aarch64_ldaddN_ORDER(iN val, iN *ptr)
652  iN __aarch64_ldclrN_ORDER(iN val, iN *ptr)
653  iN __aarch64_ldeorN_ORDER(iN val, iN *ptr)
654  iN __aarch64_ldsetN_ORDER(iN val, iN *ptr)
655
656Please note, if LSE instruction set is specified for AArch64 target then
657out-of-line atomics calls are not generated and single-instruction atomic
658operations are used in place.
659