xref: /llvm-project/llvm/docs/AArch64SME.rst (revision d313614b60ff1194f48e5f0b1bb8d63d2b7eb52d)
1*****************************************************
2Support for AArch64 Scalable Matrix Extension in LLVM
3*****************************************************
4
5.. contents::
6   :local:
7
81. Introduction
9===============
10
11The :ref:`AArch64 SME ACLE <aarch64_sme_acle>` provides a number of
12attributes for users to control PSTATE.SM and PSTATE.ZA.
13The :ref:`AArch64 SME ABI<aarch64_sme_abi>` describes the requirements for
14calls between functions when at least one of those functions uses PSTATE.SM or
15PSTATE.ZA.
16
17This document describes how the SME ACLE attributes map to LLVM IR
18attributes and how LLVM lowers these attributes to implement the rules and
19requirements of the ABI.
20
21Below we describe the LLVM IR attributes and their relation to the C/C++
22level ACLE attributes:
23
24``aarch64_pstate_sm_enabled``
25    is used for functions with ``__arm_streaming``
26
27``aarch64_pstate_sm_compatible``
28    is used for functions with ``__arm_streaming_compatible``
29
30``aarch64_pstate_sm_body``
31  is used for functions with ``__arm_locally_streaming`` and is
32  only valid on function definitions (not declarations)
33
34``aarch64_new_za``
35  is used for functions with ``__arm_new("za")``
36
37``aarch64_in_za``
38  is used for functions with ``__arm_in("za")``
39
40``aarch64_out_za``
41  is used for functions with ``__arm_out("za")``
42
43``aarch64_inout_za``
44  is used for functions with ``__arm_inout("za")``
45
46``aarch64_preserves_za``
47  is used for functions with ``__arm_preserves("za")``
48
49``aarch64_expanded_pstate_za``
50  is used for functions with ``__arm_new_za``
51
52Clang must ensure that the above attributes are added both to the
53function's declaration/definition as well as to their call-sites. This is
54important for calls to attributed function pointers, where there is no
55definition or declaration available.
56
57
582. Handling PSTATE.SM
59=====================
60
61When changing PSTATE.SM the execution of FP/vector operations may be transferred
62to another processing element. This has three important implications:
63
64* The runtime SVE vector length may change.
65
66* The contents of FP/AdvSIMD/SVE registers are zeroed.
67
68* The set of allowable instructions changes.
69
70This leads to certain restrictions on IR and optimizations. For example, it
71is undefined behaviour to share vector-length dependent state between functions
72that may operate with different values for PSTATE.SM. Front-ends must honour
73these restrictions when generating LLVM IR.
74
75Even though the runtime SVE vector length may change, for the purpose of LLVM IR
76and almost all parts of CodeGen we can assume that the runtime value for
77``vscale`` does not. If we let the compiler insert the appropriate ``smstart``
78and ``smstop`` instructions around call boundaries, then the effects on SVE
79state can be mitigated. By limiting the state changes to a very brief window
80around the call we can control how the operations are scheduled and how live
81values remain preserved between state transitions.
82
83In order to control PSTATE.SM at this level of granularity, we use function and
84callsite attributes rather than intrinsics.
85
86
87Restrictions on attributes
88--------------------------
89
90* It is undefined behaviour to pass or return (pointers to) scalable vector
91  objects to/from functions which may use a different SVE vector length.
92  This includes functions with a non-streaming interface, but marked with
93  ``aarch64_pstate_sm_body``.
94
95* It is not allowed for a function to be decorated with both
96  ``aarch64_pstate_sm_compatible`` and ``aarch64_pstate_sm_enabled``.
97
98* It is not allowed for a function to be decorated with more than one of the
99  following attributes:
100  ``aarch64_new_za``, ``aarch64_in_za``, ``aarch64_out_za``, ``aarch64_inout_za``,
101  ``aarch64_preserves_za``.
102
103These restrictions also apply in the higher level SME ACLE, which means we can
104emit diagnostics in Clang to signal users about incorrect behaviour.
105
106
107Compiler inserted streaming-mode changes
108----------------------------------------
109
110The table below describes the transitions in PSTATE.SM the compiler has to
111account for when doing calls between functions with different attributes.
112In this table, we use the following abbreviations:
113
114``N``
115  functions with a normal interface (PSTATE.SM=0 on entry, PSTATE.SM=0 on
116  return)
117
118``S``
119  functions with a Streaming interface (PSTATE.SM=1 on entry, PSTATE.SM=1
120  on return)
121
122``SC``
123  functions with a Streaming-Compatible interface (PSTATE.SM can be
124  either 0 or 1 on entry, and is unchanged on return).
125
126Functions with ``__attribute__((arm_locally_streaming))`` are excluded from this
127table because for the caller the attribute is synonymous to 'streaming', and
128for the callee it is merely an implementation detail that is explicitly not
129exposed to the caller.
130
131.. table:: Combinations of calls for functions with different attributes
132
133   ==== ==== =============================== ============================== ==============================
134   From To   Before call                     After call                     After exception
135   ==== ==== =============================== ============================== ==============================
136   N    N
137   N    S    SMSTART                         SMSTOP
138   N    SC
139   S    N    SMSTOP                          SMSTART                        SMSTART
140   S    S                                                                   SMSTART
141   S    SC                                                                  SMSTART
142   SC   N    If PSTATE.SM before call is 1,  If PSTATE.SM before call is 1, If PSTATE.SM before call is 1,
143             then SMSTOP                     then SMSTART                   then SMSTART
144   SC   S    If PSTATE.SM before call is 0,  If PSTATE.SM before call is 0, If PSTATE.SM before call is 1,
145             then SMSTART                    then SMSTOP                    then SMSTART
146   SC   SC                                                                  If PSTATE.SM before call is 1,
147                                                                            then SMSTART
148   ==== ==== =============================== ============================== ==============================
149
150
151Because changing PSTATE.SM zeroes the FP/vector registers, it is best to emit
152the ``smstart`` and ``smstop`` instructions before register allocation, so that
153the register allocator can spill/reload registers around the mode change.
154
155The compiler should also have sufficient information on which operations are
156part of the call/function's arguments/result and which operations are part of
157the function's body, so that it can place the mode changes in exactly the right
158position. The suitable place to do this seems to be SelectionDAG, where it lowers
159the call's arguments/return values to implement the specified calling convention.
160SelectionDAG provides Chains and Glue to specify the order of operations and give
161preliminary control over the instruction's scheduling.
162
163
164Example of preserving state
165---------------------------
166
167When passing and returning a ``float`` value to/from a function
168that has a streaming interface from a function that has a normal interface, the
169call-site will need to ensure that the argument/result registers are preserved
170and that no other code is scheduled in between the ``smstart/smstop`` and the call.
171
172.. code-block:: llvm
173
174    define float @foo(float %f) nounwind {
175      %res = call float @bar(float %f) "aarch64_pstate_sm_enabled"
176      ret float %res
177    }
178
179    declare float @bar(float) "aarch64_pstate_sm_enabled"
180
181The program needs to preserve the value of the floating point argument and
182return value in register ``s0``:
183
184.. code-block:: none
185
186    foo:                                    // @foo
187    // %bb.0:
188            stp     d15, d14, [sp, #-80]!           // 16-byte Folded Spill
189            stp     d13, d12, [sp, #16]             // 16-byte Folded Spill
190            stp     d11, d10, [sp, #32]             // 16-byte Folded Spill
191            stp     d9, d8, [sp, #48]               // 16-byte Folded Spill
192            str     x30, [sp, #64]                  // 8-byte Folded Spill
193            str     s0, [sp, #76]                   // 4-byte Folded Spill
194            smstart sm
195            ldr     s0, [sp, #76]                   // 4-byte Folded Reload
196            bl      bar
197            str     s0, [sp, #76]                   // 4-byte Folded Spill
198            smstop  sm
199            ldp     d9, d8, [sp, #48]               // 16-byte Folded Reload
200            ldp     d11, d10, [sp, #32]             // 16-byte Folded Reload
201            ldp     d13, d12, [sp, #16]             // 16-byte Folded Reload
202            ldr     s0, [sp, #76]                   // 4-byte Folded Reload
203            ldr     x30, [sp, #64]                  // 8-byte Folded Reload
204            ldp     d15, d14, [sp], #80             // 16-byte Folded Reload
205            ret
206
207Setting the correct register masks on the ISD nodes and inserting the
208``smstart/smstop`` in the right places should ensure this is done correctly.
209
210
211Instruction Selection Nodes
212---------------------------
213
214.. code-block:: none
215
216  AArch64ISD::SMSTART Chain, [SM|ZA|Both], CurrentState, ExpectedState[, RegMask]
217  AArch64ISD::SMSTOP  Chain, [SM|ZA|Both], CurrentState, ExpectedState[, RegMask]
218
219The ``SMSTART/SMSTOP`` nodes take ``CurrentState`` and ``ExpectedState`` operand for
220the case of a conditional SMSTART/SMSTOP. The instruction will only be executed
221if CurrentState != ExpectedState.
222
223When ``CurrentState`` and ``ExpectedState`` can be evaluated at compile-time
224(i.e. they are both constants) then an unconditional ``smstart/smstop``
225instruction is emitted. Otherwise the node is matched to a Pseudo instruction
226which expands to a compare/branch and a ``smstart/smstop``. This is necessary to
227implement transitions from ``SC -> N`` and ``SC -> S``.
228
229
230Unchained Function calls
231------------------------
232When a function with "``aarch64_pstate_sm_enabled``" calls a function that is not
233streaming compatible, the compiler has to insert a SMSTOP before the call and
234insert a SMSTOP after the call.
235
236If the function that is called is an intrinsic with no side-effects which in
237turn is lowered to a function call (e.g. ``@llvm.cos()``), then the call to
238``@llvm.cos()`` is not part of any Chain; it can be scheduled freely.
239
240Lowering of a Callsite creates a small chain of nodes which:
241
242- starts a call sequence
243
244- copies input values from virtual registers to physical registers specified by
245  the ABI
246
247- executes a branch-and-link
248
249- stops the call sequence
250
251- copies the output values from their physical registers to virtual registers
252
253When the callsite's Chain is not used, only the result value from the chained
254sequence is used, but the Chain itself is discarded.
255
256The ``SMSTART`` and ``SMSTOP`` ISD nodes return a Chain, but no real
257values, so when the ``SMSTART/SMSTOP`` nodes are part of a Chain that isn't
258used, these nodes are not considered for scheduling and are
259removed from the DAG.  In order to prevent these nodes
260from being removed, we need a way to ensure the results from the
261``CopyFromReg`` can only be **used after** the ``SMSTART/SMSTOP`` has been
262executed.
263
264We can use a CopyToReg -> CopyFromReg sequence for this, which moves the
265value to/from a virtual register and chains these nodes with the
266SMSTART/SMSTOP to make them part of the expression that calculates
267the result value. The resulting COPY nodes are removed by the register
268allocator.
269
270The example below shows how this is used in a DAG that does not link
271together the result by a Chain, but rather by a value:
272
273.. code-block:: none
274
275               t0: ch,glue = AArch64ISD::SMSTOP ...
276             t1: ch,glue = ISD::CALL ....
277           t2: res,ch,glue = CopyFromReg t1, ...
278         t3: ch,glue = AArch64ISD::SMSTART t2:1, ....   <- this is now part of the expression that returns the result value.
279       t4: ch = CopyToReg t3, Register:f64 %vreg, t2
280     t5: res,ch = CopyFromReg t4, Register:f64 %vreg
281   t6: res = FADD t5, t9
282
283We also need this for locally streaming functions, where an ``SMSTART`` needs to
284be inserted into the DAG at the start of the function.
285
286Functions with __attribute__((arm_locally_streaming))
287-----------------------------------------------------
288
289If a function is marked as ``arm_locally_streaming``, then the runtime SVE
290vector length in the prologue/epilogue may be different from the vector length
291in the function's body. This happens because we invoke smstart after setting up
292the stack-frame and similarly invoke smstop before deallocating the stack-frame.
293
294To ensure we use the correct SVE vector length to allocate the locals with, we
295can use the streaming vector-length to allocate the stack-slots through the
296``ADDSVL`` instruction, even when the CPU is not yet in streaming mode.
297
298This only works for locals and not callee-save slots, since LLVM doesn't support
299mixing two different scalable vector lengths in one stack frame. That means that the
300case where a function is marked ``arm_locally_streaming`` and needs to spill SVE
301callee-saves in the prologue is currently unsupported.  However, it is unlikely
302for this to happen without user intervention, because ``arm_locally_streaming``
303functions cannot take or return vector-length-dependent values. This would otherwise
304require forcing both the SVE PCS using '``aarch64_sve_pcs``' combined with using
305``arm_locally_streaming`` in order to encounter this problem. This combination
306can be prevented in Clang through emitting a diagnostic.
307
308
309An example of how the prologue/epilogue would look for a function that is
310attributed with ``arm_locally_streaming``:
311
312.. code-block:: c++
313
314    #define N 64
315
316    void __attribute__((arm_streaming_compatible)) some_use(svfloat32_t *);
317
318    // Use a float argument type, to check the value isn't clobbered by smstart.
319    // Use a float return type to check the value isn't clobbered by smstop.
320    float __attribute__((noinline, arm_locally_streaming)) foo(float arg) {
321      // Create local for SVE vector to check local is created with correct
322      // size when not yet in streaming mode (ADDSVL).
323      float array[N];
324      svfloat32_t vector;
325
326      some_use(&vector);
327      svst1_f32(svptrue_b32(), &array[0], vector);
328      return array[N - 1] + arg;
329    }
330
331should use ADDSVL for allocating the stack space and should avoid clobbering
332the return/argument values.
333
334.. code-block:: none
335
336    _Z3foof:                                // @_Z3foof
337    // %bb.0:                               // %entry
338            stp     d15, d14, [sp, #-96]!           // 16-byte Folded Spill
339            stp     d13, d12, [sp, #16]             // 16-byte Folded Spill
340            stp     d11, d10, [sp, #32]             // 16-byte Folded Spill
341            stp     d9, d8, [sp, #48]               // 16-byte Folded Spill
342            stp     x29, x30, [sp, #64]             // 16-byte Folded Spill
343            add     x29, sp, #64
344            str     x28, [sp, #80]                  // 8-byte Folded Spill
345            addsvl  sp, sp, #-1
346            sub     sp, sp, #256
347            str     s0, [x29, #28]                  // 4-byte Folded Spill
348            smstart sm
349            sub     x0, x29, #64
350            addsvl  x0, x0, #-1
351            bl      _Z10some_usePu13__SVFloat32_t
352            sub     x8, x29, #64
353            ptrue   p0.s
354            ld1w    { z0.s }, p0/z, [x8, #-1, mul vl]
355            ldr     s1, [x29, #28]                  // 4-byte Folded Reload
356            st1w    { z0.s }, p0, [sp]
357            ldr     s0, [sp, #252]
358            fadd    s0, s0, s1
359            str     s0, [x29, #28]                  // 4-byte Folded Spill
360            smstop  sm
361            ldr     s0, [x29, #28]                  // 4-byte Folded Reload
362            addsvl  sp, sp, #1
363            add     sp, sp, #256
364            ldp     x29, x30, [sp, #64]             // 16-byte Folded Reload
365            ldp     d9, d8, [sp, #48]               // 16-byte Folded Reload
366            ldp     d11, d10, [sp, #32]             // 16-byte Folded Reload
367            ldp     d13, d12, [sp, #16]             // 16-byte Folded Reload
368            ldr     x28, [sp, #80]                  // 8-byte Folded Reload
369            ldp     d15, d14, [sp], #96             // 16-byte Folded Reload
370            ret
371
372
373Preventing the use of illegal instructions in Streaming Mode
374------------------------------------------------------------
375
376* When executing a program in streaming-mode (PSTATE.SM=1) a subset of SVE/SVE2
377  instructions and most AdvSIMD/NEON instructions are invalid.
378
379* When executing a program in normal mode (PSTATE.SM=0), a subset of SME
380  instructions are invalid.
381
382* Streaming-compatible functions must only use instructions that are valid when
383  either PSTATE.SM=0 or PSTATE.SM=1.
384
385The value of PSTATE.SM is not controlled by the feature flags, but rather by the
386function attributes. This means that we can compile for '``+sme``' and the compiler
387will code-generate any instructions, even if they are not legal under the requested
388streaming mode. The compiler needs to use the function attributes to ensure the
389compiler doesn't do transformations under the assumption that certain operations
390are available at runtime.
391
392We made a conscious choice not to model this with feature flags, because we
393still want to support inline-asm in either mode (with the user placing
394smstart/smstop manually), and this became rather complicated to implement at the
395individual instruction level (see `D120261 <https://reviews.llvm.org/D120261>`_
396and `D121208 <https://reviews.llvm.org/D121208>`_) because of limitations in
397TableGen.
398
399As a first step, this means we'll disable vectorization (LoopVectorize/SLP)
400entirely when the a function has either of the ``aarch64_pstate_sm_enabled``,
401``aarch64_pstate_sm_body`` or ``aarch64_pstate_sm_compatible`` attributes,
402in order to avoid the use of vector instructions.
403
404Later on we'll aim to relax these restrictions to enable scalable
405auto-vectorization with a subset of streaming-compatible instructions, but that
406requires changes to the CostModel, Legalization and SelectionDAG lowering.
407
408We will also emit diagnostics in Clang to prevent the use of
409non-streaming(-compatible) operations, e.g. through ACLE intrinsics, when a
410function is decorated with the streaming mode attributes.
411
412
413Other things to consider
414------------------------
415
416* Inlining must be disabled when the call-site needs to toggle PSTATE.SM or
417  when the callee's function body is executed in a different streaming mode than
418  its caller. This is needed because function calls are the boundaries for
419  streaming mode changes.
420
421* Tail call optimization must be disabled when the call-site needs to toggle
422  PSTATE.SM, such that the caller can restore the original value of PSTATE.SM.
423
424
4253. Handling PSTATE.ZA
426=====================
427
428In contrast to PSTATE.SM, enabling PSTATE.ZA does not affect the SVE vector
429length and also doesn't clobber FP/AdvSIMD/SVE registers. This means it is safe
430to toggle PSTATE.ZA using intrinsics. This also makes it simpler to setup a
431lazy-save mechanism for calls to private-ZA functions (i.e. functions that may
432either directly or indirectly clobber ZA state).
433
434For the purpose of handling functions marked with ``aarch64_new_za``,
435we have introduced a new LLVM IR pass (SMEABIPass) that is run just before
436SelectionDAG. Any such functions dealt with by this pass are marked with
437``aarch64_expanded_pstate_za``.
438
439Setting up a lazy-save
440----------------------
441
442Committing a lazy-save
443----------------------
444
445Exception handling and ZA
446-------------------------
447
4484. Types
449========
450
451AArch64 Predicate-as-Counter Type
452---------------------------------
453
454:Overview:
455
456The predicate-as-counter type represents the type of a predicate-as-counter
457value held in a AArch64 SVE predicate register. Such a value contains
458information about the number of active lanes, the element width and a bit that
459tells whether the generated mask should be inverted. ACLE intrinsics should be
460used to move the predicate-as-counter value to/from a predicate vector.
461
462There are certain limitations on the type:
463
464* The type can be used for function parameters and return values.
465
466* The supported LLVM operations on this type are limited to ``load``, ``store``,
467  ``phi``, ``select`` and ``alloca`` instructions.
468
469The predicate-as-counter type is a scalable type.
470
471:Syntax:
472
473::
474
475      target("aarch64.svcount")
476
477
478
4795. References
480=============
481
482    .. _aarch64_sme_acle:
483
4841.  `SME ACLE Pull-request <https://github.com/ARM-software/acle/pull/188>`__
485
486    .. _aarch64_sme_abi:
487
4882.  `SME ABI Pull-request <https://github.com/ARM-software/abi-aa/pull/123>`__
489