1***************************************************** 2Support for AArch64 Scalable Matrix Extension in LLVM 3***************************************************** 4 5.. contents:: 6 :local: 7 81. Introduction 9=============== 10 11The :ref:`AArch64 SME ACLE <aarch64_sme_acle>` provides a number of 12attributes for users to control PSTATE.SM and PSTATE.ZA. 13The :ref:`AArch64 SME ABI<aarch64_sme_abi>` describes the requirements for 14calls between functions when at least one of those functions uses PSTATE.SM or 15PSTATE.ZA. 16 17This document describes how the SME ACLE attributes map to LLVM IR 18attributes and how LLVM lowers these attributes to implement the rules and 19requirements of the ABI. 20 21Below we describe the LLVM IR attributes and their relation to the C/C++ 22level ACLE attributes: 23 24``aarch64_pstate_sm_enabled`` 25 is used for functions with ``__arm_streaming`` 26 27``aarch64_pstate_sm_compatible`` 28 is used for functions with ``__arm_streaming_compatible`` 29 30``aarch64_pstate_sm_body`` 31 is used for functions with ``__arm_locally_streaming`` and is 32 only valid on function definitions (not declarations) 33 34``aarch64_new_za`` 35 is used for functions with ``__arm_new("za")`` 36 37``aarch64_in_za`` 38 is used for functions with ``__arm_in("za")`` 39 40``aarch64_out_za`` 41 is used for functions with ``__arm_out("za")`` 42 43``aarch64_inout_za`` 44 is used for functions with ``__arm_inout("za")`` 45 46``aarch64_preserves_za`` 47 is used for functions with ``__arm_preserves("za")`` 48 49``aarch64_expanded_pstate_za`` 50 is used for functions with ``__arm_new_za`` 51 52Clang must ensure that the above attributes are added both to the 53function's declaration/definition as well as to their call-sites. This is 54important for calls to attributed function pointers, where there is no 55definition or declaration available. 56 57 582. Handling PSTATE.SM 59===================== 60 61When changing PSTATE.SM the execution of FP/vector operations may be transferred 62to another processing element. This has three important implications: 63 64* The runtime SVE vector length may change. 65 66* The contents of FP/AdvSIMD/SVE registers are zeroed. 67 68* The set of allowable instructions changes. 69 70This leads to certain restrictions on IR and optimizations. For example, it 71is undefined behaviour to share vector-length dependent state between functions 72that may operate with different values for PSTATE.SM. Front-ends must honour 73these restrictions when generating LLVM IR. 74 75Even though the runtime SVE vector length may change, for the purpose of LLVM IR 76and almost all parts of CodeGen we can assume that the runtime value for 77``vscale`` does not. If we let the compiler insert the appropriate ``smstart`` 78and ``smstop`` instructions around call boundaries, then the effects on SVE 79state can be mitigated. By limiting the state changes to a very brief window 80around the call we can control how the operations are scheduled and how live 81values remain preserved between state transitions. 82 83In order to control PSTATE.SM at this level of granularity, we use function and 84callsite attributes rather than intrinsics. 85 86 87Restrictions on attributes 88-------------------------- 89 90* It is undefined behaviour to pass or return (pointers to) scalable vector 91 objects to/from functions which may use a different SVE vector length. 92 This includes functions with a non-streaming interface, but marked with 93 ``aarch64_pstate_sm_body``. 94 95* It is not allowed for a function to be decorated with both 96 ``aarch64_pstate_sm_compatible`` and ``aarch64_pstate_sm_enabled``. 97 98* It is not allowed for a function to be decorated with more than one of the 99 following attributes: 100 ``aarch64_new_za``, ``aarch64_in_za``, ``aarch64_out_za``, ``aarch64_inout_za``, 101 ``aarch64_preserves_za``. 102 103These restrictions also apply in the higher level SME ACLE, which means we can 104emit diagnostics in Clang to signal users about incorrect behaviour. 105 106 107Compiler inserted streaming-mode changes 108---------------------------------------- 109 110The table below describes the transitions in PSTATE.SM the compiler has to 111account for when doing calls between functions with different attributes. 112In this table, we use the following abbreviations: 113 114``N`` 115 functions with a normal interface (PSTATE.SM=0 on entry, PSTATE.SM=0 on 116 return) 117 118``S`` 119 functions with a Streaming interface (PSTATE.SM=1 on entry, PSTATE.SM=1 120 on return) 121 122``SC`` 123 functions with a Streaming-Compatible interface (PSTATE.SM can be 124 either 0 or 1 on entry, and is unchanged on return). 125 126Functions with ``__attribute__((arm_locally_streaming))`` are excluded from this 127table because for the caller the attribute is synonymous to 'streaming', and 128for the callee it is merely an implementation detail that is explicitly not 129exposed to the caller. 130 131.. table:: Combinations of calls for functions with different attributes 132 133 ==== ==== =============================== ============================== ============================== 134 From To Before call After call After exception 135 ==== ==== =============================== ============================== ============================== 136 N N 137 N S SMSTART SMSTOP 138 N SC 139 S N SMSTOP SMSTART SMSTART 140 S S SMSTART 141 S SC SMSTART 142 SC N If PSTATE.SM before call is 1, If PSTATE.SM before call is 1, If PSTATE.SM before call is 1, 143 then SMSTOP then SMSTART then SMSTART 144 SC S If PSTATE.SM before call is 0, If PSTATE.SM before call is 0, If PSTATE.SM before call is 1, 145 then SMSTART then SMSTOP then SMSTART 146 SC SC If PSTATE.SM before call is 1, 147 then SMSTART 148 ==== ==== =============================== ============================== ============================== 149 150 151Because changing PSTATE.SM zeroes the FP/vector registers, it is best to emit 152the ``smstart`` and ``smstop`` instructions before register allocation, so that 153the register allocator can spill/reload registers around the mode change. 154 155The compiler should also have sufficient information on which operations are 156part of the call/function's arguments/result and which operations are part of 157the function's body, so that it can place the mode changes in exactly the right 158position. The suitable place to do this seems to be SelectionDAG, where it lowers 159the call's arguments/return values to implement the specified calling convention. 160SelectionDAG provides Chains and Glue to specify the order of operations and give 161preliminary control over the instruction's scheduling. 162 163 164Example of preserving state 165--------------------------- 166 167When passing and returning a ``float`` value to/from a function 168that has a streaming interface from a function that has a normal interface, the 169call-site will need to ensure that the argument/result registers are preserved 170and that no other code is scheduled in between the ``smstart/smstop`` and the call. 171 172.. code-block:: llvm 173 174 define float @foo(float %f) nounwind { 175 %res = call float @bar(float %f) "aarch64_pstate_sm_enabled" 176 ret float %res 177 } 178 179 declare float @bar(float) "aarch64_pstate_sm_enabled" 180 181The program needs to preserve the value of the floating point argument and 182return value in register ``s0``: 183 184.. code-block:: none 185 186 foo: // @foo 187 // %bb.0: 188 stp d15, d14, [sp, #-80]! // 16-byte Folded Spill 189 stp d13, d12, [sp, #16] // 16-byte Folded Spill 190 stp d11, d10, [sp, #32] // 16-byte Folded Spill 191 stp d9, d8, [sp, #48] // 16-byte Folded Spill 192 str x30, [sp, #64] // 8-byte Folded Spill 193 str s0, [sp, #76] // 4-byte Folded Spill 194 smstart sm 195 ldr s0, [sp, #76] // 4-byte Folded Reload 196 bl bar 197 str s0, [sp, #76] // 4-byte Folded Spill 198 smstop sm 199 ldp d9, d8, [sp, #48] // 16-byte Folded Reload 200 ldp d11, d10, [sp, #32] // 16-byte Folded Reload 201 ldp d13, d12, [sp, #16] // 16-byte Folded Reload 202 ldr s0, [sp, #76] // 4-byte Folded Reload 203 ldr x30, [sp, #64] // 8-byte Folded Reload 204 ldp d15, d14, [sp], #80 // 16-byte Folded Reload 205 ret 206 207Setting the correct register masks on the ISD nodes and inserting the 208``smstart/smstop`` in the right places should ensure this is done correctly. 209 210 211Instruction Selection Nodes 212--------------------------- 213 214.. code-block:: none 215 216 AArch64ISD::SMSTART Chain, [SM|ZA|Both], CurrentState, ExpectedState[, RegMask] 217 AArch64ISD::SMSTOP Chain, [SM|ZA|Both], CurrentState, ExpectedState[, RegMask] 218 219The ``SMSTART/SMSTOP`` nodes take ``CurrentState`` and ``ExpectedState`` operand for 220the case of a conditional SMSTART/SMSTOP. The instruction will only be executed 221if CurrentState != ExpectedState. 222 223When ``CurrentState`` and ``ExpectedState`` can be evaluated at compile-time 224(i.e. they are both constants) then an unconditional ``smstart/smstop`` 225instruction is emitted. Otherwise the node is matched to a Pseudo instruction 226which expands to a compare/branch and a ``smstart/smstop``. This is necessary to 227implement transitions from ``SC -> N`` and ``SC -> S``. 228 229 230Unchained Function calls 231------------------------ 232When a function with "``aarch64_pstate_sm_enabled``" calls a function that is not 233streaming compatible, the compiler has to insert a SMSTOP before the call and 234insert a SMSTOP after the call. 235 236If the function that is called is an intrinsic with no side-effects which in 237turn is lowered to a function call (e.g. ``@llvm.cos()``), then the call to 238``@llvm.cos()`` is not part of any Chain; it can be scheduled freely. 239 240Lowering of a Callsite creates a small chain of nodes which: 241 242- starts a call sequence 243 244- copies input values from virtual registers to physical registers specified by 245 the ABI 246 247- executes a branch-and-link 248 249- stops the call sequence 250 251- copies the output values from their physical registers to virtual registers 252 253When the callsite's Chain is not used, only the result value from the chained 254sequence is used, but the Chain itself is discarded. 255 256The ``SMSTART`` and ``SMSTOP`` ISD nodes return a Chain, but no real 257values, so when the ``SMSTART/SMSTOP`` nodes are part of a Chain that isn't 258used, these nodes are not considered for scheduling and are 259removed from the DAG. In order to prevent these nodes 260from being removed, we need a way to ensure the results from the 261``CopyFromReg`` can only be **used after** the ``SMSTART/SMSTOP`` has been 262executed. 263 264We can use a CopyToReg -> CopyFromReg sequence for this, which moves the 265value to/from a virtual register and chains these nodes with the 266SMSTART/SMSTOP to make them part of the expression that calculates 267the result value. The resulting COPY nodes are removed by the register 268allocator. 269 270The example below shows how this is used in a DAG that does not link 271together the result by a Chain, but rather by a value: 272 273.. code-block:: none 274 275 t0: ch,glue = AArch64ISD::SMSTOP ... 276 t1: ch,glue = ISD::CALL .... 277 t2: res,ch,glue = CopyFromReg t1, ... 278 t3: ch,glue = AArch64ISD::SMSTART t2:1, .... <- this is now part of the expression that returns the result value. 279 t4: ch = CopyToReg t3, Register:f64 %vreg, t2 280 t5: res,ch = CopyFromReg t4, Register:f64 %vreg 281 t6: res = FADD t5, t9 282 283We also need this for locally streaming functions, where an ``SMSTART`` needs to 284be inserted into the DAG at the start of the function. 285 286Functions with __attribute__((arm_locally_streaming)) 287----------------------------------------------------- 288 289If a function is marked as ``arm_locally_streaming``, then the runtime SVE 290vector length in the prologue/epilogue may be different from the vector length 291in the function's body. This happens because we invoke smstart after setting up 292the stack-frame and similarly invoke smstop before deallocating the stack-frame. 293 294To ensure we use the correct SVE vector length to allocate the locals with, we 295can use the streaming vector-length to allocate the stack-slots through the 296``ADDSVL`` instruction, even when the CPU is not yet in streaming mode. 297 298This only works for locals and not callee-save slots, since LLVM doesn't support 299mixing two different scalable vector lengths in one stack frame. That means that the 300case where a function is marked ``arm_locally_streaming`` and needs to spill SVE 301callee-saves in the prologue is currently unsupported. However, it is unlikely 302for this to happen without user intervention, because ``arm_locally_streaming`` 303functions cannot take or return vector-length-dependent values. This would otherwise 304require forcing both the SVE PCS using '``aarch64_sve_pcs``' combined with using 305``arm_locally_streaming`` in order to encounter this problem. This combination 306can be prevented in Clang through emitting a diagnostic. 307 308 309An example of how the prologue/epilogue would look for a function that is 310attributed with ``arm_locally_streaming``: 311 312.. code-block:: c++ 313 314 #define N 64 315 316 void __attribute__((arm_streaming_compatible)) some_use(svfloat32_t *); 317 318 // Use a float argument type, to check the value isn't clobbered by smstart. 319 // Use a float return type to check the value isn't clobbered by smstop. 320 float __attribute__((noinline, arm_locally_streaming)) foo(float arg) { 321 // Create local for SVE vector to check local is created with correct 322 // size when not yet in streaming mode (ADDSVL). 323 float array[N]; 324 svfloat32_t vector; 325 326 some_use(&vector); 327 svst1_f32(svptrue_b32(), &array[0], vector); 328 return array[N - 1] + arg; 329 } 330 331should use ADDSVL for allocating the stack space and should avoid clobbering 332the return/argument values. 333 334.. code-block:: none 335 336 _Z3foof: // @_Z3foof 337 // %bb.0: // %entry 338 stp d15, d14, [sp, #-96]! // 16-byte Folded Spill 339 stp d13, d12, [sp, #16] // 16-byte Folded Spill 340 stp d11, d10, [sp, #32] // 16-byte Folded Spill 341 stp d9, d8, [sp, #48] // 16-byte Folded Spill 342 stp x29, x30, [sp, #64] // 16-byte Folded Spill 343 add x29, sp, #64 344 str x28, [sp, #80] // 8-byte Folded Spill 345 addsvl sp, sp, #-1 346 sub sp, sp, #256 347 str s0, [x29, #28] // 4-byte Folded Spill 348 smstart sm 349 sub x0, x29, #64 350 addsvl x0, x0, #-1 351 bl _Z10some_usePu13__SVFloat32_t 352 sub x8, x29, #64 353 ptrue p0.s 354 ld1w { z0.s }, p0/z, [x8, #-1, mul vl] 355 ldr s1, [x29, #28] // 4-byte Folded Reload 356 st1w { z0.s }, p0, [sp] 357 ldr s0, [sp, #252] 358 fadd s0, s0, s1 359 str s0, [x29, #28] // 4-byte Folded Spill 360 smstop sm 361 ldr s0, [x29, #28] // 4-byte Folded Reload 362 addsvl sp, sp, #1 363 add sp, sp, #256 364 ldp x29, x30, [sp, #64] // 16-byte Folded Reload 365 ldp d9, d8, [sp, #48] // 16-byte Folded Reload 366 ldp d11, d10, [sp, #32] // 16-byte Folded Reload 367 ldp d13, d12, [sp, #16] // 16-byte Folded Reload 368 ldr x28, [sp, #80] // 8-byte Folded Reload 369 ldp d15, d14, [sp], #96 // 16-byte Folded Reload 370 ret 371 372 373Preventing the use of illegal instructions in Streaming Mode 374------------------------------------------------------------ 375 376* When executing a program in streaming-mode (PSTATE.SM=1) a subset of SVE/SVE2 377 instructions and most AdvSIMD/NEON instructions are invalid. 378 379* When executing a program in normal mode (PSTATE.SM=0), a subset of SME 380 instructions are invalid. 381 382* Streaming-compatible functions must only use instructions that are valid when 383 either PSTATE.SM=0 or PSTATE.SM=1. 384 385The value of PSTATE.SM is not controlled by the feature flags, but rather by the 386function attributes. This means that we can compile for '``+sme``' and the compiler 387will code-generate any instructions, even if they are not legal under the requested 388streaming mode. The compiler needs to use the function attributes to ensure the 389compiler doesn't do transformations under the assumption that certain operations 390are available at runtime. 391 392We made a conscious choice not to model this with feature flags, because we 393still want to support inline-asm in either mode (with the user placing 394smstart/smstop manually), and this became rather complicated to implement at the 395individual instruction level (see `D120261 <https://reviews.llvm.org/D120261>`_ 396and `D121208 <https://reviews.llvm.org/D121208>`_) because of limitations in 397TableGen. 398 399As a first step, this means we'll disable vectorization (LoopVectorize/SLP) 400entirely when the a function has either of the ``aarch64_pstate_sm_enabled``, 401``aarch64_pstate_sm_body`` or ``aarch64_pstate_sm_compatible`` attributes, 402in order to avoid the use of vector instructions. 403 404Later on we'll aim to relax these restrictions to enable scalable 405auto-vectorization with a subset of streaming-compatible instructions, but that 406requires changes to the CostModel, Legalization and SelectionDAG lowering. 407 408We will also emit diagnostics in Clang to prevent the use of 409non-streaming(-compatible) operations, e.g. through ACLE intrinsics, when a 410function is decorated with the streaming mode attributes. 411 412 413Other things to consider 414------------------------ 415 416* Inlining must be disabled when the call-site needs to toggle PSTATE.SM or 417 when the callee's function body is executed in a different streaming mode than 418 its caller. This is needed because function calls are the boundaries for 419 streaming mode changes. 420 421* Tail call optimization must be disabled when the call-site needs to toggle 422 PSTATE.SM, such that the caller can restore the original value of PSTATE.SM. 423 424 4253. Handling PSTATE.ZA 426===================== 427 428In contrast to PSTATE.SM, enabling PSTATE.ZA does not affect the SVE vector 429length and also doesn't clobber FP/AdvSIMD/SVE registers. This means it is safe 430to toggle PSTATE.ZA using intrinsics. This also makes it simpler to setup a 431lazy-save mechanism for calls to private-ZA functions (i.e. functions that may 432either directly or indirectly clobber ZA state). 433 434For the purpose of handling functions marked with ``aarch64_new_za``, 435we have introduced a new LLVM IR pass (SMEABIPass) that is run just before 436SelectionDAG. Any such functions dealt with by this pass are marked with 437``aarch64_expanded_pstate_za``. 438 439Setting up a lazy-save 440---------------------- 441 442Committing a lazy-save 443---------------------- 444 445Exception handling and ZA 446------------------------- 447 4484. Types 449======== 450 451AArch64 Predicate-as-Counter Type 452--------------------------------- 453 454:Overview: 455 456The predicate-as-counter type represents the type of a predicate-as-counter 457value held in a AArch64 SVE predicate register. Such a value contains 458information about the number of active lanes, the element width and a bit that 459tells whether the generated mask should be inverted. ACLE intrinsics should be 460used to move the predicate-as-counter value to/from a predicate vector. 461 462There are certain limitations on the type: 463 464* The type can be used for function parameters and return values. 465 466* The supported LLVM operations on this type are limited to ``load``, ``store``, 467 ``phi``, ``select`` and ``alloca`` instructions. 468 469The predicate-as-counter type is a scalable type. 470 471:Syntax: 472 473:: 474 475 target("aarch64.svcount") 476 477 478 4795. References 480============= 481 482 .. _aarch64_sme_acle: 483 4841. `SME ACLE Pull-request <https://github.com/ARM-software/acle/pull/188>`__ 485 486 .. _aarch64_sme_abi: 487 4882. `SME ABI Pull-request <https://github.com/ARM-software/abi-aa/pull/123>`__ 489