1llvm-mca - LLVM Machine Code Analyzer 2===================================== 3 4.. program:: llvm-mca 5 6SYNOPSIS 7-------- 8 9:program:`llvm-mca` [*options*] [input] 10 11DESCRIPTION 12----------- 13 14:program:`llvm-mca` is a performance analysis tool that uses information 15available in LLVM (e.g. scheduling models) to statically measure the performance 16of machine code in a specific CPU. 17 18Performance is measured in terms of throughput as well as processor resource 19consumption. The tool currently works for processors with a backend for which 20there is a scheduling model available in LLVM. 21 22The main goal of this tool is not just to predict the performance of the code 23when run on the target, but also help with diagnosing potential performance 24issues. 25 26Given an assembly code sequence, :program:`llvm-mca` estimates the Instructions 27Per Cycle (IPC), as well as hardware resource pressure. The analysis and 28reporting style were inspired by the IACA tool from Intel. 29 30For example, you can compile code with clang, output assembly, and pipe it 31directly into :program:`llvm-mca` for analysis: 32 33.. code-block:: bash 34 35 $ clang foo.c -O2 --target=x86_64 -S -o - | llvm-mca -mcpu=btver2 36 37Or for Intel syntax: 38 39.. code-block:: bash 40 41 $ clang foo.c -O2 --target=x86_64 -masm=intel -S -o - | llvm-mca -mcpu=btver2 42 43(:program:`llvm-mca` detects Intel syntax by the presence of an `.intel_syntax` 44directive at the beginning of the input. By default its output syntax matches 45that of its input.) 46 47Scheduling models are not just used to compute instruction latencies and 48throughput, but also to understand what processor resources are available 49and how to simulate them. 50 51By design, the quality of the analysis conducted by :program:`llvm-mca` is 52inevitably affected by the quality of the scheduling models in LLVM. 53 54If you see that the performance report is not accurate for a processor, 55please `file a bug <https://github.com/llvm/llvm-project/issues>`_ 56against the appropriate backend. 57 58OPTIONS 59------- 60 61If ``input`` is "``-``" or omitted, :program:`llvm-mca` reads from standard 62input. Otherwise, it will read from the specified filename. 63 64If the :option:`-o` option is omitted, then :program:`llvm-mca` will send its output 65to standard output if the input is from standard input. If the :option:`-o` 66option specifies "``-``", then the output will also be sent to standard output. 67 68 69.. option:: -help 70 71 Print a summary of command line options. 72 73.. option:: -o <filename> 74 75 Use ``<filename>`` as the output filename. See the summary above for more 76 details. 77 78.. option:: -mtriple=<target triple> 79 80 Specify a target triple string. 81 82.. option:: -march=<arch> 83 84 Specify the architecture for which to analyze the code. It defaults to the 85 host default target. 86 87.. option:: -mcpu=<cpuname> 88 89 Specify the processor for which to analyze the code. By default, the cpu name 90 is autodetected from the host. 91 92.. option:: -output-asm-variant=<variant id> 93 94 Specify the output assembly variant for the report generated by the tool. 95 On x86, possible values are [0, 1]. A value of 0 (vic. 1) for this flag enables 96 the AT&T (vic. Intel) assembly format for the code printed out by the tool in 97 the analysis report. 98 99.. option:: -print-imm-hex 100 101 Prefer hex format for numeric literals in the output assembly printed as part 102 of the report. 103 104.. option:: -dispatch=<width> 105 106 Specify a different dispatch width for the processor. The dispatch width 107 defaults to field 'IssueWidth' in the processor scheduling model. If width is 108 zero, then the default dispatch width is used. 109 110.. option:: -register-file-size=<size> 111 112 Specify the size of the register file. When specified, this flag limits how 113 many physical registers are available for register renaming purposes. A value 114 of zero for this flag means "unlimited number of physical registers". 115 116.. option:: -iterations=<number of iterations> 117 118 Specify the number of iterations to run. If this flag is set to 0, then the 119 tool sets the number of iterations to a default value (i.e. 100). 120 121.. option:: -noalias=<bool> 122 123 If set, the tool assumes that loads and stores don't alias. This is the 124 default behavior. 125 126.. option:: -lqueue=<load queue size> 127 128 Specify the size of the load queue in the load/store unit emulated by the tool. 129 By default, the tool assumes an unbound number of entries in the load queue. 130 A value of zero for this flag is ignored, and the default load queue size is 131 used instead. 132 133.. option:: -squeue=<store queue size> 134 135 Specify the size of the store queue in the load/store unit emulated by the 136 tool. By default, the tool assumes an unbound number of entries in the store 137 queue. A value of zero for this flag is ignored, and the default store queue 138 size is used instead. 139 140.. option:: -timeline 141 142 Enable the timeline view. 143 144.. option:: -timeline-max-iterations=<iterations> 145 146 Limit the number of iterations to print in the timeline view. By default, the 147 timeline view prints information for up to 10 iterations. 148 149.. option:: -timeline-max-cycles=<cycles> 150 151 Limit the number of cycles in the timeline view, or use 0 for no limit. By 152 default, the number of cycles is set to 80. 153 154.. option:: -resource-pressure 155 156 Enable the resource pressure view. This is enabled by default. 157 158.. option:: -register-file-stats 159 160 Enable register file usage statistics. 161 162.. option:: -dispatch-stats 163 164 Enable extra dispatch statistics. This view collects and analyzes instruction 165 dispatch events, as well as static/dynamic dispatch stall events. This view 166 is disabled by default. 167 168.. option:: -scheduler-stats 169 170 Enable extra scheduler statistics. This view collects and analyzes instruction 171 issue events. This view is disabled by default. 172 173.. option:: -retire-stats 174 175 Enable extra retire control unit statistics. This view is disabled by default. 176 177.. option:: -instruction-info 178 179 Enable the instruction info view. This is enabled by default. 180 181.. option:: -show-encoding 182 183 Enable the printing of instruction encodings within the instruction info view. 184 185.. option:: -show-barriers 186 187 Enable the printing of LoadBarrier and StoreBarrier flags within the 188 instruction info view. 189 190.. option:: -all-stats 191 192 Print all hardware statistics. This enables extra statistics related to the 193 dispatch logic, the hardware schedulers, the register file(s), and the retire 194 control unit. This option is disabled by default. 195 196.. option:: -all-views 197 198 Enable all the view. 199 200.. option:: -instruction-tables 201 202 Prints resource pressure information based on the static information 203 available from the processor model. This differs from the resource pressure 204 view because it doesn't require that the code is simulated. It instead prints 205 the theoretical uniform distribution of resource pressure for every 206 instruction in sequence. 207 208.. option:: -bottleneck-analysis 209 210 Print information about bottlenecks that affect the throughput. This analysis 211 can be expensive, and it is disabled by default. Bottlenecks are highlighted 212 in the summary view. Bottleneck analysis is currently not supported for 213 processors with an in-order backend. 214 215.. option:: -json 216 217 Print the requested views in valid JSON format. The instructions and the 218 processor resources are printed as members of special top level JSON objects. 219 The individual views refer to them by index. However, not all views are 220 currently supported. For example, the report from the bottleneck analysis is 221 not printed out in JSON. All the default views are currently supported. 222 223.. option:: -disable-cb 224 225 Force usage of the generic CustomBehaviour and InstrPostProcess classes rather 226 than using the target specific implementation. The generic classes never 227 detect any custom hazards or make any post processing modifications to 228 instructions. 229 230.. option:: -disable-im 231 232 Force usage of the generic InstrumentManager rather than using the target 233 specific implementation. The generic class creates Instruments that provide 234 no extra information, and InstrumentManager never overrides the default 235 schedule class for a given instruction. 236 237.. option:: -skip-unsupported-instructions=<reason> 238 239 Force :program:`llvm-mca` to continue in the presence of instructions which do 240 not parse or lack key scheduling information. Note that the resulting analysis 241 is impacted since those unsupported instructions are ignored as-if they are 242 not supplied as a part of the input. 243 244 The choice of `<reason>` controls the when mca will report an error. 245 `<reason>` may be `none` (default), `lack-sched`, `parse-failure`, `any`. 246 247EXIT STATUS 248----------- 249 250:program:`llvm-mca` returns 0 on success. Otherwise, an error message is printed 251to standard error, and the tool returns 1. 252 253USING MARKERS TO ANALYZE SPECIFIC CODE BLOCKS 254--------------------------------------------- 255:program:`llvm-mca` allows for the optional usage of special code comments to 256mark regions of the assembly code to be analyzed. A comment starting with 257substring ``LLVM-MCA-BEGIN`` marks the beginning of an analysis region. A 258comment starting with substring ``LLVM-MCA-END`` marks the end of a region. 259For example: 260 261.. code-block:: none 262 263 # LLVM-MCA-BEGIN 264 ... 265 # LLVM-MCA-END 266 267If no user-defined region is specified, then :program:`llvm-mca` assumes a 268default region which contains every instruction in the input file. Every region 269is analyzed in isolation, and the final performance report is the union of all 270the reports generated for every analysis region. 271 272Analysis regions can have names. For example: 273 274.. code-block:: none 275 276 # LLVM-MCA-BEGIN A simple example 277 add %eax, %eax 278 # LLVM-MCA-END 279 280The code from the example above defines a region named "A simple example" with a 281single instruction in it. Note how the region name doesn't have to be repeated 282in the ``LLVM-MCA-END`` directive. In the absence of overlapping regions, 283an anonymous ``LLVM-MCA-END`` directive always ends the currently active user 284defined region. 285 286Example of nesting regions: 287 288.. code-block:: none 289 290 # LLVM-MCA-BEGIN foo 291 add %eax, %edx 292 # LLVM-MCA-BEGIN bar 293 sub %eax, %edx 294 # LLVM-MCA-END bar 295 # LLVM-MCA-END foo 296 297Example of overlapping regions: 298 299.. code-block:: none 300 301 # LLVM-MCA-BEGIN foo 302 add %eax, %edx 303 # LLVM-MCA-BEGIN bar 304 sub %eax, %edx 305 # LLVM-MCA-END foo 306 add %eax, %edx 307 # LLVM-MCA-END bar 308 309Note that multiple anonymous regions cannot overlap. Also, overlapping regions 310cannot have the same name. 311 312There is no support for marking regions from high-level source code, like C or 313C++. As a workaround, inline assembly directives may be used: 314 315.. code-block:: c++ 316 317 int foo(int a, int b) { 318 __asm volatile("# LLVM-MCA-BEGIN foo":::"memory"); 319 a += 42; 320 __asm volatile("# LLVM-MCA-END":::"memory"); 321 a *= b; 322 return a; 323 } 324 325However, this interferes with optimizations like loop vectorization and may have 326an impact on the code generated. This is because the ``__asm`` statements are 327seen as real code having important side effects, which limits how the code 328around them can be transformed. If users want to make use of inline assembly 329to emit markers, then the recommendation is to always verify that the output 330assembly is equivalent to the assembly generated in the absence of markers. 331The `Clang options to emit optimization reports <https://clang.llvm.org/docs/UsersManual.html#options-to-emit-optimization-reports>`_ 332can also help in detecting missed optimizations. 333 334INSTRUMENT REGIONS 335------------------ 336 337An InstrumentRegion describes a region of assembly code guarded by 338special LLVM-MCA comment directives. 339 340.. code-block:: none 341 342 # LLVM-MCA-<INSTRUMENT_TYPE> <data> 343 ... ## asm 344 345where `INSTRUMENT_TYPE` is a type defined by the target and expects 346to use `data`. 347 348A comment starting with substring `LLVM-MCA-<INSTRUMENT_TYPE>` 349brings data into scope for llvm-mca to use in its analysis for 350all following instructions. 351 352If a comment with the same `INSTRUMENT_TYPE` is found later in the 353instruction list, then the original InstrumentRegion will be 354automatically ended, and a new InstrumentRegion will begin. 355 356If there are comments containing the different `INSTRUMENT_TYPE`, 357then both data sets remain available. In contrast with an AnalysisRegion, 358an InstrumentRegion does not need a comment to end the region. 359 360Comments that are prefixed with `LLVM-MCA-` but do not correspond to 361a valid `INSTRUMENT_TYPE` for the target cause an error, except for 362`BEGIN` and `END`, since those correspond to AnalysisRegions. Comments 363that do not start with `LLVM-MCA-` are ignored by :program:`llvm-mca`. 364 365An instruction (a MCInst) is added to an InstrumentRegion R only 366if its location is in range [R.RangeStart, R.RangeEnd]. 367 368On RISCV targets, vector instructions have different behaviour depending 369on the LMUL. Code can be instrumented with a comment that takes the 370following form: 371 372.. code-block:: none 373 374 # LLVM-MCA-RISCV-LMUL <M1|M2|M4|M8|MF2|MF4|MF8> 375 376The RISCV InstrumentManager will override the schedule class for vector 377instructions to use the scheduling behaviour of its pseudo-instruction 378which is LMUL dependent. It makes sense to place RISCV instrument 379comments directly after `vset{i}vl{i}` instructions, although 380they can be placed anywhere in the program. 381 382Example of program with no call to `vset{i}vl{i}`: 383 384.. code-block:: none 385 386 # LLVM-MCA-RISCV-LMUL M2 387 vadd.vv v2, v2, v2 388 389Example of program with call to `vset{i}vl{i}`: 390 391.. code-block:: none 392 393 vsetvli zero, a0, e8, m1, tu, mu 394 # LLVM-MCA-RISCV-LMUL M1 395 vadd.vv v2, v2, v2 396 397Example of program with multiple calls to `vset{i}vl{i}`: 398 399.. code-block:: none 400 401 vsetvli zero, a0, e8, m1, tu, mu 402 # LLVM-MCA-RISCV-LMUL M1 403 vadd.vv v2, v2, v2 404 vsetvli zero, a0, e8, m8, tu, mu 405 # LLVM-MCA-RISCV-LMUL M8 406 vadd.vv v2, v2, v2 407 408Example of program with call to `vsetvl`: 409 410.. code-block:: none 411 412 vsetvl rd, rs1, rs2 413 # LLVM-MCA-RISCV-LMUL M1 414 vadd.vv v12, v12, v12 415 vsetvl rd, rs1, rs2 416 # LLVM-MCA-RISCV-LMUL M4 417 vadd.vv v12, v12, v12 418 419HOW LLVM-MCA WORKS 420------------------ 421 422:program:`llvm-mca` takes assembly code as input. The assembly code is parsed 423into a sequence of MCInst with the help of the existing LLVM target assembly 424parsers. The parsed sequence of MCInst is then analyzed by a ``Pipeline`` module 425to generate a performance report. 426 427The Pipeline module simulates the execution of the machine code sequence in a 428loop of iterations (default is 100). During this process, the pipeline collects 429a number of execution related statistics. At the end of this process, the 430pipeline generates and prints a report from the collected statistics. 431 432Here is an example of a performance report generated by the tool for a 433dot-product of two packed float vectors of four elements. The analysis is 434conducted for target x86, cpu btver2. The following result can be produced via 435the following command using the example located at 436``test/tools/llvm-mca/X86/BtVer2/dot-product.s``: 437 438.. code-block:: bash 439 440 $ llvm-mca -mtriple=x86_64-unknown-unknown -mcpu=btver2 -iterations=300 dot-product.s 441 442.. code-block:: none 443 444 Iterations: 300 445 Instructions: 900 446 Total Cycles: 610 447 Total uOps: 900 448 449 Dispatch Width: 2 450 uOps Per Cycle: 1.48 451 IPC: 1.48 452 Block RThroughput: 2.0 453 454 455 Instruction Info: 456 [1]: #uOps 457 [2]: Latency 458 [3]: RThroughput 459 [4]: MayLoad 460 [5]: MayStore 461 [6]: HasSideEffects (U) 462 463 [1] [2] [3] [4] [5] [6] Instructions: 464 1 2 1.00 vmulps %xmm0, %xmm1, %xmm2 465 1 3 1.00 vhaddps %xmm2, %xmm2, %xmm3 466 1 3 1.00 vhaddps %xmm3, %xmm3, %xmm4 467 468 469 Resources: 470 [0] - JALU0 471 [1] - JALU1 472 [2] - JDiv 473 [3] - JFPA 474 [4] - JFPM 475 [5] - JFPU0 476 [6] - JFPU1 477 [7] - JLAGU 478 [8] - JMul 479 [9] - JSAGU 480 [10] - JSTC 481 [11] - JVALU0 482 [12] - JVALU1 483 [13] - JVIMUL 484 485 486 Resource pressure per iteration: 487 [0] [1] [2] [3] [4] [5] [6] [7] [8] [9] [10] [11] [12] [13] 488 - - - 2.00 1.00 2.00 1.00 - - - - - - - 489 490 Resource pressure by instruction: 491 [0] [1] [2] [3] [4] [5] [6] [7] [8] [9] [10] [11] [12] [13] Instructions: 492 - - - - 1.00 - 1.00 - - - - - - - vmulps %xmm0, %xmm1, %xmm2 493 - - - 1.00 - 1.00 - - - - - - - - vhaddps %xmm2, %xmm2, %xmm3 494 - - - 1.00 - 1.00 - - - - - - - - vhaddps %xmm3, %xmm3, %xmm4 495 496According to this report, the dot-product kernel has been executed 300 times, 497for a total of 900 simulated instructions. The total number of simulated micro 498opcodes (uOps) is also 900. 499 500The report is structured in three main sections. The first section collects a 501few performance numbers; the goal of this section is to give a very quick 502overview of the performance throughput. Important performance indicators are 503**IPC**, **uOps Per Cycle**, and **Block RThroughput** (Block Reciprocal 504Throughput). 505 506Field *DispatchWidth* is the maximum number of micro opcodes that are dispatched 507to the out-of-order backend every simulated cycle. For processors with an 508in-order backend, *DispatchWidth* is the maximum number of micro opcodes issued 509to the backend every simulated cycle. 510 511IPC is computed dividing the total number of simulated instructions by the total 512number of cycles. 513 514Field *Block RThroughput* is the reciprocal of the block throughput. Block 515throughput is a theoretical quantity computed as the maximum number of blocks 516(i.e. iterations) that can be executed per simulated clock cycle in the absence 517of loop carried dependencies. Block throughput is superiorly limited by the 518dispatch rate, and the availability of hardware resources. 519 520In the absence of loop-carried data dependencies, the observed IPC tends to a 521theoretical maximum which can be computed by dividing the number of instructions 522of a single iteration by the `Block RThroughput`. 523 524Field 'uOps Per Cycle' is computed dividing the total number of simulated micro 525opcodes by the total number of cycles. A delta between Dispatch Width and this 526field is an indicator of a performance issue. In the absence of loop-carried 527data dependencies, the observed 'uOps Per Cycle' should tend to a theoretical 528maximum throughput which can be computed by dividing the number of uOps of a 529single iteration by the `Block RThroughput`. 530 531Field *uOps Per Cycle* is bounded from above by the dispatch width. That is 532because the dispatch width limits the maximum size of a dispatch group. Both IPC 533and 'uOps Per Cycle' are limited by the amount of hardware parallelism. The 534availability of hardware resources affects the resource pressure distribution, 535and it limits the number of instructions that can be executed in parallel every 536cycle. A delta between Dispatch Width and the theoretical maximum uOps per 537Cycle (computed by dividing the number of uOps of a single iteration by the 538`Block RThroughput`) is an indicator of a performance bottleneck caused by the 539lack of hardware resources. 540In general, the lower the Block RThroughput, the better. 541 542In this example, ``uOps per iteration/Block RThroughput`` is 1.50. Since there 543are no loop-carried dependencies, the observed `uOps Per Cycle` is expected to 544approach 1.50 when the number of iterations tends to infinity. The delta between 545the Dispatch Width (2.00), and the theoretical maximum uOp throughput (1.50) is 546an indicator of a performance bottleneck caused by the lack of hardware 547resources, and the *Resource pressure view* can help to identify the problematic 548resource usage. 549 550The second section of the report is the `instruction info view`. It shows the 551latency and reciprocal throughput of every instruction in the sequence. It also 552reports extra information related to the number of micro opcodes, and opcode 553properties (i.e., 'MayLoad', 'MayStore', and 'HasSideEffects'). 554 555Field *RThroughput* is the reciprocal of the instruction throughput. Throughput 556is computed as the maximum number of instructions of a same type that can be 557executed per clock cycle in the absence of operand dependencies. In this 558example, the reciprocal throughput of a vector float multiply is 1 559cycles/instruction. That is because the FP multiplier JFPM is only available 560from pipeline JFPU1. 561 562Instruction encodings are displayed within the instruction info view when flag 563`-show-encoding` is specified. 564 565Below is an example of `-show-encoding` output for the dot-product kernel: 566 567.. code-block:: none 568 569 Instruction Info: 570 [1]: #uOps 571 [2]: Latency 572 [3]: RThroughput 573 [4]: MayLoad 574 [5]: MayStore 575 [6]: HasSideEffects (U) 576 [7]: Encoding Size 577 578 [1] [2] [3] [4] [5] [6] [7] Encodings: Instructions: 579 1 2 1.00 4 c5 f0 59 d0 vmulps %xmm0, %xmm1, %xmm2 580 1 4 1.00 4 c5 eb 7c da vhaddps %xmm2, %xmm2, %xmm3 581 1 4 1.00 4 c5 e3 7c e3 vhaddps %xmm3, %xmm3, %xmm4 582 583The `Encoding Size` column shows the size in bytes of instructions. The 584`Encodings` column shows the actual instruction encodings (byte sequences in 585hex). 586 587The third section is the *Resource pressure view*. This view reports 588the average number of resource cycles consumed every iteration by instructions 589for every processor resource unit available on the target. Information is 590structured in two tables. The first table reports the number of resource cycles 591spent on average every iteration. The second table correlates the resource 592cycles to the machine instruction in the sequence. For example, every iteration 593of the instruction vmulps always executes on resource unit [6] 594(JFPU1 - floating point pipeline #1), consuming an average of 1 resource cycle 595per iteration. Note that on AMD Jaguar, vector floating-point multiply can 596only be issued to pipeline JFPU1, while horizontal floating-point additions can 597only be issued to pipeline JFPU0. 598 599The resource pressure view helps with identifying bottlenecks caused by high 600usage of specific hardware resources. Situations with resource pressure mainly 601concentrated on a few resources should, in general, be avoided. Ideally, 602pressure should be uniformly distributed between multiple resources. 603 604Timeline View 605^^^^^^^^^^^^^ 606The timeline view produces a detailed report of each instruction's state 607transitions through an instruction pipeline. This view is enabled by the 608command line option ``-timeline``. As instructions transition through the 609various stages of the pipeline, their states are depicted in the view report. 610These states are represented by the following characters: 611 612* D : Instruction dispatched. 613* e : Instruction executing. 614* E : Instruction executed. 615* R : Instruction retired. 616* = : Instruction already dispatched, waiting to be executed. 617* \- : Instruction executed, waiting to be retired. 618 619Below is the timeline view for a subset of the dot-product example located in 620``test/tools/llvm-mca/X86/BtVer2/dot-product.s`` and processed by 621:program:`llvm-mca` using the following command: 622 623.. code-block:: bash 624 625 $ llvm-mca -mtriple=x86_64-unknown-unknown -mcpu=btver2 -iterations=3 -timeline dot-product.s 626 627.. code-block:: none 628 629 Timeline view: 630 012345 631 Index 0123456789 632 633 [0,0] DeeER. . . vmulps %xmm0, %xmm1, %xmm2 634 [0,1] D==eeeER . . vhaddps %xmm2, %xmm2, %xmm3 635 [0,2] .D====eeeER . vhaddps %xmm3, %xmm3, %xmm4 636 [1,0] .DeeE-----R . vmulps %xmm0, %xmm1, %xmm2 637 [1,1] . D=eeeE---R . vhaddps %xmm2, %xmm2, %xmm3 638 [1,2] . D====eeeER . vhaddps %xmm3, %xmm3, %xmm4 639 [2,0] . DeeE-----R . vmulps %xmm0, %xmm1, %xmm2 640 [2,1] . D====eeeER . vhaddps %xmm2, %xmm2, %xmm3 641 [2,2] . D======eeeER vhaddps %xmm3, %xmm3, %xmm4 642 643 644 Average Wait times (based on the timeline view): 645 [0]: Executions 646 [1]: Average time spent waiting in a scheduler's queue 647 [2]: Average time spent waiting in a scheduler's queue while ready 648 [3]: Average time elapsed from WB until retire stage 649 650 [0] [1] [2] [3] 651 0. 3 1.0 1.0 3.3 vmulps %xmm0, %xmm1, %xmm2 652 1. 3 3.3 0.7 1.0 vhaddps %xmm2, %xmm2, %xmm3 653 2. 3 5.7 0.0 0.0 vhaddps %xmm3, %xmm3, %xmm4 654 3 3.3 0.5 1.4 <total> 655 656The timeline view is interesting because it shows instruction state changes 657during execution. It also gives an idea of how the tool processes instructions 658executed on the target, and how their timing information might be calculated. 659 660The timeline view is structured in two tables. The first table shows 661instructions changing state over time (measured in cycles); the second table 662(named *Average Wait times*) reports useful timing statistics, which should 663help diagnose performance bottlenecks caused by long data dependencies and 664sub-optimal usage of hardware resources. 665 666An instruction in the timeline view is identified by a pair of indices, where 667the first index identifies an iteration, and the second index is the 668instruction index (i.e., where it appears in the code sequence). Since this 669example was generated using 3 iterations: ``-iterations=3``, the iteration 670indices range from 0-2 inclusively. 671 672Excluding the first and last column, the remaining columns are in cycles. 673Cycles are numbered sequentially starting from 0. 674 675From the example output above, we know the following: 676 677* Instruction [1,0] was dispatched at cycle 1. 678* Instruction [1,0] started executing at cycle 2. 679* Instruction [1,0] reached the write back stage at cycle 4. 680* Instruction [1,0] was retired at cycle 10. 681 682Instruction [1,0] (i.e., vmulps from iteration #1) does not have to wait in the 683scheduler's queue for the operands to become available. By the time vmulps is 684dispatched, operands are already available, and pipeline JFPU1 is ready to 685serve another instruction. So the instruction can be immediately issued on the 686JFPU1 pipeline. That is demonstrated by the fact that the instruction only 687spent 1cy in the scheduler's queue. 688 689There is a gap of 5 cycles between the write-back stage and the retire event. 690That is because instructions must retire in program order, so [1,0] has to wait 691for [0,2] to be retired first (i.e., it has to wait until cycle 10). 692 693In the example, all instructions are in a RAW (Read After Write) dependency 694chain. Register %xmm2 written by vmulps is immediately used by the first 695vhaddps, and register %xmm3 written by the first vhaddps is used by the second 696vhaddps. Long data dependencies negatively impact the ILP (Instruction Level 697Parallelism). 698 699In the dot-product example, there are anti-dependencies introduced by 700instructions from different iterations. However, those dependencies can be 701removed at register renaming stage (at the cost of allocating register aliases, 702and therefore consuming physical registers). 703 704Table *Average Wait times* helps diagnose performance issues that are caused by 705the presence of long latency instructions and potentially long data dependencies 706which may limit the ILP. Last row, ``<total>``, shows a global average over all 707instructions measured. Note that :program:`llvm-mca`, by default, assumes at 708least 1cy between the dispatch event and the issue event. 709 710When the performance is limited by data dependencies and/or long latency 711instructions, the number of cycles spent while in the *ready* state is expected 712to be very small when compared with the total number of cycles spent in the 713scheduler's queue. The difference between the two counters is a good indicator 714of how large of an impact data dependencies had on the execution of the 715instructions. When performance is mostly limited by the lack of hardware 716resources, the delta between the two counters is small. However, the number of 717cycles spent in the queue tends to be larger (i.e., more than 1-3cy), 718especially when compared to other low latency instructions. 719 720Bottleneck Analysis 721^^^^^^^^^^^^^^^^^^^ 722The ``-bottleneck-analysis`` command line option enables the analysis of 723performance bottlenecks. 724 725This analysis is potentially expensive. It attempts to correlate increases in 726backend pressure (caused by pipeline resource pressure and data dependencies) to 727dynamic dispatch stalls. 728 729Below is an example of ``-bottleneck-analysis`` output generated by 730:program:`llvm-mca` for 500 iterations of the dot-product example on btver2. 731 732.. code-block:: none 733 734 735 Cycles with backend pressure increase [ 48.07% ] 736 Throughput Bottlenecks: 737 Resource Pressure [ 47.77% ] 738 - JFPA [ 47.77% ] 739 - JFPU0 [ 47.77% ] 740 Data Dependencies: [ 0.30% ] 741 - Register Dependencies [ 0.30% ] 742 - Memory Dependencies [ 0.00% ] 743 744 Critical sequence based on the simulation: 745 746 Instruction Dependency Information 747 +----< 2. vhaddps %xmm3, %xmm3, %xmm4 748 | 749 | < loop carried > 750 | 751 | 0. vmulps %xmm0, %xmm1, %xmm2 752 +----> 1. vhaddps %xmm2, %xmm2, %xmm3 ## RESOURCE interference: JFPA [ probability: 74% ] 753 +----> 2. vhaddps %xmm3, %xmm3, %xmm4 ## REGISTER dependency: %xmm3 754 | 755 | < loop carried > 756 | 757 +----> 1. vhaddps %xmm2, %xmm2, %xmm3 ## RESOURCE interference: JFPA [ probability: 74% ] 758 759 760According to the analysis, throughput is limited by resource pressure and not by 761data dependencies. The analysis observed increases in backend pressure during 76248.07% of the simulated run. Almost all those pressure increase events were 763caused by contention on processor resources JFPA/JFPU0. 764 765The `critical sequence` is the most expensive sequence of instructions according 766to the simulation. It is annotated to provide extra information about critical 767register dependencies and resource interferences between instructions. 768 769Instructions from the critical sequence are expected to significantly impact 770performance. By construction, the accuracy of this analysis is strongly 771dependent on the simulation and (as always) by the quality of the processor 772model in llvm. 773 774Bottleneck analysis is currently not supported for processors with an in-order 775backend. 776 777Extra Statistics to Further Diagnose Performance Issues 778^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 779The ``-all-stats`` command line option enables extra statistics and performance 780counters for the dispatch logic, the reorder buffer, the retire control unit, 781and the register file. 782 783Below is an example of ``-all-stats`` output generated by :program:`llvm-mca` 784for 300 iterations of the dot-product example discussed in the previous 785sections. 786 787.. code-block:: none 788 789 Dynamic Dispatch Stall Cycles: 790 RAT - Register unavailable: 0 791 RCU - Retire tokens unavailable: 0 792 SCHEDQ - Scheduler full: 272 (44.6%) 793 LQ - Load queue full: 0 794 SQ - Store queue full: 0 795 GROUP - Static restrictions on the dispatch group: 0 796 797 798 Dispatch Logic - number of cycles where we saw N micro opcodes dispatched: 799 [# dispatched], [# cycles] 800 0, 24 (3.9%) 801 1, 272 (44.6%) 802 2, 314 (51.5%) 803 804 805 Schedulers - number of cycles where we saw N micro opcodes issued: 806 [# issued], [# cycles] 807 0, 7 (1.1%) 808 1, 306 (50.2%) 809 2, 297 (48.7%) 810 811 Scheduler's queue usage: 812 [1] Resource name. 813 [2] Average number of used buffer entries. 814 [3] Maximum number of used buffer entries. 815 [4] Total number of buffer entries. 816 817 [1] [2] [3] [4] 818 JALU01 0 0 20 819 JFPU01 17 18 18 820 JLSAGU 0 0 12 821 822 823 Retire Control Unit - number of cycles where we saw N instructions retired: 824 [# retired], [# cycles] 825 0, 109 (17.9%) 826 1, 102 (16.7%) 827 2, 399 (65.4%) 828 829 Total ROB Entries: 64 830 Max Used ROB Entries: 35 ( 54.7% ) 831 Average Used ROB Entries per cy: 32 ( 50.0% ) 832 833 834 Register File statistics: 835 Total number of mappings created: 900 836 Max number of mappings used: 35 837 838 * Register File #1 -- JFpuPRF: 839 Number of physical registers: 72 840 Total number of mappings created: 900 841 Max number of mappings used: 35 842 843 * Register File #2 -- JIntegerPRF: 844 Number of physical registers: 64 845 Total number of mappings created: 0 846 Max number of mappings used: 0 847 848If we look at the *Dynamic Dispatch Stall Cycles* table, we see the counter for 849SCHEDQ reports 272 cycles. This counter is incremented every time the dispatch 850logic is unable to dispatch a full group because the scheduler's queue is full. 851 852Looking at the *Dispatch Logic* table, we see that the pipeline was only able to 853dispatch two micro opcodes 51.5% of the time. The dispatch group was limited to 854one micro opcode 44.6% of the cycles, which corresponds to 272 cycles. The 855dispatch statistics are displayed by either using the command option 856``-all-stats`` or ``-dispatch-stats``. 857 858The next table, *Schedulers*, presents a histogram displaying a count, 859representing the number of micro opcodes issued on some number of cycles. In 860this case, of the 610 simulated cycles, single opcodes were issued 306 times 861(50.2%) and there were 7 cycles where no opcodes were issued. 862 863The *Scheduler's queue usage* table shows that the average and maximum number of 864buffer entries (i.e., scheduler queue entries) used at runtime. Resource JFPU01 865reached its maximum (18 of 18 queue entries). Note that AMD Jaguar implements 866three schedulers: 867 868* JALU01 - A scheduler for ALU instructions. 869* JFPU01 - A scheduler floating point operations. 870* JLSAGU - A scheduler for address generation. 871 872The dot-product is a kernel of three floating point instructions (a vector 873multiply followed by two horizontal adds). That explains why only the floating 874point scheduler appears to be used. 875 876A full scheduler queue is either caused by data dependency chains or by a 877sub-optimal usage of hardware resources. Sometimes, resource pressure can be 878mitigated by rewriting the kernel using different instructions that consume 879different scheduler resources. Schedulers with a small queue are less resilient 880to bottlenecks caused by the presence of long data dependencies. The scheduler 881statistics are displayed by using the command option ``-all-stats`` or 882``-scheduler-stats``. 883 884The next table, *Retire Control Unit*, presents a histogram displaying a count, 885representing the number of instructions retired on some number of cycles. In 886this case, of the 610 simulated cycles, two instructions were retired during the 887same cycle 399 times (65.4%) and there were 109 cycles where no instructions 888were retired. The retire statistics are displayed by using the command option 889``-all-stats`` or ``-retire-stats``. 890 891The last table presented is *Register File statistics*. Each physical register 892file (PRF) used by the pipeline is presented in this table. In the case of AMD 893Jaguar, there are two register files, one for floating-point registers (JFpuPRF) 894and one for integer registers (JIntegerPRF). The table shows that of the 900 895instructions processed, there were 900 mappings created. Since this dot-product 896example utilized only floating point registers, the JFPuPRF was responsible for 897creating the 900 mappings. However, we see that the pipeline only used a 898maximum of 35 of 72 available register slots at any given time. We can conclude 899that the floating point PRF was the only register file used for the example, and 900that it was never resource constrained. The register file statistics are 901displayed by using the command option ``-all-stats`` or 902``-register-file-stats``. 903 904In this example, we can conclude that the IPC is mostly limited by data 905dependencies, and not by resource pressure. 906 907Instruction Flow 908^^^^^^^^^^^^^^^^ 909This section describes the instruction flow through the default pipeline of 910:program:`llvm-mca`, as well as the functional units involved in the process. 911 912The default pipeline implements the following sequence of stages used to 913process instructions. 914 915* Dispatch (Instruction is dispatched to the schedulers). 916* Issue (Instruction is issued to the processor pipelines). 917* Write Back (Instruction is executed, and results are written back). 918* Retire (Instruction is retired; writes are architecturally committed). 919 920The in-order pipeline implements the following sequence of stages: 921 922* InOrderIssue (Instruction is issued to the processor pipelines). 923* Retire (Instruction is retired; writes are architecturally committed). 924 925:program:`llvm-mca` assumes that instructions have all been decoded and placed 926into a queue before the simulation start. Therefore, the instruction fetch and 927decode stages are not modeled. Performance bottlenecks in the frontend are not 928diagnosed. Also, :program:`llvm-mca` does not model branch prediction. 929 930Instruction Dispatch 931"""""""""""""""""""" 932During the dispatch stage, instructions are picked in program order from a 933queue of already decoded instructions, and dispatched in groups to the 934simulated hardware schedulers. 935 936The size of a dispatch group depends on the availability of the simulated 937hardware resources. The processor dispatch width defaults to the value 938of the ``IssueWidth`` in LLVM's scheduling model. 939 940An instruction can be dispatched if: 941 942* The size of the dispatch group is smaller than processor's dispatch width. 943* There are enough entries in the reorder buffer. 944* There are enough physical registers to do register renaming. 945* The schedulers are not full. 946 947Scheduling models can optionally specify which register files are available on 948the processor. :program:`llvm-mca` uses that information to initialize register 949file descriptors. Users can limit the number of physical registers that are 950globally available for register renaming by using the command option 951``-register-file-size``. A value of zero for this option means *unbounded*. By 952knowing how many registers are available for renaming, the tool can predict 953dispatch stalls caused by the lack of physical registers. 954 955The number of reorder buffer entries consumed by an instruction depends on the 956number of micro-opcodes specified for that instruction by the target scheduling 957model. The reorder buffer is responsible for tracking the progress of 958instructions that are "in-flight", and retiring them in program order. The 959number of entries in the reorder buffer defaults to the value specified by field 960`MicroOpBufferSize` in the target scheduling model. 961 962Instructions that are dispatched to the schedulers consume scheduler buffer 963entries. :program:`llvm-mca` queries the scheduling model to determine the set 964of buffered resources consumed by an instruction. Buffered resources are 965treated like scheduler resources. 966 967Instruction Issue 968""""""""""""""""" 969Each processor scheduler implements a buffer of instructions. An instruction 970has to wait in the scheduler's buffer until input register operands become 971available. Only at that point, does the instruction becomes eligible for 972execution and may be issued (potentially out-of-order) for execution. 973Instruction latencies are computed by :program:`llvm-mca` with the help of the 974scheduling model. 975 976:program:`llvm-mca`'s scheduler is designed to simulate multiple processor 977schedulers. The scheduler is responsible for tracking data dependencies, and 978dynamically selecting which processor resources are consumed by instructions. 979It delegates the management of processor resource units and resource groups to a 980resource manager. The resource manager is responsible for selecting resource 981units that are consumed by instructions. For example, if an instruction 982consumes 1cy of a resource group, the resource manager selects one of the 983available units from the group; by default, the resource manager uses a 984round-robin selector to guarantee that resource usage is uniformly distributed 985between all units of a group. 986 987:program:`llvm-mca`'s scheduler internally groups instructions into three sets: 988 989* WaitSet: a set of instructions whose operands are not ready. 990* ReadySet: a set of instructions ready to execute. 991* IssuedSet: a set of instructions executing. 992 993Depending on the operands availability, instructions that are dispatched to the 994scheduler are either placed into the WaitSet or into the ReadySet. 995 996Every cycle, the scheduler checks if instructions can be moved from the WaitSet 997to the ReadySet, and if instructions from the ReadySet can be issued to the 998underlying pipelines. The algorithm prioritizes older instructions over younger 999instructions. 1000 1001Write-Back and Retire Stage 1002""""""""""""""""""""""""""" 1003Issued instructions are moved from the ReadySet to the IssuedSet. There, 1004instructions wait until they reach the write-back stage. At that point, they 1005get removed from the queue and the retire control unit is notified. 1006 1007When instructions are executed, the retire control unit flags the instruction as 1008"ready to retire." 1009 1010Instructions are retired in program order. The register file is notified of the 1011retirement so that it can free the physical registers that were allocated for 1012the instruction during the register renaming stage. 1013 1014Load/Store Unit and Memory Consistency Model 1015"""""""""""""""""""""""""""""""""""""""""""" 1016To simulate an out-of-order execution of memory operations, :program:`llvm-mca` 1017utilizes a simulated load/store unit (LSUnit) to simulate the speculative 1018execution of loads and stores. 1019 1020Each load (or store) consumes an entry in the load (or store) queue. Users can 1021specify flags ``-lqueue`` and ``-squeue`` to limit the number of entries in the 1022load and store queues respectively. The queues are unbounded by default. 1023 1024The LSUnit implements a relaxed consistency model for memory loads and stores. 1025The rules are: 1026 10271. A younger load is allowed to pass an older load only if there are no 1028 intervening stores or barriers between the two loads. 10292. A younger load is allowed to pass an older store provided that the load does 1030 not alias with the store. 10313. A younger store is not allowed to pass an older store. 10324. A younger store is not allowed to pass an older load. 1033 1034By default, the LSUnit optimistically assumes that loads do not alias 1035(`-noalias=true`) store operations. Under this assumption, younger loads are 1036always allowed to pass older stores. Essentially, the LSUnit does not attempt 1037to run any alias analysis to predict when loads and stores do not alias with 1038each other. 1039 1040Note that, in the case of write-combining memory, rule 3 could be relaxed to 1041allow reordering of non-aliasing store operations. That being said, at the 1042moment, there is no way to further relax the memory model (``-noalias`` is the 1043only option). Essentially, there is no option to specify a different memory 1044type (e.g., write-back, write-combining, write-through; etc.) and consequently 1045to weaken, or strengthen, the memory model. 1046 1047Other limitations are: 1048 1049* The LSUnit does not know when store-to-load forwarding may occur. 1050* The LSUnit does not know anything about cache hierarchy and memory types. 1051* The LSUnit does not know how to identify serializing operations and memory 1052 fences. 1053 1054The LSUnit does not attempt to predict if a load or store hits or misses the L1 1055cache. It only knows if an instruction "MayLoad" and/or "MayStore." For 1056loads, the scheduling model provides an "optimistic" load-to-use latency (which 1057usually matches the load-to-use latency for when there is a hit in the L1D). 1058 1059:program:`llvm-mca` does not (on its own) know about serializing operations or 1060memory-barrier like instructions. The LSUnit used to conservatively use an 1061instruction's "MayLoad", "MayStore", and unmodeled side effects flags to 1062determine whether an instruction should be treated as a memory-barrier. This was 1063inaccurate in general and was changed so that now each instruction has an 1064IsAStoreBarrier and IsALoadBarrier flag. These flags are mca specific and 1065default to false for every instruction. If any instruction should have either of 1066these flags set, it should be done within the target's InstrPostProcess class. 1067For an example, look at the `X86InstrPostProcess::postProcessInstruction` method 1068within `llvm/lib/Target/X86/MCA/X86CustomBehaviour.cpp`. 1069 1070A load/store barrier consumes one entry of the load/store queue. A load/store 1071barrier enforces ordering of loads/stores. A younger load cannot pass a load 1072barrier. Also, a younger store cannot pass a store barrier. A younger load 1073has to wait for the memory/load barrier to execute. A load/store barrier is 1074"executed" when it becomes the oldest entry in the load/store queue(s). That 1075also means, by construction, all of the older loads/stores have been executed. 1076 1077In conclusion, the full set of load/store consistency rules are: 1078 1079#. A store may not pass a previous store. 1080#. A store may not pass a previous load (regardless of ``-noalias``). 1081#. A store has to wait until an older store barrier is fully executed. 1082#. A load may pass a previous load. 1083#. A load may not pass a previous store unless ``-noalias`` is set. 1084#. A load has to wait until an older load barrier is fully executed. 1085 1086In-order Issue and Execute 1087"""""""""""""""""""""""""""""""""""" 1088In-order processors are modelled as a single ``InOrderIssueStage`` stage. It 1089bypasses Dispatch, Scheduler and Load/Store unit. Instructions are issued as 1090soon as their operand registers are available and resource requirements are 1091met. Multiple instructions can be issued in one cycle according to the value of 1092the ``IssueWidth`` parameter in LLVM's scheduling model. 1093 1094Once issued, an instruction is moved to ``IssuedInst`` set until it is ready to 1095retire. :program:`llvm-mca` ensures that writes are committed in-order. However, 1096an instruction is allowed to commit writes and retire out-of-order if 1097``RetireOOO`` property is true for at least one of its writes. 1098 1099Custom Behaviour 1100"""""""""""""""""""""""""""""""""""" 1101Due to certain instructions not being expressed perfectly within their 1102scheduling model, :program:`llvm-mca` isn't always able to simulate them 1103perfectly. Modifying the scheduling model isn't always a viable 1104option though (maybe because the instruction is modeled incorrectly on 1105purpose or the instruction's behaviour is quite complex). The 1106CustomBehaviour class can be used in these cases to enforce proper 1107instruction modeling (often by customizing data dependencies and detecting 1108hazards that :program:`llvm-mca` has no way of knowing about). 1109 1110:program:`llvm-mca` comes with one generic and multiple target specific 1111CustomBehaviour classes. The generic class will be used if the ``-disable-cb`` 1112flag is used or if a target specific CustomBehaviour class doesn't exist for 1113that target. (The generic class does nothing.) Currently, the CustomBehaviour 1114class is only a part of the in-order pipeline, but there are plans to add it 1115to the out-of-order pipeline in the future. 1116 1117CustomBehaviour's main method is `checkCustomHazard()` which uses the 1118current instruction and a list of all instructions still executing within 1119the pipeline to determine if the current instruction should be dispatched. 1120As output, the method returns an integer representing the number of cycles 1121that the current instruction must stall for (this can be an underestimate 1122if you don't know the exact number and a value of 0 represents no stall). 1123 1124If you'd like to add a CustomBehaviour class for a target that doesn't 1125already have one, refer to an existing implementation to see how to set it 1126up. The classes are implemented within the target specific backend (for 1127example `/llvm/lib/Target/AMDGPU/MCA/`) so that they can access backend symbols. 1128 1129Instrument Manager 1130"""""""""""""""""""""""""""""""""""" 1131On certain architectures, scheduling information for certain instructions 1132do not contain all of the information required to identify the most precise 1133schedule class. For example, data that can have an impact on scheduling can 1134be stored in CSR registers. 1135 1136One example of this is on RISCV, where values in registers such as `vtype` 1137and `vl` change the scheduling behaviour of vector instructions. Since MCA 1138does not keep track of the values in registers, instrument comments can 1139be used to specify these values. 1140 1141InstrumentManager's main function is `getSchedClassID()` which has access 1142to the MCInst and all of the instruments that are active for that MCInst. 1143This function can use the instruments to override the schedule class of 1144the MCInst. 1145 1146On RISCV, instrument comments containing LMUL information are used 1147by `getSchedClassID()` to map a vector instruction and the active 1148LMUL to the scheduling class of the pseudo-instruction that describes 1149that base instruction and the active LMUL. 1150 1151Custom Views 1152"""""""""""""""""""""""""""""""""""" 1153:program:`llvm-mca` comes with several Views such as the Timeline View and 1154Summary View. These Views are generic and can work with most (if not all) 1155targets. If you wish to add a new View to :program:`llvm-mca` and it does not 1156require any backend functionality that is not already exposed through MC layer 1157classes (MCSubtargetInfo, MCInstrInfo, etc.), please add it to the 1158`/tools/llvm-mca/View/` directory. However, if your new View is target specific 1159AND requires unexposed backend symbols or functionality, you can define it in 1160the `/lib/Target/<TargetName>/MCA/` directory. 1161 1162To enable this target specific View, you will have to use this target's 1163CustomBehaviour class to override the `CustomBehaviour::getViews()` methods. 1164There are 3 variations of these methods based on where you want your View to 1165appear in the output: `getStartViews()`, `getPostInstrInfoViews()`, and 1166`getEndViews()`. These methods returns a vector of Views so you will want to 1167return a vector containing all of the target specific Views for the target in 1168question. 1169 1170Because these target specific (and backend dependent) Views require the 1171`CustomBehaviour::getViews()` variants, these Views will not be enabled if 1172the `-disable-cb` flag is used. 1173 1174Enabling these custom Views does not affect the non-custom (generic) Views. 1175Continue to use the usual command line arguments to enable / disable those 1176Views. 1177