17330f729Sjoergllvm-mca - LLVM Machine Code Analyzer 27330f729Sjoerg===================================== 37330f729Sjoerg 47330f729Sjoerg.. program:: llvm-mca 57330f729Sjoerg 67330f729SjoergSYNOPSIS 77330f729Sjoerg-------- 87330f729Sjoerg 97330f729Sjoerg:program:`llvm-mca` [*options*] [input] 107330f729Sjoerg 117330f729SjoergDESCRIPTION 127330f729Sjoerg----------- 137330f729Sjoerg 147330f729Sjoerg:program:`llvm-mca` is a performance analysis tool that uses information 157330f729Sjoergavailable in LLVM (e.g. scheduling models) to statically measure the performance 167330f729Sjoergof machine code in a specific CPU. 177330f729Sjoerg 187330f729SjoergPerformance is measured in terms of throughput as well as processor resource 19*82d56013Sjoergconsumption. The tool currently works for processors with a backend for which 20*82d56013Sjoergthere is a scheduling model available in LLVM. 217330f729Sjoerg 227330f729SjoergThe main goal of this tool is not just to predict the performance of the code 237330f729Sjoergwhen run on the target, but also help with diagnosing potential performance 247330f729Sjoergissues. 257330f729Sjoerg 267330f729SjoergGiven an assembly code sequence, :program:`llvm-mca` estimates the Instructions 277330f729SjoergPer Cycle (IPC), as well as hardware resource pressure. The analysis and 287330f729Sjoergreporting style were inspired by the IACA tool from Intel. 297330f729Sjoerg 307330f729SjoergFor example, you can compile code with clang, output assembly, and pipe it 317330f729Sjoergdirectly into :program:`llvm-mca` for analysis: 327330f729Sjoerg 337330f729Sjoerg.. code-block:: bash 347330f729Sjoerg 357330f729Sjoerg $ clang foo.c -O2 -target x86_64-unknown-unknown -S -o - | llvm-mca -mcpu=btver2 367330f729Sjoerg 377330f729SjoergOr for Intel syntax: 387330f729Sjoerg 397330f729Sjoerg.. code-block:: bash 407330f729Sjoerg 417330f729Sjoerg $ clang foo.c -O2 -target x86_64-unknown-unknown -mllvm -x86-asm-syntax=intel -S -o - | llvm-mca -mcpu=btver2 427330f729Sjoerg 43*82d56013Sjoerg(:program:`llvm-mca` detects Intel syntax by the presence of an `.intel_syntax` 44*82d56013Sjoergdirective at the beginning of the input. By default its output syntax matches 45*82d56013Sjoergthat of its input.) 46*82d56013Sjoerg 477330f729SjoergScheduling models are not just used to compute instruction latencies and 487330f729Sjoergthroughput, but also to understand what processor resources are available 497330f729Sjoergand how to simulate them. 507330f729Sjoerg 517330f729SjoergBy design, the quality of the analysis conducted by :program:`llvm-mca` is 527330f729Sjoerginevitably affected by the quality of the scheduling models in LLVM. 537330f729Sjoerg 547330f729SjoergIf you see that the performance report is not accurate for a processor, 557330f729Sjoergplease `file a bug <https://bugs.llvm.org/enter_bug.cgi?product=libraries>`_ 567330f729Sjoergagainst the appropriate backend. 577330f729Sjoerg 587330f729SjoergOPTIONS 597330f729Sjoerg------- 607330f729Sjoerg 617330f729SjoergIf ``input`` is "``-``" or omitted, :program:`llvm-mca` reads from standard 627330f729Sjoerginput. Otherwise, it will read from the specified filename. 637330f729Sjoerg 647330f729SjoergIf the :option:`-o` option is omitted, then :program:`llvm-mca` will send its output 657330f729Sjoergto standard output if the input is from standard input. If the :option:`-o` 667330f729Sjoergoption specifies "``-``", then the output will also be sent to standard output. 677330f729Sjoerg 687330f729Sjoerg 697330f729Sjoerg.. option:: -help 707330f729Sjoerg 717330f729Sjoerg Print a summary of command line options. 727330f729Sjoerg 737330f729Sjoerg.. option:: -o <filename> 747330f729Sjoerg 757330f729Sjoerg Use ``<filename>`` as the output filename. See the summary above for more 767330f729Sjoerg details. 777330f729Sjoerg 787330f729Sjoerg.. option:: -mtriple=<target triple> 797330f729Sjoerg 807330f729Sjoerg Specify a target triple string. 817330f729Sjoerg 827330f729Sjoerg.. option:: -march=<arch> 837330f729Sjoerg 847330f729Sjoerg Specify the architecture for which to analyze the code. It defaults to the 857330f729Sjoerg host default target. 867330f729Sjoerg 877330f729Sjoerg.. option:: -mcpu=<cpuname> 887330f729Sjoerg 897330f729Sjoerg Specify the processor for which to analyze the code. By default, the cpu name 907330f729Sjoerg is autodetected from the host. 917330f729Sjoerg 927330f729Sjoerg.. option:: -output-asm-variant=<variant id> 937330f729Sjoerg 947330f729Sjoerg Specify the output assembly variant for the report generated by the tool. 957330f729Sjoerg On x86, possible values are [0, 1]. A value of 0 (vic. 1) for this flag enables 967330f729Sjoerg the AT&T (vic. Intel) assembly format for the code printed out by the tool in 977330f729Sjoerg the analysis report. 987330f729Sjoerg 997330f729Sjoerg.. option:: -print-imm-hex 1007330f729Sjoerg 1017330f729Sjoerg Prefer hex format for numeric literals in the output assembly printed as part 1027330f729Sjoerg of the report. 1037330f729Sjoerg 1047330f729Sjoerg.. option:: -dispatch=<width> 1057330f729Sjoerg 1067330f729Sjoerg Specify a different dispatch width for the processor. The dispatch width 1077330f729Sjoerg defaults to field 'IssueWidth' in the processor scheduling model. If width is 1087330f729Sjoerg zero, then the default dispatch width is used. 1097330f729Sjoerg 1107330f729Sjoerg.. option:: -register-file-size=<size> 1117330f729Sjoerg 1127330f729Sjoerg Specify the size of the register file. When specified, this flag limits how 1137330f729Sjoerg many physical registers are available for register renaming purposes. A value 1147330f729Sjoerg of zero for this flag means "unlimited number of physical registers". 1157330f729Sjoerg 1167330f729Sjoerg.. option:: -iterations=<number of iterations> 1177330f729Sjoerg 1187330f729Sjoerg Specify the number of iterations to run. If this flag is set to 0, then the 1197330f729Sjoerg tool sets the number of iterations to a default value (i.e. 100). 1207330f729Sjoerg 1217330f729Sjoerg.. option:: -noalias=<bool> 1227330f729Sjoerg 1237330f729Sjoerg If set, the tool assumes that loads and stores don't alias. This is the 1247330f729Sjoerg default behavior. 1257330f729Sjoerg 1267330f729Sjoerg.. option:: -lqueue=<load queue size> 1277330f729Sjoerg 1287330f729Sjoerg Specify the size of the load queue in the load/store unit emulated by the tool. 1297330f729Sjoerg By default, the tool assumes an unbound number of entries in the load queue. 1307330f729Sjoerg A value of zero for this flag is ignored, and the default load queue size is 1317330f729Sjoerg used instead. 1327330f729Sjoerg 1337330f729Sjoerg.. option:: -squeue=<store queue size> 1347330f729Sjoerg 1357330f729Sjoerg Specify the size of the store queue in the load/store unit emulated by the 1367330f729Sjoerg tool. By default, the tool assumes an unbound number of entries in the store 1377330f729Sjoerg queue. A value of zero for this flag is ignored, and the default store queue 1387330f729Sjoerg size is used instead. 1397330f729Sjoerg 1407330f729Sjoerg.. option:: -timeline 1417330f729Sjoerg 1427330f729Sjoerg Enable the timeline view. 1437330f729Sjoerg 1447330f729Sjoerg.. option:: -timeline-max-iterations=<iterations> 1457330f729Sjoerg 1467330f729Sjoerg Limit the number of iterations to print in the timeline view. By default, the 1477330f729Sjoerg timeline view prints information for up to 10 iterations. 1487330f729Sjoerg 1497330f729Sjoerg.. option:: -timeline-max-cycles=<cycles> 1507330f729Sjoerg 1517330f729Sjoerg Limit the number of cycles in the timeline view. By default, the number of 1527330f729Sjoerg cycles is set to 80. 1537330f729Sjoerg 1547330f729Sjoerg.. option:: -resource-pressure 1557330f729Sjoerg 1567330f729Sjoerg Enable the resource pressure view. This is enabled by default. 1577330f729Sjoerg 1587330f729Sjoerg.. option:: -register-file-stats 1597330f729Sjoerg 1607330f729Sjoerg Enable register file usage statistics. 1617330f729Sjoerg 1627330f729Sjoerg.. option:: -dispatch-stats 1637330f729Sjoerg 1647330f729Sjoerg Enable extra dispatch statistics. This view collects and analyzes instruction 1657330f729Sjoerg dispatch events, as well as static/dynamic dispatch stall events. This view 1667330f729Sjoerg is disabled by default. 1677330f729Sjoerg 1687330f729Sjoerg.. option:: -scheduler-stats 1697330f729Sjoerg 1707330f729Sjoerg Enable extra scheduler statistics. This view collects and analyzes instruction 1717330f729Sjoerg issue events. This view is disabled by default. 1727330f729Sjoerg 1737330f729Sjoerg.. option:: -retire-stats 1747330f729Sjoerg 1757330f729Sjoerg Enable extra retire control unit statistics. This view is disabled by default. 1767330f729Sjoerg 1777330f729Sjoerg.. option:: -instruction-info 1787330f729Sjoerg 1797330f729Sjoerg Enable the instruction info view. This is enabled by default. 1807330f729Sjoerg 1817330f729Sjoerg.. option:: -show-encoding 1827330f729Sjoerg 1837330f729Sjoerg Enable the printing of instruction encodings within the instruction info view. 1847330f729Sjoerg 1857330f729Sjoerg.. option:: -all-stats 1867330f729Sjoerg 1877330f729Sjoerg Print all hardware statistics. This enables extra statistics related to the 1887330f729Sjoerg dispatch logic, the hardware schedulers, the register file(s), and the retire 1897330f729Sjoerg control unit. This option is disabled by default. 1907330f729Sjoerg 1917330f729Sjoerg.. option:: -all-views 1927330f729Sjoerg 1937330f729Sjoerg Enable all the view. 1947330f729Sjoerg 1957330f729Sjoerg.. option:: -instruction-tables 1967330f729Sjoerg 1977330f729Sjoerg Prints resource pressure information based on the static information 1987330f729Sjoerg available from the processor model. This differs from the resource pressure 1997330f729Sjoerg view because it doesn't require that the code is simulated. It instead prints 2007330f729Sjoerg the theoretical uniform distribution of resource pressure for every 2017330f729Sjoerg instruction in sequence. 2027330f729Sjoerg 2037330f729Sjoerg.. option:: -bottleneck-analysis 2047330f729Sjoerg 2057330f729Sjoerg Print information about bottlenecks that affect the throughput. This analysis 2067330f729Sjoerg can be expensive, and it is disabled by default. Bottlenecks are highlighted 207*82d56013Sjoerg in the summary view. Bottleneck analysis is currently not supported for 208*82d56013Sjoerg processors with an in-order backend. 209*82d56013Sjoerg 210*82d56013Sjoerg.. option:: -json 211*82d56013Sjoerg 212*82d56013Sjoerg Print the requested views in JSON format. The instructions and the processor 213*82d56013Sjoerg resources are printed as members of special top level JSON objects. The 214*82d56013Sjoerg individual views refer to them by index. 2157330f729Sjoerg 2167330f729Sjoerg 2177330f729SjoergEXIT STATUS 2187330f729Sjoerg----------- 2197330f729Sjoerg 2207330f729Sjoerg:program:`llvm-mca` returns 0 on success. Otherwise, an error message is printed 2217330f729Sjoergto standard error, and the tool returns 1. 2227330f729Sjoerg 2237330f729SjoergUSING MARKERS TO ANALYZE SPECIFIC CODE BLOCKS 2247330f729Sjoerg--------------------------------------------- 2257330f729Sjoerg:program:`llvm-mca` allows for the optional usage of special code comments to 2267330f729Sjoergmark regions of the assembly code to be analyzed. A comment starting with 2277330f729Sjoergsubstring ``LLVM-MCA-BEGIN`` marks the beginning of a code region. A comment 2287330f729Sjoergstarting with substring ``LLVM-MCA-END`` marks the end of a code region. For 2297330f729Sjoergexample: 2307330f729Sjoerg 2317330f729Sjoerg.. code-block:: none 2327330f729Sjoerg 2337330f729Sjoerg # LLVM-MCA-BEGIN 2347330f729Sjoerg ... 2357330f729Sjoerg # LLVM-MCA-END 2367330f729Sjoerg 2377330f729SjoergIf no user-defined region is specified, then :program:`llvm-mca` assumes a 2387330f729Sjoergdefault region which contains every instruction in the input file. Every region 2397330f729Sjoergis analyzed in isolation, and the final performance report is the union of all 2407330f729Sjoergthe reports generated for every code region. 2417330f729Sjoerg 2427330f729SjoergCode regions can have names. For example: 2437330f729Sjoerg 2447330f729Sjoerg.. code-block:: none 2457330f729Sjoerg 2467330f729Sjoerg # LLVM-MCA-BEGIN A simple example 2477330f729Sjoerg add %eax, %eax 2487330f729Sjoerg # LLVM-MCA-END 2497330f729Sjoerg 2507330f729SjoergThe code from the example above defines a region named "A simple example" with a 2517330f729Sjoergsingle instruction in it. Note how the region name doesn't have to be repeated 2527330f729Sjoergin the ``LLVM-MCA-END`` directive. In the absence of overlapping regions, 2537330f729Sjoergan anonymous ``LLVM-MCA-END`` directive always ends the currently active user 2547330f729Sjoergdefined region. 2557330f729Sjoerg 2567330f729SjoergExample of nesting regions: 2577330f729Sjoerg 2587330f729Sjoerg.. code-block:: none 2597330f729Sjoerg 2607330f729Sjoerg # LLVM-MCA-BEGIN foo 2617330f729Sjoerg add %eax, %edx 2627330f729Sjoerg # LLVM-MCA-BEGIN bar 2637330f729Sjoerg sub %eax, %edx 2647330f729Sjoerg # LLVM-MCA-END bar 2657330f729Sjoerg # LLVM-MCA-END foo 2667330f729Sjoerg 2677330f729SjoergExample of overlapping regions: 2687330f729Sjoerg 2697330f729Sjoerg.. code-block:: none 2707330f729Sjoerg 2717330f729Sjoerg # LLVM-MCA-BEGIN foo 2727330f729Sjoerg add %eax, %edx 2737330f729Sjoerg # LLVM-MCA-BEGIN bar 2747330f729Sjoerg sub %eax, %edx 2757330f729Sjoerg # LLVM-MCA-END foo 2767330f729Sjoerg add %eax, %edx 2777330f729Sjoerg # LLVM-MCA-END bar 2787330f729Sjoerg 2797330f729SjoergNote that multiple anonymous regions cannot overlap. Also, overlapping regions 2807330f729Sjoergcannot have the same name. 2817330f729Sjoerg 2827330f729SjoergThere is no support for marking regions from high-level source code, like C or 2837330f729SjoergC++. As a workaround, inline assembly directives may be used: 2847330f729Sjoerg 2857330f729Sjoerg.. code-block:: c++ 2867330f729Sjoerg 2877330f729Sjoerg int foo(int a, int b) { 2887330f729Sjoerg __asm volatile("# LLVM-MCA-BEGIN foo"); 2897330f729Sjoerg a += 42; 2907330f729Sjoerg __asm volatile("# LLVM-MCA-END"); 2917330f729Sjoerg a *= b; 2927330f729Sjoerg return a; 2937330f729Sjoerg } 2947330f729Sjoerg 2957330f729SjoergHowever, this interferes with optimizations like loop vectorization and may have 2967330f729Sjoergan impact on the code generated. This is because the ``__asm`` statements are 2977330f729Sjoergseen as real code having important side effects, which limits how the code 2987330f729Sjoergaround them can be transformed. If users want to make use of inline assembly 2997330f729Sjoergto emit markers, then the recommendation is to always verify that the output 3007330f729Sjoergassembly is equivalent to the assembly generated in the absence of markers. 3017330f729SjoergThe `Clang options to emit optimization reports <https://clang.llvm.org/docs/UsersManual.html#options-to-emit-optimization-reports>`_ 3027330f729Sjoergcan also help in detecting missed optimizations. 3037330f729Sjoerg 3047330f729SjoergHOW LLVM-MCA WORKS 3057330f729Sjoerg------------------ 3067330f729Sjoerg 3077330f729Sjoerg:program:`llvm-mca` takes assembly code as input. The assembly code is parsed 3087330f729Sjoerginto a sequence of MCInst with the help of the existing LLVM target assembly 3097330f729Sjoergparsers. The parsed sequence of MCInst is then analyzed by a ``Pipeline`` module 3107330f729Sjoergto generate a performance report. 3117330f729Sjoerg 3127330f729SjoergThe Pipeline module simulates the execution of the machine code sequence in a 3137330f729Sjoergloop of iterations (default is 100). During this process, the pipeline collects 3147330f729Sjoerga number of execution related statistics. At the end of this process, the 3157330f729Sjoergpipeline generates and prints a report from the collected statistics. 3167330f729Sjoerg 3177330f729SjoergHere is an example of a performance report generated by the tool for a 3187330f729Sjoergdot-product of two packed float vectors of four elements. The analysis is 3197330f729Sjoergconducted for target x86, cpu btver2. The following result can be produced via 3207330f729Sjoergthe following command using the example located at 3217330f729Sjoerg``test/tools/llvm-mca/X86/BtVer2/dot-product.s``: 3227330f729Sjoerg 3237330f729Sjoerg.. code-block:: bash 3247330f729Sjoerg 3257330f729Sjoerg $ llvm-mca -mtriple=x86_64-unknown-unknown -mcpu=btver2 -iterations=300 dot-product.s 3267330f729Sjoerg 3277330f729Sjoerg.. code-block:: none 3287330f729Sjoerg 3297330f729Sjoerg Iterations: 300 3307330f729Sjoerg Instructions: 900 3317330f729Sjoerg Total Cycles: 610 3327330f729Sjoerg Total uOps: 900 3337330f729Sjoerg 3347330f729Sjoerg Dispatch Width: 2 3357330f729Sjoerg uOps Per Cycle: 1.48 3367330f729Sjoerg IPC: 1.48 3377330f729Sjoerg Block RThroughput: 2.0 3387330f729Sjoerg 3397330f729Sjoerg 3407330f729Sjoerg Instruction Info: 3417330f729Sjoerg [1]: #uOps 3427330f729Sjoerg [2]: Latency 3437330f729Sjoerg [3]: RThroughput 3447330f729Sjoerg [4]: MayLoad 3457330f729Sjoerg [5]: MayStore 3467330f729Sjoerg [6]: HasSideEffects (U) 3477330f729Sjoerg 3487330f729Sjoerg [1] [2] [3] [4] [5] [6] Instructions: 3497330f729Sjoerg 1 2 1.00 vmulps %xmm0, %xmm1, %xmm2 3507330f729Sjoerg 1 3 1.00 vhaddps %xmm2, %xmm2, %xmm3 3517330f729Sjoerg 1 3 1.00 vhaddps %xmm3, %xmm3, %xmm4 3527330f729Sjoerg 3537330f729Sjoerg 3547330f729Sjoerg Resources: 3557330f729Sjoerg [0] - JALU0 3567330f729Sjoerg [1] - JALU1 3577330f729Sjoerg [2] - JDiv 3587330f729Sjoerg [3] - JFPA 3597330f729Sjoerg [4] - JFPM 3607330f729Sjoerg [5] - JFPU0 3617330f729Sjoerg [6] - JFPU1 3627330f729Sjoerg [7] - JLAGU 3637330f729Sjoerg [8] - JMul 3647330f729Sjoerg [9] - JSAGU 3657330f729Sjoerg [10] - JSTC 3667330f729Sjoerg [11] - JVALU0 3677330f729Sjoerg [12] - JVALU1 3687330f729Sjoerg [13] - JVIMUL 3697330f729Sjoerg 3707330f729Sjoerg 3717330f729Sjoerg Resource pressure per iteration: 3727330f729Sjoerg [0] [1] [2] [3] [4] [5] [6] [7] [8] [9] [10] [11] [12] [13] 3737330f729Sjoerg - - - 2.00 1.00 2.00 1.00 - - - - - - - 3747330f729Sjoerg 3757330f729Sjoerg Resource pressure by instruction: 3767330f729Sjoerg [0] [1] [2] [3] [4] [5] [6] [7] [8] [9] [10] [11] [12] [13] Instructions: 3777330f729Sjoerg - - - - 1.00 - 1.00 - - - - - - - vmulps %xmm0, %xmm1, %xmm2 3787330f729Sjoerg - - - 1.00 - 1.00 - - - - - - - - vhaddps %xmm2, %xmm2, %xmm3 3797330f729Sjoerg - - - 1.00 - 1.00 - - - - - - - - vhaddps %xmm3, %xmm3, %xmm4 3807330f729Sjoerg 3817330f729SjoergAccording to this report, the dot-product kernel has been executed 300 times, 3827330f729Sjoergfor a total of 900 simulated instructions. The total number of simulated micro 3837330f729Sjoergopcodes (uOps) is also 900. 3847330f729Sjoerg 3857330f729SjoergThe report is structured in three main sections. The first section collects a 3867330f729Sjoergfew performance numbers; the goal of this section is to give a very quick 3877330f729Sjoergoverview of the performance throughput. Important performance indicators are 3887330f729Sjoerg**IPC**, **uOps Per Cycle**, and **Block RThroughput** (Block Reciprocal 3897330f729SjoergThroughput). 3907330f729Sjoerg 3917330f729SjoergField *DispatchWidth* is the maximum number of micro opcodes that are dispatched 392*82d56013Sjoergto the out-of-order backend every simulated cycle. For processors with an 393*82d56013Sjoergin-order backend, *DispatchWidth* is the maximum number of micro opcodes issued 394*82d56013Sjoergto the backend every simulated cycle. 3957330f729Sjoerg 3967330f729SjoergIPC is computed dividing the total number of simulated instructions by the total 3977330f729Sjoergnumber of cycles. 3987330f729Sjoerg 3997330f729SjoergField *Block RThroughput* is the reciprocal of the block throughput. Block 400*82d56013Sjoergthroughput is a theoretical quantity computed as the maximum number of blocks 4017330f729Sjoerg(i.e. iterations) that can be executed per simulated clock cycle in the absence 402*82d56013Sjoergof loop carried dependencies. Block throughput is superiorly limited by the 403*82d56013Sjoergdispatch rate, and the availability of hardware resources. 4047330f729Sjoerg 4057330f729SjoergIn the absence of loop-carried data dependencies, the observed IPC tends to a 4067330f729Sjoergtheoretical maximum which can be computed by dividing the number of instructions 4077330f729Sjoergof a single iteration by the `Block RThroughput`. 4087330f729Sjoerg 4097330f729SjoergField 'uOps Per Cycle' is computed dividing the total number of simulated micro 4107330f729Sjoergopcodes by the total number of cycles. A delta between Dispatch Width and this 4117330f729Sjoergfield is an indicator of a performance issue. In the absence of loop-carried 4127330f729Sjoergdata dependencies, the observed 'uOps Per Cycle' should tend to a theoretical 4137330f729Sjoergmaximum throughput which can be computed by dividing the number of uOps of a 4147330f729Sjoergsingle iteration by the `Block RThroughput`. 4157330f729Sjoerg 4167330f729SjoergField *uOps Per Cycle* is bounded from above by the dispatch width. That is 4177330f729Sjoergbecause the dispatch width limits the maximum size of a dispatch group. Both IPC 4187330f729Sjoergand 'uOps Per Cycle' are limited by the amount of hardware parallelism. The 4197330f729Sjoergavailability of hardware resources affects the resource pressure distribution, 4207330f729Sjoergand it limits the number of instructions that can be executed in parallel every 4217330f729Sjoergcycle. A delta between Dispatch Width and the theoretical maximum uOps per 4227330f729SjoergCycle (computed by dividing the number of uOps of a single iteration by the 4237330f729Sjoerg`Block RThroughput`) is an indicator of a performance bottleneck caused by the 4247330f729Sjoerglack of hardware resources. 4257330f729SjoergIn general, the lower the Block RThroughput, the better. 4267330f729Sjoerg 4277330f729SjoergIn this example, ``uOps per iteration/Block RThroughput`` is 1.50. Since there 4287330f729Sjoergare no loop-carried dependencies, the observed `uOps Per Cycle` is expected to 4297330f729Sjoergapproach 1.50 when the number of iterations tends to infinity. The delta between 4307330f729Sjoergthe Dispatch Width (2.00), and the theoretical maximum uOp throughput (1.50) is 4317330f729Sjoergan indicator of a performance bottleneck caused by the lack of hardware 4327330f729Sjoergresources, and the *Resource pressure view* can help to identify the problematic 4337330f729Sjoergresource usage. 4347330f729Sjoerg 4357330f729SjoergThe second section of the report is the `instruction info view`. It shows the 4367330f729Sjoerglatency and reciprocal throughput of every instruction in the sequence. It also 4377330f729Sjoergreports extra information related to the number of micro opcodes, and opcode 4387330f729Sjoergproperties (i.e., 'MayLoad', 'MayStore', and 'HasSideEffects'). 4397330f729Sjoerg 4407330f729SjoergField *RThroughput* is the reciprocal of the instruction throughput. Throughput 4417330f729Sjoergis computed as the maximum number of instructions of a same type that can be 4427330f729Sjoergexecuted per clock cycle in the absence of operand dependencies. In this 4437330f729Sjoergexample, the reciprocal throughput of a vector float multiply is 1 4447330f729Sjoergcycles/instruction. That is because the FP multiplier JFPM is only available 4457330f729Sjoergfrom pipeline JFPU1. 4467330f729Sjoerg 4477330f729SjoergInstruction encodings are displayed within the instruction info view when flag 4487330f729Sjoerg`-show-encoding` is specified. 4497330f729Sjoerg 4507330f729SjoergBelow is an example of `-show-encoding` output for the dot-product kernel: 4517330f729Sjoerg 4527330f729Sjoerg.. code-block:: none 4537330f729Sjoerg 4547330f729Sjoerg Instruction Info: 4557330f729Sjoerg [1]: #uOps 4567330f729Sjoerg [2]: Latency 4577330f729Sjoerg [3]: RThroughput 4587330f729Sjoerg [4]: MayLoad 4597330f729Sjoerg [5]: MayStore 4607330f729Sjoerg [6]: HasSideEffects (U) 4617330f729Sjoerg [7]: Encoding Size 4627330f729Sjoerg 4637330f729Sjoerg [1] [2] [3] [4] [5] [6] [7] Encodings: Instructions: 4647330f729Sjoerg 1 2 1.00 4 c5 f0 59 d0 vmulps %xmm0, %xmm1, %xmm2 4657330f729Sjoerg 1 4 1.00 4 c5 eb 7c da vhaddps %xmm2, %xmm2, %xmm3 4667330f729Sjoerg 1 4 1.00 4 c5 e3 7c e3 vhaddps %xmm3, %xmm3, %xmm4 4677330f729Sjoerg 4687330f729SjoergThe `Encoding Size` column shows the size in bytes of instructions. The 4697330f729Sjoerg`Encodings` column shows the actual instruction encodings (byte sequences in 4707330f729Sjoerghex). 4717330f729Sjoerg 4727330f729SjoergThe third section is the *Resource pressure view*. This view reports 4737330f729Sjoergthe average number of resource cycles consumed every iteration by instructions 4747330f729Sjoergfor every processor resource unit available on the target. Information is 4757330f729Sjoergstructured in two tables. The first table reports the number of resource cycles 4767330f729Sjoergspent on average every iteration. The second table correlates the resource 4777330f729Sjoergcycles to the machine instruction in the sequence. For example, every iteration 4787330f729Sjoergof the instruction vmulps always executes on resource unit [6] 4797330f729Sjoerg(JFPU1 - floating point pipeline #1), consuming an average of 1 resource cycle 4807330f729Sjoergper iteration. Note that on AMD Jaguar, vector floating-point multiply can 4817330f729Sjoergonly be issued to pipeline JFPU1, while horizontal floating-point additions can 4827330f729Sjoergonly be issued to pipeline JFPU0. 4837330f729Sjoerg 4847330f729SjoergThe resource pressure view helps with identifying bottlenecks caused by high 4857330f729Sjoergusage of specific hardware resources. Situations with resource pressure mainly 4867330f729Sjoergconcentrated on a few resources should, in general, be avoided. Ideally, 4877330f729Sjoergpressure should be uniformly distributed between multiple resources. 4887330f729Sjoerg 4897330f729SjoergTimeline View 4907330f729Sjoerg^^^^^^^^^^^^^ 4917330f729SjoergThe timeline view produces a detailed report of each instruction's state 4927330f729Sjoergtransitions through an instruction pipeline. This view is enabled by the 4937330f729Sjoergcommand line option ``-timeline``. As instructions transition through the 4947330f729Sjoergvarious stages of the pipeline, their states are depicted in the view report. 4957330f729SjoergThese states are represented by the following characters: 4967330f729Sjoerg 4977330f729Sjoerg* D : Instruction dispatched. 4987330f729Sjoerg* e : Instruction executing. 4997330f729Sjoerg* E : Instruction executed. 5007330f729Sjoerg* R : Instruction retired. 5017330f729Sjoerg* = : Instruction already dispatched, waiting to be executed. 5027330f729Sjoerg* \- : Instruction executed, waiting to be retired. 5037330f729Sjoerg 5047330f729SjoergBelow is the timeline view for a subset of the dot-product example located in 5057330f729Sjoerg``test/tools/llvm-mca/X86/BtVer2/dot-product.s`` and processed by 5067330f729Sjoerg:program:`llvm-mca` using the following command: 5077330f729Sjoerg 5087330f729Sjoerg.. code-block:: bash 5097330f729Sjoerg 5107330f729Sjoerg $ llvm-mca -mtriple=x86_64-unknown-unknown -mcpu=btver2 -iterations=3 -timeline dot-product.s 5117330f729Sjoerg 5127330f729Sjoerg.. code-block:: none 5137330f729Sjoerg 5147330f729Sjoerg Timeline view: 5157330f729Sjoerg 012345 5167330f729Sjoerg Index 0123456789 5177330f729Sjoerg 5187330f729Sjoerg [0,0] DeeER. . . vmulps %xmm0, %xmm1, %xmm2 5197330f729Sjoerg [0,1] D==eeeER . . vhaddps %xmm2, %xmm2, %xmm3 5207330f729Sjoerg [0,2] .D====eeeER . vhaddps %xmm3, %xmm3, %xmm4 5217330f729Sjoerg [1,0] .DeeE-----R . vmulps %xmm0, %xmm1, %xmm2 5227330f729Sjoerg [1,1] . D=eeeE---R . vhaddps %xmm2, %xmm2, %xmm3 5237330f729Sjoerg [1,2] . D====eeeER . vhaddps %xmm3, %xmm3, %xmm4 5247330f729Sjoerg [2,0] . DeeE-----R . vmulps %xmm0, %xmm1, %xmm2 5257330f729Sjoerg [2,1] . D====eeeER . vhaddps %xmm2, %xmm2, %xmm3 5267330f729Sjoerg [2,2] . D======eeeER vhaddps %xmm3, %xmm3, %xmm4 5277330f729Sjoerg 5287330f729Sjoerg 5297330f729Sjoerg Average Wait times (based on the timeline view): 5307330f729Sjoerg [0]: Executions 5317330f729Sjoerg [1]: Average time spent waiting in a scheduler's queue 5327330f729Sjoerg [2]: Average time spent waiting in a scheduler's queue while ready 5337330f729Sjoerg [3]: Average time elapsed from WB until retire stage 5347330f729Sjoerg 5357330f729Sjoerg [0] [1] [2] [3] 5367330f729Sjoerg 0. 3 1.0 1.0 3.3 vmulps %xmm0, %xmm1, %xmm2 5377330f729Sjoerg 1. 3 3.3 0.7 1.0 vhaddps %xmm2, %xmm2, %xmm3 5387330f729Sjoerg 2. 3 5.7 0.0 0.0 vhaddps %xmm3, %xmm3, %xmm4 5397330f729Sjoerg 3 3.3 0.5 1.4 <total> 5407330f729Sjoerg 5417330f729SjoergThe timeline view is interesting because it shows instruction state changes 5427330f729Sjoergduring execution. It also gives an idea of how the tool processes instructions 5437330f729Sjoergexecuted on the target, and how their timing information might be calculated. 5447330f729Sjoerg 5457330f729SjoergThe timeline view is structured in two tables. The first table shows 5467330f729Sjoerginstructions changing state over time (measured in cycles); the second table 5477330f729Sjoerg(named *Average Wait times*) reports useful timing statistics, which should 5487330f729Sjoerghelp diagnose performance bottlenecks caused by long data dependencies and 5497330f729Sjoergsub-optimal usage of hardware resources. 5507330f729Sjoerg 5517330f729SjoergAn instruction in the timeline view is identified by a pair of indices, where 5527330f729Sjoergthe first index identifies an iteration, and the second index is the 5537330f729Sjoerginstruction index (i.e., where it appears in the code sequence). Since this 5547330f729Sjoergexample was generated using 3 iterations: ``-iterations=3``, the iteration 5557330f729Sjoergindices range from 0-2 inclusively. 5567330f729Sjoerg 5577330f729SjoergExcluding the first and last column, the remaining columns are in cycles. 5587330f729SjoergCycles are numbered sequentially starting from 0. 5597330f729Sjoerg 5607330f729SjoergFrom the example output above, we know the following: 5617330f729Sjoerg 5627330f729Sjoerg* Instruction [1,0] was dispatched at cycle 1. 5637330f729Sjoerg* Instruction [1,0] started executing at cycle 2. 5647330f729Sjoerg* Instruction [1,0] reached the write back stage at cycle 4. 5657330f729Sjoerg* Instruction [1,0] was retired at cycle 10. 5667330f729Sjoerg 5677330f729SjoergInstruction [1,0] (i.e., vmulps from iteration #1) does not have to wait in the 5687330f729Sjoergscheduler's queue for the operands to become available. By the time vmulps is 5697330f729Sjoergdispatched, operands are already available, and pipeline JFPU1 is ready to 5707330f729Sjoergserve another instruction. So the instruction can be immediately issued on the 5717330f729SjoergJFPU1 pipeline. That is demonstrated by the fact that the instruction only 5727330f729Sjoergspent 1cy in the scheduler's queue. 5737330f729Sjoerg 5747330f729SjoergThere is a gap of 5 cycles between the write-back stage and the retire event. 5757330f729SjoergThat is because instructions must retire in program order, so [1,0] has to wait 5767330f729Sjoergfor [0,2] to be retired first (i.e., it has to wait until cycle 10). 5777330f729Sjoerg 5787330f729SjoergIn the example, all instructions are in a RAW (Read After Write) dependency 5797330f729Sjoergchain. Register %xmm2 written by vmulps is immediately used by the first 5807330f729Sjoergvhaddps, and register %xmm3 written by the first vhaddps is used by the second 5817330f729Sjoergvhaddps. Long data dependencies negatively impact the ILP (Instruction Level 5827330f729SjoergParallelism). 5837330f729Sjoerg 5847330f729SjoergIn the dot-product example, there are anti-dependencies introduced by 5857330f729Sjoerginstructions from different iterations. However, those dependencies can be 5867330f729Sjoergremoved at register renaming stage (at the cost of allocating register aliases, 5877330f729Sjoergand therefore consuming physical registers). 5887330f729Sjoerg 5897330f729SjoergTable *Average Wait times* helps diagnose performance issues that are caused by 5907330f729Sjoergthe presence of long latency instructions and potentially long data dependencies 5917330f729Sjoergwhich may limit the ILP. Last row, ``<total>``, shows a global average over all 5927330f729Sjoerginstructions measured. Note that :program:`llvm-mca`, by default, assumes at 5937330f729Sjoergleast 1cy between the dispatch event and the issue event. 5947330f729Sjoerg 5957330f729SjoergWhen the performance is limited by data dependencies and/or long latency 5967330f729Sjoerginstructions, the number of cycles spent while in the *ready* state is expected 5977330f729Sjoergto be very small when compared with the total number of cycles spent in the 5987330f729Sjoergscheduler's queue. The difference between the two counters is a good indicator 5997330f729Sjoergof how large of an impact data dependencies had on the execution of the 6007330f729Sjoerginstructions. When performance is mostly limited by the lack of hardware 6017330f729Sjoergresources, the delta between the two counters is small. However, the number of 6027330f729Sjoergcycles spent in the queue tends to be larger (i.e., more than 1-3cy), 6037330f729Sjoergespecially when compared to other low latency instructions. 6047330f729Sjoerg 6057330f729SjoergBottleneck Analysis 6067330f729Sjoerg^^^^^^^^^^^^^^^^^^^ 6077330f729SjoergThe ``-bottleneck-analysis`` command line option enables the analysis of 6087330f729Sjoergperformance bottlenecks. 6097330f729Sjoerg 6107330f729SjoergThis analysis is potentially expensive. It attempts to correlate increases in 6117330f729Sjoergbackend pressure (caused by pipeline resource pressure and data dependencies) to 6127330f729Sjoergdynamic dispatch stalls. 6137330f729Sjoerg 6147330f729SjoergBelow is an example of ``-bottleneck-analysis`` output generated by 6157330f729Sjoerg:program:`llvm-mca` for 500 iterations of the dot-product example on btver2. 6167330f729Sjoerg 6177330f729Sjoerg.. code-block:: none 6187330f729Sjoerg 6197330f729Sjoerg 6207330f729Sjoerg Cycles with backend pressure increase [ 48.07% ] 6217330f729Sjoerg Throughput Bottlenecks: 6227330f729Sjoerg Resource Pressure [ 47.77% ] 6237330f729Sjoerg - JFPA [ 47.77% ] 6247330f729Sjoerg - JFPU0 [ 47.77% ] 6257330f729Sjoerg Data Dependencies: [ 0.30% ] 6267330f729Sjoerg - Register Dependencies [ 0.30% ] 6277330f729Sjoerg - Memory Dependencies [ 0.00% ] 6287330f729Sjoerg 6297330f729Sjoerg Critical sequence based on the simulation: 6307330f729Sjoerg 6317330f729Sjoerg Instruction Dependency Information 6327330f729Sjoerg +----< 2. vhaddps %xmm3, %xmm3, %xmm4 6337330f729Sjoerg | 6347330f729Sjoerg | < loop carried > 6357330f729Sjoerg | 6367330f729Sjoerg | 0. vmulps %xmm0, %xmm1, %xmm2 6377330f729Sjoerg +----> 1. vhaddps %xmm2, %xmm2, %xmm3 ## RESOURCE interference: JFPA [ probability: 74% ] 6387330f729Sjoerg +----> 2. vhaddps %xmm3, %xmm3, %xmm4 ## REGISTER dependency: %xmm3 6397330f729Sjoerg | 6407330f729Sjoerg | < loop carried > 6417330f729Sjoerg | 6427330f729Sjoerg +----> 1. vhaddps %xmm2, %xmm2, %xmm3 ## RESOURCE interference: JFPA [ probability: 74% ] 6437330f729Sjoerg 6447330f729Sjoerg 6457330f729SjoergAccording to the analysis, throughput is limited by resource pressure and not by 6467330f729Sjoergdata dependencies. The analysis observed increases in backend pressure during 6477330f729Sjoerg48.07% of the simulated run. Almost all those pressure increase events were 6487330f729Sjoergcaused by contention on processor resources JFPA/JFPU0. 6497330f729Sjoerg 6507330f729SjoergThe `critical sequence` is the most expensive sequence of instructions according 6517330f729Sjoergto the simulation. It is annotated to provide extra information about critical 6527330f729Sjoergregister dependencies and resource interferences between instructions. 6537330f729Sjoerg 6547330f729SjoergInstructions from the critical sequence are expected to significantly impact 6557330f729Sjoergperformance. By construction, the accuracy of this analysis is strongly 6567330f729Sjoergdependent on the simulation and (as always) by the quality of the processor 6577330f729Sjoergmodel in llvm. 6587330f729Sjoerg 659*82d56013SjoergBottleneck analysis is currently not supported for processors with an in-order 660*82d56013Sjoergbackend. 6617330f729Sjoerg 6627330f729SjoergExtra Statistics to Further Diagnose Performance Issues 6637330f729Sjoerg^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 6647330f729SjoergThe ``-all-stats`` command line option enables extra statistics and performance 6657330f729Sjoergcounters for the dispatch logic, the reorder buffer, the retire control unit, 6667330f729Sjoergand the register file. 6677330f729Sjoerg 6687330f729SjoergBelow is an example of ``-all-stats`` output generated by :program:`llvm-mca` 6697330f729Sjoergfor 300 iterations of the dot-product example discussed in the previous 6707330f729Sjoergsections. 6717330f729Sjoerg 6727330f729Sjoerg.. code-block:: none 6737330f729Sjoerg 6747330f729Sjoerg Dynamic Dispatch Stall Cycles: 6757330f729Sjoerg RAT - Register unavailable: 0 6767330f729Sjoerg RCU - Retire tokens unavailable: 0 6777330f729Sjoerg SCHEDQ - Scheduler full: 272 (44.6%) 6787330f729Sjoerg LQ - Load queue full: 0 6797330f729Sjoerg SQ - Store queue full: 0 6807330f729Sjoerg GROUP - Static restrictions on the dispatch group: 0 6817330f729Sjoerg 6827330f729Sjoerg 6837330f729Sjoerg Dispatch Logic - number of cycles where we saw N micro opcodes dispatched: 6847330f729Sjoerg [# dispatched], [# cycles] 6857330f729Sjoerg 0, 24 (3.9%) 6867330f729Sjoerg 1, 272 (44.6%) 6877330f729Sjoerg 2, 314 (51.5%) 6887330f729Sjoerg 6897330f729Sjoerg 6907330f729Sjoerg Schedulers - number of cycles where we saw N micro opcodes issued: 6917330f729Sjoerg [# issued], [# cycles] 6927330f729Sjoerg 0, 7 (1.1%) 6937330f729Sjoerg 1, 306 (50.2%) 6947330f729Sjoerg 2, 297 (48.7%) 6957330f729Sjoerg 6967330f729Sjoerg Scheduler's queue usage: 6977330f729Sjoerg [1] Resource name. 6987330f729Sjoerg [2] Average number of used buffer entries. 6997330f729Sjoerg [3] Maximum number of used buffer entries. 7007330f729Sjoerg [4] Total number of buffer entries. 7017330f729Sjoerg 7027330f729Sjoerg [1] [2] [3] [4] 7037330f729Sjoerg JALU01 0 0 20 7047330f729Sjoerg JFPU01 17 18 18 7057330f729Sjoerg JLSAGU 0 0 12 7067330f729Sjoerg 7077330f729Sjoerg 7087330f729Sjoerg Retire Control Unit - number of cycles where we saw N instructions retired: 7097330f729Sjoerg [# retired], [# cycles] 7107330f729Sjoerg 0, 109 (17.9%) 7117330f729Sjoerg 1, 102 (16.7%) 7127330f729Sjoerg 2, 399 (65.4%) 7137330f729Sjoerg 7147330f729Sjoerg Total ROB Entries: 64 7157330f729Sjoerg Max Used ROB Entries: 35 ( 54.7% ) 7167330f729Sjoerg Average Used ROB Entries per cy: 32 ( 50.0% ) 7177330f729Sjoerg 7187330f729Sjoerg 7197330f729Sjoerg Register File statistics: 7207330f729Sjoerg Total number of mappings created: 900 7217330f729Sjoerg Max number of mappings used: 35 7227330f729Sjoerg 7237330f729Sjoerg * Register File #1 -- JFpuPRF: 7247330f729Sjoerg Number of physical registers: 72 7257330f729Sjoerg Total number of mappings created: 900 7267330f729Sjoerg Max number of mappings used: 35 7277330f729Sjoerg 7287330f729Sjoerg * Register File #2 -- JIntegerPRF: 7297330f729Sjoerg Number of physical registers: 64 7307330f729Sjoerg Total number of mappings created: 0 7317330f729Sjoerg Max number of mappings used: 0 7327330f729Sjoerg 7337330f729SjoergIf we look at the *Dynamic Dispatch Stall Cycles* table, we see the counter for 7347330f729SjoergSCHEDQ reports 272 cycles. This counter is incremented every time the dispatch 7357330f729Sjoerglogic is unable to dispatch a full group because the scheduler's queue is full. 7367330f729Sjoerg 7377330f729SjoergLooking at the *Dispatch Logic* table, we see that the pipeline was only able to 7387330f729Sjoergdispatch two micro opcodes 51.5% of the time. The dispatch group was limited to 7397330f729Sjoergone micro opcode 44.6% of the cycles, which corresponds to 272 cycles. The 7407330f729Sjoergdispatch statistics are displayed by either using the command option 7417330f729Sjoerg``-all-stats`` or ``-dispatch-stats``. 7427330f729Sjoerg 7437330f729SjoergThe next table, *Schedulers*, presents a histogram displaying a count, 7447330f729Sjoergrepresenting the number of micro opcodes issued on some number of cycles. In 7457330f729Sjoergthis case, of the 610 simulated cycles, single opcodes were issued 306 times 7467330f729Sjoerg(50.2%) and there were 7 cycles where no opcodes were issued. 7477330f729Sjoerg 7487330f729SjoergThe *Scheduler's queue usage* table shows that the average and maximum number of 7497330f729Sjoergbuffer entries (i.e., scheduler queue entries) used at runtime. Resource JFPU01 7507330f729Sjoergreached its maximum (18 of 18 queue entries). Note that AMD Jaguar implements 7517330f729Sjoergthree schedulers: 7527330f729Sjoerg 7537330f729Sjoerg* JALU01 - A scheduler for ALU instructions. 7547330f729Sjoerg* JFPU01 - A scheduler floating point operations. 7557330f729Sjoerg* JLSAGU - A scheduler for address generation. 7567330f729Sjoerg 7577330f729SjoergThe dot-product is a kernel of three floating point instructions (a vector 7587330f729Sjoergmultiply followed by two horizontal adds). That explains why only the floating 7597330f729Sjoergpoint scheduler appears to be used. 7607330f729Sjoerg 7617330f729SjoergA full scheduler queue is either caused by data dependency chains or by a 7627330f729Sjoergsub-optimal usage of hardware resources. Sometimes, resource pressure can be 7637330f729Sjoergmitigated by rewriting the kernel using different instructions that consume 7647330f729Sjoergdifferent scheduler resources. Schedulers with a small queue are less resilient 7657330f729Sjoergto bottlenecks caused by the presence of long data dependencies. The scheduler 7667330f729Sjoergstatistics are displayed by using the command option ``-all-stats`` or 7677330f729Sjoerg``-scheduler-stats``. 7687330f729Sjoerg 7697330f729SjoergThe next table, *Retire Control Unit*, presents a histogram displaying a count, 7707330f729Sjoergrepresenting the number of instructions retired on some number of cycles. In 7717330f729Sjoergthis case, of the 610 simulated cycles, two instructions were retired during the 7727330f729Sjoergsame cycle 399 times (65.4%) and there were 109 cycles where no instructions 7737330f729Sjoergwere retired. The retire statistics are displayed by using the command option 7747330f729Sjoerg``-all-stats`` or ``-retire-stats``. 7757330f729Sjoerg 7767330f729SjoergThe last table presented is *Register File statistics*. Each physical register 7777330f729Sjoergfile (PRF) used by the pipeline is presented in this table. In the case of AMD 7787330f729SjoergJaguar, there are two register files, one for floating-point registers (JFpuPRF) 7797330f729Sjoergand one for integer registers (JIntegerPRF). The table shows that of the 900 7807330f729Sjoerginstructions processed, there were 900 mappings created. Since this dot-product 7817330f729Sjoergexample utilized only floating point registers, the JFPuPRF was responsible for 7827330f729Sjoergcreating the 900 mappings. However, we see that the pipeline only used a 7837330f729Sjoergmaximum of 35 of 72 available register slots at any given time. We can conclude 7847330f729Sjoergthat the floating point PRF was the only register file used for the example, and 7857330f729Sjoergthat it was never resource constrained. The register file statistics are 7867330f729Sjoergdisplayed by using the command option ``-all-stats`` or 7877330f729Sjoerg``-register-file-stats``. 7887330f729Sjoerg 7897330f729SjoergIn this example, we can conclude that the IPC is mostly limited by data 7907330f729Sjoergdependencies, and not by resource pressure. 7917330f729Sjoerg 7927330f729SjoergInstruction Flow 7937330f729Sjoerg^^^^^^^^^^^^^^^^ 7947330f729SjoergThis section describes the instruction flow through the default pipeline of 7957330f729Sjoerg:program:`llvm-mca`, as well as the functional units involved in the process. 7967330f729Sjoerg 7977330f729SjoergThe default pipeline implements the following sequence of stages used to 7987330f729Sjoergprocess instructions. 7997330f729Sjoerg 8007330f729Sjoerg* Dispatch (Instruction is dispatched to the schedulers). 8017330f729Sjoerg* Issue (Instruction is issued to the processor pipelines). 8027330f729Sjoerg* Write Back (Instruction is executed, and results are written back). 8037330f729Sjoerg* Retire (Instruction is retired; writes are architecturally committed). 8047330f729Sjoerg 805*82d56013SjoergThe in-order pipeline implements the following sequence of stages: 806*82d56013Sjoerg* InOrderIssue (Instruction is issued to the processor pipelines). 807*82d56013Sjoerg* Retire (Instruction is retired; writes are architecturally committed). 808*82d56013Sjoerg 809*82d56013Sjoerg:program:`llvm-mca` assumes that instructions have all been decoded and placed 810*82d56013Sjoerginto a queue before the simulation start. Therefore, the instruction fetch and 811*82d56013Sjoergdecode stages are not modeled. Performance bottlenecks in the frontend are not 812*82d56013Sjoergdiagnosed. Also, :program:`llvm-mca` does not model branch prediction. 8137330f729Sjoerg 8147330f729SjoergInstruction Dispatch 8157330f729Sjoerg"""""""""""""""""""" 8167330f729SjoergDuring the dispatch stage, instructions are picked in program order from a 8177330f729Sjoergqueue of already decoded instructions, and dispatched in groups to the 8187330f729Sjoergsimulated hardware schedulers. 8197330f729Sjoerg 8207330f729SjoergThe size of a dispatch group depends on the availability of the simulated 8217330f729Sjoerghardware resources. The processor dispatch width defaults to the value 8227330f729Sjoergof the ``IssueWidth`` in LLVM's scheduling model. 8237330f729Sjoerg 8247330f729SjoergAn instruction can be dispatched if: 8257330f729Sjoerg 8267330f729Sjoerg* The size of the dispatch group is smaller than processor's dispatch width. 8277330f729Sjoerg* There are enough entries in the reorder buffer. 8287330f729Sjoerg* There are enough physical registers to do register renaming. 8297330f729Sjoerg* The schedulers are not full. 8307330f729Sjoerg 8317330f729SjoergScheduling models can optionally specify which register files are available on 8327330f729Sjoergthe processor. :program:`llvm-mca` uses that information to initialize register 8337330f729Sjoergfile descriptors. Users can limit the number of physical registers that are 8347330f729Sjoergglobally available for register renaming by using the command option 8357330f729Sjoerg``-register-file-size``. A value of zero for this option means *unbounded*. By 8367330f729Sjoergknowing how many registers are available for renaming, the tool can predict 8377330f729Sjoergdispatch stalls caused by the lack of physical registers. 8387330f729Sjoerg 8397330f729SjoergThe number of reorder buffer entries consumed by an instruction depends on the 8407330f729Sjoergnumber of micro-opcodes specified for that instruction by the target scheduling 8417330f729Sjoergmodel. The reorder buffer is responsible for tracking the progress of 8427330f729Sjoerginstructions that are "in-flight", and retiring them in program order. The 8437330f729Sjoergnumber of entries in the reorder buffer defaults to the value specified by field 8447330f729Sjoerg`MicroOpBufferSize` in the target scheduling model. 8457330f729Sjoerg 8467330f729SjoergInstructions that are dispatched to the schedulers consume scheduler buffer 8477330f729Sjoergentries. :program:`llvm-mca` queries the scheduling model to determine the set 8487330f729Sjoergof buffered resources consumed by an instruction. Buffered resources are 8497330f729Sjoergtreated like scheduler resources. 8507330f729Sjoerg 8517330f729SjoergInstruction Issue 8527330f729Sjoerg""""""""""""""""" 8537330f729SjoergEach processor scheduler implements a buffer of instructions. An instruction 8547330f729Sjoerghas to wait in the scheduler's buffer until input register operands become 8557330f729Sjoergavailable. Only at that point, does the instruction becomes eligible for 8567330f729Sjoergexecution and may be issued (potentially out-of-order) for execution. 8577330f729SjoergInstruction latencies are computed by :program:`llvm-mca` with the help of the 8587330f729Sjoergscheduling model. 8597330f729Sjoerg 8607330f729Sjoerg:program:`llvm-mca`'s scheduler is designed to simulate multiple processor 8617330f729Sjoergschedulers. The scheduler is responsible for tracking data dependencies, and 8627330f729Sjoergdynamically selecting which processor resources are consumed by instructions. 8637330f729SjoergIt delegates the management of processor resource units and resource groups to a 8647330f729Sjoergresource manager. The resource manager is responsible for selecting resource 8657330f729Sjoergunits that are consumed by instructions. For example, if an instruction 8667330f729Sjoergconsumes 1cy of a resource group, the resource manager selects one of the 8677330f729Sjoergavailable units from the group; by default, the resource manager uses a 8687330f729Sjoerground-robin selector to guarantee that resource usage is uniformly distributed 8697330f729Sjoergbetween all units of a group. 8707330f729Sjoerg 8717330f729Sjoerg:program:`llvm-mca`'s scheduler internally groups instructions into three sets: 8727330f729Sjoerg 8737330f729Sjoerg* WaitSet: a set of instructions whose operands are not ready. 8747330f729Sjoerg* ReadySet: a set of instructions ready to execute. 8757330f729Sjoerg* IssuedSet: a set of instructions executing. 8767330f729Sjoerg 8777330f729SjoergDepending on the operands availability, instructions that are dispatched to the 8787330f729Sjoergscheduler are either placed into the WaitSet or into the ReadySet. 8797330f729Sjoerg 8807330f729SjoergEvery cycle, the scheduler checks if instructions can be moved from the WaitSet 8817330f729Sjoergto the ReadySet, and if instructions from the ReadySet can be issued to the 8827330f729Sjoergunderlying pipelines. The algorithm prioritizes older instructions over younger 8837330f729Sjoerginstructions. 8847330f729Sjoerg 8857330f729SjoergWrite-Back and Retire Stage 8867330f729Sjoerg""""""""""""""""""""""""""" 8877330f729SjoergIssued instructions are moved from the ReadySet to the IssuedSet. There, 8887330f729Sjoerginstructions wait until they reach the write-back stage. At that point, they 8897330f729Sjoergget removed from the queue and the retire control unit is notified. 8907330f729Sjoerg 8917330f729SjoergWhen instructions are executed, the retire control unit flags the instruction as 8927330f729Sjoerg"ready to retire." 8937330f729Sjoerg 8947330f729SjoergInstructions are retired in program order. The register file is notified of the 8957330f729Sjoergretirement so that it can free the physical registers that were allocated for 8967330f729Sjoergthe instruction during the register renaming stage. 8977330f729Sjoerg 8987330f729SjoergLoad/Store Unit and Memory Consistency Model 8997330f729Sjoerg"""""""""""""""""""""""""""""""""""""""""""" 9007330f729SjoergTo simulate an out-of-order execution of memory operations, :program:`llvm-mca` 9017330f729Sjoergutilizes a simulated load/store unit (LSUnit) to simulate the speculative 9027330f729Sjoergexecution of loads and stores. 9037330f729Sjoerg 9047330f729SjoergEach load (or store) consumes an entry in the load (or store) queue. Users can 9057330f729Sjoergspecify flags ``-lqueue`` and ``-squeue`` to limit the number of entries in the 9067330f729Sjoergload and store queues respectively. The queues are unbounded by default. 9077330f729Sjoerg 9087330f729SjoergThe LSUnit implements a relaxed consistency model for memory loads and stores. 9097330f729SjoergThe rules are: 9107330f729Sjoerg 9117330f729Sjoerg1. A younger load is allowed to pass an older load only if there are no 9127330f729Sjoerg intervening stores or barriers between the two loads. 9137330f729Sjoerg2. A younger load is allowed to pass an older store provided that the load does 9147330f729Sjoerg not alias with the store. 9157330f729Sjoerg3. A younger store is not allowed to pass an older store. 9167330f729Sjoerg4. A younger store is not allowed to pass an older load. 9177330f729Sjoerg 9187330f729SjoergBy default, the LSUnit optimistically assumes that loads do not alias 9197330f729Sjoerg(`-noalias=true`) store operations. Under this assumption, younger loads are 9207330f729Sjoergalways allowed to pass older stores. Essentially, the LSUnit does not attempt 9217330f729Sjoergto run any alias analysis to predict when loads and stores do not alias with 9227330f729Sjoergeach other. 9237330f729Sjoerg 9247330f729SjoergNote that, in the case of write-combining memory, rule 3 could be relaxed to 9257330f729Sjoergallow reordering of non-aliasing store operations. That being said, at the 9267330f729Sjoergmoment, there is no way to further relax the memory model (``-noalias`` is the 9277330f729Sjoergonly option). Essentially, there is no option to specify a different memory 9287330f729Sjoergtype (e.g., write-back, write-combining, write-through; etc.) and consequently 9297330f729Sjoergto weaken, or strengthen, the memory model. 9307330f729Sjoerg 9317330f729SjoergOther limitations are: 9327330f729Sjoerg 9337330f729Sjoerg* The LSUnit does not know when store-to-load forwarding may occur. 9347330f729Sjoerg* The LSUnit does not know anything about cache hierarchy and memory types. 9357330f729Sjoerg* The LSUnit does not know how to identify serializing operations and memory 9367330f729Sjoerg fences. 9377330f729Sjoerg 9387330f729SjoergThe LSUnit does not attempt to predict if a load or store hits or misses the L1 9397330f729Sjoergcache. It only knows if an instruction "MayLoad" and/or "MayStore." For 9407330f729Sjoergloads, the scheduling model provides an "optimistic" load-to-use latency (which 9417330f729Sjoergusually matches the load-to-use latency for when there is a hit in the L1D). 9427330f729Sjoerg 9437330f729Sjoerg:program:`llvm-mca` does not know about serializing operations or memory-barrier 9447330f729Sjoerglike instructions. The LSUnit conservatively assumes that an instruction which 9457330f729Sjoerghas both "MayLoad" and unmodeled side effects behaves like a "soft" 9467330f729Sjoergload-barrier. That means, it serializes loads without forcing a flush of the 9477330f729Sjoergload queue. Similarly, instructions that "MayStore" and have unmodeled side 9487330f729Sjoergeffects are treated like store barriers. A full memory barrier is a "MayLoad" 9497330f729Sjoergand "MayStore" instruction with unmodeled side effects. This is inaccurate, but 9507330f729Sjoergit is the best that we can do at the moment with the current information 9517330f729Sjoergavailable in LLVM. 9527330f729Sjoerg 9537330f729SjoergA load/store barrier consumes one entry of the load/store queue. A load/store 9547330f729Sjoergbarrier enforces ordering of loads/stores. A younger load cannot pass a load 9557330f729Sjoergbarrier. Also, a younger store cannot pass a store barrier. A younger load 9567330f729Sjoerghas to wait for the memory/load barrier to execute. A load/store barrier is 9577330f729Sjoerg"executed" when it becomes the oldest entry in the load/store queue(s). That 9587330f729Sjoergalso means, by construction, all of the older loads/stores have been executed. 9597330f729Sjoerg 9607330f729SjoergIn conclusion, the full set of load/store consistency rules are: 9617330f729Sjoerg 9627330f729Sjoerg#. A store may not pass a previous store. 9637330f729Sjoerg#. A store may not pass a previous load (regardless of ``-noalias``). 9647330f729Sjoerg#. A store has to wait until an older store barrier is fully executed. 9657330f729Sjoerg#. A load may pass a previous load. 9667330f729Sjoerg#. A load may not pass a previous store unless ``-noalias`` is set. 9677330f729Sjoerg#. A load has to wait until an older load barrier is fully executed. 968*82d56013Sjoerg 969*82d56013SjoergIn-order Issue and Execute 970*82d56013Sjoerg"""""""""""""""""""""""""""""""""""" 971*82d56013SjoergIn-order processors are modelled as a single ``InOrderIssueStage`` stage. It 972*82d56013Sjoergbypasses Dispatch, Scheduler and Load/Store unit. Instructions are issued as 973*82d56013Sjoergsoon as their operand registers are available and resource requirements are 974*82d56013Sjoergmet. Multiple instructions can be issued in one cycle according to the value of 975*82d56013Sjoergthe ``IssueWidth`` parameter in LLVM's scheduling model. 976*82d56013Sjoerg 977*82d56013SjoergOnce issued, an instruction is moved to ``IssuedInst`` set until it is ready to 978*82d56013Sjoergretire. :program:`llvm-mca` ensures that writes are committed in-order. However, 979*82d56013Sjoergan instruction is allowed to commit writes and retire out-of-order if 980*82d56013Sjoerg``RetireOOO`` property is true for at least one of its writes. 981