xref: /netbsd-src/external/apache2/llvm/dist/llvm/docs/CommandGuide/llvm-mca.rst (revision 82d56013d7b633d116a93943de88e08335357a7c)
17330f729Sjoergllvm-mca - LLVM Machine Code Analyzer
27330f729Sjoerg=====================================
37330f729Sjoerg
47330f729Sjoerg.. program:: llvm-mca
57330f729Sjoerg
67330f729SjoergSYNOPSIS
77330f729Sjoerg--------
87330f729Sjoerg
97330f729Sjoerg:program:`llvm-mca` [*options*] [input]
107330f729Sjoerg
117330f729SjoergDESCRIPTION
127330f729Sjoerg-----------
137330f729Sjoerg
147330f729Sjoerg:program:`llvm-mca` is a performance analysis tool that uses information
157330f729Sjoergavailable in LLVM (e.g. scheduling models) to statically measure the performance
167330f729Sjoergof machine code in a specific CPU.
177330f729Sjoerg
187330f729SjoergPerformance is measured in terms of throughput as well as processor resource
19*82d56013Sjoergconsumption. The tool currently works for processors with a backend for which
20*82d56013Sjoergthere is a scheduling model available in LLVM.
217330f729Sjoerg
227330f729SjoergThe main goal of this tool is not just to predict the performance of the code
237330f729Sjoergwhen run on the target, but also help with diagnosing potential performance
247330f729Sjoergissues.
257330f729Sjoerg
267330f729SjoergGiven an assembly code sequence, :program:`llvm-mca` estimates the Instructions
277330f729SjoergPer Cycle (IPC), as well as hardware resource pressure. The analysis and
287330f729Sjoergreporting style were inspired by the IACA tool from Intel.
297330f729Sjoerg
307330f729SjoergFor example, you can compile code with clang, output assembly, and pipe it
317330f729Sjoergdirectly into :program:`llvm-mca` for analysis:
327330f729Sjoerg
337330f729Sjoerg.. code-block:: bash
347330f729Sjoerg
357330f729Sjoerg  $ clang foo.c -O2 -target x86_64-unknown-unknown -S -o - | llvm-mca -mcpu=btver2
367330f729Sjoerg
377330f729SjoergOr for Intel syntax:
387330f729Sjoerg
397330f729Sjoerg.. code-block:: bash
407330f729Sjoerg
417330f729Sjoerg  $ clang foo.c -O2 -target x86_64-unknown-unknown -mllvm -x86-asm-syntax=intel -S -o - | llvm-mca -mcpu=btver2
427330f729Sjoerg
43*82d56013Sjoerg(:program:`llvm-mca` detects Intel syntax by the presence of an `.intel_syntax`
44*82d56013Sjoergdirective at the beginning of the input.  By default its output syntax matches
45*82d56013Sjoergthat of its input.)
46*82d56013Sjoerg
477330f729SjoergScheduling models are not just used to compute instruction latencies and
487330f729Sjoergthroughput, but also to understand what processor resources are available
497330f729Sjoergand how to simulate them.
507330f729Sjoerg
517330f729SjoergBy design, the quality of the analysis conducted by :program:`llvm-mca` is
527330f729Sjoerginevitably affected by the quality of the scheduling models in LLVM.
537330f729Sjoerg
547330f729SjoergIf you see that the performance report is not accurate for a processor,
557330f729Sjoergplease `file a bug <https://bugs.llvm.org/enter_bug.cgi?product=libraries>`_
567330f729Sjoergagainst the appropriate backend.
577330f729Sjoerg
587330f729SjoergOPTIONS
597330f729Sjoerg-------
607330f729Sjoerg
617330f729SjoergIf ``input`` is "``-``" or omitted, :program:`llvm-mca` reads from standard
627330f729Sjoerginput. Otherwise, it will read from the specified filename.
637330f729Sjoerg
647330f729SjoergIf the :option:`-o` option is omitted, then :program:`llvm-mca` will send its output
657330f729Sjoergto standard output if the input is from standard input.  If the :option:`-o`
667330f729Sjoergoption specifies "``-``", then the output will also be sent to standard output.
677330f729Sjoerg
687330f729Sjoerg
697330f729Sjoerg.. option:: -help
707330f729Sjoerg
717330f729Sjoerg Print a summary of command line options.
727330f729Sjoerg
737330f729Sjoerg.. option:: -o <filename>
747330f729Sjoerg
757330f729Sjoerg Use ``<filename>`` as the output filename. See the summary above for more
767330f729Sjoerg details.
777330f729Sjoerg
787330f729Sjoerg.. option:: -mtriple=<target triple>
797330f729Sjoerg
807330f729Sjoerg Specify a target triple string.
817330f729Sjoerg
827330f729Sjoerg.. option:: -march=<arch>
837330f729Sjoerg
847330f729Sjoerg Specify the architecture for which to analyze the code. It defaults to the
857330f729Sjoerg host default target.
867330f729Sjoerg
877330f729Sjoerg.. option:: -mcpu=<cpuname>
887330f729Sjoerg
897330f729Sjoerg  Specify the processor for which to analyze the code.  By default, the cpu name
907330f729Sjoerg  is autodetected from the host.
917330f729Sjoerg
927330f729Sjoerg.. option:: -output-asm-variant=<variant id>
937330f729Sjoerg
947330f729Sjoerg Specify the output assembly variant for the report generated by the tool.
957330f729Sjoerg On x86, possible values are [0, 1]. A value of 0 (vic. 1) for this flag enables
967330f729Sjoerg the AT&T (vic. Intel) assembly format for the code printed out by the tool in
977330f729Sjoerg the analysis report.
987330f729Sjoerg
997330f729Sjoerg.. option:: -print-imm-hex
1007330f729Sjoerg
1017330f729Sjoerg Prefer hex format for numeric literals in the output assembly printed as part
1027330f729Sjoerg of the report.
1037330f729Sjoerg
1047330f729Sjoerg.. option:: -dispatch=<width>
1057330f729Sjoerg
1067330f729Sjoerg Specify a different dispatch width for the processor. The dispatch width
1077330f729Sjoerg defaults to field 'IssueWidth' in the processor scheduling model.  If width is
1087330f729Sjoerg zero, then the default dispatch width is used.
1097330f729Sjoerg
1107330f729Sjoerg.. option:: -register-file-size=<size>
1117330f729Sjoerg
1127330f729Sjoerg Specify the size of the register file. When specified, this flag limits how
1137330f729Sjoerg many physical registers are available for register renaming purposes. A value
1147330f729Sjoerg of zero for this flag means "unlimited number of physical registers".
1157330f729Sjoerg
1167330f729Sjoerg.. option:: -iterations=<number of iterations>
1177330f729Sjoerg
1187330f729Sjoerg Specify the number of iterations to run. If this flag is set to 0, then the
1197330f729Sjoerg tool sets the number of iterations to a default value (i.e. 100).
1207330f729Sjoerg
1217330f729Sjoerg.. option:: -noalias=<bool>
1227330f729Sjoerg
1237330f729Sjoerg  If set, the tool assumes that loads and stores don't alias. This is the
1247330f729Sjoerg  default behavior.
1257330f729Sjoerg
1267330f729Sjoerg.. option:: -lqueue=<load queue size>
1277330f729Sjoerg
1287330f729Sjoerg  Specify the size of the load queue in the load/store unit emulated by the tool.
1297330f729Sjoerg  By default, the tool assumes an unbound number of entries in the load queue.
1307330f729Sjoerg  A value of zero for this flag is ignored, and the default load queue size is
1317330f729Sjoerg  used instead.
1327330f729Sjoerg
1337330f729Sjoerg.. option:: -squeue=<store queue size>
1347330f729Sjoerg
1357330f729Sjoerg  Specify the size of the store queue in the load/store unit emulated by the
1367330f729Sjoerg  tool. By default, the tool assumes an unbound number of entries in the store
1377330f729Sjoerg  queue. A value of zero for this flag is ignored, and the default store queue
1387330f729Sjoerg  size is used instead.
1397330f729Sjoerg
1407330f729Sjoerg.. option:: -timeline
1417330f729Sjoerg
1427330f729Sjoerg  Enable the timeline view.
1437330f729Sjoerg
1447330f729Sjoerg.. option:: -timeline-max-iterations=<iterations>
1457330f729Sjoerg
1467330f729Sjoerg  Limit the number of iterations to print in the timeline view. By default, the
1477330f729Sjoerg  timeline view prints information for up to 10 iterations.
1487330f729Sjoerg
1497330f729Sjoerg.. option:: -timeline-max-cycles=<cycles>
1507330f729Sjoerg
1517330f729Sjoerg  Limit the number of cycles in the timeline view. By default, the number of
1527330f729Sjoerg  cycles is set to 80.
1537330f729Sjoerg
1547330f729Sjoerg.. option:: -resource-pressure
1557330f729Sjoerg
1567330f729Sjoerg  Enable the resource pressure view. This is enabled by default.
1577330f729Sjoerg
1587330f729Sjoerg.. option:: -register-file-stats
1597330f729Sjoerg
1607330f729Sjoerg  Enable register file usage statistics.
1617330f729Sjoerg
1627330f729Sjoerg.. option:: -dispatch-stats
1637330f729Sjoerg
1647330f729Sjoerg  Enable extra dispatch statistics. This view collects and analyzes instruction
1657330f729Sjoerg  dispatch events, as well as static/dynamic dispatch stall events. This view
1667330f729Sjoerg  is disabled by default.
1677330f729Sjoerg
1687330f729Sjoerg.. option:: -scheduler-stats
1697330f729Sjoerg
1707330f729Sjoerg  Enable extra scheduler statistics. This view collects and analyzes instruction
1717330f729Sjoerg  issue events. This view is disabled by default.
1727330f729Sjoerg
1737330f729Sjoerg.. option:: -retire-stats
1747330f729Sjoerg
1757330f729Sjoerg  Enable extra retire control unit statistics. This view is disabled by default.
1767330f729Sjoerg
1777330f729Sjoerg.. option:: -instruction-info
1787330f729Sjoerg
1797330f729Sjoerg  Enable the instruction info view. This is enabled by default.
1807330f729Sjoerg
1817330f729Sjoerg.. option:: -show-encoding
1827330f729Sjoerg
1837330f729Sjoerg  Enable the printing of instruction encodings within the instruction info view.
1847330f729Sjoerg
1857330f729Sjoerg.. option:: -all-stats
1867330f729Sjoerg
1877330f729Sjoerg  Print all hardware statistics. This enables extra statistics related to the
1887330f729Sjoerg  dispatch logic, the hardware schedulers, the register file(s), and the retire
1897330f729Sjoerg  control unit. This option is disabled by default.
1907330f729Sjoerg
1917330f729Sjoerg.. option:: -all-views
1927330f729Sjoerg
1937330f729Sjoerg  Enable all the view.
1947330f729Sjoerg
1957330f729Sjoerg.. option:: -instruction-tables
1967330f729Sjoerg
1977330f729Sjoerg  Prints resource pressure information based on the static information
1987330f729Sjoerg  available from the processor model. This differs from the resource pressure
1997330f729Sjoerg  view because it doesn't require that the code is simulated. It instead prints
2007330f729Sjoerg  the theoretical uniform distribution of resource pressure for every
2017330f729Sjoerg  instruction in sequence.
2027330f729Sjoerg
2037330f729Sjoerg.. option:: -bottleneck-analysis
2047330f729Sjoerg
2057330f729Sjoerg  Print information about bottlenecks that affect the throughput. This analysis
2067330f729Sjoerg  can be expensive, and it is disabled by default.  Bottlenecks are highlighted
207*82d56013Sjoerg  in the summary view. Bottleneck analysis is currently not supported for
208*82d56013Sjoerg  processors with an in-order backend.
209*82d56013Sjoerg
210*82d56013Sjoerg.. option:: -json
211*82d56013Sjoerg
212*82d56013Sjoerg  Print the requested views in JSON format. The instructions and the processor
213*82d56013Sjoerg  resources are printed as members of special top level JSON objects.  The
214*82d56013Sjoerg  individual views refer to them by index.
2157330f729Sjoerg
2167330f729Sjoerg
2177330f729SjoergEXIT STATUS
2187330f729Sjoerg-----------
2197330f729Sjoerg
2207330f729Sjoerg:program:`llvm-mca` returns 0 on success. Otherwise, an error message is printed
2217330f729Sjoergto standard error, and the tool returns 1.
2227330f729Sjoerg
2237330f729SjoergUSING MARKERS TO ANALYZE SPECIFIC CODE BLOCKS
2247330f729Sjoerg---------------------------------------------
2257330f729Sjoerg:program:`llvm-mca` allows for the optional usage of special code comments to
2267330f729Sjoergmark regions of the assembly code to be analyzed.  A comment starting with
2277330f729Sjoergsubstring ``LLVM-MCA-BEGIN`` marks the beginning of a code region. A comment
2287330f729Sjoergstarting with substring ``LLVM-MCA-END`` marks the end of a code region.  For
2297330f729Sjoergexample:
2307330f729Sjoerg
2317330f729Sjoerg.. code-block:: none
2327330f729Sjoerg
2337330f729Sjoerg  # LLVM-MCA-BEGIN
2347330f729Sjoerg    ...
2357330f729Sjoerg  # LLVM-MCA-END
2367330f729Sjoerg
2377330f729SjoergIf no user-defined region is specified, then :program:`llvm-mca` assumes a
2387330f729Sjoergdefault region which contains every instruction in the input file.  Every region
2397330f729Sjoergis analyzed in isolation, and the final performance report is the union of all
2407330f729Sjoergthe reports generated for every code region.
2417330f729Sjoerg
2427330f729SjoergCode regions can have names. For example:
2437330f729Sjoerg
2447330f729Sjoerg.. code-block:: none
2457330f729Sjoerg
2467330f729Sjoerg  # LLVM-MCA-BEGIN A simple example
2477330f729Sjoerg    add %eax, %eax
2487330f729Sjoerg  # LLVM-MCA-END
2497330f729Sjoerg
2507330f729SjoergThe code from the example above defines a region named "A simple example" with a
2517330f729Sjoergsingle instruction in it. Note how the region name doesn't have to be repeated
2527330f729Sjoergin the ``LLVM-MCA-END`` directive. In the absence of overlapping regions,
2537330f729Sjoergan anonymous ``LLVM-MCA-END`` directive always ends the currently active user
2547330f729Sjoergdefined region.
2557330f729Sjoerg
2567330f729SjoergExample of nesting regions:
2577330f729Sjoerg
2587330f729Sjoerg.. code-block:: none
2597330f729Sjoerg
2607330f729Sjoerg  # LLVM-MCA-BEGIN foo
2617330f729Sjoerg    add %eax, %edx
2627330f729Sjoerg  # LLVM-MCA-BEGIN bar
2637330f729Sjoerg    sub %eax, %edx
2647330f729Sjoerg  # LLVM-MCA-END bar
2657330f729Sjoerg  # LLVM-MCA-END foo
2667330f729Sjoerg
2677330f729SjoergExample of overlapping regions:
2687330f729Sjoerg
2697330f729Sjoerg.. code-block:: none
2707330f729Sjoerg
2717330f729Sjoerg  # LLVM-MCA-BEGIN foo
2727330f729Sjoerg    add %eax, %edx
2737330f729Sjoerg  # LLVM-MCA-BEGIN bar
2747330f729Sjoerg    sub %eax, %edx
2757330f729Sjoerg  # LLVM-MCA-END foo
2767330f729Sjoerg    add %eax, %edx
2777330f729Sjoerg  # LLVM-MCA-END bar
2787330f729Sjoerg
2797330f729SjoergNote that multiple anonymous regions cannot overlap. Also, overlapping regions
2807330f729Sjoergcannot have the same name.
2817330f729Sjoerg
2827330f729SjoergThere is no support for marking regions from high-level source code, like C or
2837330f729SjoergC++. As a workaround, inline assembly directives may be used:
2847330f729Sjoerg
2857330f729Sjoerg.. code-block:: c++
2867330f729Sjoerg
2877330f729Sjoerg  int foo(int a, int b) {
2887330f729Sjoerg    __asm volatile("# LLVM-MCA-BEGIN foo");
2897330f729Sjoerg    a += 42;
2907330f729Sjoerg    __asm volatile("# LLVM-MCA-END");
2917330f729Sjoerg    a *= b;
2927330f729Sjoerg    return a;
2937330f729Sjoerg  }
2947330f729Sjoerg
2957330f729SjoergHowever, this interferes with optimizations like loop vectorization and may have
2967330f729Sjoergan impact on the code generated. This is because the ``__asm`` statements are
2977330f729Sjoergseen as real code having important side effects, which limits how the code
2987330f729Sjoergaround them can be transformed. If users want to make use of inline assembly
2997330f729Sjoergto emit markers, then the recommendation is to always verify that the output
3007330f729Sjoergassembly is equivalent to the assembly generated in the absence of markers.
3017330f729SjoergThe `Clang options to emit optimization reports <https://clang.llvm.org/docs/UsersManual.html#options-to-emit-optimization-reports>`_
3027330f729Sjoergcan also help in detecting missed optimizations.
3037330f729Sjoerg
3047330f729SjoergHOW LLVM-MCA WORKS
3057330f729Sjoerg------------------
3067330f729Sjoerg
3077330f729Sjoerg:program:`llvm-mca` takes assembly code as input. The assembly code is parsed
3087330f729Sjoerginto a sequence of MCInst with the help of the existing LLVM target assembly
3097330f729Sjoergparsers. The parsed sequence of MCInst is then analyzed by a ``Pipeline`` module
3107330f729Sjoergto generate a performance report.
3117330f729Sjoerg
3127330f729SjoergThe Pipeline module simulates the execution of the machine code sequence in a
3137330f729Sjoergloop of iterations (default is 100). During this process, the pipeline collects
3147330f729Sjoerga number of execution related statistics. At the end of this process, the
3157330f729Sjoergpipeline generates and prints a report from the collected statistics.
3167330f729Sjoerg
3177330f729SjoergHere is an example of a performance report generated by the tool for a
3187330f729Sjoergdot-product of two packed float vectors of four elements. The analysis is
3197330f729Sjoergconducted for target x86, cpu btver2.  The following result can be produced via
3207330f729Sjoergthe following command using the example located at
3217330f729Sjoerg``test/tools/llvm-mca/X86/BtVer2/dot-product.s``:
3227330f729Sjoerg
3237330f729Sjoerg.. code-block:: bash
3247330f729Sjoerg
3257330f729Sjoerg  $ llvm-mca -mtriple=x86_64-unknown-unknown -mcpu=btver2 -iterations=300 dot-product.s
3267330f729Sjoerg
3277330f729Sjoerg.. code-block:: none
3287330f729Sjoerg
3297330f729Sjoerg  Iterations:        300
3307330f729Sjoerg  Instructions:      900
3317330f729Sjoerg  Total Cycles:      610
3327330f729Sjoerg  Total uOps:        900
3337330f729Sjoerg
3347330f729Sjoerg  Dispatch Width:    2
3357330f729Sjoerg  uOps Per Cycle:    1.48
3367330f729Sjoerg  IPC:               1.48
3377330f729Sjoerg  Block RThroughput: 2.0
3387330f729Sjoerg
3397330f729Sjoerg
3407330f729Sjoerg  Instruction Info:
3417330f729Sjoerg  [1]: #uOps
3427330f729Sjoerg  [2]: Latency
3437330f729Sjoerg  [3]: RThroughput
3447330f729Sjoerg  [4]: MayLoad
3457330f729Sjoerg  [5]: MayStore
3467330f729Sjoerg  [6]: HasSideEffects (U)
3477330f729Sjoerg
3487330f729Sjoerg  [1]    [2]    [3]    [4]    [5]    [6]    Instructions:
3497330f729Sjoerg   1      2     1.00                        vmulps	%xmm0, %xmm1, %xmm2
3507330f729Sjoerg   1      3     1.00                        vhaddps	%xmm2, %xmm2, %xmm3
3517330f729Sjoerg   1      3     1.00                        vhaddps	%xmm3, %xmm3, %xmm4
3527330f729Sjoerg
3537330f729Sjoerg
3547330f729Sjoerg  Resources:
3557330f729Sjoerg  [0]   - JALU0
3567330f729Sjoerg  [1]   - JALU1
3577330f729Sjoerg  [2]   - JDiv
3587330f729Sjoerg  [3]   - JFPA
3597330f729Sjoerg  [4]   - JFPM
3607330f729Sjoerg  [5]   - JFPU0
3617330f729Sjoerg  [6]   - JFPU1
3627330f729Sjoerg  [7]   - JLAGU
3637330f729Sjoerg  [8]   - JMul
3647330f729Sjoerg  [9]   - JSAGU
3657330f729Sjoerg  [10]  - JSTC
3667330f729Sjoerg  [11]  - JVALU0
3677330f729Sjoerg  [12]  - JVALU1
3687330f729Sjoerg  [13]  - JVIMUL
3697330f729Sjoerg
3707330f729Sjoerg
3717330f729Sjoerg  Resource pressure per iteration:
3727330f729Sjoerg  [0]    [1]    [2]    [3]    [4]    [5]    [6]    [7]    [8]    [9]    [10]   [11]   [12]   [13]
3737330f729Sjoerg   -      -      -     2.00   1.00   2.00   1.00    -      -      -      -      -      -      -
3747330f729Sjoerg
3757330f729Sjoerg  Resource pressure by instruction:
3767330f729Sjoerg  [0]    [1]    [2]    [3]    [4]    [5]    [6]    [7]    [8]    [9]    [10]   [11]   [12]   [13]   Instructions:
3777330f729Sjoerg   -      -      -      -     1.00    -     1.00    -      -      -      -      -      -      -     vmulps	%xmm0, %xmm1, %xmm2
3787330f729Sjoerg   -      -      -     1.00    -     1.00    -      -      -      -      -      -      -      -     vhaddps	%xmm2, %xmm2, %xmm3
3797330f729Sjoerg   -      -      -     1.00    -     1.00    -      -      -      -      -      -      -      -     vhaddps	%xmm3, %xmm3, %xmm4
3807330f729Sjoerg
3817330f729SjoergAccording to this report, the dot-product kernel has been executed 300 times,
3827330f729Sjoergfor a total of 900 simulated instructions. The total number of simulated micro
3837330f729Sjoergopcodes (uOps) is also 900.
3847330f729Sjoerg
3857330f729SjoergThe report is structured in three main sections.  The first section collects a
3867330f729Sjoergfew performance numbers; the goal of this section is to give a very quick
3877330f729Sjoergoverview of the performance throughput. Important performance indicators are
3887330f729Sjoerg**IPC**, **uOps Per Cycle**, and  **Block RThroughput** (Block Reciprocal
3897330f729SjoergThroughput).
3907330f729Sjoerg
3917330f729SjoergField *DispatchWidth* is the maximum number of micro opcodes that are dispatched
392*82d56013Sjoergto the out-of-order backend every simulated cycle. For processors with an
393*82d56013Sjoergin-order backend, *DispatchWidth* is the maximum number of micro opcodes issued
394*82d56013Sjoergto the backend every simulated cycle.
3957330f729Sjoerg
3967330f729SjoergIPC is computed dividing the total number of simulated instructions by the total
3977330f729Sjoergnumber of cycles.
3987330f729Sjoerg
3997330f729SjoergField *Block RThroughput* is the reciprocal of the block throughput. Block
400*82d56013Sjoergthroughput is a theoretical quantity computed as the maximum number of blocks
4017330f729Sjoerg(i.e. iterations) that can be executed per simulated clock cycle in the absence
402*82d56013Sjoergof loop carried dependencies. Block throughput is superiorly limited by the
403*82d56013Sjoergdispatch rate, and the availability of hardware resources.
4047330f729Sjoerg
4057330f729SjoergIn the absence of loop-carried data dependencies, the observed IPC tends to a
4067330f729Sjoergtheoretical maximum which can be computed by dividing the number of instructions
4077330f729Sjoergof a single iteration by the `Block RThroughput`.
4087330f729Sjoerg
4097330f729SjoergField 'uOps Per Cycle' is computed dividing the total number of simulated micro
4107330f729Sjoergopcodes by the total number of cycles. A delta between Dispatch Width and this
4117330f729Sjoergfield is an indicator of a performance issue. In the absence of loop-carried
4127330f729Sjoergdata dependencies, the observed 'uOps Per Cycle' should tend to a theoretical
4137330f729Sjoergmaximum throughput which can be computed by dividing the number of uOps of a
4147330f729Sjoergsingle iteration by the `Block RThroughput`.
4157330f729Sjoerg
4167330f729SjoergField *uOps Per Cycle* is bounded from above by the dispatch width. That is
4177330f729Sjoergbecause the dispatch width limits the maximum size of a dispatch group. Both IPC
4187330f729Sjoergand 'uOps Per Cycle' are limited by the amount of hardware parallelism. The
4197330f729Sjoergavailability of hardware resources affects the resource pressure distribution,
4207330f729Sjoergand it limits the number of instructions that can be executed in parallel every
4217330f729Sjoergcycle.  A delta between Dispatch Width and the theoretical maximum uOps per
4227330f729SjoergCycle (computed by dividing the number of uOps of a single iteration by the
4237330f729Sjoerg`Block RThroughput`) is an indicator of a performance bottleneck caused by the
4247330f729Sjoerglack of hardware resources.
4257330f729SjoergIn general, the lower the Block RThroughput, the better.
4267330f729Sjoerg
4277330f729SjoergIn this example, ``uOps per iteration/Block RThroughput`` is 1.50. Since there
4287330f729Sjoergare no loop-carried dependencies, the observed `uOps Per Cycle` is expected to
4297330f729Sjoergapproach 1.50 when the number of iterations tends to infinity. The delta between
4307330f729Sjoergthe Dispatch Width (2.00), and the theoretical maximum uOp throughput (1.50) is
4317330f729Sjoergan indicator of a performance bottleneck caused by the lack of hardware
4327330f729Sjoergresources, and the *Resource pressure view* can help to identify the problematic
4337330f729Sjoergresource usage.
4347330f729Sjoerg
4357330f729SjoergThe second section of the report is the `instruction info view`. It shows the
4367330f729Sjoerglatency and reciprocal throughput of every instruction in the sequence. It also
4377330f729Sjoergreports extra information related to the number of micro opcodes, and opcode
4387330f729Sjoergproperties (i.e., 'MayLoad', 'MayStore', and 'HasSideEffects').
4397330f729Sjoerg
4407330f729SjoergField *RThroughput* is the reciprocal of the instruction throughput. Throughput
4417330f729Sjoergis computed as the maximum number of instructions of a same type that can be
4427330f729Sjoergexecuted per clock cycle in the absence of operand dependencies. In this
4437330f729Sjoergexample, the reciprocal throughput of a vector float multiply is 1
4447330f729Sjoergcycles/instruction.  That is because the FP multiplier JFPM is only available
4457330f729Sjoergfrom pipeline JFPU1.
4467330f729Sjoerg
4477330f729SjoergInstruction encodings are displayed within the instruction info view when flag
4487330f729Sjoerg`-show-encoding` is specified.
4497330f729Sjoerg
4507330f729SjoergBelow is an example of `-show-encoding` output for the dot-product kernel:
4517330f729Sjoerg
4527330f729Sjoerg.. code-block:: none
4537330f729Sjoerg
4547330f729Sjoerg  Instruction Info:
4557330f729Sjoerg  [1]: #uOps
4567330f729Sjoerg  [2]: Latency
4577330f729Sjoerg  [3]: RThroughput
4587330f729Sjoerg  [4]: MayLoad
4597330f729Sjoerg  [5]: MayStore
4607330f729Sjoerg  [6]: HasSideEffects (U)
4617330f729Sjoerg  [7]: Encoding Size
4627330f729Sjoerg
4637330f729Sjoerg  [1]    [2]    [3]    [4]    [5]    [6]    [7]    Encodings:                    Instructions:
4647330f729Sjoerg   1      2     1.00                         4     c5 f0 59 d0                   vmulps	%xmm0, %xmm1, %xmm2
4657330f729Sjoerg   1      4     1.00                         4     c5 eb 7c da                   vhaddps	%xmm2, %xmm2, %xmm3
4667330f729Sjoerg   1      4     1.00                         4     c5 e3 7c e3                   vhaddps	%xmm3, %xmm3, %xmm4
4677330f729Sjoerg
4687330f729SjoergThe `Encoding Size` column shows the size in bytes of instructions.  The
4697330f729Sjoerg`Encodings` column shows the actual instruction encodings (byte sequences in
4707330f729Sjoerghex).
4717330f729Sjoerg
4727330f729SjoergThe third section is the *Resource pressure view*.  This view reports
4737330f729Sjoergthe average number of resource cycles consumed every iteration by instructions
4747330f729Sjoergfor every processor resource unit available on the target.  Information is
4757330f729Sjoergstructured in two tables. The first table reports the number of resource cycles
4767330f729Sjoergspent on average every iteration. The second table correlates the resource
4777330f729Sjoergcycles to the machine instruction in the sequence. For example, every iteration
4787330f729Sjoergof the instruction vmulps always executes on resource unit [6]
4797330f729Sjoerg(JFPU1 - floating point pipeline #1), consuming an average of 1 resource cycle
4807330f729Sjoergper iteration.  Note that on AMD Jaguar, vector floating-point multiply can
4817330f729Sjoergonly be issued to pipeline JFPU1, while horizontal floating-point additions can
4827330f729Sjoergonly be issued to pipeline JFPU0.
4837330f729Sjoerg
4847330f729SjoergThe resource pressure view helps with identifying bottlenecks caused by high
4857330f729Sjoergusage of specific hardware resources.  Situations with resource pressure mainly
4867330f729Sjoergconcentrated on a few resources should, in general, be avoided.  Ideally,
4877330f729Sjoergpressure should be uniformly distributed between multiple resources.
4887330f729Sjoerg
4897330f729SjoergTimeline View
4907330f729Sjoerg^^^^^^^^^^^^^
4917330f729SjoergThe timeline view produces a detailed report of each instruction's state
4927330f729Sjoergtransitions through an instruction pipeline.  This view is enabled by the
4937330f729Sjoergcommand line option ``-timeline``.  As instructions transition through the
4947330f729Sjoergvarious stages of the pipeline, their states are depicted in the view report.
4957330f729SjoergThese states are represented by the following characters:
4967330f729Sjoerg
4977330f729Sjoerg* D : Instruction dispatched.
4987330f729Sjoerg* e : Instruction executing.
4997330f729Sjoerg* E : Instruction executed.
5007330f729Sjoerg* R : Instruction retired.
5017330f729Sjoerg* = : Instruction already dispatched, waiting to be executed.
5027330f729Sjoerg* \- : Instruction executed, waiting to be retired.
5037330f729Sjoerg
5047330f729SjoergBelow is the timeline view for a subset of the dot-product example located in
5057330f729Sjoerg``test/tools/llvm-mca/X86/BtVer2/dot-product.s`` and processed by
5067330f729Sjoerg:program:`llvm-mca` using the following command:
5077330f729Sjoerg
5087330f729Sjoerg.. code-block:: bash
5097330f729Sjoerg
5107330f729Sjoerg  $ llvm-mca -mtriple=x86_64-unknown-unknown -mcpu=btver2 -iterations=3 -timeline dot-product.s
5117330f729Sjoerg
5127330f729Sjoerg.. code-block:: none
5137330f729Sjoerg
5147330f729Sjoerg  Timeline view:
5157330f729Sjoerg                      012345
5167330f729Sjoerg  Index     0123456789
5177330f729Sjoerg
5187330f729Sjoerg  [0,0]     DeeER.    .    .   vmulps	%xmm0, %xmm1, %xmm2
5197330f729Sjoerg  [0,1]     D==eeeER  .    .   vhaddps	%xmm2, %xmm2, %xmm3
5207330f729Sjoerg  [0,2]     .D====eeeER    .   vhaddps	%xmm3, %xmm3, %xmm4
5217330f729Sjoerg  [1,0]     .DeeE-----R    .   vmulps	%xmm0, %xmm1, %xmm2
5227330f729Sjoerg  [1,1]     . D=eeeE---R   .   vhaddps	%xmm2, %xmm2, %xmm3
5237330f729Sjoerg  [1,2]     . D====eeeER   .   vhaddps	%xmm3, %xmm3, %xmm4
5247330f729Sjoerg  [2,0]     .  DeeE-----R  .   vmulps	%xmm0, %xmm1, %xmm2
5257330f729Sjoerg  [2,1]     .  D====eeeER  .   vhaddps	%xmm2, %xmm2, %xmm3
5267330f729Sjoerg  [2,2]     .   D======eeeER   vhaddps	%xmm3, %xmm3, %xmm4
5277330f729Sjoerg
5287330f729Sjoerg
5297330f729Sjoerg  Average Wait times (based on the timeline view):
5307330f729Sjoerg  [0]: Executions
5317330f729Sjoerg  [1]: Average time spent waiting in a scheduler's queue
5327330f729Sjoerg  [2]: Average time spent waiting in a scheduler's queue while ready
5337330f729Sjoerg  [3]: Average time elapsed from WB until retire stage
5347330f729Sjoerg
5357330f729Sjoerg        [0]    [1]    [2]    [3]
5367330f729Sjoerg  0.     3     1.0    1.0    3.3       vmulps	%xmm0, %xmm1, %xmm2
5377330f729Sjoerg  1.     3     3.3    0.7    1.0       vhaddps	%xmm2, %xmm2, %xmm3
5387330f729Sjoerg  2.     3     5.7    0.0    0.0       vhaddps	%xmm3, %xmm3, %xmm4
5397330f729Sjoerg         3     3.3    0.5    1.4       <total>
5407330f729Sjoerg
5417330f729SjoergThe timeline view is interesting because it shows instruction state changes
5427330f729Sjoergduring execution.  It also gives an idea of how the tool processes instructions
5437330f729Sjoergexecuted on the target, and how their timing information might be calculated.
5447330f729Sjoerg
5457330f729SjoergThe timeline view is structured in two tables.  The first table shows
5467330f729Sjoerginstructions changing state over time (measured in cycles); the second table
5477330f729Sjoerg(named *Average Wait times*) reports useful timing statistics, which should
5487330f729Sjoerghelp diagnose performance bottlenecks caused by long data dependencies and
5497330f729Sjoergsub-optimal usage of hardware resources.
5507330f729Sjoerg
5517330f729SjoergAn instruction in the timeline view is identified by a pair of indices, where
5527330f729Sjoergthe first index identifies an iteration, and the second index is the
5537330f729Sjoerginstruction index (i.e., where it appears in the code sequence).  Since this
5547330f729Sjoergexample was generated using 3 iterations: ``-iterations=3``, the iteration
5557330f729Sjoergindices range from 0-2 inclusively.
5567330f729Sjoerg
5577330f729SjoergExcluding the first and last column, the remaining columns are in cycles.
5587330f729SjoergCycles are numbered sequentially starting from 0.
5597330f729Sjoerg
5607330f729SjoergFrom the example output above, we know the following:
5617330f729Sjoerg
5627330f729Sjoerg* Instruction [1,0] was dispatched at cycle 1.
5637330f729Sjoerg* Instruction [1,0] started executing at cycle 2.
5647330f729Sjoerg* Instruction [1,0] reached the write back stage at cycle 4.
5657330f729Sjoerg* Instruction [1,0] was retired at cycle 10.
5667330f729Sjoerg
5677330f729SjoergInstruction [1,0] (i.e., vmulps from iteration #1) does not have to wait in the
5687330f729Sjoergscheduler's queue for the operands to become available. By the time vmulps is
5697330f729Sjoergdispatched, operands are already available, and pipeline JFPU1 is ready to
5707330f729Sjoergserve another instruction.  So the instruction can be immediately issued on the
5717330f729SjoergJFPU1 pipeline. That is demonstrated by the fact that the instruction only
5727330f729Sjoergspent 1cy in the scheduler's queue.
5737330f729Sjoerg
5747330f729SjoergThere is a gap of 5 cycles between the write-back stage and the retire event.
5757330f729SjoergThat is because instructions must retire in program order, so [1,0] has to wait
5767330f729Sjoergfor [0,2] to be retired first (i.e., it has to wait until cycle 10).
5777330f729Sjoerg
5787330f729SjoergIn the example, all instructions are in a RAW (Read After Write) dependency
5797330f729Sjoergchain.  Register %xmm2 written by vmulps is immediately used by the first
5807330f729Sjoergvhaddps, and register %xmm3 written by the first vhaddps is used by the second
5817330f729Sjoergvhaddps.  Long data dependencies negatively impact the ILP (Instruction Level
5827330f729SjoergParallelism).
5837330f729Sjoerg
5847330f729SjoergIn the dot-product example, there are anti-dependencies introduced by
5857330f729Sjoerginstructions from different iterations.  However, those dependencies can be
5867330f729Sjoergremoved at register renaming stage (at the cost of allocating register aliases,
5877330f729Sjoergand therefore consuming physical registers).
5887330f729Sjoerg
5897330f729SjoergTable *Average Wait times* helps diagnose performance issues that are caused by
5907330f729Sjoergthe presence of long latency instructions and potentially long data dependencies
5917330f729Sjoergwhich may limit the ILP. Last row, ``<total>``, shows a global average over all
5927330f729Sjoerginstructions measured. Note that :program:`llvm-mca`, by default, assumes at
5937330f729Sjoergleast 1cy between the dispatch event and the issue event.
5947330f729Sjoerg
5957330f729SjoergWhen the performance is limited by data dependencies and/or long latency
5967330f729Sjoerginstructions, the number of cycles spent while in the *ready* state is expected
5977330f729Sjoergto be very small when compared with the total number of cycles spent in the
5987330f729Sjoergscheduler's queue.  The difference between the two counters is a good indicator
5997330f729Sjoergof how large of an impact data dependencies had on the execution of the
6007330f729Sjoerginstructions.  When performance is mostly limited by the lack of hardware
6017330f729Sjoergresources, the delta between the two counters is small.  However, the number of
6027330f729Sjoergcycles spent in the queue tends to be larger (i.e., more than 1-3cy),
6037330f729Sjoergespecially when compared to other low latency instructions.
6047330f729Sjoerg
6057330f729SjoergBottleneck Analysis
6067330f729Sjoerg^^^^^^^^^^^^^^^^^^^
6077330f729SjoergThe ``-bottleneck-analysis`` command line option enables the analysis of
6087330f729Sjoergperformance bottlenecks.
6097330f729Sjoerg
6107330f729SjoergThis analysis is potentially expensive. It attempts to correlate increases in
6117330f729Sjoergbackend pressure (caused by pipeline resource pressure and data dependencies) to
6127330f729Sjoergdynamic dispatch stalls.
6137330f729Sjoerg
6147330f729SjoergBelow is an example of ``-bottleneck-analysis`` output generated by
6157330f729Sjoerg:program:`llvm-mca` for 500 iterations of the dot-product example on btver2.
6167330f729Sjoerg
6177330f729Sjoerg.. code-block:: none
6187330f729Sjoerg
6197330f729Sjoerg
6207330f729Sjoerg  Cycles with backend pressure increase [ 48.07% ]
6217330f729Sjoerg  Throughput Bottlenecks:
6227330f729Sjoerg    Resource Pressure       [ 47.77% ]
6237330f729Sjoerg    - JFPA  [ 47.77% ]
6247330f729Sjoerg    - JFPU0  [ 47.77% ]
6257330f729Sjoerg    Data Dependencies:      [ 0.30% ]
6267330f729Sjoerg    - Register Dependencies [ 0.30% ]
6277330f729Sjoerg    - Memory Dependencies   [ 0.00% ]
6287330f729Sjoerg
6297330f729Sjoerg  Critical sequence based on the simulation:
6307330f729Sjoerg
6317330f729Sjoerg                Instruction                         Dependency Information
6327330f729Sjoerg   +----< 2.    vhaddps %xmm3, %xmm3, %xmm4
6337330f729Sjoerg   |
6347330f729Sjoerg   |    < loop carried >
6357330f729Sjoerg   |
6367330f729Sjoerg   |      0.    vmulps  %xmm0, %xmm1, %xmm2
6377330f729Sjoerg   +----> 1.    vhaddps %xmm2, %xmm2, %xmm3         ## RESOURCE interference:  JFPA [ probability: 74% ]
6387330f729Sjoerg   +----> 2.    vhaddps %xmm3, %xmm3, %xmm4         ## REGISTER dependency:  %xmm3
6397330f729Sjoerg   |
6407330f729Sjoerg   |    < loop carried >
6417330f729Sjoerg   |
6427330f729Sjoerg   +----> 1.    vhaddps %xmm2, %xmm2, %xmm3         ## RESOURCE interference:  JFPA [ probability: 74% ]
6437330f729Sjoerg
6447330f729Sjoerg
6457330f729SjoergAccording to the analysis, throughput is limited by resource pressure and not by
6467330f729Sjoergdata dependencies.  The analysis observed increases in backend pressure during
6477330f729Sjoerg48.07% of the simulated run. Almost all those pressure increase events were
6487330f729Sjoergcaused by contention on processor resources JFPA/JFPU0.
6497330f729Sjoerg
6507330f729SjoergThe `critical sequence` is the most expensive sequence of instructions according
6517330f729Sjoergto the simulation. It is annotated to provide extra information about critical
6527330f729Sjoergregister dependencies and resource interferences between instructions.
6537330f729Sjoerg
6547330f729SjoergInstructions from the critical sequence are expected to significantly impact
6557330f729Sjoergperformance. By construction, the accuracy of this analysis is strongly
6567330f729Sjoergdependent on the simulation and (as always) by the quality of the processor
6577330f729Sjoergmodel in llvm.
6587330f729Sjoerg
659*82d56013SjoergBottleneck analysis is currently not supported for processors with an in-order
660*82d56013Sjoergbackend.
6617330f729Sjoerg
6627330f729SjoergExtra Statistics to Further Diagnose Performance Issues
6637330f729Sjoerg^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
6647330f729SjoergThe ``-all-stats`` command line option enables extra statistics and performance
6657330f729Sjoergcounters for the dispatch logic, the reorder buffer, the retire control unit,
6667330f729Sjoergand the register file.
6677330f729Sjoerg
6687330f729SjoergBelow is an example of ``-all-stats`` output generated by  :program:`llvm-mca`
6697330f729Sjoergfor 300 iterations of the dot-product example discussed in the previous
6707330f729Sjoergsections.
6717330f729Sjoerg
6727330f729Sjoerg.. code-block:: none
6737330f729Sjoerg
6747330f729Sjoerg  Dynamic Dispatch Stall Cycles:
6757330f729Sjoerg  RAT     - Register unavailable:                      0
6767330f729Sjoerg  RCU     - Retire tokens unavailable:                 0
6777330f729Sjoerg  SCHEDQ  - Scheduler full:                            272  (44.6%)
6787330f729Sjoerg  LQ      - Load queue full:                           0
6797330f729Sjoerg  SQ      - Store queue full:                          0
6807330f729Sjoerg  GROUP   - Static restrictions on the dispatch group: 0
6817330f729Sjoerg
6827330f729Sjoerg
6837330f729Sjoerg  Dispatch Logic - number of cycles where we saw N micro opcodes dispatched:
6847330f729Sjoerg  [# dispatched], [# cycles]
6857330f729Sjoerg   0,              24  (3.9%)
6867330f729Sjoerg   1,              272  (44.6%)
6877330f729Sjoerg   2,              314  (51.5%)
6887330f729Sjoerg
6897330f729Sjoerg
6907330f729Sjoerg  Schedulers - number of cycles where we saw N micro opcodes issued:
6917330f729Sjoerg  [# issued], [# cycles]
6927330f729Sjoerg   0,          7  (1.1%)
6937330f729Sjoerg   1,          306  (50.2%)
6947330f729Sjoerg   2,          297  (48.7%)
6957330f729Sjoerg
6967330f729Sjoerg  Scheduler's queue usage:
6977330f729Sjoerg  [1] Resource name.
6987330f729Sjoerg  [2] Average number of used buffer entries.
6997330f729Sjoerg  [3] Maximum number of used buffer entries.
7007330f729Sjoerg  [4] Total number of buffer entries.
7017330f729Sjoerg
7027330f729Sjoerg   [1]            [2]        [3]        [4]
7037330f729Sjoerg  JALU01           0          0          20
7047330f729Sjoerg  JFPU01           17         18         18
7057330f729Sjoerg  JLSAGU           0          0          12
7067330f729Sjoerg
7077330f729Sjoerg
7087330f729Sjoerg  Retire Control Unit - number of cycles where we saw N instructions retired:
7097330f729Sjoerg  [# retired], [# cycles]
7107330f729Sjoerg   0,           109  (17.9%)
7117330f729Sjoerg   1,           102  (16.7%)
7127330f729Sjoerg   2,           399  (65.4%)
7137330f729Sjoerg
7147330f729Sjoerg  Total ROB Entries:                64
7157330f729Sjoerg  Max Used ROB Entries:             35  ( 54.7% )
7167330f729Sjoerg  Average Used ROB Entries per cy:  32  ( 50.0% )
7177330f729Sjoerg
7187330f729Sjoerg
7197330f729Sjoerg  Register File statistics:
7207330f729Sjoerg  Total number of mappings created:    900
7217330f729Sjoerg  Max number of mappings used:         35
7227330f729Sjoerg
7237330f729Sjoerg  *  Register File #1 -- JFpuPRF:
7247330f729Sjoerg     Number of physical registers:     72
7257330f729Sjoerg     Total number of mappings created: 900
7267330f729Sjoerg     Max number of mappings used:      35
7277330f729Sjoerg
7287330f729Sjoerg  *  Register File #2 -- JIntegerPRF:
7297330f729Sjoerg     Number of physical registers:     64
7307330f729Sjoerg     Total number of mappings created: 0
7317330f729Sjoerg     Max number of mappings used:      0
7327330f729Sjoerg
7337330f729SjoergIf we look at the *Dynamic Dispatch Stall Cycles* table, we see the counter for
7347330f729SjoergSCHEDQ reports 272 cycles.  This counter is incremented every time the dispatch
7357330f729Sjoerglogic is unable to dispatch a full group because the scheduler's queue is full.
7367330f729Sjoerg
7377330f729SjoergLooking at the *Dispatch Logic* table, we see that the pipeline was only able to
7387330f729Sjoergdispatch two micro opcodes 51.5% of the time.  The dispatch group was limited to
7397330f729Sjoergone micro opcode 44.6% of the cycles, which corresponds to 272 cycles.  The
7407330f729Sjoergdispatch statistics are displayed by either using the command option
7417330f729Sjoerg``-all-stats`` or ``-dispatch-stats``.
7427330f729Sjoerg
7437330f729SjoergThe next table, *Schedulers*, presents a histogram displaying a count,
7447330f729Sjoergrepresenting the number of micro opcodes issued on some number of cycles. In
7457330f729Sjoergthis case, of the 610 simulated cycles, single opcodes were issued 306 times
7467330f729Sjoerg(50.2%) and there were 7 cycles where no opcodes were issued.
7477330f729Sjoerg
7487330f729SjoergThe *Scheduler's queue usage* table shows that the average and maximum number of
7497330f729Sjoergbuffer entries (i.e., scheduler queue entries) used at runtime.  Resource JFPU01
7507330f729Sjoergreached its maximum (18 of 18 queue entries). Note that AMD Jaguar implements
7517330f729Sjoergthree schedulers:
7527330f729Sjoerg
7537330f729Sjoerg* JALU01 - A scheduler for ALU instructions.
7547330f729Sjoerg* JFPU01 - A scheduler floating point operations.
7557330f729Sjoerg* JLSAGU - A scheduler for address generation.
7567330f729Sjoerg
7577330f729SjoergThe dot-product is a kernel of three floating point instructions (a vector
7587330f729Sjoergmultiply followed by two horizontal adds).  That explains why only the floating
7597330f729Sjoergpoint scheduler appears to be used.
7607330f729Sjoerg
7617330f729SjoergA full scheduler queue is either caused by data dependency chains or by a
7627330f729Sjoergsub-optimal usage of hardware resources.  Sometimes, resource pressure can be
7637330f729Sjoergmitigated by rewriting the kernel using different instructions that consume
7647330f729Sjoergdifferent scheduler resources.  Schedulers with a small queue are less resilient
7657330f729Sjoergto bottlenecks caused by the presence of long data dependencies.  The scheduler
7667330f729Sjoergstatistics are displayed by using the command option ``-all-stats`` or
7677330f729Sjoerg``-scheduler-stats``.
7687330f729Sjoerg
7697330f729SjoergThe next table, *Retire Control Unit*, presents a histogram displaying a count,
7707330f729Sjoergrepresenting the number of instructions retired on some number of cycles.  In
7717330f729Sjoergthis case, of the 610 simulated cycles, two instructions were retired during the
7727330f729Sjoergsame cycle 399 times (65.4%) and there were 109 cycles where no instructions
7737330f729Sjoergwere retired.  The retire statistics are displayed by using the command option
7747330f729Sjoerg``-all-stats`` or ``-retire-stats``.
7757330f729Sjoerg
7767330f729SjoergThe last table presented is *Register File statistics*.  Each physical register
7777330f729Sjoergfile (PRF) used by the pipeline is presented in this table.  In the case of AMD
7787330f729SjoergJaguar, there are two register files, one for floating-point registers (JFpuPRF)
7797330f729Sjoergand one for integer registers (JIntegerPRF).  The table shows that of the 900
7807330f729Sjoerginstructions processed, there were 900 mappings created.  Since this dot-product
7817330f729Sjoergexample utilized only floating point registers, the JFPuPRF was responsible for
7827330f729Sjoergcreating the 900 mappings.  However, we see that the pipeline only used a
7837330f729Sjoergmaximum of 35 of 72 available register slots at any given time. We can conclude
7847330f729Sjoergthat the floating point PRF was the only register file used for the example, and
7857330f729Sjoergthat it was never resource constrained.  The register file statistics are
7867330f729Sjoergdisplayed by using the command option ``-all-stats`` or
7877330f729Sjoerg``-register-file-stats``.
7887330f729Sjoerg
7897330f729SjoergIn this example, we can conclude that the IPC is mostly limited by data
7907330f729Sjoergdependencies, and not by resource pressure.
7917330f729Sjoerg
7927330f729SjoergInstruction Flow
7937330f729Sjoerg^^^^^^^^^^^^^^^^
7947330f729SjoergThis section describes the instruction flow through the default pipeline of
7957330f729Sjoerg:program:`llvm-mca`, as well as the functional units involved in the process.
7967330f729Sjoerg
7977330f729SjoergThe default pipeline implements the following sequence of stages used to
7987330f729Sjoergprocess instructions.
7997330f729Sjoerg
8007330f729Sjoerg* Dispatch (Instruction is dispatched to the schedulers).
8017330f729Sjoerg* Issue (Instruction is issued to the processor pipelines).
8027330f729Sjoerg* Write Back (Instruction is executed, and results are written back).
8037330f729Sjoerg* Retire (Instruction is retired; writes are architecturally committed).
8047330f729Sjoerg
805*82d56013SjoergThe in-order pipeline implements the following sequence of stages:
806*82d56013Sjoerg* InOrderIssue (Instruction is issued to the processor pipelines).
807*82d56013Sjoerg* Retire (Instruction is retired; writes are architecturally committed).
808*82d56013Sjoerg
809*82d56013Sjoerg:program:`llvm-mca` assumes that instructions have all been decoded and placed
810*82d56013Sjoerginto a queue before the simulation start. Therefore, the instruction fetch and
811*82d56013Sjoergdecode stages are not modeled. Performance bottlenecks in the frontend are not
812*82d56013Sjoergdiagnosed. Also, :program:`llvm-mca` does not model branch prediction.
8137330f729Sjoerg
8147330f729SjoergInstruction Dispatch
8157330f729Sjoerg""""""""""""""""""""
8167330f729SjoergDuring the dispatch stage, instructions are picked in program order from a
8177330f729Sjoergqueue of already decoded instructions, and dispatched in groups to the
8187330f729Sjoergsimulated hardware schedulers.
8197330f729Sjoerg
8207330f729SjoergThe size of a dispatch group depends on the availability of the simulated
8217330f729Sjoerghardware resources.  The processor dispatch width defaults to the value
8227330f729Sjoergof the ``IssueWidth`` in LLVM's scheduling model.
8237330f729Sjoerg
8247330f729SjoergAn instruction can be dispatched if:
8257330f729Sjoerg
8267330f729Sjoerg* The size of the dispatch group is smaller than processor's dispatch width.
8277330f729Sjoerg* There are enough entries in the reorder buffer.
8287330f729Sjoerg* There are enough physical registers to do register renaming.
8297330f729Sjoerg* The schedulers are not full.
8307330f729Sjoerg
8317330f729SjoergScheduling models can optionally specify which register files are available on
8327330f729Sjoergthe processor. :program:`llvm-mca` uses that information to initialize register
8337330f729Sjoergfile descriptors.  Users can limit the number of physical registers that are
8347330f729Sjoergglobally available for register renaming by using the command option
8357330f729Sjoerg``-register-file-size``.  A value of zero for this option means *unbounded*. By
8367330f729Sjoergknowing how many registers are available for renaming, the tool can predict
8377330f729Sjoergdispatch stalls caused by the lack of physical registers.
8387330f729Sjoerg
8397330f729SjoergThe number of reorder buffer entries consumed by an instruction depends on the
8407330f729Sjoergnumber of micro-opcodes specified for that instruction by the target scheduling
8417330f729Sjoergmodel.  The reorder buffer is responsible for tracking the progress of
8427330f729Sjoerginstructions that are "in-flight", and retiring them in program order.  The
8437330f729Sjoergnumber of entries in the reorder buffer defaults to the value specified by field
8447330f729Sjoerg`MicroOpBufferSize` in the target scheduling model.
8457330f729Sjoerg
8467330f729SjoergInstructions that are dispatched to the schedulers consume scheduler buffer
8477330f729Sjoergentries. :program:`llvm-mca` queries the scheduling model to determine the set
8487330f729Sjoergof buffered resources consumed by an instruction.  Buffered resources are
8497330f729Sjoergtreated like scheduler resources.
8507330f729Sjoerg
8517330f729SjoergInstruction Issue
8527330f729Sjoerg"""""""""""""""""
8537330f729SjoergEach processor scheduler implements a buffer of instructions.  An instruction
8547330f729Sjoerghas to wait in the scheduler's buffer until input register operands become
8557330f729Sjoergavailable.  Only at that point, does the instruction becomes eligible for
8567330f729Sjoergexecution and may be issued (potentially out-of-order) for execution.
8577330f729SjoergInstruction latencies are computed by :program:`llvm-mca` with the help of the
8587330f729Sjoergscheduling model.
8597330f729Sjoerg
8607330f729Sjoerg:program:`llvm-mca`'s scheduler is designed to simulate multiple processor
8617330f729Sjoergschedulers.  The scheduler is responsible for tracking data dependencies, and
8627330f729Sjoergdynamically selecting which processor resources are consumed by instructions.
8637330f729SjoergIt delegates the management of processor resource units and resource groups to a
8647330f729Sjoergresource manager.  The resource manager is responsible for selecting resource
8657330f729Sjoergunits that are consumed by instructions.  For example, if an instruction
8667330f729Sjoergconsumes 1cy of a resource group, the resource manager selects one of the
8677330f729Sjoergavailable units from the group; by default, the resource manager uses a
8687330f729Sjoerground-robin selector to guarantee that resource usage is uniformly distributed
8697330f729Sjoergbetween all units of a group.
8707330f729Sjoerg
8717330f729Sjoerg:program:`llvm-mca`'s scheduler internally groups instructions into three sets:
8727330f729Sjoerg
8737330f729Sjoerg* WaitSet: a set of instructions whose operands are not ready.
8747330f729Sjoerg* ReadySet: a set of instructions ready to execute.
8757330f729Sjoerg* IssuedSet: a set of instructions executing.
8767330f729Sjoerg
8777330f729SjoergDepending on the operands availability, instructions that are dispatched to the
8787330f729Sjoergscheduler are either placed into the WaitSet or into the ReadySet.
8797330f729Sjoerg
8807330f729SjoergEvery cycle, the scheduler checks if instructions can be moved from the WaitSet
8817330f729Sjoergto the ReadySet, and if instructions from the ReadySet can be issued to the
8827330f729Sjoergunderlying pipelines. The algorithm prioritizes older instructions over younger
8837330f729Sjoerginstructions.
8847330f729Sjoerg
8857330f729SjoergWrite-Back and Retire Stage
8867330f729Sjoerg"""""""""""""""""""""""""""
8877330f729SjoergIssued instructions are moved from the ReadySet to the IssuedSet.  There,
8887330f729Sjoerginstructions wait until they reach the write-back stage.  At that point, they
8897330f729Sjoergget removed from the queue and the retire control unit is notified.
8907330f729Sjoerg
8917330f729SjoergWhen instructions are executed, the retire control unit flags the instruction as
8927330f729Sjoerg"ready to retire."
8937330f729Sjoerg
8947330f729SjoergInstructions are retired in program order.  The register file is notified of the
8957330f729Sjoergretirement so that it can free the physical registers that were allocated for
8967330f729Sjoergthe instruction during the register renaming stage.
8977330f729Sjoerg
8987330f729SjoergLoad/Store Unit and Memory Consistency Model
8997330f729Sjoerg""""""""""""""""""""""""""""""""""""""""""""
9007330f729SjoergTo simulate an out-of-order execution of memory operations, :program:`llvm-mca`
9017330f729Sjoergutilizes a simulated load/store unit (LSUnit) to simulate the speculative
9027330f729Sjoergexecution of loads and stores.
9037330f729Sjoerg
9047330f729SjoergEach load (or store) consumes an entry in the load (or store) queue. Users can
9057330f729Sjoergspecify flags ``-lqueue`` and ``-squeue`` to limit the number of entries in the
9067330f729Sjoergload and store queues respectively. The queues are unbounded by default.
9077330f729Sjoerg
9087330f729SjoergThe LSUnit implements a relaxed consistency model for memory loads and stores.
9097330f729SjoergThe rules are:
9107330f729Sjoerg
9117330f729Sjoerg1. A younger load is allowed to pass an older load only if there are no
9127330f729Sjoerg   intervening stores or barriers between the two loads.
9137330f729Sjoerg2. A younger load is allowed to pass an older store provided that the load does
9147330f729Sjoerg   not alias with the store.
9157330f729Sjoerg3. A younger store is not allowed to pass an older store.
9167330f729Sjoerg4. A younger store is not allowed to pass an older load.
9177330f729Sjoerg
9187330f729SjoergBy default, the LSUnit optimistically assumes that loads do not alias
9197330f729Sjoerg(`-noalias=true`) store operations.  Under this assumption, younger loads are
9207330f729Sjoergalways allowed to pass older stores.  Essentially, the LSUnit does not attempt
9217330f729Sjoergto run any alias analysis to predict when loads and stores do not alias with
9227330f729Sjoergeach other.
9237330f729Sjoerg
9247330f729SjoergNote that, in the case of write-combining memory, rule 3 could be relaxed to
9257330f729Sjoergallow reordering of non-aliasing store operations.  That being said, at the
9267330f729Sjoergmoment, there is no way to further relax the memory model (``-noalias`` is the
9277330f729Sjoergonly option).  Essentially, there is no option to specify a different memory
9287330f729Sjoergtype (e.g., write-back, write-combining, write-through; etc.) and consequently
9297330f729Sjoergto weaken, or strengthen, the memory model.
9307330f729Sjoerg
9317330f729SjoergOther limitations are:
9327330f729Sjoerg
9337330f729Sjoerg* The LSUnit does not know when store-to-load forwarding may occur.
9347330f729Sjoerg* The LSUnit does not know anything about cache hierarchy and memory types.
9357330f729Sjoerg* The LSUnit does not know how to identify serializing operations and memory
9367330f729Sjoerg  fences.
9377330f729Sjoerg
9387330f729SjoergThe LSUnit does not attempt to predict if a load or store hits or misses the L1
9397330f729Sjoergcache.  It only knows if an instruction "MayLoad" and/or "MayStore."  For
9407330f729Sjoergloads, the scheduling model provides an "optimistic" load-to-use latency (which
9417330f729Sjoergusually matches the load-to-use latency for when there is a hit in the L1D).
9427330f729Sjoerg
9437330f729Sjoerg:program:`llvm-mca` does not know about serializing operations or memory-barrier
9447330f729Sjoerglike instructions.  The LSUnit conservatively assumes that an instruction which
9457330f729Sjoerghas both "MayLoad" and unmodeled side effects behaves like a "soft"
9467330f729Sjoergload-barrier.  That means, it serializes loads without forcing a flush of the
9477330f729Sjoergload queue.  Similarly, instructions that "MayStore" and have unmodeled side
9487330f729Sjoergeffects are treated like store barriers.  A full memory barrier is a "MayLoad"
9497330f729Sjoergand "MayStore" instruction with unmodeled side effects.  This is inaccurate, but
9507330f729Sjoergit is the best that we can do at the moment with the current information
9517330f729Sjoergavailable in LLVM.
9527330f729Sjoerg
9537330f729SjoergA load/store barrier consumes one entry of the load/store queue.  A load/store
9547330f729Sjoergbarrier enforces ordering of loads/stores.  A younger load cannot pass a load
9557330f729Sjoergbarrier.  Also, a younger store cannot pass a store barrier.  A younger load
9567330f729Sjoerghas to wait for the memory/load barrier to execute.  A load/store barrier is
9577330f729Sjoerg"executed" when it becomes the oldest entry in the load/store queue(s). That
9587330f729Sjoergalso means, by construction, all of the older loads/stores have been executed.
9597330f729Sjoerg
9607330f729SjoergIn conclusion, the full set of load/store consistency rules are:
9617330f729Sjoerg
9627330f729Sjoerg#. A store may not pass a previous store.
9637330f729Sjoerg#. A store may not pass a previous load (regardless of ``-noalias``).
9647330f729Sjoerg#. A store has to wait until an older store barrier is fully executed.
9657330f729Sjoerg#. A load may pass a previous load.
9667330f729Sjoerg#. A load may not pass a previous store unless ``-noalias`` is set.
9677330f729Sjoerg#. A load has to wait until an older load barrier is fully executed.
968*82d56013Sjoerg
969*82d56013SjoergIn-order Issue and Execute
970*82d56013Sjoerg""""""""""""""""""""""""""""""""""""
971*82d56013SjoergIn-order processors are modelled as a single ``InOrderIssueStage`` stage. It
972*82d56013Sjoergbypasses Dispatch, Scheduler and Load/Store unit. Instructions are issued as
973*82d56013Sjoergsoon as their operand registers are available and resource requirements are
974*82d56013Sjoergmet. Multiple instructions can be issued in one cycle according to the value of
975*82d56013Sjoergthe ``IssueWidth`` parameter in LLVM's scheduling model.
976*82d56013Sjoerg
977*82d56013SjoergOnce issued, an instruction is moved to ``IssuedInst`` set until it is ready to
978*82d56013Sjoergretire. :program:`llvm-mca` ensures that writes are committed in-order. However,
979*82d56013Sjoergan instruction is allowed to commit writes and retire out-of-order if
980*82d56013Sjoerg``RetireOOO`` property is true for at least one of its writes.
981