docs/CommandGuide/llvm-mca.rst

7330f729Sjoergllvm-mca - LLVM Machine Code Analyzer
7330f729Sjoerg=====================================
7330f729Sjoerg
7330f729Sjoerg.. program:: llvm-mca
7330f729Sjoerg
7330f729SjoergSYNOPSIS
7330f729Sjoerg--------
7330f729Sjoerg
7330f729Sjoerg:program:`llvm-mca` [*options*] [input]
7330f729Sjoerg
7330f729SjoergDESCRIPTION
7330f729Sjoerg-----------
7330f729Sjoerg
7330f729Sjoerg:program:`llvm-mca` is a performance analysis tool that uses information
7330f729Sjoergavailable in LLVM (e.g. scheduling models) to statically measure the performance
7330f729Sjoergof machine code in a specific CPU.
7330f729Sjoerg
7330f729SjoergPerformance is measured in terms of throughput as well as processor resource
*82d56013Sjoergconsumption. The tool currently works for processors with a backend for which
*82d56013Sjoergthere is a scheduling model available in LLVM.
7330f729Sjoerg
7330f729SjoergThe main goal of this tool is not just to predict the performance of the code
7330f729Sjoergwhen run on the target, but also help with diagnosing potential performance
7330f729Sjoergissues.
7330f729Sjoerg
7330f729SjoergGiven an assembly code sequence, :program:`llvm-mca` estimates the Instructions
7330f729SjoergPer Cycle (IPC), as well as hardware resource pressure. The analysis and
7330f729Sjoergreporting style were inspired by the IACA tool from Intel.
7330f729Sjoerg
7330f729SjoergFor example, you can compile code with clang, output assembly, and pipe it
7330f729Sjoergdirectly into :program:`llvm-mca` for analysis:
7330f729Sjoerg
7330f729Sjoerg.. code-block:: bash
7330f729Sjoerg
7330f729Sjoerg  $ clang foo.c -O2 -target x86_64-unknown-unknown -S -o - | llvm-mca -mcpu=btver2
7330f729Sjoerg
7330f729SjoergOr for Intel syntax:
7330f729Sjoerg
7330f729Sjoerg.. code-block:: bash
7330f729Sjoerg
7330f729Sjoerg  $ clang foo.c -O2 -target x86_64-unknown-unknown -mllvm -x86-asm-syntax=intel -S -o - | llvm-mca -mcpu=btver2
7330f729Sjoerg
*82d56013Sjoerg(:program:`llvm-mca` detects Intel syntax by the presence of an `.intel_syntax`
*82d56013Sjoergdirective at the beginning of the input.  By default its output syntax matches
*82d56013Sjoergthat of its input.)
*82d56013Sjoerg
7330f729SjoergScheduling models are not just used to compute instruction latencies and
7330f729Sjoergthroughput, but also to understand what processor resources are available
7330f729Sjoergand how to simulate them.
7330f729Sjoerg
7330f729SjoergBy design, the quality of the analysis conducted by :program:`llvm-mca` is
7330f729Sjoerginevitably affected by the quality of the scheduling models in LLVM.
7330f729Sjoerg
7330f729SjoergIf you see that the performance report is not accurate for a processor,
7330f729Sjoergplease `file a bug <https://bugs.llvm.org/enter_bug.cgi?product=libraries>`_
7330f729Sjoergagainst the appropriate backend.
7330f729Sjoerg
7330f729SjoergOPTIONS
7330f729Sjoerg-------
7330f729Sjoerg
7330f729SjoergIf ``input`` is "``-``" or omitted, :program:`llvm-mca` reads from standard
7330f729Sjoerginput. Otherwise, it will read from the specified filename.
7330f729Sjoerg
7330f729SjoergIf the :option:`-o` option is omitted, then :program:`llvm-mca` will send its output
7330f729Sjoergto standard output if the input is from standard input.  If the :option:`-o`
7330f729Sjoergoption specifies "``-``", then the output will also be sent to standard output.
7330f729Sjoerg
7330f729Sjoerg
7330f729Sjoerg.. option:: -help
7330f729Sjoerg
7330f729Sjoerg Print a summary of command line options.
7330f729Sjoerg
7330f729Sjoerg.. option:: -o <filename>
7330f729Sjoerg
7330f729Sjoerg Use ``<filename>`` as the output filename. See the summary above for more
7330f729Sjoerg details.
7330f729Sjoerg
7330f729Sjoerg.. option:: -mtriple=<target triple>
7330f729Sjoerg
7330f729Sjoerg Specify a target triple string.
7330f729Sjoerg
7330f729Sjoerg.. option:: -march=<arch>
7330f729Sjoerg
7330f729Sjoerg Specify the architecture for which to analyze the code. It defaults to the
7330f729Sjoerg host default target.
7330f729Sjoerg
7330f729Sjoerg.. option:: -mcpu=<cpuname>
7330f729Sjoerg
7330f729Sjoerg  Specify the processor for which to analyze the code.  By default, the cpu name
7330f729Sjoerg  is autodetected from the host.
7330f729Sjoerg
7330f729Sjoerg.. option:: -output-asm-variant=<variant id>
7330f729Sjoerg
7330f729Sjoerg Specify the output assembly variant for the report generated by the tool.
7330f729Sjoerg On x86, possible values are [0, 1]. A value of 0 (vic. 1) for this flag enables
7330f729Sjoerg the AT&T (vic. Intel) assembly format for the code printed out by the tool in
7330f729Sjoerg the analysis report.
7330f729Sjoerg
7330f729Sjoerg.. option:: -print-imm-hex
7330f729Sjoerg
7330f729Sjoerg Prefer hex format for numeric literals in the output assembly printed as part
7330f729Sjoerg of the report.
7330f729Sjoerg
7330f729Sjoerg.. option:: -dispatch=<width>
7330f729Sjoerg
7330f729Sjoerg Specify a different dispatch width for the processor. The dispatch width
7330f729Sjoerg defaults to field 'IssueWidth' in the processor scheduling model.  If width is
7330f729Sjoerg zero, then the default dispatch width is used.
7330f729Sjoerg
7330f729Sjoerg.. option:: -register-file-size=<size>
7330f729Sjoerg
7330f729Sjoerg Specify the size of the register file. When specified, this flag limits how
7330f729Sjoerg many physical registers are available for register renaming purposes. A value
7330f729Sjoerg of zero for this flag means "unlimited number of physical registers".
7330f729Sjoerg
7330f729Sjoerg.. option:: -iterations=<number of iterations>
7330f729Sjoerg
7330f729Sjoerg Specify the number of iterations to run. If this flag is set to 0, then the
7330f729Sjoerg tool sets the number of iterations to a default value (i.e. 100).
7330f729Sjoerg
7330f729Sjoerg.. option:: -noalias=<bool>
7330f729Sjoerg
7330f729Sjoerg  If set, the tool assumes that loads and stores don't alias. This is the
7330f729Sjoerg  default behavior.
7330f729Sjoerg
7330f729Sjoerg.. option:: -lqueue=<load queue size>
7330f729Sjoerg
7330f729Sjoerg  Specify the size of the load queue in the load/store unit emulated by the tool.
7330f729Sjoerg  By default, the tool assumes an unbound number of entries in the load queue.
7330f729Sjoerg  A value of zero for this flag is ignored, and the default load queue size is
7330f729Sjoerg  used instead.
7330f729Sjoerg
7330f729Sjoerg.. option:: -squeue=<store queue size>
7330f729Sjoerg
7330f729Sjoerg  Specify the size of the store queue in the load/store unit emulated by the
7330f729Sjoerg  tool. By default, the tool assumes an unbound number of entries in the store
7330f729Sjoerg  queue. A value of zero for this flag is ignored, and the default store queue
7330f729Sjoerg  size is used instead.
7330f729Sjoerg
7330f729Sjoerg.. option:: -timeline
7330f729Sjoerg
7330f729Sjoerg  Enable the timeline view.
7330f729Sjoerg
7330f729Sjoerg.. option:: -timeline-max-iterations=<iterations>
7330f729Sjoerg
7330f729Sjoerg  Limit the number of iterations to print in the timeline view. By default, the
7330f729Sjoerg  timeline view prints information for up to 10 iterations.
7330f729Sjoerg
7330f729Sjoerg.. option:: -timeline-max-cycles=<cycles>
7330f729Sjoerg
7330f729Sjoerg  Limit the number of cycles in the timeline view. By default, the number of
7330f729Sjoerg  cycles is set to 80.
7330f729Sjoerg
7330f729Sjoerg.. option:: -resource-pressure
7330f729Sjoerg
7330f729Sjoerg  Enable the resource pressure view. This is enabled by default.
7330f729Sjoerg
7330f729Sjoerg.. option:: -register-file-stats
7330f729Sjoerg
7330f729Sjoerg  Enable register file usage statistics.
7330f729Sjoerg
7330f729Sjoerg.. option:: -dispatch-stats
7330f729Sjoerg
7330f729Sjoerg  Enable extra dispatch statistics. This view collects and analyzes instruction
7330f729Sjoerg  dispatch events, as well as static/dynamic dispatch stall events. This view
7330f729Sjoerg  is disabled by default.
7330f729Sjoerg
7330f729Sjoerg.. option:: -scheduler-stats
7330f729Sjoerg
7330f729Sjoerg  Enable extra scheduler statistics. This view collects and analyzes instruction
7330f729Sjoerg  issue events. This view is disabled by default.
7330f729Sjoerg
7330f729Sjoerg.. option:: -retire-stats
7330f729Sjoerg
7330f729Sjoerg  Enable extra retire control unit statistics. This view is disabled by default.
7330f729Sjoerg
7330f729Sjoerg.. option:: -instruction-info
7330f729Sjoerg
7330f729Sjoerg  Enable the instruction info view. This is enabled by default.
7330f729Sjoerg
7330f729Sjoerg.. option:: -show-encoding
7330f729Sjoerg
7330f729Sjoerg  Enable the printing of instruction encodings within the instruction info view.
7330f729Sjoerg
7330f729Sjoerg.. option:: -all-stats
7330f729Sjoerg
7330f729Sjoerg  Print all hardware statistics. This enables extra statistics related to the
7330f729Sjoerg  dispatch logic, the hardware schedulers, the register file(s), and the retire
7330f729Sjoerg  control unit. This option is disabled by default.
7330f729Sjoerg
7330f729Sjoerg.. option:: -all-views
7330f729Sjoerg
7330f729Sjoerg  Enable all the view.
7330f729Sjoerg
7330f729Sjoerg.. option:: -instruction-tables
7330f729Sjoerg
7330f729Sjoerg  Prints resource pressure information based on the static information
7330f729Sjoerg  available from the processor model. This differs from the resource pressure
7330f729Sjoerg  view because it doesn't require that the code is simulated. It instead prints
7330f729Sjoerg  the theoretical uniform distribution of resource pressure for every
7330f729Sjoerg  instruction in sequence.
7330f729Sjoerg
7330f729Sjoerg.. option:: -bottleneck-analysis
7330f729Sjoerg
7330f729Sjoerg  Print information about bottlenecks that affect the throughput. This analysis
7330f729Sjoerg  can be expensive, and it is disabled by default.  Bottlenecks are highlighted
*82d56013Sjoerg  in the summary view. Bottleneck analysis is currently not supported for
*82d56013Sjoerg  processors with an in-order backend.
*82d56013Sjoerg
*82d56013Sjoerg.. option:: -json
*82d56013Sjoerg
*82d56013Sjoerg  Print the requested views in JSON format. The instructions and the processor
*82d56013Sjoerg  resources are printed as members of special top level JSON objects.  The
*82d56013Sjoerg  individual views refer to them by index.
7330f729Sjoerg
7330f729Sjoerg
7330f729SjoergEXIT STATUS
7330f729Sjoerg-----------
7330f729Sjoerg
7330f729Sjoerg:program:`llvm-mca` returns 0 on success. Otherwise, an error message is printed
7330f729Sjoergto standard error, and the tool returns 1.
7330f729Sjoerg
7330f729SjoergUSING MARKERS TO ANALYZE SPECIFIC CODE BLOCKS
7330f729Sjoerg---------------------------------------------
7330f729Sjoerg:program:`llvm-mca` allows for the optional usage of special code comments to
7330f729Sjoergmark regions of the assembly code to be analyzed.  A comment starting with
7330f729Sjoergsubstring ``LLVM-MCA-BEGIN`` marks the beginning of a code region. A comment
7330f729Sjoergstarting with substring ``LLVM-MCA-END`` marks the end of a code region.  For
7330f729Sjoergexample:
7330f729Sjoerg
7330f729Sjoerg.. code-block:: none
7330f729Sjoerg
7330f729Sjoerg  # LLVM-MCA-BEGIN
7330f729Sjoerg    ...
7330f729Sjoerg  # LLVM-MCA-END
7330f729Sjoerg
7330f729SjoergIf no user-defined region is specified, then :program:`llvm-mca` assumes a
7330f729Sjoergdefault region which contains every instruction in the input file.  Every region
7330f729Sjoergis analyzed in isolation, and the final performance report is the union of all
7330f729Sjoergthe reports generated for every code region.
7330f729Sjoerg
7330f729SjoergCode regions can have names. For example:
7330f729Sjoerg
7330f729Sjoerg.. code-block:: none
7330f729Sjoerg
7330f729Sjoerg  # LLVM-MCA-BEGIN A simple example
7330f729Sjoerg    add %eax, %eax
7330f729Sjoerg  # LLVM-MCA-END
7330f729Sjoerg
7330f729SjoergThe code from the example above defines a region named "A simple example" with a
7330f729Sjoergsingle instruction in it. Note how the region name doesn't have to be repeated
7330f729Sjoergin the ``LLVM-MCA-END`` directive. In the absence of overlapping regions,
7330f729Sjoergan anonymous ``LLVM-MCA-END`` directive always ends the currently active user
7330f729Sjoergdefined region.
7330f729Sjoerg
7330f729SjoergExample of nesting regions:
7330f729Sjoerg
7330f729Sjoerg.. code-block:: none
7330f729Sjoerg
7330f729Sjoerg  # LLVM-MCA-BEGIN foo
7330f729Sjoerg    add %eax, %edx
7330f729Sjoerg  # LLVM-MCA-BEGIN bar
7330f729Sjoerg    sub %eax, %edx
7330f729Sjoerg  # LLVM-MCA-END bar
7330f729Sjoerg  # LLVM-MCA-END foo
7330f729Sjoerg
7330f729SjoergExample of overlapping regions:
7330f729Sjoerg
7330f729Sjoerg.. code-block:: none
7330f729Sjoerg
7330f729Sjoerg  # LLVM-MCA-BEGIN foo
7330f729Sjoerg    add %eax, %edx
7330f729Sjoerg  # LLVM-MCA-BEGIN bar
7330f729Sjoerg    sub %eax, %edx
7330f729Sjoerg  # LLVM-MCA-END foo
7330f729Sjoerg    add %eax, %edx
7330f729Sjoerg  # LLVM-MCA-END bar
7330f729Sjoerg
7330f729SjoergNote that multiple anonymous regions cannot overlap. Also, overlapping regions
7330f729Sjoergcannot have the same name.
7330f729Sjoerg
7330f729SjoergThere is no support for marking regions from high-level source code, like C or
7330f729SjoergC++. As a workaround, inline assembly directives may be used:
7330f729Sjoerg
7330f729Sjoerg.. code-block:: c++
7330f729Sjoerg
7330f729Sjoerg  int foo(int a, int b) {
7330f729Sjoerg    __asm volatile("# LLVM-MCA-BEGIN foo");
7330f729Sjoerg    a += 42;
7330f729Sjoerg    __asm volatile("# LLVM-MCA-END");
7330f729Sjoerg    a *= b;
7330f729Sjoerg    return a;
7330f729Sjoerg  }
7330f729Sjoerg
7330f729SjoergHowever, this interferes with optimizations like loop vectorization and may have
7330f729Sjoergan impact on the code generated. This is because the ``__asm`` statements are
7330f729Sjoergseen as real code having important side effects, which limits how the code
7330f729Sjoergaround them can be transformed. If users want to make use of inline assembly
7330f729Sjoergto emit markers, then the recommendation is to always verify that the output
7330f729Sjoergassembly is equivalent to the assembly generated in the absence of markers.
7330f729SjoergThe `Clang options to emit optimization reports <https://clang.llvm.org/docs/UsersManual.html#options-to-emit-optimization-reports>`_
7330f729Sjoergcan also help in detecting missed optimizations.
7330f729Sjoerg
7330f729SjoergHOW LLVM-MCA WORKS
7330f729Sjoerg------------------
7330f729Sjoerg
7330f729Sjoerg:program:`llvm-mca` takes assembly code as input. The assembly code is parsed
7330f729Sjoerginto a sequence of MCInst with the help of the existing LLVM target assembly
7330f729Sjoergparsers. The parsed sequence of MCInst is then analyzed by a ``Pipeline`` module
7330f729Sjoergto generate a performance report.
7330f729Sjoerg
7330f729SjoergThe Pipeline module simulates the execution of the machine code sequence in a
7330f729Sjoergloop of iterations (default is 100). During this process, the pipeline collects
7330f729Sjoerga number of execution related statistics. At the end of this process, the
7330f729Sjoergpipeline generates and prints a report from the collected statistics.
7330f729Sjoerg
7330f729SjoergHere is an example of a performance report generated by the tool for a
7330f729Sjoergdot-product of two packed float vectors of four elements. The analysis is
7330f729Sjoergconducted for target x86, cpu btver2.  The following result can be produced via
7330f729Sjoergthe following command using the example located at
7330f729Sjoerg``test/tools/llvm-mca/X86/BtVer2/dot-product.s``:
7330f729Sjoerg
7330f729Sjoerg.. code-block:: bash
7330f729Sjoerg
7330f729Sjoerg  $ llvm-mca -mtriple=x86_64-unknown-unknown -mcpu=btver2 -iterations=300 dot-product.s
7330f729Sjoerg
7330f729Sjoerg.. code-block:: none
7330f729Sjoerg
7330f729Sjoerg  Iterations:        300
7330f729Sjoerg  Instructions:      900
7330f729Sjoerg  Total Cycles:      610
7330f729Sjoerg  Total uOps:        900
7330f729Sjoerg
7330f729Sjoerg  Dispatch Width:    2
7330f729Sjoerg  uOps Per Cycle:    1.48
7330f729Sjoerg  IPC:               1.48
7330f729Sjoerg  Block RThroughput: 2.0
7330f729Sjoerg
7330f729Sjoerg
7330f729Sjoerg  Instruction Info:
7330f729Sjoerg  [1]: #uOps
7330f729Sjoerg  [2]: Latency
7330f729Sjoerg  [3]: RThroughput
7330f729Sjoerg  [4]: MayLoad
7330f729Sjoerg  [5]: MayStore
7330f729Sjoerg  [6]: HasSideEffects (U)
7330f729Sjoerg
7330f729Sjoerg  [1]    [2]    [3]    [4]    [5]    [6]    Instructions:
7330f729Sjoerg   1      2     1.00                        vmulps	%xmm0, %xmm1, %xmm2
7330f729Sjoerg   1      3     1.00                        vhaddps	%xmm2, %xmm2, %xmm3
7330f729Sjoerg   1      3     1.00                        vhaddps	%xmm3, %xmm3, %xmm4
7330f729Sjoerg
7330f729Sjoerg
7330f729Sjoerg  Resources:
7330f729Sjoerg  [0]   - JALU0
7330f729Sjoerg  [1]   - JALU1
7330f729Sjoerg  [2]   - JDiv
7330f729Sjoerg  [3]   - JFPA
7330f729Sjoerg  [4]   - JFPM
7330f729Sjoerg  [5]   - JFPU0
7330f729Sjoerg  [6]   - JFPU1
7330f729Sjoerg  [7]   - JLAGU
7330f729Sjoerg  [8]   - JMul
7330f729Sjoerg  [9]   - JSAGU
7330f729Sjoerg  [10]  - JSTC
7330f729Sjoerg  [11]  - JVALU0
7330f729Sjoerg  [12]  - JVALU1
7330f729Sjoerg  [13]  - JVIMUL
7330f729Sjoerg
7330f729Sjoerg
7330f729Sjoerg  Resource pressure per iteration:
7330f729Sjoerg  [0]    [1]    [2]    [3]    [4]    [5]    [6]    [7]    [8]    [9]    [10]   [11]   [12]   [13]
7330f729Sjoerg   -      -      -     2.00   1.00   2.00   1.00    -      -      -      -      -      -      -
7330f729Sjoerg
7330f729Sjoerg  Resource pressure by instruction:
7330f729Sjoerg  [0]    [1]    [2]    [3]    [4]    [5]    [6]    [7]    [8]    [9]    [10]   [11]   [12]   [13]   Instructions:
7330f729Sjoerg   -      -      -      -     1.00    -     1.00    -      -      -      -      -      -      -     vmulps	%xmm0, %xmm1, %xmm2
7330f729Sjoerg   -      -      -     1.00    -     1.00    -      -      -      -      -      -      -      -     vhaddps	%xmm2, %xmm2, %xmm3
7330f729Sjoerg   -      -      -     1.00    -     1.00    -      -      -      -      -      -      -      -     vhaddps	%xmm3, %xmm3, %xmm4
7330f729Sjoerg
7330f729SjoergAccording to this report, the dot-product kernel has been executed 300 times,
7330f729Sjoergfor a total of 900 simulated instructions. The total number of simulated micro
7330f729Sjoergopcodes (uOps) is also 900.
7330f729Sjoerg
7330f729SjoergThe report is structured in three main sections.  The first section collects a
7330f729Sjoergfew performance numbers; the goal of this section is to give a very quick
7330f729Sjoergoverview of the performance throughput. Important performance indicators are
7330f729Sjoerg**IPC**, **uOps Per Cycle**, and  **Block RThroughput** (Block Reciprocal
7330f729SjoergThroughput).
7330f729Sjoerg
7330f729SjoergField *DispatchWidth* is the maximum number of micro opcodes that are dispatched
*82d56013Sjoergto the out-of-order backend every simulated cycle. For processors with an
*82d56013Sjoergin-order backend, *DispatchWidth* is the maximum number of micro opcodes issued
*82d56013Sjoergto the backend every simulated cycle.
7330f729Sjoerg
7330f729SjoergIPC is computed dividing the total number of simulated instructions by the total
7330f729Sjoergnumber of cycles.
7330f729Sjoerg
7330f729SjoergField *Block RThroughput* is the reciprocal of the block throughput. Block
*82d56013Sjoergthroughput is a theoretical quantity computed as the maximum number of blocks
7330f729Sjoerg(i.e. iterations) that can be executed per simulated clock cycle in the absence
*82d56013Sjoergof loop carried dependencies. Block throughput is superiorly limited by the
*82d56013Sjoergdispatch rate, and the availability of hardware resources.
7330f729Sjoerg
7330f729SjoergIn the absence of loop-carried data dependencies, the observed IPC tends to a
7330f729Sjoergtheoretical maximum which can be computed by dividing the number of instructions
7330f729Sjoergof a single iteration by the `Block RThroughput`.
7330f729Sjoerg
7330f729SjoergField 'uOps Per Cycle' is computed dividing the total number of simulated micro
7330f729Sjoergopcodes by the total number of cycles. A delta between Dispatch Width and this
7330f729Sjoergfield is an indicator of a performance issue. In the absence of loop-carried
7330f729Sjoergdata dependencies, the observed 'uOps Per Cycle' should tend to a theoretical
7330f729Sjoergmaximum throughput which can be computed by dividing the number of uOps of a
7330f729Sjoergsingle iteration by the `Block RThroughput`.
7330f729Sjoerg
7330f729SjoergField *uOps Per Cycle* is bounded from above by the dispatch width. That is
7330f729Sjoergbecause the dispatch width limits the maximum size of a dispatch group. Both IPC
7330f729Sjoergand 'uOps Per Cycle' are limited by the amount of hardware parallelism. The
7330f729Sjoergavailability of hardware resources affects the resource pressure distribution,
7330f729Sjoergand it limits the number of instructions that can be executed in parallel every
7330f729Sjoergcycle.  A delta between Dispatch Width and the theoretical maximum uOps per
7330f729SjoergCycle (computed by dividing the number of uOps of a single iteration by the
7330f729Sjoerg`Block RThroughput`) is an indicator of a performance bottleneck caused by the
7330f729Sjoerglack of hardware resources.
7330f729SjoergIn general, the lower the Block RThroughput, the better.
7330f729Sjoerg
7330f729SjoergIn this example, ``uOps per iteration/Block RThroughput`` is 1.50. Since there
7330f729Sjoergare no loop-carried dependencies, the observed `uOps Per Cycle` is expected to
7330f729Sjoergapproach 1.50 when the number of iterations tends to infinity. The delta between
7330f729Sjoergthe Dispatch Width (2.00), and the theoretical maximum uOp throughput (1.50) is
7330f729Sjoergan indicator of a performance bottleneck caused by the lack of hardware
7330f729Sjoergresources, and the *Resource pressure view* can help to identify the problematic
7330f729Sjoergresource usage.
7330f729Sjoerg
7330f729SjoergThe second section of the report is the `instruction info view`. It shows the
7330f729Sjoerglatency and reciprocal throughput of every instruction in the sequence. It also
7330f729Sjoergreports extra information related to the number of micro opcodes, and opcode
7330f729Sjoergproperties (i.e., 'MayLoad', 'MayStore', and 'HasSideEffects').
7330f729Sjoerg
7330f729SjoergField *RThroughput* is the reciprocal of the instruction throughput. Throughput
7330f729Sjoergis computed as the maximum number of instructions of a same type that can be
7330f729Sjoergexecuted per clock cycle in the absence of operand dependencies. In this
7330f729Sjoergexample, the reciprocal throughput of a vector float multiply is 1
7330f729Sjoergcycles/instruction.  That is because the FP multiplier JFPM is only available
7330f729Sjoergfrom pipeline JFPU1.
7330f729Sjoerg
7330f729SjoergInstruction encodings are displayed within the instruction info view when flag
7330f729Sjoerg`-show-encoding` is specified.
7330f729Sjoerg
7330f729SjoergBelow is an example of `-show-encoding` output for the dot-product kernel:
7330f729Sjoerg
7330f729Sjoerg.. code-block:: none
7330f729Sjoerg
7330f729Sjoerg  Instruction Info:
7330f729Sjoerg  [1]: #uOps
7330f729Sjoerg  [2]: Latency
7330f729Sjoerg  [3]: RThroughput
7330f729Sjoerg  [4]: MayLoad
7330f729Sjoerg  [5]: MayStore
7330f729Sjoerg  [6]: HasSideEffects (U)
7330f729Sjoerg  [7]: Encoding Size
7330f729Sjoerg
7330f729Sjoerg  [1]    [2]    [3]    [4]    [5]    [6]    [7]    Encodings:                    Instructions:
7330f729Sjoerg   1      2     1.00                         4     c5 f0 59 d0                   vmulps	%xmm0, %xmm1, %xmm2
7330f729Sjoerg   1      4     1.00                         4     c5 eb 7c da                   vhaddps	%xmm2, %xmm2, %xmm3
7330f729Sjoerg   1      4     1.00                         4     c5 e3 7c e3                   vhaddps	%xmm3, %xmm3, %xmm4
7330f729Sjoerg
7330f729SjoergThe `Encoding Size` column shows the size in bytes of instructions.  The
7330f729Sjoerg`Encodings` column shows the actual instruction encodings (byte sequences in
7330f729Sjoerghex).
7330f729Sjoerg
7330f729SjoergThe third section is the *Resource pressure view*.  This view reports
7330f729Sjoergthe average number of resource cycles consumed every iteration by instructions
7330f729Sjoergfor every processor resource unit available on the target.  Information is
7330f729Sjoergstructured in two tables. The first table reports the number of resource cycles
7330f729Sjoergspent on average every iteration. The second table correlates the resource
7330f729Sjoergcycles to the machine instruction in the sequence. For example, every iteration
7330f729Sjoergof the instruction vmulps always executes on resource unit [6]
7330f729Sjoerg(JFPU1 - floating point pipeline #1), consuming an average of 1 resource cycle
7330f729Sjoergper iteration.  Note that on AMD Jaguar, vector floating-point multiply can
7330f729Sjoergonly be issued to pipeline JFPU1, while horizontal floating-point additions can
7330f729Sjoergonly be issued to pipeline JFPU0.
7330f729Sjoerg
7330f729SjoergThe resource pressure view helps with identifying bottlenecks caused by high
7330f729Sjoergusage of specific hardware resources.  Situations with resource pressure mainly
7330f729Sjoergconcentrated on a few resources should, in general, be avoided.  Ideally,
7330f729Sjoergpressure should be uniformly distributed between multiple resources.
7330f729Sjoerg
7330f729SjoergTimeline View
7330f729Sjoerg^^^^^^^^^^^^^
7330f729SjoergThe timeline view produces a detailed report of each instruction's state
7330f729Sjoergtransitions through an instruction pipeline.  This view is enabled by the
7330f729Sjoergcommand line option ``-timeline``.  As instructions transition through the
7330f729Sjoergvarious stages of the pipeline, their states are depicted in the view report.
7330f729SjoergThese states are represented by the following characters:
7330f729Sjoerg
7330f729Sjoerg* D : Instruction dispatched.
7330f729Sjoerg* e : Instruction executing.
7330f729Sjoerg* E : Instruction executed.
7330f729Sjoerg* R : Instruction retired.
7330f729Sjoerg* = : Instruction already dispatched, waiting to be executed.
7330f729Sjoerg* \- : Instruction executed, waiting to be retired.
7330f729Sjoerg
7330f729SjoergBelow is the timeline view for a subset of the dot-product example located in
7330f729Sjoerg``test/tools/llvm-mca/X86/BtVer2/dot-product.s`` and processed by
7330f729Sjoerg:program:`llvm-mca` using the following command:
7330f729Sjoerg
7330f729Sjoerg.. code-block:: bash
7330f729Sjoerg
7330f729Sjoerg  $ llvm-mca -mtriple=x86_64-unknown-unknown -mcpu=btver2 -iterations=3 -timeline dot-product.s
7330f729Sjoerg
7330f729Sjoerg.. code-block:: none
7330f729Sjoerg
7330f729Sjoerg  Timeline view:
7330f729Sjoerg                      012345
7330f729Sjoerg  Index     0123456789
7330f729Sjoerg
7330f729Sjoerg  [0,0]     DeeER.    .    .   vmulps	%xmm0, %xmm1, %xmm2
7330f729Sjoerg  [0,1]     D==eeeER  .    .   vhaddps	%xmm2, %xmm2, %xmm3
7330f729Sjoerg  [0,2]     .D====eeeER    .   vhaddps	%xmm3, %xmm3, %xmm4
7330f729Sjoerg  [1,0]     .DeeE-----R    .   vmulps	%xmm0, %xmm1, %xmm2
7330f729Sjoerg  [1,1]     . D=eeeE---R   .   vhaddps	%xmm2, %xmm2, %xmm3
7330f729Sjoerg  [1,2]     . D====eeeER   .   vhaddps	%xmm3, %xmm3, %xmm4
7330f729Sjoerg  [2,0]     .  DeeE-----R  .   vmulps	%xmm0, %xmm1, %xmm2
7330f729Sjoerg  [2,1]     .  D====eeeER  .   vhaddps	%xmm2, %xmm2, %xmm3
7330f729Sjoerg  [2,2]     .   D======eeeER   vhaddps	%xmm3, %xmm3, %xmm4
7330f729Sjoerg
7330f729Sjoerg
7330f729Sjoerg  Average Wait times (based on the timeline view):
7330f729Sjoerg  [0]: Executions
7330f729Sjoerg  [1]: Average time spent waiting in a scheduler's queue
7330f729Sjoerg  [2]: Average time spent waiting in a scheduler's queue while ready
7330f729Sjoerg  [3]: Average time elapsed from WB until retire stage
7330f729Sjoerg
7330f729Sjoerg        [0]    [1]    [2]    [3]
7330f729Sjoerg  0.     3     1.0    1.0    3.3       vmulps	%xmm0, %xmm1, %xmm2
7330f729Sjoerg  1.     3     3.3    0.7    1.0       vhaddps	%xmm2, %xmm2, %xmm3
7330f729Sjoerg  2.     3     5.7    0.0    0.0       vhaddps	%xmm3, %xmm3, %xmm4
7330f729Sjoerg         3     3.3    0.5    1.4       <total>
7330f729Sjoerg
7330f729SjoergThe timeline view is interesting because it shows instruction state changes
7330f729Sjoergduring execution.  It also gives an idea of how the tool processes instructions
7330f729Sjoergexecuted on the target, and how their timing information might be calculated.
7330f729Sjoerg
7330f729SjoergThe timeline view is structured in two tables.  The first table shows
7330f729Sjoerginstructions changing state over time (measured in cycles); the second table
7330f729Sjoerg(named *Average Wait times*) reports useful timing statistics, which should
7330f729Sjoerghelp diagnose performance bottlenecks caused by long data dependencies and
7330f729Sjoergsub-optimal usage of hardware resources.
7330f729Sjoerg
7330f729SjoergAn instruction in the timeline view is identified by a pair of indices, where
7330f729Sjoergthe first index identifies an iteration, and the second index is the
7330f729Sjoerginstruction index (i.e., where it appears in the code sequence).  Since this
7330f729Sjoergexample was generated using 3 iterations: ``-iterations=3``, the iteration
7330f729Sjoergindices range from 0-2 inclusively.
7330f729Sjoerg
7330f729SjoergExcluding the first and last column, the remaining columns are in cycles.
7330f729SjoergCycles are numbered sequentially starting from 0.
7330f729Sjoerg
7330f729SjoergFrom the example output above, we know the following:
7330f729Sjoerg
7330f729Sjoerg* Instruction [1,0] was dispatched at cycle 1.
7330f729Sjoerg* Instruction [1,0] started executing at cycle 2.
7330f729Sjoerg* Instruction [1,0] reached the write back stage at cycle 4.
7330f729Sjoerg* Instruction [1,0] was retired at cycle 10.
7330f729Sjoerg
7330f729SjoergInstruction [1,0] (i.e., vmulps from iteration #1) does not have to wait in the
7330f729Sjoergscheduler's queue for the operands to become available. By the time vmulps is
7330f729Sjoergdispatched, operands are already available, and pipeline JFPU1 is ready to
7330f729Sjoergserve another instruction.  So the instruction can be immediately issued on the
7330f729SjoergJFPU1 pipeline. That is demonstrated by the fact that the instruction only
7330f729Sjoergspent 1cy in the scheduler's queue.
7330f729Sjoerg
7330f729SjoergThere is a gap of 5 cycles between the write-back stage and the retire event.
7330f729SjoergThat is because instructions must retire in program order, so [1,0] has to wait
7330f729Sjoergfor [0,2] to be retired first (i.e., it has to wait until cycle 10).
7330f729Sjoerg
7330f729SjoergIn the example, all instructions are in a RAW (Read After Write) dependency
7330f729Sjoergchain.  Register %xmm2 written by vmulps is immediately used by the first
7330f729Sjoergvhaddps, and register %xmm3 written by the first vhaddps is used by the second
7330f729Sjoergvhaddps.  Long data dependencies negatively impact the ILP (Instruction Level
7330f729SjoergParallelism).
7330f729Sjoerg
7330f729SjoergIn the dot-product example, there are anti-dependencies introduced by
7330f729Sjoerginstructions from different iterations.  However, those dependencies can be
7330f729Sjoergremoved at register renaming stage (at the cost of allocating register aliases,
7330f729Sjoergand therefore consuming physical registers).
7330f729Sjoerg
7330f729SjoergTable *Average Wait times* helps diagnose performance issues that are caused by
7330f729Sjoergthe presence of long latency instructions and potentially long data dependencies
7330f729Sjoergwhich may limit the ILP. Last row, ``<total>``, shows a global average over all
7330f729Sjoerginstructions measured. Note that :program:`llvm-mca`, by default, assumes at
7330f729Sjoergleast 1cy between the dispatch event and the issue event.
7330f729Sjoerg
7330f729SjoergWhen the performance is limited by data dependencies and/or long latency
7330f729Sjoerginstructions, the number of cycles spent while in the *ready* state is expected
7330f729Sjoergto be very small when compared with the total number of cycles spent in the
7330f729Sjoergscheduler's queue.  The difference between the two counters is a good indicator
7330f729Sjoergof how large of an impact data dependencies had on the execution of the
7330f729Sjoerginstructions.  When performance is mostly limited by the lack of hardware
7330f729Sjoergresources, the delta between the two counters is small.  However, the number of
7330f729Sjoergcycles spent in the queue tends to be larger (i.e., more than 1-3cy),
7330f729Sjoergespecially when compared to other low latency instructions.
7330f729Sjoerg
7330f729SjoergBottleneck Analysis
7330f729Sjoerg^^^^^^^^^^^^^^^^^^^
7330f729SjoergThe ``-bottleneck-analysis`` command line option enables the analysis of
7330f729Sjoergperformance bottlenecks.
7330f729Sjoerg
7330f729SjoergThis analysis is potentially expensive. It attempts to correlate increases in
7330f729Sjoergbackend pressure (caused by pipeline resource pressure and data dependencies) to
7330f729Sjoergdynamic dispatch stalls.
7330f729Sjoerg
7330f729SjoergBelow is an example of ``-bottleneck-analysis`` output generated by
7330f729Sjoerg:program:`llvm-mca` for 500 iterations of the dot-product example on btver2.
7330f729Sjoerg
7330f729Sjoerg.. code-block:: none
7330f729Sjoerg
7330f729Sjoerg
7330f729Sjoerg  Cycles with backend pressure increase [ 48.07% ]
7330f729Sjoerg  Throughput Bottlenecks:
7330f729Sjoerg    Resource Pressure       [ 47.77% ]
7330f729Sjoerg    - JFPA  [ 47.77% ]
7330f729Sjoerg    - JFPU0  [ 47.77% ]
7330f729Sjoerg    Data Dependencies:      [ 0.30% ]
7330f729Sjoerg    - Register Dependencies [ 0.30% ]
7330f729Sjoerg    - Memory Dependencies   [ 0.00% ]
7330f729Sjoerg
7330f729Sjoerg  Critical sequence based on the simulation:
7330f729Sjoerg
7330f729Sjoerg                Instruction                         Dependency Information
7330f729Sjoerg   +----< 2.    vhaddps %xmm3, %xmm3, %xmm4
7330f729Sjoerg   |
7330f729Sjoerg   |    < loop carried >
7330f729Sjoerg   |
7330f729Sjoerg   |      0.    vmulps  %xmm0, %xmm1, %xmm2
7330f729Sjoerg   +----> 1.    vhaddps %xmm2, %xmm2, %xmm3         ## RESOURCE interference:  JFPA [ probability: 74% ]
7330f729Sjoerg   +----> 2.    vhaddps %xmm3, %xmm3, %xmm4         ## REGISTER dependency:  %xmm3
7330f729Sjoerg   |
7330f729Sjoerg   |    < loop carried >
7330f729Sjoerg   |
7330f729Sjoerg   +----> 1.    vhaddps %xmm2, %xmm2, %xmm3         ## RESOURCE interference:  JFPA [ probability: 74% ]
7330f729Sjoerg
7330f729Sjoerg
7330f729SjoergAccording to the analysis, throughput is limited by resource pressure and not by
7330f729Sjoergdata dependencies.  The analysis observed increases in backend pressure during
7330f729Sjoerg48.07% of the simulated run. Almost all those pressure increase events were
7330f729Sjoergcaused by contention on processor resources JFPA/JFPU0.
7330f729Sjoerg
7330f729SjoergThe `critical sequence` is the most expensive sequence of instructions according
7330f729Sjoergto the simulation. It is annotated to provide extra information about critical
7330f729Sjoergregister dependencies and resource interferences between instructions.
7330f729Sjoerg
7330f729SjoergInstructions from the critical sequence are expected to significantly impact
7330f729Sjoergperformance. By construction, the accuracy of this analysis is strongly
7330f729Sjoergdependent on the simulation and (as always) by the quality of the processor
7330f729Sjoergmodel in llvm.
7330f729Sjoerg
*82d56013SjoergBottleneck analysis is currently not supported for processors with an in-order
*82d56013Sjoergbackend.
7330f729Sjoerg
7330f729SjoergExtra Statistics to Further Diagnose Performance Issues
7330f729Sjoerg^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
7330f729SjoergThe ``-all-stats`` command line option enables extra statistics and performance
7330f729Sjoergcounters for the dispatch logic, the reorder buffer, the retire control unit,
7330f729Sjoergand the register file.
7330f729Sjoerg
7330f729SjoergBelow is an example of ``-all-stats`` output generated by  :program:`llvm-mca`
7330f729Sjoergfor 300 iterations of the dot-product example discussed in the previous
7330f729Sjoergsections.
7330f729Sjoerg
7330f729Sjoerg.. code-block:: none
7330f729Sjoerg
7330f729Sjoerg  Dynamic Dispatch Stall Cycles:
7330f729Sjoerg  RAT     - Register unavailable:                      0
7330f729Sjoerg  RCU     - Retire tokens unavailable:                 0
7330f729Sjoerg  SCHEDQ  - Scheduler full:                            272  (44.6%)
7330f729Sjoerg  LQ      - Load queue full:                           0
7330f729Sjoerg  SQ      - Store queue full:                          0
7330f729Sjoerg  GROUP   - Static restrictions on the dispatch group: 0
7330f729Sjoerg
7330f729Sjoerg
7330f729Sjoerg  Dispatch Logic - number of cycles where we saw N micro opcodes dispatched:
7330f729Sjoerg  [# dispatched], [# cycles]
7330f729Sjoerg   0,              24  (3.9%)
7330f729Sjoerg   1,              272  (44.6%)
7330f729Sjoerg   2,              314  (51.5%)
7330f729Sjoerg
7330f729Sjoerg
7330f729Sjoerg  Schedulers - number of cycles where we saw N micro opcodes issued:
7330f729Sjoerg  [# issued], [# cycles]
7330f729Sjoerg   0,          7  (1.1%)
7330f729Sjoerg   1,          306  (50.2%)
7330f729Sjoerg   2,          297  (48.7%)
7330f729Sjoerg
7330f729Sjoerg  Scheduler's queue usage:
7330f729Sjoerg  [1] Resource name.
7330f729Sjoerg  [2] Average number of used buffer entries.
7330f729Sjoerg  [3] Maximum number of used buffer entries.
7330f729Sjoerg  [4] Total number of buffer entries.
7330f729Sjoerg
7330f729Sjoerg   [1]            [2]        [3]        [4]
7330f729Sjoerg  JALU01           0          0          20
7330f729Sjoerg  JFPU01           17         18         18
7330f729Sjoerg  JLSAGU           0          0          12
7330f729Sjoerg
7330f729Sjoerg
7330f729Sjoerg  Retire Control Unit - number of cycles where we saw N instructions retired:
7330f729Sjoerg  [# retired], [# cycles]
7330f729Sjoerg   0,           109  (17.9%)
7330f729Sjoerg   1,           102  (16.7%)
7330f729Sjoerg   2,           399  (65.4%)
7330f729Sjoerg
7330f729Sjoerg  Total ROB Entries:                64
7330f729Sjoerg  Max Used ROB Entries:             35  ( 54.7% )
7330f729Sjoerg  Average Used ROB Entries per cy:  32  ( 50.0% )
7330f729Sjoerg
7330f729Sjoerg
7330f729Sjoerg  Register File statistics:
7330f729Sjoerg  Total number of mappings created:    900
7330f729Sjoerg  Max number of mappings used:         35
7330f729Sjoerg
7330f729Sjoerg  *  Register File #1 -- JFpuPRF:
7330f729Sjoerg     Number of physical registers:     72
7330f729Sjoerg     Total number of mappings created: 900
7330f729Sjoerg     Max number of mappings used:      35
7330f729Sjoerg
7330f729Sjoerg  *  Register File #2 -- JIntegerPRF:
7330f729Sjoerg     Number of physical registers:     64
7330f729Sjoerg     Total number of mappings created: 0
7330f729Sjoerg     Max number of mappings used:      0
7330f729Sjoerg
7330f729SjoergIf we look at the *Dynamic Dispatch Stall Cycles* table, we see the counter for
7330f729SjoergSCHEDQ reports 272 cycles.  This counter is incremented every time the dispatch
7330f729Sjoerglogic is unable to dispatch a full group because the scheduler's queue is full.
7330f729Sjoerg
7330f729SjoergLooking at the *Dispatch Logic* table, we see that the pipeline was only able to
7330f729Sjoergdispatch two micro opcodes 51.5% of the time.  The dispatch group was limited to
7330f729Sjoergone micro opcode 44.6% of the cycles, which corresponds to 272 cycles.  The
7330f729Sjoergdispatch statistics are displayed by either using the command option
7330f729Sjoerg``-all-stats`` or ``-dispatch-stats``.
7330f729Sjoerg
7330f729SjoergThe next table, *Schedulers*, presents a histogram displaying a count,
7330f729Sjoergrepresenting the number of micro opcodes issued on some number of cycles. In
7330f729Sjoergthis case, of the 610 simulated cycles, single opcodes were issued 306 times
7330f729Sjoerg(50.2%) and there were 7 cycles where no opcodes were issued.
7330f729Sjoerg
7330f729SjoergThe *Scheduler's queue usage* table shows that the average and maximum number of
7330f729Sjoergbuffer entries (i.e., scheduler queue entries) used at runtime.  Resource JFPU01
7330f729Sjoergreached its maximum (18 of 18 queue entries). Note that AMD Jaguar implements
7330f729Sjoergthree schedulers:
7330f729Sjoerg
7330f729Sjoerg* JALU01 - A scheduler for ALU instructions.
7330f729Sjoerg* JFPU01 - A scheduler floating point operations.
7330f729Sjoerg* JLSAGU - A scheduler for address generation.
7330f729Sjoerg
7330f729SjoergThe dot-product is a kernel of three floating point instructions (a vector
7330f729Sjoergmultiply followed by two horizontal adds).  That explains why only the floating
7330f729Sjoergpoint scheduler appears to be used.
7330f729Sjoerg
7330f729SjoergA full scheduler queue is either caused by data dependency chains or by a
7330f729Sjoergsub-optimal usage of hardware resources.  Sometimes, resource pressure can be
7330f729Sjoergmitigated by rewriting the kernel using different instructions that consume
7330f729Sjoergdifferent scheduler resources.  Schedulers with a small queue are less resilient
7330f729Sjoergto bottlenecks caused by the presence of long data dependencies.  The scheduler
7330f729Sjoergstatistics are displayed by using the command option ``-all-stats`` or
7330f729Sjoerg``-scheduler-stats``.
7330f729Sjoerg
7330f729SjoergThe next table, *Retire Control Unit*, presents a histogram displaying a count,
7330f729Sjoergrepresenting the number of instructions retired on some number of cycles.  In
7330f729Sjoergthis case, of the 610 simulated cycles, two instructions were retired during the
7330f729Sjoergsame cycle 399 times (65.4%) and there were 109 cycles where no instructions
7330f729Sjoergwere retired.  The retire statistics are displayed by using the command option
7330f729Sjoerg``-all-stats`` or ``-retire-stats``.
7330f729Sjoerg
7330f729SjoergThe last table presented is *Register File statistics*.  Each physical register
7330f729Sjoergfile (PRF) used by the pipeline is presented in this table.  In the case of AMD
7330f729SjoergJaguar, there are two register files, one for floating-point registers (JFpuPRF)
7330f729Sjoergand one for integer registers (JIntegerPRF).  The table shows that of the 900
7330f729Sjoerginstructions processed, there were 900 mappings created.  Since this dot-product
7330f729Sjoergexample utilized only floating point registers, the JFPuPRF was responsible for
7330f729Sjoergcreating the 900 mappings.  However, we see that the pipeline only used a
7330f729Sjoergmaximum of 35 of 72 available register slots at any given time. We can conclude
7330f729Sjoergthat the floating point PRF was the only register file used for the example, and
7330f729Sjoergthat it was never resource constrained.  The register file statistics are
7330f729Sjoergdisplayed by using the command option ``-all-stats`` or
7330f729Sjoerg``-register-file-stats``.
7330f729Sjoerg
7330f729SjoergIn this example, we can conclude that the IPC is mostly limited by data
7330f729Sjoergdependencies, and not by resource pressure.
7330f729Sjoerg
7330f729SjoergInstruction Flow
7330f729Sjoerg^^^^^^^^^^^^^^^^
7330f729SjoergThis section describes the instruction flow through the default pipeline of
7330f729Sjoerg:program:`llvm-mca`, as well as the functional units involved in the process.
7330f729Sjoerg
7330f729SjoergThe default pipeline implements the following sequence of stages used to
7330f729Sjoergprocess instructions.
7330f729Sjoerg
7330f729Sjoerg* Dispatch (Instruction is dispatched to the schedulers).
7330f729Sjoerg* Issue (Instruction is issued to the processor pipelines).
7330f729Sjoerg* Write Back (Instruction is executed, and results are written back).
7330f729Sjoerg* Retire (Instruction is retired; writes are architecturally committed).
7330f729Sjoerg
*82d56013SjoergThe in-order pipeline implements the following sequence of stages:
*82d56013Sjoerg* InOrderIssue (Instruction is issued to the processor pipelines).
*82d56013Sjoerg* Retire (Instruction is retired; writes are architecturally committed).
*82d56013Sjoerg
*82d56013Sjoerg:program:`llvm-mca` assumes that instructions have all been decoded and placed
*82d56013Sjoerginto a queue before the simulation start. Therefore, the instruction fetch and
*82d56013Sjoergdecode stages are not modeled. Performance bottlenecks in the frontend are not
*82d56013Sjoergdiagnosed. Also, :program:`llvm-mca` does not model branch prediction.
7330f729Sjoerg
7330f729SjoergInstruction Dispatch
7330f729Sjoerg""""""""""""""""""""
7330f729SjoergDuring the dispatch stage, instructions are picked in program order from a
7330f729Sjoergqueue of already decoded instructions, and dispatched in groups to the
7330f729Sjoergsimulated hardware schedulers.
7330f729Sjoerg
7330f729SjoergThe size of a dispatch group depends on the availability of the simulated
7330f729Sjoerghardware resources.  The processor dispatch width defaults to the value
7330f729Sjoergof the ``IssueWidth`` in LLVM's scheduling model.
7330f729Sjoerg
7330f729SjoergAn instruction can be dispatched if:
7330f729Sjoerg
7330f729Sjoerg* The size of the dispatch group is smaller than processor's dispatch width.
7330f729Sjoerg* There are enough entries in the reorder buffer.
7330f729Sjoerg* There are enough physical registers to do register renaming.
7330f729Sjoerg* The schedulers are not full.
7330f729Sjoerg
7330f729SjoergScheduling models can optionally specify which register files are available on
7330f729Sjoergthe processor. :program:`llvm-mca` uses that information to initialize register
7330f729Sjoergfile descriptors.  Users can limit the number of physical registers that are
7330f729Sjoergglobally available for register renaming by using the command option
7330f729Sjoerg``-register-file-size``.  A value of zero for this option means *unbounded*. By
7330f729Sjoergknowing how many registers are available for renaming, the tool can predict
7330f729Sjoergdispatch stalls caused by the lack of physical registers.
7330f729Sjoerg
7330f729SjoergThe number of reorder buffer entries consumed by an instruction depends on the
7330f729Sjoergnumber of micro-opcodes specified for that instruction by the target scheduling
7330f729Sjoergmodel.  The reorder buffer is responsible for tracking the progress of
7330f729Sjoerginstructions that are "in-flight", and retiring them in program order.  The
7330f729Sjoergnumber of entries in the reorder buffer defaults to the value specified by field
7330f729Sjoerg`MicroOpBufferSize` in the target scheduling model.
7330f729Sjoerg
7330f729SjoergInstructions that are dispatched to the schedulers consume scheduler buffer
7330f729Sjoergentries. :program:`llvm-mca` queries the scheduling model to determine the set
7330f729Sjoergof buffered resources consumed by an instruction.  Buffered resources are
7330f729Sjoergtreated like scheduler resources.
7330f729Sjoerg
7330f729SjoergInstruction Issue
7330f729Sjoerg"""""""""""""""""
7330f729SjoergEach processor scheduler implements a buffer of instructions.  An instruction
7330f729Sjoerghas to wait in the scheduler's buffer until input register operands become
7330f729Sjoergavailable.  Only at that point, does the instruction becomes eligible for
7330f729Sjoergexecution and may be issued (potentially out-of-order) for execution.
7330f729SjoergInstruction latencies are computed by :program:`llvm-mca` with the help of the
7330f729Sjoergscheduling model.
7330f729Sjoerg
7330f729Sjoerg:program:`llvm-mca`'s scheduler is designed to simulate multiple processor
7330f729Sjoergschedulers.  The scheduler is responsible for tracking data dependencies, and
7330f729Sjoergdynamically selecting which processor resources are consumed by instructions.
7330f729SjoergIt delegates the management of processor resource units and resource groups to a
7330f729Sjoergresource manager.  The resource manager is responsible for selecting resource
7330f729Sjoergunits that are consumed by instructions.  For example, if an instruction
7330f729Sjoergconsumes 1cy of a resource group, the resource manager selects one of the
7330f729Sjoergavailable units from the group; by default, the resource manager uses a
7330f729Sjoerground-robin selector to guarantee that resource usage is uniformly distributed
7330f729Sjoergbetween all units of a group.
7330f729Sjoerg
7330f729Sjoerg:program:`llvm-mca`'s scheduler internally groups instructions into three sets:
7330f729Sjoerg
7330f729Sjoerg* WaitSet: a set of instructions whose operands are not ready.
7330f729Sjoerg* ReadySet: a set of instructions ready to execute.
7330f729Sjoerg* IssuedSet: a set of instructions executing.
7330f729Sjoerg
7330f729SjoergDepending on the operands availability, instructions that are dispatched to the
7330f729Sjoergscheduler are either placed into the WaitSet or into the ReadySet.
7330f729Sjoerg
7330f729SjoergEvery cycle, the scheduler checks if instructions can be moved from the WaitSet
7330f729Sjoergto the ReadySet, and if instructions from the ReadySet can be issued to the
7330f729Sjoergunderlying pipelines. The algorithm prioritizes older instructions over younger
7330f729Sjoerginstructions.
7330f729Sjoerg
7330f729SjoergWrite-Back and Retire Stage
7330f729Sjoerg"""""""""""""""""""""""""""
7330f729SjoergIssued instructions are moved from the ReadySet to the IssuedSet.  There,
7330f729Sjoerginstructions wait until they reach the write-back stage.  At that point, they
7330f729Sjoergget removed from the queue and the retire control unit is notified.
7330f729Sjoerg
7330f729SjoergWhen instructions are executed, the retire control unit flags the instruction as
7330f729Sjoerg"ready to retire."
7330f729Sjoerg
7330f729SjoergInstructions are retired in program order.  The register file is notified of the
7330f729Sjoergretirement so that it can free the physical registers that were allocated for
7330f729Sjoergthe instruction during the register renaming stage.
7330f729Sjoerg
7330f729SjoergLoad/Store Unit and Memory Consistency Model
7330f729Sjoerg""""""""""""""""""""""""""""""""""""""""""""
7330f729SjoergTo simulate an out-of-order execution of memory operations, :program:`llvm-mca`
7330f729Sjoergutilizes a simulated load/store unit (LSUnit) to simulate the speculative
7330f729Sjoergexecution of loads and stores.
7330f729Sjoerg
7330f729SjoergEach load (or store) consumes an entry in the load (or store) queue. Users can
7330f729Sjoergspecify flags ``-lqueue`` and ``-squeue`` to limit the number of entries in the
7330f729Sjoergload and store queues respectively. The queues are unbounded by default.
7330f729Sjoerg
7330f729SjoergThe LSUnit implements a relaxed consistency model for memory loads and stores.
7330f729SjoergThe rules are:
7330f729Sjoerg
7330f729Sjoerg1. A younger load is allowed to pass an older load only if there are no
7330f729Sjoerg   intervening stores or barriers between the two loads.
7330f729Sjoerg2. A younger load is allowed to pass an older store provided that the load does
7330f729Sjoerg   not alias with the store.
7330f729Sjoerg3. A younger store is not allowed to pass an older store.
7330f729Sjoerg4. A younger store is not allowed to pass an older load.
7330f729Sjoerg
7330f729SjoergBy default, the LSUnit optimistically assumes that loads do not alias
7330f729Sjoerg(`-noalias=true`) store operations.  Under this assumption, younger loads are
7330f729Sjoergalways allowed to pass older stores.  Essentially, the LSUnit does not attempt
7330f729Sjoergto run any alias analysis to predict when loads and stores do not alias with
7330f729Sjoergeach other.
7330f729Sjoerg
7330f729SjoergNote that, in the case of write-combining memory, rule 3 could be relaxed to
7330f729Sjoergallow reordering of non-aliasing store operations.  That being said, at the
7330f729Sjoergmoment, there is no way to further relax the memory model (``-noalias`` is the
7330f729Sjoergonly option).  Essentially, there is no option to specify a different memory
7330f729Sjoergtype (e.g., write-back, write-combining, write-through; etc.) and consequently
7330f729Sjoergto weaken, or strengthen, the memory model.
7330f729Sjoerg
7330f729SjoergOther limitations are:
7330f729Sjoerg
7330f729Sjoerg* The LSUnit does not know when store-to-load forwarding may occur.
7330f729Sjoerg* The LSUnit does not know anything about cache hierarchy and memory types.
7330f729Sjoerg* The LSUnit does not know how to identify serializing operations and memory
7330f729Sjoerg  fences.
7330f729Sjoerg
7330f729SjoergThe LSUnit does not attempt to predict if a load or store hits or misses the L1
7330f729Sjoergcache.  It only knows if an instruction "MayLoad" and/or "MayStore."  For
7330f729Sjoergloads, the scheduling model provides an "optimistic" load-to-use latency (which
7330f729Sjoergusually matches the load-to-use latency for when there is a hit in the L1D).
7330f729Sjoerg
7330f729Sjoerg:program:`llvm-mca` does not know about serializing operations or memory-barrier
7330f729Sjoerglike instructions.  The LSUnit conservatively assumes that an instruction which
7330f729Sjoerghas both "MayLoad" and unmodeled side effects behaves like a "soft"
7330f729Sjoergload-barrier.  That means, it serializes loads without forcing a flush of the
7330f729Sjoergload queue.  Similarly, instructions that "MayStore" and have unmodeled side
7330f729Sjoergeffects are treated like store barriers.  A full memory barrier is a "MayLoad"
7330f729Sjoergand "MayStore" instruction with unmodeled side effects.  This is inaccurate, but
7330f729Sjoergit is the best that we can do at the moment with the current information
7330f729Sjoergavailable in LLVM.
7330f729Sjoerg
7330f729SjoergA load/store barrier consumes one entry of the load/store queue.  A load/store
7330f729Sjoergbarrier enforces ordering of loads/stores.  A younger load cannot pass a load
7330f729Sjoergbarrier.  Also, a younger store cannot pass a store barrier.  A younger load
7330f729Sjoerghas to wait for the memory/load barrier to execute.  A load/store barrier is
7330f729Sjoerg"executed" when it becomes the oldest entry in the load/store queue(s). That
7330f729Sjoergalso means, by construction, all of the older loads/stores have been executed.
7330f729Sjoerg
7330f729SjoergIn conclusion, the full set of load/store consistency rules are:
7330f729Sjoerg
7330f729Sjoerg#. A store may not pass a previous store.
7330f729Sjoerg#. A store may not pass a previous load (regardless of ``-noalias``).
7330f729Sjoerg#. A store has to wait until an older store barrier is fully executed.
7330f729Sjoerg#. A load may pass a previous load.
7330f729Sjoerg#. A load may not pass a previous store unless ``-noalias`` is set.
7330f729Sjoerg#. A load has to wait until an older load barrier is fully executed.
*82d56013Sjoerg
*82d56013SjoergIn-order Issue and Execute
*82d56013Sjoerg""""""""""""""""""""""""""""""""""""
*82d56013SjoergIn-order processors are modelled as a single ``InOrderIssueStage`` stage. It
*82d56013Sjoergbypasses Dispatch, Scheduler and Load/Store unit. Instructions are issued as
*82d56013Sjoergsoon as their operand registers are available and resource requirements are
*82d56013Sjoergmet. Multiple instructions can be issued in one cycle according to the value of
*82d56013Sjoergthe ``IssueWidth`` parameter in LLVM's scheduling model.
*82d56013Sjoerg
*82d56013SjoergOnce issued, an instruction is moved to ``IssuedInst`` set until it is ready to
*82d56013Sjoergretire. :program:`llvm-mca` ensures that writes are committed in-order. However,
*82d56013Sjoergan instruction is allowed to commit writes and retire out-of-order if
*82d56013Sjoerg``RetireOOO`` property is true for at least one of its writes.