llvm-mca.rst - OpenGrok cross reference for /llvm-project/llvm/docs/CommandGuide/llvm-mca.rst

Lines Matching full:by
28 reporting style were inspired by the IACA tool from Intel.
43 (:program:`llvm-mca` detects Intel syntax by the presence of an `.intel_syntax`
44 directive at the beginning of the input.  By default its output syntax matches
51 By design, the quality of the analysis conducted by :program:`llvm-mca` is
52 inevitably affected by the quality of the scheduling models in LLVM.
89   Specify the processor for which to analyze the code.  By default, the cpu name
94  Specify the output assembly variant for the report generated by the tool.
96  the AT&T (vic. Intel) assembly format for the code printed out by the tool in
128   Specify the size of the load queue in the load/store unit emulated by the tool.
129   By default, the tool assumes an unbound number of entries in the load queue.
135   Specify the size of the store queue in the load/store unit emulated by the
136   tool. By default, the tool assumes an unbound number of entries in the store
146   Limit the number of iterations to print in the timeline view. By default, the
151   Limit the number of cycles in the timeline view, or use 0 for no limit. By
156   Enable the resource pressure view. This is enabled by default.
166   is disabled by default.
171   issue events. This view is disabled by default.
175   Enable extra retire control unit statistics. This view is disabled by default.
179   Enable the instruction info view. This is enabled by default.
194   control unit. This option is disabled by default.
211   can be expensive, and it is disabled by default. Bottlenecks are highlighted
219   The individual views refer to them by index. However, not all views are
337 An InstrumentRegion describes a region of assembly code guarded by
345 where `INSTRUMENT_TYPE` is a type defined by the target and expects
363 that do not start with `LLVM-MCA-` are ignored by :program:`llvm-mca`.
424 parsers. The parsed sequence of MCInst is then analyzed by a ``Pipeline`` module
432 Here is an example of a performance report generated by the tool for a
490   Resource pressure by instruction:
511 IPC is computed dividing the total number of simulated instructions by the total
517 of loop carried dependencies. Block throughput is superiorly limited by the
521 theoretical maximum which can be computed by dividing the number of instructions
522 of a single iteration by the `Block RThroughput`.
525 opcodes by the total number of cycles. A delta between Dispatch Width and this
528 maximum throughput which can be computed by dividing the number of uOps of a
529 single iteration by the `Block RThroughput`.
531 Field *uOps Per Cycle* is bounded from above by the dispatch width. That is
533 and 'uOps Per Cycle' are limited by the amount of hardware parallelism. The
537 Cycle (computed by dividing the number of uOps of a single iteration by the
538 `Block RThroughput`) is an indicator of a performance bottleneck caused by the
546 an indicator of a performance bottleneck caused by the lack of hardware
588 the average number of resource cycles consumed every iteration by instructions
599 The resource pressure view helps with identifying bottlenecks caused by high
607 transitions through an instruction pipeline.  This view is enabled by the
610 These states are represented by the following characters:
620 ``test/tools/llvm-mca/X86/BtVer2/dot-product.s`` and processed by
663 help diagnose performance bottlenecks caused by long data dependencies and
666 An instruction in the timeline view is identified by a pair of indices, where
683 scheduler's queue for the operands to become available. By the time vmulps is
686 JFPU1 pipeline. That is demonstrated by the fact that the instruction only
694 chain.  Register %xmm2 written by vmulps is immediately used by the first
695 vhaddps, and register %xmm3 written by the first vhaddps is used by the second
699 In the dot-product example, there are anti-dependencies introduced by
704 Table *Average Wait times* helps diagnose performance issues that are caused by
707 instructions measured. Note that :program:`llvm-mca`, by default, assumes at
710 When the performance is limited by data dependencies and/or long latency
715 instructions.  When performance is mostly limited by the lack of hardware
726 backend pressure (caused by pipeline resource pressure and data dependencies) to
729 Below is an example of ``-bottleneck-analysis`` output generated by
760 According to the analysis, throughput is limited by resource pressure and not by
763 caused by contention on processor resources JFPA/JFPU0.
770 performance. By construction, the accuracy of this analysis is strongly
771 dependent on the simulation and (as always) by the quality of the processor
783 Below is an example of ``-all-stats`` output generated by  :program:`llvm-mca`
855 dispatch statistics are displayed by either using the command option
873 multiply followed by two horizontal adds).  That explains why only the floating
876 A full scheduler queue is either caused by data dependency chains or by a
878 mitigated by rewriting the kernel using different instructions that consume
880 to bottlenecks caused by the presence of long data dependencies.  The scheduler
881 statistics are displayed by using the command option ``-all-stats`` or
888 were retired.  The retire statistics are displayed by using the command option
892 file (PRF) used by the pipeline is presented in this table.  In the case of AMD
901 displayed by using the command option ``-all-stats`` or
904 In this example, we can conclude that the IPC is mostly limited by data
905 dependencies, and not by resource pressure.
950 globally available for register renaming by using the command option
951 ``-register-file-size``.  A value of zero for this option means *unbounded*. By
953 dispatch stalls caused by the lack of physical registers.
955 The number of reorder buffer entries consumed by an instruction depends on the
956 number of micro-opcodes specified for that instruction by the target scheduling
959 number of entries in the reorder buffer defaults to the value specified by field
964 of buffered resources consumed by an instruction.  Buffered resources are
973 Instruction latencies are computed by :program:`llvm-mca` with the help of the
978 dynamically selecting which processor resources are consumed by instructions.
981 units that are consumed by instructions.  For example, if an instruction
983 available units from the group; by default, the resource manager uses a
1022 load and store queues respectively. The queues are unbounded by default.
1034 By default, the LSUnit optimistically assumes that loads do not alias
1075 also means, by construction, all of the older loads/stores have been executed.
1107 instruction modeling (often by customizing data dependencies and detecting
1147 by `getSchedClassID()` to map a vector instruction and the active